This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Projected State-action Balancing Weights for Offline Reinforcement Learning

Jiayi Wang1    Zhengling Qi2    Raymond K.W. Wong1
(1Texas A&M University
2George Washington University
)
Abstract

Offline policy evaluation (OPE) is considered a fundamental and challenging problem in reinforcement learning (RL). This paper focuses on the value estimation of a target policy based on pre-collected data generated from a possibly different policy, under the framework of infinite-horizon Markov decision processes. Motivated by the recently developed marginal importance sampling method in RL and the covariate balancing idea in causal inference, we propose a novel estimator with approximately projected state-action balancing weights for the policy value estimation. We obtain the convergence rate of these weights, and show that the proposed value estimator is semi-parametric efficient under technical conditions. In terms of asymptotics, our results scale with both the number of trajectories and the number of decision points at each trajectory. As such, consistency can still be achieved with a limited number of subjects when the number of decision points diverges. In addition, we develop a necessary and sufficient condition for establishing the well-posedness of Bellman operator in the off-policy setting, which characterizes the difficulty of OPE and may be of independent interest. Numerical experiments demonstrate the promising performance of our proposed estimator.

Keywords: Infinite horizons; Markov decision process; Policy evaluation; Reinforcement learning

1 Introduction

In reinforcement learning (RL), off-policy evaluation (OPE) refers to the problem of estimating some notion of rewards (e.g., the value defined in (2)) of a target policy based on historical data collected from a potentially different policy. There is a recent surge of interest in OPE among the statistics and machine learning communities. On the one hand, OPE serves as the foundation for many RL methods. On the other hand, OPE is of great importance in some high-stakes domains where deploying a new policy can be very costly or risky, such as in medical applications.

This work is partly motivated by mobile health (mHealth) studies. Due to the rapid development of mobile devices and sensing technology, mHealth studies have recently emerged as a promising way to promote the healthy behaviors of patients [38]. Mobile devices are used to monitor their health conditions in real-time and deliver just-in-time interventions to individuals [50]. With access to a rich amount of longitudinal data pre-collected by wearable devices, researchers are often interested in evaluating a policy (intervention), potentially different from the one behind the data collection process. For example, the OhioT1DM dataset [43] was collected to improve the health and wellbeing of people with type 1 diabetes. The data consist of 8 weeks’ health information of six subjects, based on some unknown policy. For each subject, real-time information such as insulin doses, glucose levels, and self-reported times of meals and exercises was collected. It is often of great interest to evaluate the efficacy of insulin pump therapy. One notable challenge of mHealth data is a limited number of subjects, combined with a usually large number of decision points for each subject. For instance, the OhioT1DM dataset has six subjects with a few thousand decision points per subject. In statistics, there is a rich literature on estimating the treatment effect or optimal policy from complex longitudinal data [57, 56, 48, 47, 35, 77, 72, 59, 34]. However, these methods are mainly designed for studies with very few decision points and often require the number of subjects to grow in order to achieve accurate estimation.

To address the above challenge in mHealth studies, we adopt the framework of the Markov decision process (MDP) [55] and consider an infinite-horizon time-homogeneous setting. This framework is particularly suitable for mHealth studies, and its efficacy has been well demonstrated in the recent literature [42, 39, 62, 40, 24].

In this paper, we focus on developing a new model-free approach for OPE. The existing model-free OPE methods under the infinite-horizon discounted setting can be mainly divided into three categories. The methods in the first category [e.g., 3, 63, 36, 42, 62] directly estimate the state(-action) value function (see Section 2.1), based on the Bellman equation (see (6)). The second category is motivated by importance weights or the so-called marginal importance sampling [e.g., 41, 49, 68]. These approaches utilize the probability ratio function (see (8)) to adjust for the distributional mismatch due to the difference between the target policy and the behavior policy (i.e., the one in the data collection process). The last category of OPE methods combines methods in the first category and the second category to obtain the so-called “double robust” estimators [e.g., 30, 64, 60]. Apart from the model-free methods, we note in passing that there exists a rich class of literature on model-based methods which mainly rely on directly modeling the dynamics (i.e., the transition mechanism and the reward function) [e.g. 53, 11, 26].

This paper focuses on the second category of model-free OPE methods, which is based on the probability ratio function. These methods do not depend on the actual form of the reward function and thus can be used flexibly to evaluate the target policy on different reward functions, which is appealing in practice. Some core theoretical questions associated with the use of the (estimated) probability ratio function directly relate to the fundamental challenges of offline RL [37], and therefore have recently attracted much interest from the machine learning community [41, 49, 76, 68, 69]. Last but not least, the probability ratio function and related estimators can improve the accuracy and stability of offline RL, which has been demonstrated in [49, 29]. Despite some recent progress towards using the (estimated) probability ratio function to perform OPE, the corresponding development and (theoretical) understanding are still incomplete.

Motivated by the covariate balancing methods [e.g., 1, 74, 73, 27] that have been recently studied in the average treatment effect (ATE) estimation, we propose a novel OPE estimator via projected state-action balancing weights, under the framework of the time-homogeneous MDP and provide a comprehensive theoretical analysis. Specifically, we characterize the special challenge in developing a weighted estimator for the OPE problem in terms of an “expanded dimension” phenomenon, which significantly complicates the adoption of the balancing idea in OPE, as compared with the ATE estimation. See the detailed discussion in Section 3.2. Roughly speaking, a direct modification of an ATE estimator [e.g., 73] for the purpose of OPE leads to estimated weights that depend on not only the current state-action pairs, but also the state variable in the next decision points, which is inconsistent with the definition of the true probability ratio function. To tackle this issue, we propose an approximate projection step to rule out this “expanded dimension” issue. With the help of this additional projection, we are able to show the convergence rate of the proposed weights. (or the estimated ratio function).

As for theoretical contributions, we analyze the convergence rates of the approximate projection step, the projected state-action balancing weights, as well as the estimated value of a given policy. All these convergence rates are characterized in the scaling of both the sample size (nn, the number of observed trajectories) and the number of observed decision points (TT) per trajectory. The scaling with respect to TT is particularly important when nn (the number of subjects) is limited, which is common in mHealth datasets such as the aforementioned OhioT1DM dataset (see also Section 6). For instance, under our setup, the estimated value of the policy is still consistent in the asymptotic setting where nn is bounded, but TT\rightarrow\infty. In the course of analyzing the proposed method, we also obtain a uniform convergence rate (with respect to nn and TT) for a non-parametric regression based on exponentially β\beta-mixing sequences, which may be of independent interest for other applications. Furthermore, under some appropriate technical assumptions (including notably a non-diminishing minimum eigenvalue condition), we show that the proposed weighted estimator is asymptotically normal with a nT\sqrt{nT} rate of convergence, and achieves the efficiency bound, which aligns with other types of estimators [30, 39]. Besides, our theoretical results do not require that the underlying data are independent or generated from stationary sequences, although these assumptions are widely used in the existing literature [e.g., 15, 30, 49, 64] for general OPE methods. Without imposing these restrictive assumptions can significantly increase the applicability of the proposed method.

As another theoretical contribution, we make the first attempt to analyze the difficulty of non-parametric OPE under infinite-horizon discounted settings. Some strong assumptions are often imposed in the existing literature to establish desirable theoretical properties for corresponding OPE estimators. For instance, [62] assumed that the minimal eigenvalue of some second moment is strictly bounded away from 0 as the number of basis functions grows. [70] adopts the “completeness” condition to study the convergence of their OPE estimators. In their work, they show that these assumptions can be satisfied when the discount factor 0γ<10\leq\gamma<1 in the policy value (2) is small enough (with additional boundedness conditions on the average visitation probability). However, in practice, the discount factor is often preferred to be set close to 1 [32]. As such, these sufficient conditions are of limited use in practice. In this paper, we provide a necessary and sufficient condition for lower bounding the minimal eigenvalue of the operator Iγ𝒫πI-\gamma\mathcal{P}^{\pi} (see the definition in Section 4.4), with respect to the data generating process, which characterizes the well-posedness of their QQ-function estimations in [62] and [70]. With the help of this characterization, we can further show that the minimal eigenvalue is strictly bounded away from zero under some mild sufficient conditions without any restrictions on π\pi or γ\gamma, which may be of independent interest.

The rest of the paper is organized as follows. Section 2 presents the basic framework for OPE and some existing OPE methods. In Section 3 we introduce the state-action balancing weights and the proposed estimator. Theoretical results of our estimated weight function and the asymptotic properties of the proposed estimator are developed in Section 4. A detailed discussion regarding the lower bound of the minimum eigenvalue is presented in Section 4.4. Lastly, a simulation study and a real data application are presented in Sections 5 and 6 respectively.

2 Offline Policy Evaluation in Infinite-horizon Markov Decision Processes

In this section, we review the framework of discrete-time homogeneous MDPs and the related OPE methods. Specifically, the framework of MDP and some necessary notations are presented in Section 2.1, while three major classes of model-free OPE methods under the infinite-horizon discounted setting are reviewed in Section 2.2.

2.1 Preliminary and Notations

Consider a trajectory {St,At,Rt}t0\left\{S_{t},A_{t},R_{t}\right\}_{t\geq 0}, where (St,At,Rt)(S_{t},A_{t},R_{t}) denotes the triplet of the state, action and immediate reward, observed at the decision point tt. Let 𝒮{\cal S} and 𝒜{\cal A} be the state and action spaces, respectively. We assume 𝒜{\cal A} is finite and consider the following two assumptions, which are commonly imposed in infinite-horizon OPE problems.

Assumption 1 (Markovian assumption with stationary transitions).

There exists a transition kernel PP such that for every t0t\geq 0, a𝒜a\in{\cal A}, s𝒮s\in{\cal S} and any set F(𝒮)F\in{\cal B}({\cal S}),

Pr(St+1F|At=a,St=s,{Sj,Aj,Rj}0j<t)=P(St+1F|At=a,St=s),\Pr(S_{t+1}\in F\,|\,A_{t}=a,S_{t}=s,\left\{S_{j},A_{j},R_{j}\right\}_{0\leq j<t})=P(S_{t+1}\in F\,|\,A_{t}=a,S_{t}=s),

where (𝒮){\cal B}({\cal S}) is the family of Borel subsets of 𝒮{\cal S}.

For the notational convenience, we assume the transition kernel has a density pp. Assumption 1 requires that given the current state-action pair, future states are independent of past observations. Note that this assumption can be tested using the observed data [e.g., 61]. If one believes the trajectory satisfies some higher-order Markovian properties, one can instead construct a state variable by aggregating all the original state variables over the corresponding number of decision points. We refer the interested readers to [61] for an example.

Assumption 2.

There exists a reward function rr defined on 𝒮×𝒜{\cal S}\times{\cal A} such that
𝔼[Rt|At=a,St=s,{Sj,Aj,Rj}0j<t]=r(s,a),\mathbb{E}\left[R_{t}\,|\,A_{t}=a,S_{t}=s,\left\{S_{j},A_{j},R_{j}\right\}_{0\leq j<t}\right]=r(s,a), for every t0t\geq 0, s𝒮s\in{\cal S} and a𝒜a\in{\cal A}. In addition, RtR_{t} is uniformly bounded, i.e., there exists a constant RmaxR_{\max} such that |Rt|Rmax|R_{t}|\leq R_{\max} for every t0t\geq 0.

Assumption 2 states that the current reward is conditionally mean independent of the history given the current state-action pair. One can also regard RtR_{t} as a part of St+1S_{t+1} if needed. The uniform boundedness assumption on rewards is introduced for simplifying the technical derivation and can be relaxed. In practice, the time-stationarity of the transition density pp and the reward function rr can be warranted by incorporating time-associated covariates. Under Assumptions 1 and 2, the tuple =𝒮,𝒜,r,p{\cal M}=\langle{\cal S},{\cal A},r,p\rangle forms a discrete-time homogeneous MDP.

A policy is defined as a way of choosing actions at all decision points. In this work, we use the value criterion (defined in (2) below) to evaluate policies and thus focus on the time-invariant Markovian policy (also called a stationary policy) π\pi, which is a function mapping from the state space 𝒮{\cal S} to a probability mass function over the action space 𝒜{\cal A}. More specifically, π(a|s)\pi(a\,|\,s) is the probability of choosing an action aa given a state ss. The sufficiency of considering only the stationary policy is explained in Section 6.2 of [55].

In the offline RL setting, one of the fundamental tasks is to estimate a target stationary policy’s (expected) value function based on the pre-collected (batch) data. Given a stationary policy π\pi, the value function is defined as

Vπ(s)=𝔼π[t=0γtRt|S0=s],\displaystyle V^{\pi}(s)=\mathbb{E}^{\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}R_{t}\,|\,S_{0}=s\right], (1)

where 𝔼π\mathbb{E}^{\pi} denotes the expectation with respect to the distribution whose actions are generated by π\pi, and 0γ<10\leq\gamma<1 refers to the discounted factor controlling the trade-off between the future rewards and the immediate rewards. The value function Vπ(s)V^{\pi}(s), which is always finite due to Assumption 2, represents the discounted sum of rewards under the target policy given the initial state ss. Our goal is to estimate the policy value, i.e., the expectation of the value function, defined as

𝒱(π)=(1γ)s𝒮Vπ(s)𝔾(ds),\displaystyle{\cal V}(\pi)=(1-\gamma)\int_{s\in\mathcal{S}}V^{\pi}(s)\mathbb{G}(ds), (2)

using the pre-collected training data, where 𝔾\mathbb{G} denotes a reference distribution over 𝒮{\cal S}. In RL literature, 𝔾\mathbb{G} is typically assumed known.

Now suppose we are given pre-collected training data 𝒟n{\cal D}_{n} consisting of nn independent and identically distributed finite-horizon trajectories, denoted by

𝒟n={{(Si,t,Ai,t,Ri,t,Si,t+1)}0t<Ti}1in,{\cal D}_{n}=\left\{\left\{\left(S_{i,t},A_{i,t},R_{i,t},S_{i,t+1}\right)\right\}_{0\leq t<{T_{i}}}\right\}_{1\leq i\leq n},

where TiT_{i} denotes the termination time for ii-th trajectory. For simplicity, we assume the same number of time points are observed for all trajectories, i.e. Ti=TT_{i}=T for i=1,,ni=1,\dots,n. This assumption can be relaxed as long as TiT_{i}, i=1,,ni=1,\dots,n are of the same order. We also make the following assumption on the data generating mechanism.

Assumption 3.

The training data 𝒟n{\cal D}_{n} is generated by a fixed stationary policy bb with an initial distribution 𝔾0\mathbb{G}_{0}.

Under Assumptions 1 and 3, {St,At}t0\left\{S_{t},A_{t}\right\}_{t\geq 0} forms a discrete time time-homogeneous Markov chain. In the literature, bb is usually called the behavior policy, which may not be known. For convenience, we denote 𝔼\mathbb{E} by 𝔼b\mathbb{E}^{b}. Next we define notations for several important probability distributions. For any t0t\geq 0, we define ptπ(s,a)p_{t}^{\pi}(s,a) as the marginal density of (St,At)(S_{t},A_{t}) at (s,a)𝒮×𝒜(s,a)\in{\cal S}\times{\cal A} under the target policy π\pi and reference distribution 𝔾\mathbb{G}. In particular, when t=0t=0, ptπ(s,a)=𝔾(s)π(as)p_{t}^{\pi}(s,a)=\mathbb{G}(s)\pi(a\mid s). Similarly, we can define ptbp_{t}^{b} over 𝒮×𝒜{\cal S}\times{\cal A} under the behavior policy, where p0b(s,a)=𝔾0(s)b(as)p^{b}_{0}(s,a)=\mathbb{G}_{0}(s)b(a\mid s). With ptπp_{t}^{\pi}, the discounted visitation probability density is defined as

dπ(s,a)=(1γ)t=0γtptπ(s,a),\displaystyle d^{\pi}(s,a)=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}p^{\pi}_{t}(s,a), (3)

which is assumed to be well-defined.

2.2 Existing Off-policy Evaluation Methods

Most existing model-based OPE methods for the above setting can be grouped into three categories. The first category is direct methods, which directly estimate the state-action value function [e.g., 42, 62] defined as

Qπ(s,a)=𝔼π[t=0γtRt|S0=s,A0=a],\displaystyle Q^{\pi}(s,a)=\mathbb{E}^{\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}R_{t}\,|\,S_{0}=s,A_{0}=a\right], (4)

also known as the Q-function. As we can see from (1) and (2),

𝒱(π)=(1γ)s𝒮a𝒜π(as)Qπ(s,a)𝔾(ds).{\cal V}(\pi)=(1-\gamma)\int_{s\in\mathcal{S}}\sum_{a\in{\cal A}}\pi(a\mid s)Q^{\pi}(s,a)\mathbb{G}(ds). (5)

It is well-known that QπQ^{\pi} satisfy the following Bellman equation

Qπ(s,a)=𝔼[Rt+γa𝒜π(a|St+1)Qπ(St+1,a)|St=s,At=a],\displaystyle Q^{\pi}(s,a)=\mathbb{E}\left[R_{t}+\gamma\sum_{a\in{\cal A}}\pi(a^{\prime}\,|\,S_{t+1})Q^{\pi}(S_{t+1},a^{\prime})\,|\,S_{t}=s,A_{t}=a\right], (6)

for t0t\geq 0, s𝒮s\in{\cal S} and a𝒜a\in{\cal A}. Clearly (6) forms a conditional moment restriction, based on which QπQ^{\pi} can be estimated by many methods such as generalized method of moments [21] and the nonparametric instrumental variable regression [52, 8]. The second category of OPE methods is motivated by the idea of marginal importance sampling [e.g., 41, 49]. Notice that, with (3), one can rewrite 𝒱(π){\cal V}(\pi) as

𝒱(π)=𝔼[1Tt=0T1dπ(St,At)1Tv=0T1pvb(Sv,Av)Rt]=𝔼[1Tt=0T1ωπ(St,At)Rt],\displaystyle{\cal V}(\pi)={\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\frac{d^{\pi}(S_{t},A_{t})}{\frac{1}{T}\sum_{v=0}^{T-1}p_{v}^{b}(S_{v},A_{v})}R_{t}\right]}=\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\omega^{\pi}(S_{t},A_{t})R_{t}\right], (7)

as long as dπd^{\pi} is absolutely continuous with respect to p¯Tb=1Tv=0T1pvb\bar{p}_{T}^{b}=\frac{1}{T}\sum_{v=0}^{T-1}p_{v}^{b}. We call

ωπ(s,a):=dπ(s,a)p¯Tb(s,a),\displaystyle\omega^{\pi}(s,a):=\frac{d^{\pi}(s,a)}{\bar{p}_{T}^{b}(s,a)}, (8)

the probability ratio function, which is used to adjust for the mismatch between the behavior policy bb and the target policy π\pi. Based on this relationship, one can obtain an estimator of 𝒱(π){\cal V}(\pi) via estimating ωπ\omega^{\pi}. By the so-called backward Bellman equation of dπd^{\pi}, for any (s,a)𝒮×𝒜(s,a)\in{\cal S}\times{\cal A}, one can show that

dπ(s,a)=(1γ)𝔾(s)π(as)+γs𝒮a𝒜dπ(s,a)p(ss,a)π(as)ds.\displaystyle d^{\pi}(s,a)=(1-\gamma)\mathbb{G}(s)\pi(a\mid s)+\gamma\int_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in{\cal A}}d^{\pi}(s^{\prime},a^{\prime})p(s\mid s^{\prime},a^{\prime})\pi(a\mid s)ds^{\prime}. (9)

This implies that

𝔼[1Tt=0T1ωπ(St,At)(f(St,At)γa𝒜π(aSt+1)f(St+1,a))]=(1γ)𝔼S0𝔾[a𝒜π(aS0)f(a,S0)].\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\omega^{\pi}(S_{t},A_{t})\left(f(S_{t},A_{t})-\gamma\sum_{a^{\prime}\in{\cal A}}\pi(a^{\prime}\mid S_{t+1})f(S_{t+1},a^{\prime})\right)\right]=\\ (1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left[\sum_{a\in{\cal A}}\pi(a\mid S_{0})f(a,S_{0})\right]. (10)

See the detailed derivation for obtaining (10) in Section S2.1 of the Supplementary Material Recall that 𝔾\mathbb{G} is known. Based on (10), several methods [e.g., 49, 68] can be leveraged to estimate ωπ\omega^{\pi}. In Section 3.4, we provide more discussion on other weighted estimators as compared with the proposed estimator.

The last category of methods combines direct and marginal importance sampling methods to construct a doubly robust estimator, which is also motivated by the following efficient influence function [e.g., 30]

1Tt=0T1ωπ(St,At)(Rt+γa𝒜π(aSt+1)Qπ(St+1,a)Qπ(St,At))+(1γ)𝔼S0𝔾[a0𝒜π(a0S0)Qπ(S0,a0)]𝒱(π),\frac{1}{T}\sum_{t=0}^{T-1}\omega^{\pi}(S_{t},A_{t})\left(R_{t}+\gamma\sum_{a^{\prime}\in{\cal A}}\pi(a^{\prime}\mid S_{t+1})Q^{\pi}(S_{t+1},a^{\prime})-Q^{\pi}(S_{t},A_{t})\right)\\ +(1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left[\sum_{a_{0}\in{\cal A}}\pi(a_{0}\mid S_{0})Q^{\pi}(S_{0},a_{0})\right]-{\cal V}(\pi), (11)

where QπQ^{\pi} and ωπ\omega^{\pi} are nuisance functions.

3 Projected State-action Balancing Estimator

In this section, we introduce the proposed weighted estimator for 𝒱(π){\cal V}(\pi). Since our estimator is motivated by covariate balancing weights in the literature of ATE estimation, we discuss their connection in Section 3.1. In Section 3.2, we show the difficulty of directly applying the covariate balancing idea in the aforementioned policy evaluation problem due to an “expanded-dimension” issue. We address this difficulty and propose a projected state-action balancing estimator for 𝒱(π){\cal V}(\pi) in Section 3.3.

3.1 State-action Balancing Weights

Consider a general form of weighted estimators:

1nTi=1nt=0T1ωi,tRi,t,\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega_{i,t}R_{i,t},

where ωi,t\omega_{i,t}’s are some weights constructed from the training data 𝒟n\mathcal{D}_{n}. Due to (7), a reasonable strategy to derive such weights is to first estimate the probability ratio function ωπ\omega^{\pi}, and then evaluate it at the observed state-action pairs {(Si,t,Ai,t)}\{(S_{i,t},A_{i,t})\} in 𝒟n{\cal D}_{n}. This is analogous to the inverse probability weights commonly adopted in the ATE estimation. However, this strategy often produces an unstable weighted estimator due to small weights [31]. Instead, there is a recent surge of interest to directly obtain weights that achieve empirical covariate balance [e.g., 25, 7, 74, 73]. These weights usually produce more stable weighted estimators with superior finite-sample performances for the ATE estimation.

Inspired by the covariate balancing, a natural idea is to choose the weights {ωi,t}\{\omega_{i,t}\} that ensure the (approximate) validity of the empirical counterpart of (10), i.e.,

1nTi=1nt=0T1ωi,tπ(f(Si,t,Ai,t)γa𝒜π(aSi,t+1)f(Si,t+1,a))=(1γ)𝔼S0𝔾[a𝒜π(aS0)f(a,S0)],\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}_{i,t}\left(f(S_{i,t},A_{i,t})-\gamma\sum_{a^{\prime}\in{\cal A}}\pi(a^{\prime}\mid S_{i,t+1})f(S_{i,t+1},a^{\prime})\right)=\\ (1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left[\sum_{a\in{\cal A}}\pi(a\mid S_{0})f(a,S_{0})\right], (12)

over fspan{B1,,BK}f\in\mathrm{span}\{B_{1},\dots,B_{K}\} where B1,,BKB_{1},\dots,B_{K} are KK pre-specified functions defined on 𝒮×𝒜\mathcal{S}\times\mathcal{A}. The equality (12) can be viewed as a form of the state-action balance, in contrast to the covariate balance in the ATE estimation. The space span{B1,,BK}\mbox{span}\{B_{1},\dots,B_{K}\} can be viewed as a finite-dimensional approximation of the function space in which the balance should be enforced. In theory, KK is expected to increase with nn and TT. Changing the balancing criterion in [73], one can obtain a form of state-action balancing weights via the following mathematical programming:

minimize{ωi,t}1in,0tT11nTi=1nt=0T1h(ωi,t)\displaystyle\underset{\{{\omega}_{i,t}\}_{1\leq i\leq n,0\leq t\leq T-1}}{\mbox{minimize}}\quad\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}h({\omega}_{i,t}) (13a)
subject to|1nTi=1nt=0T1ωi,t{Bk(Si,t,Ai,t)γa𝒜π(aSi,t+1)Bk(Si,t+1,a)}\displaystyle\quad\mbox{subject to}\left|\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}{\omega}_{i,t}\left\{B_{k}(S_{i,t},A_{i,t})-\gamma\sum_{a^{\prime}\in{\cal A}}\pi(a^{\prime}\mid S_{i,t+1})B_{k}(S_{i,t+1},a^{\prime})\right\}\right.
(1γ)𝔼S0𝔾[a𝒜π(aS0)Bk(S0,a)]|δk,fork=1,2,,K,\displaystyle\qquad\quad\left.-(1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left[\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S_{0})B_{k}(S_{0},a^{\prime})\right]\right|\leq\delta_{k},\quad\mbox{for}\ k=1,2,\dots,K, (13b)

where the tuning parameters δk0\delta_{k}\geq 0 controls the imbalance with respect to BkB_{k}. When δk=0\delta_{k}=0, the weights achieve the exact balance (12) over BkB_{k}. In practice, the exact balance can be hard to achieve especially when KK is large. Allowing δk>0\delta_{k}>0 leads to approximate balance [1, 74], which introduces flexibility. In addition, one can also constrain the weights to be non-negative. Since non-negativity constraints are not necessary for consistent estimation (as shown in our theoretical analysis), we will not enforce them throughout this paper. Common choices of {Bk}\{B_{k}\} are constructed based on tensor products of one-dimensional basis functions. Examples of one-dimensional basis functions include spline basis [12] (for a continuous dimension) and indicator functions of levels (for a categorical dimension). The objective function (13a) is introduced to control the magnitude of the weights. Here hh is chosen as a non-negative, strictly convex and continuously differentiable function. Examples include h(x)=(x1)2h(x)=(x-1)^{2} and h(x)=(x1)log(x)h(x)=(x-1)\log(x). In the following, we discuss the issue with the weights defined by (13), and explain the challenge of directly applying this covariate balancing idea in the OPE problem.

Define ϵi,t=Ri,tr(Si,t,Ai,t)\epsilon_{i,t}=R_{i,t}-r(S_{i,t},A_{i,t}) and write the solution of (13) as ω~i,tπ\tilde{\omega}^{\pi}_{i,t}. Naturally, we can obtain a weighted estimator as 𝒱~(π)=(nT)1i=1nt=0T1ω~i,tπRi,t\tilde{\cal V}(\pi)=(nT)^{-1}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\tilde{\omega}_{i,t}^{\pi}R_{i,t}. For any function BB defined on 𝒮×𝒜\mathcal{S}\times\mathcal{A}, let

gπ(St,At;B):=𝔼{a𝒜π(aSt+1)B(St+1,a)St,At}.\displaystyle g_{*}^{\pi}(S_{t},A_{t};B):=\mathbb{E}\left\{\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S_{t+1})B(S_{t+1},a^{\prime})\mid S_{t},A_{t}\right\}. (14)

The difference between 𝒱~(π)\tilde{{\cal V}}(\pi) and 𝒱(π){\cal V}(\pi) yields the following decomposition:

𝒱~(π)𝒱(π)\displaystyle\tilde{{\cal V}}(\pi)-{\cal V}(\pi)
=\displaystyle= 1nTi=1nt=0T1ω~i,tπ[Qπ(Si,t,Ai,t)γgπ(Si,t,Ai,t;Qπ)]\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\tilde{\omega}^{\pi}_{i,t}\left[Q^{\pi}(S_{i,t},A_{i,t})-\gamma g_{*}^{\pi}(S_{i,t},A_{i,t};Q^{\pi})\right] (15)
(1γ)𝔼S0𝔾{a𝒜π(aS0)Qπ(S0,a)}+1nTi=1nt=0T1ω~i,tπϵi,t\displaystyle\qquad-(1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left\{\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S_{0})Q^{\pi}(S_{0},a^{\prime})\right\}+\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\tilde{\omega}^{\pi}_{i,t}\epsilon_{i,t} (16)
=\displaystyle= 1nTi=1nt=0T1ω~i,tπ[Qπ(Si,t,Ai,t)γa𝒜π(aSi,t+1)Qπ(Si,t+1,a)]\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\tilde{\omega}^{\pi}_{i,t}\left[Q^{\pi}(S_{i,t},A_{i,t})-\gamma\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S_{i,t+1})Q^{\pi}(S_{i,t+1},a^{\prime})\right] (17)
(1γ)𝔼S0𝔾{a𝒜π(aS0)Qπ(S0,a)}\displaystyle\qquad-(1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left\{\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S_{0})Q^{\pi}(S_{0},a^{\prime})\right\} (18)
+1nTi=1nt=0T1ω~i,tπϵi,t\displaystyle+\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\tilde{\omega}^{\pi}_{i,t}\epsilon_{i,t} (19)
+γnTi=1nt=0T1ω~i,tπ[a𝒜π(aSi,t+1)Qπ(Si,t+1,a)gπ(Si,t,Ai,t;Qπ)],\displaystyle+\frac{\gamma}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\tilde{\omega}_{i,t}^{\pi}\left[\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S_{i,t+1})Q^{\pi}(S_{i,t+1},a^{\prime})-g_{*}^{\pi}(S_{i,t},A_{i,t};Q^{\pi})\right], (20)

where the first equality is given by (6) and the representation of 𝒱(π){\cal V}(\pi) in (5). Clearly, (17)-(18) can be controlled via the balancing constraint (13b) by carefully controlling {δk}\{\delta_{k}\} and selecting {Bk}\{B_{k}\} so that QπQ^{\pi} can be well approximated. For (19), by assuming that {ϵi,t}\{\epsilon_{i,t}\} are independent noises of the trajectories {Si,t,Ai,t}i=1,n;t=0,,T1\{S_{i,t},A_{i,t}\}_{i=1,\dots n;t=0,\dots,T-1}, it may not be hard to obtain an upper bound of order nT\sqrt{nT} as long as the magnitude of ω~i,tπ\tilde{\omega}_{i,t}^{\pi} is properly controlled. However, it remains unclear how to control (20) due to the complex dependence between ω~i,tπ\tilde{\omega}_{i,t}^{\pi} and 𝒟n{\cal D}_{n}. Indeed, we observe an “expanded-dimension” issue due to the balancing constraints (13b), which will be explained in details in Section 3.2. This also motivates the development of the novel projected balancing constraints in Section 3.3.

3.2 Expanded Dimension

First, we obtain the dual form of (13), which provides an important characterization for the solution of (13). Define 𝚿K(s,a,s)=[Bk(s,a)γa𝒜π(as)Bk(s,a)]k=1KK\bm{\Psi}_{K}(s,a,s^{\prime})=[B_{k}(s,a)-\gamma\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})B_{k}(s^{\prime},a^{\prime})]_{k=1}^{K}\in\mathbb{R}^{K}, 𝒍K=[𝔼s𝔾{(1γ)a𝒜π(as)Bk(s,a)}]k=1,,KK\bm{l}_{K}=[\mathbb{E}_{s\sim\mathbb{G}}\{(1-\gamma)\sum_{a\in\mathcal{A}}\pi(a\mid s)B_{k}(s,a)\}]_{k=1,\dots,K}\in\mathbb{R}^{K}, and 𝜹K=[δ1,,δK]K.\bm{\delta}_{K}=[\delta_{1},\dots,\delta_{K}]^{\intercal}\in\mathbb{R}^{K}.

Theorem 1.

The dual of (13) is equivalent to the following unconstrained optimization problem:

min𝝀K1nTj=1nt=0T1ρ(𝚿K(Si,t,Ai,t,Si,t+1)𝝀)𝝀𝒍K+|𝝀|𝜹K,\displaystyle\min_{\bm{\lambda}\in\mathbb{R}^{K}}\qquad\frac{1}{nT}\sum_{j=1}^{n}\sum_{t=0}^{T-1}\rho({\bm{\Psi}}_{K}(S_{i,t},A_{i,t},S_{i,t+1})^{\intercal}\bm{\lambda})-\bm{\lambda}^{\intercal}\bm{l}_{K}+|\bm{\lambda}|^{\intercal}\bm{\delta}_{K}, (21)

where ρ(t)=t(h)1(t)h(h)1(t)\rho(t)=t(h^{\prime})^{-1}(t)-h{(h^{\prime})^{-1}(t)}. The primal solution to (13) is given as

w~i,tπ=ρ(𝚿K(Si,t,Ai,t,Si,t+1)𝝀),,\displaystyle\tilde{w}^{\pi}_{i,t}=\rho^{\prime}({\bm{\Psi}}_{K}(S_{i,t},A_{i,t},S_{i,t+1})^{\intercal}\bm{\lambda}^{\star}),, (22)

for every 1in1\leq i\leq n and 0tT10\leq t\leq T-1, where 𝛌\bm{\lambda}^{\star} is the solution to (21).

The proof of Theorem 1 is similar to that of Theorem 2 below, and can be found in Section S2.2 of the Supplementary Material. Now the expanded-dimension issue can be easily seen via the following example. Suppose that there are two triplets (Si1,t1,Ai1,t1,Si1,t1+1)(S_{i_{1},t_{1}},A_{i_{1},t_{1}},S_{i_{1},t_{1}+1}) and (Si2,t2,Ai2,t2,Si2,t2+1)(S_{i_{2},t_{2}},A_{i_{2},t_{2}},S_{i_{2},t_{2}+1}) such that Si1,t1=Si2,t2S_{i_{1},t_{1}}=S_{i_{2},t_{2}}, Ai1,t1=Ai2,t2A_{i_{1},t_{1}}=A_{i_{2},t_{2}} and Si1,t1+1Si2,t2+1S_{i_{1},t_{1}+1}\neq S_{i_{2},t_{2}+1}. As the true probability ratio function ωπ\omega^{\pi} is a function of the current state and action variables, we must have ωi1,t1π=ωi2,t2π\omega^{\pi}_{i_{1},t_{1}}=\omega^{\pi}_{i_{2},t_{2}}. However, the solution form (22) in Theorem 1 does not lead to ω~i1,t1π=ω~i2,t2π\tilde{\omega}^{\pi}_{i_{1},t_{1}}=\tilde{\omega}^{\pi}_{i_{2},t_{2}} in general, which violates our knowledge of ωπ\omega^{\pi}.

One may hypothesize that the expanded-dimension issue is a finite-sample property, and the variation of the estimated weights due to the next state may diminish asymptotically under some reasonable conditions. To gain more insight, we show that the solution form (22) indeed induces an implicit restriction on the modeling of the true weight function under finite-state and finite-action settings. Therefore, unless one is willing to make further non-trivial assumptions on the weight function, the hypothesis cannot be true in general.

Notice that

𝚿K(s,a,s)𝝀=f(s,a)γa𝒜π(as)f(s,a),\bm{\Psi}_{K}(s,a,s^{\prime})^{\intercal}\bm{\lambda}=f(s,a)-\gamma\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})f(s^{\prime},a^{\prime}), (23)

where f(s,a)=k=1KλkBk(s,a)f(s,a)=\sum^{K}_{k=1}\lambda_{k}B_{k}(s,a). To avoid dealing with the approximation error, we focus on an even more general class ρ(𝒢)\rho^{\prime}(\mathcal{G}) of functions on 𝒮×𝒜×𝒮\mathcal{S}\times\mathcal{A}\times\mathcal{S}, where

𝒢:={f(s,a)γa𝒜π(as)f(s,a):f is any real-valued function defined on 𝒮×𝒜}.\mathcal{G}:=\left\{f(s,a)-\gamma\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})f(s^{\prime},a^{\prime}):\mbox{$f$ is any real-valued function defined on $\mathcal{S}\times\mathcal{A}$}\right\}.

Recall that ρ\rho^{\prime} is the first derivative of ρ\rho defined in Theorem 1. Assume that ω~π(s,a,s)ωπ(s,a)\tilde{\omega}^{\pi}(s,a,s^{\prime})\equiv\omega^{\pi}(s,a). We would like to know if ρ(𝒢)\rho^{\prime}(\mathcal{G}) can model ω~π\tilde{\omega}^{\pi} well. Suppose ω~πρ(𝒢):={ρ(g()):g𝒢}\tilde{\omega}^{\pi}\in\rho^{\prime}(\mathcal{G}):=\{\rho^{\prime}(g(\cdot)):g\in\mathcal{G}\}. As ω~π(s,a,s)\tilde{\omega}^{\pi}(s,a,s^{\prime}) is constant with respect to ss^{\prime}, we have ω~πρ(𝒢)\tilde{\omega}^{\pi}\in\rho^{\prime}(\mathcal{G}^{\prime}) where

𝒢\displaystyle\mathcal{G}^{\prime} :={g𝒢:g(s,a,s)=g(s,a,s′′)s,s,s′′𝒮,a𝒜}\displaystyle:=\left\{g\in\mathcal{G}:g(s,a,s^{\prime})=g(s,a,s^{\prime\prime})\ \forall s,s^{\prime},s^{\prime\prime}\in\mathcal{S},a\in\mathcal{A}\right\}

characterizes the subclass with reduced input dimensions. A key question is whether 𝒢\mathcal{G}^{\prime} restricts the class of possible weight functions modeled by ρ(𝒢)\rho^{\prime}(\mathcal{G}) and, as a result, induces some implicit form of restriction. To see this, we focus on the settings where |𝒮|=pS|\mathcal{S}|=p_{S} and |𝒜|=pA|\mathcal{A}|=p_{A} are finite. In Lemma 1 below, we show that the dimension of 𝒢\mathcal{G}^{\prime} is pApSpS+1p_{A}p_{S}-p_{S}+1, which is strictly less than pSpA1p_{S}p_{A}-1 as long as pS>2p_{S}>2. Note that, due to the natural constraint that 𝔼[T1t=0T1ωπ(St,At)]=1\mathbb{E}[T^{-1}\sum_{t=0}^{T-1}\omega^{\pi}(S_{t},A_{t})]=1, a general weight function should have pApS1p_{A}p_{S}-1 free parameters. As ρ\rho^{\prime} is invertible, Lemma 1 suggests a possible implicit restriction and that the solution obtained from (13) may not be a consistent estimator for ωπ\omega^{\pi}.

Lemma 1.

Suppose |𝒮|=pS|\mathcal{S}|=p_{S} and |𝒜|=pA|\mathcal{A}|=p_{A} are both finite. Then dim(𝒢)=pApSpS+1\dim(\mathcal{G}^{\prime})=p_{A}p_{S}-p_{S}+1.

Proof sketch.

For any function g=f(s,a)γa𝒜π(a|s)f(s,a)𝒢g=f(s,a)-\gamma\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}|s^{\prime})f(s^{\prime},a^{\prime})\in\mathcal{G}^{\prime}, there exists a constant cgc_{g}\in\mathbb{R} such that a𝒜π(a|s)f(s,a)cg\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}|s^{\prime})f(s^{\prime},a^{\prime})\equiv c_{g} for any s𝒮s^{\prime}\in\mathcal{S}. Since |𝒮|=pS|\mathcal{S}|=p_{S}, a𝒜π(a|s)=1\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}|s^{\prime})=1 and π(as)0\pi(a^{\prime}\mid s^{\prime})\geq 0, this yields pSp_{S} linearly independent constraints on ff. Together with the parameter cgc_{g}, we can show that the dimension of 𝒢\mathcal{G}^{\prime} is pSpApS+1p_{S}p_{A}-p_{S}+1. The detailed proof can be found in Section S2.2 of the Supplementary Material. ∎

3.3 Projected State-action Balancing Weights

To overcome the expanded-dimension issue, we propose an approximate projection step, which is applied to a𝒜π(aSi,t+1)Bk(Si,t+1,a)\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S_{i,t+1})B_{k}(S_{i,t+1},a^{\prime}), k=1,,Kk=1,\dots,K, to rule out the involvement of Si,t+1S_{i,t+1}. To explain the idea, we again focus on the decomposition of 𝒱~(π)𝒱(π)\tilde{\cal V}(\pi)-{\cal V}(\pi). From (15) and (16), we would like to choose weights that ideally control

|1nTi=1nt=0T1ωi,t[Bk(Si,t,Ai,t)γgπ(Si,t,Ai,t;Bk)](1γ)𝔼S0𝔾[a𝒜π(aS0)Bk(S0,a)]|,\displaystyle\left|\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}{\omega}_{i,t}\left[B_{k}(S_{i,t},A_{i,t})-\gamma g_{*}^{\pi}(S_{i,t},A_{i,t};B_{k})\right]-(1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left[\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S_{0})B_{k}(S_{0},a^{\prime})\right]\right|,

for every k=1,,Kk=1,\dots,K. However, in practice, gπ(Si,t,Ai,t;Bk)g_{*}^{\pi}(S_{i,t},A_{i,t};B_{k}) (i.e., 𝔼{a𝒜π(aSi,t+1)Bk(Si,t+1,a)Si,t,Ai,t}\mathbb{E}\{\sum_{a^{\prime}\in{\cal A}}\pi(a^{\prime}\mid S_{i,t+1})B_{k}(S_{i,t+1},a^{\prime})\mid S_{i,t},A_{i,t}\}) is unknown to us. As explained in Section 3.2, the idea of replacing it with the empirical counterpart a𝒜π(aSi,t+1)Bk(Si,t+1,a)\sum_{a^{\prime}\in{\cal A}}\pi(a^{\prime}\mid S_{i,t+1})B_{k}(S_{i,t+1},a^{\prime}) results in a non-trivial expanded-dimension issue. Instead, we propose to estimate the projection term gπ(St,At;Bk)g_{*}^{\pi}(S_{t},A_{t};B_{k}) via a more involved optimization problem:

g^π(,;Bk)=argming𝒢1nTi=1nt=0T1{aπ(aSi,t+1)Bk(Si,t+1,a)g(Si,t,Ai,t)}2+μJ𝒢2(g),\displaystyle\hat{g}^{\pi}(\cdot,\cdot;B_{k})=\text{argmin}_{g\in{\mathcal{G}}}\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left\{\sum_{a^{\prime}}\pi(a^{\prime}\mid S_{i,t+1})B_{k}(S_{i,t+1},a^{\prime})-g(S_{i,t},A_{i,t})\right\}^{2}+\mu J^{2}_{\mathcal{G}}(g), (24)

where 𝒢{\mathcal{G}} is a pre-specified function space that contains gπg_{*}^{\pi}, μ0\mu\geq 0 is a tuning parameter and J𝒢()J_{\mathcal{G}}(\cdot) is a regularization functional. In this work we focus on the kernel ridge regression, where 𝒢{\mathcal{G}} is a subset of a reproducing kernel Hilbert space (RKHS) and J𝒢J_{\mathcal{G}} is taken as the squared RKHS norm; see Assumption 4’(c) in Section 4.1 of the Supplementary Material In Theorem 3 of Section 4, we establish the finite-sample error bound of g^π(,;B)\hat{g}^{\pi}(\cdot,\cdot;B) (in scaling of both nn and TT) that holds uniformly for different BB (in a reasonably large class 𝒬\mathcal{Q} specified later). This provides a solid theoretical guarantee for replacing gπg^{\pi}_{*} by g^π\hat{g}^{\pi} in the construction of the weights.

With the approximate projection (24), we propose to estimate the weights by solving the following optimization problem:

minimize{ωi,t}1in,0tT11nTi=1nt=0T1h(ωi,t)\displaystyle\underset{\{{\omega}_{i,t}\}_{1\leq i\leq n,0\leq t\leq T-1}}{\mbox{minimize}}\quad\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}h({\omega}_{i,t}) (25a)
subject to|1nTi=1nt=0T1ωi,t{Bk(Si,t,Ai,t)γg^π(Si,t,Ai,t;Bk)}\displaystyle\quad\mbox{subject to}\quad\left|\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}{\omega}_{i,t}\left\{B_{k}(S_{i,t},A_{i,t})-\gamma\hat{g}^{\pi}(S_{i,t},A_{i,t};B_{k})\right\}\right.
(1γ)𝔼S0𝔾[a𝒜π(aS0)Bk(S0,a)]|δk,fork=1,2,,K.\displaystyle\qquad\quad\left.-(1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left[\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S_{0})B_{k}(S_{0},a^{\prime})\right]\right|\leq\delta_{k},\quad\mbox{for}\quad k=1,2,\dots,K. (25b)

The resulting solution {w^i,tπ}1in,0t<T\{\hat{w}_{i,t}^{\pi}\}_{1\leq i\leq n,0\leq t<T} are the proposed weights. As such, the proposed estimator for 𝒱(π){\cal V}(\pi) is

𝒱^(π)=1nTi=1nt=0T1ω^i,tπRi,t.\hat{\cal V}(\pi)=\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\omega}^{\pi}_{i,t}R_{i,t}. (26)

Similar to Theorem 1, we derive the dual form of (25), which is shown in Theorem 2 below. For the notational simplicity, we introduce the following notations: Lk(s,a)=Bk(s,a)γgπ(s,a;Bk)L_{k}(s,a)=B_{k}(s,a)-\gamma g^{\pi}_{*}(s,a;B_{k}), L^k(s,a)=Bk(s,a)γg^π(s,a;Bk)\hat{L}_{k}(s,a)=B_{k}(s,a)-\gamma\hat{g}^{\pi}(s,a;B_{k}), 𝑩K(s,a)=[Bk(s,a)]k=1KK{\bm{B}}_{K}(s,a)=[{B}_{k}(s,a)]_{k=1}^{K}\in\mathbb{R}^{K}, 𝑳K(s,a)=[Lk(s,a)]k=1KK{\bm{L}}_{K}(s,a)=[{L}_{k}(s,a)]_{k=1}^{K}\in\mathbb{R}^{K} and 𝑳^K(s,a)=[L^k(s,a)]k=1KK\hat{\bm{L}}_{K}(s,a)=[\hat{L}_{k}(s,a)]_{k=1}^{K}\in\mathbb{R}^{K}.

Theorem 2.

The dual of (25) is equivalent to the following unconstrained optimization problem:

min𝝀K1nTi=1nt=0T1ρ(𝑳^K(Si,t,Ai,t)𝝀)𝝀𝒍K+|𝝀|𝜹K.\displaystyle\min_{\bm{\lambda}\in\mathbb{R}^{K}}\qquad\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\rho(\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda})-\bm{\lambda}^{\intercal}\bm{l}_{K}+|\bm{\lambda}|^{\intercal}\bm{\delta}_{K}. (27)

And the primal solution to (25) is given by, for every 1in1\leq i\leq n and 0tT10\leq t\leq T-1,

ω^i,tπ=ρ(𝑳^K(Si,t,Ai,t)𝝀+),\displaystyle\hat{\omega}^{\pi}_{i,t}=\rho^{\prime}(\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}^{+}), (28)

where ρ(t)=t(h)1(t)h(h)1(t)\rho(t)=t(h^{\prime})^{-1}(t)-h{(h^{\prime})^{-1}(t)} and 𝛌+\bm{\lambda}^{+} is the solution to (27).

We sometimes write ω^i,tπ=ω^i,tπ(Si,t,Ai,t)\hat{\omega}^{\pi}_{i,t}=\hat{\omega}^{\pi}_{i,t}(S_{i,t},A_{i,t}). The proof can be found in Section S2.2 of the Supplementary Material. As seen from the representation (28), w^i,tπ\hat{w}^{\pi}_{i,t} do not suffer from the expanded-dimension issue that we see in (21). Besides, (27) can be regarded as an MM-estimation of 𝝀\bm{\lambda} with a weighted 1\ell_{1}-norm regularization. Since the estimated weights are parametrized in 𝝀\bm{\lambda} via (28), Theorem 2 reveals a connection between ω^i,tπ\hat{\omega}^{\pi}_{i,t} and the shrinkage estimation of the probability ratio function ωπ\omega^{\pi}. To see this, we consider the objective function T1t=0T1ρ(𝑳K(St,At)𝝀)𝝀𝒍KT^{-1}\sum_{t=0}^{T-1}\rho({\bm{L}}_{K}(S_{t},A_{t})^{\intercal}\bm{\lambda})-\bm{\lambda}^{\intercal}\bm{l}_{K}. By Lemma 1 in [68], the expectation of this loss function is minimized when 𝝀\bm{\lambda} satisfies ωπ(s,a)=ρ(𝑳K(s,a)𝝀)\omega^{\pi}(s,a)=\rho^{\prime}(\bm{L}_{K}(s,a)^{\intercal}\bm{\lambda}), for every s𝒮s\in\mathcal{S}, a𝒜a\in\mathcal{A}. In Theorem 4 of Section 4, we show the convergence rate of the proposed weights to the true weights in the scaling of both nn and TT.

Next, we discuss the computation of the weights ω^i,t\hat{\omega}_{i,t} in practice. The projection step (24) is a pre-computation step for the optimization (25). In other words, we only need to estimate g^π(Si,t,Ai,t;Bk)\hat{g}^{\pi}(S_{i,t},A_{i,t};B_{k}), k=1,,Kk=1,\dots,K, once and there is no need to recompute them within the weights estimation (25). The optimization (25) is a standard convex optimization problem with linear constraints. We outline the proposed algorithm of the weights in Algorithm 1 of the Supplementary Material.

3.4 Other weighting methods

Apart from the proposed projection method, a sensible alternative to avoid the expanded dimension is to directly specify a class w\mathcal{F}_{w} of functions over 𝒮×𝒜\mathcal{S}\times\mathcal{A} as a model of the weight function (i.e., weights are evaluations of this function), which naturally results in the following form of estimators:

f^wπ\displaystyle\hat{f}^{\pi}_{w} =argminf1wsupf2Q[1nTi=1nt=0T1f1(Si,t,Ai,t){f2(Si,t,Ai,t)\displaystyle=\text{argmin}_{f_{1}\in\mathcal{F}_{w}}\sup_{f_{2}\in\mathcal{F}_{Q}}\left[\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}f_{1}(S_{i,t},A_{i,t})\left\{f_{2}(S_{i,t},A_{i,t})-\right.\right.
γaπ(aSi,t+1)f2(Si,t+1,a)}(1γ)𝔼S0𝔾{a𝒜π(aS0)f2(a,S0)}ν1𝒫1(f2)],\displaystyle\left.\left.\gamma\sum_{a^{\prime}}\pi(a^{\prime}\mid S_{i,t+1})f_{2}(S_{i,t+1},a^{\prime})\right\}-(1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left\{\sum_{a\in{\cal A}}\pi(a\mid S_{0})f_{2}(a,S_{0})\right\}-\nu_{1}\mathcal{P}_{1}(f_{2})\right], (29)

where Q\mathcal{F}_{Q} is a space that models QπQ^{\pi}, 𝒫1\mathcal{P}_{1} is a regularization functional and ν10\nu_{1}\geq 0 is a tuning parameter. Similar approaches have been adopted by a few recent works such as [41, 68, 70], although they are not directly motivated by the above expanded-dimension issue. Since these works assume T=1T=1 (without considering the dependence in the trajectory), we restrict our results to this setting for comparisons, with a remark that our estimator is analyzed in a more general setting of TT (e.g., either bounded, or diverging to infinity). In order to use the above estimator, the choice of the function classes w\mathcal{F}_{w} and Q\mathcal{F}_{Q} seems difficult in terms of computation of the weights. For nonparametric modeling, a natural idea is to take Q\mathcal{F}_{Q} as a RKHS, since this often leads to a finite-dimensional optimization in regression problems via a representer theorem [71]. However, the term 𝔼S0𝔾{a𝒜π(aS0)f(a,S0)}\mathbb{E}_{S_{0}\sim\mathbb{G}}\{\sum_{a\in{\cal A}}\pi(a\mid S_{0})f(a,S_{0})\} could make the representer theorem inapplicable when SS is continuous, and so (3.4) becomes an impractical infinite-dimensional optimization. Another way is to take a finite dimensional space to approximate Q\mathcal{F}_{Q} and w\mathcal{F}_{w}. The dimensions of the approximation spaces would need to increase with nn so as to avoid an asymptotic bias. Moreover, the existing results on the convergence rate of such estimator to the policy value is not optimal (i.e., slower than n\sqrt{n} or nT\sqrt{nT}). See Corollary 11 in [70]. As for our weighted estimator, an optimal convergence rate with statistical efficiency can be achieved without an additional nuisance parameter estimation, even in a more general dependent setting when T>1T>1. See Theorem 6 in Section 4.

4 Theoretical Results

In this section, we study the theoretical properties of the approximate projection g^π\hat{g}^{\pi} in (24), balancing weights ω^π\hat{\omega}^{\pi} in (25) and the final weighted estimator 𝒱^(π)\hat{\cal V}(\pi) in Section 3.3. Specifically, in Section 4.1, we derive the finite-sample error bound for the approximate projection. In Section 4.2, we study the convergence rate of the proposed balancing weights. Finally, we show that the proposed weighted estimator is statistically efficient under additional conditions specified in Section 4.3. In Section 4.4, we study the difficulty of the offline RL in a conservative manner. To start with, we introduce some notations. We define the squared empirical norm as f(,)n,T2=(nT)1i=1nt=0T1f2(Si,t,Ai,t)\|f(\cdot,\cdot)\|^{2}_{{\color[rgb]{0,0,0}n,T}}=(nT)^{-1}\sum_{i=1}^{n}\sum_{t=0}^{T-1}f^{2}(S_{i,t},A_{i,t}). The notation ϖ(n,T)θ(n,T)\varpi({\color[rgb]{0,0,0}n,T})\lesssim\theta({\color[rgb]{0,0,0}n,T}) (resp. ϖ(n,T)θ(n,T)\varpi({\color[rgb]{0,0,0}n,T})\gtrsim\theta({\color[rgb]{0,0,0}n,T})) means that there exists a sufficiently large constant (resp. small) constant c1>0c_{1}>0 (resp. c2>0c_{2}>0) such that ϖ(n,T)c1θ(n,T)\varpi({\color[rgb]{0,0,0}n,T})\geq c_{1}\theta({\color[rgb]{0,0,0}n,T}) (resp. ϖ(n,T)c2θ(n,T)\varpi({\color[rgb]{0,0,0}n,T})\leq c_{2}\theta({\color[rgb]{0,0,0}n,T})) for some sequences θ(n,T)\theta({\color[rgb]{0,0,0}n,T}) and ϖ(n,T)\varpi({\color[rgb]{0,0,0}n,T}). Also, we denote by 𝒩(ϵ,,)\mathcal{N}(\epsilon,{\mathcal{F}},\|\cdot\|) the ϵ\epsilon-covering number of {\mathcal{F}} with respect to some metric \|\cdot\|. We take dd as the dimension of a state vector SS.

4.1 Non-parametric Regressions with Exponentially β\beta-mixing Sequences

Recall that the constraint (25b) is merely a surrogate of the desired state-action balancing condition, due to the use of g^π(,,Bk)\hat{g}^{\pi}(\cdot,\cdot,B_{k}), k=1,,Kk=1,\dots,K. To bound the surrogate error on the estimations of the weights and the policy value, we study the uniform convergence of the approximate projection. More generally, in Theorem 3 below, we obtain the error bound of g^π(,;B)\hat{g}^{\pi}(\cdot,\cdot;B) uniformly over B𝒬B\in{\cal Q}, where 𝒬\mathcal{Q} is a class of functions defined on 𝒮×𝒜{\cal S}\times{\cal A} of interest. The bound scales with both nn and TT. Later when we adopt Theorem 3, 𝒬\mathcal{Q} will be taken as a subset of the linear span of {B1,,BK}\{B_{1},\dots,B_{K}\}. See Corollary 1 for more details. Theorem 3 requires the following assumption. Let TV\|\cdot\|_{\mathrm{TV}} denote the total variation norm,

Assumption 4.

The following conditions hold.

  1. (a)

    The Markov chain {St,At}t0\{S_{t},A_{t}\}_{t\geq 0} has a unique stationary distribution 𝔾\mathbb{G}^{*} with density pp^{*}, and is geometrically ergodic, i.e., there exists a function ϕ(s,a)\phi(s,a) and constant κ(0,1)\kappa\in(0,1) such that, for any s𝒮s\in\mathcal{S} and a𝒜a\in\mathcal{A},

    𝔾tb((s,a))𝔾()TVϕ(s,a)κt,t0,\left\|\mathbb{G}^{b}_{t}(\cdot\mid(s,a))-\mathbb{G}^{*}(\cdot)\right\|_{\mathrm{TV}}\leq\phi(s,a)\kappa^{t},\qquad\forall t\geq 0,

    where 𝔾tb((s,a))\mathbb{G}_{t}^{b}(\cdot\mid(s,a)) is the behavior policy induced conditioinal distribution of (St,At)(S_{t},A_{t}) given S0=sS_{0}=s and A0=aA_{0}=a. Also, there exists a constant C1\ltx@labelCs>0C_{1}\ltx@label{C_{s}}>0 such that ϕ(s,a)𝑑𝔾0(s,a)CLABEL:C_s\int\phi(s,a)d\mathbb{G}_{0}(s,a)\leq C_{\ref*{C_s}}, where we recall that 𝔾0\mathbb{G}_{0} is the initial distribution in the batch data.

  2. (b)

    The function class 𝒬\mathcal{Q} satisfies that BQmax\|B\|_{\infty}\leq Q_{\max} for all B𝒬B\in\mathcal{Q}.

  3. (c)

    The function class 𝒢\mathcal{G} in (24) satisfies that gGmax\|g\|_{\infty}\leq G_{\max} for all g𝒢g\in\mathcal{G}, and gπ(,;B)𝒢g^{\pi}_{*}(\cdot,\cdot;B)\in\mathcal{G} for all B𝒬B\in\mathcal{Q}.

  4. (d)

    The regularization functional J𝒢J_{\mathcal{G}} in (24) is a pseudo norm. Also, J𝒬J_{\mathcal{Q}} (chosen in Theorem 3) is a pseudo norm. Let 𝒬M={B:B𝒬,J𝒬(B)M}\mathcal{Q}_{M}=\{B:B\in\mathcal{Q},J_{\mathcal{Q}}(B)\leq M\} and 𝒢M={g:g𝒢,J𝒢(g)M}\mathcal{G}_{M}=\{g:g\in\mathcal{G},J_{\mathcal{G}}(g)\leq M\}. There exist constants C2\ltx@labelentropy>0C_{2}\ltx@label{entropy}>0 and α(0,1)\alpha\in(0,1), such that for any ϵ,M>0\epsilon,M>0,

    max{log𝒩(ϵ,𝒢M,),log𝒩(ϵ,𝒬M,)}CLABEL:entropy(M/ϵ)2α.\max\left\{\log\mathcal{N}(\epsilon,\mathcal{G}_{M},\|\cdot\|_{\infty}),\log\mathcal{N}(\epsilon,\mathcal{Q}_{M},\|\cdot\|_{\infty})\right\}\leq C_{\ref*{entropy}}(M/\epsilon)^{2\alpha}.

Assumption 4(a) basically assumes the stationary distribution exists and the mixing rate is exponentially fast. This allows the convergence rate to scale well with TT. By a truncation argument (see the proof of Theorem 3), we can show that our estimation is almost equivalent to the nonparametric regression based on stationary and exponentially β\beta-mixing sequences. It is known that if a Markov chain is aperiodic and satisfies some drift condition in terms of a well-behaved non-negative measurable function, then it is geometrically ergodic, see [6] and Chapter 15 of [46] for a detailed characterization. The boundedness assumptions of 𝒬\mathcal{Q} and 𝒢\mathcal{G} in Assumptions 4(b) and 4(c) are used to simplify the proof and can be relaxed by a careful truncation argument. Similar assumptions are also adopted in [39]. Assumption 4(d) specifies the complexity of the spaces. These entropy assumptions are satisfied for common functional classes, such as RKHS and Sobolev spaces [22, 19, 20]. Take the Sobolev spaces as an example, α=d/(2q)\alpha=d/(2q), where qq is the number of continuous derivatives possessed by the functions in the corresponding space. With Assumption 4, we obtain the following uniform finite-sample error bound.

Theorem 3.

Suppose Assumption 4 holds. For any 0<τ1/30<\tau\leq 1/3 and all sufficiently large nT{\color[rgb]{0,0,0}nT} (i.e., either nn or TT is large enough), with probability at least 12δ1/(nT)1-2\delta-1/({\color[rgb]{0,0,0}nT}), the following inequality holds for all B𝒬B\in\mathcal{Q}:

𝔼{1Tt=0T1[g^π(St,At;B)gπ(St,At;B)]2}+g^π(,;B)gπ(,;B)n,T2+μJ𝒢2{g^π(,;B)}μ{1+J𝒢2(gπ(,;B))+J𝒬2(B)}+1nTμα1τ(2+α)+[log(max{1/δ,nT})]1/τnT,{\mathbb{E}\left\{\frac{1}{T}\sum_{t=0}^{T-1}\left[\hat{g}^{\pi}(S_{t},A_{t};B)-g_{*}^{\pi}(S_{t},A_{t};B)\right]^{2}\right\}}+\|\hat{g}^{\pi}(\cdot,\cdot;B)-g_{*}^{\pi}(\cdot,\cdot;B)\|_{{\color[rgb]{0,0,0}n,T}}^{2}+\mu J_{\mathcal{G}}^{2}\{\hat{g}^{\pi}(\cdot,\cdot;B)\}\\ \lesssim\mu\{1+J_{\mathcal{G}}^{2}(g^{\pi}_{*}(\cdot,\cdot;B))+J_{\mathcal{Q}}^{2}(B)\}+\frac{1}{{\color[rgb]{0,0,0}nT}\mu^{\frac{\alpha}{1-\tau(2+\alpha)}}}+\frac{[\log(\max\{1/\delta,{\color[rgb]{0,0,0}nT}\})]^{1/\tau}}{{\color[rgb]{0,0,0}nT}}, (30)

where the leading constant in the above inequality depends on QmaxQ_{\max}, GmaxG_{\max}, κ\kappa, and CLABEL:C_sC_{\ref*{C_s}}.

Further suppose J𝒢2(gπ(,;B))J_{\mathcal{G}}^{2}(g^{\pi}_{*}(\cdot,\cdot;B)) and J𝒬2(B)J_{\mathcal{Q}}^{2}(B) are bounded. If we take δ=(nT)1\delta=({\color[rgb]{0,0,0}nT})^{-1}, τ=(1+α)log(log(nT))/(αlog(nT))\tau=(1+\alpha)\log(\log({\color[rgb]{0,0,0}nT}))/(\alpha\log({\color[rgb]{0,0,0}nT})) and μ(nT)1/(1+α)(log(nT))(2+α)/(1+α)\mu\asymp({\color[rgb]{0,0,0}nT})^{-1/(1+\alpha)}(\log({\color[rgb]{0,0,0}nT}))^{(2+\alpha)/(1+\alpha)} , then for a sufficiently large nT{\color[rgb]{0,0,0}nT}, the following inequality holds for all B𝒬B\in\mathcal{Q} with the probability at least 11/(nT)1-1/({\color[rgb]{0,0,0}nT}):

𝔼{g^π(,;B)gπ(,;B)}2+g^π(,;B)gπ(,;B)n,T2(nT)11+α(log(nT))2+α1+α.\mathbb{E}\{\hat{g}^{\pi}(\cdot,\cdot;B)-g_{*}^{\pi}(\cdot,\cdot;B)\}^{2}+\|\hat{g}^{\pi}(\cdot,\cdot;B)-g_{*}^{\pi}(\cdot,\cdot;B)\|_{{\color[rgb]{0,0,0}n,T}}^{2}\lesssim({\color[rgb]{0,0,0}nT})^{-\frac{1}{1+\alpha}}(\log({\color[rgb]{0,0,0}nT}))^{\frac{2+\alpha}{1+\alpha}}.

Next, we adapt Theorem 3 to our case. More specifically, we consider the setting when 𝒢{\mathcal{G}} in Assumption 4(c) is an RKHS. Due to space limitations, we list the corresponding assumption in the Supplementary Material (See Assumption 4’ in Section 4.1 of the Supplementary MaterialThen we have the following corollary.

Corollary 1.

Suppose Assumption 4’ (in the Supplementary Material) hold, and g^π(,;Bk)\hat{g}^{\pi}(\cdot,\cdot;B_{k}), k=1,,Kk=1,\dots,K, are defined by (24) with the same tuning parameter μ\mu. If μ(nT)1/(1+α)(log(nT))(2+α)/(1+α)\mu\asymp({\color[rgb]{0,0,0}nT})^{-1/(1+\alpha)}(\log({\color[rgb]{0,0,0}nT}))^{(2+\alpha)/(1+\alpha)}, then for any 𝛖=(υk)k=1KK{\bm{\upsilon}=(\upsilon_{k})_{k=1}^{K}}\in\mathbb{R}^{K} such that k=1KυkBk𝒬\sum_{k=1}^{K}\upsilon_{k}B_{k}\in\mathcal{Q}, J𝒬(k=1KυkBk)<J_{\mathcal{Q}}(\sum_{k=1}^{K}\upsilon_{k}B_{k})<\infty and gπ(,;k=1KυkBk)𝒢g^{\pi}(\cdot,\cdot;\sum_{k=1}^{K}\upsilon_{k}B_{k})\in\mathcal{G} (where KK is allowed to grow with nT{\color[rgb]{0,0,0}nT}) and a sufficiently large nT{\color[rgb]{0,0,0}nT}, the following inequality holds with the probability at least 11/(nT)1-1/({\color[rgb]{0,0,0}nT}):

𝔼{k=1Kυkg^π(,;Bk)k=1Kυkgπ(,;Bk)}2+k=1Kυkg^π(,;Bk)k=1Kυkgπ(,;Bk)n,T2(nT)11+α(log(nT))2+α1+α.\mathbb{E}\left\{\sum_{k=1}^{K}{\upsilon_{k}}\hat{g}^{\pi}(\cdot,\cdot;B_{k})-\sum_{k=1}^{K}{\upsilon_{k}}g_{*}^{\pi}(\cdot,\cdot;B_{k})\right\}^{2}+\left\|\sum_{k=1}^{K}{\upsilon_{k}}\hat{g}^{\pi}(\cdot,\cdot;B_{k})-\sum_{k=1}^{K}{\upsilon_{k}}g_{*}^{\pi}(\cdot,\cdot;B_{k})\right\|_{{\color[rgb]{0,0,0}n,T}}^{2}\\ \lesssim({\color[rgb]{0,0,0}nT})^{-\frac{1}{1+\alpha}}(\log({\color[rgb]{0,0,0}nT}))^{\frac{2+\alpha}{1+\alpha}}.

The proofs of Theorem 3 and Corollary 1 can be found in Section S2.3 of the Supplementary Material. From Theorem 3, by carefully choosing μ\mu and τ\tau, we can achieve the optimal nonparametric convergence rate that holds uniformly for all B𝒬B\in\mathcal{Q} up to a logarithmic factor, compared to the i.i.d. setting. More importantly, Theorem 3 does not require the initial distribution to be the stationary distribution, which can be unrealistic in practice but is often required in most existing results such as [16]. Thus our result is broadly applicable. Accordingly, Corollary 1 provides a tight bound (in the scaling of both nn and TT) on the proposed approximate projection step (24), which leads to an accurate estimation of the target state-action balancing condition for the construction of the proposed weights.

4.2 Convergence Rates of Balancing Weights

With Theorem 3 and Corollary 1, we now derive the convergence of the proposed weights ω^π\hat{\omega}^{\pi}. Define

ζn,T=((nT)11+α(log(nT))2+α1+α)1/2.\zeta_{{\color[rgb]{0,0,0}n,T}}=\left(({\color[rgb]{0,0,0}nT})^{-\frac{1}{1+\alpha}}(\log({\color[rgb]{0,0,0}nT}))^{\frac{2+\alpha}{1+\alpha}}\right)^{1/2}.

For any square and symmetric matrix 𝑨\bm{A}, λmax(𝑨)\lambda_{\max}(\bm{A}) and λmin(𝑨)\lambda_{\min}(\bm{A}) represents the maximum and minimum eigenvalues respectively. And we use 2\|\cdot\|_{2} to denote the Euclidean norm of a vector. We will need the following assumption.

Assumption 5.

The following conditions hold.

  1. (a)

    There exist r1>1/2r_{1}>1/2 and 𝝀K\bm{\lambda}^{*}\in\mathbb{R}^{K} such that the true weight function ωπ(s,a)\omega^{\pi}(s,a) satisfies sups,a|ωπ(s,a)ρ{𝑳K(s,a)𝝀}|C3\ltx@labelweightsboundKr1\sup_{s,a}|\omega^{\pi}(s,a)-\rho^{\prime}\{\bm{L}_{K}(s,a)^{\intercal}\bm{\lambda}^{*}\}|\leq C_{3}\ltx@label{weightsbound}K^{-r_{1}}, where CLABEL:weightsbound>0C_{\ref*{weightsbound}}>0 is a constant. Also, there exists a constant C4\ltx@labelweightsC1>0C_{4}\ltx@label{weightsC1}>0 such that sups,a|ωπ(s,a)|CLABEL:weightsC1\sup_{s,a}|\omega^{\pi}(s,a)|\leq C_{\ref*{weightsC1}}.

  2. (b)

    The second derivative of ρ\rho defined in Theorem 2, i.e., ρ′′\rho^{\prime\prime}, is a positive and continuous function and ρ′′C5\ltx@labelweightssecond\rho^{\prime\prime}\geq C_{5}\ltx@label{weightssecond} for some constant CLABEL:weightssecond>0C_{\ref*{weightssecond}}>0.

  3. (c)

    There exist constants C6\ltx@labelweightsC2>0C_{6}\ltx@label{weightsC2}>0 and C7\ltx@labelweightsC4>0C_{7}\ltx@label{weightsC4}>0 such that sups,a𝑩K(s,a)2CLABEL:weightsC2K1/2,\sup_{s,a}\|\bm{B}_{K}(s,a)\|_{2}\leq C_{\ref*{weightsC2}}K^{1/2}, and for every T0T\geq 0,

    {max{λmax{𝔼[1Tt=0T1𝑳K(St,At)𝑳K(St,At)]},𝔼(S,A)𝔾[p¯Tb(S,A)p(S,A)]2}CLABEL:weightsC4λmax{𝔼(S,A)𝔾[𝑳K(S,A)𝑳K(S,A)]}CLABEL:weightsC4.\left\{\begin{array}[]{l}\max\left\{\lambda_{\max}\left\{\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{t},A_{t})\bm{L}_{K}(S_{t},A_{t})^{\intercal}\right]\right\},\mathbb{E}_{(S,A)\sim\mathbb{G}^{*}}\left[\frac{\bar{p}^{b}_{T}(S,A)}{p^{*}(S,A)}\right]^{2}\right\}\leq C_{\ref*{weightsC4}}\\ \lambda_{\max}\left\{\mathbb{E}_{(S,A)\sim\mathbb{G^{*}}}\left[\bm{L}_{K}(S,A)\bm{L}_{K}(S,A)^{\intercal}\right]\right\}\leq C_{\ref*{weightsC4}}.\end{array}\right.
  4. (d)

    There exists a quantity ψ(K)>0\psi(K)>0 depending on KK such that

    λmin{𝔼[1Tt=0T1𝑳K(St,At)𝑳K(St,At)]}ψ(K).\lambda_{\min}\left\{\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{t},A_{t})\bm{L}_{K}(S_{t},A_{t})^{\intercal}\right]\right\}\geq\psi(K).
  5. (e)

    k=1K𝝀kBk𝒬\sum_{k=1}^{K}\bm{\lambda}^{\ast}_{k}B_{k}\in\mathcal{Q} , gπ(,;k=1K𝝀kBk)𝒢g^{\pi}_{*}(\cdot,\cdot;\sum_{k=1}^{K}\bm{\lambda}^{\ast}_{k}B_{k})\in\mathcal{G}, where 𝒬\mathcal{Q} and 𝒢\mathcal{G} are function classes considered in Corollary 1 (also see Section S2.3 of the Supplementary Material). In addition, the same tuning parameter μ(nT)1/(1+α)(log(nT))(2+α)/(1+α)\mu\asymp({\color[rgb]{0,0,0}nT})^{-1/(1+\alpha)}(\log({\color[rgb]{0,0,0}nT}))^{(2+\alpha)/(1+\alpha)} is adopted when estimating gπ(,;Bk)g^{\pi}_{*}(\cdot,\cdot;B_{k}), k=1,,Kk=1,\dots,K. For any vector 𝒂\bm{a} that satisfies k=1KαkBkQmax\|\sum_{k=1}^{K}\alpha_{k}B_{k}\|_{\infty}\leq Q_{\max}, we have k=1KαkBk𝒬\sum_{k=1}^{K}\alpha_{k}B_{k}\in\mathcal{Q} and g(,;k=1KαkBk)𝒢g(\cdot,\cdot;\sum_{k=1}^{K}\alpha_{k}B_{k})\in\mathcal{G}.
    K=𝒪(ζn,T2)K=\scalebox{0.7}{$\mathcal{O}$}(\zeta_{{\color[rgb]{0,0,0}n,T}}^{-2}).

  6. (f)

    𝜹K2=𝒪[{ψ(K)}1(Klog(nT)/nT+Kr1+Kζn,T)]\|\bm{\delta}_{K}\|_{2}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}\left[\{\psi(K)\}^{-1}\left({\sqrt{K}\log({\color[rgb]{0,0,0}nT})}/{\sqrt{{\color[rgb]{0,0,0}nT}}}+K^{-r_{1}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}\right)\right].

Assumption 5(a) specifies the requirement on the uniform approximation error of the true weight function ωπ\omega^{\pi}. The boundedness of ωπ\omega^{\pi} can be guaranteed if the average visitation density p¯Tb\bar{p}_{T}^{b} is bounded away from 0 and the policy-induced discounted visitation probability dπd^{\pi} is bounded above. This overlapping condition is commonly assumed in the literature of the ATE and RL [e.g. 74, 68, 70]. For Assumption 5(b) on ρ′′\rho^{\prime\prime}, by the relationship between ρ\rho and hh detailed in Theorem 2, many convex choices of hh adopted in (25) will result in a ρ\rho that satisfies this condition, e.g., h(x)=(x1)2h(x)=(x-1)^{2}. For Assumption 5(c), the uniform bound for 𝑩K2\|\bm{B}_{K}\|_{2} is a mild technical condition on the basis functions BkB_{k}, k=1,,Kk=1,\dots,K. It is satisfied by many classes of basis functions, including the regression spline, the trigonometric polynomial, and the wavelet basis [51, 23, 8, 2]. See the same assumption in [13] (Condition 6 of their Assumption 4.1). As for the largest eigenvalue condition, we verify that such condition holds for common bases such as tensor-product B-spline basis and tensor-product wavelet basis in Lemma S5 of the Supplementary Material. As for Assumption 5(d), ψ(K)\psi(K) is allowed to depend on KK in our analysis as opposed to the existing study such as [62]. While we allow ψ(K)\psi(K) to diminish as KK grows, we later show that ψ(K)\psi(K) can be strictly bounded below if we are willing to make further relatively minor assumptions. See detailed discussion in Section 4.4. Assumption 5(e) is a mild condition requiring the best approximation k=1K𝝀kBk\sum_{k=1}^{K}\bm{\lambda}^{\ast}_{k}B_{k} in 𝒬\mathcal{Q}, and its projection in 𝒢\mathcal{G}. Now we can show the convergence of the proposed balancing weights.

Theorem 4.

Suppose Assumptions 4’ (in the Supplementary Material) and 5 hold. If we further assume that {ψ(K)}1Kζn,T=𝒪(1)\{\psi(K)\}^{-1}\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}=\scalebox{0.7}{$\mathcal{O}$}(1) as nT+{\color[rgb]{0,0,0}nT}\rightarrow+\infty, then

max{ω^πωπn,T,𝔼{1Tt=0T1[ω^π(St,At)ωπ(St,At)]2}1/2}=𝒪p[{ψ(K)}1(Klog(nT)nT+Kr1+Kζn,T)]=𝒪p(1).\max\left\{\left\|\hat{\omega}^{\pi}-\omega^{\pi}\right\|_{{\color[rgb]{0,0,0}n,T}},{\mathbb{E}\left\{\frac{1}{T}\sum_{t=0}^{T-1}\left[\hat{\omega}^{\pi}(S_{t},A_{t})-\omega^{\pi}(S_{t},A_{t})\right]^{2}\right\}^{1/2}}\right\}=\\ \mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left[\{\psi(K)\}^{-1}\left(\frac{\sqrt{K}\log({\color[rgb]{0,0,0}nT})}{\sqrt{{\color[rgb]{0,0,0}nT}}}+K^{-r_{1}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}\right)\right]=\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(1).

The proof of Theorem 4 can be found in Section S2.4 of the Supplementary Material. Theorem 4 gives the convergence rate of the proposed balancing weights in terms of both the empirical L2L_{2} and the L2L_{2} metrics. Note that the rate implies that as long as either nn or TT goes to infinity, the proposed weight estimates converge to the true ratio functions. To the best of our knowledge, this is the first consistent result of the ratio function in the scaling of both nn and TT under the discounted infinite-horizon setting.

4.3 Estimation Error and Statistical Efficiency

Based on Theorems 3 and 4, we can derive an error bound for 𝒱^(π)\hat{\cal V}(\pi).

Assumption 6.

Suppose the following conditions hold.

  1. (a)

    There exists a 𝜷K\bm{\beta}\in\mathbb{R}^{K} such that sups,a|ΔQ(s,a)|=sups,a|Qπ(s,a)𝑩K(s,a)𝜷|C8\ltx@labelQconst1Kr2\sup_{s,a}|\Delta_{Q}(s,a)|=\sup_{s,a}|Q^{\pi}(s,a)-\bm{B}_{K}(s,a)^{\intercal}\bm{\beta}|\leq C_{8}\ltx@label{Qconst1}K^{-r_{2}} for some constants r2>1/2r_{2}>1/2 and CLABEL:Qconst1>0C_{\ref*{Qconst1}}>0. Also, k=1K𝜷kBk𝒬\sum_{k=1}^{K}\bm{\beta}_{k}B_{k}\in\mathcal{Q}, g(,;k=1K𝜷kBk)𝒢g^{*}(\cdot,\cdot;\sum_{k=1}^{K}\bm{\beta}_{k}B_{k})\in\mathcal{G}, where 𝒬\mathcal{Q} and 𝒢\mathcal{G} are function classes considered in Corollary 1. (Also see Section S2.3 of the Supplementary Material).

  2. (b)

    The errors ϵi,t:=Ri,t𝔼(Ri,tSi,t,Ai,t)\epsilon_{i,t}:=R_{i,t}-\mathbb{E}(R_{i,t}\mid S_{i,t},A_{i,t}), i=1,,n,t=0,,T1i=1,\dots,n,t=0,\dots,T-1, are independent mean-zero subgaussian random variables.

Assumption 6(a) is a regularity condition for QπQ^{\pi}. It is satisfied by letting r2=qQ/dr_{2}=q_{Q}/d, where qQq_{Q} is the maximum number of continuous derivatives for QπQ^{\pi} with respect to SS among all action levels, and choosing BKB_{K} as basis functions such as splines and power series if QπQ^{\pi} is defined over a compact domain. Assumption 6(b) is a mild condition for the error of the reward. In fact, this assumption can be relaxed to allow errors themselves to be dependent. Note that the estimated weight function ω^π\hat{\omega}^{\pi} only depends on {Si,t,Ai,t:i=1,,n,t=0,,T1}\{S_{i,t},A_{i,t}:i=1,\dots,n,t=0,\dots,T-1\}, which is independent of {ϵi,t:i=1,,n,t=0,,T1}\{\epsilon_{i,t}:i=1,\dots,n,t=0,\dots,T-1\}. The proof of the convergence of 𝒱^(π)\hat{\mathcal{V}}(\pi) stated in the following theorems is based on a conditioning argument that separates the effects of the weights and the errors. Standard techniques that deal with weighted averages of dependent random variables can be adopted to extend the current results (e.g., Theorem 5) to those dependent settings. For instance, If there are some weak autocorrelations among ϵi,t\epsilon_{i,t}, we are still able to obtain results in Theorems 5 and 6.

Theorem 5.

Suppose Assumptions 4’ (in the Supplementary Material), 5 and 6 and hold. Furthermore, assume {ψ(K)}1Kζn,T=𝒪(1)\{\psi(K)\}^{-1}\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}=\scalebox{0.7}{$\mathcal{O}$}(1). Then we have

|𝒱^(π)𝒱(π)|=𝒪p{𝜷𝜹K1+{ψ(K)}1(Kr1ζn,T+K1/2r2ζn,T+Kr1r2+Kζn,T2)+ζn,T},\left|\hat{\cal V}(\pi)-{\cal V}(\pi)\right|=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left\{{\|\bm{\beta}\circ\bm{\delta}_{K}\|_{1}}\right.\\ \left.+\{\psi(K)\}^{-1}\left(K^{-r_{1}}\zeta_{{\color[rgb]{0,0,0}n,T}}+K^{1/2-r_{2}}\zeta_{{\color[rgb]{0,0,0}n,T}}+K^{-r_{1}-r_{2}}+\sqrt{K}\zeta^{2}_{{\color[rgb]{0,0,0}n,T}}\right)+\zeta_{{\color[rgb]{0,0,0}n,T}}\right\},

where \circ refers to the element-wise product between two vectors.

Theorem 5 can be proved by the similar arguments in the proof of Theorem 6. To obtain the sharp convergence rate, based on this theorem, one need to tune KK accordingly. As such, the eigenvalue bound ψ(K)\psi(K) specified in Assumption 5(d) becomes crucial. Indeed, we can show that ψ(K)\psi(K) is strictly lower bounded by a positive constant independent of KK under some mild conditions. We defer the detailed discussion about characterizing ψ(K)\psi(K) in Section 4.4. In the following, we establish the bounds when ψ(K)\psi(K) is bounded below.

Theorem 6.

Suppose Assumptions 4’ (in the Supplementary Material), 5(a)–(g) and 6 hold. Also, assume that ψ(K)C9\ltx@labelweightsC3\psi(K)\geq C_{9}\ltx@label{weightsC3} for some constant CLABEL:weightsC3>0C_{\ref*{weightsC3}}>0.

  1. (i)

    If 𝜷𝜹K1=𝒪p(ζn,T){\|\bm{\beta}\circ\bm{\delta}_{K}\|_{1}}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta_{{\color[rgb]{0,0,0}n,T}}) and K1=𝒪p(ζn,T1/(r1+r2))K^{-1}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta_{{\color[rgb]{0,0,0}n,T}}^{1/(r_{1}+r_{2})}) , we have

    |𝒱^(π)𝒱(π)|=𝒪p(ζn,T).\displaystyle\left|\hat{\cal V}(\pi)-{\cal V}(\pi)\right|=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta_{{\color[rgb]{0,0,0}n,T}}). (31)

    In particular, we can take Kξn,TaK\asymp\xi_{{\color[rgb]{0,0,0}n,T}}^{-a} for any aa such that 1/(r1+r2)<a<21/(r_{1}+r_{2})<a<2.

  2. (ii)

    Assume 𝜷𝜹K1=𝒪p((nT)1/2){\|\bm{\beta}\circ\bm{\delta}_{K}\|_{1}}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2}), r11r_{1}\geq 1, r2>1r_{2}>1, K=𝒪((nT)1ξn,T4)K=\scalebox{0.7}{$\mathcal{O}$}(({\color[rgb]{0,0,0}nT})^{-1}\xi_{{\color[rgb]{0,0,0}n,T}}^{-4}), K1=𝒪((nT)1/(2r1)ζn,T1/r1)K^{-1}=\scalebox{0.7}{$\mathcal{O}$}(({\color[rgb]{0,0,0}nT})^{-1/(2r_{1})}\zeta_{{\color[rgb]{0,0,0}n,T}}^{-1/r_{1}}), K1=𝒪((nT)1/(2r21)ζn,T1/(r21/2))K^{-1}=\scalebox{0.7}{$\mathcal{O}$}(({\color[rgb]{0,0,0}nT})^{-1/(2r_{2}-1)}\zeta_{{\color[rgb]{0,0,0}n,T}}^{-1/(r_{2}-1/2)}), Kr1r2=𝒪p((nT)1/2)K^{-r_{1}-r_{2}}=\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2}), and ωπ𝒢\omega^{\pi}\in\mathcal{G}. Take

    σ2=1Tt=0T1𝔼{ωπ(St,At)(Rt+γa𝒜π(aSt+1)Qπ(St+1,a)Qπ(St,At))}2.\sigma^{2}=\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left\{\omega^{\pi}(S_{t},A_{t})\left(R_{t}+\gamma\sum_{a^{\prime}\in{\cal A}}\pi(a^{\prime}\mid S_{t+1})Q^{\pi}(S_{t+1},a^{\prime})-Q^{\pi}(S_{t},A_{t})\right)\right\}^{2}.

    Then, as either nn\rightarrow\infty or TT\rightarrow\infty, we have

    nTσ{𝒱^(π)𝒱(π)}𝑑N(0,1).\displaystyle\frac{\sqrt{nT}}{\sigma}\left\{\hat{\cal V}(\pi)-{\cal V}(\pi)\right\}\xrightarrow[]{d}N(0,1). (32)

    In particular, if r1=1r_{1}=1, then we can take K(nT)aK\asymp({\color[rgb]{0,0,0}nT})^{a} for any aa such that 1/(6r23)<a<1/31/(6r_{2}-3)<a<1/3 so that the above results hold.

The proof of this theorem can be found in Section S2.5 of the Supplementary Material. Note that Theorem 6 requires ψ(K)\psi(K) to be lower bounded by a positive constant, which can be satisfied under the conditions in Assumption 7. The constraints for 𝜹K\bm{\delta}_{K} in Theorem 6(i) and (ii) are stronger than that in Assumption 5, which lead to the final desired ζn,T\zeta_{{\color[rgb]{0,0,0}n,T}}-consistency and nT\sqrt{{\color[rgb]{0,0,0}nT}}-consistency of our weighted estimator in cases (i) and (ii) respectively. Compared with case (i), the constraints for KK in case (ii) are more restrictive so that the bias is asymptotically negligible and thus 𝒱^(π)\hat{\cal V}(\pi) is nT\sqrt{{\color[rgb]{0,0,0}nT}}-consistent. When r11r_{1}\geq 1 and r2>1r_{2}>1, the existence of KK is guaranteed. The additional assumption of ωπ\omega^{\pi}, i.e., ωπ𝒢\omega^{\pi}\in{\cal G}, is a mild assumption, which allows the bias term to diminish asymptotically.

If TT is fixed, one can show that σ2/T\sigma^{2}/T is indeed the semi-parametric efficiency bound in the standard i.i.d. setting. When both nn and TT are allowed to go to infinity, σ2\sigma^{2} becomes 𝔼{ωπ(S,A)(Rt+γa𝒜π(aS)Qπ(S,a)Qπ(S,A))}2,\mathbb{E}^{*}\{\omega^{\pi}(S,A)(R_{t}+\gamma\sum_{a^{\prime}\in{\cal A}}\pi(a^{\prime}\mid S^{\prime})Q^{\pi}(S^{\prime},a^{\prime})-Q^{\pi}(S,A))\}^{2}, where 𝔼\mathbb{E}^{*} denotes the expectation with respect to the stationary measure induced by the behavior policy, i.e., 𝔾\mathbb{G}^{*}. As shown in [30], σ2\sigma^{2} is also the statistical efficiency bound in the notion of [33]. Note that our results do not require that the data come from stationary distribution, which is however needed in [30].

Finally, we remark that to the best of our knowledge, only two prior works establish the convergence rates of the policy value estimators under some non-parametric models in the scaling of both nn and TT. One is [62], which directly estimates QQ-function. The underlying analysis does not require the stationary assumption for the data generating process. However, they did not show that their estimator can achieve the statistical efficiency. In addition, their conditions require that the initial distribution of the data is bounded away from zero, which we do not require. The other one is [30]. This work shows the convergence of their estimator when TT\rightarrow\infty. In their theoretical results, stationary assumption is needed. In order to obtain the efficiency, this work requires the adoption of cross-fitting (sample splitting) for the Q-function and ratio (weight) function estimations. However, it is unclear how to perform an efficient cross-fitting due to the existence of temporal dependence among the batch data.

4.4 Lower boundedness for the minimal eigenvalue

In this section, we discuss ψ(K)\psi(K) defined in Assumption 5(d) in details. For the notational simplicity, we take ¯Pr\overline{}\mbox{Pr} and 𝔼¯\bar{\mathbb{E}} as the probability and expectation with respect to the average visitation distribution p¯Tb\bar{p}_{T}^{b}. We also write

dπ(s,as,a)=t=0γtptπ(s,aS0=s,A0=a)d^{\pi}(s^{\prime},a^{\prime}\mid s,a)=\sum_{t=0}^{\infty}\gamma^{t}p_{t}^{\pi}(s^{\prime},a^{\prime}\mid S_{0}=s,A_{0}=a)

as the conditional discounted visitation probability. Define the operator 𝒫π:𝒮×𝒜𝒮×𝒜\mathcal{P}^{\pi}:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{S}\times\mathcal{A} by

(𝒫πf)(s,a)=𝔼¯[a𝒜f(S,a)π(aS)S=s,A=a](\mathcal{P}^{\pi}f)(s,a)=\bar{\mathbb{E}}\left[\sum_{a^{\prime}\in\mathcal{A}}f(S^{\prime},a^{\prime})\pi(a^{\prime}\mid S^{\prime})\mid S=s,A=a\right]

for any function f:𝒮×𝒜f:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}, and denote I:𝒮×𝒜𝒮×𝒜I:\mathbb{R}^{\mathcal{S}\times\mathcal{A}}\rightarrow\mathbb{R}^{\mathcal{S}\times\mathcal{A}} as the identity operator.

In Section C.1 of [62], a sufficient condition for the lower boundedness of ψ(K)\psi(K) is provided. They argue that, under some boundedness conditions on the average visitation probability, the minimal eigenvalue is lower bounded by a constant independent of KK as long as γ\gamma is small enough. (In fact, we show that this sufficient condition can be further relaxed; see Corollary 2 for details.) However, in practice, we may not know the distance between the target policy and the behavior policy in advance, or be able to choose a reasonably small γ\gamma that reflects the desired emphasis of the long-term rewards. More importantly, choosing γ\gamma close to 11 is often preferred in many applications as we discussed before. Despite its importance, the theoretical property of non-parametric OPE under this setting is largely uncharted territory in the current state of the literature. As a result, it is important to understand the behavior of ψ(K)\psi(K) for any 0γ<10\leq\gamma<1 and target policy π\pi.

In the following, we focus on the general operator Iγ𝒫πI-\gamma\mathcal{P}^{\pi} and study the (squared) minimal eigenvalue of it:

Υ:=inf{f:𝔼¯f2(S,A)1}[𝔼¯{(Iγ𝒫π)f(S,A)}2].\displaystyle\Upsilon:=\inf_{\{f:\bar{\mathbb{E}}f^{2}(S,A)\geq 1\}}\left[{\bar{\mathbb{E}}\left\{(I-\gamma\mathcal{P}^{\pi})f(S,A)\right\}^{2}}\right]. (33)

To see the relationship between Υ\Upsilon and ψ(K)\psi(K), we can take f(s,a)=𝑩K(s,a)𝜶f(s,a)=\bm{B}_{K}^{\intercal}(s,a)\bm{\alpha} for some 𝜶K\bm{\alpha}\in\mathbb{R}^{K} in (33). Then we have

ψ(K)\displaystyle\psi(K) =inf𝜶2=1[𝔼¯{(Iγ𝒫π)𝑩K(S,A)𝜶}2]\displaystyle=\inf_{\|\bm{\alpha}\|_{2}=1}\left[{\bar{\mathbb{E}}\left\{(I-\gamma\mathcal{P}^{\pi})\bm{B}_{K}^{\intercal}(S,A)\bm{\alpha}\right\}^{2}}\right]
=inf𝜶2=1𝔼¯{(Iγ𝒫π)𝑩K(S,A)𝜶}2𝔼¯{𝑩K(S,A)𝜶}2𝔼¯{𝑩K(S,A)𝜶}2.\displaystyle=\inf_{\|\bm{\alpha}\|_{2}=1}\frac{\bar{\mathbb{E}}\left\{(I-\gamma\mathcal{P}^{\pi})\bm{B}_{K}^{\intercal}(S,A)\bm{\alpha}\right\}^{2}}{\bar{\mathbb{E}}\left\{\bm{B}_{K}^{\intercal}(S,A)\bm{\alpha}\right\}^{2}}{\bar{\mathbb{E}}\left\{\bm{B}_{K}^{\intercal}(S,A)\bm{\alpha}\right\}^{2}}.
Υλmin{𝔼¯𝑩K(S,A)𝑩K(S,A)}.\displaystyle\geq\Upsilon\lambda_{\min}\left\{\bar{\mathbb{E}}\bm{B}_{K}(S,A)\bm{B}^{\intercal}_{K}(S,A)\right\}.

In Theorem 7, we propose a necessary and sufficient condition for bounding the minimal eigenvalue of the operator Iγ𝒫πI-\gamma\mathcal{P}^{\pi} that works for any π\pi and γ\gamma. When combined with the standard eigenvalue condition on the basis 𝑩K\bm{B}_{K}, it allows a comprehensive characterization of the lower boundedness of ψ(K)\psi(K).

Theorem 7.

Υ\Upsilon is lower bounded by a positive constant if and only if

χ:=sup{f:𝔼¯f2(S,A)1}𝔼¯[𝔼(S,A)dπ(,S,A){f(S,A)}]2<+.\chi:=\sup_{\{f:\bar{\mathbb{E}}f^{2}(S,A)\leq 1\}}{\bar{\mathbb{E}}\left[\mathbb{E}_{(S^{\prime},A^{\prime})\sim d^{\pi}(\cdot,\cdot\mid S,A)}\left\{f(S^{\prime},A^{\prime})\right\}\right]^{2}}<+\infty.

In this case, Υ=(1γ)2/χ\Upsilon=(1-\gamma)^{2}/\chi.

Theorem 7 allows us to avoid directly analyzing the minimal eigenvalue of I𝒫πI-\mathcal{P}^{\pi}, but to study the upper bound of χ\chi instead, which is much easier to deal with. We can view χ\chi as a criterion to examine the difficulty of OPE problems. When χ\chi gets larger, the OPE problem becomes more difficult. The value of χ\chi depends on many components, including the target policy π\pi, the behavior policy πb\pi_{b}, the discount factor γ\gamma and the horizon of the observed data TT. Next, we focus on the term χ\chi and provide some sufficient conditions for it to be upper bounded. Take

d¯π(s,a)=𝔼¯[dπ(s,aS,A)].{\bar{d}}^{\pi}(s^{\prime},a^{\prime})=\bar{\mathbb{E}}\left[d^{\pi}(s^{\prime},a^{\prime}\mid S,A)\right].
Assumption 7.

Suppose the following conditions hold.

  1. (a)

    The average visitation probability (density) p¯Tb(s,a)\bar{p}_{T}^{b}(s,a) is lower bounded by a constant pmin>0p_{\min}>0 and upper bounded by a constant pmax,1p_{\max,1}.

  2. (b)

    The transition probability under the target policy q(s,as,a):=π(as)p(ss,a)q(s^{\prime},a^{\prime}\mid s,a):=\pi(a^{\prime}\mid s^{\prime})p(s^{\prime}\mid s,a) is upper bounded by some constant pmax,2p_{\max,2} for all (s,a)𝒮×𝒜(s^{\prime},a^{\prime})\in\mathcal{S}\times\mathcal{A}.

Corollary 2.

If there exists a constant CLABEL:cnt:denboundC_{\ref*{cnt:denbound}} such that

ϱ:=sup(s,a)𝒮×𝒜d¯π(s,a)p¯Tb(s,a)C10\ltx@labelcnt:denbound,\displaystyle\varrho:=\sup_{(s,a)\in\mathcal{S}\times\mathcal{A}}\frac{\bar{d}^{\pi}(s,a)}{\bar{p}_{T}^{b}(s,a)}\leq C_{10}\ltx@label{cnt:denbound}, (34)

then we have

χCLABEL:cnt:denboundandΥ(1γ)2CLABEL:cnt:denbound.\chi\leq C_{\ref*{cnt:denbound}}\quad\mbox{and}\quad\Upsilon\geq\frac{(1-\gamma)^{2}}{{C_{\ref*{cnt:denbound}}}}.

In addition, if we assume Assumption 7, then we have

χpmaxpmin,Υ(1γ)2pminpmax,\chi\leq\frac{p_{\max}}{p_{\min}},\qquad\Upsilon\geq{(1-\gamma)^{2}}\frac{p_{\min}}{p_{\max}},

where pmax=max{pmax,1,pmax,2}p_{\max}=\max\{p_{\max,1},p_{\max,2}\}. Finally, ψ(K)(1γ)2pminpmax\psi(K)\gtrsim{(1-\gamma)^{2}}\frac{p_{\min}}{p_{\max}} under an additional condition that

λmin{𝔼¯[𝑩K(S,A)𝑩K(S,A))]}1.\displaystyle\lambda_{\min}\left\{\bar{\mathbb{E}}\left[\bm{B}_{K}(S,A)\bm{B}_{K}(S,A)^{\intercal})\right]\right\}\gtrsim 1. (35)

In Corollary 2, (34) provides a sufficient condition for χ\chi to be upper bounded (or equivalently, for Υ\Upsilon to be lower bounded) that works for general settings without any restriction on γ\gamma or the target policy π\pi. The second part of Corollary 2 provides a specific characterization for ϱ\varrho under Assumption 7, in which we require the average visitation probability p¯Tb\bar{p}_{T}^{b} to be lower and upper bounded. This coverage assumption is very common in RL literature [e.g. 54, 30, 62]. Compared to [62], we provide an explicit bound with respect to pminp_{\min} and pmaxp_{\max}. More importantly, we do not impose any further assumptions on the target policy π\pi or the discount factor γ\gamma, in order to show the lower boundedness of ψ(K)\psi(K). In other words, we show that such boundedness holds uniformly for any target policy π\pi and γ\gamma. We note that there is a parallel work regarding the well-poseness minimax optimal rates of nonparametric Q-function estimation in OPE [10]. They provide a sufficient condition for the lower boundedness of Υ\Upsilon. Their result is similar to the one specified in Corollary 2, and require similar boundedness conditions as in Assumption 7. See Assumption 4(a) and Theorem 1 in [10] for more details. Compared to their bound:

Υ(1γ)2pminpmax(1+γ2pmaxpmin),\Upsilon\gtrsim\frac{(1-\gamma)^{2}p_{\min}}{p_{\max}\left(1+\gamma^{2}\frac{p_{\max}}{p_{\min}}\right)},

Corollary 2 provides a sharper dependence with respect to γ\gamma, pminp_{\min} and pmaxp_{\max}. Apart from Corollary 2, we note that our general result in Theorem 7 does not require any boundedness condition on visitation and transition probability as in Assumption 7 and [10], and, more importantly, provides a necessary and sufficient condition for the boundedness of Υ\Upsilon.

Finally, we remark that the additional assumption (35) is satisfied for common bases such as tensor-product B-spline basis and tensor-product wavelet basis. See Lemma S5 of the Supplementary Material.

5 Simulation Study

We conduct a simulation study to investigate the finite-sample performance of the proposed estimator. We adopt the similar simulation settings as in [42], [39] and [62]. Specifically, the data generative model is given as follows. The state variables are two-dimensional, i.e., Si,t=(Si,t(1),Si,t(2))S_{i,t}=(S^{(1)}_{i,t},S^{(2)}_{i,t}) for 0tT0\leq t\leq T and 1in1\leq i\leq n, while the action is binary, i.e., 𝒜={0,1}{\cal A}=\{0,1\}. The initial state follows the standard bivariate normal distribution. The transition dynamics are given by Si,t+1(1)=(3/4)(2Ai,t1)Si,t(1)+ϵi,t(1)S^{(1)}_{i,t+1}=(3/4)(2A_{i,t}-1)S^{(1)}_{i,t}+\epsilon^{(1)}_{i,t}, and Si,t+1(2)=(3/4)(12Ai,t)Si,t(2)+ϵi,t(2)S^{(2)}_{i,t+1}=(3/4)(1-2A_{i,t})S^{(2)}_{i,t}+\epsilon^{(2)}_{i,t}, where ϵi,t(1)\epsilon^{(1)}_{i,t} and ϵi,t(2)\epsilon^{(2)}_{i,t} are independent normal random variables with mean 0 and variance 0.25. The behavior policy independently follows a Bernoulli distribution with mean 1/21/2. The immediate reward Ri,tR_{i,t} is defined as Ri,t=2Si,t+1(1)+Si,t+1(2)(1/4)(2Ai,t1)R_{i,t}=2S^{(1)}_{i,t+1}+S^{(2)}_{i,t+1}-(1/4)(2A_{i,t}-1). We use the initial state distribution as the reference distribution 𝔾\mathbb{G} and set γ\gamma to 0.90.9. We evaluate the following four different target policies.

  1. (a)

    π1(as)=1,a𝒜,s𝒮.\pi_{1}(a\mid s)=1,\,a\in\mathcal{A},s\in\mathcal{S}. This is the “always-treat” policy used in the simulation study of [39], where the chosen action is always 1, and does not depend on the state variable.

  2. (b)
    π2(as)={1if s(1)0 and s(2)0,0otherwise,,a𝒜,s=(s(1),s(2))𝒮.\pi_{2}(a\mid s)=\begin{cases}1&\text{if $s^{(1)}\leq 0$ and $s^{(2)}\leq 0$,}\\ 0&\text{otherwise,}\end{cases},\quad a\in\mathcal{A},s=(s^{(1)},s^{(2)})\in\mathcal{S}.

    This policy is a discontinuous function with respect to the state variable. The same type of policy is used in the simulation study of [62].

  3. (c)

    π3(as)=exp{(s(1)+s(2))},a𝒜,s=(s(1),s(2))𝒮.\pi_{3}(a\mid s)=\exp\{-(s^{(1)}+s^{(2)})\},\,a\in\mathcal{A},s=(s^{(1)},s^{(2)})\in\mathcal{S}. This policy is smooth with respect to the state variable.

  4. (d)

    π4(as)=0.5,,a𝒜,s𝒮\pi_{4}(a\mid s)=0.5,,a\in\mathcal{A},s\in\mathcal{S}. This policy is the same as the behavior one. Note that the observed data only contain finite horizon of decision points, i.e., T<T<\infty, while the target here is the policy value under infinite horizon.

For each target policy, we consider four different combinations of nn and TT, i.e., (n,T)=(40,25),(80,25)(n,T)=(40,25),(80,25), (40,50)(40,50) and (80,50)(80,50). The true policy values 𝒱(πk){\cal V}(\pi_{k}), k=1,,4k=1,\dots,4, were computed approximately by the Monte Carlo method. Specifically, for every πk\pi_{k}, we simulate n~=105\tilde{n}=10^{5} independent trajectories of length T~=5000\tilde{T}=5000, with initial states drawn from 𝔾\mathbb{G}. Then we approximate 𝒱(πk){\cal V}(\pi_{k}) by (1γ)n~1t=0T~1γtRi,t(1-\gamma){\tilde{n}}^{-1}\sum_{t=0}^{\tilde{T}-1}\gamma^{t}R_{i,t}.

Due to space limitations, the implementation details of the proposed method (ProjBalance) are reported in Section S1.2 of the Supplementary Material. For comparison, we include the following estimators: (1) VL: the estimator from [42]; (2) SAVE: the estimator from [62]; (3) FQE: the fitted Q-evaluation estimator developed in [36] where regression problems are solved using random forest models; (4) IS: importance sampling estimator from [54]; (5) MINIMAX: minimax weight learning method from [68]; (6) DR: double reinforcement learning method considered in [28]. As suggested by [42], we implemented VL with Gaussian basis functions since they offer the highest flexibility. For SAVE, we use the same basis function 𝑩K\bm{B}_{K} as in the proposed weighted estimator to estimate QπQ^{\pi}. Note that, to compute the confidence interval for the proposed weighted estimator, we use the estimate of QπQ^{\pi} from SAVE. Construction of our confidence interval can be found in Section Section S1.2 of the Supplementary Material. The implementation details of FQE, IS, MINIMAX and DR can be found in [60].

Table 1 shows the mean squared error (MSE) and median squared error (MeSE) of the above estimators, as well as the empirical coverage probabilities (ECP) and average length (AL) of their 95%95\% confidence intervals, over 500 simulated data sets when (n,T)=(40,50)(n,T)=(40,50). The results for (n,T)=(40,25)(n,T)=(40,25), (n,T)=(80,25)(n,T)=(80,25) and (n,T)=(80,50)(n,T)=(80,50) can be found in Section S1.3 of the Supplementary Material. Overall, the proposed estimator (ProjBalance) shows competitive performances. Other weighted estimators such as IS and MINIMAX suffer from instability issues and produce inferior results compared with other methods. Also, DR is influenced by the unstable weights from IS and does not perform well in these settings.

Specifically, for Scenario (a), VL in general has the smallest MSE and MeSE. The performances of ProjBalance and VL are close, while that of SAVE and FQE are worse. For Scenario (b), SAVE is the best when the sample size is relatively large, while the performance of ProjBalance is close to that of SAVE. However, when nn and TT are relatively small, SAVE produces some extreme estimates, as seen from the notable difference between MSE and MeSE. See Tables 1 and S2 in the Supplementary Material. In contrast, the results of ProjBalance remain stable for small nn and TT, and have comparable performance to the best estimator FQE in these settings. As for Scenario (c), ProjBalance and VL perform similarly, and are better than SAVE and FQE. ProjBalance always has the smallest MSE and MeSE when the target policy is π4\pi_{4}. As for ECP and AL, it seems that the confidence intervals of ProjBalance tend to have lower coverage than the target level 95%, especially when the target policy is π2\pi_{2}. We hypothesize that this under-coverage phenomenon is due to the regularization of weights in (25a), which affects the variance estimation in (32). Since this is beyond the scope of the paper, we leave it for future study. In practice, we recommend multiplying a factor to the length of our confidence interval (CI) in order to relieve this under-coverage issue. In this simulation study, we choose a constant factor 1.21.2 to obtain adjusted intervals. The ECPs and ALs for adjusted confidence intervals are provided in all tables. In general, our method performs robustly and satisfactorily in terms of the coverage and average length of the confidence interval. In the Supplementary Material, we also evaluate the performance of the above methods on the Cartpole environment, which can be obtained from OpenAI Gym [5] and has been frequently considered in the computer science literature. Overall, the performance of ProjBalance is appealing, See Section S1.5 of the Supplementary Material for more details.

Table 1: Simulation results for four different target policies when n=40n=40 and T=50T=50. The MSE values (with standard errors in parentheses), MeSE values, ECP, and AL (with standard errors in parentheses) are provided for three estimators. For ProjBalance, ECPs and ALs for the adjusted confidence interval are provided after the slashes in the ECP and AL columns.
Target methods MSE (×1000\times 1000) MeSE (×1000\times 1000) ECP AL (×100\times 100)
π1\pi_{1} ProjBalance 8.76 (0.527) 3.75 0.96 / 0.99 38.93 (0.258) / 46.71 (0.310)
VL 7.71 (0.461) 3.57 0.96 36.25 (0.159)
SAVE 10.2 (0.614) 4.32 0.96 42.28 ( 0.932)
FQE 10.99 (0.744) 4.40 _ _
minimax 34.53 (3.184) 14.04 _ _
DR 151.87 (16.670) 22.84 _ _
IS 1040.89 (270.752) 53.94 _ _
π2\pi_{2} ProjBalance 4.29 (0.356) 1.75 0.91 / 0.94 23.57 (2.130) / 28.28 (2.556)
VL 6.61 (0.369) 3.50 0.84 24.18 (0.142)
SAVE 36.90 (33.300) 1.52 0.94 439.90 (415.277)
FQE 3.93 (0.242) 1.87 _ _
minimax 120.92 (1.276) 118.98 _ _
DR 92.33 (6.212) 16.283 _ _
IS 1111.542 (100.934) 327.248 _ _
π3\pi_{3} ProjBalance 2.41 (0.153) 1.06 0.93 / 0.97 18.47 (0.125) / 22.17 (0.150)
VL 2.45 (0.162) 1.14 0.98 23.10 (0.158)
SAVE 2.72 (0.173) 1.18 0.94 21.43 (1.717)
FQE 4.66 (0.307) 2.30 _ _
minimax 37.11 (1.415) 31.21 _ _
DR 49.20 (34.761) 2.36 _ _
IS 38.12 (4.839) 13.97 _ _
π4\pi_{4} ProjBalance 0.67 (0.044) 0.32 0.93 / 0.97 9.70 (0.021) / 11.64 (0.025)
VL 0.77 (0.053) 0.34 1.00 17.00 (0.031)
SAVE 0.72 (0.047) 0.35 0.93 10.03 (0.013)
FQE 1.70 (0.103) 0.81 _ _
minimax 1.62 (0.104) 0.73 _ _
DR 1.35 (0.087) 0.61 _ _
IS 4.78 (0.290) 2.27 _ _

6 Real Data Application

In this section we apply the methods mentioned in Section 5 that provide confidence intervals (ProjBalance, VL and SAVE) to the OhioT1DM dataset [44] obtained from the Ohio University. This dataset contains approximately eight weeks’ records of CGM blood glucose levels, insulin doses and self-reported life-event data for each of six subjects with type 1 diabetes. Following [62], we divide the trajectory of every subject into segments of three-hour spans and constructed the state variable Si,t=(Si,t(1),Si,t(2),Si,t(3))S_{i,t}=(S^{(1)}_{i,t},S^{(2)}_{i,t},S^{(3)}_{i,t}) as follows. First Si,t(1)S^{(1)}_{i,t} is the average CGM glucose level over the three-hour interval [t1,t)[t-1,t). Next, Si,t(2)S^{(2)}_{i,t} is constructed based on the ii-th-subject’s self-reported time and the corresponding carbohydrate estimate for the meal. More specifically, Si,t(2)=j=1JCEjγc36(tjt+1),S_{i,t}^{(2)}=\sum_{j=1}^{J}\mathrm{CE}_{j}\gamma_{c}^{36(t_{j}-t+1)}, where CE1,CE2,,CEJ\mathrm{CE}_{1},\mathrm{CE}_{2},\dots,\mathrm{CE}_{J} are carbohydrate estimates for the ii-th-subject’s meals at times t1,t2,,tJ[t1,t)t_{1},t_{2},\dots,t_{J}\in[t-1,t), and γc\gamma_{c} is a 5-minute decay rate. Here we set γc=0.5\gamma_{c}=0.5. Last, Si,t(3)S_{i,t}^{(3)} is the average of the basal rate during the three-hour interval. The action variable Ai,tA_{i,t} is binarized according to the amount of insulin injected. In particular, we set Ai,t=1A_{i,t}=1 when the total amount of insulin delivered to the ii-th subject is larger than one unit during the three-hour interval. Otherwise, we set Ai,t=0A_{i,t}=0. In the data, the time series of glucose levels and the life-event data are not perfectly overlapped within the same time frame. So we remove several boundary points of time series to ensure that all state variables have records in the same time frame. According to the Index of Glycemic Control (IGC) [58], the immediate reward Ri,tR_{i,t} is constructed as

Ri,t={130(80Si,t+1(1))2,if Si,t+1(1)<80;0,if 80Si,t+1(1)<140;130(Si,t+1(1)140)1.35,otherwise .R_{i,t}=\begin{cases}-\frac{1}{30}(80-S^{(1)}_{i,t+1})^{2},&\text{if }S^{(1)}_{i,t+1}<80;\\ 0,&\text{if }80\leq S^{(1)}_{i,t+1}<140;\\ -\frac{1}{30}(S^{(1)}_{i,t+1}-140)^{1.35},&\text{otherwise }.\end{cases}

Similar to the simulation study, we set the discount factor γ=0.9\gamma=0.9. In addition, we study six reference distributions 𝔾\mathbb{G}, where each of them is taken as the point mass at the initial state of a subject. We evaluate the aforementioned three estimators via two policies — the always-treat policy π1(as)1\pi_{1}(a\mid s)\equiv 1 and the never-treat policy π0(as)0\pi_{0}(a\mid s)\equiv 0 since the estimated optimal policy based on [62] is very close to the always-treat policy. A similar discovery has also been observed in [42]. Therefore, we expect to see that 𝒱(π1)>𝒱(π0){\cal V}(\pi_{1})>{\cal V}(\pi_{0}).

Figure 1: Offline policy evaluation (OPE) estimates as well as the associated 95% confidence intervals on the OhioT1DM dataset under always-treat and never-treat policies. The six sub-figures displays these quantities when the reference distributions are the initial state variable of six patients respectively. The adjusted confidence intervals for ProjBalance are also included.
Refer to caption

Figure 1 shows the point estimates and associated confidence intervals of 𝒱(π1){\cal V}(\pi_{1}) and 𝒱(π0){\cal V}(\pi_{0}) for ProjBalance, VL and SAVE. To adjust for the under-coverage issue we observe in the simulation study, we also calculate the adjusted confidence interval for ProjBalance. In most scenarios, it can be observed that estimated values under the always-treat policy are larger than that of the never-treat policy, which aligns with our expectation. Note that the results of VL do not show a distinctive difference between the two policies compared with other methods. For SAVE, due to the invertibility issue (the matrix Σ^π\hat{\Sigma}_{\pi} in [62] is numerically low-rank under the never-treat policy), the confidence interval for the never-treat policy is surprisingly large. Even though the confidence intervals of our proposed estimators rely on the Q-function estimated by SAVE (where the point estimate of the Q-function does not involve inverting the problematic matrix), the confidence interval does not inherit the instability issue.

7 Acknowledgements

Wong’s research was partially supported by the National Science Foundation (DMS-1711952 and CCF-1934904). Portions of this research were conducted with the advanced computing resources provided by Texas A&M High Performance Research Computing.

Supplement to “Projected State-action Balancing Weights for Offline Reinforcement Learning”

S1 Additional Implementation Details and Numerical Results

S1.1 Algorithm

We present the algorithm outline for the proposed estimator in Algorithm 1.

Input: Observed trajectories 𝒟n={{Si,t,Ai,t,Ri,t}t=0T1Si,T}i=1n{\cal D}_{n}=\{\{S_{i,t},A_{i,t},R_{i,t}\}_{t=0}^{T-1}\cup S_{i,T}\}_{i=1}^{n}; number of basis functions KK; a grid of tuning parameters μl\mu_{l}, l=1,,Ll=1,\dots,L; function hh; basis function BkB_{k}, k=1,,Kk=1,\dots,K; reference distribution: 𝔾\mathbb{G}.
1 Calculate 𝔼S𝔾{a𝒜π(aS)Bk(S,a)}\mathbb{E}_{S\sim\mathbb{G}}\{\sum_{a\in\mathcal{A}}\pi(a\mid S)B_{k}(S,a)\}, k=1,,Kk=1,\dots,K.
2Divide 𝒟n{\cal D}_{n} into the training data 𝒟1{\cal D}_{1} and validation data 𝒟2{\cal D}_{2}. for l=1,,Ll=1,\dots,L do
3       for k=0,1,,Kk=0,1,\dots,K do
4             Compute the solution of (24) (i.e., g^lπ(,,Bk)\hat{g}^{\pi}_{l}(\cdot,\cdot,B_{k})) by taking μ=μl\mu=\mu_{l} on 𝒟1{\cal D}_{1}.
5       end for
6      Calculate the validation error k=1K(i,t)𝒟2{g^lπ(Si,t,Ai,t,Bk)a𝒜π(aSi,t+1)Bk(Si,t+1,a)}2/{(nT)1i=1nt=1TBk2(Si,t,Ai,t)}\sum_{k=1}^{K}\sum_{(i,t)\in{\cal D}_{2}}\{\hat{g}^{\pi}_{l}(S_{i,t},A_{i,t},B_{k})-\sum_{a\in\mathcal{A}}\pi(a^{\prime}\mid S_{i,t+1})B_{k}(S_{i,t+1},a^{\prime})\}^{2}/\{(nT)^{-1}\sum_{i=1}^{n}\sum_{t=1}^{T}B^{2}_{k}(S_{i,t},A_{i,t})\}
7 end for
8
9Select ll^{*} that achieves the smallest validation error among all tuning parameters.
10Take g^π(Si,t,Ai,t,Bk)\hat{g}^{\pi}(S_{i,t},A_{i,t},B_{k}) as g^lπ(Si,t,Ai,t,Bk)\hat{g}^{\pi}_{l^{*}}(S_{i,t},A_{i,t},B_{k}) and optimize (13) in the main text with the smallest feasible 𝜹K=δ𝑱\bm{\delta}_{K}=\delta\bm{J}, where 𝑱=[1,,1]\bm{J}=[1,\dots,1]^{\intercal}. Obtain the solution ω^i,tπ\hat{\omega}^{\pi}_{i,t}, i=1,,ni=1,\dots,n and t=0,,(T1)t=0,\cdots,(T-1).
Output: 𝒱^π(π)=1nTi=1nt=0T1ω^i,tπRi,t\hat{\cal V}^{\pi}(\pi)=\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\omega}^{\pi}_{i,t}R_{i,t}
Algorithm 1 Outlines for computing 𝒱^(π)\hat{\cal V}(\pi).

S1.2 Implementation of the Proposed Estimator

In this subsection, we discuss the implementation details in the simulation study. For the proposed estimator (ProjBalance), the basis functions were constructed as follows. For settings described in Section 5, we adopted the tensor product basis functions for two dimensional state variables. First, we set B1,l(S,A)=b1,l1(S(1))b2,l2(S(2))𝟙A=1B_{1,l}(S,A)=b_{1,l_{1}}(S^{(1)})b_{2,l_{2}}(S^{(2)})\mathbbm{1}_{A=1}, B2,l(S,A)=b1,l1(S(1))b2,l2(S(2))B_{2,l}(S,A)=b_{1,l_{1}}(S^{(1)})b_{2,l_{2}}(S^{(2)}). Here b1,l1b_{1,l_{1}} and b2,l2b_{2,l_{2}} were one-dimensional cubic B-spline sets where internal knots were placed at equally spaced sample quantiles for state variables. To avoid extrapolation of the basis function, we put three repeated boundary knots. Then we chose 𝑩K(S,A)=[Bk(S,A)]k=1K=[B1,l(S,A),B2,l(S,A)]l=1K/2\bm{B}_{K}(S,A)=[B_{k}(S,A)]_{k=1}^{K}=[B_{1,l}(S,A),B_{2,l}(S,A)]_{l=1}^{K/2}. In our numerical experiments, we set K=2max{(nT)1/3,42}K=2\max\{(nT)^{1/3},4^{2}\}, where 424^{2} was determined by the fact that there should be at least 4 one-dimensional basis for constructing cubic B-spline. Note that the number of tensor products of spline bases grows exponentially as the dimension of the state space increases. We recommend reducing the interaction order of the tensor products or using kernel bases in the higher dimensional settings, while the number of basis functions KK can be still chosen as stated before. In the simulation study for Cartpole datasets in Section S1.3, we did not include any interaction of tensor products among four state variables, for the sake of simplicity. For the approximate projection step (24), we used a kernel ridge regression with Gaussian kernel, where the bandwidth parameter in Gaussian kernel was chosen by the median heuristic (See Section 2.2 in [18]).We adopted 5-fold cross-validation to tune the parameter μ\mu. Note that we only needed one μ\mu for KK projections, so we took an average of the “standardized” validation errors from the projection steps for all basis functions, where the validation errors were standardized by empirical norm of the corresponding basis function, as our validation criterion. For the weight estimation (25), we chose h(x)=(x1)2h(x)=(x-1)^{2}. As 𝔾\mathbb{G} is known to us, we used Monte Carlo method to obtain 𝔼S𝔾{a𝒜π(aS)Bk(S,a)}\mathbb{E}_{S\sim\mathbb{G}}\{\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S)B_{k}(S,a^{\prime})\}. To reduce the amount of tuning, we took 𝜹K=δ𝑱\bm{\delta}_{K}=\delta\bm{J}, where δ>0\delta>0 and J=[1,,1]KJ=[1,\dots,1]\in\mathbb{R}^{K}, and selected the minimal δ\delta that had a valid solution to (25).

In the following, we provide the details for constructing the confidence interval for ProjBalance. Take Q^π\hat{Q}^{\pi} as the estimator for QπQ^{\pi} obtained by SAVE in [62] and let

σ^i,t=ω^i,tπ(Ri,t+γa𝒜π(aSi,t+1)Q^π(Si,t+1,a)Q^π(Si,t,Ai,t)).\hat{\sigma}_{i,t}=\hat{\omega}_{i,t}^{\pi}\left(R_{i,t}+\gamma\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S_{i,t+1})\hat{Q}^{\pi}(S_{i,t+1},a^{\prime})-\hat{Q}^{\pi}(S_{i,t},A_{i,t})\right).

Calculate

σ^={1nTi=1nt=0T1σ^i,t2}1/2.\hat{\sigma}=\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\sigma}_{i,t}^{2}\right\}^{1/2}.

Then the confidence interval is given by

[𝒱^(π)zα2σ^nT,𝒱^(π)+zα2σ^nT],\left[\hat{\cal V}(\pi)-z_{\frac{\alpha}{2}}\frac{\hat{\sigma}}{\sqrt{nT}},\ \hat{\cal V}(\pi)+z_{\frac{\alpha}{2}}\frac{\hat{\sigma}}{\sqrt{nT}}\right],

where 𝒱^(π)=(nT)1i=1nt=0T1ω^i,tπRi,t\hat{\cal V}(\pi)=(nT)^{-1}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\omega}^{\pi}_{i,t}R_{i,t} and zα2z_{\frac{\alpha}{2}} is (1α/2)(1-\alpha/2)-quantile of the standard normal distribution.

S1.3 Additional Simulation Results

Tables S2, S3 and S4 show the simulation results under the target policies and data generation process described in Section 5, where (n,T)=(40,25)(n,T)=(40,25), (n,T)=(80,25)(n,T)=(80,25) and (n,T)=(80,50)(n,T)=(80,50) respectively.

Table S2: Simulation results for four different target policies when n=40n=40 and T=25T=25. See details in Table 1.
Target methods MSE (×1000\times 1000) MeSE (×1000\times 1000) ECP AL (×100\times 100)
π1\pi_{1} ProjBalance 17.32 (1.064) 8.17 0.96 / 0.99 66.30 (6.145) / 79.56 (7.374)
VL 15.43 (0.942) 7.08 0.97 54.15 (0.472)
SAVE 272.36 (178.185) 11.06 0.96 637.96 (337.450)
FQE 23.58 (1.431) 10.74 _ _
minimax 46.27 (3.544) 16.75 _ _
DR 855.00 (178.345) 82.56 _ _
IS 945.92 (234.336) 53.75 _ _
π2\pi_{2} ProjBalance 10.83 (0.874) 3.17 0.87 / 0.91 35.26 (4.682) / 42.31 (5.619)
VL 12.33 (0.721) 5.65 0.88 36.63 (0.364)
SAVE 310.41 (301.139) 3.25 0.94 918.00 (874.044)
FQE 23.58 (1.431) 10.74 _ _
minimax 41.52 (1.973) 28.31 _ _
DR 60.20 (5.623) 16.30 _ _
IS 2973.88 (1803.384) 70.509 _ _
π3\pi_{3} ProjBalance 4.62 (0.299) 1.899 0.92 / 0.95 27.67 (0.984) / 33.20 (1.181)
VL 4.50 (0.305) 1.89 0.98 33.04 (0.239)
SAVE 15.87 (8.981) 2.35 0.94 669.62 (636.178)
FQE 8.687 (0.604) 3.69 _ _
minimax 45.51 (1.710) 39.61 _ _
DR 68.18 (21.377) 3.98 _ _
IS 34.02 (4.214) 12.45 _ _
π4\pi_{4} ProjBalance 1.25 (0.087) 0.49 0.94 / 0.98 13.61 (0.074) / 16.33 (0.089)
VL 1.63 (0.111) 0.66 0.99 24.47 (0.070)
SAVE 1.40 (0.096) 0.53 0.95 14.68 (0.119)
FQE 3.46 (0.224) 1.65 _ _
minimax 3.26 (0.202) 1.50 _ _
DR 2.48 (0.157) 1.15 _ _
IS 4.67 (0.281) 2.28 _ _
Table S3: Simulation results for four different target policies when n=80n=80 and T=25T=25. See details in Table 1.
Target methods MSE (×1000\times 1000) MeSE (×1000\times 1000) ECP AL (×100\times 100)
π1\pi_{1} ProjBalance 10.13 (0.661) 4.61 0.95 / 0.97 38.95 (0.175) / 46.74 (0.211)
VL 8.84 (0.562) 3.66 0.95 38.38 (0.178)
SAVE 11.55 (0.766) 4.90 0.94 40.58 (0.183)
FQE 3.46 (0.224) 1.65 _ _
minimax 38.08 (3.330) 17.80 _ _
DR 181.51 (24.850) 28.07 _ _
IS 4.67 (0.281) 2.28 _ _
π2\pi_{2} ProjBalance 3.88 (0.279) 1.76 0.92 / 0.97 21.53 (0.120) / 25.83 (0.144)
VL 5.06 (0.293) 2.46 0.94 25.18 (0.145)
SAVE 3.42 (0.199) 1.77 0.96 23.55 (0.512)
FQE 3.88 (0.214) 2.04 _ _
minimax 36.48 (1.556) 26.28 _ _
DR 28.51 (2.915) 7.08 _ _
IS 2951.59 (2613.528) 62.73 _ _
π3\pi_{3} ProjBalance 2.33 (0.134) 1.09 0.94 / 0.98 18.36 (0.1) / 22.03 (0.120))
VL 2.33 (0.137) 1.18 0.98 23.75 (0.122)
SAVE 2.74 (0.161) 1.26 0.94 19.56 (0.177)
FQE 4.96 (0.322) 2.10 _ _
minimax 32.12 (1.150) 25.85 _ _
DR 15.64 (5.164) 2.35 _ _
IS 22.39 (4.263) 7.05 _ _
π4\pi_{4} ProjBalance 0.75 (0.042) 0.41 0.92 / 0.98 9.81 (0.018) / 11.77 (0.022)
VL 0.88 (0.056) 0.37 1 17.32 (0.033)
SAVE 0.79 (0.045) 0.43 0.93 10.19 (0.107)
FQE 1.66 (0.713) 0.66 _ _
minimax 1.46 (0.084) 0.73 _ _
DR 1.25 (0.076) 0.62 _ _
IS 2.27 (0.144) 1.10 _ _
Table S4: Simulation results for four different target policies when n=80n=80 and T=50T=50. See details in Table 1.
Target methods MSE (×1000\times 1000) MeSE (×1000\times 1000) ECP AL (×100\times 100)
π1\pi_{1} ProjBalance 5.43 (0.353) 2.31 0.92 / 0.97 27.2(0.0679) / 32.58 (0.081)
VL 5.09 (0.318) 2.01 0.94 26.80 (0.075)
SAVE 5.93 (0.382) 2.65 0.92 28.09 (0.117)
FQE 6.21 (0.402) 2.57 _ _
minimax 27.27 (2.030) 12.17 _ _
DR 67.29 (11.554) 11.03 _ _
IS 192585.20 (189045.400) 44.53 _ _
π2\pi_{2} ProjBalance 2.49 (0.204) 1.09 0.87 / 0.93 15.05 (0.045) / 18.06 (0.054)
VL 2.92 (0.162) 1.58 0.89 16.93 (0.056)
SAVE 1.73 (0.107) 0.72 0.94 15.78 (0.053)
FQE 2.10 (0.140) 0.91 _ _
minimax 33.13 (1.275) 26.19 _ _
DR 11.98 (1.063) 4.10 _ _
IS 2951.59 (2613.528) 62.73 _ _
π3\pi_{3} ProjBalance 1.19 (0.071) 0.61 0.94 / 0.98 13.01 (0.048) / 15.61 (0.057)
VL 1.25 (0.071) 0.73 0.98 16.42 (0.053)
SAVE 1.34 (0.079) 0.64 0.94 13.97 (0.359)
FQE 2.23 (0.174) 0.89 _ _
minimax 32.78 (0.747) 30.92 _ _
DR 3.45 (0.966) 0.86 _ _
IS 29.67 (6.884) 6.79 _ _
π4\pi_{4} ProjBalance 0.39 (0.023) 0.21 0.92 / 0.97 6.900 (0.008) / 8.28 (0.009)
VL 0.44 (0.026) 0.24 1 12.12 (0.027)
SAVE 0.41 (0.024) 0.22 0.92 7.06 (0.005)
FQE 0.77 (0.046 0.38 _ _
minimax 0.77 (0.049) 0.39 _ _
DR 0.62 (0.039) 0.29 _ _
IS 2.26 (0.143) 1.02 _ _

S1.4 Weighted Estimator without Projection and Augmented Projected Balancing Estimator

In this section, we compare the performance of the weighted estimator (balance) where weights are obtained from (13) with our proposed estimator ProjBalance under the simulation settings described in Section 5. In particular, the same basis functions are adopted for two weighted estimators in the balancing step. Also, we include the augmented estimator (aug) based on our procedure into the comparison. The augmented estimator is constructed by

(1γ)𝔼S𝔾{a𝒜π(aS)Q^π(S,a)}+1nTi=1nt=0T1ω^i,tπ[Ri,tQ^π(Ai,t,Si,t)+γ{a𝒜π(aSi,t+1)Q^π(Si,t+1,a)}],(1-\gamma)\mathbb{E}_{S\sim\mathbb{G}}\left\{\sum_{a\in\mathcal{A}}\pi(a\mid S)\hat{Q}^{\pi}(S,a)\right\}\\ +\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\omega}^{\pi}_{i,t}\left[R_{i,t}-\hat{Q}^{\pi}(A_{i,t},S_{i,t})+\gamma\left\{\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S_{i,t+1})\hat{Q}^{\pi}(S_{i,t+1},a^{\prime})\right\}\right],

where Q^π\hat{Q}^{\pi} is estimated by SAVE and ω^i,tπ\hat{\omega}^{\pi}_{i,t}, i=1,,ni=1,\dots,n, t=0,,T1t=0,\dots,T-1 are obtained from (25). Table S5 shows the performance of above three estimators when n=40n=40 and T=50T=50. As we can see, balance only performs better when target policy is π1\pi_{1}. Under other settings, ProjBalance performs much better than balance, especially when target policy is π2\pi_{2}. As for aug, it has a very similar performance to SAVE in all four settings.

Table S5: Simulation results for ProjBalance, balance and aug under four different target policies when n=40n=40 and T=50T=50. See details in Table 1.
Target methods MSE (×1000\times 1000) MeSE (×1000\times 1000) ECP AL (×100\times 100)
π1\pi_{1} ProjBalance 8.76 (0.527) 3.75 0.96 / 0.99 38.93 (0.258) / 46.71 (0.310)
balance 4.73 (0.298) 2.36 0.99 / 1.00 38.63 (0.127) / 46.36 (0.152)
aug 9.86 (0.585) 4.31 0.95 / 0.99 38.81 (0.259) / 46.57 (0.310)
π2\pi_{2} ProjBalance 4.29 (0.356) 1.75 0.91 / 0.94 23.57 (2.130) / 28.28 (2.556)
balance 117.23 (1.38) 111.44 0.002 / 0.002 19.43 (0.583) / 23.33 (0.699)
aug 36.66 (33.118) 1.52 0.92 / 0.95 23.60 (2.166) / 28.32 (2.599)
π3\pi_{3} ProjBalance 2.41 (0.153) 1.06 0.93 / 0.97 18.47 (0.125) / 22.17 (0.150)
balance 8.32 (0.32) 6.79 0.45 / 0.57 14.95 (0.048) / 17.94 (0.058)
aug 2.62 (0.164) 1.16 0.92 / 0.97 18.50 (0.124) / 22.20 (0.148)
π4\pi_{4} ProjBalance 0.67 (0.044) 0.32 0.93 / 0.97 9.70 (0.021) / 11.64 (0.025)
balance 0.84 (0.053) 0.37 0.91 / 0.95 9.64 (0.010) / 11.56 (0.012)
aug 0.73 (0.048) 0.34 0.92 / 0.96 9.69 (0.023) / 11.63 (0.027)

S1.5 Simulation results for the Cartpole environment

In this section, we compare the performance of various estimators using the CartPole datasets from the OpenAI Gym environment [5]. The state variables in the CartPole environment are four dimensional, including cart position, cart velocity, pole angle and pole angular velocity. The action is binary. We made a slight modification to the Cartpole environment following [68] and [60]. To summarize, we added a small Gaussian noise on the original deterministic transition dynamic and defined a new state-action-dependent reward function. See details in Section B2.1 in [60]. Following [68] and [60], we first ran a deep-Q network to get a near-optimal policy as the target policy (πopt\pi_{opt}). And we then applied “epsilon-greedy” adjustment using a factor ϵ\epsilon to define our behavior policy, i.e.,

πb(as)=(1ϵ)πopt(as)+ϵ0.5,\displaystyle\pi_{b}(a\mid s)=(1-\epsilon)*\pi_{opt}(a\mid s)+\epsilon*0.5,

for every (s,a)𝒮×𝒜(s,a)\in{\cal S}\times{\cal A}. For comparison, we implemented various methods such as fitted-Q evluation (FQE), minimax and DR using the ways they are implemented in [60] and the public code from https://github.com/RunzheStat/D2OPE. We set the discount factor γ=0.9\gamma=0.9 in our evaluations.

In Tables S6, S7 and S8, we show the results for 200200 simulated dataset when setting ϵ=0.1\epsilon=0.1, ϵ=0.2\epsilon=0.2 and ϵ=0.3\epsilon=0.3 respectively. We fixed n=50n=50 and T=100T=100 for all three settings. As we can see, ProjBalance has the smallest or the second smallest MSE and MeSE among all the methods. The emprical coverages of ProjBalance are also better than other two methods (SAVE and VL). We did not report the confidence interval for the remaining methods because there are no related theoretical results.

Table S6: Simulation results for the Cartpole environment when n=50n=50, T=100T=100 and ϵ=0.1\epsilon=0.1. See detailed description in Table 1
methods MSE (×106\times 10^{6}) MeSE (×106\times 10^{6}) ECP AL (×103\times 10^{3})
ProjBalance 0.17 (0.011) 0.13 0.72 / 0.86 0.93 (0.016) / 1.12 (0.02)
SAVE 0.49 (0.022) 0.43 0.12 0.92 (0.018)
VL 32.66 (1.262) 29.57 0 1.84 (0.008)
FQE 5.15 (0.033) 5.12 _ _
minimax 43.04 (1.122) 43.24 _ _
DR 1.26 (0.042) 1.11 _ _
IS 5581.23 (491.491) 2757.71 _ _
Table S7: Simulation results for the Cartpole environment when n=50n=50, T=100T=100 and ϵ=0.2\epsilon=0.2. See detailed description in Table 1
methods MSE (×106\times 10^{6}) MeSE (×106\times 10^{6}) ECP AL (×103\times 10^{3})
ProjBalance 0.10 (0.009) 0.04 0.84 / 0.91 0.85 (0.014) / 1.02 (0.016)
SAVE 0.33 (0.021) 0.24 0.38 0.84 (0.014)
VL 23.06 (0.547) 22.5 0 1.21 (0.003)
FQE 4.44 (0.043) 4.38 _ _
minimax 35.28 (1.163) 33.69 _ _
DR 0.76 (0.049) 0.64 _ _
IS 20541.94 (2356.361) 9295.65 _ _
Table S8: Simulation results for the Cartpole environment when n=50n=50, T=100T=100 and ϵ=0.3\epsilon=0.3. See detailed description in Table 1
methods MSE (×106\times 10^{6}) MeSE (×106\times 10^{6}) ECP AL (×103\times 10^{3})
ProjBalance 0.14 (0.011) 0.08 0.65 / 0.76 0.84 (0.031) / 1.01 (0.038)
SAVE 0.13 (0.014) 0.05 0.74 0.93 (0.124)
VL 16.08 (0.326) 15.9 0 1.02 (0.001)
FQE 3.99 (0.029) 3.98 _ _
minimax 26.94 (0.962) 25.54 _ _
DR 0.87 (0.053) 0.7 _ _
IS 59936.33 (7265.061) 30726.43 _ _

S2 Proof

S2.1 Proof in Section 2

Here we show how to obtain (10) from (9).

Starting from (9), we multiply f(s,a)f(s,a) on both sides and get

dπ(s,a)f(s,a)=(1γ)G(s)π(as)f(s,a)+γf(s,a)s𝒮a𝒜dπ(s,a)p(ss,a)π(as)ds\displaystyle d^{\pi}(s,a)f(s,a)=(1-\gamma)G(s)\pi(a\mid s)f(s,a)+\gamma f(s,a)\int_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}d^{\pi}(s^{\prime},a^{\prime})p(s\mid s^{\prime},a^{\prime})\pi(a\mid s)ds^{\prime} (S36)

Then we integrate with respect to ss and aa for both sides of (S36). Next by the change of measure, we obtain

s𝒮a𝒜dπ(s,a)f(s,a)ds\displaystyle\int_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}d^{\pi}(s,a)f(s,a)ds =(1γ)s𝒮,a𝒜𝔾(s)π(as)f(s,a)\displaystyle=(1-\gamma)\int_{s\in\mathcal{S},a\in\mathcal{A}}\mathbb{G}(s)\pi(a\mid s)f(s,a)
+γs𝒮a𝒜f(s,a)s𝒮a𝒜dπ(s,a)p(ss,a)π(as)dsds\displaystyle\qquad+\gamma\int_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}f(s,a)\int_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}d^{\pi}(s^{\prime},a^{\prime})p(s\mid s^{\prime},a^{\prime})\pi(a\mid s)ds^{\prime}ds
=(1γ)s𝒮,a𝒜𝔾(s)π(as)f(s,a)\displaystyle=(1-\gamma)\int_{s\in\mathcal{S},a\in\mathcal{A}}\mathbb{G}(s)\pi(a\mid s)f(s,a)
+γs𝒮a𝒜dπ(s,a)s𝒮a𝒜f(s,a)p(ss,a)π(as)dsds\displaystyle\qquad+\gamma\int_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}d^{\pi}(s^{\prime},a^{\prime})\int_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}f(s,a)p(s\mid s^{\prime},a^{\prime})\pi(a\mid s)dsds^{\prime}
=(1γ)𝔼S0𝔾[a𝒜f(a,S0)]\displaystyle=(1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left[\sum_{a\in\mathcal{A}}f(a,S0)\right]
+γs𝒮a𝒜dπ(s,a)𝔼[a𝒜π(aS)f(S,a)S=s,A=a]ds\displaystyle\qquad+\gamma\int_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}d^{\pi}(s,a)\mathbb{E}\left[\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S^{\prime})f(S^{\prime},a)\mid S=s,A=a\right]ds (S37)

Then, based on the definition of p¯Tb(St,At)\bar{p}_{T}^{b}(S_{t},A_{t}), we have

𝔼[1Tt=0T1dπ(St,At)p¯Tb(St,At){f(St,At)γ𝔼[a𝒜π(aSt+1)f(St+1,a)St,At]}]=1Tt=0T1s𝒮a𝒜pt(s,a)dπ(s,a)p¯Tb(s,a){f(s,a)γ𝔼[a𝒜π(as)f(s,a)s,a]}ds=s𝒮a𝒜dπ(s,a){f(s,a)γ𝔼[a𝒜π(as)f(s,a)s,a]}ds.\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\frac{d^{\pi}(S_{t},A_{t})}{\bar{p}_{T}^{b}(S_{t},A_{t})}\left\{f(S_{t},A_{t})-\gamma\mathbb{E}\left[\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid S_{t+1})f(S_{t+1},a)\mid S_{t},A_{t}\right]\right\}\right]\\ =\frac{1}{T}\sum_{t=0}^{T-1}\int_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}p_{t}(s,a)\frac{d^{\pi}(s,a)}{\bar{p}^{b}_{T}(s,a)}\left\{f(s,a)-\gamma\mathbb{E}\left[\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})f(s^{\prime},a)\mid s,a\right]\right\}ds\\ =\int_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}{d^{\pi}(s,a)}\left\{f(s,a)-\gamma\mathbb{E}\left[\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})f(s^{\prime},a)\mid s,a\right]\right\}ds. (S38)

Combining (S37) and (S38), equation (10) follows.

S2.2 Technical Proof in Section 3

Here we show the proof of Lemma 1 and Theorem 2. For Lemma 1, the proof is straightforward. For Theorem 2, we first remove the absolute value notation in (25b) by adding additional KK constrains and therefore transform the optimization problem into a standard convex optimization problem with linear constraints. Then we make use of the duality results in [67] to show the corresponding statement in Lemma 1.

Proof of Lemma 1.

First, for any function g𝒢g\in\mathcal{G}^{\prime}, there exists cc\in\mathbb{R} (independent of s,a,ss,a,s^{\prime}) such that

g(s,a,s)=f(s,a)γcwherea𝒜π(a|s)f(s,a)c.g(s,a,s^{\prime})=f(s,a)-\gamma c\quad\mbox{where}\quad\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}|s^{\prime})f(s^{\prime},a^{\prime})\equiv c.

Re-parametrizing f(s,a)=m(s,a)+cf(s,a)=m(s,a)+c, we have

𝒢={g(s,a,s)=m(s,a)+(1γ)c:c,a𝒜π(as)m(s,a)0}.\mathcal{G}^{\prime}=\left\{g(s,a,s^{\prime})=m(s,a)+(1-\gamma)c:c\in\mathbb{R},\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})m(s^{\prime},a^{\prime})\equiv 0\right\}.

Let {s1,,spS}\{s_{1},\dots,s_{p_{S}}\} and {a1,,apA}\{a_{1},\dots,a_{p_{A}}\} be the possible values of 𝒮\mathcal{S} and 𝒜\mathcal{A} respectively. A real-valued function mm defined on 𝒮×𝒜\mathcal{S}\times\mathcal{A} can be identified by a matrix 𝑴=[m(si,aj)]i,j=1pS,pApS×pA\bm{M}=[m(s_{i},a_{j})]_{i,j=1}^{p_{S},p_{A}}\in\mathbb{R}^{p_{S}\times p_{A}}. Similarly, we define 𝚷=[π(ajsi)]i,j=1pS,pA\bm{\Pi}=[\pi(a_{j}\mid s_{i})]_{i,j=1}^{p_{S},p_{A}}. Denote by 𝟏aa\bm{1}_{a}\in\mathbb{R}^{a} a vector of ones for any positive integer aa. To derive dim(𝒢)\dim(\mathcal{G}^{\prime}), we study the constraint a𝒜π(as)m(s,a)0\sum_{a^{\prime}\in\mathcal{A}^{\prime}}\pi(a^{\prime}\mid s^{\prime})m(s^{\prime},a^{\prime})\equiv 0, which is equivalent to

𝝅i,𝒎i=0,i=1,,pS,\langle\bm{\pi}_{i},\bm{m}_{i}\rangle=0,\quad i=1,\dots,p_{S}, (S39)

where 𝝅i\bm{\pi}_{i} and 𝒎i\bm{m}_{i} are the ii-th row of 𝚷\bm{\Pi} and 𝑴\bm{M} respectively. Since the elements of 𝝅i\bm{\pi}_{i} are non-negative and sum to 1 for all ii, and each of these constraints are restricted on non-overlapping elements on 𝑴\bm{M}, so (S39) has pSp_{S} linearly independent constraints on 𝑴\bm{M}. It is easy to see that dim({g(s,a,s)=m(s,a):a𝒜π(as)m(s,a)0})=pSpApS\dim(\{g(s,a,s^{\prime})=m(s,a):\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})m(s^{\prime},a^{\prime})\equiv 0\})=p_{S}p_{A}-p_{S}. Together with the parameter cc, we can show that dim(𝒢)=pSpApS+1\dim(\mathcal{G}^{\prime})=p_{S}p_{A}-p_{S}+1.

Proof of Theorem 2.

Let

𝑳^=1nT[A^1(S0,0,A0,0)A^1(S0,T1,A0,T1)A^1(S1,0,A1,0)A^1(Sn,T1,An,T1)A^K(S0,0,A0,0)A^K(S0,T1,A0,T1)A^K(S1,0,A1,0)A^K(Sn,T1,An,T1)],\hat{\bm{L}}=\frac{1}{nT}\left[\begin{array}[]{cccc c c}\hat{A}_{1}(S_{0,0},A_{0,0})&\dots&\hat{A}_{1}(S_{0,T-1},A_{0,T-1})&\hat{A}_{1}(S_{1,0},A_{1,0})&\dots&\hat{A}_{1}(S_{n,T-1},A_{n,T-1})\\ \vdots&\ddots&\vdots&\vdots&\ddots&\vdots\\ \hat{A}_{K}(S_{0,0},A_{0,0})&\dots&\hat{A}_{K}(S_{0,T-1},A_{0,T-1})&\hat{A}_{K}(S_{1,0},A_{1,0})&\dots&\hat{A}_{K}(S_{n,T-1},A_{n,T-1})\\ \end{array}\right],
𝒘=[wi,t]i=1,,n,t=0,,TnT\bm{w}=[w_{i,t}]_{i=1,\dots,n,t=0,\dots,T}\in\mathbb{R}^{nT}

Then (25b) can be represented by

min 1nTi=1nt=0T1h(wi,t)\displaystyle\qquad\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}h(w_{i,t})
subject to [𝑳^𝑳^]𝒘[𝜹K+𝒍K𝜹K𝒍K]\displaystyle\qquad\left[\begin{array}[]{c}\hat{\bm{L}}\\ -\hat{\bm{L}}\end{array}\right]\bm{w}\leq\left[\begin{array}[]{c}\bm{\delta}_{K}+\bm{l}_{K}\\ \bm{\delta}_{K}-\bm{l}_{K}\end{array}\right] (S44)

From [67], the dual of (S44) is

max g(𝝀)\displaystyle\qquad g(\bm{\lambda})
subject to 𝝀0,\displaystyle\qquad\bm{\lambda}\geq 0,

where g(𝝀)=1nTj=1nt=0T1h(𝑸(Si,t,Ai,t)𝝀)𝝀,𝒅g(\bm{\lambda})=-\frac{1}{nT}\sum_{j=1}^{n}\sum_{t=0}^{T-1}h^{*}(-\bm{Q}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda})-\langle\bm{\lambda},\bm{d}\rangle, 𝝀2K\bm{\lambda}\in\mathbb{R}^{2K} and 𝒅=[(𝜹K+𝒍K),(𝜹K𝒍K)]2K\bm{d}=[(\bm{\delta}_{K}+\bm{l}_{K})^{\intercal},(\bm{\delta}_{K}-\bm{l}_{K})^{\intercal}]^{\intercal}\in\mathbb{R}^{2K}. In addition, 𝑸(Si,t,Ai,t)=[𝑳^K(Si,t,Ai,t),𝑳^K(Si,t,Ai,t)]2K\bm{Q}(S_{i,t},A_{i,t})=[\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal},-\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}]^{\intercal}\in\mathbb{R}^{2K} and hh^{*} is the convex conjugate of hh defined as

h(t)\displaystyle h^{*}(t) =supw{twh(w)}\displaystyle=\sup_{w}\{tw-h(w)\}
=tw~h(w~),\displaystyle=t\tilde{w}-h(\tilde{w}),

where w~\tilde{w} satisfies the first order condition that th(w~)=0t-h^{\prime}(\tilde{w})=0. Then we obtain that w~=(h)1(t)\tilde{w}=(h^{\prime})^{-1}(t) and

h(t)=t(h)1(t)h{(h)1(t)}.h^{*}(t)=t(h^{\prime})^{-1}(t)-h\{(h^{\prime})^{-1}(t)\}.

Take ρ(t)=h(t)\rho(t)=h^{*}(t), and it can be verified that ρ(t)=(h)1(t)=w~\rho^{\prime}(t)=(h^{\prime})^{-1}(t)=\tilde{w}. Then the dual form can be written as

min (𝝀)=1nTj=1nt=0T1ρ(𝑸(Si,t,Ai,t)𝝀)+𝝀𝒅\displaystyle\qquad\ell(\bm{\lambda})=\frac{1}{nT}\sum_{j=1}^{n}\sum_{t=0}^{T-1}\rho(-\bm{Q}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda})+\bm{\lambda}^{\intercal}\bm{d} (S45)
subject to 𝝀0\displaystyle\qquad\bm{\lambda}\geq 0

Suppose 𝝀~=[𝝀~1,𝝀~2]\tilde{\bm{\lambda}}=[\tilde{\bm{\lambda}}_{1}^{\intercal},\tilde{\bm{\lambda}}_{2}^{\intercal}]^{\intercal}, 𝝀~1,𝝀~2K\tilde{\bm{\lambda}}_{1},\tilde{\bm{\lambda}}_{2}\in\mathbb{R}^{K}, is an optimal solution of (S45). Take λ~1,k\tilde{\lambda}_{1,k}, λ~2,k\tilde{\lambda}_{2,k} and δk\delta_{k} as the kk-th element of 𝝀~1,𝝀~2\tilde{\bm{\lambda}}_{1},\tilde{\bm{\lambda}}_{2} and 𝜹K\bm{\delta}_{K} respectively. Next, we show that λ~1,kλ~2,k=0\tilde{\lambda}_{1,k}\tilde{\lambda}_{2,k}=0 for k=1,,Kk=1,\dots,K.

Suppose that there exists kk such that λ~1,k>0\tilde{\lambda}_{1,k}>0 and λ~2,k>0\tilde{\lambda}_{2,k}>0, then we take 𝝀~1=𝝀~1min{λ~1,k,λ~2,k}𝒆k\tilde{\bm{\lambda}^{\prime}}_{1}=\tilde{\bm{\lambda}}_{1}-\min\{\tilde{\lambda}_{1,k},\tilde{\lambda}_{2,k}\}\bm{e}_{k}, 𝝀~2=𝝀~2min{λ~1,k,λ~2,k}𝒆k\tilde{\bm{\lambda}^{\prime}}_{2}=\tilde{\bm{\lambda}}_{2}-\min\{\tilde{\lambda}_{1,k},\tilde{\lambda}_{2,k}\}\bm{e}_{k}, where 𝒆k\bm{e}_{k} is a vector where its kk-th entry is 1 and all remaining entries are zero. Take 𝝀~=[𝝀~1,𝝀~2]\tilde{\bm{\lambda}^{\prime}}=[\tilde{\bm{\lambda}^{\prime}}_{1}^{\intercal},\tilde{\bm{\lambda}^{\prime}}_{2}^{\intercal}]^{\intercal}. Then

(𝝀~)(𝝀~)=2min{𝝀~1,k,𝝀~2,k}δk<0,\ell(\tilde{\bm{\lambda}^{\prime}})-\ell(\tilde{\bm{\lambda}})=-2\min\{\tilde{\bm{\lambda}}_{1,k},\tilde{\bm{\lambda}}_{2,k}\}\delta_{k}<0,

due to the fact that δk>0\delta_{k}>0. Then it contradicts with the assumption that 𝝀~\tilde{\bm{\lambda}} is the solution.

Therefore, we can take the 𝝀+=𝝀~2𝝀~1\bm{\lambda}^{+}=\tilde{\bm{\lambda}}_{2}-\tilde{\bm{\lambda}}_{1}. And it can be verified that |𝝀+|=𝝀~2+𝝀~1|\bm{\lambda}^{+}|=\tilde{\bm{\lambda}}_{2}+\tilde{\bm{\lambda}}_{1}. Rewriting (S45) yields the result in Theorem 2. ∎

S2.3 Technical Proof in Section 4.1

In this section, we provide proofs of Theorem 3 and Corollary 1.

To prove Theorem 3, a novel truncation argument of the Markov chain is presented in Lemma S2 to obtain the tight concentration bound in the scaling of nn and TT. Specifically, we truncate each trajectory into two parts in the scaling of TT. Informally, for the first part, the truncation threshold should be small enough so that we could borrow the proof techniques under standard i.i.d. settings developed in [38] without losing too much for the upper bound. For the remaining part, the idea is that as the chain grows, the distribution becomes “exponentially” close to the stationary distribution according to the mixing condition in Assumption 4(a) (or 4’(a)). We first develop a delicate peeling argument based on [14] to bound the empirical process under the stationary distribution, from which we could achieve the desired order in the scaling of both nn and TT. Then it remains to bound the difference between the stationary distribution and the distribution after truncation. By carefully choosing the truncation threshold, we are able to balance the upper bounds from two parts and obtain the desired rate of convergence. See the details in Lemma S2.

Proof of Theorem 3.

Take δtπ(B)=aπ(aSt+1)B(St+1,a)\delta_{t}^{\pi}(B)=\sum_{a^{\prime}}\pi(a^{\prime}\mid S_{t+1})B(S_{t+1},a^{\prime}). In the following, we write gπ(,;B)g^{\pi}(\cdot,\cdot;B) and gπ(,;B)g_{*}^{\pi}(\cdot,\cdot;B) as gπ(B)g^{\pi}(B) and gπ(B)g_{*}^{\pi}(B) in short. We define

n,TF(X)=(nT)1i=1nt=0T1F(Xi,t)\mathbb{P}_{{\color[rgb]{0,0,0}n,T}}F(X)=(nT)^{-1}\sum_{i=1}^{n}\sum_{t=0}^{T-1}F(X_{i,t})

as the empirical expectation of some function FF with an input XX, which represents random sampled trajectories from some Markov chains.

To proceed our proof, we can decompose the error as

1Tt=0T1𝔼[{g^π(St,At;B)gπ(St,At;B)}2]\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\{\hat{g}^{\pi}(S_{t},A_{t};B)-g_{*}^{\pi}(S_{t},A_{t};B)\right\}^{2}\right]
=\displaystyle= 1Tt=0T1𝔼[{g^π(St,At;B)δtπ(B)+δtπ(B)gπ(St,At;B)}2]\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\{\hat{g}^{\pi}(S_{t},A_{t};B)-\delta_{t}^{\pi}(B)+\delta_{t}^{\pi}(B)-g_{*}^{\pi}(S_{t},A_{t};B)\right\}^{2}\right]
=\displaystyle= 1Tt=0T1𝔼[{δtπ(B)g^π(St,At;B)}2]+1Tt=0T1𝔼[{δtπ(B)gπ(St,At;B)}2]\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\{\delta_{t}^{\pi}(B)-\hat{g}^{\pi}(S_{t},A_{t};B)\right\}^{2}\right]+\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\{\delta_{t}^{\pi}(B)-g_{*}^{\pi}(S_{t},A_{t};B)\right\}^{2}\right]
+2Tt=0T1𝔼[{g^π(St,At;B)δtπ(B)}{δtπ(B)gπ(St,At;B)}].\displaystyle\qquad\quad+\frac{2}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\{\hat{g}^{\pi}(S_{t},A_{t};B)-\delta_{t}^{\pi}(B)\right\}\left\{\delta_{t}^{\pi}(B)-g_{*}^{\pi}(S_{t},A_{t};B)\right\}\right].

Since t=0T1𝔼[{δtπ(B)gπ(St,At;B)}g(St,At)]=0\sum_{t=0}^{T-1}\mathbb{E}\left[\left\{\delta_{t}^{\pi}(B)-g_{*}^{\pi}(S_{t},A_{t};B)\right\}g(S_{t},A_{t})\right]=0 for all g𝒢g\in\mathcal{G} due to the optimizing property of gπg_{*}^{\pi}, the last term above can be simplified as

2Tt=0T1𝔼[{g^π(St,At;B)gπ(St,At;B)\displaystyle\frac{2}{T}\sum_{t=0}^{T-1}\mathbb{E}\Big{[}\big{\{}\hat{g}^{\pi}(S_{t},A_{t};B)-g_{*}^{\pi}(S_{t},A_{t};B)
+gπ(St,At;B)δtπ(B)}{δtπ(B)gπ(St,At;B)}]\displaystyle\qquad\qquad+g_{*}^{\pi}(S_{t},A_{t};B)-\delta_{t}^{\pi}(B)\big{\}}\big{\{}\delta_{t}^{\pi}(B)-g_{*}^{\pi}(S_{t},A_{t};B)\big{\}}\Big{]}
=2Tt=0T1𝔼[{gπ(St,At;B)δtπ(B)}{δtπ(B)gπ(St,At;B)}]\displaystyle=\frac{2}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\big{\{}g_{*}^{\pi}(S_{t},A_{t};B)-\delta_{t}^{\pi}(B)\big{\}}\big{\{}\delta_{t}^{\pi}(B)-g_{*}^{\pi}(S_{t},A_{t};B)\big{\}}\right]
=2Tt=0T1𝔼[{δtπ(B)gπ(St,At;B)}2].\displaystyle=-\frac{2}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\{\delta_{t}^{\pi}(B)-g_{*}^{\pi}(S_{t},A_{t};B)\right\}^{2}\right].

As a result, we have

1Tt=0T1𝔼[{g^π(St,At;B)gπ(St,At;B)}2]\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\{\hat{g}^{\pi}(S_{t},A_{t};B)-g_{*}^{\pi}(S_{t},A_{t};B)\right\}^{2}\right]
=\displaystyle= 𝔼[1Tt=0T1{δtπ(B)g^π(St,At;B)}2{δtπ(B)gπ(St,At;B)}2].\displaystyle\mathbb{E}\Big{[}\frac{1}{T}\sum_{t=0}^{T-1}\big{\{}\delta_{t}^{\pi}(B)-\hat{g}^{\pi}(S_{t},A_{t};B)\big{\}}^{2}-\big{\{}\delta_{t}^{\pi}(B)-g_{*}^{\pi}(S_{t},A_{t};B)\big{\}}^{2}\Big{]}.

For g1,g2𝒢,B𝒬g_{1},g_{2}\in\mathcal{G},B\in\mathcal{Q}, we define the following two functions:

f1π(g1,g2,B):(St,At,St+1){δtπ(B)g1(St,At)}2{δtπ(B)g2(St,At)}2\displaystyle f_{1}^{\pi}(g_{1},g_{2},B):(S_{t},A_{t},S_{t+1})\mapsto\left\{\delta_{t}^{\pi}(B)-g_{1}(S_{t},A_{t})\right\}^{2}-\left\{\delta_{t}^{\pi}(B)-g_{2}(S_{t},A_{t})\right\}^{2}
f2π(g1,g2,B):(St,At,St+1){δtπ(B)g2(St,At)}{g1(St,At)g2(St,At)},\displaystyle f_{2}^{\pi}(g_{1},g_{2},B):(S_{t},A_{t},S_{t+1})\mapsto\left\{\delta_{t}^{\pi}(B)-g_{2}(S_{t},A_{t})\right\}\left\{g_{1}(S_{t},A_{t})-g_{2}(S_{t},A_{t})\right\},

With these notations, we know that

1Tt=0T1𝔼[{g^π(St,At;B)gπ(St,At;B)}2]=𝔼[1Tt=0T1f1π{g^π(B),gπ(B),B}(St,At,St+1)],\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\{\hat{g}^{\pi}(S_{t},A_{t};B)-g_{*}^{\pi}(S_{t},A_{t};B)\right\}^{2}\right]=\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}f_{1}^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}(S_{t},A_{t},S_{t+1})\right],
g^π(B)gπ(B)n,T2=N[f1π{g^π(B),gπ(B),B}(S,A,S)+2f2π{g^π(B),gπ(B),B}(S,A,S)].\displaystyle\|\hat{g}^{\pi}(B)-g_{*}^{\pi}(B)\|_{{\color[rgb]{0,0,0}n,T}}^{2}=\mathbb{P}_{N}\left[f_{1}^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}(S,A,S^{\prime})+2f_{2}^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}(S,A,S^{\prime})\right].

In the following, we decompose

1Tt=0T1𝔼[{g^π(St,At;B)gπ(St,At;B)}2]+g^π(B)gπ(B)n,T2+μJ𝒢2{g^π(B)}=I1(B)+I2(B),\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\{\hat{g}^{\pi}(S_{t},A_{t};B)-g_{*}^{\pi}(S_{t},A_{t};B)\right\}^{2}\right]+\|\hat{g}^{\pi}(B)-g_{*}^{\pi}(B)\|_{{\color[rgb]{0,0,0}n,T}}^{2}+\mu J_{\mathcal{G}}^{2}\left\{\hat{g}^{\pi}(B)\right\}=I_{1}(B)+I_{2}(B),

where

I1(B)=3Nf1π{g^π(B),gπ(B),B}+μ[3J𝒢2{g^π(B)}+2J𝒢2{gπ(B)}+2J𝒬2(B)],\displaystyle I_{1}(B)=3\mathbb{P}_{N}f_{1}^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}+\mu[3J_{\mathcal{G}}^{2}\left\{\hat{g}^{\pi}(B)\right\}+2J_{\mathcal{G}}^{2}\left\{g_{*}^{\pi}(B)\right\}+2J_{\mathcal{Q}}^{2}(B)],
I2(B)=Nf1π{g^π(B),gπ(B),B}+1T𝔼t=0T1f1π{g^π(B),gπ(B),B}(St,At,St+1)+μJ𝒢2{g^π(B)}\displaystyle I_{2}(B)=\mathbb{P}_{N}f_{1}^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}+\frac{1}{T}\mathbb{E}\sum_{t=0}^{T-1}f_{1}^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}(S_{t},A_{t},S_{t+1})+\mu J_{\mathcal{G}}^{2}\left\{\hat{g}^{\pi}(B)\right\}
+2Nf2π{g^π(B),gπ(B),B}I1(B).\displaystyle\qquad\qquad\qquad+2\mathbb{P}_{N}f_{2}^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}-I_{1}(B).

For the first term, the optimizing property of g^π(B)\hat{g}^{\pi}(B) implies that

13I1(B)\displaystyle\frac{1}{3}I_{1}(B) =N[{δtπ(B)g^π(S,A;B)}2{δπ(B)gπ(S,A;B)}2]\displaystyle=\mathbb{P}_{N}\Big{[}\left\{\delta_{t}^{\pi}(B)-\hat{g}^{\pi}(S,A;B)\right\}^{2}-\left\{\delta^{\pi}(B)-g_{*}^{\pi}(S,A;B)\right\}^{2}\Big{]}
+μJ𝒢2{g^π(B)}+23μJ𝒢2{gπ(B)}+23μJ𝒬2(B)\displaystyle\qquad+\mu J_{\mathcal{G}}^{2}\left\{\hat{g}^{\pi}(B)\right\}+\frac{2}{3}\mu J_{\mathcal{G}}^{2}\left\{g_{*}^{\pi}(B)\right\}+\frac{2}{3}\mu J_{\mathcal{Q}}^{2}(B)
=N[{δtπ(B)g^π(S,A;B)}2]+μJ𝒢2{g^π(B)}\displaystyle=\mathbb{P}_{N}\Big{[}\left\{\delta_{t}^{\pi}(B)-\hat{g}^{\pi}(S,A;B)\right\}^{2}\Big{]}+\mu J_{\mathcal{G}}^{2}\left\{\hat{g}^{\pi}(B)\right\}
N[{δπ(B)gπ(S,A;B)}2]+23μJ𝒢2{gπ(B)}+23μJ𝒬2(B)\displaystyle\qquad-\mathbb{P}_{N}\big{[}\left\{\delta^{\pi}(B)-g_{*}^{\pi}(S,A;B)\right\}^{2}\big{]}+\frac{2}{3}\mu J_{\mathcal{G}}^{2}\left\{g_{*}^{\pi}(B)\right\}+\frac{2}{3}\mu J_{\mathcal{Q}}^{2}(B)
53μJ𝒢2{gπ(B)}+23μJ𝒬2(B).\displaystyle\leq\frac{5}{3}\mu J_{\mathcal{G}}^{2}\left\{g_{*}^{\pi}(B)\right\}+\frac{2}{3}\mu J_{\mathcal{Q}}^{2}(B).

Thus, I1(B)5μJ𝒢2{gπ(B)}+2μJ𝒬2(B)I_{1}(B)\leq 5\mu J_{\mathcal{G}}^{2}\left\{g_{*}^{\pi}(B)\right\}+2\mu J_{\mathcal{Q}}^{2}(B) holds for all BB.

Next we derive the uniform bound of I2(B)I_{2}(B) over all BB. For simplicity, take

𝐉2(g1,g2,B)=J𝒢2(g1)+J𝒢2(g2)+J𝒬2(B),\displaystyle\mathbf{J}^{2}(g_{1},g_{2},B)=J_{\mathcal{G}}^{2}(g_{1})+J_{\mathcal{G}}^{2}(g_{2})+J_{\mathcal{Q}}^{2}(B), (S46)

for any g1,g2𝒢g_{1},g_{2}\in\mathcal{G} and B𝒬B\in\mathcal{Q}.

And take fπ=f1πf2πf^{\pi}=f_{1}^{\pi}-f_{2}^{\pi}, where

fπ(g1,g2,B):(St,At,St+1)(g2g1)(St,At)(3δπ(B)2g2(St,At)g1(St,At)).\displaystyle f^{\pi}(g_{1},g_{2},B):(S_{t},A_{t},S_{t+1})\mapsto(g_{2}-g_{1})(S_{t},A_{t})\cdot\left(3\delta^{\pi}(B)-2g_{2}(S_{t},A_{t})-g_{1}(S_{t},A_{t})\right). (S47)

Then

I2(B)\displaystyle I_{2}(B) =2(𝔼N)fπ{g^π(B),gπ(B),B}𝔼fπ{g^π(B),gπ(B),B}2μ𝐉2(g^π,gπ,B).\displaystyle=2(\mathbb{E}-\mathbb{P}_{N})f^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}-\mathbb{E}f^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}-2\mu\mathbf{J}^{2}(\hat{g}^{\pi},g^{\pi}_{*},B).

By Lemma S2, we are able to show that with probability at least 12δ1/(nT)1-2\delta-1/({\color[rgb]{0,0,0}nT}),

I2(B)μ+1nTμα1τ(2+α)+[log(max{1/δ,nT})]1/τnT,\displaystyle I_{2}(B)\lesssim\mu+\frac{1}{{\color[rgb]{0,0,0}nT}\mu^{\frac{\alpha}{1-\tau(2+\alpha)}}}+\frac{[\log(\max\{1/\delta,{\color[rgb]{0,0,0}nT}\})]^{1/\tau}}{{\color[rgb]{0,0,0}nT}},

where 0<τ130<\tau\leq\frac{1}{3}, and the leading constant depends on Qmax,Gmax,α,κ,CLABEL:C_sQ_{\max},G_{\max},\alpha,\kappa,C_{\ref*{C_s}}. Taking τ=(1+α)log(log(nT))/(αlog(nT))\tau=(1+\alpha)\log(\log({\color[rgb]{0,0,0}nT}))/(\alpha\log({\color[rgb]{0,0,0}nT})) and μ(nT)1/(1+α)(log(nT))\mu\asymp({\color[rgb]{0,0,0}nT})^{-1/(1+\alpha)}(\log({\color[rgb]{0,0,0}nT})), the result of this theorem then follow.

Corollary 1 is a direct application of Theorem 3. It utilizes the uniformity of Theorem 3 and the construction of kernel ridge regression estimators. Before presenting our proof, we state our assumption for Corollary 1 below.

Assumption 4’.

The following conditions hold.

  1. (a)

    The Markov chain {St,At}t0\{S_{t},A_{t}\}_{t\geq 0} has a unique stationary distribution 𝔾\mathbb{G}^{*} with density pp^{*} over 𝒮\mathcal{S} and 𝒜\mathcal{A} and is geometrically ergodic, i.e., there exists a function ϕ(s,a)\phi(s,a) and constant κ(0,1)\kappa\in(0,1) such that, for any s𝒮s\in\mathcal{S} and a𝒜a\in\mathcal{A},

    𝔾tb((s,a))𝔾()TVϕ(s,a)κt,t0,\left\|\mathbb{G}^{b}_{t}(\cdot\mid(s,a))-\mathbb{G}^{*}(\cdot)\right\|_{\mathrm{TV}}\leq\phi(s,a)\kappa^{t},\qquad\forall t\geq 0,

    where TV\|\cdot\|_{\mathrm{TV}} denotes the total variation norm, and 𝔾tb((s,a))\mathbb{G}_{t}^{b}(\cdot\mid(s,a)) is the distribution of (St,At)(S_{t},A_{t}) conditioned on S0=sS_{0}=s and A0=aA_{0}=a, under the behavior policy. Also, there exists a constant C11\ltx@labelCs>0C_{11}\ltx@label{C_{s}}>0 such that (s,a)ϕ(s,a)𝑑𝔾0(s,a)CLABEL:C_s\int_{(s,a)}\phi(s,a)d\mathbb{G}_{0}(s,a)\leq C_{\ref*{C_s}}, where recall that 𝔾0\mathbb{G}_{0} is the initial distribution of (S,A)(S,A).

  2. (b)

    The function class 𝒬\mathcal{Q} satisfies that max{B,J𝒬(B)}Qmax\max\{\|B\|_{\infty},J_{\mathcal{Q}}(B)\}\leq Q_{\max} for all B𝒬B\in\mathcal{Q}.

  3. (c)

    The function class 𝒢\mathcal{G} in (24) is a subset of an RKHS whose corresponding RKHS norm is \|\cdot\|_{{\mathcal{H}}}. In addition, 𝒢\mathcal{G} is a star-shaped set with center 0 and it satisfies that gGmax\|g\|_{\infty}\leq G_{\max} for all g𝒢g\in\mathcal{G} and gπ(,;B)Gmax\|g^{\pi}_{*}(\cdot,\cdot;B)\|_{{\mathcal{H}}}\leq G_{\max} for all B𝒬B\in\mathcal{Q}.

  4. (d)

    The regularization functional J𝒢J_{\mathcal{G}} in (24) is taken as 2\|\cdot\|_{{\mathcal{H}}}^{2}. Let 𝒬M={B:B𝒬,J𝒬(B)M}\mathcal{Q}_{M}=\{B:B\in\mathcal{Q},J_{\mathcal{Q}}(B)\leq M\} and 𝒢M={g:g𝒢,g2M}\mathcal{G}_{M}=\{g:g\in\mathcal{G},\|g\|_{{\mathcal{H}}}^{2}\leq M\}. There exist constants CLABEL:entropyC_{\ref*{entropy}} and α(0,1)\alpha\in(0,1), such that for any ϵ,M>0\epsilon,M>0,

    max{log𝒩(ϵ,𝒢M,),log𝒩(ϵ,𝒬M,)CLABEL:entropy(Mϵ)2α}\max\left\{\log\mathcal{N}(\epsilon,\mathcal{G}_{M},\|\cdot\|_{\infty}),\log\mathcal{N}(\epsilon,\mathcal{Q}_{M},\|\cdot\|_{\infty})\leq C_{\ref*{entropy}}\left(\frac{M}{\epsilon}\right)^{2\alpha}\right\}
Proof of Corollary 1.

With conditions stated in the Corollary 1, by directly applying Theorem 3 with δ=1nT\delta=\frac{1}{{\color[rgb]{0,0,0}nT}}, we can show that with probability at least 13/(nT)1-3/({\color[rgb]{0,0,0}nT}),

𝔼{1Tt=0T1[g^π(St,At;k=1KυkBk)gπ(St,At;k=1KυkBk)]2}\displaystyle{\mathbb{E}\left\{\frac{1}{T}\sum_{t=0}^{T-1}\left[\hat{g}^{\pi}\left(S_{t},A_{t};\sum_{k=1}^{K}{\upsilon_{k}}B_{k}\right)-g_{*}^{\pi}\left(S_{t},A_{t};\sum_{k=1}^{K}{\upsilon_{k}}B_{k}\right)\right]^{2}\right\}}
+g^π(,;k=1KυkBk)gπ(,;k=1KυkBk)n,T2\displaystyle\quad+\left\|\hat{g}^{\pi}\left(\cdot,\cdot;\sum_{k=1}^{K}{\upsilon_{k}}B_{k}\right)-g_{*}^{\pi}\left(\cdot,\cdot;\sum_{k=1}^{K}{\upsilon_{k}}B_{k}\right)\right\|_{{\color[rgb]{0,0,0}n,T}}^{2}
(nT)11+α(log(nT))2+α1+α.\displaystyle\lesssim({\color[rgb]{0,0,0}nT})^{-\frac{1}{1+\alpha}}(\log({\color[rgb]{0,0,0}nT}))^{\frac{2+\alpha}{1+\alpha}}.

Then it suffices to show that

gπ(,;𝝊[Bk]k=1K)=𝝊[gπ(,;Bk)]k=1K,\displaystyle g_{*}^{\pi}(\cdot,\cdot;{\bm{\upsilon}}^{\intercal}[B_{k}]_{k=1}^{K})={\bm{\upsilon}}^{\intercal}[g^{\pi}_{*}(\cdot,\cdot;B_{k})]_{k=1}^{K}, (S48)

and

g^π(,;𝝊[Bk]k=1K)=𝝊[g^π(,;Bk)]k=1K.\displaystyle\hat{g}^{\pi}(\cdot,\cdot;{\bm{\upsilon}}^{\intercal}[B_{k}]_{k=1}^{K})={\bm{\upsilon}}^{\intercal}[\hat{g}^{\pi}(\cdot,\cdot;B_{k})]_{k=1}^{K}. (S49)

Note that (S48) is due to the definition of gπg_{*}^{\pi}. Next, we verify (S49).

Denote the reproducing kernel of 𝒢\mathcal{G} by κ(,){\kappa}(\cdot,\cdot). Write Yi,tk=aπ(aSi,t+1)Bk(Si,t+1,a)Y^{k}_{i,t}=\sum_{a^{\prime}}\pi(a^{\prime}\mid S_{i,t+1})B_{k}(S_{i,t+1},a^{\prime}), 𝒀k=[Yi,tk]i=1,,n,t=0,,T1\bm{Y}_{k}=[Y^{k}_{i,t}]_{i=1,\dots,n,t=0,\dots,T-1}, and 𝑮\bm{G} as the Gram matrix. Take

𝑮κ(s,a)=[κ{(s,a),(Si,t,Ai,t)})]i=1,,n,t=0,,T1nT.\bm{G}_{\kappa}(s,a)=[{\kappa}\left\{(s,a),(S_{i,t},A_{i,t})\right\})]_{i=1,\dots,n,t=0,\dots,T-1}\in\mathbb{R}^{nT}.

Then

g^π(s,a;Bk)=𝒀k(𝑮+μ𝑰)1𝑮κ(s,a)\hat{g}^{\pi}(s,a;B_{k})=\bm{Y}_{k}^{\intercal}(\bm{G}+\mu\bm{I})^{-1}\bm{G}_{\kappa}(s,a)

If we keep the tuning parameter the same when approximating all gπ(,;Bk)g^{\pi}_{*}(\cdot,\cdot;B_{k}), then for any 𝝊{\bm{\upsilon}},

𝝊[g^π(s,a;Bk)]k=1K\displaystyle{\bm{\upsilon}}^{\intercal}[\hat{g}^{\pi}(s,a;B_{k})]_{k=1}^{K} =𝝊[𝒀k(𝑮+μ𝑰)1𝑮κ(s,a)]k=1K\displaystyle={\bm{\upsilon}}^{\intercal}\left[\bm{Y}_{k}^{\intercal}(\bm{G}+\mu\bm{I})^{-1}\bm{G}_{\kappa}(s,a)\right]_{k=1}^{K}
=[𝝊𝒀~i,t]i=1,n,t=0,,T1(𝑮+μ𝑰)1𝑮κ(s,a)\displaystyle=[{\bm{\upsilon}}^{\intercal}\tilde{\bm{Y}}_{i,t}]^{\intercal}_{i=1\dots,n,t=0,\dots,T-1}(\bm{G}+\mu\bm{I})^{-1}\bm{G}_{\kappa}(s,a)

where 𝒀~i,t=[Yi,tk]k=1KK\tilde{\bm{Y}}_{i,t}=[Y^{k}_{i,t}]_{k=1}^{K}\in\mathbb{R}^{K}, 𝝊𝒀~i,t=k=1Kυk{aπ(aSi,t+1)Bk(Si,t+1,a)]}=aπ(aSi,t+1){k=1KαkBk(Si,t+1,a)}{\bm{\upsilon}}^{\intercal}\tilde{\bm{Y}}_{i,t}=\sum_{k=1}^{K}{\upsilon_{k}}\{\sum_{a^{\prime}}\pi(a^{\prime}\mid S_{i,t+1})B_{k}(S_{i,t+1},a^{\prime})]\}=\sum_{a^{\prime}}\pi(a^{\prime}\mid S_{i,t+1})\{\sum_{k=1}^{K}\alpha_{k}B_{k}(S_{i,t+1},a^{\prime})\}. Then 𝝊[g^π(s,a;Bk)]k=1K{\bm{\upsilon}}^{\intercal}[\hat{g}^{\pi}(s,a;B_{k})]_{k=1}^{K} is the solution for approximating gπ(s,a;𝝊[Bk]k=1K)g^{\pi}_{*}(s,a;{\bm{\upsilon}}^{\intercal}[B_{k}]_{k=1}^{K}) using the same tuning parameter μ\mu. Therefore we have with probability at least 13/(nT)1-3/({\color[rgb]{0,0,0}nT}),

𝔼{k=1Kαkg^π(,;Bk)k=1Kαkgπ(,;Bk)}2+k=1Kαkg^π(,;Bk)k=1Kαkgπ(,;Bk)n,T2\displaystyle\mathbb{E}\left\{\sum_{k=1}^{K}\alpha_{k}\hat{g}^{\pi}\left(\cdot,\cdot;B_{k}\right)-\sum_{k=1}^{K}\alpha_{k}g_{*}^{\pi}\left(\cdot,\cdot;B_{k}\right)\right\}^{2}+\left\|\sum_{k=1}^{K}\alpha_{k}\hat{g}^{\pi}\left(\cdot,\cdot;B_{k}\right)-\sum_{k=1}^{K}\alpha_{k}g_{*}^{\pi}\left(\cdot,\cdot;B_{k}\right)\right\|_{{\color[rgb]{0,0,0}n,T}}^{2}
(nT)11+α(log(nT))2+α1+α.\displaystyle\lesssim({\color[rgb]{0,0,0}nT})^{-\frac{1}{1+\alpha}}(\log({\color[rgb]{0,0,0}nT}))^{\frac{2+\alpha}{1+\alpha}}.

S2.4 Technical Proof4 in Section 4.2

In this section, we present the proof of Theorem 4, which is generalized from proofs in [13] and [73]. First, we show the 2\ell_{2} convergence of 𝝀+\bm{\lambda}^{+} to 𝝀\bm{\lambda}_{*} This requires a more delicate decomposition than those in [13] and [73] due to the additional projection error from Theorem 3. In addition, by the similar truncation argument as we discussed in Section S2.3, we develop a tight matrix concentration inequality for independent Markov chains. After obtaining the convergence of 𝝀+\bm{\lambda}^{+} to 𝝀\bm{\lambda}_{*}, the convergence of weights can be derived based on the results in Lemma 1 combined with Assumption 5(a) and (c).

Proof.

Recall that 𝝀+\bm{\lambda}^{+} is the solution to (27) and 𝝀\bm{\lambda}_{*} is the coefficient that leads to the best approximation to the true ratio function ωπ\omega^{\pi} with basis 𝑳K\bm{L}_{K}. By Assumption 5(a), we can show that sups,a|ωπρ(𝑳K(s,a)𝝀)|CLABEL:weightsbound(Kr1)\sup_{s,a}|\omega^{\pi}-\rho^{\prime}(\bm{L}_{K}(s,a)^{\intercal}\bm{\lambda}_{*})|\leq C_{\ref*{weightsbound}}(K^{-r_{1}}). In the following, we study the convergence of 𝝀+𝝀2\|\bm{\lambda}^{+}-\bm{\lambda}_{*}\|_{2}. We aim to prove

𝝀+𝝀2=𝒪p[{ψ(K)}1(KlogKnT+K1/2r1+Kζn,T)].\displaystyle\|\bm{\lambda}^{+}-\bm{\lambda}_{*}\|_{2}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left[\{\psi(K)\}^{-1}\left(\frac{\sqrt{K}\log K}{\sqrt{{\color[rgb]{0,0,0}nT}}}+K^{1/2-r_{1}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}\right)\right]. (S50)

Define the objective function G(𝝀)=1nTi=1nt=0T1ρ(𝑳^K(Si,t,Ai,t)𝝀)+|𝝀|𝜹K𝝀𝒍KG(\bm{\lambda})=\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\rho(\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda})+|\bm{\lambda}|^{\intercal}\bm{\delta}_{K}-\bm{\lambda}^{\intercal}\bm{l}_{K}. In order to show the above bound in (S50), by the continuity and convexity of G()G(\cdot), it suffices to show that, with high probability,

infΔ𝒞{G(𝝀+Δ)G(𝝀)}>0,\inf_{\Delta\in\mathcal{C}}\{G(\bm{\lambda}_{*}+\Delta)-G(\bm{\lambda}_{*})\}>0,

where 𝒞={ΔK:Δ2=C12\ltx@labelcdelta{ψ(K)}1(Klog(nT)/nT+Kr1+Kζn,T)\mathcal{C}=\{\Delta\in\mathbb{R}^{K}:\|\Delta\|_{2}=C_{12}\ltx@label{c_{d}elta}\{\psi(K)\}^{-1}({\sqrt{K}\log({\color[rgb]{0,0,0}nT})}/{\sqrt{{\color[rgb]{0,0,0}nT}}}+K^{-r_{1}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}})}. Here CLABEL:c_deltaC_{\ref*{c_delta}} is some appropriate constant.

First, we take the following decomposition:

G(𝝀+Δ)G(𝝀)\displaystyle G(\bm{\lambda}_{*}+\Delta)-G(\bm{\lambda}_{*}) (S51)
=1nTi=1nt=0T1[ρ{𝑳^K(Si,t,Ai,t)(𝝀+Δ)}ρ{𝑳^K(Si,t,Ai,t)𝝀}]+𝜹K(|𝝀+Δ||𝝀|)Δ𝒍K\displaystyle=\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left[\rho\{\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}(\bm{\lambda}_{*}+\Delta)\}-\rho\{\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\}\right]+\bm{\delta}_{K}^{\intercal}(|\bm{\lambda}_{*}+\Delta|-|\bm{\lambda}_{*}|)-\Delta^{\intercal}\bm{l}_{K}
1nTi=1nt=0T1[ρ{𝑳^K(Si,t,Ai,t)𝝀}{𝑳^K(Si,t,Ai,t)Δ}+12ρ′′{v^(Si,t,Ai,t)}{𝑳^K(Si,t,Ai,t)Δ}2]\displaystyle\geq\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left[\rho^{\prime}\{\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\}\{\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\Delta\}+\frac{1}{2}\rho^{{}^{\prime\prime}}\{\hat{v}(S_{i,t},A_{i,t})\}\{\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\Delta\}^{2}\right]
Δ𝒍K|Δ|𝜹K,\displaystyle\quad-\Delta^{\intercal}\bm{l}_{K}-|\Delta|^{\intercal}\bm{\delta}_{K}, (S52)

where v^(Si,t,Ai,t)\hat{v}(S_{i,t},A_{i,t}) is a value between 𝑳^K(Si,t,Ai,t)(𝝀+Δ)\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}(\bm{\lambda}_{*}+\Delta) and 𝑳^K(Si,t,Ai,t)𝝀\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}.

Define 𝝃^(s,a)=𝑳^K(s,a)𝑳K(s,a)\hat{\bm{\xi}}(s,a)=\hat{\bm{L}}_{K}(s,a)-\bm{L}_{K}(s,a). Now we focus on the following term:

1nTi=1nt=0T1ρ{𝑳^K(Si,t,Ai,t)𝝀}{𝑳^K(Si,t,Ai,t)Δ}Δ𝒍K\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\rho^{\prime}\{\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\}\{\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\Delta\}-\Delta^{\intercal}\bm{l}_{K}
=\displaystyle= 1nTi=1nt=0T1ρ[𝑳K(Si,t,Ai,t)𝝀+𝝀𝝃^(Sit,Ait)]{𝑳K(Si,t,Ai,t)Δ+𝝃^(Si,t,Ai,t)Δ}Δ𝒍K\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\rho^{\prime}\left[\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}+\bm{\lambda}_{*}^{\intercal}\hat{\bm{\xi}}(S_{it},A_{it})\right]\{\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\Delta+\hat{\bm{\xi}}(S_{i,t},A_{i,t})^{\intercal}\Delta\}-\Delta^{\intercal}\bm{l}_{K}
=\displaystyle= 1nTi=1nt=0T1[ρ{𝑳K(Si,t,Ai,t)𝝀}{𝑳K(Si,t,Ai,t)Δ}Δ𝒍K]\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left[\rho^{\prime}\left\{\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\right\}\left\{\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\Delta\right\}-\Delta^{\intercal}\bm{l}_{K}\right] (S53)
+\displaystyle+ 1nTi=1nt=0T1ρ{𝑳K(Si,t,Ai,t)𝝀}{𝝃^(Si,t,Ai,t)Δ}\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\rho^{\prime}\left\{\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\right\}\left\{\hat{\bm{\xi}}(S_{i,t},A_{i,t})^{\intercal}\Delta\right\} (S54)
+\displaystyle+ 1nTi=1nt=0T1ρ′′{v~(Si,t,Ai,t)}{𝝃^(Si,t,Ai,t)𝝀}{𝑳K(Si,t,Ai,t)Δ}\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\rho^{{}^{\prime\prime}}\{\tilde{v}(S_{i,t},A_{i,t})\}\left\{\hat{\bm{\xi}}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\right\}\left\{\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\Delta\right\} (S55)
+\displaystyle+ 1nTi=1nt=0T1ρ′′{v~(Si,t,Ai,t)}{𝝃^(Si,t,Ai,t)𝝀}{𝝃^(Si,t,Ai,t)Δ},\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\rho^{{}^{\prime\prime}}\{\tilde{v}(S_{i,t},A_{i,t})\}\left\{\hat{\bm{\xi}}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\right\}\left\{\hat{\bm{\xi}}(S_{i,t},A_{i,t})^{\intercal}\Delta\right\}, (S56)

where v~(Si,t,Ai,t)\tilde{v}(S_{i,t},A_{i,t}) is a value between 𝑳K(Si,t,Ai,t)𝝀\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*} and 𝑳^K(Si,t,Ai,t)𝝀\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}. In the following, we show how to control (S53)-(S56) one by one.

  • For (S53),

    (S53)\displaystyle\eqref{eqn:t1} =1nTi=1nt=0T1[ωπ(Si,t,Ai,t){𝑳K(Si,t,Ai,t)Δ}Δ𝒍K]\displaystyle=\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left[\omega^{\pi}(S_{i,t},A_{i,t})\left\{\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\Delta\right\}-\Delta^{\intercal}\bm{l}_{K}\right]
    +1nTi=1nt=0T1[ρ{𝑳K(Si,t,Ai,t)𝝀}ωπ(Si,t,Ai,t)]{𝑳K(Si,t,Ai,t)Δ}\displaystyle\quad+\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left[\rho^{\prime}\left\{\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\right\}-\omega^{\pi}(S_{i,t},A_{i,t})\right]\left\{\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\Delta\right\}
    1nTi=1nt=0T1ωπ(Si,t,Ai,t)𝑳K(Si,t,Ai,t)𝒍K2Δ2\displaystyle\geq-\left\|\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})-\bm{l}_{K}\right\|_{2}\left\|\Delta\right\|_{2}
    𝒪(Kr1)Δ{1nTi=1nt=0T1𝑳K(Si,t,Ai,t)𝑳K(Si,t,Ai,t)}Δ\displaystyle\quad-\mathop{}\mathopen{}\mathcal{O}\mathopen{}(K^{-r_{1}})\sqrt{\Delta\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\right\}\Delta}
    1nTi=1nt=0T1ωπ(Si,t,Ai,t)𝑳K(Si,t,Ai,t)𝒍K2Δ2\displaystyle\geq-\left\|\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})-\bm{l}_{K}\right\|_{2}\left\|\Delta\right\|_{2}
    𝒪(Kr1)[λmax{1nTi=1nt=0T1𝑳K(Si,t,Ai,t)𝑳K(Si,t,Ai,t)}]1/2Δ2\displaystyle\quad-\mathop{}\mathopen{}\mathcal{O}\mathopen{}(K^{-r_{1}})\left[\lambda_{\max}\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\right\}\right]^{1/2}\|\Delta\|_{2}
    1nTi=1nt=0T1ωπ(Si,t,Ai,t)𝑳K(Si,t,Ai,t)𝒍K2Δ2𝒪p(Kr1)Δ2.\displaystyle\geq-\left\|\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})-\bm{l}_{K}\right\|_{2}\left\|\Delta\right\|_{2}-\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(K^{-r_{1}})\|\Delta\|_{2}. (S57)

    The first inequality is based on Cauchy–Schwarz inequality and Assumption 5(a). For the last inequality, we apply Lemma S3 which yields

    λmax{1nTi=1nt=0T1𝑳K(Si,t,Ai,t)𝑳K(Si,t,Ai,t)}CLABEL:weightsC4+𝒪p(1),\displaystyle\lambda_{\max}\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\right\}\leq C_{\ref*{weightsC4}}+\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(1),

    due to Assumptions 4’(a) and 5(c).

    Next, we show how to bound (nT)1i=1nt=0T1ωπ(Si,t,Ai,t)𝑳K(Si,t,Ai,t)𝒍K2\|(nT)^{-1}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})-\bm{l}_{K}\|_{2} in (S57). Note that

    𝒍K=𝔼{1Tt=0T1ωπ(Si,t,Ai,t)𝑳K(Si,t,Ai,t)}.\bm{l}_{K}=\mathbb{E}\left\{\frac{1}{T}\sum_{t=0}^{T-1}\omega^{\pi}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})\right\}.

    Then we apply the matrix concentration inequality developed in Lemma S4 to bound it. Specifically, we take 𝑭(s,a)=ωπ(s,a)𝑳K(s,a)\bm{F}(s,a)=\omega^{\pi}(s,a)\bm{L}_{K}(s,a) in Lemma S4. By Assumption 5(a) and (c), we can show that sup(s,a)𝑭(s,a)2ωπ{sup(s,a)𝑳K(s,a)2}CLABEL:weightsC1CLABEL:weightsC2(1+γ)K\sup_{(s,a)}\|\bm{F}(s,a)\|_{2}\leq\|\omega^{\pi}\|_{\infty}\left\{\sup_{(s,a)}\|\bm{L}_{K}(s,a)\|_{2}\right\}\leq C_{\ref*{weightsC1}}C_{\ref*{weightsC2}}(1+\gamma)\sqrt{K}. Therefore, Lemma S4 implies that

    1nTi=1nt=0T1ωπ(Si,t,Ai,t)𝑳K(Si,t,Ai,t)𝒍K2=𝒪p(Klog(nT)nT).\displaystyle\left\|\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})-\bm{l}_{K}\right\|_{2}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{\sqrt{K}\log({\color[rgb]{0,0,0}nT})}{\sqrt{{\color[rgb]{0,0,0}nT}}}\right).

    Thus we can show that

    (S53)𝒪p(Klog(nT)nT)Δ2𝒪(Kr1)Δ2\displaystyle\eqref{eqn:t1}\geq-\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{\sqrt{K}\log({\color[rgb]{0,0,0}nT})}{\sqrt{{\color[rgb]{0,0,0}nT}}}\right)\|\Delta\|_{2}-\mathop{}\mathopen{}\mathcal{O}\mathopen{}(K^{-r_{1}})\|\Delta\|_{2} (S58)
  • For (S54),

    (S54)\displaystyle\eqref{eqn:t2} Csups,a|ρ{𝑳K(s,a)𝝀}|1nTi=1nt=0T1{𝝃^(Si,t,Ai,t)𝚫}22\displaystyle\geq-C\sup_{s,a}\left|\rho^{\prime}\left\{\bm{L}_{K}(s,a)^{\intercal}\bm{\lambda}_{*}\right\}\right|\sqrt{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left\{\hat{\bm{\xi}}(S_{i,t},A_{i,t})^{\intercal}\bm{\Delta}\right\}^{2}_{2}}
    C(CLABEL:weightsC1+Kr1)Δ2γg^π(,;𝑩KΔΔ2)gπ(,;𝑩KΔΔ2)n,T,\displaystyle\geq-C(C_{\ref*{weightsC1}}+K^{-r_{1}})\|\Delta\|_{2}\gamma\left\|\hat{g}^{\pi}\left(\cdot,\cdot;\bm{B}_{K}^{\intercal}\frac{\Delta}{\|\Delta\|_{2}}\right)-g^{\pi}_{*}\left(\cdot,\cdot;\bm{B}_{K}^{\intercal}\frac{\Delta}{\|\Delta\|_{2}}\right)\right\|_{{\color[rgb]{0,0,0}n,T}},

    where the first equality is given by Cauchy-Schwarz inequality and the second one is due to Assumption 5(a) and the definition of 𝝃^\hat{\bm{\xi}}. As sups,asup𝜹1𝑩K𝜹K\sup_{s,a}\sup_{\|\bm{\delta}\|\leq 1}\|\bm{B}_{K}^{\intercal}\bm{\delta}\|_{\infty}\lesssim\sqrt{K}, one can show that K1/2𝑩KΔ/Δ21{K}^{-1/2}\bm{B}_{K}^{\intercal}\Delta/\|\Delta\|_{2}\lesssim 1. Due to Theorem 3 and Assumption 5(e), we have

    K1/2g^π(,;𝑩KΔΔ2)gπ(,;𝑩KΔΔ2)n,T=𝒪p(ζn,T).\displaystyle K^{-1/2}\left\|\hat{g}^{\pi}\left(\cdot,\cdot;\bm{B}_{K}^{\intercal}\frac{\Delta}{\|\Delta\|_{2}}\right)-g^{\pi}_{*}\left(\cdot,\cdot;\bm{B}_{K}^{\intercal}\frac{\Delta}{\|\Delta\|_{2}}\right)\right\|_{{\color[rgb]{0,0,0}n,T}}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\zeta_{{\color[rgb]{0,0,0}n,T}}\right).

    Therefore,

    (S54)\displaystyle\eqref{eqn:t2} CCLABEL:weightsC1K𝒪p(ζn,T)Δ2,\displaystyle\geq-CC_{\ref*{weightsC1}}\sqrt{K}\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\zeta_{{\color[rgb]{0,0,0}n,T}}\right)\|\Delta\|_{2}, (S59)

    where CC is some positive constant.

  • For (S55), note that

    |𝑳K(Si,t,Ai,t)𝝀|\displaystyle\left|\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\right| =|𝝀𝑩K(Si,t,Ai,t)γgπ(Si,t,Ai,t;𝝀𝜷K)|\displaystyle=\left|\bm{\lambda}_{*}\bm{B}_{K}(S_{i,t},A_{i,t})^{\intercal}-\gamma g^{\pi}_{*}(S_{i,t},A_{i,t};\bm{\lambda}_{*}^{\intercal}\bm{\beta}_{K})\right|
    |𝝀𝑩K(Si,t,Ai,t)|+|γgπ(Si,t,Ai,t;𝝀𝜷K)|\displaystyle\leq\left|\bm{\lambda}_{*}\bm{B}_{K}(S_{i,t},A_{i,t})^{\intercal}\right|+\left|\gamma g^{\pi}_{*}(S_{i,t},A_{i,t};\bm{\lambda}_{*}^{\intercal}\bm{\beta}_{K})\right|
    Qmax+γGmax,\displaystyle\leq Q_{\max}+\gamma G_{\max},

    due to Assumption 5(e). Similarly, we can show that

    |𝑳^K(Si,t,Ai,t)𝝀|Qmax+γGmax.\left|\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\right|\leq Q_{\max}+\gamma G_{\max}.

    As g^π(,;𝝀𝜷K)𝒢\hat{g}^{\pi}(\cdot,\cdot;\bm{\lambda}_{*}^{\intercal}\bm{\beta}_{K})\in\mathcal{G} by assumption, |v~(Si,t,Ai,t)|Qmax+γGmax|\tilde{v}(S_{i,t},A_{i,t})|\leq Q_{\max}+\gamma G_{\max}. Since ρ′′\rho{{}^{\prime\prime}} is a strictly positive and continuous function, there exists a constant c1\ltx@labelwc4>0c_{1}\ltx@label{wc4}>0 such that ρ{v~(Si,t,Ai,t)}′′cLABEL:wc4\rho{{}^{\prime\prime}}\{\tilde{v}(S_{i,t},A_{i,t})\}\leq c_{\ref*{wc4}}. We then have

    (S55)\displaystyle\eqref{eqn:t3} 1nTi=1nt=0T1ρ′′{v~(Si,t,Ai,t)}2{𝝃^(Si,t,Ai,t)𝝀}21nTi=1nt=0T1{𝑳K(Si,t,Ai,t)Δ}2\displaystyle\geq-\sqrt{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\rho^{{}^{\prime\prime}}\{\tilde{v}(S_{i,t},A_{i,t})\}^{2}\left\{\hat{\bm{\xi}}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\right\}^{2}}\sqrt{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left\{\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\Delta\right\}^{2}}
    cLABEL:wc4𝒪p(ζn,T)[λmax{1nTi=1nt=0T1𝑳K(Si,t,Ai,t)𝑳K(Si,t,Ai,t)}]1/2Δ\displaystyle\geq-c_{\ref*{wc4}}\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta_{{\color[rgb]{0,0,0}n,T}})\left[\lambda_{\max}\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\right\}\right]^{1/2}\|\Delta\|
    CcLABEL:wc4𝒪p(ζn,T)([λmax{𝔼1nTi=1nt=0T1𝑳K(Si,t,Ai,t)𝑳K(Si,t,Ai,t)}]1/2+𝒪p(1))Δ2\displaystyle\geq-Cc_{\ref*{wc4}}\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta_{{\color[rgb]{0,0,0}n,T}})\left(\left[\lambda_{\max}\left\{\mathbb{E}\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\right\}\right]^{1/2}+\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(1)\right)\|\Delta\|_{2}
    CcLABEL:wc4CLABEL:weightsC4𝒪p(ζn,T)Δ2,\displaystyle\geq-Cc_{\ref*{wc4}}\sqrt{C_{\ref*{weightsC4}}}\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta_{{\color[rgb]{0,0,0}n,T}})\|\Delta\|_{2}, (S60)

    where CC is some positive constant. The first inequality of the above arguments is given by Cauchy-Schwarz inequality. The second one uses results in Theorem 3 based on Assumption 5(e) so that the upper bound for (nT)1i=1nt=0T1{𝝃^(Si,t,Ai,t)𝝀}2(nT)^{-1}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\{\hat{\bm{\xi}}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\}^{2} is obtained. We use Lemma S3 in the third inequality, while the last inequality is obtained due to Assumption 5(c).

  • For (S56), by using the same argument showing (S60) and (S59), we have

    (S56)\displaystyle\eqref{eqn:t4} 1nTi=1nt=0T1ρ′′{v~(Si,t,Ai,t)}2{𝝃^(Si,t,Ai,t)𝝀}21nTi=1nt=0T1{𝝃^(Si,t,Ai,t)Δ}2\displaystyle\geq-\sqrt{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\rho^{{}^{\prime\prime}}\{\tilde{v}(S_{i,t},A_{i,t})\}^{2}\left\{\hat{\bm{\xi}}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\right\}^{2}}\sqrt{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left\{\hat{\bm{\xi}}(S_{i,t},A_{i,t})^{\intercal}\Delta\right\}^{2}}
    cLABEL:wc4𝒪p(ζn,T)K𝒪p(ζn,T)Δ2\displaystyle\geq-c_{\ref*{wc4}}\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta_{{\color[rgb]{0,0,0}n,T}})\sqrt{K}\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta_{{\color[rgb]{0,0,0}n,T}})\|\Delta\|_{2}
    cLABEL:wc4K𝒪p(ζn,T2)Δ2\displaystyle\geq-c_{\ref*{wc4}}\sqrt{K}\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta^{2}_{{\color[rgb]{0,0,0}n,T}})\|\Delta\|_{2} (S61)

Substituting (S58), (S59), (S60) and (S61) into (S53), (S54), (S55) and (S56) respectively, we have

1nTj=1nt=0T1ρ{𝑳^K(Si,t,Ai,t)𝝀}{𝑳^K(Si,t,Ai,t)Δ}Δ𝒍K\displaystyle\frac{1}{nT}\sum_{j=1}^{n}\sum_{t=0}^{T-1}\rho^{\prime}\{\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\bm{\lambda}_{*}\}\{\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\Delta\}-\Delta^{\intercal}\bm{l}_{K}
\displaystyle\geq 𝒪p(Klog(nT)nT+Kr1+Kζn,T+Kζn,T2)Δ2.\displaystyle-\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{\sqrt{K}\log({\color[rgb]{0,0,0}nT})}{\sqrt{{\color[rgb]{0,0,0}nT}}}+K^{-r_{1}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}^{2}\right)\|\Delta\|_{2}. (S62)

In the following, we discuss the remaining term in (S52). Due to Assumption 5(b), ρ′′{v^(Si,t,Ai,t)}CLABEL:weightssecond\rho^{{}^{\prime\prime}}\{\hat{v}(S_{i,t},A_{i,t})\}\geq C_{\ref*{weightssecond}}.

Next, we show how to bound

λmin[1nTi=1nt=0T1𝑳^K(Si,t,Ai,t)𝑳^K(Si,t,Ai,t)].\lambda_{\min}\left[\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\right].

First, note that for any vector 𝒂K\bm{a}\in\mathbb{R}^{K} such that 𝒂2=1\|\bm{a}\|_{2}=1, we have

𝒂[1nTi=1nt=0T1𝑳^K(Si,t,Ai,t)𝑳^K(Si,t,Ai,t)]𝒂=1nTi=1nt=0T1𝒂𝑳^K(Si,t,Ai,t)22\displaystyle\bm{a}^{\intercal}\left[\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\right]\bm{a}=\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left\|\bm{a}^{\intercal}\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})\right\|^{2}_{2}
\displaystyle\geq 1nTi=1nt=0T1𝒂𝑳K(Si,t,Ai,t)221nTi=1nt=0T1𝒂{𝑳^K(Si,t,Ai,t)𝑳K(Si,t,Ai,t)}22\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left\|\bm{a}^{\intercal}{\bm{L}}_{K}(S_{i,t},A_{i,t})\right\|^{2}_{2}-\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left\|\bm{a}^{\intercal}\left\{\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})-{\bm{L}}_{K}(S_{i,t},A_{i,t})\right\}\right\|^{2}_{2}
\displaystyle\geq 𝒂[1nTi=1nt=0T1𝑳K(Si,t,Ai,t)𝑳K(Si,t,Ai,t)]𝒂𝒪p(Kζn,T2),\displaystyle\bm{a}^{\intercal}\left[\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}{\bm{L}}_{K}(S_{i,t},A_{i,t}){\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\right]\bm{a}-\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(K\zeta^{2}_{{\color[rgb]{0,0,0}n,T}}), (S63)

where the last inequality can be derived by the same arguments in proving (S54).

Then we have

λmin[1nTi=1nt=0T1𝑳^K(Si,t,Ai,t)𝑳^K(Si,t,Ai,t)]\displaystyle\lambda_{\min}\left[\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\right]
\displaystyle\geq λmin[1nTi=1nt=0T1𝑳K(Si,t,Ai,t)𝑳K(Si,t,Ai,t)]𝒪p(Kζn,T2)\displaystyle\lambda_{\min}\left[\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}{\bm{L}}_{K}(S_{i,t},A_{i,t}){\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\right]-\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(K\zeta^{2}_{{\color[rgb]{0,0,0}n,T}})
\displaystyle\geq λmin[1Tt=0T1𝔼𝑳K(St,At)𝑳K(St,At)]𝒪p(KnTlog(nT))𝒪p(Kζn,T2)\displaystyle\lambda_{\min}\left[\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}{\bm{L}}_{K}(S_{t},A_{t}){\bm{L}}_{K}(S_{t},A_{t})^{\intercal}\right]-\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\sqrt{\frac{K}{{\color[rgb]{0,0,0}nT}}}\log({\color[rgb]{0,0,0}nT})\right)-\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(K\zeta^{2}_{{\color[rgb]{0,0,0}n,T}})
\displaystyle\geq ψ(K)𝒪p(ψ(K)).\displaystyle\psi(K)-\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(\psi(K)).

The first inequality is due to (S63). The second inequality is by Lemma S3. For the third inequality, due to the condition that {ψ(K)}1Kξn,T=𝒪(1)\{\psi(K)\}^{-1}\sqrt{K}\xi_{{\color[rgb]{0,0,0}n,T}}=\scalebox{0.7}{$\mathcal{O}$}(1), then KnTlog(nT)=𝒪(ψ(K))\sqrt{\frac{K}{{\color[rgb]{0,0,0}nT}}}\log({\color[rgb]{0,0,0}nT})=\scalebox{0.7}{$\mathcal{O}$}(\psi(K)) and Kξn,T2=𝒪(ψ(K)2)=𝒪(ψ(K))K\xi_{{\color[rgb]{0,0,0}n,T}}^{2}=\scalebox{0.7}{$\mathcal{O}$}({\psi(K)}^{2})=\scalebox{0.7}{$\mathcal{O}$}(\psi(K)).

Now, returning to (S52), we can show that

G(𝝀+Δ)G(𝝀)\displaystyle G(\bm{\lambda}_{*}+\Delta)-G(\bm{\lambda}_{*}) 𝒪p(Klog(nT)nT+Kr1+Kζn,T)Δ2\displaystyle\geq-\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{\sqrt{K}\log({\color[rgb]{0,0,0}nT})}{\sqrt{{\color[rgb]{0,0,0}nT}}}+K^{-r_{1}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}\right)\|\Delta\|_{2}
+\displaystyle+ 12nTi=1nt=0T1ρ′′{v(Si,t,Ai,t)}{𝑳^K(Si,t,Ai,t)Δ}2Δ2𝜹K2\displaystyle\frac{1}{2nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\rho^{{}^{\prime\prime}}\{v^{\prime}(S_{i,t},A_{i,t})\}\{\hat{\bm{L}}_{K}(S_{i,t},A_{i,t})^{\intercal}\Delta\}^{2}-\|\Delta\|_{2}\|\bm{\delta}_{K}\|_{2}
\displaystyle\geq 𝒪p(Klog(nT)nT+Kr1+Kζn,T)Δ2+{ψ(K)𝒪p(ψ(K))}Δ22,\displaystyle-\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{\sqrt{K}\log({\color[rgb]{0,0,0}nT})}{\sqrt{{\color[rgb]{0,0,0}nT}}}+K^{-r_{1}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}\right)\|\Delta\|_{2}+\left\{\psi(K)-\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(\psi(K))\right\}\|\Delta\|_{2}^{2},

where the last inequality is due to the condition of 𝜹K\bm{\delta}_{K} specified in Assumption 5(g). As long as Δ2CLABEL:c_delta{ψ(K)}1(Klog(KnT)nT+Kr1+Kζn,T)\|\Delta\|_{2}\geq C_{\ref*{c_delta}}\{\psi(K)\}^{-1}(\frac{\sqrt{K}\log(K{\color[rgb]{0,0,0}nT})}{\sqrt{{\color[rgb]{0,0,0}nT}}}+K^{-r_{1}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}) for some large enough constant CLABEL:c_deltaC_{\ref*{c_delta}}, with high probability,

G(𝝀+Δ)G(𝝀)>0.G(\bm{\lambda}_{*}+\Delta)-G(\bm{\lambda}_{*})>0.

Therefore we have proved (S50).

Finally, we are ready to show the convergence of ω^\hat{\omega} given below.

[𝔼{ω^π(,)ωπ(,)}2]1/2\displaystyle[\mathbb{E}\left\{\hat{\omega}^{\pi}(\cdot,\cdot)-\omega^{\pi}(\cdot,\cdot)\right\}^{2}]^{1/2}
\displaystyle\leq [𝔼{ρ(𝑳^K(,)𝝀+)ρ(𝑳K(,)𝝀)}2]1/2+sups,a|ωπ(s,a)ρ(𝑳K(s,a)𝝀)|\displaystyle\left[\mathbb{E}\left\{\rho^{\prime}(\hat{\bm{L}}_{K}(\cdot,\cdot)^{\intercal}\bm{\lambda}^{+})-\rho^{\prime}({\bm{L}}_{K}(\cdot,\cdot)^{\intercal}\bm{\lambda}_{*})\right\}^{2}\right]^{1/2}+\sup_{s,a}\left|\omega^{\pi}(s,a)-\rho^{\prime}({\bm{L}}_{K}(s,a)^{\intercal}\bm{\lambda}_{*})\right|
\displaystyle\leq [𝔼{ρ(𝑳K(,)𝝀+)ρ(𝑳K(,)𝝀)}2]1/2+[𝔼{ρ′′{v~(,)}{𝑳^K(,)𝑳K(,)}𝝀+}2]1/2+𝒪(Kr1)\displaystyle\left[\mathbb{E}\left\{\rho^{\prime}({\bm{L}}_{K}(\cdot,\cdot)^{\intercal}\bm{\lambda}^{+})-\rho^{\prime}({\bm{L}}_{K}(\cdot,\cdot)^{\intercal}\bm{\lambda}_{*})\right\}^{2}\right]^{1/2}+\left[\mathbb{E}\left\{\rho^{{}^{\prime\prime}}\{\tilde{v}(\cdot,\cdot)\}\left\{\hat{\bm{L}}_{K}(\cdot,\cdot)-{\bm{L}}_{K}(\cdot,\cdot)\right\}^{\intercal}\bm{\lambda}^{+}\right\}^{2}\right]^{1/2}+\mathop{}\mathopen{}\mathcal{O}\mathopen{}(K^{-r_{1}})
\displaystyle\leq cLABEL:wc4[𝔼{𝑳K(,)(𝝀+𝝀)}2]1/2+cLABEL:wc4𝒪p(ζn,T)+𝒪(Kr1)\displaystyle c_{\ref*{wc4}}\left[\mathbb{E}\left\{{\bm{L}}_{K}(\cdot,\cdot)^{\intercal}(\bm{\lambda}^{+}-\bm{\lambda}_{*})\right\}^{2}\right]^{1/2}+c_{\ref*{wc4}}\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta_{{\color[rgb]{0,0,0}n,T}})+\mathop{}\mathopen{}\mathcal{O}\mathopen{}(K^{-r_{1}})
\displaystyle\leq cLABEL:wc4(𝝀+𝝀)2[λmax{𝔼[1Tt=0T1𝑳K(St,At)𝑳K(St,At)]}]1/2+𝒪p(ζn,T)+𝒪(Kr1)\displaystyle c_{\ref*{wc4}}\|(\bm{\lambda}^{+}-\bm{\lambda}_{*})\|_{2}\left[\lambda_{\max}\left\{\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{t},A_{t})\bm{L}_{K}(S_{t},A_{t})^{\intercal}\right]\right\}\right]^{1/2}+\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta_{{\color[rgb]{0,0,0}n,T}})+\mathop{}\mathopen{}\mathcal{O}\mathopen{}(K^{-r_{1}})
\displaystyle\leq cLABEL:wc4CLABEL:weightsC4(𝝀+𝝀)2+𝒪p(ζn,T)+𝒪(Kr1)\displaystyle c_{\ref*{wc4}}\sqrt{C_{\ref*{weightsC4}}}\|(\bm{\lambda}^{+}-\bm{\lambda}_{*})\|_{2}+\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta_{{\color[rgb]{0,0,0}n,T}})+\mathop{}\mathopen{}\mathcal{O}\mathopen{}(K^{-r_{1}})
\displaystyle\leq 𝒪p[{ψ(K)}1(Klog(nT)nT+Kr1+Kζn,T)].\displaystyle\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left[\{\psi(K)\}^{-1}\left(\frac{\sqrt{K}\log({\color[rgb]{0,0,0}nT})}{\sqrt{{\color[rgb]{0,0,0}nT}}}+K^{-r_{1}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}\right)\right].

The second inequality is based on Assumption 5(a) and the mean value theorem. For the third inequality, we adopt the mean value theorem and Assumption 5(b) again, using the similar arguments for proving (S55) to show the boundedness of ρ′′\rho^{{}^{\prime\prime}}. Assumption 5(e) is used to obtain the desired 𝒪p(ζn,T)\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\zeta_{{\color[rgb]{0,0,0}n,T}}) order. The fifth inequality results from Assumption 5(c) and the last inequality is due to (S50).

Noting that

λmax{1nTi=1nt=0T1𝑳K(Si,t,Ai,t)𝑳K(Si,t,Ai,t)}=λmax{𝔼[1Tt=0T1𝑳K(St,At)𝑳K(St,At)]}+𝒪p(1)\lambda_{\max}\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}\right\}=\lambda_{\max}\left\{\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{t},A_{t})\bm{L}_{K}(S_{t},A_{t})^{\intercal}\right]\right\}+\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(1)

by Lemma S3, we then could use the similar argument to bound ω^π(,)ωπ(,)n,T\|\hat{\omega}^{\pi}(\cdot,\cdot)-\omega^{\pi}(\cdot,\cdot)\|_{{\color[rgb]{0,0,0}n,T}}, which completes the proof. ∎

S2.5 Technical Proof in Section 4.3

In this section, we provide the proof of Theorem 6, which shows the efficiency of our weighted estimator. In the proof, we first decompose the estimation error incurred by 𝒱^(π)\hat{\cal V}(\pi) to the true 𝒱(π){\cal V}(\pi). Based on the decomposition, we show that with the help of convergence results regarding to the projection step and the final weights, together with the approximation error to QπQ^{\pi}, the desired nT\sqrt{nT} convergence rate of 𝒱^(π)\hat{\cal V}(\pi) can be established. Finally, the efficiency of our estimator can be achieved.

Proof of Theorem 6.

We mainly prove Result (ii). Result (i) can be derived through similar arguments but under a different condition for KK. Thus, we omit that for brevity. To start with, we derive the following decomposition

𝒱^(π)𝒱(π)\displaystyle\hat{\cal V}(\pi)-{\cal V}(\pi)
=\displaystyle= 1nTi=1nt=0T1ω^i,tπ[Qπ(Si,t,Ai,t)γgπ(Si,t,Ai,t;Qπ)](1γ)𝔼S0𝔾[a𝒜π(aS0)Qπ(S0,a)]\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\omega}^{\pi}_{i,t}\left[Q^{\pi}(S_{i,t},A_{i,t}){-}\gamma g_{*}^{\pi}(S_{i,t},A_{i,t};Q^{\pi})\right]{-}(1{-}\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left[\sum_{a^{\prime}\in{\mathcal{A}}}\pi(a^{\prime}\mid S_{0})Q^{\pi}(S_{0},a^{\prime})\right]
+1nTi=1nt=0T1ω^i,tπϵi,t\displaystyle+\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\omega}^{\pi}_{i,t}\epsilon_{i,t}
=\displaystyle= 1nTi=1nt=0T1ω^i,tπ[𝑩K(Si,t,Ai,t)γgπ(Si,t,Ai,t;𝑩K)]𝜷(1γ)𝔼S0𝔾[a𝒜π(aS0)𝑩K(S0,a)𝜷]\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\omega}^{\pi}_{i,t}\left[\bm{B}_{K}(S_{i,t},A_{i,t})^{\intercal}-\gamma g_{*}^{\pi}(S_{i,t},A_{i,t};\bm{B}_{K})^{\intercal}\right]\bm{\beta}-(1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left[\sum_{a^{\prime}\in{\mathcal{A}}}\pi(a^{\prime}\mid S_{0})\bm{B}_{K}(S_{0},a^{\prime})^{\intercal}\bm{\beta}\right]
+1nTi=1nt=0T1ω^i,tπ[ΔQ(Si,t,Ai,t)γgπ(Si,t,Ai,t,ΔQ)]\displaystyle+\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\omega}^{\pi}_{i,t}\left[\Delta_{Q}(S_{i,t},A_{i,t})-\gamma g^{\pi}_{*}(S_{i,t},A_{i,t},\Delta_{Q})\right]
(1γ)𝔼S0𝔾[a𝒜π(aS0)Δ(S0,a)]+1nTi=1nt=0T1ω^i,tπϵi,t\displaystyle-(1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left[\sum_{a^{\prime}\in{\mathcal{A}}}\pi(a^{\prime}\mid S_{0})\Delta(S_{0},a^{\prime})\right]+\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\omega}^{\pi}_{i,t}\epsilon_{i,t}
=\displaystyle= 1nTi=1nt=0T1ω^iπ[𝑩K(Si,Ai)γg^π(Si,Ai;𝑩K)]𝜷(1γ)𝔼S0𝔾[a𝒜π(aS0)𝑩K(S0,a)𝜷](I)\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\omega}^{\pi}_{i}\left[\bm{B}_{K}(S_{i},A_{i})^{\intercal}-\gamma\hat{g}^{\pi}(S_{i},A_{i};\bm{B}_{K})^{\intercal}\right]\bm{\beta}-(1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left[\sum_{a^{\prime}\in{\mathcal{A}}}\pi(a^{\prime}\mid S_{0})\bm{B}_{K}(S_{0},a^{\prime})^{\intercal}\bm{\beta}\right]\cdots\cdots(\mathrm{I})
+1nTi=1nt=0T1ωi,tπγ[g^π(Si,t,Ai,t;𝑩K)gπ(Si,t,Ai,t;𝑩K)]𝜷(II)\displaystyle+\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}_{i,t}\gamma\left[\hat{g}^{\pi}(S_{i,t},A_{i,t};\bm{B}_{K})-g_{*}^{\pi}(S_{i,t},A_{i,t};\bm{B}_{K})\right]^{\intercal}\bm{\beta}\cdots\cdots(\mathrm{II})
+1nTi=1nt=0T1(ω^i,tπωi,tπ)γ[g^π(Si,t,Ai,t;𝑩K)gπ(Si,t,Ai,t;𝑩K)]𝜷(III)\displaystyle+\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}(\hat{\omega}^{\pi}_{i,t}-\omega^{\pi}_{i,t})\gamma\left[\hat{g}^{\pi}(S_{i,t},A_{i,t};\bm{B}_{K})-g_{*}^{\pi}(S_{i,t},A_{i,t};\bm{B}_{K})\right]^{\intercal}\bm{\beta}\cdots\cdots(\mathrm{III})
+1nTi=1nt=0T1ωi,tπ[ΔQ(Si,t,Ai,t)γgπ(Si,t,Ai,t,ΔQ)](1γ)𝔼S0𝔾[a𝒜π(aS0)Δ(S0,a)](IV)\displaystyle+\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}_{i,t}\left[\Delta_{Q}(S_{i,t},A_{i,t})-\gamma g^{\pi}_{*}(S_{i,t},A_{i,t},\Delta_{Q})\right]-(1-\gamma)\mathbb{E}_{S_{0}\sim\mathbb{G}}\left[\sum_{a^{\prime}\in{\mathcal{A}}}\pi(a^{\prime}\mid S_{0})\Delta(S_{0},a^{\prime})\right]\cdots\cdots(\mathrm{IV})
+1nTi=1nt=0T1(ω^i,tπωi,tπ)[ΔQ(Si,t,Ai,t)γgπ(Si,t,Ai,t,ΔQ)](V)\displaystyle+\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}(\hat{\omega}^{\pi}_{i,t}-\omega^{\pi}_{i,t})\left[\Delta_{Q}(S_{i,t},A_{i,t})-\gamma g^{\pi}_{*}(S_{i,t},A_{i,t},\Delta_{Q})\right]\cdots\cdots(\mathrm{V})
+1nTi=1nt=0T1ω^i,tπϵi,t(VI)\displaystyle+\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\hat{\omega}^{\pi}_{i,t}\epsilon_{i,t}\cdots\cdots(\mathrm{VI})

In the following, we analyze (I)-(VI) components separately.

  • For (I), from the optimization constraint (25b),

    (I)𝜷𝜹K1.(\mathrm{I})\leq{\|\bm{\beta}\circ\bm{\delta}_{K}\|_{1}}.

    By the condition that in Assumption 6(a), we can show (I)=𝒪((nT)1/2)(\mathrm{I})=\scalebox{0.7}{$\mathcal{O}$}(({\color[rgb]{0,0,0}nT})^{-1/2}).

  • For (III), we control it by utilizing the convergence of weights and function g^π\hat{g}^{\pi}. Note that

    (III)\displaystyle(\mathrm{III}) γω^πωπn,T{gπ(,;𝑩K)g^(,;𝑩K)}𝜷n,T\displaystyle\leq\gamma\|\hat{\omega}^{\pi}-\omega^{\pi}\|_{{\color[rgb]{0,0,0}n,T}}\|\{g_{*}^{\pi}(\cdot,\cdot;\bm{B}_{K})-\hat{g}(\cdot,\cdot;\bm{B}_{K})\}^{\intercal}\bm{\beta}\|_{{\color[rgb]{0,0,0}n,T}}
    =𝒪p(Klog(nT)nT+Kr1+Kζn,T)𝒪p(ζn,T)=𝒪p((nT)1/2),\displaystyle=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{\sqrt{K}\log({\color[rgb]{0,0,0}nT})}{\sqrt{{\color[rgb]{0,0,0}nT}}}+K^{-r_{1}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}\right)\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\zeta_{{\color[rgb]{0,0,0}n,T}}\right)=\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2}),

    due to the conditions in Corollary 1 and conditions for KK in Theorem 6 Result (ii).

  • For (IV), it can be seen that the mean is zero. In addition, sups,a|ΔQ(s,a)|=𝒪(Kr2)\sup_{s,a}|\Delta_{Q}(s,a)|=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(K^{-r_{2}}) by Assumption 6(a), and ωπ(s,a)\omega^{\pi}(s,a) is bounded above by Assumption 5(a). Applying similar arguments in the proof of Theorem 4, we can show that

    (IV)=𝒪p((nT)1/2Kr2)=𝒪p((nT)1/2).(\mathrm{IV})=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2}K^{-r_{2}})=\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2}).
  • For (V), we will control it by the convergence of weights and also the magnitude of ΔQ\Delta_{Q}.

    (V)\displaystyle(\mathrm{V}) ω^πωπn,TΔQγgπ(,;ΔQ)n,T\displaystyle\leq\|\hat{\omega}^{\pi}-\omega^{\pi}\|_{{\color[rgb]{0,0,0}n,T}}\|\Delta_{Q}-\gamma g^{\pi}_{*}(\cdot,\cdot;\Delta_{Q})\|_{{\color[rgb]{0,0,0}n,T}}
    𝒪p(Klog(nT)nT+Kr1+Kζn,T)𝒪(Kr2)=𝒪p((nT)1/2),\displaystyle\leq\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{\sqrt{K}\log({\color[rgb]{0,0,0}nT})}{\sqrt{{\color[rgb]{0,0,0}nT}}}+K^{-r_{1}}+\sqrt{K}\zeta_{{\color[rgb]{0,0,0}n,T}}\right)\mathop{}\mathopen{}\mathcal{O}\mathopen{}\left(K^{-r_{2}}\right)=\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2}),

    due to conditions for KK in Theorem 6(ii).

  • For (VI), by using the convergence of ω^π\hat{\omega}^{\pi} to ωπ\omega^{\pi} and the independence of ϵi,t\epsilon_{i,t} for every 1in1\leq i\leq n and 0tT10\leq t\leq T-1, we are able to prove that

    𝔼{1nTi=1nt=0T1(ω^i,tπωi,tπ)ϵi,t}2\displaystyle\mathbb{E}\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}(\hat{\omega}^{\pi}_{i,t}-\omega^{\pi}_{i,t})\epsilon_{i,t}\right\}^{2}
    =\displaystyle= 𝔼[𝔼{1nTi=1nt=0T1(ω^i,tπωi,tπ)ϵi,t}2{Si,t,Ai,t,i=1,,n,t=0,,T1}]\displaystyle\mathbb{E}\left[\mathbb{E}\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}(\hat{\omega}^{\pi}_{i,t}-\omega^{\pi}_{i,t})\epsilon_{i,t}\right\}^{2}\mid\{S_{i,t},A_{i,t},i=1,\dots,n,t=0,\dots,T-1\}\right]
    \displaystyle\leq 𝔼[supi,t𝔼ϵi,t2(nT)2i=1nt=0T1(ω^i,tπωi,tπ)2]1nT𝔼[ω^πωπn,T2]=𝒪(1nT),\displaystyle\mathbb{E}\left[\frac{\sup_{i,t}\mathbb{E}\epsilon_{i,t}^{2}}{({\color[rgb]{0,0,0}nT})^{2}}\sum_{i=1}^{n}\sum_{t=0}^{T-1}(\hat{\omega}^{\pi}_{i,t}-\omega^{\pi}_{i,t})^{2}\right]\lesssim\frac{1}{{\color[rgb]{0,0,0}nT}}\mathbb{E}\left[\|\hat{\omega}^{\pi}-\omega^{\pi}\|_{{\color[rgb]{0,0,0}n,T}}^{2}\right]=\scalebox{0.7}{$\mathcal{O}$}\left(\frac{1}{{\color[rgb]{0,0,0}nT}}\right),

    The first inequality is derived based on Assumption 6(b) and the second inequality comes from Theorem 4. Then we can show that

    1nTi=1nt=0T1(ω^i,tπωi,tπ)ϵi,t=𝒪p((nT)1/2).\displaystyle\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}(\hat{\omega}^{\pi}_{i,t}-\omega^{\pi}_{i,t})\epsilon_{i,t}=\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2}).

    Therefore,

    (VI)=1nTi=1nt=0T1ωi,tπϵi,t+1nTi=1nt=0T1(ω^i,tπωi,tπ)ϵi,t=1nTi=1nt=0T1ωi,tπϵi,t+𝒪p((nT)1/2).\displaystyle(\mathrm{VI})=\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}_{i,t}\epsilon_{i,t}+\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}(\hat{\omega}^{\pi}_{i,t}-\omega^{\pi}_{i,t})\epsilon_{i,t}=\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}_{i,t}\epsilon_{i,t}+\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2}).
  • For (II), recall that the true weight function ωπ(s,a)𝒢\omega^{\pi}(s,a)\in\mathcal{G} by the assumption in Theorem 6(ii). This is the key to bound (II). Take a function B=k=1KβkBkB=\sum_{k=1}^{K}\beta_{k}B_{k} in optimization problem (24). In addition, it can be seen that g^βπ(,):=k=1Kβkg^π(,;Bk)\hat{g}^{\pi}_{\beta}(\cdot,\cdot):=\sum_{k=1}^{K}\beta_{k}\hat{g}^{\pi}(\cdot,\cdot;B_{k}) by the structure of the optimization problem. Let Yi,t=aπ(aSi,t+1)B(Si,t+1,a){Y}_{i,t}=\sum_{a^{\prime}}\pi(a^{\prime}\mid S_{i,t+1})B(S_{i,t+1},a^{\prime}).

    Due to the optimization condition for g^βπ\hat{g}^{\pi}_{\beta}, we have

    d[1nTi=1nt=0T1{Yi,tg^βπ(Si,t,Ai,t)uωπ(Si,t,Ai,t)}2+μJ𝒢(g^βπ+uωπ)]du|u=0=0\frac{d\left[\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left\{{Y}_{i,t}-\hat{g}^{\pi}_{\beta}(S_{i,t},A_{i,t})-u\omega^{\pi}(S_{i,t},A_{i,t})\right\}^{2}+\mu J_{\mathcal{G}}(\hat{g}^{\pi}_{\beta}+u\omega^{\pi})\right]}{du}\Bigg{|}_{u=0}=0

    Recall that J𝒢(g)=g22J_{\mathcal{G}}(g)=\|g\|_{{\mathcal{H}}_{2}}^{2}. Here, we abuse the notation slightly and denote by ,2\langle\cdot,\cdot\rangle_{{\mathcal{H}}_{2}}, i.e., the inner product with respect to the RKHS specified in Assumption 4’(b). Then we have

    0\displaystyle 0 =2nTi=1nt=0T1ωπ(Si,t,Ai,t){Yi,tgπ(Si,t,Ai,t;B)}\displaystyle=\frac{-2}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}(S_{i,t},A_{i,t})\{{Y}_{i,t}-g^{\pi}_{*}(S_{i,t},A_{i,t};B)\}
    +2nTi=1nt=0T1ωπ(Si,Ai){g^βπ(Si,t,Ai,t)gπ(Si,t,Ai,t;B)}+2μg^βπ,ωπ2\displaystyle+\frac{2}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}(S_{i},A_{i})\{\hat{g}^{\pi}_{\beta}(S_{i,t},A_{i,t})-g^{\pi}_{*}(S_{i,t},A_{i,t};B)\}+2\mu\langle\hat{g}^{\pi}_{\beta},\omega^{\pi}\rangle_{{\mathcal{H}}_{2}}
    =(i)+(ii)+(iii),\displaystyle=(\mathrm{i})+(\mathrm{ii})+(\mathrm{iii}),

    Now, we can see that the term (ii)(\mathrm{ii}) multiplied by γ\gamma is exactly term (II) that we aim to bound. Therefore, it is sufficient to bound (i)(\mathrm{i}) and (iii)(\mathrm{iii}). Since we require that μ=𝒪((nT)1/2)\mu=\scalebox{0.7}{$\mathcal{O}$}(({\color[rgb]{0,0,0}nT})^{-1/2}), |(iii)|=𝒪p((nT)1/2)|(\mathrm{iii})|=\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2}). As for term (i)(i), recall that the error ei,t:=aπ(aSi,t+1)Qπ(Si,t+1,a)gπ(Si,t,Ai,t;Qπ)e_{i,t}:=\sum_{a^{\prime}}\pi(a^{\prime}\mid S_{i,t+1})Q^{\pi}(S_{i,t+1},a^{\prime})-g^{\pi}_{*}(S_{i,t},A_{i,t};Q^{\pi}). Then we have

    (i)\displaystyle\mathrm{(i)} =2nTi=1nt=0T1ωπ(Si,t,Ai,t)ei,t\displaystyle=\frac{-2}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}(S_{i,t},A_{i,t})e_{i,t}
    +2nTi=1nt=0T1ωπ(Si,t,Ai,t){aπ(aSi,t+1)ΔQ(Si,t+1,a)gπ(Si,t,Ai,t;ΔQ)}\displaystyle+\frac{-2}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}(S_{i,t},A_{i,t})\left\{\sum_{a^{\prime}}\pi(a^{\prime}\mid S_{i,t+1})\Delta_{Q}(S_{i,t+1},a)-g_{*}^{\pi}(S_{i,t},A_{i,t};\Delta_{Q})\right\}

    Next, we use Freedman’s inequality [17] to bound the second term in (i). For any integer 1gnT1\leq g\leq nT, let i(g)i(g) and t(g)t(g) be the quotient and the remainder of g+T1g+T-1 divided by TT satisfy g={i(g)1}T+t(g)+1g=\{i(g)-1\}T+t(g)+1 and 0t(g)<T0\leq t(g)<T. Let (0)={S1,0,A1,0}\mathcal{F}^{(0)}=\{S_{1,0},A_{1,0}\}. Then we iteratively define {g}1gnT\{\mathcal{F}^{g}\}_{1\leq g\leq nT} as follows:

    (g)=(g1){Si(g),t(g)+1,Ai(g),t(g)+1}, if t(g)<T1,\displaystyle\mathcal{F}^{(g)}=\mathcal{F}^{(g-1)}\cup\{S_{i(g),t(g)+1},A_{i(g),t(g)+1}\},\mbox{ if }t(g)<T-1,
    (g)=(g1){Si(g),T,Si(g)+1,0,Ai(g)+1,0}, otherwise .\displaystyle\mathcal{F}^{(g)}=\mathcal{F}^{(g-1)}\cup\{S_{i(g),T},S_{i(g)+1,0},A_{i(g)+1,0}\},\mbox{ otherwise }.

    Take e(g):=ωπ(Si(g),t(g),Ai(g),t(g))ei(g),t(g)Δe^{(g)}:=\omega^{\pi}(S_{i(g),t(g)},A_{i(g),t(g)})e^{\Delta}_{i(g),t(g)}. From the definition of ei,tΔe_{i,t}^{\Delta}, we can show that

    𝔼{e(g)(g1)}=𝔼{e(g)Si(g),t(g),Ai(g),t(g)}=0.\displaystyle\mathbb{E}\{e^{(g)}\mid\mathcal{F}^{(g-1)}\}=\mathbb{E}\{e^{(g)}\mid S_{i(g),t(g)},A_{i(g),t(g)}\}=0.

    Then {g=1Ge(g)}\{\sum_{g=1}^{G}e^{(g)}\}, G=1,,nTG=1,\dots,nT forms a martingale with respect to the filtration {σ((g))}g0\{\sigma(\mathcal{F}^{(g)})\}_{g\geq 0}, where σ((g))\sigma(\mathcal{F}^{(g)}) stands for the σ\sigma-algebra generated by (g)\mathcal{F}^{(g)}.

    Note that errors ei,tΔ:=aπ(aSi,t+1)ΔQ(Si,t+1,a)gπ(Si,t,Ai,t;ΔQ)e_{i,t}^{\Delta}:=\sum_{a^{\prime}}\pi(a^{\prime}\mid S_{i,t+1})\Delta_{Q}(S_{i,t+1},a)-g_{*}^{\pi}(S_{i,t},A_{i,t};\Delta_{Q}), i=1,,ni=1,\dots,n, t=1,,Tt=1,\dots,T are bounded by ΔQCLABEL:Qconst1Kr2\|\Delta_{Q}\|_{\infty}\leq C_{\ref*{Qconst1}}K^{-r_{2}}. Then we can verify that 𝔼(e(g))2CLABEL:weightsC12CLABEL:Qconst12K2r2\mathbb{E}(e^{(g)})^{2}\leq C_{\ref*{weightsC1}}^{2}C_{\ref*{Qconst1}}^{2}K^{-2r_{2}} by Assumptions 5(a) and 6(a). Now, we are able to apply Theorem 1.6 in [17]. For all x>0x>0, we have

    Pr(g=1nTe(g)>x)exp(x2/2nTCLABEL:weightsC12CLABEL:Qconst12K2r2+CLABEL:weightsC1CLABEL:Qconst1Kr2x/3).\displaystyle\mbox{Pr}\left(\sum_{g=1}^{{\color[rgb]{0,0,0}nT}}e^{(g)}>x\right)\leq\exp\left(-\frac{x^{2}/2}{{\color[rgb]{0,0,0}nT}C_{\ref*{weightsC1}}^{2}C_{\ref*{Qconst1}}^{2}K^{-2r_{2}}+C_{\ref*{weightsC1}}C_{\ref*{Qconst1}}K^{-r_{2}}x/3}\right).

    As such, we can derive

    2nTi=1nt=0T1ωπ(Si,t,Ai,t){aπ(aSi,t+1)ΔQ(Si,t+1,a)gπ(Si,t,Ai,t;ΔQ)}\displaystyle\frac{-2}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}(S_{i,t},A_{i,t})\left\{\sum_{a^{\prime}}\pi(a^{\prime}\mid S_{i,t+1})\Delta_{Q}(S_{i,t+1},a)-g_{*}^{\pi}(S_{i,t},A_{i,t};\Delta_{Q})\right\}
    =2nTg=1nTe(g)=𝒪p((nT)1/2Kr2log(nT))=𝒪p((nT)1/2).\displaystyle=-\frac{2}{{\color[rgb]{0,0,0}nT}}\sum_{g=1}^{{\color[rgb]{0,0,0}nT}}e^{(g)}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2}K^{-r_{2}}\log({\color[rgb]{0,0,0}nT}))=\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2}).

    Then we can show that

    (II)=γ(ii)=γ{1nTi=1nt=0T1ωi,tπei,t+𝒪p((nT)1/2)}\displaystyle(\mathrm{II})=\gamma(\mathrm{ii})=\gamma\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}_{i,t}e_{i,t}+\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2})\right\} (S64)

Combining all the bounds from (I)(I) to (VI)(VI), we have

𝒱^(π)𝒱(π)\displaystyle\hat{\cal V}(\pi)-{\cal V}(\pi) =(I)+(II)+(III)+(IV)+(V)+(VI)\displaystyle=(I)+(II)+(III)+(IV)+(V)+(VI)
=1nTi=1nt=0T1ωi,tπ{ϵi,t+γei,t}+𝒪p((nT)1/2).\displaystyle=\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\omega^{\pi}_{i,t}\left\{\epsilon_{i,t}+\gamma e_{i,t}\right\}+\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(({\color[rgb]{0,0,0}nT})^{-1/2}).

Taking

σ2=1Tt=0T1𝔼{ωπ(St,At)(Rt+γa𝒜π(a|St+1)Qπ(St+1,a)Qπ(St,At))}2,\sigma^{2}=\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left\{\omega^{\pi}(S_{t},A_{t})\left(R_{t}+\gamma\sum_{a^{\prime}\in{\cal A}}\pi(a^{\prime}|S_{t+1})Q^{\pi}(S_{t+1},a^{\prime})-Q^{\pi}(S_{t},A_{t})\right)\right\}^{2},

we see that

i=1nt=0T1ωi,tπ{ϵi,t+γei,t}nTσ\sum_{i=1}^{n}\sum_{t=0}^{T-1}\frac{\omega^{\pi}_{i,t}\left\{\epsilon_{i,t}+\gamma e_{i,t}\right\}}{\sqrt{nT}\sigma}

forms a mean zero martingale with respect to the filtration {σ((g))}g0\{\sigma(\mathcal{F}^{(g)})\}_{g\geq 0}.

Then we can verify that

maxi,t|ωi,tπ(ϵi,t+γei,t)|σnTCLABEL:weightsC1{RmaxσnT+γQmax+Gmax+CLABEL:Qconst1Kr2σnT}𝑃0\displaystyle\max_{i,t}\frac{|\omega^{\pi}_{i,t}(\epsilon_{i,t}+\gamma e_{i,t})|}{\sigma\sqrt{nT}}\leq C_{\ref*{weightsC1}}\left\{\frac{R_{\max}}{\sigma\sqrt{nT}}+\gamma\frac{Q_{\max}+G_{\max}+C_{\ref*{Qconst1}}K^{-r_{2}}}{\sigma\sqrt{nT}}\right\}\xrightarrow[]{P}0

due to Assumption 2, 5(a) and 6(a) and

i=1nt=0T1|ωi,tπ(ϵi,t+γei,t)|2σ2nT𝑃1\displaystyle\sum_{i=1}^{n}\sum_{t=0}^{T-1}\frac{|\omega^{\pi}_{i,t}(\epsilon_{i,t}+\gamma e_{i,t})|^{2}}{\sigma^{2}nT}\xrightarrow[]{P}1

due to the definition of σ\sigma.

Then by martingale central limit theorem [45], we can show the asymptotic normality, i.e., as long as either nn\rightarrow\infty or TT\rightarrow\infty, we have

nTσ{𝒱^(π)𝒱(π)}𝑑N(0,1).\displaystyle\frac{\sqrt{nT}}{\sigma}\left\{\hat{\cal V}(\pi)-{\cal V}(\pi)\right\}\xrightarrow[]{d}N(0,1). (S65)

S2.6 Technical Lemmas

In this section, we provide two technical lemmas (Lemma S2 and S4) that involve the truncation arguments for Markov chains. They are developed to obtain sharp convergence rates that depend on both sample size nn and the length of trajectories TT. In particular, Lemma S2 uses empirical process arguments and is used for Theorem 3. Lemma S4 generalizes the standard matrix Bernstein’s inequality for independent Markov chains and is used for Theorem 4.

Lemma S2.

Suppose Assumptions 4 hold. For

I2(B)\displaystyle I_{2}(B) =2(𝔼N)fπ{g^π(B),gπ(B),B}𝔼fπ{g^π(B),gπ(B),B}2μ𝐉2(g^π,gπ,B).\displaystyle=2(\mathbb{E}-\mathbb{P}_{N})f^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}-\mathbb{E}f^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}-2\mu\mathbf{J}^{2}(\hat{g}^{\pi},g^{\pi}_{*},B).

where fπf^{\pi} and 𝐉2\mathbf{J}^{2} are defined in(S47) and (S46) respectively. Then with probability at least 12δ1/(nT)1-2\delta-1/({\color[rgb]{0,0,0}nT}), for any B𝒬B\in\mathcal{Q}, we have

I2(B)μ+1nTμα1τ(2+α)+[log(max{1/δ,nT})]1/τnT,\displaystyle I_{2}(B)\lesssim\mu+\frac{1}{{\color[rgb]{0,0,0}nT}\mu^{\frac{\alpha}{1-\tau(2+\alpha)}}}+\frac{[\log(\max\{1/\delta,{\color[rgb]{0,0,0}nT}\})]^{1/\tau}}{{\color[rgb]{0,0,0}nT}},

where 0<τ130<\tau\leq\frac{1}{3}, and the leading constant depends on Qmax,Gmax,α,κ,CLABEL:C_sQ_{max},G_{\max},\alpha,\kappa,C_{\ref*{C_s}}.

Proof of Lemma S2.

To deal with the scenario that the initial distribution 𝔾\mathbb{G} is possibly not the stationary distribution 𝔾\mathbb{G}^{*}, we decompose I2(B)I_{2}(B) into three components.

Take T=min{K3log(nT),T}T^{\prime}=\min\{\lceil K_{3}\log(nT)\rceil,T\}, where K3K_{3} is a constant to be specified later , ZT=[(St,At)]t=0T1Z^{T}=[(S_{t},A_{t})]_{t=0}^{T-1}, fTπ(g1,g2,B)(ZT)=1T1t=0Tfπ(g1,g2,B)(St,At,St+1)f_{T}^{\pi}(g_{1},g_{2},B)(Z^{T})=\frac{1}{T-1}\sum_{t=0}^{T}f^{\pi}(g_{1},g_{2},B)(S_{t},A_{t},S_{t+1}). Take nfTπ(g1,g2,B)(ZT)=n1i=1nfTπ(g1,g2,B)(ZiT)\mathbb{P}_{n}f_{T}^{\pi}(g_{1},g_{2},B)(Z^{T})=n^{-1}\sum_{i=1}^{n}f_{T}^{\pi}(g_{1},g_{2},B)(Z^{T}_{i}) as the empirical evaluation for fTπf_{T}^{\pi}. And we denote by 𝔼\mathbb{E}^{*}, Pr{\mbox{Pr}}^{*} the expectation and probability under stationary distribution respectively.

I2(B)\displaystyle I_{2}(B) =2(𝔼N)fπ{g^π(B),gπ(B),B}𝔼fπ{g^π(B),gπ(B),B}2μ𝐉2(g^π,gπ,B).\displaystyle=2(\mathbb{E}-\mathbb{P}_{N})f^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}-\mathbb{E}f^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}-2\mu\mathbf{J}^{2}(\hat{g}^{\pi},g^{\pi}_{*},B).
=TTII(B)+TTT(III(B)+IV(B)),\displaystyle=\frac{T^{\prime}}{T}\mathrm{II}(B)+\frac{T-T^{\prime}}{T}(\mathrm{III}(B)+\mathrm{IV}(B)),

where

II(B)=2(𝔼n)fTπ{g^π(B),gπ(B),B}𝔼fTπ{g^π(B),gπ(B),B}2μ𝐉2(g^π,gπ,B),\displaystyle\mathrm{II}(B)=2(\mathbb{E}-\mathbb{P}_{n})f_{T^{\prime}}^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}-\mathbb{E}f^{\pi}_{T^{\prime}}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}-2\mu\mathbf{J}^{2}(\hat{g}^{\pi},g^{\pi}_{*},B),
III(B)=\displaystyle\mathrm{III}(B)= (1n(TT)i=1nt=TT12fπ{g^π(B),gπ(B),B}(Si,t,Ai,t,Si,t+1))+𝔼fπ{g^π(B),gπ(B),B}\displaystyle\left(\frac{1}{n(T-T^{\prime})}\sum_{i=1}^{n}\sum_{t=T^{\prime}}^{T-1}-2f^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}(S_{i,t},A_{i,t},S_{i,t+1})\right)+\mathbb{E}^{*}f^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}
2μ𝐉2(g^π,gπ,B),\displaystyle-2\mu\mathbf{J}^{2}(\hat{g}^{\pi},g^{\pi}_{*},B),

and

IV(B)=𝔼[1TTt=TT1fπ{g^π(B),gπ(B),B}(St,At,St+1)]𝔼fπ{g^π(B),gπ(B),B}.\displaystyle\mathrm{IV}(B)=\mathbb{E}\left[\frac{1}{T-T^{\prime}}\sum_{t=T^{\prime}}^{T-1}f^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}(S_{t},A_{t},S_{t+1})\right]-\mathbb{E}^{*}f^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}.

For some fixed t>0t>0,

Pr(B𝒬,I2(B)>3t)\displaystyle\mbox{Pr}\left(\exists B\in\mathcal{Q},I_{2}(B)>3t\right)
\displaystyle\leq Pr(B𝒬,II(B)>TTt)+Pr(B𝒬,III(B)>TTTt)+Pr(B𝒬,IV(B)>TTTt)\displaystyle\mbox{Pr}\left(\exists B\in\mathcal{Q},\mathrm{II}(B)>\frac{T}{T^{\prime}}t\right)+\mbox{Pr}\left(\exists B\in\mathcal{Q},\mathrm{III}(B)>\frac{T}{T-T^{\prime}}t\right)+\mbox{Pr}\left(\exists B\in\mathcal{Q},\mathrm{IV}(B)>\frac{T}{T-T^{\prime}}t\right)

Then we bound these three probabilities one by one. Note that when T=TT^{\prime}=T, we do not need to take into account components III(B) and IV(B). So in the following, we analyze III(B) and IV(B) under the condition that Tlog(nT)T\gg\log(nT).

  • For II(B):

Take lT={fTπ{g,gπ(B),B}:J𝒢2(g)2lμTTt,J𝒢2(gπ(B))2lμTTt,J𝒬2(B)2lμTTt,B𝒬}.\mathcal{F}^{T^{\prime}}_{l}=\{f_{T^{\prime}}^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}:J_{\mathcal{G}}^{2}(g)\leq\frac{2^{l}}{\mu}\frac{T}{T^{\prime}}t,J_{\mathcal{G}}^{2}(g_{*}^{\pi}(B))\leq\frac{2^{l}}{\mu}\frac{T}{T^{\prime}}t,J_{\mathcal{Q}}^{2}(B)\leq\frac{2^{l}}{\mu}\frac{T}{T^{\prime}}t,B\in\mathcal{Q}\}.

Pr(B𝒬,II(B)>TTt)\displaystyle\mbox{Pr}\left(\exists B\in\mathcal{Q},\mathrm{II}(B)>\frac{T}{T^{\prime}}t\right)
\displaystyle\leq l=0Pr(suphlT(𝔼n)h𝔼(h)+2lTTt>12),\displaystyle\sum_{l=0}^{\infty}\mbox{Pr}\left(\sup_{h\in\mathcal{F}^{T^{\prime}}_{l}}\frac{(\mathbb{E}-\mathbb{P}_{n})h}{\mathbb{E}(h)+2^{l}\frac{T}{T^{\prime}}t}>\frac{1}{2}\right),

In the following, we verify the conditions (A1-A4) in Theorem 19.3 in [20] with =lT\mathcal{F}=\mathcal{F}^{T}_{l}, ϵ=1/2\epsilon=1/2 and η=2lt\eta=2^{l}t.

It’s easy to verify for (A1). For any hh\in\mathcal{F},

fπ{g,gπ(B),B}6Gmax(3Qmax+3Gmax)K1\displaystyle\|f^{\pi}\{g,g^{\pi}_{*}(B),B\}\|_{\infty}\leq 6G_{\max}(3Q_{\max}+3G_{\max})\triangleq K_{1} (S66)

and therefore

fTπ{g,gπ(B),B}1Tt=0T1fπ{g,gπ(B),B}K1\displaystyle\|f_{T^{\prime}}^{\pi}\{g,g^{\pi}_{*}(B),B\}\|_{\infty}\leq\frac{1}{T^{\prime}}\sum_{t=0}^{T^{\prime}-1}\|f^{\pi}\{g,g^{\pi}_{*}(B),B\}\|_{\infty}\leq K_{1}

For (A2), recall fπ=f1πf2πf^{\pi}=f_{1}^{\pi}-f_{2}^{\pi} and thus

𝔼[fπ{g,gπ(B),B}2]2𝔼[f1π{g,gπ(B),B}(S,A,S)2]+2𝔼[f2π{g,gπ(B),B}(S,A,S)2].\displaystyle\mathbb{E}[f^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}^{2}]\leq 2\mathbb{E}[f_{1}^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}(S,A,S^{\prime})^{2}]+2\mathbb{E}[f_{2}^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}(S,A,S^{\prime})^{2}].

For the first term of RHS above:

𝔼[f1π{g,gπ(B),B}(S,A,S)2]\displaystyle\mathbb{E}[f_{1}^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}(S,A,S^{\prime})^{2}]
=𝔼[{{δtπ(B)g(S,A)}2{δtπ(B)gπ(S,A;Q)}2}2]\displaystyle=\mathbb{E}\left[\left\{\left\{\delta_{t}^{\pi}(B)-g(S,A)\right\}^{2}-\left\{\delta_{t}^{\pi}(B)-g_{*}^{\pi}(S,A;Q)\right\}^{2}\right\}^{2}\right]
=𝔼[{2δπ(B)g(S,A)gπ(S,A;B)}2{gπ(S,A;B)g(S,A)}2]\displaystyle=\mathbb{E}\left[\left\{2\delta^{\pi}(B)-g(S,A)-g_{*}^{\pi}(S,A;B)\right\}^{2}\left\{g_{*}^{\pi}(S,A;B)-g(S,A)\right\}^{2}\right]
{2(Qmax)+2Gmax}2𝔼[(gπ(S,A;B)g(S,A))2]\displaystyle\leq\left\{2(Q_{\max})+2G_{\max}\right\}^{2}\mathbb{E}[\left(g_{*}^{\pi}(S,A;B)-g(S,A)\right)^{2}]
=4(Qmax+Gmax)2𝔼[fπ{g,gπ(B),B}(S,A,S)],\displaystyle=4\left(Q_{\max}+G_{\max}\right)^{2}\mathbb{E}\left[f^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}(S,A,S^{\prime})\right],

and the second term:

𝔼[f2π{g,gπ(B),B}(S,A,S)2]\displaystyle\mathbb{E}[f_{2}^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}(S,A,S^{\prime})^{2}]
=𝔼[{{δtπ(B)gπ(S,A;B)}{g(S,A)gπ(S,A;B)}}2]\displaystyle=\mathbb{E}\left[\Big{\{}\{\delta_{t}^{\pi}(B)-g_{*}^{\pi}(S,A;B)\}\{g(S,A)-g_{*}^{\pi}(S,A;B)\}\Big{\}}^{2}\right]
𝔼[{δπ(B)gπ(S,A;B)}2{g(S,A)gπ(S,A;B)}2]\displaystyle\leq\mathbb{E}\left[\{\delta^{\pi}(B)-g_{*}^{\pi}(S,A;B)\}^{2}\{g(S,A)-g_{*}^{\pi}(S,A;B)\}^{2}\right]
(Qmax+Gmax)2𝔼[(gπ(S,A;B)g(S,A))2]\displaystyle\leq\left(Q_{\max}+G_{\max}\right)^{2}\mathbb{E}[\left(g_{*}^{\pi}(S,A;B)-g(S,A)\right)^{2}]
=(Qmax+Gmax)2𝔼[fπ{g,gπ(B),B}(S,A,S)],\displaystyle=\left(Q_{\max}+G_{\max}\right)^{2}\mathbb{E}[f^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}(S,A,S^{\prime})],

where we use the fact that 𝔼[f2π{g^π(B),gπ(B),B}(S,A,S)]=0\mathbb{E}[f_{2}^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}(S,A,S^{\prime})]=0. Putting together, we can show that

𝔼[fπ{g,gπ(B),B}(S,A,S)2]\displaystyle\mathbb{E}[f^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}(S,A,S^{\prime})^{2}] K2𝔼[fπ{g,gπ(Q),Q}(S,A,S)],\displaystyle\leq K_{2}\mathbb{E}[f^{\pi}\left\{g,g_{*}^{\pi}(Q),Q\right\}(S,A,S^{\prime})], (S67)

where K2=(Qmax+Gmax)2K_{2}=\left(Q_{\max}+G_{\max}\right)^{2}. Therefore,

𝔼fT2\displaystyle\mathbb{E}f_{T}^{2} =𝔼(1Tt=0T1fπ{g,gπ(Q),Q}(St,At,St+1))2\displaystyle=\mathbb{E}\left(\frac{1}{T}\sum_{t=0}^{T-1}f^{\pi}\left\{g,g_{*}^{\pi}(Q),Q\right\}(S_{t},A_{t},S_{t+1})\right)^{2}
1Tt=0T1𝔼[fπ{g,gπ(B),B}(St,At,St+1)2]\displaystyle\leq\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}[f^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}(S_{t},A_{t},S_{t+1})^{2}]
1Tt=0T1K2𝔼[fπ{g,gπ(B),B}(St,At,St+1)]=K2𝔼fTπ\displaystyle\leq\frac{1}{T}\sum_{t=0}^{T-1}K_{2}\mathbb{E}[f^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}(S_{t},A_{t},S_{t+1})]=K_{2}\mathbb{E}f^{\pi}_{T}

To verify that Condition (A3) holds for every ll. It suffices to ensure the inequality holds when l=0l=0, i.e.,

n(1/2)3/2TTt288max(K1,2K2)\displaystyle\sqrt{n}(1/2)^{3/2}\sqrt{\frac{T}{T^{\prime}}t}\geq 288\max(K_{1},\sqrt{2K_{2}}) (S68)
TTt8{288max(K1,2K2)}2n1\displaystyle\frac{T}{T^{\prime}}t\geq 8\{288\max(K_{1},\sqrt{2K_{2}})\}^{2}n^{-1}

Next we verify (A4). We first consider the function class l={fπ{g,gπ(B),B}:J𝒢2(g)2lμTTt,J𝒢2(gπ(B))2lμTTt,J𝒬2(B)2lμTTt,B𝒬}.\mathcal{F}_{l}=\{f^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}:J_{\mathcal{G}}^{2}(g)\leq\frac{2^{l}}{\mu}\frac{T}{T^{\prime}}t,J_{\mathcal{G}}^{2}(g_{*}^{\pi}(B))\leq\frac{2^{l}}{\mu}\frac{T}{T^{\prime}}t,J_{\mathcal{Q}}^{2}(B)\leq\frac{2^{l}}{\mu}\frac{T}{T^{\prime}}t,B\in\mathcal{Q}\}. It is not hard to verify that with M=2lTTt/μM=\sqrt{2^{l}\frac{T}{T^{\prime}}t/\mu},

log(𝒩(ϵ,lT,n))\displaystyle\log\left(\mathcal{N}(\epsilon,\mathcal{F}^{T}_{l},\|\cdot\|_{n})\right)
\displaystyle\lesssim log(𝒩(ϵ,l,n,T))\displaystyle\log\left(\mathcal{N}(\epsilon,\mathcal{F}_{l},\|\cdot\|_{{\color[rgb]{0,0,0}n,T}})\right)
\displaystyle\lesssim log(𝒩(ϵ,𝒬,)𝒩(ϵ,𝒢,)).\displaystyle\log\left(\cal\mathcal{N}(\epsilon,\mathcal{Q}_{M},\|\cdot\|_{\infty})\mathcal{N}(\epsilon,\mathcal{G}_{M},\|\cdot\|_{\infty})\right).

As a result of the entropy condition in Assumption 4

log(𝒩(ϵ,lT,n))\displaystyle\log\left(\mathcal{N}(\epsilon,\mathcal{F}^{T}_{l},\|\cdot\|_{n})\right)
\displaystyle\lesssim log𝒩(ϵ,l,n,T)\displaystyle\log\mathcal{N}(\epsilon,\mathcal{F}_{l},\|\cdot\|_{{\color[rgb]{0,0,0}n,T}})
\displaystyle\lesssim log𝒩(ϵ,𝒬M,)+2log𝒩(ϵ,𝒢M,)\displaystyle\log\mathcal{N}(\epsilon,\mathcal{Q}_{M},\|\cdot\|_{\infty})+2\log\mathcal{N}(\epsilon,\mathcal{G}_{M},\|\cdot\|_{\infty})
\displaystyle\lesssim (2lμTTt)αϵ2α.\displaystyle\left(\frac{2^{l}}{\mu}\frac{T}{T^{\prime}}t\right)^{\alpha}\epsilon^{-2\alpha}. (S69)

Now the Condition (A4) is satisfied if

x1+α24962max(K1,2K2)c2\ltx@labelcase1c1(2lμTTt)α/2n1/2,\displaystyle x^{\frac{1+\alpha}{2}}\geq 4*96\sqrt{2}\max(K_{1},2K_{2})\sqrt{c_{2}\ltx@label{case1c1}}\left(\frac{2^{l}}{\mu}\frac{T}{T^{\prime}}t\right)^{\alpha/2}n^{-1/2},

for all x2lTt/(8T)x\geq 2^{l}Tt/(8T^{\prime}) and any ll, where cLABEL:case1c1>0c_{\ref*{case1c1}}>0 is a constant depending on α,Gmax,Qmax\alpha,G_{\max},Q_{\max}. And the above inequality holds as long as

TTtc3\ltx@labelcase1c2(μαn)1\frac{T}{T^{\prime}}t\geq c_{3}\ltx@label{case1c2}(\mu^{\alpha}n)^{-1}

for some constant cLABEL:case1c2>0c_{\ref*{case1c2}}>0 depending on α,Gmax,Qmax\alpha,G_{\max},Q_{\max}.

Now, we are able to show that when TTt>μ\frac{T}{T^{\prime}}t>\mu, TTtcLABEL:case1c2(μαn)1\frac{T}{T^{\prime}}t\geq c_{\ref*{case1c2}}(\mu^{\alpha}n)^{-1}, TTt8{288max(K1,2K2)}2n1\frac{T}{T^{\prime}}t\geq 8\{288\max(K_{1},\sqrt{2K_{2}})\}^{2}n^{-1},

Pr(B𝒬,II(B)>TTt)\displaystyle\Pr\left(\exists B\in\mathcal{Q},\mathrm{II}(B)>\frac{T}{T^{\prime}}t\right)
\displaystyle\leq l=0Pr(suphlT(𝔼n)h𝔼h+2lTTt>12)\displaystyle\sum_{l=0}^{\infty}\Pr\left(\sup_{h\in\mathcal{F}^{T}_{l}}\frac{(\mathbb{E}-\mathbb{P}_{n})h}{\mathbb{E}h+2^{l}\frac{T}{T^{\prime}}t}>\frac{1}{2}\right)
\displaystyle\leq l=0exp{n2lTTt(1/2)31282304max(K12,K2)}\displaystyle\sum_{l=0}^{\infty}\exp\left\{-\frac{n2^{l}\frac{T}{T^{\prime}}t(1/2)^{3}}{128*2304\max(K_{1}^{2},K_{2})}\right\}
\displaystyle\leq c4\ltx@labelcase1c3exp(c5\ltx@labelcase1c4nTTt),\displaystyle c_{4}\ltx@label{case1c3}\exp\left(-c_{5}\ltx@label{case1c4}n\frac{T}{T^{\prime}}t\right),

where cLABEL:case1c3,cLABEL:case1c4>0c_{\ref*{case1c3}},c_{\ref*{case1c4}}>0 are some constants depending on K1,K2K_{1},K_{2}. Then, fixing some δ>0\delta>0, we can verify that there exists a constant cLABEL:case1c5>0c_{\ref*{case1c5}}>0, such that by taking tc6\ltx@labelcase1c5log(1/δ)T/(nT)t\geq c_{6}\ltx@label{case1c5}\log(1/\delta)T^{\prime}/(nT) ,Pr(B𝒬,II(B)>TTt)cLABEL:case1c3exp(cLABEL:case1c4nTt/T)δ.\Pr\left(\exists B\in\mathcal{Q},\mathrm{II}(B)>\frac{T}{T^{\prime}}t\right)\leq c_{\ref*{case1c3}}\exp(-c_{\ref*{case1c4}}nTt/T^{\prime})\leq\delta.

Combining all the conditions on tt, there exists a constant c7\ltx@labelcase1c6c_{7}\ltx@label{case1c6}, such that

Pr(B𝒬,TTII(B)>cLABEL:case1c6{TTμ+TT(μαn)1+(log(1/δ)+1)TnT})δ.\displaystyle\Pr\left(\exists B\in\mathcal{Q},\frac{T^{\prime}}{T}\mathrm{II}(B)>c_{\ref*{case1c6}}\left\{\frac{T^{\prime}}{T}\mu+\frac{T^{\prime}}{T}(\mu^{\alpha}n)^{-1}+(\log(1/\delta)+1)\frac{T^{\prime}}{nT}\right\}\right)\leq\delta.
  • For III(B):

By the geometrically ergodic property (Assumption 4(a)),

𝔼[Pr(B,III(B)TTTt{(Si,0,Ai,0)})Pr(B,III(B)TTTt)]\displaystyle\mathbb{E}\left[\Pr\left(\exists B,\mathrm{III}(B)\geq\frac{T}{T-T^{\prime}}t\mid\{(S_{i,0},A_{i,0})\}\right)-{\mbox{Pr}}^{*}\left(\exists B,\mathrm{III}(B)\geq\frac{T}{T-T^{\prime}}t\right)\right]
\displaystyle\leq 𝔼{i=1n𝔾Tb((Si,0,Ai,0))𝔾()TV}\displaystyle\mathbb{E}\left\{\sum_{i=1}^{n}\left\|\mathbb{G}^{b}_{T^{\prime}}(\cdot\mid(S_{i,0},A_{i,0}))-\mathbb{G}^{*}(\cdot)\right\|_{\mathrm{TV}}\right\}
\displaystyle\leq 𝔼{i=1nϕ(Si,0,Ai,0)κT}\displaystyle\mathbb{E}\left\{\sum_{i=1}^{n}\phi(S_{i,0},A_{i,0})\kappa^{T^{\prime}}\right\}
\displaystyle\leq nκT(s,a)ϕ(s,a)𝑑𝔾(s,a)nκTCLABEL:C_s.\displaystyle n\kappa^{T^{\prime}}\int_{(s,a)}\phi(s,a)d\mathbb{G}(s,a)\leq n\kappa^{T^{\prime}}C_{\ref*{C_s}}.

Thus we have

Pr(B,III(B)TTTt)Pr(B,III(B)TTTt)+nκTCLABEL:C_s\displaystyle\Pr\left(\exists B,\mathrm{III}(B)\geq\frac{T}{T-T^{\prime}}t\right)\leq{\Pr}^{*}\left(\exists B,\mathrm{III}(B)\geq\frac{T}{T-T^{\prime}}t\right)+n\kappa^{T^{\prime}}C_{\ref*{C_s}} (S70)

Next, we will focus on the term Pr(B,III(B)TTTt){\Pr}^{*}(\exists B,\mathrm{III}(B)\geq\frac{T}{T-T^{\prime}}t), which is easier to deal with, given the stationary condition.

Take N~=n(TT)\tilde{N}=n(T-T^{\prime}). Due to that Markov chains are stationary, we can write

III(B)=\displaystyle\mathrm{III}(B)= 2(𝔼N~)fπ{g^π(B),gπ(B),B}(S,A,S)𝔼fπ{g^π(B),gπ(B),B}(S,A,S)\displaystyle 2(\mathbb{E}^{*}-\mathbb{P}_{\tilde{N}})f^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}({S},{A},{S}^{\prime})-\mathbb{E}^{*}f^{\pi}\left\{\hat{g}^{\pi}(B),g_{*}^{\pi}(B),B\right\}({S},{A},{S}^{\prime})
2μ𝐉2(g^π,gπ,B).\displaystyle-2\mu\mathbf{J}^{2}(\hat{g}^{\pi},g^{\pi}_{*},B).

Similarly as how we handle II(B), we will bound

Pr(B𝒬,III>TTTt)l=0Pr(suphl(𝔼N~)h𝔼h+2lTTTt>12),\displaystyle{\Pr}^{*}\left(\exists B\in\mathcal{Q},\mathrm{III}>\frac{T}{T-T^{\prime}}t\right)\leq\sum_{l=0}^{\infty}{\Pr}^{*}\left(\sup_{h\in\mathcal{F}_{l}}\frac{(\mathbb{E}^{*}-\mathbb{P}_{\tilde{N}})h}{\mathbb{E}^{*}h+2^{l}\frac{T}{T-T^{\prime}}t}>\frac{1}{2}\right),

Here l={fπ{g,gπ(B),B}:J𝒢2(g)2lμTTTt,J𝒢2(gπ(B))2lμTTTt,J𝒬2(B)2lμTTTt,B𝒬}.\mathcal{F}_{l}=\{f^{\pi}\left\{g,g_{*}^{\pi}(B),B\right\}:J_{\mathcal{G}}^{2}(g)\leq\frac{2^{l}}{\mu}\frac{T}{T-T^{\prime}}t,J_{\mathcal{G}}^{2}(g_{*}^{\pi}(B))\leq\frac{2^{l}}{\mu}\frac{T}{T-T^{\prime}}t,J_{\mathcal{Q}}^{2}(B)\leq\frac{2^{l}}{\mu}\frac{T}{T-T^{\prime}}t,B\in\mathcal{Q}\}.

We use the independent block techniques [75] and the peeling device with the exponential inequality for the relative deviation of the empirical process developed in [16].

Next, we bound each term of the above probabilities by using the independent block technique. We define a partition by dividing the index {1,,N~}\{1,\cdots,\tilde{N}\} into 2vN~2v_{\tilde{N}} blocks, where each block has an equal length xN~x_{\tilde{N}}. The residual block is denoted by RN~R_{\tilde{N}}, i.e., {(j1)xN~+1,,(j1)xN~+xN~}j=12vN~\{(j-1)x_{\tilde{N}}+1,\cdots,(j-1)x_{\tilde{N}}+x_{\tilde{N}}\}_{j=1}^{2v_{\tilde{N}}} and RN~={2vN~xN~+1,,N~}R_{\tilde{N}}=\left\{2v_{\tilde{N}}x_{\tilde{N}}+1,\cdots,\tilde{N}\right\}. Then it can be seen that N~2xN~<2vN~xN~N~{\tilde{N}}-2x_{\tilde{N}}<2v_{\tilde{N}}x_{\tilde{N}}\leq{\tilde{N}} and the cardinality |RN~|<2xN~|R_{\tilde{N}}|<2x_{\tilde{N}}.

For each l0l\geq 0, we will use a different independent block sequence denoted by (xN~,l,vN~,l)(x_{\tilde{N},l},v_{\tilde{N},l}) with the residual RlR_{l} and then optimize the probability bound by properly choosing (xN~,l,vN~,l)(x_{\tilde{N},l},v_{\tilde{N},l}) and RlR_{l}. More specifically, we choose

xN~,l=xN~,land vN,l=N~2xN~,l,x_{\tilde{N},l}=\lfloor x^{\prime}_{\tilde{N},l}\rfloor\hskip 12.0pt\mbox{and }\hskip 12.0ptv_{N,l}=\lfloor\frac{\tilde{N}}{2x_{\tilde{N},l}}\rfloor,

where xN~,l=(N~TTTt)τ(2l)px^{\prime}_{\tilde{N},l}=(\tilde{N}\frac{T}{T-T^{\prime}}t)^{\tau}(2^{l})^{p} and vN~,l=N~2xN~,lv^{\prime}_{\tilde{N},l}=\frac{\tilde{N}}{2x^{\prime}_{\tilde{N},l}} with some positive constants τ\tau and pp determined later. We require τp12+α12\tau\leq p\leq\frac{1}{2+\alpha}\leq\frac{1}{2}. We also need TTTt1N~\frac{T}{T-T^{\prime}}t\geq\frac{1}{\tilde{N}} so that xN~,l1x^{\prime}_{\tilde{N},l}\geq 1. Suppose N~\tilde{N} is sufficiently large such that

N~c8\ltx@labelcase2c04×82×K1.\displaystyle\tilde{N}\geq c_{8}\ltx@label{case2c0}\triangleq 4\times 8^{2}\times K_{1}. (S71)

In the following, we consider two cases. The first case considers any ll such that xN~,lN~8x^{\prime}_{\tilde{N},l}\geq\frac{\tilde{N}}{8}. In this case, since τp\tau\leq p, we can show that xN~,l(N~TTTt2l)px^{\prime}_{\tilde{N},l}\leq(\tilde{N}\frac{T}{T-T^{\prime}}t2^{l})^{p}. Combining with the sample size requirement, we can obtain that (N~TTTt2l)(N~8)1p4N~K1(\tilde{N}\frac{T}{T-T^{\prime}}t2^{l})\geq(\frac{\tilde{N}}{8})^{\frac{1}{p}}\geq 4\tilde{N}K_{1}. Then we can show that in this case,

(𝔼N~){h(Z)}𝔼{h(Z)}+2lTTTt2K12lTTTt12.\frac{(\mathbb{E}^{*}-\mathbb{P}_{\tilde{N}})\left\{h(Z)\right\}}{\mathbb{E}^{*}\left\{h(Z)\right\}+2^{l}\frac{T}{T-T^{\prime}}t}\leq\frac{2K_{1}}{2^{l}\frac{T}{T-T^{\prime}}t}\leq\frac{1}{2}.

Therefore, when TTTt1N~\frac{T}{T-T^{\prime}}t\geq\frac{1}{\tilde{N}} and xN~,lN~8x^{\prime}_{\tilde{N},l}\geq\frac{\tilde{N}}{8},

Pr(suphl(𝔼N~){h(Z)}𝔼{h(Z)}+2lTTTt>12)=0.{\Pr}^{*}\left(\sup_{h\in\mathcal{F}_{l}}\frac{(\mathbb{E}^{*}-\mathbb{P}_{\tilde{N}})\left\{h(Z)\right\}}{\mathbb{E}^{*}\left\{h(Z)\right\}+2^{l}\frac{T}{T-T^{\prime}}t}>\frac{1}{2}\right)=0.

The second case that we consider is when xN~,l<N~8x^{\prime}_{\tilde{N},l}<\frac{\tilde{N}}{8}. Under the geometric ergodicity assumption, it follows from Theorem 3.7 [4] that the stationary Markov chain is exponentially 𝜷\bm{\beta}-mixing. The 𝜷\bm{\beta}-mixing coefficient at time lag kk satisfies that βkβ0exp(β1k)\beta_{k}\leq\beta_{0}\exp(-\beta_{1}k) for β00\beta_{0}\geq 0 and β1>0\beta_{1}>0. We apply the relative deviation concentration inequality for the exponential 𝜷\bm{\beta}-mixing stationary process given in Theorem 4 of [16], which combined results in [75] and Theorem 19.3 in [20]. To use their results, it then suffices to verify Conditions (C1)-(C5) in Theorem 4 of [16] with =l\mathcal{F}=\mathcal{F}_{l}, ϵ=1/2\epsilon=1/2 and η=2lt\eta=2^{l}t to get an exponential inequality for each term in the summation. First of all, (C1) and (C2) have been verified in (S66) and (S67).

To verify Condition (C3), without loss of generality, we assume K11K_{1}\geq 1. Otherwise, let K1=max(1,K1)K_{1}=\max(1,K_{1}). Then we know that 2K1xN~,l2K1xN~,l2K_{1}x_{\tilde{N},l}\geq\sqrt{2K_{1}x_{\tilde{N},l}} since xN~,l1x_{\tilde{N},l}\geq 1. We need to verify N~ϵ1ϵη1152K1xN~,l\sqrt{\tilde{N}}\epsilon\sqrt{1-\epsilon}\sqrt{\eta}\geq 1152K_{1}x_{\tilde{N},l}, or it suffices to have N~ϵ1ϵη1152K1xN~,l\sqrt{\tilde{N}}\epsilon\sqrt{1-\epsilon}\sqrt{\eta}\geq 1152K_{1}x^{\prime}_{\tilde{N},l} since xN~,lxN~,lx^{\prime}_{\tilde{N},l}\geq x_{\tilde{N},l} by definition. Recall that ϵ=1/2\epsilon=1/2 and η=2lTTTt\eta=2^{l}\frac{T}{T-T^{\prime}}t. To show this, it is enough to verify that

N~242lTTTt1152K1(N~TTTt2l)p,\sqrt{\tilde{N}}\frac{\sqrt{2}}{4}\sqrt{2^{l}\frac{T}{T-T^{\prime}}t}\geq 1152K_{1}(\tilde{N}\frac{T}{T-T^{\prime}}t2^{l})^{p},

since (N~TTTt2l)pxN~,l(\tilde{N}\frac{T}{T-T^{\prime}}t2^{l})^{p}\geq x^{\prime}_{\tilde{N},l}. Recall that p12+αp\leq\frac{1}{2+\alpha}, then it is sufficient to let TTTt23042K1N~c1N~\frac{T}{T-T^{\prime}}t\geq\frac{2304\sqrt{2}K_{1}}{\tilde{N}}\triangleq\frac{c_{1}^{\prime}}{\tilde{N}} so that the above inequality holds for every l0l\geq 0.

Next we verify (C4) that |Rl|N~ϵη6K1\frac{|R_{l}|}{\tilde{N}}\leq\frac{\epsilon\eta}{6K_{1}}. Recall that |Rl|<2xN~,l2xN~,l=2(N~TTTt)τ(2l)p|R_{l}|<2x_{\tilde{N},l}\leq 2x^{\prime}_{\tilde{N},l}=2(\tilde{N}\frac{T}{T-T^{\prime}}t)^{\tau}(2^{l})^{p}. So if TTTtc9\ltx@labelcase2c1N~\frac{T}{T-T^{\prime}}t\geq\frac{c_{9}\ltx@label{case2c1}}{\tilde{N}} for some positive constant cLABEL:case2c1c_{\ref*{case2c1}} depending on K1K_{1}, we can have

ϵη6K1=2l12K1TTTt2(N~TTTt)τ(2l)pN~=2xN~,lN~>|Rl|N~.\frac{\epsilon\eta}{6K_{1}}=\frac{2^{l}}{12K_{1}}\frac{T}{T-T^{\prime}}t\geq\frac{2(\tilde{N}\frac{T}{T-T^{\prime}}t)^{\tau}(2^{l})^{p}}{\tilde{N}}=\frac{2x^{\prime}_{\tilde{N},l}}{\tilde{N}}>\frac{|R_{l}|}{\tilde{N}}.

In addition, |Rl|2xN~,l<N~2|R_{l}|\leq 2x^{\prime}_{\tilde{N},l}<\frac{\tilde{N}}{2}.

Lastly we verify condition (C5). Using the similar arguments in verifying (S69), we can show that

log𝒩(ϵ,l,n,T)(2lTTTtμ)αϵ2α.\displaystyle\log\mathcal{N}(\epsilon,\mathcal{F}_{l},\|\cdot\|_{{\color[rgb]{0,0,0}n,T}})\lesssim\left(\frac{2^{l}\frac{T}{T-T^{\prime}}t}{\mu}\right)^{\alpha}\epsilon^{-2\alpha}.

Then Condition (C5) is satisfied if the following inequality holds for all x(2lTTTtxN~,l)/8x\geq(2^{l}\frac{T}{T-T^{\prime}}tx_{\tilde{N},l})/8,

vN~,l(1/2)2x96xN~,l2max(K1,2K2)\displaystyle\frac{\sqrt{v_{\tilde{N},l}}(1/2)^{2}x}{96x_{\tilde{N},l}\sqrt{2}\max(K_{1},2K_{2})} 0xc10\ltx@labelcase2c2(2lTTTtμ)α/2(u2xN~,l)α𝑑u\displaystyle\geq\int_{0}^{\sqrt{x}}\sqrt{c_{10}\ltx@label{case2c2}}\left(\frac{2^{l}\frac{T}{T-T^{\prime}}t}{\mu}\right)^{\alpha/2}\left(\frac{u}{2x_{\tilde{N},l}}\right)^{-\alpha}du
=xN~,lαx1α22αcLABEL:case2c2(2lTTTμ)α/2,\displaystyle=x_{\tilde{N},l}^{\alpha}x^{\frac{1-\alpha}{2}}\sqrt{2^{\alpha}c_{\ref*{case2c2}}}\left(\frac{2^{l}\frac{T}{T-T^{\prime}}}{\mu}\right)^{\alpha/2},

where cLABEL:case2c2>0c_{\ref*{case2c2}}>0 is a constant depending on α,Gmax,Qmax\alpha,G_{\max},Q_{\max}.

It is enough to guarantee that

vN~,l(1/2)2x96xN~,l2max(K1,2K2)xN~,lαx1α22αcLABEL:case2c2(2lTTTtμ)α/2.\frac{\sqrt{v_{\tilde{N},l}}(1/2)^{2}x}{96x_{\tilde{N},l}\sqrt{2}\max(K_{1},2K_{2})}\geq x_{\tilde{N},l}^{\alpha}x^{\frac{1-\alpha}{2}}\sqrt{2^{\alpha}c_{\ref*{case2c2}}}\left(\frac{2^{l}\frac{T}{T-T^{\prime}}t}{\mu}\right)^{\alpha/2}.

After some algebra, we can check that the above inequality holds if for some constant c11\ltx@labelcase2c3>0c_{11}\ltx@label{case2c3}>0,

TTTtcLABEL:case2c3(xN~,l)1+αvN~,l2lμα,\displaystyle\frac{T}{T-T^{\prime}}t\geq c_{\ref*{case2c3}}\frac{(x_{\tilde{N},l})^{1+\alpha}}{v^{\prime}_{\tilde{N},l}2^{l}\mu^{\alpha}},

or equivalently, for some constant cLABEL:case2c4>0c_{\ref*{case2c4}}>0,

TTTtc12\ltx@labelcase2c41N~μα1τ(2+α)(2l)1p(2+α)1τ(2+α),\displaystyle\frac{T}{T-T^{\prime}}t\geq c_{12}\ltx@label{case2c4}\frac{1}{\tilde{N}\mu^{\frac{\alpha}{1-\tau(2+\alpha)}}\left(2^{l}\right)^{\frac{1-p(2+\alpha)}{1-\tau(2+\alpha)}}},

by the definition that xN~,lxN~,lx_{\tilde{N},l}\leq x^{\prime}_{\tilde{N},l} and vN~,lvN~,lv^{\prime}_{\tilde{N},l}\leq v_{\tilde{N},l}. To summarize, if for any l0l\geq 0,

TTTtμ+cLABEL:case2c41N~μα1τ(2+α)(2l)1p(2+α)1τ(2+α),\frac{T}{T-T^{\prime}}t\geq\mu+c_{\ref*{case2c4}}\frac{1}{\tilde{N}\mu^{\frac{\alpha}{1-\tau(2+\alpha)}}\left(2^{l}\right)^{\frac{1-p(2+\alpha)}{1-\tau(2+\alpha)}}},

then the entropy inequality in Condition (C5) above holds. Since 0<τp11+2α0<\tau\leq p\leq\frac{1}{1+2\alpha}, the right hand side is a non-increasing function of ll. Then as long as,

TTTtμ+cLABEL:case2c41N~μα1τ(2+α),\frac{T}{T-T^{\prime}}t\geq\mu+c_{\ref*{case2c4}}\frac{1}{\tilde{N}\mu^{\frac{\alpha}{1-\tau(2+\alpha)}}},

Condition (C5) holds .

To summarize, Conditions (C1)–(C5) in Theorem 4 in [16] with =l\mathcal{F}=\mathcal{F}_{l}, ϵ=1/2\epsilon=1/2 and η=2lt\eta=2^{l}t hold for every l0l\geq 0 when TTTtcLABEL:case2c1N~1\frac{T}{T-T^{\prime}}t\geq c_{\ref*{case2c1}}\tilde{N}^{-1} for some constant cLABEL:case2c11c_{\ref*{case2c1}}\geq 1 and TTTtμ+cLABEL:case2c41N~μα1τ(2+α)\frac{T}{T-T^{\prime}}t\geq\mu+c_{\ref*{case2c4}}\frac{1}{\tilde{N}\mu^{\frac{\alpha}{1-\tau(2+\alpha)}}}. Thus when N~cLABEL:case2c0\tilde{N}\geq c_{\ref*{case2c0}}, there exists a constant c13\ltx@labelcase2c5>0c_{13}\ltx@label{case2c5}>0, such that

Pr{B𝒬,III(B)>TTTt}\displaystyle{\Pr}^{*}\left\{\exists B\in\mathcal{Q},\mathrm{III}(B)>\frac{T}{T-T^{\prime}}t\right\}
l=0Pr[suphl(𝔼N~){h(Z)}𝔼{h(Z)}+2lTTTt>12]\displaystyle\leq\sum_{l=0}^{\infty}{\Pr}^{*}\left[\sup_{h\in\mathcal{F}_{l}}\frac{(\mathbb{E}^{*}-\mathbb{P}_{\tilde{N}})\left\{h(Z)\right\}}{\mathbb{E}^{*}\left\{h(Z)\right\}+2^{l}\frac{T}{T-T^{\prime}}t}>\frac{1}{2}\right]
l=0120exp{cLABEL:case2c5vN~,l2TTTt2lN~}+2βxN~,lvN~,l\displaystyle\leq\sum_{l=0}^{\infty}120\exp\left\{-c_{\ref*{case2c5}}\frac{v_{\tilde{N},l}^{{}^{\prime}2}\frac{T}{T-T^{\prime}}t2^{l}}{\tilde{N}}\right\}+2\beta_{x_{\tilde{N},l}}v_{\tilde{N},l}
l=0120exp{cLABEL:case2c5vN~,l2TTTt2lN~}+2β0exp(β1xN~,l+logvN~,l),\displaystyle\leq\sum_{l=0}^{\infty}120\exp\left\{-c_{\ref*{case2c5}}\frac{v_{\tilde{N},l}^{{}^{\prime}2}\frac{T}{T-T^{\prime}}t2^{l}}{\tilde{N}}\right\}+2\beta_{0}\exp\left(-\beta_{1}x_{\tilde{N},l}+\log v_{\tilde{N},l}\right),

where the last inequality is due to that the Markov chain is stationary and exponentially β\beta-mixing. When t(4/β1log(N~))1/τN~t\geq\frac{\left(4/\beta_{1}\log(\tilde{N})\right)^{1/\tau}}{\tilde{N}}, we have logvN~,l12β1xN~,l\log v_{\tilde{N},l}\leq\frac{1}{2}\beta_{1}x_{\tilde{N},l} by using xN~,l2xN~,lx_{\tilde{N},l}^{\prime}\leq 2x_{\tilde{N},l} and vN~,lN~v_{\tilde{N},l}\leq\tilde{N}. This will further imply that 2βxN~,lvN~,l2β0exp(β1xN~,l/2)2\beta_{x_{\tilde{N},l}}v_{\tilde{N},l}\leq 2\beta_{0}\exp\left(-\beta_{1}x_{\tilde{N},l}/2\right).

Then we will have

Pr{B𝒬,III(B)>TTTt}\displaystyle{\Pr}^{*}\left\{\exists B\in\mathcal{Q},\mathrm{III}(B)>\frac{T}{T-T^{\prime}}t\right\}
l=0120exp(cLABEL:case2c5(N~TTTt)12τ(2l)12p)+2β0exp(β1(N~TTTt)τ(2l)p)\displaystyle\lesssim\sum_{l=0}^{\infty}120\exp\left(-c_{\ref*{case2c5}}(\tilde{N}\frac{T}{T-T^{\prime}}t)^{1-2\tau}(2l)^{1-2p}\right)+2\beta_{0}\exp\left(-\beta_{1}(\tilde{N}\frac{T}{T-T^{\prime}}t)^{\tau}(2^{l})^{p}\right)
c14\ltx@labelcase2c6exp(c15\ltx@labelcase2c7(N~TTTt)12τ)+c16\ltx@labelcase2c8exp(c17\ltx@labelcase2c9(N~TTTt)τ),\displaystyle\leq c_{14}\ltx@label{case2c6}\exp\left(-c_{15}\ltx@label{case2c7}(\tilde{N}\frac{T}{T-T^{\prime}}t)^{1-2\tau}\right)+c_{16}\ltx@label{case2c8}\exp\left(-c_{17}\ltx@label{case2c9}(\tilde{N}\frac{T}{T-T^{\prime}}t)^{\tau}\right),

where cLABEL:case2c6c_{\ref*{case2c6}}, cLABEL:case2c7c_{\ref*{case2c7}}, cLABEL:case2c7c_{\ref*{case2c7}}, cLABEL:case2c8c_{\ref*{case2c8}} are some positive constants.

For some fixed δ>0\delta>0, we can verify that there exists a constant c18\ltx@labelcase2c10>0c_{18}\ltx@label{case2c10}>0, such that cLABEL:case2c6exp(cLABEL:case2c7(N~t)12τ)+cLABEL:case2c8exp(cLABEL:case2c9(N~t)τ)δc_{\ref*{case2c6}}\exp(-c_{\ref*{case2c7}}(\tilde{N}t)^{1-2\tau})+c_{\ref*{case2c8}}\exp(-c_{\ref*{case2c9}}(\tilde{N}t)^{\tau})\leq\delta for tcLABEL:case2c10max{[log(1/δ)N~1]1/(12τ),[log(1/δ)N~1]1/τ}t\geq c_{\ref*{case2c10}}\max\{[\log(1/\delta)\tilde{N}^{-1}]^{1/(1-2\tau)},[\log(1/\delta)\tilde{N}^{-1}]^{1/\tau}\}.

Substituting this probability into (S70) and combining all the conditions over tt, we can derive that there exists a constant c19\ltx@labelcase2c11>0c_{19}\ltx@label{case2c11}>0, such that

Pr(B,TTTIII(B)cLABEL:case2c11{μ+1N~μα1τ(2+α)+(logN~)1/τ+1+(log(1/δ))1/(12τ)+(log(1/δ))1/τN~})\displaystyle\Pr\left(\exists B,\frac{T-T^{\prime}}{T}\mathrm{III}(B)\geq c_{\ref*{case2c11}}\left\{\mu+\frac{1}{\tilde{N}\mu^{\frac{\alpha}{1-\tau(2+\alpha)}}}+\frac{(\log\tilde{N})^{1/\tau}+1+(\log(1/\delta))^{1/(1-2\tau)}+(\log(1/\delta))^{1/\tau}}{\tilde{N}}\right\}\right)
\displaystyle\leq δ+nκTCLABEL:C_s\displaystyle\delta+n\kappa^{T^{\prime}}C_{\ref*{C_s}}
  • For IV(B):

|IV(B)|\displaystyle|\mathrm{IV}(B)| 𝔼1TTt=TT1K1𝔾tb((s0,a0))Gs()TV\displaystyle\leq\mathbb{E}\frac{1}{T-T^{\prime}}\sum_{t=T^{\prime}}^{T-1}K_{1}\left\|\mathbb{G}^{b}_{t}(\cdot\mid(s_{0},a_{0}))-G_{s}(\cdot)\right\|_{\mathrm{TV}}
K1CLABEL:C_s1TTκT/(1κ).\displaystyle\leq K_{1}C_{\ref*{C_s}}\frac{1}{T-T^{\prime}}\kappa^{T^{\prime}}/(1-\kappa).

Recall that T=K3log(nT)T^{\prime}=\lceil K_{3}\log(nT)\rceil given the condition that Tlog(nT)T\gg\log(nT). Then as long as we take K3=3(logkappa)1K_{3}=-3(\log kappa)^{-1}, then for sufficiently largest nT{\color[rgb]{0,0,0}nT}, we have

nκTCLABEL:C_s=C1n2T3n1T1,n\kappa^{T^{\prime}}C_{\ref*{C_s}}=C_{1}n^{-2}T^{-3}\leq n^{-1}T^{-1},

and

κT(TT)(1κ)=𝒪(1n3T4(1κ))=𝒪(t),\frac{\kappa^{T^{\prime}}}{(T-T^{\prime})(1-\kappa)}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}\left(\frac{1}{n^{3}T^{4}(1-\kappa)}\right)=\scalebox{0.7}{$\mathcal{O}$}(t),

due to the conditions of tt.

Combining all the results we have derived for II(B), III(B) and IV(B), we have that with probability at least 12δ1/(nT)1-2\delta-1/({\color[rgb]{0,0,0}nT}), for any B𝒬B\in\mathcal{Q},

I2(B)log(nT)T{μ+(μαn)1}+log(nT)(1+log(1/δ))nT}\displaystyle I_{2}(B)\lesssim\frac{\log({\color[rgb]{0,0,0}nT})}{T}\left\{\mu+(\mu^{\alpha}n)^{-1}\}+\frac{\log({\color[rgb]{0,0,0}nT})(1+\log(1/\delta))}{{\color[rgb]{0,0,0}nT}}\right\}
+μ+1nTμα1τ(2+α)+(log(nT))1/τ+1+(log(1/δ))1/(12τ)+(log(1/δ))1/τnT.\displaystyle+\mu+\frac{1}{{\color[rgb]{0,0,0}nT}\mu^{\frac{\alpha}{1-\tau(2+\alpha)}}}+\frac{(\log({\color[rgb]{0,0,0}nT}))^{1/\tau}+1+(\log(1/\delta))^{1/(1-2\tau)}+(\log(1/\delta))^{1/\tau}}{{\color[rgb]{0,0,0}nT}}.

Recall that we require 0<τp1/(2+α)0<\tau\leq p\leq 1/(2+\alpha), consider any τ1/3\tau\leq 1/3 and pick pp any value between τ\tau and 1/(1+2α)1/(1+2\alpha). Then we can simplify the bound into

I2(B)μ+1nTμα1τ(2+α)+[log(max{1/δ,nT})]1/τnT.\displaystyle I_{2}(B)\lesssim\mu+\frac{1}{{\color[rgb]{0,0,0}nT}\mu^{\frac{\alpha}{1-\tau(2+\alpha)}}}+\frac{[\log(\max\{1/\delta,{\color[rgb]{0,0,0}nT}\})]^{1/\tau}}{{\color[rgb]{0,0,0}nT}}.

Lemma S3.

Suppose Assumption 4(a) (4’(a)) and Assumption 5(c) hold. Then we have

λmax{1nTi=1nt=0T1𝑳K(Si,t,Ai,t)𝑳K(Si,t,Ai,t)𝔼1Tt=0T1𝑳K(St,At)𝑳K(St,At)}=𝒪p(KnTlog(nT))=𝒪p(1).\lambda_{\max}\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}-\mathbb{E}\frac{1}{T}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{t},A_{t})\bm{L}_{K}(S_{t},A_{t})^{\intercal}\right\}\\ =\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\sqrt{\frac{K}{{\color[rgb]{0,0,0}nT}}}\log({\color[rgb]{0,0,0}nT})\right)=\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(1).
Proof.

Recall that p¯Tb=1/Tt=0Tptb\overline{p}_{T}^{b}=1/T\sum_{t=0}^{T}p_{t}^{b}. By Assumption 5(c), we can show that
sup(s,a)λmax{𝑳K(s,a)𝑳K(s,a)}sup(s,a)𝑳K(s,a)22(CLABEL:weightsC2+γCLABEL:weightsC2)2K\sup_{(s,a)}\lambda_{\max}\{\bm{L}_{K}(s,a)\bm{L}_{K}(s,a)^{\intercal}\}\leq\sup_{(s,a)}\|\bm{L}_{K}(s,a)\|_{2}^{2}\leq(C_{\ref*{weightsC2}}+\gamma C_{\ref*{weightsC2}})^{2}K. Next, let us consider two cases.

  • Case 1: TT is fixed, i.e., T1T\asymp 1.

    Take 𝒁i=1Tt=0T𝑳K(Si,t,Ai,t)𝑳K(Si,t,Ai,t)\bm{Z}_{i}=\frac{1}{T}\sum_{t=0}^{T}\bm{L}_{K}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}. From above, we can see that λmax{𝒁i}(CLABEL:weightsC2+γCLABEL:weightsC2)2K\lambda_{\max}\left\{\bm{Z}_{i}\right\}\leq(C_{\ref*{weightsC2}}+\gamma C_{\ref*{weightsC2}})^{2}K for any ii.

    λmax{i=1n𝔼𝒁i𝒁i}\displaystyle\lambda_{\max}\left\{\sum_{i=1}^{n}\mathbb{E}\bm{Z}_{i}\bm{Z}_{i}^{\intercal}\right\} nλmax{𝔼𝒁i𝒁i}\displaystyle\leq n\lambda_{\max}\left\{\mathbb{E}\bm{Z}_{i}\bm{Z}_{i}^{\intercal}\right\}
    =nsupiλmax{𝒁i}λmax{𝔼𝒁i}\displaystyle=n\sup_{i}\lambda_{\max}\{\bm{Z}_{i}\}\lambda_{\max}\left\{\mathbb{E}\bm{Z}_{i}\right\}
    n(CLABEL:weightsC2+γCLABEL:weightsC2)2KCLABEL:weightsC4.\displaystyle\leq n(C_{\ref*{weightsC2}}+\gamma C_{\ref*{weightsC2}})^{2}KC_{\ref*{weightsC4}}.

    Then, applying Matrix Bernstein inequality [66], we have

    Pr(λmax{1nTi=1nt=0T1𝑳K(Si,t,Ai,t)𝑳K(Si,t,Ai,t)𝔼(1Tt=0T1𝑳K(St,At)𝑳K(St,At))}x)\displaystyle\mbox{Pr}\left(\lambda_{\max}\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}-\mathbb{E}\left(\frac{1}{T}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{t},A_{t})\bm{L}_{K}(S_{t},A_{t})^{\intercal}\right)\right\}\geq x\right)
    Pr(λmax{1ni=1n𝒁i𝔼𝒁i}x)\displaystyle\leq\mbox{Pr}\left(\lambda_{\max}\left\{\frac{1}{n}\sum_{i=1}^{n}\bm{Z}_{i}-\mathbb{E}\bm{Z}_{i}\right\}\geq x\right)
    (K+K)exp(nx2/2(CLABEL:weightsC2+γCLABEL:weightsC2)2KCLABEL:weightsC4+(CLABEL:weightsC2+γCLABEL:weightsC2)2Kx/3)\displaystyle\leq(K+K)\exp\left(-\frac{-nx^{2}/2}{(C_{\ref*{weightsC2}}+\gamma C_{\ref*{weightsC2}})^{2}KC_{\ref*{weightsC4}}+(C_{\ref*{weightsC2}}+\gamma C_{\ref*{weightsC2}})^{2}Kx/3}\right)

    Due to the condition that K=𝒪(nT/log(nT))=𝒪(n/logn)K=\scalebox{0.7}{$\mathcal{O}$}(\sqrt{{\color[rgb]{0,0,0}nT}/\log({\color[rgb]{0,0,0}nT})})=\scalebox{0.7}{$\mathcal{O}$}(\sqrt{n/\log n}), we have

    λmax{1nTi=1nt=0T1𝑳K(Si,t,Ai,t)𝑳K(Si,t,Ai,t)𝔼1Tt=0T1𝑳K(St,At)𝑳K(St,At)}\displaystyle\lambda_{\max}\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}-\mathbb{E}\frac{1}{T}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{t},A_{t})\bm{L}_{K}(S_{t},A_{t})^{\intercal}\right\}
    =\displaystyle= 𝒪p(KlogKn)=𝒪p(1).\displaystyle\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\sqrt{\frac{K\log K}{n}}\right)=\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(1).
  • Case 2: TT\rightarrow\infty.

    Take Pr\mbox{Pr}^{*} as the stationary distribution and 𝔼\mathbb{E}^{*} as the expectation taken over the stationary distribution. Take 𝒁i,t=𝑳K(Si,t,Ai,t)𝑳K(Si,t,Ai,t)\bm{Z}_{i,t}=\bm{L}_{K}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}. From Assumption 5(c), we know that λmax{𝔼𝒁i,t}C7\lambda_{\max}\left\{{\mathbb{E}}^{*}\bm{Z}_{i,t}\right\}\leq C_{7}. It can be also verified that λmax{𝒁i,t}(CLABEL:weightsC2+γCLABEL:weightsC2)2K\lambda_{\max}\{\bm{Z}_{i,t}\}\leq(C_{\ref*{weightsC2}}+\gamma C_{\ref*{weightsC2}})^{2}K for any i,ti,t, and

    λmax{𝔼𝒁i,t𝒁i,t}supi,tλmax{𝒁i,t}λmax{𝔼𝒁i,t}K.\displaystyle\lambda_{\max}\left\{{\mathbb{E}}^{*}\bm{Z}_{i,t}\bm{Z}_{i,t}\right\}\leq\sup_{i,t}\lambda_{\max}\left\{\bm{Z}_{i,t}\right\}\lambda_{\max}\left\{{\mathbb{E}}^{*}\bm{Z}_{i,t}\right\}\lesssim K.

    For any fixed ii, by Theorem 4.2 of [9], there exists some constant c20\ltx@labelbernc1>0c_{20}\ltx@label{bernc1}>0, such that for any τ0\tau\geq 0 and integer 1<q<T1<q<T,

    Pr(λmax{t=0T1(𝒁i,t𝔼𝒁i,t)}6τ)\displaystyle{\mbox{Pr}}^{*}\left(\lambda_{\max}\left\{\sum_{t=0}^{T-1}\left(\bm{Z}_{i,t}-\mathbb{E}^{*}\bm{Z}_{i,t}\right)\right\}\geq 6\tau\right)
    \displaystyle\leq Tqβ(q)+Pr(λmax{tr(𝒁i,t𝔼𝒁i,t)}6τ)\displaystyle\frac{T}{q}\beta(q)+{\mbox{Pr}}^{*}\left(\lambda_{\max}\left\{\sum_{t\in\mathcal{I}_{r}}\left(\bm{Z}_{i,t}-\mathbb{E}^{*}\bm{Z}_{i,t}\right)\right\}\geq 6\tau\right)
    +\displaystyle+ cLABEL:bernc1Kexp(τ2/2TqK+q(CLABEL:weightsC2+γCLABEL:weightsC2)2Kτ/3),\displaystyle c_{\ref*{bernc1}}K\exp\left(-\frac{\tau^{2}/2}{TqK+q(C_{\ref*{weightsC2}}+\gamma C_{\ref*{weightsC2}})^{2}K\tau/3}\right),

    where r={q(T+1)/q,q(T+1)/q+1,,T1}\mathcal{I}_{r}=\{q\lfloor(T+1)/q\rfloor{},q\lfloor(T+1)/q\rfloor{}+1,\dots,T-1\}. Suppose τq(CLABEL:weightsC2+γCLABEL:weightsC2)2K\tau\geq q(C_{\ref*{weightsC2}}+\gamma C_{\ref*{weightsC2}})^{2}K. Notice that |r|q|\mathcal{I}_{r}|\leq q. It follows that

    Pr(λmax{tr(𝒁i,t𝔼𝒁i,t)}6τ)=0.\displaystyle{\mbox{Pr}}^{*}\left(\lambda_{\max}\left\{\sum_{t\in\mathcal{I}_{r}}\left(\bm{Z}_{i,t}-\mathbb{E}^{*}\bm{Z}_{i,t}\right)\right\}\geq 6\tau\right)=0.

    Since β(q)=𝒪(κq)\beta(q)=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\kappa^{q}), set q=6log(nT)/logκq=-6\log(nT)/\log\kappa, we obtain Tβ(q)/q=𝒪(n6T5)T\beta(q)/q=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{-6}T^{-5}). Set τ=c21\ltx@labelbernc2max{TqKlog(nT),q(CLABEL:weightsC2+γCLABEL:weightsC2)2K}\tau=c_{21}\ltx@label{bernc2}\max\{\sqrt{TqK\log(nT)},q(C_{\ref*{weightsC2}}+\gamma C_{\ref*{weightsC2}})^{2}K\} with some appropriate constant cLABEL:bernc2>0c_{\ref*{bernc2}}>0. From the condition that K=𝒪(nT)K=\scalebox{0.7}{$\mathcal{O}$}(nT), the following event occurs with probability at least 1𝒪(n6T5)1-\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{-6}T^{-5}),

    λmax{t=0T1(𝒁i,t𝔼𝒁i,t)}max{TKlog(nT),Klog2(Tn)},where𝒁i,t𝔾.\displaystyle\lambda_{\max}\left\{\sum_{t=0}^{T-1}\left(\bm{Z}_{i,t}-\mathbb{E}^{*}\bm{Z}_{i,t}\right)\right\}\lesssim\max\{\sqrt{TK}\log(nT),K\log^{2}(Tn)\},\quad\mathrm{where}\quad\bm{Z}_{i,t}\sim\mathbb{G}^{*}.

    Note that

    Pr(λmax{1Tt=0T1𝒁i,t𝔼𝒁i,t}x)=𝔼𝟙λmax{1Tt=0T1𝒁i,t𝔼𝒁i,t}x\displaystyle\mbox{Pr}\left(\lambda_{\max}\left\{\frac{1}{T}\sum_{t=0}^{T-1}\bm{Z}_{i,t}-\mathbb{E}^{*}\bm{Z}_{i,t}\right\}\geq x\right)=\mathbb{E}\mathbbm{1}_{\lambda_{\max}\left\{\frac{1}{T}\sum_{t=0}^{T-1}\bm{Z}_{i,t}-\mathbb{E}^{*}\bm{Z}_{i,t}\right\}\geq x}
    =\displaystyle= 𝔼(Si,t,Ai,t)Gp¯Tb(S,A)p(S,A)𝟙λmax{1Tt=0T1𝒁i,t𝔼𝒁i,t}x\displaystyle\mathbb{E}_{(S_{i,t},A_{i,t})\sim G*}\frac{\bar{p}_{T}^{b}(S,A)}{p^{*}(S,A)}\mathbbm{1}_{\lambda_{\max}\left\{\frac{1}{T}\sum_{t=0}^{T-1}\bm{Z}_{i,t}-\mathbb{E}^{*}\bm{Z}_{i,t}\right\}\geq x}
    \displaystyle\leq 𝔼(p¯Tb(S,A)p(S,A))2Pr(λmax{t=0T1(𝒁i,t𝔼𝒁i,t)}x)\displaystyle\sqrt{\mathbb{E}^{*}\left(\frac{\bar{p}_{T}^{b}(S,A)}{p^{*}(S,A)}\right)^{2}}\sqrt{{\mbox{Pr}}^{*}\left(\lambda_{\max}\left\{\sum_{t=0}^{T-1}\left(\bm{Z}_{i,t}-\mathbb{E}^{*}\bm{Z}_{i,t}\right)\right\}\geq x\right)}
    \displaystyle\leq C7Pr(λmax{t=0T1(𝒁i,t𝔼𝒁i,t)}x),\displaystyle\sqrt{C_{7}}\sqrt{{\mbox{Pr}}^{*}\left(\lambda_{\max}\left\{\sum_{t=0}^{T-1}\left(\bm{Z}_{i,t}-\mathbb{E}^{*}\bm{Z}_{i,t}\right)\right\}\geq x\right)},

    where the first inequality is due to Cauchy Schwarz inequality and the second inequality is due to Assumption 5(c).

    Then we have with probability at least 1𝒪(n3T5/2)1-\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{-3}T^{-5/2}),

    λmax{t=0T1(𝒁i,t𝔼𝒁i,t)}max{TKlog(nT),Klog2(Tn)},where𝒁i,tp¯Tb.\displaystyle\lambda_{\max}\left\{\sum_{t=0}^{T-1}\left(\bm{Z}_{i,t}-\mathbb{E}^{*}\bm{Z}_{i,t}\right)\right\}\lesssim\max\{\sqrt{TK}\log(nT),K\log^{2}(Tn)\},\quad\mathrm{where}\quad\bm{Z}_{i,t}\sim\overline{p}_{T}^{b}.

    Then with probability at least 1𝒪(n2T5/2)1-\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{-2}T^{-5/2}),

    maxi=1,,nλmax{t=0T1(𝒁i,t𝔼𝒁i,t)}max{TKlog(nT),Klog2(Tn)}\displaystyle\max_{i=1,\dots,n}\lambda_{\max}\left\{\sum_{t=0}^{T-1}\left(\bm{Z}_{i,t}-\mathbb{E}^{*}\bm{Z}_{i,t}\right)\right\}\lesssim\max\{\sqrt{TK}\log(nT),K\log^{2}(Tn)\} (S72)

    Take

    𝒜i={λmax{t=0T1(𝒁i,t𝔼𝒁i,t)}max{TKlog(nT),Klog2(Tn)}}.\displaystyle\mathcal{A}_{i}=\left\{\lambda_{\max}\left\{\sum_{t=0}^{T-1}\left(\bm{Z}_{i,t}-\mathbb{E}^{*}\bm{Z}_{i,t}\right)\right\}\lesssim\max\{\sqrt{TK}\log(nT),K\log^{2}(Tn)\}\right\}.

    Then based on the condition of KK (K=𝒪(nT/log2(nT))K=\scalebox{0.7}{$\mathcal{O}$}(nT/\log^{2}(nT))), the matrix Bernstein inequality [65] yields that

    λmax{i=1nt=0T1(𝒁i,t𝔼𝒁i,t)𝟙(𝒜i)n𝔼t=0T1(𝒁0,t𝔼𝒁0,t)𝟙(𝒜0)}=𝒪p(nTKlog(nT)).\displaystyle\lambda_{\max}\left\{\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left(\bm{Z}_{i,t}-\mathbb{E}^{*}\bm{Z}_{i,t}\right)\mathbbm{1}(\mathcal{A}_{i})-n\mathbb{E}\sum_{t=0}^{T-1}\left(\bm{Z}_{0,t}-\mathbb{E}^{*}\bm{Z}_{0,t}\right)\mathbbm{1}(\mathcal{A}_{0})\right\}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\sqrt{nTK}\log(nT)\right).

    Based on the probability for (S72), we have

    λmax{i=1nt=0T1(𝒁i,t𝔼𝒁i,t)n𝔼t=0T1(𝒁0,t𝔼𝒁0,t)𝟙(𝒜0)}=𝒪p(nTKlog(nT)).\displaystyle\lambda_{\max}\left\{\sum_{i=1}^{n}\sum_{t=0}^{T-1}\left(\bm{Z}_{i,t}-\mathbb{E}^{*}\bm{Z}_{i,t}\right)-n\mathbb{E}\sum_{t=0}^{T-1}\left(\bm{Z}_{0,t}-\mathbb{E}^{*}\bm{Z}_{0,t}\right)\mathbbm{1}(\mathcal{A}_{0})\right\}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\sqrt{nTK}\log(nT)\right).

    Also, we can verify that for any 𝒂K\bm{a}\in\mathbb{R}^{K} such that 𝒂2=1\|\bm{a}\|_{2}=1,

    𝒂{𝔼t=0T1(𝒁0,t𝔼𝒁0,t)𝟙(𝒜0c)}𝒂𝔼{𝒂t=0T1(𝒁0,t𝔼𝒁0,t)𝒂}2Pr(𝒜0c)\displaystyle\bm{a}^{\intercal}\left\{\mathbb{E}\sum_{t=0}^{T-1}\left(\bm{Z}_{0,t}-\mathbb{E}^{*}\bm{Z}_{0,t}\right)\mathbbm{1}(\mathcal{A}^{c}_{0})\right\}\bm{a}\leq\sqrt{\mathbb{E}\left\{\bm{a}^{\intercal}\sum_{t=0}^{T-1}\left(\bm{Z}_{0,t}-\mathbb{E}^{*}\bm{Z}_{0,t}\right)\bm{a}\right\}^{2}}\sqrt{\mbox{Pr}(\mathcal{A}_{0}^{c})}
    TKlog(nT)n3/2T5/4=𝒪(n1).\displaystyle{\leq\sqrt{TK}\log(nT)n^{-3/2}T^{-5/4}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(n^{-1}).}

    Overall, we have

    λmax{i=1nt=0T1𝒁i,tn𝔼(t=1T1𝒁0,t)}=𝒪p(nTKlog(nT))\displaystyle\lambda_{\max}\left\{\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{Z}_{i,t}-n\mathbb{E}\left(\sum_{t=1}^{T-1}\bm{Z}_{0,t}\right)\right\}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\sqrt{nTK}\log(nT))

    And therefore,

    λmax{1nTi=1nt=0T1𝑳K(Si,t,Ai,t)𝑳K(Si,t,Ai,t)𝔼1Tt=0T1𝑳K(St,At)𝑳K(St,At)}\displaystyle\lambda_{\max}\left\{\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{i,t},A_{i,t})\bm{L}_{K}(S_{i,t},A_{i,t})^{\intercal}-\mathbb{E}\frac{1}{T}\sum_{t=0}^{T-1}\bm{L}_{K}(S_{t},A_{t})\bm{L}_{K}(S_{t},A_{t})^{\intercal}\right\}
    =𝒪p(KnTlog(nT))=𝒪p(1)\displaystyle=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{\sqrt{K}}{\sqrt{nT}}\log(nT)\right)=\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(1)

Lemma S4.

Suppose the Markov chain {St,At}t0\{S_{t},A_{t}\}_{t\geq 0} satisfy Assumption 4(a) (4’(a)). Take 𝐅(S,A)=[fk(S,A)]k1,,KK\bm{F}(S,A)=[f_{k}(S,A)]_{k1,\dots,K}\in\mathbb{R}^{K}. Suppose that sup(s,a)𝐅(s,a)2R\sup_{(s,a)}\|\bm{F}(s,a)\|_{2}\leq R for some quantity R>0R>0, Then we have

1nTi=1nt=0T1𝑭(Si,t,Ai,t)𝔼{1Tt=0T1𝑭(St,At)}2=𝒪p(Rlog(K)log(nT)nT)\displaystyle\left\|\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{F}(S_{i,t},A_{i,t})-\mathbb{E}\left\{\frac{1}{T}\sum_{t=0}^{T-1}\bm{F}(S_{t},A_{t})\right\}\right\|_{2}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(R\sqrt{\frac{\log(K)\log(nT)}{nT}}\right)
Proof.

Take T=min{T,2log(1/κ)log(nT)}T^{\prime}=\min\{T,2\log(1/\kappa)\log({\color[rgb]{0,0,0}nT})\}, 𝑭I(Si,Ai)=t=0T1𝑭(Si,t,Ai,t)/T\bm{F}^{\mathrm{I}}(S_{i},A_{i})=\sum_{t=0}^{T^{\prime}-1}\bm{F}(S_{i,t},A_{i,t})/T^{\prime}, 𝑭II(Si,Ai)=t=TT1𝑭(Si,t,Ai,t)/(TT)\bm{F}^{\mathrm{II}}(S_{i},A_{i})=\sum_{t=T^{\prime}}^{T-1}\bm{F}(S_{i,t},A_{i,t})/(T-T^{\prime}). Take 𝔼\mathbb{E}^{*} and Pr{\mbox{Pr}}^{*} as the expectation and probability under the stationary distribution of {St,At}\{S_{t},A_{t}\}.

1nTi=1nt=0T1𝑭(Si,t,Ai,t)𝔼{1Tt=0T1𝑭(St,At)}2\displaystyle\left\|\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{F}(S_{i,t},A_{i,t})-\mathbb{E}\left\{\frac{1}{T}\sum_{t=0}^{T-1}\bm{F}(S_{t},A_{t})\right\}\right\|_{2}
=\displaystyle= TT1ni=1n𝑭I(Si,Ai)𝔼𝑭I(S,A)2+TTT1ni=1n𝑭II(Si,Ai)𝔼𝑭II(Si,Ai)2\displaystyle\frac{T^{\prime}}{T}\left\|\frac{1}{n}\sum_{i=1}^{n}\bm{F}^{\mathrm{I}}(S_{i},A_{i})-\mathbb{E}\bm{F}^{\mathrm{I}}(S,A)\right\|_{2}+\frac{T-T^{\prime}}{T}\left\|\frac{1}{n}\sum_{i=1}^{n}\bm{F}^{\mathrm{II}}(S_{i},A_{i})-\mathbb{E}^{*}\bm{F}^{\mathrm{II}}(S_{i},A_{i})\right\|_{2}
+TTT𝔼𝑭II(S,A)𝔼𝑭II(S,A)2\displaystyle+\frac{T-T^{\prime}}{T}\left\|\mathbb{E}^{*}\bm{F}^{\mathrm{II}}(S,A)-\mathbb{E}\bm{F}^{\mathrm{II}}(S,A)\right\|_{2}
=(i)+(ii)+(iii)\displaystyle=\mathrm{(i)}+\mathrm{(ii)}+\mathrm{(iii)}
  • For (i),

We apply the Matrix Bernstein inequality [66] to bound it. Take 𝒁(Si,Ai)=𝑭I(Si,Ai)𝔼𝑭I(Si,Ai)\bm{Z}(S_{i},A_{i})=\bm{F}^{\mathrm{I}}(S_{i},A_{i})-\mathbb{E}\bm{F}^{\mathrm{I}}(S_{i},A_{i}). We can verify that

i=1n𝔼𝒁(Si,Ai){𝒁(Si,Ai)}2𝔼i=1n𝒁2(Si,Ai)22\displaystyle\left\|\sum_{i=1}^{n}\mathbb{E}\bm{Z}(S_{i},A_{i})\{\bm{Z}(S_{i},A_{i})\}^{\intercal}\right\|_{2}\leq\mathbb{E}\sum_{i=1}^{n}\left\|{}_{2}\bm{Z}(S_{i},A_{i})\right\|^{2}_{2}
\displaystyle\leq 2nsups,a𝑭I(s,a)222nTt=0T1sups,a𝑭(s,a)222nR2.\displaystyle 2n\sup_{s,a}\|\bm{F}^{\mathrm{I}}(s,a)\|^{2}_{2}\leq\frac{2n}{T^{\prime}}\sum_{t=0}^{T^{\prime}-1}\sup_{s,a}\|\bm{F}(s,a)\|^{2}_{2}\leq 2nR^{2}.

The same bound can be derived for i=1n𝔼{𝒁(Si,Ai)}𝒁(Si,Ai)2\left\|\sum_{i=1}^{n}\mathbb{E}\{\bm{Z}(S_{i},A_{i})\}^{\intercal}\bm{Z}(S_{i},A_{i})\right\|_{2}. Then for all x>0x>0,

Pr(1ni=1n𝑭I(Si,Ai)𝔼𝑭I(S,A)2x)Kexp(nx2/22R2+2xR/3).\displaystyle\mbox{Pr}\left(\left\|\frac{1}{n}\sum_{i=1}^{n}\bm{F}^{\mathrm{I}}(S_{i},A_{i})-\mathbb{E}\bm{F}^{\mathrm{I}}(S,A)\right\|_{2}\geq x\right)\leq K\exp\left(\frac{-nx^{2}/2}{2R^{2}+2xR/3}\right).

Given the condition that K=𝒪(nT)K=\scalebox{0.7}{$\mathcal{O}$}({\color[rgb]{0,0,0}nT}) and T=min{T,2log(1/κ)log(nT)}T^{\prime}=\min\{T,2\log(1/\kappa)\log({\color[rgb]{0,0,0}nT})\}, we have

(i)=𝒪p(TTRlogKn)=𝒪p(Rlog(nT)nT).\displaystyle(i)=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{T^{\prime}}{T}\frac{R\sqrt{\log K}}{\sqrt{n}}\right)=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{R\log({\color[rgb]{0,0,0}nT})}{\sqrt{{\color[rgb]{0,0,0}nT}}}\right).

If T=𝒪(log(nT))T=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\log({\color[rgb]{0,0,0}nT})), then we do not need to analyze the remaining two components. In the following, we assume that T1=𝒪(1/log(nT))T^{-1}=\scalebox{0.7}{$\mathcal{O}$}(1/\log({\color[rgb]{0,0,0}nT})).

  • For (ii),

Pr(1ni=1n𝑭II(Si,Ai)𝔼𝑭II(S,A)2>x(Si,0,Ai,0),i=1,,n)\displaystyle\mbox{Pr}\left(\left\|\frac{1}{n}\sum_{i=1}^{n}\bm{F}^{\mathrm{II}}(S_{i},A_{i})-\mathbb{E}^{*}\bm{F}^{\mathrm{II}}(S,A)\right\|_{2}>x\mid(S_{i,0},A_{i,0}),i=1,\dots,n\right)
Pr(1ni=1n𝑭II(Si,Ai)𝔼𝑭II(S,A)2>x(Si,0,Ai,0),i=1,,n)\displaystyle\qquad-{\mbox{Pr}}^{*}\left(\left\|\frac{1}{n}\sum_{i=1}^{n}\bm{F}^{\mathrm{II}}(S_{i},A_{i})-\mathbb{E}^{*}\bm{F}^{\mathrm{II}}(S,A)\right\|_{2}>x\mid(S_{i,0},A_{i,0}),i=1,\dots,n\right)
\displaystyle\leq i=1n𝔾Tb((Si,0,Ai,0))𝔾TVi=1nϕ(Si,0,Ai,0)κT\displaystyle\sum_{i=1}^{n}\|\mathbb{G}_{T^{\prime}}^{b}(\cdot\mid(S_{i,0},A_{i,0}))-\mathbb{G}^{*}\|_{\mathrm{TV}}\leq\sum_{i=1}^{n}\phi(S_{i,0},A_{i,0})\kappa^{T^{\prime}}

Then

Pr(1ni=1n𝑭II(Si,Ai)𝔼𝑭II(Si,Ai)2>x)\displaystyle\mbox{Pr}\left(\left\|\frac{1}{n}\sum_{i=1}^{n}\bm{F}^{\mathrm{II}}(S_{i},A_{i})-\mathbb{E}^{*}\bm{F}^{\mathrm{II}}(S_{i},A_{i})\right\|_{2}>x\right)
Pr(1ni=1n𝑭II(S,A)𝔼𝑭II(Si,Ai)2>x)+nκT(s,a)ϕ(s,a)𝑑𝔾(s,a)\displaystyle\leq{\mbox{Pr}}^{*}\left(\left\|\frac{1}{n}\sum_{i=1}^{n}\bm{F}^{\mathrm{II}}(S,A)-\mathbb{E}^{*}\bm{F}^{\mathrm{II}}(S_{i},A_{i})\right\|_{2}>x\right)+n\kappa^{T^{\prime}}\int_{(s,a)}\phi(s,a)d\mathbb{G}(s,a)
Pr(1n(TT)i=1nt=TT1𝑭(Si,t,Ai,t)𝔼𝑭(S,A)2>x)+CLABEL:C_snκT,\displaystyle\leq{\mbox{Pr}}^{*}\left(\left\|\frac{1}{n(T-T^{\prime})}\sum_{i=1}^{n}\sum_{t=T^{\prime}}^{T-1}\bm{F}(S_{i,t},A_{i,t})-\mathbb{E}^{*}\bm{F}(S,A)\right\|_{2}>x\right)+C_{\ref*{C_s}}n\kappa^{T^{\prime}},

Next, we apply Theorem 4.2 in [9] to bound

Pr(1n(TT)i=1nt=TT1𝑭(Si,t,Ai,t)𝔼𝑭(S,A)2>x).{\mbox{Pr}}^{*}\left(\left\|\frac{1}{n(T-T^{\prime})}\sum_{i=1}^{n}\sum_{t=T^{\prime}}^{T-1}\bm{F}(S_{i,t},A_{i,t})-\mathbb{E}^{*}\bm{F}(S,A)\right\|_{2}>x\right).

To begin with, take 𝒁(Si,t,Ai,t)=𝑭(Si,t,Ai,t)𝔼𝑭(S,A)\bm{Z}^{*}(S_{i,t},A_{i,t})=\bm{F}(S_{i,t},A_{i,t})-\mathbb{E}^{*}\bm{F}(S,A). We can verify that

i=1n𝔼𝒁s(Si,Ai){𝒁s(Si,Ai)}2𝔼i=1n𝒁s(Si,Ai)222nsups,a𝑭(s,a)222nR2.\displaystyle\left\|\sum_{i=1}^{n}\mathbb{E}\bm{Z}^{s}(S_{i},A_{i})\{\bm{Z}^{s}(S_{i},A_{i})\}^{\intercal}\right\|_{2}\leq\mathbb{E}\left\|\sum_{i=1}^{n}\bm{Z}^{s}(S_{i},A_{i})\right\|^{2}_{2}\leq 2n\sup_{s,a}\|\bm{F}(s,a)\|^{2}_{2}\leq 2nR^{2}.

The same bound can be derived for i=1n𝔼{𝒁s(Si,Ai)}𝒁s(Si,Ai)2\left\|\sum_{i=1}^{n}\mathbb{E}\{\bm{Z}^{s}(S_{i},A_{i})\}^{\intercal}\bm{Z}^{s}(S_{i},A_{i})\right\|_{2}. Then we adopt the similar arguments in the proof of Case 2 in Lemma S3. Take q=3log(nT)/logκq=-3\log(nT)/\log\kappa in Theorem 4.2 of [9]. Then by Corollary 4.2 in [9], we can verify that β(q)nT/q=𝒪(1)\beta(q){\color[rgb]{0,0,0}nT}/q=\scalebox{0.7}{$\mathcal{O}$}(1), and RqlogK=𝒪(RnT)R\sqrt{q\log K}=\scalebox{0.7}{$\mathcal{O}$}(R\sqrt{{\color[rgb]{0,0,0}nT}}), and we have

1n(TT)i=1nt=TT1𝑭(Si,t,Ai,t)𝔼𝑭(S,A)2=𝒪p(Rlog(nT)logKn(TT))\displaystyle\left\|\frac{1}{n(T-T^{\prime})}\sum_{i=1}^{n}\sum_{t=T^{\prime}}^{T-1}\bm{F}(S_{i,t},A_{i,t})-\mathbb{E}^{*}\bm{F}(S,A)\right\|_{2}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{R\sqrt{\log({\color[rgb]{0,0,0}nT})\log K}}{\sqrt{n(T-T^{\prime})}}\right)
  • For (iii),

𝔼𝑭IIk(S,A)𝔼𝑭IIk(S,A)22\displaystyle\left\|\mathbb{E}^{*}\bm{F}^{\mathrm{II}}_{k}(S,A)-\mathbb{E}\bm{F}^{\mathrm{II}}_{k}(S,A)\right\|^{2}_{2} =k[1TTt=TT1{𝔼fk(S,A)𝔼fk(St,At)}]2\displaystyle=\sum_{k}\left[\frac{1}{T-T^{\prime}}\sum_{t=T^{\prime}}^{T-1}\left\{\mathbb{E}^{*}f_{k}(S,A)-\mathbb{E}f_{k}(S_{t},A_{t})\right\}\right]^{2}
k1TTt=TT1{𝔼fk(S,A)𝔼fk(St,At)}2\displaystyle\leq\sum_{k}\frac{1}{T-T^{\prime}}\sum_{t=T^{\prime}}^{T-1}\left\{\mathbb{E}^{*}f_{k}(S,A)-\mathbb{E}f_{k}(S_{t},A_{t})\right\}^{2}
1TTt=TT1κ2t(s,a)R2d𝔾0(s,a)\displaystyle\leq\frac{1}{T-T^{\prime}}\sum_{t=T^{\prime}}^{T-1}\kappa^{2t}\int_{(s,a)}R^{2}d\mathbb{G}_{0}(s,a)
1TTt=TT1κ2tR2c22\ltx@labelconcenc2TTR2n4T4,\displaystyle\leq\frac{1}{T-T^{\prime}}\sum_{t=T^{\prime}}^{T-1}\kappa^{2t}R^{2}\leq\frac{c_{22}\ltx@label{concen_{c}2}}{T-T^{\prime}}\frac{R^{2}}{n^{4}T^{4}},

where cLABEL:concen_c2>0c_{\ref*{concen_c2}}>0 is a constant depending on CLABEL:C_sC_{\ref*{C_s}} and κ\kappa.

Combining all the bounds from (i), (ii) and (iii), given the condition that of a,ba,b and RR, we can derive that

1nTi=1nt=0T1𝑭(Si,t,Ai,t)𝔼{1Tt=0T1𝑭(St,At)}2\displaystyle\left\|\frac{1}{nT}\sum_{i=1}^{n}\sum_{t=0}^{T-1}\bm{F}(S_{i,t},A_{i,t})-\mathbb{E}\left\{\frac{1}{T}\sum_{t=0}^{T-1}\bm{F}(S_{t},A_{t})\right\}\right\|_{2}
=\displaystyle= 𝒪p(RlogKlog(nT)nT)\displaystyle\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\frac{R\sqrt{\log K\log({\color[rgb]{0,0,0}nT})}}{\sqrt{{\color[rgb]{0,0,0}nT}}}\right)

S3 Additional Proof and Lemmas

Lemma S5.

If we take 𝐁K(,a)\bm{B}_{K}(\cdot,a) as either tensor-product B-spline basis or tensor-product wavelet basis for every a𝒜a\in\mathcal{A}. And we assume that the average visitation probability (density) p¯Tb(s,a)\bar{p}_{T}^{b}(s,a) is upper bounded by a constant pmaxp_{\max} and lower bounded by a constant pminp_{\min}. Then there exists constants cc_{*} and CC_{*} such that

cλmin{𝔼[1Tt=0T1𝑩K(St,At)𝑩K(St,At)]}λmax{𝔼[1Tt=0T1𝑩K(St,At)𝑩K(St,At)]}C.\displaystyle c_{*}\leq\lambda_{\min}\left\{\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\bm{B}_{K}(S_{t},A_{t})\bm{B}_{K}(S_{t},A_{t})^{\intercal}\right]\right\}\leq\lambda_{\max}\left\{\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\bm{B}_{K}(S_{t},A_{t})\bm{B}_{K}(S_{t},A_{t})^{\intercal}\right]\right\}\leq C_{*}. (S73)

In addition, when TT\rightarrow\infty, we have

c2λmin{𝔼(S,A)G[𝑩K(S,A)𝑩K(S,A)]}λmax{𝔼(S,A)G[𝑩K(S,A)𝑩K(S,A)]}2C.\displaystyle\frac{c_{*}}{2}\leq\lambda_{\min}\left\{\mathbb{E}_{(S,A)\sim G^{*}}\left[\bm{B}_{K}(S,A)\bm{B}_{K}(S,A)^{\intercal}\right]\right\}\leq\lambda_{\max}\left\{\mathbb{E}_{(S,A)\sim G^{*}}\left[\bm{B}_{K}(S,A)\bm{B}_{K}(S,A)^{\intercal}\right]\right\}\leq 2C_{*}.
Proof of Lemma S5.

Under the conditions in Lemma S5, by Lemma 2 in [62], for any a𝒜a\in\mathcal{A}, there exists positive constants C~\tilde{C} and c~\tilde{c} such that

c~λmin{s𝒮𝑩K(s,a)𝑩K(s,a)ds}λmax{s𝒮𝑩K(s,a)𝑩K(s,a)ds}C~.\displaystyle\tilde{c}\leq\lambda_{\min}\left\{\int_{s\in\mathcal{S}}\bm{B}_{K}(s,a)\bm{B}_{K}(s,a)^{\intercal}ds\right\}\leq\lambda_{\max}\left\{\int_{s\in\mathcal{S}}\bm{B}_{K}(s,a)\bm{B}_{K}(s,a)^{\intercal}ds\right\}\leq\tilde{C}. (S74)

Based on these results, we have

λmax{𝔼[1Tt=0T1𝑩K(St,At)𝑩K(St,At)]}\displaystyle\lambda_{\max}\left\{\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\bm{B}_{K}(S_{t},A_{t})\bm{B}_{K}(S_{t},A_{t})^{\intercal}\right]\right\}
=\displaystyle= λmax{𝔼¯{𝑩K(S,A)𝑩K(S,A)}}\displaystyle\lambda_{\max}\left\{\overline{\mathbb{E}}\left\{\bm{B}_{K}(S,A)\bm{B}_{K}(S,A)^{\intercal}\right\}\right\}
\displaystyle\leq pmaxmaxa𝒜[λmax{s𝒮𝑩K(s,a)𝑩K(s,a)ds}]\displaystyle p_{\max}\max_{a\in\mathcal{A}}\left[\lambda_{\max}\left\{\int_{s\in\mathcal{S}}\bm{B}_{K}(s,a)\bm{B}_{K}(s,a)^{\intercal}ds\right\}\right]
\displaystyle\leq pmaxC~.\displaystyle p_{\max}\tilde{C}.

Using the similar arguments, we are able to show that

λmin{𝔼[1Tt=0T1𝑩K(St,At)𝑩K(St,At)]}\displaystyle\lambda_{\min}\left\{\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\bm{B}_{K}(S_{t},A_{t})\bm{B}_{K}(S_{t},A_{t})^{\intercal}\right]\right\}
\displaystyle\geq pminc~.\displaystyle p_{\min}\tilde{c}.

Take c=pminc~c_{*}=p_{\min}\tilde{c} and C=pmaxC~C_{*}=p_{\max}\tilde{C}, the conclusion follows.

When TT\rightarrow\infty, Due to Assumption 4’(a),

p¯TbPrTV=\displaystyle\left\|\overline{p}_{T}^{b}-\mbox{Pr}^{*}\right\|_{\mathrm{TV}}= (s,a)1Tt=0T1𝔾tb((s,a))𝔾()TVdG0(s,a)\displaystyle\int_{(s,a)}\left\|\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{G}_{t}^{b}(\cdot\mid(s,a))-\mathbb{G}^{*}(\cdot)\right\|_{\mathrm{TV}}dG_{0}(s,a)
\displaystyle\leq (s,a)1Tt=0T1ϕ(s,a)κtdG0(s,a)CLABEL:C_s1T1κT1κ=𝒪(1).\displaystyle\int_{(s,a)}\frac{1}{T}\sum_{t=0}^{T-1}\phi(s,a)\kappa^{t}dG_{0}(s,a)\leq C_{\ref*{C_s}}\frac{1}{T}\frac{1-\kappa^{T}}{1-\kappa}=\scalebox{0.7}{$\mathcal{O}$}(1).

As p¯Tb\overline{p}_{T}^{b} is lower bounded by pminp_{\min} and upper bounded by pmaxp_{\max}. The density of the stationary distribution is lower bounded by pmin/2p_{\min}/2 and upper bounded by 2pmax2p_{\max} for sufficiently large TT. Then follow the same argument, we are able to show the bounds for λmax{𝔼{𝑩K(S,A)𝑩K(S,A)}}\lambda_{\max}\{{\mathbb{E}}^{*}\left\{\bm{B}_{K}(S,A)\bm{B}_{K}(S,A)^{\intercal}\right\}\} and λmin{𝔼{𝑩K(S,A)𝑩K(S,A)}}\lambda_{\min}\{{\mathbb{E}}^{*}\left\{\bm{B}_{K}(S,A)\bm{B}_{K}(S,A)^{\intercal}\right\}\}. ∎

Proof of Theorem 7.

We start with showing the “if” direction of the statement. Suppose χc23\ltx@labelcnt:chi1\chi\leq c_{23}\ltx@label{cnt:chi1} for some cLABEL:cnt:chi1>0c_{\ref*{cnt:chi1}}>0.

First, we show that following equality holds for any function f𝒮×𝒜f\in\mathbb{R}^{\mathcal{S}\times\mathcal{A}}:

f(s,a)=11γ𝔼(S,A)dπ(,s,a)[f(S,A)γ(𝒫πf)(S,A)S=s,A=a],s𝒮,a𝒜.\displaystyle f(s,a)=\frac{1}{1-\gamma}\mathbb{E}_{(S^{\prime},A^{\prime})\sim d^{\pi}(\cdot,\cdot\mid s,a)}[f(S^{\prime},A^{\prime})-\gamma(\mathcal{P}^{\pi}f)(S^{\prime},A^{\prime})\mid S=s,A=a],\quad\forall s\in\mathcal{S},a\in\mathcal{A}. (S75)

To see this, note the definition of dπ(,S,A)d^{\pi}(\cdot,\cdot\mid S,A), and we have

(1γ)1𝔼(S,A)dπ(,s,a)[f(S,A)γ(𝒫πf)(S,A)S=s,A=a]\displaystyle(1-\gamma)^{-1}\mathbb{E}_{(S^{\prime},A^{\prime})\sim d^{\pi}(\cdot,\cdot\mid s,a)}[f(S^{\prime},A^{\prime})-\gamma(\mathcal{P}^{\pi}f)(S^{\prime},A^{\prime})\mid S=s,A=a]
=\displaystyle= s𝒮a𝒜π(as)(t=0{f(s,a)γ𝔼S~P(s,a),a~π(S~)[f(S~,a~)s,a]}γtptπ(s,aS0=s,A0=a))ds\displaystyle\int_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})\left(\sum_{t=0}^{\infty}\left\{f(s^{\prime},a^{\prime})-\gamma\mathbb{E}_{\tilde{S}\sim P(\cdot\mid s^{\prime},a^{\prime}),\tilde{a}\sim\pi(\cdot\mid\tilde{S})}\left[f(\tilde{S},\tilde{a})\mid s^{\prime},a^{\prime}\right]\right\}\gamma^{t}p_{t}^{\pi}(s^{\prime},a^{\prime}\mid S_{0}=s,A_{0}=a)\right)ds^{\prime}
=\displaystyle= s𝒮a𝒜π(as)(t=0f(s,a)γtptπ(s,aS0=s,A0=a))ds\displaystyle\int_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})\left(\sum_{t=0}^{\infty}f(s^{\prime},a^{\prime})\gamma^{t}p_{t}^{\pi}(s^{\prime},a^{\prime}\mid S_{0}=s,A_{0}=a)\right)ds^{\prime}
s~𝒮a~𝒜π(a~s~)(t=1f(s~,a~)γtptπ(s~,a~S0=s,A0=a))ds~\displaystyle\qquad\qquad\qquad-\int_{\tilde{s}\in\mathcal{S}}\sum_{\tilde{a}\in\mathcal{A}}\pi(\tilde{a}\mid\tilde{s})\left(\sum_{t=1}^{\infty}f(\tilde{s},\tilde{a})\gamma^{t}p_{t}^{\pi}(\tilde{s},\tilde{a}\mid S_{0}=s,A_{0}=a)\right)d\tilde{s}
=\displaystyle= s𝒮a𝒜π(as)(f(s,a)p0π(s,aS0=s,A0=a))ds=f(s,a).\displaystyle\int_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\mid s^{\prime})\left(f(s^{\prime},a^{\prime})p_{0}^{\pi}(s^{\prime},a^{\prime}\mid S_{0}=s,A_{0}=a)\right)ds^{\prime}=f(s,a).

Based on (S75), for any f𝒮×𝒜f\in\mathbb{R}^{\mathcal{S}\times\mathcal{A}}, we have

𝔼¯f2(S,A)\displaystyle\bar{\mathbb{E}}f^{2}(S,A) =1(1γ)2𝔼¯{𝔼(S,A)dπ(,s,a)[f(S,A)γ(𝒫πf)(S,A)S,A]}2\displaystyle=\frac{1}{(1-\gamma)^{2}}\bar{\mathbb{E}}\left\{\mathbb{E}_{(S^{\prime},A^{\prime})\sim d^{\pi}(\cdot,\cdot\mid s,a)}[f(S^{\prime},A^{\prime})-\gamma(\mathcal{P}^{\pi}f)(S^{\prime},A^{\prime})\mid S,A]\right\}^{2}
1(1γ)2χ𝔼¯{(Iγ𝒫π)f(S,A)}2,\displaystyle\leq\frac{1}{(1-\gamma)^{2}}\chi\bar{\mathbb{E}}\left\{(I-\gamma\mathcal{P}^{\pi})f(S,A)\right\}^{2},

where the inequality comes from the definition of χ\chi. Then we prove

Υ=inf𝔼¯f2(S,A)1[𝔼¯{(Iγ𝒫π)f(S,A)}2](1γ)2χ>(1γ)2cLABEL:cnt:chi1>0.\displaystyle\Upsilon=\inf_{\bar{\mathbb{E}}f^{2}(S,A)\geq 1}\left[{\bar{\mathbb{E}}\left\{(I-\gamma\mathcal{P}^{\pi})f(S,A)\right\}^{2}}\right]\geq\frac{(1-\gamma)^{2}}{{\chi}}>\frac{(1-\gamma)^{2}}{{c_{\ref*{cnt:chi1}}}}>0.

Next, we derive the “only if” direction of the statement. Suppose Υc24\ltx@labelcnt:chi2\Upsilon\geq c_{24}\ltx@label{cnt:chi2} for some positive constant cLABEL:cnt:chi2c_{\ref*{cnt:chi2}}.

Then, the null space of Iγ𝒫πI-\gamma\mathcal{P}^{\pi} is empty. For any f𝒮×𝒜f\in\mathbb{R}^{\mathcal{S}\times\mathcal{A}}, there exists a f~𝒮×𝒜\tilde{f}\in\mathbb{R}^{\mathcal{S}\times\mathcal{A}}, such that

f=(Iγ𝒫π)f~.\displaystyle f=(I-\gamma\mathcal{P}^{\pi})\tilde{f}. (S76)

Now we can derive

𝔼¯[𝔼(S,A)dπ(,S,A){f(S,A)S,A}]2\displaystyle\bar{\mathbb{E}}\left[\mathbb{E}_{(S^{\prime},A^{\prime})\sim d^{\pi}(\cdot,\cdot\mid S,A)}\left\{f(S^{\prime},A^{\prime})\mid S,A\right\}\right]^{2}
=\displaystyle= 𝔼¯[𝔼(S,A)dπ(,S,A){(Iγ𝒫π)f~(S,A)S,A}]2\displaystyle\bar{\mathbb{E}}\left[\mathbb{E}_{(S^{\prime},A^{\prime})\sim d^{\pi}(\cdot,\cdot\mid S,A)}\left\{(I-\gamma\mathcal{P}^{\pi})\tilde{f}(S^{\prime},A^{\prime})\mid S,A\right\}\right]^{2}
=\displaystyle= (1γ)2𝔼¯[f~(S,A)]2\displaystyle(1-\gamma)^{2}\bar{\mathbb{E}}\left[\tilde{f}(S,A)\right]^{2}
\displaystyle\leq (1γ)2[inf𝔼¯g2(S,A)=1𝔼¯{(Iγ𝒫π)g(S,A)}2]1𝔼¯{(Iγ𝒫π)f~(S,A)}2\displaystyle(1-\gamma)^{2}\left[\inf_{\bar{\mathbb{E}}g^{2}(S,A)=1}{\bar{\mathbb{E}}\left\{(I-\gamma\mathcal{P}^{\pi})g(S,A)\right\}^{2}}\right]^{-1}\bar{\mathbb{E}}\left\{(I-\gamma\mathcal{P}^{\pi})\tilde{f}(S,A)\right\}^{2}
\displaystyle\leq (1γ)2cLABEL:cnt:chi21𝔼¯[f(S,A)]2,\displaystyle(1-\gamma)^{2}c_{\ref*{cnt:chi2}}^{-1}\bar{\mathbb{E}}\left[f(S,A)\right]^{2},

where the first equality is from (S76), the second equality is due to (S75) and the last inequality is from the assumption for the minimal eigenvalue and (S76). Then

χ(1γ)2cLABEL:cnt:chi21.\displaystyle\chi\leq(1-\gamma)^{2}c_{\ref*{cnt:chi2}}^{-1}.

Proof of Corollary 2.

By Jensen’s inequality and the definition of d¯π\bar{d}^{\pi}, we have

𝔼¯[𝔼(S,A)dπ(,S,A){f(S,A)S,A}]2\displaystyle\bar{\mathbb{E}}\left[\mathbb{E}_{(S^{\prime},A^{\prime})\sim d^{\pi}(\cdot,\cdot\mid S,A)}\left\{f(S^{\prime},A^{\prime})\mid S,A\right\}\right]^{2}
\displaystyle\leq 𝔼¯[𝔼(S,A)dπ(,S,A){f2(S,A)S,A}]\displaystyle\bar{\mathbb{E}}\left[\mathbb{E}_{(S^{\prime},A^{\prime})\sim d^{\pi}(\cdot,\cdot\mid S,A)}\left\{f^{2}(S^{\prime},A^{\prime})\mid S,A\right\}\right]
=\displaystyle= 𝔼(S,A)d¯πf2(S,A).\displaystyle{\mathbb{E}}_{(S,A)\sim\bar{d}^{\pi}}f^{2}(S,A).

Then under the condition (34), we obtain

χsupf𝒮×𝒜𝔼(S,A)d¯πf2(S,A)𝔼¯f2(S,A)𝔼¯d¯π(S,A)p¯Tb(S,A)f2(S,A)𝔼¯f2(S,A)CLABEL:cnt:denbound𝔼¯f2(S,A)𝔼¯f2(S,A)=CLABEL:cnt:denbound.\displaystyle\chi\leq\sup_{f\in\mathbb{R}^{\mathcal{S}\times\mathcal{A}}}\frac{{\mathbb{E}}_{(S,A)\sim\bar{d}^{\pi}}f^{2}(S,A)}{\bar{\mathbb{E}}f^{2}(S,A)}\leq\frac{\bar{\mathbb{E}}\frac{\bar{d}^{\pi}(S,A)}{\bar{p}_{T}^{b}(S,A)}f^{2}(S,A)}{\bar{\mathbb{E}}f^{2}(S,A)}\leq C_{\ref*{cnt:denbound}}\frac{\bar{\mathbb{E}}f^{2}(S,A)}{\bar{\mathbb{E}}f^{2}(S,A)}=C_{\ref*{cnt:denbound}}.

Under Assumption 7 (a) and (b), we can show that d¯π(s,a)pmax\bar{d}^{\pi}(s,a)\leq p_{\max} for every (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}. Then we have

𝔼(S,A)d¯π[f(S,A)]2=𝔼¯[d¯π(S,A)p¯Tb(S,A)f2(S,A)]pmaxpmin𝔼¯f2(S,A).\displaystyle\mathbb{E}_{(S,A)\sim\bar{d}^{\pi}}\left[f(S,A)\right]^{2}=\bar{\mathbb{E}}\left[\frac{\bar{d}^{\pi}(S,A)}{\bar{p}_{T}^{b}(S,A)}f^{2}(S,A)\right]\leq\frac{p_{\max}}{p_{\min}}\bar{\mathbb{E}}f^{2}(S,A).

And

χpmaxpmin.\displaystyle\chi\leq\frac{p_{\max}}{p_{\min}}.

The bounds for Υ\Upsilon can be obtained by applying Theorem 7.

For ψ(K)\psi(K), by taking f(S,A)=𝑩K(S,A)𝜶f(S,A)=\bm{B}_{K}(S,A)^{\intercal}\bm{\alpha} for any 𝜶2=1\|\bm{\alpha}\|_{2}=1, we have

𝔼¯[𝑳K(S,A)𝜶]2(1γ)2CLABEL:cnt:denbound𝔼¯[𝑩K(S,A)𝜶]2(1γ)2CLABEL:cnt:denboundλmin𝔼¯{𝑩K(S,A)𝑩K(S,A)}𝜶22.\displaystyle\bar{\mathbb{E}}\left[\bm{L}^{\intercal}_{K}(S,A)\bm{\alpha}\right]^{2}\geq\frac{(1-\gamma)^{2}}{C_{\ref*{cnt:denbound}}}\bar{\mathbb{E}}\left[\bm{B}^{\intercal}_{K}(S,A)\bm{\alpha}\right]^{2}\geq\frac{(1-\gamma)^{2}}{C_{\ref*{cnt:denbound}}}\lambda_{\min}\bar{\mathbb{E}}\left\{\bm{B}^{\intercal}_{K}(S,A)\bm{B}_{K}(S,A)\right\}\|\bm{\alpha}\|^{2}_{2}.

The conclusion follows.

References

  • Athey et al. [2018] Athey, S., G. W. Imbens, and S. Wager (2018). Approximate residual balancing: debiased inference of average treatment effects in high dimensions. Journal of the Royal Statistical Society: Series B 80(4), 597–623.
  • Belloni et al. [2015] Belloni, A., V. Chernozhukov, D. Chetverikov, and K. Kato (2015). Some new asymptotic theory for least squares series: Pointwise and uniform results. Journal of Econometrics 186(2), 345–366.
  • Bertsekas [1995] Bertsekas, D. P. (1995). Dynamic programming and optimal control, Volume 1. Athena scientific Belmont, MA.
  • Bradley [2005] Bradley, R. C. (2005). Basic properties of strong mixing conditions. a survey and some open questions. Probability Surveys 2, 107–144.
  • Brockman et al. [2016] Brockman, G., V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016). Openai gym. arXiv preprint arXiv:1606.01540.
  • Chan [1989] Chan, K. (1989). A note on the geometric ergodicity of a markov chain. Advances in applied probability 21(3), 702–704.
  • Chan et al. [2015] Chan, K. C. G., S. C. P. Yam, and Z. Zhang (2015). Globally efficient non-parametric inference of average treatment effects by empirical balancing calibration weighting. Journal of the Royal Statistical Society: Series B.
  • Chen [2007] Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models. Handbook of econometrics 6, 5549–5632.
  • Chen and Christensen [2015] Chen, X. and T. M. Christensen (2015). Optimal uniform convergence rates and asymptotic normality for series estimators under weak dependence and weak conditions. Journal of Econometrics 188(2), 447–465.
  • Chen and Qi [2022] Chen, X. and Z. Qi (2022). On well-posedness and minimax optimal rates of nonparametric q-function estimation in off-policy evaluation. arXiv preprint arXiv:2201.06169.
  • Chua et al. [2018] Chua, K., R. Calandra, R. McAllister, and S. Levine (2018). Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in neural information processing systems 31.
  • De Boor [1976] De Boor, C. (1976). Splines as linear combinations of b-splines. a survey. Technical report, Wisconsin Univ Madison Mathematics Research Center.
  • Fan et al. [2016] Fan, J., K. Imai, H. Liu, Y. Ning, and X. Yang (2016). Improving covariate balancing propensity score: A doubly robust and efficient approach. Technical report, Technical report, Princeton University.
  • Farahmand et al. [2009] Farahmand, A. M., M. Ghavamzadeh, S. Mannor, and C. Szepesvári (2009). Regularized policy iteration. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou (Eds.), Advances in Neural Information Processing Systems 21, pp.  441–448. Curran Associates, Inc.
  • Farahmand et al. [2016] Farahmand, A.-m., M. Ghavamzadeh, C. Szepesvári, and S. Mannor (2016). Regularized policy iteration with nonparametric function spaces. The Journal of Machine Learning Research 17(1), 4809–4874.
  • Farahmand and Szepesvári [2012] Farahmand, A.-m. and C. Szepesvári (2012). Regularized least-squares regression: Learning from a β\beta-mixing sequence. Journal of Statistical Planning and Inference 142(2), 493–505.
  • Freedman [1975] Freedman, D. A. (1975). On tail probabilities for martingales. the Annals of Probability, 100–118.
  • Garreau et al. [2017] Garreau, D., W. Jitkrittum, and M. Kanagawa (2017). Large sample analysis of the median heuristic. arXiv preprint arXiv:1707.07269.
  • Geer and van de Geer [2000] Geer, S. A. and S. van de Geer (2000). Empirical Processes in M-estimation, Volume 6. Cambridge university press.
  • Györfi et al. [2006] Györfi, L., M. Kohler, A. Krzyzak, and H. Walk (2006). A distribution-free theory of nonparametric regression. Springer Science & Business Media.
  • Hansen [1982] Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica: Journal of the Econometric Society, 1029–1054.
  • Hearst et al. [1998] Hearst, M. A., S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf (1998). Support vector machines. IEEE Intelligent Systems and their applications 13(4), 18–28.
  • Horowitz et al. [2004] Horowitz, J. L., E. Mammen, et al. (2004). Nonparametric estimation of an additive model with a link function. Annals of Statistics 32(6), 2412–2443.
  • Hu et al. [2021] Hu, X., M. Qian, B. Cheng, and Y. K. Cheung (2021). Personalized policy learning using longitudinal mobile health data. Journal of the American Statistical Association 116(533), 410–420.
  • Imai and Ratkovic [2014] Imai, K. and M. Ratkovic (2014). Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B 76(1), 243–263.
  • Janner et al. [2019] Janner, M., J. Fu, M. Zhang, and S. Levine (2019). When to trust your model: Model-based policy optimization. Advances in Neural Information Processing Systems 32.
  • Kallus [2020] Kallus, N. (2020). Generalized optimal matching methods for causal inference. Journal of Machine Learning Research 21(62), 1–54.
  • Kallus and Uehara [2019] Kallus, N. and M. Uehara (2019). Double reinforcement learning for efficient off-policy evaluation in markov decision processes. arXiv preprint arXiv:1908.08526.
  • Kallus and Uehara [2020] Kallus, N. and M. Uehara (2020). Statistically efficient off-policy policy gradients. In International Conference on Machine Learning, pp. 5089–5100. PMLR.
  • Kallus and Uehara [2022] Kallus, N. and M. Uehara (2022). Efficiently breaking the curse of horizon in off-policy evaluation with double reinforcement learning. Operations Research.
  • Kang and Schafer [2007] Kang, J. D. and J. L. Schafer (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 523–539.
  • Komorowski et al. [2018] Komorowski, M., L. A. Celi, O. Badawi, A. C. Gordon, and A. A. Faisal (2018). The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature medicine 24(11), 1716–1720.
  • Komunjer and Vuong [2010] Komunjer, I. and Q. Vuong (2010). Semiparametric efficiency bound in time-series models for conditional quantiles. Econometric Theory, 383–405.
  • Kosorok and Laber [2019] Kosorok, M. R. and E. B. Laber (2019). Precision medicine. Annual review of statistics and its application 6, 263–286.
  • Laber et al. [2014] Laber, E. B., D. J. Lizotte, M. Qian, W. E. Pelham, and S. A. Murphy (2014). Dynamic treatment regimes: Technical challenges and applications. Electronic journal of statistics 8(1), 1225.
  • Le et al. [2019] Le, H., C. Voloshin, and Y. Yue (2019). Batch policy learning under constraints. In International Conference on Machine Learning, pp. 3703–3712.
  • Levine et al. [2020] Levine, S., A. Kumar, G. Tucker, and J. Fu (2020). Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643.
  • Liao et al. [2018] Liao, P., W. Dempsey, H. Sarker, S. M. Hossain, M. al’Absi, P. Klasnja, and S. Murphy (2018). Just-in-time but not too much: Determining treatment timing in mobile health. Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies 2(4), 179.
  • Liao et al. [2020] Liao, P., P. Klasnja, and S. Murphy (2020). Off-policy estimation of long-term average outcomes with applications to mobile health. Journal of the American Statistical Association, 1–10.
  • Liao et al. [2020] Liao, P., Z. Qi, and S. Murphy (2020). Batch policy learning in average reward markov decision processes. arXiv preprint arXiv:2007.11771.
  • Liu et al. [2018] Liu, Q., L. Li, Z. Tang, and D. Zhou (2018). Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pp. 5356–5366.
  • Luckett et al. [2019] Luckett, D. J., E. B. Laber, A. R. Kahkoska, D. M. Maahs, E. Mayer-Davis, and M. R. Kosorok (2019). Estimating dynamic treatment regimes in mobile health using v-learning. Journal of the American Statistical Association (just-accepted), 1–39.
  • Marcolino et al. [2018] Marcolino, M. S., J. A. Q. Oliveira, M. D’Agostino, A. L. Ribeiro, M. B. M. Alkmim, and D. Novillo-Ortiz (2018). The impact of mhealth interventions: systematic review of systematic reviews. JMIR mHealth and uHealth 6(1), e23.
  • Marling and Bunescu [2020] Marling, C. and R. Bunescu (2020). The ohiot1dm dataset for blood glucose level prediction: Update 2020. In CEUR workshop proceedings, Volume 2675, pp.  71.
  • McLeish [1974] McLeish, D. L. (1974). Dependent central limit theorems and invariance principles. the Annals of Probability 2(4), 620–628.
  • Meyn et al. [1995] Meyn, S., R. Tweedie, and J. Hibey (1995). Markov chains and stochastic stability. IEEE Transactions on Automatic Control 40(5), 979.
  • Murphy [2003] Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65(2), 331–355.
  • Murphy et al. [2001] Murphy, S. A., M. J. van der Laan, J. M. Robins, and C. P. P. R. Group (2001). Marginal mean models for dynamic regimes. Journal of the American Statistical Association 96(456), 1410–1423.
  • Nachum et al. [2019] Nachum, O., Y. Chow, B. Dai, and L. Li (2019). Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems, pp. 2315–2325.
  • Nahum-Shani et al. [2016] Nahum-Shani, I., S. N. Smith, B. J. Spring, L. M. Collins, K. Witkiewitz, A. Tewari, and S. A. Murphy (2016). Just-in-time adaptive interventions (jitais) in mobile health: key components and design principles for ongoing health behavior support. Annals of Behavioral Medicine, 1–17.
  • Newey [1997] Newey, W. K. (1997). Convergence rates and asymptotic normality for series estimators. Journal of econometrics 79(1), 147–168.
  • Newey and Powell [2003] Newey, W. K. and J. L. Powell (2003). Instrumental variable estimation of nonparametric models. Econometrica 71(5), 1565–1578.
  • Paduraru [2007] Paduraru, C. (2007). Planning with approximate and learned models of markov decision processes. These de maıtre, University of Alberta.
  • Precup [2000] Precup, D. (2000). Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, 80.
  • Puterman [1994] Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc.
  • Robins et al. [2000] Robins, J. M., M. A. Hernan, and B. Brumback (2000). Marginal structural models and causal inference in epidemiology.
  • Robins et al. [1994] Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association 89(427), 846–866.
  • Rodbard [2009] Rodbard, D. (2009). Interpretation of continuous glucose monitoring data: glycemic variability and quality of glycemic control. Diabetes technology & therapeutics 11(S1), S–55.
  • Shi et al. [2018] Shi, C., A. Fan, R. Song, and W. Lu (2018). High-dimensional a-learning for optimal dynamic treatment regimes. Annals of statistics 46(3), 925.
  • Shi et al. [2021] Shi, C., R. Wan, V. Chernozhukov, and R. Song (2021). Deeply-debiased off-policy interval estimation. arXiv preprint arXiv:2105.04646.
  • Shi et al. [2020] Shi, C., R. Wan, R. Song, W. Lu, and L. Leng (2020). Does the markov decision process fit the data: Testing for the markov property in sequential decision making. arXiv preprint arXiv:2002.01751.
  • Shi et al. [2020] Shi, C., S. Zhang, W. Lu, and R. Song (2020). Statistical inference of the value function for reinforcement learning in infinite horizon settings. arXiv preprint arXiv:2001.04515.
  • Sutton and Barto [2018] Sutton, R. S. and A. G. Barto (2018). Reinforcement learning: An introduction. MIT press.
  • Tang et al. [2019] Tang, Z., Y. Feng, L. Li, D. Zhou, and Q. Liu (2019). Doubly robust bias reduction in infinite horizon off-policy estimation. arXiv preprint arXiv:1910.07186.
  • Tropp [2012] Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics 12(4), 389–434.
  • Tropp [2015] Tropp, J. A. (2015). An introduction to matrix concentration inequalities. arXiv preprint arXiv:1501.01571.
  • Tseng and Bertsekas [1991] Tseng, P. and D. P. Bertsekas (1991). Relaxation methods for problems with strictly convex costs and linear constraints. Mathematics of operations research 16(3), 462–481.
  • Uehara et al. [2020] Uehara, M., J. Huang, and N. Jiang (2020). Minimax weight and q-function learning for off-policy evaluation. In International Conference on Machine Learning, pp. 9659–9668. PMLR.
  • Uehara et al. [2021a] Uehara, M., M. Imaizumi, N. Jiang, N. Kallus, W. Sun, and T. Xie (2021a). Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency. arXiv preprint arXiv:2102.02981.
  • Uehara et al. [2021b] Uehara, M., M. Imaizumi, N. Jiang, N. Kallus, W. Sun, and T. Xie (2021b). Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency. arXiv preprint arXiv:2102.02981.
  • Wahba [1990] Wahba, G. (1990). Spline Models for Observational Data. Philadelphia: SIAM.
  • Wang et al. [2018] Wang, L., Y. Zhou, R. Song, and B. Sherwood (2018). Quantile-optimal treatment regimes. Journal of the American Statistical Association 113(523), 1243–1254.
  • Wang and Zubizarreta [2020] Wang, Y. and J. R. Zubizarreta (2020). Minimal dispersion approximately balancing weights: asymptotic properties and practical considerations. Biometrika 107(1), 93–105.
  • Wong and Chan [2018] Wong, R. K. W. and K. C. G. Chan (2018). Kernel-based covariate functional balancing for observational studies. Biometrika 105(1), 199–213.
  • Yu [1994] Yu, B. (1994). Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 94–116.
  • Zhang et al. [2020] Zhang, R., B. Dai, L. Li, and D. Schuurmans (2020). Gen{dice}: Generalized offline estimation of stationary values. In International Conference on Learning Representations.
  • Zhao et al. [2015] Zhao, Y.-Q., D. Zeng, E. B. Laber, and M. R. Kosorok (2015). New statistical learning methods for estimating optimal dynamic treatment regimes. Journal of the American Statistical Association 110(510), 583–598.