Minimax Weight Learning for Absorbing MDPs¹¹1This research is supported by National Key R&D Program of China (Nos. 2021YFA1000100 and 2021YFA1000101) and Natural Science Foundation of China (No. 71771089)

Fengying Li, Yuqiang Li and Xianyi Wu²²2Corresponding authors: Yuqiang Li at [email protected] and Xianyi Wu at [email protected]. School of Statistics, KLATASDS-MOE, East China Normal University,
Shanghai 200062, PR China

Abstract

Reinforcement learning policy evaluation problems are often modeled as finite or discounted/averaged infinite-horizon Markov Decision Processes (MDPs). In this paper, we study undiscounted off-policy evaluation for absorbing MDPs. Given the dataset consisting of i.i.d episodes under a given truncation level, we propose an algorithm (referred to as MWLA in the text) to directly estimate the expected return via the importance ratio of the state-action occupancy measure. The Mean Square Error (MSE) bound of the MWLA method is provided and the dependence of statistical errors on the data size and the truncation level are analyzed. The performance of the algorithm is illustrated by means of computational experiments under an episodic taxi environment

keywords:

Absorbing MDP, Off-policy, Minimax weight learning, Policy evaluation, Occupancy measure

^†^†journal: ******

1 Introduction

Off-policy evaluation (OPE) in reinforcement learning refers to the estimation of the expected returns of target policies using data collected by possibly different behavior policies. It is particularly important in situations where implementing new strategies is expensive, risky, or even dangerous, such as medicine (Murphy et al., 2001), education (Mandel et al., 2014), economics (Hirano et al., 2003), recommender systems (Li et al., 2011), and more. Currently available OPE procedures are mostly based on direct importance sampling (IS) techniques, which suffer from high variances that exponentially increase over the time horizon, known as ”the curse of horizon” (Jiang and Li, 2016; Li et al., 2015).

A promising idea recently proposed uses marginalized importance sampling (MIS) to alleviate the curse of horizon, but raises a new problem of how to estimate the maginalized importance ratios. For instance, in the case of an infinite-horizon discounted Markov Decision Process (MDP), Liu et al. (2018) compute the importance weights based on the state distribution by solving a minimax optimization problem, and propose a method to estimate the expected return. Moreover, Uehara et al. (2020) propose a Minimax Weight Learning (MWL) algorithm that directly estimates the ratio of the state-action distribution without relying on the specification of the behavior policy.

In many real practices, such as robotic tasks, the environments will terminate at certain random times when they evolve into certain states. In such situations, it is no longer appropriate to model the environments using only finite-time or infinite-time MDPs. Instead, MDPs with absorbing states are suitable, where the absorption reflects the termination of the processes. Furthermore, from a theoretical perspective, absorbing MDPs extend the framework of infinite-horizon discounted MDP processes (Altman, 1999) but the reverse does not hold (see Section 2.1 for more details).

The theory of absorbing MDPs has been extensively studied and is well understood. For example, the knowledge of the times required to reach the absorbing states, depending on both the state and the actions, can be found in Chatterjee et al. (2008) and Iida and Mori (1996) and the minimization of the expected undiscounted cost until the state enters the absorbing set in various applications (such as pursuit problems, transient programming, and first pass problems) are investigated by Eaton and Zadeh (1962), Derman (1970), Kushner (1971)), among others. Other research efforts include the stochastic shortest path problem (Bertsekas and Tsitsiklis, 1991), the control-to-exit time problem (Kesten and Spitzer, 1975; Borkar, 1988), among a vast number of others.

In the context of learning when the underlying distributions are unknown, however, while many benchmark environments are indeed episodic and have random horizons, such as board games (a game terminates once the winner is determined), trips through a maze, and dialog systems (a session terminates when the conversation is concluded) (Jiang, 2017), there are only limited efforts specifically contributed to absorbing MDPs. In this paper, we propose an MWL algorithm for offline RL involving absorbing MDPs, referred to as MWLA hereafter. Our proposed approach tackles two key challenges.

The first challenge pertains to the Data structure. While an infinite horizon MDP can be treated as an ergodic Markov chain under suitable assumptions, the same assumption is not always valid for absorbing MDPs due to their indefinite horizons and varying episode lengths, of which some can be quite short. Therefore, assuming that the collected data consists of i.i.d. tuples $(s_{t},a_{t},r_{t},s_{t+1})$ , which is crucial for the MWL method, is not always natural. Instead, we propose working with data that consists of trajectories, where each data point represents a single trajectory.

The second challenge arises from the random episodic length and the expected undiscounted total rewards. In absorbing MDPs, the length of an episode is indefinite, and it may not be always practical to observe extremely long episodes fully, due to various reasons, such as their length or cost. To address this issue, a simple but practical choice is to truncate long episodes with a level $H$ . If the expected total rewards are discounted with $\gamma<1$ , it is easy to see that we can control the errors resulting from such truncations by sufficiently small $\gamma^{H}$ . However, with the expected undiscounted total rewards, how to quantify the errors is unclear up to now for absorbing MDPs.

Our proposed MWLA algorithm deals with truncated episode data, and we provide a theoretical analysis of the errors resulting from episode truncation. As a result, it aids us in gaining a better understanding of the effects of episode truncation and identifying an appropriate truncation level under which the errors caused by truncation can be deemed acceptable.

Specifically, in this paper, we derive an estimate of the expected undiscounted return of an absorbing MDP and establish an upper bound for the MSE of the MWLA algorithm. The bound consists of three parts: errors due to statistics, approximation and optimization, respectively. The statistical error depends on both the truncation level and data size. Moreover, we present a uniform bound on MSE by means of an optimization with respect to the truncation level when the truncation level is relatively large. We also demonstrate the effectiveness of our algorithm through numerical experiments in the episodic taxi environment.

The remainder of this paper is organized as follows: Section 2 introduces the model formulation and specifies some basic settings. The MWLA algorithm and its theoretical guarantees without knowledge of behavior policies are presented in Section 3. Additionally, we discuss a parallel version of MWLA, referred to MSWLA, for absorbing MDP with a known behavior policy in Remark 3.2. In Section 4, under the assumption that the data consists of i.i.d. episodes, MSE bound for the MWLA method is provided in Theorems 4.1 and 4.2. Specifically, when the function classes are VC classes, compared with Theorem 9 in Uehara et al. (2020), it is found that our statistical error is related to the truncation length $H$ . The related work is discussed in Section 5, providing more details to clarify their connection to and differences between the current work. In Section 6, a computer experiment is reported under the episodic taxi environment, compared with on-policy, naive-average, and MSWLA methods, where estimates of returns and their MSEs are given under different episode numbers and truncation lengths. All theoretical proofs and the pseudo-code of the algorithm are deferred to the Appendix.

2 Basic setting

An MDP is a controllable rewarded Markov process, represented by a tuple $M=(\mathcal{S},\mathcal{A},\mathcal{R},P,\mu)$ of a state space $\mathcal{S}$ , an action space $\mathcal{A}$ , a reward distribution $\mathcal{R}$ mapping a state-action pair $(s,a)$ to a distribution $\mathcal{R}(s,a)$ over the set $\mathbb{R}$ of real numbers with an expectation value $R(s,a)$ , a transfer probability function $P:(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow P(s^{\prime}|s,a)\in[0,1]$ and an initial state distribution $\mu$ .

In this paper, a policy $\pi:=\pi(a|s)$ refers to a time-homogeneous mapping from $\mathcal{S}$ to the family of all distributions over $\mathcal{A}$ , executed as follows. Starting with an initial state $s_{0}\sim\mu$ , at any integer time $t\geq 0$ , an action $a_{t}\sim\pi(\cdot|s_{t})$ is sampled, a scalar reward $r_{t}\sim\mathcal{R}(s_{t},a_{t})$ is collected, and a next state $s_{t+1}\sim P(\cdot|s_{t},a_{t})$ is then assigned by the environment. The space $\mathcal{S}\times\mathcal{A}$ is assumed enumerable and $R$ is bounded. The probability distribution generated by $M$ under a policy $\pi$ and an initial distribution $\mu$ is denoted by $\mathrm{P}_{\mu,\pi}$ , and $\mathrm{E}_{\mu,\pi}$ is used for its expectation operation. When the initial state $s_{0}=i$ , the probability distribution and expectation are indicated by $\mathrm{P}_{i,\pi}$ and $\mathrm{E}_{i,\pi}$ respectively, so that $\mathrm{P}_{\mu,\pi}=\sum_{i\in\mathcal{S}}\mu(i)\mathrm{P}_{i,\pi}$ and $\mathrm{E}_{\mu,\pi}=\sum_{i\in\mathcal{S}}\mu(i)\mathrm{E}_{i,\pi}.$ The notation $\mathrm{P}_{(s,a),\pi}$ and $\mathrm{E}_{(s,a),\pi}$ are also used to indicate the probability and expectation generated by $M$ starting from the state-action couple $(s,a)$ and subsequently the following policy $\pi$ .

An absorbing state, represented by $\xi\in\mathcal{S}$ , is a state such that $r(\xi,a)=0$ and $P(\xi|\xi,a)=1$ for all $a\in\mathcal{A}$ . It is assumed that there is only one unique absorbing state. For a trajectory, denote by $T\doteq\min\{n\geq 1,s_{n}=\xi\}$ the terminal time, where and throughout the paper $\doteq$ signifies “defined as”. An MDP is absorbing if $\mathrm{P}_{i,\pi}(T<\infty)=1$ for all states $i$ and all policies $\pi$ . Denote by $\mathcal{S}_{0}=\mathcal{S}\setminus\{\xi\}$ the non-absorbing states. We make the following assumption on $T$ .

Assumption 2.1.

$\sup_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A},\pi}\mathrm{E}_{(s,a),\pi}(T)<\infty$ .

Write $\Xi=\{q:\mathcal{S}_{0}\times\mathcal{A}\mapsto\mathbb{R}\}$ for the collection of all functions mapping a couple $(s,a)$ to a real number. Introduce a functional $\Phi_{\pi}(q)$ on $\Xi$ by

\Phi_{\pi}(q)\doteq\mathrm{E}_{\mu,\pi}[\sum_{t=0}^{T-1}q(s_{t},a_{t})]

and for any functions $q:\mathcal{S}_{0}\times\mathcal{A}\mapsto\mathbb{R}$ , let $q(s,\pi)\dot{=}\sum_{a\in\mathcal{A}}\pi(a|s)q(s,a)$ . Note that $\Phi_{\pi}$ implicitly depends on the initial distribution $\mu$ .

The expected return under a policy $\pi$ is given by

R_{\pi}\doteq\mathrm{E}_{\mu,\pi}\left[\sum\limits_{t=0}^{T-1}r_{t}\right]=\mathrm{E}_{\mu,\pi}\sum\limits_{t=0}^{\infty}\left[R(s_{t},a_{t})\right]=\Phi_{\pi}(R),

(1)

which depends only on the transition probability and the mean reward function rather than the more detailed reward distribution functions.

For any $(s,a)\in\mathcal{S}_{0}\times\mathcal{A}$ , taking the indicator function ${\bm{1}}_{(s,a)}$ as a particular mean reward function, i.e., collecting a unit reward at $(s,a)$ and zero otherwise, gives rise to the occupancy measure

d_{\pi}(s,a)\doteq\mathrm{E}_{\mu,\pi}\big{[}\sum_{t=0}^{T-1}{\bm{1}}_{(s,a)}(s_{t},a_{t})\big{]}=\sum_{t=0}^{\infty}\mathrm{P}_{\mu,\pi}(s_{t}=s,a_{t}=a)=\Phi_{\pi}({\bf 1}_{(s,a)}),

(2)

where $d_{\pi}(s,a)<\infty$ by Assumptions 2.1. Conceptually, $d_{\pi}$ can also be retrieved from Chapmann-Kolmogorov equations

\displaystyle d_{\pi}(s,a)

\displaystyle=

\displaystyle\mu(s)\pi(a|s)+\sum_{(s^{\prime},a^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}}P(s|s^{\prime},a^{\prime})\pi(s|a)d_{\pi}(s^{\prime},a^{\prime}),\hbox{ for all }(s,a)\in\mathcal{S}_{0}\times\mathcal{A}.

(3)

Conversely, with the measure $d_{\pi}$ , we can retrieve the expected return

\displaystyle R_{\pi}=\sum_{(s,a)\in S_{0}\times\mathcal{A}}R(s,a)d_{\pi}(s,a),

as an integration of the mean reward function with respect to the occupancy measure. In addition, a simple recursive argument shows that

	$\displaystyle\Phi_{\pi}(q)$	$\displaystyle=$	$\displaystyle\sum_{(s,a)\in S_{0}\times\mathcal{A}}q(s,a)d_{\pi}(s,a)$
		$\displaystyle=$	$\displaystyle\sum_{s\in\mathcal{S}_{0}}\mu(s)q(s,{\pi})+\sum_{(s,a,s^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\times\mathcal{S}_{0}}P(s^{\prime}\|s,a)q(s^{\prime},{\pi})d_{{\pi}}(s,a).$

A direct result of this equality is a family of equations

\displaystyle\sum_{(s,a)\in S_{0}\times\mathcal{A}}\left[q(s,a)-\sum_{s^{\prime}\in\mathcal{S}_{0}}P(s^{\prime}|s,a)q(s^{\prime},{\pi})\right]d_{\pi}(s,a)=\sum_{s^{\prime}\in\mathcal{S}_{0}}\mu(s^{\prime})q(s^{\prime},{\pi}),q\in\Xi.

(4)

Denote by $d_{\pi}(s,a,s^{\prime})=d_{\pi}(s,a)P(s^{\prime}|s,a)$ . Then the system of equations (4) can equivalently be rewritten as

\displaystyle\sum_{(s,a,s^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\times\mathcal{S}_{0}}q(s^{\prime},{\pi})d_{{\pi}}(s,a,s^{\prime})-\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}q(s,a)d_{{\pi}}(s,a)+\mathrm{E}_{s\sim\mu}\big{[}q(s,{\pi})\big{]}=0,q\in\Xi.

(5)

Remark 2.1.

MDP models with absorbing states to maximize expected return (1) fundamentally differ from the usually analyzed discounted MDPs with infinite time-horizon. Here two points are discussed:

In the perspective of probability theory, MDP models with absorbing states to maximize the expected returns generalize infinite-horizon MDPs that maximize discounted expected return. To see it, consider an infinite-horizon discounted MDP $M=(\mathcal{S}_{0},\mathcal{A},r,{P},\mu,\gamma)$ . To create a new model $\tilde{M}=({\cal S},{\cal A},\tilde{r},\tilde{P},\mu)$ with an absorbing state, we introduce a new state $\xi$ such that ${\cal S}={\cal S}_{0}\cup\{\xi\}$ , define

\tilde{P}(s{{}^{\prime}}|s,a)=\left\{\begin{array}[]{lr}\gamma{P}(s{{}^{\prime}}|s,a)&\text{ if }s,s{{}^{\prime}}\in\mathcal{S}_{0}\\ 1-\gamma&\text{ if }s\in\mathcal{S}_{0},s{{}^{\prime}}=\xi\\ 1&\text{ if }s=s{{}^{\prime}}=\xi,\end{array}\right.

keep the action space the same, and introduce a reward function $\tilde{r}(s,a)$ by $\left\{\begin{array}[]{ll}r(s,a)&\hbox{if }s\in{\cal S}\\ 0&\hbox{otherwise}\end{array}\right.$ . The optimal policy of $M$ and $\tilde{M}$ are the same because the two models share the same Bellman optimality equation(a similar argument is discussed in Section 10.1 of Altman (1999)).

In the context of learning, $M$ and $\tilde{M}$ are essentially two different models because one generally needs to estimate the unknown parameter $\gamma$ (the probability of transitioning to the absorbing state or the unknown discount factor) in $\tilde{M}$ , whereas $\gamma$ is known in $M$ .

2.

In general, an MDP with absorbing states to maximize expected return is not necessarily translated to an MDP to maximize expected discounted return. For example, one can consider the case when the transition probability of a state to the absorbing state under a policy depends on the state.

3 Minimax Weight Learning for absorbing MDP (MWLA)

Let

Z=(s_{0},a_{0},s_{1},a_{1},\dots,s_{T-1},a_{T-1})\hbox{ and }Z_{i}=(s_{0}^{i},a_{0}^{i},s_{1}^{i},a_{1}^{i},\dots,s_{T_{i}-1}^{i},a_{T_{i}-1}^{i}),i=1,2,\dots,m

(6)

be a representative episode of an absorbing MDP with probability distribution $P_{\mu,\pi_{b}}$ and its i.i.d. copies, respectively. The objective is to estimate the expected return $R_{\pi_{e}}$ by using the dataset $D$ of $m$ i.i.d. trajectories $\{\boldsymbol{\tau}^{i},i=1,\cdots,m\}$ , each of which is a realization of $Z_{i}$ truncated at a prespecified time step $H$ , i.e.

\displaystyle\boldsymbol{\tau}^{i}=\left(s_{0}^{i},a_{0}^{i},r_{0}^{i},s_{1}^{i},a_{1}^{i},r_{1}^{i},\dots,s_{T_{i}\wedge H}^{i},a_{T_{i}\wedge H-1}^{i},r_{T_{i}\wedge H-1}^{i},s_{T_{i}\wedge H}^{i}\right),i=1,2,\dots,m.

The algorithm developed here, as indicated by the title “Minimax Weight Learning for Absorbing MDP” of this section and referred to as MWLA below is an extension of Uehara et al. (2020)’s Minimax Weight Learning (MWL) algorithm for infinite-horizon discounted MDPs. Their method uses a discriminator function class $\mathcal{Q}$ to learn the importance weight $w$ (defined in equation (7) below) on state-action pairs. One of their important tools is the (normalized) discounted occupancy, which can be approximated well using the given discount factor $\gamma$ and the suitable dataset (for example, the dataset consisting of i.i.d. tuples $(s,a,r,s^{\prime})$ ). However, in our setting, the normalized occupancy is not applicable since the reward is not discounted. Our method is essentially based on the occupancy measure defined in equation (2).

For any $(s,a)\in\mathcal{S}_{0}\times\mathcal{A}$ , let

\displaystyle w_{\frac{\pi_{e}}{\pi_{b}}}(s,a)\dot{=}\frac{d_{{\pi_{e}}}(s,a)}{d_{\pi_{b}}(s,a)},\hbox{a.e. }d_{\pi_{b}}.

(7)

Observe that

R_{\pi_{e}}=\sum_{(s,a)\in S_{0}\times\mathcal{A}}R(s,a)d_{\pi_{e}}(s,a)=\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}w_{\frac{\pi_{e}}{\pi_{b}}}(s,a)R(s,a)d_{\pi_{b}}(s,a)=\Phi_{\pi_{b}}(w_{\frac{\pi_{e}}{\pi_{b}}}R),

provided that $d_{\pi_{e}}(s,a)>0$ implies $d_{\pi_{b}}(s,a)>0$ . By replacing $d_{\pi_{b}}$ , $R$ and $w_{\frac{\pi_{e}}{\pi_{b}}}$ with their estimates $\hat{d}_{\pi_{b}}$ , $\hat{R}$ and $\hat{w}_{\frac{\pi_{e}}{\pi_{b}}}$ , respectively, a plug-in approach suggests that $R_{\pi_{e}}$ can be simply estimated by

\hat{R}_{\pi_{e}}=\hat{\Phi}_{\pi_{b}}(\hat{w}_{\pi_{e}\over\pi_{b}}\hat{R})\doteq\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{w}_{\frac{\pi_{e}}{\pi_{b}}}(s,a)\hat{R}(s,a)\hat{d}_{\pi_{b}}(s,a),

(8)

in which the crucial step is to estimate $w_{\frac{\pi_{e}}{\pi_{b}}}$ .

From equality (5), we formally introduce an error function

	$\displaystyle L(w,q)$	$\displaystyle\doteq$	$\displaystyle{\displaystyle\sum_{(s,a,s^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\times\mathcal{S}_{0}}}w(s,a)q(s^{\prime},{\pi_{e}})d_{{\pi_{b}}}(s,a,s^{\prime})$		(9)
			$\displaystyle-{\displaystyle\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}}w(s,a)q(s,a)d_{{\pi_{b}}}(s,a)+\mathrm{E}_{s\sim\mu}\big{[}q(s,{\pi_{e}})\big{]},$		(9)

so that $L(w_{\frac{\pi_{e}}{\pi_{b}}},q)=0\hbox{ for all }q\in\Xi.$ Conversely, by taking a family of particular functions $\{q(s,a)={\bm{1}}_{(\bar{s},\bar{a})}(s,a):(\bar{s},\bar{a})\in\mathcal{S}_{0}\times\mathcal{A}\}$ , as what has been done in (2), the uniqueness of the solution to this system of equations can be derived, as what is in the following Theorem.

Note that $L(w_{\frac{\pi_{e}}{\pi_{b}}},q)=0\hbox{ for all }q\in\Xi$ if and only if $L(w_{\frac{\pi_{e}}{\pi_{b}}},q)=0\hbox{ for all }q\in\Xi,\|q\|_{2}^{2}=1$ , we can safely work with the system of equations $L(w,q)=0,\|q\|_{2}^{2}=1$ , where and throughout the remainder of the paper, $\|\cdot\|_{2}$ of a vector or a matrix denotes its Eclidean norm.

Theorem 3.1.

The function $w=w_{\frac{\pi_{e}}{\pi_{b}}}$ is the unique bounded solution to the system of equations $\{L(w,q)=0:\|q\|_{2}^{2}=1\}$ , provided that the following conditions hold:

The occupancy measure $d_{{\pi_{e}}}$ is the unique solution to the system of equations of $q$ ,

\displaystyle q(s,a)=\mu(s)\pi_{e}(a|s)+\sum_{(s^{\prime},a^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}}P(s|s^{\prime},a^{\prime})\pi_{e}(s|a)q(s^{\prime},a^{\prime}),\;\;(s,a)\in\mathcal{S}_{0}\times\mathcal{A}.

(10)

2.

$d_{\pi_{e}}(s,a)>0$ implies $d_{\pi_{b}}(s,a)>0$ for all $(s,a)\in S_{0}\times\mathcal{A}$ .

Theorem 3.1 simply states that

w_{\frac{\pi_{e}}{\pi_{b}}}=\operatorname*{argmin}\limits_{w}\max\limits_{\|q\|_{2}^{2}=1}L(w,q)^{2}.

(11)

In real applications, $w_{\frac{\pi_{e}}{\pi_{b}}}$ is approximated by

\displaystyle{w}^{*}(s,a)\doteq\operatorname*{argmin}\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L(w,q)^{2},

where $\mathcal{W}\subset\Xi$ is a working class of $w_{\frac{\pi_{e}}{\pi_{b}}}$ , and $\mathcal{Q}\subset\Xi$ treated as discriminators. For artificially designed $\cal Q$ , the class $\cal Q$ is only required to be bounded because it may be computationally inefficient to require $\|q\|_{2}^{2}=1$ for $q\in\cal Q$ .

For any $(s,a)\in\mathcal{S}_{0}\times\mathcal{A}$ , define a function $V_{s,a,\pi_{e}}\in\Xi$ by $V_{s,a,\pi_{e}}(s^{\prime},a^{\prime})\doteq\sum_{t=0}^{\infty}\mathrm{P}_{(s,a),{\pi_{e}}}(s_{t}=s^{\prime},a_{t}=a^{\prime})$ . The theorem below, with a novel relationship between $w_{\pi_{e}\over\pi_{b}}-w$ and the error function $L$ , will be helpful in bounding the estimation error of occupancy measure ratio by means of the mini-max loss via the discriminator class $\mathcal{Q}$ chosen properly.

Theorem 3.2.

The error function allows for the expressions

L\left(w,V_{s^{\prime},a^{\prime},\pi_{e}}\right)=d_{\pi_{e}}(s^{\prime},a^{\prime})-w(s^{\prime},a^{\prime})d_{\pi_{b}}(s^{\prime},a^{\prime})\hbox{ and }L\left(w,V_{s^{\prime},a^{\prime},\pi_{e}}/d_{\pi_{b}}(s^{\prime},a^{\prime})\right)=w_{\pi_{e}\over\pi_{b}}-w.

(12)

Consequently,

1.

$\|d_{\pi_{e}}-wd_{\pi_{b}}\|_{\infty}\leq\max\limits_{q\in\mathcal{Q}}|L(w,q)|\hbox{ if }\{\pm V_{s^{\prime},a^{\prime},\pi_{e}}:(s^{\prime},a^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\}\subseteq\mathcal{Q}$ and
2.

$\|w_{\frac{\pi_{e}}{\pi_{b}}}-w^{*}\|_{\infty}\leq\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}|L(w,q)|\hbox{ if }\{\pm V_{s^{\prime},a^{\prime},\pi_{e}}/d_{\pi_{b}}(s^{\prime},a^{\prime}):(s^{\prime},a^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\}\subseteq\mathcal{Q}$ .

Here $\|\cdot\|_{\infty}$ denotes the supremum norm.

The MWLA algorithm to estimate ${R}_{\pi_{e}}$ is now ready to be described as follows. Firstly, for all $(s,a,s^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\times\mathcal{S}_{0}$ , define the trajectory-specific empirical occupancy measures and rewards by

\displaystyle\hat{d}^{i}_{\pi_{b}}(s,a)\doteq\sum_{t=0}^{l_{i}-1}{\bm{1}}_{(s,a)}(s_{t}^{i},a_{t}^{i}),\;~{}\hat{d}^{i}_{\pi_{b}}(s,a,s^{\prime})\doteq\sum_{t=0}^{l_{i}-1}{\bm{1}}_{(s,a,s^{\prime})}(s_{t}^{i},a_{t}^{i},s_{t+1}^{i})

(13)

and

\displaystyle\hat{R}^{i}(s,a)\doteq\frac{\sum_{t=0}^{l_{i}-1}r_{t}^{i}{\bm{1}}_{(s,a)}(s_{t}^{i},a_{t}^{i})}{\hat{d}^{i}_{\pi_{b}}(s,a)}\hbox{ if }{\hat{d}^{i}_{\pi_{b}}(s,a)}>0\hbox{ and }0\;\;\hbox{otherwise}.

Secondly, for any $w,q\in B(\mathcal{S}_{0}\times\mathcal{A})$ , introduce an empirical error function

	$\displaystyle\hat{L}_{m}(w,q)$	$\displaystyle=$	$\displaystyle\frac{1}{m}\sum_{i=1}^{m}\sum_{(s,a,s^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\times\mathcal{S}_{0}}w(s,a)q(s^{\prime},{\pi_{e}})\hat{d}^{i}_{\pi_{b}}(s,a,s^{\prime})$		(14)
		$\displaystyle-$	$\displaystyle\frac{1}{m}\sum_{i=1}^{m}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}w(s,a)q(s,a)\hat{d}^{i}_{\pi_{b}}(s,a)+\mathrm{E}_{s\sim\mu}\big{[}q(s,{\pi_{e}})\big{]},$		(14)

so that an estimate of $w_{\frac{\pi_{e}}{\pi_{b}}}$ is then defined by

\displaystyle\hat{w}_{m}(s,a)\doteq\operatorname*{argmin}\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}\hat{L}_{m}(w,q)^{2}.

Putting them into equation (8), the expected return $R_{\pi_{e}}$ is consequently estimated by

\hat{R}_{\pi_{e},m}=\frac{1}{m}\sum\limits_{i=1}^{m}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{w}_{m}(s,a)\hat{R}^{i}(s,a)\hat{d}^{i}_{\pi_{b}}(s,a).

(15)

Remark 3.1.

Compared to the MWL algorithm in Uehara et al. (2020), the MWLA algorithm described above utilizes truncated episodes instead of the $(s,a,r,s^{\prime})$ tuples. Consequently, the accuracy of the estimation depends on two factors: the data size $m$ and the truncation level $H$ . Therefore, it is crucial to comprehend how errors vary at different levels of $m$ and $H$ . This will aid us in gaining a better understanding of the effects of episode truncation and identifying an appropriate truncation level $H$ under which the errors caused by truncation can be deemed acceptable. A detailed analysis on this matter is presented in the next section.

Below is an example of the MWLA algorithm applied to absorbing tabular MDPs.

Example 3.1.

Write $\mathcal{S}_{0}=\{0,1,\dots,n-1\}$ and $\mathcal{A}=\{0,1,\dots,h-1\}$ for the tabular model. Note that any matrix ${\bm{u}}\dot{=}(u_{kl})\in\mathbb{R}^{n\times h}$ defines a map ${\bm{u}}:\mathcal{S}_{0}\times\mathcal{A}\mapsto\mathbb{R}$ by ${\bm{u}}(k,l)=u_{kl}$ . Take the function classes $\mathcal{W}=\mathbb{R}^{n\times h}$ and $\mathcal{Q}=\big{\{}{\bm{u}}\in\mathbb{R}^{n\times h}:\|\bm{u}\|_{2}^{2}=1\big{\}}$ . For every $0\leq k\leq n-1,0\leq l\leq h-1$ , denote by ${\bf 1}_{(k,l)}$ the $nh$ -dimensional unit column vector with $1$ at its $(kh+l)$ -th component, and let ${\bf 1}_{(k,\pi_{e})}=\sum_{l=0}^{h-1}\pi_{e}(l|k){\bf 1}_{(k,l)}$ . For and ${\bm{u}}\in{\cal W}$ and ${\bm{v}}\in{\cal Q}$ , the empirical error is

$\displaystyle\hat{L}_{m}({\bm{u}},{\bm{v}})$	$\displaystyle=$	$\displaystyle\frac{1}{m}\sum_{i=1}^{m}\sum_{(k,l,k^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\times\mathcal{S}_{0}}{\bm{u}}(k,l){\bm{v}}(k^{\prime},\pi_{e})\hat{d}^{i}_{\pi_{b}}(k,l,k^{\prime})$
		$\displaystyle-\frac{1}{m}\sum_{i=1}^{m}\sum_{(k,l)\in\mathcal{S}_{0}\times\mathcal{A}}{\bm{v}}(k,l)\hat{d}^{i}_{\pi_{b}}(k,l)+\sum_{k\in\mathcal{S}_{0}}\mu(k){\bm{v}}(k,\pi_{e})$
	$\displaystyle=$	$\displaystyle(\overset{\rightarrow}{\bm{u}}^{\top}{\bm{\hat{}}{A}}+{\bm{b}}^{\top})\overset{\rightarrow}{\bm{v}},$

where

	$\displaystyle{\bm{\hat{}}{A}}$	$\displaystyle=$	$\displaystyle\frac{1}{m}\sum_{i=1}^{m}\sum_{k=0}^{n-1}\sum_{l=0}^{h-1}\sum_{v=0}^{n-1}\mathbf{1}_{(k,l)}\mathbf{1}_{(v,\pi_{e})}^{\top}\hat{d}^{i}_{\pi_{b}}(k,l,v)-\frac{1}{m}\sum_{i=1}^{m}\sum_{k=0}^{n-1}\sum_{l=0}^{h-1}\mathbf{1}_{(k,l)}\mathbf{1}_{(k,l)}^{\top}\hat{d}^{i}_{\pi_{b}}(k,l)$
		$\displaystyle=$	$\displaystyle\frac{1}{m}\sum_{i=1}^{m}\sum_{t=0}^{T_{i}\wedge H-1}{\bf 1}_{(s^{i}_{t},a^{i}_{t})}\Big{[}\sum_{a\in\mathcal{A}}\pi_{e}(a\|s^{i}_{t+1}){\bf 1}^{\top}_{(s_{t+1}^{i},a)}-{\bf 1}^{\top}_{(s_{t}^{i},a^{i}_{t})}\Big{]}$

is an $nh\times nh$ matrix, $\overset{\rightarrow}{\bm{u}}$ is the vectorized ${\bm{u}}$ by columns and ${\bm{b}}=\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\mu(s)\pi_{e}(a|s){\bf 1}_{(s,a)}$ is an $nh$ -vector. Therefore,

\max_{{\bm{v}}:\|{\bm{v}}\|^{2}=1}\hat{L}^{2}_{m}({\bm{u}},{\bm{v}})=\max_{{\bm{v}}:\|{\bm{v}}\|^{2}=1}\overset{\rightarrow}{\bm{v}}^{\top}(\bm{\hat{}}{A}^{\top}\overset{\rightarrow}{\bm{u}}+{\bm{b}})(\bm{\hat{}}{A}^{\top}\overset{\rightarrow}{\bm{u}}+{\bm{b}})^{\top}\overset{\rightarrow}{\bm{v}}={\|\bm{\hat{}}{A}^{\top}\overset{\rightarrow}{\bm{u}}+{\bm{b}}\|}_{2}^{2}.

Therefore, the estimate is $\hat{w}_{m}={{\hat{\bm{u}}}}$ with $\overset{\rightarrow}{\hat{\bm{u}}}$ the least square solution to the equation ${\bm{\hat{}}{A}^{\top}}\overset{\rightarrow}{\bm{u}}=-{\bm{b}}$ .

Remark 3.2 (A variant for known $\pi_{b}$ ).

If we define $d_{\pi}(s)=\Phi_{\pi}({\bf 1}_{\{s\}})$ , then $d_{\pi}(s,a)=d_{\pi}(s)\pi(a|s)$ and from (5), we have that

\displaystyle\sum_{(s,a,s^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\times\mathcal{S}_{0}}q(s^{\prime},{\pi})d_{{\pi}}(s)\pi(a|s)P(s^{\prime}|s,a)-\sum_{s\in\mathcal{S}_{0}}q(s,\pi)d_{{\pi}}(s)+\mathrm{E}_{s\sim\mu}\big{[}q(s,{\pi})\big{]}=0.

For a given target policy $\pi_{e}$ , simply denote $q(s,\pi_{e})$ by $q(s)$ , so that the equation above can be rewritten as

\displaystyle\sum_{(s,a,s^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\times\mathcal{S}_{0}}w_{\frac{\pi_{e}}{\pi_{b}}}(s)q(s^{\prime}){\pi_{e}(a|s)\over\pi_{b}(a|s)}d_{{\pi}_{b}}(s,a,s^{\prime})-\sum_{s\in\mathcal{S}_{0}}w_{\frac{\pi_{e}}{\pi_{b}}}(s)q(s)d_{{\pi}_{b}}(s)+\mathrm{E}_{s\sim\mu}\big{[}q(s)\big{]}=0,

where $w_{\frac{\pi_{e}}{\pi_{b}}}(s)={d_{\pi_{e}}(s)\over d_{\pi_{b}}(s)}$ .

With this equation, if the behavior policy $\pi_{b}$ is known, we can construct a corresponding estimate of the value function based on the minimax optimization problem:

\min\limits_{w\in\mathcal{W}^{s}}\max\limits_{q\in\mathcal{Q}^{s}}\big{(}\sum_{(s,a,s^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\times\mathcal{S}_{0}}w_{(}s)q(s^{\prime}){\pi_{e}(a|s)\over\pi_{b}(a|s)}d_{{\pi}_{b}}(s,a,s^{\prime})-\sum_{s\in\mathcal{S}_{0}}w(s)q(s)d_{\pi_{b}}(s)+\mathrm{E}_{s\sim\mu}\big{[}q(s)\big{]}\big{)}^{2}.

For convenience, we refer to the method as MSWLA (marginalized state weight learning for absorbing MDPs) algorithm which is essentially an extension of the method discussed in Liu et al. (2018). By similar arguments in Example 3.1, in the tabular case where $\mathcal{S}_{0}=\{0,1,\dots,n-1\}$ and the function classes $\mathcal{W}^{s}$ and $\mathcal{Q}^{s}$ are $\mathbb{R}^{n}$ , the empirical error function for the MSWLA algorithm is $\hat{L}_{m}({\bm{u}},{\bm{v}})=({\bm{u}}^{\top}{\bm{\hat{}}{A}}+{\bm{b}}^{\top}){\bm{v}}$ for any $\bm{u}\in\mathcal{W}^{s},\bm{v}\in\mathcal{Q}^{s}$ , where

\displaystyle{\bm{\hat{}}{A}}

\displaystyle=

\displaystyle\frac{1}{m}\sum_{i=1}^{m}\sum_{t=0}^{T_{i}\wedge H-1}{\bf 1}_{\{s^{i}_{t}\}}\Big{[}{\pi_{e}(a^{i}_{t}|s^{i}_{t})\over\pi_{b}(a^{i}_{t}|s^{i}_{t})}{\bf 1}^{\top}_{\{s_{t+1}^{i}\}}-{\bf 1}^{\top}_{\{s_{t}^{i}\}}\Big{]},

${\bm{b}}=\sum_{s\in\mathcal{S}_{0}}\mu(s){\bf 1}_{\{s\}}$ , and for any $s\in\mathcal{S}_{0}$ , ${\bf 1}_{\{s\}}$ is the $n$ -dimensional column vector whose $s$ -th entry is $1$ and other elements are $0$ .

4 MSE bound of the estimated return

Denote by $Q_{\pi_{e}}(s,a)\doteq\mathrm{E}_{(s,a),\pi_{e}}(\sum_{t=0}^{T-1}r(s_{t},a_{t}))$ the commonly known Q-function and $H_{m}$ the unique positive solution to the equation $2mx^{2}+2\ln m\ln x-\ln m=0.$ Let us now analyze the error bound of $\hat{R}_{\pi_{e},m}$ with respect to the number of episodes $m$ and the truncation level $H$ , measured by the mean squared error (MSE) as provided in the following theorems.

The following technical assumption is necessary, which is also supposed in Uehara et al. (2020) as Assumption 2.

Assumption 4.1.

There exists a constant $C_{w}>0$ , such that $\sup_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}w_{\frac{\pi_{e}}{\pi_{b}}}(s,a)\leq C_{w}$ .

Theorem 4.1.

Suppose that

there exists a common envelop $G$ of $\cal W$ and $\cal Q$ , i.e. $|w|\leq G$ , $|q|\leq G$ for all $w\in\mathcal{W}$ , $q\in\mathcal{Q}$ , satisfying

\displaystyle\Lambda_{1}:=\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}G(s,a)<+\infty\;\;\text{and}\;\;\Lambda_{2}:=\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}G^{2}(s,a)<+\infty;

(16)

2.

$\mathcal{W}$ and $\mathcal{Q}$ have finite pseudo-dimensions $D_{\mathcal{W}}$ and $D_{\mathcal{Q}}$ , respectively;
3.

Assumptions 2.1 and 4.1 hold;
4.

$Q_{\pi_{e}}\in\mathcal{Q}$ ;
5.

there exists $\lambda_{0}>0$ such that $\mathrm{E}_{\mu,\pi_{b}}(e^{\lambda_{0}T})=M_{0}<\infty$ .

Then we have the following:

When $M_{0}e^{-\lambda_{0}H}>H_{m}$ , there exists a constant $C$ independent of $H$ and $m$ , such that

\displaystyle\mathrm{E}\big{[}(\hat{R}_{\pi_{e},m}-R_{{\pi_{e}}})^{2}\big{]}

\displaystyle\leq

\displaystyle C\Big{(}e^{-2\lambda_{0}H}+\frac{H^{2}\ln m}{m}\Big{)}+8\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L(w,q)^{2}.

When $M_{0}e^{-\lambda_{0}H}\leq H_{m}$ , there exists a constant $C$ independent of $H$ and $m$ , such that

\displaystyle\mathrm{E}\big{[}(\hat{R}_{\pi_{e},m}-R_{{\pi_{e}}})^{2}\big{]}

\displaystyle\leq

\displaystyle C\frac{\ln^{3}m}{m}+8\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L(w,q)^{2}.

Especially, if $w_{\frac{\pi_{e}}{\pi_{b}}}\in\mathcal{W}$ and $M_{0}e^{-\lambda_{0}H}\leq H_{m}$ , then

\mathrm{E}\big{[}(\hat{R}_{\pi_{e},m}-R_{{\pi_{e}}})^{2}\big{]}\leq C{\frac{\ln^{3}m}{m}}.

Theorem 4.2.

Suppose the assumptions in Theorem 4.1 hold and $m\geq e$ but $Q_{\pi_{e}}\not\in\mathcal{Q}$ .

(1)

When $M_{0}e^{-\lambda_{0}H}>H_{m}$ , there exists a constant $C$ independent of $H,m$ , such that

	$\displaystyle\mathrm{E}\big{[}(\hat{R}_{\pi_{e},m}-R_{{\pi_{e}}})^{2}\big{]}$	$\displaystyle\leq$	$\displaystyle C\Big{(}e^{-2\lambda_{0}H}+\frac{H^{2}\ln m}{m}\Big{)}+16\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L(w,q)^{2}$
			$\displaystyle+4\max\limits_{w\in\mathcal{W}}\min\limits_{q\in\mathcal{Q}}L^{2}(w,Q_{\pi_{e}}-q).$

(2)

When $M_{0}e^{-\lambda_{0}H}\leq H_{m}$ , there exists a constant $C$ independent of $H,m$ , such that

\displaystyle\mathrm{E}\big{[}(\hat{R}_{\pi_{e},m}-R_{{\pi_{e}}})^{2}\big{]}

\displaystyle\leq

\displaystyle C\frac{\ln^{3}m}{m}+16\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L(w,q)^{2}+4\max\limits_{w\in\mathcal{W}}\min\limits_{q\in\mathcal{Q}}L^{2}(w,Q_{\pi_{e}}-q).

Obviously, the additional term $\max\limits_{w\in\mathcal{W}}\min\limits_{q\in\mathcal{Q}}L(w,Q_{{\pi_{e}}}-q)^{2}$ becomes $0$ when $Q_{{\pi_{e}}}$ in the closure of $\mathcal{Q}$ under the metric $||\cdot||_{\infty}$ .

Theorems 4.1 and 4.2 provide upper bounds for the MSE of the MWLA algorithm for a few situations. They are expressed as functions of two key parameters: the truncation level $H$ and the number of the episodes $m$ . When $H$ is small (i.e., $M_{0}e^{-\lambda_{0}H}>H_{m}$ ), the estimate errors composed of four parts: the pure truncation term $e^{-2\lambda_{0}H}$ , a crossing term $H^{2}\ln m/m$ arising from the sampling randomness, an approximation error $\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L^{2}(w,q)$ , and an optimization error $\max\limits_{w\in\mathcal{W}}\min\limits_{q\in\mathcal{Q}}L^{2}(w,Q_{\pi_{e}}-q)$ . While the first two errors stem from the randomness of the statistics, the other two result from the degree of closeness of the two function classes $\mathcal{W}$ and $\mathcal{Q}$ to $w_{{\pi}/\pi_{b}}$ and $Q_{{\pi}}$ , respectively. For a large $H$ (i.e., $M_{0}e^{-\lambda_{0}H}\leq H_{m}$ ), however, the pure truncation term $e^{-2\lambda_{0}H}$ and the mixing term $H^{2}\ln m/m$ can be dominated by an $H$ -free term $C\ln^{3}m/m$ . This indicates simply that MWLA algorithm can avoid the curse of the horizon.

In the following are more remarks on the results.

Remark 4.1.

Consider the case $Q_{\pi_{e}}\in\mathcal{Q}$ . For the infinite horizon MDP with $m$ i.i.d. tuples $(s_{i},a_{i},s_{i}^{\prime},r_{i})$ , the error bound of the MWL method consists of a statistical error $\frac{\ln m}{m}+\mathcal{R}_{m}^{2}(\mathcal{W},\mathcal{Q})$ and an approximation error $\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L(w,q)^{2}$ , where $\mathcal{R}_{m}(\mathcal{W},\mathcal{Q})$ is the Rademacher complexity of the function class

\left\{\left(s,a,s{{}^{\prime}}\right)\mapsto|w(s,a)(q(s^{{}^{\prime}},\pi)-q(s,a))|:w\in\mathcal{W},q\in\mathcal{Q}\right\},

as given in Theorem 9 of Uehara et al. (2020). Let $D_{\mathcal{W}}$ , $D_{\mathcal{Q}}$ be the VC-subgraph dimension (i.e. pseudo-dimension) of $\mathcal{W},\mathcal{Q},$ respectively. Because $\mathcal{R}_{m}(\mathcal{W},\mathcal{Q})=O(\sqrt{\frac{\max{(D_{\mathcal{W}},D_{\mathcal{Q}})}}{m}})$ (Corollary 1 of Uehara et al., 2021), the statistical error is dominated by $\frac{\ln m}{m}$ . For MWLA with $m$ i.i.d. episodes, the MSE bound also consists of an approximation error $\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L(w,q)^{2}$ and a statistical error. When $M_{0}e^{-\lambda_{0}H}\leq H_{m}$ , the statistical error is bounded by $\frac{\ln^{3}m}{m}$ , including an extra factor $\ln^{2}m$ in form.

Remark 4.2.

For $m>e$ , one has $\sqrt{\ln m/(2m)}<H_{m}<\ln m\sqrt{e/m}$ (Lemma B.8 in Appendix). Therefore, when $M_{0}e^{-\lambda_{0}H}>H_{m}$ , it follows that $H\leq\ln M_{0}+\ln 2/2+\ln m/2$ and $\frac{H^{2}\ln m}{m}\leq C\frac{\ln^{3}m}{m}$ for some constant $C$ . Whatever $H$ is, the bounds in Theorem 4.1 are both less than

C(e^{-2\lambda_{0}H}+\ln^{3}m/m)+8\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L(w,q)^{2}.

Remark 4.3.

In the tabular setting, if we take $\mathcal{W}=\{w:\|w\|_{2}\leq K_{0}\}$ and $\mathcal{Q}=\{q:\|q\|_{\infty}<K_{1}\}$ , where $K_{0}$ is a constant larger than $C_{w}$ in Assumption 4.1, then all assumptions in Theorem 4.1 hold. Hence,

\mathrm{E}\big{[}(\hat{R}_{\pi_{e},m}-R_{{\pi_{e}}})^{2}\big{]}\leq C\Big{(}e^{-2\lambda_{0}H}+\frac{\ln^{3}m}{m}\Big{)}.

Remark 4.4.

With an MSE bound, a confidence interval of the error of the estimation can be derived easily by Markov’s inequality. That is, if $Q_{\pi_{e}}\in\mathcal{Q}$ and $w_{\frac{\pi_{e}}{\pi_{b}}}\in\mathcal{W}$ , then

\displaystyle P(|\hat{R}_{\pi_{e},m}-R_{\pi_{e}}|>\epsilon)<\frac{\mathrm{E}((\hat{R}_{\pi_{e},m}-R_{{\pi_{e}}})^{2})}{\epsilon^{2}}\leq\frac{C\Big{(}e^{-2\lambda_{0}H}+\frac{\ln^{3}m}{m}\Big{)}}{\epsilon^{2}}.

As a result, for any given $\delta$ and $\epsilon$ , one can easily retrieve a sample complexity $m(\epsilon,\delta,H)$ or $H(\epsilon,\delta,m)$ , such that $P(|\hat{R}_{\pi_{e},m}-R_{\pi_{e}}|>\epsilon)<\delta$ if $m>m(\epsilon,\delta,H)$ or $H>H(\epsilon,\delta,m)$ .

5 Connections to related work

Research on off-policy evaluation (OPE) for MDPs with infinite and fixed finite horizons can be classified into two categories according to whether the behavior policy is known.

When the behavior policy is known, IS is a commonly used method that reweights rewards obtained by behavior policies, according to its likelihood ratio of the evaluation policy $\pi_{e}$ over the behavior $\pi_{b}$ to provide unbiased estimates of the expected returns. However, the IS method suffers from exponentially increasing variance over the time horizon because the ratio is computed as a cumulative product of the importance weight over action $\frac{\pi_{e}(a|s)}{\pi_{b}(a|s)}$ at each time step (Precup, 2000). To reduce that extremely high variance, a series of OPE methods have been proposed based on IS. For example, the weighted importance sampling (WIS) method, the stepwise importance sampling method, and the doubly robust(DR) method can reduce the variance to certain degree (Cassel et al., 1976; Robins et al., 1994; Robins and Rotnitzky, 1995; Bang and Robins, 2005). However, the exponential variance of IS-based methods cannot be significantly improved when the MDP has high stochasticity (Jiang and Li, 2016).

The MIS method proves a promising improvement over IS by successfully avoiding the trouble of exponential variance. For example, for a finite-horizon inhomogeneous MDP, compared to weighting the whole trajectories, Xie et al. (2019) uses a ratio $w_{t}(s)\frac{\pi_{e}(a|s)}{\pi_{b}(a|s)}$ with $w_{t}(s)=\frac{d_{\pi_{e},t}(s)}{d_{\pi_{b},t}(s)}$ to reweight the rewards $r$ in order to achieve a lower variance. In an infinite horizon setting, based on a discounted stationary distribution, Liu et al. (2018) proposes using the ratio $w_{\pi_{e}\over\pi_{b}}(s)\cdot\frac{\pi_{e}(a|s)}{\pi_{b}(a|s)}$ with $w_{\pi_{e}\over\pi_{b}}(s)=\frac{d_{\pi_{e},\gamma}(s)}{d_{\pi_{b},\gamma}(s)}$ . The ratio $w_{\pi_{e}\over\pi_{b}}(s)$ is then estimated by a minimax procedure with two function approximators: one to model a weight function $w_{\pi_{e}\over\pi_{b}}(s)$ , and the other to model $V^{\pi_{b}}$ , as a discriminator class for distribution learning.

For the case of unknown behavior policies, Hanna et al. (2019) show that the IS method with an estimated behavior policy has a lower asymptotic variance than the one with a known behavior strategy. The fitted $Q$ -iteration, which uses dynamic programming to fit $Q_{\pi_{e}}$ directly from the data, can overcome the curse of dimensionality, with a cost of assuming that the function class contains $Q_{\pi_{e}}$ and is closed under the Bellman update $B^{\pi_{e}}$ , so as to avoid a high bias, see Ernst et al. (2005) and Le et al. (2019). Uehara et al. (2020) propose the MWL algorithm by estimating marginalized importance weight $w_{\pi_{e}\over\pi_{b}}(s,a)=\frac{d_{\pi_{e},\gamma}(s,a)}{d_{\pi_{b}}(s,a)}$ . A Dualdice algorithm is further proposed to estimate the discounted stationary distribution ratios (Nachum et al., 2019a; Nachum et al., 2019b; Nachum and Dai, 2020) where the error function can be considered as a derivative of the error function (loss function in their terminology) in Uehara et al. (2020). Jiang and Huang (2020) combine MWL and MQL into a unified value interval with a unique type of double robustness, if either the value-function or the importance-weight class is correctly specified, the interval is valid, and its length measures the misspecification of the other class.

In reinforcement learning, while many benchmark environments are indeed episodic and have random horizons, such as board games (a game terminates once the winner is determined), trips through a maze, and dialog systems (a session terminates when the conversation is concluded) (Jiang, 2017), there are only limited efforts specifically contributed to absorbing MDPs. Researchers often take absorbing MDPs as special cases of finite-horizon MDPs, by padding all trajectories with absorbing states (with random lengths) to the same length. Another way to handle absorption practically is to use the infinite-horizon setup (with a sufficiently large discount factor), and whenever a trajectory terminates, we imagine it continuous infinitely at absorbing states. However, when the random horizons are not bounded and the random episodes are not observed completely, especially, accompanied by the undiscounted rewards, new issues will arise. For example, how do the unobserved trajectories affect the results? As our results show this problem is by no means trivial, which is essentially neglected when we simply apply the two ways mentioned above.

The current paper deals with the OPE for absorbing MDPs through the MWLA algorithm, a variant of the MWL to fit the random horizon and truncated episodic data modeled by absorbing MDPs, using episodic data rather than the $(s,a,r,s^{\prime})$ -tuple data. In addition, we explicitly analyze the dependence of the error bound of the MSE on the truncation level and data size and derive the uniform bound of the MSE by optimization when the truncation level is relatively large.

6 Experiments

In this section, we present several computational experiments that showcase the performance of the MWLA and other relevant algorithms. We first describe the experimental settings and subsequently report and discuss the results.

6.1 Setting

The environment we employ is a version of Dietterich (2000)’s Taxi, a two-dimensional setup that simulates a taxi moving along a $5\times 5$ grid world, as indicated by Figure 1. The four corners are marked as R(ed), B(lue), G(reen), and Y(ellow). Initially, the taxi randomly chooses a corner to wait for a passenger, who appears or disappears with a probability at each of the four corners, and that passenger wishes to be transported to one of the four corners (also chosen randomly). The taxi must pick up the passenger and drops him off at his destination. An episode ends once a passenger is successfully dropped off at his destination.

There are a total of 2000 states ( $25\times 2^{4}\times 5$ ), made from 25 taxi locations, $2^{4}$ passenger appearance status, and 5 taxi status (empty or one of 4 destinations with a passenger). There are four navigation actions that move the taxi one square North, South, East, or West, respectively. Each action yields a deterministic movement. Only 3 and 2 actions can be taken when the taxi is at the boundary and the corner, respectively. The taxi receives a reward of -1 at every time step when a passenger is picked up in Taxi, 0 if the passenger is successfully dropped off at the right place and -2 at each time step if the taxi is empty.

As in Liu et al. (2018), we first run a Q-learning with a soft-max action preference with 400,000 iterations to produce the target policy $\pi_{e}$ and then run 60,000 iterations to produce another auxiliary policy $\pi^{+}$ , of which both are regularized such that the probability of crossing any boundary of the grid is $0$ . The behaviors are then formed by $\pi_{b}=\alpha\pi_{e}+(1-\alpha)\pi^{+}$ with $\alpha\in{\{0.2,0.4\}}$ . The true expected return of the target policy is approximated by a set of $2\times 10^{6}$ on-policy Monte-Carlo trajectories, truncated at $H=500$ to assure that the majority of the trials have stopped at the absorbing state before time step $H$ .

6.2 Results

Reported here are the experiment results for MWLA, MSWLA, an on-policy, IS, and a naive averaging baseline algorithm. The on-policy algorithm (referred to as On-Policy below) estimates the expected return by the direct average over a set of trajectories generated by the target policy itself, and the naive average baseline algorithm (referred to as Naive Average below) does it by a direct average over a different set of trajectories generated by the behavior policy, all truncated at $H$ . We also show the results of MWL applied to another set of simulated data.

The first experiment is on the MSEs of the five methods under $m=15000$ , $20000$ , $30000$ , $40000$ and $50000$ trajectories and a set of truncation levels $H=20,50,100,150,200$ . A total of $100$ duplicates for every parameter combination are generated with different random seeds. The results are visualized in Figures 2 (for $\alpha=0.2$ ) and 3 (for $\alpha=0.4$ ), where every graph corresponds to a single episode size $m$ in the upper panels, and a single truncation level $H$ in the lower panels. The MSEs of MWLA and MSWLA all decrease at the beginning and then vary slowly when the truncation level increases. MWLA is better than MSWLA to a moderate degree, and both are significantly lower than the on-policy, IS, and naive averaging baseline algorithms.

The averaged estimated returns are also provided, see Figure 4, together with twice the standard errors of the estimates $\frac{1}{N}\sum_{i=1}^{N}\hat{R}^{i}_{m}\pm\frac{2S}{\sqrt{N}}$ , corresponding to the 95% confidence intervals, where N=100 is the number of duplicates, $\hat{R}^{i}_{m}$ is the estimated return of the $i$ -th duplicate, and $S$ is the sample standard deviation of $\hat{R}^{i}_{m},i=1,\dots,N$ . The estimates by MSWLA and MWLA both approach the expected returns as the numbers of trajectories get large. MSWLA has slightly smaller biases than MWLA but significantly larger fluctuations, giving rise to a higher MSE, as indicated by Figures 2 and 3, even in the final graph in the bottom panel, with a quite small deviation of the averaged estimated returns from the expected return for the largest data size due to the randomness of the data.

The final experiment is to examine the performance of the MWL by Uehara et al. (2020) algorithm to estimate the expected undiscounted returns of absorbing MDPs by treating them as a special case of infinite-horizon MDPs with a subjectively designed large discount factor $\gamma$ . The experiment is carried out under the policy mixing ratio $\alpha=0.2$ , truncation levels $H=100$ and $150$ , and numbers of trajectories $m=15000$ , $20000$ , $30000$ , $40000$ , $50000$ . We also produce $N=100$ duplicates for every parameter combination to empirically evaluate the MSEs. Two methods are employed to estimate the expected returns. One is the direct MWLA with trajectory data as above. The other is the MWL algorithm under large discount factors $\gamma=0.97$ , $0.98$ , $0.99$ and $0.995$ , where the data consisting of all the tuples $(s,a,r,s^{\prime})$ obtained by breaking the $m$ trajectories and the MSEs are computed using the errors between the estimates and the true expected (undiscounted) return. Here, we need to note that the MWL algorithm really estimates some quantity $A(\gamma,\pi_{e})$ that is a function of the artificially associated discount factor $\gamma$ and the target policy $\pi_{e}$ . The error $|A(\gamma,\pi_{e})-R_{\pi_{e}}|$ between $A(\gamma,\pi_{e})$ and the true value $R_{\pi_{e}}$ thus depends on the policy $\pi_{e}$ and, more importantly, the discount fact $\gamma$ also, so that there could exist some optimal $\gamma_{0}$ , the value of which is certainly unknown to the agent because so is the MDP model $M$ , to minimize $|A(\gamma,\pi_{e})-R_{\pi_{e}}|$ and thus the MSE. The MSE result is empirically depicted in Figure 5, where the horizontal axis is again indicated by the logarithm of the trajectory numbers.

From the figure, we clearly have the following observations:

1.

The MSEs are almost constant for the number of trajectories we experimented with, quite possibly implying that, compared with the variance, bias contributes the most to an MSE.
2.

The MSEs of the MWL algorithm vary over different $\gamma$ , meaning that different $\gamma$ gives rise to different errors of the MWL algorithm to the true value $R_{\pi_{e}}$ (caused mainly by the bias according to the previous item).
3.

It appears that $\gamma_{0}=0.98$ is optimal in our experiment. However, it is unclear how to identify an optimal $\gamma_{0}$ theoretically, even approximately.
4.

MWLA significantly performs better than MWL, which we attribute to the unbiasedness of the MWLA by taking $\gamma=1$ .

7 Conclusions and discussions

This paper addresses an OPE problem for absorbing MDPs. We propose the MWLA algorithm as an extension of the MWL algorithm by Uehara et al. (2020). This MWLA algorithm proceeds with episodic data subject to truncations. The expected return of a target policy is estimated and an upper bound of its MSE is derived, in which the statistical error is related to both data size and truncation level. We also briefly discuss the MSWLA algorithm for situations where behavior policies are known. The numerical experiments demonstrate that the MWLA method has a lower MSE as the number of episodes and truncation length increase, significantly improving the accuracy of policy evaluation.

Conceptually, one can estimate the corresponding state-action value function $Q$ using estimated expected return, for example, by fitted-Q evaluation. The double robust (DR) estimation algorithm, which integrates learning weights and state-action value functions $Q$ , is an effective and robust approach. It is now still unclear if a DR variant of the MWLA algorithm can be developed.

In statistics, confidence intervals are important to quantify the uncertainty of the point estimates. Recent work in the RL area includes Shi et al. (2021), who propose a novel deeply-debiasing procedure to construct efficient, robust, and flexible confidence intervals for the values of target policies for infinite-horizon discounted MDPs and Dai et al. (2020), who develops a CoinDICE algorithm for calculating confidence intervals. It would be interesting to combine these methods with MWLA for absorbing MDPs.

Moreover, we would like to note that policy optimization is one of the crucial goals of RL. The policy optimization based on the MWL algorithms has been analyzed by Liu et al. (2019) that proposes an off-policy policy gradient optimization technique for infinite-horizon discounted MDP by using MSWL to correct state distribution shifts for the i.i.d. tuple data structure, and Huang and Jiang (2020) that investigates the convergence of two off-policy policy gradient optimization algorithms based on state-action density ratio learning, among others. Therefore, it is important and possible to explore how the MWLA approach can be used in off-policy optimization for absorbing MDPs.

Finally, it should be noted that the state-action space considered in this paper is discrete. This choice is based on the fact that the absorbing MDPs with discrete state-action space are prevalent in real-world applications. Moreover, our theoretical analysis heavily relies on the discrete feature of the state-action space as evidenced by e.g., the proof of Lemma B.4 in Appendix. For cases involving continuous state-action spaces, it is quite common practice to employ approximations by e.g. linear or deep neural networks, so it would need additional efforts and considerations to extend the framework to continuous state-action spaces.

Acknowledgments

The authors thank the editors and referees for their helpful comments and suggestions.

\zihao5\heitiReference

Altman (1999) Eitan Altman. Constrained Markov decision processes: stochastic modeling. Routledge, 1999.
Bang and Robins (2005) Heejung Bang and James M Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973, 2005.
Bertsekas and Tsitsiklis (1991) Dimitri P Bertsekas and John N Tsitsiklis. An analysis of stochastic shortest path problems. Mathematics of Operations Research, 16(3):580–595, 1991.
Borkar (1988) Vivek S Borkar. A convex analytic approach to markov decision processes. Probability Theory and Related Fields, 78(4):583–602, 1988.
Cassel et al. (1976) Claes M Cassel, Carl E Särndal, and Jan H Wretman. Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika, 63(3):615–620, 1976.
Chatterjee et al. (2008) Debasish Chatterjee, Eugenio Cinquemani, Giorgos Chaloulos, and John Lygeros. Stochastic control up to a hitting time: optimality and rolling-horizon implementation. arXiv preprint arXiv:0806.3008, 2008.
Dai et al. (2020) Bo Dai, Ofir Nachum, Yinlam Chow, Lihong Li, Csaba Szepesvári, and Dale Schuurmans. Coindice: Off-policy confidence interval estimation. Advances in neural information processing systems, 33:9398–9411, 2020.
Derman (1970) Cyrus Derman. Finite state markovian decision processes. Technical report, 1970.
Dietterich (2000) Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of artificial intelligence research, 13:227–303, 2000.
Eaton and Zadeh (1962) Jo H Eaton and LA Zadeh. Optimal pursuit strategies in discrete-state probabilistic systems. 1962.
Ernst et al. (2005) Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005.
Hanna et al. (2019) Josiah Hanna, Scott Niekum, and Peter Stone. Importance sampling policy evaluation with an estimated behavior policy. In International Conference on Machine Learning, pages 2605–2613. PMLR, 2019.
Hirano et al. (2003) Keisuke Hirano, Guido W Imbens, and Geert Ridder. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4):1161–1189, 2003.
Huang and Jiang (2020) Jiawei Huang and Nan Jiang. On the convergence rate of density-ratio basedoff-policy policy gradient methods. In Neural Information Processing Systems Offline Reinforcement Learning Workshop, 2020.
Iida and Mori (1996) Tetsuo Iida and Masao Mori. Markov decision processes with random horizon. Journal of the Operations Research Society of Japan, 39(4):592–603, 1996.
Jiang (2017) Nan Jiang. A theory of model selection in reinforcement learning. PhD thesis, 2017.
Jiang and Huang (2020) Nan Jiang and Jiawei Huang. Minimax value interval for off-policy evaluation and policy optimization. Advances in Neural Information Processing Systems, 33:2747–2758, 2020.
Jiang and Li (2016) Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pages 652–661. PMLR, 2016.
Kesten and Spitzer (1975) Harry Kesten and Frank Spitzer. Controlled markov chains. The Annals of Probability, pages 32–40, 1975.
Kushner (1971) HJ Kushner. Introduction to stochastic control. holt, rinehart and wilson. New York, 1971.
Le et al. (2019) Hoang Le, Cameron Voloshin, and Yisong Yue. Batch policy learning under constraints. In International Conference on Machine Learning, pages 3703–3712. PMLR, 2019.
Li et al. (2011) Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 297–306, 2011.
Li et al. (2015) Lihong Li, Rémi Munos, and Csaba Szepesvári. Toward minimax off-policy value estimation. In Artificial Intelligence and Statistics, pages 608–616. PMLR, 2015.
Liu et al. (2018) Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. Advances in Neural Information Processing Systems, 31, 2018.
Liu et al. (2019) Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient with state distribution correction. arXiv preprint arXiv:1904.08473, 2019.
Mandel et al. (2014) Travis Mandel, Yun-En Liu, Sergey Levine, Emma Brunskill, and Zoran Popovic. Offline policy evaluation across representations with applications to educational games. In AAMAS, volume 1077, 2014.
Murphy et al. (2001) Susan A Murphy, Mark J van der Laan, James M Robins, and Conduct Problems Prevention Research Group. Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 96(456):1410–1423, 2001.
Nachum and Dai (2020) Ofir Nachum and Bo Dai. Reinforcement learning via fenchel-rockafellar duality. arXiv preprint arXiv:2001.01866, 2020.
Nachum et al. (2019a) Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Advances in Neural Information Processing Systems, 32, 2019a.
Nachum et al. (2019b) Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019b.
Pollard (2012) David Pollard. Convergence of stochastic processes. Springer Science & Business Media, 2012.
Precup (2000) Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000.
Robins and Rotnitzky (1995) James M Robins and Andrea Rotnitzky. Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90(429):122–129, 1995.
Robins et al. (1994) James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846–866, 1994.
Sen (2018) Bodhisattva Sen. A gentle introduction to empirical process theory and applications. Lecture Notes, Columbia University, 11:28–29, 2018.
Shi et al. (2021) Chengchun Shi, Runzhe Wan, Victor Chernozhukov, and Rui Song. Deeply-debiased off-policy interval estimation. In International Conference on Machine Learning, pages 9580–9591. PMLR, 2021.
Uehara et al. (2020) Masatoshi Uehara, Jiawei Huang, and Nan Jiang. Minimax weight and q-function learning for off-policy evaluation. In International Conference on Machine Learning, pages 9659–9668. PMLR, 2020.
Uehara et al. (2021) Masatoshi Uehara, Masaaki Imaizumi, Nan Jiang, Nathan Kallus, Wen Sun, and Tengyang Xie. Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency. arXiv preprint arXiv:2102.02981, 2021.
Xie et al. (2019) Tengyang Xie, Yifei Ma, and Yu-Xiang Wang. Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. Advances in Neural Information Processing Systems, 32, 2019.

Appendix

Appendix A Proof of Theorems in Section 3

A.1. Proof of Theorem 3.1.

It suffices to prove the uniqueness. Note that, for all $(\bar{s},\bar{a})\in\mathcal{S}_{0}\times\mathcal{A}$ ,

L(w,{\bf 1}_{(\bar{s},\bar{a})})=\mu(\bar{s}){\pi_{e}}(\bar{a}|\bar{s})+\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}w(s,a)d_{{\pi_{b}}}(s,a)P(\bar{s}|s,a){\pi_{e}}(\bar{a}|\bar{s})-w(\bar{s},\bar{a})d_{\pi_{b}}(\bar{s},\bar{a}).

By the condition that $d_{\pi_{e}}$ is the unique solution to (10), it follows that $wd_{\pi_{b}}=d_{\pi_{e}}$ , i.e. $w=w_{\pi_{e}\over\pi_{b}}$ .

A.2. Proof of Theorem 3.2.

With any policy $\pi$ , associate a one-step forward operator $\mathcal{L}_{\pi}$ on $\Xi$ , defined by

\displaystyle(\mathcal{L}_{\pi}q)(s,a)\doteq\sum_{(s^{\prime},a^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}}P(s^{\prime}|s,a)\pi(a^{\prime}|s^{\prime})q(s^{\prime},a^{\prime})=\mathrm{E}_{\pi}\left[q(s_{t+1},a_{t+1})|s_{t}=s,a_{t}=a\right].

(17)

Then, ${\cal L}_{\pi}$ is linear over $\Xi$ . Moreover, the state-action function with respect to $\pi$

$\displaystyle Q_{\pi}(s,a)$	$\displaystyle\doteq$	$\displaystyle\mathrm{E}_{(s,a),\pi}(\sum_{t=0}^{T-1}r(s_{t},a_{t}))$	(18)
	$\displaystyle=$	$\displaystyle R(s,a)+\sum_{s^{\prime}\in S_{0}}P(s^{\prime}\|s,a)\mathrm{E}_{s^{\prime},\pi}\left(\sum\limits_{t=1}^{T-1}r_{t}\right)$
	$\displaystyle=$	$\displaystyle R(s,a)+\sum_{(s^{\prime},a^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}}P(s^{\prime}\|s,a)\pi(a^{\prime}\|s^{\prime})Q_{\pi}(s^{\prime},a^{\prime})$
	$\displaystyle=$	$\displaystyle R(s,a)+(\mathcal{L}_{\pi}Q_{\pi})(s,a)$

and, by definition (9) of $L(w,q)$ , the error function

\displaystyle L(w,q)=\sum_{s,a\in\mathcal{S}_{0}\times\mathcal{A}}w(s,a)\left[\mathcal{L}_{\pi_{e}}q(s,a)-q(s,a)\right]d_{\pi_{b}}(s,a)+\mathrm{E}_{s\sim\mu}\big{[}q(s,{\pi_{e}})\big{]}.

(19)

By Theorem 3.1, we can further change the form of $L(w,q)$ as

\displaystyle L(w,q)=\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\left[w_{\pi_{e}/\pi_{b}}(s,a)-w(s,a)\right]\left[q(s,a)-(\mathcal{L}_{\pi_{e}}q)(s,a)\right]d_{\pi_{b}}(s,a).

(20)

By the definition, $V_{s^{\prime},a^{\prime},\pi_{e}}$ is the Q-function of the reward ${\bm{1}}_{(s^{\prime},a^{\prime})}(s,a)$ under the policy $\pi_{e}$ . Therefore, Equation (18) states that

V_{s^{\prime},a^{\prime},\pi_{e}}(s,a)-\mathcal{L}_{\pi_{e}}V_{s^{\prime},a^{\prime},\pi_{e}}(s,a)={\bm{1}}_{(s^{\prime},a^{\prime})}(s,a).

It then follows from putting the above into (20) that, for every $(s^{\prime},a^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}$ ,

	$\displaystyle L(w,V_{s^{\prime},a^{\prime},\pi_{e}})$	$\displaystyle=$	$\displaystyle\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\left[w_{\pi_{e}\over\pi_{b}}(s,a)-w(s,a)\right]{\bm{1}}_{(s^{\prime},a^{\prime})}(s,a)d_{\pi_{b}}(s,a)$
		$\displaystyle=$	$\displaystyle d_{\pi_{e}}(s^{\prime},a^{\prime})-w(s^{\prime},a^{\prime})d_{\pi_{b}}(s^{\prime},a^{\prime}).$

This gives the first equality of Equation (12). The second can be examined similarly by considering the function $V_{s^{\prime},a^{\prime},\pi_{e}}/d_{\pi_{b}}(s^{\prime},a^{\prime})$ to instead $V_{s^{\prime},a^{\prime},\pi_{e}}$ .

We prove the remainder results in what follows.

(1). As a result of the first equality of part (1),

|d_{\pi_{e}}(s^{\prime},a^{\prime})-w(s^{\prime},a^{\prime})d_{\pi_{b}}(s^{\prime},a^{\prime})|=|L(w,V_{s^{\prime},a^{\prime},\pi_{e}})|,

so that

\|d_{\pi_{e}}-wd_{\pi_{b}}\|_{\infty}=\max_{(s^{\prime},a^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}}|L(w,V_{s^{\prime},a^{\prime},\pi_{e}})|\leq\max_{q\in{\cal Q}}|L(w,q)|,

provided that $\{\pm V_{s^{\prime},a^{\prime},\pi_{e}}:(s^{\prime},a^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\}\subseteq\mathcal{Q}$ , The desired result (2) is thus proved.

(3). If $\{\pm V_{s^{\prime},a^{\prime},\pi_{e}}/d_{\pi_{b}}(s^{\prime},a^{\prime}):(s^{\prime},a^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\}\subseteq\mathcal{Q},$ then, substitute $w^{*}$ for $w$ in the second equality of part (1) gives ruse to

\|w_{\pi_{e}\over\pi_{b}}-w^{*}\|_{\infty}=\max_{(s^{\prime},a^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}}|L(w^{*},V_{s^{\prime},a^{\prime},\pi_{e}}/d_{\pi_{b}}(s^{\prime},a^{\prime}))|\leq\max_{q\in\mathcal{Q}}|L(w^{*},q)|=\min_{w\in\mathcal{W}}\max_{q\in\mathcal{Q}}|L(w,q)|.

The proof is thus complete.

Appendix B Proof of Theorems in Section 4

To begin with, the following two results are quoted for easy reference later on. Denote by $\mathcal{F}\left(\left\{Z_{i}\right\}\right)=\left\{\left(f\left(Z_{1}\right),\ldots,f\left(Z_{n}\right)\right)\mid f\in\mathcal{F}\right\}\subset\mathbb{R}^{n}$ a set generated by a function class $\mathcal{F}$ and a data set $\left\{Z_{1},\ldots,Z_{n}\right\}$ . Then the empirical $\ell_{1}$ -covering number $\mathcal{N}\left(\epsilon;\mathcal{F},\{Z_{i}\}_{i=1}^{n}\right)$ refers to the smallest number of balls of radius $\epsilon$ required to cover $\mathcal{F}\left(\left\{Z_{i}\right\}\right)$ , where, the distance is computed by the empirical $\ell_{1}$ -norm $\|f-g\|_{\mathcal{F}\left(\left\{Z_{i}\right\}\right)}:=\frac{1}{n}\sum_{i=1}^{n}\left|f\left(Z_{i}\right)-g\left(Z_{i}\right)\right|.$

Lemma B.1.

(Pollard, 2012) Let $\mathcal{F}$ be a permissible class of functions $\mathcal{Z}\rightarrow[-M,M]$ and $\left\{Z_{i}\right\}_{i=1}^{n}$ are i.i.d. samples from some distribution $P$ . Then, for any given $\epsilon>0$ ,

\mathbb{P}\big{(}\sup_{f\in\mathcal{F}}\big{|}\frac{1}{n}\sum_{i=1}^{n}f\big{(}Z_{i}\big{)}-\mathrm{E}f(Z)\big{|}>\epsilon\big{)}\leq 8\mathbb{E}\big{[}\mathcal{N}\left(\epsilon;\mathcal{F},\{Z_{i}\}_{i=1}^{n}\right)\big{]}\exp\big{(}\frac{-n\epsilon^{2}}{512M^{2}}\big{)}.

(21)

For a class of functions $\mathcal{F}$ on a measurable space equiped with a probability measure $\vartheta$ , the covering numbers $\mathcal{N}(\varepsilon,\mathcal{F},L^{r}(\vartheta))$ refers to the smallest number of balls of radius $\epsilon$ required to cover $\mathcal{F}$ , where distances are measured in terms of the $L^{r}$ -norm $\|f\|_{L^{r}(\vartheta)}:=(\int|f|^{r}d\vartheta)^{1/r}$ for all $f\in\mathcal{F}$ . The covering number can then be bounded in terms of the function class’s pseudo-dimension:

Lemma B.2.

(Sen, 2018) Suppose that $\mathcal{F}$ is a class of functions with measurable envelope $F$ (i.e. $|f|\leq F$ for any $f\in\mathcal{F}$ ) and has a finite pseudo-dimension $D(\mathcal{F})$ . Then, for any $r\geq 1$ , and any probability measure $\vartheta$ such that $\|F\|_{L^{r}(\vartheta)}>0$ ,

\mathcal{N}(\varepsilon\|F\|_{L^{r}(\vartheta)},\mathcal{F},L^{r}(\vartheta))<KD(\mathcal{F})(4e)^{D(\mathcal{F})}(2/\varepsilon)^{rD(\mathcal{F})},

where $K$ is a universal constant and $0<\varepsilon<1$ .

Recall that in Theorem 4.1, we address that the functional classes $\mathcal{W}$ and $\mathcal{Q}$ have finite pseudo-dimensions $D_{\mathcal{W}}$ and $D_{\mathcal{Q}}$ , $Q_{\pi_{e}}\in\mathcal{Q}$ , and there exists a function $G$ satisfying that $|w|\leq G$ , $|q|\leq G$ for all $w\in\mathcal{W}$ , $q\in\mathcal{Q}$ and $\Lambda_{1}=\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}G(s,a)<+\infty,\;\;\Lambda_{2}:=\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}G^{2}(s,a)<+\infty$ . In addition, we assume that there exists $\lambda_{0}>0$ such that $\mathrm{E}_{\mu,\pi_{b}}(e^{\lambda_{0}T})=M_{0}<\infty$ . Besides them, we also remind here that our data consist of i.i.d samples from $M=(\mathcal{S},\mathcal{A},\mathcal{R},P,\mu)$ under the behavior policy $\pi_{b}$ . The proof of Theorem 4.1 proceeds in the following 3 parts: decomposition, evaluation, and optimization.

B.1. Decomposition

First decompose the estimate $\hat{R}_{\pi_{e},m}$ defined in equation (15) as

\displaystyle\hat{R}_{\pi_{e},m}-R_{\pi_{e}}=I_{1}+I_{2}+I_{3},

(22)

where

	$\displaystyle I_{1}$	$\displaystyle=$	$\displaystyle\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}[\hat{w}_{m}(s,a)-w_{\pi_{e}\over\pi_{b}}(s,a)]R(s,a)d_{\pi_{b}}(s,a),$
	$\displaystyle I_{2}$	$\displaystyle=$	$\displaystyle\frac{1}{m}\sum\limits_{i=1}^{m}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{w}_{m}(s,a)R(s,a)\left[\hat{d}^{i}_{\pi_{b}}(s,a)-d_{\pi_{b}}(s,a)\right],$

and

\displaystyle I_{3}=\frac{1}{m}\sum\limits_{i=1}^{m}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{w}_{m}(s,a)\Big{(}\hat{R}^{i}(s,a)-R(s,a)\Big{)}\hat{d}^{i}_{\pi_{b}}(s,a).

By (18) and (20),

I_{1}=\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}[\hat{w}_{m}(s,a)-w_{\pi_{e}\over\pi_{b}}(s,a)][Q_{\pi_{e}}(s,a)-(\mathcal{L}_{\pi_{e}}Q_{\pi_{e}})(s,a)]d_{\pi_{b}}(s,a)=-L(\hat{w}_{m},Q_{\pi_{e}}).

Because $Q_{\pi_{e}}\in\mathcal{Q}$ , it follows that

	$\displaystyle I_{1}^{2}$	$\displaystyle=$	$\displaystyle\big{(}L(\hat{w}_{m},Q_{\pi_{e}})-\hat{L}_{m}(\hat{w}_{m},Q_{\pi_{e}})+\hat{L}_{m}(\hat{w}_{m},Q_{\pi_{e}})\big{)}^{2}$
		$\displaystyle\leq$	$\displaystyle 2\big{(}L(\hat{w}_{m},Q_{\pi_{e}})-\hat{L}_{m}(\hat{w}_{m},Q_{\pi_{e}})\big{)}^{2}+2\max_{q\in\mathcal{Q}}\hat{L}_{m}^{2}(\hat{w}_{m},q),$

where $\hat{L}_{m}$ is defined in (14) and we have used the fact that

\hat{L}_{m}^{2}(\hat{w}_{m},Q_{\pi_{e}})\leq\max_{q\in\mathcal{Q}}\hat{L}_{m}^{2}(\hat{w}_{m},q)=\min_{w\in\mathcal{W}}\max_{q\in\mathcal{Q}}\hat{L}_{m}^{2}(w,q).

Using the fact

\max_{q\in\mathcal{Q}}\hat{L}_{m}^{2}(\hat{w}_{m},q)\leq 2\max_{q\in\mathcal{Q}}L^{2}(\hat{w}_{m},q)+2\max_{q\in\mathcal{Q}}\left(\hat{L}_{m}(\hat{w}_{m},q)-L(\hat{w}_{m},q)\right)^{2},

we further have

\displaystyle I_{1}^{2}

\displaystyle\leq

\displaystyle 6I_{11}^{2}+4\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L(w,q)^{2},

(23)

where

I_{11}^{2}=\sup\limits_{w\in\mathcal{W},q\in\mathcal{Q}}|L(w,q)-\hat{L}_{m}(w,q)|^{2}.

Substituting (13) into the expression of $\hat{L}_{m}(w,q)$ in (14), it follows that

\displaystyle\hat{L}_{m}(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e}))

\displaystyle=

\displaystyle\frac{1}{m}\sum_{i=1}^{m}\sum_{t=0}^{l_{i}-1}w(s_{t}^{i},a_{t}^{i})(q(s_{t+1}^{i},\pi_{e})-q(s_{t}^{i},a^{i}_{t})),

where $q(\xi,\pi_{e})=0$ is assumed. Similarly,

\displaystyle L(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e}))=\mathrm{E}_{\mu,\pi_{b}}\Big{(}\sum_{t=0}^{T-1}w(s_{t},a_{t})(q(s_{t+1},\pi_{e})-q(s_{t},a_{t}))\Big{)}.

Define

\tilde{L}(w,q)=\mathrm{E}_{\mu,\pi_{b}}\Big{(}\sum_{t=0}^{T\wedge{H_{\beta}}-1}w(s_{t},a_{t})(q(s_{t+1},\pi_{e})-q(s_{t},a_{t}))\Big{)}

and

\tilde{L}_{m}(w,q)=\frac{1}{m}\sum_{i=1}^{m}\sum_{t=0}^{T_{i}\wedge H_{\beta}-1}w(s_{t}^{i},a_{t}^{i})\big{(}q(s_{t+1}^{i},\pi_{e})-q(s_{t}^{i},a^{i}_{t})\big{)},

where $H_{\beta}$ is a constant specified later in B.2. Then

$\displaystyle\|\hat{L}_{m}(w,q)-L(w,q)\|$	$\displaystyle=$	$\displaystyle\|\hat{L}_{m}(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e}))-(L(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e})))\|$	(24)
	$\displaystyle\leq\|$	$\displaystyle\hat{L}_{m}(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e}))-\tilde{L}_{m}(w,q)\|+\|\tilde{L}_{m}(w,q)-\tilde{L}(w,q)\|$
		$\displaystyle+\|\tilde{L}(w,q)-(L(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e})))\|.$

B.2. Evaluation

For any $\beta\geq M_{0}e^{-\lambda_{0}H}$ , let $H_{\beta}=\min\{k:M_{0}e^{-\lambda_{0}k}\leq\beta\}=\lceil\ln(M_{0}/\beta)/\lambda_{0}\rceil$ , where $\lceil x\rceil$ is the minimum integer no less than $x$ . Obviously, $H_{\beta}\leq H$ and $M_{0}e^{-\lambda_{0}H_{\beta}}\leq\beta.$

B.2.1. The upper bound of $\mathrm{E}(I_{11}^{2})$

An upper bound of $\mathrm{E}(I_{11}^{2})$ is derived via a sequence of auxiliary results.

Lemma B.3.

With the constant $\Lambda_{1}$ , for any $\beta\geq M_{0}e^{-\lambda_{0}H}$ ,

\displaystyle\mathrm{E}(\sup_{w\in\mathcal{W},q\in\mathcal{Q}}|\hat{L}_{m}(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e}))-\tilde{L}_{m}(w,q)|^{2})\leq\frac{4\Lambda_{1}^{4}}{\lambda_{0}^{2}}\Big{[}\beta^{2}+\frac{2\beta}{m}\Big{]}.

Proof. For any $w\in\mathcal{W}$ and $q\in\mathcal{Q}$ ,

\displaystyle|\hat{L}_{m}(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e}))-\tilde{L}_{m}(w,q)|

\displaystyle=

\displaystyle\Big{|}\frac{1}{m}\sum_{i=1}^{m}\sum_{t=T_{i}\wedge H_{\beta}}^{l_{i}-1}w(s_{t}^{i},a_{t}^{i})\left(q(s_{t+1}^{i},\pi_{e})-q(s_{t}^{i},a^{i}_{t})\right){\bm{1}}_{\{T_{i}>H_{\beta}\}}\Big{|}.

Recall the notation $l_{i}=T_{i}\wedge H$ . Due to the assumption (1) in Theorem 4.1, $w$ and $q$ are bounded by $\Lambda_{1}$ , respectively,

\displaystyle|\hat{L}_{m}(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e}))-\tilde{L}_{m}(w,q)|\leq\frac{2\Lambda_{1}^{2}}{m}\sum_{i=1}^{m}(T_{i}-H_{\beta}){\bm{1}}_{\{T_{i}>H_{\beta}\}}.

Therefore,

	$\displaystyle\mathrm{E}(\sup_{w\in\mathcal{W},q\in\mathcal{Q}}\|\hat{L}_{m}(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e}))-\tilde{L}_{m}(w,q)\|^{2})$	(25)
$\displaystyle\leq$	$\displaystyle\frac{4\Lambda_{1}^{4}}{m^{2}}\left(\sum_{i,j=1,i\not=j}^{m}\mathrm{E}[(T_{i}-H_{\beta})(T_{j}-H_{\beta}){\bm{1}}_{\{T_{i},T_{j}>H_{\beta}\}}]+\sum_{i=1}^{m}\mathrm{E}[(T_{i}-H_{\beta})^{2}{\bm{1}}_{\{T_{i}>H_{\beta}\}}]\right)$
$\displaystyle=$	$\displaystyle\frac{4\Lambda_{1}^{4}}{m^{2}}\left(m(m-1)\mathrm{E}^{2}[(T_{1}-H_{\beta}){\bm{1}}_{\{T_{1}>H_{\beta}\}}]+m\mathrm{E}[(T_{1}-H_{\beta})^{2}{\bm{1}}_{\{T_{1}>H_{\beta}\}}]\right),$

where the equality follows from the i.i.d. property of trajectories. By further the inequality $x\vee x^{2}/2\leq e^{x}$ for every $x>0$ ,

\displaystyle\mathrm{E}[(T_{1}-H_{\beta}){\bm{1}}_{\{T_{1}>H_{\beta}\}}]

\displaystyle\leq

\displaystyle\frac{1}{\lambda_{0}}\mathrm{E}\left(e^{\lambda_{0}(T_{1}-H_{\beta})}{\bm{1}}_{\{T_{1}>H_{\beta}\}}\right)\leq\frac{M_{0}e^{-\lambda_{0}H_{\beta}}}{\lambda_{0}}\leq\frac{\beta}{\lambda_{0}},

(26)

and

\displaystyle\mathrm{E}[(T_{i}-H_{\beta})^{2}{\bm{1}}_{\{T_{i}>H_{\beta}\}}]

\displaystyle\leq

\displaystyle\frac{2}{\lambda_{0}^{2}}\mathrm{E}(e^{\lambda_{0}(T_{1}-H_{\beta})}{\bm{1}}_{\{T_{1}>H_{\beta}\}})\leq\frac{2M_{0}e^{-\lambda_{0}H_{\beta}}}{\lambda_{0}^{2}}\leq\frac{2\beta}{\lambda_{0}^{2}},

(27)

substituting (26) and (27) into (25) leads to the desired result. $\square$

Lemma B.4.

There exists a constant $C_{1}$ independent of $m,H$ such that for $m\geq 2$ ,

\displaystyle\mathrm{E}(\sup_{w\in\mathcal{W},q\in\mathcal{Q}}|\tilde{L}_{m}(w,q)-\tilde{L}(w,q)|^{2})\leq C_{1}H_{\beta}^{2}\frac{\ln m}{m}.

Proof. For a representative trajectory $Z$ of the form (6), denote by

\tilde{w}_{q}(Z)=\sum\limits_{t=0}^{T\wedge H_{\beta}-1}w(s_{t},a_{t})(q(s_{t+1},\pi_{e})-q(s_{t},a_{t})).

We have that

\tilde{L}_{m}(w,q)-\tilde{L}(w,q)=\frac{1}{m}\sum_{i=1}^{m}\tilde{w}_{q}(Z_{i})-\mathrm{E}(\tilde{w}_{q}(Z)).

It is easy to see that $|\tilde{w}_{q}|\leq 2\Lambda_{1}^{2}H_{\beta}$ . Denote by $\mathcal{H}=\{\tilde{w}_{q}(Z):w\in\mathcal{W},q\in\mathcal{Q}\}.$ The distance in $\mathcal{H}$ can be bounded by

	$\displaystyle\frac{1}{m}\sum\limits_{i=1}^{m}\Big{\|}\sum_{t=0}^{T_{i}\wedge H_{\beta}-1}w_{1}(s_{t}^{i},a_{t}^{i})(q_{1}(s_{t+1}^{i},\pi_{e})-q_{1}(s_{t}^{i},a_{t}^{i}))$
	$\displaystyle\qquad\qquad\qquad-\sum_{t=0}^{T_{i}\wedge H_{\beta}-1}w_{2}(s_{t}^{i},a_{t}^{i})(q_{2}(s_{t+1}^{i},\pi_{e})-q_{2}(s_{t}^{i},a_{t}^{i}))\Big{\|}$
	$\displaystyle\quad\leq 2\Lambda_{1}H_{\beta}(\|\|w_{1}-w_{2}\|\|_{\infty}+\\|q_{1}-q_{2}\\|_{\infty}).$

Note that, in our setting,

	$\displaystyle\|\|w_{1}-w_{2}\|\|_{\infty}$	$\displaystyle\leq$	$\displaystyle\\|w_{1}-w_{2}\\|_{2}\leq\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}2G(s,a)\|w_{1}(s,a)-w_{2}(s,a)\|$
		$\displaystyle\leq$	$\displaystyle 2\Lambda_{1}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\frac{G(s,a)}{\Lambda_{1}}\|w_{1}(s,a)-w_{2}(s,a)\|,$

where at the second inequality, we use the assumption (16). Let $\vartheta:=(\vartheta(s,a))$ be the probability on $\mathcal{S}_{0}\times\mathcal{A}$ such that $\vartheta(s,a):=G(s,a)/\Lambda_{1}$ . We get that

\displaystyle||w_{1}-w_{2}||_{\infty}\leq 2\Lambda_{1}\|w_{1}-w_{2}\|_{L^{1}(\vartheta)},

The same arguments imply that

\displaystyle||q_{1}-q_{2}||_{\infty}\leq 2\Lambda_{1}\|q_{1}-q_{2}\|_{L^{1}(\vartheta)}.

As a result,

\displaystyle\mathcal{N}(4\Lambda_{1}^{2}H_{\beta}(\epsilon_{1}+\epsilon_{2}),\mathcal{H},{\{Z_{i}\}}_{i=1}^{m})

\displaystyle\leq

\displaystyle\mathcal{N}(\epsilon_{1},\mathcal{W},L^{1}(\vartheta))\mathcal{N}(\epsilon_{2},\mathcal{Q},L^{1}(\vartheta)).

Note that $\|G\|_{L^{1}(\vartheta)}=\sum g^{2}(s,a)/\Lambda_{1}=\Lambda_{2}/\Lambda_{1}$ . Due to the assumption (2) in Theorem 4.1, from Lemma B.2 we get that

\displaystyle\mathcal{N}_{1}(4H_{\beta}\Lambda_{1}^{2}(\epsilon_{1}+\epsilon_{2}),\mathcal{H},{\{Z_{i}\}}_{i=1}^{m})

\displaystyle\leq

\displaystyle K^{2}D_{\mathcal{W}}D_{\mathcal{Q}}\left(\frac{8e\Lambda_{2}}{\Lambda_{1}\epsilon_{1}}\right)^{D_{\mathcal{W}}}\left(\frac{8e\Lambda_{2}}{\Lambda_{1}\epsilon_{2}}\right)^{D_{\mathcal{Q}}}.

Taking $\epsilon_{1}=\epsilon_{2}={\epsilon\over 64H_{\beta}\Lambda_{1}^{2}}$ , we have

\mathcal{N}_{1}\left(\frac{\epsilon}{8},\mathcal{H},{\{Z_{i}\}}_{i=1}^{m}\right)\leq\frac{{M}}{\epsilon^{D_{\mathcal{W}}+D_{\mathcal{Q}}}},

where ${M}=KD_{\mathcal{W}}D_{\mathcal{Q}}(512eH_{\beta}\Lambda_{1}\Lambda_{2})^{D_{\mathcal{W}}+D_{\mathcal{Q}}}.$ By Pollard’s tail inequality (Lemma B.1),

\displaystyle P\left(\sup\limits_{w\in\mathcal{W},q\in\mathcal{Q}}\left|\frac{1}{m}\sum\limits_{i=1}^{m}h_{w,q}(Z_{i})-\mathrm{E}h_{w,q}(Z)\right|>\epsilon\right)\leq\frac{8{M}}{\epsilon^{D_{\mathcal{W}}+D_{\mathcal{Q}}}}{\exp\big{(}\frac{-m\epsilon^{2}}{2048\Lambda_{1}^{4}H_{\beta}^{2}}\big{)}}.\qquad

(28)

For any $m>1$ , let $x_{0}=\frac{(32\Lambda_{1}^{2}H_{\beta})^{2}(D_{\mathcal{W}}+D_{\mathcal{Q}})\ln m}{m}$ . Then, by (28), we have that

			$\displaystyle\mathrm{E}(\sup_{w\in\mathcal{W},q\in\mathcal{Q}}\|\tilde{L}_{m}(w,q)-\tilde{L}(w,q)\|^{2})$
		$\displaystyle\leq$	$\displaystyle\int_{0}^{\infty}\mathrm{P}(\sup_{w\in\mathcal{W},q\in\mathcal{Q}}\|\tilde{L}_{m}(w,q)-\tilde{L}(w,q)\|\geq\sqrt{x})dx.$
		$\displaystyle\leq$	$\displaystyle\int_{0}^{x_{0}}dx+\int_{x_{0}}^{\infty}\frac{8{M}}{x^{(D_{\mathcal{W}}+D_{\mathcal{Q}})/2}}\exp\big{(}\frac{-mx}{2048(\Lambda_{1}^{2}H_{\beta})^{2}}\big{)}dx$
		$\displaystyle\leq$	$\displaystyle x_{0}+\frac{8{M}}{x_{0}^{(D_{\mathcal{W}}+D_{\mathcal{Q}})/2}}\int_{x_{0}}^{\infty}\exp\big{(}\frac{-mx}{2048(\Lambda_{1}^{2}H_{\beta})^{2}}\big{)}dx$
		$\displaystyle=$	$\displaystyle\frac{(32\Lambda_{1}^{2}H_{\beta})^{2}}{m}\Big{[}(D_{\mathcal{W}}+D_{\mathcal{Q}})\ln m+\frac{16M}{\left[(32\Lambda_{1}^{2}H_{\beta})^{2}\ln m(D_{\mathcal{W}}+D_{\mathcal{Q}})\right]^{(D_{\mathcal{W}}+D_{\mathcal{Q}})/2}}\Big{]},$

which implies the desired result. $\square$

Lemma B.5.

There exists a constant $C_{3}$ independent of $m,H$ such that for $m\geq 2$ ,

\displaystyle\mathrm{E}(\sup_{w\in\mathcal{W},q\in\mathcal{Q}}|\hat{L}_{m}(w,q)-L(w,q)|^{2})\leq C_{3}\Big{(}\beta^{2}+(1-\ln\beta+\ln^{2}\beta)\frac{\ln m}{m}+\frac{\beta}{m}\Big{)}.

Proof. By (24),

$\displaystyle\mathrm{E}[\sup_{w\in\mathcal{W},q\in\mathcal{Q}}\|\hat{L}_{m}(w,q)-L(w,q)\|^{2}]$	$\displaystyle\leq$	$\displaystyle 3\mathrm{E}(\sup_{w\in\mathcal{W},q\in\mathcal{Q}}\|\hat{L}_{m}(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e}))-\tilde{L}_{m}(w,q)\|^{2})$	(29)
		$\displaystyle+3\mathrm{E}(\sup_{w\in\mathcal{W},q\in\mathcal{Q}}\|\tilde{L}(w,q)-(L(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e})))\|^{2})$
		$\displaystyle+3\sup_{w\in\mathcal{W},q\in\mathcal{Q}}\|\tilde{L}_{m}(w,q)-\tilde{L}(w,q)\|^{2}.$

Note that for any $w\in\mathcal{W}$ ,

$\displaystyle\|\tilde{L}(w,q)-(L(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e})))\|$	$\displaystyle=$	$\displaystyle\mathrm{E}\Big{(}\sum_{T\wedge H_{\beta}}^{T-1}w(s_{t},a_{t})(q(s_{t+1},\pi_{e})-q(s_{t},a_{t}))\Big{)}$	(30)
	$\displaystyle\leq$	$\displaystyle 2\Lambda_{1}^{2}\mathrm{E}((T-H_{\beta}){\bm{1}}_{\{T\geq H_{\beta}\}})$
	$\displaystyle\leq$	$\displaystyle 2\Lambda_{1}^{2}\mathrm{E}\left(\frac{e^{\lambda_{0}(T-H_{\beta})}}{\lambda_{0}}{\bm{1}}_{\{T\geq H_{\beta}\}}\right)$
	$\displaystyle\leq$	$\displaystyle\frac{2\Lambda_{1}^{2}}{\lambda_{0}}\beta,$

where the last inequality comes from the assumption (5) in Theorem 4.1 and the setting of $H_{\beta}$ . Substituting (30) into (29) and applying Lemma B.3 and Lemma B.4, we get that

\displaystyle\mathrm{E}(\sup_{w\in\mathcal{W},q\in\mathcal{Q}}|\hat{L}_{m}(w,q)-L(w,q)|^{2})

\displaystyle\leq

\displaystyle 3M_{3}^{2}\mu^{2}+3\left(M_{3}^{2}\beta^{2}+\frac{2M_{3}^{2}}{m}\beta\right)+3C_{1}H_{\beta}^{2}\frac{\ln m}{m},

where $M_{3}=2\Lambda_{1}^{2}/\lambda_{0}$ . Since $H_{\beta}\leq\frac{\ln(M_{0}/\beta)}{\lambda_{0}}+1$ , it follows from (29) that

			$\displaystyle\mathrm{E}(\sup_{w\in\mathcal{W},q\in\mathcal{Q}}\|\hat{L}_{m}(w,q)-L(w,q)\|^{2})$
		$\displaystyle\leq$	$\displaystyle\frac{3C_{1}}{\lambda_{0}^{2}}\left((\ln\beta)^{2}-2(\ln M_{0}+\lambda_{0})\ln\beta+(\ln M_{0}+\lambda_{0})^{2}\right)\frac{\ln m}{m}+6M_{3}^{2}\beta^{2}+\frac{6M_{3}^{2}}{m}\beta$
		$\displaystyle\leq$	$\displaystyle\frac{3C_{1}}{\lambda_{0}^{2}}\left((\ln\beta)^{2}-2C_{2}\ln\beta+C_{2}^{2}\right)\frac{\ln m}{m}+6M_{3}^{2}\beta^{2}+\frac{6M_{3}^{2}\beta}{m},$

where $C_{2}=\ln M_{0}+\lambda_{0}$ . Let

C_{3}=\max\left\{\frac{3C_{1}}{\lambda_{0}^{2}},\frac{3C_{1}}{\lambda_{0}^{2}}C_{2},\frac{3C_{1}}{\lambda_{0}^{2}}C_{2}^{2},6M_{3}^{2}\right\},

we can readily get the desired result. $\square$

B.2.2. The upper bounds of $\mathrm{E}(I_{2}^{2})$ and $\mathrm{E}(I_{3}^{2})$

Lemma B.6.

Denote $\sup|R(s,a)|$ by $R_{\max}$ . We have that

\mathrm{E}(I_{2}^{2})\leq\frac{2R_{\max}^{2}K_{0}^{2}M_{0}}{\lambda_{0}^{2}}(\frac{1}{m}+M_{0}e^{-2\lambda_{0}H}).

Proof.

Define a truncated occupancy measure $\tilde{d}_{\pi_{b}}(s,a)=\mathrm{E}_{\mu,\pi_{b}}(\sum_{t=0}^{T\wedge H-1}{\bm{1}}_{(s,a)}(s_{t},a_{t})).$ Then

	$\displaystyle I_{2}$	$\displaystyle=$	$\displaystyle\frac{1}{m}\sum\limits_{i=1}^{m}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{w}_{m}(s,a)R(s,a)[\hat{d}^{i}_{\pi_{b}}(s,a)-\tilde{d}_{\pi_{b}}(s,a)]$
			$\displaystyle+\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{w}_{m}(s,a)R(s,a)[\tilde{d}_{\pi_{b}}(s,a)-{d}_{\pi_{b}}(s,a)]$

and hence

$\displaystyle I_{2}^{2}$	$\displaystyle\leq$	$\displaystyle 2\Big{[}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{w}_{m}(s,a)R(s,a)\Big{(}\frac{1}{m}\sum\limits_{i=1}^{m}\hat{d}^{i}_{\pi_{b}}(s,a)-\tilde{d}_{\pi_{b}}(s,a)\Big{)}\Big{]}^{2}$
		$\displaystyle+2\Big{[}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{w}_{m}(s,a)R(s,a)\Big{(}\tilde{d}_{\pi_{b}}(s,a)-d_{\pi_{b}}(s,a)\Big{)}\Big{]}^{2}$
	$\displaystyle\leq$	$\displaystyle 2\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{w}_{m}(s,a)^{2}R(s,a)^{2}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\Big{(}\frac{1}{m}\sum\limits_{i=1}^{m}\hat{d}^{i}_{\pi_{b}}(s,a)-\tilde{d}_{\pi_{b}}(s,a)\Big{)}^{2}$
		$\displaystyle+2\Big{[}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{w}_{m}(s,a)R(s,a)\mathrm{E}_{\mu,\pi_{b}}(\sum_{t=T\wedge H}^{T-1}{\bm{1}}_{(s.a)}(s_{t},a_{t}))\Big{]}^{2},$

where the last inequality follows from Hölder’s inequality. Invoking the bounds from $\hat{w}_{m}$ and $R$ , we get that

$\displaystyle\mathrm{E}(I_{2}^{2})$	$\displaystyle\leq$	$\displaystyle 2R_{\max}^{2}\Lambda_{1}^{2}\Big{\{}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\mathrm{E}\Big{(}\frac{1}{m}\sum\limits_{i=1}^{m}\hat{d}^{i}_{\pi_{b}}(s,a)-\tilde{d}_{\pi_{b}}(s,a)\Big{)}^{2}$
		$\displaystyle\qquad\quad\qquad+\Big{(}\mathrm{E}_{\mu,\pi_{b}}\Big{[}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\sum_{t=T\wedge H}^{T-1}{\bm{1}}_{(s,a)}(s_{t},a_{t})\Big{]}\Big{)}^{2}\Big{\}}$
	$\displaystyle\leq$	$\displaystyle 2R_{\max}^{2}\Lambda_{1}^{2}\Big{\{}\frac{1}{m}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}{\rm Var}(\hat{d}^{i}_{\pi_{b}}(s,a)))+\Big{(}\mathrm{E}_{\mu,\pi_{b}}((T-H){\bm{1}}_{\{T>H\}})\Big{)}^{2}\Big{\}},$

where the first inequality is due to $\hat{w}_{m}\in\mathcal{W}$ and the second inequality follows from the fact that $d_{\pi_{b}}^{i}(s,a),1\leq i\leq m,$ are i.i.d and have expectation $\tilde{d}_{\pi_{b}}(s,a)$ . Observing that

	$\displaystyle{\rm Var}(\hat{d}^{i}_{\pi_{b}}(s,a)))$	$\displaystyle=$	$\displaystyle{\rm Var}_{\mu,\pi_{b}}(\sum_{t=0}^{T\wedge H-1}{\bm{1}}_{(s,a)}(s_{t},a_{t}))\leq\mathrm{E}_{\mu,\pi_{b}}\Big{(}\Big{(}\sum_{t=0}^{T\wedge H-1}{\bm{1}}_{(s,a)}(s_{t},a_{t})\Big{)}^{2}\Big{)}$
		$\displaystyle\leq$	$\displaystyle\mathrm{E}_{\mu,\pi_{b}}\Big{(}(T\wedge H)\sum_{t=0}^{T\wedge H-1}{\bm{1}}_{(s,a)}(s_{t},a_{t})\Big{)},$

we obtain that

\displaystyle\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}{\rm Var}(\hat{d}^{i}_{\pi_{b}}(s,a)))\leq\mathrm{E}_{\mu,\pi_{b}}\Big{(}(T\wedge H)\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\sum_{t=0}^{T\wedge H-1}{\bm{1}}_{(s,a)}(s_{t},a_{t})\Big{)}\leq\mathrm{E}_{\mu,\pi_{b}}\big{(}T^{2}\big{)}.

Since $\mathrm{E}_{\mu,\pi_{b}}\big{(}T^{2}\big{)}\leq\frac{2M_{0}}{\lambda_{0}^{2}}$ and $\mathrm{E}_{\mu,\pi_{b}}((T-H){\bm{1}}_{\{T>H\}})\leq\frac{M_{0}}{\lambda_{0}}e^{-\lambda_{0}H}$ , we arrive at the conclusion. $\square$

Lemma B.7.

We have that

\mathrm{E}(I_{3}^{2})\leq\frac{R_{\max}^{2}\Lambda_{1}^{2}M_{0}}{\lambda_{0}m}.

Proof. Noting that

\displaystyle I_{3}

\displaystyle=

\displaystyle\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{w}_{m}(s,a)\Big{[}\frac{1}{m}\sum_{i=1}^{m}(\hat{r}^{i}(s,a)-R(s,a))\hat{d}^{i}_{\pi_{b}}(s,a)\Big{]},

applying the Hölder inequality, we have that

	$\displaystyle I_{3}^{2}$	$\displaystyle\leq$	$\displaystyle\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{w}_{m}^{2}(s,a)\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\Big{[}\frac{1}{m}\sum_{i=1}^{m}(\hat{r}^{i}(s,a)-R(s,a))\hat{d}^{i}_{\pi_{b}}(s,a)\Big{]}^{2}$
		$\displaystyle\leq$	$\displaystyle\Lambda_{1}^{2}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\Big{[}\frac{1}{m}\sum_{i=1}^{m}(\hat{r}^{i}(s,a)-R(s,a))\hat{d}^{i}_{\pi_{b}}(s,a)\Big{]}^{2}.$

Therefore,

\displaystyle\mathrm{E}[I_{3}^{2}]

\displaystyle\leq

\displaystyle\Lambda_{1}^{2}\mathrm{E}\left[\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\mathrm{E}\left(\Big{[}\frac{1}{m}\sum_{i=1}^{m}(\hat{r}^{i}(s,a)-R(s,a))\hat{d}^{i}_{\pi_{b}}(s,a)\Big{]}^{2}\Big{|}\hat{d}^{i}_{\pi_{b}}(s,a),i=1,\cdots,m\right)\right].

Because $\hat{d}^{i}_{\pi_{b}}(s,a),i=1,\cdots,m$ are i.i.d and $r^{i}(s,a)$ follows distribution $R(s,a)$ and is independent of $\hat{d}^{i}_{\pi_{b}}(s,a),i=1,\cdots,m$ , it follows that

			$\displaystyle\mathrm{E}\left(\Big{[}\frac{1}{m}\sum_{i=1}^{m}(\hat{r}^{i}(s,a)-R(s,a))\hat{d}^{i}_{\pi_{b}}(s,a)\Big{]}^{2}\Big{\|}\hat{d}^{i}_{\pi_{b}}(s,a),i=1,\cdots,m\right)$
		$\displaystyle=$	$\displaystyle\frac{1}{m^{2}}\sum_{i=1}^{m}(\hat{d}^{i}_{\pi_{b}}(s,a))^{2}{\rm Var}\Big{(}\hat{r}^{i}(s,a)\Big{\|}\hat{d}^{i}_{\pi_{b}}(s,a)\Big{)}$
		$\displaystyle=$	$\displaystyle\frac{1}{m^{2}}\sum_{i=1}^{m}(\hat{d}^{i}_{\pi_{b}}(s,a))^{2}{\rm Var}\Big{(}\frac{\sum_{t=0}^{l_{i}-1}r^{i}_{t}{\bm{1}}_{(s,a)}(s_{t}^{i},a_{t}^{i})}{\hat{d}^{i}_{\pi_{b}}(s,a)}\Big{\|}\hat{d}^{i}_{\pi_{b}}(s,a)\Big{)}.$

When $\hat{d}^{i}_{\pi_{b}}(s,a)$ is given, $\sum_{t=0}^{l_{i}-1}r^{i}_{t}{\bm{1}}_{(s,a)}(s_{t}^{i},a_{t}^{i})$ is the sum of $\hat{d}^{i}_{\pi_{b}}(s,a)$ random variables who are independent with the same distribution $R(s,a)$ . Hence

\displaystyle{\rm Var}\Big{(}\hat{r}^{i}(s,a)\Big{|}\hat{d}^{i}_{\pi_{b}}(s,a)\Big{)}=\frac{{\rm Var}_{\mathcal{R}(s,a)}(r)}{\hat{d}^{i}_{\pi_{b}}(s,a)}\leq\frac{R^{2}_{max}}{\hat{d}^{i}_{\pi_{b}}(s,a)}.

Therefore

	$\displaystyle\mathrm{E}(I_{3}^{2})$	$\displaystyle\leq$	$\displaystyle\frac{R_{\max}^{2}\Lambda_{1}^{2}}{m^{2}}\mathrm{E}\Big{(}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\sum_{i=1}^{m}\hat{d}^{i}_{\pi_{b}}(s,a)\Big{)}=\frac{R_{\max}^{2}\Lambda_{1}^{2}}{m^{2}}\mathrm{E}\Big{(}\sum_{i=1}^{m}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{d}^{i}_{\pi_{b}}(s,a)\Big{)}$
		$\displaystyle=$	$\displaystyle\frac{R_{\max}^{2}\Lambda_{1}^{2}}{m^{2}}\mathrm{E}\Big{(}\sum_{i=1}^{m}(l_{i}-1)\Big{)}\leq\frac{R_{\max}^{2}\Lambda_{1}^{2}}{m}\mathrm{E}_{\mu,\pi_{b}}(T),$

which implies the desired result, since $\mathrm{E}_{\mu,\pi_{b}}(T)\leq\frac{M_{0}}{\lambda_{0}}$ . $\square$

B.3. Optimization

Based on the above discussion, we get that for any truncation level $H$ and $\beta$ such that $M_{0}e^{-\lambda_{0}H}\leq\beta$ ,

$\displaystyle\mathrm{E}(\|\hat{R}_{m}-R_{\pi_{e}}\|^{2})$	$\displaystyle\leq$	$\displaystyle 2\mathrm{E}(I_{1}^{2})+4\mathrm{E}(I_{2}^{2})+4\mathrm{E}(I_{3}^{2})$
	$\displaystyle\leq$	$\displaystyle 8\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L(w,q)^{2}+12\mathrm{E}(I_{11}^{2})+4\mathrm{E}(I_{2}^{2})+4\mathrm{E}(I_{3}^{2})$
	$\displaystyle\leq$	$\displaystyle 8\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L(w,q)^{2}+12C_{3}\Big{(}\beta^{2}+(1-\ln\beta+(\ln\beta)^{2})\frac{\ln m}{m}+\frac{\beta}{m}\Big{)}$
		$\displaystyle+\frac{4R_{\max}^{2}\Lambda_{1}^{2}M_{0}}{m\lambda_{0}}+\frac{8\Lambda_{1}^{2}R_{\max}^{2}M_{0}}{\lambda_{0}^{2}}(\frac{1}{m}+M_{0}e^{-2\lambda_{0}H}),$

where the first inequality follows from (22) and the simple inequality $(a+b)^{2}\leq 2a^{2}+2b^{2}$ , the second inequality follows from (23) and the last inequality is due to Lemma B.5-Lemma B.7. Letting

C_{4}=\max\{24C_{3}+\frac{4R_{\max}^{2}\Lambda_{1}^{2}M_{0}}{\lambda_{0}}+\frac{8\Lambda_{1}^{2}R_{\max}^{2}M_{0}}{\lambda_{0}^{2}},\frac{8\Lambda_{1}^{2}R_{\max}^{2}M_{0}^{2}}{\lambda_{0}^{2}}\},

we have that, for any truncation level $H$ and $\beta\geq M_{0}e^{-\lambda_{0}H}$ ,

\displaystyle\mathrm{E}(|\hat{R}_{m}-R_{\pi_{e}}|^{2})\leq 8\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L(w,q)^{2}+C_{4}\big{(}G(\beta,m)+e^{-2\lambda_{0}H}\big{)},

(31)

where

G(x,m):=x^{2}+(1-\ln x+\ln^{2}x)\frac{\ln m}{m},

for $x\in(0,+\infty),m\geq 1$ . Note that

G^{\prime}_{x}(x,m)=2x+\frac{\ln m(2\ln x-1)}{mx},\quad G^{\prime\prime}_{xx}(x,m)=2+\frac{\ln m(3-2\ln x)}{mx^{2}}>0,

which combining the fact $G^{\prime}_{x}(1,m)>1$ and $\lim\limits_{x\to 0+}G^{\prime}_{x}(x,m)=-\infty$ implies that there exists a unique $H_{m}\in(0,1)$ such that

2mH_{m}^{2}+2\ln m\ln H_{m}-\ln m=0,

which implies $G^{\prime}_{x}(H_{m},m)=0$ and hence for all $x\in(0,+\infty)$ ,

G(H_{m},m)\leq G(x,m).

Moreover, $G(x,m)$ is decreasing for $x\in(0,H_{m})$ while increasing for $x\in(H_{m},+\infty)$ .

Lemma B.8.

For any $m\geq e$ , $\sqrt{\ln m/(2m)}\leq H_{m}\leq\sqrt{e\ln^{2}m/m}$ and there exists constants $0<k_{1}<k_{2}$ such that

k_{1}\frac{\ln^{3}m}{m}\leq G(H_{m},m)\leq k_{2}\frac{\ln^{3}m}{m}.

Proof. From $G^{\prime}_{\mu}(H_{m},m)=0$ , we know that $H_{m}$ is a solution of

H_{m}(x):=2mx^{2}+2\ln m\ln x-\ln m=0.

Note that $H_{m}(x)$ is increasing on $(0,1]$ and

	$\displaystyle H_{m}\left(\sqrt{\frac{\ln m}{2m}}\right)$	$\displaystyle=$	$\displaystyle 2\ln m\ln\left(\sqrt{\frac{\ln m}{2m}}\right)<0,$
	$\displaystyle H_{m}\left(\sqrt{\frac{e\ln^{2}m}{m}}\right)$	$\displaystyle=$	$\displaystyle(2e-1)\ln^{2}m+2\ln m\ln\ln m>0.$

We have that $\sqrt{\ln m/(2m)}\leq H_{m}\leq\sqrt{e\ln^{2}m/m}$ .

To prove the second assertion, we note that

\displaystyle G(H_{m},m)\leq G\left(\sqrt{\frac{\ln^{3}m}{m}},m\right)\leq\frac{\ln^{3}m}{m}+(1+\frac{1}{2}\ln m+\frac{1}{4}\ln^{2}m)\frac{\ln m}{m}\leq 3\frac{\ln^{3}m}{m}.

On the other hand, when $x<\sqrt{\ln^{3}m/m}$ ,

	$\displaystyle G(x,m)$	$\displaystyle\geq$	$\displaystyle(1-\ln x+\ln^{2}x)\frac{\ln m}{m}\geq\frac{1}{4}\left(\ln\frac{\ln^{3}m}{m}\right)^{2}\frac{\ln m}{m}$
		$\displaystyle\geq$	$\displaystyle\frac{\ln^{3}m}{4m}\Big{[}1-3\frac{\ln\ln m}{\ln m}\Big{]}^{2}\geq\frac{(e^{3}-9)^{2}}{4e^{6}}\frac{\ln^{3}m}{m},$

for $m\geq e^{e^{3}}$ . Noting that $\ln^{3}m/m>0$ for all $m>e$ , we can find a constant $k^{\prime}$ such that

G(x,m)\geq k^{\prime}\frac{\ln^{3}m}{m},

for all $x\in(0,\sqrt{\ln^{3}m/m})$ and $m>e$ . Moreover, when $x\geq\sqrt{\ln^{3}m/m}$ ,

\displaystyle G(x,m)>x^{2}\geq\frac{\ln^{3}m}{m}.

Consequently, there exists a constant $k_{1}$ such that $G(H_{m},m)\geq k_{1}\ln^{3}m/m.$ $\square$

Now we are at the position to finish the proof of Theorem 4.1.

Proof. From (31) and Lemma B.8, it follows that if $H_{m}\geq M_{0}e^{-\lambda_{0}H}$ , there is a constant $C$ independent of $m,H$ such that

$\displaystyle\mathrm{E}(\|\hat{R}_{m}-R_{\pi_{e}}\|^{2})$	$\displaystyle\leq$	$\displaystyle 8\min_{w\in\mathcal{W},q\in\mathcal{Q}}L^{2}(w,q)+C_{4}\big{(}\min_{{\beta\geq M_{0}e^{-\lambda_{0}H}}}G(\beta,m)+e^{-2\lambda_{0}H}\big{)}$
	$\displaystyle=$	$\displaystyle 8\min_{w\in\mathcal{W},q\in\mathcal{Q}}L^{2}(w,q)+C_{4}(G(H_{m},m)+e^{-2\lambda_{0}H})$
	$\displaystyle\leq$	$\displaystyle 8\min_{w\in\mathcal{W},q\in\mathcal{Q}}L^{2}(w,q)+C\frac{\ln^{3}m}{m},$

since $H_{m}\geq M_{0}e^{-\lambda_{0}H}$ implies that

\displaystyle e^{-2\lambda_{0}H}\leq\frac{H_{m}^{2}}{M_{0}}\leq\frac{e}{M_{0}}\frac{\ln^{2}m}{m}.

When $H_{m}<M_{0}e^{-\lambda_{0}H}$ ,

	$\displaystyle\mathrm{E}(\|\hat{R}_{m}-R_{\pi_{e}}\|^{2})\leq 8\min_{w\in\mathcal{W},q\in\mathcal{Q}}L^{2}(w,q)+C_{4}(G(M_{0}e^{-\lambda_{0}H},m)+e^{-2\lambda_{0}H})$
	$\displaystyle\quad=8\min_{w\in\mathcal{W},q\in\mathcal{Q}}L^{2}(w,q)+C_{4}\Big{(}(1+M_{0}^{2})e^{-2\lambda_{0}H}+(1-\ln M_{0}+\lambda_{0}H+(\ln M_{0}-\lambda_{0}H)^{2})\frac{\ln m}{m}\Big{)}$
	$\displaystyle\quad\leq 8\min_{w\in\mathcal{W},q\in\mathcal{Q}}L^{2}(w,q)+C\Big{(}e^{-2\lambda_{0}H}+H^{2}\frac{\ln m}{m}\Big{)},$

for some constant $C$ independent of $m,H$ . $\square$

From Theorem 4.1, it is easy to get Therorem 4.2. We briefly state the proof as follows.

Proof of Theorem 4.2. When $Q_{\pi_{e}}\not\in\mathcal{Q}$ , (23) does not hold but can be adjusted as follows. Since

\displaystyle I_{1}^{2}

\displaystyle=

\displaystyle L^{2}(\hat{w}_{m},Q_{\pi_{e}})=(L(\hat{w}_{m},Q_{\pi_{e}}-q)+L(\hat{w}_{m},q))^{2}\leq 2(L^{2}(\hat{w}_{m},q)+L^{2}(\hat{w}_{m},Q_{\pi_{e}}-q)),

for any $q\in\mathcal{Q}$ , we have that,

$\displaystyle I_{1}^{2}$	$\displaystyle\leq$	$\displaystyle 2\max_{q\in\mathcal{Q}}L^{2}(\hat{w}_{m},q)+2\min_{q\in\mathcal{Q}}L^{2}(\hat{w}_{m},Q_{\pi_{e}}-q).$
	$\displaystyle\leq$	$\displaystyle 2\max_{q\in\mathcal{Q}}(L(\hat{w}_{m},q)-\hat{L}_{m}(\hat{w}_{m},q)+\hat{L}_{m}(\hat{w}_{m},q))^{2}+2\max_{w\in\mathcal{W}}\min_{q\in\mathcal{Q}}L^{2}(w,Q_{\pi_{e}}-q)$
	$\displaystyle\leq$	$\displaystyle 4\max_{q\in\mathcal{Q}}(L(\hat{w}_{m},q)-\hat{L}_{m}(\hat{w}_{m},q))^{2}+4\max_{q\in\mathcal{Q}}\hat{L}_{m}^{2}(w,q)+2\max_{w\in\mathcal{W}}\min_{q\in\mathcal{Q}}L^{2}(w,Q_{\pi_{e}}-q),$

for any $w\in\mathcal{W}$ . Consequently,

$\displaystyle I_{1}^{2}$	$\displaystyle\leq$	$\displaystyle 4\max_{w\in\mathcal{W},q\in\mathcal{Q}}(L(\hat{w}_{m},q)-\hat{L}_{m}(\hat{w}_{m},q))^{2}+8\max_{q\in\mathcal{Q}}(\hat{L}_{m}(w,q)-L(w,q))^{2}$
		$\displaystyle+8\max_{q\in\mathcal{Q}}L^{2}(w,q)+2\max_{w\in\mathcal{W}}\min_{q\in\mathcal{Q}}L^{2}(w,Q_{\pi_{e}}-q)$
	$\displaystyle\leq$	$\displaystyle 4\max_{w\in\mathcal{W},q\in\mathcal{Q}}(L(\hat{w}_{m},q)-\hat{L}_{m}(\hat{w}_{m},q))^{2}+8\max_{w\in\mathcal{W},q\in\mathcal{Q}}(\hat{L}_{m}(w,q)-L(w,q))^{2}$
		$\displaystyle+8\min_{w\in\mathcal{W}}\max_{q\in\mathcal{Q}}L^{2}(w,q)+2\max_{w\in\mathcal{W}}\min_{q\in\mathcal{Q}}L^{2}(w,Q_{\pi_{e}}-q)$
	$\displaystyle=$	$\displaystyle 12\max_{w\in\mathcal{W},q\in\mathcal{Q}}(L(\hat{w}_{m},q)-\hat{L}_{m}(\hat{w}_{m},q))^{2}+8\min_{w\in\mathcal{W}}\max_{q\in\mathcal{Q}}L^{2}(w,q)+2\max_{w\in\mathcal{W}}\min_{q\in\mathcal{Q}}L^{2}(w,Q_{\pi_{e}}-q),$

and therefore

	$\displaystyle\mathrm{E}(I_{1}^{2})$	$\displaystyle\leq$	$\displaystyle 12\mathrm{E}(\max_{w\in\mathcal{W},q\in\mathcal{Q}}(L(\hat{w}_{m},q)-\hat{L}_{m}(\hat{w}_{m},q))^{2})$
			$\displaystyle+8\min_{w\in\mathcal{W}}\max_{q\in\mathcal{Q}}L^{2}(w,q)+2\max_{w\in\mathcal{W}}\min_{q\in\mathcal{Q}}L^{2}(w,Q_{\pi_{e}}-q).$

Using this inequality to replace (23) and then repeating the discussion in Theorem 4.1, we can readily get the desired result. $\square$

Appendix C Algorithm Supplement

Algorithm 1 summarizes the pseudo-codes of our MWLA algorithm applied to the taxi environment in Section 6.

The algorithm needs the following notation:

G=\begin{bmatrix}0&\dots&0\\ &\ddots&\vdots\\ 0&&0\end{bmatrix}_{nh\times nh},

$Frequency=[0,\dots,0]_{nh}^{\top},$ $auxi=[0,\dots,0]_{nh}^{\top},\quad\hat{\mu}=[0,\dots,0]_{nh}^{\top},$ $X$ represents absorbing state set, $Y=\{h\times i+j,i\in X,j\in\mathcal{A}\}.$

Algorithm 1 Tabular case

1:Off-policy data

D={\{s_{0}^{i},a_{0}^{i},r_{0}^{i},s_{1}^{i},\cdots,s_{T_{i}\wedge H-1}^{i},a_{T_{i}\wedge H-1}^{i},r_{T_{i}\wedge H-1}^{i},s_{T_{i}\wedge H}^{i}\}}_{i=1}^{m}

from the behavior policy

\pi_{b}

; a target policy

\pi

for which we want to estimate the expected return.

2: Estimate the initial state distribution

\hat{\mu}(s)=\frac{1}{m}\sum\limits_{i=1}^{m}1_{\{s_{0}^{i}=s\}}

, where

1_{\{\cdot\}}

is an indicative function.

3:for episode in D do

4: for s,a,

s^{{}^{\prime}}

,r in episode do

cur=h\times s+a

G[cur,h\times s^{{}^{\prime}}:h\times(s^{{}^{\prime}}+1)]+=\pi[s^{{}^{\prime}},:],

G[cur,cur]-=1.0,

Frequency[cur]+=1.

end for

auxi=\sum_{s,a}\hat{\mu}(s)\pi(a|s){\bf 1_{(}s,a)}.

10:

tvalid=where(Frequency>0)

indicates the index of an element whose value is greater than 0.

11:

\hat{d}_{\pi_{b}}=delete(Frequency,Y,0),

delete the row corresponding to the absorbing state from Frequency.

12:

tvalid1=where(\hat{d}_{\pi_{b}}>0).

13:

\hat{d}_{\pi_{b}}=\hat{d}_{\pi_{b}}/m.

14:

{\hat{\bm{G}}}=delete(G,Y,0)

, delete the row corresponding to the absorbing state from G.

15:

{\hat{\bm{G}}}=delete(\hat{\bm{G}},Y,1)

, delete the column corresponding to the absorbing state from

\hat{\bm{G}}

16:

{\hat{\bm{G}}}[:,tvalid1]={\hat{\bm{G}}}[:,tvalid1]/(m\times\hat{d}_{\pi_{b}}[tvalid1]).

17:

\hat{b}=delete(auxi,Y,0).

18:Compute

\hat{{\bm{u}}}=\arg\min\limits_{{{\bm{u}}\geq 0}}\parallel({\hat{\bm{G}}+\lambda\bf I})^{\top}{\bm{u}}+\hat{\bm{b}}\parallel^{2}

, where

{\hat{\bm{G}}}=\frac{1}{m}\sum_{i=1}^{m}\sum_{t=0}^{T_{i}\wedge H-1}\frac{{\bf 1}_{(s^{i}_{t},a^{i}_{t})}\Big{[}\sum_{a\in\mathcal{A}}\pi(a|s^{i}_{t+1}){\bf 1}^{\top}_{(s_{t+1}^{i},a)}-{\bf 1}^{\top}_{(s_{t}^{i},a^{i}_{t})}\Big{]}}{\hat{d}_{\pi_{b}}(s_{t}^{i},a_{t}^{i})}.

\hat{\bm{b}}=\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{\mu}(s)\pi(a|s){\bf 1}_{(s,a)},

\lambda

is a regularization factor,

\bf I

denotes an identity matrix.

19:Parameterize

w(tvalid)=\frac{{\hat{\bm{u}}(tvalid1)}}{\hat{d}_{\pi_{b}}(tvalid1)}.

20:for episode in D do

21: for s, a,

s^{{}^{\prime}}

, r in episode do

22:

cur=h\times s+a

23:

\hat{R}_{\pi_{e},m}=\frac{1}{m}\sum\limits_{i=1}^{m}\sum\limits_{t=0}^{T_{i}\wedge H-1}w(cur)\times r.

end for

24:

\hat{R}_{\pi_{e},m}.

The idea of the algorithm is explained in Example 3.1. Here, for the convenience of computations, we set

w(s,a)=\frac{\mathbf{1}_{\{s,a\}}^{\top}{\bm{u}}}{\hat{d}_{\pi_{b}}(s,a)}\;\text{and}\;\;q(s,a)=\mathbf{1}_{\{s,a\}}^{\top}{\bm{v}},\qquad\forall s\in\mathcal{S}_{0},a\in\mathcal{A}.

We also introduce a regularization factor $\lambda>0$ which helps us find the unique solution of the constrained quadratic programming problem $\arg\min\limits_{{{\bm{u}}\geq 0}}\parallel({\hat{\bm{G}}+\lambda\bf I})^{\top}{\bm{u}}+\hat{\bm{b}}\parallel^{2}$ . When $\lambda$ is sufficient small, the solution is an approximation of $-({\hat{\bm{G}}}^{+})^{\top}{\bm{b}}$ where ${\hat{\bm{G}}}^{+}$ is the Moore-Penrose pseudo-inverse of ${\hat{\bm{G}}}$ . In our experiments, $\lambda$ is set to be $0.001$ .

$\displaystyle\|\hat{L}_{m}(w,q)-L(w,q)\|$	$\displaystyle=$	$\displaystyle\|\hat{L}_{m}(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e}))-(L(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e})))\|$	(24)
	$\displaystyle\leq\|$	$\displaystyle\hat{L}_{m}(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e}))-\tilde{L}_{m}(w,q)\|+\|\tilde{L}_{m}(w,q)-\tilde{L}(w,q)\|$
		$\displaystyle+\|\tilde{L}(w,q)-(L(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e})))\|.$

	$\displaystyle\|\|w_{1}-w_{2}\|\|_{\infty}$	$\displaystyle\leq$	$\displaystyle\\|w_{1}-w_{2}\\|_{2}\leq\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}2G(s,a)\|w_{1}(s,a)-w_{2}(s,a)\|$
		$\displaystyle\leq$	$\displaystyle 2\Lambda_{1}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\frac{G(s,a)}{\Lambda_{1}}\|w_{1}(s,a)-w_{2}(s,a)\|,$

$\displaystyle\mathrm{E}[\sup_{w\in\mathcal{W},q\in\mathcal{Q}}\|\hat{L}_{m}(w,q)-L(w,q)\|^{2}]$	$\displaystyle\leq$	$\displaystyle 3\mathrm{E}(\sup_{w\in\mathcal{W},q\in\mathcal{Q}}\|\hat{L}_{m}(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e}))-\tilde{L}_{m}(w,q)\|^{2})$	(29)
		$\displaystyle+3\mathrm{E}(\sup_{w\in\mathcal{W},q\in\mathcal{Q}}\|\tilde{L}(w,q)-(L(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e})))\|^{2})$
		$\displaystyle+3\sup_{w\in\mathcal{W},q\in\mathcal{Q}}\|\tilde{L}_{m}(w,q)-\tilde{L}(w,q)\|^{2}.$

			$\displaystyle\mathrm{E}\left(\Big{[}\frac{1}{m}\sum_{i=1}^{m}(\hat{r}^{i}(s,a)-R(s,a))\hat{d}^{i}_{\pi_{b}}(s,a)\Big{]}^{2}\Big{\|}\hat{d}^{i}_{\pi_{b}}(s,a),i=1,\cdots,m\right)$
		$\displaystyle=$	$\displaystyle\frac{1}{m^{2}}\sum_{i=1}^{m}(\hat{d}^{i}_{\pi_{b}}(s,a))^{2}{\rm Var}\Big{(}\hat{r}^{i}(s,a)\Big{\|}\hat{d}^{i}_{\pi_{b}}(s,a)\Big{)}$
		$\displaystyle=$	$\displaystyle\frac{1}{m^{2}}\sum_{i=1}^{m}(\hat{d}^{i}_{\pi_{b}}(s,a))^{2}{\rm Var}\Big{(}\frac{\sum_{t=0}^{l_{i}-1}r^{i}_{t}{\bm{1}}_{(s,a)}(s_{t}^{i},a_{t}^{i})}{\hat{d}^{i}_{\pi_{b}}(s,a)}\Big{\|}\hat{d}^{i}_{\pi_{b}}(s,a)\Big{)}.$

Minimax Weight Learning for Absorbing MDPs111This research is supported by National Key R&D Program of China (Nos. 2021YFA1000100 and 2021YFA1000101) and Natural Science Foundation of China (No. 71771089)

Abstract

keywords:

1 Introduction

2 Basic setting

Assumption 2.1.

Remark 2.1.

3 Minimax Weight Learning for absorbing MDP (MWLA)

Theorem 3.1.

Theorem 3.2.

Remark 3.1.

Example 3.1.

Remark 3.2 (A variant for known πb\pi_{b}).

4 MSE bound of the estimated return

Assumption 4.1.

Theorem 4.1.

Theorem 4.2.

Remark 4.1.

Remark 4.2.

Remark 4.3.

Remark 4.4.

5 Connections to related work

6 Experiments

6.1 Setting

6.2 Results

7 Conclusions and discussions

\zihao5\heitiReference

Appendix A Proof of Theorems in Section 3

Appendix B Proof of Theorems in Section 4

Lemma B.1.

Lemma B.2.

Lemma B.3.

Lemma B.4.

Lemma B.5.

Lemma B.6.

Proof.

Lemma B.7.

Lemma B.8.

Appendix C Algorithm Supplement

Minimax Weight Learning for Absorbing MDPs¹¹1This research is supported by National Key R&D Program of China (Nos. 2021YFA1000100 and 2021YFA1000101) and Natural Science Foundation of China (No. 71771089)

Remark 3.2 (A variant for known $\pi_{b}$ ).