This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Minimax Weight Learning for Absorbing MDPs111This research is supported by National Key R&D Program of China (Nos. 2021YFA1000100 and 2021YFA1000101) and Natural Science Foundation of China (No. 71771089)

Fengying Li, Yuqiang Li and Xianyi Wu222Corresponding authors: Yuqiang Li at [email protected] and Xianyi Wu at [email protected]. School of Statistics, KLATASDS-MOE, East China Normal University,
Shanghai 200062, PR China
Abstract

Reinforcement learning policy evaluation problems are often modeled as finite or discounted/averaged infinite-horizon Markov Decision Processes (MDPs). In this paper, we study undiscounted off-policy evaluation for absorbing MDPs. Given the dataset consisting of i.i.d episodes under a given truncation level, we propose an algorithm (referred to as MWLA in the text) to directly estimate the expected return via the importance ratio of the state-action occupancy measure. The Mean Square Error (MSE) bound of the MWLA method is provided and the dependence of statistical errors on the data size and the truncation level are analyzed. The performance of the algorithm is illustrated by means of computational experiments under an episodic taxi environment

keywords:
Absorbing MDP, Off-policy, Minimax weight learning, Policy evaluation, Occupancy measure
journal: ******

1 Introduction

Off-policy evaluation (OPE) in reinforcement learning refers to the estimation of the expected returns of target policies using data collected by possibly different behavior policies. It is particularly important in situations where implementing new strategies is expensive, risky, or even dangerous, such as medicine (Murphy et al., 2001), education (Mandel et al., 2014), economics (Hirano et al., 2003), recommender systems (Li et al., 2011), and more. Currently available OPE procedures are mostly based on direct importance sampling (IS) techniques, which suffer from high variances that exponentially increase over the time horizon, known as ”the curse of horizon” (Jiang and Li, 2016; Li et al., 2015).

A promising idea recently proposed uses marginalized importance sampling (MIS) to alleviate the curse of horizon, but raises a new problem of how to estimate the maginalized importance ratios. For instance, in the case of an infinite-horizon discounted Markov Decision Process (MDP), Liu et al. (2018) compute the importance weights based on the state distribution by solving a minimax optimization problem, and propose a method to estimate the expected return. Moreover, Uehara et al. (2020) propose a Minimax Weight Learning (MWL) algorithm that directly estimates the ratio of the state-action distribution without relying on the specification of the behavior policy.

In many real practices, such as robotic tasks, the environments will terminate at certain random times when they evolve into certain states. In such situations, it is no longer appropriate to model the environments using only finite-time or infinite-time MDPs. Instead, MDPs with absorbing states are suitable, where the absorption reflects the termination of the processes. Furthermore, from a theoretical perspective, absorbing MDPs extend the framework of infinite-horizon discounted MDP processes (Altman, 1999) but the reverse does not hold (see Section 2.1 for more details).

The theory of absorbing MDPs has been extensively studied and is well understood. For example, the knowledge of the times required to reach the absorbing states, depending on both the state and the actions, can be found in Chatterjee et al. (2008) and Iida and Mori (1996) and the minimization of the expected undiscounted cost until the state enters the absorbing set in various applications (such as pursuit problems, transient programming, and first pass problems) are investigated by Eaton and Zadeh (1962), Derman (1970), Kushner (1971)), among others. Other research efforts include the stochastic shortest path problem (Bertsekas and Tsitsiklis, 1991), the control-to-exit time problem (Kesten and Spitzer, 1975; Borkar, 1988), among a vast number of others.

In the context of learning when the underlying distributions are unknown, however, while many benchmark environments are indeed episodic and have random horizons, such as board games (a game terminates once the winner is determined), trips through a maze, and dialog systems (a session terminates when the conversation is concluded) (Jiang, 2017), there are only limited efforts specifically contributed to absorbing MDPs. In this paper, we propose an MWL algorithm for offline RL involving absorbing MDPs, referred to as MWLA hereafter. Our proposed approach tackles two key challenges.

The first challenge pertains to the Data structure. While an infinite horizon MDP can be treated as an ergodic Markov chain under suitable assumptions, the same assumption is not always valid for absorbing MDPs due to their indefinite horizons and varying episode lengths, of which some can be quite short. Therefore, assuming that the collected data consists of i.i.d. tuples (st,at,rt,st+1)(s_{t},a_{t},r_{t},s_{t+1}), which is crucial for the MWL method, is not always natural. Instead, we propose working with data that consists of trajectories, where each data point represents a single trajectory.

The second challenge arises from the random episodic length and the expected undiscounted total rewards. In absorbing MDPs, the length of an episode is indefinite, and it may not be always practical to observe extremely long episodes fully, due to various reasons, such as their length or cost. To address this issue, a simple but practical choice is to truncate long episodes with a level HH. If the expected total rewards are discounted with γ<1\gamma<1, it is easy to see that we can control the errors resulting from such truncations by sufficiently small γH\gamma^{H}. However, with the expected undiscounted total rewards, how to quantify the errors is unclear up to now for absorbing MDPs.

Our proposed MWLA algorithm deals with truncated episode data, and we provide a theoretical analysis of the errors resulting from episode truncation. As a result, it aids us in gaining a better understanding of the effects of episode truncation and identifying an appropriate truncation level under which the errors caused by truncation can be deemed acceptable.

Specifically, in this paper, we derive an estimate of the expected undiscounted return of an absorbing MDP and establish an upper bound for the MSE of the MWLA algorithm. The bound consists of three parts: errors due to statistics, approximation and optimization, respectively. The statistical error depends on both the truncation level and data size. Moreover, we present a uniform bound on MSE by means of an optimization with respect to the truncation level when the truncation level is relatively large. We also demonstrate the effectiveness of our algorithm through numerical experiments in the episodic taxi environment.

The remainder of this paper is organized as follows: Section 2 introduces the model formulation and specifies some basic settings. The MWLA algorithm and its theoretical guarantees without knowledge of behavior policies are presented in Section 3. Additionally, we discuss a parallel version of MWLA, referred to MSWLA, for absorbing MDP with a known behavior policy in Remark 3.2. In Section 4, under the assumption that the data consists of i.i.d. episodes, MSE bound for the MWLA method is provided in Theorems 4.1 and 4.2. Specifically, when the function classes are VC classes, compared with Theorem 9 in Uehara et al. (2020), it is found that our statistical error is related to the truncation length HH. The related work is discussed in Section 5, providing more details to clarify their connection to and differences between the current work. In Section 6, a computer experiment is reported under the episodic taxi environment, compared with on-policy, naive-average, and MSWLA methods, where estimates of returns and their MSEs are given under different episode numbers and truncation lengths. All theoretical proofs and the pseudo-code of the algorithm are deferred to the Appendix.

2 Basic setting

An MDP is a controllable rewarded Markov process, represented by a tuple M=(𝒮,𝒜,,P,μ)M=(\mathcal{S},\mathcal{A},\mathcal{R},P,\mu) of a state space 𝒮\mathcal{S}, an action space 𝒜\mathcal{A}, a reward distribution \mathcal{R} mapping a state-action pair (s,a)(s,a) to a distribution (s,a)\mathcal{R}(s,a) over the set \mathbb{R} of real numbers with an expectation value R(s,a)R(s,a), a transfer probability function P:(s,a,s)𝒮×𝒜×𝒮P(s|s,a)[0,1]P:(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow P(s^{\prime}|s,a)\in[0,1] and an initial state distribution μ\mu.

In this paper, a policy π:=π(a|s)\pi:=\pi(a|s) refers to a time-homogeneous mapping from 𝒮\mathcal{S} to the family of all distributions over 𝒜\mathcal{A}, executed as follows. Starting with an initial state s0μs_{0}\sim\mu, at any integer time t0t\geq 0, an action atπ(|st)a_{t}\sim\pi(\cdot|s_{t}) is sampled, a scalar reward rt(st,at)r_{t}\sim\mathcal{R}(s_{t},a_{t}) is collected, and a next state st+1P(|st,at)s_{t+1}\sim P(\cdot|s_{t},a_{t}) is then assigned by the environment. The space 𝒮×𝒜\mathcal{S}\times\mathcal{A} is assumed enumerable and RR is bounded. The probability distribution generated by MM under a policy π\pi and an initial distribution μ\mu is denoted by Pμ,π\mathrm{P}_{\mu,\pi}, and Eμ,π\mathrm{E}_{\mu,\pi} is used for its expectation operation. When the initial state s0=is_{0}=i, the probability distribution and expectation are indicated by Pi,π\mathrm{P}_{i,\pi} and Ei,π\mathrm{E}_{i,\pi} respectively, so that Pμ,π=i𝒮μ(i)Pi,π\mathrm{P}_{\mu,\pi}=\sum_{i\in\mathcal{S}}\mu(i)\mathrm{P}_{i,\pi} and Eμ,π=i𝒮μ(i)Ei,π.\mathrm{E}_{\mu,\pi}=\sum_{i\in\mathcal{S}}\mu(i)\mathrm{E}_{i,\pi}. The notation P(s,a),π\mathrm{P}_{(s,a),\pi} and E(s,a),π\mathrm{E}_{(s,a),\pi} are also used to indicate the probability and expectation generated by MM starting from the state-action couple (s,a)(s,a) and subsequently the following policy π\pi.

An absorbing state, represented by ξ𝒮\xi\in\mathcal{S}, is a state such that r(ξ,a)=0r(\xi,a)=0 and P(ξ|ξ,a)=1P(\xi|\xi,a)=1 for all a𝒜a\in\mathcal{A}. It is assumed that there is only one unique absorbing state. For a trajectory, denote by Tmin{n1,sn=ξ}T\doteq\min\{n\geq 1,s_{n}=\xi\} the terminal time, where and throughout the paper \doteq signifies “defined as”. An MDP is absorbing if Pi,π(T<)=1\mathrm{P}_{i,\pi}(T<\infty)=1 for all states ii and all policies π\pi. Denote by 𝒮0=𝒮{ξ}\mathcal{S}_{0}=\mathcal{S}\setminus\{\xi\} the non-absorbing states. We make the following assumption on TT.

Assumption 2.1.

sup(s,a)𝒮0×𝒜,πE(s,a),π(T)<\sup_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A},\pi}\mathrm{E}_{(s,a),\pi}(T)<\infty.

Write Ξ={q:𝒮0×𝒜}\Xi=\{q:\mathcal{S}_{0}\times\mathcal{A}\mapsto\mathbb{R}\} for the collection of all functions mapping a couple (s,a)(s,a) to a real number. Introduce a functional Φπ(q)\Phi_{\pi}(q) on Ξ\Xi by

Φπ(q)Eμ,π[t=0T1q(st,at)]\Phi_{\pi}(q)\doteq\mathrm{E}_{\mu,\pi}[\sum_{t=0}^{T-1}q(s_{t},a_{t})]

and for any functions q:𝒮0×𝒜q:\mathcal{S}_{0}\times\mathcal{A}\mapsto\mathbb{R}, let q(s,π)=˙a𝒜π(a|s)q(s,a)q(s,\pi)\dot{=}\sum_{a\in\mathcal{A}}\pi(a|s)q(s,a). Note that Φπ\Phi_{\pi} implicitly depends on the initial distribution μ\mu.

The expected return under a policy π\pi is given by

RπEμ,π[t=0T1rt]=Eμ,πt=0[R(st,at)]=Φπ(R),R_{\pi}\doteq\mathrm{E}_{\mu,\pi}\left[\sum\limits_{t=0}^{T-1}r_{t}\right]=\mathrm{E}_{\mu,\pi}\sum\limits_{t=0}^{\infty}\left[R(s_{t},a_{t})\right]=\Phi_{\pi}(R), (1)

which depends only on the transition probability and the mean reward function rather than the more detailed reward distribution functions.

For any (s,a)𝒮0×𝒜(s,a)\in\mathcal{S}_{0}\times\mathcal{A}, taking the indicator function 𝟏(s,a){\bm{1}}_{(s,a)} as a particular mean reward function, i.e., collecting a unit reward at (s,a)(s,a) and zero otherwise, gives rise to the occupancy measure

dπ(s,a)Eμ,π[t=0T1𝟏(s,a)(st,at)]=t=0Pμ,π(st=s,at=a)=Φπ(𝟏(s,a)),d_{\pi}(s,a)\doteq\mathrm{E}_{\mu,\pi}\big{[}\sum_{t=0}^{T-1}{\bm{1}}_{(s,a)}(s_{t},a_{t})\big{]}=\sum_{t=0}^{\infty}\mathrm{P}_{\mu,\pi}(s_{t}=s,a_{t}=a)=\Phi_{\pi}({\bf 1}_{(s,a)}), (2)

where dπ(s,a)<d_{\pi}(s,a)<\infty by Assumptions 2.1. Conceptually, dπd_{\pi} can also be retrieved from Chapmann-Kolmogorov equations

dπ(s,a)\displaystyle d_{\pi}(s,a) =\displaystyle= μ(s)π(a|s)+(s,a)𝒮0×𝒜P(s|s,a)π(s|a)dπ(s,a), for all (s,a)𝒮0×𝒜.\displaystyle\mu(s)\pi(a|s)+\sum_{(s^{\prime},a^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}}P(s|s^{\prime},a^{\prime})\pi(s|a)d_{\pi}(s^{\prime},a^{\prime}),\hbox{ for all }(s,a)\in\mathcal{S}_{0}\times\mathcal{A}. (3)

Conversely, with the measure dπd_{\pi}, we can retrieve the expected return

Rπ=(s,a)S0×𝒜R(s,a)dπ(s,a),\displaystyle R_{\pi}=\sum_{(s,a)\in S_{0}\times\mathcal{A}}R(s,a)d_{\pi}(s,a),

as an integration of the mean reward function with respect to the occupancy measure. In addition, a simple recursive argument shows that

Φπ(q)\displaystyle\Phi_{\pi}(q) =\displaystyle= (s,a)S0×𝒜q(s,a)dπ(s,a)\displaystyle\sum_{(s,a)\in S_{0}\times\mathcal{A}}q(s,a)d_{\pi}(s,a)
=\displaystyle= s𝒮0μ(s)q(s,π)+(s,a,s)𝒮0×𝒜×𝒮0P(s|s,a)q(s,π)dπ(s,a).\displaystyle\sum_{s\in\mathcal{S}_{0}}\mu(s)q(s,{\pi})+\sum_{(s,a,s^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\times\mathcal{S}_{0}}P(s^{\prime}|s,a)q(s^{\prime},{\pi})d_{{\pi}}(s,a).

A direct result of this equality is a family of equations

(s,a)S0×𝒜[q(s,a)s𝒮0P(s|s,a)q(s,π)]dπ(s,a)=s𝒮0μ(s)q(s,π),qΞ.\displaystyle\sum_{(s,a)\in S_{0}\times\mathcal{A}}\left[q(s,a)-\sum_{s^{\prime}\in\mathcal{S}_{0}}P(s^{\prime}|s,a)q(s^{\prime},{\pi})\right]d_{\pi}(s,a)=\sum_{s^{\prime}\in\mathcal{S}_{0}}\mu(s^{\prime})q(s^{\prime},{\pi}),q\in\Xi. (4)

Denote by dπ(s,a,s)=dπ(s,a)P(s|s,a)d_{\pi}(s,a,s^{\prime})=d_{\pi}(s,a)P(s^{\prime}|s,a). Then the system of equations (4) can equivalently be rewritten as

(s,a,s)𝒮0×𝒜×𝒮0q(s,π)dπ(s,a,s)(s,a)𝒮0×𝒜q(s,a)dπ(s,a)+Esμ[q(s,π)]=0,qΞ.\displaystyle\sum_{(s,a,s^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\times\mathcal{S}_{0}}q(s^{\prime},{\pi})d_{{\pi}}(s,a,s^{\prime})-\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}q(s,a)d_{{\pi}}(s,a)+\mathrm{E}_{s\sim\mu}\big{[}q(s,{\pi})\big{]}=0,q\in\Xi. (5)
Remark 2.1.

MDP models with absorbing states to maximize expected return (1) fundamentally differ from the usually analyzed discounted MDPs with infinite time-horizon. Here two points are discussed:

  1. 1.

    In the perspective of probability theory, MDP models with absorbing states to maximize the expected returns generalize infinite-horizon MDPs that maximize discounted expected return. To see it, consider an infinite-horizon discounted MDP M=(𝒮0,𝒜,r,P,μ,γ)M=(\mathcal{S}_{0},\mathcal{A},r,{P},\mu,\gamma). To create a new model M~=(𝒮,𝒜,r~,P~,μ)\tilde{M}=({\cal S},{\cal A},\tilde{r},\tilde{P},\mu) with an absorbing state, we introduce a new state ξ\xi such that 𝒮=𝒮0{ξ}{\cal S}={\cal S}_{0}\cup\{\xi\}, define

    P~(s|s,a)={γP(s|s,a) if s,s𝒮01γ if s𝒮0,s=ξ1 if s=s=ξ,\tilde{P}(s{{}^{\prime}}|s,a)=\left\{\begin{array}[]{lr}\gamma{P}(s{{}^{\prime}}|s,a)&\text{ if }s,s{{}^{\prime}}\in\mathcal{S}_{0}\\ 1-\gamma&\text{ if }s\in\mathcal{S}_{0},s{{}^{\prime}}=\xi\\ 1&\text{ if }s=s{{}^{\prime}}=\xi,\end{array}\right.

    keep the action space the same, and introduce a reward function r~(s,a)\tilde{r}(s,a) by {r(s,a)if s𝒮0otherwise\left\{\begin{array}[]{ll}r(s,a)&\hbox{if }s\in{\cal S}\\ 0&\hbox{otherwise}\end{array}\right.. The optimal policy of MM and M~\tilde{M} are the same because the two models share the same Bellman optimality equation(a similar argument is discussed in Section 10.1 of Altman (1999)).

    In the context of learning, MM and M~\tilde{M} are essentially two different models because one generally needs to estimate the unknown parameter γ\gamma (the probability of transitioning to the absorbing state or the unknown discount factor) in M~\tilde{M}, whereas γ\gamma is known in MM.

  2. 2.

    In general, an MDP with absorbing states to maximize expected return is not necessarily translated to an MDP to maximize expected discounted return. For example, one can consider the case when the transition probability of a state to the absorbing state under a policy depends on the state.

3 Minimax Weight Learning for absorbing MDP (MWLA)

Let

Z=(s0,a0,s1,a1,,sT1,aT1) and Zi=(s0i,a0i,s1i,a1i,,sTi1i,aTi1i),i=1,2,,mZ=(s_{0},a_{0},s_{1},a_{1},\dots,s_{T-1},a_{T-1})\hbox{ and }Z_{i}=(s_{0}^{i},a_{0}^{i},s_{1}^{i},a_{1}^{i},\dots,s_{T_{i}-1}^{i},a_{T_{i}-1}^{i}),i=1,2,\dots,m (6)

be a representative episode of an absorbing MDP with probability distribution Pμ,πbP_{\mu,\pi_{b}} and its i.i.d. copies, respectively. The objective is to estimate the expected return RπeR_{\pi_{e}} by using the dataset DD of mm i.i.d. trajectories {𝝉i,i=1,,m}\{\boldsymbol{\tau}^{i},i=1,\cdots,m\}, each of which is a realization of ZiZ_{i} truncated at a prespecified time step HH, i.e.

𝝉i=(s0i,a0i,r0i,s1i,a1i,r1i,,sTiHi,aTiH1i,rTiH1i,sTiHi),i=1,2,,m.\displaystyle\boldsymbol{\tau}^{i}=\left(s_{0}^{i},a_{0}^{i},r_{0}^{i},s_{1}^{i},a_{1}^{i},r_{1}^{i},\dots,s_{T_{i}\wedge H}^{i},a_{T_{i}\wedge H-1}^{i},r_{T_{i}\wedge H-1}^{i},s_{T_{i}\wedge H}^{i}\right),i=1,2,\dots,m.

The algorithm developed here, as indicated by the title “Minimax Weight Learning for Absorbing MDP” of this section and referred to as MWLA below is an extension of Uehara et al. (2020)’s Minimax Weight Learning (MWL) algorithm for infinite-horizon discounted MDPs. Their method uses a discriminator function class 𝒬\mathcal{Q} to learn the importance weight ww (defined in equation (7) below) on state-action pairs. One of their important tools is the (normalized) discounted occupancy, which can be approximated well using the given discount factor γ\gamma and the suitable dataset (for example, the dataset consisting of i.i.d. tuples (s,a,r,s)(s,a,r,s^{\prime})). However, in our setting, the normalized occupancy is not applicable since the reward is not discounted. Our method is essentially based on the occupancy measure defined in equation (2).

For any (s,a)𝒮0×𝒜(s,a)\in\mathcal{S}_{0}\times\mathcal{A}, let

wπeπb(s,a)=˙dπe(s,a)dπb(s,a),a.e. dπb.\displaystyle w_{\frac{\pi_{e}}{\pi_{b}}}(s,a)\dot{=}\frac{d_{{\pi_{e}}}(s,a)}{d_{\pi_{b}}(s,a)},\hbox{a.e. }d_{\pi_{b}}. (7)

Observe that

Rπe=(s,a)S0×𝒜R(s,a)dπe(s,a)=(s,a)𝒮0×𝒜wπeπb(s,a)R(s,a)dπb(s,a)=Φπb(wπeπbR),R_{\pi_{e}}=\sum_{(s,a)\in S_{0}\times\mathcal{A}}R(s,a)d_{\pi_{e}}(s,a)=\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}w_{\frac{\pi_{e}}{\pi_{b}}}(s,a)R(s,a)d_{\pi_{b}}(s,a)=\Phi_{\pi_{b}}(w_{\frac{\pi_{e}}{\pi_{b}}}R),

provided that dπe(s,a)>0d_{\pi_{e}}(s,a)>0 implies dπb(s,a)>0d_{\pi_{b}}(s,a)>0. By replacing dπbd_{\pi_{b}}, RR and wπeπbw_{\frac{\pi_{e}}{\pi_{b}}} with their estimates d^πb\hat{d}_{\pi_{b}}, R^\hat{R} and w^πeπb\hat{w}_{\frac{\pi_{e}}{\pi_{b}}}, respectively, a plug-in approach suggests that RπeR_{\pi_{e}} can be simply estimated by

R^πe=Φ^πb(w^πeπbR^)(s,a)𝒮0×𝒜w^πeπb(s,a)R^(s,a)d^πb(s,a),\hat{R}_{\pi_{e}}=\hat{\Phi}_{\pi_{b}}(\hat{w}_{\pi_{e}\over\pi_{b}}\hat{R})\doteq\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{w}_{\frac{\pi_{e}}{\pi_{b}}}(s,a)\hat{R}(s,a)\hat{d}_{\pi_{b}}(s,a), (8)

in which the crucial step is to estimate wπeπbw_{\frac{\pi_{e}}{\pi_{b}}}.

From equality (5), we formally introduce an error function

L(w,q)\displaystyle L(w,q) \displaystyle\doteq (s,a,s)𝒮0×𝒜×𝒮0w(s,a)q(s,πe)dπb(s,a,s)\displaystyle{\displaystyle\sum_{(s,a,s^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\times\mathcal{S}_{0}}}w(s,a)q(s^{\prime},{\pi_{e}})d_{{\pi_{b}}}(s,a,s^{\prime}) (9)
(s,a)𝒮0×𝒜w(s,a)q(s,a)dπb(s,a)+Esμ[q(s,πe)],\displaystyle-{\displaystyle\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}}w(s,a)q(s,a)d_{{\pi_{b}}}(s,a)+\mathrm{E}_{s\sim\mu}\big{[}q(s,{\pi_{e}})\big{]},

so that L(wπeπb,q)=0 for all qΞ.L(w_{\frac{\pi_{e}}{\pi_{b}}},q)=0\hbox{ for all }q\in\Xi. Conversely, by taking a family of particular functions {q(s,a)=𝟏(s¯,a¯)(s,a):(s¯,a¯)𝒮0×𝒜}\{q(s,a)={\bm{1}}_{(\bar{s},\bar{a})}(s,a):(\bar{s},\bar{a})\in\mathcal{S}_{0}\times\mathcal{A}\}, as what has been done in (2), the uniqueness of the solution to this system of equations can be derived, as what is in the following Theorem.

Note that L(wπeπb,q)=0 for all qΞL(w_{\frac{\pi_{e}}{\pi_{b}}},q)=0\hbox{ for all }q\in\Xi if and only if L(wπeπb,q)=0 for all qΞ,q22=1L(w_{\frac{\pi_{e}}{\pi_{b}}},q)=0\hbox{ for all }q\in\Xi,\|q\|_{2}^{2}=1, we can safely work with the system of equations L(w,q)=0,q22=1L(w,q)=0,\|q\|_{2}^{2}=1, where and throughout the remainder of the paper, 2\|\cdot\|_{2} of a vector or a matrix denotes its Eclidean norm.

Theorem 3.1.

The function w=wπeπbw=w_{\frac{\pi_{e}}{\pi_{b}}} is the unique bounded solution to the system of equations {L(w,q)=0:q22=1}\{L(w,q)=0:\|q\|_{2}^{2}=1\}, provided that the following conditions hold:

  1. 1.

    The occupancy measure dπed_{{\pi_{e}}} is the unique solution to the system of equations of qq,

    q(s,a)=μ(s)πe(a|s)+(s,a)𝒮0×𝒜P(s|s,a)πe(s|a)q(s,a),(s,a)𝒮0×𝒜.\displaystyle q(s,a)=\mu(s)\pi_{e}(a|s)+\sum_{(s^{\prime},a^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}}P(s|s^{\prime},a^{\prime})\pi_{e}(s|a)q(s^{\prime},a^{\prime}),\;\;(s,a)\in\mathcal{S}_{0}\times\mathcal{A}. (10)
  2. 2.

    dπe(s,a)>0d_{\pi_{e}}(s,a)>0 implies dπb(s,a)>0d_{\pi_{b}}(s,a)>0 for all (s,a)S0×𝒜(s,a)\in S_{0}\times\mathcal{A}.

Theorem 3.1 simply states that

wπeπb=argminwmaxq22=1L(w,q)2.w_{\frac{\pi_{e}}{\pi_{b}}}=\operatorname*{argmin}\limits_{w}\max\limits_{\|q\|_{2}^{2}=1}L(w,q)^{2}. (11)

In real applications, wπeπbw_{\frac{\pi_{e}}{\pi_{b}}} is approximated by

w(s,a)argminw𝒲maxq𝒬L(w,q)2,\displaystyle{w}^{*}(s,a)\doteq\operatorname*{argmin}\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L(w,q)^{2},

where 𝒲Ξ\mathcal{W}\subset\Xi is a working class of wπeπbw_{\frac{\pi_{e}}{\pi_{b}}}, and 𝒬Ξ\mathcal{Q}\subset\Xi treated as discriminators. For artificially designed 𝒬\cal Q, the class 𝒬\cal Q is only required to be bounded because it may be computationally inefficient to require q22=1\|q\|_{2}^{2}=1 for q𝒬q\in\cal Q.

For any (s,a)𝒮0×𝒜(s,a)\in\mathcal{S}_{0}\times\mathcal{A}, define a function Vs,a,πeΞV_{s,a,\pi_{e}}\in\Xi by Vs,a,πe(s,a)t=0P(s,a),πe(st=s,at=a)V_{s,a,\pi_{e}}(s^{\prime},a^{\prime})\doteq\sum_{t=0}^{\infty}\mathrm{P}_{(s,a),{\pi_{e}}}(s_{t}=s^{\prime},a_{t}=a^{\prime}). The theorem below, with a novel relationship between wπeπbww_{\pi_{e}\over\pi_{b}}-w and the error function LL, will be helpful in bounding the estimation error of occupancy measure ratio by means of the mini-max loss via the discriminator class 𝒬\mathcal{Q} chosen properly.

Theorem 3.2.

The error function allows for the expressions

L(w,Vs,a,πe)=dπe(s,a)w(s,a)dπb(s,a) and L(w,Vs,a,πe/dπb(s,a))=wπeπbw.L\left(w,V_{s^{\prime},a^{\prime},\pi_{e}}\right)=d_{\pi_{e}}(s^{\prime},a^{\prime})-w(s^{\prime},a^{\prime})d_{\pi_{b}}(s^{\prime},a^{\prime})\hbox{ and }L\left(w,V_{s^{\prime},a^{\prime},\pi_{e}}/d_{\pi_{b}}(s^{\prime},a^{\prime})\right)=w_{\pi_{e}\over\pi_{b}}-w. (12)

Consequently,

  1. 1.

    dπewdπbmaxq𝒬|L(w,q)| if {±Vs,a,πe:(s,a)𝒮0×𝒜}𝒬\|d_{\pi_{e}}-wd_{\pi_{b}}\|_{\infty}\leq\max\limits_{q\in\mathcal{Q}}|L(w,q)|\hbox{ if }\{\pm V_{s^{\prime},a^{\prime},\pi_{e}}:(s^{\prime},a^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\}\subseteq\mathcal{Q} and

  2. 2.

    wπeπbwminw𝒲maxq𝒬|L(w,q)| if {±Vs,a,πe/dπb(s,a):(s,a)𝒮0×𝒜}𝒬\|w_{\frac{\pi_{e}}{\pi_{b}}}-w^{*}\|_{\infty}\leq\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}|L(w,q)|\hbox{ if }\{\pm V_{s^{\prime},a^{\prime},\pi_{e}}/d_{\pi_{b}}(s^{\prime},a^{\prime}):(s^{\prime},a^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\}\subseteq\mathcal{Q}.

Here \|\cdot\|_{\infty} denotes the supremum norm.

The MWLA algorithm to estimate Rπe{R}_{\pi_{e}} is now ready to be described as follows. Firstly, for all (s,a,s)𝒮0×𝒜×𝒮0(s,a,s^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\times\mathcal{S}_{0}, define the trajectory-specific empirical occupancy measures and rewards by

d^πbi(s,a)t=0li1𝟏(s,a)(sti,ati),d^πbi(s,a,s)t=0li1𝟏(s,a,s)(sti,ati,st+1i)\displaystyle\hat{d}^{i}_{\pi_{b}}(s,a)\doteq\sum_{t=0}^{l_{i}-1}{\bm{1}}_{(s,a)}(s_{t}^{i},a_{t}^{i}),\;~{}\hat{d}^{i}_{\pi_{b}}(s,a,s^{\prime})\doteq\sum_{t=0}^{l_{i}-1}{\bm{1}}_{(s,a,s^{\prime})}(s_{t}^{i},a_{t}^{i},s_{t+1}^{i}) (13)

and

R^i(s,a)t=0li1rti𝟏(s,a)(sti,ati)d^πbi(s,a) if d^πbi(s,a)>0 and 0otherwise.\displaystyle\hat{R}^{i}(s,a)\doteq\frac{\sum_{t=0}^{l_{i}-1}r_{t}^{i}{\bm{1}}_{(s,a)}(s_{t}^{i},a_{t}^{i})}{\hat{d}^{i}_{\pi_{b}}(s,a)}\hbox{ if }{\hat{d}^{i}_{\pi_{b}}(s,a)}>0\hbox{ and }0\;\;\hbox{otherwise}.

Secondly, for any w,qB(𝒮0×𝒜)w,q\in B(\mathcal{S}_{0}\times\mathcal{A}), introduce an empirical error function

L^m(w,q)\displaystyle\hat{L}_{m}(w,q) =\displaystyle= 1mi=1m(s,a,s)𝒮0×𝒜×𝒮0w(s,a)q(s,πe)d^πbi(s,a,s)\displaystyle\frac{1}{m}\sum_{i=1}^{m}\sum_{(s,a,s^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\times\mathcal{S}_{0}}w(s,a)q(s^{\prime},{\pi_{e}})\hat{d}^{i}_{\pi_{b}}(s,a,s^{\prime}) (14)
\displaystyle- 1mi=1m(s,a)𝒮0×𝒜w(s,a)q(s,a)d^πbi(s,a)+Esμ[q(s,πe)],\displaystyle\frac{1}{m}\sum_{i=1}^{m}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}w(s,a)q(s,a)\hat{d}^{i}_{\pi_{b}}(s,a)+\mathrm{E}_{s\sim\mu}\big{[}q(s,{\pi_{e}})\big{]},

so that an estimate of wπeπbw_{\frac{\pi_{e}}{\pi_{b}}} is then defined by

w^m(s,a)argminw𝒲maxq𝒬L^m(w,q)2.\displaystyle\hat{w}_{m}(s,a)\doteq\operatorname*{argmin}\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}\hat{L}_{m}(w,q)^{2}.

Putting them into equation (8), the expected return RπeR_{\pi_{e}} is consequently estimated by

R^πe,m=1mi=1m(s,a)𝒮0×𝒜w^m(s,a)R^i(s,a)d^πbi(s,a).\hat{R}_{\pi_{e},m}=\frac{1}{m}\sum\limits_{i=1}^{m}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{w}_{m}(s,a)\hat{R}^{i}(s,a)\hat{d}^{i}_{\pi_{b}}(s,a). (15)
Remark 3.1.

Compared to the MWL algorithm in Uehara et al. (2020), the MWLA algorithm described above utilizes truncated episodes instead of the (s,a,r,s)(s,a,r,s^{\prime}) tuples. Consequently, the accuracy of the estimation depends on two factors: the data size mm and the truncation level HH. Therefore, it is crucial to comprehend how errors vary at different levels of mm and HH. This will aid us in gaining a better understanding of the effects of episode truncation and identifying an appropriate truncation level HH under which the errors caused by truncation can be deemed acceptable. A detailed analysis on this matter is presented in the next section.

Below is an example of the MWLA algorithm applied to absorbing tabular MDPs.

Example 3.1.

Write 𝒮0={0,1,,n1}\mathcal{S}_{0}=\{0,1,\dots,n-1\} and 𝒜={0,1,,h1}\mathcal{A}=\{0,1,\dots,h-1\} for the tabular model. Note that any matrix 𝒖=˙(ukl)n×h{\bm{u}}\dot{=}(u_{kl})\in\mathbb{R}^{n\times h} defines a map 𝒖:𝒮0×𝒜{\bm{u}}:\mathcal{S}_{0}\times\mathcal{A}\mapsto\mathbb{R} by 𝒖(k,l)=ukl{\bm{u}}(k,l)=u_{kl}. Take the function classes 𝒲=n×h\mathcal{W}=\mathbb{R}^{n\times h} and 𝒬={𝒖n×h:𝒖22=1}\mathcal{Q}=\big{\{}{\bm{u}}\in\mathbb{R}^{n\times h}:\|\bm{u}\|_{2}^{2}=1\big{\}}. For every 0kn1,0lh10\leq k\leq n-1,0\leq l\leq h-1, denote by 𝟏(k,l){\bf 1}_{(k,l)} the nhnh-dimensional unit column vector with 11 at its (kh+l)(kh+l)-th component, and let 𝟏(k,πe)=l=0h1πe(l|k)𝟏(k,l){\bf 1}_{(k,\pi_{e})}=\sum_{l=0}^{h-1}\pi_{e}(l|k){\bf 1}_{(k,l)}. For and 𝒖𝒲{\bm{u}}\in{\cal W} and 𝒗𝒬{\bm{v}}\in{\cal Q}, the empirical error is

L^m(𝒖,𝒗)\displaystyle\hat{L}_{m}({\bm{u}},{\bm{v}}) =\displaystyle= 1mi=1m(k,l,k)𝒮0×𝒜×𝒮0𝒖(k,l)𝒗(k,πe)d^πbi(k,l,k)\displaystyle\frac{1}{m}\sum_{i=1}^{m}\sum_{(k,l,k^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\times\mathcal{S}_{0}}{\bm{u}}(k,l){\bm{v}}(k^{\prime},\pi_{e})\hat{d}^{i}_{\pi_{b}}(k,l,k^{\prime})
1mi=1m(k,l)𝒮0×𝒜𝒗(k,l)d^πbi(k,l)+k𝒮0μ(k)𝒗(k,πe)\displaystyle-\frac{1}{m}\sum_{i=1}^{m}\sum_{(k,l)\in\mathcal{S}_{0}\times\mathcal{A}}{\bm{v}}(k,l)\hat{d}^{i}_{\pi_{b}}(k,l)+\sum_{k\in\mathcal{S}_{0}}\mu(k){\bm{v}}(k,\pi_{e})
=\displaystyle= (𝒖^A+𝒃)𝒗,\displaystyle(\overset{\rightarrow}{\bm{u}}^{\top}{\bm{\hat{}}{A}}+{\bm{b}}^{\top})\overset{\rightarrow}{\bm{v}},

where

^A\displaystyle{\bm{\hat{}}{A}} =\displaystyle= 1mi=1mk=0n1l=0h1v=0n1𝟏(k,l)𝟏(v,πe)d^πbi(k,l,v)1mi=1mk=0n1l=0h1𝟏(k,l)𝟏(k,l)d^πbi(k,l)\displaystyle\frac{1}{m}\sum_{i=1}^{m}\sum_{k=0}^{n-1}\sum_{l=0}^{h-1}\sum_{v=0}^{n-1}\mathbf{1}_{(k,l)}\mathbf{1}_{(v,\pi_{e})}^{\top}\hat{d}^{i}_{\pi_{b}}(k,l,v)-\frac{1}{m}\sum_{i=1}^{m}\sum_{k=0}^{n-1}\sum_{l=0}^{h-1}\mathbf{1}_{(k,l)}\mathbf{1}_{(k,l)}^{\top}\hat{d}^{i}_{\pi_{b}}(k,l)
=\displaystyle= 1mi=1mt=0TiH1𝟏(sti,ati)[a𝒜πe(a|st+1i)𝟏(st+1i,a)𝟏(sti,ati)]\displaystyle\frac{1}{m}\sum_{i=1}^{m}\sum_{t=0}^{T_{i}\wedge H-1}{\bf 1}_{(s^{i}_{t},a^{i}_{t})}\Big{[}\sum_{a\in\mathcal{A}}\pi_{e}(a|s^{i}_{t+1}){\bf 1}^{\top}_{(s_{t+1}^{i},a)}-{\bf 1}^{\top}_{(s_{t}^{i},a^{i}_{t})}\Big{]}

is an nh×nhnh\times nh matrix, 𝒖\overset{\rightarrow}{\bm{u}} is the vectorized 𝒖{\bm{u}} by columns and 𝒃=(s,a)𝒮0×𝒜μ(s)πe(a|s)𝟏(s,a){\bm{b}}=\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\mu(s)\pi_{e}(a|s){\bf 1}_{(s,a)} is an nhnh-vector. Therefore,

max𝒗:𝒗2=1L^m2(𝒖,𝒗)=max𝒗:𝒗2=1𝒗(^A𝒖+𝒃)(^A𝒖+𝒃)𝒗=^A𝒖+𝒃22.\max_{{\bm{v}}:\|{\bm{v}}\|^{2}=1}\hat{L}^{2}_{m}({\bm{u}},{\bm{v}})=\max_{{\bm{v}}:\|{\bm{v}}\|^{2}=1}\overset{\rightarrow}{\bm{v}}^{\top}(\bm{\hat{}}{A}^{\top}\overset{\rightarrow}{\bm{u}}+{\bm{b}})(\bm{\hat{}}{A}^{\top}\overset{\rightarrow}{\bm{u}}+{\bm{b}})^{\top}\overset{\rightarrow}{\bm{v}}={\|\bm{\hat{}}{A}^{\top}\overset{\rightarrow}{\bm{u}}+{\bm{b}}\|}_{2}^{2}.

Therefore, the estimate is w^m=𝒖^\hat{w}_{m}={{\hat{\bm{u}}}} with 𝒖^\overset{\rightarrow}{\hat{\bm{u}}} the least square solution to the equation ^A𝒖=𝒃{\bm{\hat{}}{A}^{\top}}\overset{\rightarrow}{\bm{u}}=-{\bm{b}}.

Remark 3.2 (A variant for known πb\pi_{b}).

If we define dπ(s)=Φπ(𝟏{s})d_{\pi}(s)=\Phi_{\pi}({\bf 1}_{\{s\}}), then dπ(s,a)=dπ(s)π(a|s)d_{\pi}(s,a)=d_{\pi}(s)\pi(a|s) and from (5), we have that

(s,a,s)𝒮0×𝒜×𝒮0q(s,π)dπ(s)π(a|s)P(s|s,a)s𝒮0q(s,π)dπ(s)+Esμ[q(s,π)]=0.\displaystyle\sum_{(s,a,s^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\times\mathcal{S}_{0}}q(s^{\prime},{\pi})d_{{\pi}}(s)\pi(a|s)P(s^{\prime}|s,a)-\sum_{s\in\mathcal{S}_{0}}q(s,\pi)d_{{\pi}}(s)+\mathrm{E}_{s\sim\mu}\big{[}q(s,{\pi})\big{]}=0.

For a given target policy πe\pi_{e}, simply denote q(s,πe)q(s,\pi_{e}) by q(s)q(s), so that the equation above can be rewritten as

(s,a,s)𝒮0×𝒜×𝒮0wπeπb(s)q(s)πe(a|s)πb(a|s)dπb(s,a,s)s𝒮0wπeπb(s)q(s)dπb(s)+Esμ[q(s)]=0,\displaystyle\sum_{(s,a,s^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\times\mathcal{S}_{0}}w_{\frac{\pi_{e}}{\pi_{b}}}(s)q(s^{\prime}){\pi_{e}(a|s)\over\pi_{b}(a|s)}d_{{\pi}_{b}}(s,a,s^{\prime})-\sum_{s\in\mathcal{S}_{0}}w_{\frac{\pi_{e}}{\pi_{b}}}(s)q(s)d_{{\pi}_{b}}(s)+\mathrm{E}_{s\sim\mu}\big{[}q(s)\big{]}=0,

where wπeπb(s)=dπe(s)dπb(s)w_{\frac{\pi_{e}}{\pi_{b}}}(s)={d_{\pi_{e}}(s)\over d_{\pi_{b}}(s)}.

With this equation, if the behavior policy πb\pi_{b} is known, we can construct a corresponding estimate of the value function based on the minimax optimization problem:

minw𝒲smaxq𝒬s((s,a,s)𝒮0×𝒜×𝒮0w(s)q(s)πe(a|s)πb(a|s)dπb(s,a,s)s𝒮0w(s)q(s)dπb(s)+Esμ[q(s)])2.\min\limits_{w\in\mathcal{W}^{s}}\max\limits_{q\in\mathcal{Q}^{s}}\big{(}\sum_{(s,a,s^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\times\mathcal{S}_{0}}w_{(}s)q(s^{\prime}){\pi_{e}(a|s)\over\pi_{b}(a|s)}d_{{\pi}_{b}}(s,a,s^{\prime})-\sum_{s\in\mathcal{S}_{0}}w(s)q(s)d_{\pi_{b}}(s)+\mathrm{E}_{s\sim\mu}\big{[}q(s)\big{]}\big{)}^{2}.

For convenience, we refer to the method as MSWLA (marginalized state weight learning for absorbing MDPs) algorithm which is essentially an extension of the method discussed in Liu et al. (2018). By similar arguments in Example 3.1, in the tabular case where 𝒮0={0,1,,n1}\mathcal{S}_{0}=\{0,1,\dots,n-1\} and the function classes 𝒲s\mathcal{W}^{s} and 𝒬s\mathcal{Q}^{s} are n\mathbb{R}^{n}, the empirical error function for the MSWLA algorithm is L^m(𝒖,𝒗)=(𝒖^A+𝒃)𝒗\hat{L}_{m}({\bm{u}},{\bm{v}})=({\bm{u}}^{\top}{\bm{\hat{}}{A}}+{\bm{b}}^{\top}){\bm{v}} for any 𝒖𝒲s,𝒗𝒬s\bm{u}\in\mathcal{W}^{s},\bm{v}\in\mathcal{Q}^{s}, where

^A\displaystyle{\bm{\hat{}}{A}} =\displaystyle= 1mi=1mt=0TiH1𝟏{sti}[πe(ati|sti)πb(ati|sti)𝟏{st+1i}𝟏{sti}],\displaystyle\frac{1}{m}\sum_{i=1}^{m}\sum_{t=0}^{T_{i}\wedge H-1}{\bf 1}_{\{s^{i}_{t}\}}\Big{[}{\pi_{e}(a^{i}_{t}|s^{i}_{t})\over\pi_{b}(a^{i}_{t}|s^{i}_{t})}{\bf 1}^{\top}_{\{s_{t+1}^{i}\}}-{\bf 1}^{\top}_{\{s_{t}^{i}\}}\Big{]},

𝒃=s𝒮0μ(s)𝟏{s}{\bm{b}}=\sum_{s\in\mathcal{S}_{0}}\mu(s){\bf 1}_{\{s\}}, and for any s𝒮0s\in\mathcal{S}_{0}, 𝟏{s}{\bf 1}_{\{s\}} is the nn-dimensional column vector whose ss-th entry is 11 and other elements are 0.

4 MSE bound of the estimated return

Denote by Qπe(s,a)E(s,a),πe(t=0T1r(st,at))Q_{\pi_{e}}(s,a)\doteq\mathrm{E}_{(s,a),\pi_{e}}(\sum_{t=0}^{T-1}r(s_{t},a_{t})) the commonly known Q-function and HmH_{m} the unique positive solution to the equation 2mx2+2lnmlnxlnm=0.2mx^{2}+2\ln m\ln x-\ln m=0. Let us now analyze the error bound of R^πe,m\hat{R}_{\pi_{e},m} with respect to the number of episodes mm and the truncation level HH, measured by the mean squared error (MSE) as provided in the following theorems.

The following technical assumption is necessary, which is also supposed in Uehara et al. (2020) as Assumption 2.

Assumption 4.1.

There exists a constant Cw>0C_{w}>0, such that sup(s,a)𝒮0×𝒜wπeπb(s,a)Cw\sup_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}w_{\frac{\pi_{e}}{\pi_{b}}}(s,a)\leq C_{w}.

Theorem 4.1.

Suppose that

  1. 1.

    there exists a common envelop GG of 𝒲\cal W and 𝒬\cal Q, i.e. |w|G|w|\leq G, |q|G|q|\leq G for all w𝒲w\in\mathcal{W}, q𝒬q\in\mathcal{Q}, satisfying

    Λ1:=(s,a)𝒮0×𝒜G(s,a)<+andΛ2:=(s,a)𝒮0×𝒜G2(s,a)<+;\displaystyle\Lambda_{1}:=\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}G(s,a)<+\infty\;\;\text{and}\;\;\Lambda_{2}:=\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}G^{2}(s,a)<+\infty; (16)
  2. 2.

    𝒲\mathcal{W} and 𝒬\mathcal{Q} have finite pseudo-dimensions D𝒲D_{\mathcal{W}} and D𝒬D_{\mathcal{Q}}, respectively;

  3. 3.

    Assumptions 2.1 and 4.1 hold;

  4. 4.

    Qπe𝒬Q_{\pi_{e}}\in\mathcal{Q};

  5. 5.

    there exists λ0>0\lambda_{0}>0 such that Eμ,πb(eλ0T)=M0<\mathrm{E}_{\mu,\pi_{b}}(e^{\lambda_{0}T})=M_{0}<\infty.

Then we have the following:

  1. 1.

    When M0eλ0H>HmM_{0}e^{-\lambda_{0}H}>H_{m}, there exists a constant CC independent of HH and mm, such that

    E[(R^πe,mRπe)2]\displaystyle\mathrm{E}\big{[}(\hat{R}_{\pi_{e},m}-R_{{\pi_{e}}})^{2}\big{]} \displaystyle\leq C(e2λ0H+H2lnmm)+8minw𝒲maxq𝒬L(w,q)2.\displaystyle C\Big{(}e^{-2\lambda_{0}H}+\frac{H^{2}\ln m}{m}\Big{)}+8\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L(w,q)^{2}.
  2. 2.

    When M0eλ0HHmM_{0}e^{-\lambda_{0}H}\leq H_{m}, there exists a constant CC independent of HH and mm, such that

    E[(R^πe,mRπe)2]\displaystyle\mathrm{E}\big{[}(\hat{R}_{\pi_{e},m}-R_{{\pi_{e}}})^{2}\big{]} \displaystyle\leq Cln3mm+8minw𝒲maxq𝒬L(w,q)2.\displaystyle C\frac{\ln^{3}m}{m}+8\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L(w,q)^{2}.

    Especially, if wπeπb𝒲w_{\frac{\pi_{e}}{\pi_{b}}}\in\mathcal{W} and M0eλ0HHmM_{0}e^{-\lambda_{0}H}\leq H_{m}, then

    E[(R^πe,mRπe)2]Cln3mm.\mathrm{E}\big{[}(\hat{R}_{\pi_{e},m}-R_{{\pi_{e}}})^{2}\big{]}\leq C{\frac{\ln^{3}m}{m}}.
Theorem 4.2.

Suppose the assumptions in Theorem 4.1 hold and mem\geq e but Qπe𝒬Q_{\pi_{e}}\not\in\mathcal{Q}.

  • (1)

    When M0eλ0H>HmM_{0}e^{-\lambda_{0}H}>H_{m}, there exists a constant CC independent of H,mH,m, such that

    E[(R^πe,mRπe)2]\displaystyle\mathrm{E}\big{[}(\hat{R}_{\pi_{e},m}-R_{{\pi_{e}}})^{2}\big{]} \displaystyle\leq C(e2λ0H+H2lnmm)+16minw𝒲maxq𝒬L(w,q)2\displaystyle C\Big{(}e^{-2\lambda_{0}H}+\frac{H^{2}\ln m}{m}\Big{)}+16\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L(w,q)^{2}
    +4maxw𝒲minq𝒬L2(w,Qπeq).\displaystyle+4\max\limits_{w\in\mathcal{W}}\min\limits_{q\in\mathcal{Q}}L^{2}(w,Q_{\pi_{e}}-q).
  • (2)

    When M0eλ0HHmM_{0}e^{-\lambda_{0}H}\leq H_{m}, there exists a constant CC independent of H,mH,m, such that

    E[(R^πe,mRπe)2]\displaystyle\mathrm{E}\big{[}(\hat{R}_{\pi_{e},m}-R_{{\pi_{e}}})^{2}\big{]} \displaystyle\leq Cln3mm+16minw𝒲maxq𝒬L(w,q)2+4maxw𝒲minq𝒬L2(w,Qπeq).\displaystyle C\frac{\ln^{3}m}{m}+16\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L(w,q)^{2}+4\max\limits_{w\in\mathcal{W}}\min\limits_{q\in\mathcal{Q}}L^{2}(w,Q_{\pi_{e}}-q).

Obviously, the additional term maxw𝒲minq𝒬L(w,Qπeq)2\max\limits_{w\in\mathcal{W}}\min\limits_{q\in\mathcal{Q}}L(w,Q_{{\pi_{e}}}-q)^{2} becomes 0 when QπeQ_{{\pi_{e}}} in the closure of 𝒬\mathcal{Q} under the metric ||||||\cdot||_{\infty}.

Theorems 4.1 and 4.2 provide upper bounds for the MSE of the MWLA algorithm for a few situations. They are expressed as functions of two key parameters: the truncation level HH and the number of the episodes mm. When HH is small (i.e., M0eλ0H>HmM_{0}e^{-\lambda_{0}H}>H_{m}), the estimate errors composed of four parts: the pure truncation term e2λ0He^{-2\lambda_{0}H}, a crossing term H2lnm/mH^{2}\ln m/m arising from the sampling randomness, an approximation error minw𝒲maxq𝒬L2(w,q)\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L^{2}(w,q), and an optimization error maxw𝒲minq𝒬L2(w,Qπeq)\max\limits_{w\in\mathcal{W}}\min\limits_{q\in\mathcal{Q}}L^{2}(w,Q_{\pi_{e}}-q). While the first two errors stem from the randomness of the statistics, the other two result from the degree of closeness of the two function classes 𝒲\mathcal{W} and 𝒬\mathcal{Q} to wπ/πbw_{{\pi}/\pi_{b}} and QπQ_{{\pi}}, respectively. For a large HH (i.e., M0eλ0HHmM_{0}e^{-\lambda_{0}H}\leq H_{m}), however, the pure truncation term e2λ0He^{-2\lambda_{0}H} and the mixing term H2lnm/mH^{2}\ln m/m can be dominated by an HH-free term Cln3m/mC\ln^{3}m/m. This indicates simply that MWLA algorithm can avoid the curse of the horizon.

In the following are more remarks on the results.

Remark 4.1.

Consider the case Qπe𝒬Q_{\pi_{e}}\in\mathcal{Q}. For the infinite horizon MDP with mm i.i.d. tuples (si,ai,si,ri)(s_{i},a_{i},s_{i}^{\prime},r_{i}), the error bound of the MWL method consists of a statistical error lnmm+m2(𝒲,𝒬)\frac{\ln m}{m}+\mathcal{R}_{m}^{2}(\mathcal{W},\mathcal{Q}) and an approximation error minw𝒲maxq𝒬L(w,q)2\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L(w,q)^{2}, where m(𝒲,𝒬)\mathcal{R}_{m}(\mathcal{W},\mathcal{Q}) is the Rademacher complexity of the function class

{(s,a,s)|w(s,a)(q(s,π)q(s,a))|:w𝒲,q𝒬},\left\{\left(s,a,s{{}^{\prime}}\right)\mapsto|w(s,a)(q(s^{{}^{\prime}},\pi)-q(s,a))|:w\in\mathcal{W},q\in\mathcal{Q}\right\},

as given in Theorem 9 of Uehara et al. (2020). Let D𝒲D_{\mathcal{W}},D𝒬D_{\mathcal{Q}} be the VC-subgraph dimension (i.e. pseudo-dimension) of 𝒲,𝒬,\mathcal{W},\mathcal{Q}, respectively. Because m(𝒲,𝒬)=O(max(D𝒲,D𝒬)m)\mathcal{R}_{m}(\mathcal{W},\mathcal{Q})=O(\sqrt{\frac{\max{(D_{\mathcal{W}},D_{\mathcal{Q}})}}{m}}) (Corollary 1 of Uehara et al., 2021), the statistical error is dominated by lnmm\frac{\ln m}{m}. For MWLA with mm i.i.d. episodes, the MSE bound also consists of an approximation error minw𝒲maxq𝒬L(w,q)2\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L(w,q)^{2} and a statistical error. When M0eλ0HHmM_{0}e^{-\lambda_{0}H}\leq H_{m}, the statistical error is bounded by ln3mm\frac{\ln^{3}m}{m}, including an extra factor ln2m\ln^{2}m in form.

Remark 4.2.

For m>em>e, one has lnm/(2m)<Hm<lnme/m\sqrt{\ln m/(2m)}<H_{m}<\ln m\sqrt{e/m} (Lemma B.8 in Appendix). Therefore, when M0eλ0H>HmM_{0}e^{-\lambda_{0}H}>H_{m}, it follows that HlnM0+ln2/2+lnm/2H\leq\ln M_{0}+\ln 2/2+\ln m/2 and H2lnmmCln3mm\frac{H^{2}\ln m}{m}\leq C\frac{\ln^{3}m}{m} for some constant CC. Whatever HH is, the bounds in Theorem 4.1 are both less than

C(e2λ0H+ln3m/m)+8minw𝒲maxq𝒬L(w,q)2.C(e^{-2\lambda_{0}H}+\ln^{3}m/m)+8\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L(w,q)^{2}.
Remark 4.3.

In the tabular setting, if we take 𝒲={w:w2K0}\mathcal{W}=\{w:\|w\|_{2}\leq K_{0}\} and 𝒬={q:q<K1}\mathcal{Q}=\{q:\|q\|_{\infty}<K_{1}\}, where K0K_{0} is a constant larger than CwC_{w} in Assumption 4.1, then all assumptions in Theorem 4.1 hold. Hence,

E[(R^πe,mRπe)2]C(e2λ0H+ln3mm).\mathrm{E}\big{[}(\hat{R}_{\pi_{e},m}-R_{{\pi_{e}}})^{2}\big{]}\leq C\Big{(}e^{-2\lambda_{0}H}+\frac{\ln^{3}m}{m}\Big{)}.
Remark 4.4.

With an MSE bound, a confidence interval of the error of the estimation can be derived easily by Markov’s inequality. That is, if Qπe𝒬Q_{\pi_{e}}\in\mathcal{Q} and wπeπb𝒲w_{\frac{\pi_{e}}{\pi_{b}}}\in\mathcal{W}, then

P(|R^πe,mRπe|>ϵ)<E((R^πe,mRπe)2)ϵ2C(e2λ0H+ln3mm)ϵ2.\displaystyle P(|\hat{R}_{\pi_{e},m}-R_{\pi_{e}}|>\epsilon)<\frac{\mathrm{E}((\hat{R}_{\pi_{e},m}-R_{{\pi_{e}}})^{2})}{\epsilon^{2}}\leq\frac{C\Big{(}e^{-2\lambda_{0}H}+\frac{\ln^{3}m}{m}\Big{)}}{\epsilon^{2}}.

As a result, for any given δ\delta and ϵ\epsilon, one can easily retrieve a sample complexity m(ϵ,δ,H)m(\epsilon,\delta,H) or H(ϵ,δ,m)H(\epsilon,\delta,m), such that P(|R^πe,mRπe|>ϵ)<δP(|\hat{R}_{\pi_{e},m}-R_{\pi_{e}}|>\epsilon)<\delta if m>m(ϵ,δ,H)m>m(\epsilon,\delta,H) or H>H(ϵ,δ,m)H>H(\epsilon,\delta,m).

5 Connections to related work

Research on off-policy evaluation (OPE) for MDPs with infinite and fixed finite horizons can be classified into two categories according to whether the behavior policy is known.

When the behavior policy is known, IS is a commonly used method that reweights rewards obtained by behavior policies, according to its likelihood ratio of the evaluation policy πe\pi_{e} over the behavior πb\pi_{b} to provide unbiased estimates of the expected returns. However, the IS method suffers from exponentially increasing variance over the time horizon because the ratio is computed as a cumulative product of the importance weight over action πe(a|s)πb(a|s)\frac{\pi_{e}(a|s)}{\pi_{b}(a|s)} at each time step (Precup, 2000). To reduce that extremely high variance, a series of OPE methods have been proposed based on IS. For example, the weighted importance sampling (WIS) method, the stepwise importance sampling method, and the doubly robust(DR) method can reduce the variance to certain degree (Cassel et al., 1976; Robins et al., 1994; Robins and Rotnitzky, 1995; Bang and Robins, 2005). However, the exponential variance of IS-based methods cannot be significantly improved when the MDP has high stochasticity (Jiang and Li, 2016).

The MIS method proves a promising improvement over IS by successfully avoiding the trouble of exponential variance. For example, for a finite-horizon inhomogeneous MDP, compared to weighting the whole trajectories, Xie et al. (2019) uses a ratio wt(s)πe(a|s)πb(a|s)w_{t}(s)\frac{\pi_{e}(a|s)}{\pi_{b}(a|s)} with wt(s)=dπe,t(s)dπb,t(s)w_{t}(s)=\frac{d_{\pi_{e},t}(s)}{d_{\pi_{b},t}(s)} to reweight the rewards rr in order to achieve a lower variance. In an infinite horizon setting, based on a discounted stationary distribution, Liu et al. (2018) proposes using the ratio wπeπb(s)πe(a|s)πb(a|s)w_{\pi_{e}\over\pi_{b}}(s)\cdot\frac{\pi_{e}(a|s)}{\pi_{b}(a|s)} with wπeπb(s)=dπe,γ(s)dπb,γ(s)w_{\pi_{e}\over\pi_{b}}(s)=\frac{d_{\pi_{e},\gamma}(s)}{d_{\pi_{b},\gamma}(s)}. The ratio wπeπb(s)w_{\pi_{e}\over\pi_{b}}(s) is then estimated by a minimax procedure with two function approximators: one to model a weight function wπeπb(s)w_{\pi_{e}\over\pi_{b}}(s), and the other to model VπbV^{\pi_{b}}, as a discriminator class for distribution learning.

For the case of unknown behavior policies, Hanna et al. (2019) show that the IS method with an estimated behavior policy has a lower asymptotic variance than the one with a known behavior strategy. The fitted QQ-iteration, which uses dynamic programming to fit QπeQ_{\pi_{e}} directly from the data, can overcome the curse of dimensionality, with a cost of assuming that the function class contains QπeQ_{\pi_{e}} and is closed under the Bellman update BπeB^{\pi_{e}}, so as to avoid a high bias, see Ernst et al. (2005) and Le et al. (2019). Uehara et al. (2020) propose the MWL algorithm by estimating marginalized importance weight wπeπb(s,a)=dπe,γ(s,a)dπb(s,a)w_{\pi_{e}\over\pi_{b}}(s,a)=\frac{d_{\pi_{e},\gamma}(s,a)}{d_{\pi_{b}}(s,a)}. A Dualdice algorithm is further proposed to estimate the discounted stationary distribution ratios (Nachum et al., 2019a; Nachum et al., 2019b; Nachum and Dai, 2020) where the error function can be considered as a derivative of the error function (loss function in their terminology) in Uehara et al. (2020). Jiang and Huang (2020) combine MWL and MQL into a unified value interval with a unique type of double robustness, if either the value-function or the importance-weight class is correctly specified, the interval is valid, and its length measures the misspecification of the other class.

In reinforcement learning, while many benchmark environments are indeed episodic and have random horizons, such as board games (a game terminates once the winner is determined), trips through a maze, and dialog systems (a session terminates when the conversation is concluded) (Jiang, 2017), there are only limited efforts specifically contributed to absorbing MDPs. Researchers often take absorbing MDPs as special cases of finite-horizon MDPs, by padding all trajectories with absorbing states (with random lengths) to the same length. Another way to handle absorption practically is to use the infinite-horizon setup (with a sufficiently large discount factor), and whenever a trajectory terminates, we imagine it continuous infinitely at absorbing states. However, when the random horizons are not bounded and the random episodes are not observed completely, especially, accompanied by the undiscounted rewards, new issues will arise. For example, how do the unobserved trajectories affect the results? As our results show this problem is by no means trivial, which is essentially neglected when we simply apply the two ways mentioned above.

The current paper deals with the OPE for absorbing MDPs through the MWLA algorithm, a variant of the MWL to fit the random horizon and truncated episodic data modeled by absorbing MDPs, using episodic data rather than the (s,a,r,s)(s,a,r,s^{\prime})-tuple data. In addition, we explicitly analyze the dependence of the error bound of the MSE on the truncation level and data size and derive the uniform bound of the MSE by optimization when the truncation level is relatively large.

6 Experiments

In this section, we present several computational experiments that showcase the performance of the MWLA and other relevant algorithms. We first describe the experimental settings and subsequently report and discuss the results.

6.1 Setting

The environment we employ is a version of Dietterich (2000)’s Taxi, a two-dimensional setup that simulates a taxi moving along a 5×55\times 5 grid world, as indicated by Figure 1. The four corners are marked as R(ed), B(lue), G(reen), and Y(ellow). Initially, the taxi randomly chooses a corner to wait for a passenger, who appears or disappears with a probability at each of the four corners, and that passenger wishes to be transported to one of the four corners (also chosen randomly). The taxi must pick up the passenger and drops him off at his destination. An episode ends once a passenger is successfully dropped off at his destination.

Refer to caption
Figure 1: Taxi Grid

There are a total of 2000 states (25×24×525\times 2^{4}\times 5), made from 25 taxi locations, 242^{4} passenger appearance status, and 5 taxi status (empty or one of 4 destinations with a passenger). There are four navigation actions that move the taxi one square North, South, East, or West, respectively. Each action yields a deterministic movement. Only 3 and 2 actions can be taken when the taxi is at the boundary and the corner, respectively. The taxi receives a reward of -1 at every time step when a passenger is picked up in Taxi, 0 if the passenger is successfully dropped off at the right place and -2 at each time step if the taxi is empty.

As in Liu et al. (2018), we first run a Q-learning with a soft-max action preference with 400,000 iterations to produce the target policy πe\pi_{e} and then run 60,000 iterations to produce another auxiliary policy π+\pi^{+}, of which both are regularized such that the probability of crossing any boundary of the grid is 0. The behaviors are then formed by πb=απe+(1α)π+\pi_{b}=\alpha\pi_{e}+(1-\alpha)\pi^{+} with α{0.2,0.4}\alpha\in{\{0.2,0.4\}}. The true expected return of the target policy is approximated by a set of 2×1062\times 10^{6} on-policy Monte-Carlo trajectories, truncated at H=500H=500 to assure that the majority of the trials have stopped at the absorbing state before time step HH.

6.2 Results

Reported here are the experiment results for MWLA, MSWLA, an on-policy, IS, and a naive averaging baseline algorithm. The on-policy algorithm (referred to as On-Policy below) estimates the expected return by the direct average over a set of trajectories generated by the target policy itself, and the naive average baseline algorithm (referred to as Naive Average below) does it by a direct average over a different set of trajectories generated by the behavior policy, all truncated at HH. We also show the results of MWL applied to another set of simulated data.

The first experiment is on the MSEs of the five methods under m=15000m=15000, 2000020000, 3000030000, 4000040000 and 5000050000 trajectories and a set of truncation levels H=20,50,100,150,200H=20,50,100,150,200. A total of 100100 duplicates for every parameter combination are generated with different random seeds. The results are visualized in Figures 2 (for α=0.2\alpha=0.2) and 3 (for α=0.4\alpha=0.4), where every graph corresponds to a single episode size mm in the upper panels, and a single truncation level HH in the lower panels. The MSEs of MWLA and MSWLA all decrease at the beginning and then vary slowly when the truncation level increases. MWLA is better than MSWLA to a moderate degree, and both are significantly lower than the on-policy, IS, and naive averaging baseline algorithms.

Refer to caption

Agend: The horizontal axis indicates the truncation levels HH and the vertical the logarithm
of the MSEs.
Refer to caption
Agend: The horizontal axis indicates the number of trajectories and the vertical the MSE,
both are scaled in logarithm.

Figure 2: MSE of the five algorithms (α=0.2\alpha=0.2).
Refer to caption

Agend: The horizontal axis indicates the truncation level HH and the vertical the logarithm
of the MSEs.
Refer to caption
Agend: The horizontal axis indicates the number of trajectories and the vertical the MSE,
both are scaled in logarithm.

Figure 3: MSEs of the five algorithms (α=0.4\alpha=0.4).

The averaged estimated returns are also provided, see Figure 4, together with twice the standard errors of the estimates 1Ni=1NR^mi±2SN\frac{1}{N}\sum_{i=1}^{N}\hat{R}^{i}_{m}\pm\frac{2S}{\sqrt{N}}, corresponding to the 95% confidence intervals, where N=100 is the number of duplicates, R^mi\hat{R}^{i}_{m} is the estimated return of the ii-th duplicate, and SS is the sample standard deviation of R^mi,i=1,,N\hat{R}^{i}_{m},i=1,\dots,N. The estimates by MSWLA and MWLA both approach the expected returns as the numbers of trajectories get large. MSWLA has slightly smaller biases than MWLA but significantly larger fluctuations, giving rise to a higher MSE, as indicated by Figures 2 and 3, even in the final graph in the bottom panel, with a quite small deviation of the averaged estimated returns from the expected return for the largest data size due to the randomness of the data.

Refer to caption

α=0.2\alpha=0.2
Refer to caption
α=0.4\alpha=0.4
Agend: 1. The horizontal axis indicates the number of trajectories and the vertical the MSE,
both are scaled in logarithm. 2. The blue lines represent the true expected returns.

Figure 4: Estimated expected returns.

The final experiment is to examine the performance of the MWL by Uehara et al. (2020) algorithm to estimate the expected undiscounted returns of absorbing MDPs by treating them as a special case of infinite-horizon MDPs with a subjectively designed large discount factor γ\gamma. The experiment is carried out under the policy mixing ratio α=0.2\alpha=0.2, truncation levels H=100H=100 and 150150, and numbers of trajectories m=15000m=15000, 2000020000, 3000030000, 4000040000, 5000050000. We also produce N=100N=100 duplicates for every parameter combination to empirically evaluate the MSEs. Two methods are employed to estimate the expected returns. One is the direct MWLA with trajectory data as above. The other is the MWL algorithm under large discount factors γ=0.97\gamma=0.97, 0.980.98, 0.990.99 and 0.9950.995, where the data consisting of all the tuples (s,a,r,s)(s,a,r,s^{\prime}) obtained by breaking the mm trajectories and the MSEs are computed using the errors between the estimates and the true expected (undiscounted) return. Here, we need to note that the MWL algorithm really estimates some quantity A(γ,πe)A(\gamma,\pi_{e}) that is a function of the artificially associated discount factor γ\gamma and the target policy πe\pi_{e}. The error |A(γ,πe)Rπe||A(\gamma,\pi_{e})-R_{\pi_{e}}| between A(γ,πe)A(\gamma,\pi_{e}) and the true value RπeR_{\pi_{e}} thus depends on the policy πe\pi_{e} and, more importantly, the discount fact γ\gamma also, so that there could exist some optimal γ0\gamma_{0}, the value of which is certainly unknown to the agent because so is the MDP model MM, to minimize |A(γ,πe)Rπe||A(\gamma,\pi_{e})-R_{\pi_{e}}| and thus the MSE. The MSE result is empirically depicted in Figure 5, where the horizontal axis is again indicated by the logarithm of the trajectory numbers.

Refer to caption
Refer to caption

Agend: The horizontal axis indicates the number of trajectories and the vertical the MSE,
both are scaled in logarithm.

Figure 5: The MSEs of MWL and MWLA algorithms (α=0.2\alpha=0.2).

From the figure, we clearly have the following observations:

  1. 1.

    The MSEs are almost constant for the number of trajectories we experimented with, quite possibly implying that, compared with the variance, bias contributes the most to an MSE.

  2. 2.

    The MSEs of the MWL algorithm vary over different γ\gamma, meaning that different γ\gamma gives rise to different errors of the MWL algorithm to the true value RπeR_{\pi_{e}} (caused mainly by the bias according to the previous item).

  3. 3.

    It appears that γ0=0.98\gamma_{0}=0.98 is optimal in our experiment. However, it is unclear how to identify an optimal γ0\gamma_{0} theoretically, even approximately.

  4. 4.

    MWLA significantly performs better than MWL, which we attribute to the unbiasedness of the MWLA by taking γ=1\gamma=1.

7 Conclusions and discussions

This paper addresses an OPE problem for absorbing MDPs. We propose the MWLA algorithm as an extension of the MWL algorithm by Uehara et al. (2020). This MWLA algorithm proceeds with episodic data subject to truncations. The expected return of a target policy is estimated and an upper bound of its MSE is derived, in which the statistical error is related to both data size and truncation level. We also briefly discuss the MSWLA algorithm for situations where behavior policies are known. The numerical experiments demonstrate that the MWLA method has a lower MSE as the number of episodes and truncation length increase, significantly improving the accuracy of policy evaluation.

Conceptually, one can estimate the corresponding state-action value function QQ using estimated expected return, for example, by fitted-Q evaluation. The double robust (DR) estimation algorithm, which integrates learning weights and state-action value functions QQ, is an effective and robust approach. It is now still unclear if a DR variant of the MWLA algorithm can be developed.

In statistics, confidence intervals are important to quantify the uncertainty of the point estimates. Recent work in the RL area includes Shi et al. (2021), who propose a novel deeply-debiasing procedure to construct efficient, robust, and flexible confidence intervals for the values of target policies for infinite-horizon discounted MDPs and Dai et al. (2020), who develops a CoinDICE algorithm for calculating confidence intervals. It would be interesting to combine these methods with MWLA for absorbing MDPs.

Moreover, we would like to note that policy optimization is one of the crucial goals of RL. The policy optimization based on the MWL algorithms has been analyzed by Liu et al. (2019) that proposes an off-policy policy gradient optimization technique for infinite-horizon discounted MDP by using MSWL to correct state distribution shifts for the i.i.d. tuple data structure, and Huang and Jiang (2020) that investigates the convergence of two off-policy policy gradient optimization algorithms based on state-action density ratio learning, among others. Therefore, it is important and possible to explore how the MWLA approach can be used in off-policy optimization for absorbing MDPs.

Finally, it should be noted that the state-action space considered in this paper is discrete. This choice is based on the fact that the absorbing MDPs with discrete state-action space are prevalent in real-world applications. Moreover, our theoretical analysis heavily relies on the discrete feature of the state-action space as evidenced by e.g., the proof of Lemma B.4 in Appendix. For cases involving continuous state-action spaces, it is quite common practice to employ approximations by e.g. linear or deep neural networks, so it would need additional efforts and considerations to extend the framework to continuous state-action spaces.


Acknowledgments

The authors thank the editors and referees for their helpful comments and suggestions.


\zihao5\heitiReference

  • Altman (1999) Eitan Altman. Constrained Markov decision processes: stochastic modeling. Routledge, 1999.
  • Bang and Robins (2005) Heejung Bang and James M Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973, 2005.
  • Bertsekas and Tsitsiklis (1991) Dimitri P Bertsekas and John N Tsitsiklis. An analysis of stochastic shortest path problems. Mathematics of Operations Research, 16(3):580–595, 1991.
  • Borkar (1988) Vivek S Borkar. A convex analytic approach to markov decision processes. Probability Theory and Related Fields, 78(4):583–602, 1988.
  • Cassel et al. (1976) Claes M Cassel, Carl E Särndal, and Jan H Wretman. Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika, 63(3):615–620, 1976.
  • Chatterjee et al. (2008) Debasish Chatterjee, Eugenio Cinquemani, Giorgos Chaloulos, and John Lygeros. Stochastic control up to a hitting time: optimality and rolling-horizon implementation. arXiv preprint arXiv:0806.3008, 2008.
  • Dai et al. (2020) Bo Dai, Ofir Nachum, Yinlam Chow, Lihong Li, Csaba Szepesvári, and Dale Schuurmans. Coindice: Off-policy confidence interval estimation. Advances in neural information processing systems, 33:9398–9411, 2020.
  • Derman (1970) Cyrus Derman. Finite state markovian decision processes. Technical report, 1970.
  • Dietterich (2000) Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of artificial intelligence research, 13:227–303, 2000.
  • Eaton and Zadeh (1962) Jo H Eaton and LA Zadeh. Optimal pursuit strategies in discrete-state probabilistic systems. 1962.
  • Ernst et al. (2005) Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005.
  • Hanna et al. (2019) Josiah Hanna, Scott Niekum, and Peter Stone. Importance sampling policy evaluation with an estimated behavior policy. In International Conference on Machine Learning, pages 2605–2613. PMLR, 2019.
  • Hirano et al. (2003) Keisuke Hirano, Guido W Imbens, and Geert Ridder. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4):1161–1189, 2003.
  • Huang and Jiang (2020) Jiawei Huang and Nan Jiang. On the convergence rate of density-ratio basedoff-policy policy gradient methods. In Neural Information Processing Systems Offline Reinforcement Learning Workshop, 2020.
  • Iida and Mori (1996) Tetsuo Iida and Masao Mori. Markov decision processes with random horizon. Journal of the Operations Research Society of Japan, 39(4):592–603, 1996.
  • Jiang (2017) Nan Jiang. A theory of model selection in reinforcement learning. PhD thesis, 2017.
  • Jiang and Huang (2020) Nan Jiang and Jiawei Huang. Minimax value interval for off-policy evaluation and policy optimization. Advances in Neural Information Processing Systems, 33:2747–2758, 2020.
  • Jiang and Li (2016) Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pages 652–661. PMLR, 2016.
  • Kesten and Spitzer (1975) Harry Kesten and Frank Spitzer. Controlled markov chains. The Annals of Probability, pages 32–40, 1975.
  • Kushner (1971) HJ Kushner. Introduction to stochastic control. holt, rinehart and wilson. New York, 1971.
  • Le et al. (2019) Hoang Le, Cameron Voloshin, and Yisong Yue. Batch policy learning under constraints. In International Conference on Machine Learning, pages 3703–3712. PMLR, 2019.
  • Li et al. (2011) Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 297–306, 2011.
  • Li et al. (2015) Lihong Li, Rémi Munos, and Csaba Szepesvári. Toward minimax off-policy value estimation. In Artificial Intelligence and Statistics, pages 608–616. PMLR, 2015.
  • Liu et al. (2018) Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. Advances in Neural Information Processing Systems, 31, 2018.
  • Liu et al. (2019) Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient with state distribution correction. arXiv preprint arXiv:1904.08473, 2019.
  • Mandel et al. (2014) Travis Mandel, Yun-En Liu, Sergey Levine, Emma Brunskill, and Zoran Popovic. Offline policy evaluation across representations with applications to educational games. In AAMAS, volume 1077, 2014.
  • Murphy et al. (2001) Susan A Murphy, Mark J van der Laan, James M Robins, and Conduct Problems Prevention Research Group. Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 96(456):1410–1423, 2001.
  • Nachum and Dai (2020) Ofir Nachum and Bo Dai. Reinforcement learning via fenchel-rockafellar duality. arXiv preprint arXiv:2001.01866, 2020.
  • Nachum et al. (2019a) Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Advances in Neural Information Processing Systems, 32, 2019a.
  • Nachum et al. (2019b) Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019b.
  • Pollard (2012) David Pollard. Convergence of stochastic processes. Springer Science & Business Media, 2012.
  • Precup (2000) Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000.
  • Robins and Rotnitzky (1995) James M Robins and Andrea Rotnitzky. Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90(429):122–129, 1995.
  • Robins et al. (1994) James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846–866, 1994.
  • Sen (2018) Bodhisattva Sen. A gentle introduction to empirical process theory and applications. Lecture Notes, Columbia University, 11:28–29, 2018.
  • Shi et al. (2021) Chengchun Shi, Runzhe Wan, Victor Chernozhukov, and Rui Song. Deeply-debiased off-policy interval estimation. In International Conference on Machine Learning, pages 9580–9591. PMLR, 2021.
  • Uehara et al. (2020) Masatoshi Uehara, Jiawei Huang, and Nan Jiang. Minimax weight and q-function learning for off-policy evaluation. In International Conference on Machine Learning, pages 9659–9668. PMLR, 2020.
  • Uehara et al. (2021) Masatoshi Uehara, Masaaki Imaizumi, Nan Jiang, Nathan Kallus, Wen Sun, and Tengyang Xie. Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency. arXiv preprint arXiv:2102.02981, 2021.
  • Xie et al. (2019) Tengyang Xie, Yifei Ma, and Yu-Xiang Wang. Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. Advances in Neural Information Processing Systems, 32, 2019.

Appendix


Appendix A Proof of Theorems in Section 3

A.1. Proof of Theorem 3.1.

It suffices to prove the uniqueness. Note that, for all (s¯,a¯)𝒮0×𝒜(\bar{s},\bar{a})\in\mathcal{S}_{0}\times\mathcal{A},

L(w,𝟏(s¯,a¯))=μ(s¯)πe(a¯|s¯)+(s,a)𝒮0×𝒜w(s,a)dπb(s,a)P(s¯|s,a)πe(a¯|s¯)w(s¯,a¯)dπb(s¯,a¯).L(w,{\bf 1}_{(\bar{s},\bar{a})})=\mu(\bar{s}){\pi_{e}}(\bar{a}|\bar{s})+\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}w(s,a)d_{{\pi_{b}}}(s,a)P(\bar{s}|s,a){\pi_{e}}(\bar{a}|\bar{s})-w(\bar{s},\bar{a})d_{\pi_{b}}(\bar{s},\bar{a}).

By the condition that dπed_{\pi_{e}} is the unique solution to (10), it follows that wdπb=dπewd_{\pi_{b}}=d_{\pi_{e}}, i.e. w=wπeπbw=w_{\pi_{e}\over\pi_{b}}.

A.2. Proof of Theorem 3.2.

With any policy π\pi, associate a one-step forward operator π\mathcal{L}_{\pi} on Ξ\Xi, defined by

(πq)(s,a)(s,a)𝒮0×𝒜P(s|s,a)π(a|s)q(s,a)=Eπ[q(st+1,at+1)|st=s,at=a].\displaystyle(\mathcal{L}_{\pi}q)(s,a)\doteq\sum_{(s^{\prime},a^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}}P(s^{\prime}|s,a)\pi(a^{\prime}|s^{\prime})q(s^{\prime},a^{\prime})=\mathrm{E}_{\pi}\left[q(s_{t+1},a_{t+1})|s_{t}=s,a_{t}=a\right]. (17)

Then, π{\cal L}_{\pi} is linear over Ξ\Xi. Moreover, the state-action function with respect to π\pi

Qπ(s,a)\displaystyle Q_{\pi}(s,a) \displaystyle\doteq E(s,a),π(t=0T1r(st,at))\displaystyle\mathrm{E}_{(s,a),\pi}(\sum_{t=0}^{T-1}r(s_{t},a_{t})) (18)
=\displaystyle= R(s,a)+sS0P(s|s,a)Es,π(t=1T1rt)\displaystyle R(s,a)+\sum_{s^{\prime}\in S_{0}}P(s^{\prime}|s,a)\mathrm{E}_{s^{\prime},\pi}\left(\sum\limits_{t=1}^{T-1}r_{t}\right)
=\displaystyle= R(s,a)+(s,a)𝒮0×𝒜P(s|s,a)π(a|s)Qπ(s,a)\displaystyle R(s,a)+\sum_{(s^{\prime},a^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}}P(s^{\prime}|s,a)\pi(a^{\prime}|s^{\prime})Q_{\pi}(s^{\prime},a^{\prime})
=\displaystyle= R(s,a)+(πQπ)(s,a)\displaystyle R(s,a)+(\mathcal{L}_{\pi}Q_{\pi})(s,a)

and, by definition (9) of L(w,q)L(w,q), the error function

L(w,q)=s,a𝒮0×𝒜w(s,a)[πeq(s,a)q(s,a)]dπb(s,a)+Esμ[q(s,πe)].\displaystyle L(w,q)=\sum_{s,a\in\mathcal{S}_{0}\times\mathcal{A}}w(s,a)\left[\mathcal{L}_{\pi_{e}}q(s,a)-q(s,a)\right]d_{\pi_{b}}(s,a)+\mathrm{E}_{s\sim\mu}\big{[}q(s,{\pi_{e}})\big{]}. (19)

By Theorem 3.1, we can further change the form of L(w,q)L(w,q) as

L(w,q)=(s,a)𝒮0×𝒜[wπe/πb(s,a)w(s,a)][q(s,a)(πeq)(s,a)]dπb(s,a).\displaystyle L(w,q)=\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\left[w_{\pi_{e}/\pi_{b}}(s,a)-w(s,a)\right]\left[q(s,a)-(\mathcal{L}_{\pi_{e}}q)(s,a)\right]d_{\pi_{b}}(s,a). (20)

By the definition, Vs,a,πeV_{s^{\prime},a^{\prime},\pi_{e}} is the Q-function of the reward 𝟏(s,a)(s,a){\bm{1}}_{(s^{\prime},a^{\prime})}(s,a) under the policy πe\pi_{e}. Therefore, Equation (18) states that

Vs,a,πe(s,a)πeVs,a,πe(s,a)=𝟏(s,a)(s,a).V_{s^{\prime},a^{\prime},\pi_{e}}(s,a)-\mathcal{L}_{\pi_{e}}V_{s^{\prime},a^{\prime},\pi_{e}}(s,a)={\bm{1}}_{(s^{\prime},a^{\prime})}(s,a).

It then follows from putting the above into (20) that, for every (s,a)𝒮0×𝒜(s^{\prime},a^{\prime})\in\mathcal{S}_{0}\times\mathcal{A},

L(w,Vs,a,πe)\displaystyle L(w,V_{s^{\prime},a^{\prime},\pi_{e}}) =\displaystyle= (s,a)𝒮0×𝒜[wπeπb(s,a)w(s,a)]𝟏(s,a)(s,a)dπb(s,a)\displaystyle\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\left[w_{\pi_{e}\over\pi_{b}}(s,a)-w(s,a)\right]{\bm{1}}_{(s^{\prime},a^{\prime})}(s,a)d_{\pi_{b}}(s,a)
=\displaystyle= dπe(s,a)w(s,a)dπb(s,a).\displaystyle d_{\pi_{e}}(s^{\prime},a^{\prime})-w(s^{\prime},a^{\prime})d_{\pi_{b}}(s^{\prime},a^{\prime}).

This gives the first equality of Equation (12). The second can be examined similarly by considering the function Vs,a,πe/dπb(s,a)V_{s^{\prime},a^{\prime},\pi_{e}}/d_{\pi_{b}}(s^{\prime},a^{\prime}) to instead Vs,a,πeV_{s^{\prime},a^{\prime},\pi_{e}}.

We prove the remainder results in what follows.

(1). As a result of the first equality of part (1),

|dπe(s,a)w(s,a)dπb(s,a)|=|L(w,Vs,a,πe)|,|d_{\pi_{e}}(s^{\prime},a^{\prime})-w(s^{\prime},a^{\prime})d_{\pi_{b}}(s^{\prime},a^{\prime})|=|L(w,V_{s^{\prime},a^{\prime},\pi_{e}})|,

so that

dπewdπb=max(s,a)𝒮0×𝒜|L(w,Vs,a,πe)|maxq𝒬|L(w,q)|,\|d_{\pi_{e}}-wd_{\pi_{b}}\|_{\infty}=\max_{(s^{\prime},a^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}}|L(w,V_{s^{\prime},a^{\prime},\pi_{e}})|\leq\max_{q\in{\cal Q}}|L(w,q)|,

provided that {±Vs,a,πe:(s,a)𝒮0×𝒜}𝒬\{\pm V_{s^{\prime},a^{\prime},\pi_{e}}:(s^{\prime},a^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\}\subseteq\mathcal{Q}, The desired result (2) is thus proved.

(3). If {±Vs,a,πe/dπb(s,a):(s,a)𝒮0×𝒜}𝒬,\{\pm V_{s^{\prime},a^{\prime},\pi_{e}}/d_{\pi_{b}}(s^{\prime},a^{\prime}):(s^{\prime},a^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}\}\subseteq\mathcal{Q}, then, substitute ww^{*} for ww in the second equality of part (1) gives ruse to

wπeπbw=max(s,a)𝒮0×𝒜|L(w,Vs,a,πe/dπb(s,a))|maxq𝒬|L(w,q)|=minw𝒲maxq𝒬|L(w,q)|.\|w_{\pi_{e}\over\pi_{b}}-w^{*}\|_{\infty}=\max_{(s^{\prime},a^{\prime})\in\mathcal{S}_{0}\times\mathcal{A}}|L(w^{*},V_{s^{\prime},a^{\prime},\pi_{e}}/d_{\pi_{b}}(s^{\prime},a^{\prime}))|\leq\max_{q\in\mathcal{Q}}|L(w^{*},q)|=\min_{w\in\mathcal{W}}\max_{q\in\mathcal{Q}}|L(w,q)|.

The proof is thus complete.


Appendix B Proof of Theorems in Section 4

To begin with, the following two results are quoted for easy reference later on. Denote by ({Zi})={(f(Z1),,f(Zn))f}n\mathcal{F}\left(\left\{Z_{i}\right\}\right)=\left\{\left(f\left(Z_{1}\right),\ldots,f\left(Z_{n}\right)\right)\mid f\in\mathcal{F}\right\}\subset\mathbb{R}^{n} a set generated by a function class \mathcal{F} and a data set {Z1,,Zn}\left\{Z_{1},\ldots,Z_{n}\right\}. Then the empirical 1\ell_{1}-covering number 𝒩(ϵ;,{Zi}i=1n)\mathcal{N}\left(\epsilon;\mathcal{F},\{Z_{i}\}_{i=1}^{n}\right) refers to the smallest number of balls of radius ϵ\epsilon required to cover ({Zi})\mathcal{F}\left(\left\{Z_{i}\right\}\right), where, the distance is computed by the empirical 1\ell_{1}-norm fg({Zi}):=1ni=1n|f(Zi)g(Zi)|.\|f-g\|_{\mathcal{F}\left(\left\{Z_{i}\right\}\right)}:=\frac{1}{n}\sum_{i=1}^{n}\left|f\left(Z_{i}\right)-g\left(Z_{i}\right)\right|.

Lemma B.1.

(Pollard, 2012) Let \mathcal{F} be a permissible class of functions 𝒵[M,M]\mathcal{Z}\rightarrow[-M,M] and {Zi}i=1n\left\{Z_{i}\right\}_{i=1}^{n} are i.i.d. samples from some distribution PP. Then, for any given ϵ>0\epsilon>0,

(supf|1ni=1nf(Zi)Ef(Z)|>ϵ)8𝔼[𝒩(ϵ;,{Zi}i=1n)]exp(nϵ2512M2).\mathbb{P}\big{(}\sup_{f\in\mathcal{F}}\big{|}\frac{1}{n}\sum_{i=1}^{n}f\big{(}Z_{i}\big{)}-\mathrm{E}f(Z)\big{|}>\epsilon\big{)}\leq 8\mathbb{E}\big{[}\mathcal{N}\left(\epsilon;\mathcal{F},\{Z_{i}\}_{i=1}^{n}\right)\big{]}\exp\big{(}\frac{-n\epsilon^{2}}{512M^{2}}\big{)}. (21)

For a class of functions \mathcal{F} on a measurable space equiped with a probability measure ϑ\vartheta, the covering numbers 𝒩(ε,,Lr(ϑ))\mathcal{N}(\varepsilon,\mathcal{F},L^{r}(\vartheta)) refers to the smallest number of balls of radius ϵ\epsilon required to cover \mathcal{F}, where distances are measured in terms of the LrL^{r}-norm fLr(ϑ):=(|f|r𝑑ϑ)1/r\|f\|_{L^{r}(\vartheta)}:=(\int|f|^{r}d\vartheta)^{1/r} for all ff\in\mathcal{F}. The covering number can then be bounded in terms of the function class’s pseudo-dimension:

Lemma B.2.

(Sen, 2018) Suppose that \mathcal{F} is a class of functions with measurable envelope FF (i.e. |f|F|f|\leq F for any ff\in\mathcal{F}) and has a finite pseudo-dimension D()D(\mathcal{F}). Then, for any r1r\geq 1, and any probability measure ϑ\vartheta such that FLr(ϑ)>0\|F\|_{L^{r}(\vartheta)}>0,

𝒩(εFLr(ϑ),,Lr(ϑ))<KD()(4e)D()(2/ε)rD(),\mathcal{N}(\varepsilon\|F\|_{L^{r}(\vartheta)},\mathcal{F},L^{r}(\vartheta))<KD(\mathcal{F})(4e)^{D(\mathcal{F})}(2/\varepsilon)^{rD(\mathcal{F})},

where KK is a universal constant and 0<ε<10<\varepsilon<1.

Recall that in Theorem 4.1, we address that the functional classes 𝒲\mathcal{W} and 𝒬\mathcal{Q} have finite pseudo-dimensions D𝒲D_{\mathcal{W}} and D𝒬D_{\mathcal{Q}}, Qπe𝒬Q_{\pi_{e}}\in\mathcal{Q}, and there exists a function GG satisfying that |w|G|w|\leq G, |q|G|q|\leq G for all w𝒲w\in\mathcal{W}, q𝒬q\in\mathcal{Q} and Λ1=(s,a)𝒮0×𝒜G(s,a)<+,Λ2:=(s,a)𝒮0×𝒜G2(s,a)<+\Lambda_{1}=\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}G(s,a)<+\infty,\;\;\Lambda_{2}:=\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}G^{2}(s,a)<+\infty. In addition, we assume that there exists λ0>0\lambda_{0}>0 such that Eμ,πb(eλ0T)=M0<\mathrm{E}_{\mu,\pi_{b}}(e^{\lambda_{0}T})=M_{0}<\infty. Besides them, we also remind here that our data consist of i.i.d samples from M=(𝒮,𝒜,,P,μ)M=(\mathcal{S},\mathcal{A},\mathcal{R},P,\mu) under the behavior policy πb\pi_{b}. The proof of Theorem 4.1 proceeds in the following 3 parts: decomposition, evaluation, and optimization.

B.1. Decomposition

First decompose the estimate R^πe,m\hat{R}_{\pi_{e},m} defined in equation (15) as

R^πe,mRπe=I1+I2+I3,\displaystyle\hat{R}_{\pi_{e},m}-R_{\pi_{e}}=I_{1}+I_{2}+I_{3}, (22)

where

I1\displaystyle I_{1} =\displaystyle= (s,a)𝒮0×𝒜[w^m(s,a)wπeπb(s,a)]R(s,a)dπb(s,a),\displaystyle\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}[\hat{w}_{m}(s,a)-w_{\pi_{e}\over\pi_{b}}(s,a)]R(s,a)d_{\pi_{b}}(s,a),
I2\displaystyle I_{2} =\displaystyle= 1mi=1m(s,a)𝒮0×𝒜w^m(s,a)R(s,a)[d^πbi(s,a)dπb(s,a)],\displaystyle\frac{1}{m}\sum\limits_{i=1}^{m}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{w}_{m}(s,a)R(s,a)\left[\hat{d}^{i}_{\pi_{b}}(s,a)-d_{\pi_{b}}(s,a)\right],

and

I3=1mi=1m(s,a)𝒮0×𝒜w^m(s,a)(R^i(s,a)R(s,a))d^πbi(s,a).\displaystyle I_{3}=\frac{1}{m}\sum\limits_{i=1}^{m}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{w}_{m}(s,a)\Big{(}\hat{R}^{i}(s,a)-R(s,a)\Big{)}\hat{d}^{i}_{\pi_{b}}(s,a).

By (18) and (20),

I1=(s,a)𝒮0×𝒜[w^m(s,a)wπeπb(s,a)][Qπe(s,a)(πeQπe)(s,a)]dπb(s,a)=L(w^m,Qπe).I_{1}=\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}[\hat{w}_{m}(s,a)-w_{\pi_{e}\over\pi_{b}}(s,a)][Q_{\pi_{e}}(s,a)-(\mathcal{L}_{\pi_{e}}Q_{\pi_{e}})(s,a)]d_{\pi_{b}}(s,a)=-L(\hat{w}_{m},Q_{\pi_{e}}).

Because Qπe𝒬Q_{\pi_{e}}\in\mathcal{Q}, it follows that

I12\displaystyle I_{1}^{2} =\displaystyle= (L(w^m,Qπe)L^m(w^m,Qπe)+L^m(w^m,Qπe))2\displaystyle\big{(}L(\hat{w}_{m},Q_{\pi_{e}})-\hat{L}_{m}(\hat{w}_{m},Q_{\pi_{e}})+\hat{L}_{m}(\hat{w}_{m},Q_{\pi_{e}})\big{)}^{2}
\displaystyle\leq 2(L(w^m,Qπe)L^m(w^m,Qπe))2+2maxq𝒬L^m2(w^m,q),\displaystyle 2\big{(}L(\hat{w}_{m},Q_{\pi_{e}})-\hat{L}_{m}(\hat{w}_{m},Q_{\pi_{e}})\big{)}^{2}+2\max_{q\in\mathcal{Q}}\hat{L}_{m}^{2}(\hat{w}_{m},q),

where L^m\hat{L}_{m} is defined in (14) and we have used the fact that

L^m2(w^m,Qπe)maxq𝒬L^m2(w^m,q)=minw𝒲maxq𝒬L^m2(w,q).\hat{L}_{m}^{2}(\hat{w}_{m},Q_{\pi_{e}})\leq\max_{q\in\mathcal{Q}}\hat{L}_{m}^{2}(\hat{w}_{m},q)=\min_{w\in\mathcal{W}}\max_{q\in\mathcal{Q}}\hat{L}_{m}^{2}(w,q).

Using the fact

maxq𝒬L^m2(w^m,q)2maxq𝒬L2(w^m,q)+2maxq𝒬(L^m(w^m,q)L(w^m,q))2,\max_{q\in\mathcal{Q}}\hat{L}_{m}^{2}(\hat{w}_{m},q)\leq 2\max_{q\in\mathcal{Q}}L^{2}(\hat{w}_{m},q)+2\max_{q\in\mathcal{Q}}\left(\hat{L}_{m}(\hat{w}_{m},q)-L(\hat{w}_{m},q)\right)^{2},

we further have

I12\displaystyle I_{1}^{2} \displaystyle\leq 6I112+4minw𝒲maxq𝒬L(w,q)2,\displaystyle 6I_{11}^{2}+4\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L(w,q)^{2}, (23)

where

I112=supw𝒲,q𝒬|L(w,q)L^m(w,q)|2.I_{11}^{2}=\sup\limits_{w\in\mathcal{W},q\in\mathcal{Q}}|L(w,q)-\hat{L}_{m}(w,q)|^{2}.

Substituting (13) into the expression of L^m(w,q)\hat{L}_{m}(w,q) in (14), it follows that

L^m(w,q)Eμ(q(s,πe))\displaystyle\hat{L}_{m}(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e})) =\displaystyle= 1mi=1mt=0li1w(sti,ati)(q(st+1i,πe)q(sti,ati)),\displaystyle\frac{1}{m}\sum_{i=1}^{m}\sum_{t=0}^{l_{i}-1}w(s_{t}^{i},a_{t}^{i})(q(s_{t+1}^{i},\pi_{e})-q(s_{t}^{i},a^{i}_{t})),

where q(ξ,πe)=0q(\xi,\pi_{e})=0 is assumed. Similarly,

L(w,q)Eμ(q(s,πe))=Eμ,πb(t=0T1w(st,at)(q(st+1,πe)q(st,at))).\displaystyle L(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e}))=\mathrm{E}_{\mu,\pi_{b}}\Big{(}\sum_{t=0}^{T-1}w(s_{t},a_{t})(q(s_{t+1},\pi_{e})-q(s_{t},a_{t}))\Big{)}.

Define

L~(w,q)=Eμ,πb(t=0THβ1w(st,at)(q(st+1,πe)q(st,at)))\tilde{L}(w,q)=\mathrm{E}_{\mu,\pi_{b}}\Big{(}\sum_{t=0}^{T\wedge{H_{\beta}}-1}w(s_{t},a_{t})(q(s_{t+1},\pi_{e})-q(s_{t},a_{t}))\Big{)}

and

L~m(w,q)=1mi=1mt=0TiHβ1w(sti,ati)(q(st+1i,πe)q(sti,ati)),\tilde{L}_{m}(w,q)=\frac{1}{m}\sum_{i=1}^{m}\sum_{t=0}^{T_{i}\wedge H_{\beta}-1}w(s_{t}^{i},a_{t}^{i})\big{(}q(s_{t+1}^{i},\pi_{e})-q(s_{t}^{i},a^{i}_{t})\big{)},

where HβH_{\beta} is a constant specified later in B.2. Then

|L^m(w,q)L(w,q)|\displaystyle|\hat{L}_{m}(w,q)-L(w,q)| =\displaystyle= |L^m(w,q)Eμ(q(s,πe))(L(w,q)Eμ(q(s,πe)))|\displaystyle|\hat{L}_{m}(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e}))-(L(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e})))| (24)
|\displaystyle\leq| L^m(w,q)Eμ(q(s,πe))L~m(w,q)|+|L~m(w,q)L~(w,q)|\displaystyle\hat{L}_{m}(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e}))-\tilde{L}_{m}(w,q)|+|\tilde{L}_{m}(w,q)-\tilde{L}(w,q)|
+|L~(w,q)(L(w,q)Eμ(q(s,πe)))|.\displaystyle+|\tilde{L}(w,q)-(L(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e})))|.

B.2. Evaluation

For any βM0eλ0H\beta\geq M_{0}e^{-\lambda_{0}H}, let Hβ=min{k:M0eλ0kβ}=ln(M0/β)/λ0H_{\beta}=\min\{k:M_{0}e^{-\lambda_{0}k}\leq\beta\}=\lceil\ln(M_{0}/\beta)/\lambda_{0}\rceil, where x\lceil x\rceil is the minimum integer no less than xx. Obviously, HβHH_{\beta}\leq H and M0eλ0Hββ.M_{0}e^{-\lambda_{0}H_{\beta}}\leq\beta.

B.2.1. The upper bound of E(I112)\mathrm{E}(I_{11}^{2})

An upper bound of E(I112)\mathrm{E}(I_{11}^{2}) is derived via a sequence of auxiliary results.

Lemma B.3.

With the constant Λ1\Lambda_{1}, for any βM0eλ0H\beta\geq M_{0}e^{-\lambda_{0}H},

E(supw𝒲,q𝒬|L^m(w,q)Eμ(q(s,πe))L~m(w,q)|2)4Λ14λ02[β2+2βm].\displaystyle\mathrm{E}(\sup_{w\in\mathcal{W},q\in\mathcal{Q}}|\hat{L}_{m}(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e}))-\tilde{L}_{m}(w,q)|^{2})\leq\frac{4\Lambda_{1}^{4}}{\lambda_{0}^{2}}\Big{[}\beta^{2}+\frac{2\beta}{m}\Big{]}.

Proof. For any w𝒲w\in\mathcal{W} and q𝒬q\in\mathcal{Q},

|L^m(w,q)Eμ(q(s,πe))L~m(w,q)|\displaystyle|\hat{L}_{m}(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e}))-\tilde{L}_{m}(w,q)| =\displaystyle= |1mi=1mt=TiHβli1w(sti,ati)(q(st+1i,πe)q(sti,ati))𝟏{Ti>Hβ}|.\displaystyle\Big{|}\frac{1}{m}\sum_{i=1}^{m}\sum_{t=T_{i}\wedge H_{\beta}}^{l_{i}-1}w(s_{t}^{i},a_{t}^{i})\left(q(s_{t+1}^{i},\pi_{e})-q(s_{t}^{i},a^{i}_{t})\right){\bm{1}}_{\{T_{i}>H_{\beta}\}}\Big{|}.

Recall the notation li=TiHl_{i}=T_{i}\wedge H. Due to the assumption (1) in Theorem 4.1, ww and qq are bounded by Λ1\Lambda_{1}, respectively,

|L^m(w,q)Eμ(q(s,πe))L~m(w,q)|2Λ12mi=1m(TiHβ)𝟏{Ti>Hβ}.\displaystyle|\hat{L}_{m}(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e}))-\tilde{L}_{m}(w,q)|\leq\frac{2\Lambda_{1}^{2}}{m}\sum_{i=1}^{m}(T_{i}-H_{\beta}){\bm{1}}_{\{T_{i}>H_{\beta}\}}.

Therefore,

E(supw𝒲,q𝒬|L^m(w,q)Eμ(q(s,πe))L~m(w,q)|2)\displaystyle\mathrm{E}(\sup_{w\in\mathcal{W},q\in\mathcal{Q}}|\hat{L}_{m}(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e}))-\tilde{L}_{m}(w,q)|^{2}) (25)
\displaystyle\leq 4Λ14m2(i,j=1,ijmE[(TiHβ)(TjHβ)𝟏{Ti,Tj>Hβ}]+i=1mE[(TiHβ)2𝟏{Ti>Hβ}])\displaystyle\frac{4\Lambda_{1}^{4}}{m^{2}}\left(\sum_{i,j=1,i\not=j}^{m}\mathrm{E}[(T_{i}-H_{\beta})(T_{j}-H_{\beta}){\bm{1}}_{\{T_{i},T_{j}>H_{\beta}\}}]+\sum_{i=1}^{m}\mathrm{E}[(T_{i}-H_{\beta})^{2}{\bm{1}}_{\{T_{i}>H_{\beta}\}}]\right)
=\displaystyle= 4Λ14m2(m(m1)E2[(T1Hβ)𝟏{T1>Hβ}]+mE[(T1Hβ)2𝟏{T1>Hβ}]),\displaystyle\frac{4\Lambda_{1}^{4}}{m^{2}}\left(m(m-1)\mathrm{E}^{2}[(T_{1}-H_{\beta}){\bm{1}}_{\{T_{1}>H_{\beta}\}}]+m\mathrm{E}[(T_{1}-H_{\beta})^{2}{\bm{1}}_{\{T_{1}>H_{\beta}\}}]\right),

where the equality follows from the i.i.d. property of trajectories. By further the inequality xx2/2exx\vee x^{2}/2\leq e^{x} for every x>0x>0,

E[(T1Hβ)𝟏{T1>Hβ}]\displaystyle\mathrm{E}[(T_{1}-H_{\beta}){\bm{1}}_{\{T_{1}>H_{\beta}\}}] \displaystyle\leq 1λ0E(eλ0(T1Hβ)𝟏{T1>Hβ})M0eλ0Hβλ0βλ0,\displaystyle\frac{1}{\lambda_{0}}\mathrm{E}\left(e^{\lambda_{0}(T_{1}-H_{\beta})}{\bm{1}}_{\{T_{1}>H_{\beta}\}}\right)\leq\frac{M_{0}e^{-\lambda_{0}H_{\beta}}}{\lambda_{0}}\leq\frac{\beta}{\lambda_{0}}, (26)

and

E[(TiHβ)2𝟏{Ti>Hβ}]\displaystyle\mathrm{E}[(T_{i}-H_{\beta})^{2}{\bm{1}}_{\{T_{i}>H_{\beta}\}}] \displaystyle\leq 2λ02E(eλ0(T1Hβ)𝟏{T1>Hβ})2M0eλ0Hβλ022βλ02,\displaystyle\frac{2}{\lambda_{0}^{2}}\mathrm{E}(e^{\lambda_{0}(T_{1}-H_{\beta})}{\bm{1}}_{\{T_{1}>H_{\beta}\}})\leq\frac{2M_{0}e^{-\lambda_{0}H_{\beta}}}{\lambda_{0}^{2}}\leq\frac{2\beta}{\lambda_{0}^{2}}, (27)

substituting (26) and (27) into (25) leads to the desired result. \square

Lemma B.4.

There exists a constant C1C_{1} independent of m,Hm,H such that for m2m\geq 2,

E(supw𝒲,q𝒬|L~m(w,q)L~(w,q)|2)C1Hβ2lnmm.\displaystyle\mathrm{E}(\sup_{w\in\mathcal{W},q\in\mathcal{Q}}|\tilde{L}_{m}(w,q)-\tilde{L}(w,q)|^{2})\leq C_{1}H_{\beta}^{2}\frac{\ln m}{m}.

Proof. For a representative trajectory ZZ of the form (6), denote by

w~q(Z)=t=0THβ1w(st,at)(q(st+1,πe)q(st,at)).\tilde{w}_{q}(Z)=\sum\limits_{t=0}^{T\wedge H_{\beta}-1}w(s_{t},a_{t})(q(s_{t+1},\pi_{e})-q(s_{t},a_{t})).

We have that

L~m(w,q)L~(w,q)=1mi=1mw~q(Zi)E(w~q(Z)).\tilde{L}_{m}(w,q)-\tilde{L}(w,q)=\frac{1}{m}\sum_{i=1}^{m}\tilde{w}_{q}(Z_{i})-\mathrm{E}(\tilde{w}_{q}(Z)).

It is easy to see that |w~q|2Λ12Hβ|\tilde{w}_{q}|\leq 2\Lambda_{1}^{2}H_{\beta}. Denote by ={w~q(Z):w𝒲,q𝒬}.\mathcal{H}=\{\tilde{w}_{q}(Z):w\in\mathcal{W},q\in\mathcal{Q}\}. The distance in \mathcal{H} can be bounded by

1mi=1m|t=0TiHβ1w1(sti,ati)(q1(st+1i,πe)q1(sti,ati))\displaystyle\frac{1}{m}\sum\limits_{i=1}^{m}\Big{|}\sum_{t=0}^{T_{i}\wedge H_{\beta}-1}w_{1}(s_{t}^{i},a_{t}^{i})(q_{1}(s_{t+1}^{i},\pi_{e})-q_{1}(s_{t}^{i},a_{t}^{i}))
t=0TiHβ1w2(sti,ati)(q2(st+1i,πe)q2(sti,ati))|\displaystyle\qquad\qquad\qquad-\sum_{t=0}^{T_{i}\wedge H_{\beta}-1}w_{2}(s_{t}^{i},a_{t}^{i})(q_{2}(s_{t+1}^{i},\pi_{e})-q_{2}(s_{t}^{i},a_{t}^{i}))\Big{|}
2Λ1Hβ(w1w2+q1q2).\displaystyle\quad\leq 2\Lambda_{1}H_{\beta}(||w_{1}-w_{2}||_{\infty}+\|q_{1}-q_{2}\|_{\infty}).

Note that, in our setting,

w1w2\displaystyle||w_{1}-w_{2}||_{\infty} \displaystyle\leq w1w22(s,a)𝒮0×𝒜2G(s,a)|w1(s,a)w2(s,a)|\displaystyle\|w_{1}-w_{2}\|_{2}\leq\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}2G(s,a)|w_{1}(s,a)-w_{2}(s,a)|
\displaystyle\leq 2Λ1(s,a)𝒮0×𝒜G(s,a)Λ1|w1(s,a)w2(s,a)|,\displaystyle 2\Lambda_{1}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\frac{G(s,a)}{\Lambda_{1}}|w_{1}(s,a)-w_{2}(s,a)|,

where at the second inequality, we use the assumption (16). Let ϑ:=(ϑ(s,a))\vartheta:=(\vartheta(s,a)) be the probability on 𝒮0×𝒜\mathcal{S}_{0}\times\mathcal{A} such that ϑ(s,a):=G(s,a)/Λ1\vartheta(s,a):=G(s,a)/\Lambda_{1}. We get that

w1w22Λ1w1w2L1(ϑ),\displaystyle||w_{1}-w_{2}||_{\infty}\leq 2\Lambda_{1}\|w_{1}-w_{2}\|_{L^{1}(\vartheta)},

The same arguments imply that

q1q22Λ1q1q2L1(ϑ).\displaystyle||q_{1}-q_{2}||_{\infty}\leq 2\Lambda_{1}\|q_{1}-q_{2}\|_{L^{1}(\vartheta)}.

As a result,

𝒩(4Λ12Hβ(ϵ1+ϵ2),,{Zi}i=1m)\displaystyle\mathcal{N}(4\Lambda_{1}^{2}H_{\beta}(\epsilon_{1}+\epsilon_{2}),\mathcal{H},{\{Z_{i}\}}_{i=1}^{m}) \displaystyle\leq 𝒩(ϵ1,𝒲,L1(ϑ))𝒩(ϵ2,𝒬,L1(ϑ)).\displaystyle\mathcal{N}(\epsilon_{1},\mathcal{W},L^{1}(\vartheta))\mathcal{N}(\epsilon_{2},\mathcal{Q},L^{1}(\vartheta)).

Note that GL1(ϑ)=g2(s,a)/Λ1=Λ2/Λ1\|G\|_{L^{1}(\vartheta)}=\sum g^{2}(s,a)/\Lambda_{1}=\Lambda_{2}/\Lambda_{1}. Due to the assumption (2) in Theorem 4.1, from Lemma B.2 we get that

𝒩1(4HβΛ12(ϵ1+ϵ2),,{Zi}i=1m)\displaystyle\mathcal{N}_{1}(4H_{\beta}\Lambda_{1}^{2}(\epsilon_{1}+\epsilon_{2}),\mathcal{H},{\{Z_{i}\}}_{i=1}^{m}) \displaystyle\leq K2D𝒲D𝒬(8eΛ2Λ1ϵ1)D𝒲(8eΛ2Λ1ϵ2)D𝒬.\displaystyle K^{2}D_{\mathcal{W}}D_{\mathcal{Q}}\left(\frac{8e\Lambda_{2}}{\Lambda_{1}\epsilon_{1}}\right)^{D_{\mathcal{W}}}\left(\frac{8e\Lambda_{2}}{\Lambda_{1}\epsilon_{2}}\right)^{D_{\mathcal{Q}}}.

Taking ϵ1=ϵ2=ϵ64HβΛ12\epsilon_{1}=\epsilon_{2}={\epsilon\over 64H_{\beta}\Lambda_{1}^{2}}, we have

𝒩1(ϵ8,,{Zi}i=1m)MϵD𝒲+D𝒬,\mathcal{N}_{1}\left(\frac{\epsilon}{8},\mathcal{H},{\{Z_{i}\}}_{i=1}^{m}\right)\leq\frac{{M}}{\epsilon^{D_{\mathcal{W}}+D_{\mathcal{Q}}}},

where M=KD𝒲D𝒬(512eHβΛ1Λ2)D𝒲+D𝒬.{M}=KD_{\mathcal{W}}D_{\mathcal{Q}}(512eH_{\beta}\Lambda_{1}\Lambda_{2})^{D_{\mathcal{W}}+D_{\mathcal{Q}}}. By Pollard’s tail inequality (Lemma B.1),

P(supw𝒲,q𝒬|1mi=1mhw,q(Zi)Ehw,q(Z)|>ϵ)8MϵD𝒲+D𝒬exp(mϵ22048Λ14Hβ2).\displaystyle P\left(\sup\limits_{w\in\mathcal{W},q\in\mathcal{Q}}\left|\frac{1}{m}\sum\limits_{i=1}^{m}h_{w,q}(Z_{i})-\mathrm{E}h_{w,q}(Z)\right|>\epsilon\right)\leq\frac{8{M}}{\epsilon^{D_{\mathcal{W}}+D_{\mathcal{Q}}}}{\exp\big{(}\frac{-m\epsilon^{2}}{2048\Lambda_{1}^{4}H_{\beta}^{2}}\big{)}}.\qquad (28)

For any m>1m>1, let x0=(32Λ12Hβ)2(D𝒲+D𝒬)lnmmx_{0}=\frac{(32\Lambda_{1}^{2}H_{\beta})^{2}(D_{\mathcal{W}}+D_{\mathcal{Q}})\ln m}{m}. Then, by (28), we have that

E(supw𝒲,q𝒬|L~m(w,q)L~(w,q)|2)\displaystyle\mathrm{E}(\sup_{w\in\mathcal{W},q\in\mathcal{Q}}|\tilde{L}_{m}(w,q)-\tilde{L}(w,q)|^{2})
\displaystyle\leq 0P(supw𝒲,q𝒬|L~m(w,q)L~(w,q)|x)𝑑x.\displaystyle\int_{0}^{\infty}\mathrm{P}(\sup_{w\in\mathcal{W},q\in\mathcal{Q}}|\tilde{L}_{m}(w,q)-\tilde{L}(w,q)|\geq\sqrt{x})dx.
\displaystyle\leq 0x0𝑑x+x08Mx(D𝒲+D𝒬)/2exp(mx2048(Λ12Hβ)2)𝑑x\displaystyle\int_{0}^{x_{0}}dx+\int_{x_{0}}^{\infty}\frac{8{M}}{x^{(D_{\mathcal{W}}+D_{\mathcal{Q}})/2}}\exp\big{(}\frac{-mx}{2048(\Lambda_{1}^{2}H_{\beta})^{2}}\big{)}dx
\displaystyle\leq x0+8Mx0(D𝒲+D𝒬)/2x0exp(mx2048(Λ12Hβ)2)𝑑x\displaystyle x_{0}+\frac{8{M}}{x_{0}^{(D_{\mathcal{W}}+D_{\mathcal{Q}})/2}}\int_{x_{0}}^{\infty}\exp\big{(}\frac{-mx}{2048(\Lambda_{1}^{2}H_{\beta})^{2}}\big{)}dx
=\displaystyle= (32Λ12Hβ)2m[(D𝒲+D𝒬)lnm+16M[(32Λ12Hβ)2lnm(D𝒲+D𝒬)](D𝒲+D𝒬)/2],\displaystyle\frac{(32\Lambda_{1}^{2}H_{\beta})^{2}}{m}\Big{[}(D_{\mathcal{W}}+D_{\mathcal{Q}})\ln m+\frac{16M}{\left[(32\Lambda_{1}^{2}H_{\beta})^{2}\ln m(D_{\mathcal{W}}+D_{\mathcal{Q}})\right]^{(D_{\mathcal{W}}+D_{\mathcal{Q}})/2}}\Big{]},

which implies the desired result. \square

Lemma B.5.

There exists a constant C3C_{3} independent of m,Hm,H such that for m2m\geq 2,

E(supw𝒲,q𝒬|L^m(w,q)L(w,q)|2)C3(β2+(1lnβ+ln2β)lnmm+βm).\displaystyle\mathrm{E}(\sup_{w\in\mathcal{W},q\in\mathcal{Q}}|\hat{L}_{m}(w,q)-L(w,q)|^{2})\leq C_{3}\Big{(}\beta^{2}+(1-\ln\beta+\ln^{2}\beta)\frac{\ln m}{m}+\frac{\beta}{m}\Big{)}.

Proof. By (24),

E[supw𝒲,q𝒬|L^m(w,q)L(w,q)|2]\displaystyle\mathrm{E}[\sup_{w\in\mathcal{W},q\in\mathcal{Q}}|\hat{L}_{m}(w,q)-L(w,q)|^{2}] \displaystyle\leq 3E(supw𝒲,q𝒬|L^m(w,q)Eμ(q(s,πe))L~m(w,q)|2)\displaystyle 3\mathrm{E}(\sup_{w\in\mathcal{W},q\in\mathcal{Q}}|\hat{L}_{m}(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e}))-\tilde{L}_{m}(w,q)|^{2}) (29)
+3E(supw𝒲,q𝒬|L~(w,q)(L(w,q)Eμ(q(s,πe)))|2)\displaystyle+3\mathrm{E}(\sup_{w\in\mathcal{W},q\in\mathcal{Q}}|\tilde{L}(w,q)-(L(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e})))|^{2})
+3supw𝒲,q𝒬|L~m(w,q)L~(w,q)|2.\displaystyle+3\sup_{w\in\mathcal{W},q\in\mathcal{Q}}|\tilde{L}_{m}(w,q)-\tilde{L}(w,q)|^{2}.

Note that for any w𝒲w\in\mathcal{W},

|L~(w,q)(L(w,q)Eμ(q(s,πe)))|\displaystyle|\tilde{L}(w,q)-(L(w,q)-\mathrm{E}_{\mu}(q(s,\pi_{e})))| =\displaystyle= E(THβT1w(st,at)(q(st+1,πe)q(st,at)))\displaystyle\mathrm{E}\Big{(}\sum_{T\wedge H_{\beta}}^{T-1}w(s_{t},a_{t})(q(s_{t+1},\pi_{e})-q(s_{t},a_{t}))\Big{)} (30)
\displaystyle\leq 2Λ12E((THβ)𝟏{THβ})\displaystyle 2\Lambda_{1}^{2}\mathrm{E}((T-H_{\beta}){\bm{1}}_{\{T\geq H_{\beta}\}})
\displaystyle\leq 2Λ12E(eλ0(THβ)λ0𝟏{THβ})\displaystyle 2\Lambda_{1}^{2}\mathrm{E}\left(\frac{e^{\lambda_{0}(T-H_{\beta})}}{\lambda_{0}}{\bm{1}}_{\{T\geq H_{\beta}\}}\right)
\displaystyle\leq 2Λ12λ0β,\displaystyle\frac{2\Lambda_{1}^{2}}{\lambda_{0}}\beta,

where the last inequality comes from the assumption (5) in Theorem 4.1 and the setting of HβH_{\beta}. Substituting (30) into (29) and applying Lemma B.3 and Lemma B.4, we get that

E(supw𝒲,q𝒬|L^m(w,q)L(w,q)|2)\displaystyle\mathrm{E}(\sup_{w\in\mathcal{W},q\in\mathcal{Q}}|\hat{L}_{m}(w,q)-L(w,q)|^{2}) \displaystyle\leq 3M32μ2+3(M32β2+2M32mβ)+3C1Hβ2lnmm,\displaystyle 3M_{3}^{2}\mu^{2}+3\left(M_{3}^{2}\beta^{2}+\frac{2M_{3}^{2}}{m}\beta\right)+3C_{1}H_{\beta}^{2}\frac{\ln m}{m},

where M3=2Λ12/λ0M_{3}=2\Lambda_{1}^{2}/\lambda_{0}. Since Hβln(M0/β)λ0+1H_{\beta}\leq\frac{\ln(M_{0}/\beta)}{\lambda_{0}}+1, it follows from (29) that

E(supw𝒲,q𝒬|L^m(w,q)L(w,q)|2)\displaystyle\mathrm{E}(\sup_{w\in\mathcal{W},q\in\mathcal{Q}}|\hat{L}_{m}(w,q)-L(w,q)|^{2})
\displaystyle\leq 3C1λ02((lnβ)22(lnM0+λ0)lnβ+(lnM0+λ0)2)lnmm+6M32β2+6M32mβ\displaystyle\frac{3C_{1}}{\lambda_{0}^{2}}\left((\ln\beta)^{2}-2(\ln M_{0}+\lambda_{0})\ln\beta+(\ln M_{0}+\lambda_{0})^{2}\right)\frac{\ln m}{m}+6M_{3}^{2}\beta^{2}+\frac{6M_{3}^{2}}{m}\beta
\displaystyle\leq 3C1λ02((lnβ)22C2lnβ+C22)lnmm+6M32β2+6M32βm,\displaystyle\frac{3C_{1}}{\lambda_{0}^{2}}\left((\ln\beta)^{2}-2C_{2}\ln\beta+C_{2}^{2}\right)\frac{\ln m}{m}+6M_{3}^{2}\beta^{2}+\frac{6M_{3}^{2}\beta}{m},

where C2=lnM0+λ0C_{2}=\ln M_{0}+\lambda_{0}. Let

C3=max{3C1λ02,3C1λ02C2,3C1λ02C22,6M32},C_{3}=\max\left\{\frac{3C_{1}}{\lambda_{0}^{2}},\frac{3C_{1}}{\lambda_{0}^{2}}C_{2},\frac{3C_{1}}{\lambda_{0}^{2}}C_{2}^{2},6M_{3}^{2}\right\},

we can readily get the desired result. \square

B.2.2. The upper bounds of E(I22)\mathrm{E}(I_{2}^{2}) and E(I32)\mathrm{E}(I_{3}^{2})

Lemma B.6.

Denote sup|R(s,a)|\sup|R(s,a)| by RmaxR_{\max}. We have that

E(I22)2Rmax2K02M0λ02(1m+M0e2λ0H).\mathrm{E}(I_{2}^{2})\leq\frac{2R_{\max}^{2}K_{0}^{2}M_{0}}{\lambda_{0}^{2}}(\frac{1}{m}+M_{0}e^{-2\lambda_{0}H}).
Proof.

Define a truncated occupancy measure d~πb(s,a)=Eμ,πb(t=0TH1𝟏(s,a)(st,at)).\tilde{d}_{\pi_{b}}(s,a)=\mathrm{E}_{\mu,\pi_{b}}(\sum_{t=0}^{T\wedge H-1}{\bm{1}}_{(s,a)}(s_{t},a_{t})). Then

I2\displaystyle I_{2} =\displaystyle= 1mi=1m(s,a)𝒮0×𝒜w^m(s,a)R(s,a)[d^πbi(s,a)d~πb(s,a)]\displaystyle\frac{1}{m}\sum\limits_{i=1}^{m}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{w}_{m}(s,a)R(s,a)[\hat{d}^{i}_{\pi_{b}}(s,a)-\tilde{d}_{\pi_{b}}(s,a)]
+(s,a)𝒮0×𝒜w^m(s,a)R(s,a)[d~πb(s,a)dπb(s,a)]\displaystyle+\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{w}_{m}(s,a)R(s,a)[\tilde{d}_{\pi_{b}}(s,a)-{d}_{\pi_{b}}(s,a)]

and hence

I22\displaystyle I_{2}^{2} \displaystyle\leq 2[(s,a)𝒮0×𝒜w^m(s,a)R(s,a)(1mi=1md^πbi(s,a)d~πb(s,a))]2\displaystyle 2\Big{[}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{w}_{m}(s,a)R(s,a)\Big{(}\frac{1}{m}\sum\limits_{i=1}^{m}\hat{d}^{i}_{\pi_{b}}(s,a)-\tilde{d}_{\pi_{b}}(s,a)\Big{)}\Big{]}^{2}
+2[(s,a)𝒮0×𝒜w^m(s,a)R(s,a)(d~πb(s,a)dπb(s,a))]2\displaystyle+2\Big{[}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{w}_{m}(s,a)R(s,a)\Big{(}\tilde{d}_{\pi_{b}}(s,a)-d_{\pi_{b}}(s,a)\Big{)}\Big{]}^{2}
\displaystyle\leq 2(s,a)𝒮0×𝒜w^m(s,a)2R(s,a)2(s,a)𝒮0×𝒜(1mi=1md^πbi(s,a)d~πb(s,a))2\displaystyle 2\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{w}_{m}(s,a)^{2}R(s,a)^{2}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\Big{(}\frac{1}{m}\sum\limits_{i=1}^{m}\hat{d}^{i}_{\pi_{b}}(s,a)-\tilde{d}_{\pi_{b}}(s,a)\Big{)}^{2}
+2[(s,a)𝒮0×𝒜w^m(s,a)R(s,a)Eμ,πb(t=THT1𝟏(s.a)(st,at))]2,\displaystyle+2\Big{[}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{w}_{m}(s,a)R(s,a)\mathrm{E}_{\mu,\pi_{b}}(\sum_{t=T\wedge H}^{T-1}{\bm{1}}_{(s.a)}(s_{t},a_{t}))\Big{]}^{2},

where the last inequality follows from Hölder’s inequality. Invoking the bounds from w^m\hat{w}_{m} and RR, we get that

E(I22)\displaystyle\mathrm{E}(I_{2}^{2}) \displaystyle\leq 2Rmax2Λ12{(s,a)𝒮0×𝒜E(1mi=1md^πbi(s,a)d~πb(s,a))2\displaystyle 2R_{\max}^{2}\Lambda_{1}^{2}\Big{\{}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\mathrm{E}\Big{(}\frac{1}{m}\sum\limits_{i=1}^{m}\hat{d}^{i}_{\pi_{b}}(s,a)-\tilde{d}_{\pi_{b}}(s,a)\Big{)}^{2}
+(Eμ,πb[(s,a)𝒮0×𝒜t=THT1𝟏(s,a)(st,at)])2}\displaystyle\qquad\quad\qquad+\Big{(}\mathrm{E}_{\mu,\pi_{b}}\Big{[}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\sum_{t=T\wedge H}^{T-1}{\bm{1}}_{(s,a)}(s_{t},a_{t})\Big{]}\Big{)}^{2}\Big{\}}
\displaystyle\leq 2Rmax2Λ12{1m(s,a)𝒮0×𝒜Var(d^πbi(s,a)))+(Eμ,πb((TH)𝟏{T>H}))2},\displaystyle 2R_{\max}^{2}\Lambda_{1}^{2}\Big{\{}\frac{1}{m}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}{\rm Var}(\hat{d}^{i}_{\pi_{b}}(s,a)))+\Big{(}\mathrm{E}_{\mu,\pi_{b}}((T-H){\bm{1}}_{\{T>H\}})\Big{)}^{2}\Big{\}},

where the first inequality is due to w^m𝒲\hat{w}_{m}\in\mathcal{W} and the second inequality follows from the fact that dπbi(s,a),1im,d_{\pi_{b}}^{i}(s,a),1\leq i\leq m, are i.i.d and have expectation d~πb(s,a)\tilde{d}_{\pi_{b}}(s,a). Observing that

Var(d^πbi(s,a)))\displaystyle{\rm Var}(\hat{d}^{i}_{\pi_{b}}(s,a))) =\displaystyle= Varμ,πb(t=0TH1𝟏(s,a)(st,at))Eμ,πb((t=0TH1𝟏(s,a)(st,at))2)\displaystyle{\rm Var}_{\mu,\pi_{b}}(\sum_{t=0}^{T\wedge H-1}{\bm{1}}_{(s,a)}(s_{t},a_{t}))\leq\mathrm{E}_{\mu,\pi_{b}}\Big{(}\Big{(}\sum_{t=0}^{T\wedge H-1}{\bm{1}}_{(s,a)}(s_{t},a_{t})\Big{)}^{2}\Big{)}
\displaystyle\leq Eμ,πb((TH)t=0TH1𝟏(s,a)(st,at)),\displaystyle\mathrm{E}_{\mu,\pi_{b}}\Big{(}(T\wedge H)\sum_{t=0}^{T\wedge H-1}{\bm{1}}_{(s,a)}(s_{t},a_{t})\Big{)},

we obtain that

(s,a)𝒮0×𝒜Var(d^πbi(s,a)))Eμ,πb((TH)(s,a)𝒮0×𝒜t=0TH1𝟏(s,a)(st,at))Eμ,πb(T2).\displaystyle\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}{\rm Var}(\hat{d}^{i}_{\pi_{b}}(s,a)))\leq\mathrm{E}_{\mu,\pi_{b}}\Big{(}(T\wedge H)\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\sum_{t=0}^{T\wedge H-1}{\bm{1}}_{(s,a)}(s_{t},a_{t})\Big{)}\leq\mathrm{E}_{\mu,\pi_{b}}\big{(}T^{2}\big{)}.

Since Eμ,πb(T2)2M0λ02\mathrm{E}_{\mu,\pi_{b}}\big{(}T^{2}\big{)}\leq\frac{2M_{0}}{\lambda_{0}^{2}} and Eμ,πb((TH)𝟏{T>H})M0λ0eλ0H\mathrm{E}_{\mu,\pi_{b}}((T-H){\bm{1}}_{\{T>H\}})\leq\frac{M_{0}}{\lambda_{0}}e^{-\lambda_{0}H}, we arrive at the conclusion. \square

Lemma B.7.

We have that

E(I32)Rmax2Λ12M0λ0m.\mathrm{E}(I_{3}^{2})\leq\frac{R_{\max}^{2}\Lambda_{1}^{2}M_{0}}{\lambda_{0}m}.

Proof. Noting that

I3\displaystyle I_{3} =\displaystyle= (s,a)𝒮0×𝒜w^m(s,a)[1mi=1m(r^i(s,a)R(s,a))d^πbi(s,a)],\displaystyle\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{w}_{m}(s,a)\Big{[}\frac{1}{m}\sum_{i=1}^{m}(\hat{r}^{i}(s,a)-R(s,a))\hat{d}^{i}_{\pi_{b}}(s,a)\Big{]},

applying the Hölder inequality, we have that

I32\displaystyle I_{3}^{2} \displaystyle\leq (s,a)𝒮0×𝒜w^m2(s,a)(s,a)𝒮0×𝒜[1mi=1m(r^i(s,a)R(s,a))d^πbi(s,a)]2\displaystyle\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{w}_{m}^{2}(s,a)\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\Big{[}\frac{1}{m}\sum_{i=1}^{m}(\hat{r}^{i}(s,a)-R(s,a))\hat{d}^{i}_{\pi_{b}}(s,a)\Big{]}^{2}
\displaystyle\leq Λ12(s,a)𝒮0×𝒜[1mi=1m(r^i(s,a)R(s,a))d^πbi(s,a)]2.\displaystyle\Lambda_{1}^{2}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\Big{[}\frac{1}{m}\sum_{i=1}^{m}(\hat{r}^{i}(s,a)-R(s,a))\hat{d}^{i}_{\pi_{b}}(s,a)\Big{]}^{2}.

Therefore,

E[I32]\displaystyle\mathrm{E}[I_{3}^{2}] \displaystyle\leq Λ12E[(s,a)𝒮0×𝒜E([1mi=1m(r^i(s,a)R(s,a))d^πbi(s,a)]2|d^πbi(s,a),i=1,,m)].\displaystyle\Lambda_{1}^{2}\mathrm{E}\left[\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\mathrm{E}\left(\Big{[}\frac{1}{m}\sum_{i=1}^{m}(\hat{r}^{i}(s,a)-R(s,a))\hat{d}^{i}_{\pi_{b}}(s,a)\Big{]}^{2}\Big{|}\hat{d}^{i}_{\pi_{b}}(s,a),i=1,\cdots,m\right)\right].

Because d^πbi(s,a),i=1,,m\hat{d}^{i}_{\pi_{b}}(s,a),i=1,\cdots,m are i.i.d and ri(s,a)r^{i}(s,a) follows distribution R(s,a)R(s,a) and is independent of d^πbi(s,a),i=1,,m\hat{d}^{i}_{\pi_{b}}(s,a),i=1,\cdots,m, it follows that

E([1mi=1m(r^i(s,a)R(s,a))d^πbi(s,a)]2|d^πbi(s,a),i=1,,m)\displaystyle\mathrm{E}\left(\Big{[}\frac{1}{m}\sum_{i=1}^{m}(\hat{r}^{i}(s,a)-R(s,a))\hat{d}^{i}_{\pi_{b}}(s,a)\Big{]}^{2}\Big{|}\hat{d}^{i}_{\pi_{b}}(s,a),i=1,\cdots,m\right)
=\displaystyle= 1m2i=1m(d^πbi(s,a))2Var(r^i(s,a)|d^πbi(s,a))\displaystyle\frac{1}{m^{2}}\sum_{i=1}^{m}(\hat{d}^{i}_{\pi_{b}}(s,a))^{2}{\rm Var}\Big{(}\hat{r}^{i}(s,a)\Big{|}\hat{d}^{i}_{\pi_{b}}(s,a)\Big{)}
=\displaystyle= 1m2i=1m(d^πbi(s,a))2Var(t=0li1rti𝟏(s,a)(sti,ati)d^πbi(s,a)|d^πbi(s,a)).\displaystyle\frac{1}{m^{2}}\sum_{i=1}^{m}(\hat{d}^{i}_{\pi_{b}}(s,a))^{2}{\rm Var}\Big{(}\frac{\sum_{t=0}^{l_{i}-1}r^{i}_{t}{\bm{1}}_{(s,a)}(s_{t}^{i},a_{t}^{i})}{\hat{d}^{i}_{\pi_{b}}(s,a)}\Big{|}\hat{d}^{i}_{\pi_{b}}(s,a)\Big{)}.

When d^πbi(s,a)\hat{d}^{i}_{\pi_{b}}(s,a) is given, t=0li1rti𝟏(s,a)(sti,ati)\sum_{t=0}^{l_{i}-1}r^{i}_{t}{\bm{1}}_{(s,a)}(s_{t}^{i},a_{t}^{i}) is the sum of d^πbi(s,a)\hat{d}^{i}_{\pi_{b}}(s,a) random variables who are independent with the same distribution R(s,a)R(s,a). Hence

Var(r^i(s,a)|d^πbi(s,a))=Var(s,a)(r)d^πbi(s,a)Rmax2d^πbi(s,a).\displaystyle{\rm Var}\Big{(}\hat{r}^{i}(s,a)\Big{|}\hat{d}^{i}_{\pi_{b}}(s,a)\Big{)}=\frac{{\rm Var}_{\mathcal{R}(s,a)}(r)}{\hat{d}^{i}_{\pi_{b}}(s,a)}\leq\frac{R^{2}_{max}}{\hat{d}^{i}_{\pi_{b}}(s,a)}.

Therefore

E(I32)\displaystyle\mathrm{E}(I_{3}^{2}) \displaystyle\leq Rmax2Λ12m2E((s,a)𝒮0×𝒜i=1md^πbi(s,a))=Rmax2Λ12m2E(i=1m(s,a)𝒮0×𝒜d^πbi(s,a))\displaystyle\frac{R_{\max}^{2}\Lambda_{1}^{2}}{m^{2}}\mathrm{E}\Big{(}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\sum_{i=1}^{m}\hat{d}^{i}_{\pi_{b}}(s,a)\Big{)}=\frac{R_{\max}^{2}\Lambda_{1}^{2}}{m^{2}}\mathrm{E}\Big{(}\sum_{i=1}^{m}\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{d}^{i}_{\pi_{b}}(s,a)\Big{)}
=\displaystyle= Rmax2Λ12m2E(i=1m(li1))Rmax2Λ12mEμ,πb(T),\displaystyle\frac{R_{\max}^{2}\Lambda_{1}^{2}}{m^{2}}\mathrm{E}\Big{(}\sum_{i=1}^{m}(l_{i}-1)\Big{)}\leq\frac{R_{\max}^{2}\Lambda_{1}^{2}}{m}\mathrm{E}_{\mu,\pi_{b}}(T),

which implies the desired result, since Eμ,πb(T)M0λ0\mathrm{E}_{\mu,\pi_{b}}(T)\leq\frac{M_{0}}{\lambda_{0}}. \square

B.3. Optimization

Based on the above discussion, we get that for any truncation level HH and β\beta such that M0eλ0HβM_{0}e^{-\lambda_{0}H}\leq\beta,

E(|R^mRπe|2)\displaystyle\mathrm{E}(|\hat{R}_{m}-R_{\pi_{e}}|^{2}) \displaystyle\leq 2E(I12)+4E(I22)+4E(I32)\displaystyle 2\mathrm{E}(I_{1}^{2})+4\mathrm{E}(I_{2}^{2})+4\mathrm{E}(I_{3}^{2})
\displaystyle\leq 8minw𝒲maxq𝒬L(w,q)2+12E(I112)+4E(I22)+4E(I32)\displaystyle 8\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L(w,q)^{2}+12\mathrm{E}(I_{11}^{2})+4\mathrm{E}(I_{2}^{2})+4\mathrm{E}(I_{3}^{2})
\displaystyle\leq 8minw𝒲maxq𝒬L(w,q)2+12C3(β2+(1lnβ+(lnβ)2)lnmm+βm)\displaystyle 8\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L(w,q)^{2}+12C_{3}\Big{(}\beta^{2}+(1-\ln\beta+(\ln\beta)^{2})\frac{\ln m}{m}+\frac{\beta}{m}\Big{)}
+4Rmax2Λ12M0mλ0+8Λ12Rmax2M0λ02(1m+M0e2λ0H),\displaystyle+\frac{4R_{\max}^{2}\Lambda_{1}^{2}M_{0}}{m\lambda_{0}}+\frac{8\Lambda_{1}^{2}R_{\max}^{2}M_{0}}{\lambda_{0}^{2}}(\frac{1}{m}+M_{0}e^{-2\lambda_{0}H}),

where the first inequality follows from (22) and the simple inequality (a+b)22a2+2b2(a+b)^{2}\leq 2a^{2}+2b^{2}, the second inequality follows from (23) and the last inequality is due to Lemma B.5-Lemma B.7. Letting

C4=max{24C3+4Rmax2Λ12M0λ0+8Λ12Rmax2M0λ02,8Λ12Rmax2M02λ02},C_{4}=\max\{24C_{3}+\frac{4R_{\max}^{2}\Lambda_{1}^{2}M_{0}}{\lambda_{0}}+\frac{8\Lambda_{1}^{2}R_{\max}^{2}M_{0}}{\lambda_{0}^{2}},\frac{8\Lambda_{1}^{2}R_{\max}^{2}M_{0}^{2}}{\lambda_{0}^{2}}\},

we have that, for any truncation level HH and βM0eλ0H\beta\geq M_{0}e^{-\lambda_{0}H},

E(|R^mRπe|2)8minw𝒲maxq𝒬L(w,q)2+C4(G(β,m)+e2λ0H),\displaystyle\mathrm{E}(|\hat{R}_{m}-R_{\pi_{e}}|^{2})\leq 8\min\limits_{w\in\mathcal{W}}\max\limits_{q\in\mathcal{Q}}L(w,q)^{2}+C_{4}\big{(}G(\beta,m)+e^{-2\lambda_{0}H}\big{)}, (31)

where

G(x,m):=x2+(1lnx+ln2x)lnmm,G(x,m):=x^{2}+(1-\ln x+\ln^{2}x)\frac{\ln m}{m},

for x(0,+),m1x\in(0,+\infty),m\geq 1. Note that

Gx(x,m)=2x+lnm(2lnx1)mx,Gxx′′(x,m)=2+lnm(32lnx)mx2>0,G^{\prime}_{x}(x,m)=2x+\frac{\ln m(2\ln x-1)}{mx},\quad G^{\prime\prime}_{xx}(x,m)=2+\frac{\ln m(3-2\ln x)}{mx^{2}}>0,

which combining the fact Gx(1,m)>1G^{\prime}_{x}(1,m)>1 and limx0+Gx(x,m)=\lim\limits_{x\to 0+}G^{\prime}_{x}(x,m)=-\infty implies that there exists a unique Hm(0,1)H_{m}\in(0,1) such that

2mHm2+2lnmlnHmlnm=0,2mH_{m}^{2}+2\ln m\ln H_{m}-\ln m=0,

which implies Gx(Hm,m)=0G^{\prime}_{x}(H_{m},m)=0 and hence for all x(0,+)x\in(0,+\infty),

G(Hm,m)G(x,m).G(H_{m},m)\leq G(x,m).

Moreover, G(x,m)G(x,m) is decreasing for x(0,Hm)x\in(0,H_{m}) while increasing for x(Hm,+)x\in(H_{m},+\infty).

Lemma B.8.

For any mem\geq e, lnm/(2m)Hmeln2m/m\sqrt{\ln m/(2m)}\leq H_{m}\leq\sqrt{e\ln^{2}m/m} and there exists constants 0<k1<k20<k_{1}<k_{2} such that

k1ln3mmG(Hm,m)k2ln3mm.k_{1}\frac{\ln^{3}m}{m}\leq G(H_{m},m)\leq k_{2}\frac{\ln^{3}m}{m}.

Proof. From Gμ(Hm,m)=0G^{\prime}_{\mu}(H_{m},m)=0, we know that HmH_{m} is a solution of

Hm(x):=2mx2+2lnmlnxlnm=0.H_{m}(x):=2mx^{2}+2\ln m\ln x-\ln m=0.

Note that Hm(x)H_{m}(x) is increasing on (0,1](0,1] and

Hm(lnm2m)\displaystyle H_{m}\left(\sqrt{\frac{\ln m}{2m}}\right) =\displaystyle= 2lnmln(lnm2m)<0,\displaystyle 2\ln m\ln\left(\sqrt{\frac{\ln m}{2m}}\right)<0,
Hm(eln2mm)\displaystyle H_{m}\left(\sqrt{\frac{e\ln^{2}m}{m}}\right) =\displaystyle= (2e1)ln2m+2lnmlnlnm>0.\displaystyle(2e-1)\ln^{2}m+2\ln m\ln\ln m>0.

We have that lnm/(2m)Hmeln2m/m\sqrt{\ln m/(2m)}\leq H_{m}\leq\sqrt{e\ln^{2}m/m}.

To prove the second assertion, we note that

G(Hm,m)G(ln3mm,m)ln3mm+(1+12lnm+14ln2m)lnmm3ln3mm.\displaystyle G(H_{m},m)\leq G\left(\sqrt{\frac{\ln^{3}m}{m}},m\right)\leq\frac{\ln^{3}m}{m}+(1+\frac{1}{2}\ln m+\frac{1}{4}\ln^{2}m)\frac{\ln m}{m}\leq 3\frac{\ln^{3}m}{m}.

On the other hand, when x<ln3m/mx<\sqrt{\ln^{3}m/m},

G(x,m)\displaystyle G(x,m) \displaystyle\geq (1lnx+ln2x)lnmm14(lnln3mm)2lnmm\displaystyle(1-\ln x+\ln^{2}x)\frac{\ln m}{m}\geq\frac{1}{4}\left(\ln\frac{\ln^{3}m}{m}\right)^{2}\frac{\ln m}{m}
\displaystyle\geq ln3m4m[13lnlnmlnm]2(e39)24e6ln3mm,\displaystyle\frac{\ln^{3}m}{4m}\Big{[}1-3\frac{\ln\ln m}{\ln m}\Big{]}^{2}\geq\frac{(e^{3}-9)^{2}}{4e^{6}}\frac{\ln^{3}m}{m},

for mee3m\geq e^{e^{3}}. Noting that ln3m/m>0\ln^{3}m/m>0 for all m>em>e, we can find a constant kk^{\prime} such that

G(x,m)kln3mm,G(x,m)\geq k^{\prime}\frac{\ln^{3}m}{m},

for all x(0,ln3m/m)x\in(0,\sqrt{\ln^{3}m/m}) and m>em>e. Moreover, when xln3m/mx\geq\sqrt{\ln^{3}m/m},

G(x,m)>x2ln3mm.\displaystyle G(x,m)>x^{2}\geq\frac{\ln^{3}m}{m}.

Consequently, there exists a constant k1k_{1} such that G(Hm,m)k1ln3m/m.G(H_{m},m)\geq k_{1}\ln^{3}m/m. \square

Now we are at the position to finish the proof of Theorem 4.1.

Proof. From (31) and Lemma B.8, it follows that if HmM0eλ0HH_{m}\geq M_{0}e^{-\lambda_{0}H}, there is a constant CC independent of m,Hm,H such that

E(|R^mRπe|2)\displaystyle\mathrm{E}(|\hat{R}_{m}-R_{\pi_{e}}|^{2}) \displaystyle\leq 8minw𝒲,q𝒬L2(w,q)+C4(minβM0eλ0HG(β,m)+e2λ0H)\displaystyle 8\min_{w\in\mathcal{W},q\in\mathcal{Q}}L^{2}(w,q)+C_{4}\big{(}\min_{{\beta\geq M_{0}e^{-\lambda_{0}H}}}G(\beta,m)+e^{-2\lambda_{0}H}\big{)}
=\displaystyle= 8minw𝒲,q𝒬L2(w,q)+C4(G(Hm,m)+e2λ0H)\displaystyle 8\min_{w\in\mathcal{W},q\in\mathcal{Q}}L^{2}(w,q)+C_{4}(G(H_{m},m)+e^{-2\lambda_{0}H})
\displaystyle\leq 8minw𝒲,q𝒬L2(w,q)+Cln3mm,\displaystyle 8\min_{w\in\mathcal{W},q\in\mathcal{Q}}L^{2}(w,q)+C\frac{\ln^{3}m}{m},

since HmM0eλ0HH_{m}\geq M_{0}e^{-\lambda_{0}H} implies that

e2λ0HHm2M0eM0ln2mm.\displaystyle e^{-2\lambda_{0}H}\leq\frac{H_{m}^{2}}{M_{0}}\leq\frac{e}{M_{0}}\frac{\ln^{2}m}{m}.

When Hm<M0eλ0HH_{m}<M_{0}e^{-\lambda_{0}H},

E(|R^mRπe|2)8minw𝒲,q𝒬L2(w,q)+C4(G(M0eλ0H,m)+e2λ0H)\displaystyle\mathrm{E}(|\hat{R}_{m}-R_{\pi_{e}}|^{2})\leq 8\min_{w\in\mathcal{W},q\in\mathcal{Q}}L^{2}(w,q)+C_{4}(G(M_{0}e^{-\lambda_{0}H},m)+e^{-2\lambda_{0}H})
=8minw𝒲,q𝒬L2(w,q)+C4((1+M02)e2λ0H+(1lnM0+λ0H+(lnM0λ0H)2)lnmm)\displaystyle\quad=8\min_{w\in\mathcal{W},q\in\mathcal{Q}}L^{2}(w,q)+C_{4}\Big{(}(1+M_{0}^{2})e^{-2\lambda_{0}H}+(1-\ln M_{0}+\lambda_{0}H+(\ln M_{0}-\lambda_{0}H)^{2})\frac{\ln m}{m}\Big{)}
8minw𝒲,q𝒬L2(w,q)+C(e2λ0H+H2lnmm),\displaystyle\quad\leq 8\min_{w\in\mathcal{W},q\in\mathcal{Q}}L^{2}(w,q)+C\Big{(}e^{-2\lambda_{0}H}+H^{2}\frac{\ln m}{m}\Big{)},

for some constant CC independent of m,Hm,H. \square

From Theorem 4.1, it is easy to get Therorem 4.2. We briefly state the proof as follows.

Proof of Theorem 4.2. When Qπe𝒬Q_{\pi_{e}}\not\in\mathcal{Q}, (23) does not hold but can be adjusted as follows. Since

I12\displaystyle I_{1}^{2} =\displaystyle= L2(w^m,Qπe)=(L(w^m,Qπeq)+L(w^m,q))22(L2(w^m,q)+L2(w^m,Qπeq)),\displaystyle L^{2}(\hat{w}_{m},Q_{\pi_{e}})=(L(\hat{w}_{m},Q_{\pi_{e}}-q)+L(\hat{w}_{m},q))^{2}\leq 2(L^{2}(\hat{w}_{m},q)+L^{2}(\hat{w}_{m},Q_{\pi_{e}}-q)),

for any q𝒬q\in\mathcal{Q}, we have that,

I12\displaystyle I_{1}^{2} \displaystyle\leq 2maxq𝒬L2(w^m,q)+2minq𝒬L2(w^m,Qπeq).\displaystyle 2\max_{q\in\mathcal{Q}}L^{2}(\hat{w}_{m},q)+2\min_{q\in\mathcal{Q}}L^{2}(\hat{w}_{m},Q_{\pi_{e}}-q).
\displaystyle\leq 2maxq𝒬(L(w^m,q)L^m(w^m,q)+L^m(w^m,q))2+2maxw𝒲minq𝒬L2(w,Qπeq)\displaystyle 2\max_{q\in\mathcal{Q}}(L(\hat{w}_{m},q)-\hat{L}_{m}(\hat{w}_{m},q)+\hat{L}_{m}(\hat{w}_{m},q))^{2}+2\max_{w\in\mathcal{W}}\min_{q\in\mathcal{Q}}L^{2}(w,Q_{\pi_{e}}-q)
\displaystyle\leq 4maxq𝒬(L(w^m,q)L^m(w^m,q))2+4maxq𝒬L^m2(w,q)+2maxw𝒲minq𝒬L2(w,Qπeq),\displaystyle 4\max_{q\in\mathcal{Q}}(L(\hat{w}_{m},q)-\hat{L}_{m}(\hat{w}_{m},q))^{2}+4\max_{q\in\mathcal{Q}}\hat{L}_{m}^{2}(w,q)+2\max_{w\in\mathcal{W}}\min_{q\in\mathcal{Q}}L^{2}(w,Q_{\pi_{e}}-q),

for any w𝒲w\in\mathcal{W}. Consequently,

I12\displaystyle I_{1}^{2} \displaystyle\leq 4maxw𝒲,q𝒬(L(w^m,q)L^m(w^m,q))2+8maxq𝒬(L^m(w,q)L(w,q))2\displaystyle 4\max_{w\in\mathcal{W},q\in\mathcal{Q}}(L(\hat{w}_{m},q)-\hat{L}_{m}(\hat{w}_{m},q))^{2}+8\max_{q\in\mathcal{Q}}(\hat{L}_{m}(w,q)-L(w,q))^{2}
+8maxq𝒬L2(w,q)+2maxw𝒲minq𝒬L2(w,Qπeq)\displaystyle+8\max_{q\in\mathcal{Q}}L^{2}(w,q)+2\max_{w\in\mathcal{W}}\min_{q\in\mathcal{Q}}L^{2}(w,Q_{\pi_{e}}-q)
\displaystyle\leq 4maxw𝒲,q𝒬(L(w^m,q)L^m(w^m,q))2+8maxw𝒲,q𝒬(L^m(w,q)L(w,q))2\displaystyle 4\max_{w\in\mathcal{W},q\in\mathcal{Q}}(L(\hat{w}_{m},q)-\hat{L}_{m}(\hat{w}_{m},q))^{2}+8\max_{w\in\mathcal{W},q\in\mathcal{Q}}(\hat{L}_{m}(w,q)-L(w,q))^{2}
+8minw𝒲maxq𝒬L2(w,q)+2maxw𝒲minq𝒬L2(w,Qπeq)\displaystyle+8\min_{w\in\mathcal{W}}\max_{q\in\mathcal{Q}}L^{2}(w,q)+2\max_{w\in\mathcal{W}}\min_{q\in\mathcal{Q}}L^{2}(w,Q_{\pi_{e}}-q)
=\displaystyle= 12maxw𝒲,q𝒬(L(w^m,q)L^m(w^m,q))2+8minw𝒲maxq𝒬L2(w,q)+2maxw𝒲minq𝒬L2(w,Qπeq),\displaystyle 12\max_{w\in\mathcal{W},q\in\mathcal{Q}}(L(\hat{w}_{m},q)-\hat{L}_{m}(\hat{w}_{m},q))^{2}+8\min_{w\in\mathcal{W}}\max_{q\in\mathcal{Q}}L^{2}(w,q)+2\max_{w\in\mathcal{W}}\min_{q\in\mathcal{Q}}L^{2}(w,Q_{\pi_{e}}-q),

and therefore

E(I12)\displaystyle\mathrm{E}(I_{1}^{2}) \displaystyle\leq 12E(maxw𝒲,q𝒬(L(w^m,q)L^m(w^m,q))2)\displaystyle 12\mathrm{E}(\max_{w\in\mathcal{W},q\in\mathcal{Q}}(L(\hat{w}_{m},q)-\hat{L}_{m}(\hat{w}_{m},q))^{2})
+8minw𝒲maxq𝒬L2(w,q)+2maxw𝒲minq𝒬L2(w,Qπeq).\displaystyle+8\min_{w\in\mathcal{W}}\max_{q\in\mathcal{Q}}L^{2}(w,q)+2\max_{w\in\mathcal{W}}\min_{q\in\mathcal{Q}}L^{2}(w,Q_{\pi_{e}}-q).

Using this inequality to replace (23) and then repeating the discussion in Theorem 4.1, we can readily get the desired result. \square

Appendix C Algorithm Supplement

Algorithm 1 summarizes the pseudo-codes of our MWLA algorithm applied to the taxi environment in Section 6.

The algorithm needs the following notation:

G=[0000]nh×nh,G=\begin{bmatrix}0&\dots&0\\ &\ddots&\vdots\\ 0&&0\end{bmatrix}_{nh\times nh},

Frequency=[0,,0]nh,Frequency=[0,\dots,0]_{nh}^{\top},auxi=[0,,0]nh,μ^=[0,,0]nh,auxi=[0,\dots,0]_{nh}^{\top},\quad\hat{\mu}=[0,\dots,0]_{nh}^{\top}, XX represents absorbing state set, Y={h×i+j,iX,j𝒜}.Y=\{h\times i+j,i\in X,j\in\mathcal{A}\}.

Algorithm 1 Tabular case
1:Off-policy data D={s0i,a0i,r0i,s1i,,sTiH1i,aTiH1i,rTiH1i,sTiHi}i=1mD={\{s_{0}^{i},a_{0}^{i},r_{0}^{i},s_{1}^{i},\cdots,s_{T_{i}\wedge H-1}^{i},a_{T_{i}\wedge H-1}^{i},r_{T_{i}\wedge H-1}^{i},s_{T_{i}\wedge H}^{i}\}}_{i=1}^{m} from the behavior policy πb\pi_{b}; a target policy π\pi for which we want to estimate the expected return.
2: Estimate the initial state distribution μ^(s)=1mi=1m1{s0i=s}\hat{\mu}(s)=\frac{1}{m}\sum\limits_{i=1}^{m}1_{\{s_{0}^{i}=s\}}, where 1{}1_{\{\cdot\}} is an indicative function.
3:for episode in D do
4:     for s,a,ss^{{}^{\prime}},r in episode do
5:         cur=h×s+acur=h\times s+a,
6:         G[cur,h×s:h×(s+1)]+=π[s,:],G[cur,h\times s^{{}^{\prime}}:h\times(s^{{}^{\prime}}+1)]+=\pi[s^{{}^{\prime}},:],
7:         G[cur,cur]=1.0,G[cur,cur]-=1.0,
8:         Frequency[cur]+=1.Frequency[cur]+=1.      end for
9:auxi=s,aμ^(s)π(a|s)𝟏(𝐬,𝐚).auxi=\sum_{s,a}\hat{\mu}(s)\pi(a|s){\bf 1_{(}s,a)}.
10:tvalid=where(Frequency>0)tvalid=where(Frequency>0) indicates the index of an element whose value is greater than 0.
11:d^πb=delete(Frequency,Y,0),\hat{d}_{\pi_{b}}=delete(Frequency,Y,0), delete the row corresponding to the absorbing state from Frequency.
12:tvalid1=where(d^πb>0).tvalid1=where(\hat{d}_{\pi_{b}}>0).
13:d^πb=d^πb/m.\hat{d}_{\pi_{b}}=\hat{d}_{\pi_{b}}/m.
14:𝑮^=delete(G,Y,0){\hat{\bm{G}}}=delete(G,Y,0), delete the row corresponding to the absorbing state from G.
15:𝑮^=delete(𝑮^,Y,1){\hat{\bm{G}}}=delete(\hat{\bm{G}},Y,1), delete the column corresponding to the absorbing state from 𝑮^\hat{\bm{G}}.
16:𝑮^[:,tvalid1]=𝑮^[:,tvalid1]/(m×d^πb[tvalid1]).{\hat{\bm{G}}}[:,tvalid1]={\hat{\bm{G}}}[:,tvalid1]/(m\times\hat{d}_{\pi_{b}}[tvalid1]).
17:b^=delete(auxi,Y,0).\hat{b}=delete(auxi,Y,0).
18:Compute 𝒖^=argmin𝒖0(𝑮^+λ𝐈)𝒖+𝒃^2\hat{{\bm{u}}}=\arg\min\limits_{{{\bm{u}}\geq 0}}\parallel({\hat{\bm{G}}+\lambda\bf I})^{\top}{\bm{u}}+\hat{\bm{b}}\parallel^{2}, where
𝑮^=1mi=1mt=0TiH1𝟏(sti,ati)[a𝒜π(a|st+1i)𝟏(st+1i,a)𝟏(sti,ati)]d^πb(sti,ati).{\hat{\bm{G}}}=\frac{1}{m}\sum_{i=1}^{m}\sum_{t=0}^{T_{i}\wedge H-1}\frac{{\bf 1}_{(s^{i}_{t},a^{i}_{t})}\Big{[}\sum_{a\in\mathcal{A}}\pi(a|s^{i}_{t+1}){\bf 1}^{\top}_{(s_{t+1}^{i},a)}-{\bf 1}^{\top}_{(s_{t}^{i},a^{i}_{t})}\Big{]}}{\hat{d}_{\pi_{b}}(s_{t}^{i},a_{t}^{i})}.
𝒃^=(s,a)𝒮0×𝒜μ^(s)π(a|s)𝟏(s,a),\hat{\bm{b}}=\sum_{(s,a)\in\mathcal{S}_{0}\times\mathcal{A}}\hat{\mu}(s)\pi(a|s){\bf 1}_{(s,a)},
λ\lambda is a regularization factor, 𝐈\bf I denotes an identity matrix.
19:Parameterize w(tvalid)=𝒖^(tvalid1)d^πb(tvalid1).w(tvalid)=\frac{{\hat{\bm{u}}(tvalid1)}}{\hat{d}_{\pi_{b}}(tvalid1)}.
20:for episode in D do
21:     for s, a, ss^{{}^{\prime}}, r in episode do
22:         cur=h×s+acur=h\times s+a,
23:         R^πe,m=1mi=1mt=0TiH1w(cur)×r.\hat{R}_{\pi_{e},m}=\frac{1}{m}\sum\limits_{i=1}^{m}\sum\limits_{t=0}^{T_{i}\wedge H-1}w(cur)\times r.       end for
24:R^πe,m.\hat{R}_{\pi_{e},m}.

The idea of the algorithm is explained in Example 3.1. Here, for the convenience of computations, we set

w(s,a)=𝟏{s,a}𝒖d^πb(s,a)andq(s,a)=𝟏{s,a}𝒗,s𝒮0,a𝒜.w(s,a)=\frac{\mathbf{1}_{\{s,a\}}^{\top}{\bm{u}}}{\hat{d}_{\pi_{b}}(s,a)}\;\text{and}\;\;q(s,a)=\mathbf{1}_{\{s,a\}}^{\top}{\bm{v}},\qquad\forall s\in\mathcal{S}_{0},a\in\mathcal{A}.

We also introduce a regularization factor λ>0\lambda>0 which helps us find the unique solution of the constrained quadratic programming problem argmin𝒖0(𝑮^+λ𝐈)𝒖+𝒃^2\arg\min\limits_{{{\bm{u}}\geq 0}}\parallel({\hat{\bm{G}}+\lambda\bf I})^{\top}{\bm{u}}+\hat{\bm{b}}\parallel^{2}. When λ\lambda is sufficient small, the solution is an approximation of (𝑮^+)𝒃-({\hat{\bm{G}}}^{+})^{\top}{\bm{b}} where 𝑮^+{\hat{\bm{G}}}^{+} is the Moore-Penrose pseudo-inverse of 𝑮^{\hat{\bm{G}}}. In our experiments, λ\lambda is set to be 0.0010.001.