This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust Offline Reinforcement Learning for Non-Markovian Decision Processes

Ruiquan Huang
Department of Electrical Engineering
The Pennsylvania State University
University Park, PA 16801
[email protected]
   Yingbin Liang
Department of Electrical and Computer Engineering
The Ohio State University
Columbus, OH 43210
[email protected]
   Jing Yang
Department of Electrical Engineering
The Pennsylvania State University
University Park, PA 16801
[email protected]
Abstract

Distributionally robust offline reinforcement learning (RL) aims to find a policy that performs the best under the worst environment within an uncertainty set using an offline dataset collected from a nominal model. While recent advances in robust RL focus on Markov decision processes (MDPs), robust non-Markovian RL is limited to planning problem where the transitions in the uncertainty set are known. In this paper, we study the learning problem of robust offline non-Markovian RL. Specifically, when the nominal model admits a low-rank structure, we propose a new algorithm, featuring a novel dataset distillation and a lower confidence bound (LCB) design for robust values under different types of the uncertainty set. We also derive new dual forms for these robust values in non-Markovian RL, making our algorithm more amenable to practical implementation. By further introducing a novel type-I concentrability coefficient tailored for offline low-rank non-Markovian decision processes, we prove that our algorithm can find an ϵ\epsilon-optimal robust policy using O(1/ϵ2)O(1/\epsilon^{2}) offline samples. Moreover, we extend our algorithm to the case when the nominal model does not have specific structure. With a new type-II concentrability coefficient, the extended algorithm also enjoys polynomial sample efficiency under all different types of the uncertainty set.

Keywords— Robust RL, Offline RL, POMDP, Non-markovian Decision Processes

1 Introduction

In reinforcement learning (RL) (Sutton and Barto, , 2018), an agent learns to make decisions by interacting with an unknown environment and maximizes the total received reward. This learning framework has found wide applications such as video games (Ye et al., , 2020), robotics (Ibarz et al., , 2021), and language model fine-tuning (Ouyang et al., , 2022). However, in many real-world scenarios such as autonomous driving and industrial control, direct training in the actual environment and generating data samples in an online manner is often unrealistic or even harmful. Offline RL (Lange et al., , 2012; Levine et al., , 2020), which leverages dataset collected in the past for learning, thus becomes important. Yet, pre-collected data suffers from either noisy simulation or out-of-date measurements, and causes the so-called Sim-to-real problem (Peng et al., , 2018; Zhao et al., , 2020).

Distributionally robust RL (Iyengar, , 2005) was proposed to address the aforementioned issue by learning an optimal policy under the worst environment in an uncertainty set. Robust RL has been studied under Markov decision processes (MDPs) with known transition probabilities in Xu and Mannor, (2010); Wiesemann et al., (2013); Yu and Xu, (2015). It has also been studied for MDPs with unknown transitions (Panaganti and Kalathil, , 2022; Wang and Zou, , 2021; Dong et al., , 2022; Panaganti et al., , 2022; Blanchet et al., , 2023). Regarding robust RL under non-Markovian decision processes, only robust POMDP with known transitions in the uncertainty set has been studied in  Osogami, (2015); Rasouli and Saghafian, (2018); Nakao et al., (2021). To the best of our knowledge, there has not been any study on robust non-Markovian RL with unknown transitions.

Several challenges need to be handled for exploring the robust offline non-Markovian RL with unknown transitions. (i) Considering non-Markovian processes in the entire uncertainty set can cause dramatic increase of sample complexity. The challenge then lies in how to exploit the structure conditions of these non-Markovian processes for sample efficient design. (ii) Existing coverage condition for offline non-Markovian RL (Huang et al., , 2023) is strong and can incur more stringent coverage conditions for robust offline RL. It is thus critical to relax such a condition to design desirable offline algorithms. These challenges pose the following fundamental open question for us to address:

Q: Can we design a provably efficient algorithm for learning robust non-Markovian decision processes using offline data, which (i) effectively exploit the structure of non-Markovian decision processes, and (ii) relaxes the coverage condition for an offline dataset?

Our contributions. In this paper, we provide the first study of robust non-Markovian RL with unknown transitions, i.e., unknown uncertainty set. We focus on the setting where an offline dataset is provided and is collected from one (unkown) nominal environment in an uncertainty set, and the environments are non-Markovian. We summarize our contributions as follows.

  • New algorithm design. We propose a novel algorithm for robust offline RL when the nominal model admits a low-rank structure of predictive state representations (PSRs). In particular, we consider two types of uncertainty sets, namely 𝒯\mathcal{T}-type and 𝒫\mathcal{P}-type, for non-Markovian decision processes. Our algorithm features two new elements: (i) a novel dataset distillation, where the distilled dataset ensures that the estimated model from vanilla maximum likelihood estimation performs well on the distilled data; and (ii) lower confidence bound design for the robust value, which is derived by leveraging the low-rank structure and the newly developed dual forms for robust values under both 𝒯\mathcal{T}-type and 𝒫\mathcal{P}-type uncertainty sets.

  • Novel coverage condition and theory for low-rank non-Markovian RL. We characterize the near-optimality guarantee of our algorithm based on novel type-I concentrability coefficient for offline learning in low-rank non-Markovian decision processes. This coefficient is much more relaxed than that proposed for offline low-rank non-Markovian RL in Huang et al., (2023). This relaxation is due to refined analysis on the distributional shift. We also identify a wellness condition number of the nominal model. Our results suggests that as long as the concentrability coefficient and the condition number are finite, our algorithm can find a robust policy with suboptimal gap vanishing at a rate of O(1/N)O(1/\sqrt{N}), where NN is the number of offline samples.

  • Algorithm and theory for general non-Markovian RL. We develop a robust algorithm for general offline non-Markovian RL where the transition probabilities may not have any structure. We show that our algorithm can learn a near-optimal robust policy efficiently provided that the offline dataset has finite Type-II concentrability coefficient, which is also a much weaker condition than that proposed in Huang et al., (2023).

2 Related Work

Offline RL. Offline RL problem has been investigated extensively in the literature (Antos et al., , 2008; Bertsekas, , 2011; Lange et al., , 2012; Chen and Jiang, , 2019; Xie and Jiang, , 2020; Levine et al., , 2020; Xie et al., , 2021). Many methods such as Q-learning (Jin et al., , 2021), its variant Fitted Q-Iteration (FQI) (Gordon, , 1995; Ernst et al., , 2005; Munos and Szepesvári, , 2008; Farahmand et al., , 2010; Lazaric et al., , 2012; Liu et al., , 2020; Xie et al., , 2021), value iteration (Nguyen-Tang et al., , 2023; Xiong et al., , 2022), policy gradient (Zhou et al., , 2017), and model-based approaches (Uehara and Sun, , 2021; Uehara et al., , 2021) have been shown to be effective in offline RL either theoretically or empirically. Among them, the pessimism principle is a popular approach to handle the distribution shift caused by an offline policy. Empirically, pessimistic approaches perform well on simulation control tasks (Kidambi et al., , 2020; Yu et al., , 2020; Kumar et al., , 2020; Liu et al., , 2020; Chang et al., , 2021). On the theoretical side, pessimism allows us to obtain the Probably Approximately Correct (PAC) guarantee on various models under a coverage condition (Jin et al., , 2021; Yin et al., , 2021; Rashidinejad et al., , 2021; Zanette et al., , 2021; Uehara and Sun, , 2021; Uehara et al., , 2021). In this work, we also adopt the pessimism principle and is the first to establish its efficiency for robust offline non-Markovian RL. We develop new coverage conditions in order to guarantee the provable efficiency.

Robust RL. Robust MDP (RMDP) was first introduced by Iyengar, (2005); Nilim and El Ghaoui, (2005). Most of RMDP research in the literature focuses on the planning problem (Xu and Mannor, , 2010; Wiesemann et al., , 2013; Tamar et al., , 2014; Yu and Xu, , 2015; Mannor et al., , 2016; Petrik and Russel, , 2019; Wang et al., 2023a, ; Wang et al., 2023b, ), providing computationally efficient algorithms. Based on the data generation process, studies on the learning problem in RMDP can be primarily divided into three categories. RMDP with generative model has been studied in Zhou et al., (2021); Yang et al., (2022); Panaganti and Kalathil, (2022), where a generative model can uniformly collect the offline data corresponding to each and every state-action pair. Online RMDP has been studied in Roy et al., (2017); Lim et al., (2013); Wang and Zou, (2021); Dong et al., (2022), which allows allows the agent adaptively sample states and actions in the nominal model. Offline RMDP has been studied in Panaganti et al., (2022); Blanchet et al., (2023), where only a pre-collected dataset is provided, which makes the problem more difficult due to distributional shift (Levine et al., , 2020).

In addition to RMDP, several works have studied robust POMDPs with assumptions such as known transitions in the uncertainty set (Osogami, , 2015; Rasouli and Saghafian, , 2018; Nakao et al., , 2021). In particular, they focus on the planning problem in robust POMDPs, i.e., finding an optimal policy with full information of the uncertainty set, while our work focus on the learning problem that characterizes the sample complexity with unknown uncertainty sets.

Non-Markovian RL. Learning in non-Markovian decision processes is both computationally and statistically intractable in general (Mossel and Roch, , 2005; Mundhenk et al., , 2000; Papadimitriou and Tsitsiklis, , 1987; Vlassis et al., , 2012; Krishnamurthy et al., , 2016). Recently, a line of research (Boots et al., , 2011; Jiang et al., , 2018; Zhang et al., , 2022; Zhan et al., , 2022; Uehara et al., , 2022; Liu et al., , 2022; Chen et al., , 2022; Zhong et al., , 2022; Huang et al., , 2023) has obtained polynomial sample complexity with various structural conditions. Among them, low-rank structure (Uehara et al., , 2022; Liu et al., , 2022) has been proved to admit predictive state representations (PSR) (Littman and Sutton, , 2001; Singh et al., , 2012), and captures and generalizes a rich subclass of non-Markov decision-making problems such as MDPs and POMDPs. However, none of these works studies robust offline non-Markovian RL.

3 Preliminaries

Notations. For a vector xx, xA\|x\|_{A} stands for xAx\sqrt{x^{\top}Ax}, and its ii-th coordinate is represented as [x]i[x]_{i}. For functions \mathbb{P} and \mathbb{Q} (not necessarily probability measures) over a set 𝒳\mathcal{X}, the 1\ell_{1} distance between them is 1=x|(x)(x)|\left\|\mathbb{P}-\mathbb{Q}\right\|_{1}=\sum_{x}|\mathbb{P}(x)-\mathbb{Q}(x)|, while the hellinger-squared distance is defined as 𝙳𝙷2((x),(x))=12x((x)(x))2\mathtt{D}_{\mathtt{H}}^{2}(\mathbb{P}(x),\mathbb{Q}(x))=\frac{1}{2}\sum_{x}(\sqrt{\mathbb{P}(x)}-\sqrt{\mathbb{Q}(x)})^{2}. For a sequence {xn,xn+1,,xm}\{x_{n},x_{n+1},\ldots,x_{m}\} with n<mn<m, we denote it shortly as xn:mx_{n\mathrel{\mathop{\mathchar 58\relax}}m}. For a sequence of sets 𝒮1,𝒮2,,𝒮n\mathcal{S}_{1},\mathcal{S}_{2},\ldots,\mathcal{S}_{n}, the Cartesian product is represented by 𝒮1×𝒮2××𝒮n\mathcal{S}_{1}\times\mathcal{S}_{2}\times\cdots\times\mathcal{S}_{n}. When these sets are identical, this product simplifies to 𝒮1n\mathcal{S}_{1}^{n}. Δ(𝒮)\Delta(\mathcal{S}) consists of all probability distributions over the set 𝒮\mathcal{S}.

3.1 Non-Markovian Decision Processes

In this work, we consider finite-horizon episodic (non-Markovian) decision processes with an observation space 𝒪\mathcal{O}, an action space 𝒜\mathcal{A} and a horizon HH. The dynamics of the process is determined by a set of distributions {𝒯h(|o1:h,a1:h)Δ(𝒪)}h=1H1\{\mathcal{T}_{h}(\cdot|o_{1\mathrel{\mathop{\mathchar 58\relax}}h},a_{1\mathrel{\mathop{\mathchar 58\relax}}h})\in\Delta(\mathcal{O})\}_{h=1}^{H-1}, where o1:h𝒪ho_{1\mathrel{\mathop{\mathchar 58\relax}}h}\in\mathcal{O}^{h} denotes a sequence of observations up to time hh, a1:h𝒜ha_{1\mathrel{\mathop{\mathchar 58\relax}}h}\in\mathcal{A}^{h} denotes a sequence of actions up to time hh, and 𝒯h(oh+1|o1:h,a1:h)\mathcal{T}_{h}(o_{h+1}|o_{1\mathrel{\mathop{\mathchar 58\relax}}h},a_{1\mathrel{\mathop{\mathchar 58\relax}}h}) denotes the probability of observing oh+1o_{h+1} given history o1:h,a1:ho_{1\mathrel{\mathop{\mathchar 58\relax}}h},a_{1\mathrel{\mathop{\mathchar 58\relax}}h}. A policy π\pi is defined by a collection of distributions π={πh:𝒪h×𝒜h1Δ(𝒜)}h=1H\pi=\{\pi_{h}\mathrel{\mathop{\mathchar 58\relax}}\mathcal{O}^{h}\times\mathcal{A}^{h-1}\rightarrow\Delta(\mathcal{A})\}_{h=1}^{H} chosen by an agent, with each πh(|τh1,oh)\pi_{h}(\cdot|\tau_{h-1},o_{h}) being a distribution over 𝒜\mathcal{A}. Specifically, at the beginning of each episode, the process starts at a fixed observation o1𝒪o_{1}\in\mathcal{O} at time step 1. After observing o1o_{1}, the agent samples (takes) an action a1𝒜a_{1}\in\mathcal{A} according to the distribution (policy) π1(|o1)\pi_{1}(\cdot|o_{1}), and the process transits to o2𝒪o_{2}\in\mathcal{O}, which is sampled according to the distribution 𝒯1(|o1,a1)\mathcal{T}_{1}(\cdot|o_{1},a_{1}). Then, at any time step h2h\geq 2, due to the non-Markovian nature, the agent samples an action aha_{h} based on all past information (o1:h,a1:h1)(o_{1\mathrel{\mathop{\mathchar 58\relax}}h},a_{1\mathrel{\mathop{\mathchar 58\relax}}h-1}), and the process transits to oh+1o_{h+1}, sampled from 𝒯h(oh+1|o1:h,a1:h)\mathcal{T}_{h}(o_{h+1}|o_{1\mathrel{\mathop{\mathchar 58\relax}}h},a_{1\mathrel{\mathop{\mathchar 58\relax}}h}). The interaction terminates after time step HH.

Notably, the general non-Markovian decision-making problem subsumes not only fully observable MDPs but also POMDPs as special cases. In MDPs, 𝒯h1(oh|τh1)=𝒯h1(oh|oh1,ah1)\mathcal{T}_{h-1}(o_{h}|\tau_{h-1})=\mathcal{T}_{h-1}(o_{h}|o_{h-1},a_{h-1}), and in POMDPs, 𝒯h1(oh|τh1)\mathcal{T}_{h-1}(o_{h}|\tau_{h-1}) can be factorized as 𝒯h1(oh|τh1)=s𝕆h(oh|sh)𝕋h1(sh|sh1,ah1)𝕋1(s2|s1,a1)\mathcal{T}_{h-1}(o_{h}|\tau_{h-1})=\sum_{s}\mathbb{O}_{h}(o_{h}|s_{h})\mathbb{T}_{h-1}(s_{h}|s_{h-1},a_{h-1})\cdots\mathbb{T}_{1}(s_{2}|s_{1},a_{1}), where {st}t=1H\{s_{t}\}_{t=1}^{H} represents unobserved states, {𝕆t}t=1H\{\mathbb{O}_{t}\}_{t=1}^{H} and {𝕋t}t=1H\{\mathbb{T}_{t}\}_{t=1}^{H} are called emission distributions and transition distributions, respectively.

We note that an alternative equivalent description of the process dynamics is the set of probabilities {𝒫(o1:H|a1:H1)=𝒯H1(oH|o1:H1,a1:H1)𝒯1(o2|o1,a1),o1:H,a1:H1}\{\mathcal{P}(o_{1\mathrel{\mathop{\mathchar 58\relax}}H}|a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1})=\mathcal{T}_{H-1}(o_{H}|o_{1\mathrel{\mathop{\mathchar 58\relax}}H-1},a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1})\cdots\mathcal{T}_{1}(o_{2}|o_{1},a_{1}),\forall o_{1\mathrel{\mathop{\mathchar 58\relax}}H},a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}\}, which is also frequently used in this paper. In fact, both 𝒫\mathcal{P} and 𝒯\mathcal{T} fully determine the dynamics with an additional constraint for 𝒫\mathcal{P}, that is, oh+1:H𝒫(o1:H|a1:H1)\sum_{o_{h+1\mathrel{\mathop{\mathchar 58\relax}}H}}\mathcal{P}(o_{1\mathrel{\mathop{\mathchar 58\relax}}H}|a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}) is constant with respect to ah:H1a_{h\mathrel{\mathop{\mathchar 58\relax}}H-1} for all action sequence ah:H1a_{h\mathrel{\mathop{\mathchar 58\relax}}H-1}. Hence, we use 𝔻={𝒯h,𝒫}\mathbb{D}=\{\mathcal{T}_{h},\mathcal{P}\} to represent the model dynamics. To make the presentation more explicit, we introduce a unified parameterization θΘD\theta\in\Theta\subset\mathbb{R}^{D} for the process dynamics, written as 𝔻θ={𝒯hθ,𝒫θ}\mathbb{D}_{\theta}=\{\mathcal{T}_{h}^{\theta},\mathcal{P}^{\theta}\}. Here, Θ\Theta is the complete set of model parameters and DD is usually a large integer. We also use θ\theta to refer to a model.

For a given reward function R:(𝒪×𝒜)H[0,1]R\mathrel{\mathop{\mathchar 58\relax}}(\mathcal{O}\times\mathcal{A})^{H}\rightarrow[0,1], and any policy π\pi, the value of the policy π\pi under the dynamics 𝔻θ\mathbb{D}_{\theta} is denoted by Vθ,Rπ=𝔼τH𝔻θπ[R(o1:H,a1:H)]V_{\theta,R}^{\pi}=\mathbb{E}_{\tau_{H}\sim\mathbb{D}_{\theta}^{\pi}}[R(o_{1\mathrel{\mathop{\mathchar 58\relax}}H},a_{1\mathrel{\mathop{\mathchar 58\relax}}H})]. We remark here that the goal of an agent in standard RL is to find an ϵ\epsilon-optimal policy π^\hat{\pi} such that maxπVθ,RπVθ,Rπ^ϵ\max_{\pi}V_{\theta,R}^{\pi}-V_{\theta,R}^{\hat{\pi}}\leq\epsilon for some small error ϵ>0\epsilon>0 under a single model θ\theta.

3.2 Robust RL

While standard RL assumes that the system dynamics remains the same, robust RL diverges from this setting by focusing on the ‘worst-case’ or robust value in situations where the parameter θ\theta can change within a defined uncertainty set BDB\subset\mathbb{R}^{D} (where BB will be defined below). Mathematically, given a policy π\pi, the robust value under the uncertainty set BB is defined by VB,Rπ=minθBVθ,RπV_{B,R}^{\pi}=\min_{\theta\in B}V_{\theta,R}^{\pi}.

In this paper, we consider two types of uncertainty sets defined respectively on 𝒯\mathcal{T} and 𝒫\mathcal{P} via ff-divergence 𝙳f(P,P)=f(dP/dP)𝑑P\mathtt{D}_{f}(P,P^{\prime})=\int f(dP/dP^{\prime})dP^{\prime}, where ff is a convex function for measuring the uncertainty. Given a nominal model θ\theta^{*}, a 𝒯\mathcal{T}-type uncertainty set under ff-divergence with center θ\theta^{*} and radius ξ\xi is defined as

B𝒯f(θ):={θ:𝙳f(𝒯hθ(|τh),𝒯hθ(|τh))ξ,h,τh}.\displaystyle B_{\mathcal{T}}^{f}(\theta^{*})\mathrel{\mathop{\mathchar 58\relax}}=\left\{\theta\mathrel{\mathop{\mathchar 58\relax}}\mathtt{D}_{f}\left(\mathcal{T}_{h}^{\theta}(\cdot|\tau_{h}),\mathcal{T}_{h}^{\theta^{*}}(\cdot|\tau_{h})\right)\leq\xi,\forall h,\tau_{h}\right\}. (1)

Similarly, a 𝒫\mathcal{P}-type uncertainty set under ff-divergence with center θ\theta^{*} and radius ξ\xi is defined as

B𝒫f(θ):={θ:𝙳f(𝒫θ(|a1:H),𝒫θ(|a1:H))ξ,a1:H}.\displaystyle B_{\mathcal{P}}^{f}(\theta^{*})\mathrel{\mathop{\mathchar 58\relax}}=\left\{\theta\mathrel{\mathop{\mathchar 58\relax}}\mathtt{D}_{f}\left(\mathcal{P}^{\theta}(\cdot|a_{1\mathrel{\mathop{\mathchar 58\relax}}H}),\mathcal{P}^{\theta^{*}}(\cdot|a_{1\mathrel{\mathop{\mathchar 58\relax}}H})\right)\leq\xi,\forall a_{1\mathrel{\mathop{\mathchar 58\relax}}H}\right\}. (2)

For simplicity, we set ξ\xi to be a constant throughout the paper. Our proposed algorithm and analysis can be easily extended to the case when ξ\xi is a function of τh\tau_{h} or a1:Ha_{1\mathrel{\mathop{\mathchar 58\relax}}H}. Notably, the 𝒯\mathcal{T}-type uncertainty set reduces to the 𝒮×𝒜\mathcal{S}\times\mathcal{A}-rectangular uncertainty set (Iyengar, , 2005) if 𝔻\mathbb{D} reduces to a Markov decision process, i.e., 𝒯h(|τh)=𝒯h(|oh,ah)\mathcal{T}_{h}(\cdot|\tau_{h})=\mathcal{T}_{h}(\cdot|o_{h},a_{h}). However, due to the non-Markovian nature of our problem, we also consider the 𝒫\mathcal{P}-type uncertainty set as 𝒫\mathcal{P} also fully determines the process.

We study two example types of ff-divergence, which corresponds to two classical divergences. (i) If f1(t)=|t1|/2f_{1}(t)=|t-1|/2, then 𝙳f1\mathtt{D}_{f_{1}} is called total variation distance; (ii) If f2(t)=tlogtf_{2}(t)=t\log t, then 𝙳f2\mathtt{D}_{f_{2}} is called Kullback–Leibler (KL) divergence. We also use BiB^{i} to represent BfiB^{f_{i}} for simplicity.

We focus on robust offline RL, where the agent is provided with an offline dataset 𝒟={τHn}n=1N\mathcal{D}=\{\tau_{H}^{n}\}_{n=1}^{N} that contains NN sample trajectories collected via a behavior policy ρ\rho under the nominal model θ\theta^{*}. Given the type of the uncertain set B{B𝒯i,B𝒫i}i=12B\in\{B_{\mathcal{T}}^{i},B_{\mathcal{P}}^{i}\}_{i=1}^{2} (without knowing the center θ\theta^{*}), the learning goal of the agent is to find a near-optimal robust policy π^\hat{\pi} using this offline dataset 𝒟\mathcal{D} such that

maxπminθBVθ,RπminθBVθ,Rπ^ϵ.\max_{\pi}\min_{\theta\in B}V_{\theta,R}^{\pi}-\min_{\theta\in B}V_{\theta,R}^{\hat{\pi}}\leq\epsilon.

3.3 Low-rank Non-Markovian Decision Process

We first introduce some additional notations that are helpful to describe low-rank non-Markovian decision processes. Given a historical trajectory at time step hh as τh:=(o1:h,a1:h)\tau_{h}\mathrel{\mathop{\mathchar 58\relax}}=(o_{1\mathrel{\mathop{\mathchar 58\relax}}h},a_{1\mathrel{\mathop{\mathchar 58\relax}}h}), we denote a future trajectory as ωh:=(oh+1:H,ah+1:H)\omega_{h}\mathrel{\mathop{\mathchar 58\relax}}=(o_{h+1\mathrel{\mathop{\mathchar 58\relax}}H},a_{h+1\mathrel{\mathop{\mathchar 58\relax}}H}). The set of all τh\tau_{h} is denoted by h=(𝒪×𝒜)h\mathcal{H}_{h}=(\mathcal{O}\times\mathcal{A})^{h} and the set of all future trajectories is denoted by Ωh=(𝒪×𝒜)Hh\Omega_{h}=(\mathcal{O}\times\mathcal{A})^{H-h}. In addition, we add a superscript oo or a\mathrm{a} to either historical trajectory τ\tau or a future trajectory to represent the observation sequence and the action sequence contained in that trajectory, respectively. For example, ωho=oh+1:H\omega_{h}^{o}=o_{h+1\mathrel{\mathop{\mathchar 58\relax}}H} and ωha=ah+1:H\omega_{h}^{a}=a_{h+1\mathrel{\mathop{\mathchar 58\relax}}H}. To better describe a policy, we also use the notation xh=(τh1,oh)=(o1:h,a1:h1)x_{h}=(\tau_{h-1},o_{h})=(o_{1\mathrel{\mathop{\mathchar 58\relax}}h},a_{1\mathrel{\mathop{\mathchar 58\relax}}h-1}) as the available information for the learning agent at time step hh. We remark that, having τh,xh\tau_{h},x_{h} in the same equation implies that τh=(xh,ah).\tau_{h}=(x_{h},a_{h}).

Recall that a non-Markov decision process θ\theta can be fully determined by 𝒫θ\mathcal{P}^{\theta}. We introduce a collection of dynamic matrices {𝕄hθ|h|×|Ωh|}h=1H\{\mathbb{M}_{h}^{\theta}\in\mathbb{R}^{|\mathcal{H}_{h}|\times|\Omega_{h}|}\}_{h=1}^{H}. The entry at the τh\tau_{h}-th row and ωh\omega_{h}-th column of 𝕄hθ\mathbb{M}_{h}^{\theta} is 𝒫θ(ωho,τho|τha,ωha)\mathcal{P}^{\theta}(\omega_{h}^{o},\tau_{h}^{o}|\tau_{h}^{a},\omega_{h}^{a}).

Definition 3.1 (Rank-rr non-Markov decision process).

A non-Markov decision process θ\theta is rank rr, if for any hh, the model dynamic matrix 𝕄hθ\mathbb{M}_{h}^{\theta} has rank rr.

It has been proved that any low-rank sequential decision-making problem θ\theta admits Predictive State Representations (PSR) (Liu et al., , 2022) if for each hh, there exists a set of future trajectories, namely, core tests known to the agent, 𝒬h={𝐪h,1,,𝐪h,d}Ωh\mathcal{Q}_{h}=\{\mathbf{q}_{h,1},\ldots,\mathbf{q}_{h,d}\}\subset\Omega_{h}, such that the submatrix restricted to these tests (columns) 𝕄hθ[𝒬h]\mathbb{M}_{h}^{\theta}[\mathcal{Q}_{h}] has rank rr, where drd\geq r is a positive integer111Setting the same dd for different hh is for notation simplicity.. We consider all low-rank problems with the same set of core tests {𝒬h}hH\{\mathcal{Q}_{h}\}_{h}^{H}, and have the following representations.

𝒫θ(oH,,o1|a1,,aH)=𝐦(ωh)ψ(τh),ϕh=oH,,oh+1𝐦(ωh).\displaystyle\mathcal{P}^{\theta}(o_{H},\ldots,o_{1}|a_{1},\dots,a_{H})=\mathbf{m}(\omega_{h})^{\top}\psi(\tau_{h}),\quad\phi_{h}=\sum_{o_{H},\ldots,o_{h+1}}\mathbf{m}(\omega_{h}).

where 𝐦(ωh)d\mathbf{m}(\omega_{h})\in\mathbb{R}^{d}, and ψ(τh)d\psi(\tau_{h})\in\mathbb{R}^{d}. We denote θ=(ϕh,ψ(τh))\theta=(\phi_{h},\psi(\tau_{h})) to be the representation. In particular, Θ\Theta is the set of all low-rank problems with the same core tests {𝒬h}h=1H\{\mathcal{Q}_{h}\}_{h=1}^{H}.

We assume ψ0\psi_{0} is known to the agent (Huang et al., , 2023), and employ a bar over a feature ψ(τh)\psi(\tau_{h}) to denote its normalization ψ¯(τh)=ψ(τh)/ϕhψ(τh)\bar{\psi}(\tau_{h})=\psi(\tau_{h})/\phi_{h}^{\top}\psi(\tau_{h}). ψ¯(τh)\bar{\psi}(\tau_{h}) is also known as the prediction vector (Littman and Sutton, , 2001) or prediction feature of τh\tau_{h}, since [ψ¯(τh)][\bar{\psi}(\tau_{h})]_{\ell} is the probability of observing 𝐪h,o\mathbf{q}_{h,\ell}^{\mathrm{o}} given the history τh\tau_{h} and taking action sequence 𝐪h,a\mathbf{q}_{h,\ell}^{\mathrm{a}}.

Moreover, let 𝒬hA={𝐪h,a}=1d\mathcal{Q}_{h}^{A}=\{\mathbf{q}_{h,\ell}^{\mathrm{a}}\}_{\ell=1}^{d} be the set of action sequences that are part of core tests, constructed by eliminating any repeated action sequence. 𝒬hA\mathcal{Q}_{h}^{A} known as the set of core action sequences.

We further assume that PSRs studied in this paper are well-conditioned, as specified in Assumption 3.2. Such an assumption and its variants are commonly adopted in the study of PSRs (Liu et al., , 2022; Chen et al., , 2022; Zhong et al., , 2022; Huang et al., , 2023).

Assumption 3.2 (γ\gamma-well-conditioned PSR).

A PSR θ\theta is said to be γ\gamma-well-conditioned if

h,maxxdh:x11maxπ,τhωhπ(ωh|τh)|𝐦(ωh)x|1γ,\displaystyle\textstyle\forall h,~{}~{}\max_{x\in\mathbb{R}^{d_{h}}\mathrel{\mathop{\mathchar 58\relax}}\|x\|_{1}\leq 1}\max_{\pi,\tau_{h}}\sum_{\omega_{h}}\pi(\omega_{h}|\tau_{h})|\mathbf{m}(\omega_{h})^{\top}x|\leq\frac{1}{\gamma},

where π(ωh|τh)=π(aH|xH)π(ah+1|xh+1).\pi(\omega_{h}|\tau_{h})=\pi(a_{H}|x_{H})\cdots\pi(a_{h+1}|x_{h+1}).

Assumption 3.2 requires that the error of estimating θ\theta does not significantly blow up when the estimation error xx of estimating the probability of core tests is small.

4 Robust Offline RL with Low-rank Nominal Model

In this section, we study the setting when the nominal model has low-rank structure (with rank rr(Zhan et al., , 2022; Liu et al., , 2022; Chen et al., , 2022; Huang et al., , 2023). To unify the notations, we use Bui(θ)B_{\mathrm{u}}^{i}(\theta) to denote the uncertainty set, where u\mathrm{u} is either symbol 𝒯\mathcal{T} or 𝒫\mathcal{P}.

4.1 Algorithm Design

We present our algorithm for robust offline RL when the nominal model enjoys low-rank structure. Our algorithm features two main novel elements: (a) dataset distillation, which allows the estimated prediction features of distilled data being close to the true features, and thus enables us to construct (b) a lower confidence bound design for the robust value VBui(θ),RπV_{B_{\mathrm{u}}^{i}(\theta^{*}),R}^{\pi}. In particular, we develop a new dual form for the robust value VB𝒫i(θ),RπV_{B_{\mathcal{P}}^{i}(\theta),R}^{\pi} under 𝒫\mathcal{P}-type uncertainty set, that is more amenable to practical implementation. The main steps of the algorithm are explained in detail as follows, and the pseudo-code is presented in Algorithm 1.

Algorithm 1 Robust Offline RL with low-rank nominal model
1:  Input: offline dataset 𝒟\mathcal{D}, divergence type ii
2:  Estimate a model θ^\hat{\theta} according to Equation 3.
3:  Obtain distilled dataset 𝒟g\mathcal{D}^{\mathrm{g}} according to Equation 4.
4:  Construct b^\hat{b} according to Equation 5.
5:  Output π^\hat{\pi} such that π^=argmaxπΠ(VBui(θ^),RπCuiVθ^,b^π).\hat{\pi}=\arg\max_{\pi\in\Pi}\left(V_{B_{\mathrm{u}}^{i}(\hat{\theta}),R}^{\pi}-C_{\mathrm{u}}^{i}V_{\hat{\theta},\hat{b}}^{\pi}\right).

Model estimation and dataset reform. First, we perform maximum likelihood estimation of the nominal model θ\theta^{*} by minimizing the following negative log-likelihood loss function

(θ|𝒟)=1Nn=1Nlog𝔻θρ(τHn),\displaystyle\textstyle\mathcal{L}(\theta|\mathcal{D})=\frac{1}{N}\sum_{n=1}^{N}-\log\mathbb{D}_{\theta}^{\rho}(\tau_{H}^{n}), (3)

which can be done using a supervised learning oracle, e.g. stochastic gradient descent. Let θ^=argminθ(θ|𝒟)\hat{\theta}=\arg\min_{\theta}\mathcal{L}(\theta|\mathcal{D}).

Novel dataset distillation. Motivated by Huang et al., (2023), to construct a desirable lower confidence bound for any value function under the nominal model, the estimated model θ^\hat{\theta} needs to assign non-negligible probabilities on every sample trajectory. Differently from Huang et al., (2023) that introduces additional constraints to the optimization oracle, we propose a novel idea that only keeps the “good” sample trajectories whose estimated probability is greater than a pre-defined small value pminp_{\min} under the estimated model θ^\hat{\theta} and collect those sample trajectories in set 𝒟g\mathcal{D}^{\mathrm{g}}, defined as follows.

𝒟g={τH𝒟,𝔻θ^ρ(τH)pmin}.\displaystyle\mathcal{D}^{\mathrm{g}}=\left\{\tau_{H}\in\mathcal{D},~{}~{}\mathbb{D}_{\hat{\theta}}^{\rho}(\tau_{H})\geq p_{\min}\right\}. (4)

We prove in Lemma B.1 that, with high probability |𝒟g|=Ω(N)|\mathcal{D}^{\mathrm{g}}|=\Omega(N), which still guarantees sufficient number of samples for learning. Then, 𝒟g\mathcal{D}^{\mathrm{g}} is randomly and evenly divided into HH datasets 𝒟0g,,𝒟H1g\mathcal{D}_{0}^{\mathrm{g}},\ldots,\mathcal{D}_{H-1}^{\mathrm{g}}, in order to separately learn ψ¯(τh)\bar{\psi}^{*}(\tau_{h}) for each h{0,1,,H1}h\in\{0,1,\ldots,H-1\}.

LCB of robust value. Given the estimated model θ^\hat{\theta} and divergence type ii, Algorithm 1 constructs a lower confidence bound (LCB) of VBui(θ),RπV_{B_{\mathrm{u}}^{i}(\theta^{*}),R}^{\pi} for all policy π\pi through a bonus function b^\hat{b} defined as:

U^h=λI+τh𝒟hgψ^¯(τh)ψ^¯(τh),b^(τH)=min{αhψ^¯(τh)(U^h)12,1},\displaystyle\textstyle\hat{U}_{h}=\lambda I+\sum_{\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}}}\bar{\hat{\psi}}(\tau_{h})\bar{\hat{\psi}}(\tau_{h})^{\top},\qquad\hat{b}(\tau_{H})=\min\Big{\{}\alpha\sqrt{\sum_{h}\left\|\bar{\hat{\psi}}(\tau_{h})\right\|^{2}_{(\hat{U}_{h})^{-1}}},1\Big{\}}, (5)

where λ\lambda and αi\alpha_{i} are pre-defined parameters. Recall that ψ^¯(τh)\bar{\hat{\psi}}(\tau_{h}) is the prediction feature described in Section 3.3. Then, we can show that VBui(θ^),RπCuiVθ^,b^πV_{B_{\mathrm{u}}^{i}(\hat{\theta}),R}^{\pi}-C_{\mathrm{u}}^{i}V_{\hat{\theta},\hat{b}}^{\pi} is an LCB of VBui(θ),RπV_{B_{\mathrm{u}}^{i}(\theta^{*}),R}^{\pi}, where CuiC_{\mathrm{u}}^{i} (defined in Section 4.2) is a scaling parameter that depends on the type of uncertainty set.

Novel dual form of robust value. For the 𝒯\mathcal{T}-type uncertainty sets, define robust value function as:

VB(θ),R,hπ(xh)=infθB(θ)𝔼θπ[h=hHR(𝝉h)|xh],\displaystyle\textstyle V_{B(\theta^{*}),R,h}^{\pi}(x_{h})=\inf_{\theta\in B(\theta^{*})}\mathbb{E}_{\theta}^{\pi}\left[\sum_{h^{\prime}=h}^{H}R(\bm{\tau}_{h^{\prime}})\bigg{|}x_{h}\right],
QB(θ),R,hπ(τh)=infθB(θ)𝔼θπ[h=hHR(𝝉h)|τh],\displaystyle\textstyle Q_{B(\theta^{*}),R,h}^{\pi}(\tau_{h})=\inf_{\theta\in B(\theta^{*})}\mathbb{E}_{\theta}^{\pi}\left[\sum_{h^{\prime}=h}^{H}R(\bm{\tau}_{h^{\prime}})\bigg{|}\tau_{h}\right],

where τh=(xh,ah)\tau_{h}=(x_{h},a_{h}). Then, the following Bellman-type equations hold for robust value functions.

Lemma 4.1.

For any θ\theta, let VB(θ),R,H+1π(τH)=0V_{B(\theta),R,H+1}^{\pi}(\tau_{H})=0. Then, we have

VB(θ),R,hπ(xh)=ahπ(ah|xh)QB(θ),R,hπ(τh)\displaystyle\textstyle V_{B(\theta),R,h}^{\pi}(x_{h})=\sum_{a_{h}}\pi(a_{h}|x_{h})Q_{B(\theta),R,h}^{\pi}(\tau_{h})
QB(θ),R,hπ(τh)=R(xh)+infθB(θ)oh+1𝒯hθ(oh+1|τh)VB(θ),R,h+1π(xh+1).\displaystyle\textstyle Q_{B(\theta),R,h}^{\pi}(\tau_{h})=R(x_{h})+\inf_{\theta^{\prime}\in B(\theta^{*})}\sum_{o_{h+1}}\mathcal{T}_{h}^{\theta^{\prime}}(o_{h+1}|\tau_{h})V_{B(\theta),R,h+1}^{\pi}(x_{h+1}).

Therefore, robust value under 𝒯\mathcal{T}-type uncertainty set can be calculated using the recursive formula above. Notably, the calculation of QB(θ),R,hπ(τh)Q_{B(\theta),R,h}^{\pi}(\tau_{h}) involves solving an optimization problem in the form of infP:𝙳f(P,P)𝔼P[g(X)]\inf_{P\mathrel{\mathop{\mathchar 58\relax}}\mathtt{D}_{f}(P,P^{*})}\mathbb{E}_{P}[g(X)], where PP and PP^{*} are two probability distributions over a set 𝒳\mathcal{X}. This optimization problem can be solved efficiently as shown in the literature (Panaganti et al., , 2022; Blanchet et al., , 2023). We also provide its dual form in Appendix A.

On the other hand, when the uncertainty set is 𝒫\mathcal{P}-type, we develop a new dual form of the robust value VB𝒫1(θ),RπV_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\pi} and VB𝒫2(θ),RπV_{B_{\mathcal{P}}^{2}(\theta^{*}),R}^{\pi} in Equation 6 and Equation 7, respectively. Note that, in those equations, f(xH)=aHR(τH)π(τH1)f(x_{H})=\sum_{a_{H}}R(\tau_{H})\pi(\tau_{H-1}).

VB𝒫1(θ),Rπ=max𝜸0,𝝀{\displaystyle V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\pi}=\max_{\bm{\gamma}\geq 0,\bm{\lambda}}\Big{\{} xH(f(xH)γxH)𝒫θ(xH)a1:H1maxo1:H|γxHf(xH)λaH1,xH1|ξ}\displaystyle\sum_{x_{H}}(f(x_{H})-\gamma_{x_{H}})\mathcal{P}^{\theta^{*}}(x_{H})-\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\max_{o_{1\mathrel{\mathop{\mathchar 58\relax}}H}}\left|\gamma_{x_{H}}-f(x_{H})-\lambda_{a_{H-1},x_{H-1}}\right|\xi\Big{\}} (6)
VB𝒫2(θ),Rπ=max𝜼0{\displaystyle V_{B_{\mathcal{P}}^{2}(\theta^{*}),R}^{\pi}=\max_{\bm{\eta}\geq 0}\Big{\{} τH1ηa1:H1𝒫θ(τH1)log𝔼oH𝒯H1θ[ef(xH)ηa1:H1]a1:H1ηa1:H1ξ}\displaystyle-\sum_{\tau_{H-1}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\mathcal{P}^{\theta^{*}}(\tau_{H-1})\log{\mathbb{E}}_{o_{H}\sim\mathcal{T}^{\theta^{*}}_{H-1}}\left[e^{-\frac{f(x_{H})}{\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}}}\right]-\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\xi\Big{\}} (7)

Output policy design. Finally, Algorithm 1 outputs a policy π^=argmaxπΠ(VBui(θ^),RπCuiVθ^,b^π)\hat{\pi}=\arg\max_{\pi\in\Pi}\left(V_{B_{\mathrm{u}}^{i}(\hat{\theta}),R}^{\pi}-C_{\mathrm{u}}^{i}V_{\hat{\theta},\hat{b}}^{\pi}\right), which maximizes a lower confidence bound of VBui(θ),RπV_{B_{\mathrm{u}}^{i}(\theta^{*}),R}^{\pi}.

4.2 Theoretical Guarantee

In this section, we present the theoretical results for Algorithm 1. Before we state the theorem, we need to introduce several definitions and condition coefficients.

Bracketing number. First, we introduce a complexity measure of the model class Θ\Theta, which is standard in the literature (Liu et al., , 2022; Huang et al., , 2023).

Definition 4.2 (ε\varepsilon-Bracketing number of Θ\Theta).

Suppose for each θΘ\theta\in\Theta, there exists a function Fθ:𝒳+F_{\theta}\mathrel{\mathop{\mathchar 58\relax}}\mathcal{X}\rightarrow\mathbb{R}_{+} parameterized by θ\theta. Given two functions ll and g:𝒳+g\mathrel{\mathop{\mathchar 58\relax}}\mathcal{X}\to\mathbb{R}_{+}, the bracket [l,g][l,g] is the set of all θΘ\theta\in\Theta satisfying lFθgl\leq F_{\theta}\leq g. An ε\varepsilon-bracket is a bracket [l,g][l,g] with gl1<η\|g-l\|_{1}<\eta. The bracketing number 𝒩ε(Θ)\mathcal{N}_{\varepsilon}(\Theta) is the minimum number of η\eta-brackets needed to cover Θ\Theta.

Note that if Θ\Theta is the parameter space for tabular PSRs with rank rr, we have log𝒩ε(Θ)r2|𝒪||𝒜|H2logH|𝒪||𝒜|ε\log\mathcal{N}_{\varepsilon}(\Theta)\leq r^{2}|\mathcal{O}||\mathcal{A}|H^{2}\log\frac{H|\mathcal{O}||\mathcal{A}|}{\varepsilon} (see Theorem 4.7 in Liu et al., (2022)).

Novel concentrability coefficient. Under the low-rank structure and PSRs, we introduce the following novel concentrability coefficient Cp(π|ρ)C_{\mathrm{p}}(\pi|\rho), namely Type-I concentrability coefficient, of a behavior policy ρ\rho against a robust policy π\pi.

Cp(π|ρ)=maxhmaxxx𝔼τh𝔻θπ[ψ¯(τh)ψ¯(τh)]xx𝔼τh𝔻θρ[ψ¯(τh)ψ¯(τh)]x.\displaystyle\textstyle C_{\mathrm{p}}(\pi|\rho)=\max_{h}\max_{x}\frac{x^{\top}\mathop{\mathbb{E}}_{\tau_{h}\sim\mathbb{D}_{\theta^{*}}^{\pi}}[\bar{\psi}^{*}(\tau_{h})\bar{\psi}^{*}(\tau_{h})^{\top}]x}{x^{\top}\mathop{\mathbb{E}}_{\tau_{h}\sim\mathbb{D}_{\theta^{*}}^{\rho}}[\bar{\psi}^{*}(\tau_{h})\bar{\psi}^{*}(\tau_{h})^{\top}]x}.
Remark 4.3.

Requiring a finite Type-I concentrability coefficient is much weaker than requiring a finite concentrability coefficient CC defined in Huang et al., (2023), in which a point-to-point probability ratio is used, i.e., C𝔻θπ(τh)/𝔻θρ(τh)C\geq\mathbb{D}_{\theta^{*}}^{\pi}(\tau_{h})/\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{h}). It is obvious that Cp(π|ρ)CC_{\mathrm{p}}(\pi|\rho)\leq C. Moreover, even if 𝔻θρ(τh)\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{h}) is exponentially small, indicating CC is exponentially large, the minimum eigenvalue of the covariance matrix of ψ¯(τh)\bar{\psi}^{*}(\tau_{h}) under ρ\rho (i.e., 𝔼τh𝔻θρ[ψ¯(τh)ψ¯(τh)]\mathop{\mathbb{E}}_{\tau_{h}\sim\mathbb{D}_{\theta^{*}}^{\rho}}[\bar{\psi}^{*}(\tau_{h})\bar{\psi}^{*}(\tau_{h})^{\top}]) could be bounded way from zero. By Assumption 3.2, we have ψ¯(τh)1/γωhπ(ωh|τh)𝐦(ωh)ψ¯(τh)=1\|\bar{\psi}^{*}(\tau_{h})\|_{1}/\gamma\geq\sum_{\omega_{h}}\pi(\omega_{h}|\tau_{h})\mathbf{m}(\omega_{h})^{\top}\bar{\psi}^{*}(\tau_{h})=1, which implies that the inverse of the norm of the prediction feature scales in Poly(d,1/γ)(d,1/\gamma). Thus, with a good behavior policy ρ\rho, the scaling of Cp(π|ρ)C_{\mathrm{p}}(\pi|\rho) is likely to be polynomial.

Wellness condition of the nominal model. When the uncertainty set is 𝒯\mathcal{T}-type, we further introduce a wellness condition number CBC_{B} of the nominal model θ\theta^{*} defined as follows.

CB=maxθB(θ)maxh𝔻θπ(τh)𝔻θπ(τh)=maxθB(θ)maxh𝒫θ(τh)𝒫θ(τh).\textstyle C_{B}=\max_{\theta\in B(\theta^{*})}\max_{h}\frac{\mathbb{D}_{\theta}^{\pi}(\tau_{h})}{\mathbb{D}_{\theta^{*}}^{\pi}(\tau_{h})}=\max_{\theta\in B(\theta^{*})}\max_{h}\frac{\mathcal{P}^{\theta}(\tau_{h})}{\mathcal{P}^{\theta^{*}}(\tau_{h})}.

Intuitively, this condition number is called the maximum likelihood ratio between the distributions 𝒫θ(|a1:h)\mathcal{P}^{\theta^{*}}(\cdot|a_{1\mathrel{\mathop{\mathchar 58\relax}}h}) and 𝒫θ(|a1:h)\mathcal{P}^{\theta}(\cdot|a_{1\mathrel{\mathop{\mathchar 58\relax}}h}) and only depends on the property of the nominal model and the shape of the uncertainty set. In particular, when B=B𝒯2B=B_{\mathcal{T}}^{2}, we must have

𝔼θπ[log𝒫θ(τh)𝒫θ(τh)]=𝔼θπ[hhlog𝒯θ(oh|τh1)𝒯θ(oh|τh1)]ξh,\displaystyle\textstyle\mathbb{E}_{\theta}^{\pi}\left[\log\frac{\mathcal{P}^{\theta}(\tau_{h})}{\mathcal{P}^{\theta^{*}}(\tau_{h})}\right]=\mathbb{E}_{\theta}^{\pi}\left[\sum_{h^{\prime}\leq h}\log\frac{\mathcal{T}^{\theta}(o_{h^{\prime}}|\tau_{h^{\prime}-1})}{\mathcal{T}^{\theta^{*}}(o_{h^{\prime}}|\tau_{h^{\prime}-1})}\right]\leq\xi h,

which should hold for all policy π\pi, and thus implies that CBC_{B} is highly likely to be small.

Scaling parameters. Finally, we briefly introduce the scaling parameter CuiC_{\mathrm{u}}^{i} that is used to construct the LCB of robust value. Equipped with the dual forms, we define C𝒫1=1C_{\mathcal{P}}^{1}=1, C𝒯1=CBC_{\mathcal{T}}^{1}=C_{B}. When uncertainty set is determined by KL divergence, we make the following assumption.

Assumption 4.4.

Suppose that 𝜼|𝒜|H1\bm{\eta}^{*}\in\mathbb{R}^{|\mathcal{A}|^{H-1}} is an optimal solution of Equation 7, and 𝜼θ^|𝒜|H1\bm{\eta}^{*}_{\hat{\theta}}\in\mathbb{R}^{|\mathcal{A}|^{H-1}} is an optimal solution of Equation 7 when θ\theta^{*} is replaced by θ^\hat{\theta}. We assume that the agent has an estimate of a lower bound η\eta^{*} of the coordinates of both 𝜼\bm{\eta}^{*} and 𝜼θ^\bm{\eta}^{*}_{\hat{\theta}}, i.e., ηmina1:H1{ηa1:H1,ηθ^,a1:H1}\eta^{*}\leq\min_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\{\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}^{*},\eta_{\hat{\theta},a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}^{*}\}.

Assumption 4.5.

λθθ(τh)\lambda_{\theta\|\theta^{*}}^{*}(\tau_{h}) is an optimal solution to the following optimization problem:

supλ0{λlog𝔼𝒯hθ[exp(VB𝒯2(θ),R,h+1π(τh,o)/λ)]λξ}.\displaystyle\sup_{\lambda\geq 0}\left\{-\lambda\log\mathbb{E}_{\mathcal{T}_{h}^{\theta}}[\exp(-V_{B_{\mathcal{T}}^{2}(\theta^{*}),R,h+1}^{\pi}(\tau_{h},o)/\lambda)]-\lambda\xi\right\}.

We assume that the agent has an estimate of a lower bound λ\lambda^{*} of all λθθ^(τh)\lambda_{\theta^{*}\|\hat{\theta}}^{*}(\tau_{h}) and λθ^θ(τh)\lambda_{\hat{\theta}\|\theta^{*}}^{*}(\tau_{h}), i.e., λminh,τh{λθθ^(τh),λθ^θ(τh)}.\lambda^{*}\leq\min_{h,\tau_{h}}\{\lambda_{\theta^{*}\|\hat{\theta}}^{*}(\tau_{h}),\lambda_{\hat{\theta}\|\theta^{*}}^{*}(\tau_{h})\}.

Remark 4.6.

The optimization problem mentioned in Assumption 4.5 is the dual problem that need to solve in the Bellman-type equation (Lemma 4.1). In the algorithm implementation, it suffices to estimate mina1:H1{ηθ^,a1:H1}\min_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\{\eta^{*}_{\hat{\theta},a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\} and minh,τh{λθ^θ(τh)}\min_{h,\tau_{h}}\{\lambda_{\hat{\theta}\|\theta^{*}}^{*}(\tau_{h})\}, which is not difficult to obtain such value since θ^\hat{\theta} is known. We make these two assumptions for notation simplicity. Such assumptions also appear in Blanchet et al., (2023).

Define the scaling parameters C𝒫2=3exp(1/η)C_{\mathcal{P}}^{2}=3\exp(1/\eta^{*}), and C𝒯2=CBmax{exp(ξ)/ξ,λexp(1/λ)}.C_{\mathcal{T}}^{2}=C_{B}\max\left\{\exp(\xi)/\xi,\lambda^{*}\exp(1/\lambda^{*})\right\}.

We are now ready to present the theorem characterizing the performance of Algorithm 1.

Theorem 4.7.

Suppose Assumption 3.2, Assumption 4.4, and Assumption 4.5 hold. Let QA=maxh|𝒬hA|Q_{A}=\max_{h}|\mathcal{Q}_{h}^{A}| be the maximum number of core action sequences. Let ι=min𝐪h,a,τhρ(𝐪h,a|τh)\iota=\min_{\mathbf{q}_{h,\ell}^{\mathrm{a}},\tau_{h}}\rho(\mathbf{q}_{h,\ell}^{\mathrm{a}}|\tau_{h}), pmin=δN(|𝒪||𝒜|)2Hp_{\min}=\frac{\delta}{N(|\mathcal{O}||\mathcal{A}|)^{2H}}, ε=pminNH\varepsilon=\frac{p_{\min}}{NH}, β=O(log(𝒩ε(Θ)/δ))\beta=O(\log(\mathcal{N}_{\varepsilon}(\Theta)/\delta)), λ=H2QA2\lambda=H^{2}Q_{A}^{2}, and α=O(QAdHγ2λ+βιγ)\alpha=O\left(\frac{Q_{A}\sqrt{dH}}{\gamma^{2}}\sqrt{\lambda}+\frac{\sqrt{\beta}}{\iota\gamma}\right). Then, with probability at least 1δ1-\delta, we have the following four results for the suboptimal gap of the policy π^\hat{\pi} output by Algorithm 1 under four settings.

(i) 𝒫\mathcal{P}-type uncertainty set under TV distance: VB𝒫1(θ),RπVB𝒫1(θ),Rπ^C0Cp(π|ρ)NV_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\pi^{*}}-V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\hat{\pi}}\lesssim C_{0}\sqrt{\frac{C_{\mathrm{p}}(\pi^{*}|\rho)}{N}},

(ii) 𝒫\mathcal{P}-type uncertainty set under KL divergence: VB𝒫2(θ),RπVB𝒫2(θ),Rπ^C0C𝒫klCp(π|ρ)NV_{B_{\mathcal{P}}^{2}(\theta^{*}),R}^{\pi^{*}}-V_{B_{\mathcal{P}}^{2}(\theta^{*}),R}^{\hat{\pi}}\lesssim C_{0}C_{\mathcal{P}}^{\mathrm{kl}}\sqrt{\frac{C_{\mathrm{p}}(\pi^{*}|\rho)}{N}},

(iii) 𝒯\mathcal{T}-type uncertainty set under TV distance: VB𝒯1(θ),RπVB𝒯1(θ),Rπ^C0CBCp(π|ρ)NV_{B_{\mathcal{T}}^{1}(\theta^{*}),R}^{\pi^{*}}-V_{B_{\mathcal{T}}^{1}(\theta^{*}),R}^{\hat{\pi}}\lesssim C_{0}C_{B}\sqrt{\frac{C_{\mathrm{p}}(\pi^{*}|\rho)}{N}},

(iv) 𝒯\mathcal{T}-type uncertainty set under KL divergence, VB𝒯2(θ),RπVB𝒯2(θ),Rπ^C0CBC𝒯klCp(π|ρ)NV_{B_{\mathcal{T}}^{2}(\theta^{*}),R}^{\pi^{*}}-V_{B_{\mathcal{T}}^{2}(\theta^{*}),R}^{\hat{\pi}}\lesssim C_{0}C_{B}C_{\mathcal{T}}^{\mathrm{kl}}\sqrt{\frac{C_{\mathrm{p}}(\pi^{*}|\rho)}{N}},

where π\pi^{*} is the optimal robust policy under each setting, Cp(π|ρ)C_{\mathrm{p}}(\pi^{*}|\rho) is the type-I coefficient of the behavior policy ρ\rho against π\pi^{*}, C0=H2QArdβιγ2(r+QAHγ)C_{0}=\frac{H^{2}Q_{A}\sqrt{rd\beta}}{\iota\gamma^{2}}\left(\sqrt{r}+\frac{Q_{A}\sqrt{H}}{\gamma}\right), CBC_{B} is the wellness condition number of θ\theta^{*}, C𝒫kl=3exp(1/η)C_{\mathcal{P}}^{\mathrm{kl}}=3\exp(1/\eta^{*}), C𝒯kl=max{exp(ξ)/ξ,λexp(1/λ)}C_{\mathcal{T}}^{\mathrm{kl}}=\max\left\{\exp(\xi)/\xi,\lambda^{*}\exp(1/\lambda^{*})\right\}, and η,λ\eta^{*},\lambda^{*} are defined in Assumption 4.4 and Assumption 4.5.

The theorem states that as long as the type-I concentrability coefficient of ρ\rho against the optimal robust policy π=argmaxπVB(θ),Rπ\pi^{*}=\arg\max_{\pi}V_{B(\theta),R}^{\pi} is bounded, and the nominal model θ\theta^{*} has finite wellness condition number CBC_{B}, Algorithm 1 finds an ϵ\epsilon-optimal robust policy provided that the offline dataset has Ω(1/ϵ2)\Omega(1/\epsilon^{2}) sample trajectories.

Proof sketch of Theorem 4.7. The proof consists of three main steps. (i) A novel MLE guarantee. We show that the estimator θ^\hat{\theta} produced by vanilla MLE already performs well on a large portion of offline data samples, namely, τh𝒟hg𝒟θ^ρ(|τh)𝒟θρ(|τh)1O(β)\sum_{\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}}}\|\mathcal{D}_{\hat{\theta}}^{\rho}(\cdot|\tau_{h})-\mathcal{D}_{\theta^{*}}^{\rho}(\cdot|\tau_{h})\|_{1}\leq O(\beta). This result is different from the MLE guarantee proved in Huang et al., (2023), where a constrained MLE oracle is required. (ii) Simulation lemma for robust value. By utilizing the newly proposed dual forms in Equation 6 and Equation 7, we show that for any model θ\theta (not necessarily low-rank), the difference |VB(θ),RπVB(θ),Rπ||V_{B(\theta^{*}),R}^{\pi}-V_{B(\theta),R}^{\pi}| is generally upper bounded by O(cB𝔻θπ𝔻θπ1)O(c_{B}\|\mathbb{D}_{\theta^{*}}^{\pi}-\mathbb{D}_{\theta}^{\pi}\|_{1}), where cBc_{B} depends on the uncertainty type. (iii) Applying concentrability coefficient. In low-rank non-Markovian decision processes, 1\ell_{1} distance 𝔻θπ𝔻θπ1\|\mathbb{D}_{\theta^{*}}^{\pi}-\mathbb{D}_{\theta}^{\pi}\|_{1} can be upper bounded by 𝔼θπ[ψ¯(τh)Λ1]\mathbb{E}_{\theta^{*}}^{\pi}[\|\bar{\psi}^{*}(\tau_{h})\|_{\Lambda^{-1}}] (Liu et al., , 2022; Huang et al., , 2023). By Cauchy’s inequality, this quantity is determined by the covariance matrix 𝔼θπ[ψ¯(τh)ψ¯(τh)]\mathbb{E}_{\theta^{*}}^{\pi}[\bar{\psi}(\tau_{h})\bar{\psi}^{*}(\tau_{h})^{\top}], which can thus be controlled by the type-I concentrability coefficient Cp(π|ρ)C_{\mathrm{p}}(\pi|\rho) and the covariance matrix under the behavior policy ρ\rho. It then yields the final result when combined with the MLE guarantee. The full proof can be found in Appendix C.

5 Robust Offline RL under General Non-Markovian Decision Processes

In this section, we first introduce a more general algorithm (see Algorithm 2) for robust offline learning in non-Markovian decision processes, and then present its theoretical guarantee.

5.1 Algorithm Design

The general algorithm utilizes double pessimism for robust offline RL. The algorithm consists of two parts: model estimation and robust policy planning. We note that similar double pessimism has been adopted in Blanchet et al., (2023). However, due to the non-Markovian nature, our robust value and its dual form as well as the corresponding analysis are different from theirs.

Model estimation. We construct the confidence set 𝒞\mathcal{C} for nominal model θ\theta^{*} using offline dataset 𝒟\mathcal{D}.

𝒞={θ:n=1N𝒫θ(τHn)maxθn=1N𝒫θ(τHn)β},\displaystyle\textstyle\mathcal{C}=\left\{\theta\mathrel{\mathop{\mathchar 58\relax}}\sum_{n=1}^{N}\mathcal{P}^{\theta}(\tau_{H}^{n})\geq\max_{\theta}\sum_{n=1}^{N}\mathcal{P}^{\theta}(\tau_{H}^{n})-\beta\right\}, (8)

where β\beta is a pre-defined parameter. This confidence set ensures that θ𝒞\theta^{*}\in\mathcal{C} and any model θ𝒞\theta\in\mathcal{C} is “close” to the nominal model θ\theta^{*} under the behavior policy ρ\rho.

Robust policy planning. Then, we output a policy under the double pessimism principle. Specifically, for each model θ^𝒞\hat{\theta}\in\mathcal{C}, since it is close to the nominal model θ\theta^{*}, we examine its robust value function VB(θ^),RπV_{B(\hat{\theta}),R}^{\pi} under a policy π\pi. By minimizing the robust value VB(θ^),RπV_{B(\hat{\theta}),R}^{\pi} over the confidence set 𝒞\mathcal{C}, we find a lower confidence bound for the true robust value VB(θ),RπV_{B(\theta^{*}),R}^{\pi} due to θ𝒞\theta^{*}\in\mathcal{C}. Finally, we output an optimal policy π^\hat{\pi} with respect to the LCB of θ𝒞\theta^{*}\in\mathcal{C}. Mathematically, π^=argmaxπΠminθ^𝒞VB(θ^),Rπ\hat{\pi}=\arg\max_{\pi\in\Pi}\min_{\hat{\theta}\in\mathcal{C}}V_{B(\hat{\theta}),R}^{\pi}

5.2 Theoretical Guarantee

In this section, we present the theoretical guarantees of Algorithm 2. Since the nominal model does not have specific structure or representations, type-I concentrability coefficient is not suitable for this case. Therefore, we introduce the Type-II concentrability coefficient C(π|ρ)C(\pi|\rho) of a behavior policy ρ\rho against a robust policy π\pi defined as C(π|ρ)=h=1H𝔼θρ[(𝔻θπ(τh)𝔻θρ(τh))2]C(\pi|\rho)=\sum_{h=1}^{H}\mathbb{E}_{\theta^{*}}^{\rho}\left[\left(\frac{\mathbb{D}_{\theta^{*}}^{\pi}(\tau_{h})}{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{h})}\right)^{2}\right].

Remark 5.1.

In general, requiring a finite type-II concentrability coefficient is stronger than requiring a finite type-I concentrability coefficient defined in Section 4.2. However, it is still a relaxed coverage assumption than that in Huang et al., (2023). This is because in the analysis, we rescale the estimation error across the distribution induced by the behavior policy instead of using a point-wise bound.

We are now ready to present the main result for Algorithm 2.

Theorem 5.2.

Let ε=1NH\varepsilon=\frac{1}{NH}, β=O(log(𝒩ε(Θ)/δ))\beta=O(\log(\mathcal{N}_{\varepsilon}(\Theta)/\delta)). Then, with probability at least 1δ1-\delta, we have the following four results for the suboptimal gap of the policy π^\hat{\pi} output by Algorithm 2.

(i) 𝒫\mathcal{P}-type uncertainty set under TV distance VB𝒫1(θ),RπVB𝒫1(θ),Rπ^HC(π|ρ)βN,V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\pi^{*}}-V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\hat{\pi}}\lesssim H\sqrt{\frac{C(\pi^{*}|\rho)\beta}{N}},

(ii) 𝒫\mathcal{P}-type uncertainty set under KL divergence: VB𝒫2(θ),RπVB𝒫2(θ),Rπ^HC𝒫klC(π|ρ)βN,V_{B_{\mathcal{P}}^{2}(\theta^{*}),R}^{\pi^{*}}-V_{B_{\mathcal{P}}^{2}(\theta^{*}),R}^{\hat{\pi}}\lesssim HC_{\mathcal{P}}^{\mathrm{kl}}\sqrt{\frac{C(\pi^{*}|\rho)\beta}{N}},

(iii) 𝒯\mathcal{T}-type uncertainty set under TV distance VB𝒯1(θ),RπVB𝒯1(θ),Rπ^HCBC(π|ρ)βN,V_{B_{\mathcal{T}}^{1}(\theta^{*}),R}^{\pi^{*}}-V_{B_{\mathcal{T}}^{1}(\theta^{*}),R}^{\hat{\pi}}\lesssim HC_{B}\sqrt{\frac{C(\pi^{*}|\rho)\beta}{N}},

(iv) 𝒯\mathcal{T}-type uncertainty set under KL divergence VB𝒯2(θ),RπVB𝒯2(θ),Rπ^HCBC𝒫klC(π|ρ)βN,V_{B_{\mathcal{T}}^{2}(\theta^{*}),R}^{\pi^{*}}-V_{B_{\mathcal{T}}^{2}(\theta^{*}),R}^{\hat{\pi}}\lesssim HC_{B}C_{\mathcal{P}}^{\mathrm{kl}}\sqrt{\frac{C(\pi^{*}|\rho)\beta}{N}},

where π\pi^{*} is the optimal robust policy under each setting, C(π|ρ)C(\pi^{*}|\rho) is the type-II coefficient of the behavior policy ρ\rho against π\pi^{*}, CBC_{B} is the wellness condition number defined in Section 4.2, and C𝒫kl,C𝒯klC_{\mathcal{P}}^{\mathrm{kl}},C_{\mathcal{T}}^{\mathrm{kl}} are defined in Theorem 4.7.

The theorem suggests that Algorithm 2 is more sample efficient than Algorithm 1, due to the general double pessimism design and the type-II concentrability coefficient, while Algorithm 1 is more amenable to practical implementation due to low-rank structure and computationally tractable bonus function design and dual forms development.

6 Conclusion

In this work, we studied robust offline non-Markovian RL with unkown transitions. We developed the first sample efficient algorithm when the nominal model admits a low-rank structure. To characterize the theoretical guarantee of our algorithm, we propose a novel type-I concentrability coefficient for PSRs, which can be of independent interest in the study of offline low-rank non-Markovian RL. In addition, we extended our algorithm to a more general case when the nominal model does not have any specific structure. With the notion of type-II concentrability coefficient, we proved that the extended algorithm also enjoys sample efficiency under all different types of the uncertainty set.

References

  • Antos et al., (2008) Antos, A., Szepesvári, C., and Munos, R. (2008). Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71:89–129.
  • Bertsekas, (2011) Bertsekas, D. P. (2011). Approximate policy iteration: A survey and some new methods. Journal of Control Theory and Applications, 9:310–335.
  • Blanchet et al., (2023) Blanchet, J., Lu, M., Zhang, T., and Zhong, H. (2023). Double pessimism is provably efficient for distributionally robust offline reinforcement learning: Generic algorithm and robust partial coverage. arXiv preprint arXiv:2305.09659.
  • Boots et al., (2011) Boots, B., Siddiqi, S. M., and Gordon, G. J. (2011). Closing the learning-planning loop with predictive state representations. The International Journal of Robotics Research, 30(7):954–966.
  • Chang et al., (2021) Chang, J., Uehara, M., Sreenivas, D., Kidambi, R., and Sun, W. (2021). Mitigating covariate shift in imitation learning via offline data with partial coverage. Advances in Neural Information Processing Systems, 34:965–979.
  • Chen et al., (2022) Chen, F., Bai, Y., and Mei, S. (2022). Partially observable rl with b-stability: Unified structural condition and sharp sample-efficient algorithms. arXiv preprint arXiv:2209.14990.
  • Chen and Jiang, (2019) Chen, J. and Jiang, N. (2019). Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR.
  • Dong et al., (2022) Dong, J., Li, J., Wang, B., and Zhang, J. (2022). Online policy optimization for robust mdp. arXiv preprint arXiv:2209.13841.
  • Ernst et al., (2005) Ernst, D., Geurts, P., and Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6.
  • Farahmand et al., (2010) Farahmand, A.-m., Szepesvári, C., and Munos, R. (2010). Error propagation for approximate policy and value iteration. Advances in Neural Information Processing Systems, 23.
  • Gordon, (1995) Gordon, G. J. (1995). Stable function approximation in dynamic programming. In Machine learning proceedings 1995, pages 261–268. Elsevier.
  • Huang et al., (2023) Huang, R., Liang, Y., and Yang, J. (2023). Provably efficient ucb-type algorithms for learning predictive state representations. arXiv preprint arXiv:2307.00405.
  • Ibarz et al., (2021) Ibarz, J., Tan, J., Finn, C., Kalakrishnan, M., Pastor, P., and Levine, S. (2021). How to train your robot with deep reinforcement learning: lessons we have learned. The International Journal of Robotics Research, 40(4-5):698–721.
  • Iyengar, (2005) Iyengar, G. N. (2005). Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280.
  • Jiang et al., (2018) Jiang, N., Kulesza, A., and Singh, S. (2018). Completing state representations using spectral learning. Advances in Neural Information Processing Systems, 31.
  • Jin et al., (2021) Jin, Y., Yang, Z., and Wang, Z. (2021). Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pages 5084–5096. PMLR.
  • Kidambi et al., (2020) Kidambi, R., Rajeswaran, A., Netrapalli, P., and Joachims, T. (2020). Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33:21810–21823.
  • Krishnamurthy et al., (2016) Krishnamurthy, A., Agarwal, A., and Langford, J. (2016). Pac reinforcement learning with rich observations. Advances in Neural Information Processing Systems, 29.
  • Kumar et al., (2020) Kumar, A., Zhou, A., Tucker, G., and Levine, S. (2020). Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191.
  • Lange et al., (2012) Lange, S., Gabel, T., and Riedmiller, M. (2012). Batch reinforcement learning. Reinforcement learning: State-of-the-art, pages 45–73.
  • Lazaric et al., (2012) Lazaric, A., Ghavamzadeh, M., and Munos, R. (2012). Finite-sample analysis of least-squares policy iteration. Journal of Machine Learning Research, 13:3041–3074.
  • Levine et al., (2020) Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020). Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643.
  • Lim et al., (2013) Lim, S. H., Xu, H., and Mannor, S. (2013). Reinforcement learning in robust markov decision processes. Advances in Neural Information Processing Systems, 26.
  • Littman and Sutton, (2001) Littman, M. and Sutton, R. S. (2001). Predictive representations of state. Advances in neural information processing systems, 14.
  • Liu et al., (2022) Liu, Q., Netrapalli, P., Szepesvari, C., and Jin, C. (2022). Optimistic mle–a generic model-based algorithm for partially observable sequential decision making. arXiv preprint arXiv:2209.14997.
  • Liu et al., (2020) Liu, Y., Swaminathan, A., Agarwal, A., and Brunskill, E. (2020). Provably good batch off-policy reinforcement learning without great exploration. Advances in neural information processing systems, 33:1264–1274.
  • Mannor et al., (2016) Mannor, S., Mebel, O., and Xu, H. (2016). Robust mdps with k-rectangular uncertainty. Mathematics of Operations Research, 41(4):1484–1509.
  • Mossel and Roch, (2005) Mossel, E. and Roch, S. (2005). Learning nonsingular phylogenies and hidden markov models. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 366–375.
  • Mundhenk et al., (2000) Mundhenk, M., Goldsmith, J., Lusena, C., and Allender, E. (2000). Complexity of finite-horizon markov decision process problems. Journal of the ACM (JACM), 47(4):681–720.
  • Munos and Szepesvári, (2008) Munos, R. and Szepesvári, C. (2008). Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(5).
  • Nakao et al., (2021) Nakao, H., Jiang, R., and Shen, S. (2021). Distributionally robust partially observable markov decision process with moment-based ambiguity. SIAM Journal on Optimization, 31(1):461–488.
  • Nguyen-Tang et al., (2023) Nguyen-Tang, T., Yin, M., Gupta, S., Venkatesh, S., and Arora, R. (2023). On instance-dependent bounds for offline reinforcement learning with linear function approximation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9310–9318.
  • Nilim and El Ghaoui, (2005) Nilim, A. and El Ghaoui, L. (2005). Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798.
  • Osogami, (2015) Osogami, T. (2015). Robust partially observable markov decision process. In International Conference on Machine Learning, pages 106–115. PMLR.
  • Ouyang et al., (2022) Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  • Panaganti and Kalathil, (2022) Panaganti, K. and Kalathil, D. (2022). Sample complexity of robust reinforcement learning with a generative model. In International Conference on Artificial Intelligence and Statistics, pages 9582–9602. PMLR.
  • Panaganti et al., (2022) Panaganti, K., Xu, Z., Kalathil, D., and Ghavamzadeh, M. (2022). Robust reinforcement learning using offline data. Advances in neural information processing systems, 35:32211–32224.
  • Papadimitriou and Tsitsiklis, (1987) Papadimitriou, C. H. and Tsitsiklis, J. N. (1987). The complexity of markov decision processes. Mathematics of operations research, 12(3):441–450.
  • Peng et al., (2018) Peng, X. B., Andrychowicz, M., Zaremba, W., and Abbeel, P. (2018). Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE international conference on robotics and automation (ICRA), pages 3803–3810. IEEE.
  • Petrik and Russel, (2019) Petrik, M. and Russel, R. H. (2019). Beyond confidence regions: Tight bayesian ambiguity sets for robust mdps. Advances in neural information processing systems, 32.
  • Rashidinejad et al., (2021) Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., and Russell, S. (2021). Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702–11716.
  • Rasouli and Saghafian, (2018) Rasouli, M. and Saghafian, S. (2018). Robust partially observable markov decision processes.
  • Roy et al., (2017) Roy, A., Xu, H., and Pokutta, S. (2017). Reinforcement learning under model mismatch. Advances in neural information processing systems, 30.
  • Singh et al., (2012) Singh, S., James, M., and Rudary, M. (2012). Predictive state representations: A new theory for modeling dynamical systems. arXiv preprint arXiv:1207.4167.
  • Sutton and Barto, (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
  • Tamar et al., (2014) Tamar, A., Mannor, S., and Xu, H. (2014). Scaling up robust mdps using function approximation. In International conference on machine learning, pages 181–189. PMLR.
  • Uehara et al., (2022) Uehara, M., Sekhari, A., Lee, J. D., Kallus, N., and Sun, W. (2022). Provably efficient reinforcement learning in partially observable dynamical systems. arXiv preprint arXiv:2206.12020.
  • Uehara and Sun, (2021) Uehara, M. and Sun, W. (2021). Pessimistic model-based offline reinforcement learning under partial coverage. arXiv preprint arXiv:2107.06226.
  • Uehara et al., (2021) Uehara, M., Zhang, X., and Sun, W. (2021). Representation learning for online and offline rl in low-rank mdps. arXiv preprint arXiv:2110.04652.
  • Vlassis et al., (2012) Vlassis, N., Littman, M. L., and Barber, D. (2012). On the computational complexity of stochastic controller optimization in pomdps. ACM Transactions on Computation Theory (TOCT), 4(4):1–8.
  • (51) Wang, Y., Velasquez, A., Atia, G., Prater-Bennette, A., and Zou, S. (2023a). Robust average-reward markov decision processes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 15215–15223.
  • (52) Wang, Y., Velasquez, A., Atia, G. K., Prater-Bennette, A., and Zou, S. (2023b). Model-free robust average-reward reinforcement learning. In International Conference on Machine Learning, pages 36431–36469. PMLR.
  • Wang and Zou, (2021) Wang, Y. and Zou, S. (2021). Online robust reinforcement learning with model uncertainty. Advances in Neural Information Processing Systems, 34:7193–7206.
  • Wiesemann et al., (2013) Wiesemann, W., Kuhn, D., and Rustem, B. (2013). Robust markov decision processes. Mathematics of Operations Research, 38(1):153–183.
  • Xie et al., (2021) Xie, T., Cheng, C.-A., Jiang, N., Mineiro, P., and Agarwal, A. (2021). Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34:6683–6694.
  • Xie and Jiang, (2020) Xie, T. and Jiang, N. (2020). Q* approximation schemes for batch reinforcement learning: A theoretical comparison. In Conference on Uncertainty in Artificial Intelligence, pages 550–559. PMLR.
  • Xiong et al., (2022) Xiong, W., Zhong, H., Shi, C., Shen, C., Wang, L., and Zhang, T. (2022). Nearly minimax optimal offline reinforcement learning with linear function approximation: Single-agent mdp and markov game. arXiv preprint arXiv:2205.15512.
  • Xu and Mannor, (2010) Xu, H. and Mannor, S. (2010). Distributionally robust markov decision processes. Advances in Neural Information Processing Systems, 23.
  • Yang et al., (2022) Yang, W., Zhang, L., and Zhang, Z. (2022). Toward theoretical understandings of robust markov decision processes: Sample complexity and asymptotics. The Annals of Statistics, 50(6):3223–3248.
  • Ye et al., (2020) Ye, D., Liu, Z., Sun, M., Shi, B., Zhao, P., Wu, H., Yu, H., Yang, S., Wu, X., Guo, Q., et al. (2020). Mastering complex control in moba games with deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 6672–6679.
  • Yin et al., (2021) Yin, M., Bai, Y., and Wang, Y.-X. (2021). Near-optimal offline reinforcement learning via double variance reduction. Advances in neural information processing systems, 34:7677–7688.
  • Yu and Xu, (2015) Yu, P. and Xu, H. (2015). Distributionally robust counterpart in markov decision processes. IEEE Transactions on Automatic Control, 61(9):2538–2543.
  • Yu et al., (2020) Yu, T., Thomas, G., Yu, L., Ermon, S., Zou, J. Y., Levine, S., Finn, C., and Ma, T. (2020). Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142.
  • Zanette et al., (2021) Zanette, A., Wainwright, M. J., and Brunskill, E. (2021). Provable benefits of actor-critic methods for offline reinforcement learning. Advances in neural information processing systems, 34:13626–13640.
  • Zhan et al., (2022) Zhan, W., Uehara, M., Sun, W., and Lee, J. D. (2022). Pac reinforcement learning for predictive state representations. arXiv preprint arXiv:2207.05738.
  • Zhang et al., (2022) Zhang, Z., Yang, Z., Liu, H., Tokekar, P., and Huang, F. (2022). Reinforcement learning under a multi-agent predictive state representation model: Method and theory. In The Tenth International Conference on Learning Representations (ICLR 2022).
  • Zhao et al., (2020) Zhao, W., Queralta, J. P., and Westerlund, T. (2020). Sim-to-real transfer in deep reinforcement learning for robotics: a survey. In 2020 IEEE symposium series on computational intelligence (SSCI), pages 737–744. IEEE.
  • Zhong et al., (2022) Zhong, H., Xiong, W., Zheng, S., Wang, L., Wang, Z., Yang, Z., and Zhang, T. (2022). A posterior sampling framework for interactive decision making. arXiv preprint arXiv:2211.01962.
  • Zhou et al., (2017) Zhou, L., Small, K., Rokhlenko, O., and Elkan, C. (2017). End-to-end offline goal-oriented dialog policy learning via policy gradient. arXiv preprint arXiv:1712.02838.
  • Zhou et al., (2021) Zhou, Z., Zhou, Z., Bai, Q., Qiu, L., Blanchet, J., and Glynn, P. (2021). Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 3331–3339. PMLR.

Appendices

Appendix A Technical Lemmas

In this section, we present the technical lemmas for proving our main results. It includes three subsections. First, Section A.1 primirily describes the dual forms for robust values under 𝒫\mathcal{P}-type uncertainty sets. Second, Section A.2 verifies that robust value functions satisfies Bellman-type equations. Finally, Section A.3 provides various simulation lemmas for robust values under different uncertainty sets.

A.1 Dual Form of Robust Values

We first present two classical lemmas for total variation distance and KL divergence, which can be found in Panaganti et al., (2022); Blanchet et al., (2023).

Proposition A.1.

Let 𝙳f1\mathtt{D}_{f_{1}} be defined with the convex function f1(t)=|t1|/2f_{1}(t)=|t-1|/2 corresponding to the total variation distance. Then,

inf𝙳f1(PPo)ξ𝔼P[(X)]=maxλ{λ𝔼Po[(λ(X))+]ξmaxx(λ(x))+}.\inf_{\mathtt{D}_{f_{1}}(P\|P_{o})\leq\xi}\mathbb{E}_{P}[\ell(X)]=\max_{\lambda\in\mathbb{R}}\left\{\lambda-\mathbb{E}_{P_{o}}[(\lambda-\ell(X))_{+}]-\xi\max_{x}(\lambda-\ell(x))_{+}\right\}.
Proposition A.2.

Let 𝙳f2\mathtt{D}_{f_{2}} be defined with the convex function f2(t)=tlogtf_{2}(t)=t\log t corresponding to the KL-divergence. Then,

inf𝙳f2(PPo)ξ𝔼P[l(X)]=supλ0{λlog𝔼Po[exp((X)/λ)]λξ}.\inf_{\mathtt{D}_{f_{2}}(P\|P_{o})\leq\xi}\mathbb{E}_{P}[l(X)]=\sup_{\lambda\geq 0}\left\{-\lambda\log\mathbb{E}_{P_{o}}\left[\exp(-\ell(X)/\lambda)\right]-\lambda\xi\right\}.

Then, we show the dual form for the newly proposed 𝒫\mathcal{P}-type uncertainty sets.

Proposition A.3.

If B(θ)=B𝒫1(θ)B(\theta^{*})=B_{\mathcal{P}}^{1}(\theta^{*}), then, we have

VB(θ),Rπ=max𝜸0,𝝀{xH(f(xH)γxH)𝒫(xH)a1:H1maxo1:H|γxHf(xH)λ(aH1,xH1)|ξ},\displaystyle V_{B(\theta^{*}),R}^{\pi}=\max_{\bm{\gamma}\geq 0,\bm{\lambda}}\left\{\sum_{x_{H}}(f(x_{H})-\gamma_{x_{H}})\mathcal{P}^{*}(x_{H})-\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\max_{o_{1\mathrel{\mathop{\mathchar 58\relax}}H}}\left|\gamma_{x_{H}}-f(x_{H})-\lambda_{(a_{H-1},x_{H-1})}\right|\xi\right\},

where f(xH)=R(xH)π(τH1)f(x_{H})=R(x_{H})\pi(\tau_{H-1}), and the maximum must be achieved when

f(xH)λ(aH1,xH1)0,0γxHmax{f(xH)+λaH1,xH1,0}f(xH).-f(x_{H})\leq\lambda_{(a_{H-1},x_{H-1})}\leq 0,\quad 0\leq\gamma_{x_{H}}\leq\max\left\{f(x_{H})+\lambda_{a_{H-1},x_{H-1}},0\right\}\leq f(x_{H}).
Proof.

We recall the definition of VB𝒫1(θ),RπV_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\pi}:

VB𝒫1(θ),Rπ=minθB𝒫1(θ)xH𝒫θ(xH)π(τH1)R(xH)V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\pi}=\min_{\theta\in B_{\mathcal{P}}^{1}(\theta^{*})}\sum_{x_{H}}\mathcal{P}^{\theta}(x_{H})\pi(\tau_{H-1})R(x_{H})

To simplify the notation, we denote π(τH1)R(xH)\pi(\tau_{H-1})R(x_{H}) by f(xH)f(x_{H}). Then, we need to specify the equations that form B𝒫1(θ)B_{\mathcal{P}}^{1}(\theta^{*}).

The most important one is that for any a1:H1a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1} o2,,oH|𝒫(o2:H|a1:H1)𝒫(o2:H|a1:H1)|ξ\sum_{o_{2},\ldots,o_{H}}|\mathcal{P}(o_{2\mathrel{\mathop{\mathchar 58\relax}}H}|a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1})-\mathcal{P}^{*}(o_{2\mathrel{\mathop{\mathchar 58\relax}}H}|a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1})|\leq\xi. Another implicit constraint is that if we sum over all oHo_{H}, the value oH𝒫(oH,,o2|a1,,aH1)\sum_{o_{H}}\mathcal{P}(o_{H},\ldots,o_{2}|a_{1},\ldots,a_{H-1}) is constant with respect to aH1a_{H-1}. Therefore, we construct some additional variables 𝒫(xh)\mathcal{P}(x_{h}) and s(xH)|𝒫(xH)𝒫(xH)|s(x_{H})\geq|\mathcal{P}(x_{H})-\mathcal{P}^{*}(x_{H})| and formulate the following constrained optimization problem.

P1: {minxH𝒫(xH)f(xH)s.t.o2𝒫(x2)=1,(a1,x1)o3𝒫(x3)=𝒫(x2),(a2,x2)oh𝒫(xh)=𝒫(xh1),(ah1,xh1)oH𝒫(xH)=𝒫(xH1),(aH1,xH1)s(xH)𝒫(xH)𝒫(xH)s(xH),xHo2:Hs(xH)ξ,a1:H1𝒫(xH)0,s(xH)0,xH\displaystyle\left\{\begin{aligned} &\min\sum_{x_{H}}\mathcal{P}(x_{H})f(x_{H})\\ &\ \text{s.t.}\sum_{o_{2}}\mathcal{P}(x_{2})=1,\quad\forall(a_{1},x_{1})\\ &\ \ \quad\sum_{o_{3}}\mathcal{P}(x_{3})=\mathcal{P}(x_{2}),\quad\forall(a_{2},x_{2})\\ &\ \ \quad\ldots\\ &\ \ \quad\sum_{o_{h}}\mathcal{P}(x_{h})=\mathcal{P}(x_{h-1}),\quad\forall(a_{h-1},x_{h-1})\\ &\ \ \quad\ldots\\ &\ \ \quad\sum_{o_{H}}\mathcal{P}(x_{H})=\mathcal{P}(x_{H-1}),\quad\forall(a_{H-1},x_{H-1})\\ &\ \ \quad-s(x_{H})\leq\mathcal{P}(x_{H})-\mathcal{P}^{*}(x_{H})\leq s(x_{H}),\quad\forall x_{H}\\ &\ \ \quad\sum_{o_{2\mathrel{\mathop{\mathchar 58\relax}}H}}s(x_{H})\leq\xi,\quad\forall a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}\\ &\ \ \quad\mathcal{P}(x_{H})\geq 0,s(x_{H})\geq 0,\quad\forall x_{H}\end{aligned}\right.

Since P1 is a linear programming, we can formulate the Lagrangian function L1(𝝀,𝜶,𝜷,𝜹)L_{1}(\bm{\lambda},\bm{\alpha},\bm{\beta},\bm{\delta}) as follows:

L1\displaystyle L_{1} (𝝀,𝜶,𝜷,𝜸,𝜹,𝜼)\displaystyle(\bm{\lambda},\bm{\alpha},\bm{\beta},\bm{\gamma},\bm{\delta},\bm{\eta})
=xH𝒫(xH)f(xH)+(a1,x1)λ(a1,x1)(o2𝒫(x2)1)+h=2H1(ah,xh)λ(ah,xh)(oh+1𝒫(xh+1)𝒫(xh))\displaystyle=\sum_{x_{H}}\mathcal{P}(x_{H})f(x_{H})+\sum_{(a_{1},x_{1})}\lambda_{(a_{1},x_{1})}\left(\sum_{o_{2}}\mathcal{P}(x_{2})-1\right)+\sum_{h=2}^{H-1}\sum_{(a_{h},x_{h})}\lambda_{(a_{h},x_{h})}\left(\sum_{o_{h+1}}\mathcal{P}(x_{h+1})-\mathcal{P}(x_{h})\right)
+xHαxH(𝒫(xH)𝒫(xH)s(xH))+xHβxH(𝒫(xH)+𝒫(xH)s(xH))\displaystyle\quad+\sum_{x_{H}}\alpha_{x_{H}}(\mathcal{P}(x_{H})-\mathcal{P}^{*}(x_{H})-s(x_{H}))+\sum_{x_{H}}\beta_{x_{H}}(-\mathcal{P}(x_{H})+\mathcal{P}^{*}(x_{H})-s(x_{H}))
xH(γxH𝒫(xH)+δxHs(xH))+a1:H1ηa1:H1(o2:Hs(xH)ξ),\displaystyle\quad-\sum_{x_{H}}(\gamma_{x_{H}}\mathcal{P}(x_{H})+\delta_{x_{H}}s(x_{H}))+\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\left(\sum_{o_{2\mathrel{\mathop{\mathchar 58\relax}}H}}s(x_{H})-\xi\right),

where λ(ah,xh)\lambda_{(a_{h},x_{h})}\in\mathbb{R}, and αxH,βxH,γxH,δxH,ηa1:H10,xH\alpha_{x_{H}},\beta_{x_{H}},\gamma_{x_{H}},\delta_{x_{H}},\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\geq 0,\quad\forall x_{H}.

Thus, the dual problem is

D1:{max𝝀,𝜶,𝜷,𝜸,𝜹,ηinf𝒫,sL1(𝝀,𝜶,𝜷,𝜸,𝜹,η)s.t.f(xH)+λ(aH1,xH1)+αxHβxHγxH=0,xHηa1:H1δxHαxHβxH=0,xHλ(ah1,xh1)=ahλ(ah,xh),h,xhαxH,βxH,γxH,δxH,ηa1:H10,xH.\displaystyle\text{D1:}\left\{\begin{aligned} &\max_{\bm{\lambda},\bm{\alpha},\bm{\beta},\bm{\gamma},\bm{\delta},\eta}\inf_{\mathcal{P},s}L_{1}(\bm{\lambda},\bm{\alpha},\bm{\beta},\bm{\gamma},\bm{\delta},\eta)\\ &\ \text{s.t.}\ f(x_{H})+\lambda_{(a_{H-1},x_{H-1})}+\alpha_{x_{H}}-\beta_{x_{H}}-\gamma_{x_{H}}=0,\quad\forall x_{H}\\ &\ \ \quad\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}-\delta_{x_{H}}-\alpha_{x_{H}}-\beta_{x_{H}}=0,\quad\forall x_{H}\\ &\ \ \quad\lambda_{(a_{h-1},x_{h-1})}=\sum_{a_{h}}\lambda_{(a_{h},x_{h})},\quad\forall h,x_{h}\\ &\ \ \quad\alpha_{x_{H}},\beta_{x_{H}},\gamma_{x_{H}},\delta_{x_{H}},\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\geq 0,\quad\forall x_{H}.\end{aligned}\right.

Denote inf𝒫,sL1(𝝀,𝜶,𝜷,𝜸,𝜹,η)\inf_{\mathcal{P},s}L_{1}(\bm{\lambda},\bm{\alpha},\bm{\beta},\bm{\gamma},\bm{\delta},\eta) by L¯1\bar{L}_{1}. Then, we have

L¯1=xH(βxHαxH)𝒫(xH)λ(a1,x1)a1:H1ηa1:H1ξ.\displaystyle\bar{L}_{1}=\sum_{x_{H}}(\beta_{x_{H}}-\alpha_{x_{H}})\mathcal{P}^{*}(x_{H})-\lambda_{(a_{1},x_{1})}-\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\xi.

In addition, by the third constraint in the dual problem D1, we have

λ(a1,x1)\displaystyle\lambda_{(a_{1},x_{1})} =a2λ(a2,x2)\displaystyle=\sum_{a_{2}}\lambda_{(a_{2},x_{2})}
=o2𝒫(x2)a2λ(a2,x2)\displaystyle=\sum_{o_{2}}\mathcal{P}^{*}(x_{2})\sum_{a_{2}}\lambda_{(a_{2},x_{2})}
\displaystyle\ldots
=xH𝒫(xH)λ(aH1,xH1).\displaystyle=\sum_{x_{H}}\mathcal{P}^{*}(x_{H})\lambda_{(a_{H-1},x_{H-1})}.

Thus, we can simplify L¯1\bar{L}_{1} to

L¯1(𝝀,𝜶,𝜷,𝜸,𝜹,𝜼)\displaystyle\bar{L}_{1}(\bm{\lambda},\bm{\alpha},\bm{\beta},\bm{\gamma},\bm{\delta},\bm{\eta}) =xH(βxHαxH)𝒫(xH)λ(a1,x1)a1:H1ηa1:H1ξ\displaystyle=\sum_{x_{H}}(\beta_{x_{H}}-\alpha_{x_{H}})\mathcal{P}^{*}(x_{H})-\lambda_{(a_{1},x_{1})}-\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\xi
=xH(βxHαxHλ(aH1,xH1))𝒫(xH)a1:H1ηa1:H1ξ\displaystyle=\sum_{x_{H}}(\beta_{x_{H}}-\alpha_{x_{H}}-\lambda_{(a_{H-1},x_{H-1})})\mathcal{P}^{*}(x_{H})-\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\xi
=xH(f(xH)γxH)𝒫(xH)a1:H1ηa1:H1ξ.\displaystyle=\sum_{x_{H}}(f(x_{H})-\gamma_{x_{H}})\mathcal{P}^{*}(x_{H})-\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\xi.

Hence, by the Slater’s condition, the primal value equals to the dual value. Mathematically, we have

VB(θ),Rπ\displaystyle V_{B(\theta^{*}),R}^{\pi}
=max𝝀,𝜶,𝜷,𝜸,𝜹,𝜼L¯1(𝝀,𝜶,𝜷,𝜸,𝜹,𝜼)\displaystyle\quad=\max_{\bm{\lambda},\bm{\alpha},\bm{\beta},\bm{\gamma},\bm{\delta},\bm{\eta}}\bar{L}_{1}(\bm{\lambda},\bm{\alpha},\bm{\beta},\bm{\gamma},\bm{\delta},\bm{\eta})
=max𝝀,𝜶,𝜷,𝜸,𝜹,𝜼{xH(f(xH)γxH)𝒫(xH)a1:H1ηa1:H1ξ}\displaystyle\quad=\max_{\bm{\lambda},\bm{\alpha},\bm{\beta},\bm{\gamma},\bm{\delta},\bm{\eta}}\left\{\sum_{x_{H}}(f(x_{H})-\gamma_{x_{H}})\mathcal{P}^{*}(x_{H})-\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\xi\right\}
s.t.{ηa1:H1=αxH+βxH+δxHγxH=f(xH)+λ(aH1,xH1)+αxHβxHηa1:H1,γxH,αxH,βxH,δxH0.\displaystyle\quad\quad\quad\text{s.t.}\left\{\begin{aligned} &\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}=\alpha_{x_{H}}+\beta_{x_{H}}+\delta_{x_{H}}\\ &\gamma_{x_{H}}=f(x_{H})+\lambda_{(a_{H-1},x_{H-1})}+\alpha_{x_{H}}-\beta_{x_{H}}\\ &\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}},\gamma_{x_{H}},\alpha_{x_{H}},\beta_{x_{H}},\delta_{x_{H}}\geq 0.\end{aligned}\right.

If we fix 𝝀\bm{\lambda} and 𝜸\bm{\gamma}, then the maximum is achieved when

{ηa1:H1=maxo1:H|γxHf(xH)λ(aH1,xH1)|αxH=max{γxHf(xH)λ(aH1,xH1),0}βxH=max{γxH+f(xH)+λ(aH1,xH1),0}\left\{\begin{aligned} &\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}=\max_{o_{1\mathrel{\mathop{\mathchar 58\relax}}H}}\left|\gamma_{x_{H}}-f(x_{H})-\lambda_{(a_{H-1},x_{H-1})}\right|\\ &\alpha_{x_{H}}=\max\left\{\gamma_{x_{H}}-f(x_{H})-\lambda_{(a_{H-1},x_{H-1})},0\right\}\\ &\beta_{x_{H}}=\max\left\{-\gamma_{x_{H}}+f(x_{H})+\lambda_{(a_{H-1},x_{H-1})},0\right\}\\ \end{aligned}\right.

Therefore,

VB(θ),Rπ\displaystyle V_{B(\theta^{*}),R}^{\pi}
=max𝜸0,𝝀{xH(f(xH)γxH)𝒫(xH)a1:H1maxo1:H|γxHf(xH)λ(aH1,xH1)|ξ}\displaystyle\quad=\max_{\bm{\gamma}\geq 0,\bm{\lambda}}\left\{\sum_{x_{H}}(f(x_{H})-\gamma_{x_{H}})\mathcal{P}^{*}(x_{H})-\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\max_{o_{1\mathrel{\mathop{\mathchar 58\relax}}H}}\left|\gamma_{x_{H}}-f(x_{H})-\lambda_{(a_{H-1},x_{H-1})}\right|\xi\right\}

Moreover, since 1f(xH)01\geq f(x_{H})\geq 0, the maximum must be achieved when

f(xH)λ(aH1,xH1)0,0γxHmax{f(xH)+λaH1,xH1,0}f(xH).-f(x_{H})\leq\lambda_{(a_{H-1},x_{H-1})}\leq 0,\quad 0\leq\gamma_{x_{H}}\leq\max\left\{f(x_{H})+\lambda_{a_{H-1},x_{H-1}},0\right\}\leq f(x_{H}).

Proposition A.4.

If B(θ)=B𝒫2(θ)B(\theta^{*})=B_{\mathcal{P}}^{2}(\theta^{*}), then, we have

VB(θ),Rπ=max𝜼0{aH1,xH1ηa1:H1𝒫θ(xH1)log𝔼o𝒯H1θ(|τH1)[exp(f(xH)ηa1:H1)]a1:H1ηa1:H1ξ}.\displaystyle V_{B(\theta^{*}),R}^{\pi}=\max_{\bm{\eta}\geq 0}\left\{-\sum_{a_{H-1},x_{H-1}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\mathcal{P}^{\theta^{*}}(x_{H-1})\log\mathbb{E}_{o\sim\mathcal{T}^{\theta^{*}}_{H-1}(\cdot|\tau_{H-1})}\left[\exp\left(-\frac{f(x_{H})}{\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}}\right)\right]-\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\xi\right\}.
Proof.

Similar to Proposition A.3, we denote π(xH)R(xH)\pi(x_{H})R(x_{H}) by f(xH)f(x_{H}). The robust value equals to the following optimization problem.

P2: {minxH𝒫(xH)f(xH)s.t.o2𝒫(x2)=1,(a1,x1)o3𝒫(x3)=𝒫(x2),(a2,x2)oh𝒫(xh)=𝒫(xh1),(ah1,xh1)oH𝒫(xH)=𝒫(xH1),(aH1,xH1)o2:H𝒫(xH)log𝒫(xH)𝒫(xH)ξ,a1:H1𝒫(xH)0,xH\displaystyle\left\{\begin{aligned} &\min\sum_{x_{H}}\mathcal{P}(x_{H})f(x_{H})\\ &\ \text{s.t.}\sum_{o_{2}}\mathcal{P}(x_{2})=1,\quad\forall(a_{1},x_{1})\\ &\ \ \quad\sum_{o_{3}}\mathcal{P}(x_{3})=\mathcal{P}(x_{2}),\quad\forall(a_{2},x_{2})\\ &\ \ \quad\ldots\\ &\ \ \quad\sum_{o_{h}}\mathcal{P}(x_{h})=\mathcal{P}(x_{h-1}),\quad\forall(a_{h-1},x_{h-1})\\ &\ \ \quad\ldots\\ &\ \ \quad\sum_{o_{H}}\mathcal{P}(x_{H})=\mathcal{P}(x_{H-1}),\quad\forall(a_{H-1},x_{H-1})\\ &\ \ \quad\sum_{o_{2\mathrel{\mathop{\mathchar 58\relax}}H}}\mathcal{P}(x_{H})\log\frac{\mathcal{P}(x_{H})}{\mathcal{P}^{*}(x_{H})}\leq\xi,\quad\forall a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}\\ &\ \ \quad\mathcal{P}(x_{H})\geq 0,\quad\forall x_{H}\end{aligned}\right.

Since P2 is a convex optimization problem, we can formulate the Lagrangian function L2(𝝀,𝜸,𝜼)L_{2}(\bm{\lambda},\bm{\gamma},\bm{\eta}) as follows

L2\displaystyle L_{2} (𝝀,𝜸,𝜼)\displaystyle(\bm{\lambda},\bm{\gamma},\bm{\eta})
=xH𝒫(xH)f(xH)+(a1,x1)λ(a1,x1)(o2𝒫(x2)1)+h=2H1(ah,xh)λ(ah,xh)(oh+1𝒫(xh+1)𝒫(xh))\displaystyle=\sum_{x_{H}}\mathcal{P}(x_{H})f(x_{H})+\sum_{(a_{1},x_{1})}\lambda_{(a_{1},x_{1})}\left(\sum_{o_{2}}\mathcal{P}(x_{2})-1\right)+\sum_{h=2}^{H-1}\sum_{(a_{h},x_{h})}\lambda_{(a_{h},x_{h})}\left(\sum_{o_{h+1}}\mathcal{P}(x_{h+1})-\mathcal{P}(x_{h})\right)
xHγxH𝒫(xH)+a1:H1ηa1:H1(o2:H𝒫(xH)log𝒫(xH)𝒫(xH)ξ),\displaystyle\quad-\sum_{x_{H}}\gamma_{x_{H}}\mathcal{P}(x_{H})+\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\left(\sum_{o_{2\mathrel{\mathop{\mathchar 58\relax}}H}}\mathcal{P}(x_{H})\log\frac{\mathcal{P}(x_{H})}{\mathcal{P}^{*}(x_{H})}-\xi\right),

where λ(ah,xh)\lambda_{(a_{h},x_{h})}\in\mathbb{R}, and γxH,ηa1:H10,xH\gamma_{x_{H}},\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\geq 0,\quad\forall x_{H}.

Thus, the dual problem is

D1:{max𝝀,𝜸,𝜼inf𝒫,sL2(𝝀,𝜸,𝜼)s.t.f(xH)+λ(aH1,xH1)γxH+ηa1:H1(log𝒫(xH)𝒫(xH)+1)=0,xHλ(ah1,xh1)=ahλ(ah,xh),h,xhγxH,ηa1:H10,xH.\displaystyle\text{D1:}\left\{\begin{aligned} &\max_{\bm{\lambda},\bm{\gamma},\bm{\eta}}\inf_{\mathcal{P},s}L_{2}(\bm{\lambda},\bm{\gamma},\bm{\eta})\\ &\ \text{s.t.}\ f(x_{H})+\lambda_{(a_{H-1},x_{H-1})}-\gamma_{x_{H}}+\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\left(\log\frac{\mathcal{P}(x_{H})}{\mathcal{P}^{*}(x_{H})}+1\right)=0,\quad\forall x_{H}\\ &\ \ \quad\lambda_{(a_{h-1},x_{h-1})}=\sum_{a_{h}}\lambda_{(a_{h},x_{h})},\quad\forall h,x_{h}\\ &\ \ \quad\gamma_{x_{H}},\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\geq 0,\quad\forall x_{H}.\end{aligned}\right.

Denote inf𝒫,sL2(𝝀,𝜸,𝜼)\inf_{\mathcal{P},s}L_{2}(\bm{\lambda},\bm{\gamma},\bm{\eta}) by L¯2\bar{L}_{2}. Then, we have

L¯2\displaystyle\bar{L}_{2} =xHηa1:H1𝒫(xH)λ(a1,x1)a1:H1ηa1:H1ξ\displaystyle=-\sum_{x_{H}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\mathcal{P}(x_{H})-\lambda_{(a_{1},x_{1})}-\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\xi
=xHηa1:H1𝒫(xH)xH𝒫(xH)λ(aH1,xH1)a1:H1ηa1:H1ξ\displaystyle=-\sum_{x_{H}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\mathcal{P}(x_{H})-\sum_{x_{H}}\mathcal{P}^{*}(x_{H})\lambda_{(a_{H-1},x_{H-1})}-\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\xi
=xHηa1:H1𝒫(xH)exp(γxHf(xH)λ(aH1,xH1)ηa1:H11)xH𝒫(xH)λ(aH1,xH1)a1:H1ηa1:H1ξ.\displaystyle=-\sum_{x_{H}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\mathcal{P}^{*}(x_{H})\exp\left(\frac{\gamma_{x_{H}}-f(x_{H})-\lambda_{(a_{H-1},x_{H-1})}}{\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}}-1\right)-\sum_{x_{H}}\mathcal{P}^{*}(x_{H})\lambda_{(a_{H-1},x_{H-1})}-\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\xi.

Hence, by the Slater’s condition, the primal value equals to the dual value. Mathematically, we have

VB(θ),Rπ\displaystyle V_{B(\theta^{*}),R}^{\pi}
=max𝝀,𝜸,𝜼L¯2(𝝀,𝜸,𝜼)\displaystyle\quad=\max_{\bm{\lambda},\bm{\gamma},\bm{\eta}}\bar{L}_{2}(\bm{\lambda},\bm{\gamma},\bm{\eta})
=max𝜼0max𝜸0,𝝀{xHηa1:H1𝒫(xH)exp(γxHf(xH)λ(aH1,xH1)ηa1:H11)\displaystyle\quad=\max_{\bm{\eta}\geq 0}\max_{\bm{\gamma}\geq 0,\bm{\lambda}}\left\{-\sum_{x_{H}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\mathcal{P}^{*}(x_{H})\exp\left(\frac{\gamma_{x_{H}}-f(x_{H})-\lambda_{(a_{H-1},x_{H-1})}}{\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}}-1\right)\right.
xH𝒫(xH)λ(aH1,xH1)a1:H1ηa1:H1ξ}\displaystyle\hskip 56.9055pt\qquad\qquad\left.-\sum_{x_{H}}\mathcal{P}^{*}(x_{H})\lambda_{(a_{H-1},x_{H-1})}-\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\xi\right\}
=max𝜼0{aH1,xH1ηaH1,xH1𝒫(xH1)log𝔼o𝒯H1(|τH1)[exp(f(xH)ηa1:H1)]a1:H1ηa1:H1ξ},\displaystyle\quad=\max_{\bm{\eta}\geq 0}\left\{-\sum_{a_{H-1},x_{H-1}}\eta_{a_{H-1},x_{H-1}}\mathcal{P}^{*}(x_{H-1})\log\mathbb{E}_{o\sim\mathcal{T}^{*}_{H-1}(\cdot|\tau_{H-1})}\left[\exp\left(-\frac{f(x_{H})}{\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}}\right)\right]-\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\xi\right\},

where we choose γxH=0\gamma_{x_{H}}=0 and

λ(aH1,xH1)=ηa1:H1log(o𝒯H1(o|τH1)exp(f(xH)ηa1:H11)).\lambda_{(a_{H-1},x_{H-1})}=\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\log\left(\sum_{o}\mathcal{T}^{*}_{H-1}(o|\tau_{H-1})\exp\left(\frac{-f(x_{H})}{\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}}-1\right)\right).

A.2 Bellman-type Equations under 𝒯\mathcal{T}-type Uncertainty Sets

Recall that robust value function is defined as follows.

VB(θ),R,hπ(τh1,oh)=infθB(θ)𝔼θπ[h=hHR(𝝉h)|τh1,oh],\displaystyle V_{B(\theta^{*}),R,h}^{\pi}(\tau_{h-1},o_{h})=\inf_{\theta\in B(\theta^{*})}\mathbb{E}_{\theta}^{\pi}\left[\sum_{h^{\prime}=h}^{H}R(\bm{\tau}_{h^{\prime}})\bigg{|}\tau_{h-1},o_{h}\right],
QB(θ),R,hπ(τh)=infθB(θ)𝔼θπ[h=hHR(𝝉h)|τh].\displaystyle Q_{B(\theta^{*}),R,h}^{\pi}(\tau_{h})=\inf_{\theta\in B(\theta^{*})}\mathbb{E}_{\theta}^{\pi}\left[\sum_{h^{\prime}=h}^{H}R(\bm{\tau}_{h^{\prime}})\bigg{|}\tau_{h}\right].

Then, we have the following Bellman-type equations for robust value functions.

Lemma A.5.

For any θ\theta, we have

{VB(θ),R,hπ(τh1,oh)=ahπ(ah|τh1,oh)QB(θ),R,hπ(τh)QB(θ),R,hπ(τh)=R(τh)+infθB(θ)oh+1𝒯hθ(oh+1|τh)VB(θ),R,h+1π(τh,oh+1).\displaystyle\left\{\begin{aligned} &V_{B(\theta^{*}),R,h}^{\pi}(\tau_{h-1},o_{h})=\sum_{a_{h}}\pi(a_{h}|\tau_{h-1},o_{h})Q_{B(\theta^{*}),R,h}^{\pi}(\tau_{h})\\ &Q_{B(\theta^{*}),R,h}^{\pi}(\tau_{h})=R(\tau_{h})+\inf_{\theta\in B(\theta^{*})}\sum_{o_{h+1}}\mathcal{T}_{h}^{\theta}(o_{h+1}|\tau_{h})V_{B(\theta^{*}),R,h+1}^{\pi}(\tau_{h},o_{h+1})\end{aligned}\right..
Proof.

The first equation is straightforward.

For the second equation, by the definition of robust Q-value function, for any ι>0\iota>0, there exists θιB(θ)\theta_{\iota}\in B(\theta^{*}) such that

QB(θ),R,hπ(τh)𝔼θιπ[h=hHR(𝝉h)|τh]ι.Q_{B(\theta^{*}),R,h}^{\pi}(\tau_{h})\geq\mathbb{E}_{\theta_{\iota}}^{\pi}\left[\sum_{h^{\prime}=h}^{H}R(\bm{\tau}_{h^{\prime}})\bigg{|}\tau_{h}\right]-\iota.

Then, we have

QB(θ),R,hπ(τh)\displaystyle Q_{B(\theta^{*}),R,h}^{\pi}(\tau_{h}) 𝔼θιπ[h=hHR(𝝉h)|τh]ι\displaystyle\geq\mathbb{E}_{\theta_{\iota}}^{\pi}\left[\sum_{h^{\prime}=h}^{H}R(\bm{\tau}_{h^{\prime}})\bigg{|}\tau_{h}\right]-\iota
=R(τh)+oh+1𝒯hθι(oh+1|τh)Vθι,R,h+1π(τh,oh+1)ι\displaystyle=R(\tau_{h})+\sum_{o_{h+1}}\mathcal{T}_{h}^{\theta_{\iota}}(o_{h+1}|\tau_{h})V_{\theta_{\iota},R,h+1}^{\pi}(\tau_{h},o_{h+1})-\iota
R(τh)+infθB(θ)oh+1𝒯hθ(oh+1|τh)VB(θ),R,h+1π(τh,oh+1)ι.\displaystyle\geq R(\tau_{h})+\inf_{\theta\in B(\theta^{*})}\sum_{o_{h+1}}\mathcal{T}_{h}^{\theta}(o_{h+1}|\tau_{h})V_{B(\theta^{*}),R,h+1}^{\pi}(\tau_{h},o_{h+1})-\iota.

Let ι0\iota\rightarrow 0, we conclude that

QB(θ),R,hπ(τh)R(τh)+infθB(θ)oh+1𝒯hθ(oh+1|τh)VB(θ),R,h+1π(τh,oh+1).Q_{B(\theta^{*}),R,h}^{\pi}(\tau_{h})\geq R(\tau_{h})+\inf_{\theta\in B(\theta^{*})}\sum_{o_{h+1}}\mathcal{T}_{h}^{\theta}(o_{h+1}|\tau_{h})V_{B(\theta^{*}),R,h+1}^{\pi}(\tau_{h},o_{h+1}).

On the other hand, for any ι1,ι2>0\iota_{1},\iota_{2}>0, there exist θι1,θι2B(θ)\theta_{\iota_{1}},\theta_{\iota_{2}}\in B(\theta^{*}) such that

VB(θ),R,h+1π(τh,oh+1)Vθι1,R,h+1π(τh,oh+1)ι1,V_{B(\theta^{*}),R,h+1}^{\pi}(\tau_{h},o_{h+1})\geq V_{\theta_{\iota_{1}},R,h+1}^{\pi}(\tau_{h},o_{h+1})-\iota_{1},

and

infθB(θ)oh+1𝒯hθ(oh+1|τh)VB(θ),R,h+1π(τh,oh+1)oh+1𝒯hθι2(oh+1|τh)VB(θ),R,h+1π(τh,oh+1)ι2.\inf_{\theta\in B(\theta^{*})}\sum_{o_{h+1}}\mathcal{T}_{h}^{\theta}(o_{h+1}|\tau_{h})V_{B(\theta^{*}),R,h+1}^{\pi}(\tau_{h},o_{h+1})\geq\sum_{o_{h+1}}\mathcal{T}_{h}^{\theta_{\iota_{2}}}(o_{h+1}|\tau_{h})V_{B(\theta^{*}),R,h+1}^{\pi}(\tau_{h},o_{h+1})-\iota_{2}.

In addition, there exists θ3B(θ)\theta_{3}\in B(\theta^{*}) such that 𝒯θ3(oh+1|τh)=𝒯θι2(oh+1|τh)\mathcal{T}_{\theta_{3}}(o_{h+1}|\tau_{h})=\mathcal{T}_{\theta_{\iota_{2}}}(o_{h+1}|\tau_{h}) and Vθι1,R,h+1π=Vθ3,R,h+1πV_{\theta_{\iota_{1}},R,h+1}^{\pi}=V_{\theta_{3},R,h+1}^{\pi}. Then, we have

R(τh)\displaystyle R(\tau_{h}) +infθB(θ)oh+1𝒯hθ(oh+1|τh)VB(θ),R,h+1π(τh,oh+1)\displaystyle+\inf_{\theta\in B(\theta^{*})}\sum_{o_{h+1}}\mathcal{T}_{h}^{\theta}(o_{h+1}|\tau_{h})V_{B(\theta^{*}),R,h+1}^{\pi}(\tau_{h},o_{h+1})
R(τh)+oh+1𝒯hθ3(oh+1|τh)VB(θ),R,h+1π(τh,oh+1)ι2\displaystyle\geq R(\tau_{h})+\sum_{o_{h+1}}\mathcal{T}_{h}^{\theta_{3}}(o_{h+1}|\tau_{h})V_{B(\theta^{*}),R,h+1}^{\pi}(\tau_{h},o_{h+1})-\iota_{2}
R(τh)+oh+1𝒯hθ3(oh+1|τh)Vθ3,R,h+1π(τh,oh+1)ι1ι2\displaystyle\geq R(\tau_{h})+\sum_{o_{h+1}}\mathcal{T}_{h}^{\theta_{3}}(o_{h+1}|\tau_{h})V_{\theta_{3},R,h+1}^{\pi}(\tau_{h},o_{h+1})-\iota_{1}-\iota_{2}
=𝔼θ3π[h=hHR(𝝉h)|τh]ι1ι2\displaystyle=\mathbb{E}_{\theta_{3}}^{\pi}\left[\sum_{h^{\prime}=h}^{H}R(\bm{\tau}_{h})\bigg{|}\tau_{h}\right]-\iota_{1}-\iota_{2}
QB(θ),R,hπ(τh)ι1ι2.\displaystyle\geq Q_{B(\theta^{*}),R,h}^{\pi}(\tau_{h})-\iota_{1}-\iota_{2}.

Let ι1,ι20\iota_{1},\iota_{2}\rightarrow 0, we can conclude that

R(τh)+infθB(θ)oh+1𝒯hθ(oh+1|τh)VB(θ),R,h+1π(τh,oh+1)QB(θ),R,hπ(τh),R(\tau_{h})+\inf_{\theta\in B(\theta^{*})}\sum_{o_{h+1}}\mathcal{T}_{h}^{\theta}(o_{h+1}|\tau_{h})V_{B(\theta^{*}),R,h+1}^{\pi}(\tau_{h},o_{h+1})\geq Q_{B(\theta^{*}),R,h}^{\pi}(\tau_{h}),

which finishes the proof.

A.3 Simulation Lemmas of Robust Values

Lemma A.6 (Simulation lemma of robust value under B𝒫1B_{\mathcal{P}}^{1}).
|VB𝒫1(θ),RπVB𝒫1(θ),Rπ|𝔻θπ𝔻θπ1\left|V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\pi}-V_{B_{\mathcal{P}}^{1}(\theta),R}^{\pi}\right|\leq\left\|\mathbb{D}_{\theta}^{\pi}-\mathbb{D}_{\theta^{*}}^{\pi}\right\|_{1}
Proof.

Let f(xH)=π(τH1)R(xH)f(x_{H})=\pi(\tau_{H-1})R(x_{H}).

We first prove that VB𝒫1(θ),RπVB𝒫1(θ),Rπ𝔻θπ𝔻θπ1V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\pi}-V_{B_{\mathcal{P}}^{1}(\theta),R}^{\pi}\leq\left\|\mathbb{D}_{\theta}^{\pi}-\mathbb{D}_{\theta^{*}}^{\pi}\right\|_{1}. By Proposition A.3, we have

VB(θ),RπVB(θ),Rπ\displaystyle V_{B(\theta^{*}),R}^{\pi}-V_{B(\theta),R}^{\pi}
=max𝜸0,𝝀{xH(f(xH)γxH)𝒫θ(xH)a1:H1maxo1:H|γxHf(xH)λ(aH1,xH1)|ξ}\displaystyle\quad=\max_{\bm{\gamma}\geq 0,\bm{\lambda}}\left\{\sum_{x_{H}}(f(x_{H})-\gamma_{x_{H}})\mathcal{P}^{\theta^{*}}(x_{H})-\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\max_{o_{1\mathrel{\mathop{\mathchar 58\relax}}H}}\left|\gamma_{x_{H}}-f(x_{H})-\lambda_{(a_{H-1},x_{H-1})}\right|\xi\right\}
max𝜸0,𝝀{xH(f(xH)γxH)𝒫θ(xH)a1:H1maxo1:H|γxHf(xH)λ(aH1,xH1)|ξ},\displaystyle\quad\quad-\max_{\bm{\gamma}\geq 0,\bm{\lambda}}\left\{\sum_{x_{H}}(f(x_{H})-\gamma_{x_{H}})\mathcal{P}^{\theta}(x_{H})-\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\max_{o_{1\mathrel{\mathop{\mathchar 58\relax}}H}}\left|\gamma_{x_{H}}-f(x_{H})-\lambda_{(a_{H-1},x_{H-1})}\right|\xi\right\},

Let 𝜸,𝝀\bm{\gamma}^{*},\bm{\lambda}^{*} be the solution to

max𝜸0,𝝀{xH(f(xH)γxH)𝒫θ(xH)a1:H1maxo1:H|γxHf(xH)λ(aH1,xH1)|ξ}.\max_{\bm{\gamma}\geq 0,\bm{\lambda}}\left\{\sum_{x_{H}}(f(x_{H})-\gamma_{x_{H}})\mathcal{P}^{\theta^{*}}(x_{H})-\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\max_{o_{1\mathrel{\mathop{\mathchar 58\relax}}H}}\left|\gamma_{x_{H}}-f(x_{H})-\lambda_{(a_{H-1},x_{H-1})}\right|\xi\right\}.

Again, from Proposition A.3, we know that

f(xH)λ(aH1,xH1)0,0γxHmax{f(xH)+λaH1,xH1,0}f(xH).-f(x_{H})\leq\lambda^{*}_{(a_{H-1},x_{H-1})}\leq 0,\quad 0\leq\gamma^{*}_{x_{H}}\leq\max\left\{f(x_{H})+\lambda^{*}_{a_{H-1},x_{H-1}},0\right\}\leq f(x_{H}). (9)

Thus, the difference can be upper bounded as follows.

VB(θ),RπVB(θ),Rπ\displaystyle V_{B(\theta^{*}),R}^{\pi}-V_{B(\theta),R}^{\pi}
xH(f(xH)γxH)𝒫θ(xH)a1:H1maxo1:H|γxHf(xH)λ(aH1,xH1)|ξ\displaystyle\quad\leq\sum_{x_{H}}(f(x_{H})-\gamma^{*}_{x_{H}})\mathcal{P}^{\theta^{*}}(x_{H})-\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\max_{o_{1\mathrel{\mathop{\mathchar 58\relax}}H}}\left|\gamma^{*}_{x_{H}}-f(x_{H})-\lambda^{*}_{(a_{H-1},x_{H-1})}\right|\xi
xH(f(xH)γxH)𝒫θ(xH)+a1:H1maxo1:H|γxHf(xH)λ(aH1,xH1)|ξ\displaystyle\quad\quad-\sum_{x_{H}}(f(x_{H})-\gamma^{*}_{x_{H}})\mathcal{P}^{\theta}(x_{H})+\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\max_{o_{1\mathrel{\mathop{\mathchar 58\relax}}H}}\left|\gamma^{*}_{x_{H}}-f(x_{H})-\lambda^{*}_{(a_{H-1},x_{H-1})}\right|\xi
=xH(f(xH)γxH)(𝒫θ(xH)𝒫θ(xH))\displaystyle\quad=\sum_{x_{H}}(f(x_{H})-\gamma^{*}_{x_{H}})(\mathcal{P}^{\theta^{*}}(x_{H})-\mathcal{P}^{\theta}(x_{H}))
xH|f(xH)γxH||𝒫θ(xH)𝒫θ(xH)|\displaystyle\quad\leq\sum_{x_{H}}|f(x_{H})-\gamma^{*}_{x_{H}}||\mathcal{P}^{\theta^{*}}(x_{H})-\mathcal{P}^{\theta}(x_{H})|
(a)xHf(xH)|𝒫θ(xH)𝒫θ(xH)|\displaystyle\quad\overset{(a)}{\leq}\sum_{x_{H}}f(x_{H})|\mathcal{P}^{\theta^{*}}(x_{H})-\mathcal{P}^{\theta}(x_{H})|
=xHπ(τH1)R(xH)|𝒫θ(xH)𝒫θ(xH)|\displaystyle\quad=\sum_{x_{H}}\pi(\tau_{H-1})R(x_{H})|\mathcal{P}^{\theta^{*}}(x_{H})-\mathcal{P}^{\theta}(x_{H})|
𝔻θπ𝔻θπ1,\displaystyle\quad\leq\left\|\mathbb{D}_{\theta}^{\pi}-\mathbb{D}_{\theta^{*}}^{\pi}\right\|_{1},

where (a)(a) is due to Equation 9.

The opposite side VB𝒫1(θ),RπVB𝒫1(θ),Rπ𝔻θπ𝔻θπ1V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\pi}-V_{B_{\mathcal{P}}^{1}(\theta),R}^{\pi}\geq-\left\|\mathbb{D}_{\theta}^{\pi}-\mathbb{D}_{\theta^{*}}^{\pi}\right\|_{1} follows exactly the same proof.

Lemma A.7 (Simulation lemma of robust value under B𝒫2B_{\mathcal{P}}^{2}).
|VB𝒫2(θ),RπVB𝒫2(θ),Rπ|3exp(1/η)𝔻θπ𝔻θπ1,\left|V_{B_{\mathcal{P}}^{2}(\theta^{*}),R}^{\pi}-V_{B_{\mathcal{P}}^{2}(\theta),R}^{\pi}\right|\leq 3\exp(1/\eta^{*})\left\|\mathbb{D}_{\theta}^{\pi}-\mathbb{D}_{\theta^{*}}^{\pi}\right\|_{1},

where η\eta^{*} is a lower bound of the optimal solution

ηmin{argmax𝜼0{aH1,xH1ηaH1,xH1𝒫θ(xH1)log𝔼o𝒯H1θ(|τH1)[exp(f(xH)ηa1:H1)]a1:H1ηa1:H1ξ},argmax𝜼0{aH1,xH1ηaH1,xH1𝒫θ(xH1)log𝔼o𝒯H1θ(|τH1)[exp(f(xH)ηa1:H1)]a1:H1ηa1:H1ξ}.\eta^{*}\leq\min\left\{\begin{aligned} &\arg\max_{\bm{\eta}\geq 0}\left\{-\sum_{a_{H-1},x_{H-1}}\eta_{a_{H-1},x_{H-1}}\mathcal{P}^{\theta^{*}}(x_{H-1})\log\mathbb{E}_{o\sim\mathcal{T}^{\theta^{*}}_{H-1}(\cdot|\tau_{H-1})}\left[\exp\left(-\frac{f(x_{H})}{\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}}\right)\right]\right.\\ &\quad\hskip 56.9055pt\left.-\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\xi\right\},\\ &\arg\max_{\bm{\eta}\geq 0}\left\{-\sum_{a_{H-1},x_{H-1}}\eta_{a_{H-1},x_{H-1}}\mathcal{P}^{\theta}(x_{H-1})\log\mathbb{E}_{o\sim\mathcal{T}^{\theta}_{H-1}(\cdot|\tau_{H-1})}\left[\exp\left(-\frac{f(x_{H})}{\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}}\right)\right]\right.\\ &\quad\hskip 56.9055pt\left.-\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\xi\right\}.\end{aligned}\right.

Here, if x𝐯x\leq\mathbf{v}, where xx is a scalar and 𝐯\mathbf{v} is a vector, then xx is less than or equal to each coordinate of the vector.

Proof.

Let f(xH)=π(τH1)R(xH)f(x_{H})=\pi(\tau_{H-1})R(x_{H}). By Proposition A.4, we have

VB(θ),RπVB(θ),Rπ\displaystyle V_{B(\theta^{*}),R}^{\pi}-V_{B(\theta),R}^{\pi}
=max𝜼0{aH1,xH1ηaH1,xH1𝒫θ(xH1)log𝔼o𝒯H1θ(|τH1)[exp(f(xH)ηa1:H1)]a1:H1ηa1:H1ξ}\displaystyle\quad=\max_{\bm{\eta}\geq 0}\left\{-\sum_{a_{H-1},x_{H-1}}\eta_{a_{H-1},x_{H-1}}\mathcal{P}^{\theta^{*}}(x_{H-1})\log\mathbb{E}_{o\sim\mathcal{T}^{\theta^{*}}_{H-1}(\cdot|\tau_{H-1})}\left[\exp\left(-\frac{f(x_{H})}{\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}}\right)\right]-\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\xi\right\}
max𝜼0{aH1,xH1ηaH1,xH1𝒫θ(xH1)log𝔼o𝒯H1θ^(|τH1)[exp(f(xH)ηa1:H1)]a1:H1ηa1:H1ξ}.\displaystyle\quad\quad-\max_{\bm{\eta}\geq 0}\left\{-\sum_{a_{H-1},x_{H-1}}\eta_{a_{H-1},x_{H-1}}\mathcal{P}^{\theta^{\prime}}(x_{H-1})\log\mathbb{E}_{o\sim\mathcal{T}^{\hat{\theta}}_{H-1}(\cdot|\tau_{H-1})}\left[\exp\left(-\frac{f(x_{H})}{\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}}\right)\right]-\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\xi\right\}.

Let 𝜼\bm{\eta}^{*} be the solution to

max𝜼0{aH1,xH1ηaH1,xH1𝒫θ(xH1)log𝔼o𝒯H1θ(|τH1)[exp(f(xH)ηa1:H1)]a1:H1ηa1:H1ξ}.\max_{\bm{\eta}\geq 0}\left\{-\sum_{a_{H-1},x_{H-1}}\eta_{a_{H-1},x_{H-1}}\mathcal{P}^{\theta^{*}}(x_{H-1})\log\mathbb{E}_{o\sim\mathcal{T}^{\theta^{*}}_{H-1}(\cdot|\tau_{H-1})}\left[\exp\left(-\frac{f(x_{H})}{\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}}\right)\right]-\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\xi\right\}.

Then, we difference can be upper bounded as follows.

VB(θ),RπVB(θ),Rπ\displaystyle V_{B(\theta^{*}),R}^{\pi}-V_{B(\theta),R}^{\pi}
aH1,xH1ηaH1,xH1𝒫θlog𝔼oH𝒯H1θ(|τH1)[exp(f(xH)ηa1:H1)]+a1:H1ηa1:H1ξ\displaystyle\quad\leq\sum_{a_{H-1},x_{H-1}}\eta^{*}_{a_{H-1},x_{H-1}}\mathcal{P}^{\theta}\log\mathbb{E}_{o_{H}\sim\mathcal{T}^{\theta}_{H-1}(\cdot|\tau_{H-1})}\left[\exp\left(-\frac{f(x_{H})}{\eta^{*}_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}}\right)\right]+\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}^{*}\xi
aH1,xH1ηaH1,xH1𝒫θ(xH1)log𝔼oH𝒯H1θ(|τH1)[exp(f(xH)ηa1:H1)]a1:H1ηa1:H1ξ\displaystyle\quad\quad-\sum_{a_{H-1},x_{H-1}}\eta^{*}_{a_{H-1},x_{H-1}}\mathcal{P}^{\theta^{*}}(x_{H-1})\log\mathbb{E}_{o_{H}\sim\mathcal{T}^{\theta^{*}}_{H-1}(\cdot|\tau_{H-1})}\left[\exp\left(-\frac{f(x_{H})}{\eta^{*}_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}}\right)\right]-\sum_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\eta^{*}_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\xi
=aH1,xH1ηaH1,xH1(𝒫θ(xH1)𝒫θ(xH1))log𝔼oH𝒯H1θ(|τH1)[exp(f(xH)ηa1:H1)]\displaystyle\quad=\sum_{a_{H-1},x_{H-1}}\eta^{*}_{a_{H-1},x_{H-1}}\left(\mathcal{P}^{\theta}(x_{H-1})-\mathcal{P}^{\theta^{*}}(x_{H-1})\right)\log\mathbb{E}_{o_{H}\sim\mathcal{T}^{\theta}_{H-1}(\cdot|\tau_{H-1})}\left[\exp\left(-\frac{f(x_{H})}{\eta^{*}_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}}\right)\right]
+aH1,xH1ηaH1,xH1𝒫θ(xH1)(log𝔼oH𝒯H1θ(|τH1)[exp(f(xH)ηa1:H1)]\displaystyle\quad\quad+\sum_{a_{H-1},x_{H-1}}\eta^{*}_{a_{H-1},x_{H-1}}\mathcal{P}^{\theta^{*}}(x_{H-1})\left(\log\mathbb{E}_{o_{H}\sim\mathcal{T}^{\theta}_{H-1}(\cdot|\tau_{H-1})}\left[\exp\left(-\frac{f(x_{H})}{\eta^{*}_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}}\right)\right]\right.
log𝔼oH𝒯H1θ(|τH1)[exp(f(xH)ηa1:H1)])\displaystyle\quad\quad\left.-\log\mathbb{E}_{o_{H}\sim\mathcal{T}^{\theta^{*}}_{H-1}(\cdot|\tau_{H-1})}\left[\exp\left(-\frac{f(x_{H})}{\eta^{*}_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}}\right)\right]\right)
aH1,xH1ηaH1,xH1|𝒫θ(xH1)𝒫θ(xH1)|log𝔼oH𝒯H1θ(|τH1)[exp(f(xH)ηa1:H1)]\displaystyle\quad\leq\sum_{a_{H-1},x_{H-1}}-\eta^{*}_{a_{H-1},x_{H-1}}\left|\mathcal{P}^{\theta}(x_{H-1})-\mathcal{P}^{\theta^{*}}(x_{H-1})\right|\log\mathbb{E}_{o_{H}\sim\mathcal{T}^{\theta}_{H-1}(\cdot|\tau_{H-1})}\left[\exp\left(-\frac{f(x_{H})}{\eta^{*}_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}}\right)\right]
+aH1,xH1ηaH1,xH1𝒫θ(xH1)log𝔼oH𝒯H1θ(|τH1)[exp(f(xH)ηa1:H1)]𝔼oH𝒯H1θ(|τH1)[exp(f(xH)ηa1:H1)]\displaystyle\quad\quad+\sum_{a_{H-1},x_{H-1}}\eta^{*}_{a_{H-1},x_{H-1}}\mathcal{P}^{\theta^{*}}(x_{H-1})\log\frac{\mathbb{E}_{o_{H}\sim\mathcal{T}^{\theta}_{H-1}(\cdot|\tau_{H-1})}\left[\exp\left(-\frac{f(x_{H})}{\eta^{*}_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}}\right)\right]}{\mathbb{E}_{o_{H}\sim\mathcal{T}^{\theta^{*}}_{H-1}(\cdot|\tau_{H-1})}\left[\exp\left(-\frac{f(x_{H})}{\eta^{*}_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}}\right)\right]}
aH1,xH1|𝒫θ(xH1)𝒫θ(xH1)|maxoHf(xH)\displaystyle\quad\leq\sum_{a_{H-1},x_{H-1}}\left|\mathcal{P}^{\theta}(x_{H-1})-\mathcal{P}^{\theta^{*}}(x_{H-1})\right|\max_{o_{H}}f(x_{H})
+aH1,xH1ηaH1,xH1𝒫θ(xH1)log𝔼oH𝒯H1θ(|τH1)[exp(f(xH)ηa1:H1)]𝔼oH𝒯H1θ(|τH1)[exp(f(xH)ηa1:H1)]I0\displaystyle\quad\quad+\sum_{a_{H-1},x_{H-1}}\eta^{*}_{a_{H-1},x_{H-1}}\mathcal{P}^{\theta^{*}}(x_{H-1})\underbrace{\log\frac{\mathbb{E}_{o_{H}\sim\mathcal{T}^{\theta}_{H-1}(\cdot|\tau_{H-1})}\left[\exp\left(-\frac{f(x_{H})}{\eta^{*}_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}}\right)\right]}{\mathbb{E}_{o_{H}\sim\mathcal{T}^{\theta^{*}}_{H-1}(\cdot|\tau_{H-1})}\left[\exp\left(-\frac{f(x_{H})}{\eta^{*}_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}}\right)\right]}}_{I_{0}}

For term I0I_{0}, we fix τH1\tau_{H-1} and write f(xH)f(x_{H}) as π(τH1)R(oH)\pi(\tau_{H-1})R(o_{H}) for convenience.

Since τH1\tau_{H-1} is fixed and we aim to upper bound the term

I0=log𝔼oH𝒯H1θ^(|τH1)[exp(π(τH1)R(oH)ηa1:H1)]𝔼oH𝒯H1θ(|τH1)[exp(π(τH1)R(oH)ηa1:H1)],I_{0}=\log\frac{\mathbb{E}_{o_{H}\sim\mathcal{T}^{\hat{\theta}}_{H-1}(\cdot|\tau_{H-1})}\left[\exp\left(-\frac{\pi(\tau_{H-1})R(o_{H})}{\eta^{*}_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}}\right)\right]}{\mathbb{E}_{o_{H}\sim\mathcal{T}^{\theta^{*}}_{H-1}(\cdot|\tau_{H-1})}\left[\exp\left(-\frac{\pi(\tau_{H-1})R(o_{H})}{\eta^{*}_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}}\right)\right]},

π(τH1)/ηa1:H1[0,1/η]\pi(\tau_{H-1})/\eta^{*}_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}\in[0,1/\eta^{*}] can be treated as a variable. We define a function g(t)g(t) for 1/ηt01/\eta^{*}\geq t\geq 0 as follows

g(t)=log𝔼oP[exp(tR(o))]𝔼oQ[exp(tR(o))],g(t)=\log\frac{\mathbb{E}_{o\sim P}\left[\exp\left(-tR(o)\right)\right]}{\mathbb{E}_{o\sim Q}\left[\exp\left(-tR(o)\right)\right]},

where P,QΔ(𝒪)P,Q\in\Delta(\mathcal{O}) are two distributions over the observation space.

Taking the derivative with respect to tt, we have

dgdt\displaystyle\frac{dg}{dt} =𝔼oP[R(o)exp(tR(o))]𝔼oP[exp(tR(o))]𝔼oQ[R(o)exp(tR(o))]𝔼oQ[exp(tR(o))]\displaystyle=\frac{\mathbb{E}_{o\sim P}\left[-R(o)\exp\left(-tR(o)\right)\right]}{\mathbb{E}_{o\sim P}\left[\exp\left(-tR(o)\right)\right]}-\frac{\mathbb{E}_{o\sim Q}\left[-R(o)\exp\left(-tR(o)\right)\right]}{\mathbb{E}_{o\sim Q}\left[\exp\left(-tR(o)\right)\right]}
=𝔼oQ[R(o)exp(tR(o))𝔼oQ[exp(tR(o))]]𝔼oP[R(o)exp(tR(o))𝔼oP[exp(tR(o))]].\displaystyle=\mathbb{E}_{o\sim Q}\left[R(o)\frac{\exp\left(-tR(o)\right)}{\mathbb{E}_{o\sim Q}\left[\exp\left(-tR(o)\right)\right]}\right]-\mathbb{E}_{o\sim P}\left[R(o)\frac{\exp\left(-tR(o)\right)}{\mathbb{E}_{o\sim P}\left[\exp\left(-tR(o)\right)\right]}\right].

Due to R(o)[0,1]R(o)\in[0,1] and 1/ηt01/\eta^{*}\geq t\geq 0, we have

|dgdt|PQ11exp(1/η).\left|\frac{dg}{dt}\right|\leq\|P-Q\|_{1}\frac{1}{\exp(-1/\eta^{*})}.

Thus, for t[0,1/η]t\in[0,1/\eta^{*}], we have

g(t)=0tg(t)𝑑ttPQ1exp(1/η).g(t)=\int_{0}^{t}g^{\prime}(t)dt\leq t\|P-Q\|_{1}\exp(1/\eta^{*}).

Plugging the upper bound of g(t)g(t) back to the robust value difference, we have

VB(θ),RπVB(θ),Rπ\displaystyle V_{B(\theta^{*}),R}^{\pi}-V_{B(\theta),R}^{\pi}
aH1,xH1|𝒫θ(xH1)𝒫θ(xH1)|maxoHf(xH)\displaystyle\quad\leq\sum_{a_{H-1},x_{H-1}}\left|\mathcal{P}^{\theta}(x_{H-1})-\mathcal{P}^{\theta^{*}}(x_{H-1})\right|\max_{o_{H}}f(x_{H})
+aH1,xH1ηaH1,xH1𝒫θ(xH1)𝒯H1θ(|τH1)𝒯H1θ(|τH1)1π(τH1)ηa1:H1exp(1/η)\displaystyle\quad\quad+\sum_{a_{H-1},x_{H-1}}\eta^{*}_{a_{H-1},x_{H-1}}\mathcal{P}^{\theta^{*}}(x_{H-1})\left\|\mathcal{T}^{\theta}_{H-1}(\cdot|\tau_{H-1})-\mathcal{T}^{\theta^{*}}_{H-1}(\cdot|\tau_{H-1})\right\|_{1}\frac{\pi(\tau_{H-1})}{\eta_{a_{1\mathrel{\mathop{\mathchar 58\relax}}H-1}}}\exp(1/\eta^{*})
τH1|𝔻θπ(τH1)𝔻θπ(τH1)|+exp(1/η)𝔼θπ[𝒯H1θ(|τH1)𝒯H1θ(|τH1)1]\displaystyle\leq\sum_{\tau_{H-1}}\left|\mathbb{D}_{\theta}^{\pi}(\tau_{H-1})-\mathbb{D}_{\theta^{*}}^{\pi}(\tau_{H-1})\right|+\exp(1/\eta^{*})\mathbb{E}_{\theta^{*}}^{\pi}\left[\left\|\mathcal{T}^{\theta}_{H-1}(\cdot|\tau_{H-1})-\mathcal{T}^{\theta^{*}}_{H-1}(\cdot|\tau_{H-1})\right\|_{1}\right]
(a)3exp(1/η)𝔻θπ𝔻θπ1,\displaystyle\overset{(a)}{\leq}3\exp(1/\eta^{*})\left\|\mathbb{D}_{\theta}^{\pi}-\mathbb{D}_{\theta^{*}}^{\pi}\right\|_{1},

where (a)(a) is due to Lemma E.1.

For the second inequality, we follow the same argument and the proof is complete.

Lemma A.8 (Simulation lemma for robust value under B𝒯iB_{\mathcal{T}}^{i}).
|VB𝒯1(θ),RπVB𝒯1(θ),Rπ|2CB𝔻θπ𝔻θπ1,\displaystyle\left|V_{B_{\mathcal{T}}^{1}(\theta^{*}),R}^{\pi}-V_{B_{\mathcal{T}}^{1}(\theta),R}^{\pi}\right|\leq 2C_{B}\|\mathbb{D}_{\theta^{*}}^{\pi}-\mathbb{D}_{\theta}^{\pi}\|_{1},
|VB𝒯2(θ),RπVB𝒯2(θ),Rπ|2CBC𝒯kl𝔻θπ𝔻θπ1\displaystyle\left|V_{B_{\mathcal{T}}^{2}(\theta^{*}),R}^{\pi}-V_{B_{\mathcal{T}}^{2}(\theta),R}^{\pi}\right|\leq 2C_{B}C_{\mathcal{T}}^{\mathrm{kl}}\|\mathbb{D}_{\theta^{*}}^{\pi}-\mathbb{D}_{\theta}^{\pi}\|_{1}

where C𝒯kl=max{exp(ξ)/ξ,λexp(1/λ)}C_{\mathcal{T}}^{\mathrm{kl}}=\max\left\{\exp(\xi)/\xi,\lambda^{*}\exp(1/\lambda^{*})\right\}, and λ\lambda^{*} is a lower bound for the optimal solutions of the following problems

λ=minh,τh{argsupλ0{λlog𝔼𝒯hθ[exp(VB𝒯2(θ),R,h+1π(τh,o)/λ)]λξ},argsupλ0{λlog𝔼𝒯hθ[exp(VB𝒯2(θ),R,h+1π(τh,o)/λ)]λξ}.\lambda^{*}=\min_{h,\tau_{h}}\left\{\begin{aligned} &\arg\sup_{\lambda\geq 0}\left\{-\lambda\log\mathbb{E}_{\mathcal{T}_{h}^{\theta^{*}}}[\exp(-V_{B_{\mathcal{T}}^{2}(\theta),R,h+1}^{\pi}(\tau_{h},o)/\lambda)]-\lambda\xi\right\},\\ &\arg\sup_{\lambda\geq 0}\left\{-\lambda\log\mathbb{E}_{\mathcal{T}_{h}^{\theta}}[\exp(-V_{B_{\mathcal{T}}^{2}(\theta^{*}),R,h+1}^{\pi}(\tau_{h},o)/\lambda)]-\lambda\xi\right\}.\end{aligned}\right.
Proof.

First, we have

VB𝒯i(θ),Rπ\displaystyle V_{B_{\mathcal{T}}^{i}(\theta^{*}),R}^{\pi} VB𝒯i(θ),Rπ\displaystyle-V_{B_{\mathcal{T}}^{i}(\theta),R}^{\pi}
=𝔼π[QB𝒯1(θ),R,1π(τ1)QB𝒯i(θ),R,1π(τ1)|o1]\displaystyle=\mathbb{E}^{\pi}\left[Q_{B_{\mathcal{T}}^{1}(\theta^{*}),R,1}^{\pi}(\tau_{1})-Q_{B_{\mathcal{T}}^{i}(\theta),R,1}^{\pi}(\tau_{1})\bigg{|}o_{1}\right]
=𝔼π[infθB𝒯i(θ)o2𝒯1θ(o2|τ1)VB𝒯i(θ),R,2π(τ1,o2)infθB𝒯i(θ)o2𝒯1θ(o2|τ1)VB𝒯i(θ),R,2π(τ1,o2)]\displaystyle=\mathbb{E}^{\pi}\left[\inf_{\theta^{\prime}\in B_{\mathcal{T}}^{i}(\theta^{*})}\sum_{o_{2}}\mathcal{T}_{1}^{\theta^{\prime}}(o_{2}|\tau_{1})V_{B_{\mathcal{T}}^{i}(\theta^{*}),R,2}^{\pi}(\tau_{1},o_{2})-\inf_{\theta^{\prime}\in B_{\mathcal{T}}^{i}(\theta)}\sum_{o_{2}}\mathcal{T}_{1}^{\theta^{\prime}}(o_{2}|\tau_{1})V_{B_{\mathcal{T}}^{i}(\theta),R,2}^{\pi}(\tau_{1},o_{2})\right]
=𝔼π[infθB𝒯i(θ)o2𝒯1θ(o2|τ1)VB𝒯i(θ),R,2π(τ1,o2)infθB𝒯i(θ)o2𝒯1θ(o2|τ1)VB𝒯i(θ),R,2π(τ1,o2)\displaystyle=\mathbb{E}^{\pi}\left[\inf_{\theta^{\prime}\in B_{\mathcal{T}}^{i}(\theta^{*})}\sum_{o_{2}}\mathcal{T}_{1}^{\theta}(o_{2}|\tau_{1})V_{B_{\mathcal{T}}^{i}(\theta^{*}),R,2}^{\pi}(\tau_{1},o_{2})-\inf_{\theta^{\prime}\in B_{\mathcal{T}}^{i}(\theta^{*})}\sum_{o_{2}}\mathcal{T}_{1}^{\theta}(o_{2}|\tau_{1})V_{B_{\mathcal{T}}^{i}(\theta),R,2}^{\pi}(\tau_{1},o_{2})\right.
+infθB𝒯i(θ)o2𝒯1θ(o2|τ1)VB𝒯i(θ),R,2π(τ1,o2)infθB𝒯i(θ)o2𝒯1θ(o2|τ1)VB𝒯i(θ),R,2π(τ1,o2)]\displaystyle\quad\quad\quad\left.+\inf_{\theta^{\prime}\in B_{\mathcal{T}}^{i}(\theta^{*})}\sum_{o_{2}}\mathcal{T}_{1}^{\theta^{\prime}}(o_{2}|\tau_{1})V_{B_{\mathcal{T}}^{i}(\theta),R,2}^{\pi}(\tau_{1},o_{2})-\inf_{\theta^{\prime}\in B_{\mathcal{T}}^{i}(\theta)}\sum_{o_{2}}\mathcal{T}_{1}^{\theta^{\prime}}(o_{2}|\tau_{1})V_{B_{\mathcal{T}}^{i}(\theta),R,2}^{\pi}(\tau_{1},o_{2})\right]
maxθB𝒯1(θ)𝔼θπ[VB𝒯i(θ),R,2π(τ1,o2)VB𝒯i(θ),R,2π(τ1,o2)]+𝔼π[B𝒯i(θ,τ1)]\displaystyle\leq\max_{\theta^{\prime}\in B_{\mathcal{T}}^{1}(\theta^{*})}\mathbb{E}_{\theta^{\prime}}^{\pi}\left[V_{B_{\mathcal{T}}^{i}(\theta^{*}),R,2}^{\pi}(\tau_{1},o_{2})-V_{B_{\mathcal{T}}^{i}(\theta),R,2}^{\pi}(\tau_{1},o_{2})\right]+\mathbb{E}^{\pi}\left[\mathcal{E}_{B_{\mathcal{T}}^{i}}(\theta,\tau_{1})\right]
\displaystyle\leq\ldots
maxθB𝒯i(θ)h=1H𝔼θπ[B𝒯i(θ,τh)]CBh=1H𝔼θπ[B𝒯i(θ,τh)]\displaystyle\leq\max_{\theta^{\prime}\in B_{\mathcal{T}}^{i}(\theta^{*})}\sum_{h=1}^{H}\mathbb{E}_{\theta^{\prime}}^{\pi}\left[\mathcal{E}_{B_{\mathcal{T}}^{i}}(\theta,\tau_{h})\right]\leq C_{B}\sum_{h=1}^{H}\mathbb{E}_{\theta^{*}}^{\pi}\left[\mathcal{E}_{B_{\mathcal{T}}^{i}}(\theta,\tau_{h})\right]

When i=1i=1, we have

B𝒯1(θ,τh)\displaystyle\mathcal{E}_{B_{\mathcal{T}}^{1}}(\theta,\tau_{h}) =minθB𝒯1(θ)𝔼o𝒯hθ(|τh)[VB𝒯1(θ),R,h+1π(τh,o)]minθB𝒯1(θ)𝔼o𝒯hθ(|τh)[VB𝒯1(θ),R,h+1π(τh,o)]\displaystyle=\min_{\theta^{\prime}\in B_{\mathcal{T}}^{1}(\theta^{*})}\mathop{\mathbb{E}}_{o\sim\mathcal{T}_{h}^{\theta^{\prime}}(\cdot|\tau_{h})}\left[V_{B_{\mathcal{T}}^{1}(\theta),R,h+1}^{\pi}(\tau_{h},o)\right]-\min_{\theta^{\prime}\in B_{\mathcal{T}}^{1}(\theta)}\mathop{\mathbb{E}}_{o\sim\mathcal{T}_{h}^{\theta^{\prime}}(\cdot|\tau_{h})}\left[V_{B_{\mathcal{T}}^{1}(\theta),R,h+1}^{\pi}(\tau_{h},o)\right]
=(a)supλ[0,1]{𝔼𝒯hθ[(λVB𝒯1(θ),R,h+1π(τh,o))+]ξmaxo(λVB𝒯1(θ),R,h+1π(τh,o))++λ}\displaystyle\overset{(a)}{=}\sup_{\lambda\in[0,1]}\left\{-\mathbb{E}_{\mathcal{T}_{h}^{\theta^{*}}}[(\lambda-V_{B_{\mathcal{T}}^{1}(\theta),R,h+1}^{\pi}(\tau_{h},o))_{+}]-\xi\max_{o}(\lambda-V_{B_{\mathcal{T}}^{1}(\theta),R,h+1}^{\pi}(\tau_{h},o))_{+}+\lambda\right\}
supλ[0,1]{𝔼𝒯hθ[(λVB𝒯1(θ),R,h+1π(τh,o))+]ξmaxo(λVB𝒯1(θ),R,h+1π(τh,o))++λ}\displaystyle\qquad-\sup_{\lambda\in[0,1]}\left\{-\mathbb{E}_{\mathcal{T}_{h}^{\theta}}[(\lambda-V_{B_{\mathcal{T}}^{1}(\theta),R,h+1}^{\pi}(\tau_{h},o))_{+}]-\xi\max_{o}(\lambda-V_{B_{\mathcal{T}}^{1}(\theta),R,h+1}^{\pi}(\tau_{h},o))_{+}+\lambda\right\}
supλ[0,1]{𝔼𝒯hθ[(λVB𝒯1(θ),R,h+1π(τh,o))+]𝔼𝒯hθ[(λVB𝒯1(θ),R,h+1π(τh,o))+]}\displaystyle\leq\sup_{\lambda\in[0,1]}\left\{\mathbb{E}_{\mathcal{T}_{h}^{\theta}}[(\lambda-V_{B_{\mathcal{T}}^{1}(\theta),R,h+1}^{\pi}(\tau_{h},o))_{+}]-\mathbb{E}_{\mathcal{T}_{h}^{\theta^{*}}}[(\lambda-V_{B_{\mathcal{T}}^{1}(\theta),R,h+1}^{\pi}(\tau_{h},o))_{+}]\right\}
𝒯hθ(|τh)𝒯hθ(|τh)1,\displaystyle\leq\left\|\mathcal{T}_{h}^{\theta}(\cdot|\tau_{h})-\mathcal{T}_{h}^{\theta^{*}}(\cdot|\tau_{h})\right\|_{1},

where (a)(a) is due to Proposition A.1.

When i=2i=2, we have

B𝒯2(θ,τh)\displaystyle\mathcal{E}_{B_{\mathcal{T}}^{2}}(\theta,\tau_{h}) =minθB𝒯2(θ)𝔼o𝒯hθ(|τh)[VB𝒯2(θ),R,h+1π(τh,o)]minθB𝒯2(θ)𝔼o𝒯hθ(|τh)[VB𝒯2(θ),R,h+1π(τh,o)]\displaystyle=\min_{\theta^{\prime}\in B_{\mathcal{T}}^{2}(\theta^{*})}\mathop{\mathbb{E}}_{o\sim\mathcal{T}_{h}^{\theta^{\prime}}(\cdot|\tau_{h})}\left[V_{B_{\mathcal{T}}^{2}(\theta),R,h+1}^{\pi}(\tau_{h},o)\right]-\min_{\theta^{\prime}\in B_{\mathcal{T}}^{2}(\theta)}\mathop{\mathbb{E}}_{o\sim\mathcal{T}_{h}^{\theta^{\prime}}(\cdot|\tau_{h})}\left[V_{B_{\mathcal{T}}^{2}(\theta),R,h+1}^{\pi}(\tau_{h},o)\right]
=(a)supλ0{λlog𝔼𝒯hθ[exp(VB𝒯2(θ),R,h+1π(τh,o)/λ)]λξ}\displaystyle\overset{(a)}{=}\sup_{\lambda\geq 0}\left\{-\lambda\log\mathbb{E}_{\mathcal{T}_{h}^{\theta^{*}}}[\exp(-V_{B_{\mathcal{T}}^{2}(\theta),R,h+1}^{\pi}(\tau_{h},o)/\lambda)]-\lambda\xi\right\}
supλ0{λlog𝔼𝒯hθ[exp(VB𝒯2(θ),R,h+1π(τh,o)/λ)]λξ},\displaystyle\qquad-\sup_{\lambda\geq 0}\left\{-\lambda\log\mathbb{E}_{\mathcal{T}_{h}^{\theta}}[\exp(-V_{B_{\mathcal{T}}^{2}(\theta),R,h+1}^{\pi}(\tau_{h},o)/\lambda)]-\lambda\xi\right\},

where (a)(a) is due to Proposition A.2.

Let λ(τh)=argsupλ0{λlog𝔼𝒯hθ[exp(VB𝒯2(θ),R,h+1π(τh,o)/λ)]λξ}\lambda^{*}(\tau_{h})=\arg\sup_{\lambda\geq 0}\left\{-\lambda\log\mathbb{E}_{\mathcal{T}_{h}^{\theta^{*}}}[\exp(-V_{B_{\mathcal{T}}^{2}(\theta),R,h+1}^{\pi}(\tau_{h},o)/\lambda)]-\lambda\xi\right\}.

Then, we further have

B𝒯2(θ,τh)\displaystyle\mathcal{E}_{B_{\mathcal{T}}^{2}}(\theta,\tau_{h}) λ(τh)log𝔼𝒯hθ[exp(VB𝒯2(θ),R,h+1π(τh,o)/λ(τh))]\displaystyle\leq\lambda^{*}(\tau_{h})\log\mathbb{E}_{\mathcal{T}_{h}^{\theta}}[\exp(-V_{B_{\mathcal{T}}^{2}(\theta),R,h+1}^{\pi}(\tau_{h},o)/\lambda^{*}(\tau_{h}))]
λ(τh)log𝔼𝒯hθ[exp(VB𝒯2(θ),R,h+1π(τh,o)/λ(τh))]\displaystyle\qquad-\lambda^{*}(\tau_{h})\log\mathbb{E}_{\mathcal{T}_{h}^{\theta^{*}}}[\exp(-V_{B_{\mathcal{T}}^{2}(\theta),R,h+1}^{\pi}(\tau_{h},o)/\lambda^{*}(\tau_{h}))]
=λ(τh)log𝔼𝒯hθ[exp(VB𝒯2(θ),R,h+1π(τh,o)/λ(τh))]𝔼𝒯hθ[exp(VB𝒯2(θ),R,h+1π(τh,o)/λ(τh))]\displaystyle=\lambda^{*}(\tau_{h})\log\frac{\mathbb{E}_{\mathcal{T}_{h}^{\theta}}[\exp(-V_{B_{\mathcal{T}}^{2}(\theta),R,h+1}^{\pi}(\tau_{h},o)/\lambda^{*}(\tau_{h}))]}{\mathbb{E}_{\mathcal{T}_{h}^{\theta^{*}}}[\exp(-V_{B_{\mathcal{T}}^{2}(\theta),R,h+1}^{\pi}(\tau_{h},o)/\lambda^{*}(\tau_{h}))]}
(a)λ(τh)𝔼𝒯hθ[exp(VB𝒯2(θ),R,h+1π(τh,o)/λ(τh))]𝔼𝒯hθ[exp(VB𝒯2(θ),R,h+1π(τh,o)/λ(τh))]𝔼𝒯hθ[exp(VB𝒯2(θ),R,h+1π(τh,o)/λ(τh))]\displaystyle\overset{(a)}{\leq}\lambda^{*}(\tau_{h})\frac{\mathbb{E}_{\mathcal{T}_{h}^{\theta}}[\exp(-V_{B_{\mathcal{T}}^{2}(\theta),R,h+1}^{\pi}(\tau_{h},o)/\lambda^{*}(\tau_{h}))]-\mathbb{E}_{\mathcal{T}_{h}^{\theta^{*}}}[\exp(-V_{B_{\mathcal{T}}^{2}(\theta),R,h+1}^{\pi}(\tau_{h},o)/\lambda^{*}(\tau_{h}))]}{\mathbb{E}_{\mathcal{T}_{h}^{\theta^{*}}}[\exp(-V_{B_{\mathcal{T}}^{2}(\theta),R,h+1}^{\pi}(\tau_{h},o)/\lambda^{*}(\tau_{h}))]}
λ(τh)exp(1/λ(τh))𝒯hθ(|τh)𝒯hθ(|τh)1\displaystyle\leq\frac{\lambda^{*}(\tau_{h})}{\exp(-1/\lambda^{*}(\tau_{h}))}\left\|\mathcal{T}_{h}^{\theta^{*}}(\cdot|\tau_{h})-\mathcal{T}_{h}^{\theta}(\cdot|\tau_{h})\right\|_{1}
max{exp(ξ)ξ,λexp(1/λ)}𝒯hθ(|τh)𝒯hθ(|τh)1,\displaystyle\leq\max\left\{\frac{\exp(\xi)}{\xi},\lambda^{*}\exp(1/\lambda^{*})\right\}\left\|\mathcal{T}_{h}^{\theta^{*}}(\cdot|\tau_{h})-\mathcal{T}_{h}^{\theta}(\cdot|\tau_{h})\right\|_{1},

where (a)(a) follows from log(x)x1\log(x)\leq x-1 for x>0x>0, and 0<λminh,τhλ(τh)0<\lambda^{*}\leq\min_{h,\tau_{h}}\lambda^{*}(\tau_{h}).

In summary, the difference between robust values can be upper bounded by

VB𝒯i(θ),RπVB𝒯i(θ),RπCBCih=1H𝔼θπ[𝒯hθ(|τh)𝒯hθ(|τh)1]2CBCi𝔻θπ𝔻θπ1,\displaystyle V_{B_{\mathcal{T}}^{i}(\theta^{*}),R}^{\pi}-V_{B_{\mathcal{T}}^{i}(\theta),R}^{\pi}\leq C_{B}C_{i}\sum_{h=1}^{H}\mathbb{E}_{\theta^{*}}^{\pi}\left[\left\|\mathcal{T}_{h}^{\theta^{*}}(\cdot|\tau_{h})-\mathcal{T}_{h}^{\theta}(\cdot|\tau_{h})\right\|_{1}\right]\leq 2C_{B}C_{i}\|\mathbb{D}_{\theta^{*}}^{\pi}-\mathbb{D}_{\theta}^{\pi}\|_{1},

where

Ci={1,i=1,max{exp(ξ)ξ,λexp(1/λ)},i=2.C_{i}=\left\{\begin{aligned} &1,\quad i=1,\\ &\max\left\{\frac{\exp(\xi)}{\xi},\lambda^{*}\exp(1/\lambda^{*})\right\},\quad i=2.\end{aligned}\right.

The opposite side of the inequalities of the lemma follows the same argument.

Appendix B MLE Analysis

In this section, we provide the estimation guarantee for MLE oracle in Lemma B.5.

We first present a sequence of supporting lemmas in the following. Notably, the first lemma states that after dataset distillation, we still have sufficient offline samples.

Lemma B.1 (Sufficient good samples).

Let pmin<δ|𝒪|2H|𝒜|2Hp_{\min}<\frac{\delta}{|\mathcal{O}|^{2H}|\mathcal{A}|^{2H}}, then, with probability at least 1δ1-\delta, we have |𝒟g|N/2|\mathcal{D}^{\mathrm{g}}|\geq N/2.

Proof.

Let L=Hlog(|𝒪||𝒜|)L^{*}=H\log(|\mathcal{O}||\mathcal{A}|). Then, we have

𝔼\displaystyle\mathbb{E} [exp(n=1N(log𝔻θρ(τHn)L))]\displaystyle\left[\exp\left(\sum_{n=1}^{N}(-\log\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H}^{n})-L^{*})\right)\right]
=𝔼[n=1N1𝔻θρ(τH)eL]\displaystyle=\mathbb{E}\left[\prod_{n=1}^{N}\frac{1}{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})e^{L^{*}}}\right]
=n=1N|𝒪|H|𝒜|HeL\displaystyle=\prod_{n=1}^{N}\frac{|\mathcal{O}|^{H}|\mathcal{A}|^{H}}{e^{L^{*}}}
=1.\displaystyle=1.

Thus, by Chernoff bound, we have that with probability 1δ1-\delta,

minθ(θ|𝒟)1Nn=1Nlog𝔻θρ(τHn)L+1Nlog1δ.\min_{\theta}\mathcal{L}(\theta|\mathcal{D})\leq\frac{1}{N}\sum_{n=1}^{N}-\log\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H}^{n})\leq L^{*}+\frac{1}{N}\log\frac{1}{\delta}.

Recall that θ^=minθ(θ|𝒟).\hat{\theta}=\min_{\theta}\mathcal{L}(\theta|\mathcal{D}). Let 𝒟b=𝒟/𝒟g={τH𝒟,𝔻θ^ρ(τH)<pmin}\mathcal{D}^{\mathrm{b}}=\mathcal{D}/\mathcal{D}^{\mathrm{g}}=\{\tau_{H}\in\mathcal{D},\mathbb{D}_{\hat{\theta}}^{\rho}(\tau_{H})<p_{\min}\}. Thus, we further have

NL+log1δ\displaystyle NL^{*}+\log\frac{1}{\delta} n=1Nlog𝔻θ^ρ(τHn)\displaystyle\geq\sum_{n=1}^{N}-\log\mathbb{D}_{\hat{\theta}}^{\rho}(\tau_{H}^{n})
τh𝒟blog𝔻θ^ρ(τH)\displaystyle\geq\sum_{\tau_{h}\in\mathcal{D}^{\mathrm{b}}}-\log\mathbb{D}^{\rho}_{\hat{\theta}}(\tau_{H})
|𝒟b|logpmin,\displaystyle\geq-|\mathcal{D}^{\mathrm{b}}|\log p_{\min},

which implies

|𝒟b|NL+log(1/δ)log1pmin<N2.|\mathcal{D}^{\mathrm{b}}|\leq\frac{NL^{*}+\log(1/\delta)}{\log\frac{1}{p_{\min}}}<\frac{N}{2}.

Lemma B.2.

Fix ε<1N\varepsilon<\frac{1}{N}. Let Θ¯ε\bar{\Theta}_{\varepsilon} with size 𝒩ε(Θ)\mathcal{N}_{\varepsilon}(\Theta) consist of all ε\varepsilon-brackets who can cover Θ\Theta. With probability at least 1δ1-\delta, for any [θ¯,θ¯]Θ¯ε[\underline{\theta},\bar{\theta}]\in\bar{\Theta}_{\varepsilon}, the following two inequalities hold:

log={θ¯Θ¯ε,hτh𝒟hglog𝔻θ¯ρ(τh)3log𝒩ε(Θ)δhτh𝒟hglog𝔻θπ(τh),θ¯Θ¯ε,τH𝒟glog𝔻θ¯ρ(τH)3log𝒩ε(Θ)δτH𝒟glog𝔻θρ(τH).\mathcal{E}_{\log}=\left\{\begin{aligned} &\forall\bar{\theta}\in\bar{\Theta}_{\varepsilon},~{}~{}\sum_{h}\sum_{\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}}}\log\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{h})-3\log\frac{\mathcal{N}_{\varepsilon}(\Theta)}{\delta}\leq\sum_{h}\sum_{\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}}}\log\mathbb{D}_{\theta^{*}}^{\pi}(\tau_{h}),\\ &\forall\bar{\theta}\in\bar{\Theta}_{\varepsilon},~{}~{}\sum_{\tau_{H}\in\mathcal{D}^{\mathrm{g}}}\log\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{H})-3\log\frac{\mathcal{N}_{\varepsilon}(\Theta)}{\delta}\leq\sum_{\tau_{H}\in\mathcal{D}^{\mathrm{g}}}\log\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H}).\end{aligned}\right.
Proof.

We start with the first inequality. Suppose the data in 𝒟hg\mathcal{D}_{h}^{\mathrm{g}} is indexed by tt. Then,

𝔼\displaystyle\mathbb{E} [exp(hτh𝒟hglog𝔻θ¯ρ(τh)𝔻θρ(τh))]\displaystyle\left[\exp\left(\sum_{h}\sum_{\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}}}\log\frac{\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{h})}{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{h})}\right)\right]
=𝔼[t|𝒟hg|h𝔻θ¯ρ(τht)𝔻θρ(τht)]\displaystyle=\mathbb{E}\left[\prod_{t\leq|\mathcal{D}_{h}^{\mathrm{g}}|}\prod_{h}\frac{\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{h}^{t})}{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{h}^{t})}\right]
=𝔼[t|𝒟hg|1h𝔻θ¯ρ(τht)𝔻θρ(τht)𝔼[𝔻θ¯ρ(τh|𝒟hg|)𝔻θρ(τh|𝒟hg|)]]\displaystyle=\mathbb{E}\left[\prod_{t\leq|\mathcal{D}_{h}^{\mathrm{g}}|-1}\prod_{h}\frac{\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{h}^{t})}{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{h}^{t})}\mathbb{E}\left[\frac{\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{h}^{|\mathcal{D}_{h}^{\mathrm{g}}|})}{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{h}^{|\mathcal{D}_{h}^{\mathrm{g}}|})}\right]\right]
=𝔼[t|𝒟hg|1h𝔻θ¯ρ(τht)𝔻θρ(τht)hτh𝔻θ¯ρ(τh)]\displaystyle=\mathbb{E}\left[\prod_{t\leq|\mathcal{D}_{h}^{\mathrm{g}}|-1}\prod_{h}\frac{\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{h}^{t})}{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{h}^{t})}\prod_{h}\sum_{\tau_{h}}\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{h})\right]
(a)(1+ε)H𝔼[t|𝒟hg|1h𝔻θ¯ρ(τht)𝔻θρ(τht)]\displaystyle\overset{(a)}{\leq}\left(1+\varepsilon\right)^{H}\mathbb{E}\left[\prod_{t\leq|\mathcal{D}_{h}^{\mathrm{g}}|-1}\prod_{h}\frac{\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{h}^{t})}{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{h}^{t})}\right]
(1+ε)|𝒟g|\displaystyle\leq\left(1+\varepsilon\right)^{|\mathcal{D}^{\mathrm{g}}|}
(b)e,\displaystyle\overset{(b)}{\leq}e,

where (a)(a) follows because τh|𝔻θ¯ρ(τh)𝔻θρ(τh)|𝔻θ¯ρ𝔻θρ1ε\sum_{\tau_{h}}|\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{h})-\mathbb{D}_{\theta}^{\rho}(\tau_{h})|\leq\|\mathbb{D}_{\bar{\theta}}^{\rho}-\mathbb{D}_{\theta}^{\rho}\|_{1}\leq\varepsilon, and (b)(b) follows because ε1N1|𝒟g|\varepsilon\leq\frac{1}{N}\leq\frac{1}{|\mathcal{D}^{\mathrm{g}}|}.

By the Chernoff bound and the union bound over Θ¯ϵ\bar{\Theta}_{\epsilon}, with probability at least 1δ1-\delta, we have

θ¯Θ¯ϵ,hτh𝒟hglog𝔻θ¯ρ(τh)𝔻θρ(τh)3log|Θ¯ϵ|δ,\displaystyle\forall\bar{\theta}\in\bar{\Theta}_{\epsilon},~{}~{}\sum_{h}\sum_{\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}}}\log\frac{\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{h})}{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{h})}\leq 3\log\frac{\left|\bar{\Theta}_{\epsilon}\right|}{\delta},

which yields the first result of this proposition.

To show the second inequality, we follow an argument similar to that for the first inequality. We have

𝔼[exp(τH𝒟glog𝔻θρ(τH)𝔻θρ(τH))]\displaystyle\mathbb{E}\left[\exp\left(\sum_{\tau_{H}\in\mathcal{D}^{\mathrm{g}}}\log\frac{\mathbb{D}_{\theta}^{\rho}(\tau_{H})}{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})}\right)\right] 𝔼[exp(τH𝒟glog𝔻θ¯ρ(τH)𝔻θρ(τH))]\displaystyle\leq\mathbb{E}\left[\exp\left(\sum_{\tau_{H}\in\mathcal{D}^{\mathrm{g}}}\log\frac{\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{H})}{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})}\right)\right]
(a)(1+ε)Ne,\displaystyle\overset{(a)}{\leq}(1+\varepsilon)^{N}\leq e,

where (a)(a) follows from the tower rule of the expectation and because τH𝔻θ¯ρ(τH)1+ε\sum_{\tau_{H}}\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{H})\leq 1+\varepsilon.

Thus, with probability at least 1δ1-\delta, for any θ¯Θ\bar{\theta}\in\Theta, the following inequality holds

τH𝒟glog𝔻θρ(τH)𝔻θρ(τH)3log𝒩ε(Θ)δ,\displaystyle\sum_{\tau_{H}\in\mathcal{D}^{\mathrm{g}}}\log\frac{\mathbb{D}_{\theta}^{\rho}(\tau_{H})}{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})}\leq 3\log\frac{\mathcal{N}_{\varepsilon}(\Theta)}{\delta},

which completes the proof. ∎

Proposition B.3.

Fix pminp_{\min} and εpminN\varepsilon\leq\frac{p_{\min}}{N}. Let Θmin={θ:h,τh𝒟hg,𝔻θρ(τh)pmin}\Theta_{\min}=\{\theta\mathrel{\mathop{\mathchar 58\relax}}\forall h,\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}},~{}~{}\mathbb{D}_{\theta}^{\rho}(\tau_{h})\geq p_{\min}\}. Consider the following event

ω\displaystyle\mathcal{E}_{\omega} ={θΘmin,hτh𝒟hg𝔻θ^ρ(|τh)𝔻θρ(|τh)126hτH𝒟hglog𝔻θρ(τH)𝔻θρ(τH)+31log𝒩ε(Θ)δ}.\displaystyle=\left\{\forall\theta\in\Theta_{\min},~{}~{}\sum_{h}\sum_{\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}}}\left\|\mathbb{D}_{\hat{\theta}}^{\rho}(\cdot|\tau_{h})-\mathbb{D}_{\theta^{*}}^{\rho}(\cdot|\tau_{h})\right\|_{1}^{2}\leq 6\sum_{h}\sum_{\tau_{H}\in\mathcal{D}_{h}^{\mathrm{g}}}\log\frac{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})}{\mathbb{D}_{\theta}^{\rho}(\tau_{H})}+31\log\frac{\mathcal{N}_{\varepsilon}(\Theta)}{\delta}\right\}.

Then, (ω)1δ.\mathbb{P}\left(\mathcal{E}_{\omega}\right)\geq 1-\delta.

Proof.

We start with a general upper bound on the total variation distance between two conditional distributions. Note that for any θ,θΘΘ¯ϵ\theta,\theta^{\prime}\in\Theta\cup\bar{\Theta}_{\epsilon} and fixed (τh,π)(\tau_{h},\pi), we have

𝔻θρ(ωh|τh)𝔻θρ(ωh|τh)1\displaystyle\left\|\mathbb{D}_{\theta}^{\rho}(\omega_{h}|\tau_{h})-\mathbb{D}_{\theta^{\prime}}^{\rho}(\omega_{h}|\tau_{h})\right\|_{1}
=ωh|𝔻θπ(ωh,τh)𝔻θρ(τh)𝔻θρ(ωh,τh)𝔻θρ(τh)𝔻θπ(τh)𝔻θρ(τh)|\displaystyle\quad=\sum_{\omega_{h}}\bigg{|}\frac{\mathbb{D}_{\theta^{\prime}}^{\pi}(\omega_{h},\tau_{h})\mathbb{D}_{\theta}^{\rho}(\tau_{h})-\mathbb{D}_{\theta}^{\rho}(\omega_{h},\tau_{h})\mathbb{D}_{\theta^{\prime}}^{\rho}(\tau_{h})}{\mathbb{D}_{\theta}^{\pi}(\tau_{h})\mathbb{D}_{\theta^{\prime}}^{\rho}(\tau_{h})}\bigg{|}
=ωh|(𝔻θρ(ωh,τh)𝔻θρ(ωh,τh))θρ(τh)+𝔻θρ(ωh,τh)(𝔻θπ(τh)𝔻θρ(τh))𝔻θρ(τh)𝔻θρ(τh)|\displaystyle\quad=\sum_{\omega_{h}}\left|\frac{\left(\mathbb{D}_{\theta^{\prime}}^{\rho}(\omega_{h},\tau_{h})-\mathbb{D}_{\theta}^{\rho}(\omega_{h},\tau_{h})\right)\mathbb{P}_{\theta}^{\rho}(\tau_{h})+\mathbb{D}_{\theta}^{\rho}(\omega_{h},\tau_{h})\left(\mathbb{D}_{\theta}^{\pi}(\tau_{h})-\mathbb{D}_{\theta^{\prime}}^{\rho}(\tau_{h})\right)}{\mathbb{D}_{\theta}^{\rho}(\tau_{h})\mathbb{D}_{\theta^{\prime}}^{\rho}(\tau_{h})}\right|
|𝔻θρ(τh)𝔻θπ(τh)|𝔻θπ(τh)+1𝔻θρ(τh)ωh|(𝔻θρ(ωh,τh)𝔻θρ(ωh,τh))|\displaystyle\quad\leq\frac{|\mathbb{D}_{\theta}^{\rho}(\tau_{h})-\mathbb{D}_{\theta^{\prime}}^{\pi}(\tau_{h})|}{\mathbb{D}_{\theta^{\prime}}^{\pi}(\tau_{h})}+\frac{1}{\mathbb{D}_{\theta^{\prime}}^{\rho}(\tau_{h})}\sum_{\omega_{h}}\left|\left(\mathbb{D}_{\theta^{\prime}}^{\rho}(\omega_{h},\tau_{h})-\mathbb{D}_{\theta}^{\rho}(\omega_{h},\tau_{h})\right)\right|
2𝔻θρ(τh)𝔻θρ𝔻θρ1.\displaystyle\quad\leq\frac{2}{\mathbb{D}_{\theta^{\prime}}^{\rho}(\tau_{h})}\left\|\mathbb{D}_{\theta}^{\rho}-\mathbb{D}_{\theta^{\prime}}^{\rho}\right\|_{1}.

By symmetry, we also have

𝔻θρ(ωh|τh)𝔻θρ(ωh|τh)12max{𝔻θρ(τh),𝔻θρ(τh)}𝔻θρ𝔻θρ1.\displaystyle\left\|\mathbb{D}_{\theta}^{\rho}(\omega_{h}|\tau_{h})-\mathbb{D}_{\theta^{\prime}}^{\rho}(\omega_{h}|\tau_{h})\right\|_{1}\leq\frac{2}{\max\left\{\mathbb{D}_{\theta}^{\rho}(\tau_{h}),\mathbb{D}_{\theta^{\prime}}^{\rho}(\tau_{h})\right\}}\left\|\mathbb{D}_{\theta}^{\rho}-\mathbb{D}_{\theta^{\prime}}^{\rho}\right\|_{1}.

We replace θ\theta^{\prime} by a θ¯\bar{\theta} where [θ¯,θ¯]Θ¯ε[\underline{\theta},\bar{\theta}]\in\bar{\Theta}_{\varepsilon} is an ε\varepsilon-bracket of θ\theta (recall Definition 4.2), i.e. 𝔻θρ𝔻θ¯ρ1ε\left\|\mathbb{D}_{\theta}^{\rho}-\mathbb{D}_{\bar{\theta}}^{\rho}\right\|_{1}\leq\varepsilon, and 𝔻θ¯ρ(τh)𝔻θρ(τh)\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{h})\geq\mathbb{D}_{\theta}^{\rho}(\tau_{h}), τh\forall\tau_{h}. Then, due the construction of Θmink\Theta_{\min}^{k}, we have

τh𝒟hg,𝔻θρ(ωh|τh)𝔻θρ(ωh|τh)12εpmin2N,\forall\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}},~{}~{}\left\|\mathbb{D}_{\theta}^{\rho}(\omega_{h}|\tau_{h})-\mathbb{D}_{\theta^{\prime}}^{\rho}(\omega_{h}|\tau_{h})\right\|_{1}\leq\frac{2\varepsilon}{p_{\min}}\leq\frac{2}{N},

which implies

hτh𝒟hg\displaystyle\sum_{h}\sum_{\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}}} 𝔻θρ(ωh|τh)𝔻θρ(ωh|τh)12\displaystyle\left\|\mathbb{D}_{\theta}^{\rho}(\omega_{h}|\tau_{h})-\mathbb{D}_{\theta^{\prime}}^{\rho}(\omega_{h}|\tau_{h})\right\|_{1}^{2}
(a)hτh𝒟hg2𝔻θρ(ωh|τh)θ¯ρ(ωh|τh)12+2𝔻θ¯ρ(ωh|τh)𝔻θρ(ωh|τh))12\displaystyle\overset{(a)}{\leq}\sum_{h}\sum_{\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}}}2\left\|\mathbb{D}_{\theta}^{\rho}(\omega_{h}|\tau_{h})-\mathbb{P}_{\bar{\theta}}^{\rho}(\omega_{h}|\tau_{h})\right\|_{1}^{2}+2\left\|\mathbb{D}_{\bar{\theta}}^{\rho}(\omega_{h}|\tau_{h})-\mathbb{D}_{\theta^{*}}^{\rho}(\omega_{h}|\tau_{h})\right)\|_{1}^{2}
8N+2hτh𝒟hg𝙳𝚃𝚅2(θ¯π(ωh|τh),θπ(ωh|τh)).\displaystyle\leq\frac{8}{N}+2\sum_{h}\sum_{\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}}}\mathtt{D}_{\mathtt{TV}}^{2}\left(\mathbb{P}_{\bar{\theta}}^{\pi}(\omega_{h}|\tau_{h}),\mathbb{P}_{\theta^{*}}^{\pi}(\omega_{h}|\tau_{h})\right).

Here (a)(a) follows because the total variation distance satisfies the triangle inequality and (a+b)22a2+2b2(a+b)^{2}\leq 2a^{2}+2b^{2}.

Moreover, note that

𝔻θ¯ρ(ωh|τh)𝔻θρ(ωh|τh)12\displaystyle\left\|\mathbb{D}_{\bar{\theta}}^{\rho}(\omega_{h}|\tau_{h})-\mathbb{D}_{\theta^{*}}^{\rho}(\omega_{h}|\tau_{h})\right\|_{1}^{2}
(a)4(2+2/(N))𝙳𝙷2(𝔻θ¯ρ(ωh|τh),𝔻θρ(ωh|τh))\displaystyle\quad\overset{(a)}{\leq}4(2+2/(N))\mathtt{D}^{2}_{\mathtt{H}}\left(\mathbb{D}_{\bar{\theta}}^{\rho}(\omega_{h}|\tau_{h}),\mathbb{D}_{\theta^{*}}^{\rho}(\omega_{h}|\tau_{h})\right)
6(1+1N𝔼ωh𝔻θπ𝔻θ¯ρ(ωh|τh)𝔻θρ(ωh|τh))\displaystyle\quad\leq 6\left(1+\frac{1}{N}-\mathop{\mathbb{E}}_{\omega_{h}\sim\mathbb{D}_{\theta^{*}}^{\pi}}\sqrt{\frac{\mathbb{D}_{\bar{\theta}}^{\rho}(\omega_{h}|\tau_{h})}{\mathbb{D}_{\theta^{*}}^{\rho}(\omega_{h}|\tau_{h})}}\right)
(b)6log𝔼ωh𝔻θρ(|τh)𝔻θ¯ρ(ωh|τh)𝔻θρ(ωh|τh)+6N,\displaystyle\quad\overset{(b)}{\leq}-6\log\mathop{\mathbb{E}}_{\omega_{h}\sim\mathbb{D}_{\theta^{*}}^{\rho}(\cdot|\tau_{h})}\sqrt{\frac{\mathbb{D}_{\bar{\theta}}^{\rho}(\omega_{h}|\tau_{h})}{\mathbb{D}_{\theta^{*}}^{\rho}(\omega_{h}|\tau_{h})}}+\frac{6}{N},

where (a)(a) is due to Lemma E.2 and (b)(b) follows because 1xlogx1-x\leq-\log x for any x>0x>0.

Thus, the summation of the total variation distance between conditional distributions conditioned on τh𝒟hg\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}} can be upper bounded by

hτh𝒟hg\displaystyle\sum_{h}\sum_{\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}}} 𝔻θ¯ρ(ωh|τh)𝔻θρ(ωh|τh)12\displaystyle\left\|\mathbb{D}_{\bar{\theta}}^{\rho}(\omega_{h}|\tau_{h})-\mathbb{D}_{\theta^{*}}^{\rho}(\omega_{h}|\tau_{h})\right\|_{1}^{2}
18N12hτh𝒟hglog𝔼ωh𝔻θρ(|τh)𝔻θ¯ρ(ωh|τh)𝔻θρ(ωh|τh).\displaystyle\leq\frac{18}{N}-12\sum_{h}\sum_{\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}}}\log\mathop{\mathbb{E}}_{\omega_{h}\sim\mathbb{D}_{\theta^{*}}^{\rho}(\cdot|\tau_{h})}\sqrt{\frac{\mathbb{D}_{\bar{\theta}}^{\rho}(\omega_{h}|\tau_{h})}{\mathbb{D}_{\theta^{*}}^{\rho}(\omega_{h}|\tau_{h})}}.

In addition, by only taking expectation over ωh\omega_{h}, we have

𝔼h,τh𝒟hg,ωh𝔻θρ(|τh)[exp(12h(ωh,τh)𝒟hglog𝔻θ¯ρ(ωh|τh)𝔻θρ(ωh|τh)hτh𝒟hglog𝔼ωh𝔻θρ(|τh)𝔻θ¯ρ(ωh|τh)𝔻θρ(ωh|τh))]\displaystyle\mathop{\mathbb{E}}_{\begin{subarray}{c}\forall h,\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}},\\ \omega_{h}\sim\mathbb{D}_{\theta^{*}}^{\rho}(\cdot|\tau_{h})\end{subarray}}\left[\exp\left(\frac{1}{2}\sum_{h}\sum_{(\omega_{h},\tau_{h})\in\mathcal{D}_{h}^{\mathrm{g}}}\log\frac{\mathbb{D}_{\bar{\theta}}^{\rho}(\omega_{h}|\tau_{h})}{\mathbb{D}_{\theta^{*}}^{\rho}(\omega_{h}|\tau_{h})}-\sum_{h}\sum_{\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}}}\log\mathop{\mathbb{E}}_{\omega_{h}\sim\mathbb{D}_{\theta^{*}}^{\rho}(\cdot|\tau_{h})}\sqrt{\frac{\mathbb{D}_{\bar{\theta}}^{\rho}(\omega_{h}|\tau_{h})}{\mathbb{D}_{\theta^{*}}^{\rho}(\omega_{h}|\tau_{h})}}\right)\right]
=𝔼h,τh𝒟hg,ωh𝔻θρ(|τh)[h(ωh,τh)𝒟hg𝔻θ¯ρ(ωh|τh)𝔻θρ(ωh|τh)]hτh𝒟hg𝔼ωh𝔻θρ(|τh)[𝔻θ¯ρ(ωh|τh)𝔻θρ(ωh|τh)]=1,\displaystyle\quad\quad=\frac{\mathop{\mathbb{E}}_{\begin{subarray}{c}\forall h,\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}},\\ \omega_{h}\sim\mathbb{D}_{\theta^{*}}^{\rho}(\cdot|\tau_{h})\end{subarray}}\left[\prod_{h}\prod_{(\omega_{h},\tau_{h})\in\mathcal{D}_{h}^{\mathrm{g}}}\sqrt{\frac{\mathbb{D}_{\bar{\theta}}^{\rho}(\omega_{h}|\tau_{h})}{\mathbb{D}_{\theta^{*}}^{\rho}(\omega_{h}|\tau_{h})}}\right]}{\prod_{h}\prod_{\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}}}\mathop{\mathbb{E}}_{\omega_{h}\sim\mathbb{D}_{\theta^{*}}^{\rho}(\cdot|\tau_{h})}\left[\sqrt{\frac{\mathbb{D}_{\bar{\theta}}^{\rho}(\omega_{h}|\tau_{h})}{\mathbb{D}_{\theta^{*}}^{\rho}(\omega_{h}|\tau_{h})}}\right]}=1,

where the last equality is due to the conditional independence of ωh𝒟hg\omega_{h}\in\mathcal{D}_{h}^{\mathrm{g}} given τh𝒟hg\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}}.

Therefore, by the Chernoff bound, with probability 1δ1-\delta, we have

hτh𝒟hglog𝔼ωh𝔻θρ(|τh)𝔻θ¯ρ(ωh|τh)θρ(ωh|τh)12h(ωh,τh)𝒟hglog𝔻θρ(ωh|τh)𝔻θ¯ρ(ωh|τh)+log1δ.\displaystyle-\sum_{h}\sum_{\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}}}\log\mathop{\mathbb{E}}_{\omega_{h}\sim\mathbb{D}_{\theta^{*}}^{\rho}(\cdot|\tau_{h})}\sqrt{\frac{\mathbb{D}_{\bar{\theta}}^{\rho}(\omega_{h}|\tau_{h})}{\mathbb{P}_{\theta^{*}}^{\rho}(\omega_{h}|\tau_{h})}}\leq\frac{1}{2}\sum_{h}\sum_{(\omega_{h},\tau_{h})\in\mathcal{D}_{h}^{\mathrm{g}}}\log\frac{\mathbb{D}_{\theta^{*}}^{\rho}(\omega_{h}|\tau_{h})}{\mathbb{D}_{\bar{\theta}}^{\rho}(\omega_{h}|\tau_{h})}+\log\frac{1}{\delta}.

Taking the union bound over Θ¯ϵ\bar{\Theta}_{\epsilon}, and rescaling δ\delta, we have, with probability at least 1δ1-\delta, the following inequality holds:

hτh𝒟hg\displaystyle\sum_{h}\sum_{\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}}} 𝔻θρ(ωh|τh)𝔻θρ(ωh|τh)12\displaystyle\left\|\mathbb{D}_{\theta}^{\rho}(\omega_{h}|\tau_{h})-\mathbb{D}_{\theta^{*}}^{\rho}(\omega_{h}|\tau_{h})\right\|_{1}^{2}
18N+6h(ωh,τh)𝒟hglog𝔻θρ(ωh|τh)𝔻θ¯ρ(ωh|τh)+12log𝒩ε(Θ)δ\displaystyle\leq\frac{18}{N}+6\sum_{h}\sum_{(\omega_{h},\tau_{h})\in\mathcal{D}_{h}^{\mathrm{g}}}\log\frac{\mathbb{D}_{\theta^{*}}^{\rho}(\omega_{h}|\tau_{h})}{\mathbb{D}_{\bar{\theta}}^{\rho}(\omega_{h}|\tau_{h})}+12\log\frac{\mathcal{N}_{\varepsilon}(\Theta)}{\delta}
6h(ωh,τh)𝒟hglog𝔻θρ(ωh,τh)𝔻θ¯ρ(ωh,τh)+6hτh𝒟hglog𝔻θ¯ρ(τh)𝔻θρ(τh)+13log𝒩ε(Θ)δ.\displaystyle\leq 6\sum_{h}\sum_{(\omega_{h},\tau_{h})\in\mathcal{D}_{h}^{\mathrm{g}}}\log\frac{\mathbb{D}_{\theta^{*}}^{\rho}(\omega_{h},\tau_{h})}{\mathbb{D}_{\bar{\theta}}^{\rho}(\omega_{h},\tau_{h})}+6\sum_{h}\sum_{\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}}}\log\frac{\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{h})}{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{h})}+13\log\frac{\mathcal{N}_{\varepsilon}(\Theta)}{\delta}.

Note that, following from Lemma B.2, with probability at least 1δ1-\delta, we have for any k[K]k\in[K],

hτh𝒟hglog𝔻θ¯ρ(τh)𝔻θρ(τh)3log𝒩ε(Θ)δ.\displaystyle\sum_{h}\sum_{\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}}}\log\frac{\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{h})}{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{h})}\leq 3\log\frac{\mathcal{N}_{\varepsilon}(\Theta)}{\delta}.

Hence, combining with the optimistic property of θ¯\bar{\theta} and rescaling δ\delta, we have that the following inequality holds with probability at least 1δ1-\delta:

θΘmin,hτh𝒟hg\displaystyle\forall\theta\in\Theta_{\min},~{}~{}\sum_{h}\sum_{\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}}} 𝔻θρ(ωh|τh)𝔻θρ(ωh|τh)126hτH𝒟hklogθρ(τH)θρ(τH)+31log𝒩ε(Θ)δ,\displaystyle\left\|\mathbb{D}_{\theta}^{\rho}(\omega_{h}|\tau_{h})-\mathbb{D}_{\theta^{*}}^{\rho}(\omega_{h}|\tau_{h})\right\|_{1}^{2}\leq 6\sum_{h}\sum_{\tau_{H}\in\mathcal{D}_{h}^{k}}\log\frac{\mathbb{P}_{\theta^{*}}^{\rho}(\tau_{H})}{\mathbb{P}_{\theta}^{\rho}(\tau_{H})}+31\log\frac{\mathcal{N}_{\varepsilon}(\Theta)}{\delta},

which yields the final result. ∎

Proposition B.4.

Fix ε<1N2\varepsilon<\frac{1}{N^{2}}. Define the following event:

H={θΘ,|𝒟g|𝙳𝙷2(𝔻θρ(τH),𝔻θρ(τH))12τH𝒟glog𝔻θπ(τH)𝔻θρ(τH)+2log𝒩ε(Θ)δ}.\displaystyle\mathcal{E}_{\mathrm{H}}=\left\{\forall\theta\in\Theta,~{}~{}|\mathcal{D}^{\mathrm{g}}|\mathtt{D}_{\mathtt{H}}^{2}(\mathbb{D}_{\theta}^{\rho}(\tau_{H}),\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H}))\leq\frac{1}{2}\sum_{\tau_{H}\in\mathcal{D}^{\mathrm{g}}}\log\frac{\mathbb{D}_{\theta^{*}}^{\pi}(\tau_{H})}{\mathbb{D}_{\theta}^{\rho}(\tau_{H})}+2\log\frac{\mathcal{N}_{\varepsilon}(\Theta)}{\delta}\right\}.

We have (π)1δ.\mathbb{P}(\mathcal{E}_{\pi})\geq 1-\delta.

Proof.

First, by the construction of Θ¯ε\bar{\Theta}_{\varepsilon}, for any θ\theta, let θ¯\bar{\theta} satisfy τH|𝔻θρ(τH)𝔻θ¯ρ(τH)|ε\sum_{\tau_{H}}\left|\mathbb{D}_{\theta}^{\rho}(\tau_{H})-\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{H})\right|\leq\varepsilon. We translate the distance between θ\theta and θ\theta^{*} to the distance between θ¯\bar{\theta} and θ\theta^{*} as follows.

𝙳𝙷2\displaystyle\mathtt{D}_{\mathtt{H}}^{2} (𝔻θρ,𝔻θρ)\displaystyle(\mathbb{D}_{\theta}^{\rho},\mathbb{D}_{\theta^{*}}^{\rho})
=1τH𝔻θρ(τH)𝔻θρ(τH)\displaystyle=1-\sum_{\tau_{H}}\sqrt{\mathbb{D}_{\theta}^{\rho}(\tau_{H})\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})}
=1τH𝔻θ¯ρ(τH)𝔻θρ(τH)+(𝔻θρ(τH)𝔻θ¯ρ(τH))𝔻θρ(τH)\displaystyle=1-\sum_{\tau_{H}}\sqrt{\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{H})\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})+\left(\mathbb{D}_{\theta}^{\rho}(\tau_{H})-\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{H})\right)\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})}
(a)1τH𝔻θ¯ρ(τH)𝔻θρ(τH)+τH|𝔻θρ(τH)𝔻θ¯ρ(τH)|𝔻θρ(τH)\displaystyle\overset{(a)}{\leq}1-\sum_{\tau_{H}}\sqrt{\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{H})\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})}+\sum_{\tau_{H}}\sqrt{\left|\mathbb{D}_{\theta}^{\rho}(\tau_{H})-\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{H})\right|\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})}
(b)log𝔼τH𝔻θρ()𝔻θ¯ρ(τH)𝔻θρ(τH)+τH|𝔻θρ(τH)𝔻θ¯ρ(τH)|\displaystyle\overset{(b)}{\leq}-\log\mathop{\mathbb{E}}_{\tau_{H}\sim\mathbb{D}_{\theta^{*}}^{\rho}(\cdot)}\sqrt{\frac{\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{H})}{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})}}+\sqrt{\sum_{\tau_{H}}\left|\mathbb{D}_{\theta}^{\rho}(\tau_{H})-\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{H})\right|}
log𝔼τH𝔻θρ()𝔻θ¯ρ(τH)𝔻θρ(τH)+ε,\displaystyle\leq-\log\mathop{\mathbb{E}}_{\tau_{H}\sim\mathbb{D}_{\theta^{*}}^{\rho}(\cdot)}\sqrt{\frac{\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{H})}{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})}}+\sqrt{\varepsilon},

where (a)(a) follows because a+ba|b|\sqrt{a+b}\geq\sqrt{a}-\sqrt{|b|} if a>0a>0 and a+b>0a+b>0, and (b)(b) follows from the Cauchy’s inequality and the fact that 1xlogx1-x\leq-\log x.

Hence, in order to upper bound |𝒟g|𝙳𝙷2(𝔻θρ,𝔻θρ)|\mathcal{D}^{\mathrm{g}}|\mathtt{D}_{\mathtt{H}}^{2}(\mathbb{D}_{\theta}^{\rho},\mathbb{D}_{\theta^{*}}^{\rho}), it suffices to upper bound log𝔼τH𝔻θρ()𝔻θ¯ρ(τH)𝔻θρ(τH).-\log\mathop{\mathbb{E}}_{\tau_{H}\sim\mathbb{D}_{\theta^{*}}^{\rho}(\cdot)}\sqrt{\frac{\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{H})}{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})}}. To this end, we observe that,

𝔼\displaystyle\mathbb{E} [exp(12τH𝒟glog𝔻θ¯ρ(τH)𝔻θπ(τH)|𝒟g|log𝔼τH𝔻θρ()𝔻θρ(τH)𝔻θρ(τH))]\displaystyle\left[\exp\left(\frac{1}{2}\sum_{\tau_{H}\in\mathcal{D}^{\mathrm{g}}}\log\frac{\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{H})}{\mathbb{D}_{\theta^{*}}^{\pi}(\tau_{H})}-|\mathcal{D}^{\mathrm{g}}|\log\mathop{\mathbb{E}}_{\tau_{H}\sim\mathbb{D}_{\theta^{*}}^{\rho}(\cdot)}\sqrt{\frac{\mathbb{D}_{\theta}^{\rho}(\tau_{H})}{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})}}\right)\right]
=(a)𝔼[τH𝒟g𝔻θρ(τH)𝔻θρ(τH)]𝔼[τH𝒟g𝔻θρ(τH)𝔻θρ(τH)]=1.\displaystyle\quad\overset{(a)}{=}\frac{\mathbb{E}\left[\prod_{\tau_{H}\in\mathcal{D}^{\mathrm{g}}}\sqrt{\frac{\mathbb{D}_{\theta}^{\rho}(\tau_{H})}{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})}}\right]}{\mathbb{E}\left[\prod_{\tau_{H}\in\mathcal{D}^{\mathrm{g}}}\sqrt{\frac{\mathbb{D}_{\theta}^{\rho}(\tau_{H})}{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})}}\right]}=1.

Then, by the Chernoff bound, we have

\displaystyle\mathbb{P} (12τH𝒟glog𝔻θ¯ρ(τH)𝔻θρ(τH)|𝒟g|log𝔼τH𝔻θρ()𝔻θρ(τH)𝔻θρ(τH)log1δ)\displaystyle\left(\frac{1}{2}\sum_{\tau_{H}\in\mathcal{D}^{\mathrm{g}}}\log\frac{\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{H})}{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})}-|\mathcal{D}^{\mathrm{g}}|\log\mathop{\mathbb{E}}_{\tau_{H}\sim\mathbb{D}_{\theta^{*}}^{\rho}(\cdot)}\sqrt{\frac{\mathbb{D}_{\theta}^{\rho}(\tau_{H})}{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})}}\geq\log\frac{1}{\delta}\right)
δ.\displaystyle\leq\delta.

Finally, rescaling δ\delta to δ/(𝒩ε(Θ))\delta/(\mathcal{N}_{\varepsilon}(\Theta)) and taking the union bound over Θ¯ϵ\bar{\Theta}_{\epsilon}, we conclude that, with probability at least 1δ1-\delta, ,

θΘ,\displaystyle\forall\theta\in\Theta,~{}~{} |𝒟g|𝙳𝙷2(𝔻θρ(τH),𝔻θρ(τH))\displaystyle|\mathcal{D}^{\mathrm{g}}|\mathtt{D}_{\mathtt{H}}^{2}(\mathbb{D}_{\theta}^{\rho}(\tau_{H}),\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H}))
Nε+12τH𝒟glog𝔻θρ(τH)𝔻θ¯ρ(τH)+log𝒩ε(Θ)δ\displaystyle\leq N\sqrt{\varepsilon}+\frac{1}{2}\sum_{\tau_{H}\in\mathcal{D}^{\mathrm{g}}}\log\frac{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})}{\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{H})}+\log\frac{\mathcal{N}_{\varepsilon}(\Theta)}{\delta}
(a)12τH𝒟glog𝔻θρ(τH)𝔻θρ(τH)+2log𝒩ε(Θ)δ,\displaystyle\overset{(a)}{\leq}\frac{1}{2}\sum_{\tau_{H}\in\mathcal{D}^{\mathrm{g}}}\log\frac{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})}{\mathbb{D}_{\theta}^{\rho}(\tau_{H})}+2\log\frac{\mathcal{N}_{\varepsilon}(\Theta)}{\delta},

where (a)(a) follows because ε1N2\varepsilon\leq\frac{1}{N^{2}}. ∎

Now, we are ready to provide the MLE guarantee of under the behavior policy ρ\rho.

Lemma B.5 (MLE guarantee).

Let pmin<δ/(|𝒪||𝒜|)2Hp_{\min}<\delta/(|\mathcal{O}||\mathcal{A}|)^{2H}. With probability at least 1δ1-\delta, for any parameter θ\theta satisfying (θ|𝒟)minθ(θ|𝒟)+β/N\mathcal{L}(\theta|\mathcal{D})\leq\min_{\theta^{\prime}}\mathcal{L}(\theta^{\prime}|\mathcal{D})+\beta/N, we have the following inequalities.

hτh𝒟hg𝔻θρ(|τh)𝔻θρ(|τh)1213β,\displaystyle\sum_{h}\sum_{\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}}}\left\|\mathbb{D}_{\theta}^{\rho}(\cdot|\tau_{h})-\mathbb{D}_{\theta^{*}}^{\rho}(\cdot|\tau_{h})\right\|_{1}^{2}\leq 13\beta,
𝙳𝙷2(𝔻θρ,𝔻θρ)4βN,\displaystyle\mathtt{D}_{\mathtt{H}}^{2}\left(\mathbb{D}_{\theta}^{\rho},\mathbb{D}_{\theta^{*}}^{\rho}\right)\leq\frac{4\beta}{N},

where β1=31log3𝒩ε(Θ)δ\beta_{1}=31\log\frac{3\mathcal{N}_{\varepsilon}(\Theta)}{\delta}.

Proof.

Consider the three events log\mathcal{E}_{\log}, ω\mathcal{E}_{\omega}, and H\mathcal{E}_{\mathrm{H}}, defined in Lemma B.2, Proposition B.3, Proposition B.4, respectively. Let o=logωH\mathcal{E}_{o}=\mathcal{E}_{\log}\cap\mathcal{E}_{\omega}\cap\mathcal{E}_{\mathrm{H}}. Then, we have (o)1δ.\mathbb{P}(\mathcal{E}_{o})\geq 1-\delta. The following proof is under the case when o\mathcal{E}_{o} happens.

First, due to the condition that (θ|𝒟)minθ(θ|𝒟)+β/N\mathcal{L}(\theta|\mathcal{D})\leq\min_{\theta^{\prime}}\mathcal{L}(\theta^{\prime}|\mathcal{D})+\beta/N, we have

hτH𝒟hg\displaystyle\sum_{h}\sum_{\tau_{H}\in\mathcal{D}_{h}^{\mathrm{g}}} log𝔻θρ(τH)𝔻θρ(τH)\displaystyle\log\frac{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})}{\mathbb{D}_{\theta}^{\rho}(\tau_{H})}
=τH𝒟log𝔻θρ(τH)τH𝒟log𝔻θρ(τH)+τH𝒟glog𝔻θρ(τH)τH𝒟glog𝔻θρ(τH)\displaystyle=\sum_{\tau_{H}\in\mathcal{D}}\log\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})-\sum_{\tau_{H}\in\mathcal{D}}\log\mathbb{D}_{\theta}^{\rho}(\tau_{H})+\sum_{\tau_{H}\notin\mathcal{D}^{\mathrm{g}}}\log\mathbb{D}_{\theta}^{\rho}(\tau_{H})-\sum_{\tau_{H}\notin\mathcal{D}^{\mathrm{g}}}\log\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})
τH𝒟glog𝔻θρ(τH)τH𝒟glog𝔻θρ(τH)+β\displaystyle\leq\sum_{\tau_{H}\notin\mathcal{D}^{\mathrm{g}}}\log\mathbb{D}_{\theta}^{\rho}(\tau_{H})-\sum_{\tau_{H}\notin\mathcal{D}^{\mathrm{g}}}\log\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})+\beta
(a)τH𝒟glog𝔻θ¯ρ(τH)τH𝒟glog𝔻θρ(τH)+β\displaystyle\overset{(a)}{\leq}\sum_{\tau_{H}\notin\mathcal{D}^{\mathrm{g}}}\log\mathbb{D}_{\bar{\theta}}^{\rho}(\tau_{H})-\sum_{\tau_{H}\notin\mathcal{D}^{\mathrm{g}}}\log\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})+\beta
(b)2β,\displaystyle\overset{(b)}{\leq}2\beta,

where (a)(a) is due to the definition of θ¯\bar{\theta}, and (b)(b) follows from Lemma B.2.

Then, due to Proposition B.3, we have

hτh𝒟hg𝔻θρ(|τh)𝔻θρ(|τh)12\displaystyle\sum_{h}\sum_{\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}}}\left\|\mathbb{D}_{\theta}^{\rho}(\cdot|\tau_{h})-\mathbb{D}_{\theta^{*}}^{\rho}(\cdot|\tau_{h})\right\|_{1}^{2} 6hτH𝒟hglog𝔻θρ(τH)𝔻θρ(τH)+31log𝒩ε(Θ)δ\displaystyle\leq 6\sum_{h}\sum_{\tau_{H}\in\mathcal{D}_{h}^{\mathrm{g}}}\log\frac{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})}{\mathbb{D}_{\theta}^{\rho}(\tau_{H})}+31\log\frac{\mathcal{N}_{\varepsilon}(\Theta)}{\delta}
12β+β\displaystyle\leq 12\beta+\beta
13β.\displaystyle\leq 13\beta.

Similarly, due to Proposition B.4

𝙳𝙷2(𝔻θρ,𝔻θρ)\displaystyle\mathtt{D}_{\mathtt{H}}^{2}\left(\mathbb{D}_{\theta}^{\rho},\mathbb{D}_{\theta^{*}}^{\rho}\right) 1|𝒟g|(hτH𝒟hg12log𝔻θρ(τH)𝔻θρ(τH)+2log𝒩ε(Θ)δ)\displaystyle\leq\frac{1}{|\mathcal{D}^{\mathrm{g}}|}\left(\sum_{h}\sum_{\tau_{H}\in\mathcal{D}_{h}^{\mathrm{g}}}\frac{1}{2}\log\frac{\mathbb{D}_{\theta^{*}}^{\rho}(\tau_{H})}{\mathbb{D}_{\theta}^{\rho}(\tau_{H})}+2\log\frac{\mathcal{N}_{\varepsilon}(\Theta)}{\delta}\right)
1|𝒟g|(β+β)\displaystyle\leq\frac{1}{|\mathcal{D}^{\mathrm{g}}|}\left(\beta+\beta\right)
4β/N,\displaystyle\leq 4\beta/N,

where the last inequality follows from Lemma B.1.

Appendix C Proof of Theorem 4.7.

In this section, we provide the full proof for Theorem 4.7. Recall that o=logωH\mathcal{E}_{o}=\mathcal{E}_{\log}\cap\mathcal{E}_{\omega}\cap\mathcal{E}_{\mathrm{H}}, where log\mathcal{E}_{\log}, ω\mathcal{E}_{\omega}, and H\mathcal{E}_{\mathrm{H}}, defined in Lemma B.2, Proposition B.3, Proposition B.4, respectively.

Recall that we make the following assumption.

Assumption C.1.

The nominal model θ\theta^{*} has rank rr.

Thus, we have the following lemma.

Lemma C.2 (Theorem C.1 in Liu et al., (2022)).

Any low-rank decision-making problem θ\theta with core tests {𝒬h}h=1H\{\mathcal{Q}_{h}\}_{h=1}^{H} admits a (self-consistent) predictive state representation θ={ϕh,𝐌h}h=1H\theta=\{\phi_{h},\mathbf{M}_{h}\}_{h=1}^{H} given core tests {𝒬h}h=0H1\{\mathcal{Q}_{h}\}_{h=0}^{H-1}, such that for any τhh,ωhΩh\tau_{h}\in\mathcal{H}_{h},\omega_{h}\in\Omega_{h}, we have

{ψ(τh)=𝐌h(oh,ah)𝐌1(o1,a1)ψ0,[ψ(τh)]=𝒫θ(𝐪h,o,τho|τha,𝐪h,a),𝐦(ωh)=ϕH𝐌H(oH,aH)𝐌h+1(oh+1,ah+1),oh+1ϕh+1𝐌h+1(oh+1,ah+1)=ϕh,𝒫θ(oh,,o1|a1,,ah)=ϕhψ(τh),\displaystyle\left\{\begin{aligned} &\psi(\tau_{h})=\mathbf{M}_{h}(o_{h},a_{h})\cdots\mathbf{M}_{1}(o_{1},a_{1})\psi_{0},\\ &[\psi(\tau_{h})]_{\ell}=\mathcal{P}^{\theta}(\mathbf{q}_{h,\ell}^{\mathrm{o}},\tau_{h}^{\mathrm{o}}|\tau_{h}^{\mathrm{a}},\mathbf{q}_{h,\ell}^{\mathrm{a}}),\\ &\mathbf{m}(\omega_{h})^{\top}=\phi_{H}^{\top}\mathbf{M}_{H}(o_{H},a_{H})\cdots\mathbf{M}_{h+1}(o_{h+1},a_{h+1}),\\ &\sum_{o_{h+1}}\phi_{h+1}^{\top}\mathbf{M}_{h+1}(o_{h+1},a_{h+1})=\phi_{h}^{\top},\\ &\mathcal{P}^{\theta}(o_{h},\ldots,o_{1}|a_{1},\dots,a_{h})=\phi_{h}^{\top}\psi(\tau_{h}),\end{aligned}\right.

where 𝐌h:𝒪×𝒜d×d\mathbf{M}_{h}\mathrel{\mathop{\mathchar 58\relax}}\mathcal{O}\times\mathcal{A}\rightarrow\mathbb{R}^{d\times d}, ϕhd\phi_{h}\in\mathbb{R}^{d}, and ψ0d\psi_{0}\in\mathbb{R}^{d}. In particular, Θ\Theta is the set of all low-rank problems with the same core tests {𝒬h}h=1H\{\mathcal{Q}_{h}\}_{h=1}^{H}.

First, we derive the following guarantee for the difference of robust values.

Corollary C.3.

Under the event o\mathcal{E}_{o}, we have that

|VB𝒫1(θ),RπVB𝒫1(θ^),Rπ|Vθ^,b^π,|VB𝒫2(θ),RπVB𝒫2(θ^),Rπ|C𝒫klVθ^,b^π,|VB𝒯1(θ),RπVB𝒯1(θ^),Rπ|2CBVθ^,b^π,|VB𝒯2(θ),RπVB𝒯2(θ^),Rπ|2CBC𝒯klVθ^,b^π.\left.\begin{aligned} &\left|V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\pi}-V_{B_{\mathcal{P}}^{1}(\hat{\theta}),R}^{\pi}\right|\leq V_{\hat{\theta},\hat{b}}^{\pi},\\ &\left|V_{B_{\mathcal{P}}^{2}(\theta^{*}),R}^{\pi}-V_{B_{\mathcal{P}}^{2}(\hat{\theta}),R}^{\pi}\right|\leq C_{\mathcal{P}}^{\mathrm{kl}}V_{\hat{\theta},\hat{b}}^{\pi},\\ &\left|V_{B_{\mathcal{T}}^{1}(\theta^{*}),R}^{\pi}-V_{B_{\mathcal{T}}^{1}(\hat{\theta}),R}^{\pi}\right|\leq 2C_{B}V_{\hat{\theta},\hat{b}}^{\pi},\\ &\left|V_{B_{\mathcal{T}}^{2}(\theta^{*}),R}^{\pi}-V_{B_{\mathcal{T}}^{2}(\hat{\theta}),R}^{\pi}\right|\leq 2C_{B}C_{\mathcal{T}}^{\mathrm{kl}}V_{\hat{\theta},\hat{b}}^{\pi}.\end{aligned}\right.
Proof.

The proof is finished by directly combining Lemma A.6, Lemma A.7, Lemma A.8 and Lemma 9 and Corollary 2 in Huang et al., (2023). ∎

Theorem C.4 (Restatement of Theorem 4.7).

Suppose Assumption C.1 and Assumption 3.2 hold. The type-I coefficient of the behavior policy ρ\rho against the robust policy π\pi^{*} is finite. Let QA=maxh|𝒬hA|Q_{A}=\max_{h}|\mathcal{Q}_{h}^{A}| be the maximum number of core action sequences. Let ι=min𝐪h,aρ(𝐪h,a)\iota=\min_{\mathbf{q}_{h,\ell}^{\mathrm{a}}}\rho(\mathbf{q}_{h,\ell}^{\mathrm{a}}), pmin=δN(|𝒪||𝒜|)2Hp_{\min}=\frac{\delta}{N(|\mathcal{O}||\mathcal{A}|)^{2H}}, ε=pminNH\varepsilon=\frac{p_{\min}}{NH}, β=O(log(𝒩ε(Θ)/δ))\beta=O(\log(\mathcal{N}_{\varepsilon}(\Theta)/\delta)), λ=H2QA2\lambda=H^{2}Q_{A}^{2}, and α=O(QAdHγ2λ+βιγ)\alpha=O\left(\frac{Q_{A}\sqrt{dH}}{\gamma^{2}}\sqrt{\lambda}+\frac{\sqrt{\beta}}{\iota\gamma}\right). Then, with probability at least 1δ1-\delta, the output π^\hat{\pi} under different settings of Algorithm 1 satisfies that

VB𝒫1(θ),RπVB𝒫1(θ),Rπ^H2QAdβιγ2(r+QAHγ)Cp(π|ρ)N,\displaystyle V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\pi^{*}}-V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\hat{\pi}}\lesssim\frac{H^{2}Q_{A}\sqrt{d\beta}}{\iota\gamma^{2}}\left(\sqrt{r}+\frac{Q_{A}\sqrt{H}}{\gamma}\right)\sqrt{\frac{C_{\mathrm{p}}(\pi^{*}|\rho)}{N}},
VB𝒫2(θ),RπVB𝒫2(θ),Rπ^H2QAdβιγ2(r+QAHγ)C𝒫klCp(π|ρ)N,\displaystyle V_{B_{\mathcal{P}}^{2}(\theta^{*}),R}^{\pi^{*}}-V_{B_{\mathcal{P}}^{2}(\theta^{*}),R}^{\hat{\pi}}\lesssim\frac{H^{2}Q_{A}\sqrt{d\beta}}{\iota\gamma^{2}}\left(\sqrt{r}+\frac{Q_{A}\sqrt{H}}{\gamma}\right)C_{\mathcal{P}}^{\mathrm{kl}}\sqrt{\frac{C_{\mathrm{p}}(\pi^{*}|\rho)}{N}},
VB𝒯1(θ),RπVB𝒯1(θ),Rπ^H2QAdβιγ2(r+QAHγ)CBCp(π|ρ)N,\displaystyle V_{B_{\mathcal{T}}^{1}(\theta^{*}),R}^{\pi^{*}}-V_{B_{\mathcal{T}}^{1}(\theta^{*}),R}^{\hat{\pi}}\lesssim\frac{H^{2}Q_{A}\sqrt{d\beta}}{\iota\gamma^{2}}\left(\sqrt{r}+\frac{Q_{A}\sqrt{H}}{\gamma}\right)C_{B}\sqrt{\frac{C_{\mathrm{p}}(\pi^{*}|\rho)}{N}},
VB𝒯2(θ),RπVB𝒯2(θ),Rπ^H2QAdβιγ2(r+QAHγ)CBC𝒯klCp(π|ρ)N.\displaystyle V_{B_{\mathcal{T}}^{2}(\theta^{*}),R}^{\pi^{*}}-V_{B_{\mathcal{T}}^{2}(\theta^{*}),R}^{\hat{\pi}}\lesssim\frac{H^{2}Q_{A}\sqrt{d\beta}}{\iota\gamma^{2}}\left(\sqrt{r}+\frac{Q_{A}\sqrt{H}}{\gamma}\right)C_{B}C_{\mathcal{T}}^{\mathrm{kl}}\sqrt{\frac{C_{\mathrm{p}}(\pi^{*}|\rho)}{N}}.
Proof.

We first prove the result for 𝒫\mathcal{P}-type uncertainty set with 1\ell_{1} distance.

Recall that π^=argmaxπ{VB𝒫1(θ^),RπVθ^,b^π}\hat{\pi}=\arg\max_{\pi}\left\{V_{B_{\mathcal{P}}^{1}(\hat{\theta}),R}^{\pi}-V_{\hat{\theta},\hat{b}}^{\pi}\right\}. Let π=argmaxπVB𝒫1(θ),Rπ\pi^{*}=\arg\max_{\pi}V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\pi}. Then, by utilizing Corollary C.3, the suboptimal gap of π^\hat{\pi} can be upper bounded as follows.

VB𝒫1(θ),Rπ\displaystyle V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\pi^{*}} VB𝒫1(θ),Rπ^\displaystyle-V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\hat{\pi}}
VB𝒫1(θ),Rπ(VB𝒫1(θ^),RπVθ^,b^π)+(VB𝒫1(θ^),RπVθ^,b^π)(VB𝒫1(θ^),Rπ^Vθ^,b^π^)\displaystyle\leq V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\pi^{*}}-\left(V_{B_{\mathcal{P}}^{1}(\hat{\theta}),R}^{\pi^{*}}-V_{\hat{\theta},\hat{b}}^{\pi^{*}}\right)+\left(V_{B_{\mathcal{P}}^{1}(\hat{\theta}),R}^{\pi^{*}}-V_{\hat{\theta},\hat{b}}^{\pi^{*}}\right)-\left(V_{B_{\mathcal{P}}^{1}(\hat{\theta}),R}^{\hat{\pi}}-V_{\hat{\theta},\hat{b}}^{\hat{\pi}}\right)
2Vθ^,b^π.\displaystyle\leq 2V_{\hat{\theta},\hat{b}}^{\pi^{*}}.

By Lemma 8 and Lemma 10 in Huang et al., (2023) we have

Vθ^,b^π\displaystyle V_{\hat{\theta},\hat{b}}^{\pi^{*}} Vθ,b^π+𝔻θπ𝔻θ^π1\displaystyle\leq V_{\theta^{*},\hat{b}}^{\pi^{*}}+\left\|\mathbb{D}_{\theta^{*}}^{\pi^{*}}-\mathbb{D}_{\hat{\theta}}^{\pi^{*}}\right\|_{1}
(a)αrβιλh=0H1𝔼θπ[ψ¯h(τh)Uh1]+αHQAλ𝔻θπ𝔻θ^π1\displaystyle\overset{(a)}{\lesssim}\alpha\frac{\sqrt{r\beta}}{\iota\sqrt{\lambda}}\sum_{h=0}^{H-1}\mathbb{E}_{\theta^{*}}^{\pi}\left[\left\|\bar{\psi}_{h}^{*}(\tau_{h})\right\|_{U_{h}^{-1}}\right]+\frac{\alpha HQ_{A}}{\sqrt{\lambda}}\left\|\mathbb{D}_{\theta^{*}}^{\pi}-\mathbb{D}_{\hat{\theta}}^{\pi}\right\|_{1}
αrβιλh=0H𝔼θπ[ψ¯(τh)U¯h1]+αHQAλβHι2γ2h=0H𝔼θπ[ψ¯(τh)Λh1],\displaystyle\lesssim\alpha\frac{\sqrt{r\beta}}{\iota\sqrt{\lambda}}\sum_{h=0}^{H}\mathbb{E}_{\theta^{*}}^{\pi}\left[\left\|\bar{\psi}^{*}(\tau_{h})\right\|_{\bar{U}_{h}^{-1}}\right]+\frac{\alpha HQ_{A}}{\sqrt{\lambda}}\sqrt{\frac{\beta}{H\iota^{2}\gamma^{2}}}\sum_{h=0}^{H}\mathbb{E}_{\theta^{*}}^{\pi}\left[\left\|\bar{\psi}^{*}(\tau_{h})\right\|_{\Lambda_{h}^{-1}}\right],

where (a)(a) is due to λ<H2QA2\lambda<H^{2}Q_{A}^{2}, α=4λHQA2dγ4+7βι2γ2\alpha=\sqrt{\frac{4\lambda HQ_{A}^{2}d}{\gamma^{4}}+\frac{7\beta}{\iota^{2}\gamma^{2}}}, and

U^h=λI+τh𝒟hgψ¯(τh)ψ¯(τh),\displaystyle\hat{U}_{h}=\lambda I+\sum_{\tau_{h}\in\mathcal{D}_{h}^{\mathrm{g}}}\bar{\psi}^{*}(\tau_{h})\bar{\psi}^{*}(\tau_{h})^{\top},
U¯h=λI+|𝒟hg|𝔼θρ[ψ¯(τh)ψ¯(τh)],\displaystyle\bar{U}_{h}=\lambda I+|\mathcal{D}_{h}^{\mathrm{g}}|\mathbb{E}_{\theta^{*}}^{\rho}\left[\bar{\psi}^{*}(\tau_{h})\bar{\psi}^{*}(\tau_{h})^{\top}\right],
Λh=λ0I+N2H𝔼θρ[ψ¯(τh)ψ¯(τh)],\displaystyle\Lambda_{h}=\lambda_{0}I+\frac{N}{2H}\mathbb{E}_{\theta^{*}}^{\rho}\left[\bar{\psi}^{*}(\tau_{h})\bar{\psi}^{*}(\tau_{h})^{\top}\right],
λ0=γ44QA2d.\displaystyle\lambda_{0}=\frac{\gamma^{4}}{4Q_{A}^{2}d}.

Recall that type-I concentrability coefficient implies that

𝔼θπ[ψ¯(τh)ψ¯(τh)]Cp(π|ρ)𝔼θρ[ψ¯(τh)ψ¯(τh)].\mathbb{E}_{\theta^{*}}^{\pi^{*}}\left[\bar{\psi}^{*}(\tau_{h})\bar{\psi}^{*}(\tau_{h})^{\top}\right]\leq C_{\mathrm{p}}(\pi^{*}|\rho)\mathbb{E}_{\theta^{*}}^{\rho}\left[\bar{\psi}^{*}(\tau_{h})\bar{\psi}^{*}(\tau_{h})^{\top}\right].

Thus, due to Cauchy’s inequality, we have

VB𝒫1(θ),Rπ\displaystyle V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\pi^{*}} VB𝒫1(θ),Rπ^\displaystyle-V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\hat{\pi}}
αβιλ(r+QAHγ)hCp(π|ρ)𝔼θρ[ψ¯(τh)Λh12]\displaystyle\lesssim\alpha\frac{\sqrt{\beta}}{\iota\sqrt{\lambda}}\left(\sqrt{r}+\frac{Q_{A}\sqrt{H}}{\gamma}\right)\sum_{h}\sqrt{C_{\mathrm{p}}(\pi^{*}|\rho)\mathbb{E}_{\theta^{*}}^{\rho}\left[\left\|\bar{\psi}^{*}(\tau_{h})\right\|^{2}_{\Lambda_{h}^{-1}}\right]}
αHβιλ(r+QAHγ)HCp(π|ρ)N\displaystyle\leq\alpha H\frac{\sqrt{\beta}}{\iota\sqrt{\lambda}}\left(\sqrt{r}+\frac{Q_{A}\sqrt{H}}{\gamma}\right)\sqrt{\frac{HC_{\mathrm{p}}(\pi^{*}|\rho)}{N}}
H2QAdβιγ2(r+QAHγ)Cp(π|ρ)N.\displaystyle\lesssim\frac{H^{2}Q_{A}\sqrt{d\beta}}{\iota\gamma^{2}}\left(\sqrt{r}+\frac{Q_{A}\sqrt{H}}{\gamma}\right)\sqrt{\frac{C_{\mathrm{p}}(\pi^{*}|\rho)}{N}}.

For the other three inequalities, we follow the same argument as above and the proof is finished.

Appendix D Proof of Theorem 5.2

We first provide the pseudo code for our proposed algorithm.

Algorithm 2 Robust Offline RL
1:  Input: offline dataset 𝒟\mathcal{D}
2:  Construct confidence set 𝒞\mathcal{C} according to Equation 8.
3:  Planning π^\hat{\pi} such that π^=argmaxπΠminθ^𝒞VB(θ^),Rπ\hat{\pi}=\arg\max_{\pi\in\Pi}\min_{\hat{\theta}\in\mathcal{C}}V_{B(\hat{\theta}),R}^{\pi}.

Theorem D.1.

Suppose the type-II coefficient of the behavior policy ρ\rho against the robust policy π\pi^{*} is finite. Let ε=1NH\varepsilon=\frac{1}{NH}, β=O(log(𝒩ε(Θ)/δ))\beta=O(\log(\mathcal{N}_{\varepsilon}(\Theta)/\delta)). Then, with probability at least 1δ1-\delta, the output π^\hat{\pi} under different settings of Algorithm 2 satisfies that

VB𝒫1(θ),RπVB𝒫1(θ),Rπ^HC(π|ρ)βN,\displaystyle V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\pi^{*}}-V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\hat{\pi}}\lesssim H\sqrt{\frac{C(\pi^{*}|\rho)\beta}{N}},
VB𝒫2(θ),RπVB𝒫2(θ),Rπ^HC𝒫klC(π|ρ)βN,\displaystyle V_{B_{\mathcal{P}}^{2}(\theta^{*}),R}^{\pi^{*}}-V_{B_{\mathcal{P}}^{2}(\theta^{*}),R}^{\hat{\pi}}\lesssim HC_{\mathcal{P}}^{\mathrm{kl}}\sqrt{\frac{C(\pi^{*}|\rho)\beta}{N}},
VB𝒫1(θ),RπVB𝒫1(θ),Rπ^HCBC(π|ρ)βN,\displaystyle V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\pi^{*}}-V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\hat{\pi}}\lesssim HC_{B}\sqrt{\frac{C(\pi^{*}|\rho)\beta}{N}},
VB𝒫2(θ),RπVB𝒫2(θ),Rπ^HCBC𝒫klC(π|ρ)βN.\displaystyle V_{B_{\mathcal{P}}^{2}(\theta^{*}),R}^{\pi^{*}}-V_{B_{\mathcal{P}}^{2}(\theta^{*}),R}^{\hat{\pi}}\lesssim HC_{B}C_{\mathcal{P}}^{\mathrm{kl}}\sqrt{\frac{C(\pi^{*}|\rho)\beta}{N}}.
Proof.

We first prove the result for the setting when the uncertainty set is 𝒫\mathcal{P}-type with 1\ell_{1} distance.

Let θ^=argminθ𝒞VB𝒫1(θ),Rπ^\hat{\theta}=\arg\min_{\theta\in\mathcal{C}}V_{B_{\mathcal{P}}^{1}(\theta),R}^{\hat{\pi}}. Then, we perform the decomposition as follows.

VB𝒫1(θ),Rπ\displaystyle V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\pi^{*}} VB𝒫1(θ),Rπ^\displaystyle-V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\hat{\pi}}
=VB𝒫1(θ),RπVB𝒫1(θ^),Rπ+VB𝒫1(θ^),RπVB𝒫1(θ^),Rπ^I1+VB𝒫1(θ^),Rπ^VB𝒫1(θ),Rπ^I2\displaystyle=V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\pi^{*}}-V_{B_{\mathcal{P}}^{1}(\hat{\theta}),R}^{\pi^{*}}+\underbrace{V_{B_{\mathcal{P}}^{1}(\hat{\theta}),R}^{\pi^{*}}-V_{B_{\mathcal{P}}^{1}(\hat{\theta}),R}^{\hat{\pi}}}_{I_{1}}+\underbrace{V_{B_{\mathcal{P}}^{1}(\hat{\theta}),R}^{\hat{\pi}}-V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\hat{\pi}}}_{I_{2}}
VB𝒫1(θ),RπVB𝒫1(θ^),Rπ,\displaystyle\leq V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\pi^{*}}-V_{B_{\mathcal{P}}^{1}(\hat{\theta}),R}^{\pi^{*}},

where I1I_{1} is due to the optimality π^\hat{\pi}, and I2I_{2} follows from the property of θ^\hat{\theta}.

By Lemma A.6, we have

VB𝒫1(θ),Rπ\displaystyle V_{B_{\mathcal{P}}^{1}(\theta^{*}),R}^{\pi^{*}} VB𝒫1(θ^),Rπ\displaystyle-V_{B_{\mathcal{P}}^{1}(\hat{\theta}),R}^{\pi^{*}}
𝔻θπ𝔻θ^π1\displaystyle\leq\left\|\mathbb{D}_{\theta^{*}}^{\pi^{*}}-\mathbb{D}_{\hat{\theta}}^{\pi^{*}}\right\|_{1}
(a)h=1H𝔼θπ[𝒯θ(|τh)𝒯θ^(|τh)1]\displaystyle\overset{(a)}{\leq}\sum_{h=1}^{H}\mathbb{E}_{\theta^{*}}^{\pi^{*}}\left[\left\|\mathcal{T}^{\theta^{*}}(\cdot|\tau_{h})-\mathcal{T}^{\hat{\theta}}(\cdot|\tau_{h})\right\|_{1}\right]
(b)h=1HC(π|ρ)𝔼θρ[𝒯θ(|τh)𝒯θ^(|τh)12]\displaystyle\overset{(b)}{\leq}\sum_{h=1}^{H}\sqrt{C(\pi^{*}|\rho)\mathbb{E}_{\theta^{*}}^{\rho}\left[\left\|\mathcal{T}^{\theta^{*}}(\cdot|\tau_{h})-\mathcal{T}^{\hat{\theta}}(\cdot|\tau_{h})\right\|_{1}^{2}\right]}
(c)h=1HC(π|ρ)𝔼θρ[𝙳H2(𝒯θ(|τh),𝒯θ^(|τh))]\displaystyle\overset{(c)}{\lesssim}\sum_{h=1}^{H}\sqrt{C(\pi^{*}|\rho)\mathbb{E}_{\theta^{*}}^{\rho}\left[\mathtt{D}_{\mathrm{H}}^{2}\left(\mathcal{T}^{\theta^{*}}(\cdot|\tau_{h}),\mathcal{T}^{\hat{\theta}}(\cdot|\tau_{h})\right)\right]}
(d)HC(π|ρ)𝙳H2(𝔻θρ,𝔻θ^ρ)\displaystyle\overset{(d)}{\lesssim}H\sqrt{C(\pi^{*}|\rho)\mathtt{D}_{\mathrm{H}}^{2}\left(\mathbb{D}_{\theta^{*}}^{\rho},\mathbb{D}_{\hat{\theta}}^{\rho}\right)}
HβC(π|ρ)N,\displaystyle\leq H\sqrt{\frac{\beta C(\pi^{*}|\rho)}{N}},

where (a)(a) follows from Lemma E.1, (b)(b) is due to the definition of type-II concentrability coefficient, and (c),(d)(c),(d) follows from Lemma E.2.

The other three results follows the same argument by utilizing Lemma A.7 and Lemma A.8. ∎

Appendix E Auxiliary Lemmas

Lemma E.1.

For two distributions PP and QQ, we have

𝔼XPX[PY(|X)QY(|X)1]2PX,YQX,Y1.\mathbb{E}_{X\sim P_{X}}\left[\left\|P_{Y}(\cdot|X)-Q_{Y}(\cdot|X)\right\|_{1}\right]\leq 2\|P_{X,Y}-Q_{X,Y}\|_{1}.

In addition, the following inequality holds.

PX,YQX,Y1𝔼P[PY|X(|X)QY|X(|X)1]+PXQX1.\|P_{X,Y}-Q_{X,Y}\|_{1}\leq\mathbb{E}_{P}[\|P_{Y|X}(\cdot|X)-Q_{Y|X}(\cdot|X)\|_{1}]+\|P_{X}-Q_{X}\|_{1}.
Proof.

We have

𝔼XPX\displaystyle\mathbb{E}_{X\sim P_{X}} [PY(|X)QY(|X)1]\displaystyle\left[\left\|P_{Y}(\cdot|X)-Q_{Y}(\cdot|X)\right\|_{1}\right]
=xPX(x)y|PY|X(y|x)QY|X(y|x)|\displaystyle=\sum_{x}P_{X}(x)\sum_{y}|P_{Y|X}(y|x)-Q_{Y|X}(y|x)|
=xy|PY|X(y|x)PX(x)QY|X(y|x)PX(x)|\displaystyle=\sum_{x}\sum_{y}|P_{Y|X}(y|x)P_{X}(x)-Q_{Y|X}(y|x)P_{X}(x)|
=xy|PY|X(y|x)PX(x)QY|X(y|x)QX(x)+QY|X(y|x)QX(x)QY|X(y|x)PX(x)|\displaystyle=\sum_{x}\sum_{y}|P_{Y|X}(y|x)P_{X}(x)-Q_{Y|X}(y|x)Q_{X}(x)+Q_{Y|X}(y|x)Q_{X}(x)-Q_{Y|X}(y|x)P_{X}(x)|
xy|PY|X(y|x)PX(x)QY|X(y|x)QX(x)|\displaystyle\leq\sum_{x}\sum_{y}|P_{Y|X}(y|x)P_{X}(x)-Q_{Y|X}(y|x)Q_{X}(x)|
+xy|QY|X(y|x)QX(x)QY|X(y|x)PX(x)|\displaystyle\quad+\sum_{x}\sum_{y}|Q_{Y|X}(y|x)Q_{X}(x)-Q_{Y|X}(y|x)P_{X}(x)|
=PX,YQX,Y1+x|PX(x)QX(x)|\displaystyle=\|P_{X,Y}-Q_{X,Y}\|_{1}+\sum_{x}|P_{X}(x)-Q_{X}(x)|
=PX,YQX,Y1+x|yPY|X(y|x)PX(x)yQY|X(y|x)QX(x)|\displaystyle=\|P_{X,Y}-Q_{X,Y}\|_{1}+\sum_{x}|\sum_{y}P_{Y|X}(y|x)P_{X}(x)-\sum_{y}Q_{Y|X}(y|x)Q_{X}(x)|
2PX,YQX,Y1.\displaystyle\leq 2\|P_{X,Y}-Q_{X,Y}\|_{1}.

On the other hand,

PX,Y\displaystyle\|P_{X,Y} QX,Y1\displaystyle-Q_{X,Y}\|_{1}
=xy|PY|X(y|x)PX(x)QY|X(y|x)QX(x)|\displaystyle=\sum_{x}\sum_{y}|P_{Y|X}(y|x)P_{X}(x)-Q_{Y|X}(y|x)Q_{X}(x)|
=xy|PY|X(y|x)PX(x)QY|X(y|x)PX(x)+QY|X(y|x)PX(x)QY|X(y|x)QX(x)|\displaystyle=\sum_{x}\sum_{y}|P_{Y|X}(y|x)P_{X}(x)-Q_{Y|X}(y|x)P_{X}(x)+Q_{Y|X}(y|x)P_{X}(x)-Q_{Y|X}(y|x)Q_{X}(x)|
𝔼P[PY|X(|X)QY|X(|X)1]+PXQX1.\displaystyle\leq\mathbb{E}_{P}[\|P_{Y|X}(\cdot|X)-Q_{Y|X}(\cdot|X)\|_{1}]+\|P_{X}-Q_{X}\|_{1}.

Lemma E.2.

Given two bounded measures PP and QQ defined on the set 𝒳\mathcal{X}. Let |P|=x𝒳P(x)|P|=\sum_{x\in\mathcal{X}}P(x) and |Q|=x𝒳Q(x).|Q|=\sum_{x\in\mathcal{X}}Q(x). We have

PQ124(|P|+|Q|)𝙳𝙷2(P,Q)\|P-Q\|_{1}^{2}\leq 4(|P|+|Q|)\mathtt{D}_{\mathtt{H}}^{2}(P,Q)

In addition, if PY|X,QY|XP_{Y|X},Q_{Y|X} are two conditional distributions over a random variable YY, and PX,Y=PY|XPP_{X,Y}=P_{Y|X}P, QX,Y=QY|XQQ_{X,Y}=Q_{Y|X}Q are the joint distributions when XX follows the distributions PP and QQ, respectively, we have

𝔼XP[𝙳𝙷2(PY|X,QY|X)]8𝙳𝙷2(PX,Y,QX,Y).\mathop{\mathbb{E}}_{X\sim P}\left[\mathtt{D}_{\mathtt{H}}^{2}(P_{Y|X},Q_{Y|X})\right]\leq 8\mathtt{D}_{\mathtt{H}}^{2}(P_{X,Y},Q_{X,Y}).
Proof.

We first prove the first inequality. By the definition of total variation distance, we have

PQ12\displaystyle\|P-Q\|_{1}^{2} =(x|P(x)Q(x)|)2\displaystyle=\left(\sum_{x}|P(x)-Q(x)|\right)^{2}
=(x(P(x)Q(x))(P(x)+Q(x)))2\displaystyle=\left(\sum_{x}\left(\sqrt{P(x)}-\sqrt{Q(x)}\right)\left(\sqrt{P(x)}+\sqrt{Q(x)}\right)\right)^{2}
(a)(x(P(x)Q(x))2)(2x(P(x)+Q(x)))\displaystyle\overset{(a)}{\leq}\left(\sum_{x}\left(\sqrt{P(x)}-\sqrt{Q(x)}\right)^{2}\right)\left(2\sum_{x}\left(P(x)+Q(x)\right)\right)
4(|P|+|Q|)𝙳𝙷2(P,Q),\displaystyle\leq 4(|P|+|Q|)\mathtt{D}_{\mathtt{H}}^{2}(P,Q),

where (a)(a) follows from the Cauchy’s inequality and because (a+b)22a2+2b2(a+b)^{2}\leq 2a^{2}+2b^{2}.

For the second inequality, we have,

𝔼XP\displaystyle\mathop{\mathbb{E}}_{X\sim P} [𝙳𝙷2(PY|X,QY|X)]\displaystyle\left[\mathtt{D}_{\mathtt{H}}^{2}(P_{Y|X},Q_{Y|X})\right]
=xP(x)(y(PY|X(y)QY|X(y))2)\displaystyle=\sum_{x}P(x)\left(\sum_{y}\left(\sqrt{P_{Y|X}(y)}-\sqrt{Q_{Y|X}(y)}\right)^{2}\right)
=x,y(PX,Y(x,y)QX,Y(x,y)+QY|X(y)Q(x)QY|X(y)P(x))2\displaystyle=\sum_{x,y}\left(\sqrt{P_{X,Y}(x,y)}-\sqrt{Q_{X,Y}(x,y)}+\sqrt{Q_{Y|X}(y)Q(x)}-\sqrt{Q_{Y|X}(y)P(x)}\right)^{2}
2x,y(PX,Y(x,y)QX,Y(x,y))2+2x,yQY|X(y)(Q(x)P(x))2\displaystyle\leq 2\sum_{x,y}\left(\sqrt{P_{X,Y}(x,y)}-\sqrt{Q_{X,Y}(x,y)}\right)^{2}+2\sum_{x,y}Q_{Y|X}(y)\left(\sqrt{Q(x)}-\sqrt{P(x)}\right)^{2}
=4𝙳𝙷2(PX,Y,QX,Y)+2(|P|+|Q|2xP(x)Q(x))\displaystyle=4\mathtt{D}_{\mathtt{H}}^{2}(P_{X,Y},Q_{X,Y})+2(|P|+|Q|-2\sum_{x}\sqrt{P(x)Q(x)})
(a)4𝙳𝙷2(PX,Y,QX,Y)+2(|P|+|Q|2xyPY|X(y)P(x)QY|X(y)Q(x))\displaystyle\overset{(a)}{\leq}4\mathtt{D}_{\mathtt{H}}^{2}(P_{X,Y},Q_{X,Y})+2(|P|+|Q|-2\sum_{x}\sum_{y}\sqrt{P_{Y|X}(y)P(x)Q_{Y|X}(y)Q(x)})
=8𝙳𝙷2(PX,Y,QX,Y),\displaystyle=8\mathtt{D}_{\mathtt{H}}^{2}(P_{X,Y},Q_{X,Y}),

where (a)(a) follows from the Cauchy’s inequality that applies on yPY|X(y)QY|X(y)\sum_{y}\sqrt{P_{Y|X}(y)Q_{Y|X}(y)}. ∎