\stackMath

The Curious Price of Distributional Robustness
in Reinforcement Learning with a Generative Model

Laixi Shi
Caltech Department of Computing Mathematical Sciences, California Institute of Technology, CA 91125, USA. Part of L. Shi’s work was completed when she was at CMU. Gen Li
CUHK
Department of Statistics, The Chinese University of Hong Kong, Hong Kong. Yuting Wei
UPenn
Department of Statistics and Data Science, Wharton School, University of Pennsylvania, Philadelphia, PA 19104, USA. Yuxin Chen³³footnotemark: 3
UPenn
Matthieu Geist
Cohere
Cohere. Yuejie Chi
CMU Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA.

(May 2023; Revised )

Abstract

This paper investigates model robustness in reinforcement learning (RL) to reduce the sim-to-real gap in practice. We adopt the framework of distributionally robust Markov decision processes (RMDPs), aimed at learning a policy that optimizes the worst-case performance when the deployed environment falls within a prescribed uncertainty set around the nominal MDP. Despite recent efforts, the sample complexity of RMDPs remained mostly unsettled regardless of the uncertainty set in use. It was unclear if distributional robustness bears any statistical consequences when benchmarked against standard RL. Assuming access to a generative model that draws samples based on the nominal MDP, we characterize the sample complexity of RMDPs when the uncertainty set is specified via either the total variation (TV) distance or $\chi^{2}$ divergence. The algorithm studied here is a model-based method called distributionally robust value iteration, which is shown to be near-optimal for the full range of uncertainty levels. Somewhat surprisingly, our results uncover that RMDPs are not necessarily easier or harder to learn than standard MDPs. The statistical consequence incurred by the robustness requirement depends heavily on the size and shape of the uncertainty set: in the case w.r.t. the TV distance, the minimax sample complexity of RMDPs is always smaller than that of standard MDPs; in the case w.r.t. the $\chi^{2}$ divergence, the sample complexity of RMDPs can often far exceed the standard MDP counterpart.

Keywords: distributionally robust RL, robust Markov decision processes, sample complexity, distributionally robust value iteration, model-based RL

1 Introduction

Reinforcement learning (RL) strives to learn desirable sequential decisions based on trial-and-error interactions with an unknown environment. As a fast-growing subfield of artificial intelligence, it has achieved remarkable success in a variety of applications, such as networked systems (Qu et al.,, 2022), trading (Park and Van Roy,, 2015), operations research (de Castro Silva et al.,, 2003; Pan et al.,, 2023; Zhao et al.,, 2021), large language model alignment (OpenAI,, 2023; Ziegler et al.,, 2019), healthcare (Liu et al.,, 2019; Fatemi et al.,, 2021), robotics and control (Kober et al.,, 2013; Mnih et al.,, 2013). Due to the unprecedented dimensionality of the state-action space, the issue of data efficiency inevitably lies at the core of modern RL practice. A large portion of recent efforts in RL has been directed towards designing sample-efficient algorithms and understanding the fundamental statistical bottleneck for a diverse range of RL scenarios.

While standard RL has been heavily investigated recently, its use can be significantly hampered in practice due to the sim-to-real gap or uncertainty (Bertsimas et al.,, 2019); for instance, a policy learned in an ideal, nominal environment might fail catastrophically when the deployed environment is subject to small changes in task objectives or adversarial perturbations (Zhang et al., 2020a, ; Klopp et al.,, 2017; Mahmood et al.,, 2018). Consequently, in addition to maximizing the long-term cumulative reward, robustness emerges as another critical goal for RL, especially in high-stakes applications such as robotics, autonomous driving, clinical trials, financial investments, and so on. Towards achieving this, distributionally robust RL (Iyengar,, 2005; Nilim and El Ghaoui,, 2005; Xu and Mannor,, 2012; Bäuerle and Glauner,, 2022; Cai et al.,, 2016), which leverages insights from distributionally robust optimization and supervised learning (Rahimian and Mehrotra,, 2019; Gao,, 2020; Bertsimas et al.,, 2018; Duchi and Namkoong,, 2018; Blanchet and Murthy,, 2019; Chen et al.,, 2019; Lam,, 2019), becomes a natural yet versatile framework; the aim is to learn a policy that performs well even when the deployed environment deviates from the nominal one in the face of environment uncertainty.

In this paper, we pursue fundamental understanding about whether, and how, the choice of distributional robustness bears statistical implications in learning a desirable policy, through the lens of sample complexity. More concretely, imagine that one has access to a generative model (also called a simulator) that draws samples from a Markov decision processes (MDP) with a nominal transition kernel (Kearns and Singh,, 1999). Standard RL aims to learn the optimal policy tailored to the nominal kernel, for which the minimax sample complexity limit has been fully settled (Azar et al., 2013b, ; Li et al., 2023b, ). In contrast, distributionally robust RL seeks to learn a more robust policy using the same set of samples, with the aim of optimizing the worst-case performance when the transition kernel is arbitrarily chosen from some prescribed uncertainty set around the nominal kernel; this setting is frequently referred to as robust MDPs (RMDPs).¹¹1While it is straightforward to incorporate additional uncertainty of the reward in our framework, we do not consider it here for simplicity, since the key challenge is to deal with the uncertainty of the transition kernel. Clearly, the RMDP framework helps ensure that the performance of the learned policy does not fail catastrophically as long as the sim-to-real gap is not overly large. It is then natural to wonder how the robustness consideration impacts data efficiency: is there a statistical premium that one needs to pay in quest of additional robustness?

Compared with standard MDPs, the class of RMDPs encapsulates richer models, given that one is allowed to prescribe the shape and size of the uncertainty set. Oftentimes, the uncertainty set is hand-picked as a small ball surrounding the nominal kernel, with the size and shape of the ball specified by some distance-like metric $\rho$ between probability distributions and some uncertainty level $\sigma$ . To ensure tractability of solving RMDPs, the uncertainty set is often selected to obey certain structures. For instance, a number of prior works assumed that the uncertainty set can be decomposed as a product of independent uncertainty subsets over each state or state-action pair (Zhou et al.,, 2021; Wiesemann et al.,, 2013), dubbed as the $s$ - and $(s,a)$ -rectangularity, respectively. The current paper adopts the second choice by assuming $(s,a)$ -rectangularity for the uncertainty set. An additional challenge with RMDPs arises from distribution shift, where the transition kernel drawn from the uncertainty set can be different from the nominal kernel. This challenge leads to complicated nonlinearity and nested optimization in the problem structure not present in standard MDPs.

1.1 Prior art and open questions

Result type	Reference	Sample complexity
Result type	Reference	$0<\sigma\lesssim 1-\gamma$	$1-\gamma\lesssim\sigma<1$
Upper bound	Yang et al., (2022) $\frac{1^{7}}{1^{7^{7}}}$	$\frac{S^{2}A}{\sigma^{2}(1-\gamma)^{4}\varepsilon^{2}}$
	Panaganti and Kalathil, (2022) $\frac{1^{7}}{1^{7^{7}}}$	$\frac{S^{2}A}{(1-\gamma)^{4}\varepsilon^{2}}$
	This paper $\frac{1^{7}}{1^{7^{7}}}$	$\frac{SA}{(1-\gamma)^{3}\varepsilon^{2}}$	$\frac{SA}{(1-\gamma)^{2}\sigma\varepsilon^{2}}$
Lower bound	Yang et al., (2022) $\frac{1^{7}}{1^{7^{7}}}$	$\frac{SA}{(1-\gamma)^{3}\varepsilon^{2}}$	$\frac{SA(1-\gamma)}{\sigma^{4}\varepsilon^{2}}$
Lower bound	This paper $\frac{1^{7}}{1^{7^{7}}}$	$\frac{SA}{(1-\gamma)^{3}\varepsilon^{2}}$	$\frac{SA}{(1-\gamma)^{2}\sigma\varepsilon^{2}}$

Table 1: Comparisons between our results and prior arts for finding an

\varepsilon

-optimal robust policy in the infinite-horizon RMDPs with a generative model, where the uncertainty set is measured w.r.t. the TV distance. Here,

S

A

\gamma

, and

\sigma\in(0,1)

are the state space size, the action space size, the discount factor, and the uncertainty level, respectively, and all logarithmic factors are omitted in the table. Our results provide the first matching upper and lower bounds (up to log factors), improving upon all prior results.

Result type	Reference	Sample complexity
Result type	Reference	$0<\sigma\lesssim 1-\gamma$	$1-\gamma\lesssim\sigma\lesssim\frac{1}{1-\gamma}$	$\sigma\gtrsim\frac{1}{1-\gamma}$
Upper bound	Panaganti and Kalathil, (2022) $\frac{1^{7}}{1^{7^{7}}}$	$\frac{S^{2}A(1+\sigma)}{(1-\gamma)^{4}\varepsilon^{2}}$
	Yang et al., (2022) $\frac{1^{7}}{1^{7^{7}}}$	$\frac{S^{2}A(1+\sigma)^{2}}{(\sqrt{1+\sigma}-1)^{2}(1-\gamma)^{4}\varepsilon^{2}}$
	This paper $\frac{1^{7}}{1^{7^{7}}}$	$\frac{SA(1+\sigma)}{(1-\gamma)^{4}\varepsilon^{2}}$
Lower bound	Yang et al., (2022) $\frac{1^{7}}{1^{7^{7}}}$	$\frac{SA}{(1-\gamma)^{3}\varepsilon^{2}}$	$\frac{SA}{(1-\gamma)^{2}\sigma\varepsilon^{2}}$
Lower bound	This paper $\frac{1^{7}}{1^{7^{7}}}$	$\frac{SA}{(1-\gamma)^{3}\varepsilon^{2}}$	$\frac{SA\sigma}{(1-\gamma)^{4}(1+\sigma)^{4}\varepsilon^{2}}$	$\qquad\frac{SA\sigma}{\varepsilon^{2}}\qquad$

Table 2: Comparisons between our results and prior art on finding an

\varepsilon

-optimal robust policy in the infinite-horizon RMDPs with a generative model, where the uncertainty set is measured w.r.t. the

\chi^{2}

divergence. Here,

S

A

\gamma

, and

\sigma\in(0,\infty)

are the state space size, the action space size, the discount factor, and the uncertainty level, respectively, and all logarithmic factors are omitted in the table. Improving upon all prior results, our theory is tight (up to log factors) when

\sigma\asymp 1

, and otherwise loose by no more than a polynomial factor in

1/(1-\gamma)

Refer to caption — Figure 1: Illustrations of the obtained sample complexity upper and lower bounds for learning RMDPs with comparisons to state-of-the-art and the sample complexity of standard MDPs, where the uncertainty set is specified using the TV distance (a) and the $\chi^{2}$ divergence (b).

In this paper, we focus attention on RMDPs in the context of $\gamma$ -discounted infinite-horizon setting, assuming access to a generative model. The uncertainty set considered herein is specified using one of the $f$ -divergence metrics: the total variation (TV) distance and the $\chi^{2}$ divergence. These two choices are motivated by their practical appeals: easy to implement, and already adopted by empirical RL (Lee et al.,, 2021; Pan et al.,, 2023).

A popular learning approach is model-based, which first estimates the nominal transition kernel using a plug-in estimator based on the collected samples, and then runs a planning algorithm (e.g., a robust variant of value iteration) on top of the estimated kernel. Despite the surge of recent activities, however, existing statistical guarantees for the above paradigm remained highly inadequate, as we shall elaborate on momentarily (see Table 1 and Table 2 respectively for a summary of existing results). For concreteness, let $S$ be the size of the state space, $A$ the size of the action space, $\gamma$ the discount factor (so that the effective horizon is $\frac{1}{1-\gamma}$ ), and $\sigma$ the uncertainty level. We are interested in how the sample complexity — the number of samples needed for an algorithm to output a policy whose robust value function (the worst-case value over all the transition kernels in the uncertainty set) is at most $\varepsilon$ away from the optimal robust one — scales with all these salient problem parameters.

•

Large gaps between existing upper and lower bounds. There remained large gaps between the sample complexity upper and lower bounds established in prior literature, regardless of the divergence metric in use. Specifically, considering the cases using either TV distance or $\chi^{2}$ divergence, the state-of-the-art upper bounds (Panaganti and Kalathil,, 2022) scales quadratically with the size $S$ of the state space, while the lower bound (Yang et al.,, 2022) exhibits only linear scaling with $S$ . Moreover, in the $\chi^{2}$ divergence case, the state-of-the-art upper bound grows linearly with the uncertainty level $\sigma$ when $\sigma\gtrsim 1$ ,²²2Let $\mathcal{X}\coloneqq\big{(}S,A,\frac{1}{1-\gamma},\sigma,\frac{1}{\varepsilon},\frac{1}{\delta}\big{)}$ . The notation $f(\mathcal{X})=O(g(\mathcal{X}))$ or $f(\mathcal{X})\lesssim g(\mathcal{X})$ indicates that there exists a universal constant $C_{1}>0$ such that $f\leq C_{1}g$ , the notation $f(\mathcal{X})\gtrsim g(\mathcal{X})$ indicates that $g(\mathcal{X})=O(f(\mathcal{X}))$ , and the notation $f(\mathcal{X})\asymp g(\mathcal{X})$ indicates that $f(\mathcal{X})\lesssim g(\mathcal{X})$ and $f(\mathcal{X})\gtrsim g(\mathcal{X})$ hold simultaneously. Additionally, the notation $\widetilde{O}(\cdot)$ is defined in the same way as ${O}(\cdot)$ except that it hides logarithmic factors. while the lower bound (Yang et al.,, 2022) is inversely proportional to $\sigma$ . These lead to unbounded gaps between the upper and lower bounds as $\sigma$ grows. Can we hope to close these gaps for RMDPs?
•

Benchmarking with standard MDPs. Perhaps a more pressing issue is that, past works failed to provide an affirmative answer regarding how to benchmark the sample complexity of RMDPs with that of standard MDPs regardless of the chosen shape (determined by $\rho$ ) or size (determined by $\sigma$ ) of the uncertainty set, given the large unresolved gaps mentioned above. Specifically, existing sample complexity upper (resp. lower) bounds are all larger (resp. smaller) than the sample size requirement for standard MDPs. As a consequence, it remains mostly unclear whether learning RMDPs is harder or easier than learning standard MDPs.

1.2 Main contributions

To address the aforementioned questions, this paper develops strengthened sample complexity upper bounds on learning RMDPs with the TV distance and $\chi^{2}$ divergence in the infinite-horizon setting, using a model-based approach called distributionally robust value iteration (DRVI). Improved minimax lower bounds are also developed to help gauge the tightness of our upper bounds and enable benchmarking with standard MDPs. The novel analysis framework developed herein leads to new insights into the interplay between the geometry of uncertainty sets and statistical hardness.

Sample complexity of RMDPs under the TV distance.

We summarize our results and compare them with past works in Table 1; see Figure 1(a) for a graphical illustration.

•

Minimax-optimal sample complexity. We prove that DRVI reaches $\varepsilon$ accuracy as soon as the sample complexity is on the order of

$\widetilde{O}\left(\frac{SA}{(1-\gamma)^{2}\varepsilon^{2}}\min\left\{\frac{1}{1-\gamma},\frac{1}{\sigma}\right\}\right)$

for all $\sigma\in(0,1)$ , assuming that $\varepsilon$ is small enough. In addition, a matching minimax lower bound (modulo some logarithmic factor) is established to guarantee the tightness of the upper bound. To the best of our knowledge, this is the first minimax-optimal sample complexity for RMDPs, which was previously unavailable regardless of the divergence metric and uncertainty level in use and is over the full range of the uncertainty level.
•

RMDPs are easier to learn than standard MDPs under the TV distance. Given the sample complexity $\widetilde{O}\left(\frac{SA}{(1-\gamma)^{3}\varepsilon^{2}}\right)$ of standard MDPs (Li et al., 2023b, ), it can be seen that learning RMDPs under the TV distance is never harder than learning standard MDPs; more concretely, the sample complexity for RMDPs matches that of standard MDPs when $\sigma\lesssim 1-\gamma$ , and becomes smaller by a factor of $\sigma/(1-\gamma)$ when $1-\gamma\lesssim\sigma<1$ . Therefore, in this case, distributional robustness comes almost for free, given that we do not need to collect more samples.

Sample complexity of RMDPs under the $\chi^{2}$ divergence.

We summarize our results and provide comparisons with prior works in Table 2; see Figure 1(b) for an illustration.

•

Near-optimal sample complexity. We demonstrate that DRVI yields $\varepsilon$ accuracy as soon as the sample complexity is on the order of

$\widetilde{O}\left(\frac{SA(1+\sigma)}{(1-\gamma)^{4}\varepsilon^{2}}\right)$

for all $\sigma\in(0,\infty)$ , which is the first sample complexity in this setting that scales linearly in the size $S$ of the state space; in other words, our theory breaks the quadratic scaling bottleneck that was present in prior works (Panaganti and Kalathil,, 2022; Yang et al.,, 2022). We have also developed a strengthened lower bound that is optimized by leveraging the geometry of the uncertainty set under different ranges of $\sigma$ . Our theory is tight when $\sigma\asymp 1$ , and is otherwise loose by at most a polynomial factor of the effective horizon $1/(1-\gamma)$ (regardless of the uncertainty level $\sigma$ ). This significantly improves upon prior results (as there exists an unbounded gap between prior upper and lower bounds as $\sigma\rightarrow\infty$ ).
•

RMDPs can be harder to learn than standard MDPs under the $\chi^{2}$ divergence. Somewhat surprisingly, our improved lower bound suggests that RMDPs in this case can be much harder to learn than standard MDPs, at least for a certain range of uncertainty levels. We single out two regimes of particular interest. Firstly, when $\sigma\asymp 1$ , the sample size requirement of RMDPs is on the order of $\frac{SA}{(1-\gamma)^{4}\varepsilon^{2}}$ (up to log factor), which is provably larger than the one for standard MDPs by a factor of $\frac{1}{1-\gamma}$ . Secondly, the lower bound continues to increase as $\sigma$ grows and exceeds the sample complexity of standard MDPs when $\sigma\gtrsim\frac{1}{(1-\gamma)^{3}}$ .

In sum, our sample complexity bounds not only strengthen the prior art in the development of both upper and lower bounds, but also unveil that the additional robustness consideration might affect the sample complexity in a somewhat surprising manner. As it turns out, RMDPs are not necessarily harder nor easier to learn than standard MDPs; the conclusion is far more nuanced and highly dependent on both the size and shape of the uncertainty set. This constitutes a curious phenomenon that has not been elucidated in prior analyses.

Technical novelty.

Our upper bound analyses require careful treatments of the impact of the uncertainty set upon the value functions, and decouple the statistical dependency across the iterates of the robust value iteration using tailored leave-one-out arguments (Agarwal et al.,, 2020; Li et al., 2022b, ) that have not been introduced to the RMDP setting previously. Turning to the lower bound, we develop new hard instances that differ from those for standard MDPs (Azar et al., 2013a, ; Li et al.,, 2024). These new instances draw inspiration from the asymmetric structure of RMDPs induced by the additional infimum operator in the robust value function. In addition, we construct a series of hard instances depending on the uncertainty level $\sigma$ to establish the tight lower bound as $\sigma$ varies.

Extension: offline RL with uniform coverage.

Last but not least, we extend our analysis framework to accommodate a widely studied offline setting with uniform data coverage (Zhou et al.,, 2021; Yang et al.,, 2022) in Section 6. In particular, given a historical dataset with minimal coverage probability $\mu_{\min}$ over the state-action space (see Assumption 1), we provide sample complexity results for both cases with TV distance or $\chi^{2}$ divergence, where in effect the dependency with the size of the state-action space $SA$ is replaced by $1/\mu_{\min}$ . The sample complexity upper bounds significantly improve upon prior art (Yang et al.,, 2022) by a factor of $\frac{S}{(1-\gamma)^{2}}$ (resp. $S(1+\sigma)$ ) when the uncertainty set is measured by the TV distance (resp. the $\chi^{2}$ divergence).

Notation and paper organization.

Throughout this paper, we denote by $\Delta({\mathcal{S}})$ the probability simplex over a set ${\mathcal{S}}$ and $x=\big{[}x(s,a)\big{]}_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\in\mathbb{R}^{SA}$ (resp. $x=\big{[}x(s)\big{]}_{s\in{\mathcal{S}}}\in\mathbb{R}^{S}$ ) as any vector that constitutes certain values for each state-action pair (resp. state). In addition, we denote by $x\circ y=\big{[}x(s)\cdot y(s)\big{]}_{s\in{\mathcal{S}}}$ the Hadamard product of any two vectors $x,y\in\mathbb{R}^{S}$ .

The remainder of this paper is structured as follows. Section 2 presents the background about discounted infinite-horizon standard MDPs and formulates distributionally robust MDPs. In Section 3, a model-based approach is introduced, tailored to both the TV distance and the $\chi^{2}$ divergence. Both upper and lower bounds on the sample complexity are developed in Section 4, covering both divergence metrics. Section 5 provides an outline of our analysis. Section 6 further extends the findings to the offline RL setting with uniform data coverage. We then summarize several additional related works in Section 7 and conclude the main paper with further discussions in Section 8. The proof details are deferred to the appendix.

2 Problem formulation

In this section, we formulate distributionally robust Markov decision processes (RMDPs) in the discounted infinite-horizon setting, introduce the sampling mechanism, and describe our goal.

Standard MDPs.

To begin, we first introduce the standard Markov decision processes (MDPs), which facilitate the understanding of RMDPs. A discounted infinite-horizon MDP is represented by $\mathcal{M}=\big{(}\mathcal{S},\mathcal{A},\gamma,P,r\big{)}$ , where $\mathcal{S}=\{1,\cdots,S\}$ and $\mathcal{A}=\{1,\cdots,A\}$ are the finite state and action spaces, respectively, $\gamma\in[0,1)$ is the discounted factor, $P:{\mathcal{S}}\times\mathcal{A}\rightarrow\Delta({\mathcal{S}})$ denotes the probability transition kernel, and $r:{\mathcal{S}}\times\mathcal{A}\rightarrow[0,1]$ is the immediate reward function which is assumed to be deterministic. A policy is denoted by $\pi:\mathcal{S}\rightarrow\Delta(\mathcal{A})$ , which specifies the action selection probability over the action space in any state. When the policy is deterministic, we overload the notation and refer to $\pi(s)$ as the action selected by policy $\pi$ in state $s$ . To characterize the cumulative reward, the value function $V^{\pi,P}$ for any policy $\pi$ under the transition kernel $P$ is defined by

\displaystyle\forall s\in{\mathcal{S}}:\qquad V^{\pi,P}(s)

\displaystyle\coloneqq\mathbb{E}_{\pi,P}\left[\sum_{t=0}^{\infty}\gamma^{t}r\big{(}s_{t},a_{t}\big{)}\,\Big{|}\,s_{0}=s\right],

(1)

where the expectation is taken over the randomness of the trajectory $\{s_{t},a_{t}\}_{t=0}^{\infty}$ generated by executing policy $\pi$ under the transition kernel $P$ , namely, $a_{t}\sim\pi(\cdot\,|\,s_{t})$ and $s_{t+1}\sim P(\cdot\,|\,s_{t},a_{t})$ for all $t\geq 0$ . Similarly, the Q-function $Q^{\pi,P}$ associated with any policy $\pi$ under the transition kernel $P$ is defined as

\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\qquad Q^{\pi,P}(s,a)

\displaystyle\coloneqq\mathbb{E}_{\pi,P}\left[\sum_{t=0}^{\infty}\gamma^{t}r\big{(}s_{t},a_{t}\big{)}\,\Big{|}\,s_{0}=s,a_{0}=a\right],

(2)

where the expectation is again taken over the randomness of the trajectory under policy $\pi$ .

Distributionally robust MDPs.

We now introduce the distributionally robust MDP (RMDP) tailored to the discounted infinite-horizon setting, denoted by $\mathcal{M}_{\mathsf{rob}}=\{{\mathcal{S}},\mathcal{A},\gamma,\mathcal{U}_{\rho}^{\sigma}(P^{0}),r\}$ , where ${\mathcal{S}},\mathcal{A},\gamma,r$ are identical to those in the standard MDP. A key distinction from the standard MDP is that: rather than assuming a fixed transition kernel $P$ , it allows the transition kernel to be chosen arbitrarily from a prescribed uncertainty set $\mathcal{U}_{\rho}^{\sigma}(P^{0})$ centered around a nominal kernel $P^{0}:{\mathcal{S}}\times\mathcal{A}\rightarrow\Delta({\mathcal{S}})$ , where the uncertainty set is specified using some distance metric $\rho$ of radius $\sigma>0$ . In particular, given the nominal transition kernel $P^{0}$ and some uncertainty level $\sigma$ , the uncertainty set—with the divergence metric $\rho:\Delta({\mathcal{S}})\times\Delta({\mathcal{S}})\rightarrow\mathbb{R}^{+}$ —is specified as

\displaystyle\mathcal{U}_{\rho}^{\sigma}(P^{0})

\displaystyle\coloneqq\otimes\;\mathcal{U}_{\rho}^{\sigma}(P^{0}_{s,a})\qquad\text{with}\quad\mathcal{U}_{\rho}^{\sigma}(P^{0}_{s,a})\coloneqq\left\{P_{s,a}\in\Delta({\mathcal{S}}):\rho\left(P_{s,a},P^{0}_{s,a}\right)\leq\sigma\right\},

(3)

where we denote a vector of the transition kernel $P$ or $P^{0}$ at state-action pair $(s,a)$ respectively as

\displaystyle P_{s,a}\coloneqq P(\cdot\,|\,s,a)\in\mathbb{R}^{1\times S},\qquad P_{s,a}^{0}\coloneqq P^{0}(\cdot\,|\,s,a)\in\mathbb{R}^{1\times S}.

(4)

In other words, the uncertainty is imposed in a decoupled manner for each state-action pair, obeying the so-called $(s,a)$ -rectangularity (Zhou et al.,, 2021; Wiesemann et al.,, 2013).

In RMDPs, we are interested in the worst-case performance of a policy $\pi$ over all the possible transition kernels in the uncertainty set. This is measured by the robust value function $V^{\pi,\sigma}$ and the robust Q-function $Q^{\pi,\sigma}$ in $\mathcal{M}_{\mathsf{rob}}$ , defined respectively as

\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad V^{\pi,\sigma}(s)

\displaystyle\coloneqq\inf_{P\in\mathcal{U}_{\rho}^{\sigma}(P^{0})}V^{\pi,P}(s),\qquad Q^{\pi,\sigma}(s,a)\coloneqq\inf_{P\in\mathcal{U}_{\rho}^{\sigma}(P^{0})}Q^{\pi,P}(s,a).

(5)

Optimal robust policy and robust Bellman operator.

As a generalization of properties of standard MDPs, it is well-known that there exists at least one deterministic policy that maximizes the robust value function (resp. robust Q-function) simultaneously for all states (resp. state-action pairs) (Iyengar,, 2005; Nilim and El Ghaoui,, 2005). Therefore, we denote the optimal robust value function (resp. optimal robust Q-function) as $V^{\star,\sigma}$ (resp. $Q^{\star,\sigma}$ ), and the optimal robust policy as $\pi^{\star}$ , which satisfy


$\displaystyle\forall s\in{\mathcal{S}}:\quad$	$\displaystyle V^{\star,\sigma}(s)\coloneqq V^{\pi^{\star},\sigma}(s)=\max_{\pi}V^{\pi,\sigma}(s),$	(6a)
$\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad$	$\displaystyle Q^{\star,\sigma}(s,a)\coloneqq Q^{\pi^{\star},\sigma}(s,a)=\max_{\pi}Q^{\pi,\sigma}(s,a).$	(6b)

A key machinery in RMDPs is a generalization of Bellman’s optimality principle, encapsulated in the following robust Bellman consistency equation (resp. robust Bellman optimality equation):


$\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad$	$\displaystyle Q^{\pi,\sigma}(s,a)=r(s,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}_{\rho}(P^{0}_{s,a})}\mathcal{P}V^{\pi,\sigma},$	(7a)
$\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad$	$\displaystyle Q^{\star,\sigma}(s,a)=r(s,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}_{\rho}(P^{0}_{s,a})}\mathcal{P}V^{\star,\sigma}.$	(7b)

The robust Bellman operator (Iyengar,, 2005; Nilim and El Ghaoui,, 2005) is denoted by ${\mathcal{T}}^{\sigma}(\cdot):\mathbb{R}^{SA}\rightarrow\mathbb{R}^{SA}$ and defined as follows:

\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad{\mathcal{T}}^{\sigma}(Q)(s,a)\coloneqq r(s,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}_{\rho}(P^{0}_{s,a})}\mathcal{P}V,\quad\text{with}\quad V(s)\coloneqq\max_{a}Q(s,a).

(8)

Given that $Q^{\star,\sigma}$ is the unique fixed point of ${\mathcal{T}}^{\sigma}$ , one can recover the optimal robust value function and Q-function using a procedure termed distributionally robust value iteration (DRVI). Generalizing the standard value iteration, DRVI starts from some given initialization and recursively applies the robust Bellman operator until convergence. As has been shown previously, this procedure converges rapidly due to the $\gamma$ -contraction property of ${\mathcal{T}}^{\sigma}$ w.r.t. the $\ell_{\infty}$ norm (Iyengar,, 2005; Nilim and El Ghaoui,, 2005).

Specification of the divergence $\rho$ .

We consider two popular choices of the uncertainty set measured in terms of two different $f$ -divergence metric: the total variation distance and the $\chi^{2}$ divergence, given respectively by (Tsybakov,, 2009)

	$\displaystyle\rho_{{\mathsf{TV}}}\left(P_{s,a},P^{0}_{s,a}\right)$	$\displaystyle\coloneqq\frac{1}{2}\left\\|P_{s,a}-P^{0}_{s,a}\right\\|_{1}=\frac{1}{2}\sum_{s^{\prime}\in{\mathcal{S}}}P^{0}(s^{\prime}\,\|\,s,a)\left\|1-\frac{P(s^{\prime}\,\|\,s,a)}{P^{0}(s^{\prime}\,\|\,s,a)}\right\|,$		(9)
	$\displaystyle\rho_{\chi^{2}}\left(P_{s,a},P^{0}_{s,a}\right)$	$\displaystyle\coloneqq\sum_{s^{\prime}\in{\mathcal{S}}}P^{0}(s^{\prime}\,\|\,s,a)\left(1-\frac{P(s^{\prime}\,\|\,s,a)}{P^{0}(s^{\prime}\,\|\,s,a)}\right)^{2}.$		(10)

Note that $\rho_{{\mathsf{TV}}}\left(P_{s,a},P^{0}_{s,a}\right)\in[0,1]$ and $\rho_{\chi^{2}}\left(P_{s,a},P^{0}_{s,a}\right)\in[0,\infty)$ in general. As we shall see shortly, these two choices of divergence metrics result in drastically different messages when it comes to sample complexities.

Sampling mechanism: a generative model.

Following Zhou et al., (2021); Panaganti and Kalathil, (2022), we assume access to a generative model or a simulator (Kearns and Singh,, 1999), which allows us to collect $N$ independent samples for each state-action pair generated based on the nominal kernel $P^{0}$ :

\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A},\qquad s_{i,s,a}\overset{i.i.d}{\sim}P^{0}(\cdot\,|\,s,a),\qquad i=1,2,\cdots,N.

(11)

The total sample size is, therefore, $NSA$ .

Goal.

Given the collected samples, the task is to learn the robust optimal policy for the RMDP — w.r.t. some prescribed uncertainty set $\mathcal{U}^{\sigma}(P^{0})$ around the nominal kernel — using as few samples as possible. Specifically, given some target accuracy level $\varepsilon>0$ , the goal is to seek an $\varepsilon$ -optimal robust policy $\widehat{\pi}$ obeying

\displaystyle\forall s\in{\mathcal{S}}:\quad V^{\star,\sigma}(s)-V^{\widehat{\pi},\sigma}(s)\leq\varepsilon.

(12)

3 Model-based algorithm: distributionally robust value iteration

We consider a model-based approach tailored to RMDPs, which first constructs an empirical nominal transition kernel based on the collected samples, and then applies distributionally robust value iteration (DRVI) to compute an optimal robust policy.

Empirical nominal kernel.

The empirical nominal transition kernel $\widehat{P}^{0}\in\mathbb{R}^{SA\times S}$ can be constructed on the basis of the empirical frequency of state transitions, i.e.,

\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad\widehat{P}^{0}(s^{\prime}\,|\,s,a)\coloneqq\frac{1}{N}\sum\limits_{i=1}^{N}\mathds{1}\big{\{}s_{i,s,a}=s^{\prime}\big{\}},

(13)

which leads to an empirical RMDP $\widehat{\mathcal{M}}_{\mathsf{rob}}=\{{\mathcal{S}},\mathcal{A},\gamma,\mathcal{U}^{\sigma}_{\rho}(\widehat{P}^{0}),r\}$ . Analogously, we can define the corresponding robust value function (resp. robust Q-function) of policy $\pi$ in $\widehat{\mathcal{M}}_{\mathsf{rob}}$ as $\widehat{V}^{\pi,\sigma}$ (resp. $\widehat{Q}^{\pi,\sigma}$ ) (cf. (6)). In addition, we denote the corresponding optimal robust policy as $\widehat{\pi}^{\star}$ and the optimal robust value function (resp. optimal robust Q-function) as $\widehat{V}^{\star,\sigma}$ (resp. $\widehat{Q}^{\star,\sigma}$ ) (cf. (7)), which satisfies the robust Bellman optimality equation:

\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad

\displaystyle\widehat{Q}^{\star,\sigma}(s,a)=r(s,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}_{\rho}(\widehat{P}^{0}_{s,a})}\mathcal{P}\widehat{V}^{\star,\sigma}.

(14)

Equipped with $\widehat{P}^{0}$ , we can define the empirical robust Bellman operator $\widehat{{\mathcal{T}}}^{\sigma}$ as

\displaystyle\forall(s,a)\in

\displaystyle{\mathcal{S}}\times\mathcal{A}:\quad\widehat{{\mathcal{T}}}^{\sigma}(Q)(s,a)\coloneqq r(s,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}_{\rho}(\widehat{P}^{0}_{s,a})}\mathcal{P}V,\quad\text{with}\quad V(s)\coloneqq\max_{a}Q(s,a).

(15)

DRVI: distributionally robust value iteration.

To compute the fixed point of $\widehat{{\mathcal{T}}}^{\sigma}$ , we introduce distributionally robust value iteration (DRVI), which is summarized in Algorithm 1. From an initialization $\widehat{Q}_{0}=0$ , the update rule at the $t$ -th ( $t\geq 1$ ) iteration can be formulated as:

\displaystyle\forall(s,a)\in

\displaystyle{\mathcal{S}}\times\mathcal{A}:\quad\widehat{Q}_{t}(s,a)=\widehat{{\mathcal{T}}}^{\sigma}\big{(}\widehat{Q}_{t-1}\big{)}(s,a)=r(s,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}_{\rho}^{\sigma}(\widehat{P}^{0}_{s,a})}\mathcal{P}\widehat{V}_{t-1},

(16)

where $\widehat{V}_{t-1}(s)=\max_{a}\widehat{Q}_{t-1}(s,a)$ for all $s\in{\mathcal{S}}$ . However, directly solving (16) is computationally expensive since it involves optimization over an $S$ -dimensional probability simplex at each iteration, especially when the dimension of the state space ${\mathcal{S}}$ is large. Fortunately, in view of strong duality (Iyengar,, 2005), (16) can be equivalently solved using its dual problem, which concerns optimizing a scalar dual variable and thus can be solved efficiently. In what follows, we shall illustrate this for the two choices of the divergence $\rho$ of interest (cf. (9) and (10)). Before continuing, for any $V\in\mathbb{R}^{S}$ , we denote $[V]_{\alpha}$ as its clipped version by some non-negative value $\alpha$ , namely,

\displaystyle[V]_{\alpha}(s)\coloneqq\begin{cases}\alpha,&\text{if }V(s)>\alpha,\\ V(s),&\text{otherwise.}\end{cases}

(17)

•

TV distance, where the uncertainty set is $\mathcal{U}^{\sigma}_{\rho}(\widehat{P}^{0}_{s,a})\coloneqq\mathcal{U}^{\sigma}_{\mathsf{TV}}(\widehat{P}^{0}_{s,a})\coloneqq\mathcal{U}^{\sigma}_{\rho_{\mathsf{TV}}}(\widehat{P}^{0}_{s,a})$ w.r.t. the TV distance $\rho=\rho_{\mathsf{TV}}$ defined in (9). In particular, we have the following lemma due to strong duality, which is a direct consequence of Iyengar, (2005, Lemma 4.3).

Lemma 1 (Strong duality for TV).

Consider any probability vector $P\in\Delta({\mathcal{S}})$ , any fixed uncertainty level $\sigma$ and the uncertainty set $\mathcal{U}^{\sigma}(P)\coloneqq\mathcal{U}^{\sigma}_{\mathsf{TV}}(P)$ . For any vector $V\in\mathbb{R}^{S}$ obeying $V\geq{0}$ , recalling the definition of $[V]_{\alpha}$ in (17), one has

\displaystyle\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P)}\mathcal{P}V

\displaystyle=\max_{\alpha\in[\min_{s}V(s),\max_{s}V(s)]}\left\{P\left[V\right]_{\alpha}-\sigma\left(\alpha-\min_{s^{\prime}}\left[V\right]_{\alpha}(s^{\prime})\right)\right\}.

(18)

In view of the above lemma, the following dual update rule is equivalent to (16) in DRVI:

\displaystyle\widehat{Q}_{t}(s,a)=r(s,a)+\gamma\max_{\alpha\in\left[\min_{s}\widehat{V}_{t-1}(s),\max_{s}\widehat{V}_{t-1}(s)\right]}\left\{\widehat{P}^{0}_{s,a}\left[\widehat{V}_{t-1}\right]_{\alpha}-\sigma\left(\alpha-\min_{s^{\prime}}\left[\widehat{V}_{t-1}\right]_{\alpha}(s^{\prime})\right)\right\}.

(19)

•

$\chi^{2}$ divergence, where the uncertainty set is $\mathcal{U}^{\sigma}_{\rho}(\widehat{P}^{0}_{s,a})\coloneqq\mathcal{U}^{\sigma}_{\chi^{2}}(\widehat{P}^{0}_{s,a})\coloneqq\mathcal{U}^{\sigma}_{\rho_{\chi^{2}}}(\widehat{P}^{0}_{s,a})$ w.r.t. the $\chi^{2}$ divergence $\rho=\rho_{\chi^{2}}$ defined in (10). We introduce the following lemma which directly follows from (Iyengar,, 2005, Lemma 4.2).

Lemma 2 (Strong duality for $\chi^{2}$ ).

Consider any probability vector $P\in\Delta({\mathcal{S}})$ , any fixed uncertainty level $\sigma$ and the uncertainty set $\mathcal{U}^{\sigma}(P)\coloneqq\mathcal{U}^{\sigma}_{\chi^{2}}(P)$ . For any vector $V\in\mathbb{R}^{S}$ obeying $V\geq{0}$ , one has

\displaystyle\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P)}\mathcal{P}V=\max_{\alpha\in[\min_{s}V(s),\max_{s}V(s)]}\left\{P[V]_{\alpha}-\sqrt{\sigma\mathsf{Var}_{P}\left([V]_{\alpha}\right)}\right\},

(20)

where $\mathsf{Var}_{P}\left(\cdot\right)$ is defined as (40).

In view of the above lemma, the update rule (16) in DRVI can be equivalently written as:

\displaystyle\widehat{Q}_{t}(s,a)=r(s,a)+\gamma\max_{\alpha\in\left[\min_{s}\widehat{V}_{t-1}(s),\max_{s}\widehat{V}_{t-1}(s)\right]}\left\{\widehat{P}^{0}_{s,a}\left[\widehat{V}_{t-1}\right]_{\alpha}-\sqrt{\sigma\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\left[\widehat{V}_{t-1}\right]_{\alpha}\right)}\right\}.

(21)

The proofs of Lemma 1 and Lemma 2 are provided in Appendix A. To complete the description, we output the greedy policy of the final Q-estimate $\widehat{Q}_{T}$ as the final policy $\widehat{\pi}$ , namely,

\displaystyle\forall s\in{\mathcal{S}}:\quad\widehat{\pi}(s)=\arg\max_{a}\widehat{Q}_{T}(s,a).

(22)

Encouragingly, the iterates $\big{\{}\widehat{Q}_{t}\big{\}}_{t\geq 0}$ of DRVI converge linearly to the fixed point $\widehat{Q}^{\star,\sigma}$ , owing to the appealing $\gamma$ -contraction property of $\widehat{{\mathcal{T}}}^{\sigma}$ .

1 input: empirical nominal transition kernel

\widehat{P}^{0}

; reward function

r

; uncertainty level

\sigma

; number of iterations

T

2 initialization:

\widehat{Q}_{0}(s,a)=0

\widehat{V}_{0}(s)=0

for all

(s,a)\in{\mathcal{S}}\times\mathcal{A}

3 for $t=1,2,\cdots,T$ do

5 for $s\in{\mathcal{S}},a\in\mathcal{A}$ do

6 Set

\widehat{Q}_{t}(s,a)

according to (16);

7 for $s\in{\mathcal{S}}$ do

8 Set

\widehat{V}_{t}(s)=\max_{a}\widehat{Q}_{t}(s,a)

;

output:

\widehat{Q}_{T}

\widehat{V}_{T}

and

\widehat{\pi}

obeying

\widehat{\pi}(s)\coloneqq\arg\max_{a}\widehat{Q}_{T}(s,a)

Algorithm 1 Distributionally robust value iteration (DRVI) for infinite-horizon RMDPs.

4 Theoretical guarantees: sample complexity analyses

We now present our main results, which concern the sample complexities of learning RMDPs when the uncertainty set is specified using the TV distance or the $\chi^{2}$ divergence. Somewhat surprisingly, different choices of the uncertainty set can lead to dramatically different consequences in the sample size requirement.

4.1 The case of TV distance: RMDPs are easier to learn than standard MDPs

We start with the case where the uncertainty set is measured via the TV distance. The following theorem, whose proof is deferred to Section 5.2, develops an upper bound on the sample complexity of DRVI in order to return an $\varepsilon$ -optimal robust policy. The key challenge of the analysis lies in careful control of the robust value function $V^{\pi,\sigma}$ as a function of the uncertainty level $\sigma$ .

Theorem 1 (Upper bound under TV distance).

Let the uncertainty set be $\mathcal{U}_{\rho}^{\sigma}(\cdot)=\mathcal{U}^{\sigma}_{\mathsf{TV}}(\cdot)$ , as specified by the TV distance (9). Consider any discount factor $\gamma\in\left[\frac{1}{4},1\right)$ , uncertainty level $\sigma\in(0,1)$ , and $\delta\in(0,1)$ . Let $\widehat{\pi}$ be the output policy of Algorithm 1 after $T=C_{1}\log\big{(}\frac{N}{1-\gamma}\big{)}$ iterations. Then with probability at least $1-\delta$ , one has

\displaystyle\forall s\in{\mathcal{S}}:\quad V^{\star,\sigma}(s)-V^{\widehat{\pi},\sigma}(s)\leq\varepsilon

(23)

for any $\varepsilon\in\left(0,\sqrt{1/\max\{1-\gamma,\sigma\}}\right]$ , as long as the total number of samples obeys

\displaystyle NSA\geq\frac{C_{2}SA}{(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}}\log\left(\frac{SAN}{(1-\gamma)\delta}\right).

(24)

Here, $C_{1},C_{2}>0$ are some large enough universal constants.

Remark 1.

Note that Theorem 1 is not only valid when invoking Algorithm 1. In fact, the theorem holds for any oracle planning algorithm (designed based on the empirical transitions $\widehat{P}^{0}$ ) whose output policy $\widehat{\pi}$ obeys

\big{\|}\widehat{V}^{\star,\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\big{\|}_{\infty}\leq O\left(\frac{(1-\gamma)^{2}}{N}\log\left(\frac{SAN}{(1-\gamma)\delta}\right)\right).

(25)

Before discussing the implications of Theorem 1, we present a matching minimax lower bound that confirms the tightness and optimality of the upper bound, which in turn pins down the sample complexity requirement for learning RMDPs with TV distance. The proof is based on constructing new hard instances inspired by the asymmetric structure of RMDPs, with the details postponed to Section 5.3.

Theorem 2 (Lower bound under TV distance).

Consider any tuple $(S,A,\gamma,\sigma,\varepsilon)$ obeying $\sigma\in(0,1-c_{0}]$ with $0<c_{0}\leq\frac{1}{8}$ being any small enough positive constant, $\gamma\in\left[\frac{1}{2},1\right)$ , and $\varepsilon\in\big{(}0,\frac{c_{0}}{256(1-\gamma)}\big{]}$ . We can construct a collection of infinite-horizon RMDPs $\mathcal{M}_{0},\mathcal{M}_{1}$ defined by the uncertainty set $\mathcal{U}_{\rho}^{\sigma}(\cdot)=\mathcal{U}^{\sigma}_{\mathsf{TV}}(\cdot)$ , an initial state distribution $\varphi$ , and a dataset with $N$ independent samples for each state-action pair over the nominal transition kernel (for $\mathcal{M}_{0}$ and $\mathcal{M}_{1}$ respectively), such that

\inf_{\widehat{\pi}}\max\left\{\mathbb{P}_{0}\big{(}V^{\star,\sigma}(\varphi)-V^{\widehat{\pi},\sigma}(\varphi)>\varepsilon\big{)},\,\mathbb{P}_{1}\big{(}V^{\star,\sigma}(\varphi)-V^{\widehat{\pi},\sigma}(\varphi)>\varepsilon\big{)}\right\}\geq\frac{1}{8},

provided that

NSA\leq\frac{c_{0}SA\log 2}{8192(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}}.

Here, the infimum is taken over all estimators $\widehat{\pi}$ , and $\mathbb{P}_{0}$ (resp. $\mathbb{P}_{1}$ ) denotes the probability when the RMDP is $\mathcal{M}_{0}$ (resp. $\mathcal{M}_{1}$ ).

Below, we interpret the above theorems and highlight several key implications about the sample complexity requirements for learning RMDPs for the case w.r.t. the TV distance.

Near minimax-optimal sample complexity.

Theorem 1 shows that the total number of samples required for DRVI (or any oracle planning algorithm claimed in Remark 1) to yield $\varepsilon$ -accuracy is

\displaystyle\widetilde{O}\left(\frac{SA}{(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}}\right).

(26)

Taken together with the minimax lower bound asserted by Theorem 2, this confirms the near optimality of the sample complexity (up to some logarithmic factor) almost over the full range of the uncertainty level $\sigma$ . Importantly, this sample complexity scales linearly with the size of the state-action space, and is inversely proportional to $\sigma$ in the regime where $\sigma\gtrsim 1-\gamma$ .

RMDPs is easier than standard MDPs with TV distance.

Recall that the sample complexity requirement for learning standard MDPs with a generative model is (Azar et al., 2013a, ; Agarwal et al.,, 2020; Li et al., 2023b, )

\widetilde{O}\left(\frac{SA}{(1-\gamma)^{3}\varepsilon^{2}}\right)

(27)

in order to yield $\varepsilon$ accuracy. Comparing this with the sample complexity requirement in (26) for RMDPs under the TV distance, we confirm that the latter is at least as easy as — if not easier than — standard MDPs. In particular, when $\sigma\lesssim 1-\gamma$ is small, the sample complexity of RMDPs is the same as that of standard MDPs as in (27), which is as anticipated since the RMDP reduces to the standard MDP when $\sigma=0$ . On the other hand, when $1-\gamma\lesssim\sigma<1$ , the sample complexity of RMDPs simplifies to

\displaystyle\widetilde{O}\left(\frac{SA}{(1-\gamma)^{2}\sigma\varepsilon^{2}}\right),

(28)

which is smaller than that of standard MDPs by a factor of $\sigma/(1-\gamma)$ .

Comparison with state-of-the-art bounds.

For the upper bound, our results (cf. Theorem 1) significantly improves over the prior art $\widetilde{O}\left(\frac{S^{2}A}{(1-\gamma)^{4}\varepsilon^{2}}\right)$ of Panaganti and Kalathil, (2022) by at least a factor of $\frac{S}{1-\gamma}$ and even $\frac{S}{(1-\gamma)^{2}}$ when the uncertainty level $1-\gamma\lesssim\sigma<1$ is large. Turning to the lower bound side, Yang et al., (2022) developed a lower bound for RMDPs under the TV distance, which scales as

\widetilde{O}\left(\frac{SA(1-\gamma)}{\varepsilon^{2}}\min\left\{\frac{1}{(1-\gamma)^{4}},\frac{1}{\sigma^{4}}\right\}\right).

Clearly, this is worse than ours by a factor of $\frac{\sigma^{3}}{(1-\gamma)^{3}}\in\big{(}1,\frac{1}{(1-\gamma)^{3}}\big{)}$ in the regime where $1-\gamma\lesssim\sigma<1$ .

4.2 The case of $\chi^{2}$ divergence: RMDPs can be harder than standard MDPs

We now switch attention to the case when the uncertainty set is measured via the $\chi^{2}$ divergence. The theorem below presents an upper bound on the sample complexity for this case, whose proof is deferred to Appendix D.

Theorem 3 (Upper bound under $\chi^{2}$ divergence).

Let the uncertainty set be $\mathcal{U}_{\rho}^{\sigma}(\cdot)=\mathcal{U}^{\sigma}_{\chi^{2}}(\cdot)$ , as specified using the $\chi^{2}$ divergence (10). Consider any uncertainty level $\sigma\in(0,\infty)$ , $\gamma\in[1/4,1)$ and $\delta\in(0,1)$ . With probability at least $1-\delta$ , the output policy $\widehat{\pi}$ from Algorithm 1 with at most $T=c_{1}\log\big{(}\frac{N}{1-\gamma}\big{)}$ iterations yields

\displaystyle\forall s\in{\mathcal{S}}:\quad V^{\star,\sigma}(s)-V^{\widehat{\pi},\sigma}(s)\leq\varepsilon

(29)

for any $\varepsilon\in\big{(}0,\frac{1}{1-\gamma}\big{]}$ , as long as the total number of samples obeying

\displaystyle NSA\geq\frac{c_{2}SA(1+\sigma)}{(1-\gamma)^{4}\varepsilon^{2}}\log\left(\frac{SAN}{\delta}\right).

(30)

Here, $c_{1},c_{2}>0$ are some large enough universal constants.

Remark 2.

Akin to Remark 1, the sample complexity derived in Theorem 3 continues to hold for any oracle planning algorithm that outputs a policy $\widehat{\pi}$ obeying $\big{\|}\widehat{V}^{\star,\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\big{\|}_{\infty}\leq O\Big{(}\frac{\log(\frac{SAN}{(1-\gamma)\delta})}{N^{2}}\Big{)}$ .

In addition, in order to gauge the tightness of Theorem 3 and understand the minimal sample complexity requirement under the $\chi^{2}$ divergence, we further develop a minimax lower bound as follows; the proof is deferred to Appendix E.

Theorem 4 (Lower bound under $\chi^{2}$ divergence).

Consider any $(S,A,\gamma,\sigma,\varepsilon)$ obeying $\gamma\in[\frac{3}{4},1)$ , $\sigma\in(0,\infty)$ , and

\displaystyle\varepsilon

\displaystyle\leq c_{3}\begin{cases}\frac{1}{1-\gamma}\quad&\text{if }\sigma\in\left(0,\frac{1-\gamma}{4}\right)\\ \max\left\{\frac{1}{(1+\sigma)(1-\gamma)},1\right\}\quad&\text{if }\sigma\in\left[\frac{1-\gamma}{4},\infty\right)\end{cases}

(31)

for some small universal constant $c_{3}>0$ . Then we can construct two infinite-horizon RMDPs $\mathcal{M}_{0},\mathcal{M}_{1}$ defined by the uncertainty set $\mathcal{U}_{\rho}^{\sigma}(\cdot)=\mathcal{U}^{\sigma}_{\chi^{2}}(\cdot)$ , an initial state distribution $\varphi$ , and a dataset with $N$ independent samples per $(s,a)$ pair over the nominal transition kernel (for $\mathcal{M}_{0}$ and $\mathcal{M}_{1}$ respectively), such that

\displaystyle\inf_{\widehat{\pi}}\max\left\{\mathbb{P}_{0}\big{(}V^{\star,\sigma}(\varphi)-V^{\widehat{\pi},\sigma}(\varphi)>\varepsilon\big{)},\,\mathbb{P}_{1}\big{(}V^{\star,\sigma}(\varphi)-V^{\widehat{\pi},\sigma}(\varphi)>\varepsilon\big{)}\right\}\geq\frac{1}{8},

(32)

provided that the total number of samples

\displaystyle NSA\leq c_{4}\begin{cases}\frac{SA}{(1-\gamma)^{3}\varepsilon^{2}}&\text{if }\sigma\in\left(0,\frac{1-\gamma}{4}\right)\\ \frac{\sigma SA}{\min\left\{1,(1-\gamma)^{4}(1+\sigma)^{4}\right\}\varepsilon^{2}}&\text{if }\sigma\in\left[\frac{1-\gamma}{4},\infty\right)\end{cases}

(33)

for some universal constant $c_{4}>0$ .

We are now positioned to single out several key implications of the above theorems.

Nearly tight sample complexity.

In order to achieve $\varepsilon$ -accuracy for RMDPs under the $\chi^{2}$ divergence, Theorem 3 asserts that a total number of samples on the order of

\displaystyle\widetilde{O}\left(\frac{SA(1+\sigma)}{(1-\gamma)^{4}\varepsilon^{2}}\right).

(34)

is sufficient for DRVI (or any other oracle planning algorithm as discussed in Remark 2). Taking this together with the minimax lower bound in Theorem 4 confirms that the sample complexity is near-optimal — up to a polynomial factor of the effective horizon $\frac{1}{1-\gamma}$ — over the entire range of the uncertainty level $\sigma$ . In particular,

•

when $\sigma\asymp 1$ , our sample complexity $\widetilde{O}\left(\frac{SA}{(1-\gamma)^{4}\varepsilon^{2}}\right)$ is sharp and matches the minimax lower bound;
•

when $\sigma\gtrsim\frac{1}{(1-\gamma)^{3}}$ , our sample complexity correctly predicts the linear dependency with $\sigma$ , suggesting that more samples are needed when one wishes to account for a larger $\chi^{2}$ -based uncertainty sets.

RMDPs can be much harder to learn than standard MDPs with $\chi^{2}$ divergence.

The minimax lower bound developed in Theorem 4 exhibits a curious non-monotonic behavior of the sample size requirement over the entire range of the uncertainty level $\sigma\in(0,\infty)$ when the uncertainty set is measured via the $\chi^{2}$ divergence. When $\sigma\lesssim 1-\gamma$ , the lower bound reduces to

\widetilde{O}\left(\frac{SA}{(1-\gamma)^{3}\varepsilon^{2}}\right),

which matches with that of standard MDPs, as $\sigma=0$ corresponds to standard MDP. However, two additional regimes are worth calling out:

	$\displaystyle 1-\gamma\lesssim\sigma\lesssim\frac{1}{(1-\gamma)^{1/3}}:\qquad$	$\displaystyle\widetilde{O}\left(\frac{SA}{(1-\gamma)^{4}\varepsilon^{2}}\min\left\{\sigma,\frac{1}{\sigma^{3}}\right\}\right),$
	$\displaystyle\sigma\gtrsim\frac{1}{(1-\gamma)^{3}}:\qquad$	$\displaystyle\widetilde{O}\left(\frac{SA\sigma}{\varepsilon^{2}}\right),$

both of which are greater than that of standard MDPs, indicating learning RMDPs under the $\chi^{2}$ divergence can be much harder.

Comparison with state-of-the-art bounds.

Our upper bound significantly improves over the prior art $\widetilde{O}\left(\frac{S^{2}A(1+\sigma)}{(1-\gamma)^{4}\varepsilon^{2}}\right)$ of Panaganti and Kalathil, (2022) by a factor of $S$ , and provides the first finite-sample complexity that scales linearly with respect to $S$ for discounted infinite-horizon RMDPs, which typically exhibit more complicated statistical dependencies than the finite-horizon counterpart. On the other hand, Yang et al., (2022) established a lower bound on the order of $\widetilde{O}\left(\frac{SA}{(1-\gamma)^{2}\sigma\varepsilon^{2}}\right)$ when $\sigma\gtrsim 1-\gamma$ , which is always smaller than the requirement of standard MDPs, and diminishes when $\sigma$ grows. Consequently, Yang et al., (2022) does not lead to the rigorous justification that RMDPs can be much harder than standard MDPs, nor the correct linear scaling of the sample size as $\sigma$ grows.

5 Analysis: the TV case

This section presents the key technical steps for proving our main results of the TV case.

5.1 Preliminaries of the analysis

5.1.1 Additional notations and basic facts

For convenience, we introduce the notation $[T]\coloneqq\{1,\cdots,T\}$ for any positive integer $T>0$ . Moreover, for any two vectors $x=[x_{i}]_{1\leq i\leq n}$ and $y=[y_{i}]_{1\leq i\leq n}$ , the notation ${x}\leq{y}$ (resp. ${x}\geq{y}$ ) means $x_{i}\leq y_{i}$ (resp. $x_{i}\geq y_{i}$ ) for all $1\leq i\leq n$ . And for any vecvor $x$ , we overload the notation by letting $x\circ x=\big{[}x(s,a)^{2}\big{]}_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}$ (resp. $x\circ x=\big{[}x(s)^{2}\big{]}_{s\in{\mathcal{S}}}$ ). With slight abuse of notation, we denote ${0}$ (resp. ${1}$ ) as the all-zero (resp. all-one) vector, and drop the subscript $\rho$ to write $\mathcal{U}^{\sigma}(\cdot)=\mathcal{U}^{\sigma}_{\rho}(\cdot)$ whenever the argument holds for all divergence $\rho$ .

Matrix notation.

To continue, we recall or introduce some additional matrix notation that is useful throughout the analysis.

•

$P^{0}\in\mathbb{R}^{SA\times S}$ : the matrix of the nominal transition kernel with $P^{0}_{s,a}$ as the $(s,a)$ -th row.
•

$\widehat{P}^{0}\in\mathbb{R}^{SA\times S}$ : the matrix of the estimated nomimal transition kernel with $\widehat{P}^{0}_{s,a}$ as the $(s,a)$ -th row.
•

$r\in\mathbb{R}^{SA}$ : a vector representing the reward function $r$ (so that $r_{(s,a)}=r(s,a)$ for all $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ ).

•

$\Pi^{\pi}\in\{0,1\}^{S\times SA}$ : a projection matrix associated with a given deterministic policy $\pi$ taking the following form

\displaystyle\Pi^{\pi}={\scriptsize\begin{pmatrix}{e}_{\pi(1)}^{\top}&{0}^{\top}&\cdots&{0}^{\top}\\ {0}^{\top}&{e}_{\pi(2)}^{\top}&\cdots&{0}^{\top}\\ \vdots&\vdots&\ddots&\vdots\\ {0}^{\top}&{0}^{\top}&\cdots&{e}_{\pi(S)}^{\top}\end{pmatrix}},

(35)

where ${e}_{\pi(1)}^{\top},{e}_{\pi(2)}^{\top},\ldots,{e}_{\pi(S)}^{\top}\in\mathbb{R}^{A}$ are standard basis vectors.

•

$r_{\pi}\in\mathbb{R}^{S}$ : a reward vector restricted to the actions chosen by the policy $\pi$ , namely, $r_{\pi}(s)=r(s,\pi(s))$ for all $s\in{\mathcal{S}}$ (or simply, $r_{\pi}=\Pi^{\pi}r$ ).
•

$\mathrm{Var}_{P}(V)\in\mathbb{R}^{SA}$ : for any transition kernel $P\in\mathbb{R}^{SA\times S}$ and vector $V\in\mathbb{R}^{S}$ , we denote the $(s,a)$ -th row of $\mathrm{Var}_{P}(V)$ as

$\displaystyle\mathsf{Var}_{P}(s,a)\coloneqq\mathrm{Var}_{P_{s,a}}(V).$ (36)

•

$P^{V}\in\mathbb{R}^{SA\times S}$ , $\widehat{P}^{V}\in\mathbb{R}^{SA\times S}$ : the matrices representing the probability transition kernel in the uncertainty set that leads to the worst-case value for any vector $V\in\mathbb{R}^{S}$ . We denote $P_{s,a}^{V}$ (resp. $\widehat{P}_{s,a}^{V}$ ) as the $(s,a)$ -th row of the transition matrix $P^{V}$ (resp. $\widehat{P}^{V}$ ). In truth, the $(s,a)$ -th rows of these transition matrices are defined as


	$\displaystyle P_{s,a}^{V}$	$\displaystyle=\mathrm{argmin}_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{0}_{s,a})}\mathcal{P}V,\qquad\text{and}\qquad\widehat{P}_{s,a}^{V}=\mathrm{argmin}_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s,a})}\mathcal{P}V.$	(37a)
Furthermore, we make use of the following short-hand notation:

	$\displaystyle P_{s,a}^{\pi,V}$	$\displaystyle:=P_{s,a}^{V^{\pi,\sigma}}=\mathrm{argmin}_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{0}_{s,a})}\mathcal{P}V^{\pi,\sigma},\qquad P_{s,a}^{\pi,\widehat{V}}:=P_{s,a}^{\widehat{V}^{\pi,\sigma}}=\mathrm{argmin}_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{0}_{s,a})}\mathcal{P}\widehat{V}^{\pi,\sigma},$	(37b)
	$\displaystyle\widehat{P}_{s,a}^{\pi,V}$	$\displaystyle:=\widehat{P}_{s,a}^{V^{\pi,\sigma}}=\mathrm{argmin}_{P\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s,a})}PV^{\pi,\sigma},\qquad\widehat{P}_{s,a}^{\pi,\widehat{V}}:=\widehat{P}_{s,a}^{\widehat{V}^{\pi,\sigma}}=\mathrm{argmin}_{P\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s,a})}P\widehat{V}^{\pi,\sigma}.$	(37c)

The corresponding probability transition matrices are denoted by $P^{\pi,V}\in\mathbb{R}^{SA\times S}$ , $P^{\pi,\widehat{V}}\in\mathbb{R}^{SA\times S}$ , $\widehat{P}^{\pi,V}\in\mathbb{R}^{SA\times S}$ and $\widehat{P}^{\pi,\widehat{V}}\in\mathbb{R}^{SA\times S}$ , respectively.

•

$P^{\pi}\in\mathbb{R}^{S\times S}$ , $\widehat{P}^{\pi}\in\mathbb{R}^{S\times S}$ , $\underline{P}^{\pi,V}\in\mathbb{R}^{S\times S}$ , $\underline{P}^{\pi,\widehat{V}}\in\mathbb{R}^{S\times S}$ , $\underline{\widehat{P}}^{\pi,V}\in\mathbb{R}^{S\times S}$ and $\underline{\widehat{P}}^{\pi,\widehat{V}}\in\mathbb{R}^{S\times S}$ : six square probability transition matrices w.r.t. policy $\pi$ over the states, namely

		$\displaystyle P^{\pi}\coloneqq\Pi^{\pi}P^{0},\qquad\widehat{P}^{\pi}\coloneqq\Pi^{\pi}\widehat{P}^{0},\qquad\underline{P}^{\pi,V}\coloneqq\Pi^{\pi}P^{\pi,V},\qquad\underline{P}^{\pi,\widehat{V}}\coloneqq\Pi^{\pi}P^{\pi,\widehat{V}},$
		$\displaystyle\underline{\widehat{P}}^{\pi,V}\coloneqq\Pi^{\pi}\widehat{P}^{\pi,V},\qquad\text{and}\qquad\underline{\widehat{P}}^{\pi,\widehat{V}}\coloneqq\Pi^{\pi}\widehat{P}^{\pi,\widehat{V}}.$		(38)

We denote $P^{\pi}_{s}$ as the $s$ -th row of the transition matrix $P^{\pi}$ ; similar quantities can be defined for the other matrices as well.

Kullback-Leibler (KL) divergence.

First, for any two distributions $P$ and $Q$ , we denote by $\mathsf{KL}(P\parallel Q)$ the Kullback-Leibler (KL) divergence of $P$ and $Q$ . Letting $\mathsf{Ber}(p)$ be the Bernoulli distribution with mean $p$ , we also introduce

\displaystyle\mathsf{KL}(p\parallel q)\coloneqq p\log\frac{p}{q}+(1-p)\log\frac{1-p}{1-q}\quad\text{and}\quad\chi^{2}(p\parallel q)\coloneqq\frac{(p-q)^{2}}{q}+\frac{(p-q)^{2}}{1-q}=\frac{(p-q)^{2}}{q(1-q)},

(39)

which represent respectively the KL divergence and the $\chi^{2}$ divergence of $\mathsf{Ber}(p)$ from $\mathsf{Ber}(q)$ (Tsybakov,, 2009).

Variance.

For any probability vector $P\in\mathbb{R}^{1\times S}$ and vector $V\in\mathbb{R}^{S}$ , we denote the variance

\displaystyle\mathrm{Var}_{P}(V)\coloneqq P(V\circ V)-(PV)\circ(PV).

(40)

The following lemma bounds the Lipschitz constant of the variance function.

Lemma 3.

Consider any ${0}\leq V_{1},V_{2}\leq\frac{1}{1-\gamma}$ obeying $\|V_{1}-V_{2}\|_{\infty}\leq x$ and any probability vector $P\in\Delta(S)$ , one has

\displaystyle\left|\mathrm{Var}_{P}(V_{1})-\mathrm{Var}_{P}(V_{2})\right|\leq\frac{2x}{(1-\gamma)}.

(41)

Proof of Lemma 3: It is immediate to check that

$\displaystyle\left\|\mathrm{Var}_{P}(V_{1})-\mathrm{Var}_{P}(V_{2})\right\|$	$\displaystyle=\left\|P(V_{1}\circ V_{1})-(PV_{1})\circ(PV_{1})-P(V_{2}\circ V_{2})+(PV_{2})\circ(PV_{2})\right\|$
	$\displaystyle\leq\left\|P\big{(}V_{1}\circ V_{1}-V_{2}\circ V_{2}\big{)}\right\|+\left\|(PV_{1}+PV_{2})P(V_{1}-V_{2})\right\|$
	$\displaystyle\leq 2\\|V_{1}+V_{2}\\|_{\infty}\\|V_{1}-V_{2}\\|_{\infty}\leq\frac{2x}{(1-\gamma)}.$	(42)

where the penultimate inequality holds by the triangle inequality.

5.1.2 Facts of the robust Bellman operator and the empirical robust MDP

$\gamma$ -contraction of the robust Bellman operator.

It is worth noting that the robust Bellman operator (cf. (8)) shares the nice $\gamma$ -contraction property of the standard Bellman operator, stated as below.

Lemma 4 ( $\gamma$ -Contraction).

(Iyengar,, 2005, Theorem 3.2) For any $\gamma\in[0,1)$ , the robust Bellman operator ${\mathcal{T}}^{\sigma}(\cdot)$ (cf. (8)) is a $\gamma$ -contraction w.r.t. $\|\cdot\|_{\infty}$ . Namely, for any $Q_{1},Q_{2}\in\mathbb{R}^{SA}$ s.t. $Q_{1}(s,a),Q_{2}(s,a)\in\big{[}0,\frac{1}{1-\gamma}\big{]}$ for all $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ , one has

\displaystyle\left\|{\mathcal{T}}^{\sigma}(Q_{1})-{\mathcal{T}}^{\sigma}(Q_{2})\right\|_{\infty}\leq\gamma\left\|Q_{1}-Q_{2}\right\|_{\infty}.

(43)

Additionally, $Q^{\star,\sigma}$ is the unique fixed point of ${\mathcal{T}}^{\sigma}(\cdot)$ obeying $0\leq{Q}^{\star,\sigma}(s,a)\leq\frac{1}{1-\gamma}$ for all $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ .

Bellman equations of the empirical robust MDP $\widehat{\mathcal{M}}_{\mathsf{rob}}$ .

To begin with, recall that the empirical robust MDP $\widehat{\mathcal{M}}_{\mathsf{rob}}=\{{\mathcal{S}},\mathcal{A},\gamma,\mathcal{U}^{\sigma}(\widehat{P}^{0}),r\}$ based on the estimated nominal distribution $\widehat{P}^{0}$ constructed in (13) and its corresponding robust value function (resp. robust Q-function) $\widehat{V}^{\pi,\sigma}$ (resp. $\widehat{Q}^{\pi,\sigma}$ ).

Note that $\widehat{Q}^{\star,\sigma}$ is the unique fixed point of $\widehat{{\mathcal{T}}}^{\sigma}(\cdot)$ (see Lemma 4), the empirical robust Bellman operator constructed using $\widehat{P}^{0}$ . Moreover, similar to (7), for $\widehat{\mathcal{M}}_{\mathsf{rob}}$ , the Bellman’s optimality principle gives the following robust Bellman consistency equation (resp. robust Bellman optimality equation):


$\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad$	$\displaystyle\widehat{Q}^{\pi,\sigma}(s,a)=r(s,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s,a})}\mathcal{P}\widehat{V}^{\pi,\sigma},$	(44a)
$\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad$	$\displaystyle\widehat{Q}^{\star,\sigma}(s,a)=r(s,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s,a})}\mathcal{P}\widehat{V}^{\star,\sigma}.$	(44b)

With these in mind, combined with the matrix notation (introduced at the beginning of Section 5), for any policy $\pi$ , we can write the robust Bellman consistency equations as

\displaystyle Q^{\pi,\sigma}=r+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{0})}\mathcal{P}V^{\pi,\sigma}\quad\text{and}\quad\widehat{Q}^{\pi,\sigma}=r+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0})}\mathcal{P}\widehat{V}^{\pi,\sigma},

(45)

which leads to

	$\displaystyle V^{\pi,\sigma}$	$\displaystyle=r_{\pi}+\gamma\Pi^{\pi}\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{0})}\mathcal{P}V^{\pi,\sigma}\overset{\mathrm{(i)}}{=}r_{\pi}+\gamma\underline{P}^{\pi,V}V^{\pi,\sigma},$
	$\displaystyle\widehat{V}^{\pi,\sigma}$	$\displaystyle=r_{\pi}+\gamma\Pi^{\pi}\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0})}\mathcal{P}\widehat{V}^{\pi,\sigma}\overset{\mathrm{(ii)}}{=}r_{\pi}+\gamma\underline{\widehat{P}}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma},$		(46)

where (i) and (ii) holds by the definitions in (35), (37) and (• ‣ 5.1.1).

Encouragingly, the above property of the robust Bellman operator ensures the fast convergence of DRVI. We collect this consequence in the following lemma, whose proof is postponed to Appendix A.2.

Lemma 5.

Let $\widehat{Q}_{0}=0$ . The iterates $\{\widehat{Q}_{t}\},\{\widehat{V}_{t}\}$ of DRVI (cf. Algorithm 1) obey

\displaystyle\forall t\geq 0:\quad\big{\|}\widehat{Q}_{t}-\widehat{Q}^{\star,\sigma}\big{\|}_{\infty}\leq\frac{\gamma^{t}}{1-\gamma}\qquad\text{and}\qquad\big{\|}\widehat{V}_{t}-\widehat{V}^{\star,\sigma}\big{\|}_{\infty}\leq\frac{\gamma^{t}}{1-\gamma}.

(47)

Furthermore, the output policy $\widehat{\pi}$ obeys

\displaystyle\big{\|}\widehat{V}^{\star,\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\big{\|}_{\infty}

\displaystyle\leq\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma},\qquad\mbox{where}\quad\big{\|}\widehat{V}^{\star,\sigma}-\widehat{V}_{T-1}\big{\|}_{\infty}=:\varepsilon_{\mathsf{opt}}.

(48)

5.2 Proof of the upper bound with TV distance: Theorem 1

Throughout this section, for any transition kernel $P$ , the uncertainty set is taken as (see (9))

\displaystyle\mathcal{U}^{\sigma}(P)\coloneqq\mathcal{U}^{\sigma}_{\mathsf{TV}}(P)=\otimes\;\mathcal{U}^{\sigma}_{\mathsf{TV}}(P_{s,a}),\qquad

\displaystyle\mathcal{U}^{\sigma}_{\mathsf{TV}}(P_{s,a})\coloneqq\Big{\{}P^{\prime}_{s,a}\in\Delta({\mathcal{S}}):\frac{1}{2}\left\|P^{\prime}_{s,a}-P_{s,a}\right\|_{1}\leq\sigma\Big{\}}.

(49)

5.2.1 Technical lemmas

We begin with a key lemma that is new and distinguishes robust MDPs with TV distance from standard MDPs , which plays a critical role in obtaining the sample complexity upper bound in Theorem 1. This lemma concerns the dynamic range of the robust value function $V^{\pi,\sigma}$ (cf. (5)) for any fixed policy $\pi$ , which produces tighter control than that in standard MDP (cf. $\frac{1}{1-\gamma}$ ) when $\sigma$ is large. This lemma The proof is deferred to Appendix B.1.

Lemma 6.

For any nominal transition kernel $P\in\mathbb{R}^{SA\times S}$ , any fixed uncertainty level $\sigma$ , and any policy $\pi$ , its corresponding robust value function $V^{\pi,\sigma}$ (cf. (5)) satisfies

\displaystyle\max_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s)-\min_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s)\leq\frac{1}{\gamma\max\{1-\gamma,\sigma\}}.

With the above lemma in hand, we introduce the following lemma that is useful throughout this section, whose proof is postponed in Appendix B.2.

Lemma 7.

Consider an MDP with transition kernel matrix $P$ and reward function ${0}\leq r\leq 1$ . For any policy $\pi$ and its associated state transition matrix $P_{\pi}\coloneqq\Pi^{\pi}P$ and value function $0\leq V^{\pi,P}\leq\frac{1}{1-\gamma}$ (cf. (1)), one has

\displaystyle\left(I-\gamma P_{\pi}\right)^{-1}\sqrt{\mathrm{Var}_{P_{\pi}}(V^{\pi,P})}\leq\sqrt{\frac{8(\max_{s}V^{\pi,P}(s)-\min_{s}V^{\pi,P}(s))}{\gamma^{2}(1-\gamma)^{2}}}1.

5.2.2 Proof of Theorem 1

Recall that the proof for standard RL (Agarwal et al.,, 2020; Li et al., 2023b, ) deals with the upper and lower bound of the value function estimate gap identically. In contrast, the proof of Theorem 1 needs tailored argument for the robust RL setting — controlling the upper and lower bound of the value function estimate gap in an asymmetric way — motivated by the varying worst-case transition kernels associated with different value functions. Before proceeding, applying Lemma 5 yields that for any $\varepsilon_{\mathsf{opt}}>0$ , as long as $T\geq\log(\frac{1}{(1-\gamma)\varepsilon_{\mathsf{opt}}})$ , one has

\displaystyle\big{\|}\widehat{V}^{\star,\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\big{\|}_{\infty}\leq\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma},

(50)

allowing us to justify the more general statement in Remark 1. To control the performance gap $\left\|V^{\star,\sigma}-V^{\widehat{\pi},\sigma}\right\|_{\infty}$ , the proof is divided into several key steps.

Step 1: decomposing the error.

Recall the optimal robust policy $\pi^{\star}$ w.r.t. $\mathcal{M}_{\mathsf{rob}}$ and the optimal robust policy $\widehat{\pi}^{\star}$ , the optimal robust value function $\widehat{V}^{\star,\sigma}$ (resp. robust value function $\widehat{Q}^{\pi,\sigma}$ ) w.r.t. $\widehat{\mathcal{M}}_{\mathsf{rob}}$ . The term of interest $V^{\star,\sigma}-V^{\widehat{\pi},\sigma}$ can be decomposed as

$\displaystyle V^{\star,\sigma}-V^{\widehat{\pi},\sigma}$	$\displaystyle=\left(V^{\pi^{\star},\sigma}-\widehat{V}^{\pi^{\star},\sigma}\right)+\left(\widehat{V}^{\pi^{\star},\sigma}-\widehat{V}^{\widehat{\pi}^{\star},\sigma}\right)+\left(\widehat{V}^{\widehat{\pi}^{\star},\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\right)+\left(\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\right)$
	$\displaystyle\overset{\mathrm{(i)}}{\leq}\left(V^{\pi^{\star},\sigma}-\widehat{V}^{\pi^{\star},\sigma}\right)+\left(\widehat{V}^{\widehat{\pi}^{\star},\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\right)+\left(\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\right)$
	$\displaystyle\overset{\mathrm{(ii)}}{\leq}\left(V^{\pi^{\star},\sigma}-\widehat{V}^{\pi^{\star},\sigma}\right)+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}1+\left(\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\right)$	(51)

where (i) holds by $\widehat{V}^{\pi^{\star},\sigma}-\widehat{V}^{\widehat{\pi}^{\star},\sigma}\leq 0$ since $\widehat{\pi}^{\star}$ is the robust optimal policy for $\widehat{\mathcal{M}}_{\mathsf{rob}}$ , and (ii) comes from the fact in (50).

To control the two important terms in (51), we first consider a more general term $\widehat{V}^{\pi,\sigma}-V^{\pi,\sigma}$ for any policy $\pi$ . Towards this, plugging in (46) yields

	$\displaystyle\widehat{V}^{\pi,\sigma}-V^{\pi,\sigma}$	$\displaystyle=r_{\pi}+\gamma\underline{\widehat{P}}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\big{(}r_{\pi}+\gamma\underline{P}^{\pi,V}V^{\pi,\sigma}\big{)}$
		$\displaystyle=\Big{(}\gamma\underline{\widehat{P}}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\gamma\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}\Big{)}+\Big{(}\gamma\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\gamma\underline{P}^{\pi,V}V^{\pi,\sigma}\Big{)}$
		$\displaystyle\overset{\mathrm{(i)}}{\leq}\gamma\Big{(}\underline{P}^{\pi,V}\widehat{V}^{\pi,\sigma}-\underline{P}^{\pi,V}V^{\pi,\sigma}\Big{)}+\Big{(}\gamma\underline{\widehat{P}}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\gamma\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}\Big{)},$

where (i) holds by observing

\displaystyle\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}

\displaystyle\leq\underline{P}^{\pi,V}\widehat{V}^{\pi,\sigma}

due to the optimality of $\underline{P}^{\pi,\widehat{V}}$ (cf. (37)). Rearranging terms leads to

\displaystyle\widehat{V}^{\pi,\sigma}-V^{\pi,\sigma}

\displaystyle\leq\gamma\left(I-\gamma\underline{P}^{\pi,V}\right)^{-1}\Big{(}\underline{\widehat{P}}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}\Big{)}.

(52)

Similarly, we can also deduce

$\displaystyle\widehat{V}^{\pi,\sigma}-V^{\pi,\sigma}$	$\displaystyle=r_{\pi}+\gamma\underline{\widehat{P}}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\left(r_{\pi}+\gamma\underline{P}^{\pi,V}V^{\pi,\sigma}\right)$
	$\displaystyle=\Big{(}\gamma\underline{\widehat{P}}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\gamma\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}\Big{)}+\left(\gamma\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\gamma\underline{P}^{\pi,V}V^{\pi,\sigma}\right)$
	$\displaystyle\geq\gamma\left(\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\underline{P}^{\pi,\widehat{V}}V^{\pi,\sigma}\right)+\Big{(}\gamma\underline{\widehat{P}}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\gamma\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}\Big{)}$
	$\displaystyle\geq\gamma\left(I-\gamma\underline{P}^{\pi,\widehat{V}}\right)^{-1}\Big{(}\underline{\widehat{P}}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}\Big{)}.$	(53)

Combining (52) and (53), we arrive at

	$\displaystyle\big{\\|}\widehat{V}^{\pi,\sigma}-V^{\pi,\sigma}\big{\\|}_{\infty}$	$\displaystyle\leq\gamma\max\Big{\{}\Big{\\|}\left(I-\gamma\underline{P}^{\pi,V}\right)^{-1}\Big{(}\underline{\widehat{P}}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}\Big{)}\Big{\\|}_{\infty},$
		$\displaystyle\qquad\Big{\\|}\left(I-\gamma\underline{P}^{\pi,\widehat{V}}\right)^{-1}\Big{(}\underline{\widehat{P}}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}\Big{)}\Big{\\|}_{\infty}\Big{\}}.$		(54)

By decomposing the error in a symmetric way, we can similarly obtain

	$\displaystyle\big{\\|}\widehat{V}^{\pi,\sigma}-V^{\pi,\sigma}\big{\\|}_{\infty}$	$\displaystyle\leq\gamma\max\Big{\{}\Big{\\|}\left(I-\gamma\underline{\widehat{P}}^{\pi,V}\right)^{-1}\Big{(}\underline{\widehat{P}}^{\pi,V}V^{\pi,\sigma}-\underline{P}^{\pi,V}V^{\pi,\sigma}\Big{)}\Big{\\|}_{\infty},$
		$\displaystyle\qquad\Big{\\|}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi,\widehat{V}}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\pi,V}V^{\pi,\sigma}-\underline{P}^{\pi,V}V^{\pi,\sigma}\Big{)}\Big{\\|}_{\infty}\Big{\}}.$		(55)

With the above facts in mind, we are ready to control the two terms $\big{\|}\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\big{\|}_{\infty}$ and $\big{\|}\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\big{\|}_{\infty}$ in (51) separately. More specifically, taking $\pi=\pi^{\star}$ , applying (5.2.2) leads to

	$\displaystyle\big{\\|}\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\big{\\|}_{\infty}$	$\displaystyle\leq\gamma\max\Big{\{}\Big{\\|}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{)}\Big{\\|}_{\infty},$
		$\displaystyle\qquad\Big{\\|}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{)}\Big{\\|}_{\infty}\Big{\}}.$		(56)

Similarly, taking $\pi=\widehat{\pi}$ , applying (5.2.2) leads to

	$\displaystyle\big{\\|}\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\big{\\|}_{\infty}$	$\displaystyle\leq\gamma\max\Big{\{}\Big{\\|}\left(I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\right)^{-1}\Big{(}\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\Big{)}\Big{\\|}_{\infty},$
		$\displaystyle\qquad\Big{\\|}\left(I-\gamma\underline{P}^{\widehat{\pi},V}\right)^{-1}\Big{(}\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\Big{)}\Big{\\|}_{\infty}\Big{\}}.$		(57)

Step 2: controlling $\big{\|}\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\big{\|}_{\infty}$ and $\big{\|}\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\big{\|}_{\infty}$ separately and summing up.

First, we introduce the following two lemmas that control the two main terms in (51), respectively. The first lemma controls the value function estimation error associated with the optimal policy $\pi^{\star}$ induced by the randomness of the generated dataset. The proof are postponed to Appendix B.3 and B.4.

Lemma 8.

Consider any $\delta\in(0,1)$ . With probability at least $1-\delta$ , taking $N\geq\frac{16\log(\frac{SAN}{\delta})}{(1-\gamma)^{2}}$ , one has

\displaystyle\left\|\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\right\|_{\infty}

\displaystyle\leq 160\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}+\frac{8\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}.

(58)

Unlike the term $\|\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\|_{\infty}$ associated with the fixed policy $\pi^{\star}$ (independent from the dataset), to control $\left\|\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\right\|_{\infty}$ , we need to deal with the additional complicated statistical dependency between the learned policy $\widehat{\pi}$ and the empirical RMDP constructed by the dataset.

Lemma 9.

Taking $\varepsilon_{\mathsf{opt}}\leq\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma N}$ and $N\geq\frac{16\log(\frac{54SAN^{2}}{\delta})}{(1-\gamma)^{2}}$ , with probability at least $1-\delta$ , one has

\displaystyle\left\|\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\right\|_{\infty}

\displaystyle\leq 24\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}+\frac{28\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)^{2}}.

(59)

Summing up the results in (58) and (59) and inserting back to (51) complete the proof as follows: taking $\varepsilon_{\mathsf{opt}}\leq\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma N}$ and $N\geq\frac{16\log(\frac{54SAN^{2}}{\delta})}{(1-\gamma)^{2}}$ , with probability at least $1-\delta$ ,

$\displaystyle\big{\\|}V^{\star,\sigma}-V^{\widehat{\pi},\sigma}\big{\\|}_{\infty}$	$\displaystyle\leq\big{\\|}V^{\pi^{\star},\sigma}-\widehat{V}^{\pi^{\star},\sigma}\big{\\|}_{\infty}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}+\big{\\|}\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\big{\\|}_{\infty}$
	$\displaystyle\leq\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}+160\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}+\frac{8\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}$
	$\displaystyle\quad+24\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}+\frac{28\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)^{2}}$
	$\displaystyle\leq 184\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}+\frac{36\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)^{2}}$
	$\displaystyle\leq 1508\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}},$	(60)

where the last inequality holds by $\gamma\geq\frac{1}{4}$ and $N\geq\frac{16\log(\frac{54SAN^{2}}{\delta})}{(1-\gamma)^{2}}$ .

5.3 Proof of the lower bound with TV distance: Theorem 2

To achieve a tight lower bound for robust MDPs, we construct new hard instances that are different from those for standard MDPs (Azar et al., 2013a, ), addressing two new challenges: 1) Due to robustness requirement, the recursive step (or bootstrapping) in robust MDPs has asymmetric structures over all states, since the worst-case transition probability depends on the value function and puts more weights on the states with lower values. Inspired by such an asymmetric structure, we develop new hard instances by setting larger rewards on the states with action-invariant transition kernels to achieve a tighter lower bound. Note that standard MDPs do not have such reward allocation challenges, where the bootstrapping step is determined by a fixed transition probability independent from the value function. 2) As the uncertainty level can vary within $0<\sigma\leq 1$ , for given any fixed uncertainty level $\sigma$ , a tailored $\sigma$ -dependent hard instance is required to achieve a tight lower bound, leading to the construction of a series of different instances as $\sigma$ varies. Instead, standard RL only needs to construct one hard instance (i.e., $\sigma=0$ ). By constructing a new class of hard instances addressing the above challenges, we develop a new lower bound in Theorem 2 that is tighter than prior art (Yang et al.,, 2022), which used an identical hard instance for all uncertainty levels $0<\sigma\leq 1$ .

5.3.1 Construction of the hard problem instances

Construction of two hard MDPs.

Suppose there are two standard MDPs defined as below:

\displaystyle\left\{\mathcal{M}_{\phi}=\left(\mathcal{S},\mathcal{A},P^{\phi},r,\gamma\right)\,|\,\phi=\{0,1\}\right\}.

Here, $\gamma$ is the discount parameter, ${\mathcal{S}}=\{0,1,\ldots,S-1\}$ is the state space. Given any state $s\in\{2,3,\cdots,S-1\}$ , the corresponding action space are $\mathcal{A}=\{0,1,2,\cdots,A-1\}$ . While for states $s=0$ or $s=1$ , the action space is only $\mathcal{A}^{\prime}=\{0,1\}$ . For any $\phi\in\{0,1\}$ , the transition kernel $P^{\phi}$ of the constructed MDP $\mathcal{M}_{\phi}$ is defined as

\displaystyle P^{\phi}(s^{\prime}\,|\,s,a)=\left\{\begin{array}[]{lll}p\mathds{1}(s^{\prime}=1)+(1-p)\mathds{1}(s^{\prime}=0)&\text{if}&(s,a)=(0,\phi)\\ q\mathds{1}(s^{\prime}=1)+(1-q)\mathds{1}(s^{\prime}=0)&\text{if}&(s,a)=(0,1-\phi)\\ \mathds{1}(s^{\prime}=1)&\text{if}&s\geq 1\end{array}\right.,

(64)

where $p$ and $q$ are set to satisfy

\displaystyle 0\leq p\leq 1\quad\text{ and }\quad 0\leq q=p-\Delta

(65)

for some $p$ and $\Delta>0$ that shall be introduced later. The above transition kernel $P^{\phi}$ implies that state $1$ is an absorbing state, namely, the MDP will always stay after it arrives at $1$ .

Then, we define the reward function as

\displaystyle r(s,a)=\left\{\begin{array}[]{lll}1&\text{if }s=1\\ 0&\text{otherwise}\end{array}\right..

(68)

Additionally, we choose the following initial state distribution:

\displaystyle\varphi(s)=\begin{cases}1,\quad&\text{if }s=0\\ 0,&\text{otherwise }\end{cases}.

(69)

Here, the constructed two instances are set with different probability transition from state $0$ with reward $0$ but not state $1$ with reward $1$ (which were used in standard MDPs (Li et al., 2022b, )), yielding a larger gap between the value functions of the two instances.

Uncertainty set of the transition kernels.

Recalling the uncertainty set assumed throughout this section is defined as $\mathcal{U}^{\sigma}(P^{\phi})$ with TV distance:

\displaystyle\mathcal{U}^{\sigma}(P^{\phi})\coloneqq\mathcal{U}^{\sigma}_{\mathsf{TV}}(P^{\phi})=\otimes\;\mathcal{U}^{\sigma}_{\mathsf{TV}}(P^{\phi}_{s,a}),\qquad

\displaystyle\mathcal{U}^{\sigma}_{\mathsf{TV}}(P^{\phi}_{s,a})\coloneqq\Big{\{}P^{\prime}_{s,a}\in\Delta({\mathcal{S}}):\frac{1}{2}\left\|P^{\prime}_{s,a}-P^{\phi}_{s,a}\right\|_{1}\leq\sigma\Big{\}},

(70)

where $P^{\phi}_{s,a}\coloneqq P^{\phi}(\cdot\,|\,s,a)$ is defined similar to (4). In addition, without loss of generality, we recall the radius $\sigma\in(0,1-c_{0}]$ with $0<c_{0}<1$ . With the uncertainty level in hand, taking $c_{1}\coloneqq\frac{c_{0}}{2}$ , $p$ and $\Delta$ which determines the instances obey

\displaystyle p=\left(1+c_{1}\right)\max\{1-\gamma,\sigma\}\qquad\text{and}\qquad\Delta\leq c_{1}\max\{1-\gamma,\sigma\},

(71)

which ensure $0\leq p\leq 1$ as follows:

\displaystyle\left(1+c_{1}\right)\sigma

\displaystyle\leq 1-c_{0}+c_{1}\sigma\leq 1-\frac{c_{0}}{2}<1,\qquad\left(1+c_{1}\right)(1-\gamma)\leq\frac{3}{2}(1-\gamma)\leq\frac{3}{4}<1.

(72)

Consequently, applying (65) directly leads to

\displaystyle p\geq q\geq\max\{1-\gamma,\sigma\}.

(73)

To continue, for any $(s,a,s^{\prime})\in{\mathcal{S}}\times\mathcal{A}\times{\mathcal{S}}$ , we denote the infimum probability of moving to the next state $s^{\prime}$ associated with any perturbed transition kernel $P_{s,a}\in\mathcal{U}^{\sigma}(P^{\phi}_{s,a})$ as

\displaystyle\underline{P}^{\phi}(s^{\prime}\,|\,s,a)

\displaystyle\coloneqq\inf_{P_{s,a}\in\mathcal{U}^{\sigma}(P^{\phi}_{s,a})}P(s^{\prime}\,|\,s,a)=\max\{P(s^{\prime}\,|\,s,a)-\sigma,0\},

(74)

where the last equation can be easily verified by the definition of $\mathcal{U}^{\sigma}(P^{\phi})$ in (70). As shall be seen, the transition from state $0$ to state $1$ plays an important role in the analysis, for convenience, we denote

\displaystyle\underline{p}

\displaystyle\coloneqq\underline{P}^{\phi}(1\,|\,0,\phi)=p-\sigma,\qquad\underline{q}\coloneqq\underline{P}^{\phi}(1\,|\,0,1-\phi)=q-\sigma,

(75)

which follows from the fact that $p\geq q\geq\sigma$ in (73).

Robust value functions and robust optimal policies.

To proceed, we are ready to derive the corresponding robust value functions, identify the optimal policies, and characterize the optimal values. For any MDP $\mathcal{M}_{\phi}$ with the above uncertainty set, we denote $\pi^{\star}_{\phi}$ as the optimal policy, and the robust value function of any policy $\pi$ (resp. the optimal policy $\pi^{\star}_{\phi}$ ) as $V^{\pi,\sigma}_{\phi}$ (resp. $V^{\star,\sigma}_{\phi}$ ). Then, we introduce the following lemma which describes some important properties of the robust (optimal) value functions and optimal policies. The proof is postponed to Appendix C.1.

Lemma 10.

For any $\phi=\{0,1\}$ and any policy $\pi$ , the robust value function obeys

\displaystyle V^{\pi,\sigma}_{\phi}(0)=\frac{\gamma\big{(}z_{\phi}^{\pi}-\sigma\big{)}}{(1-\gamma)\bigg{(}1+\frac{\gamma\left(z_{\phi}^{\pi}-\sigma\right)}{1-\gamma\left(1-\sigma\right)}\bigg{)}\left(1-\gamma\left(1-\sigma\right)\right)},

(76)

where $z_{\phi}^{\pi}$ is defined as

\displaystyle z_{\phi}^{\pi}\coloneqq p\pi(\phi\,|\,0)+q\pi(1-\phi\,|\,0).

(77)

In addition, the robust optimal value functions and the robust optimal policies satisfy


$\displaystyle V_{\phi}^{\star,\sigma}(0)$	$\displaystyle=\frac{\gamma\left(p-\sigma\right)}{(1-\gamma)\left(1+\frac{\gamma\left(p-\sigma\right)}{1-\gamma\left(1-\sigma\right)}\right)\left(1-\gamma\left(1-\sigma\right)\right)},$	(78a)
$\displaystyle\pi_{\phi}^{\star}(\phi\,\|\,s)$	$\displaystyle=1,\qquad\qquad\text{ for }s\in{\mathcal{S}}.$	(78b)

5.3.2 Establishing the minimax lower bound

Note that our goal is to control the quantity w.r.t. any policy estimator $\widehat{\pi}$ based on the chosen initial distribution $\varphi$ in (69) and the dataset consisting of $N$ samples over each state-action pair generated from the nominal transition kernel $P^{\phi}$ , which gives

\displaystyle\big{\langle}\varphi,V^{\star,\sigma}_{\phi}-V^{\widehat{\pi},\sigma}_{\phi}\big{\rangle}=V^{\star,\sigma}_{\phi}(0)-V^{\widehat{\pi},\sigma}_{\phi}(0).

Step 1: converting the goal to estimate $\phi$ .

We make the following useful claim which shall be verified in Appendix C.2: With $\varepsilon\leq\frac{c_{1}}{32(1-\gamma)}$ , letting

\displaystyle\Delta=32(1-\gamma)\max\{1-\gamma,\sigma\}\varepsilon\leq c_{1}\max\{1-\gamma,\sigma\}

(79)

which satisfies (71), it leads to that for any policy $\widehat{\pi}$ ,

\displaystyle\big{\langle}\varphi,V^{\star,\sigma}_{\phi}-V^{\widehat{\pi},\sigma}_{\phi}\big{\rangle}\geq 2\varepsilon\big{(}1-\widehat{\pi}(\phi\,|\,0)\big{)}.

(80)

With this connection established between the policy $\widehat{\pi}$ and its sub-optimality gap as depicted in (80), we can now proceed to build an estimate for $\phi$ . Here, we denote $\mathbb{P}_{\phi}$ as the probability distribution when the MDP is $\mathcal{M}_{\phi}$ , where $\phi$ can take on values in the set $\{0,1\}$ .

Let’s assume momentarily that an estimated policy $\widehat{\pi}$ achieves

\displaystyle\mathbb{P}_{\phi}\big{\{}\big{\langle}\varphi,V^{\star,\sigma}_{\phi}-V^{\widehat{\pi},\sigma}_{\phi}\big{\rangle}\leq\varepsilon\big{\}}\geq\frac{7}{8},

(81)

then in view of (80), we necessarily have $\widehat{\pi}(\phi\,|\,0)\geq\frac{1}{2}$ with probability at least $\frac{7}{8}$ . With this in mind, we are motivated to construct the following estimate $\widehat{\phi}$ for $\phi\in\{0,1\}$ :

\displaystyle\widehat{\phi}=\arg\max_{a\in\{0,1\}}\,\widehat{\pi}(a\,|\,0),

(82)

which obeys

\displaystyle\mathbb{P}_{\phi}\big{\{}\widehat{\phi}=\phi\big{\}}\geq\mathbb{P}_{\phi}\big{\{}\widehat{\pi}(\phi\,|\,0)>1/2\big{\}}\geq\frac{7}{8}.

(83)

Subsequently, our aim is to demonstrate that (83) cannot occur without an adequate number of samples, which would in turn contradict (80).

Step 2: probability of error in testing two hypotheses.

Equipped with the aforementioned groundwork, we can now delve into differentiating between the two hypotheses $\phi\in\{0,1\}$ . To achieve this, we consider the concept of minimax probability of error, defined as follows:

p_{\mathrm{e}}\coloneqq\inf_{\psi}\max\big{\{}\mathbb{P}_{0}(\psi\neq 0),\,\mathbb{P}_{1}(\psi\neq 1)\big{\}}.

(84)

Here, the infimum is taken over all possible tests $\psi$ constructed from the samples generated from the nominal transition kernel $P^{\phi}$ .

Moving forward, let us denote $\mu_{\phi}$ (resp. $\mu_{\phi}(s)$ ) as the distribution of a sample tuple $(s_{i},a_{i},s_{i}^{\prime})$ under the nominal transition kernel $P^{\phi}$ associated with $\mathcal{M}_{\phi}$ and the samples are generated independently. Applying standard results from Tsybakov, (2009, Theorem 2.2) and the additivity of the KL divergence (cf. Tsybakov, (2009, Page 85)), we obtain

	$\displaystyle p_{\mathrm{e}}$	$\displaystyle\geq\frac{1}{4}\exp\Big{(}-NSA\cdot\mathsf{KL}\big{(}\mu_{0}\parallel\mu_{1}\big{)}\Big{)}$
		$\displaystyle=\frac{1}{4}\exp\Big{\{}-N\Big{(}\mathsf{KL}\big{(}P^{0}(\cdot\,\|\,0,0)\parallel P^{1}(\cdot\,\|\,0,0)\big{)}+\mathsf{KL}\big{(}P^{0}(\cdot\,\|\,0,1)\parallel P^{1}(\cdot\,\|\,0,1)\big{)}\Big{)}\Big{\}},$		(85)

where the last inequality holds by observing that

	$\displaystyle\mathsf{KL}\big{(}\mu_{0}\parallel\mu_{1}\big{)}$	$\displaystyle=\frac{1}{SA}\sum_{s,a,s^{\prime}}\mathsf{KL}\big{(}P^{0}(s^{\prime}\,\|\,s,a)\parallel P^{1}(s^{\prime}\,\|\,s,a)\big{)}$
		$\displaystyle=\frac{1}{SA}\sum_{a\in\{0,1\}}\mathsf{KL}\big{(}P^{0}(\cdot\,\|\,0,a)\parallel P^{1}(\cdot\,\|\,0,a)\big{)},$

Here, the last equality holds by the fact that $P^{0}(\cdot\,|\,s,a)$ and $P^{1}(\cdot\,|\,s,a)$ only differ when $s=0$ .

Now, our focus shifts towards bounding the terms involving the KL divergence in (85). Given $p\geq q\geq\max\{1-\gamma,\sigma\}$ (cf. (73)), applying Tsybakov, (2009, Lemma 2.7) gives

$\displaystyle\mathsf{KL}\big{(}P^{0}(\cdot\,\|\,0,1)\parallel P^{1}(\cdot\,\|\,0,1)\big{)}$	$\displaystyle=\mathsf{KL}\left(p\parallel q\right)\leq\frac{(p-q)^{2}}{(1-p)p}\overset{\mathrm{(i)}}{=}\frac{\Delta^{2}}{p(1-p)}$
	$\displaystyle\overset{\mathrm{(ii)}}{=}\frac{1024(1-\gamma)^{2}\max\{1-\gamma,\sigma\}^{2}\varepsilon^{2}}{p(1-p)}$
	$\displaystyle\leq\frac{1024(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}}{1-p}\leq\frac{4096}{c_{1}}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2},$	(86)

where (i) stems from the definition in (65), (ii) follows by the expression of $\Delta$ in (79), and the last inequality arises from $1-q\geq 1-p\geq\frac{c_{0}}{4}$ (see (72)).

Note that it can be shown that $\mathsf{KL}\big{(}P^{0}(\cdot\,|\,0,0)\parallel P^{1}(\cdot\,|\,0,0)\big{)}$ can be upper bounded in a same manner. Substituting (86) back into (85) demonstrates that: if the sample size is selected as

\displaystyle N\leq\frac{c_{1}\log 2}{8192(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}},

(87)

then one necessarily has

\displaystyle p_{\mathrm{e}}

\displaystyle\geq\frac{1}{4}\exp\bigg{\{}-N\frac{8192}{c_{1}}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}\bigg{\}}\geq\frac{1}{8},

(88)

Step 3: putting the results together.

Lastly, suppose that there exists an estimator $\widehat{\pi}$ such that

\mathbb{P}_{0}\big{\{}\big{\langle}\varphi,V_{0}^{\star,\sigma}-V_{0}^{\widehat{\pi},\sigma}\big{\rangle}>\varepsilon\big{\}}<\frac{1}{8}\qquad\text{and}\qquad\mathbb{P}_{1}\big{\{}\big{\langle}\varphi,V_{1}^{\star,\sigma}-V_{1}^{\widehat{\pi},\sigma}\big{\rangle}>\varepsilon\big{\}}<\frac{1}{8}.

According to Step 1, the estimator $\widehat{\phi}$ defined in (82) must satisfy

\mathbb{P}_{0}\big{(}\widehat{\phi}\neq 0\big{)}<\frac{1}{8}\qquad\text{and}\qquad\mathbb{P}_{1}\big{(}\widehat{\phi}\neq 1\big{)}<\frac{1}{8}.

However, this cannot occur under the sample size condition (87) to avoid contradiction with (88). Thus, we have completed the proof.

6 Offline distributionally robust RL with uniform coverage

In this section, we extend our theoretical analysis to broader sampling mechanism scenarios with offline datasets. We first specify the offline settings as below.

Offline/batch dataset.

Suppose that we observe a batch/historical dataset $\mathcal{D}^{\mathsf{b}}=\{(s_{i},a_{i},r_{i},s_{i}^{\prime})\}_{1\leq i\leq N_{\mathsf{b}}}$ consisting of $N_{\mathsf{b}}$ sample transitions generated independently. Specifically, the state-action pair $(s_{i},a_{i})$ is drawn from some behavior distribution $\mu^{\mathsf{b}}\in\Delta({\mathcal{S}}\times\mathcal{A})$ , followed by a next state $s_{i}^{\prime}$ drawn over the nominal transition kernel $P^{0}$ , i.e.,

\displaystyle(s_{i},a_{i})\overset{\text{i.i.d.}}{\sim}\mu^{\mathsf{b}}\quad\text{and}\quad s_{i}^{\prime}\overset{\text{i.i.d.}}{\sim}P^{0}(\cdot\,|\,s_{i},a_{i}),\qquad 1\leq i\leq N_{\mathsf{b}}.

(89)

We consider uniform coverage historical dataset that is widely studied in offline settings for both standard RL and robust RL (Liao et al.,, 2022; Chen and Jiang,, 2019; Jin et al., 2020b, ; Zhou et al.,, 2021; Yang et al.,, 2022), specified in the following assumption.

Assumption 1.

Suppose the historical dataset $\mathcal{D}^{\mathsf{b}}$ obeys

\displaystyle\mu_{\min}\coloneqq\min_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\mu^{\mathsf{b}}(s,a)>0.

(90)

Armed with the above dataset $\mathcal{D}^{\mathsf{b}}$ , the empirical nominal transition kernel $\widehat{P}^{0}\in\mathbb{R}^{SA\times S}$ can be constructed through (13) analogously. Then in such offline setting, we introduce the sample complexity upper bounds for DRVI and information-theoretical lower bounds in the cases of TV or $\chi^{2}$ divergence respectively. The proof of the following corollaries are postponed to Appendix F.

6.1 The case of TV distance

With above historical dataset $\mathcal{D}^{\mathsf{b}}$ in hand, we achieve the following corollary implied by Theorem 1.

Corollary 1 (Upper bound under TV distance).

Let the uncertainty set be $\mathcal{U}_{\rho}^{\sigma}(\cdot)=\mathcal{U}^{\sigma}_{\mathsf{TV}}(\cdot)$ defined in (9), and $C_{3},C_{4}>0$ be some large enough universal constants. Consider any discount factor $\gamma\in\left[\frac{1}{4},1\right)$ , uncertainty level $\sigma\in(0,1)$ , and $\delta\in(0,1)$ . Let $\widehat{\pi}$ be the output policy of Algorithm 1 after $T=C_{3}\log\big{(}\frac{N_{\mathsf{b}}}{1-\gamma}\big{)}$ iterations, based on a dataset $\mathcal{D}^{\mathsf{b}}$ satisfying Assumption 1. Then with probability at least $1-\delta$ , one has

\displaystyle\forall s\in{\mathcal{S}}:\quad V^{\star,\sigma}(s)-V^{\widehat{\pi},\sigma}(s)\leq\varepsilon

(91)

for any $\varepsilon\in\left(0,\sqrt{1/\max\{1-\gamma,\sigma\}}\right]$ , as long as the total number of samples obeys

\displaystyle N_{\mathsf{b}}\geq\frac{C_{4}}{\mu_{\min}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}}\log\left(\frac{N_{\mathsf{b}}SA}{(1-\gamma)\delta}\right).

(92)

We also derive a lower bound in the offline setting by adapting Theorem 2.

Corollary 2 (Lower bound under TV distance).

Let the uncertainty set be $\mathcal{U}_{\rho}^{\sigma}(\cdot)=\mathcal{U}^{\sigma}_{\mathsf{TV}}(\cdot)$ defined in (9). Consider any tuple $(S,\gamma,\sigma,\varepsilon,\mu_{\min})$ that obeys $\mu_{\min}>0$ , $\sigma\in(0,1-c_{0}]$ with $0<c_{0}\leq\frac{1}{8}$ being any small enough positive constant, $\gamma\in\left[\frac{1}{2},1\right)$ , and $\varepsilon\in\big{(}0,\frac{c_{0}}{256(1-\gamma)}\big{]}$ . We can construct two infinite-horizon RMDPs $\mathcal{M}_{0},\mathcal{M}_{1}$ , an initial state distribution $\varphi$ , and a dataset with $N_{\mathsf{b}}$ samples satisfying Assumption 1 (for $\mathcal{M}_{0}$ and $\mathcal{M}_{1}$ respectively) such that

\inf_{\widehat{\pi}}\max\left\{\mathbb{P}_{0}\big{(}V^{\star,\sigma}(\varphi)-V^{\widehat{\pi},\sigma}(\varphi)>\varepsilon\big{)},\,\mathbb{P}_{1}\big{(}V^{\star,\sigma}(\varphi)-V^{\widehat{\pi},\sigma}(\varphi)>\varepsilon\big{)}\right\}\geq\frac{1}{8},

provided that

N_{\mathsf{b}}\leq\frac{c_{0}\log 2}{8192\mu_{\min}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}}.

Here, the infimum is taken over all estimators $\widehat{\pi}$ , and $\mathbb{P}_{0}$ (resp. $\mathbb{P}_{1}$ ) denotes the probability when the RMDP is $\mathcal{M}_{0}$ (resp. $\mathcal{M}_{1}$ ).

Discussions.

In the offline setting with uniform coverage dataset (cf. Assumption 1), Corollary 1 shows that DRVI algorithm can find an $\varepsilon$ -optimal policy with the following sample complexity

\displaystyle\widetilde{O}\left(\frac{1}{\mu_{\min}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}}\right),

(93)

which is near minimax optimal with respect to all salient parameters (up to logarithmic factors) almost over the full range of the uncertainty level $\sigma$ , verified by the lower bound in Corollary 2. Our sample complexity upper bound (Corollary 1) significantly improves over the prior art $\widetilde{O}\left(\frac{S(2+\sigma)^{2}}{\mu_{\min}\sigma^{2}(1-\gamma)^{4}\varepsilon^{2}}\right)$ (Yang et al.,, 2022) by at least a factor of $\frac{S}{(1-\gamma)^{2}}$ , and even more than $\frac{S}{(1-\gamma)^{3}}$ when the uncertainty level $0<\sigma\lesssim 1-\gamma$ is small.

6.2 The case of $\chi^{2}$ divergence

With uncertainty sets measured by the $\chi^{2}$ divergence, we obtain the following upper bounds for DRVI and information-theoretical lower bounds, adapted from Theorem 3 and Theorem 4 respectively.

Corollary 3 (Upper bound under $\chi^{2}$ divergence).

Let the uncertainty set be $\mathcal{U}_{\rho}^{\sigma}(\cdot)=\mathcal{U}^{\sigma}_{\chi^{2}}(\cdot)$ specified by the $\chi^{2}$ divergence (cf. (10)), and $c_{1},c_{2}>0$ be some large enough universal constants. Consider any uncertainty level $\sigma\in(0,\infty)$ , $\gamma\in[1/4,1)$ and $\delta\in(0,1)$ . Given a dataset $\mathcal{D}^{\mathsf{b}}$ satisfying Assumption 1, with probability at least $1-\delta$ , the output policy $\widehat{\pi}$ from Algorithm 1 with at most $T=c_{1}\log\big{(}\frac{N_{\mathsf{b}}}{1-\gamma}\big{)}$ iterations yields

\displaystyle\forall s\in{\mathcal{S}}:\quad V^{\star,\sigma}(s)-V^{\widehat{\pi},\sigma}(s)\leq\varepsilon

(94)

for any $\varepsilon\in\big{(}0,\frac{1}{1-\gamma}\big{]}$ , as long as the total number of samples obeying

\displaystyle N_{\mathsf{b}}\geq\frac{c_{2}(1+\sigma)}{\mu_{\min}(1-\gamma)^{4}\varepsilon^{2}}\log\left(\frac{N_{\mathsf{b}}}{\mu_{\min}\delta}\right).

(95)

Corollary 4 (Lower bound under $\chi^{2}$ divergence).

Let the uncertainty set be $\mathcal{U}_{\rho}^{\sigma}(\cdot)=\mathcal{U}^{\sigma}_{\chi^{2}}(\cdot)$ , and $c_{3},c_{4}>0$ be some universal constants. Consider any tuple $(S,\gamma,\sigma,\varepsilon,\mu_{\min})$ obeying $\mu_{\min}>0$ , $\gamma\in[\frac{3}{4},1)$ , $\sigma\in(0,\infty)$ , and

\displaystyle\varepsilon

\displaystyle\leq c_{3}\begin{cases}\frac{1}{1-\gamma}\quad&\text{if }\sigma\in\left(0,\frac{1-\gamma}{4}\right)\\ \max\left\{\frac{1}{(1+\sigma)(1-\gamma)},1\right\}\quad&\text{if }\sigma\in\left[\frac{1-\gamma}{4},\infty\right).\end{cases}

(96)

Then we can construct two infinite-horizon RMDPs $\mathcal{M}_{0},\mathcal{M}_{1}$ , an initial state distribution $\varphi$ , and a dataset with $N_{\mathsf{b}}$ independent samples satisfying Assumption 1 over the nominal transition kernel (for $\mathcal{M}_{0}$ and $\mathcal{M}_{1}$ respectively), such that

\displaystyle\inf_{\widehat{\pi}}\max\left\{\mathbb{P}_{0}\big{(}V^{\star,\sigma}(\varphi)-V^{\widehat{\pi},\sigma}(\varphi)>\varepsilon\big{)},\,\mathbb{P}_{1}\big{(}V^{\star,\sigma}(\varphi)-V^{\widehat{\pi},\sigma}(\varphi)>\varepsilon\big{)}\right\}\geq\frac{1}{8},

(97)

provided that the total number of samples

\displaystyle N_{\mathsf{b}}\leq c_{4}\begin{cases}\frac{1}{\mu_{\min}(1-\gamma)^{3}\varepsilon^{2}}&\text{if }\sigma\in\left(0,\frac{1-\gamma}{4}\right)\\ \frac{\sigma}{\min\left\{1,(1-\gamma)^{4}(1+\sigma)^{4}\right\}\mu_{\min}\varepsilon^{2}}&\text{if }\sigma\in\left[\frac{1-\gamma}{4}.\infty\right)\end{cases}

(98)

Discussions.

Corollary 3 indicates that in the offline setting with uniform coverage dataset (cf. Assumption 1), DRVI can achieve $\varepsilon$ -accuracy for RMDPs under the $\chi^{2}$ divergence with a total number of samples on the order of

\displaystyle\widetilde{O}\left(\frac{(1+\sigma)}{\mu_{\min}(1-\gamma)^{4}\varepsilon^{2}}\right).

(99)

The above upper bound is relatively tight, since it matches the lower bound derived in Corollary 4 when the uncertainty level $\sigma\asymp 1$ and correctly captures the linear dependency with $\sigma$ when the uncertainty level $\sigma\gtrsim\frac{1}{(1-\gamma)^{3}}$ is large. In addition, it significantly improves upon the prior art $\widetilde{O}\left(\frac{S(1+\sigma)^{2}}{\mu_{\min}(\sqrt{1+\sigma}-1)^{2}(1-\gamma)^{4}\varepsilon^{2}}\right)$ (Yang et al.,, 2022) by at least a factor of $S(1+\sigma)$ .

7 Other related works

This section briefly discusses a small sample of other related works. We limit our discussions primarily to provable RL algorithms in the tabular setting with finite state and action spaces, which are most related to the current paper.

Finite-sample guarantees for standard RL.

A surge of recent research has utilized the toolkit from high-dimensional probability/statistics to investigate the performance of standard RL algorithms in non-asymptotic settings. There has been a considerable amount of research into non-asymptotic sample analysis of standard RL for a variety of settings; partial examples include, but are not limited to, the works via probably approximately correct (PAC) bounds for the generative model setting (Kearns and Singh,, 1999; Beck and Srikant,, 2012; Li et al., 2022a, ; Chen et al.,, 2020; Azar et al., 2013b, ; Sidford et al.,, 2018; Agarwal et al.,, 2020; Li et al., 2023a, ; Li et al., 2023b, ; Wainwright,, 2019) and the offline setting (Liao et al.,, 2022; Chen and Jiang,, 2019; Rashidinejad et al.,, 2021; Xie et al.,, 2021; Yin et al.,, 2021; Shi et al.,, 2022; Li et al.,, 2024; Jin et al.,, 2021; Yan et al.,, 2022; Woo et al.,, 2024; Uehara et al.,, 2022), as well as the online setting via both regret-based and PAC-base analyses (Jin et al.,, 2018; Bai et al.,, 2019; Li et al.,, 2021; Zhang et al., 2020b, ; Dong et al.,, 2019; Jin et al., 2020a, ; Li et al., 2023c, ; Jafarnia-Jahromi et al.,, 2020; Yang et al.,, 2021; Woo et al.,, 2023).

Robustness in RL.

While standard RL has achieved remarkable success, current RL algorithms still have significant drawbacks in that the learned policy could be completely off if the deployed environment is subject to perturbation, model mismatch, or other structural changes. To address these challenges, an emerging line of works begin to address robustness of RL algorithms with respect to the uncertainty or perturbation over different components of MDPs — state, action, reward, and the transition kernel; see Moos et al., (2022) for a recent review. Besides the framework of distributionally robust MDPs (RMDPs) (Iyengar,, 2005) adopted by this work, to promote robustness in RL, there exist various other works including but not limited to Zhang et al., 2020a ; Zhang et al., (2021); Han et al., (2022); Qiaoben et al., (2021); Sun et al., (2021); Xiong et al., (2022) investigating the robustness w.r.t. state uncertainty, where the agent’s policy is chosen based on a perturbed observation generated from the state by adding restricted noise or adversarial attack. Besides, Tessler et al., (2019); Tan et al., (2020) considered the robustness w.r.t. the uncertainty of the action, namely, the action is possibly distorted by an adversarial agent abruptly or smoothly, and Ding et al., (2023) tackles robustness against spurious correlations..

Distributionally robust RL.

Rooted in the literature of distributionally robust optimization, which has primarily been investigated in the context of supervised learning (Rahimian and Mehrotra,, 2019; Gao,, 2020; Bertsimas et al.,, 2018; Duchi and Namkoong,, 2018; Blanchet and Murthy,, 2019), distributionally robust dynamic programming and RMDPs have attracted considerable attention recently (Iyengar,, 2005; Xu and Mannor,, 2012; Wolff et al.,, 2012; Kaufman and Schaefer,, 2013; Ho et al.,, 2018; Smirnova et al.,, 2019; Ho et al.,, 2021; Goyal and Grand-Clement,, 2022; Derman and Mannor,, 2020; Tamar et al.,, 2014; Badrinath and Kalathil,, 2021). In the context of RMDPs, both empirical and theoretical studies have been widely conducted, although most prior theoretical analyses focus on planning with an exact knowledge of the uncertainty set (Iyengar,, 2005; Xu and Mannor,, 2012; Tamar et al.,, 2014), or are asymptotic in nature (Roy et al.,, 2017).

Resorting to the tools of high-dimensional statistics, various recent works begin to shift attention to understand the finite-sample performance of provable robust RL algorithms, under diverse data generating mechanisms and forms of the uncertainty set over the transition kernel. Besides the infinite-horizon setting, finite-sample complexity bounds for RMDPs with the TV distance and the $\chi^{2}$ divergence are also developed for the finite-horizon setting in Xu et al., (2023); Dong et al., (2022); Lu et al., (2024). In addition, many other forms of uncertainty sets have been considered. For example, Wang and Zou, (2021) considered a R-contamination uncertain set and proposed a provable robust Q-learning algorithm for the online setting with similar guarantees as standard MDPs. The KL divergence is another popular choice widely considered, where Yang et al., (2022); Panaganti and Kalathil, (2022); Zhou et al., (2021); Shi and Chi, (2022); Xu et al., (2023); Wang et al., 2023b ; Blanchet et al., (2023); Liu et al., (2022); Wang et al., 2023d ; Liang et al., (2023); Wang et al., 2023a investigated the sample complexity of both model-based and model-free algorithms under the simulator, offline settings, or single-trajectory setting. Xu et al., (2023) considered a variety of uncertainty sets including one associated with Wasserstein distance. Badrinath and Kalathil, (2021); Ramesh et al., (2023); Panaganti et al., (2022); Ma et al., (2022); Wang et al., (2024); Liu and Xu, 2024b ; Liu and Xu, 2024a considered function approximation settings. Moreover, various other related issues have been explored such as the difference of various uncertainty types (Wang et al., 2023c, ), the iteration complexity of the policy-based methods (Li et al., 2022c, ; Kumar et al.,, 2023; Li and Lan,, 2023), the case when the uncertainty level is instance-dependent small enough (Clavier et al.,, 2023), regularization-based robust RL (Yang et al.,, 2023; Zhang et al.,, 2023), and distributionally robust optimization for offline RL (Panaganti et al.,, 2023).

8 Discussions

This work has developed improved sample complexity bounds for learning RMDPs when the uncertainty set is measured via the TV distance or the $\chi^{2}$ divergence, assuming availability of a generative model. Our results have not only strengthened the prior art in both the upper and lower bounds, but have also unlocked curious insights into how the quest for distributional robustness impacts the sample complexity. As a key takeaway of this paper, RMDPs are not necessarily harder nor easier to learn than standard MDPs, as the answer depends — in a rather subtle manner — on the specific choice of the uncertainty set. For the case w.r.t. the TV distance, we have settled the minimax sample complexity for RMDPs, which is never larger than that required to learn standard MDPs. Regarding the case w.r.t. the $\chi^{2}$ divergence, we have uncovered that learning RMDPs can oftentimes be provably harder than the standard MDP counterpart. All in all, our findings help raise awareness that the choice of the uncertainty set not only represents a preference in robustness, but also exerts fundamental influences upon the intrinsic statistical complexity.

Moving forward, our work opens up numerous avenues for future studies, and we point out a few below.

•

Extensions to the finite-horizon setting. It is likely that our current analysis framework can be extended to tackle finite-horizon RMDPs, which would help complete our understanding for the tabular cases.
•

Improved analysis for the case of $\chi^{2}$ divergence. While we have settled the sample complexity of RMDPs with the TV distance, the upper and lower bounds we have developed for RMDPs w.r.t. the $\chi^{2}$ divergence still differ by some polynomial factor in the effective horizon. It would be of great interest to see how to close this gap.
•

A unified theory for other families of uncertainty sets. Our work raises an interesting question concerning how the geometry of the uncertainty sets intervenes the sample complexity. Characterizing the tight sample complexity for RMDPs under a more general family of uncertainty sets — such as using $\ell_{p}$ distance or $f$ -divergence, as well as $s$ -rectangular sets — would be highly desirable.
•

Instance-dependent sample complexity analyses. We note that we focus on understanding the minimax-optimal sample complexity of RMDPs, which might be rather pessimistic. When consider a given MDP, the feasible and reasonable magnitude of the uncertainty level $\sigma$ is limited by a certain instance-dependent finite threshold. It will be desirable to study instance-dependent sample complexity of RMDPs, which might shed better light on guiding the practice.

Acknowledgement

The work of L. Shi and Y. Chi is supported in part by the grants ONR N00014-19-1-2404, NSF CCF-2106778, DMS-2134080, and CNS-2148212. L. Shi is also gratefully supported by the Leo Finzi Memorial Fellowship, Wei Shen and Xuehong Zhang Presidential Fellowship, and Liang Ji-Dian Graduate Fellowship at Carnegie Mellon University. The work of Y. Wei is supported in part by the NSF grants DMS-2147546/2015447, CAREER award DMS-2143215, CCF-2106778, and the Google Research Scholar Award. The work of Y. Chen is supported in part by the Alfred P. Sloan Research Fellowship, the Google Research Scholar Award, the AFOSR grant FA9550-22-1-0198, the ONR grant N00014-22-1-2354, and the NSF grants CCF-2221009 and CCF-1907661. The authors also acknowledge Mengdi Xu, Zuxin Liu and He Wang for valuable discussions.

References

Agarwal et al., (2020) Agarwal, A., Kakade, S., and Yang, L. F. (2020). Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pages 67–83. PMLR.
(2) Azar, M., Munos, R., and Kappen, H. J. (2013a). Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91:325–349.
(3) Azar, M. G., Munos, R., and Kappen, H. J. (2013b). Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325–349.
Badrinath and Kalathil, (2021) Badrinath, K. P. and Kalathil, D. (2021). Robust reinforcement learning using least squares policy iteration with provable performance guarantees. In International Conference on Machine Learning, pages 511–520. PMLR.
Bai et al., (2019) Bai, Y., Xie, T., Jiang, N., and Wang, Y.-X. (2019). Provably efficient Q-learning with low switching cost. arXiv preprint arXiv:1905.12849.
Bäuerle and Glauner, (2022) Bäuerle, N. and Glauner, A. (2022). Distributionally robust markov decision processes and their connection to risk measures. Mathematics of Operations Research, 47(3):1757–1780.
Beck and Srikant, (2012) Beck, C. L. and Srikant, R. (2012). Error bounds for constant step-size Q-learning. Systems & control letters, 61(12):1203–1208.
Bertsimas et al., (2018) Bertsimas, D., Gupta, V., and Kallus, N. (2018). Data-driven robust optimization. Mathematical Programming, 167(2):235–292.
Bertsimas et al., (2019) Bertsimas, D., Sim, M., and Zhang, M. (2019). Adaptive distributionally robust optimization. Management Science, 65(2):604–618.
Blanchet et al., (2023) Blanchet, J., Lu, M., Zhang, T., and Zhong, H. (2023). Double pessimism is provably efficient for distributionally robust offline reinforcement learning: Generic algorithm and robust partial coverage. arXiv preprint arXiv:2305.09659.
Blanchet and Murthy, (2019) Blanchet, J. and Murthy, K. (2019). Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44(2):565–600.
Cai et al., (2016) Cai, J.-F., Qu, X., Xu, W., and Ye, G.-B. (2016). Robust recovery of complex exponential signals from random gaussian projections via low rank hankel matrix reconstruction. Applied and computational harmonic analysis, 41(2):470–490.
Chen and Jiang, (2019) Chen, J. and Jiang, N. (2019). Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR.
Chen et al., (2020) Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2020). Finite-sample analysis of stochastic approximation using smooth convex envelopes. arXiv preprint arXiv:2002.00874.
Chen et al., (2019) Chen, Z., Sim, M., and Xu, H. (2019). Distributionally robust optimization with infinitely constrained ambiguity sets. Operations Research, 67(5):1328–1344.
Clavier et al., (2023) Clavier, P., Pennec, E. L., and Geist, M. (2023). Towards minimax optimality of model-based robust reinforcement learning. arXiv preprint arXiv:2302.05372v1.
de Castro Silva et al., (2003) de Castro Silva, J., Soma, N., and Maculan, N. (2003). A greedy search for the three-dimensional bin packing problem: the packing static stability case. International Transactions in Operational Research, 10(2):141–153.
Derman and Mannor, (2020) Derman, E. and Mannor, S. (2020). Distributional robustness and regularization in reinforcement learning. arXiv preprint arXiv:2003.02894.
Ding et al., (2023) Ding, W., Shi, L., Chi, Y., and Zhao, D. (2023). Seeing is not believing: Robust reinforcement learning against spurious correlation. In Thirty-seventh Conference on Neural Information Processing Systems.
Dong et al., (2022) Dong, J., Li, J., Wang, B., and Zhang, J. (2022). Online policy optimization for robust MDP. arXiv preprint arXiv:2209.13841.
Dong et al., (2019) Dong, K., Wang, Y., Chen, X., and Wang, L. (2019). Q-learning with UCB exploration is sample efficient for infinite-horizon MDP. arXiv preprint arXiv:1901.09311.
Duchi and Namkoong, (2018) Duchi, J. and Namkoong, H. (2018). Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750.
Fatemi et al., (2021) Fatemi, M., Killian, T. W., Subramanian, J., and Ghassemi, M. (2021). Medical dead-ends and learning to identify high-risk states and treatments. Advances in Neural Information Processing Systems, 34:4856–4870.
Gao, (2020) Gao, R. (2020). Finite-sample guarantees for wasserstein distributionally robust optimization: Breaking the curse of dimensionality. arXiv preprint arXiv:2009.04382.
Goyal and Grand-Clement, (2022) Goyal, V. and Grand-Clement, J. (2022). Robust markov decision processes: Beyond rectangularity. Mathematics of Operations Research.
Han et al., (2022) Han, S., Su, S., He, S., Han, S., Yang, H., and Miao, F. (2022). What is the solution for state adversarial multi-agent reinforcement learning? arXiv preprint arXiv:2212.02705.
Ho et al., (2018) Ho, C. P., Petrik, M., and Wiesemann, W. (2018). Fast bellman updates for robust MDPs. In International Conference on Machine Learning, pages 1979–1988. PMLR.
Ho et al., (2021) Ho, C. P., Petrik, M., and Wiesemann, W. (2021). Partial policy iteration for l1-robust markov decision processes. Journal of Machine Learning Research, 22(275):1–46.
Iyengar, (2005) Iyengar, G. N. (2005). Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280.
Jafarnia-Jahromi et al., (2020) Jafarnia-Jahromi, M., Wei, C.-Y., Jain, R., and Luo, H. (2020). A model-free learning algorithm for infinite-horizon average-reward MDPs with near-optimal regret. arXiv preprint arXiv:2006.04354.
Jin et al., (2018) Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. (2018). Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873.
(32) Jin, C., Krishnamurthy, A., Simchowitz, M., and Yu, T. (2020a). Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870–4879. PMLR.
(33) Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. (2020b). Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR.
Jin et al., (2021) Jin, Y., Yang, Z., and Wang, Z. (2021). Is pessimism provably efficient for offline RL? In International Conference on Machine Learning, pages 5084–5096.
Kaufman and Schaefer, (2013) Kaufman, D. L. and Schaefer, A. J. (2013). Robust modified policy iteration. INFORMS Journal on Computing, 25(3):396–410.
Kearns and Singh, (1999) Kearns, M. J. and Singh, S. P. (1999). Finite-sample convergence rates for Q-learning and indirect algorithms. In Advances in neural information processing systems, pages 996–1002.
Klopp et al., (2017) Klopp, O., Lounici, K., and Tsybakov, A. B. (2017). Robust matrix completion. Probability Theory and Related Fields, 169(1-2):523–564.
Kober et al., (2013) Kober, J., Bagnell, J. A., and Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274.
Kumar et al., (2023) Kumar, N., Derman, E., Geist, M., Levy, K., and Mannor, S. (2023). Policy gradient for s-rectangular robust markov decision processes. arXiv preprint arXiv:2301.13589.
Lam, (2019) Lam, H. (2019). Recovering best statistical guarantees via the empirical divergence-based distributionally robust optimization. Operations Research, 67(4):1090–1105.
Lee et al., (2021) Lee, J., Jeon, W., Lee, B., Pineau, J., and Kim, K.-E. (2021). Optidice: Offline policy optimization via stationary distribution correction estimation. In International Conference on Machine Learning, pages 6120–6130. PMLR.
(42) Li, G., Cai, C., Chen, Y., Wei, Y., and Chi, Y. (2023a). Is Q-learning minimax optimal? a tight sample complexity analysis. Operations Research.
(43) Li, G., Chi, Y., Wei, Y., and Chen, Y. (2022a). Minimax-optimal multi-agent RL in Markov games with a generative model. Neural Information Processing Systems.
(44) Li, G., Shi, L., Chen, Y., Chi, Y., and Wei, Y. (2022b). Settling the sample complexity of model-based offline reinforcement learning. arXiv preprint arXiv:2204.05275.
Li et al., (2024) Li, G., Shi, L., Chen, Y., Chi, Y., and Wei, Y. (2024). Settling the sample complexity of model-based offline reinforcement learning. The Annals of Statistics, 52(1):233–260.
Li et al., (2021) Li, G., Shi, L., Chen, Y., Gu, Y., and Chi, Y. (2021). Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning. Advances in Neural Information Processing Systems, 34.
(47) Li, G., Wei, Y., Chi, Y., and Chen, Y. (2023b). Breaking the sample size barrier in model-based reinforcement learning with a generative model. accepted to Operations Research.
(48) Li, G., Yan, Y., Chen, Y., and Fan, J. (2023c). Minimax-optimal reward-agnostic exploration in reinforcement learning. arXiv preprint arXiv:2304.07278.
Li and Lan, (2023) Li, Y. and Lan, G. (2023). First-order policy optimization for robust policy evaluation. arXiv preprint arXiv:2307.15890.
(50) Li, Y., Zhao, T., and Lan, G. (2022c). First-order policy optimization for robust markov decision process. arXiv preprint arXiv:2209.10579.
Liang et al., (2023) Liang, Z., Ma, X., Blanchet, J., Zhang, J., and Zhou, Z. (2023). Single-trajectory distributionally robust reinforcement learning. arXiv preprint arXiv:2301.11721.
Liao et al., (2022) Liao, P., Qi, Z., Wan, R., Klasnja, P., and Murphy, S. A. (2022). Batch policy learning in average reward markov decision processes. Annals of statistics, 50(6):3364.
Liu et al., (2019) Liu, S., Ngiam, K. Y., and Feng, M. (2019). Deep reinforcement learning for clinical decision support: a brief survey. arXiv preprint arXiv:1907.09475.
Liu et al., (2022) Liu, Z., Bai, Q., Blanchet, J., Dong, P., Xu, W., Zhou, Z., and Zhou, Z. (2022). Distributionally robust Q-learning. In International Conference on Machine Learning, pages 13623–13643. PMLR.
(55) Liu, Z. and Xu, P. (2024a). Distributionally robust off-dynamics reinforcement learning: Provable efficiency with linear function approximation. arXiv preprint arXiv:2402.15399.
(56) Liu, Z. and Xu, P. (2024b). Minimax optimal and computationally efficient algorithms for distributionally robust offline reinforcement learning. arXiv preprint arXiv:2403.09621.
Lu et al., (2024) Lu, M., Zhong, H., Zhang, T., and Blanchet, J. (2024). Distributionally robust reinforcement learning with interactive data collection: Fundamental hardness and near-optimal algorithm. arXiv preprint arXiv:2404.03578.
Ma et al., (2022) Ma, X., Liang, Z., Blanchet, J., Liu, M., Xia, L., Zhang, J., Zhao, Q., and Zhou, Z. (2022). Distributionally robust offline reinforcement learning with linear function approximation. arXiv preprint arXiv:2209.06620.
Mahmood et al., (2018) Mahmood, A. R., Korenkevych, D., Vasan, G., Ma, W., and Bergstra, J. (2018). Benchmarking reinforcement learning algorithms on real-world robots. In Conference on robot learning, pages 561–591. PMLR.
Mnih et al., (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
Moos et al., (2022) Moos, J., Hansel, K., Abdulsamad, H., Stark, S., Clever, D., and Peters, J. (2022). Robust reinforcement learning: A review of foundations and recent advances. Machine Learning and Knowledge Extraction, 4(1):276–315.
Nilim and El Ghaoui, (2005) Nilim, A. and El Ghaoui, L. (2005). Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798.
OpenAI, (2023) OpenAI (2023). Gpt-4 technical report.
Pan et al., (2023) Pan, Y., Chen, Y., and Lin, F. (2023). Adjustable robust reinforcement learning for online 3d bin packing. arXiv preprint arXiv:2310.04323.
Panaganti and Kalathil, (2022) Panaganti, K. and Kalathil, D. (2022). Sample complexity of robust reinforcement learning with a generative model. In International Conference on Artificial Intelligence and Statistics, pages 9582–9602. PMLR.
Panaganti et al., (2022) Panaganti, K., Xu, Z., Kalathil, D., and Ghavamzadeh, M. (2022). Robust reinforcement learning using offline data. Advances in neural information processing systems, 35:32211–32224.
Panaganti et al., (2023) Panaganti, K., Xu, Z., Kalathil, D., and Ghavamzadeh, M. (2023). Bridging distributionally robust learning and offline rl: An approach to mitigate distribution shift and partial data coverage. arXiv preprint arXiv:2310.18434.
Park and Van Roy, (2015) Park, B. and Van Roy, B. (2015). Adaptive execution: Exploration and learning of price impact. Operations Research, 63(5):1058–1076.
Qiaoben et al., (2021) Qiaoben, Y., Zhou, X., Ying, C., and Zhu, J. (2021). Strategically-timed state-observation attacks on deep reinforcement learning agents. In ICML 2021 Workshop on Adversarial Machine Learning.
Qu et al., (2022) Qu, G., Wierman, A., and Li, N. (2022). Scalable reinforcement learning for multiagent networked systems. Operations Research, 70(6):3601–3628.
Rahimian and Mehrotra, (2019) Rahimian, H. and Mehrotra, S. (2019). Distributionally robust optimization: A review. arXiv preprint arXiv:1908.05659.
Ramesh et al., (2023) Ramesh, S. S., Sessa, P. G., Hu, Y., Krause, A., and Bogunovic, I. (2023). Distributionally robust model-based reinforcement learning with large state spaces. arXiv preprint arXiv:2309.02236.
Rashidinejad et al., (2021) Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., and Russell, S. (2021). Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Neural Information Processing Systems (NeurIPS).
Roy et al., (2017) Roy, A., Xu, H., and Pokutta, S. (2017). Reinforcement learning under model mismatch. Advances in neural information processing systems, 30.
Shi and Chi, (2022) Shi, L. and Chi, Y. (2022). Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity. arXiv preprint arXiv:2208.05767.
Shi et al., (2022) Shi, L., Li, G., Wei, Y., Chen, Y., and Chi, Y. (2022). Pessimistic Q-learning for offline reinforcement learning: Towards optimal sample complexity. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pages 19967–20025. PMLR.
Sidford et al., (2018) Sidford, A., Wang, M., Wu, X., Yang, L., and Ye, Y. (2018). Near-optimal time and sample complexities for solving Markov decision processes with a generative model. In Advances in Neural Information Processing Systems, pages 5186–5196.
Smirnova et al., (2019) Smirnova, E., Dohmatob, E., and Mary, J. (2019). Distributionally robust reinforcement learning. arXiv preprint arXiv:1902.08708.
Sun et al., (2021) Sun, K., Liu, Y., Zhao, Y., Yao, H., Jui, S., and Kong, L. (2021). Exploring the training robustness of distributional reinforcement learning against noisy state observations. arXiv preprint arXiv:2109.08776.
Tamar et al., (2014) Tamar, A., Mannor, S., and Xu, H. (2014). Scaling up robust MDPs using function approximation. In International conference on machine learning, pages 181–189. PMLR.
Tan et al., (2020) Tan, K. L., Esfandiari, Y., Lee, X. Y., and Sarkar, S. (2020). Robustifying reinforcement learning agents via action space adversarial training. In 2020 American control conference (ACC), pages 3959–3964. IEEE.
Tessler et al., (2019) Tessler, C., Efroni, Y., and Mannor, S. (2019). Action robust reinforcement learning and applications in continuous control. In International Conference on Machine Learning, pages 6215–6224. PMLR.
Tsybakov, (2009) Tsybakov, A. B. (2009). Introduction to nonparametric estimation, volume 11. Springer.
Uehara et al., (2022) Uehara, M., Shi, C., and Kallus, N. (2022). A review of off-policy evaluation in reinforcement learning. arXiv preprint arXiv:2212.06355.
Vershynin, (2018) Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press.
Wainwright, (2019) Wainwright, M. J. (2019). Stochastic approximation with cone-contractive operators: Sharp $\ell_{\infty}$ -bounds for Q-learning. arXiv preprint arXiv:1905.06265.
Wang et al., (2024) Wang, H., Shi, L., and Chi, Y. (2024). Sample complexity of offline distributionally robust linear markov decision processes. arXiv preprint arXiv:2403.12946.
(88) Wang, K., Gadot, U., Kumar, N., Levy, K., and Mannor, S. (2023a). Robust reinforcement learning via adversarial kernel approximation. arXiv preprint arXiv:2306.05859.
(89) Wang, S., Si, N., Blanchet, J., and Zhou, Z. (2023b). A finite sample complexity bound for distributionally robust Q-learning. arXiv preprint arXiv:2302.13203.
(90) Wang, S., Si, N., Blanchet, J., and Zhou, Z. (2023c). On the foundation of distributionally robust reinforcement learning. arXiv preprint arXiv:2311.09018.
(91) Wang, S., Si, N., Blanchet, J., and Zhou, Z. (2023d). Sample complexity of variance-reduced distributionally robust Q-learning. arXiv preprint arXiv:2305.18420.
Wang and Zou, (2021) Wang, Y. and Zou, S. (2021). Online robust reinforcement learning with model uncertainty. Advances in Neural Information Processing Systems, 34.
Wiesemann et al., (2013) Wiesemann, W., Kuhn, D., and Rustem, B. (2013). Robust Markov decision processes. Mathematics of Operations Research, 38(1):153–183.
Wolff et al., (2012) Wolff, E. M., Topcu, U., and Murray, R. M. (2012). Robust control of uncertain Markov decision processes with temporal logic specifications. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pages 3372–3379. IEEE.
Woo et al., (2023) Woo, J., Joshi, G., and Chi, Y. (2023). The blessing of heterogeneity in federated Q-learning: Linear speedup and beyond. arXiv preprint arXiv:2305.10697.
Woo et al., (2024) Woo, J., Shi, L., Joshi, G., and Chi, Y. (2024). Federated offline reinforcement learning: Collaborative single-policy coverage suffices. arXiv preprint arXiv:2402.05876.
Xie et al., (2021) Xie, T., Jiang, N., Wang, H., Xiong, C., and Bai, Y. (2021). Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems, 34.
Xiong et al., (2022) Xiong, Z., Eappen, J., Zhu, H., and Jagannathan, S. (2022). Defending observation attacks in deep reinforcement learning via detection and denoising. arXiv preprint arXiv:2206.07188.
Xu and Mannor, (2012) Xu, H. and Mannor, S. (2012). Distributionally robust Markov decision processes. Mathematics of Operations Research, 37(2):288–300.
Xu et al., (2023) Xu, Z., Panaganti, K., and Kalathil, D. (2023). Improved sample complexity bounds for distributionally robust reinforcement learning. arXiv preprint arXiv:2303.02783.
Yan et al., (2022) Yan, Y., Li, G., Chen, Y., and Fan, J. (2022). The efficacy of pessimism in asynchronous Q-learning. arXiv preprint arXiv:2203.07368.
Yang et al., (2021) Yang, K., Yang, L., and Du, S. (2021). Q-learning with logarithmic regret. In International Conference on Artificial Intelligence and Statistics, pages 1576–1584. PMLR.
Yang et al., (2023) Yang, W., Wang, H., Kozuno, T., Jordan, S. M., and Zhang, Z. (2023). Avoiding model estimation in robust markov decision processes with a generative model. arXiv preprint arXiv:2302.01248.
Yang et al., (2022) Yang, W., Zhang, L., and Zhang, Z. (2022). Toward theoretical understandings of robust Markov decision processes: Sample complexity and asymptotics. The Annals of Statistics, 50(6):3223–3248.
Yin et al., (2021) Yin, M., Bai, Y., and Wang, Y.-X. (2021). Near-optimal offline reinforcement learning via double variance reduction. arXiv preprint arXiv:2102.01748.
Zhang et al., (2021) Zhang, H., Chen, H., Boning, D., and Hsieh, C.-J. (2021). Robust reinforcement learning on state observations with learned optimal adversary. arXiv preprint arXiv:2101.08452.
(107) Zhang, H., Chen, H., Xiao, C., Li, B., Liu, M., Boning, D., and Hsieh, C.-J. (2020a). Robust deep reinforcement learning against adversarial perturbations on state observations. Advances in Neural Information Processing Systems, 33:21024–21037.
Zhang et al., (2023) Zhang, R., Hu, Y., and Li, N. (2023). Regularized robust mdps and risk-sensitive mdps: Equivalence, policy gradient, and sample complexity. arXiv preprint arXiv:2306.11626.
(109) Zhang, Z., Zhou, Y., and Ji, X. (2020b). Almost optimal model-free reinforcement learning via reference-advantage decomposition. Advances in Neural Information Processing Systems, 33.
Zhao et al., (2021) Zhao, H., Yu, Y., and Xu, K. (2021). Learning efficient online 3d bin packing on packing configuration trees. In International conference on learning representations.
Zhou et al., (2021) Zhou, Z., Bai, Q., Zhou, Z., Qiu, L., Blanchet, J., and Glynn, P. (2021). Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 3331–3339. PMLR.
Ziegler et al., (2019) Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. (2019). Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.

Appendix A Proof of the preliminaries

A.1 Proof of Lemma 1 and Lemma 2

Proof of Lemma 1.

To begin with, applying (Iyengar,, 2005, Lemma 4.3), the term of interest obeys

\displaystyle\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P)}\mathcal{P}V

\displaystyle=\max_{\mu\in\mathbb{R}^{S},\mu\geq 0}\left\{P\left(V-\mu\right)-\sigma\left(\max_{s^{\prime}}\left\{V(s^{\prime})-\mu(s^{\prime})\right\}-\min_{s^{\prime}}\left\{V(s^{\prime})-\mu(s^{\prime})\right\}\right)\right\},

(100)

where $\mu(s^{\prime})$ represents the $s^{\prime}$ -th entry of $\mu\in\mathbb{R}^{S}$ . Denoting $\mu^{\star}$ as the optimal dual solution, taking $\alpha=\max_{s^{\prime}}\left\{V(s^{\prime})-\mu^{\star}(s^{\prime})\right\}$ , it is easily verified that $\mu^{\star}$ obeys

\displaystyle\mu^{\star}(s)=\begin{cases}V(s)-\alpha,&\text{if }V(s)>\alpha\\ 0,&\text{otherwise}.\end{cases}

(101)

Therefore, (100) can be solved by optimizing $\alpha$ as below (Iyengar,, 2005, Lemma 4.3):

\displaystyle\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P)}\mathcal{P}V=\max_{\alpha\in\left[\min_{s}V(s),\max_{s}V(s)\right]}\left\{P\left[V\right]_{\alpha}-\sigma\left(\alpha-\min_{s^{\prime}}\left[V\right]_{\alpha}(s^{\prime})\right)\right\}.

(102)

Proof of Lemma 2.

Due to strong duality (Iyengar,, 2005, Lemma 4.2), it holds that

\displaystyle\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P)}\mathcal{P}V

\displaystyle=\max_{\mu\in\mathbb{R}^{S},\mu\geq 0}\left\{P\left(V-\mu\right)-\sqrt{\sigma\mathsf{Var}_{P}\left(V-\mu\right)}\right\},

(103)

and the optimal $\mu^{\star}$ obeys

\displaystyle\mu^{\star}(s)=\begin{cases}V(s)-\alpha,&\text{if }V(s)>\alpha\\ 0,&\text{otherwise}.\end{cases}

(104)

for some $\alpha\in[\min_{s}V(s),\max_{s}V(s)]$ . As a result, solving (103) is equivalent to optimizing the scalar $\alpha$ as below:

\displaystyle\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P)}\mathcal{P}V=\max_{\alpha\in[\min_{s}V(s),\max_{s}V(s)]}\left\{P[V]_{\alpha}-\sqrt{\sigma\mathsf{Var}_{P}\left([V]_{\alpha}\right)}\right\}.

(105)

A.2 Proof of Lemma 5

Applying the $\gamma$ -contraction property in Lemma 4 directly yields that for any $t\geq 0$ ,

	$\displaystyle\\|\widehat{Q}_{t}-\widehat{Q}^{\star,\sigma}\\|_{\infty}=\big{\\|}\widehat{{\mathcal{T}}}^{\sigma}(\widehat{Q}_{t-1})-\widehat{{\mathcal{T}}}^{\sigma}(\widehat{Q}^{\star,\sigma})\big{\\|}_{\infty}$	$\displaystyle\leq\gamma\big{\\|}\widehat{Q}_{t-1}-\widehat{Q}^{\star,\sigma}\big{\\|}_{\infty}$
		$\displaystyle\leq\cdots\leq\gamma^{t}\big{\\|}\widehat{Q}_{0}-\widehat{Q}^{\star,\sigma}\big{\\|}_{\infty}=\gamma^{t}\big{\\|}\widehat{Q}^{\star,\sigma}\big{\\|}_{\infty}\leq\frac{\gamma^{t}}{1-\gamma},$

where the last inequality holds by the fact $\|\widehat{Q}^{\star,\sigma}\|_{\infty}\leq\frac{1}{1-\gamma}$ (see Lemma 4). In addition,

\displaystyle\big{\|}\widehat{V}_{t}-\widehat{V}^{\star,\sigma}\big{\|}_{\infty}

\displaystyle=\max_{s\in{\mathcal{S}}}\Big{\|}\max_{a\in\mathcal{A}}\widehat{Q}_{t}(s,a)-\max_{a\in\mathcal{A}}\widehat{Q}^{\star,\sigma}(s,a)\Big{\|}_{\infty}\leq\big{\|}\widehat{Q}_{t}-\widehat{Q}^{\star,\sigma}\big{\|}_{\infty}\leq\frac{\gamma^{t}}{1-\gamma},

where the penultimate inequality holds by the maximum operator is $1$ -Lipschitz. This completes the proof of (47).

We now move to establish (48). Note that there exists at least one state $s_{0}\in{\mathcal{S}}$ that is associated with the maximum of the value gap, i.e.,

\displaystyle\big{\|}\widehat{V}^{\star,\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\big{\|}_{\infty}=\widehat{V}^{\star,\sigma}(s_{0})-\widehat{V}^{\widehat{\pi},\sigma}(s_{0})\geq\widehat{V}^{\star,\sigma}(s)-\widehat{V}^{\widehat{\pi},\sigma}(s),\qquad\forall s\in{\mathcal{S}}.

Recall $\widehat{\pi}^{\star}$ is the optimal robust policy for the empirical RMDP $\widehat{\mathcal{M}}_{\mathsf{rob}}$ . For convenience, we denote $a_{1}=\widehat{\pi}^{\star}(s_{0})$ and $a_{2}=\widehat{\pi}(s_{0})$ . Then, since $\widehat{\pi}$ is the greedy policy w.r.t. $\widehat{Q}_{T}$ , one has

\displaystyle r(s_{0},a_{1})+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{1}})}\mathcal{P}\widehat{V}_{T-1}=\widehat{Q}_{T}(s_{0},a_{1})\leq\widehat{Q}_{T}(s_{0},a_{2})=r(s_{0},a_{2})+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{2}})}\mathcal{P}\widehat{V}_{T-1}.

(106)

Recalling the notation in (37), the above fact and (48) altogether yield

$\displaystyle r(s_{0},a_{1})+\gamma\widehat{P}_{s_{0},a_{1}}^{\widehat{V}_{T-1}}\left(\widehat{V}^{\star,\sigma}-\varepsilon_{\mathsf{opt}}{1}\right)$	$\displaystyle\leq r(s_{0},a_{1})+\gamma\widehat{P}_{s_{0},a_{1}}^{\widehat{V}_{T-1}}\widehat{V}_{T-1}$
	$\displaystyle\leq r(s_{0},a_{2})+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{2}})}\mathcal{P}\widehat{V}_{T-1}$
	$\displaystyle\overset{\mathrm{(i)}}{\leq}r(s_{0},a_{2})+\gamma\widehat{P}_{s_{0},a_{2}}^{\widehat{V}^{\widehat{\pi},\sigma}}\widehat{V}_{T-1}$
	$\displaystyle\leq r(s_{0},a_{2})+\gamma\widehat{P}_{s_{0},a_{2}}^{\widehat{V}^{\widehat{\pi},\sigma}}\left(\widehat{V}^{\star,\sigma}+\varepsilon_{\mathsf{opt}}{1}\right),$	(107)

where (i) follows from the optimality criteria. The term of interest can be controlled as

		$\displaystyle\big{\\|}\widehat{V}^{\star,\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\big{\\|}_{\infty}$
		$\displaystyle=\widehat{V}^{\star,\sigma}(s_{0})-\widehat{V}^{\widehat{\pi},\sigma}(s_{0})$
		$\displaystyle=r(s_{0},a_{1})+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{1}})}\mathcal{P}\widehat{V}^{\star,\sigma}-\bigg{(}r(s_{0},a_{2})+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{2}})}\mathcal{P}\widehat{V}^{\widehat{\pi},\sigma}\bigg{)}$
		$\displaystyle=r(s_{0},a_{1})-r(s_{0},a_{2})+\gamma\bigg{(}\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{1}})}\mathcal{P}\widehat{V}^{\star,\sigma}-\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{2}})}\mathcal{P}\widehat{V}^{\widehat{\pi},\sigma}\bigg{)}$
		$\displaystyle\overset{\mathrm{(i)}}{\leq}2\gamma\varepsilon_{\mathsf{opt}}+\gamma\bigg{(}\widehat{P}_{s_{0},a_{2}}^{\widehat{V}^{\widehat{\pi},\sigma}}\widehat{V}^{\star,\sigma}-\widehat{P}_{s_{0},a_{1}}^{\widehat{V}_{T-1}}\widehat{V}^{\star,\sigma}+\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{1}})}\mathcal{P}\widehat{V}^{\star,\sigma}-\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{2}})}\mathcal{P}\widehat{V}^{\widehat{\pi},\sigma}\bigg{)}$
		$\displaystyle=2\gamma\varepsilon_{\mathsf{opt}}+\gamma\bigg{(}\widehat{P}_{s_{0},a_{2}}^{\widehat{V}^{\widehat{\pi},\sigma}}\widehat{V}^{\star,\sigma}-\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{2}})}\mathcal{P}\widehat{V}^{\widehat{\pi},\sigma}\bigg{)}+\gamma\bigg{(}\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{1}})}\mathcal{P}\widehat{V}^{\star,\sigma}-\widehat{P}_{s_{0},a_{1}}^{\widehat{V}_{T-1}}\widehat{V}^{\star,\sigma}\bigg{)}$
		$\displaystyle\overset{\mathrm{(ii)}}{\leq}2\gamma\varepsilon_{\mathsf{opt}}+\gamma\widehat{P}_{s_{0},a_{2}}^{\widehat{V}^{\widehat{\pi},\sigma}}\big{(}\widehat{V}^{\star,\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\big{)}+\gamma\bigg{(}\widehat{P}_{s_{0},a_{1}}^{\widehat{V}_{T-1}}\widehat{V}^{\star,\sigma}-\widehat{P}_{s_{0},a_{1}}^{\widehat{V}_{T-1}}\widehat{V}^{\star,\sigma}\bigg{)}$
		$\displaystyle\leq 2\gamma\varepsilon_{\mathsf{opt}}+\gamma\big{\\|}\widehat{V}^{\star,\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\big{\\|}_{\infty},$		(108)

where $\mathrm{(i)}$ holds by plugging in (107), and (ii) follows from $\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{1}})}\mathcal{P}\widehat{V}^{\star,\sigma}\leq\mathcal{P}\widehat{V}^{\star,\sigma}$ for any $\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{1}})$ . Rearranging (A.2) leads to

\big{\|}\widehat{V}^{\star,\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\big{\|}_{\infty}\leq\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}.

Appendix B Proof of the auxiliary lemmas for Theorem 1

B.1 Proof of Lemma 6

To begin, note that there at leasts exist one state $s_{0}$ for any $V^{\pi,\sigma}$ such that $V^{\pi,\sigma}(s_{0})=\min_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s)$ . With this in mind, for any policy $\pi$ , one has by the definition in (5) and the Bellman’s equation (7a),

	$\displaystyle\max_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s)$	$\displaystyle=\max_{s\in{\mathcal{S}}}\mathbb{E}_{a\sim\pi(\cdot\,\|\,s)}\Big{[}r(s,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P_{s,a})}\mathcal{P}V^{\pi,\sigma}\Big{]}$
		$\displaystyle\leq\max_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\Big{(}1+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P_{s,a})}\mathcal{P}V^{\pi,\sigma}\Big{)},$

where the second line holds since the reward function $r(s,a)\in[0,1]$ for all $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ . To continue, note that for any $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ , there exists some $\widetilde{P}_{s,a}\in\mathbb{R}^{S}$ constructed by reducing the values of some elements of $P_{s,a}$ to obey $P_{s,a}\geq\widetilde{P}_{s,a}\geq 0$ and $\sum_{s^{\prime}}(P_{s,a}(s^{\prime})-\widetilde{P}_{s,a}(s^{\prime}))=\sigma$ . This implies $\widetilde{P}_{s,a}+\sigma{e}_{s_{0}}^{\top}\in\mathcal{U}^{\sigma}(P_{s,a})$ , where $e_{s_{0}}$ is the standard basis vector supported on $s_{0}$ , since $\frac{1}{2}\|\widetilde{P}_{s,a}+\sigma{e}_{s_{0}}^{\top}-P_{s,a}\|_{1}\leq\frac{1}{2}\|\widetilde{P}_{s,a}-P_{s,a}\|_{1}+\frac{\sigma}{2}=\sigma$ . Consequently,

	$\displaystyle\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P_{s,a})}\mathcal{P}V^{\pi,\sigma}\leq\left(\widetilde{P}_{s,a}+\sigma{e}_{s_{0}}^{\top}\right)V^{\pi,\sigma}$	$\displaystyle\leq\big{\\|}\widetilde{P}_{s,a}\big{\\|}_{1}\big{\\|}V^{\pi,\sigma}\big{\\|}_{\infty}+\sigma V^{\pi,\sigma}(s_{0})$
		$\displaystyle\leq\left(1-\sigma\right)\max_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s)+\sigma\min_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s),$		(109)

where the second inequality holds by $\big{\|}\widetilde{P}_{s,a}\big{\|}_{1}=\sum_{s^{\prime}}\widetilde{P}_{s,a}(s^{\prime})=-\sum_{s^{\prime}}\left(P_{s,a}(s^{\prime})-\widetilde{P}_{s,a}(s^{\prime})\right)+\sum_{s^{\prime}}P_{s,a}(s^{\prime})=1-\sigma$ . Plugging this back to the previous relation gives

\displaystyle\max_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s)

\displaystyle\leq 1+\gamma\left(1-\sigma\right)\max_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s)+\gamma\sigma\min_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s),

which, by rearranging terms, immediately yields

	$\displaystyle\max_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s)$	$\displaystyle\leq\frac{1+\gamma\sigma\min_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s)}{1-\gamma\left(1-\sigma\right)}$
		$\displaystyle\leq\frac{1}{(1-\gamma)+\gamma\sigma}+\min_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s)\leq\frac{1}{\gamma\max\{1-\gamma,\sigma\}}+\min_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s).$

B.2 Proof of Lemma 7

Observing that each row of $P_{\pi}$ belongs to $\Delta(S)$ , it can be directly verified that each row of $(1-\gamma)\left(I-\gamma P_{\pi}\right)^{-1}$ falls into $\Delta(S)$ . As a result,

$\displaystyle\left(I-\gamma P_{\pi}\right)^{-1}\sqrt{\mathrm{Var}_{P_{\pi}}(V^{\pi,P})}$	$\displaystyle=\frac{1}{1-\gamma}(1-\gamma)\left(I-\gamma P_{\pi}\right)^{-1}\sqrt{\mathrm{Var}_{P_{\pi}}(V^{\pi,P})}$
	$\displaystyle\overset{\mathrm{(i)}}{\leq}\frac{1}{1-\gamma}\sqrt{(1-\gamma)\left(I-\gamma P_{\pi}\right)^{-1}\mathrm{Var}_{P_{\pi}}(V^{\pi,P})}$
	$\displaystyle=\sqrt{\frac{1}{1-\gamma}}\sqrt{\sum_{t=0}^{\infty}\gamma^{t}\left(P_{\pi}\right)^{t}\mathrm{Var}_{P_{\pi}}(V^{\pi,P})},$	(110)

where (i) holds by Jensen’s inequality.

To continue, denoting the minimum value of $V$ as $V_{\min}=\min_{s\in{\mathcal{S}}}V^{\pi,P}(s)$ and $V^{\prime}\coloneqq V^{\pi,P}-V_{\min}1$ . We control $\mathrm{Var}_{P_{\pi}}(V^{\pi,P})$ as follows:

	$\displaystyle\mathrm{Var}_{P_{\pi}}(V^{\pi,P})$
	$\displaystyle\overset{\mathrm{(i)}}{=}\mathrm{Var}_{P_{\pi}}(V^{\prime})=P_{\pi}\left(V^{\prime}\circ V^{\prime}\right)-\big{(}P_{\pi}V^{\prime}\big{)}\circ\big{(}P_{\pi}V^{\prime}\big{)}$
	$\displaystyle\overset{\mathrm{(ii)}}{=}P_{\pi}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma^{2}}\left(V^{\prime}-r_{\pi}+(1-\gamma)V_{\min}1\right)\circ\left(V^{\prime}-r_{\pi}+(1-\gamma)V_{\min}1\right)$
	$\displaystyle=P_{\pi}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma^{2}}V^{\prime}\circ V^{\prime}+\frac{2}{\gamma^{2}}V^{\prime}\circ\left(r_{\pi}-(1-\gamma)V_{\min}1\right)$
	$\displaystyle\quad-\frac{1}{\gamma^{2}}\left(r_{\pi}-(1-\gamma)V_{\min}1\right)\circ\left(r_{\pi}-(1-\gamma)V_{\min}1\right)$
	$\displaystyle\leq P_{\pi}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}+\frac{2}{\gamma^{2}}\\|V^{\prime}\\|_{\infty}1,$		(111)

where (i) holds by the fact that $\mathrm{Var}_{P_{\pi}}(V^{\pi,P}-b1)=\mathrm{Var}_{P_{\pi}}(V^{\pi,P})$ for any scalar $b$ and $V^{\pi,P}\in\mathbb{R}^{S}$ , (ii) follows from $V^{\prime}=r_{\pi}+\gamma P_{\pi}V^{\pi,P}-V_{\min}1=r_{\pi}-(1-\gamma)V_{\min}1+\gamma P_{\pi}V^{\prime}$ , and the last line arises from $\frac{1}{\gamma^{2}}V^{\prime}\circ V^{\prime}\geq\frac{1}{\gamma}V^{\prime}\circ V^{\prime}$ and $\|r_{\pi}-(1-\gamma)V_{\min}1\|_{\infty}\leq 1$ . Plugging (111) back to (110) leads to

	$\displaystyle\left(I-\gamma P_{\pi}\right)^{-1}\sqrt{\mathrm{Var}_{P_{\pi}}(V^{\pi,P})}\leq\sqrt{\frac{1}{1-\gamma}}\sqrt{\sum_{t=0}^{\infty}\gamma^{t}\left(P_{\pi}\right)^{t}\left(P_{\pi}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}+\frac{2}{\gamma^{2}}\\|V^{\prime}\\|_{\infty}1\right)}$
	$\displaystyle\overset{\mathrm{(i)}}{\leq}\sqrt{\frac{1}{1-\gamma}}\sqrt{\bigg{\|}\sum_{t=0}^{\infty}\gamma^{t}\left(P_{\pi}\right)^{t}\left(P_{\pi}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}\right)\bigg{\|}}+\sqrt{\frac{1}{1-\gamma}}\sqrt{\sum_{t=0}^{\infty}\gamma^{t}\left(P_{\pi}\right)^{t}\frac{2}{\gamma^{2}}\\|V^{\prime}\\|_{\infty}1}$
	$\displaystyle\leq\sqrt{\frac{1}{1-\gamma}}\sqrt{\bigg{\|}\bigg{(}\sum_{t=0}^{\infty}\gamma^{t}\left(P_{\pi}\right)^{t+1}-\sum_{t=0}^{\infty}\gamma^{t-1}\left(P_{\pi}\right)^{t}\bigg{)}\left(V^{\prime}\circ V^{\prime}\right)\bigg{\|}}+\sqrt{\frac{2\\|V^{\prime}\\|_{\infty}1}{\gamma^{2}(1-\gamma)^{2}}}$
	$\displaystyle\overset{\mathrm{(ii)}}{\leq}\sqrt{\frac{\\|V^{\prime}\\|_{\infty}^{2}1}{\gamma(1-\gamma)}}+\sqrt{\frac{2\\|V^{\prime}\\|_{\infty}1}{\gamma^{2}(1-\gamma)^{2}}}$
	$\displaystyle\leq\sqrt{\frac{8\\|V^{\prime}\\|_{\infty}1}{\gamma^{2}(1-\gamma)^{2}}},$		(112)

where (i) holds by the triangle inequality, (ii) holds by following recursion, and the last inequality holds by $\|V^{\prime}\|_{\infty}\leq\frac{1}{1-\gamma}$ .

B.3 Proof of Lemma 8

Step 1: controlling $\|\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\|_{\infty}$ : bounding the first term in (56).

To control the two terms in (56), we first introduce the following lemma whose proof is postponed to Appendix B.5.

Lemma 11.

Consider any $\delta\in(0,1)$ . Setting $N\geq\log(\frac{18SAN}{\delta})$ , with probability at least $1-\delta$ , one has

	$\displaystyle\Big{\|}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{\|}$	$\displaystyle\leq 2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{\pi^{\star}}}(V^{\star,\sigma})}+\frac{\log(\frac{18SAN}{\delta})}{N(1-\gamma)}1$
		$\displaystyle\leq 3\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}1,$		(113)

where $\mathrm{Var}_{P^{\pi^{\star}}}(V^{\star,\sigma})$ is defined in (36).

Armed with the above lemma, now we control the first term on the right hand side of (56) as follows:

	$\displaystyle\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{)}$
	$\displaystyle\overset{\mathrm{(i)}}{\leq}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\Big{\\|}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{\\|}_{\infty}$
	$\displaystyle\overset{\mathrm{(ii)}}{\leq}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\bigg{(}2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{\pi^{\star}}}(V^{\star,\sigma})}+\frac{\log(\frac{18SAN}{\delta})}{N(1-\gamma)}1\bigg{)}$
	$\displaystyle\leq\frac{\log(\frac{18SAN}{\delta})}{N(1-\gamma)}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}1+\underbrace{2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma})}}_{=:\mathcal{C}_{1}}$
	$\displaystyle\quad+\underbrace{2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\sqrt{\left\|\mathrm{Var}_{\widehat{P}^{\pi^{\star}}}(V^{\star,\sigma})-\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma})\right\|}}_{=:\mathcal{C}_{2}}$
	$\displaystyle\quad+\underbrace{2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\Big{(}\sqrt{\mathrm{Var}_{P^{\pi^{\star}}}(V^{\star,\sigma})}-\sqrt{\mathrm{Var}_{\widehat{P}^{\pi^{\star}}}(V^{\star,\sigma})}\Big{)}}_{=:\mathcal{C}_{3}},$		(114)

where (i) holds by $\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\geq 0$ , (ii) follows from Lemma 11, and the last inequality arise from

	$\displaystyle\sqrt{\mathrm{Var}_{P^{\pi^{\star}}}(V^{\star,\sigma})}=\left(\sqrt{\mathrm{Var}_{P^{\pi^{\star}}}(V^{\star,\sigma})}-\sqrt{\mathrm{Var}_{\widehat{P}^{\pi^{\star}}}(V^{\star,\sigma})}\right)+\sqrt{\mathrm{Var}_{\widehat{P}^{\pi^{\star}}}(V^{\star,\sigma})}$
	$\displaystyle\leq\left(\sqrt{\mathrm{Var}_{P^{\pi^{\star}}}(V^{\star,\sigma})}-\sqrt{\mathrm{Var}_{\widehat{P}^{\pi^{\star}}}(V^{\star,\sigma})}\right)+\sqrt{\left\|\mathrm{Var}_{\widehat{P}^{\pi^{\star}}}(V^{\star,\sigma})-\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma})\right\|}+\sqrt{\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma})}$

by applying the triangle inequality.

To continue, observing that each row of $\underline{\widehat{P}}^{\pi^{\star},V}$ is a probability distribution obeying that the sum is $1$ , we arrive at

\displaystyle\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}1=\Big{(}I+\sum_{t=1}^{\infty}\gamma^{t}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{t}\Big{)}1=\frac{1}{1-\gamma}1.

(115)

Armed with this fact, we shall control the other three terms $\mathcal{C}_{1},\mathcal{C}_{2},\mathcal{C}_{3}$ in (114) separately.

•

Consider $\mathcal{C}_{1}$ . We first introduce the following lemma, whose proof is postponed to Appendix B.6.

Lemma 12.

Consider any $\delta\in(0,1)$ . With probability at least $1-\delta$ , one has

\displaystyle\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma})}\leq 4\sqrt{\frac{\left(1+\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\right)}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}}}1\leq 4\sqrt{\frac{\left(1+\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\right)}{\gamma^{3}(1-\gamma)^{3}}}1.

Applying Lemma 12 and inserting back to (114) leads to

	$\displaystyle\mathcal{C}_{1}$	$\displaystyle=2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma})}$
		$\displaystyle\leq 8\sqrt{\frac{\log(\frac{18SAN}{\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}\bigg{(}1+\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\bigg{)}}1.$		(116)

•

Consider $\mathcal{C}_{2}$ . First, denote $V^{\prime}\coloneqq V^{\star,\sigma}-\min_{s^{\prime}\in{\mathcal{S}}}V^{\star,\sigma}(s^{\prime})1$ , by Lemma 6, it follows that

\displaystyle{0}\leq V^{\prime}\leq\frac{1}{\gamma\max\{1-\gamma,\sigma\}}1.

(117)

Then, we have for all $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ , and $P_{s,a}\in\Delta({\mathcal{S}})$ , and $\widetilde{P}_{s,a}\in\mathcal{U}^{\sigma}(P_{s,a})$ :

$\displaystyle\big{\|}\mathsf{Var}_{\widetilde{P}_{s,a}}(V^{\star,\sigma})-\mathsf{Var}_{P_{s,a}}(V^{\star,\sigma})\big{\|}$	$\displaystyle=\big{\|}\mathsf{Var}_{\widetilde{P}_{s,a}}(V^{\prime})-\mathsf{Var}_{P_{s,a}}(V^{\prime})\big{\|}$
	$\displaystyle\leq\big{\\|}\widetilde{P}_{s,a}-P_{s,a}\big{\\|}_{1}\big{\\|}V^{\prime}\big{\\|}_{\infty}^{2}$
	$\displaystyle\leq\frac{2\sigma}{\gamma^{2}(\max\{1-\gamma,\sigma\})^{2}}\leq\frac{2}{\gamma^{2}\max\{1-\gamma,\sigma\}}.$	(118)

Applying the above relation we obtain

$\displaystyle\mathcal{C}_{2}$	$\displaystyle=2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\sqrt{\left\|\mathrm{Var}_{\widehat{P}^{\pi^{\star}}}(V^{\star,\sigma})-\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma})\right\|}$
	$\displaystyle=2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\sqrt{\left\|\Pi^{\pi^{\star}}\left(\mathrm{Var}_{\widehat{P}^{0}}(V^{\star,\sigma})-\mathrm{Var}_{\widehat{P}^{\pi^{\star},V}}(V^{\star,\sigma})\right)\right\|}$
	$\displaystyle\leq 2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\sqrt{\left\\|\mathrm{Var}_{\widehat{P}^{0}}(V^{\star,\sigma})-\mathrm{Var}_{\widehat{P}^{\pi^{\star},V}}(V^{\star,\sigma})\right\\|_{\infty}1}$
	$\displaystyle\leq 2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\sqrt{\frac{2}{\gamma^{2}\max\{1-\gamma,\sigma\}}}1=2\sqrt{\frac{2\log(\frac{18SAN}{\delta})}{\gamma^{2}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1,$	(119)

where the last equality uses $\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}1=\frac{1}{1-\gamma}$ (cf. (115)).

•

Consider $\mathcal{C}_{3}$ . The following lemma plays an important role.

Lemma 13.

(Panaganti and Kalathil,, 2022, Lemma 6) Consider any $\delta\in(0,1)$ . For any fixed policy $\pi$ and fixed value vector $V\in\mathbb{R}^{S}$ , one has with probability at least $1-\delta$ ,

\displaystyle\left|\sqrt{\mathrm{Var}_{\widehat{P}^{\pi}}(V)}-\sqrt{\mathrm{Var}_{P^{\pi}}(V)}\right|\leq\sqrt{\frac{2\|V\|_{\infty}^{2}\log(\frac{2SA}{\delta})}{N}}1.

Applying Lemma 13 with $\pi=\pi^{\star}$ and $V=V^{\star,\sigma}$ leads to

\displaystyle\sqrt{\mathrm{Var}_{P^{\pi^{\star}}}(V^{\star,\sigma})}-\sqrt{\mathrm{Var}_{\widehat{P}^{\pi^{\star}}}(V^{\star,\sigma})}\leq\sqrt{\frac{2\|V^{\star,\sigma}\|_{\infty}^{2}\log(\frac{2SA}{\delta})}{N}}1,

which can be plugged in (114) to verify

	$\displaystyle\mathcal{C}_{3}$	$\displaystyle=2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\left(\sqrt{\mathrm{Var}_{P^{\pi^{\star}}}(V^{\star,\sigma})}-\sqrt{\mathrm{Var}_{\widehat{P}^{\pi^{\star}}}(V^{\star,\sigma})}\right)$
		$\displaystyle\leq\frac{4}{(1-\gamma)}\frac{\log(\frac{SAN}{\delta})\\|V^{\star,\sigma}\\|_{\infty}}{N}1\leq\frac{4\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}1,$		(120)

where the last line uses $\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}1=\frac{1}{1-\gamma}$ (cf. (115)).

Finally, inserting the results of $\mathcal{C}_{1}$ in (116), $\mathcal{C}_{2}$ in (119), $\mathcal{C}_{3}$ in (120), and (115) back into (114) gives

	$\displaystyle\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{)}\leq 8\sqrt{\frac{\log(\frac{18SAN}{\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}\bigg{(}1+\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\bigg{)}}1$
	$\displaystyle\quad+2\sqrt{\frac{2\log(\frac{18SAN}{\delta})}{\gamma^{2}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1+\frac{4\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}1+\frac{\log(\frac{18SAN}{\delta})}{N(1-\gamma)^{2}}1$
	$\displaystyle\leq 10\sqrt{\frac{2\log(\frac{18SAN}{\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}\bigg{(}1+\sqrt{\frac{\log(\frac{SAN}{\delta})}{(1-\gamma)^{2}N}}\bigg{)}}1+\frac{5\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}1$
	$\displaystyle\leq 160\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1+\frac{5\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}1,$		(121)

where the last inequality holds by the fact $\gamma\geq\frac{1}{4}$ and letting $N\geq\frac{\log(\frac{SAN}{\delta})}{(1-\gamma)^{2}}$ .

Step 2: controlling $\|\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\|_{\infty}$ : bounding the second term in (56).

To proceed, applying Lemma 11 on the second term of the right hand side of (56) leads to

	$\displaystyle\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{)}$
	$\displaystyle\leq 2\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}\bigg{(}\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{\pi^{\star}}}(V^{\star,\sigma})}+\frac{\log(\frac{18SAN}{\delta})}{N(1-\gamma)}1\bigg{)}$
	$\displaystyle\leq\frac{2\log(\frac{18SAN}{\delta})}{N(1-\gamma)}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}1+\underbrace{2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}}(\widehat{V}^{\pi^{\star},\sigma})}}_{=:\mathcal{C}_{4}}$
	$\displaystyle\quad+\underbrace{2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}\left(\sqrt{\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}}(V^{\pi^{\star},\sigma}-\widehat{V}^{\pi^{\star},\sigma})}\right)}_{=:\mathcal{C}_{5}}$
	$\displaystyle\quad+\underbrace{2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}\left(\sqrt{\left\|\mathrm{Var}_{\widehat{P}^{\pi^{\star}}}(V^{\star,\sigma})-\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}}(V^{\star,\sigma})\right\|}\right)}_{=:\mathcal{C}_{6}}$
	$\displaystyle\quad+\underbrace{2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}\left(\sqrt{\mathrm{Var}_{P^{\pi^{\star}}}(V^{\star,\sigma})}-\sqrt{\mathrm{Var}_{\widehat{P}^{\pi^{\star}}}(V^{\star,\sigma})}\right)}_{=:\mathcal{C}_{7}},$		(122)

where the last term $\widetilde{\mathcal{C}_{3}}$ can be controlled the same as $\mathcal{C}_{3}$ in (120). We now bound the above terms separately.

•

Applying Lemma 7 with $P=\widehat{P}^{\pi^{\star},\widehat{V}}$ , $\pi=\pi^{\star}$ and taking $V=\widehat{V}^{\pi^{\star},\sigma}$ which obeys $\widehat{V}^{\pi^{\star},\sigma}=r_{\pi^{\star}}+\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\widehat{V}^{\pi^{\star},\sigma}$ , and in view of (115), the term $\mathcal{C}_{4}$ in (122) can be controlled as follows:

$\displaystyle\mathcal{C}_{4}$	$\displaystyle=2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}}(\widehat{V}^{\pi^{\star},\sigma})}$
	$\displaystyle\leq 2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\sqrt{\frac{8(\max_{s}\widehat{V}^{\pi^{\star},\sigma}(s)-\min_{s}\widehat{V}^{\pi^{\star},\sigma}(s))}{\gamma^{2}(1-\gamma)^{2}}}1$
	$\displaystyle\leq 8\sqrt{\frac{\log(\frac{18SAN}{\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1,$	(123)

where the last inequality holds by applying Lemma 6.

•

To continue, considering $\mathcal{C}_{5}$ , we directly observe that (in view of (115))

	$\displaystyle\mathcal{C}_{5}$	$\displaystyle=2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}}(V^{\pi^{\star},\sigma}-\widehat{V}^{\pi^{\star},\sigma})}$
		$\displaystyle\leq 2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\left\\|V^{\star,\sigma}-\widehat{V}^{\pi^{\star},\sigma}\right\\|_{\infty}1.$		(124)

•

Then, it is easily verified that $\mathcal{C}_{6}$ can be controlled similarly as (119) as follows:

\displaystyle\mathcal{C}_{6}

\displaystyle\leq 2\sqrt{\frac{2\log(\frac{18SAN}{\delta})}{\gamma^{2}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1.

(125)

•

Similarly, $\mathcal{C}_{7}$ can be controlled the same as (120) shown below:

$\displaystyle\mathcal{C}_{7}$ $\displaystyle\leq\frac{4\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}1.$ (126)

Combining the results in (123), (124), (125), and (126) and inserting back to (122) leads to

	$\displaystyle\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{)}\leq 8\sqrt{\frac{\log(\frac{18SAN}{\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1$
	$\displaystyle\qquad\qquad+2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\left\\|V^{\star,\sigma}-\widehat{V}^{\pi^{\star},\sigma}\right\\|_{\infty}1+2\sqrt{\frac{2\log(\frac{18SAN}{\delta})}{\gamma^{2}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1+\frac{4\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}1$
	$\displaystyle\leq 80\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1+2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\left\\|V^{\star,\sigma}-\widehat{V}^{\pi^{\star},\sigma}\right\\|_{\infty}1+\frac{4\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}1,$		(127)

where the last inequality follows from the assumption $\gamma\geq\frac{1}{4}$ .

Finally, inserting (121) and (127) back to (56) yields

	$\displaystyle\left\\|\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\right\\|_{\infty}$
	$\displaystyle\leq\max\Bigg{\{}160\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}+\frac{5\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N},$
	$\displaystyle\qquad\quad 80\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}+2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\left\\|V^{\star,\sigma}-\widehat{V}^{\pi^{\star},\sigma}\right\\|_{\infty}+\frac{4\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}\Bigg{\}}$
	$\displaystyle\leq 160\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}+\frac{8\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N},$		(128)

where the last inequality holds by taking $N\geq\frac{16\log(\frac{SAN}{\delta})}{(1-\gamma)^{2}}$ .

B.4 Proof of Lemma 9

Step 1: controlling $\|\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\|_{\infty}$ : bounding the first term in (57).

To begin with, we introduce the following lemma which controls the main term on the right hand side of (57), which is proved in Appendix B.7.

Lemma 14.

Consider any $\delta\in(0,1)$ . Taking $N\geq\log\left(\frac{54SAN^{2}}{(1-\gamma)\delta}\right)$ , with probability at least $1-\delta$ , one has

	$\displaystyle\Big{\|}\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\Big{\|}$	$\displaystyle\leq 2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(\widehat{V}^{\star,\sigma})}1+\frac{8\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)}1+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}1$
		$\displaystyle\leq 10\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}N}}1+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}1.$		(129)

With Lemma 14 in hand, we have

	$\displaystyle\Big{(}I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\Big{)}$
	$\displaystyle\overset{\mathrm{(i)}}{\leq}\left(I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\right)^{-1}\left\|\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\right\|$
	$\displaystyle\leq 2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\left(I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\right)^{-1}\sqrt{\mathrm{Var}_{P^{\widehat{\pi}}}(\widehat{V}^{\star,\sigma})}+\left(I-\gamma P_{Q}^{\widehat{\pi},V^{\widehat{\pi}}}\right)^{-1}\Bigg{(}\frac{8\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}\Bigg{)}1$
	$\displaystyle\overset{\mathrm{(ii)}}{\leq}\Bigg{(}\frac{8\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)^{2}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{2}}\Bigg{)}1+\underbrace{2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\left(I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\right)^{-1}\sqrt{\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\widehat{\pi},\sigma})}}_{=:\mathcal{D}_{1}}$
	$\displaystyle\quad+\underbrace{2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\left(I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\right)^{-1}\sqrt{\left\|\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\star,\sigma})-\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\widehat{\pi},\sigma})\right\|}}_{=:\mathcal{D}_{2}}$
	$\displaystyle\quad+\underbrace{2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\left(I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\right)^{-1}\sqrt{\left\|\mathrm{Var}_{P^{\widehat{\pi}}}(\widehat{V}^{\star,\sigma})-\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\star,\sigma})\right\|}}_{=:\mathcal{D}_{3}},$		(130)

where (i) and (ii) hold by the fact that each row of $(1-\gamma)\left(I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\right)^{-1}$ is a probability vector that falls into $\Delta({\mathcal{S}})$ .

The remainder of the proof will focus on controlling the three terms in (130) separately.

•

For $\mathcal{D}_{1}$ , we introduce the following lemma, whose proof is postponed to B.8.

Lemma 15.

Consider any $\delta\in(0,1)$ . Taking $N\geq\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}}$ and $\varepsilon_{\mathsf{opt}}\leq\frac{1-\gamma}{\gamma}$ , one has with probability at least $1-\delta$ ,

\displaystyle\Big{(}I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\widehat{\pi},\sigma})}

\displaystyle\leq 6\sqrt{\frac{1}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}}}1\leq 6\sqrt{\frac{1}{(1-\gamma)^{3}\gamma^{2}}}1.

Applying Lemma 15 and (115) to (130) leads to

	$\displaystyle\mathcal{D}_{1}$	$\displaystyle=2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\Big{(}I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\widehat{\pi},\sigma})}$
		$\displaystyle\leq 12\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1.$		(131)

•

Applying Lemma 3 with $\|\widehat{V}^{\star,\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\|_{\infty}\leq\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}$ and (115), $\mathcal{D}_{2}$ can be controlled as

	$\displaystyle\mathcal{D}_{2}$	$\displaystyle=2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\Big{(}I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\Big{)}^{-1}\sqrt{\left\|\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\star,\sigma})-\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\widehat{\pi},\sigma})\right\|}$
		$\displaystyle\leq 4\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\Big{(}I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\Big{)}^{-1}\frac{\sqrt{\gamma\varepsilon_{\mathsf{opt}}}}{1-\gamma}\leq 4\sqrt{\frac{\gamma\varepsilon_{\mathsf{opt}}\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{4}N}}1.$		(132)

•

$\mathcal{D}_{3}$ can be controlled similar to $\mathcal{C}_{2}$ in (119) as follows:

	$\displaystyle\mathcal{D}_{3}$	$\displaystyle=2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\left(I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\right)^{-1}\sqrt{\left\|\mathrm{Var}_{P^{\widehat{\pi}}}(\widehat{V}^{\star,\sigma})-\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\star,\sigma})\right\|}$
		$\displaystyle\leq 4\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\left(I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\right)^{-1}\sqrt{\frac{1}{\gamma^{2}\max\{1-\gamma,\sigma\}}}1\leq 4\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma^{2}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1$		(133)

Finally, summing up the results in (131), (132), and (133) and inserting them back to (130) yields: taking $N\geq\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}}$ and $\varepsilon_{\mathsf{opt}}\leq\frac{1-\gamma}{\gamma}$ , with probability at least $1-\delta$ ,

	$\displaystyle\left(I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\right)^{-1}\left(\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\right)\leq\left(\frac{8\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)^{2}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{2}}\right)1$
	$\displaystyle\quad+12\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1+4\sqrt{\frac{\gamma\varepsilon_{\mathsf{opt}}\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{4}N}}1+4\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma^{2}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1$
	$\displaystyle\leq 16\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1+\frac{14\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)^{2}}1,$		(134)

where the last inequality holds by taking $\varepsilon_{\mathsf{opt}}\leq\min\left\{\frac{1-\gamma}{\gamma},\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma N}\right\}=\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma N}$ .

Step 2: controlling $\|\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\|_{\infty}$ : bounding the second term in (57).

Towards this, applying Lemma 14 leads to

	$\displaystyle\Big{(}I-\gamma\underline{P}^{\widehat{\pi},V}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\Big{)}\leq\Big{(}I-\gamma\underline{P}^{\widehat{\pi},V}\Big{)}^{-1}\Big{\|}\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\Big{\|}$
	$\displaystyle\leq 2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\Big{(}I-\gamma\underline{P}^{\widehat{\pi},V}\Big{)}^{-1}\sqrt{\mathrm{Var}_{P^{\widehat{\pi}}}(\widehat{V}^{\star,\sigma})}+\Big{(}I-\gamma\underline{P}^{\widehat{\pi},V}\Big{)}^{-1}\Bigg{(}\frac{8\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}\Bigg{)}1$
	$\displaystyle\leq\Bigg{(}\frac{8\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)^{2}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{2}}\Bigg{)}1+\underbrace{2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\left(I-\gamma\underline{P}^{\widehat{\pi},V}\right)^{-1}\sqrt{\mathrm{Var}_{\underline{P}^{\widehat{\pi},V}}(V^{\widehat{\pi},\sigma})}}_{=:\mathcal{D}_{4}}$
	$\displaystyle\quad+\underbrace{2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\Big{(}I-\gamma\underline{P}^{\widehat{\pi},V}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{P}^{\widehat{\pi},V}}(\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma})}}_{=:\mathcal{D}_{5}}$
	$\displaystyle\quad+\underbrace{2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\Big{(}I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\Big{)}^{-1}\sqrt{\left\|\mathrm{Var}_{\underline{P}^{\widehat{\pi},V}}(\widehat{V}^{\star,\sigma})-\mathrm{Var}_{\underline{P}^{\widehat{\pi},V}}(\widehat{V}^{\widehat{\pi},\sigma})\right\|}}_{=:\mathcal{D}_{6}}$
	$\displaystyle\quad+\underbrace{2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\Big{(}I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\Big{)}^{-1}\sqrt{\left\|\mathrm{Var}_{P^{\widehat{\pi}}}(\widehat{V}^{\star,\sigma})-\mathrm{Var}_{\underline{P}^{\widehat{\pi},V}}(\widehat{V}^{\star,\sigma})\right\|}}_{=:\mathcal{D}_{7}}.$		(135)

We shall bound each of the terms separately.

•

Applying Lemma 7 with $P=\underline{P}^{\widehat{\pi},V}$ , $\pi=\widehat{\pi}$ , and taking $V=V^{\widehat{\pi},\sigma}$ which obeys $V^{\widehat{\pi},\sigma}=r_{\widehat{\pi}}+\gamma\underline{P}^{\widehat{\pi},V}V^{\widehat{\pi},\sigma}$ , the term $\mathcal{D}_{4}$ can be controlled similar to (123) as follows:

\displaystyle\mathcal{D}_{4}

\displaystyle\leq 8\sqrt{\frac{\log(\frac{54SAN^{2}}{\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1.

(136)

•

For $\mathcal{D}_{5}$ , it is observed that

	$\displaystyle\mathcal{D}_{5}$	$\displaystyle=2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\Big{(}I-\gamma\underline{P}^{\widehat{\pi},V}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{P}^{\widehat{\pi},V}}(\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma})}$
		$\displaystyle\leq 2\sqrt{\frac{\log(\frac{54SAN^{2}}{\delta})}{(1-\gamma)^{2}N}}\left\\|V^{\widehat{\pi},\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\right\\|_{\infty}1.$		(137)

•

Next, observing that $\mathcal{D}_{6}$ and $\mathcal{D}_{7}$ are almost the same as the terms $\mathcal{D}_{2}$ (controlled in (132)) and $\mathcal{D}_{3}$ (controlled in (133)) in (130), it is easily verified that they can be controlled as follows

\displaystyle\mathcal{D}_{6}

\displaystyle\leq 4\sqrt{\frac{\gamma\varepsilon_{\mathsf{opt}}\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{4}N}}1,\qquad\qquad\mathcal{D}_{7}\leq 4\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma^{2}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1.

(138)

Then inserting the results in (136), (137), and (138) back to (135) leads to

	$\displaystyle\Big{(}I-\gamma\underline{P}^{\widehat{\pi},V}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\Big{)}$
	$\displaystyle\leq\Bigg{(}\frac{8\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)^{2}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{2}}\Bigg{)}1+8\sqrt{\frac{\log(\frac{54SAN^{2}}{\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1$
	$\displaystyle\quad+2\sqrt{\frac{\log(\frac{54SAN^{2}}{\delta})}{(1-\gamma)^{2}N}}\left\\|V^{\widehat{\pi},\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\right\\|_{\infty}1+4\sqrt{\frac{\gamma\varepsilon_{\mathsf{opt}}\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{4}N}}1+4\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma^{2}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1$
	$\displaystyle\leq 12\sqrt{\frac{2\log(\frac{8SAN^{2}}{(1-\gamma)\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1+\frac{14\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)^{2}}1+2\sqrt{\frac{\log(\frac{54SAN^{2}}{\delta})}{(1-\gamma)^{2}N}}\left\\|V^{\widehat{\pi},\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\right\\|_{\infty}1,$		(139)

where the last inequality holds by letting $\varepsilon_{\mathsf{opt}}\leq\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma N}$ , which directly satisfies $\varepsilon_{\mathsf{opt}}\leq\frac{1-\gamma}{\gamma}$ by letting $N\geq\frac{\log(\frac{54SAN^{2}}{\delta})}{1-\gamma}$ .

Finally, inserting (134) and (139) back to (57) yields: taking $\varepsilon_{\mathsf{opt}}\leq\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma N}$ and $N\geq\frac{16\log(\frac{54SAN^{2}}{\delta})}{(1-\gamma)^{2}}$ , with probability at least $1-\delta$ , one has

	$\displaystyle\left\\|\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\right\\|_{\infty}$
	$\displaystyle\leq\max\Big{\{}16\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}+\frac{14\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)^{2}},$
	$\displaystyle\quad 12\sqrt{\frac{2\log(\frac{8SAN^{2}}{(1-\gamma)\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}+\frac{14\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)^{2}}+2\sqrt{\frac{\log(\frac{54SAN^{2}}{\delta})}{(1-\gamma)^{2}N}}\left\\|V^{\widehat{\pi},\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\right\\|_{\infty}\Big{\}}$
	$\displaystyle\leq 24\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}+\frac{28\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)^{2}}.$		(140)

B.5 Proof of Lemma 11

Step 1: controlling the point-wise concentration.

We first consider a more general term w.r.t. any fixed (independent from $\widehat{P}^{0}$ ) value vector $V$ obeying ${0}\leq V\leq\frac{1}{1-\gamma}1$ and any policy $\pi$ . Invoking Lemma 1 leads to that for any $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ ,

$\displaystyle\left\|\widehat{P}^{\pi,V}_{s,a}V-P^{\pi,V}_{s,a}V\right\|$	$\displaystyle\leq\Big{\|}\max_{\alpha\in[\min_{s}V(s),\max_{s}V(s)]}\left\{\widehat{P}^{0}_{s,a}\left[V\right]_{\alpha}-\sigma\left(\alpha-\min_{s^{\prime}}\left[V\right]_{\alpha}(s^{\prime})\right)\right\}$
	$\displaystyle\qquad-\max_{\alpha\in[\min_{s}V(s),\max_{s}V(s)]}\left\{P^{0}_{s,a}\left[V\right]_{\alpha}-w\sigma\left(\alpha-\min_{s^{\prime}}\left[V\right]_{\alpha}(s^{\prime})\right)\right\}\Big{\|}$
	$\displaystyle\leq\max_{\alpha\in[\min_{s}V(s),\max_{s}V(s)]}\underbrace{\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[V]_{\alpha}\right\|}_{=:g_{s,a}(\alpha,V)},$	(141)

where the last inequality holds by that the maximum operator is $1$ -Lipschitz.

Then for a fixed $\alpha$ and any vector $V$ that is independent with $\widehat{P}^{0}$ , using the Bernstein’s inequality, one has with probability at least $1-\delta$ ,

	$\displaystyle g_{s,a}(\alpha,V)=\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[V]_{\alpha}\right\|$	$\displaystyle\leq\sqrt{\frac{2\log(\frac{2}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}([V]_{\alpha})}+\frac{2\log(\frac{2}{\delta})}{3N(1-\gamma)}$
		$\displaystyle\leq\sqrt{\frac{2\log(\frac{2}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(V)}+\frac{2\log(\frac{2}{\delta})}{3N(1-\gamma)}.$		(142)

Step 2: deriving the uniform concentration.

To obtain the union bound, we first notice that $g_{s,a}(\alpha,V)$ is $1$ -Lipschitz w.r.t. $\alpha$ for any $V$ obeying $\|V\|_{\infty}\leq\frac{1}{1-\gamma}$ . In addition, we can construct an $\varepsilon_{1}$ -net $N_{\varepsilon_{1}}$ over $[0,\frac{1}{1-\gamma}]$ whose size satisfies $|N_{\varepsilon_{1}}|\leq\frac{3}{\varepsilon_{1}(1-\gamma)}$ (Vershynin,, 2018). By the union bound and (B.5), it holds with probability at least $1-\frac{\delta}{SA}$ that for all $\alpha\in N_{\varepsilon_{1}}$ ,

g_{s,a}(\alpha,V)\leq\sqrt{\frac{2\log(\frac{2SA|N_{\varepsilon_{1}}|}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(V)}+\frac{2\log(\frac{2SA|N_{\varepsilon_{1}}|}{\delta})}{3N(1-\gamma)}.

(143)

Combined with (141), it yields that,

$\displaystyle\left\|\widehat{P}^{\pi,V}_{s,a}V-P^{\pi,V}_{s,a}V\right\|$	$\displaystyle\leq\max_{\alpha\in[\min_{s}V(s),\max_{s}V(s)]}\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[V]_{\alpha}\right\|$
	$\displaystyle\overset{\mathrm{(i)}}{\leq}\varepsilon_{1}+\sup_{\alpha\in N_{\varepsilon_{1}}}\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[V]_{\alpha}\right\|$
	$\displaystyle\overset{\mathrm{(ii)}}{\leq}\varepsilon_{1}+\sqrt{\frac{2\log(\frac{2SA\|N_{\varepsilon_{1}}\|}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(V)}+\frac{2\log(\frac{2SA\|N_{\varepsilon_{1}}\|}{\delta})}{3N(1-\gamma)}$	(144)
	$\displaystyle\overset{\mathrm{(iii)}}{\leq}\sqrt{\frac{2\log(\frac{2SA\|N_{\varepsilon_{1}}\|}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(V)}+\frac{\log(\frac{2SA\|N_{\varepsilon_{1}}\|}{\delta})}{N(1-\gamma)}$
	$\displaystyle\overset{\mathrm{(iv)}}{\leq}2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(V)}+\frac{\log(\frac{18SAN}{\delta})}{N(1-\gamma)}$	(145)
	$\displaystyle\leq 2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\\|V\\|_{\infty}+\frac{\log(\frac{18SAN}{\delta})}{N(1-\gamma)}$
	$\displaystyle\leq 3\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}$	(146)

where (i) follows from that the optimal $\alpha^{\star}$ falls into the $\varepsilon_{1}$ -ball centered around some point inside $N_{\varepsilon_{1}}$ and $g_{s,a}(\alpha,V)$ is $1$ -Lipschitz, (ii) holds by (143), (iii) arises from taking $\varepsilon_{1}=\frac{\log(\frac{2SA|N_{\varepsilon_{1}}|}{\delta})}{3N(1-\gamma)}$ , (iv) is verified by $|N_{\varepsilon_{1}}|\leq\frac{3}{\varepsilon_{1}(1-\gamma)}\leq 9N$ , and the last inequality is due to the fact $\|V^{\star,\sigma}\|_{\infty}\leq\frac{1}{1-\gamma}$ and letting $N\geq\log(\frac{18SAN}{\delta})$ .

To continue, applying (145) and (146) with $\pi=\pi^{\star}$ and $V=V^{\star,\sigma}$ (independent with $\widehat{P}^{0}$ ) and taking the union bound over $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ gives that with probability at least $1-\delta$ , it holds simultaneously for all $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ that

	$\displaystyle\left\|\widehat{P}^{\pi^{\star},V}_{s,a}V^{\star,\sigma}-P^{\pi^{\star},V}_{s,a}V^{\star,\sigma}\right\|$	$\displaystyle\leq 2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(V^{\star,\sigma})}+\frac{\log(\frac{18SAN}{\delta})}{N(1-\gamma)}$
		$\displaystyle\leq 3\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}.$		(147)

By converting (147) to the matrix form, one has with probability at least $1-\delta$ ,

	$\displaystyle\left\|\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\right\|$	$\displaystyle\leq 2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{\pi^{\star}}}(V^{\star,\sigma})}+\frac{\log(\frac{18SAN}{\delta})}{N(1-\gamma)}1$
		$\displaystyle\leq 3\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}1.$		(148)

B.6 Proof of Lemma 12

Following the same argument as (110), it follows

\displaystyle\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma})}

\displaystyle=\sqrt{\frac{1}{1-\gamma}}\sqrt{\sum_{t=0}^{\infty}\gamma^{t}\left(\underline{\widehat{P}}^{\pi^{\star},V}\right)^{t}\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma})}.

(149)

To continue, we first focus on controlling $\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma})$ . Towards this, denoting the minimum value of $V^{\star,\sigma}$ as $V_{\min}\coloneqq\min_{s\in{\mathcal{S}}}V^{\star,\sigma}(s)$ and $V^{\prime}\coloneqq V^{\star,\sigma}-V_{\min}1$ , we arrive at (see the robust Bellman’s consistency equation in (46))

$\displaystyle V^{\prime}=V^{\star,\sigma}-V_{\min}1$	$\displaystyle=r_{\pi^{\star}}+\gamma\underline{P}^{\pi^{\star},V}V^{\star,\sigma}-V_{\min}1$
	$\displaystyle=r_{\pi^{\star}}+\gamma\underline{\widehat{P}}^{\pi^{\star},V}V^{\star,\sigma}+\gamma\Big{(}\underline{P}^{\pi^{\star},V}-\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}V^{\star,\sigma}-V_{\min}1$
	$\displaystyle=r_{\pi^{\star}}-(1-\gamma)V_{\min}1+\gamma\underline{\widehat{P}}^{\pi^{\star},V}V^{\prime}+\gamma\Big{(}\underline{P}^{\pi^{\star},V}-\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}V^{\star,\sigma}$
	$\displaystyle=r_{\pi^{\star}}^{\prime}+\gamma\underline{\widehat{P}}^{\pi^{\star},V}V^{\prime}+\gamma\Big{(}\underline{P}^{\pi^{\star},V}-\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}V^{\star,\sigma},$	(150)

where the last line holds by letting $r_{\pi^{\star}}^{\prime}\coloneqq r_{\pi^{\star}}-(1-\gamma)V_{\min}1\leq r_{\pi^{\star}}$ . With the above fact in hand, we control $\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma})$ as follows:

$\displaystyle\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma})$	$\displaystyle\overset{\mathrm{(i)}}{=}\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\prime})=\underline{\widehat{P}}^{\pi^{\star},V}\left(V^{\prime}\circ V^{\prime}\right)-\big{(}\underline{\widehat{P}}^{\pi^{\star},V}V^{\prime}\big{)}\circ\big{(}\underline{\widehat{P}}^{\pi^{\star},V}V^{\prime}\big{)}$
	$\displaystyle\overset{\mathrm{(ii)}}{=}\underline{\widehat{P}}^{\pi^{\star},V}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma^{2}}\Big{(}V^{\prime}-r_{\pi^{\star}}^{\prime}-\gamma\Big{(}\underline{P}^{\pi^{\star},V}-\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}V^{\star,\sigma}\Big{)}^{\circ 2}$
	$\displaystyle=\underline{\widehat{P}}^{\pi^{\star},V}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma^{2}}V^{\prime}\circ V^{\prime}+\frac{2}{\gamma^{2}}V^{\prime}\circ\Big{(}r_{\pi^{\star}}^{\prime}+\gamma\Big{(}\underline{P}^{\pi^{\star},V}-\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}V^{\star,\sigma}\Big{)}$
	$\displaystyle\quad-\frac{1}{\gamma^{2}}\Big{(}r_{\pi^{\star}}^{\prime}+\gamma\Big{(}\underline{P}^{\pi^{\star},V}-\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}V^{\star,\sigma}\Big{)}^{\circ 2}$
	$\displaystyle\overset{\mathrm{(iii)}}{\leq}\underline{\widehat{P}}^{\pi^{\star},V}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}+\frac{2}{\gamma^{2}}\\|V^{\prime}\\|_{\infty}1+\frac{2}{\gamma}\\|V^{\prime}\\|_{\infty}\Big{\|}\Big{(}\underline{P}^{\pi^{\star},V}-\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}V^{\star,\sigma}\Big{\|}$	(151)
	$\displaystyle\leq\underline{\widehat{P}}^{\pi^{\star},V}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}+\frac{2}{\gamma^{2}}\\|V^{\prime}\\|_{\infty}1+\frac{6}{\gamma}\\|V^{\prime}\\|_{\infty}\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}1,$	(152)

where (i) holds by the fact that $\mathrm{Var}_{P_{\pi}}(V-b1)=\mathrm{Var}_{P_{\pi}}(V)$ for any scalar $b$ and $V\in\mathbb{R}^{S}$ , (ii) follows from (150), (iii) arises from $\frac{1}{\gamma^{2}}V^{\prime}\circ V^{\prime}\geq\frac{1}{\gamma}V^{\prime}\circ V^{\prime}$ and $-1\leq r_{\pi^{\star}}-(1-\gamma)V_{\min}1=r_{\pi^{\star}}^{\prime}\leq r_{\pi^{\star}}\leq 1$ , and the last inequality holds by Lemma 11.

Plugging (152) into (149) leads to

	$\displaystyle\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma})}$
	$\displaystyle\leq\sqrt{\frac{1}{1-\gamma}}\sqrt{\sum_{t=0}^{\infty}\gamma^{t}\left(\underline{\widehat{P}}^{\pi^{\star},V}\right)^{t}\Bigg{(}\underline{\widehat{P}}^{\pi^{\star},V}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}+\frac{2}{\gamma^{2}}\\|V^{\prime}\\|_{\infty}1+\frac{6}{\gamma}\\|V^{\prime}\\|_{\infty}\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}1\Bigg{)}}$
	$\displaystyle\overset{\mathrm{(i)}}{\leq}\sqrt{\frac{1}{1-\gamma}}\sqrt{\bigg{\|}\sum_{t=0}^{\infty}\gamma^{t}\left(\underline{\widehat{P}}^{\pi^{\star},V}\right)^{t}\bigg{(}\underline{\widehat{P}}^{\pi^{\star},V}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}\bigg{)}\bigg{\|}}$
	$\displaystyle\qquad+\sqrt{\frac{1}{1-\gamma}}\sqrt{\sum_{t=0}^{\infty}\gamma^{t}\left(\underline{\widehat{P}}^{\pi^{\star},V}\right)^{t}\Bigg{(}\frac{2}{\gamma^{2}}\\|V^{\prime}\\|_{\infty}1+\frac{6}{\gamma}\\|V^{\prime}\\|_{\infty}\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}1\Bigg{)}}$
	$\displaystyle\leq\sqrt{\frac{1}{1-\gamma}}\sqrt{\bigg{\|}\sum_{t=0}^{\infty}\gamma^{t}\left(\underline{\widehat{P}}^{\pi^{\star},V}\right)^{t}\bigg{[}\underline{\widehat{P}}^{\pi^{\star},V}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}\bigg{]}\bigg{\|}}+\sqrt{\frac{\left(2+6\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\right)\\|V^{\prime}\\|_{\infty}}{(1-\gamma)^{2}\gamma^{2}}}1,$		(153)

where (i) holds by the triangle inequality. Therefore, the remainder of the proof shall focus on the first term, which follows

	$\displaystyle\bigg{\|}\sum_{t=0}^{\infty}\gamma^{t}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{t}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}\Big{)}\bigg{\|}$
	$\displaystyle=\bigg{\|}\bigg{(}\sum_{t=0}^{\infty}\gamma^{t}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{t+1}-\sum_{t=0}^{\infty}\gamma^{t-1}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{t}\bigg{)}\left(V^{\prime}\circ V^{\prime}\right)\bigg{\|}\leq\frac{1}{\gamma}\\|V^{\prime}\\|_{\infty}^{2}1$		(154)

by recursion. Inserting (154) back to (153) leads to

	$\displaystyle\bigg{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\bigg{)}^{-1}\sqrt{\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma})}$
	$\displaystyle\leq\sqrt{\frac{\\|V^{\prime}\\|_{\infty}^{2}}{\gamma(1-\gamma)}}1+3\sqrt{\frac{\Big{(}1+\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\Big{)}\\|V^{\prime}\\|_{\infty}}{(1-\gamma)^{2}\gamma^{2}}}1$
	$\displaystyle\leq 4\sqrt{\frac{\Big{(}1+\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\Big{)}\\|V^{\prime}\\|_{\infty}}{(1-\gamma)^{2}\gamma^{2}}}1\leq 4\sqrt{\frac{\Big{(}1+\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\Big{)}}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}}}1\leq 4\sqrt{\frac{\Big{(}1+\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\Big{)}}{\gamma^{3}(1-\gamma)^{3}}}1,$		(155)

where the penultimate inequality follows from applying Lemma 6 with $P=P^{0}$ and $\pi=\pi^{\star}$ :

\displaystyle\|V^{\prime}\|_{\infty}=\max_{s\in{\mathcal{S}}}V^{\star,\sigma}(s)-\min_{s\in{\mathcal{S}}}V^{\star,\sigma}(s)\leq\frac{1}{\gamma\max\{1-\gamma,\sigma\}}.

B.7 Proof of Lemma 14

To begin with, for any $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ , invoking the results in (141), we have

	$\displaystyle\left\|\widehat{P}^{\widehat{\pi},\widehat{V}}_{s,a}\widehat{V}^{\widehat{\pi},\sigma}-P^{\widehat{\pi},\widehat{V}}_{s,a}\widehat{V}^{\widehat{\pi},\sigma}\right\|\leq\max_{\alpha\in[\min_{s}\widehat{V}^{\widehat{\pi},\sigma}(s),\max_{s}\widehat{V}^{\widehat{\pi},\sigma}(s)]}\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right\|$
	$\displaystyle\overset{\mathrm{(i)}}{\leq}\max_{\alpha\in[\min_{s}\widehat{V}^{\widehat{\pi},\sigma}(s),\max_{s}\widehat{V}^{\widehat{\pi},\sigma}(s)]}\left(\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right\|+\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\left(\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}-\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right)\right\|\right)$
	$\displaystyle\leq\max_{\alpha\in[\min_{s}\widehat{V}^{\widehat{\pi},\sigma}(s),\max_{s}\widehat{V}^{\widehat{\pi},\sigma}(s)]}\Big{(}\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\big{[}\widehat{V}^{\star,\sigma}\right]_{\alpha}\big{\|}+\left\\|P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right\\|_{1}\left\\|\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}-\big{[}\widehat{V}^{\star,\sigma}\right]_{\alpha}\big{\\|}_{\infty}\Big{)}$
	$\displaystyle\overset{\mathrm{(ii)}}{\leq}\max_{\alpha\in[\min_{s}\widehat{V}^{\widehat{\pi},\sigma}(s),\max_{s}\widehat{V}^{\widehat{\pi},\sigma}(s)]}\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right\|+2\left\\|\widehat{V}^{\widehat{\pi},\sigma}-\widehat{V}^{\star,\sigma}\right\\|_{\infty}$
	$\displaystyle\overset{\mathrm{(iii)}}{\leq}\max_{\alpha\in[\min_{s}\widehat{V}^{\widehat{\pi},\sigma}(s),\max_{s}\widehat{V}^{\widehat{\pi},\sigma}(s)]}\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right\|+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma},$		(156)

where (i) holds by the triangle inequality, and (ii) follows from $\big{\|}P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\big{\|}_{1}\leq 2$ and $\big{\|}\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}-\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\big{\|}_{\infty}\leq\big{\|}\widehat{V}^{\widehat{\pi},\sigma}-\widehat{V}^{\star,\sigma}\big{\|}_{\infty}$ , and (iii) follows from (50).

To control $\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right|$ in (156) for any given $\alpha\in\big{[}0,\frac{1}{1-\gamma}\big{]}$ , and tame the dependency between $\widehat{V}^{\star,\sigma}$ and $\widehat{P}^{0}$ , we resort to the following leave-one-out argument motivated by (Agarwal et al.,, 2020; Li et al., 2022b, ; Shi and Chi,, 2022). Specifically, we first construct a set of auxiliary RMDPs which simultaneously have the desired statistical independence between robust value functions and the estimated nominal transition kernel, and are minimally different from the original RMDPs under consideration. Then we control the term of interest associated with these auxiliary RMDPs and show the value is close to the target quantity for the desired RMDP. The process is divided into several steps as below.

Step 1: construction of auxiliary RMDPs with deterministic empirical nominal transitions.

Recall that we target the empirical infinite-horizon robust MDP $\widehat{\mathcal{M}}_{\mathsf{rob}}$ with the nominal transition kernel $\widehat{P}^{0}$ . Towards this, we can construct an auxiliary robust MDP $\widehat{\mathcal{M}}^{s,u}_{\mathsf{rob}}$ for each state $s$ and any non-negative scalar $u\geq 0$ , so that it is the same as $\widehat{\mathcal{M}}_{\mathsf{rob}}$ except for the transition properties in state $s$ . In particular, we define the nominal transition kernel and reward function of $\widehat{\mathcal{M}}^{s,u}_{\mathsf{rob}}$ as $P^{s,u}$ and $r^{s,u}$ , which are expressed as follows

\displaystyle\begin{cases}P^{s,u}(s^{\prime}\,|\,s,a)=\operatorname{\mathds{1}}(s^{\prime}=s)&\qquad\qquad\text{for all }(s^{\prime},a)\in{\mathcal{S}}\times\mathcal{A},\\ P^{s,u}(\cdot\,|\,\widetilde{s},a)=\widehat{P}^{0}(\cdot\,|\,\widetilde{s},a)&\qquad\qquad\text{for all }(\widetilde{s},a)\in{\mathcal{S}}\times\mathcal{A}\text{ and }\widetilde{s}\neq s,\end{cases}

(157)

and

\displaystyle\begin{cases}r^{s,u}(s,a)=u&\qquad\qquad\qquad\text{for all }a\in\mathcal{A},\\ r^{s,u}(\widetilde{s},a)=r(\widetilde{s},a)&\qquad\qquad\qquad\text{for all }(\widetilde{s},a)\in{\mathcal{S}}\times\mathcal{A}\text{ and }\widetilde{s}\neq s.\end{cases}

(158)

It is evident that the nominal transition probability at state $s$ of the auxiliary $\widehat{\mathcal{M}}^{s,u}_{\mathsf{rob}}$ , i.e. it never leaves state $s$ once entered. This useful property removes the randomness of $\widehat{P}^{0}_{s,a}$ for all $a\in\mathcal{A}$ in state $s$ , which will be leveraged later.

Correspondingly, the robust Bellman operator $\widehat{{\mathcal{T}}}^{\sigma}_{s,u}(\cdot)$ associated with the RMDP $\widehat{\mathcal{M}}^{s,u}_{\mathsf{rob}}$ is defined as

\displaystyle\forall(\tilde{s},a)\in{\mathcal{S}}\times\mathcal{A}:\quad\widehat{{\mathcal{T}}}^{\sigma}_{s,u}(Q)(\tilde{s},a)=r^{s,u}(\tilde{s},a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{s,u}_{\tilde{s},a})}\mathcal{P}V,\qquad\text{with }V(\tilde{s})=\max_{a}Q(\tilde{s},a).

(159)

Step 2: fixed-point equivalence between $\widehat{\mathcal{M}}_{\mathsf{rob}}$ and the auxiliary RMDP $\widehat{\mathcal{M}}^{s,u}_{\mathsf{rob}}$ .

Recall that $\widehat{Q}^{\star,\sigma}$ is the unique fixed point of $\widehat{{\mathcal{T}}}^{\sigma}(\cdot)$ with the corresponding robust value $\widehat{V}^{\star,\sigma}$ . We assert that the corresponding robust value function $\widehat{V}^{\star,\sigma}_{s,u^{\star}}$ obtained from the fixed point of $\widehat{{\mathcal{T}}}^{\sigma}_{s,u}(\cdot)$ aligns with the robust value function $\widehat{V}^{\star,\sigma}$ derived from $\widehat{{\mathcal{T}}}^{\sigma}(\cdot)$ , as long as we choose $u$ in the following manner:

\displaystyle u^{\star}\coloneqq u^{\star}(s)=\widehat{V}^{\star,\sigma}(s)-\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(e_{s})}\mathcal{P}\widehat{V}^{\star,\sigma}.

(160)

where $e_{s}$ is the $s$ -th standard basis vector in $\mathbb{R}^{S}$ . Towards verifying this, we shall break our arguments in two different cases.

•

For state $s$ : One has for any $a\in\mathcal{A}$ :

	$\displaystyle r^{s,u^{\star}}(s,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{s,u^{\star}}_{s,a})}\mathcal{P}\widehat{V}^{\star,\sigma}$	$\displaystyle=u^{\star}+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(e_{s})}\mathcal{P}\widehat{V}^{\star,\sigma}$
		$\displaystyle=\widehat{V}^{\star,\sigma}(s)-\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(e_{s})}\mathcal{P}\widehat{V}^{\star,\sigma}+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(e_{s})}\mathcal{P}\widehat{V}^{\star,\sigma}=\widehat{V}^{\star,\sigma}(s),$		(161)

where the first equality follows from the definition of $P^{s,u^{\star}}_{s,a}$ in (157), and the second equality follows from plugging in the definition of $u^{\star}$ in (160).

•

For state $s^{\prime}\neq s$ : It is easily verified that for all $a\in\mathcal{A}$ ,

	$\displaystyle r^{s,u^{\star}}(s^{\prime},a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{s,u^{\star}}_{s^{\prime},a})}\mathcal{P}\widehat{V}^{\star,\sigma}$	$\displaystyle=r(s^{\prime},a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s^{\prime},a})}\mathcal{P}\widehat{V}^{\star,\sigma}$
		$\displaystyle=\widehat{{\mathcal{T}}}^{\sigma}(\widehat{Q}^{\star,\sigma})(s^{\prime},a)=\widehat{Q}^{\star,\sigma}(s^{\prime},a),$		(162)

where the first equality follows from the definitions in (158) and (157), and the last line arises from the definition of the robust Bellman operator in (15), and that $\widehat{Q}^{\star,\sigma}$ is the fixed point of $\widehat{{\mathcal{T}}}^{\sigma}(\cdot)$ (see Lemma 4).

Combining the facts in the above two cases, we establish that there exists a fixed point $\widehat{Q}^{\star,\sigma}_{s,u^{\star}}$ of the operator $\widehat{{\mathcal{T}}}^{\sigma}_{s,u^{\star}}(\cdot)$ by taking

\displaystyle\begin{cases}\widehat{Q}^{\star,\sigma}_{s,u^{\star}}(s,a)=\widehat{V}^{\star,\sigma}(s)&\qquad\qquad\qquad\text{for all }a\in\mathcal{A},\\ \widehat{Q}^{\star,\sigma}_{s,u^{\star}}(s^{\prime},a)=\widehat{Q}^{\star,\sigma}(s^{\prime},a)&\qquad\qquad\qquad\text{for all }s^{\prime}\neq s\text{ and }a\in\mathcal{A}.\end{cases}

(163)

Consequently, we confirm the existence of a fixed point of the operator $\widehat{{\mathcal{T}}}^{\sigma}_{s,u^{\star}}(\cdot)$ . In addition, its corresponding value function $\widehat{V}^{\star,\sigma}_{s,u^{\star}}$ also coincides with $\widehat{V}^{\star,\sigma}$ . Note that the corresponding facts between $\widehat{\mathcal{M}}_{\mathsf{rob}}$ and $\widehat{\mathcal{M}}_{\mathsf{rob}}^{s,u}$ in Step 1 and step 2 holds in fact for any uncertainty set.

Step 3: building an $\varepsilon$ -net for all reward values $u$ .

It is easily verified that

\displaystyle 0\leq u^{\star}\leq\widehat{V}^{\star,\sigma}(s)\leq\frac{1}{1-\gamma}.

(164)

We can construct a $N_{\varepsilon_{2}}$ -net over the interval $\big{[}0,\frac{1}{1-\gamma}\big{]}$ , where the size is bounded by $|N_{\varepsilon_{2}}|\leq\frac{3}{\varepsilon_{2}(1-\gamma)}$ (Vershynin,, 2018). Following the same arguments in the proof of Lemma 4, we can demonstrate that for each $u\in N_{\varepsilon_{2}}$ , there exists a unique fixed point $\widehat{Q}^{\star,\sigma}_{s,u}$ of the operator $\widehat{{\mathcal{T}}}^{\sigma}_{s,u}(\cdot)$ , which satisfies ${0}\leq\widehat{Q}^{\star,\sigma}_{s,u}\leq\frac{1}{1-\gamma}\cdot 1$ . Consequently, the corresponding robust value function also satisfies $\left\|\widehat{V}^{\star,\sigma}_{s,u}\right\|_{\infty}\leq\frac{1}{1-\gamma}$ .

By the definitions in (157) and (158), we observe that for all $u\in N_{\varepsilon_{2}}$ , $\widehat{\mathcal{M}}_{\mathsf{rob}}^{s,u}$ is statistically independent from $\widehat{P}^{0}_{s,a}$ . This independence indicates that $[\widehat{V}^{\star,\sigma}_{s,u}]_{\alpha}$ and $\widehat{P}^{0}_{s,a}$ are independent for a fixed $\alpha$ . With this in mind, invoking the fact in (145) and (146) and taking the union bound over all $(s,a,\alpha)\in{\mathcal{S}}\times\mathcal{A}\times N_{\varepsilon_{1}}$ , $u\in N_{\varepsilon_{2}}$ yields that, with probability at least $1-\delta$ , it holds for all $(s,a,u)\in{\mathcal{S}}\times\mathcal{A}\times N_{\varepsilon_{2}}$ that

	$\displaystyle\max_{\alpha\in[0,1/(1-\gamma)]}\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\big{[}\widehat{V}^{\star,\sigma}_{s,u}\big{]}_{\alpha}\right\|$	$\displaystyle\leq\varepsilon_{2}+2\sqrt{\frac{\log(\frac{18SAN\|N_{\varepsilon_{2}}\|}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(\widehat{V}^{\star,\sigma}_{s,u})}+\frac{2\log(\frac{18SAN\|N_{\varepsilon_{2}}\|}{\delta})}{3N(1-\gamma)}$
		$\displaystyle\leq\varepsilon_{2}+3\sqrt{\frac{\log(\frac{18SAN\|N_{\varepsilon_{2}}\|}{\delta})}{(1-\gamma)^{2}N}},$		(165)

where the last inequality holds by the fact $\mathrm{Var}_{P^{0}_{s,a}}(\widehat{V}^{\star,\sigma}_{s,u})\leq\|\widehat{V}^{\star,\sigma}_{s,u}\|_{\infty}\leq\frac{1}{1-\gamma}$ and letting $N\geq\log\left(\frac{18SAN|N_{\varepsilon_{2}}|}{\delta}\right)$ .

Step 4: uniform concentration.

Recalling that $u^{\star}\in\big{[}0,\frac{1}{1-\gamma}\big{]}$ (see (164)), we can always find some $\overline{u}\in N_{\varepsilon_{2}}$ such that $|\overline{u}-u^{\star}|\leq\varepsilon_{2}$ . Consequently, plugging in the operator $\widehat{{\mathcal{T}}}^{\sigma}_{s,u}(\cdot)$ in (159) yields

\displaystyle\forall Q\in\mathbb{R}^{SA}:\quad\Big{\|}\widehat{{\mathcal{T}}}^{\sigma}_{s,\overline{u}}(Q)-\widehat{{\mathcal{T}}}^{\sigma}_{s,u^{\star}}(Q)\Big{\|}_{\infty}=|\overline{u}-u^{\star}|\leq\varepsilon_{2}

With this in mind, we observe that the fixed points of $\widehat{{\mathcal{T}}}^{\sigma}_{s,\overline{u}}(\cdot)$ and $\widehat{{\mathcal{T}}}^{\sigma}_{s,u^{\star}}(\cdot)$ obey

	$\displaystyle\left\\|\widehat{Q}^{\star,\sigma}_{s,\overline{u}}-\widehat{Q}^{\star,\sigma}_{s,u^{\star}}\right\\|_{\infty}$	$\displaystyle=\left\\|\widehat{{\mathcal{T}}}^{\sigma}_{s,\overline{u}}(\widehat{Q}^{\star,\sigma}_{s,\overline{u}})-\widehat{{\mathcal{T}}}^{\sigma}_{s,u^{\star}}(\widehat{Q}^{\star,\sigma}_{s,u^{\star}})\right\\|_{\infty}$
		$\displaystyle\leq\left\\|\widehat{{\mathcal{T}}}^{\sigma}_{s,\overline{u}}(\widehat{Q}^{\star,\sigma}_{s,\overline{u}})-\widehat{{\mathcal{T}}}^{\sigma}_{s,\overline{u}}(\widehat{Q}^{\star,\sigma}_{s,u^{\star}})\right\\|_{\infty}+\left\\|\widehat{{\mathcal{T}}}^{\sigma}_{s,\overline{u}}(\widehat{Q}^{\star,\sigma}_{s,u^{\star}})-\widehat{{\mathcal{T}}}^{\sigma}_{s,u^{\star}}(\widehat{Q}^{\star,\sigma}_{s,u^{\star}})\right\\|_{\infty}$
		$\displaystyle\leq\gamma\left\\|\widehat{Q}^{\star,\sigma}_{s,\overline{u}}-\widehat{Q}^{\star,\sigma}_{s,u^{\star}}\right\\|_{\infty}+\varepsilon_{2},$

where the last inequality holds by the fact that $\widehat{{\mathcal{T}}}^{\sigma}_{s,u}(\cdot)$ is a $\gamma$ -contraction. It directly indicates that

\displaystyle\left\|\widehat{Q}^{\star,\sigma}_{s,\overline{u}}-\widehat{Q}^{\star,\sigma}_{s,u^{\star}}\right\|_{\infty}\leq\frac{\varepsilon_{2}}{(1-\gamma)}\quad\mbox{and}\quad\left\|\widehat{V}^{\star,\sigma}_{s,\overline{u}}-\widehat{V}^{\star,\sigma}_{s,u^{\star}}\right\|_{\infty}\leq\left\|\widehat{Q}^{\star,\sigma}_{s,\overline{u}}-\widehat{Q}^{\star,\sigma}_{s,u^{\star}}\right\|_{\infty}\leq\frac{\varepsilon_{2}}{(1-\gamma)}.

(166)

Armed with the above facts, to control the first term in (156), invoking the identity $\widehat{V}^{\star,\sigma}=\widehat{V}^{\star,\sigma}_{s,u^{\star}}$ established in Step 2 gives that: for all $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ ,

	$\displaystyle\max_{\alpha\in[\min_{s}\widehat{V}^{\widehat{\pi},\sigma}(s),\max_{s}\widehat{V}^{\widehat{\pi},\sigma}(s)]}\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[\widehat{V}^{\star,\sigma}]_{\alpha}\right\|$
	$\displaystyle\leq\max_{\alpha\in[0,1/(1-\gamma)]}\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[\widehat{V}^{\star,\sigma}]_{\alpha}\right\|=\max_{\alpha\in[0,1/(1-\gamma)]}\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[\widehat{V}^{\star,\sigma}_{s,u^{\star}}]_{\alpha}\right\|$
	$\displaystyle\overset{\mathrm{(i)}}{\leq}\max_{\alpha\in[0,1/(1-\gamma)]}\left\{\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[\widehat{V}^{\star,\sigma}_{s,\overline{u}}]_{\alpha}\right\|+\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\left([\widehat{V}^{\star,\sigma}_{s,\overline{u}}]_{\alpha}-[\widehat{V}^{\star,\sigma}_{s,u^{\star}}]_{\alpha}\right)\right\|\right\}$
	$\displaystyle\overset{\mathrm{(ii)}}{\leq}\max_{\alpha\in[0,1/(1-\gamma)]}\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[\widehat{V}^{\star,\sigma}_{s,\overline{u}}]_{\alpha}\right\|+\frac{2\varepsilon_{2}}{(1-\gamma)}$
	$\displaystyle\overset{\mathrm{(iii)}}{\leq}\frac{2\varepsilon_{2}}{(1-\gamma)}+\varepsilon_{2}+2\sqrt{\frac{\log(\frac{18SAN\|N_{\varepsilon_{2}}\|}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(\widehat{V}^{\star,\sigma}_{s,u})}+\frac{2\log(\frac{18SAN\|N_{\varepsilon_{2}}\|}{\delta})}{3N(1-\gamma)}$
	$\displaystyle\leq\frac{3\varepsilon_{2}}{(1-\gamma)}+2\sqrt{\frac{\log(\frac{18SAN\|N_{\varepsilon_{2}}\|}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(\widehat{V}^{\star,\sigma})}+\frac{2\log(\frac{18SAN\|N_{\varepsilon_{2}}\|}{\delta})}{3N(1-\gamma)}$
	$\displaystyle\qquad+2\sqrt{\frac{\log(\frac{18SAN\|N_{\varepsilon_{2}}\|}{\delta})}{N}}\sqrt{\left\|\mathrm{Var}_{P^{0}_{s,a}}(\widehat{V}^{\star,\sigma})-\mathrm{Var}_{P^{0}_{s,a}}(\widehat{V}^{\star,\sigma}_{s,\overline{u}})\right\|}$
	$\displaystyle\overset{\mathrm{(iv)}}{\leq}\frac{3\varepsilon_{2}}{(1-\gamma)}+2\sqrt{\frac{\log(\frac{18SAN\|N_{\varepsilon_{2}}\|}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(\widehat{V}^{\star,\sigma})}+\frac{2\log(\frac{18SAN\|N_{\varepsilon_{2}}\|}{\delta})}{3N(1-\gamma)}+2\sqrt{\frac{2\varepsilon_{2}\log(\frac{18SAN\|N_{\varepsilon_{2}}\|}{\delta})}{N(1-\gamma)^{2}}}$
	$\displaystyle\leq 2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(\widehat{V}^{\star,\sigma})}+\frac{8\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)}$		(167)
	$\displaystyle\leq 10\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}N}},$		(168)

where (i) holds by the triangle inequality, (ii) arises from (the last inequality holds by (166))

	$\displaystyle\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\left([\widehat{V}^{\star,\sigma}_{s,\overline{u}}]_{\alpha}-[\widehat{V}^{\star,\sigma}_{s,u^{\star}}]_{\alpha}\right)\right\|$	$\displaystyle\leq\left\\|P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right\\|_{1}\left\\|[\widehat{V}^{\star,\sigma}_{s,\overline{u}}]_{\alpha}-[\widehat{V}^{\star,\sigma}_{s,u^{\star}}]_{\alpha}\right\\|_{\infty}$
		$\displaystyle\leq 2\left\\|\widehat{V}^{\star,\sigma}_{s,\overline{u}}-\widehat{V}^{\star,\sigma}_{s,u^{\star}}\right\\|_{\infty}\leq\frac{2\varepsilon_{2}}{(1-\gamma)},$		(169)

(iii) follows from (B.7), (iv) can be verified by applying Lemma 3 with (166). Here, the penultimate inequality holds by letting $\varepsilon_{2}=\frac{\log(\frac{18SAN|N_{\varepsilon_{2}}|}{\delta})}{N}$ , which leads to $|N_{\varepsilon_{2}}|\leq\frac{3}{\varepsilon_{2}(1-\gamma)}\leq\frac{3N}{1-\gamma}$ , and the last inequality holds by the fact $\mathrm{Var}_{P^{0}_{s,a}}(\widehat{V}^{\star,\sigma})\leq\|\widehat{V}^{\star,\sigma}\|_{\infty}\leq\frac{1}{1-\gamma}$ and letting $N\geq\log\left(\frac{54SAN^{2}}{(1-\gamma)\delta}\right)$ .

Step 5: finishing up.

Inserting (167) and (168) back into (156) and combining with (168) give that with probability at least $1-\delta$ ,

$\displaystyle\left\|\widehat{P}^{\widehat{\pi},\widehat{V}}_{s,a}\widehat{V}^{\widehat{\pi},\sigma}-P^{\widehat{\pi},\widehat{V}}_{s,a}\widehat{V}^{\widehat{\pi},\sigma}\right\|$	$\displaystyle\leq\max_{\alpha\in[\min_{s}\widehat{V}^{\widehat{\pi},\sigma}(s),\max_{s}\widehat{V}^{\widehat{\pi},\sigma}(s)]}\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[\widehat{V}^{\star,\sigma}]_{\alpha}\right\|+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}$
	$\displaystyle\leq\max_{\alpha\in[0,1/(1-\gamma)]}\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[\widehat{V}^{\star,\sigma}]_{\alpha}\right\|+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}$
	$\displaystyle\leq 2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(\widehat{V}^{\star,\sigma})}+\frac{8\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}$
	$\displaystyle\leq 10\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}N}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}$	(170)

holds for all $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ .

Finally, we complete the proof by compiling everything into the matrix form as follows:

	$\displaystyle\bigg{\|}\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\bigg{\|}$	$\displaystyle\leq 2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(\widehat{V}^{\star,\sigma})}1+\frac{8\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)}1+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}1$
		$\displaystyle\leq 10\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}N}}1+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}1.$		(171)

B.8 Proof of Lemma 15

The proof can be achieved by directly applying the same routine as Appendix B.6. Towards this, similar to (149), we arrive at

\displaystyle\Big{(}I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\widehat{\pi},\sigma})}\leq\sqrt{\frac{1}{1-\gamma}}\sqrt{\sum_{t=0}^{\infty}\gamma^{t}\Big{(}\underline{P}^{\widehat{\pi},\widehat{V}}\Big{)}^{t}\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\widehat{\pi},\sigma})}.

(172)

To control $\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\widehat{\pi},\sigma})$ , we denote the minimum value of $\widehat{V}^{\widehat{\pi},\sigma}$ as $V_{\min}=\min_{s\in{\mathcal{S}}}\widehat{V}^{\widehat{\pi},\sigma}(s)$ and $V^{\prime}\coloneqq\widehat{V}^{\widehat{\pi},\sigma}-V_{\min}1$ . By the same argument as (151), we arrive at

	$\displaystyle\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\widehat{\pi},\sigma})$
	$\displaystyle\leq\underline{P}^{\widehat{\pi},\widehat{V}}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}+\frac{2}{\gamma^{2}}\\|V^{\prime}\\|_{\infty}1+\frac{2}{\gamma}\\|V^{\prime}\\|_{\infty}\left\|\left(\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}-\underline{P}^{\widehat{\pi},\widehat{V}}\right)\widehat{V}^{\widehat{\pi},\sigma}\right\|$
	$\displaystyle\leq\underline{P}^{\widehat{\pi},\widehat{V}}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}+\frac{2}{\gamma^{2}}\\|V^{\prime}\\|_{\infty}1+\frac{2}{\gamma}\\|V^{\prime}\\|_{\infty}\Bigg{(}10\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}N}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}\Bigg{)}1,$		(173)

where the last inequality makes use of Lemma 14. Plugging (173) back into (172) leads to

$\displaystyle\Big{(}I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\widehat{\pi},\sigma})}$	$\displaystyle\overset{\mathrm{(i)}}{\leq}\sqrt{\frac{1}{1-\gamma}}\sqrt{\bigg{\|}\sum_{t=0}^{\infty}\gamma^{t}\left(\underline{P}^{\widehat{\pi},\widehat{V}}\right)^{t}\Big{(}\underline{P}^{\widehat{\pi},\widehat{V}}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}\Big{)}\bigg{\|}}$
	$\displaystyle\quad+\sqrt{\frac{1}{(1-\gamma)^{2}\gamma^{2}}\bigg{(}2+20\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}N}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}\bigg{)}\\|V^{\prime}\\|_{\infty}}1$
	$\displaystyle\overset{\mathrm{(ii)}}{\leq}\sqrt{\frac{\\|V^{\prime}\\|_{\infty}^{2}}{\gamma(1-\gamma)}}1+\sqrt{\frac{\bigg{(}2+20\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}N}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}\bigg{)}\\|V^{\prime}\\|_{\infty}}{(1-\gamma)^{2}\gamma^{2}}}1$
	$\displaystyle\overset{\mathrm{(iii)}}{\leq}\sqrt{\frac{\\|V^{\prime}\\|_{\infty}^{2}}{\gamma(1-\gamma)}}1+\sqrt{\frac{24\\|V^{\prime}\\|_{\infty}}{(1-\gamma)^{2}\gamma^{2}}}1\leq 6\sqrt{\frac{\\|V^{\prime}\\|_{\infty}}{(1-\gamma)^{2}\gamma^{2}}}1,$	(174)

where (i) arises from following the routine of (153), (ii) holds by repeating the argument of (154), (iii) follows by taking $N\geq\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}}$ and $\varepsilon_{\mathsf{opt}}\leq\frac{1-\gamma}{\gamma}$ , and the last inequality holds by $\|V^{\prime}\|_{\infty}\leq\|V^{\star,\sigma}\|_{\infty}\leq\frac{1}{1-\gamma}$ .

Finally, applying Lemma 6 with $P=\widehat{P}^{0}$ and $\pi=\widehat{\pi}$ yields

\displaystyle\|V^{\prime}\|_{\infty}\leq\max_{s\in{\mathcal{S}}}\widehat{V}^{\widehat{\pi},\sigma}(s)-\min_{s\in{\mathcal{S}}}\widehat{V}^{\widehat{\pi},\sigma}(s)\leq\frac{1}{\gamma\max\{1-\gamma,\sigma\}},

which can be inserted into (174) and gives

\displaystyle\left(I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\right)^{-1}\sqrt{\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\widehat{\pi},\sigma})}\leq 6\sqrt{\frac{1}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}}}1\leq 6\sqrt{\frac{1}{(1-\gamma)^{3}\gamma^{2}}}1.

Appendix C Proof of the auxiliary facts for Theorem 2

C.1 Proof of Lemma 10

Deriving the robust value function over different states.

For any $\mathcal{M}_{\phi}$ with $\phi\in\{0,1\}$ , we first characterize the robust value function of any policy $\pi$ over different states. Before proceeding, we denote the minimum of the robust value function over states as below:

V_{\phi,\min}^{\pi,\sigma}\coloneqq\min_{s\in{\mathcal{S}}}V_{\phi}^{\pi,\sigma}(s).

(175)

Clearly, there exists at least one state $s_{\phi,\min}^{\pi}$ that satisfies $V_{\phi}^{\pi,\sigma}(s_{\phi,\min}^{\pi})=V_{\phi,\min}^{\pi,\sigma}$ .

With this in mind, it is easily observed that for any policy $\pi$ , the robust value function at state $s=1$ obeys

	$\displaystyle V_{\phi}^{\pi,\sigma}(1)$	$\displaystyle=\mathbb{E}_{a\sim\pi(\cdot\,\|\,1)}\bigg{[}r(1,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{\phi}_{1,a})}\mathcal{P}V_{\phi}^{\pi,\sigma}\bigg{]}$
		$\displaystyle\overset{\mathrm{(i)}}{=}1+\gamma\mathbb{E}_{a\sim\pi(\cdot\,\|\,1)}\left[\underline{P}^{\phi}(1\,\|\,1,a)V_{\phi}^{\pi,\sigma}(1)\right]+\gamma\sigma V_{\phi,\min}^{\pi,\sigma}\overset{\mathrm{(ii)}}{=}1+\gamma(1-\sigma)V_{\phi}^{\pi,\sigma}(1)+\gamma\sigma V_{\phi,\min}^{\pi,\sigma},$		(176)

where (i) holds by $r(1,a)=1$ for all $a\in\mathcal{A}^{\prime}$ and (74), and (ii) follows from $P^{\phi}(1\,|\,1,a)=1$ for all $a\in\mathcal{A}^{\prime}$ .

Similarly, for any $s\in\{2,3,\cdots,S-1\}$ , we have

	$\displaystyle V_{\phi}^{\pi,\sigma}(s)$	$\displaystyle=0+\gamma\mathbb{E}_{a\sim\pi(\cdot\,\|\,s)}\left[\underline{P}^{\phi}(1\,\|\,s,a)V_{\phi}^{\pi,\sigma}(1)\right]+\gamma\sigma V_{\phi,\min}^{\pi,\sigma}$
		$\displaystyle=\gamma\left(1-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\gamma\sigma V_{\phi,\min}^{\pi,\sigma},$		(177)

since $r(s,a)=0$ for all $s\in\{2,3,\cdots,S-1\}$ and the definition in (74).

Finally, we move onto compute $V_{\phi}^{\pi,\sigma}(0)$ , the robust value function at state $0$ associated with any policy $\pi$ . First, it obeys

	$\displaystyle V_{\phi}^{\pi,\sigma}(0)$	$\displaystyle=\mathbb{E}_{a\sim\pi(\cdot\,\|\,0)}\bigg{[}r(0,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{\phi}_{0,a})}\mathcal{P}V^{\pi,\sigma}_{\phi}\bigg{]}$
		$\displaystyle=0+\gamma\pi(\phi\,\|\,0)\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{\phi}_{0,\phi})}\mathcal{P}V^{\pi,\sigma}_{\phi}+\gamma\pi(1-\phi\,\|\,0)\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{\phi}_{0,1-\phi})}\mathcal{P}V^{\pi,\sigma}_{\phi}.$		(178)

Recall the transition kernel defined in (64) and the fact about the uncertainty set over state $0$ in (75), it is easily verified that the following probability vector $P_{1}\in\Delta({\mathcal{S}})$ obeys $P_{1}\in\mathcal{U}^{\sigma}(P^{\phi}_{0,\phi})$ , which is defined as

	$\displaystyle P_{1}(0)=1-p+\sigma\operatorname{\mathds{1}}\left(0=s_{\phi,\min}^{\pi}\right),\qquad P_{1}(1)=\underline{p}=p-\sigma,$
	$\displaystyle P_{1}(s)=\sigma\operatorname{\mathds{1}}\left(s=s_{\phi,\min}^{\pi}\right),\qquad\forall s\in\{2,3,\cdots,S-1\},$		(179)

where $\underline{p}=p-\sigma$ due to (75). Similarly, the following probability vector $P_{2}\in\Delta({\mathcal{S}})$ also falls into the uncertainty set $\mathcal{U}^{\sigma}(P^{\phi}_{0,1-\phi})$ :

	$\displaystyle P_{2}(0)=1-q+\sigma\operatorname{\mathds{1}}\left(0=s_{\phi,\min}^{\pi}\right),\qquad P_{2}(1)=\underline{q}=q-\sigma,$
	$\displaystyle P_{2}(s)=\sigma\operatorname{\mathds{1}}\left(0=s_{\phi,\min}^{\pi}\right)\qquad\forall s\in\{2,3,\cdots,S-1\}.$		(180)

It is noticed that $P_{0}$ and $P_{1}$ defined above are the worst-case perturbations, since the probability mass at state $1$ will be moved to the state with the least value. Plugging the above facts about $P_{1}\in\mathcal{U}^{\sigma}(P^{\phi}_{0,\phi})$ and $P_{2}\in\mathcal{U}^{\sigma}(P^{\phi}_{0,1-\phi})$ into (178), we arrive at

$\displaystyle V_{\phi}^{\pi,\sigma}(0)$	$\displaystyle\leq\gamma\pi(\phi\,\|\,0)P_{1}V^{\pi,\sigma}_{\phi}+\gamma\pi(1-\phi\,\|\,0)P_{2}V^{\pi,\sigma}_{\phi}$
	$\displaystyle=\gamma\pi(\phi\,\|\,0)\Big{[}\left(p-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\left(1-p\right)V_{\phi}^{\pi,\sigma}(0)+\sigma V_{\phi,\min}^{\pi,\sigma}\Big{]}$
	$\displaystyle\qquad+\gamma\pi(1-\phi\,\|\,0)\Big{[}\left(q-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\left(1-q\right)V_{\phi}^{\pi,\sigma}(0)+\sigma V_{\phi,\min}^{\pi,\sigma}\Big{]}$
	$\displaystyle\overset{\mathrm{(i)}}{=}\gamma\left(z_{\phi}^{\pi}-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\gamma\sigma V_{\phi,\min}^{\pi,\sigma}+\gamma(1-z_{\phi}^{\pi})V_{\phi}^{\pi,\sigma}(0),$	(181)

where the last equality holds by the definition of $z_{\phi}^{\pi}$ in (77). To continue, recursively applying (181) yields

$\displaystyle V_{\phi}^{\pi,\sigma}(0)$	$\displaystyle\leq\gamma\left(z_{\phi}^{\pi}-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\gamma\sigma V_{\phi,\min}^{\pi,\sigma}+\gamma(1-z_{\phi}^{\pi})\Big{[}\gamma\left(z_{\phi}^{\pi}-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\gamma\sigma V_{\phi,\min}^{\pi,\sigma}+\gamma(1-z_{\phi}^{\pi})V_{\phi}^{\pi,\sigma}(0)\Big{]}$
	$\displaystyle\overset{\mathrm{(i)}}{\leq}\gamma\left(z_{\phi}^{\pi}-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\gamma\sigma V_{\phi,\min}^{\pi,\sigma}+\gamma(1-z_{\phi}^{\pi})\Big{[}\gamma z_{\phi}^{\pi}V_{\phi}^{\pi,\sigma}(1)+\gamma(1-z_{\phi}^{\pi})V_{\phi}^{\pi,\sigma}(0)\Big{]}$
	$\displaystyle\leq...$
	$\displaystyle\leq\gamma\left(z_{\phi}^{\pi}-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\gamma\sigma V_{\phi,\min}^{\pi,\sigma}+\gamma z_{\phi}^{\pi}\sum_{t=1}^{\infty}\gamma^{t}(1-z_{\phi}^{\pi})^{t}V_{\phi}^{\pi,\sigma}(1)+\lim_{t\rightarrow\infty}\gamma^{t}(1-z_{\phi}^{\pi})^{t}V_{\phi}^{\pi,\sigma}(0)$
	$\displaystyle\overset{\mathrm{(ii)}}{\leq}\gamma\left(z_{\phi}^{\pi}-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\gamma\sigma V_{\phi,\min}^{\pi,\sigma}+\gamma(1-z_{\phi}^{\pi})\frac{\gamma z_{\phi}^{\pi}}{1-\gamma(1-z_{\phi}^{\pi})}V_{\phi}^{\pi,\sigma}(1)+0$
	$\displaystyle<\gamma\left(z_{\phi}^{\pi}-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\gamma\sigma V_{\phi,\min}^{\pi,\sigma}+\gamma(1-z_{\phi}^{\pi})V_{\phi}^{\pi,\sigma}(1)$
	$\displaystyle=\gamma\left(1-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\gamma\sigma V_{\phi,\min}^{\pi,\sigma},$	(182)

where (i) uses $V_{\phi,\min}^{\pi,\sigma}\leq V_{\phi}^{\pi,\sigma}(1)$ , (ii) follows from $\gamma(1-z_{\phi}^{\pi})<1$ , and the penultimate line follows from the trivial fact that $\frac{\gamma z_{\phi}^{\pi}}{1-\gamma(1-z_{\phi}^{\pi})}<1$ .

Combining (176), (177), and (182), we have that for any policy $\pi$ ,

\displaystyle V_{\phi}^{\pi,\sigma}(0)=V_{\phi,\min}^{\pi,\sigma},

(183)

which directly leads to

\displaystyle V_{\phi}^{\pi,\sigma}(1)

\displaystyle=1+\gamma\left(1-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\gamma\sigma V_{\phi,\min}^{\pi,\sigma}=\frac{1+\gamma\sigma V_{\phi}^{\pi,\sigma}(0)}{1-\gamma\left(1-\sigma\right)}.

(184)

Let’s now return to the characterization of $V_{\phi}^{\pi,\sigma}(0)$ . In view of (183), the equality in (181) holds, and we have

	$\displaystyle V_{\phi}^{\pi,\sigma}(0)$	$\displaystyle=\gamma\left(z_{\phi}^{\pi}-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\gamma\left(1-z_{\phi}^{\pi}+\sigma\right)V_{\phi}^{\pi,\sigma}(0)$
		$\displaystyle\overset{\mathrm{(i)}}{=}\gamma\left(z_{\phi}^{\pi}-\sigma\right)\frac{1+\gamma\sigma V_{\phi}^{\pi,\sigma}(0)}{1-\gamma\left(1-\sigma\right)}+\gamma\left(1-z_{\phi}^{\pi}+\sigma\right)V_{\phi}^{\pi,\sigma}(0)$
		$\displaystyle=\frac{\gamma\left(z_{\phi}^{\pi}-\sigma\right)}{1-\gamma\left(1-\sigma\right)}+\gamma\left(1+\left(z_{\phi}^{\pi}-\sigma\right)\frac{\gamma\sigma-\left(1-\gamma\left(1-\sigma\right)\right)}{1-\gamma\left(1-\sigma\right)}\right)V_{\phi}^{\pi,\sigma}(0)$
		$\displaystyle=\frac{\gamma\left(z_{\phi}^{\pi}-\sigma\right)}{1-\gamma\left(1-\sigma\right)}+\gamma\left(1-\frac{(1-\gamma)\big{(}z_{\phi}^{\pi}-\sigma\big{)}}{1-\gamma\left(1-\sigma\right)}\right)V_{\phi}^{\pi,\sigma}(0),$

where (i) arises from (184). Solving this relation gives

\displaystyle V_{\phi}^{\pi,\sigma}(0)

\displaystyle=\frac{\frac{\gamma\left(z_{\phi}^{\pi}-\sigma\right)}{1-\gamma\left(1-\sigma\right)}}{(1-\gamma)\bigg{(}1+\frac{\gamma\left(z_{\phi}^{\pi}-\sigma\right)}{1-\gamma\left(1-\sigma\right)}\bigg{)}}.

(185)

The optimal robust policy and optimal robust value function.

We move on to characterize the robust optimal policy and its corresponding robust value function. To begin with, denoting

\displaystyle z\coloneqq\frac{\gamma\big{(}z_{\phi}^{\pi}-\sigma\big{)}}{1-\gamma\left(1-\sigma\right)},

(186)

we rewrite (185) as

\displaystyle V_{\phi}^{\pi,\sigma}(0)=\frac{z}{(1-\gamma)(1+z)}=:f(z).

Plugging in the fact that $z_{\phi}^{\pi}\geq q\geq\sigma>0$ in (73), it follows that $z>0$ . So for any $z>0$ , the derivative of $f(z)$ w.r.t. $z$ obeys

\displaystyle\frac{(1-\gamma)(1+z)-(1-\gamma)z}{(1-\gamma)^{2}(1+z)^{2}}=\frac{1}{(1-\gamma)(1+z)^{2}}>0.

(187)

Observing that $f(z)$ is increasing in $z$ , $z$ is increasing in $z_{\phi}^{\pi}$ , and $z_{\phi}^{\pi}$ is also increasing in $\pi(\phi\,|\,0)$ (see the fact $p\geq q$ in (73)), the optimal policy in state $0$ thus obeys

\pi_{\phi}^{\star}(\phi\,|\,0)=1.

(188)

Considering that the action does not influence the state transition for all states $s>0$ , without loss of generality, we choose the robust optimal policy to obey

\displaystyle\forall s>0:\quad\pi_{\phi}^{\star}(\phi\,|\,s)=1.

(189)

Taking $\pi=\pi^{\star}_{\phi}$ , we complete the proof by showing that the corresponding robust optimal robust value function at state $0$ as follows:

\displaystyle V_{\phi}^{\star,\sigma}(0)=\frac{\frac{\gamma\left(z_{\phi}^{\pi^{\star}}-\sigma\right)}{1-\gamma\left(1-\sigma\right)}}{(1-\gamma)\left(1+\frac{\gamma\left(z_{\phi}^{\pi^{\star}}-\sigma\right)}{1-\gamma\left(1-\sigma\right)}\right)}=\frac{\frac{\gamma\left(p-\sigma\right)}{1-\gamma\left(1-\sigma\right)}}{(1-\gamma)\left(1+\frac{\gamma\left(p-\sigma\right)}{1-\gamma\left(1-\sigma\right)}\right)}.

(190)

C.2 Proof of the claim (80)

Plugging in the definition of $\varphi$ , we arrive at that for any policy $\pi$ ,

\displaystyle\big{\langle}\varphi,V^{\star,\sigma}_{\phi}-V^{\pi,\sigma}_{\phi}\big{\rangle}=V^{\star,\sigma}_{\phi}(0)-V^{\pi,\sigma}_{\phi}(0)

\displaystyle=\frac{\frac{\gamma\left(p-z_{\phi}^{\pi}\right)}{1-\gamma\left(1-\sigma\right)}}{(1-\gamma)\left(1+\frac{\gamma\left(p-\sigma\right)}{1-\gamma\left(1-\sigma\right)}\right)\left(1+\frac{\gamma\left(z_{\phi}^{\pi}-\sigma\right)}{1-\gamma\left(1-\sigma\right)}\right)},

(191)

which follows from applying (76) and basic calculus. Then, we proceed to control the above term in two cases separately in terms of the uncertainty level $\sigma$ .

•

When $\sigma\in(0,1-\gamma]$ . Then regarding the important terms in (191), we observe that

\displaystyle 1-\gamma<1-\gamma\left(1-\sigma\right)

\displaystyle\leq 1-\gamma\left(1-(1-\gamma)\right)=(1-\gamma)(1+\gamma)\leq 2(1-\gamma),

(192)

which directly leads to

\displaystyle\frac{\gamma\big{(}z_{\phi}^{\pi}-\sigma\big{)}}{1-\gamma\left(1-\sigma\right)}\overset{\mathrm{(i)}}{\leq}\frac{\gamma\left(p-\sigma\right)}{1-\gamma\left(1-\sigma\right)}\leq\frac{\gamma c_{1}(1-\gamma)}{1-\gamma\left(1-\sigma\right)}\overset{\mathrm{(ii)}}{<}c_{1}\gamma,

(193)

where (i) holds by $z_{\phi}^{\pi}<p$ , and (ii) is due to (192). Inserting (192) and (193) back into (191), we arrive at

	$\displaystyle\big{\langle}\varphi,V^{\star,\sigma}_{\phi}-V^{\pi,\sigma}_{\phi}\big{\rangle}$	$\displaystyle\geq\frac{\frac{\gamma\left(p-z_{\phi}^{\pi}\right)}{2(1-\gamma)}}{(1-\gamma)(1+c_{1}\gamma)^{2}}\geq\frac{\gamma\big{(}p-z_{\phi}^{\pi}\big{)}}{8(1-\gamma)^{2}}$
		$\displaystyle=\frac{\gamma\left(p-q\right)\big{(}1-\pi(\phi\,\|\,0)\big{)}}{8(1-\gamma)^{2}}=\frac{\gamma\Delta\big{(}1-\pi(\phi\,\|\,0)\big{)}}{8(1-\gamma)^{2}}\geq 2\varepsilon\big{(}1-\pi(\phi\,\|\,0)\big{)},$		(194)

where the last inequality holds by setting ( $\gamma\geq 1/2$ )

\displaystyle\Delta=32(1-\gamma)^{2}\varepsilon.

(195)

Finally, it is easily verified that

\displaystyle\varepsilon\leq\frac{c_{1}}{32(1-\gamma)}\quad\Longrightarrow\quad\Delta\leq c_{1}(1-\gamma).

•

When $\sigma\in(1-\gamma,1-c_{1}]$ . Regarding (191), we observe that

\displaystyle\gamma\sigma<1-\gamma\left(1-\sigma\right)

\displaystyle=1-\gamma+\gamma\sigma\leq(1+\gamma)\sigma\leq 2\sigma,

(196)

which directly leads to

\displaystyle\frac{\gamma\big{(}z_{\phi}^{\pi}-\sigma\big{)}}{1-\gamma\left(1-\sigma\right)}\leq\frac{\gamma\left(p-\sigma\right)}{1-\gamma\left(1-\sigma\right)}\leq\frac{\gamma c_{1}\sigma}{1-\gamma\left(1-\sigma\right)}\overset{\mathrm{(i)}}{<}c_{1},

(197)

where (i) holds by (196). Inserting (196) and (197) back into (191), we arrive at

	$\displaystyle\big{\langle}\varphi,V^{\star,\sigma}_{\phi}-V^{\pi,\sigma}_{\phi}\big{\rangle}$	$\displaystyle\geq\frac{\frac{\gamma\left(p-z_{\phi}^{\pi}\right)}{2\sigma}}{(1-\gamma)(1+c_{1})^{2}}\geq\frac{\gamma\left(p-z_{\phi}^{\pi}\right)}{8(1-\gamma)\sigma}=\frac{\gamma\left(p-q\right)\big{(}1-\pi(\phi\,\|\,0)\big{)}}{8(1-\gamma)\sigma}$
		$\displaystyle=\frac{\gamma\Delta\big{(}1-\pi(\phi\,\|\,0)\big{)}}{8(1-\gamma)\sigma}\geq 2\varepsilon\big{(}1-\pi(\phi\,\|\,0)\big{)},$		(198)

where the last inequality holds by letting ( $\gamma\geq 1/2$ )

\displaystyle\Delta=32(1-\gamma)\sigma\varepsilon.

(199)

Finally, it is easily verified that

\displaystyle\varepsilon\leq\frac{c_{1}}{32(1-\gamma)}\quad\Longrightarrow\quad\Delta\leq c_{1}\sigma.

(200)

Appendix D Proof of the upper bound with $\chi^{2}$ divergence: Theorem 3

The proof of Theorem 3 mainly follows the structure of the proof of Theorem 1 in Appendix 5.2. Throughout this section, for any nominal transition kernel $P$ , the uncertainty set is taken as (see (10))

\displaystyle\mathcal{U}^{\sigma}(P)=\mathcal{U}^{\sigma}_{\chi^{2}}(P)\coloneqq\otimes\;\mathcal{U}^{\sigma}_{\chi^{2}}(P_{s,a}),\quad

\displaystyle\mathcal{U}^{\sigma}_{\chi^{2}}(P_{s,a})\coloneqq\Big{\{}P^{\prime}_{s,a}\in\Delta({\mathcal{S}}):\sum_{s^{\prime}\in{\mathcal{S}}}\frac{(P^{\prime}(s^{\prime}\,|\,s,a)-P(s^{\prime}\,|\,s,a))^{2}}{P(s^{\prime}\,|\,s,a)}\leq\sigma\Big{\}}.

(201)

D.1 Proof of Theorem 3

In order to control the performance gap $\left\|V^{\star,\sigma}-V^{\widehat{\pi},\sigma}\right\|_{\infty}$ , recall the error decomposition in (51):

\displaystyle V^{\star,\sigma}-V^{\widehat{\pi},\sigma}\leq\left(V^{\pi^{\star},\sigma}-\widehat{V}^{\pi^{\star},\sigma}\right)+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}{1}+\left(\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\right),

(202)

where $\varepsilon_{\mathsf{opt}}$ (cf. (50)) shall be specified later (which justifies Remark 2). To further control (202), we bound the remaining two terms separately.

Step 1: controlling $\big{\|}\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\big{\|}_{\infty}$ .

Towards this, recall the bound in (56) which holds for any uncertainty set:

	$\displaystyle\big{\\|}\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\big{\\|}_{\infty}$	$\displaystyle\leq\gamma\max\Big{\{}\Big{\\|}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{)}\Big{\\|}_{\infty},$
		$\displaystyle\qquad\qquad\Big{\\|}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{)}\Big{\\|}_{\infty}\Big{\}}.$		(203)

To control the main term $\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}$ in (203), we first introduce an important lemma whose proof is postponed to Appendix D.2.1.

Lemma 16.

Consider any $\sigma>0$ and the uncertainty set $\mathcal{U}^{\sigma}(\cdot)\coloneqq\mathcal{U}^{\sigma}_{\chi^{2}}(\cdot)$ . For any $\delta\in(0,1)$ and any fixed policy $\pi$ , one has with probability at least $1-\delta$ ,

\displaystyle\left\|\underline{\widehat{P}}^{\pi,V}V^{\pi,\sigma}-\underline{P}^{\pi,V}V^{\pi,\sigma}\right\|_{\infty}\leq 4\sqrt{\frac{2(1+\sigma)\log(\frac{24SAN}{\delta})}{(1-\gamma)^{2}N}}.

Applying Lemma 16 by taking $\pi=\pi^{\star}$ gives

\displaystyle\Big{\|}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{\|}_{\infty}\leq 4\sqrt{\frac{2(1+\sigma)\log(\frac{24SAN}{\delta})}{(1-\gamma)^{2}N}},

(204)

which directly leads to

	$\displaystyle\Big{\\|}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{)}\Big{\\|}_{\infty}$
	$\displaystyle\leq\Big{\\|}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{\\|}_{\infty}\cdot\Big{\\|}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}{1}\Big{\\|}_{\infty}\leq 4\sqrt{\frac{2(1+\sigma)\log(\frac{24SAN}{\delta})}{(1-\gamma)^{4}N}}.$		(205)

Similarly, we have

\displaystyle\Big{\|}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{)}\Big{\|}_{\infty}\leq 4\sqrt{\frac{2(1+\sigma)\log(\frac{24SAN}{\delta})}{(1-\gamma)^{4}N}}.

(206)

Inserting (205) and (206) back to (203) yields

\displaystyle\big{\|}\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\big{\|}_{\infty}

\displaystyle\leq 4\sqrt{\frac{2(1+\sigma)\log(\frac{24SAN}{\delta})}{(1-\gamma)^{4}N}}.

(207)

Step 2: controlling $\left\|\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\right\|_{\infty}$ .

Recall the bound in (57) which holds for any uncertainty set:

	$\displaystyle\big{\\|}\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\big{\\|}_{\infty}$	$\displaystyle\leq\gamma\max\Big{\{}\Big{\\|}\Big{(}I-\gamma\underline{P}^{\widehat{\pi},V}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\Big{)}\Big{\\|}_{\infty},$
		$\displaystyle\qquad\qquad\Big{\\|}\Big{(}I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\Big{)}\Big{\\|}_{\infty}\Big{\}}.$		(208)

We introduce the following lemma which controls $\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}$ in (208); the proof is deferred to Appendix D.2.2.

Lemma 17.

Consider the uncertainty set $\mathcal{U}^{\sigma}(\cdot)\coloneqq\mathcal{U}^{\sigma}_{\chi^{2}}(\cdot)$ and any $\delta\in(0,1)$ . With probability at least $1-\delta$ , one has

\displaystyle\Big{\|}\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\Big{\|}_{\infty}

\displaystyle\leq 12\sqrt{\frac{2(1+\sigma)\log(\frac{36SAN^{2}}{\delta})}{(1-\gamma)^{2}N}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}+4\sqrt{\frac{\sigma\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{2}}}.

(209)

Repeating the arguments from (204) to (207) yields

\displaystyle\big{\|}\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\big{\|}_{\infty}

\displaystyle\leq 12\sqrt{\frac{2(1+\sigma)\log(\frac{36SAN^{2}}{\delta})}{(1-\gamma)^{4}N}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{2}}+4\sqrt{\frac{\sigma\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{4}}}.

(210)

Finally, inserting (207) and (210) back to (202) complete the proof

	$\displaystyle\big{\\|}V^{\star,\sigma}-V^{\widehat{\pi},\sigma}\big{\\|}_{\infty}$
	$\displaystyle\leq\big{\\|}V^{\pi^{\star},\sigma}-\widehat{V}^{\pi^{\star},\sigma}\big{\\|}_{\infty}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}+\big{\\|}\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\big{\\|}_{\infty}$
	$\displaystyle\leq 4\sqrt{\frac{2(1+\sigma)\log(\frac{24SAN}{\delta})}{(1-\gamma)^{4}N}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}+12\sqrt{\frac{2(1+\sigma)\log(\frac{36SAN^{2}}{\delta})}{(1-\gamma)^{4}N}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{2}}+4\sqrt{\frac{\sigma\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{4}}}$
	$\displaystyle\leq 24\sqrt{\frac{2(1+\sigma)\log(\frac{36SAN^{2}}{\delta})}{(1-\gamma)^{4}N}},$		(211)

where the last line holds by taking $\varepsilon_{\mathsf{opt}}\leq\min\left\{\sqrt{\frac{32(1+\sigma)\log(\frac{36SAN^{2}}{\delta})}{N}},\frac{4\log(\frac{36SAN^{2}}{\delta})}{N}\right\}$ .

D.2 Proof of the auxiliary lemmas

D.2.1 Proof of Lemma 16

Step 1: controlling the point-wise concentration.

Consider any fixed policy $\pi$ and the corresponding robust value vector $V\coloneqq V^{\pi,\sigma}$ (independent from $\widehat{P}^{0}$ ). Invoking Lemma 2 leads to that for any $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ ,

	$\displaystyle\left\|\widehat{P}^{\pi,V}_{s,a}V^{\pi,\sigma}-P^{\pi,V}_{s,a}V^{\pi,\sigma}\right\|$
	$\displaystyle=\bigg{\|}\max_{\alpha\in\left[\min_{s}V(s),\max_{s}V(s)\right]}\left\{P^{0}_{s,a}[V]_{\alpha}-\sqrt{\sigma\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha}\right)}\right\}$
	$\displaystyle\qquad\qquad-\max_{\alpha\in\left[\min_{s}V(s),\max_{s}V(s)\right]}\left\{\widehat{P}^{0}_{s,a}[V]_{\alpha}-\sqrt{\sigma\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha}\right)}\right\}\bigg{\|}$
	$\displaystyle\leq\max_{\alpha\in\left[\min_{s}V(s),\max_{s}V(s)\right]}\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[V]_{\alpha}+\sqrt{\sigma\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha}\right)}-\sqrt{\sigma\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha}\right)}\right\|$
	$\displaystyle\leq\max_{\alpha\in\left[\min_{s}V(s),\max_{s}V(s)\right]}\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[V]_{\alpha}\right\|+$
	$\displaystyle\qquad+\max_{\alpha\in\left[\min_{s}V(s),\max_{s}V(s)\right]}\sqrt{\sigma}\left\|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha}\right)}\right\|,$		(212)

where the first inequality follows by that the maximum operator is $1$ -Lipschitz, and the second inequality follows from the triangle inequality. Observing that the first term in (212) is exactly the same as (141), recalling the fact in (146) directly leads to: with probability at least $1-\delta$ ,

\displaystyle\max_{\alpha\in[\min_{s}V(s),\max_{s}V(s)]}\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[V]_{\alpha}\right|\leq 2\sqrt{\frac{\log(\frac{2SAN}{\delta})}{(1-\gamma)^{2}N}}

(213)

holds for all $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ . Then the remainder of the proof focuses on controlling the second term in (212).

Step 2: controlling the second term in (212).

For any given $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ and fixed $\alpha\in[0,\frac{1}{1-\gamma}]$ , applying the concentration inequality (Panaganti and Kalathil,, 2022, Lemma 6) with $\|[V]_{\alpha}\|_{\infty}\leq\frac{1}{1-\gamma}$ , we arrive at

\displaystyle\left|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha}\right)}\right|\leq\sqrt{\frac{2\log(\frac{2}{\delta})}{(1-\gamma)^{2}N}}

(214)

holds with probability at least $1-\delta$ . To obtain a uniform bound, we first observe the follow lemma proven in Appendix D.2.3.

Lemma 18.

For any $V$ obeying $\|V\|_{\infty}\leq\frac{1}{1-\gamma}$ , the function $J_{s,a}(\alpha,V):=\Big{|}\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha}\right)}\Big{|}$ w.r.t. $\alpha$ obeys

\displaystyle\left|J_{s,a}(\alpha_{1},V)-J_{s,a}(\alpha_{2},V)\right|\leq 4\sqrt{\frac{|\alpha_{1}-\alpha_{2}|}{1-\gamma}}.

In addition, we can construct an $\varepsilon_{3}$ -net $N_{\varepsilon_{3}}$ over $[0,\frac{1}{1-\gamma}]$ whose size is $|N_{\varepsilon_{3}}|\leq\frac{3}{\varepsilon_{3}(1-\gamma)}$ (Vershynin,, 2018). Armed with the above, we can derive the uniform bound over $\alpha\in[\min_{s}V(s),\max_{s}V(s)]\subset[0,1/(1-\gamma)]$ : with probability at least $1-\frac{\delta}{SA}$ , it holds that for any $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ ,

	$\displaystyle\max_{\alpha\in\left[\min_{s}V(s),\max_{s}V(s)\right]}\left\|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha}\right)}\right\|$
	$\displaystyle\leq\max_{\alpha\in\left[0,1/(1-\gamma)\right]}\left\|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha}\right)}\right\|$
	$\displaystyle\overset{\mathrm{(i)}}{\leq}4\sqrt{\frac{\varepsilon_{3}}{1-\gamma}}+\sup_{\alpha\in N_{\varepsilon_{3}}}\left\|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha}\right)}\right\|$
	$\displaystyle\overset{\mathrm{(ii)}}{\leq}4\sqrt{\frac{\varepsilon_{3}}{1-\gamma}}+\sqrt{\frac{2\log(\frac{2SA\|N_{\varepsilon_{3}}\|}{\delta})}{(1-\gamma)^{2}N}}$
	$\displaystyle\overset{\mathrm{(iii)}}{\leq}2\sqrt{\frac{2\log(\frac{2SA\|N_{\varepsilon_{3}}\|}{\delta})}{(1-\gamma)^{2}N}}\leq 2\sqrt{\frac{2\log(\frac{24SAN}{\delta})}{(1-\gamma)^{2}N}},$		(215)

where (i) holds by the property of $N_{\varepsilon_{3}}$ , (ii) follows from (214), (iii) arises from taking $\varepsilon_{3}=\frac{\log(\frac{2SA|N_{\varepsilon_{3}}|}{\delta})}{8N(1-\gamma)}$ , and the last inequality is verified by $|N_{\varepsilon_{3}}|\leq\frac{3}{\varepsilon_{3}(1-\gamma)}\leq 24N$ .

Inserting (213) and (215) back to (212) and taking the union bound over $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ , we arrive at that for all $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ , with probability at least $1-{\delta}$ ,

	$\displaystyle\left\|\widehat{P}^{\pi,V}_{s,a}V-P^{\pi,V}_{s,a}V\right\|$	$\displaystyle\leq\max_{\alpha\in\left[\min_{s}V(s),\max_{s}V(s)\right]}\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[V]_{\alpha}\right\|+$
		$\displaystyle\qquad+\max_{\alpha\in\left[\min_{s}V(s),\max_{s}V(s)\right]}\left\|\sqrt{\sigma\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha}\right)}-\sqrt{\sigma\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha}\right)}\right\|$
		$\displaystyle\leq\sqrt{\frac{2\log(\frac{2SAN}{\delta})}{(1-\gamma)^{2}N}}+2\sqrt{\frac{2\sigma\log(\frac{24SAN}{\delta})}{(1-\gamma)^{2}N}}\leq 4\sqrt{\frac{2(1+\sigma)\log(\frac{24SAN}{\delta})}{(1-\gamma)^{2}N}}.$

Finally, we complete the proof by recalling the matrix form as below:

\displaystyle\left\|\underline{\widehat{P}}^{\pi,V}V^{\pi,\sigma}-\underline{P}^{\pi,V}V^{\pi,\sigma}\right\|_{\infty}\leq\max_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\left|\widehat{P}^{\pi,V}_{s,a}V-P^{\pi,V}_{s,a}V\right|\leq 4\sqrt{\frac{2(1+\sigma)\log(\frac{24SAN}{\delta})}{(1-\gamma)^{2}N}}.

D.2.2 Proof of Lemma 17

Step 1: decomposing the term of interest.

The proof follows the routine of the proof of Lemma 14 in Appendix B.7. To begin with, for any $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ , following the same arguments of (212) yields

	$\displaystyle\left\|\widehat{P}^{\widehat{\pi},\widehat{V}}_{s,a}\widehat{V}^{\widehat{\pi},\sigma}-P^{\widehat{\pi},\widehat{V}}_{s,a}\widehat{V}^{\widehat{\pi},\sigma}\right\|\leq\max_{\alpha\in\left[\min_{s}\widehat{V}^{\widehat{\pi},\sigma}(s),\max_{s}\widehat{V}^{\widehat{\pi},\sigma}(s)\right]}\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right\|+$
	$\displaystyle\qquad+\max_{\alpha\in\left[\min_{s}\widehat{V}^{\widehat{\pi},\sigma}(s),\max_{s}\widehat{V}^{\widehat{\pi},\sigma}(s)\right]}\sqrt{\sigma}\left\|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\Big{(}\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\Big{)}}\right\|.$		(216)

Invoking the fact in (170) (for proving Lemma 14), the first term in (216) obeys

	$\displaystyle\max_{\alpha\in\left[\min_{s}\widehat{V}^{\widehat{\pi},\sigma}(s),\max_{s}\widehat{V}^{\widehat{\pi},\sigma}(s)\right]}\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right\|$	$\displaystyle\leq\max_{\alpha\in[0,1/(1-\gamma)]}\left\|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right\|$
		$\displaystyle\leq 4\sqrt{\frac{\log(\frac{3SAN^{3/2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}N}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}.$		(217)

The remainder of the proof will focus on controlling the second term of (216).

Step 2: controlling the second term of (216).

Towards this, we recall the auxiliary robust MDP $\widehat{\mathcal{M}}_{\mathsf{rob}}^{s,u}$ defined in Appendix B.7. Taking the uncertainty set $\mathcal{U}^{\sigma}(\cdot)\coloneqq\mathcal{U}_{\chi^{2}}^{\sigma}(\cdot)$ for both $\widehat{\mathcal{M}}_{\mathsf{rob}}^{s,u}$ and $\widehat{\mathcal{M}}_{\mathsf{rob}}$ , we recall the corresponding robust Bellman operator $\widehat{{\mathcal{T}}}^{\sigma}_{s,u}(\cdot)$ in (159) and the following definition in (160)

\displaystyle u^{\star}\coloneqq\widehat{V}^{\star,\sigma}(s)-\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(e_{s})}\mathcal{P}\widehat{V}^{\star,\sigma}.

(218)

Following the arguments in Appendix B.7, it can be verified that there exists a unique fixed point $\widehat{Q}^{\star,\sigma}_{s,u}$ of the operator $\widehat{{\mathcal{T}}}^{\sigma}_{s,u}(\cdot)$ , which satisfies $0\leq\widehat{Q}^{\star,\sigma}_{s,u}\leq\frac{1}{1-\gamma}{1}$ . In addition, the corresponding robust value function coincides with that of the operator $\widehat{{\mathcal{T}}}^{\sigma}(\cdot)$ , i.e., $\widehat{V}^{\star,\sigma}_{s,u}=\widehat{V}^{\star,\sigma}$ .

We recall the $N_{\varepsilon_{2}}$ -net over $\left[0,\frac{1}{1-\gamma}\right]$ whose size obeying $|N_{\varepsilon_{2}}|\leq\frac{3}{\varepsilon_{2}(1-\gamma)}$ (Vershynin,, 2018). Then for all $u\in N_{\varepsilon_{2}}$ and a fixed $\alpha$ , $\widehat{\mathcal{M}}_{\mathsf{rob}}^{s,u}$ is statistically independent from $\widehat{P}^{0}_{s,a}$ , which indicates the independence between $[\widehat{V}^{\star,\sigma}_{s,u}]_{\alpha}$ and $\widehat{P}^{0}_{s,a}$ . With this in mind, invoking the fact in (215) and taking the union bound over all $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ and $u\in N_{\varepsilon_{2}}$ yields that, with probability at least $1-\delta$ ,

\displaystyle\max_{\alpha\in\left[0,1/(1-\gamma)\right]}\bigg{|}\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([\widehat{V}^{\star,\sigma}_{s,u}]_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([\widehat{V}^{\star,\sigma}_{s,u}]_{\alpha}\right)}\bigg{|}

\displaystyle\leq 2\sqrt{\frac{2\log(\frac{24SAN|N_{\varepsilon_{2}}|}{\delta})}{(1-\gamma)^{2}N}}

(219)

holds for all $(s,a,u)\in{\mathcal{S}}\times\mathcal{A}\times N_{\varepsilon_{2}}$ .

To continue, we decompose the term of interest in (216) as follows:

	$\displaystyle\max_{\alpha\in\left[\min_{s}\widehat{V}^{\widehat{\pi},\sigma}(s),\max_{s}\widehat{V}^{\widehat{\pi},\sigma}(s)\right]}\bigg{\|}\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left(\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right)}\bigg{\|}$
	$\displaystyle\leq\max_{\alpha\in\left[0,1/(1-\gamma)\right]}\bigg{\|}\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left(\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right)}\bigg{\|}$
	$\displaystyle\overset{\mathrm{(i)}}{\leq}\max_{\alpha\in\left[0,1/(1-\gamma)\right]}\bigg{\|}\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left(\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right)}\bigg{\|}$
	$\displaystyle\quad+\max_{\alpha\in\left[0,1/(1-\gamma)\right]}\bigg{[}\sqrt{\left\|\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right)-\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right)\right\|}$
	$\displaystyle\quad+\sqrt{\left\|\mathsf{Var}_{P^{0}_{s,a}}\left(\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right)-\mathsf{Var}_{P^{0}_{s,a}}\left(\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right)\right\|}\bigg{]}$
	$\displaystyle\overset{\mathrm{(ii)}}{\leq}\max_{\alpha\in\left[0,1/(1-\gamma)\right]}\bigg{\|}\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left(\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right)}\bigg{\|}$
	$\displaystyle\quad+\max_{\alpha\in\left[0,1/(1-\gamma)\right]}2\sqrt{\frac{2}{(1-\gamma)}\left\\|\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}-\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right\\|_{\infty}}$
	$\displaystyle\leq\max_{\alpha\in\left[0,1/(1-\gamma)\right]}\left\|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left(\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right)}\right\|+4\sqrt{\frac{\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{2}}},$		(220)

where (i) holds by the triangle inequality, (ii) arises from applying Lemma 3, and the last inequality holds by (50).

Armed with the above facts, invoking the identity $\widehat{V}^{\star,\sigma}=\widehat{V}^{\star,\sigma}_{s,u^{\star}}$ leads to that for all $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ , with probability at least $1-\delta$ ,

	$\displaystyle\max_{\alpha\in\left[0,1/(1-\gamma)\right]}\bigg{\|}\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left(\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right)}\bigg{\|}$
	$\displaystyle=\max_{\alpha\in\left[0,1/(1-\gamma)\right]}\left\|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\left[\widehat{V}^{\star,\sigma}_{s,u^{\star}}\right]_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left(\left[\widehat{V}^{\star,\sigma}_{s,u^{\star}}\right]_{\alpha}\right)}\right\|$
	$\displaystyle\overset{\mathrm{(i)}}{\leq}\max_{\alpha\in\left[0,1/(1-\gamma)\right]}\left\|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\left[\widehat{V}^{\star,\sigma}_{s,\overline{u}}\right]_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left(\left[\widehat{V}^{\star,\sigma}_{s,\overline{u}}\right]_{\alpha}\right)}\right\|$
	$\displaystyle\quad+\max_{\alpha\in\left[0,1/(1-\gamma)\right]}\bigg{[}\sqrt{\left\|\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\left[\widehat{V}^{\star,\sigma}_{s,u^{\star}}\right]_{\alpha}\right)-\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\left[\widehat{V}^{\star,\sigma}_{s,\overline{u}}\right]_{\alpha}\right)\right\|}$
	$\displaystyle\quad+\sqrt{\left\|\mathsf{Var}_{P^{0}_{s,a}}\left(\left[\widehat{V}^{\star,\sigma}_{s,u^{\star}}\right]_{\alpha}\right)-\mathsf{Var}_{P^{0}_{s,a}}\left(\left[\widehat{V}^{\star,\sigma}_{s,\overline{u}}\right]_{\alpha}\right)\right\|}\bigg{]}$
	$\displaystyle\overset{\mathrm{(ii)}}{\leq}\max_{\alpha\in\left[0,1/(1-\gamma)\right]}\left\|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\left[\widehat{V}^{\star,\sigma}_{s,\overline{u}}\right]_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left(\left[\widehat{V}^{\star,\sigma}_{s,\overline{u}}\right]_{\alpha}\right)}\right\|+4\sqrt{\frac{\varepsilon_{2}}{(1-\gamma)}}$
	$\displaystyle\overset{\mathrm{(iii)}}{\leq}2\sqrt{\frac{2\log(\frac{24SAN\|N_{\varepsilon_{2}}\|}{\delta})}{(1-\gamma)^{2}N}}+4\sqrt{\frac{\varepsilon_{2}}{(1-\gamma)}}$
	$\displaystyle\leq 6\sqrt{\frac{2\log(\frac{36SAN^{2}\|N_{\varepsilon_{2}}\|}{\delta})}{(1-\gamma)^{2}N}},$		(221)

where (i) holds by the triangle inequality, (ii) arises from applying Lemma 3 and the fact $\left\|\widehat{V}^{\star,\sigma}_{s,\overline{u}}-\widehat{V}^{\star,\sigma}_{s,u^{\star}}\right\|_{\infty}\leq\frac{\varepsilon_{2}}{(1-\gamma)}$ (see (166)), (iii) follows from (219), and the last inequality holds by letting $\varepsilon_{2}=\frac{2\log(\frac{24SAN|N_{\varepsilon_{2}}|}{\delta})}{(1-\gamma)N}$ , which leads to $|N_{\varepsilon_{2}}|\leq\frac{3}{\varepsilon_{2}(1-\gamma)}\leq\frac{3N}{2}$ .

In summary, inserting (221) back to (220) and (220) leads to with probability at least $1-\delta$ ,

	$\displaystyle\max_{\alpha\in\left[\min_{s}\widehat{V}^{\widehat{\pi},\sigma}(s),\max_{s}\widehat{V}^{\widehat{\pi},\sigma}(s)\right]}\bigg{\|}\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left(\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right)}\bigg{\|}$
	$\displaystyle\leq 6\sqrt{\frac{2\sigma\log(\frac{36SAN^{2}\|N_{\varepsilon_{2}}\|}{\delta})}{(1-\gamma)^{2}N}}+4\sqrt{\frac{\sigma\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{2}}}$		(222)

holds for all $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ .

Step 4: finishing up.

Inserting (222) and (217) back to (216), we complete the proof: with probability at least $1-\delta$ ,

	$\displaystyle\Big{\\|}\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\Big{\\|}_{\infty}$	$\displaystyle\leq 4\sqrt{\frac{\log(\frac{3SAN^{3/2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}N}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}+6\sqrt{\frac{2\sigma\log(\frac{36SAN^{2}\|N_{\varepsilon_{2}}\|}{\delta})}{(1-\gamma)^{2}N}}+4\sqrt{\frac{\sigma\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{2}}}$
		$\displaystyle\leq 12\sqrt{\frac{2(1+\sigma)\log(\frac{36SAN^{2}}{\delta})}{(1-\gamma)^{2}N}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}+4\sqrt{\frac{\sigma\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{2}}}.$		(223)

D.2.3 Proof of Lemma 18

For any $0\leq\alpha_{1},\alpha_{2}\leq 1/(1-\gamma)$ , one has

	$\displaystyle\|J_{s,a}(\alpha_{1},V)-J_{s,a}(\alpha_{2},V)\|$
	$\displaystyle=\bigg{\|}\left\|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha_{1}}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha_{1}}\right)}\right\|-\left\|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha_{2}}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha_{2}}\right)}\right\|\bigg{\|}$
	$\displaystyle\overset{\mathrm{(i)}}{\leq}\left\|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha_{1}}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha_{1}}\right)}-\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha_{2}}\right)}+\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha_{2}}\right)}\right\|$
	$\displaystyle\leq\left\|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha_{1}}\right)}-\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha_{2}}\right)}\right\|+\left\|\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha_{1}}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha_{2}}\right)}\right\|$
	$\displaystyle\overset{\mathrm{(ii)}}{\leq}\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha_{2}}\right)-\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha_{1}}\right)}+\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha_{2}}\right)-\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha_{1}}\right)}$
	$\displaystyle\overset{\mathrm{(iii)}}{\leq}\sqrt{\left\|\widehat{P}^{0}_{s,a}\left[\left([V]_{\alpha_{1}}\right)\circ\left([V]_{\alpha_{1}}\right)-\left([V]_{\alpha_{2}}\right)\circ\left([V]_{\alpha_{2}}\right)\right]\right\|+\left\|\widehat{P}^{0}_{s,a}\left([V]_{\alpha_{1}}+[V]_{\alpha_{2}}\right)\cdot\widehat{P}^{0}_{s,a}\left([V]_{\alpha_{1}}-[V]_{\alpha_{2}}\right)\right\|}$
	$\displaystyle\quad+\sqrt{\left\|P^{0}_{s,a}\left[\left([V]_{\alpha_{1}}\right)\circ\left([V]_{\alpha_{1}}\right)-\left([V]_{\alpha_{2}}\right)\circ\left([V]_{\alpha_{2}}\right)\right]\right\|+\left\|P^{0}_{s,a}\left([V]_{\alpha_{1}}+[V]_{\alpha_{2}}\right)\cdot P^{0}_{s,a}\left([V]_{\alpha_{1}}-[V]_{\alpha_{2}}\right)\right\|}$
	$\displaystyle\leq 2\sqrt{2(\alpha_{1}+\alpha_{2})\|\alpha_{1}-\alpha_{2}\|}\leq 4\sqrt{\frac{\|\alpha_{1}-\alpha_{2}\|}{1-\gamma}}.$		(224)

where (i) holds by the fact $||x|-|y||\leq|x-y|$ for all $x,y\in\mathbb{R}$ , (ii) follows from the fact that $\sqrt{x}-\sqrt{y}\leq\sqrt{x-y}$ for any $x\geq y\geq 0$ and $\mathsf{Var}_{P}\left([V]_{\alpha_{2}}\right)\geq\mathsf{Var}_{P}\left([V]_{\alpha_{1}}\right)$ for any transition kernel $P\in\Delta({\mathcal{S}})$ , (iii) holds by the definition of $\mathrm{Var}_{P}(\cdot)$ defined in (40), and the last inequality arises from $0\leq\alpha_{1},\alpha_{2}\leq 1/(1-\gamma)$ .

Appendix E Proof of the lower bound with $\chi^{2}$ divergence: Theorem 4

To prove Theorem 4, we shall first construct some hard instances and then characterize the sample complexity requirements over these instances. The structure of the hard instances are the same as the ones used in the proof of Theorem 2.

E.1 Construction of the hard problem instances

First, note that we shall use the same MDPs defined in Appendix 5.3.1 as follows

\displaystyle\left\{\mathcal{M}_{\phi}=\left(\mathcal{S},\mathcal{A},P^{\phi},r,\gamma\right)\,|\,\phi=\{0,1\}\right\}.

In particular, we shall keep the structure of the transition kernel in (64), reward function in (68) and initial state distribution in (69), while $p$ and $\Delta$ shall be specified differently later.

Uncertainty set of the transition kernels.

Recalling the uncertainty set associated with $\chi^{2}$ divergence in (201), for any uncertainty level $\sigma$ , the uncertainty set throughout this section is defined as $\mathcal{U}^{\sigma}(P^{\phi})$ :

\displaystyle\mathcal{U}^{\sigma}(P^{\phi})

\displaystyle\coloneqq\otimes\;\mathcal{U}^{\sigma}_{\mathsf{\chi^{2}}}(P^{\phi}_{s,a}),\qquad

\displaystyle\mathcal{U}^{\sigma}_{\mathsf{\chi^{2}}}(P^{\phi}_{s,a})\coloneqq\bigg{\{}P_{s,a}\in\Delta({\mathcal{S}}):\sum_{s^{\prime}\in{\mathcal{S}}}\frac{\left(P(s^{\prime}\,|\,s,a)-P^{\phi}(s^{\prime}\,|\,s,a)\right)^{2}}{P^{\phi}(s^{\prime}\,|\,s,a)}\leq\sigma\bigg{\}}.

(225)

Clearly, $\mathcal{U}^{\sigma}(P^{\phi}_{s,a})=P^{\phi}_{s,a}$ whenever the state transition is deterministic for $\chi^{2}$ divergence. Here, $q$ and $\Delta$ (whose choice will be specified later in more detail) which determine the instances are specified as

\displaystyle 0

\displaystyle\leq q=\begin{cases}1-\gamma&\text{if }\sigma\in\left(0,\frac{1-\gamma}{4}\right)\\ \frac{\sigma}{1+\sigma}&\text{if }\sigma\in\left[\frac{1-\gamma}{4},\infty\right)\\ \end{cases},\qquad p=q+\Delta,

(226)

and

\displaystyle 0

\displaystyle<\Delta\leq\begin{cases}\frac{1}{4}(1-\gamma)&\text{if }\sigma\in\left(0,\frac{1-\gamma}{4}\right)\\ \min\left\{\frac{1}{4}(1-\gamma),\,\frac{1}{2(1+\sigma)}\right\}&\text{if }\sigma\in\left[\frac{1-\gamma}{4},\infty\right)\\ \end{cases}.

(227)

This directly ensures that

p=\Delta+q\leq\max\left\{\frac{\frac{1}{2}+\sigma}{1+\sigma},\frac{5}{4}(1-\gamma)\right\}\leq 1

since $\gamma\in\big{[}\frac{3}{4},1\big{)}$ .

\displaystyle\underline{P}^{\phi}(s^{\prime}\,|\,s,a)

\displaystyle\coloneqq\inf_{P_{s,a}\in\mathcal{U}^{\sigma}(P^{\phi}_{s,a})}P(s^{\prime}\,|\,s,a).

(228)

In addition, we denote the transition from state $0$ to state $1$ as follows, which plays an important role in the analysis,

\displaystyle\underline{p}

\displaystyle\coloneqq\underline{P}^{\phi}(1\,|\,0,\phi),\qquad\underline{q}\coloneqq\underline{P}^{\phi}(1\,|\,0,1-\phi).

(229)

Before continuing, we introduce some facts about $\underline{p}$ and $\underline{q}$ which are summarized as the following lemma; the proof is postponed to Appendix E.3.1.

Lemma 19.

Consider any $\sigma\in(0,\infty)$ and any $p,q,\Delta$ obeying (226) and (227), the following properties hold

\displaystyle\begin{cases}\frac{1-\gamma}{2}<\underline{q}<1-\gamma,\quad\underline{q}+\frac{3}{4}\Delta\leq\underline{p}\leq\underline{q}+\Delta\leq\frac{5(1-\gamma)}{4}&\text{if }\sigma\in\left(0,\frac{1-\gamma}{4}\right),\\ \underline{q}=0,\quad\frac{\sigma+1}{2}\Delta\leq\underline{p}\leq(3+\sigma)\Delta&\text{if }\sigma\in\left[\frac{1-\gamma}{4},\infty\right).\end{cases}

(230)

Value functions and optimal policies.

Armed with above facts, we are positioned to derive the corresponding robust value functions, the optimal policies, and its corresponding optimal robust value functions. For any RMDP $\mathcal{M}_{\phi}$ with the uncertainty set defined in (225), we denote the robust optimal policy as $\pi^{\star}_{\phi}$ , the robust value function of any policy $\pi$ (resp. the optimal policy $\pi^{\star}_{\phi}$ ) as $V^{\pi,\sigma}_{\phi}$ (resp. $V^{\star,\sigma}_{\phi}$ ). The following lemma describes some key properties of the robust (optimal) value functions and optimal policies whose proof is postponed to Appendix E.3.2.

Lemma 20.

For any $\phi=\{0,1\}$ and any policy $\pi$ , one has

\displaystyle V^{\pi,\sigma}_{\phi}(0)=\frac{\gamma z_{\phi}^{\pi}}{(1-\gamma)\left(1-\gamma\big{(}1-z_{\phi}^{\pi}\big{)}\right)},

(231)

where $z_{\phi}^{\pi}$ is defined as

\displaystyle z_{\phi}^{\pi}\coloneqq\underline{p}\pi(\phi\,|\,0)+\underline{q}\pi(1-\phi\,|\,0).

(232)

In addition, the optimal value functions and the optimal policies obey


$\displaystyle V_{\phi}^{\star,\sigma}(0)$	$\displaystyle=\frac{\gamma\underline{p}}{(1-\gamma)\left(1-\gamma\left(1-\underline{p}\right)\right)},$	(233a)
$\displaystyle\pi_{\phi}^{\star}(\phi\,\|\,s)$	$\displaystyle=1,\qquad\qquad\text{ for }s\in{\mathcal{S}}.$	(233b)

E.2 Establishing the minimax lower bound

Our goal is to control the performance gap w.r.t. any policy estimator $\widehat{\pi}$ based on the generated dataset and the chosen initial distribution $\varphi$ in (69), which gives

\displaystyle\big{\langle}\varphi,V^{\star,\sigma}_{\phi}-V^{\widehat{\pi},\sigma}_{\phi}\big{\rangle}=V^{\star,\sigma}_{\phi}(0)-V^{\widehat{\pi},\sigma}_{\phi}(0).

(234)

Step 1: converting the goal to estimate $\phi$ .

To achieve the goal, we first introduce the following fact which shall be verified in Appendix E.3.3: given

\displaystyle\varepsilon\leq\begin{cases}\frac{1}{72(1-\gamma)}\quad&\text{if }\sigma\in\left(0,\frac{1-\gamma}{4}\right),\\ \frac{1}{256(1+\sigma)(1-\gamma)}\quad&\text{if }\sigma\in\left[\frac{1-\gamma}{4},\frac{1}{3(1-\gamma)}\right),\\ \frac{3}{32}\quad&\text{if }\sigma>\frac{1}{3(1-\gamma)}.\end{cases}

(235)

choosing

\displaystyle\Delta=\begin{cases}18(1-\gamma)^{2}\varepsilon\quad&\text{if }\sigma\in\left(0,\frac{1-\gamma}{4}\right),\\ 64(1+\sigma)(1-\gamma)^{2}\varepsilon\quad&\text{if }\sigma\in\left[\frac{1-\gamma}{4},\frac{1}{3(1-\gamma)}\right),\\ \frac{16}{3(1+\sigma)}\varepsilon\quad&\text{if }\sigma>\frac{1}{3(1-\gamma)}.\end{cases}

(236)

which satisfies the requirement of $\Delta$ in (226), it holds that for any policy $\widehat{\pi}$ ,

\displaystyle\big{\langle}\varphi,V^{\star,\sigma}_{\phi}-V^{\widehat{\pi},\sigma}_{\phi}\big{\rangle}\geq 2\varepsilon\big{(}1-\widehat{\pi}(\phi\,|\,0)\big{)}.

(237)

Step 2: arriving at the final results.

To continue, following the same definitions and argument in Appendix 5.3.2, we recall the minimax probability of the error and its property as follows:

\displaystyle p_{\mathrm{e}}

\displaystyle\geq\frac{1}{4}\exp\bigg{\{}-N\Big{(}\mathsf{KL}\big{(}P^{0}(\cdot\,|\,0,0)\parallel P^{1}(\cdot\,|\,0,0)\big{)}+\mathsf{KL}\big{(}P^{0}(\cdot\,|\,0,1)\parallel P^{1}(\cdot\,|\,0,1)\big{)}\Big{)}\bigg{\}},

(238)

then we can complete the proof by showing $p_{\mathrm{e}}\geq\frac{1}{8}$ given the bound for the sample size $N$ . In the following, we shall control the KL divergence terms in (238) in three different cases.

•

Case 1: $\sigma\in\left(0,\frac{1-\gamma}{4}\right)$ . In this case, applying $\gamma\in[\frac{3}{4},1)$ yields

	$\displaystyle 1-q$	$\displaystyle>1-p=1-q-\Delta>\gamma-\frac{1-\gamma}{4}>\frac{3}{4}-\frac{1}{16}>\frac{1}{2},$
	$\displaystyle p$	$\displaystyle\geq q=1-\gamma.$		(239)

Armed with the above facts, applying Tsybakov, (2009, Lemma 2.7) yields

$\displaystyle\mathsf{KL}\big{(}P^{0}(\cdot\,\|\,0,1)\parallel P^{1}(\cdot\,\|\,0,1)\big{)}$	$\displaystyle=\mathsf{KL}\left(p\parallel q\right)\leq\frac{(p-q)^{2}}{(1-p)p}\overset{\mathrm{(i)}}{=}\frac{\Delta^{2}}{p(1-p)}$
	$\displaystyle\overset{\mathrm{(ii)}}{=}\frac{324(1-\gamma)^{4}\varepsilon^{2}}{p(1-p)}$
	$\displaystyle\overset{\mathrm{(iii)}}{\leq}648(1-\gamma)^{3}\varepsilon^{2},$	(240)

where (i) follows from the definition in (226), (ii) holds by plugging in the expression of $\Delta$ in (236), and (iii) arises from (• ‣ E.2). The same bound can be established for $\mathsf{KL}\big{(}P_{1}^{0}(\cdot\,|\,0,0)\parallel P_{1}^{1}(\cdot\,|\,0,0)\big{)}$ . Substituting (240) back into (238) demonstrates that: if the sample size is chosen as

\displaystyle N\leq\frac{\log 2}{1296(1-\gamma)^{3}\varepsilon^{2}},

(241)

then one necessarily has

\displaystyle p_{\mathrm{e}}

\displaystyle\geq\frac{1}{4}\exp\Big{\{}-N\cdot 1296(1-\gamma)^{3}\varepsilon^{2}\Big{\}}\geq\frac{1}{8}.

(242)

•

Case 2: $\sigma\in\left[\frac{1-\gamma}{4},\frac{1}{3(1-\gamma)}\right)$ . Applying the facts of $\Delta$ in (227), one has

	$\displaystyle 1-q$	$\displaystyle>1-p=1-q-\Delta\geq\frac{1}{1+\sigma}-\frac{1}{2(1+\sigma)}=\frac{1}{2(1+\sigma)},$
	$\displaystyle p$	$\displaystyle\geq q=\frac{\sigma}{1+\sigma}.$		(243)

Given (• ‣ E.2), applying Tsybakov, (2009, Lemma 2.7) yields

$\displaystyle\mathsf{KL}\big{(}P^{0}(\cdot\,\|\,0,1)\parallel P^{1}(\cdot\,\|\,0,1)\big{)}$	$\displaystyle=\mathsf{KL}\left(p\parallel q\right)\leq\frac{(p-q)^{2}}{(1-p)p}\overset{\mathrm{(i)}}{=}\frac{\Delta^{2}}{p(1-p)}$
	$\displaystyle\overset{\mathrm{(ii)}}{=}\frac{4096(1+\sigma)^{2}(1-\gamma)^{4}\varepsilon^{2}}{p(1-p)}$
	$\displaystyle\overset{\mathrm{(iii)}}{\leq}\frac{4096(1+\sigma)^{2}(1-\gamma)^{4}\varepsilon^{2}}{\frac{\sigma}{2(1+\sigma)^{2}}}\leq\frac{8192(1-\gamma)^{4}(1+\sigma)^{4}\varepsilon^{2}}{\sigma},$	(244)

Substituting (244) back into (85) demonstrates that: if the sample size is chosen as

\displaystyle N\leq\frac{\sigma\log 2}{16384(1-\gamma)^{4}(1+\sigma)^{4}\varepsilon^{2}},

(245)

then one necessarily has

\displaystyle p_{\mathrm{e}}

\displaystyle\geq\frac{1}{4}\exp\bigg{\{}-N\frac{16384(1-\gamma)^{4}(1+\sigma)^{4}\varepsilon^{2}}{\sigma}\bigg{\}}\geq\frac{1}{8}.

(246)

•

Case 3: $\sigma>\frac{1}{3(1-\gamma)}\geq\frac{1}{3}$ . Regarding this, one gives

	$\displaystyle 1-q$	$\displaystyle>1-p=1-q-\Delta\geq\frac{1}{1+\sigma}-\frac{1}{4(1+\sigma)}\geq\frac{1}{2(1+\sigma)},$
	$\displaystyle p$	$\displaystyle\geq q\geq\frac{1}{4}.$		(247)

Given $p\geq q\geq 1/2$ and (• ‣ E.2), applying Tsybakov, (2009, Lemma 2.7) yields

$\displaystyle\mathsf{KL}\big{(}P^{0}(\cdot\,\|\,0,1)\parallel P^{1}(\cdot\,\|\,0,1)\big{)}$	$\displaystyle=\mathsf{KL}\left(p\parallel q\right)\leq\frac{(p-q)^{2}}{(1-p)p}\overset{\mathrm{(i)}}{=}\frac{\Delta^{2}}{p(1-p)}$
	$\displaystyle\overset{\mathrm{(ii)}}{\leq}\frac{\frac{64}{(1+\sigma)^{2}}\varepsilon^{2}}{p(1-p)}$
	$\displaystyle\overset{\mathrm{(iii)}}{\leq}\frac{492\varepsilon^{2}}{\sigma},$	(248)

where (i) follows from the definition in (226), (ii) holds by plugging in the expression of $\Delta$ in (236), and (iii) arises from (• ‣ E.2). The same bound can be established for $\mathsf{KL}\big{(}P_{1}^{0}(\cdot\,|\,0,0)\parallel P_{1}^{1}(\cdot\,|\,0,0)\big{)}$ . Substituting (248) back into (85) demonstrates that: if the sample size is chosen as

\displaystyle N\leq\frac{\sigma\log 2}{984\varepsilon^{2}},

(249)

then one necessarily has

\displaystyle p_{\mathrm{e}}

\displaystyle\geq\frac{1}{4}\exp\bigg{\{}-N\frac{984\varepsilon^{2}}{\sigma}\bigg{\}}\geq\frac{1}{8}.

(250)

Step 3: putting things together.

Finally, summing up the results in (241), (245), and (249), combined with the requirement in (235), one has when

\displaystyle\varepsilon

\displaystyle\leq c_{1}\begin{cases}\frac{1}{1-\gamma}\quad&\text{if }\sigma\in\left(0,\frac{1-\gamma}{4}\right)\\ \max\left\{\frac{1}{(1+\sigma)(1-\gamma)},1\right\}\quad&\text{if }\sigma\in\left[\frac{1-\gamma}{4},\infty\right)\end{cases},

(251)

taking

\displaystyle N\leq c_{2}\begin{cases}\frac{1}{(1-\gamma)^{3}\varepsilon^{2}}&\text{if }\sigma\in\left(0,\frac{1-\gamma}{4}\right)\\ \frac{\sigma}{\min\left\{1,(1-\gamma)^{4}(1+\sigma)^{4}\right\}\varepsilon^{2}}&\text{if }\sigma\in\left[\frac{1-\gamma}{4},\infty\right)\end{cases}

(252)

leads to $p_{e}\geq\frac{1}{8}$ , for some universal constants $c_{1},c_{2}>0$ .

E.3 Proof of the auxiliary facts

We begin with some basic facts about the $\chi^{2}$ divergence defined in (39) for any two Bernoulli distributions $\mathsf{Ber}(w)$ and $\mathsf{Ber}(x)$ , denoted as

\displaystyle f(w,x)\coloneqq\chi^{2}(x\parallel w)=\frac{(w-x)^{2}}{w}+\frac{(1-w-(1-x))^{2}}{1-w}=\frac{(w-x)^{2}}{w(1-w)}.

(253)

For $x\in[0,w)$ , it is easily verified that the partial derivative w.r.t. $x$ obeys $\frac{\partial f(w,x)}{\partial x}=\frac{2(x-w)}{w(1-w)}<0$ , implying that

\displaystyle\forall\;x_{1}<x_{2}\in[0,w),\qquad f(w,x_{1})>f(w,x_{2}).

(254)

In other words, the $\chi^{2}$ divergence $f(w,x)$ increases as $x$ decreases from $w$ to $0$ .

Next, we introduce the following function for any fixed $\sigma\in(0,\infty)$ and any $x\in\left[\frac{\sigma}{1+\sigma},1\right)$ :

\displaystyle f_{\sigma}(x)\coloneqq\inf_{\{y:\chi^{2}(y\parallel x)\leq\sigma,y\in[0,x]\}}y\overset{\mathrm{(i)}}{=}\max\left\{0,x-\sqrt{\sigma x(1-x)}\right\}=x-\sqrt{\sigma x(1-x)},

(255)

where (i) has been verified in Yang et al., (2022, Corollary B.2), and the last equality holds since $x\geq\frac{\sigma}{1+\sigma}$ . The next lemma summarizes some useful facts about $f_{\sigma}(\cdot)$ , which again has been verified in Yang et al., (2022, Lemma B.12 and Corollary B.2).

Lemma 21.

Consider any $\sigma\in(0,\infty)$ . For $x\in[\frac{\sigma}{1+\sigma},1)$ , $f_{\sigma}(x)$ is convex and differentiable, which obeys

\displaystyle f_{\sigma}^{\prime}(x)=1+\frac{\sqrt{\sigma}(2x-1)}{2\sqrt{x(1-x)}}.

E.3.1 Proof of Lemma 19

Let us control $\underline{q}$ and $\underline{p}$ respectively.

Step 1: controlling $\underline{q}$ .

We shall control $\underline{q}$ in different cases w.r.t. the uncertainty level $\sigma$ .

•

Case 1: $\sigma\in\left(0,\frac{1-\gamma}{4}\right)$ . In this case, recall that $q=1-\gamma$ defined in (226), applying (255) with $x=q$ leads to

\displaystyle 1-\gamma=q>\underline{q}=f_{\sigma}(q)=1-\gamma-\sqrt{\sigma\gamma(1-\gamma)}\geq 1-\gamma-\sqrt{\frac{1-\gamma}{4}\gamma(1-\gamma)}>\frac{1-\gamma}{2}.

(256)

•

Case 2: $\sigma\in\left[\frac{1-\gamma}{4},\infty\right)$ . Note that it suffices to treat $P^{\phi}_{0,1-\phi}$ as a Bernoulli distribution $\mathsf{Ber}(q)$ over states $1$ and $0$ , since we do not allow transition to other states. Recalling $q=\frac{\sigma}{1+\sigma}$ in (226) and noticing the fact that

$\displaystyle f(q,0)$ $\displaystyle=\frac{q^{2}}{q}+\frac{(1-(1-q))^{2}}{1-q}=\frac{q}{(1-q)}=\sigma,$ (257)

one has the probability $\mathsf{Ber}(0)$ falls into the uncertainty set of $\mathsf{Ber}(q))$ of size $\sigma$ . As a result, recalling the definition (229) leads to

$\displaystyle\underline{q}=\underline{P}^{\phi}(1\,|\,0,1-\phi)=0,$ (258)

since $\underline{q}\geq 0$ .

Step 2: controlling $\underline{p}$ .

To characterize the value of $\underline{p}$ , we also divide into several cases separately.

•

Case 1: $\sigma\in\left(0,\frac{1-\gamma}{4}\right)$ . In this case, note that $p>q=1-\gamma\geq\frac{\sigma}{1+\sigma}$ . Therefore, applying that $f_{\sigma}(\cdot)$ is convex and the form of its derivative in Lemma 21, one has

	$\displaystyle\underline{p}$	$\displaystyle=f_{\sigma}(p)\geq f_{\sigma}(q)+f_{\sigma}^{\prime}(q)(p-q)$
		$\displaystyle=\underline{q}+\Bigg{(}1+\frac{\sqrt{\sigma}(2q-1)}{2\sqrt{q(1-q)}}\Bigg{)}\Delta\geq\underline{q}+\Bigg{(}1-\frac{\sqrt{\frac{1-\gamma}{4}}(1-2(1-\gamma))}{2\sqrt{(1-\gamma)\gamma}}\Bigg{)}\Delta\geq\underline{q}+\frac{3\Delta}{4}.$		(259)

Similarly, applying Lemma 21 leads to

	$\displaystyle\underline{p}$	$\displaystyle=f_{\sigma}(p)\leq f_{\sigma}(q)+f_{\sigma}^{\prime}(p)(p-q)$
		$\displaystyle=\underline{q}+\Bigg{(}1-\frac{\sqrt{\sigma}(1-2p)}{2\sqrt{p(1-p)}}\Bigg{)}\Delta\leq\underline{q}+\Delta,$		(260)

where the last inequality holds by $1-2p>0$ due to the fact $p=q+\Delta\leq\frac{5}{4}(1-\gamma)\leq\frac{5}{16}<\frac{1}{2}$ (cf. (227) and $\gamma\in[\frac{3}{4},1)$ ). To sum up, given $\sigma\in\left(0,\frac{1-\gamma}{4}\right)$ , combined with (256), we arrive at

\displaystyle\underline{q}+\frac{3}{4}\Delta\leq\underline{p}\leq\underline{q}+\Delta\leq\frac{5(1-\gamma)}{4},

(261)

where the last inequality holds by $\Delta\leq\frac{1}{4}(1-\gamma)$ (see (226)).

•

Case 2: $\sigma\in\left[\frac{1-\gamma}{4},\infty\right)$ . We recall that $p=q+\Delta>q=\frac{\sigma}{1+\sigma}$ in (226). To derive the lower bound for $\underline{p}$ in (229), similar to (• ‣ E.3.1), one has

$\displaystyle\underline{p}$	$\displaystyle=f_{\sigma}(p)\geq f_{\sigma}(q)+f_{\sigma}^{\prime}(q)(p-q)$
	$\displaystyle=\underline{q}+\left(1+\frac{\sqrt{\sigma}(2q-1)}{2\sqrt{q(1-q)}}\right)\Delta$
	$\displaystyle\overset{\mathrm{(i)}}{=}0+\left(1+\frac{\sqrt{\sigma}\frac{\sigma-1}{1+\sigma}}{2\sqrt{\frac{\sigma}{1+\sigma}\frac{1}{1+\sigma}}}\right)\Delta=\left(1+\frac{\sigma-1}{2}\right)\Delta=\left(\frac{\sigma+1}{2}\right)\Delta,$	(262)

where (i) follows from $q=\frac{\sigma}{1+\sigma}$ and $\underline{q}=0$ (see (258)). For the other direction, similar to (• ‣ E.3.1), we have

$\displaystyle\underline{p}$	$\displaystyle=f_{\sigma}(p)\leq f_{\sigma}(q)+f_{\sigma}^{\prime}(p)(p-q)=\underline{q}+\left(1+\frac{\sqrt{\sigma}(2p-1)}{2\sqrt{p(1-p)}}\right)\Delta$
	$\displaystyle\overset{\mathrm{(i)}}{=}\left(1+\frac{\sqrt{\sigma}(2p-1)}{2\sqrt{p(1-p)}}\right)\Delta\overset{\mathrm{(ii)}}{=}\left(1+\frac{\sqrt{\sigma}\left(\frac{\sigma-1}{1+\sigma}+2\Delta\right)}{2\sqrt{\left(\frac{\sigma}{1+\sigma}+\Delta\right)\left(\frac{1}{1+\sigma}-\Delta\right)}}\right)\Delta$
	$\displaystyle\overset{\mathrm{(iii)}}{\leq}\left(1+\frac{\sqrt{\sigma}(1+2\Delta)}{2\sqrt{\frac{\sigma}{1+\sigma}\cdot\frac{1}{2(1+\sigma)}}}\right)\Delta\overset{\mathrm{(iv)}}{\leq}\left(1+(1+\sigma)\left(1+\frac{1}{1+\sigma}\right)\right)\Delta=(3+\sigma)\Delta,$	(263)

where (i) holds by $\underline{q}=0$ (see (258)), (ii) follows from plugging in $p=q+\Delta=\frac{\sigma}{1+\sigma}+\Delta$ , and (iii) and (iv) arises from $\Delta=\min\left\{\frac{1}{4}(1-\gamma),\frac{1}{2(1+\sigma)}\right\}\leq 1$ in (227). Combining (• ‣ E.3.1) and (263) yields

\displaystyle\frac{\sigma+1}{2}\Delta\leq\underline{p}\leq(3+\sigma)\Delta.

(264)

Step 3: combining all the results.

Finally, summing up the results for both $\underline{q}$ (in (256) and (258)) and $\underline{p}$ (in (261) and (264)), we arrive at the advertised bound.

E.3.2 Proof of Lemma 20

The robust value function for any policy $\pi$ .

For any $\mathcal{M}_{\phi}$ with $\phi\in\{0,1\}$ , we first characterize the robust value function of any policy $\pi$ over different states.

Towards this, it is easily observed that for any policy $\pi$ , the robust value functions at state $s=1$ or any $s\in\{2,3,\cdots,S-1\}$ obey


	$\displaystyle V_{\phi}^{\pi,\sigma}(1)\overset{\mathrm{(i)}}{=}1+\gamma V_{\phi}^{\pi,\sigma}(1)=\frac{1}{1-\gamma}$		(265a)
and

	$\displaystyle\forall s\in\{2,3,\cdots,S\}:\qquad V_{\phi}^{\pi,\sigma}(s)$	$\displaystyle\overset{\mathrm{(ii)}}{=}0+\gamma V_{\phi}^{\pi,\sigma}(1)=\frac{\gamma}{1-\gamma},$	(265b)

where (i) and (ii) is according to the facts that the transitions defined over states $s\geq 1$ in (64) give only one possible next state $1$ , leading to a non-random transition in the uncertainty set associated with $\chi^{2}$ divergence, and $r(1,a)=1$ for all $a\in\mathcal{A}^{\prime}$ and $r(s,a)=0$ holds all $(s,a)\in\{2,3,\cdots,S-1\}\times\mathcal{A}$ .

To continue, the robust value function at state $0$ with policy $\pi$ satisfies

$\displaystyle V_{\phi}^{\pi,\sigma}(0)$	$\displaystyle=\mathbb{E}_{a\sim\pi(\cdot\,\|\,0)}\bigg{[}r(0,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{\phi}_{0,a})}\mathcal{P}V^{\pi,\sigma}_{\phi}\bigg{]}$
	$\displaystyle=0+\gamma\pi(\phi\,\|\,0)\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{\phi}_{0,\phi})}\mathcal{P}V^{\pi,\sigma}_{\phi}+\gamma\pi(1-\phi\,\|\,0)\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{\phi}_{0,1-\phi})}\mathcal{P}V^{\pi,\sigma}_{\phi}$	(266)
	$\displaystyle\overset{\mathrm{(i)}}{\leq}\frac{\gamma}{1-\gamma},$	(267)

where (i) holds by that $\|V^{\pi,\sigma}_{\phi}\|_{\infty}\leq\frac{1}{1-\gamma}$ . Summing up the results in (265b) and (267) leads to

\displaystyle\forall s\in\{2,3,\cdots,S\},\qquad V_{\phi}^{\pi,\sigma}(1)>V_{\phi}^{\pi,\sigma}(s)\geq V_{\phi}^{\pi,\sigma}(0).

(268)

With the transition kernel in (64) over state $0$ and the fact in (268), (266) can be rewritten as

$\displaystyle V_{\phi}^{\pi,\sigma}(0)$	$\displaystyle=\gamma\pi(\phi\,\|\,0)\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{\phi}_{0,\phi})}\mathcal{P}V^{\pi,\sigma}_{\phi}+\gamma\pi(1-\phi\,\|\,0)\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{\phi}_{0,1-\phi})}\mathcal{P}V^{\pi,\sigma}_{\phi}$
	$\displaystyle\overset{\mathrm{(i)}}{=}\gamma\pi(\phi\,\|\,0)\Big{[}\underline{p}V_{\phi}^{\pi,\sigma}(1)+\left(1-\underline{p}\right)V_{\phi}^{\pi,\sigma}(0)\Big{]}+\gamma\pi(1-\phi\,\|\,0)\Big{[}\underline{q}V_{\phi}^{\pi,\sigma}(1)+\left(1-\underline{q}\right)V_{\phi}^{\pi,\sigma}(0)\Big{]}$
	$\displaystyle\overset{\mathrm{(ii)}}{=}\gamma z_{\phi}^{\pi}V_{\phi}^{\pi,\sigma}(1)+\gamma\left(1-z_{\phi}^{\pi}\right)V_{\phi}^{\pi,\sigma}(0)$
	$\displaystyle=\frac{\gamma z_{\phi}^{\pi}}{(1-\gamma)\Big{(}1-\gamma\big{(}1-z_{\phi}^{\pi}\big{)}\Big{)}},$	(269)

where (i) holds by the definition of $\underline{p}$ and $\underline{q}$ in (229), (ii) follows from the definition of $z_{\phi}^{\pi}$ in (232), and the last line holds by applying (265a) and solving the resulting linear equation for $V_{\phi}^{\pi,\sigma}(0)$ .

Optimal policy and its optimal value function.

To continue, observing that $V_{\phi}^{\pi,\sigma}(0)=:f(z_{\phi}^{\pi})$ is increasing in $z_{\phi}^{\pi}$ since the derivative of $f(z_{\phi}^{\pi})$ w.r.t. $z_{\phi}^{\pi}$ obeys

\displaystyle f^{\prime}(z_{\phi}^{\pi})=\frac{\gamma(1-\gamma)\left(1-\gamma\big{(}1-z_{\phi}^{\pi}\big{)}\right)-\gamma^{2}z_{\phi}^{\pi}(1-\gamma)}{(1-\gamma)^{2}\left(1-\gamma\big{(}1-z_{\phi}^{\pi}\big{)}\right)^{2}}=\frac{\gamma}{\left(1-\gamma\big{(}1-z_{\phi}^{\pi}\big{)}\right)^{2}}>0,

where the last inequality holds by $0\leq z_{\phi}^{\pi}\leq 1$ . Further, $z_{\phi}^{\pi}$ is also increasing in $\pi(\phi\,|\,0)$ (see the fact $\underline{p}\geq\underline{q}$ in (229)), the optimal robust policy in state $0$ thus obeys

\pi_{\phi}^{\star}(\phi\,|\,0)=1.

(270)

Considering that the action does not influence the state transition for all states $s>0$ , without loss of generality, we choose the optimal robust policy to obey

\displaystyle\forall s>0:\quad\pi_{\phi}^{\star}(\phi\,|\,s)=1.

(271)

Taking $\pi=\pi^{\star}_{\phi}$ and $z_{\phi}^{\pi^{\star}_{\phi}}=\underline{p}$ in (269), we complete the proof by showing the corresponding optimal robust value function at state $0$ as follows:

\displaystyle V_{\phi}^{\star,\sigma}(0)

\displaystyle=\frac{\gamma z_{\phi}^{\pi^{\star}_{\phi}}}{(1-\gamma)\left(1-\gamma\left(1-z_{\phi}^{\pi^{\star}_{\phi}}\right)\right)}=\frac{\gamma\underline{p}}{(1-\gamma)\left(1-\gamma\left(1-\underline{p}\right)\right)}.

E.3.3 Proof of the claim (237)

Plugging in the definition of $\varphi$ , we arrive at that for any policy $\pi$ ,

$\displaystyle\big{\langle}\varphi,V^{\star,\sigma}_{\phi}-V^{\pi,\sigma}_{\phi}\big{\rangle}$	$\displaystyle=V^{\star,\sigma}_{\phi}(0)-V^{\pi,\sigma}_{\phi}(0)$
	$\displaystyle\overset{\mathrm{(i)}}{=}\frac{\gamma\underline{p}}{(1-\gamma)\left(1-\gamma\big{(}1-\underline{p}\big{)}\right)}-\frac{\gamma z_{\phi}^{\pi}}{(1-\gamma)\left(1-\gamma\big{(}1-z_{\phi}^{\pi}\big{)}\right)}$
	$\displaystyle=\frac{\gamma\left(\underline{p}-z_{\phi}^{\pi}\right)}{\left(1-\gamma\big{(}1-\underline{p}\big{)}\right)\left(1-\gamma\big{(}1-z_{\phi}^{\pi}\big{)}\right)}\overset{\mathrm{(ii)}}{\geq}\frac{\gamma\left(\underline{p}-z_{\phi}^{\pi}\right)}{\left(1-\gamma\left(1-\underline{p}\right)\right)^{2}}\overset{\mathrm{(iii)}}{=}\frac{\gamma(\underline{p}-\underline{q})\big{(}1-\pi(\phi\,\|\,0)\big{)}}{\left(1-\gamma\left(1-\underline{p}\right)\right)^{2}},$	(272)

where (i) holds by applying Lemma 20, (ii) arises from $z_{\phi}^{\pi}\leq\underline{p}$ (see the definition of $z_{\phi}^{\pi}$ in (232) and the fact $\underline{p}\geq\underline{q}+\frac{3\Delta}{4}$ in (229)), and (iii) follows from the definition of $z_{\phi}^{\pi}$ in (232).

To further control (272), we consider it in two cases separately:

•

Case 1: $\sigma\in\left(0,\frac{1-\gamma}{4}\right)$ . In this case, applying Lemma 19 to (272) yields

	$\displaystyle\big{\langle}\varphi,V^{\star,\sigma}_{\phi}-V^{\pi,\sigma}_{\phi}\big{\rangle}$	$\displaystyle\geq\frac{\gamma(\underline{p}-\underline{q})\big{(}1-\pi(\phi\,\|\,0)\big{)}}{\left(1-\gamma\left(1-\underline{p}\right)\right)^{2}}\geq\frac{\gamma\frac{3\Delta}{4}\big{(}1-\pi(\phi\,\|\,0)\big{)}}{\left(1-\gamma\left(1-\frac{5(1-\gamma)}{4}\right)\right)^{2}}$
		$\displaystyle\geq\frac{\Delta\big{(}1-\pi(\phi\,\|\,0)\big{)}}{9(1-\gamma)^{2}}=2\varepsilon\big{(}1-{\pi}(\phi\,\|\,0)\big{)},$		(273)

where the penultimate inequality follows from $\gamma\geq 3/4$ , and the last inequality holds by taking the specification of $\Delta$ in (236) as follows:

\displaystyle\Delta=18(1-\gamma)^{2}\varepsilon.

(274)

It is easily verified that taking $\varepsilon\leq\frac{1}{72(1-\gamma)}$ as in (235) directly leads to meeting the requirement in (227), i.e., $\Delta\leq\frac{1}{4}(1-\gamma)$ .

•

Case 2: $\sigma\in\left[\frac{1-\gamma}{4},\infty\right)$ . Similarly, applying Lemma 19 to (272) gives

\displaystyle\big{\langle}\varphi,V^{\star,\sigma}_{\phi}-V^{\pi,\sigma}_{\phi}\big{\rangle}

\displaystyle\geq\frac{\gamma(\underline{p}-\underline{q})\big{(}1-\pi(\phi\,|\,0)\big{)}}{\left(1-\gamma\left(1-\underline{p}\right)\right)^{2}}\geq\frac{\gamma\frac{\sigma+1}{2}\Delta\big{(}1-\pi(\phi\,|\,0)\big{)}}{\min\left\{1,\left(1-\gamma\left(1-(3+\sigma)\Delta\right)\right)^{2}\right\}}

(275)

Before continuing, it can be verified that

	$\displaystyle 1-\gamma\left(1-(3+\sigma)\Delta\right)$	$\displaystyle=1-\gamma+\gamma(3+\sigma)\Delta\overset{\mathrm{(i)}}{\leq}1-\gamma+(3+\sigma)\min\left\{\frac{1}{4}(1-\gamma),\frac{1}{2(\sigma+1)}\right\}$
		$\displaystyle\leq\min\left\{2(1+\sigma)(1-\gamma),\,\frac{3}{2}\right\},$		(276)

where (i) is obtained by $\Delta\leq\min\left\{\frac{1}{4}(1-\gamma),\frac{1}{2(1+\sigma)}\right\}$ (see (226)). Applying the above fact to (275) gives

	$\displaystyle\big{\langle}\varphi,V^{\star,\sigma}_{\phi}-V^{\pi,\sigma}_{\phi}\big{\rangle}$	$\displaystyle\geq\frac{\gamma\frac{\sigma+1}{2}\Delta\big{(}1-\pi(\phi\,\|\,0)\big{)}}{\min\left\{1,\left(1-\gamma\left(1-(3+\sigma)\Delta\right)\right)^{2}\right\}}\overset{\mathrm{(i)}}{\geq}\frac{3(\sigma+1)\Delta\big{(}1-\pi(\phi\,\|\,0)\big{)}}{8\min\left\{4(1+\sigma)^{2}(1-\gamma)^{2},1\right\}}$
		$\displaystyle\geq\frac{\Delta\big{(}1-\pi(\phi\,\|\,0)\big{)}}{\min\left\{32(1+\sigma)(1-\gamma)^{2},\frac{8}{3(1+\sigma)}\right\}}=2\varepsilon\big{(}1-{\pi}(\phi\,\|\,0)\big{)},$		(277)

where (i) holds by $\gamma\geq\frac{3}{4}$ and (275), and the last equality holds by the specification in (236):

\displaystyle\Delta=\begin{cases}64(1+\sigma)(1-\gamma)^{2}\varepsilon\quad&\text{if }\sigma\in\left[\frac{1-\gamma}{4},\frac{1}{3(1-\gamma)}\right),\\ \frac{16}{3(1+\sigma)}\varepsilon\quad&\text{if }\sigma>\frac{1}{3(1-\gamma)}.\end{cases}

(278)

As a result, it is easily verified that the requirement in (227)

\displaystyle\Delta\leq\min\left\{\frac{1}{4}(1-\gamma),\frac{1}{2(1+\sigma)}\right\}

(279)

is met if we let

\displaystyle\varepsilon\leq\begin{cases}\frac{1}{256(1+\sigma)(1-\gamma)}\quad&\text{if }\sigma\in\left[\frac{1-\gamma}{4},\frac{1}{3(1-\gamma)}\right),\\ \frac{3}{32}\quad&\text{if }\sigma>\frac{1}{3(1-\gamma)},\end{cases}

(280)

as in (235).

The proof is then completed by summing up the results in the above two cases.

Appendix F Proof for the offline setting

F.1 Proof of the upper bounds: Corollary 1 and Corollary 3

As the proofs of Corollary 1 and Corollary 3 are similar, without loss of generality, we first focus on Corollary 1 in the case of TV distance.

To begin with, suppose we have access to in total $N_{\mathsf{b}}$ independent sample tuples $\{s_{i},a_{i},a_{i}^{\prime},r_{i}\}_{i=1}^{N_{\mathsf{b}}}$ from either the generative model or a historical dataset. We denote the number of samples generated based on the state-action pair $(s,a)$ as $N(s,a)$ , i.e,

\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad N(s,a)=\sum\limits_{i=1}^{N_{\mathsf{b}}}\mathds{1}\big{\{}s_{i}=s,a_{i}=a\big{\}}.

(281)

Then according to (13), we can construct an empirical nominal transition for DRVI (Algorithm 1).

\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad\widehat{P}^{0}(s^{\prime}\,|\,s,a)\coloneqq\frac{1}{N(s,a)}\sum\limits_{i=1}^{N(s,a)}\mathds{1}\big{\{}s_{i}=s,a_{i}=a,s_{i}^{\prime}=s^{\prime}\big{\}}.

(282)

Armed with the above estimate of nominal transition kernel, we introduce a slightly general version of Theorem 1, which follows directly from the same proof routine in Appendix 5.2.2.

Theorem 5 (Upper bound under TV distance).

Let the uncertainty set be $\mathcal{U}_{\rho}^{\sigma}(\cdot)=\mathcal{U}^{\sigma}_{\mathsf{TV}}(\cdot)$ , as specified by the TV distance (9). Consider any discount factor $\gamma\in\left[\frac{1}{4},1\right)$ , uncertainty level $\sigma\in(0,1)$ , and $\delta\in(0,1)$ . Based on the empirical nominal transition kernel in (282), let $\widehat{\pi}$ be the output policy of Algorithm 1 after $T=C_{1}\log\big{(}\frac{N_{\mathsf{b}}}{1-\gamma}\big{)}$ iterations. Then with probability at least $1-\delta$ , one has

\displaystyle\forall s\in{\mathcal{S}}:\quad V^{\star,\sigma}(s)-V^{\widehat{\pi},\sigma}(s)\leq\varepsilon

(283)

for any $\varepsilon\in\left(0,\sqrt{1/\max\{1-\gamma,\sigma\}}\right]$ , as long as

\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad N(s,a)\geq\frac{C_{2}}{(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}}\log\left(\frac{SAN_{\mathsf{b}}}{(1-\gamma)\delta}\right).

(284)

Here, $C_{1},C_{2}>0$ are some large enough universal constants.

Furthermore, we invoke a fact derived from basic concentration inequalities (Li et al.,, 2024) as below.

Lemma 22.

Consider any $\delta\in(0,1)$ and a dataset with $N_{\mathsf{b}}$ independent samples satisfying Assumption 1. With probability at least $1-\delta$ , the quantities $\{N(s,a)\}$ obey

\displaystyle\max\Big{\{}N(s,a),\frac{2}{3}\log\frac{N_{\mathsf{b}}}{\delta}\Big{\}}

\displaystyle\geq\frac{N_{\mathsf{b}}\mu^{\mathsf{b}}(s,a)}{12}

(285)

simultaneously for all $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ .

Now we are ready to verify Corollary 1. Armed with a historical dataset $\mathcal{D}^{\mathsf{b}}$ with $N_{\mathsf{b}}$ independent samples that obeys Assumption 1, one has with probability at least $1-\delta$ ,

\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad N(s,a)\geq\frac{N_{\mathsf{b}}\mu^{\mathsf{b}}(s,a)}{12}\geq\frac{N_{\mathsf{b}}\mu_{\min}}{12}

(286)

as long as $N_{\mathsf{b}}\geq\frac{8\log\frac{N_{\mathsf{b}}}{\delta}}{\mu_{\min}}\geq\frac{8\log\frac{N_{\mathsf{b}}}{\delta}}{\mu^{\mathsf{b}}(s,a)}$ for all $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ . Consequently, given $N_{\mathsf{b}}\geq\frac{8\log\frac{N_{\mathsf{b}}}{\delta}}{\mu_{\min}}$ , applying Theorem 5 with the fact $N(s,a)\geq\frac{N_{\mathsf{b}}\mu_{\min}}{12}$ for all $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ (see (286)) directly leads to: DRVI can achieve an $\varepsilon$ -optimal policy as long as

\displaystyle N(s,a)\geq\frac{N_{\mathsf{b}}\mu_{\min}}{12}\geq\frac{C_{2}}{(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}}\log\left(\frac{SAN_{\mathsf{b}}}{(1-\gamma)\delta}\right),

(287)

namely

\displaystyle N_{\mathsf{b}}\geq\frac{C_{3}}{\mu_{\min}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}}\log\left(\frac{SAN_{\mathsf{b}}}{(1-\gamma)\delta}\right),

(288)

where $C_{3}$ is some large enough universal constant. Note that the above inequality directly implies $N_{\mathsf{b}}\geq\frac{8\log\frac{N_{\mathsf{b}}}{\delta}}{\mu_{\min}}$ . This complete the proof of Corollary 1. The same argument holds for Corollary 3.

F.2 Proof of the lower bounds: Corollary 2 and Corollary 4

Analogous to Appendix F.1, without loss of generality, we firstly focus on verifying Corollary 2, where we use the TV distance to measure the uncertainty set.

We stick to the two hard instances $\mathcal{M}_{0}$ and $\mathcal{M}_{1}$ (i.e., $\mathcal{M}_{\phi}$ with $\phi\in\{0,1\}$ ) constructed in the proof for Theorem 2 (Appendix 5.3.1). Recall that the state space is defined as ${\mathcal{S}}=\{0,1,2,\cdots,S-1\}$ , where the corresponding action space for any state $s\in\{2,3,\cdots,S-1\}$ is $\mathcal{A}=\{0,1,2,\cdots,A-1\}$ . For states $s=0$ or $s=1$ , the action space is only $\mathcal{A}^{\prime}=\{0,1\}$ . Hence, for a given factor $\mu_{\min}\in(0,\frac{1}{SA}]$ , we can construct a historical dataset $\mathcal{D}^{\mathsf{b}}$ with $N_{\mathsf{b}}$ samples such that the data coverage becomes the smallest over the state-action pairs $(0,0)$ and $(0,1)$ , i.e.,

\displaystyle\mu^{\mathsf{b}}(0,0)=\mu^{\mathsf{b}}(0,1)=\mu_{\min}\quad\text{and}\quad\mu^{\mathsf{b}}(s,a)=\frac{1-2\mu_{\min}}{(S-2)A+2},\quad\forall s\in\{1,2,\cdots,S\}.

(289)

Armed with the above hard instance and historical dataset, we follow the proof procedure in Appendix 5.3.2 to verify the corollary. Our goal is to distinguish between the two hypotheses $\phi\in\{0,1\}$ by considering the minimax probability of error as follows:

p_{\mathrm{e}}\coloneqq\inf_{\psi}\max\big{\{}\mathbb{P}_{0}(\psi\neq 0),\,\mathbb{P}_{1}(\psi\neq 1)\big{\}},

(290)

where the infimum is taken over all possible tests $\psi$ constructed from the samples in $\mathcal{D}^{\mathsf{b}}$ .

Recall that we denote $\mu_{\phi}$ (resp. $\mu_{\phi}(s)$ ) as the distribution of a sample tuple $(s_{i},a_{i},s_{i}^{\prime})$ under the nominal transition kernel $P^{\phi}$ associated with $\mathcal{M}_{\phi}$ and the samples are generated independently. Analogous to (85), one has

	$\displaystyle p_{\mathrm{e}}$	$\displaystyle\geq\frac{1}{4}\exp\Big{(}-N_{\mathsf{b}}\mathsf{KL}\big{(}\mu_{0}\parallel\mu_{1}\big{)}\Big{)}$
		$\displaystyle=\frac{1}{4}\exp\Big{\{}-N_{\mathsf{b}}\mu_{\min}\Big{(}\mathsf{KL}\big{(}P^{0}(\cdot\,\|\,0,0)\parallel P^{1}(\cdot\,\|\,0,0)\big{)}+\mathsf{KL}\big{(}P^{0}(\cdot\,\|\,0,1)\parallel P^{1}(\cdot\,\|\,0,1)\big{)}\Big{)}\Big{\}},$		(291)

where the last inequality holds by observing that

	$\displaystyle\mathsf{KL}\big{(}\mu_{0}\parallel\mu_{1}\big{)}$	$\displaystyle=\sum_{s,a,s^{\prime}}\mu^{\mathsf{b}}(s,a)\mathsf{KL}\big{(}P^{0}(s^{\prime}\,\|\,s,a)\parallel P^{1}(s^{\prime}\,\|\,s,a)\big{)}$
		$\displaystyle=\sum_{a\in\{0,1\}}\mu^{\mathsf{b}}(0,a)\mathsf{KL}\big{(}P^{0}(\cdot\,\|\,0,a)\parallel P^{1}(\cdot\,\|\,0,a)\big{)}=\mu_{\min}\sum_{a\in\{0,1\}}\mathsf{KL}\big{(}P^{0}(\cdot\,\|\,0,a)\parallel P^{1}(\cdot\,\|\,0,a)\big{)}.$		(292)

Here, the last line holds by the fact that $P^{0}(\cdot\,|\,s,a)$ and $P^{1}(\cdot\,|\,s,a)$ (associated with $\mathcal{M}_{0}$ and $\mathcal{M}_{1}$ ) only differ from each other in state-action pairs $(0,0)$ and $(0,1)$ , each has a visitation density of $\mu_{\min}$ . Consequently, following the same routine from (86) to the end of Appendix 5.3.2, we applying (87) and (88) with $N=N_{\mathsf{b}}\mu_{\min}$ and complete the proof by showing: if the sample size is selected as

\displaystyle N_{\mathsf{b}}\mu_{\min}=N\leq\frac{c_{1}\log 2}{8192(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}},

(293)

then one necessarily has

\displaystyle p_{e}=\inf_{\widehat{\pi}}\max\left\{\mathbb{P}_{0}\big{(}V^{\star,\sigma}(\varphi)-V^{\widehat{\pi},\sigma}(\varphi)>\varepsilon\big{)},\,\mathbb{P}_{1}\big{(}V^{\star,\sigma}(\varphi)-V^{\widehat{\pi},\sigma}(\varphi)>\varepsilon\big{)}\right\}\geq\frac{1}{8}.

(294)

We can follow the same argument to complete the proof of Corollary 4.

$\displaystyle\left\|\mathrm{Var}_{P}(V_{1})-\mathrm{Var}_{P}(V_{2})\right\|$	$\displaystyle=\left\|P(V_{1}\circ V_{1})-(PV_{1})\circ(PV_{1})-P(V_{2}\circ V_{2})+(PV_{2})\circ(PV_{2})\right\|$
	$\displaystyle\leq\left\|P\big{(}V_{1}\circ V_{1}-V_{2}\circ V_{2}\big{)}\right\|+\left\|(PV_{1}+PV_{2})P(V_{1}-V_{2})\right\|$
	$\displaystyle\leq 2\\|V_{1}+V_{2}\\|_{\infty}\\|V_{1}-V_{2}\\|_{\infty}\leq\frac{2x}{(1-\gamma)}.$	(42)

	$\displaystyle\big{\\|}\widehat{V}^{\pi,\sigma}-V^{\pi,\sigma}\big{\\|}_{\infty}$	$\displaystyle\leq\gamma\max\Big{\{}\Big{\\|}\left(I-\gamma\underline{P}^{\pi,V}\right)^{-1}\Big{(}\underline{\widehat{P}}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}\Big{)}\Big{\\|}_{\infty},$
		$\displaystyle\qquad\Big{\\|}\left(I-\gamma\underline{P}^{\pi,\widehat{V}}\right)^{-1}\Big{(}\underline{\widehat{P}}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}\Big{)}\Big{\\|}_{\infty}\Big{\}}.$		(54)

	$\displaystyle\big{\\|}\widehat{V}^{\pi,\sigma}-V^{\pi,\sigma}\big{\\|}_{\infty}$	$\displaystyle\leq\gamma\max\Big{\{}\Big{\\|}\left(I-\gamma\underline{\widehat{P}}^{\pi,V}\right)^{-1}\Big{(}\underline{\widehat{P}}^{\pi,V}V^{\pi,\sigma}-\underline{P}^{\pi,V}V^{\pi,\sigma}\Big{)}\Big{\\|}_{\infty},$
		$\displaystyle\qquad\Big{\\|}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi,\widehat{V}}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\pi,V}V^{\pi,\sigma}-\underline{P}^{\pi,V}V^{\pi,\sigma}\Big{)}\Big{\\|}_{\infty}\Big{\}}.$		(55)

	$\displaystyle\left(I-\gamma P_{\pi}\right)^{-1}\sqrt{\mathrm{Var}_{P_{\pi}}(V^{\pi,P})}\leq\sqrt{\frac{1}{1-\gamma}}\sqrt{\sum_{t=0}^{\infty}\gamma^{t}\left(P_{\pi}\right)^{t}\left(P_{\pi}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}+\frac{2}{\gamma^{2}}\\|V^{\prime}\\|_{\infty}1\right)}$
	$\displaystyle\overset{\mathrm{(i)}}{\leq}\sqrt{\frac{1}{1-\gamma}}\sqrt{\bigg{\|}\sum_{t=0}^{\infty}\gamma^{t}\left(P_{\pi}\right)^{t}\left(P_{\pi}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}\right)\bigg{\|}}+\sqrt{\frac{1}{1-\gamma}}\sqrt{\sum_{t=0}^{\infty}\gamma^{t}\left(P_{\pi}\right)^{t}\frac{2}{\gamma^{2}}\\|V^{\prime}\\|_{\infty}1}$
	$\displaystyle\leq\sqrt{\frac{1}{1-\gamma}}\sqrt{\bigg{\|}\bigg{(}\sum_{t=0}^{\infty}\gamma^{t}\left(P_{\pi}\right)^{t+1}-\sum_{t=0}^{\infty}\gamma^{t-1}\left(P_{\pi}\right)^{t}\bigg{)}\left(V^{\prime}\circ V^{\prime}\right)\bigg{\|}}+\sqrt{\frac{2\\|V^{\prime}\\|_{\infty}1}{\gamma^{2}(1-\gamma)^{2}}}$
	$\displaystyle\overset{\mathrm{(ii)}}{\leq}\sqrt{\frac{\\|V^{\prime}\\|_{\infty}^{2}1}{\gamma(1-\gamma)}}+\sqrt{\frac{2\\|V^{\prime}\\|_{\infty}1}{\gamma^{2}(1-\gamma)^{2}}}$
	$\displaystyle\leq\sqrt{\frac{8\\|V^{\prime}\\|_{\infty}1}{\gamma^{2}(1-\gamma)^{2}}},$		(112)

$\displaystyle\big{\|}\mathsf{Var}_{\widetilde{P}_{s,a}}(V^{\star,\sigma})-\mathsf{Var}_{P_{s,a}}(V^{\star,\sigma})\big{\|}$	$\displaystyle=\big{\|}\mathsf{Var}_{\widetilde{P}_{s,a}}(V^{\prime})-\mathsf{Var}_{P_{s,a}}(V^{\prime})\big{\|}$
	$\displaystyle\leq\big{\\|}\widetilde{P}_{s,a}-P_{s,a}\big{\\|}_{1}\big{\\|}V^{\prime}\big{\\|}_{\infty}^{2}$
	$\displaystyle\leq\frac{2\sigma}{\gamma^{2}(\max\{1-\gamma,\sigma\})^{2}}\leq\frac{2}{\gamma^{2}\max\{1-\gamma,\sigma\}}.$	(118)

The Curious Price of Distributional Robustness in Reinforcement Learning with a Generative Model

Abstract

1 Introduction

1.1 Prior art and open questions

1.2 Main contributions

Sample complexity of RMDPs under the TV distance.

Sample complexity of RMDPs under the χ2\chi^{2} divergence.

Technical novelty.

Extension: offline RL with uniform coverage.

Notation and paper organization.

2 Problem formulation

Standard MDPs.

Distributionally robust MDPs.

Optimal robust policy and robust Bellman operator.

Specification of the divergence ρ\rho.

Sampling mechanism: a generative model.

Goal.

3 Model-based algorithm: distributionally robust value iteration

Empirical nominal kernel.

DRVI: distributionally robust value iteration.

Lemma 1 (Strong duality for TV).

Lemma 2 (Strong duality for χ2\chi^{2}).

4 Theoretical guarantees: sample complexity analyses

4.1 The case of TV distance: RMDPs are easier to learn than standard MDPs

Theorem 1 (Upper bound under TV distance).

Remark 1.

Theorem 2 (Lower bound under TV distance).

Near minimax-optimal sample complexity.

RMDPs is easier than standard MDPs with TV distance.

Comparison with state-of-the-art bounds.

4.2 The case of χ2\chi^{2} divergence: RMDPs can be harder than standard MDPs

Theorem 3 (Upper bound under χ2\chi^{2} divergence).

Remark 2.

Theorem 4 (Lower bound under χ2\chi^{2} divergence).

Nearly tight sample complexity.

RMDPs can be much harder to learn than standard MDPs with χ2\chi^{2} divergence.

Comparison with state-of-the-art bounds.

5 Analysis: the TV case

5.1 Preliminaries of the analysis

5.1.1 Additional notations and basic facts

Matrix notation.

Kullback-Leibler (KL) divergence.

Variance.

Lemma 3.

5.1.2 Facts of the robust Bellman operator and the empirical robust MDP

γ\gamma-contraction of the robust Bellman operator.

Lemma 4 (γ\gamma-Contraction).

Bellman equations of the empirical robust MDP ℳ^𝗋𝗈𝖻\widehat{\mathcal{M}}_{\mathsf{rob}}.

Lemma 5.

5.2 Proof of the upper bound with TV distance: Theorem 1

5.2.1 Technical lemmas

Lemma 6.

Lemma 7.

5.2.2 Proof of Theorem 1

Step 1: decomposing the error.

Step 2: controlling ‖V^π⋆,σ−Vπ⋆,σ‖∞\big{\|}\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\big{\|}_{\infty} and ‖V^π^,σ−Vπ^,σ‖∞\big{\|}\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\big{\|}_{\infty} separately and summing up.

Lemma 8.

Lemma 9.

5.3 Proof of the lower bound with TV distance: Theorem 2

5.3.1 Construction of the hard problem instances

Construction of two hard MDPs.

Uncertainty set of the transition kernels.

Robust value functions and robust optimal policies.

Lemma 10.

5.3.2 Establishing the minimax lower bound

Step 1: converting the goal to estimate ϕ\phi.

Step 2: probability of error in testing two hypotheses.

Step 3: putting the results together.

6 Offline distributionally robust RL with uniform coverage

Offline/batch dataset.

Assumption 1.

6.1 The case of TV distance

Corollary 1 (Upper bound under TV distance).

Corollary 2 (Lower bound under TV distance).

Discussions.

6.2 The case of χ2\chi^{2} divergence

Corollary 3 (Upper bound under χ2\chi^{2} divergence).

Corollary 4 (Lower bound under χ2\chi^{2} divergence).

Discussions.

7 Other related works

The Curious Price of Distributional Robustness
in Reinforcement Learning with a Generative Model

Sample complexity of RMDPs under the $\chi^{2}$ divergence.

Specification of the divergence $\rho$ .

Lemma 2 (Strong duality for $\chi^{2}$ ).

4.2 The case of $\chi^{2}$ divergence: RMDPs can be harder than standard MDPs

Theorem 3 (Upper bound under $\chi^{2}$ divergence).

Theorem 4 (Lower bound under $\chi^{2}$ divergence).

RMDPs can be much harder to learn than standard MDPs with $\chi^{2}$ divergence.

$\gamma$ -contraction of the robust Bellman operator.

Lemma 4 ( $\gamma$ -Contraction).

Bellman equations of the empirical robust MDP $\widehat{\mathcal{M}}_{\mathsf{rob}}$ .

Step 2: controlling $\big{\|}\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\big{\|}_{\infty}$ and $\big{\|}\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\big{\|}_{\infty}$ separately and summing up.

Step 1: converting the goal to estimate $\phi$ .

6.2 The case of $\chi^{2}$ divergence

Corollary 3 (Upper bound under $\chi^{2}$ divergence).

Corollary 4 (Lower bound under $\chi^{2}$ divergence).

Step 1: controlling $\|\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\|_{\infty}$ : bounding the first term in (56).

Step 2: controlling $\|\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\|_{\infty}$ : bounding the second term in (56).

Step 1: controlling $\|\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\|_{\infty}$ : bounding the first term in (57).

Step 2: controlling $\|\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\|_{\infty}$ : bounding the second term in (57).

Step 2: fixed-point equivalence between $\widehat{\mathcal{M}}_{\mathsf{rob}}$ and the auxiliary RMDP $\widehat{\mathcal{M}}^{s,u}_{\mathsf{rob}}$ .

Step 3: building an $\varepsilon$ -net for all reward values $u$ .

Appendix D Proof of the upper bound with $\chi^{2}$ divergence: Theorem 3

Step 1: controlling $\big{\|}\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\big{\|}_{\infty}$ .

Step 2: controlling $\left\|\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\right\|_{\infty}$ .

Appendix E Proof of the lower bound with $\chi^{2}$ divergence: Theorem 4

Step 1: converting the goal to estimate $\phi$ .