This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\stackMath

The Curious Price of Distributional Robustness
in Reinforcement Learning with a Generative Model

Laixi Shi
Caltech
Department of Computing Mathematical Sciences, California Institute of Technology, CA 91125, USA. Part of L. Shi’s work was completed when she was at CMU.
   Gen Li
CUHK
Department of Statistics, The Chinese University of Hong Kong, Hong Kong.
   Yuting Wei
UPenn
Department of Statistics and Data Science, Wharton School, University of Pennsylvania, Philadelphia, PA 19104, USA.
   Yuxin Chen33footnotemark: 3
UPenn
   Matthieu Geist
Cohere
Cohere.
   Yuejie Chi
CMU
Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
(May 2023; Revised )
Abstract

This paper investigates model robustness in reinforcement learning (RL) to reduce the sim-to-real gap in practice. We adopt the framework of distributionally robust Markov decision processes (RMDPs), aimed at learning a policy that optimizes the worst-case performance when the deployed environment falls within a prescribed uncertainty set around the nominal MDP. Despite recent efforts, the sample complexity of RMDPs remained mostly unsettled regardless of the uncertainty set in use. It was unclear if distributional robustness bears any statistical consequences when benchmarked against standard RL. Assuming access to a generative model that draws samples based on the nominal MDP, we characterize the sample complexity of RMDPs when the uncertainty set is specified via either the total variation (TV) distance or χ2\chi^{2} divergence. The algorithm studied here is a model-based method called distributionally robust value iteration, which is shown to be near-optimal for the full range of uncertainty levels. Somewhat surprisingly, our results uncover that RMDPs are not necessarily easier or harder to learn than standard MDPs. The statistical consequence incurred by the robustness requirement depends heavily on the size and shape of the uncertainty set: in the case w.r.t. the TV distance, the minimax sample complexity of RMDPs is always smaller than that of standard MDPs; in the case w.r.t. the χ2\chi^{2} divergence, the sample complexity of RMDPs can often far exceed the standard MDP counterpart.

Keywords: distributionally robust RL, robust Markov decision processes, sample complexity, distributionally robust value iteration, model-based RL

1 Introduction

Reinforcement learning (RL) strives to learn desirable sequential decisions based on trial-and-error interactions with an unknown environment. As a fast-growing subfield of artificial intelligence, it has achieved remarkable success in a variety of applications, such as networked systems (Qu et al.,, 2022), trading (Park and Van Roy,, 2015), operations research (de Castro Silva et al.,, 2003; Pan et al.,, 2023; Zhao et al.,, 2021), large language model alignment (OpenAI,, 2023; Ziegler et al.,, 2019), healthcare (Liu et al.,, 2019; Fatemi et al.,, 2021), robotics and control (Kober et al.,, 2013; Mnih et al.,, 2013). Due to the unprecedented dimensionality of the state-action space, the issue of data efficiency inevitably lies at the core of modern RL practice. A large portion of recent efforts in RL has been directed towards designing sample-efficient algorithms and understanding the fundamental statistical bottleneck for a diverse range of RL scenarios.

While standard RL has been heavily investigated recently, its use can be significantly hampered in practice due to the sim-to-real gap or uncertainty (Bertsimas et al.,, 2019); for instance, a policy learned in an ideal, nominal environment might fail catastrophically when the deployed environment is subject to small changes in task objectives or adversarial perturbations (Zhang et al., 2020a, ; Klopp et al.,, 2017; Mahmood et al.,, 2018). Consequently, in addition to maximizing the long-term cumulative reward, robustness emerges as another critical goal for RL, especially in high-stakes applications such as robotics, autonomous driving, clinical trials, financial investments, and so on. Towards achieving this, distributionally robust RL (Iyengar,, 2005; Nilim and El Ghaoui,, 2005; Xu and Mannor,, 2012; Bäuerle and Glauner,, 2022; Cai et al.,, 2016), which leverages insights from distributionally robust optimization and supervised learning (Rahimian and Mehrotra,, 2019; Gao,, 2020; Bertsimas et al.,, 2018; Duchi and Namkoong,, 2018; Blanchet and Murthy,, 2019; Chen et al.,, 2019; Lam,, 2019), becomes a natural yet versatile framework; the aim is to learn a policy that performs well even when the deployed environment deviates from the nominal one in the face of environment uncertainty.

In this paper, we pursue fundamental understanding about whether, and how, the choice of distributional robustness bears statistical implications in learning a desirable policy, through the lens of sample complexity. More concretely, imagine that one has access to a generative model (also called a simulator) that draws samples from a Markov decision processes (MDP) with a nominal transition kernel (Kearns and Singh,, 1999). Standard RL aims to learn the optimal policy tailored to the nominal kernel, for which the minimax sample complexity limit has been fully settled (Azar et al., 2013b, ; Li et al., 2023b, ). In contrast, distributionally robust RL seeks to learn a more robust policy using the same set of samples, with the aim of optimizing the worst-case performance when the transition kernel is arbitrarily chosen from some prescribed uncertainty set around the nominal kernel; this setting is frequently referred to as robust MDPs (RMDPs).111While it is straightforward to incorporate additional uncertainty of the reward in our framework, we do not consider it here for simplicity, since the key challenge is to deal with the uncertainty of the transition kernel. Clearly, the RMDP framework helps ensure that the performance of the learned policy does not fail catastrophically as long as the sim-to-real gap is not overly large. It is then natural to wonder how the robustness consideration impacts data efficiency: is there a statistical premium that one needs to pay in quest of additional robustness?

Compared with standard MDPs, the class of RMDPs encapsulates richer models, given that one is allowed to prescribe the shape and size of the uncertainty set. Oftentimes, the uncertainty set is hand-picked as a small ball surrounding the nominal kernel, with the size and shape of the ball specified by some distance-like metric ρ\rho between probability distributions and some uncertainty level σ\sigma. To ensure tractability of solving RMDPs, the uncertainty set is often selected to obey certain structures. For instance, a number of prior works assumed that the uncertainty set can be decomposed as a product of independent uncertainty subsets over each state or state-action pair (Zhou et al.,, 2021; Wiesemann et al.,, 2013), dubbed as the ss- and (s,a)(s,a)-rectangularity, respectively. The current paper adopts the second choice by assuming (s,a)(s,a)-rectangularity for the uncertainty set. An additional challenge with RMDPs arises from distribution shift, where the transition kernel drawn from the uncertainty set can be different from the nominal kernel. This challenge leads to complicated nonlinearity and nested optimization in the problem structure not present in standard MDPs.

1.1 Prior art and open questions

Result type Reference Sample complexity
0<σ1γ0<\sigma\lesssim 1-\gamma 1γσ<11-\gamma\lesssim\sigma<1
Upper bound Yang et al., (2022) 17177\frac{1^{7}}{1^{7^{7}}} S2Aσ2(1γ)4ε2\frac{S^{2}A}{\sigma^{2}(1-\gamma)^{4}\varepsilon^{2}}
Panaganti and Kalathil, (2022) 17177\frac{1^{7}}{1^{7^{7}}} S2A(1γ)4ε2\frac{S^{2}A}{(1-\gamma)^{4}\varepsilon^{2}}
This paper 17177\frac{1^{7}}{1^{7^{7}}} SA(1γ)3ε2\frac{SA}{(1-\gamma)^{3}\varepsilon^{2}} SA(1γ)2σε2\frac{SA}{(1-\gamma)^{2}\sigma\varepsilon^{2}}
Lower bound Yang et al., (2022) 17177\frac{1^{7}}{1^{7^{7}}} SA(1γ)3ε2\frac{SA}{(1-\gamma)^{3}\varepsilon^{2}} SA(1γ)σ4ε2\frac{SA(1-\gamma)}{\sigma^{4}\varepsilon^{2}}
This paper 17177\frac{1^{7}}{1^{7^{7}}} SA(1γ)3ε2\frac{SA}{(1-\gamma)^{3}\varepsilon^{2}} SA(1γ)2σε2\frac{SA}{(1-\gamma)^{2}\sigma\varepsilon^{2}}
Table 1: Comparisons between our results and prior arts for finding an ε\varepsilon-optimal robust policy in the infinite-horizon RMDPs with a generative model, where the uncertainty set is measured w.r.t. the TV distance. Here, SS, AA, γ\gamma, and σ(0,1)\sigma\in(0,1) are the state space size, the action space size, the discount factor, and the uncertainty level, respectively, and all logarithmic factors are omitted in the table. Our results provide the first matching upper and lower bounds (up to log factors), improving upon all prior results.
Result type Reference Sample complexity
0<σ1γ0<\sigma\lesssim 1-\gamma 1γσ11γ1-\gamma\lesssim\sigma\lesssim\frac{1}{1-\gamma} σ11γ\sigma\gtrsim\frac{1}{1-\gamma}
Upper bound Panaganti and Kalathil, (2022) 17177\frac{1^{7}}{1^{7^{7}}} S2A(1+σ)(1γ)4ε2\frac{S^{2}A(1+\sigma)}{(1-\gamma)^{4}\varepsilon^{2}}
Yang et al., (2022) 17177\frac{1^{7}}{1^{7^{7}}} S2A(1+σ)2(1+σ1)2(1γ)4ε2\frac{S^{2}A(1+\sigma)^{2}}{(\sqrt{1+\sigma}-1)^{2}(1-\gamma)^{4}\varepsilon^{2}}
This paper 17177\frac{1^{7}}{1^{7^{7}}} SA(1+σ)(1γ)4ε2\frac{SA(1+\sigma)}{(1-\gamma)^{4}\varepsilon^{2}}
Lower bound Yang et al., (2022) 17177\frac{1^{7}}{1^{7^{7}}} SA(1γ)3ε2\frac{SA}{(1-\gamma)^{3}\varepsilon^{2}} SA(1γ)2σε2\frac{SA}{(1-\gamma)^{2}\sigma\varepsilon^{2}}
This paper 17177\frac{1^{7}}{1^{7^{7}}} SA(1γ)3ε2\frac{SA}{(1-\gamma)^{3}\varepsilon^{2}} SAσ(1γ)4(1+σ)4ε2\frac{SA\sigma}{(1-\gamma)^{4}(1+\sigma)^{4}\varepsilon^{2}} SAσε2\qquad\frac{SA\sigma}{\varepsilon^{2}}\qquad
Table 2: Comparisons between our results and prior art on finding an ε\varepsilon-optimal robust policy in the infinite-horizon RMDPs with a generative model, where the uncertainty set is measured w.r.t. the χ2\chi^{2} divergence. Here, SS, AA, γ\gamma, and σ(0,)\sigma\in(0,\infty) are the state space size, the action space size, the discount factor, and the uncertainty level, respectively, and all logarithmic factors are omitted in the table. Improving upon all prior results, our theory is tight (up to log factors) when σ1\sigma\asymp 1, and otherwise loose by no more than a polynomial factor in 1/(1γ)1/(1-\gamma).
Refer to caption Refer to caption
          (a) TV distance        (b) χ2\chi^{2} divergence
Figure 1: Illustrations of the obtained sample complexity upper and lower bounds for learning RMDPs with comparisons to state-of-the-art and the sample complexity of standard MDPs, where the uncertainty set is specified using the TV distance (a) and the χ2\chi^{2} divergence (b).

In this paper, we focus attention on RMDPs in the context of γ\gamma-discounted infinite-horizon setting, assuming access to a generative model. The uncertainty set considered herein is specified using one of the ff-divergence metrics: the total variation (TV) distance and the χ2\chi^{2} divergence. These two choices are motivated by their practical appeals: easy to implement, and already adopted by empirical RL (Lee et al.,, 2021; Pan et al.,, 2023).

A popular learning approach is model-based, which first estimates the nominal transition kernel using a plug-in estimator based on the collected samples, and then runs a planning algorithm (e.g., a robust variant of value iteration) on top of the estimated kernel. Despite the surge of recent activities, however, existing statistical guarantees for the above paradigm remained highly inadequate, as we shall elaborate on momentarily (see Table 1 and Table 2 respectively for a summary of existing results). For concreteness, let SS be the size of the state space, AA the size of the action space, γ\gamma the discount factor (so that the effective horizon is 11γ\frac{1}{1-\gamma}), and σ\sigma the uncertainty level. We are interested in how the sample complexity — the number of samples needed for an algorithm to output a policy whose robust value function (the worst-case value over all the transition kernels in the uncertainty set) is at most ε\varepsilon away from the optimal robust one — scales with all these salient problem parameters.

  • Large gaps between existing upper and lower bounds. There remained large gaps between the sample complexity upper and lower bounds established in prior literature, regardless of the divergence metric in use. Specifically, considering the cases using either TV distance or χ2\chi^{2} divergence, the state-of-the-art upper bounds (Panaganti and Kalathil,, 2022) scales quadratically with the size SS of the state space, while the lower bound (Yang et al.,, 2022) exhibits only linear scaling with SS. Moreover, in the χ2\chi^{2} divergence case, the state-of-the-art upper bound grows linearly with the uncertainty level σ\sigma when σ1\sigma\gtrsim 1,222Let 𝒳(S,A,11γ,σ,1ε,1δ)\mathcal{X}\coloneqq\big{(}S,A,\frac{1}{1-\gamma},\sigma,\frac{1}{\varepsilon},\frac{1}{\delta}\big{)}. The notation f(𝒳)=O(g(𝒳))f(\mathcal{X})=O(g(\mathcal{X})) or f(𝒳)g(𝒳)f(\mathcal{X})\lesssim g(\mathcal{X}) indicates that there exists a universal constant C1>0C_{1}>0 such that fC1gf\leq C_{1}g, the notation f(𝒳)g(𝒳)f(\mathcal{X})\gtrsim g(\mathcal{X}) indicates that g(𝒳)=O(f(𝒳))g(\mathcal{X})=O(f(\mathcal{X})), and the notation f(𝒳)g(𝒳)f(\mathcal{X})\asymp g(\mathcal{X}) indicates that f(𝒳)g(𝒳)f(\mathcal{X})\lesssim g(\mathcal{X}) and f(𝒳)g(𝒳)f(\mathcal{X})\gtrsim g(\mathcal{X}) hold simultaneously. Additionally, the notation O~()\widetilde{O}(\cdot) is defined in the same way as O(){O}(\cdot) except that it hides logarithmic factors. while the lower bound (Yang et al.,, 2022) is inversely proportional to σ\sigma. These lead to unbounded gaps between the upper and lower bounds as σ\sigma grows. Can we hope to close these gaps for RMDPs?

  • Benchmarking with standard MDPs. Perhaps a more pressing issue is that, past works failed to provide an affirmative answer regarding how to benchmark the sample complexity of RMDPs with that of standard MDPs regardless of the chosen shape (determined by ρ\rho) or size (determined by σ\sigma) of the uncertainty set, given the large unresolved gaps mentioned above. Specifically, existing sample complexity upper (resp. lower) bounds are all larger (resp. smaller) than the sample size requirement for standard MDPs. As a consequence, it remains mostly unclear whether learning RMDPs is harder or easier than learning standard MDPs.

1.2 Main contributions

To address the aforementioned questions, this paper develops strengthened sample complexity upper bounds on learning RMDPs with the TV distance and χ2\chi^{2} divergence in the infinite-horizon setting, using a model-based approach called distributionally robust value iteration (DRVI). Improved minimax lower bounds are also developed to help gauge the tightness of our upper bounds and enable benchmarking with standard MDPs. The novel analysis framework developed herein leads to new insights into the interplay between the geometry of uncertainty sets and statistical hardness.

Sample complexity of RMDPs under the TV distance.

We summarize our results and compare them with past works in Table 1; see Figure 1(a) for a graphical illustration.

  • Minimax-optimal sample complexity. We prove that DRVI reaches ε\varepsilon accuracy as soon as the sample complexity is on the order of

    O~(SA(1γ)2ε2min{11γ,1σ})\widetilde{O}\left(\frac{SA}{(1-\gamma)^{2}\varepsilon^{2}}\min\left\{\frac{1}{1-\gamma},\frac{1}{\sigma}\right\}\right)

    for all σ(0,1)\sigma\in(0,1), assuming that ε\varepsilon is small enough. In addition, a matching minimax lower bound (modulo some logarithmic factor) is established to guarantee the tightness of the upper bound. To the best of our knowledge, this is the first minimax-optimal sample complexity for RMDPs, which was previously unavailable regardless of the divergence metric and uncertainty level in use and is over the full range of the uncertainty level.

  • RMDPs are easier to learn than standard MDPs under the TV distance. Given the sample complexity O~(SA(1γ)3ε2)\widetilde{O}\left(\frac{SA}{(1-\gamma)^{3}\varepsilon^{2}}\right) of standard MDPs (Li et al., 2023b, ), it can be seen that learning RMDPs under the TV distance is never harder than learning standard MDPs; more concretely, the sample complexity for RMDPs matches that of standard MDPs when σ1γ\sigma\lesssim 1-\gamma, and becomes smaller by a factor of σ/(1γ)\sigma/(1-\gamma) when 1γσ<11-\gamma\lesssim\sigma<1. Therefore, in this case, distributional robustness comes almost for free, given that we do not need to collect more samples.

Sample complexity of RMDPs under the χ2\chi^{2} divergence.

We summarize our results and provide comparisons with prior works in Table 2; see Figure 1(b) for an illustration.

  • Near-optimal sample complexity. We demonstrate that DRVI yields ε\varepsilon accuracy as soon as the sample complexity is on the order of

    O~(SA(1+σ)(1γ)4ε2)\widetilde{O}\left(\frac{SA(1+\sigma)}{(1-\gamma)^{4}\varepsilon^{2}}\right)

    for all σ(0,)\sigma\in(0,\infty), which is the first sample complexity in this setting that scales linearly in the size SS of the state space; in other words, our theory breaks the quadratic scaling bottleneck that was present in prior works (Panaganti and Kalathil,, 2022; Yang et al.,, 2022). We have also developed a strengthened lower bound that is optimized by leveraging the geometry of the uncertainty set under different ranges of σ\sigma. Our theory is tight when σ1\sigma\asymp 1, and is otherwise loose by at most a polynomial factor of the effective horizon 1/(1γ)1/(1-\gamma) (regardless of the uncertainty level σ\sigma). This significantly improves upon prior results (as there exists an unbounded gap between prior upper and lower bounds as σ\sigma\rightarrow\infty).

  • RMDPs can be harder to learn than standard MDPs under the χ2\chi^{2} divergence. Somewhat surprisingly, our improved lower bound suggests that RMDPs in this case can be much harder to learn than standard MDPs, at least for a certain range of uncertainty levels. We single out two regimes of particular interest. Firstly, when σ1\sigma\asymp 1, the sample size requirement of RMDPs is on the order of SA(1γ)4ε2\frac{SA}{(1-\gamma)^{4}\varepsilon^{2}} (up to log factor), which is provably larger than the one for standard MDPs by a factor of 11γ\frac{1}{1-\gamma}. Secondly, the lower bound continues to increase as σ\sigma grows and exceeds the sample complexity of standard MDPs when σ1(1γ)3\sigma\gtrsim\frac{1}{(1-\gamma)^{3}}.

In sum, our sample complexity bounds not only strengthen the prior art in the development of both upper and lower bounds, but also unveil that the additional robustness consideration might affect the sample complexity in a somewhat surprising manner. As it turns out, RMDPs are not necessarily harder nor easier to learn than standard MDPs; the conclusion is far more nuanced and highly dependent on both the size and shape of the uncertainty set. This constitutes a curious phenomenon that has not been elucidated in prior analyses.

Technical novelty.

Our upper bound analyses require careful treatments of the impact of the uncertainty set upon the value functions, and decouple the statistical dependency across the iterates of the robust value iteration using tailored leave-one-out arguments (Agarwal et al.,, 2020; Li et al., 2022b, ) that have not been introduced to the RMDP setting previously. Turning to the lower bound, we develop new hard instances that differ from those for standard MDPs (Azar et al., 2013a, ; Li et al.,, 2024). These new instances draw inspiration from the asymmetric structure of RMDPs induced by the additional infimum operator in the robust value function. In addition, we construct a series of hard instances depending on the uncertainty level σ\sigma to establish the tight lower bound as σ\sigma varies.

Extension: offline RL with uniform coverage.

Last but not least, we extend our analysis framework to accommodate a widely studied offline setting with uniform data coverage (Zhou et al.,, 2021; Yang et al.,, 2022) in Section 6. In particular, given a historical dataset with minimal coverage probability μmin\mu_{\min} over the state-action space (see Assumption 1), we provide sample complexity results for both cases with TV distance or χ2\chi^{2} divergence, where in effect the dependency with the size of the state-action space SASA is replaced by 1/μmin1/\mu_{\min}. The sample complexity upper bounds significantly improve upon prior art (Yang et al.,, 2022) by a factor of S(1γ)2\frac{S}{(1-\gamma)^{2}} (resp. S(1+σ)S(1+\sigma)) when the uncertainty set is measured by the TV distance (resp. the χ2\chi^{2} divergence).

Notation and paper organization.

Throughout this paper, we denote by Δ(𝒮)\Delta({\mathcal{S}}) the probability simplex over a set 𝒮{\mathcal{S}} and x=[x(s,a)](s,a)𝒮×𝒜SAx=\big{[}x(s,a)\big{]}_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\in\mathbb{R}^{SA} (resp. x=[x(s)]s𝒮Sx=\big{[}x(s)\big{]}_{s\in{\mathcal{S}}}\in\mathbb{R}^{S}) as any vector that constitutes certain values for each state-action pair (resp. state). In addition, we denote by xy=[x(s)y(s)]s𝒮x\circ y=\big{[}x(s)\cdot y(s)\big{]}_{s\in{\mathcal{S}}} the Hadamard product of any two vectors x,ySx,y\in\mathbb{R}^{S}.

The remainder of this paper is structured as follows. Section 2 presents the background about discounted infinite-horizon standard MDPs and formulates distributionally robust MDPs. In Section 3, a model-based approach is introduced, tailored to both the TV distance and the χ2\chi^{2} divergence. Both upper and lower bounds on the sample complexity are developed in Section 4, covering both divergence metrics. Section 5 provides an outline of our analysis. Section 6 further extends the findings to the offline RL setting with uniform data coverage. We then summarize several additional related works in Section 7 and conclude the main paper with further discussions in Section 8. The proof details are deferred to the appendix.

2 Problem formulation

In this section, we formulate distributionally robust Markov decision processes (RMDPs) in the discounted infinite-horizon setting, introduce the sampling mechanism, and describe our goal.

Standard MDPs.

To begin, we first introduce the standard Markov decision processes (MDPs), which facilitate the understanding of RMDPs. A discounted infinite-horizon MDP is represented by =(𝒮,𝒜,γ,P,r)\mathcal{M}=\big{(}\mathcal{S},\mathcal{A},\gamma,P,r\big{)}, where 𝒮={1,,S}\mathcal{S}=\{1,\cdots,S\} and 𝒜={1,,A}\mathcal{A}=\{1,\cdots,A\} are the finite state and action spaces, respectively, γ[0,1)\gamma\in[0,1) is the discounted factor, P:𝒮×𝒜Δ(𝒮)P:{\mathcal{S}}\times\mathcal{A}\rightarrow\Delta({\mathcal{S}}) denotes the probability transition kernel, and r:𝒮×𝒜[0,1]r:{\mathcal{S}}\times\mathcal{A}\rightarrow[0,1] is the immediate reward function which is assumed to be deterministic. A policy is denoted by π:𝒮Δ(𝒜)\pi:\mathcal{S}\rightarrow\Delta(\mathcal{A}), which specifies the action selection probability over the action space in any state. When the policy is deterministic, we overload the notation and refer to π(s)\pi(s) as the action selected by policy π\pi in state ss. To characterize the cumulative reward, the value function Vπ,PV^{\pi,P} for any policy π\pi under the transition kernel PP is defined by

s𝒮:Vπ,P(s)\displaystyle\forall s\in{\mathcal{S}}:\qquad V^{\pi,P}(s) 𝔼π,P[t=0γtr(st,at)|s0=s],\displaystyle\coloneqq\mathbb{E}_{\pi,P}\left[\sum_{t=0}^{\infty}\gamma^{t}r\big{(}s_{t},a_{t}\big{)}\,\Big{|}\,s_{0}=s\right], (1)

where the expectation is taken over the randomness of the trajectory {st,at}t=0\{s_{t},a_{t}\}_{t=0}^{\infty} generated by executing policy π\pi under the transition kernel PP, namely, atπ(|st)a_{t}\sim\pi(\cdot\,|\,s_{t}) and st+1P(|st,at)s_{t+1}\sim P(\cdot\,|\,s_{t},a_{t}) for all t0t\geq 0. Similarly, the Q-function Qπ,PQ^{\pi,P} associated with any policy π\pi under the transition kernel PP is defined as

(s,a)𝒮×𝒜:Qπ,P(s,a)\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\qquad Q^{\pi,P}(s,a) 𝔼π,P[t=0γtr(st,at)|s0=s,a0=a],\displaystyle\coloneqq\mathbb{E}_{\pi,P}\left[\sum_{t=0}^{\infty}\gamma^{t}r\big{(}s_{t},a_{t}\big{)}\,\Big{|}\,s_{0}=s,a_{0}=a\right], (2)

where the expectation is again taken over the randomness of the trajectory under policy π\pi.

Distributionally robust MDPs.

We now introduce the distributionally robust MDP (RMDP) tailored to the discounted infinite-horizon setting, denoted by 𝗋𝗈𝖻={𝒮,𝒜,γ,𝒰ρσ(P0),r}\mathcal{M}_{\mathsf{rob}}=\{{\mathcal{S}},\mathcal{A},\gamma,\mathcal{U}_{\rho}^{\sigma}(P^{0}),r\}, where 𝒮,𝒜,γ,r{\mathcal{S}},\mathcal{A},\gamma,r are identical to those in the standard MDP. A key distinction from the standard MDP is that: rather than assuming a fixed transition kernel PP, it allows the transition kernel to be chosen arbitrarily from a prescribed uncertainty set 𝒰ρσ(P0)\mathcal{U}_{\rho}^{\sigma}(P^{0}) centered around a nominal kernel P0:𝒮×𝒜Δ(𝒮)P^{0}:{\mathcal{S}}\times\mathcal{A}\rightarrow\Delta({\mathcal{S}}), where the uncertainty set is specified using some distance metric ρ\rho of radius σ>0\sigma>0. In particular, given the nominal transition kernel P0P^{0} and some uncertainty level σ\sigma, the uncertainty set—with the divergence metric ρ:Δ(𝒮)×Δ(𝒮)+\rho:\Delta({\mathcal{S}})\times\Delta({\mathcal{S}})\rightarrow\mathbb{R}^{+}—is specified as

𝒰ρσ(P0)\displaystyle\mathcal{U}_{\rho}^{\sigma}(P^{0}) 𝒰ρσ(Ps,a0)with𝒰ρσ(Ps,a0){Ps,aΔ(𝒮):ρ(Ps,a,Ps,a0)σ},\displaystyle\coloneqq\otimes\;\mathcal{U}_{\rho}^{\sigma}(P^{0}_{s,a})\qquad\text{with}\quad\mathcal{U}_{\rho}^{\sigma}(P^{0}_{s,a})\coloneqq\left\{P_{s,a}\in\Delta({\mathcal{S}}):\rho\left(P_{s,a},P^{0}_{s,a}\right)\leq\sigma\right\}, (3)

where we denote a vector of the transition kernel PP or P0P^{0} at state-action pair (s,a)(s,a) respectively as

Ps,aP(|s,a)1×S,Ps,a0P0(|s,a)1×S.\displaystyle P_{s,a}\coloneqq P(\cdot\,|\,s,a)\in\mathbb{R}^{1\times S},\qquad P_{s,a}^{0}\coloneqq P^{0}(\cdot\,|\,s,a)\in\mathbb{R}^{1\times S}. (4)

In other words, the uncertainty is imposed in a decoupled manner for each state-action pair, obeying the so-called (s,a)(s,a)-rectangularity (Zhou et al.,, 2021; Wiesemann et al.,, 2013).

In RMDPs, we are interested in the worst-case performance of a policy π\pi over all the possible transition kernels in the uncertainty set. This is measured by the robust value function Vπ,σV^{\pi,\sigma} and the robust Q-function Qπ,σQ^{\pi,\sigma} in 𝗋𝗈𝖻\mathcal{M}_{\mathsf{rob}}, defined respectively as

(s,a)𝒮×𝒜:Vπ,σ(s)\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad V^{\pi,\sigma}(s) infP𝒰ρσ(P0)Vπ,P(s),Qπ,σ(s,a)infP𝒰ρσ(P0)Qπ,P(s,a).\displaystyle\coloneqq\inf_{P\in\mathcal{U}_{\rho}^{\sigma}(P^{0})}V^{\pi,P}(s),\qquad Q^{\pi,\sigma}(s,a)\coloneqq\inf_{P\in\mathcal{U}_{\rho}^{\sigma}(P^{0})}Q^{\pi,P}(s,a). (5)
Optimal robust policy and robust Bellman operator.

As a generalization of properties of standard MDPs, it is well-known that there exists at least one deterministic policy that maximizes the robust value function (resp. robust Q-function) simultaneously for all states (resp. state-action pairs) (Iyengar,, 2005; Nilim and El Ghaoui,, 2005). Therefore, we denote the optimal robust value function (resp. optimal robust Q-function) as V,σV^{\star,\sigma} (resp. Q,σQ^{\star,\sigma}), and the optimal robust policy as π\pi^{\star}, which satisfy

s𝒮:\displaystyle\forall s\in{\mathcal{S}}:\quad V,σ(s)Vπ,σ(s)=maxπVπ,σ(s),\displaystyle V^{\star,\sigma}(s)\coloneqq V^{\pi^{\star},\sigma}(s)=\max_{\pi}V^{\pi,\sigma}(s), (6a)
(s,a)𝒮×𝒜:\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad Q,σ(s,a)Qπ,σ(s,a)=maxπQπ,σ(s,a).\displaystyle Q^{\star,\sigma}(s,a)\coloneqq Q^{\pi^{\star},\sigma}(s,a)=\max_{\pi}Q^{\pi,\sigma}(s,a). (6b)

A key machinery in RMDPs is a generalization of Bellman’s optimality principle, encapsulated in the following robust Bellman consistency equation (resp. robust Bellman optimality equation):

(s,a)𝒮×𝒜:\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad Qπ,σ(s,a)=r(s,a)+γinf𝒫𝒰ρσ(Ps,a0)𝒫Vπ,σ,\displaystyle Q^{\pi,\sigma}(s,a)=r(s,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}_{\rho}(P^{0}_{s,a})}\mathcal{P}V^{\pi,\sigma}, (7a)
(s,a)𝒮×𝒜:\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad Q,σ(s,a)=r(s,a)+γinf𝒫𝒰ρσ(Ps,a0)𝒫V,σ.\displaystyle Q^{\star,\sigma}(s,a)=r(s,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}_{\rho}(P^{0}_{s,a})}\mathcal{P}V^{\star,\sigma}. (7b)

The robust Bellman operator (Iyengar,, 2005; Nilim and El Ghaoui,, 2005) is denoted by 𝒯σ():SASA{\mathcal{T}}^{\sigma}(\cdot):\mathbb{R}^{SA}\rightarrow\mathbb{R}^{SA} and defined as follows:

(s,a)𝒮×𝒜:𝒯σ(Q)(s,a)r(s,a)+γinf𝒫𝒰ρσ(Ps,a0)𝒫V,withV(s)maxaQ(s,a).\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad{\mathcal{T}}^{\sigma}(Q)(s,a)\coloneqq r(s,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}_{\rho}(P^{0}_{s,a})}\mathcal{P}V,\quad\text{with}\quad V(s)\coloneqq\max_{a}Q(s,a). (8)

Given that Q,σQ^{\star,\sigma} is the unique fixed point of 𝒯σ{\mathcal{T}}^{\sigma}, one can recover the optimal robust value function and Q-function using a procedure termed distributionally robust value iteration (DRVI). Generalizing the standard value iteration, DRVI starts from some given initialization and recursively applies the robust Bellman operator until convergence. As has been shown previously, this procedure converges rapidly due to the γ\gamma-contraction property of 𝒯σ{\mathcal{T}}^{\sigma} w.r.t. the \ell_{\infty} norm (Iyengar,, 2005; Nilim and El Ghaoui,, 2005).

Specification of the divergence ρ\rho.

We consider two popular choices of the uncertainty set measured in terms of two different ff-divergence metric: the total variation distance and the χ2\chi^{2} divergence, given respectively by (Tsybakov,, 2009)

ρ𝖳𝖵(Ps,a,Ps,a0)\displaystyle\rho_{{\mathsf{TV}}}\left(P_{s,a},P^{0}_{s,a}\right) 12Ps,aPs,a01=12s𝒮P0(s|s,a)|1P(s|s,a)P0(s|s,a)|,\displaystyle\coloneqq\frac{1}{2}\left\|P_{s,a}-P^{0}_{s,a}\right\|_{1}=\frac{1}{2}\sum_{s^{\prime}\in{\mathcal{S}}}P^{0}(s^{\prime}\,|\,s,a)\left|1-\frac{P(s^{\prime}\,|\,s,a)}{P^{0}(s^{\prime}\,|\,s,a)}\right|, (9)
ρχ2(Ps,a,Ps,a0)\displaystyle\rho_{\chi^{2}}\left(P_{s,a},P^{0}_{s,a}\right) s𝒮P0(s|s,a)(1P(s|s,a)P0(s|s,a))2.\displaystyle\coloneqq\sum_{s^{\prime}\in{\mathcal{S}}}P^{0}(s^{\prime}\,|\,s,a)\left(1-\frac{P(s^{\prime}\,|\,s,a)}{P^{0}(s^{\prime}\,|\,s,a)}\right)^{2}. (10)

Note that ρ𝖳𝖵(Ps,a,Ps,a0)[0,1]\rho_{{\mathsf{TV}}}\left(P_{s,a},P^{0}_{s,a}\right)\in[0,1] and ρχ2(Ps,a,Ps,a0)[0,)\rho_{\chi^{2}}\left(P_{s,a},P^{0}_{s,a}\right)\in[0,\infty) in general. As we shall see shortly, these two choices of divergence metrics result in drastically different messages when it comes to sample complexities.

Sampling mechanism: a generative model.

Following Zhou et al., (2021); Panaganti and Kalathil, (2022), we assume access to a generative model or a simulator (Kearns and Singh,, 1999), which allows us to collect NN independent samples for each state-action pair generated based on the nominal kernel P0P^{0}:

(s,a)𝒮×𝒜,si,s,ai.i.dP0(|s,a),i=1,2,,N.\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A},\qquad s_{i,s,a}\overset{i.i.d}{\sim}P^{0}(\cdot\,|\,s,a),\qquad i=1,2,\cdots,N. (11)

The total sample size is, therefore, NSANSA.

Goal.

Given the collected samples, the task is to learn the robust optimal policy for the RMDP — w.r.t. some prescribed uncertainty set 𝒰σ(P0)\mathcal{U}^{\sigma}(P^{0}) around the nominal kernel — using as few samples as possible. Specifically, given some target accuracy level ε>0\varepsilon>0, the goal is to seek an ε\varepsilon-optimal robust policy π^\widehat{\pi} obeying

s𝒮:V,σ(s)Vπ^,σ(s)ε.\displaystyle\forall s\in{\mathcal{S}}:\quad V^{\star,\sigma}(s)-V^{\widehat{\pi},\sigma}(s)\leq\varepsilon. (12)

3 Model-based algorithm: distributionally robust value iteration

We consider a model-based approach tailored to RMDPs, which first constructs an empirical nominal transition kernel based on the collected samples, and then applies distributionally robust value iteration (DRVI) to compute an optimal robust policy.

Empirical nominal kernel.

The empirical nominal transition kernel P^0SA×S\widehat{P}^{0}\in\mathbb{R}^{SA\times S} can be constructed on the basis of the empirical frequency of state transitions, i.e.,

(s,a)𝒮×𝒜:P^0(s|s,a)1Ni=1N𝟙{si,s,a=s},\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad\widehat{P}^{0}(s^{\prime}\,|\,s,a)\coloneqq\frac{1}{N}\sum\limits_{i=1}^{N}\mathds{1}\big{\{}s_{i,s,a}=s^{\prime}\big{\}}, (13)

which leads to an empirical RMDP ^𝗋𝗈𝖻={𝒮,𝒜,γ,𝒰ρσ(P^0),r}\widehat{\mathcal{M}}_{\mathsf{rob}}=\{{\mathcal{S}},\mathcal{A},\gamma,\mathcal{U}^{\sigma}_{\rho}(\widehat{P}^{0}),r\}. Analogously, we can define the corresponding robust value function (resp. robust Q-function) of policy π\pi in ^𝗋𝗈𝖻\widehat{\mathcal{M}}_{\mathsf{rob}} as V^π,σ\widehat{V}^{\pi,\sigma} (resp. Q^π,σ\widehat{Q}^{\pi,\sigma}) (cf. (6)). In addition, we denote the corresponding optimal robust policy as π^\widehat{\pi}^{\star} and the optimal robust value function (resp. optimal robust Q-function) as V^,σ\widehat{V}^{\star,\sigma} (resp. Q^,σ\widehat{Q}^{\star,\sigma}) (cf. (7)), which satisfies the robust Bellman optimality equation:

(s,a)𝒮×𝒜:\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad Q^,σ(s,a)=r(s,a)+γinf𝒫𝒰ρσ(P^s,a0)𝒫V^,σ.\displaystyle\widehat{Q}^{\star,\sigma}(s,a)=r(s,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}_{\rho}(\widehat{P}^{0}_{s,a})}\mathcal{P}\widehat{V}^{\star,\sigma}. (14)

Equipped with P^0\widehat{P}^{0}, we can define the empirical robust Bellman operator 𝒯^σ\widehat{{\mathcal{T}}}^{\sigma} as

(s,a)\displaystyle\forall(s,a)\in 𝒮×𝒜:𝒯^σ(Q)(s,a)r(s,a)+γinf𝒫𝒰ρσ(P^s,a0)𝒫V,withV(s)maxaQ(s,a).\displaystyle{\mathcal{S}}\times\mathcal{A}:\quad\widehat{{\mathcal{T}}}^{\sigma}(Q)(s,a)\coloneqq r(s,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}_{\rho}(\widehat{P}^{0}_{s,a})}\mathcal{P}V,\quad\text{with}\quad V(s)\coloneqq\max_{a}Q(s,a). (15)
DRVI: distributionally robust value iteration.

To compute the fixed point of 𝒯^σ\widehat{{\mathcal{T}}}^{\sigma}, we introduce distributionally robust value iteration (DRVI), which is summarized in Algorithm 1. From an initialization Q^0=0\widehat{Q}_{0}=0, the update rule at the tt-th (t1t\geq 1) iteration can be formulated as:

(s,a)\displaystyle\forall(s,a)\in 𝒮×𝒜:Q^t(s,a)=𝒯^σ(Q^t1)(s,a)=r(s,a)+γinf𝒫𝒰ρσ(P^s,a0)𝒫V^t1,\displaystyle{\mathcal{S}}\times\mathcal{A}:\quad\widehat{Q}_{t}(s,a)=\widehat{{\mathcal{T}}}^{\sigma}\big{(}\widehat{Q}_{t-1}\big{)}(s,a)=r(s,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}_{\rho}^{\sigma}(\widehat{P}^{0}_{s,a})}\mathcal{P}\widehat{V}_{t-1}, (16)

where V^t1(s)=maxaQ^t1(s,a)\widehat{V}_{t-1}(s)=\max_{a}\widehat{Q}_{t-1}(s,a) for all s𝒮s\in{\mathcal{S}}. However, directly solving (16) is computationally expensive since it involves optimization over an SS-dimensional probability simplex at each iteration, especially when the dimension of the state space 𝒮{\mathcal{S}} is large. Fortunately, in view of strong duality (Iyengar,, 2005), (16) can be equivalently solved using its dual problem, which concerns optimizing a scalar dual variable and thus can be solved efficiently. In what follows, we shall illustrate this for the two choices of the divergence ρ\rho of interest (cf. (9) and (10)). Before continuing, for any VSV\in\mathbb{R}^{S}, we denote [V]α[V]_{\alpha} as its clipped version by some non-negative value α\alpha, namely,

[V]α(s){α,if V(s)>α,V(s),otherwise.\displaystyle[V]_{\alpha}(s)\coloneqq\begin{cases}\alpha,&\text{if }V(s)>\alpha,\\ V(s),&\text{otherwise.}\end{cases} (17)
  • TV distance, where the uncertainty set is 𝒰ρσ(P^s,a0)𝒰𝖳𝖵σ(P^s,a0)𝒰ρ𝖳𝖵σ(P^s,a0)\mathcal{U}^{\sigma}_{\rho}(\widehat{P}^{0}_{s,a})\coloneqq\mathcal{U}^{\sigma}_{\mathsf{TV}}(\widehat{P}^{0}_{s,a})\coloneqq\mathcal{U}^{\sigma}_{\rho_{\mathsf{TV}}}(\widehat{P}^{0}_{s,a}) w.r.t. the TV distance ρ=ρ𝖳𝖵\rho=\rho_{\mathsf{TV}} defined in (9). In particular, we have the following lemma due to strong duality, which is a direct consequence of Iyengar, (2005, Lemma 4.3).

    Lemma 1 (Strong duality for TV).

    Consider any probability vector PΔ(𝒮)P\in\Delta({\mathcal{S}}), any fixed uncertainty level σ\sigma and the uncertainty set 𝒰σ(P)𝒰𝖳𝖵σ(P)\mathcal{U}^{\sigma}(P)\coloneqq\mathcal{U}^{\sigma}_{\mathsf{TV}}(P). For any vector VSV\in\mathbb{R}^{S} obeying V0V\geq{0}, recalling the definition of [V]α[V]_{\alpha} in (17), one has

    inf𝒫𝒰σ(P)𝒫V\displaystyle\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P)}\mathcal{P}V =maxα[minsV(s),maxsV(s)]{P[V]ασ(αmins[V]α(s))}.\displaystyle=\max_{\alpha\in[\min_{s}V(s),\max_{s}V(s)]}\left\{P\left[V\right]_{\alpha}-\sigma\left(\alpha-\min_{s^{\prime}}\left[V\right]_{\alpha}(s^{\prime})\right)\right\}. (18)

    In view of the above lemma, the following dual update rule is equivalent to (16) in DRVI:

    Q^t(s,a)=r(s,a)+γmaxα[minsV^t1(s),maxsV^t1(s)]{P^s,a0[V^t1]ασ(αmins[V^t1]α(s))}.\displaystyle\widehat{Q}_{t}(s,a)=r(s,a)+\gamma\max_{\alpha\in\left[\min_{s}\widehat{V}_{t-1}(s),\max_{s}\widehat{V}_{t-1}(s)\right]}\left\{\widehat{P}^{0}_{s,a}\left[\widehat{V}_{t-1}\right]_{\alpha}-\sigma\left(\alpha-\min_{s^{\prime}}\left[\widehat{V}_{t-1}\right]_{\alpha}(s^{\prime})\right)\right\}. (19)
  • χ2\chi^{2} divergence, where the uncertainty set is 𝒰ρσ(P^s,a0)𝒰χ2σ(P^s,a0)𝒰ρχ2σ(P^s,a0)\mathcal{U}^{\sigma}_{\rho}(\widehat{P}^{0}_{s,a})\coloneqq\mathcal{U}^{\sigma}_{\chi^{2}}(\widehat{P}^{0}_{s,a})\coloneqq\mathcal{U}^{\sigma}_{\rho_{\chi^{2}}}(\widehat{P}^{0}_{s,a}) w.r.t. the χ2\chi^{2} divergence ρ=ρχ2\rho=\rho_{\chi^{2}} defined in (10). We introduce the following lemma which directly follows from (Iyengar,, 2005, Lemma 4.2).

    Lemma 2 (Strong duality for χ2\chi^{2}).

    Consider any probability vector PΔ(𝒮)P\in\Delta({\mathcal{S}}), any fixed uncertainty level σ\sigma and the uncertainty set 𝒰σ(P)𝒰χ2σ(P)\mathcal{U}^{\sigma}(P)\coloneqq\mathcal{U}^{\sigma}_{\chi^{2}}(P). For any vector VSV\in\mathbb{R}^{S} obeying V0V\geq{0}, one has

    inf𝒫𝒰σ(P)𝒫V=maxα[minsV(s),maxsV(s)]{P[V]ασ𝖵𝖺𝗋P([V]α)},\displaystyle\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P)}\mathcal{P}V=\max_{\alpha\in[\min_{s}V(s),\max_{s}V(s)]}\left\{P[V]_{\alpha}-\sqrt{\sigma\mathsf{Var}_{P}\left([V]_{\alpha}\right)}\right\}, (20)

    where 𝖵𝖺𝗋P()\mathsf{Var}_{P}\left(\cdot\right) is defined as (40).

    In view of the above lemma, the update rule (16) in DRVI can be equivalently written as:

    Q^t(s,a)=r(s,a)+γmaxα[minsV^t1(s),maxsV^t1(s)]{P^s,a0[V^t1]ασ𝖵𝖺𝗋P^s,a0([V^t1]α)}.\displaystyle\widehat{Q}_{t}(s,a)=r(s,a)+\gamma\max_{\alpha\in\left[\min_{s}\widehat{V}_{t-1}(s),\max_{s}\widehat{V}_{t-1}(s)\right]}\left\{\widehat{P}^{0}_{s,a}\left[\widehat{V}_{t-1}\right]_{\alpha}-\sqrt{\sigma\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\left[\widehat{V}_{t-1}\right]_{\alpha}\right)}\right\}. (21)

The proofs of Lemma 1 and Lemma 2 are provided in Appendix A. To complete the description, we output the greedy policy of the final Q-estimate Q^T\widehat{Q}_{T} as the final policy π^\widehat{\pi}, namely,

s𝒮:π^(s)=argmaxaQ^T(s,a).\displaystyle\forall s\in{\mathcal{S}}:\quad\widehat{\pi}(s)=\arg\max_{a}\widehat{Q}_{T}(s,a). (22)

Encouragingly, the iterates {Q^t}t0\big{\{}\widehat{Q}_{t}\big{\}}_{t\geq 0} of DRVI converge linearly to the fixed point Q^,σ\widehat{Q}^{\star,\sigma}, owing to the appealing γ\gamma-contraction property of 𝒯^σ\widehat{{\mathcal{T}}}^{\sigma}.

1 input: empirical nominal transition kernel P^0\widehat{P}^{0}; reward function rr; uncertainty level σ\sigma; number of iterations TT.
2 initialization: Q^0(s,a)=0\widehat{Q}_{0}(s,a)=0, V^0(s)=0\widehat{V}_{0}(s)=0 for all (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}.
3 for t=1,2,,Tt=1,2,\cdots,T do
4      
5      for s𝒮,a𝒜s\in{\mathcal{S}},a\in\mathcal{A} do
6             Set Q^t(s,a)\widehat{Q}_{t}(s,a) according to (16);
7      for s𝒮s\in{\mathcal{S}} do
8             Set V^t(s)=maxaQ^t(s,a)\widehat{V}_{t}(s)=\max_{a}\widehat{Q}_{t}(s,a);
9      
10
output: Q^T\widehat{Q}_{T}, V^T\widehat{V}_{T} and π^\widehat{\pi} obeying π^(s)argmaxaQ^T(s,a)\widehat{\pi}(s)\coloneqq\arg\max_{a}\widehat{Q}_{T}(s,a).
Algorithm 1 Distributionally robust value iteration (DRVI) for infinite-horizon RMDPs.

4 Theoretical guarantees: sample complexity analyses

We now present our main results, which concern the sample complexities of learning RMDPs when the uncertainty set is specified using the TV distance or the χ2\chi^{2} divergence. Somewhat surprisingly, different choices of the uncertainty set can lead to dramatically different consequences in the sample size requirement.

4.1 The case of TV distance: RMDPs are easier to learn than standard MDPs

We start with the case where the uncertainty set is measured via the TV distance. The following theorem, whose proof is deferred to Section 5.2, develops an upper bound on the sample complexity of DRVI in order to return an ε\varepsilon-optimal robust policy. The key challenge of the analysis lies in careful control of the robust value function Vπ,σV^{\pi,\sigma} as a function of the uncertainty level σ\sigma.

Theorem 1 (Upper bound under TV distance).

Let the uncertainty set be 𝒰ρσ()=𝒰𝖳𝖵σ()\mathcal{U}_{\rho}^{\sigma}(\cdot)=\mathcal{U}^{\sigma}_{\mathsf{TV}}(\cdot), as specified by the TV distance (9). Consider any discount factor γ[14,1)\gamma\in\left[\frac{1}{4},1\right), uncertainty level σ(0,1)\sigma\in(0,1), and δ(0,1)\delta\in(0,1). Let π^\widehat{\pi} be the output policy of Algorithm 1 after T=C1log(N1γ)T=C_{1}\log\big{(}\frac{N}{1-\gamma}\big{)} iterations. Then with probability at least 1δ1-\delta, one has

s𝒮:V,σ(s)Vπ^,σ(s)ε\displaystyle\forall s\in{\mathcal{S}}:\quad V^{\star,\sigma}(s)-V^{\widehat{\pi},\sigma}(s)\leq\varepsilon (23)

for any ε(0,1/max{1γ,σ}]\varepsilon\in\left(0,\sqrt{1/\max\{1-\gamma,\sigma\}}\right], as long as the total number of samples obeys

NSAC2SA(1γ)2max{1γ,σ}ε2log(SAN(1γ)δ).\displaystyle NSA\geq\frac{C_{2}SA}{(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}}\log\left(\frac{SAN}{(1-\gamma)\delta}\right). (24)

Here, C1,C2>0C_{1},C_{2}>0 are some large enough universal constants.

Remark 1.

Note that Theorem 1 is not only valid when invoking Algorithm 1. In fact, the theorem holds for any oracle planning algorithm (designed based on the empirical transitions P^0\widehat{P}^{0}) whose output policy π^\widehat{\pi} obeys

V^,σV^π^,σO((1γ)2Nlog(SAN(1γ)δ)).\big{\|}\widehat{V}^{\star,\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\big{\|}_{\infty}\leq O\left(\frac{(1-\gamma)^{2}}{N}\log\left(\frac{SAN}{(1-\gamma)\delta}\right)\right). (25)

Before discussing the implications of Theorem 1, we present a matching minimax lower bound that confirms the tightness and optimality of the upper bound, which in turn pins down the sample complexity requirement for learning RMDPs with TV distance. The proof is based on constructing new hard instances inspired by the asymmetric structure of RMDPs, with the details postponed to Section 5.3.

Theorem 2 (Lower bound under TV distance).

Consider any tuple (S,A,γ,σ,ε)(S,A,\gamma,\sigma,\varepsilon) obeying σ(0,1c0]\sigma\in(0,1-c_{0}] with 0<c0180<c_{0}\leq\frac{1}{8} being any small enough positive constant, γ[12,1)\gamma\in\left[\frac{1}{2},1\right), and ε(0,c0256(1γ)]\varepsilon\in\big{(}0,\frac{c_{0}}{256(1-\gamma)}\big{]}. We can construct a collection of infinite-horizon RMDPs 0,1\mathcal{M}_{0},\mathcal{M}_{1} defined by the uncertainty set 𝒰ρσ()=𝒰𝖳𝖵σ()\mathcal{U}_{\rho}^{\sigma}(\cdot)=\mathcal{U}^{\sigma}_{\mathsf{TV}}(\cdot), an initial state distribution φ\varphi, and a dataset with NN independent samples for each state-action pair over the nominal transition kernel (for 0\mathcal{M}_{0} and 1\mathcal{M}_{1} respectively), such that

infπ^max{0(V,σ(φ)Vπ^,σ(φ)>ε),1(V,σ(φ)Vπ^,σ(φ)>ε)}18,\inf_{\widehat{\pi}}\max\left\{\mathbb{P}_{0}\big{(}V^{\star,\sigma}(\varphi)-V^{\widehat{\pi},\sigma}(\varphi)>\varepsilon\big{)},\,\mathbb{P}_{1}\big{(}V^{\star,\sigma}(\varphi)-V^{\widehat{\pi},\sigma}(\varphi)>\varepsilon\big{)}\right\}\geq\frac{1}{8},

provided that

NSAc0SAlog28192(1γ)2max{1γ,σ}ε2.NSA\leq\frac{c_{0}SA\log 2}{8192(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}}.

Here, the infimum is taken over all estimators π^\widehat{\pi}, and 0\mathbb{P}_{0} (resp. 1\mathbb{P}_{1}) denotes the probability when the RMDP is 0\mathcal{M}_{0} (resp. 1\mathcal{M}_{1}).

Below, we interpret the above theorems and highlight several key implications about the sample complexity requirements for learning RMDPs for the case w.r.t. the TV distance.

Near minimax-optimal sample complexity.

Theorem 1 shows that the total number of samples required for DRVI (or any oracle planning algorithm claimed in Remark 1) to yield ε\varepsilon-accuracy is

O~(SA(1γ)2max{1γ,σ}ε2).\displaystyle\widetilde{O}\left(\frac{SA}{(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}}\right). (26)

Taken together with the minimax lower bound asserted by Theorem 2, this confirms the near optimality of the sample complexity (up to some logarithmic factor) almost over the full range of the uncertainty level σ\sigma. Importantly, this sample complexity scales linearly with the size of the state-action space, and is inversely proportional to σ\sigma in the regime where σ1γ\sigma\gtrsim 1-\gamma.

RMDPs is easier than standard MDPs with TV distance.

Recall that the sample complexity requirement for learning standard MDPs with a generative model is (Azar et al., 2013a, ; Agarwal et al.,, 2020; Li et al., 2023b, )

O~(SA(1γ)3ε2)\widetilde{O}\left(\frac{SA}{(1-\gamma)^{3}\varepsilon^{2}}\right) (27)

in order to yield ε\varepsilon accuracy. Comparing this with the sample complexity requirement in (26) for RMDPs under the TV distance, we confirm that the latter is at least as easy as — if not easier than — standard MDPs. In particular, when σ1γ\sigma\lesssim 1-\gamma is small, the sample complexity of RMDPs is the same as that of standard MDPs as in (27), which is as anticipated since the RMDP reduces to the standard MDP when σ=0\sigma=0. On the other hand, when 1γσ<11-\gamma\lesssim\sigma<1, the sample complexity of RMDPs simplifies to

O~(SA(1γ)2σε2),\displaystyle\widetilde{O}\left(\frac{SA}{(1-\gamma)^{2}\sigma\varepsilon^{2}}\right), (28)

which is smaller than that of standard MDPs by a factor of σ/(1γ)\sigma/(1-\gamma).

Comparison with state-of-the-art bounds.

For the upper bound, our results (cf. Theorem 1) significantly improves over the prior art O~(S2A(1γ)4ε2)\widetilde{O}\left(\frac{S^{2}A}{(1-\gamma)^{4}\varepsilon^{2}}\right) of Panaganti and Kalathil, (2022) by at least a factor of S1γ\frac{S}{1-\gamma} and even S(1γ)2\frac{S}{(1-\gamma)^{2}} when the uncertainty level 1γσ<11-\gamma\lesssim\sigma<1 is large. Turning to the lower bound side, Yang et al., (2022) developed a lower bound for RMDPs under the TV distance, which scales as

O~(SA(1γ)ε2min{1(1γ)4,1σ4}).\widetilde{O}\left(\frac{SA(1-\gamma)}{\varepsilon^{2}}\min\left\{\frac{1}{(1-\gamma)^{4}},\frac{1}{\sigma^{4}}\right\}\right).

Clearly, this is worse than ours by a factor of σ3(1γ)3(1,1(1γ)3)\frac{\sigma^{3}}{(1-\gamma)^{3}}\in\big{(}1,\frac{1}{(1-\gamma)^{3}}\big{)} in the regime where 1γσ<11-\gamma\lesssim\sigma<1.

4.2 The case of χ2\chi^{2} divergence: RMDPs can be harder than standard MDPs

We now switch attention to the case when the uncertainty set is measured via the χ2\chi^{2} divergence. The theorem below presents an upper bound on the sample complexity for this case, whose proof is deferred to Appendix D.

Theorem 3 (Upper bound under χ2\chi^{2} divergence).

Let the uncertainty set be 𝒰ρσ()=𝒰χ2σ()\mathcal{U}_{\rho}^{\sigma}(\cdot)=\mathcal{U}^{\sigma}_{\chi^{2}}(\cdot), as specified using the χ2\chi^{2} divergence (10). Consider any uncertainty level σ(0,)\sigma\in(0,\infty), γ[1/4,1)\gamma\in[1/4,1) and δ(0,1)\delta\in(0,1). With probability at least 1δ1-\delta, the output policy π^\widehat{\pi} from Algorithm 1 with at most T=c1log(N1γ)T=c_{1}\log\big{(}\frac{N}{1-\gamma}\big{)} iterations yields

s𝒮:V,σ(s)Vπ^,σ(s)ε\displaystyle\forall s\in{\mathcal{S}}:\quad V^{\star,\sigma}(s)-V^{\widehat{\pi},\sigma}(s)\leq\varepsilon (29)

for any ε(0,11γ]\varepsilon\in\big{(}0,\frac{1}{1-\gamma}\big{]}, as long as the total number of samples obeying

NSAc2SA(1+σ)(1γ)4ε2log(SANδ).\displaystyle NSA\geq\frac{c_{2}SA(1+\sigma)}{(1-\gamma)^{4}\varepsilon^{2}}\log\left(\frac{SAN}{\delta}\right). (30)

Here, c1,c2>0c_{1},c_{2}>0 are some large enough universal constants.

Remark 2.

Akin to Remark 1, the sample complexity derived in Theorem 3 continues to hold for any oracle planning algorithm that outputs a policy π^\widehat{\pi} obeying V^,σV^π^,σO(log(SAN(1γ)δ)N2)\big{\|}\widehat{V}^{\star,\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\big{\|}_{\infty}\leq O\Big{(}\frac{\log(\frac{SAN}{(1-\gamma)\delta})}{N^{2}}\Big{)}.

In addition, in order to gauge the tightness of Theorem 3 and understand the minimal sample complexity requirement under the χ2\chi^{2} divergence, we further develop a minimax lower bound as follows; the proof is deferred to Appendix E.

Theorem 4 (Lower bound under χ2\chi^{2} divergence).

Consider any (S,A,γ,σ,ε)(S,A,\gamma,\sigma,\varepsilon) obeying γ[34,1)\gamma\in[\frac{3}{4},1), σ(0,)\sigma\in(0,\infty), and

ε\displaystyle\varepsilon c3{11γif σ(0,1γ4)max{1(1+σ)(1γ),1}if σ[1γ4,)\displaystyle\leq c_{3}\begin{cases}\frac{1}{1-\gamma}\quad&\text{if }\sigma\in\left(0,\frac{1-\gamma}{4}\right)\\ \max\left\{\frac{1}{(1+\sigma)(1-\gamma)},1\right\}\quad&\text{if }\sigma\in\left[\frac{1-\gamma}{4},\infty\right)\end{cases} (31)

for some small universal constant c3>0c_{3}>0. Then we can construct two infinite-horizon RMDPs 0,1\mathcal{M}_{0},\mathcal{M}_{1} defined by the uncertainty set 𝒰ρσ()=𝒰χ2σ()\mathcal{U}_{\rho}^{\sigma}(\cdot)=\mathcal{U}^{\sigma}_{\chi^{2}}(\cdot), an initial state distribution φ\varphi, and a dataset with NN independent samples per (s,a)(s,a) pair over the nominal transition kernel (for 0\mathcal{M}_{0} and 1\mathcal{M}_{1} respectively), such that

infπ^max{0(V,σ(φ)Vπ^,σ(φ)>ε),1(V,σ(φ)Vπ^,σ(φ)>ε)}18,\displaystyle\inf_{\widehat{\pi}}\max\left\{\mathbb{P}_{0}\big{(}V^{\star,\sigma}(\varphi)-V^{\widehat{\pi},\sigma}(\varphi)>\varepsilon\big{)},\,\mathbb{P}_{1}\big{(}V^{\star,\sigma}(\varphi)-V^{\widehat{\pi},\sigma}(\varphi)>\varepsilon\big{)}\right\}\geq\frac{1}{8}, (32)

provided that the total number of samples

NSAc4{SA(1γ)3ε2if σ(0,1γ4)σSAmin{1,(1γ)4(1+σ)4}ε2if σ[1γ4,)\displaystyle NSA\leq c_{4}\begin{cases}\frac{SA}{(1-\gamma)^{3}\varepsilon^{2}}&\text{if }\sigma\in\left(0,\frac{1-\gamma}{4}\right)\\ \frac{\sigma SA}{\min\left\{1,(1-\gamma)^{4}(1+\sigma)^{4}\right\}\varepsilon^{2}}&\text{if }\sigma\in\left[\frac{1-\gamma}{4},\infty\right)\end{cases} (33)

for some universal constant c4>0c_{4}>0.

We are now positioned to single out several key implications of the above theorems.

Nearly tight sample complexity.

In order to achieve ε\varepsilon-accuracy for RMDPs under the χ2\chi^{2} divergence, Theorem 3 asserts that a total number of samples on the order of

O~(SA(1+σ)(1γ)4ε2).\displaystyle\widetilde{O}\left(\frac{SA(1+\sigma)}{(1-\gamma)^{4}\varepsilon^{2}}\right). (34)

is sufficient for DRVI (or any other oracle planning algorithm as discussed in Remark 2). Taking this together with the minimax lower bound in Theorem 4 confirms that the sample complexity is near-optimal — up to a polynomial factor of the effective horizon 11γ\frac{1}{1-\gamma} — over the entire range of the uncertainty level σ\sigma. In particular,

  • when σ1\sigma\asymp 1, our sample complexity O~(SA(1γ)4ε2)\widetilde{O}\left(\frac{SA}{(1-\gamma)^{4}\varepsilon^{2}}\right) is sharp and matches the minimax lower bound;

  • when σ1(1γ)3\sigma\gtrsim\frac{1}{(1-\gamma)^{3}}, our sample complexity correctly predicts the linear dependency with σ\sigma, suggesting that more samples are needed when one wishes to account for a larger χ2\chi^{2}-based uncertainty sets.

RMDPs can be much harder to learn than standard MDPs with χ2\chi^{2} divergence.

The minimax lower bound developed in Theorem 4 exhibits a curious non-monotonic behavior of the sample size requirement over the entire range of the uncertainty level σ(0,)\sigma\in(0,\infty) when the uncertainty set is measured via the χ2\chi^{2} divergence. When σ1γ\sigma\lesssim 1-\gamma, the lower bound reduces to

O~(SA(1γ)3ε2),\widetilde{O}\left(\frac{SA}{(1-\gamma)^{3}\varepsilon^{2}}\right),

which matches with that of standard MDPs, as σ=0\sigma=0 corresponds to standard MDP. However, two additional regimes are worth calling out:

1γσ1(1γ)1/3:\displaystyle 1-\gamma\lesssim\sigma\lesssim\frac{1}{(1-\gamma)^{1/3}}:\qquad O~(SA(1γ)4ε2min{σ,1σ3}),\displaystyle\widetilde{O}\left(\frac{SA}{(1-\gamma)^{4}\varepsilon^{2}}\min\left\{\sigma,\frac{1}{\sigma^{3}}\right\}\right),
σ1(1γ)3:\displaystyle\sigma\gtrsim\frac{1}{(1-\gamma)^{3}}:\qquad O~(SAσε2),\displaystyle\widetilde{O}\left(\frac{SA\sigma}{\varepsilon^{2}}\right),

both of which are greater than that of standard MDPs, indicating learning RMDPs under the χ2\chi^{2} divergence can be much harder.

Comparison with state-of-the-art bounds.

Our upper bound significantly improves over the prior art O~(S2A(1+σ)(1γ)4ε2)\widetilde{O}\left(\frac{S^{2}A(1+\sigma)}{(1-\gamma)^{4}\varepsilon^{2}}\right) of Panaganti and Kalathil, (2022) by a factor of SS, and provides the first finite-sample complexity that scales linearly with respect to SS for discounted infinite-horizon RMDPs, which typically exhibit more complicated statistical dependencies than the finite-horizon counterpart. On the other hand, Yang et al., (2022) established a lower bound on the order of O~(SA(1γ)2σε2)\widetilde{O}\left(\frac{SA}{(1-\gamma)^{2}\sigma\varepsilon^{2}}\right) when σ1γ\sigma\gtrsim 1-\gamma, which is always smaller than the requirement of standard MDPs, and diminishes when σ\sigma grows. Consequently, Yang et al., (2022) does not lead to the rigorous justification that RMDPs can be much harder than standard MDPs, nor the correct linear scaling of the sample size as σ\sigma grows.

5 Analysis: the TV case

This section presents the key technical steps for proving our main results of the TV case.

5.1 Preliminaries of the analysis

5.1.1 Additional notations and basic facts

For convenience, we introduce the notation [T]{1,,T}[T]\coloneqq\{1,\cdots,T\} for any positive integer T>0T>0. Moreover, for any two vectors x=[xi]1inx=[x_{i}]_{1\leq i\leq n} and y=[yi]1iny=[y_{i}]_{1\leq i\leq n}, the notation xy{x}\leq{y} (resp. xy{x}\geq{y}) means xiyix_{i}\leq y_{i} (resp. xiyix_{i}\geq y_{i}) for all 1in1\leq i\leq n. And for any vecvor xx, we overload the notation by letting xx=[x(s,a)2](s,a)𝒮×𝒜x\circ x=\big{[}x(s,a)^{2}\big{]}_{(s,a)\in{\mathcal{S}}\times\mathcal{A}} (resp. xx=[x(s)2]s𝒮x\circ x=\big{[}x(s)^{2}\big{]}_{s\in{\mathcal{S}}}). With slight abuse of notation, we denote 0{0} (resp. 1{1}) as the all-zero (resp. all-one) vector, and drop the subscript ρ\rho to write 𝒰σ()=𝒰ρσ()\mathcal{U}^{\sigma}(\cdot)=\mathcal{U}^{\sigma}_{\rho}(\cdot) whenever the argument holds for all divergence ρ\rho.

Matrix notation.

To continue, we recall or introduce some additional matrix notation that is useful throughout the analysis.

  • P0SA×SP^{0}\in\mathbb{R}^{SA\times S}: the matrix of the nominal transition kernel with Ps,a0P^{0}_{s,a} as the (s,a)(s,a)-th row.

  • P^0SA×S\widehat{P}^{0}\in\mathbb{R}^{SA\times S}: the matrix of the estimated nomimal transition kernel with P^s,a0\widehat{P}^{0}_{s,a} as the (s,a)(s,a)-th row.

  • rSAr\in\mathbb{R}^{SA}: a vector representing the reward function rr (so that r(s,a)=r(s,a)r_{(s,a)}=r(s,a) for all (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}).

  • Ππ{0,1}S×SA\Pi^{\pi}\in\{0,1\}^{S\times SA}: a projection matrix associated with a given deterministic policy π\pi taking the following form

    Ππ=(eπ(1)000eπ(2)000eπ(S)),\displaystyle\Pi^{\pi}={\scriptsize\begin{pmatrix}{e}_{\pi(1)}^{\top}&{0}^{\top}&\cdots&{0}^{\top}\\ {0}^{\top}&{e}_{\pi(2)}^{\top}&\cdots&{0}^{\top}\\ \vdots&\vdots&\ddots&\vdots\\ {0}^{\top}&{0}^{\top}&\cdots&{e}_{\pi(S)}^{\top}\end{pmatrix}}, (35)

    where eπ(1),eπ(2),,eπ(S)A{e}_{\pi(1)}^{\top},{e}_{\pi(2)}^{\top},\ldots,{e}_{\pi(S)}^{\top}\in\mathbb{R}^{A} are standard basis vectors.

  • rπSr_{\pi}\in\mathbb{R}^{S}: a reward vector restricted to the actions chosen by the policy π\pi, namely, rπ(s)=r(s,π(s))r_{\pi}(s)=r(s,\pi(s)) for all s𝒮s\in{\mathcal{S}} (or simply, rπ=Ππrr_{\pi}=\Pi^{\pi}r).

  • VarP(V)SA\mathrm{Var}_{P}(V)\in\mathbb{R}^{SA}: for any transition kernel PSA×SP\in\mathbb{R}^{SA\times S} and vector VSV\in\mathbb{R}^{S}, we denote the (s,a)(s,a)-th row of VarP(V)\mathrm{Var}_{P}(V) as

    𝖵𝖺𝗋P(s,a)VarPs,a(V).\displaystyle\mathsf{Var}_{P}(s,a)\coloneqq\mathrm{Var}_{P_{s,a}}(V). (36)
  • PVSA×SP^{V}\in\mathbb{R}^{SA\times S}, P^VSA×S\widehat{P}^{V}\in\mathbb{R}^{SA\times S}: the matrices representing the probability transition kernel in the uncertainty set that leads to the worst-case value for any vector VSV\in\mathbb{R}^{S}. We denote Ps,aVP_{s,a}^{V} (resp. P^s,aV\widehat{P}_{s,a}^{V}) as the (s,a)(s,a)-th row of the transition matrix PVP^{V} (resp. P^V\widehat{P}^{V}). In truth, the (s,a)(s,a)-th rows of these transition matrices are defined as

    Ps,aV\displaystyle P_{s,a}^{V} =argmin𝒫𝒰σ(Ps,a0)𝒫V,andP^s,aV=argmin𝒫𝒰σ(P^s,a0)𝒫V.\displaystyle=\mathrm{argmin}_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{0}_{s,a})}\mathcal{P}V,\qquad\text{and}\qquad\widehat{P}_{s,a}^{V}=\mathrm{argmin}_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s,a})}\mathcal{P}V. (37a)
    Furthermore, we make use of the following short-hand notation:
    Ps,aπ,V\displaystyle P_{s,a}^{\pi,V} :=Ps,aVπ,σ=argmin𝒫𝒰σ(Ps,a0)𝒫Vπ,σ,Ps,aπ,V^:=Ps,aV^π,σ=argmin𝒫𝒰σ(Ps,a0)𝒫V^π,σ,\displaystyle:=P_{s,a}^{V^{\pi,\sigma}}=\mathrm{argmin}_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{0}_{s,a})}\mathcal{P}V^{\pi,\sigma},\qquad P_{s,a}^{\pi,\widehat{V}}:=P_{s,a}^{\widehat{V}^{\pi,\sigma}}=\mathrm{argmin}_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{0}_{s,a})}\mathcal{P}\widehat{V}^{\pi,\sigma}, (37b)
    P^s,aπ,V\displaystyle\widehat{P}_{s,a}^{\pi,V} :=P^s,aVπ,σ=argminP𝒰σ(P^s,a0)PVπ,σ,P^s,aπ,V^:=P^s,aV^π,σ=argminP𝒰σ(P^s,a0)PV^π,σ.\displaystyle:=\widehat{P}_{s,a}^{V^{\pi,\sigma}}=\mathrm{argmin}_{P\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s,a})}PV^{\pi,\sigma},\qquad\widehat{P}_{s,a}^{\pi,\widehat{V}}:=\widehat{P}_{s,a}^{\widehat{V}^{\pi,\sigma}}=\mathrm{argmin}_{P\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s,a})}P\widehat{V}^{\pi,\sigma}. (37c)

    The corresponding probability transition matrices are denoted by Pπ,VSA×SP^{\pi,V}\in\mathbb{R}^{SA\times S}, Pπ,V^SA×SP^{\pi,\widehat{V}}\in\mathbb{R}^{SA\times S}, P^π,VSA×S\widehat{P}^{\pi,V}\in\mathbb{R}^{SA\times S} and P^π,V^SA×S\widehat{P}^{\pi,\widehat{V}}\in\mathbb{R}^{SA\times S}, respectively.

  • PπS×SP^{\pi}\in\mathbb{R}^{S\times S}, P^πS×S\widehat{P}^{\pi}\in\mathbb{R}^{S\times S}, P¯π,VS×S\underline{P}^{\pi,V}\in\mathbb{R}^{S\times S}, P¯π,V^S×S\underline{P}^{\pi,\widehat{V}}\in\mathbb{R}^{S\times S}, P¯^π,VS×S\underline{\widehat{P}}^{\pi,V}\in\mathbb{R}^{S\times S} and P¯^π,V^S×S\underline{\widehat{P}}^{\pi,\widehat{V}}\in\mathbb{R}^{S\times S}: six square probability transition matrices w.r.t. policy π\pi over the states, namely

    PπΠπP0,P^πΠπP^0,P¯π,VΠπPπ,V,P¯π,V^ΠπPπ,V^,\displaystyle P^{\pi}\coloneqq\Pi^{\pi}P^{0},\qquad\widehat{P}^{\pi}\coloneqq\Pi^{\pi}\widehat{P}^{0},\qquad\underline{P}^{\pi,V}\coloneqq\Pi^{\pi}P^{\pi,V},\qquad\underline{P}^{\pi,\widehat{V}}\coloneqq\Pi^{\pi}P^{\pi,\widehat{V}},
    P¯^π,VΠπP^π,V,andP¯^π,V^ΠπP^π,V^.\displaystyle\underline{\widehat{P}}^{\pi,V}\coloneqq\Pi^{\pi}\widehat{P}^{\pi,V},\qquad\text{and}\qquad\underline{\widehat{P}}^{\pi,\widehat{V}}\coloneqq\Pi^{\pi}\widehat{P}^{\pi,\widehat{V}}. (38)

    We denote PsπP^{\pi}_{s} as the ss-th row of the transition matrix PπP^{\pi}; similar quantities can be defined for the other matrices as well.

Kullback-Leibler (KL) divergence.

First, for any two distributions PP and QQ, we denote by 𝖪𝖫(PQ)\mathsf{KL}(P\parallel Q) the Kullback-Leibler (KL) divergence of PP and QQ. Letting 𝖡𝖾𝗋(p)\mathsf{Ber}(p) be the Bernoulli distribution with mean pp, we also introduce

𝖪𝖫(pq)plogpq+(1p)log1p1qandχ2(pq)(pq)2q+(pq)21q=(pq)2q(1q),\displaystyle\mathsf{KL}(p\parallel q)\coloneqq p\log\frac{p}{q}+(1-p)\log\frac{1-p}{1-q}\quad\text{and}\quad\chi^{2}(p\parallel q)\coloneqq\frac{(p-q)^{2}}{q}+\frac{(p-q)^{2}}{1-q}=\frac{(p-q)^{2}}{q(1-q)}, (39)

which represent respectively the KL divergence and the χ2\chi^{2} divergence of 𝖡𝖾𝗋(p)\mathsf{Ber}(p) from 𝖡𝖾𝗋(q)\mathsf{Ber}(q) (Tsybakov,, 2009).

Variance.

For any probability vector P1×SP\in\mathbb{R}^{1\times S} and vector VSV\in\mathbb{R}^{S}, we denote the variance

VarP(V)P(VV)(PV)(PV).\displaystyle\mathrm{Var}_{P}(V)\coloneqq P(V\circ V)-(PV)\circ(PV). (40)

The following lemma bounds the Lipschitz constant of the variance function.

Lemma 3.

Consider any 0V1,V211γ{0}\leq V_{1},V_{2}\leq\frac{1}{1-\gamma} obeying V1V2x\|V_{1}-V_{2}\|_{\infty}\leq x and any probability vector PΔ(S)P\in\Delta(S), one has

|VarP(V1)VarP(V2)|2x(1γ).\displaystyle\left|\mathrm{Var}_{P}(V_{1})-\mathrm{Var}_{P}(V_{2})\right|\leq\frac{2x}{(1-\gamma)}. (41)

Proof of Lemma 3: It is immediate to check that

|VarP(V1)VarP(V2)|\displaystyle\left|\mathrm{Var}_{P}(V_{1})-\mathrm{Var}_{P}(V_{2})\right| =|P(V1V1)(PV1)(PV1)P(V2V2)+(PV2)(PV2)|\displaystyle=\left|P(V_{1}\circ V_{1})-(PV_{1})\circ(PV_{1})-P(V_{2}\circ V_{2})+(PV_{2})\circ(PV_{2})\right|
|P(V1V1V2V2)|+|(PV1+PV2)P(V1V2)|\displaystyle\leq\left|P\big{(}V_{1}\circ V_{1}-V_{2}\circ V_{2}\big{)}\right|+\left|(PV_{1}+PV_{2})P(V_{1}-V_{2})\right|
2V1+V2V1V22x(1γ).\displaystyle\leq 2\|V_{1}+V_{2}\|_{\infty}\|V_{1}-V_{2}\|_{\infty}\leq\frac{2x}{(1-\gamma)}. (42)

where the penultimate inequality holds by the triangle inequality.

5.1.2 Facts of the robust Bellman operator and the empirical robust MDP

γ\gamma-contraction of the robust Bellman operator.

It is worth noting that the robust Bellman operator (cf. (8)) shares the nice γ\gamma-contraction property of the standard Bellman operator, stated as below.

Lemma 4 (γ\gamma-Contraction).

(Iyengar,, 2005, Theorem 3.2) For any γ[0,1)\gamma\in[0,1), the robust Bellman operator 𝒯σ(){\mathcal{T}}^{\sigma}(\cdot) (cf. (8)) is a γ\gamma-contraction w.r.t. \|\cdot\|_{\infty}. Namely, for any Q1,Q2SAQ_{1},Q_{2}\in\mathbb{R}^{SA} s.t. Q1(s,a),Q2(s,a)[0,11γ]Q_{1}(s,a),Q_{2}(s,a)\in\big{[}0,\frac{1}{1-\gamma}\big{]} for all (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}, one has

𝒯σ(Q1)𝒯σ(Q2)γQ1Q2.\displaystyle\left\|{\mathcal{T}}^{\sigma}(Q_{1})-{\mathcal{T}}^{\sigma}(Q_{2})\right\|_{\infty}\leq\gamma\left\|Q_{1}-Q_{2}\right\|_{\infty}. (43)

Additionally, Q,σQ^{\star,\sigma} is the unique fixed point of 𝒯σ(){\mathcal{T}}^{\sigma}(\cdot) obeying 0Q,σ(s,a)11γ0\leq{Q}^{\star,\sigma}(s,a)\leq\frac{1}{1-\gamma} for all (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}.

Bellman equations of the empirical robust MDP ^𝗋𝗈𝖻\widehat{\mathcal{M}}_{\mathsf{rob}}.

To begin with, recall that the empirical robust MDP ^𝗋𝗈𝖻={𝒮,𝒜,γ,𝒰σ(P^0),r}\widehat{\mathcal{M}}_{\mathsf{rob}}=\{{\mathcal{S}},\mathcal{A},\gamma,\mathcal{U}^{\sigma}(\widehat{P}^{0}),r\} based on the estimated nominal distribution P^0\widehat{P}^{0} constructed in (13) and its corresponding robust value function (resp. robust Q-function) V^π,σ\widehat{V}^{\pi,\sigma} (resp. Q^π,σ\widehat{Q}^{\pi,\sigma}).

Note that Q^,σ\widehat{Q}^{\star,\sigma} is the unique fixed point of 𝒯^σ()\widehat{{\mathcal{T}}}^{\sigma}(\cdot) (see Lemma 4), the empirical robust Bellman operator constructed using P^0\widehat{P}^{0}. Moreover, similar to (7), for ^𝗋𝗈𝖻\widehat{\mathcal{M}}_{\mathsf{rob}}, the Bellman’s optimality principle gives the following robust Bellman consistency equation (resp. robust Bellman optimality equation):

(s,a)𝒮×𝒜:\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad Q^π,σ(s,a)=r(s,a)+γinf𝒫𝒰σ(P^s,a0)𝒫V^π,σ,\displaystyle\widehat{Q}^{\pi,\sigma}(s,a)=r(s,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s,a})}\mathcal{P}\widehat{V}^{\pi,\sigma}, (44a)
(s,a)𝒮×𝒜:\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad Q^,σ(s,a)=r(s,a)+γinf𝒫𝒰σ(P^s,a0)𝒫V^,σ.\displaystyle\widehat{Q}^{\star,\sigma}(s,a)=r(s,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s,a})}\mathcal{P}\widehat{V}^{\star,\sigma}. (44b)

With these in mind, combined with the matrix notation (introduced at the beginning of Section 5), for any policy π\pi, we can write the robust Bellman consistency equations as

Qπ,σ=r+γinf𝒫𝒰σ(P0)𝒫Vπ,σandQ^π,σ=r+γinf𝒫𝒰σ(P^0)𝒫V^π,σ,\displaystyle Q^{\pi,\sigma}=r+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{0})}\mathcal{P}V^{\pi,\sigma}\quad\text{and}\quad\widehat{Q}^{\pi,\sigma}=r+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0})}\mathcal{P}\widehat{V}^{\pi,\sigma}, (45)

which leads to

Vπ,σ\displaystyle V^{\pi,\sigma} =rπ+γΠπinf𝒫𝒰σ(P0)𝒫Vπ,σ=(i)rπ+γP¯π,VVπ,σ,\displaystyle=r_{\pi}+\gamma\Pi^{\pi}\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{0})}\mathcal{P}V^{\pi,\sigma}\overset{\mathrm{(i)}}{=}r_{\pi}+\gamma\underline{P}^{\pi,V}V^{\pi,\sigma},
V^π,σ\displaystyle\widehat{V}^{\pi,\sigma} =rπ+γΠπinf𝒫𝒰σ(P^0)𝒫V^π,σ=(ii)rπ+γP¯^π,V^V^π,σ,\displaystyle=r_{\pi}+\gamma\Pi^{\pi}\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0})}\mathcal{P}\widehat{V}^{\pi,\sigma}\overset{\mathrm{(ii)}}{=}r_{\pi}+\gamma\underline{\widehat{P}}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}, (46)

where (i) and (ii) holds by the definitions in (35), (37) and (5.1.1).

Encouragingly, the above property of the robust Bellman operator ensures the fast convergence of DRVI. We collect this consequence in the following lemma, whose proof is postponed to Appendix A.2.

Lemma 5.

Let Q^0=0\widehat{Q}_{0}=0. The iterates {Q^t},{V^t}\{\widehat{Q}_{t}\},\{\widehat{V}_{t}\} of DRVI (cf. Algorithm 1) obey

t0:Q^tQ^,σγt1γandV^tV^,σγt1γ.\displaystyle\forall t\geq 0:\quad\big{\|}\widehat{Q}_{t}-\widehat{Q}^{\star,\sigma}\big{\|}_{\infty}\leq\frac{\gamma^{t}}{1-\gamma}\qquad\text{and}\qquad\big{\|}\widehat{V}_{t}-\widehat{V}^{\star,\sigma}\big{\|}_{\infty}\leq\frac{\gamma^{t}}{1-\gamma}. (47)

Furthermore, the output policy π^\widehat{\pi} obeys

V^,σV^π^,σ\displaystyle\big{\|}\widehat{V}^{\star,\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\big{\|}_{\infty} 2γε𝗈𝗉𝗍1γ,whereV^,σV^T1=:ε𝗈𝗉𝗍.\displaystyle\leq\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma},\qquad\mbox{where}\quad\big{\|}\widehat{V}^{\star,\sigma}-\widehat{V}_{T-1}\big{\|}_{\infty}=:\varepsilon_{\mathsf{opt}}. (48)

5.2 Proof of the upper bound with TV distance: Theorem 1

Throughout this section, for any transition kernel PP, the uncertainty set is taken as (see (9))

𝒰σ(P)𝒰𝖳𝖵σ(P)=𝒰𝖳𝖵σ(Ps,a),\displaystyle\mathcal{U}^{\sigma}(P)\coloneqq\mathcal{U}^{\sigma}_{\mathsf{TV}}(P)=\otimes\;\mathcal{U}^{\sigma}_{\mathsf{TV}}(P_{s,a}),\qquad 𝒰𝖳𝖵σ(Ps,a){Ps,aΔ(𝒮):12Ps,aPs,a1σ}.\displaystyle\mathcal{U}^{\sigma}_{\mathsf{TV}}(P_{s,a})\coloneqq\Big{\{}P^{\prime}_{s,a}\in\Delta({\mathcal{S}}):\frac{1}{2}\left\|P^{\prime}_{s,a}-P_{s,a}\right\|_{1}\leq\sigma\Big{\}}. (49)

5.2.1 Technical lemmas

We begin with a key lemma that is new and distinguishes robust MDPs with TV distance from standard MDPs , which plays a critical role in obtaining the sample complexity upper bound in Theorem 1. This lemma concerns the dynamic range of the robust value function Vπ,σV^{\pi,\sigma} (cf. (5)) for any fixed policy π\pi, which produces tighter control than that in standard MDP (cf. 11γ\frac{1}{1-\gamma}) when σ\sigma is large. This lemma The proof is deferred to Appendix B.1.

Lemma 6.

For any nominal transition kernel PSA×SP\in\mathbb{R}^{SA\times S}, any fixed uncertainty level σ\sigma, and any policy π\pi, its corresponding robust value function Vπ,σV^{\pi,\sigma} (cf. (5)) satisfies

maxs𝒮Vπ,σ(s)mins𝒮Vπ,σ(s)1γmax{1γ,σ}.\displaystyle\max_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s)-\min_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s)\leq\frac{1}{\gamma\max\{1-\gamma,\sigma\}}.

With the above lemma in hand, we introduce the following lemma that is useful throughout this section, whose proof is postponed in Appendix B.2.

Lemma 7.

Consider an MDP with transition kernel matrix PP and reward function 0r1{0}\leq r\leq 1. For any policy π\pi and its associated state transition matrix PπΠπPP_{\pi}\coloneqq\Pi^{\pi}P and value function 0Vπ,P11γ0\leq V^{\pi,P}\leq\frac{1}{1-\gamma} (cf. (1)), one has

(IγPπ)1VarPπ(Vπ,P)8(maxsVπ,P(s)minsVπ,P(s))γ2(1γ)21.\displaystyle\left(I-\gamma P_{\pi}\right)^{-1}\sqrt{\mathrm{Var}_{P_{\pi}}(V^{\pi,P})}\leq\sqrt{\frac{8(\max_{s}V^{\pi,P}(s)-\min_{s}V^{\pi,P}(s))}{\gamma^{2}(1-\gamma)^{2}}}1.

5.2.2 Proof of Theorem 1

Recall that the proof for standard RL (Agarwal et al.,, 2020; Li et al., 2023b, ) deals with the upper and lower bound of the value function estimate gap identically. In contrast, the proof of Theorem 1 needs tailored argument for the robust RL setting — controlling the upper and lower bound of the value function estimate gap in an asymmetric way — motivated by the varying worst-case transition kernels associated with different value functions. Before proceeding, applying Lemma 5 yields that for any ε𝗈𝗉𝗍>0\varepsilon_{\mathsf{opt}}>0, as long as Tlog(1(1γ)ε𝗈𝗉𝗍)T\geq\log(\frac{1}{(1-\gamma)\varepsilon_{\mathsf{opt}}}), one has

V^,σV^π^,σ2γε𝗈𝗉𝗍1γ,\displaystyle\big{\|}\widehat{V}^{\star,\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\big{\|}_{\infty}\leq\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}, (50)

allowing us to justify the more general statement in Remark 1. To control the performance gap V,σVπ^,σ\left\|V^{\star,\sigma}-V^{\widehat{\pi},\sigma}\right\|_{\infty}, the proof is divided into several key steps.

Step 1: decomposing the error.

Recall the optimal robust policy π\pi^{\star} w.r.t. 𝗋𝗈𝖻\mathcal{M}_{\mathsf{rob}} and the optimal robust policy π^\widehat{\pi}^{\star}, the optimal robust value function V^,σ\widehat{V}^{\star,\sigma} (resp. robust value function Q^π,σ\widehat{Q}^{\pi,\sigma}) w.r.t. ^𝗋𝗈𝖻\widehat{\mathcal{M}}_{\mathsf{rob}}. The term of interest V,σVπ^,σV^{\star,\sigma}-V^{\widehat{\pi},\sigma} can be decomposed as

V,σVπ^,σ\displaystyle V^{\star,\sigma}-V^{\widehat{\pi},\sigma} =(Vπ,σV^π,σ)+(V^π,σV^π^,σ)+(V^π^,σV^π^,σ)+(V^π^,σVπ^,σ)\displaystyle=\left(V^{\pi^{\star},\sigma}-\widehat{V}^{\pi^{\star},\sigma}\right)+\left(\widehat{V}^{\pi^{\star},\sigma}-\widehat{V}^{\widehat{\pi}^{\star},\sigma}\right)+\left(\widehat{V}^{\widehat{\pi}^{\star},\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\right)+\left(\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\right)
(i)(Vπ,σV^π,σ)+(V^π^,σV^π^,σ)+(V^π^,σVπ^,σ)\displaystyle\overset{\mathrm{(i)}}{\leq}\left(V^{\pi^{\star},\sigma}-\widehat{V}^{\pi^{\star},\sigma}\right)+\left(\widehat{V}^{\widehat{\pi}^{\star},\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\right)+\left(\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\right)
(ii)(Vπ,σV^π,σ)+2γε𝗈𝗉𝗍1γ1+(V^π^,σVπ^,σ)\displaystyle\overset{\mathrm{(ii)}}{\leq}\left(V^{\pi^{\star},\sigma}-\widehat{V}^{\pi^{\star},\sigma}\right)+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}1+\left(\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\right) (51)

where (i) holds by V^π,σV^π^,σ0\widehat{V}^{\pi^{\star},\sigma}-\widehat{V}^{\widehat{\pi}^{\star},\sigma}\leq 0 since π^\widehat{\pi}^{\star} is the robust optimal policy for ^𝗋𝗈𝖻\widehat{\mathcal{M}}_{\mathsf{rob}}, and (ii) comes from the fact in (50).

To control the two important terms in (51), we first consider a more general term V^π,σVπ,σ\widehat{V}^{\pi,\sigma}-V^{\pi,\sigma} for any policy π\pi. Towards this, plugging in (46) yields

V^π,σVπ,σ\displaystyle\widehat{V}^{\pi,\sigma}-V^{\pi,\sigma} =rπ+γP¯^π,V^V^π,σ(rπ+γP¯π,VVπ,σ)\displaystyle=r_{\pi}+\gamma\underline{\widehat{P}}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\big{(}r_{\pi}+\gamma\underline{P}^{\pi,V}V^{\pi,\sigma}\big{)}
=(γP¯^π,V^V^π,σγP¯π,V^V^π,σ)+(γP¯π,V^V^π,σγP¯π,VVπ,σ)\displaystyle=\Big{(}\gamma\underline{\widehat{P}}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\gamma\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}\Big{)}+\Big{(}\gamma\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\gamma\underline{P}^{\pi,V}V^{\pi,\sigma}\Big{)}
(i)γ(P¯π,VV^π,σP¯π,VVπ,σ)+(γP¯^π,V^V^π,σγP¯π,V^V^π,σ),\displaystyle\overset{\mathrm{(i)}}{\leq}\gamma\Big{(}\underline{P}^{\pi,V}\widehat{V}^{\pi,\sigma}-\underline{P}^{\pi,V}V^{\pi,\sigma}\Big{)}+\Big{(}\gamma\underline{\widehat{P}}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\gamma\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}\Big{)},

where (i) holds by observing

P¯π,V^V^π,σ\displaystyle\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma} P¯π,VV^π,σ\displaystyle\leq\underline{P}^{\pi,V}\widehat{V}^{\pi,\sigma}

due to the optimality of P¯π,V^\underline{P}^{\pi,\widehat{V}} (cf. (37)). Rearranging terms leads to

V^π,σVπ,σ\displaystyle\widehat{V}^{\pi,\sigma}-V^{\pi,\sigma} γ(IγP¯π,V)1(P¯^π,V^V^π,σP¯π,V^V^π,σ).\displaystyle\leq\gamma\left(I-\gamma\underline{P}^{\pi,V}\right)^{-1}\Big{(}\underline{\widehat{P}}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}\Big{)}. (52)

Similarly, we can also deduce

V^π,σVπ,σ\displaystyle\widehat{V}^{\pi,\sigma}-V^{\pi,\sigma} =rπ+γP¯^π,V^V^π,σ(rπ+γP¯π,VVπ,σ)\displaystyle=r_{\pi}+\gamma\underline{\widehat{P}}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\left(r_{\pi}+\gamma\underline{P}^{\pi,V}V^{\pi,\sigma}\right)
=(γP¯^π,V^V^π,σγP¯π,V^V^π,σ)+(γP¯π,V^V^π,σγP¯π,VVπ,σ)\displaystyle=\Big{(}\gamma\underline{\widehat{P}}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\gamma\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}\Big{)}+\left(\gamma\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\gamma\underline{P}^{\pi,V}V^{\pi,\sigma}\right)
γ(P¯π,V^V^π,σP¯π,V^Vπ,σ)+(γP¯^π,V^V^π,σγP¯π,V^V^π,σ)\displaystyle\geq\gamma\left(\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\underline{P}^{\pi,\widehat{V}}V^{\pi,\sigma}\right)+\Big{(}\gamma\underline{\widehat{P}}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\gamma\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}\Big{)}
γ(IγP¯π,V^)1(P¯^π,V^V^π,σP¯π,V^V^π,σ).\displaystyle\geq\gamma\left(I-\gamma\underline{P}^{\pi,\widehat{V}}\right)^{-1}\Big{(}\underline{\widehat{P}}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}\Big{)}. (53)

Combining (52) and (53), we arrive at

V^π,σVπ,σ\displaystyle\big{\|}\widehat{V}^{\pi,\sigma}-V^{\pi,\sigma}\big{\|}_{\infty} γmax{(IγP¯π,V)1(P¯^π,V^V^π,σP¯π,V^V^π,σ),\displaystyle\leq\gamma\max\Big{\{}\Big{\|}\left(I-\gamma\underline{P}^{\pi,V}\right)^{-1}\Big{(}\underline{\widehat{P}}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}\Big{)}\Big{\|}_{\infty},
(IγP¯π,V^)1(P¯^π,V^V^π,σP¯π,V^V^π,σ)}.\displaystyle\qquad\Big{\|}\left(I-\gamma\underline{P}^{\pi,\widehat{V}}\right)^{-1}\Big{(}\underline{\widehat{P}}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}-\underline{P}^{\pi,\widehat{V}}\widehat{V}^{\pi,\sigma}\Big{)}\Big{\|}_{\infty}\Big{\}}. (54)

By decomposing the error in a symmetric way, we can similarly obtain

V^π,σVπ,σ\displaystyle\big{\|}\widehat{V}^{\pi,\sigma}-V^{\pi,\sigma}\big{\|}_{\infty} γmax{(IγP¯^π,V)1(P¯^π,VVπ,σP¯π,VVπ,σ),\displaystyle\leq\gamma\max\Big{\{}\Big{\|}\left(I-\gamma\underline{\widehat{P}}^{\pi,V}\right)^{-1}\Big{(}\underline{\widehat{P}}^{\pi,V}V^{\pi,\sigma}-\underline{P}^{\pi,V}V^{\pi,\sigma}\Big{)}\Big{\|}_{\infty},
(IγP¯^π,V^)1(P¯^π,VVπ,σP¯π,VVπ,σ)}.\displaystyle\qquad\Big{\|}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi,\widehat{V}}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\pi,V}V^{\pi,\sigma}-\underline{P}^{\pi,V}V^{\pi,\sigma}\Big{)}\Big{\|}_{\infty}\Big{\}}. (55)

With the above facts in mind, we are ready to control the two terms V^π,σVπ,σ\big{\|}\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\big{\|}_{\infty} and V^π^,σVπ^,σ\big{\|}\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\big{\|}_{\infty} in (51) separately. More specifically, taking π=π\pi=\pi^{\star}, applying (5.2.2) leads to

V^π,σVπ,σ\displaystyle\big{\|}\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\big{\|}_{\infty} γmax{(IγP¯^π,V)1(P¯^π,VVπ,σP¯π,VVπ,σ),\displaystyle\leq\gamma\max\Big{\{}\Big{\|}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{)}\Big{\|}_{\infty},
(IγP¯^π,V^)1(P¯^π,VVπ,σP¯π,VVπ,σ)}.\displaystyle\qquad\Big{\|}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{)}\Big{\|}_{\infty}\Big{\}}. (56)

Similarly, taking π=π^\pi=\widehat{\pi}, applying (5.2.2) leads to

V^π^,σVπ^,σ\displaystyle\big{\|}\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\big{\|}_{\infty} γmax{(IγP¯π^,V^)1(P¯^π^,V^V^π^,σP¯π^,V^V^π^,σ),\displaystyle\leq\gamma\max\Big{\{}\Big{\|}\left(I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\right)^{-1}\Big{(}\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\Big{)}\Big{\|}_{\infty},
(IγP¯π^,V)1(P¯^π^,V^V^π^,σP¯π^,V^V^π^,σ)}.\displaystyle\qquad\Big{\|}\left(I-\gamma\underline{P}^{\widehat{\pi},V}\right)^{-1}\Big{(}\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\Big{)}\Big{\|}_{\infty}\Big{\}}. (57)
Step 2: controlling V^π,σVπ,σ\big{\|}\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\big{\|}_{\infty} and V^π^,σVπ^,σ\big{\|}\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\big{\|}_{\infty} separately and summing up.

First, we introduce the following two lemmas that control the two main terms in (51), respectively. The first lemma controls the value function estimation error associated with the optimal policy π\pi^{\star} induced by the randomness of the generated dataset. The proof are postponed to Appendix B.3 and B.4.

Lemma 8.

Consider any δ(0,1)\delta\in(0,1). With probability at least 1δ1-\delta, taking N16log(SANδ)(1γ)2N\geq\frac{16\log(\frac{SAN}{\delta})}{(1-\gamma)^{2}}, one has

V^π,σVπ,σ\displaystyle\left\|\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\right\|_{\infty} 160log(18SANδ)(1γ)2max{1γ,σ}N+8log(18SANδ)(1γ)2N.\displaystyle\leq 160\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}+\frac{8\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}. (58)

Unlike the term V^π,σVπ,σ\|\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\|_{\infty} associated with the fixed policy π\pi^{\star} (independent from the dataset), to control V^π^,σVπ^,σ\left\|\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\right\|_{\infty}, we need to deal with the additional complicated statistical dependency between the learned policy π^\widehat{\pi} and the empirical RMDP constructed by the dataset.

Lemma 9.

Taking ε𝗈𝗉𝗍log(54SAN2(1γ)δ)γN\varepsilon_{\mathsf{opt}}\leq\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma N} and N16log(54SAN2δ)(1γ)2N\geq\frac{16\log(\frac{54SAN^{2}}{\delta})}{(1-\gamma)^{2}}, with probability at least 1δ1-\delta, one has

V^π^,σVπ^,σ\displaystyle\left\|\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\right\|_{\infty} 24log(54SAN2(1γ)δ)γ3(1γ)2max{1γ,σ}N+28log(54SAN2(1γ)δ)N(1γ)2.\displaystyle\leq 24\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}+\frac{28\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)^{2}}. (59)

Summing up the results in (58) and (59) and inserting back to (51) complete the proof as follows: taking ε𝗈𝗉𝗍log(54SAN2(1γ)δ)γN\varepsilon_{\mathsf{opt}}\leq\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma N} and N16log(54SAN2δ)(1γ)2N\geq\frac{16\log(\frac{54SAN^{2}}{\delta})}{(1-\gamma)^{2}}, with probability at least 1δ1-\delta,

V,σVπ^,σ\displaystyle\big{\|}V^{\star,\sigma}-V^{\widehat{\pi},\sigma}\big{\|}_{\infty} Vπ,σV^π,σ+2γε𝗈𝗉𝗍1γ+V^π^,σVπ^,σ\displaystyle\leq\big{\|}V^{\pi^{\star},\sigma}-\widehat{V}^{\pi^{\star},\sigma}\big{\|}_{\infty}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}+\big{\|}\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\big{\|}_{\infty}
2γε𝗈𝗉𝗍1γ+160log(18SANδ)(1γ)2max{1γ,σ}N+8log(18SANδ)(1γ)2N\displaystyle\leq\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}+160\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}+\frac{8\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}
+24log(54SAN2(1γ)δ)γ3(1γ)2max{1γ,σ}N+28log(54SAN2(1γ)δ)N(1γ)2\displaystyle\quad+24\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}+\frac{28\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)^{2}}
184log(54SAN2(1γ)δ)γ3(1γ)2max{1γ,σ}N+36log(54SAN2(1γ)δ)N(1γ)2\displaystyle\leq 184\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}+\frac{36\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)^{2}}
1508log(54SAN2(1γ)δ)(1γ)2max{1γ,σ}N,\displaystyle\leq 1508\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}, (60)

where the last inequality holds by γ14\gamma\geq\frac{1}{4} and N16log(54SAN2δ)(1γ)2N\geq\frac{16\log(\frac{54SAN^{2}}{\delta})}{(1-\gamma)^{2}}.

5.3 Proof of the lower bound with TV distance: Theorem 2

To achieve a tight lower bound for robust MDPs, we construct new hard instances that are different from those for standard MDPs (Azar et al., 2013a, ), addressing two new challenges: 1) Due to robustness requirement, the recursive step (or bootstrapping) in robust MDPs has asymmetric structures over all states, since the worst-case transition probability depends on the value function and puts more weights on the states with lower values. Inspired by such an asymmetric structure, we develop new hard instances by setting larger rewards on the states with action-invariant transition kernels to achieve a tighter lower bound. Note that standard MDPs do not have such reward allocation challenges, where the bootstrapping step is determined by a fixed transition probability independent from the value function. 2) As the uncertainty level can vary within 0<σ10<\sigma\leq 1, for given any fixed uncertainty level σ\sigma, a tailored σ\sigma-dependent hard instance is required to achieve a tight lower bound, leading to the construction of a series of different instances as σ\sigma varies. Instead, standard RL only needs to construct one hard instance (i.e., σ=0\sigma=0). By constructing a new class of hard instances addressing the above challenges, we develop a new lower bound in Theorem 2 that is tighter than prior art (Yang et al.,, 2022), which used an identical hard instance for all uncertainty levels 0<σ10<\sigma\leq 1.

5.3.1 Construction of the hard problem instances

Construction of two hard MDPs.

Suppose there are two standard MDPs defined as below:

{ϕ=(𝒮,𝒜,Pϕ,r,γ)|ϕ={0,1}}.\displaystyle\left\{\mathcal{M}_{\phi}=\left(\mathcal{S},\mathcal{A},P^{\phi},r,\gamma\right)\,|\,\phi=\{0,1\}\right\}.

Here, γ\gamma is the discount parameter, 𝒮={0,1,,S1}{\mathcal{S}}=\{0,1,\ldots,S-1\} is the state space. Given any state s{2,3,,S1}s\in\{2,3,\cdots,S-1\}, the corresponding action space are 𝒜={0,1,2,,A1}\mathcal{A}=\{0,1,2,\cdots,A-1\}. While for states s=0s=0 or s=1s=1, the action space is only 𝒜={0,1}\mathcal{A}^{\prime}=\{0,1\}. For any ϕ{0,1}\phi\in\{0,1\}, the transition kernel PϕP^{\phi} of the constructed MDP ϕ\mathcal{M}_{\phi} is defined as

Pϕ(s|s,a)={p𝟙(s=1)+(1p)𝟙(s=0)if(s,a)=(0,ϕ)q𝟙(s=1)+(1q)𝟙(s=0)if(s,a)=(0,1ϕ)𝟙(s=1)ifs1,\displaystyle P^{\phi}(s^{\prime}\,|\,s,a)=\left\{\begin{array}[]{lll}p\mathds{1}(s^{\prime}=1)+(1-p)\mathds{1}(s^{\prime}=0)&\text{if}&(s,a)=(0,\phi)\\ q\mathds{1}(s^{\prime}=1)+(1-q)\mathds{1}(s^{\prime}=0)&\text{if}&(s,a)=(0,1-\phi)\\ \mathds{1}(s^{\prime}=1)&\text{if}&s\geq 1\end{array}\right., (64)

where pp and qq are set to satisfy

0p1 and 0q=pΔ\displaystyle 0\leq p\leq 1\quad\text{ and }\quad 0\leq q=p-\Delta (65)

for some pp and Δ>0\Delta>0 that shall be introduced later. The above transition kernel PϕP^{\phi} implies that state 11 is an absorbing state, namely, the MDP will always stay after it arrives at 11.

Then, we define the reward function as

r(s,a)={1if s=10otherwise.\displaystyle r(s,a)=\left\{\begin{array}[]{lll}1&\text{if }s=1\\ 0&\text{otherwise}\end{array}\right.. (68)

Additionally, we choose the following initial state distribution:

φ(s)={1,if s=00,otherwise .\displaystyle\varphi(s)=\begin{cases}1,\quad&\text{if }s=0\\ 0,&\text{otherwise }\end{cases}. (69)

Here, the constructed two instances are set with different probability transition from state 0 with reward 0 but not state 11 with reward 11 (which were used in standard MDPs (Li et al., 2022b, )), yielding a larger gap between the value functions of the two instances.

Uncertainty set of the transition kernels.

Recalling the uncertainty set assumed throughout this section is defined as 𝒰σ(Pϕ)\mathcal{U}^{\sigma}(P^{\phi}) with TV distance:

𝒰σ(Pϕ)𝒰𝖳𝖵σ(Pϕ)=𝒰𝖳𝖵σ(Ps,aϕ),\displaystyle\mathcal{U}^{\sigma}(P^{\phi})\coloneqq\mathcal{U}^{\sigma}_{\mathsf{TV}}(P^{\phi})=\otimes\;\mathcal{U}^{\sigma}_{\mathsf{TV}}(P^{\phi}_{s,a}),\qquad 𝒰𝖳𝖵σ(Ps,aϕ){Ps,aΔ(𝒮):12Ps,aPs,aϕ1σ},\displaystyle\mathcal{U}^{\sigma}_{\mathsf{TV}}(P^{\phi}_{s,a})\coloneqq\Big{\{}P^{\prime}_{s,a}\in\Delta({\mathcal{S}}):\frac{1}{2}\left\|P^{\prime}_{s,a}-P^{\phi}_{s,a}\right\|_{1}\leq\sigma\Big{\}}, (70)

where Ps,aϕPϕ(|s,a)P^{\phi}_{s,a}\coloneqq P^{\phi}(\cdot\,|\,s,a) is defined similar to (4). In addition, without loss of generality, we recall the radius σ(0,1c0]\sigma\in(0,1-c_{0}] with 0<c0<10<c_{0}<1. With the uncertainty level in hand, taking c1c02c_{1}\coloneqq\frac{c_{0}}{2}, pp and Δ\Delta which determines the instances obey

p=(1+c1)max{1γ,σ}andΔc1max{1γ,σ},\displaystyle p=\left(1+c_{1}\right)\max\{1-\gamma,\sigma\}\qquad\text{and}\qquad\Delta\leq c_{1}\max\{1-\gamma,\sigma\}, (71)

which ensure 0p10\leq p\leq 1 as follows:

(1+c1)σ\displaystyle\left(1+c_{1}\right)\sigma 1c0+c1σ1c02<1,(1+c1)(1γ)32(1γ)34<1.\displaystyle\leq 1-c_{0}+c_{1}\sigma\leq 1-\frac{c_{0}}{2}<1,\qquad\left(1+c_{1}\right)(1-\gamma)\leq\frac{3}{2}(1-\gamma)\leq\frac{3}{4}<1. (72)

Consequently, applying (65) directly leads to

pqmax{1γ,σ}.\displaystyle p\geq q\geq\max\{1-\gamma,\sigma\}. (73)

To continue, for any (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in{\mathcal{S}}\times\mathcal{A}\times{\mathcal{S}}, we denote the infimum probability of moving to the next state ss^{\prime} associated with any perturbed transition kernel Ps,a𝒰σ(Ps,aϕ)P_{s,a}\in\mathcal{U}^{\sigma}(P^{\phi}_{s,a}) as

P¯ϕ(s|s,a)\displaystyle\underline{P}^{\phi}(s^{\prime}\,|\,s,a) infPs,a𝒰σ(Ps,aϕ)P(s|s,a)=max{P(s|s,a)σ,0},\displaystyle\coloneqq\inf_{P_{s,a}\in\mathcal{U}^{\sigma}(P^{\phi}_{s,a})}P(s^{\prime}\,|\,s,a)=\max\{P(s^{\prime}\,|\,s,a)-\sigma,0\}, (74)

where the last equation can be easily verified by the definition of 𝒰σ(Pϕ)\mathcal{U}^{\sigma}(P^{\phi}) in (70). As shall be seen, the transition from state 0 to state 11 plays an important role in the analysis, for convenience, we denote

p¯\displaystyle\underline{p} P¯ϕ(1| 0,ϕ)=pσ,q¯P¯ϕ(1| 0,1ϕ)=qσ,\displaystyle\coloneqq\underline{P}^{\phi}(1\,|\,0,\phi)=p-\sigma,\qquad\underline{q}\coloneqq\underline{P}^{\phi}(1\,|\,0,1-\phi)=q-\sigma, (75)

which follows from the fact that pqσp\geq q\geq\sigma in (73).

Robust value functions and robust optimal policies.

To proceed, we are ready to derive the corresponding robust value functions, identify the optimal policies, and characterize the optimal values. For any MDP ϕ\mathcal{M}_{\phi} with the above uncertainty set, we denote πϕ\pi^{\star}_{\phi} as the optimal policy, and the robust value function of any policy π\pi (resp. the optimal policy πϕ\pi^{\star}_{\phi}) as Vϕπ,σV^{\pi,\sigma}_{\phi} (resp. Vϕ,σV^{\star,\sigma}_{\phi}). Then, we introduce the following lemma which describes some important properties of the robust (optimal) value functions and optimal policies. The proof is postponed to Appendix C.1.

Lemma 10.

For any ϕ={0,1}\phi=\{0,1\} and any policy π\pi, the robust value function obeys

Vϕπ,σ(0)=γ(zϕπσ)(1γ)(1+γ(zϕπσ)1γ(1σ))(1γ(1σ)),\displaystyle V^{\pi,\sigma}_{\phi}(0)=\frac{\gamma\big{(}z_{\phi}^{\pi}-\sigma\big{)}}{(1-\gamma)\bigg{(}1+\frac{\gamma\left(z_{\phi}^{\pi}-\sigma\right)}{1-\gamma\left(1-\sigma\right)}\bigg{)}\left(1-\gamma\left(1-\sigma\right)\right)}, (76)

where zϕπz_{\phi}^{\pi} is defined as

zϕπpπ(ϕ| 0)+qπ(1ϕ| 0).\displaystyle z_{\phi}^{\pi}\coloneqq p\pi(\phi\,|\,0)+q\pi(1-\phi\,|\,0). (77)

In addition, the robust optimal value functions and the robust optimal policies satisfy

Vϕ,σ(0)\displaystyle V_{\phi}^{\star,\sigma}(0) =γ(pσ)(1γ)(1+γ(pσ)1γ(1σ))(1γ(1σ)),\displaystyle=\frac{\gamma\left(p-\sigma\right)}{(1-\gamma)\left(1+\frac{\gamma\left(p-\sigma\right)}{1-\gamma\left(1-\sigma\right)}\right)\left(1-\gamma\left(1-\sigma\right)\right)}, (78a)
πϕ(ϕ|s)\displaystyle\pi_{\phi}^{\star}(\phi\,|\,s) =1, for s𝒮.\displaystyle=1,\qquad\qquad\text{ for }s\in{\mathcal{S}}. (78b)

5.3.2 Establishing the minimax lower bound

Note that our goal is to control the quantity w.r.t. any policy estimator π^\widehat{\pi} based on the chosen initial distribution φ\varphi in (69) and the dataset consisting of NN samples over each state-action pair generated from the nominal transition kernel PϕP^{\phi}, which gives

φ,Vϕ,σVϕπ^,σ=Vϕ,σ(0)Vϕπ^,σ(0).\displaystyle\big{\langle}\varphi,V^{\star,\sigma}_{\phi}-V^{\widehat{\pi},\sigma}_{\phi}\big{\rangle}=V^{\star,\sigma}_{\phi}(0)-V^{\widehat{\pi},\sigma}_{\phi}(0).
Step 1: converting the goal to estimate ϕ\phi.

We make the following useful claim which shall be verified in Appendix C.2: With εc132(1γ)\varepsilon\leq\frac{c_{1}}{32(1-\gamma)}, letting

Δ=32(1γ)max{1γ,σ}εc1max{1γ,σ}\displaystyle\Delta=32(1-\gamma)\max\{1-\gamma,\sigma\}\varepsilon\leq c_{1}\max\{1-\gamma,\sigma\} (79)

which satisfies (71), it leads to that for any policy π^\widehat{\pi},

φ,Vϕ,σVϕπ^,σ2ε(1π^(ϕ| 0)).\displaystyle\big{\langle}\varphi,V^{\star,\sigma}_{\phi}-V^{\widehat{\pi},\sigma}_{\phi}\big{\rangle}\geq 2\varepsilon\big{(}1-\widehat{\pi}(\phi\,|\,0)\big{)}. (80)

With this connection established between the policy π^\widehat{\pi} and its sub-optimality gap as depicted in (80), we can now proceed to build an estimate for ϕ\phi. Here, we denote ϕ\mathbb{P}_{\phi} as the probability distribution when the MDP is ϕ\mathcal{M}_{\phi}, where ϕ\phi can take on values in the set {0,1}\{0,1\}.

Let’s assume momentarily that an estimated policy π^\widehat{\pi} achieves

ϕ{φ,Vϕ,σVϕπ^,σε}78,\displaystyle\mathbb{P}_{\phi}\big{\{}\big{\langle}\varphi,V^{\star,\sigma}_{\phi}-V^{\widehat{\pi},\sigma}_{\phi}\big{\rangle}\leq\varepsilon\big{\}}\geq\frac{7}{8}, (81)

then in view of (80), we necessarily have π^(ϕ| 0)12\widehat{\pi}(\phi\,|\,0)\geq\frac{1}{2} with probability at least 78\frac{7}{8}. With this in mind, we are motivated to construct the following estimate ϕ^\widehat{\phi} for ϕ{0,1}\phi\in\{0,1\}:

ϕ^=argmaxa{0,1}π^(a| 0),\displaystyle\widehat{\phi}=\arg\max_{a\in\{0,1\}}\,\widehat{\pi}(a\,|\,0), (82)

which obeys

ϕ{ϕ^=ϕ}ϕ{π^(ϕ| 0)>1/2}78.\displaystyle\mathbb{P}_{\phi}\big{\{}\widehat{\phi}=\phi\big{\}}\geq\mathbb{P}_{\phi}\big{\{}\widehat{\pi}(\phi\,|\,0)>1/2\big{\}}\geq\frac{7}{8}. (83)

Subsequently, our aim is to demonstrate that (83) cannot occur without an adequate number of samples, which would in turn contradict (80).

Step 2: probability of error in testing two hypotheses.

Equipped with the aforementioned groundwork, we can now delve into differentiating between the two hypotheses ϕ{0,1}\phi\in\{0,1\}. To achieve this, we consider the concept of minimax probability of error, defined as follows:

peinfψmax{0(ψ0),1(ψ1)}.p_{\mathrm{e}}\coloneqq\inf_{\psi}\max\big{\{}\mathbb{P}_{0}(\psi\neq 0),\,\mathbb{P}_{1}(\psi\neq 1)\big{\}}. (84)

Here, the infimum is taken over all possible tests ψ\psi constructed from the samples generated from the nominal transition kernel PϕP^{\phi}.

Moving forward, let us denote μϕ\mu_{\phi} (resp. μϕ(s)\mu_{\phi}(s)) as the distribution of a sample tuple (si,ai,si)(s_{i},a_{i},s_{i}^{\prime}) under the nominal transition kernel PϕP^{\phi} associated with ϕ\mathcal{M}_{\phi} and the samples are generated independently. Applying standard results from Tsybakov, (2009, Theorem 2.2) and the additivity of the KL divergence (cf. Tsybakov, (2009, Page 85)), we obtain

pe\displaystyle p_{\mathrm{e}} 14exp(NSA𝖪𝖫(μ0μ1))\displaystyle\geq\frac{1}{4}\exp\Big{(}-NSA\cdot\mathsf{KL}\big{(}\mu_{0}\parallel\mu_{1}\big{)}\Big{)}
=14exp{N(𝖪𝖫(P0(| 0,0)P1(| 0,0))+𝖪𝖫(P0(| 0,1)P1(| 0,1)))},\displaystyle=\frac{1}{4}\exp\Big{\{}-N\Big{(}\mathsf{KL}\big{(}P^{0}(\cdot\,|\,0,0)\parallel P^{1}(\cdot\,|\,0,0)\big{)}+\mathsf{KL}\big{(}P^{0}(\cdot\,|\,0,1)\parallel P^{1}(\cdot\,|\,0,1)\big{)}\Big{)}\Big{\}}, (85)

where the last inequality holds by observing that

𝖪𝖫(μ0μ1)\displaystyle\mathsf{KL}\big{(}\mu_{0}\parallel\mu_{1}\big{)} =1SAs,a,s𝖪𝖫(P0(s|s,a)P1(s|s,a))\displaystyle=\frac{1}{SA}\sum_{s,a,s^{\prime}}\mathsf{KL}\big{(}P^{0}(s^{\prime}\,|\,s,a)\parallel P^{1}(s^{\prime}\,|\,s,a)\big{)}
=1SAa{0,1}𝖪𝖫(P0(| 0,a)P1(| 0,a)),\displaystyle=\frac{1}{SA}\sum_{a\in\{0,1\}}\mathsf{KL}\big{(}P^{0}(\cdot\,|\,0,a)\parallel P^{1}(\cdot\,|\,0,a)\big{)},

Here, the last equality holds by the fact that P0(|s,a)P^{0}(\cdot\,|\,s,a) and P1(|s,a)P^{1}(\cdot\,|\,s,a) only differ when s=0s=0.

Now, our focus shifts towards bounding the terms involving the KL divergence in (85). Given pqmax{1γ,σ}p\geq q\geq\max\{1-\gamma,\sigma\} (cf. (73)), applying Tsybakov, (2009, Lemma 2.7) gives

𝖪𝖫(P0(| 0,1)P1(| 0,1))\displaystyle\mathsf{KL}\big{(}P^{0}(\cdot\,|\,0,1)\parallel P^{1}(\cdot\,|\,0,1)\big{)} =𝖪𝖫(pq)(pq)2(1p)p=(i)Δ2p(1p)\displaystyle=\mathsf{KL}\left(p\parallel q\right)\leq\frac{(p-q)^{2}}{(1-p)p}\overset{\mathrm{(i)}}{=}\frac{\Delta^{2}}{p(1-p)}
=(ii)1024(1γ)2max{1γ,σ}2ε2p(1p)\displaystyle\overset{\mathrm{(ii)}}{=}\frac{1024(1-\gamma)^{2}\max\{1-\gamma,\sigma\}^{2}\varepsilon^{2}}{p(1-p)}
1024(1γ)2max{1γ,σ}ε21p4096c1(1γ)2max{1γ,σ}ε2,\displaystyle\leq\frac{1024(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}}{1-p}\leq\frac{4096}{c_{1}}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}, (86)

where (i) stems from the definition in (65), (ii) follows by the expression of Δ\Delta in (79), and the last inequality arises from 1q1pc041-q\geq 1-p\geq\frac{c_{0}}{4} (see (72)).

Note that it can be shown that 𝖪𝖫(P0(| 0,0)P1(| 0,0))\mathsf{KL}\big{(}P^{0}(\cdot\,|\,0,0)\parallel P^{1}(\cdot\,|\,0,0)\big{)} can be upper bounded in a same manner. Substituting (86) back into (85) demonstrates that: if the sample size is selected as

Nc1log28192(1γ)2max{1γ,σ}ε2,\displaystyle N\leq\frac{c_{1}\log 2}{8192(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}}, (87)

then one necessarily has

pe\displaystyle p_{\mathrm{e}} 14exp{N8192c1(1γ)2max{1γ,σ}ε2}18,\displaystyle\geq\frac{1}{4}\exp\bigg{\{}-N\frac{8192}{c_{1}}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}\bigg{\}}\geq\frac{1}{8}, (88)
Step 3: putting the results together.

Lastly, suppose that there exists an estimator π^\widehat{\pi} such that

0{φ,V0,σV0π^,σ>ε}<18and1{φ,V1,σV1π^,σ>ε}<18.\mathbb{P}_{0}\big{\{}\big{\langle}\varphi,V_{0}^{\star,\sigma}-V_{0}^{\widehat{\pi},\sigma}\big{\rangle}>\varepsilon\big{\}}<\frac{1}{8}\qquad\text{and}\qquad\mathbb{P}_{1}\big{\{}\big{\langle}\varphi,V_{1}^{\star,\sigma}-V_{1}^{\widehat{\pi},\sigma}\big{\rangle}>\varepsilon\big{\}}<\frac{1}{8}.

According to Step 1, the estimator ϕ^\widehat{\phi} defined in (82) must satisfy

0(ϕ^0)<18and1(ϕ^1)<18.\mathbb{P}_{0}\big{(}\widehat{\phi}\neq 0\big{)}<\frac{1}{8}\qquad\text{and}\qquad\mathbb{P}_{1}\big{(}\widehat{\phi}\neq 1\big{)}<\frac{1}{8}.

However, this cannot occur under the sample size condition (87) to avoid contradiction with (88). Thus, we have completed the proof.

6 Offline distributionally robust RL with uniform coverage

In this section, we extend our theoretical analysis to broader sampling mechanism scenarios with offline datasets. We first specify the offline settings as below.

Offline/batch dataset.

Suppose that we observe a batch/historical dataset 𝒟𝖻={(si,ai,ri,si)}1iN𝖻\mathcal{D}^{\mathsf{b}}=\{(s_{i},a_{i},r_{i},s_{i}^{\prime})\}_{1\leq i\leq N_{\mathsf{b}}} consisting of N𝖻N_{\mathsf{b}} sample transitions generated independently. Specifically, the state-action pair (si,ai)(s_{i},a_{i}) is drawn from some behavior distribution μ𝖻Δ(𝒮×𝒜)\mu^{\mathsf{b}}\in\Delta({\mathcal{S}}\times\mathcal{A}), followed by a next state sis_{i}^{\prime} drawn over the nominal transition kernel P0P^{0}, i.e.,

(si,ai)i.i.d.μ𝖻andsii.i.d.P0(|si,ai),1iN𝖻.\displaystyle(s_{i},a_{i})\overset{\text{i.i.d.}}{\sim}\mu^{\mathsf{b}}\quad\text{and}\quad s_{i}^{\prime}\overset{\text{i.i.d.}}{\sim}P^{0}(\cdot\,|\,s_{i},a_{i}),\qquad 1\leq i\leq N_{\mathsf{b}}. (89)

We consider uniform coverage historical dataset that is widely studied in offline settings for both standard RL and robust RL (Liao et al.,, 2022; Chen and Jiang,, 2019; Jin et al., 2020b, ; Zhou et al.,, 2021; Yang et al.,, 2022), specified in the following assumption.

Assumption 1.

Suppose the historical dataset 𝒟𝖻\mathcal{D}^{\mathsf{b}} obeys

μminmin(s,a)𝒮×𝒜μ𝖻(s,a)>0.\displaystyle\mu_{\min}\coloneqq\min_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\mu^{\mathsf{b}}(s,a)>0. (90)

Armed with the above dataset 𝒟𝖻\mathcal{D}^{\mathsf{b}}, the empirical nominal transition kernel P^0SA×S\widehat{P}^{0}\in\mathbb{R}^{SA\times S} can be constructed through (13) analogously. Then in such offline setting, we introduce the sample complexity upper bounds for DRVI and information-theoretical lower bounds in the cases of TV or χ2\chi^{2} divergence respectively. The proof of the following corollaries are postponed to Appendix F.

6.1 The case of TV distance

With above historical dataset 𝒟𝖻\mathcal{D}^{\mathsf{b}} in hand, we achieve the following corollary implied by Theorem 1.

Corollary 1 (Upper bound under TV distance).

Let the uncertainty set be 𝒰ρσ()=𝒰𝖳𝖵σ()\mathcal{U}_{\rho}^{\sigma}(\cdot)=\mathcal{U}^{\sigma}_{\mathsf{TV}}(\cdot) defined in (9), and C3,C4>0C_{3},C_{4}>0 be some large enough universal constants. Consider any discount factor γ[14,1)\gamma\in\left[\frac{1}{4},1\right), uncertainty level σ(0,1)\sigma\in(0,1), and δ(0,1)\delta\in(0,1). Let π^\widehat{\pi} be the output policy of Algorithm 1 after T=C3log(N𝖻1γ)T=C_{3}\log\big{(}\frac{N_{\mathsf{b}}}{1-\gamma}\big{)} iterations, based on a dataset 𝒟𝖻\mathcal{D}^{\mathsf{b}} satisfying Assumption 1. Then with probability at least 1δ1-\delta, one has

s𝒮:V,σ(s)Vπ^,σ(s)ε\displaystyle\forall s\in{\mathcal{S}}:\quad V^{\star,\sigma}(s)-V^{\widehat{\pi},\sigma}(s)\leq\varepsilon (91)

for any ε(0,1/max{1γ,σ}]\varepsilon\in\left(0,\sqrt{1/\max\{1-\gamma,\sigma\}}\right], as long as the total number of samples obeys

N𝖻C4μmin(1γ)2max{1γ,σ}ε2log(N𝖻SA(1γ)δ).\displaystyle N_{\mathsf{b}}\geq\frac{C_{4}}{\mu_{\min}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}}\log\left(\frac{N_{\mathsf{b}}SA}{(1-\gamma)\delta}\right). (92)

We also derive a lower bound in the offline setting by adapting Theorem 2.

Corollary 2 (Lower bound under TV distance).

Let the uncertainty set be 𝒰ρσ()=𝒰𝖳𝖵σ()\mathcal{U}_{\rho}^{\sigma}(\cdot)=\mathcal{U}^{\sigma}_{\mathsf{TV}}(\cdot) defined in (9). Consider any tuple (S,γ,σ,ε,μmin)(S,\gamma,\sigma,\varepsilon,\mu_{\min}) that obeys μmin>0\mu_{\min}>0, σ(0,1c0]\sigma\in(0,1-c_{0}] with 0<c0180<c_{0}\leq\frac{1}{8} being any small enough positive constant, γ[12,1)\gamma\in\left[\frac{1}{2},1\right), and ε(0,c0256(1γ)]\varepsilon\in\big{(}0,\frac{c_{0}}{256(1-\gamma)}\big{]}. We can construct two infinite-horizon RMDPs 0,1\mathcal{M}_{0},\mathcal{M}_{1}, an initial state distribution φ\varphi, and a dataset with N𝖻N_{\mathsf{b}} samples satisfying Assumption 1 (for 0\mathcal{M}_{0} and 1\mathcal{M}_{1} respectively) such that

infπ^max{0(V,σ(φ)Vπ^,σ(φ)>ε),1(V,σ(φ)Vπ^,σ(φ)>ε)}18,\inf_{\widehat{\pi}}\max\left\{\mathbb{P}_{0}\big{(}V^{\star,\sigma}(\varphi)-V^{\widehat{\pi},\sigma}(\varphi)>\varepsilon\big{)},\,\mathbb{P}_{1}\big{(}V^{\star,\sigma}(\varphi)-V^{\widehat{\pi},\sigma}(\varphi)>\varepsilon\big{)}\right\}\geq\frac{1}{8},

provided that

N𝖻c0log28192μmin(1γ)2max{1γ,σ}ε2.N_{\mathsf{b}}\leq\frac{c_{0}\log 2}{8192\mu_{\min}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}}.

Here, the infimum is taken over all estimators π^\widehat{\pi}, and 0\mathbb{P}_{0} (resp. 1\mathbb{P}_{1}) denotes the probability when the RMDP is 0\mathcal{M}_{0} (resp. 1\mathcal{M}_{1}).

Discussions.

In the offline setting with uniform coverage dataset (cf. Assumption 1), Corollary 1 shows that DRVI algorithm can find an ε\varepsilon-optimal policy with the following sample complexity

O~(1μmin(1γ)2max{1γ,σ}ε2),\displaystyle\widetilde{O}\left(\frac{1}{\mu_{\min}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}}\right), (93)

which is near minimax optimal with respect to all salient parameters (up to logarithmic factors) almost over the full range of the uncertainty level σ\sigma, verified by the lower bound in Corollary 2. Our sample complexity upper bound (Corollary 1) significantly improves over the prior art O~(S(2+σ)2μminσ2(1γ)4ε2)\widetilde{O}\left(\frac{S(2+\sigma)^{2}}{\mu_{\min}\sigma^{2}(1-\gamma)^{4}\varepsilon^{2}}\right) (Yang et al.,, 2022) by at least a factor of S(1γ)2\frac{S}{(1-\gamma)^{2}}, and even more than S(1γ)3\frac{S}{(1-\gamma)^{3}} when the uncertainty level 0<σ1γ0<\sigma\lesssim 1-\gamma is small.

6.2 The case of χ2\chi^{2} divergence

With uncertainty sets measured by the χ2\chi^{2} divergence, we obtain the following upper bounds for DRVI and information-theoretical lower bounds, adapted from Theorem 3 and Theorem 4 respectively.

Corollary 3 (Upper bound under χ2\chi^{2} divergence).

Let the uncertainty set be 𝒰ρσ()=𝒰χ2σ()\mathcal{U}_{\rho}^{\sigma}(\cdot)=\mathcal{U}^{\sigma}_{\chi^{2}}(\cdot) specified by the χ2\chi^{2} divergence (cf. (10)), and c1,c2>0c_{1},c_{2}>0 be some large enough universal constants. Consider any uncertainty level σ(0,)\sigma\in(0,\infty), γ[1/4,1)\gamma\in[1/4,1) and δ(0,1)\delta\in(0,1). Given a dataset 𝒟𝖻\mathcal{D}^{\mathsf{b}} satisfying Assumption 1, with probability at least 1δ1-\delta, the output policy π^\widehat{\pi} from Algorithm 1 with at most T=c1log(N𝖻1γ)T=c_{1}\log\big{(}\frac{N_{\mathsf{b}}}{1-\gamma}\big{)} iterations yields

s𝒮:V,σ(s)Vπ^,σ(s)ε\displaystyle\forall s\in{\mathcal{S}}:\quad V^{\star,\sigma}(s)-V^{\widehat{\pi},\sigma}(s)\leq\varepsilon (94)

for any ε(0,11γ]\varepsilon\in\big{(}0,\frac{1}{1-\gamma}\big{]}, as long as the total number of samples obeying

N𝖻c2(1+σ)μmin(1γ)4ε2log(N𝖻μminδ).\displaystyle N_{\mathsf{b}}\geq\frac{c_{2}(1+\sigma)}{\mu_{\min}(1-\gamma)^{4}\varepsilon^{2}}\log\left(\frac{N_{\mathsf{b}}}{\mu_{\min}\delta}\right). (95)
Corollary 4 (Lower bound under χ2\chi^{2} divergence).

Let the uncertainty set be 𝒰ρσ()=𝒰χ2σ()\mathcal{U}_{\rho}^{\sigma}(\cdot)=\mathcal{U}^{\sigma}_{\chi^{2}}(\cdot), and c3,c4>0c_{3},c_{4}>0 be some universal constants. Consider any tuple (S,γ,σ,ε,μmin)(S,\gamma,\sigma,\varepsilon,\mu_{\min}) obeying μmin>0\mu_{\min}>0, γ[34,1)\gamma\in[\frac{3}{4},1), σ(0,)\sigma\in(0,\infty), and

ε\displaystyle\varepsilon c3{11γif σ(0,1γ4)max{1(1+σ)(1γ),1}if σ[1γ4,).\displaystyle\leq c_{3}\begin{cases}\frac{1}{1-\gamma}\quad&\text{if }\sigma\in\left(0,\frac{1-\gamma}{4}\right)\\ \max\left\{\frac{1}{(1+\sigma)(1-\gamma)},1\right\}\quad&\text{if }\sigma\in\left[\frac{1-\gamma}{4},\infty\right).\end{cases} (96)

Then we can construct two infinite-horizon RMDPs 0,1\mathcal{M}_{0},\mathcal{M}_{1}, an initial state distribution φ\varphi, and a dataset with N𝖻N_{\mathsf{b}} independent samples satisfying Assumption 1 over the nominal transition kernel (for 0\mathcal{M}_{0} and 1\mathcal{M}_{1} respectively), such that

infπ^max{0(V,σ(φ)Vπ^,σ(φ)>ε),1(V,σ(φ)Vπ^,σ(φ)>ε)}18,\displaystyle\inf_{\widehat{\pi}}\max\left\{\mathbb{P}_{0}\big{(}V^{\star,\sigma}(\varphi)-V^{\widehat{\pi},\sigma}(\varphi)>\varepsilon\big{)},\,\mathbb{P}_{1}\big{(}V^{\star,\sigma}(\varphi)-V^{\widehat{\pi},\sigma}(\varphi)>\varepsilon\big{)}\right\}\geq\frac{1}{8}, (97)

provided that the total number of samples

N𝖻c4{1μmin(1γ)3ε2if σ(0,1γ4)σmin{1,(1γ)4(1+σ)4}μminε2if σ[1γ4.)\displaystyle N_{\mathsf{b}}\leq c_{4}\begin{cases}\frac{1}{\mu_{\min}(1-\gamma)^{3}\varepsilon^{2}}&\text{if }\sigma\in\left(0,\frac{1-\gamma}{4}\right)\\ \frac{\sigma}{\min\left\{1,(1-\gamma)^{4}(1+\sigma)^{4}\right\}\mu_{\min}\varepsilon^{2}}&\text{if }\sigma\in\left[\frac{1-\gamma}{4}.\infty\right)\end{cases} (98)
Discussions.

Corollary 3 indicates that in the offline setting with uniform coverage dataset (cf. Assumption 1), DRVI can achieve ε\varepsilon-accuracy for RMDPs under the χ2\chi^{2} divergence with a total number of samples on the order of

O~((1+σ)μmin(1γ)4ε2).\displaystyle\widetilde{O}\left(\frac{(1+\sigma)}{\mu_{\min}(1-\gamma)^{4}\varepsilon^{2}}\right). (99)

The above upper bound is relatively tight, since it matches the lower bound derived in Corollary 4 when the uncertainty level σ1\sigma\asymp 1 and correctly captures the linear dependency with σ\sigma when the uncertainty level σ1(1γ)3\sigma\gtrsim\frac{1}{(1-\gamma)^{3}} is large. In addition, it significantly improves upon the prior art O~(S(1+σ)2μmin(1+σ1)2(1γ)4ε2)\widetilde{O}\left(\frac{S(1+\sigma)^{2}}{\mu_{\min}(\sqrt{1+\sigma}-1)^{2}(1-\gamma)^{4}\varepsilon^{2}}\right) (Yang et al.,, 2022) by at least a factor of S(1+σ)S(1+\sigma).

7 Other related works

This section briefly discusses a small sample of other related works. We limit our discussions primarily to provable RL algorithms in the tabular setting with finite state and action spaces, which are most related to the current paper.

Finite-sample guarantees for standard RL.

A surge of recent research has utilized the toolkit from high-dimensional probability/statistics to investigate the performance of standard RL algorithms in non-asymptotic settings. There has been a considerable amount of research into non-asymptotic sample analysis of standard RL for a variety of settings; partial examples include, but are not limited to, the works via probably approximately correct (PAC) bounds for the generative model setting (Kearns and Singh,, 1999; Beck and Srikant,, 2012; Li et al., 2022a, ; Chen et al.,, 2020; Azar et al., 2013b, ; Sidford et al.,, 2018; Agarwal et al.,, 2020; Li et al., 2023a, ; Li et al., 2023b, ; Wainwright,, 2019) and the offline setting (Liao et al.,, 2022; Chen and Jiang,, 2019; Rashidinejad et al.,, 2021; Xie et al.,, 2021; Yin et al.,, 2021; Shi et al.,, 2022; Li et al.,, 2024; Jin et al.,, 2021; Yan et al.,, 2022; Woo et al.,, 2024; Uehara et al.,, 2022), as well as the online setting via both regret-based and PAC-base analyses (Jin et al.,, 2018; Bai et al.,, 2019; Li et al.,, 2021; Zhang et al., 2020b, ; Dong et al.,, 2019; Jin et al., 2020a, ; Li et al., 2023c, ; Jafarnia-Jahromi et al.,, 2020; Yang et al.,, 2021; Woo et al.,, 2023).

Robustness in RL.

While standard RL has achieved remarkable success, current RL algorithms still have significant drawbacks in that the learned policy could be completely off if the deployed environment is subject to perturbation, model mismatch, or other structural changes. To address these challenges, an emerging line of works begin to address robustness of RL algorithms with respect to the uncertainty or perturbation over different components of MDPs — state, action, reward, and the transition kernel; see Moos et al., (2022) for a recent review. Besides the framework of distributionally robust MDPs (RMDPs) (Iyengar,, 2005) adopted by this work, to promote robustness in RL, there exist various other works including but not limited to Zhang et al., 2020a ; Zhang et al., (2021); Han et al., (2022); Qiaoben et al., (2021); Sun et al., (2021); Xiong et al., (2022) investigating the robustness w.r.t. state uncertainty, where the agent’s policy is chosen based on a perturbed observation generated from the state by adding restricted noise or adversarial attack. Besides, Tessler et al., (2019); Tan et al., (2020) considered the robustness w.r.t. the uncertainty of the action, namely, the action is possibly distorted by an adversarial agent abruptly or smoothly, and Ding et al., (2023) tackles robustness against spurious correlations..

Distributionally robust RL.

Rooted in the literature of distributionally robust optimization, which has primarily been investigated in the context of supervised learning (Rahimian and Mehrotra,, 2019; Gao,, 2020; Bertsimas et al.,, 2018; Duchi and Namkoong,, 2018; Blanchet and Murthy,, 2019), distributionally robust dynamic programming and RMDPs have attracted considerable attention recently (Iyengar,, 2005; Xu and Mannor,, 2012; Wolff et al.,, 2012; Kaufman and Schaefer,, 2013; Ho et al.,, 2018; Smirnova et al.,, 2019; Ho et al.,, 2021; Goyal and Grand-Clement,, 2022; Derman and Mannor,, 2020; Tamar et al.,, 2014; Badrinath and Kalathil,, 2021). In the context of RMDPs, both empirical and theoretical studies have been widely conducted, although most prior theoretical analyses focus on planning with an exact knowledge of the uncertainty set (Iyengar,, 2005; Xu and Mannor,, 2012; Tamar et al.,, 2014), or are asymptotic in nature (Roy et al.,, 2017).

Resorting to the tools of high-dimensional statistics, various recent works begin to shift attention to understand the finite-sample performance of provable robust RL algorithms, under diverse data generating mechanisms and forms of the uncertainty set over the transition kernel. Besides the infinite-horizon setting, finite-sample complexity bounds for RMDPs with the TV distance and the χ2\chi^{2} divergence are also developed for the finite-horizon setting in Xu et al., (2023); Dong et al., (2022); Lu et al., (2024). In addition, many other forms of uncertainty sets have been considered. For example, Wang and Zou, (2021) considered a R-contamination uncertain set and proposed a provable robust Q-learning algorithm for the online setting with similar guarantees as standard MDPs. The KL divergence is another popular choice widely considered, where Yang et al., (2022); Panaganti and Kalathil, (2022); Zhou et al., (2021); Shi and Chi, (2022); Xu et al., (2023); Wang et al., 2023b ; Blanchet et al., (2023); Liu et al., (2022); Wang et al., 2023d ; Liang et al., (2023); Wang et al., 2023a investigated the sample complexity of both model-based and model-free algorithms under the simulator, offline settings, or single-trajectory setting. Xu et al., (2023) considered a variety of uncertainty sets including one associated with Wasserstein distance. Badrinath and Kalathil, (2021); Ramesh et al., (2023); Panaganti et al., (2022); Ma et al., (2022); Wang et al., (2024); Liu and Xu, 2024b ; Liu and Xu, 2024a considered function approximation settings. Moreover, various other related issues have been explored such as the difference of various uncertainty types (Wang et al., 2023c, ), the iteration complexity of the policy-based methods (Li et al., 2022c, ; Kumar et al.,, 2023; Li and Lan,, 2023), the case when the uncertainty level is instance-dependent small enough (Clavier et al.,, 2023), regularization-based robust RL (Yang et al.,, 2023; Zhang et al.,, 2023), and distributionally robust optimization for offline RL (Panaganti et al.,, 2023).

8 Discussions

This work has developed improved sample complexity bounds for learning RMDPs when the uncertainty set is measured via the TV distance or the χ2\chi^{2} divergence, assuming availability of a generative model. Our results have not only strengthened the prior art in both the upper and lower bounds, but have also unlocked curious insights into how the quest for distributional robustness impacts the sample complexity. As a key takeaway of this paper, RMDPs are not necessarily harder nor easier to learn than standard MDPs, as the answer depends — in a rather subtle manner — on the specific choice of the uncertainty set. For the case w.r.t. the TV distance, we have settled the minimax sample complexity for RMDPs, which is never larger than that required to learn standard MDPs. Regarding the case w.r.t. the χ2\chi^{2} divergence, we have uncovered that learning RMDPs can oftentimes be provably harder than the standard MDP counterpart. All in all, our findings help raise awareness that the choice of the uncertainty set not only represents a preference in robustness, but also exerts fundamental influences upon the intrinsic statistical complexity.

Moving forward, our work opens up numerous avenues for future studies, and we point out a few below.

  • Extensions to the finite-horizon setting. It is likely that our current analysis framework can be extended to tackle finite-horizon RMDPs, which would help complete our understanding for the tabular cases.

  • Improved analysis for the case of χ2\chi^{2} divergence. While we have settled the sample complexity of RMDPs with the TV distance, the upper and lower bounds we have developed for RMDPs w.r.t. the χ2\chi^{2} divergence still differ by some polynomial factor in the effective horizon. It would be of great interest to see how to close this gap.

  • A unified theory for other families of uncertainty sets. Our work raises an interesting question concerning how the geometry of the uncertainty sets intervenes the sample complexity. Characterizing the tight sample complexity for RMDPs under a more general family of uncertainty sets — such as using p\ell_{p} distance or ff-divergence, as well as ss-rectangular sets — would be highly desirable.

  • Instance-dependent sample complexity analyses. We note that we focus on understanding the minimax-optimal sample complexity of RMDPs, which might be rather pessimistic. When consider a given MDP, the feasible and reasonable magnitude of the uncertainty level σ\sigma is limited by a certain instance-dependent finite threshold. It will be desirable to study instance-dependent sample complexity of RMDPs, which might shed better light on guiding the practice.

Acknowledgement

The work of L. Shi and Y. Chi is supported in part by the grants ONR N00014-19-1-2404, NSF CCF-2106778, DMS-2134080, and CNS-2148212. L. Shi is also gratefully supported by the Leo Finzi Memorial Fellowship, Wei Shen and Xuehong Zhang Presidential Fellowship, and Liang Ji-Dian Graduate Fellowship at Carnegie Mellon University. The work of Y. Wei is supported in part by the NSF grants DMS-2147546/2015447, CAREER award DMS-2143215, CCF-2106778, and the Google Research Scholar Award. The work of Y. Chen is supported in part by the Alfred P. Sloan Research Fellowship, the Google Research Scholar Award, the AFOSR grant FA9550-22-1-0198, the ONR grant N00014-22-1-2354, and the NSF grants CCF-2221009 and CCF-1907661. The authors also acknowledge Mengdi Xu, Zuxin Liu and He Wang for valuable discussions.

References

  • Agarwal et al., (2020) Agarwal, A., Kakade, S., and Yang, L. F. (2020). Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pages 67–83. PMLR.
  • (2) Azar, M., Munos, R., and Kappen, H. J. (2013a). Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91:325–349.
  • (3) Azar, M. G., Munos, R., and Kappen, H. J. (2013b). Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325–349.
  • Badrinath and Kalathil, (2021) Badrinath, K. P. and Kalathil, D. (2021). Robust reinforcement learning using least squares policy iteration with provable performance guarantees. In International Conference on Machine Learning, pages 511–520. PMLR.
  • Bai et al., (2019) Bai, Y., Xie, T., Jiang, N., and Wang, Y.-X. (2019). Provably efficient Q-learning with low switching cost. arXiv preprint arXiv:1905.12849.
  • Bäuerle and Glauner, (2022) Bäuerle, N. and Glauner, A. (2022). Distributionally robust markov decision processes and their connection to risk measures. Mathematics of Operations Research, 47(3):1757–1780.
  • Beck and Srikant, (2012) Beck, C. L. and Srikant, R. (2012). Error bounds for constant step-size Q-learning. Systems & control letters, 61(12):1203–1208.
  • Bertsimas et al., (2018) Bertsimas, D., Gupta, V., and Kallus, N. (2018). Data-driven robust optimization. Mathematical Programming, 167(2):235–292.
  • Bertsimas et al., (2019) Bertsimas, D., Sim, M., and Zhang, M. (2019). Adaptive distributionally robust optimization. Management Science, 65(2):604–618.
  • Blanchet et al., (2023) Blanchet, J., Lu, M., Zhang, T., and Zhong, H. (2023). Double pessimism is provably efficient for distributionally robust offline reinforcement learning: Generic algorithm and robust partial coverage. arXiv preprint arXiv:2305.09659.
  • Blanchet and Murthy, (2019) Blanchet, J. and Murthy, K. (2019). Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44(2):565–600.
  • Cai et al., (2016) Cai, J.-F., Qu, X., Xu, W., and Ye, G.-B. (2016). Robust recovery of complex exponential signals from random gaussian projections via low rank hankel matrix reconstruction. Applied and computational harmonic analysis, 41(2):470–490.
  • Chen and Jiang, (2019) Chen, J. and Jiang, N. (2019). Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR.
  • Chen et al., (2020) Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2020). Finite-sample analysis of stochastic approximation using smooth convex envelopes. arXiv preprint arXiv:2002.00874.
  • Chen et al., (2019) Chen, Z., Sim, M., and Xu, H. (2019). Distributionally robust optimization with infinitely constrained ambiguity sets. Operations Research, 67(5):1328–1344.
  • Clavier et al., (2023) Clavier, P., Pennec, E. L., and Geist, M. (2023). Towards minimax optimality of model-based robust reinforcement learning. arXiv preprint arXiv:2302.05372v1.
  • de Castro Silva et al., (2003) de Castro Silva, J., Soma, N., and Maculan, N. (2003). A greedy search for the three-dimensional bin packing problem: the packing static stability case. International Transactions in Operational Research, 10(2):141–153.
  • Derman and Mannor, (2020) Derman, E. and Mannor, S. (2020). Distributional robustness and regularization in reinforcement learning. arXiv preprint arXiv:2003.02894.
  • Ding et al., (2023) Ding, W., Shi, L., Chi, Y., and Zhao, D. (2023). Seeing is not believing: Robust reinforcement learning against spurious correlation. In Thirty-seventh Conference on Neural Information Processing Systems.
  • Dong et al., (2022) Dong, J., Li, J., Wang, B., and Zhang, J. (2022). Online policy optimization for robust MDP. arXiv preprint arXiv:2209.13841.
  • Dong et al., (2019) Dong, K., Wang, Y., Chen, X., and Wang, L. (2019). Q-learning with UCB exploration is sample efficient for infinite-horizon MDP. arXiv preprint arXiv:1901.09311.
  • Duchi and Namkoong, (2018) Duchi, J. and Namkoong, H. (2018). Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750.
  • Fatemi et al., (2021) Fatemi, M., Killian, T. W., Subramanian, J., and Ghassemi, M. (2021). Medical dead-ends and learning to identify high-risk states and treatments. Advances in Neural Information Processing Systems, 34:4856–4870.
  • Gao, (2020) Gao, R. (2020). Finite-sample guarantees for wasserstein distributionally robust optimization: Breaking the curse of dimensionality. arXiv preprint arXiv:2009.04382.
  • Goyal and Grand-Clement, (2022) Goyal, V. and Grand-Clement, J. (2022). Robust markov decision processes: Beyond rectangularity. Mathematics of Operations Research.
  • Han et al., (2022) Han, S., Su, S., He, S., Han, S., Yang, H., and Miao, F. (2022). What is the solution for state adversarial multi-agent reinforcement learning? arXiv preprint arXiv:2212.02705.
  • Ho et al., (2018) Ho, C. P., Petrik, M., and Wiesemann, W. (2018). Fast bellman updates for robust MDPs. In International Conference on Machine Learning, pages 1979–1988. PMLR.
  • Ho et al., (2021) Ho, C. P., Petrik, M., and Wiesemann, W. (2021). Partial policy iteration for l1-robust markov decision processes. Journal of Machine Learning Research, 22(275):1–46.
  • Iyengar, (2005) Iyengar, G. N. (2005). Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280.
  • Jafarnia-Jahromi et al., (2020) Jafarnia-Jahromi, M., Wei, C.-Y., Jain, R., and Luo, H. (2020). A model-free learning algorithm for infinite-horizon average-reward MDPs with near-optimal regret. arXiv preprint arXiv:2006.04354.
  • Jin et al., (2018) Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. (2018). Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873.
  • (32) Jin, C., Krishnamurthy, A., Simchowitz, M., and Yu, T. (2020a). Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870–4879. PMLR.
  • (33) Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. (2020b). Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR.
  • Jin et al., (2021) Jin, Y., Yang, Z., and Wang, Z. (2021). Is pessimism provably efficient for offline RL? In International Conference on Machine Learning, pages 5084–5096.
  • Kaufman and Schaefer, (2013) Kaufman, D. L. and Schaefer, A. J. (2013). Robust modified policy iteration. INFORMS Journal on Computing, 25(3):396–410.
  • Kearns and Singh, (1999) Kearns, M. J. and Singh, S. P. (1999). Finite-sample convergence rates for Q-learning and indirect algorithms. In Advances in neural information processing systems, pages 996–1002.
  • Klopp et al., (2017) Klopp, O., Lounici, K., and Tsybakov, A. B. (2017). Robust matrix completion. Probability Theory and Related Fields, 169(1-2):523–564.
  • Kober et al., (2013) Kober, J., Bagnell, J. A., and Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274.
  • Kumar et al., (2023) Kumar, N., Derman, E., Geist, M., Levy, K., and Mannor, S. (2023). Policy gradient for s-rectangular robust markov decision processes. arXiv preprint arXiv:2301.13589.
  • Lam, (2019) Lam, H. (2019). Recovering best statistical guarantees via the empirical divergence-based distributionally robust optimization. Operations Research, 67(4):1090–1105.
  • Lee et al., (2021) Lee, J., Jeon, W., Lee, B., Pineau, J., and Kim, K.-E. (2021). Optidice: Offline policy optimization via stationary distribution correction estimation. In International Conference on Machine Learning, pages 6120–6130. PMLR.
  • (42) Li, G., Cai, C., Chen, Y., Wei, Y., and Chi, Y. (2023a). Is Q-learning minimax optimal? a tight sample complexity analysis. Operations Research.
  • (43) Li, G., Chi, Y., Wei, Y., and Chen, Y. (2022a). Minimax-optimal multi-agent RL in Markov games with a generative model. Neural Information Processing Systems.
  • (44) Li, G., Shi, L., Chen, Y., Chi, Y., and Wei, Y. (2022b). Settling the sample complexity of model-based offline reinforcement learning. arXiv preprint arXiv:2204.05275.
  • Li et al., (2024) Li, G., Shi, L., Chen, Y., Chi, Y., and Wei, Y. (2024). Settling the sample complexity of model-based offline reinforcement learning. The Annals of Statistics, 52(1):233–260.
  • Li et al., (2021) Li, G., Shi, L., Chen, Y., Gu, Y., and Chi, Y. (2021). Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning. Advances in Neural Information Processing Systems, 34.
  • (47) Li, G., Wei, Y., Chi, Y., and Chen, Y. (2023b). Breaking the sample size barrier in model-based reinforcement learning with a generative model. accepted to Operations Research.
  • (48) Li, G., Yan, Y., Chen, Y., and Fan, J. (2023c). Minimax-optimal reward-agnostic exploration in reinforcement learning. arXiv preprint arXiv:2304.07278.
  • Li and Lan, (2023) Li, Y. and Lan, G. (2023). First-order policy optimization for robust policy evaluation. arXiv preprint arXiv:2307.15890.
  • (50) Li, Y., Zhao, T., and Lan, G. (2022c). First-order policy optimization for robust markov decision process. arXiv preprint arXiv:2209.10579.
  • Liang et al., (2023) Liang, Z., Ma, X., Blanchet, J., Zhang, J., and Zhou, Z. (2023). Single-trajectory distributionally robust reinforcement learning. arXiv preprint arXiv:2301.11721.
  • Liao et al., (2022) Liao, P., Qi, Z., Wan, R., Klasnja, P., and Murphy, S. A. (2022). Batch policy learning in average reward markov decision processes. Annals of statistics, 50(6):3364.
  • Liu et al., (2019) Liu, S., Ngiam, K. Y., and Feng, M. (2019). Deep reinforcement learning for clinical decision support: a brief survey. arXiv preprint arXiv:1907.09475.
  • Liu et al., (2022) Liu, Z., Bai, Q., Blanchet, J., Dong, P., Xu, W., Zhou, Z., and Zhou, Z. (2022). Distributionally robust Q-learning. In International Conference on Machine Learning, pages 13623–13643. PMLR.
  • (55) Liu, Z. and Xu, P. (2024a). Distributionally robust off-dynamics reinforcement learning: Provable efficiency with linear function approximation. arXiv preprint arXiv:2402.15399.
  • (56) Liu, Z. and Xu, P. (2024b). Minimax optimal and computationally efficient algorithms for distributionally robust offline reinforcement learning. arXiv preprint arXiv:2403.09621.
  • Lu et al., (2024) Lu, M., Zhong, H., Zhang, T., and Blanchet, J. (2024). Distributionally robust reinforcement learning with interactive data collection: Fundamental hardness and near-optimal algorithm. arXiv preprint arXiv:2404.03578.
  • Ma et al., (2022) Ma, X., Liang, Z., Blanchet, J., Liu, M., Xia, L., Zhang, J., Zhao, Q., and Zhou, Z. (2022). Distributionally robust offline reinforcement learning with linear function approximation. arXiv preprint arXiv:2209.06620.
  • Mahmood et al., (2018) Mahmood, A. R., Korenkevych, D., Vasan, G., Ma, W., and Bergstra, J. (2018). Benchmarking reinforcement learning algorithms on real-world robots. In Conference on robot learning, pages 561–591. PMLR.
  • Mnih et al., (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
  • Moos et al., (2022) Moos, J., Hansel, K., Abdulsamad, H., Stark, S., Clever, D., and Peters, J. (2022). Robust reinforcement learning: A review of foundations and recent advances. Machine Learning and Knowledge Extraction, 4(1):276–315.
  • Nilim and El Ghaoui, (2005) Nilim, A. and El Ghaoui, L. (2005). Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798.
  • OpenAI, (2023) OpenAI (2023). Gpt-4 technical report.
  • Pan et al., (2023) Pan, Y., Chen, Y., and Lin, F. (2023). Adjustable robust reinforcement learning for online 3d bin packing. arXiv preprint arXiv:2310.04323.
  • Panaganti and Kalathil, (2022) Panaganti, K. and Kalathil, D. (2022). Sample complexity of robust reinforcement learning with a generative model. In International Conference on Artificial Intelligence and Statistics, pages 9582–9602. PMLR.
  • Panaganti et al., (2022) Panaganti, K., Xu, Z., Kalathil, D., and Ghavamzadeh, M. (2022). Robust reinforcement learning using offline data. Advances in neural information processing systems, 35:32211–32224.
  • Panaganti et al., (2023) Panaganti, K., Xu, Z., Kalathil, D., and Ghavamzadeh, M. (2023). Bridging distributionally robust learning and offline rl: An approach to mitigate distribution shift and partial data coverage. arXiv preprint arXiv:2310.18434.
  • Park and Van Roy, (2015) Park, B. and Van Roy, B. (2015). Adaptive execution: Exploration and learning of price impact. Operations Research, 63(5):1058–1076.
  • Qiaoben et al., (2021) Qiaoben, Y., Zhou, X., Ying, C., and Zhu, J. (2021). Strategically-timed state-observation attacks on deep reinforcement learning agents. In ICML 2021 Workshop on Adversarial Machine Learning.
  • Qu et al., (2022) Qu, G., Wierman, A., and Li, N. (2022). Scalable reinforcement learning for multiagent networked systems. Operations Research, 70(6):3601–3628.
  • Rahimian and Mehrotra, (2019) Rahimian, H. and Mehrotra, S. (2019). Distributionally robust optimization: A review. arXiv preprint arXiv:1908.05659.
  • Ramesh et al., (2023) Ramesh, S. S., Sessa, P. G., Hu, Y., Krause, A., and Bogunovic, I. (2023). Distributionally robust model-based reinforcement learning with large state spaces. arXiv preprint arXiv:2309.02236.
  • Rashidinejad et al., (2021) Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., and Russell, S. (2021). Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Neural Information Processing Systems (NeurIPS).
  • Roy et al., (2017) Roy, A., Xu, H., and Pokutta, S. (2017). Reinforcement learning under model mismatch. Advances in neural information processing systems, 30.
  • Shi and Chi, (2022) Shi, L. and Chi, Y. (2022). Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity. arXiv preprint arXiv:2208.05767.
  • Shi et al., (2022) Shi, L., Li, G., Wei, Y., Chen, Y., and Chi, Y. (2022). Pessimistic Q-learning for offline reinforcement learning: Towards optimal sample complexity. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pages 19967–20025. PMLR.
  • Sidford et al., (2018) Sidford, A., Wang, M., Wu, X., Yang, L., and Ye, Y. (2018). Near-optimal time and sample complexities for solving Markov decision processes with a generative model. In Advances in Neural Information Processing Systems, pages 5186–5196.
  • Smirnova et al., (2019) Smirnova, E., Dohmatob, E., and Mary, J. (2019). Distributionally robust reinforcement learning. arXiv preprint arXiv:1902.08708.
  • Sun et al., (2021) Sun, K., Liu, Y., Zhao, Y., Yao, H., Jui, S., and Kong, L. (2021). Exploring the training robustness of distributional reinforcement learning against noisy state observations. arXiv preprint arXiv:2109.08776.
  • Tamar et al., (2014) Tamar, A., Mannor, S., and Xu, H. (2014). Scaling up robust MDPs using function approximation. In International conference on machine learning, pages 181–189. PMLR.
  • Tan et al., (2020) Tan, K. L., Esfandiari, Y., Lee, X. Y., and Sarkar, S. (2020). Robustifying reinforcement learning agents via action space adversarial training. In 2020 American control conference (ACC), pages 3959–3964. IEEE.
  • Tessler et al., (2019) Tessler, C., Efroni, Y., and Mannor, S. (2019). Action robust reinforcement learning and applications in continuous control. In International Conference on Machine Learning, pages 6215–6224. PMLR.
  • Tsybakov, (2009) Tsybakov, A. B. (2009). Introduction to nonparametric estimation, volume 11. Springer.
  • Uehara et al., (2022) Uehara, M., Shi, C., and Kallus, N. (2022). A review of off-policy evaluation in reinforcement learning. arXiv preprint arXiv:2212.06355.
  • Vershynin, (2018) Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press.
  • Wainwright, (2019) Wainwright, M. J. (2019). Stochastic approximation with cone-contractive operators: Sharp \ell_{\infty}-bounds for Q-learning. arXiv preprint arXiv:1905.06265.
  • Wang et al., (2024) Wang, H., Shi, L., and Chi, Y. (2024). Sample complexity of offline distributionally robust linear markov decision processes. arXiv preprint arXiv:2403.12946.
  • (88) Wang, K., Gadot, U., Kumar, N., Levy, K., and Mannor, S. (2023a). Robust reinforcement learning via adversarial kernel approximation. arXiv preprint arXiv:2306.05859.
  • (89) Wang, S., Si, N., Blanchet, J., and Zhou, Z. (2023b). A finite sample complexity bound for distributionally robust Q-learning. arXiv preprint arXiv:2302.13203.
  • (90) Wang, S., Si, N., Blanchet, J., and Zhou, Z. (2023c). On the foundation of distributionally robust reinforcement learning. arXiv preprint arXiv:2311.09018.
  • (91) Wang, S., Si, N., Blanchet, J., and Zhou, Z. (2023d). Sample complexity of variance-reduced distributionally robust Q-learning. arXiv preprint arXiv:2305.18420.
  • Wang and Zou, (2021) Wang, Y. and Zou, S. (2021). Online robust reinforcement learning with model uncertainty. Advances in Neural Information Processing Systems, 34.
  • Wiesemann et al., (2013) Wiesemann, W., Kuhn, D., and Rustem, B. (2013). Robust Markov decision processes. Mathematics of Operations Research, 38(1):153–183.
  • Wolff et al., (2012) Wolff, E. M., Topcu, U., and Murray, R. M. (2012). Robust control of uncertain Markov decision processes with temporal logic specifications. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pages 3372–3379. IEEE.
  • Woo et al., (2023) Woo, J., Joshi, G., and Chi, Y. (2023). The blessing of heterogeneity in federated Q-learning: Linear speedup and beyond. arXiv preprint arXiv:2305.10697.
  • Woo et al., (2024) Woo, J., Shi, L., Joshi, G., and Chi, Y. (2024). Federated offline reinforcement learning: Collaborative single-policy coverage suffices. arXiv preprint arXiv:2402.05876.
  • Xie et al., (2021) Xie, T., Jiang, N., Wang, H., Xiong, C., and Bai, Y. (2021). Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems, 34.
  • Xiong et al., (2022) Xiong, Z., Eappen, J., Zhu, H., and Jagannathan, S. (2022). Defending observation attacks in deep reinforcement learning via detection and denoising. arXiv preprint arXiv:2206.07188.
  • Xu and Mannor, (2012) Xu, H. and Mannor, S. (2012). Distributionally robust Markov decision processes. Mathematics of Operations Research, 37(2):288–300.
  • Xu et al., (2023) Xu, Z., Panaganti, K., and Kalathil, D. (2023). Improved sample complexity bounds for distributionally robust reinforcement learning. arXiv preprint arXiv:2303.02783.
  • Yan et al., (2022) Yan, Y., Li, G., Chen, Y., and Fan, J. (2022). The efficacy of pessimism in asynchronous Q-learning. arXiv preprint arXiv:2203.07368.
  • Yang et al., (2021) Yang, K., Yang, L., and Du, S. (2021). Q-learning with logarithmic regret. In International Conference on Artificial Intelligence and Statistics, pages 1576–1584. PMLR.
  • Yang et al., (2023) Yang, W., Wang, H., Kozuno, T., Jordan, S. M., and Zhang, Z. (2023). Avoiding model estimation in robust markov decision processes with a generative model. arXiv preprint arXiv:2302.01248.
  • Yang et al., (2022) Yang, W., Zhang, L., and Zhang, Z. (2022). Toward theoretical understandings of robust Markov decision processes: Sample complexity and asymptotics. The Annals of Statistics, 50(6):3223–3248.
  • Yin et al., (2021) Yin, M., Bai, Y., and Wang, Y.-X. (2021). Near-optimal offline reinforcement learning via double variance reduction. arXiv preprint arXiv:2102.01748.
  • Zhang et al., (2021) Zhang, H., Chen, H., Boning, D., and Hsieh, C.-J. (2021). Robust reinforcement learning on state observations with learned optimal adversary. arXiv preprint arXiv:2101.08452.
  • (107) Zhang, H., Chen, H., Xiao, C., Li, B., Liu, M., Boning, D., and Hsieh, C.-J. (2020a). Robust deep reinforcement learning against adversarial perturbations on state observations. Advances in Neural Information Processing Systems, 33:21024–21037.
  • Zhang et al., (2023) Zhang, R., Hu, Y., and Li, N. (2023). Regularized robust mdps and risk-sensitive mdps: Equivalence, policy gradient, and sample complexity. arXiv preprint arXiv:2306.11626.
  • (109) Zhang, Z., Zhou, Y., and Ji, X. (2020b). Almost optimal model-free reinforcement learning via reference-advantage decomposition. Advances in Neural Information Processing Systems, 33.
  • Zhao et al., (2021) Zhao, H., Yu, Y., and Xu, K. (2021). Learning efficient online 3d bin packing on packing configuration trees. In International conference on learning representations.
  • Zhou et al., (2021) Zhou, Z., Bai, Q., Zhou, Z., Qiu, L., Blanchet, J., and Glynn, P. (2021). Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 3331–3339. PMLR.
  • Ziegler et al., (2019) Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. (2019). Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.

Appendix A Proof of the preliminaries

A.1 Proof of Lemma 1 and Lemma 2

Proof of Lemma 1.

To begin with, applying (Iyengar,, 2005, Lemma 4.3), the term of interest obeys

inf𝒫𝒰σ(P)𝒫V\displaystyle\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P)}\mathcal{P}V =maxμS,μ0{P(Vμ)σ(maxs{V(s)μ(s)}mins{V(s)μ(s)})},\displaystyle=\max_{\mu\in\mathbb{R}^{S},\mu\geq 0}\left\{P\left(V-\mu\right)-\sigma\left(\max_{s^{\prime}}\left\{V(s^{\prime})-\mu(s^{\prime})\right\}-\min_{s^{\prime}}\left\{V(s^{\prime})-\mu(s^{\prime})\right\}\right)\right\}, (100)

where μ(s)\mu(s^{\prime}) represents the ss^{\prime}-th entry of μS\mu\in\mathbb{R}^{S}. Denoting μ\mu^{\star} as the optimal dual solution, taking α=maxs{V(s)μ(s)}\alpha=\max_{s^{\prime}}\left\{V(s^{\prime})-\mu^{\star}(s^{\prime})\right\}, it is easily verified that μ\mu^{\star} obeys

μ(s)={V(s)α,if V(s)>α0,otherwise.\displaystyle\mu^{\star}(s)=\begin{cases}V(s)-\alpha,&\text{if }V(s)>\alpha\\ 0,&\text{otherwise}.\end{cases} (101)

Therefore, (100) can be solved by optimizing α\alpha as below (Iyengar,, 2005, Lemma 4.3):

inf𝒫𝒰σ(P)𝒫V=maxα[minsV(s),maxsV(s)]{P[V]ασ(αmins[V]α(s))}.\displaystyle\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P)}\mathcal{P}V=\max_{\alpha\in\left[\min_{s}V(s),\max_{s}V(s)\right]}\left\{P\left[V\right]_{\alpha}-\sigma\left(\alpha-\min_{s^{\prime}}\left[V\right]_{\alpha}(s^{\prime})\right)\right\}. (102)
Proof of Lemma 2.

Due to strong duality (Iyengar,, 2005, Lemma 4.2), it holds that

inf𝒫𝒰σ(P)𝒫V\displaystyle\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P)}\mathcal{P}V =maxμS,μ0{P(Vμ)σ𝖵𝖺𝗋P(Vμ)},\displaystyle=\max_{\mu\in\mathbb{R}^{S},\mu\geq 0}\left\{P\left(V-\mu\right)-\sqrt{\sigma\mathsf{Var}_{P}\left(V-\mu\right)}\right\}, (103)

and the optimal μ\mu^{\star} obeys

μ(s)={V(s)α,if V(s)>α0,otherwise.\displaystyle\mu^{\star}(s)=\begin{cases}V(s)-\alpha,&\text{if }V(s)>\alpha\\ 0,&\text{otherwise}.\end{cases} (104)

for some α[minsV(s),maxsV(s)]\alpha\in[\min_{s}V(s),\max_{s}V(s)]. As a result, solving (103) is equivalent to optimizing the scalar α\alpha as below:

inf𝒫𝒰σ(P)𝒫V=maxα[minsV(s),maxsV(s)]{P[V]ασ𝖵𝖺𝗋P([V]α)}.\displaystyle\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P)}\mathcal{P}V=\max_{\alpha\in[\min_{s}V(s),\max_{s}V(s)]}\left\{P[V]_{\alpha}-\sqrt{\sigma\mathsf{Var}_{P}\left([V]_{\alpha}\right)}\right\}. (105)

A.2 Proof of Lemma 5

Applying the γ\gamma-contraction property in Lemma 4 directly yields that for any t0t\geq 0,

Q^tQ^,σ=𝒯^σ(Q^t1)𝒯^σ(Q^,σ)\displaystyle\|\widehat{Q}_{t}-\widehat{Q}^{\star,\sigma}\|_{\infty}=\big{\|}\widehat{{\mathcal{T}}}^{\sigma}(\widehat{Q}_{t-1})-\widehat{{\mathcal{T}}}^{\sigma}(\widehat{Q}^{\star,\sigma})\big{\|}_{\infty} γQ^t1Q^,σ\displaystyle\leq\gamma\big{\|}\widehat{Q}_{t-1}-\widehat{Q}^{\star,\sigma}\big{\|}_{\infty}
γtQ^0Q^,σ=γtQ^,σγt1γ,\displaystyle\leq\cdots\leq\gamma^{t}\big{\|}\widehat{Q}_{0}-\widehat{Q}^{\star,\sigma}\big{\|}_{\infty}=\gamma^{t}\big{\|}\widehat{Q}^{\star,\sigma}\big{\|}_{\infty}\leq\frac{\gamma^{t}}{1-\gamma},

where the last inequality holds by the fact Q^,σ11γ\|\widehat{Q}^{\star,\sigma}\|_{\infty}\leq\frac{1}{1-\gamma} (see Lemma 4). In addition,

V^tV^,σ\displaystyle\big{\|}\widehat{V}_{t}-\widehat{V}^{\star,\sigma}\big{\|}_{\infty} =maxs𝒮maxa𝒜Q^t(s,a)maxa𝒜Q^,σ(s,a)Q^tQ^,σγt1γ,\displaystyle=\max_{s\in{\mathcal{S}}}\Big{\|}\max_{a\in\mathcal{A}}\widehat{Q}_{t}(s,a)-\max_{a\in\mathcal{A}}\widehat{Q}^{\star,\sigma}(s,a)\Big{\|}_{\infty}\leq\big{\|}\widehat{Q}_{t}-\widehat{Q}^{\star,\sigma}\big{\|}_{\infty}\leq\frac{\gamma^{t}}{1-\gamma},

where the penultimate inequality holds by the maximum operator is 11-Lipschitz. This completes the proof of (47).

We now move to establish (48). Note that there exists at least one state s0𝒮s_{0}\in{\mathcal{S}} that is associated with the maximum of the value gap, i.e.,

V^,σV^π^,σ=V^,σ(s0)V^π^,σ(s0)V^,σ(s)V^π^,σ(s),s𝒮.\displaystyle\big{\|}\widehat{V}^{\star,\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\big{\|}_{\infty}=\widehat{V}^{\star,\sigma}(s_{0})-\widehat{V}^{\widehat{\pi},\sigma}(s_{0})\geq\widehat{V}^{\star,\sigma}(s)-\widehat{V}^{\widehat{\pi},\sigma}(s),\qquad\forall s\in{\mathcal{S}}.

Recall π^\widehat{\pi}^{\star} is the optimal robust policy for the empirical RMDP ^𝗋𝗈𝖻\widehat{\mathcal{M}}_{\mathsf{rob}}. For convenience, we denote a1=π^(s0)a_{1}=\widehat{\pi}^{\star}(s_{0}) and a2=π^(s0)a_{2}=\widehat{\pi}(s_{0}). Then, since π^\widehat{\pi} is the greedy policy w.r.t. Q^T\widehat{Q}_{T}, one has

r(s0,a1)+γinf𝒫𝒰σ(P^s0,a10)𝒫V^T1=Q^T(s0,a1)Q^T(s0,a2)=r(s0,a2)+γinf𝒫𝒰σ(P^s0,a20)𝒫V^T1.\displaystyle r(s_{0},a_{1})+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{1}})}\mathcal{P}\widehat{V}_{T-1}=\widehat{Q}_{T}(s_{0},a_{1})\leq\widehat{Q}_{T}(s_{0},a_{2})=r(s_{0},a_{2})+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{2}})}\mathcal{P}\widehat{V}_{T-1}. (106)

Recalling the notation in (37), the above fact and (48) altogether yield

r(s0,a1)+γP^s0,a1V^T1(V^,σε𝗈𝗉𝗍1)\displaystyle r(s_{0},a_{1})+\gamma\widehat{P}_{s_{0},a_{1}}^{\widehat{V}_{T-1}}\left(\widehat{V}^{\star,\sigma}-\varepsilon_{\mathsf{opt}}{1}\right) r(s0,a1)+γP^s0,a1V^T1V^T1\displaystyle\leq r(s_{0},a_{1})+\gamma\widehat{P}_{s_{0},a_{1}}^{\widehat{V}_{T-1}}\widehat{V}_{T-1}
r(s0,a2)+γinf𝒫𝒰σ(P^s0,a20)𝒫V^T1\displaystyle\leq r(s_{0},a_{2})+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{2}})}\mathcal{P}\widehat{V}_{T-1}
(i)r(s0,a2)+γP^s0,a2V^π^,σV^T1\displaystyle\overset{\mathrm{(i)}}{\leq}r(s_{0},a_{2})+\gamma\widehat{P}_{s_{0},a_{2}}^{\widehat{V}^{\widehat{\pi},\sigma}}\widehat{V}_{T-1}
r(s0,a2)+γP^s0,a2V^π^,σ(V^,σ+ε𝗈𝗉𝗍1),\displaystyle\leq r(s_{0},a_{2})+\gamma\widehat{P}_{s_{0},a_{2}}^{\widehat{V}^{\widehat{\pi},\sigma}}\left(\widehat{V}^{\star,\sigma}+\varepsilon_{\mathsf{opt}}{1}\right), (107)

where (i) follows from the optimality criteria. The term of interest can be controlled as

V^,σV^π^,σ\displaystyle\big{\|}\widehat{V}^{\star,\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\big{\|}_{\infty}
=V^,σ(s0)V^π^,σ(s0)\displaystyle=\widehat{V}^{\star,\sigma}(s_{0})-\widehat{V}^{\widehat{\pi},\sigma}(s_{0})
=r(s0,a1)+γinf𝒫𝒰σ(P^s0,a10)𝒫V^,σ(r(s0,a2)+γinf𝒫𝒰σ(P^s0,a20)𝒫V^π^,σ)\displaystyle=r(s_{0},a_{1})+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{1}})}\mathcal{P}\widehat{V}^{\star,\sigma}-\bigg{(}r(s_{0},a_{2})+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{2}})}\mathcal{P}\widehat{V}^{\widehat{\pi},\sigma}\bigg{)}
=r(s0,a1)r(s0,a2)+γ(inf𝒫𝒰σ(P^s0,a10)𝒫V^,σinf𝒫𝒰σ(P^s0,a20)𝒫V^π^,σ)\displaystyle=r(s_{0},a_{1})-r(s_{0},a_{2})+\gamma\bigg{(}\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{1}})}\mathcal{P}\widehat{V}^{\star,\sigma}-\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{2}})}\mathcal{P}\widehat{V}^{\widehat{\pi},\sigma}\bigg{)}
(i)2γε𝗈𝗉𝗍+γ(P^s0,a2V^π^,σV^,σP^s0,a1V^T1V^,σ+inf𝒫𝒰σ(P^s0,a10)𝒫V^,σinf𝒫𝒰σ(P^s0,a20)𝒫V^π^,σ)\displaystyle\overset{\mathrm{(i)}}{\leq}2\gamma\varepsilon_{\mathsf{opt}}+\gamma\bigg{(}\widehat{P}_{s_{0},a_{2}}^{\widehat{V}^{\widehat{\pi},\sigma}}\widehat{V}^{\star,\sigma}-\widehat{P}_{s_{0},a_{1}}^{\widehat{V}_{T-1}}\widehat{V}^{\star,\sigma}+\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{1}})}\mathcal{P}\widehat{V}^{\star,\sigma}-\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{2}})}\mathcal{P}\widehat{V}^{\widehat{\pi},\sigma}\bigg{)}
=2γε𝗈𝗉𝗍+γ(P^s0,a2V^π^,σV^,σinf𝒫𝒰σ(P^s0,a20)𝒫V^π^,σ)+γ(inf𝒫𝒰σ(P^s0,a10)𝒫V^,σP^s0,a1V^T1V^,σ)\displaystyle=2\gamma\varepsilon_{\mathsf{opt}}+\gamma\bigg{(}\widehat{P}_{s_{0},a_{2}}^{\widehat{V}^{\widehat{\pi},\sigma}}\widehat{V}^{\star,\sigma}-\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{2}})}\mathcal{P}\widehat{V}^{\widehat{\pi},\sigma}\bigg{)}+\gamma\bigg{(}\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{1}})}\mathcal{P}\widehat{V}^{\star,\sigma}-\widehat{P}_{s_{0},a_{1}}^{\widehat{V}_{T-1}}\widehat{V}^{\star,\sigma}\bigg{)}
(ii)2γε𝗈𝗉𝗍+γP^s0,a2V^π^,σ(V^,σV^π^,σ)+γ(P^s0,a1V^T1V^,σP^s0,a1V^T1V^,σ)\displaystyle\overset{\mathrm{(ii)}}{\leq}2\gamma\varepsilon_{\mathsf{opt}}+\gamma\widehat{P}_{s_{0},a_{2}}^{\widehat{V}^{\widehat{\pi},\sigma}}\big{(}\widehat{V}^{\star,\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\big{)}+\gamma\bigg{(}\widehat{P}_{s_{0},a_{1}}^{\widehat{V}_{T-1}}\widehat{V}^{\star,\sigma}-\widehat{P}_{s_{0},a_{1}}^{\widehat{V}_{T-1}}\widehat{V}^{\star,\sigma}\bigg{)}
2γε𝗈𝗉𝗍+γV^,σV^π^,σ,\displaystyle\leq 2\gamma\varepsilon_{\mathsf{opt}}+\gamma\big{\|}\widehat{V}^{\star,\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\big{\|}_{\infty}, (108)

where (i)\mathrm{(i)} holds by plugging in (107), and (ii) follows from inf𝒫𝒰σ(P^s0,a10)𝒫V^,σ𝒫V^,σ\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{1}})}\mathcal{P}\widehat{V}^{\star,\sigma}\leq\mathcal{P}\widehat{V}^{\star,\sigma} for any 𝒫𝒰σ(P^s0,a10)\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s_{0},a_{1}}). Rearranging (A.2) leads to

V^,σV^π^,σ2γε𝗈𝗉𝗍1γ.\big{\|}\widehat{V}^{\star,\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\big{\|}_{\infty}\leq\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}.

Appendix B Proof of the auxiliary lemmas for Theorem 1

B.1 Proof of Lemma 6

To begin, note that there at leasts exist one state s0s_{0} for any Vπ,σV^{\pi,\sigma} such that Vπ,σ(s0)=mins𝒮Vπ,σ(s)V^{\pi,\sigma}(s_{0})=\min_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s). With this in mind, for any policy π\pi, one has by the definition in (5) and the Bellman’s equation (7a),

maxs𝒮Vπ,σ(s)\displaystyle\max_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s) =maxs𝒮𝔼aπ(|s)[r(s,a)+γinf𝒫𝒰σ(Ps,a)𝒫Vπ,σ]\displaystyle=\max_{s\in{\mathcal{S}}}\mathbb{E}_{a\sim\pi(\cdot\,|\,s)}\Big{[}r(s,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P_{s,a})}\mathcal{P}V^{\pi,\sigma}\Big{]}
max(s,a)𝒮×𝒜(1+γinf𝒫𝒰σ(Ps,a)𝒫Vπ,σ),\displaystyle\leq\max_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\Big{(}1+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P_{s,a})}\mathcal{P}V^{\pi,\sigma}\Big{)},

where the second line holds since the reward function r(s,a)[0,1]r(s,a)\in[0,1] for all (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}. To continue, note that for any (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}, there exists some P~s,aS\widetilde{P}_{s,a}\in\mathbb{R}^{S} constructed by reducing the values of some elements of Ps,aP_{s,a} to obey Ps,aP~s,a0P_{s,a}\geq\widetilde{P}_{s,a}\geq 0 and s(Ps,a(s)P~s,a(s))=σ\sum_{s^{\prime}}(P_{s,a}(s^{\prime})-\widetilde{P}_{s,a}(s^{\prime}))=\sigma. This implies P~s,a+σes0𝒰σ(Ps,a)\widetilde{P}_{s,a}+\sigma{e}_{s_{0}}^{\top}\in\mathcal{U}^{\sigma}(P_{s,a}), where es0e_{s_{0}} is the standard basis vector supported on s0s_{0}, since 12P~s,a+σes0Ps,a112P~s,aPs,a1+σ2=σ\frac{1}{2}\|\widetilde{P}_{s,a}+\sigma{e}_{s_{0}}^{\top}-P_{s,a}\|_{1}\leq\frac{1}{2}\|\widetilde{P}_{s,a}-P_{s,a}\|_{1}+\frac{\sigma}{2}=\sigma. Consequently,

inf𝒫𝒰σ(Ps,a)𝒫Vπ,σ(P~s,a+σes0)Vπ,σ\displaystyle\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P_{s,a})}\mathcal{P}V^{\pi,\sigma}\leq\left(\widetilde{P}_{s,a}+\sigma{e}_{s_{0}}^{\top}\right)V^{\pi,\sigma} P~s,a1Vπ,σ+σVπ,σ(s0)\displaystyle\leq\big{\|}\widetilde{P}_{s,a}\big{\|}_{1}\big{\|}V^{\pi,\sigma}\big{\|}_{\infty}+\sigma V^{\pi,\sigma}(s_{0})
(1σ)maxs𝒮Vπ,σ(s)+σmins𝒮Vπ,σ(s),\displaystyle\leq\left(1-\sigma\right)\max_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s)+\sigma\min_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s), (109)

where the second inequality holds by P~s,a1=sP~s,a(s)=s(Ps,a(s)P~s,a(s))+sPs,a(s)=1σ\big{\|}\widetilde{P}_{s,a}\big{\|}_{1}=\sum_{s^{\prime}}\widetilde{P}_{s,a}(s^{\prime})=-\sum_{s^{\prime}}\left(P_{s,a}(s^{\prime})-\widetilde{P}_{s,a}(s^{\prime})\right)+\sum_{s^{\prime}}P_{s,a}(s^{\prime})=1-\sigma. Plugging this back to the previous relation gives

maxs𝒮Vπ,σ(s)\displaystyle\max_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s) 1+γ(1σ)maxs𝒮Vπ,σ(s)+γσmins𝒮Vπ,σ(s),\displaystyle\leq 1+\gamma\left(1-\sigma\right)\max_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s)+\gamma\sigma\min_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s),

which, by rearranging terms, immediately yields

maxs𝒮Vπ,σ(s)\displaystyle\max_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s) 1+γσmins𝒮Vπ,σ(s)1γ(1σ)\displaystyle\leq\frac{1+\gamma\sigma\min_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s)}{1-\gamma\left(1-\sigma\right)}
1(1γ)+γσ+mins𝒮Vπ,σ(s)1γmax{1γ,σ}+mins𝒮Vπ,σ(s).\displaystyle\leq\frac{1}{(1-\gamma)+\gamma\sigma}+\min_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s)\leq\frac{1}{\gamma\max\{1-\gamma,\sigma\}}+\min_{s\in{\mathcal{S}}}V^{\pi,\sigma}(s).

B.2 Proof of Lemma 7

Observing that each row of PπP_{\pi} belongs to Δ(S)\Delta(S), it can be directly verified that each row of (1γ)(IγPπ)1(1-\gamma)\left(I-\gamma P_{\pi}\right)^{-1} falls into Δ(S)\Delta(S). As a result,

(IγPπ)1VarPπ(Vπ,P)\displaystyle\left(I-\gamma P_{\pi}\right)^{-1}\sqrt{\mathrm{Var}_{P_{\pi}}(V^{\pi,P})} =11γ(1γ)(IγPπ)1VarPπ(Vπ,P)\displaystyle=\frac{1}{1-\gamma}(1-\gamma)\left(I-\gamma P_{\pi}\right)^{-1}\sqrt{\mathrm{Var}_{P_{\pi}}(V^{\pi,P})}
(i)11γ(1γ)(IγPπ)1VarPπ(Vπ,P)\displaystyle\overset{\mathrm{(i)}}{\leq}\frac{1}{1-\gamma}\sqrt{(1-\gamma)\left(I-\gamma P_{\pi}\right)^{-1}\mathrm{Var}_{P_{\pi}}(V^{\pi,P})}
=11γt=0γt(Pπ)tVarPπ(Vπ,P),\displaystyle=\sqrt{\frac{1}{1-\gamma}}\sqrt{\sum_{t=0}^{\infty}\gamma^{t}\left(P_{\pi}\right)^{t}\mathrm{Var}_{P_{\pi}}(V^{\pi,P})}, (110)

where (i) holds by Jensen’s inequality.

To continue, denoting the minimum value of VV as Vmin=mins𝒮Vπ,P(s)V_{\min}=\min_{s\in{\mathcal{S}}}V^{\pi,P}(s) and VVπ,PVmin1V^{\prime}\coloneqq V^{\pi,P}-V_{\min}1. We control VarPπ(Vπ,P)\mathrm{Var}_{P_{\pi}}(V^{\pi,P}) as follows:

VarPπ(Vπ,P)\displaystyle\mathrm{Var}_{P_{\pi}}(V^{\pi,P})
=(i)VarPπ(V)=Pπ(VV)(PπV)(PπV)\displaystyle\overset{\mathrm{(i)}}{=}\mathrm{Var}_{P_{\pi}}(V^{\prime})=P_{\pi}\left(V^{\prime}\circ V^{\prime}\right)-\big{(}P_{\pi}V^{\prime}\big{)}\circ\big{(}P_{\pi}V^{\prime}\big{)}
=(ii)Pπ(VV)1γ2(Vrπ+(1γ)Vmin1)(Vrπ+(1γ)Vmin1)\displaystyle\overset{\mathrm{(ii)}}{=}P_{\pi}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma^{2}}\left(V^{\prime}-r_{\pi}+(1-\gamma)V_{\min}1\right)\circ\left(V^{\prime}-r_{\pi}+(1-\gamma)V_{\min}1\right)
=Pπ(VV)1γ2VV+2γ2V(rπ(1γ)Vmin1)\displaystyle=P_{\pi}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma^{2}}V^{\prime}\circ V^{\prime}+\frac{2}{\gamma^{2}}V^{\prime}\circ\left(r_{\pi}-(1-\gamma)V_{\min}1\right)
1γ2(rπ(1γ)Vmin1)(rπ(1γ)Vmin1)\displaystyle\quad-\frac{1}{\gamma^{2}}\left(r_{\pi}-(1-\gamma)V_{\min}1\right)\circ\left(r_{\pi}-(1-\gamma)V_{\min}1\right)
Pπ(VV)1γVV+2γ2V1,\displaystyle\leq P_{\pi}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}+\frac{2}{\gamma^{2}}\|V^{\prime}\|_{\infty}1, (111)

where (i) holds by the fact that VarPπ(Vπ,Pb1)=VarPπ(Vπ,P)\mathrm{Var}_{P_{\pi}}(V^{\pi,P}-b1)=\mathrm{Var}_{P_{\pi}}(V^{\pi,P}) for any scalar bb and Vπ,PSV^{\pi,P}\in\mathbb{R}^{S}, (ii) follows from V=rπ+γPπVπ,PVmin1=rπ(1γ)Vmin1+γPπVV^{\prime}=r_{\pi}+\gamma P_{\pi}V^{\pi,P}-V_{\min}1=r_{\pi}-(1-\gamma)V_{\min}1+\gamma P_{\pi}V^{\prime}, and the last line arises from 1γ2VV1γVV\frac{1}{\gamma^{2}}V^{\prime}\circ V^{\prime}\geq\frac{1}{\gamma}V^{\prime}\circ V^{\prime} and rπ(1γ)Vmin11\|r_{\pi}-(1-\gamma)V_{\min}1\|_{\infty}\leq 1. Plugging (111) back to (110) leads to

(IγPπ)1VarPπ(Vπ,P)11γt=0γt(Pπ)t(Pπ(VV)1γVV+2γ2V1)\displaystyle\left(I-\gamma P_{\pi}\right)^{-1}\sqrt{\mathrm{Var}_{P_{\pi}}(V^{\pi,P})}\leq\sqrt{\frac{1}{1-\gamma}}\sqrt{\sum_{t=0}^{\infty}\gamma^{t}\left(P_{\pi}\right)^{t}\left(P_{\pi}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}+\frac{2}{\gamma^{2}}\|V^{\prime}\|_{\infty}1\right)}
(i)11γ|t=0γt(Pπ)t(Pπ(VV)1γVV)|+11γt=0γt(Pπ)t2γ2V1\displaystyle\overset{\mathrm{(i)}}{\leq}\sqrt{\frac{1}{1-\gamma}}\sqrt{\bigg{|}\sum_{t=0}^{\infty}\gamma^{t}\left(P_{\pi}\right)^{t}\left(P_{\pi}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}\right)\bigg{|}}+\sqrt{\frac{1}{1-\gamma}}\sqrt{\sum_{t=0}^{\infty}\gamma^{t}\left(P_{\pi}\right)^{t}\frac{2}{\gamma^{2}}\|V^{\prime}\|_{\infty}1}
11γ|(t=0γt(Pπ)t+1t=0γt1(Pπ)t)(VV)|+2V1γ2(1γ)2\displaystyle\leq\sqrt{\frac{1}{1-\gamma}}\sqrt{\bigg{|}\bigg{(}\sum_{t=0}^{\infty}\gamma^{t}\left(P_{\pi}\right)^{t+1}-\sum_{t=0}^{\infty}\gamma^{t-1}\left(P_{\pi}\right)^{t}\bigg{)}\left(V^{\prime}\circ V^{\prime}\right)\bigg{|}}+\sqrt{\frac{2\|V^{\prime}\|_{\infty}1}{\gamma^{2}(1-\gamma)^{2}}}
(ii)V21γ(1γ)+2V1γ2(1γ)2\displaystyle\overset{\mathrm{(ii)}}{\leq}\sqrt{\frac{\|V^{\prime}\|_{\infty}^{2}1}{\gamma(1-\gamma)}}+\sqrt{\frac{2\|V^{\prime}\|_{\infty}1}{\gamma^{2}(1-\gamma)^{2}}}
8V1γ2(1γ)2,\displaystyle\leq\sqrt{\frac{8\|V^{\prime}\|_{\infty}1}{\gamma^{2}(1-\gamma)^{2}}}, (112)

where (i) holds by the triangle inequality, (ii) holds by following recursion, and the last inequality holds by V11γ\|V^{\prime}\|_{\infty}\leq\frac{1}{1-\gamma}.

B.3 Proof of Lemma 8

Step 1: controlling V^π,σVπ,σ\|\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\|_{\infty}: bounding the first term in (56).

To control the two terms in (56), we first introduce the following lemma whose proof is postponed to Appendix B.5.

Lemma 11.

Consider any δ(0,1)\delta\in(0,1). Setting Nlog(18SANδ)N\geq\log(\frac{18SAN}{\delta}), with probability at least 1δ1-\delta, one has

|P¯^π,VVπ,σP¯π,VVπ,σ|\displaystyle\Big{|}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{|} 2log(18SANδ)NVarPπ(V,σ)+log(18SANδ)N(1γ)1\displaystyle\leq 2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{\pi^{\star}}}(V^{\star,\sigma})}+\frac{\log(\frac{18SAN}{\delta})}{N(1-\gamma)}1
3log(18SANδ)(1γ)2N1,\displaystyle\leq 3\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}1, (113)

where VarPπ(V,σ)\mathrm{Var}_{P^{\pi^{\star}}}(V^{\star,\sigma}) is defined in (36).

Armed with the above lemma, now we control the first term on the right hand side of (56) as follows:

(IγP¯^π,V)1(P¯^π,VVπ,σP¯π,VVπ,σ)\displaystyle\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{)}
(i)(IγP¯^π,V)1P¯^π,VVπ,σP¯π,VVπ,σ\displaystyle\overset{\mathrm{(i)}}{\leq}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\Big{\|}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{\|}_{\infty}
(ii)(IγP¯^π,V)1(2log(18SANδ)NVarPπ(V,σ)+log(18SANδ)N(1γ)1)\displaystyle\overset{\mathrm{(ii)}}{\leq}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\bigg{(}2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{\pi^{\star}}}(V^{\star,\sigma})}+\frac{\log(\frac{18SAN}{\delta})}{N(1-\gamma)}1\bigg{)}
log(18SANδ)N(1γ)(IγP¯^π,V)11+2log(18SANδ)N(IγP¯^π,V)1VarP¯^π,V(V,σ)=:𝒞1\displaystyle\leq\frac{\log(\frac{18SAN}{\delta})}{N(1-\gamma)}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}1+\underbrace{2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma})}}_{=:\mathcal{C}_{1}}
+2log(18SANδ)N(IγP¯^π,V)1|VarP^π(V,σ)VarP¯^π,V(V,σ)|=:𝒞2\displaystyle\quad+\underbrace{2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\sqrt{\left|\mathrm{Var}_{\widehat{P}^{\pi^{\star}}}(V^{\star,\sigma})-\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma})\right|}}_{=:\mathcal{C}_{2}}
+2log(18SANδ)N(IγP¯^π,V)1(VarPπ(V,σ)VarP^π(V,σ))=:𝒞3,\displaystyle\quad+\underbrace{2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\Big{(}\sqrt{\mathrm{Var}_{P^{\pi^{\star}}}(V^{\star,\sigma})}-\sqrt{\mathrm{Var}_{\widehat{P}^{\pi^{\star}}}(V^{\star,\sigma})}\Big{)}}_{=:\mathcal{C}_{3}}, (114)

where (i) holds by (IγP¯^π,V)10\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\geq 0, (ii) follows from Lemma 11, and the last inequality arise from

VarPπ(V,σ)=(VarPπ(V,σ)VarP^π(V,σ))+VarP^π(V,σ)\displaystyle\sqrt{\mathrm{Var}_{P^{\pi^{\star}}}(V^{\star,\sigma})}=\left(\sqrt{\mathrm{Var}_{P^{\pi^{\star}}}(V^{\star,\sigma})}-\sqrt{\mathrm{Var}_{\widehat{P}^{\pi^{\star}}}(V^{\star,\sigma})}\right)+\sqrt{\mathrm{Var}_{\widehat{P}^{\pi^{\star}}}(V^{\star,\sigma})}
(VarPπ(V,σ)VarP^π(V,σ))+|VarP^π(V,σ)VarP¯^π,V(V,σ)|+VarP¯^π,V(V,σ)\displaystyle\leq\left(\sqrt{\mathrm{Var}_{P^{\pi^{\star}}}(V^{\star,\sigma})}-\sqrt{\mathrm{Var}_{\widehat{P}^{\pi^{\star}}}(V^{\star,\sigma})}\right)+\sqrt{\left|\mathrm{Var}_{\widehat{P}^{\pi^{\star}}}(V^{\star,\sigma})-\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma})\right|}+\sqrt{\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma})}

by applying the triangle inequality.

To continue, observing that each row of P¯^π,V\underline{\widehat{P}}^{\pi^{\star},V} is a probability distribution obeying that the sum is 11, we arrive at

(IγP¯^π,V)11=(I+t=1γt(P¯^π,V)t)1=11γ1.\displaystyle\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}1=\Big{(}I+\sum_{t=1}^{\infty}\gamma^{t}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{t}\Big{)}1=\frac{1}{1-\gamma}1. (115)

Armed with this fact, we shall control the other three terms 𝒞1,𝒞2,𝒞3\mathcal{C}_{1},\mathcal{C}_{2},\mathcal{C}_{3} in (114) separately.

  • Consider 𝒞1\mathcal{C}_{1}. We first introduce the following lemma, whose proof is postponed to Appendix B.6.

    Lemma 12.

    Consider any δ(0,1)\delta\in(0,1). With probability at least 1δ1-\delta, one has

    (IγP¯^π,V)1VarP¯^π,V(V,σ)4(1+log(18SANδ)(1γ)2N)γ3(1γ)2max{1γ,σ}14(1+log(18SANδ)(1γ)2N)γ3(1γ)31.\displaystyle\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma})}\leq 4\sqrt{\frac{\left(1+\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\right)}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}}}1\leq 4\sqrt{\frac{\left(1+\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\right)}{\gamma^{3}(1-\gamma)^{3}}}1.

    Applying Lemma 12 and inserting back to (114) leads to

    𝒞1\displaystyle\mathcal{C}_{1} =2log(18SANδ)N(IγP¯^π,V)1VarP¯^π,V(V,σ)\displaystyle=2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma})}
    8log(18SANδ)γ3(1γ)2max{1γ,σ}N(1+log(18SANδ)(1γ)2N)1.\displaystyle\leq 8\sqrt{\frac{\log(\frac{18SAN}{\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}\bigg{(}1+\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\bigg{)}}1. (116)
  • Consider 𝒞2\mathcal{C}_{2}. First, denote VV,σmins𝒮V,σ(s)1V^{\prime}\coloneqq V^{\star,\sigma}-\min_{s^{\prime}\in{\mathcal{S}}}V^{\star,\sigma}(s^{\prime})1, by Lemma 6, it follows that

    0V1γmax{1γ,σ}1.\displaystyle{0}\leq V^{\prime}\leq\frac{1}{\gamma\max\{1-\gamma,\sigma\}}1. (117)

    Then, we have for all (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}, and Ps,aΔ(𝒮)P_{s,a}\in\Delta({\mathcal{S}}), and P~s,a𝒰σ(Ps,a)\widetilde{P}_{s,a}\in\mathcal{U}^{\sigma}(P_{s,a}):

    |𝖵𝖺𝗋P~s,a(V,σ)𝖵𝖺𝗋Ps,a(V,σ)|\displaystyle\big{|}\mathsf{Var}_{\widetilde{P}_{s,a}}(V^{\star,\sigma})-\mathsf{Var}_{P_{s,a}}(V^{\star,\sigma})\big{|} =|𝖵𝖺𝗋P~s,a(V)𝖵𝖺𝗋Ps,a(V)|\displaystyle=\big{|}\mathsf{Var}_{\widetilde{P}_{s,a}}(V^{\prime})-\mathsf{Var}_{P_{s,a}}(V^{\prime})\big{|}
    P~s,aPs,a1V2\displaystyle\leq\big{\|}\widetilde{P}_{s,a}-P_{s,a}\big{\|}_{1}\big{\|}V^{\prime}\big{\|}_{\infty}^{2}
    2σγ2(max{1γ,σ})22γ2max{1γ,σ}.\displaystyle\leq\frac{2\sigma}{\gamma^{2}(\max\{1-\gamma,\sigma\})^{2}}\leq\frac{2}{\gamma^{2}\max\{1-\gamma,\sigma\}}. (118)

    Applying the above relation we obtain

    𝒞2\displaystyle\mathcal{C}_{2} =2log(18SANδ)N(IγP¯^π,V)1|VarP^π(V,σ)VarP¯^π,V(V,σ)|\displaystyle=2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\sqrt{\left|\mathrm{Var}_{\widehat{P}^{\pi^{\star}}}(V^{\star,\sigma})-\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma})\right|}
    =2log(18SANδ)N(IγP¯^π,V)1|Ππ(VarP^0(V,σ)VarP^π,V(V,σ))|\displaystyle=2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\sqrt{\left|\Pi^{\pi^{\star}}\left(\mathrm{Var}_{\widehat{P}^{0}}(V^{\star,\sigma})-\mathrm{Var}_{\widehat{P}^{\pi^{\star},V}}(V^{\star,\sigma})\right)\right|}
    2log(18SANδ)N(IγP¯^π,V)1VarP^0(V,σ)VarP^π,V(V,σ)1\displaystyle\leq 2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\sqrt{\left\|\mathrm{Var}_{\widehat{P}^{0}}(V^{\star,\sigma})-\mathrm{Var}_{\widehat{P}^{\pi^{\star},V}}(V^{\star,\sigma})\right\|_{\infty}1}
    2log(18SANδ)N(IγP¯^π,V)12γ2max{1γ,σ}1=22log(18SANδ)γ2(1γ)2max{1γ,σ}N1,\displaystyle\leq 2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\sqrt{\frac{2}{\gamma^{2}\max\{1-\gamma,\sigma\}}}1=2\sqrt{\frac{2\log(\frac{18SAN}{\delta})}{\gamma^{2}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1, (119)

    where the last equality uses (IγP¯^π,V)11=11γ\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}1=\frac{1}{1-\gamma} (cf. (115)).

  • Consider 𝒞3\mathcal{C}_{3}. The following lemma plays an important role.

    Lemma 13.

    (Panaganti and Kalathil,, 2022, Lemma 6) Consider any δ(0,1)\delta\in(0,1). For any fixed policy π\pi and fixed value vector VSV\in\mathbb{R}^{S}, one has with probability at least 1δ1-\delta,

    |VarP^π(V)VarPπ(V)|2V2log(2SAδ)N1.\displaystyle\left|\sqrt{\mathrm{Var}_{\widehat{P}^{\pi}}(V)}-\sqrt{\mathrm{Var}_{P^{\pi}}(V)}\right|\leq\sqrt{\frac{2\|V\|_{\infty}^{2}\log(\frac{2SA}{\delta})}{N}}1.

    Applying Lemma 13 with π=π\pi=\pi^{\star} and V=V,σV=V^{\star,\sigma} leads to

    VarPπ(V,σ)VarP^π(V,σ)2V,σ2log(2SAδ)N1,\displaystyle\sqrt{\mathrm{Var}_{P^{\pi^{\star}}}(V^{\star,\sigma})}-\sqrt{\mathrm{Var}_{\widehat{P}^{\pi^{\star}}}(V^{\star,\sigma})}\leq\sqrt{\frac{2\|V^{\star,\sigma}\|_{\infty}^{2}\log(\frac{2SA}{\delta})}{N}}1,

    which can be plugged in (114) to verify

    𝒞3\displaystyle\mathcal{C}_{3} =2log(18SANδ)N(IγP¯^π,V)1(VarPπ(V,σ)VarP^π(V,σ))\displaystyle=2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\left(\sqrt{\mathrm{Var}_{P^{\pi^{\star}}}(V^{\star,\sigma})}-\sqrt{\mathrm{Var}_{\widehat{P}^{\pi^{\star}}}(V^{\star,\sigma})}\right)
    4(1γ)log(SANδ)V,σN14log(18SANδ)(1γ)2N1,\displaystyle\leq\frac{4}{(1-\gamma)}\frac{\log(\frac{SAN}{\delta})\|V^{\star,\sigma}\|_{\infty}}{N}1\leq\frac{4\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}1, (120)

    where the last line uses (IγP¯^π,V)11=11γ\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}1=\frac{1}{1-\gamma} (cf. (115)).

Finally, inserting the results of 𝒞1\mathcal{C}_{1} in (116), 𝒞2\mathcal{C}_{2} in (119), 𝒞3\mathcal{C}_{3} in (120), and (115) back into (114) gives

(IγP¯^π,V)1(P¯^π,VVπ,σP¯π,VVπ,σ)8log(18SANδ)γ3(1γ)2max{1γ,σ}N(1+log(18SANδ)(1γ)2N)1\displaystyle\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{)}\leq 8\sqrt{\frac{\log(\frac{18SAN}{\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}\bigg{(}1+\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\bigg{)}}1
+22log(18SANδ)γ2(1γ)2max{1γ,σ}N1+4log(18SANδ)(1γ)2N1+log(18SANδ)N(1γ)21\displaystyle\quad+2\sqrt{\frac{2\log(\frac{18SAN}{\delta})}{\gamma^{2}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1+\frac{4\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}1+\frac{\log(\frac{18SAN}{\delta})}{N(1-\gamma)^{2}}1
102log(18SANδ)γ3(1γ)2max{1γ,σ}N(1+log(SANδ)(1γ)2N)1+5log(18SANδ)(1γ)2N1\displaystyle\leq 10\sqrt{\frac{2\log(\frac{18SAN}{\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}\bigg{(}1+\sqrt{\frac{\log(\frac{SAN}{\delta})}{(1-\gamma)^{2}N}}\bigg{)}}1+\frac{5\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}1
160log(18SANδ)(1γ)2max{1γ,σ}N1+5log(18SANδ)(1γ)2N1,\displaystyle\leq 160\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1+\frac{5\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}1, (121)

where the last inequality holds by the fact γ14\gamma\geq\frac{1}{4} and letting Nlog(SANδ)(1γ)2N\geq\frac{\log(\frac{SAN}{\delta})}{(1-\gamma)^{2}}.

Step 2: controlling V^π,σVπ,σ\|\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\|_{\infty}: bounding the second term in (56).

To proceed, applying Lemma 11 on the second term of the right hand side of (56) leads to

(IγP¯^π,V^)1(P¯^π,VVπ,σP¯π,VVπ,σ)\displaystyle\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{)}
2(IγP¯^π,V^)1(log(18SANδ)NVarPπ(V,σ)+log(18SANδ)N(1γ)1)\displaystyle\leq 2\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}\bigg{(}\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{\pi^{\star}}}(V^{\star,\sigma})}+\frac{\log(\frac{18SAN}{\delta})}{N(1-\gamma)}1\bigg{)}
2log(18SANδ)N(1γ)(IγP¯^π,V^)11+2log(18SANδ)N(IγP¯^π,V^)1VarP¯^π,V^(V^π,σ)=:𝒞4\displaystyle\leq\frac{2\log(\frac{18SAN}{\delta})}{N(1-\gamma)}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}1+\underbrace{2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}}(\widehat{V}^{\pi^{\star},\sigma})}}_{=:\mathcal{C}_{4}}
+2log(18SANδ)N(IγP¯^π,V^)1(VarP¯^π,V^(Vπ,σV^π,σ))=:𝒞5\displaystyle\quad+\underbrace{2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}\left(\sqrt{\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}}(V^{\pi^{\star},\sigma}-\widehat{V}^{\pi^{\star},\sigma})}\right)}_{=:\mathcal{C}_{5}}
+2log(18SANδ)N(IγP¯^π,V^)1(|VarP^π(V,σ)VarP¯^π,V^(V,σ)|)=:𝒞6\displaystyle\quad+\underbrace{2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}\left(\sqrt{\left|\mathrm{Var}_{\widehat{P}^{\pi^{\star}}}(V^{\star,\sigma})-\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}}(V^{\star,\sigma})\right|}\right)}_{=:\mathcal{C}_{6}}
+2log(18SANδ)N(IγP¯^π,V^)1(VarPπ(V,σ)VarP^π(V,σ))=:𝒞7,\displaystyle\quad+\underbrace{2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}\left(\sqrt{\mathrm{Var}_{P^{\pi^{\star}}}(V^{\star,\sigma})}-\sqrt{\mathrm{Var}_{\widehat{P}^{\pi^{\star}}}(V^{\star,\sigma})}\right)}_{=:\mathcal{C}_{7}}, (122)

where the last term 𝒞3~\widetilde{\mathcal{C}_{3}} can be controlled the same as 𝒞3\mathcal{C}_{3} in (120). We now bound the above terms separately.

  • Applying Lemma 7 with P=P^π,V^P=\widehat{P}^{\pi^{\star},\widehat{V}}, π=π\pi=\pi^{\star} and taking V=V^π,σV=\widehat{V}^{\pi^{\star},\sigma} which obeys V^π,σ=rπ+γP¯^π,V^V^π,σ\widehat{V}^{\pi^{\star},\sigma}=r_{\pi^{\star}}+\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\widehat{V}^{\pi^{\star},\sigma}, and in view of (115), the term 𝒞4\mathcal{C}_{4} in (122) can be controlled as follows:

    𝒞4\displaystyle\mathcal{C}_{4} =2log(18SANδ)N(IγP¯^π,V^)1VarP¯^π,V^(V^π,σ)\displaystyle=2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}}(\widehat{V}^{\pi^{\star},\sigma})}
    2log(18SANδ)N8(maxsV^π,σ(s)minsV^π,σ(s))γ2(1γ)21\displaystyle\leq 2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\sqrt{\frac{8(\max_{s}\widehat{V}^{\pi^{\star},\sigma}(s)-\min_{s}\widehat{V}^{\pi^{\star},\sigma}(s))}{\gamma^{2}(1-\gamma)^{2}}}1
    8log(18SANδ)γ3(1γ)2max{1γ,σ}N1,\displaystyle\leq 8\sqrt{\frac{\log(\frac{18SAN}{\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1, (123)

    where the last inequality holds by applying Lemma 6.

  • To continue, considering 𝒞5\mathcal{C}_{5}, we directly observe that (in view of (115))

    𝒞5\displaystyle\mathcal{C}_{5} =2log(18SANδ)N(IγP¯^π,V^)1VarP¯^π,V^(Vπ,σV^π,σ)\displaystyle=2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}}(V^{\pi^{\star},\sigma}-\widehat{V}^{\pi^{\star},\sigma})}
    2log(18SANδ)(1γ)2NV,σV^π,σ1.\displaystyle\leq 2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\left\|V^{\star,\sigma}-\widehat{V}^{\pi^{\star},\sigma}\right\|_{\infty}1. (124)
  • Then, it is easily verified that 𝒞6\mathcal{C}_{6} can be controlled similarly as (119) as follows:

    𝒞6\displaystyle\mathcal{C}_{6} 22log(18SANδ)γ2(1γ)2max{1γ,σ}N1.\displaystyle\leq 2\sqrt{\frac{2\log(\frac{18SAN}{\delta})}{\gamma^{2}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1. (125)
  • Similarly, 𝒞7\mathcal{C}_{7} can be controlled the same as (120) shown below:

    𝒞7\displaystyle\mathcal{C}_{7} 4log(18SANδ)(1γ)2N1.\displaystyle\leq\frac{4\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}1. (126)

Combining the results in (123), (124), (125), and (126) and inserting back to (122) leads to

(IγP¯^π,V^)1(P¯^π,VVπ,σP¯π,VVπ,σ)8log(18SANδ)γ3(1γ)2max{1γ,σ}N1\displaystyle\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{)}\leq 8\sqrt{\frac{\log(\frac{18SAN}{\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1
+2log(18SANδ)(1γ)2NV,σV^π,σ1+22log(18SANδ)γ2(1γ)2max{1γ,σ}N1+4log(18SANδ)(1γ)2N1\displaystyle\qquad\qquad+2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\left\|V^{\star,\sigma}-\widehat{V}^{\pi^{\star},\sigma}\right\|_{\infty}1+2\sqrt{\frac{2\log(\frac{18SAN}{\delta})}{\gamma^{2}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1+\frac{4\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}1
80log(18SANδ)(1γ)2max{1γ,σ}N1+2log(18SANδ)(1γ)2NV,σV^π,σ1+4log(18SANδ)(1γ)2N1,\displaystyle\leq 80\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1+2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\left\|V^{\star,\sigma}-\widehat{V}^{\pi^{\star},\sigma}\right\|_{\infty}1+\frac{4\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}1, (127)

where the last inequality follows from the assumption γ14\gamma\geq\frac{1}{4}.

Finally, inserting (121) and (127) back to (56) yields

V^π,σVπ,σ\displaystyle\left\|\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\right\|_{\infty}
max{160log(18SANδ)(1γ)2max{1γ,σ}N+5log(18SANδ)(1γ)2N,\displaystyle\leq\max\Bigg{\{}160\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}+\frac{5\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N},
80log(18SANδ)(1γ)2max{1γ,σ}N+2log(18SANδ)(1γ)2NV,σV^π,σ+4log(18SANδ)(1γ)2N}\displaystyle\qquad\quad 80\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}+2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\left\|V^{\star,\sigma}-\widehat{V}^{\pi^{\star},\sigma}\right\|_{\infty}+\frac{4\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}\Bigg{\}}
160log(18SANδ)(1γ)2max{1γ,σ}N+8log(18SANδ)(1γ)2N,\displaystyle\leq 160\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}+\frac{8\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}, (128)

where the last inequality holds by taking N16log(SANδ)(1γ)2N\geq\frac{16\log(\frac{SAN}{\delta})}{(1-\gamma)^{2}}.

B.4 Proof of Lemma 9

Step 1: controlling V^π^,σVπ^,σ\|\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\|_{\infty}: bounding the first term in (57).

To begin with, we introduce the following lemma which controls the main term on the right hand side of (57), which is proved in Appendix B.7.

Lemma 14.

Consider any δ(0,1)\delta\in(0,1). Taking Nlog(54SAN2(1γ)δ)N\geq\log\left(\frac{54SAN^{2}}{(1-\gamma)\delta}\right), with probability at least 1δ1-\delta, one has

|P¯^π^,V^V^π^,σP¯π^,V^V^π^,σ|\displaystyle\Big{|}\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\Big{|} 2log(54SAN2(1γ)δ)NVarPs,a0(V^,σ)1+8log(54SAN2(1γ)δ)N(1γ)1+2γε𝗈𝗉𝗍1γ1\displaystyle\leq 2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(\widehat{V}^{\star,\sigma})}1+\frac{8\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)}1+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}1
10log(54SAN2(1γ)δ)(1γ)2N1+2γε𝗈𝗉𝗍1γ1.\displaystyle\leq 10\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}N}}1+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}1. (129)

With Lemma 14 in hand, we have

(IγP¯π^,V^)1(P¯^π^,V^V^π^,σP¯π^,V^V^π^,σ)\displaystyle\Big{(}I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\Big{)}
(i)(IγP¯π^,V^)1|P¯^π^,V^V^π^,σP¯π^,V^V^π^,σ|\displaystyle\overset{\mathrm{(i)}}{\leq}\left(I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\right)^{-1}\left|\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\right|
2log(54SAN2(1γ)δ)N(IγP¯π^,V^)1VarPπ^(V^,σ)+(IγPQπ^,Vπ^)1(8log(54SAN2(1γ)δ)N(1γ)+2γε𝗈𝗉𝗍1γ)1\displaystyle\leq 2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\left(I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\right)^{-1}\sqrt{\mathrm{Var}_{P^{\widehat{\pi}}}(\widehat{V}^{\star,\sigma})}+\left(I-\gamma P_{Q}^{\widehat{\pi},V^{\widehat{\pi}}}\right)^{-1}\Bigg{(}\frac{8\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}\Bigg{)}1
(ii)(8log(54SAN2(1γ)δ)N(1γ)2+2γε𝗈𝗉𝗍(1γ)2)1+2log(54SAN2(1γ)δ)N(IγP¯π^,V^)1VarP¯π^,V^(V^π^,σ)=:𝒟1\displaystyle\overset{\mathrm{(ii)}}{\leq}\Bigg{(}\frac{8\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)^{2}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{2}}\Bigg{)}1+\underbrace{2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\left(I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\right)^{-1}\sqrt{\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\widehat{\pi},\sigma})}}_{=:\mathcal{D}_{1}}
+2log(54SAN2(1γ)δ)N(IγP¯π^,V^)1|VarP¯π^,V^(V^,σ)VarP¯π^,V^(V^π^,σ)|=:𝒟2\displaystyle\quad+\underbrace{2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\left(I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\right)^{-1}\sqrt{\left|\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\star,\sigma})-\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\widehat{\pi},\sigma})\right|}}_{=:\mathcal{D}_{2}}
+2log(54SAN2(1γ)δ)N(IγP¯π^,V^)1|VarPπ^(V^,σ)VarP¯π^,V^(V^,σ)|=:𝒟3,\displaystyle\quad+\underbrace{2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\left(I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\right)^{-1}\sqrt{\left|\mathrm{Var}_{P^{\widehat{\pi}}}(\widehat{V}^{\star,\sigma})-\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\star,\sigma})\right|}}_{=:\mathcal{D}_{3}}, (130)

where (i) and (ii) hold by the fact that each row of (1γ)(IγP¯π^,V^)1(1-\gamma)\left(I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\right)^{-1} is a probability vector that falls into Δ(𝒮)\Delta({\mathcal{S}}).

The remainder of the proof will focus on controlling the three terms in (130) separately.

  • For 𝒟1\mathcal{D}_{1}, we introduce the following lemma, whose proof is postponed to B.8.

    Lemma 15.

    Consider any δ(0,1)\delta\in(0,1). Taking Nlog(54SAN2(1γ)δ)(1γ)2N\geq\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}} and ε𝗈𝗉𝗍1γγ\varepsilon_{\mathsf{opt}}\leq\frac{1-\gamma}{\gamma}, one has with probability at least 1δ1-\delta,

    (IγP¯π^,V^)1VarP¯π^,V^(V^π^,σ)\displaystyle\Big{(}I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\widehat{\pi},\sigma})} 61γ3(1γ)2max{1γ,σ}161(1γ)3γ21.\displaystyle\leq 6\sqrt{\frac{1}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}}}1\leq 6\sqrt{\frac{1}{(1-\gamma)^{3}\gamma^{2}}}1.

    Applying Lemma 15 and (115) to (130) leads to

    𝒟1\displaystyle\mathcal{D}_{1} =2log(54SAN2(1γ)δ)N(IγP¯π^,V^)1VarP¯π^,V^(V^π^,σ)\displaystyle=2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\Big{(}I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\widehat{\pi},\sigma})}
    12log(54SAN2(1γ)δ)γ3(1γ)2max{1γ,σ}N1.\displaystyle\leq 12\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1. (131)
  • Applying Lemma 3 with V^,σV^π^,σ2γε𝗈𝗉𝗍1γ\|\widehat{V}^{\star,\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\|_{\infty}\leq\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma} and (115), 𝒟2\mathcal{D}_{2} can be controlled as

    𝒟2\displaystyle\mathcal{D}_{2} =2log(54SAN2(1γ)δ)N(IγP¯π^,V^)1|VarP¯π^,V^(V^,σ)VarP¯π^,V^(V^π^,σ)|\displaystyle=2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\Big{(}I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\Big{)}^{-1}\sqrt{\left|\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\star,\sigma})-\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\widehat{\pi},\sigma})\right|}
    4log(54SAN2(1γ)δ)N(IγP¯π^,V^)1γε𝗈𝗉𝗍1γ4γε𝗈𝗉𝗍log(54SAN2(1γ)δ)(1γ)4N1.\displaystyle\leq 4\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\Big{(}I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\Big{)}^{-1}\frac{\sqrt{\gamma\varepsilon_{\mathsf{opt}}}}{1-\gamma}\leq 4\sqrt{\frac{\gamma\varepsilon_{\mathsf{opt}}\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{4}N}}1. (132)
  • 𝒟3\mathcal{D}_{3} can be controlled similar to 𝒞2\mathcal{C}_{2} in (119) as follows:

    𝒟3\displaystyle\mathcal{D}_{3} =2log(54SAN2(1γ)δ)N(IγP¯π^,V^)1|VarPπ^(V^,σ)VarP¯π^,V^(V^,σ)|\displaystyle=2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\left(I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\right)^{-1}\sqrt{\left|\mathrm{Var}_{P^{\widehat{\pi}}}(\widehat{V}^{\star,\sigma})-\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\star,\sigma})\right|}
    4log(54SAN2(1γ)δ)N(IγP¯π^,V^)11γ2max{1γ,σ}14log(54SAN2(1γ)δ)γ2(1γ)2max{1γ,σ}N1\displaystyle\leq 4\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\left(I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\right)^{-1}\sqrt{\frac{1}{\gamma^{2}\max\{1-\gamma,\sigma\}}}1\leq 4\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma^{2}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1 (133)

Finally, summing up the results in (131), (132), and (133) and inserting them back to (130) yields: taking Nlog(54SAN2(1γ)δ)(1γ)2N\geq\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}} and ε𝗈𝗉𝗍1γγ\varepsilon_{\mathsf{opt}}\leq\frac{1-\gamma}{\gamma}, with probability at least 1δ1-\delta,

(IγP¯π^,V^)1(P¯^π^,V^V^π^,σP¯π^,V^V^π^,σ)(8log(54SAN2(1γ)δ)N(1γ)2+2γε𝗈𝗉𝗍(1γ)2)1\displaystyle\left(I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\right)^{-1}\left(\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\right)\leq\left(\frac{8\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)^{2}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{2}}\right)1
+12log(54SAN2(1γ)δ)γ3(1γ)2max{1γ,σ}N1+4γε𝗈𝗉𝗍log(54SAN2(1γ)δ)(1γ)4N1+4log(54SAN2(1γ)δ)γ2(1γ)2max{1γ,σ}N1\displaystyle\quad+12\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1+4\sqrt{\frac{\gamma\varepsilon_{\mathsf{opt}}\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{4}N}}1+4\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma^{2}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1
16log(54SAN2(1γ)δ)γ3(1γ)2max{1γ,σ}N1+14log(54SAN2(1γ)δ)N(1γ)21,\displaystyle\leq 16\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1+\frac{14\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)^{2}}1, (134)

where the last inequality holds by taking ε𝗈𝗉𝗍min{1γγ,log(54SAN2(1γ)δ)γN}=log(54SAN2(1γ)δ)γN\varepsilon_{\mathsf{opt}}\leq\min\left\{\frac{1-\gamma}{\gamma},\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma N}\right\}=\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma N}.

Step 2: controlling V^π^,σVπ^,σ\|\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\|_{\infty}: bounding the second term in (57).

Towards this, applying Lemma 14 leads to

(IγP¯π^,V)1(P¯^π^,V^V^π^,σP¯π^,V^V^π^,σ)(IγP¯π^,V)1|P¯^π^,V^V^π^,σP¯π^,V^V^π^,σ|\displaystyle\Big{(}I-\gamma\underline{P}^{\widehat{\pi},V}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\Big{)}\leq\Big{(}I-\gamma\underline{P}^{\widehat{\pi},V}\Big{)}^{-1}\Big{|}\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\Big{|}
2log(54SAN2(1γ)δ)N(IγP¯π^,V)1VarPπ^(V^,σ)+(IγP¯π^,V)1(8log(54SAN2(1γ)δ)N(1γ)+2γε𝗈𝗉𝗍1γ)1\displaystyle\leq 2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\Big{(}I-\gamma\underline{P}^{\widehat{\pi},V}\Big{)}^{-1}\sqrt{\mathrm{Var}_{P^{\widehat{\pi}}}(\widehat{V}^{\star,\sigma})}+\Big{(}I-\gamma\underline{P}^{\widehat{\pi},V}\Big{)}^{-1}\Bigg{(}\frac{8\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}\Bigg{)}1
(8log(54SAN2(1γ)δ)N(1γ)2+2γε𝗈𝗉𝗍(1γ)2)1+2log(54SAN2(1γ)δ)N(IγP¯π^,V)1VarP¯π^,V(Vπ^,σ)=:𝒟4\displaystyle\leq\Bigg{(}\frac{8\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)^{2}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{2}}\Bigg{)}1+\underbrace{2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\left(I-\gamma\underline{P}^{\widehat{\pi},V}\right)^{-1}\sqrt{\mathrm{Var}_{\underline{P}^{\widehat{\pi},V}}(V^{\widehat{\pi},\sigma})}}_{=:\mathcal{D}_{4}}
+2log(54SAN2(1γ)δ)N(IγP¯π^,V)1VarP¯π^,V(V^π^,σVπ^,σ)=:𝒟5\displaystyle\quad+\underbrace{2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\Big{(}I-\gamma\underline{P}^{\widehat{\pi},V}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{P}^{\widehat{\pi},V}}(\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma})}}_{=:\mathcal{D}_{5}}
+2log(54SAN2(1γ)δ)N(IγP¯π^,V^)1|VarP¯π^,V(V^,σ)VarP¯π^,V(V^π^,σ)|=:𝒟6\displaystyle\quad+\underbrace{2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\Big{(}I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\Big{)}^{-1}\sqrt{\left|\mathrm{Var}_{\underline{P}^{\widehat{\pi},V}}(\widehat{V}^{\star,\sigma})-\mathrm{Var}_{\underline{P}^{\widehat{\pi},V}}(\widehat{V}^{\widehat{\pi},\sigma})\right|}}_{=:\mathcal{D}_{6}}
+2log(54SAN2(1γ)δ)N(IγP¯π^,V^)1|VarPπ^(V^,σ)VarP¯π^,V(V^,σ)|=:𝒟7.\displaystyle\quad+\underbrace{2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\Big{(}I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\Big{)}^{-1}\sqrt{\left|\mathrm{Var}_{P^{\widehat{\pi}}}(\widehat{V}^{\star,\sigma})-\mathrm{Var}_{\underline{P}^{\widehat{\pi},V}}(\widehat{V}^{\star,\sigma})\right|}}_{=:\mathcal{D}_{7}}. (135)

We shall bound each of the terms separately.

  • Applying Lemma 7 with P=P¯π^,VP=\underline{P}^{\widehat{\pi},V}, π=π^\pi=\widehat{\pi}, and taking V=Vπ^,σV=V^{\widehat{\pi},\sigma} which obeys Vπ^,σ=rπ^+γP¯π^,VVπ^,σV^{\widehat{\pi},\sigma}=r_{\widehat{\pi}}+\gamma\underline{P}^{\widehat{\pi},V}V^{\widehat{\pi},\sigma}, the term 𝒟4\mathcal{D}_{4} can be controlled similar to (123) as follows:

    𝒟4\displaystyle\mathcal{D}_{4} 8log(54SAN2δ)γ3(1γ)2max{1γ,σ}N1.\displaystyle\leq 8\sqrt{\frac{\log(\frac{54SAN^{2}}{\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1. (136)
  • For 𝒟5\mathcal{D}_{5}, it is observed that

    𝒟5\displaystyle\mathcal{D}_{5} =2log(54SAN2(1γ)δ)N(IγP¯π^,V)1VarP¯π^,V(V^π^,σVπ^,σ)\displaystyle=2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\Big{(}I-\gamma\underline{P}^{\widehat{\pi},V}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{P}^{\widehat{\pi},V}}(\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma})}
    2log(54SAN2δ)(1γ)2NVπ^,σV^π^,σ1.\displaystyle\leq 2\sqrt{\frac{\log(\frac{54SAN^{2}}{\delta})}{(1-\gamma)^{2}N}}\left\|V^{\widehat{\pi},\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\right\|_{\infty}1. (137)
  • Next, observing that 𝒟6\mathcal{D}_{6} and 𝒟7\mathcal{D}_{7} are almost the same as the terms 𝒟2\mathcal{D}_{2} (controlled in (132)) and 𝒟3\mathcal{D}_{3} (controlled in (133)) in (130), it is easily verified that they can be controlled as follows

    𝒟6\displaystyle\mathcal{D}_{6} 4γε𝗈𝗉𝗍log(54SAN2(1γ)δ)(1γ)4N1,𝒟74log(54SAN2(1γ)δ)γ2(1γ)2max{1γ,σ}N1.\displaystyle\leq 4\sqrt{\frac{\gamma\varepsilon_{\mathsf{opt}}\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{4}N}}1,\qquad\qquad\mathcal{D}_{7}\leq 4\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma^{2}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1. (138)

Then inserting the results in (136), (137), and (138) back to (135) leads to

(IγP¯π^,V)1(P¯^π^,V^V^π^,σP¯π^,V^V^π^,σ)\displaystyle\Big{(}I-\gamma\underline{P}^{\widehat{\pi},V}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\Big{)}
(8log(54SAN2(1γ)δ)N(1γ)2+2γε𝗈𝗉𝗍(1γ)2)1+8log(54SAN2δ)γ3(1γ)2max{1γ,σ}N1\displaystyle\leq\Bigg{(}\frac{8\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)^{2}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{2}}\Bigg{)}1+8\sqrt{\frac{\log(\frac{54SAN^{2}}{\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1
+2log(54SAN2δ)(1γ)2NVπ^,σV^π^,σ1+4γε𝗈𝗉𝗍log(54SAN2(1γ)δ)(1γ)4N1+4log(54SAN2(1γ)δ)γ2(1γ)2max{1γ,σ}N1\displaystyle\quad+2\sqrt{\frac{\log(\frac{54SAN^{2}}{\delta})}{(1-\gamma)^{2}N}}\left\|V^{\widehat{\pi},\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\right\|_{\infty}1+4\sqrt{\frac{\gamma\varepsilon_{\mathsf{opt}}\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{4}N}}1+4\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma^{2}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1
122log(8SAN2(1γ)δ)γ3(1γ)2max{1γ,σ}N1+14log(54SAN2(1γ)δ)N(1γ)21+2log(54SAN2δ)(1γ)2NVπ^,σV^π^,σ1,\displaystyle\leq 12\sqrt{\frac{2\log(\frac{8SAN^{2}}{(1-\gamma)\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}1+\frac{14\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)^{2}}1+2\sqrt{\frac{\log(\frac{54SAN^{2}}{\delta})}{(1-\gamma)^{2}N}}\left\|V^{\widehat{\pi},\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\right\|_{\infty}1, (139)

where the last inequality holds by letting ε𝗈𝗉𝗍log(54SAN2(1γ)δ)γN\varepsilon_{\mathsf{opt}}\leq\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma N}, which directly satisfies ε𝗈𝗉𝗍1γγ\varepsilon_{\mathsf{opt}}\leq\frac{1-\gamma}{\gamma} by letting Nlog(54SAN2δ)1γN\geq\frac{\log(\frac{54SAN^{2}}{\delta})}{1-\gamma}.

Finally, inserting (134) and (139) back to (57) yields: taking ε𝗈𝗉𝗍log(54SAN2(1γ)δ)γN\varepsilon_{\mathsf{opt}}\leq\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma N} and N16log(54SAN2δ)(1γ)2N\geq\frac{16\log(\frac{54SAN^{2}}{\delta})}{(1-\gamma)^{2}}, with probability at least 1δ1-\delta, one has

V^π^,σVπ^,σ\displaystyle\left\|\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\right\|_{\infty}
max{16log(54SAN2(1γ)δ)γ3(1γ)2max{1γ,σ}N+14log(54SAN2(1γ)δ)N(1γ)2,\displaystyle\leq\max\Big{\{}16\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}+\frac{14\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)^{2}},
122log(8SAN2(1γ)δ)γ3(1γ)2max{1γ,σ}N+14log(54SAN2(1γ)δ)N(1γ)2+2log(54SAN2δ)(1γ)2NVπ^,σV^π^,σ}\displaystyle\quad 12\sqrt{\frac{2\log(\frac{8SAN^{2}}{(1-\gamma)\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}+\frac{14\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)^{2}}+2\sqrt{\frac{\log(\frac{54SAN^{2}}{\delta})}{(1-\gamma)^{2}N}}\left\|V^{\widehat{\pi},\sigma}-\widehat{V}^{\widehat{\pi},\sigma}\right\|_{\infty}\Big{\}}
24log(54SAN2(1γ)δ)γ3(1γ)2max{1γ,σ}N+28log(54SAN2(1γ)δ)N(1γ)2.\displaystyle\leq 24\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}N}}+\frac{28\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)^{2}}. (140)

B.5 Proof of Lemma 11

Step 1: controlling the point-wise concentration.

We first consider a more general term w.r.t. any fixed (independent from P^0\widehat{P}^{0}) value vector VV obeying 0V11γ1{0}\leq V\leq\frac{1}{1-\gamma}1 and any policy π\pi. Invoking Lemma 1 leads to that for any (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A},

|P^s,aπ,VVPs,aπ,VV|\displaystyle\left|\widehat{P}^{\pi,V}_{s,a}V-P^{\pi,V}_{s,a}V\right| |maxα[minsV(s),maxsV(s)]{P^s,a0[V]ασ(αmins[V]α(s))}\displaystyle\leq\Big{|}\max_{\alpha\in[\min_{s}V(s),\max_{s}V(s)]}\left\{\widehat{P}^{0}_{s,a}\left[V\right]_{\alpha}-\sigma\left(\alpha-\min_{s^{\prime}}\left[V\right]_{\alpha}(s^{\prime})\right)\right\}
maxα[minsV(s),maxsV(s)]{Ps,a0[V]αwσ(αmins[V]α(s))}|\displaystyle\qquad-\max_{\alpha\in[\min_{s}V(s),\max_{s}V(s)]}\left\{P^{0}_{s,a}\left[V\right]_{\alpha}-w\sigma\left(\alpha-\min_{s^{\prime}}\left[V\right]_{\alpha}(s^{\prime})\right)\right\}\Big{|}
maxα[minsV(s),maxsV(s)]|(Ps,a0P^s,a0)[V]α|=:gs,a(α,V),\displaystyle\leq\max_{\alpha\in[\min_{s}V(s),\max_{s}V(s)]}\underbrace{\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[V]_{\alpha}\right|}_{=:g_{s,a}(\alpha,V)}, (141)

where the last inequality holds by that the maximum operator is 11-Lipschitz.

Then for a fixed α\alpha and any vector VV that is independent with P^0\widehat{P}^{0}, using the Bernstein’s inequality, one has with probability at least 1δ1-\delta,

gs,a(α,V)=|(Ps,a0P^s,a0)[V]α|\displaystyle g_{s,a}(\alpha,V)=\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[V]_{\alpha}\right| 2log(2δ)NVarPs,a0([V]α)+2log(2δ)3N(1γ)\displaystyle\leq\sqrt{\frac{2\log(\frac{2}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}([V]_{\alpha})}+\frac{2\log(\frac{2}{\delta})}{3N(1-\gamma)}
2log(2δ)NVarPs,a0(V)+2log(2δ)3N(1γ).\displaystyle\leq\sqrt{\frac{2\log(\frac{2}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(V)}+\frac{2\log(\frac{2}{\delta})}{3N(1-\gamma)}. (142)
Step 2: deriving the uniform concentration.

To obtain the union bound, we first notice that gs,a(α,V)g_{s,a}(\alpha,V) is 11-Lipschitz w.r.t. α\alpha for any VV obeying V11γ\|V\|_{\infty}\leq\frac{1}{1-\gamma}. In addition, we can construct an ε1\varepsilon_{1}-net Nε1N_{\varepsilon_{1}} over [0,11γ][0,\frac{1}{1-\gamma}] whose size satisfies |Nε1|3ε1(1γ)|N_{\varepsilon_{1}}|\leq\frac{3}{\varepsilon_{1}(1-\gamma)} (Vershynin,, 2018). By the union bound and (B.5), it holds with probability at least 1δSA1-\frac{\delta}{SA} that for all αNε1\alpha\in N_{\varepsilon_{1}},

gs,a(α,V)2log(2SA|Nε1|δ)NVarPs,a0(V)+2log(2SA|Nε1|δ)3N(1γ).g_{s,a}(\alpha,V)\leq\sqrt{\frac{2\log(\frac{2SA|N_{\varepsilon_{1}}|}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(V)}+\frac{2\log(\frac{2SA|N_{\varepsilon_{1}}|}{\delta})}{3N(1-\gamma)}. (143)

Combined with (141), it yields that,

|P^s,aπ,VVPs,aπ,VV|\displaystyle\left|\widehat{P}^{\pi,V}_{s,a}V-P^{\pi,V}_{s,a}V\right| maxα[minsV(s),maxsV(s)]|(Ps,a0P^s,a0)[V]α|\displaystyle\leq\max_{\alpha\in[\min_{s}V(s),\max_{s}V(s)]}\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[V]_{\alpha}\right|
(i)ε1+supαNε1|(Ps,a0P^s,a0)[V]α|\displaystyle\overset{\mathrm{(i)}}{\leq}\varepsilon_{1}+\sup_{\alpha\in N_{\varepsilon_{1}}}\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[V]_{\alpha}\right|
(ii)ε1+2log(2SA|Nε1|δ)NVarPs,a0(V)+2log(2SA|Nε1|δ)3N(1γ)\displaystyle\overset{\mathrm{(ii)}}{\leq}\varepsilon_{1}+\sqrt{\frac{2\log(\frac{2SA|N_{\varepsilon_{1}}|}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(V)}+\frac{2\log(\frac{2SA|N_{\varepsilon_{1}}|}{\delta})}{3N(1-\gamma)} (144)
(iii)2log(2SA|Nε1|δ)NVarPs,a0(V)+log(2SA|Nε1|δ)N(1γ)\displaystyle\overset{\mathrm{(iii)}}{\leq}\sqrt{\frac{2\log(\frac{2SA|N_{\varepsilon_{1}}|}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(V)}+\frac{\log(\frac{2SA|N_{\varepsilon_{1}}|}{\delta})}{N(1-\gamma)}
(iv)2log(18SANδ)NVarPs,a0(V)+log(18SANδ)N(1γ)\displaystyle\overset{\mathrm{(iv)}}{\leq}2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(V)}+\frac{\log(\frac{18SAN}{\delta})}{N(1-\gamma)} (145)
2log(18SANδ)NV+log(18SANδ)N(1γ)\displaystyle\leq 2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\|V\|_{\infty}+\frac{\log(\frac{18SAN}{\delta})}{N(1-\gamma)}
3log(18SANδ)(1γ)2N\displaystyle\leq 3\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}} (146)

where (i) follows from that the optimal α\alpha^{\star} falls into the ε1\varepsilon_{1}-ball centered around some point inside Nε1N_{\varepsilon_{1}} and gs,a(α,V)g_{s,a}(\alpha,V) is 11-Lipschitz, (ii) holds by (143), (iii) arises from taking ε1=log(2SA|Nε1|δ)3N(1γ)\varepsilon_{1}=\frac{\log(\frac{2SA|N_{\varepsilon_{1}}|}{\delta})}{3N(1-\gamma)}, (iv) is verified by |Nε1|3ε1(1γ)9N|N_{\varepsilon_{1}}|\leq\frac{3}{\varepsilon_{1}(1-\gamma)}\leq 9N, and the last inequality is due to the fact V,σ11γ\|V^{\star,\sigma}\|_{\infty}\leq\frac{1}{1-\gamma} and letting Nlog(18SANδ)N\geq\log(\frac{18SAN}{\delta}).

To continue, applying (145) and (146) with π=π\pi=\pi^{\star} and V=V,σV=V^{\star,\sigma} (independent with P^0\widehat{P}^{0}) and taking the union bound over (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A} gives that with probability at least 1δ1-\delta, it holds simultaneously for all (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A} that

|P^s,aπ,VV,σPs,aπ,VV,σ|\displaystyle\left|\widehat{P}^{\pi^{\star},V}_{s,a}V^{\star,\sigma}-P^{\pi^{\star},V}_{s,a}V^{\star,\sigma}\right| 2log(18SANδ)NVarPs,a0(V,σ)+log(18SANδ)N(1γ)\displaystyle\leq 2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(V^{\star,\sigma})}+\frac{\log(\frac{18SAN}{\delta})}{N(1-\gamma)}
3log(18SANδ)(1γ)2N.\displaystyle\leq 3\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}. (147)

By converting (147) to the matrix form, one has with probability at least 1δ1-\delta,

|P¯^π,VVπ,σP¯π,VVπ,σ|\displaystyle\left|\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\right| 2log(18SANδ)NVarPπ(V,σ)+log(18SANδ)N(1γ)1\displaystyle\leq 2\sqrt{\frac{\log(\frac{18SAN}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{\pi^{\star}}}(V^{\star,\sigma})}+\frac{\log(\frac{18SAN}{\delta})}{N(1-\gamma)}1
3log(18SANδ)(1γ)2N1.\displaystyle\leq 3\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}1. (148)

B.6 Proof of Lemma 12

Following the same argument as (110), it follows

(IγP¯^π,V)1VarP¯^π,V(V,σ)\displaystyle\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma})} =11γt=0γt(P¯^π,V)tVarP¯^π,V(V,σ).\displaystyle=\sqrt{\frac{1}{1-\gamma}}\sqrt{\sum_{t=0}^{\infty}\gamma^{t}\left(\underline{\widehat{P}}^{\pi^{\star},V}\right)^{t}\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma})}. (149)

To continue, we first focus on controlling VarP¯^π,V(V,σ)\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma}). Towards this, denoting the minimum value of V,σV^{\star,\sigma} as Vminmins𝒮V,σ(s)V_{\min}\coloneqq\min_{s\in{\mathcal{S}}}V^{\star,\sigma}(s) and VV,σVmin1V^{\prime}\coloneqq V^{\star,\sigma}-V_{\min}1, we arrive at (see the robust Bellman’s consistency equation in (46))

V=V,σVmin1\displaystyle V^{\prime}=V^{\star,\sigma}-V_{\min}1 =rπ+γP¯π,VV,σVmin1\displaystyle=r_{\pi^{\star}}+\gamma\underline{P}^{\pi^{\star},V}V^{\star,\sigma}-V_{\min}1
=rπ+γP¯^π,VV,σ+γ(P¯π,VP¯^π,V)V,σVmin1\displaystyle=r_{\pi^{\star}}+\gamma\underline{\widehat{P}}^{\pi^{\star},V}V^{\star,\sigma}+\gamma\Big{(}\underline{P}^{\pi^{\star},V}-\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}V^{\star,\sigma}-V_{\min}1
=rπ(1γ)Vmin1+γP¯^π,VV+γ(P¯π,VP¯^π,V)V,σ\displaystyle=r_{\pi^{\star}}-(1-\gamma)V_{\min}1+\gamma\underline{\widehat{P}}^{\pi^{\star},V}V^{\prime}+\gamma\Big{(}\underline{P}^{\pi^{\star},V}-\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}V^{\star,\sigma}
=rπ+γP¯^π,VV+γ(P¯π,VP¯^π,V)V,σ,\displaystyle=r_{\pi^{\star}}^{\prime}+\gamma\underline{\widehat{P}}^{\pi^{\star},V}V^{\prime}+\gamma\Big{(}\underline{P}^{\pi^{\star},V}-\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}V^{\star,\sigma}, (150)

where the last line holds by letting rπrπ(1γ)Vmin1rπr_{\pi^{\star}}^{\prime}\coloneqq r_{\pi^{\star}}-(1-\gamma)V_{\min}1\leq r_{\pi^{\star}}. With the above fact in hand, we control VarP¯^π,V(V,σ)\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma}) as follows:

VarP¯^π,V(V,σ)\displaystyle\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma}) =(i)VarP¯^π,V(V)=P¯^π,V(VV)(P¯^π,VV)(P¯^π,VV)\displaystyle\overset{\mathrm{(i)}}{=}\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\prime})=\underline{\widehat{P}}^{\pi^{\star},V}\left(V^{\prime}\circ V^{\prime}\right)-\big{(}\underline{\widehat{P}}^{\pi^{\star},V}V^{\prime}\big{)}\circ\big{(}\underline{\widehat{P}}^{\pi^{\star},V}V^{\prime}\big{)}
=(ii)P^¯π,V(VV)1γ2(Vrπγ(P¯π,VP^¯π,V)V,σ)2\displaystyle\overset{\mathrm{(ii)}}{=}\underline{\widehat{P}}^{\pi^{\star},V}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma^{2}}\Big{(}V^{\prime}-r_{\pi^{\star}}^{\prime}-\gamma\Big{(}\underline{P}^{\pi^{\star},V}-\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}V^{\star,\sigma}\Big{)}^{\circ 2}
=P^¯π,V(VV)1γ2VV+2γ2V(rπ+γ(P¯π,VP^¯π,V)V,σ)\displaystyle=\underline{\widehat{P}}^{\pi^{\star},V}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma^{2}}V^{\prime}\circ V^{\prime}+\frac{2}{\gamma^{2}}V^{\prime}\circ\Big{(}r_{\pi^{\star}}^{\prime}+\gamma\Big{(}\underline{P}^{\pi^{\star},V}-\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}V^{\star,\sigma}\Big{)}
1γ2(rπ+γ(P¯π,VP^¯π,V)V,σ)2\displaystyle\quad-\frac{1}{\gamma^{2}}\Big{(}r_{\pi^{\star}}^{\prime}+\gamma\Big{(}\underline{P}^{\pi^{\star},V}-\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}V^{\star,\sigma}\Big{)}^{\circ 2}
(iii)P^¯π,V(VV)1γVV+2γ2V1+2γV|(P¯π,VP^¯π,V)V,σ|\displaystyle\overset{\mathrm{(iii)}}{\leq}\underline{\widehat{P}}^{\pi^{\star},V}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}+\frac{2}{\gamma^{2}}\|V^{\prime}\|_{\infty}1+\frac{2}{\gamma}\|V^{\prime}\|_{\infty}\Big{|}\Big{(}\underline{P}^{\pi^{\star},V}-\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}V^{\star,\sigma}\Big{|} (151)
P^¯π,V(VV)1γVV+2γ2V1+6γVlog(18SANδ)(1γ)2N1,\displaystyle\leq\underline{\widehat{P}}^{\pi^{\star},V}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}+\frac{2}{\gamma^{2}}\|V^{\prime}\|_{\infty}1+\frac{6}{\gamma}\|V^{\prime}\|_{\infty}\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}1, (152)

where (i) holds by the fact that VarPπ(Vb1)=VarPπ(V)\mathrm{Var}_{P_{\pi}}(V-b1)=\mathrm{Var}_{P_{\pi}}(V) for any scalar bb and VSV\in\mathbb{R}^{S}, (ii) follows from (150), (iii) arises from 1γ2VV1γVV\frac{1}{\gamma^{2}}V^{\prime}\circ V^{\prime}\geq\frac{1}{\gamma}V^{\prime}\circ V^{\prime} and 1rπ(1γ)Vmin1=rπrπ1-1\leq r_{\pi^{\star}}-(1-\gamma)V_{\min}1=r_{\pi^{\star}}^{\prime}\leq r_{\pi^{\star}}\leq 1, and the last inequality holds by Lemma 11.

Plugging (152) into (149) leads to

(IγP^¯π,V)1VarP^¯π,V(V,σ)\displaystyle\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma})}
11γt=0γt(P^¯π,V)t(P^¯π,V(VV)1γVV+2γ2V1+6γVlog(18SANδ)(1γ)2N1)\displaystyle\leq\sqrt{\frac{1}{1-\gamma}}\sqrt{\sum_{t=0}^{\infty}\gamma^{t}\left(\underline{\widehat{P}}^{\pi^{\star},V}\right)^{t}\Bigg{(}\underline{\widehat{P}}^{\pi^{\star},V}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}+\frac{2}{\gamma^{2}}\|V^{\prime}\|_{\infty}1+\frac{6}{\gamma}\|V^{\prime}\|_{\infty}\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}1\Bigg{)}}
(i)11γ|t=0γt(P^¯π,V)t(P^¯π,V(VV)1γVV)|\displaystyle\overset{\mathrm{(i)}}{\leq}\sqrt{\frac{1}{1-\gamma}}\sqrt{\bigg{|}\sum_{t=0}^{\infty}\gamma^{t}\left(\underline{\widehat{P}}^{\pi^{\star},V}\right)^{t}\bigg{(}\underline{\widehat{P}}^{\pi^{\star},V}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}\bigg{)}\bigg{|}}
+11γt=0γt(P^¯π,V)t(2γ2V1+6γVlog(18SANδ)(1γ)2N1)\displaystyle\qquad+\sqrt{\frac{1}{1-\gamma}}\sqrt{\sum_{t=0}^{\infty}\gamma^{t}\left(\underline{\widehat{P}}^{\pi^{\star},V}\right)^{t}\Bigg{(}\frac{2}{\gamma^{2}}\|V^{\prime}\|_{\infty}1+\frac{6}{\gamma}\|V^{\prime}\|_{\infty}\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}1\Bigg{)}}
11γ|t=0γt(P^¯π,V)t[P^¯π,V(VV)1γVV]|+(2+6log(18SANδ)(1γ)2N)V(1γ)2γ21,\displaystyle\leq\sqrt{\frac{1}{1-\gamma}}\sqrt{\bigg{|}\sum_{t=0}^{\infty}\gamma^{t}\left(\underline{\widehat{P}}^{\pi^{\star},V}\right)^{t}\bigg{[}\underline{\widehat{P}}^{\pi^{\star},V}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}\bigg{]}\bigg{|}}+\sqrt{\frac{\left(2+6\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\right)\|V^{\prime}\|_{\infty}}{(1-\gamma)^{2}\gamma^{2}}}1, (153)

where (i) holds by the triangle inequality. Therefore, the remainder of the proof shall focus on the first term, which follows

|t=0γt(P^¯π,V)t(P^¯π,V(VV)1γVV)|\displaystyle\bigg{|}\sum_{t=0}^{\infty}\gamma^{t}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{t}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}\Big{)}\bigg{|}
=|(t=0γt(P^¯π,V)t+1t=0γt1(P^¯π,V)t)(VV)|1γV21\displaystyle=\bigg{|}\bigg{(}\sum_{t=0}^{\infty}\gamma^{t}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{t+1}-\sum_{t=0}^{\infty}\gamma^{t-1}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{t}\bigg{)}\left(V^{\prime}\circ V^{\prime}\right)\bigg{|}\leq\frac{1}{\gamma}\|V^{\prime}\|_{\infty}^{2}1 (154)

by recursion. Inserting (154) back to (153) leads to

(IγP^¯π,V)1VarP^¯π,V(V,σ)\displaystyle\bigg{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\bigg{)}^{-1}\sqrt{\mathrm{Var}_{\underline{\widehat{P}}^{\pi^{\star},V}}(V^{\star,\sigma})}
V2γ(1γ)1+3(1+log(18SANδ)(1γ)2N)V(1γ)2γ21\displaystyle\leq\sqrt{\frac{\|V^{\prime}\|_{\infty}^{2}}{\gamma(1-\gamma)}}1+3\sqrt{\frac{\Big{(}1+\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\Big{)}\|V^{\prime}\|_{\infty}}{(1-\gamma)^{2}\gamma^{2}}}1
4(1+log(18SANδ)(1γ)2N)V(1γ)2γ214(1+log(18SANδ)(1γ)2N)γ3(1γ)2max{1γ,σ}14(1+log(18SANδ)(1γ)2N)γ3(1γ)31,\displaystyle\leq 4\sqrt{\frac{\Big{(}1+\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\Big{)}\|V^{\prime}\|_{\infty}}{(1-\gamma)^{2}\gamma^{2}}}1\leq 4\sqrt{\frac{\Big{(}1+\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\Big{)}}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}}}1\leq 4\sqrt{\frac{\Big{(}1+\sqrt{\frac{\log(\frac{18SAN}{\delta})}{(1-\gamma)^{2}N}}\Big{)}}{\gamma^{3}(1-\gamma)^{3}}}1, (155)

where the penultimate inequality follows from applying Lemma 6 with P=P0P=P^{0} and π=π\pi=\pi^{\star}:

V=maxs𝒮V,σ(s)mins𝒮V,σ(s)1γmax{1γ,σ}.\displaystyle\|V^{\prime}\|_{\infty}=\max_{s\in{\mathcal{S}}}V^{\star,\sigma}(s)-\min_{s\in{\mathcal{S}}}V^{\star,\sigma}(s)\leq\frac{1}{\gamma\max\{1-\gamma,\sigma\}}.

B.7 Proof of Lemma 14

To begin with, for any (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}, invoking the results in (141), we have

|P^π^,V^s,aV^π^,σPπ^,V^s,aV^π^,σ|maxα[minsV^π^,σ(s),maxsV^π^,σ(s)]|(P0s,aP^0s,a)[V^π^,σ]α|\displaystyle\left|\widehat{P}^{\widehat{\pi},\widehat{V}}_{s,a}\widehat{V}^{\widehat{\pi},\sigma}-P^{\widehat{\pi},\widehat{V}}_{s,a}\widehat{V}^{\widehat{\pi},\sigma}\right|\leq\max_{\alpha\in[\min_{s}\widehat{V}^{\widehat{\pi},\sigma}(s),\max_{s}\widehat{V}^{\widehat{\pi},\sigma}(s)]}\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right|
(i)maxα[minsV^π^,σ(s),maxsV^π^,σ(s)](|(P0s,aP^0s,a)[V^,σ]α|+|(P0s,aP^0s,a)([V^π^,σ]α[V^,σ]α)|)\displaystyle\overset{\mathrm{(i)}}{\leq}\max_{\alpha\in[\min_{s}\widehat{V}^{\widehat{\pi},\sigma}(s),\max_{s}\widehat{V}^{\widehat{\pi},\sigma}(s)]}\left(\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right|+\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\left(\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}-\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right)\right|\right)
maxα[minsV^π^,σ(s),maxsV^π^,σ(s)](|(P0s,aP^0s,a)[V^,σ]α|+P0s,aP^0s,a1[V^π^,σ]α[V^,σ]α)\displaystyle\leq\max_{\alpha\in[\min_{s}\widehat{V}^{\widehat{\pi},\sigma}(s),\max_{s}\widehat{V}^{\widehat{\pi},\sigma}(s)]}\Big{(}\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\big{[}\widehat{V}^{\star,\sigma}\right]_{\alpha}\big{|}+\left\|P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right\|_{1}\left\|\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}-\big{[}\widehat{V}^{\star,\sigma}\right]_{\alpha}\big{\|}_{\infty}\Big{)}
(ii)maxα[minsV^π^,σ(s),maxsV^π^,σ(s)]|(P0s,aP^0s,a)[V^,σ]α|+2V^π^,σV^,σ\displaystyle\overset{\mathrm{(ii)}}{\leq}\max_{\alpha\in[\min_{s}\widehat{V}^{\widehat{\pi},\sigma}(s),\max_{s}\widehat{V}^{\widehat{\pi},\sigma}(s)]}\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right|+2\left\|\widehat{V}^{\widehat{\pi},\sigma}-\widehat{V}^{\star,\sigma}\right\|_{\infty}
(iii)maxα[minsV^π^,σ(s),maxsV^π^,σ(s)]|(P0s,aP^0s,a)[V^,σ]α|+2γε𝗈𝗉𝗍1γ,\displaystyle\overset{\mathrm{(iii)}}{\leq}\max_{\alpha\in[\min_{s}\widehat{V}^{\widehat{\pi},\sigma}(s),\max_{s}\widehat{V}^{\widehat{\pi},\sigma}(s)]}\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right|+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}, (156)

where (i) holds by the triangle inequality, and (ii) follows from P0s,aP^0s,a12\big{\|}P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\big{\|}_{1}\leq 2 and [V^π^,σ]α[V^,σ]αV^π^,σV^,σ\big{\|}\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}-\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\big{\|}_{\infty}\leq\big{\|}\widehat{V}^{\widehat{\pi},\sigma}-\widehat{V}^{\star,\sigma}\big{\|}_{\infty}, and (iii) follows from (50).

To control |(P0s,aP^0s,a)[V^,σ]α|\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right| in (156) for any given α[0,11γ]\alpha\in\big{[}0,\frac{1}{1-\gamma}\big{]}, and tame the dependency between V^,σ\widehat{V}^{\star,\sigma} and P^0\widehat{P}^{0}, we resort to the following leave-one-out argument motivated by (Agarwal et al.,, 2020; Li et al., 2022b, ; Shi and Chi,, 2022). Specifically, we first construct a set of auxiliary RMDPs which simultaneously have the desired statistical independence between robust value functions and the estimated nominal transition kernel, and are minimally different from the original RMDPs under consideration. Then we control the term of interest associated with these auxiliary RMDPs and show the value is close to the target quantity for the desired RMDP. The process is divided into several steps as below.

Step 1: construction of auxiliary RMDPs with deterministic empirical nominal transitions.

Recall that we target the empirical infinite-horizon robust MDP ^𝗋𝗈𝖻\widehat{\mathcal{M}}_{\mathsf{rob}} with the nominal transition kernel P^0\widehat{P}^{0}. Towards this, we can construct an auxiliary robust MDP ^s,u𝗋𝗈𝖻\widehat{\mathcal{M}}^{s,u}_{\mathsf{rob}} for each state ss and any non-negative scalar u0u\geq 0, so that it is the same as ^𝗋𝗈𝖻\widehat{\mathcal{M}}_{\mathsf{rob}} except for the transition properties in state ss. In particular, we define the nominal transition kernel and reward function of ^s,u𝗋𝗈𝖻\widehat{\mathcal{M}}^{s,u}_{\mathsf{rob}} as Ps,uP^{s,u} and rs,ur^{s,u}, which are expressed as follows

{Ps,u(s|s,a)=𝟙(s=s)for all (s,a)𝒮×𝒜,Ps,u(|s~,a)=P^0(|s~,a)for all (s~,a)𝒮×𝒜 and s~s,\displaystyle\begin{cases}P^{s,u}(s^{\prime}\,|\,s,a)=\operatorname{\mathds{1}}(s^{\prime}=s)&\qquad\qquad\text{for all }(s^{\prime},a)\in{\mathcal{S}}\times\mathcal{A},\\ P^{s,u}(\cdot\,|\,\widetilde{s},a)=\widehat{P}^{0}(\cdot\,|\,\widetilde{s},a)&\qquad\qquad\text{for all }(\widetilde{s},a)\in{\mathcal{S}}\times\mathcal{A}\text{ and }\widetilde{s}\neq s,\end{cases} (157)

and

{rs,u(s,a)=ufor all a𝒜,rs,u(s~,a)=r(s~,a)for all (s~,a)𝒮×𝒜 and s~s.\displaystyle\begin{cases}r^{s,u}(s,a)=u&\qquad\qquad\qquad\text{for all }a\in\mathcal{A},\\ r^{s,u}(\widetilde{s},a)=r(\widetilde{s},a)&\qquad\qquad\qquad\text{for all }(\widetilde{s},a)\in{\mathcal{S}}\times\mathcal{A}\text{ and }\widetilde{s}\neq s.\end{cases} (158)

It is evident that the nominal transition probability at state ss of the auxiliary ^s,u𝗋𝗈𝖻\widehat{\mathcal{M}}^{s,u}_{\mathsf{rob}}, i.e. it never leaves state ss once entered. This useful property removes the randomness of P^0s,a\widehat{P}^{0}_{s,a} for all a𝒜a\in\mathcal{A} in state ss, which will be leveraged later.

Correspondingly, the robust Bellman operator 𝒯^σs,u()\widehat{{\mathcal{T}}}^{\sigma}_{s,u}(\cdot) associated with the RMDP ^s,u𝗋𝗈𝖻\widehat{\mathcal{M}}^{s,u}_{\mathsf{rob}} is defined as

(s~,a)𝒮×𝒜:𝒯^σs,u(Q)(s~,a)=rs,u(s~,a)+γinf𝒫𝒰σ(Ps,us~,a)𝒫V,with V(s~)=maxaQ(s~,a).\displaystyle\forall(\tilde{s},a)\in{\mathcal{S}}\times\mathcal{A}:\quad\widehat{{\mathcal{T}}}^{\sigma}_{s,u}(Q)(\tilde{s},a)=r^{s,u}(\tilde{s},a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{s,u}_{\tilde{s},a})}\mathcal{P}V,\qquad\text{with }V(\tilde{s})=\max_{a}Q(\tilde{s},a). (159)
Step 2: fixed-point equivalence between ^𝗋𝗈𝖻\widehat{\mathcal{M}}_{\mathsf{rob}} and the auxiliary RMDP ^s,u𝗋𝗈𝖻\widehat{\mathcal{M}}^{s,u}_{\mathsf{rob}}.

Recall that Q^,σ\widehat{Q}^{\star,\sigma} is the unique fixed point of 𝒯^σ()\widehat{{\mathcal{T}}}^{\sigma}(\cdot) with the corresponding robust value V^,σ\widehat{V}^{\star,\sigma}. We assert that the corresponding robust value function V^,σs,u\widehat{V}^{\star,\sigma}_{s,u^{\star}} obtained from the fixed point of 𝒯^σs,u()\widehat{{\mathcal{T}}}^{\sigma}_{s,u}(\cdot) aligns with the robust value function V^,σ\widehat{V}^{\star,\sigma} derived from 𝒯^σ()\widehat{{\mathcal{T}}}^{\sigma}(\cdot), as long as we choose uu in the following manner:

uu(s)=V^,σ(s)γinf𝒫𝒰σ(es)𝒫V^,σ.\displaystyle u^{\star}\coloneqq u^{\star}(s)=\widehat{V}^{\star,\sigma}(s)-\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(e_{s})}\mathcal{P}\widehat{V}^{\star,\sigma}. (160)

where ese_{s} is the ss-th standard basis vector in S\mathbb{R}^{S}. Towards verifying this, we shall break our arguments in two different cases.

  • For state ss: One has for any a𝒜a\in\mathcal{A}:

    rs,u(s,a)+γinf𝒫𝒰σ(Ps,us,a)𝒫V^,σ\displaystyle r^{s,u^{\star}}(s,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{s,u^{\star}}_{s,a})}\mathcal{P}\widehat{V}^{\star,\sigma} =u+γinf𝒫𝒰σ(es)𝒫V^,σ\displaystyle=u^{\star}+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(e_{s})}\mathcal{P}\widehat{V}^{\star,\sigma}
    =V^,σ(s)γinf𝒫𝒰σ(es)𝒫V^,σ+γinf𝒫𝒰σ(es)𝒫V^,σ=V^,σ(s),\displaystyle=\widehat{V}^{\star,\sigma}(s)-\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(e_{s})}\mathcal{P}\widehat{V}^{\star,\sigma}+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(e_{s})}\mathcal{P}\widehat{V}^{\star,\sigma}=\widehat{V}^{\star,\sigma}(s), (161)

    where the first equality follows from the definition of Ps,us,aP^{s,u^{\star}}_{s,a} in (157), and the second equality follows from plugging in the definition of uu^{\star} in (160).

  • For state sss^{\prime}\neq s: It is easily verified that for all a𝒜a\in\mathcal{A},

    rs,u(s,a)+γinf𝒫𝒰σ(Ps,us,a)𝒫V^,σ\displaystyle r^{s,u^{\star}}(s^{\prime},a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{s,u^{\star}}_{s^{\prime},a})}\mathcal{P}\widehat{V}^{\star,\sigma} =r(s,a)+γinf𝒫𝒰σ(P^0s,a)𝒫V^,σ\displaystyle=r(s^{\prime},a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(\widehat{P}^{0}_{s^{\prime},a})}\mathcal{P}\widehat{V}^{\star,\sigma}
    =𝒯^σ(Q^,σ)(s,a)=Q^,σ(s,a),\displaystyle=\widehat{{\mathcal{T}}}^{\sigma}(\widehat{Q}^{\star,\sigma})(s^{\prime},a)=\widehat{Q}^{\star,\sigma}(s^{\prime},a), (162)

    where the first equality follows from the definitions in (158) and (157), and the last line arises from the definition of the robust Bellman operator in (15), and that Q^,σ\widehat{Q}^{\star,\sigma} is the fixed point of 𝒯^σ()\widehat{{\mathcal{T}}}^{\sigma}(\cdot) (see Lemma 4).

Combining the facts in the above two cases, we establish that there exists a fixed point Q^,σs,u\widehat{Q}^{\star,\sigma}_{s,u^{\star}} of the operator 𝒯^σs,u()\widehat{{\mathcal{T}}}^{\sigma}_{s,u^{\star}}(\cdot) by taking

{Q^,σs,u(s,a)=V^,σ(s)for all a𝒜,Q^,σs,u(s,a)=Q^,σ(s,a)for all ss and a𝒜.\displaystyle\begin{cases}\widehat{Q}^{\star,\sigma}_{s,u^{\star}}(s,a)=\widehat{V}^{\star,\sigma}(s)&\qquad\qquad\qquad\text{for all }a\in\mathcal{A},\\ \widehat{Q}^{\star,\sigma}_{s,u^{\star}}(s^{\prime},a)=\widehat{Q}^{\star,\sigma}(s^{\prime},a)&\qquad\qquad\qquad\text{for all }s^{\prime}\neq s\text{ and }a\in\mathcal{A}.\end{cases} (163)

Consequently, we confirm the existence of a fixed point of the operator 𝒯^σs,u()\widehat{{\mathcal{T}}}^{\sigma}_{s,u^{\star}}(\cdot). In addition, its corresponding value function V^,σs,u\widehat{V}^{\star,\sigma}_{s,u^{\star}} also coincides with V^,σ\widehat{V}^{\star,\sigma}. Note that the corresponding facts between ^𝗋𝗈𝖻\widehat{\mathcal{M}}_{\mathsf{rob}} and ^𝗋𝗈𝖻s,u\widehat{\mathcal{M}}_{\mathsf{rob}}^{s,u} in Step 1 and step 2 holds in fact for any uncertainty set.

Step 3: building an ε\varepsilon-net for all reward values uu.

It is easily verified that

0uV^,σ(s)11γ.\displaystyle 0\leq u^{\star}\leq\widehat{V}^{\star,\sigma}(s)\leq\frac{1}{1-\gamma}. (164)

We can construct a Nε2N_{\varepsilon_{2}}-net over the interval [0,11γ]\big{[}0,\frac{1}{1-\gamma}\big{]}, where the size is bounded by |Nε2|3ε2(1γ)|N_{\varepsilon_{2}}|\leq\frac{3}{\varepsilon_{2}(1-\gamma)} (Vershynin,, 2018). Following the same arguments in the proof of Lemma 4, we can demonstrate that for each uNε2u\in N_{\varepsilon_{2}}, there exists a unique fixed point Q^,σs,u\widehat{Q}^{\star,\sigma}_{s,u} of the operator 𝒯^σs,u()\widehat{{\mathcal{T}}}^{\sigma}_{s,u}(\cdot), which satisfies 0Q^,σs,u11γ1{0}\leq\widehat{Q}^{\star,\sigma}_{s,u}\leq\frac{1}{1-\gamma}\cdot 1. Consequently, the corresponding robust value function also satisfies V^,σs,u11γ\left\|\widehat{V}^{\star,\sigma}_{s,u}\right\|_{\infty}\leq\frac{1}{1-\gamma}.

By the definitions in (157) and (158), we observe that for all uNε2u\in N_{\varepsilon_{2}}, ^𝗋𝗈𝖻s,u\widehat{\mathcal{M}}_{\mathsf{rob}}^{s,u} is statistically independent from P^0s,a\widehat{P}^{0}_{s,a}. This independence indicates that [V^,σs,u]α[\widehat{V}^{\star,\sigma}_{s,u}]_{\alpha} and P^0s,a\widehat{P}^{0}_{s,a} are independent for a fixed α\alpha. With this in mind, invoking the fact in (145) and (146) and taking the union bound over all (s,a,α)𝒮×𝒜×Nε1(s,a,\alpha)\in{\mathcal{S}}\times\mathcal{A}\times N_{\varepsilon_{1}}, uNε2u\in N_{\varepsilon_{2}} yields that, with probability at least 1δ1-\delta, it holds for all (s,a,u)𝒮×𝒜×Nε2(s,a,u)\in{\mathcal{S}}\times\mathcal{A}\times N_{\varepsilon_{2}} that

maxα[0,1/(1γ)]|(P0s,aP^0s,a)[V^,σs,u]α|\displaystyle\max_{\alpha\in[0,1/(1-\gamma)]}\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\big{[}\widehat{V}^{\star,\sigma}_{s,u}\big{]}_{\alpha}\right| ε2+2log(18SAN|Nε2|δ)NVarP0s,a(V^,σs,u)+2log(18SAN|Nε2|δ)3N(1γ)\displaystyle\leq\varepsilon_{2}+2\sqrt{\frac{\log(\frac{18SAN|N_{\varepsilon_{2}}|}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(\widehat{V}^{\star,\sigma}_{s,u})}+\frac{2\log(\frac{18SAN|N_{\varepsilon_{2}}|}{\delta})}{3N(1-\gamma)}
ε2+3log(18SAN|Nε2|δ)(1γ)2N,\displaystyle\leq\varepsilon_{2}+3\sqrt{\frac{\log(\frac{18SAN|N_{\varepsilon_{2}}|}{\delta})}{(1-\gamma)^{2}N}}, (165)

where the last inequality holds by the fact VarP0s,a(V^,σs,u)V^,σs,u11γ\mathrm{Var}_{P^{0}_{s,a}}(\widehat{V}^{\star,\sigma}_{s,u})\leq\|\widehat{V}^{\star,\sigma}_{s,u}\|_{\infty}\leq\frac{1}{1-\gamma} and letting Nlog(18SAN|Nε2|δ)N\geq\log\left(\frac{18SAN|N_{\varepsilon_{2}}|}{\delta}\right).

Step 4: uniform concentration.

Recalling that u[0,11γ]u^{\star}\in\big{[}0,\frac{1}{1-\gamma}\big{]} (see (164)), we can always find some u¯Nε2\overline{u}\in N_{\varepsilon_{2}} such that |u¯u|ε2|\overline{u}-u^{\star}|\leq\varepsilon_{2}. Consequently, plugging in the operator 𝒯^σs,u()\widehat{{\mathcal{T}}}^{\sigma}_{s,u}(\cdot) in (159) yields

QSA:𝒯^σs,u¯(Q)𝒯^σs,u(Q)=|u¯u|ε2\displaystyle\forall Q\in\mathbb{R}^{SA}:\quad\Big{\|}\widehat{{\mathcal{T}}}^{\sigma}_{s,\overline{u}}(Q)-\widehat{{\mathcal{T}}}^{\sigma}_{s,u^{\star}}(Q)\Big{\|}_{\infty}=|\overline{u}-u^{\star}|\leq\varepsilon_{2}

With this in mind, we observe that the fixed points of 𝒯^σs,u¯()\widehat{{\mathcal{T}}}^{\sigma}_{s,\overline{u}}(\cdot) and 𝒯^σs,u()\widehat{{\mathcal{T}}}^{\sigma}_{s,u^{\star}}(\cdot) obey

Q^,σs,u¯Q^,σs,u\displaystyle\left\|\widehat{Q}^{\star,\sigma}_{s,\overline{u}}-\widehat{Q}^{\star,\sigma}_{s,u^{\star}}\right\|_{\infty} =𝒯^σs,u¯(Q^,σs,u¯)𝒯^σs,u(Q^,σs,u)\displaystyle=\left\|\widehat{{\mathcal{T}}}^{\sigma}_{s,\overline{u}}(\widehat{Q}^{\star,\sigma}_{s,\overline{u}})-\widehat{{\mathcal{T}}}^{\sigma}_{s,u^{\star}}(\widehat{Q}^{\star,\sigma}_{s,u^{\star}})\right\|_{\infty}
𝒯^σs,u¯(Q^,σs,u¯)𝒯^σs,u¯(Q^,σs,u)+𝒯^σs,u¯(Q^,σs,u)𝒯^σs,u(Q^,σs,u)\displaystyle\leq\left\|\widehat{{\mathcal{T}}}^{\sigma}_{s,\overline{u}}(\widehat{Q}^{\star,\sigma}_{s,\overline{u}})-\widehat{{\mathcal{T}}}^{\sigma}_{s,\overline{u}}(\widehat{Q}^{\star,\sigma}_{s,u^{\star}})\right\|_{\infty}+\left\|\widehat{{\mathcal{T}}}^{\sigma}_{s,\overline{u}}(\widehat{Q}^{\star,\sigma}_{s,u^{\star}})-\widehat{{\mathcal{T}}}^{\sigma}_{s,u^{\star}}(\widehat{Q}^{\star,\sigma}_{s,u^{\star}})\right\|_{\infty}
γQ^,σs,u¯Q^,σs,u+ε2,\displaystyle\leq\gamma\left\|\widehat{Q}^{\star,\sigma}_{s,\overline{u}}-\widehat{Q}^{\star,\sigma}_{s,u^{\star}}\right\|_{\infty}+\varepsilon_{2},

where the last inequality holds by the fact that 𝒯^σs,u()\widehat{{\mathcal{T}}}^{\sigma}_{s,u}(\cdot) is a γ\gamma-contraction. It directly indicates that

Q^,σs,u¯Q^,σs,uε2(1γ)andV^,σs,u¯V^,σs,uQ^,σs,u¯Q^,σs,uε2(1γ).\displaystyle\left\|\widehat{Q}^{\star,\sigma}_{s,\overline{u}}-\widehat{Q}^{\star,\sigma}_{s,u^{\star}}\right\|_{\infty}\leq\frac{\varepsilon_{2}}{(1-\gamma)}\quad\mbox{and}\quad\left\|\widehat{V}^{\star,\sigma}_{s,\overline{u}}-\widehat{V}^{\star,\sigma}_{s,u^{\star}}\right\|_{\infty}\leq\left\|\widehat{Q}^{\star,\sigma}_{s,\overline{u}}-\widehat{Q}^{\star,\sigma}_{s,u^{\star}}\right\|_{\infty}\leq\frac{\varepsilon_{2}}{(1-\gamma)}. (166)

Armed with the above facts, to control the first term in (156), invoking the identity V^,σ=V^,σs,u\widehat{V}^{\star,\sigma}=\widehat{V}^{\star,\sigma}_{s,u^{\star}} established in Step 2 gives that: for all (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A},

maxα[minsV^π^,σ(s),maxsV^π^,σ(s)]|(P0s,aP^0s,a)[V^,σ]α|\displaystyle\max_{\alpha\in[\min_{s}\widehat{V}^{\widehat{\pi},\sigma}(s),\max_{s}\widehat{V}^{\widehat{\pi},\sigma}(s)]}\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[\widehat{V}^{\star,\sigma}]_{\alpha}\right|
maxα[0,1/(1γ)]|(P0s,aP^0s,a)[V^,σ]α|=maxα[0,1/(1γ)]|(P0s,aP^0s,a)[V^,σs,u]α|\displaystyle\leq\max_{\alpha\in[0,1/(1-\gamma)]}\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[\widehat{V}^{\star,\sigma}]_{\alpha}\right|=\max_{\alpha\in[0,1/(1-\gamma)]}\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[\widehat{V}^{\star,\sigma}_{s,u^{\star}}]_{\alpha}\right|
(i)maxα[0,1/(1γ)]{|(P0s,aP^0s,a)[V^,σs,u¯]α|+|(P0s,aP^0s,a)([V^,σs,u¯]α[V^,σs,u]α)|}\displaystyle\overset{\mathrm{(i)}}{\leq}\max_{\alpha\in[0,1/(1-\gamma)]}\left\{\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[\widehat{V}^{\star,\sigma}_{s,\overline{u}}]_{\alpha}\right|+\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\left([\widehat{V}^{\star,\sigma}_{s,\overline{u}}]_{\alpha}-[\widehat{V}^{\star,\sigma}_{s,u^{\star}}]_{\alpha}\right)\right|\right\}
(ii)maxα[0,1/(1γ)]|(P0s,aP^0s,a)[V^,σs,u¯]α|+2ε2(1γ)\displaystyle\overset{\mathrm{(ii)}}{\leq}\max_{\alpha\in[0,1/(1-\gamma)]}\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[\widehat{V}^{\star,\sigma}_{s,\overline{u}}]_{\alpha}\right|+\frac{2\varepsilon_{2}}{(1-\gamma)}
(iii)2ε2(1γ)+ε2+2log(18SAN|Nε2|δ)NVarP0s,a(V^,σs,u)+2log(18SAN|Nε2|δ)3N(1γ)\displaystyle\overset{\mathrm{(iii)}}{\leq}\frac{2\varepsilon_{2}}{(1-\gamma)}+\varepsilon_{2}+2\sqrt{\frac{\log(\frac{18SAN|N_{\varepsilon_{2}}|}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(\widehat{V}^{\star,\sigma}_{s,u})}+\frac{2\log(\frac{18SAN|N_{\varepsilon_{2}}|}{\delta})}{3N(1-\gamma)}
3ε2(1γ)+2log(18SAN|Nε2|δ)NVarP0s,a(V^,σ)+2log(18SAN|Nε2|δ)3N(1γ)\displaystyle\leq\frac{3\varepsilon_{2}}{(1-\gamma)}+2\sqrt{\frac{\log(\frac{18SAN|N_{\varepsilon_{2}}|}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(\widehat{V}^{\star,\sigma})}+\frac{2\log(\frac{18SAN|N_{\varepsilon_{2}}|}{\delta})}{3N(1-\gamma)}
+2log(18SAN|Nε2|δ)N|VarP0s,a(V^,σ)VarP0s,a(V^,σs,u¯)|\displaystyle\qquad+2\sqrt{\frac{\log(\frac{18SAN|N_{\varepsilon_{2}}|}{\delta})}{N}}\sqrt{\left|\mathrm{Var}_{P^{0}_{s,a}}(\widehat{V}^{\star,\sigma})-\mathrm{Var}_{P^{0}_{s,a}}(\widehat{V}^{\star,\sigma}_{s,\overline{u}})\right|}
(iv)3ε2(1γ)+2log(18SAN|Nε2|δ)NVarP0s,a(V^,σ)+2log(18SAN|Nε2|δ)3N(1γ)+22ε2log(18SAN|Nε2|δ)N(1γ)2\displaystyle\overset{\mathrm{(iv)}}{\leq}\frac{3\varepsilon_{2}}{(1-\gamma)}+2\sqrt{\frac{\log(\frac{18SAN|N_{\varepsilon_{2}}|}{\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(\widehat{V}^{\star,\sigma})}+\frac{2\log(\frac{18SAN|N_{\varepsilon_{2}}|}{\delta})}{3N(1-\gamma)}+2\sqrt{\frac{2\varepsilon_{2}\log(\frac{18SAN|N_{\varepsilon_{2}}|}{\delta})}{N(1-\gamma)^{2}}}
2log(54SAN2(1γ)δ)NVarP0s,a(V^,σ)+8log(54SAN2(1γ)δ)N(1γ)\displaystyle\leq 2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(\widehat{V}^{\star,\sigma})}+\frac{8\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)} (167)
10log(54SAN2(1γ)δ)(1γ)2N,\displaystyle\leq 10\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}N}}, (168)

where (i) holds by the triangle inequality, (ii) arises from (the last inequality holds by (166))

|(P0s,aP^0s,a)([V^,σs,u¯]α[V^,σs,u]α)|\displaystyle\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\left([\widehat{V}^{\star,\sigma}_{s,\overline{u}}]_{\alpha}-[\widehat{V}^{\star,\sigma}_{s,u^{\star}}]_{\alpha}\right)\right| P0s,aP^0s,a1[V^,σs,u¯]α[V^,σs,u]α\displaystyle\leq\left\|P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right\|_{1}\left\|[\widehat{V}^{\star,\sigma}_{s,\overline{u}}]_{\alpha}-[\widehat{V}^{\star,\sigma}_{s,u^{\star}}]_{\alpha}\right\|_{\infty}
2V^,σs,u¯V^,σs,u2ε2(1γ),\displaystyle\leq 2\left\|\widehat{V}^{\star,\sigma}_{s,\overline{u}}-\widehat{V}^{\star,\sigma}_{s,u^{\star}}\right\|_{\infty}\leq\frac{2\varepsilon_{2}}{(1-\gamma)}, (169)

(iii) follows from (B.7), (iv) can be verified by applying Lemma 3 with (166). Here, the penultimate inequality holds by letting ε2=log(18SAN|Nε2|δ)N\varepsilon_{2}=\frac{\log(\frac{18SAN|N_{\varepsilon_{2}}|}{\delta})}{N}, which leads to |Nε2|3ε2(1γ)3N1γ|N_{\varepsilon_{2}}|\leq\frac{3}{\varepsilon_{2}(1-\gamma)}\leq\frac{3N}{1-\gamma}, and the last inequality holds by the fact VarP0s,a(V^,σ)V^,σ11γ\mathrm{Var}_{P^{0}_{s,a}}(\widehat{V}^{\star,\sigma})\leq\|\widehat{V}^{\star,\sigma}\|_{\infty}\leq\frac{1}{1-\gamma} and letting Nlog(54SAN2(1γ)δ)N\geq\log\left(\frac{54SAN^{2}}{(1-\gamma)\delta}\right).

Step 5: finishing up.

Inserting (167) and (168) back into (156) and combining with (168) give that with probability at least 1δ1-\delta,

|P^π^,V^s,aV^π^,σPπ^,V^s,aV^π^,σ|\displaystyle\left|\widehat{P}^{\widehat{\pi},\widehat{V}}_{s,a}\widehat{V}^{\widehat{\pi},\sigma}-P^{\widehat{\pi},\widehat{V}}_{s,a}\widehat{V}^{\widehat{\pi},\sigma}\right| maxα[minsV^π^,σ(s),maxsV^π^,σ(s)]|(P0s,aP^0s,a)[V^,σ]α|+2γε𝗈𝗉𝗍1γ\displaystyle\leq\max_{\alpha\in[\min_{s}\widehat{V}^{\widehat{\pi},\sigma}(s),\max_{s}\widehat{V}^{\widehat{\pi},\sigma}(s)]}\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[\widehat{V}^{\star,\sigma}]_{\alpha}\right|+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}
maxα[0,1/(1γ)]|(P0s,aP^0s,a)[V^,σ]α|+2γε𝗈𝗉𝗍1γ\displaystyle\leq\max_{\alpha\in[0,1/(1-\gamma)]}\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[\widehat{V}^{\star,\sigma}]_{\alpha}\right|+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}
2log(54SAN2(1γ)δ)NVarP0s,a(V^,σ)+8log(54SAN2(1γ)δ)N(1γ)+2γε𝗈𝗉𝗍1γ\displaystyle\leq 2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(\widehat{V}^{\star,\sigma})}+\frac{8\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}
10log(54SAN2(1γ)δ)(1γ)2N+2γε𝗈𝗉𝗍1γ\displaystyle\leq 10\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}N}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma} (170)

holds for all (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}.

Finally, we complete the proof by compiling everything into the matrix form as follows:

|P^¯π^,V^V^π^,σP¯π^,V^V^π^,σ|\displaystyle\bigg{|}\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\bigg{|} 2log(54SAN2(1γ)δ)NVarP0s,a(V^,σ)1+8log(54SAN2(1γ)δ)N(1γ)1+2γε𝗈𝗉𝗍1γ1\displaystyle\leq 2\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N}}\sqrt{\mathrm{Var}_{P^{0}_{s,a}}(\widehat{V}^{\star,\sigma})}1+\frac{8\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{N(1-\gamma)}1+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}1
10log(54SAN2(1γ)δ)(1γ)2N1+2γε𝗈𝗉𝗍1γ1.\displaystyle\leq 10\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}N}}1+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}1. (171)

B.8 Proof of Lemma 15

The proof can be achieved by directly applying the same routine as Appendix B.6. Towards this, similar to (149), we arrive at

(IγP¯π^,V^)1VarP¯π^,V^(V^π^,σ)11γt=0γt(P¯π^,V^)tVarP¯π^,V^(V^π^,σ).\displaystyle\Big{(}I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\widehat{\pi},\sigma})}\leq\sqrt{\frac{1}{1-\gamma}}\sqrt{\sum_{t=0}^{\infty}\gamma^{t}\Big{(}\underline{P}^{\widehat{\pi},\widehat{V}}\Big{)}^{t}\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\widehat{\pi},\sigma})}. (172)

To control VarP¯π^,V^(V^π^,σ)\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\widehat{\pi},\sigma}), we denote the minimum value of V^π^,σ\widehat{V}^{\widehat{\pi},\sigma} as Vmin=mins𝒮V^π^,σ(s)V_{\min}=\min_{s\in{\mathcal{S}}}\widehat{V}^{\widehat{\pi},\sigma}(s) and VV^π^,σVmin1V^{\prime}\coloneqq\widehat{V}^{\widehat{\pi},\sigma}-V_{\min}1. By the same argument as (151), we arrive at

VarP¯π^,V^(V^π^,σ)\displaystyle\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\widehat{\pi},\sigma})
P¯π^,V^(VV)1γVV+2γ2V1+2γV|(P^¯π^,V^P¯π^,V^)V^π^,σ|\displaystyle\leq\underline{P}^{\widehat{\pi},\widehat{V}}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}+\frac{2}{\gamma^{2}}\|V^{\prime}\|_{\infty}1+\frac{2}{\gamma}\|V^{\prime}\|_{\infty}\left|\left(\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}-\underline{P}^{\widehat{\pi},\widehat{V}}\right)\widehat{V}^{\widehat{\pi},\sigma}\right|
P¯π^,V^(VV)1γVV+2γ2V1+2γV(10log(54SAN2(1γ)δ)(1γ)2N+2γε𝗈𝗉𝗍1γ)1,\displaystyle\leq\underline{P}^{\widehat{\pi},\widehat{V}}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}+\frac{2}{\gamma^{2}}\|V^{\prime}\|_{\infty}1+\frac{2}{\gamma}\|V^{\prime}\|_{\infty}\Bigg{(}10\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}N}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}\Bigg{)}1, (173)

where the last inequality makes use of Lemma 14. Plugging (173) back into (172) leads to

(IγP¯π^,V^)1VarP¯π^,V^(V^π^,σ)\displaystyle\Big{(}I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\Big{)}^{-1}\sqrt{\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\widehat{\pi},\sigma})} (i)11γ|t=0γt(P¯π^,V^)t(P¯π^,V^(VV)1γVV)|\displaystyle\overset{\mathrm{(i)}}{\leq}\sqrt{\frac{1}{1-\gamma}}\sqrt{\bigg{|}\sum_{t=0}^{\infty}\gamma^{t}\left(\underline{P}^{\widehat{\pi},\widehat{V}}\right)^{t}\Big{(}\underline{P}^{\widehat{\pi},\widehat{V}}\left(V^{\prime}\circ V^{\prime}\right)-\frac{1}{\gamma}V^{\prime}\circ V^{\prime}\Big{)}\bigg{|}}
+1(1γ)2γ2(2+20log(54SAN2(1γ)δ)(1γ)2N+2γε𝗈𝗉𝗍1γ)V1\displaystyle\quad+\sqrt{\frac{1}{(1-\gamma)^{2}\gamma^{2}}\bigg{(}2+20\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}N}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}\bigg{)}\|V^{\prime}\|_{\infty}}1
(ii)V2γ(1γ)1+(2+20log(54SAN2(1γ)δ)(1γ)2N+2γε𝗈𝗉𝗍1γ)V(1γ)2γ21\displaystyle\overset{\mathrm{(ii)}}{\leq}\sqrt{\frac{\|V^{\prime}\|_{\infty}^{2}}{\gamma(1-\gamma)}}1+\sqrt{\frac{\bigg{(}2+20\sqrt{\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}N}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}\bigg{)}\|V^{\prime}\|_{\infty}}{(1-\gamma)^{2}\gamma^{2}}}1
(iii)V2γ(1γ)1+24V(1γ)2γ216V(1γ)2γ21,\displaystyle\overset{\mathrm{(iii)}}{\leq}\sqrt{\frac{\|V^{\prime}\|_{\infty}^{2}}{\gamma(1-\gamma)}}1+\sqrt{\frac{24\|V^{\prime}\|_{\infty}}{(1-\gamma)^{2}\gamma^{2}}}1\leq 6\sqrt{\frac{\|V^{\prime}\|_{\infty}}{(1-\gamma)^{2}\gamma^{2}}}1, (174)

where (i) arises from following the routine of (153), (ii) holds by repeating the argument of (154), (iii) follows by taking Nlog(54SAN2(1γ)δ)(1γ)2N\geq\frac{\log(\frac{54SAN^{2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}} and ε𝗈𝗉𝗍1γγ\varepsilon_{\mathsf{opt}}\leq\frac{1-\gamma}{\gamma}, and the last inequality holds by VV,σ11γ\|V^{\prime}\|_{\infty}\leq\|V^{\star,\sigma}\|_{\infty}\leq\frac{1}{1-\gamma}.

Finally, applying Lemma 6 with P=P^0P=\widehat{P}^{0} and π=π^\pi=\widehat{\pi} yields

Vmaxs𝒮V^π^,σ(s)mins𝒮V^π^,σ(s)1γmax{1γ,σ},\displaystyle\|V^{\prime}\|_{\infty}\leq\max_{s\in{\mathcal{S}}}\widehat{V}^{\widehat{\pi},\sigma}(s)-\min_{s\in{\mathcal{S}}}\widehat{V}^{\widehat{\pi},\sigma}(s)\leq\frac{1}{\gamma\max\{1-\gamma,\sigma\}},

which can be inserted into (174) and gives

(IγP¯π^,V^)1VarP¯π^,V^(V^π^,σ)61γ3(1γ)2max{1γ,σ}161(1γ)3γ21.\displaystyle\left(I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\right)^{-1}\sqrt{\mathrm{Var}_{\underline{P}^{\widehat{\pi},\widehat{V}}}(\widehat{V}^{\widehat{\pi},\sigma})}\leq 6\sqrt{\frac{1}{\gamma^{3}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}}}1\leq 6\sqrt{\frac{1}{(1-\gamma)^{3}\gamma^{2}}}1.

Appendix C Proof of the auxiliary facts for Theorem 2

C.1 Proof of Lemma 10

Deriving the robust value function over different states.

For any ϕ\mathcal{M}_{\phi} with ϕ{0,1}\phi\in\{0,1\}, we first characterize the robust value function of any policy π\pi over different states. Before proceeding, we denote the minimum of the robust value function over states as below:

Vϕ,minπ,σmins𝒮Vϕπ,σ(s).V_{\phi,\min}^{\pi,\sigma}\coloneqq\min_{s\in{\mathcal{S}}}V_{\phi}^{\pi,\sigma}(s). (175)

Clearly, there exists at least one state sϕ,minπs_{\phi,\min}^{\pi} that satisfies Vϕπ,σ(sϕ,minπ)=Vϕ,minπ,σV_{\phi}^{\pi,\sigma}(s_{\phi,\min}^{\pi})=V_{\phi,\min}^{\pi,\sigma}.

With this in mind, it is easily observed that for any policy π\pi, the robust value function at state s=1s=1 obeys

Vϕπ,σ(1)\displaystyle V_{\phi}^{\pi,\sigma}(1) =𝔼aπ(| 1)[r(1,a)+γinf𝒫𝒰σ(Pϕ1,a)𝒫Vϕπ,σ]\displaystyle=\mathbb{E}_{a\sim\pi(\cdot\,|\,1)}\bigg{[}r(1,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{\phi}_{1,a})}\mathcal{P}V_{\phi}^{\pi,\sigma}\bigg{]}
=(i)1+γ𝔼aπ(| 1)[P¯ϕ(1| 1,a)Vϕπ,σ(1)]+γσVϕ,minπ,σ=(ii)1+γ(1σ)Vϕπ,σ(1)+γσVϕ,minπ,σ,\displaystyle\overset{\mathrm{(i)}}{=}1+\gamma\mathbb{E}_{a\sim\pi(\cdot\,|\,1)}\left[\underline{P}^{\phi}(1\,|\,1,a)V_{\phi}^{\pi,\sigma}(1)\right]+\gamma\sigma V_{\phi,\min}^{\pi,\sigma}\overset{\mathrm{(ii)}}{=}1+\gamma(1-\sigma)V_{\phi}^{\pi,\sigma}(1)+\gamma\sigma V_{\phi,\min}^{\pi,\sigma}, (176)

where (i) holds by r(1,a)=1r(1,a)=1 for all a𝒜a\in\mathcal{A}^{\prime} and (74), and (ii) follows from Pϕ(1| 1,a)=1P^{\phi}(1\,|\,1,a)=1 for all a𝒜a\in\mathcal{A}^{\prime}.

Similarly, for any s{2,3,,S1}s\in\{2,3,\cdots,S-1\}, we have

Vϕπ,σ(s)\displaystyle V_{\phi}^{\pi,\sigma}(s) =0+γ𝔼aπ(|s)[P¯ϕ(1|s,a)Vϕπ,σ(1)]+γσVϕ,minπ,σ\displaystyle=0+\gamma\mathbb{E}_{a\sim\pi(\cdot\,|\,s)}\left[\underline{P}^{\phi}(1\,|\,s,a)V_{\phi}^{\pi,\sigma}(1)\right]+\gamma\sigma V_{\phi,\min}^{\pi,\sigma}
=γ(1σ)Vϕπ,σ(1)+γσVϕ,minπ,σ,\displaystyle=\gamma\left(1-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\gamma\sigma V_{\phi,\min}^{\pi,\sigma}, (177)

since r(s,a)=0r(s,a)=0 for all s{2,3,,S1}s\in\{2,3,\cdots,S-1\} and the definition in (74).

Finally, we move onto compute Vϕπ,σ(0)V_{\phi}^{\pi,\sigma}(0), the robust value function at state 0 associated with any policy π\pi. First, it obeys

Vϕπ,σ(0)\displaystyle V_{\phi}^{\pi,\sigma}(0) =𝔼aπ(| 0)[r(0,a)+γinf𝒫𝒰σ(Pϕ0,a)𝒫Vπ,σϕ]\displaystyle=\mathbb{E}_{a\sim\pi(\cdot\,|\,0)}\bigg{[}r(0,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{\phi}_{0,a})}\mathcal{P}V^{\pi,\sigma}_{\phi}\bigg{]}
=0+γπ(ϕ| 0)inf𝒫𝒰σ(Pϕ0,ϕ)𝒫Vπ,σϕ+γπ(1ϕ| 0)inf𝒫𝒰σ(Pϕ0,1ϕ)𝒫Vπ,σϕ.\displaystyle=0+\gamma\pi(\phi\,|\,0)\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{\phi}_{0,\phi})}\mathcal{P}V^{\pi,\sigma}_{\phi}+\gamma\pi(1-\phi\,|\,0)\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{\phi}_{0,1-\phi})}\mathcal{P}V^{\pi,\sigma}_{\phi}. (178)

Recall the transition kernel defined in (64) and the fact about the uncertainty set over state 0 in (75), it is easily verified that the following probability vector P1Δ(𝒮)P_{1}\in\Delta({\mathcal{S}}) obeys P1𝒰σ(Pϕ0,ϕ)P_{1}\in\mathcal{U}^{\sigma}(P^{\phi}_{0,\phi}), which is defined as

P1(0)=1p+σ𝟙(0=sϕ,minπ),P1(1)=p¯=pσ,\displaystyle P_{1}(0)=1-p+\sigma\operatorname{\mathds{1}}\left(0=s_{\phi,\min}^{\pi}\right),\qquad P_{1}(1)=\underline{p}=p-\sigma,
P1(s)=σ𝟙(s=sϕ,minπ),s{2,3,,S1},\displaystyle P_{1}(s)=\sigma\operatorname{\mathds{1}}\left(s=s_{\phi,\min}^{\pi}\right),\qquad\forall s\in\{2,3,\cdots,S-1\}, (179)

where p¯=pσ\underline{p}=p-\sigma due to (75). Similarly, the following probability vector P2Δ(𝒮)P_{2}\in\Delta({\mathcal{S}}) also falls into the uncertainty set 𝒰σ(Pϕ0,1ϕ)\mathcal{U}^{\sigma}(P^{\phi}_{0,1-\phi}):

P2(0)=1q+σ𝟙(0=sϕ,minπ),P2(1)=q¯=qσ,\displaystyle P_{2}(0)=1-q+\sigma\operatorname{\mathds{1}}\left(0=s_{\phi,\min}^{\pi}\right),\qquad P_{2}(1)=\underline{q}=q-\sigma,
P2(s)=σ𝟙(0=sϕ,minπ)s{2,3,,S1}.\displaystyle P_{2}(s)=\sigma\operatorname{\mathds{1}}\left(0=s_{\phi,\min}^{\pi}\right)\qquad\forall s\in\{2,3,\cdots,S-1\}. (180)

It is noticed that P0P_{0} and P1P_{1} defined above are the worst-case perturbations, since the probability mass at state 11 will be moved to the state with the least value. Plugging the above facts about P1𝒰σ(Pϕ0,ϕ)P_{1}\in\mathcal{U}^{\sigma}(P^{\phi}_{0,\phi}) and P2𝒰σ(Pϕ0,1ϕ)P_{2}\in\mathcal{U}^{\sigma}(P^{\phi}_{0,1-\phi}) into (178), we arrive at

Vϕπ,σ(0)\displaystyle V_{\phi}^{\pi,\sigma}(0) γπ(ϕ| 0)P1Vπ,σϕ+γπ(1ϕ| 0)P2Vπ,σϕ\displaystyle\leq\gamma\pi(\phi\,|\,0)P_{1}V^{\pi,\sigma}_{\phi}+\gamma\pi(1-\phi\,|\,0)P_{2}V^{\pi,\sigma}_{\phi}
=γπ(ϕ| 0)[(pσ)Vϕπ,σ(1)+(1p)Vϕπ,σ(0)+σVϕ,minπ,σ]\displaystyle=\gamma\pi(\phi\,|\,0)\Big{[}\left(p-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\left(1-p\right)V_{\phi}^{\pi,\sigma}(0)+\sigma V_{\phi,\min}^{\pi,\sigma}\Big{]}
+γπ(1ϕ| 0)[(qσ)Vϕπ,σ(1)+(1q)Vϕπ,σ(0)+σVϕ,minπ,σ]\displaystyle\qquad+\gamma\pi(1-\phi\,|\,0)\Big{[}\left(q-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\left(1-q\right)V_{\phi}^{\pi,\sigma}(0)+\sigma V_{\phi,\min}^{\pi,\sigma}\Big{]}
=(i)γ(zϕπσ)Vϕπ,σ(1)+γσVϕ,minπ,σ+γ(1zϕπ)Vϕπ,σ(0),\displaystyle\overset{\mathrm{(i)}}{=}\gamma\left(z_{\phi}^{\pi}-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\gamma\sigma V_{\phi,\min}^{\pi,\sigma}+\gamma(1-z_{\phi}^{\pi})V_{\phi}^{\pi,\sigma}(0), (181)

where the last equality holds by the definition of zϕπz_{\phi}^{\pi} in (77). To continue, recursively applying (181) yields

Vϕπ,σ(0)\displaystyle V_{\phi}^{\pi,\sigma}(0) γ(zϕπσ)Vϕπ,σ(1)+γσVϕ,minπ,σ+γ(1zϕπ)[γ(zϕπσ)Vϕπ,σ(1)+γσVϕ,minπ,σ+γ(1zϕπ)Vϕπ,σ(0)]\displaystyle\leq\gamma\left(z_{\phi}^{\pi}-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\gamma\sigma V_{\phi,\min}^{\pi,\sigma}+\gamma(1-z_{\phi}^{\pi})\Big{[}\gamma\left(z_{\phi}^{\pi}-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\gamma\sigma V_{\phi,\min}^{\pi,\sigma}+\gamma(1-z_{\phi}^{\pi})V_{\phi}^{\pi,\sigma}(0)\Big{]}
(i)γ(zϕπσ)Vϕπ,σ(1)+γσVϕ,minπ,σ+γ(1zϕπ)[γzϕπVϕπ,σ(1)+γ(1zϕπ)Vϕπ,σ(0)]\displaystyle\overset{\mathrm{(i)}}{\leq}\gamma\left(z_{\phi}^{\pi}-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\gamma\sigma V_{\phi,\min}^{\pi,\sigma}+\gamma(1-z_{\phi}^{\pi})\Big{[}\gamma z_{\phi}^{\pi}V_{\phi}^{\pi,\sigma}(1)+\gamma(1-z_{\phi}^{\pi})V_{\phi}^{\pi,\sigma}(0)\Big{]}
\displaystyle\leq...
γ(zϕπσ)Vϕπ,σ(1)+γσVϕ,minπ,σ+γzϕπt=1γt(1zϕπ)tVϕπ,σ(1)+limtγt(1zϕπ)tVϕπ,σ(0)\displaystyle\leq\gamma\left(z_{\phi}^{\pi}-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\gamma\sigma V_{\phi,\min}^{\pi,\sigma}+\gamma z_{\phi}^{\pi}\sum_{t=1}^{\infty}\gamma^{t}(1-z_{\phi}^{\pi})^{t}V_{\phi}^{\pi,\sigma}(1)+\lim_{t\rightarrow\infty}\gamma^{t}(1-z_{\phi}^{\pi})^{t}V_{\phi}^{\pi,\sigma}(0)
(ii)γ(zϕπσ)Vϕπ,σ(1)+γσVϕ,minπ,σ+γ(1zϕπ)γzϕπ1γ(1zϕπ)Vϕπ,σ(1)+0\displaystyle\overset{\mathrm{(ii)}}{\leq}\gamma\left(z_{\phi}^{\pi}-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\gamma\sigma V_{\phi,\min}^{\pi,\sigma}+\gamma(1-z_{\phi}^{\pi})\frac{\gamma z_{\phi}^{\pi}}{1-\gamma(1-z_{\phi}^{\pi})}V_{\phi}^{\pi,\sigma}(1)+0
<γ(zϕπσ)Vϕπ,σ(1)+γσVϕ,minπ,σ+γ(1zϕπ)Vϕπ,σ(1)\displaystyle<\gamma\left(z_{\phi}^{\pi}-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\gamma\sigma V_{\phi,\min}^{\pi,\sigma}+\gamma(1-z_{\phi}^{\pi})V_{\phi}^{\pi,\sigma}(1)
=γ(1σ)Vϕπ,σ(1)+γσVϕ,minπ,σ,\displaystyle=\gamma\left(1-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\gamma\sigma V_{\phi,\min}^{\pi,\sigma}, (182)

where (i) uses Vϕ,minπ,σVϕπ,σ(1)V_{\phi,\min}^{\pi,\sigma}\leq V_{\phi}^{\pi,\sigma}(1), (ii) follows from γ(1zϕπ)<1\gamma(1-z_{\phi}^{\pi})<1, and the penultimate line follows from the trivial fact that γzϕπ1γ(1zϕπ)<1\frac{\gamma z_{\phi}^{\pi}}{1-\gamma(1-z_{\phi}^{\pi})}<1.

Combining (176), (177), and (182), we have that for any policy π\pi,

Vϕπ,σ(0)=Vϕ,minπ,σ,\displaystyle V_{\phi}^{\pi,\sigma}(0)=V_{\phi,\min}^{\pi,\sigma}, (183)

which directly leads to

Vϕπ,σ(1)\displaystyle V_{\phi}^{\pi,\sigma}(1) =1+γ(1σ)Vϕπ,σ(1)+γσVϕ,minπ,σ=1+γσVϕπ,σ(0)1γ(1σ).\displaystyle=1+\gamma\left(1-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\gamma\sigma V_{\phi,\min}^{\pi,\sigma}=\frac{1+\gamma\sigma V_{\phi}^{\pi,\sigma}(0)}{1-\gamma\left(1-\sigma\right)}. (184)

Let’s now return to the characterization of Vϕπ,σ(0)V_{\phi}^{\pi,\sigma}(0). In view of (183), the equality in (181) holds, and we have

Vϕπ,σ(0)\displaystyle V_{\phi}^{\pi,\sigma}(0) =γ(zϕπσ)Vϕπ,σ(1)+γ(1zϕπ+σ)Vϕπ,σ(0)\displaystyle=\gamma\left(z_{\phi}^{\pi}-\sigma\right)V_{\phi}^{\pi,\sigma}(1)+\gamma\left(1-z_{\phi}^{\pi}+\sigma\right)V_{\phi}^{\pi,\sigma}(0)
=(i)γ(zϕπσ)1+γσVϕπ,σ(0)1γ(1σ)+γ(1zϕπ+σ)Vϕπ,σ(0)\displaystyle\overset{\mathrm{(i)}}{=}\gamma\left(z_{\phi}^{\pi}-\sigma\right)\frac{1+\gamma\sigma V_{\phi}^{\pi,\sigma}(0)}{1-\gamma\left(1-\sigma\right)}+\gamma\left(1-z_{\phi}^{\pi}+\sigma\right)V_{\phi}^{\pi,\sigma}(0)
=γ(zϕπσ)1γ(1σ)+γ(1+(zϕπσ)γσ(1γ(1σ))1γ(1σ))Vϕπ,σ(0)\displaystyle=\frac{\gamma\left(z_{\phi}^{\pi}-\sigma\right)}{1-\gamma\left(1-\sigma\right)}+\gamma\left(1+\left(z_{\phi}^{\pi}-\sigma\right)\frac{\gamma\sigma-\left(1-\gamma\left(1-\sigma\right)\right)}{1-\gamma\left(1-\sigma\right)}\right)V_{\phi}^{\pi,\sigma}(0)
=γ(zϕπσ)1γ(1σ)+γ(1(1γ)(zϕπσ)1γ(1σ))Vϕπ,σ(0),\displaystyle=\frac{\gamma\left(z_{\phi}^{\pi}-\sigma\right)}{1-\gamma\left(1-\sigma\right)}+\gamma\left(1-\frac{(1-\gamma)\big{(}z_{\phi}^{\pi}-\sigma\big{)}}{1-\gamma\left(1-\sigma\right)}\right)V_{\phi}^{\pi,\sigma}(0),

where (i) arises from (184). Solving this relation gives

Vϕπ,σ(0)\displaystyle V_{\phi}^{\pi,\sigma}(0) =γ(zϕπσ)1γ(1σ)(1γ)(1+γ(zϕπσ)1γ(1σ)).\displaystyle=\frac{\frac{\gamma\left(z_{\phi}^{\pi}-\sigma\right)}{1-\gamma\left(1-\sigma\right)}}{(1-\gamma)\bigg{(}1+\frac{\gamma\left(z_{\phi}^{\pi}-\sigma\right)}{1-\gamma\left(1-\sigma\right)}\bigg{)}}. (185)
The optimal robust policy and optimal robust value function.

We move on to characterize the robust optimal policy and its corresponding robust value function. To begin with, denoting

zγ(zϕπσ)1γ(1σ),\displaystyle z\coloneqq\frac{\gamma\big{(}z_{\phi}^{\pi}-\sigma\big{)}}{1-\gamma\left(1-\sigma\right)}, (186)

we rewrite (185) as

Vϕπ,σ(0)=z(1γ)(1+z)=:f(z).\displaystyle V_{\phi}^{\pi,\sigma}(0)=\frac{z}{(1-\gamma)(1+z)}=:f(z).

Plugging in the fact that zϕπqσ>0z_{\phi}^{\pi}\geq q\geq\sigma>0 in (73), it follows that z>0z>0. So for any z>0z>0, the derivative of f(z)f(z) w.r.t. zz obeys

(1γ)(1+z)(1γ)z(1γ)2(1+z)2=1(1γ)(1+z)2>0.\displaystyle\frac{(1-\gamma)(1+z)-(1-\gamma)z}{(1-\gamma)^{2}(1+z)^{2}}=\frac{1}{(1-\gamma)(1+z)^{2}}>0. (187)

Observing that f(z)f(z) is increasing in zz, zz is increasing in zϕπz_{\phi}^{\pi}, and zϕπz_{\phi}^{\pi} is also increasing in π(ϕ| 0)\pi(\phi\,|\,0) (see the fact pqp\geq q in (73)), the optimal policy in state 0 thus obeys

πϕ(ϕ| 0)=1.\pi_{\phi}^{\star}(\phi\,|\,0)=1. (188)

Considering that the action does not influence the state transition for all states s>0s>0, without loss of generality, we choose the robust optimal policy to obey

s>0:πϕ(ϕ|s)=1.\displaystyle\forall s>0:\quad\pi_{\phi}^{\star}(\phi\,|\,s)=1. (189)

Taking π=πϕ\pi=\pi^{\star}_{\phi}, we complete the proof by showing that the corresponding robust optimal robust value function at state 0 as follows:

Vϕ,σ(0)=γ(zϕπσ)1γ(1σ)(1γ)(1+γ(zϕπσ)1γ(1σ))=γ(pσ)1γ(1σ)(1γ)(1+γ(pσ)1γ(1σ)).\displaystyle V_{\phi}^{\star,\sigma}(0)=\frac{\frac{\gamma\left(z_{\phi}^{\pi^{\star}}-\sigma\right)}{1-\gamma\left(1-\sigma\right)}}{(1-\gamma)\left(1+\frac{\gamma\left(z_{\phi}^{\pi^{\star}}-\sigma\right)}{1-\gamma\left(1-\sigma\right)}\right)}=\frac{\frac{\gamma\left(p-\sigma\right)}{1-\gamma\left(1-\sigma\right)}}{(1-\gamma)\left(1+\frac{\gamma\left(p-\sigma\right)}{1-\gamma\left(1-\sigma\right)}\right)}. (190)

C.2 Proof of the claim (80)

Plugging in the definition of φ\varphi, we arrive at that for any policy π\pi,

φ,V,σϕVπ,σϕ=V,σϕ(0)Vπ,σϕ(0)\displaystyle\big{\langle}\varphi,V^{\star,\sigma}_{\phi}-V^{\pi,\sigma}_{\phi}\big{\rangle}=V^{\star,\sigma}_{\phi}(0)-V^{\pi,\sigma}_{\phi}(0) =γ(pzϕπ)1γ(1σ)(1γ)(1+γ(pσ)1γ(1σ))(1+γ(zϕπσ)1γ(1σ)),\displaystyle=\frac{\frac{\gamma\left(p-z_{\phi}^{\pi}\right)}{1-\gamma\left(1-\sigma\right)}}{(1-\gamma)\left(1+\frac{\gamma\left(p-\sigma\right)}{1-\gamma\left(1-\sigma\right)}\right)\left(1+\frac{\gamma\left(z_{\phi}^{\pi}-\sigma\right)}{1-\gamma\left(1-\sigma\right)}\right)}, (191)

which follows from applying (76) and basic calculus. Then, we proceed to control the above term in two cases separately in terms of the uncertainty level σ\sigma.

  • When σ(0,1γ]\sigma\in(0,1-\gamma]. Then regarding the important terms in (191), we observe that

    1γ<1γ(1σ)\displaystyle 1-\gamma<1-\gamma\left(1-\sigma\right) 1γ(1(1γ))=(1γ)(1+γ)2(1γ),\displaystyle\leq 1-\gamma\left(1-(1-\gamma)\right)=(1-\gamma)(1+\gamma)\leq 2(1-\gamma), (192)

    which directly leads to

    γ(zϕπσ)1γ(1σ)(i)γ(pσ)1γ(1σ)γc1(1γ)1γ(1σ)<(ii)c1γ,\displaystyle\frac{\gamma\big{(}z_{\phi}^{\pi}-\sigma\big{)}}{1-\gamma\left(1-\sigma\right)}\overset{\mathrm{(i)}}{\leq}\frac{\gamma\left(p-\sigma\right)}{1-\gamma\left(1-\sigma\right)}\leq\frac{\gamma c_{1}(1-\gamma)}{1-\gamma\left(1-\sigma\right)}\overset{\mathrm{(ii)}}{<}c_{1}\gamma, (193)

    where (i) holds by zϕπ<pz_{\phi}^{\pi}<p, and (ii) is due to (192). Inserting (192) and (193) back into (191), we arrive at

    φ,V,σϕVπ,σϕ\displaystyle\big{\langle}\varphi,V^{\star,\sigma}_{\phi}-V^{\pi,\sigma}_{\phi}\big{\rangle} γ(pzϕπ)2(1γ)(1γ)(1+c1γ)2γ(pzϕπ)8(1γ)2\displaystyle\geq\frac{\frac{\gamma\left(p-z_{\phi}^{\pi}\right)}{2(1-\gamma)}}{(1-\gamma)(1+c_{1}\gamma)^{2}}\geq\frac{\gamma\big{(}p-z_{\phi}^{\pi}\big{)}}{8(1-\gamma)^{2}}
    =γ(pq)(1π(ϕ| 0))8(1γ)2=γΔ(1π(ϕ| 0))8(1γ)22ε(1π(ϕ| 0)),\displaystyle=\frac{\gamma\left(p-q\right)\big{(}1-\pi(\phi\,|\,0)\big{)}}{8(1-\gamma)^{2}}=\frac{\gamma\Delta\big{(}1-\pi(\phi\,|\,0)\big{)}}{8(1-\gamma)^{2}}\geq 2\varepsilon\big{(}1-\pi(\phi\,|\,0)\big{)}, (194)

    where the last inequality holds by setting (γ1/2\gamma\geq 1/2)

    Δ=32(1γ)2ε.\displaystyle\Delta=32(1-\gamma)^{2}\varepsilon. (195)

    Finally, it is easily verified that

    εc132(1γ)Δc1(1γ).\displaystyle\varepsilon\leq\frac{c_{1}}{32(1-\gamma)}\quad\Longrightarrow\quad\Delta\leq c_{1}(1-\gamma).
  • When σ(1γ,1c1]\sigma\in(1-\gamma,1-c_{1}]. Regarding (191), we observe that

    γσ<1γ(1σ)\displaystyle\gamma\sigma<1-\gamma\left(1-\sigma\right) =1γ+γσ(1+γ)σ2σ,\displaystyle=1-\gamma+\gamma\sigma\leq(1+\gamma)\sigma\leq 2\sigma, (196)

    which directly leads to

    γ(zϕπσ)1γ(1σ)γ(pσ)1γ(1σ)γc1σ1γ(1σ)<(i)c1,\displaystyle\frac{\gamma\big{(}z_{\phi}^{\pi}-\sigma\big{)}}{1-\gamma\left(1-\sigma\right)}\leq\frac{\gamma\left(p-\sigma\right)}{1-\gamma\left(1-\sigma\right)}\leq\frac{\gamma c_{1}\sigma}{1-\gamma\left(1-\sigma\right)}\overset{\mathrm{(i)}}{<}c_{1}, (197)

    where (i) holds by (196). Inserting (196) and (197) back into (191), we arrive at

    φ,V,σϕVπ,σϕ\displaystyle\big{\langle}\varphi,V^{\star,\sigma}_{\phi}-V^{\pi,\sigma}_{\phi}\big{\rangle} γ(pzϕπ)2σ(1γ)(1+c1)2γ(pzϕπ)8(1γ)σ=γ(pq)(1π(ϕ| 0))8(1γ)σ\displaystyle\geq\frac{\frac{\gamma\left(p-z_{\phi}^{\pi}\right)}{2\sigma}}{(1-\gamma)(1+c_{1})^{2}}\geq\frac{\gamma\left(p-z_{\phi}^{\pi}\right)}{8(1-\gamma)\sigma}=\frac{\gamma\left(p-q\right)\big{(}1-\pi(\phi\,|\,0)\big{)}}{8(1-\gamma)\sigma}
    =γΔ(1π(ϕ| 0))8(1γ)σ2ε(1π(ϕ| 0)),\displaystyle=\frac{\gamma\Delta\big{(}1-\pi(\phi\,|\,0)\big{)}}{8(1-\gamma)\sigma}\geq 2\varepsilon\big{(}1-\pi(\phi\,|\,0)\big{)}, (198)

    where the last inequality holds by letting (γ1/2\gamma\geq 1/2)

    Δ=32(1γ)σε.\displaystyle\Delta=32(1-\gamma)\sigma\varepsilon. (199)

    Finally, it is easily verified that

    εc132(1γ)Δc1σ.\displaystyle\varepsilon\leq\frac{c_{1}}{32(1-\gamma)}\quad\Longrightarrow\quad\Delta\leq c_{1}\sigma. (200)

Appendix D Proof of the upper bound with χ2\chi^{2} divergence: Theorem 3

The proof of Theorem 3 mainly follows the structure of the proof of Theorem 1 in Appendix 5.2. Throughout this section, for any nominal transition kernel PP, the uncertainty set is taken as (see (10))

𝒰σ(P)=𝒰σχ2(P)𝒰σχ2(Ps,a),\displaystyle\mathcal{U}^{\sigma}(P)=\mathcal{U}^{\sigma}_{\chi^{2}}(P)\coloneqq\otimes\;\mathcal{U}^{\sigma}_{\chi^{2}}(P_{s,a}),\quad 𝒰σχ2(Ps,a){Ps,aΔ(𝒮):s𝒮(P(s|s,a)P(s|s,a))2P(s|s,a)σ}.\displaystyle\mathcal{U}^{\sigma}_{\chi^{2}}(P_{s,a})\coloneqq\Big{\{}P^{\prime}_{s,a}\in\Delta({\mathcal{S}}):\sum_{s^{\prime}\in{\mathcal{S}}}\frac{(P^{\prime}(s^{\prime}\,|\,s,a)-P(s^{\prime}\,|\,s,a))^{2}}{P(s^{\prime}\,|\,s,a)}\leq\sigma\Big{\}}. (201)

D.1 Proof of Theorem 3

In order to control the performance gap V,σVπ^,σ\left\|V^{\star,\sigma}-V^{\widehat{\pi},\sigma}\right\|_{\infty}, recall the error decomposition in (51):

V,σVπ^,σ(Vπ,σV^π,σ)+2γε𝗈𝗉𝗍1γ1+(V^π^,σVπ^,σ),\displaystyle V^{\star,\sigma}-V^{\widehat{\pi},\sigma}\leq\left(V^{\pi^{\star},\sigma}-\widehat{V}^{\pi^{\star},\sigma}\right)+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}{1}+\left(\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\right), (202)

where ε𝗈𝗉𝗍\varepsilon_{\mathsf{opt}} (cf. (50)) shall be specified later (which justifies Remark 2). To further control (202), we bound the remaining two terms separately.

Step 1: controlling V^π,σVπ,σ\big{\|}\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\big{\|}_{\infty}.

Towards this, recall the bound in (56) which holds for any uncertainty set:

V^π,σVπ,σ\displaystyle\big{\|}\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\big{\|}_{\infty} γmax{(IγP^¯π,V^)1(P^¯π,VVπ,σP¯π,VVπ,σ),\displaystyle\leq\gamma\max\Big{\{}\Big{\|}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{)}\Big{\|}_{\infty},
(IγP^¯π,V)1(P^¯π,VVπ,σP¯π,VVπ,σ)}.\displaystyle\qquad\qquad\Big{\|}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{)}\Big{\|}_{\infty}\Big{\}}. (203)

To control the main term P^¯π,VVπ,σP¯π,VVπ,σ\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma} in (203), we first introduce an important lemma whose proof is postponed to Appendix D.2.1.

Lemma 16.

Consider any σ>0\sigma>0 and the uncertainty set 𝒰σ()𝒰σχ2()\mathcal{U}^{\sigma}(\cdot)\coloneqq\mathcal{U}^{\sigma}_{\chi^{2}}(\cdot). For any δ(0,1)\delta\in(0,1) and any fixed policy π\pi, one has with probability at least 1δ1-\delta,

P^¯π,VVπ,σP¯π,VVπ,σ42(1+σ)log(24SANδ)(1γ)2N.\displaystyle\left\|\underline{\widehat{P}}^{\pi,V}V^{\pi,\sigma}-\underline{P}^{\pi,V}V^{\pi,\sigma}\right\|_{\infty}\leq 4\sqrt{\frac{2(1+\sigma)\log(\frac{24SAN}{\delta})}{(1-\gamma)^{2}N}}.

Applying Lemma 16 by taking π=π\pi=\pi^{\star} gives

P^¯π,VVπ,σP¯π,VVπ,σ42(1+σ)log(24SANδ)(1γ)2N,\displaystyle\Big{\|}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{\|}_{\infty}\leq 4\sqrt{\frac{2(1+\sigma)\log(\frac{24SAN}{\delta})}{(1-\gamma)^{2}N}}, (204)

which directly leads to

(IγP^¯π,V^)1(P^¯π,VVπ,σP¯π,VVπ,σ)\displaystyle\Big{\|}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{)}\Big{\|}_{\infty}
P^¯π,VVπ,σP¯π,VVπ,σ(IγP^¯π,V^)1142(1+σ)log(24SANδ)(1γ)4N.\displaystyle\leq\Big{\|}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{\|}_{\infty}\cdot\Big{\|}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},\widehat{V}}\Big{)}^{-1}{1}\Big{\|}_{\infty}\leq 4\sqrt{\frac{2(1+\sigma)\log(\frac{24SAN}{\delta})}{(1-\gamma)^{4}N}}. (205)

Similarly, we have

(IγP^¯π,V)1(P^¯π,VVπ,σP¯π,VVπ,σ)42(1+σ)log(24SANδ)(1γ)4N.\displaystyle\Big{\|}\Big{(}I-\gamma\underline{\widehat{P}}^{\pi^{\star},V}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\pi^{\star},V}V^{\pi^{\star},\sigma}-\underline{P}^{\pi^{\star},V}V^{\pi^{\star},\sigma}\Big{)}\Big{\|}_{\infty}\leq 4\sqrt{\frac{2(1+\sigma)\log(\frac{24SAN}{\delta})}{(1-\gamma)^{4}N}}. (206)

Inserting (205) and (206) back to (203) yields

V^π,σVπ,σ\displaystyle\big{\|}\widehat{V}^{\pi^{\star},\sigma}-V^{\pi^{\star},\sigma}\big{\|}_{\infty} 42(1+σ)log(24SANδ)(1γ)4N.\displaystyle\leq 4\sqrt{\frac{2(1+\sigma)\log(\frac{24SAN}{\delta})}{(1-\gamma)^{4}N}}. (207)
Step 2: controlling V^π^,σVπ^,σ\left\|\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\right\|_{\infty}.

Recall the bound in (57) which holds for any uncertainty set:

V^π^,σVπ^,σ\displaystyle\big{\|}\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\big{\|}_{\infty} γmax{(IγP¯π^,V)1(P^¯π^,V^V^π^,σP¯π^,V^V^π^,σ),\displaystyle\leq\gamma\max\Big{\{}\Big{\|}\Big{(}I-\gamma\underline{P}^{\widehat{\pi},V}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\Big{)}\Big{\|}_{\infty},
(IγP¯π^,V^)1(P^¯π^,V^V^π^,σP¯π^,V^V^π^,σ)}.\displaystyle\qquad\qquad\Big{\|}\Big{(}I-\gamma\underline{P}^{\widehat{\pi},\widehat{V}}\Big{)}^{-1}\Big{(}\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\Big{)}\Big{\|}_{\infty}\Big{\}}. (208)

We introduce the following lemma which controls P^¯π^,V^V^π^,σP¯π^,V^V^π^,σ\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma} in (208); the proof is deferred to Appendix D.2.2.

Lemma 17.

Consider the uncertainty set 𝒰σ()𝒰σχ2()\mathcal{U}^{\sigma}(\cdot)\coloneqq\mathcal{U}^{\sigma}_{\chi^{2}}(\cdot) and any δ(0,1)\delta\in(0,1). With probability at least 1δ1-\delta, one has

P^¯π^,V^V^π^,σP¯π^,V^V^π^,σ\displaystyle\Big{\|}\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\Big{\|}_{\infty} 122(1+σ)log(36SAN2δ)(1γ)2N+2γε𝗈𝗉𝗍1γ+4σε𝗈𝗉𝗍(1γ)2.\displaystyle\leq 12\sqrt{\frac{2(1+\sigma)\log(\frac{36SAN^{2}}{\delta})}{(1-\gamma)^{2}N}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}+4\sqrt{\frac{\sigma\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{2}}}. (209)

Repeating the arguments from (204) to (207) yields

V^π^,σVπ^,σ\displaystyle\big{\|}\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\big{\|}_{\infty} 122(1+σ)log(36SAN2δ)(1γ)4N+2γε𝗈𝗉𝗍(1γ)2+4σε𝗈𝗉𝗍(1γ)4.\displaystyle\leq 12\sqrt{\frac{2(1+\sigma)\log(\frac{36SAN^{2}}{\delta})}{(1-\gamma)^{4}N}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{2}}+4\sqrt{\frac{\sigma\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{4}}}. (210)

Finally, inserting (207) and (210) back to (202) complete the proof

V,σVπ^,σ\displaystyle\big{\|}V^{\star,\sigma}-V^{\widehat{\pi},\sigma}\big{\|}_{\infty}
Vπ,σV^π,σ+2γε𝗈𝗉𝗍1γ+V^π^,σVπ^,σ\displaystyle\leq\big{\|}V^{\pi^{\star},\sigma}-\widehat{V}^{\pi^{\star},\sigma}\big{\|}_{\infty}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}+\big{\|}\widehat{V}^{\widehat{\pi},\sigma}-V^{\widehat{\pi},\sigma}\big{\|}_{\infty}
42(1+σ)log(24SANδ)(1γ)4N+2γε𝗈𝗉𝗍1γ+122(1+σ)log(36SAN2δ)(1γ)4N+2γε𝗈𝗉𝗍(1γ)2+4σε𝗈𝗉𝗍(1γ)4\displaystyle\leq 4\sqrt{\frac{2(1+\sigma)\log(\frac{24SAN}{\delta})}{(1-\gamma)^{4}N}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}+12\sqrt{\frac{2(1+\sigma)\log(\frac{36SAN^{2}}{\delta})}{(1-\gamma)^{4}N}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{2}}+4\sqrt{\frac{\sigma\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{4}}}
242(1+σ)log(36SAN2δ)(1γ)4N,\displaystyle\leq 24\sqrt{\frac{2(1+\sigma)\log(\frac{36SAN^{2}}{\delta})}{(1-\gamma)^{4}N}}, (211)

where the last line holds by taking ε𝗈𝗉𝗍min{32(1+σ)log(36SAN2δ)N,4log(36SAN2δ)N}\varepsilon_{\mathsf{opt}}\leq\min\left\{\sqrt{\frac{32(1+\sigma)\log(\frac{36SAN^{2}}{\delta})}{N}},\frac{4\log(\frac{36SAN^{2}}{\delta})}{N}\right\}.

D.2 Proof of the auxiliary lemmas

D.2.1 Proof of Lemma 16

Step 1: controlling the point-wise concentration.

Consider any fixed policy π\pi and the corresponding robust value vector VVπ,σV\coloneqq V^{\pi,\sigma} (independent from P^0\widehat{P}^{0}). Invoking Lemma 2 leads to that for any (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A},

|P^π,Vs,aVπ,σPπ,Vs,aVπ,σ|\displaystyle\left|\widehat{P}^{\pi,V}_{s,a}V^{\pi,\sigma}-P^{\pi,V}_{s,a}V^{\pi,\sigma}\right|
=|maxα[minsV(s),maxsV(s)]{P0s,a[V]ασ𝖵𝖺𝗋P0s,a([V]α)}\displaystyle=\bigg{|}\max_{\alpha\in\left[\min_{s}V(s),\max_{s}V(s)\right]}\left\{P^{0}_{s,a}[V]_{\alpha}-\sqrt{\sigma\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha}\right)}\right\}
maxα[minsV(s),maxsV(s)]{P^0s,a[V]ασ𝖵𝖺𝗋P^0s,a([V]α)}|\displaystyle\qquad\qquad-\max_{\alpha\in\left[\min_{s}V(s),\max_{s}V(s)\right]}\left\{\widehat{P}^{0}_{s,a}[V]_{\alpha}-\sqrt{\sigma\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha}\right)}\right\}\bigg{|}
maxα[minsV(s),maxsV(s)]|(P0s,aP^0s,a)[V]α+σ𝖵𝖺𝗋P^0s,a([V]α)σ𝖵𝖺𝗋P0s,a([V]α)|\displaystyle\leq\max_{\alpha\in\left[\min_{s}V(s),\max_{s}V(s)\right]}\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[V]_{\alpha}+\sqrt{\sigma\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha}\right)}-\sqrt{\sigma\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha}\right)}\right|
maxα[minsV(s),maxsV(s)]|(P0s,aP^0s,a)[V]α|+\displaystyle\leq\max_{\alpha\in\left[\min_{s}V(s),\max_{s}V(s)\right]}\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[V]_{\alpha}\right|+
+maxα[minsV(s),maxsV(s)]σ|𝖵𝖺𝗋P^0s,a([V]α)𝖵𝖺𝗋P0s,a([V]α)|,\displaystyle\qquad+\max_{\alpha\in\left[\min_{s}V(s),\max_{s}V(s)\right]}\sqrt{\sigma}\left|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha}\right)}\right|, (212)

where the first inequality follows by that the maximum operator is 11-Lipschitz, and the second inequality follows from the triangle inequality. Observing that the first term in (212) is exactly the same as (141), recalling the fact in (146) directly leads to: with probability at least 1δ1-\delta,

maxα[minsV(s),maxsV(s)]|(P0s,aP^0s,a)[V]α|2log(2SANδ)(1γ)2N\displaystyle\max_{\alpha\in[\min_{s}V(s),\max_{s}V(s)]}\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[V]_{\alpha}\right|\leq 2\sqrt{\frac{\log(\frac{2SAN}{\delta})}{(1-\gamma)^{2}N}} (213)

holds for all (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}. Then the remainder of the proof focuses on controlling the second term in (212).

Step 2: controlling the second term in (212).

For any given (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A} and fixed α[0,11γ]\alpha\in[0,\frac{1}{1-\gamma}], applying the concentration inequality (Panaganti and Kalathil,, 2022, Lemma 6) with [V]α11γ\|[V]_{\alpha}\|_{\infty}\leq\frac{1}{1-\gamma}, we arrive at

|𝖵𝖺𝗋P^0s,a([V]α)𝖵𝖺𝗋P0s,a([V]α)|2log(2δ)(1γ)2N\displaystyle\left|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha}\right)}\right|\leq\sqrt{\frac{2\log(\frac{2}{\delta})}{(1-\gamma)^{2}N}} (214)

holds with probability at least 1δ1-\delta. To obtain a uniform bound, we first observe the follow lemma proven in Appendix D.2.3.

Lemma 18.

For any VV obeying V11γ\|V\|_{\infty}\leq\frac{1}{1-\gamma}, the function Js,a(α,V):=|𝖵𝖺𝗋P^0s,a([V]α)𝖵𝖺𝗋P0s,a([V]α)|J_{s,a}(\alpha,V):=\Big{|}\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha}\right)}\Big{|} w.r.t. α\alpha obeys

|Js,a(α1,V)Js,a(α2,V)|4|α1α2|1γ.\displaystyle\left|J_{s,a}(\alpha_{1},V)-J_{s,a}(\alpha_{2},V)\right|\leq 4\sqrt{\frac{|\alpha_{1}-\alpha_{2}|}{1-\gamma}}.

In addition, we can construct an ε3\varepsilon_{3}-net Nε3N_{\varepsilon_{3}} over [0,11γ][0,\frac{1}{1-\gamma}] whose size is |Nε3|3ε3(1γ)|N_{\varepsilon_{3}}|\leq\frac{3}{\varepsilon_{3}(1-\gamma)} (Vershynin,, 2018). Armed with the above, we can derive the uniform bound over α[minsV(s),maxsV(s)][0,1/(1γ)]\alpha\in[\min_{s}V(s),\max_{s}V(s)]\subset[0,1/(1-\gamma)]: with probability at least 1δSA1-\frac{\delta}{SA}, it holds that for any (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A},

maxα[minsV(s),maxsV(s)]|𝖵𝖺𝗋P^0s,a([V]α)𝖵𝖺𝗋P0s,a([V]α)|\displaystyle\max_{\alpha\in\left[\min_{s}V(s),\max_{s}V(s)\right]}\left|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha}\right)}\right|
maxα[0,1/(1γ)]|𝖵𝖺𝗋P^0s,a([V]α)𝖵𝖺𝗋P0s,a([V]α)|\displaystyle\leq\max_{\alpha\in\left[0,1/(1-\gamma)\right]}\left|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha}\right)}\right|
(i)4ε31γ+supαNε3|𝖵𝖺𝗋P^0s,a([V]α)𝖵𝖺𝗋P0s,a([V]α)|\displaystyle\overset{\mathrm{(i)}}{\leq}4\sqrt{\frac{\varepsilon_{3}}{1-\gamma}}+\sup_{\alpha\in N_{\varepsilon_{3}}}\left|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha}\right)}\right|
(ii)4ε31γ+2log(2SA|Nε3|δ)(1γ)2N\displaystyle\overset{\mathrm{(ii)}}{\leq}4\sqrt{\frac{\varepsilon_{3}}{1-\gamma}}+\sqrt{\frac{2\log(\frac{2SA|N_{\varepsilon_{3}}|}{\delta})}{(1-\gamma)^{2}N}}
(iii)22log(2SA|Nε3|δ)(1γ)2N22log(24SANδ)(1γ)2N,\displaystyle\overset{\mathrm{(iii)}}{\leq}2\sqrt{\frac{2\log(\frac{2SA|N_{\varepsilon_{3}}|}{\delta})}{(1-\gamma)^{2}N}}\leq 2\sqrt{\frac{2\log(\frac{24SAN}{\delta})}{(1-\gamma)^{2}N}}, (215)

where (i) holds by the property of Nε3N_{\varepsilon_{3}}, (ii) follows from (214), (iii) arises from taking ε3=log(2SA|Nε3|δ)8N(1γ)\varepsilon_{3}=\frac{\log(\frac{2SA|N_{\varepsilon_{3}}|}{\delta})}{8N(1-\gamma)}, and the last inequality is verified by |Nε3|3ε3(1γ)24N|N_{\varepsilon_{3}}|\leq\frac{3}{\varepsilon_{3}(1-\gamma)}\leq 24N.

Inserting (213) and (215) back to (212) and taking the union bound over (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}, we arrive at that for all (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}, with probability at least 1δ1-{\delta},

|P^π,Vs,aVPπ,Vs,aV|\displaystyle\left|\widehat{P}^{\pi,V}_{s,a}V-P^{\pi,V}_{s,a}V\right| maxα[minsV(s),maxsV(s)]|(P0s,aP^0s,a)[V]α|+\displaystyle\leq\max_{\alpha\in\left[\min_{s}V(s),\max_{s}V(s)\right]}\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)[V]_{\alpha}\right|+
+maxα[minsV(s),maxsV(s)]|σ𝖵𝖺𝗋P^0s,a([V]α)σ𝖵𝖺𝗋P0s,a([V]α)|\displaystyle\qquad+\max_{\alpha\in\left[\min_{s}V(s),\max_{s}V(s)\right]}\left|\sqrt{\sigma\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha}\right)}-\sqrt{\sigma\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha}\right)}\right|
2log(2SANδ)(1γ)2N+22σlog(24SANδ)(1γ)2N42(1+σ)log(24SANδ)(1γ)2N.\displaystyle\leq\sqrt{\frac{2\log(\frac{2SAN}{\delta})}{(1-\gamma)^{2}N}}+2\sqrt{\frac{2\sigma\log(\frac{24SAN}{\delta})}{(1-\gamma)^{2}N}}\leq 4\sqrt{\frac{2(1+\sigma)\log(\frac{24SAN}{\delta})}{(1-\gamma)^{2}N}}.

Finally, we complete the proof by recalling the matrix form as below:

P^¯π,VVπ,σP¯π,VVπ,σmax(s,a)𝒮×𝒜|P^π,Vs,aVPπ,Vs,aV|42(1+σ)log(24SANδ)(1γ)2N.\displaystyle\left\|\underline{\widehat{P}}^{\pi,V}V^{\pi,\sigma}-\underline{P}^{\pi,V}V^{\pi,\sigma}\right\|_{\infty}\leq\max_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\left|\widehat{P}^{\pi,V}_{s,a}V-P^{\pi,V}_{s,a}V\right|\leq 4\sqrt{\frac{2(1+\sigma)\log(\frac{24SAN}{\delta})}{(1-\gamma)^{2}N}}.

D.2.2 Proof of Lemma 17

Step 1: decomposing the term of interest.

The proof follows the routine of the proof of Lemma 14 in Appendix B.7. To begin with, for any (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}, following the same arguments of (212) yields

|P^π^,V^s,aV^π^,σPπ^,V^s,aV^π^,σ|maxα[minsV^π^,σ(s),maxsV^π^,σ(s)]|(P0s,aP^0s,a)[V^π^,σ]α|+\displaystyle\left|\widehat{P}^{\widehat{\pi},\widehat{V}}_{s,a}\widehat{V}^{\widehat{\pi},\sigma}-P^{\widehat{\pi},\widehat{V}}_{s,a}\widehat{V}^{\widehat{\pi},\sigma}\right|\leq\max_{\alpha\in\left[\min_{s}\widehat{V}^{\widehat{\pi},\sigma}(s),\max_{s}\widehat{V}^{\widehat{\pi},\sigma}(s)\right]}\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right|+
+maxα[minsV^π^,σ(s),maxsV^π^,σ(s)]σ|𝖵𝖺𝗋P^0s,a([V^π^,σ]α)𝖵𝖺𝗋P0s,a([V^π^,σ]α)|.\displaystyle\qquad+\max_{\alpha\in\left[\min_{s}\widehat{V}^{\widehat{\pi},\sigma}(s),\max_{s}\widehat{V}^{\widehat{\pi},\sigma}(s)\right]}\sqrt{\sigma}\left|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\Big{(}\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\Big{)}}\right|. (216)

Invoking the fact in (170) (for proving Lemma 14), the first term in (216) obeys

maxα[minsV^π^,σ(s),maxsV^π^,σ(s)]|(P0s,aP^0s,a)[V^π^,σ]α|\displaystyle\max_{\alpha\in\left[\min_{s}\widehat{V}^{\widehat{\pi},\sigma}(s),\max_{s}\widehat{V}^{\widehat{\pi},\sigma}(s)\right]}\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right| maxα[0,1/(1γ)]|(P0s,aP^0s,a)[V^π^,σ]α|\displaystyle\leq\max_{\alpha\in[0,1/(1-\gamma)]}\left|\left(P^{0}_{s,a}-\widehat{P}^{0}_{s,a}\right)\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right|
4log(3SAN3/2(1γ)δ)(1γ)2N+2γε𝗈𝗉𝗍1γ.\displaystyle\leq 4\sqrt{\frac{\log(\frac{3SAN^{3/2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}N}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}. (217)

The remainder of the proof will focus on controlling the second term of (216).

Step 2: controlling the second term of (216).

Towards this, we recall the auxiliary robust MDP ^𝗋𝗈𝖻s,u\widehat{\mathcal{M}}_{\mathsf{rob}}^{s,u} defined in Appendix B.7. Taking the uncertainty set 𝒰σ()𝒰χ2σ()\mathcal{U}^{\sigma}(\cdot)\coloneqq\mathcal{U}_{\chi^{2}}^{\sigma}(\cdot) for both ^𝗋𝗈𝖻s,u\widehat{\mathcal{M}}_{\mathsf{rob}}^{s,u} and ^𝗋𝗈𝖻\widehat{\mathcal{M}}_{\mathsf{rob}}, we recall the corresponding robust Bellman operator 𝒯^σs,u()\widehat{{\mathcal{T}}}^{\sigma}_{s,u}(\cdot) in (159) and the following definition in (160)

uV^,σ(s)γinf𝒫𝒰σ(es)𝒫V^,σ.\displaystyle u^{\star}\coloneqq\widehat{V}^{\star,\sigma}(s)-\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(e_{s})}\mathcal{P}\widehat{V}^{\star,\sigma}. (218)

Following the arguments in Appendix B.7, it can be verified that there exists a unique fixed point Q^,σs,u\widehat{Q}^{\star,\sigma}_{s,u} of the operator 𝒯^σs,u()\widehat{{\mathcal{T}}}^{\sigma}_{s,u}(\cdot), which satisfies 0Q^,σs,u11γ10\leq\widehat{Q}^{\star,\sigma}_{s,u}\leq\frac{1}{1-\gamma}{1}. In addition, the corresponding robust value function coincides with that of the operator 𝒯^σ()\widehat{{\mathcal{T}}}^{\sigma}(\cdot), i.e., V^,σs,u=V^,σ\widehat{V}^{\star,\sigma}_{s,u}=\widehat{V}^{\star,\sigma}.

We recall the Nε2N_{\varepsilon_{2}}-net over [0,11γ]\left[0,\frac{1}{1-\gamma}\right] whose size obeying |Nε2|3ε2(1γ)|N_{\varepsilon_{2}}|\leq\frac{3}{\varepsilon_{2}(1-\gamma)} (Vershynin,, 2018). Then for all uNε2u\in N_{\varepsilon_{2}} and a fixed α\alpha, ^𝗋𝗈𝖻s,u\widehat{\mathcal{M}}_{\mathsf{rob}}^{s,u} is statistically independent from P^0s,a\widehat{P}^{0}_{s,a}, which indicates the independence between [V^,σs,u]α[\widehat{V}^{\star,\sigma}_{s,u}]_{\alpha} and P^0s,a\widehat{P}^{0}_{s,a}. With this in mind, invoking the fact in (215) and taking the union bound over all (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A} and uNε2u\in N_{\varepsilon_{2}} yields that, with probability at least 1δ1-\delta,

maxα[0,1/(1γ)]|𝖵𝖺𝗋P^0s,a([V^,σs,u]α)𝖵𝖺𝗋P0s,a([V^,σs,u]α)|\displaystyle\max_{\alpha\in\left[0,1/(1-\gamma)\right]}\bigg{|}\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([\widehat{V}^{\star,\sigma}_{s,u}]_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([\widehat{V}^{\star,\sigma}_{s,u}]_{\alpha}\right)}\bigg{|} 22log(24SAN|Nε2|δ)(1γ)2N\displaystyle\leq 2\sqrt{\frac{2\log(\frac{24SAN|N_{\varepsilon_{2}}|}{\delta})}{(1-\gamma)^{2}N}} (219)

holds for all (s,a,u)𝒮×𝒜×Nε2(s,a,u)\in{\mathcal{S}}\times\mathcal{A}\times N_{\varepsilon_{2}}.

To continue, we decompose the term of interest in (216) as follows:

maxα[minsV^π^,σ(s),maxsV^π^,σ(s)]|𝖵𝖺𝗋P^0s,a([V^π^,σ]α)𝖵𝖺𝗋P0s,a([V^π^,σ]α)|\displaystyle\max_{\alpha\in\left[\min_{s}\widehat{V}^{\widehat{\pi},\sigma}(s),\max_{s}\widehat{V}^{\widehat{\pi},\sigma}(s)\right]}\bigg{|}\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left(\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right)}\bigg{|}
maxα[0,1/(1γ)]|𝖵𝖺𝗋P^0s,a([V^π^,σ]α)𝖵𝖺𝗋P0s,a([V^π^,σ]α)|\displaystyle\leq\max_{\alpha\in\left[0,1/(1-\gamma)\right]}\bigg{|}\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left(\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right)}\bigg{|}
(i)maxα[0,1/(1γ)]|𝖵𝖺𝗋P^0s,a([V^,σ]α)𝖵𝖺𝗋P0s,a([V^,σ]α)|\displaystyle\overset{\mathrm{(i)}}{\leq}\max_{\alpha\in\left[0,1/(1-\gamma)\right]}\bigg{|}\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left(\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right)}\bigg{|}
+maxα[0,1/(1γ)][|𝖵𝖺𝗋P^0s,a([V^π^,σ]α)𝖵𝖺𝗋P^0s,a([V^,σ]α)|\displaystyle\quad+\max_{\alpha\in\left[0,1/(1-\gamma)\right]}\bigg{[}\sqrt{\left|\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right)-\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right)\right|}
+|𝖵𝖺𝗋P0s,a([V^π^,σ]α)𝖵𝖺𝗋P0s,a([V^,σ]α)|]\displaystyle\quad+\sqrt{\left|\mathsf{Var}_{P^{0}_{s,a}}\left(\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right)-\mathsf{Var}_{P^{0}_{s,a}}\left(\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right)\right|}\bigg{]}
(ii)maxα[0,1/(1γ)]|𝖵𝖺𝗋P^0s,a([V^,σ]α)𝖵𝖺𝗋P0s,a([V^,σ]α)|\displaystyle\overset{\mathrm{(ii)}}{\leq}\max_{\alpha\in\left[0,1/(1-\gamma)\right]}\bigg{|}\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left(\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right)}\bigg{|}
+maxα[0,1/(1γ)]22(1γ)[V^π^,σ]α[V^,σ]α\displaystyle\quad+\max_{\alpha\in\left[0,1/(1-\gamma)\right]}2\sqrt{\frac{2}{(1-\gamma)}\left\|\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}-\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right\|_{\infty}}
maxα[0,1/(1γ)]|𝖵𝖺𝗋P^0s,a([V^,σ]α)𝖵𝖺𝗋P0s,a([V^,σ]α)|+4ε𝗈𝗉𝗍(1γ)2,\displaystyle\leq\max_{\alpha\in\left[0,1/(1-\gamma)\right]}\left|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left(\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right)}\right|+4\sqrt{\frac{\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{2}}}, (220)

where (i) holds by the triangle inequality, (ii) arises from applying Lemma 3, and the last inequality holds by (50).

Armed with the above facts, invoking the identity V^,σ=V^,σs,u\widehat{V}^{\star,\sigma}=\widehat{V}^{\star,\sigma}_{s,u^{\star}} leads to that for all (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}, with probability at least 1δ1-\delta,

maxα[0,1/(1γ)]|𝖵𝖺𝗋P^0s,a([V^,σ]α)𝖵𝖺𝗋P0s,a([V^,σ]α)|\displaystyle\max_{\alpha\in\left[0,1/(1-\gamma)\right]}\bigg{|}\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left(\big{[}\widehat{V}^{\star,\sigma}\big{]}_{\alpha}\right)}\bigg{|}
=maxα[0,1/(1γ)]|𝖵𝖺𝗋P^0s,a([V^,σs,u]α)𝖵𝖺𝗋P0s,a([V^,σs,u]α)|\displaystyle=\max_{\alpha\in\left[0,1/(1-\gamma)\right]}\left|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\left[\widehat{V}^{\star,\sigma}_{s,u^{\star}}\right]_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left(\left[\widehat{V}^{\star,\sigma}_{s,u^{\star}}\right]_{\alpha}\right)}\right|
(i)maxα[0,1/(1γ)]|𝖵𝖺𝗋P^0s,a([V^,σs,u¯]α)𝖵𝖺𝗋P0s,a([V^,σs,u¯]α)|\displaystyle\overset{\mathrm{(i)}}{\leq}\max_{\alpha\in\left[0,1/(1-\gamma)\right]}\left|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\left[\widehat{V}^{\star,\sigma}_{s,\overline{u}}\right]_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left(\left[\widehat{V}^{\star,\sigma}_{s,\overline{u}}\right]_{\alpha}\right)}\right|
+maxα[0,1/(1γ)][|𝖵𝖺𝗋P^0s,a([V^,σs,u]α)𝖵𝖺𝗋P^0s,a([V^,σs,u¯]α)|\displaystyle\quad+\max_{\alpha\in\left[0,1/(1-\gamma)\right]}\bigg{[}\sqrt{\left|\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\left[\widehat{V}^{\star,\sigma}_{s,u^{\star}}\right]_{\alpha}\right)-\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\left[\widehat{V}^{\star,\sigma}_{s,\overline{u}}\right]_{\alpha}\right)\right|}
+|𝖵𝖺𝗋P0s,a([V^,σs,u]α)𝖵𝖺𝗋P0s,a([V^,σs,u¯]α)|]\displaystyle\quad+\sqrt{\left|\mathsf{Var}_{P^{0}_{s,a}}\left(\left[\widehat{V}^{\star,\sigma}_{s,u^{\star}}\right]_{\alpha}\right)-\mathsf{Var}_{P^{0}_{s,a}}\left(\left[\widehat{V}^{\star,\sigma}_{s,\overline{u}}\right]_{\alpha}\right)\right|}\bigg{]}
(ii)maxα[0,1/(1γ)]|𝖵𝖺𝗋P^0s,a([V^,σs,u¯]α)𝖵𝖺𝗋P0s,a([V^,σs,u¯]α)|+4ε2(1γ)\displaystyle\overset{\mathrm{(ii)}}{\leq}\max_{\alpha\in\left[0,1/(1-\gamma)\right]}\left|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\left[\widehat{V}^{\star,\sigma}_{s,\overline{u}}\right]_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left(\left[\widehat{V}^{\star,\sigma}_{s,\overline{u}}\right]_{\alpha}\right)}\right|+4\sqrt{\frac{\varepsilon_{2}}{(1-\gamma)}}
(iii)22log(24SAN|Nε2|δ)(1γ)2N+4ε2(1γ)\displaystyle\overset{\mathrm{(iii)}}{\leq}2\sqrt{\frac{2\log(\frac{24SAN|N_{\varepsilon_{2}}|}{\delta})}{(1-\gamma)^{2}N}}+4\sqrt{\frac{\varepsilon_{2}}{(1-\gamma)}}
62log(36SAN2|Nε2|δ)(1γ)2N,\displaystyle\leq 6\sqrt{\frac{2\log(\frac{36SAN^{2}|N_{\varepsilon_{2}}|}{\delta})}{(1-\gamma)^{2}N}}, (221)

where (i) holds by the triangle inequality, (ii) arises from applying Lemma 3 and the fact V^,σs,u¯V^,σs,uε2(1γ)\left\|\widehat{V}^{\star,\sigma}_{s,\overline{u}}-\widehat{V}^{\star,\sigma}_{s,u^{\star}}\right\|_{\infty}\leq\frac{\varepsilon_{2}}{(1-\gamma)} (see (166)), (iii) follows from (219), and the last inequality holds by letting ε2=2log(24SAN|Nε2|δ)(1γ)N\varepsilon_{2}=\frac{2\log(\frac{24SAN|N_{\varepsilon_{2}}|}{\delta})}{(1-\gamma)N}, which leads to |Nε2|3ε2(1γ)3N2|N_{\varepsilon_{2}}|\leq\frac{3}{\varepsilon_{2}(1-\gamma)}\leq\frac{3N}{2}.

In summary, inserting (221) back to (220) and (220) leads to with probability at least 1δ1-\delta,

maxα[minsV^π^,σ(s),maxsV^π^,σ(s)]|𝖵𝖺𝗋P^0s,a([V^π^,σ]α)𝖵𝖺𝗋P0s,a([V^π^,σ]α)|\displaystyle\max_{\alpha\in\left[\min_{s}\widehat{V}^{\widehat{\pi},\sigma}(s),\max_{s}\widehat{V}^{\widehat{\pi},\sigma}(s)\right]}\bigg{|}\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left(\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left(\big{[}\widehat{V}^{\widehat{\pi},\sigma}\big{]}_{\alpha}\right)}\bigg{|}
62σlog(36SAN2|Nε2|δ)(1γ)2N+4σε𝗈𝗉𝗍(1γ)2\displaystyle\leq 6\sqrt{\frac{2\sigma\log(\frac{36SAN^{2}|N_{\varepsilon_{2}}|}{\delta})}{(1-\gamma)^{2}N}}+4\sqrt{\frac{\sigma\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{2}}} (222)

holds for all (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}.

Step 4: finishing up.

Inserting (222) and (217) back to (216), we complete the proof: with probability at least 1δ1-\delta,

P^¯π^,V^V^π^,σP¯π^,V^V^π^,σ\displaystyle\Big{\|}\underline{\widehat{P}}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}-\underline{P}^{\widehat{\pi},\widehat{V}}\widehat{V}^{\widehat{\pi},\sigma}\Big{\|}_{\infty} 4log(3SAN3/2(1γ)δ)(1γ)2N+2γε𝗈𝗉𝗍1γ+62σlog(36SAN2|Nε2|δ)(1γ)2N+4σε𝗈𝗉𝗍(1γ)2\displaystyle\leq 4\sqrt{\frac{\log(\frac{3SAN^{3/2}}{(1-\gamma)\delta})}{(1-\gamma)^{2}N}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}+6\sqrt{\frac{2\sigma\log(\frac{36SAN^{2}|N_{\varepsilon_{2}}|}{\delta})}{(1-\gamma)^{2}N}}+4\sqrt{\frac{\sigma\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{2}}}
122(1+σ)log(36SAN2δ)(1γ)2N+2γε𝗈𝗉𝗍1γ+4σε𝗈𝗉𝗍(1γ)2.\displaystyle\leq 12\sqrt{\frac{2(1+\sigma)\log(\frac{36SAN^{2}}{\delta})}{(1-\gamma)^{2}N}}+\frac{2\gamma\varepsilon_{\mathsf{opt}}}{1-\gamma}+4\sqrt{\frac{\sigma\varepsilon_{\mathsf{opt}}}{(1-\gamma)^{2}}}. (223)

D.2.3 Proof of Lemma 18

For any 0α1,α21/(1γ)0\leq\alpha_{1},\alpha_{2}\leq 1/(1-\gamma), one has

|Js,a(α1,V)Js,a(α2,V)|\displaystyle|J_{s,a}(\alpha_{1},V)-J_{s,a}(\alpha_{2},V)|
=||𝖵𝖺𝗋P^0s,a([V]α1)𝖵𝖺𝗋P0s,a([V]α1)||𝖵𝖺𝗋P^0s,a([V]α2)𝖵𝖺𝗋P0s,a([V]α2)||\displaystyle=\bigg{|}\left|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha_{1}}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha_{1}}\right)}\right|-\left|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha_{2}}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha_{2}}\right)}\right|\bigg{|}
(i)|𝖵𝖺𝗋P^0s,a([V]α1)𝖵𝖺𝗋P0s,a([V]α1)𝖵𝖺𝗋P^0s,a([V]α2)+𝖵𝖺𝗋P0s,a([V]α2)|\displaystyle\overset{\mathrm{(i)}}{\leq}\left|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha_{1}}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha_{1}}\right)}-\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha_{2}}\right)}+\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha_{2}}\right)}\right|
|𝖵𝖺𝗋P^0s,a([V]α1)𝖵𝖺𝗋P^0s,a([V]α2)|+|𝖵𝖺𝗋P0s,a([V]α1)𝖵𝖺𝗋P0s,a([V]α2)|\displaystyle\leq\left|\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha_{1}}\right)}-\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha_{2}}\right)}\right|+\left|\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha_{1}}\right)}-\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha_{2}}\right)}\right|
(ii)𝖵𝖺𝗋P^0s,a([V]α2)𝖵𝖺𝗋P^0s,a([V]α1)+𝖵𝖺𝗋P0s,a([V]α2)𝖵𝖺𝗋P0s,a([V]α1)\displaystyle\overset{\mathrm{(ii)}}{\leq}\sqrt{\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha_{2}}\right)-\mathsf{Var}_{\widehat{P}^{0}_{s,a}}\left([V]_{\alpha_{1}}\right)}+\sqrt{\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha_{2}}\right)-\mathsf{Var}_{P^{0}_{s,a}}\left([V]_{\alpha_{1}}\right)}
(iii)|P^0s,a[([V]α1)([V]α1)([V]α2)([V]α2)]|+|P^0s,a([V]α1+[V]α2)P^0s,a([V]α1[V]α2)|\displaystyle\overset{\mathrm{(iii)}}{\leq}\sqrt{\left|\widehat{P}^{0}_{s,a}\left[\left([V]_{\alpha_{1}}\right)\circ\left([V]_{\alpha_{1}}\right)-\left([V]_{\alpha_{2}}\right)\circ\left([V]_{\alpha_{2}}\right)\right]\right|+\left|\widehat{P}^{0}_{s,a}\left([V]_{\alpha_{1}}+[V]_{\alpha_{2}}\right)\cdot\widehat{P}^{0}_{s,a}\left([V]_{\alpha_{1}}-[V]_{\alpha_{2}}\right)\right|}
+|P0s,a[([V]α1)([V]α1)([V]α2)([V]α2)]|+|P0s,a([V]α1+[V]α2)P0s,a([V]α1[V]α2)|\displaystyle\quad+\sqrt{\left|P^{0}_{s,a}\left[\left([V]_{\alpha_{1}}\right)\circ\left([V]_{\alpha_{1}}\right)-\left([V]_{\alpha_{2}}\right)\circ\left([V]_{\alpha_{2}}\right)\right]\right|+\left|P^{0}_{s,a}\left([V]_{\alpha_{1}}+[V]_{\alpha_{2}}\right)\cdot P^{0}_{s,a}\left([V]_{\alpha_{1}}-[V]_{\alpha_{2}}\right)\right|}
22(α1+α2)|α1α2|4|α1α2|1γ.\displaystyle\leq 2\sqrt{2(\alpha_{1}+\alpha_{2})|\alpha_{1}-\alpha_{2}|}\leq 4\sqrt{\frac{|\alpha_{1}-\alpha_{2}|}{1-\gamma}}. (224)

where (i) holds by the fact ||x||y|||xy|||x|-|y||\leq|x-y| for all x,yx,y\in\mathbb{R}, (ii) follows from the fact that xyxy\sqrt{x}-\sqrt{y}\leq\sqrt{x-y} for any xy0x\geq y\geq 0 and 𝖵𝖺𝗋P([V]α2)𝖵𝖺𝗋P([V]α1)\mathsf{Var}_{P}\left([V]_{\alpha_{2}}\right)\geq\mathsf{Var}_{P}\left([V]_{\alpha_{1}}\right) for any transition kernel PΔ(𝒮)P\in\Delta({\mathcal{S}}), (iii) holds by the definition of VarP()\mathrm{Var}_{P}(\cdot) defined in (40), and the last inequality arises from 0α1,α21/(1γ)0\leq\alpha_{1},\alpha_{2}\leq 1/(1-\gamma).

Appendix E Proof of the lower bound with χ2\chi^{2} divergence: Theorem 4

To prove Theorem 4, we shall first construct some hard instances and then characterize the sample complexity requirements over these instances. The structure of the hard instances are the same as the ones used in the proof of Theorem 2.

E.1 Construction of the hard problem instances

First, note that we shall use the same MDPs defined in Appendix 5.3.1 as follows

{ϕ=(𝒮,𝒜,Pϕ,r,γ)|ϕ={0,1}}.\displaystyle\left\{\mathcal{M}_{\phi}=\left(\mathcal{S},\mathcal{A},P^{\phi},r,\gamma\right)\,|\,\phi=\{0,1\}\right\}.

In particular, we shall keep the structure of the transition kernel in (64), reward function in (68) and initial state distribution in (69), while pp and Δ\Delta shall be specified differently later.

Uncertainty set of the transition kernels.

Recalling the uncertainty set associated with χ2\chi^{2} divergence in (201), for any uncertainty level σ\sigma, the uncertainty set throughout this section is defined as 𝒰σ(Pϕ)\mathcal{U}^{\sigma}(P^{\phi}):

𝒰σ(Pϕ)\displaystyle\mathcal{U}^{\sigma}(P^{\phi}) 𝒰σχ𝟤(Pϕs,a),\displaystyle\coloneqq\otimes\;\mathcal{U}^{\sigma}_{\mathsf{\chi^{2}}}(P^{\phi}_{s,a}),\qquad 𝒰σχ𝟤(Pϕs,a){Ps,aΔ(𝒮):s𝒮(P(s|s,a)Pϕ(s|s,a))2Pϕ(s|s,a)σ}.\displaystyle\mathcal{U}^{\sigma}_{\mathsf{\chi^{2}}}(P^{\phi}_{s,a})\coloneqq\bigg{\{}P_{s,a}\in\Delta({\mathcal{S}}):\sum_{s^{\prime}\in{\mathcal{S}}}\frac{\left(P(s^{\prime}\,|\,s,a)-P^{\phi}(s^{\prime}\,|\,s,a)\right)^{2}}{P^{\phi}(s^{\prime}\,|\,s,a)}\leq\sigma\bigg{\}}. (225)

Clearly, 𝒰σ(Pϕs,a)=Pϕs,a\mathcal{U}^{\sigma}(P^{\phi}_{s,a})=P^{\phi}_{s,a} whenever the state transition is deterministic for χ2\chi^{2} divergence. Here, qq and Δ\Delta (whose choice will be specified later in more detail) which determine the instances are specified as

0\displaystyle 0 q={1γif σ(0,1γ4)σ1+σif σ[1γ4,),p=q+Δ,\displaystyle\leq q=\begin{cases}1-\gamma&\text{if }\sigma\in\left(0,\frac{1-\gamma}{4}\right)\\ \frac{\sigma}{1+\sigma}&\text{if }\sigma\in\left[\frac{1-\gamma}{4},\infty\right)\\ \end{cases},\qquad p=q+\Delta, (226)

and

0\displaystyle 0 <Δ{14(1γ)if σ(0,1γ4)min{14(1γ),12(1+σ)}if σ[1γ4,).\displaystyle<\Delta\leq\begin{cases}\frac{1}{4}(1-\gamma)&\text{if }\sigma\in\left(0,\frac{1-\gamma}{4}\right)\\ \min\left\{\frac{1}{4}(1-\gamma),\,\frac{1}{2(1+\sigma)}\right\}&\text{if }\sigma\in\left[\frac{1-\gamma}{4},\infty\right)\\ \end{cases}. (227)

This directly ensures that

p=Δ+qmax{12+σ1+σ,54(1γ)}1p=\Delta+q\leq\max\left\{\frac{\frac{1}{2}+\sigma}{1+\sigma},\frac{5}{4}(1-\gamma)\right\}\leq 1

since γ[34,1)\gamma\in\big{[}\frac{3}{4},1\big{)}.

To continue, for any (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in{\mathcal{S}}\times\mathcal{A}\times{\mathcal{S}}, we denote the infimum probability of moving to the next state ss^{\prime} associated with any perturbed transition kernel Ps,a𝒰σ(Pϕs,a)P_{s,a}\in\mathcal{U}^{\sigma}(P^{\phi}_{s,a}) as

P¯ϕ(s|s,a)\displaystyle\underline{P}^{\phi}(s^{\prime}\,|\,s,a) infPs,a𝒰σ(Pϕs,a)P(s|s,a).\displaystyle\coloneqq\inf_{P_{s,a}\in\mathcal{U}^{\sigma}(P^{\phi}_{s,a})}P(s^{\prime}\,|\,s,a). (228)

In addition, we denote the transition from state 0 to state 11 as follows, which plays an important role in the analysis,

p¯\displaystyle\underline{p} P¯ϕ(1| 0,ϕ),q¯P¯ϕ(1| 0,1ϕ).\displaystyle\coloneqq\underline{P}^{\phi}(1\,|\,0,\phi),\qquad\underline{q}\coloneqq\underline{P}^{\phi}(1\,|\,0,1-\phi). (229)

Before continuing, we introduce some facts about p¯\underline{p} and q¯\underline{q} which are summarized as the following lemma; the proof is postponed to Appendix E.3.1.

Lemma 19.

Consider any σ(0,)\sigma\in(0,\infty) and any p,q,Δp,q,\Delta obeying (226) and (227), the following properties hold

{1γ2<q¯<1γ,q¯+34Δp¯q¯+Δ5(1γ)4if σ(0,1γ4),q¯=0,σ+12Δp¯(3+σ)Δif σ[1γ4,).\displaystyle\begin{cases}\frac{1-\gamma}{2}<\underline{q}<1-\gamma,\quad\underline{q}+\frac{3}{4}\Delta\leq\underline{p}\leq\underline{q}+\Delta\leq\frac{5(1-\gamma)}{4}&\text{if }\sigma\in\left(0,\frac{1-\gamma}{4}\right),\\ \underline{q}=0,\quad\frac{\sigma+1}{2}\Delta\leq\underline{p}\leq(3+\sigma)\Delta&\text{if }\sigma\in\left[\frac{1-\gamma}{4},\infty\right).\end{cases} (230)
Value functions and optimal policies.

Armed with above facts, we are positioned to derive the corresponding robust value functions, the optimal policies, and its corresponding optimal robust value functions. For any RMDP ϕ\mathcal{M}_{\phi} with the uncertainty set defined in (225), we denote the robust optimal policy as πϕ\pi^{\star}_{\phi}, the robust value function of any policy π\pi (resp. the optimal policy πϕ\pi^{\star}_{\phi}) as Vπ,σϕV^{\pi,\sigma}_{\phi} (resp. V,σϕV^{\star,\sigma}_{\phi}). The following lemma describes some key properties of the robust (optimal) value functions and optimal policies whose proof is postponed to Appendix E.3.2.

Lemma 20.

For any ϕ={0,1}\phi=\{0,1\} and any policy π\pi, one has

Vπ,σϕ(0)=γzϕπ(1γ)(1γ(1zϕπ)),\displaystyle V^{\pi,\sigma}_{\phi}(0)=\frac{\gamma z_{\phi}^{\pi}}{(1-\gamma)\left(1-\gamma\big{(}1-z_{\phi}^{\pi}\big{)}\right)}, (231)

where zϕπz_{\phi}^{\pi} is defined as

zϕπp¯π(ϕ| 0)+q¯π(1ϕ| 0).\displaystyle z_{\phi}^{\pi}\coloneqq\underline{p}\pi(\phi\,|\,0)+\underline{q}\pi(1-\phi\,|\,0). (232)

In addition, the optimal value functions and the optimal policies obey

Vϕ,σ(0)\displaystyle V_{\phi}^{\star,\sigma}(0) =γp¯(1γ)(1γ(1p¯)),\displaystyle=\frac{\gamma\underline{p}}{(1-\gamma)\left(1-\gamma\left(1-\underline{p}\right)\right)}, (233a)
πϕ(ϕ|s)\displaystyle\pi_{\phi}^{\star}(\phi\,|\,s) =1, for s𝒮.\displaystyle=1,\qquad\qquad\text{ for }s\in{\mathcal{S}}. (233b)

E.2 Establishing the minimax lower bound

Our goal is to control the performance gap w.r.t. any policy estimator π^\widehat{\pi} based on the generated dataset and the chosen initial distribution φ\varphi in (69), which gives

φ,V,σϕVπ^,σϕ=V,σϕ(0)Vπ^,σϕ(0).\displaystyle\big{\langle}\varphi,V^{\star,\sigma}_{\phi}-V^{\widehat{\pi},\sigma}_{\phi}\big{\rangle}=V^{\star,\sigma}_{\phi}(0)-V^{\widehat{\pi},\sigma}_{\phi}(0). (234)
Step 1: converting the goal to estimate ϕ\phi.

To achieve the goal, we first introduce the following fact which shall be verified in Appendix E.3.3: given

ε{172(1γ)if σ(0,1γ4),1256(1+σ)(1γ)if σ[1γ4,13(1γ)),332if σ>13(1γ).\displaystyle\varepsilon\leq\begin{cases}\frac{1}{72(1-\gamma)}\quad&\text{if }\sigma\in\left(0,\frac{1-\gamma}{4}\right),\\ \frac{1}{256(1+\sigma)(1-\gamma)}\quad&\text{if }\sigma\in\left[\frac{1-\gamma}{4},\frac{1}{3(1-\gamma)}\right),\\ \frac{3}{32}\quad&\text{if }\sigma>\frac{1}{3(1-\gamma)}.\end{cases} (235)

choosing

Δ={18(1γ)2εif σ(0,1γ4),64(1+σ)(1γ)2εif σ[1γ4,13(1γ)),163(1+σ)εif σ>13(1γ).\displaystyle\Delta=\begin{cases}18(1-\gamma)^{2}\varepsilon\quad&\text{if }\sigma\in\left(0,\frac{1-\gamma}{4}\right),\\ 64(1+\sigma)(1-\gamma)^{2}\varepsilon\quad&\text{if }\sigma\in\left[\frac{1-\gamma}{4},\frac{1}{3(1-\gamma)}\right),\\ \frac{16}{3(1+\sigma)}\varepsilon\quad&\text{if }\sigma>\frac{1}{3(1-\gamma)}.\end{cases} (236)

which satisfies the requirement of Δ\Delta in (226), it holds that for any policy π^\widehat{\pi},

φ,V,σϕVπ^,σϕ2ε(1π^(ϕ| 0)).\displaystyle\big{\langle}\varphi,V^{\star,\sigma}_{\phi}-V^{\widehat{\pi},\sigma}_{\phi}\big{\rangle}\geq 2\varepsilon\big{(}1-\widehat{\pi}(\phi\,|\,0)\big{)}. (237)
Step 2: arriving at the final results.

To continue, following the same definitions and argument in Appendix 5.3.2, we recall the minimax probability of the error and its property as follows:

pe\displaystyle p_{\mathrm{e}} 14exp{N(𝖪𝖫(P0(| 0,0)P1(| 0,0))+𝖪𝖫(P0(| 0,1)P1(| 0,1)))},\displaystyle\geq\frac{1}{4}\exp\bigg{\{}-N\Big{(}\mathsf{KL}\big{(}P^{0}(\cdot\,|\,0,0)\parallel P^{1}(\cdot\,|\,0,0)\big{)}+\mathsf{KL}\big{(}P^{0}(\cdot\,|\,0,1)\parallel P^{1}(\cdot\,|\,0,1)\big{)}\Big{)}\bigg{\}}, (238)

then we can complete the proof by showing pe18p_{\mathrm{e}}\geq\frac{1}{8} given the bound for the sample size NN. In the following, we shall control the KL divergence terms in (238) in three different cases.

  • Case 1: σ(0,1γ4)\sigma\in\left(0,\frac{1-\gamma}{4}\right). In this case, applying γ[34,1)\gamma\in[\frac{3}{4},1) yields

    1q\displaystyle 1-q >1p=1qΔ>γ1γ4>34116>12,\displaystyle>1-p=1-q-\Delta>\gamma-\frac{1-\gamma}{4}>\frac{3}{4}-\frac{1}{16}>\frac{1}{2},
    p\displaystyle p q=1γ.\displaystyle\geq q=1-\gamma. (239)

    Armed with the above facts, applying Tsybakov, (2009, Lemma 2.7) yields

    𝖪𝖫(P0(| 0,1)P1(| 0,1))\displaystyle\mathsf{KL}\big{(}P^{0}(\cdot\,|\,0,1)\parallel P^{1}(\cdot\,|\,0,1)\big{)} =𝖪𝖫(pq)(pq)2(1p)p=(i)Δ2p(1p)\displaystyle=\mathsf{KL}\left(p\parallel q\right)\leq\frac{(p-q)^{2}}{(1-p)p}\overset{\mathrm{(i)}}{=}\frac{\Delta^{2}}{p(1-p)}
    =(ii)324(1γ)4ε2p(1p)\displaystyle\overset{\mathrm{(ii)}}{=}\frac{324(1-\gamma)^{4}\varepsilon^{2}}{p(1-p)}
    (iii)648(1γ)3ε2,\displaystyle\overset{\mathrm{(iii)}}{\leq}648(1-\gamma)^{3}\varepsilon^{2}, (240)

    where (i) follows from the definition in (226), (ii) holds by plugging in the expression of Δ\Delta in (236), and (iii) arises from (E.2). The same bound can be established for 𝖪𝖫(P10(| 0,0)P11(| 0,0))\mathsf{KL}\big{(}P_{1}^{0}(\cdot\,|\,0,0)\parallel P_{1}^{1}(\cdot\,|\,0,0)\big{)}. Substituting (240) back into (238) demonstrates that: if the sample size is chosen as

    Nlog21296(1γ)3ε2,\displaystyle N\leq\frac{\log 2}{1296(1-\gamma)^{3}\varepsilon^{2}}, (241)

    then one necessarily has

    pe\displaystyle p_{\mathrm{e}} 14exp{N1296(1γ)3ε2}18.\displaystyle\geq\frac{1}{4}\exp\Big{\{}-N\cdot 1296(1-\gamma)^{3}\varepsilon^{2}\Big{\}}\geq\frac{1}{8}. (242)
  • Case 2: σ[1γ4,13(1γ))\sigma\in\left[\frac{1-\gamma}{4},\frac{1}{3(1-\gamma)}\right). Applying the facts of Δ\Delta in (227), one has

    1q\displaystyle 1-q >1p=1qΔ11+σ12(1+σ)=12(1+σ),\displaystyle>1-p=1-q-\Delta\geq\frac{1}{1+\sigma}-\frac{1}{2(1+\sigma)}=\frac{1}{2(1+\sigma)},
    p\displaystyle p q=σ1+σ.\displaystyle\geq q=\frac{\sigma}{1+\sigma}. (243)

    Given (E.2), applying Tsybakov, (2009, Lemma 2.7) yields

    𝖪𝖫(P0(| 0,1)P1(| 0,1))\displaystyle\mathsf{KL}\big{(}P^{0}(\cdot\,|\,0,1)\parallel P^{1}(\cdot\,|\,0,1)\big{)} =𝖪𝖫(pq)(pq)2(1p)p=(i)Δ2p(1p)\displaystyle=\mathsf{KL}\left(p\parallel q\right)\leq\frac{(p-q)^{2}}{(1-p)p}\overset{\mathrm{(i)}}{=}\frac{\Delta^{2}}{p(1-p)}
    =(ii)4096(1+σ)2(1γ)4ε2p(1p)\displaystyle\overset{\mathrm{(ii)}}{=}\frac{4096(1+\sigma)^{2}(1-\gamma)^{4}\varepsilon^{2}}{p(1-p)}
    (iii)4096(1+σ)2(1γ)4ε2σ2(1+σ)28192(1γ)4(1+σ)4ε2σ,\displaystyle\overset{\mathrm{(iii)}}{\leq}\frac{4096(1+\sigma)^{2}(1-\gamma)^{4}\varepsilon^{2}}{\frac{\sigma}{2(1+\sigma)^{2}}}\leq\frac{8192(1-\gamma)^{4}(1+\sigma)^{4}\varepsilon^{2}}{\sigma}, (244)

    where (i) follows from the definition in (226), (ii) holds by plugging in the expression of Δ\Delta in (236), and (iii) arises from (E.2). The same bound can be established for 𝖪𝖫(P10(| 0,0)P11(| 0,0))\mathsf{KL}\big{(}P_{1}^{0}(\cdot\,|\,0,0)\parallel P_{1}^{1}(\cdot\,|\,0,0)\big{)}.

    Substituting (244) back into (85) demonstrates that: if the sample size is chosen as

    Nσlog216384(1γ)4(1+σ)4ε2,\displaystyle N\leq\frac{\sigma\log 2}{16384(1-\gamma)^{4}(1+\sigma)^{4}\varepsilon^{2}}, (245)

    then one necessarily has

    pe\displaystyle p_{\mathrm{e}} 14exp{N16384(1γ)4(1+σ)4ε2σ}18.\displaystyle\geq\frac{1}{4}\exp\bigg{\{}-N\frac{16384(1-\gamma)^{4}(1+\sigma)^{4}\varepsilon^{2}}{\sigma}\bigg{\}}\geq\frac{1}{8}. (246)
  • Case 3: σ>13(1γ)13\sigma>\frac{1}{3(1-\gamma)}\geq\frac{1}{3}. Regarding this, one gives

    1q\displaystyle 1-q >1p=1qΔ11+σ14(1+σ)12(1+σ),\displaystyle>1-p=1-q-\Delta\geq\frac{1}{1+\sigma}-\frac{1}{4(1+\sigma)}\geq\frac{1}{2(1+\sigma)},
    p\displaystyle p q14.\displaystyle\geq q\geq\frac{1}{4}. (247)

    Given pq1/2p\geq q\geq 1/2 and (E.2), applying Tsybakov, (2009, Lemma 2.7) yields

    𝖪𝖫(P0(| 0,1)P1(| 0,1))\displaystyle\mathsf{KL}\big{(}P^{0}(\cdot\,|\,0,1)\parallel P^{1}(\cdot\,|\,0,1)\big{)} =𝖪𝖫(pq)(pq)2(1p)p=(i)Δ2p(1p)\displaystyle=\mathsf{KL}\left(p\parallel q\right)\leq\frac{(p-q)^{2}}{(1-p)p}\overset{\mathrm{(i)}}{=}\frac{\Delta^{2}}{p(1-p)}
    (ii)64(1+σ)2ε2p(1p)\displaystyle\overset{\mathrm{(ii)}}{\leq}\frac{\frac{64}{(1+\sigma)^{2}}\varepsilon^{2}}{p(1-p)}
    (iii)492ε2σ,\displaystyle\overset{\mathrm{(iii)}}{\leq}\frac{492\varepsilon^{2}}{\sigma}, (248)

    where (i) follows from the definition in (226), (ii) holds by plugging in the expression of Δ\Delta in (236), and (iii) arises from (E.2). The same bound can be established for 𝖪𝖫(P10(| 0,0)P11(| 0,0))\mathsf{KL}\big{(}P_{1}^{0}(\cdot\,|\,0,0)\parallel P_{1}^{1}(\cdot\,|\,0,0)\big{)}. Substituting (248) back into (85) demonstrates that: if the sample size is chosen as

    Nσlog2984ε2,\displaystyle N\leq\frac{\sigma\log 2}{984\varepsilon^{2}}, (249)

    then one necessarily has

    pe\displaystyle p_{\mathrm{e}} 14exp{N984ε2σ}18.\displaystyle\geq\frac{1}{4}\exp\bigg{\{}-N\frac{984\varepsilon^{2}}{\sigma}\bigg{\}}\geq\frac{1}{8}. (250)
Step 3: putting things together.

Finally, summing up the results in (241), (245), and (249), combined with the requirement in (235), one has when

ε\displaystyle\varepsilon c1{11γif σ(0,1γ4)max{1(1+σ)(1γ),1}if σ[1γ4,),\displaystyle\leq c_{1}\begin{cases}\frac{1}{1-\gamma}\quad&\text{if }\sigma\in\left(0,\frac{1-\gamma}{4}\right)\\ \max\left\{\frac{1}{(1+\sigma)(1-\gamma)},1\right\}\quad&\text{if }\sigma\in\left[\frac{1-\gamma}{4},\infty\right)\end{cases}, (251)

taking

Nc2{1(1γ)3ε2if σ(0,1γ4)σmin{1,(1γ)4(1+σ)4}ε2if σ[1γ4,)\displaystyle N\leq c_{2}\begin{cases}\frac{1}{(1-\gamma)^{3}\varepsilon^{2}}&\text{if }\sigma\in\left(0,\frac{1-\gamma}{4}\right)\\ \frac{\sigma}{\min\left\{1,(1-\gamma)^{4}(1+\sigma)^{4}\right\}\varepsilon^{2}}&\text{if }\sigma\in\left[\frac{1-\gamma}{4},\infty\right)\end{cases} (252)

leads to pe18p_{e}\geq\frac{1}{8}, for some universal constants c1,c2>0c_{1},c_{2}>0.

E.3 Proof of the auxiliary facts

We begin with some basic facts about the χ2\chi^{2} divergence defined in (39) for any two Bernoulli distributions 𝖡𝖾𝗋(w)\mathsf{Ber}(w) and 𝖡𝖾𝗋(x)\mathsf{Ber}(x), denoted as

f(w,x)χ2(xw)=(wx)2w+(1w(1x))21w=(wx)2w(1w).\displaystyle f(w,x)\coloneqq\chi^{2}(x\parallel w)=\frac{(w-x)^{2}}{w}+\frac{(1-w-(1-x))^{2}}{1-w}=\frac{(w-x)^{2}}{w(1-w)}. (253)

For x[0,w)x\in[0,w), it is easily verified that the partial derivative w.r.t. xx obeys f(w,x)x=2(xw)w(1w)<0\frac{\partial f(w,x)}{\partial x}=\frac{2(x-w)}{w(1-w)}<0, implying that

x1<x2[0,w),f(w,x1)>f(w,x2).\displaystyle\forall\;x_{1}<x_{2}\in[0,w),\qquad f(w,x_{1})>f(w,x_{2}). (254)

In other words, the χ2\chi^{2} divergence f(w,x)f(w,x) increases as xx decreases from ww to 0.

Next, we introduce the following function for any fixed σ(0,)\sigma\in(0,\infty) and any x[σ1+σ,1)x\in\left[\frac{\sigma}{1+\sigma},1\right):

fσ(x)inf{y:χ2(yx)σ,y[0,x]}y=(i)max{0,xσx(1x)}=xσx(1x),\displaystyle f_{\sigma}(x)\coloneqq\inf_{\{y:\chi^{2}(y\parallel x)\leq\sigma,y\in[0,x]\}}y\overset{\mathrm{(i)}}{=}\max\left\{0,x-\sqrt{\sigma x(1-x)}\right\}=x-\sqrt{\sigma x(1-x)}, (255)

where (i) has been verified in Yang et al., (2022, Corollary B.2), and the last equality holds since xσ1+σx\geq\frac{\sigma}{1+\sigma}. The next lemma summarizes some useful facts about fσ()f_{\sigma}(\cdot), which again has been verified in Yang et al., (2022, Lemma B.12 and Corollary B.2).

Lemma 21.

Consider any σ(0,)\sigma\in(0,\infty). For x[σ1+σ,1)x\in[\frac{\sigma}{1+\sigma},1), fσ(x)f_{\sigma}(x) is convex and differentiable, which obeys

fσ(x)=1+σ(2x1)2x(1x).\displaystyle f_{\sigma}^{\prime}(x)=1+\frac{\sqrt{\sigma}(2x-1)}{2\sqrt{x(1-x)}}.

E.3.1 Proof of Lemma 19

Let us control q¯\underline{q} and p¯\underline{p} respectively.

Step 1: controlling q¯\underline{q}.

We shall control q¯\underline{q} in different cases w.r.t. the uncertainty level σ\sigma.

  • Case 1: σ(0,1γ4)\sigma\in\left(0,\frac{1-\gamma}{4}\right). In this case, recall that q=1γq=1-\gamma defined in (226), applying (255) with x=qx=q leads to

    1γ=q>q¯=fσ(q)=1γσγ(1γ)1γ1γ4γ(1γ)>1γ2.\displaystyle 1-\gamma=q>\underline{q}=f_{\sigma}(q)=1-\gamma-\sqrt{\sigma\gamma(1-\gamma)}\geq 1-\gamma-\sqrt{\frac{1-\gamma}{4}\gamma(1-\gamma)}>\frac{1-\gamma}{2}. (256)
  • Case 2: σ[1γ4,)\sigma\in\left[\frac{1-\gamma}{4},\infty\right). Note that it suffices to treat Pϕ0,1ϕP^{\phi}_{0,1-\phi} as a Bernoulli distribution 𝖡𝖾𝗋(q)\mathsf{Ber}(q) over states 11 and 0, since we do not allow transition to other states. Recalling q=σ1+σq=\frac{\sigma}{1+\sigma} in (226) and noticing the fact that

    f(q,0)\displaystyle f(q,0) =q2q+(1(1q))21q=q(1q)=σ,\displaystyle=\frac{q^{2}}{q}+\frac{(1-(1-q))^{2}}{1-q}=\frac{q}{(1-q)}=\sigma, (257)

    one has the probability 𝖡𝖾𝗋(0)\mathsf{Ber}(0) falls into the uncertainty set of 𝖡𝖾𝗋(q))\mathsf{Ber}(q)) of size σ\sigma. As a result, recalling the definition (229) leads to

    q¯=P¯ϕ(1| 0,1ϕ)=0,\displaystyle\underline{q}=\underline{P}^{\phi}(1\,|\,0,1-\phi)=0, (258)

    since q¯0\underline{q}\geq 0.

Step 2: controlling p¯\underline{p}.

To characterize the value of p¯\underline{p}, we also divide into several cases separately.

  • Case 1: σ(0,1γ4)\sigma\in\left(0,\frac{1-\gamma}{4}\right). In this case, note that p>q=1γσ1+σp>q=1-\gamma\geq\frac{\sigma}{1+\sigma}. Therefore, applying that fσ()f_{\sigma}(\cdot) is convex and the form of its derivative in Lemma 21, one has

    p¯\displaystyle\underline{p} =fσ(p)fσ(q)+fσ(q)(pq)\displaystyle=f_{\sigma}(p)\geq f_{\sigma}(q)+f_{\sigma}^{\prime}(q)(p-q)
    =q¯+(1+σ(2q1)2q(1q))Δq¯+(11γ4(12(1γ))2(1γ)γ)Δq¯+3Δ4.\displaystyle=\underline{q}+\Bigg{(}1+\frac{\sqrt{\sigma}(2q-1)}{2\sqrt{q(1-q)}}\Bigg{)}\Delta\geq\underline{q}+\Bigg{(}1-\frac{\sqrt{\frac{1-\gamma}{4}}(1-2(1-\gamma))}{2\sqrt{(1-\gamma)\gamma}}\Bigg{)}\Delta\geq\underline{q}+\frac{3\Delta}{4}. (259)

    Similarly, applying Lemma 21 leads to

    p¯\displaystyle\underline{p} =fσ(p)fσ(q)+fσ(p)(pq)\displaystyle=f_{\sigma}(p)\leq f_{\sigma}(q)+f_{\sigma}^{\prime}(p)(p-q)
    =q¯+(1σ(12p)2p(1p))Δq¯+Δ,\displaystyle=\underline{q}+\Bigg{(}1-\frac{\sqrt{\sigma}(1-2p)}{2\sqrt{p(1-p)}}\Bigg{)}\Delta\leq\underline{q}+\Delta, (260)

    where the last inequality holds by 12p>01-2p>0 due to the fact p=q+Δ54(1γ)516<12p=q+\Delta\leq\frac{5}{4}(1-\gamma)\leq\frac{5}{16}<\frac{1}{2} (cf. (227) and γ[34,1)\gamma\in[\frac{3}{4},1)). To sum up, given σ(0,1γ4)\sigma\in\left(0,\frac{1-\gamma}{4}\right), combined with (256), we arrive at

    q¯+34Δp¯q¯+Δ5(1γ)4,\displaystyle\underline{q}+\frac{3}{4}\Delta\leq\underline{p}\leq\underline{q}+\Delta\leq\frac{5(1-\gamma)}{4}, (261)

    where the last inequality holds by Δ14(1γ)\Delta\leq\frac{1}{4}(1-\gamma) (see (226)).

  • Case 2: σ[1γ4,)\sigma\in\left[\frac{1-\gamma}{4},\infty\right). We recall that p=q+Δ>q=σ1+σp=q+\Delta>q=\frac{\sigma}{1+\sigma} in (226). To derive the lower bound for p¯\underline{p} in (229), similar to (E.3.1), one has

    p¯\displaystyle\underline{p} =fσ(p)fσ(q)+fσ(q)(pq)\displaystyle=f_{\sigma}(p)\geq f_{\sigma}(q)+f_{\sigma}^{\prime}(q)(p-q)
    =q¯+(1+σ(2q1)2q(1q))Δ\displaystyle=\underline{q}+\left(1+\frac{\sqrt{\sigma}(2q-1)}{2\sqrt{q(1-q)}}\right)\Delta
    =(i)0+(1+σσ11+σ2σ1+σ11+σ)Δ=(1+σ12)Δ=(σ+12)Δ,\displaystyle\overset{\mathrm{(i)}}{=}0+\left(1+\frac{\sqrt{\sigma}\frac{\sigma-1}{1+\sigma}}{2\sqrt{\frac{\sigma}{1+\sigma}\frac{1}{1+\sigma}}}\right)\Delta=\left(1+\frac{\sigma-1}{2}\right)\Delta=\left(\frac{\sigma+1}{2}\right)\Delta, (262)

    where (i) follows from q=σ1+σq=\frac{\sigma}{1+\sigma} and q¯=0\underline{q}=0 (see (258)). For the other direction, similar to (E.3.1), we have

    p¯\displaystyle\underline{p} =fσ(p)fσ(q)+fσ(p)(pq)=q¯+(1+σ(2p1)2p(1p))Δ\displaystyle=f_{\sigma}(p)\leq f_{\sigma}(q)+f_{\sigma}^{\prime}(p)(p-q)=\underline{q}+\left(1+\frac{\sqrt{\sigma}(2p-1)}{2\sqrt{p(1-p)}}\right)\Delta
    =(i)(1+σ(2p1)2p(1p))Δ=(ii)(1+σ(σ11+σ+2Δ)2(σ1+σ+Δ)(11+σΔ))Δ\displaystyle\overset{\mathrm{(i)}}{=}\left(1+\frac{\sqrt{\sigma}(2p-1)}{2\sqrt{p(1-p)}}\right)\Delta\overset{\mathrm{(ii)}}{=}\left(1+\frac{\sqrt{\sigma}\left(\frac{\sigma-1}{1+\sigma}+2\Delta\right)}{2\sqrt{\left(\frac{\sigma}{1+\sigma}+\Delta\right)\left(\frac{1}{1+\sigma}-\Delta\right)}}\right)\Delta
    (iii)(1+σ(1+2Δ)2σ1+σ12(1+σ))Δ(iv)(1+(1+σ)(1+11+σ))Δ=(3+σ)Δ,\displaystyle\overset{\mathrm{(iii)}}{\leq}\left(1+\frac{\sqrt{\sigma}(1+2\Delta)}{2\sqrt{\frac{\sigma}{1+\sigma}\cdot\frac{1}{2(1+\sigma)}}}\right)\Delta\overset{\mathrm{(iv)}}{\leq}\left(1+(1+\sigma)\left(1+\frac{1}{1+\sigma}\right)\right)\Delta=(3+\sigma)\Delta, (263)

    where (i) holds by q¯=0\underline{q}=0 (see (258)), (ii) follows from plugging in p=q+Δ=σ1+σ+Δp=q+\Delta=\frac{\sigma}{1+\sigma}+\Delta, and (iii) and (iv) arises from Δ=min{14(1γ),12(1+σ)}1\Delta=\min\left\{\frac{1}{4}(1-\gamma),\frac{1}{2(1+\sigma)}\right\}\leq 1 in (227). Combining (E.3.1) and (263) yields

    σ+12Δp¯(3+σ)Δ.\displaystyle\frac{\sigma+1}{2}\Delta\leq\underline{p}\leq(3+\sigma)\Delta. (264)
Step 3: combining all the results.

Finally, summing up the results for both q¯\underline{q} (in (256) and (258)) and p¯\underline{p} (in (261) and (264)), we arrive at the advertised bound.

E.3.2 Proof of Lemma 20

The robust value function for any policy π\pi.

For any ϕ\mathcal{M}_{\phi} with ϕ{0,1}\phi\in\{0,1\}, we first characterize the robust value function of any policy π\pi over different states.

Towards this, it is easily observed that for any policy π\pi, the robust value functions at state s=1s=1 or any s{2,3,,S1}s\in\{2,3,\cdots,S-1\} obey

Vϕπ,σ(1)=(i)1+γVϕπ,σ(1)=11γ\displaystyle V_{\phi}^{\pi,\sigma}(1)\overset{\mathrm{(i)}}{=}1+\gamma V_{\phi}^{\pi,\sigma}(1)=\frac{1}{1-\gamma} (265a)
and
s{2,3,,S}:Vϕπ,σ(s)\displaystyle\forall s\in\{2,3,\cdots,S\}:\qquad V_{\phi}^{\pi,\sigma}(s) =(ii)0+γVϕπ,σ(1)=γ1γ,\displaystyle\overset{\mathrm{(ii)}}{=}0+\gamma V_{\phi}^{\pi,\sigma}(1)=\frac{\gamma}{1-\gamma}, (265b)

where (i) and (ii) is according to the facts that the transitions defined over states s1s\geq 1 in (64) give only one possible next state 11, leading to a non-random transition in the uncertainty set associated with χ2\chi^{2} divergence, and r(1,a)=1r(1,a)=1 for all a𝒜a\in\mathcal{A}^{\prime} and r(s,a)=0r(s,a)=0 holds all (s,a){2,3,,S1}×𝒜(s,a)\in\{2,3,\cdots,S-1\}\times\mathcal{A}.

To continue, the robust value function at state 0 with policy π\pi satisfies

Vϕπ,σ(0)\displaystyle V_{\phi}^{\pi,\sigma}(0) =𝔼aπ(| 0)[r(0,a)+γinf𝒫𝒰σ(Pϕ0,a)𝒫Vπ,σϕ]\displaystyle=\mathbb{E}_{a\sim\pi(\cdot\,|\,0)}\bigg{[}r(0,a)+\gamma\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{\phi}_{0,a})}\mathcal{P}V^{\pi,\sigma}_{\phi}\bigg{]}
=0+γπ(ϕ| 0)inf𝒫𝒰σ(Pϕ0,ϕ)𝒫Vπ,σϕ+γπ(1ϕ| 0)inf𝒫𝒰σ(Pϕ0,1ϕ)𝒫Vπ,σϕ\displaystyle=0+\gamma\pi(\phi\,|\,0)\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{\phi}_{0,\phi})}\mathcal{P}V^{\pi,\sigma}_{\phi}+\gamma\pi(1-\phi\,|\,0)\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{\phi}_{0,1-\phi})}\mathcal{P}V^{\pi,\sigma}_{\phi} (266)
(i)γ1γ,\displaystyle\overset{\mathrm{(i)}}{\leq}\frac{\gamma}{1-\gamma}, (267)

where (i) holds by that Vπ,σϕ11γ\|V^{\pi,\sigma}_{\phi}\|_{\infty}\leq\frac{1}{1-\gamma}. Summing up the results in (265b) and (267) leads to

s{2,3,,S},Vϕπ,σ(1)>Vϕπ,σ(s)Vϕπ,σ(0).\displaystyle\forall s\in\{2,3,\cdots,S\},\qquad V_{\phi}^{\pi,\sigma}(1)>V_{\phi}^{\pi,\sigma}(s)\geq V_{\phi}^{\pi,\sigma}(0). (268)

With the transition kernel in (64) over state 0 and the fact in (268), (266) can be rewritten as

Vϕπ,σ(0)\displaystyle V_{\phi}^{\pi,\sigma}(0) =γπ(ϕ| 0)inf𝒫𝒰σ(Pϕ0,ϕ)𝒫Vπ,σϕ+γπ(1ϕ| 0)inf𝒫𝒰σ(Pϕ0,1ϕ)𝒫Vπ,σϕ\displaystyle=\gamma\pi(\phi\,|\,0)\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{\phi}_{0,\phi})}\mathcal{P}V^{\pi,\sigma}_{\phi}+\gamma\pi(1-\phi\,|\,0)\inf_{\mathcal{P}\in\mathcal{U}^{\sigma}(P^{\phi}_{0,1-\phi})}\mathcal{P}V^{\pi,\sigma}_{\phi}
=(i)γπ(ϕ| 0)[p¯Vϕπ,σ(1)+(1p¯)Vϕπ,σ(0)]+γπ(1ϕ| 0)[q¯Vϕπ,σ(1)+(1q¯)Vϕπ,σ(0)]\displaystyle\overset{\mathrm{(i)}}{=}\gamma\pi(\phi\,|\,0)\Big{[}\underline{p}V_{\phi}^{\pi,\sigma}(1)+\left(1-\underline{p}\right)V_{\phi}^{\pi,\sigma}(0)\Big{]}+\gamma\pi(1-\phi\,|\,0)\Big{[}\underline{q}V_{\phi}^{\pi,\sigma}(1)+\left(1-\underline{q}\right)V_{\phi}^{\pi,\sigma}(0)\Big{]}
=(ii)γzϕπVϕπ,σ(1)+γ(1zϕπ)Vϕπ,σ(0)\displaystyle\overset{\mathrm{(ii)}}{=}\gamma z_{\phi}^{\pi}V_{\phi}^{\pi,\sigma}(1)+\gamma\left(1-z_{\phi}^{\pi}\right)V_{\phi}^{\pi,\sigma}(0)
=γzϕπ(1γ)(1γ(1zϕπ)),\displaystyle=\frac{\gamma z_{\phi}^{\pi}}{(1-\gamma)\Big{(}1-\gamma\big{(}1-z_{\phi}^{\pi}\big{)}\Big{)}}, (269)

where (i) holds by the definition of p¯\underline{p} and q¯\underline{q} in (229), (ii) follows from the definition of zϕπz_{\phi}^{\pi} in (232), and the last line holds by applying (265a) and solving the resulting linear equation for Vϕπ,σ(0)V_{\phi}^{\pi,\sigma}(0).

Optimal policy and its optimal value function.

To continue, observing that Vϕπ,σ(0)=:f(zϕπ)V_{\phi}^{\pi,\sigma}(0)=:f(z_{\phi}^{\pi}) is increasing in zϕπz_{\phi}^{\pi} since the derivative of f(zϕπ)f(z_{\phi}^{\pi}) w.r.t. zϕπz_{\phi}^{\pi} obeys

f(zϕπ)=γ(1γ)(1γ(1zϕπ))γ2zϕπ(1γ)(1γ)2(1γ(1zϕπ))2=γ(1γ(1zϕπ))2>0,\displaystyle f^{\prime}(z_{\phi}^{\pi})=\frac{\gamma(1-\gamma)\left(1-\gamma\big{(}1-z_{\phi}^{\pi}\big{)}\right)-\gamma^{2}z_{\phi}^{\pi}(1-\gamma)}{(1-\gamma)^{2}\left(1-\gamma\big{(}1-z_{\phi}^{\pi}\big{)}\right)^{2}}=\frac{\gamma}{\left(1-\gamma\big{(}1-z_{\phi}^{\pi}\big{)}\right)^{2}}>0,

where the last inequality holds by 0zϕπ10\leq z_{\phi}^{\pi}\leq 1. Further, zϕπz_{\phi}^{\pi} is also increasing in π(ϕ| 0)\pi(\phi\,|\,0) (see the fact p¯q¯\underline{p}\geq\underline{q} in (229)), the optimal robust policy in state 0 thus obeys

πϕ(ϕ| 0)=1.\pi_{\phi}^{\star}(\phi\,|\,0)=1. (270)

Considering that the action does not influence the state transition for all states s>0s>0, without loss of generality, we choose the optimal robust policy to obey

s>0:πϕ(ϕ|s)=1.\displaystyle\forall s>0:\quad\pi_{\phi}^{\star}(\phi\,|\,s)=1. (271)

Taking π=πϕ\pi=\pi^{\star}_{\phi} and zϕπϕ=p¯z_{\phi}^{\pi^{\star}_{\phi}}=\underline{p} in (269), we complete the proof by showing the corresponding optimal robust value function at state 0 as follows:

Vϕ,σ(0)\displaystyle V_{\phi}^{\star,\sigma}(0) =γzϕπϕ(1γ)(1γ(1zϕπϕ))=γp¯(1γ)(1γ(1p¯)).\displaystyle=\frac{\gamma z_{\phi}^{\pi^{\star}_{\phi}}}{(1-\gamma)\left(1-\gamma\left(1-z_{\phi}^{\pi^{\star}_{\phi}}\right)\right)}=\frac{\gamma\underline{p}}{(1-\gamma)\left(1-\gamma\left(1-\underline{p}\right)\right)}.

E.3.3 Proof of the claim (237)

Plugging in the definition of φ\varphi, we arrive at that for any policy π\pi,

φ,V,σϕVπ,σϕ\displaystyle\big{\langle}\varphi,V^{\star,\sigma}_{\phi}-V^{\pi,\sigma}_{\phi}\big{\rangle} =V,σϕ(0)Vπ,σϕ(0)\displaystyle=V^{\star,\sigma}_{\phi}(0)-V^{\pi,\sigma}_{\phi}(0)
=(i)γp¯(1γ)(1γ(1p¯))γzϕπ(1γ)(1γ(1zϕπ))\displaystyle\overset{\mathrm{(i)}}{=}\frac{\gamma\underline{p}}{(1-\gamma)\left(1-\gamma\big{(}1-\underline{p}\big{)}\right)}-\frac{\gamma z_{\phi}^{\pi}}{(1-\gamma)\left(1-\gamma\big{(}1-z_{\phi}^{\pi}\big{)}\right)}
=γ(p¯zϕπ)(1γ(1p¯))(1γ(1zϕπ))(ii)γ(p¯zϕπ)(1γ(1p¯))2=(iii)γ(p¯q¯)(1π(ϕ| 0))(1γ(1p¯))2,\displaystyle=\frac{\gamma\left(\underline{p}-z_{\phi}^{\pi}\right)}{\left(1-\gamma\big{(}1-\underline{p}\big{)}\right)\left(1-\gamma\big{(}1-z_{\phi}^{\pi}\big{)}\right)}\overset{\mathrm{(ii)}}{\geq}\frac{\gamma\left(\underline{p}-z_{\phi}^{\pi}\right)}{\left(1-\gamma\left(1-\underline{p}\right)\right)^{2}}\overset{\mathrm{(iii)}}{=}\frac{\gamma(\underline{p}-\underline{q})\big{(}1-\pi(\phi\,|\,0)\big{)}}{\left(1-\gamma\left(1-\underline{p}\right)\right)^{2}}, (272)

where (i) holds by applying Lemma 20, (ii) arises from zϕπp¯z_{\phi}^{\pi}\leq\underline{p} (see the definition of zϕπz_{\phi}^{\pi} in (232) and the fact p¯q¯+3Δ4\underline{p}\geq\underline{q}+\frac{3\Delta}{4} in (229)), and (iii) follows from the definition of zϕπz_{\phi}^{\pi} in (232).

To further control (272), we consider it in two cases separately:

  • Case 1: σ(0,1γ4)\sigma\in\left(0,\frac{1-\gamma}{4}\right). In this case, applying Lemma 19 to (272) yields

    φ,V,σϕVπ,σϕ\displaystyle\big{\langle}\varphi,V^{\star,\sigma}_{\phi}-V^{\pi,\sigma}_{\phi}\big{\rangle} γ(p¯q¯)(1π(ϕ| 0))(1γ(1p¯))2γ3Δ4(1π(ϕ| 0))(1γ(15(1γ)4))2\displaystyle\geq\frac{\gamma(\underline{p}-\underline{q})\big{(}1-\pi(\phi\,|\,0)\big{)}}{\left(1-\gamma\left(1-\underline{p}\right)\right)^{2}}\geq\frac{\gamma\frac{3\Delta}{4}\big{(}1-\pi(\phi\,|\,0)\big{)}}{\left(1-\gamma\left(1-\frac{5(1-\gamma)}{4}\right)\right)^{2}}
    Δ(1π(ϕ| 0))9(1γ)2=2ε(1π(ϕ| 0)),\displaystyle\geq\frac{\Delta\big{(}1-\pi(\phi\,|\,0)\big{)}}{9(1-\gamma)^{2}}=2\varepsilon\big{(}1-{\pi}(\phi\,|\,0)\big{)}, (273)

    where the penultimate inequality follows from γ3/4\gamma\geq 3/4, and the last inequality holds by taking the specification of Δ\Delta in (236) as follows:

    Δ=18(1γ)2ε.\displaystyle\Delta=18(1-\gamma)^{2}\varepsilon. (274)

    It is easily verified that taking ε172(1γ)\varepsilon\leq\frac{1}{72(1-\gamma)} as in (235) directly leads to meeting the requirement in (227), i.e., Δ14(1γ)\Delta\leq\frac{1}{4}(1-\gamma).

  • Case 2: σ[1γ4,)\sigma\in\left[\frac{1-\gamma}{4},\infty\right). Similarly, applying Lemma 19 to (272) gives

    φ,V,σϕVπ,σϕ\displaystyle\big{\langle}\varphi,V^{\star,\sigma}_{\phi}-V^{\pi,\sigma}_{\phi}\big{\rangle} γ(p¯q¯)(1π(ϕ| 0))(1γ(1p¯))2γσ+12Δ(1π(ϕ| 0))min{1,(1γ(1(3+σ)Δ))2}\displaystyle\geq\frac{\gamma(\underline{p}-\underline{q})\big{(}1-\pi(\phi\,|\,0)\big{)}}{\left(1-\gamma\left(1-\underline{p}\right)\right)^{2}}\geq\frac{\gamma\frac{\sigma+1}{2}\Delta\big{(}1-\pi(\phi\,|\,0)\big{)}}{\min\left\{1,\left(1-\gamma\left(1-(3+\sigma)\Delta\right)\right)^{2}\right\}} (275)

    Before continuing, it can be verified that

    1γ(1(3+σ)Δ)\displaystyle 1-\gamma\left(1-(3+\sigma)\Delta\right) =1γ+γ(3+σ)Δ(i)1γ+(3+σ)min{14(1γ),12(σ+1)}\displaystyle=1-\gamma+\gamma(3+\sigma)\Delta\overset{\mathrm{(i)}}{\leq}1-\gamma+(3+\sigma)\min\left\{\frac{1}{4}(1-\gamma),\frac{1}{2(\sigma+1)}\right\}
    min{2(1+σ)(1γ),32},\displaystyle\leq\min\left\{2(1+\sigma)(1-\gamma),\,\frac{3}{2}\right\}, (276)

    where (i) is obtained by Δmin{14(1γ),12(1+σ)}\Delta\leq\min\left\{\frac{1}{4}(1-\gamma),\frac{1}{2(1+\sigma)}\right\} (see (226)). Applying the above fact to (275) gives

    φ,V,σϕVπ,σϕ\displaystyle\big{\langle}\varphi,V^{\star,\sigma}_{\phi}-V^{\pi,\sigma}_{\phi}\big{\rangle} γσ+12Δ(1π(ϕ| 0))min{1,(1γ(1(3+σ)Δ))2}(i)3(σ+1)Δ(1π(ϕ| 0))8min{4(1+σ)2(1γ)2,1}\displaystyle\geq\frac{\gamma\frac{\sigma+1}{2}\Delta\big{(}1-\pi(\phi\,|\,0)\big{)}}{\min\left\{1,\left(1-\gamma\left(1-(3+\sigma)\Delta\right)\right)^{2}\right\}}\overset{\mathrm{(i)}}{\geq}\frac{3(\sigma+1)\Delta\big{(}1-\pi(\phi\,|\,0)\big{)}}{8\min\left\{4(1+\sigma)^{2}(1-\gamma)^{2},1\right\}}
    Δ(1π(ϕ| 0))min{32(1+σ)(1γ)2,83(1+σ)}=2ε(1π(ϕ| 0)),\displaystyle\geq\frac{\Delta\big{(}1-\pi(\phi\,|\,0)\big{)}}{\min\left\{32(1+\sigma)(1-\gamma)^{2},\frac{8}{3(1+\sigma)}\right\}}=2\varepsilon\big{(}1-{\pi}(\phi\,|\,0)\big{)}, (277)

    where (i) holds by γ34\gamma\geq\frac{3}{4} and (275), and the last equality holds by the specification in (236):

    Δ={64(1+σ)(1γ)2εif σ[1γ4,13(1γ)),163(1+σ)εif σ>13(1γ).\displaystyle\Delta=\begin{cases}64(1+\sigma)(1-\gamma)^{2}\varepsilon\quad&\text{if }\sigma\in\left[\frac{1-\gamma}{4},\frac{1}{3(1-\gamma)}\right),\\ \frac{16}{3(1+\sigma)}\varepsilon\quad&\text{if }\sigma>\frac{1}{3(1-\gamma)}.\end{cases} (278)

    As a result, it is easily verified that the requirement in (227)

    Δmin{14(1γ),12(1+σ)}\displaystyle\Delta\leq\min\left\{\frac{1}{4}(1-\gamma),\frac{1}{2(1+\sigma)}\right\} (279)

    is met if we let

    ε{1256(1+σ)(1γ)if σ[1γ4,13(1γ)),332if σ>13(1γ),\displaystyle\varepsilon\leq\begin{cases}\frac{1}{256(1+\sigma)(1-\gamma)}\quad&\text{if }\sigma\in\left[\frac{1-\gamma}{4},\frac{1}{3(1-\gamma)}\right),\\ \frac{3}{32}\quad&\text{if }\sigma>\frac{1}{3(1-\gamma)},\end{cases} (280)

    as in (235).

The proof is then completed by summing up the results in the above two cases.

Appendix F Proof for the offline setting

F.1 Proof of the upper bounds: Corollary 1 and Corollary 3

As the proofs of Corollary 1 and Corollary 3 are similar, without loss of generality, we first focus on Corollary 1 in the case of TV distance.

To begin with, suppose we have access to in total N𝖻N_{\mathsf{b}} independent sample tuples {si,ai,ai,ri}i=1N𝖻\{s_{i},a_{i},a_{i}^{\prime},r_{i}\}_{i=1}^{N_{\mathsf{b}}} from either the generative model or a historical dataset. We denote the number of samples generated based on the state-action pair (s,a)(s,a) as N(s,a)N(s,a), i.e,

(s,a)𝒮×𝒜:N(s,a)=i=1N𝖻𝟙{si=s,ai=a}.\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad N(s,a)=\sum\limits_{i=1}^{N_{\mathsf{b}}}\mathds{1}\big{\{}s_{i}=s,a_{i}=a\big{\}}. (281)

Then according to (13), we can construct an empirical nominal transition for DRVI (Algorithm 1).

(s,a)𝒮×𝒜:P^0(s|s,a)1N(s,a)i=1N(s,a)𝟙{si=s,ai=a,si=s}.\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad\widehat{P}^{0}(s^{\prime}\,|\,s,a)\coloneqq\frac{1}{N(s,a)}\sum\limits_{i=1}^{N(s,a)}\mathds{1}\big{\{}s_{i}=s,a_{i}=a,s_{i}^{\prime}=s^{\prime}\big{\}}. (282)

Armed with the above estimate of nominal transition kernel, we introduce a slightly general version of Theorem 1, which follows directly from the same proof routine in Appendix 5.2.2.

Theorem 5 (Upper bound under TV distance).

Let the uncertainty set be 𝒰ρσ()=𝒰σ𝖳𝖵()\mathcal{U}_{\rho}^{\sigma}(\cdot)=\mathcal{U}^{\sigma}_{\mathsf{TV}}(\cdot), as specified by the TV distance (9). Consider any discount factor γ[14,1)\gamma\in\left[\frac{1}{4},1\right), uncertainty level σ(0,1)\sigma\in(0,1), and δ(0,1)\delta\in(0,1). Based on the empirical nominal transition kernel in (282), let π^\widehat{\pi} be the output policy of Algorithm 1 after T=C1log(N𝖻1γ)T=C_{1}\log\big{(}\frac{N_{\mathsf{b}}}{1-\gamma}\big{)} iterations. Then with probability at least 1δ1-\delta, one has

s𝒮:V,σ(s)Vπ^,σ(s)ε\displaystyle\forall s\in{\mathcal{S}}:\quad V^{\star,\sigma}(s)-V^{\widehat{\pi},\sigma}(s)\leq\varepsilon (283)

for any ε(0,1/max{1γ,σ}]\varepsilon\in\left(0,\sqrt{1/\max\{1-\gamma,\sigma\}}\right], as long as

(s,a)𝒮×𝒜:N(s,a)C2(1γ)2max{1γ,σ}ε2log(SAN𝖻(1γ)δ).\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad N(s,a)\geq\frac{C_{2}}{(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}}\log\left(\frac{SAN_{\mathsf{b}}}{(1-\gamma)\delta}\right). (284)

Here, C1,C2>0C_{1},C_{2}>0 are some large enough universal constants.

Furthermore, we invoke a fact derived from basic concentration inequalities (Li et al.,, 2024) as below.

Lemma 22.

Consider any δ(0,1)\delta\in(0,1) and a dataset with N𝖻N_{\mathsf{b}} independent samples satisfying Assumption 1. With probability at least 1δ1-\delta, the quantities {N(s,a)}\{N(s,a)\} obey

max{N(s,a),23logN𝖻δ}\displaystyle\max\Big{\{}N(s,a),\frac{2}{3}\log\frac{N_{\mathsf{b}}}{\delta}\Big{\}} N𝖻μ𝖻(s,a)12\displaystyle\geq\frac{N_{\mathsf{b}}\mu^{\mathsf{b}}(s,a)}{12} (285)

simultaneously for all (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}.

Now we are ready to verify Corollary 1. Armed with a historical dataset 𝒟𝖻\mathcal{D}^{\mathsf{b}} with N𝖻N_{\mathsf{b}} independent samples that obeys Assumption 1, one has with probability at least 1δ1-\delta,

(s,a)𝒮×𝒜:N(s,a)N𝖻μ𝖻(s,a)12N𝖻μmin12\displaystyle\forall(s,a)\in{\mathcal{S}}\times\mathcal{A}:\quad N(s,a)\geq\frac{N_{\mathsf{b}}\mu^{\mathsf{b}}(s,a)}{12}\geq\frac{N_{\mathsf{b}}\mu_{\min}}{12} (286)

as long as N𝖻8logN𝖻δμmin8logN𝖻δμ𝖻(s,a)N_{\mathsf{b}}\geq\frac{8\log\frac{N_{\mathsf{b}}}{\delta}}{\mu_{\min}}\geq\frac{8\log\frac{N_{\mathsf{b}}}{\delta}}{\mu^{\mathsf{b}}(s,a)} for all (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}. Consequently, given N𝖻8logN𝖻δμminN_{\mathsf{b}}\geq\frac{8\log\frac{N_{\mathsf{b}}}{\delta}}{\mu_{\min}}, applying Theorem 5 with the fact N(s,a)N𝖻μmin12N(s,a)\geq\frac{N_{\mathsf{b}}\mu_{\min}}{12} for all (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A} (see (286)) directly leads to: DRVI can achieve an ε\varepsilon-optimal policy as long as

N(s,a)N𝖻μmin12C2(1γ)2max{1γ,σ}ε2log(SAN𝖻(1γ)δ),\displaystyle N(s,a)\geq\frac{N_{\mathsf{b}}\mu_{\min}}{12}\geq\frac{C_{2}}{(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}}\log\left(\frac{SAN_{\mathsf{b}}}{(1-\gamma)\delta}\right), (287)

namely

N𝖻C3μmin(1γ)2max{1γ,σ}ε2log(SAN𝖻(1γ)δ),\displaystyle N_{\mathsf{b}}\geq\frac{C_{3}}{\mu_{\min}(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}}\log\left(\frac{SAN_{\mathsf{b}}}{(1-\gamma)\delta}\right), (288)

where C3C_{3} is some large enough universal constant. Note that the above inequality directly implies N𝖻8logN𝖻δμminN_{\mathsf{b}}\geq\frac{8\log\frac{N_{\mathsf{b}}}{\delta}}{\mu_{\min}}. This complete the proof of Corollary 1. The same argument holds for Corollary 3.

F.2 Proof of the lower bounds: Corollary 2 and Corollary 4

Analogous to Appendix F.1, without loss of generality, we firstly focus on verifying Corollary 2, where we use the TV distance to measure the uncertainty set.

We stick to the two hard instances 0\mathcal{M}_{0} and 1\mathcal{M}_{1} (i.e., ϕ\mathcal{M}_{\phi} with ϕ{0,1}\phi\in\{0,1\}) constructed in the proof for Theorem 2 (Appendix 5.3.1). Recall that the state space is defined as 𝒮={0,1,2,,S1}{\mathcal{S}}=\{0,1,2,\cdots,S-1\}, where the corresponding action space for any state s{2,3,,S1}s\in\{2,3,\cdots,S-1\} is 𝒜={0,1,2,,A1}\mathcal{A}=\{0,1,2,\cdots,A-1\}. For states s=0s=0 or s=1s=1, the action space is only 𝒜={0,1}\mathcal{A}^{\prime}=\{0,1\}. Hence, for a given factor μmin(0,1SA]\mu_{\min}\in(0,\frac{1}{SA}], we can construct a historical dataset 𝒟𝖻\mathcal{D}^{\mathsf{b}} with N𝖻N_{\mathsf{b}} samples such that the data coverage becomes the smallest over the state-action pairs (0,0)(0,0) and (0,1)(0,1), i.e.,

μ𝖻(0,0)=μ𝖻(0,1)=μminandμ𝖻(s,a)=12μmin(S2)A+2,s{1,2,,S}.\displaystyle\mu^{\mathsf{b}}(0,0)=\mu^{\mathsf{b}}(0,1)=\mu_{\min}\quad\text{and}\quad\mu^{\mathsf{b}}(s,a)=\frac{1-2\mu_{\min}}{(S-2)A+2},\quad\forall s\in\{1,2,\cdots,S\}. (289)

Armed with the above hard instance and historical dataset, we follow the proof procedure in Appendix 5.3.2 to verify the corollary. Our goal is to distinguish between the two hypotheses ϕ{0,1}\phi\in\{0,1\} by considering the minimax probability of error as follows:

peinfψmax{0(ψ0),1(ψ1)},p_{\mathrm{e}}\coloneqq\inf_{\psi}\max\big{\{}\mathbb{P}_{0}(\psi\neq 0),\,\mathbb{P}_{1}(\psi\neq 1)\big{\}}, (290)

where the infimum is taken over all possible tests ψ\psi constructed from the samples in 𝒟𝖻\mathcal{D}^{\mathsf{b}}.

Recall that we denote μϕ\mu_{\phi} (resp. μϕ(s)\mu_{\phi}(s)) as the distribution of a sample tuple (si,ai,si)(s_{i},a_{i},s_{i}^{\prime}) under the nominal transition kernel PϕP^{\phi} associated with ϕ\mathcal{M}_{\phi} and the samples are generated independently. Analogous to (85), one has

pe\displaystyle p_{\mathrm{e}} 14exp(N𝖻𝖪𝖫(μ0μ1))\displaystyle\geq\frac{1}{4}\exp\Big{(}-N_{\mathsf{b}}\mathsf{KL}\big{(}\mu_{0}\parallel\mu_{1}\big{)}\Big{)}
=14exp{N𝖻μmin(𝖪𝖫(P0(| 0,0)P1(| 0,0))+𝖪𝖫(P0(| 0,1)P1(| 0,1)))},\displaystyle=\frac{1}{4}\exp\Big{\{}-N_{\mathsf{b}}\mu_{\min}\Big{(}\mathsf{KL}\big{(}P^{0}(\cdot\,|\,0,0)\parallel P^{1}(\cdot\,|\,0,0)\big{)}+\mathsf{KL}\big{(}P^{0}(\cdot\,|\,0,1)\parallel P^{1}(\cdot\,|\,0,1)\big{)}\Big{)}\Big{\}}, (291)

where the last inequality holds by observing that

𝖪𝖫(μ0μ1)\displaystyle\mathsf{KL}\big{(}\mu_{0}\parallel\mu_{1}\big{)} =s,a,sμ𝖻(s,a)𝖪𝖫(P0(s|s,a)P1(s|s,a))\displaystyle=\sum_{s,a,s^{\prime}}\mu^{\mathsf{b}}(s,a)\mathsf{KL}\big{(}P^{0}(s^{\prime}\,|\,s,a)\parallel P^{1}(s^{\prime}\,|\,s,a)\big{)}
=a{0,1}μ𝖻(0,a)𝖪𝖫(P0(| 0,a)P1(| 0,a))=μmina{0,1}𝖪𝖫(P0(| 0,a)P1(| 0,a)).\displaystyle=\sum_{a\in\{0,1\}}\mu^{\mathsf{b}}(0,a)\mathsf{KL}\big{(}P^{0}(\cdot\,|\,0,a)\parallel P^{1}(\cdot\,|\,0,a)\big{)}=\mu_{\min}\sum_{a\in\{0,1\}}\mathsf{KL}\big{(}P^{0}(\cdot\,|\,0,a)\parallel P^{1}(\cdot\,|\,0,a)\big{)}. (292)

Here, the last line holds by the fact that P0(|s,a)P^{0}(\cdot\,|\,s,a) and P1(|s,a)P^{1}(\cdot\,|\,s,a) (associated with 0\mathcal{M}_{0} and 1\mathcal{M}_{1}) only differ from each other in state-action pairs (0,0)(0,0) and (0,1)(0,1), each has a visitation density of μmin\mu_{\min}. Consequently, following the same routine from (86) to the end of Appendix 5.3.2, we applying (87) and (88) with N=N𝖻μminN=N_{\mathsf{b}}\mu_{\min} and complete the proof by showing: if the sample size is selected as

N𝖻μmin=Nc1log28192(1γ)2max{1γ,σ}ε2,\displaystyle N_{\mathsf{b}}\mu_{\min}=N\leq\frac{c_{1}\log 2}{8192(1-\gamma)^{2}\max\{1-\gamma,\sigma\}\varepsilon^{2}}, (293)

then one necessarily has

pe=infπ^max{0(V,σ(φ)Vπ^,σ(φ)>ε),1(V,σ(φ)Vπ^,σ(φ)>ε)}18.\displaystyle p_{e}=\inf_{\widehat{\pi}}\max\left\{\mathbb{P}_{0}\big{(}V^{\star,\sigma}(\varphi)-V^{\widehat{\pi},\sigma}(\varphi)>\varepsilon\big{)},\,\mathbb{P}_{1}\big{(}V^{\star,\sigma}(\varphi)-V^{\widehat{\pi},\sigma}(\varphi)>\varepsilon\big{)}\right\}\geq\frac{1}{8}. (294)

We can follow the same argument to complete the proof of Corollary 4.