The Curious Price of Distributional Robustness
in Reinforcement Learning with a Generative Model
Abstract
This paper investigates model robustness in reinforcement learning (RL) to reduce the sim-to-real gap in practice. We adopt the framework of distributionally robust Markov decision processes (RMDPs), aimed at learning a policy that optimizes the worst-case performance when the deployed environment falls within a prescribed uncertainty set around the nominal MDP. Despite recent efforts, the sample complexity of RMDPs remained mostly unsettled regardless of the uncertainty set in use. It was unclear if distributional robustness bears any statistical consequences when benchmarked against standard RL. Assuming access to a generative model that draws samples based on the nominal MDP, we characterize the sample complexity of RMDPs when the uncertainty set is specified via either the total variation (TV) distance or divergence. The algorithm studied here is a model-based method called distributionally robust value iteration, which is shown to be near-optimal for the full range of uncertainty levels. Somewhat surprisingly, our results uncover that RMDPs are not necessarily easier or harder to learn than standard MDPs. The statistical consequence incurred by the robustness requirement depends heavily on the size and shape of the uncertainty set: in the case w.r.t. the TV distance, the minimax sample complexity of RMDPs is always smaller than that of standard MDPs; in the case w.r.t. the divergence, the sample complexity of RMDPs can often far exceed the standard MDP counterpart.
Keywords: distributionally robust RL, robust Markov decision processes, sample complexity, distributionally robust value iteration, model-based RL
1 Introduction
Reinforcement learning (RL) strives to learn desirable sequential decisions based on trial-and-error interactions with an unknown environment. As a fast-growing subfield of artificial intelligence, it has achieved remarkable success in a variety of applications, such as networked systems (Qu et al.,, 2022), trading (Park and Van Roy,, 2015), operations research (de Castro Silva et al.,, 2003; Pan et al.,, 2023; Zhao et al.,, 2021), large language model alignment (OpenAI,, 2023; Ziegler et al.,, 2019), healthcare (Liu et al.,, 2019; Fatemi et al.,, 2021), robotics and control (Kober et al.,, 2013; Mnih et al.,, 2013). Due to the unprecedented dimensionality of the state-action space, the issue of data efficiency inevitably lies at the core of modern RL practice. A large portion of recent efforts in RL has been directed towards designing sample-efficient algorithms and understanding the fundamental statistical bottleneck for a diverse range of RL scenarios.
While standard RL has been heavily investigated recently, its use can be significantly hampered in practice due to the sim-to-real gap or uncertainty (Bertsimas et al.,, 2019); for instance, a policy learned in an ideal, nominal environment might fail catastrophically when the deployed environment is subject to small changes in task objectives or adversarial perturbations (Zhang et al., 2020a, ; Klopp et al.,, 2017; Mahmood et al.,, 2018). Consequently, in addition to maximizing the long-term cumulative reward, robustness emerges as another critical goal for RL, especially in high-stakes applications such as robotics, autonomous driving, clinical trials, financial investments, and so on. Towards achieving this, distributionally robust RL (Iyengar,, 2005; Nilim and El Ghaoui,, 2005; Xu and Mannor,, 2012; Bäuerle and Glauner,, 2022; Cai et al.,, 2016), which leverages insights from distributionally robust optimization and supervised learning (Rahimian and Mehrotra,, 2019; Gao,, 2020; Bertsimas et al.,, 2018; Duchi and Namkoong,, 2018; Blanchet and Murthy,, 2019; Chen et al.,, 2019; Lam,, 2019), becomes a natural yet versatile framework; the aim is to learn a policy that performs well even when the deployed environment deviates from the nominal one in the face of environment uncertainty.
In this paper, we pursue fundamental understanding about whether, and how, the choice of distributional robustness bears statistical implications in learning a desirable policy, through the lens of sample complexity. More concretely, imagine that one has access to a generative model (also called a simulator) that draws samples from a Markov decision processes (MDP) with a nominal transition kernel (Kearns and Singh,, 1999). Standard RL aims to learn the optimal policy tailored to the nominal kernel, for which the minimax sample complexity limit has been fully settled (Azar et al., 2013b, ; Li et al., 2023b, ). In contrast, distributionally robust RL seeks to learn a more robust policy using the same set of samples, with the aim of optimizing the worst-case performance when the transition kernel is arbitrarily chosen from some prescribed uncertainty set around the nominal kernel; this setting is frequently referred to as robust MDPs (RMDPs).111While it is straightforward to incorporate additional uncertainty of the reward in our framework, we do not consider it here for simplicity, since the key challenge is to deal with the uncertainty of the transition kernel. Clearly, the RMDP framework helps ensure that the performance of the learned policy does not fail catastrophically as long as the sim-to-real gap is not overly large. It is then natural to wonder how the robustness consideration impacts data efficiency: is there a statistical premium that one needs to pay in quest of additional robustness?
Compared with standard MDPs, the class of RMDPs encapsulates richer models, given that one is allowed to prescribe the shape and size of the uncertainty set. Oftentimes, the uncertainty set is hand-picked as a small ball surrounding the nominal kernel, with the size and shape of the ball specified by some distance-like metric between probability distributions and some uncertainty level . To ensure tractability of solving RMDPs, the uncertainty set is often selected to obey certain structures. For instance, a number of prior works assumed that the uncertainty set can be decomposed as a product of independent uncertainty subsets over each state or state-action pair (Zhou et al.,, 2021; Wiesemann et al.,, 2013), dubbed as the - and -rectangularity, respectively. The current paper adopts the second choice by assuming -rectangularity for the uncertainty set. An additional challenge with RMDPs arises from distribution shift, where the transition kernel drawn from the uncertainty set can be different from the nominal kernel. This challenge leads to complicated nonlinearity and nested optimization in the problem structure not present in standard MDPs.
1.1 Prior art and open questions
Result type | Reference | Sample complexity | |
---|---|---|---|
Upper bound | Yang et al., (2022) | ||
Panaganti and Kalathil, (2022) | |||
This paper | |||
Lower bound | Yang et al., (2022) | ||
This paper |
Result type | Reference | Sample complexity | ||
---|---|---|---|---|
Upper bound | Panaganti and Kalathil, (2022) | |||
Yang et al., (2022) | ||||
This paper | ||||
Lower bound | Yang et al., (2022) | |||
This paper |
![]() |
![]() |
(a) TV distance | (b) divergence |
In this paper, we focus attention on RMDPs in the context of -discounted infinite-horizon setting, assuming access to a generative model. The uncertainty set considered herein is specified using one of the -divergence metrics: the total variation (TV) distance and the divergence. These two choices are motivated by their practical appeals: easy to implement, and already adopted by empirical RL (Lee et al.,, 2021; Pan et al.,, 2023).
A popular learning approach is model-based, which first estimates the nominal transition kernel using a plug-in estimator based on the collected samples, and then runs a planning algorithm (e.g., a robust variant of value iteration) on top of the estimated kernel. Despite the surge of recent activities, however, existing statistical guarantees for the above paradigm remained highly inadequate, as we shall elaborate on momentarily (see Table 1 and Table 2 respectively for a summary of existing results). For concreteness, let be the size of the state space, the size of the action space, the discount factor (so that the effective horizon is ), and the uncertainty level. We are interested in how the sample complexity — the number of samples needed for an algorithm to output a policy whose robust value function (the worst-case value over all the transition kernels in the uncertainty set) is at most away from the optimal robust one — scales with all these salient problem parameters.
-
•
Large gaps between existing upper and lower bounds. There remained large gaps between the sample complexity upper and lower bounds established in prior literature, regardless of the divergence metric in use. Specifically, considering the cases using either TV distance or divergence, the state-of-the-art upper bounds (Panaganti and Kalathil,, 2022) scales quadratically with the size of the state space, while the lower bound (Yang et al.,, 2022) exhibits only linear scaling with . Moreover, in the divergence case, the state-of-the-art upper bound grows linearly with the uncertainty level when ,222Let . The notation or indicates that there exists a universal constant such that , the notation indicates that , and the notation indicates that and hold simultaneously. Additionally, the notation is defined in the same way as except that it hides logarithmic factors. while the lower bound (Yang et al.,, 2022) is inversely proportional to . These lead to unbounded gaps between the upper and lower bounds as grows. Can we hope to close these gaps for RMDPs?
-
•
Benchmarking with standard MDPs. Perhaps a more pressing issue is that, past works failed to provide an affirmative answer regarding how to benchmark the sample complexity of RMDPs with that of standard MDPs regardless of the chosen shape (determined by ) or size (determined by ) of the uncertainty set, given the large unresolved gaps mentioned above. Specifically, existing sample complexity upper (resp. lower) bounds are all larger (resp. smaller) than the sample size requirement for standard MDPs. As a consequence, it remains mostly unclear whether learning RMDPs is harder or easier than learning standard MDPs.
1.2 Main contributions
To address the aforementioned questions, this paper develops strengthened sample complexity upper bounds on learning RMDPs with the TV distance and divergence in the infinite-horizon setting, using a model-based approach called distributionally robust value iteration (DRVI). Improved minimax lower bounds are also developed to help gauge the tightness of our upper bounds and enable benchmarking with standard MDPs. The novel analysis framework developed herein leads to new insights into the interplay between the geometry of uncertainty sets and statistical hardness.
Sample complexity of RMDPs under the TV distance.
We summarize our results and compare them with past works in Table 1; see Figure 1(a) for a graphical illustration.
-
•
Minimax-optimal sample complexity. We prove that DRVI reaches accuracy as soon as the sample complexity is on the order of
for all , assuming that is small enough. In addition, a matching minimax lower bound (modulo some logarithmic factor) is established to guarantee the tightness of the upper bound. To the best of our knowledge, this is the first minimax-optimal sample complexity for RMDPs, which was previously unavailable regardless of the divergence metric and uncertainty level in use and is over the full range of the uncertainty level.
-
•
RMDPs are easier to learn than standard MDPs under the TV distance. Given the sample complexity of standard MDPs (Li et al., 2023b, ), it can be seen that learning RMDPs under the TV distance is never harder than learning standard MDPs; more concretely, the sample complexity for RMDPs matches that of standard MDPs when , and becomes smaller by a factor of when . Therefore, in this case, distributional robustness comes almost for free, given that we do not need to collect more samples.
Sample complexity of RMDPs under the divergence.
We summarize our results and provide comparisons with prior works in Table 2; see Figure 1(b) for an illustration.
-
•
Near-optimal sample complexity. We demonstrate that DRVI yields accuracy as soon as the sample complexity is on the order of
for all , which is the first sample complexity in this setting that scales linearly in the size of the state space; in other words, our theory breaks the quadratic scaling bottleneck that was present in prior works (Panaganti and Kalathil,, 2022; Yang et al.,, 2022). We have also developed a strengthened lower bound that is optimized by leveraging the geometry of the uncertainty set under different ranges of . Our theory is tight when , and is otherwise loose by at most a polynomial factor of the effective horizon (regardless of the uncertainty level ). This significantly improves upon prior results (as there exists an unbounded gap between prior upper and lower bounds as ).
-
•
RMDPs can be harder to learn than standard MDPs under the divergence. Somewhat surprisingly, our improved lower bound suggests that RMDPs in this case can be much harder to learn than standard MDPs, at least for a certain range of uncertainty levels. We single out two regimes of particular interest. Firstly, when , the sample size requirement of RMDPs is on the order of (up to log factor), which is provably larger than the one for standard MDPs by a factor of . Secondly, the lower bound continues to increase as grows and exceeds the sample complexity of standard MDPs when .
In sum, our sample complexity bounds not only strengthen the prior art in the development of both upper and lower bounds, but also unveil that the additional robustness consideration might affect the sample complexity in a somewhat surprising manner. As it turns out, RMDPs are not necessarily harder nor easier to learn than standard MDPs; the conclusion is far more nuanced and highly dependent on both the size and shape of the uncertainty set. This constitutes a curious phenomenon that has not been elucidated in prior analyses.
Technical novelty.
Our upper bound analyses require careful treatments of the impact of the uncertainty set upon the value functions, and decouple the statistical dependency across the iterates of the robust value iteration using tailored leave-one-out arguments (Agarwal et al.,, 2020; Li et al., 2022b, ) that have not been introduced to the RMDP setting previously. Turning to the lower bound, we develop new hard instances that differ from those for standard MDPs (Azar et al., 2013a, ; Li et al.,, 2024). These new instances draw inspiration from the asymmetric structure of RMDPs induced by the additional infimum operator in the robust value function. In addition, we construct a series of hard instances depending on the uncertainty level to establish the tight lower bound as varies.
Extension: offline RL with uniform coverage.
Last but not least, we extend our analysis framework to accommodate a widely studied offline setting with uniform data coverage (Zhou et al.,, 2021; Yang et al.,, 2022) in Section 6. In particular, given a historical dataset with minimal coverage probability over the state-action space (see Assumption 1), we provide sample complexity results for both cases with TV distance or divergence, where in effect the dependency with the size of the state-action space is replaced by . The sample complexity upper bounds significantly improve upon prior art (Yang et al.,, 2022) by a factor of (resp. ) when the uncertainty set is measured by the TV distance (resp. the divergence).
Notation and paper organization.
Throughout this paper, we denote by the probability simplex over a set and (resp. ) as any vector that constitutes certain values for each state-action pair (resp. state). In addition, we denote by the Hadamard product of any two vectors .
The remainder of this paper is structured as follows. Section 2 presents the background about discounted infinite-horizon standard MDPs and formulates distributionally robust MDPs. In Section 3, a model-based approach is introduced, tailored to both the TV distance and the divergence. Both upper and lower bounds on the sample complexity are developed in Section 4, covering both divergence metrics. Section 5 provides an outline of our analysis. Section 6 further extends the findings to the offline RL setting with uniform data coverage. We then summarize several additional related works in Section 7 and conclude the main paper with further discussions in Section 8. The proof details are deferred to the appendix.
2 Problem formulation
In this section, we formulate distributionally robust Markov decision processes (RMDPs) in the discounted infinite-horizon setting, introduce the sampling mechanism, and describe our goal.
Standard MDPs.
To begin, we first introduce the standard Markov decision processes (MDPs), which facilitate the understanding of RMDPs. A discounted infinite-horizon MDP is represented by , where and are the finite state and action spaces, respectively, is the discounted factor, denotes the probability transition kernel, and is the immediate reward function which is assumed to be deterministic. A policy is denoted by , which specifies the action selection probability over the action space in any state. When the policy is deterministic, we overload the notation and refer to as the action selected by policy in state . To characterize the cumulative reward, the value function for any policy under the transition kernel is defined by
(1) |
where the expectation is taken over the randomness of the trajectory generated by executing policy under the transition kernel , namely, and for all . Similarly, the Q-function associated with any policy under the transition kernel is defined as
(2) |
where the expectation is again taken over the randomness of the trajectory under policy .
Distributionally robust MDPs.
We now introduce the distributionally robust MDP (RMDP) tailored to the discounted infinite-horizon setting, denoted by , where are identical to those in the standard MDP. A key distinction from the standard MDP is that: rather than assuming a fixed transition kernel , it allows the transition kernel to be chosen arbitrarily from a prescribed uncertainty set centered around a nominal kernel , where the uncertainty set is specified using some distance metric of radius . In particular, given the nominal transition kernel and some uncertainty level , the uncertainty set—with the divergence metric —is specified as
(3) |
where we denote a vector of the transition kernel or at state-action pair respectively as
(4) |
In other words, the uncertainty is imposed in a decoupled manner for each state-action pair, obeying the so-called -rectangularity (Zhou et al.,, 2021; Wiesemann et al.,, 2013).
In RMDPs, we are interested in the worst-case performance of a policy over all the possible transition kernels in the uncertainty set. This is measured by the robust value function and the robust Q-function in , defined respectively as
(5) |
Optimal robust policy and robust Bellman operator.
As a generalization of properties of standard MDPs, it is well-known that there exists at least one deterministic policy that maximizes the robust value function (resp. robust Q-function) simultaneously for all states (resp. state-action pairs) (Iyengar,, 2005; Nilim and El Ghaoui,, 2005). Therefore, we denote the optimal robust value function (resp. optimal robust Q-function) as (resp. ), and the optimal robust policy as , which satisfy
(6a) | ||||
(6b) |
A key machinery in RMDPs is a generalization of Bellman’s optimality principle, encapsulated in the following robust Bellman consistency equation (resp. robust Bellman optimality equation):
(7a) | ||||
(7b) |
The robust Bellman operator (Iyengar,, 2005; Nilim and El Ghaoui,, 2005) is denoted by and defined as follows:
(8) |
Given that is the unique fixed point of , one can recover the optimal robust value function and Q-function using a procedure termed distributionally robust value iteration (DRVI). Generalizing the standard value iteration, DRVI starts from some given initialization and recursively applies the robust Bellman operator until convergence. As has been shown previously, this procedure converges rapidly due to the -contraction property of w.r.t. the norm (Iyengar,, 2005; Nilim and El Ghaoui,, 2005).
Specification of the divergence .
We consider two popular choices of the uncertainty set measured in terms of two different -divergence metric: the total variation distance and the divergence, given respectively by (Tsybakov,, 2009)
(9) | ||||
(10) |
Note that and in general. As we shall see shortly, these two choices of divergence metrics result in drastically different messages when it comes to sample complexities.
Sampling mechanism: a generative model.
Following Zhou et al., (2021); Panaganti and Kalathil, (2022), we assume access to a generative model or a simulator (Kearns and Singh,, 1999), which allows us to collect independent samples for each state-action pair generated based on the nominal kernel :
(11) |
The total sample size is, therefore, .
Goal.
Given the collected samples, the task is to learn the robust optimal policy for the RMDP — w.r.t. some prescribed uncertainty set around the nominal kernel — using as few samples as possible. Specifically, given some target accuracy level , the goal is to seek an -optimal robust policy obeying
(12) |
3 Model-based algorithm: distributionally robust value iteration
We consider a model-based approach tailored to RMDPs, which first constructs an empirical nominal transition kernel based on the collected samples, and then applies distributionally robust value iteration (DRVI) to compute an optimal robust policy.
Empirical nominal kernel.
The empirical nominal transition kernel can be constructed on the basis of the empirical frequency of state transitions, i.e.,
(13) |
which leads to an empirical RMDP . Analogously, we can define the corresponding robust value function (resp. robust Q-function) of policy in as (resp. ) (cf. (6)). In addition, we denote the corresponding optimal robust policy as and the optimal robust value function (resp. optimal robust Q-function) as (resp. ) (cf. (7)), which satisfies the robust Bellman optimality equation:
(14) |
Equipped with , we can define the empirical robust Bellman operator as
(15) |
DRVI: distributionally robust value iteration.
To compute the fixed point of , we introduce distributionally robust value iteration (DRVI), which is summarized in Algorithm 1. From an initialization , the update rule at the -th () iteration can be formulated as:
(16) |
where for all . However, directly solving (16) is computationally expensive since it involves optimization over an -dimensional probability simplex at each iteration, especially when the dimension of the state space is large. Fortunately, in view of strong duality (Iyengar,, 2005), (16) can be equivalently solved using its dual problem, which concerns optimizing a scalar dual variable and thus can be solved efficiently. In what follows, we shall illustrate this for the two choices of the divergence of interest (cf. (9) and (10)). Before continuing, for any , we denote as its clipped version by some non-negative value , namely,
(17) |
-
•
TV distance, where the uncertainty set is w.r.t. the TV distance defined in (9). In particular, we have the following lemma due to strong duality, which is a direct consequence of Iyengar, (2005, Lemma 4.3).
Lemma 1 (Strong duality for TV).
Consider any probability vector , any fixed uncertainty level and the uncertainty set . For any vector obeying , recalling the definition of in (17), one has
(18) In view of the above lemma, the following dual update rule is equivalent to (16) in DRVI:
(19) -
•
divergence, where the uncertainty set is w.r.t. the divergence defined in (10). We introduce the following lemma which directly follows from (Iyengar,, 2005, Lemma 4.2).
Lemma 2 (Strong duality for ).
Consider any probability vector , any fixed uncertainty level and the uncertainty set . For any vector obeying , one has
(20) where is defined as (40).
In view of the above lemma, the update rule (16) in DRVI can be equivalently written as:
(21)
The proofs of Lemma 1 and Lemma 2 are provided in Appendix A. To complete the description, we output the greedy policy of the final Q-estimate as the final policy , namely,
(22) |
Encouragingly, the iterates of DRVI converge linearly to the fixed point , owing to the appealing -contraction property of .
4 Theoretical guarantees: sample complexity analyses
We now present our main results, which concern the sample complexities of learning RMDPs when the uncertainty set is specified using the TV distance or the divergence. Somewhat surprisingly, different choices of the uncertainty set can lead to dramatically different consequences in the sample size requirement.
4.1 The case of TV distance: RMDPs are easier to learn than standard MDPs
We start with the case where the uncertainty set is measured via the TV distance. The following theorem, whose proof is deferred to Section 5.2, develops an upper bound on the sample complexity of DRVI in order to return an -optimal robust policy. The key challenge of the analysis lies in careful control of the robust value function as a function of the uncertainty level .
Theorem 1 (Upper bound under TV distance).
Let the uncertainty set be , as specified by the TV distance (9). Consider any discount factor , uncertainty level , and . Let be the output policy of Algorithm 1 after iterations. Then with probability at least , one has
(23) |
for any , as long as the total number of samples obeys
(24) |
Here, are some large enough universal constants.
Remark 1.
Before discussing the implications of Theorem 1, we present a matching minimax lower bound that confirms the tightness and optimality of the upper bound, which in turn pins down the sample complexity requirement for learning RMDPs with TV distance. The proof is based on constructing new hard instances inspired by the asymmetric structure of RMDPs, with the details postponed to Section 5.3.
Theorem 2 (Lower bound under TV distance).
Consider any tuple obeying with being any small enough positive constant, , and . We can construct a collection of infinite-horizon RMDPs defined by the uncertainty set , an initial state distribution , and a dataset with independent samples for each state-action pair over the nominal transition kernel (for and respectively), such that
provided that
Here, the infimum is taken over all estimators , and (resp. ) denotes the probability when the RMDP is (resp. ).
Below, we interpret the above theorems and highlight several key implications about the sample complexity requirements for learning RMDPs for the case w.r.t. the TV distance.
Near minimax-optimal sample complexity.
Theorem 1 shows that the total number of samples required for DRVI (or any oracle planning algorithm claimed in Remark 1) to yield -accuracy is
(26) |
Taken together with the minimax lower bound asserted by Theorem 2, this confirms the near optimality of the sample complexity (up to some logarithmic factor) almost over the full range of the uncertainty level . Importantly, this sample complexity scales linearly with the size of the state-action space, and is inversely proportional to in the regime where .
RMDPs is easier than standard MDPs with TV distance.
Recall that the sample complexity requirement for learning standard MDPs with a generative model is (Azar et al., 2013a, ; Agarwal et al.,, 2020; Li et al., 2023b, )
(27) |
in order to yield accuracy. Comparing this with the sample complexity requirement in (26) for RMDPs under the TV distance, we confirm that the latter is at least as easy as — if not easier than — standard MDPs. In particular, when is small, the sample complexity of RMDPs is the same as that of standard MDPs as in (27), which is as anticipated since the RMDP reduces to the standard MDP when . On the other hand, when , the sample complexity of RMDPs simplifies to
(28) |
which is smaller than that of standard MDPs by a factor of .
Comparison with state-of-the-art bounds.
For the upper bound, our results (cf. Theorem 1) significantly improves over the prior art of Panaganti and Kalathil, (2022) by at least a factor of and even when the uncertainty level is large. Turning to the lower bound side, Yang et al., (2022) developed a lower bound for RMDPs under the TV distance, which scales as
Clearly, this is worse than ours by a factor of in the regime where .
4.2 The case of divergence: RMDPs can be harder than standard MDPs
We now switch attention to the case when the uncertainty set is measured via the divergence. The theorem below presents an upper bound on the sample complexity for this case, whose proof is deferred to Appendix D.
Theorem 3 (Upper bound under divergence).
Let the uncertainty set be , as specified using the divergence (10). Consider any uncertainty level , and . With probability at least , the output policy from Algorithm 1 with at most iterations yields
(29) |
for any , as long as the total number of samples obeying
(30) |
Here, are some large enough universal constants.
Remark 2.
In addition, in order to gauge the tightness of Theorem 3 and understand the minimal sample complexity requirement under the divergence, we further develop a minimax lower bound as follows; the proof is deferred to Appendix E.
Theorem 4 (Lower bound under divergence).
Consider any obeying , , and
(31) |
for some small universal constant . Then we can construct two infinite-horizon RMDPs defined by the uncertainty set , an initial state distribution , and a dataset with independent samples per pair over the nominal transition kernel (for and respectively), such that
(32) |
provided that the total number of samples
(33) |
for some universal constant .
We are now positioned to single out several key implications of the above theorems.
Nearly tight sample complexity.
In order to achieve -accuracy for RMDPs under the divergence, Theorem 3 asserts that a total number of samples on the order of
(34) |
is sufficient for DRVI (or any other oracle planning algorithm as discussed in Remark 2). Taking this together with the minimax lower bound in Theorem 4 confirms that the sample complexity is near-optimal — up to a polynomial factor of the effective horizon — over the entire range of the uncertainty level . In particular,
-
•
when , our sample complexity is sharp and matches the minimax lower bound;
-
•
when , our sample complexity correctly predicts the linear dependency with , suggesting that more samples are needed when one wishes to account for a larger -based uncertainty sets.
RMDPs can be much harder to learn than standard MDPs with divergence.
The minimax lower bound developed in Theorem 4 exhibits a curious non-monotonic behavior of the sample size requirement over the entire range of the uncertainty level when the uncertainty set is measured via the divergence. When , the lower bound reduces to
which matches with that of standard MDPs, as corresponds to standard MDP. However, two additional regimes are worth calling out:
both of which are greater than that of standard MDPs, indicating learning RMDPs under the divergence can be much harder.
Comparison with state-of-the-art bounds.
Our upper bound significantly improves over the prior art of Panaganti and Kalathil, (2022) by a factor of , and provides the first finite-sample complexity that scales linearly with respect to for discounted infinite-horizon RMDPs, which typically exhibit more complicated statistical dependencies than the finite-horizon counterpart. On the other hand, Yang et al., (2022) established a lower bound on the order of when , which is always smaller than the requirement of standard MDPs, and diminishes when grows. Consequently, Yang et al., (2022) does not lead to the rigorous justification that RMDPs can be much harder than standard MDPs, nor the correct linear scaling of the sample size as grows.
5 Analysis: the TV case
This section presents the key technical steps for proving our main results of the TV case.
5.1 Preliminaries of the analysis
5.1.1 Additional notations and basic facts
For convenience, we introduce the notation for any positive integer . Moreover, for any two vectors and , the notation (resp. ) means (resp. ) for all . And for any vecvor , we overload the notation by letting (resp. ). With slight abuse of notation, we denote (resp. ) as the all-zero (resp. all-one) vector, and drop the subscript to write whenever the argument holds for all divergence .
Matrix notation.
To continue, we recall or introduce some additional matrix notation that is useful throughout the analysis.
-
•
: the matrix of the nominal transition kernel with as the -th row.
-
•
: the matrix of the estimated nomimal transition kernel with as the -th row.
-
•
: a vector representing the reward function (so that for all ).
-
•
: a projection matrix associated with a given deterministic policy taking the following form
(35) where are standard basis vectors.
-
•
: a reward vector restricted to the actions chosen by the policy , namely, for all (or simply, ).
-
•
: for any transition kernel and vector , we denote the -th row of as
(36) -
•
, : the matrices representing the probability transition kernel in the uncertainty set that leads to the worst-case value for any vector . We denote (resp. ) as the -th row of the transition matrix (resp. ). In truth, the -th rows of these transition matrices are defined as
(37a) Furthermore, we make use of the following short-hand notation: (37b) (37c) The corresponding probability transition matrices are denoted by , , and , respectively.
-
•
, , , , and : six square probability transition matrices w.r.t. policy over the states, namely
(38) We denote as the -th row of the transition matrix ; similar quantities can be defined for the other matrices as well.
Kullback-Leibler (KL) divergence.
First, for any two distributions and , we denote by the Kullback-Leibler (KL) divergence of and . Letting be the Bernoulli distribution with mean , we also introduce
(39) |
which represent respectively the KL divergence and the divergence of from (Tsybakov,, 2009).
Variance.
For any probability vector and vector , we denote the variance
(40) |
The following lemma bounds the Lipschitz constant of the variance function.
Lemma 3.
Consider any obeying and any probability vector , one has
(41) |
Proof of Lemma 3: It is immediate to check that
(42) |
where the penultimate inequality holds by the triangle inequality.
5.1.2 Facts of the robust Bellman operator and the empirical robust MDP
-contraction of the robust Bellman operator.
It is worth noting that the robust Bellman operator (cf. (8)) shares the nice -contraction property of the standard Bellman operator, stated as below.
Bellman equations of the empirical robust MDP .
To begin with, recall that the empirical robust MDP based on the estimated nominal distribution constructed in (13) and its corresponding robust value function (resp. robust Q-function) (resp. ).
Note that is the unique fixed point of (see Lemma 4), the empirical robust Bellman operator constructed using . Moreover, similar to (7), for , the Bellman’s optimality principle gives the following robust Bellman consistency equation (resp. robust Bellman optimality equation):
(44a) | ||||
(44b) |
With these in mind, combined with the matrix notation (introduced at the beginning of Section 5), for any policy , we can write the robust Bellman consistency equations as
(45) |
which leads to
(46) |
where (i) and (ii) holds by the definitions in (35), (37) and (• ‣ 5.1.1).
Encouragingly, the above property of the robust Bellman operator ensures the fast convergence of DRVI. We collect this consequence in the following lemma, whose proof is postponed to Appendix A.2.
Lemma 5.
5.2 Proof of the upper bound with TV distance: Theorem 1
Throughout this section, for any transition kernel , the uncertainty set is taken as (see (9))
(49) |
5.2.1 Technical lemmas
We begin with a key lemma that is new and distinguishes robust MDPs with TV distance from standard MDPs , which plays a critical role in obtaining the sample complexity upper bound in Theorem 1. This lemma concerns the dynamic range of the robust value function (cf. (5)) for any fixed policy , which produces tighter control than that in standard MDP (cf. ) when is large. This lemma The proof is deferred to Appendix B.1.
Lemma 6.
For any nominal transition kernel , any fixed uncertainty level , and any policy , its corresponding robust value function (cf. (5)) satisfies
With the above lemma in hand, we introduce the following lemma that is useful throughout this section, whose proof is postponed in Appendix B.2.
Lemma 7.
Consider an MDP with transition kernel matrix and reward function . For any policy and its associated state transition matrix and value function (cf. (1)), one has
5.2.2 Proof of Theorem 1
Recall that the proof for standard RL (Agarwal et al.,, 2020; Li et al., 2023b, ) deals with the upper and lower bound of the value function estimate gap identically. In contrast, the proof of Theorem 1 needs tailored argument for the robust RL setting — controlling the upper and lower bound of the value function estimate gap in an asymmetric way — motivated by the varying worst-case transition kernels associated with different value functions. Before proceeding, applying Lemma 5 yields that for any , as long as , one has
(50) |
allowing us to justify the more general statement in Remark 1. To control the performance gap , the proof is divided into several key steps.
Step 1: decomposing the error.
Recall the optimal robust policy w.r.t. and the optimal robust policy , the optimal robust value function (resp. robust value function ) w.r.t. . The term of interest can be decomposed as
(51) |
where (i) holds by since is the robust optimal policy for , and (ii) comes from the fact in (50).
To control the two important terms in (51), we first consider a more general term for any policy . Towards this, plugging in (46) yields
where (i) holds by observing
due to the optimality of (cf. (37)). Rearranging terms leads to
(52) |
Similarly, we can also deduce
(53) |
Step 2: controlling and separately and summing up.
First, we introduce the following two lemmas that control the two main terms in (51), respectively. The first lemma controls the value function estimation error associated with the optimal policy induced by the randomness of the generated dataset. The proof are postponed to Appendix B.3 and B.4.
Lemma 8.
Consider any . With probability at least , taking , one has
(58) |
Unlike the term associated with the fixed policy (independent from the dataset), to control , we need to deal with the additional complicated statistical dependency between the learned policy and the empirical RMDP constructed by the dataset.
Lemma 9.
Taking and , with probability at least , one has
(59) |
5.3 Proof of the lower bound with TV distance: Theorem 2
To achieve a tight lower bound for robust MDPs, we construct new hard instances that are different from those for standard MDPs (Azar et al., 2013a, ), addressing two new challenges: 1) Due to robustness requirement, the recursive step (or bootstrapping) in robust MDPs has asymmetric structures over all states, since the worst-case transition probability depends on the value function and puts more weights on the states with lower values. Inspired by such an asymmetric structure, we develop new hard instances by setting larger rewards on the states with action-invariant transition kernels to achieve a tighter lower bound. Note that standard MDPs do not have such reward allocation challenges, where the bootstrapping step is determined by a fixed transition probability independent from the value function. 2) As the uncertainty level can vary within , for given any fixed uncertainty level , a tailored -dependent hard instance is required to achieve a tight lower bound, leading to the construction of a series of different instances as varies. Instead, standard RL only needs to construct one hard instance (i.e., ). By constructing a new class of hard instances addressing the above challenges, we develop a new lower bound in Theorem 2 that is tighter than prior art (Yang et al.,, 2022), which used an identical hard instance for all uncertainty levels .
5.3.1 Construction of the hard problem instances
Construction of two hard MDPs.
Suppose there are two standard MDPs defined as below:
Here, is the discount parameter, is the state space. Given any state , the corresponding action space are . While for states or , the action space is only . For any , the transition kernel of the constructed MDP is defined as
(64) |
where and are set to satisfy
(65) |
for some and that shall be introduced later. The above transition kernel implies that state is an absorbing state, namely, the MDP will always stay after it arrives at .
Then, we define the reward function as
(68) |
Additionally, we choose the following initial state distribution:
(69) |
Here, the constructed two instances are set with different probability transition from state with reward but not state with reward (which were used in standard MDPs (Li et al., 2022b, )), yielding a larger gap between the value functions of the two instances.
Uncertainty set of the transition kernels.
Recalling the uncertainty set assumed throughout this section is defined as with TV distance:
(70) |
where is defined similar to (4). In addition, without loss of generality, we recall the radius with . With the uncertainty level in hand, taking , and which determines the instances obey
(71) |
which ensure as follows:
(72) |
Consequently, applying (65) directly leads to
(73) |
To continue, for any , we denote the infimum probability of moving to the next state associated with any perturbed transition kernel as
(74) |
where the last equation can be easily verified by the definition of in (70). As shall be seen, the transition from state to state plays an important role in the analysis, for convenience, we denote
(75) |
which follows from the fact that in (73).
Robust value functions and robust optimal policies.
To proceed, we are ready to derive the corresponding robust value functions, identify the optimal policies, and characterize the optimal values. For any MDP with the above uncertainty set, we denote as the optimal policy, and the robust value function of any policy (resp. the optimal policy ) as (resp. ). Then, we introduce the following lemma which describes some important properties of the robust (optimal) value functions and optimal policies. The proof is postponed to Appendix C.1.
Lemma 10.
For any and any policy , the robust value function obeys
(76) |
where is defined as
(77) |
In addition, the robust optimal value functions and the robust optimal policies satisfy
(78a) | ||||
(78b) |
5.3.2 Establishing the minimax lower bound
Note that our goal is to control the quantity w.r.t. any policy estimator based on the chosen initial distribution in (69) and the dataset consisting of samples over each state-action pair generated from the nominal transition kernel , which gives
Step 1: converting the goal to estimate .
We make the following useful claim which shall be verified in Appendix C.2: With , letting
(79) |
which satisfies (71), it leads to that for any policy ,
(80) |
With this connection established between the policy and its sub-optimality gap as depicted in (80), we can now proceed to build an estimate for . Here, we denote as the probability distribution when the MDP is , where can take on values in the set .
Let’s assume momentarily that an estimated policy achieves
(81) |
then in view of (80), we necessarily have with probability at least . With this in mind, we are motivated to construct the following estimate for :
(82) |
which obeys
(83) |
Subsequently, our aim is to demonstrate that (83) cannot occur without an adequate number of samples, which would in turn contradict (80).
Step 2: probability of error in testing two hypotheses.
Equipped with the aforementioned groundwork, we can now delve into differentiating between the two hypotheses . To achieve this, we consider the concept of minimax probability of error, defined as follows:
(84) |
Here, the infimum is taken over all possible tests constructed from the samples generated from the nominal transition kernel .
Moving forward, let us denote (resp. ) as the distribution of a sample tuple under the nominal transition kernel associated with and the samples are generated independently. Applying standard results from Tsybakov, (2009, Theorem 2.2) and the additivity of the KL divergence (cf. Tsybakov, (2009, Page 85)), we obtain
(85) |
where the last inequality holds by observing that
Here, the last equality holds by the fact that and only differ when .
Step 3: putting the results together.
6 Offline distributionally robust RL with uniform coverage
In this section, we extend our theoretical analysis to broader sampling mechanism scenarios with offline datasets. We first specify the offline settings as below.
Offline/batch dataset.
Suppose that we observe a batch/historical dataset consisting of sample transitions generated independently. Specifically, the state-action pair is drawn from some behavior distribution , followed by a next state drawn over the nominal transition kernel , i.e.,
(89) |
We consider uniform coverage historical dataset that is widely studied in offline settings for both standard RL and robust RL (Liao et al.,, 2022; Chen and Jiang,, 2019; Jin et al., 2020b, ; Zhou et al.,, 2021; Yang et al.,, 2022), specified in the following assumption.
Assumption 1.
Suppose the historical dataset obeys
(90) |
Armed with the above dataset , the empirical nominal transition kernel can be constructed through (13) analogously. Then in such offline setting, we introduce the sample complexity upper bounds for DRVI and information-theoretical lower bounds in the cases of TV or divergence respectively. The proof of the following corollaries are postponed to Appendix F.
6.1 The case of TV distance
With above historical dataset in hand, we achieve the following corollary implied by Theorem 1.
Corollary 1 (Upper bound under TV distance).
Let the uncertainty set be defined in (9), and be some large enough universal constants. Consider any discount factor , uncertainty level , and . Let be the output policy of Algorithm 1 after iterations, based on a dataset satisfying Assumption 1. Then with probability at least , one has
(91) |
for any , as long as the total number of samples obeys
(92) |
We also derive a lower bound in the offline setting by adapting Theorem 2.
Corollary 2 (Lower bound under TV distance).
Let the uncertainty set be defined in (9). Consider any tuple that obeys , with being any small enough positive constant, , and . We can construct two infinite-horizon RMDPs , an initial state distribution , and a dataset with samples satisfying Assumption 1 (for and respectively) such that
provided that
Here, the infimum is taken over all estimators , and (resp. ) denotes the probability when the RMDP is (resp. ).
Discussions.
In the offline setting with uniform coverage dataset (cf. Assumption 1), Corollary 1 shows that DRVI algorithm can find an -optimal policy with the following sample complexity
(93) |
which is near minimax optimal with respect to all salient parameters (up to logarithmic factors) almost over the full range of the uncertainty level , verified by the lower bound in Corollary 2. Our sample complexity upper bound (Corollary 1) significantly improves over the prior art (Yang et al.,, 2022) by at least a factor of , and even more than when the uncertainty level is small.
6.2 The case of divergence
With uncertainty sets measured by the divergence, we obtain the following upper bounds for DRVI and information-theoretical lower bounds, adapted from Theorem 3 and Theorem 4 respectively.
Corollary 3 (Upper bound under divergence).
Let the uncertainty set be specified by the divergence (cf. (10)), and be some large enough universal constants. Consider any uncertainty level , and . Given a dataset satisfying Assumption 1, with probability at least , the output policy from Algorithm 1 with at most iterations yields
(94) |
for any , as long as the total number of samples obeying
(95) |
Corollary 4 (Lower bound under divergence).
Let the uncertainty set be , and be some universal constants. Consider any tuple obeying , , , and
(96) |
Then we can construct two infinite-horizon RMDPs , an initial state distribution , and a dataset with independent samples satisfying Assumption 1 over the nominal transition kernel (for and respectively), such that
(97) |
provided that the total number of samples
(98) |
Discussions.
Corollary 3 indicates that in the offline setting with uniform coverage dataset (cf. Assumption 1), DRVI can achieve -accuracy for RMDPs under the divergence with a total number of samples on the order of
(99) |
The above upper bound is relatively tight, since it matches the lower bound derived in Corollary 4 when the uncertainty level and correctly captures the linear dependency with when the uncertainty level is large. In addition, it significantly improves upon the prior art (Yang et al.,, 2022) by at least a factor of .
7 Other related works
This section briefly discusses a small sample of other related works. We limit our discussions primarily to provable RL algorithms in the tabular setting with finite state and action spaces, which are most related to the current paper.
Finite-sample guarantees for standard RL.
A surge of recent research has utilized the toolkit from high-dimensional probability/statistics to investigate the performance of standard RL algorithms in non-asymptotic settings. There has been a considerable amount of research into non-asymptotic sample analysis of standard RL for a variety of settings; partial examples include, but are not limited to, the works via probably approximately correct (PAC) bounds for the generative model setting (Kearns and Singh,, 1999; Beck and Srikant,, 2012; Li et al., 2022a, ; Chen et al.,, 2020; Azar et al., 2013b, ; Sidford et al.,, 2018; Agarwal et al.,, 2020; Li et al., 2023a, ; Li et al., 2023b, ; Wainwright,, 2019) and the offline setting (Liao et al.,, 2022; Chen and Jiang,, 2019; Rashidinejad et al.,, 2021; Xie et al.,, 2021; Yin et al.,, 2021; Shi et al.,, 2022; Li et al.,, 2024; Jin et al.,, 2021; Yan et al.,, 2022; Woo et al.,, 2024; Uehara et al.,, 2022), as well as the online setting via both regret-based and PAC-base analyses (Jin et al.,, 2018; Bai et al.,, 2019; Li et al.,, 2021; Zhang et al., 2020b, ; Dong et al.,, 2019; Jin et al., 2020a, ; Li et al., 2023c, ; Jafarnia-Jahromi et al.,, 2020; Yang et al.,, 2021; Woo et al.,, 2023).
Robustness in RL.
While standard RL has achieved remarkable success, current RL algorithms still have significant drawbacks in that the learned policy could be completely off if the deployed environment is subject to perturbation, model mismatch, or other structural changes. To address these challenges, an emerging line of works begin to address robustness of RL algorithms with respect to the uncertainty or perturbation over different components of MDPs — state, action, reward, and the transition kernel; see Moos et al., (2022) for a recent review. Besides the framework of distributionally robust MDPs (RMDPs) (Iyengar,, 2005) adopted by this work, to promote robustness in RL, there exist various other works including but not limited to Zhang et al., 2020a ; Zhang et al., (2021); Han et al., (2022); Qiaoben et al., (2021); Sun et al., (2021); Xiong et al., (2022) investigating the robustness w.r.t. state uncertainty, where the agent’s policy is chosen based on a perturbed observation generated from the state by adding restricted noise or adversarial attack. Besides, Tessler et al., (2019); Tan et al., (2020) considered the robustness w.r.t. the uncertainty of the action, namely, the action is possibly distorted by an adversarial agent abruptly or smoothly, and Ding et al., (2023) tackles robustness against spurious correlations..
Distributionally robust RL.
Rooted in the literature of distributionally robust optimization, which has primarily been investigated in the context of supervised learning (Rahimian and Mehrotra,, 2019; Gao,, 2020; Bertsimas et al.,, 2018; Duchi and Namkoong,, 2018; Blanchet and Murthy,, 2019), distributionally robust dynamic programming and RMDPs have attracted considerable attention recently (Iyengar,, 2005; Xu and Mannor,, 2012; Wolff et al.,, 2012; Kaufman and Schaefer,, 2013; Ho et al.,, 2018; Smirnova et al.,, 2019; Ho et al.,, 2021; Goyal and Grand-Clement,, 2022; Derman and Mannor,, 2020; Tamar et al.,, 2014; Badrinath and Kalathil,, 2021). In the context of RMDPs, both empirical and theoretical studies have been widely conducted, although most prior theoretical analyses focus on planning with an exact knowledge of the uncertainty set (Iyengar,, 2005; Xu and Mannor,, 2012; Tamar et al.,, 2014), or are asymptotic in nature (Roy et al.,, 2017).
Resorting to the tools of high-dimensional statistics, various recent works begin to shift attention to understand the finite-sample performance of provable robust RL algorithms, under diverse data generating mechanisms and forms of the uncertainty set over the transition kernel. Besides the infinite-horizon setting, finite-sample complexity bounds for RMDPs with the TV distance and the divergence are also developed for the finite-horizon setting in Xu et al., (2023); Dong et al., (2022); Lu et al., (2024). In addition, many other forms of uncertainty sets have been considered. For example, Wang and Zou, (2021) considered a R-contamination uncertain set and proposed a provable robust Q-learning algorithm for the online setting with similar guarantees as standard MDPs. The KL divergence is another popular choice widely considered, where Yang et al., (2022); Panaganti and Kalathil, (2022); Zhou et al., (2021); Shi and Chi, (2022); Xu et al., (2023); Wang et al., 2023b ; Blanchet et al., (2023); Liu et al., (2022); Wang et al., 2023d ; Liang et al., (2023); Wang et al., 2023a investigated the sample complexity of both model-based and model-free algorithms under the simulator, offline settings, or single-trajectory setting. Xu et al., (2023) considered a variety of uncertainty sets including one associated with Wasserstein distance. Badrinath and Kalathil, (2021); Ramesh et al., (2023); Panaganti et al., (2022); Ma et al., (2022); Wang et al., (2024); Liu and Xu, 2024b ; Liu and Xu, 2024a considered function approximation settings. Moreover, various other related issues have been explored such as the difference of various uncertainty types (Wang et al., 2023c, ), the iteration complexity of the policy-based methods (Li et al., 2022c, ; Kumar et al.,, 2023; Li and Lan,, 2023), the case when the uncertainty level is instance-dependent small enough (Clavier et al.,, 2023), regularization-based robust RL (Yang et al.,, 2023; Zhang et al.,, 2023), and distributionally robust optimization for offline RL (Panaganti et al.,, 2023).
8 Discussions
This work has developed improved sample complexity bounds for learning RMDPs when the uncertainty set is measured via the TV distance or the divergence, assuming availability of a generative model. Our results have not only strengthened the prior art in both the upper and lower bounds, but have also unlocked curious insights into how the quest for distributional robustness impacts the sample complexity. As a key takeaway of this paper, RMDPs are not necessarily harder nor easier to learn than standard MDPs, as the answer depends — in a rather subtle manner — on the specific choice of the uncertainty set. For the case w.r.t. the TV distance, we have settled the minimax sample complexity for RMDPs, which is never larger than that required to learn standard MDPs. Regarding the case w.r.t. the divergence, we have uncovered that learning RMDPs can oftentimes be provably harder than the standard MDP counterpart. All in all, our findings help raise awareness that the choice of the uncertainty set not only represents a preference in robustness, but also exerts fundamental influences upon the intrinsic statistical complexity.
Moving forward, our work opens up numerous avenues for future studies, and we point out a few below.
-
•
Extensions to the finite-horizon setting. It is likely that our current analysis framework can be extended to tackle finite-horizon RMDPs, which would help complete our understanding for the tabular cases.
-
•
Improved analysis for the case of divergence. While we have settled the sample complexity of RMDPs with the TV distance, the upper and lower bounds we have developed for RMDPs w.r.t. the divergence still differ by some polynomial factor in the effective horizon. It would be of great interest to see how to close this gap.
-
•
A unified theory for other families of uncertainty sets. Our work raises an interesting question concerning how the geometry of the uncertainty sets intervenes the sample complexity. Characterizing the tight sample complexity for RMDPs under a more general family of uncertainty sets — such as using distance or -divergence, as well as -rectangular sets — would be highly desirable.
-
•
Instance-dependent sample complexity analyses. We note that we focus on understanding the minimax-optimal sample complexity of RMDPs, which might be rather pessimistic. When consider a given MDP, the feasible and reasonable magnitude of the uncertainty level is limited by a certain instance-dependent finite threshold. It will be desirable to study instance-dependent sample complexity of RMDPs, which might shed better light on guiding the practice.
Acknowledgement
The work of L. Shi and Y. Chi is supported in part by the grants ONR N00014-19-1-2404, NSF CCF-2106778, DMS-2134080, and CNS-2148212. L. Shi is also gratefully supported by the Leo Finzi Memorial Fellowship, Wei Shen and Xuehong Zhang Presidential Fellowship, and Liang Ji-Dian Graduate Fellowship at Carnegie Mellon University. The work of Y. Wei is supported in part by the NSF grants DMS-2147546/2015447, CAREER award DMS-2143215, CCF-2106778, and the Google Research Scholar Award. The work of Y. Chen is supported in part by the Alfred P. Sloan Research Fellowship, the Google Research Scholar Award, the AFOSR grant FA9550-22-1-0198, the ONR grant N00014-22-1-2354, and the NSF grants CCF-2221009 and CCF-1907661. The authors also acknowledge Mengdi Xu, Zuxin Liu and He Wang for valuable discussions.
References
- Agarwal et al., (2020) Agarwal, A., Kakade, S., and Yang, L. F. (2020). Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pages 67–83. PMLR.
- (2) Azar, M., Munos, R., and Kappen, H. J. (2013a). Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91:325–349.
- (3) Azar, M. G., Munos, R., and Kappen, H. J. (2013b). Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325–349.
- Badrinath and Kalathil, (2021) Badrinath, K. P. and Kalathil, D. (2021). Robust reinforcement learning using least squares policy iteration with provable performance guarantees. In International Conference on Machine Learning, pages 511–520. PMLR.
- Bai et al., (2019) Bai, Y., Xie, T., Jiang, N., and Wang, Y.-X. (2019). Provably efficient Q-learning with low switching cost. arXiv preprint arXiv:1905.12849.
- Bäuerle and Glauner, (2022) Bäuerle, N. and Glauner, A. (2022). Distributionally robust markov decision processes and their connection to risk measures. Mathematics of Operations Research, 47(3):1757–1780.
- Beck and Srikant, (2012) Beck, C. L. and Srikant, R. (2012). Error bounds for constant step-size Q-learning. Systems & control letters, 61(12):1203–1208.
- Bertsimas et al., (2018) Bertsimas, D., Gupta, V., and Kallus, N. (2018). Data-driven robust optimization. Mathematical Programming, 167(2):235–292.
- Bertsimas et al., (2019) Bertsimas, D., Sim, M., and Zhang, M. (2019). Adaptive distributionally robust optimization. Management Science, 65(2):604–618.
- Blanchet et al., (2023) Blanchet, J., Lu, M., Zhang, T., and Zhong, H. (2023). Double pessimism is provably efficient for distributionally robust offline reinforcement learning: Generic algorithm and robust partial coverage. arXiv preprint arXiv:2305.09659.
- Blanchet and Murthy, (2019) Blanchet, J. and Murthy, K. (2019). Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44(2):565–600.
- Cai et al., (2016) Cai, J.-F., Qu, X., Xu, W., and Ye, G.-B. (2016). Robust recovery of complex exponential signals from random gaussian projections via low rank hankel matrix reconstruction. Applied and computational harmonic analysis, 41(2):470–490.
- Chen and Jiang, (2019) Chen, J. and Jiang, N. (2019). Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR.
- Chen et al., (2020) Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2020). Finite-sample analysis of stochastic approximation using smooth convex envelopes. arXiv preprint arXiv:2002.00874.
- Chen et al., (2019) Chen, Z., Sim, M., and Xu, H. (2019). Distributionally robust optimization with infinitely constrained ambiguity sets. Operations Research, 67(5):1328–1344.
- Clavier et al., (2023) Clavier, P., Pennec, E. L., and Geist, M. (2023). Towards minimax optimality of model-based robust reinforcement learning. arXiv preprint arXiv:2302.05372v1.
- de Castro Silva et al., (2003) de Castro Silva, J., Soma, N., and Maculan, N. (2003). A greedy search for the three-dimensional bin packing problem: the packing static stability case. International Transactions in Operational Research, 10(2):141–153.
- Derman and Mannor, (2020) Derman, E. and Mannor, S. (2020). Distributional robustness and regularization in reinforcement learning. arXiv preprint arXiv:2003.02894.
- Ding et al., (2023) Ding, W., Shi, L., Chi, Y., and Zhao, D. (2023). Seeing is not believing: Robust reinforcement learning against spurious correlation. In Thirty-seventh Conference on Neural Information Processing Systems.
- Dong et al., (2022) Dong, J., Li, J., Wang, B., and Zhang, J. (2022). Online policy optimization for robust MDP. arXiv preprint arXiv:2209.13841.
- Dong et al., (2019) Dong, K., Wang, Y., Chen, X., and Wang, L. (2019). Q-learning with UCB exploration is sample efficient for infinite-horizon MDP. arXiv preprint arXiv:1901.09311.
- Duchi and Namkoong, (2018) Duchi, J. and Namkoong, H. (2018). Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750.
- Fatemi et al., (2021) Fatemi, M., Killian, T. W., Subramanian, J., and Ghassemi, M. (2021). Medical dead-ends and learning to identify high-risk states and treatments. Advances in Neural Information Processing Systems, 34:4856–4870.
- Gao, (2020) Gao, R. (2020). Finite-sample guarantees for wasserstein distributionally robust optimization: Breaking the curse of dimensionality. arXiv preprint arXiv:2009.04382.
- Goyal and Grand-Clement, (2022) Goyal, V. and Grand-Clement, J. (2022). Robust markov decision processes: Beyond rectangularity. Mathematics of Operations Research.
- Han et al., (2022) Han, S., Su, S., He, S., Han, S., Yang, H., and Miao, F. (2022). What is the solution for state adversarial multi-agent reinforcement learning? arXiv preprint arXiv:2212.02705.
- Ho et al., (2018) Ho, C. P., Petrik, M., and Wiesemann, W. (2018). Fast bellman updates for robust MDPs. In International Conference on Machine Learning, pages 1979–1988. PMLR.
- Ho et al., (2021) Ho, C. P., Petrik, M., and Wiesemann, W. (2021). Partial policy iteration for l1-robust markov decision processes. Journal of Machine Learning Research, 22(275):1–46.
- Iyengar, (2005) Iyengar, G. N. (2005). Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280.
- Jafarnia-Jahromi et al., (2020) Jafarnia-Jahromi, M., Wei, C.-Y., Jain, R., and Luo, H. (2020). A model-free learning algorithm for infinite-horizon average-reward MDPs with near-optimal regret. arXiv preprint arXiv:2006.04354.
- Jin et al., (2018) Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. (2018). Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873.
- (32) Jin, C., Krishnamurthy, A., Simchowitz, M., and Yu, T. (2020a). Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870–4879. PMLR.
- (33) Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. (2020b). Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR.
- Jin et al., (2021) Jin, Y., Yang, Z., and Wang, Z. (2021). Is pessimism provably efficient for offline RL? In International Conference on Machine Learning, pages 5084–5096.
- Kaufman and Schaefer, (2013) Kaufman, D. L. and Schaefer, A. J. (2013). Robust modified policy iteration. INFORMS Journal on Computing, 25(3):396–410.
- Kearns and Singh, (1999) Kearns, M. J. and Singh, S. P. (1999). Finite-sample convergence rates for Q-learning and indirect algorithms. In Advances in neural information processing systems, pages 996–1002.
- Klopp et al., (2017) Klopp, O., Lounici, K., and Tsybakov, A. B. (2017). Robust matrix completion. Probability Theory and Related Fields, 169(1-2):523–564.
- Kober et al., (2013) Kober, J., Bagnell, J. A., and Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274.
- Kumar et al., (2023) Kumar, N., Derman, E., Geist, M., Levy, K., and Mannor, S. (2023). Policy gradient for s-rectangular robust markov decision processes. arXiv preprint arXiv:2301.13589.
- Lam, (2019) Lam, H. (2019). Recovering best statistical guarantees via the empirical divergence-based distributionally robust optimization. Operations Research, 67(4):1090–1105.
- Lee et al., (2021) Lee, J., Jeon, W., Lee, B., Pineau, J., and Kim, K.-E. (2021). Optidice: Offline policy optimization via stationary distribution correction estimation. In International Conference on Machine Learning, pages 6120–6130. PMLR.
- (42) Li, G., Cai, C., Chen, Y., Wei, Y., and Chi, Y. (2023a). Is Q-learning minimax optimal? a tight sample complexity analysis. Operations Research.
- (43) Li, G., Chi, Y., Wei, Y., and Chen, Y. (2022a). Minimax-optimal multi-agent RL in Markov games with a generative model. Neural Information Processing Systems.
- (44) Li, G., Shi, L., Chen, Y., Chi, Y., and Wei, Y. (2022b). Settling the sample complexity of model-based offline reinforcement learning. arXiv preprint arXiv:2204.05275.
- Li et al., (2024) Li, G., Shi, L., Chen, Y., Chi, Y., and Wei, Y. (2024). Settling the sample complexity of model-based offline reinforcement learning. The Annals of Statistics, 52(1):233–260.
- Li et al., (2021) Li, G., Shi, L., Chen, Y., Gu, Y., and Chi, Y. (2021). Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning. Advances in Neural Information Processing Systems, 34.
- (47) Li, G., Wei, Y., Chi, Y., and Chen, Y. (2023b). Breaking the sample size barrier in model-based reinforcement learning with a generative model. accepted to Operations Research.
- (48) Li, G., Yan, Y., Chen, Y., and Fan, J. (2023c). Minimax-optimal reward-agnostic exploration in reinforcement learning. arXiv preprint arXiv:2304.07278.
- Li and Lan, (2023) Li, Y. and Lan, G. (2023). First-order policy optimization for robust policy evaluation. arXiv preprint arXiv:2307.15890.
- (50) Li, Y., Zhao, T., and Lan, G. (2022c). First-order policy optimization for robust markov decision process. arXiv preprint arXiv:2209.10579.
- Liang et al., (2023) Liang, Z., Ma, X., Blanchet, J., Zhang, J., and Zhou, Z. (2023). Single-trajectory distributionally robust reinforcement learning. arXiv preprint arXiv:2301.11721.
- Liao et al., (2022) Liao, P., Qi, Z., Wan, R., Klasnja, P., and Murphy, S. A. (2022). Batch policy learning in average reward markov decision processes. Annals of statistics, 50(6):3364.
- Liu et al., (2019) Liu, S., Ngiam, K. Y., and Feng, M. (2019). Deep reinforcement learning for clinical decision support: a brief survey. arXiv preprint arXiv:1907.09475.
- Liu et al., (2022) Liu, Z., Bai, Q., Blanchet, J., Dong, P., Xu, W., Zhou, Z., and Zhou, Z. (2022). Distributionally robust Q-learning. In International Conference on Machine Learning, pages 13623–13643. PMLR.
- (55) Liu, Z. and Xu, P. (2024a). Distributionally robust off-dynamics reinforcement learning: Provable efficiency with linear function approximation. arXiv preprint arXiv:2402.15399.
- (56) Liu, Z. and Xu, P. (2024b). Minimax optimal and computationally efficient algorithms for distributionally robust offline reinforcement learning. arXiv preprint arXiv:2403.09621.
- Lu et al., (2024) Lu, M., Zhong, H., Zhang, T., and Blanchet, J. (2024). Distributionally robust reinforcement learning with interactive data collection: Fundamental hardness and near-optimal algorithm. arXiv preprint arXiv:2404.03578.
- Ma et al., (2022) Ma, X., Liang, Z., Blanchet, J., Liu, M., Xia, L., Zhang, J., Zhao, Q., and Zhou, Z. (2022). Distributionally robust offline reinforcement learning with linear function approximation. arXiv preprint arXiv:2209.06620.
- Mahmood et al., (2018) Mahmood, A. R., Korenkevych, D., Vasan, G., Ma, W., and Bergstra, J. (2018). Benchmarking reinforcement learning algorithms on real-world robots. In Conference on robot learning, pages 561–591. PMLR.
- Mnih et al., (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
- Moos et al., (2022) Moos, J., Hansel, K., Abdulsamad, H., Stark, S., Clever, D., and Peters, J. (2022). Robust reinforcement learning: A review of foundations and recent advances. Machine Learning and Knowledge Extraction, 4(1):276–315.
- Nilim and El Ghaoui, (2005) Nilim, A. and El Ghaoui, L. (2005). Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798.
- OpenAI, (2023) OpenAI (2023). Gpt-4 technical report.
- Pan et al., (2023) Pan, Y., Chen, Y., and Lin, F. (2023). Adjustable robust reinforcement learning for online 3d bin packing. arXiv preprint arXiv:2310.04323.
- Panaganti and Kalathil, (2022) Panaganti, K. and Kalathil, D. (2022). Sample complexity of robust reinforcement learning with a generative model. In International Conference on Artificial Intelligence and Statistics, pages 9582–9602. PMLR.
- Panaganti et al., (2022) Panaganti, K., Xu, Z., Kalathil, D., and Ghavamzadeh, M. (2022). Robust reinforcement learning using offline data. Advances in neural information processing systems, 35:32211–32224.
- Panaganti et al., (2023) Panaganti, K., Xu, Z., Kalathil, D., and Ghavamzadeh, M. (2023). Bridging distributionally robust learning and offline rl: An approach to mitigate distribution shift and partial data coverage. arXiv preprint arXiv:2310.18434.
- Park and Van Roy, (2015) Park, B. and Van Roy, B. (2015). Adaptive execution: Exploration and learning of price impact. Operations Research, 63(5):1058–1076.
- Qiaoben et al., (2021) Qiaoben, Y., Zhou, X., Ying, C., and Zhu, J. (2021). Strategically-timed state-observation attacks on deep reinforcement learning agents. In ICML 2021 Workshop on Adversarial Machine Learning.
- Qu et al., (2022) Qu, G., Wierman, A., and Li, N. (2022). Scalable reinforcement learning for multiagent networked systems. Operations Research, 70(6):3601–3628.
- Rahimian and Mehrotra, (2019) Rahimian, H. and Mehrotra, S. (2019). Distributionally robust optimization: A review. arXiv preprint arXiv:1908.05659.
- Ramesh et al., (2023) Ramesh, S. S., Sessa, P. G., Hu, Y., Krause, A., and Bogunovic, I. (2023). Distributionally robust model-based reinforcement learning with large state spaces. arXiv preprint arXiv:2309.02236.
- Rashidinejad et al., (2021) Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., and Russell, S. (2021). Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Neural Information Processing Systems (NeurIPS).
- Roy et al., (2017) Roy, A., Xu, H., and Pokutta, S. (2017). Reinforcement learning under model mismatch. Advances in neural information processing systems, 30.
- Shi and Chi, (2022) Shi, L. and Chi, Y. (2022). Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity. arXiv preprint arXiv:2208.05767.
- Shi et al., (2022) Shi, L., Li, G., Wei, Y., Chen, Y., and Chi, Y. (2022). Pessimistic Q-learning for offline reinforcement learning: Towards optimal sample complexity. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pages 19967–20025. PMLR.
- Sidford et al., (2018) Sidford, A., Wang, M., Wu, X., Yang, L., and Ye, Y. (2018). Near-optimal time and sample complexities for solving Markov decision processes with a generative model. In Advances in Neural Information Processing Systems, pages 5186–5196.
- Smirnova et al., (2019) Smirnova, E., Dohmatob, E., and Mary, J. (2019). Distributionally robust reinforcement learning. arXiv preprint arXiv:1902.08708.
- Sun et al., (2021) Sun, K., Liu, Y., Zhao, Y., Yao, H., Jui, S., and Kong, L. (2021). Exploring the training robustness of distributional reinforcement learning against noisy state observations. arXiv preprint arXiv:2109.08776.
- Tamar et al., (2014) Tamar, A., Mannor, S., and Xu, H. (2014). Scaling up robust MDPs using function approximation. In International conference on machine learning, pages 181–189. PMLR.
- Tan et al., (2020) Tan, K. L., Esfandiari, Y., Lee, X. Y., and Sarkar, S. (2020). Robustifying reinforcement learning agents via action space adversarial training. In 2020 American control conference (ACC), pages 3959–3964. IEEE.
- Tessler et al., (2019) Tessler, C., Efroni, Y., and Mannor, S. (2019). Action robust reinforcement learning and applications in continuous control. In International Conference on Machine Learning, pages 6215–6224. PMLR.
- Tsybakov, (2009) Tsybakov, A. B. (2009). Introduction to nonparametric estimation, volume 11. Springer.
- Uehara et al., (2022) Uehara, M., Shi, C., and Kallus, N. (2022). A review of off-policy evaluation in reinforcement learning. arXiv preprint arXiv:2212.06355.
- Vershynin, (2018) Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press.
- Wainwright, (2019) Wainwright, M. J. (2019). Stochastic approximation with cone-contractive operators: Sharp -bounds for Q-learning. arXiv preprint arXiv:1905.06265.
- Wang et al., (2024) Wang, H., Shi, L., and Chi, Y. (2024). Sample complexity of offline distributionally robust linear markov decision processes. arXiv preprint arXiv:2403.12946.
- (88) Wang, K., Gadot, U., Kumar, N., Levy, K., and Mannor, S. (2023a). Robust reinforcement learning via adversarial kernel approximation. arXiv preprint arXiv:2306.05859.
- (89) Wang, S., Si, N., Blanchet, J., and Zhou, Z. (2023b). A finite sample complexity bound for distributionally robust Q-learning. arXiv preprint arXiv:2302.13203.
- (90) Wang, S., Si, N., Blanchet, J., and Zhou, Z. (2023c). On the foundation of distributionally robust reinforcement learning. arXiv preprint arXiv:2311.09018.
- (91) Wang, S., Si, N., Blanchet, J., and Zhou, Z. (2023d). Sample complexity of variance-reduced distributionally robust Q-learning. arXiv preprint arXiv:2305.18420.
- Wang and Zou, (2021) Wang, Y. and Zou, S. (2021). Online robust reinforcement learning with model uncertainty. Advances in Neural Information Processing Systems, 34.
- Wiesemann et al., (2013) Wiesemann, W., Kuhn, D., and Rustem, B. (2013). Robust Markov decision processes. Mathematics of Operations Research, 38(1):153–183.
- Wolff et al., (2012) Wolff, E. M., Topcu, U., and Murray, R. M. (2012). Robust control of uncertain Markov decision processes with temporal logic specifications. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pages 3372–3379. IEEE.
- Woo et al., (2023) Woo, J., Joshi, G., and Chi, Y. (2023). The blessing of heterogeneity in federated Q-learning: Linear speedup and beyond. arXiv preprint arXiv:2305.10697.
- Woo et al., (2024) Woo, J., Shi, L., Joshi, G., and Chi, Y. (2024). Federated offline reinforcement learning: Collaborative single-policy coverage suffices. arXiv preprint arXiv:2402.05876.
- Xie et al., (2021) Xie, T., Jiang, N., Wang, H., Xiong, C., and Bai, Y. (2021). Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems, 34.
- Xiong et al., (2022) Xiong, Z., Eappen, J., Zhu, H., and Jagannathan, S. (2022). Defending observation attacks in deep reinforcement learning via detection and denoising. arXiv preprint arXiv:2206.07188.
- Xu and Mannor, (2012) Xu, H. and Mannor, S. (2012). Distributionally robust Markov decision processes. Mathematics of Operations Research, 37(2):288–300.
- Xu et al., (2023) Xu, Z., Panaganti, K., and Kalathil, D. (2023). Improved sample complexity bounds for distributionally robust reinforcement learning. arXiv preprint arXiv:2303.02783.
- Yan et al., (2022) Yan, Y., Li, G., Chen, Y., and Fan, J. (2022). The efficacy of pessimism in asynchronous Q-learning. arXiv preprint arXiv:2203.07368.
- Yang et al., (2021) Yang, K., Yang, L., and Du, S. (2021). Q-learning with logarithmic regret. In International Conference on Artificial Intelligence and Statistics, pages 1576–1584. PMLR.
- Yang et al., (2023) Yang, W., Wang, H., Kozuno, T., Jordan, S. M., and Zhang, Z. (2023). Avoiding model estimation in robust markov decision processes with a generative model. arXiv preprint arXiv:2302.01248.
- Yang et al., (2022) Yang, W., Zhang, L., and Zhang, Z. (2022). Toward theoretical understandings of robust Markov decision processes: Sample complexity and asymptotics. The Annals of Statistics, 50(6):3223–3248.
- Yin et al., (2021) Yin, M., Bai, Y., and Wang, Y.-X. (2021). Near-optimal offline reinforcement learning via double variance reduction. arXiv preprint arXiv:2102.01748.
- Zhang et al., (2021) Zhang, H., Chen, H., Boning, D., and Hsieh, C.-J. (2021). Robust reinforcement learning on state observations with learned optimal adversary. arXiv preprint arXiv:2101.08452.
- (107) Zhang, H., Chen, H., Xiao, C., Li, B., Liu, M., Boning, D., and Hsieh, C.-J. (2020a). Robust deep reinforcement learning against adversarial perturbations on state observations. Advances in Neural Information Processing Systems, 33:21024–21037.
- Zhang et al., (2023) Zhang, R., Hu, Y., and Li, N. (2023). Regularized robust mdps and risk-sensitive mdps: Equivalence, policy gradient, and sample complexity. arXiv preprint arXiv:2306.11626.
- (109) Zhang, Z., Zhou, Y., and Ji, X. (2020b). Almost optimal model-free reinforcement learning via reference-advantage decomposition. Advances in Neural Information Processing Systems, 33.
- Zhao et al., (2021) Zhao, H., Yu, Y., and Xu, K. (2021). Learning efficient online 3d bin packing on packing configuration trees. In International conference on learning representations.
- Zhou et al., (2021) Zhou, Z., Bai, Q., Zhou, Z., Qiu, L., Blanchet, J., and Glynn, P. (2021). Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 3331–3339. PMLR.
- Ziegler et al., (2019) Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. (2019). Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
Appendix A Proof of the preliminaries
A.1 Proof of Lemma 1 and Lemma 2
Proof of Lemma 1.
To begin with, applying (Iyengar,, 2005, Lemma 4.3), the term of interest obeys
(100) |
where represents the -th entry of . Denoting as the optimal dual solution, taking , it is easily verified that obeys
(101) |
Proof of Lemma 2.
A.2 Proof of Lemma 5
Applying the -contraction property in Lemma 4 directly yields that for any ,
where the last inequality holds by the fact (see Lemma 4). In addition,
where the penultimate inequality holds by the maximum operator is -Lipschitz. This completes the proof of (47).
We now move to establish (48). Note that there exists at least one state that is associated with the maximum of the value gap, i.e.,
Recall is the optimal robust policy for the empirical RMDP . For convenience, we denote and . Then, since is the greedy policy w.r.t. , one has
(106) |
Recalling the notation in (37), the above fact and (48) altogether yield
(107) |
where (i) follows from the optimality criteria. The term of interest can be controlled as
(108) |
where holds by plugging in (107), and (ii) follows from for any . Rearranging (A.2) leads to
Appendix B Proof of the auxiliary lemmas for Theorem 1
B.1 Proof of Lemma 6
To begin, note that there at leasts exist one state for any such that . With this in mind, for any policy , one has by the definition in (5) and the Bellman’s equation (7a),
where the second line holds since the reward function for all . To continue, note that for any , there exists some constructed by reducing the values of some elements of to obey and . This implies , where is the standard basis vector supported on , since . Consequently,
(109) |
where the second inequality holds by . Plugging this back to the previous relation gives
which, by rearranging terms, immediately yields
B.2 Proof of Lemma 7
Observing that each row of belongs to , it can be directly verified that each row of falls into . As a result,
(110) |
where (i) holds by Jensen’s inequality.
To continue, denoting the minimum value of as and . We control as follows:
(111) |
where (i) holds by the fact that for any scalar and , (ii) follows from , and the last line arises from and . Plugging (111) back to (110) leads to
(112) |
where (i) holds by the triangle inequality, (ii) holds by following recursion, and the last inequality holds by .
B.3 Proof of Lemma 8
Step 1: controlling : bounding the first term in (56).
To control the two terms in (56), we first introduce the following lemma whose proof is postponed to Appendix B.5.
Lemma 11.
Armed with the above lemma, now we control the first term on the right hand side of (56) as follows:
(114) |
where (i) holds by , (ii) follows from Lemma 11, and the last inequality arise from
by applying the triangle inequality.
To continue, observing that each row of is a probability distribution obeying that the sum is , we arrive at
(115) |
Armed with this fact, we shall control the other three terms in (114) separately.
-
•
Consider . We first introduce the following lemma, whose proof is postponed to Appendix B.6.
Lemma 12.
Consider any . With probability at least , one has
-
•
Consider . First, denote , by Lemma 6, it follows that
(117) Then, we have for all , and , and :
(118) -
•
Consider . The following lemma plays an important role.
Lemma 13.
(Panaganti and Kalathil,, 2022, Lemma 6) Consider any . For any fixed policy and fixed value vector , one has with probability at least ,
Step 2: controlling : bounding the second term in (56).
To proceed, applying Lemma 11 on the second term of the right hand side of (56) leads to
(122) |
where the last term can be controlled the same as in (120). We now bound the above terms separately.
B.4 Proof of Lemma 9
Step 1: controlling : bounding the first term in (57).
To begin with, we introduce the following lemma which controls the main term on the right hand side of (57), which is proved in Appendix B.7.
Lemma 14.
Consider any . Taking , with probability at least , one has
(129) |
With Lemma 14 in hand, we have
(130) |
where (i) and (ii) hold by the fact that each row of is a probability vector that falls into .
The remainder of the proof will focus on controlling the three terms in (130) separately.
-
•
For , we introduce the following lemma, whose proof is postponed to B.8.
Lemma 15.
Consider any . Taking and , one has with probability at least ,
- •
-
•
can be controlled similar to in (119) as follows:
(133)
Finally, summing up the results in (131), (132), and (133) and inserting them back to (130) yields: taking and , with probability at least ,
(134) |
where the last inequality holds by taking .
Step 2: controlling : bounding the second term in (57).
B.5 Proof of Lemma 11
Step 1: controlling the point-wise concentration.
We first consider a more general term w.r.t. any fixed (independent from ) value vector obeying and any policy . Invoking Lemma 1 leads to that for any ,
(141) |
where the last inequality holds by that the maximum operator is -Lipschitz.
Then for a fixed and any vector that is independent with , using the Bernstein’s inequality, one has with probability at least ,
(142) |
Step 2: deriving the uniform concentration.
To obtain the union bound, we first notice that is -Lipschitz w.r.t. for any obeying . In addition, we can construct an -net over whose size satisfies (Vershynin,, 2018). By the union bound and (B.5), it holds with probability at least that for all ,
(143) |
Combined with (141), it yields that,
(144) | ||||
(145) | ||||
(146) |
where (i) follows from that the optimal falls into the -ball centered around some point inside and is -Lipschitz, (ii) holds by (143), (iii) arises from taking , (iv) is verified by , and the last inequality is due to the fact and letting .
To continue, applying (145) and (146) with and (independent with ) and taking the union bound over gives that with probability at least , it holds simultaneously for all that
(147) |
By converting (147) to the matrix form, one has with probability at least ,
(148) |
B.6 Proof of Lemma 12
Following the same argument as (110), it follows
(149) |
To continue, we first focus on controlling . Towards this, denoting the minimum value of as and , we arrive at (see the robust Bellman’s consistency equation in (46))
(150) |
where the last line holds by letting . With the above fact in hand, we control as follows:
(151) | ||||
(152) |
where (i) holds by the fact that for any scalar and , (ii) follows from (150), (iii) arises from and , and the last inequality holds by Lemma 11.
Plugging (152) into (149) leads to
(153) |
where (i) holds by the triangle inequality. Therefore, the remainder of the proof shall focus on the first term, which follows
(154) |
by recursion. Inserting (154) back to (153) leads to
(155) |
where the penultimate inequality follows from applying Lemma 6 with and :
B.7 Proof of Lemma 14
To begin with, for any , invoking the results in (141), we have
(156) |
where (i) holds by the triangle inequality, and (ii) follows from and , and (iii) follows from (50).
To control in (156) for any given , and tame the dependency between and , we resort to the following leave-one-out argument motivated by (Agarwal et al.,, 2020; Li et al., 2022b, ; Shi and Chi,, 2022). Specifically, we first construct a set of auxiliary RMDPs which simultaneously have the desired statistical independence between robust value functions and the estimated nominal transition kernel, and are minimally different from the original RMDPs under consideration. Then we control the term of interest associated with these auxiliary RMDPs and show the value is close to the target quantity for the desired RMDP. The process is divided into several steps as below.
Step 1: construction of auxiliary RMDPs with deterministic empirical nominal transitions.
Recall that we target the empirical infinite-horizon robust MDP with the nominal transition kernel . Towards this, we can construct an auxiliary robust MDP for each state and any non-negative scalar , so that it is the same as except for the transition properties in state . In particular, we define the nominal transition kernel and reward function of as and , which are expressed as follows
(157) |
and
(158) |
It is evident that the nominal transition probability at state of the auxiliary , i.e. it never leaves state once entered. This useful property removes the randomness of for all in state , which will be leveraged later.
Correspondingly, the robust Bellman operator associated with the RMDP is defined as
(159) |
Step 2: fixed-point equivalence between and the auxiliary RMDP .
Recall that is the unique fixed point of with the corresponding robust value . We assert that the corresponding robust value function obtained from the fixed point of aligns with the robust value function derived from , as long as we choose in the following manner:
(160) |
where is the -th standard basis vector in . Towards verifying this, we shall break our arguments in two different cases.
- •
- •
Combining the facts in the above two cases, we establish that there exists a fixed point of the operator by taking
(163) |
Consequently, we confirm the existence of a fixed point of the operator . In addition, its corresponding value function also coincides with . Note that the corresponding facts between and in Step 1 and step 2 holds in fact for any uncertainty set.
Step 3: building an -net for all reward values .
It is easily verified that
(164) |
We can construct a -net over the interval , where the size is bounded by (Vershynin,, 2018). Following the same arguments in the proof of Lemma 4, we can demonstrate that for each , there exists a unique fixed point of the operator , which satisfies . Consequently, the corresponding robust value function also satisfies .
By the definitions in (157) and (158), we observe that for all , is statistically independent from . This independence indicates that and are independent for a fixed . With this in mind, invoking the fact in (145) and (146) and taking the union bound over all , yields that, with probability at least , it holds for all that
(165) |
where the last inequality holds by the fact and letting .
Step 4: uniform concentration.
Recalling that (see (164)), we can always find some such that . Consequently, plugging in the operator in (159) yields
With this in mind, we observe that the fixed points of and obey
where the last inequality holds by the fact that is a -contraction. It directly indicates that
(166) |
Armed with the above facts, to control the first term in (156), invoking the identity established in Step 2 gives that: for all ,
(167) | |||
(168) |
where (i) holds by the triangle inequality, (ii) arises from (the last inequality holds by (166))
(169) |
(iii) follows from (B.7), (iv) can be verified by applying Lemma 3 with (166). Here, the penultimate inequality holds by letting , which leads to , and the last inequality holds by the fact and letting .
Step 5: finishing up.
Inserting (167) and (168) back into (156) and combining with (168) give that with probability at least ,
(170) |
holds for all .
Finally, we complete the proof by compiling everything into the matrix form as follows:
(171) |
B.8 Proof of Lemma 15
The proof can be achieved by directly applying the same routine as Appendix B.6. Towards this, similar to (149), we arrive at
(172) |
To control , we denote the minimum value of as and . By the same argument as (151), we arrive at
(173) |
where the last inequality makes use of Lemma 14. Plugging (173) back into (172) leads to
(174) |
where (i) arises from following the routine of (153), (ii) holds by repeating the argument of (154), (iii) follows by taking and , and the last inequality holds by .
Appendix C Proof of the auxiliary facts for Theorem 2
C.1 Proof of Lemma 10
Deriving the robust value function over different states.
For any with , we first characterize the robust value function of any policy over different states. Before proceeding, we denote the minimum of the robust value function over states as below:
(175) |
Clearly, there exists at least one state that satisfies .
With this in mind, it is easily observed that for any policy , the robust value function at state obeys
(176) |
where (i) holds by for all and (74), and (ii) follows from for all .
Finally, we move onto compute , the robust value function at state associated with any policy . First, it obeys
(178) |
Recall the transition kernel defined in (64) and the fact about the uncertainty set over state in (75), it is easily verified that the following probability vector obeys , which is defined as
(179) |
where due to (75). Similarly, the following probability vector also falls into the uncertainty set :
(180) |
It is noticed that and defined above are the worst-case perturbations, since the probability mass at state will be moved to the state with the least value. Plugging the above facts about and into (178), we arrive at
(181) |
where the last equality holds by the definition of in (77). To continue, recursively applying (181) yields
(182) |
where (i) uses , (ii) follows from , and the penultimate line follows from the trivial fact that .
The optimal robust policy and optimal robust value function.
We move on to characterize the robust optimal policy and its corresponding robust value function. To begin with, denoting
(186) |
we rewrite (185) as
Plugging in the fact that in (73), it follows that . So for any , the derivative of w.r.t. obeys
(187) |
Observing that is increasing in , is increasing in , and is also increasing in (see the fact in (73)), the optimal policy in state thus obeys
(188) |
Considering that the action does not influence the state transition for all states , without loss of generality, we choose the robust optimal policy to obey
(189) |
Taking , we complete the proof by showing that the corresponding robust optimal robust value function at state as follows:
(190) |
C.2 Proof of the claim (80)
Plugging in the definition of , we arrive at that for any policy ,
(191) |
which follows from applying (76) and basic calculus. Then, we proceed to control the above term in two cases separately in terms of the uncertainty level .
-
•
When . Then regarding the important terms in (191), we observe that
(192) which directly leads to
(193) where (i) holds by , and (ii) is due to (192). Inserting (192) and (193) back into (191), we arrive at
(194) where the last inequality holds by setting ()
(195) Finally, it is easily verified that
- •
Appendix D Proof of the upper bound with divergence: Theorem 3
The proof of Theorem 3 mainly follows the structure of the proof of Theorem 1 in Appendix 5.2. Throughout this section, for any nominal transition kernel , the uncertainty set is taken as (see (10))
(201) |
D.1 Proof of Theorem 3
In order to control the performance gap , recall the error decomposition in (51):
(202) |
where (cf. (50)) shall be specified later (which justifies Remark 2). To further control (202), we bound the remaining two terms separately.
Step 1: controlling .
Towards this, recall the bound in (56) which holds for any uncertainty set:
(203) |
To control the main term in (203), we first introduce an important lemma whose proof is postponed to Appendix D.2.1.
Lemma 16.
Consider any and the uncertainty set . For any and any fixed policy , one has with probability at least ,
Step 2: controlling .
Recall the bound in (57) which holds for any uncertainty set:
(208) |
We introduce the following lemma which controls in (208); the proof is deferred to Appendix D.2.2.
Lemma 17.
Consider the uncertainty set and any . With probability at least , one has
(209) |
D.2 Proof of the auxiliary lemmas
D.2.1 Proof of Lemma 16
Step 1: controlling the point-wise concentration.
Consider any fixed policy and the corresponding robust value vector (independent from ). Invoking Lemma 2 leads to that for any ,
(212) |
where the first inequality follows by that the maximum operator is -Lipschitz, and the second inequality follows from the triangle inequality. Observing that the first term in (212) is exactly the same as (141), recalling the fact in (146) directly leads to: with probability at least ,
(213) |
holds for all . Then the remainder of the proof focuses on controlling the second term in (212).
Step 2: controlling the second term in (212).
For any given and fixed , applying the concentration inequality (Panaganti and Kalathil,, 2022, Lemma 6) with , we arrive at
(214) |
holds with probability at least . To obtain a uniform bound, we first observe the follow lemma proven in Appendix D.2.3.
Lemma 18.
For any obeying , the function w.r.t. obeys
In addition, we can construct an -net over whose size is (Vershynin,, 2018). Armed with the above, we can derive the uniform bound over : with probability at least , it holds that for any ,
(215) |
where (i) holds by the property of , (ii) follows from (214), (iii) arises from taking , and the last inequality is verified by .
Inserting (213) and (215) back to (212) and taking the union bound over , we arrive at that for all , with probability at least ,
Finally, we complete the proof by recalling the matrix form as below:
D.2.2 Proof of Lemma 17
Step 1: decomposing the term of interest.
The proof follows the routine of the proof of Lemma 14 in Appendix B.7. To begin with, for any , following the same arguments of (212) yields
(216) |
The remainder of the proof will focus on controlling the second term of (216).
Step 2: controlling the second term of (216).
Towards this, we recall the auxiliary robust MDP defined in Appendix B.7. Taking the uncertainty set for both and , we recall the corresponding robust Bellman operator in (159) and the following definition in (160)
(218) |
Following the arguments in Appendix B.7, it can be verified that there exists a unique fixed point of the operator , which satisfies . In addition, the corresponding robust value function coincides with that of the operator , i.e., .
We recall the -net over whose size obeying (Vershynin,, 2018). Then for all and a fixed , is statistically independent from , which indicates the independence between and . With this in mind, invoking the fact in (215) and taking the union bound over all and yields that, with probability at least ,
(219) |
holds for all .
To continue, we decompose the term of interest in (216) as follows:
(220) |
where (i) holds by the triangle inequality, (ii) arises from applying Lemma 3, and the last inequality holds by (50).
Step 4: finishing up.
D.2.3 Proof of Lemma 18
For any , one has
(224) |
where (i) holds by the fact for all , (ii) follows from the fact that for any and for any transition kernel , (iii) holds by the definition of defined in (40), and the last inequality arises from .
Appendix E Proof of the lower bound with divergence: Theorem 4
To prove Theorem 4, we shall first construct some hard instances and then characterize the sample complexity requirements over these instances. The structure of the hard instances are the same as the ones used in the proof of Theorem 2.
E.1 Construction of the hard problem instances
First, note that we shall use the same MDPs defined in Appendix 5.3.1 as follows
In particular, we shall keep the structure of the transition kernel in (64), reward function in (68) and initial state distribution in (69), while and shall be specified differently later.
Uncertainty set of the transition kernels.
Recalling the uncertainty set associated with divergence in (201), for any uncertainty level , the uncertainty set throughout this section is defined as :
(225) |
Clearly, whenever the state transition is deterministic for divergence. Here, and (whose choice will be specified later in more detail) which determine the instances are specified as
(226) |
and
(227) |
This directly ensures that
since .
To continue, for any , we denote the infimum probability of moving to the next state associated with any perturbed transition kernel as
(228) |
In addition, we denote the transition from state to state as follows, which plays an important role in the analysis,
(229) |
Before continuing, we introduce some facts about and which are summarized as the following lemma; the proof is postponed to Appendix E.3.1.
Value functions and optimal policies.
Armed with above facts, we are positioned to derive the corresponding robust value functions, the optimal policies, and its corresponding optimal robust value functions. For any RMDP with the uncertainty set defined in (225), we denote the robust optimal policy as , the robust value function of any policy (resp. the optimal policy ) as (resp. ). The following lemma describes some key properties of the robust (optimal) value functions and optimal policies whose proof is postponed to Appendix E.3.2.
Lemma 20.
For any and any policy , one has
(231) |
where is defined as
(232) |
In addition, the optimal value functions and the optimal policies obey
(233a) | ||||
(233b) |
E.2 Establishing the minimax lower bound
Our goal is to control the performance gap w.r.t. any policy estimator based on the generated dataset and the chosen initial distribution in (69), which gives
(234) |
Step 1: converting the goal to estimate .
Step 2: arriving at the final results.
To continue, following the same definitions and argument in Appendix 5.3.2, we recall the minimax probability of the error and its property as follows:
(238) |
then we can complete the proof by showing given the bound for the sample size . In the following, we shall control the KL divergence terms in (238) in three different cases.
-
•
Case 1: . In this case, applying yields
(239) Armed with the above facts, applying Tsybakov, (2009, Lemma 2.7) yields
(240) where (i) follows from the definition in (226), (ii) holds by plugging in the expression of in (236), and (iii) arises from (• ‣ E.2). The same bound can be established for . Substituting (240) back into (238) demonstrates that: if the sample size is chosen as
(241) then one necessarily has
(242) -
•
Case 2: . Applying the facts of in (227), one has
(243) -
•
Case 3: . Regarding this, one gives
(247) Given and (• ‣ E.2), applying Tsybakov, (2009, Lemma 2.7) yields
(248) where (i) follows from the definition in (226), (ii) holds by plugging in the expression of in (236), and (iii) arises from (• ‣ E.2). The same bound can be established for . Substituting (248) back into (85) demonstrates that: if the sample size is chosen as
(249) then one necessarily has
(250)
Step 3: putting things together.
E.3 Proof of the auxiliary facts
We begin with some basic facts about the divergence defined in (39) for any two Bernoulli distributions and , denoted as
(253) |
For , it is easily verified that the partial derivative w.r.t. obeys , implying that
(254) |
In other words, the divergence increases as decreases from to .
Next, we introduce the following function for any fixed and any :
(255) |
where (i) has been verified in Yang et al., (2022, Corollary B.2), and the last equality holds since . The next lemma summarizes some useful facts about , which again has been verified in Yang et al., (2022, Lemma B.12 and Corollary B.2).
Lemma 21.
Consider any . For , is convex and differentiable, which obeys
E.3.1 Proof of Lemma 19
Let us control and respectively.
Step 1: controlling .
We shall control in different cases w.r.t. the uncertainty level .
- •
-
•
Case 2: . Note that it suffices to treat as a Bernoulli distribution over states and , since we do not allow transition to other states. Recalling in (226) and noticing the fact that
(257) one has the probability falls into the uncertainty set of of size . As a result, recalling the definition (229) leads to
(258) since .
Step 2: controlling .
To characterize the value of , we also divide into several cases separately.
-
•
Case 1: . In this case, note that . Therefore, applying that is convex and the form of its derivative in Lemma 21, one has
(259) Similarly, applying Lemma 21 leads to
(260) where the last inequality holds by due to the fact (cf. (227) and ). To sum up, given , combined with (256), we arrive at
(261) where the last inequality holds by (see (226)).
-
•
Case 2: . We recall that in (226). To derive the lower bound for in (229), similar to (• ‣ E.3.1), one has
(262) where (i) follows from and (see (258)). For the other direction, similar to (• ‣ E.3.1), we have
(263) where (i) holds by (see (258)), (ii) follows from plugging in , and (iii) and (iv) arises from in (227). Combining (• ‣ E.3.1) and (263) yields
(264)
Step 3: combining all the results.
E.3.2 Proof of Lemma 20
The robust value function for any policy .
For any with , we first characterize the robust value function of any policy over different states.
Towards this, it is easily observed that for any policy , the robust value functions at state or any obey
(265a) | ||||
and | ||||
(265b) |
where (i) and (ii) is according to the facts that the transitions defined over states in (64) give only one possible next state , leading to a non-random transition in the uncertainty set associated with divergence, and for all and holds all .
To continue, the robust value function at state with policy satisfies
(266) | ||||
(267) |
where (i) holds by that . Summing up the results in (265b) and (267) leads to
(268) |
With the transition kernel in (64) over state and the fact in (268), (266) can be rewritten as
(269) |
where (i) holds by the definition of and in (229), (ii) follows from the definition of in (232), and the last line holds by applying (265a) and solving the resulting linear equation for .
Optimal policy and its optimal value function.
To continue, observing that is increasing in since the derivative of w.r.t. obeys
where the last inequality holds by . Further, is also increasing in (see the fact in (229)), the optimal robust policy in state thus obeys
(270) |
Considering that the action does not influence the state transition for all states , without loss of generality, we choose the optimal robust policy to obey
(271) |
Taking and in (269), we complete the proof by showing the corresponding optimal robust value function at state as follows:
E.3.3 Proof of the claim (237)
Plugging in the definition of , we arrive at that for any policy ,
(272) |
where (i) holds by applying Lemma 20, (ii) arises from (see the definition of in (232) and the fact in (229)), and (iii) follows from the definition of in (232).
To further control (272), we consider it in two cases separately:
-
•
Case 1: . In this case, applying Lemma 19 to (272) yields
(273) where the penultimate inequality follows from , and the last inequality holds by taking the specification of in (236) as follows:
(274) It is easily verified that taking as in (235) directly leads to meeting the requirement in (227), i.e., .
-
•
Case 2: . Similarly, applying Lemma 19 to (272) gives
(275) Before continuing, it can be verified that
(276) where (i) is obtained by (see (226)). Applying the above fact to (275) gives
(277) where (i) holds by and (275), and the last equality holds by the specification in (236):
(278) As a result, it is easily verified that the requirement in (227)
(279) is met if we let
(280) as in (235).
The proof is then completed by summing up the results in the above two cases.
Appendix F Proof for the offline setting
F.1 Proof of the upper bounds: Corollary 1 and Corollary 3
As the proofs of Corollary 1 and Corollary 3 are similar, without loss of generality, we first focus on Corollary 1 in the case of TV distance.
To begin with, suppose we have access to in total independent sample tuples from either the generative model or a historical dataset. We denote the number of samples generated based on the state-action pair as , i.e,
(281) |
Then according to (13), we can construct an empirical nominal transition for DRVI (Algorithm 1).
(282) |
Armed with the above estimate of nominal transition kernel, we introduce a slightly general version of Theorem 1, which follows directly from the same proof routine in Appendix 5.2.2.
Theorem 5 (Upper bound under TV distance).
Let the uncertainty set be , as specified by the TV distance (9). Consider any discount factor , uncertainty level , and . Based on the empirical nominal transition kernel in (282), let be the output policy of Algorithm 1 after iterations. Then with probability at least , one has
(283) |
for any , as long as
(284) |
Here, are some large enough universal constants.
Furthermore, we invoke a fact derived from basic concentration inequalities (Li et al.,, 2024) as below.
Lemma 22.
Consider any and a dataset with independent samples satisfying Assumption 1. With probability at least , the quantities obey
(285) |
simultaneously for all .
Now we are ready to verify Corollary 1. Armed with a historical dataset with independent samples that obeys Assumption 1, one has with probability at least ,
(286) |
as long as for all . Consequently, given , applying Theorem 5 with the fact for all (see (286)) directly leads to: DRVI can achieve an -optimal policy as long as
(287) |
namely
(288) |
where is some large enough universal constant. Note that the above inequality directly implies . This complete the proof of Corollary 1. The same argument holds for Corollary 3.
F.2 Proof of the lower bounds: Corollary 2 and Corollary 4
Analogous to Appendix F.1, without loss of generality, we firstly focus on verifying Corollary 2, where we use the TV distance to measure the uncertainty set.
We stick to the two hard instances and (i.e., with ) constructed in the proof for Theorem 2 (Appendix 5.3.1). Recall that the state space is defined as , where the corresponding action space for any state is . For states or , the action space is only . Hence, for a given factor , we can construct a historical dataset with samples such that the data coverage becomes the smallest over the state-action pairs and , i.e.,
(289) |
Armed with the above hard instance and historical dataset, we follow the proof procedure in Appendix 5.3.2 to verify the corollary. Our goal is to distinguish between the two hypotheses by considering the minimax probability of error as follows:
(290) |
where the infimum is taken over all possible tests constructed from the samples in .
Recall that we denote (resp. ) as the distribution of a sample tuple under the nominal transition kernel associated with and the samples are generated independently. Analogous to (85), one has
(291) |
where the last inequality holds by observing that
(292) |
Here, the last line holds by the fact that and (associated with and ) only differ from each other in state-action pairs and , each has a visitation density of . Consequently, following the same routine from (86) to the end of Appendix 5.3.2, we applying (87) and (88) with and complete the proof by showing: if the sample size is selected as
(293) |
then one necessarily has
(294) |
We can follow the same argument to complete the proof of Corollary 4.