Robust Offline Reinforcement Learning for Non-Markovian Decision Processes
Abstract
Distributionally robust offline reinforcement learning (RL) aims to find a policy that performs the best under the worst environment within an uncertainty set using an offline dataset collected from a nominal model. While recent advances in robust RL focus on Markov decision processes (MDPs), robust non-Markovian RL is limited to planning problem where the transitions in the uncertainty set are known. In this paper, we study the learning problem of robust offline non-Markovian RL. Specifically, when the nominal model admits a low-rank structure, we propose a new algorithm, featuring a novel dataset distillation and a lower confidence bound (LCB) design for robust values under different types of the uncertainty set. We also derive new dual forms for these robust values in non-Markovian RL, making our algorithm more amenable to practical implementation. By further introducing a novel type-I concentrability coefficient tailored for offline low-rank non-Markovian decision processes, we prove that our algorithm can find an -optimal robust policy using offline samples. Moreover, we extend our algorithm to the case when the nominal model does not have specific structure. With a new type-II concentrability coefficient, the extended algorithm also enjoys polynomial sample efficiency under all different types of the uncertainty set.
Keywords— Robust RL, Offline RL, POMDP, Non-markovian Decision Processes
1 Introduction
In reinforcement learning (RL) (Sutton and Barto, , 2018), an agent learns to make decisions by interacting with an unknown environment and maximizes the total received reward. This learning framework has found wide applications such as video games (Ye et al., , 2020), robotics (Ibarz et al., , 2021), and language model fine-tuning (Ouyang et al., , 2022). However, in many real-world scenarios such as autonomous driving and industrial control, direct training in the actual environment and generating data samples in an online manner is often unrealistic or even harmful. Offline RL (Lange et al., , 2012; Levine et al., , 2020), which leverages dataset collected in the past for learning, thus becomes important. Yet, pre-collected data suffers from either noisy simulation or out-of-date measurements, and causes the so-called Sim-to-real problem (Peng et al., , 2018; Zhao et al., , 2020).
Distributionally robust RL (Iyengar, , 2005) was proposed to address the aforementioned issue by learning an optimal policy under the worst environment in an uncertainty set. Robust RL has been studied under Markov decision processes (MDPs) with known transition probabilities in Xu and Mannor, (2010); Wiesemann et al., (2013); Yu and Xu, (2015). It has also been studied for MDPs with unknown transitions (Panaganti and Kalathil, , 2022; Wang and Zou, , 2021; Dong et al., , 2022; Panaganti et al., , 2022; Blanchet et al., , 2023). Regarding robust RL under non-Markovian decision processes, only robust POMDP with known transitions in the uncertainty set has been studied in Osogami, (2015); Rasouli and Saghafian, (2018); Nakao et al., (2021). To the best of our knowledge, there has not been any study on robust non-Markovian RL with unknown transitions.
Several challenges need to be handled for exploring the robust offline non-Markovian RL with unknown transitions. (i) Considering non-Markovian processes in the entire uncertainty set can cause dramatic increase of sample complexity. The challenge then lies in how to exploit the structure conditions of these non-Markovian processes for sample efficient design. (ii) Existing coverage condition for offline non-Markovian RL (Huang et al., , 2023) is strong and can incur more stringent coverage conditions for robust offline RL. It is thus critical to relax such a condition to design desirable offline algorithms. These challenges pose the following fundamental open question for us to address:
Q: Can we design a provably efficient algorithm for learning robust non-Markovian decision processes using offline data, which (i) effectively exploit the structure of non-Markovian decision processes, and (ii) relaxes the coverage condition for an offline dataset?
Our contributions. In this paper, we provide the first study of robust non-Markovian RL with unknown transitions, i.e., unknown uncertainty set. We focus on the setting where an offline dataset is provided and is collected from one (unkown) nominal environment in an uncertainty set, and the environments are non-Markovian. We summarize our contributions as follows.
-
•
New algorithm design. We propose a novel algorithm for robust offline RL when the nominal model admits a low-rank structure of predictive state representations (PSRs). In particular, we consider two types of uncertainty sets, namely -type and -type, for non-Markovian decision processes. Our algorithm features two new elements: (i) a novel dataset distillation, where the distilled dataset ensures that the estimated model from vanilla maximum likelihood estimation performs well on the distilled data; and (ii) lower confidence bound design for the robust value, which is derived by leveraging the low-rank structure and the newly developed dual forms for robust values under both -type and -type uncertainty sets.
-
•
Novel coverage condition and theory for low-rank non-Markovian RL. We characterize the near-optimality guarantee of our algorithm based on novel type-I concentrability coefficient for offline learning in low-rank non-Markovian decision processes. This coefficient is much more relaxed than that proposed for offline low-rank non-Markovian RL in Huang et al., (2023). This relaxation is due to refined analysis on the distributional shift. We also identify a wellness condition number of the nominal model. Our results suggests that as long as the concentrability coefficient and the condition number are finite, our algorithm can find a robust policy with suboptimal gap vanishing at a rate of , where is the number of offline samples.
-
•
Algorithm and theory for general non-Markovian RL. We develop a robust algorithm for general offline non-Markovian RL where the transition probabilities may not have any structure. We show that our algorithm can learn a near-optimal robust policy efficiently provided that the offline dataset has finite Type-II concentrability coefficient, which is also a much weaker condition than that proposed in Huang et al., (2023).
2 Related Work
Offline RL. Offline RL problem has been investigated extensively in the literature (Antos et al., , 2008; Bertsekas, , 2011; Lange et al., , 2012; Chen and Jiang, , 2019; Xie and Jiang, , 2020; Levine et al., , 2020; Xie et al., , 2021). Many methods such as Q-learning (Jin et al., , 2021), its variant Fitted Q-Iteration (FQI) (Gordon, , 1995; Ernst et al., , 2005; Munos and Szepesvári, , 2008; Farahmand et al., , 2010; Lazaric et al., , 2012; Liu et al., , 2020; Xie et al., , 2021), value iteration (Nguyen-Tang et al., , 2023; Xiong et al., , 2022), policy gradient (Zhou et al., , 2017), and model-based approaches (Uehara and Sun, , 2021; Uehara et al., , 2021) have been shown to be effective in offline RL either theoretically or empirically. Among them, the pessimism principle is a popular approach to handle the distribution shift caused by an offline policy. Empirically, pessimistic approaches perform well on simulation control tasks (Kidambi et al., , 2020; Yu et al., , 2020; Kumar et al., , 2020; Liu et al., , 2020; Chang et al., , 2021). On the theoretical side, pessimism allows us to obtain the Probably Approximately Correct (PAC) guarantee on various models under a coverage condition (Jin et al., , 2021; Yin et al., , 2021; Rashidinejad et al., , 2021; Zanette et al., , 2021; Uehara and Sun, , 2021; Uehara et al., , 2021). In this work, we also adopt the pessimism principle and is the first to establish its efficiency for robust offline non-Markovian RL. We develop new coverage conditions in order to guarantee the provable efficiency.
Robust RL. Robust MDP (RMDP) was first introduced by Iyengar, (2005); Nilim and El Ghaoui, (2005). Most of RMDP research in the literature focuses on the planning problem (Xu and Mannor, , 2010; Wiesemann et al., , 2013; Tamar et al., , 2014; Yu and Xu, , 2015; Mannor et al., , 2016; Petrik and Russel, , 2019; Wang et al., 2023a, ; Wang et al., 2023b, ), providing computationally efficient algorithms. Based on the data generation process, studies on the learning problem in RMDP can be primarily divided into three categories. RMDP with generative model has been studied in Zhou et al., (2021); Yang et al., (2022); Panaganti and Kalathil, (2022), where a generative model can uniformly collect the offline data corresponding to each and every state-action pair. Online RMDP has been studied in Roy et al., (2017); Lim et al., (2013); Wang and Zou, (2021); Dong et al., (2022), which allows allows the agent adaptively sample states and actions in the nominal model. Offline RMDP has been studied in Panaganti et al., (2022); Blanchet et al., (2023), where only a pre-collected dataset is provided, which makes the problem more difficult due to distributional shift (Levine et al., , 2020).
In addition to RMDP, several works have studied robust POMDPs with assumptions such as known transitions in the uncertainty set (Osogami, , 2015; Rasouli and Saghafian, , 2018; Nakao et al., , 2021). In particular, they focus on the planning problem in robust POMDPs, i.e., finding an optimal policy with full information of the uncertainty set, while our work focus on the learning problem that characterizes the sample complexity with unknown uncertainty sets.
Non-Markovian RL. Learning in non-Markovian decision processes is both computationally and statistically intractable in general (Mossel and Roch, , 2005; Mundhenk et al., , 2000; Papadimitriou and Tsitsiklis, , 1987; Vlassis et al., , 2012; Krishnamurthy et al., , 2016). Recently, a line of research (Boots et al., , 2011; Jiang et al., , 2018; Zhang et al., , 2022; Zhan et al., , 2022; Uehara et al., , 2022; Liu et al., , 2022; Chen et al., , 2022; Zhong et al., , 2022; Huang et al., , 2023) has obtained polynomial sample complexity with various structural conditions. Among them, low-rank structure (Uehara et al., , 2022; Liu et al., , 2022) has been proved to admit predictive state representations (PSR) (Littman and Sutton, , 2001; Singh et al., , 2012), and captures and generalizes a rich subclass of non-Markov decision-making problems such as MDPs and POMDPs. However, none of these works studies robust offline non-Markovian RL.
3 Preliminaries
Notations. For a vector , stands for , and its -th coordinate is represented as . For functions and (not necessarily probability measures) over a set , the distance between them is , while the hellinger-squared distance is defined as . For a sequence with , we denote it shortly as . For a sequence of sets , the Cartesian product is represented by . When these sets are identical, this product simplifies to . consists of all probability distributions over the set .
3.1 Non-Markovian Decision Processes
In this work, we consider finite-horizon episodic (non-Markovian) decision processes with an observation space , an action space and a horizon . The dynamics of the process is determined by a set of distributions , where denotes a sequence of observations up to time , denotes a sequence of actions up to time , and denotes the probability of observing given history . A policy is defined by a collection of distributions chosen by an agent, with each being a distribution over . Specifically, at the beginning of each episode, the process starts at a fixed observation at time step 1. After observing , the agent samples (takes) an action according to the distribution (policy) , and the process transits to , which is sampled according to the distribution . Then, at any time step , due to the non-Markovian nature, the agent samples an action based on all past information , and the process transits to , sampled from . The interaction terminates after time step .
Notably, the general non-Markovian decision-making problem subsumes not only fully observable MDPs but also POMDPs as special cases. In MDPs, , and in POMDPs, can be factorized as , where represents unobserved states, and are called emission distributions and transition distributions, respectively.
We note that an alternative equivalent description of the process dynamics is the set of probabilities , which is also frequently used in this paper. In fact, both and fully determine the dynamics with an additional constraint for , that is, is constant with respect to for all action sequence . Hence, we use to represent the model dynamics. To make the presentation more explicit, we introduce a unified parameterization for the process dynamics, written as . Here, is the complete set of model parameters and is usually a large integer. We also use to refer to a model.
For a given reward function , and any policy , the value of the policy under the dynamics is denoted by . We remark here that the goal of an agent in standard RL is to find an -optimal policy such that for some small error under a single model .
3.2 Robust RL
While standard RL assumes that the system dynamics remains the same, robust RL diverges from this setting by focusing on the ‘worst-case’ or robust value in situations where the parameter can change within a defined uncertainty set (where will be defined below). Mathematically, given a policy , the robust value under the uncertainty set is defined by .
In this paper, we consider two types of uncertainty sets defined respectively on and via -divergence , where is a convex function for measuring the uncertainty. Given a nominal model , a -type uncertainty set under -divergence with center and radius is defined as
(1) |
Similarly, a -type uncertainty set under -divergence with center and radius is defined as
(2) |
For simplicity, we set to be a constant throughout the paper. Our proposed algorithm and analysis can be easily extended to the case when is a function of or . Notably, the -type uncertainty set reduces to the -rectangular uncertainty set (Iyengar, , 2005) if reduces to a Markov decision process, i.e., . However, due to the non-Markovian nature of our problem, we also consider the -type uncertainty set as also fully determines the process.
We study two example types of -divergence, which corresponds to two classical divergences. (i) If , then is called total variation distance; (ii) If , then is called Kullback–Leibler (KL) divergence. We also use to represent for simplicity.
We focus on robust offline RL, where the agent is provided with an offline dataset that contains sample trajectories collected via a behavior policy under the nominal model . Given the type of the uncertain set (without knowing the center ), the learning goal of the agent is to find a near-optimal robust policy using this offline dataset such that
3.3 Low-rank Non-Markovian Decision Process
We first introduce some additional notations that are helpful to describe low-rank non-Markovian decision processes. Given a historical trajectory at time step as , we denote a future trajectory as . The set of all is denoted by and the set of all future trajectories is denoted by . In addition, we add a superscript or to either historical trajectory or a future trajectory to represent the observation sequence and the action sequence contained in that trajectory, respectively. For example, and . To better describe a policy, we also use the notation as the available information for the learning agent at time step . We remark that, having in the same equation implies that
Recall that a non-Markov decision process can be fully determined by . We introduce a collection of dynamic matrices . The entry at the -th row and -th column of is .
Definition 3.1 (Rank- non-Markov decision process).
A non-Markov decision process is rank , if for any , the model dynamic matrix has rank .
It has been proved that any low-rank sequential decision-making problem admits Predictive State Representations (PSR) (Liu et al., , 2022) if for each , there exists a set of future trajectories, namely, core tests known to the agent, , such that the submatrix restricted to these tests (columns) has rank , where is a positive integer111Setting the same for different is for notation simplicity.. We consider all low-rank problems with the same set of core tests , and have the following representations.
where , and . We denote to be the representation. In particular, is the set of all low-rank problems with the same core tests .
We assume is known to the agent (Huang et al., , 2023), and employ a bar over a feature to denote its normalization . is also known as the prediction vector (Littman and Sutton, , 2001) or prediction feature of , since is the probability of observing given the history and taking action sequence .
Moreover, let be the set of action sequences that are part of core tests, constructed by eliminating any repeated action sequence. known as the set of core action sequences.
We further assume that PSRs studied in this paper are well-conditioned, as specified in Assumption 3.2. Such an assumption and its variants are commonly adopted in the study of PSRs (Liu et al., , 2022; Chen et al., , 2022; Zhong et al., , 2022; Huang et al., , 2023).
Assumption 3.2 (-well-conditioned PSR).
A PSR is said to be -well-conditioned if
where
Assumption 3.2 requires that the error of estimating does not significantly blow up when the estimation error of estimating the probability of core tests is small.
4 Robust Offline RL with Low-rank Nominal Model
In this section, we study the setting when the nominal model has low-rank structure (with rank ) (Zhan et al., , 2022; Liu et al., , 2022; Chen et al., , 2022; Huang et al., , 2023). To unify the notations, we use to denote the uncertainty set, where is either symbol or .
4.1 Algorithm Design
We present our algorithm for robust offline RL when the nominal model enjoys low-rank structure. Our algorithm features two main novel elements: (a) dataset distillation, which allows the estimated prediction features of distilled data being close to the true features, and thus enables us to construct (b) a lower confidence bound design for the robust value . In particular, we develop a new dual form for the robust value under -type uncertainty set, that is more amenable to practical implementation. The main steps of the algorithm are explained in detail as follows, and the pseudo-code is presented in Algorithm 1.
Model estimation and dataset reform. First, we perform maximum likelihood estimation of the nominal model by minimizing the following negative log-likelihood loss function
(3) |
which can be done using a supervised learning oracle, e.g. stochastic gradient descent. Let .
Novel dataset distillation. Motivated by Huang et al., (2023), to construct a desirable lower confidence bound for any value function under the nominal model, the estimated model needs to assign non-negligible probabilities on every sample trajectory. Differently from Huang et al., (2023) that introduces additional constraints to the optimization oracle, we propose a novel idea that only keeps the “good” sample trajectories whose estimated probability is greater than a pre-defined small value under the estimated model and collect those sample trajectories in set , defined as follows.
(4) |
We prove in Lemma B.1 that, with high probability , which still guarantees sufficient number of samples for learning. Then, is randomly and evenly divided into datasets , in order to separately learn for each .
LCB of robust value. Given the estimated model and divergence type , Algorithm 1 constructs a lower confidence bound (LCB) of for all policy through a bonus function defined as:
(5) |
where and are pre-defined parameters. Recall that is the prediction feature described in Section 3.3. Then, we can show that is an LCB of , where (defined in Section 4.2) is a scaling parameter that depends on the type of uncertainty set.
Novel dual form of robust value. For the -type uncertainty sets, define robust value function as:
where . Then, the following Bellman-type equations hold for robust value functions.
Lemma 4.1.
For any , let . Then, we have
Therefore, robust value under -type uncertainty set can be calculated using the recursive formula above. Notably, the calculation of involves solving an optimization problem in the form of , where and are two probability distributions over a set . This optimization problem can be solved efficiently as shown in the literature (Panaganti et al., , 2022; Blanchet et al., , 2023). We also provide its dual form in Appendix A.
On the other hand, when the uncertainty set is -type, we develop a new dual form of the robust value and in Equation 6 and Equation 7, respectively. Note that, in those equations, .
(6) | ||||
(7) |
Output policy design. Finally, Algorithm 1 outputs a policy , which maximizes a lower confidence bound of .
4.2 Theoretical Guarantee
In this section, we present the theoretical results for Algorithm 1. Before we state the theorem, we need to introduce several definitions and condition coefficients.
Bracketing number. First, we introduce a complexity measure of the model class , which is standard in the literature (Liu et al., , 2022; Huang et al., , 2023).
Definition 4.2 (-Bracketing number of ).
Suppose for each , there exists a function parameterized by . Given two functions and , the bracket is the set of all satisfying . An -bracket is a bracket with . The bracketing number is the minimum number of -brackets needed to cover .
Note that if is the parameter space for tabular PSRs with rank , we have (see Theorem 4.7 in Liu et al., (2022)).
Novel concentrability coefficient. Under the low-rank structure and PSRs, we introduce the following novel concentrability coefficient , namely Type-I concentrability coefficient, of a behavior policy against a robust policy .
Remark 4.3.
Requiring a finite Type-I concentrability coefficient is much weaker than requiring a finite concentrability coefficient defined in Huang et al., (2023), in which a point-to-point probability ratio is used, i.e., . It is obvious that . Moreover, even if is exponentially small, indicating is exponentially large, the minimum eigenvalue of the covariance matrix of under (i.e., ) could be bounded way from zero. By Assumption 3.2, we have , which implies that the inverse of the norm of the prediction feature scales in Poly. Thus, with a good behavior policy , the scaling of is likely to be polynomial.
Wellness condition of the nominal model. When the uncertainty set is -type, we further introduce a wellness condition number of the nominal model defined as follows.
Intuitively, this condition number is called the maximum likelihood ratio between the distributions and and only depends on the property of the nominal model and the shape of the uncertainty set. In particular, when , we must have
which should hold for all policy , and thus implies that is highly likely to be small.
Scaling parameters. Finally, we briefly introduce the scaling parameter that is used to construct the LCB of robust value. Equipped with the dual forms, we define , . When uncertainty set is determined by KL divergence, we make the following assumption.
Assumption 4.4.
Suppose that is an optimal solution of Equation 7, and is an optimal solution of Equation 7 when is replaced by . We assume that the agent has an estimate of a lower bound of the coordinates of both and , i.e., .
Assumption 4.5.
is an optimal solution to the following optimization problem:
We assume that the agent has an estimate of a lower bound of all and , i.e.,
Remark 4.6.
The optimization problem mentioned in Assumption 4.5 is the dual problem that need to solve in the Bellman-type equation (Lemma 4.1). In the algorithm implementation, it suffices to estimate and , which is not difficult to obtain such value since is known. We make these two assumptions for notation simplicity. Such assumptions also appear in Blanchet et al., (2023).
Define the scaling parameters , and
We are now ready to present the theorem characterizing the performance of Algorithm 1.
Theorem 4.7.
Suppose Assumption 3.2, Assumption 4.4, and Assumption 4.5 hold. Let be the maximum number of core action sequences. Let , , , , , and . Then, with probability at least , we have the following four results for the suboptimal gap of the policy output by Algorithm 1 under four settings.
(i) -type uncertainty set under TV distance: ,
(ii) -type uncertainty set under KL divergence: ,
(iii) -type uncertainty set under TV distance: ,
(iv) -type uncertainty set under KL divergence, ,
where is the optimal robust policy under each setting, is the type-I coefficient of the behavior policy against , , is the wellness condition number of , , , and are defined in Assumption 4.4 and Assumption 4.5.
The theorem states that as long as the type-I concentrability coefficient of against the optimal robust policy is bounded, and the nominal model has finite wellness condition number , Algorithm 1 finds an -optimal robust policy provided that the offline dataset has sample trajectories.
Proof sketch of Theorem 4.7. The proof consists of three main steps. (i) A novel MLE guarantee. We show that the estimator produced by vanilla MLE already performs well on a large portion of offline data samples, namely, . This result is different from the MLE guarantee proved in Huang et al., (2023), where a constrained MLE oracle is required. (ii) Simulation lemma for robust value. By utilizing the newly proposed dual forms in Equation 6 and Equation 7, we show that for any model (not necessarily low-rank), the difference is generally upper bounded by , where depends on the uncertainty type. (iii) Applying concentrability coefficient. In low-rank non-Markovian decision processes, distance can be upper bounded by (Liu et al., , 2022; Huang et al., , 2023). By Cauchy’s inequality, this quantity is determined by the covariance matrix , which can thus be controlled by the type-I concentrability coefficient and the covariance matrix under the behavior policy . It then yields the final result when combined with the MLE guarantee. The full proof can be found in Appendix C.
5 Robust Offline RL under General Non-Markovian Decision Processes
In this section, we first introduce a more general algorithm (see Algorithm 2) for robust offline learning in non-Markovian decision processes, and then present its theoretical guarantee.
5.1 Algorithm Design
The general algorithm utilizes double pessimism for robust offline RL. The algorithm consists of two parts: model estimation and robust policy planning. We note that similar double pessimism has been adopted in Blanchet et al., (2023). However, due to the non-Markovian nature, our robust value and its dual form as well as the corresponding analysis are different from theirs.
Model estimation. We construct the confidence set for nominal model using offline dataset .
(8) |
where is a pre-defined parameter. This confidence set ensures that and any model is “close” to the nominal model under the behavior policy .
Robust policy planning. Then, we output a policy under the double pessimism principle. Specifically, for each model , since it is close to the nominal model , we examine its robust value function under a policy . By minimizing the robust value over the confidence set , we find a lower confidence bound for the true robust value due to . Finally, we output an optimal policy with respect to the LCB of . Mathematically,
5.2 Theoretical Guarantee
In this section, we present the theoretical guarantees of Algorithm 2. Since the nominal model does not have specific structure or representations, type-I concentrability coefficient is not suitable for this case. Therefore, we introduce the Type-II concentrability coefficient of a behavior policy against a robust policy defined as .
Remark 5.1.
In general, requiring a finite type-II concentrability coefficient is stronger than requiring a finite type-I concentrability coefficient defined in Section 4.2. However, it is still a relaxed coverage assumption than that in Huang et al., (2023). This is because in the analysis, we rescale the estimation error across the distribution induced by the behavior policy instead of using a point-wise bound.
We are now ready to present the main result for Algorithm 2.
Theorem 5.2.
Let , . Then, with probability at least , we have the following four results for the suboptimal gap of the policy output by Algorithm 2.
(i) -type uncertainty set under TV distance
(ii) -type uncertainty set under KL divergence:
(iii) -type uncertainty set under TV distance
(iv) -type uncertainty set under KL divergence
where is the optimal robust policy under each setting, is the type-II coefficient of the behavior policy against , is the wellness condition number defined in Section 4.2, and are defined in Theorem 4.7.
The theorem suggests that Algorithm 2 is more sample efficient than Algorithm 1, due to the general double pessimism design and the type-II concentrability coefficient, while Algorithm 1 is more amenable to practical implementation due to low-rank structure and computationally tractable bonus function design and dual forms development.
6 Conclusion
In this work, we studied robust offline non-Markovian RL with unkown transitions. We developed the first sample efficient algorithm when the nominal model admits a low-rank structure. To characterize the theoretical guarantee of our algorithm, we propose a novel type-I concentrability coefficient for PSRs, which can be of independent interest in the study of offline low-rank non-Markovian RL. In addition, we extended our algorithm to a more general case when the nominal model does not have any specific structure. With the notion of type-II concentrability coefficient, we proved that the extended algorithm also enjoys sample efficiency under all different types of the uncertainty set.
References
- Antos et al., (2008) Antos, A., Szepesvári, C., and Munos, R. (2008). Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71:89–129.
- Bertsekas, (2011) Bertsekas, D. P. (2011). Approximate policy iteration: A survey and some new methods. Journal of Control Theory and Applications, 9:310–335.
- Blanchet et al., (2023) Blanchet, J., Lu, M., Zhang, T., and Zhong, H. (2023). Double pessimism is provably efficient for distributionally robust offline reinforcement learning: Generic algorithm and robust partial coverage. arXiv preprint arXiv:2305.09659.
- Boots et al., (2011) Boots, B., Siddiqi, S. M., and Gordon, G. J. (2011). Closing the learning-planning loop with predictive state representations. The International Journal of Robotics Research, 30(7):954–966.
- Chang et al., (2021) Chang, J., Uehara, M., Sreenivas, D., Kidambi, R., and Sun, W. (2021). Mitigating covariate shift in imitation learning via offline data with partial coverage. Advances in Neural Information Processing Systems, 34:965–979.
- Chen et al., (2022) Chen, F., Bai, Y., and Mei, S. (2022). Partially observable rl with b-stability: Unified structural condition and sharp sample-efficient algorithms. arXiv preprint arXiv:2209.14990.
- Chen and Jiang, (2019) Chen, J. and Jiang, N. (2019). Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR.
- Dong et al., (2022) Dong, J., Li, J., Wang, B., and Zhang, J. (2022). Online policy optimization for robust mdp. arXiv preprint arXiv:2209.13841.
- Ernst et al., (2005) Ernst, D., Geurts, P., and Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6.
- Farahmand et al., (2010) Farahmand, A.-m., Szepesvári, C., and Munos, R. (2010). Error propagation for approximate policy and value iteration. Advances in Neural Information Processing Systems, 23.
- Gordon, (1995) Gordon, G. J. (1995). Stable function approximation in dynamic programming. In Machine learning proceedings 1995, pages 261–268. Elsevier.
- Huang et al., (2023) Huang, R., Liang, Y., and Yang, J. (2023). Provably efficient ucb-type algorithms for learning predictive state representations. arXiv preprint arXiv:2307.00405.
- Ibarz et al., (2021) Ibarz, J., Tan, J., Finn, C., Kalakrishnan, M., Pastor, P., and Levine, S. (2021). How to train your robot with deep reinforcement learning: lessons we have learned. The International Journal of Robotics Research, 40(4-5):698–721.
- Iyengar, (2005) Iyengar, G. N. (2005). Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280.
- Jiang et al., (2018) Jiang, N., Kulesza, A., and Singh, S. (2018). Completing state representations using spectral learning. Advances in Neural Information Processing Systems, 31.
- Jin et al., (2021) Jin, Y., Yang, Z., and Wang, Z. (2021). Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pages 5084–5096. PMLR.
- Kidambi et al., (2020) Kidambi, R., Rajeswaran, A., Netrapalli, P., and Joachims, T. (2020). Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33:21810–21823.
- Krishnamurthy et al., (2016) Krishnamurthy, A., Agarwal, A., and Langford, J. (2016). Pac reinforcement learning with rich observations. Advances in Neural Information Processing Systems, 29.
- Kumar et al., (2020) Kumar, A., Zhou, A., Tucker, G., and Levine, S. (2020). Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191.
- Lange et al., (2012) Lange, S., Gabel, T., and Riedmiller, M. (2012). Batch reinforcement learning. Reinforcement learning: State-of-the-art, pages 45–73.
- Lazaric et al., (2012) Lazaric, A., Ghavamzadeh, M., and Munos, R. (2012). Finite-sample analysis of least-squares policy iteration. Journal of Machine Learning Research, 13:3041–3074.
- Levine et al., (2020) Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020). Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643.
- Lim et al., (2013) Lim, S. H., Xu, H., and Mannor, S. (2013). Reinforcement learning in robust markov decision processes. Advances in Neural Information Processing Systems, 26.
- Littman and Sutton, (2001) Littman, M. and Sutton, R. S. (2001). Predictive representations of state. Advances in neural information processing systems, 14.
- Liu et al., (2022) Liu, Q., Netrapalli, P., Szepesvari, C., and Jin, C. (2022). Optimistic mle–a generic model-based algorithm for partially observable sequential decision making. arXiv preprint arXiv:2209.14997.
- Liu et al., (2020) Liu, Y., Swaminathan, A., Agarwal, A., and Brunskill, E. (2020). Provably good batch off-policy reinforcement learning without great exploration. Advances in neural information processing systems, 33:1264–1274.
- Mannor et al., (2016) Mannor, S., Mebel, O., and Xu, H. (2016). Robust mdps with k-rectangular uncertainty. Mathematics of Operations Research, 41(4):1484–1509.
- Mossel and Roch, (2005) Mossel, E. and Roch, S. (2005). Learning nonsingular phylogenies and hidden markov models. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 366–375.
- Mundhenk et al., (2000) Mundhenk, M., Goldsmith, J., Lusena, C., and Allender, E. (2000). Complexity of finite-horizon markov decision process problems. Journal of the ACM (JACM), 47(4):681–720.
- Munos and Szepesvári, (2008) Munos, R. and Szepesvári, C. (2008). Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(5).
- Nakao et al., (2021) Nakao, H., Jiang, R., and Shen, S. (2021). Distributionally robust partially observable markov decision process with moment-based ambiguity. SIAM Journal on Optimization, 31(1):461–488.
- Nguyen-Tang et al., (2023) Nguyen-Tang, T., Yin, M., Gupta, S., Venkatesh, S., and Arora, R. (2023). On instance-dependent bounds for offline reinforcement learning with linear function approximation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9310–9318.
- Nilim and El Ghaoui, (2005) Nilim, A. and El Ghaoui, L. (2005). Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798.
- Osogami, (2015) Osogami, T. (2015). Robust partially observable markov decision process. In International Conference on Machine Learning, pages 106–115. PMLR.
- Ouyang et al., (2022) Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Panaganti and Kalathil, (2022) Panaganti, K. and Kalathil, D. (2022). Sample complexity of robust reinforcement learning with a generative model. In International Conference on Artificial Intelligence and Statistics, pages 9582–9602. PMLR.
- Panaganti et al., (2022) Panaganti, K., Xu, Z., Kalathil, D., and Ghavamzadeh, M. (2022). Robust reinforcement learning using offline data. Advances in neural information processing systems, 35:32211–32224.
- Papadimitriou and Tsitsiklis, (1987) Papadimitriou, C. H. and Tsitsiklis, J. N. (1987). The complexity of markov decision processes. Mathematics of operations research, 12(3):441–450.
- Peng et al., (2018) Peng, X. B., Andrychowicz, M., Zaremba, W., and Abbeel, P. (2018). Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE international conference on robotics and automation (ICRA), pages 3803–3810. IEEE.
- Petrik and Russel, (2019) Petrik, M. and Russel, R. H. (2019). Beyond confidence regions: Tight bayesian ambiguity sets for robust mdps. Advances in neural information processing systems, 32.
- Rashidinejad et al., (2021) Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., and Russell, S. (2021). Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702–11716.
- Rasouli and Saghafian, (2018) Rasouli, M. and Saghafian, S. (2018). Robust partially observable markov decision processes.
- Roy et al., (2017) Roy, A., Xu, H., and Pokutta, S. (2017). Reinforcement learning under model mismatch. Advances in neural information processing systems, 30.
- Singh et al., (2012) Singh, S., James, M., and Rudary, M. (2012). Predictive state representations: A new theory for modeling dynamical systems. arXiv preprint arXiv:1207.4167.
- Sutton and Barto, (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
- Tamar et al., (2014) Tamar, A., Mannor, S., and Xu, H. (2014). Scaling up robust mdps using function approximation. In International conference on machine learning, pages 181–189. PMLR.
- Uehara et al., (2022) Uehara, M., Sekhari, A., Lee, J. D., Kallus, N., and Sun, W. (2022). Provably efficient reinforcement learning in partially observable dynamical systems. arXiv preprint arXiv:2206.12020.
- Uehara and Sun, (2021) Uehara, M. and Sun, W. (2021). Pessimistic model-based offline reinforcement learning under partial coverage. arXiv preprint arXiv:2107.06226.
- Uehara et al., (2021) Uehara, M., Zhang, X., and Sun, W. (2021). Representation learning for online and offline rl in low-rank mdps. arXiv preprint arXiv:2110.04652.
- Vlassis et al., (2012) Vlassis, N., Littman, M. L., and Barber, D. (2012). On the computational complexity of stochastic controller optimization in pomdps. ACM Transactions on Computation Theory (TOCT), 4(4):1–8.
- (51) Wang, Y., Velasquez, A., Atia, G., Prater-Bennette, A., and Zou, S. (2023a). Robust average-reward markov decision processes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 15215–15223.
- (52) Wang, Y., Velasquez, A., Atia, G. K., Prater-Bennette, A., and Zou, S. (2023b). Model-free robust average-reward reinforcement learning. In International Conference on Machine Learning, pages 36431–36469. PMLR.
- Wang and Zou, (2021) Wang, Y. and Zou, S. (2021). Online robust reinforcement learning with model uncertainty. Advances in Neural Information Processing Systems, 34:7193–7206.
- Wiesemann et al., (2013) Wiesemann, W., Kuhn, D., and Rustem, B. (2013). Robust markov decision processes. Mathematics of Operations Research, 38(1):153–183.
- Xie et al., (2021) Xie, T., Cheng, C.-A., Jiang, N., Mineiro, P., and Agarwal, A. (2021). Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34:6683–6694.
- Xie and Jiang, (2020) Xie, T. and Jiang, N. (2020). Q* approximation schemes for batch reinforcement learning: A theoretical comparison. In Conference on Uncertainty in Artificial Intelligence, pages 550–559. PMLR.
- Xiong et al., (2022) Xiong, W., Zhong, H., Shi, C., Shen, C., Wang, L., and Zhang, T. (2022). Nearly minimax optimal offline reinforcement learning with linear function approximation: Single-agent mdp and markov game. arXiv preprint arXiv:2205.15512.
- Xu and Mannor, (2010) Xu, H. and Mannor, S. (2010). Distributionally robust markov decision processes. Advances in Neural Information Processing Systems, 23.
- Yang et al., (2022) Yang, W., Zhang, L., and Zhang, Z. (2022). Toward theoretical understandings of robust markov decision processes: Sample complexity and asymptotics. The Annals of Statistics, 50(6):3223–3248.
- Ye et al., (2020) Ye, D., Liu, Z., Sun, M., Shi, B., Zhao, P., Wu, H., Yu, H., Yang, S., Wu, X., Guo, Q., et al. (2020). Mastering complex control in moba games with deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 6672–6679.
- Yin et al., (2021) Yin, M., Bai, Y., and Wang, Y.-X. (2021). Near-optimal offline reinforcement learning via double variance reduction. Advances in neural information processing systems, 34:7677–7688.
- Yu and Xu, (2015) Yu, P. and Xu, H. (2015). Distributionally robust counterpart in markov decision processes. IEEE Transactions on Automatic Control, 61(9):2538–2543.
- Yu et al., (2020) Yu, T., Thomas, G., Yu, L., Ermon, S., Zou, J. Y., Levine, S., Finn, C., and Ma, T. (2020). Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142.
- Zanette et al., (2021) Zanette, A., Wainwright, M. J., and Brunskill, E. (2021). Provable benefits of actor-critic methods for offline reinforcement learning. Advances in neural information processing systems, 34:13626–13640.
- Zhan et al., (2022) Zhan, W., Uehara, M., Sun, W., and Lee, J. D. (2022). Pac reinforcement learning for predictive state representations. arXiv preprint arXiv:2207.05738.
- Zhang et al., (2022) Zhang, Z., Yang, Z., Liu, H., Tokekar, P., and Huang, F. (2022). Reinforcement learning under a multi-agent predictive state representation model: Method and theory. In The Tenth International Conference on Learning Representations (ICLR 2022).
- Zhao et al., (2020) Zhao, W., Queralta, J. P., and Westerlund, T. (2020). Sim-to-real transfer in deep reinforcement learning for robotics: a survey. In 2020 IEEE symposium series on computational intelligence (SSCI), pages 737–744. IEEE.
- Zhong et al., (2022) Zhong, H., Xiong, W., Zheng, S., Wang, L., Wang, Z., Yang, Z., and Zhang, T. (2022). A posterior sampling framework for interactive decision making. arXiv preprint arXiv:2211.01962.
- Zhou et al., (2017) Zhou, L., Small, K., Rokhlenko, O., and Elkan, C. (2017). End-to-end offline goal-oriented dialog policy learning via policy gradient. arXiv preprint arXiv:1712.02838.
- Zhou et al., (2021) Zhou, Z., Zhou, Z., Bai, Q., Qiu, L., Blanchet, J., and Glynn, P. (2021). Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 3331–3339. PMLR.
Appendices
Appendix A Technical Lemmas
In this section, we present the technical lemmas for proving our main results. It includes three subsections. First, Section A.1 primirily describes the dual forms for robust values under -type uncertainty sets. Second, Section A.2 verifies that robust value functions satisfies Bellman-type equations. Finally, Section A.3 provides various simulation lemmas for robust values under different uncertainty sets.
A.1 Dual Form of Robust Values
We first present two classical lemmas for total variation distance and KL divergence, which can be found in Panaganti et al., (2022); Blanchet et al., (2023).
Proposition A.1.
Let be defined with the convex function corresponding to the total variation distance. Then,
Proposition A.2.
Let be defined with the convex function corresponding to the KL-divergence. Then,
Then, we show the dual form for the newly proposed -type uncertainty sets.
Proposition A.3.
If , then, we have
where , and the maximum must be achieved when
Proof.
We recall the definition of :
To simplify the notation, we denote by . Then, we need to specify the equations that form .
The most important one is that for any . Another implicit constraint is that if we sum over all , the value is constant with respect to . Therefore, we construct some additional variables and and formulate the following constrained optimization problem.
P1: |
Since P1 is a linear programming, we can formulate the Lagrangian function as follows:
where , and .
Thus, the dual problem is
Denote by . Then, we have
In addition, by the third constraint in the dual problem D1, we have
Thus, we can simplify to
Hence, by the Slater’s condition, the primal value equals to the dual value. Mathematically, we have
If we fix and , then the maximum is achieved when
Therefore,
Moreover, since , the maximum must be achieved when
∎
Proposition A.4.
If , then, we have
Proof.
Similar to Proposition A.3, we denote by . The robust value equals to the following optimization problem.
P2: |
Since P2 is a convex optimization problem, we can formulate the Lagrangian function as follows
where , and .
Thus, the dual problem is
Denote by . Then, we have
Hence, by the Slater’s condition, the primal value equals to the dual value. Mathematically, we have
where we choose and
∎
A.2 Bellman-type Equations under -type Uncertainty Sets
Recall that robust value function is defined as follows.
Then, we have the following Bellman-type equations for robust value functions.
Lemma A.5.
For any , we have
Proof.
The first equation is straightforward.
For the second equation, by the definition of robust Q-value function, for any , there exists such that
Then, we have
Let , we conclude that
On the other hand, for any , there exist such that
and
In addition, there exists such that and . Then, we have
Let , we can conclude that
which finishes the proof.
∎
A.3 Simulation Lemmas of Robust Values
Lemma A.6 (Simulation lemma of robust value under ).
Proof.
Let .
We first prove that . By Proposition A.3, we have
Let be the solution to
Again, from Proposition A.3, we know that
(9) |
The opposite side follows exactly the same proof.
∎
Lemma A.7 (Simulation lemma of robust value under ).
where is a lower bound of the optimal solution
Here, if , where is a scalar and is a vector, then is less than or equal to each coordinate of the vector.
Proof.
Let . By Proposition A.4, we have
Let be the solution to
Then, we difference can be upper bounded as follows.
For term , we fix and write as for convenience.
Since is fixed and we aim to upper bound the term
can be treated as a variable. We define a function for as follows
where are two distributions over the observation space.
Taking the derivative with respect to , we have
Due to and , we have
Thus, for , we have
For the second inequality, we follow the same argument and the proof is complete.
∎
Lemma A.8 (Simulation lemma for robust value under ).
where , and is a lower bound for the optimal solutions of the following problems
Proof.
First, we have
Let .
Then, we further have
where follows from for , and .
In summary, the difference between robust values can be upper bounded by
where
The opposite side of the inequalities of the lemma follows the same argument.
∎
Appendix B MLE Analysis
In this section, we provide the estimation guarantee for MLE oracle in Lemma B.5.
We first present a sequence of supporting lemmas in the following. Notably, the first lemma states that after dataset distillation, we still have sufficient offline samples.
Lemma B.1 (Sufficient good samples).
Let , then, with probability at least , we have .
Proof.
Let . Then, we have
Thus, by Chernoff bound, we have that with probability ,
Recall that Let . Thus, we further have
which implies
∎
Lemma B.2.
Fix . Let with size consist of all -brackets who can cover . With probability at least , for any , the following two inequalities hold:
Proof.
We start with the first inequality. Suppose the data in is indexed by . Then,
where follows because , and follows because .
By the Chernoff bound and the union bound over , with probability at least , we have
which yields the first result of this proposition.
To show the second inequality, we follow an argument similar to that for the first inequality. We have
where follows from the tower rule of the expectation and because .
Thus, with probability at least , for any , the following inequality holds
which completes the proof. ∎
Proposition B.3.
Fix and . Let . Consider the following event
Then,
Proof.
We start with a general upper bound on the total variation distance between two conditional distributions. Note that for any and fixed , we have
By symmetry, we also have
We replace by a where is an -bracket of (recall Definition 4.2), i.e. , and , . Then, due the construction of , we have
which implies
Here follows because the total variation distance satisfies the triangle inequality and .
Thus, the summation of the total variation distance between conditional distributions conditioned on can be upper bounded by
In addition, by only taking expectation over , we have
where the last equality is due to the conditional independence of given .
Therefore, by the Chernoff bound, with probability , we have
Taking the union bound over , and rescaling , we have, with probability at least , the following inequality holds:
Note that, following from Lemma B.2, with probability at least , we have for any ,
Hence, combining with the optimistic property of and rescaling , we have that the following inequality holds with probability at least :
which yields the final result. ∎
Proposition B.4.
Fix . Define the following event:
We have
Proof.
First, by the construction of , for any , let satisfy . We translate the distance between and to the distance between and as follows.
where follows because if and , and follows from the Cauchy’s inequality and the fact that .
Hence, in order to upper bound , it suffices to upper bound To this end, we observe that,
Then, by the Chernoff bound, we have
Finally, rescaling to and taking the union bound over , we conclude that, with probability at least , ,
where follows because . ∎
Now, we are ready to provide the MLE guarantee of under the behavior policy .
Lemma B.5 (MLE guarantee).
Let . With probability at least , for any parameter satisfying , we have the following inequalities.
where .
Proof.
Consider the three events , , and , defined in Lemma B.2, Proposition B.3, Proposition B.4, respectively. Let . Then, we have The following proof is under the case when happens.
First, due to the condition that , we have
where is due to the definition of , and follows from Lemma B.2.
Then, due to Proposition B.3, we have
Similarly, due to Proposition B.4
where the last inequality follows from Lemma B.1.
∎
Appendix C Proof of Theorem 4.7.
In this section, we provide the full proof for Theorem 4.7. Recall that , where , , and , defined in Lemma B.2, Proposition B.3, Proposition B.4, respectively.
Recall that we make the following assumption.
Assumption C.1.
The nominal model has rank .
Thus, we have the following lemma.
Lemma C.2 (Theorem C.1 in Liu et al., (2022)).
Any low-rank decision-making problem with core tests admits a (self-consistent) predictive state representation given core tests , such that for any , we have
where , , and . In particular, is the set of all low-rank problems with the same core tests .
First, we derive the following guarantee for the difference of robust values.
Corollary C.3.
Under the event , we have that
Proof.
Theorem C.4 (Restatement of Theorem 4.7).
Suppose Assumption C.1 and Assumption 3.2 hold. The type-I coefficient of the behavior policy against the robust policy is finite. Let be the maximum number of core action sequences. Let , , , , , and . Then, with probability at least , the output under different settings of Algorithm 1 satisfies that
Proof.
We first prove the result for -type uncertainty set with distance.
Recall that . Let . Then, by utilizing Corollary C.3, the suboptimal gap of can be upper bounded as follows.
Recall that type-I concentrability coefficient implies that
Thus, due to Cauchy’s inequality, we have
For the other three inequalities, we follow the same argument as above and the proof is finished.
∎
Appendix D Proof of Theorem 5.2
We first provide the pseudo code for our proposed algorithm.
Theorem D.1.
Suppose the type-II coefficient of the behavior policy against the robust policy is finite. Let , . Then, with probability at least , the output under different settings of Algorithm 2 satisfies that
Proof.
We first prove the result for the setting when the uncertainty set is -type with distance.
Let . Then, we perform the decomposition as follows.
where is due to the optimality , and follows from the property of .
Appendix E Auxiliary Lemmas
Lemma E.1.
For two distributions and , we have
In addition, the following inequality holds.
Proof.
We have
On the other hand,
∎
Lemma E.2.
Given two bounded measures and defined on the set . Let and We have
In addition, if are two conditional distributions over a random variable , and , are the joint distributions when follows the distributions and , respectively, we have
Proof.
We first prove the first inequality. By the definition of total variation distance, we have
where follows from the Cauchy’s inequality and because .
For the second inequality, we have,
where follows from the Cauchy’s inequality that applies on . ∎