Risk-Averse Model Uncertainty for
Distributionally Robust Safe Reinforcement Learning
Abstract
Many real-world domains require safe decision making in uncertain environments. In this work, we introduce a deep reinforcement learning framework for approaching this important problem. We consider a distribution over transition models, and apply a risk-averse perspective towards model uncertainty through the use of coherent distortion risk measures. We provide robustness guarantees for this framework by showing it is equivalent to a specific class of distributionally robust safe reinforcement learning problems. Unlike existing approaches to robustness in deep reinforcement learning, however, our formulation does not involve minimax optimization. This leads to an efficient, model-free implementation of our approach that only requires standard data collection from a single training environment. In experiments on continuous control tasks with safety constraints, we demonstrate that our framework produces robust performance and safety at deployment time across a range of perturbed test environments.
1 Introduction
In many real-world decision making applications, it is important to satisfy safety requirements while achieving a desired goal. In addition, real-world environments often involve uncertain or changing conditions. Therefore, in order to reliably deploy data-driven decision making methods such as deep reinforcement learning (RL) in these settings, they must deliver robust performance and safety even in the presence of uncertainty. Recently, techniques have been developed to handle safety constraints within the deep RL framework [3, 41, 47, 27, 44, 55, 28], but these safe RL algorithms only focus on performance and safety in the training environment. They do not consider uncertainty about the true environment at deployment time due to unknown disturbances or irreducible modeling errors, which we refer to as model uncertainty. In this work, we introduce a framework that incorporates model uncertainty into safe RL. In order for our framework to be useful, we emphasize the importance of (i) an efficient deep RL implementation during training and (ii) robustness guarantees on performance and safety upon deployment.
Existing robust RL methods address the issue of model uncertainty, but they can be difficult to implement and are not always suitable in real-world decision making settings. Robust RL focuses on worst-case environments in an uncertainty set, which requires solving complex minimax optimization problems throughout training. This is typically approximated in a deep RL setting through direct interventions with a learned adversary [38, 46, 49], or through the use of parametric uncertainty with multiple simulated training environments [39, 31, 32]. However, we do not always have access to fast, high-fidelity simulators for training [10, 33, 54]. In these cases, we must be able to incorporate robustness to model uncertainty without relying on multiple training environments or potentially dangerous adversarial interventions, as real-world data collection may be necessary.
A more informative way to represent model uncertainty is to instead consider a distribution over potential environments. Domain randomization [37] collects training data from a range of environments by randomizing across parameter values in a simulator, and optimizes for average performance. This approach to model uncertainty avoids minimax formulations and works well in practice [5], but lacks robustness guarantees. In addition, domain randomization focuses on parametric uncertainty, which still requires detailed simulator access and domain knowledge to define the training distribution.
In this work, we introduce a general approach to safe RL in the presence of model uncertainty that addresses the main shortcomings of existing methods. In particular, we consider a distribution over potential environments, and apply a risk-averse perspective towards model uncertainty. Through the use of coherent distortion risk measures, this leads to a safe RL framework with robustness guarantees that does not involve difficult minimax formulations. Using this framework, we show how we can learn safe policies that are robust to model uncertainty, without the need for detailed simulator access or adversarial interventions during training. Our main contributions are as follows:
-
1.
We reformulate the safe RL problem to incorporate a risk-averse perspective towards model uncertainty through the use of coherent distortion risk measures, and we introduce the corresponding Bellman operators.
-
2.
From a theoretical standpoint, we provide robustness guarantees for our framework by showing it is equivalent to a specific class of distributionally robust safe RL problems.
-
3.
We propose an efficient deep RL implementation that avoids the difficult minimax formulation present in robust RL and only uses data collected from a single training environment.
- 4.
2 Preliminaries
Safe reinforcement learning
In this work, we consider RL in the presence of safety constraints. We model this sequential decision making problem as an infinite-horizon, discounted Constrained Markov Decision Process (CMDP) [4] defined by the tuple , where is the set of states, is the set of actions, is the transition model where represents the space of probability measures over , are the reward function and cost function used to define the objective and constraint, respectively, is the initial state distribution, and is the discount rate. We focus on the setting with a single constraint, but all results can be extended to the case of multiple constraints.
We model the agent’s decisions as a stationary policy . For a given CMDP and policy , we write the expected total discounted rewards and costs as and , respectively, where represents a trajectory sampled according to , , and . The goal of safe RL is to find a policy that maximizes the constrained optimization problem
(1) |
where is a safety budget on expected total discounted costs.
We write the corresponding state-action value functions (i.e., Q functions) for a given transition model and policy as and , respectively. Off-policy optimization techniques [55, 28] find a policy that maximizes (1) by solving at each iteration the related optimization problem
(2) |
where is the current policy and is a replay buffer containing data collected in the training environment. Note that and are the respective fixed points of the Bellman operators
Model uncertainty in reinforcement learning
Rather than focusing on a single CMDP with transition model , we incorporate uncertainty about the transition model by considering a distribution over models. We focus on distributions of the form , where represents a distribution over transition models at a given state-action pair and is the product over all . This is known as rectangularity, and is a common assumption in the literature [53, 56, 16, 15, 11]. Note that , where we write to denote model space. Compared to robust RL methods that apply uncertainty sets over transition models, the use of a distribution over transition models is a more informative way to represent model uncertainty that does not require solving for worst-case environments (i.e., does not introduce a minimax formulation).
In order to incorporate robustness to the choice of , distributionally robust MDPs [53, 56] consider an ambiguity set of distributions over transition models, where . The goal of distributionally robust RL is to optimize the worst-case average performance across all distributions contained in . In this work, we will show that a risk-averse perspective towards model uncertainty defined by is equivalent to distributionally robust RL for appropriate choices of ambiguity sets in the objective and constraint of a CMDP. However, our use of risk measures avoids the need to solve for worst-case distributions in throughout training.
Risk measures
Consider the probability space , where is a -algebra on and defines a probability measure over . Let be a space of random variables defined on this probability space, and let be its corresponding dual space. A real-valued risk measure summarizes a random variable as a value on the real line. In this section, we consider cost random variables where a lower value of is better. We can define a corresponding risk measure for reward random variables through an appropriate change in sign, where . Risk-sensitive methods typically focus on classes of risk measures with desirable properties [30], such as coherent risk measures [6] and distortion risk measures [50, 17].
Definition 1 (Coherent risk measure).
A risk measure is a coherent risk measure if it satisfies monotonicity, translation invariance, positive homogeneity, and convexity.
Definition 2 (Distortion risk measure).
Let be a non-decreasing, left-continuous function with and . A distortion risk measure with respect to is defined as
where is the inverse cumulative distribution function of and .
A distortion risk measure is coherent if and only if is concave [52]. In this work, we focus on the class of coherent distortion risk measures. We will leverage properties of coherent risk measures to provide robustness guarantees for our framework, and we will leverage properties of distortion risk measures to propose an efficient, model-free implementation that does not involve minimax optimization. See the Appendix for additional details on the properties of coherent distortion risk measures. Many commonly used risk measures belong to this class, including expectation, conditional value-at-risk (CVaR), and the Wang transform [51] for which is defined by the distortion function , where is the standard Normal cumulative distribution function.
3 Related work
Safe reinforcement learning
The CMDP framework is the most popular approach to safety in RL, and several deep RL algorithms have been developed to solve the constrained optimization problem in (1). These include primal-dual methods that consider the Lagrangian relaxation of (1) [41, 47, 44], algorithms that compute closed-form solutions to related or approximate versions of (1) [3, 28], and direct methods for constraint satisfaction such as the use of barriers [27] or immediate switching between the objective and constraint [55]. All of these approaches are designed to satisfy expected cost constraints for a single CMDP observed during training. In our work, on the other hand, we consider a distribution over possible transition models.
Uncertainty in reinforcement learning
Our work focuses on irreducible uncertainty about the true environment at deployment time, which we refer to as model uncertainty and represent using a distribution over transition models. The most popular approach that incorporates model uncertainty in this way is domain randomization [48, 37], which randomizes across parameter values in a simulator and trains a policy to maximize average performance over this training distribution. This represents a risk-neutral attitude towards model uncertainty, which has been referred to as a soft-robust approach [16]. Distributionally robust MDPs incorporate robustness to the choice of by instead considering a set of distributions [53, 56, 15, 11], but application of this distributionally robust framework has remained limited in deep RL as it leads to a difficult minimax formulation that requires solving for worst-case distributions over transition models.
Robust RL represents an alternative approach to model uncertainty that considers uncertainty sets of transition models [21, 34]. A major drawback of robust RL is the need to calculate worst-case environments during training, which is typically approximated through the use of parametric uncertainty with multiple training environments [39, 31, 32] or a trained adversary that directly intervenes during trajectory rollouts [38, 46, 49]. Unlike these methods, we propose a robust approach to model uncertainty based on a distribution over models, which does not require access to a range of simulated training environments, does not impact data collection during training, and does not involve minimax optimization problems.
In contrast to irreducible model uncertainty, epistemic uncertainty captures estimation error that can be reduced during training through data collection. Epistemic uncertainty has been considered in the estimation of Q functions [35, 36, 9] and learned transition models [13, 25, 22, 40, 7], and has been applied to promote both exploration and safety in a fixed MDP. Finally, risk-sensitive methods typically focus on the aleatoric uncertainty in RL, which refers to the range of stochastic outcomes within a single MDP. Rather than considering the standard expected value objective, they learn risk-sensitive policies over this distribution of possible outcomes in a fixed MDP [43, 12, 45, 24, 26]. Distributional RL [8] trains critics that estimate the full distribution of future returns due to aleatoric uncertainty, and risk measures can be applied to these distributional critics for risk-sensitive learning [14, 29]. We also consider the use of risk measures in our work, but different from standard risk-sensitive RL methods we apply a risk measure over model uncertainty instead of aleatoric uncertainty.
4 Risk-averse model uncertainty for safe reinforcement learning
The standard safe RL problem in (1) focuses on performance and safety in a single environment with fixed transition model . In this work, however, we are interested in a distribution of possible transition models rather than a fixed transition model. The distribution provides a natural way to capture our uncertainty about the unknown transition model at deployment time. Next, we must incorporate this model uncertainty into our problem formulation. Prior methods have done this by applying the expectation operator over at every transition [16]. Instead, we adopt a risk-averse view towards model uncertainty in order to learn policies with robust performance and safety. We accomplish this by applying a coherent distortion risk measure with respect to model uncertainty at every transition.
We consider the risk-averse model uncertainty (RAMU) safe RL problem
(3) |
where we use and to account for reward and cost random variables, respectively, and we apply these coherent distortion risk measures over at every transition to define
The notation and emphasize that the stochasticity of the random variables are with respect to the transition models sampled from . Note that we still apply expectations over the aleatoric uncertainty of the CMDP (i.e., the randomness associated with a stochastic transition model and stochastic policy), while being risk-averse with respect to model uncertainty. Because we are interested in learning policies that achieve robust performance and robust safety at deployment time, we apply this risk-averse perspective to model uncertainty in both the objective and constraint of (3).
We write the corresponding RAMU reward and cost Q functions as and , respectively. Similar to the standard safe RL setting, we can apply off-policy techniques to solve the RAMU safe RL problem in (3) by iteratively optimizing
(4) |
Therefore, we have replaced the standard Q functions for a fixed transition model in (2) with our RAMU Q functions in (4).
We can write the RAMU Q functions recursively as
These recursive definitions motivate corresponding RAMU Bellman operators.
Definition 3 (RAMU Bellman operators).
For a given policy , the RAMU Bellman operators are defined as
Note that the RAMU Bellman operators can also be interpreted as applying a coherent distortion risk measure over standard Bellman targets, which are random variables with respect to the transition model for a given state-action pair.
Lemma 1.
The RAMU Bellman operators can be written in terms of standard Bellman operators as
(5) |
Proof.
The results follow from the definitions of and , along with the translation invariance and positive homogeneity of coherent distortion risk measures. See the Appendix for details. ∎
In the next section, we show that and are contraction operators, so we can apply standard temporal difference learning techniques to learn the RAMU Q functions and that are needed for our RAMU policy update in (4).
5 Robustness guarantees
Intuitively, our risk-averse perspective places more emphasis on potential transition models that result in higher costs or lower rewards under the current policy, which should result in learning safe policies that are robust to model uncertainty. Next, we formalize the robustness guarantees of our RAMU framework by showing it is equivalent to a distributionally robust safe RL problem for appropriate choices of ambiguity sets.
Theorem 1.
The RAMU safe RL problem in (3) is equivalent to the distributionally robust safe RL problem
(6) |
with ambiguity sets and , where
are sets of feasible reweightings of with that depend on the choice of and , respectively.
Proof.
Using duality results for coherent risk measures [42], we show that the RAMU Bellman operators and are equivalent to distributionally robust Bellman operators [53, 56] with ambiguity sets of distributions and , respectively. The RAMU Q functions are the respective fixed points of these Bellman operators, so they can be written as distributionally robust Q functions. Finally, by averaging over initial states and actions, we see that (3) is equivalent to (6). See the Appendix for details. ∎
Theorem 1 shows that the application of and at every timestep are equivalent to solving distributionally robust optimization problems over the ambiguity sets of distributions and , respectively. This can be interpreted as adversarially reweighting with at every state-action pair. Note that worst-case distributions appear in both the objective and constraint of (6), so any policy trained with our RAMU framework is guaranteed to deliver robust performance and robust safety. The level of robustness depends on the choice of and , which determine the structure and size of the corresponding ambiguity sets based on their dual representations [42].
In addition, because (3) is equivalent to a distributionally robust safe RL problem according to Theorem 1, we can leverage existing results for distributionally robust MDPs [53, 56] to show that and are contraction operators.
Corollary 1.
The RAMU Bellman operators and are -contractions in the sup-norm.
Therefore, we have that and can be interpreted as distributionally robust Q functions by Theorem 1, and we can apply standard temporal difference methods to learn these RAMU Q functions as a result of Corollary 1. Importantly, Theorem 1 demonstrates the robustness properties of our RAMU framework, but it is not used to implement our approach. Directly implementing (6) would require solving for adversarial distributions over transition models throughout training. Instead, our framework provides the same robustness, but the use of risk measures leads to an efficient deep RL implementation as we describe in the following section.
6 Model-free implementation with a single training environment
The RAMU policy update in (4) takes the same form as the standard safe RL update in (2), except for the use of and . Because our RAMU Bellman operators are contractions, we can learn these RAMU Q functions by applying standard temporal difference loss functions that are used throughout deep RL. In particular, we consider parameterized critics and , and we optimize their parameters during training to minimize the loss functions
where and represent sample-based estimates of the RAMU Bellman operators applied to target Q functions denoted by . Therefore, we must be able to efficiently estimate the RAMU Bellman targets, which involve calculating coherent distortion risk measures that depend on the distribution .
Sample-based estimation of risk measures
Using the formulation of our RAMU Bellman operators from Lemma 1, we can leverage properties of distortion risk measures to efficiently estimate the results in (5) using sample-based weighted averages of standard Bellman targets. For transition models , , sampled independently from and sorted according to their corresponding Bellman targets, consider the weights
where defines the distortion risk measure according to Definition 2. See Figure 1 for the distortion functions and weights associated with the risk measures used in our experiments. Then, from Jones and Zitikis [23] we have that
are consistent estimators of the results in (5), where are sorted in ascending order and are sorted in descending order. Finally, we can replace and with the standard unbiased sample-based estimates
where and . This leads to the sample-based estimates
(7) |
which we use to train our RAMU Q functions. Note that the estimates in (7) can be computed very efficiently, which is a major benefit of our RAMU framework compared to robust RL methods. Next, we describe how we can sample models , , from , and generate state transitions from these models to use in the calculation of our sample-based Bellman targets in (7).

Generative distribution of transition models
Note that our RAMU framework can be applied using any choice of distribution , provided we can sample transition models and corresponding next states . In this work, we define the distribution over perturbed versions of a single training environment , and we propose a generative approach to sampling transition models and corresponding next states that only requires data collected from . By doing so, our RAMU framework achieves robust performance and safety with minimal assumptions on the training process, and can even be applied to settings that require real-world data collection for training.
We consider a latent variable , and we define a transition model for every that shifts the probability of under according to a perturbation function . This perturbation function takes as input a state transition , and outputs a perturbed next state that depends on the latent variable . Therefore, a distribution over latent space implicitly defines a distribution over perturbed versions of . In order to obtain the next state samples needed to compute the Bellman target estimates in (7), we sample latent variables and apply to the state transition observed in the training environment. We have that for data collected in the training environment, so represents the corresponding sample from the perturbed transition model .
In our experiments, we consider a simple implementation for the common case where . We use uniformly distributed latent variables , and we define the perturbation function as
where all operations are performed per-coordinate. Therefore, the latent variable can be interpreted as the percentage change in each dimension of a state transition observed in the training environment, where the average magnitude of the percentage change is . The hyperparameter determines the distribution over transition models, where a larger value of leads to transition models that vary more significantly from the training environment. The structure of provides an intuitive, scale-invariant meaning for the hyperparameter , which makes it easy to tune in practice. This choice of distribution captures general uncertainty in the training environment, without requiring specific domain knowledge of potential disturbances.
Algorithm
We summarize the implementation of our RAMU framework in Algorithm 1. Given data collected in a single training environment, we can efficiently calculate the sample-based RAMU Bellman targets in (7) by (i) sampling from a latent variable , (ii) computing the corresponding next state samples , and (iii) sorting the standard Bellman estimates that correspond to these sampled transition models. Given the sample-based RAMU Bellman targets, updates of the critics and policy have the same form as in standard deep safe RL algorithms. Therefore, our RAMU framework can be easily combined with many popular safe RL algorithms to incorporate model uncertainty with robustness guarantees, using only a minor change to the estimation of Bellman targets that is efficient to implement in practice.
7 Experiments
In order to evaluate the performance and safety of our RAMU framework, we conduct experiments on 5 continuous control tasks with safety constraints from the Real-World RL Suite [18, 19]: Cartpole Swingup, Walker Walk, Walker Run, Quadruped Walk, and Quadruped Run. Each task has a horizon length of with and , and we consider a safety budget of . Unless noted otherwise, we train these tasks on a single training environment for 1 million steps across 5 random seeds, and we evaluate performance of the learned policies across a range of perturbed test environments via 10 trajectory rollouts. See the Appendix for information on the safety constraints and environment perturbations that we consider.
Normalized Ave. | Rollouts Require | ||||
Algorithm | % Safe | Reward | Cost | Adversary | Simulator |
Safe RL | 51% | 1.00 | 1.00 | No | No |
RAMU (Wang 0.75) | 80% | 1.08 | 0.51 | No | No |
RAMU (Expectation) | 74% | 1.05 | 0.67 | No | No |
Domain Randomization | 76% | 1.14 | 0.72 | No | Yes |
Domain Randomization (OOD) | 55% | 1.02 | 1.02 | No | Yes |
Adversarial RL | 82% | 1.05 | 0.48 | Yes | No |
Percentage of policies that satisfy the safety constraint across all tasks and test environments. | |||||
Normalized relative to the average performance of standard safe RL for each task and test environment. | |||||
Denotes need for adversary or simulator during data collection (i.e., trajectory rollouts) for training. |

Our RAMU framework can be combined with several choices of safe RL algorithms. We consider the safe RL algorithm Constraint-Rectified Policy Optimization (CRPO) [55], and we use Maximum a Posteriori Policy Optimization (MPO) [1] as the unconstrained policy optimization algorithm in CRPO. For a fair comparison, we apply this choice of safe RL policy update in every method we consider in our experiments. We use a multivariate Gaussian policy with learned mean and diagonal covariance at each state, along with separate reward and cost critics. We parameterize our policy and critics using neural networks. See the Appendix for implementation details.111Code is publicly available at https://github.com/jqueeney/robust-safe-rl.
We summarize the performance and safety of our RAMU framework in Table 1 and Figure 2, compared to several baseline algorithms that we discuss next. We include detailed experimental results across all perturbed test environments in the Appendix. We apply our RAMU framework using the Wang transform with as the risk measure in both the objective and constraint. In order to understand the impact of being risk-averse to model uncertainty, we also consider the risk-neutral special case of our framework where expectations are applied to the objective and constraint. For our RAMU results in Table 1 and Figure 2, we specify the risk measure in parentheses. Finally, we consider samples of transition models with latent variable hyperparameter in order to calculate Bellman targets in our RAMU framework.
Comparison to safe reinforcement learning
First, we analyze the impact of our RAMU framework compared to standard safe RL. In both cases, we train policies using data collected from a single training environment, so the only difference comes from our use of risk-averse model uncertainty to learn RAMU Q functions. By evaluating the learned policies in perturbed test environments different from the training environment, we see that our RAMU framework provides robustness in terms of both total rewards and safety. In particular, the risk-averse implementation of our algorithm leads to safety constraint satisfaction in 80% of test environments, compared to only 51% with standard safe RL. In addition, this implementation results in higher total rewards (1.08x) and lower total costs (0.51x), on average. We see in Table 1 that the use of expectations over model uncertainty (i.e., a risk-neutral approach) also improves robustness in both the objective and constraint, on average, compared to standard safe RL. However, we further improve upon the benefits observed in the risk-neutral case by instead applying a risk-averse perspective.
Comparison to domain randomization
Next, we compare our RAMU framework to domain randomization, a popular approach that also represents model uncertainty using a distribution over models. Note that domain randomization considers parametric uncertainty and has the benefit of training on a range of simulated environments, while our method only collects data from a single training environment. In order to evaluate the importance of domain knowledge for defining the training distribution in domain randomization, we consider two different cases: an in-distribution version that trains on a subset of the perturbed test environments, and an out-of-distribution (OOD) version that randomizes over a different perturbation parameter than the one varied at test time.
The results in Table 1 and Figure 2 show the importance of domain knowledge: in-distribution domain randomization leads to improved robustness compared to standard safe RL and the highest normalized average rewards (1.14x), while the out-of-distribution version provides little benefit. In both cases, however, domain randomization achieves lower levels of safety, on average, than our risk-averse formulation. In fact, we see in Figure 2 that the safety constraint satisfaction of our risk-averse formulation is at least as strong as both versions of domain randomization in 4 out of 5 tasks, despite only training on a single environment with no specific knowledge about the disturbances at test time. This demonstrates the key benefit of our risk-averse approach to model uncertainty.
Comparison to adversarial reinforcement learning
Finally, we compare our approach to adversarial RL using the action-robust PR-MDP framework [46], which randomly applies worst-case actions a percentage of the time during data collection. Although adversarial RL only collects data from a single training environment, it requires potentially dangerous adversarial interventions during training in order to provide robustness at test time. In order to apply this method to the safe RL setting, we train an adversary to maximize costs and consider a 5% probability of intervention during training. The performance of adversarial RL is typically evaluated without adversarial interventions, which requires a clear distinction between training and testing.
We see in Figure 2 that adversarial RL learns policies that achieve robust safety constraint satisfaction at test time in the Quadruped tasks. Our risk-averse formulation, on the other hand, achieves higher levels of safety in the remaining 3 out of 5 tasks, and similar levels of safety on average. Unlike adversarial RL, our RAMU framework achieves robust safety in a way that (i) does not alter the data collection process, (ii) does not require training an adversary in a minimax formulation, and (iii) does not require different implementations during training and testing. In addition, our use of a distribution over models represents a less conservative approach than adversarial RL, resulting in higher normalized average rewards as shown in Table 1.
8 Conclusion
We have presented a framework for safe RL in the presence of model uncertainty, an important setting for many real-world decision making applications. Compared to existing approaches to model uncertainty in deep RL, our formulation applies a risk-averse perspective through the use of coherent distortion risk measures. We show that this results in robustness guarantees, while still leading to an efficient deep RL implementation that does not involve minimax optimization problems. Importantly, our method only requires data collected from a single training environment, so it can be applied to real-world domains where high-fidelity simulators are not readily available or are computationally expensive. Therefore, our framework represents an attractive approach to safe decision making under model uncertainty that can be deployed across a range of applications.
Prior to potential deployment, it is important to understand the limitations of our proposed methodology. The robustness and safety of our RAMU framework depend on the user-defined choices of model distribution and risk measure . The distribution defines the uncertainty over transition models, and the risk measure defines the level of robustness to this choice of . In addition, our approach only considers robustness with respect to model uncertainty and safety as defined by expected total cost constraints. It would be interesting to extend our techniques to address other forms of uncertainty and other definitions of safety, including epistemic uncertainty in model-based RL, observational uncertainty, and safety-critical formulations based on sets of unsafe states.
References
- Abdolmaleki et al. [2018] A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. Riedmiller. Maximum a posteriori policy optimisation. In Sixth International Conference on Learning Representations, 2018.
- Abdolmaleki et al. [2020] A. Abdolmaleki, S. Huang, L. Hasenclever, M. Neunert, F. Song, M. Zambelli, M. Martins, N. Heess, R. Hadsell, and M. Riedmiller. A distributional view on multi-objective policy optimization. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 11–22. PMLR, 2020.
- Achiam et al. [2017] J. Achiam, D. Held, A. Tamar, and P. Abbeel. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 22–31. PMLR, 2017.
- Altman [1999] E. Altman. Constrained Markov Decision Processes. CRC Press, 1999.
- Andrychowicz et al. [2020] M. Andrychowicz, B. Baker, M. Chociej, R. Józefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, and W. Zaremba. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1):3–20, 2020. doi: 10.1177/0278364919887447.
- Artzner et al. [1999] P. Artzner, F. Delbaen, J.-M. Eber, and D. Heath. Coherent measures of risk. Mathematical Finance, 9(3):203–228, 1999. doi: 10.1111/1467-9965.00068.
- As et al. [2022] Y. As, I. Usmanova, S. Curi, and A. Krause. Constrained policy optimization via Bayesian world models. In Tenth International Conference on Learning Representations, 2022.
- Bellemare et al. [2017] M. G. Bellemare, W. Dabney, and R. Munos. A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 449–458. PMLR, 2017.
- Bharadhwaj et al. [2021] H. Bharadhwaj, A. Kumar, N. Rhinehart, S. Levine, F. Shkurti, and A. Garg. Conservative safety critics for exploration. In Ninth International Conference on Learning Representations, 2021.
- Cao et al. [2022] W. Cao, M. Benosman, and R. Ma. Domain knowledge-based automated analog circuit design with deep reinforcement learning. In The 59th ACM/IEEE Design Automation Conference, 2022.
- Chen and Paschalidis [2020] R. Chen and I. C. Paschalidis. Distributionally robust learning. Foundations and Trends® in Optimization, 4(1-2):1–243, 2020. doi: 10.1561/2400000026.
- Chow et al. [2015] Y. Chow, A. Tamar, S. Mannor, and M. Pavone. Risk-sensitive and robust decision-making: a CVaR optimization approach. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
- Chua et al. [2018] K. Chua, R. Calandra, R. McAllister, and S. Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
- Dabney et al. [2018] W. Dabney, G. Ostrovski, D. Silver, and R. Munos. Implicit quantile networks for distributional reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 1096–1105. PMLR, 2018.
- Derman and Mannor [2020] E. Derman and S. Mannor. Distributional robustness and regularization in reinforcement learning. arXiv preprint, 2020. arXiv:2003.02894.
- Derman et al. [2018] E. Derman, D. J. Mankowitz, T. A. Mann, and S. Mannor. Soft-robust actor-critic policy-gradient. arXiv preprint, 2018. arXiv:1803.04848.
- Dhaene et al. [2012] J. Dhaene, A. Kukush, D. Linders, and Q. Tang. Remarks on quantiles and distortion risk measures. European Actuarial Journal, 2:319–328, 2012. doi: 10.1007/s13385-012-0058-0.
- Dulac-Arnold et al. [2020] G. Dulac-Arnold, N. Levine, D. J. Mankowitz, J. Li, C. Paduraru, S. Gowal, and T. Hester. An empirical investigation of the challenges of real-world reinforcement learning. arXiv preprint, 2020. arXiv:2003.11881.
- Dulac-Arnold et al. [2021] G. Dulac-Arnold, N. Levine, D. J. Mankowitz, J. Li, C. Paduraru, S. Gowal, and T. Hester. Challenges of real-world reinforcement learning: definitions, benchmarks and analysis. Machine Learning, 110:2419–2468, 2021. doi: 10.1007/s10994-021-05961-4.
- Hoffman et al. [2020] M. W. Hoffman, B. Shahriari, J. Aslanides, G. Barth-Maron, N. Momchev, D. Sinopalnikov, P. Stańczyk, S. Ramos, A. Raichuk, D. Vincent, L. Hussenot, R. Dadashi, G. Dulac-Arnold, M. Orsini, A. Jacq, J. Ferret, N. Vieillard, S. K. S. Ghasemipour, S. Girgin, O. Pietquin, F. Behbahani, T. Norman, A. Abdolmaleki, A. Cassirer, F. Yang, K. Baumli, S. Henderson, A. Friesen, R. Haroun, A. Novikov, S. G. Colmenarejo, S. Cabi, C. Gulcehre, T. L. Paine, S. Srinivasan, A. Cowie, Z. Wang, B. Piot, and N. de Freitas. Acme: A research framework for distributed reinforcement learning. arXiv preprint, 2020. arXiv:2006.00979.
- Iyengar [2005] G. N. Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280, 2005. doi: 10.1287/moor.1040.0129.
- Janner et al. [2019] M. Janner, J. Fu, M. Zhang, and S. Levine. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Jones and Zitikis [2003] B. L. Jones and R. Zitikis. Empirical estimation of risk measures and related quantities. North American Actuarial Journal, 7(4):44–54, 2003. doi: 10.1080/10920277.2003.10596117.
- Keramati et al. [2020] R. Keramati, C. Dann, A. Tamkin, and E. Brunskill. Being optimistic to be conservative: Quickly learning a CVaR policy. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4436–4443, 2020.
- Kurutach et al. [2018] T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel. Model-ensemble trust-region policy optimization. In Sixth International Conference on Learning Representations, 2018.
- L.A. and Fu [2022] P. L.A. and M. C. Fu. Risk-sensitive reinforcement learning via policy gradient search. Foundations and Trends® in Machine Learning, 15(5):537–693, 2022. ISSN 1935-8237. doi: 10.1561/2200000091.
- Liu et al. [2020] Y. Liu, J. Ding, and X. Liu. IPO: Interior-point policy optimization under constraints. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4940–4947, 2020.
- Liu et al. [2022] Z. Liu, Z. Cen, V. Isenbaev, W. Liu, S. Wu, B. Li, and D. Zhao. Constrained variational policy optimization for safe reinforcement learning. In Proceedings of the 39th International Conference on Machine Learning, pages 13644–13668. PMLR, 2022.
- Ma et al. [2020] X. Ma, L. Xia, Z. Zhou, J. Yang, and Q. Zhao. DSAC: Distributional soft actor critic for risk-sensitive reinforcement learning. arXiv preprint, 2020. arXiv:2004.14547.
- Majumdar and Pavone [2020] A. Majumdar and M. Pavone. How should a robot assess risk? Towards an axiomatic theory of risk in robotics. In Robotics Research, pages 75–84. Springer International Publishing, 2020. ISBN 978-3-030-28619-4.
- Mankowitz et al. [2020] D. J. Mankowitz, N. Levine, R. Jeong, A. Abdolmaleki, J. T. Springenberg, Y. Shi, J. Kay, T. Hester, T. Mann, and M. Riedmiller. Robust reinforcement learning for continuous control with model misspecification. In Eighth International Conference on Learning Representations, 2020.
- Mankowitz et al. [2021] D. J. Mankowitz, D. A. Calian, R. Jeong, C. Paduraru, N. Heess, S. Dathathri, M. Riedmiller, and T. Mann. Robust constrained reinforcement learning for continuous control with model misspecification. arXiv preprint, 2021. arXiv:2010.10644.
- Mowlavi et al. [2022] S. Mowlavi, M. Benosman, and S. Nabi. Reinforcement learning state estimation for high-dimensional nonlinear systems. In Tenth International Conference on Learning Representations, 2022.
- Nilim and Ghaoui [2005] A. Nilim and L. E. Ghaoui. Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798, 2005. doi: 10.1287/opre.1050.0216.
- Osband et al. [2016] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped DQN. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
- Osband et al. [2018] I. Osband, J. Aslanides, and A. Cassirer. Randomized prior functions for deep reinforcement learning. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
- Peng et al. [2018] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 3803–3810, 2018. doi: 10.1109/ICRA.2018.8460528.
- Pinto et al. [2017] L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta. Robust adversarial reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 2817–2826. PMLR, 2017.
- Rajeswaran et al. [2017] A. Rajeswaran, S. Ghotra, B. Ravindran, and S. Levine. EPOpt: Learning robust neural network policies using model ensembles. In 5th International Conference on Learning Representations, 2017.
- Rajeswaran et al. [2020] A. Rajeswaran, I. Mordatch, and V. Kumar. A game theoretic framework for model based reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 7953–7963. PMLR, 2020.
- Ray et al. [2019] A. Ray, J. Achiam, and D. Amodei. Benchmarking safe exploration in deep reinforcement learning, 2019.
- Shapiro et al. [2014] A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures on Stochastic Programming: Modeling and Theory, Second Edition. Society for Industrial and Applied Mathematics, 2014. ISBN 1611973422.
- Shen et al. [2014] Y. Shen, M. J. Tobia, T. Sommer, and K. Obermayer. Risk-sensitive reinforcement learning. Neural Computation, 26(7):1298–1328, 2014. doi: 10.1162/NECO_a_00600.
- Stooke et al. [2020] A. Stooke, J. Achiam, and P. Abbeel. Responsive safety in reinforcement learning by PID Lagrangian methods. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 9133–9143. PMLR, 2020.
- Tamar et al. [2015] A. Tamar, Y. Chow, M. Ghavamzadeh, and S. Mannor. Policy gradient for coherent risk measures. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
- Tessler et al. [2019a] C. Tessler, Y. Efroni, and S. Mannor. Action robust reinforcement learning and applications in continuous control. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 6215–6224. PMLR, 2019a.
- Tessler et al. [2019b] C. Tessler, D. J. Mankowitz, and S. Mannor. Reward constrained policy optimization. In Seventh International Conference on Learning Representations, 2019b.
- Tobin et al. [2017] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017. doi: 10.1109/IROS.2017.8202133.
- Vinitsky et al. [2020] E. Vinitsky, Y. Du, K. Parvate, K. Jang, P. Abbeel, and A. Bayen. Robust reinforcement learning using adversarial populations. arXiv preprint, 2020. arXiv:2008.01825.
- Wang [1996] S. Wang. Premium calculation by transforming the layer premium density. ASTIN Bulletin, 26(1):71–92, 1996. doi: 10.2143/AST.26.1.563234.
- Wang [2000] S. S. Wang. A class of distortion operators for pricing financial and insurance risks. The Journal of Risk and Insurance, 67(1):15–36, 2000. ISSN 00224367, 15396975. doi: 10.2307/253675.
- Wirch and Hardy [2003] J. L. Wirch and M. R. Hardy. Distortion risk measures: Coherence and stochastic dominance. Insurance Mathematics and Economics, 32:168–168, 2003.
- Xu and Mannor [2010] H. Xu and S. Mannor. Distributionally robust Markov decision processes. In Advances in Neural Information Processing Systems, volume 23. Curran Associates, Inc., 2010.
- Xu et al. [2022] J. Xu, S. Kim, T. Chen, A. Rodriguez, P. Agrawal, W. Matusik, and S. Sueda. Efficient tactile simulation with differentiability for robotic manipulation. In The Conference on Robot Learning (CoRL), 2022.
- Xu et al. [2021] T. Xu, Y. Liang, and G. Lan. CRPO: A new approach for safe reinforcement learning with convergence guarantee. In Proceedings of the 38th International Conference on Machine Learning, pages 11480–11491. PMLR, 2021.
- Yu and Xu [2016] P. Yu and H. Xu. Distributionally robust counterpart in Markov decision processes. IEEE Transactions on Automatic Control, 61(9):2538–2543, 2016. doi: 10.1109/TAC.2015.2495174.
Appendix A Properties of coherent distortion risk measures
Majumdar and Pavone [30] proposed a set of six axioms to characterize desirable properties of risk measures in the context of robotics.
-
A1.
Monotonicity: If and almost everywhere, then .
-
A2.
Translation invariance: If and , then .
-
A3.
Positive homogeneity: If and , then .
-
A4.
Convexity: If and , then .
-
A5.
Comonotonic additivity: If are comonotonic, then .
-
A6.
Law invariance: If are identically distributed, then .
See Majumdar and Pavone [30] for a discussion on the intuition behind these axioms. Note that coherent risk measures [6] satisfy Axioms A1–A4, distortion risk measures [50, 17] satisfy Axioms A1–A3 and Axioms A5–A6, and coherent distortion risk measures satisfy all six axioms.
The properties of coherent risk measures also lead to a useful dual representation.
Lemma 2 (Shapiro et al. [42]).
Let be a proper, real-valued coherent risk measure. Then, for any we have that
where represents expectation with respect to the probability measure , and
is a convex, bounded, and weakly* closed set that depends on .
See Shapiro et al. [42] for a general treatment of this result.
Appendix B Proofs
In this section, we prove all results related to the RAMU cost Bellman operator . Using the fact that for a coherent distortion risk measure on a cost random variable, all results related to the RAMU reward Bellman operator follow by an appropriate change in sign.
B.1 Proof of Lemma 1
Proof.
Starting from the definition of in Definition 3, we have that
which proves the result. Note that the second equality follows from the positive homogeneity of (Axiom A3), the third equality follows from the translation invariance of (Axiom A2), and the fourth equality follows from the definition of the standard cost Bellman operator . ∎
B.2 Proof of Theorem 1
Proof.
First, we show that is equivalent to a distributionally robust Bellman operator. For a given state-action pair, we apply Lemma 2 to the risk measure that appears in the formulation of given by Lemma 1. By doing so, we have that
where is defined in Lemma 2. Therefore, has the same form as a distributionally robust Bellman operator [53, 56] with the ambiguity set . The RAMU cost Q function is the fixed point of , so it is equivalent to a distributionally robust Q function with ambiguity set . Using the rectangularity of , we can write this succinctly as
Then, using the definition of we have that
where we can move the optimization over outside of the expectation operators due to rectangularity.
B.3 Proof of Corollary 1
Given the equivalence of and to distributionally robust Bellman operators as shown in Theorem 1, Corollary 1 follows from results in Xu and Mannor [53] and Yu and Xu [56]. We include a proof for completeness.
Proof.
Due to the linearity of the expectation operator, for a given we have that
where represents a mixture transition model determined by . Therefore, starting from the result in Theorem 1, we can write
where
As a result, has the same form as a robust Bellman operator [21, 34] with the uncertainty set .
Consider Q functions and , and denote the sup-norm by
Fix and consider . Then, there exists such that
We have that
A similar argument can be used to show that
so we have that
By applying a supremum over state-action pairs on the left-hand side, we obtain
Finally, since was arbitrary, we have shown that is a -contraction in the sup-norm. ∎
Appendix C Implementation details and additional experimental results
Safety constraints and environment perturbations

Safety | ||
---|---|---|
Task | Safety Constraint | Coefficient |
Cartpole Swingup | Slider Position | 0.30 |
Walker Walk | Joint Velocity | 0.25 |
Walker Run | Joint Velocity | 0.30 |
Quadruped Walk | Joint Angle | 0.15 |
Quadruped Run | Joint Angle | 0.30 |
In all of our experiments, we consider the problem of optimizing a task objective while satisfying a safety constraint. We focus on a single safety constraint corresponding to a cost function defined in the Real-World RL Suite for each task, and we consider a safety budget of . The safety constraints used for each task are described in Table 2. In the Cartpole domain, costs are applied when the slider is outside of a specified range. In the Walker domain, costs are applied for large joint velocities. In the Quadruped domain, costs are applied for large joint angles. See Dulac-Arnold et al. [19] for detailed definitions of each safety constraint.
The definitions of these cost functions depend on a safety coefficient in , which determines the range of outcomes that lead to constraint violations and therefore controls how difficult it will be to satisfy safety constraints corresponding to these cost functions. As the safety coefficient decreases, the range of safe outcomes also decreases and the safety constraint becomes more difficult to satisfy. In order to consider safe RL tasks with difficult safety constraints where strong performance is still possible, we selected the value of this safety constraint in the range of for each task based on the performance of the baseline safe RL algorithm CRPO compared to the unconstrained algorithm MPO. Figure 3 shows total rewards throughout training for each task across this range of safety coefficients. We selected the most difficult cost definition in this range (i.e., lowest safety coefficient value) where CRPO is still able to achieve the same total rewards as MPO (or the value that leads to the smallest gap between the two in the case of Walker Run and Quadruped Run). The resulting safety coefficients used for our experiments are listed in Table 2.
In order to evaluate the robustness of our learned policies, we generate a range of test environments for each task based on perturbing a simulator parameter in the Real-World RL Suite. See Table 3 for the perturbation parameters and corresponding ranges considered in our experiments. The test range for each domain is centered around the nominal parameter value that defines the single training environment used for all experiments except domain randomization. See Figure 4 for detailed results of the risk-averse and risk-neutral versions of our RAMU framework across all tasks and environment perturbations.
Perturbation | Nominal | ||
---|---|---|---|
Domain | Parameter | Value | Test Range |
Cartpole | Pole Length | 1.00 | |
Walker | Torso Length | 0.30 | |
Quadruped | Torso Density |

Domain randomization
Perturbation | Nominal | Training | |
Domain | Parameter | Value | Range |
In-Distribution | |||
Cartpole | Pole Length | 1.00 | |
Walker | Torso Length | 0.30 | |
Quadruped | Torso Density | ||
Out-of-Distribution | |||
Cartpole | Pole Mass | 0.10 | |
Walker | Contact Friction | 0.70 | |
Quadruped | Contact Friction | 1.50 |
Domain randomization requires a training distribution over a range of environments, which is typically defined by considering a range of simulator parameters. For the in-distribution version of domain randomization considered in our experiments, we apply a uniform distribution over a subset of the test environments defined in Table 3. In particular, we consider the middle 50% of test environment parameter values centered around the nominal environment value for training. In the out-of-distribution version of domain randomization, on the other hand, we consider a different perturbation parameter from the one varied at test time. We apply a uniform distribution over a range of values for this alternate parameter centered around the value in the nominal environment. Therefore, the only environment shared between the set of test environments and the set of training environments used for out-of-distribution domain randomization is the nominal environment. See Table 4 for details on the parameters and corresponding ranges used for training in domain randomization.

We include the results for domain randomization across all tasks and environment perturbations in Figure 5. Across all tasks, we observe that our RAMU framework leads to similar or improved constraint satisfaction compared to in-distribution domain randomization, while only using one training environment. In addition, our framework consistently outperforms out-of-distribution domain randomization, which provides little benefit compared to standard safe RL due to its misspecified training distribution.
Adversarial reinforcement learning
In order to implement the action-robust PR-MDP framework, we must train an adversarial policy. We represent the adversarial policy using the same structure and neural network architecture as our main policy, and we train the adversarial policy to maximize total costs using MPO. Using the default setting in Tessler et al. [46], we apply one adversary update for every 10 policy updates.
We include the results for adversarial RL across all tasks and environment perturbations in Figure 6, where adversarial RL is evaluated without adversarial interventions. We see that adversarial RL leads to robust safety in some cases, such as the two Quadruped tasks. However, in other tasks such as Cartpole Swingup, safety constraint satisfaction is not as robust. Safety also comes at the cost of conservative performance in some tasks, as evidenced by the total rewards achieved by adversarial RL in Walker Run and Quadruped Run. Overall, our RAMU framework achieves similar performance to adversarial RL, without the drawbacks associated with adversarial methods that preclude their use in some real-world settings.

Network architectures and algorithm hyperparameters
In our experiments, we consider neural network representations of the policy and critics. Each of these neural networks contains 3 hidden layers of 256 units with ELU activations. In addition, we apply layer normalization followed by a tanh activation after the first layer of these networks as proposed in Abdolmaleki et al. [2]. We consider a multivariate Gaussian policy, where at a given state we have where and represent outputs of the policy network. is a diagonal covariance matrix, whose diagonal elements are calculated by applying the softplus operator to the outputs of the neural network. We parameterize the reward and cost critics with separate neural networks. In addition, we consider target networks that are updated as an exponential moving average with parameter .
We consider CRPO [55] as the baseline safe RL algorithm in all of our experiments, which immediately switches between maximizing rewards and minimizing costs at every update based on the value of the safety constraint. If the sample-average estimate of the safety constraint for the current batch of data satisfies the safety budget, we update the policy to maximize rewards. Otherwise, we update the policy to minimize costs.
After CRPO determines the appropriate objective for the current batch of data, we apply MPO [1] to calculate policy updates. MPO calculates a non-parametric policy update based on the KL divergence parameter , and then takes a step towards this non-parametric policy while constraining the KL divergence from updating the mean by and the KL divergence from updating the covariance matrix by . We consider per-dimension KL divergence constraints by dividing these parameter values by the number of action dimensions, and we penalize actions outside of the feasible action limits using the multi-objective MPO framework [2] as suggested in Hoffman et al. [20]. In order to avoid potential issues related to the immediate switching between reward and cost objectives throughout training, we completely solve for the temperature parameter of the non-parametric target policy in MPO at every update as done in Liu et al. [28]. See Table 5 for the default hyperparameter values used in our experiments, which are based on default values considered in Hoffman et al. [20].
For our RAMU framework, the latent variable hyperparameter controls the definition of the distribution over transition models. Figure 7 shows the performance of our RAMU framework in Walker Run and Quadruped Run for . A larger value of leads to a distribution over a wider range of transition models, which results in a more robust approach when combined with a risk-averse perspective on model uncertainty. We see in Figure 7 that our algorithm more robustly satisfies safety constraints as increases, but this robustness also leads to a decrease in total rewards. We consider in our experiments, as it achieves strong constraint satisfaction without a meaningful decrease in rewards. Finally, for computational efficiency we consider samples of transition models per data point to calculate sample-based Bellman targets in our RAMU framework, as we did not observe meaningful improvements in performance from considering a larger number of samples.
General | ||
Batch size per update | 256 | |
Updates per environment step | 1 | |
Discount rate () | 0.99 | |
Target network exponential moving average () | 5e-3 | |
Policy | ||
Layer sizes | 256, 256, 256 | |
Layer activations | ELU | |
Layer norm + tanh on first layer | Yes | |
Initial standard deviation | 0.3 | |
Learning rate | 1e-4 | |
Non-parametric KL () | 0.10 | |
Action penalty KL | 1e-3 | |
Action samples per update | 20 | |
Parametric mean KL () | 0.01 | |
Parametric covariance KL () | 1e-5 | |
Parametric KL dual learning rate | 0.01 | |
Critics | ||
Layer sizes | 256, 256, 256 | |
Layer activations | ELU | |
Layer norm + tanh on first layer | Yes | |
Learning rate | 1e-4 | |
RAMU | ||
Transition model samples per data point () | 5 | |
Latent variable hyperparameter () | 0.10 |

Computational resources
All experiments were run on a Linux cluster with 2.9 GHz Intel Gold processors and NVIDIA A40 and A100 GPUs. The Real-World RL Suite is available under the Apache License 2.0. We trained policies for 1 million steps across 5 random seeds, which required approximately one day of wall-clock time on a single GPU for each combination of algorithm and task using code that has not been optimized for execution speed.