Offline Policy Comparison with Confidence: Benchmarks and Baselines

Anurag Koul [email protected]
School of EECS
Oregon State University Mariano Phielipp [email protected]
Intel Labs Alan Fern [email protected]
School of EECS
Oregon State University

Abstract

Decision makers often wish to use offline historical data to compare sequential-action policies at various world states. Importantly, computational tools should produce confidence values for such offline policy comparison (OPC) to account for statistical variance and limited data coverage. Nevertheless, there is little work that directly evaluates the quality of confidence values for OPC. In this work, we address this issue by creating benchmarks for OPC with Confidence (OPCC), derived by adding sets of policy comparison queries to datasets from offline reinforcement learning. In addition, we present an empirical evaluation of the risk versus coverage trade-off for a class of model-based baselines. In particular, the baselines learn ensembles of dynamics models, which are used in various ways to produce simulations for answering queries with confidence values. While our results suggest advantages for certain baseline variations, there appears to be significant room for improvement in future work.

1 Introduction

Given historical data from a dynamic environment, how well can we make predictions about future trajectories while also quantifying the uncertainty of those predictions? Our main goal is to drive research toward a positive answer by encouraging work on a specific prediction problem, offline policy comparison with confidence (OPCC). Toward this goal we contribute OPCC benchmarks with metrics that directly relate to uncertainty quantification along with a baseline pilot evaluation. The benchmarks ¹¹1 Benchmarks: https://github.com/koulanurag/opcc and baselines²²2 Baselines: https://github.com/koulanurag/opcc-baselines are made publicly available.

OPCC involves using historical data to answer queries that each ask for: 1) a prediction of which of two policies is better for an initial state and horizon, where the policies, state, and horizon can be arbitrarily specified, and 2) a confidence value for the prediction. While here we use OPCC for benchmarking uncertainty quantification, it also has utility for both decision support and policy optimization. For decision support, a farm manager may want a prediction for which of two irrigation policies will best match season-level crop goals. A careful farm manager, however, would only take the prediction seriously if it comes with a meaningful measure of confidence. For policy optimization, we may want to search through policy variations to identify variations that confidently improve over others in light of historical data.

Offline reinforcement learning (ORL) (Levine et al., 2020), both for policy evaluation and optimization, offers a number of techniques relevant to decision support and OPCC in particular. One of the key ORL challenges is dealing with uncertainty due to statistical variance and limited coverage of historical data. This recognition has led to rapid progress in ORL, yielding different approaches for addressing uncertainty, e.g. pessimism in the face of uncertainty (Kumar et al., 2020; Buckman et al., 2020; Jin et al., 2021a; Shrestha et al., 2021) or regularizing policy learning toward the historical data (Fujimoto & Gu, 2021; Kostrikov et al., 2021; Kumar et al., 2019; Peng et al., 2019). However, there has been very little work on directly evaluating the uncertainty quantification capabilities embedded in these approaches. Rather, overall ORL performance is typically evaluated, which can be affected by many algorithmic choices that are not directly related to uncertainty quantification. A major motivation for our work is to better measure and understand the underlying uncertainty quantification embedded in popular ORL approaches.

Prior work has studied non-sequential prediction (e.g. image classification) with an abstention (or rejection) option (El-Yaniv et al., 2010; Geifman & El-Yaniv, 2017; 2019; Hendrickx et al., 2021; Xin et al., 2021; Condessa et al., 2017). Typically, these methods produce confidence values for predictions and abstain based on a confidence threshold. Ideally, if the confidence values strongly relate to prediction uncertainty, then abstentions will be biased toward the erroneous predictions. In order to directly evaluate the quality of uncertainty quantification, this line of work commonly reports measures of risk-coverage curves (RCCs) such as area under the curve (AUC) and reverse-pair proportion (RPP). To the best of our knowledge, analogous benchmarks and evaluations have not yet been established for sequential decision making. Our focus on establishing benchmarks and metrics for OPCC aims at partially filling this gap.

Contribution. The first contribution of this paper is to develop benchmarks for OPCC derived from existing ORL benchmarks and to suggest metrics for the quality of uncertainty quantification. Each benchmark includes: 1) a set of trajectory data $D$ collected in an environment via different types of data collection policies, and 2) a set of queries $Q$ , where each query asks which of two provided policies has a larger expected reward with respect to a specified horizon and initial states. Note that our OPCC benchmarks are related to recent benchmarks for offline policy evaluation (OPE) (Fu et al., 2021), which includes a policy ranking task similar to OPCC. That work, however, does not propose evaluation metrics and protocols for measuring uncertainty quantification over policy rankings. Further, our query sets $Q$ span a much broader range of initial states than existing benchmarks, which is critical for understanding how uncertainty quantification varies across the wider state space as it relates to the trajectory data $D$ . The benchmarks and baselines are publicly available with the intention of supporting community expansion over time.

Our second contribution is to present a pilot empirical evaluation of OPCC for a class of approaches that use ensembles as the mechanism to capture uncertainty, which is one of the prevalent approaches on ORL. This class uses learned ensembles of dynamics and reward models to produce Monte-Carlo simulations of each policy, which can then be compared in various ways to produce a prediction and confidence value. Our results for different variations of this class provide evidence that some variations may improve aspects of uncertainty quantification. However, overall, we did not observe sizeable and consistent improvements from most of the variations we considered. This suggests that there is significant room for future work aimed at consistent improvement for one or more of the uncertainty-quantification metrics.

2 Background

We formulate our work in the framework of Markov Decision Processes (MDPs), for which we assume basic familiarity (Puterman, 2014). An MDP is a tuple $M=\{S,A,P,R\}$ , where $S$ is the state space, $A$ is the action space, and $P(s^{\prime}|s,a)$ is the first-order Markovian transition function that gives the probability of transitioning to state $s^{\prime}$ given that action $a$ is taken in state $s$ . Finally, $R(s,a)$ is potentially a stochastic reward function, which returns the reward for taking action $a$ in state $s$ .

In this work, we focus on decision problems with a finite horizon $h$ , where action selection can depend on the time step. A non-stationary policy $\pi(s,t)$ is a possibly stochastic function that returns an action for the specified state $s$ and time step $t\in\{0,\ldots,h-1\}$ . Given a horizon $h$ and discount factor $\gamma\in[0,1)$ , the value of a policy $\pi$ at state $s$ , denoted $V^{\pi}(s,h)$ , is the expected cumulative discounted future reward over the horizon:

V^{\pi}(s,h)=\mathbb{E}\left.\left[\sum_{t=0}^{h-1}\gamma^{t}R\left(S_{t},A_{t}\right)\right|S_{0}=s,A_{t}=\pi(S_{t},t)\right]

where $S_{t}$ and $A_{t}$ are the state and action random variables at time $t$ . It is important to note that we gain considerable flexibility by allowing for non-stationary policies. For example, $\pi$ could be an open-loop policy or even a fixed sequence of actions, which are commonly used in the context of model-predictive control (Richards, 2005). Further, we can implicitly represent the action value function $Q^{\pi}(s,a,h)$ for a policy $\pi$ by defining a new non-stationary policy $\pi^{\prime}$ that takes action $a$ at $t=0$ and then follows $\pi$ thereafter, which yields $V^{\pi^{\prime}}(s,h)=Q^{\pi}(s,a,h)$ . For this reason, we will focus exclusively on comparisons in terms of state-value functions without loss of generality.

3 Offline Policy Comparison with Confidence

In this section, we first introduce the concept of policy comparison queries, which are then used to define the OPCC learning problem. Finally, we discuss the OPCC evaluation metrics used in our evaluations.

3.1 Policy Comparison Queries

We consider the fundamental decision problem of predicting the relative future performance of two policies, which we formalize via policy comparison queries (PCQs). A PCQ is a tuple $q=(s,\pi,{\hat{s}},{\hat{\pi}},h)$ , where $s$ and ${\hat{s}}$ are arbitrary starting states, $\pi$ and ${\hat{\pi}}$ are policies, and $h$ is a horizon. The answer to a PCQ is the truth value of $V^{\pi}(s,h)<V^{{\hat{\pi}}}({\hat{s}},h)$ . That is, a PCQ asks whether the $h$ -horizon value of ${\hat{\pi}}$ is greater than $\pi$ when started in ${\hat{s}}$ and $s$ respectively.

As motivated in Section 1, PCQs are useful for both human-decision support and automated policy optimization. For example, if a farm manager wants information about which of two irrigation policies, $\pi$ and ${\hat{\pi}}$ , will result in the best future crop yield given the environment state $s$ , then the corresponding PCQ would be $(s,\pi,s,{\hat{\pi}},h)$ . Alternatively, the manager may be interested in whether a policy $\pi$ is better suited to an environmental state $s$ or ${\hat{s}}$ , which is captured by the PCQ $(s,\pi,{\hat{s}},\pi,h)$ . In addition, PCQs can be used as the basis for the classic policy improvement step of policy iteration (Puterman, 2014). In particular, we can improve over policy $\pi$ at state $s$ by identifying an action $a^{\prime}$ with higher action value than chosen by $\pi$ . The corresponding PCQ for testing $a^{\prime}$ is $(s,\pi,s,\pi^{\prime},h)$ , where $\pi^{\prime}$ is the non-stationary policy that first takes action $a^{\prime}$ and then follows $\pi$ .

In practice, PCQs within an application domain need not be restricted to comparing policies via a single reward function. Rather there are often multiple quantities of interest to users. For example, a farm manager may be interested in understanding how two irrigation policies compare across multiple features of the future, such as cumulative water usage, plant stress, run off, etc. This can be facilitated by defining reward functions corresponding to each feature and issuing the appropriate PCQs.

3.2 Learning to Answer PCQs with Confidence

Given an accurate generative model of the environment MDP, a PCQ $(s,\pi,{\hat{s}},{\hat{\pi}},h)$ can be answered via Monte-Carlo trajectory sampling to estimate $V^{\pi}(s,h)$ and $V^{{\hat{\pi}}}({\hat{s}},h)$ . Further, the confidence in the answer can be arbitrarily improved by increasing the number of sampled trajectories. In this work, we do not assume an environment model, but instead are provided with an offline data set of environment trajectories produced by one or more unknown behavior policies. We will denote this dataset by $\mathcal{D}=\{(s_{i},a_{i},s^{\prime}_{i},r_{i})\}$ where each tuple corresponds to an observed transitions from state $s_{i}$ to state $s^{\prime}_{i}$ after taking action $a_{i}$ and receiving reward $r_{i}$ .

Given a dataset $\mathcal{D}$ we would like to learn a model for predicting answers to PCQs from a query space $\mathcal{Q}$ . Here, $\mathcal{Q}$ may assert application-specific restrictions on states and policies involved in PCQs. A fundamental challenge is that the coverage of $\mathcal{D}$ will not necessarily be representative of the dynamics and rewards relevant to answering all queries in $\mathcal{Q}$ . Thus, if query answers are being used to inform important decisions, then it is critical for each answer to come with a meaningful measure of confidence that accounts for data coverage and statistical variance. Dealing with this uncertainty is also a core challenge for general offline RL (Levine et al., 2020), which has lead to a number of approaches for addressing it. However, there is little direct evaluation of the uncertainty-handling components.

The above motivates the OPCC learning problem, which provides a dataset $\mathcal{D}$ and desired constraints on the query space $\mathcal{Q}$ . The learner should output a model $w=(f,c)$ composed of: 1) a query prediction function $f:\mathcal{Q}\rightarrow\{0,1\}$ , which returns a binary answer for any query in $\mathcal{Q}$ , and 2) a confidence function $c:\mathcal{Q}\rightarrow[l,u]$ that maps queries in $\mathcal{Q}$ to a confidence value within a bounded interval. Given a query $q$ , the intent is for larger values of $c(q)$ to indicate a higher confidence in the prediction $f(q)$ . Note that we do not attach any predefined semantics to the values of $c(q)$ to allow for flexibility of potential solutions. Rather, we focus on defining metrics for directly evaluating the quality of uncertainty quantification provided by $w$ . If desired, various methods can be used after learning to calibrate the confidence values of $c$ to meaningful scales (e.g. (Loh, 1987; Naeini et al., 2015). Section 5 discusses possible learning approach and the baselines evaluated in this paper.

3.3 Evaluation Metrics

Since OPCC involves confidence estimation for binary PCQ predictions, we can draw on evaluation metrics from prior work on selective classification (e.g. (El-Yaniv et al., 2010; Xin et al., 2021)). In selective classification, the aim is to reduce prediction errors by allowing a predictor to abstain from a prediction if the confidence is below a threshold. The quality of confidence values is thus related to how well they result in abstaining when the prediction would have been incorrect. This idea is formalized via risk-coverage curves (RCCs), as outlined below.

Risk-Coverage Curve. Let $\mathcal{L}(q,\hat{y})$ be a loss function for predicting $\hat{y}$ for query $q$ , e.g. 0/1 loss. Given a test set of queries $Q=\{q_{1},\ldots,q_{N}\}$ , a model $w=(f,c)$ , and confidence threshold $\tau$ , the coverage is the fraction of test queries with confidence at least $\tau$ . The selective risk is the average loss of $f$ over the covered queries. Formally, the coverage and selective risk are respectively define by

	$\displaystyle cov(w,Q,\tau)$	$\displaystyle=$	$\displaystyle\frac{1}{\|Q\|}\sum_{q\in Q}I[c(q)\geq\tau]$		(1)
	$\displaystyle r(w,Q,\tau)$	$\displaystyle=$	$\displaystyle\frac{\sum_{q\in Q}I[c(q)\geq\tau]L(q,f(q))}{\sum_{q\in Q}I[c(q)\geq\tau]}$		(2)

where $I$ is the binary indicator function. Thus, each possible threshold corresponds to a risk-coverage operating point $\langle r(w,Q,\tau),cov(w,Q,\tau)\rangle$ . An RCC is simply the risk versus coverage curve of these operating points when sweeping through possible thresholds. Practically, for a finite test set $Q$ there can be at most $|Q|$ unique operating points since there are at most $|Q|$ distinct confidence values produced by $c$ . Thus, when displaying empirical RCCs we linearly interpolate between those operating points.³³3This is justified by the fact that we can achieve (in expectation) any linearly interpolated operting point between two thresholds $\tau_{1}$ and $\tau_{2}$ by varying the probability $p\in[0,1]$ of using $\tau_{1}$ versus $\tau_{2}$ to decide on abstention. Figure 1 shows an example of an RCC from our experiments. The curve starts at the point $(0,0)$ , since the risk is 0 at zero coverage, and ends at $(1,r_{f})$ , where $r_{f}$ is the risk of $f$ evaluated on all of $Q$ .

Refer to caption — Figure 1: A sample of Risk-Coverage Curve (RCC)

In order to provide a single measure of the RCC quality, we aggregate across all thresholds to compute the Area Under the RCC (AURCC). Since lower risk is preferred, we consider a lower AURCC to indicate better confidence estimation. The minimum AURCC is 0, which occurs when the predictor $f$ has zero risk on all of $Q$ . Rather, for a ranomized confidence function that returns a uniformly random value in $[l,u]$ , the expected AURCC is $r_{f}$ ⁴⁴4This is because for any threshold $\tau$ and uniformly random confidence function, there is a uniform probability of covering any particular query. Thus, the expected risk for any threshold $\tau>l$ is the expected risk over a random draw from the query set, which is $r_{f}$ ., indicating no ability to quantify prediction uncertainty.

RPP. Our second selective-classification metric from (Xin et al., 2021) is reverse pair proportion (RPP). The main idea is that the ordering of confidence values for a pair of queries should reflect the relative prediction loss for those queries. RPP measures how often the confidence value ordering conflicts with the relative losses across all pairs of queries. In particular, a conflict occurs when the loss of $q_{1}$ is less than $q_{2}$ , but we are more confident about $q_{2}$ than $q_{1}$ . The RPP is just the fraction of such conflicts.

RPP(w,Q)=\frac{1}{|Q|^{2}}\sum_{q_{1},q_{2}\in Q}I[l(q_{1})<l(q_{2}),c(q_{1})<c(q_{2})]

where $l(q)=L(q,f(q))$ is the loss of $f$ on $q$ .

$\bm{CR_{K}}$ . Finally, we introduce a new metric on just the confidence function $c$ . A practical difference between different confidence functions is the resolution of values that they output in practice. For example, given a set of queries $Q$ , one confidence function $c_{1}$ may result in only three distinct coverage values $cov(w,Q,\tau)$ across all thresholds, while another confidence function $c_{2}$ results in $|Q|$ distinct coverage values. All else being equal $c_{2}$ is the preferable function, since it provides a higher level of resolution with respect to abstention/coverage rates. We measure this via coverage resolution at $K$ , denoted $CR_{K}$ . To compute $CR_{K}$ for $w=(c,f)$ and query set $Q$ , the coverage interval $[0,1]$ is partitioned into $K$ equal bins and we return the fraction of bins which contain $cov(w,Q,\tau)$ for some threshold $\tau$ . By increasing $K$ we get a finer grained distinction in coverage resolution.

4 OPCC Benchmark Construction

In this section, we describe our approach to constructing OPCC benchmarks. We first describe our choices of environments, which are based on existing offline RL benchmarks. In particular, the training datasets used in our benchmarks is based on data from those benchmarks. Next we describe our approach to constructing testing query sets for each of the benchmark environments. The outlined benchmark-construction schema is generic, which can be followed by others to extend the set of available OPCC benchmarks. A tabular summary of the benchmark can be found in Appendix A.

4.1 Environments

To support easier adoption of our benchmarks, we selected seven environments and corresponding datasets that are currently used in offline RL research. As a first set of OPCC benchmarks, we have chosen to focus on relatively low-dimensional environments with non-image-based observations. This helps focus initial studies on fundamental OPCC capabilities, rather than simultaneously addressing the additional complexities that enter with lower-level perceptual observations such as images.

Maze2d (4 environments). The Maze2d environments were introduced in D4RL (Fu et al., 2020) and comprise of 2d mazes of different complexities: open, u-maze, medium, and large as illustrated in Figure 3. Each environment has 4D observations giving the position and velocity of the ball being controlled and a 2D action space specifying the direction of movement. The goal in each environment is to control a rolling ball to reach a goal location. For our benchmarks, we used the sparse-reward version of the environments that provides unit reward for each time step in the goal region. There are no terminal states in these environments and episode ends after maximum number of allowed time-steps reported in Table 8. We use the datasets provided by D4RL, which we refer to as “1M" due to the datasets each having 1 million state transitions. The D4RL trajectory data sets were created by running a path-planning algorithm to navigate in the maze between different start and end points.

Gym-Mujoco (3 environments). The Gym-Mujoco environments are based on controlling the actuators of systems within the Mujoco physics simulators. We consider three locomotion-based environments from OpenAI Gym (Brockman et al., 2016): HalfCheetah, Walker2d, and Hopper. For each environment, we use the corresponding D4RL (Fu et al., 2020) datasets that include behavior trajectories of varying qualities. This includes "random, medium, medium-replay, expert and expert-replay". These environments are qualitatively different from Maze2d in that they involve controlling periodic locomotion behavior based on continuous states and actions. Rather Maze2d is primarily about goal-based path planning (navigation) rather than controlling low-level locomotion.

4.2 Query Set Construction

For each environment we must create a set of PCQs with ground truth answers for evaluating OPCC. A possible starting point is the off-policy evaluation (OPE) extension (Fu et al., 2021) to D4RL, which includes policies for a subset of the environments. In particular, one of the tasks considered is policy ranking, which is similar in spirit to OPCC. However, that extension of D4RL does not capture at least two important characteristics of OPCC. First, the evaluation protocols do not explicitly address measuring the quality of uncertainty quantification. Of course, this can be addressed by just extending the evaluation protocol and metrics.

Second, and more importantly, it is not clear how to define adequate query sets for evaluating OPCC. In particular, the OPE ranking task from D4RL is currently limited to just ranking policies based on their expected values over the initial state distribution of each environment. In contrast, OPCC evaluations should involve sets of PCQs that cover a wide range of states that are both in-distribution and out-of-distribution relative to the offline data set. Further, it is desirable to select the PCQs in a way that spans some notion of PCQ difficulty. In particular, the notion of difficulty we consider here for a PCQ $(s,\pi,{\hat{s}},{\hat{\pi}},h)$ is directly related to the performance gap between the policies, i.e $\left|V^{\pi}(s,h)-V^{{\hat{\pi}}}({\hat{s}},h)\right|$ . It is expected that all else being equal, larger gaps will reason in easier discrimination between policies. Indeed one of the initial challenges in developing the benchmarks was to try to create query sets that were not all too easy or too hard.

Bases on above considerations we create the evaluation sets of PCQs for each environments via the following steps.

Step 1: Policy Generation. We first train multiple policies for an environment to serve as the policy used for PCQ construction. We chose to train new policies, rather than use policies from D4RL, to ensure they would be distinct from the behavior policies used to create the D4RL datasets. For each environment we used the corresponding simulator and multiple runs of the PPO algorithm (Schulman et al., 2017) to train a set of policies of varying quality. We then hand-picked 4 policies with an effort to ensure that they were sufficiently distinct in terms of quality and behavior to support non-trivial PCQs. Performance of these policies in shared in Table 9 (appendix).

Step 2: Initial State Generation. For each environment we generated a large set of potential initial states by running episodes of the random policy, the learned policies, and a mixture of random and learned policies. This produced a set of states that covered a wide range of the environment that extended well beyond the initial state distributions.

Step 3: Candidate PCQ Generation. For each horizon $h\in\{10,20,30,40,50\}$ we create a set of 2000 randomly constructed PCQs from the initial states and learned policies. This included explicitly creating random PCQs of the form $(s,\pi,s,{\hat{\pi}},h)$ with $s$ a random initial state and $\pi,{\hat{\pi}}$ a random pair of the learned policies. In addition, we create a set of PCQs of the form $(s,\pi,{\hat{s}},{\hat{\pi}},h)$ in the same way, except that two random initial states are used instead of one.

Step 4: PCQ Labeling and Selection. For each generated PCQ from step 3 we used Monte-Carlo simulation via the environment simulator to accurately estimate $V^{\pi}(s,h)$ and $V^{{\hat{\pi}}}({\hat{s}},h)$ and removed any PCQ having a difference of less than 10 between the value of each side of a query. The motivation is to filter out PCQs that are the most ambiguous and more likely to act as a source of noise in evaluations. Finally, for each $h$ , we randomly selected 1500 of the PCQs to include as the benchmark query set. In Figure 4, we show, a scatter plot of $\left(V^{\pi}(s,h),V^{{\hat{\pi}}}({\hat{s}},h)\right)$ for the selected set of PCQs for Halfcheetah-v2 and maze2d-open-v0. Notice the lack of PCQs along the diagonal, which corresponds to the removal of ambiguous queries. Also note that the PCQs span a range of value gaps, which suggests that they span varying PCQs of varying difficult. If these plots showed a bias toward only large gap queries, then additional steps would be necessary to ensure that more variation was present in the selected query sets.

5 OPCC Baselines

In this section, we describe the class of baselines that will be made available with the benchmarks and included in our pilot experiments (Section 7). Recall that each baseline must provide a prediction function $f$ and confidence function $c$ that are derived from the dataset $\mathcal{D}$ . Perhaps the most natural approach for $f$ to answer a PCQ $(s,\pi,{\hat{s}},{\hat{\pi}},h)$ is to estimate and then compare the relevant values using OPE. The corresponding confidence function $c$ might then be based on the uncertainty of the value estimates.

There are at least two types of OPE approaches to consider: model-free and model-based. Model-free approaches, such as fitted Q-evaluation (Ernst et al., 2005) typically learn a Q-function $Q^{\pi}(s,a)$ for a given policy $\pi$ that can be evaluated for any state and action. Unfortunately, each function learned by such model-free methods is valid for the single policy $\pi$ and the effective horizon used during training. Thus, answering PCQs involving other policies or horizons requires costly retraining. Since we are seeking an OPCC approach, which can be quickly applied to arbitrary policies, states, and horizons, we instead choose to use a model-based approach for our baselines.

Our baselines are variants of model-based ensemble approaches, which are one of the most common class of approaches used in model-based RL for dynamics modeling and capturing uncertainty (Argenson & Dulac-Arnold, 2020; Yu et al., 2020; Kidambi et al., 2020). In particular, our baselines all have the following structure:

1.

Learn an ensemble of models $\{{\hat{P}}_{i}\}$ from $\mathcal{D}$ that each predict the dynamics and reward of the environment.
2.

Use each model in the ensemble to generate estimates of the relevant PCQ values, $V^{\pi}(s,h)$ and $V^{{\hat{\pi}}}({\hat{s}},h)$ , via Monte-Carlo simulation of the policies.
3.

Combine the ensemble estimates to provide a prediction and confidence value.

Compared to model-free approaches, this approach can instantly apply to arbitrary policies and horizons without costly retraining. That is, new policies and horizons can easily be swapped into the Monte-Carlo simulation of step 2 with no modifications to the model. We can obtain different baselines by varying the choices for learning the model ensemble (step 1) as well as varying the ensemble combination approach (step 3). Below we describe the variations used in our experiments.

5.1 Base Models

In our experiments, we consider two types of base models for forming ensembles. The first base model is the commonly use Feed-Forward (FF) Gaussian model, which, given the current state/observation and action as input, returns the mean and diagonal covariance matrix of a Gaussian distribution over the next state and reward. This model allows for stochastic Monte-Carlo simulations by drawing the next state from the model’s Gaussian distribution at each time step. In this work, we use the same FF base-model architecture and training details as MBPO (Janner et al., 2019).

We also consider a recent base model (Zhang et al., 2021) (referred to as Auto-regressive (AR)), which was demonstrated in some cases to improve over the output architecture of FF. Instead of generating all $n$ features of the predicted next state in a single pass, AG auto-regressively samples each feature one at a time using $n$ forward passes. In particular, to sample state feature $i$ of the next state, denoted $s_{t+1}^{i}$ , the network receives the usual input $s_{t}$ and $a_{t}$ as well as the previously sampled state features $s^{0}_{t+1},...s^{i-1}_{t+1}$ . AR then returns the mean and covariance for a Gaussian that is used to sample $s_{t+1}^{i}$ . The intuition is that this approach may allow for representing non-Gaussian and multi-modal next-state distributions compared to the uni-modal Gaussian FF model.

5.2 Ensemble Learning

Model-based approaches to ORL have commonly used ensembles as an attempt to quantify uncertainty, e.g. via measures of ensemble-member disagreement (Janner et al., 2019). We consider two choices for generating ensembles. The first choice is the standard bootstrapping ensemble approach, which simply trains each ensemble member using a different random weight initialization and bootstrapped dataset $\hat{\mathcal{D}}$ by sampling from $\mathcal{D}$ with replacement $|\mathcal{D}|$ times. The intent is that the combination of classic statistical bootstrapping Efron & Tibshirani (1994) and random initialization will produce a diverse set of ensemble models.

Often, however, it is observed that the basic bootstrapping approach does not create enough diversity in an ensemble, which is counter to our motivation of representing uncertainty. For this reason, there are a number of proposals for increasing the ensemble diversity, of which, we consider just one in this work. In particular, work motivated by capturing uncertainty in ORL proposed the use of randomized constant priors to increase ensemble diversity (Osband et al., 2018). For each base model, a randomized constant prior is produced, which is simply a network with random initial weights. The base model is trained as an additive component on top of this prior and the final output is the sum of the two. The intuition is that the constant prior should cause ensemble members to disagree more often in unrepresented parts of the state-space, which will provide a better measure of disagreement-based uncertainty.

5.3 Prediction and Confidence Values

Given a PCQ $(s,\pi,{\hat{s}},{\hat{\pi}},h)$ query and ensemble of size $M$ we generate a prediction and confidence by first using each ensemble member to generate, via Monte-Carlo simulation, a pair of value estimates of $V^{\pi}(s,h)$ and $V^{{\hat{\pi}}}({\hat{s}},h)$ . This results in a set of $M$ value estimate pairs, denoted by $\mathcal{V}=\{(V_{1},{\hat{V}}_{1}),\ldots,(V_{M},{\hat{V}}_{M})\}$ . This approach is based on the classic view, from bagging classifiers (Breiman, 1996), of the base learning algorithm being a stochastic function⁵⁵5Indeed, each run of the base algorithm has at least two sources of randomness in our implementation. First, each run uses a different bootstrap sample of the dataset. Second, each run uses randomized initial weights. Third, mini-batches in stochastic gradient descent depend on the random seed. The set $\mathcal{V}$ can then be viewed as value-estimate pairs sampled from the distribution of learning algorithm runs. Bagging analysis tells us that if 50% of the learning algorithm runs result in estimates $(V_{i},{\hat{V}}_{i})$ that correctly rank the values, then a large enough ensemble will correctly predict the query.⁶⁶6This argument assumes independence of the ensemble members, which clearly is not true in practice due to at least correlation between their training data. Based on this view, below we describe the three approaches we consider for producing predictions and confidences from $\mathcal{V}$ .

•

Ensemble Voting (EV). Following (Dietterich, 2000), EV simply returns a prediction for a query based on the majority vote across the ensemble of $V_{i}<{\hat{V}}_{i}$ . The confidence score is equal to the fraction of ensemble members that agree on the majority vote (in the range [0.5,1]), but re-scaled to fall in the range [0,1].
•

Paired Confidence Interval (PCI). The PCI confidence value is computed by estimating the expected value of $V-{\hat{V}}$ for a random run of the learning algorithm. The mean estimate is given by $\sum_{i}V_{i}-{\hat{V}}_{i}$ and the prediction is based on the sign of this estimate. The confidence value is based on computing $\alpha$ percentile confidence intervals on the difference, denoted by $[l_{\alpha},u_{\alpha}]$ . In particular, it is equal to the largest value of $\alpha$ such that $0\not\in[l_{\alpha},u_{\alpha}]$ . Thus, a high confidence value reflects that there is strong evidence that the expected difference is either above or below zero (in agreement with the prediction). Confidence intervals are computed based on the $t$ distribution.
•

UnPaired Confidence Interval (U-PCI). This approach makes the prediction in the same way as PCI, but uses unpaired confidence intervals to compute the confidence, which should be expected to be more conservative. In particular, we compute $\alpha$ percentile confidence intervals for the means of the $V_{i}$ and ${\hat{V}}_{i}$ denoted respectively by $\left[l_{\alpha},u_{\alpha}\right]$ and $\left[\hat{l}_{\alpha},\hat{u}_{\alpha}\right]$ and let the confidence be the maximum value of $\alpha$ for which the confidence intervals do not overlap.

6 Related Work

Dynamics Learning in RL. There has been much recent interest in learning deep models of dynamical systems to support model-based RL. Examples from online RL include Clavera et al. (2018); Kurutach et al. (2018), which learn one-step observation-based dynamics along with extensions to ensembles Deisenroth & Rasmussen (2011); Chua et al. (2018); Janner et al. (2019); Nagabandi et al. (2020). PILCO (Deisenroth & Rasmussen, 2011; Gal et al., 2016) is another model-based RL approach that learns dynamics via Gaussian Processes (Rasmussen, 2003), which are able to capture epistemic uncertainty. However, performance is primarily measured in terms of overall task performance and it is unclear how well uncertainty is actually quantified. Recent work on offline reinforcement, such as MBOP(Argenson & Dulac-Arnold, 2020), MOPO (Yu et al., 2020), and MoREL (Kidambi et al., 2020) has also considered learning dynamics models over observations from fixed, offline data sets. These approaches incorporate uncertainty estimates in different ways (e.g. pessimistic rewards or dynamics) and all use ensembles to estimate uncertainty. Thus, they are established exemplars of the baselines considered in our work. However, only the final task performance is tested and it is unclear how well uncertainty is actually captured by the models. COMBO (Yu et al., 2021), Muzero Unplugged (Schrittwieser et al., 2021), and LOMPO (Rafailov et al., 2021) investigate learning latent-space dynamics models for offline RL rather than learning in the observation space. Again, however, there is no explicit evaluation of the models ability to quantify uncertainty.

Policy Ranking. Similar to our work, Sonabend-W et al. (2020) also identifies the need to measure uncertainty for policies learned with limited data. In order to learn safe policies, their approach uses hypothesis testing for determining uncertainty in policy evaluation for a pair of candidate policies based on sampling from model posteriors. This helps in ranking them and selection of better performing policy over the behavior policy in a safe manner. The work, however, was limited to small flat state-spaces and did not explicitly evaluate uncertainty quantification. In contrast, our work produces a benchmark to primarily focus on uncertainty quantification of a system using offline data, rather than evaluating in terms of overall task performance.

DOPE (Fu et al., 2021) studies OPE and devises a protocol that measures policy evaluation, ranking, and selection. For this purpose, the approach introduces a set of candidate policies along with their expected value over a distribution of initial states. Rather, in our work, we question the ability of a system to rank policies from any arbitrary state for a given horizon instead of limiting to initial state distribution only. This can help provide a more comprehensive view of uncertainty estimation across the state space.

In similar motivation, SOPR-T(Jin et al., 2021b) also considers policy ranking from offline data and additional policy-value supervision. This is done by learning an encoded representation of a policy using a transformer based architecture and a scoring function over the representation. In order to learn the representation, they require a set of pre-defined policies, each labeled by its ground truth value with respect to an initial state distribution. Our framework does not assume the availability of such policy-value supervision and also puts an emphasis on uncertainty quantification, which is not evaluated by this work.

Confidence Intervals. We make use of confidence intervals over policy value estimates for answering queries. Thomas et al. (2015) also studies confidence interval estimation over policy value estimates using trajectories generated by a different set of policies. Their approach uses importance sampling for unbiased value estimates, which suffers from high variance leading to loose confidence bounds. They also introduce the problem of high confidence off-policy evaluation and produce tighter bounds on estimates using improved concentration inequalities (Massart, 2007).

Another class of approaches is based on statistical bootstrapping (Efron, 1987). Hanna et al. (2017) bootstraps learned MDP transition models in order to estimate lower confidence bounds on policy evaluation estimates with limited data. Kostrikov & Nachum (2020) suggests that confidence intervals of these bootstrapped estimates are not guaranteed to be accurate. In practice, they are shown to be overly confident especially for insufficient sample sizes and under-coverage of the data distribution. They suggest, that, in practice, this issue may be mitigated by inducing noisy rewards and regularization to learn smoother empirical transition and reward functions. Evaluating that claim within our OPCC framework is a potential direction for future work.

CoinDICE (Dai et al., 2020), and similar methods (Nachum et al., 2019; Zhang et al., 2020; Strehl & Littman, 2008; Hao et al., 2021) progressively focus on confidence intervals for OPE based on the formulation of certain optimization problems. This iterative optimization approach for estimating policy value and confidence bounds induces a computational overload. This is undesirable in our framework which aims to rapidly answer queries of arbitrary horizons and policies making it computationally unsustainable. An interesting direction for future work is to consider generalizing this optimization-based approach to more flexibly handle arbitrary policies.

7 Experiments

Our pilot experiments explore the baseline methods on our benchmarks using the proposed metrics for OPCC. It is important to note that these experiments are not intended to identify a top performer. Rather our primary goal for these pilot experiments is to assess the adequacy of the benchmarks and metrics for future work and to establish a basic performance bar. Secondarily, we are interested to observe evidence or the lack of evidence for certain assumptions that might be drawn about the baselines from prior work. Based on those primary goals our experiments and analysis are designed to: 1) Assess whether the benchmarks appear to be too difficult or too easy for supporting future work; 2) Assess whether there is any evidence that our baselines are sensitive to the data-set type used for each benchmark environment. In particular, the performance of strong OPCC approach should vary with the coverage afforded by the data set. 3) Assess whether there is any evidence for performance differences among the baseline variations. In particular, certain features such as auto-regressive sampling, constant priors, and larger ensembles have been claimed to improve uncertainty handling in prior work.

In our experiments, unless otherwise specified the default model is an ensemble of 100 deterministic feed-forward models and uses EV for the confidence score. Two additional details are important to note for these experiments. First, as is customary in model-based RL (including ORL), we are using a pre-defined episode termination function rather than a learned one. We have found that this can significantly impact performance of model-based RL systems and also our OPCC evaluations. Second, we clipped predicted observations and rewards to keep them within the bounds of the available data sets, which is also a common practice in ORL that we found to be important.

Too hard or too easy? We first assess the degree of difficulty posed by our OPCC benchmark for our baselines. Figures 5 and 6 show RCCs of our default model for different data set types (averaged across the different PCQ horizons $h$ ) in gym-mujoco and maze2d environments. Sections 7 and 2 report their corresponding metrics i.e. AURCC, RPP, $CR_{k}$ , and Loss (or risk) at complete coverage.

First, we consider risk at complete coverage and find that there is no significant difference in risk across dataset type, but varies significantly across gym-mujoco environments. This shows that some environments are more challenging than others due to their underlying complex dynamics and high dimensional observation and action sizes. Also, the risk at complete coverage for maze2d environments with a single dataset (’1m’) is significantly lower than gym-mujoco. This is potentially due to data collection via a path-planning procedure leading to significant state-action space coverage. Further Medium and Umaze have very small risks without much room for risk improvement, while Large and Open appear to have room for improvement. Second, we consider how the risk varies across coverage values. In most cases, there are no thresholds that produce points within the coverage interval (0,0.5], which indicates a lack of sensitivity in that coverage range. There are typically multiple points between (0.5, and 1], though often just a few. Ideally we would hope for a more gradual degradation in risk spanning from no coverage to complete coverage. This suggests that there is significant room to improve the coverage sensitivity, especially in the range [0,0.5].

Overall, the current set of benchmarks, with the exception of 2 Maze2d environments, are not too easy and appears to offer significant room for improvement in terms of both overall risk and sensitivity of the RCCs across coverage values. Likewise, the observation that the risks achieved are significantly less than chance suggest that the benchmarks are not too hard.

Impact of dataset type. The different types of data sets provide different types of coverage of the system dynamics. Is there evidence that our baselines are able to distinguish among these types? Figure 5 shows that the RCCs for different datasets are quite similar for each of the gym-mujoco environment. The AURCC and RPP values in Section 7 are consistent with these observations. This could be due to the diverse coverage of queries across the state space that offer challenges for all datasets. The small variation in RCCs across dataset types could also be due to the models learned from different datasets providing similar types of generalization. It is also possible that differences between dataset types would become more prevalent for smaller versions of the datasets, which an interesting future extension to the benchmarks. Finally no significant patterns for CR in relation to data-set type are apparent, which is not surprising since CR is expected to be more heavily influenced by the type of baseline approach.

Env.	dataset-type	AURCC $(\downarrow)$	RPP $(\downarrow)$	$CR_{K}(\uparrow)$	loss $(\downarrow)$
Half Cheetah	expert	$0.212\pm 0.002$	$0.05\pm 0.001$	$0.5\pm(<0.001)$	$0.361\pm 0.002$
	medium	$0.222\pm 0.001$	$0.048\pm 0.001$	$0.5\pm(<0.001)$	$0.374\pm 0.002$
	medium-expert	$0.24\pm 0.004$	$0.06\pm 0.002$	$0.6\pm(<0.001)$	$0.387\pm 0.003$
	medium-replay	$0.216\pm 0.001$	$0.04\pm 0.001$	$0.4\pm(<0.001)$	$0.368\pm 0.001$
	random	$0.206\pm 0.001$	$0.023\pm 0.001$	$0.3\pm(<0.001)$	$0.378\pm 0.001$
Hopper	expert	$0.152\pm 0.002$	$0.04\pm 0.001$	$0.5\pm(<0.001)$	$0.284\pm 0.002$
	medium	$0.133\pm 0.001$	$0.03\pm 0.001$	$0.4\pm(<0.001)$	$0.26\pm 0.002$
	medium-expert	$0.136\pm 0.001$	$0.028\pm(<0.001)$	$0.4\pm(<0.001)$	$0.265\pm 0.001$
	medium-replay	$0.128\pm 0.001$	$0.012\pm 0.001$	$0.3\pm(<0.001)$	$0.258\pm 0.001$
	random	$0.156\pm 0.008$	$0.045\pm 0.004$	$0.54\pm 0.043$	$0.273\pm(<0.001)$
Walker 2d	expert	$0.064\pm 0.001$	$0.011\pm(<0.001)$	$0.3\pm(<0.001)$	$0.161\pm(<0.001)$
	medium	$0.069\pm 0.001$	$0.007\pm(<0.001)$	$0.22\pm 0.035$	$0.156\pm 0.001$
	medium-expert	$0.068\pm(<0.001)$	$0.007\pm(<0.001)$	$0.24\pm 0.043$	$0.153\pm 0.001$
	medium-replay	$0.07\pm 0.001$	$0.005\pm(<0.001)$	$0.2\pm(<0.001)$	$0.161\pm 0.001$
	random	$0.067\pm 0.001$	$0.024\pm 0.001$	$0.54\pm 0.043$	$0.165\pm 0.001$

Env.	dataset-type	AURCC $(\downarrow)$	RPP $(\downarrow)$	$CR_{K}(\uparrow)$	loss $(\downarrow)$
large	1m	$0.14\pm 0.015$	$0.062\pm 0.004$	$0.82\pm 0.035$	$0.251\pm 0.029$
medium	1m	$0.001\pm(<0.001)$	$0.0\pm(<0.001)$	$0.2\pm(<0.001)$	$0.022\pm 0.001$
open	1m	$0.029\pm 0.001$	$0.012\pm 0.001$	$0.5\pm(<0.001)$	$0.107\pm 0.005$
umaze	1m	$0.008\pm 0.001$	$0.002\pm(<0.001)$	$0.3\pm(<0.001)$	$0.075\pm 0.003$

Env.	horizon	AURCC $(\downarrow)$	RPP $(\downarrow)$	$CR_{K}(\uparrow)$	loss $(\downarrow)$
Half Cheetah	10.0	$0.077\pm 0.004$	$0.008\pm 0.001$	$0.288\pm 0.017$	$0.191\pm 0.006$
	20.0	$0.217\pm 0.006$	$0.041\pm 0.005$	$0.416\pm 0.042$	$0.374\pm 0.005$
	30.0	$0.215\pm 0.004$	$0.038\pm 0.005$	$0.404\pm 0.038$	$0.377\pm 0.005$
	40.0	$0.223\pm 0.006$	$0.049\pm 0.006$	$0.464\pm 0.041$	$0.368\pm 0.005$
	50.0	$0.277\pm 0.008$	$0.063\pm 0.007$	$0.516\pm 0.054$	$0.428\pm 0.003$
Hopper	10.0	$0.017\pm 0.002$	$0.001\pm(<0.001)$	$0.2\pm(<0.001)$	$0.048\pm 0.002$
	20.0	$0.078\pm 0.002$	$0.016\pm 0.003$	$0.336\pm 0.038$	$0.17\pm 0.003$
	30.0	$0.146\pm 0.004$	$0.029\pm 0.004$	$0.42\pm 0.033$	$0.284\pm 0.005$
	40.0	$0.169\pm 0.007$	$0.038\pm 0.006$	$0.432\pm 0.036$	$0.293\pm 0.004$
	50.0	$0.196\pm 0.009$	$0.047\pm 0.007$	$0.516\pm 0.049$	$0.334\pm 0.006$
Walker 2d	10.0	$0.011\pm(<0.001)$	$0.001\pm(<0.001)$	$0.22\pm 0.016$	$0.033\pm 0.003$
	20.0	$0.025\pm 0.002$	$0.003\pm 0.001$	$0.252\pm 0.042$	$0.077\pm 0.004$
	30.0	$0.059\pm 0.001$	$0.01\pm 0.003$	$0.3\pm 0.061$	$0.132\pm 0.004$
	40.0	$0.093\pm 0.002$	$0.017\pm 0.004$	$0.384\pm 0.049$	$0.209\pm 0.004$
	50.0	$0.131\pm 0.002$	$0.023\pm 0.005$	$0.392\pm 0.06$	$0.259\pm 0.002$

Env.	uncertainty-type	AURCC $(\downarrow)$	RPP $(\downarrow)$	$CR_{K}(\uparrow)$	loss $(\downarrow)$
Half Cheetah	ev	$0.219\pm 0.005$	$0.044\pm 0.005$	$0.46\pm 0.04$	$0.374\pm 0.004$
	pci	$0.191\pm 0.002$	$0.006\pm 0.001$	$0.2\pm(<0.001)$	$0.373\pm 0.004$
	u-pci	$0.196\pm 0.003$	$0.014\pm 0.002$	$0.228\pm 0.018$	$0.373\pm 0.004$
Hopper	ev	$0.141\pm 0.005$	$0.031\pm 0.005$	$0.428\pm 0.034$	$0.268\pm 0.004$
	pci	$0.135\pm 0.002$	$0.004\pm 0.001$	$0.2\pm(<0.001)$	$0.269\pm 0.004$
	u-pci	$0.135\pm 0.003$	$0.009\pm 0.002$	$0.216\pm 0.014$	$0.269\pm 0.004$
Walker 2d	ev	$0.068\pm 0.001$	$0.011\pm 0.003$	$0.3\pm 0.051$	$0.159\pm 0.002$
	pci	$0.078\pm 0.001$	$0.002\pm 0.001$	$0.2\pm(<0.001)$	$0.16\pm 0.003$
	u-pci	$0.076\pm 0.001$	$0.004\pm 0.001$	$0.22\pm 0.016$	$0.16\pm 0.003$

Env.	ensemble-count	AURCC $(\downarrow)$	RPP $(\downarrow)$	$CR_{K}(\uparrow)$	loss $(\downarrow)$
Half Cheetah	10	$0.209\pm 0.004$	$0.028\pm 0.004$	$0.32\pm 0.029$	$0.378\pm 0.004$
	20	$0.213\pm 0.004$	$0.034\pm 0.004$	$0.36\pm 0.04$	$0.376\pm 0.004$
	40	$0.215\pm 0.004$	$0.039\pm 0.005$	$0.4\pm 0.035$	$0.374\pm 0.004$
	80	$0.219\pm 0.005$	$0.043\pm 0.005$	$0.452\pm 0.04$	$0.374\pm 0.004$
	100	$0.219\pm 0.005$	$0.044\pm 0.005$	$0.46\pm 0.04$	$0.374\pm 0.004$
Hopper	10	$0.141\pm 0.005$	$0.02\pm 0.004$	$0.316\pm 0.029$	$0.273\pm 0.005$
	20	$0.141\pm 0.005$	$0.024\pm 0.004$	$0.344\pm 0.037$	$0.272\pm 0.004$
	40	$0.141\pm 0.005$	$0.027\pm 0.004$	$0.388\pm 0.04$	$0.27\pm 0.004$
	80	$0.141\pm 0.005$	$0.03\pm 0.004$	$0.416\pm 0.038$	$0.269\pm 0.004$
	100	$0.141\pm 0.005$	$0.031\pm 0.005$	$0.428\pm 0.034$	$0.268\pm 0.004$
Walker 2d	10	$0.072\pm 0.001$	$0.007\pm 0.002$	$0.256\pm 0.03$	$0.16\pm 0.002$
	20	$0.071\pm 0.001$	$0.008\pm 0.002$	$0.268\pm 0.038$	$0.16\pm 0.002$
	40	$0.07\pm 0.001$	$0.01\pm 0.003$	$0.28\pm 0.046$	$0.16\pm 0.002$
	80	$0.068\pm 0.001$	$0.01\pm 0.003$	$0.284\pm 0.049$	$0.159\pm 0.002$
	100	$0.068\pm 0.001$	$0.011\pm 0.003$	$0.3\pm 0.051$	$0.159\pm 0.002$

Offline Policy Comparison with Confidence: Benchmarks and Baselines

Abstract

1 Introduction

2 Background

3 Offline Policy Comparison with Confidence

3.1 Policy Comparison Queries

3.2 Learning to Answer PCQs with Confidence

3.3 Evaluation Metrics

4 OPCC Benchmark Construction

4.1 Environments

4.2 Query Set Construction

5 OPCC Baselines

5.1 Base Models

5.2 Ensemble Learning

5.3 Prediction and Confidence Values

6 Related Work

7 Experiments

8 Summary

References

Appendix A Appendix - OPCC Benchmark Summary

Appendix B Appendix - Evaluation Metrics & Selective-Risk Coverage Curves for Ablations

B.1 Randomized Constant Priors

B.2 Deterministic Model

B.3 Dynamics Type

B.4 Normalization of input state-space

B.5 Uncertainty Types

B.6 Horizon

B.7 Ensemble-Count

Env.	dynamics-type	AURCC $(\downarrow)$	RPP $(\downarrow)$	$CR_{K}(\uparrow)$	loss $(\downarrow)$
Half Cheetah	autoregressive	$0.249\pm 0.008$	$0.068\pm 0.006$	$0.736\pm 0.087$	$0.379\pm 0.003$
Half Cheetah	feed-forward	$0.219\pm 0.005$	$0.044\pm 0.005$	$0.46\pm 0.04$	$0.374\pm 0.004$
Hopper	autoregressive	$0.139\pm 0.007$	$0.034\pm 0.005$	$0.48\pm 0.042$	$0.268\pm 0.007$
Hopper	feed-forward	$0.141\pm 0.005$	$0.031\pm 0.005$	$0.428\pm 0.034$	$0.268\pm 0.004$
Walker 2d	autoregressive	$0.065\pm 0.002$	$0.013\pm 0.002$	$0.356\pm 0.046$	$0.158\pm 0.002$
Walker 2d	feed-forward	$0.068\pm 0.001$	$0.011\pm 0.003$	$0.3\pm 0.051$	$0.159\pm 0.002$

Env.	Observation-size	Action-Dimensions	Max. Env. Steps	Dataset-name	Query-count
maze2d-open-v0	4	2	150	1m	$1500$
maze2d-medium-v1	4	2	600	1m	$1500$
maze2d-umaze-v1	4	2	300	1m	$1500$
maze2d-umaze-v1	4	2	800	1m	$121$
HalfCheetah-v2	17	6	1000	random, expert, medium, medium-replay, medium-expert	1500
Hopper-v2	11	3	1000	random, expert, medium, medium-replay, medium-expert	1500
Walker2d-v2	17	6	1000	random, expert, medium, medium-replay, medium-expert	1500

Env.	policy-1	policy-2	policy-3	policy-4
maze2d-open-v0	$122\pm 10$	$104\pm 22$	$18\pm 14$	$4\pm 8$
maze2d-umaze-v1	$245\pm 272$	$203\pm 252$	$256\pm 260$	$258\pm 262$
maze2d-medium-v1	$235\pm 35$	$197\pm 58$	$23\pm 73$	$3\pm 9$
maze2d-large-v1	$231\pm 268$	$160\pm 201$	$50\pm 76$	$9\pm 9$
HalfCheetah-v2	$1168\pm 80$	$1044\pm 112$	$785\pm 303$	$94\pm 40$
Hopper-v2	$1195\pm 794$	$1466\pm 487$	$1832\pm 560$	$236\pm 1$
Walker2d-v2	$2506\pm 698$	$811\pm 321$	$387\pm 42$	$162\pm 102$

Env.	deterministic	AURCC $(\downarrow)$	RPP $(\downarrow)$	$CR_{K}(\uparrow)$	loss $(\downarrow)$
Half Cheetah	False	$0.229\pm 0.007$	$0.054\pm 0.006$	$0.568\pm 0.057$	$0.377\pm 0.005$
Half Cheetah	True	$0.219\pm 0.005$	$0.044\pm 0.005$	$0.46\pm 0.04$	$0.374\pm 0.004$
Hopper	False	$0.138\pm 0.002$	$0.039\pm 0.002$	$0.524\pm 0.039$	$0.26\pm 0.003$
Hopper	True	$0.141\pm 0.005$	$0.031\pm 0.005$	$0.428\pm 0.034$	$0.268\pm 0.004$
Walker 2d	False	$0.064\pm(<0.001)$	$0.012\pm 0.001$	$0.328\pm 0.024$	$0.16\pm 0.002$
Walker 2d	True	$0.068\pm 0.001$	$0.011\pm 0.003$	$0.3\pm 0.051$	$0.159\pm 0.002$

Env.	deterministic	AURCC $(\downarrow)$	RPP $(\downarrow)$	$CR_{K}(\uparrow)$	loss $(\downarrow)$
large	False	$0.152\pm 0.01$	$0.058\pm 0.007$	$0.54\pm 0.043$	$0.167\pm 0.003$
large	True	$0.14\pm 0.015$	$0.062\pm 0.004$	$0.82\pm 0.035$	$0.251\pm 0.029$
medium	False	$0.0\pm(<0.001)$	$0.0\pm(<0.001)$	$0.2\pm(<0.001)$	$0.003\pm(<0.001)$
medium	True	$0.001\pm(<0.001)$	$0.0\pm(<0.001)$	$0.2\pm(<0.001)$	$0.022\pm 0.001$
open	False	$0.037\pm 0.001$	$0.01\pm(<0.001)$	$0.4\pm(<0.001)$	$0.143\pm 0.005$
open	True	$0.029\pm 0.001$	$0.012\pm 0.001$	$0.5\pm(<0.001)$	$0.107\pm 0.005$
umaze	False	$0.004\pm(<0.001)$	$0.001\pm(<0.001)$	$0.2\pm(<0.001)$	$0.059\pm 0.004$
umaze	True	$0.008\pm 0.001$	$0.002\pm(<0.001)$	$0.3\pm(<0.001)$	$0.075\pm 0.003$

Env.	dynamics-type	AURCC $(\downarrow)$	RPP $(\downarrow)$	$CR_{K}(\uparrow)$	loss $(\downarrow)$
large	autoregressive	$0.131\pm 0.017$	$0.06\pm 0.003$	$0.8\pm(<0.001)$	$0.233\pm 0.044$
large	feed-forward	$0.14\pm 0.015$	$0.062\pm 0.004$	$0.82\pm 0.035$	$0.251\pm 0.029$
medium	autoregressive	$0.001\pm(<0.001)$	$0.0\pm(<0.001)$	$0.2\pm(<0.001)$	$0.031\pm 0.001$
medium	feed-forward	$0.001\pm(<0.001)$	$0.0\pm(<0.001)$	$0.2\pm(<0.001)$	$0.022\pm 0.001$
open	autoregressive	$0.034\pm 0.003$	$0.014\pm 0.002$	$0.5\pm(<0.001)$	$0.123\pm 0.006$
open	feed-forward	$0.029\pm 0.001$	$0.012\pm 0.001$	$0.5\pm(<0.001)$	$0.107\pm 0.005$
umaze	autoregressive	$0.007\pm 0.001$	$0.002\pm(<0.001)$	$0.3\pm(<0.001)$	$0.07\pm 0.002$
umaze	feed-forward	$0.008\pm 0.001$	$0.002\pm(<0.001)$	$0.3\pm(<0.001)$	$0.075\pm 0.003$

Env.	normalize	AURCC $(\downarrow)$	RPP $(\downarrow)$	$CR_{K}(\uparrow)$	loss $(\downarrow)$
Half Cheetah	False	$0.228\pm 0.006$	$0.053\pm 0.007$	$0.548\pm 0.07$	$0.376\pm 0.003$
Half Cheetah	True	$0.219\pm 0.005$	$0.044\pm 0.005$	$0.46\pm 0.04$	$0.374\pm 0.004$
Hopper	False	$0.128\pm 0.002$	$0.017\pm 0.002$	$0.3\pm 0.025$	$0.255\pm 0.003$
Hopper	True	$0.141\pm 0.005$	$0.031\pm 0.005$	$0.428\pm 0.034$	$0.268\pm 0.004$
Walker 2d	False	$0.078\pm 0.007$	$0.013\pm 0.005$	$0.316\pm 0.073$	$0.17\pm 0.009$
Walker 2d	True	$0.068\pm 0.001$	$0.011\pm 0.003$	$0.3\pm 0.051$	$0.159\pm 0.002$

Env.	normalize	AURCC $(\downarrow)$	RPP $(\downarrow)$	$CR_{K}(\uparrow)$	loss $(\downarrow)$
large	False	$0.215\pm 0.027$	$0.067\pm 0.007$	$0.82\pm 0.035$	$0.402\pm 0.037$
large	True	$0.14\pm 0.015$	$0.062\pm 0.004$	$0.82\pm 0.035$	$0.251\pm 0.029$
medium	False	$0.0\pm(<0.001)$	$0.0\pm(<0.001)$	$0.2\pm(<0.001)$	$0.017\pm 0.002$
medium	True	$0.001\pm(<0.001)$	$0.0\pm(<0.001)$	$0.2\pm(<0.001)$	$0.022\pm 0.001$
open	False	$0.033\pm 0.001$	$0.014\pm(<0.001)$	$0.5\pm(<0.001)$	$0.108\pm 0.004$
open	True	$0.029\pm 0.001$	$0.012\pm 0.001$	$0.5\pm(<0.001)$	$0.107\pm 0.005$
umaze	False	$0.003\pm(<0.001)$	$0.002\pm(<0.001)$	$0.3\pm(<0.001)$	$0.047\pm 0.004$
umaze	True	$0.008\pm 0.001$	$0.002\pm(<0.001)$	$0.3\pm(<0.001)$	$0.075\pm 0.003$

Env.	ensemble-count	AURCC $(\downarrow)$	RPP $(\downarrow)$	$CR_{K}(\uparrow)$	loss $(\downarrow)$
large	10	$0.168\pm 0.044$	$0.051\pm 0.011$	$0.6\pm(<0.001)$	$0.307\pm 0.061$
	20	$0.149\pm 0.039$	$0.057\pm 0.015$	$0.78\pm 0.066$	$0.269\pm 0.065$
	40	$0.138\pm 0.031$	$0.056\pm 0.011$	$0.82\pm 0.035$	$0.264\pm 0.044$
	80	$0.146\pm 0.019$	$0.061\pm 0.007$	$0.82\pm 0.035$	$0.269\pm 0.027$
	100	$0.14\pm 0.015$	$0.062\pm 0.004$	$0.82\pm 0.035$	$0.251\pm 0.029$
medium	10	$0.001\pm(<0.001)$	$0.0\pm(<0.001)$	$0.2\pm(<0.001)$	$0.023\pm 0.007$
	20	$0.001\pm(<0.001)$	$0.0\pm(<0.001)$	$0.2\pm(<0.001)$	$0.024\pm 0.006$
	40	$0.001\pm(<0.001)$	$0.0\pm(<0.001)$	$0.2\pm(<0.001)$	$0.021\pm 0.003$
	80	$0.001\pm(<0.001)$	$0.0\pm(<0.001)$	$0.2\pm(<0.001)$	$0.022\pm 0.001$
	100	$0.001\pm(<0.001)$	$0.0\pm(<0.001)$	$0.2\pm(<0.001)$	$0.022\pm 0.001$
open	10	$0.033\pm 0.004$	$0.009\pm(<0.001)$	$0.48\pm 0.035$	$0.127\pm 0.015$
	20	$0.031\pm 0.003$	$0.01\pm(<0.001)$	$0.5\pm(<0.001)$	$0.117\pm 0.019$
	40	$0.03\pm 0.003$	$0.011\pm 0.001$	$0.5\pm(<0.001)$	$0.11\pm 0.014$
	80	$0.029\pm 0.002$	$0.012\pm 0.001$	$0.5\pm(<0.001)$	$0.108\pm 0.007$
	100	$0.029\pm 0.001$	$0.012\pm 0.001$	$0.5\pm(<0.001)$	$0.107\pm 0.005$
umaze	10	$0.011\pm 0.002$	$0.002\pm(<0.001)$	$0.3\pm(<0.001)$	$0.073\pm 0.006$
	20	$0.009\pm 0.001$	$0.002\pm(<0.001)$	$0.3\pm(<0.001)$	$0.074\pm 0.004$
	40	$0.008\pm 0.001$	$0.002\pm(<0.001)$	$0.3\pm(<0.001)$	$0.077\pm 0.004$
	80	$0.008\pm 0.001$	$0.002\pm(<0.001)$	$0.3\pm(<0.001)$	$0.075\pm 0.004$
	100	$0.008\pm 0.001$	$0.002\pm(<0.001)$	$0.3\pm(<0.001)$	$0.075\pm 0.003$