This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Model-Based Offline Reinforcement Learning with Pessimism-Modulated Dynamics Belief

Kaiyang Guo   Yunfeng Shao   Yanhui Geng
Huawei Noah’s Ark Lab
Corresponding to: [email protected]
Abstract

Model-based offline reinforcement learning (RL) aims to find highly rewarding policy, by leveraging a previously collected static dataset and a dynamics model. While the dynamics model learned through reuse of the static dataset, its generalization ability hopefully promotes policy learning if properly utilized. To that end, several works propose to quantify the uncertainty of predicted dynamics, and explicitly apply it to penalize reward. However, as the dynamics and the reward are intrinsically different factors in context of MDP, characterizing the impact of dynamics uncertainty through reward penalty may incur unexpected tradeoff between model utilization and risk avoidance. In this work, we instead maintain a belief distribution over dynamics, and evaluate/optimize policy through biased sampling from the belief. The sampling procedure, biased towards pessimism, is derived based on an alternating Markov game formulation of offline RL. We formally show that the biased sampling naturally induces an updated dynamics belief with policy-dependent reweighting factor, termed Pessimism-Modulated Dynamics Belief. To improve policy, we devise an iterative regularized policy optimization algorithm for the game, with guarantee of monotonous improvement under certain condition. To make practical, we further devise an offline RL algorithm to approximately find the solution. Empirical results show that the proposed approach achieves state-of-the-art performance on a wide range of benchmark tasks.

1 Introduction

In typical paradigm of RL, the agent actively interacts with environment and receives feedback to promote policy improvement. The essential trial-and-error procedure can be costly, unsafe or even prohibitory in practice (e.g. robotics robotic , autonomous driving driving , and healthcare health ), thus constituting a major impediment to actual deployment of RL. Meanwhile, for a number of applications, historical data records are available to reflect the system feedback under a predefined policy. This raises the opportunity to learn policy in purely offline setting.

In offline setting, as no further interaction with environment is permitted, the dataset provides a limited coverage in state-action space. Then, the policy that induces out-of-distribution (OOD) state-action pairs can not be well evaluated in offline learning phase, and deploying it online potentially attains terrible performance. Recent studies have reported that applying vanilla RL algorithms to offline dataset exacerbates such a distributional shift BCQ ; BEAR ; MOREL , making them unsuitable for offline setting.

To tackle the distributional shift issue, a number of offline RL approaches have been developed. Specifically, one category of them propose to directly constrain the policy close to the one collecting data BEAR ; BCQ ; BRAC ; EMaQ , or penalize Q-value towards conservatism for OOD state-action pairs CQL ; Fisher ; AlgaeDICE . While they achieve remarkable performance gains, the policy regularizer and the Q-value penalty tightly restricts the produced policy within the data manifold. Instead, more recent works consider to quantify the uncertainty of Q-value with neural network ensembles Ensemble , where the consistent Q-value estimates indicate high confidence and can be plausibly used during learning process, even for OOD state-action pairs EDAC ; PBRL . However, the uncertainty quantification over OOD region highly relies on how neural network generalizes BayesGenalization . As the prior knowledge of Q-function is hard to acquire and insert into the neural network, the generalization is unlikely reliable to facilitate meaningful uncertainty quantification SWAG . Notably, all these works are model-free.

Model-based offline RL optimizes policy based on a constructed dynamics model. Compared to the model-free approaches, one prominent advantage is that the prior knowledge of dynamics is easier to access. First, the generic prior like smoothness widely exists in various domains smooth . Second, the sufficiently learned dynamics models for relevant tasks can act as a data-driven prior for the concerned task blockMDP ; invariant ; zeroshot . With richer prior knowledge, the uncertainty quantification for dynamics is more trustworthy. Similar to the model-free approach, the dynamics uncertainty can be incorporated to find reliable policy beyond data coverage. However, an additional challenge is how to characterize the accumulative impact of dynamics uncertainty on the long-term reward, as the system dynamics is with entirely different meaning compared to the reward or Q-value.

Although existing model-based offline RL literature theoretically bounds the impact of dynamics uncertainty on final performance, their practical variants characterize the impact through reward penalty MOPO ; MOREL ; Revisit . Concretely, the reward function is penalized by the dynamics uncertainty for each state-action pair MOPO , or the agent is enforced to a low-reward absorbing state when the dynamics uncertainty exceeds a certain level MOREL . While optimizing policy in these constructed MDPs stimulates anti-uncertainty behavior, the final policy tends to be over-conservative. For example, even the transition dynamics for a state-action pair is ambiguous among several possible candidates, these candidates may generate the states from which the system evolves similarly.111Or from these states, the system evolves differently but generates similar rewards. Then, such a state-action pair should not be treated specially.

Motivated by the above intuition, we propose pessimism-modulated dynamics belief for model-based offline RL. In contrast with the previous approaches, the dynamics uncertainty is not explicitly quantified. To characterize its impact, we maintain a belief distribution over system dynamics, and the policy is evaluated/optimized through biased sampling from it. The sampling procedure, biased towards pessimism, is derived based on an alternating Markov game (AMG) formulation of offline RL. We formally show that the biased sampling naturally induces an updated dynamics belief with policy-dependent reweighting factor, termed Pessimism-Modulated Dynamics Belief. Besides, the degree of pessimism is monotonously determined by the hyperparameters in sampling procedure.

The considered AMG formulation can be regarded as a generalization of robust MDP, which is proposed as a surrogate to optimize the percentile performance in face of dynamics uncertainty RobustControl ; RMDP . However, robust MDP suffers from two significant shortcomings: 1) The percentile criterion is over-conservative since it fixates on a single pessimistic dynamics instance SRRL ; SRAC ; 2) Robust MDP is constructed based on an uncertainty set, and the improper choice of uncertainty set would further aggravate the degree of conservatism BCR ; Percentile . The AMG formulation is kept from these shortcomings. To solve the AMG, we devise an iterative regularized policy optimization algorithm, with guarantee of monotonous improvement under certain condition. To make practical, we further derive an offline RL algorithm to approximately find the solution, and empirically evaluate it on the offline RL benchmark D4RL. The results show that the proposed approach obviously outperforms previous state-of-the-art (SoTA) in 9 out of 18 environment-dataset configurations and performs competitively in the rest, without tuning hyperparameters for each task. The proof of theorems in this paper are presented in Appendix B.

2 Preliminaries

Markov Decision Process (MDP)

A MDP is depicted by the tuple (𝒮,𝒜,T,r,ρ0,γ)(\mathcal{S},\mathcal{A},T,r,\rho_{0},\gamma), where 𝒮,𝒜\mathcal{S},\mathcal{A} are state and action spaces, T(s|s,a)T(s^{\prime}|s,a) is the transition probability, r(s,a)r(s,a) is the reward function, ρ0(s)\rho_{0}(s) is the initial state distribution, and γ\gamma is the discount factor. The goal of RL is to find the policy π:sΔ(𝒜)\pi:s\rightarrow\Delta(\mathcal{A}) that maximizes the cumulative discounted reward:

J(π,T)=𝔼ρ0,T,π[t=0γtr(st,at)],\displaystyle J(\pi,T)=\mathbb{E}_{\rho_{0},T,\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})\right], (1)

where Δ()\Delta(\cdot) denotes the probability simplex. In typical RL paradigm, this is done via actively interacting with environment.

Offline RL

In offline setting, the environment is unaccessible, and only a static dataset 𝒟={(s,a,r,s)}\mathcal{D}=\left\{\left(s,a,r,s^{\prime}\right)\right\} is provided, containing the previously logged data samples under an unknown behavior policy. Offline RL aims to optimize the policy by solely leveraging the offline dataset.

To simplify the presentation, we assume the reward function rr and initial state distribution ρ0\rho_{0} are known. Then, the system dynamics is unknown only in terms of the transition probability TT. Note that the considered formulation and the proposed approach can be easily extend to the general case without additional technical modification.

Robust MDP

With the offline dataset, a straightforward strategy is first learning a dynamics model τ(s|s,a)\tau(s^{\prime}|s,a) and then optimizing policy via simulation. However, due to the limitedness of available data, the learned model is inevitably imprecise. Robust MDP RobustControl is a surrogate to optimize policy with consideration of the ambiguity of dynamics. Concretely, robust MDP is constructed by introducing an uncertainty set 𝒯={τ}\mathcal{T}=\{\tau\} to contain plausible transition probabilities. If the uncertainty set includes the true transition with probability of (1δ)(1-\delta), the performance of any policy π\pi in true MDP can be lower bounded by minτ𝒯J(π,τ)\min_{\tau\in\mathcal{T}}J(\pi,\tau) with probability of at least (1δ)(1-\delta). Thus, the percentile performance for the true MDP can be optimized by finding a solution to

maxπminτ𝒯J(π,τ).\displaystyle\max_{\pi}\min_{\tau\in\mathcal{T}}J(\pi,\tau). (2)

Despite its popularity, Robust MDP suffers from two major shortcomings: First, the percentile criterion overly fixates on a single pessimistic transition instance, especially when there are multiple optimal policies for this transition but they lead to dramatically different performance for other transitions SRRL ; SRAC . This behavior results in unnecessarily conservative policy.

Second, the level of conservatism can be further aggravated when the uncertainty set is inappropriately constructed BCR . For a given policy π\pi, the ideal situation is that 𝒯\mathcal{T} contains the (1δ)(1-\delta) proportion of transitions with which the policy achieves higher performance than with the other δ\delta proportion. Then, minτ𝒯J(π,τ)\min_{\tau\in\mathcal{T}}J(\pi,\tau) is exactly the δ\delta-quantile performance. This requires the uncertainty set to be policy-dependent, and during policy optimization the uncertainty set should change accordingly. Otherwise, if 𝒯\mathcal{T} is predetermined and fixed, it is possible to have τ𝒯\tau^{\prime}\notin\mathcal{T} with non-zero probability and satisfying J(π,τ)>minτ𝒯J(π,τ)J(\pi^{*},\tau^{\prime})>\min_{\tau\in\mathcal{T}}J(\pi^{*},\tau), where π\pi^{*} is the optimal policy for (2). Then, adding τ\tau^{\prime} into 𝒯\mathcal{T} does not affect the optimal solution of the problem (2). This indicates that we are essentially optimizing a δ\delta^{\prime}-quantile performance, where δ\delta^{\prime} can be much smaller than δ\delta. In literature, the uncertainty sets are mostly predetermined before policy optimization RobustControl ; safe19 ; HCPE ; fastRMDP .

3 Pessimism-Modulated Dynamics Belief

In short, robust MDP is over-conservative due to the fixation on a single pessimistic transition instance and the predetermination of uncertainty set. In this work, we strive to take the entire spectrum of plausible transitions into account, and let the algorithm by itself determine which part deserves more attention. To this end, we consider an alternating Markov game formulation of offline RL, based on which the proposed offline RL approach is derived.

3.1 Formulation

Alternating Markov game (AMG)

The AMG is a specialization of two-player zero-sum game, depicted by (𝒮,𝒮¯,𝒜,𝒜¯,G,r,ρ0,γ)(\mathcal{S},\bar{\mathcal{S}},\mathcal{A},\bar{\mathcal{A}},G,r,\rho_{0},\gamma). The game starts from a state sampled from ρ0\rho_{0}, then two players alternatively choose actions a𝒜a\in\mathcal{A} and a¯𝒜¯\bar{a}\in\bar{\mathcal{A}} under states s𝒮s\in\mathcal{S} and s¯𝒮¯\bar{s}\in\bar{\mathcal{S}}, along with the game transition defined by G(s¯|s,a)G(\bar{s}|s,a) and G(s|s¯,a¯)G(s|\bar{s},\bar{a}). At each round, the primary player receives reward r(s,a)r(s,a) and the secondary player receives its negative counterpart r(s,a)-r(s,a).

Offline RL as AMG

We formulate the offline RL problem as an AMG, where the primary player optimizes a reliable policy for our concerned MDP in face of stochastic disturbance from the secondary player. The AMG is constructed by augmenting the original MDP. As both have the transition probability, we use game transition and system transition to differentiate them.

For the primary player, its state space 𝒮\mathcal{S}, action space 𝒜\mathcal{A} and reward function r(s,a)r(s,a) are same with those in the original MDP. After the primary player acts, the game emits a NN-size set of system transition candidates 𝒯sa\mathcal{T}^{sa}, which later acts as the state of secondary player. Formally, 𝒯sa\mathcal{T}^{sa} is generated according to

G(s¯=𝒯sa|s,a)=τsa𝒯saTsa(τsa),\displaystyle G\left(\bar{s}=\mathcal{T}^{sa}|s,a\right)=\prod_{\tau^{sa}\in\mathcal{T}^{sa}}\mathbb{P}_{T}^{sa}(\tau^{sa}), (3)

where τsa()\tau^{sa}(\cdot) re-denotes the plausible system transition τ(|s,a)\tau(\cdot|s,a) for short, and Tsa\mathbb{P}_{T}^{sa} is a given belief distribution over τsa\tau^{sa}. According to (3), the elements in 𝒯sa\mathcal{T}^{sa} are independent and identically distributed samples following Tsa\mathbb{P}_{T}^{sa}. The major difference to uncertainty set in robust MDP is that the set introduced here is unfixed and stochastic for each step. To distinguish with uncertainty set, we call it candidate set. The belief distribution Tsa\mathbb{P}_{T}^{sa} can be chosen arbitrarily to incorporate knowledge of system transition. Particularly, when the prior distribution of system transition is accessible, Tsa\mathbb{P}_{T}^{sa} can be obtained as the posterior by integrating the prior and the evidence 𝒟\mathcal{D} through Bayes’ rule.

The secondary player receives the candidate set 𝒯sa\mathcal{T}^{sa} as state. Thus, its state space can be denoted by 𝒮¯=ΔN(𝒮)\bar{\mathcal{S}}=\Delta^{N}(\mathcal{S}), i.e., the n-fold Cartesian product of probability simplex over 𝒮\mathcal{S}. Note that the state 𝒯sa\mathcal{T}^{sa} also takes the role of action space, i.e., 𝒜¯=𝒯sa\bar{\mathcal{A}}=\mathcal{T}^{sa}, meaning that the action of secondary player is to choose a system transition from the candidate set. Given the chosen τsa𝒯sa\tau^{sa}\in\mathcal{T}^{sa}, the game evolves by sampling τsa\tau^{sa}, i.e.,

G(s|s¯=𝒯sa,a¯=τsa)=τsa(s),\displaystyle G\left(s^{\prime}|\bar{s}=\mathcal{T}^{sa},\bar{a}=\tau^{sa}\right)=\tau^{sa}(s^{\prime}), (4)

and the primary player receives ss^{\prime} to continue the game. In the following, we use TN(𝒯sa)\mathbb{P}_{T}^{N}(\mathcal{T}^{sa}) to compactly denote the game transition G(s¯=𝒯sa|s,a)G\left(\bar{s}=\mathcal{T}^{sa}|s,a\right) in (3), and omit the superscript sasa in τsa\tau^{sa}, 𝒯sa\mathcal{T}^{sa} and Tsa\mathbb{P}_{T}^{sa} when it is clear from the context.

For the above AMG, we consider a specific policy (explained below) for the secondary player, such that the cumulative discounted reward of the primary player with policy π\pi can be written as:

J(π):=𝔼ρ0,π,TNminτ0𝒯0k[𝔼τ0,π,TNminτ1𝒯1k[𝔼τ,π[t=0γtr(st,at)]]],\displaystyle J(\pi):=\mathop{\mathbb{E}}_{\rho_{0},\pi,\mathbb{P}_{T}^{N}}\lfloor\min\rfloor_{\tau_{0}\in\mathcal{T}_{0}}^{k}\left[\mathop{\mathbb{E}}_{\tau_{0},\pi,\mathbb{P}_{T}^{N}}\lfloor\min\rfloor_{\tau_{1}\in\mathcal{T}_{1}}^{k}\cdots\left[\mathop{\mathbb{E}}_{\tau_{\infty},\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})\right]\right]\right], (5)

where the subscripts of τ\tau and 𝒯\mathcal{T} denote time step, the expectation is over s0ρ0s_{0}\sim\rho_{0}, st>0τt1(|st1,at1),atπ(|st)s_{t>0}\sim\tau_{t-1}(\cdot|s_{t-1},a_{t-1}),a_{t}\sim\pi(\cdot|s_{t}) and 𝒯tTN\mathcal{T}_{t}\sim\mathbb{P}_{T}^{N}, and the operator minx𝒳kf(x)\lfloor\min\rfloor_{x\in\mathcal{X}}^{k}f(x) denotes finding kkth minimum of f(x)f(x) over x𝒳x\in\mathcal{X}. The policy of secondary player is implicitly defined by the operator minx𝒳kf(x)\lfloor\min\rfloor_{x\in\mathcal{X}}^{k}f(x). When changing k{1,2,,N}k\in\{1,2,\cdots,N\}, the secondary player exhibits various degree of adversarial or aggressive disturbance to the future reward. From the view of original MDP, this behavior raises flexible tendency ranging from pessimism to optimism when evaluating policy π\pi.

The distinctions between the introduced AMG and the robust MDP are twofold: 1) With a belief distribution over transitions, robust MDP will select only part of its supports into uncertainty set, and the set elements are treated indiscriminatingly. It indicates that both the possibility of transitions out of the uncertainty set and the relative likelihood of transitions within the uncertainty set are discarded. However, in the AMG, the candidate set simply contains samples drawn from the belief distribution, implying no information drop in an average sense. Intuitively, by keeping richer knowledge of the system, the performance evaluation is more exact and away from excessive conservatism; 2) In robust MDP, the level of conservatism is expected to be controlled by its hyperparameter δ\delta. However, as illustrated in Section 2, a smaller δ\delta does not necessarily corresponds to a more conservative performance evaluation, due to the extra impact from uncertainty set construction. In contrast, for the AMG, the degree of conservatism is adjusted by the size of candidate size NN and the order of minimum kk. When changing values of kk or NN, the impact to performance evaluation is ascertained, as formalized in Theorem 3.

To evaluate J(π)J(\pi), we define the following Bellman backup operator:

N,kπQ(s,a)=r(s,a)+γ𝔼TN[minτ𝒯k𝔼τ,π[Q(s,a)]].\displaystyle\mathcal{B}^{\pi}_{N,k}Q(s,a)=r(s,a)+\gamma\mathbb{E}_{\mathbb{P}_{T}^{N}}\Big{[}\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau,\pi}\left[Q(s^{\prime},a^{\prime})\right]\Big{]}. (6)

As the operator depends on NN, kk and we emphasize pessimism in offline RL, we call it (N,k)(N,k)-pessimistic Bellman backup operator. Compared to the standard Bellman backup operator in Q-learning, N,kπ\mathcal{B}^{\pi}_{N,k} additionally includes the expectation over 𝒯TN\mathcal{T}\sim\mathbb{P}_{T}^{N} and the kk-minimum operator over 𝒯\mathcal{T}. Despite these differences, we prove that N,kπ\mathcal{B}^{\pi}_{N,k} is still a contraction mapping, based on which J(π)J(\pi) can be easily evaluated.

Theorem 1 (Policy Evaluation).

The (N,k)(N,k)-pessimistic Bellman backup operator N,kπ\mathcal{B}^{\pi}_{N,k} is a contraction mapping. By starting from any function Q:𝒮×𝒜Q:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R} and repeatedly applying N,kπ\mathcal{B}^{\pi}_{N,k}, the sequence converges to QN,kπQ^{\pi}_{N,k}, with which we have J(π)=𝔼ρ0,π[QN,kπ(s0,a0)]J(\pi)=\mathbb{E}_{\rho_{0},\pi}\big{[}Q^{\pi}_{N,k}(s_{0},a_{0})\big{]}.

3.2 Pessimism-Modulated Dynamics Belief

With the converged Q-value, we are ready to establish a more direct connection between the AMG and the original MDP. The connection appears as the answer to a natural question: the calculation of (6) encompasses biased samples from the dynamics belief distribution, can we treat these samples as the unbiased ones sampling from another belief distribution? We give positive answer in the following theorem.

Theorem 2 (Equivalent MDP with Pessimism-Modulated Dynamics Belief).

The alternating Markov game in (5) is equivalent to the MDP with tuple (𝒮,𝒜,T~,r,ρ0,γ)(\mathcal{S},\mathcal{A},\widetilde{T},r,\rho_{0},\gamma), where the transition probability T~(s|s,a)=𝔼~Tsa[τsa(s)]\widetilde{T}(s^{\prime}|s,a)=\mathbb{E}_{\widetilde{\mathbb{P}}_{T}^{sa}}\left[\tau^{sa}(s^{\prime})\right] is defined with the reweighted belief distribution ~Tsa\widetilde{\mathbb{P}}_{T}^{sa}:

~Tsa(τsa)w(𝔼τsa,π[QN,kπ(s,a)];k,N)Tsa(τsa),\displaystyle\widetilde{\mathbb{P}}_{T}^{sa}(\tau^{sa})\propto w\Big{(}\mathbb{E}_{\tau^{sa},\pi}\big{[}Q^{\pi}_{N,k}(s^{\prime},a^{\prime})\big{]};k,N\Big{)}\mathbb{P}_{T}^{sa}(\tau^{sa}), (7)
w(x;k,N)=[F(x)]k1[1F(x)]Nk,\displaystyle w(x;k,N)=\big{[}F(x)\big{]}^{k-1}\big{[}1-F(x)\big{]}^{N-k}, (8)

and F()F(\cdot) is cumulative density function. Furthermore, the value of w(x;k,N)w(x;k,N) first increases and then decreases with xx, and its maximum is obtained at the k1N1\frac{k-1}{N-1} quantile, i.e., x=F1(k1N1)x^{*}=F^{-1}\left(\frac{k-1}{N-1}\right).

In right-hand side of (7), τsa\tau^{sa} itself is random following the belief distribution, thus 𝔼τsa,π[QN,kπ(s,a)]\mathbb{E}_{\tau^{sa},\pi}\big{[}Q^{\pi}_{N,k}(s^{\prime},a^{\prime})\big{]}, as a functional of τsa\tau^{sa}, is also a random variable, whose cumulative density function is determined by the belief distribution Tsa\mathbb{P}_{T}^{sa}. Intuitively, we can treat 𝔼τsa,π[QN,kπ(s,a)]\mathbb{E}_{\tau^{sa},\pi}\big{[}Q^{\pi}_{N,k}(s^{\prime},a^{\prime})\big{]} as a pessimism indicator for transition τsa\tau^{sa}, with larger value indicating less pessimism.

From Theorem 2, the maximum of ww is obtained at τ:F(𝔼τ,π[QN,kπ(s,a)])=k1N1\tau^{*}\!\!:\!F\!\left(\mathbb{E}_{\tau^{*},\pi}\big{[}Q^{\pi}_{N,k}(s^{\prime},a^{\prime})\big{]}\right)\!\!=\!\frac{k-1}{N-1}, i.e., the transition with k1N1\frac{k-1}{N-1}-quantile pessimism indicator. Besides, when 𝔼τsa,π[QN,kπ(s,a)]\mathbb{E}_{\tau^{sa},\pi}\big{[}Q^{\pi}_{N,k}(s^{\prime},a^{\prime})\big{]} departs the k1N1\frac{k-1}{N-1} quantile, the reweighting coefficient for its τsa\tau^{sa} decreases. Considering the effect of ww to ~Tsa\widetilde{\mathbb{P}}_{T}^{sa} and the equivalence between the AMG and the refined MDP, we can say that J(π)J(\pi) is a soft percentile performance. Compared to the standard percentile criteria, J(π)J(\pi) is derived by reshaping belief distribution towards concentrating around a certain percentile, rather than fixating on a single percentile point. Due to this feature, we term ~Tsa\widetilde{\mathbb{P}}_{T}^{sa} Pessimism-Modulated Dynamics Belief (PMDB).

Lastly, recall that all the above derivations are with hyperparameters kk and NN, we present the monotonicity of QN,kπQ^{\pi}_{N,k} over them in Theorem 3. Furthermore, by combining Theorem 1 with Theorem 3, we conclude that J(π)J(\pi) decreases with NN and increases with kk.

Theorem 3 (Monotonicity).

The converged Q-function QN,kπQ^{\pi}_{N,k} are with the following properties:

  • Given any kk, the Q-function QN,kπQ^{\pi}_{N,k} element-wisely decreases with N{k,k+1,}N\in\{k,k+1,\cdots\}.

  • Given any NN, the Q-function QN,kπQ^{\pi}_{N,k} element-wisely increases with k{1,2,,N}k\in\{1,2,\cdots,N\}.

  • The Q-function QN,NπQ^{\pi}_{N,N} element-wisely increases with NN.

Refer to caption
Figure 1: Monotonicity of Q-values. The arrows indicate the directions along which Q-values increase.

Remark 1 (Special Cases). For N=k=1N=k=1, we have ~Tsa=Tsa\widetilde{\mathbb{P}}^{sa}_{T}=\mathbb{P}_{T}^{sa}. Then, the performance is evaluated through sampling the initial belief distribution. This resembles the common methodology in model-based RL (MBRL), with dynamics belief defined by the uniform distribution over dynamics model ensembles. For k=δ(N1)+1k=\delta(N-1)+1 and NN\rightarrow\infty, ~Tsa\widetilde{\mathbb{P}}_{T}^{sa} asymptotically collapses to be a delta function. Then, J(π)J(\pi) degrades to fixate on a single transition instance. It is equivalent to the robust MDP with the uncertainty set constructed as {τsa:Tsa(τsa)>0,𝔼τsa,π[QN,kπ(s,a)]F1(δ)}\Big{\{}\tau^{sa}:\mathbb{P}_{T}^{sa}(\tau^{sa})>0,\mathbb{E}_{\tau^{sa},\pi}\big{[}Q^{\pi}_{N,k}(s^{\prime},a^{\prime})\big{]}\geq F^{-1}(\delta)\Big{\}}. In this sense, the AMG is a successive interpolation between MBRL and robust MDP.

4 Policy Optimization with Pessimism-Modulated Dynamics Belief

In this section, we optimize policy by maximizing J(π)J(\pi). The major consideration is that the methodology should adapt well for both discrete and continuous action spaces. In continuous setting of MDP, the policy can be updated by following stochastic/deterministic policy gradient SPG ; DPG . However, for the AMG, evaluating J(π)J(\pi) itself involves an inner dynamic programming procedure as in Theorem 1. As each evaluation of J(π)J(\pi) can only produce one exact gradient, it is inefficient to maximize J(π)J(\pi) via gradient-based method. In this section, we consider a series of sub-problems with Kullback–Leibler (KL) regularization. Solving each sub-problem makes prominent update to the policy, and the sequence of solutions for sub-problems monotonously improve regarding J(π)J(\pi). Based on this idea, we further derive offline RL algorithm to approximately find the solution.

4.1 Iterative Regularized Policy Optimization

Define the KL-regularized return for the AMG by

J¯(π;μ):=𝔼ρ0,π,TNminτ0𝒯0k\displaystyle\bar{J}(\pi;\mu):=\mathop{\mathbb{E}}_{\rho_{0},\pi,\mathbb{P}_{T}^{N}}\lfloor\min\rfloor_{\tau_{0}\in\mathcal{T}_{0}}^{k} [𝔼τ0,π,TNminτ1𝒯1k[𝔼τ,π[t=0γt(r(st,at)\displaystyle\left[\mathop{\mathbb{E}}_{\tau_{0},\pi,\mathbb{P}_{T}^{N}}\lfloor\min\rfloor_{\tau_{1}\in\mathcal{T}_{1}}^{k}\cdots\left[\mathop{\mathbb{E}}_{\tau_{\infty},\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}\bigg{(}r(s_{t},a_{t})\right.\right.\right.
αDKL(.π(|st)||μ(|st)))]]],\displaystyle\qquad\qquad\qquad\ \ \left.\left.\left.-\alpha D_{\mathrm{KL}}\left(\big{.}\pi(\cdot|s_{t})\;\middle|\middle|\;\mu(\cdot|s_{t})\right)\vphantom{\sum_{1}^{2}}\bigg{)}\right]\right]\right], (9)

where α0\alpha\geq 0 is the strength of regularization, and μ\mu is a reference policy to keep close with.

KL-regularized MDP is considered in previous works to enhance exploration, improve robustness to noise or insert expert knowledge SQL ; SAC ; RegularizedMDP ; ISKL ; Prior . Here, the idea is to constrain the optimized policy in neighbour of a reference policy so that the inner problem is adequately evaluated for such a small policy region.

To optimize J¯(π;μ)\bar{J}(\pi;\mu), we introduce the soft (N,k)(N,k)-pessimistic Bellman backup operator:

¯N,kQ(s,a)\displaystyle\bar{\mathcal{B}}^{*}_{N,k}Q(s,a) =r(s,a)+γ𝔼TN[minτ𝒯k𝔼τ[αlog𝔼μexp(1αQ(s,a))]].\displaystyle=r(s,a)+\gamma\mathbb{E}_{\mathbb{P}_{T}^{N}}\left[\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau}\left[\alpha\log\mathbb{E}_{\mu}\exp{\left(\frac{1}{\alpha}Q(s^{\prime},a^{\prime})\right)}\right]\right]. (10)
Theorem 4 (Regularized Policy Optimization).

The soft (N,k)(N,k)-pessimistic Bellman backup operator ¯N,k\bar{\mathcal{B}}^{*}_{N,k} is a contraction mapping. By starting from any function Q:𝒮×𝒜Q:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R} and repeatedly applying ¯N,k\bar{\mathcal{B}}^{*}_{N,k}, the sequence converges to Q¯N,k\bar{Q}^{*}_{N,k}, with which the optimal policy for J¯(π;μ)\bar{J}(\pi;\mu) is obtained as π¯(a|s)μ(a|s)exp(1αQ¯N,k(s,a))\bar{\pi}^{*}(a|s)\propto\mu(a|s)\exp{\left(\frac{1}{\alpha}\bar{Q}^{*}_{N,k}(s,a)\right)}.

Apparently, the solved policy π¯\bar{\pi}^{*} depends on the reference policy μ\mu, and setting μ\mu arbitrarily aforehand can result in suboptimal policy for J(π)J(\pi). In fact, we can construct a sequence of sub-problems, with μ\mu chosen as an improved policy from last sub-problem. By successively solving them, the impact from the initial reference policy is gradually eliminated.

Theorem 5 (Iterative Regularized Policy Optimization).

By starting from any stochastic policy π0:sΔ(𝒜)\pi_{0}:s\rightarrow\Delta(\mathcal{A}) and repeatedly finding πi+1:J¯(πi+1;πi)>J¯(πi;πi)\pi_{i+1}:\bar{J}(\pi_{i+1};\pi_{i})>\bar{J}(\pi_{i};\pi_{i}), the sequence of {πi}\{\pi_{i}\} monotonically improves regarding J(π)J(\pi), i.e., J(πi+1)J(πi)J(\pi_{i+1})\geq J(\pi_{i}). Especially, when π0(a|s)>0,s,a\pi_{0}(a|s)>0,\forall s,a and {πi}\{\pi_{i}\} are obtained via regularized policy optimization in Theorem 4, we have πi(a|s)πi(a|s)\frac{\pi_{i}(a|s)}{\pi_{i}(a^{\prime}|s)}\rightarrow\infty for any s,a,as,a,a^{\prime} such that limiQN,kπi(s,a)>limiQN,kπi(s,a)\lim_{i\rightarrow\infty}Q_{N,k}^{\pi_{i}}(s,a)>\lim_{i\rightarrow\infty}Q_{N,k}^{\pi_{i}}(s,a^{\prime}).

Ideally, by combining Theorems 4 and 5, the policy for J(π)J(\pi) can be continuously improved by infinitely applying soft pessimistic Bellman backup operator for each of the sequential sub-problems.

Remark 2 (Iterative Regularized Policy Optimization as Expectation–Maximization with Structured Variational Posterior). According to Theorem 2, PMDB can be recovered with the converged Q-function QN,kπQ^{\pi^{*}}_{N,k}. From an end-to-end view, we have an initial dynamics belief Tsa\mathbb{P}_{T}^{sa}, then via the calculation based on the belief samples and the reward function, we obtain the updated dynamics belief ~Tsa\widetilde{\mathbb{P}}_{T}^{sa}. It is likely that we are doing some form of posterior inference, where the evidence comes from the reward function. In fact, the iterative regularized policy optimization can be formally recast as an Expectation-Maximization algorithm for offline policy optimization, where the Expectation step correponds to a structured variational inference procedure for dynamics. We elaborate it in Appendix C.

4.2 Offline Reinforcement Learning with Pessimism-Modulated Dynamics Belief

While solving each sub-problem makes prominent update to policy compared with the policy gradient method, we may need to construct several sub-problems before convergence, then exactly solving each of them incurs unnecessary computation. For practical consideration, next we introduce a smooth-evolving reference policy, with which the explicit boundary between sub-problems is blurred. Based on this reference policy, and by further adopting function approximator, we devise an offline RL algorithm to approximately maximize J(π)J(\pi).

The idea of smooth-evolving reference policy is inspired by the soft-updated target network in deep RL literature DQN ; DDPG . That is setting the reference policy as a slowly tracked copy of the policy being optimized. Formally, consider a parameterized policy πϕ\pi_{\phi} with the parameter ϕ\phi. The reference policy is set as μ=πϕ\mu=\pi_{\phi^{\prime}}, where ϕ\phi^{\prime} is the moving average of ϕ:ϕω1ϕ+(1ω1)ϕ\phi:\phi^{\prime}\leftarrow\omega_{1}\phi+(1-\omega_{1})\phi^{\prime}. With small enough ω1\omega_{1}, the Q-value of the state-action pairs induced by πϕ\pi_{\phi^{\prime}} (or its slight variant) can be sufficiently evaluated, before being used to update the policy. Next, we detail the loss functions to learn Q-value and policy with neural network approximators.

Denote the parameterized Q-function by QθQ_{\theta} with the parameter θ\theta. It is trained by minimizing the Bellman residual of both the AMG and the empirical MDP:

LQ(θ)=𝔼(s,a,𝒯)𝒟[(Qθ(s,a)Q^AMG(s,a))2]+𝔼(s,a,s)𝒟[(Qθ(s,a)Q^MDP(s,a))2],\displaystyle L_{Q}(\theta)=\mathbb{E}_{(s,a,\mathcal{T})\sim\mathcal{D}^{\prime}}\left[\left(Q_{\theta}(s,a)-\widehat{Q}_{\text{AMG}}(s,a)\right)^{2}\right]+\mathbb{E}_{(s,a,s^{\prime})\sim\mathcal{D}}\left[\left(Q_{\theta}(s,a)-\widehat{Q}_{\text{MDP}}(s,a)\right)^{2}\right], (11)

with

Q^AMG(s,a)=r(s,a)+γminτ𝒯k𝔼τ[αlog𝔼πϕexp(1αQθ(s,a))],\displaystyle\widehat{Q}_{\text{AMG}}(s,a)=r(s,a)+\gamma\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau}\left[\alpha\log\mathbb{E}_{\pi_{\phi^{\prime}}}\exp{\left(\frac{1}{\alpha}Q_{\theta^{\prime}}(s^{\prime},a^{\prime})\right)}\right], (12)
Q^MDP(s,a)=r(s,a)+γαlog𝔼πϕexp(1αQθ(s,a)),\displaystyle\widehat{Q}_{\text{MDP}}(s,a)=r(s,a)+\gamma\cdot\alpha\log\mathbb{E}_{\pi_{\phi^{\prime}}}\exp{\left(\frac{1}{\alpha}Q_{\theta^{\prime}}(s^{\prime},a^{\prime})\right)}, (13)

where QθQ_{\theta^{\prime}} represent the target Q-value softly updated for stability DDPG , i.e., θω2θ+(1ω2)θ\theta^{\prime}\leftarrow\omega_{2}\theta+(1-\omega_{2})\theta^{\prime}, and 𝒟\mathcal{D}^{\prime} is the on-policy data buffer for the AMG. Since the game transition is known, the game can be executed with multiple counterparts in parallel, and the buffer only collects the latest sample for each of them. To promote direct learning from 𝒟\mathcal{D}, we also include the Bellman residual of the empirical MDP in (11).

As with policy update, Theorem 4 states that the optimal policy for J¯(π;μ)\bar{J}(\pi;\mu) is propotional to μ(a|s)exp(1αQ¯N,k(s,a))\mu(a|s)\exp{\left(\frac{1}{\alpha}\bar{Q}^{*}_{N,k}(s,a)\right)}. Then, we update πϕ\pi_{\phi} by supervisedly learning this policy, with μ\mu and Q¯N,k\bar{Q}^{*}_{N,k} replaced by the smooth-evolving reference policy and the learned Q-value:

LP(ϕ)\displaystyle L_{P}(\phi) =𝔼s𝒟𝒟[DKL(πϕ(|s)exp(1αQθ(s,))𝔼πϕ[exp(1αQθ(s,a))]||πϕ(|s))]\displaystyle=\mathbb{E}_{s\sim\mathcal{D}\cup\mathcal{D}^{\prime}}\left[D_{\mathrm{KL}}\left(\frac{\pi_{\phi^{\prime}}(\cdot|s)\exp\left(\frac{1}{\alpha}Q_{\theta}(s,\cdot)\right)}{\mathbb{E}_{\pi_{\phi^{\prime}}}\left[\exp\left(\frac{1}{\alpha}Q_{\theta}(s,a)\right)\right]}\;\middle|\middle|\;\pi_{\phi}(\cdot|s)\right)\right]
=A𝔼s𝒟𝒟aπϕ[exp(1αQθ(s,a))logπϕ(a|s)]+B,\displaystyle=A\cdot\mathbb{E}_{\begin{subarray}{c}s\sim\mathcal{D}\cup\mathcal{D}^{\prime}\\ a\sim\pi_{\phi^{\prime}}\end{subarray}}\left[\exp\left(\frac{1}{\alpha}Q_{\theta}(s,a)\right)\log\pi_{\phi}(a|s)\right]+B, (14)

where AA and BB are constant terms. In general, (4.2) can be replaced by any tractable function that measures the similarity of distributions. For example, when πϕ\pi_{\phi} is Gaussian, we can apply the recent proposed β\beta-NLL Pitfall , in which each data point’s contribution to the negative log-likelihood loss is weighted by the β\beta-exponentiated variance to improve learning heteroscedastic behavior.

To summarize, the algorithm alternates between collecting on-policy data samples in AMG and updating the function approximators. In detail, the latter procedure includes updating the Q-value with (11), updating the optimized policy with (4.2), and updating the target Q-value as well as the reference policy with the moving-average rule. The complete algorithm is listed in Appendix D.

5 Experiments

Through the experiments, we aim to answer the following questions: 1) How does the proposed approach compared to the previous SoTA offline RL algorithms on standard benchmark? 2) How does the learning process in the AMG connect with the performance change in the original MDP? 3) Section 3 presents the monotonicity of J(π)J(\pi) over NN and kk for any specified π\pi, and it is easy to verify that this statement also holds when considering optimal policy for each setting of (N,k)(N,k). However, with neural network approximator, our proposed offline RL algorithm approximately solves the AMG. Then, how is the monotonicity satisfied in this case?

We consider the Gym domains in the D4RL benchmark D4RL to answer these questions. As PMDB relies on the initial dynamics belief, inserting additional knowledge into the initial dynamics belief will result in unfair comparison. To avoid that, we consider an uniform distribution over dynamics model ensembles as the initial belief. The dynamics model ensembles are trained in supervised manner with the offline dataset. This is similar to previous model-based works MOPO ; MOREL , where the dynamics model ensembles are considered for dynamics uncertainty quantification. Since hyperparameter tuning for offline RL algorithms requires extra online test for each task, we purposely keep the same hyperparameters for all tasks, except when answering the last question. Especially, the hyperparameters in sampling procedure are N=10N=10 and k=2k=2. The more detailed setup for experiments and hyperparameters can be found in Appendix G. The code is available online222Code is released at https://github.com/huawei-noah/HEBO/tree/master/PMDB and https://gitee.com/mindspore/models/tree/master/research/rl/pmdb.

5.1 Performance Comparison

We compare the proposed offline RL algorithm with the baselines including: BEAR BEAR and BRAC BRAC , the model-free approaches based on policy constraint, CQL CQL , the model-free approach by penalizing Q-value, EDAC EDAC , the previous SoTA on the D4RL benchmark, MOReL MOREL , the model-based approach which terminates the trajectory if the dynamics uncertainty exceeds a certain degree, and BC, the behavior cloning method. These approaches are evaluated on a total of eighteen domains involving three environments (hopper, walker2d, halfcheetah) and six dataset types (random, medium, expert, medium-expert, medium-replay, full-replay) per environment. We use the v2 version of each dataset.

The results are summarized in Table LABEL:Table:D4RL. Our approach PMDB obviously improves over the previous SoTA on 9 tasks and performs competitively in the rest. Although EDAC achieves better performance in walker2d with several dataset types, its hyperparameters are tuned individually for each task. The later experiments on the impact of hyperparameters indicate that larger kk or smaller NN could generate better results for walker2d and halfcheetah. We also find that PMDB significantly outperforms MOReL, another model-based approach. It is encouraging that our model-based approach achieves competitive or better performance compared with the SoTA model-free approach, as model-based approach naturally has better support for multi-task learning and transfer learning, where the offline data from relevant tasks can be further leveraged.

Table 1: Results for D4RL datasets. Each result is the normalized score computed as (score - random policy score) // (expert policy score - random policy score), ±\pm standard deviation. The score of our proposed approach is averaged over 4 random seeds, and the results of the baselines are taken from EDAC .
Task Name BC BEAR BRAC CQL MOReL EDAC PMDB
hopper-random 3.7±\pm0.6 3.6±\pm3.6 8.1±\pm0.6 5.3±\pm0.6 38.1±\pm10.1 25.3±\pm10.4 32.7±\pm0.1
hopper-medium 54.1±\pm3.8 55.3±\pm3.2 77.8±\pm6.1 61.9±\pm6.4 84.0±\pm17.0 101.6±\pm0.6 106.8±\pm0.2
hopper-expert 107.7±\pm9.7 39.4±\pm20.5 78.1±\pm52.6 106.5±\pm9.1 80.4±\pm34.9 110.1±\pm0.1 111.7±\pm0.3
hopper-medium-expert 53.9±\pm4.7 66.2±\pm8.5 81.3±\pm8.0 96.9±\pm15.1 105.6±\pm8.2 110.7±\pm0.1 111.8±\pm0.6
hopper-medium-replay 16.6±\pm4.8 57.7±\pm16.5 62.7±\pm30.4 86.3±\pm7.3 81.8±\pm17.0 101.0±\pm0.5 106.2±\pm0.6
hopper-full-replay 19.9±\pm12.9 54.0±\pm24.0 107.4±\pm0.5 101.9±\pm0.6 94.4±\pm20.5 105.4±\pm0.7 109.1±\pm0.2
walker2d-random 1.3±\pm0.1 4.3±\pm1.2 1.3±\pm1.4 5.4±\pm1.7 16.0±\pm7.7 16.6±\pm7.0 21.8±\pm0.1
walker2d-medium 70.9±\pm11.0 59.8±\pm40.0 59.7±\pm39.9 79.5±\pm3.2 72.8±\pm11.9 92.5±\pm0.8 94.2±\pm1.1
walker2d-expert 108.7±\pm0.2 110.1±\pm0.6 55.2±\pm62.2 109.3±\pm0.1 62.6±\pm29.9 115.1±\pm1.9 115.9±\pm1.9
walker2d-medium-expert 90.1±\pm13.2 107.0±\pm2.9 9.3±\pm18.9 109.1±\pm0.2 107.5±\pm5.6 114.7±\pm0.9 111.9±\pm0.2
walker2d-medium-replay 20.3±\pm9.8 12.2±\pm4.7 40.1±\pm47.9 76.8±\pm10.0 40.8±\pm20.4 87.1±\pm2.3 79.9±\pm0.2
walker2d-full-replay 68.8±\pm17.7 79.6±\pm15.6 96.9±\pm2.2 94.2±\pm1.9 84.8±\pm13.1 99.8±\pm0.7 95.4±\pm0.7
halfcheetah-random 2.2±\pm0.0 12.6±\pm1.0 24.3±\pm0.7 31.3±\pm3.5 38.9±\pm1.8 28.4±\pm1.0 37.8 ±\pm 0.2
halfcheetah-medium 43.2±\pm0.6 42.8±\pm0.1 51.9±\pm0.3 46.9±\pm0.4 60.7±\pm4.4 65.9±\pm0.6 75.6±\pm 1.3
halfcheetah-expert 91.8±\pm1.5 92.6±\pm0.6 39.0±\pm13.8 97.3±\pm1.1 8.4±\pm11.8 106.8±\pm3.4 105.7±\pm 1.0
halfcheetah-medium-expert 44.0±\pm1.6 45.7±\pm4.2 52.3±\pm0.1 95.0±\pm1.4 80.4±\pm11.7 106.3±\pm1.9 108.5±\pm0.5
halfcheetah-medium-replay 37.6±\pm2.1 39.4±\pm0.8 48.6±\pm0.4 45.3±\pm0.3 44.5±\pm5.6 61.3±\pm1.9 71.7±\pm1.1
halfcheetah-full-replay 62.9±\pm0.8 60.1±\pm3.2 78.0±\pm0.7 76.9±\pm0.9 70.1±\pm5.1 84.6±\pm0.9 90.0±\pm0.8
Average 49.9 52.4 54.0 73.7 65.1 85.2 88.2

5.2 Learning in Alternating Markov Game

Figure 2 presents the learning curves in the AMG, as well as the received return when deploying the policy being learned in true MDP. The performance in the AMG closely tracks the true performance from the lower side, implying that it can act as a reasonable surrogate to evaluate/optimize performance for the true MDP. Besides, the performance in the AMG improves nearly monotonously, verifying the effectiveness of the proposed algorithm to approximately solve the game.

Refer to caption
Refer to caption
Refer to caption
Figure 2: Learning and test curves for medium datasets.

Recall that PMDB does not explicitly quantify dynamics uncertainty to penalize return, Figure 3 checks how the dynamics uncertainty and the Q-value of visited state-action pairs change during the learning process. The uncertainty is measured by the logarithm of standard deviation of the predicted means from the NN dynamics samples, i.e., log(std(𝔼τ[s];τ𝒯))\log\left(\text{std}\left(\mathbb{E}_{\tau}[s^{\prime}];\tau\in\mathcal{T}\right)\right). The policy being learned is periodically tested in the AMG for ten trials, and we collect the whole ten trajectories of state-action pairs. The solid curves in Figure 3 denote the mean uncertainty and Q-value over the collected pairs, and shaded regions denote the standard deviation. From the results, the dynamics uncertainty first sharply decreases and then keeps a slowly increasing trend. Besides, in the long-term view, the Q-value is correlated with the degree of uncertainty negatively in the first phase and positively in the second phase. This indicates that the policy first moves to the in-distribution region and then tries to get away by resorting to the generalization of dynamics model.

Refer to caption
Refer to caption
Refer to caption
Figure 3: Change on the dynamics uncertainty and Q-value of the encountered state-action pairs during learning process. The dynamics uncertainty of state-action pair (s,a)(s,a) is measured by log(std(𝔼τ(|s,a)[s];τ𝒯))\log\left(\text{std}\left(\mathbb{E}_{\tau(\cdot|s,a)}[s^{\prime}];\tau\in\mathcal{T}\right)\right).

In Figure 3, we also notice that the large dip of Q-value is accompanied with the sudden raise of dynamics uncertainty. We suspect this is due to the optimized policy being far from the dataset. We try to verify by checking the maximal return covered by the offline dataset. It shows that the maximal normalized returns provided by offline datasets are 99.6, 92.0 and 45.0 respectively for hopper, walker2d and halfcheetah, while the proposed approach achieves 106.8, 94.2 and 75.6. The policy optimization is more significant for halfcheetah (where we observe the large dip), indicating the policy should move further from the dataset.

The above finding also explains why the AMG performance in Figure 2 runs into large dip only for halfcheetah: with the larger dynamics uncertainty, the secondary player can choose the more pessimistic transition. However, we want to highlight that it is normal behavior of the proposed algorithm and does not mean instability, as we are handling the alternating Markov game, a specialization of zero-sum game. Besides, we can see that even when the AMG performance goes down the MDP performance is still stable.

5.3 Practical Impact of Hyperparameters in Sampling Procedure

Table 2 lists the impact of kk. In each setting, we evaluate the learned policy in both the true MDP and the AMG. The performance in the AMGs improve when increasing kk. This is consistent with the theoretical result, even that we approximately solve the game. Regarding the performance in true MDPs, we notice that k=2k=2 corresponds to the best performance for hopper, but for the others k=3k=3 is better. This indicates that tuning hyperparameter online can further improve the performance. The impact of NN is presented in Appendix H, suggesting the opposite monotonicity.

hopper-medium walker2d-medium halfcheetah-medium
kk MDP AMG MDP AMG MDP AMG
1 106.2±\pm0.2 91.6±\pm2.2 82.6±\pm0.5 33.3±\pm2.6 70.7±\pm0.8 63.1±\pm0.2
2 106.8±\pm0.2 105.2±\pm1.6 94.2±\pm1.1 77.2±\pm3.7 75.6±\pm1.3 67.3±\pm1.1
3 90.8±\pm17.5 106.6±\pm2.1 105.1±\pm0.2 82.5±\pm0.5 77.3±\pm0.5 70.1±\pm0.2
Table 2: Impact of kk, with N=10N=10.

6 Related Works

Inadequate data coverage is the root of challenge in offline RL. Existing works differ in their methodology to reacting in face of limited system knowledge.

Model-free offline RL

The prevalent idea is to find policy within the data manifold through model-free learning. Analogous to online RL, both policy-based and value-based approaches are devised to this end. Policy-based approaches directly constrain the optimized policy close to the behavior policy that collects data, via various measurements such as KL divergence BRAC , MMD BEAR and action deviation BCQ ; EMaQ . Value-based approaches instead reflect the policy regularization through value function. For example, CQL enforces small Q-value for OOD state-action pairs CQL , AlgaeDICE penalizes return with the ff-divergence between optimized and offline state-action distributions AlgaeDICE , and Fisher-BRC proposes a novel parameterization of the Q-value to encourage the generated policy close to the data Fisher . Our proposed approach is more relevant to the value-based scope, and the key difference to existing works is that our Q-value is penalized through an adversarial choice of transition from plausible candidates.

Learning within the data manifold limits the degree to which the policy improves, and recent works attempt to relieve the restriction. Along the model-free line, EDAC EDAC and PBRL PBRL quantify uncertainty of Q-value via neural network ensemble, and assign penalty to Q-value depending on the uncertainty degree. In this way, the OOD state-action pairs are touchable if they pose low uncertainty on Q-value. However, the uncertainty quantification over OOD region highly relies on how neural network generalizes BayesGenalization . As the prior knowledge of Q-function is hard to acquire and insert into the neural network, the generalization is unlikely reliable to facilitate meaningful uncertainty quantification SWAG .

Model-based offline RL

Model-based approach is widely recognized due to its superior data efficiency. However, directly optimizing policy based on an offline learned model is vulnerable to model exploitation MBPO ; Revisit . A line of works improve the dynamics learning for seek of robustness PL or adaptation WME to distributional shift. In terms of policy learning, several works extend the idea from model-free approaches, and constrain the optimized policy close to the behavior policy when applying the dynamics model for planing MBOP or policy optimization BREMEN ; COMBO . There are also recent works incorporating uncertainty quantification of dynamics model to learn policy beyond data coverage. Especially, MOPO MOPO and MOReL MOREL perform policy improvement in states that may not directly occur in the static offline dataset, but can be predicted by leveraging the power of generalization. Compared to them, our approach does not explicitly characterize the dynamics uncertainty as reward penalty. There are also relevant works dealing with model ambiguity in light of Bayesian decision theory, which are discussed in Appendix A.

7 Discussion

We proposed model-based offline RL with Pessimism-Modulated Dynamics Belief (PMDB), a framework to reliably learn policy from offline dataset, with the ability of leveraging dynamics prior knowledge. Empirically, the proposed approach outperforms the previous SoTA in a wide range of D4RL tasks. Compared to the previous model-based approaches, we characterize the impact of dynamics uncertainty through biased sampling from the dynamics belief, which implicitly induces PMDB. As PMDB is with the form of reweighting an initial dynamics belief, it provides a principled way to insert prior knowledge via the belief to boost policy learning. However, posing a valuable dynamics belief for arbitrary task is challenging, as the expert knowledge is not always available. Besides, an over-aggressive belief may still incur high-risk behavior in reality. Encouragingly, recent works have done active research to learn data-driven prior from relevant tasks. We believe that integrating them as well as developing safe criterion to design/learn dynamics belief would further promote practical deployment of offline RL.

References

  • [1] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In IEEE ICRA, 2017.
  • [2] B Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A. Al Sallab, Senthil Yogamani, and Patrick Pérez. Deep reinforcement learning for autonomous driving: A survey. IEEE TITS, 2021.
  • [3] Chao Yu, Jiming Liu, Shamim Nemati, and Guosheng Yin. Reinforcement learning in healthcare: A survey. ACM Computing Surveys, 2021.
  • [4] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In ICML, 2019.
  • [5] Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy Q-learning via bootstrapping error reduction. In NeurIPS, 2019.
  • [6] Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. MOReL: Model-based offline reinforcement learning. In NeurIPS, 2020.
  • [7] Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
  • [8] Seyed Kamyar Seyed Ghasemipour, Dale Schuurmans, and Shixiang Shane Gu. EMaQ: Expected-max Q-learning operator for simple yet effective offline and online RL. In ICML, 2021.
  • [9] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-learning for offline reinforcement learning. In NeurIPS, 2020.
  • [10] Ilya Kostrikov, Rob Fergus, Jonathan Tompson, and Ofir Nachum. Offline reinforcement learning with fisher divergence critic regularization. In ICML, 2021.
  • [11] Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. AlgaeDICE: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019.
  • [12] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In NeurIPS, 2017.
  • [13] Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty-based offline reinforcement learning with diversified Q-ensemble. In NeurIPS, 2021.
  • [14] Chenjia Bai, Lingxiao Wang, Zhuoran Yang, Zhihong Deng, Animesh Garg, Peng Liu, and Zhaoran Wang. Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning. arXiv preprint arXiv:2202.11566, 2022.
  • [15] Andrew G Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic perspective of generalization. In NeurIPS, 2020.
  • [16] Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon Wilson. A simple baseline for Bayesian uncertainty in deep learning. In NeurIPS, 2019.
  • [17] Andy Zeng, Shuran Song, Johnny Lee, Alberto Rodriguez, and Thomas Funkhouser. Tossingbot: Learning to throw arbitrary objects with residual physics. IEEE T-RO, 2020.
  • [18] Amy Zhang, Clare Lyle, Shagun Sodhani, Angelos Filos, Marta Kwiatkowska, Joelle Pineau, Yarin Gal, and Doina Precup. Invariant causal prediction for block MDPs. In ICML, 2020.
  • [19] Amy Zhang, Rowan McAllister, Roberto Calandra, Yarin Gal, and Sergey Levine. Learning invariant representations for reinforcement learning without reconstruction. In ICLR, 2021.
  • [20] Philip J Ball, Cong Lu, Jack Parker-Holder, and Stephen Roberts. Augmented world models facilitate zero-shot dynamics generalization from a single offline environment. In ICML, 2021.
  • [21] Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. MOPO: Model-based offline policy optimization. In NeurIPS, 2020.
  • [22] Cong Lu, Philip J. Ball, Jack Parker-Holder, Michael A. Osborne, and Stephen J. Roberts. Revisiting design choices in model-based offline reinforcement learning. In ICLR, 2022.
  • [23] Arnab Nilim and Laurent El Ghaoui. Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 2005.
  • [24] Wolfram Wiesemann, Daniel Kuhn, and Berç Rustem. Robust Markov decision processes. Mathematics of Operations Research, 2013.
  • [25] Elita A. Lobo, Mohammad Ghavamzadeh, and Marek Petrik. Soft-robust algorithms for batch reinforcement learning. arXiv preprint arXiv:2011.14495, 2020.
  • [26] Esther Derman, Daniel J. Mankowitz, Timothy A. Mann, and Shie Mannor. Soft-robust actor-critic policy-gradient. In UAI, 2018.
  • [27] Marek Petrik and Reazul Hasan Russel. Beyond confidence regions: Tight Bayesian ambiguity sets for robust MDPs. In NeurIPS, 2019.
  • [28] Bahram Behzadian, Reazul Hasan Russel, Marek Petrik, and Chin Pang Ho. Optimizing percentile criterion using robust MDPs. In AISTATS, 2021.
  • [29] Romain Laroche, Paul Trichelair, and Remi Tachet Des Combes. Safe policy improvement with baseline bootstrapping. In ICML, 2019.
  • [30] Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High-confidence off-policy evaluation. In AAAI, 2015.
  • [31] Chin Pang Ho, Marek Petrik, and Wolfram Wiesemann. Fast Bellman updates for robust MDPs. In ICML, 2018.
  • [32] Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In NeurIPS, 1999.
  • [33] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In ICML, 2014.
  • [34] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In ICML, 2017.
  • [35] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, 2018.
  • [36] Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized Markov decision processes. In ICML, 2019.
  • [37] Alexandre Galashov, Siddhant M. Jayakumar, Leonard Hasenclever, Dhruva Tirumala, Jonathan Schwarz, Guillaume Desjardins, Wojciech M. Czarnecki, Yee Whye Teh, Razvan Pascanu, and Nicolas Heess. Information asymmetry in KL-regularized RL. In ICLR, 2019.
  • [38] Noah Y. Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, Nicolas Heess, and Martin A. Riedmiller. Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. In ICLR, 2020.
  • [39] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 2015.
  • [40] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In ICLR, 2016.
  • [41] Maximilian Seitzer, Arash Tavakoli, Dimitrije Antic, and Georg Martius. On the pitfalls of heteroscedastic uncertainty estimation with probabilistic neural networks. arXiv preprint arXiv:2203.09168, 2022.
  • [42] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
  • [43] Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In NeurIPS, 2019.
  • [44] Byung-Jun Lee, Jongmin Lee, and Kim Kee-Eung. Representation balancing offline model-based reinforcement learning. In ICLR, 2020.
  • [45] Toru Hishinuma and Kei Senda. Weighted model estimation for offline model-based reinforcement learning. In NeurIPS, 2021.
  • [46] Arthur Argenson and Gabriel Dulac-Arnold. Model-based offline planning. In ICLR, 2021.
  • [47] Tatsuya Matsushima, Hiroki Furuta, Yutaka Matsuo, Ofir Nachum, and Shixiang Gu. Deployment-efficient reinforcement learning via model-based offline optimization. In ICLR, 2021.
  • [48] Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. COMBO: Conservative offline model-based policy optimization. In NeurIPS, 2021.
  • [49] RT Rockafellar and Stanislav Uryasev. Optimization of conditional value-at-risk. Journal of Risk, 2000.
  • [50] Aviv Tamar, Huan Xu, and Shie Mannor. Scaling up robust MDPs by reinforcement learning. arXiv preprint arXiv:1306.6189, 2013.
  • [51] Romain Laroche, Paul Trichelair, and Remi Tachet Des Combes. Safe policy improvement with baseline bootstrapping. In ICML, 2019.
  • [52] Daniel J. Mankowitz, Nir Levine, Rae Jeong, Abbas Abdolmaleki, Jost Tobias Springenberg, Timothy A. Mann, Todd Hester, and Martin A. Riedmiller. Robust reinforcement learning for continuous control with model misspecification. In ICLR, 2020.
  • [53] Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. In ICML, 2017.
  • [54] Chen Tessler, Yonathan Efroni, and Shie Mannor. Action robust reinforcement learning and applications in continuous control. In ICML, 2019.
  • [55] Elena Smirnova, Elvis Dohmatob, and Jérémie Mary. Distributionally robust reinforcement learning. In ICML Workshop, 2019.
  • [56] Marc Rigter, Paul Duckworth, Bruno Lacerda, and Nick Hawes. Planning for risk-aversion and expected value in MDPs. In ICAPS, 2022.
  • [57] Esther Derman, Daniel Mankowitz, Timothy Mann, and Shie Mannor. A Bayesian approach to robust reinforcement learning. In Uncertainty in Artificial Intelligence Conference, pages 648–658, 2020.
  • [58] Marc Rigter, Bruno Lacerda, and Nick Hawes. Risk-averse Bayes-adaptive reinforcement learning. In NeurIPS, 2021.
  • [59] H.A. David and H.N. Nagaraja. Order Statistics. Wiley, 2004.
  • [60] Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
  • [61] Marc Rigter, Bruno Lacerda, and Nick Hawes. RAMBO-RL: Robust adversarial model-based offline reinforcement learning. arXiv preprint arXiv:2204.12581, 2022.

Appendix A Additional Related Works

Robust MDP and CVaR Criterion

Bayesian decision theory provides a principled formalism for decision making under uncertainty. Robust MDP [23, 24], Conditional Value at Risk (CVaR) [49] and our proposed criterion can be deemed as the specializations of Bayesian decision theory, but derived from different principles and with different properties.

Robust MDP is proposed as a surrogate to optimize the percentile performance. Early works mainly focus on the algorithmic design and theoretical analysis in the tabular case [23, 24] or under linear function approximation [50]. Recently, it has also been extended to continuous action spaces and nonlinear cases, by integrating the advances of deep RL [51, 52]. Meanwhile, a variety of works generalize the uncertainty in regard to system disturbance [53] and action disturbance [54, 55]. Although robust MDP produces robust policy, it purely focuses on the percentile performance, and ignoring the other possibilities is reported to be over-conservative [25, 26, 27].

CVaR instead considers the average performance of the worst δ\delta-fraction possibilities. Despite involving more information about the stochasticity, CVaR is still solely from the pessimistic view. Recent works propose to improve by maximizing the convex combination of mean performance and CVaR [25], or maximizing mean performance under CVaR constraint [56]. However, they are intractable regarding policy optimization, i.e., proved as an NP-hard problem or relying on heuristic. As comparison, the proposed AMG formulation presents an alternative way to tackle the entire spectrum of plausible transitions while also give more attention on the pessimistic parts. Besides, the policy optimization is with theoretical guarantee.

Apart from offline RL, Bayesian decision theory is also applied in other RL settings. Particularly, Bayesian RL considers that new observations are continually received and utilized to make adaptive decision. The goal of Bayesian RL is to fast explore and adapt, while that of offline RL is to sufficiently exploit the offline dataset to generate the best-effort policy supported or surrounded by the dataset. Recently, Bayesian robust RL [57] integrates the idea of robust MDP in the setting of Bayesian RL, where the uncertainty set is constructed to produce robust policy, and will be updated upon new observations to alleviate the degree of conservativeness. Besides, CVaR criterion is also considered in Bayesian RL [58].

Appendix B Theorem Proof

We first present and prove the fundamental inequalities applied to prove the main theorem, and then present the proofs for Sections 3 and 4 respectively. For conciseness, the subscripts N,kN,k are omitted in Q-value and Bellman backup operator when clear from the context.

B.1 Preliminaries

Lemma 1.

Let minikxi\lfloor\min\rfloor_{i}^{k}~{}x_{i} denote the kkth minimum in {xi}\left\{x_{i}\right\}, then

mini(xiyi)minikximinikyimaxi(xiyi),k=1,2,,N,\min_{i}\left(x_{i}-y_{i}\right)\leq\lfloor\min\rfloor_{i}^{k}~{}x_{i}-\lfloor\min\rfloor_{i}^{k}~{}y_{i}\leq\max_{i}\left(x_{i}-y_{i}\right),\quad\forall k=1,2,\cdots,N,

where NN is the size of both {xi}\{x_{i}\} and {yi}\{y_{i}\}.

Proof of Lemma 1.

Denote i=argminikxii^{*}=\arg\lfloor\min\rfloor_{i}^{k}~{}x_{i} and j=argminikyij^{*}=\arg\lfloor\min\rfloor_{i}^{k}~{}y_{i}. Next, we prove the first inequality. The proof is done by dividing into two cases.

Case 1: yiyjy_{i^{*}}\geq y_{j^{*}}

It is easy to check

minikximinikyi=xiyjxiyimini(xiyi).\displaystyle\lfloor\min\rfloor_{i}^{k}~{}x_{i}-\lfloor\min\rfloor_{i}^{k}~{}y_{i}=x_{i^{*}}-y_{j^{*}}\geq x_{i^{*}}-y_{i^{*}}\geq\min_{i}\left(x_{i}-y_{i}\right).

Case 2: yi<yjy_{i^{*}}<y_{j^{*}}

We prove by contradiction. Let 𝒮x={argminilxi|l=1,2,,k1}\mathcal{S}_{x}=\Big{\{}\arg\lfloor\min\rfloor_{i}^{l}~{}x_{i}~{}\Big{|}~{}l=1,2,\cdots,k-1\Big{\}}. Assume

ys<yj,s𝒮x.y_{s}<y_{j^{*}},\quad\forall s\in\mathcal{S}_{x}.

Since yjy_{j^{*}} is the kkth minimum, the above assumption implies 𝒮x𝒮y\mathcal{S}_{x}\subseteq\mathcal{S}_{y}. Meanwhile, according to the condition of case 2, i𝒮yi^{*}\in\mathcal{S}_{y}. Put these together, we have {i}𝒮x𝒮y\{i^{*}\}\cup\mathcal{S}_{x}\subseteq\mathcal{S}_{y}. According to the definition of ii^{*}, we know i𝒮xi^{*}\notin\mathcal{S}_{x}. This concludes that 𝒮y\mathcal{S}_{y} has at least kk elements, contradicting with its definition.

Thus,

s¯𝒮x:ys¯yj.\exists\bar{s}\in\mathcal{S}_{x}:y_{\bar{s}}\geq y_{j^{*}}.

By applying the above inequality and xs¯xix_{\bar{s}}\leq x_{i^{*}}, we have

minikximinikyi=xiyjxs¯ys¯mini(xiyi).\displaystyle\lfloor\min\rfloor_{i}^{k}~{}x_{i}-\lfloor\min\rfloor_{i}^{k}~{}y_{i}=x_{i^{*}}-y_{j^{*}}\geq x_{\bar{s}}-y_{\bar{s}}\geq\min_{i}\left(x_{i}-y_{i}\right).

In summary, we have mini(xiyi)minikximinikyi\min_{i}\left(x_{i}-y_{i}\right)\leq\lfloor\min\rfloor_{i}^{k}~{}x_{i}-\lfloor\min\rfloor_{i}^{k}~{}y_{i} for both cases.

The second inequality can be proved by resorting to the first one. By respectively replacing xix_{i} and yiy_{i} with xi-x_{i} and yi-y_{i} in first inequality, we obtain

mini(xi+yi)minik(xi)minik(yi),\displaystyle\min_{i}\left(-x_{i}+y_{i}\right)\leq\lfloor\min\rfloor_{i}^{k}~{}(-x_{i})-\lfloor\min\rfloor_{i}^{k}~{}(-y_{i}),

which can be rewritten as

maxi(xiyi)\displaystyle\max_{i}\left(x_{i}-y_{i}\right) (minik(xi)minik(yi))\displaystyle\geq-\Big{(}\lfloor\min\rfloor_{i}^{k}~{}(-x_{i})-\lfloor\min\rfloor_{i}^{k}~{}(-y_{i})\Big{)}
=miniNkximiniNkyi,\displaystyle=\lfloor\min\rfloor_{i}^{N-k}~{}x_{i}-\lfloor\min\rfloor_{i}^{N-k}~{}y_{i},

where the last equation is due to minik(xi)=miniNkxi\lfloor\min\rfloor_{i}^{k}~{}(-x_{i})=-\lfloor\min\rfloor_{i}^{N-k}~{}x_{i}. As the above inequalities holds for any k{1,2,,N}k\in\{1,2,\cdots,N\}, we can replace NkN-k by kk, and this is right the second inequality in Lemma 1. ∎

Corollary 1.
|minikximinikyi|maxi|xiyi|,k=1,2,,N.\Big{|}\lfloor\min\rfloor_{i}^{k}~{}x_{i}-\lfloor\min\rfloor_{i}^{k}~{}y_{i}\Big{|}\leq\max_{i}|x_{i}-y_{i}|,\quad\forall k=1,2,\cdots,N.
Proof of Corollary 1.

The inequality can be attained through simple derivation based on Lemma 1, i.e.,

minikximinikyimini(xiyi)mini(|xiyi|)=maxi|xiyi|\lfloor\min\rfloor_{i}^{k}~{}x_{i}-\lfloor\min\rfloor_{i}^{k}~{}y_{i}\geq\min_{i}\left(x_{i}-y_{i}\right)\geq\min_{i}\left(-\left|x_{i}-y_{i}\right|\right)=-\max_{i}\left|x_{i}-y_{i}\right|

and

minikximinikyimaxi(xiyi)maxi|xiyi|.\lfloor\min\rfloor_{i}^{k}~{}x_{i}-\lfloor\min\rfloor_{i}^{k}~{}y_{i}\leq\max_{i}\left(x_{i}-y_{i}\right)\leq\max_{i}\left|x_{i}-y_{i}\right|.

Put them together, we obtain

|minikximinikyi|maxi|xiyi|.\Big{|}\lfloor\min\rfloor_{i}^{k}~{}x_{i}-\lfloor\min\rfloor_{i}^{k}~{}y_{i}\Big{|}\leq\max_{i}\left|x_{i}-y_{i}\right|.

B.2 Proofs for Section 3

Proof of Theorem 1.

Let Q1Q_{1} and Q2Q_{2} be two arbitrary Q function, then

πQ1πQ2\displaystyle\left\lVert\mathcal{B}^{\pi}Q_{1}-\mathcal{B}^{\pi}Q_{2}\right\rVert_{\infty}
=γmaxs,a|𝔼TN[minτ𝒯k𝔼τ,π[Q1(s,a)]]𝔼TN[minτ𝒯k𝔼τ,π[Q2(s,a)]]|\displaystyle=\gamma\max_{s,a}\Bigg{|}\mathbb{E}_{\mathbb{P}_{T}^{N}}\bigg{[}\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau,\pi}\big{[}Q_{1}(s^{\prime},a^{\prime})\big{]}\bigg{]}-\mathbb{E}_{\mathbb{P}_{T}^{N}}\bigg{[}\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau,\pi}\big{[}Q_{2}(s^{\prime},a^{\prime})\big{]}\bigg{]}\Bigg{|}
=γmaxs,a|𝔼TN[minτ𝒯k𝔼τ,π[Q1(s,a)]minτ𝒯k𝔼τ,π[Q2(s,a)]]|\displaystyle=\gamma\max_{s,a}\Bigg{|}\mathbb{E}_{\mathbb{P}_{T}^{N}}\bigg{[}\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau,\pi}\big{[}Q_{1}(s^{\prime},a^{\prime})\big{]}-\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau,\pi}\big{[}Q_{2}(s^{\prime},a^{\prime})\big{]}\bigg{]}\Bigg{|}
γmaxs,a(𝔼TN|minτ𝒯k𝔼τ,π[Q1(s,a)]minτ𝒯k𝔼τ,π[Q2(s,a)]|)\displaystyle\leq\gamma\max_{s,a}\Bigg{(}\mathbb{E}_{\mathbb{P}_{T}^{N}}\bigg{|}\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau,\pi}\big{[}Q_{1}(s^{\prime},a^{\prime})\big{]}-\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau,\pi}\big{[}Q_{2}(s^{\prime},a^{\prime})\big{]}\bigg{|}\Bigg{)}
γmaxs,a(𝔼TN[maxτ𝒯|𝔼τ,π[Q1(s,a)Q2(s,a)]|])\displaystyle\leq\gamma\max_{s,a}\Bigg{(}\mathbb{E}_{\mathbb{P}_{T}^{N}}\bigg{[}\max_{\tau\in\mathcal{T}}\Big{|}\mathbb{E}_{\tau,\pi}\big{[}Q_{1}(s^{\prime},a^{\prime})-Q_{2}(s^{\prime},a^{\prime})\big{]}\Big{|}\bigg{]}\Bigg{)}
γmaxs,a(𝔼TNQ1Q2)\displaystyle\leq\gamma\max_{s,a}\left(\mathbb{E}_{\mathbb{P}_{T}^{N}}\lVert Q_{1}-Q_{2}\rVert_{\infty}\right)
=γQ1Q2,\displaystyle=\gamma\lVert Q_{1}-Q_{2}\rVert_{\infty},

where the second inequality is due to Corollary 1. Thus, the pessimistic Bellman update operator π\mathcal{B}^{\pi} is a contraction mapping.

After convergence, it is easy to check J(π)=𝔼ρ0,π[Qπ(s0,a0)]J(\pi)=\mathbb{E}_{\rho_{0},\pi}\Big{[}Q^{\pi}(s_{0},a_{0})\Big{]} by recursively unfolding Q-function. ∎

Proof of Theorem 2.

For conciseness, in this proof we drop the superscript sasa in τsa,Tsa\tau^{sa},\mathbb{P}_{T}^{sa} and ~Tsa\widetilde{\mathbb{P}}_{T}^{sa}.

The proof is based on the definition and the probability density function of order statistic [59]. For any random variables X1,X2,,XNX_{1},X_{2},\cdots,X_{N}, their kkth order statistic is defined as minn{1,,N}kXn\lfloor\min\rfloor_{n\in\{1,\cdots,N\}}^{k}X_{n}, which is another random variable. Particularly, when X1,X2,,XNX_{1},X_{2},\cdots,X_{N} are independent and identically distributed following a probability density function (x)\mathbb{P}(x), the order statistic is with the probability density function

N,k(x)=N!(k1)!(Nk)!C(x)[F(x)]k1[1F(x)]Nk,\displaystyle\mathbb{P}_{N,k}(x)=\underbrace{\frac{N!}{(k-1)!(N-k)!}}_{C}\mathbb{P}(x)\big{[}F(x)\big{]}^{k-1}\big{[}1-F(x)\big{]}^{N-k},

where F(x)F(x) is the cumulative distribution corresponding to (x)\mathbb{P}(x).

Let g(τ)=𝔼τ,π[Qπ(s,a)]g(\tau)=\mathbb{E}_{\tau,\pi}\big{[}Q^{\pi}(s^{\prime},a^{\prime})\big{]} for short. As τ\tau is random following the belief distribution, gg as the functional of τ\tau is also a random variable. Its sample can be drawn by

g=g(τ),τT(τ).\displaystyle g=g(\tau),\quad\tau\sim\mathbb{P}_{T}(\tau).

As the elements in 𝒯\mathcal{T} are independent and identically distributed samples from T(τ)\mathbb{P}_{T}(\tau), the elements in 𝒢={g(τ)τ𝒯}\mathcal{G}=\{g(\tau)\mid\tau\in\mathcal{T}\} are also independent and identically distributed. Thus, ming𝒢kg\lfloor\min\rfloor_{g\in\mathcal{G}}^{k}~{}g is their kkth order statistic, and we have

𝔼TN[minτ𝒯kg(τ)]\displaystyle\mathbb{E}_{\mathbb{P}^{N}_{T}}\Big{[}\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}g(\tau)\Big{]} =𝔼TN[ming𝒢kg]=N,k(g)g𝑑g\displaystyle=\mathbb{E}_{\mathbb{P}^{N}_{T}}\Big{[}\lfloor\min\rfloor_{g\in\mathcal{G}}^{k}~{}g\Big{]}=\int_{-\infty}^{\infty}\mathbb{P}_{N,k}(g)gdg
=C(g)[F(g)]k1[1F(g)]Nkg𝑑g,\displaystyle=C\int_{-\infty}^{\infty}\mathbb{P}(g)\big{[}F(g)\big{]}^{k-1}\big{[}1-F(g)\big{]}^{N-k}gdg,
=C\bigintsss[τ:g(τ)=gT(τ)𝑑ν(τ)][F(g)]k1[1F(g)]Nkgdg,\displaystyle=C\bigintsss_{-\infty}^{\infty}\left[\int_{\tau:g(\tau)=g}\mathbb{P}_{T}(\tau)d\nu(\tau)\right]\big{[}F(g)\big{]}^{k-1}\big{[}1-F(g)\big{]}^{N-k}gdg,
=C\bigintsss\bigintsssτ:g(τ)=gT(τ)[F(g)]k1[1F(g)]Nkgdν(τ)dg,\displaystyle=C\bigintsss_{-\infty}^{\infty}\bigintsss_{\tau:g(\tau)=g}\mathbb{P}_{T}(\tau)\big{[}F(g)\big{]}^{k-1}\big{[}1-F(g)\big{]}^{N-k}gd\nu(\tau)dg,
=()C\bigintsssτ\bigintsssg=g(τ)T(τ)[F(g)]k1[1F(g)]Nkgdgdν(τ),\displaystyle\overset{(*)}{=}C\bigintsss_{\tau}\bigintsss_{g=g(\tau)}\mathbb{P}_{T}(\tau)\big{[}F(g)\big{]}^{k-1}\big{[}1-F(g)\big{]}^{N-k}gdgd\nu(\tau),
=\bigintsssτCT(τ)[F(g(τ))]k1[1F(g(τ))]Nk~T(τ)g(τ)dν(τ)\displaystyle=\bigintsss_{\tau}\underbrace{C~{}\mathbb{P}_{T}(\tau)\Big{[}F\big{(}g(\tau)\big{)}\Big{]}^{k-1}\Big{[}1-F\big{(}g(\tau)\big{)}\Big{]}^{N-k}}_{\widetilde{\mathbb{P}}_{T}(\tau)}g(\tau)d\nu(\tau)
=𝔼~T[g(τ)],\displaystyle=\mathbb{E}_{\widetilde{\mathbb{P}}_{T}}[g(\tau)],

where ν(τ)\nu(\tau) is the reference measure based on which the belief distribution PTNP_{T}^{N} is defined, and the equation ()(*) is obtained by exchanging the orders of integration. The above equation can rewritten as

𝔼TN[minτ𝒯k𝔼τ,π[Q(s,a)]]=𝔼~T𝔼τ,π[Q(s,a)].\displaystyle\mathbb{E}_{\mathbb{P}_{T}^{N}}\Big{[}\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau,\pi}\left[Q(s^{\prime},a^{\prime})\right]\Big{]}=\mathbb{E}_{\widetilde{\mathbb{P}}_{T}}\mathbb{E}_{\tau,\pi}\left[Q(s^{\prime},a^{\prime})\right].

Taking this into consideration, the pessimistic Bellman backup operator in (6) is exactly the vanilla Bellman backup operator for the MDP with transition probability T~(s|s,a)=𝔼~T[τ(s)]\widetilde{T}(s^{\prime}|s,a)=\mathbb{E}_{\widetilde{\mathbb{P}}_{T}}\left[\tau(s^{\prime})\right]. Then, evaluating/optimizing policy in the AMG is equivalent to evaluating/optimizing in this MDP.

To prove the property of ww, we treat it as a composite function with form of w(F(x))w\big{(}F(x)\big{)}. Then, the derivative of ww over FF is

δwδF=Fk2(1F)Nk1[(k1)(N1)F].\displaystyle\frac{\delta w}{\delta F}=F^{k-2}(1-F)^{N-k-1}\left[(k-1)-(N-1)F\right]. (15)

It is easy to check that δwδF0\frac{\delta w}{\delta F}\geq 0 for Fk1N1F\leq\frac{k-1}{N-1} and δwδF0\frac{\delta w}{\delta F}\leq 0 for Fk1N1F\geq\frac{k-1}{N-1}. Thus, w(F)w(F) reaches the maximum at F=k1N1F=\frac{k-1}{N-1}. Besides, as F()F(\cdot) is the PDF of xx, it monotonically increases with xx. Put the monotonicity of ww and FF together, we know w(F(x))w(F(x)) first increases, then decreases with xx and achieves the maximimum at x=F1(k1N1)x^{*}=F^{-1}\left(\frac{k-1}{N-1}\right). ∎

Lemma 2 (Monotonicity of Pessimistic Bellman Backup Operator).

Assume that Q1Q2Q_{1}\geq Q_{2} holds element-wisely, then πQ1πQ2\mathcal{B}^{\pi}Q_{1}\geq\mathcal{B}^{\pi}Q_{2} element-wisely.

Proof of Lemma 2.
πQ1(s,a)πQ2(s,a)\displaystyle\mathcal{B}^{\pi}Q_{1}(s,a)-\mathcal{B}^{\pi}Q_{2}(s,a)
=γ𝔼TN[minτ𝒯k𝔼τ,π[Q1(s,a)]minτ𝒯k𝔼τ,π[Q2(s,a)]]\displaystyle=\gamma\mathbb{E}_{\mathbb{P}^{N}_{T}}\bigg{[}\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau,\pi}\big{[}Q_{1}(s^{\prime},a^{\prime})\big{]}-\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau,\pi}\big{[}Q_{2}(s^{\prime},a^{\prime})\big{]}\bigg{]}
γ𝔼TN[minτ𝒯𝔼τ,π[Q1(s,a)Q2(s,a)]]\displaystyle\geq\gamma\mathbb{E}_{\mathbb{P}_{T}^{N}}\bigg{[}\min_{\tau\in\mathcal{T}}\mathbb{E}_{\tau,\pi}\big{[}Q_{1}(s^{\prime},a^{\prime})-Q_{2}(s^{\prime},a^{\prime})\big{]}\bigg{]}
0,s,a,\displaystyle\geq 0,\qquad\forall s,a,

where the first inequality is due to Lemma 1. ∎

Proof of Theorem 3.

It is sufficient to prove QN,k+1πQN,kπ,QN+1,kπQN,kπQ^{\pi}_{N,k+1}\geq Q^{\pi}_{N,k},Q^{\pi}_{N+1,k}\leq Q^{\pi}_{N,k} and QN+1,N+1πQN,NπQ^{\pi}_{N+1,N+1}\geq Q^{\pi}_{N,N} element-wisely. The idea is to first show N,k+1πQN,kπQN,kπ,N+1,kπQN,kπQN,kπ\mathcal{B}_{N,k+1}^{\pi}Q^{\pi}_{N,k}\geq Q^{\pi}_{N,k},\mathcal{B}_{N+1,k}^{\pi}Q^{\pi}_{N,k}\leq Q^{\pi}_{N,k} and N+1,N+1πQN,NπQN,Nπ\mathcal{B}_{N+1,N+1}^{\pi}Q^{\pi}_{N,N}\geq Q^{\pi}_{N,N}. Then, the proof can be finished by recursively applying Lemma 2, for example:

QN,k+1π=limn(N,k+1π)nQN,kπN,k+1πQN,kπQN,kπ.\displaystyle Q^{\pi}_{N,k+1}=\lim_{n\rightarrow\infty}\left(\mathcal{B}_{N,k+1}^{\pi}\right)^{n}Q^{\pi}_{N,k}\geq\cdots\geq\mathcal{B}_{N,k+1}^{\pi}Q^{\pi}_{N,k}\geq Q^{\pi}_{N,k}.

Next, we prove the three inequalities in sequence.

N,k+1πQN,kπQN,kπ\mathcal{B}_{N,k+1}^{\pi}Q^{\pi}_{N,k}\geq Q^{\pi}_{N,k}

N,k+1πQN,kπ(s,a)\displaystyle\mathcal{B}_{N,k+1}^{\pi}Q^{\pi}_{N,k}(s,a)
=r(s,a)+γ𝔼Tn[minτ𝒯k+1𝔼τ,π[QN,kπ(s,a)]]\displaystyle=r(s,a)+\gamma\mathbb{E}_{\mathbb{P}^{n}_{T}}\bigg{[}\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k+1}\mathbb{E}_{\tau,\pi}\big{[}Q^{\pi}_{N,k}(s^{\prime},a^{\prime})\big{]}\bigg{]}
r(s,a)+γ𝔼Tn[minτ𝒯k𝔼τ,π[QN,kπ(s,a)]]\displaystyle\geq r(s,a)+\gamma\mathbb{E}_{\mathbb{P}^{n}_{T}}\bigg{[}\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau,\pi}\big{[}Q^{\pi}_{N,k}(s^{\prime},a^{\prime})\big{]}\bigg{]}
=N,kπQN,kπ(s,a)\displaystyle=\mathcal{B}_{N,k}^{\pi}Q^{\pi}_{N,k}(s,a)
=QN,kπ(s,a),s,a,\displaystyle=Q^{\pi}_{N,k}(s,a),\qquad\forall s,a,

N+1,kπQN,kπQN,kπ\mathcal{B}_{N+1,k}^{\pi}Q^{\pi}_{N,k}\leq Q^{\pi}_{N,k}

N+1,kπQN,kπ(s,a)\displaystyle\mathcal{B}_{N+1,k}^{\pi}Q^{\pi}_{N,k}(s,a)
=r(s,a)+γ𝔼𝒯TN+1[minτ𝒯k𝔼τ,π[QN,kπ(s,a)]]\displaystyle=r(s,a)+\gamma\mathbb{E}_{\mathcal{T}\sim\mathbb{P}_{T}^{N+1}}\bigg{[}\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau,\pi}\big{[}Q^{\pi}_{N,k}(s^{\prime},a^{\prime})\big{]}\bigg{]}
=r(s,a)+γ𝔼𝒯TN[𝔼τT[minτ𝒯{τ}k𝔼τ,π[QN,kπ(s,a)]]]\displaystyle=r(s,a)+\gamma\mathbb{E}_{\mathcal{T}^{\prime}\sim\mathbb{P}_{T}^{N}}\Bigg{[}\mathbb{E}_{\tau^{\prime}\sim\mathbb{P}_{T}}\bigg{[}\lfloor\min\rfloor_{\tau\in\mathcal{T^{\prime}\cup\{\tau^{\prime}\}}}^{k}\mathbb{E}_{\tau,\pi}\big{[}Q^{\pi}_{N,k}(s^{\prime},a^{\prime})\big{]}\bigg{]}\Bigg{]}
r(s,a)+γ𝔼𝒯TN[minτ𝒯k𝔼τ,π[QN,kπ(s,a)]]\displaystyle\leq r(s,a)+\gamma\mathbb{E}_{\mathcal{T}^{\prime}\sim\mathbb{P}_{T}^{N}}\bigg{[}\lfloor\min\rfloor_{\tau\in\mathcal{T}^{\prime}}^{k}\mathbb{E}_{\tau,\pi}\big{[}Q^{\pi}_{N,k}(s^{\prime},a^{\prime})\big{]}\bigg{]}
=N,kπQN,kπ(s,a)\displaystyle=\mathcal{B}_{N,k}^{\pi}Q^{\pi}_{N,k}(s,a)
=QN,kπ(s,a),s,a,\displaystyle=Q^{\pi}_{N,k}(s,a),\qquad\forall s,a,

where we divide the (N+1)(N+1)-size set 𝒯\mathcal{T} into 𝒯\mathcal{T}^{\prime} and {τ}\{\tau^{\prime}\}, 𝒯\mathcal{T}^{\prime} contains the first NN elements and τ\tau^{\prime} is the last element. The second equality is due to the independence among the set elements.

N+1,N+1πQN,NπQN,Nπ\mathcal{B}_{N+1,N+1}^{\pi}Q^{\pi}_{N,N}\geq Q^{\pi}_{N,N}

N+1,N+1πQN,Nπ(s,a)\displaystyle\mathcal{B}_{N+1,N+1}^{\pi}Q^{\pi}_{N,N}(s,a)
=r(s,a)+γ𝔼𝒯TN+1[maxτ𝒯𝔼τ,π[QN,kπ(s,a)]]\displaystyle=r(s,a)+\gamma\mathbb{E}_{\mathcal{T}\sim\mathbb{P}_{T}^{N+1}}\bigg{[}\max_{\tau\in\mathcal{T}}\mathbb{E}_{\tau,\pi}\big{[}Q^{\pi}_{N,k}(s^{\prime},a^{\prime})\big{]}\bigg{]}
=r(s,a)+γ𝔼𝒯TN[𝔼τT[maxτ𝒯{τ}𝔼τ,π[QN,kπ(s,a)]]]\displaystyle=r(s,a)+\gamma\mathbb{E}_{\mathcal{T}^{\prime}\sim\mathbb{P}_{T}^{N}}\Bigg{[}\mathbb{E}_{\tau^{\prime}\sim\mathbb{P}_{T}}\bigg{[}\max_{\tau\in\mathcal{T^{\prime}\cup\{\tau^{\prime}\}}}\mathbb{E}_{\tau,\pi}\big{[}Q^{\pi}_{N,k}(s^{\prime},a^{\prime})\big{]}\bigg{]}\Bigg{]}
r(s,a)+γ𝔼𝒯TN[maxτ𝒯𝔼τ,π[QN,kπ(s,a)]]\displaystyle\geq r(s,a)+\gamma\mathbb{E}_{\mathcal{T}^{\prime}\sim\mathbb{P}_{T}^{N}}\bigg{[}\max_{\tau\in\mathcal{T}^{\prime}}\mathbb{E}_{\tau,\pi}\big{[}Q^{\pi}_{N,k}(s^{\prime},a^{\prime})\big{]}\bigg{]}
=N,kπQN,kπ(s,a)\displaystyle=\mathcal{B}_{N,k}^{\pi}Q^{\pi}_{N,k}(s,a)
=QN,kπ(s,a),s,a,\displaystyle=Q^{\pi}_{N,k}(s,a),\qquad\forall s,a,

B.3 Proofs for Section 4

Analogous to the policy evaluation for non-regularized case, we define the KL-regularized Bellman update operator for a given policy π\pi by

¯N,kπQ(s,a)=r(s,a)+γ𝔼TN[minτ𝒯k𝔼τ,π[Q(s,a)αDKL(.π(|s)||μ(|s))]].\displaystyle\bar{\mathcal{B}}^{\pi}_{N,k}Q(s,a)=r(s,a)+\gamma\mathbb{E}_{\mathbb{P}_{T}^{N}}\bigg{[}\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau,\pi}\Big{[}Q(s^{\prime},a^{\prime})-\alpha D_{\mathrm{KL}}\left(\big{.}\pi(\cdot|s)\;\middle|\middle|\;\mu(\cdot|s)\right)\Big{]}\bigg{]}. (16)

It is easy to check all proofs in last subsection adapt well for the KL-regularized case. We state the corresponding theorems and lemma as below, and apply them to prove the theorems in Section 4.

Theorem 6 (Policy Evaluation for KL-Regularized AMG).

The regularized (N,k)(N,k)-pessimistic Bellman backup operator ¯N,kπ\bar{\mathcal{B}}^{\pi}_{N,k} is a contraction mapping. By starting from any function Q:𝒮×𝒜Q:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R} and repeatedly applying ¯N,kπ\bar{\mathcal{B}}^{\pi}_{N,k}, the sequence converges to Q¯N,kπ\bar{Q}^{\pi}_{N,k}, with which we have J¯(π;μ)=𝔼ρ0,π[Q¯N,kπ(s0,a0)αDKL(.π(|s0)||μ(|s0))]\bar{J}(\pi;\mu)=\mathbb{E}_{\rho_{0},\pi}\Big{[}\bar{Q}^{\pi}_{N,k}(s_{0},a_{0})-\alpha D_{\mathrm{KL}}\left(\big{.}\pi(\cdot|s_{0})\;\middle|\middle|\;\mu(\cdot|s_{0})\right)\Big{]}.

Theorem 7 (Equivalent KL-Regularized MDP with Pessimism-Modulated Dynamics Belief).

The KL-regularized alternating Markov game in (4.1) is equivalent to the KL-regularized MDP with tuple (𝒮,𝒜,T~,r,ρ0,γ)(\mathcal{S},\mathcal{A},\widetilde{T},r,\rho_{0},\gamma), where the transition probability T~(s|s,a)=𝔼~Tsa[τsa(s)]\widetilde{T}(s^{\prime}|s,a)=\mathbb{E}_{\widetilde{\mathbb{P}}_{T}^{sa}}\left[\tau^{sa}(s^{\prime})\right] is defined with the reweighted belief distribution ~Tsa\widetilde{\mathbb{P}}_{T}^{sa}:

~Tsa(τsa)w(𝔼τsa,π[Q¯N,kπ(s,a)];k,N)Tsa(τsa),\displaystyle\widetilde{\mathbb{P}}_{T}^{sa}(\tau^{sa})\propto w\Big{(}\mathbb{E}_{\tau^{sa},\pi}\big{[}\bar{Q}^{\pi}_{N,k}(s^{\prime},a^{\prime})\big{]};k,N\Big{)}\mathbb{P}_{T}^{sa}(\tau^{sa}), (17)
w(x;k,N)=[F(x)]k1[1F(x)]Nk,\displaystyle w(x;k,N)=\big{[}F(x)\big{]}^{k-1}\big{[}1-F(x)\big{]}^{N-k}, (18)

and F()F(\cdot) is cumulative density function. Furthermore, the value of w(x;k,N)w(x;k,N) first increases and then decreases with xx, and its maximum is obtained at the k1N1\frac{k-1}{N-1} quantile, i.e., x=F1(k1N1)x^{*}=F^{-1}\left(\frac{k-1}{N-1}\right). Similar to the non-regularized case, the reweighting factor ww reshapes the initial belief distribution towards being pessimistic in terms of 𝔼τ,π[Q¯N,kπ(s,a)]\mathbb{E}_{\tau,\pi}\big{[}\bar{Q}^{\pi}_{N,k}(s^{\prime},a^{\prime})\big{]}.

Lemma 3 (Monotonicity of Regularized Pessimistic Bellman Backup Operator).

Assume that Q1Q2Q_{1}\geq Q_{2} holds element-wisely, then ¯N,kπQ1¯N,kπQ2\bar{\mathcal{B}}^{\pi}_{N,k}Q_{1}\geq\bar{\mathcal{B}}^{\pi}_{N,k}Q_{2} element-wisely.

Theorem 8 (Monotonicity in Regularized Alternating Markov Game).

The converged Q-function Q¯N,kπ\bar{Q}^{\pi}_{N,k} are with the following properties:

  • Given any kk, the Q-function Q¯N,kπ\bar{Q}^{\pi}_{N,k} element-wisely decreases with N{k,k+1,}N\in\{k,k+1,\cdots\}.

  • Given any NN, the Q-function Q¯N,kπ\bar{Q}^{\pi}_{N,k} element-wisely increases with k{1,2,,N}k\in\{1,2,\cdots,N\}.

  • The Q-function Q¯N,Nπ\bar{Q}^{\pi}_{N,N} element-wisely increases with NN.

Proof of Theorem 4.

The proof of contraction mapping basically follows the same steps in proof of Theorem 1 Let Q1Q_{1} and Q2Q_{2} be two arbitrary Q function.

¯Q1¯Q2\displaystyle\left\lVert\bar{\mathcal{B}}^{*}Q_{1}-\bar{\mathcal{B}}^{*}Q_{2}\right\rVert_{\infty}
=γmaxs,a|𝔼TN[minτ𝒯k𝔼τ[αlog𝔼μexp(1αQ1(s,a))]\displaystyle=\gamma\max_{s,a}\Bigg{|}\mathbb{E}_{\mathbb{P}_{T}^{N}}\bigg{[}\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau}\left[\alpha\log\mathbb{E}_{\mu}\exp{\left(\frac{1}{\alpha}Q_{1}(s^{\prime},a^{\prime})\right)}\right]
minτ𝒯k𝔼τ[αlog𝔼μexp(1αQ2(s,a))]]|\displaystyle\qquad\qquad\qquad\ \ -\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau}\left[\alpha\log\mathbb{E}_{\mu}\exp{\left(\frac{1}{\alpha}Q_{2}(s^{\prime},a^{\prime})\right)}\right]\bigg{]}\Bigg{|}
γmaxs,a(𝔼TN|minτ𝒯k𝔼τ[αlog𝔼μexp(1αQ1(s,a))]\displaystyle\leq\gamma\max_{s,a}\Bigg{(}\mathbb{E}_{\mathbb{P}_{T}^{N}}\Bigg{|}\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau}\left[\alpha\log\mathbb{E}_{\mu}\exp{\left(\frac{1}{\alpha}Q_{1}(s^{\prime},a^{\prime})\right)}\right]
minτ𝒯k𝔼τ[αlog𝔼μexp(1αQ2(s,a))]|)\displaystyle\qquad\qquad\qquad\ \ -\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau}\left[\alpha\log\mathbb{E}_{\mu}\exp{\left(\frac{1}{\alpha}Q_{2}(s^{\prime},a^{\prime})\right)}\right]\Bigg{|}\Bigg{)}
γmaxs,a(𝔼TN[maxτ𝒯|𝔼τ[αlog𝔼μexp(1αQ1(s,a))]𝔼τ[αlog𝔼μexp(1αQ2(s,a))]|])\displaystyle\leq\gamma\max_{s,a}\Bigg{(}\mathbb{E}_{\mathbb{P}_{T}^{N}}\Bigg{[}\max_{\tau\in\mathcal{T}}\Bigg{|}\mathbb{E}_{\tau}\left[\alpha\log\mathbb{E}_{\mu}\exp{\left(\frac{1}{\alpha}Q_{1}(s^{\prime},a^{\prime})\right)}\right]-\mathbb{E}_{\tau}\left[\alpha\log\mathbb{E}_{\mu}\exp{\left(\frac{1}{\alpha}Q_{2}(s^{\prime},a^{\prime})\right)}\right]\Bigg{|}\Bigg{]}\Bigg{)}
=γmaxs,a(𝔼TN[maxτ𝒯|𝔼τ[αlog𝔼μexp(1αQ1(s,a))αlog𝔼μexp(1αQ2(s,a))]|])\displaystyle=\gamma\max_{s,a}\Bigg{(}\mathbb{E}_{\mathbb{P}_{T}^{N}}\Bigg{[}\max_{\tau\in\mathcal{T}}\Bigg{|}\mathbb{E}_{\tau}\left[\alpha\log\mathbb{E}_{\mu}\exp{\left(\frac{1}{\alpha}Q_{1}(s^{\prime},a^{\prime})\right)}-\alpha\log\mathbb{E}_{\mu}\exp{\left(\frac{1}{\alpha}Q_{2}(s^{\prime},a^{\prime})\right)}\right]\Bigg{|}\Bigg{]}\Bigg{)}
γmaxs,a(𝔼TNQ1Q2)\displaystyle\leq\gamma\max_{s,a}\left(\mathbb{E}_{\mathbb{P}_{T}^{N}}\left\lVert Q_{1}-Q_{2}\right\rVert_{\infty}\right)
=γQ1Q2,\displaystyle=\gamma\left\lVert Q_{1}-Q_{2}\right\rVert_{\infty},

where the second inequality is obtained with Corollary 1, and the last inequality is due to αlog𝔼μexp(1αQ1(s,a))αlog𝔼μexp(1αQ2(s,a))Q1Q2\left\lVert\alpha\log\mathbb{E}_{\mu}\exp{\left(\frac{1}{\alpha}Q_{1}(s,a)\right)}-\alpha\log\mathbb{E}_{\mu}\exp{\left(\frac{1}{\alpha}Q_{2}(s,a)\right)}\right\rVert_{\infty}\leq\left\lVert Q_{1}-Q_{2}\right\rVert_{\infty}. We present its proof by following [34]:

Suppose ϵ=Q1Q2\epsilon=\lVert Q_{1}-Q_{2}\rVert_{\infty}, then

αlog𝔼μexp(1αQ1(s,a))\displaystyle\alpha\log\mathbb{E}_{\mu}\exp{\left(\frac{1}{\alpha}Q_{1}(s,a)\right)} αlog𝔼μexp(1αQ2(s,a)+ϵα)\displaystyle\leq\alpha\log\mathbb{E}_{\mu}\exp{\left(\frac{1}{\alpha}Q_{2}(s,a)+\frac{\epsilon}{\alpha}\right)}
=αlog𝔼μexp(1αQ2(s,a))+ϵ.\displaystyle=\alpha\log\mathbb{E}_{\mu}\exp{\left(\frac{1}{\alpha}Q_{2}(s,a)\right)}+\epsilon.

Similarly, αlog𝔼μexp(1αQ1(s,a))αlog𝔼μexp(1αQ2(s,a))ϵ\alpha\log\mathbb{E}_{\mu}\exp{\left(\frac{1}{\alpha}Q_{1}(s,a)\right)}\geq\alpha\log\mathbb{E}_{\mu}\exp{\left(\frac{1}{\alpha}Q_{2}(s,a)\right)}-\epsilon. The desired inequality is proved by putting them together.

Next, we prove π¯(a|s)μ(a|s)exp(1αQ¯(s,a))\bar{\pi}^{*}(a|s)\propto\mu(a|s)\exp{\left(\frac{1}{\alpha}\bar{Q}^{*}(s,a)\right)} is the optimal policy for J¯(π;μ)\bar{J}(\pi;\mu).

First, for any policy π\pi^{\prime},

¯Q¯π(s,a)\displaystyle\bar{\mathcal{B}}^{*}\bar{Q}^{\pi^{\prime}}(s,a)
=r(s,a)+γ𝔼TN[minτ𝒯k𝔼τ[αlog𝔼μexp(1αQ¯π(s,a))]]\displaystyle=r(s,a)+\gamma\mathbb{E}_{\mathbb{P}_{T}^{N}}\Bigg{[}\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau}\bigg{[}\alpha\log\mathbb{E}_{\mu}\exp{\left(\frac{1}{\alpha}\bar{Q}^{\pi^{\prime}}(s^{\prime},a^{\prime})\right)}\bigg{]}\Bigg{]}
=r(s,a)+γ𝔼TN[minτ𝒯k𝔼τ[αlog𝔼μexp(1αQ¯π(s,a))\displaystyle=r(s,a)+\gamma\mathbb{E}_{\mathbb{P}_{T}^{N}}\Bigg{[}\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau}\Bigg{[}\alpha\log\mathbb{E}_{\mu}\exp{\left(\frac{1}{\alpha}\bar{Q}^{\pi^{\prime}}(s^{\prime},a^{\prime})\right)}
minπαDKL(.π(|s)||μ(|s)exp1αQ¯π(s,)𝔼μexp(1αQ¯π(s,a)))]]\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad-\min_{\pi}\alpha D_{\mathrm{KL}}\left(\big{.}\pi(\cdot|s^{\prime})\;\middle|\middle|\;\frac{\mu(\cdot|s^{\prime})\exp{\frac{1}{\alpha}\bar{Q}^{\pi^{\prime}}(s^{\prime},\cdot)}}{\mathbb{E}_{\mu}\exp{\left(\frac{1}{\alpha}\bar{Q}^{\pi^{\prime}}(s^{\prime},a^{\prime})\right)}}\right)\Bigg{]}\Bigg{]}
=r(s,a)+γ𝔼TN[minτ𝒯k𝔼τ[maxπ(𝔼π[Q¯π(s,a)]αDKL(.π(|s)||μ(|s)))]]\displaystyle=r(s,a)+\gamma\mathbb{E}_{\mathbb{P}_{T}^{N}}\Bigg{[}\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau}\bigg{[}\max_{\pi}\Big{(}\mathbb{E}_{\pi}\big{[}\bar{Q}^{\pi^{\prime}}(s^{\prime},a^{\prime})\big{]}-\alpha D_{\mathrm{KL}}\left(\big{.}\pi(\cdot|s^{\prime})\;\middle|\middle|\;\mu(\cdot|s^{\prime})\right)\Big{)}\bigg{]}\Bigg{]}
r(s,a)+γ𝔼TN[minτ𝒯k𝔼τ[𝔼π[Q¯π(s,a)]αDKL(.π(|s)||μ(|s))]]\displaystyle\geq r(s,a)+\gamma\mathbb{E}_{\mathbb{P}_{T}^{N}}\Bigg{[}\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau}\bigg{[}\mathbb{E}_{\pi^{\prime}}\big{[}\bar{Q}^{\pi^{\prime}}(s^{\prime},a^{\prime})\big{]}-\alpha D_{\mathrm{KL}}\left(\big{.}\pi^{\prime}(\cdot|s^{\prime})\;\middle|\middle|\;\mu(\cdot|s^{\prime})\right)\bigg{]}\Bigg{]}
=¯πQ¯π(s,a)\displaystyle=\bar{\mathcal{B}}^{\pi^{\prime}}\bar{Q}^{\pi^{\prime}}(s,a)
=Q¯π(s,a),s,a.\displaystyle=\bar{Q}^{\pi^{\prime}}(s,a),\qquad\forall s,a.

By applying Lemma 3 recursively, we obtain

Q¯(s,a)=limn(¯)nQ¯π(s,a)¯Q¯π(s,a)Q¯π(s,a),s,a.\displaystyle\bar{Q}^{*}(s,a)=\lim_{n\rightarrow\infty}\left(\bar{\mathcal{B}}^{*}\right)^{n}\bar{Q}^{\pi^{\prime}}(s,a)\geq\cdots\geq\bar{\mathcal{B}}^{*}\bar{Q}^{\pi^{\prime}}(s,a)\geq\bar{Q}^{\pi^{\prime}}(s,a),\quad\forall s,a. (19)

Besides,

¯π¯Q¯(s,a)\displaystyle\bar{\mathcal{B}}^{\bar{\pi}^{*}}\bar{Q}^{*}(s,a)
=r(s,a)+γ𝔼TN[minτ𝒯k𝔼τ,π¯[Q¯(s,a)αDKL(.π¯(|s)||μ(|s))]]\displaystyle=r(s,a)+\gamma\mathbb{E}_{\mathbb{P}_{T}^{N}}\bigg{[}\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau,\bar{\pi}^{*}}\Big{[}\bar{Q}^{*}(s^{\prime},a^{\prime})-\alpha D_{\mathrm{KL}}\left(\big{.}\bar{\pi}^{*}(\cdot|s)\;\middle|\middle|\;\mu(\cdot|s)\right)\Big{]}\bigg{]}
=r(s,a)+γ𝔼TN[minτ𝒯k𝔼τ[αlog𝔼μexp(1αQ¯(s,a))]]\displaystyle=r(s,a)+\gamma\mathbb{E}_{\mathbb{P}_{T}^{N}}\left[\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau}\left[\alpha\log\mathbb{E}_{\mu}\exp{\left(\frac{1}{\alpha}\bar{Q}^{*}(s^{\prime},a^{\prime})\right)}\right]\right]
=Q¯(s,a),s,a.\displaystyle=\bar{Q}^{*}(s,a),\qquad\forall s,a.

By repeatedly applying ¯π¯\bar{\mathcal{B}}^{\bar{\pi}^{*}} to the above equation, we obtain

Q¯π¯(s,a)=limn(¯π¯)nQ¯(s,a)==¯π¯Q¯(s,a)=Q¯(s,a),s,a.\displaystyle\bar{Q}^{\bar{\pi}^{*}}(s,a)=\lim_{n\rightarrow\infty}\left(\bar{\mathcal{B}}^{\bar{\pi}^{*}}\right)^{n}\bar{Q}^{*}(s,a)=\cdots=\bar{\mathcal{B}}^{\bar{\pi}^{*}}\bar{Q}^{*}(s,a)=\bar{Q}^{*}(s,a),\quad\forall s,a. (20)

By combining equations (19) and (20), we have

Q¯π¯(s,a)Q¯π(s,a),π,s,a.\displaystyle\bar{Q}^{\bar{\pi}^{*}}(s,a)\geq\bar{Q}^{\pi^{\prime}}(s,a),\quad\forall\pi^{\prime},\forall s,a. (21)

Finally, by expanding J¯\bar{J} as stated in Theorem 6 and applying (21), the proof is completed

J¯(π¯;μ)\displaystyle\bar{J}(\bar{\pi}^{*};\mu) =𝔼ρ0,π¯[Q¯π¯(s0,a0)αDKL(.π¯(|s0)||μ(|s0))]\displaystyle=\mathbb{E}_{\rho_{0},\bar{\pi}^{*}}\Big{[}\bar{Q}^{\bar{\pi}^{*}}(s_{0},a_{0})-\alpha D_{\mathrm{KL}}\left(\big{.}\bar{\pi}^{*}(\cdot|s_{0})\;\middle|\middle|\;\mu(\cdot|s_{0})\right)\Big{]}
𝔼ρ0,π¯[Q¯π(s0,a0)αDKL(.π¯(|s0)||μ(|s0))]\displaystyle\geq\mathbb{E}_{\rho_{0},\bar{\pi}^{*}}\Big{[}\bar{Q}^{\pi^{\prime}}(s_{0},a_{0})-\alpha D_{\mathrm{KL}}\left(\big{.}\bar{\pi}^{*}(\cdot|s_{0})\;\middle|\middle|\;\mu(\cdot|s_{0})\right)\Big{]}
𝔼ρ0,π[Q¯π(s0,a0)αDKL(.π(|s0)||μ(|s0))]\displaystyle\geq\mathbb{E}_{\rho_{0},\pi^{\prime}}\Big{[}\bar{Q}^{\pi^{\prime}}(s_{0},a_{0})-\alpha D_{\mathrm{KL}}\left(\big{.}\pi^{\prime}(\cdot|s_{0})\;\middle|\middle|\;\mu(\cdot|s_{0})\right)\Big{]}
=J¯(π;μ),π.\displaystyle=\bar{J}(\pi^{\prime};\mu),\quad\forall\pi^{\prime}.

Proof of Theorem 5.

We first prove J(πi+1)>J(πi)J(\pi_{i+1})>J(\pi_{i}). As the iteration requires J¯(πi+1;πi)>J¯(πi;πi)=J(πi)\bar{J}(\pi_{i+1};\pi_{i})>\bar{J}(\pi_{i};\pi_{i})=J(\pi_{i}), it is sufficient to prove J(πi+1)J¯(πi+1;πi)J(\pi_{i+1})\geq\bar{J}(\pi_{i+1};\pi_{i}). We do that by showing Qπi+1Q¯πi+1Q^{\pi_{i+1}}\geq\bar{Q}^{\pi_{i+1}} element-wisely.

First,

πi+1Q¯πi+1(s,a)Q¯πi+1(s,a)\displaystyle\mathcal{B}^{\pi_{i+1}}\bar{Q}^{\pi_{i+1}}(s,a)-\bar{Q}^{\pi_{i+1}}(s,a)
=πi+1Q¯πi+1(s,a)¯πi+1Q¯πi+1(s,a)\displaystyle=\mathcal{B}^{\pi_{i+1}}\bar{Q}^{\pi_{i+1}}(s,a)-\bar{\mathcal{B}}^{\pi_{i+1}}\bar{Q}^{\pi_{i+1}}(s,a)
=γ𝔼TN[minτ𝒯k𝔼τ,πi+1[Q¯πi+1(s,a)]]\displaystyle=\gamma\mathbb{E}_{\mathbb{P}_{T}^{N}}\bigg{[}\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau,\pi_{i+1}}\Big{[}\bar{Q}^{\pi_{i+1}}(s^{\prime},a^{\prime})\Big{]}\bigg{]}
γ𝔼TN[minτ𝒯k𝔼τ,πi+1[Q¯πi+1(s,a)αDKL(.πi+1(|s)||πi(|s))]]\displaystyle\quad-\gamma\mathbb{E}_{\mathbb{P}_{T}^{N}}\bigg{[}\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau,\pi_{i+1}}\Big{[}\bar{Q}^{\pi_{i+1}}(s^{\prime},a^{\prime})-\alpha D_{\mathrm{KL}}\left(\big{.}\pi_{i+1}(\cdot|s^{\prime})\;\middle|\middle|\;\pi_{i}(\cdot|s^{\prime})\right)\Big{]}\bigg{]}
γ𝔼TN[minτ𝔼τ[αDKL(.πi+1(|s)||πi(|s))]]\displaystyle\geq\gamma\mathbb{E}_{\mathbb{P}_{T}^{N}}\bigg{[}\min_{\tau}\mathbb{E}_{\tau}\Big{[}\alpha D_{\mathrm{KL}}\left(\big{.}\pi_{i+1}(\cdot|s^{\prime})\;\middle|\middle|\;\pi_{i}(\cdot|s^{\prime})\right)\Big{]}\bigg{]}
0,s,a,\displaystyle\geq 0,\qquad\forall s,a, (22)

where the first inequality is due to Lemma 1, the second inequality is due to the non-negativity of KL-divergence.

Then, by recursively applying Lemma 2 we obtain

Qπi+1(s,a)=limn(πi+1)nQ¯πi+1(s,a)πi+1Q¯πi+1(s,a)Q¯πi+1(s,a),s,a.\displaystyle Q^{\pi_{i+1}}(s,a)=\lim_{n\rightarrow\infty}\left(\mathcal{B}^{\pi_{i+1}}\right)^{n}\bar{Q}^{\pi_{i+1}}(s,a)\geq\cdots\geq\mathcal{B}^{\pi_{i+1}}\bar{Q}^{\pi_{i+1}}(s,a)\geq\bar{Q}^{\pi_{i+1}}(s,a),\quad\forall s,a. (23)

By substituting into J(πi+1)J(\pi_{i+1}) and J¯(πi+1;πi)\bar{J}(\pi_{i+1};\pi_{i}), we have

J(πi+1)\displaystyle J(\pi_{i+1}) =𝔼ρ0,πi+1Qπi+1(s0,a0)\displaystyle=\mathbb{E}_{\rho_{0},\pi_{i+1}}Q^{\pi_{i+1}}(s_{0},a_{0})
𝔼ρ0,πi+1Q¯πi+1(s0,a0)\displaystyle\geq\mathbb{E}_{\rho_{0},\pi_{i+1}}\bar{Q}^{\pi_{i+1}}(s_{0},a_{0})
𝔼ρ0,πi+1[Q¯πi+1(s0,a0)αDKL(πi+1(|s0)||πi(|s0))]\displaystyle\geq\mathbb{E}_{\rho_{0},\pi_{i+1}}\left[\bar{Q}^{\pi_{i+1}}(s_{0},a_{0})-\alpha D_{\mathrm{KL}}\left(\pi_{i+1}(\cdot|s_{0})\;\middle|\middle|\;\pi_{i}(\cdot|s_{0})\right)\right]
=J¯(πi+1;πi).\displaystyle=\bar{J}(\pi_{i+1};\pi_{i}). (24)

To summarize, the proof is done via J(πi+1)J¯(πi+1;πi)>J¯(πi;πi)=J(πi)J(\pi_{i+1})\geq\bar{J}(\pi_{i+1};\pi_{i})>\bar{J}(\pi_{i};\pi_{i})=J(\pi_{i}).

Next, we consider the special case where {πi}\{\pi_{i}\} are obtained via regularized policy optimization in Theorem 4. For the (i+1)(i+1)th step, πi+1\pi_{i+1} is the optimal solution for the sub-problem of maximizing J(π;πi)J(\pi;\pi_{i}). Thus, according to (21), Q¯πi+1(s,a)Q¯π(s,a),π,s,a\bar{Q}^{\pi_{i+1}}(s,a)\geq\bar{Q}^{\pi^{\prime}}(s,a),\forall\pi^{\prime},\forall s,a. For π=πi\pi^{\prime}=\pi_{i}, the KL term in Q-value vanishes and we have Q¯πi+1(s,a)Qπi(s,a)\bar{Q}^{\pi_{i+1}}(s,a)\geq Q^{\pi_{i}}(s,a). By combining it with (23), we obtain

Qπi+1(s,a)Q¯πi+1(s,a)Qπi(s,a),s,a.\displaystyle Q^{\pi_{i+1}}(s,a)\geq\bar{Q}^{\pi_{i+1}}(s,a)\geq Q^{\pi_{i}}(s,a),\quad\forall s,a. (25)

Then, the boundness of QQ indicates the existence of limiQπi(s,a)\lim_{i\rightarrow\infty}Q^{\pi_{i}}(s,a) and also limiQπi(s,a)=limiQ¯πi(s,a),s,a\lim_{i\rightarrow\infty}Q^{\pi_{i}}(s,a)=\lim_{i\rightarrow\infty}\bar{Q}^{\pi_{i}}(s,a),\forall s,a.

For any s,a,as,a,a^{\prime} satisfying limiQπi(s,a)>limiQπi(s,a)\lim_{i\rightarrow\infty}Q^{\pi_{i}}(s,a)>\lim_{i\rightarrow\infty}Q^{\pi_{i}}(s,a^{\prime}), it satisfies limiQ¯πi(s,a)>limiQ¯πi(s,a)\lim_{i\rightarrow\infty}\bar{Q}^{\pi_{i}}(s,a)>\lim_{i\rightarrow\infty}\bar{Q}^{\pi_{i}}(s,a^{\prime}). Thus,

N,ϵ>0jN:Q¯πj(s,a)Q¯πj(s,a)ϵ.\displaystyle\exists N,\epsilon>0\quad\forall j\geq N:\bar{Q}^{\pi_{j}}(s,a)-\bar{Q}^{\pi_{j}}(s,a^{\prime})\geq\epsilon. (26)

According to Theorem 4, the updated policy is with form of333Strictly speaking, Theorem 4 shows π¯(a|s)μ(a|s)exp(1αQ¯N,k(s,a))\bar{\pi}^{*}(a|s)\propto\mu(a|s)\exp{\left(\frac{1}{\alpha}\bar{Q}^{*}_{N,k}(s,a)\right)}. Besides, we have shown Q¯N,kπ¯(s,a)=Q¯N,k(s,a)\bar{Q}_{N,k}^{\bar{\pi}^{*}}(s,a)=\bar{Q}_{N,k}^{*}(s,a) in (20). Thus, π¯(a|s)μ(a|s)exp(1αQ¯N,kπ(s,a))\bar{\pi}^{*}(a|s)\propto\mu(a|s)\exp{\left(\frac{1}{\alpha}\bar{Q}^{\pi^{*}}_{N,k}(s,a)\right)}.

πi(a|s)πi1(a|s)exp(1αQ¯πi(s,a)).\displaystyle\pi_{i}(a|s)\propto\pi_{i-1}(a|s)\exp{\left(\frac{1}{\alpha}\bar{Q}^{\pi_{i}}(s,a)\right)}.

Then, the policy ratio can be rewritten and bounded as

πi(a|s)πi(a|s)=πN(a|s)πN(a|s)exp(j=NiQ¯πj(s,a)Q¯πj(s,a)α)πN(a|s)πN(a|s)exp(iNαϵ),iN.\displaystyle\frac{\pi_{i}(a|s)}{\pi_{i}(a^{\prime}|s)}=\frac{\pi_{N}(a|s)}{\pi_{N}(a^{\prime}|s)}\exp{\left(\sum_{j=N}^{i}\frac{\bar{Q}^{\pi_{j}}(s,a)-\bar{Q}^{\pi_{j}}(s,a^{\prime})}{\alpha}\right)}\geq\frac{\pi_{N}(a|s)}{\pi_{N}(a^{\prime}|s)}\exp{\left(\frac{i-N}{\alpha}\epsilon\right)},\ \ \forall i\geq N. (27)

With the prerequisite of π0(a|s)>0,s,a\pi_{0}(a|s)>0,\forall s,a and the form of policy update, we know πN(a|s)>0,s,a\pi_{N}(a|s)>0,\forall s,a, and further πN(a|s)πN(a|s)>0\frac{\pi_{N}(a|s)}{\pi_{N}(a^{\prime}|s)}>0. Then, as ii approaches infinity in (27), we obtain πi(a|s)πi(a|s)\frac{\pi_{i}(a|s)}{\pi_{i}(a^{\prime}|s)}\rightarrow\infty. ∎

Appendix C Iterative Regularized Policy Optimization as Expectation–Maximization with Structured Variational Posterior

This section recasts the iterative regularized policy optimization as an Expectation-Maximization algorithm for policy optimization, where the Expectation step corresponds to a structured variational inference procedure for dynamics. To simplify the presentation, we consider the LL-length horizon and let γ=1\gamma=1 (thus omitted in the derivation). For infinite horizon LL\to\infty, the discounted factor γ\gamma can be readily recovered by modifying the transition dynamics, such that any action produces a transition into an terminal state with probability 1γ1-\gamma.

C.1 Review of RL as Probabilistic Inference

Refer to caption
Figure 4: Probabilistic graphical model for RL as inference.

We first review the general framework of casting RL as probabilistic inference [60]. It starts by embedding the MDP into a probabilistic graphical model, as shown in Figure 4. Apart from the basic elements in MDP, an additional binary random variable 𝒪t\mathcal{O}_{t} is introduced, where 𝒪t=1\mathcal{O}_{t}=1 denotes that the action at time step tt is optimal, and 𝒪t=0\mathcal{O}_{t}=0 denotes the suboptimality. Its distribution is defined as444Assume the reward function is non-positive such that the probability is not larger than one. If the assumption is unsatisfied, we can subtract the reward function by its maximum, without changing the optimal policy.

p(𝒪t=1|st,at)=exp(r(st,at)α),\displaystyle p(\mathcal{O}_{t}=1|s_{t},a_{t})=\exp{\left(\frac{r(s_{t},a_{t})}{\alpha}\right)}, (28)

where α\alpha is the hyperparameter. As we focus on the optimality, in the following we drop =1=1 and use 𝒪t\mathcal{O}_{t} to denote 𝒪t=1\mathcal{O}_{t}=1 for conciseness. The remaining random variables in the probabilistic graphical model are sts_{t} and ata_{t}, whose distributions are defined by the system dynamics ρ0(s)\rho_{0}(s) and T(s|s,a)T(s^{\prime}|s,a) as well as a reference policy μ(a|s)\mu(a|s). Then, the joint distribution over all random variables for t{1,2,,L}t\in\{1,2,\cdots,L\} can be written as

(s0:L,a0:L,𝒪0:L)=ρ0(s0)t=0L1T(st+1|st,at)μ(at|st)μ(aL|sL)exp(t=0Lr(st,at)α).\displaystyle\mathbb{P}\left(s_{0:L},a_{0:L},\mathcal{O}_{0:L}\right)=\rho_{0}(s_{0})\cdot\prod_{t=0}^{L-1}T(s_{t+1}|s_{t},a_{t})\mu(a_{t}|s_{t})\cdot\mu(a_{L}|s_{L})\exp{\left(\sum_{t=0}^{L}\frac{r(s_{t},a_{t})}{\alpha}\right)}. (29)

Regarding optimal control, a natural question to ask is what the trajectory should be like given the optimality over all time steps. This raises the posterior inference of (s0:L,a0:L|𝒪0:L)\mathbb{P}\left(s_{0:L},a_{0:L}|\mathcal{O}_{0:L}\right). According to d-separation, the exact posterior follows the form of

(s0:L,a0:L|𝒪0:L)=(s0|𝒪0:L)t=0L1(st+1|st,at,𝒪0:L)(at|st,𝒪0:L)(aL|sL,𝒪0:L).\displaystyle\mathbb{P}\left(s_{0:L},a_{0:L}|\mathcal{O}_{0:L}\right)=\mathbb{P}(s_{0}|\mathcal{O}_{0:L})\cdot\prod_{t=0}^{L-1}\mathbb{P}(s_{t+1}|s_{t},a_{t},\mathcal{O}_{0:L})\mathbb{P}(a_{t}|s_{t},\mathcal{O}_{0:L})\cdot\mathbb{P}(a_{L}|s_{L},\mathcal{O}_{0:L}). (30)

Notice that the dynamics posterior (s0|𝒪0:L)\mathbb{P}(s_{0}|\mathcal{O}_{0:L}) and (st+1|st,at,𝒪0:L)\mathbb{P}(s_{t+1}|s_{t},a_{t},\mathcal{O}_{0:L}) depends on 𝒪0:L\mathcal{O}_{0:L}, and in fact their concrete mathematical expressions are inconsistent with those of the system dynamics ρ0(s0)\rho_{0}(s_{0}) and T(st+1|st,at)T(s_{t+1}|s_{t},a_{t}) [60]. This essentially poses the assumption that the dynamics itself can be controlled when referring to the optimality, unpractical in general.

Variational inference can be applied to correct this issue. Concretely, define the variational approximation to the exact posterior by

^(s0:L,a0:L)=ρ0(s0)t=0L1T(st+1|st,at)π(at|st)π(aL|sL).\displaystyle\widehat{\mathbb{P}}\left(s_{0:L},a_{0:L}\right)=\rho_{0}(s_{0})\cdot\prod_{t=0}^{L-1}T(s_{t+1}|s_{t},a_{t})\pi(a_{t}|s_{t})\cdot\pi(a_{L}|s_{L}). (31)

Its difference to (30) is enforcing the dynamics posterior to match the practical one. Under this structure, the variational posterior can be adjusted by optimizing π\pi to best approximate the exact posterior. The optimization is executed under measure of KL divergence, i.e.,

DKL(^(s0:L,a0:L)||(s0:L,a0:L|𝒪0:L))=\bigintssss^(s0:L,a0:L)log^(s0:L,a0:L)(s0:L,a0:L|𝒪0:L)ds0:Lda0:L\displaystyle D_{\mathrm{KL}}\left(\widehat{\mathbb{P}}\left(s_{0:L},a_{0:L}\right)\;\middle|\middle|\;\mathbb{P}\left(s_{0:L},a_{0:L}|\mathcal{O}_{0:L}\right)\right)\!=\!\!\bigintssss\widehat{\mathbb{P}}\left(s_{0:L},a_{0:L}\right)\log\frac{\widehat{\mathbb{P}}\left(s_{0:L},a_{0:L}\right)}{\mathbb{P}\left(s_{0:L},a_{0:L}|\mathcal{O}_{0:L}\right)}ds_{0:L}da_{0:L}
=\bigintssss^(s0:L,a0:L)log^(s0:L,a0:L)(s0:L,a0:L,𝒪0:L)ds0:Lda0:L+log(𝒪0:L)\displaystyle=\bigintssss\widehat{\mathbb{P}}\left(s_{0:L},a_{0:L}\right)\log\frac{\widehat{\mathbb{P}}\left(s_{0:L},a_{0:L}\right)}{\mathbb{P}\left(s_{0:L},a_{0:L},\mathcal{O}_{0:L}\right)}ds_{0:L}da_{0:L}+\log\mathbb{P}(\mathcal{O}_{0:L})
=𝔼ρ0,T,π[t=0L(r(st,at)α+logπ(at|st)μ(at|st))]+log(𝒪0:L)\displaystyle=\mathbb{E}_{\rho_{0},T,\pi}\left[\sum_{t=0}^{L}\left(-\frac{r(s_{t},a_{t})}{\alpha}+\log\frac{\pi(a_{t}|s_{t})}{\mu(a_{t}|s_{t})}\right)\right]+\log\mathbb{P}(\mathcal{O}_{0:L})
=1α𝔼ρ0,T,π[t=0L(r(st,at)+αDKL(.π(|st)||.μ(|st)))]+log(𝒪0:L),\displaystyle=\frac{1}{\alpha}\mathbb{E}_{\rho_{0},T,\pi}\left[\sum_{t=0}^{L}\bigg{(}-r(s_{t},a_{t})+\alpha D_{\mathrm{KL}}\left(\Big{.}\pi(\cdot|s_{t})\;\middle|\middle|\;\Big{.}\mu(\cdot|s_{t})\right)\bigg{)}\right]+\log\mathbb{P}(\mathcal{O}_{0:L}), (32)

where the third equation is obtained by substituting (29) and (31). As the second term in (C.1) is constant, minimizing the above KL divergence is equivalent to maximize the cumulative reward with policy regularizer. Several fascinating online RL methods can be treated as algorithmic instances based on this framework [34, 35].

To summarize, the structured variational posterior with form (31) is vital to ensure the inferred policy meaningful in the actual environment.

C.2 Pessimism-Modulated Dynamics Belief as Structured Variational Posterior

Refer to caption
Figure 5: Probabilistic graphical model for offline RL as inference.

The probabilistic graphical model is previously devised for online RL. In offline setting, the environment can not be interacted to minimize (C.1). A straightforward modification to reflect this is to add the transition dynamics as a random variable in the graph, as shown in Figure 5. We assume the transition follows a predefined belief distribution, i.e., Tsa(τsa)\mathbb{P}_{T}^{sa}(\tau^{sa}) introduced in Subsection 3.1. To make its dependence on (s,a)(s,a) explicit, let T(τsa|s,a)\mathbb{P}_{T}(\tau^{sa}|s,a) redenote Tsa(τsa)\mathbb{P}_{T}^{sa}(\tau^{sa}). For conciseness, we drop the superscript sasa in τsa\tau^{sa} in the remainder.

The joint distribution over all random variables in Figure 5 for t{1,2,,L}t\in\{1,2,\cdots,L\} can be written as

(s0:L,a0:L,τ0:L1,𝒪0:L)=ρ0(s0)\displaystyle\mathbb{P}\left(s_{0:L},a_{0:L},\tau_{0:L-1},\mathcal{O}_{0:L}\right)=\rho_{0}(s_{0}) t=0L1T(τt|st,at)τt(st+1)μ(at|st)\displaystyle\cdot\prod_{t=0}^{L-1}\mathbb{P}_{T}(\tau_{t}|s_{t},a_{t})\tau_{t}(s_{t+1})\mu(a_{t}|s_{t})
μ(aL|sL)exp(t=0Lr(st,at)α).\displaystyle\cdot\mu(a_{L}|s_{L})\exp{\left(\sum_{t=0}^{L}\frac{r(s_{t},a_{t})}{\alpha}\right)}. (33)

Similar to online setting, we wonder what the trajectory should be like given the optimality over all time steps. By examining the conditional independence in the probabilistic graphical model, the exact posterior follows the form of

(s0:L,a0:L,τ0:L1|𝒪0:L)=(s0|𝒪0:L)\displaystyle\mathbb{P}\left(s_{0:L},a_{0:L},\tau_{0:L-1}|\mathcal{O}_{0:L}\right)=\mathbb{P}(s_{0}|\mathcal{O}_{0:L}) t=0L1(τt|st,at,𝒪0:L)(st+1|τt,𝒪0:L)(at|st,𝒪0:L)\displaystyle\cdot\prod_{t=0}^{L-1}\mathbb{P}(\tau_{t}|s_{t},a_{t},\mathcal{O}_{0:L})\mathbb{P}(s_{t+1}|\tau_{t},\mathcal{O}_{0:L})\mathbb{P}(a_{t}|s_{t},\mathcal{O}_{0:L})
(aL|sL,𝒪0:L).\displaystyle\cdot\mathbb{P}(a_{L}|s_{L},\mathcal{O}_{0:L}). (34)

Unsurprisingly, s0:Ts_{0:T} and τ0:T\tau_{0:T} again depend on 𝒪0:L\mathcal{O}_{0:L}, indicating that the system transition and its belief can be controlled when referring to optimality. In other words, it leads to over-optimistic inference.

To emphasize pessimism, we define a novel structured variational posterior:

^(s0:L,a0:L,τ0:L1)=ρ0(s0)t=0L1~T(τt|st,at)τt(st+1)π(at|st)π(aL|sL),\displaystyle\widehat{\mathbb{P}}\left(s_{0:L},a_{0:L},\tau_{0:L-1}\right)=\rho_{0}(s_{0})\cdot\prod_{t=0}^{L-1}\widetilde{\mathbb{P}}_{T}(\tau_{t}|s_{t},a_{t})\tau_{t}(s_{t+1})\pi(a_{t}|s_{t})\cdot\pi(a_{L}|s_{L}), (35)

with ~T\widetilde{\mathbb{P}}_{T} being the Pessimism-Modulated Dynamics Belief (PMDB) constructed via the KL-regularized AMG (see Theorem 7):

~T(τ|s,a)w(𝔼τ,π[Q¯N,kπ(s,a)];k,N)T(τ|s,a),\displaystyle\widetilde{\mathbb{P}}_{T}(\tau|s,a)\propto w\Big{(}\mathbb{E}_{\tau,\pi}\big{[}\bar{Q}^{\pi}_{N,k}(s^{\prime},a^{\prime})\big{]};k,N\Big{)}\mathbb{P}_{T}(\tau|s,a), (36)
w(x;k,N)=[F(x)]k1[1F(x)]Nk,\displaystyle w(x;k,N)=\big{[}F(x)\big{]}^{k-1}\big{[}1-F(x)\big{]}^{N-k}, (37)

F()F(\cdot) is cumulative density function and Q¯N,kπ\bar{Q}^{\pi}_{N,k} is the Q-value for the KL-regularized AMG. As discussed, ww reshapes the initial belief distribution towards being pessimistic in terms of 𝔼τ,π[Q¯N,kπ(s,a)]\mathbb{E}_{\tau,\pi}\big{[}\bar{Q}^{\pi}_{N,k}(s^{\prime},a^{\prime})\big{]}.

It seems that we need to solve the AMG to obtain Q¯N,kπ\bar{Q}^{\pi}_{N,k} and further define ~T\widetilde{\mathbb{P}}_{T}. In fact, Q¯N,kπ\bar{Q}^{\pi}_{N,k} is also the Q-value for the MDP considered in (35). This can be verified by checking Theorem 7: the KL-regularized AMG is equivalent to the MDP with transition T~(s|s,a)=𝔼~T[τ(s)]\widetilde{T}(s^{\prime}|s,a)=\mathbb{E}_{\widetilde{\mathbb{P}}_{T}}\left[\tau(s^{\prime})\right], which can be implemented by sampling first τ~T\tau\sim\widetilde{\mathbb{P}}_{T} and then sτs^{\prime}\sim\tau, right the procedure in (35).

To best approximate the exact posterior, we optimize the variational posterior by minimizing

DKL(^(s0:L,a0:L,τ0:L1)||(s0:L,a0:L,τ0:L1|𝒪0:L))\displaystyle D_{\mathrm{KL}}\left(\widehat{\mathbb{P}}\left(s_{0:L},a_{0:L},\tau_{0:L-1}\right)\;\middle|\middle|\;\mathbb{P}\left(s_{0:L},a_{0:L},\tau_{0:L-1}|\mathcal{O}_{0:L}\right)\right)
=\bigintssss^(s0:L,a0:L,τ0:L1)log^(s0:L,a0:L,τ0:L1)(s0:L,a0:L,τ0:L1|𝒪0:L)ds0:Lda0:L\displaystyle=\bigintssss\widehat{\mathbb{P}}\left(s_{0:L},a_{0:L},\tau_{0:L-1}\right)\log\frac{\widehat{\mathbb{P}}\left(s_{0:L},a_{0:L},\tau_{0:L-1}\right)}{\mathbb{P}\left(s_{0:L},a_{0:L},\tau_{0:L-1}|\mathcal{O}_{0:L}\right)}ds_{0:L}da_{0:L}
=\bigintssss^(s0:L,a0:L,τ0:L1)log^(s0:L,a0:L,τ0:L1)(s0:L,a0:L,τ0:L1,𝒪0:L)ds0:Lda0:L+log(𝒪0:L)\displaystyle=\bigintssss\widehat{\mathbb{P}}\left(s_{0:L},a_{0:L},\tau_{0:L-1}\right)\log\frac{\widehat{\mathbb{P}}\left(s_{0:L},a_{0:L},\tau_{0:L-1}\right)}{\mathbb{P}\left(s_{0:L},a_{0:L},\tau_{0:L-1},\mathcal{O}_{0:L}\right)}ds_{0:L}da_{0:L}+\log\mathbb{P}(\mathcal{O}_{0:L})
=𝔼ρ0,~T,τ0:L1,π[t=0L(r(st,at)α+DKL(.π(|st)||μ(|st)))]M(π;μ)\displaystyle=\underbrace{\mathbb{E}_{\rho_{0},\widetilde{\mathbb{P}}_{T},\tau_{0:L-1},\pi}\Bigg{[}\sum_{t=0}^{L}\left(-\frac{r(s_{t},a_{t})}{\alpha}+D_{\mathrm{KL}}\left(\Big{.}\pi(\cdot|s_{t})\;\middle|\middle|\;\mu(\cdot|s_{t})\right)\right)\Bigg{]}}_{M(\pi;\mu)}
+𝔼ρ0,~T,τ0:L1,π[t=0L1logw(𝔼τt,π[Q¯N,kπ(st+1,at+1)];k,N)]+(L1)logC\displaystyle\qquad+\mathbb{E}_{\rho_{0},\widetilde{\mathbb{P}}_{T},\tau_{0:L-1},\pi}\Bigg{[}\sum_{t=0}^{L-1}\log w\Big{(}\mathbb{E}_{\tau_{t},\pi}\big{[}\bar{Q}^{\pi}_{N,k}(s_{t+1},a_{t+1})\big{]};k,N\Big{)}\Bigg{]}+(L-1)\cdot\log C
=()M(π;μ)+(L1)logC\displaystyle\overset{(*)}{=}M(\pi;\mu)+(L-1)\cdot\log C
+t=0L1𝔼ρ0,π,^T[𝔼τ0,π,^T[𝔼τt1,π,~T[logw(𝔼τt,π[Q¯N,kπ(st+1,at+1)];k,N)]]]\displaystyle\qquad+\sum_{t=0}^{L-1}\mathbb{E}_{\rho_{0},\pi,\widehat{\mathbb{P}}_{T}}\left[\mathbb{E}_{\tau_{0},\pi,\widehat{\mathbb{P}}_{T}}\cdots\left[{\color[rgb]{0,0,1}\mathbb{E}_{\tau_{t-1},\pi,\widetilde{\mathbb{P}}_{T}}\left[\log w\Big{(}\mathbb{E}_{\tau_{t},\pi}\big{[}\bar{Q}^{\pi}_{N,k}(s_{t+1},a_{t+1})\big{]};k,N\Big{)}\vphantom{\sum_{1}^{2}}\right]}\right]\right]
M(π;μ)+(L1)(logC+logC),\displaystyle\approx M(\pi;\mu)+(L-1)\cdot({\color[rgb]{0,0,1}\log C^{\prime}}+\log C), (38)

where the equation ()(*) is by unfolding the expectation sequentially over each step, C=N!(k1)!(Nk)!C=\frac{N!}{(k-1)!(N-k)!} is the normalization constant in (36), and C=(k1)k1(Nk)Nk(N1)N1C^{\prime}=\frac{(k-1)^{k-1}(N-k)^{N-k}}{(N-1)^{N-1}} is used to approximate ww. To clarify the approximation, recall Theorem 7 stating that a sample τt~T\tau_{t}\sim\widetilde{\mathbb{P}}_{T} can be equivalently drawn by finding τt=argminτ𝒯tk𝔼τ,π[Q¯N,kπ(st+1,at+1)]\tau_{t}=\arg\lfloor\min\rfloor_{\tau\in\mathcal{T}_{t}}^{k}\mathbb{E}_{\tau,\pi}\left[\bar{Q}^{\pi}_{N,k}(s_{t+1},a_{t+1})\right] based on another sampling procedure 𝒯t={τ}NTN\mathcal{T}_{t}=\{\tau\}^{N}\sim\mathbb{P}^{N}_{T}. Then, given 𝒯t\mathcal{T}_{t}, we observe that 𝔼τt,π[Q¯N,kπ(st+1,at+1)]\mathbb{E}_{{\color[rgb]{1,0,1}\tau_{t}},\pi}\big{[}\bar{Q}^{\pi}_{N,k}(s_{t+1},a_{t+1})\big{]} is the empirical k1N1\frac{k-1}{N-1} quantile of the random variable 𝔼τ,π[Q¯N,kπ(st+1,at+1)]\mathbb{E}_{{\color[rgb]{1,0,1}\tau},\pi}\big{[}\bar{Q}^{\pi}_{N,k}(s_{t+1},a_{t+1})\big{]}, i.e., F(𝔼τt,π[Q¯N,kπ(st+1,at+1)])k1N1F\left(\mathbb{E}_{\tau_{t},\pi}\big{[}\bar{Q}^{\pi}_{N,k}(s_{t+1},a_{t+1})\big{]}\right)\approx\frac{k-1}{N-1}. By substituting into ww, we obtain wCw\approx C^{\prime}.

Note that αM(π;μ)-\alpha M(\pi;\mu) is exactly the return of KL-regularized MDP in Theorem 7. By the equivalence of this KL-regularized MDP and the KL-regularized AMG in (4.1), we have M(π;μ)=J¯(π;μ)αM(\pi;\mu)=-\frac{\bar{J}(\pi;\mu)}{\alpha}. Thus, minimization of (C.2) is equivalent to maximization of J¯(π;μ)\bar{J}(\pi;\mu).

C.3 Full Expectation-Maximization Algorithm

In previous subsection, the reference policy μ\mu is assumed as a prior, and the optimized policy would be constrained close to it through KL divergence. In practice, the prior of optimal policy can not easily obtained, and a popular methodology to handle this is to learn the prior itself in the data-driven way, i.e., the principle of empirical Bayes.

The prior learning is done by maximizing the log-marginal likelihood:

L(μ)=log(𝒪0:L)=log(s0:L,a0:L,τ0:L1,𝒪0:L)𝑑s0:L𝑑a0:L𝑑τ0:L1,\displaystyle L(\mu)=\log\mathbb{P}\left(\mathcal{O}_{0:L}\right)=\log\int\mathbb{P}\left(s_{0:L},a_{0:L},\tau_{0:L-1},\mathcal{O}_{0:L}\right)d{s_{0:L}}d{a_{0:L}}d{\tau_{0:L-1}}, (39)

where (s0:L,a0:L,τ0:L1,𝒪0:L)\mathbb{P}\left(s_{0:L},a_{0:L},\tau_{0:L-1},\mathcal{O}_{0:L}\right) is given in (C.2). As the log function includes a high-dimensional integration, evaluating L(μ)L(\mu) incurs intensive computation. Expectation-Maximization algorithm instead considers a lower bound of L(μ)L(\mu) to make the evaluation/optimization tractable:

L(μ)\displaystyle L(\mu)\geq log(𝒪0:L)DKL(^(s0:L,a0:L,τ0:L1)||(s0:L,a0:L,τ0:L1|𝒪0:L))\displaystyle\log\mathbb{P}\left(\mathcal{O}_{0:L}\right)-D_{\mathrm{KL}}\left(\widehat{\mathbb{P}}\left(s_{0:L},a_{0:L},\tau_{0:L-1}\right)\;\middle|\middle|\;\mathbb{P}\left(s_{0:L},a_{0:L},\tau_{0:L-1}|\mathcal{O}_{0:L}\right)\right)
=\displaystyle= \bigintssss^(s0:L,a0:L,τ0:L1)log(s0:L,a0:L,τ0:L1,𝒪0:L)ds0:Lda0:Ldτ0:L1\displaystyle\bigintssss\widehat{\mathbb{P}}\left(s_{0:L},a_{0:L},\tau_{0:L-1}\right)\log\mathbb{P}\left(s_{0:L},a_{0:L},\tau_{0:L-1},\mathcal{O}_{0:L}\right)d{s_{0:L}}d{a_{0:L}}d{\tau_{0:L-1}}
[^(s0:L,a0:L,τ0:L1)],\displaystyle-\mathcal{H}\left[\widehat{\mathbb{P}}\left(s_{0:L},a_{0:L},\tau_{0:L-1}\right)\right], (40)

where the inequality is due to the non-negativity of KL divergence, and ^(s0:L,a0:L,τ0:L1)\widehat{\mathbb{P}}\left(s_{0:L},a_{0:L},\tau_{0:L-1}\right) is an approximation to the exact posterior (s0:L,a0:L,τ0:L1|𝒪0:L)\mathbb{P}\left(s_{0:L},a_{0:L},\tau_{0:L-1}|\mathcal{O}_{0:L}\right). The lower bound is tighter with the more exact approximation for the posterior. In previous subsection, we introduce the structured variational approximation with form of (35) to emphasize pessimism on the transition dynamics. Although this variational posterior would lead to non-zero KL term, it promotes learning robust policy as we discussed in previous subsection. Since that the variational posterior is with an adjustable policy π\pi, we denote the lower bound by L¯(μ;π)\bar{L}(\mu;\pi).

By substituting (35) into (C.3), it follows

L¯(μ;π)=\displaystyle\bar{L}(\mu;\pi)= 𝔼ρ0,~T,τ0:L1,π[t=0Llogμ(at|st)]+C′′\displaystyle\mathbb{E}_{\rho_{0},\widetilde{\mathbb{P}}_{T},\tau_{0:L-1},\pi}\left[\sum_{t=0}^{L}\log\mu(a_{t}|s_{t})\right]+C^{\prime\prime}
=\displaystyle= 𝔼ρ0,~T,τ0:L1,π[t=0Llogμ(at|st)logπ(at|st)+logπ(at|st)]+C′′\displaystyle\mathbb{E}_{\rho_{0},\widetilde{\mathbb{P}}_{T},\tau_{0:L-1},\pi}\left[\sum_{t=0}^{L}\log\mu(a_{t}|s_{t})-\log\pi(a_{t}|s_{t})+\log\pi(a_{t}|s_{t})\right]+C^{\prime\prime}
=\displaystyle= 𝔼ρ0,~T,τ0:L1,π[t=0LDKL(π(|st)||μ(|st))]+C′′′,\displaystyle\mathbb{E}_{\rho_{0},\widetilde{\mathbb{P}}_{T},\tau_{0:L-1},\pi}\left[\sum_{t=0}^{L}-D_{\mathrm{KL}}\left(\pi(\cdot|s_{t})\;\middle|\middle|\;\mu(\cdot|s_{t})\right)\right]+C^{\prime\prime\prime}, (41)

where C′′C^{\prime\prime} and C′′′C^{\prime\prime\prime} includes the constant terms irrelevant to μ\mu. According to the form of (C.3), given fixed π\pi, the optimal prior policy to maximize L¯(μ;π)\bar{L}(\mu;\pi) is obtained as μ=π\mu=\pi. Maximizing the lower bound is known as Maximization step.

Recall π\pi in the variational posterior is adjustable, we can optimize it by minimizing DKL(^(s0:L,a0:L,τ0:L1)||(s0:L,a0:L,τ0:L1|𝒪0:L))D_{\mathrm{KL}}\left(\widehat{\mathbb{P}}\left(s_{0:L},a_{0:L},\tau_{0:L-1}\right)\;\middle|\middle|\;\mathbb{P}\left(s_{0:L},a_{0:L},\tau_{0:L-1}|\mathcal{O}_{0:L}\right)\right) to tighten the bound. The minimization procedure is known as Expectation step. In our case, the minimization problem is exactly the one discussed in previous subsection.

When repeatedly and alternately applying the Expectation and Maximization steps, the iterative regularized policy optimization algorithm is recovered. According to Theorem 5, both π\pi and μ\mu continuously improve regarding the objective function.

Appendix D Algorithm and Implementation Details for Model-Based Offline RL with PMDB

The pseudocode for model-based offline RL with PMDB is presented in Algorithm 1. As ρ0\rho_{0} is unknown in practice, we uniformly sample states from 𝒟\mathcal{D} as the initial {s0}\{s_{0}\}. In Step 4, the primary players act according to the non-parametric policy π\pi, rather than its approximated policy πϕ\pi_{\phi}. This is because during learning process πϕ\pi_{\phi} is not always trained adequately to approximate π\pi, then following πϕ\pi_{\phi} will visit unexpected states. In Step 11, the reference policy πϕ\pi_{\phi^{\prime}} is returned as the final policy, considering that it is more stable than πϕ\pi_{\phi}.

Algorithm 1 Model-Based Offline RL with PMDB

Require: 𝒟,T,N,k,M\mathcal{D},\mathbb{P}_{T},N,k,M.

1:  Approximator initialization: Randomly initialize Q-function Qθ(s,a)Q_{\theta}(s,a) and policy πϕ(a|s)\pi_{\phi}(a|s); Initialize target Q-function Qθ(s,a)Q_{\theta^{\prime}}(s,a) and reference policy πϕ(a|s)\pi_{\phi^{\prime}}(a|s) with θθ,ϕϕ\theta^{\prime}\leftarrow\theta,\phi^{\prime}\leftarrow\phi.
2:  Game initialization: Randomly sample CC states from 𝒟\mathcal{D}, as the initial states for CC paralleled games {s}\{s\}.
3:  for step t=1,2,,Mt=1,2,\cdots,M do
4:     Primary players: Sample actions according to
π(a|s)πϕ(a|s)exp(1αQθ(s,a)).\displaystyle\pi(a|s)\propto\pi_{\phi^{\prime}}(a|s)\exp\left(\frac{1}{\alpha}Q_{\theta}(s,a)\right).
5:     Game transitions: Sample candidate sets {𝒯}\{\mathcal{T}\} according to (3).
6:     Update: Sample a batch of transitions from 𝒟\mathcal{D}, together with the CC-size game transitions {(s,a,𝒯)}\{(s,a,\mathcal{T})\}, to update θ\theta and ϕ\phi via one-step gradient descent regarding (11) and (4.2).
7:     Secondary players: Determine whether to exploit or explore: with probability of (1ϵ)(1-\epsilon),
τ¯=argminτ𝒯k𝔼τ[αlog𝔼πϕexp(1αQθ(s,a))],\displaystyle\bar{\tau}=\arg\lfloor\min\rfloor_{\tau\in\mathcal{T}}^{k}\mathbb{E}_{\tau}\left[\alpha\log\mathbb{E}_{\pi_{\phi^{\prime}}}\exp{\left(\frac{1}{\alpha}Q_{\theta}(s^{\prime},a^{\prime})\right)}\right],
otherwise randomly choose τ¯\bar{\tau} from 𝒯\mathcal{T}.
8:     Game transitions: Sample states following {τ¯}\{\bar{\tau}\} to update {s}\{s\}. For terminal states in {s}\{s\}, use random samples from 𝒟\mathcal{D} to replace them.
9:     Moving-average update: Update reference policy and target Q-function with
ϕ\displaystyle\phi^{\prime} ω1ϕ+(1ω1)ϕ,\displaystyle\leftarrow\omega_{1}\phi+(1-\omega_{1})\phi^{\prime},
θ\displaystyle\theta^{\prime} ω2θ+(1ω2)θ.\displaystyle\leftarrow\omega_{2}\theta+(1-\omega_{2})\theta^{\prime}.
10:  end for
11:  return  πϕ\pi_{\phi^{\prime}}.

Computing expectation

Algorithm 1 involves the computation of expectation. In discrete domains, the expectation can be computed exactly. In continuous domains, we use Monte Carlo methods to approximate it. Concretely, for the expectation over states we apply vanilla Monte Carlo sampling, while for the expectation over actions we apply importance sampling. To elaborate, the expectation over actions can be written as

𝔼μexp(1αQ(s,a))\displaystyle\mathbb{E}_{\mu}\exp\left(\frac{1}{\alpha}Q(s,a)\right) =12[𝔼μexp(1αQ(s,a))+𝔼qμ(a|s)exp(1αQ(s,a))q(a|s)]\displaystyle=\frac{1}{2}\left[\mathbb{E}_{\mu}\exp\left(\frac{1}{\alpha}Q(s,a)\right)+\mathbb{E}_{q}\frac{\mu(a|s)\exp\left(\frac{1}{\alpha}Q(s,a)\right)}{q(a|s)}\right]
12n[aiμ(|s)nexp(1αQ(s,ai))+aiq(|s)nμ(ai|s)exp(1αQ(s,ai))q(ai|s)],\displaystyle\approx\frac{1}{2n}\left[\sum_{a_{i}\sim\mu(\cdot|s)}^{n}\exp{\left(\frac{1}{\alpha}Q(s,a_{i})\right)}+\sum_{a_{i}\sim q(\cdot|s)}^{n}\frac{\mu(a_{i}|s)\exp\left(\frac{1}{\alpha}Q(s,a_{i})\right)}{q(a_{i}|s)}\right],

where qq is the proposal distribution.

In Algorithm 1, the above expectation is computed for both s𝒟s\in\mathcal{D} and s𝒟s\in\mathcal{D}^{\prime}. For s𝒟s\in\mathcal{D}, we choose

q(|s)=𝒩(;a,σ2I),where a|s𝒟,\displaystyle q(\cdot|s)=\mathcal{N}(\ \cdot\ ;a,\sigma^{2}I),\quad\text{where }a|s\sim\mathcal{D},

i.e., the samples are drawn close to the data points, and σ2\sigma^{2} determines how much they keep close. For example, in Step 6 a batch of {(s,a,s)}\{(s,a,s^{\prime})\} are sampled from 𝒟\mathcal{D} to calculate (4.2), then we construct the proposal distribution as above for each (s,a)(s,a) in the batch. The motivation of drawing actions near the data samples is to enhance learning in the multi-modal scenario, where the offline dataset 𝒟\mathcal{D} is collected by mixture of multiple policies. If μ\mu is single-modal (say the widely adopted Gaussian policy) and we solely draw samples from it to approximate the expectation, these samples will be locally clustered. Then, applying them to update πθ\pi_{\theta} in (4.2) can be easily get stuck at local optimum.

For s𝒟s\in\mathcal{D^{\prime}}, we choose

q(|s)=πθ(|s).\displaystyle q(\cdot|s)=\pi_{\theta}(\cdot|s).

The reason is that πθ\pi_{\theta} is an approximator to the improved policy with higher Q-value, and sampling from it hopefully reduces variance of the Monte Carlo estimator.

Although applying Monte Carlo methods to approximate the expectation incurs extra computation, all the operators can be executed in parallel. In the experiments, we use 1010 and 2020 samples respectively for the expectations over state and action, and the algorithm is run on a single machine with one Quadro RTX 6000 GPU. The results show that in average it takes 73.4 s to finish 1k training steps, and the GPU memory cost is 2.5 GB.

Several future directions regarding the Monte Carlo method are worthy to explore. For example, by reducing the sample size for the expectation over state, the optimized policy additionally tends to avoid the risk due to aleatoric uncertainty (while in this work we focus on epistemic uncertainty). Besides, the computational cost can be reduced by more aggressive Monte Carlo approximation, for example only using mean action to compute the expectation in terms of policy. We leave these as future work.

Appendix E Choice of Initial Dynamics Belief

In offline setting, extra knowledge is strongly desired to aggressively optimize policy. The initial dynamics belief provides an interface to absorb the aforehand knowledge of system transition. In what follows, we illustrate several potential usecases:

  • Consider the physical system where the dynamics can be described as mathematical expression but with uncertain parameter. If we have a narrow distribution over the parameter (according to expert knowledge or inferred from data), the system is almost known for certain. Here, both the mathematical expression and narrow distribution provide more information.

  • Consider the case where we know the dynamics is smooth with probability of 0.7 and periodic with probability of 0.3. Gaussian processes (GPs) with RBF kernel and periodic kernel can well encode these prior knowledge. Then, the 0.7-0.3 mixture of the two GPs trained with offline data can act as the dynamics belief to provide more information.

  • In the case where multi-task datasets are available, we can train dynamics models using each of the datasets and assign likelihood ratios to these models. If the likelihood ratio well reflects the similarity between the concerned task and the offline tasks, the multi-task datasets promote knowledge.

The performance gain is expected to monotonously increase with the amount of correct knowledge. As an impractical but intuitive example, with the exact knowledge of system transition (the initial belief is a delta function), the proposed approach is actually optimizing policy as in real system.

In practice, the expert knowledge is not available everywhere. When unavailable, the best we can hope for is that the final policy stays close to the dataset, but unnecessary to be fully covered (as we want to utilize the generalization ability of dynamics model at least around the data). To that end, the dynamics belief is desired to be certain at the region in distribution of dataset, and turns more and more uncertain when departing. It has been reported that the simple model ensemble leads to such a behavior [12]. In this sense, the uniform distribution over learned dynamics ensemble can act as a quite common belief. In the experiments, we apply it for fair comparison with baseline methods.

Appendix F Automatically Adjusting KL Coefficient

In Section 4, the KL regularizer is introduced to restrict πϕ\pi_{\phi} in a small region near πϕ\pi_{\phi^{\prime}}, such that the Q-value can be evaluated sufficiently before policy improvement. Apart from fixing the KL coefficient α\alpha throughout, we provide a strategy to automatically adjust it.

Note that the optimal policy to minimize LPL_{P} in (4.2) is πϕ(|s)exp(1αQθ(s,))𝔼πϕ[exp(1αQθ(s,a))]\frac{\pi_{\phi^{\prime}}(\cdot|s)\exp\left(\frac{1}{\alpha}Q_{\theta}(s,\cdot)\right)}{\mathbb{E}_{\pi_{\phi^{\prime}}}\left[\exp\left(\frac{1}{\alpha}Q_{\theta}(s,a)\right)\right]}. The criterion of choosing α\alpha is to constrain the KL divergence between this policy and πϕ\pi_{\phi^{\prime}} smaller than a specified constant, i.e.,

DKL(πϕ(|s)exp(1αQθ(s,))𝔼πϕ[exp(1αQθ(s,a))]||πϕ(|s))d.\displaystyle D_{\mathrm{KL}}\left(\frac{\pi_{\phi^{\prime}}(\cdot|s)\exp\left(\frac{1}{\alpha}Q_{\theta}(s,\cdot)\right)}{\mathbb{E}_{\pi_{\phi^{\prime}}}\left[\exp\left(\frac{1}{\alpha}Q_{\theta}(s,a)\right)\right]}\;\middle|\middle|\;\pi_{\phi^{\prime}}(\cdot|s)\right)\leq d. (42)

Finding α\alpha to satisfy the above inequation is intractable, instead we consider a surrogate of the KL divergence:

DKL(πϕ(|s)exp(1αQθ(s,))𝔼πϕ[exp(1αQθ(s,a))]||πϕ(|s))\displaystyle D_{\mathrm{KL}}\left(\frac{\pi_{\phi^{\prime}}(\cdot|s)\exp\left(\frac{1}{\alpha}Q_{\theta}(s,\cdot)\right)}{\mathbb{E}_{\pi_{\phi^{\prime}}}\left[\exp\left(\frac{1}{\alpha}Q_{\theta}(s,a)\right)\right]}\;\middle|\middle|\;\pi_{\phi^{\prime}}(\cdot|s)\right)
=𝔼πϕ[exp(1αQθ(s,a))𝔼πϕ[exp(1αQθ(s,a))]1αQθ(s,a)]log𝔼πϕ[exp(1αQθ(s,a))]\displaystyle=\mathbb{E}_{\pi_{\phi^{\prime}}}\left[\frac{\exp\left(\frac{1}{\alpha}Q_{\theta}(s,a)\right)}{\mathbb{E}_{\pi_{\phi^{\prime}}}\left[\exp\left(\frac{1}{\alpha}Q_{\theta}(s,a)\right)\right]}\cdot\frac{1}{\alpha}Q_{\theta}(s,a)\right]-\log\mathbb{E}_{\pi_{\phi^{\prime}}}\left[\exp\left(\frac{1}{\alpha}Q_{\theta}(s,a)\right)\right]
1α(𝔼πϕ[exp(1α0Qθ(s,a))𝔼πϕ[exp(1α0Qθ(s,a))]Qθ(s,a)]𝔼πϕ[Qθ(s,a)]),\displaystyle\leq\frac{1}{\alpha}\left(\mathbb{E}_{\pi_{\phi^{\prime}}}\left[\frac{\exp\left(\frac{1}{\alpha_{0}}Q_{\theta}(s,a)\right)}{\mathbb{E}_{\pi_{\phi^{\prime}}}\left[\exp\left(\frac{1}{\alpha_{0}}Q_{\theta}(s,a)\right)\right]}\cdot Q_{\theta}(s,a)\right]-\mathbb{E}_{\pi_{\phi^{\prime}}}\left[Q_{\theta}(s,a)\right]\right),

where α0\alpha_{0} is a predefined lower bound of α\alpha.

Then, (42) can be satisfied by setting

α1d(𝔼πϕ[exp(1α0Qθ(s,a))𝔼πϕ[exp(1α0Qθ(s,a))]Qθ(s,a)]𝔼πϕ[Qθ(s,a)]).\displaystyle\alpha\geq\frac{1}{d}\left(\mathbb{E}_{\pi_{\phi^{\prime}}}\left[\frac{\exp\left(\frac{1}{\alpha_{0}}Q_{\theta}(s,a)\right)}{\mathbb{E}_{\pi_{\phi^{\prime}}}\left[\exp\left(\frac{1}{\alpha_{0}}Q_{\theta}(s,a)\right)\right]}\cdot Q_{\theta}(s,a)\right]-\mathbb{E}_{\pi_{\phi^{\prime}}}\left[Q_{\theta}(s,a)\right]\right).

Combining with the predefined lower bound, we choose α\alpha as

α=max(1d(𝔼πϕ[exp(1α0Qθ(s,a))𝔼πϕ[exp(1α0Qθ(s,a))]Qθ(s,a)]𝔼πϕ[Qθ(s,a)]),α0).\displaystyle\alpha=\max\left(\frac{1}{d}\left(\mathbb{E}_{\pi_{\phi^{\prime}}}\left[\frac{\exp\left(\frac{1}{\alpha_{0}}Q_{\theta}(s,a)\right)}{\mathbb{E}_{\pi_{\phi^{\prime}}}\left[\exp\left(\frac{1}{\alpha_{0}}Q_{\theta}(s,a)\right)\right]}\cdot Q_{\theta}(s,a)\right]-\mathbb{E}_{\pi_{\phi^{\prime}}}\left[Q_{\theta}(s,a)\right]\right),\alpha_{0}\right).

In practice, the expectation can be estimated over Monte Carlo samples. Note that the coefficient can be computed individually for each state, picking dd is hopefully easier than picking α\alpha suitable for all states.

Appendix G Additional Experimental Setup

Task Domains

We evaluate the proposed methods and the baselines on eighteen domains involving three environments (hopper, walker2d, halfcheetah), each with six dataset types. The dataset types are collected by different policies, denoted by random: a randomly initialized policy, expert: a policy trained to completion with SAC, medium: a policy trained to approximately 1/3 the performance of the expert, medium-expert: 50-50 mixture of medium and expert data, medium-replay: the replay buffer of a policy trained up to the performance of the medium agent, full-replay: the replay buffer of a policy trained up to the performance of the expert agent.

Dynamics Belief

We adopt an uniform distribution over dynamics model ensemble as the initial belief. The ensemble contains 100100 neural networks, each is with 44 hidden layers and 256256 hidden units per layer. All the neural networks are trained independently with the sample dataset 𝒟\mathcal{D} and in parallel. The training process stops after the average training loss does not change obviously. Specifically, the number of epochs for hopper-random and walker2d-medium are 20002000, and those for other tasks are 10001000. Note that the level of pessimism depends on the candidate size N(=10N\ (=10 by default), rather than the ensemble size.

Policy Network and Q Network

The policy network is with 33 hidden layers and 256256 hidden units per layer. It outputs the mean and the diagonal variance for a Gaussian distribution, which is then transformed via tanh function to generate the policy. When evaluating our approach, we apply the deterministic policy, where the action is the tanh transformation of the Gaussian mean. The Q network is with the same architecture as the policy network except the output layer. Similar to existing RL approaches [35], we make use of two Q networks and apply the minimum of them for calculation in Algorithm 1, in order to mitigate over-estimation when learning in the AMG. The policy learning stops after the performance in AMG does not change obviously. Specifically, the gradient steps for walker2d-random, halfcheetah-random and hopper with all dataset types are 11 million, and those for other tasks are 22 millon.

Hyperparameters

We list the detailed hyperparameters in Table LABEL:Table:hyperparas.

Table 3: Hyperparameters
Parameter Value
dynamics learning rate 10410^{-4}
policy learning rate 31053\cdot 10^{-5}
Q-value learning rate 31043\cdot 10^{-4}
discounted factor (γ\gamma) 0.990.99
smoothing coefficient for policy (ω1\omega_{1}) 10510^{-5}
smoothing coefficient for Q-value (ω2\omega_{2}) 51035\cdot 10^{-3}
Exploration ratio for secondary player (ϵ\epsilon) 0.10.1
KL coefficient (α\alpha) 0.10.1
variance for important sampling (σ2\sigma^{2}) 0.01
Batch size for dynamics learning 256256
Batch size for AMG and MDP 128128
Maximal horizon of AMG 10001000

Appendix H Practical Impact of NN

Table 4 lists the impact of NN. The performance in the AMGs improve when decreasing kk. Regarding the performance in true MDPs, we notice that N=15N=15 corresponds to the best performance for hopper, but for the others N=5N=5 is better.

hopper-medium walker2d-medium halfcheetah-medium
NN MDP AMG MDP AMG MDP AMG
5 90.2±\pm25.4 108.6±\pm2.2 112.7±\pm0.9 101.7±\pm5.7 79.8±\pm0.4 69.5±\pm1.6
10 106.8±\pm0.2 105.2±\pm1.6 94.2±\pm1.1 77.2±\pm3.7 75.6±\pm1.3 67.3±\pm1.1
15 107.3±\pm0.2 103.1±\pm1.8 92.1±\pm0.3 68.3±\pm6.7 75.4±\pm0.4 63.2±\pm2.3
Table 4: Impact of NN, with k=2k=2.

Appendix I Ablation of Randomness of 𝒯\mathcal{T}

Compared to the standard Bellman backup operator in Q-learning, the proposed one additionally includes the expectation over 𝒯𝒫TN\mathcal{T}\sim\mathcal{P}_{T}^{N} and the kk-minimum operator over τ𝒯\tau\in\mathcal{T}. We report the impact of choosing different kk in Table 2, and present the impact of the randomness of 𝒯\mathcal{T} as below. Fixed 𝒯\mathcal{T} denotes that after sampling once 𝒯\mathcal{T} from the belief distribution we keep it fixed during policy optimization.

Table 5: Impact of randomness of 𝒯\mathcal{T}
Task Name Stochastic 𝒯\mathcal{T} Fixed 𝒯\mathcal{T}
hopper-medium 106.8 ±\pm 0.2 106.2 ±\pm 0.3
walker2d-medium 94.2 ±\pm 1.1 90.1 ±\pm 4.3
halfcheetah-medium 75.6 ±\pm 1.3 73.1 ±\pm 2.8

We observe that the randomness of 𝒯\mathcal{T} has a mild effect on the performance in average. The reason can be that we apply the uniform distribution over dynamics ensemble as initial belief (without additional knowledge to insert). The model ensemble is reported to produce low uncertainty estimation in distribution of data coverage and high estimation when departing the dataset [12]. This property makes the optimized policy keep close to the dataset, and it does not rely on the randomness of ensemble elements. However, involving the randomness can lead to more smooth variation of the estimated uncertainty, which benefits the training process and results in better performance. Apart from these empirical results, we highlight that in cases with more informative dynamics belief, only picking several fixed samples from the belief distribution as 𝒯\mathcal{T} will result in the loss of knowledge.

Appendix J Weighting AMG Loss and MDP Loss in (11)

In (11), the Q-function is trained to minimize the Bellman residuals of both the AMG and the empirical MDP, equipped with the same weight (both are 1). In the following table, we show experiment results to check the impact of different weights.

Table 6: Impact of weights in (11)
Task Name 0.5:1.5 1.0:1.0 1.5:0.5
hopper-medium 106.6 ±\pm 0.3 106.8 ±\pm 0.2 106.5 ±\pm 0.3
walker2d-medium 93.8 ±\pm 1.5 94.2 ±\pm 1.1 93.1 ±\pm 1.3
halfcheetah-medium 75.2 ±\pm 0.8 75.6 ±\pm 1.3 76.1 ±\pm 1.0

The results suggests that the performance does not obviously depend on the weights. But in cases with available expert knowledge about dynamics, the weights can be adjusted to match our confidence on the knowledge, i.e., the less confidence, the smaller weight for AMG.

Appendix K Comparison with RAMBO

We additionally compared the proposed approach with RAMBO [61], a concurrent work that also formulates offline RL as a two-player zero-sum game. The results of RAMBO for random, medium, medium-expert and medium-replay are taken from [61]. For the other two dataset types, we run the official code and follow the hyperparameter search procedure reported in its paper.

Table 7: Extended Results for D4RL datasets.
Task Name BC BEAR BRAC CQL MOReL EDAC RAMBO PMDB
hopper-random 3.7±\pm0.6 3.6±\pm3.6 8.1±\pm0.6 5.3±\pm0.6 38.1±\pm10.1 25.3±\pm10.4 25.4±\pm7.5 32.7±\pm0.1
hopper-medium 54.1±\pm3.8 55.3±\pm3.2 77.8±\pm6.1 61.9±\pm6.4 84.0±\pm17.0 101.6±\pm0.6 87.0±\pm15.4 106.8±\pm0.2
hopper-expert 107.7±\pm9.7 39.4±\pm20.5 78.1±\pm52.6 106.5±\pm9.1 80.4±\pm34.9 110.1±\pm0.1 50.0±\pm8.1 111.7±\pm0.3
hopper-medium-expert 53.9±\pm4.7 66.2±\pm8.5 81.3±\pm8.0 96.9±\pm15.1 105.6±\pm8.2 110.7±\pm0.1 88.2±\pm20.5 111.8±\pm0.6
hopper-medium-replay 16.6±\pm4.8 57.7±\pm16.5 62.7±\pm30.4 86.3±\pm7.3 81.8±\pm17.0 101.0±\pm0.5 99.5±\pm4.8 106.2±\pm0.6
hopper-full-replay 19.9±\pm12.9 54.0±\pm24.0 107.4±\pm0.5 101.9±\pm0.6 94.4±\pm20.5 105.4±\pm0.7 105.2 ±\pm2.1 109.1±\pm0.2
walker2d-random 1.3±\pm0.1 4.3±\pm1.2 1.3±\pm1.4 5.4±\pm1.7 16.0±\pm7.7 16.6±\pm7.0 0.0±\pm0.3 21.8±\pm0.1
walker2d-medium 70.9±\pm11.0 59.8±\pm40.0 59.7±\pm39.9 79.5±\pm3.2 72.8±\pm11.9 92.5±\pm0.8 84.9 ±\pm2.6 94.2±\pm1.1
walker2d-expert 108.7±\pm0.2 110.1±\pm0.6 55.2±\pm62.2 109.3±\pm0.1 62.6±\pm29.9 115.1±\pm1.9 1.6±\pm2.3 115.9±\pm1.9
walker2d-medium-expert 90.1±\pm13.2 107.0±\pm2.9 9.3±\pm18.9 109.1±\pm0.2 107.5±\pm5.6 114.7±\pm0.9 56.7±\pm39.0 111.9±\pm0.2
walker2d-medium-replay 20.3±\pm9.8 12.2±\pm4.7 40.1±\pm47.9 76.8±\pm10.0 40.8±\pm20.4 87.1±\pm2.3 89.2±\pm6.7 79.9±\pm0.2
walker2d-full-replay 68.8±\pm17.7 79.6±\pm15.6 96.9±\pm2.2 94.2±\pm1.9 84.8±\pm13.1 99.8±\pm0.7 88.3±\pm4.9 95.4±\pm0.7
halfcheetah-random 2.2±\pm0.0 12.6±\pm1.0 24.3±\pm0.7 31.3±\pm3.5 38.9±\pm1.8 28.4±\pm1.0 39.5±\pm3.5 37.8 ±\pm 0.2
halfcheetah-medium 43.2±\pm0.6 42.8±\pm0.1 51.9±\pm0.3 46.9±\pm0.4 60.7±\pm4.4 65.9±\pm0.6 77.9 ±\pm4.0 75.6±\pm 1.3
halfcheetah-expert 91.8±\pm1.5 92.6±\pm0.6 39.0±\pm13.8 97.3±\pm1.1 8.4±\pm11.8 106.8±\pm3.4 79.3±\pm15.1 105.7±\pm 1.0
halfcheetah-medium-expert 44.0±\pm1.6 45.7±\pm4.2 52.3±\pm0.1 95.0±\pm1.4 80.4±\pm11.7 106.3±\pm1.9 95.4 ±\pm5.4 108.5±\pm0.5
halfcheetah-medium-replay 37.6±\pm2.1 39.4±\pm0.8 48.6±\pm0.4 45.3±\pm0.3 44.5±\pm5.6 61.3±\pm1.9 68.7±\pm 5.3 71.7±\pm1.1
halfcheetah-full-replay 62.9±\pm0.8 60.1±\pm3.2 78.0±\pm0.7 76.9±\pm0.9 70.1±\pm5.1 84.6±\pm0.9 87.0±\pm3.2 90.0±\pm0.8
Average 49.9 52.4 54.0 73.7 65.1 85.2 68.0 88.2

The results show our approach outperforms RAMBO on most of considered tasks. One reason can be that the problem formulation of RAMBO is based on robust MDP, whose defects are discussed in Section 2 and Appendix A.