This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

DiffCPS: Diffusion-based Constrained Policy Search for Offline Reinforcement Learning

Longxiang He    Li Shen    Linrui Zhang    Junbo Tan    Xueqian Wang
Abstract

Constrained policy search (CPS) is a fundamental problem in offline reinforcement learning, which is generally solved by advantage weighted regression (AWR). However, previous methods may still encounter out-of-distribution actions due to the limited expressivity of Gaussian-based policies. On the other hand, directly applying the state-of-the-art models with distribution expression capabilities (i.e., diffusion models) in the AWR framework is intractable since AWR requires exact policy probability densities, which is intractable in diffusion models. In this paper, we propose a novel approach, Diffusion-based Constrained Policy Search (dubbed DiffCPS), which tackles the diffusion-based constrained policy search with the primal-dual method. The theoretical analysis reveals that strong duality holds for diffusion-based CPS problems, and upon introducing parameter approximation, an approximated solution can be obtained after 𝒪(1/ϵ){\mathcal{O}}(1/\epsilon) number of dual iterations, where ϵ\epsilon denotes the representation ability of the parametrized policy. Extensive experimental results based on the D4RL benchmark demonstrate the efficacy of our approach. We empirically show that DiffCPS achieves better or at least competitive performance compared to traditional AWR-based baselines as well as recent diffusion-based offline RL methods.

Machine Learning, ICML

1 Introduction

Offline Reinforcement Learning (offline RL) aims to seek an optimal policy without environmental interactions (Fujimoto et al., 2019; Levine et al., 2020). This is compelling for having the potential to transform large-scale datasets into powerful decision-making tools and avoids costly and risky online data collection, which offers significant application prospects in fields such as healthcare (Nie et al., 2021; Tseng et al., 2017) and autopilot (Yurtsever et al., 2020; Rhinehart et al., 2018).

Notwithstanding its promise, applying contemporary off-policy RL algorithms (Lillicrap et al., 2015; Fujimoto et al., 2018; Haarnoja et al., 2018a, b) directly into the offline context presents challenges due to distribution shift (Fujimoto et al., 2019; Levine et al., 2020). Previous methods to mitigate this issue under the model-free offline RL setting generally fall into three categories: 1) value function-based approaches, which implement pessimistic value estimation by assigning low values to out-of-distribution actions (Kumar et al., 2020; Fujimoto et al., 2019), 2) sequential modeling approaches, which casts offline RL as a sequence generation task with return guidance (Chen et al., 2021; Janner et al., 2022; Liang et al., 2023; Ajay et al., 2022), and 3) constrained policy search (CPS) approaches, which regularizes the discrepancy between the learned policy and behavior policy (Peters et al., 2010; Peng et al., 2019; Nair et al., 2020). We focus on the CPS-based offline RL due to its convergence guarantee and outstanding performance in a wide range of tasks.

Prior solutions for CPS (Peters et al., 2010; Peng et al., 2019; Nair et al., 2020) primarily train a parameterized unimodal Gaussian policy through weighted regression. However, recent works (Chen et al., 2022; Hansen-Estruch et al., 2023; Shafiullah et al., 2022) show such unimodal Gaussian models in weighted regression will impair the policy performance due to limited distributional expressivity. For example, if we fit a multi-modal distribution with an unimodal Gaussian distribution, it will unavoidably result in covering the low-density area between peaks. Intuitively, we can choose a more expressive model to eliminate this issue. Nevertheless, Ghasemipour et al. (2020) shows that VAEs (Kingma & Welling, 2013) in BCQ (Fujimoto et al., 2019) do not align well with the behavior dataset, which will introduce the biases in generated actions. Chen et al. (2022) utilizes the diffusion probabilistic model (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song & Ermon, 2019) to generate actions and select the action through the action evaluation model under the well-studied AWR framework. However, AWR requires an exact probability density of behavior policy, which is still intractable for the diffusion-based one. Alternatively, they leverage Monte Carlo sampling to approximate the probability density of behavior policy, which inevitably introduces estimation biases and increases the cost of inference. According to motivating examples in Section 3.1, we visually demonstrate how these issues are pronounced even on a simple bandit task.

To solve the above issues, we present Diffusion-based Constrained Policy Search (DiffCPS) which incorporates the diffusion model to solve the limited expressivity problem of unimodal Gaussian policy. To avoid the exact probability density required by AWR, we solve the diffusion-based CPS problem with the primal-dual method. Thereby, DiffCPS solves the limited policy expressivity problem while avoiding the intractable density calculation brought by AWR. Interestingly, we find that when we use diffusion-based policies, we can eliminate the policy distribution constraint in the CPS through the action distribution of the diffusion model. Then under some mild assumptions, we show that strong duality holds for the diffusion-based CPS problem, which makes it easily solvable using primal-dual methods. We also study the effect of using a finite parametrization for the policies and show that the price to pay in terms of the duality gap depends on the representation ability of the parametrization. And the expressiveness of the model is precisely the greatest strength of diffusion models.

In summary, our main contributions are three-fold: 1)We explicitly eliminate the policy distribution constraint in the CPS problem. This allows us to achieve strong duality under some assumptions. 2)We present DiffCPS, which tackles the diffusion-based constrained policy search without resorting to AWR. Our theoretical analysis shows that our method can achieve near-optimal results by solving the primal-dual problem with a parametric policy. 3)Our experimental results illustrate superior or competitive performance compared to existing offline RL methods in D4RL tasks, which also corroborates the results of our theoretical analysis. Even compared to other diffusion-based methods, DiffCPS achieves state-of-the-art (SOTA) performance in D4RL MuJoCo locomotion and AntMaze average score, which substantiates the effectiveness of our method.

Refer to caption
Refer to caption
Figure 1: Toy offline experiment on a simple bandit task. We test the performance of AWR and other diffusion-based offline RL algorithms (DQL (Wang et al., 2022) and SfBC (Chen et al., 2022)). The first row displays the actions taken by the trained policy where TT denotes diffusion steps. We note that the AWR fails to capture the multi-modal actions in the offline dataset due to the limited policy expressivity of unimodal Gaussian. The second row shows the effect of different diffusion steps TT.

2 Preliminaries

Notation. In this paper, we use μθ(𝒂|𝒔)\mu_{\theta}({\bm{a}}|{\bm{s}}) or μ\mu to denote the learned policy with parameter θ\theta and πb\pi_{b} to denote behavior policy that generated the offline data. We use superscripts ii to denote the diffusion time step and subscripts tt to denote RL trajectory timestep. For instance, atia_{t}^{i} denotes the tt-th action in ii-th diffusion time step.

2.1 Constrained Policy Search in Offline RL

Consider a Markov decision process (MDP): M={𝒮,𝒜,P,R,γ,d0}M=\{{\mathcal{S}},{\mathcal{A}},P,R,\gamma,d_{0}\}, with state space SS, action space 𝒜{\mathcal{A}}, environment dynamics 𝒫(𝒔|𝒔,𝒂):S×S×𝒜[0,1]{\mathcal{P}}({\bm{s}}^{\prime}|{\bm{s}},{\bm{a}}):S\times S\times{\mathcal{A}}\rightarrow[0,1], reward function R:S×𝒜R:S\times{\mathcal{A}}\rightarrow{\mathbb{R}}, discount factor γ[0,1)\gamma\in[0,1), policy μ𝒫(𝒮)\mu\in{\mathcal{P}}({\mathcal{S}}), where 𝒫(𝒮){\mathcal{P}}({\mathcal{S}}) is the space of probability measures on (𝒜,(𝒜))({\mathcal{A}},{\mathcal{B}}({\mathcal{A}})) parametrized by elements of 𝒮{\mathcal{S}}, where (𝒜){\mathcal{B}}({\mathcal{A}}) are the Borel sets of 𝒜{\mathcal{A}}, and initial state distribution d0d_{0}. The action-value or Q-value of policy μ\mu is defined as Qμ(𝒔t,𝒂t)=𝔼𝒂t+1,𝒂t+2,μ[j=0γjr(𝒔t+j,𝒂t+j)]Q^{\mu}({\bm{s}}_{t},{\bm{a}}_{t})=\mathbb{E}_{{\bm{a}}_{t+1},{\bm{a}}_{t+2},...\sim\mu}\left[\sum_{j=0}^{\infty}\gamma^{j}r({\bm{s}}_{t+j},{\bm{a}}_{t+j})\right]. Our goal is to get a policy to maximize the cumulative discounted reward J(θ)=𝒮d0(𝒔)Qμ(𝒔,𝒂)𝑑𝒔J(\theta)=\int_{\mathcal{S}}d_{0}({\bm{s}})Q^{\mu}({\bm{s}},{\bm{a}})d{\bm{s}}, where ρμ(𝒔)=(1γ)t=0γtpμ(𝒔t=𝒔)\rho^{\mu}({\bm{s}})=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}p_{\mu}({\bm{s}}_{t}={\bm{s}}) is the normalized discounted state visitation frequencies induced by the policy μ\mu and pμ(𝒔t=𝒔)p_{\mu}({\bm{s}}_{t}={\bm{s}}) is the likelihood of the policy being in state 𝒔{\bm{s}} after following μ\mu for tt timesteps (Sutton & Barto, 2018; Peng et al., 2019).

In offline setting (Fujimoto et al., 2019), environmental interaction is not allowed, and a static dataset 𝒟{(𝒮,𝒜,R,𝒮,done)}{\mathcal{D}}\triangleq\left\{({\mathcal{S}},{\mathcal{A}},R,{\mathcal{S}}^{\prime},\text{done})\right\} is used to learn a policy. To avoid out-of-distribution actions, we restrict the learned policy μ\mu to be not far away from the behavior policy πb\pi_{b} by the KL divergence constraint. Prior works (Peters et al., 2010; Peng et al., 2019) formulate the above offline RL as a constrained policy search (CPS) problem with the following form:

μ=argmaxμ\displaystyle\mu^{*}=\operatorname*{arg\,max}_{\mu}\ J(μ)=argmaxμ𝒮d0(𝒔)𝒜Qμ(𝒔,𝒂)𝑑𝒂𝑑𝒔\displaystyle J(\mu)=\operatorname*{arg\,max}_{\mu}\int_{\mathcal{S}}d_{0}({\bm{s}})\int_{\mathcal{A}}Q^{\mu}({\bm{s}},{\bm{a}})d{\bm{a}}d{\bm{s}}
s.t.\displaystyle s.t.\quad DKL(πb(|𝒔)μ(|𝒔))ϵ,𝒔\displaystyle D_{\mathrm{KL}}(\pi_{b}(\cdot|{\bm{s}})\|\mu(\cdot|{\bm{s}}))\leq{\mathbf{\epsilon}},\quad\forall{\bm{s}} (1)
𝒂μ(𝒂|𝒔)𝑑𝒂=1,𝒔,\displaystyle\int_{\bm{a}}\mu({\bm{a}}|{\bm{s}})d{\bm{a}}=1,\quad\forall{\bm{s}},

Previous works (Peters et al., 2010; Peng et al., 2019; Nair et al., 2020) solve Eq. (2.1) through KKT conditions and get the optimal policy π\pi^{*} as:

π(𝒂|𝒔)\displaystyle\pi^{*}({\bm{a}}|{\bm{s}}) =1Z(𝒔)πb(𝒂|𝒔)exp(αQϕ(𝒔,𝒂)),\displaystyle=\frac{1}{Z({\bm{s}})}\ \pi_{b}({\bm{a}}|{\bm{s}})\ \mathrm{exp}\left(\alpha Q_{\phi}({\bm{s}},{\bm{a}})\right), (2)

where Z(𝒔)Z({\bm{s}}) is the partition function. Intuitively we can use Eq. (2) to optimize policy π\pi. However, the behavior policy may be very diverse and hard to model. To avoid modeling the behavior policy, prior works (Peng et al., 2019; Wang et al., 2020; Chen et al., 2020) optimize π\pi^{*} through a parameterized policy πθ\pi_{\theta}, known as AWR:

argminθ𝔼𝒔𝒟μ[DKL(π(|𝒔)||πθ(|𝒔))]\displaystyle\mathop{\mathrm{arg\ min}}_{\theta}\mathbb{E}_{{\bm{s}}\sim\mathcal{D}^{\mu}}\left[D_{\mathrm{KL}}\left(\pi^{*}(\cdot|{\bm{s}})\middle|\middle|\pi_{\theta}(\cdot|{\bm{s}})\right)\right] (3)
=\displaystyle= argmaxθ𝔼(𝒔,𝒂)𝒟μ[1Z(𝒔)logπθ(𝒂|𝒔)exp(αQϕ(𝒔,𝒂))].\displaystyle\mathop{\mathrm{arg\ max}}_{\theta}\mathbb{E}_{({\bm{s}},{\bm{a}})\sim\mathcal{D}^{\mu}}\left[\frac{1}{Z({\bm{s}})}\mathrm{log}\ \pi_{\theta}({\bm{a}}|{\bm{s}})\ \mathrm{exp}\left(\alpha Q_{\phi}({\bm{s}},{\bm{a}})\right)\right].

where exp(αQϕ(𝒔,𝒂))\mathrm{exp}(\alpha Q_{\phi}({\bm{s}},{\bm{a}})) being the regression weights. However, AWR requires the exact probability density of policy, which restricts the use of generative models like diffusion models. In this paper, we directly utilize the diffusion-based policy to address Eq. (2.1). Therefore, our method not only avoids the need for explicit probability densities in AWR but also solves the limited policy expressivity problem.

2.2 Diffusion Model

Diffusion Probabilistic Model (DPM). Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song & Ermon, 2019) are composed of two processes: the forward diffusion process and the reverse process. In the forward diffusion process, we gradually add Gaussian noise to the data 𝒙0q(𝒙0){\bm{x}}_{0}\sim q({\bm{x}}_{0}) in TT steps. The step sizes are controlled by a variance schedule βi\beta_{i}:

q(𝒙1:T|𝒙0):=i=1Tq(𝒙i|𝒙i1),\displaystyle q({\bm{x}}_{1:T}\,|\,{\bm{x}}_{0}):=\textstyle\prod_{i=1}^{T}q({\bm{x}}_{i}\,|\,{\bm{x}}_{i-1}), (4)
q(𝒙i|𝒙i1):=𝒩(𝒙i;1βi𝒙i1,βi𝑰).\displaystyle q({\bm{x}}_{i}\,|\,{\bm{x}}_{i-1}):={\mathcal{N}}({\bm{x}}_{i};\sqrt{1-\beta_{i}}{\bm{x}}_{i-1},\beta_{i}{\bm{I}}).

In the reverse process, we can recreate the true sample 𝒙0{\bm{x}}_{0} through p(𝒙i1|𝒙i)p({\bm{x}}^{i-1}|{\bm{x}}^{i}):

p(𝒙)\displaystyle p({\bm{x}}) =p(𝒙0:T)𝑑𝒙1:T\displaystyle=\int p({\bm{x}}^{0:T})d{\bm{x}}^{1:T} (5)
=𝒩(𝒙T;𝟎,𝑰)i=1Tp(𝒙i1|𝒙i)d𝒙1:T.\displaystyle=\int{\mathcal{N}}({\bm{x}}^{T};\mathbf{0},{\bm{I}})\prod_{i=1}^{T}p({\bm{x}}^{i-1}|{\bm{x}}^{i})d{\bm{x}}^{1:T}.

The training objective is to maximize the ELBO of 𝔼𝒒x0[logp(𝒙0)]\mathbb{E}_{{\bm{q}}_{x_{0}}}\left[\log p({\bm{x}}_{0})\right]. Following DDPM (Ho et al., 2020), we use the simplified surrogate loss d(θ)=𝔼i[1,T],ϵ𝒩(𝟎,𝑰),𝒙0q[ϵϵθ(𝒙i,i)2]{\mathcal{L}}_{d}(\theta)=\mathbb{E}_{i\sim\left[1,T\right],{\mathbf{\epsilon}}\sim{\mathcal{N}}(\mathbf{0},{\bm{I}}),{\bm{x}}_{0}\sim q}\left[||{\mathbf{\epsilon}}-{\mathbf{\epsilon}}_{\theta}({\bm{x}}_{i},i)||^{2}\right] to approximate the ELBO. After training, sampling from the diffusion model is equivalent to running the reverse process.

Conditional DPM. There are two kinds of conditioning methods: classifier-guided (Dhariwal & Nichol, 2021) and classifier-free (Ho & Salimans, 2021). The former requires training a classifier on noisy data 𝒙i{\bm{x}}_{i} and using gradients 𝒙logfϕ(𝒚|𝒙i)\nabla_{\bm{x}}\log f_{\phi}({\bm{y}}|{\bm{x}}_{i}) to guide the diffusion sample toward the conditioning information 𝒚{\bm{y}}. The latter does not train an independent fϕf_{\phi} but combines a conditional noise model ϵθ(𝒙i,i,𝒔)\epsilon_{\theta}({\bm{x}}_{i},i,{\bm{s}}) and an unconditional model ϵθ(𝒙i,i)\epsilon_{\theta}({\bm{x}}_{i},i) for the noise. The perturbed noise wϵθ(𝒙i,i)+(w+1)ϵθ(𝒙i,i,𝒔)w\epsilon_{\theta}({\bm{x}}_{i},i)+(w+1)\epsilon_{\theta}({\bm{x}}_{i},i,{\bm{s}}) is used to later generate samples. However Pearce et al. (2023) shows this combination will degrade the policy performance in offline RL. Following Pearce et al. (2023); Wang et al. (2022) we solely employ a conditional noise model ϵθ(𝒙i,i,𝒔)\epsilon_{\theta}({\bm{x}}_{i},i,{\bm{s}}) to construct our noise model (w=0w=0).

3 Methodology

In this section, we detail the design of our Diffusion-based Constrained Policy Search (DiffCPS). First, we show that the limited policy expressivity in the AWR will degrade the policy performance through the bandit problem (Section 3.1). Second, we formulate CPS via diffusion-based policy and get our diffusion-based CPS problem (Section 3.2). Then, we solve DiffCPS with primal-dual method and show the price to pay in terms of the duality gap depending on the representation ability of the parametrization (Sectioon 3.3). (All proofs are placed in Appendix A due to limited space.)

3.1 Is policy expressivity all you need for AWR?

Before showing our method, we first present that the limited policy expressivity in previous Advantage Weighted Regression (AWR) methods (Peters et al., 2010; Peng et al., 2019; Nair et al., 2020) may degrade the performance through a sample 2-D bandit experiment with real-valued actions. Our offline dataset is constructed by a unit circle with noise (The first panel of Figure 1). Data on the noisy circle have a positive reward of 11. Note that this offline dataset exhibits strong multi-modality since there are many actions (points on the noisy circle) that can achieve the same reward of 11 if a state is given. However, if we use unimodal Gaussian to represent the policy, although the points on the noisy circle can all receive the same reward of 11, the actions taken by AWR will move closer to the center of the noisy circle (AWR in Figure 1), due to the incorrect unimodal assumption. Experiment results in Figure 1 illustrate that AWR performs poorly in the bandit experiments compared to other diffusion-based methods. We also notice that compared to other diffusion-based methods, actions generated by SfBC include many points that differ significantly from those in the dataset, when T=15T=15. This is due to the sampling error introduced by incorporating diffusion into AWR.

Therefore, we conclude that policy expressivity is important in offline RL since most offline RL datasets are collected from multiple behavior policies or human experts, which exhibit strong multi-modality. To better model behavior policies, we need more expressive generative models to model the policy distribution, rather than using unimodal Gaussians. Furthermore, we also need to avoid the sampling errors caused by using diffusion in AWR, which is the motivation behind our algorithm.

3.2 Diffusion-Based Constrained Policy Search

In this subsection, we focus on deriving our diffusion-based CPS problem Eq. (12). Firstly, we present the form of constrained policy search (Peng et al., 2019):

μ=argmaxμ\displaystyle\mu^{*}\!=\!\operatorname*{arg\,max}_{\mu}\ J(μ)=argmaxμ𝒮d0(𝒔)𝒜Qμ(𝒔,𝒂)𝑑𝒂𝑑𝒔\displaystyle J(\mu)\!=\!\operatorname*{arg\,max}_{\mu}\int_{\mathcal{S}}d_{0}({\bm{s}})\int_{\mathcal{A}}Q^{\mu}({\bm{s}},{\bm{a}})d{\bm{a}}d{\bm{s}} (6)
s.t.\displaystyle s.t. E𝒔ρπb(𝒔)[DKL(πb(|𝒔)μ(|𝒔))]ϵ,\displaystyle E_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}})}\left[D_{\mathrm{KL}}(\pi_{b}(\cdot|{\bm{s}})\|\mu(\cdot|{\bm{s}}))\right]\leq{\mathbf{\epsilon}}, (7)
𝒂μ(𝒂|𝒔)𝑑𝒂=1,𝒔,\displaystyle\int_{\bm{a}}\mu({\bm{a}}|{\bm{s}})d{\bm{a}}=1,\quad\forall{\bm{s}}, (8)

where μ(𝒂|𝒔)\mu({\bm{a}}|{\bm{s}}) denotes our diffusion-based policy, πb\pi_{b} denotes the behavior policy. Here we represent our policy μ(𝒂|𝒔)\mu({\bm{a}}|{\bm{s}}) via the conditional diffusion model:

μ(𝒂|𝒔)\displaystyle\mu({\bm{a}}|{\bm{s}}) =μ(𝒂0:T|𝒔)𝑑𝒂1:T\displaystyle=\int\mu({\bm{a}}^{0:T}|{\bm{s}})d{\bm{a}}^{1:T} (9)
=𝒩(𝒂T;𝟎,𝑰)i=1Tμ(𝒂i1|𝒂i,𝒔)d𝒂1:T,\displaystyle=\int{\mathcal{N}}({\bm{a}}^{T};\mathbf{0},{\bm{I}})\prod_{i=1}^{T}\mu({\bm{a}}^{i-1}|{\bm{a}}^{i},{\bm{s}})d{\bm{a}}^{1:T},

where the end sample of the reverse chain 𝒂0{\bm{a}}^{0} is the action used for the policy’s output and μ(𝒂0:T)\mu({\bm{a}}^{0:T}) is the joint distribution of all noisy samples. According to DDPM, we can approximate the reverse process μ(𝒂i1|𝒂i,𝒔)\mu({\bm{a}}^{i-1}|{\bm{a}}^{i},{\bm{s}}) with a Gaussian distribution 𝒩(𝒂i1;𝝁θ(𝒂i,𝒔,i),𝚺θ(𝒂i,𝒔,i)){\mathcal{N}}({\bm{a}}^{i-1};{\bm{\mu}}_{\theta}({\bm{a}}^{i},{\bm{s}},i),{\bm{\Sigma}}_{\theta}({\bm{a}}^{i},{\bm{s}},i)). We also follow the DDPM to fix the covariance matrix and predict the mean with a conditional noise model ϵθ(𝒂i,𝒔,i){\mathbf{\epsilon}}_{\theta}({\bm{a}}^{i},{\bm{s}},i):

𝝁θ(𝒂i,𝒔,i)=1αi(𝒂iβi1α¯iϵθ(𝒂i,𝒔,i)).\textstyle{\bm{\mu}}_{\theta}({\bm{a}}^{i},{\bm{s}},i)=\frac{1}{\sqrt{\alpha_{i}}}\big{(}{\bm{a}}^{i}-\frac{\beta_{i}}{\sqrt{1-\bar{\alpha}_{i}}}{\mathbf{\epsilon}}_{\theta}({\bm{a}}^{i},{\bm{s}},i)\big{)}. (10)

In the reverse process, we sample 𝒂T𝒩(𝟎,𝑰){\bm{a}}^{T}\sim{\mathcal{N}}(\mathbf{0},{\bm{I}}) and then follow the reverse diffusion chain 𝒩{𝒂i1|𝒂i}{\mathcal{N}}\left\{{\bm{a}}^{i-1}|{\bm{a}}^{i}\right\}, parameterized by θ\theta as

𝒩{ai1;𝒂iαiβiαi(1α¯i)ϵθ(𝒂i,𝒔,I),βi}{\mathcal{N}}\left\{a^{i-1};\frac{{\bm{a}}^{i}}{\sqrt{\alpha_{i}}}-\frac{\beta_{i}}{\sqrt{\alpha_{i}(1-\bar{\alpha}_{i})}}{\mathbf{\epsilon}}_{\theta}({\bm{a}}^{i},{\bm{s}},I),\beta_{i}\right\} (11)

for i=T,,1.i=T,\ldots,1. The reverse sampling in Eq. (11) requires iteratively predicting ϵ{\mathbf{\epsilon}} TT times. When TT is large, it will consume much time during the sampling process. To work with small TT (T=5T=5 in our experiment), we follow (Xiao et al., 2021; Wang et al., 2022) and define

βi=1αi=1eβmin(1T)0.5(βmaxβmin)2i1T2,\beta_{i}=1-\alpha_{i}=1-e^{-\beta_{\min}(\frac{1}{T})-0.5(\beta_{\max}-\beta_{\min})\frac{2i-1}{T^{2}}},

which is a noise schedule obtained under the variance preserving SDE of Song et al. (2020).

Theorem 3.1.

Let μ(𝐚|𝐬)\mu({\bm{a}}|{\bm{s}}) be a diffusion-based policy and πb\pi_{b} be the behavior policy. Then, we have

  1. (1)

    There exists κ0{\mathbf{\kappa}}_{0}\in{\mathcal{R}} such that Eq. (7) can be transformed to

    H(πb,μ)=𝔼𝒔ρπb(𝒔),𝒂πb(|𝒔)[logμ(|𝒔)]κ0.H(\pi_{b},\mu)=-\mathbb{E}_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}}),{\bm{a}}\sim\pi_{b}(\cdot|{\bm{s}})}\left[\log\mu(\cdot|{\bm{s}})\right]\leq{\mathbf{\kappa}}_{0}.
  2. (2)

    s\forall s, 𝒂μ(𝒂|𝒔)𝑑𝒂1\int_{\bm{a}}\mu({\bm{a}}|{\bm{s}})d{\bm{a}}\equiv 1.

Corollary 3.2.

The primal problem (Eq. (6)-Eq. (8)) can be transformed into the following DiffCPS problem:

μ=argmaxμ\displaystyle\mu^{*}=\operatorname*{arg\,max}_{\mu}\ J(μ)=argmaxμ𝒮d0(𝒔)𝒜Qμ(𝒔,𝒂)𝑑𝒂𝑑𝒔\displaystyle J(\mu)=\operatorname*{arg\,max}_{\mu}\int_{\mathcal{S}}d_{0}({\bm{s}})\int_{\mathcal{A}}Q^{\mu}({\bm{s}},{\bm{a}})d{\bm{a}}d{\bm{s}}
s.t.\displaystyle s.t.\quad H(πb,μ)κ0.\displaystyle H(\pi_{b},\mu)\leq{\mathbf{\kappa}}_{0}. (12)
Remark 3.3.

Theorem 3.1 and Corollary 3.2 formulate our diffusion-based CPS problem, then we will show under some mild assumption, the strong duality holds for Eq. (12).

Assumption 3.4.

Suppose that rr is bounded and that Slaters’ condition (Boyd & Vandenberghe, 2004) holds for our offline RL setting.

Assumption 3.5.

Suppose that all policies learned from the offline dataset 𝒟{\mathcal{D}} satisfy

μ(𝒂|𝒔)=ρμ(𝒔,𝒂)ρπb(𝒔)(𝒔),\mu({\bm{a}}|{\bm{s}})=\frac{\rho^{\mu}({\bm{s}},{\bm{a}})}{\rho^{\pi_{b}({\bm{s}})}({\bm{s}})}, (13)

where ρπb(𝒔)\rho^{\pi_{b}}({\bm{s}}) denotes the normalized discounted state visitation frequencies induced by the behavior policy.

Remark 3.6.

Assumption 3.5 is a mild assumption in offline reinforcement learning, as the state distributions encountered by all policies can only be those collected by the behavior policy. Therefore, it is feasible to assume Eq. (13) regarding the occupancy measure of the learned policy.

Theorem 3.7.

Under Assumption 3.4 and Assumption 3.5, strong duality holds for Eq. (12), which has the same optimal policy μ\mu^{*} with Eq. (15).

According to Theorem 3.7, we can get the optimal policy by solving the dual problem. In the following section, we outline the algorithm for solving Eq. (12). It consists of two steps: (i) solving the unconstrained dual function of Eq. (12), (ii) updating the Lagrange multiplier.

3.3 Primal-Dual Method for DiffCPS

In this section, the duality is used to solve the Eq. (12) and derive our method. The dual function of Eq. (12) is

d(λ)=argmaxμ(μ,λ)\displaystyle\quad d(\lambda)=\operatorname*{arg\,max}_{\mu}{\mathcal{L}}(\mu,\lambda) (14)
where(μ,λ)=J(μ)+λ(κ0H(πb,μ)).\displaystyle\text{where}\quad{\mathcal{L}}(\mu,\lambda)=J(\mu)+\lambda({\mathbf{\kappa}}_{0}-H(\pi_{b},\mu)).

The Lagrange dual problem associated with the Eq. (12) is

D(λ)minλ0maxμJ(μ)+λ(κ0H(πb,μ)).D^{*}(\lambda)\triangleq\min_{\lambda\geq 0}\max_{\mu}J(\mu)+\lambda({\mathbf{\kappa}}_{0}-H(\pi_{b},\mu)). (15)

In Eq. (14), we need to calculate the cross entropy H(πb,μ)H(\pi_{b},\mu), which is intractable in practice.

Proposition 3.8.

In the diffusion model, we can approximate the entropy with MSE-like loss c(πb,μ){\mathcal{L}}_{c}(\pi_{b},\mu) through ELBO:

H(πb,μ)c+c(πb,μ),H(\pi_{b},\mu)\approx c+{\mathcal{L}}_{c}(\pi_{b},\mu), (16)

where c is a constant.

Remark 3.9.

Let κ=κ0c{\mathbf{\kappa}}={\mathbf{\kappa}}_{0}-c, Proposition 3.8 allow us to use the loss of diffusion model to approximate the Eq. (7). In the following section, we will replace the H(πb,μ)H(\pi_{b},\mu) with c(πb,μ){\mathcal{L}}_{c}(\pi_{b},\mu).

In practice, to turn the original functional optimization problem into a tractable problem, we often use parametrized policies, such as neural networks. Next, we shift focus to solving the parametric version of Eq. (12), i.e.i.e. to find the parametrized policies μθ\mu_{\theta} that solve Eq. (14)

dθ(λ)maxθ(μθ,λt)\displaystyle\quad d_{\theta}(\lambda)\triangleq\max_{\theta}{\mathcal{L}}(\mu_{\theta},\lambda_{t}) (17)
where(μθ,λt)=𝔼𝒔d0(𝒔),𝒂μθ(|𝒔)[Qμθ(𝒔,𝒂)Qtargetμθ(𝒔,𝒂)]\displaystyle\text{where}\quad{\mathcal{L}}(\mu_{\theta},\lambda_{t})=\mathbb{E}_{{\bm{s}}\sim d_{0}({\bm{s}}),{\bm{a}}\sim\mu_{\theta}(\cdot|{\bm{s}})}\left[\frac{Q^{\mu_{\theta}}({\bm{s}},{\bm{a}})}{Q_{\text{target}}^{\mu_{\theta}}({\bm{s}},{\bm{a}})}\right]
+λ(κc(πb,μθ)),\displaystyle+\lambda({\mathbf{\kappa}}-{\mathcal{L}}_{c}(\pi_{b},\mu_{\theta}^{*})),
Dθminλ0dθ(λ).D_{\theta}^{*}\triangleq\min_{\lambda\geq 0}d_{\theta}(\lambda). (18)

Notice that we follow Fujimoto & Gu (2021); Wang et al. (2022) to normalize Qμ(𝒔,𝒂)Q^{\mu}({\bm{s}},{\bm{a}}) by dividing it by the target Q-net’s value. We also clip the λ\lambda to keep the constraint λ0\lambda\geq 0 holding by λclip=max(c,λ)\lambda_{\text{clip}}=\max(c,\lambda), c0c\geq 0. We also find that we can improve the performance of the policy by delaying the policy update (Fujimoto et al., 2018). The algorithm given by Eq. (17)-Eq. (18) for solving Eq. (12) is summarized under Algorithm 1.

Algorithm 1 DiffCPS
1:  Initialize policy network μθ\mu_{\theta}, critic networks Qϕ1Q_{\phi_{1}}, Qϕ2Q_{\phi_{2}}, and target networks μθ\mu_{\theta^{\prime}}, Qϕ1Q_{\phi_{1}^{\prime}}, Qϕ2Q_{\phi_{2}^{\prime}},​
2:  policy evaluation interval dd and step size η\eta.
3:  for t=1t=1 to TT do
4:    Sample transition mini-batch ={(𝒔t,𝒂t,rt,𝒔t+1)}𝒟{\mathcal{B}}\!=\!\left\{({\bm{s}}_{t},{\bm{a}}_{t},r_{t},{\bm{s}}_{t+1})\right\}\!\sim\!{\mathcal{D}}.
5:    # Critic updating
6:    Sample 𝒂t+1μθ(|𝒔t+1){\bm{a}}_{t+1}\sim\mu_{\theta}(\cdot|{\bm{s}}_{t+1}) according to Eq. (11).
7:    y=rt+γmini=1,2Qϕi{𝒔t+1,𝒂t+1}y=r_{t}+\gamma\min_{i=1,2}Q_{\phi_{i}^{\prime}}\left\{{\bm{s}}_{t+1},{\bm{a}}_{t+1}\right\}.
8:    Update critic ϕiargminϕiN1(yQϕi(𝒔t,𝒂t))2\phi_{i}\leftarrow\operatorname*{arg\,min}_{\phi_{i}}N^{-1}\sum(y-Q_{\phi_{i}}({\bm{s}}_{t},{\bm{a}}_{t}))^{2}.
9:    if tt mod dd then
10:       Sample 𝒂μθ(|𝒔){\bm{a}}\sim\mu_{\theta}(\cdot|{\bm{s}}) according to Eq. (11).
11:       # Policy updating through Eq. (17)
12:       θt+1argmax(μθ,λt)\theta_{t+1}\approx\operatorname*{arg\,max}{\mathcal{L}}(\mu_{\theta},\lambda_{t}).
13:       # Lagrange multiplier λ\lambda updating through Eq. (18)
14:       λt+1λtη[κc(πb,μθt+1)]\lambda_{t+1}\leftarrow\lambda_{t}-\eta\left[{\mathbf{\kappa}}-{\mathcal{L}}_{c}(\pi_{b},\mu^{*}_{\theta_{t+1}})\right].
15:       λt+1λclip=max(c,λt+1)\lambda_{t+1}\leftarrow\lambda_{\text{clip}}=\max(c,\lambda_{t+1}), c0c\geq 0.
16:    end if
17:    # Target Networks updating
18:    θ=ρθ+(1ρ)θt+1\theta^{\prime}=\rho\theta^{\prime}+(1-\rho)\theta_{t+1}, ϕi=ρϕi+(1ρ)ϕi for i={1,2}\phi_{i}^{\prime}=\rho\phi_{i}^{\prime}+(1-\rho)\phi_{i}\mbox{ for }i=\{1,2\}.
19:  end for

However, there is a cost for introducing such parametrization in Algorithm 1: the duality gap is no longer null. In the following Theorem 3.11, we show that after introducing parameterization, we can obtain a suboptimal solution in a 𝒪(1/ϵ){\mathcal{O}}(1/\epsilon) number of dual iterations. Below, we first define ϵ\epsilon-universal parametrization functions.

Definition 3.10.

A parametrization πθ\pi_{\theta} is an ϵ\epsilon-universal parametrization of functions in 𝒫(𝒮){\mathcal{P}}({\mathcal{S}}) if, for some ϵ>0\epsilon>0, there exists for any π𝒫(𝒮)\pi\in{\mathcal{P}}({\mathcal{S}}) a parameter θp\theta\in\mathbb{R}^{p} such that

maxs𝒮𝒜|π(a|s)πθ(a|s)|daϵ.\max_{s\in{\mathcal{S}}}\int_{{\mathcal{A}}}\left|\pi(a|s)-\pi_{\theta}(a|s)\right|\,da\leq\epsilon. (19)
Table 1: The performance of DiffCPS and other SOTA baselines on D4RL tasks. The mean and standard deviation of DiffCPS are obtained by evaluating the trained policy on five different random seeds. We report the performance of baseline methods using the best results reported from their paper. “-A” refers to any number of hyperparameters allowed. Results within 33 percent of the maximum in every D4RL task and the best average result are highlighted in boldface.
Dataset Environment CQL IDQL-A QGPO SfBC DD Diffuser Diffuison-QL IQL DiffCPS(ours)
Medium-Expert HalfCheetah 62.462.4 95.9\bf{95.9} 93.593.5 92.692.6 90.690.6 79.879.8 96.8\bf{96.8} 86.786.7 100.3±4.1\bf{100.3\pm 4.1}
Medium-Expert Hopper 98.798.7 108.6108.6 108.0108.0 108.6108.6 111.8\bf{111.8} 107.2107.2 111.1\bf{111.1} 91.591.5 112.1±0.6\bf{112.1\pm 0.6}
Medium-Expert Walker2d 110.1110.1 112.7\bf{112.7} 110.7\bf{110.7} 109.8109.8 108.8108.8 108.4108.4 110.1110.1 109.6109.6 113.1±1.8\bf{113.1\pm 1.8}
Medium HalfCheetah 44.444.4 51.051.0 54.1\bf{54.1} 45.945.9 49.149.1 44.244.2 51.1\bf{51.1} 47.447.4 71.0±0.5\bf{71.0\pm 0.5}
Medium Hopper 58.058.0 65.465.4 98.0\bf{98.0} 57.157.1 79.379.3 58.558.5 90.5\bf{90.5} 66.366.3 100.1±3.5\bf{100.1\pm 3.5}
Medium Walker2d 79.279.2 82.582.5 86.0\bf{86.0} 77.977.9 82.582.5 79.779.7 87.0\bf{87.0} 78.378.3 90.9±1.6\bf{90.9\pm 1.6}
Medium-Replay HalfCheetah 46.246.2 45.945.9 47.6\bf{47.6} 37.137.1 39.339.3 42.242.2 47.8\bf{47.8} 44.244.2 50.5±0.6\bf{50.5\pm 0.6}
Medium-Replay Hopper 48.648.6 92.192.1 96.996.9 86.286.2 𝟏𝟎𝟎\bf{100} 96.896.8 101.3\bf{101.3} 94.794.7 101.1±0.2\bf{101.1\pm 0.2}
Medium-Replay Walker2d 26.726.7 85.1\bf{85.1} 84.484.4 65.165.1 75.075.0 61.261.2 95.5\bf{95.5} 73.973.9 91.3±0.7\bf{91.3\pm 0.7}
Average (Locomotion) 63.963.9 82.182.1 86.686.6 75.675.6 81.881.8 75.375.3 87.987.9 76.976.9 92.26\bf{92.26}
Default AntMaze-umaze 74.074.0 94.0\bf{94.0} 96.4\bf{96.4} 92.092.0 - - 93.493.4 87.587.5 97.4±3.7\bf{97.4\pm 3.7}
Diverse AntMaze-umaze 84.0\bf{84.0} 80.280.2 74.474.4 85.3\bf{85.3} - - 66.266.2 62.262.2 87.4±3.8\bf{87.4\pm 3.8}
Play AntMaze-medium 61.261.2 84.5\bf{84.5} 83.6\bf{83.6} 81.381.3 - - 76.676.6 71.271.2 88.2±2.2\bf{88.2\pm 2.2}
Diverse AntMaze-medium 53.753.7 84.8\bf{84.8} 83.8\bf{83.8} 82.082.0 - - 78.678.6 70.070.0 87.8±6.5\bf{87.8\pm 6.5}
Play AntMaze-large 15.815.8 63.5\bf{63.5} 66.6\bf{66.6} 59.359.3 - - 46.446.4 39.639.6 65.6±3.6\bf{65.6\pm 3.6}
Diverse AntMaze-large 14.914.9 67.9\bf{67.9} 64.8\bf{64.8} 45.545.5 - - 57.357.3 47.547.5 63.6±3.9\bf{63.6\pm 3.9}
Average (AntMaze) 50.650.6 79.179.1 78.378.3 74.274.2 - - 69.869.8 63.063.0 81.67\bf{81.67}
# Diffusion steps - 55 1515 1515 100100 100100 55 - 55
Theorem 3.11.

Let πθ\pi_{\theta} be an ϵ\epsilon universal parametrization of 𝒫(𝒮){\mathcal{P}}({\mathcal{S}}) according to Definition 3.10, Suppose that rr is bounded by constants R>0R>0 and B=max𝐬,𝐚ρπb(𝐬,𝐚)1/ρ(𝐬,𝐚)B=\max_{{\bm{s}},{\bm{a}}}\rho^{\pi_{b}}({\bm{s}},{\bm{a}})1/\rho({\bm{s}},{\bm{a}}). and γ(0,1)\gamma\in(0,1) be the discount factor. Then, under Assumption 3.4 3.5 and for any ε>0\varepsilon>0, the sequence of updates of Algorithm 1 with step size η\eta converges in K>0K>0 steps, with

Kλ0λθ2/2ηε,K\leq{\left\|\lambda_{0}-\lambda_{\theta}^{\star}\right\|^{2}}{\big{/}}{2\eta\varepsilon}, (20)

to a neighborhood of PP^{\star} (the solution of Eq. (12)), satisfying

(R+λ1B)ϵ1γdθ(λK)PηC2+δ+ε.\!\!\!\!-\frac{\left(R+\left\|\lambda^{\star}\right\|_{1}B\right)\epsilon}{1-\gamma}\!\leq\!d_{\theta}(\lambda_{K})\!-\!P^{\star}\!\leq\!\eta\frac{C}{2}+\delta+\varepsilon. (21)

where C=(κ0+𝒮×𝒜ρπb(𝐬,𝐚)logπb(𝐚|𝐬)𝑑𝐚𝑑𝐬)2C=\left({\mathbf{\kappa}}_{0}+\int_{{\mathcal{S}}\times{\mathcal{A}}}\rho^{\pi_{b}}({\bm{s}},{\bm{a}})\log\pi_{b}({\bm{a}}|{\bm{s}})d{\bm{a}}d{\bm{s}}\right)^{2}, δ\delta denotes the error between the parameterized local optimal solution and the optimal solution in Eq. (17). and λ\lambda^{\star} be the solution to the dual problem associated with Eq. (12).

Remark 3.12.

Notice that the suboptimal solution which the primal-dual algorithm convergences depends on the goodness of the solution of Eq. (17) and the representation ability of the parametrized policy, while the expressiveness of the model is precisely the greatest strength of diffusion models.

Theoretically, we need to precisely solve the Eq. (17) and Eq. (18). However, in practice, we can resort to SGD and parametrized policy to solve the equations. In this way, we can recursively optimize Eq. (12) through Eq. (17) and Eq. (18). The policy improvement described in Eq. (17) and the solution of Eq. (18) constitute the core of DiffCPS.

4 Experiments

We evaluate our DiffCPS on D4RL (Fu et al., 2020) benchmark in Section 4.1. Further, we conduct an ablation experiment to assess the contribution of different parts in DiffCPS in Section 4.2. Due to space limitations, we place the experiment setup, implementation details, and additional experiment results in the Appendix C.

4.1 Results

In Table 1, we compare the performance of DiffCPS to other offline RL methods in D4RL (Fu et al., 2020) tasks. We merely illustrate the results of the MuJoCo locomotion and AntMaze tasks due to the page limit. In traditional MuJoCo tasks, DiffCPS outperforms other methods as well as recent diffusion-based method (Wang et al., 2022; Lu et al., 2023; Hansen-Estruch et al., 2023) by large margins in most tasks, especially in the HalfCheetah medium. The medium datasets are collected by online SAC (Haarnoja et al., 2018a) agent trained to approximate 1/31/3 the performance of the expert. Hence, it contains a lot of suboptimal trajectories, which makes the offline RL algorithms hard to learn.

Compared to the locomotion tasks, AntMaze tasks are more challenging since the datasets consist of sparse rewards and suboptimal trajectories. Even so, DiffCPS also achieves competitive or SOTA results compared with other methods. For simpler tasks like umaze, DiffCPS achieves 100±0%100\pm 0\% success rate on some seeds, which shows its powerful ability to learn from suboptimal trajectories. In other AntMaze tasks, DiffCPS also shows competitive performance compared to other SOTA diffusion-based approaches. Overall, the experiments demonstrate the effectiveness of DiffCPS compared to several existing methods for offline RL.

Table 2: Ablation study of diffusion steps. We conduct an ablation study to investigate the impact of diffusion steps on different algorithms. We only show the average score due to the page limit. The results of DiffCPS are obtained from three random seeds, while the results of SfBC are derived from the original SfBC paper.
D4RL Tasks DiffCPS (T=5) DiffCPS (T=15) DiffCPS (T=20) SfBC (T=10) SfBC (T=15) SfBC (T=25)
Locomotion 92.0\bf{92.0} 87.587.5 87.687.6 72.972.9 75.6\bf{75.6} 74.474.4
AntMaze 80.0\bf{80.0} 60.760.7 66.766.7 65.765.7 74.2\bf{74.2} 73.073.0
Refer to caption
Figure 2: Ablation studies of diffusion steps TT on selected Gym tasks (three random seeds). We observe that as TT increases, the training stability improves, but the final performance drops.

4.2 Ablation Study

In this section, we analyze why DiffCPS outperforms the other methods quantitatively on D4RL tasks. We conduct an ablation study to investigate the impact of three parts in DiffCPS, i.e.i.e. diffusion steps, the minimum value of Lagrange multiplier λclip\lambda_{\text{clip}}, and policy evaluation interval.

Diffusion Steps. We show the effect of diffusion steps TT, which is a vital hyperparameter in all diffusion-based methods. In SfBC the best TT is 1515, while T=5T=5 is best for our DiffCPS. Table 2 shows the average performance of different diffusion steps. Figure 2 shows the training curve of selected D4RL tasks over different diffusion steps TT.

We also note that large TT works better for the bandit experiment. However, for D4RL tasks, a large TT will lead to a performance drop. The reason is that compared to bandit tasks, D4RL datasets contain a significant amount of suboptimal trajectories. A larger TT implies stronger behavior cloning ability, which can indeed lead to policy overfitting to the suboptimal data, especially when combined with actor-critic methods. Poor policies result in error value estimates, and vice versa, creating a vicious cycle that leads to a drop in policy performance with a large TT.

The Minimum Value of Lagrange multiplier λclip\lambda_{\text{clip}}. In our method, λ\lambda serves as the coefficient for the policy constraint term, where a larger λ\lambda implies a stronger policy constraint. Although we need to restrict the λ0\lambda\geq 0 according to the definition of Lagrange multiplier, we notice that we could get better results through clip λc\lambda\geq c in AntMaze tasks, where cc is a positive number, see full λ\lambda ablation results in Figure 3 for details.

Refer to caption
Refer to caption
Figure 3: Ablation studies of λclip\lambda_{\text{clip}} in AntMaze and MuJoCo tasks. We observe that λclip\lambda_{\text{clip}} has little impact on MuJoCo tasks but significantly influences AntMaze tasks, especially as AntMaze datasets are larger. The reason is that the sparse rewards and suboptimal trajectories in AntMaze datasets make the critic network prone to error estimation, leading to learning poor policy. Therefore, there is a need to enhance learning from the original dataset which means we should increase λ\lambda or enhance the KL constraint. We find that increasing λclip\lambda_{\text{clip}} while maintaining a moderate KL constraint achieves the best results. All the results are obtained by evaluating three random seeds.
Refer to caption
Figure 4: Ablation studies of the policy evaluation interval in AntMaze and MuJoCo tasks. Delayed policy updates have a relatively minor impact on the MuJoCo locomotion tasks. However, for large-scale sparse reward datasets like AntMaze Large, choosing an appropriate update frequency can greatly increase the final optimal results. The MuJoCo task results are obtained with 2 million training steps (three random seeds), while AntMaze results are obtained with 1 million training steps (three random seeds).

Policy Evaluation Interval. We include the ablation of policy evaluation interval in Figure 4. We find that the policy delayed update (Fujimoto et al., 2018) has significant impacts on AntMaze large tasks. However, for other tasks, it does not have much effect and even leads to a slight performance decline. The reason is that infrequent policy updates reduce the variance of value estimates, which is more effective in tasks where sparse rewards and suboptimal trajectories lead to significant errors in value function estimation like AntMaze large tasks.

In a nutshell, the ablation study shows that the combination of these three parts, along with DiffCPS, collectively leads to producing good performance.

5 Related Work

Offline Reinforcement Learning. Offline RL algorithms need to avoid extrapolation error. Prior works usually solve this problem through policy regularization (Fujimoto et al., 2019; Kumar et al., 2019; Wu et al., 2019; Fujimoto & Gu, 2021), value pessimism about unseen actions (Kumar et al., 2020; Kostrikov et al., 2021a), or implicit TD backups (Kostrikov et al., 2021b; Ma et al., 2021) to avoid the use of out-of-distribution actions. Another line of research solves the offline RL problem through weighted regression (Peters et al., 2010; Peng et al., 2019; Nair et al., 2020) from the perspective of CPS. Our DiffCPS derivation is related but features with a diffusion model form.

Diffusion Models in RL. There exist several works that introduce the diffusion model to RL. Diffuser (Janner et al., 2022) uses the diffusion model to directly generate trajectory guided with gradient guidance or reward. DiffusionQL (Wang et al., 2022) uses the diffusion model as an actor and optimizes it through the TD3+BC-style objective with a coefficient η\eta to balance the two terms. AdaptDiffuser (Liang et al., 2023) uses a diffusion model to generate extra trajectories and a discriminator to select desired data to add to the training set to enhance the adaptability of the diffusion model. DD (Ajay et al., 2022) uses a conditional diffusion model to generate trajectory and compose skills. Unlike Diffuser, DD diffuses only states and trains inverse dynamics to predict actions. QGPO (Lu et al., 2023) uses the energy function to guide the sampling process and proves that the proposed CEP training method can get an unbiased estimation of the gradient of the energy function under unlimited model capacity and data samples. IDQL (Hansen-Estruch et al., 2023) reinterpret IQL as an Actor-Critic method and extract the policy through sampling from a diffusion-parameterized behavior policy with weights computed from the IQL-style critic. EDP (Kang & Ma, ) focuses on boosting sampling speed through approximated actions. SRPO (Chen et al., ) uses a Gaussian policy in which the gradient is regularized by a pretrained diffusion model to recover the IQL-style policy. DiffCPS is distinct from these methods because we derive it from equation 12.

The closest to our work is the method that combines AWR and diffusion models. SfBC (Chen et al., 2022) uses the diffusion model to generate candidate actions and uses the regression weights to select the best action. Our method differs from it as we directly solve the limited policy expressivity problem through the primal-dual method for DiffCPS without resorting to AWR. This makes DiffCPS simple to implement and tune hyperparameters. As shown in Table 4 we can achieve SOTA results in most tasks by merely tuning one hyperparameter.

6 Conclusion

In our work, we solve the limited expressivity problem in the weighted regression through the diffusion model. We first simplify the CPS problem with the action distribution of diffusion-based policy. Then we prove that strong duality holds for diffusion-based CPS problems, and solve it through the primal-dual method with function approximation. Experimental results on the D4RL benchmark illustrate the superiority of our method which outperforms previous SOTA algorithms in most tasks, and DiffCPS is easy to tune hyperparameters, which only needs to tune the constraint κ{\mathbf{\kappa}} in most tasks. We hope that our work can inspire relative researchers to utilize powerful generative models, especially the diffusion model, for offline RL and decision-making.

7 Impact statements

Offline RL aims to learn a policy from an offline dataset, much like supervised learning, but due to extrapolation error and function approximation, the task of offline reinforcement learning is much more challenging compared to supervised learning and pre-training. The DiffCPS algorithm proposed in this paper utilizes diffusion-based policy and primal-dual method to address the CPS problem in offline reinforcement learning. Our method will promote the development of the field of offline reinforcement learning, thus facilitating the implementation of offline reinforcement learning in practical scenarios, such as robotic control, and consequently bring about a series of corresponding ethical issues and social outcomes.

References

  • Ajay et al. (2022) Ajay, A., Du, Y., Gupta, A., Tenenbaum, J. B., Jaakkola, T. S., and Agrawal, P. Is conditional generative modeling all you need for decision making? In The Eleventh International Conference on Learning Representations, 2022.
  • Borkar (1988) Borkar, V. S. A convex analytic approach to markov decision processes. Probability Theory and Related Fields, 78(4):583–602, 1988.
  • Boyd & Vandenberghe (2004) Boyd, S. P. and Vandenberghe, L. Convex optimization. Cambridge university press, 2004.
  • (4) Chen, H., Lu, C., Wang, Z., Su, H., and Zhu, J. Score regularized policy optimization through diffusion behavior. URL http://arxiv.org/abs/2310.07297.
  • Chen et al. (2022) Chen, H., Lu, C., Ying, C., Su, H., and Zhu, J. Offline reinforcement learning via high-fidelity generative behavior modeling. In The Eleventh International Conference on Learning Representations, 2022.
  • Chen et al. (2021) Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., and Mordatch, I. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
  • Chen et al. (2020) Chen, X., Zhou, Z., Wang, Z., Wang, C., Wu, Y., and Ross, K. Bail: Best-action imitation learning for batch deep reinforcement learning. Advances in Neural Information Processing Systems, 33:18353–18363, 2020.
  • Dhariwal & Nichol (2021) Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  • Fu et al. (2020) Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
  • Fujimoto & Gu (2021) Fujimoto, S. and Gu, S. S. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021.
  • Fujimoto et al. (2018) Fujimoto, S., Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pp.  1587–1596. PMLR, 2018.
  • Fujimoto et al. (2019) Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pp.  2052–2062. PMLR, 2019.
  • Ghasemipour et al. (2020) Ghasemipour, S. K. S., Schuurmans, D., and Gu, S. S. Emaq: Expected-max q-learning operator for simple yet effective offline and online rl. arXiv preprint arXiv:2007.11091, 2020.
  • (14) Gürtler, N., Blaes, S., Kolev, P., Widmaier, F., Wüthrich, M., Bauer, S., Schölkopf, B., and Martius, G. BENCHMARKING OFFLINE REINFORCEMENT LEARN- ING ON REAL-ROBOT HARDWARE.
  • Haarnoja et al. (2018a) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pp.  1861–1870. PMLR, 2018a.
  • Haarnoja et al. (2018b) Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., and Abbeel, P. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018b.
  • Hansen-Estruch et al. (2023) Hansen-Estruch, P., Kostrikov, I., Janner, M., Kuba, J. G., and Levine, S. Idql: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573, 2023.
  • Ho & Salimans (2021) Ho, J. and Salimans, T. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  • Ho et al. (2020) Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • Janner et al. (2022) Janner, M., Du, Y., Tenenbaum, J. B., and Levine, S. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022.
  • (21) Kang, B. and Ma, X. Efficient diffusion policies for offline reinforcement learning.
  • Kidambi et al. (2020) Kidambi, R., Rajeswaran, A., Netrapalli, P., and Joachims, T. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33:21810–21823, 2020.
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv e-prints, pp.  arXiv–1312, 2013.
  • Kostrikov et al. (2021a) Kostrikov, I., Fergus, R., Tompson, J., and Nachum, O. Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pp.  5774–5783. PMLR, 2021a.
  • Kostrikov et al. (2021b) Kostrikov, I., Nair, A., and Levine, S. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2021b.
  • Kumar et al. (2019) Kumar, A., Fu, J., Soh, M., Tucker, G., and Levine, S. Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019.
  • Kumar et al. (2020) Kumar, A., Zhou, A., Tucker, G., and Levine, S. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
  • Levine et al. (2020) Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  • Liang et al. (2023) Liang, Z., Mu, Y., Ding, M., Ni, F., Tomizuka, M., and Luo, P. Adaptdiffuser: Diffusion models as adaptive self-evolving planners. arXiv preprint arXiv:2302.01877, 2023.
  • Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  • Lu et al. (2023) Lu, C., Chen, H., Chen, J., Su, H., Li, C., and Zhu, J. Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. arXiv preprint arXiv:2304.12824, 2023.
  • Ma et al. (2021) Ma, X., Yang, Y., Hu, H., Liu, Q., Yang, J., Zhang, C., Zhao, Q., and Liang, B. Offline reinforcement learning with value-based episodic memory. arXiv preprint arXiv:2110.09796, 2021.
  • (34) Matsushima, T., Furuta, H., Matsuo, Y., Nachum, O., and Gu, S. S. Deployment-efficient reinforcement learn-ing via model-based offline optimization. Policy, 100:1000s.
  • Nair et al. (2020) Nair, A., Gupta, A., Dalal, M., and Levine, S. Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
  • Nie et al. (2021) Nie, X., Brunskill, E., and Wager, S. Learning when-to-treat policies. Journal of the American Statistical Association, 116(533):392–409, 2021.
  • Paternain et al. (2019) Paternain, S., Chamon, L., Calvo-Fullana, M., and Ribeiro, A. Constrained reinforcement learning has zero duality gap. Advances in Neural Information Processing Systems, 32, 2019.
  • Pearce et al. (2023) Pearce, T., Rashid, T., Kanervisto, A., Bignell, D., Sun, M., Georgescu, R., Macua, S. V., Tan, S. Z., Momennejad, I., Hofmann, K., et al. Imitating human behaviour with diffusion models. arXiv preprint arXiv:2301.10677, 2023.
  • Peng et al. (2019) Peng, X. B., Kumar, A., Zhang, G., and Levine, S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
  • Peters et al. (2010) Peters, J., Mulling, K., and Altun, Y. Relative entropy policy search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 24, pp.  1607–1612, 2010.
  • Rashidinejad et al. (2021) Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., and Russell, S. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702–11716, 2021.
  • Rhinehart et al. (2018) Rhinehart, N., McAllister, R., and Levine, S. Deep imitative models for flexible inference, planning, and control. arXiv preprint arXiv:1810.06544, 2018.
  • Rockafellar (1970) Rockafellar, R. Convex analysis. Princeton Math. Series, 28, 1970.
  • Shafiullah et al. (2022) Shafiullah, N. M., Cui, Z., Altanzaya, A. A., and Pinto, L. Behavior transformers: Cloning kk modes with one stone. Advances in neural information processing systems, 35:22955–22968, 2022.
  • Sohl-Dickstein et al. (2015) Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp.  2256–2265. PMLR, 2015.
  • Song & Ermon (2019) Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  • Song et al. (2020) Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  • Suh et al. (2023) Suh, H., Chou, G., Dai, H., Yang, L., Gupta, A., and Tedrake, R. Fighting uncertainty with gradients: Offline reinforcement learning via diffusion score matching. arXiv preprint arXiv:2306.14079, 2023.
  • Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT press, 2018.
  • Tseng et al. (2017) Tseng, H.-H., Luo, Y., Cui, S., Chien, J.-T., Ten Haken, R. K., and Naqa, I. E. Deep reinforcement learning for automated radiation adaptation in lung cancer. Medical physics, 44(12):6690–6705, 2017.
  • Wang et al. (2020) Wang, Z., Novikov, A., Zolna, K., Merel, J. S., Springenberg, J. T., Reed, S. E., Shahriari, B., Siegel, N., Gulcehre, C., and Heess, N. Critic regularized regression. Advances in Neural Information Processing Systems, 33:7768–7778, 2020.
  • Wang et al. (2022) Wang, Z., Hunt, J. J., and Zhou, M. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022.
  • Wu et al. (2019) Wu, Y., Tucker, G., and Nachum, O. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
  • Xiao et al. (2021) Xiao, Z., Kreis, K., and Vahdat, A. Tackling the generative learning trilemma with denoising diffusion gans. arXiv preprint arXiv:2112.07804, 2021.
  • Yu et al. (2020) Yu, T., Thomas, G., Yu, L., Ermon, S., Zou, J. Y., Levine, S., Finn, C., and Ma, T. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020.
  • Yurtsever et al. (2020) Yurtsever, E., Lambert, J., Carballo, A., and Takeda, K. A survey of autonomous driving: Common practices and emerging technologies. IEEE access, 8:58443–58469, 2020.

In the appendix, we provide the proofs of the main theorems in the paper (Appendix A), the notation table (Appendix B), details of the experiments (Appendix C), some additional experiments (Appendix D,F,E), and a discussion on the use of the augmented Lagrangian method (Appendix G).

Appendix A Proofs

A.1 Theorem 3.1

We provide the proof of Theorem 3.1.

Proof.

(1): Since the pointwise KL constraint in Eq. (2.1) at all states is intractable, we follow the Peng et al. (2019) to relax the constraint by enforcing it only in expectation

E𝒔ρπb(𝒔)[DKL(πb(|𝒔)μ(|𝒔))]ϵ.E_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}})}\left[D_{\mathrm{KL}}(\pi_{b}(\cdot|{\bm{s}})\|\mu(\cdot|{\bm{s}}))\right]\leq{\mathbf{\epsilon}}.

The LHS can be transformed to

E𝒔ρπb(𝒔)[DKL(πb(|𝒔)μ(|𝒔))]=E𝒔ρπb(𝒔),𝒂πb(|𝒔)[logπb(|𝒔)logμ(|𝒔)],E_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}})}\left[D_{\mathrm{KL}}(\pi_{b}(\cdot|{\bm{s}})\|\mu(\cdot|{\bm{s}}))\right]=E_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}}),{\bm{a}}\sim\pi_{b}(\cdot|{\bm{s}})}\left[\log\pi_{b}(\cdot|{\bm{s}})-\log\mu(\cdot|{\bm{s}})\right], (22)

where

E𝒔ρπb(𝒔),𝒂πb(|𝒔)[logπb(|𝒔)]E_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}}),{\bm{a}}\sim\pi_{b}(\cdot|{\bm{s}})}\left[\log\pi_{b}(\cdot|{\bm{s}})\right]

is the negative entropy of πb\pi_{b} is a constant for μ\mu. So the constraint in Eq. (7) can be transformed to

H(πb,μ)=𝔼𝒔ρπb(𝒔),𝒂πb(|𝒔)[logμ(|𝒔)]κ0,H(\pi_{b},\mu)=-\mathbb{E}_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}}),{\bm{a}}\sim\pi_{b}(\cdot|{\bm{s}})}\left[\log\mu(\cdot|{\bm{s}})\right]\leq{\mathbf{\kappa}}_{0}, (23)

where κ0=ϵE𝒔ρπb(𝒔),𝒂πb(|𝒔)[logπb(|𝒔)]{\mathbf{\kappa}}_{0}={\mathbf{\epsilon}}-E_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}}),{\bm{a}}\sim\pi_{b}(\cdot|{\bm{s}})}\left[\log\pi_{b}(\cdot|{\bm{s}})\right].

(2): According to the definition of the diffusion model in Eq. (9):

𝒂μ(𝒂|𝒔)𝑑𝒂\displaystyle\int_{\bm{a}}\mu({\bm{a}}|{\bm{s}})d{\bm{a}} =𝒂μ(𝒂0|𝒔)𝑑𝒂0\displaystyle=\int_{\bm{a}}\mu({\bm{a}}^{0}|{\bm{s}})d{\bm{a}}^{0} (24)
=μ(𝒂0:T|𝒔)𝑑𝒂1:T𝑑𝒂0\displaystyle=\int\int\mu({\bm{a}}^{0:T}|{\bm{s}})d{\bm{a}}^{1:T}d{\bm{a}}^{0}
=μ(𝒂0:T)𝑑𝒂0:T\displaystyle=\int\mu({\bm{a}}^{0:T})d{\bm{a}}^{0:T}
=1.\displaystyle=1.

Eq. (24) implies the constraint Eq. (8) always holds for the diffusion model, so we can omit this constraint.

According to the above two properties, we can rewrite the diffusion-based CPS problem Eq. (6)-Eq. (8) as

μ=argmaxμ\displaystyle\mu^{*}=\operatorname*{arg\,max}_{\mu} J(μ)=argmaxμ𝒮d0(𝒔)𝒜Qμ(𝒔,𝒂)𝑑𝒂𝑑𝒔\displaystyle J(\mu)=\operatorname*{arg\,max}_{\mu}\int_{\mathcal{S}}d_{0}({\bm{s}})\int_{\mathcal{A}}Q^{\mu}({\bm{s}},{\bm{a}})d{\bm{a}}d{\bm{s}} (25)
s.t.\displaystyle s.t. H(πb,μ)κ0.\displaystyle H(\pi_{b},\mu)\leq{\mathbf{\kappa}}_{0}.

A.2 Theorem 3.7

Then we prove the Theorem 3.7.

The perturbation function associated to Eq. (25) is defined as

P(ξ)=\displaystyle P(\xi)= maxμ𝒥(μ)=maxμ𝒮d0(𝒔)𝒜Qμ(𝒔,𝒂)𝑑𝒂𝑑𝒔\displaystyle\max_{\mu}{\mathcal{J}}(\mu)=\max_{\mu}\int_{\mathcal{S}}d_{0}({\bm{s}})\int_{\mathcal{A}}Q^{\mu}({\bm{s}},{\bm{a}})d{\bm{a}}d{\bm{s}} (26)
s.t.\displaystyle s.t. H(πb,μ)κ0+ξ.\displaystyle-H(\pi_{b},\mu)\geq-{\mathbf{\kappa}}_{0}+\xi.
Lemma A.1.

If (i) rr is bounded; (ii) Slater’s condition holds for Eq. (25) and (iii) its perturbation function P(ξ)P(\xi) is concave, then strong duality holds for Eq. (25).

Proof.

See, e.g., Rockafellar (1970)[Cor. 30.2.2] ∎

Proof.

(The proof skeleton is essentially based on Paternain et al. (2019) Theorem 1)

Condition (i) and (ii) are satisfied by the Assumption 3.4. To prove the strong duality of Eq. (25), it suffices then to show the perturbation function is concave [(iii)], i.e., that for every ξ1,ξ2\xi^{1},\xi^{2}\in\mathbb{R}, and t(0,1)t\in(0,1),

P[tξ1+(1t)ξ2]tP(ξ1)+(1t)P(ξ2).P\left[t\xi^{1}+(1-t)\xi^{2}\right]\geq tP\left(\xi^{1}\right)+(1-t)P\left(\xi^{2}\right)\text{.} (27)

If for either perturbation ξ1\xi^{1} or ξ2\xi^{2} the problem becomes infeasible then P(ξ1)=P(\xi^{1})=-\infty or P(ξ2)=P(\xi^{2})=-\infty and thus Eq. (27) holds trivially. For perturbations which keep the problem feasible, suppose P(ξ1)P(\xi^{1}) and P(ξ2)P(\xi^{2}) are achieved by the policies μ1𝒫(𝒮)\mu_{1}\in{\mathcal{P}}({\mathcal{S}}) and μ2𝒫(𝒮)\mu_{2}\in{\mathcal{P}}({\mathcal{S}}) respectively. Then, P(ξ1)=𝒥(μ1)P(\xi^{1})={\mathcal{J}}(\mu_{1}) with H(πb,μ1)+κ0ξ1-H(\pi_{b},\mu_{1})+{\mathbf{\kappa}}_{0}\geq\xi^{1} and P(ξ2)=𝒥(μ2)P(\xi^{2})={\mathcal{J}}(\mu_{2}) with H(πb,μ2)+κ0ξ2-H(\pi_{b},\mu_{2})+{\mathbf{\kappa}}_{0}\geq\xi^{2}. To establish Eq. (27) it suffices to show that for every t(0,1)t\in(0,1) there exists a policy μt\mu_{t} such that H(πb,μt)+κ0tξ1+(1t)ξ2-H(\pi_{b},\mu_{t})+{\mathbf{\kappa}}_{0}\geq t\xi^{1}+(1-t)\xi^{2} and 𝒥(μt)=t𝒥(μ1)+(1t)𝒥(μ2){\mathcal{J}}(\mu_{t})=t{\mathcal{J}}(\mu_{1})+(1-t){\mathcal{J}}(\mu_{2}). Notice that any policy μt\mu_{t} satisfying the previous conditions is a feasible policy for the slack κ0+tξ1+(1t)ξ2-{\mathbf{\kappa}}_{0}+t\xi^{1}+(1-t)\xi^{2}. Hence, by definition of the perturbed function Eq. (27), it follows that

P[tξ1+(1t)ξ2]𝒥(μt)=t𝒥(μ1)+(1t)𝒥(μ2)=tP(ξ1)+(1t)P(ξ2).P\left[t\xi^{1}+(1-t)\xi^{2}\right]\geq{\mathcal{J}}(\mu_{t})=t{\mathcal{J}}(\mu_{1})+(1-t){\mathcal{J}}(\mu_{2})=tP\left(\xi^{1}\right)+(1-t)P\left(\xi^{2}\right). (28)

If such a policy exists, the previous equation implies Eq. (27). Thus, to complete the proof of the result we need to establish its existence. To do so we start by formulating the objective 𝒥(μ){\mathcal{J}}(\mu) as a linear function.

After sufficient training, we can suppose Qϕ(𝒔,𝒂)Q_{\phi}({\bm{s}},{\bm{a}}) as an unbiased approximation of Qμ(𝒔,𝒂)Q^{\mu}({\bm{s}},{\bm{a}}),

Qϕ(𝒔,𝒂)Qμ(𝒔,𝒂)=𝔼τd0,μ,𝒯[j=0Jγjr(𝒔j,𝒂j|𝒔0=𝒔,𝒂0=𝒂)],Q_{\phi}({\bm{s}},{\bm{a}})\approx Q^{\mu}({\bm{s}},{\bm{a}})=\mathbb{E}_{\tau\sim d_{0},\mu,{\mathcal{T}}}\left[\sum_{j=0}^{J}\gamma^{j}r({\bm{s}}_{j},{\bm{a}}_{j}|{\bm{s}}_{0}={\bm{s}},{\bm{a}}_{0}={\bm{a}})\right], (29)

where τ\tau denotes the trajectory generated by policy μ(𝒂|𝒔)\mu({\bm{a}}|{\bm{s}}), start state distribution d0d_{0} and environment dynamics 𝒯(𝒔|𝒔,𝒂){\mathcal{T}}({\bm{s}}^{\prime}|{\bm{s}},{\bm{a}}).

So the objective function Eq. (6) can be written as

J(μ)\displaystyle J(\mu) =𝒮d0(𝒔)𝒜Qμ(𝒔,𝒂)𝑑𝒂𝑑𝒔\displaystyle=\int_{\mathcal{S}}d_{0}({\bm{s}})\int_{\mathcal{A}}Q^{\mu}({\bm{s}},{\bm{a}})d{\bm{a}}d{\bm{s}} (30)
=t=0γt𝔼𝒔tpμ(𝒔t=s)[𝔼𝒂μ(𝒂t|𝒔t)r(𝒔t,𝒂t)]\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}\mathbb{E}_{{\bm{s}}_{t}\sim p_{\mu}({\bm{s}}_{t}=s)}\left[\mathbb{E}_{{\bm{a}}\sim\mu({\bm{a}}_{t}|{\bm{s}}_{t})}r({\bm{s}}_{t},{\bm{a}}_{t})\right]
=t=0γt[𝒮×𝒜pμ(𝒔t=𝒔)μ(𝒂t=𝒂|𝒔t=𝒔)r(𝒔,𝒂)𝑑𝒂𝑑𝒔]\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}\left[\int_{{\mathcal{S}}\times{\mathcal{A}}}p_{\mu}({\bm{s}}_{t}={\bm{s}})\mu({\bm{a}}_{t}={\bm{a}}|{\bm{s}}_{t}={\bm{s}})r({\bm{s}},{\bm{a}})d{\bm{a}}d{\bm{s}}\right]
=𝒮×𝒜[t=0γtpμ(𝒔t=𝒔)μ(𝒂t=𝒂|𝒔t=𝒔)r(𝒔,𝒂)d𝒂d𝒔]\displaystyle=\int_{{\mathcal{S}}\times{\mathcal{A}}}\left[\sum_{t=0}^{\infty}\gamma^{t}p_{\mu}({\bm{s}}_{t}={\bm{s}})\mu({\bm{a}}_{t}={\bm{a}}|{\bm{s}}_{t}={\bm{s}})r({\bm{s}},{\bm{a}})d{\bm{a}}d{\bm{s}}\right]
=11γ𝔼𝒔,𝒂ρμ(𝒔,𝒂)[r(𝒔,𝒂)]\displaystyle=\frac{1}{1-\gamma}\mathbb{E}_{{\bm{s}},{\bm{a}}\sim\rho^{\mu}({\bm{s}},{\bm{a}})}\left[r({\bm{s}},{\bm{a}})\right]
=11γ𝒮×𝒜ρμ(𝒔,𝒂)r(𝒔,𝒂)𝑑𝒂𝑑𝒔,\displaystyle=\frac{1}{1-\gamma}\int_{{\mathcal{S}}\times{\mathcal{A}}}\rho^{\mu}({\bm{s}},{\bm{a}})r({\bm{s}},{\bm{a}})d{\bm{a}}d{\bm{s}},

which implies the objective function Eq. (6) equivalent to

maxρ11γ𝒮×𝒜ρμ(𝒔,𝒂)r(𝒔,𝒂)𝑑𝒂𝑑𝒔,\max_{\rho\in{\mathcal{R}}}\frac{1}{1-\gamma}\int_{{\mathcal{S}}\times{\mathcal{A}}}\rho^{\mu}({\bm{s}},{\bm{a}})r({\bm{s}},{\bm{a}})d{\bm{a}}d{\bm{s}}, (31)

where {\mathcal{R}} is the set of occupation measures, which is convex and compact.[ Borkar (1988), Theorem 3.1]

So Eq. (31) is a linear function on ρ(𝒔,𝒂)\rho({\bm{s}},{\bm{a}}). Let ρ1(𝒔,𝒂),ρ2(𝒔,𝒂)\rho_{1}({\bm{s}},{\bm{a}}),\rho_{2}({\bm{s}},{\bm{a}})\in{\mathcal{R}} be the occupation measures associated to μ1\mu_{1} and μ2\mu_{2}. Since, {\mathcal{R}} is convex, there exists a policy μt\mu_{t} such that its corresponding occupation measure is ρt(𝒔,𝒂)=tρ1(𝒔,𝒂)+(1t)ρ2(𝒔,𝒂)\rho_{t}({\bm{s}},{\bm{a}})=t\rho_{1}({\bm{s}},{\bm{a}})+(1-t)\rho_{2}({\bm{s}},{\bm{a}})\in{\mathcal{R}} and 𝒥(μt)=t𝒥(μ1)+(1t)𝒥(μ2){\mathcal{J}}(\mu_{t})=t{\mathcal{J}}(\mu_{1})+(1-t){\mathcal{J}}(\mu_{2}). To prove Eq. (27), it suffices to show such μt\mu_{t} satisfies H(πb,μ)+κ0tξ1+(1t)ξ2-H(\pi_{b},\mu)+{\mathbf{\kappa}}_{0}\geq t\xi^{1}+(1-t)\xi^{2}. Using Assumption 3.5, we can get

H(πb,μ)\displaystyle-H(\pi_{b},\mu) =𝒮×𝒜ρπb(𝒔,𝒂)logρtμ(𝒔,𝒂)ρπb(𝒔)d𝒂d𝒔\displaystyle=\int_{{\mathcal{S}}\times{\mathcal{A}}}\rho^{\pi_{b}}({\bm{s}},{\bm{a}})\log\frac{\rho^{\mu}_{t}({\bm{s}},{\bm{a}})}{\rho^{\pi_{b}}({\bm{s}})}d{\bm{a}}d{\bm{s}} (32)
=𝒮×𝒜ρπb(𝒔,𝒂)logtρ1(𝒔,𝒂)+(1t)ρ2(𝒔,𝒂)ρπb(𝒔)d𝒂d𝒔\displaystyle=\int_{{\mathcal{S}}\times{\mathcal{A}}}\rho^{\pi_{b}}({\bm{s}},{\bm{a}})\log\frac{t\rho_{1}({\bm{s}},{\bm{a}})+(1-t)\rho_{2}({\bm{s}},{\bm{a}})}{\rho^{\pi_{b}}({\bm{s}})}d{\bm{a}}d{\bm{s}}
t𝒮×𝒜ρπb(𝒔,𝒂)logρ1(𝒔,𝒂)ρπb(𝒔)d𝒂d𝒔+Jensen’s inequality.\displaystyle\underbrace{\geq t\int_{{\mathcal{S}}\times{\mathcal{A}}}\rho^{\pi_{b}}({\bm{s}},{\bm{a}})\log\frac{\rho_{1}({\bm{s}},{\bm{a}})}{\rho^{\pi_{b}}({\bm{s}})}d{\bm{a}}d{\bm{s}}+}_{\quad\text{Jensen's inequality.}}
(1t)𝒮×𝒜ρπb(𝒔,𝒂)logρ2(𝒔,𝒂)ρπb(𝒔)d𝒂d𝒔Jensen’s inequality.\displaystyle\underbrace{(1-t)\int_{{\mathcal{S}}\times{\mathcal{A}}}\rho^{\pi_{b}}({\bm{s}},{\bm{a}})\log\frac{\rho_{2}({\bm{s}},{\bm{a}})}{\rho^{\pi_{b}}({\bm{s}})}d{\bm{a}}d{\bm{s}}}_{\quad\text{Jensen's inequality.}}
tξ1+(1t)ξ2κ0\displaystyle\geq t\xi^{1}+(1-t)\xi^{2}-{\mathbf{\kappa}}_{0}

This completes the proof that the perturbation function is concave and according to Lemma A.1 strong duality for Eq. (12) holds.

Finally, we prove Eq. (12), which has the same optimal policy μ\mu with Eq. (15).

For Eq. (15), if strong duality holds and a dual optimal solution λ\lambda^{*} exists, then any primal optimal point of Eq. (12) is also a maximizer of (μ,λ){\mathcal{L}}(\mu,\lambda^{*}), this is obvious through the KKT conditions.[ Boyd & Vandenberghe (2004) Ch.5.5.5.]

So if we can prove that the solution of the (μ,λ){\mathcal{L}}(\mu,\lambda^{*}) is unique, we can compute the optimal policy of Eq. (12) from a dual problem Eq. (15):

(μ,λ)\displaystyle{\mathcal{L}}(\mu,\lambda^{*}) =J(μ)+λ(κ0H(πb,μ))\displaystyle=J(\mu)+\lambda^{*}({\mathbf{\kappa}}_{0}-H(\pi_{b},\mu)) (33)
=11γ[𝒮×𝒜ρμ(𝒔,𝒂)r(𝒔,𝒂)𝑑𝒂𝑑𝒔Using Eq. (30).+λ(κ0¯𝒮×𝒜ρπb(𝒔,𝒂)logρμ(𝒔,𝒂)ρπb(𝒔)d𝒂d𝒔)].\displaystyle=\frac{1}{1-\gamma}\left[\underbrace{\int_{{\mathcal{S}}\times{\mathcal{A}}}\rho^{\mu}({\bm{s}},{\bm{a}})r({\bm{s}},{\bm{a}})d{\bm{a}}d{\bm{s}}}_{\quad\text{Using Eq.~{}(\ref{affine}).}}+\lambda^{*}(\bar{{\mathbf{\kappa}}_{0}}-\int_{{\mathcal{S}}\times{\mathcal{A}}}\rho^{\pi_{b}}({\bm{s}},{\bm{a}})\log\frac{\rho^{\mu}({\bm{s}},{\bm{a}})}{\rho^{\pi_{b}}({\bm{s}})}d{\bm{a}}d{\bm{s}})\right].

According to Eq. (30) J(μ)J(\mu) is a linear function, and negative relative entropy is concave, this implies that (μ,λ){\mathcal{L}}(\mu,\lambda^{*}) is a strictly concave function of ρμ\rho^{\mu} with only a unique maximum value and since we have that the occupancy measure ρμ(𝒔,𝒂)\rho^{\mu}({\bm{s}},{\bm{a}}) has a one-to-one relationship with policy μ\mu, we can obtain the unique μ\mu corresponding to ρμ\rho^{\mu}. ∎

A.3 Proposition 3.8

Proof.
H(πb,μ)\displaystyle H(\pi_{b},\mu) =𝔼𝒔ρπb(𝒔),𝒂πb(|𝒔)[logμ(|𝒔)]\displaystyle=-\mathbb{E}_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}}),{\bm{a}}\sim\pi_{b}(\cdot|{\bm{s}})}\left[\log\mu(\cdot|{\bm{s}})\right] (34)
=𝔼𝒔ρπb(𝒔)[πb(𝒂|𝒔)𝑑𝒂logπb(𝒂1:T|𝒂,𝒔)μ(𝒂0:T|𝒔)πb(𝒂1:T|𝒂,𝒔)𝑑𝒂1:T]\displaystyle=-\mathbb{E}_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}})}\left[\int\pi_{b}({\bm{a}}|{\bm{s}})d{\bm{a}}\log\int\pi_{b}({\bm{a}}^{1:T}|{\bm{a}},{\bm{s}})\frac{\mu({\bm{a}}^{0:T}|{\bm{s}})}{\pi_{b}({\bm{a}}^{1:T}|{\bm{a}},{\bm{s}})}d{\bm{a}}^{1:T}\right]
𝔼𝒔ρπb(𝒔)[πb(𝒂0:T)log[πb(𝒂1:T|𝒂,𝒔)μ(𝒂0:T|𝒔)]𝑑𝒂0:TJensen’s inequality]=K.\displaystyle\leq\mathbb{E}_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}})}\left[\underbrace{\int\pi_{b}({\bm{a}}^{0:T})\log[\frac{\pi_{b}({\bm{a}}^{1:T}|{\bm{a}},{\bm{s}})}{\mu({\bm{a}}^{0:T}|{\bm{s}})}]d{\bm{a}}^{0:T}}_{\quad\text{Jensen's inequality}}\right]=K.

H(πb,μ)=KH(\pi_{b},\mu)=K is true if and only if πb(𝒂1:T|𝒂0,𝒔)=μ(𝒂1:T|𝒂0,𝒔)\pi_{b}({\bm{a}}^{1:T}|{\bm{a}}^{0},{\bm{s}})=\mu({\bm{a}}^{1:T}|{\bm{a}}^{0},{\bm{s}}). By letting the KL divergence between πb\pi_{b} and μ\mu be small enough, we can approximate the H(πb,μ)H(\pi_{b},\mu) with KK. Furthermore, we can use the variational lower bound KK to approximate the KL divergence (Ho et al., 2020):

H(πb,μ)K\displaystyle H(\pi_{b},\mu)\approx K =𝔼𝒔ρπb(𝒔)[𝔼πb(𝒂0:T)[logπb(𝒂1:T|𝒂,𝒔)μ(𝒂0:T|𝒔)]]=𝔼𝒔ρπb(𝒔)[𝔼πb(𝒂0:T)[logπb(𝒂1:T|𝒂0,𝒔)μ(𝒂0:T|𝒔)]]\displaystyle=\mathbb{E}_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}})}\left[\mathbb{E}_{\pi_{b}({\bm{a}}^{0:T})}\Big{[}\log\frac{\pi_{b}({\bm{a}}^{1:T}|{\bm{a}},{\bm{s}})}{\mu({\bm{a}}^{0:T}|{\bm{s}})}\Big{]}\right]=\mathbb{E}_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}})}\left[\mathbb{E}_{\pi_{b}({\bm{a}}^{0:T})}\Big{[}\log\frac{\pi_{b}({\bm{a}}^{1:T}|{\bm{a}}^{0},{\bm{s}})}{\mu({\bm{a}}^{0:T}|{\bm{s}})}\Big{]}\right]
=𝔼ρπb(𝒔,𝒂)[logt=1Tπb(𝒂t|𝒂t1)μ(𝒂T)t=1Tμ(𝒂t1|𝒂t)]\displaystyle=\mathbb{E}_{\rho^{\pi_{b}}({\bm{s}},{\bm{a}})}\Big{[}\log\frac{\prod_{t=1}^{T}\pi_{b}({\bm{a}}^{t}|{\bm{a}}^{t-1})}{\mu({\bm{a}}^{T})\prod_{t=1}^{T}\mu({\bm{a}}^{t-1}|{\bm{a}}^{t})}\Big{]}
=𝔼ρπb(𝒔,𝒂)[logμ(𝒂T)+t=1Tlogπb(𝒂t|𝒂t1)μ(𝒂t1|𝒂t)]\displaystyle=\mathbb{E}_{\rho^{\pi_{b}}({\bm{s}},{\bm{a}})}\Big{[}-\log\mu({\bm{a}}^{T})+\sum_{t=1}^{T}\log\frac{\pi_{b}({\bm{a}}^{t}|{\bm{a}}^{t-1})}{\mu({\bm{a}}^{t-1}|{\bm{a}}^{t})}\Big{]}
=𝔼ρπb(𝒔,𝒂)[logμ(𝒂T)+t=2Tlogπb(𝒂t|𝒂t1)μ(𝒂t1|𝒂t)+logπb(𝒂1|𝒂0)μ(𝒂0|𝒂1)]\displaystyle=\mathbb{E}_{\rho^{\pi_{b}}({\bm{s}},{\bm{a}})}\Big{[}-\log\mu({\bm{a}}^{T})+\sum_{t=2}^{T}\log\frac{\pi_{b}({\bm{a}}^{t}|{\bm{a}}^{t-1})}{\mu({\bm{a}}^{t-1}|{\bm{a}}^{t})}+\log\frac{\pi_{b}({\bm{a}}^{1}|{\bm{a}}^{0})}{\mu({\bm{a}}^{0}|{\bm{a}}^{1})}\Big{]}
=𝔼ρπb(𝒔,𝒂)[logμ(𝒂T)+t=2Tlog(πb(𝒂t1|𝒂t,𝒂0)μ(𝒂t1|𝒂t)πb(𝒂t|𝒂0)πb(𝒂t1|𝒂0))+logπb(𝒂1|𝒂0)μ(𝒂0|𝒂1)]\displaystyle=\mathbb{E}_{\rho^{\pi_{b}}({\bm{s}},{\bm{a}})}\Big{[}-\log\mu({\bm{a}}^{T})+\sum_{t=2}^{T}\log\Big{(}\frac{\pi_{b}({\bm{a}}^{t-1}|{\bm{a}}^{t},{\bm{a}}^{0})}{\mu({\bm{a}}^{t-1}|{\bm{a}}^{t})}\cdot\frac{\pi_{b}({\bm{a}}^{t}|{\bm{a}}^{0})}{\pi_{b}({\bm{a}}^{t-1}|{\bm{a}}^{0})}\Big{)}+\log\frac{\pi_{b}({\bm{a}}^{1}|{\bm{a}}^{0})}{\mu({\bm{a}}^{0}|{\bm{a}}^{1})}\Big{]}
=𝔼ρπb(𝒔,𝒂)[logμ(𝒂T)+t=2Tlogπb(𝒂t1|𝒂t,𝒂0)μ(𝒂t1|𝒂t)+t=2Tlogπb(𝒂t|𝒂0)πb(𝒂t1|𝒂0)+logπb(𝒂1|𝒂0)μ(𝒂0|𝒂1)]\displaystyle=\mathbb{E}_{\rho^{\pi_{b}}({\bm{s}},{\bm{a}})}\Big{[}-\log\mu({\bm{a}}^{T})+\sum_{t=2}^{T}\log\frac{\pi_{b}({\bm{a}}^{t-1}|{\bm{a}}^{t},{\bm{a}}^{0})}{\mu({\bm{a}}^{t-1}|{\bm{a}}^{t})}+\sum_{t=2}^{T}\log\frac{\pi_{b}({\bm{a}}^{t}|{\bm{a}}^{0})}{\pi_{b}({\bm{a}}^{t-1}|{\bm{a}}^{0})}+\log\frac{\pi_{b}({\bm{a}}^{1}|{\bm{a}}^{0})}{\mu({\bm{a}}^{0}|{\bm{a}}^{1})}\Big{]}
=𝔼ρπb(𝒔,𝒂)[logμ(𝒂T)+t=2Tlogπb(𝒂t1|𝒂t,𝒂0)μ(𝒂t1|𝒂t)+logπb(𝒂T|𝒂0)πb(𝒂1|𝒂0)+logπb(𝒂1|𝒂0)μ(𝒂0|𝒂1)]\displaystyle=\mathbb{E}_{\rho^{\pi_{b}}({\bm{s}},{\bm{a}})}\Big{[}-\log\mu({\bm{a}}^{T})+\sum_{t=2}^{T}\log\frac{\pi_{b}({\bm{a}}^{t-1}|{\bm{a}}^{t},{\bm{a}}^{0})}{\mu({\bm{a}}^{t-1}|{\bm{a}}^{t})}+\log\frac{\pi_{b}({\bm{a}}^{T}|{\bm{a}}^{0})}{\pi_{b}({\bm{a}}^{1}|{\bm{a}}^{0})}+\log\frac{\pi_{b}({\bm{a}}^{1}|{\bm{a}}^{0})}{\mu({\bm{a}}^{0}|{\bm{a}}^{1})}\Big{]}
=𝔼ρπb(𝒔,𝒂)[logπb(𝒂T|𝒂0)μ(𝒂T)+t=2Tlogπb(𝒂t1|𝒂t,𝒂0)μ(𝒂t1|𝒂t)logμ(𝒂0|𝒂1)]\displaystyle=\mathbb{E}_{\rho^{\pi_{b}}({\bm{s}},{\bm{a}})}\Big{[}\log\frac{\pi_{b}({\bm{a}}^{T}|{\bm{a}}^{0})}{\mu({\bm{a}}^{T})}+\sum_{t=2}^{T}\log\frac{\pi_{b}({\bm{a}}^{t-1}|{\bm{a}}^{t},{\bm{a}}^{0})}{\mu({\bm{a}}^{t-1}|{\bm{a}}^{t})}-\log\mu({\bm{a}}^{0}|{\bm{a}}^{1})\Big{]}
=𝔼ρπb(𝒔,𝒂)[DKL(πb(𝒂T|𝒂0)μ(𝒂T))LT+t=2TDKL(πb(𝒂t1|𝒂t,𝒂0)μ(𝒂t1|𝒂t))Lt1logμ(𝒂0|𝒂1)L0].\displaystyle=\mathbb{E}_{\rho^{\pi_{b}}({\bm{s}},{\bm{a}})}[\underbrace{D_{\text{KL}}(\pi_{b}({\bm{a}}^{T}|{\bm{a}}^{0})\parallel\mu({\bm{a}}^{T}))}_{L_{T}}+\sum_{t=2}^{T}\underbrace{D_{\text{KL}}(\pi_{b}({\bm{a}}^{t-1}|{\bm{a}}^{t},{\bm{a}}^{0})\parallel\mu({\bm{a}}^{t-1}|{\bm{a}}^{t}))}_{L_{t-1}}\underbrace{-\log\mu({\bm{a}}^{0}|{\bm{a}}^{1})}_{L_{0}}].

So we can approximate the entropy constraint in Eq. (12):

H(πb,μ)K=LT+t=2TLt1+L0.H(\pi_{b},\mu)\approx K=L_{T}+\sum_{t=2}^{T}L_{t-1}+L_{0}. (35)

Following DDPM (Ho et al., 2020) use the random sample to approximate the t=2TLt1+L0\sum_{t=2}^{T}L_{t-1}+L_{0}, we have

t=2TLt1+L0c(πb,μ)\displaystyle\sum_{t=2}^{T}L_{t-1}+L_{0}\approx{\mathcal{L}}_{c}(\pi_{b},\mu) (36)
=𝔼i[1:N],ϵ𝒩(𝟎,𝑰),(𝒔,𝒂)𝒟[ϵϵθ(α¯i𝒂+1α¯iϵ,𝒔,i)2].\displaystyle=\mathbb{E}_{i\sim\left[1:N\right],{\mathbf{\epsilon}}\sim{\mathcal{N}}(\mathbf{0},{\bm{I}}),({\bm{s}},{\bm{a}})\sim{\mathcal{D}}}\left[||{\mathbf{\epsilon}}-{\mathbf{\epsilon}}_{\theta}(\sqrt{\bar{\alpha}_{i}}{\bm{a}}+\sqrt{1-\bar{\alpha}_{i}}{\mathbf{\epsilon}},{\bm{s}},i)||^{2}\right].

Let LT=cL_{T}=c, combining Eq. (35) and Eq. (36) we can get

H(πb,μ)c+c(πb,μ).H(\pi_{b},\mu)\approx c+{\mathcal{L}}_{c}(\pi_{b},\mu). (37)

A.4 Theorem 3.11

The proof of Theorem 3.11 is based on Paternain et al. (2019)[Appendix A].

We first claim lemma A.2 to show the error introduced by the parametrization is bound by a constant.

Lemma A.2.

Let ρ\rho and ρθ\rho_{\theta} be occupation measures induced by the policies π𝒫(𝒮)\pi\in{\mathcal{P}}({\mathcal{S}}) and πθ\pi_{\theta} respectively, where πθ\pi_{\theta} is an ϵ\epsilon- parametrization of π\pi. Then, it follows that

𝒮×𝒜|ρ(s,a)ρθ(s,a)|𝑑s𝑑aϵ1γ.\int_{{\mathcal{S}}\times{\mathcal{A}}}|\rho(s,a)-\rho_{\theta}(s,a)|\,dsda\leq\frac{\epsilon}{1-\gamma}. (38)

See proof in Paternain et al. (2019)[Appendix A].

We next introduce Theorem A.3, which shows that the dual gap introduced after parameterization is bounded by a linear function of the model’s expressive capability.

Theorem A.3.

Suppose that rr is bounded by constants R>0R>0 and B=max𝐬,𝐚ρπb(𝐬,𝐚)1/ρ(𝐬,𝐚)B=\max_{{\bm{s}},{\bm{a}}}\rho^{\pi_{b}}({\bm{s}},{\bm{a}})1/\rho({\bm{s}},{\bm{a}}). Let λ\lambda^{\star} be the solution to the dual problem associated with Eq. (12). Then, under the hypothesis of Theorem 3.7 it follows that

PDθP(R+λ1B)ϵ1γ,P^{\star}\geq D_{\theta}^{\star}\geq P^{\star}-\left(R+\left\|{\lambda^{\star}}\right\|_{1}B\right)\frac{\epsilon}{1-\gamma}, (39)

where PP^{\star} is the optimal value of Eq. (12), and DθD_{\theta}^{\star} the value of the parametrized dual problem Eq. (18).

Note that Theorem A.3 is actually based on Theorem 2 of Paternain et al. (2019), and the proof is as follows.

Proof of Theorem A.3.

Notice that the dual functions d(λ)d(\lambda) and dθ(λ)d_{\theta}(\lambda) associated to the problems Eq. (12) and Eq. (17) respectively are such that for every λ\lambda we have that dθ(λ)d(λ)d_{\theta}(\lambda)\leq d(\lambda). The latter follows from the fact that the set of maximizers of the Lagrangian for the parametrized policies is contained in the set of maximizers of the non-parametrized policies. In particular, this holds for λ\lambda^{\star} the solution of the dual problem associated to Eq. (12). Hence we have the following sequence of inequalities

D=d(λ)dθ(λ)Dθ,D^{\star}=d(\lambda^{\star})\geq d_{\theta}(\lambda^{\star})\geq D^{\star}_{\theta}, (40)

where the last inequality follows from the fact that DθD^{\star}_{\theta} is the minimum of Eq. (18). The zero duality gap established in Theorem 3.7 completes the proof of the upper bound for DθD_{\theta}^{\star}. We next work towards proving the lower bound for DθD_{\theta}^{\star}. Let us next write the dual function of the parametrized problem Eq. (17) as

dθ(λ)=d(λ)(maxμ𝒫(𝒮)(μ,λ)maxθθ(θ,λ))d_{\theta}(\lambda)=d(\lambda)-\left(\max_{\mu\in{\mathcal{P}}({\mathcal{S}})}{\mathcal{L}}(\mu,\lambda)-\max_{\theta}{\mathcal{L}}_{\theta}(\theta,\lambda)\right) (41)

Let μargmaxμ𝒫(𝒮)(μ,λ)\mu^{\star}\triangleq\operatorname*{arg\,max}_{\mu\in{\mathcal{P}}({\mathcal{S}})}{\mathcal{L}}(\mu,\lambda) and let θ\theta^{\star} be an ϵ\epsilon-approximation of μ\mu^{\star}. Then, by definition of the maximum it follows that

dθ(λ)d(λ)((μ,λ)θ(θ,λ))d_{\theta}(\lambda)\geq d(\lambda)-\left({\mathcal{L}}(\mu^{\star},\lambda)-{\mathcal{L}}_{\theta}(\theta^{\star},\lambda)\right) (42)

We next work towards a bound for (μ,λ)θ(θ,λ){\mathcal{L}}(\mu^{\star},\lambda)-{\mathcal{L}}_{\theta}(\theta^{\star},\lambda). To do so, notice that we can write the difference in terms of the occupation measures where ρ\rho^{\star} and ρθ\rho_{\theta}^{\star} are the occupation measures associated to the the policies μ\mu^{\star} and the policy μθ\mu_{\theta^{\star}}

(μ,λ)θ(θ,λ)=𝒮×𝒜r(dρ(λ)dρθ(λ))+λρπb(𝒔,𝒂)d(logρ(λ)logρθ(λ)).{\mathcal{L}}(\mu^{\star},\lambda)-{\mathcal{L}}_{\theta}(\theta^{\star},\lambda)=\int_{{\mathcal{S}}\times{\mathcal{A}}}r\left(d\rho^{\star}(\lambda)-d\rho_{\theta}^{\star}(\lambda)\right)+\lambda\rho^{\pi_{b}}({\bm{s}},{\bm{a}})d(\log\rho^{\star}(\lambda)-\log\rho_{\theta}^{\star}(\lambda)). (43)

Since μθ\mu_{\theta^{\star}} is by definition an ϵ\epsilon approximation of μ\mu^{\star} it follows from Lemma A.2 that

𝒮×𝒜|dρ(λ)dρθ(λ)|ϵ1γ.\int_{{\mathcal{S}}\times{\mathcal{A}}}\left|d\rho^{\star}(\lambda)-d\rho_{\theta}^{\star}(\lambda)\right|\leq\frac{\epsilon}{1-\gamma}. (44)

Using the bounds on the reward functions and occupation measures we can upper bound the difference (π,λ)θ(θ,λ){\mathcal{L}}(\pi^{\star},\lambda)-{\mathcal{L}}_{\theta}(\theta^{\star},\lambda) by

(μ,λ)θ(θ,λ)(R+λ1B)ϵ1γ.{\mathcal{L}}(\mu^{\star},\lambda)-{\mathcal{L}}_{\theta}(\theta^{\star},\lambda)\leq\left(R+\left\|\lambda\right\|_{1}B\right)\frac{\epsilon}{1-\gamma}. (45)

The second term follows from (ab)1/alogalogb(ab)1/b,ab(a-b)1/a\leq\log a-\log b\leq(a-b)1/b,a\geq b. Combining the previous bound with Eq. (42) we can lower bound dθ(λ)d_{\theta}(\lambda) as

dθ(λ)d(λ)(R+λ1B)ϵ1γd_{\theta}(\lambda)\geq d(\lambda)-\left(R+\left\|\lambda\right\|_{1}B\right)\frac{\epsilon}{1-\gamma} (46)

Since the previous expression holds for every λ\lambda, in particular it holds for λθ\lambda_{\theta}^{\star}, the dual solution of the parametrized problem Eq. (18). Thus, we have that

Dθd(λ)(R+λ1B)ϵ1γD_{\theta}^{\star}\geq d(\lambda)-\left(R+\left\|\lambda\right\|_{1}B\right)\frac{\epsilon}{1-\gamma} (47)

Recall that λ=argmind(λ)\lambda^{\star}=\operatorname*{arg\,min}d(\lambda), and use the definition of the dual function to lower bound DθD_{\theta}^{\star} by

Dθmaxμ𝒫(𝒮)𝒥(μ)+λ(κ0H(πb,μ))(R+λ1B)ϵ1γ.D_{\theta}^{\star}\geq\max_{\mu\in{\mathcal{P}}({\mathcal{S}})}{\mathcal{J}}(\mu)+\lambda^{\star}\left({\mathbf{\kappa}}_{0}-H(\pi_{b},\mu)\right)-\left(R+\left\|\lambda\right\|_{1}B\right)\frac{\epsilon}{1-\gamma}. (48)

By definition of maximum, we can lower bound the previous expression by substituting by any μ𝒫(𝒮)\mu\in{\mathcal{P}}({\mathcal{S}}). In particular, we select μ\mu^{\star} the solution to Eq. (12)

Dθmaxμ𝒫(𝒮)𝒥(μ)+λ(κ0H(πb,μ))(R+λ1B)ϵ1γ.D_{\theta}^{\star}\geq\max_{\mu\in{\mathcal{P}}({\mathcal{S}})}{\mathcal{J}}(\mu^{\star})+\lambda^{\star}\left({\mathbf{\kappa}}_{0}-H(\pi_{b},\mu^{\star})\right)-\left(R+\left\|\lambda\right\|_{1}B\right)\frac{\epsilon}{1-\gamma}. (49)

Since μ\mu^{\star} is the optimal solution to Eq. (12) it follows that κ0H(πb,μ)0{\mathbf{\kappa}}_{0}-H(\pi_{b},\mu)\geq 0 and since λ0\lambda^{\star}\geq 0 the previous expression reduces to

Dθ𝒥(μ)(R+λ1B)ϵ1γ=P(R+λ1B)ϵ1γD_{\theta}^{\star}\geq{\mathcal{J}}(\mu^{\star})-\left(R+\left\|\lambda\right\|_{1}B\right)\frac{\epsilon}{1-\gamma}=P^{\star}-\left(R+\left\|\lambda\right\|_{1}B\right)\frac{\epsilon}{1-\gamma} (50)

Which completes the proof of the result ∎

Next, we claim that the subgradient calculation error in the dual update (line 1414 in Algorithm 1) through Proposition A.5

Assumption A.4.

Let μθ\mu_{\theta} be a parametrization of functions in 𝒫(𝒮){\mathcal{P}}({\mathcal{S}}) and let θ(θ,λ){\mathcal{L}}_{\theta}(\theta,\lambda) with λ+m\lambda\in\mathbb{R}^{m}_{+} be the Lagrangian associated to Eq. (17). Denote by θ(λ),θ(λ)P\theta^{\star}(\lambda),\theta^{\dagger}(\lambda)\in\mathbb{R}^{P} the maximum of (θ,λ){\mathcal{L}}(\theta,\lambda) and a local maximum respectively achieved by a generic reinforcement learning algorithm. Then, there exists δ>0\delta>0 such that for all λ+m\lambda\in\mathbb{R}^{m}_{+} it holds that θ(θ(λ),λ)θ(θ(λ),λ)+δ{\mathcal{L}}_{\theta}(\theta^{\star}(\lambda),\lambda)\leq{\mathcal{L}}_{\theta}(\theta^{\dagger}(\lambda),\lambda)+\delta.

Proposition A.5.

Under Assumption A.4, the constraint in Eq. (17) evaluated at a local maximizer of Lagrangian θ(λ)\theta^{\dagger}(\lambda) approximate the subgradient of the dual function Eq. (18). In particular, it follows that

dθ(λ)dθ(λθ)(λλθ)(κ0H(πb,μ))+δ.d_{\theta}(\lambda)-d_{\theta}(\lambda_{\theta}^{\star})\leq\left(\lambda-\lambda_{\theta}^{\star}\right)\left({\mathbf{\kappa}}_{0}-H(\pi_{b},\mu)\right)+\delta. (51)

See proof in Paternain et al. (2019) (Appendix A).

Finally, we prove Theorem 3.11.

Proof of Theorem 3.11.

We start by showing the lower bound, which in fact holds for any λ\lambda. Notice that for any λ\lambda and by definition of the dual problem it follows that dθ(λ)Dθd_{\theta}(\lambda)\geq D^{\star}_{\theta}. Combining this bound with the result of Theorem A.3 it follows that

dθ(λ)P(R+λ1B)ϵ1γ.d_{\theta}(\lambda)\geq P^{\star}-\left(R+\left\|{\lambda^{\star}}\right\|_{1}B\right)\frac{\epsilon}{1-\gamma}. (52)

To show the upper bound we start by writing the difference between the dual multiplier k+1k+1 and the solution of Eq. (18) in terms of the iteration at time kk. Since λθ+\lambda_{\theta}^{\star}\in\mathbb{R}^{+} and using the non-expansive property of the projection it follows that

λk+1λθ2λkη(κ0H(πb,θ(λk)))λθ2\left\|\lambda_{k+1}-\lambda^{\star}_{\theta}\right\|^{2}\leq\left\|\lambda_{k}-\eta\left({\mathbf{\kappa}}_{0}-H(\pi_{b},\theta^{\dagger}(\lambda_{k}))\right)-\lambda^{\star}_{\theta}\right\|^{2} (53)

Expanding the square and using that C=(κ0+𝒮×𝒜ρπb(𝒔,𝒂)logπb(𝒂|𝒔)𝑑𝒂𝑑𝒔)2C=\left({\mathbf{\kappa}}_{0}+\int_{{\mathcal{S}}\times{\mathcal{A}}}\rho^{\pi_{b}}({\bm{s}},{\bm{a}})\log\pi_{b}({\bm{a}}|{\bm{s}})d{\bm{a}}d{\bm{s}}\right)^{2} is a bound which is calculated by the KL divergence in Eq. (7) is nonnegative on the norm squared of κ0H(πb,θ(λk)){\mathbf{\kappa}}_{0}-H(\pi_{b},\theta^{\dagger}(\lambda_{k})) it follows that

λk+1λθ2λkλθ22η(λkλθ)(κ0H(πb,θ(λk)))+η2C.\left\|\lambda_{k+1}-\lambda^{\star}_{\theta}\right\|^{2}\leq\left\|\lambda_{k}-\lambda_{\theta}^{\star}\right\|^{2}-2\eta\left(\lambda_{k}-\lambda_{\theta}^{\star}\right)\left({\mathbf{\kappa}}_{0}-H(\pi_{b},\theta^{\dagger}(\lambda_{k}))\right)+\eta^{2}C. (54)

Using the result of Proposition A.5 we can further upper bound the inner product in the previous expression by the difference of the dual function evaluated at λk\lambda_{k} and λθ\lambda_{\theta}^{\star} plus δ\delta, the error in the solution of the primal maximization,

λk+1λθ2λkλθ2+2η(δ+dθ(λθ)dθ(λk))+η2C.\left\|\lambda_{k+1}-\lambda^{\star}_{\theta}\right\|^{2}\leq\left\|\lambda_{k}-\lambda_{\theta}^{\star}\right\|^{2}+2\eta\left(\delta+d_{\theta}(\lambda_{\theta}^{\star})-d_{\theta}(\lambda_{k})\right)+\eta^{2}C. (55)

Defining αk=2(δ+dθ(λθ)dθ(λk))+ηC\alpha_{k}=2(\delta+d_{\theta}(\lambda_{\theta}^{\star})-d_{\theta}(\lambda_{k}))+\eta C and writing recursively the previous expression yields

λk+1λθ2λ0λθ2+ηj=0kαj.\left\|\lambda_{k+1}-\lambda^{\star}_{\theta}\right\|^{2}\leq\left\|\lambda_{0}-\lambda_{\theta}^{\star}\right\|^{2}+\eta\sum_{j=0}^{k}\alpha_{j}. (56)

Since dθ(λθ)d_{\theta}(\lambda_{\theta}^{\star}) is the minimum of the dual function, the difference dθ(λθ)dθ(λk)d_{\theta}(\lambda_{\theta}^{\star})-d_{\theta}(\lambda_{k}) is always negative. Thus, when λk\lambda_{k} is not close to the solution of the dual problem αk\alpha_{k} is negative. The latter implies that the distance between λk\lambda_{k} and λθ\lambda_{\theta}^{\star} is reduced by virtue of Eq. (56). To be formal, for any ε>0\varepsilon>0, when aj>2εa_{j}>-2\varepsilon we have that

dθ(λj)dθ(λθ)ηC2+δ+ε.d_{\theta}(\lambda_{j})-d_{\theta}(\lambda_{\theta}^{\star})\leq\eta\frac{C}{2}+\delta+\varepsilon. (57)

Using the result of Theorem A.3 we can upper bound DθD_{\theta}^{\star} by PP^{\star} which establishes the neighborhood defined in Eq. (21). We are left to show that the number of iterations required to do so is bounded by

Kλ0λθ22ηε.K\leq\frac{\left\|\lambda_{0}-\lambda_{\theta}^{\star}\right\|^{2}}{2\eta\varepsilon}. (58)

To do so, let K>0K>0 be the first iterate in the neighborhood Eq. (21). Formally, K=minjαj>2εK=\min_{j\in\mathbb{N}}\alpha_{j}>-2\varepsilon. Then it follows from the recursion that

λKλθ2λ0λθ22Kηε.\left\|\lambda_{K}-\lambda^{\star}_{\theta}\right\|^{2}\leq\left\|\lambda_{0}-\lambda_{\theta}^{\star}\right\|^{2}-2K\eta\varepsilon. (59)

Since λKλθ2\left\|\lambda_{K}-\lambda_{\theta}^{\star}\right\|^{2} is positive the previous expression reduces to 2Kηϵλ0λθ22K\eta\epsilon\leq\left\|\lambda_{0}-\lambda_{\theta}^{\star}\right\|^{2}. Which completes the proof of the result. ∎

Appendix B Notation Table

Table 3: Table of Notation in paper
H(πb,μ)H(\pi_{b},\mu) \triangleq 𝔼𝒔ρπb(𝒔),𝒂πb(|𝒔)[logμ(|𝒔)]-\mathbb{E}_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}}),{\bm{a}}\sim\pi_{b}(\cdot|{\bm{s}})}\left[\log\mu(\cdot|{\bm{s}})\right].
μ\mu, μ(𝒂|𝒔)\mu({\bm{a}}|{\bm{s}}) or μ(|𝒔)\mu(\cdot|{\bm{s}}) \triangleq policy.
πb\pi_{b}, πb(𝒂|𝒔)\pi_{b}({\bm{a}}|{\bm{s}}) or πb(|𝒔)\pi_{b}(\cdot|{\bm{s}}) \triangleq behavior policy.
pμ(𝒔t=𝒔)p_{\mu}({\bm{s}}_{t}={\bm{s}}) \triangleq Probability of landing in state 𝒔{\bm{s}} at time tt, when following policy μ\mu from an initial state sampled from d0d_{0}, in an environment with transition dynamics 𝒯{\mathcal{T}}.
𝒯{\mathcal{T}}, 𝒫(𝒔|𝒔,𝒂){\mathcal{P}}({\bm{s}}^{\prime}|{\bm{s}},{\bm{a}}) or p(𝒔|𝒔,𝒂)p({\bm{s}}^{\prime}|{\bm{s}},{\bm{a}}) \triangleq transition dynamics.
d0(𝒔)d_{0}({\bm{s}}) \triangleq initial state distribution of behavior policy.
ρμ(𝒔)\rho^{\mu}({\bm{s}}) \triangleq ρμ(𝒔)=t=0γtpμ(𝒔t=𝒔)\rho^{\mu}({\bm{s}})=\sum_{t=0}^{\infty}\gamma^{t}p_{\mu}({\bm{s}}_{t}={\bm{s}}) is the unnormalized discounted state visitation frequencies.
ρμ(𝒔,𝒂)\rho^{\mu}({\bm{s}},{\bm{a}}) \triangleq ρμ(𝒔,𝒂)=(1γ)t=0γtpμ(𝒔t=𝒔)μ(𝒂|𝒔)=(1γ)ρμ(𝒔)μ(𝒂|𝒔)\rho^{\mu}({\bm{s}},{\bm{a}})=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}p_{\mu}({\bm{s}}_{t}={\bm{s}})\mu({\bm{a}}|{\bm{s}})=(1-\gamma)\rho^{\mu}({\bm{s}})\mu({\bm{a}}|{\bm{s}}) is the occupation measure of policy μ\mu.
Qϕ(𝒔,𝒂)Q_{\phi}({\bm{s}},{\bm{a}}) \triangleq parameterized state action function.
Qϕ(𝒔,𝒂)Q_{\phi^{{}^{\prime}}}({\bm{s}},{\bm{a}}) \triangleq parameterized target state action function.
μθ(𝒂|𝒔)\mu_{\theta}({\bm{a}}|{\bm{s}}) \triangleq parameterized policy.
μθ(𝒂|𝒔)\mu_{\theta^{{}^{\prime}}}({\bm{a}}|{\bm{s}}) \triangleq parameterized target policy.
ϵθ(𝒙i,i){\mathbf{\epsilon}}_{\theta}({\bm{x}}_{i},i) \triangleq parameterized Gaussian noise predict network in diffusion model.
fϕ(𝒚|𝒙i)f_{\phi}({\bm{y}}|{\bm{x}}_{i}) \triangleq parameterized classifier.

Appendix C EXPERIMENTAL DETAILS

We follow the Wang et al. (2022) to build our policy with an MLP-based conditional diffusion model. Following the DDPM, we recover the action from a residual network ϵ(𝒂i,𝒔,i){\mathbf{\epsilon}}({\bm{a}}^{i},{\bm{s}},i), and we model the ϵθ{\mathbf{\epsilon}}_{\theta} as a 3-layers MLPs with Mish activations. All the hidden units have been set to 256. We also follow the Fujimoto et al. (2018) to build two Q networks with the same MLP setting as our critic network. All the networks are optimized through Adam (Kingma & Ba, 2014). We provide all the hyperparameters in Table 4.

We train for 10001000 epochs (20002000 for Gym tasks). Each epoch consists of 10001000 gradient steps with batch size 256256. The training in MuJoCo locomotion is usually quite stable as shown in Figure 2. However, the training for AntMaze is relatively unstable due to its sparse reward setting and suboptimal trajectories in the offline datasets.

Table 4: Hyperparameter settings of all selected tasks. Reward Tune with CQL means that we use the reward tuning method in  Kumar et al. (2020). We also use max Q backup to improve performance in AntMaze tasks. In offline reinforcement learning, it’s common to fine-tune rewards for the AntMaze tasks. Additionally, we have observed that some other diffusion-based methods, such as SfBC, perform more reward engineering compared to our method. λclip\lambda_{\text{clip}} in our paper means that λλclip\lambda\geq\lambda_{\text{clip}}.
Dataset Environment Learning Rate κ\bf{{\mathbf{\kappa}}} Reward Tune max Q backup Policy evaluation interval λclip\bf{\lambda_{\text{clip}}}
Medium-Expert HalfCheetah 3e-4 0.040.04 None False 22 0
Medium-Expert Hopper 3e-4 0.030.03 None False 22 0
Medium-Expert Walker2d 3e-4 0.040.04 None False 22 0
Medium HalfCheetah 3e-4 0.060.06 None False 22 0
Medium Hopper 3e-4 0.050.05 None False 22 0
Medium Walker2d 3e-4 0.030.03 None False 22 0
Medium-Replay HalfCheetah 3e-4 0.060.06 None False 22 0
Medium-Replay Hopper 3e-4 0.030.03 None False 22 0
Medium-Replay Walker2d 3e-4 0.030.03 None False 22 0
Default AntMaze-umaze 3e-4 0.20.2 CQL False 22 0.30.3
Diverse AntMaze-umaze 3e-4 0.090.09 CQL True 22 0.30.3
Play AntMaze-medium 1e-3 0.30.3 CQL True 22 0.30.3
Diverse AntMaze-medium 3e-4 0.20.2 CQL True 22 0.30.3
Play AntMaze-large 3e-4 0.20.2 CQL True 44 0.30.3
Diverse AntMaze-large 3e-4 0.20.2 CQL True 44 0.30.3

Runtime. We test the runtime of DiffCPS on a RTX 3050 GPU. For algorithm training, the runtime cost of training the gym Locomotion tasks is about 4h26min for 20002000 epochs (2e6 gradient steps), see Table 5 for details.

Table 5: Runtime of different diffusion-based offline RL methods.
D4RL Tasks DiffCPS (T=5) DiffusionQL (T=5) SfBC (T=5)
Locomotion Runtime (11 epoch) 7.687.68s 5.15.1s 8.428.42s
AntMaze Runtime (11 epoch) 9.969.96s 10.510.5s 10.5310.53s

From Table 5, it’s evident that the runtime of DiffCPS is comparable to other diffusion model-based methods. Further optimization like using Jax and incorporating other diffusion model acceleration tricks can improve the runtime of DiffCPS.

Appendix D More Toy experiments

In this chapter, we provide further detailed information about the toy experiments. The noisy circle is composed of a unit circle with added Gaussian noise 𝒩(0,0.05){\mathcal{N}}(0,0.05). The offline data 𝒟={(𝒂i)}i=15000{\mathcal{D}}=\left\{({\bm{a}}_{i})\right\}_{i=1}^{5000} are collected from the noisy circle. We train all methods with 20,000 steps to ensure convergence. The network framework of SfBC remains consistent with the original SfBC paper. Both DQL and DiffCPS use the same MLP architecture. The code for DQL and SfBC is sourced from the official code provided by the authors, while the code for AWR is adopted from a PyTorch implementation available on GitHub.

Refer to caption
Figure 5: Evaluation performance of DiffCPS and other baselines on toy bandit experiments. The dashed line represents the score of AWR. We also observe that as TT increases, diffusion-based algorithms all experience a certain degree of performance decline, especially SfBC. The reason could be that as TT increases, the increased model capacity leads to overfitting the data in the dataset. In the case of SfBC, the presence of sampling errors exacerbates this phenomenon.

In Figure 6, we put the full results of diffusion-based methods and Figure 5 displays the scores of diffusion-based algorithms under different diffusion steps, denoted as TT. The score calculation involves a modified Jaccard Index, which computes how many actions fall into the offline dataset (the minimum distance is less than 1e-8). The dashed line represents the score of AWR.

Refer to caption
Refer to caption
Refer to caption
Figure 6: Evaluation effect of different diffusion steps on diffusion-based algorithms.
Refer to caption
Figure 7: Evaluation effect of prior offline methods.
Refer to caption
Figure 8: Results of diffusion behavior clone.

In Figure 7, we compare DiffCPS with prior offline RL methods, including the TD3+BC (Fujimoto & Gu, 2021) with Gaussian policy, BCQ (Fujimoto et al., 2019), and BEAR (Kumar et al., 2019). Note that both BCQ and BEAR use the VAE to model behavior policy, and then get the policy by regularizing the cloned behavior policy. However, results in Figure 7 show that VAE cannot exactly model the multi-modal data in offline datasets. TD3+BC with Gaussian policy also fails to model the optimal behavior due to the limited policy expressivity described in Section 3.1.

In Figure 8, we show the results of diffusion-based behavior clone. The results in Figure 8 and Figure 7 show that the diffusion model can model multi-modal distribution while other methods struggle to capture the multi-modal behavior policy.

Appendix E Extra Related Work

Offline Model-based RL. Model-based RL methods represent another approach to address offline reinforcement learning. Similar to model-free offline RL, model-based RL also needs to address extrapolation error. Kidambi et al. (2020); Yu et al. (2020) perform pessimistic planning in the learned model, which penalizes the reward for uncertainty. BREMEN (Matsushima et al., ) uses trust region constraint to update policy in an ensemble of dynamic models to avoid the extrapolation error. VL-LCB (Rashidinejad et al., 2021) employs offline value iteration with lower confidence bound to address the extrapolation error. SGP (Suh et al., 2023) uses the score function of the diffusion model to approximate the gradient of the proposed uncertainty estimation. Our method differs from them for DiffCPS tackles the diffusion-based constrained policy search problem in offline RL through Lagrange dual and recursive optimization.

Appendix F Extra Experiments

F.1 Trajectory Stitching

The offline dataset is depicted in the Figure 9 (a). The reward function is the sum of the negative Euclidean distance from the point (1,1)(1,1) and a constant, with both actions and states represented as 2D coordinates. The dataset consists only of trajectories forming an X shape. During the evaluation, the starting point is at (1,1)(-1,1). To achieve maximum reward, the policy needs to stitch together offline data to reach the goal. The trajectories generated by our policy are shown in Figure 9 (b). We note that DiffCPS successfully combines trajectories from the dataset to reach the vicinity of the target point (1,1)(1,1) while avoiding low-reward areas.

Refer to caption
(a) Offline Dataset.
Refer to caption
(b) Trajectory of DiffCPS.
Figure 9: Trajectory Stitching

F.2 TriFinger Dataset Experiment

The TriFinger dataset (Gürtler et al., ) is an offline dataset for the TriFinger robotic arm that includes both real and simulated data. The dataset primarily focuses on two tasks: Push and Lift. We utilize the Push-Sim-Expert dataset to train our DiffCPS, and the network architecture is consistent with the details provided in Appendix C. The goal of the push task is to move the cube to a target location. This task does not require the agent to align the orientation of the cube; the reward is based only on the desired and the achieved position.

When trained with 2e6 training steps (gradient steps) on the sim-expert dataset, DiffCPS (T=45T=45, target_kl =0.01=0.01, λmin=0\lambda_{\text{min}}=0) achieves a success rate of 0.906±0.0010.906\pm 0.001 in the push task based on 100 evaluation episodes and 5 random seeds, as detailed in the Table 6.

TriFinger-Push BC CRR IQL DiffCPS (Ours)
Sim-Expert 0.83±0.020.83\pm 0.02 0.94±0.04\bf{0.94\pm 0.04} 0.88±0.040.88\pm 0.04 0.906±0.001\bf{0.906\pm 0.001}
Table 6: The success rate of DiffCPS and other baselines on TriFinger Push task. Results for other baselines are sourced from Gürtler et al. and have been carefully tuned. It’s worth noting that our DiffCPS still utilizes a simple MLP to represent the policy, suggesting potential benefits from more powerful network architectures.

Appendix G Augmented Lagrangian Method

In Algorithm 1, we utilize alternating optimization (dual ascent) to solve the dual problem in Eq. (15). In practice, we can employ the Augmented Lagrangian Method (ALM) to enhance the stability of the algorithm. ALM introduces a strongly convex term (penalty function) into the Lagrangian function to improve convergence. Our experimental results suggest that ALM can enhance training stability but may lead to a performance decrease. This could be attributed to (i) the introduction of the penalty function altering the optimal KL divergence constraint value and (ii) the penalty function making the policy too conservative, thereby preventing the adoption of high-reward actions. We believe that further research will benefit our algorithm by incorporating ALM and other optimization algorithms.