DiffCPS: Diffusion-based Constrained Policy Search for Offline Reinforcement Learning

Longxiang He Li Shen Linrui Zhang Junbo Tan Xueqian Wang

Abstract

Constrained policy search (CPS) is a fundamental problem in offline reinforcement learning, which is generally solved by advantage weighted regression (AWR). However, previous methods may still encounter out-of-distribution actions due to the limited expressivity of Gaussian-based policies. On the other hand, directly applying the state-of-the-art models with distribution expression capabilities (i.e., diffusion models) in the AWR framework is intractable since AWR requires exact policy probability densities, which is intractable in diffusion models. In this paper, we propose a novel approach, Diffusion-based Constrained Policy Search (dubbed DiffCPS), which tackles the diffusion-based constrained policy search with the primal-dual method. The theoretical analysis reveals that strong duality holds for diffusion-based CPS problems, and upon introducing parameter approximation, an approximated solution can be obtained after ${\mathcal{O}}(1/\epsilon)$ number of dual iterations, where $\epsilon$ denotes the representation ability of the parametrized policy. Extensive experimental results based on the D4RL benchmark demonstrate the efficacy of our approach. We empirically show that DiffCPS achieves better or at least competitive performance compared to traditional AWR-based baselines as well as recent diffusion-based offline RL methods.

Machine Learning, ICML

1 Introduction

Offline Reinforcement Learning (offline RL) aims to seek an optimal policy without environmental interactions (Fujimoto et al., 2019; Levine et al., 2020). This is compelling for having the potential to transform large-scale datasets into powerful decision-making tools and avoids costly and risky online data collection, which offers significant application prospects in fields such as healthcare (Nie et al., 2021; Tseng et al., 2017) and autopilot (Yurtsever et al., 2020; Rhinehart et al., 2018).

Notwithstanding its promise, applying contemporary off-policy RL algorithms (Lillicrap et al., 2015; Fujimoto et al., 2018; Haarnoja et al., 2018a, b) directly into the offline context presents challenges due to distribution shift (Fujimoto et al., 2019; Levine et al., 2020). Previous methods to mitigate this issue under the model-free offline RL setting generally fall into three categories: 1) value function-based approaches, which implement pessimistic value estimation by assigning low values to out-of-distribution actions (Kumar et al., 2020; Fujimoto et al., 2019), 2) sequential modeling approaches, which casts offline RL as a sequence generation task with return guidance (Chen et al., 2021; Janner et al., 2022; Liang et al., 2023; Ajay et al., 2022), and 3) constrained policy search (CPS) approaches, which regularizes the discrepancy between the learned policy and behavior policy (Peters et al., 2010; Peng et al., 2019; Nair et al., 2020). We focus on the CPS-based offline RL due to its convergence guarantee and outstanding performance in a wide range of tasks.

Prior solutions for CPS (Peters et al., 2010; Peng et al., 2019; Nair et al., 2020) primarily train a parameterized unimodal Gaussian policy through weighted regression. However, recent works (Chen et al., 2022; Hansen-Estruch et al., 2023; Shafiullah et al., 2022) show such unimodal Gaussian models in weighted regression will impair the policy performance due to limited distributional expressivity. For example, if we fit a multi-modal distribution with an unimodal Gaussian distribution, it will unavoidably result in covering the low-density area between peaks. Intuitively, we can choose a more expressive model to eliminate this issue. Nevertheless, Ghasemipour et al. (2020) shows that VAEs (Kingma & Welling, 2013) in BCQ (Fujimoto et al., 2019) do not align well with the behavior dataset, which will introduce the biases in generated actions. Chen et al. (2022) utilizes the diffusion probabilistic model (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song & Ermon, 2019) to generate actions and select the action through the action evaluation model under the well-studied AWR framework. However, AWR requires an exact probability density of behavior policy, which is still intractable for the diffusion-based one. Alternatively, they leverage Monte Carlo sampling to approximate the probability density of behavior policy, which inevitably introduces estimation biases and increases the cost of inference. According to motivating examples in Section 3.1, we visually demonstrate how these issues are pronounced even on a simple bandit task.

To solve the above issues, we present Diffusion-based Constrained Policy Search (DiffCPS) which incorporates the diffusion model to solve the limited expressivity problem of unimodal Gaussian policy. To avoid the exact probability density required by AWR, we solve the diffusion-based CPS problem with the primal-dual method. Thereby, DiffCPS solves the limited policy expressivity problem while avoiding the intractable density calculation brought by AWR. Interestingly, we find that when we use diffusion-based policies, we can eliminate the policy distribution constraint in the CPS through the action distribution of the diffusion model. Then under some mild assumptions, we show that strong duality holds for the diffusion-based CPS problem, which makes it easily solvable using primal-dual methods. We also study the effect of using a finite parametrization for the policies and show that the price to pay in terms of the duality gap depends on the representation ability of the parametrization. And the expressiveness of the model is precisely the greatest strength of diffusion models.

In summary, our main contributions are three-fold: 1)We explicitly eliminate the policy distribution constraint in the CPS problem. This allows us to achieve strong duality under some assumptions. 2)We present DiffCPS, which tackles the diffusion-based constrained policy search without resorting to AWR. Our theoretical analysis shows that our method can achieve near-optimal results by solving the primal-dual problem with a parametric policy. 3)Our experimental results illustrate superior or competitive performance compared to existing offline RL methods in D4RL tasks, which also corroborates the results of our theoretical analysis. Even compared to other diffusion-based methods, DiffCPS achieves state-of-the-art (SOTA) performance in D4RL MuJoCo locomotion and AntMaze average score, which substantiates the effectiveness of our method.

Refer to caption — Figure 1: Toy offline experiment on a simple bandit task. We test the performance of AWR and other diffusion-based offline RL algorithms (DQL (Wang et al., 2022) and SfBC (Chen et al., 2022)). The first row displays the actions taken by the trained policy where $T$ denotes diffusion steps. We note that the AWR fails to capture the multi-modal actions in the offline dataset due to the limited policy expressivity of unimodal Gaussian. The second row shows the effect of different diffusion steps $T$ .

2 Preliminaries

Notation. In this paper, we use $\mu_{\theta}({\bm{a}}|{\bm{s}})$ or $\mu$ to denote the learned policy with parameter $\theta$ and $\pi_{b}$ to denote behavior policy that generated the offline data. We use superscripts $i$ to denote the diffusion time step and subscripts $t$ to denote RL trajectory timestep. For instance, $a_{t}^{i}$ denotes the $t$ -th action in $i$ -th diffusion time step.

2.1 Constrained Policy Search in Offline RL

Consider a Markov decision process (MDP): $M=\{{\mathcal{S}},{\mathcal{A}},P,R,\gamma,d_{0}\}$ , with state space $S$ , action space ${\mathcal{A}}$ , environment dynamics ${\mathcal{P}}({\bm{s}}^{\prime}|{\bm{s}},{\bm{a}}):S\times S\times{\mathcal{A}}\rightarrow[0,1]$ , reward function $R:S\times{\mathcal{A}}\rightarrow{\mathbb{R}}$ , discount factor $\gamma\in[0,1)$ , policy $\mu\in{\mathcal{P}}({\mathcal{S}})$ , where ${\mathcal{P}}({\mathcal{S}})$ is the space of probability measures on $({\mathcal{A}},{\mathcal{B}}({\mathcal{A}}))$ parametrized by elements of ${\mathcal{S}}$ , where ${\mathcal{B}}({\mathcal{A}})$ are the Borel sets of ${\mathcal{A}}$ , and initial state distribution $d_{0}$ . The action-value or Q-value of policy $\mu$ is defined as $Q^{\mu}({\bm{s}}_{t},{\bm{a}}_{t})=\mathbb{E}_{{\bm{a}}_{t+1},{\bm{a}}_{t+2},...\sim\mu}\left[\sum_{j=0}^{\infty}\gamma^{j}r({\bm{s}}_{t+j},{\bm{a}}_{t+j})\right]$ . Our goal is to get a policy to maximize the cumulative discounted reward $J(\theta)=\int_{\mathcal{S}}d_{0}({\bm{s}})Q^{\mu}({\bm{s}},{\bm{a}})d{\bm{s}}$ , where $\rho^{\mu}({\bm{s}})=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}p_{\mu}({\bm{s}}_{t}={\bm{s}})$ is the normalized discounted state visitation frequencies induced by the policy $\mu$ and $p_{\mu}({\bm{s}}_{t}={\bm{s}})$ is the likelihood of the policy being in state ${\bm{s}}$ after following $\mu$ for $t$ timesteps (Sutton & Barto, 2018; Peng et al., 2019).

In offline setting (Fujimoto et al., 2019), environmental interaction is not allowed, and a static dataset ${\mathcal{D}}\triangleq\left\{({\mathcal{S}},{\mathcal{A}},R,{\mathcal{S}}^{\prime},\text{done})\right\}$ is used to learn a policy. To avoid out-of-distribution actions, we restrict the learned policy $\mu$ to be not far away from the behavior policy $\pi_{b}$ by the KL divergence constraint. Prior works (Peters et al., 2010; Peng et al., 2019) formulate the above offline RL as a constrained policy search (CPS) problem with the following form:

$\displaystyle\mu^{}=\operatorname{arg\,max}_{\mu}\$	$\displaystyle J(\mu)=\operatorname*{arg\,max}_{\mu}\int_{\mathcal{S}}d_{0}({\bm{s}})\int_{\mathcal{A}}Q^{\mu}({\bm{s}},{\bm{a}})d{\bm{a}}d{\bm{s}}$
$\displaystyle s.t.\quad$	$\displaystyle D_{\mathrm{KL}}(\pi_{b}(\cdot\|{\bm{s}})\\|\mu(\cdot\|{\bm{s}}))\leq{\mathbf{\epsilon}},\quad\forall{\bm{s}}$	(1)
	$\displaystyle\int_{\bm{a}}\mu({\bm{a}}\|{\bm{s}})d{\bm{a}}=1,\quad\forall{\bm{s}},$

Previous works (Peters et al., 2010; Peng et al., 2019; Nair et al., 2020) solve Eq. (2.1) through KKT conditions and get the optimal policy $\pi^{*}$ as:

\displaystyle\pi^{*}({\bm{a}}|{\bm{s}})

\displaystyle=\frac{1}{Z({\bm{s}})}\ \pi_{b}({\bm{a}}|{\bm{s}})\ \mathrm{exp}\left(\alpha Q_{\phi}({\bm{s}},{\bm{a}})\right),

(2)

where $Z({\bm{s}})$ is the partition function. Intuitively we can use Eq. (2) to optimize policy $\pi$ . However, the behavior policy may be very diverse and hard to model. To avoid modeling the behavior policy, prior works (Peng et al., 2019; Wang et al., 2020; Chen et al., 2020) optimize $\pi^{*}$ through a parameterized policy $\pi_{\theta}$ , known as AWR:

		$\displaystyle\mathop{\mathrm{arg\ min}}_{\theta}\mathbb{E}_{{\bm{s}}\sim\mathcal{D}^{\mu}}\left[D_{\mathrm{KL}}\left(\pi^{*}(\cdot\|{\bm{s}})\middle\|\middle\|\pi_{\theta}(\cdot\|{\bm{s}})\right)\right]$		(3)
	$\displaystyle=$	$\displaystyle\mathop{\mathrm{arg\ max}}_{\theta}\mathbb{E}_{({\bm{s}},{\bm{a}})\sim\mathcal{D}^{\mu}}\left[\frac{1}{Z({\bm{s}})}\mathrm{log}\ \pi_{\theta}({\bm{a}}\|{\bm{s}})\ \mathrm{exp}\left(\alpha Q_{\phi}({\bm{s}},{\bm{a}})\right)\right].$

where $\mathrm{exp}(\alpha Q_{\phi}({\bm{s}},{\bm{a}}))$ being the regression weights. However, AWR requires the exact probability density of policy, which restricts the use of generative models like diffusion models. In this paper, we directly utilize the diffusion-based policy to address Eq. (2.1). Therefore, our method not only avoids the need for explicit probability densities in AWR but also solves the limited policy expressivity problem.

2.2 Diffusion Model

Diffusion Probabilistic Model (DPM). Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song & Ermon, 2019) are composed of two processes: the forward diffusion process and the reverse process. In the forward diffusion process, we gradually add Gaussian noise to the data ${\bm{x}}_{0}\sim q({\bm{x}}_{0})$ in $T$ steps. The step sizes are controlled by a variance schedule $\beta_{i}$ :

		$\displaystyle q({\bm{x}}_{1:T}\,\|\,{\bm{x}}_{0}):=\textstyle\prod_{i=1}^{T}q({\bm{x}}_{i}\,\|\,{\bm{x}}_{i-1}),$		(4)
		$\displaystyle q({\bm{x}}_{i}\,\|\,{\bm{x}}_{i-1}):={\mathcal{N}}({\bm{x}}_{i};\sqrt{1-\beta_{i}}{\bm{x}}_{i-1},\beta_{i}{\bm{I}}).$		(4)

In the reverse process, we can recreate the true sample ${\bm{x}}_{0}$ through $p({\bm{x}}^{i-1}|{\bm{x}}^{i})$ :

	$\displaystyle p({\bm{x}})$	$\displaystyle=\int p({\bm{x}}^{0:T})d{\bm{x}}^{1:T}$		(5)
		$\displaystyle=\int{\mathcal{N}}({\bm{x}}^{T};\mathbf{0},{\bm{I}})\prod_{i=1}^{T}p({\bm{x}}^{i-1}\|{\bm{x}}^{i})d{\bm{x}}^{1:T}.$		(5)

The training objective is to maximize the ELBO of $\mathbb{E}_{{\bm{q}}_{x_{0}}}\left[\log p({\bm{x}}_{0})\right]$ . Following DDPM (Ho et al., 2020), we use the simplified surrogate loss ${\mathcal{L}}_{d}(\theta)=\mathbb{E}_{i\sim\left[1,T\right],{\mathbf{\epsilon}}\sim{\mathcal{N}}(\mathbf{0},{\bm{I}}),{\bm{x}}_{0}\sim q}\left[||{\mathbf{\epsilon}}-{\mathbf{\epsilon}}_{\theta}({\bm{x}}_{i},i)||^{2}\right]$ to approximate the ELBO. After training, sampling from the diffusion model is equivalent to running the reverse process.

Conditional DPM. There are two kinds of conditioning methods: classifier-guided (Dhariwal & Nichol, 2021) and classifier-free (Ho & Salimans, 2021). The former requires training a classifier on noisy data ${\bm{x}}_{i}$ and using gradients $\nabla_{\bm{x}}\log f_{\phi}({\bm{y}}|{\bm{x}}_{i})$ to guide the diffusion sample toward the conditioning information ${\bm{y}}$ . The latter does not train an independent $f_{\phi}$ but combines a conditional noise model $\epsilon_{\theta}({\bm{x}}_{i},i,{\bm{s}})$ and an unconditional model $\epsilon_{\theta}({\bm{x}}_{i},i)$ for the noise. The perturbed noise $w\epsilon_{\theta}({\bm{x}}_{i},i)+(w+1)\epsilon_{\theta}({\bm{x}}_{i},i,{\bm{s}})$ is used to later generate samples. However Pearce et al. (2023) shows this combination will degrade the policy performance in offline RL. Following Pearce et al. (2023); Wang et al. (2022) we solely employ a conditional noise model $\epsilon_{\theta}({\bm{x}}_{i},i,{\bm{s}})$ to construct our noise model ( $w=0$ ).

3 Methodology

In this section, we detail the design of our Diffusion-based Constrained Policy Search (DiffCPS). First, we show that the limited policy expressivity in the AWR will degrade the policy performance through the bandit problem (Section 3.1). Second, we formulate CPS via diffusion-based policy and get our diffusion-based CPS problem (Section 3.2). Then, we solve DiffCPS with primal-dual method and show the price to pay in terms of the duality gap depending on the representation ability of the parametrization (Sectioon 3.3). (All proofs are placed in Appendix A due to limited space.)

3.1 Is policy expressivity all you need for AWR?

Before showing our method, we first present that the limited policy expressivity in previous Advantage Weighted Regression (AWR) methods (Peters et al., 2010; Peng et al., 2019; Nair et al., 2020) may degrade the performance through a sample 2-D bandit experiment with real-valued actions. Our offline dataset is constructed by a unit circle with noise (The first panel of Figure 1). Data on the noisy circle have a positive reward of $1$ . Note that this offline dataset exhibits strong multi-modality since there are many actions (points on the noisy circle) that can achieve the same reward of $1$ if a state is given. However, if we use unimodal Gaussian to represent the policy, although the points on the noisy circle can all receive the same reward of $1$ , the actions taken by AWR will move closer to the center of the noisy circle (AWR in Figure 1), due to the incorrect unimodal assumption. Experiment results in Figure 1 illustrate that AWR performs poorly in the bandit experiments compared to other diffusion-based methods. We also notice that compared to other diffusion-based methods, actions generated by SfBC include many points that differ significantly from those in the dataset, when $T=15$ . This is due to the sampling error introduced by incorporating diffusion into AWR.

Therefore, we conclude that policy expressivity is important in offline RL since most offline RL datasets are collected from multiple behavior policies or human experts, which exhibit strong multi-modality. To better model behavior policies, we need more expressive generative models to model the policy distribution, rather than using unimodal Gaussians. Furthermore, we also need to avoid the sampling errors caused by using diffusion in AWR, which is the motivation behind our algorithm.

3.2 Diffusion-Based Constrained Policy Search

In this subsection, we focus on deriving our diffusion-based CPS problem Eq. (12). Firstly, we present the form of constrained policy search (Peng et al., 2019):

$\displaystyle\mu^{}\!=\!\operatorname{arg\,max}_{\mu}\$	$\displaystyle J(\mu)\!=\!\operatorname*{arg\,max}_{\mu}\int_{\mathcal{S}}d_{0}({\bm{s}})\int_{\mathcal{A}}Q^{\mu}({\bm{s}},{\bm{a}})d{\bm{a}}d{\bm{s}}$	(6)
$\displaystyle s.t.$	$\displaystyle E_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}})}\left[D_{\mathrm{KL}}(\pi_{b}(\cdot\|{\bm{s}})\\|\mu(\cdot\|{\bm{s}}))\right]\leq{\mathbf{\epsilon}},$	(7)
	$\displaystyle\int_{\bm{a}}\mu({\bm{a}}\|{\bm{s}})d{\bm{a}}=1,\quad\forall{\bm{s}},$	(8)

where $\mu({\bm{a}}|{\bm{s}})$ denotes our diffusion-based policy, $\pi_{b}$ denotes the behavior policy. Here we represent our policy $\mu({\bm{a}}|{\bm{s}})$ via the conditional diffusion model:

	$\displaystyle\mu({\bm{a}}\|{\bm{s}})$	$\displaystyle=\int\mu({\bm{a}}^{0:T}\|{\bm{s}})d{\bm{a}}^{1:T}$		(9)
		$\displaystyle=\int{\mathcal{N}}({\bm{a}}^{T};\mathbf{0},{\bm{I}})\prod_{i=1}^{T}\mu({\bm{a}}^{i-1}\|{\bm{a}}^{i},{\bm{s}})d{\bm{a}}^{1:T},$		(9)

where the end sample of the reverse chain ${\bm{a}}^{0}$ is the action used for the policy’s output and $\mu({\bm{a}}^{0:T})$ is the joint distribution of all noisy samples. According to DDPM, we can approximate the reverse process $\mu({\bm{a}}^{i-1}|{\bm{a}}^{i},{\bm{s}})$ with a Gaussian distribution ${\mathcal{N}}({\bm{a}}^{i-1};{\bm{\mu}}_{\theta}({\bm{a}}^{i},{\bm{s}},i),{\bm{\Sigma}}_{\theta}({\bm{a}}^{i},{\bm{s}},i))$ . We also follow the DDPM to fix the covariance matrix and predict the mean with a conditional noise model ${\mathbf{\epsilon}}_{\theta}({\bm{a}}^{i},{\bm{s}},i)$ :

\textstyle{\bm{\mu}}_{\theta}({\bm{a}}^{i},{\bm{s}},i)=\frac{1}{\sqrt{\alpha_{i}}}\big{(}{\bm{a}}^{i}-\frac{\beta_{i}}{\sqrt{1-\bar{\alpha}_{i}}}{\mathbf{\epsilon}}_{\theta}({\bm{a}}^{i},{\bm{s}},i)\big{)}.

(10)

In the reverse process, we sample ${\bm{a}}^{T}\sim{\mathcal{N}}(\mathbf{0},{\bm{I}})$ and then follow the reverse diffusion chain ${\mathcal{N}}\left\{{\bm{a}}^{i-1}|{\bm{a}}^{i}\right\}$ , parameterized by $\theta$ as

{\mathcal{N}}\left\{a^{i-1};\frac{{\bm{a}}^{i}}{\sqrt{\alpha_{i}}}-\frac{\beta_{i}}{\sqrt{\alpha_{i}(1-\bar{\alpha}_{i})}}{\mathbf{\epsilon}}_{\theta}({\bm{a}}^{i},{\bm{s}},I),\beta_{i}\right\}

(11)

for $i=T,\ldots,1.$ The reverse sampling in Eq. (11) requires iteratively predicting ${\mathbf{\epsilon}}$ $T$ times. When $T$ is large, it will consume much time during the sampling process. To work with small $T$ ( $T=5$ in our experiment), we follow (Xiao et al., 2021; Wang et al., 2022) and define

\beta_{i}=1-\alpha_{i}=1-e^{-\beta_{\min}(\frac{1}{T})-0.5(\beta_{\max}-\beta_{\min})\frac{2i-1}{T^{2}}},

which is a noise schedule obtained under the variance preserving SDE of Song et al. (2020).

Theorem 3.1.

Let $\mu({\bm{a}}|{\bm{s}})$ be a diffusion-based policy and $\pi_{b}$ be the behavior policy. Then, we have

(1)

There exists ${\mathbf{\kappa}}_{0}\in{\mathcal{R}}$ such that Eq. (7) can be transformed to

H(\pi_{b},\mu)=-\mathbb{E}_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}}),{\bm{a}}\sim\pi_{b}(\cdot|{\bm{s}})}\left[\log\mu(\cdot|{\bm{s}})\right]\leq{\mathbf{\kappa}}_{0}.

(2)

$\forall s$ , $\int_{\bm{a}}\mu({\bm{a}}|{\bm{s}})d{\bm{a}}\equiv 1$ .

Corollary 3.2.

The primal problem (Eq. (6)-Eq. (8)) can be transformed into the following DiffCPS problem:

	$\displaystyle\mu^{}=\operatorname{arg\,max}_{\mu}\$	$\displaystyle J(\mu)=\operatorname*{arg\,max}_{\mu}\int_{\mathcal{S}}d_{0}({\bm{s}})\int_{\mathcal{A}}Q^{\mu}({\bm{s}},{\bm{a}})d{\bm{a}}d{\bm{s}}$
	$\displaystyle s.t.\quad$	$\displaystyle H(\pi_{b},\mu)\leq{\mathbf{\kappa}}_{0}.$		(12)

Remark 3.3.

Theorem 3.1 and Corollary 3.2 formulate our diffusion-based CPS problem, then we will show under some mild assumption, the strong duality holds for Eq. (12).

Assumption 3.4.

Suppose that $r$ is bounded and that Slaters’ condition (Boyd & Vandenberghe, 2004) holds for our offline RL setting.

Assumption 3.5.

Suppose that all policies learned from the offline dataset ${\mathcal{D}}$ satisfy

\mu({\bm{a}}|{\bm{s}})=\frac{\rho^{\mu}({\bm{s}},{\bm{a}})}{\rho^{\pi_{b}({\bm{s}})}({\bm{s}})},

(13)

where $\rho^{\pi_{b}}({\bm{s}})$ denotes the normalized discounted state visitation frequencies induced by the behavior policy.

Remark 3.6.

Assumption 3.5 is a mild assumption in offline reinforcement learning, as the state distributions encountered by all policies can only be those collected by the behavior policy. Therefore, it is feasible to assume Eq. (13) regarding the occupancy measure of the learned policy.

Theorem 3.7.

Under Assumption 3.4 and Assumption 3.5, strong duality holds for Eq. (12), which has the same optimal policy $\mu^{*}$ with Eq. (15).

According to Theorem 3.7, we can get the optimal policy by solving the dual problem. In the following section, we outline the algorithm for solving Eq. (12). It consists of two steps: (i) solving the unconstrained dual function of Eq. (12), (ii) updating the Lagrange multiplier.

3.3 Primal-Dual Method for DiffCPS

In this section, the duality is used to solve the Eq. (12) and derive our method. The dual function of Eq. (12) is

		$\displaystyle\quad d(\lambda)=\operatorname*{arg\,max}_{\mu}{\mathcal{L}}(\mu,\lambda)$		(14)
		$\displaystyle\text{where}\quad{\mathcal{L}}(\mu,\lambda)=J(\mu)+\lambda({\mathbf{\kappa}}_{0}-H(\pi_{b},\mu)).$		(14)

The Lagrange dual problem associated with the Eq. (12) is

D^{*}(\lambda)\triangleq\min_{\lambda\geq 0}\max_{\mu}J(\mu)+\lambda({\mathbf{\kappa}}_{0}-H(\pi_{b},\mu)).

(15)

In Eq. (14), we need to calculate the cross entropy $H(\pi_{b},\mu)$ , which is intractable in practice.

Proposition 3.8.

In the diffusion model, we can approximate the entropy with MSE-like loss ${\mathcal{L}}_{c}(\pi_{b},\mu)$ through ELBO:

H(\pi_{b},\mu)\approx c+{\mathcal{L}}_{c}(\pi_{b},\mu),

(16)

where c is a constant.

Remark 3.9.

Let ${\mathbf{\kappa}}={\mathbf{\kappa}}_{0}-c$ , Proposition 3.8 allow us to use the loss of diffusion model to approximate the Eq. (7). In the following section, we will replace the $H(\pi_{b},\mu)$ with ${\mathcal{L}}_{c}(\pi_{b},\mu)$ .

In practice, to turn the original functional optimization problem into a tractable problem, we often use parametrized policies, such as neural networks. Next, we shift focus to solving the parametric version of Eq. (12), $i.e.$ to find the parametrized policies $\mu_{\theta}$ that solve Eq. (14)

		$\displaystyle\quad d_{\theta}(\lambda)\triangleq\max_{\theta}{\mathcal{L}}(\mu_{\theta},\lambda_{t})$		(17)
		$\displaystyle\text{where}\quad{\mathcal{L}}(\mu_{\theta},\lambda_{t})=\mathbb{E}_{{\bm{s}}\sim d_{0}({\bm{s}}),{\bm{a}}\sim\mu_{\theta}(\cdot\|{\bm{s}})}\left[\frac{Q^{\mu_{\theta}}({\bm{s}},{\bm{a}})}{Q_{\text{target}}^{\mu_{\theta}}({\bm{s}},{\bm{a}})}\right]$
		$\displaystyle+\lambda({\mathbf{\kappa}}-{\mathcal{L}}_{c}(\pi_{b},\mu_{\theta}^{*})),$

D_{\theta}^{*}\triangleq\min_{\lambda\geq 0}d_{\theta}(\lambda).

(18)

Notice that we follow Fujimoto & Gu (2021); Wang et al. (2022) to normalize $Q^{\mu}({\bm{s}},{\bm{a}})$ by dividing it by the target Q-net’s value. We also clip the $\lambda$ to keep the constraint $\lambda\geq 0$ holding by $\lambda_{\text{clip}}=\max(c,\lambda)$ , $c\geq 0$ . We also find that we can improve the performance of the policy by delaying the policy update (Fujimoto et al., 2018). The algorithm given by Eq. (17)-Eq. (18) for solving Eq. (12) is summarized under Algorithm 1.

Algorithm 1 DiffCPS

1: Initialize policy network

\mu_{\theta}

, critic networks

Q_{\phi_{1}}

Q_{\phi_{2}}

, and target networks

\mu_{\theta^{\prime}}

Q_{\phi_{1}^{\prime}}

Q_{\phi_{2}^{\prime}}

2: policy evaluation interval

d

and step size

\eta

3: for

t=1

T

4: Sample transition mini-batch

{\mathcal{B}}\!=\!\left\{({\bm{s}}_{t},{\bm{a}}_{t},r_{t},{\bm{s}}_{t+1})\right\}\!\sim\!{\mathcal{D}}

5: # Critic updating

6: Sample

{\bm{a}}_{t+1}\sim\mu_{\theta}(\cdot|{\bm{s}}_{t+1})

according to Eq. (11).

y=r_{t}+\gamma\min_{i=1,2}Q_{\phi_{i}^{\prime}}\left\{{\bm{s}}_{t+1},{\bm{a}}_{t+1}\right\}

8: Update critic

\phi_{i}\leftarrow\operatorname*{arg\,min}_{\phi_{i}}N^{-1}\sum(y-Q_{\phi_{i}}({\bm{s}}_{t},{\bm{a}}_{t}))^{2}

9: if

t

mod

d

then

10: Sample

{\bm{a}}\sim\mu_{\theta}(\cdot|{\bm{s}})

according to Eq. (11).

11: # Policy updating through Eq. (17)

12:

\theta_{t+1}\approx\operatorname*{arg\,max}{\mathcal{L}}(\mu_{\theta},\lambda_{t})

13: # Lagrange multiplier

\lambda

updating through Eq. (18)

14:

\lambda_{t+1}\leftarrow\lambda_{t}-\eta\left[{\mathbf{\kappa}}-{\mathcal{L}}_{c}(\pi_{b},\mu^{*}_{\theta_{t+1}})\right]

15:

\lambda_{t+1}\leftarrow\lambda_{\text{clip}}=\max(c,\lambda_{t+1})

c\geq 0

16: end if

17: # Target Networks updating

18:

\theta^{\prime}=\rho\theta^{\prime}+(1-\rho)\theta_{t+1}

\phi_{i}^{\prime}=\rho\phi_{i}^{\prime}+(1-\rho)\phi_{i}\mbox{ for }i=\{1,2\}

19: end for

However, there is a cost for introducing such parametrization in Algorithm 1: the duality gap is no longer null. In the following Theorem 3.11, we show that after introducing parameterization, we can obtain a suboptimal solution in a ${\mathcal{O}}(1/\epsilon)$ number of dual iterations. Below, we first define $\epsilon$ -universal parametrization functions.

Definition 3.10.

A parametrization $\pi_{\theta}$ is an $\epsilon$ -universal parametrization of functions in ${\mathcal{P}}({\mathcal{S}})$ if, for some $\epsilon>0$ , there exists for any $\pi\in{\mathcal{P}}({\mathcal{S}})$ a parameter $\theta\in\mathbb{R}^{p}$ such that

\max_{s\in{\mathcal{S}}}\int_{{\mathcal{A}}}\left|\pi(a|s)-\pi_{\theta}(a|s)\right|\,da\leq\epsilon.

(19)

Table 1: The performance of DiffCPS and other SOTA baselines on D4RL tasks. The mean and standard deviation of DiffCPS are obtained by evaluating the trained policy on five different random seeds. We report the performance of baseline methods using the best results reported from their paper. “-A” refers to any number of hyperparameters allowed. Results within

3

percent of the maximum in every D4RL task and the best average result are highlighted in boldface.

Dataset	Environment	CQL	IDQL-A	QGPO	SfBC	DD	Diffuser	Diffuison-QL	IQL	DiffCPS(ours)
Medium-Expert	HalfCheetah	$62.4$	$\bf{95.9}$	$93.5$	$92.6$	$90.6$	$79.8$	$\bf{96.8}$	$86.7$	$\bf{100.3\pm 4.1}$
Medium-Expert	Hopper	$98.7$	$108.6$	$108.0$	$108.6$	$\bf{111.8}$	$107.2$	$\bf{111.1}$	$91.5$	$\bf{112.1\pm 0.6}$
Medium-Expert	Walker2d	$110.1$	$\bf{112.7}$	$\bf{110.7}$	$109.8$	$108.8$	$108.4$	$110.1$	$109.6$	$\bf{113.1\pm 1.8}$
Medium	HalfCheetah	$44.4$	$51.0$	$\bf{54.1}$	$45.9$	$49.1$	$44.2$	$\bf{51.1}$	$47.4$	$\bf{71.0\pm 0.5}$
Medium	Hopper	$58.0$	$65.4$	$\bf{98.0}$	$57.1$	$79.3$	$58.5$	$\bf{90.5}$	$66.3$	$\bf{100.1\pm 3.5}$
Medium	Walker2d	$79.2$	$82.5$	$\bf{86.0}$	$77.9$	$82.5$	$79.7$	$\bf{87.0}$	$78.3$	$\bf{90.9\pm 1.6}$
Medium-Replay	HalfCheetah	$46.2$	$45.9$	$\bf{47.6}$	$37.1$	$39.3$	$42.2$	$\bf{47.8}$	$44.2$	$\bf{50.5\pm 0.6}$
Medium-Replay	Hopper	$48.6$	$92.1$	$96.9$	$86.2$	$\bf{100}$	$96.8$	$\bf{101.3}$	$94.7$	$\bf{101.1\pm 0.2}$
Medium-Replay	Walker2d	$26.7$	$\bf{85.1}$	$84.4$	$65.1$	$75.0$	$61.2$	$\bf{95.5}$	$73.9$	$\bf{91.3\pm 0.7}$
Average (Locomotion)		$63.9$	$82.1$	$86.6$	$75.6$	$81.8$	$75.3$	$87.9$	$76.9$	$\bf{92.26}$
Default	AntMaze-umaze	$74.0$	$\bf{94.0}$	$\bf{96.4}$	$92.0$	-	-	$93.4$	$87.5$	$\bf{97.4\pm 3.7}$
Diverse	AntMaze-umaze	$\bf{84.0}$	$80.2$	$74.4$	$\bf{85.3}$	-	-	$66.2$	$62.2$	$\bf{87.4\pm 3.8}$
Play	AntMaze-medium	$61.2$	$\bf{84.5}$	$\bf{83.6}$	$81.3$	-	-	$76.6$	$71.2$	$\bf{88.2\pm 2.2}$
Diverse	AntMaze-medium	$53.7$	$\bf{84.8}$	$\bf{83.8}$	$82.0$	-	-	$78.6$	$70.0$	$\bf{87.8\pm 6.5}$
Play	AntMaze-large	$15.8$	$\bf{63.5}$	$\bf{66.6}$	$59.3$	-	-	$46.4$	$39.6$	$\bf{65.6\pm 3.6}$
Diverse	AntMaze-large	$14.9$	$\bf{67.9}$	$\bf{64.8}$	$45.5$	-	-	$57.3$	$47.5$	$\bf{63.6\pm 3.9}$
Average (AntMaze)		$50.6$	$79.1$	$78.3$	$74.2$	-	-	$69.8$	$63.0$	$\bf{81.67}$
# Diffusion steps		-	$5$	$15$	$15$	$100$	$100$	$5$	-	$5$

Theorem 3.11.

Let $\pi_{\theta}$ be an $\epsilon$ universal parametrization of ${\mathcal{P}}({\mathcal{S}})$ according to Definition 3.10, Suppose that $r$ is bounded by constants $R>0$ and $B=\max_{{\bm{s}},{\bm{a}}}\rho^{\pi_{b}}({\bm{s}},{\bm{a}})1/\rho({\bm{s}},{\bm{a}})$ . and $\gamma\in(0,1)$ be the discount factor. Then, under Assumption 3.4 3.5 and for any $\varepsilon>0$ , the sequence of updates of Algorithm 1 with step size $\eta$ converges in $K>0$ steps, with

K\leq{\left\|\lambda_{0}-\lambda_{\theta}^{\star}\right\|^{2}}{\big{/}}{2\eta\varepsilon},

(20)

to a neighborhood of $P^{\star}$ (the solution of Eq. (12)), satisfying

\!\!\!\!-\frac{\left(R+\left\|\lambda^{\star}\right\|_{1}B\right)\epsilon}{1-\gamma}\!\leq\!d_{\theta}(\lambda_{K})\!-\!P^{\star}\!\leq\!\eta\frac{C}{2}+\delta+\varepsilon.

(21)

where $C=\left({\mathbf{\kappa}}_{0}+\int_{{\mathcal{S}}\times{\mathcal{A}}}\rho^{\pi_{b}}({\bm{s}},{\bm{a}})\log\pi_{b}({\bm{a}}|{\bm{s}})d{\bm{a}}d{\bm{s}}\right)^{2}$ , $\delta$ denotes the error between the parameterized local optimal solution and the optimal solution in Eq. (17). and $\lambda^{\star}$ be the solution to the dual problem associated with Eq. (12).

Remark 3.12.

Notice that the suboptimal solution which the primal-dual algorithm convergences depends on the goodness of the solution of Eq. (17) and the representation ability of the parametrized policy, while the expressiveness of the model is precisely the greatest strength of diffusion models.

Theoretically, we need to precisely solve the Eq. (17) and Eq. (18). However, in practice, we can resort to SGD and parametrized policy to solve the equations. In this way, we can recursively optimize Eq. (12) through Eq. (17) and Eq. (18). The policy improvement described in Eq. (17) and the solution of Eq. (18) constitute the core of DiffCPS.

4 Experiments

We evaluate our DiffCPS on D4RL (Fu et al., 2020) benchmark in Section 4.1. Further, we conduct an ablation experiment to assess the contribution of different parts in DiffCPS in Section 4.2. Due to space limitations, we place the experiment setup, implementation details, and additional experiment results in the Appendix C.

4.1 Results

In Table 1, we compare the performance of DiffCPS to other offline RL methods in D4RL (Fu et al., 2020) tasks. We merely illustrate the results of the MuJoCo locomotion and AntMaze tasks due to the page limit. In traditional MuJoCo tasks, DiffCPS outperforms other methods as well as recent diffusion-based method (Wang et al., 2022; Lu et al., 2023; Hansen-Estruch et al., 2023) by large margins in most tasks, especially in the HalfCheetah medium. The medium datasets are collected by online SAC (Haarnoja et al., 2018a) agent trained to approximate $1/3$ the performance of the expert. Hence, it contains a lot of suboptimal trajectories, which makes the offline RL algorithms hard to learn.

Compared to the locomotion tasks, AntMaze tasks are more challenging since the datasets consist of sparse rewards and suboptimal trajectories. Even so, DiffCPS also achieves competitive or SOTA results compared with other methods. For simpler tasks like umaze, DiffCPS achieves $100\pm 0\%$ success rate on some seeds, which shows its powerful ability to learn from suboptimal trajectories. In other AntMaze tasks, DiffCPS also shows competitive performance compared to other SOTA diffusion-based approaches. Overall, the experiments demonstrate the effectiveness of DiffCPS compared to several existing methods for offline RL.

Table 2: Ablation study of diffusion steps. We conduct an ablation study to investigate the impact of diffusion steps on different algorithms. We only show the average score due to the page limit. The results of DiffCPS are obtained from three random seeds, while the results of SfBC are derived from the original SfBC paper.

D4RL Tasks	DiffCPS (T=5)	DiffCPS (T=15)	DiffCPS (T=20)	SfBC (T=10)	SfBC (T=15)	SfBC (T=25)
Locomotion	$\bf{92.0}$	$87.5$	$87.6$	$72.9$	$\bf{75.6}$	$74.4$
AntMaze	$\bf{80.0}$	$60.7$	$66.7$	$65.7$	$\bf{74.2}$	$73.0$

4.2 Ablation Study

In this section, we analyze why DiffCPS outperforms the other methods quantitatively on D4RL tasks. We conduct an ablation study to investigate the impact of three parts in DiffCPS, $i.e.$ diffusion steps, the minimum value of Lagrange multiplier $\lambda_{\text{clip}}$ , and policy evaluation interval.

Diffusion Steps. We show the effect of diffusion steps $T$ , which is a vital hyperparameter in all diffusion-based methods. In SfBC the best $T$ is $15$ , while $T=5$ is best for our DiffCPS. Table 2 shows the average performance of different diffusion steps. Figure 2 shows the training curve of selected D4RL tasks over different diffusion steps $T$ .

We also note that large $T$ works better for the bandit experiment. However, for D4RL tasks, a large $T$ will lead to a performance drop. The reason is that compared to bandit tasks, D4RL datasets contain a significant amount of suboptimal trajectories. A larger $T$ implies stronger behavior cloning ability, which can indeed lead to policy overfitting to the suboptimal data, especially when combined with actor-critic methods. Poor policies result in error value estimates, and vice versa, creating a vicious cycle that leads to a drop in policy performance with a large $T$ .

The Minimum Value of Lagrange multiplier $\lambda_{\text{clip}}$ . In our method, $\lambda$ serves as the coefficient for the policy constraint term, where a larger $\lambda$ implies a stronger policy constraint. Although we need to restrict the $\lambda\geq 0$ according to the definition of Lagrange multiplier, we notice that we could get better results through clip $\lambda\geq c$ in AntMaze tasks, where $c$ is a positive number, see full $\lambda$ ablation results in Figure 3 for details.

Policy Evaluation Interval. We include the ablation of policy evaluation interval in Figure 4. We find that the policy delayed update (Fujimoto et al., 2018) has significant impacts on AntMaze large tasks. However, for other tasks, it does not have much effect and even leads to a slight performance decline. The reason is that infrequent policy updates reduce the variance of value estimates, which is more effective in tasks where sparse rewards and suboptimal trajectories lead to significant errors in value function estimation like AntMaze large tasks.

In a nutshell, the ablation study shows that the combination of these three parts, along with DiffCPS, collectively leads to producing good performance.

5 Related Work

Offline Reinforcement Learning. Offline RL algorithms need to avoid extrapolation error. Prior works usually solve this problem through policy regularization (Fujimoto et al., 2019; Kumar et al., 2019; Wu et al., 2019; Fujimoto & Gu, 2021), value pessimism about unseen actions (Kumar et al., 2020; Kostrikov et al., 2021a), or implicit TD backups (Kostrikov et al., 2021b; Ma et al., 2021) to avoid the use of out-of-distribution actions. Another line of research solves the offline RL problem through weighted regression (Peters et al., 2010; Peng et al., 2019; Nair et al., 2020) from the perspective of CPS. Our DiffCPS derivation is related but features with a diffusion model form.

Diffusion Models in RL. There exist several works that introduce the diffusion model to RL. Diffuser (Janner et al., 2022) uses the diffusion model to directly generate trajectory guided with gradient guidance or reward. DiffusionQL (Wang et al., 2022) uses the diffusion model as an actor and optimizes it through the TD3+BC-style objective with a coefficient $\eta$ to balance the two terms. AdaptDiffuser (Liang et al., 2023) uses a diffusion model to generate extra trajectories and a discriminator to select desired data to add to the training set to enhance the adaptability of the diffusion model. DD (Ajay et al., 2022) uses a conditional diffusion model to generate trajectory and compose skills. Unlike Diffuser, DD diffuses only states and trains inverse dynamics to predict actions. QGPO (Lu et al., 2023) uses the energy function to guide the sampling process and proves that the proposed CEP training method can get an unbiased estimation of the gradient of the energy function under unlimited model capacity and data samples. IDQL (Hansen-Estruch et al., 2023) reinterpret IQL as an Actor-Critic method and extract the policy through sampling from a diffusion-parameterized behavior policy with weights computed from the IQL-style critic. EDP (Kang & Ma, ) focuses on boosting sampling speed through approximated actions. SRPO (Chen et al., ) uses a Gaussian policy in which the gradient is regularized by a pretrained diffusion model to recover the IQL-style policy. DiffCPS is distinct from these methods because we derive it from equation 12.

The closest to our work is the method that combines AWR and diffusion models. SfBC (Chen et al., 2022) uses the diffusion model to generate candidate actions and uses the regression weights to select the best action. Our method differs from it as we directly solve the limited policy expressivity problem through the primal-dual method for DiffCPS without resorting to AWR. This makes DiffCPS simple to implement and tune hyperparameters. As shown in Table 4 we can achieve SOTA results in most tasks by merely tuning one hyperparameter.

6 Conclusion

In our work, we solve the limited expressivity problem in the weighted regression through the diffusion model. We first simplify the CPS problem with the action distribution of diffusion-based policy. Then we prove that strong duality holds for diffusion-based CPS problems, and solve it through the primal-dual method with function approximation. Experimental results on the D4RL benchmark illustrate the superiority of our method which outperforms previous SOTA algorithms in most tasks, and DiffCPS is easy to tune hyperparameters, which only needs to tune the constraint ${\mathbf{\kappa}}$ in most tasks. We hope that our work can inspire relative researchers to utilize powerful generative models, especially the diffusion model, for offline RL and decision-making.

7 Impact statements

Offline RL aims to learn a policy from an offline dataset, much like supervised learning, but due to extrapolation error and function approximation, the task of offline reinforcement learning is much more challenging compared to supervised learning and pre-training. The DiffCPS algorithm proposed in this paper utilizes diffusion-based policy and primal-dual method to address the CPS problem in offline reinforcement learning. Our method will promote the development of the field of offline reinforcement learning, thus facilitating the implementation of offline reinforcement learning in practical scenarios, such as robotic control, and consequently bring about a series of corresponding ethical issues and social outcomes.

References

Ajay et al. (2022) Ajay, A., Du, Y., Gupta, A., Tenenbaum, J. B., Jaakkola, T. S., and Agrawal, P. Is conditional generative modeling all you need for decision making? In The Eleventh International Conference on Learning Representations, 2022.
Borkar (1988) Borkar, V. S. A convex analytic approach to markov decision processes. Probability Theory and Related Fields, 78(4):583–602, 1988.
Boyd & Vandenberghe (2004) Boyd, S. P. and Vandenberghe, L. Convex optimization. Cambridge university press, 2004.
(4) Chen, H., Lu, C., Wang, Z., Su, H., and Zhu, J. Score regularized policy optimization through diffusion behavior. URL http://arxiv.org/abs/2310.07297.
Chen et al. (2022) Chen, H., Lu, C., Ying, C., Su, H., and Zhu, J. Offline reinforcement learning via high-fidelity generative behavior modeling. In The Eleventh International Conference on Learning Representations, 2022.
Chen et al. (2021) Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., and Mordatch, I. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
Chen et al. (2020) Chen, X., Zhou, Z., Wang, Z., Wang, C., Wu, Y., and Ross, K. Bail: Best-action imitation learning for batch deep reinforcement learning. Advances in Neural Information Processing Systems, 33:18353–18363, 2020.
Dhariwal & Nichol (2021) Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
Fu et al. (2020) Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
Fujimoto & Gu (2021) Fujimoto, S. and Gu, S. S. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021.
Fujimoto et al. (2018) Fujimoto, S., Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pp. 1587–1596. PMLR, 2018.
Fujimoto et al. (2019) Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pp. 2052–2062. PMLR, 2019.
Ghasemipour et al. (2020) Ghasemipour, S. K. S., Schuurmans, D., and Gu, S. S. Emaq: Expected-max q-learning operator for simple yet effective offline and online rl. arXiv preprint arXiv:2007.11091, 2020.
(14) Gürtler, N., Blaes, S., Kolev, P., Widmaier, F., Wüthrich, M., Bauer, S., Schölkopf, B., and Martius, G. BENCHMARKING OFFLINE REINFORCEMENT LEARN- ING ON REAL-ROBOT HARDWARE.
Haarnoja et al. (2018a) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pp. 1861–1870. PMLR, 2018a.
Haarnoja et al. (2018b) Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., and Abbeel, P. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018b.
Hansen-Estruch et al. (2023) Hansen-Estruch, P., Kostrikov, I., Janner, M., Kuba, J. G., and Levine, S. Idql: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573, 2023.
Ho & Salimans (2021) Ho, J. and Salimans, T. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
Ho et al. (2020) Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
Janner et al. (2022) Janner, M., Du, Y., Tenenbaum, J. B., and Levine, S. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022.
(21) Kang, B. and Ma, X. Efficient diffusion policies for offline reinforcement learning.
Kidambi et al. (2020) Kidambi, R., Rajeswaran, A., Netrapalli, P., and Joachims, T. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33:21810–21823, 2020.
Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv e-prints, pp. arXiv–1312, 2013.
Kostrikov et al. (2021a) Kostrikov, I., Fergus, R., Tompson, J., and Nachum, O. Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pp. 5774–5783. PMLR, 2021a.
Kostrikov et al. (2021b) Kostrikov, I., Nair, A., and Levine, S. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2021b.
Kumar et al. (2019) Kumar, A., Fu, J., Soh, M., Tucker, G., and Levine, S. Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019.
Kumar et al. (2020) Kumar, A., Zhou, A., Tucker, G., and Levine, S. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
Levine et al. (2020) Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
Liang et al. (2023) Liang, Z., Mu, Y., Ding, M., Ni, F., Tomizuka, M., and Luo, P. Adaptdiffuser: Diffusion models as adaptive self-evolving planners. arXiv preprint arXiv:2302.01877, 2023.
Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
Lu et al. (2023) Lu, C., Chen, H., Chen, J., Su, H., Li, C., and Zhu, J. Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. arXiv preprint arXiv:2304.12824, 2023.
Ma et al. (2021) Ma, X., Yang, Y., Hu, H., Liu, Q., Yang, J., Zhang, C., Zhao, Q., and Liang, B. Offline reinforcement learning with value-based episodic memory. arXiv preprint arXiv:2110.09796, 2021.
(34) Matsushima, T., Furuta, H., Matsuo, Y., Nachum, O., and Gu, S. S. Deployment-efficient reinforcement learn-ing via model-based offline optimization. Policy, 100:1000s.
Nair et al. (2020) Nair, A., Gupta, A., Dalal, M., and Levine, S. Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
Nie et al. (2021) Nie, X., Brunskill, E., and Wager, S. Learning when-to-treat policies. Journal of the American Statistical Association, 116(533):392–409, 2021.
Paternain et al. (2019) Paternain, S., Chamon, L., Calvo-Fullana, M., and Ribeiro, A. Constrained reinforcement learning has zero duality gap. Advances in Neural Information Processing Systems, 32, 2019.
Pearce et al. (2023) Pearce, T., Rashid, T., Kanervisto, A., Bignell, D., Sun, M., Georgescu, R., Macua, S. V., Tan, S. Z., Momennejad, I., Hofmann, K., et al. Imitating human behaviour with diffusion models. arXiv preprint arXiv:2301.10677, 2023.
Peng et al. (2019) Peng, X. B., Kumar, A., Zhang, G., and Levine, S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
Peters et al. (2010) Peters, J., Mulling, K., and Altun, Y. Relative entropy policy search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 24, pp. 1607–1612, 2010.
Rashidinejad et al. (2021) Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., and Russell, S. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702–11716, 2021.
Rhinehart et al. (2018) Rhinehart, N., McAllister, R., and Levine, S. Deep imitative models for flexible inference, planning, and control. arXiv preprint arXiv:1810.06544, 2018.
Rockafellar (1970) Rockafellar, R. Convex analysis. Princeton Math. Series, 28, 1970.
Shafiullah et al. (2022) Shafiullah, N. M., Cui, Z., Altanzaya, A. A., and Pinto, L. Behavior transformers: Cloning $k$ modes with one stone. Advances in neural information processing systems, 35:22955–22968, 2022.
Sohl-Dickstein et al. (2015) Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265. PMLR, 2015.
Song & Ermon (2019) Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
Song et al. (2020) Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
Suh et al. (2023) Suh, H., Chou, G., Dai, H., Yang, L., Gupta, A., and Tedrake, R. Fighting uncertainty with gradients: Offline reinforcement learning via diffusion score matching. arXiv preprint arXiv:2306.14079, 2023.
Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT press, 2018.
Tseng et al. (2017) Tseng, H.-H., Luo, Y., Cui, S., Chien, J.-T., Ten Haken, R. K., and Naqa, I. E. Deep reinforcement learning for automated radiation adaptation in lung cancer. Medical physics, 44(12):6690–6705, 2017.
Wang et al. (2020) Wang, Z., Novikov, A., Zolna, K., Merel, J. S., Springenberg, J. T., Reed, S. E., Shahriari, B., Siegel, N., Gulcehre, C., and Heess, N. Critic regularized regression. Advances in Neural Information Processing Systems, 33:7768–7778, 2020.
Wang et al. (2022) Wang, Z., Hunt, J. J., and Zhou, M. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022.
Wu et al. (2019) Wu, Y., Tucker, G., and Nachum, O. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
Xiao et al. (2021) Xiao, Z., Kreis, K., and Vahdat, A. Tackling the generative learning trilemma with denoising diffusion gans. arXiv preprint arXiv:2112.07804, 2021.
Yu et al. (2020) Yu, T., Thomas, G., Yu, L., Ermon, S., Zou, J. Y., Levine, S., Finn, C., and Ma, T. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020.
Yurtsever et al. (2020) Yurtsever, E., Lambert, J., Carballo, A., and Takeda, K. A survey of autonomous driving: Common practices and emerging technologies. IEEE access, 8:58443–58469, 2020.

In the appendix, we provide the proofs of the main theorems in the paper (Appendix A), the notation table (Appendix B), details of the experiments (Appendix C), some additional experiments (Appendix D,F,E), and a discussion on the use of the augmented Lagrangian method (Appendix G).

Appendix A Proofs

A.1 Theorem 3.1

We provide the proof of Theorem 3.1.

Proof.

(1): Since the pointwise KL constraint in Eq. (2.1) at all states is intractable, we follow the Peng et al. (2019) to relax the constraint by enforcing it only in expectation

E_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}})}\left[D_{\mathrm{KL}}(\pi_{b}(\cdot|{\bm{s}})\|\mu(\cdot|{\bm{s}}))\right]\leq{\mathbf{\epsilon}}.

The LHS can be transformed to

E_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}})}\left[D_{\mathrm{KL}}(\pi_{b}(\cdot|{\bm{s}})\|\mu(\cdot|{\bm{s}}))\right]=E_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}}),{\bm{a}}\sim\pi_{b}(\cdot|{\bm{s}})}\left[\log\pi_{b}(\cdot|{\bm{s}})-\log\mu(\cdot|{\bm{s}})\right],

(22)

where

E_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}}),{\bm{a}}\sim\pi_{b}(\cdot|{\bm{s}})}\left[\log\pi_{b}(\cdot|{\bm{s}})\right]

is the negative entropy of $\pi_{b}$ is a constant for $\mu$ . So the constraint in Eq. (7) can be transformed to

H(\pi_{b},\mu)=-\mathbb{E}_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}}),{\bm{a}}\sim\pi_{b}(\cdot|{\bm{s}})}\left[\log\mu(\cdot|{\bm{s}})\right]\leq{\mathbf{\kappa}}_{0},

(23)

where ${\mathbf{\kappa}}_{0}={\mathbf{\epsilon}}-E_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}}),{\bm{a}}\sim\pi_{b}(\cdot|{\bm{s}})}\left[\log\pi_{b}(\cdot|{\bm{s}})\right]$ .

(2): According to the definition of the diffusion model in Eq. (9):

$\displaystyle\int_{\bm{a}}\mu({\bm{a}}\|{\bm{s}})d{\bm{a}}$	$\displaystyle=\int_{\bm{a}}\mu({\bm{a}}^{0}\|{\bm{s}})d{\bm{a}}^{0}$	(24)
	$\displaystyle=\int\int\mu({\bm{a}}^{0:T}\|{\bm{s}})d{\bm{a}}^{1:T}d{\bm{a}}^{0}$
	$\displaystyle=\int\mu({\bm{a}}^{0:T})d{\bm{a}}^{0:T}$
	$\displaystyle=1.$

Eq. (24) implies the constraint Eq. (8) always holds for the diffusion model, so we can omit this constraint.

According to the above two properties, we can rewrite the diffusion-based CPS problem Eq. (6)-Eq. (8) as

	$\displaystyle\mu^{}=\operatorname{arg\,max}_{\mu}$	$\displaystyle J(\mu)=\operatorname*{arg\,max}_{\mu}\int_{\mathcal{S}}d_{0}({\bm{s}})\int_{\mathcal{A}}Q^{\mu}({\bm{s}},{\bm{a}})d{\bm{a}}d{\bm{s}}$		(25)
	$\displaystyle s.t.$	$\displaystyle H(\pi_{b},\mu)\leq{\mathbf{\kappa}}_{0}.$		(25)

∎

A.2 Theorem 3.7

Then we prove the Theorem 3.7.

The perturbation function associated to Eq. (25) is defined as

	$\displaystyle P(\xi)=$	$\displaystyle\max_{\mu}{\mathcal{J}}(\mu)=\max_{\mu}\int_{\mathcal{S}}d_{0}({\bm{s}})\int_{\mathcal{A}}Q^{\mu}({\bm{s}},{\bm{a}})d{\bm{a}}d{\bm{s}}$		(26)
	$\displaystyle s.t.$	$\displaystyle-H(\pi_{b},\mu)\geq-{\mathbf{\kappa}}_{0}+\xi.$		(26)

Lemma A.1.

If (i) $r$ is bounded; (ii) Slater’s condition holds for Eq. (25) and (iii) its perturbation function $P(\xi)$ is concave, then strong duality holds for Eq. (25).

Proof.

See, e.g., Rockafellar (1970)[Cor. 30.2.2] ∎

Proof.

(The proof skeleton is essentially based on Paternain et al. (2019) Theorem 1)

Condition (i) and (ii) are satisfied by the Assumption 3.4. To prove the strong duality of Eq. (25), it suffices then to show the perturbation function is concave [(iii)], i.e., that for every $\xi^{1},\xi^{2}\in\mathbb{R}$ , and $t\in(0,1)$ ,

P\left[t\xi^{1}+(1-t)\xi^{2}\right]\geq tP\left(\xi^{1}\right)+(1-t)P\left(\xi^{2}\right)\text{.}

(27)

If for either perturbation $\xi^{1}$ or $\xi^{2}$ the problem becomes infeasible then $P(\xi^{1})=-\infty$ or $P(\xi^{2})=-\infty$ and thus Eq. (27) holds trivially. For perturbations which keep the problem feasible, suppose $P(\xi^{1})$ and $P(\xi^{2})$ are achieved by the policies $\mu_{1}\in{\mathcal{P}}({\mathcal{S}})$ and $\mu_{2}\in{\mathcal{P}}({\mathcal{S}})$ respectively. Then, $P(\xi^{1})={\mathcal{J}}(\mu_{1})$ with $-H(\pi_{b},\mu_{1})+{\mathbf{\kappa}}_{0}\geq\xi^{1}$ and $P(\xi^{2})={\mathcal{J}}(\mu_{2})$ with $-H(\pi_{b},\mu_{2})+{\mathbf{\kappa}}_{0}\geq\xi^{2}$ . To establish Eq. (27) it suffices to show that for every $t\in(0,1)$ there exists a policy $\mu_{t}$ such that $-H(\pi_{b},\mu_{t})+{\mathbf{\kappa}}_{0}\geq t\xi^{1}+(1-t)\xi^{2}$ and ${\mathcal{J}}(\mu_{t})=t{\mathcal{J}}(\mu_{1})+(1-t){\mathcal{J}}(\mu_{2})$ . Notice that any policy $\mu_{t}$ satisfying the previous conditions is a feasible policy for the slack $-{\mathbf{\kappa}}_{0}+t\xi^{1}+(1-t)\xi^{2}$ . Hence, by definition of the perturbed function Eq. (27), it follows that

P\left[t\xi^{1}+(1-t)\xi^{2}\right]\geq{\mathcal{J}}(\mu_{t})=t{\mathcal{J}}(\mu_{1})+(1-t){\mathcal{J}}(\mu_{2})=tP\left(\xi^{1}\right)+(1-t)P\left(\xi^{2}\right).

(28)

If such a policy exists, the previous equation implies Eq. (27). Thus, to complete the proof of the result we need to establish its existence. To do so we start by formulating the objective ${\mathcal{J}}(\mu)$ as a linear function.

After sufficient training, we can suppose $Q_{\phi}({\bm{s}},{\bm{a}})$ as an unbiased approximation of $Q^{\mu}({\bm{s}},{\bm{a}})$ ,

Q_{\phi}({\bm{s}},{\bm{a}})\approx Q^{\mu}({\bm{s}},{\bm{a}})=\mathbb{E}_{\tau\sim d_{0},\mu,{\mathcal{T}}}\left[\sum_{j=0}^{J}\gamma^{j}r({\bm{s}}_{j},{\bm{a}}_{j}|{\bm{s}}_{0}={\bm{s}},{\bm{a}}_{0}={\bm{a}})\right],

(29)

where $\tau$ denotes the trajectory generated by policy $\mu({\bm{a}}|{\bm{s}})$ , start state distribution $d_{0}$ and environment dynamics ${\mathcal{T}}({\bm{s}}^{\prime}|{\bm{s}},{\bm{a}})$ .

So the objective function Eq. (6) can be written as

$\displaystyle J(\mu)$	$\displaystyle=\int_{\mathcal{S}}d_{0}({\bm{s}})\int_{\mathcal{A}}Q^{\mu}({\bm{s}},{\bm{a}})d{\bm{a}}d{\bm{s}}$	(30)
	$\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}\mathbb{E}_{{\bm{s}}_{t}\sim p_{\mu}({\bm{s}}_{t}=s)}\left[\mathbb{E}_{{\bm{a}}\sim\mu({\bm{a}}_{t}\|{\bm{s}}_{t})}r({\bm{s}}_{t},{\bm{a}}_{t})\right]$
	$\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}\left[\int_{{\mathcal{S}}\times{\mathcal{A}}}p_{\mu}({\bm{s}}_{t}={\bm{s}})\mu({\bm{a}}_{t}={\bm{a}}\|{\bm{s}}_{t}={\bm{s}})r({\bm{s}},{\bm{a}})d{\bm{a}}d{\bm{s}}\right]$
	$\displaystyle=\int_{{\mathcal{S}}\times{\mathcal{A}}}\left[\sum_{t=0}^{\infty}\gamma^{t}p_{\mu}({\bm{s}}_{t}={\bm{s}})\mu({\bm{a}}_{t}={\bm{a}}\|{\bm{s}}_{t}={\bm{s}})r({\bm{s}},{\bm{a}})d{\bm{a}}d{\bm{s}}\right]$
	$\displaystyle=\frac{1}{1-\gamma}\mathbb{E}_{{\bm{s}},{\bm{a}}\sim\rho^{\mu}({\bm{s}},{\bm{a}})}\left[r({\bm{s}},{\bm{a}})\right]$
	$\displaystyle=\frac{1}{1-\gamma}\int_{{\mathcal{S}}\times{\mathcal{A}}}\rho^{\mu}({\bm{s}},{\bm{a}})r({\bm{s}},{\bm{a}})d{\bm{a}}d{\bm{s}},$

which implies the objective function Eq. (6) equivalent to

\max_{\rho\in{\mathcal{R}}}\frac{1}{1-\gamma}\int_{{\mathcal{S}}\times{\mathcal{A}}}\rho^{\mu}({\bm{s}},{\bm{a}})r({\bm{s}},{\bm{a}})d{\bm{a}}d{\bm{s}},

(31)

where ${\mathcal{R}}$ is the set of occupation measures, which is convex and compact.[ Borkar (1988), Theorem 3.1]

So Eq. (31) is a linear function on $\rho({\bm{s}},{\bm{a}})$ . Let $\rho_{1}({\bm{s}},{\bm{a}}),\rho_{2}({\bm{s}},{\bm{a}})\in{\mathcal{R}}$ be the occupation measures associated to $\mu_{1}$ and $\mu_{2}$ . Since, ${\mathcal{R}}$ is convex, there exists a policy $\mu_{t}$ such that its corresponding occupation measure is $\rho_{t}({\bm{s}},{\bm{a}})=t\rho_{1}({\bm{s}},{\bm{a}})+(1-t)\rho_{2}({\bm{s}},{\bm{a}})\in{\mathcal{R}}$ and ${\mathcal{J}}(\mu_{t})=t{\mathcal{J}}(\mu_{1})+(1-t){\mathcal{J}}(\mu_{2})$ . To prove Eq. (27), it suffices to show such $\mu_{t}$ satisfies $-H(\pi_{b},\mu)+{\mathbf{\kappa}}_{0}\geq t\xi^{1}+(1-t)\xi^{2}$ . Using Assumption 3.5, we can get

$\displaystyle-H(\pi_{b},\mu)$	$\displaystyle=\int_{{\mathcal{S}}\times{\mathcal{A}}}\rho^{\pi_{b}}({\bm{s}},{\bm{a}})\log\frac{\rho^{\mu}_{t}({\bm{s}},{\bm{a}})}{\rho^{\pi_{b}}({\bm{s}})}d{\bm{a}}d{\bm{s}}$	(32)
	$\displaystyle=\int_{{\mathcal{S}}\times{\mathcal{A}}}\rho^{\pi_{b}}({\bm{s}},{\bm{a}})\log\frac{t\rho_{1}({\bm{s}},{\bm{a}})+(1-t)\rho_{2}({\bm{s}},{\bm{a}})}{\rho^{\pi_{b}}({\bm{s}})}d{\bm{a}}d{\bm{s}}$
	$\displaystyle\underbrace{\geq t\int_{{\mathcal{S}}\times{\mathcal{A}}}\rho^{\pi_{b}}({\bm{s}},{\bm{a}})\log\frac{\rho_{1}({\bm{s}},{\bm{a}})}{\rho^{\pi_{b}}({\bm{s}})}d{\bm{a}}d{\bm{s}}+}_{\quad\text{Jensen's inequality.}}$
	$\displaystyle\underbrace{(1-t)\int_{{\mathcal{S}}\times{\mathcal{A}}}\rho^{\pi_{b}}({\bm{s}},{\bm{a}})\log\frac{\rho_{2}({\bm{s}},{\bm{a}})}{\rho^{\pi_{b}}({\bm{s}})}d{\bm{a}}d{\bm{s}}}_{\quad\text{Jensen's inequality.}}$
	$\displaystyle\geq t\xi^{1}+(1-t)\xi^{2}-{\mathbf{\kappa}}_{0}$

This completes the proof that the perturbation function is concave and according to Lemma A.1 strong duality for Eq. (12) holds.

Finally, we prove Eq. (12), which has the same optimal policy $\mu$ with Eq. (15).

For Eq. (15), if strong duality holds and a dual optimal solution $\lambda^{*}$ exists, then any primal optimal point of Eq. (12) is also a maximizer of ${\mathcal{L}}(\mu,\lambda^{*})$ , this is obvious through the KKT conditions.[ Boyd & Vandenberghe (2004) Ch.5.5.5.]

So if we can prove that the solution of the ${\mathcal{L}}(\mu,\lambda^{*})$ is unique, we can compute the optimal policy of Eq. (12) from a dual problem Eq. (15):

	$\displaystyle{\mathcal{L}}(\mu,\lambda^{*})$	$\displaystyle=J(\mu)+\lambda^{*}({\mathbf{\kappa}}_{0}-H(\pi_{b},\mu))$		(33)
		$\displaystyle=\frac{1}{1-\gamma}\left[\underbrace{\int_{{\mathcal{S}}\times{\mathcal{A}}}\rho^{\mu}({\bm{s}},{\bm{a}})r({\bm{s}},{\bm{a}})d{\bm{a}}d{\bm{s}}}_{\quad\text{Using Eq.~{}(\ref{affine}).}}+\lambda^{*}(\bar{{\mathbf{\kappa}}_{0}}-\int_{{\mathcal{S}}\times{\mathcal{A}}}\rho^{\pi_{b}}({\bm{s}},{\bm{a}})\log\frac{\rho^{\mu}({\bm{s}},{\bm{a}})}{\rho^{\pi_{b}}({\bm{s}})}d{\bm{a}}d{\bm{s}})\right].$		(33)

According to Eq. (30) $J(\mu)$ is a linear function, and negative relative entropy is concave, this implies that ${\mathcal{L}}(\mu,\lambda^{*})$ is a strictly concave function of $\rho^{\mu}$ with only a unique maximum value and since we have that the occupancy measure $\rho^{\mu}({\bm{s}},{\bm{a}})$ has a one-to-one relationship with policy $\mu$ , we can obtain the unique $\mu$ corresponding to $\rho^{\mu}$ . ∎

A.3 Proposition 3.8

Proof.

$\displaystyle H(\pi_{b},\mu)$	$\displaystyle=-\mathbb{E}_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}}),{\bm{a}}\sim\pi_{b}(\cdot\|{\bm{s}})}\left[\log\mu(\cdot\|{\bm{s}})\right]$	(34)
	$\displaystyle=-\mathbb{E}_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}})}\left[\int\pi_{b}({\bm{a}}\|{\bm{s}})d{\bm{a}}\log\int\pi_{b}({\bm{a}}^{1:T}\|{\bm{a}},{\bm{s}})\frac{\mu({\bm{a}}^{0:T}\|{\bm{s}})}{\pi_{b}({\bm{a}}^{1:T}\|{\bm{a}},{\bm{s}})}d{\bm{a}}^{1:T}\right]$
	$\displaystyle\leq\mathbb{E}_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}})}\left[\underbrace{\int\pi_{b}({\bm{a}}^{0:T})\log[\frac{\pi_{b}({\bm{a}}^{1:T}\|{\bm{a}},{\bm{s}})}{\mu({\bm{a}}^{0:T}\|{\bm{s}})}]d{\bm{a}}^{0:T}}_{\quad\text{Jensen's inequality}}\right]=K.$

$H(\pi_{b},\mu)=K$ is true if and only if $\pi_{b}({\bm{a}}^{1:T}|{\bm{a}}^{0},{\bm{s}})=\mu({\bm{a}}^{1:T}|{\bm{a}}^{0},{\bm{s}})$ . By letting the KL divergence between $\pi_{b}$ and $\mu$ be small enough, we can approximate the $H(\pi_{b},\mu)$ with $K$ . Furthermore, we can use the variational lower bound $K$ to approximate the KL divergence (Ho et al., 2020):

	$\displaystyle H(\pi_{b},\mu)\approx K$	$\displaystyle=\mathbb{E}_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}})}\left[\mathbb{E}_{\pi_{b}({\bm{a}}^{0:T})}\Big{[}\log\frac{\pi_{b}({\bm{a}}^{1:T}\|{\bm{a}},{\bm{s}})}{\mu({\bm{a}}^{0:T}\|{\bm{s}})}\Big{]}\right]=\mathbb{E}_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}})}\left[\mathbb{E}_{\pi_{b}({\bm{a}}^{0:T})}\Big{[}\log\frac{\pi_{b}({\bm{a}}^{1:T}\|{\bm{a}}^{0},{\bm{s}})}{\mu({\bm{a}}^{0:T}\|{\bm{s}})}\Big{]}\right]$
		$\displaystyle=\mathbb{E}_{\rho^{\pi_{b}}({\bm{s}},{\bm{a}})}\Big{[}\log\frac{\prod_{t=1}^{T}\pi_{b}({\bm{a}}^{t}\|{\bm{a}}^{t-1})}{\mu({\bm{a}}^{T})\prod_{t=1}^{T}\mu({\bm{a}}^{t-1}\|{\bm{a}}^{t})}\Big{]}$
		$\displaystyle=\mathbb{E}_{\rho^{\pi_{b}}({\bm{s}},{\bm{a}})}\Big{[}-\log\mu({\bm{a}}^{T})+\sum_{t=1}^{T}\log\frac{\pi_{b}({\bm{a}}^{t}\|{\bm{a}}^{t-1})}{\mu({\bm{a}}^{t-1}\|{\bm{a}}^{t})}\Big{]}$
		$\displaystyle=\mathbb{E}_{\rho^{\pi_{b}}({\bm{s}},{\bm{a}})}\Big{[}-\log\mu({\bm{a}}^{T})+\sum_{t=2}^{T}\log\frac{\pi_{b}({\bm{a}}^{t}\|{\bm{a}}^{t-1})}{\mu({\bm{a}}^{t-1}\|{\bm{a}}^{t})}+\log\frac{\pi_{b}({\bm{a}}^{1}\|{\bm{a}}^{0})}{\mu({\bm{a}}^{0}\|{\bm{a}}^{1})}\Big{]}$
		$\displaystyle=\mathbb{E}_{\rho^{\pi_{b}}({\bm{s}},{\bm{a}})}\Big{[}-\log\mu({\bm{a}}^{T})+\sum_{t=2}^{T}\log\Big{(}\frac{\pi_{b}({\bm{a}}^{t-1}\|{\bm{a}}^{t},{\bm{a}}^{0})}{\mu({\bm{a}}^{t-1}\|{\bm{a}}^{t})}\cdot\frac{\pi_{b}({\bm{a}}^{t}\|{\bm{a}}^{0})}{\pi_{b}({\bm{a}}^{t-1}\|{\bm{a}}^{0})}\Big{)}+\log\frac{\pi_{b}({\bm{a}}^{1}\|{\bm{a}}^{0})}{\mu({\bm{a}}^{0}\|{\bm{a}}^{1})}\Big{]}$
		$\displaystyle=\mathbb{E}_{\rho^{\pi_{b}}({\bm{s}},{\bm{a}})}\Big{[}-\log\mu({\bm{a}}^{T})+\sum_{t=2}^{T}\log\frac{\pi_{b}({\bm{a}}^{t-1}\|{\bm{a}}^{t},{\bm{a}}^{0})}{\mu({\bm{a}}^{t-1}\|{\bm{a}}^{t})}+\sum_{t=2}^{T}\log\frac{\pi_{b}({\bm{a}}^{t}\|{\bm{a}}^{0})}{\pi_{b}({\bm{a}}^{t-1}\|{\bm{a}}^{0})}+\log\frac{\pi_{b}({\bm{a}}^{1}\|{\bm{a}}^{0})}{\mu({\bm{a}}^{0}\|{\bm{a}}^{1})}\Big{]}$
		$\displaystyle=\mathbb{E}_{\rho^{\pi_{b}}({\bm{s}},{\bm{a}})}\Big{[}-\log\mu({\bm{a}}^{T})+\sum_{t=2}^{T}\log\frac{\pi_{b}({\bm{a}}^{t-1}\|{\bm{a}}^{t},{\bm{a}}^{0})}{\mu({\bm{a}}^{t-1}\|{\bm{a}}^{t})}+\log\frac{\pi_{b}({\bm{a}}^{T}\|{\bm{a}}^{0})}{\pi_{b}({\bm{a}}^{1}\|{\bm{a}}^{0})}+\log\frac{\pi_{b}({\bm{a}}^{1}\|{\bm{a}}^{0})}{\mu({\bm{a}}^{0}\|{\bm{a}}^{1})}\Big{]}$
		$\displaystyle=\mathbb{E}_{\rho^{\pi_{b}}({\bm{s}},{\bm{a}})}\Big{[}\log\frac{\pi_{b}({\bm{a}}^{T}\|{\bm{a}}^{0})}{\mu({\bm{a}}^{T})}+\sum_{t=2}^{T}\log\frac{\pi_{b}({\bm{a}}^{t-1}\|{\bm{a}}^{t},{\bm{a}}^{0})}{\mu({\bm{a}}^{t-1}\|{\bm{a}}^{t})}-\log\mu({\bm{a}}^{0}\|{\bm{a}}^{1})\Big{]}$
		$\displaystyle=\mathbb{E}_{\rho^{\pi_{b}}({\bm{s}},{\bm{a}})}[\underbrace{D_{\text{KL}}(\pi_{b}({\bm{a}}^{T}\|{\bm{a}}^{0})\parallel\mu({\bm{a}}^{T}))}_{L_{T}}+\sum_{t=2}^{T}\underbrace{D_{\text{KL}}(\pi_{b}({\bm{a}}^{t-1}\|{\bm{a}}^{t},{\bm{a}}^{0})\parallel\mu({\bm{a}}^{t-1}\|{\bm{a}}^{t}))}_{L_{t-1}}\underbrace{-\log\mu({\bm{a}}^{0}\|{\bm{a}}^{1})}_{L_{0}}].$

So we can approximate the entropy constraint in Eq. (12):

H(\pi_{b},\mu)\approx K=L_{T}+\sum_{t=2}^{T}L_{t-1}+L_{0}.

(35)

Following DDPM (Ho et al., 2020) use the random sample to approximate the $\sum_{t=2}^{T}L_{t-1}+L_{0}$ , we have

		$\displaystyle\sum_{t=2}^{T}L_{t-1}+L_{0}\approx{\mathcal{L}}_{c}(\pi_{b},\mu)$		(36)
		$\displaystyle=\mathbb{E}_{i\sim\left[1:N\right],{\mathbf{\epsilon}}\sim{\mathcal{N}}(\mathbf{0},{\bm{I}}),({\bm{s}},{\bm{a}})\sim{\mathcal{D}}}\left[\|\|{\mathbf{\epsilon}}-{\mathbf{\epsilon}}_{\theta}(\sqrt{\bar{\alpha}_{i}}{\bm{a}}+\sqrt{1-\bar{\alpha}_{i}}{\mathbf{\epsilon}},{\bm{s}},i)\|\|^{2}\right].$		(36)

Let $L_{T}=c$ , combining Eq. (35) and Eq. (36) we can get

H(\pi_{b},\mu)\approx c+{\mathcal{L}}_{c}(\pi_{b},\mu).

(37)

∎

A.4 Theorem 3.11

The proof of Theorem 3.11 is based on Paternain et al. (2019)[Appendix A].

We first claim lemma A.2 to show the error introduced by the parametrization is bound by a constant.

Lemma A.2.

Let $\rho$ and $\rho_{\theta}$ be occupation measures induced by the policies $\pi\in{\mathcal{P}}({\mathcal{S}})$ and $\pi_{\theta}$ respectively, where $\pi_{\theta}$ is an $\epsilon$ - parametrization of $\pi$ . Then, it follows that

\int_{{\mathcal{S}}\times{\mathcal{A}}}|\rho(s,a)-\rho_{\theta}(s,a)|\,dsda\leq\frac{\epsilon}{1-\gamma}.

(38)

See proof in Paternain et al. (2019)[Appendix A].

We next introduce Theorem A.3, which shows that the dual gap introduced after parameterization is bounded by a linear function of the model’s expressive capability.

Theorem A.3.

Suppose that $r$ is bounded by constants $R>0$ and $B=\max_{{\bm{s}},{\bm{a}}}\rho^{\pi_{b}}({\bm{s}},{\bm{a}})1/\rho({\bm{s}},{\bm{a}})$ . Let $\lambda^{\star}$ be the solution to the dual problem associated with Eq. (12). Then, under the hypothesis of Theorem 3.7 it follows that

P^{\star}\geq D_{\theta}^{\star}\geq P^{\star}-\left(R+\left\|{\lambda^{\star}}\right\|_{1}B\right)\frac{\epsilon}{1-\gamma},

(39)

where $P^{\star}$ is the optimal value of Eq. (12), and $D_{\theta}^{\star}$ the value of the parametrized dual problem Eq. (18).

Note that Theorem A.3 is actually based on Theorem 2 of Paternain et al. (2019), and the proof is as follows.

Proof of Theorem A.3.

Notice that the dual functions $d(\lambda)$ and $d_{\theta}(\lambda)$ associated to the problems Eq. (12) and Eq. (17) respectively are such that for every $\lambda$ we have that $d_{\theta}(\lambda)\leq d(\lambda)$ . The latter follows from the fact that the set of maximizers of the Lagrangian for the parametrized policies is contained in the set of maximizers of the non-parametrized policies. In particular, this holds for $\lambda^{\star}$ the solution of the dual problem associated to Eq. (12). Hence we have the following sequence of inequalities

D^{\star}=d(\lambda^{\star})\geq d_{\theta}(\lambda^{\star})\geq D^{\star}_{\theta},

(40)

where the last inequality follows from the fact that $D^{\star}_{\theta}$ is the minimum of Eq. (18). The zero duality gap established in Theorem 3.7 completes the proof of the upper bound for $D_{\theta}^{\star}$ . We next work towards proving the lower bound for $D_{\theta}^{\star}$ . Let us next write the dual function of the parametrized problem Eq. (17) as

d_{\theta}(\lambda)=d(\lambda)-\left(\max_{\mu\in{\mathcal{P}}({\mathcal{S}})}{\mathcal{L}}(\mu,\lambda)-\max_{\theta}{\mathcal{L}}_{\theta}(\theta,\lambda)\right)

(41)

Let $\mu^{\star}\triangleq\operatorname*{arg\,max}_{\mu\in{\mathcal{P}}({\mathcal{S}})}{\mathcal{L}}(\mu,\lambda)$ and let $\theta^{\star}$ be an $\epsilon$ -approximation of $\mu^{\star}$ . Then, by definition of the maximum it follows that

d_{\theta}(\lambda)\geq d(\lambda)-\left({\mathcal{L}}(\mu^{\star},\lambda)-{\mathcal{L}}_{\theta}(\theta^{\star},\lambda)\right)

(42)

We next work towards a bound for ${\mathcal{L}}(\mu^{\star},\lambda)-{\mathcal{L}}_{\theta}(\theta^{\star},\lambda)$ . To do so, notice that we can write the difference in terms of the occupation measures where $\rho^{\star}$ and $\rho_{\theta}^{\star}$ are the occupation measures associated to the the policies $\mu^{\star}$ and the policy $\mu_{\theta^{\star}}$

{\mathcal{L}}(\mu^{\star},\lambda)-{\mathcal{L}}_{\theta}(\theta^{\star},\lambda)=\int_{{\mathcal{S}}\times{\mathcal{A}}}r\left(d\rho^{\star}(\lambda)-d\rho_{\theta}^{\star}(\lambda)\right)+\lambda\rho^{\pi_{b}}({\bm{s}},{\bm{a}})d(\log\rho^{\star}(\lambda)-\log\rho_{\theta}^{\star}(\lambda)).

(43)

Since $\mu_{\theta^{\star}}$ is by definition an $\epsilon$ approximation of $\mu^{\star}$ it follows from Lemma A.2 that

\int_{{\mathcal{S}}\times{\mathcal{A}}}\left|d\rho^{\star}(\lambda)-d\rho_{\theta}^{\star}(\lambda)\right|\leq\frac{\epsilon}{1-\gamma}.

(44)

Using the bounds on the reward functions and occupation measures we can upper bound the difference ${\mathcal{L}}(\pi^{\star},\lambda)-{\mathcal{L}}_{\theta}(\theta^{\star},\lambda)$ by

{\mathcal{L}}(\mu^{\star},\lambda)-{\mathcal{L}}_{\theta}(\theta^{\star},\lambda)\leq\left(R+\left\|\lambda\right\|_{1}B\right)\frac{\epsilon}{1-\gamma}.

(45)

The second term follows from $(a-b)1/a\leq\log a-\log b\leq(a-b)1/b,a\geq b$ . Combining the previous bound with Eq. (42) we can lower bound $d_{\theta}(\lambda)$ as

d_{\theta}(\lambda)\geq d(\lambda)-\left(R+\left\|\lambda\right\|_{1}B\right)\frac{\epsilon}{1-\gamma}

(46)

Since the previous expression holds for every $\lambda$ , in particular it holds for $\lambda_{\theta}^{\star}$ , the dual solution of the parametrized problem Eq. (18). Thus, we have that

D_{\theta}^{\star}\geq d(\lambda)-\left(R+\left\|\lambda\right\|_{1}B\right)\frac{\epsilon}{1-\gamma}

(47)

Recall that $\lambda^{\star}=\operatorname*{arg\,min}d(\lambda)$ , and use the definition of the dual function to lower bound $D_{\theta}^{\star}$ by

D_{\theta}^{\star}\geq\max_{\mu\in{\mathcal{P}}({\mathcal{S}})}{\mathcal{J}}(\mu)+\lambda^{\star}\left({\mathbf{\kappa}}_{0}-H(\pi_{b},\mu)\right)-\left(R+\left\|\lambda\right\|_{1}B\right)\frac{\epsilon}{1-\gamma}.

(48)

By definition of maximum, we can lower bound the previous expression by substituting by any $\mu\in{\mathcal{P}}({\mathcal{S}})$ . In particular, we select $\mu^{\star}$ the solution to Eq. (12)

D_{\theta}^{\star}\geq\max_{\mu\in{\mathcal{P}}({\mathcal{S}})}{\mathcal{J}}(\mu^{\star})+\lambda^{\star}\left({\mathbf{\kappa}}_{0}-H(\pi_{b},\mu^{\star})\right)-\left(R+\left\|\lambda\right\|_{1}B\right)\frac{\epsilon}{1-\gamma}.

(49)

Since $\mu^{\star}$ is the optimal solution to Eq. (12) it follows that ${\mathbf{\kappa}}_{0}-H(\pi_{b},\mu)\geq 0$ and since $\lambda^{\star}\geq 0$ the previous expression reduces to

D_{\theta}^{\star}\geq{\mathcal{J}}(\mu^{\star})-\left(R+\left\|\lambda\right\|_{1}B\right)\frac{\epsilon}{1-\gamma}=P^{\star}-\left(R+\left\|\lambda\right\|_{1}B\right)\frac{\epsilon}{1-\gamma}

(50)

Which completes the proof of the result ∎

Next, we claim that the subgradient calculation error in the dual update (line $14$ in Algorithm 1) through Proposition A.5

Assumption A.4.

Let $\mu_{\theta}$ be a parametrization of functions in ${\mathcal{P}}({\mathcal{S}})$ and let ${\mathcal{L}}_{\theta}(\theta,\lambda)$ with $\lambda\in\mathbb{R}^{m}_{+}$ be the Lagrangian associated to Eq. (17). Denote by $\theta^{\star}(\lambda),\theta^{\dagger}(\lambda)\in\mathbb{R}^{P}$ the maximum of ${\mathcal{L}}(\theta,\lambda)$ and a local maximum respectively achieved by a generic reinforcement learning algorithm. Then, there exists $\delta>0$ such that for all $\lambda\in\mathbb{R}^{m}_{+}$ it holds that ${\mathcal{L}}_{\theta}(\theta^{\star}(\lambda),\lambda)\leq{\mathcal{L}}_{\theta}(\theta^{\dagger}(\lambda),\lambda)+\delta$ .

Proposition A.5.

Under Assumption A.4, the constraint in Eq. (17) evaluated at a local maximizer of Lagrangian $\theta^{\dagger}(\lambda)$ approximate the subgradient of the dual function Eq. (18). In particular, it follows that

d_{\theta}(\lambda)-d_{\theta}(\lambda_{\theta}^{\star})\leq\left(\lambda-\lambda_{\theta}^{\star}\right)\left({\mathbf{\kappa}}_{0}-H(\pi_{b},\mu)\right)+\delta.

(51)

See proof in Paternain et al. (2019) (Appendix A).

Finally, we prove Theorem 3.11.

Proof of Theorem 3.11.

We start by showing the lower bound, which in fact holds for any $\lambda$ . Notice that for any $\lambda$ and by definition of the dual problem it follows that $d_{\theta}(\lambda)\geq D^{\star}_{\theta}$ . Combining this bound with the result of Theorem A.3 it follows that

d_{\theta}(\lambda)\geq P^{\star}-\left(R+\left\|{\lambda^{\star}}\right\|_{1}B\right)\frac{\epsilon}{1-\gamma}.

(52)

To show the upper bound we start by writing the difference between the dual multiplier $k+1$ and the solution of Eq. (18) in terms of the iteration at time $k$ . Since $\lambda_{\theta}^{\star}\in\mathbb{R}^{+}$ and using the non-expansive property of the projection it follows that

\left\|\lambda_{k+1}-\lambda^{\star}_{\theta}\right\|^{2}\leq\left\|\lambda_{k}-\eta\left({\mathbf{\kappa}}_{0}-H(\pi_{b},\theta^{\dagger}(\lambda_{k}))\right)-\lambda^{\star}_{\theta}\right\|^{2}

(53)

Expanding the square and using that $C=\left({\mathbf{\kappa}}_{0}+\int_{{\mathcal{S}}\times{\mathcal{A}}}\rho^{\pi_{b}}({\bm{s}},{\bm{a}})\log\pi_{b}({\bm{a}}|{\bm{s}})d{\bm{a}}d{\bm{s}}\right)^{2}$ is a bound which is calculated by the KL divergence in Eq. (7) is nonnegative on the norm squared of ${\mathbf{\kappa}}_{0}-H(\pi_{b},\theta^{\dagger}(\lambda_{k}))$ it follows that

\left\|\lambda_{k+1}-\lambda^{\star}_{\theta}\right\|^{2}\leq\left\|\lambda_{k}-\lambda_{\theta}^{\star}\right\|^{2}-2\eta\left(\lambda_{k}-\lambda_{\theta}^{\star}\right)\left({\mathbf{\kappa}}_{0}-H(\pi_{b},\theta^{\dagger}(\lambda_{k}))\right)+\eta^{2}C.

(54)

Using the result of Proposition A.5 we can further upper bound the inner product in the previous expression by the difference of the dual function evaluated at $\lambda_{k}$ and $\lambda_{\theta}^{\star}$ plus $\delta$ , the error in the solution of the primal maximization,

\left\|\lambda_{k+1}-\lambda^{\star}_{\theta}\right\|^{2}\leq\left\|\lambda_{k}-\lambda_{\theta}^{\star}\right\|^{2}+2\eta\left(\delta+d_{\theta}(\lambda_{\theta}^{\star})-d_{\theta}(\lambda_{k})\right)+\eta^{2}C.

(55)

Defining $\alpha_{k}=2(\delta+d_{\theta}(\lambda_{\theta}^{\star})-d_{\theta}(\lambda_{k}))+\eta C$ and writing recursively the previous expression yields

\left\|\lambda_{k+1}-\lambda^{\star}_{\theta}\right\|^{2}\leq\left\|\lambda_{0}-\lambda_{\theta}^{\star}\right\|^{2}+\eta\sum_{j=0}^{k}\alpha_{j}.

(56)

Since $d_{\theta}(\lambda_{\theta}^{\star})$ is the minimum of the dual function, the difference $d_{\theta}(\lambda_{\theta}^{\star})-d_{\theta}(\lambda_{k})$ is always negative. Thus, when $\lambda_{k}$ is not close to the solution of the dual problem $\alpha_{k}$ is negative. The latter implies that the distance between $\lambda_{k}$ and $\lambda_{\theta}^{\star}$ is reduced by virtue of Eq. (56). To be formal, for any $\varepsilon>0$ , when $a_{j}>-2\varepsilon$ we have that

d_{\theta}(\lambda_{j})-d_{\theta}(\lambda_{\theta}^{\star})\leq\eta\frac{C}{2}+\delta+\varepsilon.

(57)

Using the result of Theorem A.3 we can upper bound $D_{\theta}^{\star}$ by $P^{\star}$ which establishes the neighborhood defined in Eq. (21). We are left to show that the number of iterations required to do so is bounded by

K\leq\frac{\left\|\lambda_{0}-\lambda_{\theta}^{\star}\right\|^{2}}{2\eta\varepsilon}.

(58)

To do so, let $K>0$ be the first iterate in the neighborhood Eq. (21). Formally, $K=\min_{j\in\mathbb{N}}\alpha_{j}>-2\varepsilon$ . Then it follows from the recursion that

\left\|\lambda_{K}-\lambda^{\star}_{\theta}\right\|^{2}\leq\left\|\lambda_{0}-\lambda_{\theta}^{\star}\right\|^{2}-2K\eta\varepsilon.

(59)

Since $\left\|\lambda_{K}-\lambda_{\theta}^{\star}\right\|^{2}$ is positive the previous expression reduces to $2K\eta\epsilon\leq\left\|\lambda_{0}-\lambda_{\theta}^{\star}\right\|^{2}$ . Which completes the proof of the result. ∎

Appendix B Notation Table

Table 3: Table of Notation in paper

$H(\pi_{b},\mu)$	$\triangleq$	$-\mathbb{E}_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}}),{\bm{a}}\sim\pi_{b}(\cdot\|{\bm{s}})}\left[\log\mu(\cdot\|{\bm{s}})\right]$ .
$\mu$ , $\mu({\bm{a}}\|{\bm{s}})$ or $\mu(\cdot\|{\bm{s}})$	$\triangleq$	policy.
$\pi_{b}$ , $\pi_{b}({\bm{a}}\|{\bm{s}})$ or $\pi_{b}(\cdot\|{\bm{s}})$	$\triangleq$	behavior policy.
$p_{\mu}({\bm{s}}_{t}={\bm{s}})$	$\triangleq$	Probability of landing in state ${\bm{s}}$ at time $t$ , when following policy $\mu$ from an initial state sampled from $d_{0}$ , in an environment with transition dynamics ${\mathcal{T}}$ .
${\mathcal{T}}$ , ${\mathcal{P}}({\bm{s}}^{\prime}\|{\bm{s}},{\bm{a}})$ or $p({\bm{s}}^{\prime}\|{\bm{s}},{\bm{a}})$	$\triangleq$	transition dynamics.
$d_{0}({\bm{s}})$	$\triangleq$	initial state distribution of behavior policy.
$\rho^{\mu}({\bm{s}})$	$\triangleq$	$\rho^{\mu}({\bm{s}})=\sum_{t=0}^{\infty}\gamma^{t}p_{\mu}({\bm{s}}_{t}={\bm{s}})$ is the unnormalized discounted state visitation frequencies.
$\rho^{\mu}({\bm{s}},{\bm{a}})$	$\triangleq$	$\rho^{\mu}({\bm{s}},{\bm{a}})=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}p_{\mu}({\bm{s}}_{t}={\bm{s}})\mu({\bm{a}}\|{\bm{s}})=(1-\gamma)\rho^{\mu}({\bm{s}})\mu({\bm{a}}\|{\bm{s}})$ is the occupation measure of policy $\mu$ .
$Q_{\phi}({\bm{s}},{\bm{a}})$	$\triangleq$	parameterized state action function.
$Q_{\phi^{{}^{\prime}}}({\bm{s}},{\bm{a}})$	$\triangleq$	parameterized target state action function.
$\mu_{\theta}({\bm{a}}\|{\bm{s}})$	$\triangleq$	parameterized policy.
$\mu_{\theta^{{}^{\prime}}}({\bm{a}}\|{\bm{s}})$	$\triangleq$	parameterized target policy.
${\mathbf{\epsilon}}_{\theta}({\bm{x}}_{i},i)$	$\triangleq$	parameterized Gaussian noise predict network in diffusion model.
$f_{\phi}({\bm{y}}\|{\bm{x}}_{i})$	$\triangleq$	parameterized classifier.

Appendix C EXPERIMENTAL DETAILS

We follow the Wang et al. (2022) to build our policy with an MLP-based conditional diffusion model. Following the DDPM, we recover the action from a residual network ${\mathbf{\epsilon}}({\bm{a}}^{i},{\bm{s}},i)$ , and we model the ${\mathbf{\epsilon}}_{\theta}$ as a 3-layers MLPs with Mish activations. All the hidden units have been set to 256. We also follow the Fujimoto et al. (2018) to build two Q networks with the same MLP setting as our critic network. All the networks are optimized through Adam (Kingma & Ba, 2014). We provide all the hyperparameters in Table 4.

We train for $1000$ epochs ( $2000$ for Gym tasks). Each epoch consists of $1000$ gradient steps with batch size $256$ . The training in MuJoCo locomotion is usually quite stable as shown in Figure 2. However, the training for AntMaze is relatively unstable due to its sparse reward setting and suboptimal trajectories in the offline datasets.

Table 4: Hyperparameter settings of all selected tasks. Reward Tune with CQL means that we use the reward tuning method in Kumar et al. (2020). We also use max Q backup to improve performance in AntMaze tasks. In offline reinforcement learning, it’s common to fine-tune rewards for the AntMaze tasks. Additionally, we have observed that some other diffusion-based methods, such as SfBC, perform more reward engineering compared to our method.

\lambda_{\text{clip}}

in our paper means that

\lambda\geq\lambda_{\text{clip}}

Dataset	Environment	Learning Rate	$\bf{{\mathbf{\kappa}}}$	Reward Tune	max Q backup	Policy evaluation interval	$\bf{\lambda_{\text{clip}}}$
Medium-Expert	HalfCheetah	3e-4	$0.04$	None	False	$2$	$0$
Medium-Expert	Hopper	3e-4	$0.03$	None	False	$2$	$0$
Medium-Expert	Walker2d	3e-4	$0.04$	None	False	$2$	$0$
Medium	HalfCheetah	3e-4	$0.06$	None	False	$2$	$0$
Medium	Hopper	3e-4	$0.05$	None	False	$2$	$0$
Medium	Walker2d	3e-4	$0.03$	None	False	$2$	$0$
Medium-Replay	HalfCheetah	3e-4	$0.06$	None	False	$2$	$0$
Medium-Replay	Hopper	3e-4	$0.03$	None	False	$2$	$0$
Medium-Replay	Walker2d	3e-4	$0.03$	None	False	$2$	$0$
Default	AntMaze-umaze	3e-4	$0.2$	CQL	False	$2$	$0.3$
Diverse	AntMaze-umaze	3e-4	$0.09$	CQL	True	$2$	$0.3$
Play	AntMaze-medium	1e-3	$0.3$	CQL	True	$2$	$0.3$
Diverse	AntMaze-medium	3e-4	$0.2$	CQL	True	$2$	$0.3$
Play	AntMaze-large	3e-4	$0.2$	CQL	True	$4$	$0.3$
Diverse	AntMaze-large	3e-4	$0.2$	CQL	True	$4$	$0.3$

Runtime. We test the runtime of DiffCPS on a RTX 3050 GPU. For algorithm training, the runtime cost of training the gym Locomotion tasks is about 4h26min for $2000$ epochs (2e6 gradient steps), see Table 5 for details.

Table 5: Runtime of different diffusion-based offline RL methods.

D4RL Tasks	DiffCPS (T=5)	DiffusionQL (T=5)	SfBC (T=5)
Locomotion Runtime ( $1$ epoch)	$7.68$ s	$5.1$ s	$8.42$ s
AntMaze Runtime ( $1$ epoch)	$9.96$ s	$10.5$ s	$10.53$ s

From Table 5, it’s evident that the runtime of DiffCPS is comparable to other diffusion model-based methods. Further optimization like using Jax and incorporating other diffusion model acceleration tricks can improve the runtime of DiffCPS.

Appendix D More Toy experiments

In this chapter, we provide further detailed information about the toy experiments. The noisy circle is composed of a unit circle with added Gaussian noise ${\mathcal{N}}(0,0.05)$ . The offline data ${\mathcal{D}}=\left\{({\bm{a}}_{i})\right\}_{i=1}^{5000}$ are collected from the noisy circle. We train all methods with 20,000 steps to ensure convergence. The network framework of SfBC remains consistent with the original SfBC paper. Both DQL and DiffCPS use the same MLP architecture. The code for DQL and SfBC is sourced from the official code provided by the authors, while the code for AWR is adopted from a PyTorch implementation available on GitHub.

In Figure 6, we put the full results of diffusion-based methods and Figure 5 displays the scores of diffusion-based algorithms under different diffusion steps, denoted as $T$ . The score calculation involves a modified Jaccard Index, which computes how many actions fall into the offline dataset (the minimum distance is less than 1e-8). The dashed line represents the score of AWR.

In Figure 7, we compare DiffCPS with prior offline RL methods, including the TD3+BC (Fujimoto & Gu, 2021) with Gaussian policy, BCQ (Fujimoto et al., 2019), and BEAR (Kumar et al., 2019). Note that both BCQ and BEAR use the VAE to model behavior policy, and then get the policy by regularizing the cloned behavior policy. However, results in Figure 7 show that VAE cannot exactly model the multi-modal data in offline datasets. TD3+BC with Gaussian policy also fails to model the optimal behavior due to the limited policy expressivity described in Section 3.1.

In Figure 8, we show the results of diffusion-based behavior clone. The results in Figure 8 and Figure 7 show that the diffusion model can model multi-modal distribution while other methods struggle to capture the multi-modal behavior policy.

Appendix E Extra Related Work

Offline Model-based RL. Model-based RL methods represent another approach to address offline reinforcement learning. Similar to model-free offline RL, model-based RL also needs to address extrapolation error. Kidambi et al. (2020); Yu et al. (2020) perform pessimistic planning in the learned model, which penalizes the reward for uncertainty. BREMEN (Matsushima et al., ) uses trust region constraint to update policy in an ensemble of dynamic models to avoid the extrapolation error. VL-LCB (Rashidinejad et al., 2021) employs offline value iteration with lower confidence bound to address the extrapolation error. SGP (Suh et al., 2023) uses the score function of the diffusion model to approximate the gradient of the proposed uncertainty estimation. Our method differs from them for DiffCPS tackles the diffusion-based constrained policy search problem in offline RL through Lagrange dual and recursive optimization.

Appendix F Extra Experiments

F.1 Trajectory Stitching

The offline dataset is depicted in the Figure 9 (a). The reward function is the sum of the negative Euclidean distance from the point $(1,1)$ and a constant, with both actions and states represented as 2D coordinates. The dataset consists only of trajectories forming an X shape. During the evaluation, the starting point is at $(-1,1)$ . To achieve maximum reward, the policy needs to stitch together offline data to reach the goal. The trajectories generated by our policy are shown in Figure 9 (b). We note that DiffCPS successfully combines trajectories from the dataset to reach the vicinity of the target point $(1,1)$ while avoiding low-reward areas.

F.2 TriFinger Dataset Experiment

The TriFinger dataset (Gürtler et al., ) is an offline dataset for the TriFinger robotic arm that includes both real and simulated data. The dataset primarily focuses on two tasks: Push and Lift. We utilize the Push-Sim-Expert dataset to train our DiffCPS, and the network architecture is consistent with the details provided in Appendix C. The goal of the push task is to move the cube to a target location. This task does not require the agent to align the orientation of the cube; the reward is based only on the desired and the achieved position.

When trained with 2e6 training steps (gradient steps) on the sim-expert dataset, DiffCPS ( $T=45$ , target_kl $=0.01$ , $\lambda_{\text{min}}=0$ ) achieves a success rate of $0.906\pm 0.001$ in the push task based on 100 evaluation episodes and 5 random seeds, as detailed in the Table 6.

TriFinger-Push	BC	CRR	IQL	DiffCPS (Ours)
Sim-Expert	$0.83\pm 0.02$	$\bf{0.94\pm 0.04}$	$0.88\pm 0.04$	$\bf{0.906\pm 0.001}$

Table 6: The success rate of DiffCPS and other baselines on TriFinger Push task. Results for other baselines are sourced from Gürtler et al. and have been carefully tuned. It’s worth noting that our DiffCPS still utilizes a simple MLP to represent the policy, suggesting potential benefits from more powerful network architectures.

Appendix G Augmented Lagrangian Method

In Algorithm 1, we utilize alternating optimization (dual ascent) to solve the dual problem in Eq. (15). In practice, we can employ the Augmented Lagrangian Method (ALM) to enhance the stability of the algorithm. ALM introduces a strongly convex term (penalty function) into the Lagrangian function to improve convergence. Our experimental results suggest that ALM can enhance training stability but may lead to a performance decrease. This could be attributed to (i) the introduction of the penalty function altering the optimal KL divergence constraint value and (ii) the penalty function making the policy too conservative, thereby preventing the adoption of high-reward actions. We believe that further research will benefit our algorithm by incorporating ALM and other optimization algorithms.

	$\displaystyle\mu({\bm{a}}\|{\bm{s}})$	$\displaystyle=\int\mu({\bm{a}}^{0:T}\|{\bm{s}})d{\bm{a}}^{1:T}$		(9)
		$\displaystyle=\int{\mathcal{N}}({\bm{a}}^{T};\mathbf{0},{\bm{I}})\prod_{i=1}^{T}\mu({\bm{a}}^{i-1}\|{\bm{a}}^{i},{\bm{s}})d{\bm{a}}^{1:T},$		(9)

$\displaystyle H(\pi_{b},\mu)$	$\displaystyle=-\mathbb{E}_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}}),{\bm{a}}\sim\pi_{b}(\cdot\|{\bm{s}})}\left[\log\mu(\cdot\|{\bm{s}})\right]$	(34)
	$\displaystyle=-\mathbb{E}_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}})}\left[\int\pi_{b}({\bm{a}}\|{\bm{s}})d{\bm{a}}\log\int\pi_{b}({\bm{a}}^{1:T}\|{\bm{a}},{\bm{s}})\frac{\mu({\bm{a}}^{0:T}\|{\bm{s}})}{\pi_{b}({\bm{a}}^{1:T}\|{\bm{a}},{\bm{s}})}d{\bm{a}}^{1:T}\right]$
	$\displaystyle\leq\mathbb{E}_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}})}\left[\underbrace{\int\pi_{b}({\bm{a}}^{0:T})\log[\frac{\pi_{b}({\bm{a}}^{1:T}\|{\bm{a}},{\bm{s}})}{\mu({\bm{a}}^{0:T}\|{\bm{s}})}]d{\bm{a}}^{0:T}}_{\quad\text{Jensen's inequality}}\right]=K.$

	$\displaystyle H(\pi_{b},\mu)\approx K$	$\displaystyle=\mathbb{E}_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}})}\left[\mathbb{E}_{\pi_{b}({\bm{a}}^{0:T})}\Big{[}\log\frac{\pi_{b}({\bm{a}}^{1:T}\|{\bm{a}},{\bm{s}})}{\mu({\bm{a}}^{0:T}\|{\bm{s}})}\Big{]}\right]=\mathbb{E}_{{\bm{s}}\sim\rho^{\pi_{b}}({\bm{s}})}\left[\mathbb{E}_{\pi_{b}({\bm{a}}^{0:T})}\Big{[}\log\frac{\pi_{b}({\bm{a}}^{1:T}\|{\bm{a}}^{0},{\bm{s}})}{\mu({\bm{a}}^{0:T}\|{\bm{s}})}\Big{]}\right]$
		$\displaystyle=\mathbb{E}_{\rho^{\pi_{b}}({\bm{s}},{\bm{a}})}\Big{[}\log\frac{\prod_{t=1}^{T}\pi_{b}({\bm{a}}^{t}\|{\bm{a}}^{t-1})}{\mu({\bm{a}}^{T})\prod_{t=1}^{T}\mu({\bm{a}}^{t-1}\|{\bm{a}}^{t})}\Big{]}$
		$\displaystyle=\mathbb{E}_{\rho^{\pi_{b}}({\bm{s}},{\bm{a}})}\Big{[}-\log\mu({\bm{a}}^{T})+\sum_{t=1}^{T}\log\frac{\pi_{b}({\bm{a}}^{t}\|{\bm{a}}^{t-1})}{\mu({\bm{a}}^{t-1}\|{\bm{a}}^{t})}\Big{]}$
		$\displaystyle=\mathbb{E}_{\rho^{\pi_{b}}({\bm{s}},{\bm{a}})}\Big{[}-\log\mu({\bm{a}}^{T})+\sum_{t=2}^{T}\log\frac{\pi_{b}({\bm{a}}^{t}\|{\bm{a}}^{t-1})}{\mu({\bm{a}}^{t-1}\|{\bm{a}}^{t})}+\log\frac{\pi_{b}({\bm{a}}^{1}\|{\bm{a}}^{0})}{\mu({\bm{a}}^{0}\|{\bm{a}}^{1})}\Big{]}$
		$\displaystyle=\mathbb{E}_{\rho^{\pi_{b}}({\bm{s}},{\bm{a}})}\Big{[}-\log\mu({\bm{a}}^{T})+\sum_{t=2}^{T}\log\Big{(}\frac{\pi_{b}({\bm{a}}^{t-1}\|{\bm{a}}^{t},{\bm{a}}^{0})}{\mu({\bm{a}}^{t-1}\|{\bm{a}}^{t})}\cdot\frac{\pi_{b}({\bm{a}}^{t}\|{\bm{a}}^{0})}{\pi_{b}({\bm{a}}^{t-1}\|{\bm{a}}^{0})}\Big{)}+\log\frac{\pi_{b}({\bm{a}}^{1}\|{\bm{a}}^{0})}{\mu({\bm{a}}^{0}\|{\bm{a}}^{1})}\Big{]}$
		$\displaystyle=\mathbb{E}_{\rho^{\pi_{b}}({\bm{s}},{\bm{a}})}\Big{[}-\log\mu({\bm{a}}^{T})+\sum_{t=2}^{T}\log\frac{\pi_{b}({\bm{a}}^{t-1}\|{\bm{a}}^{t},{\bm{a}}^{0})}{\mu({\bm{a}}^{t-1}\|{\bm{a}}^{t})}+\sum_{t=2}^{T}\log\frac{\pi_{b}({\bm{a}}^{t}\|{\bm{a}}^{0})}{\pi_{b}({\bm{a}}^{t-1}\|{\bm{a}}^{0})}+\log\frac{\pi_{b}({\bm{a}}^{1}\|{\bm{a}}^{0})}{\mu({\bm{a}}^{0}\|{\bm{a}}^{1})}\Big{]}$
		$\displaystyle=\mathbb{E}_{\rho^{\pi_{b}}({\bm{s}},{\bm{a}})}\Big{[}-\log\mu({\bm{a}}^{T})+\sum_{t=2}^{T}\log\frac{\pi_{b}({\bm{a}}^{t-1}\|{\bm{a}}^{t},{\bm{a}}^{0})}{\mu({\bm{a}}^{t-1}\|{\bm{a}}^{t})}+\log\frac{\pi_{b}({\bm{a}}^{T}\|{\bm{a}}^{0})}{\pi_{b}({\bm{a}}^{1}\|{\bm{a}}^{0})}+\log\frac{\pi_{b}({\bm{a}}^{1}\|{\bm{a}}^{0})}{\mu({\bm{a}}^{0}\|{\bm{a}}^{1})}\Big{]}$
		$\displaystyle=\mathbb{E}_{\rho^{\pi_{b}}({\bm{s}},{\bm{a}})}\Big{[}\log\frac{\pi_{b}({\bm{a}}^{T}\|{\bm{a}}^{0})}{\mu({\bm{a}}^{T})}+\sum_{t=2}^{T}\log\frac{\pi_{b}({\bm{a}}^{t-1}\|{\bm{a}}^{t},{\bm{a}}^{0})}{\mu({\bm{a}}^{t-1}\|{\bm{a}}^{t})}-\log\mu({\bm{a}}^{0}\|{\bm{a}}^{1})\Big{]}$
		$\displaystyle=\mathbb{E}_{\rho^{\pi_{b}}({\bm{s}},{\bm{a}})}[\underbrace{D_{\text{KL}}(\pi_{b}({\bm{a}}^{T}\|{\bm{a}}^{0})\parallel\mu({\bm{a}}^{T}))}_{L_{T}}+\sum_{t=2}^{T}\underbrace{D_{\text{KL}}(\pi_{b}({\bm{a}}^{t-1}\|{\bm{a}}^{t},{\bm{a}}^{0})\parallel\mu({\bm{a}}^{t-1}\|{\bm{a}}^{t}))}_{L_{t-1}}\underbrace{-\log\mu({\bm{a}}^{0}\|{\bm{a}}^{1})}_{L_{0}}].$