^*^*footnotetext: Equal contribution

Generating Behaviorally Diverse Policies with Latent Diffusion Models

Shashank Hegde^∗
University of Southern California
[email protected]
&Sumeet Batra ¹¹1
University of Southern California
[email protected]
K.R. Zentner
University of Southern California
[email protected]
&Gaurav S. Sukhatme
University of Southern California
[email protected]
Sukhatme holds concurrent appointments as a Professor at USC and as an Amazon Scholar. This paper describes work performed at USC and is not associated with Amazon.

Abstract

Recent progress in Quality Diversity Reinforcement Learning (QD-RL) has enabled learning a collection of behaviorally diverse, high performing policies. However, these methods typically involve storing thousands of policies, which results in high space-complexity and poor scaling to additional behaviors. Condensing the archive into a single model while retaining the performance and coverage of the original collection of policies has proved challenging. In this work, we propose using diffusion models to distill the archive into a single generative model over policy parameters. We show that our method achieves a compression ratio of 13x while recovering 98% of the original rewards and 89% of the original coverage. Further, the conditioning mechanism of diffusion models allows for flexibly selecting and sequencing behaviors, including using language.
Project website: https://sites.google.com/view/policydiffusion/home.

1 Introduction

Quality Diversity (QD) is an emerging field in which collections of high performing, behaviorally diverse solutions are trained. QD methods perform what is often referred to as illumination or divergent search, in that they attempt to illuminate the search space rather than optimizing towards a single point. QD algorithms have shown success in learning robot controllers capable of adapting to damage, solving hard exploration problems, and generating diverse scenarios in the procedural content generation (PCG) domain [5] [7] [2]. The foundational method, Map Elites [20], maintains an archive of solutions where each cell in the archive corresponds to a solution with a score given by the task objective $f$ , and behavior specified by measure functions $m_{1}...m_{k}$ , which map to a low dimensional behavior space. The measure functions $m_{1}...m_{k}$ specify which cell each solution belongs to in the $k$ dimensional archive. New solutions are evolved using evolutionary methods and inserted into the archive only if they outperform existing ones with the same behavior.

A promising subclass of methods (Quality Diversity Reinforcement Learning (QD-RL)) combines the optimization capabilities of RL with the illumination capabilities of QD, to find high performing and diverse solutions. In the robotics domain where the environment and the objective-measure functions $f$ and m are assumed to be non-differentiable, RL can be leveraged to estimate the gradients of $f$ and/or m and provide a powerful performance improvement operator on the solutions in the archive. QD-RL methods that combine QD with on-policy and off-policy RL algorithms have shown promising results on a variety of locomotion tasks and are capable of finding a plethora of high-quality gaits [21] [27] [22] [1]. One of several drawbacks of existing QD-RL methods is that they must maintain a collection of hundreds, if not thousands of policies, in order to cover the behavior space, which leads to poor space-complexity and difficulty in real-world deployment. Map-Elites-based QD methods show poor scaling properties and suffer from the curse of dimensionality, quite literally in that as the dimensionality $k$ of the archive increases, the number of solutions one needs to store increases exponentially. Prior methods have attempted to scale Map-Elites to higher dimensional problems by using Centroidal Voronoi Tessellations to divide the archive into a small number of evenly spaced geometric regions. However, these methods require recomputing the Voronoi cells periodically, resulting in worse runtime performance, and try to keep the number of niches small in order to effectively exploit local competition. In order to smoothly interpolate between behaviors of different solutions with a discrete archive, one must uspample the archive resolution to (tens of) thousands of policies, often resulting in a level of discretization higher than the actual occurrence of distinct behaviors, while further worsening the space and time complexity of the algorithm.

An alluring idea is to distill the archive into a single, expressive model that completely covers the behavior space and maintains high performance. A single model representing the archive reduces space-complexity and potentially allows for smooth interpolation in the behavior space, making it easier to deploy and use in downstream tasks. Prior methods have shown promising results in distilling the archive into a single policy [8], or by learning a generative model via adversarial training over the policies in the archive [15]. We wish to improve on these methods by maintaining, or even improving, the overall performance of the original archive during the distillation phase, and scale generative models to be able to represent policies parameterized by deep neural networks rather than a low dimensional 1D vector of parameters.

To this end, we utilize the powerful generative and conditioning mechanisms of diffusion models to distill the archive into a single, expressive model that can generate a policy with any behavior from the behavior space mapped by the original archive. This generative process can be conditioned on the desired behavior measures, and even language descriptions of the desired behavior. Diffusion models have shown great success in computer vision in image quality and diversity [14] [6]. Latent diffusion models accelerate training by compressing the image dataset into a compact, expressive latent space and training a diffusion model on this lower dimensional space [24]. They proceed in two stages, first by compressing imperceptible, high-frequency details via a learned dimensionality reduction, and then by learning the semantic details of the images via the actual diffusion model itself. Similarly, here we show that one can compress a collection of policies parameterized by deep neural networks into a lower dimensional space by using a variational auto encoder (VAE), and then learn the semantic or behavioral details of the policy distribution via latent diffusion. Our experiments show evidence of the manifold hypothesis or the elite hypervolume [31], that all high performing policies lie on a low-dimensional manifold. We summarize our contributions below.

1.

We compress an archive of policies parameterized by deep neural networks and trained via a state of the art QD-RL method PPGA into a single, expressive model while maintaining performance of the policies in the original dataset.
2.

We use the iterative conditioning mechanism of diffusion models to reconstruct policies with precise locations in measure space, and demonstrate how language conditioning can be used to flexibly generate policies with different behaviors.
3.

We showcase our model’s ability to sequentially compose completely different behaviors together, and additionally show that language conditioning can be used to dramatically improve the performance and consistency of sequential behavior composition.

2 Related Work

Quality Diversity

QD Optimization attempts to illuminate the search space with high performing solutions. The optimization problem is formulated as follows. Given an objective $f(\cdot)$ to maximize and $k$ measure functions $\textbf{m}=<m_{1}(\cdot)...m_{k}(\cdot)>$ that map a solution $\theta_{i}$ to a low dimensional behavior space, the QD problem is to find the highest performing solution $\theta_{i}$ for every value of m. Since m is a continuous variable, estimating a good solution for every point in behavior space requires infinite memory and is intractable. The QD problem is usually relaxed by discretizing m into a finite number of cells M, represented as a $k$ -dimensional archive $\mathcal{A}$ . The optimization problem then becomes $\textbf{max}\sum_{i=0}^{\textit{M}}f(\theta_{i})$ , where $\theta_{i}$ is a solution whose measures $\textbf{m}(\theta_{i})$ fall into cell $i$ . Differentiable Quality Diversity (DQD) [9] considers the problem where the objective and measure functions are differentiable, which provides gradients $\nabla f(\cdot)$ and $\mathbf{\nabla m(\cdot)}$ . Quality Diversity Reinforcement Learning (QD-RL) considers a subclass of problems that can be framed as sequential decision making tasks with exploitable Markov Decision Process (MDP) structure. Instead of optimizing for a single optimal policy, the goal is to find a collection of high performing policies that are diverse with respect to embedding functions m that encode behavior information in a low-dimensional space. QD-RL algorithms vary in implementation, and leverage recent works in both on-policy and off-policy RL [21, 29, 28, 22]. Proximal Policy Gradient Arborescence (PPGA) [1], on which we build here, is a state of the art QD-RL method that combines on-policy reinforcement learning with DQD. It maintains a current search policy corresponding to some policy $\pi_{\theta_{\mu}}$ in the archive. The objective and measure gradients $f$ and $\mathbf{m}$ are estimated for this policy and used to branch off solutions into nearby cells. The information on which branched policies most improved the archive, where policies that land in new cells are favored, is used to derive a first-order gradient estimate of maximum archive improvement. On-policy RL is used to walk the search policy towards those promising new regions of the archive that have yet to be explored. PPGA has produced state of the art results on challenging locomotion tasks. A particularly nice property of this method is that the first-order approximation of the gradient w.r.t. archive improvement improves with higher archive resolution. Since training diffusion models requires large datasets, upsampling the archive resolution in PPGA generally results in better performance and allows us to produce more data for the diffusion model.

Archive Distillation

Archive distillation is the process by which a collection of solutions is distilled into a single model. This is particularly useful in the QD-RL domain, since having a single policy with full coverage of the behavior space, the ability to interpolate between points in the behavior space, and compose different behaviors to produce new behaviors, makes the model more versatile, memory efficient, and easily deployable in downstream tasks. Prior works predict that a form of the manifold hypothesis (Elite Hypervolume), exists because policies that map to the same low-dimensional behavioral space, despite occupying different niches, may share certain traits. Follow-up works attempt to either find such low-dimensional representations or illuminate them by searching over the manifold directly [11, 23]. Contemporary work in the QD-RL domain have shown success in archive distillation on difficult RL problems. [8] jointly produces an archive using a state of the art QD-RL method Policy Gradient Assisted Map Elites [21] and distills the archive into a single behavior-conditioned model. [19] uses a variant of Map Elites to produce a collection of policies that perform well in uncertain environments, and distills these policies into a single Decision Transformer [3]. Prior methods have also applied Generative Adversarial Networks to generate a diverse collection of policies for a robot ball-throwing task [15]. Here, we aim to improve on generative models applied in the QD-RL domain by scaling the representational capacity of our model to collections of deep neural networks, while simultaneously maintaining the performance and diversity of the original archive.

Diffusion

Diffusion models have become state of the art in image generation. Denoising Diffusion Probabilistic Models (DDPM) are a class of generative models that iteratively denoise noise sampled from an isotropic Gaussian. The iterative denoising process is a powerful mechanism that has been shown to produce state of the art results on computer vision benchmarks [6]. Numerous methods have made improvements on DDPMs that address some of the shortcomings of these models compared to other generative methods. [6] shows that classifier-guidance can be applied at test-time to improve the quality of the generated samples. [25] showed that, by relaxing the Markov assumption in the forward diffusion process, one can significantly improve inference-time by downsampling the number diffusion steps while maintaining most of the sample quality. Multiple refined methods for sampling from a diffusion process have been proposed, including [26], [18] and [16]. However, here we are not particularly concerned with sampling efficiency, and thus use the method proposed in [25]. [24] showed that diffusion can be performed on the latent space of a pretrained Variational Autoencoder.

Graph Hypernetworks

Hypernetworks are models capable of estimating the weights of a secondary network [12]. When conditioned on task identities, these can achieve continual learning by rehearsing task specific weight realizations [32]. Graph hypernetworks were originally introduced for architecture search in image classification [17] and have been shown to be trainable with RL to estimate variable architecture policies for locomotion and manipulation [13].

3 Method

Refer to caption — Figure 1: Structure of our model as a encoder (left) and decoder (right). During encoding, policies are split into layers and encoded separately. The encodings are concatenated together and fed into a final layer to produce a latent representation. The conditional diffusion model samples a latent code $z$ from the latent representation. During decoding, a graph hyper network jointly decodes the weights and bias parameters from $z$ , and the policy network architecture graph $g$ , while normalization parameters are directly decoded from $z$ .

Policy Compression

Following [24], we compress the archive $\mathcal{A}$ into a lower dimensional space using a variational autoencoder (VAE). A policy consists of $l$ layers, each containing a weight matrix $W_{i}$ and bias vector $b_{i}$ , $1\leq i\leq l$ . In the encoder $\mathcal{E}$ , the features of each $W_{i}$ and $b_{i}$ are extracted using a convolutional neural network and fully connected network, respectively. These features are concatenated together and fed into a final fully connected layer, that produces an intermediate latent representation $z\in\mathbb{R}^{h\times w\times c}$ . The decoder $\mathcal{D}$ contains a conditional graph hypernetwork and an observation normalizer decoder, $d_{n}$ , which takes in the latent code $z$ and produces the reconstructed policy $\pi^{\prime}_{i}=\mathcal{D}(z_{i})$ . The conditional graph hypernetwork estimates the policy network’s parameters while $d_{n}$ estimates the observation normalizing mean and variance parameters. Our conditional graph hypernetwork is based on the implementation in [13]. While the original graph hypernetwork is meant to estimate the parameters of variable architectures (represented as graphs), we freeze the input architecture graph $g$ . This is set to be the architecture graph of all the networks in the archive, which in our case is represented as $\{0,128,128,a\}$ . Here 0 indicates the input node, the following 128’s represent the hidden layer nodes and $a$ is the output node and equals the action space dimension. Further, we add a latent encoder $e_{z}$ , and with the graph encoder $e_{s}$ , a concatenated encoding is fed to the gated graph network [33]. This mechanism lets us condition the parameter estimation on the latent $z$ . Together, $\mathcal{E}$ and $\mathcal{D}$ form the VAE architecture, which optimizes the objective

L_{VAE}=L_{rec}(\pi^{\prime}(a|s),\pi(a|s))+D_{KL}(\mathcal{D}_{\phi}(z|\pi)||\mathcal{N}(0,I))

(1)

Policy Diffusion

We hypothesize that the denoising process can be used to produce high quality policies given a dataset of behaviorally diverse policies parameterized by deep neural networks. We describe how the DDPM formulation can be applied to such datasets. Diffusion models consist of a forward process $q$ that iteratively applies noise $\epsilon_{t}$ to a sample $x_{0}$ from the original data distribution $x_{0}\sim q(\textbf{x})$ . The noise $\epsilon_{t}$ at each timestep is sampled according to a variance schedule $\{\beta_{t}\}_{t=1}^{T}$

\epsilon_{t}=q(x_{t}|x_{t-1})=\mathcal{N}(x_{T};x_{t-1}\sqrt{1-\beta_{t}},\beta_{t}\textbf{I})

(2)

making the forward process Markovian. The reverse or generative process $p(x_{T})$ reverts noise from an isotropic Gaussian into a sample $x_{0}$ in $q(\textbf{x})$ and contains a similar Markov structure.

p_{\theta}(x_{0})=p(x_{T})\prod_{t=1}^{T}p_{\theta}(x_{t-1}|x_{t}),\text{ }p_{\theta}(x_{t-1}|x_{t})=\mathcal{N}(x_{t-1};\mu_{\theta}(x_{t},t),\Sigma_{\theta}(x_{t},t))

(3)

Here, $x_{0}$ is a latent code $z$ representing some policy $\pi_{\theta}(a|s)$ , rather than the policy parameters itself. Thus, the diffusion model instead learns to capture the distribution of the much lower dimensional $z$ , analogous to the Elite Hypervolume hypothesized in [31].

[14] makes a connection between the reverse process and Langevin Dynamics, where $p_{\theta}(x_{t-1}|x_{t})$ is the learned gradient of the data density. When $x$ represents neural network parameters, the diffusion model learns the gradient of the policy-distribution score function and iteratively refines the noisy policy parameters $x_{t}$ towards this distribution. When conditioning the policy $x$ to match a specific behavior m i.e. $p_{\theta}(x_{t-1}|x_{t},\textbf{m})$ , this gradient can be thought of as the gradient of the maximum a posteriori (MAP) objective over the distribution of policies that exhibit behavior m with respect to policy parameters $x_{t}$ . Thus, our diffusion formulation draws inspiration from Bayesian Methods, where $p(x_{T},\textbf{m})\prod_{t=1}^{T}p_{\theta}(x_{t-1}|x_{t},\textbf{m})$ resembles the iterative training process of a neural network $x$ towards the mode of the posterior distribution over high-performing policies with behavior m.

Training Procedure

We follow the training procedure in [24]. We first train an autoencoder according to the objective in Eq. 1. A random batch of policies and their observation normalizers $(\pi_{\theta},\eta)$ are sampled from the archive and fed into the encoder $\mathcal{E}$ to produce latents $\textbf{z}=\mathcal{E}(\pi_{\theta},\eta)$ . The decoder $\mathcal{D}$ then reconstructs the policies and their respective observation normalizers from the latents $(\pi_{\theta}^{\prime},\eta^{\prime})=\mathcal{D(\textbf{z})}$ . To simplify training, on some tasks we normalize the archive dataset by subtracting out the per-parameter mean and dividing by the per-parameter variance. This results in an autoencoder over parameter residuals relative to the original per-parameter mean and variance, which we keep for decoding. For training the latent diffusion model, we sample a batch of policies and their respective obs normalizers, measures, and text labels $(\pi_{\phi},\eta,\textbf{m},\textbf{y}$ ). The policies and measures are first encoded into latent vectors and measure embeddings respectively $\textbf{z}=\mathcal{E}(\pi_{\theta},\eta),\tau_{\psi_{m}}(\textbf{m})$ , where $\tau_{\psi_{m}}$ is a trainable encoder. These are subsequently fed into the diffusion model, where the latents are conditioned on the measure embeddings using the cross-attention mechanism. We uniformly sample t from $\{1,...,T\}$ for the batch and regress the predicted noise vectors according to the latent diffusion training objective

L_{LDM}:=\mathbb{E}_{\mathcal{E}(\pi_{\phi}),\epsilon\sim\mathcal{N}(0,1),t}\left[||\epsilon-\epsilon_{\theta}(z_{t},t,\tau_{\psi_{m}}(\textbf{m})||_{2}^{2}\right]

(4)

In the case of language-conditioned diffusion, the measures are replaced with text labels that are encoded using a Flan-T5-Small encoder ([4]), which is fine-tuned end-to-end using the loss in Equation 4 to produce text embeddings $\epsilon_{\theta}(z_{t},t,\tau_{\psi_{y}}(y))$ that condition the diffusion process.

4 Experiments

In our experiments, we wish to analyze our model’s performance on the following: 1. archive compression while maintaining original performance, 2. measure and language conditioning to produce policies with specific behaviors, and 3. sequential behavior composition to produce new behaviors. Since PPGA was evaluated on the Brax [10] environments Humanoid, Walker2D, Halfcheetah, and Ant, we evaluate our model on the same four environments. For each environment, the reward function encourages stable forward locomotion while minimizing energy consumption, and the observation and action spaces are continuous. The dimensions for the (observation, action) spaces are: Humanoid (227, 17); Walker2d (17, 6); Halfcheetah (18, 6); Ant (87, 8). Every policy in the archive has two hidden layers of 128 neurons each, followed by an output layer matching the action space dimension. While there are recent works that perform archive distillation on these tasks [8, 19], they produce a very different datasets of policies using different underlying QD-RL methods. The Quality Diversity Transformer, for example, uses evolutionary strategies with explicit optimization towards policies with low-variance behaviors, whereas PPGA uses first-order gradient approximations and makes no such explicit optimization towards low behavior variance. As any comparison of distillation methods is relative to the archive being distilled, we are unable to make any direct comparison to these methods.

Performance and Accuracy Experiments

We evaluate our model’s ability to reconstruct the performance and behavior of policies in the archive to high precision. Following [19], we first downsample our trained archives for each task into 50 equally-spaced geometric regions using Centroidal Voronoi Tessalation - Map Elites (CVT-ME) [30]. Each region has a policy $\pi_{\theta}$ from the original archive and a corresponding behavior $<m_{1},...,m_{k}>$ for a $k$ -dimensional archive that lies in the center of that region. These policies’ behaviors are used as conditions to produce 50 measure-conditioned policies $\pi_{\theta_{1}},...,\pi_{\theta_{50}}$ . Each policy is then rolled out 50 times, and the objective and measure functions $f(\pi_{\theta_{i}}),\textbf{m}(\pi_{\theta_{i}})$ are computed as the average over 50 episodes. These values are then used to compute the reward ratio, which is the average performance of the generated policies over the original ones: $r=\frac{\sum_{i=1}^{50}f(\pi_{\theta_{i}}^{\prime})}{\sum_{i=1}^{50}f(\pi_{\theta_{i}})}$ .

The reward ratio alone can be misleading if all the generated policies have high performance but incorrect measures w.r.t. the measure conditions. For example, generating 50 best-performing policies with the same measures despite sampling different measure conditions would lead to a large reward ratio but no diversity. Thus, we also track the Jenson-Shannon (JS) Divergence between the distribution of measures produced by the generated policies and the original policies. We refer to this as the measure divergence. We report the JS divergence instead of the KL divergence because there is no clear preference between the KL divergence directions, and because some experiments produce policies with near-zero measure distribution overlap, for which the JS divergence is upper bounded by $ln(2)$ , and the KL divergence is arbitrarily large. We perform this once every 10 epochs over the course of training and present the results in Fig. 2. Humanoid, Walker2D, and Halfcheetah achieve a reward ratio of near $\sim 1.0$ while reaching a measure divergence of $10^{-2}$ . Ant achieves a reward ratio of $\sim 0.75$ with a measure divergence of $\sim 0.1$ . We expect a higher measure divergence on Ant given that it is a 4-legged locomotor and thus has twice as many measures as the other environments.

Archive Reconstruction

At test time, we analyze the ability of the latent diffusion model to reconstruct the entire original archive. We take the measure vector $<m_{1},...,m_{k}>$ corresponding to cell $c_{i}$ and use it as a condition to produce policy $\pi_{\theta_{i}}^{\prime}\forall c_{i}\in\mathcal{A}$ . If the resolution of archive $\mathcal{A}$ is $d^{k}$ , where $d$ is the discretization level and $k$ is the number of measures, this gives us a collection of $d^{k}$ policies. These are rolled out to compute $f$ and $\mathbf{m}$ , and inserted into a new, empty archive $\mathcal{A}^{\prime}$ according to the standard insertion rules in QD optimization, where only the best solution for a cell $i$ is stored when two solutions map to the same location in behavior space. The reconstructed archive will thus have some number of unique solutions such that $|\mathcal{A}^{\prime}|\leq d^{k}$ with a QD Score of $\sum_{i=0}^{|\mathcal{A}^{\prime}|}f(\pi_{\theta_{i}}^{\prime})$ . To make an informative comparison between the reconstructed and original archives, we plot their cumulative distribution functions (CDF) (Fig. 4). These not only encapsulate coverage and QD-score information, but also tell us how the policies are distributed with respect to the objective.

On all tasks, the policy distribution of the reconstructed archive closely matches that of the original one. On Halfcheetah, we are able to almost exactly reproduce the original archive. On all other tasks, we lose some density in the low performing regions of the archive, but consistently match policy density in the higher performing regions. Fig. 3 tells a strikingly similar story, in that our model first fills the central and often higher performing regions of the archive before moving on to the fringe regions to improve diversity. Both results seem to suggest that the diffusion model first learns common features corresponding to high performance across all policies, and then proceeds to learn aspects of individual policies that make them behaviorally unique. Table 1 provides a quantitative view of the CDF plots. We report the QD-scores and coverage on both the original and reconstructed archives for all tasks. Following [19], we report the Mean Error in Measure (MEM), which is the average $L_{2}$ distance between all generated policies’ and original policies’ measures in archives $\mathcal{A}^{\prime}$ and $\mathcal{A}$ , respectively: MEM $=\mathbb{E}\bigl{[}||\mathbf{m}(\pi_{\theta_{i}})-\mathbf{m}(\pi_{\theta_{i}}^{\prime})||_{2}^{2}\bigr{]}$ .

Task	QD Score ( $\times 10^{7}$ )	Coverage (%)	Reconstructed QD Score ( $\times 10^{7}$ )	Reconstructed Coverage (%)	MEM
Humanoid	$8.08$	90.79	$7.4\pm 0.11$	$84.35\pm 1.01$	$0.237\pm 0.03$
Walker2D	$3.12$	85.68	$3.00\pm 0.02$	$83.65\pm 0.14$	$0.269\pm 0.07$
Halfcheetah	$11.4$	96.94	$11.1\pm 0.01$	$94.55\pm 0.11$	$0.15\pm 0.01$
Ant	$3.44$	71.03	$2.54\pm 0.07$	$52.66\pm 1.46$	$0.57\pm 0.08$

Table 1: Latent Diffusion QD Metrics. The QD Score and Coverage columns are calculated by rolling out the policies in the original archive training dataset. The Reconstructed QD Score and Reconstructed Coverage are calculated by rolling out the policies that were generated by our model, conditioned on the measures from the original archive.

KL Coefficient Ablation

A KL penalty (Eq. 1) is used to regularize the latent space. [24] used a relatively small penalty coefficient of $10^{-6}$ to prevent information loss due to overly compacting the latent space. We wish to similarly quantify the information density of our dataset and the effects of stronger latent space regularization on VAE model performance. Fig. 5 shows the VAE reward ratio and JS Divergence for larger values of the KL coefficient. Overall, our findings are in line with [24] - stronger regularization results in a loss of information, thus reducing our VAE’s ability to reproduce policies from the original dataset. For all other experiments, we fix the KL coefficient to $10^{-6}$ .

GHN Size Ablation

We examine the effect of model size on our model’s ability to reproduce the archive. We chose three different values for the number of neurons in the hidden layers of the hypernetwork in the decoder and keep the diffusion model size fixed. The results are shown in Table 2 for the Humanoid environment. The QD Ratio indicates the QD score of the reconstructed archive over the original archive’s QD score. Compression ratio is calculated as the number of parameters the decoder plus the number of parameters in the diffusion model, divided by the sum of parameters of all policies in the original archive. In general, we find that the MEM decreases and QD ratio increases with larger model size, at the expense of compression ratio. Nonetheless, even the largest diffusion model (43.7 million parameters) achieves a compression ratio of 8 to 1, while reproducing 94% of the original archive with low measure error. In situations where model size is not a significant constraint, picking the largest model may be the best option, as it nearly recovers the original archive’s performance and covers all relevant parts of the behavior space covered by the original dataset.

Decoder	Parameters	QD Ratio	Compression Ratio	MEM
GHN8	18.3M	0.77	19:1	$0.267\pm 0.091$
GHN16	26.8M	0.87	13:1	$0.242\pm 0.101$
GHN32	43.7M	0.94	8:1	$0.129\pm 0.005$

Table 2: GHN size ablation on the humanoid archive. Compression Ratio is rounded to the nearest integer. QD Ratio and Mean Error in Measure improve with larger GHN sizes.

Sequential Behavior Composition

To test our model’s ability to successfully compose an arbitrary chain of generated behavior policies without triggering the termination criteria within a single episode we design an experiment where the episode, which naturally terminates at $T=1000$ , is divided into four time intervals of 250 timesteps each. For each time interval, we randomly sample a measure condition m and use our conditional diffusion model combined with the decoder to produce an agent that exhibits behavior m, $\pi_{\phi}(a|s,\textbf{m})=\mathcal{D}(\epsilon_{\theta}(z_{T},T,\textbf{m})),z_{T}\sim\mathcal{N}(0,1)$ . An experiment is successful if the episode terminates no earlier than $t=800$ , implying that our model has produced all four behavior policies and successfully composed at least three of them together. We consider the trajectory length for one experiment to be the average trajectory length over 50 parallel environments, and we repeat this experiment 10 times. Post completion, we see that the success rate is 80%.

Language Conditioned Sequential Behavior Composition

To evaluate how well our language conditioned diffusion model is able to produce specific behaviors, we repeat the above protocol with encoded text labels in place of m. We run three related experiments. First, we uniformly sample 100 sequences of four text labels from the 128 available text labels, and perform the evaluation described above, reaching a success rate of 38%. Second, we sample sequences of text labels, filtering out those that contain the work “fall”, which indicates unsuccessful policies. This allows us to select better policies that don’t fall over on their own, which increases the success rate of sequences to 59%. Finally, we sample sequences using only text labels that contain the term “quickly,” which indicates policies that move forward quickly. This raises the success rate of sequencing four different behaviors to 79%, more than double the success rate of sampling text labels uniformly. The text labels of one such successful episode overlaid on a heatmap of the archive from which the LDM was trained along with a sequence of rendered frames from the same episode are shown in Figure 7. The improved success rate of sequences of filtered labels demonstrates one approach to using text conditioning to extract more specific behaviors from an archive than can be found using measure conditioning alone. This approach is similar to the use of “quality tags,” used in some diffusion models.

Compute Resources

Each VAE and diffusion experiment was run on a SLURM cluster where each job was allocated 6 cores of a Intel(R) Xeon(R) Gold 6154 3.00GHz CPU, a NVIDIA GeForce RTX 2080 Ti GPU and 108 GB of RAM. Each VAE experiment took about 16h; each measure conditioned diffusion model took about 3h to train while the language conditioned diffusion runs took 4h.

5 Limitations

Scaling with measure dimension warrants further investigation (e.g.,. on ant, we see high MEM). Further experimentation with diffusion hyper parameters might produce better reconstruction of policies. The language descriptions we use are currently limited, and can be expanded to included a much larger corpus of descriptions. Language can sometimes under determine the desired behavior, therefore some descriptions can lead to undesirable outcomes. For example, Figure 6 shows higher variance in the reconstructed measures when conditioned on language. Training the diffusion model requires us to first construct the policy archive by using a QD algorithm. An interesting alternative would be to train this model in an online regime, by passing the archive construction step completely.

6 Conclusion

We proposed a method that uses diffusion models to distill the archive into a single generative model over policy parameters, achieving a compression ratio of 13x while recovering 98% of the original reward and maintaining 89% of the original coverage. Our models can flexibly generate specific behaviors using measures or text, and these behaviors can be sequenced together with surprising consistency. A pleasantly surprising find was additional evidence supporting the Elite Hypervolume hypothesis proposed in [31]. From our training results and heatmap evolution over time, we see that the diffusion model first learns the general structure of what comprises a "good" policy across behavior space, and then proceeds to branch out to different areas of behavior space, implying a learning of what makes each policy behaviorally unique. Finally, we look forward to exploring the connections this work has to other subfields, such as Meta-RL and Bayesian Learning.

References

[1] Sumeet Batra, Bryon Tjanaka, Matthew C Fontaine, Aleksei Petrenko, Stefanos Nikolaidis, and Gaurav Sukhatme. Proximal policy gradient arborescence for quality diversity reinforcement learning. arXiv preprint arXiv:2305.13795, 2023.
[2] Varun Bhatt, Bryon Tjanaka, Matthew C. Fontaine, and Stefanos Nikolaidis. Deep surrogate assisted generation of environments. CoRR, abs/2206.04199, 2022.
[3] Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, P. Abbeel, A. Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. In Neural Information Processing Systems, 2021.
[4] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. Scaling instruction-finetuned language models, 2022.
[5] Antoine Cully, Jeff Clune, Danesh Tarapore, and Jean-Baptiste Mouret. Robots that can adapt like animals. Nat., 521(7553):503–507, 2015.
[6] Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat gans on image synthesis. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 8780–8794, 2021.
[7] Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. Go-explore: a new approach for hard-exploration problems. CoRR, abs/1901.10995, 2019.
[8] Maxence Faldor, Félix Chalumeau, Manon Flageat, and Antoine Cully. Map-elites with descriptor-conditioned gradients and archive distillation into a single policy. CoRR, abs/2303.03832, 2023.
[9] Matthew Fontaine and Stefanos Nikolaidis. Differentiable quality diversity. Advances in Neural Information Processing Systems, 34:10040–10052, 2021.
[10] C. Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem. Brax - a differentiable physics engine for large scale rigid body simulation, 2021.
[11] Adam Gaier, Alexander Asteroth, and Jean-Baptiste Mouret. Discovering representations for black-box optimization. Proceedings of the 2020 Genetic and Evolutionary Computation Conference, 2020.
[12] David Ha, Andrew M. Dai, and Quoc V. Le. Hypernetworks. CoRR, abs/1609.09106, 2016.
[13] Shashank Hegde and Gaurav S Sukhatme. Efficiently learning small policies for locomotion and manipulation. arXiv preprint arXiv:2210.00140, 2022.
[14] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
[15] Marija Jegorova, Stéphane Doncieux, and Timothy M. Hospedales. Generative adversarial policy networks for behavioural repertoire. CoRR, abs/1811.02945, 2018.
[16] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. ArXiv, abs/2206.00364, 2022.
[17] Boris Knyazev, Michal Drozdzal, Graham W Taylor, and Adriana Romero. Parameter prediction for unseen deep architectures. In Advances in Neural Information Processing Systems, 2021.
[18] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. ArXiv, abs/2206.00927, 2022.
[19] Valentin Mac’e, Raphael Boige, Félix Chalumeau, Thomas Pierrot, Guillaume Richard, and Nicolas Perrin-Gilbert. The quality-diversity transformer: Generating behavior-conditioned trajectories with decision transformers. ArXiv, abs/2303.16207, 2023.
[20] Jean-Baptiste Mouret and Jeff Clune. Illuminating search spaces by mapping elites. CoRR, abs/1504.04909, 2015.
[21] Olle Nilsson and Antoine Cully. Policy gradient assisted map-elites. In Francisco Chicano and Krzysztof Krawiec, editors, GECCO ’21: Genetic and Evolutionary Computation Conference, Lille, France, July 10-14, 2021, pages 866–875. ACM, 2021.
[22] Thomas Pierrot, Valentin Macé, Félix Chalumeau, Arthur Flajolet, Geoffrey Cideron, Karim Beguir, Antoine Cully, Olivier Sigaud, and Nicolas Perrin-Gilbert. Diversity policy gradient for sample efficient quality-diversity optimization. In Jonathan E. Fieldsend and Markus Wagner, editors, GECCO ’22: Genetic and Evolutionary Computation Conference, Boston, Massachusetts, USA, July 9 - 13, 2022, pages 1075–1083. ACM, 2022.
[23] Nemanja Rakicevic, Antoine Cully, and Petar Kormushev. Policy manifold search: exploring the manifold hypothesis for diversity-based neuroevolution. Proceedings of the Genetic and Evolutionary Computation Conference, 2021.
[24] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10674–10685. IEEE, 2022.
[25] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
[26] Yang Song, Jascha Narain Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. ArXiv, abs/2011.13456, 2020.
[27] Bryon Tjanaka, Matthew C. Fontaine, Julian Togelius, and Stefanos Nikolaidis. Approximating gradients for differentiable quality diversity in reinforcement learning. In Jonathan E. Fieldsend and Markus Wagner, editors, GECCO ’22: Genetic and Evolutionary Computation Conference, Boston, Massachusetts, USA, July 9 - 13, 2022, pages 1102–1111. ACM, 2022.
[28] Bryon Tjanaka, Matthew C Fontaine, Julian Togelius, and Stefanos Nikolaidis. Approximating gradients for differentiable quality diversity in reinforcement learning. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 1102–1111, 2022.
[29] Bryon Tjanaka, Matthew Christopher Fontaine, Aniruddha Kalkar, and Stefanos Nikolaidis. Scaling covariance matrix adaptation map-annealing to high-dimensional controllers. In Deep Reinforcement Learning Workshop NeurIPS 2022.
[30] Vassilis Vassiliades, Konstantinos I. Chatzilygeroudis, and Jean-Baptiste Mouret. Using centroidal voronoi tessellations to scale up the multidimensional archive of phenotypic elites algorithm. IEEE Trans. Evol. Comput., 22(4):623–630, 2018.
[31] Vassilis Vassiliades and Jean-Baptiste Mouret. Discovering the elite hypervolume by leveraging interspecies correlation. Proceedings of the Genetic and Evolutionary Computation Conference, 2018.
[32] Johannes von Oswald, Christian Henning, João Sacramento, and Benjamin F. Grewe. Continual learning with hypernetworks. CoRR, abs/1906.00695, 2019.
[33] Chris Zhang, Mengye Ren, and Raquel Urtasun. Graph hypernetworks for neural architecture search. In 7th International Conference on Learning Representations, ICLR 2019, 2019.

Appendix A Model details and Hyper parameters

A.1 Variational AutoEncoder

GHNx refers to a diffusion model where the decoder (the GHN) has a hidden layer with x number of neurons. The main paper shows results across tasks with a GHN16 model.

name	value
$z$ dimension	64
Encoder hidden dimension	64
Obs Normalizer encoder hidden dimension	64
KL coefficient	1e-6
Gradient clipping	True
Learning rate	1e-4
Training batch size	32
GHN hidden layer size	16

Table 3: Hyperparameters used to train the VAE

We normalize the archive dataset by subtracting the per parameter mean and dividing by the per-parameter standard deviation for each policy in the archive. This is equivalent to calculating a "mean-policy" with parameters $\theta_{\mu}$ and subtracting it from each policy in the archive. The task for the VAE reduces to learning the parameter residuals. We enable archive-normalization on Humanoid and Ant where we saw the greatest improvement in performance and training stability, and leave it disabled on Halfcheetah and Walker2d, where we observed negligent improvements.

A.2 Latent Diffusion Model

We utilize a UNet backbone as the architecture for latent diffusion model. We use a single ResNet block to downsample the inputs into a condensed embedding space. A spatial transformer is used to perform cross-attention with condition-embeddings in the embedding space. Finally, a single ResNet block is used to upsample the embeddings to the same dimensionality as the inputs. A single ResNet block consists of two convolutional layers plus a embedding layer for sinusoidal position embeddings. The encoder and decoder used for latent diffusion remains the same as described above.

name	value
No. of Resnet Blocks in U-Net	1
U-Net activation	SiLU
Transformer heads in middle part of U-Net	4
Gradient clipping	True
Learning rate	1e-4
Training batch size	32

Table 4: Hyperparameters used to train the Latent Diffusion Model

Appendix B Rollout Videos

The entire list of rollout videos for the results shown in the paper can be found at the project site.

Measure conditioned rollouts: https://sites.google.com/view/policydiffusion/home#h.f1x0mlj6i9wz

Language conditioned rollouts: https://sites.google.com/view/policydiffusion/home#h.ddcrcw552la1

Behavior composition rollouts: https://sites.google.com/view/policydiffusion/home#h.ups33xysbz0u

Appendix C Heatmaps for all tasks

The following are the heatmaps for the original archive generated by PPGA and the reconstructed archive by the LDM model. Since the measure space for the Ant environment is 4 dimensional, it cannot be visualized with our current tools. Below are the plots for Halfcheetah, Walker2d and Humanoid. These heatmaps are obtained at the end of the 200th epoch of training.


Epoch 0	Epoch 1	Epoch 2	Epoch 3