Panoptic Diffusion Models: co-generation of images and segmentation maps

Yinghan Long, Kaushik Roy

Abstract

Recently, diffusion models have demonstrated impressive capabilities in text-guided and image-conditioned image generation. However, existing diffusion models cannot simultaneously generate a segmentation map of objects and a corresponding image from the prompt. Previous attempts either generate segmentation maps based on the images or provide maps as input conditions to control image generation, limiting their functionality to given inputs. Incorporating an inherent understanding of the scene layouts can improve the creativity and realism of diffusion models. To address this limitation, we present Panoptic Diffusion Model (PDM), the first model designed to generate both images and panoptic segmentation maps concurrently. PDM bridges the gap between image and text by constructing segmentation layouts that provide detailed, built-in guidance throughout the generation process. This ensures the inclusion of categories mentioned in text prompts and enriches the diversity of segments within the background. We demonstrate the effectiveness of PDM across two architectures: a unified diffusion transformer and a two-stream transformer with a pretrained backbone. To facilitate co-generation with fewer sampling steps, we incorporate a fast diffusion solver into PDM. Additionally, when ground-truth maps are available, PDM can function as a text-guided image-to-image generation model. Finally, we propose a novel metric for evaluating the quality of generated maps and show that PDM achieves state-of-the-art results in image generation with implicit scene control.

1 Introduction

Diffusion models have recently outperformed other generative models, demonstrating a strong ability to generate high-quality, photorealistic images and creative videos with high fidelity (Dhariwal and Nichol 2021; Saharia et al. 2022; Ramesh et al. 2022; Rombach et al. 2022; Nichol et al. 2021; Brooks et al. 2024; Ho et al. 2022a, b; Bar-Tal et al. 2024; Singer et al. 2022). Their success has drawn significant attention to generative AI, marking it as the next frontier following the achievements of AI in classification tasks. However, text-guided image generation often lacks control over the spatial positioning and structure of objects and backgrounds within the image (Zhang, Rao, and Agrawala 2023). Current diffusion models lack an understanding of objects and shapes because the diffusion process is uniformly applied to every pixel, without regard to the segment it belongs to. As a result, they may generate objects with unrealistic shapes and miss components mentioned in the text, leading to images that are perceived as artificial, as shown in the left column of Fig.1.

To address this issue, we propose teaching diffusion models to understand objects and scenes through segmentation maps, which provide detailed information about the image background that complements text prompts. Recent works, such as ControlNet, have demonstrated that using images with complex layouts as conditions, in addition to text prompts, can precisely control the generation process (Zhang, Rao, and Agrawala 2023). These studies show that image-guided generation can better align with users’ specific imaginings expressed through both text and image prompts. Inspired by this, we anticipate that if diffusion models generate segmentation maps alongside images to provide inherent guidance, they can utilize spatial composition information to create more realistic images.

Specifically, we train diffusion models using panoptic segmentation maps, which unify object classes and background categories, providing information about both countable objects in the foreground and background elements (Kirillov et al. 2018). With advanced segmentation models like Segment Anything (Kirillov et al. 2023) easily segmenting images, segmentation maps hold potential as alternative or complementary training data for image generation tasks.

The co-generation of images and masks is nontrivial and challenging because it represents a dual problem. Unlike previous approaches that rely on either a clean image or a segmentation map as a stable condition to generate the other, our model tackles the complex task of simultaneously denoising both an image and its corresponding map (Zhang, Rao, and Agrawala 2023; Chen et al. 2023). To address this, we designed a new paradigm to solve the dual diffusion problem. Compared to using predefined segmentation maps, co-generation preserves the diversity and flexibility of the images. By generating panoptic segmentation maps, Panoptic Diffusion Models provide intrinsic control over image generation, while the images in turn ensure that the map generation remains coherent. Since the generation of both segmentation maps and images is guided by text, the model learns the correlation between text, images, and maps. With its enhanced scene understanding capabilities, Panoptic Diffusion Models represent a significant step towards photorealistic image generation.

Refer to caption — (a) “An upside down stop sign by the road.”

We design both a one-stream Panoptic Diffusion Model and a two-stream model that incorporates a pretrained image generation stream. For training the two-stream model, we fix the image stream and efficiently fine-tune the segmentation stream. An alternative approach to map-guided image generation involves using two separate models: one to generate a segmentation map and another to generate an image based on that map. However, this method has several disadvantages compared to a unified model. First, few datasets provide paired images, descriptions, and masks, making a two-stream model with a pretrained backbone advantageous due to its ability to leverage more abundant image datasets. Second, it only allows for single-direction control from maps to images. Third, using two separate models is less efficient as they cannot run in parallel.

Another advantage of our model is that the readily available segmentation maps can benefit downstream computer vision tasks, such as autonomous driving. Additionally, the generated segmentation maps and image latents can be used as input conditions for larger diffusion models to produce higher-resolution images.

The major contributions are listed below:

1. We propose a unified diffusion model that generates both images and panoptic segmentation maps. This model inherently understands scene structures through collaborative training with multimodal data, requiring no priors and providing self-control.

2. We adapt the ODE solver for image denoising to facilitate simultaneous image and map generation. The iterative denoising of images and maps is interlinked, ensuring consistency between them.

3. We develop a two-stream diffusion model and apply efficient fine-tuning techniques. This approach leverages pretrained diffusion models and extends their capabilities by incorporating segmentation maps.

4. Our model directly provides segmentation maps for downstream tasks without the need for a separate segmentation model. These maps can scale up to four times the latent size without requiring a super-resolution model. We also introduce a new metric for evaluating the quality of the generated maps.

2 Related works

2.1 Diffusion Models for Image Generation

One of the initial works in this area, Denoising Diffusion Probabilistic Models (DDPM), use a Markov chain to gradually add scheduled noises to images in the forward process (Ho, Jain, and Abbeel 2020). The transition of the Markov chain is then parameterized by a neural network trained to predict the noise. During inference, a diffusion model starts from random noise and gradually reverses it to reconstruct the image.

A well-known drawback of diffusion models is that they require a large number of steps to generate samples iteratively. To address this issue and improve efficiency, researchers have proposed various modifications to diffusion models (Nichol and Dhariwal 2021). For instance, DDIM demonstrates that diffusion models can operate in a non-Markovian manner, resulting in shorter generative chains (Song, Meng, and Ermon 2021). Additionally, distillation algorithms have been introduced to further accelerate the multi-step inference process by progressively distilling a teacher model into a student model(Salimans and Ho 2022; Berthelot et al. 2023; Ren et al. 2024).

The backbone neural network for a diffusion model is typically a UNet, which is composed of convolutional layers and attention blocks, or a diffusion transformer that relies solely on attention mechanisms (Rombach et al. 2022; Peebles and Xie 2022). Another variant, UViT, is a type of diffusion transformer that retains skip connections, allowing later layers to access information from earlier layers, thereby enhancing alignment (Bao et al. 2023).

There are three main methods for applying conditions to a diffusion model. The first approach, used in stable diffusion, involves cross-attention between the image and the conditions (Rombach et al. 2022). The second method appends condition embeddings as tokens to the image patches (Bao et al. 2023). The third approach uses an adaptive norm layer to integrate conditions with the hidden states (Peebles and Xie 2022). In our panoptic diffusion models, we opt for the second method because the transformer can leverage self-attention to learn the relationships between images and maps, treating them as conditions for each other.

The solver for our panoptic diffusion model is a modified version of DPM Solver++ (Lu et al. 2023). Solving the reverse of the diffusion process is equivalent to solving an ordinary differential (ODE) equation, which can be decoupled as an exactly computed linear part and a non-linear part approximated by neural networks (Lu et al. 2022).

During inference, we apply classifier-free guidance similar to Nichol et al. (2021) and Ho and Salimans (2022). The diffusion model runs twice, once in an unconditioned setting and once in a conditioned setting, and the final output is obtained by taking a weighted sum of the unconditioned and conditioned outputs.

2.2 Panoptic Segmentation

Object detection requires generating bounding boxes and fine-grained masks, tasks traditionally accomplished by convolutional neural networks such as Fast R-CNN (Girshick 2015) and Mask R-CNN (He et al. 2017). In Carion et al. (2020), researchers introduced the use of transformers to generate binary masks by inputting object queries. Building on this, Cheng et al. (2022) proposed a collaboration between an image encoder backbone and a masked transformer to generate masks, where masked attention replaces cross attention.

Recently, there has been growing interest in applying diffusion models to panoptic segmentation masks. For example, in Chen et al. (2023), a diffusion model comprising an image encoder and a mask decoder is used to extract image features and apply cross attention between these features and the masks. To address the challenge of handling discrete data with diffusion models, Chen, Zhang, and Hinton (2022) proposed converting panoptic masks into analog bits during preprocessing. On the other hand, Baranchuk et al. (2021) suggest that the intermediate features of diffusion models can capture semantic information useful for label-efficient segmentation. Similarly, DiffuMask (Wu et al. 2024) generates a synthetic image and a corresponding segmentation mask of an object using attention maps. However, directly extracting masks from attention maps lacks the ability to control the generated image in return. In contrast, our approach aims to co-generate pixel-level panoptic segmentation maps and images, allowing them to influence and control each other.

While previous studies use diffusion models for panoptic segmentation based on given images, our work leverages an additional dataset of panoptic maps to train a model capable of generating both maps and images. The generated maps are then used to condition the image generation, producing a photorealistic result.

2.3 Image Guided Image Generation

Image guided image generation enables more precise control over the structure of the image and ensures faithfulness to users’ illustrative inputs. The input for guidance can have various forms, such as segmentation maps and layouts (Rombach et al. 2022; Zhang, Rao, and Agrawala 2023). Stochastic Differential Editing (SDEdit) perturbs user inputs with Gaussian noises and then synthesizes images by reversing SDE (Meng et al. 2022). They show that when the reverse SDE is not solved from the ending point but a particular timestep, the generated images can achieve a good balance between faithfulness and realism. Make-a-scene introduces scene-based conditioning for image generation by optionally providing tokens from segmentation maps (Gafni et al. 2022), but this method heavily relies on explicit strategies for tackling panoptic, human, and face semantics. SpaText (Avrahami et al. 2023) employs CLIP (Radford et al. 2021) to convert local text prompts that describe segments into image space and concatenate to the channel dimension of noises. ControlNet can accept user inputs such as canny edges and segmentation masks for conditional control of image generation (Zhang, Rao, and Agrawala 2023). Prompt-to-prompt image editing controls the generation by cross-attention to ensure similarity between images generated from similar prompts (Hertz et al. 2022). InstructPix2Pix combines Prompt-to-prompt method with stable diffusion to generate pairs of images from pairs of captions for training, then train the model to modify image pixels following the instructions (Brooks, Holynski, and Efros 2023).

These approaches demonstrate that providing various forms of guidance can more accurately control the structure of generated images. Building on this insight, our method assumes that such guidance is crucial for enhancing image quality. Additionally, panoptic diffusion models inherently generate segmentation maps alongside images, offering built-in guidance without the need for additional user input beyond the text prompt.

2.4 Efficient Finetuning

To reduce the number of trained parameters or adapt the model to a new domain, previous works have designed adaptive blocks to fine-tune convolutional neural networks or transformers (Houlsby et al. 2019; Long et al. 2021; Mou et al. 2023). In our two-stream panoptic diffusion model, the map stream functions similarly to an adapter. To prevent any negative impact on the pretrained weights, we employ zero-initialized convolutional blocks as proposed in Zhang, Rao, and Agrawala (2023).

3 Panoptic Diffusion

3.1 Preprocessing and Postprocessing of Segmentation Maps

As shown in Fig. 2,we process the panoptic segmentation maps through several steps before feeding them into the diffusion model. Instead of using a binary mask for each object, we load pixel-level panoptic annotations. In a segmentation map $M_{0}$ , each pixel’s value is set to the corresponding category ID if it belongs to a segment; otherwise, its value is zero. We then convert these pixel values into analog bits (Chen, Zhang, and Hinton 2022). Analog bits are necessary because a standard diffusion model can only generate continuous data, while segmentation classes are discrete or categorical. Since the range of category ID is from 1 to 200, each pixel is represented by 8 binary bits. Prior to noise scheduling, these bits are scaled to the range $[-1,1]$ , matching the range of the latent input to the diffusion model. To ensure that the noise can effectively flip the bits, its absolute value must exceed one. Therefore, we set the noise added to the maps as $\epsilon_{M}\sim\mathcal{N}(0,2*\mathbf{I})$ .

Latent diffusion models use latent representations of images encoded by an autoencoder as inputs. However, no autoencoder exists for encoding and decoding high-resolution segmentation maps into latents. We address this issue by pooling and using a larger patch size for the maps. To achieve high-resolution maps and enable more precise control, we first pool the maps to match one, two, or four times the height and width of the image latents. We use min pooling to prioritize smaller category numbers, as the COCO dataset annotations categorize 1-91 as thing categories and 92-200 as stuff categories. Next, we set the patch size of the maps to be one, two, or four times that of the images. This approach ensures that, after patchifying, the sizes of the image and map features align. Given that images have three RGB channels while maps have only one channel for the category ID before preprocessing, using a larger patch size is effective for extracting hidden features from segmentation maps. Consequently, this method allows us to generate higher-resolution maps without the need for an additional autoencoder or a larger latent size.

For postprocessing, the output values predicted by the diffusion model are thresholded at zero. Negative values are treated as zero bits, while positive values are considered one bits. Subsequently, these output bits are converted back into category numbers.

3.2 Forward Diffusion Process

In the forward pass of the diffusion process (Ho, Jain, and Abbeel 2020), random noise $\epsilon\sim\mathcal{N}(0,\mathbf{I})$ is added to the image latent $x_{0}$ according to the noise scheduler. With a total of $n$ steps, each step updates the noisy image $x_{t}$ from the previous step $x_{t-1}$ , using scaling factors $\alpha$ and $\beta$ provided by the noise scheduler. This process forms a Markov chain. Consequently, the noisy image $x_{t}$ can be simplified and calculated directly from $x_{0}$ .

	$\displaystyle x_{t}=\sqrt{\alpha_{t}}\cdot x_{t-1}+\beta_{t}\epsilon$		(1)
	$\displaystyle x_{t}=\sqrt{\bar{\alpha}}\cdot x_{0}+\sigma_{t}\epsilon$		(2)

where $\alpha_{t}$ are close to 1 and $\beta_{t}=1-\alpha_{t}$ . The cumulative factor $\bar{\alpha}=\prod_{i=1}^{t}\alpha_{i}$ , and the noise is scaled by $\sigma_{t}=\sqrt{1-\bar{\alpha}}$ .

To learn to denoise panoptic segmentation maps, we create another random Gaussian noise $\epsilon_{M}\sim\mathcal{N}(0,2*\mathbf{I})$ and add it to the ground-truth maps $M_{0}$ . The same noise scheduler is used to add noises to maps.

\displaystyle M_{t}=\sqrt{\bar{\alpha}}\cdot M_{0}+\sigma_{t}\epsilon_{M}

(3)

where $M_{t}$ is the noised map at timestep $t$ .

3.3 Reverse Diffusion Process

The panoptic diffusion model outputs $\epsilon_{\theta}$ , which estimates the noise $\epsilon$ . Using this estimated noise, we compute the predicted image $\tilde{x_{0}}$ . When incorporating the map as an additional input to the diffusion model, the equation for predicting the image is given by Eq. 4. To accelerate inference, we utilize a fast DPM solver to compute $x_{t_{i-1}}$ from $x_{t_{i}}$ (Lu et al. 2022, 2023). By using discontinuous time steps $t_{i}$ and $t_{i-1}$ , this method can skip intermediate steps, reducing the total number of sampling steps required. The first-order solver is described in Equation 5, where $h_{i}$ represents the difference in the log signal-to-noise ratio between different steps ( $h_{i}=\log(\alpha_{t_{i}}/\sigma_{t_{i}})-\log(\alpha_{t_{i-1}}/\sigma_{t_{i-1}})$ ). Details on a third-order solver can be found in Appendix A.

	$\displaystyle\tilde{x_{0}}(x_{t_{i}},M_{t_{i}},C,t_{i})=\dfrac{x_{t_{i}}-\sigma_{t}\epsilon_{\theta}(x_{t_{i}},M_{t_{i}},C,t_{i})}{\sqrt{\bar{\alpha}}}$		(4)
	$\displaystyle x_{t_{i-1}}=\dfrac{\sigma_{t_{i-1}}}{\sigma_{t_{i}}}x_{t_{i}}-\alpha_{t_{i}}(e^{-h_{i}}-1)\tilde{x_{0}}(x_{t_{i}},M_{t_{i}},C,t_{i})$		(5)

The other output of a panoptic diffusion model is $M_{\theta}$ , which is a prediction of $M_{0}$ . Drawing inspiration from DPM-solver++, we use the following equation to estimate $M_{t_{i-1}}$ from the previous step. It is important to note that the model directly estimates $M_{0}$ rather than the noise added to the segmentation map, as predicting $\epsilon_{M}$ does not provide effective guidance for the images. By training the diffusion model with panoptic segmentation maps, it incorporates intrinsic self-control into the image generation process.

\displaystyle M_{t_{i-1}}=\dfrac{\sigma_{t_{i-1}}}{\sigma_{t_{i}}}M_{t_{i}}-\alpha_{t_{i}}(e^{-h_{i}}-1)M_{\theta}(x_{t_{i}},M_{t_{i}},C,t_{i})

In a special case where ground truth maps are provided as conditions, the diffusion model will focus solely on predicting the images. This allows users to have customized control for generating desired images, similar to existing methods (Zhang, Rao, and Agrawala 2023). However, this approach limits the diversity of the generated images.

Since the generation of $x_{t-1}$ and $M_{t-1}$ relies on $x_{t}$ and $M_{t}$ , they form a dual problem. Improvements in the quality of the generated masks and images influence each other. Consequently, according to the scaling law, a larger diffusion model can produce more accurate masks, which in turn provides better control and further enhances image quality.

3.4 Dual training and generation

Let the inputs to a panoptic diffusion model at each timestep be image latent $x_{t}$ , mask $M_{t}$ , text condition encoded by a text encoder $C$ , and timestep $t$ . The conditional probability of $x_{t-1}$ and $M_{0}$ is given by

	$\displaystyle P(x_{t-1},M_{0}\|x_{t},M_{t},c)$
	$\displaystyle=P(x_{t-1}\|x_{t},M_{t},M_{0},c)\cdot P(M_{0}\|x_{t},M_{t},c)$		(6)

Equation 6 show that it is feasible to predict the segmantation map $M_{0}$ first, then use it as a condition to predict $x_{t-1}$ . However, when using a unified model to predict both $x_{t-1}$ and $M_{0}$ , the intermediate features already contain the segmentation information used to predict $M_{0}$ . Through self-attention, the map features can inherently condition $x_{t-1}$ . Therefore, it is reasonable to predict $x_{t-1}$ and $M_{0}$ simultaneously. By taking the logarithm of the probability, we can optimize the model by combining the losses associated with image denoising and segmentation map generation.

	$\displaystyle\log P(x_{t-1},M_{0}\|x_{t},M_{t},c)$
	$\displaystyle=\log P(x_{t-1}\|x_{t},M_{t},M_{0},c)+\log P(M_{0}\|x_{t},M_{t},c)$		(7)

The training algorithm is outlined in Algorithm 1. We use Mean Squared Error (MSE) loss to optimize the predicted noises for both image and segmentation map denoising. Specifically, the target for image denoising is the noise $\epsilon$ , while the target for mask generation is the ground-truth $M_{0}$ . The losses for images and maps are summed to perform gradient backpropagation. During inference, the diffusion model iteratively denoises both images and maps, as detailed in Algorithm 2.

Classifier-free Map Guidance

Classifier-free diffusion guidance was introduced to balance sample quality and diversity without relying on a classifier (Ho and Salimans 2022). This approach involves alternating between an unconditional and a conditional diffusion model during training, and using a weighted sum of the results from both models during inference. For panoptic diffusion models, we only remove the text conditions while keeping the map conditions active. Specifically, we set the context condition to empty text with a probability of 0.1 during training ( $C=\varnothing$ ). When the context is empty, the diffusion model is guided solely by the bidirectional control between images and segmentation maps. This setup allows the map generator to provide classifier-free guidance and enhance diversity. Let $\theta_{1}$ represent the output with regular conditioning and $\theta_{2}$ represent the output with empty text. During inference, these outputs are weighted by $\gamma$ , which is set to 1.0 by default.

	$\displaystyle\epsilon_{\theta}=\epsilon_{\theta 1}+\gamma(\epsilon_{\theta 1}-\epsilon_{\theta 2})$		(8)
	$\displaystyle M_{\theta}=M_{\theta 1}+\gamma(M_{\theta 1}-M_{\theta 2})$		(9)

Input: Ground truth Masks

M_{0}

; Images

x_{0}

; Text condition

C

; Total number of steps

T

Output: Predicted noise

\epsilon_{\theta}

, Predicted mask

M_{\theta}

\epsilon

= normal(mean=0, std=1)

\epsilon_{m}

= normal(mean=0, std=2)

M_{0}

= int2bits(

M_{0}

)

t = randn(1,T)

x_{t}

= scheduler(

x_{0}

\epsilon

, t)

M_{t}

= scheduler(

M_{0}

\epsilon_{m}

, t)

\epsilon_{\theta}

M_{\theta}

= DiffusionModel(

x_{t}

M_{t}

C

t

)

loss_{x}

= MSE(

\epsilon

\epsilon_{\theta}

)

loss_{m}

= MSE(

M_{0}

M_{\theta}

)

loss=loss_{x}+loss_{m}

Algorithm 1 Training of Panoptic Diffusion model

Input: Text

C

; Total number of steps

T

Output: Generated image

x_{0}

, Generated mask

M_{0}

x_{t}

= normal(mean=0, std=1)

M_{t}

= normal(mean=0, std=1)

Sample a set of steps

T

from n to 0

for $t$ in $T$ do

# Run the diffusion model

\epsilon_{\theta}

M_{\theta}

= DiffusionModel(

x_{t}

M_{t}

C

t

)

# Update predicted images and masks

x_{0}=\dfrac{x_{t}-\sigma_{t}\epsilon_{\theta}}{\sqrt{\bar{\alpha}}}

x_{t}

M_{t}

= dpmSolver(

x_{0}

M_{\theta}

X_{t}

M_{t}

t

)

end for

Algorithm 2 Inference of Panoptic Diffusion model using DPM solver

3.5 Architecture of Panoptic Diffusion Models

One-stream Panoptic Diffusion Models

We first modify a U-ViT to a panoptic diffusion model (Bao et al. 2023). We start by patchifying the map input $M_{t}$ using a convolutional layer and adding positional embeddings. These map embeddings are then concatenated with the image, text, and time embeddings and processed through attention blocks. Since U-ViT treats all inputs as tokens and applies self-attention among them, the segmentation maps can be treated as tokens in the same manner. At the end of the transformer, we separate the features related to images and segmentation maps, using distinct convolutional layers to unpatchify and predict the outputs.

In the special case that the ground truth maps are provided, only the loss of images will be used for optimization. To ensure that map features are included in the gradient backpropagation, they are added to the image features before the final output convolutional layer.

Two-stream Panoptic Diffusion Models

To leverage a pretrained model as the backbone, we design a two-stream diffusion model consisting of a pretrained image stream and a segmentation map stream, as illustrated in Fig. 3. During fine-tuning, the transformer layers of the image stream are kept frozen while the map stream is adjusted. The map stream processes image features and conditions from the previous block, then concatenates them with map features. Through self-attention, the map features and image features become interrelated within the map stream. The auxiliary image feature output from the map stream is added back to the image stream via a zero-convolution layer. This setup ensures specific control over the image stream and allows gradients to be backpropagated from the loss of image generation. The zero-convolution layer has zero initial weights and no bias (Zhang, Rao, and Agrawala 2023). Unlike ControlNet, which uses only the encoder part of the map stream to generate control signals, our model employs encoder-decoder U-shaped transformers in both streams to co-generate images and segmentation maps.

3.6 Evaluation metric for generated maps

We propose a new metric to evaluate the quality of generated segmentation maps by measuring the difference in the number of pixels labeled as each category. While Panoptic Quality (Kirillov et al. 2018) uses Intersection over Union (IoU) to assess segmentation maps based on the weighted sum of true positives, false positives, and false negatives, this approach is not suitable for maps generated by diffusion models. These models produce maps probabilistically based on text prompts and co-generated images, making it impractical to compute IoU with ground-truth maps due to inherent differences in the generated images. Instead, we introduce the Mean Count Difference (MCD) metric. MCD evaluates the quality of generated maps by counting the frequency $f$ of each category in both the ground-truth and generated maps, then summing their absolute differences. This sum is divided by the total number of pixels, calculated as the product of the height and width. Given that object locations on the generated map are not fixed, comparing category frequencies rather than direct pixel values provides a more meaningful assessment. The metric ranges from $[0,2]$ , where zero indicates identical segmentation maps and larger values indicate greater differences.

	$\displaystyle f=bincount(M_{0});\quad f^{\prime}=bincount(M_{\theta})$
	$\displaystyle MCD=\dfrac{\sum(\|f-f^{\prime}\|)}{H*W}$

4 Experiments

We train our model using the COCO2017 dataset (Lin et al. 2015), which includes both panoptic segmentation maps and image captions. The COCO2017 dataset comprises 118k training samples and 5k validation samples. Images are projected into latent space using a VAE model provided by Stable Diffusion (Rombach et al. 2022; Gu et al. 2021), while text conditions are encoded using the CLIP encoder from OpenAI (clip-vit-large-patch14) (Radford et al. 2021). We implement both one-stream and two-stream panoptic diffusion models (PDM) based on U-ViT (Bao et al. 2023). In contrast to commercial models with billions of parameters, our models are significantly smaller. The one-stream PDM has 45 million parameters, while the two-stream PDM has 95 million parameters. The image latent size is $32\times 32\times 4$ , with a height and width of 32 and a latent channel count of 4. The segmentation map’s height and width can be 32, 64, or 128, depending on the patch factor, and it has 8 channels, representing 8 analog bits after conversion. The diffusion model’s output image latents are decoded by a VAE decoder to produce $256\times 256$ images.

	FID( $\downarrow$ )	CLIP( $\uparrow$ )	Patch	MCD
GLIDE (2021)	12.24	$\sim$ 28	-	-
Imagen (2022)	7.27	$\sim$ 27	-	-
UViT (2023)	8.29	27.37	-	-
One-stream PDM	18.52	26.32	2	1.638
Two-stream PDM	11.29	27.08	1	1.522
Two-stream PDM	10.99	27.53	2	1.592
	30.91	25.87	4	1.638
One-stream PDM given maps	8.21	28.40	1	-
Two-stream PDM given maps	11.61	28.19	2	-

Table 1: Quantitative Evaluation Results of COCO dataset.

4.1 Quantitative Evaluation

We evaluate the quality of generated images using FID (Heusel et al. 2017) and CLIP scores (Hessel et al. 2022). FID assesses the quality and fidelity of the generated images by employing an Inception model, while CLIP scores gauge how well the generated images correspond to the text prompts. For CLIP scores, we use the ViT-B/32 model (Radford et al. 2021). We generate 30,000 images and segmentation maps from 5,000 text files in the COCO dataset’s validation set, with each text file containing five captions describing the same scene. We compute the average CLIP scores by comparing these five captions with the generated images.

In Table.1, we compare the FID and CLIP scores of our models with those of state-of-the-art methods. The results indicate that while our panoptic diffusion models (PDMs) are trained with a combined loss of images and segmentation maps, they achieve comparable fidelity (FID scores) and improved relevance between image and text (higher CLIP scores). This improvement is due to the enhanced connectivity between the image, text, and segmentation map. The two-stream PDM performs better due to its pretrained stream and larger number of parameters. When ground-truth maps are provided, the model performs optimally because it focuses solely on optimizing image generation.

Increasing the patch factor results in a higher MCD because generating higher-resolution maps with a fixed number of latents becomes more challenging. This creates a trade-off between map resolution and quality. We find that a patch factor of 2 offers the best balance, yielding the highest FID and CLIP scores. However, increasing the patch factor to 4 results in worse performance, suggesting that unbalanced patch sizes for maps and images are detrimental.

4.2 Qualitative Evaluaiton

In Fig. 1, we compare the images and masks generated by PDM with images generated by U-ViT. By training with segmentation masks, PDM learns that the shape of a stop sign should be octagon, while U-ViT cannot guarantee to generate an octagon stop sign. Similarly, PDM ensures to generate correct shapes for a fire hydrant and a human. In the last row of Fig. 1, PDM generates masks for not only elephants but also for the river, while a regular diffusion model misses the required component of the text prompt.
Figure 4 displays images generated with either ground-truth segmentation maps or co-generated maps. The generated maps in the bottom left show objects of the same categories and similar shapes as the ground-truth maps. The images on the right are conditioned on these segmentation maps, demonstrating the PDM’s ability to simultaneously generate correlated images and maps. While images generated with ground-truth maps exhibit slightly better quality, co-generation removes the need for a segmentation input and produces diverse maps and images. Note that the pixel values in the generated segmentation maps correspond to category IDs (1-200), which are mapped to random RGB colors for visualization. The color map used is detailed in Appendix C.

Additional examples generated by PDMs are provided in Appendix B. Zero-shot results on the CIFAR-10 dataset demonstrate that our model can generate segmentation maps for various categories across different image datasets.

4.3 Ablation study

Effect of the patch factor

We evaluate the impact of different patch sizes on map resolution, as illustrated in Figure 5. When the patch size for segmentation maps is set to four times that of the images, the resulting maps have a resolution of 128x128. However, these larger maps may include hallucinated details that could misguide image generation. This issue arises due to the disparity in patch sizes and the model’s limited hidden dimension of 768, which complicates accurate prediction for a 128x128 map.

Replacing noisy map inputs with zero

To assess whether PDMs learn to denoise the segmentation map or extract it from the image latent, we replace noisy map inputs $M_{t}$ with zero inputs during training. The results reveals that while a two-stream model can still generate images (FID=18.94), it cannot generate readable maps. This indicates that a panoptic diffusion model does not solely depend on image features for map generation, unlike the approach in DiffuMask (Wu et al. 2024). Hence, noisy map inputs $M_{t}$ are crucial for predicting $M_{0}$ .

Noise scale for segmentation maps

As previously mentioned, the noise added to segmentation maps must be greater than one to effectively flip the analog bits. If the noise variance is smaller than one, it fails to convert the training signal to noise at any timestep, resulting in the model’s inability to denoise maps adequately. Figure 5(b) demonstrates that maps are not properly denoised when $\epsilon_{M}\sim\mathcal{N}(0,\mathbf{I})$ .

5 Conclusion

In conclusion, we introduce the Panoptic Diffusion Model (PDM), a pioneering approach that simultaneously generates images and panoptic segmentation maps from a given prompt. Unlike previous diffusion models that either depend on pre-existing segmentation maps or generate them based on images, PDM inherently understands and constructs scene layouts during the generation process. This innovation enables PDM to produce more creative and realistic images by leveraging segmentation layouts as intrinsic guidance. This research lays the groundwork for future advancements in diffusion models, offering a robust framework for co-generation of images and segmentation maps.

References

Avrahami et al. (2023) Avrahami, O.; Hayes, T.; Gafni, O.; Gupta, S.; Taigman, Y.; Parikh, D.; Lischinski, D.; Fried, O.; and Yin, X. 2023. SpaText: Spatio-Textual Representation for Controllable Image Generation. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.
Bao et al. (2023) Bao, F.; Nie, S.; Xue, K.; Cao, Y.; Li, C.; Su, H.; and Zhu, J. 2023. All are Worth Words: A ViT Backbone for Diffusion Models. In CVPR.
Bar-Tal et al. (2024) Bar-Tal, O.; Chefer, H.; Tov, O.; Herrmann, C.; Paiss, R.; Zada, S.; Ephrat, A.; Hur, J.; Liu, G.; Raj, A.; Li, Y.; Rubinstein, M.; Michaeli, T.; Wang, O.; Sun, D.; Dekel, T.; and Mosseri, I. 2024. Lumiere: A Space-Time Diffusion Model for Video Generation. arXiv:2401.12945.
Baranchuk et al. (2021) Baranchuk, D.; Rubachev, I.; Voynov, A.; Khrulkov, V.; and Babenko, A. 2021. Label-Efficient Semantic Segmentation with Diffusion Models. arXiv:2112.03126.
Berthelot et al. (2023) Berthelot, D.; Autef, A.; Lin, J.; Yap, D. A.; Zhai, S.; Hu, S.; Zheng, D.; Talbott, W.; and Gu, E. 2023. TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation. arXiv:2303.04248.
Brooks, Holynski, and Efros (2023) Brooks, T.; Holynski, A.; and Efros, A. A. 2023. InstructPix2Pix: Learning to Follow Image Editing Instructions. arXiv:2211.09800.
Brooks et al. (2024) Brooks, T.; Peebles, B.; Holmes, C.; DePue, W.; Guo, Y.; Jing, L.; Schnurr, D.; Taylor, J.; Luhman, T.; Luhman, E.; Ng, C.; Wang, R.; and Ramesh, A. 2024. Video generation models as world simulators.
Carion et al. (2020) Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; and Zagoruyko, S. 2020. End-to-End Object Detection with Transformers. CoRR, abs/2005.12872.
Chen et al. (2023) Chen, T.; Li, L.; Saxena, S.; Hinton, G.; and Fleed, D. 2023. A Generalist Framework for Panoptic Segmentation of Images and Videos. 909–919.
Chen, Zhang, and Hinton (2022) Chen, T.; Zhang, R.; and Hinton, G. 2022. Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202.
Cheng et al. (2022) Cheng, B.; Misra, I.; Schwing, A. G.; Kirillov, A.; and Girdhar, R. 2022. Masked-attention Mask Transformer for Universal Image Segmentation.
Dhariwal and Nichol (2021) Dhariwal, P.; and Nichol, A. 2021. Diffusion Models Beat GANs on Image Synthesis. CoRR, abs/2105.05233.
Gafni et al. (2022) Gafni, O.; Polyak, A.; Ashual, O.; Sheynin, S.; Parikh, D.; and Taigman, Y. 2022. Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors. arXiv:2203.13131.
Girshick (2015) Girshick, R. B. 2015. Fast R-CNN. CoRR, abs/1504.08083.
Gu et al. (2021) Gu, S.; Chen, D.; Bao, J.; Wen, F.; Zhang, B.; Chen, D.; Yuan, L.; and Guo, B. 2021. Vector Quantized Diffusion Model for Text-to-Image Synthesis. CoRR, abs/2111.14822.
He et al. (2017) He, K.; Gkioxari, G.; Dollár, P.; and Girshick, R. B. 2017. Mask R-CNN. CoRR, abs/1703.06870.
Hertz et al. (2022) Hertz, A.; Mokady, R.; Tenenbaum, J.; Aberman, K.; Pritch, Y.; and Cohen-Or, D. 2022. Prompt-to-Prompt Image Editing with Cross Attention Control. arXiv:2208.01626.
Hessel et al. (2022) Hessel, J.; Holtzman, A.; Forbes, M.; Bras, R. L.; and Choi, Y. 2022. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. arXiv:2104.08718.
Heusel et al. (2017) Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; and Hochreiter, S. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
Ho et al. (2022a) Ho, J.; Chan, W.; Saharia, C.; Whang, J.; Gao, R.; Gritsenko, A.; Kingma, D. P.; Poole, B.; Norouzi, M.; Fleet, D. J.; et al. 2022a. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303.
Ho, Jain, and Abbeel (2020) Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising Diffusion Probabilistic Models. NIPS.
Ho and Salimans (2022) Ho, J.; and Salimans, T. 2022. Classifier-Free Diffusion Guidance. arXiv:2207.12598.
Ho et al. (2022b) Ho, J.; Salimans, T.; Gritsenko, A.; Chan, W.; Norouzi, M.; and Fleet, D. J. 2022b. Video Diffusion Models. arXiv:2204.03458.
Houlsby et al. (2019) Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; de Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; and Gelly, S. 2019. Parameter-Efficient Transfer Learning for NLP. CoRR, abs/1902.00751.
Kirillov et al. (2018) Kirillov, A.; He, K.; Girshick, R. B.; Rother, C.; and Dollár, P. 2018. Panoptic Segmentation. CoRR, abs/1801.00868.
Kirillov et al. (2023) Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A. C.; Lo, W.-Y.; Dollár, P.; and Girshick, R. 2023. Segment Anything. arXiv:2304.02643.
Lin et al. (2015) Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C. L.; and Dollár, P. 2015. Microsoft COCO: Common Objects in Context. arXiv:1405.0312.
Long et al. (2021) Long, Y.; Chakraborty, I.; Srinivasan, G.; and Roy, K. 2021. Complexity-aware Adaptive Training and Inference for Edge-Cloud Distributed AI Systems. In 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS), 573–583.
Lu et al. (2022) Lu, C.; Zhou, Y.; Bao, F.; Chen, J.; Li, C.; and Zhu, J. 2022. DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps. NeurIPS.
Lu et al. (2023) Lu, C.; Zhou, Y.; Bao, F.; Chen, J.; Li, C.; and Zhu, J. 2023. DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models. arXiv:2211.01095.
Meng et al. (2022) Meng, C.; He, Y.; Song, Y.; Song, J.; Wu, J.; Zhu, J.-Y.; and Ermon, S. 2022. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. arXiv:2108.01073.
Mou et al. (2023) Mou, C.; Wang, X.; Xie, L.; Wu, Y.; Zhang, J.; Qi, Z.; Shan, Y.; and Qie, X. 2023. T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models. arXiv:2302.08453.
Nichol and Dhariwal (2021) Nichol, A.; and Dhariwal, P. 2021. Improved Denoising Diffusion Probabilistic Models. CoRR, abs/2102.09672.
Nichol et al. (2021) Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; and Chen, M. 2021. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. CoRR, abs/2112.10741.
Peebles and Xie (2022) Peebles, W.; and Xie, S. 2022. Scalable Diffusion Models with Transformers. arXiv preprint arXiv:2212.09748.
Radford et al. (2021) Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; and Sutskever, I. 2021. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020.
Ramesh et al. (2022) Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; and Chen, M. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125.
Ren et al. (2024) Ren, Y.; Xia, X.; Lu, Y.; Zhang, J.; Wu, J.; Xie, P.; Wang, X.; and Xiao, X. 2024. Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis. arXiv:2404.13686.
Rombach et al. (2022) Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Ommer, B. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. CVPR, abs/2112.10752.
Saharia et al. (2022) Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.; Ghasemipour, S. K. S.; Ayan, B. K.; Mahdavi, S. S.; Lopes, R. G.; Salimans, T.; Ho, J.; Fleet, D. J.; and Norouzi, M. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv:2205.11487.
Salimans and Ho (2022) Salimans, T.; and Ho, J. 2022. Progressive Distillation for Fast Sampling of Diffusion Models. ICLR, abs/2202.00512.
Singer et al. (2022) Singer, U.; Polyak, A.; Hayes, T.; Yin, X.; An, J.; Zhang, S.; Hu, Q.; Yang, H.; Ashual, O.; Gafni, O.; Parikh, D.; Gupta, S.; and Taigman, Y. 2022. Make-A-Video: Text-to-Video Generation without Text-Video Data. arXiv:2209.14792.
Song, Meng, and Ermon (2021) Song, J.; Meng, C.; and Ermon, S. 2021. Denoising Diffusion Implicit Models. ICLR, abs/2010.02502.
Wu et al. (2024) Wu, W.; Zhao, Y.; Shou, M. Z.; Zhou, H.; and Shen, C. 2024. DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models. arXiv:2303.11681.
Zhang, Rao, and Agrawala (2023) Zhang, L.; Rao, A.; and Agrawala, M. 2023. Adding Conditional Control to Text-to-Image Diffusion Models.

Appendix A Fast DPM solver for segmentation maps

We modify the first order and third order DPM-solver++ to solve the image and map of the previous step given $x_{t}$ , $M_{t}$ and predicted $x_{0}$ , $M_{0}$ (Lu et al. 2023). The pseudo code for the solvers are listed below. For the details of the algorithm and definition of the parameters $\sigma,\alpha,\phi,s$ , please check DPM-solver++.

⬇

1def dpmFirstSolver(self,x_0, m_0, x_t,m_t):

2 x_t=(sigma_t/sigma_s)*x+(alpha_t*phi_1)*x_0

3 #update M[t-1] based on M[t]

4 m_t= (sigma_t/sigma_s)*m_t +

5 (alpha_t*phi_1)*m_0

6 return x_t, m_t

8def dpmThirdSolver(self, x_t,m_t,C,t):

9 #First step

10 x_0, m_0= diffusionModel(x_t,m_t,C,s)

11 x_s1=(sigma_s1/sigma_s)*x+(alpha_s1*phi_11)*x_0

12 m_s1= (sigma_s1/sigma_s)*m_t +

13 (alpha_s1*phi_11)*m_0

14 #Second step

15 x_02, m_02= diffusionModel(x_s1,m_s1,C,s1)

16 x_s2=(sigma_s2/sigma_s)*x+(alpha_s1*phi_12)*x_0 +

17 r2 / r1 * (alpha_s2 * phi_22)* (x_02 - x_0)

18 m_s2= (sigma_s2/sigma_s)*m_t +

19 (alpha_s2*phi_12)*m_0 +

20 r2 / r1 * (alpha_s2 * phi_22)* (m_02 - m_0)

21 #Third step

22 x_03, m_03= diffusionModel(x_s2,m_s2,C,s2)

23 x_t=(sigma_t/sigma_s)*x+(alpha_t*phi_1)*x_0 +

24 (1. / r2) * (alpha_t * phi_2)* (x_03 - x_0)

25 m_t= (sigma_t/sigma_s)*m_t +

26 (alpha_t*phi_1)*m_0 +

27 (1. / r2) * (alpha_t * phi_2)* (m_03 - m_0)

28 return x_t, m_t

Appendix B More examples of generated images and maps

B.1 Comparison between using ground-truth segmentation map and using co-generated maps

Fig. 6 shows more examples of generated images and segmentation maps. The prompts are randomly chosen from COCO2017 validation dataset, as listed below.
0 A woman stands in the dining area at the table.
1 A big burly grizzly bear is show with grass in the background.
2 Bedroom scene with a bookcase, blue comforter and window.
3 A stop sign is mounted upside-down on it’s post.
4 Three teddy bears, each a different color, snuggling together.
5 A woman posing for the camera standing on skis.
6 A kitchen with a refrigerator, stove and oven with cabinets.
7 A couple of baseball player standing on a field.
8 a male tennis player in white shorts is playing tennis
9 The people are posing for a group photo.
10 A beautiful woman taking a picture with her smart phone.
11A woman holding a Hello Kitty phone on her hands.
12some children are riding on a mini orange train
13A meal is lying on a plate on a table.
14A man in a wet suit stands on a surfboard and rows with a paddle.
15A computer on a desk next to a laptop.
16A street scene with focus on the street signs on an overpass.
17The red, double decker bus is driving past other buses.
18A cat resting on an open laptop computer.
19Two planes flying in the sky over a bridge.
20A zebra in the grass who is cleaning himself.
21A bedroom with a bed and small table near by.
22a big purple bus parked in a parking spot
23A large white bowl of many green apples.
24Batter preparing to swing at pitch during major game.
25A plate of finger foods next to a blue and raspberry topped cake.
26A man on a blue raft attempting to catch a ride on a large wave.
27Many small children are posing together in the black and white photo.
28A plate on a wooden table full of bread.
29A man flying through the air while riding skis.
30A person standing on top of a ski covered slope.
31a close up of a banana and a doughnut in a plastic bag

B.2 Zero-shot results on CIFAR10

We apply the model trained on COCO dataset to generate images with segmentation maps for CIFAR10. The class labels are encoded by the text encoder as image captions. The zero-shot results show that our model is capable of generating segmentation maps for things and stuffs for other image datasets.

Appendix C Color map of panoptic categories of COCO dataset

Please see Fig. 8. This is a random color map only for reference. Although COCO dataset uses 1-200 as class labels, there are only 133 classes.