Detector Guidance for Multi-Object
Text-to-Image Generation

Luping Liu¹, Zijian Zhang¹, Yi Ren², Rongjie Huang¹, Xiang Yin², Zhou Zhao¹
¹Zhejiang University ²ByteDance AI Lab
{luping.liu, ckczzj, rongjiehuang, zhaozhou}@zju.edu.cn,
{ren.yi, yinxiang.stephen}@bytedance.com
Corresponding author

Abstract

Diffusion models have demonstrated impressive performance in text-to-image generation. They utilize a text encoder and cross-attention blocks to infuse textual information into images at a pixel level. However, their capability to generate images with text containing multiple objects is still restricted. Previous works identify the problem of information mixing in the CLIP text encoder and introduce the T5 text encoder or incorporate strong prior knowledge to assist with the alignment. We find that mixing problems also occur on the image side and in the cross-attention blocks. The noisy images can cause different objects to appear similar, and the cross-attention blocks inject information at a pixel level, leading to leakage of global object understanding and resulting in object mixing. In this paper, we introduce Detector Guidance (DG), which integrates a latent object detection model to separate different objects during the generation process. DG first performs latent object detection on cross-attention maps (CAMs) to obtain object information. Based on this information, DG then masks conflicting prompts and enhances related prompts by manipulating the following CAMs. We evaluate the effectiveness of DG using Stable Diffusion on COCO, CC, and a novel multi-related object benchmark, MRO. Human evaluations demonstrate that DG provides an 8-22% advantage in preventing the amalgamation of conflicting concepts and ensuring that each object possesses its unique region without any human involvement and additional iterations. Our implementation is available at https://github.com/luping-liu/Detector-Guidance.

1 Introduction

Diffusion models [1, 2, 3, 4] have exhibited impressive performance in conditional generations, which requires that the generation results be not only realistic but also strongly correlated with the given conditions. Among various conditions, text condition has attracted significant attention due to its user-friendly nature and has resulted in a plethora of heavyweight works, such as DALL $\cdot$ E 2 [5], Imagen [6] and Stable Diffusion [7]. By utilizing billions of text-image pairs from the internet and employing well-designed model structures, these models ultimately achieve state-of-the-art text-to-image performance.

However, these models still exhibit relatively poor performance in generating multiple objects within a single image. Issues such as attribute mixing, object mixing, and object disappearance persist. Attribute mixing refers to the phenomenon where objects may be influenced by attributes that belong to other objects. Object mixing and disappearance refer to the fusion that occurs at the object level, leading to the generation of strange multi-object hybrids and incorrect object count.

Refer to caption — Figure 1: Examples of the object mixing problem in text-to-image diffusion models can be observed in DALL $\cdot$ E, Midjourney v5, and Stable Diffusion 2.1. This issue can be resolved in Stable Diffusion 2.1 through the implementation of our Detector Guidance.

Prior research [8] has revealed that due to the causal attention masks in the text encoder, the semantics of tokens in the later part of a sequence get mixed with the token semantics before them. We further discover that a similar information mixing issue arises on the image side. The intermediate results of the diffusion model contain noise, which can cause different objects to appear similar. These two problems increase the difficulty of aligning different objects in prompts to the correct regions. Furthermore, diffusion models employ cross-attention blocks between text and images to incorporate text conditions into each pixel. In situations where the text conditions contain multiple objects, this creates a pixel-by-pixel competition for information from different objects. This can result in the fusion of conflicting information, such as 40% leopard and 60% tiger (e.g., the 1st row in Figure 1), or the division of a complete region by texts from different objects (e.g., the 3rd row in Figure 1). This underscores the weak global comprehension abilities of cross-attention blocks.

Previous works solve this problem by incorporating strong prior knowledge, improving the correspondence between attributes and objects, or introducing better text encoders. The prior knowledge may include bounding boxes [9], masks [10], or small patches [11] of target objects. These data can aid cross-attention in achieving better alignment between the target prompt and image patches, thereby reducing undesired mixing. However, such solutions necessitate extensive human intervention and restrict the diversity of generation results. Feng et al. [8] utilize language parsers to associate attributes solely with the corresponding objects. However, this method is effective only when there is no issue of object mixing. Saharia et al. [6] utilize the T5 text encoder instead of CLIP, but it cannot address the problems on the image side and in the cross-attention blocks.

In this paper, our solution is to enable diffusion models to grasp the concept of objects, allowing them to assign regions globally and generate different objects simultaneously. To achieve this objective, we integrate a latent object detection model into pre-trained diffusion models. During the generation process, the latent object detection model generates bounding boxes based on the cross-attention maps (CAMs). By selecting CAMs as input, we can fully utilize the alignment results of the diffusion model, which increases the robustness and generalization of the detection. Once we obtain object information, we combine the bounding boxes and CAMs to further refine the boundary and build masks. Then we mask conflicting text prompts and enhance the target text prompts by manipulating CAMs. Additionally, we use a smooth strategy to ensure continuity and support high-order numerical methods. We refer to our approach as detector guidance (DG).

We evaluate the effectiveness of DG using Stable Diffusion on COCO [12], CC [8], and a new multi-related object benchmark (MRO). Based on our experiments, DG outperforms the original Stable Diffusion by 8-22% in human evaluation. DG accurately assigns attributes to the corresponding objects, prevents the combination of conflicting concepts, and ensures that each object has its unique region due to its clear understanding of objects. Our paper has the following contributions:

•

We conduct a systematic analysis of the alignment issues in text-to-image diffusion models, which occur not only on the text encoding side but also on the image side and the cross-attention blocks.
•

We propose a latent object detection method that fully utilizes diffusion model alignment information. Our detection model, trained on COCO, exhibits good generalization to unseen categories.
•

We introduce Detector Guidance to address the weak global comprehension of diffusion models, which provides a significant advantage without any human involvement or additional iterations.

2 Related Work

In this section, we introduce diffusion models and focus on Stable Diffusion, the basis of our method.

2.1 Diffusion Model

The diffusion denoising models [1, 2, 4] are a type of deep generative models that employ an iterative denoising process to generate samples. These models utilize noise-conditioned score networks, as described in [13, 14], and denoising score matching objectives, as described in [15, 16], at varying noise levels. They have demonstrated successful applications in various domains, such as text-to-image generation [6, 5, 7], natural language generation [17], time series prediction [18], audio synthesis [19, 20], 3D point cloud generation [21], and molecular conformation generation [22].

Text-to-Image Generation Diffusion models play an important role in text-to-image generation. To improve computational efficiency, diffusion models are typically trained on low-resolution images [6] or latent variables [7, 23], which are then transformed into high-resolution images through super-resolution diffusion models [24] or latent-to-image decoders [25]. The sampling process of diffusion models utilizes classifier-free guidance [26] as well as various sampling algorithms that use deterministic [3, 27, 28, 29, 30] or stochastic [31, 32, 33] iterations. In addition, several works [34, 35] retrieve additional images related to the text prompt from an external database and use them to condition generation, thereby enhancing performance.

Multi-Object Generation Multi-object generation needs text-to-image models to comprehend different objects in the generation process. As the difficulty in the alignment between text and image, prior studies have utilized bounding boxes [9], masks [10], or small patches [11] of target objects to enhance alignment. Some studies aim to improve generation results without the need for human involvement in each generation, Liu et al. [36] proposed an approach where concept conjunctions are achieved by adding multiple estimated scores for different objects. And Feng et al. [8] utilize language parsers to associate attributes solely with the corresponding objects. Balaji et al. [37] combine T5 [38] and CLIP [39, 40] text encoders to improve the alignment in the text side.

Applications The diffusion models for text-to-image have a significant impact on downstream applications. These models can be directly applied to various inverse problems, such as super-resolution [41, 42], inpainting [43, 44], and JPEG restoration [45, 46]. Text-to-image diffusion models can also be used for other semantic image editing tasks. For instance, SDEdit [47] enables editing of an existing image via colored strokes or image patches. DreamBooth [48] and Textual Inversion [49] allow for personalized model implementation by learning a subject-specific token from a few images. Similar image-editing capabilities can also be achieved by fine-tuning the model parameters [50, 51] or automatically searching for editing masks using denoisers [52]. Several text-to-video diffusion models are developed on text-to-image ones and achieve high-quality video generation results [53, 54, 55, 56, 57]. Furthermore, the fitness capabilities of diffusion models have proven beneficial for the task of out-of-distribution detection [58].

2.2 Stable Diffusion

Our method is constructed on a SOTA open-source text-to-image model: Stable diffusion, which comprises an autoencoder and a diffusion model. During the training process, the pre-trained autoencoder first compresses the image distribution into a latent space, and the diffusion model attempts to fit this new distribution in the latent space. In the sampling process, the diffusion model first generates latent results based on the text prompts and then uses the autoencoder to decode the latent results and obtain the final images.

Stable Diffusion utilizes a pre-trained CLIP model to encode the text prompt and incorporates multiple cross-attention blocks to integrate text information into the target regions of images. Previous work [59] has shown that these cross-attention blocks contain rich spatial structure information and control the spatial layouts of the generated results. This observation provides us with the possibility of using such spatial information to separate different objects and correcting the following cross-attention blocks to solve the mixing problems.

3 Detector Guidance

In this section, we begin by discussing the structural information that can be obtained from the cross-attention maps (CAMs) of diffusion models and explain why additional models are necessary to facilitate generation. We then illustrate how a latent object detection model can be integrated into the diffusion models. Subsequently, we combine the results of bounding box detection and CAMs to achieve multiple object segmentation in a noisy latent space. We utilize the segmentation results to eliminate conflicting information and enhance target information by refining the subsequent CAMs. Finally, we present our detector guidance method that incorporates all of these components.

3.1 Cross-Attention Map

The multiple cross-attention blocks in Stable Diffusion exhibit remarkable alignment abilities in text-to-image generation. In a cross-attention block, the features $z_{t}$ of the noisy data are projected to a key $K=L_{K}(z_{t})$ , and the textual embedding $c$ is projected to a query $Q=L_{Q}(c)$ and a value $V=L_{V}(c)$ , via learned linear projections. The cross-attention map (CAM) is then $M=KQ^{T}$ , and the final cross-attention output of this block is $\text{Softmax}(M/\sqrt{d})V$ . Here, $d$ is the projection dimension of the key and query. Among multiple CAMs at different resolutions, CAMs in the middle have rich spatial structure information [59]. Spatial information can still be clearly observed even when only 20% of the generation process has been completed, as shown in Figure 7.

While such alignment is effective in many cases, it lacks global coordination, leading to disorderly competition. For instance, consider a text prompt “a leopard and a tiger” and the intermediate results correctly generate two objects. Ideally, the leopard prompt should align with one region, and the tiger prompt should align with another. However, in practice, the leopard and tiger prompts may both attempt to align with two regions simultaneously, resulting in the mixing of conflicting information and the generation of leopard and tiger hybrids, as shown in Figure 7. Moreover, designing global coordination solely with a pre-trained diffusion model is challenging due to its local alignment strategy, which results in limited global comprehension abilities for distinguishing between different objects. Thus, an additional model is necessary to identify objects from the local alignment information of CAMs, which we refer to as latent object detection.

Although we incorporate an additional model, our latent object detection model can be simple for several reasons. Firstly, local alignment has already marked important areas, making it easier to extract features. Secondly, the resolution of the middle CAMs is only $16\times 16$ , resulting in a relatively small scale difference between objects. Thirdly, we typically avoid images where objects overlap each other, as our objective is to present the objects mentioned in the text as comprehensively as possible. These factors make the detection process relatively straightforward.

3.2 Latent Object Detection

In this paper, we use a simple YOLO [60] model for latent object detection and train it on the COCO dataset. Our training procedure is as follows: we add Gaussian noise into the images and use the labels of the original images as the prompts. We then feed the noisy images and prompts into a pre-trained diffusion model. We capture the outcomes of middle CAMs, which we subsequently employ as input to the latent object detection model. The latent object detection model infers the bounding boxes and confidence scores for each object based on the corresponding CAMs. We then use the predicted bounding boxes and the ground truth bounding boxes to calculate the loss outcomes and update the latent object detection model correspondingly. More details and analyses about our latent detection model can be found in Appendix A.1.

In the sampling procedure, we first utilize a language parser [61] to find the objects in prompts. More details about language parsers can be found in Appendix A.2. Then we generate bounding boxes for each object based on the corresponding CAMs, with the class for each bounding box being the noun corresponding to the input CAM. The next step is to assign bounding boxes to different objects within a single image. To achieve this, we first employ non-maximum suppression to eliminate redundant bounding boxes. We do not discard all information from unused bounding boxes. Rather, we incorporate the confidence score and object class of unused bounding boxes into the corresponding choice bounding boxes. As a result, each choice bounding box may pertain to different objects with varying scores. We employ the linear sum assignment algorithm from the SciPy library to assign bounding boxes to different objects and optimize the total confidence score across all objects.

Despite the relatively limited number of categories in the COCO dataset compared to larger-scale datasets like Laion-5B [62], we find that our well-trained detection model demonstrates good generalization to previously unseen categories. The use of CAMs as input has played a significant role in this. Many previously unseen categories, which may exhibit substantial visual differences at the image level, are remarkably similar to certain known categories in terms of their CAMs as long as they share similar overall structures.

3.3 CAM Correction

After obtaining the desired bounding boxes for each object, we use three consecutive steps to correct the CAMs and a strategy to maintain continuity in the generation process.

Boundary Correction Theoretically, we can use segmentation models instead of detection models to obtain more precise object boundaries. However, we find that this is often unnecessary, as the CAMs already contain sufficient local boundary information for objects. Therefore, we first use the detection model to generate bounding boxes $\text{BBox}[*]$ and then further refine the boundaries using CAMs. The segmentation for a certain object $n$ is:

S[n](i,j)=\{(i,j)\in\text{BBox}[n]\text{ \& }\text{CAM}[...,n](i,j)>=\sigma[n]\}.

(1)

Here, the shape of CAM is $H\times W\times N$ , and $N$ is the total number of tokens in a prompt. We only do detection and boundary correction on the nouns that represent objects, and the shape of a segmentation of object $n$ is $H\times W$ . The threshold $\sigma$ is computed using Otsu’s method [63]. For pixels belonging to more than one object, we assign them to the smallest object.

Conflict Elimination We find that the alignment results in CAMs become meaningful after $t\leq 800$ .¹¹1Visual results for the meaningful CAMs be found in Section 4.3. The transition timestep is $I=800$ . Then, we can address the issue of information mixing. In each cross-attention block, we utilize the conditional prompt $c$ and the unconditional prompt $uc$ to generate two queries $Q_{c}$ , $Q_{uc}$ and two CAMs $KQ_{c}^{T}$ , $KQ_{uc}^{T}$ , respectively. Here, we denote $\text{CAM}_{0}=KQ_{c}^{T}$ . For any region that has established its correspondence with the text describing a certain object, we eliminate the influence of other conflicting texts by replacing the values of the corresponding $KQ_{c}^{T}$ with those of $KQ_{uc}^{T}$ at the same position. Conflicting relationships can be obtained through human annotations or a language parser. Therefore, the corresponding mask is:

\text{mask}(i,j,p)=\{\exists n\text{ s.t. }S[n](i,j)=1\text{ \& }p\text{ conflicts with }n\}.

(2)

Here, the shape of the mask is identical to that of a CAM. Other tokens in a noun phrase share the same mask as the core noun. The Conflict Elimination algorithm can be expressed as follows:

\text{CAM}_{1}=\text{CAM}_{0}*(1-\text{mask})+KQ_{uc}^{T}*\text{mask}.

(3)

Target Enhancement While the masking operator can prevent conflicting information from being mixed in the subsequent steps, it does not guarantee that the correct information can have sufficient influence in the target region to generate the right objects. One reason is that some mistakes may have already occurred in the previous steps, which can affect the alignment between pixels and target text. To address this issue, we propose Target Enhancement that strengthens the influence of the target text in such regions. We record the maximum value of CAMs for each pixel. If the maximum decrease after Conflict Elimination, it indicates that this pixel contains a large percentage of features belonging to other objects. In this case, we increase the value of the remaining CAMs at this pixel to enhance the injection of target information. The algorithm of Target Enhancement can be written as:

\text{CAM}_{2}=\text{CAM}_{1}*\frac{\text{CAM}_{0}.\text{max}(\text{dim}=-1)}{\text{CAM}_{1}.\text{max}(\text{dim}=-1)}.

(4)

Smooth Involvement Another issue with masking operators is that they can cause a sharp change in the outputs when first applied. Such discontinuous outputs are not suitable for high-order numerical acceleration methods of diffusion models, such as PNDM [28], DPM-Solver [27], and DEIS [29]. To address this, we propose a smoothing approach that gradually increases the impact of Detector Guidance. Specifically, we begin by using 25% $\text{CAM}_{2}$ and 75% original CAM and gradually increase the ratio to 50%:50%, 75%:25%, and 100%:0%.

Upon the introduction of all the steps, the whole algorithm of our Detector Guidance can be found in Algorithm 1 and 5. A more comprehensive version can be found in Appendix A.3. Notably, our CAM correction has no hyperparameters for different situations, making it easy and robust to use. We also find the bounding boxes can be cached and reused in several successive steps without affecting alignment and quality, thereby improving the computational efficiency of our method.

Algorithm 1 Detection of Detector Guidance

1:initial noise

x_{T}

, text condition

c

2:for

t=T,T-1,\cdots,T-I+1

x_{t-1}

, CAM = DM(

x_{t}

c

)

4:end for

5:for

t=T-I,T-I-1,\cdots,1

x_{t-1}

, CAM = DM(

x_{t}

c

\text{BBox}_{t+1}

\sigma_{t+1}

)

\text{BBox}_{t}

= Detection(CAM)

\sigma_{t}

= Otsu(CAM)

9:end for

10:return image = Decoder(

x_{0}

)

Algorithm 2 Correction of Detector Guidance

1:bounding box

\text{BBox}_{t+1}

, threshold

\sigma_{t+1}

, smoothing factor

s

2:for block

\in

cross-attention blocks do

\text{CAM}_{0}

\text{CAM}_{\text{uc}}

KQ_{c}^{T}

KQ_{uc}^{T}

4: m = Mask(

\text{CAM}_{0}

\text{BBox}_{t+1}

\sigma_{t+1}

)

\text{CAM}_{1}

\text{CAM}_{0}*

(1 - m) +

\text{CAM}_{\text{uc}}*\text{m}

\text{CAM}_{2}=\text{CAM}_{1}*\frac{\text{CAM}_{0}.\text{max}(\text{dim}=-1)}{\text{CAM}_{1}.\text{max}(\text{dim}=-1)}

\text{CAM}_{3}

\text{CAM}_{2}*s

\text{CAM}_{0}*(1-s)

8: output =

\text{Softmax}(\text{CAM}_{3}/\sqrt{d})*

9:end for

4 Experiments

In this section, we present the setups of our method and compare it with the baselines on COCO, CC, and a new benchmark MRO. Then we showcase the performance of our latent object detection model at various timesteps and the influence of modifying CAMs using Detector Guidance. After that, we conducted ablation studies to analyze the effectiveness of each step in Detector Guidance.

4.1 Setups

We evaluate Detector Guidance on pre-trained Stable Diffusion models v1.4 and v2.1. The additional latent detection model is YOLOv1 with the conv2d stride of 1 to suit a small input size. The latent detection model is trained on the COCO dataset, utilizing 2 RTX3090*days, with nearly 70% of the time allocated to CAMs computation.

Here, we evaluate the alignment of our DG method on three benchmarks. First is the Concept Conjunction (CC) benchmark from [8], which comprises about 500 prompts following the structure of “a red car and a white sheep”. This is a suitable benchmark for Stable Diffusion v1.4 but too simplistic for Stable Diffusion v2.1. Consequently, we present a novel benchmark, the multi-related object (MRO) benchmark, which uses a similar sentence pattern. However, instead of using distinct objects such as a car and a sheep, we utilize GPT4 [64] to generate 30 prompts that contain two related objects, such as a tiger and a leopard. Additionally, we generate 10 samples for each prompt instead of just 1. This benchmark evaluates the ability of generation models to solve mixing problems both at the attribute level and at the object level. The complete MRO list can be found in Appendix A.4. Moreover, the COCO validation set is a commonly used benchmark for zero-shot text-to-image generation. We utilize it to evaluate the performance under more complex prompt patterns.

Regarding evaluation metrics, we find that FID and CLIP-score are not effective indicators of the mixing problem. Therefore, we primarily rely on human evaluation. Nevertheless, we also evaluate FID and CLIP-score to ensure that our method does not lead to any performance degradation on these two metrics. In Figure 5, we also present the visual generation results on MRO and COCO.

4.2 Main Results

Model	Guidance	Benchmark	Who aligns better?
SD 1.4	Structure [8]	CC	27.4%	29.4%	+2.0%
DG (ours)	15.9%	23.5%	+7.6%
Str+DG (ours)	24.4%	34.3%	+9.9%
SD 2.1	DG (ours)	CC	14.3%	15.0%	+0.7%
MRO	13.0%	35.3%	+22%
COCO	12.8%	23.6%	+11%

CC Structured diffusion guidance [8] is another guidance method that focuses on the text side to address the mixing problem. This work is built on Stable Diffusion v1.4 and provides the CC benchmark. Thus, we compare it with our method on the same benchmark and model. Moreover, since the two methods address the mixing problem on the text and image sides, respectively, they can be combined. In Table 4, we find that the combination of our method and the structured diffusion guidance method can further enhance the performance. One issue with structured diffusion guidance is that it may introduce unnecessary adjustments and is more likely to result in guidance results that are worse than the original ones, about 10% worse than ours.

MRO We use the more challenging MRO benchmark to evaluate the performance of our method on Stable Diffusion v2.1. The results are shown in Table 4. We can see that our method still achieves huge improvements on Stable Diffusion v2.1. As shown in the first row of Figure 5, Detector Guidance successfully solves the attribute mixing, object mixing, and object disappearance problems.

COCO In our experiments, we randomly select 5,000 captions from all COCO captions and generate 5,000 corresponding image pairs with and without Detector Guidance. To perform latent object detection on COCO, we use a language parser to identify all the noun phrases within the captions from COCO. We plot the trade-off curves between FID and CLIP-score using different guidance scales, including $[2.0,3.0,\cdots,10.0]$ . Figure 4 shows that Detector Guidance improves a bit the results in both FID and CLIP-score when the guidance scale is larger than 3. To evaluate our method under complex prompt patterns, we conducted human evaluations on 500 pairs of images with the largest L2 distance from the 5,000 pairs. The results also show a clear advantage of our method.

4.3 Analyses

Detection Based on our observations, we notice that independent objects can be discerned from CAMs when $t\leq 800$ . This is in strong agreement with the outcomes we obtained using a latent object detection model that was well-trained on CAMs. In Figure 7, we find that the IOU results undergo significant changes when $t\geq 800$ and gradually converge to one when $t\leq 800$ . Therefore, it can be deduced that for $t\geq 800$ , the latent object detection model essentially makes random guesses regarding the object, while for $t\leq 800$ , the model can effectively locate the object and subsequently refine its boundaries, ultimately converging to the final result.

Our detection model is only trained on the COCO dataset. Nonetheless, we observe that it exhibits good generalization to unseen categories. Notably, most objects in the demos of this paper do not belong to any category included in COCO. Our method can still detect objects and align objects with these prompts successfully.

Correction In Figure 7, we show an example of CAMs associated with the tokens “tiger” and “leopard” under different timesteps with or without DG. Here, the whole prompt is “A striped tiger and a spotted leopard”. In the original diffusion model, the CAMs attempt to align the tokens “tiger” and “leopard” with both objects simultaneously. Consequently, the information from “tiger” and “leopard” amalgamates in the right object, resulting in an animal that resembles both a tiger and a leopard. Upon incorporating DG into the diffusion model, DG can accurately distinguish between the two objects and align each one with its corresponding prompt, resulting in a final result that is highly consistent with the conditions. What’s more, DG adheres to the principle of minimal intervention, as it successfully preserves the image elements, such as rocks, vegetation, and the tiger, except for the region where the mixing error occurred. This reduces the possibility of introducing unknown new problems through additional corrections.

4.4 Ablation Studies

Boundary Correction Since bounding boxes of different objects may overlap with each other, some areas of one object may be mistakenly assigned to other objects. Boundary Correction can address this issue, as illustrated in Figure 9. In the second image, the boundary box of the tree overlaps with the boundary box of the giraffe, leading to poor generation results. Boundary Correction can refine

the boundary of the tree and remove the overlapping area. As a result, the hand of the giraffe can be generated correctly in the third image.

Target Enhancement While Conflict Elimination cannot guarantee proper alignment of the target prompt with the assigned area, we use Target Enhancement to enhance the target object, as shown in Figure 9. In the first image, both the left and right sides of the image bear a resemblance to a cat. Without Target Enhancement, even though we remove the cat prompt on the right and improve the left cat in the second image, the dog prompt cannot effectively align the right region. In the third image, Target Enhancement resolves this issue by enhancing the target object in CAMs.

Smooth Involvement Smooth Involvement is an important theoretical issue, but we did not observe significant improvement in practice. A similar phenomenon occurs in the acceleration of diffusion models, which use lower-order methods to start the generation and still yield good generation results.

5 Discussion

In this paper, we introduce Detector Guidance to aid diffusion models in comprehending global object information. In the detection stage, we do not perform detection directly on the image space. Instead, we utilize the local alignment information from Stable Diffusion and latent object detection to obtain robust detection results. In the correction stage, we adhere to the principle of minimal intervention and avoid the use of any hyperparameters to ensure the easy, robust, and safe application of our method. Additionally, we provide a new benchmark to assess the performance of generation models on the multi-object mixing problem.

Apart from text-to-image generation, Detector Guidance can be effortlessly implemented in other diffusion models that also utilize cross-attention blocks and encounter the problem of multi-object mixing. For instance, Imagebind [65] achieves alignment of more modalities and images, such as text, audio, depth, and thermal. The problem of audio guidance being from different objects is similar to that of text prompts containing multiple objects. We believe that all of these modalities can be utilized as guidance in the future and can benefit from our Detector Guidance.

References

Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pages 6840–6851, 2020.
Song et al. [2021a] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021a.
Song et al. [2021b] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021b.
Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
Saharia et al. [2022a] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022a.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
Feng et al. [2022] Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032, 2022.
Ma et al. [2023] Wan-Duo Kurt Ma, JP Lewis, W Bastiaan Kleijn, and Thomas Leung. Directed diffusion: Direct control of object placement through attention guidance. arXiv preprint arXiv:2302.13153, 2023.
Avrahami et al. [2022] Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, Yaniv Taigman, Devi Parikh, Dani Lischinski, Ohad Fried, and Xi Yin. Spatext: Spatio-textual representation for controllable image generation. arXiv preprint arXiv:2211.14305, 2022.
Sarukkai et al. [2023] Vishnu Sarukkai, Linden Li, Arden Ma, Christopher Ré, and Kayvon Fatahalian. Collage diffusion. arXiv preprint arXiv:2303.00262, 2023.
Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
Song and Ermon [2019] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
Song and Ermon [2020] Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020.
Hyvärinen and Dayan [2005] Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
Vincent [2011] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
Li et al. [2022] Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328–4343, 2022.
Tashiro et al. [2021] Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Ermon. Csdi: Conditional score-based diffusion models for probabilistic time series imputation. Advances in Neural Information Processing Systems, 34:24804–24816, 2021.
Kong et al. [2020] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020.
Liu et al. [2022a] Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, and Zhou Zhao. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11020–11028, 2022a.
Luo and Hu [2021] Shitong Luo and Wei Hu. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2837–2845, 2021.
Xu et al. [2022] Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Ermon, and Jian Tang. Geodiff: A geometric diffusion model for molecular conformation generation. arXiv preprint arXiv:2203.02923, 2022.
Gu et al. [2022] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
Ho et al. [2022a] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23(47):1–33, 2022a.
Sinha et al. [2021] Abhishek Sinha, Jiaming Song, Chenlin Meng, and Stefano Ermon. D2c: Diffusion-decoding models for few-shot conditional generation. Advances in Neural Information Processing Systems, 34:12533–12548, 2021.
Ho and Salimans [2022] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
Lu et al. [2022] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In Advances in Neural Information Processing Systems, 2022.
Liu et al. [2022b] Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. In International Conference on Learning Representations, 2022b.
Zhang and Chen [2022] Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. arXiv preprint arXiv:2204.13902, 2022.
Karras et al. [2022] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. arXiv preprint arXiv:2206.00364, 2022.
Bao et al. [2022] Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. arXiv preprint arXiv:2201.06503, 2022.
Dockhorn et al. [2021] Tim Dockhorn, Arash Vahdat, and Karsten Kreis. Score-based generative modeling with critically-damped langevin diffusion. arXiv preprint arXiv:2112.07068, 2021.
Zhang et al. [2022] Qinsheng Zhang, Molei Tao, and Yongxin Chen. gddim: Generalized denoising diffusion implicit models. arXiv preprint arXiv:2206.05564, 2022.
Blattmann et al. [2022] Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas Müller, and Björn Ommer. Retrieval-augmented diffusion models. Advances in Neural Information Processing Systems, 35:15309–15324, 2022.
Sheynin et al. [2022] Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer, Oran Gafni, Eliya Nachmani, and Yaniv Taigman. Knn-diffusion: Image generation via large-scale retrieval. arXiv preprint arXiv:2204.02849, 2022.
Liu et al. [2022c] Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. Compositional visual generation with composable diffusion models. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVII, pages 423–439. Springer, 2022c.
Balaji et al. [2022] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
Raffel et al. [2020] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Cherti et al. [2022] Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. arXiv preprint arXiv:2212.07143, 2022.
Saharia et al. [2022b] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022b.
Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
Lugmayr et al. [2022] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
Chung et al. [2022] Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, and Jong Chul Ye. Improving diffusion models for inverse problems using manifold constraints. arXiv preprint arXiv:2206.00941, 2022.
Saharia et al. [2022c] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022c.
Kawar et al. [2022a] Bahjat Kawar, Jiaming Song, Stefano Ermon, and Michael Elad. Jpeg artifact correction using denoising diffusion restoration models. arXiv preprint arXiv:2209.11888, 2022a.
Meng et al. [2021] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
Ruiz et al. [2022] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242, 2022.
Gal et al. [2022] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
Kawar et al. [2022b] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022b.
Valevski et al. [2022] Dani Valevski, Matan Kalman, Yossi Matias, and Yaniv Leviathan. Unitune: Text-driven image editing by fine tuning an image generation model on a single image. arXiv preprint arXiv:2210.09477, 2022.
Couairon et al. [2022] Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022.
Ho et al. [2022b] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022b.
Ho et al. [2022c] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022c.
Yang et al. [2022] Ruihan Yang, Prakhar Srivastava, and Stephan Mandt. Diffusion probabilistic modeling for video generation. arXiv preprint arXiv:2203.09481, 2022.
Singer et al. [2022] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
Harvey et al. [2022] William Harvey, Saeid Naderiparizi, Vaden Masrani, Christian Weilbach, and Frank Wood. Flexible diffusion modeling of long videos. arXiv preprint arXiv:2205.11495, 2022.
Liu et al. [2022d] Luping Liu, Yi Ren, Xize Cheng, and Zhou Zhao. Diffusion denoising process for perceptron bias in out-of-distribution detection. arXiv preprint arXiv:2211.11255, 2022d.
Hertz et al. [2022] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
Redmon et al. [2016] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
Bird et al. [2009] Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.", 2009.
Schuhmann et al. [2022] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
Otsu [1979] Nobuyuki Otsu. A threshold selection method from gray-level histograms. IEEE transactions on systems, man, and cybernetics, 9(1):62–66, 1979.
OpenAI [2023] OpenAI. Gpt-4 technical report, 2023.
Girdhar et al. [2023] Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. arXiv preprint arXiv:2305.05665, 2023.

Appendix A Supplementary Material

A.1 Latent Object Detection

Our latent object detection is built on YOLOv1 and Stable Diffusion v2.1. Specifically, we adopt an implementation of YOLOv1 from https://github.com/yjh0410/new-YOLOv1_PyTorch. More details are as follows:

•

Input: The input is the concatenation of 5 CAMs in the Unet with a size of $20\times 16\times 16$ corresponding to the core noun, and we only use the mean CAMs along the first head dimension. If the core noun consists of multiple tokens, we use the average CAMs across all these tokens, too.
•

Model Structure: To suit a small input size, we employ ResNet as the backbone with a conv2d stride of 1 and use several conv2d layers as the only head to simultaneously predict the bounding box and confidence.
•

Loss Function: We exclude the classification loss and only retain the confidence loss and txtytwth loss from YOLOv1.
•

Dataset: We utilize the COCO dataset, and adopt the official train/val split. We then use the validation set as the test set.
•

Augmentation: We employ several augmentations, including random horizontal flipping, random brightness adjustment, and random cropping.
•

Optimizer: We use the AdamW optimizer with a learning rate of 0.001 and a batch size of 32. We train the model for 100k steps and use the final model.
•

Evaluation Metric: We use the same evaluation metric as YOLOv1, which is the mean average precision (mAP) with an IoU threshold of 0.5.

Table 1 presents the results, which demonstrate that our latent object detection model achieves a mAP_.5 of 58.4. This confirms that the CAMs contain rich semantic information, and indicates that our approach can effectively detect objects in the latent space. As the noise level increases, the detection results decrease. This is because the noise confuses the model, leading to a decline in detection performance.

Table 1: The mAP_.5 results of our latent object detection model on the COCO validation dataset at different timesteps.

model	timestep
model	0	200	400	600	800
YOLOv1	56.5	58.4	54.8	38.1	18.0

A.2 Language Parser

We utilize the noun_chunks function in Spacy to identify the noun phrases in prompts and extract the noun within each noun phrase as the core noun. Certain noun phrases, such as time and location, do not correspond to any objects and are not relevant to our purpose. To filter out these phrases, we employ a stop-word list that includes terms such as:

top, bottom, beside, towards, front, left, right, center, middle, rear, edge, corner, periphery, interior, exterior, upstairs, downstairs, sideways, diagonal, opposite, adjacent, parallel, north, south, east, west, northeast, southeast, southwest, downward, inward, outward, lengthwise, crosswise, amidst, amongst, proximity, and vicinity.

We find that, on one hand, Spacy may make errors, while on the other hand, some noun phrases can be easily identified by the model without requiring additional guidance. Therefore, we also encourage users to annotate the noun phrases in the input prompt that require correction or fusion. This can further enhance the efficiency of the generation process and improve the quality of the resulting output.

A.3 Algorithm

Here, we give the full pseudo code for our training and sampling algorithm.

Algorithm 3 The training process of latent object detection.

1:image

x_{0}

, bounding boxes

bbox

, class label

c

2:while not converge do

z_{0}

, prompt = Encoder(

x_{0}

c

t

, noise = random.choice(

[0,T]

), random.randn(0, 1,

z_{0}

.shape)

z_{t}=\sqrt{\alpha_{t}}z_{0}+\sqrt{1-\alpha_{t}}

noise

6: _, CAM = DM(

z_{t}

, prompt,

t

)

\triangleright

Store the CAM for latent object detection.

bbox_{p}

conf_{p}

= Detector(CAM)

loss_{bbox}

L_{bbox}(bbox_{p},bbox)

conf_{ij}

= the center of

bbox

is in

ij

-th cell

10:

loss_{conf}

L_{conf}(conf_{p},conf)

11: update Detector with

loss_{bbox}+\lambda loss_{conf}

\triangleright

Only the bbox loss and conf loss are used.

12:end while

Algorithm 4 Detection of Detector Guidance for Stable Diffusion.

1:initial noise

x_{T}

, text condition

c

2:core noun

cn

, noun phrase

np

= Parser(

c

)

\triangleright

Extract noun phrases from the text condition.

3:for

t=T,T-1,\cdots,T-I+1

x_{t-1}

, CAM = DM(

x_{t}

c

)

\triangleright

Waiting for the CAM to become meaningful.

5:end for

6:for

t=T-I,T-I-1,\cdots,1

x_{t-1}

, CAM = DM(

x_{t}

c

\text{BBox}_{t+1}

\sigma_{t+1}

)

\triangleright

The correction is made in this step.

8: CAM = CAM[

cn

]

\triangleright

The CAMs for core nouns form a batch of input.

\text{BBox}_{t}

= Detection(CAM)

\triangleright

Only do detection for the core nouns.

10:

\sigma_{t}

= Otsu(CAM)

11:end for

12:return image = Decoder(

x_{0}

)

Algorithm 5 Correction of Detector Guidance for Stable Diffusion.

1:bounding box

\text{BBox}_{t+1}

, threshold

\sigma_{t+1}

, smoothing factor

s

2:for block

\in

Stable Diffusion do

3: if block is cross-attention block then

K,Q_{c},Q_{uc},V

= block(input)

\text{CAM}_{0}

\text{CAM}_{\text{uc}}

KQ_{c}^{T}

KQ_{uc}^{T}

6: m = Mask(

\text{CAM}_{0}

\text{BBox}_{t+1}

\sigma_{t+1}

)

\triangleright

Boundary Correction

\text{CAM}_{1}

\text{CAM}_{0}*

(1 - m) +

\text{CAM}_{\text{uc}}*\text{m}

\triangleright

Conflict Elimination

\text{CAM}_{2}=\text{CAM}_{1}*\frac{\text{CAM}_{0}.\text{max}(\text{dim}=-1)}{\text{CAM}_{1}.\text{max}(\text{dim}=-1)}

\triangleright

Target Enhancement

\text{CAM}_{3}

\text{CAM}_{2}*s

\text{CAM}_{0}*(1-s)

\triangleright

Smooth Involvement

10: output =

\text{Softmax}(\text{CAM}_{3}/\sqrt{d})*V

11: else

12: output = block(input)

13: end if

14: input = output

15:end for

A.4 Multi-Related Object Benchmark

We use GPT4 to generate text-to-image prompts. The prompt used in GPT4 is:

“The pattern consists of visually similar yet contrasting pairs of animals or objects that can naturally coexist in a single image, emphasizing their unique attributes to create engaging and descriptive prompts. The output format should be adjective + noun and adjective + noun, such as a white cat and a black dog. Now please give me 20 prompts to meet the above requirements.”

Here, we show the whole list of the prompts in the Multi-Related Object benchmark (MRO).

1.

a fluffy sheep and a bare goat
2.

a friendly koala and a watchful kangaroo
3.

a howling wolf and a purring cat
4.

a white cat and a brown dog
5.

a golden retriever and a gray wolf
6.

a regal lion and a sly fox
7.

a striped tiger and a spotted leopard
8.

a wise owl and a nimble squirrel
9.

a wild mustang and a graceful deer
10.

a robust bison and a dainty gazelle
11.

a soft bunny and a spiky porcupine
12.

a swift cheetah and a lumbering bear
13.

a cunning coyote and a timid deer
14.

a towering giraffe and a sturdy elephant
15.

a sprightly hare and a slow-moving tortoise
16.

a spotted hyena and a striped zebra
17.

a fierce falcon and a gentle dove
18.

a swift hummingbird and a perching eagle
19.

a vibrant toucan and a modest pigeon
20.

a chatty parrot and a silent owl
21.

a luminescent jellyfish and a matte sea turtle
22.

a fierce crocodile and a docile manatee
23.

a beautiful butterfly and a fluffy bee
24.

a hovering dragonfly and a perched hummingbird
25.

a wispy dandelion and a dense sunflower
26.

a red apple and a green pear
27.

a ripe peach and a tangy orange
28.

a succulent pineapple and a crisp apple
29.

a rusty robot and a delicate Muppet
30.

a futuristic drone and a traditional kite

A.5 Human Evaluation

In Figure 10, we show the website used for human evaluation. The evaluator is required to choose one of three options - A is better, B is better, and Tie - based on the alignment between the prompt and the image.

A.6 Project License

Here, we present the GitHub repository addresses and the project licenses for the main open-source projects used in this paper.

name	GitHub	license
YOLOv1	https://github.com/yjh0410/new-YOLOv1_PyTorch	n/a
DDPM	https://github.com/hojonathanho/diffusion	n/a
DDIM	https://github.com/ermongroup/ddim	MIT license
PNDM	https://github.com/luping-liu/PNDM	Apache-2.0 license
dpm-solver	https://github.com/LuChengTHU/dpm-solver	MIT license
SD v1.4	https://github.com/CompVis/stable-diffusion	CreativeML Open RAIL-M
SD v2.1	https://github.com/Stability-AI/stablediffusion	CreativeML Open RAIL++-M

Detector Guidance for Multi-Object Text-to-Image Generation

Abstract

1 Introduction

2 Related Work

2.1 Diffusion Model

2.2 Stable Diffusion

3 Detector Guidance

3.1 Cross-Attention Map

3.2 Latent Object Detection

3.3 CAM Correction

4 Experiments

4.1 Setups

4.2 Main Results

4.3 Analyses

4.4 Ablation Studies

5 Discussion

References

Appendix A Supplementary Material

A.1 Latent Object Detection

A.2 Language Parser

A.3 Algorithm

A.4 Multi-Related Object Benchmark

A.5 Human Evaluation

A.6 Project License

Detector Guidance for Multi-Object
Text-to-Image Generation