DivCon: Divide and Conquer for Progressive Text-to-Image Generation

Yuhao Jia¹, Wenhan Tan²

Abstract

Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements in recent years. To further improve T2I models’ capability in numerical and spatial reasoning, layout is employed as an intermedium to bridge large language models and layout-based diffusion models. However, these methods still struggle with generating images from prompts with multiple objects and complicated spatial relationships. To tackle this challenge, we introduce a divide-and-conquer approach which decouples the generation task into multiple subtasks. First, the layout prediction stage is divided into numerical & spatial reasoning and bounding box prediction to better parse the text prompt to obtain accurate composition information. Second, the layout-to-image generation stage is divided into two steps to synthesize objects from easy ones to difficult ones. Experiments are conducted on the HRS and NSR-1K benchmarks and our method outperforms previous approaches with notable margins. In addition, visual results and user study demonstrate that our approach significantly improves the perceptual quality, especially when generating multiple objects from complex textural prompts.

Code — https://github.com/DivCon-gen/DivCon

Introduction

Large-scale text-to-image generation (T2I) models demonstrate exceptional zero-shot capacity in efficiently generating small quantities and varieties of objects (Rombach et al. 2022; Chefer et al. 2023; Balaji et al. 2022; Kang et al. 2023; Ramesh et al. 2021; Saharia et al. 2022). However, these methods still suffer limited capability in generating images from text prompts containing specified object counts, varying sizes, rich details, and complicated spatial relationships (Bakr et al. 2023; Chefer et al. 2023; Feng et al. 2023a; Chen, Laina, and Vedaldi 2023a; Xiao et al. 2023). Recently, combining large language models (LLMs) and layout-to-image diffusion models has significantly improved T2I models’ capability in numerical and spatial reasoning (Feng et al. 2023b; Phung, Ge, and Huang 2024). These methods leverage the strong reasoning and visual planning ability of LLMs to predict objects’ layouts from the text prompt, which provides additional informative conditions for diffusion models to produce higher-quality images.

Despite recent significant advancements, existing methods still suffer limited accuracy when the text prompt contains multiple objects and complicated spatial relationships (Bakr et al. 2023; Phung, Ge, and Huang 2024). On the one hand, LLMs encounter great challenges when conducting simultaneous reasoning and visual planning from complicated text prompts (e.g. “five mice are scurrying around a wooden table, each nibbling on one of the five ripe apples that are laid out on top”). On the other hand, since the difficulty of synthesizing objects with different characteristics varies a lot, layout-to-image diffusion models exhibit varying levels of generative proficiency for objects with different sizes, shapes, and details. However, previous methods commonly conduct reasoning and visual planning together while generating all objects simultaneously without taking their differences into consideration. As a result, objects with higher difficulties cannot be well synthesized in the images (e.g. toaster and pizza in Fig. 2).

Refer to caption — Figure 1: Layouts and images generated by our DivCon. DivCon enhances the capability of text-to-image diffusion models to understand complex numerical and spatial relationships in the text.

To address this issue, we propose DivCon, a training-free approach that applies the divide-and-conquer strategy to both layout prediction and image generation stages. During layout prediction, the number of objects and their spatial relationships are first reasoned from the text prompt. Then, visual planning is conducted to predict the bounding boxes of the objects. By decoupling layout prediction into reasoning and visual planning, more accurate composition information can be obtained. During layout-to-image generation, the diffusion model is first executed to process the layout to generate all objects. Then, the consistency between the resultant objects and the text prompt is calculated. With objects of low consistency being highlighted, the diffusion model is executed for another time to encourage the network to focus more on these difficult objects. By diving layout-to-image generation into two steps, objects with different levels of difficulties can be synthesized progressively for higher quality. Extensive experiments have demonstrated that our DivCon has significantly improved the quality of generated images and achieves state-of-the-art performance on the HRS(Bakr et al. 2023) and NSR-1K(Feng et al. 2023b) benchmark datasets.

Our main contributions can be summarized as follows:

•

We propose DivCon, a divide-and-conquer approach for layout-based text-to-image generation by dividing the complicated task into multiple subtasks.
•

To obtain accurate composition information from the text prompt, we divide the layout prediction stage into numerical & spatial reasoning and bounding box planning.
•

To generate high-fidelity images from the layout, we divide layout-to-image generation into two steps to synthesize objects with different levels of difficulty.
•

Experimental results show that our approach achieves significant performance gains, surpassing current state-of-the-art models.

Related Work

In this section, we review three main categories of text-to-image generation methods: text-to-image generation (T2I) models, layout-conditioned T2I models, and text-to-layout-to-image generation models.

2.1 Text-to-Image (T2I) Generation Models

The evolution of large-scale T2I synthesis demonstrates a significant leap forward in the realm of generative models, enabling the creation of detailed and coherent images directly from textual descriptions (Balaji et al. 2022; Gafni et al. 2022; Ramesh et al. 2021; Kang et al. 2023; Saharia et al. 2022; Yu et al. 2022). Milestone methods in this area are Generative Adversarial Networks (GANs) (Kang et al. 2023; Sauer et al. 2023), Variational Autoencoders (VAEs) (Chang et al. 2023; Ding et al. 2022; Ramesh et al. 2021; Yu et al. 2022), and diffusion models (Balaji et al. 2022; Ho, Jain, and Abbeel 2020; Nichol et al. 2022; Ramesh et al. 2022; Saharia et al. 2022). GANs often struggle with training stability and mode collapse, whereas VAEs may sometimes produce overly smooth or blurry outputs due to their reconstruction loss. Recently, diffusion models are highly valued for their robustness and ability to generate detailed and realistic images. Particularly, Stable Diffusion (Rombach et al. 2022), which utilizes text embeddings extracted by CLIP (Radford et al. 2021) as conditions, has demonstrated remarkable performance for open-world image synthesis. Despite promising results, a text prompt is usually difficult to provide fine-grained control of the generated images.

2.2 Layout-conditioned T2I Models

To improve the controlability of T2I models, prompts like layouts are employed as additional conditions for T2I models. Existing layout-conditioned generation models can be categorized into three branches. One branch of methods train external networks on layout-images pairs (Li et al. 2023b; Zhang, Rao, and Agrawala 2023; Nichol et al. 2022; Avrahami et al. 2023; Yang et al. 2022). For instance, GLIGEN incorporates a trainable gated self-attention layer that transforms bounding boxes into conditional input for Stable Diffusion. Another branch of methods, including attention-refocusing (Phung, Ge, and Huang 2024), layout-predictor (Wu et al. 2023), direct-diffusion(Ma et al. 2023), layout-guidance(Chen, Laina, and Vedaldi 2023b) and BoxDiff (Xie et al. 2023), directly manipulate attention maps to optimize the sampling process of the diffusion model. Since these methods generate all objects through a single forward pass of the diffusion model, objects with spatial relationships usually suffer limited quality. To remedy this, the last branch of methods (including Mixture-of-Diffusion(Barbero Jiménez 2023), MultiDiffusion (Bar-Tal et al. 2023), and LLM-grounded Diffusion (Lian et al. 2023)) first synthesize each object (i.e., bounding box) separately and then compose them to produce the final result. Nevertheless, the computational cost of these methods increases with the number of objects such that the computational overhead is considerable for a text prompt containing plenty of objects. In addition, the relationship between objects is disrupted and cannot be well maintained in the generated images.

2.3 Text-to-Layout-to-Image Generation Models

Current LLMs have demonstrated strong numerical and spatial reasoning capabilities in visual planning and layout generation. Inspired by this, several efforts are made to integrate LLMs to generate layouts from texts, and then utilize layout-conditioned diffusion models for image generation. Attention-refocusing (Phung, Ge, and Huang 2024) and LLM-grounded Diffusion (Lian et al. 2023) explore in-context learning of GPT-4 for layout prediction. LayoutGPT (Feng et al. 2023b) retrieves similar samples from a database as input examples for LLMs. LLM-layout-generator (Chen et al. 2023) leverages Chain-of-Thought (CoT) prompting (Wei et al. 2022) to improve the layout prediction accuracy. However, LLMs are expected to simultaneously perform numerical & spatial reasoning and visual planning of bounding boxes in these methods. As a result, inferior accuracy is produced especially for complicated text prompts. Contrary to these methods, we propose a divide-and-conquer approach to conduct reasoning and layout prediction separately, which improves the accuracy of layout predictions and substantially advances the prompt fidelity of generated images. In addition, during layout-to-image generation, the objects are synthesized progressively according to their difficulties, which achieves not only superior fidelity but also significant computational cost savings.

Methods	Numerical Layout Pre. Rec. F1 Acc. Image Pre. Rec. F1 Acc.	Spatial Layout Image Acc. Acc.
Stable Diffusion 2.1	- - - - 73.97 53.93 62.38 16.20	- 7.96
Attend-and-Excite	- - - - 76.57 55.94 64.65 18.37	- 9.23
Divide-and-Bind	- - - - 77.95 57.50 66.18 20.15	- 13.15
Our layout + Layout Guidance	- - - - 71.09 45.98 55.84 16.58	- 29.85
Our layout + GLIGEN	- - - - 75.34 62.14 69.01 23.72	- 48.41
Attention-Refocusing	88.62 91.91 90.44 77.05 77.43 61.78 68.72 25.60	89.57 48.29
DivCon (Ours)	94.17 89.53 91.80 80.73 78.65 63.41 70.22 29.97	91.87 54.21

Table 1: Comparison of our method with baselines on HRS benchmark. Text in bold shows the best results of each metric.

Method

As illustrated in Fig. 4, our framework comprises two stages. Given a text prompt, LLMs is first adopted to parse it and predict a layout. Then, the resultant layout is fed to the diffusion model as an additional condition with the text prompt to achieve text-to-image generation. More specifically, each stage is further divided into two steps to handle simple subtasks.

3.1 Text-to-Layout Prediction

Intuitively, to predict the layout from a text prompt, the number & spatial relations of the objects are first required to be reasoned, and then bounding boxes are generated to describe the coordinates and sizes of the objects. However, previous methods commonly conduct these two steps simultaneously without considering their inherent correlation, thereby hindering them from understanding complicated text prompts well. To remedy this, we employ the divide-and-conquer strategy to improve the accuracy of predicted layouts. Specifically, the text prompt is first fed to an LLM (e.g., GPT-4 (OpenAI 2024)) to perform numerical and spatial reasoning to produce more distinct instructions of objects, as illustrated in Fig. 4(a). Then, these instructions are adopted as additional conditions to predict more accurate bounding boxes.

(1) Numerical and Spatial Reasoning.

Given a text prompt $C$ describing an image, an LLM is adopted to predict a set of strings $O=\{s_{j}|j=1,2,...,n\}$ to describe the objects in the prompt. If the prompt primarily focuses on counting relationships among objects, each string $s_{j}$ includes both the name and quantity information of each object category $j$ . If the prompt mainly describes spatial relationships, each string $s_{j}$ contains the object name and its spatial position in the image. After numerical and spatial reasoning, the numerical information and spatial relationships can be explicitly parsed from the complicated text prompt as intermedium conditions.

(2) Bounding Box Prediction.

After numerical and spatial reasoning, both the meta prompt $C$ and intermedium conditions $O$ are fed to the LLM again to predict a list of name-box pairs $P=\{p_{i}|i=1,2,...,n\}$ . Each pair $p_{i}=\{o_{i}:(l_{i},t_{i},r_{i},b_{i})|i=1,2,...,n\}$ consists of the object name $o_{i}$ and its bounding box’s coordinates $(l_{i},t_{i},r_{i},b_{i})$ . Meanwhile, an in-context exemplars including a process of correcting the features of generated bounding boxes is provided to the LLM to improve the quality of the predicted layouts.

3.2 Layout-to-Image Generation

After layout prediction, the resultant layout is employed as an additional condition to generate the image. As objects in the layout have different levels of difficulty to synthesize, a divide-and-conquer approach is adopted to reconstruct objects from easy ones to hard ones in an progressive manner, as shown in Fig. 4(b). First, the layout is fed to the diffusion model to synthesize all objects. Then, the consistency between the synthetic objects and the text prompt is calculated. Objects with low consistency are considered as low-fidelity ones and fed to the diffusion model for a second time while maintaining the results of objects with high consistency. With low-fidelity objects being highlighted, the diffusion model is encouraged to focus on these difficult objects in the second forward pass to achieve higher quality.

(1) First-Round Generation.

We employ a layout-to-image diffusion model with sampling optimization (Phung, Ge, and Huang 2024) as the generative model to generate an image conditioned on the text prompt and the predicted layout. During the denoising process, random noise $z_{T}$ is fed to the diffusion model to gradually denoise it to an image $z_{0}$ after $T$ time steps.

(2) Consistency Evaluation.

In the generated image $z_{0}$ , each object is cropped using the bounding boxes in the layout and denoted as $x_{i}\in\mathbb{R}^{4\times h\times w}$ , where $h\times w$ represents the height and width of the bounding box. Then, the CLIP similarity $S_{x_{i},o_{i}}\in(-1,1)$ is employed to evaluate the consistency between the cropped object and its name $o_{i}$ in the text prompt. The higher the similarity is, the higher the fidelity of the corresponding object is. Consequently, the objects with similarities higher than a specified threshold can be considered as easy samples that are ”good enough”. In contrast, other objects with low similarities are considered as hard samples that are more difficult to synthesize (e.g. ”sheep” and ”horse” in Fig. 3). Next, a binary mask $M$ is produced with bounding boxes of easy objects set to 1 and the rest regions set to 0.

(3) Second-Round Refinement.

The objective of second-round refinement is to preserve the “good enough” objects while reproducing the remaining ones to ease the difficulty of the diffusion model. To this end, a straightforward approach is to retrieve the latent representation of the objects in the first-round denoising process. Nevertheless, directly freezing the object regions during the denoising process produces inharmonious artifacts around the boundaries. To address this issue, we retrieve the latent representation $z_{T}$ of all time steps during the denoising process of the first-round generation. Then, the binary mask $M$ is adopted to crop high-fidelity regions in the latent representation. At each time step $T$ of the second-round forward pass, these cropped representations are employed to replace the corresponding regions in the noises:

z^{\prime}_{T}=z_{T}\circ M+z^{\prime}_{T}\circ(1-M)

(1)

where $\circ$ is element-wise multiplication and $z^{\prime}_{T}$ is the noise at $T$ step of the second-round forward pass. With the latent representation of high-fidelity objects (masked by $M$ ) being retrieved, these objects can be well preserved to encourage the model to focus on reconstructing the remaining ones. Meanwhile, the bounding boxes of “bad samples” $P^{\prime}$ are passed to the diffusion model as the layout conditions, enabling the model to concentrate more on refining $P^{\prime}$ to further improve the overall quality of the generated image.

Methods	Numerical Layout Pre. Rec. F1 Acc. Image Pre. Rec. F1 Acc.	Spatial Layout Image Acc. Acc.
Stable Diffusion 2.1	- - - - 78.94 75.42 77.41 33.99	- 22.97
Attend-and-Excite	- - - - 79.21 78.62 78.91 39.51	- 29.61
Divide-and-Bind	- - - - 84.99 79.13 81.96 32.35	- 24.00
Our layout + Layout Guidance	- - - - 81.97 75.41 78.55 43.13	- 52.88
Our layout + GLIGEN	- - - - 85.13 84.97 85.05 52.21	- 59.72
Attention-Refocusing	98.51 99.11 98.81 94.31 84.61 85.64 85.12 56.10	95.05 69.28
DivCon (Ours)	98.47 98.35 98.41 94.39 85.41 87.06 86.22 57.35	96.23 73.15

Table 2: Comparison of our method with baselines on NSR-1K benchmark. Text in bold shows the best results of each metric.

Experiments

In this section, we provide an evaluation of our method and compare it with state-of-the-art T2I models. Ablation experiments are then conducted to demonstrate the effectiveness of our divide-and-conquer approach in layout prediction and layout-to-image generation stages.

4.1 Experiment Setup

(1) Benchmarks and Baselines

To evaluate the performance of spatial and numerical reasoning in the text-to-image generation tasks, we utilize benchmarks HRS (Bakr et al. 2023) and NSR-1K (Feng et al. 2023b). The HRS dataset contains 3,000 prompts (labeled with objects’ names and specific counts) for numerical relationships and 1,002 prompts (labeled with objects’ names and corresponding positions) for spatial relationships. The NSR-1K dataset consists of 762 prompts for numerical relationships and 283 prompts for spatial relationships, all labeled with names, counts, bounding boxes, and ground truth images. Compared to NSR-1K, HRS exclusively employs natural language prompts, alongside a larger quantity of objects per sample and more intricate spatial relations. For these benchmarks, Stable Diffusion (Rombach et al. 2022), Attend-and-Excite (Chefer et al. 2023), Divide-and-Bind (Li et al. 2023a), GLIGEN (Li et al. 2023b), Layout-Guidance (Chen, Laina, and Vedaldi 2023b), and Attention-Refocusing (Phung, Ge, and Huang 2024) are included for comparison. In addition, we also include the dataset in (Lian et al. 2023) for evaluation, which includes four tasks: negation, generative numeracy, attribute binding, and spatial reasoning. As our method focuses on numerical and spatial reasoning, we select 100 prompts from these two tasks to construct a benchmark termed LLM-Grounded.

(2) Evaluation Metrics

Quality. To assess the quality of generated images, we use FID as the metric (Heusel et al. 2017). Lower FID represents higher similarity between generated images and ground truths.

Fidelity. To evaluate the fidelity of the generated images, grounding accuracy between layouts/images and the text prompt is calculated. Particularly, YOLOv8 (Jocher, Chaurasia, and Qiu 2023) is employed to obtain the bounding boxes of the objects in the generated images. For counting, the number of bounding boxes is compared to the ground truths to calculate the precision, recall, F1 score, and accuracy (El-Nouby et al. 2019; Fu et al. 2020). For spatial relationships, the relative position between two bounding boxes is adopted following PaintSkills (Cho, Zala, and Bansal 2022) and LayoutGPT (Feng et al. 2023b). Specifically, if the horizontal distance between two bounding boxes exceeds the vertical distance, they are denoted as a left-right relationship. Otherwise, they are denoted as a top-bottom relationship.

Efficiency. To validate the efficiency and scalability of our method, inference runtime is evaluated on an AWS g5.xlarge GPU instance.

4.2 Evaluation Results

(1) Quantitative Results

The HRS dataset. We present the results on the HRS benchmark in Tab. 1. As we can see, our DivCon outperforms previous T2I models by over 10% in numerical accuracy and about 45% in spatial accuracy, demonstrating its superior numerical and spatial reasoning performance. In addition, DivCon surpasses Attention Refocusing (Phung, Ge, and Huang 2024) on most metrics. Specifically, our layout prediction improves numerical accuracy by 3% and spatial accuracy by 2%, respectively. For image generation, the F1 score and accuracy achieve 2% and 4% gains, while the spatial accuracy is boosted by up to 6%. By dividing the T2I task into simple subtasks, our DivCon can better understand the text prompt to achieve higher fidelity.

The NSR-1K dataset. The results on the NSR-1K benchmark are presented in Tab. 2. It can be observed that DivCon outperforms other T2I models by over 20% in numerical accuracy and about 45% in spatial accuracy. As compared to Attention Refocusing (Phung, Ge, and Huang 2024), our DivCon produces competitive results in terms of layout accuracy. Meanwhile, DivCon achieves approximately 2% gains in precision, recall, and F1 score for numerical reasoning, and surpasses by nearly 4% in terms of spatial accuracy. Quantitative results in terms of image quality are shown in Tab. 3. As we can see, our DivCon produces comparable FID scores to Stable Diffusion and outperforms attention-refocusing with notable margins. This further demonstrates that our approach is able to produce images with better visual quality while maintaining high prompt fidelity.

Methods	FID ( $\downarrow$ )
Stable Diffusion 2.1	20.77
Attention-Refocusing	21.05
DivCon (Ours)	21.01

Table 3: FID results achieved on NSR-1K.

LLM-Grounded benchmark. Table 4 presents the accuracy achieved by our method and LLM-grounded Diffusion. It is clear that our method achieves comparable performance to LLM-Grounded Diffusion on the numerical reasoning while outperforming it by 3% on the spatial reasoning task. This also validates the superiority of our method.

Methods	Numerical	Spatial
LLM-Grounded (GPT4)	84%	82%
DivCon (Ours)	84%	87%

Table 4: Comparison with LLM-Grounded

Runtime comparisons are presented in Tab. 5. Among techniques that require multiple forward passes to synthesize different regions separately, our DivCon produces notable efficiency gains. Particularly, LLM-Grounded takes 134.98s to generate an image, while our method achieves a $1.87\times$ speedup. Compared with methods with a single forward pass, our method (1 Iter) produces competitive results with comparable efficiency.

Methods	#Forward Pass	Runtime
Stable Diffusion	1	15.81 s
GLIGEN	1	33.75 s
Attention-Refocusing	1	35.71 s
DivCon-1 iter (Ours)	1	37.35 s
Mixture-of-Diffusion	3	126.41 s
LLM-Grounded	3	134.98 s
Multi-Diffusion	3	121.68 s
DivCon-2 iter (Ours)	2	72.01 s

Table 5: Comparison of Inference Runtime.

(2) Qualitative Results

Figure 5 illustrates the qualitative comparison of DivCon and baselines. In all cases, our DivCon can accurately predict the object quantities and the spatial relations from the text prompts.

Large Quantity of Objects. Given a prompt containing more than one object category (Fig. 5), Stable Diffusion (Rombach et al. 2022) and Attend-and-Excite (Chefer et al. 2023) often generate an insufficient quantity of objects, while Attention-Refocusing (Phung, Ge, and Huang 2024) sometimes omits one category of objects or generates an excessive number of objects. For instance, the prompt in Fig. 5(b) contains five spoons and four oranges. However, Stable Diffusion and Attend-and-Excite cannot synthesize sufficient spoons or oranges. Meanwhile, Attention-Refocusing generates an excess number of objects, with spoons appearing incomplete due to missing parts. As compared to these methods, our DivCon achieves superior numerical accuracy.

Complicated Spatial Relation. Given a prompt containing complex spatial relationships among more than two object categories, previous T2I models struggle to generate correct spatial relations. For example, Stable Diffusion and Attend-and-Excite miss certain object categories (Fig. 5(d)). Meanwhile, Attention-Refocusing tends to produce distortion or blending artifacts around the boundaries of objects (Fig. 5(c)(d)) or merges the color of the banana with the person (Fig. 5(d)). In contrast, our DivCon produces accurate object categories and spatial relationships, exhibiting superior performance in these challenging cases.

User Study. We conduct a user study to evaluate the images generated by different methods. To be specific, we recruit 17 people with different backgrounds and asked them to rank the images generated by different methods according to their fidelity to the original prompt. Lower ranks indicates better alignment. A total of 20 examples are randomly selected from the NSR-1K benchmark. As shown in Fig. 6, our method achieves the best performance in terms ofaverage rank, which demonstrates the superior perceptual quality of our results.

4.3 Ablation Study

In this sub-section, we conduct ablation experiments to evaluate the effectiveness of our divide-and-conquer approach in both layout prediction and image generation. Results are presented in Tab. 6 and Attention-Refocusing (Phung, Ge, and Huang 2024) is used as the baseline.

Divide-and-Conquer in Layout Prediction. To demonstrate the effectiveness of our divide-and-conquer approach in the layout prediction stage, we develop a variant by conducting numerical & spatial reasoning and bounding box prediction in a single step. As we can see, our divide-and-conquer approach produces consistent improvements across all metrics, with an increase of 4% in both numerical and spatial accuracy. We further compare the layouts predicted by our DivCon and Attention-Refocusing in Fig. 7. Compared to Attention-Refocusing, layouts generated by our DivCon not only exhibit higher reasoning accuracy (Fig. 7 (a) but also are more organized with less overlap (Fig. 7 (b)).

	Layout w/ DivCon	Image w/ DivCon	Numerical Precision Recall F1 Acc.	Spatial Acc.
(1)	-	-	77.43 61.78 68.72 25.60	48.29
(2)	✓	-	78.26 62.61 69.57 29.46	52.74
(3)	-	✓	78.18 62.36 69.38 27.83	53.17
(4)	✓	✓	79.13 63.53 70.48 29.62	53.87

Table 6: Ablation study of our DiVCon on HRS benchmark. “Layout w/ DivCon”: Layout prediction with divide-and-conquer strategy. “Image w/ DivCon”: Layout-to-Image generation with divide-and-conquer strategy.

Divide-and-Conquer in Image Generation. To study the effectiveness of our divide-and-conquer approach in the image generation stage, we develop a variant to synthesize the image in a single forward pass. It can be observed from Tab. 6 that the divide-and-conquer approach facilitates our method to notable performance gains in terms of all metrics. We further compare the results produced by our method using prompts containing multiple objects in Fig. 8. As we can see, the divide-and-conquer approach facilitates our method to better reconstruct the objects. For example, DivCon accurately reconstructs the small objects (Fig. 8(a)) and avoids generating the dog outside boxes (Fig. 8(b)).

Overall, with our divide-and-conquer approach in both layout prediction and image generation stages, our DivCon produces the best accuracy and surpasses Attention-Refocusing by approximately 2% on numerical f1, 4% on numerical accuracy, and 6% on spatial accuracy (Tab. 6).

4.4 Limitations and Discussions

As illustrated in Fig. 9, when the text prompt describes objects with extreme overlap, the diffusion model may fail to generate all objects accurately. This limitation stems from the inherent capacity of layout-conditioned image generation models to handle overlapping boxes. Despite these challenges, our approach still demonstrates an improved quality in layout prediction as compared to the base model.

Conclusion

In this work, we introduce DivCon, a novel divide-and-conquer approach that divides the text-to-image generation task into multiple subtasks to achieve higher image generation quality. Specifically, the layout prediction is divided into two steps to conduct numerical & spatial reasoning and bounding box prediction, respectively. Then, layout-to-image generation is achieved via an progressive manner to synthesize objects in an easy-to-hard fashion. Comprehensive experiments demonstrate that our DivCon outperforms existing state-of-the-art grounded text-to-image models in terms of both image quality and prompt fidelity.

References

Avrahami et al. (2023) Avrahami, O.; Hayes, T.; Gafni, O.; Gupta, S.; Taigman, Y.; Parikh, D.; Lischinski, D.; Fried, O.; and Yin, X. 2023. Spatext: Spatio-Textual Representation for Controllable Image Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Bakr et al. (2023) Bakr, E. M.; Sun, P.; Shen, X.; Khan, F. F.; Li, L. E.; and Elhoseiny, M. 2023. HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 20041–20053. Paris, France: IEEE.
Balaji et al. (2022) Balaji, Y.; Nah, S.; Huang, X.; Vahdat, A.; Song, J.; Zhang, Q.; Kreis, K.; Aittala, M.; Aila, T.; Laine, S.; Catanzaro, B.; Karras, T.; and Liu, M.-Y. 2022. ediff-i: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. https://arxiv.org/abs/2211.01324. Accessed: 2023-XX-XX.
Bar-Tal et al. (2023) Bar-Tal, O.; Yariv, L.; Lipman, Y.; and Dekel, T. 2023. Multidiffusion: Fusing Diffusion Paths for Controlled Image Generation.
Barbero Jiménez (2023) Barbero Jiménez, A. 2023. Mixture of Diffusers for Scene Composition and High-Resolution Image Generation. https://arxiv.org/abs/2302.02412.
Chang et al. (2023) Chang, H.; Zhang, H.; Barber, J.; Maschinot, A.; Lezama, J.; Jiang, L.; Yang, M.-H.; Murphy, K.; Freeman, W. T.; Rubinstein, M.; Yuanzhen, L.; and Dilip, K. 2023. Muse: Text-to-Image Generation via Masked Generative Transformers. https://arxiv.org/abs/2301.00704.
Chefer et al. (2023) Chefer, H.; Alaluf, Y.; Vinker, Y.; Wolf, L.; and Cohen-Or, D. 2023. Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models. ACM Transactions on Graphics (TOG), 42(4): 1–10.
Chen, Laina, and Vedaldi (2023a) Chen, M.; Laina, I.; and Vedaldi, A. 2023a. Training-Free Layout Control with Cross-Attention Guidance. https://arxiv.org/abs/2304.03373.
Chen, Laina, and Vedaldi (2023b) Chen, M.; Laina, I.; and Vedaldi, A. 2023b. Training-Free Layout Control with Cross-Attention Guidance. https://arxiv.org/abs/2304.03373.
Chen et al. (2023) Chen, X.; Liu, Y.; Yang, Y.; Yuan, J.; You, Q.; Liu, L.-P.; and Yang, H. 2023. Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis. https://arxiv.org/abs/2311.17126.
Cho, Zala, and Bansal (2022) Cho, J.; Zala, A.; and Bansal, M. 2022. DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers. https://arxiv.org/abs/2202.04053.
Ding et al. (2022) Ding, M.; Zheng, W.; Hong, W.; and Tang, J. 2022. CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers. https://arxiv.org/abs/2204.14217.
El-Nouby et al. (2019) El-Nouby, A.; Sharma, S.; Schulz, H.; Hjelm, D.; El Asri, L.; Ebrahimi Kahou, S.; Bengio, Y.; and Taylor, G. W. 2019. Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
Feng et al. (2023a) Feng, W.; He, X.; Fu, T.-J.; Jampani, V.; Akula, A.; Narayana, P.; Basu, S.; Wang, X. E.; and Wang, W. Y. 2023a. Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis. In Proceedings of the International Conference on Learning Representations (ICLR). Virtual: ICLR.
Feng et al. (2023b) Feng, W.; Zhu, W.; Fu, T.-j.; Jampani, V.; Akula, A.; He, X.; Basu, S.; Wang, X. E.; and Wang, W. Y. 2023b. LayoutGPT: Compositional Visual Planning and Generation with Large Language Models. In Proceedings of the 37th Annual Conference on Neural Information Processing Systems (NeurIPS). New Orleans, LA: NeurIPS Foundation.
Fu et al. (2020) Fu, T.-J.; Wang, X. E.; Grafton, S.; Eckstein, M.; and Wang, W. Y. 2020. SSCR: Iterative Language-Based Image Editing via Self-Supervised Counterfactual Reasoning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
Gafni et al. (2022) Gafni, O.; Polyak, A.; Ashual, O.; Sheynin, S.; Parikh, D.; and Taigman, Y. 2022. Make-a-Scene: Scene-Based Text-to-Image Generation with Human Priors. In Proceedings of the European Conference on Computer Vision (ECCV), 89–106. Tel Aviv, Israel: Springer.
Heusel et al. (2017) Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; and Hochreiter, S. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems (NeurIPS), volume 30.
Ho, Jain, and Abbeel (2020) Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising Diffusion Probabilistic Models. In Advances in Neural Information Processing Systems (NeurIPS), 6840–6851.
Jocher, Chaurasia, and Qiu (2023) Jocher, G.; Chaurasia, A.; and Qiu, J. 2023. Ultralytics YOLO.
Kang et al. (2023) Kang, M.; Zhu, J.-Y.; Zhang, R.; Park, J.; Shechtman, E.; Paris, S.; and Park, T. 2023. Scaling Up GANs for Text-to-Image Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10124–10134. Vancouver, BC: IEEE.
Li et al. (2023a) Li, Y.; Keuper, M.; Zhang, D.; and Khoreva, A. 2023a. Divide & Bind Your Attention for Improved Generative Semantic Nursing. In 34th British Machine Vision Conference 2023, BMVC 2023.
Li et al. (2023b) Li, Y.; Liu, H.; Wu, Q.; Mu, F.; Yang, J.; Gao, J.; Li, C.; and Lee, Y. J. 2023b. GLIGEN: Open-Set Grounded Text-to-Image Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Lian et al. (2023) Lian, L.; Li, B.; Yala, A.; and Darrell, T. 2023. LLM-Grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models. https://arxiv.org/abs/2305.13655.
Ma et al. (2023) Ma, W.-D. K.; Lewis, J.; Kleijn, W. B.; and Leung, T. 2023. Directed Diffusion: Direct Control of Object Placement Through Attention Guidance. https://arxiv.org/abs/2302.13153.
Nichol et al. (2022) Nichol, A. Q.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; and Chen, M. 2022. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In Proceedings of the 39th International Conference on Machine Learning (ICML), 16784–16804. PMLR.
OpenAI (2024) OpenAI. 2024. GPT-4 Technical Report. arXiv:2303.08774.
Phung, Ge, and Huang (2024) Phung, Q.; Ge, S.; and Huang, J.-B. 2024. Grounded Text-to-Image Synthesis with Attention Refocusing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7932–7942. IEEE.
Radford et al. (2021) Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; and Sutskever, I. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), 8748–8763. PMLR.
Ramesh et al. (2022) Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; and Chen, M. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. https://arxiv.org/abs/2204.06125.
Ramesh et al. (2021) Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; and Sutskever, I. 2021. Zero-Shot Text-to-Image Generation. In Proceedings of the 38th International Conference on Machine Learning (ICML), 8821–8831. Virtual: PMLR.
Rombach et al. (2022) Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Ommer, B. 2022. High-resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10684–10695. New Orleans, LA: IEEE.
Saharia et al. (2022) Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.; Seyed Ghasemipour, S. K.; Ayan, B. K.; Mahdavi, S. S.; Lopes, R. G.; Salimans, T.; Ho, J.; Fleet, D. J.; and Norouzi, M. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, 36479–36494. Virtual: NeurIPS.
Sauer et al. (2023) Sauer, A.; Karras, T.; Laine, S.; Geiger, A.; and Aila, T. 2023. StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis. https://arxiv.org/abs/2301.09515.
Wei et al. (2022) Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q. V.; Zhou, D.; et al. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, 24824–24837.
Wu et al. (2023) Wu, Q.; Liu, Y.; Zhao, H.; Bui, T.; Lin, Z.; Zhang, Y.; and Chang, S. 2023. Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis. https://arxiv.org/abs/2304.03869.
Xiao et al. (2023) Xiao, G.; Yin, T.; Freeman, W. T.; Durand, F.; and Han, S. 2023. FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention. arXiv preprint arXiv:2305.10431. Accessed: 2023-08-13.
Xie et al. (2023) Xie, J.; Li, Y.; Huang, Y.; Liu, H.; Zhang, W.; Zheng, Y.; and Shou, M. Z. 2023. BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion. https://arxiv.org/abs/2307.10816.
Yang et al. (2022) Yang, Z.; Wang, J.; Gan, Z.; Li, L.; Lin, K.; Wu, C.; Duan, N.; Liu, Z.; Liu, C.; and Zeng, e. a., Michael. 2022. Reco: Region-Controlled Text-to-Image Generation. https://arxiv.org/abs/2211.15518.
Yu et al. (2022) Yu, J.; Xu, Y.; Koh, J. Y.; Luong, T.; Baid, G.; Wang, Z.; Vasudevan, V.; Ku, A.; Yang, Y.; Ayan, B. K.; Hutchinson, B.; Wei, H.; Parekh, Z.; Li, X.; Zhang, H.; Baldridge, J.; and Wu, Y. 2022. Scaling Autoregressive Models for Content-Rich Text-to-Image Generation. Transactions on Machine Learning Research.
Zhang, Rao, and Agrawala (2023) Zhang, L.; Rao, A.; and Agrawala, M. 2023. Adding Conditional Control to Text-to-Image Diffusion Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 3836–3847.