MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

Dewei Zhou¹^∗ You Li¹^∗ Fan Ma¹ Xiaoting Zhang² Yi Yang¹^†
¹ReLER, CCAI, Zhejiang University, Zhejiang, China ²Huawei Technologies Ltd., China
^∗ Equal contribution ^† Corresponding author
{zdw1999, uli2000, mafan, yangyics}@zju.edu.cn, [email protected]
https://migcproject.github.io/

Abstract

We present a Multi-Instance Generation (MIG) task, simultaneously generating multiple instances with diverse controls in one image. Given a set of predefined coordinates and their corresponding descriptions, the task is to ensure that generated instances are accurately at the designated locations and that all instances’ attributes adhere to their corresponding description. This broadens the scope of current research on Single-instance generation, elevating it to a more versatile and practical dimension. Inspired by the idea of divide and conquer, we introduce an innovative approach named Multi-Instance Generation Controller (MIGC) to address the challenges of the MIG task. Initially, we break down the MIG task into several subtasks, each involving the shading of a single instance. To ensure precise shading for each instance, we introduce an instance enhancement attention mechanism. Lastly, we aggregate all the shaded instances to provide the necessary information for accurately generating multiple instances in stable diffusion (SD). To evaluate how well generation models perform on the MIG task, we provide a COCO-MIG benchmark along with an evaluation pipeline. Extensive experiments were conducted on the proposed COCO-MIG benchmark, as well as on various commonly used benchmarks. The evaluation results illustrate the exceptional control capabilities of our model in terms of quantity, position, attribute, and interaction. Code and demos will be released at https://migcproject.github.io/.

1 Introduction

Refer to caption — Figure 1: Overview of our MIGC. Stable diffusion’s UNet inputs text description and image features into the Cross-Attention layer to obtain the residual feature and then adds it to the image features to determine generated content, which is like a shading process (i.e., coloring with parallel pencil lines or a block of color). In this view, MIG can be considered multi-instance shading on image features, and MIGC comprises three steps: (a) Divide MIG into single-instance shading subtasks. (b) Conquer single-instance shading with Enhancement Attention. (c) Combine shading results through Layout Attention and Shading Aggregation Controller.

Stable diffusion [38] has exhibited extraordinary capabilities in wild scenarios, including photography, painting, and other area [55, 11, 29]. The current research mainly focuses on Single-Instance Generation, where the generated content is only required to align with the single description, including image editing, personalized image generation, 3D generation [45, 39, 8, 32, 12, 44, 21, 24, 22, 29], etc. However, more practical cases where multiple instances are simultaneously generated in one image with diverse controls have been rarely explored. In this research, we delve into a more general task, termed as Multi-Instance Generation (MIG), incorporating all factors such as quantity, position, attribute, and interaction control into one-time generation.

Challenges in MIG. MIG not only requires the instance to comply with the user-given description and layout but also ensures global alignment among all instances. Incorporating this information directly into the stable diffusion [38] often leads to failure. On the one hand, the current text encoder, like CLIP [36], struggles to differentiate each singular attribute from prompts containing multiple attributes [14]. On the other hand, Cross-Attention [46] layers in stable diffusion lack the ability to control position [6, 31, 27], resulting in difficulties when generating multiple instances within a specified region.

Motivated by the divide and conquer strategy, we propose the Multi-Instance Generation Controller (MIGC) approach. This approach aims to decompose MIG into multiple subtasks and then combines the results of those subtasks. Although the direct application of stable diffusion in MIG is still a challenge, the outstanding capacity of stable diffusion in Single-Instance Generation could facilitate this task. Illustrated in Fig. 1, MIGC comprises three steps: 1) Divide: MIGC decomposes MIG into multiple instance-shading subtasks only in the Cross-Attention layers of SD to speed up the resolution of each subtask and make the generated images more harmonious. 2) Conquer: MIGC employs an Enhancement Attention Layer to enhance the shading results obtained through the frozen Cross-Attention, ensuring successful shading for each instance. 3) Combine: MIGC obtains the shading template through a Layout Attention layer and then inputs it, together with the shading background and shading instances, into a Shading Aggregation Controller to obtain the final shading result.

Benchmark for MIG. To evaluate how well generation models perform on the MIG task, we propose a COCO-MIG benchmark based on the COCO dataset [28], and this benchmark requires generation models to achieve strong control on position, attribute, and quantity simultaneously.

We conducted comprehensive experiments on the proposed COCO-MIG and the widely recognized COCO [28] and DrawBench [41] benchmarks. When applied to the COCO-MIG benchmark, our method substantially enhanced the Instance Success Rate, increasing it from 32.39% to 58.43%. Transitioning to the COCO benchmark, our approach exhibited noteworthy improvements in Average Precision (AP), elevating it from 40.68/68.26/42.85 to 54.69/84.17/61.71. Similarly, on DrawBench, our method demonstrated advancements across position, attribute, and count, particularly elevating the attribute success rate from 48.20% to 97.50%. Moreover, MIGC maintains an inference speed close to the original stable diffusion.

Our contributions are summarized as follows:

1)

To advance the development of vision generation, we present the MIG task to address prevailing challenges in both academic and industrial domains. Meanwhile, we propose the COCO-MIG benchmark to evaluate the inherent MIG capabilities of generative models.
2)

Inspired by the principle of divide and conquer, we introduce a novel MIGC approach that enhances pre-trained stable diffusion with improved MIG capabilities.
3)

We conducted extensive experiments on three benchmarks, indicating that our MIGC significantly surpassed the previous SOTA method while ensuring the inference speed was close to the original stable diffusion.

2 Related work

2.1 Text-to-Image Generation

Text-to-image (T2I) Generation aims to generate high-quality images based on text descriptions. Conditional GANs [37, 50, 54, 53] were initially used for T2I Generation, while diffusion models [33, 38, 41, 7, 2, 13, 19, 57, 43] and autoregressive models [4, 10, 52] gradually replaced GANs as the foundational generator due to their more stable training and higher image quality.

2.2 Layout-to-Image Generation

As text cannot precisely control the position of generated instances. Some Layout-to-Image methods [33, 6, 49, 56, 31] extend the pre-trained T2I model [38] to integrate layout information into the generation and achieve control of instances’ position. However, they struggle to isolate the attributes of multiple instances, thus generating images with mixed attributes. This paper proposes a novel MIGC approach to achieve precise position-and-attribute control.

3 Method

3.1 Preliminaries

Stable diffusion [38] is one of the most popular T2I models, and it uses the CLIP [36] text encoder to project texts into sequence embedding and integrate textual conditions into the generation process via Cross-Attention [46] layers.

Attention layers [46] play key roles in the interaction between various modal features. Omitting the reshape operation, the attention layer on 2D space can be expressed as:

\vspace{-0.2em}\mathbf{R}=\text{Softmax}(\frac{\mathbf{Q}\mathbf{K}^{T}}{\sqrt{d}})\mathbf{V},\mathbf{R}\in{\mathbb{R}^{(H,W,C)}}

(1)

where $\mathbf{R}$ represents the output residual, and $\mathbf{Q},\mathbf{K},\mathbf{V}$ separately represents the Query, Key, and Value in attention layers, which are projected by linear layers.

3.2 Overview

Problem Definition. In the Multi-Instance Generation (MIG), users will give generation models the global prompt $\mathcal{P}$ , instance layout bounding boxes $\mathbb{B}=\{\mathbf{b}^{1},...,\mathbf{b}^{N}\}$ , where $\mathbf{b}^{i}=[x_{1}^{i},y_{1}^{i},x_{2}^{i},y_{2}^{i}]$ , and corresponding descriptions $\mathbb{D}=\{\mathbf{d}^{1},...,\mathbf{d}^{N}\}$ . According to user-provided inputs, the model needs to generate an image $\mathcal{I}$ , in which the instance within the box $\mathbf{b}^{i}$ should adhere to the instance description $\mathbf{d}^{i}$ , and global alignment is ensured in all instances.

Difficulties in MIG. When dealing with Multi-Instance prompts, stable diffusion struggles with attribute leakage, i.e., 1) Textual Leakage. Due to the causal attention masks used in the CLIP encoder, the latter instance tokens may exhibit semantic confusion [14]. 2) Spatial Leakage. The Cross-Attention lacks precise position control [6], and instances will affect the generation of each others’ region.

Motivation. Divide and conquer is an ancient but wise idea. It first divides a complex task into several simpler subtasks, then conquers these subtasks respectively, and finally obtains the solution to the original task by combining the solutions of the subtasks. This idea is highly applicable to MIG. For example, MIG is a complex task for most T2I models, while Single-Instace Generation is a simpler subtask that T2I models can solve well [39, 8, 51, 32, 45]. Based on this idea, we proposed our MIGC, which extends the stable diffusion with stronger MIG ability. We will introduce the technology details by telling “how to divide,” “how to conquer,” and “how to combine.”

3.3 Divide MIG into Instance Shading Subtasks

Instance shading subtasks in Cross-Attention space. Cross-Attention is the only way for text and image features to interact in stable diffusion, and the output determines the generated content, which looks like a shading operation on image features. In this view, the MIG task can be defined as performing correct Multi-Instance shading on image features, and $subtask_{i}$ can be defined as finding a single instance shading result $\mathbf{R}^{i}$ to satisfy the following:

\mathbf{R}^{i}=\arg\min_{\mathbf{R}^{i}}(\left\|\mathbf{R}^{i}-\mathbf{R}^{correct}\right\|_{2}\cdot\textbf{M}^{i}),\vspace{-3mm}

(2)

where $\mathbf{R}^{correct}$ represents an objectively existing correct feature, and $\mathbf{M}^{i}$ is an instance mask generated according to the box $\mathbf{b}^{i}$ , with the values inside the box region are set to 1, and the rest of the positions are set to 0. That is to say, each shading instance should have the correct textual semantic in its corresponding area.

Two benefits of division in the Cross-Attention space. i.e., 1) Conquer more efficiently: For N-instance generation, MIGC conquers N subtasks solely on Cross-Attention layers instead of the entire Unet network, which will be more efficient; 2) Combine more harmoniously: Combining subtasks in the middle layer enhances the overall cohesiveness of the generated image compared to combining at the final output of the network.

Method	Instance Success Rate(%) $\uparrow$						mIoU $\uparrow$						Time(s) $\downarrow$
Level	${L_{2}}$	${L_{3}}$	${L_{4}}$	${L_{5}}$	${L_{6}}$	Avg	${L_{2}}$	${L_{3}}$	${L_{4}}$	${L_{5}}$	${L_{6}}$	Avg	Time(s) $\downarrow$
Stable Diffusion	6.87	5.01	3.45	3.27	2.21	3.61	18.92	17.44	15.85	15.17	14.42	15.80	9.18
TFLCG	20.47	12.71	8.36	6.72	4.36	8.62	29.34	25.06	20.82	18.81	17.86	20.92	19.92
BOX-Diffusion	24.61	19.22	14.20	11.92	9.31	13.96	32.64	29.88	25.39	23.81	21.19	25.14	44.17
Multi Diffusion	24.88	22.14	19.88	18.97	18.60	20.12	29.41	28.06	25.59	24.83	24.71	25.89	25.15
GLIGEN	42.30	35.55	32.66	28.18	30.84	32.39	37.58	32.34	29.95	26.60	27.70	32.25	22.00
Ours	67.70	59.61	58.09	56.16	56.88	58.43	59.39	52.73	51.45	49.52	49.89	51.48	15.61

Table 1: Quantitative results in our proposed COCO-MIG benchmark. According to the count of generated instances, COCO-MIG is divided into five levels:

L_{2}

L_{3}

L_{4}

L_{5}

, and

L_{6}

L_{i}

means that the count of instances needed to generate in the image is i.

3.4 Conquer Instance Shading

Shading stage 1: shading results of Cross-Attention. The pre-trained Cross-Attention will notice regions with high attention weight and perform shading according to the textual semantics [48, 9]. As shown in Fig. 1, MIGC uses the masked Cross-Attention output as the first shading results:

\vspace{-0.2em}\mathbf{R}^{i}_{f}=\text{Softmax}(\frac{\mathbf{Q}\mathbf{K^{i}}^{T}}{\sqrt{d}})\mathbf{V^{i}}\cdot\textbf{M}^{i},

(3)

where $\mathbf{K}^{i}$ and $\mathbf{V}^{i}$ are obtained from text embedding of $\mathbf{d}^{i}$ , and $\mathbf{Q}$ is obtained from the image feature map.

Two issues of Cross-Attention shading results. 1) Instance Merge. According to Eq. (3), for two instances with the same description, they will get the same $\mathbf{K}$ and $\mathbf{V}$ in the Cross-Attention layer. If their boxes are close or even overlap, the network will easily merge the two instances; 2) Instance Missing. The initial edit method [31] shows that the initial noise of SD largely determines the layout of the generated image, i.e., specific regions prefer to generate specific instances or nothing. If the initial noise does not tend to generate an instance according to description $\mathbf{d}^{i}$ in box $\mathbf{b}^{i}$ , the $\mathbf{R}_{f}^{i}$ will be weak, leading to the instance missing.

Grounded phrase token for solving instance merge. To identify instances with the same description but different boxes, MIGC extends the text tokens of each instance to a combination of text and position tokens. As shown in Fig. 2(a), MIGC first projects the bounding box information to the Fourier embedding, then uses a MLP layer to get position tokens. MIGC concatenates the text tokens with position tokens to obtain the grounded phrase tokens:

\vspace{-0.2em}\mathbf{G}^{i}=[\text{CLIP}(\mathbf{d}^{i}),\text{MLP}(\text{Fourier}(\mathbf{b}^{i}))],

(4)

where $[\cdot]$ represents the concatenation.

Shading stage 2: Enhancement Attention for solving instance missing. Illustrated in Fig. 1, MIGC uses a trainable Enhancement-Attention (EA) Layer to enhance the shading result. Specifically, as shown in Fig. 2(a), after obtaining the grounded phrase token, EA uses a new trainable Cross-Attention layer to obtain an enhanced shading result and adds it to the first shading result $\mathbf{R}^{i}_{f}$ :

\mathbf{R}^{i}_{s}=\mathbf{R}^{i}_{f}+\text{softmax}(\frac{\mathbf{Q}_{ea}{\mathbf{K}^{i}_{ea}}^{T}}{\sqrt{d}})\mathbf{V}^{i}_{ea}\cdot\textbf{M}^{i},

(5)

where $\mathbf{K}^{i}_{ea}$ and $\mathbf{V}^{i}_{ea}$ are obtained from the grounded phrase token $\mathbf{G}^{i}$ , and $\mathbf{Q}_{ea}$ is obtained from the image feature map. During the training period, since $\mathbf{M}^{i}$ ensures precise spatial positioning, the instance shading result output by EA exclusively impacts the correct region, so the EA can easily learn: no matter what the image feature is, the EA should perform enhanced shading to satisfy the textual semantic of $\mathbf{d}^{i}$ and solve the issue of instance missing.

Finally, MIGC treats the second shading result $\mathbf{R}^{i}_{s}$ as the solution of the $subtask_{i}$ .

3.5 Combine Shading Results

Global prompt residual as shading background. Obtaining N-instance shading results as shading foreground, the next step of MIGC is to get the shading background. Illustrated in Fig. 1(c), MIGC utilizes global prompt $\mathcal{P}$ to obtain the shading background result $\mathbf{R}^{bg}$ in a manner similar to Eq.(3), with the background mask $\mathbf{M}^{bg}$ , in which positions containing the instance are assigned a value of 0, while all other positions are marked as 1.

Layout Attention residuals as shading template. A certain gap exists between shading instances $\{\mathbf{R}_{s}^{1},\ldots,\mathbf{R}_{s}^{N}\}$ and the shading background $\mathbf{R}^{bg}$ , as their shading process is relatively independent. To bridge these shading results and minimize the gap, MIGC needs to learn a shading template according to the image feature maps’ information. As shown in Fig. 1(c), a Layout Attention layer is used in MIGC to achieve the above goal. Illustrated in Fig. 2(b), Layout Attention performs similarly to the Self-Attention [42, 40] while instance masks $\mathbb{M}_{inst}=\{\mathbf{M}^{bg},\mathbf{M}^{1},\ldots,\mathbf{M}^{N}\}$ are used to construct attention masks:

\vspace{-0.8em}\mathbf{A}_{(a,b),(c,d)}=\begin{cases}1,\text{if}\,\exists\ \mathbf{m}\in\mathbb{M}_{inst},\mathbf{m}_{a,b}=\mathbf{m}_{c,d}=1\\ -inf,\ \ \ \ \ \ \text{otherwise}\end{cases}

(6)

\mathbf{R}_{LA}=\text{Softmax}(\frac{\mathbf{Q}_{LA}{\mathbf{K}_{LA}}^{T}}{\sqrt{d}}\odot\mathbf{A})\mathbf{V}_{LA},

(7)

where $\odot$ represents the Hadamard product, and $\mathbf{A}\in\mathbb{R}^{((H,W),(H,W))}$ represents attention masks, in which $\mathbf{A}_{(a,b)(c,d)}$ determines whether pixel (a, b) should attend to pixel (c, d). The constructed attention mask A ensures one pixel can only attend to other pixels in the same instance region, which avoids attribute leakage between instances.

Shading Aggregation Controller for the final fusion. To summarize, in all the above operations, MIGC can get $\mathbb{R}_{s}=\{\mathbf{R}_{s}^{1},\ldots,\mathbf{R}^{N}_{s},\mathbf{R}^{bg},\mathbf{R}_{LA}\}\in\mathbb{R}^{(N+2,C,H,W)}$ and $\mathbb{M}=\{\mathbf{M}^{1},\ldots,\mathbf{M}^{N},\mathbf{M}^{bg},\mathbf{M}_{LA}\}\in\mathbb{R}^{(N+2,1,H,W)}$ , where $\mathbf{M}_{LA}$ is the all-1 guidance mask corresponding to $\mathbf{R}_{LA}$ . In order to dynamically aggregate shading results at different timesteps of the generation process, we propose the Shading Aggregation Controller (SAC). As shown in Fig.2(c), SAC sequentially performs instance intra-attention and inter-attention, and aggregation weights summing to 1 are assigned to shading results on each spacial pixel through the softmax function, resulting in the final shading.

\mathbf{R}_{final}=SAC(\mathbb{R}_{s},\mathbb{M}),\mathbf{R}_{final}\in\mathbb{R}^{H,W,C}

(8)

For more details, please refer to supplementary materials.

Method	Spatial Accuracy(%)					Image Text Consistency		Image Quality
Method	R $\uparrow$	mIoU $\uparrow$	AP $\uparrow$	AP50 $\uparrow$	AP75 $\uparrow$	CLIP $\uparrow$	Local CLIP $\uparrow$	FID-6K $\downarrow$
Real Image	83.75	85.49	65.97	79.11	71.22	24.22	19.74	-
Stable Diffusion	5.95	21.60	0.8	2.71	0.42	25.69	17.34	23.56
TFLCG	13.54	28.01	1.75	6.77	0.56	25.07	17.97	24.65
BOX-Diffusion	17.84	33.38	3.29	12.27	1.08	23.79	18.70	25.15
Multi Diffusion	23.86	38,82	6.72	18.65	3.63	22.10	19.13	33.20
Layout Diffusion	50.53	57.49	23.45	48.10	20.70	18.28	19.08	25.94
GLIGEN	70.52	71.61	40.68	68.26	42.85	24.61	19.69	26.80
Ours	80.29	77.38	54.69	84.17	61.71	24.66	20.25	24.52

Table 2: Quantitative results on the COCO-Position. “R” means the success rate, which checks whether all instances in one image are position-correctly generated.

3.6 Summary

Training Loss. We use the original denoising loss [38, 20]:

\min_{\theta^{\prime}}\mathcal{L}_{\text{LDM}}=\mathbb{E}_{z,\epsilon\sim\mathcal{N}(0,I),t}[||\epsilon-f_{\theta,\theta^{\prime}}(\mathbf{z}_{t},t,\mathcal{P},\mathbb{B},\mathbb{D})||_{2}^{2}],

(9)

where $\theta$ represents the frozen parameters of the pre-trained stable diffusion, and $\theta^{\prime}$ means the parameter of our MIGC.

Besides, to constrain generated instances within their regions and prevent the generation of additional objects in the background, we design an inhibition loss to avoid high attention weight in the background region:

\vspace{-0.2em}\min_{\theta^{\prime}}\mathcal{L}_{\text{ihbt}}=\sum_{i=1}^{i=N}\left|\mathbf{A}_{c}^{i}-\text{DNR}(\mathbf{A}_{c}^{i})\right|\odot\mathbf{M}^{bg},

(10)

where $\mathbf{A}_{c}^{i}$ denotes the attention maps for the ith instance in the frozen $16\times{16}$ Cross-Attention layer of the Unet decoder [31], and DNR( $\cdot$ ) means the denoising (e.g., we use the average operation) of the background region. The final training loss is designed as follows:

\vspace{-0.2em}\min_{\theta^{\prime}}\mathcal{L}=\mathcal{L}_{\text{LDM}}+\lambda\mathcal{L}_{\text{ihbt}},

(11)

we set the loss weight $\lambda$ as 0.1.

Implementation Details. We only deploy MIGC on the mid-layers (i.e., $8\times{8}$ ) and the lowest-resolution decoder layers (i.e., $16\times{16}$ ) of UNet, which greatly determine the generated image’s layout and semantic information [32, 6]. In other Cross-Attention layers, we use the global prompt for global shading. We use COCO 2014 [28] to train MIGC. To get the instance descriptions and their bounding boxes, we use stanza [35] to split the global prompt and detect the instances with the Grounding-DINO[30] model. We train our MIGC based on the pre-trained stable diffusion v1.4. We use AdamW [25] optimizer with a constant learning rate of $1e^{-4}$ , and train the model for 300 epochs with batch size 320, which requires 15 hours on 40 V100 GPUs with 16GB VRAM each. For inference, we use EulerDiscreteScheduler [23] with 50 sample steps and use our MIGC in the first 25 steps. We select the CFG scale[19] as 7.5. For more details, please refer to supplementary materials.

4 Experiments

4.1 Benchmarks

We evaluate models’ performance on three benchmarks: COCO-MIG, COCO-Position [28], and DrawBench [41]. We use 8 seeds to generate images for each prompt.

In COCO-MIG, we pay attention to position, color, and quantity. To construct it, we randomly sampled 800 COCO images and assigned a color to each instance while keeping the original layout. Furthermore, We reconstruct the global prompts in the format of ’ a $<$ attr1 $>$ $<$ obj1 $>$ and a $<$ attr2 $>$ $<$ obj2 $>$ and a … ’, and we divide this benchmark into five levels based on the number of instances in the generated image. Each method will generate 6400 images.

In COCO-Position, we sampled 800 images, using the captions as the global prompts, labels as instance descriptions, and bounding boxes as layouts to generate 6400 images.

Drawbench is a challenging T2I benchmark. We use GPT4 [34, 15] to extract all instance descriptions and generate the layouts for each prompt. We use a total of 64 prompts, of which 25 are related to color, 19 are related to counting, and 20 are related to position, ultimately generating 512 images.

4.2 Evaluation Metrics

Position Evaluation. We use Grounding-DINO [30] to detect each instance and calculate the maximum IoU between the detection boxes and the Ground Truth box. If the above IoU is higher than the threshold $t$ =0.5, we mark it as Position Correctly Generated.

Attribute Evaluation. For a Position Correctly Generated instance, we use the Grounded-SAM model [30, 26] to segment it and calculate the percentage of the target color in the HSV color space. If the above percentage exceeds the threshold $S$ =0.2, we denote it as Fully Correctly Generated.

Metrics on COCO-MIG. We primarily measure the Instance Success Rate and mIoU. The Instance Success Rate calculates the probability that each instance is Fully Correctly Generated, and mIoU calculates the mean of the maximum IoU for all instances. Note that if the color attribute is incorrect, we set the IoU value as 0.

Metrics on COCO-Position. We use Success Rate, mIoU and Grounding-DINO AP score to measure the Spatial Accuracy. The Success Rate represents whether all instances in one image are Position Correctly Generated. Besides, we use the Fréchet Inception Distance (FID) [18] to evaluate Image Quality. To measure Image-Text Consistency, we use CLIP score and Local CLIP score[1].

Metrics on DrawBench. We evaluate the Success Rate for images related to the position and count by checking whether all instances in each image are Position Correctly Generated. For color-related images, we check whether all instances are Fully Correctly Generated. In addition to automated evaluations, a manual evaluation is conducted.

Method	Spatial(%) $\uparrow$		Attribute(%) $\uparrow$		Count(%) $\uparrow$
Method	R	Human	R	Human	R	Human
SD1.4	-	13.30	-	57.52	-	23.70
AAE	-	23.13	-	51.50	-	30.92
Struc-D	-	13.12	-	56.5	-	30.26
Box-D	11.88	50.00	28.50	57.50	9.21	39.47
TFLCG	9.38	53.13	35.00	60.00	15.79	31.58
Multi-D	10.63	55.63	18.5	65.50	17.76	36.18
GLIGEN	61.25	78.80	51.00	48.20	44.08	55.90
Ours	69.38	93.13	79.00	97.50	67.76	67.50

Table 3: Evaluation on the drawbench.

4.3 Baselines

We compare our method with some SOTA layout-to-image methods: Multi-Diffusion[3], Layout Diffusion[56], GLIGEN[27], TFLCG[6], and Box-Diffusion[49]. Since Layout Diffusion cannot control color, we only run it on COCO-Position. In Drawbench, we also compare our method with some SOTA T2I methods: stable diffusion v1.4[38], AAE[5], Structure Diffusion[14]. All methods are executed using the official code and default configuration.

4.4 Quantitative Results

COCO-MIG. Tab.1 shows results in COCO-MIG. MIGC improves the Instance Success Rate from 32.39% to 58.43% and mIoU from 32.25 to 51.48. Improvements are consistently observed across all count-division levels, underscoring the robust control capabilities of MIGC on position, quantity, and attributes. Furthermore, MIGC runs at almost the same speed as the original stable diffusion, thanks to MIGC dividing MIG in the Cross-Attention Space, accelerating the conquering and combing of subtasks.

COCO-Position. Tab.2 shows quantitative results in COCO-Position, indicating that MIGC brings significant improvement in Spatial Accuracy: increased the Success Rate from 70.52% to 80.29%, mIoU from 71.61 to 77.38, and AP score from 40.68/68.26/42.85 to 54.69/84.17/61.71. MIGC also achieves similar FID scores compared to the stable diffusion, highlighting that MIGC can enhance position control capabilities without destroying image quality.

DrawBench. Tab.3 shows the results in drawbench. MIGC achieves the best performance in both mechanical metrics and human evaluation. Human evaluation doesn’t rely on IoU to determine position correctness.

SAC	EA	LA	R(%) $\uparrow$	mIoU $\uparrow$	AP $\uparrow$	AP50 $\uparrow$	AP75 $\uparrow$	numb.
			7.66	22.71	0.91	3.18	0.35	①
✔			12.10	29.55	1.89	7.64	0.49	②
✔		✔	34.70	44.08	11.02	28.64	6.83	③
✔	✔		80.16	76.63	53.03	84.05	58.67	④
	✔	✔	78.12	75.47	52.05	83.48	57.16	⑤
✔	✔	✔	80.29	77.38	54.69	84.17	61.71	⑥

Table 4: Ablation on COCO-Position of Shading Aggregation Controller(SAC), Enhancement Attention (EA), Layout Attention (LA).

Config	R(%) $\uparrow$	mIoU $\uparrow$	AP $\uparrow$	AP50 $\uparrow$	AP75 $\uparrow$	FID $\downarrow$
w/o loss	80.20	77.03	52.46	82.65	58.05	24.73
w/ loss 1.0	80.61	77.79	55.62	84.48	62.85	26.94
w/ loss 0.1	80.29	77.38	54.69	84.17	61.71	24.52

Table 5: Ablation on COCO-Position of Inhibition loss. We conducted an ablation study on three configurations: w/o loss, loss weight 1.0, and loss weight 0.1.

4.5 Qualitative Results

Fig.3 shows qualitative results in COCO-MIG. MIGC demonstrates effective position-and-attributes control over all instances, even in complex scenarios. Fig.4 shows qualitative results in COCO-Position. MIGC achieves more precise control, ensuring all instances are generated strictly within their designated boxes without instances missing or merging. The qualitative results for DrawBench will be presented in the supplementary materials.

4.6 Analysis of Shading Aggregation Controller

We generate each image with 50 steps while using MIGC in the first 25 steps. Fig.5 shows SAC aggregation weights at T=50, 40, and 30 (i.e., T=50 means the first step). In the early time steps, the SAC assigns more weight to the EA layer’s shading instances in the foreground while giving more weight to the LA layer’s shading template in the background. In the later time steps, the SAC gradually increases the attention to the global context in the background.

4.7 Ablation Study

The ablation focuses on four components: (1) Enhancement Attention Layer. (2) Layout Attention Layer. (3) Shading Aggregation Controller. (4) The inhibition loss. Experiments are performed on COCO-Position and COCO-MIG.

Shading Aggregation Controller. From Tab.4, we find that using SAC improves the performance metrics(compare ⑤ with ⑥ and ① with ②), which is also reflected in the ablation experiments on COCO-MIG in Fig.6(a).

Enhancement Attention Layer. In Tab.4, the EA Layer significantly improves the Success Rate from 12.10% to 80.16%, mIoU from 29.55 to 76.63, and AP from 1.89 / 7.64 / 0.49 to 53.03 / 84.05 / 58.67 (Compare ② with ④). We also observe significant improvement in Fig.6(a).

Layout Attention Layer. The results of ④ and ⑥ in Tab.4 show that LA Layer can improve the AP. We find that SAC+LA, compared to SAC alone, has improved mIoU to some extent in Fig.6(a).

Inhibition Loss. We also conducted the ablation study on the Inhibition loss, with 0.1 and 1.0 loss weight. We show the results in Tab.5 and Fig.6(b). Tab.5 indicates that inhibition loss can significantly improve the AP metric in COCO-Position. We find that setting the loss function weight to 1.0 can further improve the AP metric, but it comes at the cost of a slight decrease in image quality (i.e., FID). So we finally choose loss weight as 0.1. Fig.6 (b) shows the comparison between w/ loss 0.1 and w/o loss on COCO-MIG, and we observe that the inhibition loss can improve the mIoU, especially when generating images with large instance quantity.

Qualitative Results. We show qualitative results in Fig.7. The first column indicates that the EA Layer can effectively alleviate instance missing. The second column illustrates that the LA Layer can significantly improve generated image quality. The third column suggests that the SAC also aids in better aggregation of shading instances. The fourth column demonstrates that inhibition loss enhances the model’s control capabilities. The fifth column demonstrates that position tokens effectively alleviate instance merging.

5 Conclusion

In this work, we define a practical and challenging MIG task and propose a MIGC approach to improve the stable diffusion’s MIG ability. We divide the complex MIG task into simpler Single-Instance shading subtasks, conquer each instance shading with an Enhancement Attention layer, and combine the final shading result through a Layout Attention layer and Shading Aggregation Controller. Comprehensive experiments are conducted on our proposed COCO-MIG and popular COCO-Position and Drawbench benchmarks. Experiment results verify the efficiency and effectiveness of our MIGC. In the future, we will further explore the control of interactive relationships between instances.

References

Avrahami et al. [2023] Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, Yaniv Taigman, Devi Parikh, Dani Lischinski, Ohad Fried, and Xi Yin. SpaText: Spatio-textual representation for controllable image generation. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2023.
Balaji et al. [2022] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Liu. ediff-i: Text-to-image diffusion models with ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
Bar-Tal et al. [2023] Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113, 2023.
Chang et al. [2023] Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, José Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin P. Murphy, William T. Freeman, Michael Rubinstein, Yuanzhen Li, and Dilip Krishnan. Muse: Text-to-image generation via masked generative transformers. 2023.
Chefer et al. [2023] Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models, 2023.
Chen et al. [2023a] Minghao Chen, Iro Laina, and Andrea Vedaldi. Training-free layout control with cross-attention guidance. arXiv preprint arXiv:2304.03373, 2023a.
Chen et al. [2022] Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W. Cohen. Re-imagen: Retrieval-augmented text-to-image generator, 2022.
Chen et al. [2023b] Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023b.
Couairon et al. [2023] Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion-based semantic image editing with mask guidance. In ICLR 2023 (Eleventh International Conference on Learning Representations), 2023.
Ding et al. [2021] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. Cogview: Mastering text-to-image generation via transformers. arXiv preprint arXiv:2105.13290, 2021.
Ding et al. [2023] Zheng Ding, Xuaner Zhang, Zhihao Xia, Lars Jebe, Zhuowen Tu, and Xiuming Zhang. Diffusionrig: Learning personalized priors for facial appearance editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12736–12746, 2023.
Epstein et al. [2023] Dave Epstein, Allan Jabri, Ben Poole, Alexei A. Efros, and Aleksander Holynski. Diffusion self-guidance for controllable image generation, 2023.
et al [2022] Aditya Ramesh et al. Hierarchical text-conditional image generation with clip latents, 2022.
Feng et al. [2023a] Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Reddy Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis. In The Eleventh International Conference on Learning Representations, 2023a.
Feng et al. [2023b] Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual planning and generation with large language models. arXiv preprint arXiv:2305.15393, 2023b.
Guerreiro et al. [2023] Julian Jorge Andrade Guerreiro, Mitsuru Nakazawa, and Björn Stenger. Pct-net: Full resolution image harmonization using pixel-wise color transformations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5917–5926, 2023.
Gupta and Kembhavi [2023] Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023.
Heusel et al. [2018] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018.
Ho [2022] Jonathan Ho. Classifier-free diffusion guidance. ArXiv, abs/2207.12598, 2022.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
Huang et al. [2023] Wenjing Huang, Shikui Tu, and Lei Xu. Pfb-diff: Progressive feature blending diffusion for text-driven image editing, 2023.
Karnewar et al. [2023] Animesh Karnewar, Andrea Vedaldi, David Novotny, and Niloy J Mitra. Holodiffusion: Training a 3d diffusion model using 2d images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18423–18433, 2023.
Karras et al. [2022] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models, 2022.
Kawar et al. [2023] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models, 2023.
Kingma and Ba [2017] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.
Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. arXiv:2304.02643, 2023.
Li et al. [2023] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. CVPR, 2023.
Lin et al. [2015] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015.
Liu et al. [2023a] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3d object, 2023a.
Liu et al. [2023b] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023b.
Mao et al. [2023] Jiafeng Mao, Xueting Wang, and Kiyoharu Aizawa. Guided image synthesis via initial image editing in diffusion model. In Proceedings of the 31st ACM International Conference on Multimedia. ACM, 2023.
Mou et al. [2023] Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, and Jian Zhang. Dragondiffusion: Enabling drag-style manipulation on diffusion models. arXiv preprint arXiv:2307.02421, 2023.
Nichol et al. [2022] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2022.
OpenAI [2023] OpenAI. Gpt-4 technical report, 2023.
Qi et al. [2020] Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. Stanza: A Python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2020.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021.
Reed et al. [2016] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis, 2016.
Rombach et al. [2021] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021.
Ruiz et al. [2023] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
Saharia et al. [2022a] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022a.
Saharia et al. [2022b] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, Seyedeh Sara Mahdavi, Raphael Gontijo Lopes, Tim Salimans, Jonathan Ho, David Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. 2022b.
Shaw et al. [2018] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.
Shen et al. [2023] Xiaolong Shen, Jianxin Ma, Chang Zhou, and Zongxin Yang. Controllable 3d face generation with conditional style code diffusion. arXiv preprint arXiv:2312.13941, 2023.
Shi et al. [2023a] Jing Shi, Wei Xiong, Zhe Lin, and Hyun Joon Jung. Instantbooth: Personalized text-to-image generation without test-time finetuning, 2023a.
Shi et al. [2023b] Yujun Shi, Chuhui Xue, Jiachun Pan, Wenqing Zhang, Vincent YF Tan, and Song Bai. Dragdiffusion: Harnessing diffusion models for interactive point-based image editing. arXiv preprint arXiv:2306.14435, 2023b.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Woo et al. [2018] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module, 2018.
Wu et al. [2023] Weijia Wu, Yuzhong Zhao, Mike Zheng Shou, Hong Zhou, and Chunhua Shen. Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. arXiv preprint arXiv:2303.11681, 2023.
Xie et al. [2023] Jinheng Xie, Yuexiang Li, Yawen Huang, Haozhe Liu, Wentian Zhang, Yefeng Zheng, and Mike Zheng Shou. Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. arXiv preprint arXiv:2307.10816, 2023.
Xu et al. [2017] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks, 2017.
Yang et al. [2023] Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18381–18391, 2023.
Yu et al. [2022] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022.
Zhang et al. [2017] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks, 2017.
Zhang et al. [2022] Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. Cross-modal contrastive learning for text-to-image generation, 2022.
Zhang et al. [2023] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
Zheng et al. [2023] Guangcong Zheng, Xianpan Zhou, Xuewei Li, Zhongang Qi, Ying Shan, and Xi Li. Layoutdiffusion: Controllable diffusion model for layout-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22490–22499, 2023.
Zhou et al. [2023] Dewei Zhou, Zongxin Yang, and Yi Yang. Pyramid diffusion models for low-light image enhancement. arXiv preprint arXiv:2305.10028, 2023.

\thetitle

Supplementary Material

A Construction Process of COCO-MIG Benchmark

Overview. COCO-MIG benchmark uses the layout of COCO-position benchmark [28] and assigns a specific color attribute to each instance. COCO-MIG requires that each instance generated not only meet the position requirements but also meet the attribute (i.e., color) requirements.

Step 1: Sampling layouts from COCO. We sample layouts from the COCO-position [28], filter out instances with side lengths less than 1/8 of the original image size, and further filter out those layouts with less than two instances. To test the model’s ability to control quantity, we divided these layouts into five levels, $L_{2}$ - $L_{6}$ , based on the number of instances, where $L_{i}$ indicates that there are i instances in the target-generated image. A total of 160 layouts are sampled for each level. Notably, in the process of sampling layouts for level $L_{i}$ , if the number of instances surpasses $i$ , we selectively choose the initial $i$ instances with the largest area. Conversely, if the number of instances is less than $i$ , a resampling procedure is employed.

Step 2: Assigning color attribute to each instance. On the basis of each sampled layout, we assign each instance a specific color from eight colors, i.e., red, yellow, green, blue, white, black, and brown. At the same time, we write the global prompt as ’a $<$ attr1 $>$ $<$ obj1 $>$ , a $<$ attr2 $>$ $<$ obj2 $>$ , …, and a …’.

B Difference between COCO-MIG and COCO-position

Fig. 1 shows a specific example. In this example, COCO-MIG assigns a specific color to each “donut” based on the COCO-position layout. Fig. 1(b) shows the results of the state-of-the-art layout-to-image method GLIGEN [27]. It can be seen that the results generated by GLIGEN meet the position requirements, so this will be judged as correctly generated in the COCO-positon benchmark. However, COCO-MIG not only requires the generated instances to meet position requirements but also attributes requirements. From this perspective, COCO-MIG will determine that the results generated by GLIGEN are incorrectly generated because there are “donuts” of incorrectly generated color attributes. Finally, it can be seen that using our proposed MIGC guarantees that the position and attributes of each generated instance are correct.

C More MIG Results

Fig. 2 and Fig. 3 show more results obtained using MIGC for Multi-Instance Generation. Even with complex layouts and rich attribute descriptions, MIGC can ensure that each instance is generated at the correct position and has the correct attributes. At the same time, if the relationship between each instance is specified in the global prompt (e.g., action relationship), MIGC can further control the interaction between instances.

D More Qualitative Results on COCO-MIG

More qualitative results on our proposed COCO-MIG benchmark are shown in Fig. 4. Compared with previous state-of-the-art methods, our proposed MIGC approach can better control the position, attributes, and quantity simultaneously.

E Qualitative Results on DrawBench

Implementation details. DrawBench [41] is a challenging T2I benchmark. On this benchmark, we compare our proposed MIGC with state-of-the-art text-to-image (i.e., AAE [5], Struc-D [14]) and layout-to-image methods (i.e., Box-D [49], TFLCG [6], Multi-D [3], GLIGEN [27]). For the text-to-image methods, we directly input the DrawBench’s prompt into the pipeline. For the layout-to-image methods and our proposed MIGC, we first use GPT-4 [34, 15] to generate the layout and then input it into the network, forming a two-stage text-to-image pipeline.

Qualitative results. Fig. 5 shows the qualitative comparison on DrawBench. The first row shows that obvious attribute leakage problems (i.e., confusion between the “yellow” and “red” attributes) occur in the results of previous state-of-the-art methods, while our MIGC can control the attributes of each instance very precisely. The second row shows that previous state-of-the-art methods cannot correctly generate a “black apple,” which is counterfactual, while our MIGC can achieve good generation. The third row indicates that MIGC can control the position more accurately than the previous state-of-the-art method and can effectively solve the problem of extra generation (e.g., both Multi-D and GLIGEN have the phenomenon of excessive generation of “carrots”), mainly due to the inhibition loss used in MIGC. The fourth row shows that MIGC can achieve accurate quantity control while other methods generate wrong quantities. The results in the fifth row show that when it comes to quantity control of multiple categories, stronger attribute control (e.g., it can avoid cat attributes from leaking to the dogs’ region) makes MIGC achieve more accurate quantity control.

F Details of Shading Aggregation Controller

Overview. Illustrated in Fig. 6, after obtaining the shading results $\mathbb{R}_{s}=\{\mathbf{R}_{s}^{1},\ldots,\mathbf{R}^{N}_{s},\mathbf{R}^{bg},\mathbf{R}_{LA}\}\in\mathbb{R}^{(N+2,C,H,W)}$ and guidance masks $\mathbb{M}=\{\mathbf{M}^{1},\ldots,\mathbf{M}^{N},\mathbf{M}^{bg},\mathbf{M}_{LA}\}\in\mathbb{R}^{(N+2,1,H,W)}$ , the Shading Aggregation Controller (SAC) sequentially performs instance intra-attention and inter-attention to dynamically aggregate shading results, and aggregation weights summing to 1 are assigned to shading results on each spacial pixel through the softmax function, resulting in the final shading $\mathbf{R}_{final}\in\mathbb{R}^{(H,W,C)}$ .

Instance Intra-Attention. As shown in Fig. 6, after SAC concats the shading results and guidance masks in the channel dimension, it will perform instance intra-attention through a stack of Conv-CBAM-Conv layers, in which Conv layers are mainly used to change the channel number, and the CBAM [47] sequentially performs channel-wise and spatial-wise attention in each instance’s feature map.

Instance Inter-Attention. As shown in Fig. 6, after instance intra-attention, the SAC further performs instance inter-attention, in which the SAC reshapes the features to change the dimension order of the feature map and then uses the CBAM to perform instance-wise attention.

G Details of Evaluation Pipeline

Overview. The flowchart in Fig. 7 shows the details of the evaluation pipeline, and we will introduce it by telling how to check whether “a red vase” is accurately generated.

Position Evaluation. First, we input the generated image into Grounding-DINO [30] to detect the bounding box of the “vase” and then calculate the IoU with the target layout’s bounding box. If the IoU $\geq$ 0.5, this ”vase” is determined to be Position Correctly Generated. Note that if multiple bounding boxes are detected in the generated image, we will select the one closest to the target layout’s bounding box to calculate IoU.

Attribute Evaluation. After checking that the “a red vase” is Position Correctly Generated, we will further check whether its attribute (i.e., “red” color) is generated accurately. Specifically, we will use Grounded-SAM [26] to segment the “vase” region in the generated image and mark its area as M. Then we will calculate the area in M that meets the “red” requirement on the HSV color space and mark it as O. If the percentage O/M $\geq$ 0.2, we can consider that this “red vase” has the correct attribute and mark it as Fully Correctly Generated.

Evaluation for Different Benchmarks. Benchmarks requiring both attribute and position control, such as our proposed COCO-MIG, require each instance to be Fully Correctly Generated. Benchmarks requiring only position control, such as COCO-positon, require each instance to be Position Correctly Generated.

H Manual Evaluation on DrawBench

We also perform a manual evaluation on DrawBench [41] to check whether the generated images adhere to the input text description in color, position, and count dimensions. Specifically, ten people will participate in the evaluation, and each generated image will be judged as ”correctly generated” or ”wrongly generated.” We show the average accuracy calculated based on the evaluation results of ten people.

Different from automated evaluation, which strictly considers the mIoU to determine whether the local generation is successful or not, manual evaluation mainly checks whether the generated image satisfies the text description globally.

I More Implementation Details

Training. We only deploy MIGC on the mid-layers (i.e., $8\times{8}$ ) and the lowest-resolution decoder layers of UNet(i.e., $16\times{16}$ ), which greatly determine the generated image’s layout and semantic information [32, 6]. In other Cross-Attention layers, we use the global prompt for global shading. We use COCO 2014 [28] to train MIGC. To get the instance descriptions and their bounding boxes, we use stanza [35] to split the global prompt and detect the instances with the Grounding-DINO[30] model. To put the data in the same batch, we fix the number of instances to 6 during training, i.e., if data contains more than $6$ instances, 6 of them will be randomly selected. If data contains less than $6$ instances, we complete it with null text and coordinates [0.0, 0.0, 0.0, 0.0]. We train our MIGC based on the pre-trained stable diffusion v1.4. We use AdamW [25] optimizer with a constant learning rate of $1e^{-4}$ , and train the model for 300 epochs with batch size 320, which requires 15 hours on 40 V100 GPUs with 16GB VRAM each.

Inference. We use EulerDiscreteScheduler [23] with 50 sample steps and use our MIGC in the first 25 steps. We select the CFG scale[19] as 7.5. As shown in Fig. 6, the channel number of the second CBAM layer (i.e., CBAM in Instance Inter Attention) in the Shading Aggregation Controller is related to the number of input instances. In order to allow our MIGC to handle different numbers of instances, we set the channel number of the second CBAM layer as $max\_num+2$ (e.g., our default setting is 28+2, which can satisfy almost all practical applications). In actual inference, we assume that the number of instances to be processed is $n_{infer}\leq max\_num$ , and we get $f_{intra}\in\mathbb{R}^{(1,n_{infer}+2,h,w)}$ through Instance Intra Attention. Next, we need to pad the number of channels of $f_{intra}$ to $max\_num+2$ . Specifically, we enter an all-0 features $f_{zero}\in{\mathbb{R}^{1,c,h,w}}$ (i.e., this is consistent with the shading result of null text during the training) into Instance Intra Attention to get $f_{padding}=InstanceIntraAttenion(f_{zero}),f_{padding}\in{\mathbb{R}^{1,1,h,w}}$ , and we use $f=concat([f_{intra}]+[f_{padding}]*(max\_num-n_{infer}),dim=1),f\in{\mathbb{R}^{1,max\_num+2,h,w}}$ as the input of the CBAM layer in Instance Inter Attention. In order to allow the network to notice the later-ordered shading instances during actual inference, we will randomly shuffle the above $f_{padding}$ and $f_{intra}$ during training, while the shading background and shading template will not participate in the above shuffle process. At the same time, we can observe that since Instance Intra Attention has eliminated the larger number of original feature channels C (e.g., 1280), the computational complexity will be very low even when processing a larger number of shading instances in Instance Inter Attention.

J More Baselines

Based on the divide-and-conquer idea, we also designed two other baselines. The qualitative comparisons are shown in Fig.9

1)PCT-Net Pipeline. As shown in the first row of Fig.9, we first independently generate each individual instance and the background. Then, we use PCT-Net [16], a state-of-the-art image fusion network, to merge all instances with the background. Using PCT-Net to fuse the pre-generated images ensures the correctness of attributes, which verifies the effectiveness of our divide-and-conquer idea. However, this pipeline incurs significant inference time costs, and the generated images may lack harmony.

2)Visual Programming Pipeline. As shown in the second row of Fig.9, Visual Programming [17] utilizes the GPT model to parse user input commands and generate a series of predefined operations, thereby achieving functionalities such as image editing. Here, we employ this method to sequentially perform editing on each object in the pre-generated images from left to right, aiming to correct the attributes of each object as much as possible. This pipeline is capable of correcting erroneously generated attributes. However, it uses text to locate and edit the instance, which lacks precise positioning capabilities and faces challenges in deployment to real scenes with complex layouts. For example, the 1st and 4th steps locate and affect the incorrect cat. In addition, these methods utilize GPT to coordinate image generation, making large-scale generation expensive and challenging.

K Limitation

Inspired by the idea of divide and conquer, MIGC maximizes the use of the powerful Single-Instance Generation capability of pre-trained stable diffusion and extends it to MIG tasks. However, for a specific instance that stable diffusion cannot generate well, our MIGC will also encounter difficulties when generating this instance or its combination with other instances. As Fig.8(a) shows, stable diffusion has difficulty generating individual letters accurately. Therefore, when using MIGC to generate the words ‘CVPR’ in the layout of Fig.8(c), we see that although MIGC correctly controls the color attribute of each letter, the content of the actual letters is wrong, causing the entire sample to fail, as shown in Fig.8(b).