DSE-GAN: Dynamic Semantic Evolution Generative Adversarial Network for Text-to-Image Generation

Mengqi Huang University of Science and Technology of ChinaChina [email protected] , Zhendong Mao University of Science and Technology of ChinaChina [email protected] , Penghui Wang University of Science and Technology of ChinaChina [email protected] , Quan Wang MOE Key Laboratory of Trustworthy Distributed Computing and Service, Beijing University of Posts and TelecommunicationsChina [email protected] and Yongdong Zhang University of Science and Technology of ChinaInstitute of Artificial Intelligence, Hefei Comprehensive National Science CenterChina [email protected]

(2022)

Abstract.

Text-to-image generation aims at generating realistic images which are semantically consistent with the given text. Previous works mainly adopt the multi-stage architecture by stacking generator-discriminator pairs to engage multiple adversarial training, where the text semantics used to provide generation guidance remain static across all stages. This work argues that text features at each stage should be adaptively re-composed conditioned on the status of the historical stage (i.e., historical stage’s text and image features) to provide diversified and accurate semantic guidance during the coarse-to-fine generation process. We thereby propose a novel Dynamical Semantic Evolution GAN (DSE-GAN) to re-compose each stage’s text features under a novel single adversarial multi-stage architecture. Specifically, we design (1) Dynamic Semantic Evolution (DSE) module, which first aggregates historical image features to summarize the generative feedback, and then dynamically selects words required to be re-composed at each stage as well as re-composed them by dynamically enhancing or suppressing different granularity subspace’s semantics. (2) Single Adversarial Multi-stage Architecture (SAMA), which extends the previous structure by eliminating complicated multiple adversarial training requirements and therefore allows more stages of text-image interactions, and finally facilitates the DSE module. We conduct comprehensive experiments and show that DSE-GAN achieves 7.48% and 37.8% relative FID improvement on two widely used benchmarks, i.e., CUB-200 and MSCOCO, respectively.

Text-to-Image, Generative Adversarial Network, Dynamic Network

^†^†journalyear: 2022^†^†copyright: acmcopyright^†^†conference: Proceedings of the 30th ACM International Conference on Multimedia ; October 10–14, 2022; Lisbon, Portugal.^†^†booktitle: Proceedings of the 30th ACM International Conference on Multimedia (MM ’22), October 10–14, 2022, Lisbon, Portugal^†^†price: 15.00^†^†isbn: 978-1-4503-9203-7/22/10^†^†doi: 10.1145/3503161.3547881^†^†ccs: Computing methodologies Image representations

1. Introduction

Text-to-Image generation (T2I) bridges the gap between the two most prevalent modalities of vision and language. Compared with other kinds of inputs (e.g., sketches or object layouts) for generating images, text descriptions are a flexible and convenient form to represent visual concepts. Therefore, T2I has a variety of potential applications, such as art generation(zhi2017pixelbrush, ) and computer-aided design(chen2018text2shape, ). The key of T2I lies in generating realistic images with rich and vivid details that are semantically consistent with the text.

The last few years have witnessed the remarkable success of Generative Adversarial Networks (GANs) (goodfellow2014generative, ) for T2I(reed2016generative, ). Most existing approaches adopt the multi-stage architecture (zhang2017stackgan, ; zhang2018stackgan++, ), which stacks multiple generator-discriminator pairs to engage multiple adversarial training. To be specific, each stage’s generator refines the previous stage’s results by increasing the image resolution and adding more details stage by stage. Based on this architecture, some recent works focus on introducing different cross-modal fusion mechanisms between stages to better align text and image(xu2018attngan, ; zhu2019dm, ; yin2019semantics, ; cheng2020rifegan, ; ruan2021dae, ; tao2020df, ), and others model intermediate representations (e.g., object layout or segmentation map) to bridge text and image smoothly (hinzsemantic, ; qiao2021r, ). However, most existing methods share one commonality: the image features are dynamically generated at each stage, but text features stay static on the contrary.

Refer to caption — Figure 1. A real case of semantic evolution at different stages of our method in a T2I generation process. Text features consist of word embeddings, and the semantics of each word is visualized by the top-3 words with the highest cosine similar in the embedding space. Previous methods use static text features at all generation stages and the word embeddings remain unchanged, while we dynamically re-compose text features conditioned on the historical stage to provide diversified and accurate semantic guidance for each stage. Take the word “bill” as an example, during the text feature evolving process, new and consistent semantics (the word in red, *e.g.*, “pointy”, “white”) are automatically activated stage by stage which finally leads to more detailed and vivid generation results. For more detailed visualization refer to Fig. 6.

This paper argues that in a multi-stage T2I, the text features at each stage should be dynamically re-composed conditioned on the status of the historical stage (i.e., historical stage’s text features and image features) to provide diversified and accurate semantic guidance. The reason is that multi-stage T2I itself is, by nature, a coarse-to-fine generation process, with the image resolution gradually increased and details gradually added stage by stage. During this process, text features should also evolve synchronously to provide semantic guidance from coarse-grained to fine-grained (e.g., from “plant” to “flowers” and then to “red rose”), so as to better guide the image generation at each stage. Using such evolving text features brings another benefit. By dynamically re-composing text features at different stages, we are able to suppress previously used semantic information and activate new consistent ones during the generation process, which prevents the same semantics from being generated repeatedly and alleviates the repeated rendering problem. Take Fig. 1 as an example. When text features are adaptively re-composed, each word that constitutes text features evolves with new and consistent semantics stage by stage (e.g., the semantics of “pointy” and “white” gradually activated for the word “bill”), and leads to more detailed and vivid generation results.

With this motivation, we propose a novel T2I framework dubbed as Dynamic Semantic Evolution Generative Adversarial Network (DSE-GAN), which dynamically re-composes text features conditioned on the status of the historical stage within a novel single adversarial multi-stage architecture. On the one hand, we devise a unique Dynamic Semantic Evolution (DSE) module, which not only dynamically selects words required to be re-composed at each stage but also dynamically re-composed them by enhancing or suppressing different granularity subspace’s semantics. On the other hand, to better facilitate DSE modules, we further propose a novel single adversarial multi-stage structure (SAMA), which simplifies traditional multi-stage generation by eliminating the complicated multiple adversarial training requirements and therefore allows more stages of text-image interactions. To be specific, as the core module, firstly DSE summarizes the generative feedback to prepare for the subsequent re-composition by predicting weights on historical image features and aggregating them into a small number of representative vectors. Secondly, DSE calculates the correlation of each word with images’ aggregated vectors to filter out words required to be re-composed through a dynamic element router, which is implemented by a gated function. Thirdly, the selected words are further divided into multiple granularities (i.e., different feature dimensions) subspaces to construct a complete semantic space, where we configure a dynamic subspace router to generate the stage-aware semantic path and apply the attention mechanism to re-compose different subspaces’ semantics w.r.t. these attention scores. In summary, we achieve a cross-modal iterative sequential method on both text and images to provide diversified and accurate semantic guidance for image generation at each stage. In this way, our method can generate images with more rich and vivid details that are semantically consistent with the text.

Our contributions can be summarized as follows:

$\bullet$ We propose a novel sequential generation framework on both text and images for T2I, i.e., DSE-GAN, which dynamically re-composes text features based on the historical stage. To the best of our knowledge, this is the first framework in T2I that adaptively re-composes text features at each stage.

$\bullet$ We propose the Dynamic Semantic Evolution (DSE) module, which dynamically re-composes text features at different stages, providing diversified and accurate coarse-to-fine semantic guidance and suppressing repeated rendering.

$\bullet$ We propose a novel Single Adversarial Multi-stage Architecture (SAMA), which increases the number of stages to allow sufficient text-image interactions without additional training cost.

$\bullet$ Extensive experiments have demonstrated the superiority of our method. On CUB-200, we achieve 4.48%, 7.48%, and 3.12% relative improvement on IS, FID, and R-precision, respectively. On MSCOCO, we achieve 7.36% and a dramatic 37.8% relative improvement on R-precision and FID, respectively.

2. Related Work

2.1. Text-to-Image Generation

Generative Adversarial Network (GAN) (goodfellow2014generative, ; reed2016generative, ) is the most popular model for T2I. StackGAN (zhang2017stackgan, ; zhang2018stackgan++, ) proposed a multi-stage generation architecture by stacking several generators as well as discriminators and engaging multiple adversarial training. Many works(xu2018attngan, ; zhu2019dm, ; yin2019semantics, ; qiao2019mirrorgan, ; li2019controllable, ; qiao2019learn, ) follow this architecture, e.g., AttnGAN(xu2018attngan, ) introduces the attention mechanism to generate fine-grained details at the word level. DM-GAN(zhu2019dm, ) adaptively refines generated images with a memory module that writes and reads text and image features. MirrorGAN(qiao2019mirrorgan, ) develops a text-to-image-to-text cycle framework. Besides, there are also many works(johnson2018image, ; li2019object, ; hinz2019semantic, ; hinz2019generating, ; qiao2021r, ) focusing on introducing an intermediate representation (e.g., object bounding boxes), to smoothly bridge the text and the image. To avoid the cumbersome multiple adversarial training, Ming et al. propose DFGAN(tao2020df, ) which has only one generator-discriminator pair. XMC-GAN(zhang2021cross, ) maximizing the mutual information between image and text via contrastive learning. Our framework also follows the single-adversarial scheme but we use more fine-grained word features and dynamically re-compose them at each stage to provide diversified and accurate semantic guidance.

Recently, auto-regressive based models such as DALL-E(ramesh2021zero, ) and CogView(ding2021cogview, ) shows impressive results on T2I. The general paradigm is that first using a convolutional auto-encoder to learn a codebook (oord2017neural, ) and then applying transformer(vaswani2017attention, ) to jointly model both text and image tokens’ distribution. However, compared with GAN-based methods, the auto-regressive-based methods require a huge amount of data and resources to achieve reasonable results.

2.2. Dynamic Network

Dynamic neural networks have been a hot research topic in deep learning, which could adapt their structures or parameters to the given example during inference and therefore yield better representation power, adaptiveness, compatibility, and generality(han2021dynamic, ). In the literature, the research of dynamic networks can be mainly categorized into three directions, i.e., dynamic depth for network early exiting(bolukbasi2017adaptive, ) or layer skipping(lin2017runtime, ; veit2018convolutional, ), dynamic width for skipping neurons(bengio2015conditional, ) or channels(liu2017learning, ) and dynamic routing for multi-branch or tree structure networks(huang2017multi, ; li2020learning, ; yang2020resolution, ). Our method belongs in the last direction. To the best of our knowledge, the dynamic mechanism has never been studied in the field of T2I. Our model is the first one to introduce the dynamic mechanism to determine the words required to be re-composed at each stage and more importantly, re-compose the selected words’ features by dynamically enhancing or suppressing different granularity subspace’s semantics.

3. Method

The overall framework of the proposed DSE-GAN is depicted in Fig. 2(a), which is consist of a pre-trained text encoder and a single generator-discriminator pair. The pre-trained text encoder(xu2018attngan, ) first extracts both sentence and word features following previous works(zhu2019dm, ; tao2020df, ; yin2019semantics, ; tan2019semantics, ). The generator is consist of several Dynamic Semantic Evolution (DSE) modules and the proposed single adversarial multi-stage structure (SAMA). Here each DSE module is responsible for re-composing word features conditioned on historical image and text features while each sub-generator in SAMA further generates the next stage’s image features under the re-composed semantic guidance. As for the discriminator, we adopt the one-way discriminator proposed in (tao2020df, ) for its effectiveness and simplicity. In this section, we first elaborate on our core module, i.e., the DSE module, and then introduce SAMA. The objective functions for the generator as well as discriminator will be presented last.

Notations. Formally, $\bar{T}\in\mathbb{R}^{D_{t}}$ and $T_{0}\in\mathbb{R}^{L_{t}\times D_{t}}$ denote original sentence and word features respectively. $L_{t}$ is the number of words and $D_{t}$ is the textual feature dimension. During the generation process, $I_{i}$ denotes the output image features of stage $i$ and $T_{i}$ denotes the word features used as semantic guidance for stage $i$ . $\odot$ denotes the Hadamard Product. $A^{T}$ denotes the transpose of matrix $A$ .

3.1. Dynamic Semantic Evolution Module

As shown in Fig 3, in each generation stage $i$ , the DSE module takes historical word features $T_{i-1}\in\mathbb{R}^{L_{t}\times D_{t}}$ and image features $I_{i-1}\in\mathbb{R}^{L_{i-1}\times D_{i-1}}$ as input, and outputs the re-composed word features $T_{i}\in\mathbb{R}^{L_{t}\times D_{t}}$ to provide diversified and accurate semantic guidance for current stage’s image features’ refinement. Here, $D_{i-1}$ is the number of channels of the image features generated by stage $i-1$ and $L_{i-1}$ is the number of image features. The re-composition of word features is achieved by three sub-modules, i.e., features aggregation sub-module for aggregating image features to summarize the generative feedback for subsequent re-composition, dynamic element routing sub-module for dynamically selecting the words which required to be re-composed at current stage and dynamic subspace routing sub-module for re-composing the selected word features by dynamically enhancing or suppressing different granularity subspace’s semantics.

Features Aggregation. Historical image features represent the current generation status and are important clues for the current stage’s word features’ re-composition. However, we found that directly using original image features $I_{i-1}$ is neither effective nor efficient due to the modal gap and the quadratic growth of image features number $L_{i-1}$ . Therefore, we first introduce a weighted pooling on image features to obtain a small number of $K$ image vectors that summarizes generative feedback for subsequent re-composition. Concretely, we first project the image features with a learnable weight $W^{a}\in\mathbb{R}^{D_{i-1}\times K}$ to obtain the aggregation weight $A_{agg}$ :

(1)

A_{agg}=I_{i-1}W^{a}\in\mathbb{R}^{L_{i-1}\times K},

and also project the image features into the same embedding space of word features with another learnable weight $W^{c}\in\mathbb{R}^{D_{i-1}\times D_{t}}$ :

(2)

I_{i-1}^{{}^{\prime}}=I_{i-1}W^{c}.

The aggregation weight $A_{agg}$ is then normalized via a softmax operation to obtain the final results $I^{{}^{\prime\prime}}_{i-1}$ :

(3)

I^{{}^{\prime\prime}}_{i-1}=(\text{softmax}(A_{agg}))^{T}I^{{}^{\prime}}_{i-1}\in\mathbb{R}^{K\times D_{t}}.

Compared with original image features $I_{i-1}$ , the aggregated image features $I^{{}^{\prime\prime}}_{i-1}$ not only require much less computation resource since $K\ll L_{i-1}$ , but also could provide better re-composition information for word features’ evolution by projecting into the same embedding space and thus both efficiency and effectiveness are achieved.

Dynamic Element Routing. After obtaining the aggregated image features $I^{{}^{\prime\prime}}_{i-1}$ , we further need to select the words required to be re-composed. On the one hand, not all words are visually meaningful (e.g., “I”, “the” and “have”) and unnecessary emphasis on these meaningless words could negatively impact other important words’ re-composition. On the other hand, different generation stages have different important words since they focus on generating different aspects of images, and therefore some important words may not need to be re-composed in a certain stage. Mathematically, given the previous stage’s word features $T_{i-1}$ and the aggregated image features $I_{i-1}^{{}^{\prime\prime}}$ , we first calculate the cross-modal correlation as:

(4)

A_{cross}=T_{i-1}(I_{i-1}^{{}^{\prime\prime}})^{T}\in\mathbb{R}^{L_{t}\times K}.

Then we perform mean-pooling on $A_{cross}$ to acquire the correlation of each aggregated image feature with entire word features $A_{cross}^{{}^{\prime}}$ :

(5)

a_{k}^{{}^{\prime}}=\frac{\sum_{l=0}^{L_{t}}a_{lk}}{L_{t}}.

Here, $a_{lk}$ denotes the element of $l^{th}$ row and $k^{th}$ column in $A_{cross}$ , and $A_{cross}^{{}^{\prime}}=\{a_{k}^{{}^{\prime}}\in\mathbb{R}^{1}|k=0,1,...,(K-1)\}\in\mathbb{R}^{1\times K}$ . Then we apply an extra softmax operation to $A_{cross}^{{}^{\prime}}$ . The dynamic element routing is realized via a gating function to determine whether a word required to be re-composed in the next step:

(6)

R_{e}=\max(0,\alpha\text{Tanh}(W_{e}(T_{i-1}\oplus(A_{cross}^{{}^{\prime}}I^{{}^{\prime\prime}}_{i-1}))])).

Here, $W_{e}\in\mathbb{R}^{1\times(2\times D_{t})}$ is the learnable weight. $\oplus$ denotes concatenation operation along the channel dimension. $\alpha$ is a learnable parameter to control the strength of the gating function, which we initialize as $1$ . Note that if the aggregated image features number $K=1$ , the term $A_{cross}^{{}^{\prime}}I^{{}^{\prime\prime}}_{i-1}$ is identical to $I^{{}^{\prime\prime}}_{i-1}$ . The output range of the dynamic element router $R_{e}$ is $[0,\alpha]$ , and the word features are updated as $T_{i-1}^{{}^{\prime}}=R_{e}T_{i-1}$ , where the words’ features that does not need to be re-composed will be set to $0$ .

Dynamic Subspace Routing. Take a word feature $T_{\text{example}}\in\mathbb{R}^{D_{t}}$ as example, it can be divided into different representation subspaces, i.e., $T_{\text{example}}\in\mathbb{R}^{h\times\frac{D_{t}}{h}}$ , where $h$ denotes the number of subspace and $\frac{D_{t}}{h}$ denotes each divided subspace’s dimension (i.e., subspace granularity). As pointed out by numbers of previous works in multi-view similarity research(veit2017conditional, ; qu2020context, ; plummer2018conditional, ) for text-image matching and multi-head attention(vaswani2017attention, ), different representation subspaces contain different semantics. In this paper, instead of treating $h$ as a hyper-parameter that only re-composes word features in a fixed granularity, we first divide the word features into multiple granularities of subspaces to construct a complete semantic space and then configure a dynamic subspace router to generate stage-aware paths, which bring more accurate and diversified semantic re-composition results. Generally, given a list of subspace number $H=\{h_{j}|j=0,1,2,..,N-1\}$ , for each $h_{j}$ , its corresponding divided word features $T_{i-1,j}^{{}^{\prime}}\in\mathbb{R}^{L_{t}\times h_{j}\times\frac{D_{t}}{h_{j}}}$ , and the re-composition of semantics represented by different divided subspaces is calculated via an attention mechanism as:

(7)		$\displaystyle Q_{j}=T_{i-1}^{{}^{\prime}}W_{q_{j}},K_{j}=I_{i-1}^{{}^{\prime\prime}}W_{k_{j}},V_{j}=I_{i-1}^{{}^{\prime\prime}}W_{v_{j}},$
(8)		$\displaystyle O_{j}=\text{Tanh}(\text{Attention}(Q_{j},K_{j},V_{j}))\in\mathbb{R}^{L_{t}\times h_{j}\times 1},$
(9)		$\displaystyle T_{i,j}=\text{Reverse}(T_{i-1,j}^{{}^{\prime}}\odot O_{j}).$

Here, $W_{q_{j}},W_{k_{j}}\in\mathbb{R}^{D_{t}\times\frac{D_{t}}{h_{j}}},W_{v_{j}}\in\mathbb{R}^{D_{t}\times h_{j}}$ are learnable weights. $O_{j}$ is the attention output after tanh function. “Reverse” denotes turning the divided word features back to original size, i.e., $\mathbb{R}^{L_{t}\times h_{j}\times\frac{D_{t}}{h_{j}}}\rightarrow\mathbb{R}^{L_{t}\times D_{t}}$ . Here, we treat the semantics of each divided subspace as a whole, i.e., each $\frac{D_{t}}{h_{j}}$ dimension is treated as a whole to re-compose. Since the value range of $O_{j}$ is $[-1,1]$ , the different semantics of the word features are suppressed or enhanced to varying degrees and realize the re-composition. To achieve the dynamic granularity subspace routing, we add another learnable weight $W_{r_{j}}\in\mathbb{R}^{D_{t}\times 1}$ to generate the routing parameters:

(10)		$\displaystyle V_{r,j}=I_{i-1}^{{}^{\prime\prime}}W_{r_{j}},$
(11)		$\displaystyle R_{j}=\text{Attention}(Q_{j},K_{j},V_{r,j})\in\mathbb{R}^{L_{t}}.$

Therefore, the routing parameters for all granularities are represents as $\hat{R}=\{\text{mean}(R_{j})|j=0,1,..N-1\}\in\mathbb{R}^{N}$ . Different from the hard router used in previous works(huang2017multi, ; li2020learning, ; yang2020resolution, ), we consider a soft version via generating continuous values as different granularity subspace re-composition probabilities. Specifically, the softmax function is applied to relax the categorical choice to a continuous and differentiable operation. The final re-composed word features $T_{i}$ is computed as:

(12)		$\displaystyle\hat{R}^{{}^{\prime}}=\text{softmax}(\hat{R}),$
(13)		$\displaystyle T_{i}=T_{i-1}+\sum_{j=0}^{N-1}(\hat{R}^{{}^{\prime}}_{j}T_{i,j}).$

Efficient implementation. In fact, instead of doing attention calculation in each granularity respectively, we could integrate the multi-grained re-composition into single attention by placing adjacency masks to the attention and barely cost additional training time. To be specific, we replace Eq. 7, 8, 10 and Eq. 11 as

(14)		$\displaystyle Q=T_{i-1}^{{}^{\prime}}W_{q},K=I_{i-1}^{{}^{\prime\prime}}W_{k},V=I_{i-1}^{{}^{\prime\prime}}W_{v},R=I_{i-1}^{{}^{\prime\prime}}W_{r},$
(15)		$\displaystyle O=\tanh(\text{Attention}(QM_{Q},K,VM_{V})),$
(16)		$\displaystyle R=\text{Attention}(QM_{Q},K,R).$

Here, we set the sum of different granularity subspace dimension as $D_{sum}=\sum_{j=1}^{K}\frac{D_{t}}{h_{j}}$ and the sum of different granularity subspace number as $h_{sum}=\sum_{j=1}^{K}h_{j}$ . Therefore, $W_{q},W_{k}\in\mathbb{R}^{D_{t}\times D_{sum}},W_{v}\in\mathbb{R}^{D_{t}\times h_{sum}},W_{r}\in\mathbb{R}^{D_{t}\times N}$ are the learnable weights. The corresponding adjacency mask for $Q$ and $V$ are $M_{Q}\in\mathbb{R}^{N\times L_{t}\times D_{sum}},M_{v}\in\mathbb{R}^{N\times K\times h_{sum}}$ respectively. As illustrated in Fig. 4, take $M_{Q}=\{M_{Q,j}\in\mathbb{R}^{L_{t}\times D_{sum}}|j=0,...,N-1\}$ as example, for each $M_{Q,j}$ , we only set $M_{Q,j}[:,\sum_{j^{{}^{\prime}}=0}^{j-1}\frac{D_{t}}{h_{j^{{}^{\prime}}}}:\sum_{j^{{}^{\prime}}=0}^{j}\frac{D_{t}}{h_{j^{{}^{\prime}}}}]=1$ while others are set to be $0$ . The binary adjacency masks $M_{V}$ are analogous. Therefore, different granularity re-composition will be calculated separately.

Methods	IS $\uparrow$		FID $\downarrow$		R-precision $\uparrow$ (%)
Methods	CUB-200	MSCOCO	CUB-200	MSCOCO	CUB-200	MSCOCO
StackGAN++(zhang2018stackgan++, )	4.04 $\pm$ .06	8.30 $\pm$ .10	15.30	81.59	-	-
AttnGAN(xu2018attngan, )	4.26 $\pm$ .03	25.89 $\pm$ .47	23.98	35.49	21.65	55.13
DMGAN(zhu2019dm, )	4.75 $\pm$ .07	30.49 $\pm$ .57	16.09	32.64	48.72	71.08
KT-GAN(tan2020kt, )	4.85 $\pm$ .04	31.67 $\pm$ .36	17.32	30.73	-	-
DFGAN(tao2020df, )	4.86 $\pm$ .04	18.61 $\pm$ .12 $\dagger$	19.24	28.92	25.89	42.61
MAGAN(yang2021multi, )	4.76 $\pm$ .09	21.66	-	-	-	-
TIME(liu2021time, )	4.91 $\pm$ 0.03	30.85 $\pm$ 0.7	14.3	31.14	-	-
DAE-GAN(ruan2021dae, )	4.42 $\pm$ .04	35.08 $\pm$ 1.16	15.29	28.12	51.64	70.17
XMC-GAN(zhang2021cross, )	-	17.25 $\pm$ 0.04 $\ddagger$	-	50.08 $\ddagger$	-	-
R-GAN(qiao2021r, )	-	-	-	24.60	-	-
OP-GAN(hinz2020semantic, )	-	27.88 $\pm$ .12	-	24.70	-	67.92
DALLE(ramesh2021zero, )	-	-	56.10	27.50	-	-
Cogview(ding2021cogview, )	-	-	-	27.10	-	-
DSE-GAN (our method)	5.13 $\pm$ .03	26.71 $\pm$ 0.38	13.23	15.30	53.25	76.31

Table 1. Qualitative comparison between our method and other state-of-the-art.

\dagger

indicates scores computed from images generated from the open-sourced models.

\ddagger

indicates scores computed from the model trained under the same experimental setting of our methods (i.e., four RTX-3090 GPUs and 120 epochs on MSCOCO) using their open-source code. For CUB-200, our proposed DSE-GAN achieves 4.48%, 7.48%, and 3.12% relative improvement on IS, FID, and R-precision respectively. For MS-COCO, our proposed DSE-GAN achieves 37.8% and 7.37% relative improvement on FID and R-precision respectively.

3.2. Dynamic Semantic Evolution Generator

To allow more stages for text-image interactions and thereby facilitate our proposed DSE modules, we further propose a novel Single Adversarial Multi-stage Architecture (SAMA) by eliminating the complicated multiple adversarial training requirements. On the one hand, since each stage’s discriminator is removed and therefore each sub-generator could not get immediate generative feedback, we adopt the grid transformer block (jiang2021transgan, ) as our sub-generator’s basic structure for its better expressivity and learning ability thanks to the built-in self-attention operations. See Fig. 2(b) for details. On the other hand, we introduce a weighted hierarchical image integration to form the final images, which allows each sub-generator to only focus on adding corresponding details based on previous stages’ generation results instead of outputting complete images. As shown in Fig.2(a), the proposed framework has $M+1$ stages’ sub-generators ( $G_{0},...,G_{M}$ ), each of which take previous stage’s output image features ( $I_{0},...,I_{M-1}$ ) as input and generate images of small-to-large scales ( $x_{0},...,x_{M}$ ). Especially, for the initial stage:

(17)

I_{0},x_{0}=G_{0}(z,F^{ca}(\hat{T})),

where $F^{ca}$ is the Conditioning Augmentation (CA) (zhang2017stackgan, ) to augment the sentence feature and avoid overfitting by resampling the input sentence feature from an independent Gaussian distribution. $z$ is the random Gaussian noise vector. For the following stages:

(18)			$\displaystyle T_{i}=\text{DSE}(I_{i-1},T_{i-1}),i=1,...,M,$
(18)			$\displaystyle I_{i},x_{i}=G_{i}(I_{i-1},T_{i}),i=1,...,M.$

The final target images $\mathcal{I}$ are formed by weighted adding and sum all RGB values by skip-connection:

(19)

\mathcal{I}=\text{sum}(\alpha_{0}x_{0},\alpha_{0}x_{1},...,\alpha_{M}x_{M}).

Here, $\{\alpha_{i}|i=0,1,...,M\}$ are learnable weight, which are all initialize as $\frac{1}{M+1}$ . During experiments, we found this small modification makes the training process more stable.

3.3. Objective Function

We adopt the hinge adversarial loss (lim2017geometric, ) and apply the Matching-Aware zero-centered Gradient Penalty (MA-GP) (tao2020df, ) to penalize the discriminator for deviating from the Nash-equilibrium. The discriminator’s objective function is presented as:

	$\displaystyle\mathcal{L}^{D}=E_{x\sim p_{{data}}}[\max(0,1-D(x,s))]$
	$\displaystyle+\frac{1}{2}E_{x\sim p_{G}}[\max(0,1+D(\hat{x},s))]$
	$\displaystyle+\frac{1}{2}E_{x\sim p_{{data}}}[\max(0,1+D(x,\hat{s}))]$
(20)		$\displaystyle+\lambda_{MA}E_{x\sim p_{{data}}}\left[\left(\left\\|\nabla_{x}D(x,s)\right\\|_{2}\right.\right.\left.\left.+\left\\|\nabla_{s}D(x,s)\right\\|_{2}\right)^{p}\right],$

where $s$ is the matched sentence feature while $\hat{s}$ is a mismatched one. $x$ is the real image corresponding to $s$ , and $\hat{x}$ is the generated image. $D(\cdot)$ is the decision given by the discriminator that whether the input image matches the input sentence. The variables $\lambda_{MA}$ and $p$ are the hyper parameters for MA-GP loss.

The generator’s adversarial objective function is presented as:

(21)

\mathcal{L}^{G}_{adv}=-E_{x\sim p_{G}}[D(\hat{x},s)].

Following (xu2018attngan, ; zhu2019dm, ; tan2019semantics, ; qiao2019mirrorgan, ), we further utilize the DAMSM loss (xu2018attngan, ) to compute the matching degree between images and text descriptions, mathematically denoted as $\mathcal{L}_{DAMSM}$ . And the Conditional Augmentation loss is defined as the Kullback-Leibler divergence between the standard Gaussian distribution and the Gaussian distribution of training text, i.e.,

(22)

\mathcal{L}_{CA}=D_{KL}\left(\mathcal{N}\left(\mu(\mathbf{s}),\sum(\mathbf{s})\right)\|\mathcal{N}(0,I)\right).

The final objective function of the generator networks is composed of the aforementioned three terms:

(23)

\mathcal{L}^{G}=\mathcal{L}^{G}_{adv}+\lambda_{1}\mathcal{L}_{CA}+\lambda_{2}\mathcal{L}_{DAMSM}.

4. Experiments

4.1. Experiment Settings

4.1.1. Datasets.

Following previous works, experiments are conducted on two benchmarks, i.e., the CUB-200(wah2011caltech, ) and MSCOCO(lin2014microsoft, ).

4.1.2. Evaluation Metrics.

Following previous work, we report validation results by generating images for 30,000 random captions.

$\bullet$ Inception Score (IS), which is obtained by the pre-trained Inception-v3 network(szegedy2016rethinking, ) to compute the KL-divergence between the conditional class distribution and the marginal class distribution. A large IS indicates better generation diversity and generation quality. However, as pointed out by previous work(li2019object, ; zhang2021cross, ; tao2020df, ), the IS completely fails in evaluating the semantic layout of the synthesized image and thus fails to evaluate the generation quality on MSCOCO. In this paper, we still report the IS score on MSCOCO to give a comprehensive comparison.

$\bullet$ Fr $\acute{\text{e}}$ chet Inception Distance (FID), which calculates the Frechet distance between two multivariate Gaussians fit to Inception features (szegedy2016rethinking, ) between generated and real images. The lower FID score indicates better generation quality. Compared with IS, the FID is more robust and aligns with manual evaluation.

$\bullet$ R-precision assesses whether the entire image matches the text description by conducting a retrieval experiment, i.e., generated images are used to query their corresponding text descriptions. However, we notice that previous work computes R-precision using image-text encoders from AttnGAN (xu2018attngan, ), and many others use these encoders as part of their optimization function during training. This skews results: many generated models report R-precision scores significantly higher than real images. To alleviate this, we finetune CLIP (radford2021learning, ) on both CUB-200 and MSCOCO respectively, and use the finetuned CLIP to calculate the R-precision, which is disjoint from training and better correlates with human judgments. The effectiveness of using CLIP for evaluating text-image alignment is also validated by recent studies(park2021benchmark, ).

4.1.3. Implementation Details.

Our model is implemented in Pytorch (paszke2019pytorch, ). Adam(kingma2014adam, ) is utilized for optimization with $\beta_{1}$ = 0 and $\beta_{2}$ = 0.99. The learning rate is set to 0.0001 for the generator and 0.0004 for the discriminator according to the Two Timescale Update Rule (TTUR) (heusel2017gans, ). The hyper-parameters $p$ , $\lambda_{MA}$ , $\lambda_{1}$ , $\lambda_{2}$ are set to 6, 2, 1, 0.1 respectively. Following previous works(xu2018attngan, ; zhu2019dm, ; tao2020df, ; hinzsemantic, ; tan2019semantics, ; qiao2019mirrorgan, ), the main results are obtained trained on 600 epochs on CUB-200 and 120 epochs on MSCOCO by four RTX 3090 GPUs. The aggregated image number $K$ is set to be 4 and the subspace number list is set to be $\{256,128,64,32,16,8,4,2\}$ to construct a complete semantic space since the dimension of text features is $256$ .

4.2. Comparison with state-of-the-art methods

4.2.1. Quantitative Results.

We compare our proposed methods with various kinds of state-of-the-art methods, including direct text-to-image(zhang2018stackgan++, ; xu2018attngan, ; zhu2019dm, ; tan2020kt, ; tao2020df, ; liu2021time, ; yang2021multi, ), text-to-layout-to-image(hinz2020semantic, ; qiao2021r, ) and recent auto-regressive methods(ding2021cogview, ; ramesh2021zero, ), on both benchmarks for all automated metrics comprehensively, as shown in Tab. 1. For the CUB-200 benchmark, our proposed DSE-GAN outperforms previous works for all metrics, which achieves 4.48% (from 4.91 to 5.13), 7.48% (from 14.3 to 13.23), and 3.12% (from 51.64% to 53.25%) relative improvement on IS, FID and R-precision, respectively. As for the more challenging MSCOCO benchmark, DSE-GAN dramatically improves FID from 24.60 to 15.30, a 37.8% relative improvement over the next best model, R-GAN(qiao2021r, ), which requires extra object bounding box labels. DSE-GAN also outperforms others for R-precision (from 71.08 to 76.31, a 7.36% relative improvement), which indicates the better text-image alignment of our models. Although our DSE-GAN is inferior to some previous works (e.g., DMGAN(zhu2019dm, ) and DAE-GAN(ruan2021dae, )) for IS on MSCOCO, IS score is known for failing to evaluate the generation quality on MSCOCO (zhang2021cross, ; li2019object, ; tao2020df, ) and the visual inspection in Fig.5 indicates DSE-GAN’s image quality is much higher than others.

4.2.2. Qualitative Results.

We qualitatively compare the generated images from our method with three recent state-of-the-art GAN methods, i.e., DMGAN(zhu2019dm, ), DFGAN(tao2020df, ) and DAE-GAN(ruan2021dae, ). For the CUB-200 benchmark, as shown in the first four columns in Fig. 5, our DSE-GAN generates images with many rich and vivid details which are consistent with the given textual descriptions. For example, in the 3^th and 4^th columns, compared with previous methods, the birds generated by our DSE-GAN have more realistic feather textures. For the MSCOCO benchmark, which has more objects and is more challenging, as shown in the last four columns of Fig. 5, our DSE-GAN still could generate images with reasonable objects shape and different backgrounds. For example, our DSE-GAN is the only model which successfully generate “giraffe” in 5^th columns, “plane” in 6^th columns and the “keyboard”, “display screen” in 7^th columns. Meanwhile, our DSE-GAN also generates a much more realistic layout for the complex scene, e.g., the “dining room” with clearly “looks out glass” in 8^th columns.

4.3. Ablations

ID	Components	IS $\uparrow$	FID $\downarrow$
$\mathcal{A}$	baseline (SAMA)	4.84 $\pm$ .04	15.84
$\mathcal{B}$	$\mathcal{A}$ + element routing	4.83 $\pm$ .03	15.09
$\mathcal{C}$	$\mathcal{A}$ + subspace routing	4.93 $\pm$ .07	14.66
$\mathcal{D}$	$\mathcal{A}$ + subspace routing (hard)	4.89 $\pm$ .01	14.79
$\mathcal{E}$	$\mathcal{A}$ + element & subspace routing	5.00 $\pm$ .06	14.02
$\mathcal{F}$	$\mathcal{E}$ + image features aggregation (K=4)	5.09 $\pm$ .04	13.44

Table 2. Effectiveness of different components.

We thoroughly evaluate the different components of DSE-GAN and analyze their impact. All ablation results are reported on CUB-200 by two RTX-3090 GPUs. We verify each part of our proposed DSE module, as shown in Tab. 2. Both element & subspace routing alone could bring improvements for our proposed SAMA baseline since the element routing alone can be regarded as dynamically enhancing different words for each stage and the subspace routing provides diversified as well as accurate semantic guidance for each stage. By adding the element routing, the performance of subspace routing is further increased, which validates the necessity of avoiding unnecessary words’ re-composition for each stage. As for the subspace routing, we also investigate the hard version router (huang2017multi, ; li2020learning, ; yang2020resolution, ), i.e., replacing softmax in Eq. 12 with Gumbel softmax, which we denotes as subspace routing (hard). We found that the soft version subspace routing is better than the hard version one, and we hypothesize that the reason lies in that the soft router could combine different granularity subspaces re-composition results and become more diversified and accurate. The features aggregation sub-module further brings improvement since it summarizes the generative feedback. Compared with directly using original image features, the aggregated ones provide better re-composition information for word features’ evolution by projecting into the same embedding space as well as capturing the high-order statistics of image features, which is both effective and efficient.

4.4. Visualization and Case Study

To understand the text semantic adaptive evolution process, we present two generation cases conditioned on the same text, as shown in Fig. 6. We present each stage’s output and the corresponding semantic visualization (top 3 most similar words in the embedding space) of “bill” and “brown”. Newly activated semantics are highlighted in red. The final images are formed as the sum of all stages’ output. As can be seen, the upper image flow activates “bill” with the semantics of “dagger”, “open”, etc, while the lower image flow activates “bill” with “straight”, “pointy”, etc. Both evolving semantics provide diversified and accurate guidance for each stage’s generation and present different but semantically consistent appearances in the final images. Meanwhile, we could observe that each important word is also dynamically determined by the historical stage whether it is required to be re-composed, e.g., the word “bill” is not re-composed at stage 1 since the generative summary of stage 0 is too coarse to infer any information for adding details for rendering “bill”. The same conclusion also holds with “browns”.

5. Conclusion

In this paper, we propose a novel Dynamic Semantic Evolution Generative Adversarial Network (DSE-GAN) for text-to-image generation. Different from the previous methods that utilize static text features across all generation stages, our method dynamically re-composes text features at different stages conditioned on the status of the historical stage by first aggregating historical image features to summarize the generative feedback, and then dynamically selecting the words required to be re-composed at each stage as well as re-compose them by dynamically enhancing or suppressing different granularity subspaces’ semantics via the novel Dynamic Semantic Evolution (DSE) modules. Moreover, the novel single adversarial multi-stage architecture (SAMA) further facilitates DSE by eliminating the complicated multiple adversarial training requirements and therefore allows more stages of text-image interactions. Comprehensive experiments on two widely used benchmarks demonstrate the superiority of our DSE-GAN.

6. Acknowledgments

This work is supported in part by National Natural Science Foundation of China under Grants 62222212, U19A2057, and 61876223, Science Fund for Creative Research Groups under Grant 62121002, and Fundamental Research Funds for the Central Universities under Grants WK3480000008 and WK3480000010.

References

(1) H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5907–5915.
(2) T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “Attngan: Fine-grained text to image generation with attentional generative adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1316–1324.
(3) M. Zhu, P. Pan, W. Chen, and Y. Yang, “Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5802–5810.
(4) W. Li, P. Zhang, L. Zhang, Q. Huang, X. He, S. Lyu, and J. Gao, “Object-driven text-to-image synthesis via adversarial training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 174–12 182.
(5) J. Cheng, F. Wu, Y. Tian, L. Wang, and D. Tao, “Rifegan: Rich feature generation for text-to-image synthesis from prior knowledge,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 911–10 920.
(6) A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” arXiv preprint arXiv:2102.12092, 2021.
(7) J. Cho, J. Lu, D. Schwenk, H. Hajishirzi, and A. Kembhavi, “X-lxmert: Paint, caption and answer questions with multi-modal transformers,” arXiv preprint arXiv:2009.11278, 2020.
(8) M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang et al., “Cogview: Mastering text-to-image generation via transformers,” arXiv preprint arXiv:2105.13290, 2021.
(9) H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “Stackgan++: Realistic image synthesis with stacked generative adversarial networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 8, pp. 1947–1962, 2018.
(10) T. Hinz, S. Heinrich, and S. Wermter, “Semantic object accuracy for generative text-to-image synthesis,” arXiv preprint arXiv:1910.13321, 2019.
(11) G. Yin, B. Liu, L. Sheng, N. Yu, X. Wang, and J. Shao, “Semantics disentangling for text-to-image generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2327–2336.
(12) M. Tao, H. Tang, S. Wu, N. Sebe, F. Wu, and X.-Y. Jing, “Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis,” arXiv e-prints, pp. arXiv–2008, 2020.
(13) B. Li, X. Qi, P. H. Torr, and T. Lukasiewicz, “Lightweight generative adversarial networks for text-guided image manipulation,” arXiv preprint arXiv:2010.12136, 2020.
(14) T. Hinz, S. Heinrich, and S. Wermter, “Semantic object accuracy for generative text-to-image synthesis,” IEEE transactions on pattern analysis and machine intelligence, 2019.
(15) B. Liu, K. Song, Y. Zhu, G. de Melo, and A. Elgammal, “Time: Text and image mutual-translation adversarial networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, 2021, pp. 2082–2090.
(16) P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 873–12 883.
(17) H. Tan, X. Liu, X. Li, Y. Zhang, and B. Yin, “Semantics-enhanced adversarial nets for text-to-image synthesis,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society, 2019, pp. 10 500–10 509.
(18) T. Qiao, J. Zhang, D. Xu, and D. Tao, “Mirrorgan: Learning text-to-image generation by redescription,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1505–1514.
(19) T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8110–8119.
(20) H. Zhang, J. Y. Koh, J. Baldridge, H. Lee, and Y. Yang, “Cross-modal contrastive learning for text-to-image generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 833–842.
(21) J. Zhi, “Pixelbrush: Art generation from text with gans,” in Cl. Proj. Stanford CS231N Convolutional Neural Networks Vis. Recognition, Sprint 2017, 2017, p. 256.
(22) K. Chen, C. B. Choy, M. Savva, A. X. Chang, T. Funkhouser, and S. Savarese, “Text2shape: Generating shapes from natural language by learning joint embeddings,” in Asian Conference on Computer Vision. Springer, 2018, pp. 100–116.
(23) J. Johnson, A. Gupta, and L. Fei-Fei, “Image generation from scene graphs,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1219–1228.
(24) N. Chomsky, “The architecture of language,” 2000.
(25) C. Liu, Z. Mao, A.-A. Liu, T. Zhang, B. Wang, and Y. Zhang, “Focus your attention: A bidirectional focal attention network for image-text matching,” in Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 3–11.
(26) C. Thomas and A. Kovashka, “Preserving semantic neighborhoods for robust cross-modal retrieval,” in European Conference on Computer Vision. Springer, 2020, pp. 317–335.
(27) T. Qiao, J. Zhang, D. Xu, and D. Tao, “Learn, imagine and create: Text-to-image generation from prior knowledge,” Advances in Neural Information Processing Systems, vol. 32, pp. 887–897, 2019.
(28) A. D. Seshadri and B. Ravindran, “Multi-tailed, multi-headed, spatial dynamic memory refined text-to-image synthesis,” arXiv preprint arXiv:2110.08143, 2021.
(29) Y. Qiao, Q. Chen, C. Deng, N. Ding, Y. Qi, M. Tan, X. Ren, and Q. Wu, “R-gan: Exploring human-like way for reasonable text-to-image synthesis via generative adversarial networks,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 2085–2093.
(30) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014.
(31) M. Artetxe, G. Labaka, I. Lopez-Gazpio, and E. Agirre, “Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation,” in Proceedings of the 22nd Conference on Computational Natural Language Learning, 2018, pp. 282–291.
(32) M. Zhao, P. Dufter, Y. Yaghoobzadeh, and H. Schütze, “Quantifying the contextualization of word representations with semantic class probing,” arXiv preprint arXiv:2004.12198, 2020.
(33) S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” in International Conference on Machine Learning. PMLR, 2016, pp. 1060–1069.
(34) B. Li, X. Qi, T. Lukasiewicz, and P. H. Torr, “Controllable text-to-image generation,” arXiv preprint arXiv:1909.07083, 2019.
(35) J. Gauthier, “Conditional generative adversarial networks for convolutional face generation,” 2015.
(36) T. Hinz, S. Heinrich, and S. Wermter, “Generating multiple objects at spatially distinct locations,” arXiv preprint arXiv:1901.00686, 2019.
(37) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
(38) A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” 2018.
(39) J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
(40) A. v. d. Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” arXiv preprint arXiv:1711.00937, 2017.
(41) Y. Huang, H. Xue, B. Liu, and Y. Lu, “Unifying multimodal transformer for bi-directional image and text generation,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1138–1147.
(42) Y. Jiang, S. Chang, and Z. Wang, “Transgan: Two transformers can make one strong gan,” arXiv preprint arXiv:2102.07074, vol. 1, no. 3, 2021.
(43) L. Zhao, Z. Zhang, T. Chen, D. N. Metaxas, and H. Zhang, “Improved transformer for high-resolution gans,” arXiv preprint arXiv:2106.07631, 2021.
(44) R. Durall, S. Frolov, J. Hees, F. Raue, F.-J. Pfreundt, A. Dengel, and J. Keuper, “Combining transformer generators with convolutional discriminators,” in German Conference on Artificial Intelligence (Künstliche Intelligenz). Springer, 2021, pp. 67–79.
(45) O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
(46) A. Karnewar and O. Wang, “Msg-gan: Multi-scale gradients for generative adversarial networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 7799–7808.
(47) R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T. Liu, “On layer normalization in the transformer architecture,” in International Conference on Machine Learning. PMLR, 2020, pp. 10 524–10 533.
(48) L. Mescheder, A. Geiger, and S. Nowozin, “Which training methods for gans do actually converge?” in International conference on machine learning. PMLR, 2018, pp. 3481–3490.
(49) C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The caltech-ucsd birds-200-2011 dataset,” 2011.
(50) T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.
(51) C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
(52) A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” arXiv preprint arXiv:2103.00020, 2021.
(53) S. Ruan, Y. Zhang, K. Zhang, Y. Fan, F. Tang, Q. Liu, and E. Chen, “Dae-gan: Dynamic aspect-aware gan for text-to-image synthesis,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 960–13 969.
(54) Y. Yang, L. Wang, D. Xie, C. Deng, and D. Tao, “Multi-sentence auxiliary adversarial networks for fine-grained text-to-image synthesis,” IEEE Transactions on Image Processing, vol. 30, pp. 2798–2809, 2021.
(55) P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,” arXiv preprint arXiv:1803.02155, 2018.
(56) D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
(57) M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems, vol. 30, 2017.
(58) X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7794–7803.
(59) S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
(60) T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur, “Recurrent neural network based language model.” in Interspeech, vol. 2, no. 3. Makuhari, 2010, pp. 1045–1048.
(61) J. H. Lim and J. C. Ye, “Geometric gan,” arXiv preprint arXiv:1705.02894, 2017.
(62) S. Barratt and R. Sharma, “A note on the inception score,” arXiv preprint arXiv:1801.01973, 2018.
(63) A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” 2019.
(64) T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for improved quality, stability, and variation,” arXiv preprint arXiv:1710.10196, 2017.
(65) T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4401–4410.
(66) T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear cnn models for fine-grained visual recognition,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1449–1457.
(67) Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng, “A^ 2-nets: Double attention networks,” Advances in neural information processing systems, vol. 31, 2018.
(68) D. A. Hudson and L. Zitnick, “Generative adversarial transformers,” in International Conference on Machine Learning. PMLR, 2021, pp. 4487–4499.
(69) Y. Han, G. Huang, S. Song, L. Yang, H. Wang, and Y. Wang, “Dynamic neural networks: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
(70) T. Bolukbasi, J. Wang, O. Dekel, and V. Saligrama, “Adaptive neural networks for efficient inference,” in International Conference on Machine Learning. PMLR, 2017, pp. 527–536.
(71) J. Lin, Y. Rao, J. Lu, and J. Zhou, “Runtime neural pruning,” Advances in neural information processing systems, vol. 30, 2017.
(72) A. Veit and S. Belongie, “Convolutional networks with adaptive inference graphs,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 3–18.
(73) E. Bengio, P.-L. Bacon, J. Pineau, and D. Precup, “Conditional computation in neural networks for faster models,” arXiv preprint arXiv:1511.06297, 2015.
(74) Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient convolutional networks through network slimming,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2736–2744.
(75) G. Huang, D. Chen, T. Li, F. Wu, L. Van Der Maaten, and K. Q. Weinberger, “Multi-scale dense networks for resource efficient image classification,” arXiv preprint arXiv:1703.09844, 2017.
(76) Y. Li, L. Song, Y. Chen, Z. Li, X. Zhang, X. Wang, and J. Sun, “Learning dynamic routing for semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8553–8562.
(77) L. Yang, Y. Han, X. Chen, S. Song, J. Dai, and G. Huang, “Resolution adaptive networks for efficient inference,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2369–2378.
(78) A. Veit, S. Belongie, and T. Karaletsos, “Conditional similarity networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 830–838.
(79) L. Qu, M. Liu, D. Cao, L. Nie, and Q. Tian, “Context-aware multi-view summarization network for image-text matching,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1047–1055.
(80) B. A. Plummer, P. Kordas, M. H. Kiapour, S. Zheng, R. Piramuthu, and S. Lazebnik, “Conditional image-text embedding networks,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 249–264.
(81) N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran, “Image transformer,” in International Conference on Machine Learning. PMLR, 2018, pp. 4055–4064.
(82) P. Ramachandran, N. Parmar, A. Vaswani, I. Bello, A. Levskaya, and J. Shlens, “Stand-alone self-attention in vision models,” Advances in Neural Information Processing Systems, vol. 32, 2019.
(83) E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” arXiv preprint arXiv:1611.01144, 2016.
(84) H. Tan, X. Liu, M. Liu, B. Yin, and X. Li, “Kt-gan: knowledge-transfer generative adversarial network for text-to-image synthesis,” IEEE Transactions on Image Processing, vol. 30, pp. 1275–1290, 2020.
(85) T. Hinz, S. Heinrich, and S. Wermter, “Semantic object accuracy for generative text-to-image synthesis,” IEEE transactions on pattern analysis and machine intelligence, 2020.
(86) D. H. Park, S. Azadi, X. Liu, T. Darrell, and A. Rohrbach, “Benchmark for compositional text-to-image synthesis,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.