This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

DR-GAN: Distribution Regularization for Text-to-Image Generation

Hongchen Tan, Xiuping Liu, Baocai Yin and Xin Li*,  (Corresponding author: Xin Li) Hongchen Tan and Baocai Yin are with Artificial Intelligence Research Institute, Beijing University of Technology, Beijing 100124, China (e-mail: [email protected]; [email protected]). Xin Li is with School of Electrical Engineering & Computer Science, and Center for Computation & Technology, Louisiana State University, Baton Rouge (LA) 70808, United States of America (e-mail: [email protected]). Xiuping Liu is with School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, China ([email protected].).
Abstract

This paper presents a new Text-to-Image generation model, named Distribution Regularization Generative Adversarial Network (DR-GAN), to generate images from text descriptions from improved distribution learning. In DR-GAN, we introduce two novel modules: a Semantic Disentangling Module (SDM) and a Distribution Normalization Module (DNM). SDM combines the spatial self-attention mechanism and a new Semantic Disentangling Loss (SDL) to help the generator distill key semantic information for the image generation. DNM uses a Variational Auto-Encoder (VAE) to normalize and denoise the image latent distribution, which can help the discriminator better distinguish synthesized images from real images. DNM also adopts a Distribution Adversarial Loss (DAL) to guide the generator to align with normalized real image distributions in the latent space. Extensive experiments on two public datasets demonstrated that our DR-GAN achieved a competitive performance in the Text-to-Image task. The code link: https://github.com/Tan-H-C/DR-GAN-Distribution-Regularization-for-Text-to-Image-Generation

Index Terms:
Generative Adversarial Network, Distribution Normalization, Text-to-Image Generation, Semantic Disentanglement Mechanism

I Introduction

Generating photographic images from text descriptions (known as Text-to-Image Generation, T2I) is a challenging cross-modal generation technique that is a core component in many computer vision tasks such as Image Editing [28, 51], Story Visualization [53], and Multimedia Retrieval [19]. Compared with the image generation [26, 17, 22] and image processing [6, 5, 23] tasks between the same mode, it is difficult to build the heterogeneous semantic bridge between text and image [54, 48, 40]. Many state-of-the-art T2I algorithms [31, 25, 36, 9, 3, 42] first extract text features, then use Generative Adversarial Networks (GANs) [7] to generate the corresponding image. Their essence is to map a text feature distribution to the image distribution. However, two factors prevent GAN-based T2I methods from capturing real image distribution: (1) The abstract and ambiguity of text descriptions make the generator difficult capture the key semantic information for image generation [54, 34]; (2) The diversity of visual information makes the distribution of images complex so that it is difficult for the GAN-based T2I models to capture the real image distribution from text feature distribution [8]. Thus, this work explores better distribution learning strategies to enhance GAN-based T2I models.

In multi-modal perceptual information, the semantics of the text description is usually abstract and ambiguous; Image information is usually concrete and has a lot of spatial structure information. Text and Image information are expressed in different patterns, which makes it difficult to achieve semantic correlation based on feature vectors or tensors. Thus, it is difficult for the generator to accurately capture key semantics from text descriptions for image generation. Since this, in the intermediate stage of generation, the image features contain a lot of non-key semantics. Such inaccurate semantics often leads to ineffective image distribution generation, and then the generated images are often semantically inconsistent, chaos structure and details and so on. To alleviate this issue, our first strategy is to design an information disentangling mechanism on the intermediate feature, to better distill key information before performing cross-modal distribution learning.

In addition, images often contain diverse visual information, messy background, and other non-key visual information. Their image latent distribution is often complex. And, the distribution of images is difficult to model explicitly [8]. This means that we cannot directly and explicitly learn the target image distribution from the text features. As an outstanding image generation model, GANs [7] learn the target data distribution implicitly by sampling data from the True or Fake data distribution. However, such complex image distribution makes it difficult for the discriminator in GANs to distinguish whether the current input image is sampled from the real image distribution or generated image distribution. So, our second strategy is to design an effective distribution normalization mechanism to normalize the image latent distribution. The mechanism aims to help the discriminator better learn the distribution decision boundary between the generated versus real image.

Based on the above two strategies, we built a new Text-to-Image generation model, Distribution Regularization Generative Adversarial Networks (DR-GAN). DR-GAN contains two novel modules: Semantic Disentangling Module (SDM) and Distribution Normalization Module (DNM). In SDM, we introduce a spatial self-attention mechanism and propose a new Semantic Disentangling Loss (SDL) to help the generator better distill key information from texts and images in capturing the image distribution process. In DNM, we introduce a Variational Auto-Encoder (VAE) [24] into GAN-based T2I methods to normalize image distributions in the latent space. We also propose a Distribution Adversarial Loss (DAL) to align the learned distribution with the real distribution in the normalized latent space. With DNM and SDM, our DR-GAN can generate a image latent distribution that better matches with the real image distribution, and generate higher-quality images. The main contributions are summarized as follows:

  • (i)

    We propose a Semantic Disentangling Module (SDM) to help the generator distill key information (and filter out the non-key information) from both text and image features.

  • (ii)

    We design a new Distribution Normalization Module (DNM) which introduces the VAE into the GAN-based T2I pipeline so that it can more effectively normalize and denoise image latent distributions.

  • (iii)

    Extensive experimental results and analysis show the efficacy of DR-GAN on two benchmarks: CUB-Bird [47] and large-scale MS-COCO [27] over four metrics.

II Related Work

II-A GANs in Text-to-Image Generation

With the recent successes of GANs [7], a large number of GAN-based T2I methods [37, 55, 11, 50, 25, 36, 3, 39, 9, 31, 42] have boosted the performance of the T2I task. Reed et al. [37] first introduced the adversarial process to generate images from text descriptions. However, they can only generate images with 64×6464\times 64 resolution. And, the quality of the images is not good. Then, StackGAN/StackGAN++ [55, 11] and HDGAN [56] adopted multi-stage generation patterns to progressively enhance the quality and detail of synthesized images. Thereafter, the multi-stage generation framework has been widely used in GAN-based T2I methods. Based on the multi-stage generation framework, AttnGAN [50], DMGAN [31], and CPGAN [18] adopted the word-level or the object-level attention mechanisms to help the generator enhance local regions’ or objects’ semantics. MirrorGAN [36] combined the Text-to-Image generation and Image-to-Text generation to improve the global semantic consistency between text description and the generated image. SDGAN [9] and SEGAN [39] combined the Siamese Network and the contrastive loss to enhance the semantics of the synthesized image. Like these T2I methods, we also adopt the multi-stage generation framework to build our DR-GAN. But different from them, we help the generator to better distill key information for distribution learning.

II-B GANs in distribution learning

The GANs [7], as a latent distribution learning strategy, has been widely adopted in various generation tasks [37, 1, 22, 35, 51, 53]. However, GANs tend to suffer from unstable training, mode collapse, and uncontrollable problem, etc., and they hinder GANs from effectively modeling the real data distribution. Recently, many approaches [33, 11, 4, 29, 32, 55] have been exploited to overcome these issues. LSGANs [29] overcome the vanishing gradient problem by introducing a least square loss to replace the entropy loss. WGAN [30] introduced the Earth Mover (EM) distance to improve the stability of learning distribution and provide meaningful learning curves useful for hyperparameter search and debugging. MDGAN [4] improved the stability of distribution learning by utilizing an encoder E(x):xzE(x):x\rightarrow z to produce the latent variable zz for the generator GG. SN-GANs [32] proposed a novel weight normalization method named spectral normalization to better stabilize the training of the discriminator. F-GAN [33] adopted the Kullback-Leibler (KL) divergence to help align the generated image distribution with the real data distribution. In the T2I task, many GAN-based methods introduce various strategies such as multi-stage generation pattern [11, 55], attention mechanism [50, 31], and cycle consistent mechanism [36] to help match synthesized distribution with the real image distribution. However, diverse visual information, messy background, and other non-key visual information in images usually make the image distribution complicated. It makes distribution learning more difficult. Thus, our idea is to explore an effective distribution normalization strategy to overcome the challenge.

III DR-GAN for text-to-image generation

Refer to caption
Figure 1: The framework of proposed Distribution Regularization Generative Adversarial Network (DR-GAN).

III-A Overview

Most recent GAN-based T2I methods [50, 43, 25, 36, 39, 31, 9] adopted a multi-stage generation framework to progressively map the text embedding distribution to the image distribution, to synthesize high-quality images. Like all these methods, we also adopt such a generation pattern from AttnGAN [50] as our baseline to build the DR-GAN.

As shown in Fig. 1, DR-GAN has a Text Encoder [50], a conditioning augmentation module [55] FcaF^{ca}, mm generation modules GioG_{i}^{o}, (i=0,1,2,,m1\emph{i}=0,1,2,\ldots,m-1), and two new designs: mm Semantic Disentangling Modules (SDMs) SDMiSDM_{i}, and mm Distribution Normalization Modules (DNMs) DNMiDNM_{i}, i=0,1,2,,m1\emph{i}=0,1,2,\ldots,m-1.

The Text Encoder transforms the input text description (a single sentence) into the sentence feature ss^{\prime} and word features WW. The FcaF^{ca} [55] converts a sentence feature ss^{\prime} to a conditioning sentence feature ss. The SDM distills key information from the text or image features, in the intermediate stage of generation, for better approximating the real image distribution. In the testing stage, these SDMs take noise zN(0,1)z\sim N(0,1), the sentence feature ss, and the word features WW to produce a series of hidden features HiH_{i} (i=0,1,2,,m1\emph{i}=0,1,2,\cdots,m-1); while in the training stage, besides zz, ss and WW, the ii-th SDMiSDM_{i} also takes the ii-th scale real image, IiI^{*}_{i}, as input to generate HiH_{i}. Then, GioG_{i}^{o} takes HiH_{i} to generate ii-th scaled images I^i\hat{I}_{i}. The DNMiDNM_{i} contains a Variational Auto-Encoder (VAE) module and a discriminator: the former normalizes image latent distributions, and the latter distinguishes between the real image and the synthesized image. The generation stage information flow is formulated as

SDM0:H0=SDM0(z,Fca(s),I0)in Training Stage;\displaystyle SDM_{0}:H_{0}=SDM_{0}(z,F^{ca}(s^{\prime}),I_{0}^{*})\,\,\,\textrm{in Training Stage}; (1)
SDMi:Hi=SDMi(Hi1,W,Ii)in Training Stage;\displaystyle SDM_{i}:H_{i}=SDM_{i}(H_{i-1},W,I_{i}^{*})\,\,\,\textrm{in Training Stage};
SDM0:H0=SDM0(z,Fca(s))in Testing Stage;\displaystyle SDM_{0}:H_{0}=SDM_{0}(z,F^{ca}(s^{\prime}))\,\,\,\textrm{in Testing Stage};
SDMi:Hi=SDMi(Hi1,W)in Testing Stage;\displaystyle SDM_{i}:H_{i}=SDM_{i}(H_{i-1},W)\,\,\,\textrm{in Testing Stage};
Gio:Ii^=Gio(Hi),i=0,1,2,,m1.\displaystyle G_{i}^{o}:\hat{I_{i}}=G_{i}^{o}(H_{i}),\,\,\,\emph{i}=0,1,2,\cdots,m-1.

Note that the SDM0SDM_{0} only contains a series of convolution layers and upsampling modules. The proposed Semantic Disentangling Module (SDM) is adopted in the SDMiSDM_{i} i=0,1,2,,m1i=0,1,2,\cdots,m-1.

III-B Semantic Disentangling Module (SDM)

Refer to caption
Figure 2: The architecture of the proposed Semantic Disentangling Module (SDM). WAM: Word-level Attention Mechanism (WAM) [50]; RIRM: Real Image Reconstruction Module; SD Loss: Semantic Disentangling Loss.

Semantic abstract and ambiguity of text description often makes the generator difficult capture accurate semantics of the given sentence. Such inaccurate semantics are not conducive to learning distribution, and then lead to incorrect spatial structures and semantics in synthesized images. To this end, we propose a Semantic Disentangling Module (SDM) (Fig. 2) to help generators suppress irrelevant spatial information and highlight relevant spatial information for generating high-quality images.

We build our SDM based on the widely adopted Cascaded Attentional Generative Model (CAGM) introduced by AttnGAN [50], because the Word-level Attention Mechanism (WAM) in CAGM can effectively enhance semantic details of the generated images.

Initially, the SDM was designed to directly extract the key and non-key information from text features. Because text and image belong to the heterogeneous domain, it is difficult to achieve reasonable semantics matching in the feature space [54]. Compared with the text features, the word-level context feature from WAM contain more structural and semantic information, and semantically better match the image features. Besides, the semantics of word-level context features and image features, are from the input word features WW and sentences features ss. Therefore, SDM is designed to distill key information and filter out non-key information, on the word-level context features and image features of the CAGM.

In this section, we firstly revisit the WAM  [50] to acquire the word-level context features and image features. Secondly, we introduce the Spatial Self-Attention Mechanism (SSAM) to represent the key information and the non-key information. Finally, we introduce the Semantic Disentangling Loss (SDL) to drive the SDM conduct the semantic disentangling.

III-B1 Word-level Attention Mechanism (WAM) [50]

In cases where there is no ambiguity, we omit subscripts in this WAM’s description. As shown in Fig. 2, the “WAM” has two inputs: the word features WD×TW\in\mathbb{R}^{D\times T} and the image features HD^×NH\in\mathbb{R}^{\hat{D}\times N} from the previous hidden layer.

Firstly, the word features are mapped to the same latent semantic space as the image features, i.e. W=UWW^{\prime}=UW, W={wiD^|i=1,2,,T}W^{\prime}=\{w^{\prime}_{i}\in\mathbb{R}^{\hat{D}}|i=1,2,\cdots,T\}, where UD^×DU\in\mathbb{R}^{\hat{D}\times D} is a perceptual layer. Each column of H={hjD^|j=1,2,,N}H=\{h_{j}\in\mathbb{R}^{\hat{D}}|j=1,2,\cdots,N\} (hidden features) is a feature vector of an image’s sub-region.

Secondly, for the jthj^{th} image’s sub-region, its dynamic representation of word vectors w.r.t. hjh_{j} is

qj=i=1Tθj,iwi,whereθj,i=exp(Sj,i)k=1Texp(Sj,k).q_{j}=\sum_{i=1}^{T}\theta_{j,i}w_{i}^{\prime},\quad{\rm where}\ \theta_{j,i}=\frac{exp(S_{j,i}^{\prime})}{\sum_{k=1}^{T}exp(S_{j,k}^{\prime})}. (2)

Here Sj,i=hjTwiS^{\prime}_{j,i}=h^{T}_{j}w^{\prime}_{i}, and θj,i\theta_{j,i} indicates the weight the model assigned to the ith{i}^{th} word when generating the jthj^{th} sub-region of the image.

Thirdly, the word-level context feature for HH is denoted by Q=(q1,q2,,qN)D^×NQ^{\prime}=(q_{1},q_{2},...,q_{N})\in\mathbb{R}^{\hat{D}\times N}.

As shown in Fig. 2, the WAM generates a word-level context feature QQ^{\prime} from the given word feature WW and image feature HH. Here, QQ^{\prime} is a weighted combination of word features that expresses the image feature HH. The QQ^{\prime} can effectively enrich the semantics of image details [50]. Due to the abstract and ambiguity of text description, the generator is prone to parsing wrong or inaccurate semantics. However, such inaccurate semantic extraction leads to incorrect semantics and structures in QQ^{\prime} and HH. Thus, it is necessary to distill the key information from word-level context feature QQ^{\prime} and the intermediate image feature HH, for image generation. Next, we use Spatial Self-Attention Mechanism to represent the key and non-key features of the QQ^{\prime} and the HH respectively.

III-B2 Spatial Self-Attention Mechanism

As shown in Fig. 2, we use a Spatial Self-Attention Mechanism to represent the key and non-key information of the word-level context feature QiQ^{\prime}_{i} and the intermediate image feature Hi1H_{i-1} respectively.

Firstly, we represent the key and non-key information of the image feature Hi1H_{i-1}. The spatial attention mask MaskiHMask_{i}^{H} of the feature Hi1H_{i-1} is defined as

MaskiH=Sig.(Conv1×12(ReLU(Conv3×31(Hi1)))),Mask_{i}^{H}=Sig.(Conv_{1\times 1}^{2}(ReLU(Conv_{3\times 3}^{1}(H_{i-1})))), (3)

where Hi1H_{i-1} is followed by a 3×33\times 3 convolutional layer Conv3×31()Conv_{3\times 3}^{1}(\cdot) and a 1×11\times 1 convolutional layer Conv1×12()Conv_{1\times 1}^{2}(\cdot), and ReLU and Sigmoid (Sig.) are used as activation functions. We use Hi+=MaskiHHi1H^{+}_{i}=Mask_{i}^{H}\odot H_{i-1} to express the key spatial information of Hi1H_{i-1}, and use Hi=Hi1Hi+H^{-}_{i}=H_{i-1}-H^{+}_{i} to express the non-key information.

Secondly, we represent the key and non-key information of the word-level context feature QiQ^{\prime}_{i}. Although QiQ^{\prime}_{i} can reflect the semantic matching of words and images’ sub-regions, QiQ^{\prime}_{i} is still defined in the text embedding space and lacks necessary spatial structure information. Thus, we design a convolution module “ResBlock” to convert the word-context matrix QiQ^{\prime}_{i} to the refined feature QiQ_{i}. The spatial attention mask MaskiQMask_{i}^{Q} of the feature QiQ_{i} is defined as

MaskiQ=Sig.(Conv1×12(ReLU(Conv3×31(Qi)))),Mask_{i}^{Q}=Sig.(Conv_{1\times 1}^{2}(ReLU(Conv_{3\times 3}^{1}(Q_{i})))), (4)

We use Qi+=MaskiQQiQ^{+}_{i}=Mask_{i}^{Q}\odot Q_{i} to express the key spatial information of QiQ_{i}, and use Qi=QiQi+Q^{-}_{i}=Q_{i}-Q^{+}_{i} to express the non-key information.

III-B3 Semantic Disentangling Loss

To drive the SDM to better distinguish between the key information and non-key information of QiQ_{i} and Hi1H_{i-1}. We further design a new Semantic Disentangling Loss (SDL) term. In the generation task, the generated image distribution and the real image distribution are assumed to be the same type of distribution [12]. If the mean and variance of two distributions are the same, then two distributions are identical. Therefore, we use the constraints of mean and variance to separate key information from non-key information of QiQ_{i} and Hi1H_{i-1}, for constructing the SDL loss. Specifically, we push the mean and variance of key information to approximate that of real images in the latent space, and vice versa. The SDL on Hi1H_{i-1} is defined as

SDLHi=\displaystyle\mathcal{L}_{SDL}^{H_{i}}= SP(μ(Hi+)μ(Hi)μ(Hi)μ(Hi))\displaystyle SP(||\mu(H^{+}_{i})-\mu(H^{*}_{i})||-||\mu(H^{-}_{i})-\mu(H^{*}_{i})||) (5)
+SP(σ(Hi+)σ(Hi)σ(Hi)σ(Hi)).\displaystyle+SP(||\sigma(H^{+}_{i})-\sigma(H^{*}_{i})||-||\sigma(H^{-}_{i})-\sigma(H^{*}_{i})||).

Similarly, the SDL on the feature map QiQ_{i} is denoted as

SDLQi=\displaystyle\mathcal{L}_{SDL}^{Q_{i}}= SP(μ(Qi+)μ(Hi)μ(Qi)μ(Hi))\displaystyle SP(||\mu(Q^{+}_{i})-\mu(H^{*}_{i})||-||\mu(Q^{-}_{i})-\mu(H^{*}_{i})||) (6)
+SP(σ(Qi+)σ(Hi)σ(Qi)σ(Hi)).\displaystyle+SP(||\sigma(Q^{+}_{i})-\sigma(H^{*}_{i})||-||\sigma(Q^{-}_{i})-\sigma(H^{*}_{i})||).

Here, μ()\mu(\cdot) and σ()\sigma(\cdot) compute the mean and variance of feature maps within a batch, the SP(x)=ln(1+ex)SP(x)=ln(1+e^{x}).

The HiH^{*}_{i} is the feature map of the corresponding real image IiI^{*}_{i}. We introduce the Real Image Reconstruction Module (RIRM) shown in Fig. 2 to acquire the real image features HiH^{*}_{i}. The RIRM contains an encoder and a decoder. The encoder takes the real image IiI^{*}_{i} as the input and outputs the real image feature HiH^{*}_{i}. The decoder takes the real image feature HiH^{*}_{i} and reconstruct the real image by the reconstruction loss function RIRM(Ii)Ii1||RIRM(I^{*}_{i})-I^{*}_{i}||_{1}. Note that the decoder and the generation modules GoG^{o} form the Siamese Network, which can provide high-quality real image features for SDM. The loss functions in SDM are summarized as

SDLi=λ1SDLHi+λ2SDLQi+λ3RIRM(Ii)Ii1\mathcal{L}_{SDL_{i}}=\lambda_{1}\mathcal{L}_{SDL}^{H_{i}}+\lambda_{2}\mathcal{L}_{SDL}^{Q_{i}}+\lambda_{3}||RIRM(I^{*}_{i})-I^{*}_{i}||_{1} (7)

Here, RIRM(Ii)Ii1||RIRM(I^{*}_{i})-I^{*}_{i}||_{1} is the reconstruction loss in RIRM.

Finally, purified key features Qi+Q^{+}_{i} and Hi+H^{+}_{i} are concatenated and then fused and upsampled (by using a ResBlock and Up-sampling module) to get the next-stage feature map HiH_{i}.

Refer to caption
Figure 3: The image feature H1H_{1} (Stage-1), word-context matrix Q2Q^{\prime}_{2} (Stage-2), image feature H2H_{2} (Stage-2), and H2H_{2}’s corresponding generated image I^2\hat{I}_{2} (Stage-2). The examples in the first row is from our Baseline (Base. AttnGAN [50]); The examples in the second row is from Baseline+SDM (Base.+SDM). The warmer the color, the stronger the feature response.

In Fig. 3, we show the image feature H1H_{1} (stage-1), word-level context feature Q2Q^{\prime}_{2} (stage-2), image feature H2H_{2} (stage-2), and the synthesized image I2^\hat{I_{2}} (stage-2). The first line is the visualization of Baseline (Base.) and the second line is the visualization of Baseline+SDM (Base.+SDM).

In the first line (Baseline) of Fig. 3: Due to abstract and ambiguity of text descriptions, it is difficult for the generator to accurately capture key semantics; Some non-key semantics get mixed up in the generation process; So, strange structures and details are easy to appear in the generated image features, such as H1H_{1}; the calculation of Q2Q^{\prime}_{2} requires the participation of H1H_{1}, which also further leads to Q2Q^{\prime}_{2}’s strange structure and details; As a result, the structure of the generated image I^2\hat{I}_{2} is chaotic. In the second line (Base.+SDM) of Fig. 3: Based on the selection strategy of key information driven by SDM, the non-key structural information on H1H_{1} and Q2Q^{\prime}_{2} can be better filtered out. Since this, the structure and semantics of image feature H2H_{2} become more reasonable. So, the structure of the synthesized images I^2\hat{I}_{2} is also reasonable.

III-C Distribution Normalization Module (DNM)

The aforementioned SDM can help the generator distill key semantic information for image generation. But diverse visual information, messy background, and other non-key visual information in images usually make the image distribution more complicated. On the discriminator’s side: such complicated image distributions make the distinction of real and synthesized images harder; the discriminator may fail to effectively identify the synthesized image.

Refer to caption
Figure 4: The architecture of the DNM. The input xx is a generated image Ii^\hat{I_{i}} or real image IiI^{*}_{i}. The discriminator DiD_{i} connects an encoder EiD()E^{D}_{i}(\cdot) to a logical classifier ψi()\psi_{i}(\cdot). The VAE module consists of a variational encoder (that stacks EiD()E^{D}_{i}(\cdot) and a variational sampling module φi()\varphi_{i}(\cdot)), and a decoder (DiE()D^{E}_{i}(\cdot)).

We know that the data normalization mechanism [16, 2, 45, 49] can reduce the noise and internal covariate shift of data, and further improve manifold learning efficiency in the deep learning community. In the discriminant stage of GAN, image sampling is firstly carried out from real image distribution and synthesized image distribution. Secondly, the discriminator determines whether the current input image is sampled from the real image distribution or synthesized image distribution, i.e. True or False. Due to the diversity of images and the randomness of synthesized images, the distributions of synthesized images and real images are complicated. Such complex image distribution makes it difficult for the discriminator to distinguish whether the current input image is sampled from the real image distribution or generated image distribution. And it is difficult for the generator to align the generated distribution with the real image distribution. Therefore, it is necessary to reduce the complexity of the distribution. Normalization is an effective strategy for denoising and reducing complexity. So, we expect to introduce the normalization of latent distribution.

As a generative model, Variational Auto-Encoder (VAE) [24] can effectively denoise the latent distribution and reduce complexity of the distribution. In VAE, it assumes the latent embedding vector of an image follow a Gaussian distribution N(μ~,σ~)N(\tilde{\mu},\tilde{\sigma}), and then normalizes the N(μ~,σ~)N(\tilde{\mu},\tilde{\sigma}) to a standard normal distribution N(0,1)N(0,1). Based on the advantage of image reconstruction in VAE, the normalized embedding vector can preserve key semantic visual information. Thus, we build a VAE module in DNM to normalize the image latent distributions to help the discriminator better distinguish between the “True” image and the “Fake” image.

The structure of the ii-th DNM is shown in Fig. 4. DNM contains two sub-modules: the discriminator DiD_{i} and the VAE module AiA_{i}. To simplify the notation, we omit the subscript ii. xx can be a generated image I^\hat{I} or a real image II^{*}. The discriminator is composed of an encoder ED()E^{D}(\cdot) and an logical classifier ψ()\psi(\cdot). ED()E^{D}(\cdot) encodes image xx into an embedding vector vv. The embedding vv combined with text embedding ss is fed to the logical classifier ψ()\psi(\cdot), which identifies if xx is a real or generated image. As mentioned above, diverse visual information, messy background, and other non-key visual information in images make the distribution of embedding vectors vv complicated, and make the identification of xx harder. Thus, we adopt a VAE module to normalize and denoise the latent distribution of embedding vectors vv. In addition to reducing the complexity of the image latent distribution, using a VAE can also push the encoded image feature vector vv to record important image semantics (through reconstruction).

Our VAE module AA adopts a standard design architecture of Variational Auto-Encoder (VAE) [24]. As shown in Fig. 4, AA has a variational encoder (which consists of an encoder ED()E^{D}(\cdot) and a variational sampling module φ()\varphi(\cdot)), and a decoder DE()D^{E}(\cdot).

The information flow of the VAE module AA is as follows.

(1) Given an image xx, xx is first fed to the encoder ED()E^{D}(\cdot), and ED()E^{D}(\cdot) outputs the image latent embedding vv.

(2) φ()\varphi(\cdot) infers the mean and variance of vv and builds a Gaussian distribution N(μ~(φ(v)),σ~(φ(v)))N(\tilde{\mu}(\varphi(v)),\tilde{\sigma}(\varphi(v))). N(μ~(φ(v)),σ~(φ(v)))N(\tilde{\mu}(\varphi(v)),\tilde{\sigma}(\varphi(v))) is further normalized to a Normal distribution by the KL(N(μ~(φ(v)),σ~(φ(v))))||N(0,1))KL(N(\tilde{\mu}(\varphi(v)),\tilde{\sigma}(\varphi(v))))||N(0,1)). To make this procedure differentiable, the re-sampling trick [24] is adopted to get z=zσ~(φ(v)))+μ~(φ(v))z^{*}=z\cdot\tilde{\sigma}(\varphi(v)))+\tilde{\mu}(\varphi(v)), zN(0,1)z\sim N(0,1).

(3) zz^{*} and the text embedding ss are concatenated, and then fed to the decoder DE()D^{E}(\cdot) to reconstruct image xx^{*}. Note that here the decoder takes both ss and zz^{*} for reconstruction, because the image generation here is conditioned on the text description.

III-C1 Distribution Adversarial Loss

Following the alternate optimization mechanism of GANs, our VAE module is trained together with the discriminator. Based on the lower variational bound of the VAE [24], the loss function of the VAE module in DNMi can be defined as

DiD=\displaystyle\mathcal{L}_{D_{i}^{D}}= Ii^DiE(φi(ED(Ii^)),s)1+IiDiE(φi(ED(Ii)),s)1\displaystyle||\hat{I_{i}}-D^{E}_{i}(\varphi_{i}(E^{D}(\hat{I_{i}})),s)||_{1}+||I_{i}^{*}-D^{E}_{i}(\varphi_{i}(E^{D}(I_{i}^{*})),s)||_{1} (8)
+KL(N(μi~(φi(ED(Ii^))),σi~(φi(ED(Ii^)))))||N(0,1))\displaystyle+KL(N(\tilde{\mu_{i}}(\varphi_{i}(E^{D}(\hat{I_{i}}))),\tilde{\sigma_{i}}(\varphi_{i}(E^{D}(\hat{I_{i}})))))||N(0,1))
+KL(N(μi~(φi(ED(Ii))),σi~(φi(ED(Ii)))))||N(0,1)).\displaystyle+KL(N(\tilde{\mu_{i}}(\varphi_{i}(E^{D}(I_{i}^{*}))),\tilde{\sigma_{i}}(\varphi_{i}(E^{D}(I_{i}^{*})))))||N(0,1)).

In each generator’s training step, the generated images have an unnormalized distribution; while the distributions of real images have been normalized in the discriminator’s training step. Hence, it is difficult for the generator to produce distributions to approximate the normalized real image distribution. Therefore, (1) we want to normalize the generated image distribution in the VAE module during the generator’s training step. In addition, (2) we want to align the normalized generated distribution with the normalized real image distribution. To achieve the above two goals, we define a distribution consistency loss, i.e.

GiD=\displaystyle\mathcal{L}_{G_{i}^{D}}= KL(N(μi~(φi(ED(Ii^))),σi~(φi(ED(Ii^)))))||N(0,1))\displaystyle KL(N(\tilde{\mu_{i}}(\varphi_{i}(E^{D}(\hat{I_{i}}))),\tilde{\sigma_{i}}(\varphi_{i}(E^{D}(\hat{I_{i}})))))||N(0,1)) (9)
+IiDiE(φi(ED(Ii^)),s)1,\displaystyle+||I_{i}^{*}-D^{E}_{i}(\varphi_{i}(E^{D}(\hat{I_{i}})),s)||_{1},

where the first term is designed for our first goal, and the second term is designed for our second goal.

We denote two loss functions GiD\mathcal{L}_{G_{i}^{D}} and DiD\mathcal{L}_{D_{i}^{D}} as the Distribution Adversarial Loss (DAL) terms. In the discriminator’s training stage, DiD\mathcal{L}_{D_{i}^{D}} helps the discriminator better distinguish the synthesized image from the real image, and better learn the distribution decision boundary between the generated versus real image latent distributions. In the generator’s training stage, GiD\mathcal{L}_{G_{i}^{D}} can help the generator learn and capture the real image distribution in the normalized latent space.

Our DNM module, combining VAE and DAL, can effectively reduce the complexity of distribution constructed by the image embedding vv, and enrich the high-level semantic information of the image embedding vv. Such a normalized embedding vv helps the discriminator better distinguish between the “Fake” image and the “True” image. Consequently, the generator can also better align the generated distribution with the real image distribution.

III-D Objective Functions in DR-GAN

Combining the above modules, at the ii-th stage of the DR-GAN, the Generative loss Gi\mathcal{L}_{G_{i}} and Discriminative loss Di\mathcal{L}_{D_{i}} are defined as

Gi=12𝔼Ii^PGi[logDi(Ii^)]unconditional loss12𝔼Ii^PGi[logDi(Ii^,s)]conditional loss,\mathcal{L}_{G_{i}}=\underbrace{-\frac{1}{2}\mathbb{E}_{\hat{I_{i}}\sim P_{G_{i}}}[logD_{i}(\hat{I_{i}})]}_{\text{unconditional loss}}-\underbrace{\frac{1}{2}\mathbb{E}_{\hat{I_{i}}\sim P_{G_{i}}}[logD_{i}(\hat{I_{i}},s)]}_{\text{conditional loss}}, (10)

where the unconditional loss is trained to generate high-quality images towards the real image distribution to fool the discriminator, and the conditional loss is trained to generate images to better match text descriptions. The Di()=ψ^(EiD())D_{i}(\cdot)=\hat{\psi}(E^{D}_{i}(\cdot)) is the unconditional discriminator, Di(,)=ψ(EiD(),s)D_{i}(\cdot,\cdot)=\psi(E^{D}_{i}(\cdot),s) is the conditional discriminator. The ψ^()\hat{\psi}(\cdot) and ψ()\psi(\cdot) are the unconditional logical classifier and the conditional logical classifier respectively.

The discriminator DiD_{i} is trained to classify the input image into the “Fake” or “True” class by minimizing the cross-entropy loss

Di=\displaystyle\mathcal{L}_{D_{i}}= 12𝔼IiPdatai[logDi(Ii)]12𝔼Ii^PGi[log(1Di(Ii^)]unconditional loss+\displaystyle\underbrace{-\frac{1}{2}\mathbb{E}_{I^{*}_{i}\sim P_{data_{i}}}[logD_{i}(I^{*}_{i})]-\frac{1}{2}\mathbb{E}_{\hat{I_{i}}\sim P_{G_{i}}}[log(1-D_{i}(\hat{I_{i}})]}_{\text{unconditional loss}}+ (11)
12𝔼IiPdatai[logDi(Ii,s)]12𝔼Ii^PGi[log(1Di(Ii^,s)]conditional loss,\displaystyle\underbrace{-\frac{1}{2}\mathbb{E}_{I^{*}_{i}\sim P_{data_{i}}}[logD_{i}(I^{*}_{i},s)]-\frac{1}{2}\mathbb{E}_{\hat{I_{i}}\sim P_{G_{i}}}[log(1-D_{i}(\hat{I_{i}},s)]}_{\text{conditional loss}},

where IiI^{*}_{i} is from the realistic image distribution PdataP_{data} at the ii-th scale, and Ii^\hat{I_{i}} is from distribution PGiP_{G_{i}} of the generative images at the same scale.

To generate realistic images, the final objective functions in the generation training stage (G\mathcal{L}_{G}) and discrimination training stage (D\mathcal{L}_{D}) are respectively defined as

G=i=0m1(Gi+λ4GiD+SDLi)+αDAMSM,\mathcal{L}_{G}=\sum_{i=0}^{m-1}(\mathcal{L}_{G_{i}}+\lambda_{4}\mathcal{L}_{G_{i}^{D}}+\mathcal{L}_{SDL_{i}})+\alpha\mathcal{L}_{DAMSM}, (12)
D=i=0m1(Di+λ5DiD).\mathcal{L}_{D}=\sum_{i=0}^{m-1}(\mathcal{L}_{D_{i}}+\lambda_{5}\mathcal{L}_{D_{i}^{D}}). (13)

The loss function DAMSM\mathcal{L}_{DAMSM} [50] is designed to measure the matching degree between images and text descriptions. The DAMSM loss makes generated images better conditioned on text descriptions. The DR-GAN has three-stage generators (m=3m=3) like the most recent GAN-based T2I methods [31, 3, 42, 36, 25, 50].

IV Experimental Results

TABLE I: IS \uparrow, FID \downarrow, MS \uparrow, R-Precision \uparrow, and Human Perceptual score (H.P. score) \uparrow by some SOTA GAN-based T2I models and our DR-GAN on the CUB-Bird and MS-COCO test sets. {\dagger} indicates the scores are computed from images generated by the open-sourced models. * indicates the scores are reported in DMGAN [31]. ** indicates the scores are reported in AttnGAN+O.P.*[43]. Other results were reported in the original paper. The Bold is the best result.
Method CUB-Bird MS-COCO
IS \uparrow FID \downarrow MS \uparrow R-Precision \uparrow H.P. score \uparrow IS \uparrow FID \downarrow MS \uparrow R-Precision \uparrow H.P. score \uparrow
StackGANv2[11] 3.93±0.063.93\pm 0.06 29.6429.64^{\dagger} 4.104.10^{\dagger} - - 8.30±0.108.30\pm 0.10 81.5981.59^{**} - - -
AttnGAN [50] 4.36±0.024.36\pm 0.02 23.9823.98^{*} 4.304.30^{\dagger} 52.62%52.62\%^{\dagger} 16.14%16.14\%^{\dagger} 25.89±0.1925.89\pm 0.19 35.4935.49^{*} 23.7123.71^{\dagger} 61.34%61.34\%^{\dagger} 20.97%20.97\%^{\dagger}
Obj-GAN [25] - - - - - 30.29±0.3330.29\pm 0.33 25.64 - - -
DM-GAN [31] 4.75±0.074.75\pm 0.07 16.0916.09 4.624.62^{\dagger} 59.21%59.21\%^{\dagger} 32.47%32.47\%^{\dagger} 30.49±0.5730.49\pm 0.57 32.6432.64 29.9429.94^{\dagger} 70.63%70.63\%^{\dagger} 37.51%37.51\%^{\dagger}
RiFeGAN [21] 5.23 ±\pm 0.09 - - - - 31.7031.70 - - - -
DR-GAN (Our) 4.90±0.054.90\pm 0.05 14.96 4.86 61.17% 51.39% 34.59 ±\pm 0.51 27.8027.80 33.96 77.27% 41.52%

In this section, we perform extensive experiments to evaluate the proposed DR-GAN. Firstly, we compare our DR-GAN with other SOTA GAN-based T2I methods [11, 50, 25, 31, 21]. Secondly, we discuss the effectiveness of each new module introduced in DR-GAN: Semantic Disentangling Module (SDM) and Distribution Normalization Module (DNM). All the experiments are performed with one GTX 2080 Ti using the PyTorch toolbox.

IV-A Experiment Settings

Datasets. We conduct experiments on two widely-used datasets, CUB-Bird [47] and MS-COCO [27] datasets. The CUB-Bird [47] dataset contains 11,78811,788 bird images, and each bird image has 1010 sentences to describe the fine-grained visual details. The MS-COCO dataset [27] contains 80k80k training images and 40k40k test images, and each image has 55 sentences to describe the visual information of the scene. We pre-process and split the images on the two datasets by following the same setting in [37], [55].

Evaluation. We from three aspects to compare our DR-GAN with other GAN-based T2I approaches: Image Diversity, Distribution Consistency, and Semantic Consistency. Each model generated 30,00030,000 images conditioning on the text descriptions from the unseen test set for evaluation. The \uparrow means that the higher the value, the better the performance of the model, and vice versa.

(I) Image Diversity. Following almost all T2I approaches, we adopt the fine-tuned Inception models [55] to calculate the Inception Score (IS \uparrow), which measures images diversity.

(II) Distribution Consistency. We use the Fréchet Inception Distance (FID \downarrow[13] and Mode Score (MS \uparrow[44] to evaluate the distribution consistency between generated images and real images. The image features in FID and MS are extracted by a pre-trained Inception-V3 network [38].

(III) Semantic Consistency. We use the R-precision \uparrow and Human Perceptual score (H.P. score \uparrow) to evaluate the semantic consistency between the text description and the synthesized image.

R-precision. Following  [50], we also use R-precision to evaluate semantic consistency. Given a pre-trained image-to-text retrieval model, we use generated images to query their corresponding text descriptions. First, given generated image x^\hat{x} conditioned on sentence ss and 9999 random sampled sentences {si:1i99}\{s_{i}^{{}^{\prime}}:1\leq i\leq 99\}, we rank these 100100 sentences by the pre-trained image-to-text retrieval model. If the ground truth sentence ss is ranked highest, we count this as a successful retrieval. For all the images in the test dataset, we perform this retrieval task once and finally count the percentage of success retrievals as the R-precision score. Higher R-precision means greater semantic consistency.

Human Perceptual score (H.P. score). To get H.P. score, we randomly select 20002000 text descriptions on CUB-Bird test set and 20002000 text descriptions on MS-COCO test set. Given the same text description, 3030 volunteers (not including any author) are asked to rank the images generated by different methods. The average ratio ranked as the best by human users is calculated to evaluate the compared methods.

IV-B Comparison with state-of-the-arts

Refer to caption
Figure 5: Images of 256×256256\times 256 resolution are generated by our DR-GAN, DM-GAN [31], and AttnGAN [50] conditioned on text descriptions from the MS-COCO (the upper part) and CUB-Bird (the bottom part) test sets.

Image Diversity. We use IS to evaluate the image diversity. As shown in Table I, DR-GAN achieves the second-highest IS score on the CUB-Bird test set, and achieves the highest IS score on the MS-COCO test set. The IS score (5.235.23) of RiFeGAN [21] is higher than that of ours (4.904.90) on the CUB-Bird dataset; but RiFeGAN [21] uses 1010 sentences to train the generator on the CUB-Bird dataset, while our DR-GAN only uses 11 sentence following the standard T2I problem formulation. On the larger-scale and challenging MS-COCO dataset, RiFeGAN [21] uses 55 sentences to train the model, but obtains a significantly lower IS score 31.7031.70 than that of our DR-GAN (34.5934.59). When the given database only contains images with a single sentence (which is common in practical tasks such as Story Visualization [53], Text-to-Video [52], and other text-guided image generation [51]), RiFeGAN [21] can not be used. In contrast, methods such as our DR-GAN, AttnGAN [50], and DMGAN [31] that only need one sentence per image can be used.

Distribution Consistency. We use FID and MS to evaluate the distribution consistency between the generated image distribution and the real image distribution. As shown in Table I, compared with these GAN-based T2I methods, our DR-GAN achieves the competitive performance on the CUB-Bird and MS-COCO test sets over the FID and the MS. The FID of Obj-GAN [25] is lower than that of our DR-GAN. This is because Obj-GAN [25] and its similar methods [3, 43, 15, 20, 14] require additional information, including the interesting object’s bounding boxes and shapes, for training synthesizing. This additional information can help the generator better capture the object’s layout. This additional information, although available in the MS-COCO dataset, is often unavailable for other datasets such as the CUB-Bird dataset. In general, producing more descriptions or objects’ bounding boxes and shapes on a new database to train the generator is expensive. This limits its scalability and usability in more general text-guided image generation. Thus, our DR-GAN still achieves the competitive performance in the distribution consistency evaluation.

Semantic Consistency. We use the R-precision and the Human Perceptual score (H.P.) to evaluate the semantic consistency. As shown in Table I, compared with AttnGAN and DMGAN [31], our DR-GAN also achieves the best performance on the semantic consistency evaluation on these two datasets.

TABLE II: The performance of applying SDM and DNM on other GAN-based T2I methods on CUB-Bird test set. The measures include IS\uparrow, FID \downarrow and MS \uparrow. AttnGAN- indicates that we remove the Word-level Attention Mechanism (WAM) in AttnGAN [50]. {\dagger} indicates the scores are computed from images generated by the open-sourced models. Other results were reported in the original paper.
Method CUB-Bird
IS \uparrow FID \downarrow MS \uparrow
StackGANv2 [11] 3.93±0.063.93\pm 0.06 29.6429.64^{\dagger} 4.104.10^{\dagger}
StackGANv2+SDM+DNM 4.58±0.044.58\pm 0.04 22.9422.94 4.634.63
AttnGAN- [50] 4.11±0.084.11\pm 0.08 25.2325.23^{\dagger} 4.014.01^{\dagger}
AttnGAN-+SDM+DNM 4.65±0.064.65\pm 0.06 20.0120.01 4.544.54
DMGAN [31] 4.75±0.074.75\pm 0.07 16.0916.09 4.624.62^{\dagger}
DMGAN+SDM+DNM 4.84±0.044.84\pm 0.04 15.2615.26 4.724.72
Refer to caption
Figure 6: Images of 256×256256\times 256 resolution are generated by StackGAN v2 [11], StackGAN v2+SDM+DNM (StackGAN v2), AttnGAN- [50], AttnGAN-+SDM+DNM (AttnGAN-∗), DMGAN [31], and DMGAN+SDM+DNM (DMGAN) conditioned on text descriptions from CUB-Bird test set.

Generalization. To evaluate the generalizability of our proposed SDM and DNM, we integrate the SDM and DNM modules into several well-known/SOTA GAN-based T2I models including StackGANv2 [11], AttnGAN- [50], and DMGAN [31]. Here, AttnGAN- denotes the AttnGAN [50] with its Word-level Attention Mechanism (WAM) removed. As shown in Table II, SDM and DNM can help all these GAN-Based T2I models achieve better performance on all three measures. It indicates that SDM and DNM can be used as general mechanisms in T2I generation task.

Visualization. We qualitatively evaluate our DR-GAN and some GAN-based T2I methods by image visualization. As shown in Fig. 5, compared with AttnGAN [50] and DMGAN [31], our DG-GAN can synthesize higher-quality images on CUB-Bird and MS-COCO test sets.

On the CUB-Bird test set: (i) compared with our DR-GAN, the birds synthesized by AttnGAN and DMGAN contain some strange or incomplete details and structures; Because SDM can better distill key information for image generation, our DR-GAN can perform better on the structure generation; (ii) Due to the introduction of DNM, DR-GAN can better capture the real images distribution. Therefore, compared with AttnGAN and DMGAN, the bird images generated by DR-GAN are more realistic and contain more full semantic details. In particular, the birds wings in Fig.-b-c-d-k have very detailed textures and full colors. On the MS-COCO test set: the semantics of text description are rather sparse; it is very difficult for the generator to capture sufficient semantics to generate high-quality images; therefore, most current methods cannot synthesize vivid images; compared with AttnGAN and DMGAN, our proposed DR-GAN is more reasonable in object layout and richer in semantic details.

Besides, in Fig. 6, we have shown the synthesized images of StackGAN v2 [11], AttnGAN- [50], and DMGAN [31], and synthesized images of StackGAN v2+SDM+DNM (StackGAN v2), AttnGAN-+SDM+DNM (AttnGAN-∗), and DMGAN+SDM+DNM (DMGAN). As shown in Fig. 6, our SDM and DNM can further help these methods (StackGAN v2 [11], AttnGAN- [50], and DMGAN [31]) improve the structure and enrich the semantic details of synthesized birds, so as to make the generated image more realistic.

Model Cost. As shown in Tabel III, we compare our proposed DR-GAN with other SOTA T2I methods under four model cost measures including Training Time, Training Epoch, Model Size, and Testing Time. Using the MS-COCO dataset as an example, compared with AttnGAN (Baseline) and DM-GAN, our proposed DR-GAN is between AttnGAN and DM-GAN in terms of model cost while achieving the highest performance.

TABLE III: The Training Time, Training Epoch, Model Size, Testing Time of our DR-GAN and other SOTA T2I methods on the MS-COCO dataset.
Method Training Time Training Epoch Model Size Testing Time
AttnGAN (Base.) [50] 8Days\sim 8Days 120120 55.5M\sim 55.5M 1200s\sim 1200s
DM-GAN[31] 14Days\sim 14Days 200200 89.7M\sim 89.7M 1800s\sim 1800s
DR-GAN(Ours) 10Days\sim 10Days 150150 73.2M\sim 73.2M 1400s\sim 1400s

IV-C Ablation Strudy

IV-C1 Effectiveness of New Modules

Refer to caption
Figure 7: Images of 256×256256\times 256 resolution are generated by our Baseline (Base.), Base.+SDM, Base.+DNM, and DR-GAN conditioned on text descriptions from the CUB-Bird test set.
Refer to caption
Figure 8: The top-3 word guided attention maps and Synthesized Images at different stages from AttnGAN, DMGAN and Base.+SDM (AttnGAN+SDM) on the CUB-Bird test dataset.
TABLE IV: IS\uparrow, FID \downarrow and MS \uparrow produced by combining different components of the DR-GAN on the CUB-Bird and MS-COCO test sets. Our Baseline (Base.) is AttnGAN [50]. DR-GAN=Base.+SDM+DNM.
Method CUB-Bird MS-COCO
IS \uparrow FID \downarrow MS \uparrow IS \uparrow FID \downarrow MS \uparrow
Base. [50] 4.36±0.024.36\pm 0.02 23.9823.98 4.304.30 25.89±0.1925.89\pm 0.19 35.4935.49 23.7123.71
Base.+SDM 4.70±0.024.70\pm 0.02 15.5015.50 4.624.62 30.89±0.5730.89\pm 0.57 31.4231.42 30.6430.64
Base.+DNM 4.79±0.034.79\pm 0.03 15.1215.12 4.804.80 31.26±0.4231.26\pm 0.42 29.7329.73 31.2831.28
DR-GAN 4.90±0.054.90\pm 0.05 14.9614.96 4.864.86 34.59±0.5134.59\pm 0.51 27.8027.80 33.9633.96

In this subsection, we evaluate the effectiveness of each new component qualitatively and quantitatively. The numerical results are documented in Table IV. The visualization results are shown in Fig. 7, Fig. 8 and Fig. 9.

Quantitative results. We evaluated the effectiveness of two new components, SDM and DNM, in terms of three measures. Our Baseline (Base.) is AttnGAN [50]. As shown in Table IV, both SDM and DNM can effectively improve the performance of the baseline on these two datasets over three measures.

Firstly, we introduce the SDM into the baseline (Base.), i.e. Base.+SDM. As shown in Table IV, Base.+SDM leads to 7.80%7.80\% and 19.31%19.31\% improvement of IS, 35.36%35.36\% and 11.47%11.47\% improvement of FID, and 7.44%7.44\% and 29.23%29.23\% improvement of MS, on the CUB-Bird and MS-COCO test sets respectively.

Secondly, we introduce the DNM into the baseline (Base.), i.e. Base.+DNM. As shown in Table IV, Base.+DNM leads to 9.86%9.86\% and 20.74%20.74\% improvement of IS, 36.95%36.95\% and 16.23%16.23\% improvement of FID, and 11.63%11.63\% and 31.92%31.92\% improvement of MS, on the CUB-Bird and MS-COCO test sets respectively.

Finally, when we introduce the SDM and the DNM into the baseline (Base.), i.e. DR-GAN, our DR-GAN obtains 12.38%12.38\% and 33.60%33.60\% improvement over the baseline in IS, 37.61%37.61\% and 21.67%21.67\% improvement over the baseline in FID, and 13.02%13.02\% and 43.23%43.23\% improvement over the baseline in MS, on the CUB-Bird and MS-COCO test sets respectively. Our DR-GAN achieves 4.904.90 and 34.5934.59 in the term of IS, and achieves 14.9614.96 and 27.8027.80 in the term of FID, and achieves 4.864.86 and 33.9633.96 in the term of MS, on the CUB-Bird and MS-COCO test sets respectively.

Qualitative Results. We also qualitatively evaluate the effectiveness of each component by image visualization (Fig. 7, Fig. 8, and Fig. 9).

In Fig. 7, we can clearly see that the DNM and SDM can effectively improve the quality of the generated images respectively. SDM: Compared with Baseline, the introduction of SDM can better improve bird body structure on the CUB-Bird test set, and also improve target layout in complex scenes to some extent. This is because SDM can directly extract key features and filter non-key features from the spatial perspective of feature maps. Based on the constraints of real image feature distribution statistics, SDM will try to filter out these non-key structural information which is not conducive to distribution learning. For COCO data, the excessive sparsity and abstraction of text semantics make the generator unable to generate vivid images. But with SDM, the overall layout and structure of generated images are significantly improved compared to the Baseline.

DNM: Compared with Baseline, the introduction of DNM can better improve visual representation and semantic expression of details to some extent. This is because DNM normalizes the latent distribution of real and generated images, which can drive the generator to better approximate the real image distribution. Therefore, generated images are better in terms of visual representation and detail semantics. Compared with SDM, DNM lacks direct intervention on image features. Therefore, compared with Base.+SDM, the structure of images generated by Base.+DNM guidance is slightly worse.

SDM+DNM: When we introduce both SDM and DNM into the Baseline (Base.), i.e. DR-GAN. Based on the respective advantages of SDM and DNM, compared with Baseline, our proposed DR-GAN performs better in visual semantics, structure, and layout, and so on.

Besides, we present top-33 word guided attention maps and 33 stage generation images ( I0^\hat{I_{0}} 64×664\times 6, I1^\hat{I_{1}} 128×128128\times 128, I2^\hat{I_{2}} 256×256256\times 256) from AttnGAN, DMGAN and Base.+SDM (AttnGAN+SDM) on the CUB-Bird test dataset. To facilitate display, we pulled the generated image pixels into images of the same size in Fig. 8. For the AttnGAN and DMGAN, we can observe that when the quality of the generated image I0^\hat{I_{0}} in the initial stage is very poor, the I0^\hat{I_{0}} affects the confusion in the word attention area. Due to the lack of direct intervention of features, confused attention and image features continue to confuse I1^\hat{I_{1}} and attention maps in the SDM2SDM_{2}. In contrast, the introduction of the proposed SDM can gradually capture key information and filter out the non-key information for image generation in the subsequent stage. To this end, the attention information and the resulting images are gradually becoming more reasonable.

Finally, we use T-SNE [46] to visualize the two-dimensional distribution of generated images and real images. As shown in Fig. 9, in the initial stage (200200 epochs on Cub-Bird dataset), the difference of the distribution between the generated image of Baseline and the real image is very large. And the difference of the distribution between the generated image of Baseline and the real image is also large. When we introduce SDM into the Baseline, i.e. Baseline+SDM, we observe a narrowing of the difference in the distribution between the generated image and the real image. When we introduce DNM into the Baseline+SDM, i.e. DR-GAN, we observe a further narrowing of the difference in the two-dimensional distribution between the generated image and the real image. And DNM makes the scatter plot area more compact. Through the results of T-SNE visualization, we can see that DNM and SDM are effective for distribution learning.

Refer to caption
Figure 9: The visualization results of the synthesized images and real images under T-SNE [46]. The green scattered part is the real images, and the red scattered part is the generated images.

IV-C2 The validity of different parts of SDM.

The IS\uparrow, FID \downarrow and MS \uparrow of Baseline (Base.) are 4.36±0.024.36\pm 0.02, 23.9823.98 and 4.304.30 respectively. As shown in Table V, we set λ3=0\lambda_{3}=0 to mean that RIRM does not participate in SDM. So, the SD loss does not get high-quality real image features. The constraint of the whole distribution statistic will deviate from the feature distribution of the real image. There will be a serious deterioration in image quality. Thus, compared with Base. and Base.+SDM, the performance of Base.+SDM (λ3=0\lambda_{3}=0) is bad. When we remove the SDL, the performance of the model also deteriorates. This is due to the lack of the necessary loss function to drive the attention mechanism to extract key information. In all, SDM can effectively improve the quality of model generation.

TABLE V: IS\uparrow, FID \downarrow and MS \uparrow produced by different parts in SDM on the CUB-Bird test set.
Method CUB-Bird
IS \uparrow FID \downarrow MS \uparrow
Base. [50] 4.36±0.024.36\pm 0.02 23.9823.98 4.304.30
Base.+SDM 4.70±0.024.70\pm 0.02 15.5015.50 4.624.62
Base.+SDM w/o SDL 4.41±0.034.41\pm 0.03 23.5623.56 4.384.38
Base.+SDM (λ3=0\lambda_{3}=0) 4.21±0.044.21\pm 0.04 27.3427.34 4.114.11

IV-C3 Impact of self-encoding mode on performance.

We modify the VAE module to observe their impact on model performance. The first case in Fig. 10, we remove the variational sampling module φ()\varphi(\cdot)) in VAE module. Besides, the Distribution Adversarial Loss is rewritten as Eq. 14 and Eq. 15.

Refer to caption
Figure 10: The architecture of the DNM. The first case is VAE without variational sampling module, that is, general coding and decoding (AE). The second case is the VAE framework adopted by us.
DiD=Ii^DiE(φi(ED(Ii^)),s)1+IiDiE(φi(ED(Ii)),s)1\mathcal{L}_{D_{i}^{D}}=||\hat{I_{i}}-D^{E}_{i}(\varphi_{i}(E^{D}(\hat{I_{i}})),s)||_{1}+||I_{i}^{*}-D^{E}_{i}(\varphi_{i}(E^{D}(I_{i}^{*})),s)||_{1} (14)
GiD=IiDiE(φi(ED(Ii^)),s)1,\mathcal{L}_{G_{i}^{D}}=||I_{i}^{*}-D^{E}_{i}(\varphi_{i}(E^{D}(\hat{I_{i}})),s)||_{1}, (15)

Here we named such a model DNM. As shown in Table VI, we compare Base.+DNM and Base.+DNM on CUB-Bird dataset under three measures. Compared with Base.+DNM (contains VAE), the performance of Base.+DNM has significantly decreased. However, the performance of Base.+DNM is better than that of Baseline. This is because the image encoding and decoding in AEAE can enhance the expression of image semantics in viv_{i}. In all, compared with the general AE, adopting VAE can make the generator obtain higher performance.

TABLE VI: IS\uparrow, FID \downarrow and MS \uparrow produced by Base.+DNM and Base.+DNM on the CUB-Bird test set.
Method CUB-Bird
IS \uparrow FID \downarrow MS \uparrow
Base. [50] 4.36±0.024.36\pm 0.02 23.9823.98 4.304.30
Base.+DNM 4.53±0.054.53\pm 0.05 19.8619.86 4.554.55
Base.+DNM 4.79±0.034.79\pm 0.03 15.1215.12 4.804.80

IV-C4 Discussion of semantic consistency strategies.

We compare the performance of SDM with some outstanding semantic consistency strategies. Some outstanding methods (such as CSM-GAN [41], SE-GAN[39], MirrorGAN[36]) introduce various semantic constraints to improve the quality of image generation. As shown in Table VII, we present the performance of the semantics constraint strategies in these methods under IS score on the CUB-Bird dataset. SD-GAN[9] and SE-GAN[39] constrain semantic consistency between synthesized images and real images. CSM-GAN [41] constrains semantic consistency between the synthesized images and the text descriptions. MirrorGAN [36] introduces semantic consistency between the text descriptions that the image is converted to and the real text descriptions. Compared with these methods, distribution statistic constraints ( Baseline+SDM ) gain better performance under IS score. We think that the randomness of images and texts leads to the instability of semantics constraints. The constraints of mean and variance can guide the learning direction of distributions for generators. It helps the generator better align the generated image distribution with the real image distribution.

TABLE VII: Compare with other semantics consistent strategies in some outstanding T2I methods under IS score. “*” represents the performance the semantics constraint strategies in these methods under IS score.
      Method       CUB-Bird
      IS \uparrow
      Baseline [50]       4.36±0.024.36\pm 0.02
      CSM-GAN [41]       4.58±0.054.58\pm 0.05
      SE-GAN [39]       4.44±0.034.44\pm 0.03
      MirrorGAN [36]       4.47±0.074.47\pm 0.07
      SD-GAN [9]       4.51±0.074.51\pm 0.07
      Baseline+SDM (Our)       4.70±0.024.70\pm 0.02

IV-D Parametric Sensitivity Analysis.

In this subsection, we mainly show and analyze the sensitivity of the hyper-parameters λ1\lambda_{1}, λ2\lambda_{2}, λ3\lambda_{3}, λ4\lambda_{4}, λ5\lambda_{5}, α\alpha. In Table I to Table VII (excluding Table III), the six parameters are assigned to values when the DR-GAN performance is around the median value in Fig. 11. So, λ1=103\lambda_{1}=10^{-3}, λ2=101\lambda_{2}=10^{-1}, λ3=105\lambda_{3}=10^{-5}, λ4=1.0\lambda_{4}=1.0, λ5=1.0\lambda_{5}=1.0, and α=5.0\alpha=5.0. The FID is an evaluation index to evaluate the quality of synthesized images on the CUB-Bird dataset. The FID results of DR-GAN under different values of the hyper-parameters λ1\lambda_{1}, λ2\lambda_{2}, λ3\lambda_{3}, λ4\lambda_{4}, λ5\lambda_{5}, α\alpha are shown in Fig. 11.

The hyper-parameters λ1\lambda_{1}, λ2\lambda_{2}: In the Semantic Disentangling Loss Eq.  7, the loss Eq. 5 and the loss Eq. 6 are proposed to drive the SDM better distill the key information from image feature HH and word-level context feature QQ for image generation. The λ1\lambda_{1} and λ2\lambda_{2} are important balance parameters in the Semantic Disentangling Loss Eq.  7.

As shown in Fig. 11: (i) The Semantic Disentangling Loss has a great influence on the overall performance of DR-GAN. When λ1=0\lambda_{1}=0 or λ2=0\lambda_{2}=0, the FID score of DR-GAN increases, which means that the quality of the generated image deteriorates. (ii) As the values of λ1\lambda_{1} or λ2\lambda_{2} increase, the FID score also increases. The increase of weight means that model training pays more attention to the regression of distribution statistics. The statistics reflect the overall information of the distribution. The constraint of statistics is designed to assist the generator to better approximate the real image distribution. The learning of accurate distribution requires GAN itself to approximate the real distribution based on implicit sampling. Therefore, the weight of SDL should be selected appropriately. Based on the results shown in Fig. 11, the value range of λ1\lambda_{1} is {0.001,0.01,0.1}\{0.001,0.01,0.1\}, and the value range of λ2\lambda_{2} is {0.001,0.1}\{0.001,0.1\}.

The hyper-parameter λ3\lambda_{3}: In the Semantic Disentangling Loss Eq.  7, the parameter λ3\lambda_{3} is designed to adjust the weight of the reconstruction loss RIRM(Ii)Ii1||RIRM(I^{*}_{i})-I^{*}_{i}||_{1}. The reconstruction loss RIRM(Ii)Ii1||RIRM(I^{*}_{i})-I^{*}_{i}||_{1} can provide the real image feature HH^{*} for other terms in Semantic Disentangling Loss. When λ3=0\lambda_{3}=0, the reconstruction mechanism (i.e. RIRM) doesn’t work. So, the performance of DR-GAN also drops significantly. With the removal of the loss, it is difficult for the SDM to match the valid mean and variance of real image features. Since this, real image features would be mixed with more semantic information irrelevant to image generation, or some important image semantic information would be suppressed. The bad real image features mislead the generator to learn the bad image distribution and make the quality of the generated image decline. Besides, We found that performance decreased as the value of λ3\lambda_{3} increased. Because the decoder in RIRM and the generation modules GoG^{o} are the Siamese Networks. The purpose of this design is that the real image features HH^{*} and the generated image features HH can be mapped to the same semantic space. The increase of weight will make the generation modules GoG^{o} pay more attention to the reconstruction of real images by real image features, and weaken the generation of generated images. That is, the Siamese Networks are prone to the imbalance between two feature sources in the training process. In all, based on the results shown in Fig. 11, the value range of λ3\lambda_{3} is about [0.00001,1][0.00001,1].

The hyper-parameter λ4\lambda_{4}, λ5\lambda_{5}: The hyper-parameter λ4\lambda_{4} balance the weight of the Eq. 9 in the generation stage loss of DR-GAN; The hyper-parameter λ5\lambda_{5} balance the weight of the Eq. 8 in the generation stage loss of DR-GAN; The loss Eq. 9 and loss form the Distribution Adversarial Loss. When λ4=0\lambda_{4}=0 or λ5=0\lambda_{5}=0, the FID score increases, and the generated image quality decreases. This means that the two-loss terms of Distribution Adversarial Loss can effectively improve the distribution quality of the generated image. When the value of the λ4\lambda_{4} or λ5\lambda_{5} increases, the quality of the image distribution tends to decline. When the weight is too large, the image latent distribution will be over-normalized. At this point, the discriminant model becomes very powerful. Just like GAN’s theory [8], a strong discriminator is not conducive to generator optimization. So, based on the results shown in Fig. 11, the value range of λ4\lambda_{4} and λ5\lambda_{5} is [0.1,2][0.1,2].

The hyper-parameter α\alpha: In the training stage of DR-GAN, we also utilize the DAMSM loss [50] to make generated images better conditioned on text descriptions. In Fig. 11, we show the performance of DR-GAN based on a different value of the hyper-parameter α\alpha. When α\alpha, the FID score increases, and the generated image quality decreases. This means that the constraints of semantic matching help to improve the quality of image generation. Besides, when the value of α\alpha changes constantly, the performance of DR-GAN also changes moderately, and the overall performance is relatively good. Based on the results shown in Fig. 11, the value range of α\alpha is about [1,20][1,20].

Refer to caption
Figure 11: FID \downarrow scores of our DR-GAN under different values of the hyper-parameters λ1\lambda_{1}, λ2\lambda_{2}, λ3\lambda_{3}, λ4\lambda_{4}, λ5\lambda_{5}, α\alpha on the CUB-Bird test set.

IV-E Limitation and Discussion

Refer to caption
Figure 12: Failure cases generated by DR-GAN on CUB-Bird test set (top row) and MS-COCO test set (bottom row).

The experiments showed the effectiveness of our DR-GAN in T2I generation. However, there are a few failure cases. Some failure cases from the CUB-Bird data are shown in the top row of Fig. 12. Distribution normalization in SDM could sometimes lead to missing spatial local structures/parts such as heads, feet, and necks. Some failure cases from the MS-COCO dataset are shown in the second row. Like in most other existing T2I models[50, 25, 36, 3, 39, 9, 31], it is difficult for the text encoder to parse reasonable location information of objects in a scene if it is not specifically provided in sentences. Then, the generator in DR-GAN tends to randomly place some objects that are sometimes unreasonable.

We will explore the objects’ location relationship from the text knowledge graph in our future work. We will also explore the application of SDM and DNM on other/broader GAN-based image generation tasks, such as Image-to-Image Translation [22, 35] and Virtual Try-On [10, 1].

V Conclusions

We proposed a novel Semantic Disentangling Module (SDM) and Distribution Normalization Module (DNM) in the GAN-based T2I model, and build a Distribution Regularization Generative Adversarial Network (DR-GAN) for Text-to-Image (T2I) generation. The SDM helps the generator better distill the key information and filter out the non-key information for image generation. The DNM helps GANs better normalize and reduce the complexity of the image latent distribution, and helps GAN-based T2I methods better capture the real image distribution from text feature distribution. Extensive experimental results and analysis demonstrated the effectiveness of DR-GAN and better performance compared against previous outstanding methods. In addition, the proposed SDM and DNM can further help other GAN-Based T2I models achieve better performance on the Text-to-Image generation task. The proposed SDM and DNM can be used as general mechanisms in Text-to-Image generation task.

Acknowledgments

This work is supported by National Key R&D Program of China (2021ZD0111900), National Natural Science Foundation of China (61976040), National Science Foundation of USA (OIA-1946231, CBET-2115405), Chinese Postdoctoral Science Foundation (No. 2021M700303). No conflict of interest: Hongchen Tan, Xiuping Liu, Baocai Yin, and Xin Li declare that they have no conflict of interest.

References

  • [1] Neuberger Assaf, Borenstein Eran, Hilleli Bar, Oks Eduard, and Alpert Sharon. Image based virtual try-on network from unpaired data. In CVPR, 2020.
  • [2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. 2016.
  • [3] Li Bowen, Qi Xiaojuan, Lukasiewicz Thomas, and H. S. Torr Philip. Controllable text-to-image generation. In NeurIPS, 2019.
  • [4] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. Mode regularized generative adversarial networks. In ICLR, 2017.
  • [5] Cheng Deng, Zhao Li, Xinbo Gao, and Dacheng Tao. Deep multi-scale discriminative networks for double JPEG compression forensics. volume abs/1904.02520, 2019.
  • [6] Xinxia Fan, Yanhua Yang, Cheng Deng, Jie Xua, and Xinbo Gao. Compressed multi-scale feature fusion network for single image super-resolution. Signal Processing, 146(MAY):50–60, 2017.
  • [7] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Xu Bing, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014.
  • [8] Jie Gui, Zhenan Sun, Yonggang Wen, Dacheng Tao, and Jieping Ye. A review on generative adversarial networks: Algorithms, theory, and applications. In arXiv:2001.06937, 2020.
  • [9] Yin Guojun, Liu Bin, Sheng Lu, Yu Nenghai, Wang Xiaogang, and Shao Jing. Semantics disentangling for text-to-image generation. In CVPR, 2019.
  • [10] Yang Han, Zhang Ruimao, Guo Xiaobao, Liu Wei, Zuo Wangmeng, and Luo Ping. Towards photo-realistic virtual try-on by adaptively generating-preserving image content. In CVPR, 2020.
  • [11] Zhang Han, Xu Tao, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
  • [12] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Conference and Workshop on Neural Information Processing Systems(NeurIPs), 2017.
  • [13] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
  • [14] Tobias Hinz, Stefan Heinrich, and Stefan Ge Wermter. Semantic object accuracy for generative text-to-image synthesis. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  • [15] Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee. Inferring semantic layout for hierarchical text-to-image synthesis. In CVPR, 2018.
  • [16] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 448–456. JMLR.org, 2015.
  • [17] Phillip Isola, Jun Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017.
  • [18] Liang Jiadong, Pei Wenjie, and Lu Feng. Cpgan: Full-spectrum content-parsing generative adversarial networks for text-to-image synthesis. In ECCV, 2020.
  • [19] Gu Jiuxiang, Cai Jianfei, Joty Shafiq, Niu Li, and Wang Gang. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In CVPR, 2018.
  • [20] Justin Johnson, Agrim Gupta, and Fei Fei Li. Image generation from scene graphs. In CVPR, 2018.
  • [21] Cheng Jun, Wu Fuxiang, Tian Yanling, Wang Lei, and Tao Dapeng. Rifegan: Rich feature generation for text-to-image synthesis from prior knowledge. In CVPR, 2020.
  • [22] Zhu Jun-Yan, Park Taesung, Isola Phillip, and A. Efros Alexei. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.
  • [23] Mahdi M. Kalayeh, Emrah Basaran, Muhittin Gokmen, Mustafa E. Kamasak, and Mubarak Shah. Human semantic parsing for person re-identification. In CVPR, 2018.
  • [24] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In arXiv 1312.6114, 2014.
  • [25] Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang, Xiaodong He, Siwei Lyu, and Jianfeng Gao. Object-driven text-to-image synthesis via adversarial training. In CVPR, 2019.
  • [26] Yuheng Li, Krishna Kumar Singh, Utkarsh Ojha, and Yong Jae Lee. Mixnmatch: Multifactor disentanglement and encoding for conditional image generation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • [27] Tsung Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
  • [28] Zhang Lisai, Chen Qingcai, Hu Baotian, and Jiang Shuoran. Text-guided neural image inpainting. In ACM MM, 2020.
  • [29] X. Mao, Q. Li, H. Xie, R. K. Lau, Z. Wang, and S. Smolley. Least squares generative adversarial networks. In ICCV, 2017.
  • [30] Arjovsky Martin, Chintala Soumith, and Bottou Leon. Least squares generative adversarial networks. In ICML, 2017.
  • [31] Zhu Minfeng, Pan Pingbo, Chen Wei, and Yang Yi. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In CVPR, 2019.
  • [32] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In ICLR, 2018.
  • [33] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In NeurIPs, 2016.
  • [34] Yuxin Peng, Wenwu Zhu, Yao Zhao, Changsheng Xu, Qingming Huang, Hanqing Lu, Qinghua Zheng, Tiejun Huang, and Wen Gao. Cross-media analysis and reasoning: advances and directions. pages 18(1):44–57, 2017.
  • [35] Isola Phillip, Zhu Jun-Yan, Zhou Tinghui, and Efros Alexei A. Image-to-image translation with conditional adversarial networks. In CVPR, 2017.
  • [36] Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. Mirrorgan: Learning text-to-image generation by redescription. In CVPR, 2019.
  • [37] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In ICML, 2016.
  • [38] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
  • [39] Hongchen Tan, Xiuping Liu, Xin Li, Yi Zhang, and Baocai Yin. Semantics-enhanced adversarial nets for text-to-image synthesis. In IEEE International Conference on Computer Vision (ICCV), pages 10500–10509, 2019.
  • [40] Hongchen Tan, Xiuping Liu, Meng Liu, Baocai Yin, and Xin Li. Kt-gan: Knowledge-transfer generative adversarial network for text-to-image synthesis. IEEE Transactions on Image Processing, 30:1275–1290, 2021.
  • [41] Hongchen Tan, Xiuping Liu, Baocai Yin, and Xin Li. Cross-modal semantic matching generative adversarial networks for text-to-image synthesis. IEEE Transactions on Multimedia, pages 1–1, 2021.
  • [42] Qiao Tingting, Zhang Jing, Xu Duanqing, and Tao Dacheng. Learn, imagine and create: Text-to-image generation from prior knowledge. In NeurIPS, 2019.
  • [43] Hinz Tobias, Heinrich Stefan, and Wermter Stefan. Generating multiple objects at spatially distinct locations. In ICLR, 2019.
  • [44] Che Tong, Li Yanran, Jacob Athul Paul, Bengio Yoshua, and Li Wenjie. Mode regularized generative adversarial networks. In ICLR, 2016.
  • [45] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missing ingredient for fast stylization. volume abs/1607.08022, 2016.
  • [46] van der Maaten Lauren and Hinton Geoffrey. Visualizing data using t-sne. In Journal of Machine Learning Research, pages 9(2605):2579–2605, 2008.
  • [47] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
  • [48] Hao Wang, Cheng Deng, Fan Ma, and Yi Yang. Context modulated dynamic networks for actor and action video segmentation with language queries. In AAAI, pages 12152–12159, 2020.
  • [49] Yuxin Wu and Kaiming He. Group normalization. In European Conference on Computer Vision (ECCV), volume abs/1803.08494, 2018.
  • [50] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, 2018.
  • [51] Liu Yahui, Nadai Marco De, Cai Deng, Li Huayang, and Lepri Bruno. Describe what to change: A text-guided unsupervised image-to-image translation approach. In ACM MM, 2020.
  • [52] Li Yitong, Min Martin Renqiang, Shen Dinghan, Carlson David, and Carin Lawrence. Video generation from text. In AAAI, 2018.
  • [53] Li Yitong, Gan Zhe, Shen Yelong, Liu Jingjing, Cheng Yu, Wu Yuexin, Carin Lawrence, Carlson David, and Gao Jianfeng. Storygan: A sequential conditional gan for story visualization. In CVPR, 2019.
  • [54] Mingkuan Yuan and Yuxin Peng. Text-to-image synthesis via symmetrical distillation networks. In ACM MM, pages 1407–1415, 2018.
  • [55] Han Zhang, Tao Xu, and Li Hongsheng. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
  • [56] Zizhao Zhang, Yuanpu Xie, and Yang Lin. Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In CVPR, 2018.
[Uncaptioned image] Hongchen Tan is a Lecturer of Artificial Intelligence Research Institute at Beijing University of Technology. He received Ph.D degrees in computational mathematics from the Dalian University of Technology in 2021. His research interests are Person Re-identification, Image Synthesis, and Referring Segmentation. Various parts of his work have been published in top conferences and journals, such as IEEE ICCV/TIP/TNNLS/TMM/TCSVT, and Neurocomputing.
[Uncaptioned image] Xiuping Liu is a Professor in School of Mathematical Sciences at Dalian University of Technology. She received Ph.D degrees in computational mathematics from Dalian University of Technology. Her research interests include shape modeling and analyzing, and computer vision.
[Uncaptioned image] Baocai Yin is a Professor of Artificial Intelligence Research Institute at Beijing University of Technology. He is also a Researcher with the Beijing Key Laboratory of Multimedia and Intelligent Software Technology and the Beijing Advanced Innovation Center for Future Internet Technology. He received the M.S. and Ph.D. degrees in computational mathematics from the Dalian University of Technology, Dalian, China, in 1988 and 1993, respectively. His research interests include multimedia, image processing, computer vision, and pattern recognition.
[Uncaptioned image] Xin Li is a Professor at Division of Electrical & Computer Engineering, Louisiana State University, USA. He got his B.E. degree in Computer Science from University of Science and Technology of China in 2003, and his M.S. and Ph.D. degrees in Computer Science from State University of New York at Stony Brook in 2005 and 2008. His research interests are in Geometric and Visual Data Computing, Processing, and Understanding, Computer Vision, and Virtual Reality.