DR-GAN: Distribution Regularization for Text-to-Image Generation

Hongchen Tan, Xiuping Liu, Baocai Yin and Xin Li*, (Corresponding author: Xin Li) Hongchen Tan and Baocai Yin are with Artificial Intelligence Research Institute, Beijing University of Technology, Beijing 100124, China (e-mail: [email protected]; [email protected]). Xin Li is with School of Electrical Engineering & Computer Science, and Center for Computation & Technology, Louisiana State University, Baton Rouge (LA) 70808, United States of America (e-mail: [email protected]). Xiuping Liu is with School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, China ([email protected].).

Abstract

This paper presents a new Text-to-Image generation model, named Distribution Regularization Generative Adversarial Network (DR-GAN), to generate images from text descriptions from improved distribution learning. In DR-GAN, we introduce two novel modules: a Semantic Disentangling Module (SDM) and a Distribution Normalization Module (DNM). SDM combines the spatial self-attention mechanism and a new Semantic Disentangling Loss (SDL) to help the generator distill key semantic information for the image generation. DNM uses a Variational Auto-Encoder (VAE) to normalize and denoise the image latent distribution, which can help the discriminator better distinguish synthesized images from real images. DNM also adopts a Distribution Adversarial Loss (DAL) to guide the generator to align with normalized real image distributions in the latent space. Extensive experiments on two public datasets demonstrated that our DR-GAN achieved a competitive performance in the Text-to-Image task. The code link: https://github.com/Tan-H-C/DR-GAN-Distribution-Regularization-for-Text-to-Image-Generation

Index Terms:

Generative Adversarial Network, Distribution Normalization, Text-to-Image Generation, Semantic Disentanglement Mechanism

I Introduction

Generating photographic images from text descriptions (known as Text-to-Image Generation, T2I) is a challenging cross-modal generation technique that is a core component in many computer vision tasks such as Image Editing [28, 51], Story Visualization [53], and Multimedia Retrieval [19]. Compared with the image generation [26, 17, 22] and image processing [6, 5, 23] tasks between the same mode, it is difficult to build the heterogeneous semantic bridge between text and image [54, 48, 40]. Many state-of-the-art T2I algorithms [31, 25, 36, 9, 3, 42] first extract text features, then use Generative Adversarial Networks (GANs) [7] to generate the corresponding image. Their essence is to map a text feature distribution to the image distribution. However, two factors prevent GAN-based T2I methods from capturing real image distribution: (1) The abstract and ambiguity of text descriptions make the generator difficult capture the key semantic information for image generation [54, 34]; (2) The diversity of visual information makes the distribution of images complex so that it is difficult for the GAN-based T2I models to capture the real image distribution from text feature distribution [8]. Thus, this work explores better distribution learning strategies to enhance GAN-based T2I models.

In multi-modal perceptual information, the semantics of the text description is usually abstract and ambiguous; Image information is usually concrete and has a lot of spatial structure information. Text and Image information are expressed in different patterns, which makes it difficult to achieve semantic correlation based on feature vectors or tensors. Thus, it is difficult for the generator to accurately capture key semantics from text descriptions for image generation. Since this, in the intermediate stage of generation, the image features contain a lot of non-key semantics. Such inaccurate semantics often leads to ineffective image distribution generation, and then the generated images are often semantically inconsistent, chaos structure and details and so on. To alleviate this issue, our first strategy is to design an information disentangling mechanism on the intermediate feature, to better distill key information before performing cross-modal distribution learning.

In addition, images often contain diverse visual information, messy background, and other non-key visual information. Their image latent distribution is often complex. And, the distribution of images is difficult to model explicitly [8]. This means that we cannot directly and explicitly learn the target image distribution from the text features. As an outstanding image generation model, GANs [7] learn the target data distribution implicitly by sampling data from the True or Fake data distribution. However, such complex image distribution makes it difficult for the discriminator in GANs to distinguish whether the current input image is sampled from the real image distribution or generated image distribution. So, our second strategy is to design an effective distribution normalization mechanism to normalize the image latent distribution. The mechanism aims to help the discriminator better learn the distribution decision boundary between the generated versus real image.

Based on the above two strategies, we built a new Text-to-Image generation model, Distribution Regularization Generative Adversarial Networks (DR-GAN). DR-GAN contains two novel modules: Semantic Disentangling Module (SDM) and Distribution Normalization Module (DNM). In SDM, we introduce a spatial self-attention mechanism and propose a new Semantic Disentangling Loss (SDL) to help the generator better distill key information from texts and images in capturing the image distribution process. In DNM, we introduce a Variational Auto-Encoder (VAE) [24] into GAN-based T2I methods to normalize image distributions in the latent space. We also propose a Distribution Adversarial Loss (DAL) to align the learned distribution with the real distribution in the normalized latent space. With DNM and SDM, our DR-GAN can generate a image latent distribution that better matches with the real image distribution, and generate higher-quality images. The main contributions are summarized as follows:

(i)

We propose a Semantic Disentangling Module (SDM) to help the generator distill key information (and filter out the non-key information) from both text and image features.
(ii)

We design a new Distribution Normalization Module (DNM) which introduces the VAE into the GAN-based T2I pipeline so that it can more effectively normalize and denoise image latent distributions.
(iii)

Extensive experimental results and analysis show the efficacy of DR-GAN on two benchmarks: CUB-Bird [47] and large-scale MS-COCO [27] over four metrics.

II Related Work

II-A GANs in Text-to-Image Generation

With the recent successes of GANs [7], a large number of GAN-based T2I methods [37, 55, 11, 50, 25, 36, 3, 39, 9, 31, 42] have boosted the performance of the T2I task. Reed et al. [37] first introduced the adversarial process to generate images from text descriptions. However, they can only generate images with $64\times 64$ resolution. And, the quality of the images is not good. Then, StackGAN/StackGAN++ [55, 11] and HDGAN [56] adopted multi-stage generation patterns to progressively enhance the quality and detail of synthesized images. Thereafter, the multi-stage generation framework has been widely used in GAN-based T2I methods. Based on the multi-stage generation framework, AttnGAN [50], DMGAN [31], and CPGAN [18] adopted the word-level or the object-level attention mechanisms to help the generator enhance local regions’ or objects’ semantics. MirrorGAN [36] combined the Text-to-Image generation and Image-to-Text generation to improve the global semantic consistency between text description and the generated image. SDGAN [9] and SEGAN [39] combined the Siamese Network and the contrastive loss to enhance the semantics of the synthesized image. Like these T2I methods, we also adopt the multi-stage generation framework to build our DR-GAN. But different from them, we help the generator to better distill key information for distribution learning.

II-B GANs in distribution learning

The GANs [7], as a latent distribution learning strategy, has been widely adopted in various generation tasks [37, 1, 22, 35, 51, 53]. However, GANs tend to suffer from unstable training, mode collapse, and uncontrollable problem, etc., and they hinder GANs from effectively modeling the real data distribution. Recently, many approaches [33, 11, 4, 29, 32, 55] have been exploited to overcome these issues. LSGANs [29] overcome the vanishing gradient problem by introducing a least square loss to replace the entropy loss. WGAN [30] introduced the Earth Mover (EM) distance to improve the stability of learning distribution and provide meaningful learning curves useful for hyperparameter search and debugging. MDGAN [4] improved the stability of distribution learning by utilizing an encoder $E(x):x\rightarrow z$ to produce the latent variable $z$ for the generator $G$ . SN-GANs [32] proposed a novel weight normalization method named spectral normalization to better stabilize the training of the discriminator. F-GAN [33] adopted the Kullback-Leibler (KL) divergence to help align the generated image distribution with the real data distribution. In the T2I task, many GAN-based methods introduce various strategies such as multi-stage generation pattern [11, 55], attention mechanism [50, 31], and cycle consistent mechanism [36] to help match synthesized distribution with the real image distribution. However, diverse visual information, messy background, and other non-key visual information in images usually make the image distribution complicated. It makes distribution learning more difficult. Thus, our idea is to explore an effective distribution normalization strategy to overcome the challenge.

III DR-GAN for text-to-image generation

Refer to caption — Figure 1: The framework of proposed Distribution Regularization Generative Adversarial Network (DR-GAN).

III-A Overview

Most recent GAN-based T2I methods [50, 43, 25, 36, 39, 31, 9] adopted a multi-stage generation framework to progressively map the text embedding distribution to the image distribution, to synthesize high-quality images. Like all these methods, we also adopt such a generation pattern from AttnGAN [50] as our baseline to build the DR-GAN.

As shown in Fig. 1, DR-GAN has a Text Encoder [50], a conditioning augmentation module [55] $F^{ca}$ , $m$ generation modules $G_{i}^{o}$ , ( $\emph{i}=0,1,2,\ldots,m-1$ ), and two new designs: $m$ Semantic Disentangling Modules (SDMs) $SDM_{i}$ , and $m$ Distribution Normalization Modules (DNMs) $DNM_{i}$ , $\emph{i}=0,1,2,\ldots,m-1$ .

The Text Encoder transforms the input text description (a single sentence) into the sentence feature $s^{\prime}$ and word features $W$ . The $F^{ca}$ [55] converts a sentence feature $s^{\prime}$ to a conditioning sentence feature $s$ . The SDM distills key information from the text or image features, in the intermediate stage of generation, for better approximating the real image distribution. In the testing stage, these SDMs take noise $z\sim N(0,1)$ , the sentence feature $s$ , and the word features $W$ to produce a series of hidden features $H_{i}$ ( $\emph{i}=0,1,2,\cdots,m-1$ ); while in the training stage, besides $z$ , $s$ and $W$ , the $i$ -th $SDM_{i}$ also takes the $i$ -th scale real image, $I^{*}_{i}$ , as input to generate $H_{i}$ . Then, $G_{i}^{o}$ takes $H_{i}$ to generate $i$ -th scaled images $\hat{I}_{i}$ . The $DNM_{i}$ contains a Variational Auto-Encoder (VAE) module and a discriminator: the former normalizes image latent distributions, and the latter distinguishes between the real image and the synthesized image. The generation stage information flow is formulated as

		$\displaystyle SDM_{0}:H_{0}=SDM_{0}(z,F^{ca}(s^{\prime}),I_{0}^{*})\,\,\,\textrm{in Training Stage};$		(1)
		$\displaystyle SDM_{i}:H_{i}=SDM_{i}(H_{i-1},W,I_{i}^{*})\,\,\,\textrm{in Training Stage};$
		$\displaystyle SDM_{0}:H_{0}=SDM_{0}(z,F^{ca}(s^{\prime}))\,\,\,\textrm{in Testing Stage};$
		$\displaystyle SDM_{i}:H_{i}=SDM_{i}(H_{i-1},W)\,\,\,\textrm{in Testing Stage};$
		$\displaystyle G_{i}^{o}:\hat{I_{i}}=G_{i}^{o}(H_{i}),\,\,\,\emph{i}=0,1,2,\cdots,m-1.$

Note that the $SDM_{0}$ only contains a series of convolution layers and upsampling modules. The proposed Semantic Disentangling Module (SDM) is adopted in the $SDM_{i}$ $i=0,1,2,\cdots,m-1$ .

III-B Semantic Disentangling Module (SDM)

Semantic abstract and ambiguity of text description often makes the generator difficult capture accurate semantics of the given sentence. Such inaccurate semantics are not conducive to learning distribution, and then lead to incorrect spatial structures and semantics in synthesized images. To this end, we propose a Semantic Disentangling Module (SDM) (Fig. 2) to help generators suppress irrelevant spatial information and highlight relevant spatial information for generating high-quality images.

We build our SDM based on the widely adopted Cascaded Attentional Generative Model (CAGM) introduced by AttnGAN [50], because the Word-level Attention Mechanism (WAM) in CAGM can effectively enhance semantic details of the generated images.

Initially, the SDM was designed to directly extract the key and non-key information from text features. Because text and image belong to the heterogeneous domain, it is difficult to achieve reasonable semantics matching in the feature space [54]. Compared with the text features, the word-level context feature from WAM contain more structural and semantic information, and semantically better match the image features. Besides, the semantics of word-level context features and image features, are from the input word features $W$ and sentences features $s$ . Therefore, SDM is designed to distill key information and filter out non-key information, on the word-level context features and image features of the CAGM.

In this section, we firstly revisit the WAM [50] to acquire the word-level context features and image features. Secondly, we introduce the Spatial Self-Attention Mechanism (SSAM) to represent the key information and the non-key information. Finally, we introduce the Semantic Disentangling Loss (SDL) to drive the SDM conduct the semantic disentangling.

III-B1 Word-level Attention Mechanism (WAM) [50]

In cases where there is no ambiguity, we omit subscripts in this WAM’s description. As shown in Fig. 2, the “WAM” has two inputs: the word features $W\in\mathbb{R}^{D\times T}$ and the image features $H\in\mathbb{R}^{\hat{D}\times N}$ from the previous hidden layer.

Firstly, the word features are mapped to the same latent semantic space as the image features, i.e. $W^{\prime}=UW$ , $W^{\prime}=\{w^{\prime}_{i}\in\mathbb{R}^{\hat{D}}|i=1,2,\cdots,T\}$ , where $U\in\mathbb{R}^{\hat{D}\times D}$ is a perceptual layer. Each column of $H=\{h_{j}\in\mathbb{R}^{\hat{D}}|j=1,2,\cdots,N\}$ (hidden features) is a feature vector of an image’s sub-region.

Secondly, for the $j^{th}$ image’s sub-region, its dynamic representation of word vectors w.r.t. $h_{j}$ is

q_{j}=\sum_{i=1}^{T}\theta_{j,i}w_{i}^{\prime},\quad{\rm where}\ \theta_{j,i}=\frac{exp(S_{j,i}^{\prime})}{\sum_{k=1}^{T}exp(S_{j,k}^{\prime})}.

(2)

Here $S^{\prime}_{j,i}=h^{T}_{j}w^{\prime}_{i}$ , and $\theta_{j,i}$ indicates the weight the model assigned to the ${i}^{th}$ word when generating the $j^{th}$ sub-region of the image.

Thirdly, the word-level context feature for $H$ is denoted by $Q^{\prime}=(q_{1},q_{2},...,q_{N})\in\mathbb{R}^{\hat{D}\times N}$ .

As shown in Fig. 2, the WAM generates a word-level context feature $Q^{\prime}$ from the given word feature $W$ and image feature $H$ . Here, $Q^{\prime}$ is a weighted combination of word features that expresses the image feature $H$ . The $Q^{\prime}$ can effectively enrich the semantics of image details [50]. Due to the abstract and ambiguity of text description, the generator is prone to parsing wrong or inaccurate semantics. However, such inaccurate semantic extraction leads to incorrect semantics and structures in $Q^{\prime}$ and $H$ . Thus, it is necessary to distill the key information from word-level context feature $Q^{\prime}$ and the intermediate image feature $H$ , for image generation. Next, we use Spatial Self-Attention Mechanism to represent the key and non-key features of the $Q^{\prime}$ and the $H$ respectively.

III-B2 Spatial Self-Attention Mechanism

As shown in Fig. 2, we use a Spatial Self-Attention Mechanism to represent the key and non-key information of the word-level context feature $Q^{\prime}_{i}$ and the intermediate image feature $H_{i-1}$ respectively.

Firstly, we represent the key and non-key information of the image feature $H_{i-1}$ . The spatial attention mask $Mask_{i}^{H}$ of the feature $H_{i-1}$ is defined as

Mask_{i}^{H}=Sig.(Conv_{1\times 1}^{2}(ReLU(Conv_{3\times 3}^{1}(H_{i-1})))),

(3)

where $H_{i-1}$ is followed by a $3\times 3$ convolutional layer $Conv_{3\times 3}^{1}(\cdot)$ and a $1\times 1$ convolutional layer $Conv_{1\times 1}^{2}(\cdot)$ , and ReLU and Sigmoid (Sig.) are used as activation functions. We use $H^{+}_{i}=Mask_{i}^{H}\odot H_{i-1}$ to express the key spatial information of $H_{i-1}$ , and use $H^{-}_{i}=H_{i-1}-H^{+}_{i}$ to express the non-key information.

Secondly, we represent the key and non-key information of the word-level context feature $Q^{\prime}_{i}$ . Although $Q^{\prime}_{i}$ can reflect the semantic matching of words and images’ sub-regions, $Q^{\prime}_{i}$ is still defined in the text embedding space and lacks necessary spatial structure information. Thus, we design a convolution module “ResBlock” to convert the word-context matrix $Q^{\prime}_{i}$ to the refined feature $Q_{i}$ . The spatial attention mask $Mask_{i}^{Q}$ of the feature $Q_{i}$ is defined as

Mask_{i}^{Q}=Sig.(Conv_{1\times 1}^{2}(ReLU(Conv_{3\times 3}^{1}(Q_{i})))),

(4)

We use $Q^{+}_{i}=Mask_{i}^{Q}\odot Q_{i}$ to express the key spatial information of $Q_{i}$ , and use $Q^{-}_{i}=Q_{i}-Q^{+}_{i}$ to express the non-key information.

III-B3 Semantic Disentangling Loss

To drive the SDM to better distinguish between the key information and non-key information of $Q_{i}$ and $H_{i-1}$ . We further design a new Semantic Disentangling Loss (SDL) term. In the generation task, the generated image distribution and the real image distribution are assumed to be the same type of distribution [12]. If the mean and variance of two distributions are the same, then two distributions are identical. Therefore, we use the constraints of mean and variance to separate key information from non-key information of $Q_{i}$ and $H_{i-1}$ , for constructing the SDL loss. Specifically, we push the mean and variance of key information to approximate that of real images in the latent space, and vice versa. The SDL on $H_{i-1}$ is defined as

	$\displaystyle\mathcal{L}_{SDL}^{H_{i}}=$	$\displaystyle SP(\|\|\mu(H^{+}_{i})-\mu(H^{}_{i})\|\|-\|\|\mu(H^{-}_{i})-\mu(H^{}_{i})\|\|)$		(5)
		$\displaystyle+SP(\|\|\sigma(H^{+}_{i})-\sigma(H^{}_{i})\|\|-\|\|\sigma(H^{-}_{i})-\sigma(H^{}_{i})\|\|).$		(5)

Similarly, the SDL on the feature map $Q_{i}$ is denoted as

	$\displaystyle\mathcal{L}_{SDL}^{Q_{i}}=$	$\displaystyle SP(\|\|\mu(Q^{+}_{i})-\mu(H^{}_{i})\|\|-\|\|\mu(Q^{-}_{i})-\mu(H^{}_{i})\|\|)$		(6)
		$\displaystyle+SP(\|\|\sigma(Q^{+}_{i})-\sigma(H^{}_{i})\|\|-\|\|\sigma(Q^{-}_{i})-\sigma(H^{}_{i})\|\|).$		(6)

Here, $\mu(\cdot)$ and $\sigma(\cdot)$ compute the mean and variance of feature maps within a batch, the $SP(x)=ln(1+e^{x})$ .

The $H^{*}_{i}$ is the feature map of the corresponding real image $I^{*}_{i}$ . We introduce the Real Image Reconstruction Module (RIRM) shown in Fig. 2 to acquire the real image features $H^{*}_{i}$ . The RIRM contains an encoder and a decoder. The encoder takes the real image $I^{*}_{i}$ as the input and outputs the real image feature $H^{*}_{i}$ . The decoder takes the real image feature $H^{*}_{i}$ and reconstruct the real image by the reconstruction loss function $||RIRM(I^{*}_{i})-I^{*}_{i}||_{1}$ . Note that the decoder and the generation modules $G^{o}$ form the Siamese Network, which can provide high-quality real image features for SDM. The loss functions in SDM are summarized as

\mathcal{L}_{SDL_{i}}=\lambda_{1}\mathcal{L}_{SDL}^{H_{i}}+\lambda_{2}\mathcal{L}_{SDL}^{Q_{i}}+\lambda_{3}||RIRM(I^{*}_{i})-I^{*}_{i}||_{1}

(7)

Here, $||RIRM(I^{*}_{i})-I^{*}_{i}||_{1}$ is the reconstruction loss in RIRM.

Finally, purified key features $Q^{+}_{i}$ and $H^{+}_{i}$ are concatenated and then fused and upsampled (by using a ResBlock and Up-sampling module) to get the next-stage feature map $H_{i}$ .

In Fig. 3, we show the image feature $H_{1}$ (stage-1), word-level context feature $Q^{\prime}_{2}$ (stage-2), image feature $H_{2}$ (stage-2), and the synthesized image $\hat{I_{2}}$ (stage-2). The first line is the visualization of Baseline (Base.) and the second line is the visualization of Baseline+SDM (Base.+SDM).

In the first line (Baseline) of Fig. 3: Due to abstract and ambiguity of text descriptions, it is difficult for the generator to accurately capture key semantics; Some non-key semantics get mixed up in the generation process; So, strange structures and details are easy to appear in the generated image features, such as $H_{1}$ ; the calculation of $Q^{\prime}_{2}$ requires the participation of $H_{1}$ , which also further leads to $Q^{\prime}_{2}$ ’s strange structure and details; As a result, the structure of the generated image $\hat{I}_{2}$ is chaotic. In the second line (Base.+SDM) of Fig. 3: Based on the selection strategy of key information driven by SDM, the non-key structural information on $H_{1}$ and $Q^{\prime}_{2}$ can be better filtered out. Since this, the structure and semantics of image feature $H_{2}$ become more reasonable. So, the structure of the synthesized images $\hat{I}_{2}$ is also reasonable.

III-C Distribution Normalization Module (DNM)

The aforementioned SDM can help the generator distill key semantic information for image generation. But diverse visual information, messy background, and other non-key visual information in images usually make the image distribution more complicated. On the discriminator’s side: such complicated image distributions make the distinction of real and synthesized images harder; the discriminator may fail to effectively identify the synthesized image.

We know that the data normalization mechanism [16, 2, 45, 49] can reduce the noise and internal covariate shift of data, and further improve manifold learning efficiency in the deep learning community. In the discriminant stage of GAN, image sampling is firstly carried out from real image distribution and synthesized image distribution. Secondly, the discriminator determines whether the current input image is sampled from the real image distribution or synthesized image distribution, i.e. True or False. Due to the diversity of images and the randomness of synthesized images, the distributions of synthesized images and real images are complicated. Such complex image distribution makes it difficult for the discriminator to distinguish whether the current input image is sampled from the real image distribution or generated image distribution. And it is difficult for the generator to align the generated distribution with the real image distribution. Therefore, it is necessary to reduce the complexity of the distribution. Normalization is an effective strategy for denoising and reducing complexity. So, we expect to introduce the normalization of latent distribution.

As a generative model, Variational Auto-Encoder (VAE) [24] can effectively denoise the latent distribution and reduce complexity of the distribution. In VAE, it assumes the latent embedding vector of an image follow a Gaussian distribution $N(\tilde{\mu},\tilde{\sigma})$ , and then normalizes the $N(\tilde{\mu},\tilde{\sigma})$ to a standard normal distribution $N(0,1)$ . Based on the advantage of image reconstruction in VAE, the normalized embedding vector can preserve key semantic visual information. Thus, we build a VAE module in DNM to normalize the image latent distributions to help the discriminator better distinguish between the “True” image and the “Fake” image.

The structure of the $i$ -th DNM is shown in Fig. 4. DNM contains two sub-modules: the discriminator $D_{i}$ and the VAE module $A_{i}$ . To simplify the notation, we omit the subscript $i$ . $x$ can be a generated image $\hat{I}$ or a real image $I^{*}$ . The discriminator is composed of an encoder $E^{D}(\cdot)$ and an logical classifier $\psi(\cdot)$ . $E^{D}(\cdot)$ encodes image $x$ into an embedding vector $v$ . The embedding $v$ combined with text embedding $s$ is fed to the logical classifier $\psi(\cdot)$ , which identifies if $x$ is a real or generated image. As mentioned above, diverse visual information, messy background, and other non-key visual information in images make the distribution of embedding vectors $v$ complicated, and make the identification of $x$ harder. Thus, we adopt a VAE module to normalize and denoise the latent distribution of embedding vectors $v$ . In addition to reducing the complexity of the image latent distribution, using a VAE can also push the encoded image feature vector $v$ to record important image semantics (through reconstruction).

Our VAE module $A$ adopts a standard design architecture of Variational Auto-Encoder (VAE) [24]. As shown in Fig. 4, $A$ has a variational encoder (which consists of an encoder $E^{D}(\cdot)$ and a variational sampling module $\varphi(\cdot)$ ), and a decoder $D^{E}(\cdot)$ .

The information flow of the VAE module $A$ is as follows.

(1) Given an image $x$ , $x$ is first fed to the encoder $E^{D}(\cdot)$ , and $E^{D}(\cdot)$ outputs the image latent embedding $v$ .

(2) $\varphi(\cdot)$ infers the mean and variance of $v$ and builds a Gaussian distribution $N(\tilde{\mu}(\varphi(v)),\tilde{\sigma}(\varphi(v)))$ . $N(\tilde{\mu}(\varphi(v)),\tilde{\sigma}(\varphi(v)))$ is further normalized to a Normal distribution by the $KL(N(\tilde{\mu}(\varphi(v)),\tilde{\sigma}(\varphi(v))))||N(0,1))$ . To make this procedure differentiable, the re-sampling trick [24] is adopted to get $z^{*}=z\cdot\tilde{\sigma}(\varphi(v)))+\tilde{\mu}(\varphi(v))$ , $z\sim N(0,1)$ .

(3) $z^{*}$ and the text embedding $s$ are concatenated, and then fed to the decoder $D^{E}(\cdot)$ to reconstruct image $x^{*}$ . Note that here the decoder takes both $s$ and $z^{*}$ for reconstruction, because the image generation here is conditioned on the text description.

III-C1 Distribution Adversarial Loss

Following the alternate optimization mechanism of GANs, our VAE module is trained together with the discriminator. Based on the lower variational bound of the VAE [24], the loss function of the VAE module in DNM_i can be defined as

$\displaystyle\mathcal{L}_{D_{i}^{D}}=$	$\displaystyle\|\|\hat{I_{i}}-D^{E}_{i}(\varphi_{i}(E^{D}(\hat{I_{i}})),s)\|\|_{1}+\|\|I_{i}^{}-D^{E}_{i}(\varphi_{i}(E^{D}(I_{i}^{})),s)\|\|_{1}$	(8)
	$\displaystyle+KL(N(\tilde{\mu_{i}}(\varphi_{i}(E^{D}(\hat{I_{i}}))),\tilde{\sigma_{i}}(\varphi_{i}(E^{D}(\hat{I_{i}})))))\|\|N(0,1))$
	$\displaystyle+KL(N(\tilde{\mu_{i}}(\varphi_{i}(E^{D}(I_{i}^{}))),\tilde{\sigma_{i}}(\varphi_{i}(E^{D}(I_{i}^{})))))\|\|N(0,1)).$

In each generator’s training step, the generated images have an unnormalized distribution; while the distributions of real images have been normalized in the discriminator’s training step. Hence, it is difficult for the generator to produce distributions to approximate the normalized real image distribution. Therefore, (1) we want to normalize the generated image distribution in the VAE module during the generator’s training step. In addition, (2) we want to align the normalized generated distribution with the normalized real image distribution. To achieve the above two goals, we define a distribution consistency loss, i.e.

	$\displaystyle\mathcal{L}_{G_{i}^{D}}=$	$\displaystyle KL(N(\tilde{\mu_{i}}(\varphi_{i}(E^{D}(\hat{I_{i}}))),\tilde{\sigma_{i}}(\varphi_{i}(E^{D}(\hat{I_{i}})))))\|\|N(0,1))$		(9)
		$\displaystyle+\|\|I_{i}^{*}-D^{E}_{i}(\varphi_{i}(E^{D}(\hat{I_{i}})),s)\|\|_{1},$		(9)

where the first term is designed for our first goal, and the second term is designed for our second goal.

We denote two loss functions $\mathcal{L}_{G_{i}^{D}}$ and $\mathcal{L}_{D_{i}^{D}}$ as the Distribution Adversarial Loss (DAL) terms. In the discriminator’s training stage, $\mathcal{L}_{D_{i}^{D}}$ helps the discriminator better distinguish the synthesized image from the real image, and better learn the distribution decision boundary between the generated versus real image latent distributions. In the generator’s training stage, $\mathcal{L}_{G_{i}^{D}}$ can help the generator learn and capture the real image distribution in the normalized latent space.

Our DNM module, combining VAE and DAL, can effectively reduce the complexity of distribution constructed by the image embedding $v$ , and enrich the high-level semantic information of the image embedding $v$ . Such a normalized embedding $v$ helps the discriminator better distinguish between the “Fake” image and the “True” image. Consequently, the generator can also better align the generated distribution with the real image distribution.

III-D Objective Functions in DR-GAN

Combining the above modules, at the $i$ -th stage of the DR-GAN, the Generative loss $\mathcal{L}_{G_{i}}$ and Discriminative loss $\mathcal{L}_{D_{i}}$ are defined as

\mathcal{L}_{G_{i}}=\underbrace{-\frac{1}{2}\mathbb{E}_{\hat{I_{i}}\sim P_{G_{i}}}[logD_{i}(\hat{I_{i}})]}_{\text{unconditional loss}}-\underbrace{\frac{1}{2}\mathbb{E}_{\hat{I_{i}}\sim P_{G_{i}}}[logD_{i}(\hat{I_{i}},s)]}_{\text{conditional loss}},

(10)

where the unconditional loss is trained to generate high-quality images towards the real image distribution to fool the discriminator, and the conditional loss is trained to generate images to better match text descriptions. The $D_{i}(\cdot)=\hat{\psi}(E^{D}_{i}(\cdot))$ is the unconditional discriminator, $D_{i}(\cdot,\cdot)=\psi(E^{D}_{i}(\cdot),s)$ is the conditional discriminator. The $\hat{\psi}(\cdot)$ and $\psi(\cdot)$ are the unconditional logical classifier and the conditional logical classifier respectively.

The discriminator $D_{i}$ is trained to classify the input image into the “Fake” or “True” class by minimizing the cross-entropy loss

	$\displaystyle\mathcal{L}_{D_{i}}=$	$\displaystyle\underbrace{-\frac{1}{2}\mathbb{E}_{I^{}_{i}\sim P_{data_{i}}}[logD_{i}(I^{}_{i})]-\frac{1}{2}\mathbb{E}_{\hat{I_{i}}\sim P_{G_{i}}}[log(1-D_{i}(\hat{I_{i}})]}_{\text{unconditional loss}}+$		(11)
		$\displaystyle\underbrace{-\frac{1}{2}\mathbb{E}_{I^{}_{i}\sim P_{data_{i}}}[logD_{i}(I^{}_{i},s)]-\frac{1}{2}\mathbb{E}_{\hat{I_{i}}\sim P_{G_{i}}}[log(1-D_{i}(\hat{I_{i}},s)]}_{\text{conditional loss}},$		(11)

where $I^{*}_{i}$ is from the realistic image distribution $P_{data}$ at the $i$ -th scale, and $\hat{I_{i}}$ is from distribution $P_{G_{i}}$ of the generative images at the same scale.

To generate realistic images, the final objective functions in the generation training stage ( $\mathcal{L}_{G}$ ) and discrimination training stage ( $\mathcal{L}_{D}$ ) are respectively defined as

\mathcal{L}_{G}=\sum_{i=0}^{m-1}(\mathcal{L}_{G_{i}}+\lambda_{4}\mathcal{L}_{G_{i}^{D}}+\mathcal{L}_{SDL_{i}})+\alpha\mathcal{L}_{DAMSM},

(12)

\mathcal{L}_{D}=\sum_{i=0}^{m-1}(\mathcal{L}_{D_{i}}+\lambda_{5}\mathcal{L}_{D_{i}^{D}}).

(13)

The loss function $\mathcal{L}_{DAMSM}$ [50] is designed to measure the matching degree between images and text descriptions. The DAMSM loss makes generated images better conditioned on text descriptions. The DR-GAN has three-stage generators ( $m=3$ ) like the most recent GAN-based T2I methods [31, 3, 42, 36, 25, 50].

IV Experimental Results

TABLE I: IS

\uparrow

, FID

\downarrow

, MS

\uparrow

, R-Precision

\uparrow

, and Human Perceptual score (H.P. score)

\uparrow

by some SOTA GAN-based T2I models and our DR-GAN on the CUB-Bird and MS-COCO test sets.

{\dagger}

indicates the scores are computed from images generated by the open-sourced models.

*

indicates the scores are reported in DMGAN [31].

**

indicates the scores are reported in AttnGAN+O.P.*[43]. Other results were reported in the original paper. The Bold is the best result.

Method	CUB-Bird					MS-COCO
Method	IS $\uparrow$	FID $\downarrow$	MS $\uparrow$	R-Precision $\uparrow$	H.P. score $\uparrow$	IS $\uparrow$	FID $\downarrow$	MS $\uparrow$	R-Precision $\uparrow$	H.P. score $\uparrow$
StackGANv2[11]	$3.93\pm 0.06$	$29.64^{\dagger}$	$4.10^{\dagger}$	-	-	$8.30\pm 0.10$	$81.59^{**}$	-	-	-
AttnGAN [50]	$4.36\pm 0.02$	$23.98^{*}$	$4.30^{\dagger}$	$52.62\%^{\dagger}$	$16.14\%^{\dagger}$	$25.89\pm 0.19$	$35.49^{*}$	$23.71^{\dagger}$	$61.34\%^{\dagger}$	$20.97\%^{\dagger}$
Obj-GAN [25]	-	-	-	-	-	$30.29\pm 0.33$	25.64	-	-	-
DM-GAN [31]	$4.75\pm 0.07$	$16.09$	$4.62^{\dagger}$	$59.21\%^{\dagger}$	$32.47\%^{\dagger}$	$30.49\pm 0.57$	$32.64$	$29.94^{\dagger}$	$70.63\%^{\dagger}$	$37.51\%^{\dagger}$
RiFeGAN [21]	5.23 $\pm$ 0.09	-	-	-	-	$31.70$	-	-	-	-
DR-GAN (Our)	$4.90\pm 0.05$	14.96	4.86	61.17%	51.39%	34.59 $\pm$ 0.51	$27.80$	33.96	77.27%	41.52%

In this section, we perform extensive experiments to evaluate the proposed DR-GAN. Firstly, we compare our DR-GAN with other SOTA GAN-based T2I methods [11, 50, 25, 31, 21]. Secondly, we discuss the effectiveness of each new module introduced in DR-GAN: Semantic Disentangling Module (SDM) and Distribution Normalization Module (DNM). All the experiments are performed with one GTX 2080 Ti using the PyTorch toolbox.

IV-A Experiment Settings

Datasets. We conduct experiments on two widely-used datasets, CUB-Bird [47] and MS-COCO [27] datasets. The CUB-Bird [47] dataset contains $11,788$ bird images, and each bird image has $10$ sentences to describe the fine-grained visual details. The MS-COCO dataset [27] contains $80k$ training images and $40k$ test images, and each image has $5$ sentences to describe the visual information of the scene. We pre-process and split the images on the two datasets by following the same setting in [37], [55].

Evaluation. We from three aspects to compare our DR-GAN with other GAN-based T2I approaches: Image Diversity, Distribution Consistency, and Semantic Consistency. Each model generated $30,000$ images conditioning on the text descriptions from the unseen test set for evaluation. The $\uparrow$ means that the higher the value, the better the performance of the model, and vice versa.

(I) Image Diversity. Following almost all T2I approaches, we adopt the fine-tuned Inception models [55] to calculate the Inception Score (IS $\uparrow$ ), which measures images diversity.

(II) Distribution Consistency. We use the Fréchet Inception Distance (FID $\downarrow$ ) [13] and Mode Score (MS $\uparrow$ ) [44] to evaluate the distribution consistency between generated images and real images. The image features in FID and MS are extracted by a pre-trained Inception-V3 network [38].

(III) Semantic Consistency. We use the R-precision $\uparrow$ and Human Perceptual score (H.P. score $\uparrow$ ) to evaluate the semantic consistency between the text description and the synthesized image.

R-precision. Following [50], we also use R-precision to evaluate semantic consistency. Given a pre-trained image-to-text retrieval model, we use generated images to query their corresponding text descriptions. First, given generated image $\hat{x}$ conditioned on sentence $s$ and $99$ random sampled sentences $\{s_{i}^{{}^{\prime}}:1\leq i\leq 99\}$ , we rank these $100$ sentences by the pre-trained image-to-text retrieval model. If the ground truth sentence $s$ is ranked highest, we count this as a successful retrieval. For all the images in the test dataset, we perform this retrieval task once and finally count the percentage of success retrievals as the R-precision score. Higher R-precision means greater semantic consistency.

Human Perceptual score (H.P. score). To get H.P. score, we randomly select $2000$ text descriptions on CUB-Bird test set and $2000$ text descriptions on MS-COCO test set. Given the same text description, $30$ volunteers (not including any author) are asked to rank the images generated by different methods. The average ratio ranked as the best by human users is calculated to evaluate the compared methods.

IV-B Comparison with state-of-the-arts

Image Diversity. We use IS to evaluate the image diversity. As shown in Table I, DR-GAN achieves the second-highest IS score on the CUB-Bird test set, and achieves the highest IS score on the MS-COCO test set. The IS score ( $5.23$ ) of RiFeGAN [21] is higher than that of ours ( $4.90$ ) on the CUB-Bird dataset; but RiFeGAN [21] uses $10$ sentences to train the generator on the CUB-Bird dataset, while our DR-GAN only uses $1$ sentence following the standard T2I problem formulation. On the larger-scale and challenging MS-COCO dataset, RiFeGAN [21] uses $5$ sentences to train the model, but obtains a significantly lower IS score $31.70$ than that of our DR-GAN ( $34.59$ ). When the given database only contains images with a single sentence (which is common in practical tasks such as Story Visualization [53], Text-to-Video [52], and other text-guided image generation [51]), RiFeGAN [21] can not be used. In contrast, methods such as our DR-GAN, AttnGAN [50], and DMGAN [31] that only need one sentence per image can be used.

Distribution Consistency. We use FID and MS to evaluate the distribution consistency between the generated image distribution and the real image distribution. As shown in Table I, compared with these GAN-based T2I methods, our DR-GAN achieves the competitive performance on the CUB-Bird and MS-COCO test sets over the FID and the MS. The FID of Obj-GAN [25] is lower than that of our DR-GAN. This is because Obj-GAN [25] and its similar methods [3, 43, 15, 20, 14] require additional information, including the interesting object’s bounding boxes and shapes, for training synthesizing. This additional information can help the generator better capture the object’s layout. This additional information, although available in the MS-COCO dataset, is often unavailable for other datasets such as the CUB-Bird dataset. In general, producing more descriptions or objects’ bounding boxes and shapes on a new database to train the generator is expensive. This limits its scalability and usability in more general text-guided image generation. Thus, our DR-GAN still achieves the competitive performance in the distribution consistency evaluation.

Semantic Consistency. We use the R-precision and the Human Perceptual score (H.P.) to evaluate the semantic consistency. As shown in Table I, compared with AttnGAN and DMGAN [31], our DR-GAN also achieves the best performance on the semantic consistency evaluation on these two datasets.

TABLE II: The performance of applying SDM and DNM on other GAN-based T2I methods on CUB-Bird test set. The measures include IS

\uparrow

, FID

\downarrow

and MS

\uparrow

. AttnGAN^- indicates that we remove the Word-level Attention Mechanism (WAM) in AttnGAN [50].

{\dagger}

indicates the scores are computed from images generated by the open-sourced models. Other results were reported in the original paper.

Method	CUB-Bird
Method	IS $\uparrow$	FID $\downarrow$	MS $\uparrow$
StackGANv2 [11]	$3.93\pm 0.06$	$29.64^{\dagger}$	$4.10^{\dagger}$
StackGANv2+SDM+DNM	$4.58\pm 0.04$	$22.94$	$4.63$
AttnGAN^- [50]	$4.11\pm 0.08$	$25.23^{\dagger}$	$4.01^{\dagger}$
AttnGAN^-+SDM+DNM	$4.65\pm 0.06$	$20.01$	$4.54$
DMGAN [31]	$4.75\pm 0.07$	$16.09$	$4.62^{\dagger}$
DMGAN+SDM+DNM	$4.84\pm 0.04$	$15.26$	$4.72$

Generalization. To evaluate the generalizability of our proposed SDM and DNM, we integrate the SDM and DNM modules into several well-known/SOTA GAN-based T2I models including StackGANv2 [11], AttnGAN^- [50], and DMGAN [31]. Here, AttnGAN^- denotes the AttnGAN [50] with its Word-level Attention Mechanism (WAM) removed. As shown in Table II, SDM and DNM can help all these GAN-Based T2I models achieve better performance on all three measures. It indicates that SDM and DNM can be used as general mechanisms in T2I generation task.

Visualization. We qualitatively evaluate our DR-GAN and some GAN-based T2I methods by image visualization. As shown in Fig. 5, compared with AttnGAN [50] and DMGAN [31], our DG-GAN can synthesize higher-quality images on CUB-Bird and MS-COCO test sets.

On the CUB-Bird test set: (i) compared with our DR-GAN, the birds synthesized by AttnGAN and DMGAN contain some strange or incomplete details and structures; Because SDM can better distill key information for image generation, our DR-GAN can perform better on the structure generation; (ii) Due to the introduction of DNM, DR-GAN can better capture the real images distribution. Therefore, compared with AttnGAN and DMGAN, the bird images generated by DR-GAN are more realistic and contain more full semantic details. In particular, the birds wings in Fig.-b-c-d-k have very detailed textures and full colors. On the MS-COCO test set: the semantics of text description are rather sparse; it is very difficult for the generator to capture sufficient semantics to generate high-quality images; therefore, most current methods cannot synthesize vivid images; compared with AttnGAN and DMGAN, our proposed DR-GAN is more reasonable in object layout and richer in semantic details.

Besides, in Fig. 6, we have shown the synthesized images of StackGAN v2 [11], AttnGAN^- [50], and DMGAN [31], and synthesized images of StackGAN v2+SDM+DNM (StackGAN v2^∗), AttnGAN^-+SDM+DNM (AttnGAN^-∗), and DMGAN+SDM+DNM (DMGAN^∗). As shown in Fig. 6, our SDM and DNM can further help these methods (StackGAN v2 [11], AttnGAN^- [50], and DMGAN [31]) improve the structure and enrich the semantic details of synthesized birds, so as to make the generated image more realistic.

Model Cost. As shown in Tabel III, we compare our proposed DR-GAN with other SOTA T2I methods under four model cost measures including Training Time, Training Epoch, Model Size, and Testing Time. Using the MS-COCO dataset as an example, compared with AttnGAN (Baseline) and DM-GAN, our proposed DR-GAN is between AttnGAN and DM-GAN in terms of model cost while achieving the highest performance.

TABLE III: The Training Time, Training Epoch, Model Size, Testing Time of our DR-GAN and other SOTA T2I methods on the MS-COCO dataset.

Method	Training Time	Training Epoch	Model Size	Testing Time
AttnGAN (Base.) [50]	$\sim 8Days$	$120$	$\sim 55.5M$	$\sim 1200s$
DM-GAN[31]	$\sim 14Days$	$200$	$\sim 89.7M$	$\sim 1800s$
DR-GAN(Ours)	$\sim 10Days$	$150$	$\sim 73.2M$	$\sim 1400s$

IV-C Ablation Strudy

IV-C1 Effectiveness of New Modules

TABLE IV: IS

\uparrow

, FID

\downarrow

and MS

\uparrow

produced by combining different components of the DR-GAN on the CUB-Bird and MS-COCO test sets. Our Baseline (Base.) is AttnGAN [50]. DR-GAN=Base.+SDM+DNM.

Method	CUB-Bird			MS-COCO
Method	IS $\uparrow$	FID $\downarrow$	MS $\uparrow$	IS $\uparrow$	FID $\downarrow$	MS $\uparrow$
Base. [50]	$4.36\pm 0.02$	$23.98$	$4.30$	$25.89\pm 0.19$	$35.49$	$23.71$
Base.+SDM	$4.70\pm 0.02$	$15.50$	$4.62$	$30.89\pm 0.57$	$31.42$	$30.64$
Base.+DNM	$4.79\pm 0.03$	$15.12$	$4.80$	$31.26\pm 0.42$	$29.73$	$31.28$
DR-GAN	$4.90\pm 0.05$	$14.96$	$4.86$	$34.59\pm 0.51$	$27.80$	$33.96$

In this subsection, we evaluate the effectiveness of each new component qualitatively and quantitatively. The numerical results are documented in Table IV. The visualization results are shown in Fig. 7, Fig. 8 and Fig. 9.

Quantitative results. We evaluated the effectiveness of two new components, SDM and DNM, in terms of three measures. Our Baseline (Base.) is AttnGAN [50]. As shown in Table IV, both SDM and DNM can effectively improve the performance of the baseline on these two datasets over three measures.

Firstly, we introduce the SDM into the baseline (Base.), i.e. Base.+SDM. As shown in Table IV, Base.+SDM leads to $7.80\%$ and $19.31\%$ improvement of IS, $35.36\%$ and $11.47\%$ improvement of FID, and $7.44\%$ and $29.23\%$ improvement of MS, on the CUB-Bird and MS-COCO test sets respectively.

Secondly, we introduce the DNM into the baseline (Base.), i.e. Base.+DNM. As shown in Table IV, Base.+DNM leads to $9.86\%$ and $20.74\%$ improvement of IS, $36.95\%$ and $16.23\%$ improvement of FID, and $11.63\%$ and $31.92\%$ improvement of MS, on the CUB-Bird and MS-COCO test sets respectively.

Finally, when we introduce the SDM and the DNM into the baseline (Base.), i.e. DR-GAN, our DR-GAN obtains $12.38\%$ and $33.60\%$ improvement over the baseline in IS, $37.61\%$ and $21.67\%$ improvement over the baseline in FID, and $13.02\%$ and $43.23\%$ improvement over the baseline in MS, on the CUB-Bird and MS-COCO test sets respectively. Our DR-GAN achieves $4.90$ and $34.59$ in the term of IS, and achieves $14.96$ and $27.80$ in the term of FID, and achieves $4.86$ and $33.96$ in the term of MS, on the CUB-Bird and MS-COCO test sets respectively.

Qualitative Results. We also qualitatively evaluate the effectiveness of each component by image visualization (Fig. 7, Fig. 8, and Fig. 9).

In Fig. 7, we can clearly see that the DNM and SDM can effectively improve the quality of the generated images respectively. SDM: Compared with Baseline, the introduction of SDM can better improve bird body structure on the CUB-Bird test set, and also improve target layout in complex scenes to some extent. This is because SDM can directly extract key features and filter non-key features from the spatial perspective of feature maps. Based on the constraints of real image feature distribution statistics, SDM will try to filter out these non-key structural information which is not conducive to distribution learning. For COCO data, the excessive sparsity and abstraction of text semantics make the generator unable to generate vivid images. But with SDM, the overall layout and structure of generated images are significantly improved compared to the Baseline.

DNM: Compared with Baseline, the introduction of DNM can better improve visual representation and semantic expression of details to some extent. This is because DNM normalizes the latent distribution of real and generated images, which can drive the generator to better approximate the real image distribution. Therefore, generated images are better in terms of visual representation and detail semantics. Compared with SDM, DNM lacks direct intervention on image features. Therefore, compared with Base.+SDM, the structure of images generated by Base.+DNM guidance is slightly worse.

SDM+DNM: When we introduce both SDM and DNM into the Baseline (Base.), i.e. DR-GAN. Based on the respective advantages of SDM and DNM, compared with Baseline, our proposed DR-GAN performs better in visual semantics, structure, and layout, and so on.

Besides, we present top- $3$ word guided attention maps and $3$ stage generation images ( $\hat{I_{0}}$ $64\times 6$ , $\hat{I_{1}}$ $128\times 128$ , $\hat{I_{2}}$ $256\times 256$ ) from AttnGAN, DMGAN and Base.+SDM (AttnGAN+SDM) on the CUB-Bird test dataset. To facilitate display, we pulled the generated image pixels into images of the same size in Fig. 8. For the AttnGAN and DMGAN, we can observe that when the quality of the generated image $\hat{I_{0}}$ in the initial stage is very poor, the $\hat{I_{0}}$ affects the confusion in the word attention area. Due to the lack of direct intervention of features, confused attention and image features continue to confuse $\hat{I_{1}}$ and attention maps in the $SDM_{2}$ . In contrast, the introduction of the proposed SDM can gradually capture key information and filter out the non-key information for image generation in the subsequent stage. To this end, the attention information and the resulting images are gradually becoming more reasonable.

Finally, we use T-SNE [46] to visualize the two-dimensional distribution of generated images and real images. As shown in Fig. 9, in the initial stage ( $200$ epochs on Cub-Bird dataset), the difference of the distribution between the generated image of Baseline and the real image is very large. And the difference of the distribution between the generated image of Baseline and the real image is also large. When we introduce SDM into the Baseline, i.e. Baseline+SDM, we observe a narrowing of the difference in the distribution between the generated image and the real image. When we introduce DNM into the Baseline+SDM, i.e. DR-GAN, we observe a further narrowing of the difference in the two-dimensional distribution between the generated image and the real image. And DNM makes the scatter plot area more compact. Through the results of T-SNE visualization, we can see that DNM and SDM are effective for distribution learning.

IV-C2 The validity of different parts of SDM.

The IS $\uparrow$ , FID $\downarrow$ and MS $\uparrow$ of Baseline (Base.) are $4.36\pm 0.02$ , $23.98$ and $4.30$ respectively. As shown in Table V, we set $\lambda_{3}=0$ to mean that RIRM does not participate in SDM. So, the SD loss does not get high-quality real image features. The constraint of the whole distribution statistic will deviate from the feature distribution of the real image. There will be a serious deterioration in image quality. Thus, compared with Base. and Base.+SDM, the performance of Base.+SDM ( $\lambda_{3}=0$ ) is bad. When we remove the SDL, the performance of the model also deteriorates. This is due to the lack of the necessary loss function to drive the attention mechanism to extract key information. In all, SDM can effectively improve the quality of model generation.

TABLE V: IS

\uparrow

, FID

\downarrow

and MS

\uparrow

produced by different parts in SDM on the CUB-Bird test set.

Method	CUB-Bird
Method	IS $\uparrow$	FID $\downarrow$	MS $\uparrow$
Base. [50]	$4.36\pm 0.02$	$23.98$	$4.30$
Base.+SDM	$4.70\pm 0.02$	$15.50$	$4.62$
Base.+SDM w/o SDL	$4.41\pm 0.03$	$23.56$	$4.38$
Base.+SDM ( $\lambda_{3}=0$ )	$4.21\pm 0.04$	$27.34$	$4.11$

IV-C3 Impact of self-encoding mode on performance.

We modify the VAE module to observe their impact on model performance. The first case in Fig. 10, we remove the variational sampling module $\varphi(\cdot)$ ) in VAE module. Besides, the Distribution Adversarial Loss is rewritten as Eq. 14 and Eq. 15.

\mathcal{L}_{D_{i}^{D}}=||\hat{I_{i}}-D^{E}_{i}(\varphi_{i}(E^{D}(\hat{I_{i}})),s)||_{1}+||I_{i}^{*}-D^{E}_{i}(\varphi_{i}(E^{D}(I_{i}^{*})),s)||_{1}

(14)

\mathcal{L}_{G_{i}^{D}}=||I_{i}^{*}-D^{E}_{i}(\varphi_{i}(E^{D}(\hat{I_{i}})),s)||_{1},

(15)

Here we named such a model DNM^∗. As shown in Table VI, we compare Base.+DNM and Base.+DNM^∗ on CUB-Bird dataset under three measures. Compared with Base.+DNM (contains VAE), the performance of Base.+DNM^∗ has significantly decreased. However, the performance of Base.+DNM^∗ is better than that of Baseline. This is because the image encoding and decoding in $AE$ can enhance the expression of image semantics in $v_{i}$ . In all, compared with the general AE, adopting VAE can make the generator obtain higher performance.

TABLE VI: IS

\uparrow

, FID

\downarrow

and MS

\uparrow

produced by Base.+DNM and Base.+DNM^∗ on the CUB-Bird test set.

Method	CUB-Bird
Method	IS $\uparrow$	FID $\downarrow$	MS $\uparrow$
Base. [50]	$4.36\pm 0.02$	$23.98$	$4.30$
Base.+DNM^∗	$4.53\pm 0.05$	$19.86$	$4.55$
Base.+DNM	$4.79\pm 0.03$	$15.12$	$4.80$

IV-C4 Discussion of semantic consistency strategies.

We compare the performance of SDM with some outstanding semantic consistency strategies. Some outstanding methods (such as CSM-GAN^∗ [41], SE-GAN^∗[39], MirrorGAN^∗[36]) introduce various semantic constraints to improve the quality of image generation. As shown in Table VII, we present the performance of the semantics constraint strategies in these methods under IS score on the CUB-Bird dataset. SD-GAN[9] and SE-GAN[39] constrain semantic consistency between synthesized images and real images. CSM-GAN [41] constrains semantic consistency between the synthesized images and the text descriptions. MirrorGAN [36] introduces semantic consistency between the text descriptions that the image is converted to and the real text descriptions. Compared with these methods, distribution statistic constraints ( Baseline+SDM ) gain better performance under IS score. We think that the randomness of images and texts leads to the instability of semantics constraints. The constraints of mean and variance can guide the learning direction of distributions for generators. It helps the generator better align the generated image distribution with the real image distribution.

TABLE VII: Compare with other semantics consistent strategies in some outstanding T2I methods under IS score. “*” represents the performance the semantics constraint strategies in these methods under IS score.

Method	CUB-Bird
Method	IS $\uparrow$
Baseline [50]	$4.36\pm 0.02$
CSM-GAN^∗ [41]	$4.58\pm 0.05$
SE-GAN^∗ [39]	$4.44\pm 0.03$
MirrorGAN^∗ [36]	$4.47\pm 0.07$
SD-GAN^∗ [9]	$4.51\pm 0.07$
Baseline+SDM (Our)	$4.70\pm 0.02$

IV-D Parametric Sensitivity Analysis.

In this subsection, we mainly show and analyze the sensitivity of the hyper-parameters $\lambda_{1}$ , $\lambda_{2}$ , $\lambda_{3}$ , $\lambda_{4}$ , $\lambda_{5}$ , $\alpha$ . In Table I to Table VII (excluding Table III), the six parameters are assigned to values when the DR-GAN performance is around the median value in Fig. 11. So, $\lambda_{1}=10^{-3}$ , $\lambda_{2}=10^{-1}$ , $\lambda_{3}=10^{-5}$ , $\lambda_{4}=1.0$ , $\lambda_{5}=1.0$ , and $\alpha=5.0$ . The FID is an evaluation index to evaluate the quality of synthesized images on the CUB-Bird dataset. The FID results of DR-GAN under different values of the hyper-parameters $\lambda_{1}$ , $\lambda_{2}$ , $\lambda_{3}$ , $\lambda_{4}$ , $\lambda_{5}$ , $\alpha$ are shown in Fig. 11.

The hyper-parameters $\lambda_{1}$ , $\lambda_{2}$ : In the Semantic Disentangling Loss Eq. 7, the loss Eq. 5 and the loss Eq. 6 are proposed to drive the SDM better distill the key information from image feature $H$ and word-level context feature $Q$ for image generation. The $\lambda_{1}$ and $\lambda_{2}$ are important balance parameters in the Semantic Disentangling Loss Eq. 7.

As shown in Fig. 11: (i) The Semantic Disentangling Loss has a great influence on the overall performance of DR-GAN. When $\lambda_{1}=0$ or $\lambda_{2}=0$ , the FID score of DR-GAN increases, which means that the quality of the generated image deteriorates. (ii) As the values of $\lambda_{1}$ or $\lambda_{2}$ increase, the FID score also increases. The increase of weight means that model training pays more attention to the regression of distribution statistics. The statistics reflect the overall information of the distribution. The constraint of statistics is designed to assist the generator to better approximate the real image distribution. The learning of accurate distribution requires GAN itself to approximate the real distribution based on implicit sampling. Therefore, the weight of SDL should be selected appropriately. Based on the results shown in Fig. 11, the value range of $\lambda_{1}$ is $\{0.001,0.01,0.1\}$ , and the value range of $\lambda_{2}$ is $\{0.001,0.1\}$ .

The hyper-parameter $\lambda_{3}$ : In the Semantic Disentangling Loss Eq. 7, the parameter $\lambda_{3}$ is designed to adjust the weight of the reconstruction loss $||RIRM(I^{*}_{i})-I^{*}_{i}||_{1}$ . The reconstruction loss $||RIRM(I^{*}_{i})-I^{*}_{i}||_{1}$ can provide the real image feature $H^{*}$ for other terms in Semantic Disentangling Loss. When $\lambda_{3}=0$ , the reconstruction mechanism (i.e. RIRM) doesn’t work. So, the performance of DR-GAN also drops significantly. With the removal of the loss, it is difficult for the SDM to match the valid mean and variance of real image features. Since this, real image features would be mixed with more semantic information irrelevant to image generation, or some important image semantic information would be suppressed. The bad real image features mislead the generator to learn the bad image distribution and make the quality of the generated image decline. Besides, We found that performance decreased as the value of $\lambda_{3}$ increased. Because the decoder in RIRM and the generation modules $G^{o}$ are the Siamese Networks. The purpose of this design is that the real image features $H^{*}$ and the generated image features $H$ can be mapped to the same semantic space. The increase of weight will make the generation modules $G^{o}$ pay more attention to the reconstruction of real images by real image features, and weaken the generation of generated images. That is, the Siamese Networks are prone to the imbalance between two feature sources in the training process. In all, based on the results shown in Fig. 11, the value range of $\lambda_{3}$ is about $[0.00001,1]$ .

The hyper-parameter $\lambda_{4}$ , $\lambda_{5}$ : The hyper-parameter $\lambda_{4}$ balance the weight of the Eq. 9 in the generation stage loss of DR-GAN; The hyper-parameter $\lambda_{5}$ balance the weight of the Eq. 8 in the generation stage loss of DR-GAN; The loss Eq. 9 and loss form the Distribution Adversarial Loss. When $\lambda_{4}=0$ or $\lambda_{5}=0$ , the FID score increases, and the generated image quality decreases. This means that the two-loss terms of Distribution Adversarial Loss can effectively improve the distribution quality of the generated image. When the value of the $\lambda_{4}$ or $\lambda_{5}$ increases, the quality of the image distribution tends to decline. When the weight is too large, the image latent distribution will be over-normalized. At this point, the discriminant model becomes very powerful. Just like GAN’s theory [8], a strong discriminator is not conducive to generator optimization. So, based on the results shown in Fig. 11, the value range of $\lambda_{4}$ and $\lambda_{5}$ is $[0.1,2]$ .

The hyper-parameter $\alpha$ : In the training stage of DR-GAN, we also utilize the DAMSM loss [50] to make generated images better conditioned on text descriptions. In Fig. 11, we show the performance of DR-GAN based on a different value of the hyper-parameter $\alpha$ . When $\alpha$ , the FID score increases, and the generated image quality decreases. This means that the constraints of semantic matching help to improve the quality of image generation. Besides, when the value of $\alpha$ changes constantly, the performance of DR-GAN also changes moderately, and the overall performance is relatively good. Based on the results shown in Fig. 11, the value range of $\alpha$ is about $[1,20]$ .

IV-E Limitation and Discussion

The experiments showed the effectiveness of our DR-GAN in T2I generation. However, there are a few failure cases. Some failure cases from the CUB-Bird data are shown in the top row of Fig. 12. Distribution normalization in SDM could sometimes lead to missing spatial local structures/parts such as heads, feet, and necks. Some failure cases from the MS-COCO dataset are shown in the second row. Like in most other existing T2I models[50, 25, 36, 3, 39, 9, 31], it is difficult for the text encoder to parse reasonable location information of objects in a scene if it is not specifically provided in sentences. Then, the generator in DR-GAN tends to randomly place some objects that are sometimes unreasonable.

We will explore the objects’ location relationship from the text knowledge graph in our future work. We will also explore the application of SDM and DNM on other/broader GAN-based image generation tasks, such as Image-to-Image Translation [22, 35] and Virtual Try-On [10, 1].

V Conclusions

We proposed a novel Semantic Disentangling Module (SDM) and Distribution Normalization Module (DNM) in the GAN-based T2I model, and build a Distribution Regularization Generative Adversarial Network (DR-GAN) for Text-to-Image (T2I) generation. The SDM helps the generator better distill the key information and filter out the non-key information for image generation. The DNM helps GANs better normalize and reduce the complexity of the image latent distribution, and helps GAN-based T2I methods better capture the real image distribution from text feature distribution. Extensive experimental results and analysis demonstrated the effectiveness of DR-GAN and better performance compared against previous outstanding methods. In addition, the proposed SDM and DNM can further help other GAN-Based T2I models achieve better performance on the Text-to-Image generation task. The proposed SDM and DNM can be used as general mechanisms in Text-to-Image generation task.

Acknowledgments

This work is supported by National Key R&D Program of China (2021ZD0111900), National Natural Science Foundation of China (61976040), National Science Foundation of USA (OIA-1946231, CBET-2115405), Chinese Postdoctoral Science Foundation (No. 2021M700303). No conflict of interest: Hongchen Tan, Xiuping Liu, Baocai Yin, and Xin Li declare that they have no conflict of interest.

References

[1] Neuberger Assaf, Borenstein Eran, Hilleli Bar, Oks Eduard, and Alpert Sharon. Image based virtual try-on network from unpaired data. In CVPR, 2020.
[2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. 2016.
[3] Li Bowen, Qi Xiaojuan, Lukasiewicz Thomas, and H. S. Torr Philip. Controllable text-to-image generation. In NeurIPS, 2019.
[4] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. Mode regularized generative adversarial networks. In ICLR, 2017.
[5] Cheng Deng, Zhao Li, Xinbo Gao, and Dacheng Tao. Deep multi-scale discriminative networks for double JPEG compression forensics. volume abs/1904.02520, 2019.
[6] Xinxia Fan, Yanhua Yang, Cheng Deng, Jie Xua, and Xinbo Gao. Compressed multi-scale feature fusion network for single image super-resolution. Signal Processing, 146(MAY):50–60, 2017.
[7] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Xu Bing, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014.
[8] Jie Gui, Zhenan Sun, Yonggang Wen, Dacheng Tao, and Jieping Ye. A review on generative adversarial networks: Algorithms, theory, and applications. In arXiv:2001.06937, 2020.
[9] Yin Guojun, Liu Bin, Sheng Lu, Yu Nenghai, Wang Xiaogang, and Shao Jing. Semantics disentangling for text-to-image generation. In CVPR, 2019.
[10] Yang Han, Zhang Ruimao, Guo Xiaobao, Liu Wei, Zuo Wangmeng, and Luo Ping. Towards photo-realistic virtual try-on by adaptively generating-preserving image content. In CVPR, 2020.
[11] Zhang Han, Xu Tao, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
[12] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Conference and Workshop on Neural Information Processing Systems(NeurIPs), 2017.
[13] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
[14] Tobias Hinz, Stefan Heinrich, and Stefan Ge Wermter. Semantic object accuracy for generative text-to-image synthesis. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
[15] Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee. Inferring semantic layout for hierarchical text-to-image synthesis. In CVPR, 2018.
[16] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 448–456. JMLR.org, 2015.
[17] Phillip Isola, Jun Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017.
[18] Liang Jiadong, Pei Wenjie, and Lu Feng. Cpgan: Full-spectrum content-parsing generative adversarial networks for text-to-image synthesis. In ECCV, 2020.
[19] Gu Jiuxiang, Cai Jianfei, Joty Shafiq, Niu Li, and Wang Gang. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In CVPR, 2018.
[20] Justin Johnson, Agrim Gupta, and Fei Fei Li. Image generation from scene graphs. In CVPR, 2018.
[21] Cheng Jun, Wu Fuxiang, Tian Yanling, Wang Lei, and Tao Dapeng. Rifegan: Rich feature generation for text-to-image synthesis from prior knowledge. In CVPR, 2020.
[22] Zhu Jun-Yan, Park Taesung, Isola Phillip, and A. Efros Alexei. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.
[23] Mahdi M. Kalayeh, Emrah Basaran, Muhittin Gokmen, Mustafa E. Kamasak, and Mubarak Shah. Human semantic parsing for person re-identification. In CVPR, 2018.
[24] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In arXiv 1312.6114, 2014.
[25] Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang, Xiaodong He, Siwei Lyu, and Jianfeng Gao. Object-driven text-to-image synthesis via adversarial training. In CVPR, 2019.
[26] Yuheng Li, Krishna Kumar Singh, Utkarsh Ojha, and Yong Jae Lee. Mixnmatch: Multifactor disentanglement and encoding for conditional image generation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[27] Tsung Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
[28] Zhang Lisai, Chen Qingcai, Hu Baotian, and Jiang Shuoran. Text-guided neural image inpainting. In ACM MM, 2020.
[29] X. Mao, Q. Li, H. Xie, R. K. Lau, Z. Wang, and S. Smolley. Least squares generative adversarial networks. In ICCV, 2017.
[30] Arjovsky Martin, Chintala Soumith, and Bottou Leon. Least squares generative adversarial networks. In ICML, 2017.
[31] Zhu Minfeng, Pan Pingbo, Chen Wei, and Yang Yi. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In CVPR, 2019.
[32] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In ICLR, 2018.
[33] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In NeurIPs, 2016.
[34] Yuxin Peng, Wenwu Zhu, Yao Zhao, Changsheng Xu, Qingming Huang, Hanqing Lu, Qinghua Zheng, Tiejun Huang, and Wen Gao. Cross-media analysis and reasoning: advances and directions. pages 18(1):44–57, 2017.
[35] Isola Phillip, Zhu Jun-Yan, Zhou Tinghui, and Efros Alexei A. Image-to-image translation with conditional adversarial networks. In CVPR, 2017.
[36] Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. Mirrorgan: Learning text-to-image generation by redescription. In CVPR, 2019.
[37] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In ICML, 2016.
[38] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
[39] Hongchen Tan, Xiuping Liu, Xin Li, Yi Zhang, and Baocai Yin. Semantics-enhanced adversarial nets for text-to-image synthesis. In IEEE International Conference on Computer Vision (ICCV), pages 10500–10509, 2019.
[40] Hongchen Tan, Xiuping Liu, Meng Liu, Baocai Yin, and Xin Li. Kt-gan: Knowledge-transfer generative adversarial network for text-to-image synthesis. IEEE Transactions on Image Processing, 30:1275–1290, 2021.
[41] Hongchen Tan, Xiuping Liu, Baocai Yin, and Xin Li. Cross-modal semantic matching generative adversarial networks for text-to-image synthesis. IEEE Transactions on Multimedia, pages 1–1, 2021.
[42] Qiao Tingting, Zhang Jing, Xu Duanqing, and Tao Dacheng. Learn, imagine and create: Text-to-image generation from prior knowledge. In NeurIPS, 2019.
[43] Hinz Tobias, Heinrich Stefan, and Wermter Stefan. Generating multiple objects at spatially distinct locations. In ICLR, 2019.
[44] Che Tong, Li Yanran, Jacob Athul Paul, Bengio Yoshua, and Li Wenjie. Mode regularized generative adversarial networks. In ICLR, 2016.
[45] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missing ingredient for fast stylization. volume abs/1607.08022, 2016.
[46] van der Maaten Lauren and Hinton Geoffrey. Visualizing data using t-sne. In Journal of Machine Learning Research, pages 9(2605):2579–2605, 2008.
[47] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
[48] Hao Wang, Cheng Deng, Fan Ma, and Yi Yang. Context modulated dynamic networks for actor and action video segmentation with language queries. In AAAI, pages 12152–12159, 2020.
[49] Yuxin Wu and Kaiming He. Group normalization. In European Conference on Computer Vision (ECCV), volume abs/1803.08494, 2018.
[50] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, 2018.
[51] Liu Yahui, Nadai Marco De, Cai Deng, Li Huayang, and Lepri Bruno. Describe what to change: A text-guided unsupervised image-to-image translation approach. In ACM MM, 2020.
[52] Li Yitong, Min Martin Renqiang, Shen Dinghan, Carlson David, and Carin Lawrence. Video generation from text. In AAAI, 2018.
[53] Li Yitong, Gan Zhe, Shen Yelong, Liu Jingjing, Cheng Yu, Wu Yuexin, Carin Lawrence, Carlson David, and Gao Jianfeng. Storygan: A sequential conditional gan for story visualization. In CVPR, 2019.
[54] Mingkuan Yuan and Yuxin Peng. Text-to-image synthesis via symmetrical distillation networks. In ACM MM, pages 1407–1415, 2018.
[55] Han Zhang, Tao Xu, and Li Hongsheng. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
[56] Zizhao Zhang, Yuanpu Xie, and Yang Lin. Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In CVPR, 2018.

Hongchen Tan is a Lecturer of Artificial Intelligence Research Institute at Beijing University of Technology. He received Ph.D degrees in computational mathematics from the Dalian University of Technology in 2021. His research interests are Person Re-identification, Image Synthesis, and Referring Segmentation. Various parts of his work have been published in top conferences and journals, such as IEEE ICCV/TIP/TNNLS/TMM/TCSVT, and Neurocomputing.

Xiuping Liu is a Professor in School of Mathematical Sciences at Dalian University of Technology. She received Ph.D degrees in computational mathematics from Dalian University of Technology. Her research interests include shape modeling and analyzing, and computer vision.

Baocai Yin is a Professor of Artificial Intelligence Research Institute at Beijing University of Technology. He is also a Researcher with the Beijing Key Laboratory of Multimedia and Intelligent Software Technology and the Beijing Advanced Innovation Center for Future Internet Technology. He received the M.S. and Ph.D. degrees in computational mathematics from the Dalian University of Technology, Dalian, China, in 1988 and 1993, respectively. His research interests include multimedia, image processing, computer vision, and pattern recognition.

Xin Li is a Professor at Division of Electrical & Computer Engineering, Louisiana State University, USA. He got his B.E. degree in Computer Science from University of Science and Technology of China in 2003, and his M.S. and Ph.D. degrees in Computer Science from State University of New York at Stony Brook in 2005 and 2008. His research interests are in Geometric and Visual Data Computing, Processing, and Understanding, Computer Vision, and Virtual Reality.

$\displaystyle\mathcal{L}_{D_{i}^{D}}=$	$\displaystyle\|\|\hat{I_{i}}-D^{E}_{i}(\varphi_{i}(E^{D}(\hat{I_{i}})),s)\|\|_{1}+\|\|I_{i}^{}-D^{E}_{i}(\varphi_{i}(E^{D}(I_{i}^{})),s)\|\|_{1}$	(8)
	$\displaystyle+KL(N(\tilde{\mu_{i}}(\varphi_{i}(E^{D}(\hat{I_{i}}))),\tilde{\sigma_{i}}(\varphi_{i}(E^{D}(\hat{I_{i}})))))\|\|N(0,1))$
	$\displaystyle+KL(N(\tilde{\mu_{i}}(\varphi_{i}(E^{D}(I_{i}^{}))),\tilde{\sigma_{i}}(\varphi_{i}(E^{D}(I_{i}^{})))))\|\|N(0,1)).$