Memory-Driven Text-to-Image Generation

Bowen Li, Philip H. S. Torr, Thomas Lukasiewicz
University of Oxford

Abstract

We introduce a memory-driven semi-parametric approach to text-to-image generation, which is based on both parametric and non-parametric techniques. The non-parametric component is a memory bank of image features constructed from a training set of images. The parametric component is a generative adversarial network. Given a new text description at inference time, the memory bank is used to selectively retrieve image features that are provided as basic information of target images, which enables the generator to produce realistic synthetic results. We also incorporate the content information into the discriminator, together with semantic features, allowing the discriminator to make a more reliable prediction. Experimental results demonstrate that the proposed memory-driven semi-parametric approach produces more realistic images than purely parametric approaches, in terms of both visual fidelity and text-image semantic consistency.

1 Introduction

How to effectively produce realistic images from given natural language descriptions with semantic alignment has drawn much attention, because of its tremendous potential applications in art, design, and video games, to name a few. Recently, with the vast development of generative adversarial networks (Goodfellow et al., 2014; Gauthier, 2015; Mirza & Osindero, 2014) in realistic image generation, text-to-image generation has made much progress, where the progress has been mainly driven by parametric models –- deep networks use their weights to represent all data concerning realistic appearance (Zhang et al., 2017; 2018; Xu et al., 2018; Li et al., 2019a; Qiao et al., 2019b; Zhu et al., 2019; Hinz et al., 2019; Cheng et al., 2020; Qiao et al., 2019a).

Although these approaches can produce realistic results on well-structured datasets, containing a specific class of objects at the image center with fine-grained descriptions, such as birds (Wah et al., 2011) and flowers (Nilsback & Zisserman, 2008), there is still much room to improve. Besides, they usually fail on more complex datasets, which contain multiple objects with diverse backgrounds, e.g., COCO (Lin et al., 2014). This is likely because, for COCO, the generation process involves a large variety in objects (e.g., pose, shape, and location), backgrounds, and scenery settings. Thus, it is much easier for these approaches to only produce text-semantic-matched appearances instead of capturing difficult geometric structure. As shown in Fig. 1, current approaches are only capable of producing required appearances semantically matching the given descriptions (e.g., white and black stripes for zebra), but objects are unrealistic with distorted shape. Furthermore, these approaches are in contrast to earlier works on image s, which were based on non-parametric techniques that could make use of large datasets of images at inference time (Chen et al., 2009; Hays & Efros, 2007; Isola & Liu, 2013; Zhu et al., 2015; Lalonde et al., 2007). Although parametric approaches can enable the benefits of end-to-end training of highly expressive models, they lose a strength of earlier non-parametric techniques, as they fail to make use of large datasets of images at inference time.

In this paper, we introduce a memory-driven semi-parametric approach to text-to-image generation, where the approach takes the advantage of both parametric and non-parametric techniques. The non-parametric component is a memory bank of disentangled image features constructed from a training set of real images. The parametric component is a generative adversarial network. Given a novel text description at inference time, the memory bank is used to selectively retrieve compatible image features that are provided as basic information, allowing the generator to directly draw clues of target images, and thus to produce realistic synthetic results.

Besides, to further improve the differentiation ability of the discriminator, we incorporate the content information into it. This is because, to make a prediction, the discriminator usually relies on semantic features, extracted from a given image using a series of convolution operators with local receptive fields. However, when the discriminator goes deeper, less content details are preserved, including the exact geometric structure information (Gatys et al., 2016; Johnson et al., 2016). We think that the loss of content details is likely one of the reasons why current approaches fail to produce realistic shapes for objects on difficult datasets, such as COCO. Thus, the adoption of content information allows the model to exploit the capability of content details and then improve the discriminator to make the final prediction more reliable.

Finally, an extensive experimental analysis is performed, which demonstrates that our memory-driven semi-parametric method can generate more realistic images from natural language, compared with purely parametric models, in terms of both visual appearances and geometric structure.

Refer to caption — Figure 1: Examples of text-to-image generation on COCO. Current approaches only generate low-quality images with unrealistic objects. In contrast, our method can produce realistic images, in terms of both visual appearances and geometric structure.

2 Related Work

Text-to-image generation has made much progress because of the success of generative adversarial networks (GANs) (Goodfellow et al., 2014) in realistic image generation. Zhang et al. (2017) proposed a multi-stage architecture to generate realistic images progressively. Then, attention-based methods (Xu et al., 2018; Li et al., 2019a) are proposed to further improve the results. Zhu et al. (2019) introduced a dynamic memory module to refine image contents. Qiao et al. (2019a) proposed text-visual co-embeddings to replace input text with corresponding visual features. Cheng et al. (2020) introduced a rich feature generating text-to-image synthesis. Besides, extra information is adopted on the text-to-image generation process, such as scene graphs (Johnson et al., 2018; Ashual & Wolf, 2019) and layout (e.g., bounding boxes or segmentation masks) (Hong et al., 2018; Li et al., 2019b; Hinz et al., 2019). However, none of the above approaches adopt non-parametric techniques to make use of large datasets of images at inference time, neither feed content information into the discriminator to enable a finer training feedback. Also, our method does not make use of any additional semantic information, e.g., scene graphs and layout.

Text-guided image manipulation is related to our work, where the task also takes natural language descriptions and real images as inputs, but it aims to modify the images using given texts to achieve semantic consistency (Nam et al., 2018; Dong et al., 2017; Li et al., 2020). Differently from it, our work focuses mainly on generating novel images, instead of editing some attributes of the given images. Also, the real images in the text-guided image manipulation task behave as a condition, where the synthetic results should reconstruct all text-irrelevant attributes from the given real images. Differently, the real images in our work are mainly to provide the generator with additional cues of target images, in order to ease the whole generation process.

Memory Bank. Qi et al. (2018) introduced a semi-parametric approach to realistic image generation from semantic layouts. Li et al. (2019c) used the stored image crops to determine the appearance of objects. Tseng et al. (2020) used a differentiable retrieval process to select mutually compatible image patches. Li et al. (2021) studied conditional image extrapolation to synthesize new images guided by the input structured text. Differently, instead of using a concise semantic representation (a scene graph as input), which is less user-friendly and has limited context of general descriptions, we use natural language descriptions as input. Also, Liang et al. (2020) designed a memory structure to parse the textual content. Differently, our method simply uses a deep network to extract image features, instead of involving complex image preprocessing to build a memory bank.

3 Overview

Given a sentence $S$ , we aim to generate a fake image $I^{\prime}$ that is semantically aligned with the given $S$ . The proposed model is trained on a set of paired text description and corresponding real image features $v$ , denoted by $(S,v)$ . This set is also used to generate a memory bank $M$ of disentangled image features $v$ for different categories, where image features are extracted from the training image by using a pretrained VGG-16 network (Simonyan & Zisserman, 2014) (see Fig. 2). Each element in $M$ is an image feature extracted from a training image, associated with corresponding semantically-matched text descriptions from the training datasets.

At inference time, we are given a novel text description $S$ that was not seen during training. Then, $S$ is used to retrieve semantically-aligned image features from the memory bank $M$ , based on designed matching algorithms (more details are shown in Sec. 4.2). Next, the retrieved image features $v$ , together with the given text description $S$ , are fed into the generator to synthesize the output image (see Fig. 3). The generator utilizes the information from the image features, fuses them with hidden features produced from the given text description $S$ , and generate realistic images semantically-aligned with $S$ . The architecture and training of the network are described in Sec. 5.

To incorporate image features into the generation pipeline, we borrow from the text-guided image manipulation literature (Li et al., 2020), and redesign the architecture to make full use of the given image features in text-to-image generation, shown in Fig. 3.

4 Memory Bank

4.1 Representation

The memory bank $M$ is a set of image features $v_{i}$ extracted from training set images, and each image features $v_{i}$ is associated with matched text descriptions that are provided in the dataset, e.g., in COCO, each image has five matched text descriptions. These descriptions are used in the matching algorithms, allowing a given text to find the most compatible image features at inference time.

4.2 Retrieval

Given a new text description, in order to effectively retrieve the most compatible image features from the memory bank $M$ , we have designed several matching algorithms and also explored the effectiveness of each algorithms. A detailed comparison between different algorithms is shown in the supplementary material.

4.2.1 Sentence-Sentence Matching

Here, we use image features’ associated sentences $S^{\prime}_{i}$ as keys, to find the most compatible image features $v_{i}$ for a given unseen sentence $S$ at inference time. First, we feed both $S$ and $S^{\prime}_{i}$ into a pretrained text encoder (Xu et al., 2018) to produce sentence features $s\in\mathbb{R}^{D\times 1}$ and $s^{\prime}_{i}\in\mathbb{R}^{D\times 1}$ , respectively, where $D$ is the feature dimension. Then, for the given sentence $S$ , we select the most compatible image features $v_{i}$ in $M$ based on a cosine similarity score:

\alpha_{i}=\frac{(s)^{T}s^{\prime}_{i}}{\left\|s\right\|\left\|s^{\prime}_{i}\right\|}.

(1)

Finally, we fetch the image features $v_{i}$ using the key $S^{\prime}_{i}$ with the highest similarity score $\alpha_{i}$ .

4.2.2 Sentence-Image Matching

Instead of using associated sentences as keys, we can calculate the similarity between the sentence feature $s\in\mathbb{R}^{D\times 1}$ and image features $v_{i}\in\mathbb{R}^{D\times H\times W}$ stored in $M$ , where $D$ is the number of channels, $H$ is the height, and $W$ is the width. To directly calculate the similarity, we first average the image features on the spatial direction to get a global image feature $v_{Gi}\in\mathbb{R}^{D\times 1}$ . So, for a given unseen $S$ , we select the most compatible image features $v_{i}$ in $M$ based on $\beta_{i}$ :

\beta_{i}=\frac{(s)^{T}v_{Gi}}{\left\|s\right\|\left\|v_{Gi}\right\|}.

(2)

4.2.3 Words-Words Matching

Moreover, we can use a more fine-grained text representation (namely, word embeddings), as keys to find the most compatible image features $v_{i}$ stored in $M$ for a given unseen sentence $S$ . At inference time, we first feed both $S$ and $S^{\prime}_{i}$ into a pretrained text encoder (Xu et al., 2018) to generate word embeddings $w\in\mathbb{R}^{N\times D}$ and $w^{\prime}_{i}\in\mathbb{R}^{N\times D}$ , respectively, where $N$ is the number of words and $D$ is the feature dimension. Then, we reshape the size of both $w$ and $w^{\prime}_{i}$ to $\mathbb{R}^{(D\ast N)\times 1}$ . So, to find the most compatible image features, the cosine similarity score can be defined as follows:

\delta_{i}=\frac{(w)^{T}w^{\prime}_{i}}{\left\|w\right\|\left\|w^{\prime}_{i}\right\|}.

(3)

However, different words in a sentence are not equally important. Thus, if we simply combine all words from a sentence together to calculate the similarity (like above), the similarity score may be less precise. To solve this issue, during training, we reweight each word in a sentence by its importance. We first use convolutional layers to remap word embeddings, and then calculate the importance $\lambda$ (and $\lambda^{\prime}_{i}$ ) for each word in word embeddings $w\in\mathbb{R}^{N\times D}$ (and $w^{\prime}_{i}\in\mathbb{R}^{N\times D}$ ), denoted by: $\lambda=\text{Softmax}(ww^{T})$ and $\lambda^{\prime}_{i}=\text{Softmax}(w^{\prime}_{i}w^{\prime T}_{i})$ , respectively.

Each elements in $\lambda$ represents the correlation between different words in a sentence. Then, $\lambda w$ (and $\lambda^{\prime}_{i}w^{\prime}_{i}$ ) reweight word embeddings for each word based on its correlation with other words. So, using this reweighted word embeddings, we can achieve a more precise similarity calculation between two word embeddings. At inference time, after we reshape the size of both $\lambda w$ and $\lambda^{\prime}_{i}w^{\prime}_{i}$ to $\mathbb{R}^{(D\ast N)\times 1}$ , the new equation is defined as follows:

\delta_{i}=\frac{(\lambda w)^{T}\lambda^{\prime}_{i}w^{\prime}_{i}}{\left\|\lambda w\right\|\left\|\lambda^{\prime}_{i}w^{\prime}_{i}\right\|}.

(4)

4.2.4 Words-Image Matching

Furthermore, we use the word embeddings $w\in\mathbb{R}^{N\times D}$ and image features $v_{i}\in\mathbb{R}^{D\times H\times W}$ to directly calculate the similarity score between them. To achieve this, we first reshape the size of the image features to $v_{i}\in\mathbb{R}^{D\times(H\ast W)}$ . Then, a correlation matrix $c_{i}\in\mathbb{R}^{N\times(H\ast W)}$ can be obtained via: $c_{i}=\text{Softmax}(wv_{i})$ , where each element in $c_{i}$ represents the correlation between each word and each image spatial location. Then, a reweighted word embedding $\tilde{w_{i}}\in\mathbb{R}^{N\times D}$ containing image information can be achieved by $\tilde{w_{i}}=c_{i}v_{i}^{T}$ . So, to find the most compatible image features, we first reshape the size of both $w$ and $\tilde{w_{i}}$ to $\mathbb{R}^{(D\ast N)\times 1}$ , and the similarity score is defined as follows:

\gamma_{i}=\frac{(w)^{T}\tilde{w_{i}}}{\left\|w\right\|\left\|\tilde{w_{i}}\right\|}.

(5)

Similarly, we can also reweight word embeddings $w$ and image features $v_{i}$ based on their importance (see Sec.4.2.3) to achieve a more precise calculation.

5 Generative Adversarial Networks

To generate high-quality synthetic images from natural language descriptions, we propose to incorporate image features $v$ , along with the given sentence $S$ , into the generator. To incorporate image features into the generation pipeline, we borrow from the text-guided image manipulation literature (Li et al., 2020), and redesign the architecture to make full use of the given image features in text-to-image generation, shown in Fig. 3.

5.1 Generator with Image Features

To avoid the identity mapping and also to make full use of image features $v$ in the generator, we first average $v$ on each channel to filter potential content details (e.g., overall spatial structure) contained in $v$ , getting a global image feature $v_{G}$ , where $v_{G}$ only keeps basic information of the corresponding real image $I$ , serving as basic image priors. By doing this, the model can effectively avoid copying and pasting from $I$ , and greatly ensure the diversity of output results, especially on the first stage. This is because the following stages focus more on refining basic images produced by the first stage, according to adding more details and improving their resolution, shown in Fig. 3.

However, only feeding the global image feature $v_{G}$ at the beginning of the network, the model may fail to fully utilize the cues contained in the image features $v$ . Thus, we further incorporate the image features $v$ at each stage of the network. The reason to feed image features $v$ rather than the global feature $v_{G}$ at the following stages is that $v$ contains more information about the desired output image, such as image contents and geometric structure of objects, where these details can work as candidate information for the main generation pipeline to select. To enable this regional selection effect, we adopt the text-image affine combine module (ACM) (Li et al., 2020), which is able to selectively fuse text-required image information within $v$ into the hidden features $h$ , where $h$ is generated from the given text description $S$ . However, simply fusing image features $v$ into the generation pipeline may introduce constraints on producing diverse and novel synthetic results, because different image information (e.g., objects and visual attributes) in $v$ may be entangled, which means, for example, if the model only wants to generate one object, the corresponding entangled parts (e.g, objects and attributes) may be produced as well. This may cause an additional generation of text-irrelevant objects and attributes. Thus, to avoid these drawbacks, inspired by the study (Karras et al., 2019), we use several fully connected layers to disentangle the image features $v$ , getting disentangled image features $v_{D}$ , which allows the model to disconnect relations between different objects and also attributes. By doing this, the model is able to prevent the constraints introduced by the image features $v$ , and then selectively choose text-required image information within $v_{D}$ , where this information is effectively disentangled without a strong connection.

Why does the generator with image features work better? Ideally, the generator produces a sample, e.g., an image, from a latent code, and the distribution of these samples should be indistinguishable from the training distribution, where the training distribution is actually drawn from the real samples in the training dataset. Based on this, incorporating image features from real images in training dataest into the generator allows the generator to directly draw cues of the desired distribution that it eventually needs to generate. Besides, the global feature $v_{G}$ and disentangled image features $v_{D}$ can provide basic information of target results in advance, and also work as candidate information, allowing the model to selectively choose text-required information without generating it by the model itself, and thus easing the whole generation process. To some extent, the global feature $v_{G}$ can be seen as the meta-data of target images, which may contain information about what kinds of objects to generate, e.g., zebra or bus, and $v_{D}$ is able to provides basic information of objects, e.g., the spatial structure like four legs and one head for the zebra and the rectangle shape for the bus.

5.2 Discriminator with Content Information

To further improve the discriminator to make a more reliable prediction, with respect to both visual appearances and geometric structure, we propose to incorporate the content information into it. This is mainly because, in a deep convolution neural network, when the network goes deeper, the less content details are preserved, including the exact shape of objects (Gatys et al., 2016; Johnson et al., 2016). We think the loss of content details may prevent the discriminator to provide fine-grained shape-quality-feedback to the generator, which may cause the difficulty for the generator to produce realistic geometric structure. Also, Zhou et al. (2014) showed that the empirical receptive field of a deep convolution neural network is much smaller than the theoretical one especially on deep layers. This means, using convolution operators with a local receptive field only, the network may fail to capture the spatial structure of objects when the size of objects exceeds the receptive field.

To incorporate the content details, we propose to generate a series of image content features, $\{a_{128},a_{64},a_{32},\ldots,a_{4}\}$ , by aggregating different image regions via applying pooling operators on the given real or fake features. The size of these content features is from $a_{128}\in\mathbb{R}^{C\times 128\times 128}$ to $a_{4}\in\mathbb{R}^{C\times 4\times 4}$ , where $C$ represents the number of channels, and the width and the height of the next image content features are $1/2$ the previous one. Thus, the given image is pooled into representations for different regions, from fine- ( $a_{128}$ ) to coarse-scale ( $a_{4}$ ), which is able to preserve content information of different subregions, such as the spatial structure of objects. Then, these features are concatenated with the corresponding hidden features on the channel-wise direction, incorporating the content information into the discriminator.

The number of different-scale content features can be modified, which is dependent on the size of given images. These features aggregate different image subregions by repetitively adopting fixed-size pooling kernels with a small stride. Thus, these content features maintain a reasonable small gap for image information. For the type of pooling operation between max and average, we perform comparison studies to show the difference in Sec. 6.2.

Why does the discriminator with content information work better? Basically, the discriminator in a generative adversarial network is simply a classifier (Goodfellow et al., 2014). It tries to distinguish real data from the data created by the generator (note that in our method, we implement the Minmax loss in the loss function, instead of the Wasserstein loss (Arjovsky et al., 2017)). Also, the implementation of content information has shown its great effectiveness on classification (Lazebnik et al., 2006; He et al., 2015) and semantic segmentation (Liu et al., 2015; Zhao et al., 2017). Based on this, incorporating the content information into the discriminator is helpful, allowing the discriminator to make a more reliable prediction on complex datasets, especially for the datasets with complex image scenery settings, such as COCO.

5.3 Training

To train the network, we follow (Li et al., 2020) and adopt adversarial training. There are three stages in the model, and each stage has a generator network and a discriminator network. The generator and discriminator are trained alternatively by minimizing the generator loss $\mathcal{L}_{G}$ and discriminator loss $\mathcal{L}_{D}$ . Please see the supplementary material for more details about training objectives. We only highlight some training differences compared with Li et al. (2020).

Generator objective. The objective functions to train the generator are similar as in (Li et al., 2020), but, differently, the inputs for the generator are a pair of (S, v) and a noise $z$ , denoted by $G_{i}(z,S,v)$ , where $i$ indicates the stage number.

Discriminator objective. To improve the convergence of our GAN-based generation model, the $R_{1}$ regularization (Mescheder et al., 2018) is adopted in the discriminator:

R_{1}({\psi}):=\frac{\gamma}{2}E_{p_{D}(x)}\left[\left\|\bigtriangledown D_{\psi}(x)\right\|^{2}\right]\textrm{,}

(6)

where $\psi$ represents parameter values of the discriminator.

6 Experiments

Table 1: Quantitative comparison on CUB bird: Fréchet inception distance (FID) and R-precision (R-psr) of StackGAN++ (Zhang et al., 2018), AttnGAN (Xu et al., 2018), ControlGAN (Li et al., 2019a), DM-GAN (Zhu et al., 2019), DF-GAN(Tao et al., 2020), and our method. For FID, lower is better, while for R-precision, alignment, and realism, higher is better.

Matrix	StackGAN++	AttnGAN	ControlGAN	DM-GAN	DF-GAN	Ours
FID	15.30	23.98	13.92	16.09	14.81	10.49
R-psr	46.67	67.82	69.33	72.31	-	73.87
Alignment (%)	-	-	-	-	43	57
Realism (%)	-	-	-	-	31	69

Table 2: Quantitative comparison on COCO. Note that we also compare our method with OP-GAN (Hinz et al., 2019), where OP-GAN adopts bounding box in their method.

Matrix	StackGAN++	AttnGAN	ControlGAN	DM-GAN	DF-GAN	Ours	OP-GAN
FID	81.59	32.32	33.58	32.64	21.42	19.47	24.70
R-prs	71.88	85.47	82.43	88.56	-	90.32	89.01
Alignment (%)	-	-	-	-	29	71	-
Realism (%)	-	-	-	-	22	78	-

To verify the effectiveness of our proposed method in realistic image generation from text descriptions, we conduct extensive experiments on the CUB bird (Wah et al., 2011) dataset and more complex COCO (Lin et al., 2014) dataset, where COCO contains multiple objects with diverse backgrounds.

Evaluation metrics. We adopt the Fréchet inception distance (FID) (Heusel et al., 2017) as the primary metric to quantitatively evaluate the image quality and diversity. In our experiments, we use $30\text{K}$ synthetic images vs. $30\text{K}$ real test images to calculate the FID value. However, as FID cannot reflect the relevance between an image and a text description, we use the R-precision (Xu et al., 2018) to measure the correlation between a generated image and its corresponding text.

Human evaluation. To better verify the performance of our proposed method, we conducted a user study between current state-of-the-art method DF-GAN (Tao et al., 2020) and ours on CUB and COCO. We randomly selected 100 text descriptions from the test dataset. Then, we asked 5 workers to compare the results after looking at the output images and given text descriptions based on two criteria: (1) alignment: whether the synthetic image is semantically aligned with the given description, and (2) realism: whether the synthetic image looks realistic, shown in Tables 1 and 2. Please see supplementary material for more details about the human evaluation.

Implementation. There are three stages in the model, and each stage has a generator network and a discriminator network. The number of stages can be modified, which depends on the resolution of the output image. We utilize a deep neural network layer relu5_3 of a pre-trained VGG-16 to extract image features $v$ , which is able to filter content details in $I$ and keep more semantic information. In the discriminator, the number of different-scale image content features can be modified, which is related to the size of the given image. A same-size pooling kernel with a small stride ( $\text{stride}=2$ ) is repeatedly implemented on the image features, to maximize the preservation of the content information. For the type of pooling operation, average pooling is adopted. For the matching algorithms, word image matching with reweighting based on importance is adopted. The resolution of synthetic results is $256\times 256$ . Our method and its variants are trained on a single Quadro RTX 6000 GPU, using the Adam optimizer (Kingma & Ba, 2014) with the learning rate $0.0002$ . The hyperparameter $\lambda$ is set to $5$ . We preprocess datasets according to the method used in (Xu et al., 2018). No attention module is implemented in the whole architecture.

6.1 Comparison with Other Approaches

Quantitative comparison. Quantitative results are shown in Tables 1 and 2. As we can see, compared to other approaches, our method achieves better FID and R-precision scores on both datasets, and even has a better performance than OP-GAN, where OP-GAN adopts bounding boxes. This indicates that (1) our method can produce more realistic images from given text descriptions, in terms of image quality and diversity, and (2) synthetic results produced by our method are more semantically aligned with the given text descriptions. Besides, in human evaluation, our method achieves better alignment and realism scores, compared with DF-GAN, which indicates that our results are most preferred by workers, which further verifies the better performance of our method, with respect to semantic alignment and image realism.

Qualitative comparison. In Fig. 6, we present synthetic examples produced by our method at $256\times 256$ , along with the corresponding retrieved images that provide image features. As we can see, our method is able to produce high-quality results on CUB and COCO, with respect to realistic appearances and geometric structure, and also semantically matching the given text descriptions. Besides, the synthetic results are different from the retrieved image features, which indicates there is no significant copy-and-paste problem in our method.

Diversity evaluation. To further evaluate the diversity of our method, we fix the given text description and the corresponding retrieved image features, and only change the given noise $z$ to generate output images, shown in Fig. 7. When we fix the sentence and image features and only change the noise, our method can generate obviously different images, but they still semantically match the given sentence and also make use information from the image features. More evaluations are shown in the supplementary material.

6.2 Component Analysis

Table 3: Ablation studies: “Ours w/o Feature” denotes without feeding image features into the generator, “Ours w/o Disen.” denotes without using the fully connected layers to disentangle image features

v

, “Ours w/o Disen.*” is for mismatched pairs, “Ours w/o Content” denotes without incorporating the content information into the discriminator, “Ours w/o Reg.” denotes without using the regularization in the discriminator, “Ours w/ Max” denotes using the maximum pooling to extract content information, and “Ours w/ Aver” denotes using the average pooling.

Method	FID	R-psr
Ours w/o Feature	22.20	84.63
Ours w/o Disen.	18.82	92.17
Ours w/o Disen.*	18.80	67.05
Ours w/o Content	20.96	88.95
Ours w/o Reg.	27.12	82.97
Ours w/ Max	26.12	83.11
Ours w/ Aver (baseline)	19.47	90.32

Effectiveness of the image features. To better understand the effectiveness of image features in the generator, we conduct an ablation study shown in Table 3. Without image features, the model “Ours w/o Feature” achieves worse quantitative results on both FID and R-precision compared with the baseline, which verifies the effectiveness of image features on high-quality image generation. Interestingly, without image features, even our method becomes a pure text-to-image generation method, similar to other baselines, but the FID of “Ours w/o Feature” is still competitive with other baselines. This indicate that even without the image features fed into our method, our method can still generate better synthetic results, with respect to image quality and diversity. We think this is mainly because with the help of content information, our better discriminator is able to make a more reliable prediction on complex datasets, which in turn encourages the generator to produce better synthetic images.

Effectiveness of the disentanglement. Here, we show the effectiveness of the fully connected layers applied on the image features $v$ . Interestingly, from Table 3, the “model w/o Disen.” achieves better FID and R-precision compared with the baseline. This is likely because the model may suffer from an identity mapping problem. To verify this identity mapping problem, we conduct another experiment, where we feed mismatched sentence and image pairs into the network without using search algorithms, denoted “model w/o Disen.*”. As we can see, on mismatched pairs, although FID is still low, the R-precision degrades significantly.

Effectiveness of the content information. To verify the effectiveness of the content information adopted in the discriminator, we conduct an ablation study, shown in Table 3. As we can see, FID and R-precision degrade when the discriminator without adopting the content information. This may indicate that the content information can effectively strengthens the differentiation abilities of the discriminator. Then, the improved discriminator is able to provide the generator with fine-grained training feedback, regarding to geometric structure, thus facilitating training a better generator to produce higher-quality synthetic results.

Comparison between different pooling types. Here, we conduct a comparison study on different pooling types (i.e., max and average) in Table 3. As we can see, the model with the average pooling works better than max pooling. We think that this is likely because max pooling fails to capture the contextual information between neighboring pixels, because it only picks the maximum value among a region of pixels, while average pooling calculates the average value between them.

Effectiveness of the regularization. We evaluate the effectiveness of the adopted regularization in the discriminator. From Table 3, the model without the regularization has worse quantitative results, compared with the full model. We think that this is because the regularization effectively improves GAN convergence by preventing the generator from training on junk feedback, once the discriminator cannot easily tell the difference between real and fake.

7 Conclusion

We have introduced a memory-driven semi-parametric approach to text-to-image generation, which utilizes large datasets of images at inference time. Also, an alternative architecture is proposed for both the generator and the discriminator. Extensive experimental results on two datasets demonstrate the effectiveness of feeding retrieved image features into the generator and incorporating content information into the discriminator.

References

Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875, 2017.
Ashual & Wolf (2019) Oron Ashual and Lior Wolf. Specifying object attributes and relations in interactive scene generation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4561–4569, 2019.
Chen et al. (2009) Tao Chen, Ming-Ming Cheng, Ping Tan, Ariel Shamir, and Shi-Min Hu. Sketch2Photo: Internet image montage. ACM Transactions on Graphics (TOG), 28(5):1–10, 2009.
Cheng et al. (2020) Jun Cheng, Fuxiang Wu, Yanling Tian, Lei Wang, and Dapeng Tao. RiFeGAN rich feature generation for text-to-image synthesis from prior knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10911–10920, 2020.
Dong et al. (2017) Hao Dong, Simiao Yu, Chao Wu, and Yike Guo. Semantic image synthesis via adversarial learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5706–5714, 2017.
Gatys et al. (2016) Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2414–2423, 2016.
Gauthier (2015) Jon Gauthier. Conditional generative adversarial networks for convolutional face generation. Technical report, pp. 3, 2015.
Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
Hays & Efros (2007) James Hays and Alexei A. Efros. Scene completion using millions of photographs. ACM Transactions on Graphics (ToG), 26(3):4–es, 2007.
He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1904–1916, 2015.
Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637, 2017.
Hinz et al. (2019) Tobias Hinz, Stefan Heinrich, and Stefan Wermter. Semantic object accuracy for generative text-to-image synthesis. arXiv preprint arXiv:1910.13321, 2019.
Hong et al. (2018) Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee. Inferring semantic layout for hierarchical text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7986–7994, 2018.
Isola & Liu (2013) Phillip Isola and Ce Liu. Scene collaging: Analysis and synthesis of natural images with semantic layers. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3048–3055, 2013.
Johnson et al. (2016) Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European Conference on Computer Vision, pp. 694–711. Springer, 2016.
Johnson et al. (2018) Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image generation from scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1219–1228, 2018.
Karras et al. (2019) Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410, 2019.
Kingma & Ba (2014) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Lalonde et al. (2007) Jean-François Lalonde, Derek Hoiem, Alexei A. Efros, Carsten Rother, John Winn, and Antonio Criminisi. Photo clip art. ACM Transactions on Graphics (TOG), 26(3):3–es, 2007.
Lazebnik et al. (2006) Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pp. 2169–2178. IEEE, 2006.
Li et al. (2019a) Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip Torr. Controllable text-to-image generation. In Advances in Neural Information Processing Systems, pp. 2063–2073, 2019a.
Li et al. (2020) Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip Torr. ManiGAN: Text-guided image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7880–7889, 2020.
Li et al. (2019b) Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang, Xiaodong He, Siwei Lyu, and Jianfeng Gao. Object-driven text-to-image synthesis via adversarial training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12174–12182, 2019b.
Li et al. (2021) Yijun Li, Lu Jiang, and Ming-Hsuan Yang. Controllable and progressive image extrapolation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2140–2149, 2021.
Li et al. (2019c) Yikang Li, Tao Ma, Yeqi Bai, Nan Duan, Sining Wei, and Xiaogang Wang. PasteGAN: A semi-parametric method to generate image from scene graph. Advances in Neural Information Processing Systems, 32:3948–3958, 2019c.
Liang et al. (2020) Jiadong Liang, Wenjie Pei, and Feng Lu. CPGAN: content-parsing generative adversarial networks for text-to-image synthesis. In European Conference on Computer Vision, pp. 491–508. Springer, 2020.
Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision, pp. 740–755. Springer, 2014.
Liu et al. (2015) Wei Liu, Andrew Rabinovich, and Alexander C. Berg. ParseNet: Looking wider to see better. arXiv preprint arXiv:1506.04579, 2015.
Mescheder et al. (2018) Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for GANs do actually converge? arXiv preprint arXiv:1801.04406, 2018.
Mirza & Osindero (2014) Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
Nam et al. (2018) Seonghyeon Nam, Yunji Kim, and Seon Joo Kim. Text-adaptive generative adversarial networks: Manipulating images with natural language. In Advances in Neural Information Processing Systems, pp. 42–51, 2018.
Nilsback & Zisserman (2008) Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. IEEE, 2008.
Qi et al. (2018) Xiaojuan Qi, Qifeng Chen, Jiaya Jia, and Vladlen Koltun. Semi-parametric image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8808–8816, 2018.
Qiao et al. (2019a) Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. Learn, imagine and create: Text-to-image generation from prior knowledge. Advances in Neural Information Processing Systems, 32:887–897, 2019a.
Qiao et al. (2019b) Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. Mirrorgan: Learning text-to-image generation by redescription. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1505–1514, 2019b.
Simonyan & Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Tao et al. (2020) Ming Tao, Hao Tang, Songsong Wu, Nicu Sebe, Xiao-Yuan Jing, Fei Wu, and Bingkun Bao. DF-GAN: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:2008.05865, 2020.
Tseng et al. (2020) Hung-Yu Tseng, Hsin-Ying Lee, Lu Jiang, Ming-Hsuan Yang, and Weilong Yang. RetrieveGAN: Image synthesis via differentiable patch retrieval. In European Conference on Computer Vision, pp. 242–257. Springer, 2020.
Wah et al. (2011) Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The Caltech-UCSD Birds-200-2011 dataset. 2011.
Xu et al. (2018) Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324, 2018.
Zhang et al. (2017) Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N. Metaxas. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5907–5915, 2017.
Zhang et al. (2018) Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N. Metaxas. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8):1947–1962, 2018.
Zhao et al. (2017) Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890, 2017.
Zhou et al. (2014) Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Object detectors emerge in deep scene CNNs. arXiv preprint arXiv:1412.6856, 2014.
Zhu et al. (2015) Jun-Yan Zhu, Philipp Krahenbuhl, Eli Shechtman, and Alexei A. Efros. Learning a discriminative model for the perception of realism in composite images. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3943–3951, 2015.
Zhu et al. (2019) Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. DM-GAN: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5802–5810, 2019.