Unrealistic Feature Suppression for Generative Adversarial Networks

Sanghun Kim
KyungHee University
Department of Computer Science and
Engineering Kyunghee University
powerkei@naver.com Seungkyu Lee
KyungHee University
Department of Computer Science and
Engineering Kyunghee University
seungkyu@khu.ac.kr

Abstract

Due to the unstable nature of minimax game between generator and discriminator, improving the performance of GANs is a challenging task. Recent studies have shown that selected high-quality samples in training improve the performance of GANs. However, sampling approaches which discard samples show limitations in some aspects such as the speed of training and optimality of the networks. In this paper we propose unrealistic feature suppression (UFS) module that keeps high-quality features and suppresses unrealistic features. UFS module keeps the training stability of networks and improves the quality of generated images. We demonstrate the effectiveness of UFS module on various models such as WGAN-GP, SNGAN, and BigGAN. By using UFS module, we achieved better Fréchet inception distance and inception score compared to various baseline models. We also visualize how effectively our UFS module suppresses unrealistic features through class activation maps.

1 Introduction

Generative Adversarial Networks (GANs) [5] have achieved explosive attention and success in various research areas since it was introduced. GANs consist of two adversarial networks (generator and discriminator) that are trained alternately. Discriminator is trained to distinguish between real and generated fake samples. On the other hand, generator is trained based on the feedback from the discriminator to make realistic fake samples. Thanks to its practical performance of adversarial training strategy, GANs have evolved to various image generation methods such as image to image translation [9, 18, 25], super resolution [12], text to image generation [31], etc. Improving the performance of GANs is a challenging task due to the unstable nature of minimax game. Sometimes GANs fall in Nash equilibrium early because of the difficulties in balancing between generator and discriminator training.

One effort of improving GANs is employing attention blocks that previously have shown improved performance in various classification tasks. Squeeze and excitation [8] propose a channel attention block and CBAM [27] adds a spatial attention block to focus not only on channels but also on spatial features showing that the attention approaches are capable of improving the performance of convolutional neural networks. In generative models, AttnGAN [29] embeds encoded text into networks through attention module so that the networks focus more on the related image details. In image to image translation tasks, SelectionGAN [24] proposes a method of combining multiple candidate images through multi channel attention selection. DAGAN [15] shows instance level translation by specifying the area to focus on in an image. Inspired by non-local neural networks [26], SAGAN [30] proposes self attention module that extracts feature similarity over entire area of an image.

Another group of methods study how to provide useful gradients in the training of GANs. LOGAN [28] proposes gradient-based latent optimization scheme for GANs. Latent optimization uses custom latent vector which is optimized by the gradient of networks in training process. It uses $G(z^{\prime})$ obtained by optimized latent vector $z^{\prime}$ as fake samples rather than $G(z)$ obtained by random latent $z$ decreasing the influence of random distribution on the networks. To this end, LOGAN first forwards the $z$ obtained from random distribution to calculate $D(G(z))$ . After obtaining $\nabla z$ in backward process, LOGAN gets optimized latent vector $z^{\prime}$ ( $z^{\prime}=z+\alpha\nabla z$ ). Finally $z^{\prime}$ is re-forwarded to compute $D(G(z^{\prime}))$ training the networks. LOGAN shows that the latent optimization not only generate high quality samples but also give better direction to optimal generator. Since LOGAN performs two forward-backward operations in training, training speed is slow compared to other methods. Top-k GAN [22] argues that more realistic samples presented by latent optimization improve network performance. It claims that the success of LOGAN is due to the high quality samples produced by optimized latent $z^{\prime}$ . Top-k GAN dismisses unrealistic samples decided by critic(discriminator) from batches and adopt only realistic samples in order to select useful gradients in training process thereby improving the performance of GANs. Top-k GAN experimentally shows that high-quality samples in training process lead to improved performance on various networks such as WGAN-GP [6], SNGAN [16], SAGAN [30] and BigGAN [2]. Instance selection [3] increases training speed by reducing the number of data required for training, rather than improving network itself. Original data set is mapped to an embedded space by embedding function $F$ , and outliers of data manifold is identified using scoring function $H$ . An example of outliers is a sample with large background portion compared to foreground. By excluding such outliers from original data set, training speed of GANs has been dramatically improved.

Even though Top-k GAN [22] has shown that selected high-quality samples improve the performance of GANs producing useful gradients in the training, it does not mean that entire features of bottom samples present unrealistic characteristics. Earlier iterations of the training procedure may see that both top and bottom samples are all unrealistic. When the training is proceeded creating some realistic fake samples, unrealistic fake samples may not be totally random images including some extent of realistic characteristics. Therefore selecting useful gradients in terms of samples may not be fully effective and its performance varies along the quality composition of generated fake samples. Furthermore, top-k sampling uses only a fraction of overall batch slowing down overall training. To alleviate the defect, annealing scheme is adopted starting with full batch size and gradually reducing it over the course of training.

Refer to caption — Figure 1: Training Generator with UFS module in GANs: UFS module is applied to generator training process. UFS module derives linear suppression vector $S$ from fake feature $Y_{Fake}$ . After multiplying $Y_{Fake}$ by $S$ , we forward it to discriminator’s last linear layer $D_{L}$ to calculate $L_{G}$ .

Gao et al. [4] introduces feature boosting and suppression module (FBS) which performs sample unit channel pruning to save computational and memory resources. They argue that different samples have different salient features and dynamic channel pruning is able to amplify salient channels and skip unimportant ones. Song et al. [23] also introduces a FBSM combined with feature diversification module.

In this work, we propose to perform the selection of useful gradients for improved training of GANs in terms of features that contribute to either realistic or unrealistic samples (or both). In order to achieve the goal, we propose unrealistic feature suppression (UFS) module that is able to disregard channels contributing to unrealistic part of fake samples according to the suppression vector shown in figure 1 in generator training step. Compared to traditional channel attention blocks that add attention to preferred realistic features, proposed UFS module suppresses selected unrealistic features of all generated fake samples. Different from feature pruning approaches [4], UFS module does not discard selected unrealistic features. Instead it assigns less importance to the features in gradient calculation. In this way, UFS module keeps the training stability of original networks and improves the quality of generated images. In experimental evaluation, UFS module is embedded on various GANs such as WGAN-GP, SNGAN, BigGAN and tested on various benchmark data sets such as CIFAR-10, CelebA, and ImageNet.

2 Method

Figure 1 illustrates how unrealistic feature suppression (UFS) module is incorporated in the training of GANs. Figure 2 shows detailed structure of the UFS module with visualized conceptual example.

2.1 Training Generator with UFS module

Figure 1 shows a diagram of forward flow using suppression vector $S$ in generator training process. $D$ indicates discriminator. $D_{L}$ and $D_{R}$ indicate last and all other preceding layers of the discriminator respectively. We calculate suppression vector $S$ by determining realistic and unrealistic degree in feature level. Fake samples $G(z)$ are mapped to embedded feature $Y_{Fake}$ via $D_{R}$ ( $Y_{Fake}=D_{R}(G(z))$ ). We forward the feature $Y_{Fake}$ to UFS module to calculate Linear suppression vector $S$ . UFS module produces suppression vector $S$ by comparing each feature of $Y_{Fake}$ with corresponding mean of real and fake features. The structure of UFS module is described in detail in Section 2.2. In this way, suppression vector $S$ maintains realistic features of $Y_{Fake}$ and suppresses unrealistic features.

\displaystyle\bar{y}=D_{L}(Y_{Fake}\otimes UFS(Y_{Fake}))

(1)

where $UFS()$ indicates UFS module, and $\otimes$ indicates element wise product. Generator loss $L_{G}$ is calculated by the expectation of $\bar{y}$ .

\displaystyle L_{G}=\mathop{\mathbb{E}}_{z{\sim}P_{z}}[\bar{y}]

(2)

Suppression vector $S$ is enabled in generator training, not in discriminator training. For discriminator loss $L_{D}$ , we can use both WGAN adversarial loss [1] and hinge adversarial loss [13].

2.2 Unrealistic Feature Suppression (UFS) module

Figure 2 shows how linear suppression vector $S$ is extracted from feature vector $Y_{Fake}$ . In order to distinguish between realistic and unrealistic features, we investigate how each feature of real and fake samples are distributed in the embedded latent space. In general, GANs alternately train discriminator and generator. In the training process of discriminator, real data $x$ and fake data $G(z)$ are forwarded to the latent space obtaining respective feature vectors $Y_{Real}$ and $Y_{Fake}$ .

\displaystyle Y_{Real}=D_{R}(x),Y_{Fake}=D_{R}(G(z))

(3)

If we consider the last layer of discriminator $D_{L}$ as a decision function that distinguishes real and fake, we express hyperplane of real and fake as the multiplication between embedded feature and $D_{L}$ . However, what we need is the distributions of fake and real at the feature space, not the sample level. We calculate average feature vectors $\mu_{Real}$ and $\mu_{Fake}$ through element wise product instead of weighted sum calculation.

	$\displaystyle\mu_{Real}=\frac{1}{n}{\sum}(w\otimes Y_{Real}),$		(4)
	$\displaystyle\mu_{Fake}=\frac{1}{n}{\sum}(w\otimes Y_{Fake})$		(4)

where $w$ indicates weight vector of linear layer $D_{L}$ , and $n$ is batch size. UFS module calculates and stores average feature vectors $\mu_{Real}$ and $\mu_{Fake}$ during the discriminator training process. In generator training process, we forward fake sample $G(z)$ to calculate embedded fake feature vector $Y_{Fake}$ . $Y_{Fake}$ is used to calculate $\hat{Y}$ through element wise product with $w$ . Our objective is to compare each feature, so we don’t calculate the average of fake features in training process of generator.

\displaystyle\hat{Y}=w\otimes Y_{Fake}

(5)

Now we have criteria $\mu_{Real}$ and $\mu_{Fake}$ to distinguish realistic and unrealistic features. We calculate margin vector $M$ between $\mu_{Real}$ and $\mu_{Fake}$ and distance vector $D$ between $\hat{Y}$ and $\mu_{Real}$ in the feature space.

\displaystyle M=\mu_{Real}-\mu_{Fake}

(6)

\displaystyle D=\mu_{Real}-\hat{Y}

(7)

We obtain distance ratio vector $R$ by dividing distance vector $D$ by margin vector $M$ for each channel separately. $R$ stands for the ratio of how far from the real mean the fake feature is located regarding real-fake margin of each channel separately. $R_{c}$ of a channel $c$ is defined as follows.

\displaystyle R_{c}=\begin{cases}\frac{D_{c}}{M_{c}},&\text{if \, $|D_{c}|\geq\gamma$}.\\ 1,&\text{otherwise}.\end{cases}

(8)

where $c$ indicates a channel of feature vector. The role of $\gamma$ is to prevent a situation where features that are sufficiently close to real mean being considered unrealistic. Finally, suppression vector $S$ is created based on $R_{c}$ .

\displaystyle S_{c}=\begin{cases}-\alpha+\epsilon,&\text{if \, $R_{c}<\alpha$}.\\ -R_{c}+\epsilon,&\text{if \, $\alpha\leq R_{c}\leq\beta$}.\\ -\beta+\epsilon,&\text{if \, $R_{c}>\beta$}.\end{cases}

(9)

In order to have continuously varying scores in suppression vector according to $R_{c}$ within a given range, rather than binary values 0 or 1 (use or drop), we define lower and upper bounds $\alpha$ and $\beta$ to constrain $R_{c}$ . $\epsilon$ is added to shift the constrained range to secure appropriate values for suppression vector. Figure 3 shows the operating principles of these hyper-parameters.

3 Experimental Evaluation

First we run original Top-k GAN [22] on top-k, bottom-k and random-k samples to compare the performance of the cases and demonstrate the necessity of unrealistic feature suppression rather than realistic sample selection. Experimental evaluation of proposed method is performed on CelebA [14], CIFAR10 [10], ImageNet [19]. We have implemented proposed UFS module on WGAN-GP [6], SNGAN [16], BigGAN [2].

3.1 Top-k and Bottom-k Samples

The performance of Top-k GAN [22] highly depends on a hyper-parameter selecting high-quality samples. It experimentally has shown with Gaussian mixture model that bottom-K samples produce gradients of negative direction to optimal generator. However, that does not mean that bottom samples always contribute to bad gradients in all process of generator training, especially when it is trained with real image data sets of complex distribution. Let’s assume that we have a moderately-trained GANs. Generated fake samples are divided into relatively realistic and unrealistic sample groups. Now, we have two strategies: training with only top-k samples (concentrating more on making realistic samples to be more realistic), and training with only bottom-k samples (concentrating more on increasing the quality of unrealistic samples).

We conduct an experiment assuming that bottom samples have ability to provide good gradients. Figure 4 shows our experimental results on CIFAR-10 data set. Top-k and bottom-k use only $k$ numbers of top/bottom samples in training process. Batch size is 64 and sampling hyper-parameter $k$ is set to 32. Modified WGAN-GP [6] is trained for 500 epochs. FID score is measured every 10 epochs and 10k of real and fake samples are used. Modified WGAN-GP replaces output size of discriminator 1x1 by 8x8. It has the effect of data augmentation that allows last layer of discriminator $D_{L}$ to see more diverse features leading Modified WGAN-GP to improved performance. In Figure 4, top-k sampling is more stable with less fluctuation than bottom-k sampling, but final FID score is similar to each other. We also test random-k sampling to see if sampling method itself produces good results regardless of selected top-k or bottom-k samples. Random-k sampling initially seems to converge quickly, but after large fluctuations in the middle it shows similar result to original method. In this test, we experimentally observe that bottom-k samples are also capable of providing good gradients for training even though they also bring a risk of negative gradients that are appeared as increased fluctuation in the training.

3.2 CIFAR-10

Verification on CIFAR-10 is conducted using both modified WGAN-GP [6] and SNGAN [16]. SNGAN results are discussed in section 4.1. Modified WGAN-GP that we have used in this test shows improved performance as noted in previous subsection.

Experiments on WGAN-GP Improved WGAN-GP is used as a baseline in this test and table 1 summarizes experimental results. First, WGAN-GP is combined with Top-k, Bottom-k, and random-k sample selections that are compared to proposed UFS module implemented on the WGAN-GP. In these cases, proposed (WGAN-GP + UFS module) shows best FID score 15.67. We also combine WGAN-GP with UFS module and Top-k, Bottom-k, and random-k sample selection methods where (WGAN-GP + Top-k + UFS module) shows best FID score 15.87. These combinations are possible because instance sampling methods (Top-k, Bottom-k, and random-k) select useful gradients in sample level after forwarding step is performed. On the other hand, our UFS module selects useful gradients in latent feature space during forwarding step. Therefore, combined use of both schemes finds useful gradients in both sample and feature levels simultaneously. All experiments are trained for 500 epochs and FID scores are measured using 10k fake and real samples every 10 epochs. Hyper-parameters of UFS module are $\alpha=0$ , $\beta=1$ , $\epsilon=1$ and $\gamma=0.0001$ .

CIFAR-10	FID score
WGAN-GP	17.56
WGAN-GP + Random-k	17.29
WGAN-GP + Bottom-k	15.62
WGAN-GP + Top-k [22]	16.13
WGAN-GP + UFS (Ours)	15.67
WGAN-GP + Random-k + UFS (Ours)	16.10
WGAN-GP + Bottom-k + UFS (Ours)	16.64
WGAN-GP + Top-k + UFS (Ours)	15.87

Table 1: FID scores of WGAN-GP trained on CIFAR-10

3.3 CelebA

CelebA data set is also tested with improved WGAN-GP as a baseline. Networks in all tests are trained for 100 epochs using WGAN adversarial loss. Hyper-parameters of UFS module are $\alpha=0$ , $\beta=1$ , $\epsilon=1$ , $\gamma=0.0001$ . Batch size is 64 and hyper-paramter $k$ used in top-k GAN is adaptively set starting from 64 and gradually decreasing to 32. Table 2 compares FID scores. FID scores are calculated every epoch using 10k fake and real samples. (WGAN-GP + UFS module) shows best FID score 6.51 compared to 6.90 of WGAN-GP and 7.96 of Top-k GAN.

CelebA	FID score
WGAN-GP	6.90
WGAN-GP + Top-k [22]	7.96
WGAN-GP + Top-k + UFS (Ours)	7.81
WGAN-GP + UFS (Ours)	6.51

Table 2: FID scores of WGAN-GP trained on CelebA.

3.4 ImageNet

ImageNet is a large-scale data set containing 1.2M images of 1000 classes. Because of the huge size of the data set, training a network with original ImageNet takes long time. For example, training $128\times 128$ data set takes 2 weeks with 8 NVIDIA V100 GPUs, and $256\times 256$ requires much longer time. ImageNet is also sensitive to batch size. Brock et al. [2] reported that increasing batch size from 256 to 2048 leads performance improvement with FID and IS [21]. Instead of using all of data with large batch size, we follow the method of Miyato et al. [16] and Devris et al. [3]. Miyato et al. use a subset of ImageNet named ImageNet dog and cat. Each class of ImageNet dog and cat shows similar characteristics, so networks can be trained easier than using entire ImageNet classes. Devris et al. introduces instance selection which accelerates training speed by reducing the number of samples. We test BigGAN on $64\times 64$ ImageNet and $128\times 128$ ImageNet with 50% instance selection.

ImageNet Dog and cat SNGAN trains their networks by selecting 143 classes out of 1000 classes (total 180k images). All of the selected classes are animal classes corresponding to species of dog and cat, and training parameters are same with SNGAN’s baseline. For all experiments we use 64 batch, and hyperparameters of our method are $\alpha=1$ , $\beta=1.5$ , $\epsilon=2$ and $\gamma=0.0001$ . $k$ for Top-k sampling experiments is set to start at 64 and gradually decrease to 32. We train all models for 250k iterations. Table 3 and figure 5 shows our experimental results. Experiment combining Top-k sampling and UFS achieves the best FID score(18.84).

ImageNet Dog and Cat	FID score	IS score
SNGAN [16]	20.02	11.01
SNGAN + UFS (Ours)	19.59	11.66
SNGAN + Top-k [22]	18.99	11.86
SNGAN + Top-k + UFS (Ours)	18.84	12.55

Table 3: FID and IS scores on ImageNet dog and cat: Baseline model is SNGAN and all models are trained for 250k iterations.

ImageNet $\mathbf{64\times 64}$ Devris et al. [2] reduces the size of the data set by removing out-liers. Reducing the size of training data set causes small performance degradation, but it shows a dramatic improvement in training speed. Devris et al. reported that instance selection with too large batch size causes performance degradation. In this test, we train BigGAN with instance selection using 256 batch instead 2048. $k$ for Top-k sampling experiments is set to start at 256 and gradually decrease to 128. For instance selection, we use inception v3 for data embedding function $F$ , and Gaussian model is used for scoring function $H$ . Retention ratio is 50%, i.e. we only use 50% of data for training. All experiments are trained until mode collapse occurred.

Figure 6, 7, and table 4 show our results of BigGAN trained on 50% of ImageNet. (BigGAN + UFS module) achieves 7.84 FID score that is worse than original BigGAN(7.58). Top-k sampling achieves better FID score(7.17), but it shows very slow training speed compared to original method. While original method takes 310k iterations to reach highest FID 7.58, Top-k takes 460k iterations. On the other hand, (BigGAN + Top-k + UFS module) shows much faster speed than only using top-k sampling. As you can see in Figure 6, initial training speed is slower than the original, but faster than Top-k sampling. To reach the FID score 10, the original takes 128k iterations, (BigGAN + Top-k + UFS module) takes 154k, and Top-k takes 194k. When comparing the best FID score, (BigGAN + Top-k + UFS module) achieves a much better FID score(6.73) than other methods. It is noteworthy that experiments without Top-k sampling fall into mode collapse, while experiments using Top-k sampling don’t show mode collapse even if they are trained more than 2 times longer iterations.

Since FID score is highly sensitive to variance of model distribution, we need to check other metrics for evaluating our model. Precision & Recall [11, 20] and Density & Coverage [17] check the distribution manifold through k-NN on embedded space. We choose the best model based on FID score of (BigGAN + Top-k) and (BigGAN + Top-k + UFS module). Except density metric, (BigGAN + Top-k + UFS module) achieves better scores than Top-k: precision(0.8547 vs 0.8456), recall(0.6086 vs 0.5960), density(1.303 vs 1.306) and coverage(0.9406 vs 0.9325). (BigGAN + Top-k + UFS module) achieves better precision, but density is lower than (BigGAN + Top-k), so it is hard to say that which one is better in terms of the overlap ratio in real data manifold of generated samples. But (BigGAN + Top-k + UFS module) achieves better recall and coverage, which represents (BigGAN + Top-k + UFS module) generates more diverse images which overlap real data manifold.

ImageNet 64 $\mathbf{\times}$ 64	FID score	IS score
BigGAN [2]	7.58	49.24
BigGAN + UFS (Ours)	7.84	47.17
BigGAN + Top-k [22]	7.17	45.05
BigGAN + Top-k + UFS (Ours)	6.73	49.93

Table 4: FID and IS scores of BigGAN trained on

64\times 64

ImageNet: To accelerate the speed of all experiments, we use instance selection to reduce the number of training data by 50%.

ImageNet $\mathbf{128\times 128}$ To check the performance of our method with higher resolution images, we train BigGAN on $128\times 128$ ImageNet. We use instance selection to speed up training. Hyper-parameters for instance selection are same to experiments on ImageNet $64\times 64$ . Batch size is 128, but we accumulate gradient twice following the experiments of Devris et al. [3]. $k$ is set to start at 128 and gradually decrease to 64. To accelerate training speed, we also anneal $\beta$ gradually from 1 to 1.5, it means that we use full feature in early training and gradually suppress unrealistic features over epochs. Other hyper-parameters are the same as previous experiments, $\alpha=1$ , $\epsilon=2$ , $\gamma=0.0001$ . We train all models for 800k iterations. Figure 8 and table 5 show test results of BigGAN trained on 50% of ImageNet $128\times 128$ . Similar to $64\times 64$ experiments, Top-k sampling shows very slow training speed, which results in insufficient convergence for 800k iterations. BigGAN and (BigGAN + UFS module) fall into mode collapse before 400k iterations. (BigGAN + Top-k + UFS module) shows best FID score 8.78 and inception score 111.01. We also check other metrics for evaluating our models. (BigGAN + Top-k + UFS module) achieves better scores than Top-k sampling: precision(0.8963 vs 0.8852), recall(0.5488 vs 0.5395), density(1.4594 vs 1.3917), and coverage(0.9297 vs 0.9101).

ImageNet 128 $\mathbf{\times}$ 128	FID score	IS score
BigGAN [2]	9.88	108.40
BigGAN + UFS (Ours)	11.08	94.83
BigGAN + Top-k [22]	9.76	108.86
BigGAN + Top-k + UFS (Ours)	8.78	111.01

Table 5: FID and IS scores of BigGAN trained on

128\times 128

ImageNet: To accelerate the speed of all experiments, we use instance selection to reduce the number of data by 50%.

3.5 How does suppression vector work?

To intuitively understand how $S$ helps training, we need to visualize what discriminator actually see. Class activation map [32] visualizes which regions of the image that CNN sees and makes judgments. We obtain class activation map in three different ways from pretrained BigGAN networks. We calculate class activation maps based on the equations below.

$\displaystyle CAM^{i,j}$	$\displaystyle=\left\langle\tilde{Y}_{i,j},w\right\rangle$	(10)
$\displaystyle CAM_{UFS}^{i,j}$	$\displaystyle=\left\langle\tilde{Y}_{i,j}\otimes S,w\right\rangle$
$\displaystyle CAM_{SUP}^{i,j}=$	$\displaystyle\left\langle\tilde{Y}_{i,j}\otimes(1-S),w\right\rangle$

where $\tilde{Y}$ represents feature before global sum pooling is applied. Since WGAN based adversarial loss leads $D(x)$ to be greater than $D(G(z))$ , class activation map represents the most realistic areas in terms of discriminator. $CAM_{SUP}$ shows the regions of the most unrealistic features in terms of our UFS module. Therefore, $CAM_{SUP}$ shows regions that will be suppressed by our methods. On the other hand, $CAM_{UFS}$ shows class activation map after unrealistic features are suppressed. Therefore, $CAM_{UFS}$ shows the regions of realistic features in terms of UFS module.

Figure 9 shows sample activation maps. $CAM$ includes foreground with unique and outstanding appearance as well as background with relatively rough and ambiguous visual patterns. Since the appearance of such background is easier to learn, $CAM$ tends to include unrealistic background regions that are relatively easy to be learned. However, as we observe the $CAM_{UFS}$ examples in figure 9, UFS module suppresses features that are far from average real feature $\mu_{Real}$ , actually suppressing unrealistic background region and promoting discriminator to concentrate more on realistic foreground. Similar to Top-k GAN which ignores unrealistic samples and trains networks to make realistic samples better, UFS module ignores unrealistic features and trains networks to make realistic features better.

4 Discussion and Conclusion

4.1 Why unrealistic feature suppression instead of dismission?

SNGAN + UFS module tested on CIFAR-10 shows unstable training. Figure 10 shows generator loss and discriminator loss of entire training process. While the discriminator is quickly trained, the generator loses its direction for training. We assume that when a discriminator becomes too powerful, i.e., a discriminator easily distinguishes between real and fake features, UFS module looks dismiss most of unrealistic features. We adjust hyper-parameters to suppress unrealistic features rather than dismissing them. By setting $0<\epsilon-\beta<1$ , unrealistic features are suppressed. Table 6 shows the summary of the our experiments. We observed that dismissing unrealistic features gives rise to severe mode collapse. On contrary, suppressing unrealistic features doesn’t fall into mode collapse. The optimal hyper-parameters empirically found are $\alpha=1$ , $\beta=1.5$ and $\epsilon=2$ .

$\alpha$	$\beta$	$\epsilon$	FID score	IS score	UFS
0	1	1	43.37	6.20	Dismission
1	2	2.5	19.47	7.93	Suppression
1	2	3	19.54	8.26	Suppression
1	3	3	50.57	6.70	Dismission
1	1.2	2	18.07	8.09	Suppression
1	1.3	2	19.29	7.92	Suppression
1	1.4	2	18.82	8.09	Suppression
1	1.5	2	17.10	8.50	Suppression

Table 6: Experiments to get proper hyperparameters for our method. Baseline model is SNGAN, and dataset is CIFAR-10. We train all models with 250k iterations. Suppression means suppressing unrealistic features rather than dismissing them by setting (

0<\epsilon-\beta<1

4.2 Conclusion

In this work, we have proposed unrealistic feature suppression (UFS) module that suppress unrealistic features in generator training. Effectiveness of UFS module has been proved through extensive experimental evaluations on various backbone networks such as WGAN-GP, SNGAN, BigGAN. In ImageNet experiments, we show that a method combining Top-k selection and UFS module converges faster and better compared to prior methods.

References

[1] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
[2] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2018.
[3] Terrance DeVries, Michal Drozdzal, and Graham W Taylor. Instance selection for gans. NIPS, 2020.
[4] Xitong Gao, Yiren Zhao, Łukasz Dudziak, Robert Mullins, and Cheng-zhong Xu. Dynamic channel pruning: Feature boosting and suppression. In International Conference on Learning Representations, 2018.
[5] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
[6] Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In NIPS, 2017.
[7] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 2017.
[8] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
[9] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
[10] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[11] Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. In NeurIPS, 2019.
[12] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690, 2017.
[13] Jae Hyun Lim and Jong Chul Ye. Geometric gan. arXiv preprint arXiv:1705.02894, 2017.
[14] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
[15] Shuang Ma, Jianlong Fu, Chang Wen Chen, and Tao Mei. Da-gan: Instance-level image translation by deep attention generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5657–5666, 2018.
[16] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.
[17] Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reliable fidelity and diversity metrics for generative models. In International Conference on Machine Learning, pages 7176–7185. PMLR, 2020.
[18] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2337–2346, 2019.
[19] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
[20] Mehdi SM Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly. Assessing generative models via precision and recall. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 5234–5243, 2018.
[21] Tim Salimans, Ian J Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In NIPS, 2016.
[22] Samarth Sinha, Zhengli Zhao, Anirudh Goyal, Colin Raffel, and Augustus Odena. Top-k training of gans: Improving gan performance by throwing away bad samples. NIPS, 2020.
[23] Jianwei Song and Ruoyu Yang. Feature boosting, suppression, and diversification for fine-grained visual classification. arXiv preprint arXiv:2103.02782, 2021.
[24] Hao Tang, Dan Xu, Nicu Sebe, Yanzhi Wang, Jason J Corso, and Yan Yan. Multi-channel attention selection gan with cascaded semantic guidance for cross-view image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2417–2426, 2019.
[25] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8798–8807, 2018.
[26] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
[27] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
[28] Yan Wu, Jeff Donahue, David Balduzzi, Karen Simonyan, and Timothy Lillicrap. Logan: Latent optimisation for generative adversarial networks. arXiv preprint arXiv:1912.00953, 2019.
[29] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1316–1324, 2018.
[30] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In International conference on machine learning, pages 7354–7363. PMLR, 2019.
[31] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 5907–5915, 2017.
[32] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016.