This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Unrealistic Feature Suppression for Generative Adversarial Networks

Sanghun Kim
KyungHee University
Department of Computer Science and
Engineering Kyunghee University
powerkei@naver.com
   Seungkyu Lee
KyungHee University
Department of Computer Science and
Engineering Kyunghee University
seungkyu@khu.ac.kr
Abstract

Due to the unstable nature of minimax game between generator and discriminator, improving the performance of GANs is a challenging task. Recent studies have shown that selected high-quality samples in training improve the performance of GANs. However, sampling approaches which discard samples show limitations in some aspects such as the speed of training and optimality of the networks. In this paper we propose unrealistic feature suppression (UFS) module that keeps high-quality features and suppresses unrealistic features. UFS module keeps the training stability of networks and improves the quality of generated images. We demonstrate the effectiveness of UFS module on various models such as WGAN-GP, SNGAN, and BigGAN. By using UFS module, we achieved better Fréchet inception distance and inception score compared to various baseline models. We also visualize how effectively our UFS module suppresses unrealistic features through class activation maps.

1 Introduction

Generative Adversarial Networks (GANs) [5] have achieved explosive attention and success in various research areas since it was introduced. GANs consist of two adversarial networks (generator and discriminator) that are trained alternately. Discriminator is trained to distinguish between real and generated fake samples. On the other hand, generator is trained based on the feedback from the discriminator to make realistic fake samples. Thanks to its practical performance of adversarial training strategy, GANs have evolved to various image generation methods such as image to image translation [9, 18, 25], super resolution [12], text to image generation [31], etc. Improving the performance of GANs is a challenging task due to the unstable nature of minimax game. Sometimes GANs fall in Nash equilibrium early because of the difficulties in balancing between generator and discriminator training.

One effort of improving GANs is employing attention blocks that previously have shown improved performance in various classification tasks. Squeeze and excitation [8] propose a channel attention block and CBAM [27] adds a spatial attention block to focus not only on channels but also on spatial features showing that the attention approaches are capable of improving the performance of convolutional neural networks. In generative models, AttnGAN [29] embeds encoded text into networks through attention module so that the networks focus more on the related image details. In image to image translation tasks, SelectionGAN [24] proposes a method of combining multiple candidate images through multi channel attention selection. DAGAN [15] shows instance level translation by specifying the area to focus on in an image. Inspired by non-local neural networks [26], SAGAN [30] proposes self attention module that extracts feature similarity over entire area of an image.

Another group of methods study how to provide useful gradients in the training of GANs. LOGAN [28] proposes gradient-based latent optimization scheme for GANs. Latent optimization uses custom latent vector which is optimized by the gradient of networks in training process. It uses G(z)G(z^{\prime}) obtained by optimized latent vector zz^{\prime} as fake samples rather than G(z)G(z) obtained by random latent zz decreasing the influence of random distribution on the networks. To this end, LOGAN first forwards the zz obtained from random distribution to calculate D(G(z))D(G(z)). After obtaining z\nabla z in backward process, LOGAN gets optimized latent vector zz^{\prime} (z=z+αzz^{\prime}=z+\alpha\nabla z). Finally zz^{\prime} is re-forwarded to compute D(G(z))D(G(z^{\prime})) training the networks. LOGAN shows that the latent optimization not only generate high quality samples but also give better direction to optimal generator. Since LOGAN performs two forward-backward operations in training, training speed is slow compared to other methods. Top-k GAN [22] argues that more realistic samples presented by latent optimization improve network performance. It claims that the success of LOGAN is due to the high quality samples produced by optimized latent zz^{\prime}. Top-k GAN dismisses unrealistic samples decided by critic(discriminator) from batches and adopt only realistic samples in order to select useful gradients in training process thereby improving the performance of GANs. Top-k GAN experimentally shows that high-quality samples in training process lead to improved performance on various networks such as WGAN-GP [6], SNGAN [16], SAGAN [30] and BigGAN [2]. Instance selection [3] increases training speed by reducing the number of data required for training, rather than improving network itself. Original data set is mapped to an embedded space by embedding function FF, and outliers of data manifold is identified using scoring function HH. An example of outliers is a sample with large background portion compared to foreground. By excluding such outliers from original data set, training speed of GANs has been dramatically improved.

Even though Top-k GAN [22] has shown that selected high-quality samples improve the performance of GANs producing useful gradients in the training, it does not mean that entire features of bottom samples present unrealistic characteristics. Earlier iterations of the training procedure may see that both top and bottom samples are all unrealistic. When the training is proceeded creating some realistic fake samples, unrealistic fake samples may not be totally random images including some extent of realistic characteristics. Therefore selecting useful gradients in terms of samples may not be fully effective and its performance varies along the quality composition of generated fake samples. Furthermore, top-k sampling uses only a fraction of overall batch slowing down overall training. To alleviate the defect, annealing scheme is adopted starting with full batch size and gradually reducing it over the course of training.

Refer to caption
Figure 1: Training Generator with UFS module in GANs: UFS module is applied to generator training process. UFS module derives linear suppression vector SS from fake feature YFakeY_{Fake}. After multiplying YFakeY_{Fake} by SS, we forward it to discriminator’s last linear layer DLD_{L} to calculate LGL_{G}.

Gao et al. [4] introduces feature boosting and suppression module (FBS) which performs sample unit channel pruning to save computational and memory resources. They argue that different samples have different salient features and dynamic channel pruning is able to amplify salient channels and skip unimportant ones. Song et al. [23] also introduces a FBSM combined with feature diversification module.

In this work, we propose to perform the selection of useful gradients for improved training of GANs in terms of features that contribute to either realistic or unrealistic samples (or both). In order to achieve the goal, we propose unrealistic feature suppression (UFS) module that is able to disregard channels contributing to unrealistic part of fake samples according to the suppression vector shown in figure 1 in generator training step. Compared to traditional channel attention blocks that add attention to preferred realistic features, proposed UFS module suppresses selected unrealistic features of all generated fake samples. Different from feature pruning approaches [4], UFS module does not discard selected unrealistic features. Instead it assigns less importance to the features in gradient calculation. In this way, UFS module keeps the training stability of original networks and improves the quality of generated images. In experimental evaluation, UFS module is embedded on various GANs such as WGAN-GP, SNGAN, BigGAN and tested on various benchmark data sets such as CIFAR-10, CelebA, and ImageNet.

2 Method

Figure 1 illustrates how unrealistic feature suppression (UFS) module is incorporated in the training of GANs. Figure 2 shows detailed structure of the UFS module with visualized conceptual example.

Refer to caption
(a) Detailed Structure of UFS module
Refer to caption
(b) Conceptual Feature Space of a Channel
Figure 2: (a) Unrealistic Feature Suppression (UFS) module: μReal\mu_{Real} and μFake\mu_{Fake} are pre-determined features during training process of discriminator. Using these two feature vectors, we calculate how far Y^\hat{Y} is located from mean of real feature vector μReal\mu_{Real} as the ratio to real-fake margin MM. (b) Distance ratio RcR_{c} and margin McM_{c} of a channel cc

2.1 Training Generator with UFS module

Figure 1 shows a diagram of forward flow using suppression vector SS in generator training process. DD indicates discriminator. DLD_{L} and DRD_{R} indicate last and all other preceding layers of the discriminator respectively. We calculate suppression vector SS by determining realistic and unrealistic degree in feature level. Fake samples G(z)G(z) are mapped to embedded feature YFakeY_{Fake} via DRD_{R} (YFake=DR(G(z))Y_{Fake}=D_{R}(G(z))). We forward the feature YFakeY_{Fake} to UFS module to calculate Linear suppression vector SS. UFS module produces suppression vector SS by comparing each feature of YFakeY_{Fake} with corresponding mean of real and fake features. The structure of UFS module is described in detail in Section 2.2. In this way, suppression vector SS maintains realistic features of YFakeY_{Fake} and suppresses unrealistic features.

y¯=DL(YFakeUFS(YFake))\displaystyle\bar{y}=D_{L}(Y_{Fake}\otimes UFS(Y_{Fake})) (1)

where UFS()UFS() indicates UFS module, and \otimes indicates element wise product. Generator loss LGL_{G} is calculated by the expectation of y¯\bar{y}.

LG=𝔼zPz[y¯]\displaystyle L_{G}=\mathop{\mathbb{E}}_{z{\sim}P_{z}}[\bar{y}] (2)

Suppression vector SS is enabled in generator training, not in discriminator training. For discriminator loss LDL_{D}, we can use both WGAN adversarial loss [1] and hinge adversarial loss [13].

2.2 Unrealistic Feature Suppression (UFS) module

Figure 2 shows how linear suppression vector SS is extracted from feature vector YFakeY_{Fake}. In order to distinguish between realistic and unrealistic features, we investigate how each feature of real and fake samples are distributed in the embedded latent space. In general, GANs alternately train discriminator and generator. In the training process of discriminator, real data xx and fake data G(z)G(z) are forwarded to the latent space obtaining respective feature vectors YRealY_{Real} and YFakeY_{Fake}.

YReal=DR(x),YFake=DR(G(z))\displaystyle Y_{Real}=D_{R}(x),Y_{Fake}=D_{R}(G(z)) (3)

If we consider the last layer of discriminator DLD_{L} as a decision function that distinguishes real and fake, we express hyperplane of real and fake as the multiplication between embedded feature and DLD_{L}. However, what we need is the distributions of fake and real at the feature space, not the sample level. We calculate average feature vectors μReal\mu_{Real} and μFake\mu_{Fake} through element wise product instead of weighted sum calculation.

μReal=1n(wYReal),\displaystyle\mu_{Real}=\frac{1}{n}{\sum}(w\otimes Y_{Real}), (4)
μFake=1n(wYFake)\displaystyle\mu_{Fake}=\frac{1}{n}{\sum}(w\otimes Y_{Fake})

where ww indicates weight vector of linear layer DLD_{L}, and nn is batch size. UFS module calculates and stores average feature vectors μReal\mu_{Real} and μFake\mu_{Fake} during the discriminator training process. In generator training process, we forward fake sample G(z)G(z) to calculate embedded fake feature vector YFakeY_{Fake}. YFakeY_{Fake} is used to calculate Y^\hat{Y} through element wise product with ww. Our objective is to compare each feature, so we don’t calculate the average of fake features in training process of generator.

Y^=wYFake\displaystyle\hat{Y}=w\otimes Y_{Fake} (5)

Now we have criteria μReal\mu_{Real} and μFake\mu_{Fake} to distinguish realistic and unrealistic features. We calculate margin vector MM between μReal\mu_{Real} and μFake\mu_{Fake} and distance vector DD between Y^\hat{Y} and μReal\mu_{Real} in the feature space.

M=μRealμFake\displaystyle M=\mu_{Real}-\mu_{Fake} (6)
D=μRealY^\displaystyle D=\mu_{Real}-\hat{Y} (7)

We obtain distance ratio vector RR by dividing distance vector DD by margin vector MM for each channel separately. RR stands for the ratio of how far from the real mean the fake feature is located regarding real-fake margin of each channel separately. RcR_{c} of a channel cc is defined as follows.

Rc={DcMc,if |Dc|γ.1,otherwise.\displaystyle R_{c}=\begin{cases}\frac{D_{c}}{M_{c}},&\text{if \, $|D_{c}|\geq\gamma$}.\\ 1,&\text{otherwise}.\end{cases} (8)

where cc indicates a channel of feature vector. The role of γ\gamma is to prevent a situation where features that are sufficiently close to real mean being considered unrealistic. Finally, suppression vector SS is created based on RcR_{c}.

Refer to caption
(a) α=0.5\alpha=0.5, β=1\beta=1, ϵ=1.5\epsilon=1.5
Refer to caption
(b) α=0\alpha=0, β=1\beta=1, ϵ=1\epsilon=1
Figure 3: Examples of RcR_{c} - ScS_{c} graph. (a) α=0.5\alpha=0.5, β=1\beta=1, ϵ=1.5\epsilon=1.5: Features with Rc<0.5R_{c}<0.5 will be maintained, and features with Rc>1R_{c}>1 will be suppressed in half. (b) α=0\alpha=0, β=1\beta=1, ϵ=1\epsilon=1: Features with Rc<0R_{c}<0 will be maintained, and features with Rc>1R_{c}>1 will be dismissed.
Sc={α+ϵ,if Rc<α.Rc+ϵ,if αRcβ.β+ϵ,if Rc>β.\displaystyle S_{c}=\begin{cases}-\alpha+\epsilon,&\text{if \, $R_{c}<\alpha$}.\\ -R_{c}+\epsilon,&\text{if \, $\alpha\leq R_{c}\leq\beta$}.\\ -\beta+\epsilon,&\text{if \, $R_{c}>\beta$}.\end{cases} (9)

In order to have continuously varying scores in suppression vector according to RcR_{c} within a given range, rather than binary values 0 or 1 (use or drop), we define lower and upper bounds α\alpha and β\beta to constrain RcR_{c}. ϵ\epsilon is added to shift the constrained range to secure appropriate values for suppression vector. Figure 3 shows the operating principles of these hyper-parameters.

3 Experimental Evaluation

First we run original Top-k GAN [22] on top-k, bottom-k and random-k samples to compare the performance of the cases and demonstrate the necessity of unrealistic feature suppression rather than realistic sample selection. Experimental evaluation of proposed method is performed on CelebA [14], CIFAR10 [10], ImageNet [19]. We have implemented proposed UFS module on WGAN-GP [6], SNGAN [16], BigGAN [2].

3.1 Top-k and Bottom-k Samples

Refer to caption
Figure 4: Experiments on various sampling techniques. We train WGAN-GP on CIFAR-10 data set. Unlike other methods showing fluctuation, top-k sampling shows stable convergence. However, the best FID [7] scores are similar to each other (bottom-k = 15.6 and top-k = 16.1).

The performance of Top-k GAN [22] highly depends on a hyper-parameter selecting high-quality samples. It experimentally has shown with Gaussian mixture model that bottom-K samples produce gradients of negative direction to optimal generator. However, that does not mean that bottom samples always contribute to bad gradients in all process of generator training, especially when it is trained with real image data sets of complex distribution. Let’s assume that we have a moderately-trained GANs. Generated fake samples are divided into relatively realistic and unrealistic sample groups. Now, we have two strategies: training with only top-k samples (concentrating more on making realistic samples to be more realistic), and training with only bottom-k samples (concentrating more on increasing the quality of unrealistic samples).

We conduct an experiment assuming that bottom samples have ability to provide good gradients. Figure 4 shows our experimental results on CIFAR-10 data set. Top-k and bottom-k use only kk numbers of top/bottom samples in training process. Batch size is 64 and sampling hyper-parameter kk is set to 32. Modified WGAN-GP [6] is trained for 500 epochs. FID score is measured every 10 epochs and 10k of real and fake samples are used. Modified WGAN-GP replaces output size of discriminator 1x1 by 8x8. It has the effect of data augmentation that allows last layer of discriminator DLD_{L} to see more diverse features leading Modified WGAN-GP to improved performance. In Figure 4, top-k sampling is more stable with less fluctuation than bottom-k sampling, but final FID score is similar to each other. We also test random-k sampling to see if sampling method itself produces good results regardless of selected top-k or bottom-k samples. Random-k sampling initially seems to converge quickly, but after large fluctuations in the middle it shows similar result to original method. In this test, we experimentally observe that bottom-k samples are also capable of providing good gradients for training even though they also bring a risk of negative gradients that are appeared as increased fluctuation in the training.

3.2 CIFAR-10

Verification on CIFAR-10 is conducted using both modified WGAN-GP [6] and SNGAN [16]. SNGAN results are discussed in section 4.1. Modified WGAN-GP that we have used in this test shows improved performance as noted in previous subsection.

Experiments on WGAN-GP Improved WGAN-GP is used as a baseline in this test and table 1 summarizes experimental results. First, WGAN-GP is combined with Top-k, Bottom-k, and random-k sample selections that are compared to proposed UFS module implemented on the WGAN-GP. In these cases, proposed (WGAN-GP + UFS module) shows best FID score 15.67. We also combine WGAN-GP with UFS module and Top-k, Bottom-k, and random-k sample selection methods where (WGAN-GP + Top-k + UFS module) shows best FID score 15.87. These combinations are possible because instance sampling methods (Top-k, Bottom-k, and random-k) select useful gradients in sample level after forwarding step is performed. On the other hand, our UFS module selects useful gradients in latent feature space during forwarding step. Therefore, combined use of both schemes finds useful gradients in both sample and feature levels simultaneously. All experiments are trained for 500 epochs and FID scores are measured using 10k fake and real samples every 10 epochs. Hyper-parameters of UFS module are α=0\alpha=0, β=1\beta=1, ϵ=1\epsilon=1 and γ=0.0001\gamma=0.0001.

CIFAR-10 FID score
WGAN-GP 17.56
WGAN-GP + Random-k 17.29
WGAN-GP + Bottom-k 15.62
WGAN-GP + Top-k [22] 16.13
WGAN-GP + UFS (Ours) 15.67
WGAN-GP + Random-k + UFS (Ours) 16.10
WGAN-GP + Bottom-k + UFS (Ours) 16.64
WGAN-GP + Top-k + UFS (Ours) 15.87
Table 1: FID scores of WGAN-GP trained on CIFAR-10

3.3 CelebA

CelebA data set is also tested with improved WGAN-GP as a baseline. Networks in all tests are trained for 100 epochs using WGAN adversarial loss. Hyper-parameters of UFS module are α=0\alpha=0, β=1\beta=1, ϵ=1\epsilon=1, γ=0.0001\gamma=0.0001. Batch size is 64 and hyper-paramter kk used in top-k GAN is adaptively set starting from 64 and gradually decreasing to 32. Table 2 compares FID scores. FID scores are calculated every epoch using 10k fake and real samples. (WGAN-GP + UFS module) shows best FID score 6.51 compared to 6.90 of WGAN-GP and 7.96 of Top-k GAN.

CelebA FID score
WGAN-GP 6.90
WGAN-GP + Top-k [22] 7.96
WGAN-GP + Top-k + UFS (Ours) 7.81
WGAN-GP + UFS (Ours) 6.51
Table 2: FID scores of WGAN-GP trained on CelebA.

3.4 ImageNet

ImageNet is a large-scale data set containing 1.2M images of 1000 classes. Because of the huge size of the data set, training a network with original ImageNet takes long time. For example, training 128×128128\times 128 data set takes 2 weeks with 8 NVIDIA V100 GPUs, and 256×256256\times 256 requires much longer time. ImageNet is also sensitive to batch size. Brock et al. [2] reported that increasing batch size from 256 to 2048 leads performance improvement with FID and IS [21]. Instead of using all of data with large batch size, we follow the method of Miyato et al. [16] and Devris et al. [3]. Miyato et al. use a subset of ImageNet named ImageNet dog and cat. Each class of ImageNet dog and cat shows similar characteristics, so networks can be trained easier than using entire ImageNet classes. Devris et al. introduces instance selection which accelerates training speed by reducing the number of samples. We test BigGAN on 64×6464\times 64 ImageNet and 128×128128\times 128 ImageNet with 50% instance selection.

ImageNet Dog and cat   SNGAN trains their networks by selecting 143 classes out of 1000 classes (total 180k images). All of the selected classes are animal classes corresponding to species of dog and cat, and training parameters are same with SNGAN’s baseline. For all experiments we use 64 batch, and hyperparameters of our method are α=1\alpha=1, β=1.5\beta=1.5, ϵ=2\epsilon=2 and γ=0.0001\gamma=0.0001. kk for Top-k sampling experiments is set to start at 64 and gradually decrease to 32. We train all models for 250k iterations. Table 3 and figure 5 shows our experimental results. Experiment combining Top-k sampling and UFS achieves the best FID score(18.84).

ImageNet Dog and Cat FID score IS score
SNGAN [16] 20.02 11.01
SNGAN + UFS (Ours) 19.59 11.66
SNGAN + Top-k [22] 18.99 11.86
SNGAN + Top-k + UFS (Ours) 18.84 12.55
Table 3: FID and IS scores on ImageNet dog and cat: Baseline model is SNGAN and all models are trained for 250k iterations.
Refer to caption
(a) SNGAN + Top-k + UFS
Refer to caption
(b) SNGAN + Top-k [22]
Figure 5: Random samples from SNGAN trained on 64×6464\times 64 dog and cat images

ImageNet 𝟔𝟒×𝟔𝟒\mathbf{64\times 64}   Devris et al. [2] reduces the size of the data set by removing out-liers. Reducing the size of training data set causes small performance degradation, but it shows a dramatic improvement in training speed. Devris et al. reported that instance selection with too large batch size causes performance degradation. In this test, we train BigGAN with instance selection using 256 batch instead 2048. kk for Top-k sampling experiments is set to start at 256 and gradually decrease to 128. For instance selection, we use inception v3 for data embedding function FF, and Gaussian model is used for scoring function HH. Retention ratio is 50%, i.e. we only use 50% of data for training. All experiments are trained until mode collapse occurred.

Figure 6, 7, and table 4 show our results of BigGAN trained on 50% of ImageNet. (BigGAN + UFS module) achieves 7.84 FID score that is worse than original BigGAN(7.58). Top-k sampling achieves better FID score(7.17), but it shows very slow training speed compared to original method. While original method takes 310k iterations to reach highest FID 7.58, Top-k takes 460k iterations. On the other hand, (BigGAN + Top-k + UFS module) shows much faster speed than only using top-k sampling. As you can see in Figure 6, initial training speed is slower than the original, but faster than Top-k sampling. To reach the FID score 10, the original takes 128k iterations, (BigGAN + Top-k + UFS module) takes 154k, and Top-k takes 194k. When comparing the best FID score, (BigGAN + Top-k + UFS module) achieves a much better FID score(6.73) than other methods. It is noteworthy that experiments without Top-k sampling fall into mode collapse, while experiments using Top-k sampling don’t show mode collapse even if they are trained more than 2 times longer iterations.

Refer to caption
Figure 6: FID scores on 64×6464\times 64 ImageNet experiments: BigGAN and (BigGAN + UFS) fall into mode collapse before 400k iterations.

Since FID score is highly sensitive to variance of model distribution, we need to check other metrics for evaluating our model. Precision & Recall [11, 20] and Density & Coverage [17] check the distribution manifold through k-NN on embedded space. We choose the best model based on FID score of (BigGAN + Top-k) and (BigGAN + Top-k + UFS module). Except density metric, (BigGAN + Top-k + UFS module) achieves better scores than Top-k: precision(0.8547 vs 0.8456), recall(0.6086 vs 0.5960), density(1.303 vs 1.306) and coverage(0.9406 vs 0.9325). (BigGAN + Top-k + UFS module) achieves better precision, but density is lower than (BigGAN + Top-k), so it is hard to say that which one is better in terms of the overlap ratio in real data manifold of generated samples. But (BigGAN + Top-k + UFS module) achieves better recall and coverage, which represents (BigGAN + Top-k + UFS module) generates more diverse images which overlap real data manifold.

ImageNet 64 ×\mathbf{\times} 64 FID score IS score
BigGAN [2] 7.58 49.24
BigGAN + UFS (Ours) 7.84 47.17
BigGAN + Top-k [22] 7.17 45.05
BigGAN + Top-k + UFS (Ours) 6.73 49.93
Table 4: FID and IS scores of BigGAN trained on 64×6464\times 64 ImageNet: To accelerate the speed of all experiments, we use instance selection to reduce the number of training data by 50%.
Refer to caption
(a) BigGAN + Top-k + UFS
Refer to caption
(b) BigGAN + Top-k [22]
Figure 7: Samples from BigGAN trained on 64×6464\times 64 ImageNet. We pick the classes that clearly show difference in generated image quality.

ImageNet 𝟏𝟐𝟖×𝟏𝟐𝟖\mathbf{128\times 128}   To check the performance of our method with higher resolution images, we train BigGAN on 128×128128\times 128 ImageNet. We use instance selection to speed up training. Hyper-parameters for instance selection are same to experiments on ImageNet 64×6464\times 64. Batch size is 128, but we accumulate gradient twice following the experiments of Devris et al. [3]. kk is set to start at 128 and gradually decrease to 64. To accelerate training speed, we also anneal β\beta gradually from 1 to 1.5, it means that we use full feature in early training and gradually suppress unrealistic features over epochs. Other hyper-parameters are the same as previous experiments, α=1\alpha=1, ϵ=2\epsilon=2, γ=0.0001\gamma=0.0001. We train all models for 800k iterations. Figure 8 and table 5 show test results of BigGAN trained on 50% of ImageNet 128×128128\times 128. Similar to 64×6464\times 64 experiments, Top-k sampling shows very slow training speed, which results in insufficient convergence for 800k iterations. BigGAN and (BigGAN + UFS module) fall into mode collapse before 400k iterations. (BigGAN + Top-k + UFS module) shows best FID score 8.78 and inception score 111.01. We also check other metrics for evaluating our models. (BigGAN + Top-k + UFS module) achieves better scores than Top-k sampling: precision(0.8963 vs 0.8852), recall(0.5488 vs 0.5395), density(1.4594 vs 1.3917), and coverage(0.9297 vs 0.9101).

ImageNet 128 ×\mathbf{\times} 128 FID score IS score
BigGAN [2] 9.88 108.40
BigGAN + UFS (Ours) 11.08 94.83
BigGAN + Top-k [22] 9.76 108.86
BigGAN + Top-k + UFS (Ours) 8.78 111.01
Table 5: FID and IS scores of BigGAN trained on 128×128128\times 128 ImageNet: To accelerate the speed of all experiments, we use instance selection to reduce the number of data by 50%.
Refer to caption
(a) BigGAN + Top-k + UFS
Refer to caption
(b) BigGAN + Top-k [2]
Figure 8: Samples from BigGAN trained on 128×128128\times 128 ImageNet. We pick the classes that clearly show difference in generated image quality.

3.5 How does suppression vector work?

To intuitively understand how SS helps training, we need to visualize what discriminator actually see. Class activation map [32] visualizes which regions of the image that CNN sees and makes judgments. We obtain class activation map in three different ways from pretrained BigGAN networks. We calculate class activation maps based on the equations below.

Refer to caption
Figure 9: Visualization on how suppression vector SS works on discriminator. CAMCAM is general class activation map of discriminator, which shows most realistic area in images. CAMUFSCAM_{UFS} and CAMSUPCAM_{SUP} are class activation maps of discriminators with suppression vector SS and suppressed vector 1S1-S, respectively.
CAMi,j\displaystyle CAM^{i,j} =Y~i,j,w\displaystyle=\left\langle\tilde{Y}_{i,j},w\right\rangle (10)
CAMUFSi,j\displaystyle CAM_{UFS}^{i,j} =Y~i,jS,w\displaystyle=\left\langle\tilde{Y}_{i,j}\otimes S,w\right\rangle
CAMSUPi,j=\displaystyle CAM_{SUP}^{i,j}= Y~i,j(1S),w\displaystyle\left\langle\tilde{Y}_{i,j}\otimes(1-S),w\right\rangle

where Y~\tilde{Y} represents feature before global sum pooling is applied. Since WGAN based adversarial loss leads D(x)D(x) to be greater than D(G(z))D(G(z)), class activation map represents the most realistic areas in terms of discriminator. CAMSUPCAM_{SUP} shows the regions of the most unrealistic features in terms of our UFS module. Therefore, CAMSUPCAM_{SUP} shows regions that will be suppressed by our methods. On the other hand, CAMUFSCAM_{UFS} shows class activation map after unrealistic features are suppressed. Therefore, CAMUFSCAM_{UFS} shows the regions of realistic features in terms of UFS module.

Figure 9 shows sample activation maps. CAMCAM includes foreground with unique and outstanding appearance as well as background with relatively rough and ambiguous visual patterns. Since the appearance of such background is easier to learn, CAMCAM tends to include unrealistic background regions that are relatively easy to be learned. However, as we observe the CAMUFSCAM_{UFS} examples in figure 9, UFS module suppresses features that are far from average real feature μReal\mu_{Real}, actually suppressing unrealistic background region and promoting discriminator to concentrate more on realistic foreground. Similar to Top-k GAN which ignores unrealistic samples and trains networks to make realistic samples better, UFS module ignores unrealistic features and trains networks to make realistic features better.

4 Discussion and Conclusion

4.1 Why unrealistic feature suppression instead of dismission?

SNGAN + UFS module tested on CIFAR-10 shows unstable training. Figure 10 shows generator loss and discriminator loss of entire training process. While the discriminator is quickly trained, the generator loses its direction for training. We assume that when a discriminator becomes too powerful, i.e., a discriminator easily distinguishes between real and fake features, UFS module looks dismiss most of unrealistic features. We adjust hyper-parameters to suppress unrealistic features rather than dismissing them. By setting 0<ϵβ<10<\epsilon-\beta<1, unrealistic features are suppressed. Table 6 shows the summary of the our experiments. We observed that dismissing unrealistic features gives rise to severe mode collapse. On contrary, suppressing unrealistic features doesn’t fall into mode collapse. The optimal hyper-parameters empirically found are α=1\alpha=1, β=1.5\beta=1.5 and ϵ=2\epsilon=2.

Refer to caption
(a) α=0\alpha=0, β=1\beta=1, ϵ=1\epsilon=1
Refer to caption
(b) α=1\alpha=1, β=1.5\beta=1.5, ϵ=2\epsilon=2
Figure 10: Training loss of SNGAN experiments on CIFAR-10. (a) α=0\alpha=0, β=1\beta=1 and ϵ=1\epsilon=1: Discriminator converges in early iterations, and after that generator loss becomes unstable and networks go to mode collapse. (b) α=1\alpha=1, β=1.5\beta=1.5 and ϵ=2\epsilon=2: It constrains ScS_{c} in space of [0.5,1][0.5,1].
α\alpha β\beta ϵ\epsilon FID score IS score UFS
0 1 1 43.37 6.20 Dismission
1 2 2.5 19.47 7.93 Suppression
1 2 3 19.54 8.26 Suppression
1 3 3 50.57 6.70 Dismission
1 1.2 2 18.07 8.09 Suppression
1 1.3 2 19.29 7.92 Suppression
1 1.4 2 18.82 8.09 Suppression
1 1.5 2 17.10 8.50 Suppression
Table 6: Experiments to get proper hyperparameters for our method. Baseline model is SNGAN, and dataset is CIFAR-10. We train all models with 250k iterations. Suppression means suppressing unrealistic features rather than dismissing them by setting (0<ϵβ<10<\epsilon-\beta<1).

4.2 Conclusion

In this work, we have proposed unrealistic feature suppression (UFS) module that suppress unrealistic features in generator training. Effectiveness of UFS module has been proved through extensive experimental evaluations on various backbone networks such as WGAN-GP, SNGAN, BigGAN. In ImageNet experiments, we show that a method combining Top-k selection and UFS module converges faster and better compared to prior methods.

References

  • [1] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
  • [2] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2018.
  • [3] Terrance DeVries, Michal Drozdzal, and Graham W Taylor. Instance selection for gans. NIPS, 2020.
  • [4] Xitong Gao, Yiren Zhao, Łukasz Dudziak, Robert Mullins, and Cheng-zhong Xu. Dynamic channel pruning: Feature boosting and suppression. In International Conference on Learning Representations, 2018.
  • [5] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
  • [6] Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In NIPS, 2017.
  • [7] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 2017.
  • [8] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
  • [9] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
  • [10] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • [11] Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. In NeurIPS, 2019.
  • [12] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690, 2017.
  • [13] Jae Hyun Lim and Jong Chul Ye. Geometric gan. arXiv preprint arXiv:1705.02894, 2017.
  • [14] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
  • [15] Shuang Ma, Jianlong Fu, Chang Wen Chen, and Tao Mei. Da-gan: Instance-level image translation by deep attention generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5657–5666, 2018.
  • [16] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.
  • [17] Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reliable fidelity and diversity metrics for generative models. In International Conference on Machine Learning, pages 7176–7185. PMLR, 2020.
  • [18] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2337–2346, 2019.
  • [19] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
  • [20] Mehdi SM Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly. Assessing generative models via precision and recall. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 5234–5243, 2018.
  • [21] Tim Salimans, Ian J Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In NIPS, 2016.
  • [22] Samarth Sinha, Zhengli Zhao, Anirudh Goyal, Colin Raffel, and Augustus Odena. Top-k training of gans: Improving gan performance by throwing away bad samples. NIPS, 2020.
  • [23] Jianwei Song and Ruoyu Yang. Feature boosting, suppression, and diversification for fine-grained visual classification. arXiv preprint arXiv:2103.02782, 2021.
  • [24] Hao Tang, Dan Xu, Nicu Sebe, Yanzhi Wang, Jason J Corso, and Yan Yan. Multi-channel attention selection gan with cascaded semantic guidance for cross-view image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2417–2426, 2019.
  • [25] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8798–8807, 2018.
  • [26] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
  • [27] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
  • [28] Yan Wu, Jeff Donahue, David Balduzzi, Karen Simonyan, and Timothy Lillicrap. Logan: Latent optimisation for generative adversarial networks. arXiv preprint arXiv:1912.00953, 2019.
  • [29] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1316–1324, 2018.
  • [30] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In International conference on machine learning, pages 7354–7363. PMLR, 2019.
  • [31] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 5907–5915, 2017.
  • [32] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016.