This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\doparttoc\faketableofcontents
\equalcont

These authors should be nominated as the corresponding author. \equalcontThese authors should be nominated as the corresponding author.

[1]\surZhonglong Zheng \equalcontThese authors should be nominated as the corresponding author.

[1]\orgdivSchool of Computer Science and Technology, \orgnameZhejiang Normal University, \orgaddress\streetNo. 688 Yingbin Avenue, \cityJinhua, \postcode321004, \stateZhejiang, \countryChina

2]\orgdivSchool of Data Science and MOE Frontiers Center for Brain Science, \orgnameFudan University, \orgaddress\streetNo.220 Handan Road, \cityShanghai, \postcode200433, \stateShanghai, \countryChina

3]\orgdivFudan ISTBI-JNU Algorithm Centre for Brain inspired Intelligence, \orgnameZhejiang Normal University, \orgaddress\streetNo. 688 Yingbin Avenue, \cityJinhua, \postcode321004, \stateZhejiang, \countryChina

A New Formulation of Lipschitz Constrained With Functional Gradient Learning for GANs

\surChang Wan [email protected]    \surKe Fan [email protected]    \surXinwei Sun [email protected]    \surYanwei Fu [email protected]    \surMinglu Li [email protected]    \surYunliang Jiang [email protected]    [email protected] * [ [
Abstract

This paper introduces a promising alternative method for training Generative Adversarial Networks (GANs) on large-scale datasets with clear theoretical guarantees. GANs are typically learned through a minimax game between a generator and a discriminator, which is known to be empirically unstable. Previous learning paradigms have encountered mode collapse issues without a theoretical solution. To address these challenges, we propose a novel Lipschitz-constrained Functional Gradient GANs learning (Li-CFG) method to stabilize the training of GAN and provide a theoretical foundation for effectively increasing the diversity of synthetic samples by reducing the neighborhood size of the latent vector. Specifically, we demonstrate that the neighborhood size of the latent vector can be reduced by increasing the norm of the discriminator gradient, resulting in enhanced diversity of synthetic samples. To efficiently enlarge the norm of the discriminator gradient, we introduce a novel 𝜺\boldsymbol{\varepsilon}-centered gradient penalty that amplifies the norm of the discriminator gradient using the hyper-parameter 𝜺\boldsymbol{\varepsilon}. In comparison to other constraints, our method enlarging the discriminator norm, thus obtaining the smallest neighborhood size of the latent vector. Extensive experiments on benchmark datasets for image generation demonstrate the efficacy of the Li-CFG method and the 𝜺\boldsymbol{\varepsilon}-centered gradient penalty. The results showcase improved stability and increased diversity of synthetic samples.

keywords:
Generative Adversarial Nets, Functional Gradient Methods, New Lipschitz Constraint, Synthesis Diversity
Refer to caption
Figure 1: Highlighting Diversity. We underscore the significance of diversity in image synthesis. The left column (a) and right column (b) display horse label images generated using the CFG and Li-CFG methods trained on the CIFAR-10 dataset, respectively.

1 Introduction

GANs are designed to sample a random variable zz from a known distribution pzp_{z}, approximating the underlying data distribution pp_{*}. This learning process is modeled as a minimax game where the generator and discriminator are iteratively optimized, as introduced by Goodfellow et al. [1]. The generator produces samples that mimic the true data distribution, while the discriminator distinguishes between the generated data and real data samples. Despite numerous remarkable efforts that have been made [2, 3], the GANs learning still suffers from training instability and mode collapse.

Recently, Composite Functional Gradient Learning (CFG), as proposed in Johnson and Zhang [4], has gained attention. CFG utilizes a strong discriminator and functional gradient learning for the generator, leading to convergent GAN learning theoretically and empirically. However, we have observed that there are still various hyper-parameters in CFG that may significantly impact the GAN learning process. Despite the advancements in stability, one still needs to carefully set these hyper-parameters to ensure a successful and well-trained GAN. Properly tuning these parameters remains an essential aspect of achieving optimal performance in GAN training. This issue hinders the widespread adoption of the CFG method for training GAN in real-world and large-scale datasets. One of the most effective mechanisms for addressing this issue for GAN training is the Lipschitz constraint, with the 0-centered gradient penalty introduced by  Mescheder et al. [5] and the 11-centered gradient penalty introduced by Gulrajani et al. [3] being among the most well-known. Empirically, stable training of a GAN can result in more diverse synthesis results. However, the theoretical foundation for stable GAN training using the Lipschitz gradient penalty and generating diverse synthesis samples is still unclear. It is therefore important to develop a new theoretical framework that can account for the Lipschitz constraint of the discriminator and the diversity of synthesis samples.

To address these challenges, we propose Lipschitz Constrained Functional Gradient GANs Learning (Li-CFG), an improved version of CFG. We present a comparative analysis of our Li-CFG and CFG methods in Fig. 2 and Example. 1.1. Additionally, we emphasize the importance of diversity in synthetic samples through Fig. 1. Remarkably, as shown in Fig. 1, the horse images generated by Li-CFG display a more diverse range of gestures, colors, and textures in comparison to those produced by the CFG method. When synthetic samples lack diversity, they fail to capture essential image characteristics. While recent generative models have made notable progress in enhancing diversity, there remains a gap in providing comprehensive theoretical explanations for certain aspects.

Refer to caption
Figure 2: CFG method V.S Li-CFG. The left and right figures show the results of the FID and the norm of the discriminator gradient 𝒙D(𝒙)\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\| for the CFG method and Li-CFG with different values of the hyper-parameter δ(𝒙)\delta(\boldsymbol{x}), respectively. FID is a metric that measures the diversity between synthetic samples and real samples. A lower score is better. More information about FID is present in Section 5.1. The hyper-parameter δ(𝒙)\delta(\boldsymbol{x}) is important as it controls the gradient magnitude for the CFG method, which is defined in Eq. (5). Solid and dashed lines of the same color in both figures indicate FID and 𝒙D(𝒙)\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\| result of CFG method and Li-CFG with the same δ(𝒙)\delta(\boldsymbol{x}).
Example 1.1.

Fig. 2 indicates the idea that too large degree of 𝐱D(𝐱)\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\| leads to an excessively small neighborhood of the latent vector which causes an untrained model, a too-small degree of 𝐱D(𝐱)\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\| lead to an overly large neighborhood of the latent vectors which cause a worse diversity, i.e., the blue dashed lines with a reasonable degree of 𝐱D(𝐱)\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\| lead to the best FID compared to the results of other colors, the green solid lines with a too large degree of 𝐱D(𝐱)\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\| lead to an untrained result and the red dashed lines with a too small degree of 𝐱D(𝐱)\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\| lead to a worse diversity. Training results with different δ(𝐱)\delta(\boldsymbol{x}) of CFG method and Li-CFG, illustrating that FID result of CFG with different δ(𝐱)\delta(\boldsymbol{x}) varying dramatically due to the unstable change of 𝐱D(𝐱)\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\|. However, the FID result of our Li-CFG with different δ(𝐱)\delta(\boldsymbol{x}) varying more stable due to the smooth 𝐱D(𝐱)\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\| thanks to the gradient penalty. By controlling the degree of 𝐱D(𝐱)\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\| via changing the 𝛆\boldsymbol{\varepsilon} value of our 𝛆\boldsymbol{\varepsilon}-centered gradient penalty, we can adjust the neighborhood size of the latent vector and consequently influence the degree of diversity of synthetic samples.

Our key insight is the introduction of Lipschitz continuity, a robust form of uniform continuity, into CFG. This provides a theoretical basis showing that synthetic sample diversity can be enhanced by reducing the latent vector neighborhood size through a discriminator constraint. For simplicity and clarity, we will use latent N-size to denote the neighborhood size of the latent vector and constraint to refer to the discriminator constraint for the rest of the article. First, Li-CFG integrates the Lipschitz constraint into CFG to tackle instability in training under dynamic theory Second, we establish a theoretical link between the discriminator gradient norm and the latent N-size. By increasing the discriminator gradient norm, the latent N-size is reduced, thus enhancing the diversity of synthetic samples. Lastly, to efficiently adjust the discriminator gradient norm, we introduce a novel Lipschitz constraint mechanism, the 𝜺\boldsymbol{\varepsilon}-centered gradient penalty. This mechanism enables fine-tuning of the latent vector neighborhood size by varying the hyper-parameter 𝜺\boldsymbol{\varepsilon}. Through this approach, we aim to achieve more effective control over the discriminator gradient norm and further improve the diversity of generated samples.

We summarize the key contributions of this work as follows: (1) We introduce a novel Lipschitz constraint to CFG, the 𝜺\boldsymbol{\varepsilon}-centered gradient penalty, addressing the hyperparameter sensitivity in regression-based GAN training. Our Li-CFG method enables stable GAN training, producing superior results compared to traditional CFG. (2) We present a new perspective on analyzing the relationship between the discriminator constraint and synthetic sample diversity. To the best of our knowledge, we are the first to explore this relationship, demonstrating that our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty allows for effective control of diversity during training. (3) Our empirical studies highlight the superiority of Li-CFG over CFG across a variety of datasets, including synthetic and real-world data such as MNIST, CIFAR10, LSUN, and ImageNet. In an ablation study, we show a trade-off between diversity and model trainability. Additionally, we demonstrate the generalizability of our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty across multiple GAN models, achieving better results compared to existing gradient penalties.

2 Preliminary

To enhance readability, we provide definitions for frequently used mathematical symbols in our theory. For CFG and dynamic theory, the symbols will be explained in the respective sections.

Notation. Gθ(z):𝒵𝒴G_{\theta}(z):\mathcal{Z}\rightarrow\mathcal{Y} represents a generator model that maps an element of the latent space 𝒵\mathcal{Z} to the image space 𝒴\mathcal{Y}, where the parameter of generator is denoted by θ\theta. Here zz is a sample drawn from a low dimensional latent distribution pzp_{z}. We will use GθG_{\theta} and GG interchangeably to refer to the generator throughout this article.

Dψ:𝒴𝒞D_{\psi}:\mathcal{Y}\rightarrow\mathcal{C} represents a discriminator model that maps elements from the image space 𝒴\mathcal{Y} to the probability of belonging to real data distribution or the generated data distribution, where the parameter of generator is denoted by ψ\psi. 𝒴\mathcal{Y} is a sample drawn from real data distribution pp_{*}. 𝒞\mathcal{C} consists of the scale value obtained from the outputs of the discriminator. The symbol DD in the CFG method, along with the notation DψD_{\psi} used in both the gradient penalty and our neighborhood theory, all represent the discriminator.

We use RR to denote the gradient penalty. Furthermore, we use R1R_{1} to denote the 11-centered Gradient Penalty, R0R_{0} to denote the 0-centered Gradient Penalty and R𝜺R_{\boldsymbol{\varepsilon}} to denote the 𝜺\boldsymbol{\varepsilon}-centered Gradient Penalty. rRr_{R} stands for the latent N-size of the corresponding gradient penalty. The symbol ϵ^\hat{\epsilon} in the definition of our neighborhood method represents a small quantity in image space. It is distinct from the symbol 𝜺\boldsymbol{\varepsilon} used in our 𝜺\boldsymbol{\varepsilon}-centered GP method.

Composite Functional Gradient GANs. We follow the definition in CFG [6]. The CFG employs discriminator works as a logistic regression to differentiate real samples from synthetic samples. Meanwhile, it employs the functional compositions gradient to learn the generator as the following form

Gm(𝒛)=Gm1(𝒛)+ηmgm(Gm1(𝒛)),(m=1,M)\displaystyle\small\begin{aligned} G_{m}(\boldsymbol{z})=G_{m-1}(\boldsymbol{z})+\eta_{m}g_{m}(G_{m-1}(\boldsymbol{z})),(m=1...,M)\end{aligned} (1)

to obtain G(𝒛)=GM(𝒛)G(\boldsymbol{z})=G_{M}(\boldsymbol{z}). The MM represents the number of steps in the generator used to approximate the distribution of real data samples. Each gmg_{m} is a function to be estimated from data. gmg_{m} is a residual which gradually move the generated samples of m1m-1 step Gm1(𝒛)G_{m-1}(\boldsymbol{z}) towards the mm step Gm(𝒛)G_{m}(\boldsymbol{z}). gmg_{m} guarantees that the distance between the latent distribution pzp_{z} and the real data distribution pp_{*} will gradually decrease until it reaches zero, and the ηm\eta_{m} is a small step size.

To simplify the analysis of this problem, we first transform the discrete M into the continuous M. First, we transform the Eq. (1) into Gm+δ(Z)=Gm(Z)+δgm(Gm(Z))G_{m+\delta}(Z)=G_{m}(Z)+\delta g_{m}\left(G_{m}(Z)\right) by setting ηm=δ\eta_{m}=\delta, where δ\delta is a small time step. By letting δ0\delta\rightarrow 0, we have a generator that evolves continuously in time mm that satisfies an ordinary differential equation

d(Gm(Z))dm=gm(Gm(Z)).\frac{d\left(G_{m}(Z)\right)}{dm}=g_{m}\left(G_{m}(Z)\right).

The goal is to learn gm:kkg_{m}:\mathbb{R}^{k}\rightarrow\mathbb{R}^{k} from data so that the probability density pmp_{m} of Gm(𝒛)G_{m}(\boldsymbol{z}), which continuously evolves by Eq. (1), becomes close to the density pp_{*} of real data as mm continuously increases. To measure the ‘closeness’, we use LL denotes a distance measure between two distributions:

L(pm)=(p(x),pm(x))𝑑x,L(p_{m})=\int\ell\left(p_{*}(x),p_{m}(x)\right)dx,

where :R2R\ell:R^{2}\rightarrow R is a pre-defined function so that LL satisfies L(pm)=0L(p_{m})=0 if and only if pm=pp_{m}=p_{*} and L(p)0L(p)\geq 0 for any probability density function pp.

From the above equation, we will derive the choice of gm()g_{m}(\cdot) that guarantees that transformation Eq. (1) can always reduce L()L(\cdot). Let pmp_{m} be the probability density of random variable Gt(𝒛)G_{t}(\boldsymbol{z}). Let 2(ρ,ρ)=\ell_{2}^{\prime}\left(\rho_{*},\rho\right)= (ρ,ρ)/ρ\partial\ell\left(\rho_{*},\rho\right)/\partial\rho. Then we have

dL(pm)dm=pm(x)x2(p(x),pm(x))gm(x)𝑑x.\frac{dL\left(p_{m}\right)}{dm}=\int p_{m}(x)\nabla_{x}\ell_{2}^{\prime}\left(p_{*}(x),p_{m}(x)\right)\cdot g_{m}(x)dx.

With this definition of dL(pm)dm\frac{dL\left(p_{m}\right)}{dm}, we aim to keep that dL(pm)dm\frac{dL\left(p_{m}\right)}{dm} is negative, so that the distance LL decreases. To achieve this goal, we choose gm(x)g_{m}(x) to be:

gm(𝒙)=sm(𝒙)ϕ0(x2(p(𝒙),pm(𝒙))),\displaystyle\begin{aligned} g_{m}(\boldsymbol{x})=-s_{m}(\boldsymbol{x})\phi_{0}\left(\nabla_{x}\ell_{2}^{\prime}\left(p_{*}(\boldsymbol{x}),p_{m}(\boldsymbol{x})\right)\right),\end{aligned} (2)

where sm(x)>0s_{m}(x)>0 is an arbitrary scaling factor. ϕ0(u)\phi_{0}(u) is a vector function such that ϕ(u)=uϕ0(u)0\phi(u)=u\cdot\phi_{0}(u)\geq 0 and ϕ(u)=0\phi(u)=0 if and only if u=0u=0. Here are two examples: (ϕ0(u)=u,ϕ(u)=u22)(\phi_{0}(u)=u,\phi(u)=\|u\|_{2}^{2}) and (ϕ0(u)=sign(u),ϕ(u)=u1)(\phi_{0}(u)=\operatorname{sign}(u),\phi(u)=\|u\|_{1}).

With this choice of gm(x)g_{m}(x), we obtain

dL(pm)dm=sm(x)pm(x)ϕ(x2(p(x),pm(x)))𝑑x0,\displaystyle\begin{aligned} \frac{dL\left(p_{m}\right)}{dm}=-\int s_{m}(x)p_{m}(x)\phi\left(\nabla_{x}\ell_{2}^{\prime}\left(p_{*}(x),p_{m}(x)\right)\right)dx\leq 0,\end{aligned} (3)

that is, the distance LL is guaranteed to decrease unless the equality holds. Moreover, this implies that we have limmsm(x)pm(x)ϕ(x2(p(x),pm(x)))𝑑x=0\lim_{m\rightarrow\infty}\int s_{m}(x)p_{m}(x)\phi\left(\nabla_{x}\ell_{2}^{\prime}\left(p_{*}(x),p_{m}(x)\right)\right)dx=0. (Otherwise, L(pm)L\left(p_{m}\right) would keep going down and become negative as mm increases, but L(pm)0L\left(p_{m}\right)\geq 0 by definition.) For simplicity, we omit the subscript mm in the following empirical settings.

Let us consider a case where the distance measure L()L(\cdot) is an ff-divergence. With a convex function f:+f:\mathbb{R}^{+}\rightarrow\mathbb{R} such that f(1)=0f(1)=0 and that ff is strictly convex at 1,L(pm)1,L\left(p_{m}\right) defined by

L(pm)=p(x)f(rm(x))𝑑x where rm(x)=pm(x)p(x)L\left(p_{m}\right)=\int p_{*}(x)f\left(r_{m}(x)\right)dx\text{ where }r_{m}(x)=\frac{p_{m}(x)}{p_{*}(x)}

is called ff-divergence. Here we focus on a special case where ff is twice differentiable and strongly convex so that the second order derivative of ff, denoted here by f′′f^{\prime\prime}, is always positive. For instance, when we consider the KL divergence, ff can be represented by lnx-\ln x, in which case f=1/xf^{\prime}=-1/x and f′′=1/x2f^{\prime\prime}=1/x^{2}. On the other hand, if we consider the reverse KL divergence, ff can be represented as xlnxx\ln{x}, in which case f=lnx+1f^{\prime}=\ln{x}+1 and f′′=1/xf^{\prime\prime}=1/x.

With this definition of function ff and LL as ff-divergence setting, the value of gm(x)g_{m}(x) should to be

gm(𝒙)=sm(𝒙)ϕ0(x2(p(𝒙),pm(𝒙)))=sm(𝒙)x2(p(𝒙),pm(𝒙))=sm(𝒙)f′′(rm(𝒙))rm(𝒙)sm(𝒙)f′′(r~m(𝒙))r~m(𝒙)xD(𝒙),\displaystyle\begin{aligned} g_{m}(\boldsymbol{x})&=-s_{m}(\boldsymbol{x})\phi_{0}\left(\nabla_{x}\ell_{2}^{\prime}\left(p_{*}(\boldsymbol{x}),p_{m}(\boldsymbol{x})\right)\right)\\ &=-s_{m}(\boldsymbol{x})\nabla_{x}\ell_{2}^{\prime}\left(p_{*}(\boldsymbol{x}),p_{m}(\boldsymbol{x})\right)\\ &=-s_{m}(\boldsymbol{x})f^{\prime\prime}\left(r_{m}(\boldsymbol{x})\right)\nabla r_{m}(\boldsymbol{x})\\ &\approx s_{m}(\boldsymbol{x})f^{\prime\prime}\left(\tilde{r}_{m}(\boldsymbol{x})\right)\tilde{r}_{m}(\boldsymbol{x})\nabla_{x}D(\boldsymbol{x}),\end{aligned} (4)

where sm(x)>0s_{m}(x)>0 is an arbitrary scaling factor, 2(ρ,ρ)=ρf(ρ/ρ)\ell_{2}\left(\rho_{*},\rho\right)=\rho_{*}f\left(\rho/\rho_{*}\right), x2(p(x),pm(x))=f′′(rm(x))rm(x)\nabla_{x}\ell_{2}^{\prime}\left(p_{*}(x),p_{m}(x)\right)=f^{\prime\prime}\left(r_{m}(x)\right)\nabla r_{m}(x), f′′=1/x2f^{\prime\prime}=1/x^{2} when ff is KL-divergence and rm(𝒙)=exp(D(𝒙))pm(x)/p(x)=r~m(𝒙)r_{m}(\boldsymbol{x})=\exp(-D(\boldsymbol{x}))\approx p_{m}(x)/p_{*}(x)=\tilde{r}_{m}(\boldsymbol{x}) when D(x)lnp(x)pm(x)D(x)\approx\ln\frac{p_{*}(x)}{p_{m}(x)}, which is the analytic solution of the CFG discriminator.

Empirically, we define the function in Eq. (4) as g(𝒙)=δ(𝒙)𝒙D(𝒙)g(\boldsymbol{x})=\delta(\boldsymbol{x})\nabla_{\boldsymbol{x}}D(\boldsymbol{x}). The δ(𝒙)\delta(\boldsymbol{x}) can be computed as

δ(𝒙)=sm(𝒙)r~m(𝒙)f′′(r~m(𝒙)),\displaystyle\begin{aligned} \delta(\boldsymbol{x})=s_{m}(\boldsymbol{x})\tilde{r}_{m}(\boldsymbol{x})f^{\prime\prime}(\tilde{r}_{m}(\boldsymbol{x})),\end{aligned} (5)

where we have an arbitrary scaling factor sm(𝒙)s_{m}(\boldsymbol{x}), a KL-divergence function f=ln𝒙f=-\ln\boldsymbol{x}, f′′=1/x2f^{\prime\prime}=1/x^{2} and r~m(𝒙)=exp(D(𝒙))\tilde{r}_{m}(\boldsymbol{x})=\exp(-D(\boldsymbol{x})). Since r~m(𝒙)f′′(x)>0\tilde{r}_{m}(\boldsymbol{x})f^{\prime\prime}(x)>0, it can be absorbed into the sm(x)s_{m}(x). So the value of δ(𝒙)\delta(\boldsymbol{x}) is always greater than 0. For simplicity, we directly regard δ(𝒙)\delta(\boldsymbol{x}) as the scaling factor and set it to a fixed value as a hyper-parameter. We also demonstrate the results of different δ(𝒙)\delta(\boldsymbol{x}) values in the Fig. 2.

Gradient Penalty for GANs. Let us work with the most commonly used WGAN-GP [3]. Its regularization term is commonly referred to as

R(θ,ψ)=γ2E𝒙^(𝒙Dψ(𝒙^)g0)2,\displaystyle R(\theta,\psi)=\frac{\gamma}{2}\mathrm{E}_{\hat{\boldsymbol{x}}}\left(\left\|\nabla_{\boldsymbol{x}}D_{\psi}(\hat{\boldsymbol{x}})\right\|-g_{0}\right)^{2}, (6)

where 𝒙^\hat{\boldsymbol{x}} is sampled uniformly on the line segment between two random points vector 𝒙1pθ(𝒙1),𝒙2p𝒟(𝒙2)\boldsymbol{x}_{1}\sim p_{\theta}\left(\boldsymbol{x}_{1}\right),\boldsymbol{x}_{2}\sim p_{\mathcal{D}}\left(\boldsymbol{x}_{2}\right). The notation of R(θ,ψ)R(\theta,\psi) presents the regularization term of the generator with weight θ\theta and the discriminator with weight ψ\psi. The notion of γ\gamma is a coefficient that controls the magnitude of regularization. The value of g0g_{0} is always empirically set to 1. We call it g0g_{0}-centered GP or 11-centered GP. The other solution is referred to as the 0-centered GP method, proposed by Roth et al. [7] and  Mescheder et al. [5]. The formulation is

R(θ,ψ)=γ2E(𝒙^)[𝒙^Dψ(𝒙^)2],\displaystyle\begin{aligned} R(\theta,\psi)=\frac{\gamma}{2}\mathrm{E}_{(\hat{\boldsymbol{x}})}\left[\left\|\nabla_{\hat{\boldsymbol{x}}}D_{\psi}(\hat{\boldsymbol{x}})\right\|^{2}\right],\end{aligned} (7)

where 𝒙^\hat{\boldsymbol{x}} is sampled uniformly on the line segment between two random points vector 𝒙1pθ(𝒙1),𝒙2p𝒟(𝒙2)\boldsymbol{x}_{1}\sim p_{\theta}\left(\boldsymbol{x}_{1}\right),\boldsymbol{x}_{2}\sim p_{\mathcal{D}}\left(\boldsymbol{x}_{2}\right) like WGAN-GP.

When integrating the gradient penalty into the GAN model, it is expressed as a regularization term following the loss function of the GAN. The formulation is represented as

minGθmaxDψLGAN(Gθ,Dψ)=𝔼𝒛pz[f(Dψ(Gθ(𝒛)))]+𝔼𝒙p[f(Dψ(𝒙))]+λR(θ,ψ),\displaystyle\mathop{\min_{G_{\theta}}}\mathop{\max_{D_{\psi}}}L_{GAN}(G_{\theta},D_{\psi})=\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\left[f\left(D_{\psi}\left(G_{\theta}(\boldsymbol{z})\right)\right)\right]+\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\left[f\left(-D_{\psi}(\boldsymbol{x})\right)\right]+\lambda R(\theta,\psi), (8)

where λ\lambda controls the importance of the regularization, R(θ,ψ)R(\theta,\psi) denotes the gradient penalty term, ff represents a general function from [3] with different forms in various GAN models. This specific type of loss function indicates that the gradient penalty can influence the variability of the discriminator’s gradient, thereby impacting the generation of synthetic samples by the generator.

Dynamic Theory for CFG. We incorporate dynamic theory to gain a theoretical understanding of the equivalence between the CFG method and the common GAN theory. However, the CFG method also faces challenges related to unstable training and the lack of local convergence near the Nash-equilibrium point. A summary of the key outcomes is provided in Appendix B.

3 Methodology

Overview. The main structure of this section is organized as follows: In Section 3.1, we present the definition of the latent N-size with the gradient penalty. Guided by this, in Section 3.2, we introduce our regularization termed as 𝜺\boldsymbol{\varepsilon}-centered gradient penalty. In Section 3.3, we present our main theorem to demonstrate the connection between the latent N-size and the various gradient penalties. Proofs of these theorems are provided in Appendix C.

3.1 Latent N-size with gradient penalty

In this section, our objective is to reveal the interrelation between the latent N-size and the gradient penalty. First, we will define the latent N-size, along with an intuitive explanation. Then, we will explain three basic definitions of the latent N-size by showing how it relates to the diversity of synthetic samples. Additionally, building on the previous step, we will discuss the expansion of the latent N-size by including the gradient penalty.

Latent N-size. We present the definition of latent N-size, which forms the basis for the subsequent theory.

Definition 3.1 (Latent Neighborhood Size).

Let 𝐳1\boldsymbol{z}_{1}, 𝐳2\boldsymbol{z}_{2} be two samples in the latent space. Suppose 𝐳1\boldsymbol{z}_{1} is attracted to the mode i\mathcal{M}_{i} by ϵ^\hat{\epsilon}, then there exists a neighborhood 𝒩r(𝐳1)\mathcal{N}_{r}\left(\boldsymbol{z}_{1}\right) of 𝐳1\boldsymbol{z}_{1} such that 𝐳2\boldsymbol{z}_{2} is distracted to i\mathcal{M}_{i} by (ϵ^/22α)(\hat{\epsilon}/2-2\alpha), for all 𝐳2𝒩r(𝐳1)\boldsymbol{z}_{2}\in\mathcal{N}_{r}\left(\boldsymbol{z}_{1}\right). The size of 𝒩r(𝐳1)\mathcal{N}_{r}\left(\boldsymbol{z}_{1}\right) can be arbitrarily large but is bounded by an open ball of radius rr. The rr is defined as

r=ϵ^(2inf𝒛{(Gθt(𝒛1)Gθt(𝒛)𝒛1𝒛+Gθt+1(𝒛1)Gθt+1(𝒛)𝒛1𝒛)})1,r=\hat{\epsilon}\cdot\left(2\inf_{\boldsymbol{z}}\left\{\left(\frac{\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}+\frac{\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\right)\right\}\right)^{-1},

where the mode \mathcal{M}, attractedattracted and distracteddistracted are defined in Definition. 3.23.33.4, respectively.

According to this definition, the radius rr is inversely proportional to the discrepancy between the preceding and subsequent outputs of the generator, given a similar latent vector 𝒛\boldsymbol{z}. A large value of rr results in a small difference between the previous and subsequent generator outputs, leading to mode collapse. Conversely, a small value of rr leads to a large difference, resulting in diverse synthesis.

Latent N-size and the diversity. In this paragraph, we discuss the relationship between the latent N-size and the gradient penalty. To begin, let’s discuss the above three definitions, the mode \mathcal{M}, attractedattracted, and distracteddistracted, which play a crucial role in the mode collapse phenomenon [8].

Additionally, we propose implementing the gradient penalty in the discriminator to adjust the latent N-size and alleviate the mode collapse phenomenon. If the neighborhood size is too large, a significant portion of the latent space vectors would be attracted to this specific image mode, leading to limited diversity in the synthetic samples. Conversely, if the neighborhood size is too small and contains only one vector, the latent space vectors cannot adequately cover all the modes in the image space.

The intuition behind our idea is illustrated in Fig. 3. The top and bottom rows indicate the latent N-size for the discriminator with or without gradient penalty, respectively. The yellow line at the top indicates that a different latent vector is being drawn towards a new mode, distinct from z1z_{1}. The blue line at the bottom row indicates that the same latent vector in the neighborhood of 𝒛1\boldsymbol{z}_{1} is attracted to the same mode as 𝒛1\boldsymbol{z}_{1}. The top row represents improved sample diversity, while the bottom row indicates a mode collapse phenomenon.

Refer to caption
Figure 3: Main idea of the relationship between latent N-size and constraint .
Definition 3.2 (Modes in Image Space).

There exist some modes \mathcal{M} cover the image space 𝒴\mathcal{Y}. Mode i\mathcal{M}_{i} is a subset of 𝒴\mathcal{Y} satisfying max𝐲𝐤,𝐣i𝐲𝐤𝐲𝐣<α\max_{\boldsymbol{y_{k,j}}\in\mathcal{M}_{i}}\left\|\boldsymbol{y_{k}}-\boldsymbol{y_{j}}\right\|<\alpha and min𝐲𝐤i,𝐲𝐦iα<𝐲𝐦𝐲𝐤<2α\min_{\boldsymbol{y_{k}}\in\mathcal{M}_{i},\boldsymbol{y_{m}}\not\in\mathcal{M}_{i}}\alpha<\left\|\boldsymbol{y_{m}}-\boldsymbol{y_{k}}\right\|<2\alpha, where 𝐲𝐤\boldsymbol{y_{k}} and 𝐲𝐣\boldsymbol{y_{j}} belong to the same mode i\mathcal{M}_{i}, 𝐲𝐦\boldsymbol{y_{m}} and 𝐲𝐤\boldsymbol{y_{k}} belong to different modes i\mathcal{M}_{i}, and α>0\alpha>0.

Definition 3.2 asserts that images within the same mode exhibit minimal differences. Conversely, images belonging to different modes exhibit more significant differences.

Definition 3.3 (Modes Attracted).

Let 𝐳1\boldsymbol{z}_{1} be a sample in latent space, we say 𝐳1\boldsymbol{z}_{1} is attracted to a mode i\mathcal{M}_{i} by ϵ^\hat{\epsilon} from a gradient step if 𝐲𝐤Gθt+1(𝐳1)+ϵ^<𝐲𝐤Gθt(𝐳1)\left\|\boldsymbol{y_{k}}-G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)\right\|+\hat{\epsilon}<\left\|\boldsymbol{y_{k}}-G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)\right\|, where 𝐲𝐤i\boldsymbol{y_{k}}\in\mathcal{M}_{i} is an image in a mode i\mathcal{M}_{i}, ϵ^\hat{\epsilon} denotes a small quantity, θt\theta_{t} and θt+1\theta_{t+1} are the generator parameters before and after the gradient updates respectively.

Definition 3.3 establishes that a latent vector 𝒛1\boldsymbol{z}_{1} is attracted to a specific mode i\mathcal{M}_{i} in the latent space. As training progresses, the output corresponding to 𝒛1\boldsymbol{z}_{1} will exhibit only minor deviations from images within that mode.

Definition 3.4 (Modes Distracted).

Let 𝐳2\boldsymbol{z}_{2} be a sample in latent space, we say 𝐳2\boldsymbol{z}_{2} is distracted from a mode i\mathcal{M}_{i} by (α+ϵ^)/2(\alpha+\hat{\epsilon})/2 from a gradient step if 𝐲𝐦Gθt+1(𝐳2)+(ϵ^/22α)<𝐲𝐤Gθt(𝐳2)\left\|\boldsymbol{y_{m}}-G_{\theta_{t+1}}\left(\boldsymbol{z}_{2}\right)\right\|+(\hat{\epsilon}/2-2\alpha)<\left\|\boldsymbol{y_{k}}-G_{\theta_{t}}\left(\boldsymbol{z}_{2}\right)\right\|, where 𝐲𝐤i\boldsymbol{y_{k}}\in\mathcal{M}_{i} is an image in a mode i\mathcal{M}_{i}, 𝐲𝐦i\boldsymbol{y_{m}}\not\in\mathcal{M}_{i} is an image from other modes, α\alpha keeps the same meaning as in Definition 3.2, θt\theta_{t} and θt+1\theta_{t+1} are the generator parameters before and after the gradient updates respectively.

Definition 3.4 explains that when a vector 𝒛2\boldsymbol{z}_{2} close to 𝒛1\boldsymbol{z}_{1} is drawn towards a particular mode in the image space, it is less likely to be attracted by a different mode. Therefore, it is crucial to decrease the latent N-size, as this encourages latent vectors to be attracted to various modes within the image space.

Latent N-size with gradient penalty. We demonstrate the relationship between the latent N-size and the gradient penalty. According to Proposition 3.5, as 𝒙D(𝒙)\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\| increases, the latent N-size decreases, and vice versa.

Proposition 3.5.

𝒩r(𝒛1)\mathcal{N}_{r}\left(\boldsymbol{z}_{1}\right) can be defined with discriminator gradient penalty as follows:

r=ϵ^(2inf𝒛{(2Gθt(𝒛1)Gθt(𝒛)+ηmδ(𝒙)m=1N(𝒙Dm(𝒴2)+𝒙Dm(𝒴))𝒛1𝒛)})1\small r=\hat{\epsilon}\cdot\left(2\inf_{\boldsymbol{z}}\left\{\left(\frac{2\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|+\eta_{m}\delta(\boldsymbol{x})\sum\limits_{m=1}^{N}\left(\|\nabla_{\boldsymbol{x}}D_{m}(\mathcal{Y}_{2})\|+\|\nabla_{\boldsymbol{x}}D_{m}(\mathcal{Y})\|\right)}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\right)\right\}\right)^{-1}

, where 𝐱Dm(𝒴2)=\|\nabla_{\boldsymbol{x}}D_{m}(\mathcal{Y}_{2})\|= 𝐱Dm(Gθt(𝐳2))+R\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{2}))+R\| and 𝐱Dm(𝒴)=\|\nabla_{\boldsymbol{x}}D_{m}(\mathcal{Y})\|= 𝐱Dm(Gθt(𝐳))+R\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+R\|. RR stands for the discriminator gradient penalty.

We show that GθtG_{\theta_{t}} can be iterated computed as follows: Gθt=Gθt1+ηmδ(𝒙)xDm(Gθt1(𝒛))G_{\theta_{t}}=G_{\theta_{t-1}}+\eta_{m}\delta(\boldsymbol{x})\nabla_{x}D_{m}(G_{\theta_{t-1}}(\boldsymbol{z})). For the sake of simplicity, we present the expansion for the tt-th term.

From Proposition 3.5, it’s crucial to maintain the latent N-size within a specific range to generate diverse synthetic samples. To achieve this, it is necessary to control the magnitude of the gradient norm with a gradient penalty.

Based on the corollary 𝒙D(𝒙)0\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\leq 0, we propose subtracting a small value 𝜺\boldsymbol{\varepsilon} from 𝒙D(𝒙)\nabla_{\boldsymbol{x}}D(\boldsymbol{x}) to control the magnitude of the gradient norm. This will effectively enhance the gradient norm, leading to a reduction in the latent N-size. Additionally, we propose the 𝜺\boldsymbol{\varepsilon}-centered gradient penalty based on the above insight.

3.2 𝜺\boldsymbol{\varepsilon}-centered GP

We propose our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty in this section. We use notation 𝜺\boldsymbol{\varepsilon} in our penalty name and equation to differ from the hyper-parameter ε\varepsilon^{\prime}. The 𝜺\boldsymbol{\varepsilon}-centered GP is

R(θ,ψ)=γ2E𝒙^(𝒙Dψ(𝒙^)𝜺)2,\displaystyle R(\theta,\psi)=\frac{\gamma}{2}\mathrm{E}_{\hat{\boldsymbol{x}}}\left(\left\|\nabla_{\boldsymbol{x}}D_{\psi}(\hat{\boldsymbol{x}})-\boldsymbol{\varepsilon}\right\|\right)^{2}, (9)

where 𝜺\boldsymbol{\varepsilon} is a vector such that 𝜺2=ε\|\boldsymbol{\varepsilon}\|_{2}=\varepsilon^{\prime} with ε=CN2𝜺2\varepsilon^{\prime}=\sqrt{C\cdot N^{2}\cdot\boldsymbol{\varepsilon}^{2}}, NN and CC are dimensions and channels of the real data respectively. 𝒙^\hat{\boldsymbol{x}} is sampled uniformly on the line segment between two random points vector 𝒙1pθ(𝒙1),𝒙2p𝒟(𝒙2)\boldsymbol{x}_{1}\sim p_{\theta}\left(\boldsymbol{x}_{1}\right),\boldsymbol{x}_{2}\sim p_{\mathcal{D}}\left(\boldsymbol{x}_{2}\right). Combining the corollary 𝒙D(𝒙)0\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\leq 0, our 𝜺\boldsymbol{\varepsilon}-centered GP increases the 𝒙D(𝒙)\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\| as to achieve a better latent N-size which other two gradient penalty behaviors worse.

When training the GAN model with the loss function Eq. (8) and Eq. (9), our approach will control the gradient of the discriminator and result in a diversity of synthesized samples.

3.3 Latent N-size with different gradient penalties

In this section, we combine Definition 3.1, Proposition 3.5, Lemma 3.6, and our ε\varepsilon-centered gradient penalty to prove the main theorem. First, we establish the relationship among the latent N-size with three gradient penalties in the following Lemma.

Lemma 3.6.

The norms of the three Gradient Penalties, which determine the latent N-size, are defined as follows: R1\|R_{1}\| = (𝐱Dm(Gθt(𝐳))g0)\|\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right\|-g_{0}\right)\|, R0=(𝐱Dm(Gθt(𝐳)))\|R_{0}\|=\|\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right\|\right)\|, R𝛆\|R_{\boldsymbol{\varepsilon}}\|=(𝐱Dm(Gθt(𝐳))+𝛆)\|\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right\|+\|\boldsymbol{\varepsilon}\|\right)\|, respectively. The order of magnitude between the norms of three Gradient Penalty is R1<R0<R𝛆\|R_{1}\|<\|R_{0}\|<\|R_{\boldsymbol{\varepsilon}}\|. Consequently, the relationship between the latent N-size of three Gradient Penalty is rR1>rR0>rR𝛆r_{R_{1}}>r_{R_{0}}>r_{R_{\boldsymbol{\varepsilon}}}.

Combining the Proposition 3.5, a larger value of 𝒙D(𝒙)\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\| will lead to a smaller latent N-size, thus enhancing the diversity of synthetic samples.

Theorem 3.7.

Suppose 𝐳1\boldsymbol{z}_{1} is attracted to the mode i\mathcal{M}_{i} by ϵ^\hat{\epsilon}, then there exists a neighborhood 𝒩r(𝐳1)\mathcal{N}_{r}\left(\boldsymbol{z}_{1}\right) of 𝐳1\boldsymbol{z}_{1} such that 𝐳2\boldsymbol{z}_{2} is distracted to i\mathcal{M}_{i} by (ϵ^/22α)(\hat{\epsilon}/2-2\alpha), for all 𝐳2𝒩r(𝐳1)\boldsymbol{z}_{2}\in\mathcal{N}_{r}\left(\boldsymbol{z}_{1}\right). The size of 𝒩r(𝐳1)\mathcal{N}_{r}\left(\boldsymbol{z}_{1}\right) can be arbitrarily large but is bounded by an open ball of radius rr where be controlled by Gradient Penalty terms of the discriminator. The relationship between the latent N-size corresponding to the three Gradient Penalty is rR1>rR0>rR𝛆r_{R_{1}}>r_{R_{0}}>r_{R_{\boldsymbol{\varepsilon}}}.

The theorem suggests that if a latent vector is pulled towards a specific mode in the image space, the size of the latent vector should be kept reasonable. If the latent N-size is excessively large, it could prevent latent vectors from attracting toward other modes, potentially causing the mode collapse phenomenon. The gradient penalty can be used to effectively adjust the latent N-size. In descending order, the relationship between the latent N-size corresponding to the three Gradient Penalties can be summarized as follows: rR1>rR0>rR𝜺r_{R_{1}}>r_{R_{0}}>r_{R_{\boldsymbol{\varepsilon}}}.

Refer to caption
Figure 4: Results for CIFAR10, MNIST: The most left four columns are CFG method, the second four columns are Li-CFG with 𝜺\boldsymbol{\varepsilon}-centered GP(ours), the third four columns are Li-CFG with 11-centered, the most right four columns are Li-CFG with 0-centered.
Refer to caption
Figure 5: Results for LSUN tower, Church, B, BR+LR, T+B, ImageNet: The method from top to bottom are CFG method, Li-CFG with 𝜺\boldsymbol{\varepsilon}-centered GP(ours), Li-CFG with 11-centered and Li-CFG with 0-centered in each two rows.

4 Related Work

Generative Adversarial Network. GAN is optimized as a discriminator and a generator in a minimax game formulated as [1]. There are various variants of GAN for image generation, such as DCGAN [9], SAGAN [10], Progressive Growing GAN [11], BigGAN [12] and StyleGAN [13]. In general, the GANs are very difficult to be stably trained. Training GAN may suffer from various issues, including gradients vanishing, mode collapse, and so on [14, 7, 15]. Numerous excellent works have been done in addressing these issues. For example, by using the Wasserstein distance, WGAN [2] and its extensions  [3, 15, 16, 17, 5] improve the training stability of GAN.

Lipschitz Constrained for GANs. Applying Lipschitz constraint to CNNs have been widely explored [18, 19, 20, 21, 22, 23]. Such an idea of Lipschitz constraint has also been introduced in Wasserstein GAN (WGAN). Recently theoretical results proposed by Kim et al. [23] show that self-attention which is widely used in the transformer model does not meet the Lipschitz continuity. It has been proved in  Zhou et al. [20] that the Lipschitz-continuity is a general solution to make the gradient of the optimal discriminative function reliable. Unfortunately, directly applying the Lipschitz constraint to complicated neural networks is not easy. Previous works typically employ the mechanisms of gradient penalty and weight normalization. The gradient penalty proposed by  Gulrajani et al. [3] adds a function after the loss function to control the gradient varying of the discriminator.  Mescheder et al. [5] and  Nagarajan and Kolter [16] argue that the WGAN-GP is not stable near Nash-equilibrium and propose a new form of gradient penalty. Weight normalization or spectrum normalization was first studied by  Miyato et al. [24]. The normalization constructs a layer of neural networks to control the gradient of the discriminator. Recently,  Bhaskara et al. [25], Wu et al. [26] propose a new form of normalization behaviors better than the spectrum normalization.

Compare our method with existing methods. Our method appears similar to AdvLatGAN-div [27] and MSGAN [28], aiming to increase the pixel space distance ratio over latent space.

We have clarified the differences between our method and AdvLatGAN-div and MSGAN as follows: Firstly, we note that MSGAN, AdvLatGAN, and our method are related to Eq. (10).

(d𝑰(G(𝒄,𝒛1),G(𝒄,𝒛2))d𝒛(𝒛1,𝒛2)),\begin{aligned} \left(\frac{d_{\boldsymbol{I}}\left(G\left(\boldsymbol{c},\boldsymbol{z}_{1}\right),G\left(\boldsymbol{c},\boldsymbol{z}_{2}\right)\right)}{d_{\boldsymbol{z}}\left(\boldsymbol{z}_{1},\boldsymbol{z}_{2}\right)}\right)\end{aligned}, (10)

where 𝒄\boldsymbol{c} is the condition vector for generation, and d𝑰d_{\boldsymbol{I}} and d𝒛d_{\boldsymbol{z}} are the distance metrics in the target (image) space and latent space, respectively.

Secondly, MSGAN maximizes Eq. (10) to train the generator, encouraging it to synthesize more diverse samples. Essentially, maximizing Eq. (10) means that for two different samples 𝒛𝟏\boldsymbol{z_{1}} and 𝒛𝟐\boldsymbol{z_{2}} from the latent space, the distance between their corresponding synthesized outputs G(𝒛𝟏)G(\boldsymbol{z_{1}}) and G(𝒛𝟐)G(\boldsymbol{z_{2}}) in the target space should be as large as possible. This leads to an increase in the diversity of the generated samples. The AdvLatGAN-div method searches for pairs of ztz_{t} that match hard sample pairs, which are more likely to collapse in image space. This search process is achieved by iteratively optimizing the latent samples 𝒛\boldsymbol{z} to minimize Eq. (10), similar to generating adversarial examples. Then, the objective is to maximize Eq. (10) by using these hard sample pairs ztz_{t} in the latent space as input again to optimize the generator. This process helps to generate diverse samples in the image space and avoid mode collapse. Both MSGAN and AdvLatGAN-div, which utilize Eq. (10), are based on empirical observations and do not have a solid theoretical basis.

Unlike the previous two methods, we don’t employ Eq. (10) as an additional loss function or an optimal target for training our model. Instead, we use Eq. (10) as the fundamental explanation of our theory. We have improved upon Eq. (10) by using the CFG method. This has helped us establish a connection between the distance of various 𝒛\boldsymbol{z} from the latent space and the discriminator’s constraints. We have also explained how the Lipschitz constraint of the discriminator can enhance the generator’s diversity based on our new theory.

5 Experiment

Datasets. To demonstrate the efficiency of our approach, we use both synthetic datasets and real-world datasets. In line with [29, 30], we simulate two synthetic datasets: (1) Ring dataset. It consists of a mixture of eight 2-D Gaussians with mean (2cos(iπ/4),2sin(iπ/4)),1i8{(2\cos(i\pi/4),2\sin(i\pi/4))},1\leq i\leq 8 and standard deviation 0.02. (2) Grid dataset. It consists of a mixture of twenty-five 2-D isotropic Gaussians with mean (2i,2j),2i,j2{(2i,2j)},-2\leq i,j\leq 2 and standard deviation 0.02. For real-world datasets, we used MNIST [31], CIFAR10 [32], and the large-scale scene understanding (LSUN) [33] and ImageNet [34] dataset, to make fair comparison to original CFG method. Note that we for the first time present the ImageNet results of CFG methods. The CFG method employs a balanced two-class dataset using the same number of training images from the ‘bedroom’ class and the ‘living room’ class (LSUN BR+LR) and a balanced dataset from ‘tower’ and ‘bridge’ (LSUN T+B). Besides, We choose to use the ’church’(LSUN C) and the LSUN tower(LSUN T) to do our experiment because the CFG method does not very well in these datasets. We choose CIFAR10 and ImageNet to demonstrate the generalization performance of our algorithm on richer categories and larger-scale datasets. We have included the high-resolution qualitative results of various datasets in the supplementary section.

Implementation Details. To ensure a fair comparison between our method and the CFG method, we maintain identical implementation settings. All the experiments were done using a single NVIDIA Tesla V100 or a single NVIDIA Tesla A100 111The computations in this research were performed using the CFFF platform of Fudan University. The hyper-parameter values for the CFG method were fixed to those in Table. 1 unless otherwise specified. CFG method behaves sensitively to the setting of δ(𝒙)\delta(\boldsymbol{x}) for generating the image to approximate an appropriate value. In our work, we give the same δ(𝒙)\delta(\boldsymbol{x}) settings with the CFG method; The hyper-parameter γ\gamma is set to 0.10.1, 11 or 1010; 𝜺\boldsymbol{\varepsilon}-centered hyper-parameters ε\varepsilon^{\prime} is set to 0.10.1, 0.30.3, or 11 in the experiments.

The explanation and ablation study of ε\varepsilon^{\prime} are in the supplementary. The base setting is presented in Table. 1 and Table. 2.

Baselines. As a representative of comparison methods, we tested WGAN with the gradient penalty (WGAN-GP) [3], the least square GAN [35], the origin GAN [1] and the HingeGAN [36]. All of them always have been the baseline in other GAN models. For a fair comparison, we utilize the same network architecture as the CFG method. We have utilized the backbones of both simple DCGAN [9], and complex Resnet [37] in each dataset. To evaluate the generalization of our 𝜺\boldsymbol{\varepsilon}-centered method, we incorporate SOTA GAN models, such as BigGAN [12] and the Denoising Diffusion GAN (DDGAN) [38], with our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty.

Table 1: hyper-parameters.
NAME DESCRIPTION
B=64 training data batch size
U=1 discriminator update per epoch
N=640 examples for updating G per epoch
M=15 number of generate step in CFG method
γ\gamma=0.1/1/10 a hyper-parameters for L constant
ε\varepsilon^{\prime}=0.1/0.3/1 a hyper-parameters for 𝜺\boldsymbol{\varepsilon}-centered

Evaluation Metrics. Generative adversarial models are known to be a challenge to make reliable likelihood estimates. So we instead evaluated the visual quality of generated images by adopting the inception score [39], the Fréchet inception distance [40] and the Precision/Recall [41]. The intuition behind the inception score is that high-quality generated images should lead to close to the real image. The Fréchet inception distance indicates the similarity between the generated images and the real image. Moreover, by definition, precision is the fraction of the generated images that are realistic, and recall is the fraction of real images within the manifold covered by the generator.

Table 2: Hyper-parameters. The δ(𝒙)\delta(\boldsymbol{x}) value is the same as the CFG method. γ\gamma is the parameter for the GP coefficient.
DATASET η\eta δ(𝒙)\delta(\boldsymbol{x}) γ\gamma ε\varepsilon^{\prime}
MNIST 2.5e-4 0.5 0.1/1/10 0.1/0.3/1
CIFAR 2.5e-4 1 0.1/1/10 0.1/0.3/1
LSUN 2.5e-4 1 0.1/1/10 0.1/0.3/1
ImageNet 2.5e-4 1 0.1/1/10 0.1/0.3/1

We note that the inception score is limited, as it fails to detect mode collapse or missing modes. Apart from that, we found that it generally corresponds well to human perception.

In addition, we used Fréchet inception distance (FID). FID measures the distance between the distribution of f(x)f(x_{*}) for real data xx_{*}and the distribution of f(x)f(x) for generated data xx, where function ff extract the feature of an image. One advantage of this metric is that it would be high (poor) if mode collapse occurs, and a disadvantage is that its computation is relatively expensive. However, FID does not differentiate the fidelity and diversity aspects of the generated images, so we used the Precision/Recall to diagnose these properties. A high precision value indicates a high quality of generated samples, and a high recall value implies that the generator can generate many realistic samples that can be found in the ”real” image distribution. In the results below, we call these metrics the (inception) score, the Fréchet distance, and the Precision/Recall.

5.1 Experimental Results

In this section, we present a detailed comparison between our Li-CFG experiments and three gradient penalty (GP) methods within the context of CFG and other GAN-based models. we report the results of our model using the same hyper-parameters against the CFG method. Our comparison shows that our method achieves better results. Furthermore, we demonstrate that the 𝜺\boldsymbol{\varepsilon}-centered gradient penalty is versatile and can be effectively integrated into various GAN architectures.

Table 3: To assess the generalization of our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty, we apply it to various GANs baselines on the various datasets. When compared to the original GAN, WGAN, LSGAN, and HingeGAN, each utilizing different gradient penalties, our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty consistently achieves the best Fréchet Inception Distance (FID) score and the Inception Score (IS). This outcome serves as evidence that our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty is not only applicable to the CFG mechanism but can also be effectively employed in common GAN models. The red font number indicates a correction made in the previous manuscript.
IS \uparrow / FID\downarrow
MNIST origin GAN WGAN LSGAN HingeGAN Li-CFG
Unconstrained 2.18/34.28 2.23/31.05 2.22/36.83 2.21/30.72 2.29/4.04
ours(𝜺\boldsymbol{\varepsilon}-centered) 2.21/25.62 2.24/23.79 2.24/28.54 2.23/24.65 2.31/3.29
0-centered 2.21/30.28 2.22/24.98 2.2/27.32 2.19/26.89 2.31/3.54
11-centered 2.19/31.48 2.20/31.27 2.23/29.23 2.22/24.72 2.3/3.64
CIFAR10
Unconstrained 3.48/37.83 3.53/36.24 3.41/30.94 3.58/32.31 4.02/19.41
ours(𝜺\boldsymbol{\varepsilon}-centered) 3.82/30.52 3.87/28.45 3.92/24.73 4.01/20.63 4.83/14.96
0-centered 3.76/35.72 3.85/29.21 3.84/29.01 3.89/23.95 4.72/18.5
11-centered 3.78/31.48 3.75/31.27 3.80/29.23 3.82/24.72 4.71/19.3
LSUN B
Unconstrained 2.35/19.28 2.38/18.72 2.31/19.53 2.42/17.85 3.014/10.79
ours(𝜺\boldsymbol{\varepsilon}-centered) 2.58/17.38 2.61/16.98 2.67/16.21 2.73/15.43 8.78
0-centered 2.55/18.72 2.59/17.88 2.65/17.42 2.68/15.96 3.067/10.54
11-centered 2.56/18.89 2.57/18.29 2.66/17.69 2.65/16.83 3.154/11.5
LSUN T
Unconstrained 3.59/25.87 3.67/22.76 3.52/28.14 3.63/23.57 4.38/13.54
ours(𝜺\boldsymbol{\varepsilon}-centered) 3.93/20.62 4.06/17.89 3.89/22.67 4.02/18.72 4.6/11.41
0-centered 3.87/21.65 4.02/18.03 3.83/23.88 3.98/20.23 4.57/12.33
11-centered 3.85/22.03 3.99/18.49 3.81/24.33 3.94/20.81 4.47/12.81
LSUN C
Unconstrained 2.18/34.69 2.25/32.31 2.3/31.47 2.43/30.77 3.17/11.49
ours(𝜺\boldsymbol{\varepsilon}-centered) 2.51/29.48 2.63/26.46 2.61/27.82 2.69/25.58 3.18/9.38
0-centered 2.49/30.13 2.60/27.86 2.57/28.76 2.67/26.63 3.08/12.33
11-centered 2.45/30.79 2.58/28.79 2.53/29.15 2.65/ 26.97 3.08/12.81
LSUN B+L
Unconstrained 3.18/21.92 3.01/23.47 3.12/25.93 3.29/20.33 3.49/15.76
ours(𝜺\boldsymbol{\varepsilon}-centered) 3.40/18.39 3.38/19.34 3.31/21.82 3.42/17.87 3.50/14.09
0-centered 3.36/19.12 3.37/19.98 3.29/22.51 3.38/18.63 3.47/15.39
11-centered 3.30/19.87 3.35/20.65 3.26/23.32 3.31/19.48 3.42/15.53
LSUN T+B
Unconstrained 4.65/27.61 4.77/25.4 4.68/25.87 4.81/24.66 5.01/16.9
ours(𝜺\boldsymbol{\varepsilon}-centered) 4.78/23.72 4.85/21.84 4.89/21.9 4.92/19.83 5.07/15.72
0-centered 4.73/24.19 4.81/22.98 4.85/23.4 4.85/21.31 5.05/16.01
11-centered 4.74/24.55 4.76/23.59 4.79/24.02 4.80/22.76 5.04/16.32
ImageNet
Unconstrained 6.83/54.68 6.94/48.54 7.00/53.7 7.13/51.28 8.98/29.73
ours(𝜺\boldsymbol{\varepsilon}-centered) 7.38/45.84 7.69/40.72 7.52/41.25 7.66/40.36 9.15/29.05
0-centered 7.25/46.13 7.48/42.88 7.49/41.93 7.58/41.65 8.79/30.04
11-centered 7.16/47.28 7.21/44.71 7.44/42.84 7.47/42.37 8.96/29.55

Inception Score Results. Table. 3 presents the Inception Score (IS) values for different GAN models. Since the exact codes and models used to compute IS scores in Johnson and Zhang [4] are not publicly available, we employed the standard PyTorch implementation of the Inception Score function 222https://github.com/sbarratt/inception-score-pytorch. The measured IS scores for the real datasets are as follows: 2.58(MNIST), 9.56(CIFAR10), 4.78(LSUN T), 3.72(LSUN B), 3.72(LSUN B+L), 3.79(LSUN T+B), 5.8(LSUN C), 37.99(ImageNet). We can find that the CFG method and Li-CFG scores are very close, and the WGAN and LSGAN performance is not so good in all the datasets. The relative differences also shows that Li-CFG archives a better image quality than the CFG method in the same dataset.

Table 4: To assess the generalization of our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty, we apply it to various GANs baselines on the various datasets. When compared to the original GAN, WGAN, LSGAN, and HingeGAN baselines, each utilizing different gradient penalties, our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty consistently achieves the best Recall score. This outcome serves as evidence that our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty is not only applicable to the CFG mechanism but can also be effectively employed in common GAN models.
Precision/Recall\uparrow
CIFAR10 origin GAN WGAN LSGAN HingeGAN Li-CFG
Unconstrained 0.78/0.45 0.78/0.41 0.80/0.49 0.81/0.47 0.77/0.59
ours(𝜺\boldsymbol{\varepsilon}-centered) 0.77/0.56 0.79/0.56 0.79/0.55 0.78/0.59 0.75/0.66
0-centered 0.77/0.53 0.79/0.53 0.78/0.53 0.78/0.58 0.76/0.64
11-centered 0.78/0.54 0.78/0.52 0.78/0.51 0.79/0.57 0.77/0.62
LSUN B
Unconstrained 0.75/0.37 0.75/0.39 0.77/0.38 0.73/0.4 0.65/0.49
ours(𝜺\boldsymbol{\varepsilon}-centered) 0.71/0.43 0.69/0.47 0.67/0.46 0.69/0.47 0.61/0.53
0-centered 0.73/0.41 0.71/0.45 0.67/0.45 0.71/0.47 0.62/0.51
11-centered 0.74/0.41 0.71/0.44 0.69/0.44 0.73/0.45 0.64/0.50
LSUN T
Unconstrained 0.80/0.45 0.79/0.49 0.81/0.43 0.79/0.47 0.71/0.58
ours(𝜺\boldsymbol{\varepsilon}-centered) 0.76/0.49 0.72/0.56 0.78/0.45 0.74/0.53 0.62/0.65
0-centered 0.77/0.48 0.73/0.54 0.79/0.44 0.75/0.51 0.64/0.63
11-centered 0.75/0.48 0.76/0.53 0.78/0.43 0.74/0.52 0.69/0.62
LSUN C
Unconstrained 0.82/0.36 0.84/0.32 0.83/0.35 0.79/0.41 0.75/0.51
ours(𝜺\boldsymbol{\varepsilon}-centered) 0.79/0.43 0.80/0.45 0.81/0.44 0.78/0.47 0.72/0.58
0-centered 0.78/0.42 0.81/0.44 0.82/0.43 0.79/0.45 0.76/0.55
11-centered 0.79/0.41 0.82/0.42 0.81/0.41 0.79/0.42 0.77/0.54
LSUN B+L
Unconstrained 0.81/0.46 0.85/0.43 0.86/0.42 0.81/0.47 0.77/0.51
ours(𝜺\boldsymbol{\varepsilon}-centered) 0.79/0.51 0.79/0.49 0.83/0.46 0.78/0.51 0.72/0.61
0-centered 0.79/0.49 0.80/0.47 0.85/0.45 0.80/0.48 0.74/0.59
11-centered 0.77/0.48 0.82/0.46 0.86/0.43 0.80/0.49 0.74/0.58
LSUN T+B
Unconstrained 0.82/0.38 0.82/0.39 0.84/0.37 0.81/0.42 0.78/0.46
ours(𝜺\boldsymbol{\varepsilon}-centered) 0.80/0.43 0.78/0.45 0.79/0.45 0.78/0.46 0.74/0.53
0-centered 0.83/0.41 0.79/0.43 0.81/0.42 0.79/0.44 0.75/0.52
11-centered 0.83/0.40 0.81/0.42 0.81/0.41 0.8/0.44 0.75/0.51
ImageNet
Unconstrained 0.79/0.39 0.72/0.45 0.76/0.41 0.76/0.42 0.62/0.61
ours(𝜺\boldsymbol{\varepsilon}-centered) 0.73/0.47 0.69/0.51 0.7/0.49 0.7/0.51 0.61/0.62
0-centered 0.75/0.46 0.7/0.47 0.72/0.47 0.71/0.50 0.62/0.62
11-centered 0.76/0.45 0.72/0.46 0.72/0.46 0.71/0.49 0.61/0.61

Fréchet Distance Results. We compute the Fréchet Distance with 50k generative images and all real images from datasets with the standard implementation 333https://github.com/mseitzer/pytorch-fid. However, we observed that the FID scores computed in our environment were significantly higher than those reported for the CFG method, even though we generated the images using the same CFG technique. The difference in FID scores suggest potential differences in environmental factors or implementation details. The results, presented in Table. 3, show that Li-CFG archives the best in the MNIST, CIFAR10, and the LSUN T, T+B, and B+L datasets. The WGAN and the LSGAN exhibit consistently weaker performance. The reason is that, for a fair comparison with other methods, we do not use tuning tricks, and these methods are also sensitive to varying hyper-parameters.

Table 5: Different ε\varepsilon^{\prime} settings of our Li-CFG trained in MNIST. We use the FID and IS scores to compare the generated effect. The other two penalties do not have the parameter ε\varepsilon^{\prime} so that all the cells fill the same value. Untrained means the loss function does not converge.
FID IS
MNIST ε=0.1\varepsilon^{\prime}=0.1 ε=0.3\varepsilon^{\prime}=0.3 ε=1\varepsilon^{\prime}=1 ε=5\varepsilon^{\prime}=5 ε=0.1\varepsilon^{\prime}=0.1 ε=0.3\varepsilon^{\prime}=0.3 ε=1\varepsilon^{\prime}=1 ε=5\varepsilon^{\prime}=5
ours(𝜺\boldsymbol{\varepsilon}-centered) 2.99 2.88 2.85 untrained 2.28 2.32 2.29 untrained
0-centered 3.54 3.54 3.54 3.54 2.31 2.31 2.31 2.31
11-centered 3.64 3.64 3.64 3.64 2.3 2.3 2.3 2.3
LSUN Bedroom
ours(𝜺\boldsymbol{\varepsilon}-centered) 9.94 8.78 9.73 untrained 2.97 2.94 2.97 untrained
0-centered 10.54 10.54 10.54 10.54 3.067 3.067 3.067 3.067
11-centered 11.5 11.5 11.5 11.5 3.154 3.154 3.154 3.154
FID IS
LSUN T+B γ=0.1\gamma=0.1 γ=1\gamma=1 γ=10\gamma=10 γ=0.1\gamma=0.1 γ=1\gamma=1 γ=10\gamma=10
ours(ε\boldsymbol{\varepsilon}-centered) 15.72 16.85 16.67 5.07 5.06 5.08
0-centered 16.01 17.4 19.79 5.05 5.08 5.05
11-centered 16.32 18.73 34.28 5.04 5.18 4.71

As we maintained the original network configurations for both models without applying any additional training optimizations, our results provide a fair comparison.

By maintaining the same network architecture as the CFG method, we were able to achieve the best scores in six out of seven datasets. The FID results demonstrate that our Li-CFG approach is effective and capable of generating high-quality images.

Precision and Recall Results. We generate synthetic 10,000 samples compared with 10,000 real samples to compute precision and recall, utilizing the codes of Precision and Recall functions.444https://github.com/blandocs/improved-precision-and-recall-metric-pytorch. Except for the five GAN variants mentioned above, we additionally utilized BigGAN in both CIFAR10 and ImageNet and DDGAN on the CIFAR10 dataset. We present these results in Table. 4.

Ablation Study about ε\varepsilon^{\prime} And δ(x)\delta(\boldsymbol{x}). Our method introduces a controllable parameter that adjusts the gradient penalty within a specified range. As this parameter varies, the latent N-size adjusts accordingly. Through experiments, we demonstrate that a reasonable latent N-size is crucial, as shown in Table. 5. A too small latent N-size, resulting from a large value of our parameter, causes the loss function to fail to converge. This reflects a trade-off between the diversity of synthetic samples and the model’s training stability. For the values of δ(𝒙)\delta(\boldsymbol{x}) where the CFG method performs well, γ=0.1\gamma=0.1 consistently yields better results. On the other hand, for cases where the CFG method performs poorly, γ=10\gamma=10 tends to improve performance.

Generalization of ε\boldsymbol{\varepsilon}-centered Gradient Penalty. The conception of our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty emerged from the CFG mechanism. The Column ’Li-CFG’ in Table. 3 and Table. 4 present the effectiveness of the CFG method. However, a fundamental question arises: How well does the generalization of our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty extend beyond the CFG framework? To address this, we investigate the applicability of our mechanism to various GAN models. While the CFG method employs a distinct formulation, it shares equivalence with common GAN in dynamic theory. This prompts us to assess the performance of our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty across various models. The results in Table. 3 and Table. 4 reveal that our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty consistently achieves the best FID score across all these GAN models.

To demonstrate the effectiveness and generalization of our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty, we compare various methods, focusing on the diversity of synthesized samples. The results of the compared methods are tabulated in Table. 6.

We have discovered an interesting phenomenon where spectral normalization, weight gradient, and gradient penalty can effectively work together to improve the diversity of synthesized samples. However, it can be argued that all these methods rely on Lipschitz constraints and, therefore may compete with each other. For instance, BigGAN and denoising diffusion GAN employ spectral normalization as a strong Lipschitz constraint for the varying weight of the neural networks. Given such a strong Lipschitz constraint, our gradient penalty should not affect the synthesis samples. However, when both models are applied with the gradient penalty, there is still a significant improvement in the diversity of the synthesis samples. This situation emphasizes that our theory is valid. It suggests that interpreting the gradient penalty only as a Lipschitz constraint may not be sufficient. Our neighborhood theory provides a new perspective to understand how the gradient penalty can improve the diversity of the model.

This evidence highlights the strong generalization capability of our proposed method. It showcases its broad applicability by seamlessly integrating with standard GAN models, offering advantages that extend beyond the CFG mechanism.

Refer to caption
(a)
Refer to caption
(b)
Figure 6: Column (a): Result for CIFAR10 from DDGAN with our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty. Column (b): Result for CIFAR10 from BigGAN with our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty.
Table 6: To assess the effeteness of our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty to the diversity target, we compare it to various GANs baselines focused on the diversity of synthesis samples. The improved FID refers to the difference in FID results between a method and a baseline. The symbol ’-’ indicates that we do not recalculate the experimental results in local computing environments. The ’Baseline’ refers to the method that other methods are compared to. In order to make a fair comparison, all the methods being compared are based on the same baseline. Our method is the one that uses the bold (𝜺\boldsymbol{\varepsilon}-centered).
CIFAR10 FID\downarrow Improved \uparrow FID Precision Recall\uparrow
Baseline: WGAN 36.24 - 0.78 0.41
AdvLatGAN-qua [27] 30.21 6.03 0.69 0.45
AdvLatGAN-qua+ [27] 29.73 6.51 0.69 0.46
WGAN-Unroll [29] 30.28 5.98 0.7 0.45
IID-GAN [30] 28.63 7.61 - -
WGAN(ε\boldsymbol{\varepsilon}-centered) 28.45 7.79 0.67 0.46
Baseline: Origin GAN 37.83 - 0.78 0.45
MSGAN [28] 32.38 5.45 0.77 0.51
AdvLatGAN-div+ [27] 30.92 6.91 0.78 0.54
Origin GAN(ε\boldsymbol{\varepsilon}-centered) 30.52 7.31 0.77 0.56
Baseline: SNGAN-RES [24] 15.93 - 0.8 0.75
AdvLatGAN-qua [27] 20.75 -4.62 0.82 0.68
AdvLatGAN-qua+ [27] 15.87 0.06 0.79 0.76
GN-GAN [26] 15.31 0.63 0.77 0.75
GraNC-GAN [25] 14.82 1.11 - -
aw-SN-GAN [42] 8.9 7.03 - -
SNGAN-RES(ε\boldsymbol{\varepsilon}-centered) 13.4 2.53 0.74 0.79
Baseline: SNGAN-CNN [24] 18.95 - 0.785 0.63
GN-GAN [26] 19.31 -0.36 0.81 0.59
SNGAN-CNN(ε\boldsymbol{\varepsilon}-centered) 17.92 1.03 0.79 0.64
Baseline: BigGAN 8.25 - 0.76 0.62
GN-BigGAN [26] 7.89 0.36 0.77 0.62
aw-BigGAN [42] 7.03 1.22 - -
BigGAN(ε\boldsymbol{\varepsilon}-centered) 5.18 3.07 0.75 0.67
DDGAN(ε\boldsymbol{\varepsilon}-centered) 2.38 5.87 0.74 0.69
ImageNet
Baseline: BigGAN 17.33 - 0.53 0.71
BigGAN(ε\boldsymbol{\varepsilon}-centered) 12.68 4.65 0.45 0.76
Refer to caption
Figure 7: Results for the CFG with varying gradient penalties on the Ring dataset are displayed as follows: From top to bottom, the sequence includes the CFG, the Li-CFG with 11-centered gradient penalty, the Li-CFG with 0-centered gradient penalty, and Li-CFG with 𝜺\boldsymbol{\varepsilon}-centered gradient penalty. Progressing from left to right, each column represents outcomes from different stages of training. The far-right column displays the ground truth data for comparison.
Refer to caption
Figure 8: Results for the CFG with varying gradient penalties on the Grid dataset are displayed as follows: From top to bottom, the sequence includes the CFG, the Li-CFG with 11-centered gradient penalty, the Li-CFG with 0-centered gradient penalty, and Li-CFG with 𝜺\boldsymbol{\varepsilon}-centered gradient penalty. Progressing from left to right, each column represents outcomes from different stages of training. The far-right column displays the ground truth data for comparison.

Visualization of synthesized images. The results of the experiment using Li-CFG and CFG methods can be seen in Fig. 4 and Fig. 5. Our Li-CFG method can achieve the same or better results than the CFG method. In some datasets, the image generated by the CFG method has already collapsed, while the image generated by Li-CFG still performs well. Furthermore, we also present synthesis samples generated with the state-of-the-art GAN model in CIFAR10, as shown in Fig. 6. Additional synthesis samples from various datasets can be found in Section D.4 of the supplementary materials.

The following results were obtained for the synthetic datasets presented in Fig. 78. Our observation is that the unconstrained CFG method has difficulty in converging to all modes of the ring or grid datasets. However, when supplemented with gradient penalty, these methods show an enhanced ability to converge to a mixture of Gaussians. Out of the three types of gradient penalties tested, the 11-centered gradient penalty showed inferior convergence compared to the 0-centered gradient penalty and our proposed 𝜺\boldsymbol{\varepsilon}-centered gradient penalty. Notably, our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty demonstrated a higher efficacy in driving more sample points to converge to the Gaussian points compared to the 0-centered gradient penalty.

6 Conclusion

In this paper, We provide a novel perspective to analyze the relationship between constraint and the diversity of synthetic samples. We assert that the constraint can effectively influence the latent N-size that is strongly associated with the mode collapse phenomenon. To modify the latent N-size efficiently, we propose a new form of Gradient Penalty called 𝜺\boldsymbol{\varepsilon}-centered GP. The experiments demonstrate that our method is more stable and achieves more diverse synthetic samples compared with the CFG method. Additionally, our method can be applied not only in the CFG method but also in common GAN models. Inception score, Fréchet Distance, Precision/Recall, and visual quality of generated images show that our method is more powerful. In future work, we plan to investigate the metric function of the generator from KL diversity to the Wasserstein distance to achieve a more stable and efficient GAN architecture.

Limitations and future work. In this study, we achieve good FID results in the considered data sets. However, if given higher resolution data sets, it might lead to a different choice of our neural network architecture and hyper-parameter. On the other hand, although the hyper-parameter δ(𝒙)\delta(\boldsymbol{x}) has an important impact on the algorithm, we do not discuss the relationship between hyper-parameter δ(𝒙)\delta(\boldsymbol{x}) in CFG and our constraint hyper-parameter 𝜺\boldsymbol{\varepsilon} from a theoretical perspective. These effects should be systematically studied in future work.

7 Declarations

declaration - statement

  • Conflicts of interest/Competing interests

    Not applicable

  • Funding

    This work was funded by the National Natural Science Foundation of China U22A20102, 62272419. Natural Science Foundation of Zhejiang Province ZJNSFLZ22F020010.

  • Ethics approval

    Not applicable

  • Consent to participate

    Not applicable

  • Consent for publication

    Not applicable

  • Availability of data and material

    All the data used in our work are sourced from publicly available datasets that have been established by prior research. References for all the data are provided in our main document.

  • Code availability

    Our code is some kind of custom code and is available for use.

  • Authors’ contributions

    We list the Authors’ contributions as follows:

    Conceptualization: [Chang Wan], [Yanwei Fu], [Xinwei Sun];

    Methodology: [Chang Wan], [Xinwei Sun], [Ke Fan];

    Experiments: [Chang Wan], [Ke Fan], [Yunliang Jiang];

    Writing ‐ original draft preparation: [Chang Wan], [Yanwei Fu];

    Writing ‐ review and editing: [Xinwei Sun], [Zhonglong Zheng], [Minglu Li];

    Funding acquisition: [Yunliang Jiang], [Zhonglong Zheng];

    Resources: [Zhonglong Zheng];

    Supervision: [Minglu Li], [Yunliang Jiang];

    All authors read and approved the final manuscript.

References

  • \bibcommenthead
  • Goodfellow et al. [2014] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Conference on Neural Information Processing Systems 27 (2014)
  • Arjovsky et al. [2017] Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: International Conference on Machine Learning, pp. 214–223 (2017). PMLR
  • Gulrajani et al. [2017] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.: Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028 (2017)
  • Johnson and Zhang [2018] Johnson, R., Zhang, T.: Composite functional gradient learning of generative adversarial models. In: International Conference on Machine Learning, pp. 2371–2379 (2018). PMLR
  • Mescheder et al. [2018] Mescheder, L., Geiger, A., Nowozin, S.: Which training methods for gans do actually converge? In: International Conference on Machine Learning, pp. 3481–3490 (2018). PMLR
  • Johnson and Zhang [2019] Johnson, R., Zhang, T.: A framework of composite functional gradient methods for generative adversarial models. IEEE Transactions on Pattern Analysis and Machine Intelligence 43(1), 17–32 (2019)
  • Roth et al. [2017] Roth, K., Lucchi, A., Nowozin, S., Hofmann, T.: Stabilizing training of generative adversarial networks through regularization. arXiv preprint arXiv:1705.09367 (2017)
  • Yang et al. [2019] Yang, D., Hong, S., Jang, Y., Zhao, T., Lee, H.: Diversity-sensitive conditional generative adversarial networks. arXiv preprint arXiv:1901.09024 (2019)
  • Radford et al. [2015] Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
  • Zhang et al. [2019] Zhang, H., Goodfellow, I., Metaxas, D., Odena, A.: Self-attention generative adversarial networks. In: International Conference on Machine Learning, pp. 7354–7363 (2019). PMLR
  • Karras et al. [2017] Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017)
  • Brock et al. [2018] Brock, A., Donahue, J., Simonyan, K.: Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 (2018)
  • Karras et al. [2019] Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: IEEE/CVF Computer Vision and Pattern Recognition Conference, pp. 4401–4410 (2019)
  • Che et al. [2016] Che, T., Li, Y., Jacob, A.P., Bengio, Y., Li, W.: Mode regularized generative adversarial networks. arXiv preprint arXiv:1612.02136 (2016)
  • Nowozin et al. [2016] Nowozin, S., Cseke, B., Tomioka, R.: f-gan: Training generative neural samplers using variational divergence minimization. In: Conference on Neural Information Processing Systems, pp. 271–279 (2016)
  • Nagarajan and Kolter [2017] Nagarajan, V., Kolter, J.Z.: Gradient descent gan optimization is locally stable. arXiv preprint arXiv:1706.04156 (2017)
  • Mescheder et al. [2017] Mescheder, L., Nowozin, S., Geiger, A.: The numerics of gans. arXiv preprint arXiv:1705.10461 (2017)
  • Oberman and Calder [2018] Oberman, A.M., Calder, J.: Lipschitz regularized deep neural networks converge and generalize. arXiv preprint arXiv:1808.09540 (2018)
  • Scaman and Virmaux [2018] Scaman, K., Virmaux, A.: Lipschitz regularity of deep neural networks: analysis and efficient estimation. arXiv preprint arXiv:1805.10965 (2018)
  • Zhou et al. [2018] Zhou, Z., Song, Y., Yu, L., Wang, H., Liang, J., Zhang, W., Zhang, Z., Yu, Y.: Understanding the effectiveness of lipschitz-continuity in generative adversarial nets. arXiv preprint arXiv:1807.00751 (2018)
  • Zhou et al. [2019] Zhou, Z., Liang, J., Song, Y., Yu, L., Wang, H., Zhang, W., Yu, Y., Zhang, Z.: Lipschitz generative adversarial nets. In: International Conference on Machine Learning, pp. 7584–7593 (2019). PMLR
  • Herrera et al. [2020] Herrera, C., Krach, F., Teichmann, J.: Estimating full lipschitz constants of deep neural networks. arXiv preprint arXiv:2004.13135 (2020)
  • Kim et al. [2021] Kim, H., Papamakarios, G., Mnih, A.: The lipschitz constant of self-attention. In: International Conference on Machine Learning, pp. 5562–5571 (2021). PMLR
  • Miyato et al. [2018] Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957 (2018)
  • Bhaskara et al. [2022] Bhaskara, V.S., Aumentado-Armstrong, T., Jepson, A.D., Levinshtein, A.: Gran-gan: Piecewise gradient normalization for generative adversarial networks. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3821–3830 (2022)
  • Wu et al. [2021] Wu, Y.-L., Shuai, H.-H., Tam, Z.-R., Chiu, H.-Y.: Gradient normalization for generative adversarial networks. In: InternationalConference on Computer Vision, pp. 6373–6382 (2021)
  • Li et al. [2022] Li, Y., Mo, Y., Shi, L., Yan, J.: Improving generative adversarial networks via adversarial learning in latent space. Conference on Neural Information Processing Systems 35, 8868–8881 (2022)
  • Mao et al. [2019] Mao, Q., Lee, H.-Y., Tseng, H.-Y., Ma, S., Yang, M.-H.: Mode seeking generative adversarial networks for diverse image synthesis. In: IEEE/CVF Computer Vision and Pattern Recognition Conference, pp. 1429–1437 (2019)
  • Metz et al. [2016] Metz, L., Poole, B., Pfau, D., Sohl-Dickstein, J.: Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163 (2016)
  • Li et al. [2021] Li, Y., Shi, L., Yan, J.: Iid-gan: an iid sampling perspective for regularizing mode collapse. arXiv preprint arXiv:2106.00563 (2021)
  • LeCun et al. [1998] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
  • Krizhevsky et al. [2009] Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto (2009)
  • Yu et al. [2015] Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., Xiao, J.: Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365 (2015)
  • Deng et al. [2009] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: IEEE/CVF Computer Vision and Pattern Recognition Conference, pp. 248–255 (2009). Ieee
  • Mao et al. [2017] Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z., Paul Smolley, S.: Least squares generative adversarial networks. In: InternationalConference on Computer Vision, pp. 2794–2802 (2017)
  • Lim and Ye [2017] Lim, J.H., Ye, J.C.: Geometric gan. arXiv preprint arXiv:1705.02894 (2017)
  • He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE/CVF Computer Vision and Pattern Recognition Conference, pp. 770–778 (2016)
  • Xiao et al. [2021] Xiao, Z., Kreis, K., Vahdat, A.: Tackling the generative learning trilemma with denoising diffusion gans. arXiv preprint arXiv:2112.07804 (2021)
  • Salimans et al. [2016] Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. Conference on Neural Information Processing Systems 29, 2234–2242 (2016)
  • Heusel et al. [2017] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Conference on Neural Information Processing Systems 30 (2017)
  • Kynkäänniemi et al. [2019] Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., Aila, T.: Improved precision and recall metric for assessing generative models. Conference on Neural Information Processing Systems 32 (2019)
  • Zadorozhnyy et al. [2021] Zadorozhnyy, V., Cheng, Q., Ye, Q.: Adaptive weighted discriminator for training generative adversarial networks. In: IEEE/CVF Computer Vision and Pattern Recognition Conference, pp. 4781–4790 (2021)

Appendix

\parttoc

Appendix A Overview

In this section, we outline the contents of our supplementary material, which is divided into four main sections. Firstly, we introduce the supplementary material. Secondly, we delve into the analysis of the dynamic theory for the CFG method. This section covers essential concepts related to CFG, the Lipschitz constraint, and the analysis of the dynamic theory of CFG. Thirdly, the theoretical analysis of our theory section contains the proof of our definition, Lemma and Theorem. Finally, we showcase the results of our experiments in the last section.

In the analysis of dynamic theory for the CFG method section, we provide foundational knowledge about CFG and Lipschitz constraints. Furthermore, we provide theoretical proofs that establish certain equivalences between the CFG method and conventional GAN theory, specifically focusing on dynamic theory principles.

In the theoretical analysis of our theory section, we initiate our exploration by presenting the proofs for both Definition 3.1 and the Proposition 3.5. The Definition 3.1 is a base definition for Proposition 3.5, so we prove them together. The Proposition 3.5 gives the formulation for the latent N-size with gradient penalty. Moving forward, we provide the proof of the corollary that 𝒙D(𝒙)0\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\leq 0, which is an important corollary that deviates our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty. Subsequently, we provide the proof of the Lemma 3.6 which deviates the relationship between the value of the discriminator norm under three different gradient penalties. Lastly, we provide the proof of our main Theorem 3.7 which summarizes the relationship between the latent N-size and three gradient penalties under the above definitions and Lemmas.

In the Experiments section, we provide additional experiments and present more results of our Li-CFG and 𝜺\boldsymbol{\varepsilon}-centered method with common GAN models. When working with synthetic datasets, we include visual inspection figures compared our 𝜺\boldsymbol{\varepsilon}-centered method to other GAN models that use different gradient penalties. In our work with real-world datasets, we will create visual comparison figures to evaluate the performance of the CFG method and Li-CFG on different databases, including CeleBA, LSUN Bedroom, and ImageNet. We will compare these methods at resolutions of 128x128 and 256x256 pixels. Moreover, we’ll also present visual comparison figures of our 𝜺\boldsymbol{\varepsilon}-centered method with DDGAN and BigGAN in LSUN Church (256x256) and ImageNet (64x64).

Appendix B Analysis of the dynamic theory for the CFG

Dynamic Theory for the CFG. This section encompasses the motivation behind and proofs for the relationship between the CFG method and the common GAN method. In terms of the dynamic theory, the generator and discriminator in the CFG method function equivalently to the corresponding components in common GAN. Based on this analysis, the CFG method still surfers unstable and not locally convergent problems at the Nash-equilibrium point. We visualize the gradient vector of the discriminator for CFG and Li-CFG in Fig. 9. This serves as the impetus for us to introduce the Lipschitz constraint to the CFG method, aiming to enforce convergence. The dynamic theory is imported from Mescheder et al. [5] and Nagarajan and Kolter [16].

Refer to caption
Figure 9: Left and right images compare CFG and our Li-CFG, individually. From the conclusions of Lemma B.1 and Lemma B.2, CFG suffers the non-locally convergent near the Nash-equilibrium problems as common GAN in the dynamic theory. In contrast, our Li-CFG method can converges to the small neighborhood around the Nash equilibrium.
Lemma B.1.

The loss functions of the discriminators of CFG and common GAN learned by minimax games exhibit the same form and optimization objective. The form of the discriminator loss function is:

𝔼𝒛pz[f(Dψ(Gθ(𝒛)))]+𝔼𝒙p[f(Dψ(𝒙))]\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\left[f\left(-D_{\psi}\left(G_{\theta}(\boldsymbol{z})\right)\right)\right]+\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\left[f\left(D_{\psi}(\boldsymbol{x})\right)\right]
Proof.

We can establish the equivalence through two main perspectives. Firstly, we analyze the CFG method. The loss function of the discriminator can be described as follows:

LCFG=𝔼𝒙pln(1+exp(D(𝒙)))+𝔼𝒙pzln(1+exp(D(𝒙))).\displaystyle\begin{aligned} L_{CFG}=\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\mathrm{ln}\left(1+\exp(-D(\boldsymbol{x}))\right)+\mathbb{E}_{\boldsymbol{x}\sim p_{z}}\mathrm{ln}\left(1+\exp(D(\boldsymbol{x}))\right).\end{aligned} (11)

To unify the symbolic expression, we use exp(D(𝒙))\exp(D(\boldsymbol{x})) instead of eD(𝒙)e^{D(\boldsymbol{x})} from the CFG article in our paper. The common GAN loss function is

LGAN=𝔼𝒙p[lnd(𝒙)]+𝔼𝒛p𝒛ln[1d(G(𝒛))].\displaystyle L_{GAN}=\mathbb{E}_{\boldsymbol{x}\sim p_{*}}[\ln d(\boldsymbol{x})]+\mathbb{E}_{\boldsymbol{z}\sim p_{\boldsymbol{z}}}\ln[1-d(G(\boldsymbol{z}))]. (12)

If we set the d(𝒙)=11+exp(D(𝒙))d(\boldsymbol{x})=\frac{1}{1+\exp(-D(\boldsymbol{x}))} and substitute it into the common GAN loss function, we can get the Loss function of the CFG method. This demonstrates the equivalence between the two loss functions. From the second part, we can use the notation of the loss function proposed by Nagarajan and Kolter [16] to prove the lemma. The loss function is

LCOMM(θ,ψ)=𝔼𝒛pz[f(Dψ(Gθ(𝒛)))]+𝔼𝒙p[f(Dψ(𝒙))].\displaystyle\begin{aligned} L_{COMM}(\theta,\psi)=\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\left[f\left(D_{\psi}\left(G_{\theta}(\boldsymbol{z})\right)\right)\right]+\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\left[f\left(-D_{\psi}(\boldsymbol{x})\right)\right].\end{aligned} (13)

If we choose the f function to be f(t)=ln(1+exp(t))f(t)=-\ln\left(1+\exp(-t)\right), it leads to the loss function of the CFG method. ∎

We demonstrate the process with more detailed proof. First, we list the loss functions of the three discriminators for comparison. The discriminator’s loss function proposed by  Nagarajan and Kolter [16] is

LCOMM(θ,ψ)=𝔼𝒛pz[f(Dψ(Gθ(𝒛)))]+𝔼𝒙p[f(Dψ(𝒙))].\displaystyle L_{COMM}(\theta,\psi)=\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\left[f\left(-D_{\psi}\left(G_{\theta}(\boldsymbol{z})\right)\right)\right]+\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\left[f\left(D_{\psi}(\boldsymbol{x})\right)\right]. (14)

But in the notation of Mescheder et al. [5], the loss function is

LCOMM(θ,ψ)=𝔼𝒛pz[f(Dψ(Gθ(𝒛)))]+𝔼xp[f(Dψ(𝒙))].\displaystyle L_{COMM}(\theta,\psi)=\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\left[f\left(D_{\psi}\left(G_{\theta}(\boldsymbol{z})\right)\right)\right]+\mathbb{E}_{x\sim p_{*}}\left[f\left(-D_{\psi}(\boldsymbol{x})\right)\right]. (15)

We think both the equation are correct, and choose the Eq. (14) in our proof process. As for simplification, we omit the symbol (θ,ψ)(\theta,\psi). Furthermore, we use the symbols DD and GG to denote the discriminator and generator, respectively. The loss function of the discriminator of the CFG method is

LCFG=𝔼𝒙pln(1+exp(D(𝒙)))+𝔼𝒛pzln(1+exp(D(G(𝒛)))).\displaystyle L_{CFG}=\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\mathrm{ln}\left(1+\exp(-D(\boldsymbol{x}))\right)+\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\mathrm{ln}\left(1+\exp(D(G(\boldsymbol{z})))\right). (16)

The discriminator loss function of common GAN is

LGAN=𝔼xp[lnd(𝒙)]+𝔼𝒛pzln[1d(G(𝒛))].\displaystyle L_{GAN}=\mathbb{E}_{x\sim p_{*}}[\ln d(\boldsymbol{x})]+\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\ln[1-d(G(\boldsymbol{z}))]. (17)

Our goal is to prove that the three loss functions are equivalent. One loss function can be transformed into others under certain conditions.

The proof of Eq. (17) convert to Eq. (16).

We set the d(𝒙)=11+exp(D(𝒙))d(\boldsymbol{x})=\frac{1}{1+\exp(-D(\boldsymbol{x}))} and substitute it into the Eq. (17). Then we have

LCFG=𝔼𝒙pln[11+exp(D(𝒙))]+𝔼𝒛pzln[111+exp(D(G(𝒛)))]=𝔼𝒙pln[1+exp(D(𝒙))]1+𝔼𝒛pzln[exp(D(G(𝒛)))1+exp(D(G(𝒛)))]=𝔼𝒙pln[1+exp(D(𝒙))]1+𝔼𝒛pzln[1exp(D(G(𝒛)))1+exp(D(G(𝒛)))exp(D(G(𝒛)))]=𝔼𝒙pln[1+exp(D(𝒙))]1+𝔼𝒛pzln[11+exp(D(G(𝒛)))]=𝔼𝒙pln[1+exp(D(𝒙))]1+𝔼𝒛pzln[1+exp(D(G(𝒛)))]1=[𝔼𝒙pln[1+exp(D(𝒙))]+𝔼𝒛pzln[1+exp(D(G(𝒛)))]]=[𝔼𝒙pln(1+exp(D(𝒙)))+𝔼𝒛pzln[1+exp(D(G(𝒛)))]].\displaystyle\begin{aligned} L_{CFG}&=\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\ln\left[\frac{1}{1+\exp(-D(\boldsymbol{x}))}\right]+\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\ln\left[1-\frac{1}{1+\exp(-D(G(\boldsymbol{z})))}\right]\\ &=\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\ln\left[1+\exp(-D(\boldsymbol{x}))\right]^{-1}+\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\ln\left[\frac{\exp(-D(G(\boldsymbol{z})))}{1+\exp(-D(G(\boldsymbol{z})))}\right]\\ &=\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\ln\left[1+\exp(-D(\boldsymbol{x}))\right]^{-1}+\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\ln\left[\frac{\frac{1}{\exp(D(G(\boldsymbol{z})))}}{\frac{1+\exp(D(G(\boldsymbol{z})))}{\exp(D(G(\boldsymbol{z})))}}\right]\\ &=\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\ln\left[1+\exp(-D(\boldsymbol{x}))\right]^{-1}+\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\ln\left[\frac{1}{1+\exp(D(G(\boldsymbol{z})))}\right]\\ &=\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\ln\left[1+\exp(-D(\boldsymbol{x}))\right]^{-1}+\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\ln\left[1+\exp(D(G(\boldsymbol{z})))\right]^{-1}\\ &=-\bigg{[}\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\ln\left[1+\exp(-D(\boldsymbol{x}))\right]+\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\ln[1+\exp(D(G(\boldsymbol{z})))]\bigg{]}\\ &=-\bigg{[}\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\ln\left(1+\exp(-D(\boldsymbol{x}))\right)+\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\ln[1+\exp(D(G(\boldsymbol{z})))]\bigg{]}.\end{aligned}

The equation in the bracket is the loss function of the CFG method. minLCFG-\min L_{CFG} has the same optimization objectives as the maxLGAN\max L_{GAN}. So both loss functions are equivalent. It is easy to transform the Eq. (16) to Eq. (17) as we set the 11+exp(D(𝒙))=d(𝒙)\frac{1}{1+\exp(-D(\boldsymbol{x}))}=d(\boldsymbol{x}).

The proof of Eq. (14) convert to Eq. (16).

We choice the f function to be f(t)=ln(1+exp(t))f(t)=-\ln\left(1+\exp(-t)\right). The equation is similar to the lnd(𝒙)=ln[11+exp(D(𝒙))]=ln[1+exp(D(𝒙))]\ln d(\boldsymbol{x})=\ln\left[\frac{1}{1+\exp(-D(\boldsymbol{x}))}\right]=-\ln\left[1+\exp(-D(\boldsymbol{x}))\right] if we substitute the ff into Eq. (14), it leads to the loss function of CFG method. Both equations are equivalent. So the Lemma B.1 is proved.

Lemma B.2.

The gradient vector field of generators of CFG and common GAN learned by minimax games exhibit the same form in the dynamic theory. The form of the gradient vector field is:

([ηδ(G(𝒛))[𝒙D(G(𝒛))]G(𝒛)𝜽]0)\left(\begin{array}[]{cc}\left[-\eta\delta(G(\boldsymbol{z}))\left[\nabla_{\boldsymbol{x}}D(G(\boldsymbol{z}))\right]\frac{\partial G(\boldsymbol{z})}{\partial\boldsymbol{\theta}}\right]&\\ &\\ 0&\end{array}\right)

We want to prove that v~G(𝜽)=v¯G(𝜽)\widetilde{v}_{G}(\boldsymbol{\theta})=\overline{v}_{G}(\boldsymbol{\theta}). v~G\widetilde{v}_{G} denotes the gradient vector field of the common GAN generator, while v¯G\overline{v}_{G} represents the generator gradient vector field of the CFG method. 𝜽\boldsymbol{\theta} denotes the weight vector of the generator instead of θ\theta. If the two gradient vector fields are equal to each other, we can say that proof of Lemma B.2 is done.

Proof.

we expand the v~G(𝜽)\widetilde{v}_{G}(\boldsymbol{\theta}) , the gradient vector field of generator for common GAN

v~G(𝜽)=(𝜽ln(1d(G(𝒛))0).\displaystyle\begin{aligned} \widetilde{v}_{G}(\boldsymbol{\theta})=\left(\begin{array}[]{cc}\nabla_{\boldsymbol{\theta}}\ln(1-d(G(\boldsymbol{z}))&\\ &\\ 0&\end{array}\right).\end{aligned} (18)

The value of dd is d(𝒙)=11+exp(D(𝒙))d(\boldsymbol{x})=\frac{1}{1+\exp(-D(\boldsymbol{x})}) . We substitute the d(𝒙)d(\boldsymbol{x}) into v~G(𝜽)\widetilde{v}_{G}(\boldsymbol{\theta}), then

v~G(𝜽)=([𝜽lnexp(D(G(𝒛)))1+exp(D(G(𝒛)))]0).\displaystyle\begin{aligned} \widetilde{v}_{G}(\boldsymbol{\theta})=\left(\begin{array}[]{cc}\left[\nabla_{\boldsymbol{\theta}}\ln\frac{\exp(-D(G(\boldsymbol{z})))}{1+\exp(-D(G(\boldsymbol{z})))}\right]&\\ &\\ 0&\end{array}\right).\\ \end{aligned} (19)

We extract the [𝜽lnexp(D(G(𝒛)))1+exp(D(G(𝒛)))][\nabla_{\boldsymbol{\theta}}\ln\frac{\exp(-D(G(\boldsymbol{z})))}{1+\exp(-D(G(\boldsymbol{z})))}] out and derive it as follows:

𝜽lnexp(D(G(𝒛)))1+exp(D(G(𝒛)))=𝜽ln1exp(D(G(𝒛)))1+exp(D(G(𝒛)))exp(D(G(𝒛)))=𝜽ln11+exp(D(G(𝒛)))=exp(D(G(𝒛)))1+exp(D(G(𝒛)))[𝒙D(G(𝒛))]G(𝒛)𝜽=11+exp(D(G(𝒛)))[𝒙D(G(𝒛))]G(𝒛)𝜽.\displaystyle\begin{aligned} &\nabla_{\boldsymbol{\theta}}\ln\frac{\exp(-D(G(\boldsymbol{z})))}{1+\exp(-D(G(\boldsymbol{z})))}\\ &=\nabla_{\boldsymbol{\theta}}\ln\frac{\frac{1}{\exp(D(G(\boldsymbol{z})))}}{\frac{1+\exp(D(G(\boldsymbol{z})))}{\exp(D(G(\boldsymbol{z})))}}\\ &=\nabla_{\boldsymbol{\theta}}\ln\frac{1}{1+\exp(D(G(\boldsymbol{z})))}\\ &=-\frac{\exp(D(G(\boldsymbol{z})))}{1+\exp(D(G(\boldsymbol{z})))}\left[\nabla_{\boldsymbol{x}}D(G(\boldsymbol{z}))\right]\frac{\partial G(\boldsymbol{z})}{\partial\boldsymbol{\theta}}\\ &=-\frac{1}{1+\exp(-D(G(\boldsymbol{z})))}\left[\nabla_{\boldsymbol{x}}D(G(\boldsymbol{z}))\right]\frac{\partial G(\boldsymbol{z})}{\partial\boldsymbol{\theta}}.\\ \end{aligned} (20)

We bring it back into the v~G(𝜽)\widetilde{v}_{G}(\boldsymbol{\theta}), we will get

v~G(𝜽)=([11+exp(D(G(𝒛)))[𝒙D(G(𝒛))]G(𝒛)𝜽]0).\displaystyle\begin{aligned} \widetilde{v}_{G}(\boldsymbol{\theta})=\left(\begin{array}[]{cc}\left[-\frac{1}{1+\exp(-D(G(\boldsymbol{z})))}\left[\nabla_{\boldsymbol{x}}D(G(\boldsymbol{z}))\right]\frac{\partial G(\boldsymbol{z})}{\partial\boldsymbol{\theta}}\right]&\\ &\\ 0&\end{array}\right).\\ \end{aligned} (21)

This is the gradient vector field of common GAN.

Now we check the gradient vector field of the CFG method. For the CFG method, a hyper-parameter MM has been used to control the varying steps of generators

Gm+1(𝒛)=Gm(𝒛)+ηmgm(Gm(𝒛)).\displaystyle G_{m+1}(\boldsymbol{z})=G_{m}(\boldsymbol{z})+\eta_{m}g_{m}\left(G_{m}(\boldsymbol{z})\right). (22)

The variables MM and mm have the same meaning. we will discuss the M in two cases to analyze the gradient vector field of the generator for the CFG method. The loss function of the generator for the CFG method is

L(𝜽)=12[Gm+1(𝒛)G1(𝒛)]2.\displaystyle L(\boldsymbol{\theta})=\frac{1}{2}\left[G_{m+1}(\boldsymbol{z})-G_{1}(\boldsymbol{z})\right]^{2}. (23)

First, when we consider MM=1. In this case, the loss function can be written as

L(𝜽)=12[G2(𝒛)G1(𝒛)]2.\displaystyle L(\boldsymbol{\theta})=\frac{1}{2}\left[G_{2}(\boldsymbol{z})-G_{1}(\boldsymbol{z})\right]^{2}. (24)

. Then the gradient vector field can be written as below:

v¯G(𝜽)=([[G2(𝒛)G1(𝒛)]G(𝒛)𝜽]0)=([[η1g1(G(𝒛))]G(𝒛)𝜽]0)=([η1δ(G(𝒛))[𝒙D(G(𝒛))]G(𝒛)𝜽]0).\displaystyle\begin{aligned} \overline{v}_{G}(\boldsymbol{\theta})&=\left(\begin{array}[]{cc}\left[\left[G_{2}(\boldsymbol{z})-G_{1}(\boldsymbol{z})\right]\frac{\partial G(\boldsymbol{z})}{\partial\boldsymbol{\theta}}\right]&\\ &\\ 0&\end{array}\right)\\ &=\left(\begin{array}[]{cc}\left[\left[\eta_{1}g_{1}(G(\boldsymbol{z}))\right]\frac{\partial G(\boldsymbol{z})}{\partial\boldsymbol{\theta}}\right]&\\ &\\ 0&\end{array}\right)\\ &=\left(\begin{array}[]{cc}\left[-\eta_{1}\delta(G(\boldsymbol{z}))\left[\nabla_{\boldsymbol{x}}D(G(\boldsymbol{z}))\right]\frac{\partial G(\boldsymbol{z})}{\partial\boldsymbol{\theta}}\right]&\\ &\\ 0&\end{array}\right).\end{aligned} (25)

For g(𝒙)=δ(𝒙)𝒙D(𝒙)g(\boldsymbol{x})=\delta(\boldsymbol{x})\cdot\nabla_{\boldsymbol{x}}D(\boldsymbol{x}), we archive the above result. Now we compare the form of v~G(θ)\widetilde{v}_{G}(\theta) and v¯G(θ)\overline{v}_{G}(\theta). If we set the 11+exp(D(G(𝒛)))-\frac{1}{1+\exp(-D(G(\boldsymbol{z})))} = η1δ(G(𝒛)-\eta_{1}\delta(G(\boldsymbol{z}), the two function are both scaling factor. We find two equations get exactly the same form. So the proof is done.

Next, When we consider the second case, M>1M>1 and MM\rightarrow\infty. The expansion of Eq. (22) can be written as below:

Gm+1(𝒛)=Gm(𝒛)+ηmgm(Gm1(𝒛))G3(𝒛)=G2(𝒛)+η2g2(G2(𝒛))G2(𝒛)=G1(𝒛)+η1g1(G1(𝒛)).\displaystyle\begin{aligned} G_{m+1}(\boldsymbol{z})=&G_{m}(\boldsymbol{z})+\eta_{m}g_{m}\left(G_{m-1}(\boldsymbol{z})\right)\\ &\vdots\\ G_{3}(\boldsymbol{z})=&G_{2}(\boldsymbol{z})+\eta_{2}g_{2}\left(G_{2}(\boldsymbol{z})\right)\\ G_{2}(\boldsymbol{z})=&G_{1}(\boldsymbol{z})+\eta_{1}g_{1}\left(G_{1}(\boldsymbol{z})\right).\end{aligned} (26)

Let us sum the right side of the equation list, and we get the form of Gm+1(z)G_{m+1}(z) by G1G_{1} as:

Gm+1(𝒛)=G1(𝒛)+m=1Mηmgm(Gm(𝒛)).\displaystyle G_{m+1}(\boldsymbol{z})=G_{1}(\boldsymbol{z})+\sum_{m=1}^{M}\eta_{m}g_{m}\left(G_{m}(\boldsymbol{z})\right). (27)

Now we consider the v¯G(𝜽)\overline{v}_{G}(\boldsymbol{\theta}). When M>1M>1, the gradient vector field can be written as

v¯G(𝜽)=([𝜽12[Gm+1(𝒛)G1(𝒛)]2]0)=([𝜽12[m=1Mηmgm(Gm(𝒛))]2]0)=([m=1Mηmgm(Gm(𝒛))]𝜽[j=1MGj(𝒛)]0)=([m=1Mηmgm(Gm(𝒛))][j=1MGj(𝒛)𝜽]0)=([m=1Mj=1Mηmgm(Gm(𝒛))Gj(𝒛)𝜽]0)=([m=1Mj=1Mηmδ(Gm(𝒛))[𝒙D(Gm(𝒛))]Gj(𝒛)𝜽]0).\displaystyle\begin{aligned} \overline{v}_{G}(\boldsymbol{\theta})&=\left(\begin{array}[]{cc}\left[\nabla_{\boldsymbol{\theta}}\frac{1}{2}\left[G_{m+1}(\boldsymbol{z})-G_{1}(\boldsymbol{z})\right]^{2}\right]&\\ &\\ 0&\end{array}\right)\\ &=\left(\begin{array}[]{cc}\left[\nabla_{\boldsymbol{\theta}}\frac{1}{2}\left[\sum\limits_{m=1}^{M}\eta_{m}g_{m}\left(G_{m}(\boldsymbol{z})\right)\right]^{2}\right]&\\ &\\ 0&\end{array}\right)\\ &=\left(\begin{array}[]{cc}\left[\sum\limits_{m=1}^{M}\eta_{m}g_{m}\left(G_{m}(\boldsymbol{z})\right)\right]\nabla_{\boldsymbol{\theta}}\left[\sum\limits_{j=1}^{M}G_{j}(\boldsymbol{z})\right]&\\ &\\ 0&\end{array}\right)\\ &=\left(\begin{array}[]{cc}\left[\sum\limits_{m=1}^{M}\eta_{m}g_{m}\left(G_{m}(\boldsymbol{z})\right)\right]\left[\sum\limits_{j=1}^{M}\frac{\partial G_{j}(\boldsymbol{z})}{\partial\boldsymbol{\theta}}\right]&\\ &\\ 0&\end{array}\right)\\ &=\left(\begin{array}[]{cc}\left[\sum\limits_{m=1}^{M}\sum\limits_{j=1}^{M}\eta_{m}g_{m}\left(G_{m}(\boldsymbol{z})\right)\frac{\partial G_{j}(\boldsymbol{z})}{\partial\boldsymbol{\theta}}\right]&\\ &\\ 0&\end{array}\right)\\ &=\left(\begin{array}[]{cc}\left[\sum\limits_{m=1}^{M}\sum\limits_{j=1}^{M}-\eta_{m}\delta(G_{m}(\boldsymbol{z}))\left[\nabla_{\boldsymbol{x}}D(G_{m}(\boldsymbol{z}))\right]\frac{\partial G_{j}(\boldsymbol{z})}{\partial\boldsymbol{\theta}}\right]&\\ &\\ 0&\end{array}\right).\\ \end{aligned} (28)

So we get the gradient vector field of the CFG method when T>1T>1. Every sub-item in the equation is a gradient vector that has the same form as the gradient vector when T=1T=1, which is equivalent to common GAN. The Sum of these gradient vectors is CFG method T>1T>1 which is also equivalent to common GAN. So Lemma 3.2 is proved. ∎

Appendix C Theoretical analysis of our theory

C.1 Latent N-size with gradient penalty

In this section, we will provide a detailed proof procedure for Definition 3.1 and Proposition 3.5. Since Definition 3.1 is the foundational definition for Proposition 3.5, the detailed proof procedure for Definition 3.1 will be presented in the proof of Proposition 3.5.

Latent N-size. We illustrate the concept of latent N-size, which serves as the foundational definition for the subsequent theory.

Definition 3.1.

Let 𝐳1\boldsymbol{z}_{1}, 𝐳2\boldsymbol{z}_{2} be two samples in the latent space. Suppose 𝐳1\boldsymbol{z}_{1} is attracted to the mode i\mathcal{M}_{i} by ϵ^\hat{\epsilon}, then there exists a neighborhood 𝒩r(𝐳1)\mathcal{N}_{r}\left(\boldsymbol{z}_{1}\right) of 𝐳1\boldsymbol{z}_{1} such that 𝐳2\boldsymbol{z}_{2} is distracted to i\mathcal{M}_{i} by (ϵ^/22α)(\hat{\epsilon}/2-2\alpha), for all 𝐳2𝒩r(𝐳1)\boldsymbol{z}_{2}\in\mathcal{N}_{r}\left(\boldsymbol{z}_{1}\right). The size of 𝒩r(𝐳1)\mathcal{N}_{r}\left(\boldsymbol{z}_{1}\right) can be arbitrarily large but is bounded by an open ball of radius rr. The rr is defined as

r=ϵ^(2inf𝒛{(Gθt(𝒛1)Gθt(𝒛)𝒛1𝒛+Gθt+1(𝒛1)Gθt+1(𝒛)𝒛1𝒛)})1,r=\hat{\epsilon}\cdot\left(2\inf_{\boldsymbol{z}}\left\{\left(\frac{\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}+\frac{\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\right)\right\}\right)^{-1},

where the meaningful of mode \mathcal{M}, attractedattracted and distracteddistracted are present in definition 3.23.33.4, respectively.

According to this definition, the radius rr is inversely proportional to the discrepancy between the preceding and subsequent outputs of the generator, given a similar latent vector 𝒛\boldsymbol{z}. A large value of rr results in a small difference between the previous and subsequent generator outputs, leading to mode collapse. Conversely, a small value of rr leads to a large difference, resulting in diverse synthesis.

Latent N-size and the diversity. We offer three fundamental definitions of the latent N-size and demonstrate its relationship to the diversity of synthetic samples.

Definition 3.2 (Modes in Image Space).

There exist some modes \mathcal{M} cover the image space 𝒴\mathcal{Y}. Mode i\mathcal{M}_{i} is a subset of 𝒴\mathcal{Y} satisfying max𝐲𝐤,𝐣i𝐲𝐤𝐲𝐣<α\max_{\boldsymbol{y_{k,j}}\in\mathcal{M}_{i}}\left\|\boldsymbol{y_{k}}-\boldsymbol{y_{j}}\right\|<\alpha and min𝐲𝐤i,𝐲𝐦iα<𝐲𝐦𝐲𝐤<2α\min_{\boldsymbol{y_{k}}\in\mathcal{M}_{i},\boldsymbol{y_{m}}\not\in\mathcal{M}_{i}}\alpha<\left\|\boldsymbol{y_{m}}-\boldsymbol{y_{k}}\right\|<2\alpha, where 𝐲𝐤\boldsymbol{y_{k}} and 𝐲𝐣\boldsymbol{y_{j}} belong to the same mode i\mathcal{M}_{i}, 𝐲𝐦\boldsymbol{y_{m}} and 𝐲𝐤\boldsymbol{y_{k}} belong to different modes i\mathcal{M}_{i}, and α>0\alpha>0.

Definition 3.3 (Modes Attracted).

Let 𝐳1\boldsymbol{z}_{1} be a sample in latent space, we say 𝐳1\boldsymbol{z}_{1} is attracted to a mode i\mathcal{M}_{i} by ϵ^\hat{\epsilon} from a gradient step if 𝐲𝐤Gθt+1(𝐳1)+ϵ^<𝐲𝐤Gθt(𝐳1)\left\|\boldsymbol{y_{k}}-G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)\right\|+\hat{\epsilon}<\left\|\boldsymbol{y_{k}}-G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)\right\|, where 𝐲𝐤i\boldsymbol{y_{k}}\in\mathcal{M}_{i} is an image in a mode i\mathcal{M}_{i}, ϵ^\hat{\epsilon} denotes a small quantity, θt\theta_{t} and θt+1\theta_{t+1} are the generator parameters before and after the gradient updates respectively.

Definition 3.4 (Modes Distracted).

Let 𝐳2\boldsymbol{z}_{2} be a sample in latent space, we say 𝐳2\boldsymbol{z}_{2} is distracted to a mode i\mathcal{M}_{i} by (α+ϵ^)/2(\alpha+\hat{\epsilon})/2 from a gradient step if 𝐲𝐦Gθt+1(𝐳2)+(ϵ^/22α)<𝐲𝐤Gθt(𝐳2)\left\|\boldsymbol{y_{m}}-G_{\theta_{t+1}}\left(\boldsymbol{z}_{2}\right)\right\|+(\hat{\epsilon}/2-2\alpha)<\left\|\boldsymbol{y_{k}}-G_{\theta_{t}}\left(\boldsymbol{z}_{2}\right)\right\|, where 𝐲𝐤i\boldsymbol{y_{k}}\in\mathcal{M}_{i} is an image in a mode i\mathcal{M}_{i}, 𝐲𝐦i\boldsymbol{y_{m}}\not\in\mathcal{M}_{i} is an image from other modes, α\alpha keeps the same meaning as in Definition 3.2, θt\theta_{t} and θt+1\theta_{t+1} are the generator parameters before and after the gradient updates respectively.

Latent N-size with gradient penalty. We demonstrate the relationship between the latent N-size and the gradient penalty. According to Proposition 3.5, as 𝒙D(𝒙)\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\| increases, the latent N-size decreases, and vice versa.

Proposition 3.5.

𝒩r(𝒛1)\mathcal{N}_{r}\left(\boldsymbol{z}_{1}\right) can be defined with discriminator gradient penalty as follows:

r=ϵ^(2inf𝒛{(2Gθt(𝒛1)Gθt(𝒛)+ηmδ(𝒙)m=1N(𝒙Dm(𝒴2)+𝒙Dm(𝒴))𝒛1𝒛)})1r=\hat{\epsilon}\cdot\left(2\inf_{\boldsymbol{z}}\left\{\left(\frac{2\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|+\eta_{m}\delta(\boldsymbol{x})\sum\limits_{m=1}^{N}\left(\|\nabla_{\boldsymbol{x}}D_{m}(\mathcal{Y}_{2})\|+\|\nabla_{\boldsymbol{x}}D_{m}(\mathcal{Y})\|\right)}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\right)\right\}\right)^{-1}

, where 𝐱Dm(𝒴2)=\|\nabla_{\boldsymbol{x}}D_{m}(\mathcal{Y}_{2})\|= 𝐱Dm(Gθt(𝐳2))+R\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{2}))+R\| and 𝐱Dm(𝒴)=\|\nabla_{\boldsymbol{x}}D_{m}(\mathcal{Y})\|= 𝐱Dm(Gθt(𝐳))+R\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+R\|. RR stands for the discriminator gradient penalty.

Proof.

With the gradient penalty of CFG formulation

Gθt+1(𝒛)=Gθt(𝒛)+m=1Nηmδ(𝒙)(𝒙Dm(Gθt(𝒛)))=Gθt(𝒛)+m=1Nηmδ(𝒙)(𝒙Dm(Gθt(𝒛))+R),\displaystyle\begin{aligned} G_{\theta_{t+1}}\left(\boldsymbol{z}\right)&=G_{\theta_{t}}\left(\boldsymbol{z}\right)+\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})\left(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right)\\ &=G_{\theta_{t}}\left(\boldsymbol{z}\right)+\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})\left(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+R\right),\end{aligned} (29)

we are now ready to prove the Proposition 3.5.

𝒚𝒎Gθt+1(𝒛2)𝒚𝒎𝒚𝒌+𝒚𝒌Gθt+1(𝒛2)𝒚𝒎𝒚𝒌+𝒚𝒌Gθt+1(𝒛1)+Gθt+1(𝒛1)Gθt+1(𝒛2)<𝒚𝒎𝒚𝒌+𝒚𝒌Gθt(𝒛1)+Gθt+1(𝒛1)Gθt+1(𝒛2)ϵ^𝒚𝒎𝒚𝒌+𝒚𝒌Gθt(𝒛2)+Gθt(𝒛2)Gθt(𝒛1)+Gθt+1(𝒛1)Gθt+1(𝒛2)ϵ^=(Gθt(𝒛1)Gθt(𝒛2)𝒛1𝒛2+Gθt+1(𝒛1)Gθt+1(𝒛2)𝒛1𝒛2)𝒛1𝒛2+𝒚𝒎𝒚𝒌+𝒚𝒌Gθt(𝒛2)ϵ^.\displaystyle\begin{aligned} \left\|\boldsymbol{y_{m}}-G_{\theta_{t+1}}\left(\boldsymbol{z}_{2}\right)\right\|&\leq\left\|\boldsymbol{y_{m}}-\boldsymbol{y_{k}}\right\|+\left\|\boldsymbol{y_{k}}-G_{\theta_{t+1}}\left(\boldsymbol{z}_{2}\right)\right\|\\ &\leq\left\|\boldsymbol{y_{m}}-\boldsymbol{y_{k}}\right\|+\left\|\boldsymbol{y_{k}}-G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)\right\|+\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}\left(\boldsymbol{z}_{2}\right)\right\|\\ &<\left\|\boldsymbol{y_{m}}-\boldsymbol{y_{k}}\right\|+\left\|\boldsymbol{y_{k}}-G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)\right\|+\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}\left(\boldsymbol{z}_{2}\right)\right\|-\hat{\epsilon}\quad\\ &\leq\left\|\boldsymbol{y_{m}}-\boldsymbol{y_{k}}\right\|+\left\|\boldsymbol{y_{k}}-G_{\theta_{t}}\left(\boldsymbol{z}_{2}\right)\right\|+\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{2}\right)-G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)\right\|\\ &+\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}\left(\boldsymbol{z}_{2}\right)\right\|-\hat{\epsilon}\\ &=\left(\frac{\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}\left(\boldsymbol{z}_{2}\right)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}_{2}\right\|}+\frac{\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}\left(\boldsymbol{z}_{2}\right)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}_{2}\right\|}\right)\left\|\boldsymbol{z}_{1}-\boldsymbol{z}_{2}\right\|\\ &+\left\|\boldsymbol{y_{m}}-\boldsymbol{y_{k}}\right\|+\left\|\boldsymbol{y_{k}}-G_{\theta_{t}}\left(\boldsymbol{z}_{2}\right)\right\|-\hat{\epsilon}.\end{aligned} (30)

This implies

𝒚𝒎Gθt+1(𝒛2)+(ϵ^22α)<𝒚𝒌Gθt(𝒛2),\left\|\boldsymbol{y_{m}}-G_{\theta_{t+1}}\left(\boldsymbol{z}_{2}\right)\right\|+(\frac{\hat{\epsilon}}{2}-2\alpha)<\left\|\boldsymbol{y_{k}}-G_{\theta_{t}}\left(\boldsymbol{z}_{2}\right)\right\|,

which is the content of definition 3.4 and

(Gθt(𝒛1)Gθt(𝒛2)𝒛1𝒛2+Gθt+1(𝒛1)Gθt+1(𝒛2)𝒛1𝒛2)𝒛1𝒛2ϵ^2.\left(\frac{\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}\left(\boldsymbol{z}_{2}\right)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}_{2}\right\|}+\frac{\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}\left(\boldsymbol{z}_{2}\right)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}_{2}\right\|}\right)\left\|\boldsymbol{z}_{1}-\boldsymbol{z}_{2}\right\|\leq\frac{\hat{\epsilon}}{2}.

We define the latent N-size of 𝒛1\boldsymbol{z}_{1} is

𝒩τ(𝒛1)={𝒛:(Gθt(𝒛1)Gθt(𝒛)𝒛1𝒛+Gθt+1(𝒛1)Gθt+1(𝒛)𝒛1𝒛)τ}.\mathcal{N}_{\tau}\left(\boldsymbol{z}_{1}\right)=\left\{\boldsymbol{z}:\left(\frac{\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}+\frac{\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\right)\leq\tau\right\}.

Then, the maximum latent N-size

r=ϵ^(2inf𝒛{(Gθt(𝒛1)Gθt(𝒛)𝒛1𝒛+Gθt+1(𝒛1)Gθt+1(𝒛)𝒛1𝒛)})1.\displaystyle r=\hat{\epsilon}\cdot\left(2\inf_{\boldsymbol{z}}\left\{\left(\frac{\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}+\frac{\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\right)\right\}\right)^{-1}. (31)

Here, we finish the proof of the definition 3.1.

Then, we bring the CFG definition of generator Eq. (29) into the Eq. (31), and expand the sum of the additions in parentheses as below:

Gθt(𝒛1)Gθt(𝒛)𝒛1𝒛+Gθt+1(𝒛1)Gθt+1(𝒛)𝒛1𝒛=Gθt(𝒛1)Gθt(𝒛)𝒛1𝒛+Gθt(𝒛1)Gθt(𝒛)+m=1Nηmδ(𝒙)[(𝒙Dm(Gθt(𝒛1))+R)(𝒙Dm(Gθt(𝒛))+R)]𝒛1𝒛2Gθt(𝒛1)Gθt(𝒛)𝒛1𝒛+m=1Nηmδ(𝒙)[(𝒙Dm(Gθt(𝒛1))+R)(𝒙Dm(Gθt(𝒛))+R)]𝒛1𝒛2Gθt(𝒛1)Gθt(𝒛)𝒛1𝒛+m=1Nηmδ(𝒙)(𝒙Dm(Gθt(𝒛1))+R)+m=1Nηmδ(𝒙)(𝒙Dm(Gθt(𝒛))+R)𝒛1𝒛.\displaystyle\begin{aligned} &\frac{\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}+\frac{\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &=\frac{\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})+\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})[(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))+R)-(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+R)]\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &\leq\frac{2\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})[(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))+R)-(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+R)]\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &\leq\frac{2\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))+R)\right\|+\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+R)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}.\\ \end{aligned} (32)

As for simple, we define 𝒙Dm(𝒴2)=\|\nabla_{\boldsymbol{x}}D_{m}(\mathcal{Y}_{2})\|= 𝒙Dm(Gθt(𝒛2))+R\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{2}))+R\| and 𝒙Dm(𝒴)=\|\nabla_{\boldsymbol{x}}D_{m}(\mathcal{Y})\|= 𝒙Dm(Gθt(𝒛))+R\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+R\|, and extract the symbol ηmδ(𝒙)m=1N\eta_{m}\delta(\boldsymbol{x})\sum\limits_{m=1}^{N} out of the norm. So we finish the proof of Proposition 3.5. ∎

C.2 𝜺\boldsymbol{\varepsilon}-centered GP

Observe from Eq. (3) and Eq. (4) in CFG method, we can derive the following conclusion: 𝒙D(𝒙)0\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\leq 0. This is an important corollary derived from the CFG method. With this corollary, we can prove that the latent N-size corresponding to our 𝜺\boldsymbol{\varepsilon}-centered GP is the smallest among all three gradient penalties in the following section.

Proof.

The Eq. (3) from CFG implies that for dL(pm)dm\frac{dL\left(p_{m}\right)}{dm} to be negative so that the distance LL decreases, we should choose gm(𝒙)g_{m}(\boldsymbol{x}) to be:

gm(𝒙)=sm(𝒙)ϕ0(𝒙2(p(𝒙),pm(𝒙))),g_{m}(\boldsymbol{x})=-s_{m}(\boldsymbol{x})\phi_{0}\left(\nabla_{\boldsymbol{x}}\ell_{2}^{\prime}\left(p_{*}(\boldsymbol{x}),p_{m}(\boldsymbol{x})\right)\right),

where sm(𝒙)>0s_{m}(\boldsymbol{x})>0 is an arbitrary scaling factor. ϕ0(u)\phi_{0}(u) is a vector function such that ϕ(u)=uϕ0(u)0\phi(u)=u\cdot\phi_{0}(u)\geq 0 and that ϕ(u)=0\phi(u)=0 if and only if u=0u=0, e.g., (ϕ0(u)=u,ϕ(u)=u22)\left(\phi_{0}(u)=u,\phi(u)=\|u\|_{2}^{2}\right) or (ϕ0(u)=sign(u),ϕ(u)=u1)\left(\phi_{0}(u)=\operatorname{sign}(u),\phi(u)=\|u\|_{1}\right). With this choice of gm(𝒙)g_{m}(\boldsymbol{x}), we obtain

dL(pm)dm=sm(𝒙)pm(𝒙)ϕ(𝒙2(p(𝒙),pm(𝒙)))𝑑𝒙0,\frac{dL\left(p_{m}\right)}{dm}=-\int s_{m}(\boldsymbol{x})p_{m}(\boldsymbol{x})\phi\left(\nabla_{\boldsymbol{x}}\ell_{2}^{\prime}\left(p_{*}(\boldsymbol{x}),p_{m}(\boldsymbol{x})\right)\right)d\boldsymbol{x}\leq 0,

that is, the distance LL is guaranteed to decrease unless the equality holds. So, we can know that gm(𝒙)0g_{m}(\boldsymbol{x})\leq 0. Then, we scrutiny the Eq. (4) from which can deviate the equation:

gm(𝒙)=sm(𝒙)ϕ0(x2(p(𝒙),pm(𝒙)))=sm(𝒙)x2(p(𝒙),pm(𝒙))=sm(𝒙)f′′(rm(𝒙))rm(𝒙)sm(𝒙)f′′(r~m(𝒙))r~m(𝒙)xD(𝒙),\displaystyle\begin{aligned} g_{m}(\boldsymbol{x})&=-s_{m}(\boldsymbol{x})\phi_{0}\left(\nabla_{x}\ell_{2}^{\prime}\left(p_{*}(\boldsymbol{x}),p_{m}(\boldsymbol{x})\right)\right)\\ &=-s_{m}(\boldsymbol{x})\nabla_{x}\ell_{2}^{\prime}\left(p_{*}(\boldsymbol{x}),p_{m}(\boldsymbol{x})\right)\\ &=-s_{m}(\boldsymbol{x})f^{\prime\prime}\left(r_{m}(\boldsymbol{x})\right)\nabla r_{m}(\boldsymbol{x})\\ &\approx s_{m}(\boldsymbol{x})f^{\prime\prime}\left(\tilde{r}_{m}(\boldsymbol{x})\right)\tilde{r}_{m}(\boldsymbol{x})\nabla_{x}D(\boldsymbol{x}),\end{aligned}

where sm(x)>0s_{m}(x)>0 is an arbitrary scaling factor, 2(ρ,ρ)=ρf(ρ/ρ)\ell_{2}\left(\rho_{*},\rho\right)=\rho_{*}f\left(\rho/\rho_{*}\right), x2(p(x),pm(x))=f′′(rm(x))rm(x)\nabla_{x}\ell_{2}^{\prime}\left(p_{*}(x),p_{m}(x)\right)=f^{\prime\prime}\left(r_{m}(x)\right)\nabla r_{m}(x), f′′=1/x2f^{\prime\prime}=1/x^{2} when ff is KL-divergence and rm(𝒙)=exp(D(𝒙))pm(x)/p(x)=r~m(𝒙)r_{m}(\boldsymbol{x})=\exp(-D(\boldsymbol{x}))\approx p_{m}(x)/p_{*}(x)=\tilde{r}_{m}(\boldsymbol{x}) when D(x)lnp(x)pm(x)D(x)\approx\ln\frac{p_{*}(x)}{p_{m}(x)}, which is the analytic solution of the CFG discriminator. Based on the above formulation, we can find that s(𝒙)s(\boldsymbol{x}) is a scaling function that is always greater than 0, r~(𝒙)\tilde{r}(\boldsymbol{x}) is a exponential function, and fkl′′(r~(𝒙))f_{kl}^{\prime\prime}(\tilde{r}(\boldsymbol{x})) is a non negative function composed of exponential function r~(𝒙)\tilde{r}(\boldsymbol{x}). So if equation gm(𝒙)0g_{m}(\boldsymbol{x})\leq 0 holds, the 𝒙D(𝒙)0\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\leq 0 also holds. ∎

We define our method as the 𝜺\boldsymbol{\varepsilon}-centered Gradient Penalty. We use notation 𝜺\boldsymbol{\varepsilon} in our penalty name and equation to differ from the hyper-parameter ε\varepsilon^{\prime}. The 𝜺\boldsymbol{\varepsilon}-centered GP is

R(θ,ψ)=γ2E𝒙^(𝒙Dψ(𝒙^)𝜺)2,\displaystyle R(\theta,\psi)=\frac{\gamma}{2}\mathrm{E}_{\hat{\boldsymbol{x}}}\left(\left\|\nabla_{\boldsymbol{x}}D_{\psi}(\hat{\boldsymbol{x}})-\boldsymbol{\varepsilon}\right\|\right)^{2}, (33)

where 𝜺\boldsymbol{\varepsilon} is a vector such that 𝜺2=ε\|\boldsymbol{\varepsilon}\|_{2}=\varepsilon^{\prime} with ε=CN2𝜺2\varepsilon^{\prime}=\sqrt{C\cdot N^{2}\cdot\boldsymbol{\varepsilon}^{2}}, NN and CC are dimensions and channels of the real data respectively. 𝒙^\hat{\boldsymbol{x}} is sampled uniformly on the line segment between two random points vector 𝒙1pθ(𝒙1),𝒙2p𝒟(𝒙2)\boldsymbol{x}_{1}\sim p_{\theta}\left(\boldsymbol{x}_{1}\right),\boldsymbol{x}_{2}\sim p_{\mathcal{D}}\left(\boldsymbol{x}_{2}\right).

Combining the corollary 𝒙D(𝒙)0\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\leq 0, our 𝜺\boldsymbol{\varepsilon}-centered GP increases the 𝒙D(𝒙)\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\| as to achieve a better latent N-size which other two gradient penalty behaviors worse.

When training the GAN model with the loss function Eq. (8) and Eq. (9), our approach will control the gradient of the discriminator and result in a diversity of synthesized samples.

C.3 Latent N-size with different gradient penalty

In this section, we will derive our final theorem about the latent N-size with different gradient penalties.

First, we establish the relationship among the latent N-size for three gradient penalties in the following Lemma. We will substitute different types of gradient penalty into the Proposition 3.5.

Lemma 3.6.

The norms of the three Gradient Penalties, which dictate the latent N-size, are defined as follows: R1\|R_{1}\| = (𝐱Dm(Gθt(𝐳))g0)\|\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right\|-g_{0}\right)\|, R0=(𝐱Dm(Gθt(𝐳)))\|R_{0}\|=\|\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right\|\right)\|, R𝛆\|R_{\boldsymbol{\varepsilon}}\|=(𝐱Dm(Gθt(𝐳))+𝛆)\|\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right\|+\|\boldsymbol{\varepsilon}\|\right)\|, respectively. The order of magnitude between the norms of three Gradient Penalty is R1<R0<R𝛆\|R_{1}\|<\|R_{0}\|<\|R_{\boldsymbol{\varepsilon}}\|. Consequently, the relationship between the latent N-size of three Gradient Penalty is rR1>rR0>rR𝛆r_{R_{1}}>r_{R_{0}}>r_{R_{\boldsymbol{\varepsilon}}}.

Proof.

Let us start with the result from the Eq .32. We will insert three distinct Gradient Penalties into it and determine the latent N-size for each of these penalties. Next, we will focus on the primary component of the three equations and compare the values among them.

The equation of the 11-centered gradient penalty is

Rg0=γ2E𝒙^(𝒙Dm(𝒙^)g0)2.\displaystyle\begin{aligned} R_{g_{0}}&=\frac{\gamma}{2}\mathrm{E}_{\hat{\boldsymbol{x}}}\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(\hat{\boldsymbol{x}})\right\|-g_{0}\right)^{2}.\end{aligned} (34)

The equation of 0-centered gradient penalty is

R0=γ2E𝒙^𝒙^Dψ(𝒙^)2.\displaystyle\begin{aligned} R_{0}&=\frac{\gamma}{2}\mathrm{E}_{\hat{\boldsymbol{x}}}\left\|\nabla_{\hat{\boldsymbol{x}}}D_{\psi}(\hat{\boldsymbol{x}})\right\|^{2}.\end{aligned} (35)

The equation of ε\varepsilon-centered gradient penalty is

Rε=γ2E𝒙^𝒙Dψ(𝒙^)𝜺2.\displaystyle\begin{aligned} R_{\varepsilon}&=\frac{\gamma}{2}\mathrm{E}_{\hat{\boldsymbol{x}}}\left\|\nabla_{\boldsymbol{x}}D_{\psi}(\hat{\boldsymbol{x}})-\boldsymbol{\varepsilon}\right\|^{2}.\end{aligned} (36)

We bring them back to the Eq .(29) and expand the Gradient Penalty expression. We omit the expectation symbol and focus on the gradient. Then, we plug the above results into Eq .32 and we have the different gradient penalty equation for the sum of the additions in latent N-size parentheses. As for simplicity, we just focus on it and omit other symbols. We called it the determined part of the latent N-size.

The determined part of the latent N-size of 11-centered gradient penalty is

Gθt(𝒛1)Gθt(𝒛)𝒛1𝒛+Gθt+1(𝒛1)Gθt+1(𝒛)𝒛1𝒛2Gθt(𝒛1)Gθt(𝒛)𝒛1𝒛+m=1Nηmδ(𝒙)(𝒙Dm(Gθt(𝒛1))+R+m=1Nηmδ(𝒙)(𝒙Dm(Gθt(𝒛))+R𝒛1𝒛=2Gθt(𝒛1)Gθt(𝒛)𝒛1𝒛+m=1Nηmδ(𝒙)(𝒙Dm(Gθt(𝒛1))+(𝒙Dm(Gθt(𝒛1))g0)2)𝒛1𝒛+m=1Nηmδ(𝒙)(𝒙Dm(Gθt(𝒛))+(𝒙Dm(Gθt(𝒛))g0)2)𝒛1𝒛.\displaystyle\begin{aligned} &\frac{\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}+\frac{\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &\leq\frac{2\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))+R\right\|+\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+R\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &=\frac{2\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})\left(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))+\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))\right\|-g_{0}\right)^{2}\right)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})\left(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right\|-g_{0}\right)^{2}\right)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}.\\ \end{aligned} (37)

The determined part of the latent N-size of 0-centered gradient penalty is

Gθt(𝒛1)Gθt(𝒛)𝒛1𝒛+Gθt+1(𝒛1)Gθt+1(𝒛)𝒛1𝒛2Gθt(𝒛1)Gθt(𝒛)𝒛1𝒛+m=1Nηmδ(𝒙)(𝒙Dm(Gθt(𝒛1))+R+m=1Nηmδ(𝒙)(𝒙Dm(Gθt(𝒛))+R𝒛1𝒛=2Gθt(𝒛1)Gθt(𝒛)𝒛1𝒛+m=1Nηmδ(𝒙)(𝒙Dm(Gθt(𝒛1))+(𝒙Dm(Gθt(𝒛1)))2)𝒛1𝒛+m=1Nηmδ(𝒙)(𝒙Dm(Gθt(𝒛))+(𝒙Dm(Gθt(𝒛)))2)𝒛1𝒛.\displaystyle\begin{aligned} &\frac{\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}+\frac{\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &\leq\frac{2\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))+R\right\|+\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+R\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &=\frac{2\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})\left(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))+\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))\right\|\right)^{2}\right)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})\left(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right\|\right)^{2}\right)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}.\\ \end{aligned} (38)

With prior knowledge from CFG that xDm(x)0\nabla_{x}D_{m}(x)\leq 0, The determined part of the latent N-size of 𝜺\boldsymbol{\varepsilon}-centered gradient penalty is

Gθt(𝒛1)Gθt(𝒛)𝒛1𝒛+Gθt+1(𝒛1)Gθt+1(𝒛)𝒛1𝒛2Gθt(𝒛1)Gθt(𝒛)𝒛1𝒛+m=1Nηmδ(𝒙)(𝒙Dm(Gθt(𝒛1))+R+m=1Nηmδ(𝒙)(𝒙Dm(Gθt(𝒛))+R𝒛1𝒛=2Gθt(𝒛1)Gθt(𝒛)𝒛1𝒛+m=1Nηmδ(𝒙)(𝒙Dm(Gθt(𝒛1))+(𝒙Dm(Gθt(𝒛1))𝜺)2)𝒛1𝒛+m=1Nηmδ(𝒙)(𝒙Dm(Gθt(𝒛))+(𝒙Dm(Gθt(𝒛))𝜺)2)𝒛1𝒛2Gθt(𝒛1)Gθt(𝒛)𝒛1𝒛+m=1Nηmδ(𝒙)(𝒙Dm(Gθt(𝒛1))+(𝒙Dm(Gθt(𝒛1))+𝜺)2)𝒛1𝒛+m=1Nηmδ(𝒙)(𝒙Dm(Gθt(𝒛))+(𝒙Dm(Gθt(𝒛))+𝜺)2)𝒛1𝒛.\displaystyle\begin{aligned} &\frac{\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}+\frac{\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &\leq\frac{2\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))+R\right\|+\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+R\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &=\frac{2\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})\left(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))+\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))-\boldsymbol{\varepsilon}\right\|\right)^{2}\right)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})\left(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))-\boldsymbol{\varepsilon}\right\|\right)^{2}\right)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &\leq\frac{2\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})\left(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))+\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))\right\|+\|\boldsymbol{\varepsilon}\|\right)^{2}\right)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})\left(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right\|+\|\boldsymbol{\varepsilon}\|\right)^{2}\right)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}.\\ \end{aligned} (39)

For the square items in Eq. (37), Eq. (LABEL:dpgp0), Eq. (LABEL:dpgp2), we show that the

(𝒙Dm(Gθt(𝒛))+𝜺)>(𝒙Dm(Gθt(𝒛)))>(𝒙Dm(Gθt(𝒛))g0).\displaystyle\begin{aligned} &\|\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right\|+\|\boldsymbol{\varepsilon}\|\right)\|>\|\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right\|\right)\|>\|\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right\|-g_{0}\right)\|.\end{aligned} (40)

Based on this conclusion, we can observe that the relationship among the latent N-size with different gradient penalties is as follows: rR1>rR0>rR𝜺r_{R_{1}}>r_{R_{0}}>r_{R_{\boldsymbol{\varepsilon}}}. ∎

Theorem 3.7.

Suppose 𝐳1\boldsymbol{z}_{1} is attracted to the mode i\mathcal{M}_{i} by ϵ^\hat{\epsilon}, then there exists a neighborhood 𝒩r(𝐳1)\mathcal{N}_{r}\left(\boldsymbol{z}_{1}\right) of 𝐳1\boldsymbol{z}_{1} such that 𝐳2\boldsymbol{z}_{2} is distracted to i\mathcal{M}_{i} by (ϵ^/22α)(\hat{\epsilon}/2-2\alpha), for all 𝐳2𝒩r(𝐳1)\boldsymbol{z}_{2}\in\mathcal{N}_{r}\left(\boldsymbol{z}_{1}\right). The size of 𝒩r(𝐳1)\mathcal{N}_{r}\left(\boldsymbol{z}_{1}\right) can be arbitrarily large but is bounded by an open ball of radius rr where be controlled by Gradient Penalty terms of the discriminator. The relationship between the radius’s size of three Gradient Penalty is rR1>rR0>rR𝛆r_{R_{1}}>r_{R_{0}}>r_{R_{\boldsymbol{\varepsilon}}}.

Proof.

By combining the Definition 3.1, Proposition 3.5, Lemma 3.6 and our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty, we can deviate this conclusion. ∎

This theorem encompasses three key implications. Firstly, when the latent vector is attracted to a mode within the image space, the corresponding latent N-size should not be overly large, which can be attributed to two distinct reasons. One, vectors within the neighborhood are more likely to be attracted to the same mode. Two, vectors within this neighborhood face challenges in being attracted to other modes within the image space, with the level of difficulty determined by an upper bound expressed as (ϵ^/22α)(\hat{\epsilon}/2-2\alpha). This bound is constructed based on the distances between different modes and within the same mode within the image space.

Secondly, the discriminator gradient penalty in the CFG formulation can regulate the latent N-size. This introduces a trade-off between diversity and training stability. For instance, the 0-centered gradient penalty ensures stable and convergent training near the Nash equilibrium, but it leads to a minimal discriminator norm, resulting in a larger latent N-size and reduced diversity.

Thirdly, by augmenting the norm of the discriminator, our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty achieves the smallest latent N-size, consequently leading to the highest level of diversity.

Appendix D Experiments

In this section, we describe additional experiments and give a more visual representation. In addition to the databases mentioned in the paper, we also conducted training on the CeleBA and LSUN Bedroom datasets with resolutions of 128x128 and 256x256. We will provide visual inspections and results for these additional databases.

D.1 Hyper-parameter ε\varepsilon^{\prime}

In the practice stage, the 𝜺\boldsymbol{\varepsilon} is a ε\varepsilon^{\prime}-value constant vector which has the same dimension of 𝒙D(𝒙)\nabla_{\boldsymbol{x}}D(\boldsymbol{x}). The dimension of 𝒙D(𝒙)\nabla_{\boldsymbol{x}}D(\boldsymbol{x}) is [C, H, W], the flatten of 𝒙D(𝒙)\nabla_{\boldsymbol{x}}D(\boldsymbol{x}) dimension equals HWCH*W*C. Our image data has the NN pixels height, NN pixels weight, and CC channels. The HWCH*W*C will be written as CN2CN^{2}. \|\| is the symbol of norm 2. So the value of 𝜺\|\boldsymbol{\varepsilon}\| can be written as

𝜺=i=1CN2εi2=CN2ε2=ε,\displaystyle\begin{aligned} \|\boldsymbol{\varepsilon}\|=\sqrt{\sum_{i=1}^{CN^{2}}\varepsilon_{i}^{2}}=\sqrt{CN^{2}\varepsilon^{2}}=\varepsilon^{\prime},\end{aligned} (41)

where εi2=ε2\varepsilon_{i}^{2}=\varepsilon^{2}, ε2=ε2CN2\varepsilon^{2}=\frac{\varepsilon^{\prime 2}}{CN^{2}}. The ε\varepsilon^{\prime} is a hyper-parameter that controls the tight bound of 𝒙D(𝒙)\left\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\right\|. We will set ε=0.1,0.3,1,5\varepsilon^{\prime}=0.1,0.3,1,5 in the ablation study to show the effeteness. It is easy to understand the ε=0.1,0.3,1\varepsilon^{\prime}=0.1,0.3,1 because this is a small enhancement of the discriminator norm and thus to a smaller latent N-size. If we set the ε=5\varepsilon^{\prime}=5, the varying range of 𝒙D(𝒙)\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\| is too large which leads to a too-small latent N-size, thus leads to a non-convergent result for the neural network. The loss function of the training process will not converge and the synthetic samples will transform to noise. We present the ablation result in Table. 8 and Table. 7

Table 7: Different ε\varepsilon^{\prime} settings of our Li-CFG trained in MNIST. We use the FID and IS scores to compare the generated effect. The other two penalties do not have the parameter ε\varepsilon^{\prime} so that all the cells fill the same value. Untrained means the loss function does not converge.
FID IS
MNIST ε=0.1\varepsilon^{\prime}=0.1 ε=0.3\varepsilon^{\prime}=0.3 ε=1\varepsilon^{\prime}=1 ε=5\varepsilon^{\prime}=5 ε=0.1\varepsilon^{\prime}=0.1 ε=0.3\varepsilon^{\prime}=0.3 ε=1\varepsilon^{\prime}=1 ε=5\varepsilon^{\prime}=5
ours(𝜺\boldsymbol{\varepsilon}-centered) 2.99 2.88 2.85 untrained 2.28 2.32 2.29 untrained
0-centered 3.54 3.54 3.54 3.54 2.31 2.31 2.31 2.31
11-centered 3.64 3.64 3.64 3.64 2.3 2.3 2.3 2.3
Table 8: Different ε\varepsilon^{\prime} settings of our Li-CFG trained in LSUN bedroom. We use the FID and IS scores to compare the generated effect. The other two penalties do not have the parameter ε\varepsilon^{\prime} so that all the cells fill the same value. Untrained means the loss function does not converge.
FID IS
LSUN Bedroom ε=0.1\varepsilon^{\prime}=0.1 ε=0.3\varepsilon^{\prime}=0.3 ε=1\varepsilon^{\prime}=1 ε=5\varepsilon^{\prime}=5 ε=0.1\varepsilon^{\prime}=0.1 ε=0.3\varepsilon^{\prime}=0.3 ε=1\varepsilon^{\prime}=1 ε=5\varepsilon^{\prime}=5
ours(𝜺\boldsymbol{\varepsilon}-centered) 9.94 8.78 9.73 untrained 2.97 2.94 2.97 untrained
0-centered 10.54 10.54 10.54 10.54 3.067 3.067 3.067 3.067
11-centered 11.5 11.5 11.5 11.5 3.154 3.154 3.154 3.154
Table 9: Different γ\gamma and the same ε=0.3\varepsilon^{\prime}=0.3 settings of our Li-CFG trained in LSUN T. We use FID and IS score to compare the generated effect. The other two penalties do not have the parameter ε\varepsilon so that all the cells fill the same value. Untrained means the loss function does not converge.
FID IS
LSUN T (δ(x)\delta(\boldsymbol{x})=1) γ=0.1\gamma^{\prime}=0.1 γ=1\gamma^{\prime}=1 γ=10\gamma^{\prime}=10 γ=0.1\gamma^{\prime}=0.1 γ=1\gamma^{\prime}=1 γ=10\gamma^{\prime}=10
ours(𝜺\boldsymbol{\varepsilon}-centered) 11.41 19.47 21.59 4.6 4.5 4.42
0-centered 12.33 19.94 21.71 4.57 4.6 4.53
11-centered 12.81 22.17 22.71 4.47 4.46 4.5
CFG method 13.54 13.54 13.54 4.38 4.38 4.38
LSUN T (δ(x)\delta(\boldsymbol{x})=5)
ours(𝜺\boldsymbol{\varepsilon}-centered) 19.18 21.58 21.69 4.48 4.49 4.34
0-centered 20.29 24.77 23.02 4.47 4.43 4.43
11-centered 24.09 29.39 untrained 4.37 4.41 untrained
CFG method 22.3 22.3 22.3 4.34 4.34 4.34
LSUN T (δ(x)\delta(\boldsymbol{x})=10)
ours(𝜺\boldsymbol{\varepsilon}-centered) 21.3 23 20.7 4.43 4.42 4.53
0-centered 23.68 19.74 20.7 4.39 4.46 4.38
11-centered 23.49 untrained untrained 4.3 untrained untrained
CFG method 25.53 25.53 25.53 4.2 4.2 4.2

If our experiment setting does not satisfy these conditions, the synthetic samples make noise or collapse. We show the visual inspection of different settings ε\varepsilon^{\prime} for MNIST and LSUN Bedroom in Fig. 10 and Fig. 11.

Refer to caption
Figure 10: The top two rows are ε=5\varepsilon^{\prime}=5. The bottom two rows are ε=0.3\varepsilon^{\prime}=0.3. The top two rows collapse, but the bottom two rows are normal.
Refer to caption
Figure 11: The top two rows are ε=5\varepsilon^{\prime}=5, the second two row are ε=1\varepsilon^{\prime}=1 and the third two rows are ε=0.1\varepsilon^{\prime}=0.1. The bottom two rows are ε=0.3\varepsilon^{\prime}=0.3. This figure shows that if the hyper-parameter ε\varepsilon^{\prime} is too large, it will lead to a too-small latent N-size and thus lead to the non-convergent result. The hyper-parameter ε\varepsilon^{\prime} should be a reasonable value, neither too large nor too small will lead to the best diversity of synthetic samples.
Table 10: Different γ\gamma settings of our Li-CFG trained in LSUN T+B. We use FID and IS scores to compare the generated effect. The other two penalties do not have the parameter ε\varepsilon so that all the cells fill the same value.
FID IS
LSUN T+B γ=0.1\gamma=0.1 γ=1\gamma=1 γ=10\gamma=10 γ=0.1\gamma=0.1 γ=1\gamma=1 γ=10\gamma=10
ours(ε\boldsymbol{\varepsilon}-centered) 15.72 16.85 16.67 5.07 5.06 5.08
0-centered 16.01 17.4 19.79 5.05 5.08 5.05
11-centered 16.32 18.73 34.28 5.04 5.18 4.71

D.2 Effect of γ\gamma with different penalty

In most cases, a gradient penalty with a center within a small interval around 0 tends to yield better results. However, on some datasets, a gradient penalty centered around 1 might perform better. Our method provides a controllable parameter that allows the center of the gradient penalty to vary within the range specified by the parameter, and we experimentally show that our method yields better results. For the parameters δ(𝒙)\delta(\boldsymbol{x}) where the CFG method performs well, using a value of 0.1 for the parameter γ\gamma consistently yields improved results. Conversely, for those parameter settings of δ(𝒙)\delta(\boldsymbol{x}) where the CFG method shows poor or inadequate training performance, using a value of 10 for the parameter γ\gamma often leads to better overall performance. We compare the results of different γ\gamma values for CFG and Li-CFG in Table. 9 and Table. 10.

D.3 Effect of Gradient Penalty with different δ(𝒙)\delta(\boldsymbol{x})

The results of the CFG method can exhibit significant variations in response to minor changes in parameter values. This phenomenon is illustrated in Fig. 12, where we observe that different choices of the δ(𝒙)\delta(\boldsymbol{x}) parameter lead to diverse outcomes. However, by introducing the gradient penalty, the CFG method consistently demonstrates increased stability across various δ(𝒙)\delta(\boldsymbol{x}) values. Notably, in cases where the CFG method struggles to converge, the addition of the gradient penalty leads to improved results.

Refer to caption
Figure 12: This figure contains FID values with three different δ(𝒙)\delta(\boldsymbol{x}) settings in LSUN tower datasets. Even though the FID scores of the CFG method get 200 which means the training of the CFG method does not converge, our Li-CFG always achieves desirable and competitive FID scores.

D.4 Visual inspection of synthesized images

In this section, we introduce the visual inspection figures of the CFG method, Li-CFG, and other GAN models. The Fig. 27282932303133 are separated into two columns of real images and generated images respectively. The real images are represented in the left column while the generated images are displayed in the right column which contains different GP regularization with Li-CFG. The settings of generated images in the CFG method and Li-CFG are almost the same. We also display the generated images with CeleBA, LSUN Bedroom and ImageNet with resolution of 128 and 256. We present the results of synthetic datasets in the Fig. 1314151617181920. Moreover, we showcase the results of BigGAN and DDGAN from real-world datasets in the Fig. 212223.

Refer to caption
Figure 13: Results for the original GAN with varying gradient penalties on the Ring dataset are displayed as follows: From top to bottom, the sequence includes the original GAN, the original GAN with 11-centered gradient penalty, the original GAN with 0-centered gradient penalty, and the original GAN with 𝜺\boldsymbol{\varepsilon}-centered gradient penalty. Progressing from left to right, each column represents outcomes from different stages of training. The far-right column displays the ground truth data for comparison.
Refer to caption
Figure 14: Results for the WGAN with varying gradient penalties on the Ring dataset are displayed as follows: From top to bottom, the sequence includes the WGAN, the WGAN with 11-centered gradient penalty, the WGAN with 0-centered gradient penalty, and the WGAN with 𝜺\boldsymbol{\varepsilon}-centered gradient penalty. Progressing from left to right, each column represents outcomes from different stages of training. The far-right column displays the ground truth data for comparison.
Refer to caption
Figure 15: Results for the Hinge GAN with varying gradient penalties on the Ring dataset are displayed as follows: From top to bottom, the sequence includes the Hinge GAN, the Hinge GAN with 11-centered gradient penalty, the Hinge GAN with 0-centered gradient penalty, and the Hinge GAN with 𝜺\boldsymbol{\varepsilon}-centered gradient penalty. Progressing from left to right, each column represents outcomes from different stages of training. The far-right column displays the ground truth data for comparison.
Refer to caption
Figure 16: Results for the LSGAN with varying gradient penalties on the Ring dataset are displayed as follows: From top to bottom, the sequence includes the LSGAN, the LSGAN with 11-centered gradient penalty, the LSGAN with 0-centered gradient penalty, and the LSGAN with 𝜺\boldsymbol{\varepsilon}-centered gradient penalty. Progressing from left to right, each column represents outcomes from different stages of training. The far-right column displays the ground truth data for comparison.
Refer to caption
Figure 17: Results for the origin GAN with varying gradient penalties on the Grid dataset are displayed as follows: From top to bottom, the sequence includes the origin GAN, the origin GAN with 11-centered gradient penalty, the origin GAN with 0-centered gradient penalty, and the origin GAN with 𝜺\boldsymbol{\varepsilon}-centered gradient penalty. Progressing from left to right, each column represents outcomes from different stages of training. The far-right column displays the ground truth data for comparison.
Refer to caption
Figure 18: Results for the WGAN with varying gradient penalties on the Grid dataset are displayed as follows: From top to bottom, the sequence includes the WGAN, the WGAN with 11-centered gradient penalty, the WGAN with 0-centered gradient penalty, and the WGAN with 𝜺\boldsymbol{\varepsilon}-centered gradient penalty. Progressing from left to right, each column represents outcomes from different stages of training. The far-right column displays the ground truth data for comparison.
Refer to caption
Figure 19: Results for the LSGAN with varying gradient penalties on the Grid dataset are displayed as follows: From top to bottom, the sequence includes the LSGAN, the LSGAN with 11-centered gradient penalty, the LSGAN with 0-centered gradient penalty, and the LSGAN with 𝜺\boldsymbol{\varepsilon}-centered gradient penalty. Progressing from left to right, each column represents outcomes from different stages of training. The far-right column displays the ground truth data for comparison.
Refer to caption
Figure 20: Results for the Hinge GAN with varying gradient penalties on the Grid dataset are displayed as follows: From top to bottom, the sequence includes the Hinge GAN, the Hinge GAN with 11-centered gradient penalty, the Hinge GAN with 0-centered gradient penalty, and the Hinge GAN with 𝜺\boldsymbol{\varepsilon}-centered gradient penalty. Progressing from left to right, each column represents outcomes from different stages of training. The far-right column displays the ground truth data for comparison.
Refer to caption
Figure 21: Result for ImageNet64 from BigGAN with our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty.
Refer to caption
Figure 22: Result for LSUN Church 256 from DDGAN with our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty.
Refer to caption
Figure 23: Result for LSUN Church 256 from DDGAN with our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty.

Synthetic Datasets. In the present results from synthetic datasets, we observed that unconstrained methods like the original GAN, LSGAN, WGAN and HingeGAN struggle to converge to all modes of the ring or grid datasets. However, these methods, when supplemented with gradient penalty, show an enhanced ability to converge to a mixture of Gaussians. Among the three types of gradient penalties tested, the 11-centered gradient penalty exhibited inferior convergence compared to the 0-centered gradient penalty and our proposed 𝜺\boldsymbol{\varepsilon}-centered gradient penalty. Notably, our 𝜺\boldsymbol{\varepsilon}-centered gradient penalty demonstrated a higher efficacy in driving more sample points to converge to the Gaussian points compared to the 0-centered gradient penalty.

ImageNet. We present the experiment results of Li-CFG on the ImageNet datasets here. We compare the FID result of Li-CFG and CFG methods which are the same Neural architecture with parameters of different magnitudes. The Neural architecture simply uses the DCGAN and the res-block. Fig. 24 shows the FID scores of the CFG method, the CFG method with two times the number of parameters, and Li-CFG with twice the number of parameters. Fig. 25 presents the visualization effect of the above three models. Fig. 26 manifest our Li-CFG results on the ImageNet datasets.

Refer to caption
Figure 24: Results for ImageNet: The yellow line is the FID results of the CFG method at different stages on the ImageNet datasets. The green and red lines represent the FID results of the CFG method and Li-CFG, respectively, with 2x parameters at different stages on the ImageNet datasets. The figure expresses such a conclusion that larger magnitude parameters of the CFG method behavior are much better than smaller magnitude parameters of the CFG method on the ImageNet datasets. .
Refer to caption
Figure 25: Visual results for ImageNet: The left and middle columns display the visual results of the CFG method and Li-CFG with 2x parameters on the ImageNet datasets, respectively. The right column showcases the visual results of the CFG method on the ImageNet datasets. .
Refer to caption
Figure 26: Visual result of Li-CFG on ImageNet datasets. .
Refer to caption
Figure 27: Results for LSUN T: The left column consists of real images. The first two rows in the right column depict the visual results of the CFG method. The second two rows in the right column depict the visual results of the Li-CFG with 𝜺\boldsymbol{\varepsilon}-centered gradient penalty (ours). The third two rows in the right column depict the visual results of the Li-CFG with the 11-centered gradient penalty. The last two rows in the right column depict the visual results of the Li-CFG with 0-centered gradient penalty. The parameter settings for Li-CFG are as follows: η=\eta=2.5e-4, γ=0.1\gamma=0.1, δ(𝒙)=1\delta(\boldsymbol{x})=1, and ε=0.3\varepsilon^{\prime}=0.3.
Refer to caption
Figure 28: Results for BR+LR: The left column consists of real images. The first two rows in the right column depict the visual results of the CFG method. The second two rows in the right column depict the visual results of the Li-CFG with 𝜺\boldsymbol{\varepsilon}-centered gradient penalty (ours). The third two rows in the right column depict the visual results of the Li-CFG with the 11-centered gradient penalty. The last two rows in the right column depict the visual results of the Li-CFG with 0-centered gradient penalty. The parameter settings for Li-CFG are as follows: η=\eta=2.5e-4, γ=0.1\gamma=0.1, δ(𝒙)=1\delta(\boldsymbol{x})=1, and ε=0.3\varepsilon^{\prime}=0.3.
Refer to caption
Figure 29: Results for LSUN Church: The left column consists of real images. The first two rows in the right column depict the visual results of the CFG method. The second two rows in the right column depict the visual results of the Li-CFG with 𝜺\boldsymbol{\varepsilon}-centered gradient penalty (ours). The third two rows in the right column depict the visual results of the Li-CFG with the 11-centered gradient penalty. The last two rows in the right column depict the visual results of the Li-CFG with 0-centered gradient penalty. The parameter settings for Li-CFG are as follows: η=\eta=2.5e-4, γ=0.1\gamma=0.1, δ(𝒙)=1\delta(\boldsymbol{x})=1, and ε=0.3\varepsilon^{\prime}=0.3.
Refer to caption
Figure 30: Results for celeBA with resolution of 256: The outcomes are obtained using our 𝜺\boldsymbol{\varepsilon}-centered Li-CFG method. The parameter settings for Li-CFG are as follows: η=\eta=2.5e-4, γ=\gamma=10, δ(𝒙)=\delta(\boldsymbol{x})=10 and ε=\varepsilon^{\prime}=0.3.
Refer to caption
Figure 31: Results for LSUN Bedroom with the resolution of 256: The outcomes are obtained using our 𝜺\boldsymbol{\varepsilon}-centered Li-CFG method. The parameter settings for Li-CFG are as follows: η=\eta=2.5e-4, γ=\gamma=10, δ(𝒙)=\delta(\boldsymbol{x})=10 and ε=\varepsilon^{\prime}=0.3.
Refer to caption
Figure 32: Results for ImageNet with the resolution of 256: The outcomes are obtained using our 𝜺\boldsymbol{\varepsilon}-centered Li-CFG method. The parameter settings for Li-CFG are as follows: η=\eta=2.5e-4, γ=\gamma=0.1; δ(𝒙)=\delta(\boldsymbol{x})=20 and ε=\varepsilon^{\prime}=1.
Refer to caption
Figure 33: Results for CeleBA with the resolution of 128: The outcomes are obtained using our 𝜺\boldsymbol{\varepsilon}-centered Li-CFG method. The parameter settings for Li-CFG are as follows: η=\eta=2.5e-4, γ=\gamma=10; δ(𝒙)=\delta(\boldsymbol{x})=5 and ε=\varepsilon^{\prime}=0.3.