\doparttoc\faketableofcontents

\equalcont

These authors should be nominated as the corresponding author. \equalcontThese authors should be nominated as the corresponding author.

[1]\surZhonglong Zheng \equalcontThese authors should be nominated as the corresponding author.

[1]\orgdivSchool of Computer Science and Technology, \orgnameZhejiang Normal University, \orgaddress\streetNo. 688 Yingbin Avenue, \cityJinhua, \postcode321004, \stateZhejiang, \countryChina

2]\orgdivSchool of Data Science and MOE Frontiers Center for Brain Science, \orgnameFudan University, \orgaddress\streetNo.220 Handan Road, \cityShanghai, \postcode200433, \stateShanghai, \countryChina

3]\orgdivFudan ISTBI-JNU Algorithm Centre for Brain inspired Intelligence, \orgnameZhejiang Normal University, \orgaddress\streetNo. 688 Yingbin Avenue, \cityJinhua, \postcode321004, \stateZhejiang, \countryChina

A New Formulation of Lipschitz Constrained With Functional Gradient Learning for GANs

\surChang Wan [email protected] \surKe Fan [email protected] \surXinwei Sun [email protected] \surYanwei Fu [email protected] \surMinglu Li [email protected] \surYunliang Jiang [email protected] [email protected] * [ [

Abstract

This paper introduces a promising alternative method for training Generative Adversarial Networks (GANs) on large-scale datasets with clear theoretical guarantees. GANs are typically learned through a minimax game between a generator and a discriminator, which is known to be empirically unstable. Previous learning paradigms have encountered mode collapse issues without a theoretical solution. To address these challenges, we propose a novel Lipschitz-constrained Functional Gradient GANs learning (Li-CFG) method to stabilize the training of GAN and provide a theoretical foundation for effectively increasing the diversity of synthetic samples by reducing the neighborhood size of the latent vector. Specifically, we demonstrate that the neighborhood size of the latent vector can be reduced by increasing the norm of the discriminator gradient, resulting in enhanced diversity of synthetic samples. To efficiently enlarge the norm of the discriminator gradient, we introduce a novel $\boldsymbol{\varepsilon}$ -centered gradient penalty that amplifies the norm of the discriminator gradient using the hyper-parameter $\boldsymbol{\varepsilon}$ . In comparison to other constraints, our method enlarging the discriminator norm, thus obtaining the smallest neighborhood size of the latent vector. Extensive experiments on benchmark datasets for image generation demonstrate the efficacy of the Li-CFG method and the $\boldsymbol{\varepsilon}$ -centered gradient penalty. The results showcase improved stability and increased diversity of synthetic samples.

keywords:

Generative Adversarial Nets, Functional Gradient Methods, New Lipschitz Constraint, Synthesis Diversity

Refer to caption — Figure 1: Highlighting Diversity. We underscore the significance of diversity in image synthesis. The left column (a) and right column (b) display horse label images generated using the CFG and Li-CFG methods trained on the CIFAR-10 dataset, respectively.

1 Introduction

GANs are designed to sample a random variable $z$ from a known distribution $p_{z}$ , approximating the underlying data distribution $p_{*}$ . This learning process is modeled as a minimax game where the generator and discriminator are iteratively optimized, as introduced by Goodfellow et al. [1]. The generator produces samples that mimic the true data distribution, while the discriminator distinguishes between the generated data and real data samples. Despite numerous remarkable efforts that have been made [2, 3], the GANs learning still suffers from training instability and mode collapse.

Recently, Composite Functional Gradient Learning (CFG), as proposed in Johnson and Zhang [4], has gained attention. CFG utilizes a strong discriminator and functional gradient learning for the generator, leading to convergent GAN learning theoretically and empirically. However, we have observed that there are still various hyper-parameters in CFG that may significantly impact the GAN learning process. Despite the advancements in stability, one still needs to carefully set these hyper-parameters to ensure a successful and well-trained GAN. Properly tuning these parameters remains an essential aspect of achieving optimal performance in GAN training. This issue hinders the widespread adoption of the CFG method for training GAN in real-world and large-scale datasets. One of the most effective mechanisms for addressing this issue for GAN training is the Lipschitz constraint, with the $0$ -centered gradient penalty introduced by Mescheder et al. [5] and the $1$ -centered gradient penalty introduced by Gulrajani et al. [3] being among the most well-known. Empirically, stable training of a GAN can result in more diverse synthesis results. However, the theoretical foundation for stable GAN training using the Lipschitz gradient penalty and generating diverse synthesis samples is still unclear. It is therefore important to develop a new theoretical framework that can account for the Lipschitz constraint of the discriminator and the diversity of synthesis samples.

To address these challenges, we propose Lipschitz Constrained Functional Gradient GANs Learning (Li-CFG), an improved version of CFG. We present a comparative analysis of our Li-CFG and CFG methods in Fig. 2 and Example. 1.1. Additionally, we emphasize the importance of diversity in synthetic samples through Fig. 1. Remarkably, as shown in Fig. 1, the horse images generated by Li-CFG display a more diverse range of gestures, colors, and textures in comparison to those produced by the CFG method. When synthetic samples lack diversity, they fail to capture essential image characteristics. While recent generative models have made notable progress in enhancing diversity, there remains a gap in providing comprehensive theoretical explanations for certain aspects.

Example 1.1.

Fig. 2 indicates the idea that too large degree of $\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\|$ leads to an excessively small neighborhood of the latent vector which causes an untrained model, a too-small degree of $\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\|$ lead to an overly large neighborhood of the latent vectors which cause a worse diversity, i.e., the blue dashed lines with a reasonable degree of $\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\|$ lead to the best FID compared to the results of other colors, the green solid lines with a too large degree of $\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\|$ lead to an untrained result and the red dashed lines with a too small degree of $\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\|$ lead to a worse diversity. Training results with different $\delta(\boldsymbol{x})$ of CFG method and Li-CFG, illustrating that FID result of CFG with different $\delta(\boldsymbol{x})$ varying dramatically due to the unstable change of $\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\|$ . However, the FID result of our Li-CFG with different $\delta(\boldsymbol{x})$ varying more stable due to the smooth $\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\|$ thanks to the gradient penalty. By controlling the degree of $\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\|$ via changing the $\boldsymbol{\varepsilon}$ value of our $\boldsymbol{\varepsilon}$ -centered gradient penalty, we can adjust the neighborhood size of the latent vector and consequently influence the degree of diversity of synthetic samples.

Our key insight is the introduction of Lipschitz continuity, a robust form of uniform continuity, into CFG. This provides a theoretical basis showing that synthetic sample diversity can be enhanced by reducing the latent vector neighborhood size through a discriminator constraint. For simplicity and clarity, we will use latent N-size to denote the neighborhood size of the latent vector and constraint to refer to the discriminator constraint for the rest of the article. First, Li-CFG integrates the Lipschitz constraint into CFG to tackle instability in training under dynamic theory Second, we establish a theoretical link between the discriminator gradient norm and the latent N-size. By increasing the discriminator gradient norm, the latent N-size is reduced, thus enhancing the diversity of synthetic samples. Lastly, to efficiently adjust the discriminator gradient norm, we introduce a novel Lipschitz constraint mechanism, the $\boldsymbol{\varepsilon}$ -centered gradient penalty. This mechanism enables fine-tuning of the latent vector neighborhood size by varying the hyper-parameter $\boldsymbol{\varepsilon}$ . Through this approach, we aim to achieve more effective control over the discriminator gradient norm and further improve the diversity of generated samples.

We summarize the key contributions of this work as follows: (1) We introduce a novel Lipschitz constraint to CFG, the $\boldsymbol{\varepsilon}$ -centered gradient penalty, addressing the hyperparameter sensitivity in regression-based GAN training. Our Li-CFG method enables stable GAN training, producing superior results compared to traditional CFG. (2) We present a new perspective on analyzing the relationship between the discriminator constraint and synthetic sample diversity. To the best of our knowledge, we are the first to explore this relationship, demonstrating that our $\boldsymbol{\varepsilon}$ -centered gradient penalty allows for effective control of diversity during training. (3) Our empirical studies highlight the superiority of Li-CFG over CFG across a variety of datasets, including synthetic and real-world data such as MNIST, CIFAR10, LSUN, and ImageNet. In an ablation study, we show a trade-off between diversity and model trainability. Additionally, we demonstrate the generalizability of our $\boldsymbol{\varepsilon}$ -centered gradient penalty across multiple GAN models, achieving better results compared to existing gradient penalties.

2 Preliminary

To enhance readability, we provide definitions for frequently used mathematical symbols in our theory. For CFG and dynamic theory, the symbols will be explained in the respective sections.

Notation. $G_{\theta}(z):\mathcal{Z}\rightarrow\mathcal{Y}$ represents a generator model that maps an element of the latent space $\mathcal{Z}$ to the image space $\mathcal{Y}$ , where the parameter of generator is denoted by $\theta$ . Here $z$ is a sample drawn from a low dimensional latent distribution $p_{z}$ . We will use $G_{\theta}$ and $G$ interchangeably to refer to the generator throughout this article.

$D_{\psi}:\mathcal{Y}\rightarrow\mathcal{C}$ represents a discriminator model that maps elements from the image space $\mathcal{Y}$ to the probability of belonging to real data distribution or the generated data distribution, where the parameter of generator is denoted by $\psi$ . $\mathcal{Y}$ is a sample drawn from real data distribution $p_{*}$ . $\mathcal{C}$ consists of the scale value obtained from the outputs of the discriminator. The symbol $D$ in the CFG method, along with the notation $D_{\psi}$ used in both the gradient penalty and our neighborhood theory, all represent the discriminator.

We use $R$ to denote the gradient penalty. Furthermore, we use $R_{1}$ to denote the $1$ -centered Gradient Penalty, $R_{0}$ to denote the $0$ -centered Gradient Penalty and $R_{\boldsymbol{\varepsilon}}$ to denote the $\boldsymbol{\varepsilon}$ -centered Gradient Penalty. $r_{R}$ stands for the latent N-size of the corresponding gradient penalty. The symbol $\hat{\epsilon}$ in the definition of our neighborhood method represents a small quantity in image space. It is distinct from the symbol $\boldsymbol{\varepsilon}$ used in our $\boldsymbol{\varepsilon}$ -centered GP method.

Composite Functional Gradient GANs. We follow the definition in CFG [6]. The CFG employs discriminator works as a logistic regression to differentiate real samples from synthetic samples. Meanwhile, it employs the functional compositions gradient to learn the generator as the following form

\displaystyle\small\begin{aligned} G_{m}(\boldsymbol{z})=G_{m-1}(\boldsymbol{z})+\eta_{m}g_{m}(G_{m-1}(\boldsymbol{z})),(m=1...,M)\end{aligned}

(1)

to obtain $G(\boldsymbol{z})=G_{M}(\boldsymbol{z})$ . The $M$ represents the number of steps in the generator used to approximate the distribution of real data samples. Each $g_{m}$ is a function to be estimated from data. $g_{m}$ is a residual which gradually move the generated samples of $m-1$ step $G_{m-1}(\boldsymbol{z})$ towards the $m$ step $G_{m}(\boldsymbol{z})$ . $g_{m}$ guarantees that the distance between the latent distribution $p_{z}$ and the real data distribution $p_{*}$ will gradually decrease until it reaches zero, and the $\eta_{m}$ is a small step size.

To simplify the analysis of this problem, we first transform the discrete M into the continuous M. First, we transform the Eq. (1) into $G_{m+\delta}(Z)=G_{m}(Z)+\delta g_{m}\left(G_{m}(Z)\right)$ by setting $\eta_{m}=\delta$ , where $\delta$ is a small time step. By letting $\delta\rightarrow 0$ , we have a generator that evolves continuously in time $m$ that satisfies an ordinary differential equation

\frac{d\left(G_{m}(Z)\right)}{dm}=g_{m}\left(G_{m}(Z)\right).

The goal is to learn $g_{m}:\mathbb{R}^{k}\rightarrow\mathbb{R}^{k}$ from data so that the probability density $p_{m}$ of $G_{m}(\boldsymbol{z})$ , which continuously evolves by Eq. (1), becomes close to the density $p_{*}$ of real data as $m$ continuously increases. To measure the ‘closeness’, we use $L$ denotes a distance measure between two distributions:

L(p_{m})=\int\ell\left(p_{*}(x),p_{m}(x)\right)dx,

where $\ell:R^{2}\rightarrow R$ is a pre-defined function so that $L$ satisfies $L(p_{m})=0$ if and only if $p_{m}=p_{*}$ and $L(p)\geq 0$ for any probability density function $p$ .

From the above equation, we will derive the choice of $g_{m}(\cdot)$ that guarantees that transformation Eq. (1) can always reduce $L(\cdot)$ . Let $p_{m}$ be the probability density of random variable $G_{t}(\boldsymbol{z})$ . Let $\ell_{2}^{\prime}\left(\rho_{*},\rho\right)=$ $\partial\ell\left(\rho_{*},\rho\right)/\partial\rho$ . Then we have

\frac{dL\left(p_{m}\right)}{dm}=\int p_{m}(x)\nabla_{x}\ell_{2}^{\prime}\left(p_{*}(x),p_{m}(x)\right)\cdot g_{m}(x)dx.

With this definition of $\frac{dL\left(p_{m}\right)}{dm}$ , we aim to keep that $\frac{dL\left(p_{m}\right)}{dm}$ is negative, so that the distance $L$ decreases. To achieve this goal, we choose $g_{m}(x)$ to be:

\displaystyle\begin{aligned} g_{m}(\boldsymbol{x})=-s_{m}(\boldsymbol{x})\phi_{0}\left(\nabla_{x}\ell_{2}^{\prime}\left(p_{*}(\boldsymbol{x}),p_{m}(\boldsymbol{x})\right)\right),\end{aligned}

(2)

where $s_{m}(x)>0$ is an arbitrary scaling factor. $\phi_{0}(u)$ is a vector function such that $\phi(u)=u\cdot\phi_{0}(u)\geq 0$ and $\phi(u)=0$ if and only if $u=0$ . Here are two examples: $(\phi_{0}(u)=u,\phi(u)=\|u\|_{2}^{2})$ and $(\phi_{0}(u)=\operatorname{sign}(u),\phi(u)=\|u\|_{1})$ .

With this choice of $g_{m}(x)$ , we obtain

\displaystyle\begin{aligned} \frac{dL\left(p_{m}\right)}{dm}=-\int s_{m}(x)p_{m}(x)\phi\left(\nabla_{x}\ell_{2}^{\prime}\left(p_{*}(x),p_{m}(x)\right)\right)dx\leq 0,\end{aligned}

(3)

that is, the distance $L$ is guaranteed to decrease unless the equality holds. Moreover, this implies that we have $\lim_{m\rightarrow\infty}\int s_{m}(x)p_{m}(x)\phi\left(\nabla_{x}\ell_{2}^{\prime}\left(p_{*}(x),p_{m}(x)\right)\right)dx=0$ . (Otherwise, $L\left(p_{m}\right)$ would keep going down and become negative as $m$ increases, but $L\left(p_{m}\right)\geq 0$ by definition.) For simplicity, we omit the subscript $m$ in the following empirical settings.

Let us consider a case where the distance measure $L(\cdot)$ is an $f$ -divergence. With a convex function $f:\mathbb{R}^{+}\rightarrow\mathbb{R}$ such that $f(1)=0$ and that $f$ is strictly convex at $1,L\left(p_{m}\right)$ defined by

L\left(p_{m}\right)=\int p_{*}(x)f\left(r_{m}(x)\right)dx\text{ where }r_{m}(x)=\frac{p_{m}(x)}{p_{*}(x)}

is called $f$ -divergence. Here we focus on a special case where $f$ is twice differentiable and strongly convex so that the second order derivative of $f$ , denoted here by $f^{\prime\prime}$ , is always positive. For instance, when we consider the KL divergence, $f$ can be represented by $-\ln x$ , in which case $f^{\prime}=-1/x$ and $f^{\prime\prime}=1/x^{2}$ . On the other hand, if we consider the reverse KL divergence, $f$ can be represented as $x\ln{x}$ , in which case $f^{\prime}=\ln{x}+1$ and $f^{\prime\prime}=1/x$ .

With this definition of function $f$ and $L$ as $f$ -divergence setting, the value of $g_{m}(x)$ should to be

\displaystyle\begin{aligned} g_{m}(\boldsymbol{x})&=-s_{m}(\boldsymbol{x})\phi_{0}\left(\nabla_{x}\ell_{2}^{\prime}\left(p_{*}(\boldsymbol{x}),p_{m}(\boldsymbol{x})\right)\right)\\ &=-s_{m}(\boldsymbol{x})\nabla_{x}\ell_{2}^{\prime}\left(p_{*}(\boldsymbol{x}),p_{m}(\boldsymbol{x})\right)\\ &=-s_{m}(\boldsymbol{x})f^{\prime\prime}\left(r_{m}(\boldsymbol{x})\right)\nabla r_{m}(\boldsymbol{x})\\ &\approx s_{m}(\boldsymbol{x})f^{\prime\prime}\left(\tilde{r}_{m}(\boldsymbol{x})\right)\tilde{r}_{m}(\boldsymbol{x})\nabla_{x}D(\boldsymbol{x}),\end{aligned}

(4)

Empirically, we define the function in Eq. (4) as $g(\boldsymbol{x})=\delta(\boldsymbol{x})\nabla_{\boldsymbol{x}}D(\boldsymbol{x})$ . The $\delta(\boldsymbol{x})$ can be computed as

\displaystyle\begin{aligned} \delta(\boldsymbol{x})=s_{m}(\boldsymbol{x})\tilde{r}_{m}(\boldsymbol{x})f^{\prime\prime}(\tilde{r}_{m}(\boldsymbol{x})),\end{aligned}

(5)

where we have an arbitrary scaling factor $s_{m}(\boldsymbol{x})$ , a KL-divergence function $f=-\ln\boldsymbol{x}$ , $f^{\prime\prime}=1/x^{2}$ and $\tilde{r}_{m}(\boldsymbol{x})=\exp(-D(\boldsymbol{x}))$ . Since $\tilde{r}_{m}(\boldsymbol{x})f^{\prime\prime}(x)>0$ , it can be absorbed into the $s_{m}(x)$ . So the value of $\delta(\boldsymbol{x})$ is always greater than 0. For simplicity, we directly regard $\delta(\boldsymbol{x})$ as the scaling factor and set it to a fixed value as a hyper-parameter. We also demonstrate the results of different $\delta(\boldsymbol{x})$ values in the Fig. 2.

Gradient Penalty for GANs. Let us work with the most commonly used WGAN-GP [3]. Its regularization term is commonly referred to as

\displaystyle R(\theta,\psi)=\frac{\gamma}{2}\mathrm{E}_{\hat{\boldsymbol{x}}}\left(\left\|\nabla_{\boldsymbol{x}}D_{\psi}(\hat{\boldsymbol{x}})\right\|-g_{0}\right)^{2},

(6)

where $\hat{\boldsymbol{x}}$ is sampled uniformly on the line segment between two random points vector $\boldsymbol{x}_{1}\sim p_{\theta}\left(\boldsymbol{x}_{1}\right),\boldsymbol{x}_{2}\sim p_{\mathcal{D}}\left(\boldsymbol{x}_{2}\right)$ . The notation of $R(\theta,\psi)$ presents the regularization term of the generator with weight $\theta$ and the discriminator with weight $\psi$ . The notion of $\gamma$ is a coefficient that controls the magnitude of regularization. The value of $g_{0}$ is always empirically set to 1. We call it $g_{0}$ -centered GP or $1$ -centered GP. The other solution is referred to as the $0$ -centered GP method, proposed by Roth et al. [7] and Mescheder et al. [5]. The formulation is

\displaystyle\begin{aligned} R(\theta,\psi)=\frac{\gamma}{2}\mathrm{E}_{(\hat{\boldsymbol{x}})}\left[\left\|\nabla_{\hat{\boldsymbol{x}}}D_{\psi}(\hat{\boldsymbol{x}})\right\|^{2}\right],\end{aligned}

(7)

When integrating the gradient penalty into the GAN model, it is expressed as a regularization term following the loss function of the GAN. The formulation is represented as

\displaystyle\mathop{\min_{G_{\theta}}}\mathop{\max_{D_{\psi}}}L_{GAN}(G_{\theta},D_{\psi})=\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\left[f\left(D_{\psi}\left(G_{\theta}(\boldsymbol{z})\right)\right)\right]+\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\left[f\left(-D_{\psi}(\boldsymbol{x})\right)\right]+\lambda R(\theta,\psi),

(8)

where $\lambda$ controls the importance of the regularization, $R(\theta,\psi)$ denotes the gradient penalty term, $f$ represents a general function from [3] with different forms in various GAN models. This specific type of loss function indicates that the gradient penalty can influence the variability of the discriminator’s gradient, thereby impacting the generation of synthetic samples by the generator.

Dynamic Theory for CFG. We incorporate dynamic theory to gain a theoretical understanding of the equivalence between the CFG method and the common GAN theory. However, the CFG method also faces challenges related to unstable training and the lack of local convergence near the Nash-equilibrium point. A summary of the key outcomes is provided in Appendix B.

3 Methodology

Overview. The main structure of this section is organized as follows: In Section 3.1, we present the definition of the latent N-size with the gradient penalty. Guided by this, in Section 3.2, we introduce our regularization termed as $\boldsymbol{\varepsilon}$ -centered gradient penalty. In Section 3.3, we present our main theorem to demonstrate the connection between the latent N-size and the various gradient penalties. Proofs of these theorems are provided in Appendix C.

3.1 Latent N-size with gradient penalty

In this section, our objective is to reveal the interrelation between the latent N-size and the gradient penalty. First, we will define the latent N-size, along with an intuitive explanation. Then, we will explain three basic definitions of the latent N-size by showing how it relates to the diversity of synthetic samples. Additionally, building on the previous step, we will discuss the expansion of the latent N-size by including the gradient penalty.

Latent N-size. We present the definition of latent N-size, which forms the basis for the subsequent theory.

Definition 3.1 (Latent Neighborhood Size).

Let $\boldsymbol{z}_{1}$ , $\boldsymbol{z}_{2}$ be two samples in the latent space. Suppose $\boldsymbol{z}_{1}$ is attracted to the mode $\mathcal{M}_{i}$ by $\hat{\epsilon}$ , then there exists a neighborhood $\mathcal{N}_{r}\left(\boldsymbol{z}_{1}\right)$ of $\boldsymbol{z}_{1}$ such that $\boldsymbol{z}_{2}$ is distracted to $\mathcal{M}_{i}$ by $(\hat{\epsilon}/2-2\alpha)$ , for all $\boldsymbol{z}_{2}\in\mathcal{N}_{r}\left(\boldsymbol{z}_{1}\right)$ . The size of $\mathcal{N}_{r}\left(\boldsymbol{z}_{1}\right)$ can be arbitrarily large but is bounded by an open ball of radius $r$ . The $r$ is defined as

r=\hat{\epsilon}\cdot\left(2\inf_{\boldsymbol{z}}\left\{\left(\frac{\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}+\frac{\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\right)\right\}\right)^{-1},

where the mode $\mathcal{M}$ , $attracted$ and $distracted$ are defined in Definition. 3.2, 3.3, 3.4, respectively.

According to this definition, the radius $r$ is inversely proportional to the discrepancy between the preceding and subsequent outputs of the generator, given a similar latent vector $\boldsymbol{z}$ . A large value of $r$ results in a small difference between the previous and subsequent generator outputs, leading to mode collapse. Conversely, a small value of $r$ leads to a large difference, resulting in diverse synthesis.

Latent N-size and the diversity. In this paragraph, we discuss the relationship between the latent N-size and the gradient penalty. To begin, let’s discuss the above three definitions, the mode $\mathcal{M}$ , $attracted$ , and $distracted$ , which play a crucial role in the mode collapse phenomenon [8].

Additionally, we propose implementing the gradient penalty in the discriminator to adjust the latent N-size and alleviate the mode collapse phenomenon. If the neighborhood size is too large, a significant portion of the latent space vectors would be attracted to this specific image mode, leading to limited diversity in the synthetic samples. Conversely, if the neighborhood size is too small and contains only one vector, the latent space vectors cannot adequately cover all the modes in the image space.

The intuition behind our idea is illustrated in Fig. 3. The top and bottom rows indicate the latent N-size for the discriminator with or without gradient penalty, respectively. The yellow line at the top indicates that a different latent vector is being drawn towards a new mode, distinct from $z_{1}$ . The blue line at the bottom row indicates that the same latent vector in the neighborhood of $\boldsymbol{z}_{1}$ is attracted to the same mode as $\boldsymbol{z}_{1}$ . The top row represents improved sample diversity, while the bottom row indicates a mode collapse phenomenon.

Definition 3.2 (Modes in Image Space).

There exist some modes $\mathcal{M}$ cover the image space $\mathcal{Y}$ . Mode $\mathcal{M}_{i}$ is a subset of $\mathcal{Y}$ satisfying $\max_{\boldsymbol{y_{k,j}}\in\mathcal{M}_{i}}\left\|\boldsymbol{y_{k}}-\boldsymbol{y_{j}}\right\|<\alpha$ and $\min_{\boldsymbol{y_{k}}\in\mathcal{M}_{i},\boldsymbol{y_{m}}\not\in\mathcal{M}_{i}}\alpha<\left\|\boldsymbol{y_{m}}-\boldsymbol{y_{k}}\right\|<2\alpha$ , where $\boldsymbol{y_{k}}$ and $\boldsymbol{y_{j}}$ belong to the same mode $\mathcal{M}_{i}$ , $\boldsymbol{y_{m}}$ and $\boldsymbol{y_{k}}$ belong to different modes $\mathcal{M}_{i}$ , and $\alpha>0$ .

Definition 3.2 asserts that images within the same mode exhibit minimal differences. Conversely, images belonging to different modes exhibit more significant differences.

Definition 3.3 (Modes Attracted).

Let $\boldsymbol{z}_{1}$ be a sample in latent space, we say $\boldsymbol{z}_{1}$ is attracted to a mode $\mathcal{M}_{i}$ by $\hat{\epsilon}$ from a gradient step if $\left\|\boldsymbol{y_{k}}-G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)\right\|+\hat{\epsilon}<\left\|\boldsymbol{y_{k}}-G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)\right\|$ , where $\boldsymbol{y_{k}}\in\mathcal{M}_{i}$ is an image in a mode $\mathcal{M}_{i}$ , $\hat{\epsilon}$ denotes a small quantity, $\theta_{t}$ and $\theta_{t+1}$ are the generator parameters before and after the gradient updates respectively.

Definition 3.3 establishes that a latent vector $\boldsymbol{z}_{1}$ is attracted to a specific mode $\mathcal{M}_{i}$ in the latent space. As training progresses, the output corresponding to $\boldsymbol{z}_{1}$ will exhibit only minor deviations from images within that mode.

Definition 3.4 (Modes Distracted).

Let $\boldsymbol{z}_{2}$ be a sample in latent space, we say $\boldsymbol{z}_{2}$ is distracted from a mode $\mathcal{M}_{i}$ by $(\alpha+\hat{\epsilon})/2$ from a gradient step if $\left\|\boldsymbol{y_{m}}-G_{\theta_{t+1}}\left(\boldsymbol{z}_{2}\right)\right\|+(\hat{\epsilon}/2-2\alpha)<\left\|\boldsymbol{y_{k}}-G_{\theta_{t}}\left(\boldsymbol{z}_{2}\right)\right\|$ , where $\boldsymbol{y_{k}}\in\mathcal{M}_{i}$ is an image in a mode $\mathcal{M}_{i}$ , $\boldsymbol{y_{m}}\not\in\mathcal{M}_{i}$ is an image from other modes, $\alpha$ keeps the same meaning as in Definition 3.2, $\theta_{t}$ and $\theta_{t+1}$ are the generator parameters before and after the gradient updates respectively.

Definition 3.4 explains that when a vector $\boldsymbol{z}_{2}$ close to $\boldsymbol{z}_{1}$ is drawn towards a particular mode in the image space, it is less likely to be attracted by a different mode. Therefore, it is crucial to decrease the latent N-size, as this encourages latent vectors to be attracted to various modes within the image space.

Latent N-size with gradient penalty. We demonstrate the relationship between the latent N-size and the gradient penalty. According to Proposition 3.5, as $\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\|$ increases, the latent N-size decreases, and vice versa.

Proposition 3.5.

$\mathcal{N}_{r}\left(\boldsymbol{z}_{1}\right)$ can be defined with discriminator gradient penalty as follows:

\small r=\hat{\epsilon}\cdot\left(2\inf_{\boldsymbol{z}}\left\{\left(\frac{2\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|+\eta_{m}\delta(\boldsymbol{x})\sum\limits_{m=1}^{N}\left(\|\nabla_{\boldsymbol{x}}D_{m}(\mathcal{Y}_{2})\|+\|\nabla_{\boldsymbol{x}}D_{m}(\mathcal{Y})\|\right)}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\right)\right\}\right)^{-1}

, where $\|\nabla_{\boldsymbol{x}}D_{m}(\mathcal{Y}_{2})\|=$ $\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{2}))+R\|$ and $\|\nabla_{\boldsymbol{x}}D_{m}(\mathcal{Y})\|=$ $\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+R\|$ . $R$ stands for the discriminator gradient penalty.

We show that $G_{\theta_{t}}$ can be iterated computed as follows: $G_{\theta_{t}}=G_{\theta_{t-1}}+\eta_{m}\delta(\boldsymbol{x})\nabla_{x}D_{m}(G_{\theta_{t-1}}(\boldsymbol{z}))$ . For the sake of simplicity, we present the expansion for the $t$ -th term.

From Proposition 3.5, it’s crucial to maintain the latent N-size within a specific range to generate diverse synthetic samples. To achieve this, it is necessary to control the magnitude of the gradient norm with a gradient penalty.

Based on the corollary $\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\leq 0$ , we propose subtracting a small value $\boldsymbol{\varepsilon}$ from $\nabla_{\boldsymbol{x}}D(\boldsymbol{x})$ to control the magnitude of the gradient norm. This will effectively enhance the gradient norm, leading to a reduction in the latent N-size. Additionally, we propose the $\boldsymbol{\varepsilon}$ -centered gradient penalty based on the above insight.

3.2 $\boldsymbol{\varepsilon}$ -centered GP

We propose our $\boldsymbol{\varepsilon}$ -centered gradient penalty in this section. We use notation $\boldsymbol{\varepsilon}$ in our penalty name and equation to differ from the hyper-parameter $\varepsilon^{\prime}$ . The $\boldsymbol{\varepsilon}$ -centered GP is

\displaystyle R(\theta,\psi)=\frac{\gamma}{2}\mathrm{E}_{\hat{\boldsymbol{x}}}\left(\left\|\nabla_{\boldsymbol{x}}D_{\psi}(\hat{\boldsymbol{x}})-\boldsymbol{\varepsilon}\right\|\right)^{2},

(9)

where $\boldsymbol{\varepsilon}$ is a vector such that $\|\boldsymbol{\varepsilon}\|_{2}=\varepsilon^{\prime}$ with $\varepsilon^{\prime}=\sqrt{C\cdot N^{2}\cdot\boldsymbol{\varepsilon}^{2}}$ , $N$ and $C$ are dimensions and channels of the real data respectively. $\hat{\boldsymbol{x}}$ is sampled uniformly on the line segment between two random points vector $\boldsymbol{x}_{1}\sim p_{\theta}\left(\boldsymbol{x}_{1}\right),\boldsymbol{x}_{2}\sim p_{\mathcal{D}}\left(\boldsymbol{x}_{2}\right)$ . Combining the corollary $\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\leq 0$ , our $\boldsymbol{\varepsilon}$ -centered GP increases the $\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\|$ as to achieve a better latent N-size which other two gradient penalty behaviors worse.

When training the GAN model with the loss function Eq. (8) and Eq. (9), our approach will control the gradient of the discriminator and result in a diversity of synthesized samples.

3.3 Latent N-size with different gradient penalties

In this section, we combine Definition 3.1, Proposition 3.5, Lemma 3.6, and our $\varepsilon$ -centered gradient penalty to prove the main theorem. First, we establish the relationship among the latent N-size with three gradient penalties in the following Lemma.

Lemma 3.6.

The norms of the three Gradient Penalties, which determine the latent N-size, are defined as follows: $\|R_{1}\|$ = $\|\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right\|-g_{0}\right)\|$ , $\|R_{0}\|=\|\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right\|\right)\|$ , $\|R_{\boldsymbol{\varepsilon}}\|$ = $\|\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right\|+\|\boldsymbol{\varepsilon}\|\right)\|$ , respectively. The order of magnitude between the norms of three Gradient Penalty is $\|R_{1}\|<\|R_{0}\|<\|R_{\boldsymbol{\varepsilon}}\|$ . Consequently, the relationship between the latent N-size of three Gradient Penalty is $r_{R_{1}}>r_{R_{0}}>r_{R_{\boldsymbol{\varepsilon}}}$ .

Combining the Proposition 3.5, a larger value of $\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\|$ will lead to a smaller latent N-size, thus enhancing the diversity of synthetic samples.

Theorem 3.7.

Suppose $\boldsymbol{z}_{1}$ is attracted to the mode $\mathcal{M}_{i}$ by $\hat{\epsilon}$ , then there exists a neighborhood $\mathcal{N}_{r}\left(\boldsymbol{z}_{1}\right)$ of $\boldsymbol{z}_{1}$ such that $\boldsymbol{z}_{2}$ is distracted to $\mathcal{M}_{i}$ by $(\hat{\epsilon}/2-2\alpha)$ , for all $\boldsymbol{z}_{2}\in\mathcal{N}_{r}\left(\boldsymbol{z}_{1}\right)$ . The size of $\mathcal{N}_{r}\left(\boldsymbol{z}_{1}\right)$ can be arbitrarily large but is bounded by an open ball of radius $r$ where be controlled by Gradient Penalty terms of the discriminator. The relationship between the latent N-size corresponding to the three Gradient Penalty is $r_{R_{1}}>r_{R_{0}}>r_{R_{\boldsymbol{\varepsilon}}}$ .

The theorem suggests that if a latent vector is pulled towards a specific mode in the image space, the size of the latent vector should be kept reasonable. If the latent N-size is excessively large, it could prevent latent vectors from attracting toward other modes, potentially causing the mode collapse phenomenon. The gradient penalty can be used to effectively adjust the latent N-size. In descending order, the relationship between the latent N-size corresponding to the three Gradient Penalties can be summarized as follows: $r_{R_{1}}>r_{R_{0}}>r_{R_{\boldsymbol{\varepsilon}}}$ .

4 Related Work

Generative Adversarial Network. GAN is optimized as a discriminator and a generator in a minimax game formulated as [1]. There are various variants of GAN for image generation, such as DCGAN [9], SAGAN [10], Progressive Growing GAN [11], BigGAN [12] and StyleGAN [13]. In general, the GANs are very difficult to be stably trained. Training GAN may suffer from various issues, including gradients vanishing, mode collapse, and so on [14, 7, 15]. Numerous excellent works have been done in addressing these issues. For example, by using the Wasserstein distance, WGAN [2] and its extensions [3, 15, 16, 17, 5] improve the training stability of GAN.

Lipschitz Constrained for GANs. Applying Lipschitz constraint to CNNs have been widely explored [18, 19, 20, 21, 22, 23]. Such an idea of Lipschitz constraint has also been introduced in Wasserstein GAN (WGAN). Recently theoretical results proposed by Kim et al. [23] show that self-attention which is widely used in the transformer model does not meet the Lipschitz continuity. It has been proved in Zhou et al. [20] that the Lipschitz-continuity is a general solution to make the gradient of the optimal discriminative function reliable. Unfortunately, directly applying the Lipschitz constraint to complicated neural networks is not easy. Previous works typically employ the mechanisms of gradient penalty and weight normalization. The gradient penalty proposed by Gulrajani et al. [3] adds a function after the loss function to control the gradient varying of the discriminator. Mescheder et al. [5] and Nagarajan and Kolter [16] argue that the WGAN-GP is not stable near Nash-equilibrium and propose a new form of gradient penalty. Weight normalization or spectrum normalization was first studied by Miyato et al. [24]. The normalization constructs a layer of neural networks to control the gradient of the discriminator. Recently, Bhaskara et al. [25], Wu et al. [26] propose a new form of normalization behaviors better than the spectrum normalization.

Compare our method with existing methods. Our method appears similar to AdvLatGAN-div [27] and MSGAN [28], aiming to increase the pixel space distance ratio over latent space.

We have clarified the differences between our method and AdvLatGAN-div and MSGAN as follows: Firstly, we note that MSGAN, AdvLatGAN, and our method are related to Eq. (10).

\begin{aligned} \left(\frac{d_{\boldsymbol{I}}\left(G\left(\boldsymbol{c},\boldsymbol{z}_{1}\right),G\left(\boldsymbol{c},\boldsymbol{z}_{2}\right)\right)}{d_{\boldsymbol{z}}\left(\boldsymbol{z}_{1},\boldsymbol{z}_{2}\right)}\right)\end{aligned},

(10)

where $\boldsymbol{c}$ is the condition vector for generation, and $d_{\boldsymbol{I}}$ and $d_{\boldsymbol{z}}$ are the distance metrics in the target (image) space and latent space, respectively.

Secondly, MSGAN maximizes Eq. (10) to train the generator, encouraging it to synthesize more diverse samples. Essentially, maximizing Eq. (10) means that for two different samples $\boldsymbol{z_{1}}$ and $\boldsymbol{z_{2}}$ from the latent space, the distance between their corresponding synthesized outputs $G(\boldsymbol{z_{1}})$ and $G(\boldsymbol{z_{2}})$ in the target space should be as large as possible. This leads to an increase in the diversity of the generated samples. The AdvLatGAN-div method searches for pairs of $z_{t}$ that match hard sample pairs, which are more likely to collapse in image space. This search process is achieved by iteratively optimizing the latent samples $\boldsymbol{z}$ to minimize Eq. (10), similar to generating adversarial examples. Then, the objective is to maximize Eq. (10) by using these hard sample pairs $z_{t}$ in the latent space as input again to optimize the generator. This process helps to generate diverse samples in the image space and avoid mode collapse. Both MSGAN and AdvLatGAN-div, which utilize Eq. (10), are based on empirical observations and do not have a solid theoretical basis.

Unlike the previous two methods, we don’t employ Eq. (10) as an additional loss function or an optimal target for training our model. Instead, we use Eq. (10) as the fundamental explanation of our theory. We have improved upon Eq. (10) by using the CFG method. This has helped us establish a connection between the distance of various $\boldsymbol{z}$ from the latent space and the discriminator’s constraints. We have also explained how the Lipschitz constraint of the discriminator can enhance the generator’s diversity based on our new theory.

5 Experiment

Datasets. To demonstrate the efficiency of our approach, we use both synthetic datasets and real-world datasets. In line with [29, 30], we simulate two synthetic datasets: (1) Ring dataset. It consists of a mixture of eight 2-D Gaussians with mean ${(2\cos(i\pi/4),2\sin(i\pi/4))},1\leq i\leq 8$ and standard deviation 0.02. (2) Grid dataset. It consists of a mixture of twenty-five 2-D isotropic Gaussians with mean ${(2i,2j)},-2\leq i,j\leq 2$ and standard deviation 0.02. For real-world datasets, we used MNIST [31], CIFAR10 [32], and the large-scale scene understanding (LSUN) [33] and ImageNet [34] dataset, to make fair comparison to original CFG method. Note that we for the first time present the ImageNet results of CFG methods. The CFG method employs a balanced two-class dataset using the same number of training images from the ‘bedroom’ class and the ‘living room’ class (LSUN BR+LR) and a balanced dataset from ‘tower’ and ‘bridge’ (LSUN T+B). Besides, We choose to use the ’church’(LSUN C) and the LSUN tower(LSUN T) to do our experiment because the CFG method does not very well in these datasets. We choose CIFAR10 and ImageNet to demonstrate the generalization performance of our algorithm on richer categories and larger-scale datasets. We have included the high-resolution qualitative results of various datasets in the supplementary section.

Implementation Details. To ensure a fair comparison between our method and the CFG method, we maintain identical implementation settings. All the experiments were done using a single NVIDIA Tesla V100 or a single NVIDIA Tesla A100 ¹¹1The computations in this research were performed using the CFFF platform of Fudan University. The hyper-parameter values for the CFG method were fixed to those in Table. 1 unless otherwise specified. CFG method behaves sensitively to the setting of $\delta(\boldsymbol{x})$ for generating the image to approximate an appropriate value. In our work, we give the same $\delta(\boldsymbol{x})$ settings with the CFG method; The hyper-parameter $\gamma$ is set to $0.1$ , $1$ or $10$ ; $\boldsymbol{\varepsilon}$ -centered hyper-parameters $\varepsilon^{\prime}$ is set to $0.1$ , $0.3$ , or $1$ in the experiments.

The explanation and ablation study of $\varepsilon^{\prime}$ are in the supplementary. The base setting is presented in Table. 1 and Table. 2.

Baselines. As a representative of comparison methods, we tested WGAN with the gradient penalty (WGAN-GP) [3], the least square GAN [35], the origin GAN [1] and the HingeGAN [36]. All of them always have been the baseline in other GAN models. For a fair comparison, we utilize the same network architecture as the CFG method. We have utilized the backbones of both simple DCGAN [9], and complex Resnet [37] in each dataset. To evaluate the generalization of our $\boldsymbol{\varepsilon}$ -centered method, we incorporate SOTA GAN models, such as BigGAN [12] and the Denoising Diffusion GAN (DDGAN) [38], with our $\boldsymbol{\varepsilon}$ -centered gradient penalty.

Table 1: hyper-parameters.

NAME	DESCRIPTION
B=64	training data batch size
U=1	discriminator update per epoch
N=640	examples for updating G per epoch
M=15	number of generate step in CFG method
$\gamma$ =0.1/1/10	a hyper-parameters for L constant
$\varepsilon^{\prime}$ =0.1/0.3/1	a hyper-parameters for $\boldsymbol{\varepsilon}$ -centered

Evaluation Metrics. Generative adversarial models are known to be a challenge to make reliable likelihood estimates. So we instead evaluated the visual quality of generated images by adopting the inception score [39], the Fréchet inception distance [40] and the Precision/Recall [41]. The intuition behind the inception score is that high-quality generated images should lead to close to the real image. The Fréchet inception distance indicates the similarity between the generated images and the real image. Moreover, by definition, precision is the fraction of the generated images that are realistic, and recall is the fraction of real images within the manifold covered by the generator.

Table 2: Hyper-parameters. The

\delta(\boldsymbol{x})

value is the same as the CFG method.

\gamma

is the parameter for the GP coefficient.

DATASET	$\eta$	$\delta(\boldsymbol{x})$	$\gamma$	$\varepsilon^{\prime}$
MNIST	2.5e-4	0.5	0.1/1/10	0.1/0.3/1
CIFAR	2.5e-4	1	0.1/1/10	0.1/0.3/1
LSUN	2.5e-4	1	0.1/1/10	0.1/0.3/1
ImageNet	2.5e-4	1	0.1/1/10	0.1/0.3/1

We note that the inception score is limited, as it fails to detect mode collapse or missing modes. Apart from that, we found that it generally corresponds well to human perception.

In addition, we used Fréchet inception distance (FID). FID measures the distance between the distribution of $f(x_{*})$ for real data $x_{*}$ and the distribution of $f(x)$ for generated data $x$ , where function $f$ extract the feature of an image. One advantage of this metric is that it would be high (poor) if mode collapse occurs, and a disadvantage is that its computation is relatively expensive. However, FID does not differentiate the fidelity and diversity aspects of the generated images, so we used the Precision/Recall to diagnose these properties. A high precision value indicates a high quality of generated samples, and a high recall value implies that the generator can generate many realistic samples that can be found in the ”real” image distribution. In the results below, we call these metrics the (inception) score, the Fréchet distance, and the Precision/Recall.

5.1 Experimental Results

In this section, we present a detailed comparison between our Li-CFG experiments and three gradient penalty (GP) methods within the context of CFG and other GAN-based models. we report the results of our model using the same hyper-parameters against the CFG method. Our comparison shows that our method achieves better results. Furthermore, we demonstrate that the $\boldsymbol{\varepsilon}$ -centered gradient penalty is versatile and can be effectively integrated into various GAN architectures.

Table 3: To assess the generalization of our

\boldsymbol{\varepsilon}

-centered gradient penalty, we apply it to various GANs baselines on the various datasets. When compared to the original GAN, WGAN, LSGAN, and HingeGAN, each utilizing different gradient penalties, our

\boldsymbol{\varepsilon}

-centered gradient penalty consistently achieves the best Fréchet Inception Distance (FID) score and the Inception Score (IS). This outcome serves as evidence that our

\boldsymbol{\varepsilon}

-centered gradient penalty is not only applicable to the CFG mechanism but can also be effectively employed in common GAN models. The red font number indicates a correction made in the previous manuscript.

IS $\uparrow$ / FID $\downarrow$
MNIST	origin GAN	WGAN	LSGAN	HingeGAN	Li-CFG
Unconstrained	2.18/34.28	2.23/31.05	2.22/36.83	2.21/30.72	2.29/4.04
ours( $\boldsymbol{\varepsilon}$ -centered)	2.21/25.62	2.24/23.79	2.24/28.54	2.23/24.65	2.31/3.29
$0$ -centered	2.21/30.28	2.22/24.98	2.2/27.32	2.19/26.89	2.31/3.54
$1$ -centered	2.19/31.48	2.20/31.27	2.23/29.23	2.22/24.72	2.3/3.64
CIFAR10
Unconstrained	3.48/37.83	3.53/36.24	3.41/30.94	3.58/32.31	4.02/19.41
ours( $\boldsymbol{\varepsilon}$ -centered)	3.82/30.52	3.87/28.45	3.92/24.73	4.01/20.63	4.83/14.96
$0$ -centered	3.76/35.72	3.85/29.21	3.84/29.01	3.89/23.95	4.72/18.5
$1$ -centered	3.78/31.48	3.75/31.27	3.80/29.23	3.82/24.72	4.71/19.3
LSUN B
Unconstrained	2.35/19.28	2.38/18.72	2.31/19.53	2.42/17.85	3.014/10.79
ours( $\boldsymbol{\varepsilon}$ -centered)	2.58/17.38	2.61/16.98	2.67/16.21	2.73/15.43	8.78
$0$ -centered	2.55/18.72	2.59/17.88	2.65/17.42	2.68/15.96	3.067/10.54
$1$ -centered	2.56/18.89	2.57/18.29	2.66/17.69	2.65/16.83	3.154/11.5
LSUN T
Unconstrained	3.59/25.87	3.67/22.76	3.52/28.14	3.63/23.57	4.38/13.54
ours( $\boldsymbol{\varepsilon}$ -centered)	3.93/20.62	4.06/17.89	3.89/22.67	4.02/18.72	4.6/11.41
$0$ -centered	3.87/21.65	4.02/18.03	3.83/23.88	3.98/20.23	4.57/12.33
$1$ -centered	3.85/22.03	3.99/18.49	3.81/24.33	3.94/20.81	4.47/12.81
LSUN C
Unconstrained	2.18/34.69	2.25/32.31	2.3/31.47	2.43/30.77	3.17/11.49
ours( $\boldsymbol{\varepsilon}$ -centered)	2.51/29.48	2.63/26.46	2.61/27.82	2.69/25.58	3.18/9.38
$0$ -centered	2.49/30.13	2.60/27.86	2.57/28.76	2.67/26.63	3.08/12.33
$1$ -centered	2.45/30.79	2.58/28.79	2.53/29.15	2.65/ 26.97	3.08/12.81
LSUN B+L
Unconstrained	3.18/21.92	3.01/23.47	3.12/25.93	3.29/20.33	3.49/15.76
ours( $\boldsymbol{\varepsilon}$ -centered)	3.40/18.39	3.38/19.34	3.31/21.82	3.42/17.87	3.50/14.09
$0$ -centered	3.36/19.12	3.37/19.98	3.29/22.51	3.38/18.63	3.47/15.39
$1$ -centered	3.30/19.87	3.35/20.65	3.26/23.32	3.31/19.48	3.42/15.53
LSUN T+B
Unconstrained	4.65/27.61	4.77/25.4	4.68/25.87	4.81/24.66	5.01/16.9
ours( $\boldsymbol{\varepsilon}$ -centered)	4.78/23.72	4.85/21.84	4.89/21.9	4.92/19.83	5.07/15.72
$0$ -centered	4.73/24.19	4.81/22.98	4.85/23.4	4.85/21.31	5.05/16.01
$1$ -centered	4.74/24.55	4.76/23.59	4.79/24.02	4.80/22.76	5.04/16.32
ImageNet
Unconstrained	6.83/54.68	6.94/48.54	7.00/53.7	7.13/51.28	8.98/29.73
ours( $\boldsymbol{\varepsilon}$ -centered)	7.38/45.84	7.69/40.72	7.52/41.25	7.66/40.36	9.15/29.05
$0$ -centered	7.25/46.13	7.48/42.88	7.49/41.93	7.58/41.65	8.79/30.04
$1$ -centered	7.16/47.28	7.21/44.71	7.44/42.84	7.47/42.37	8.96/29.55

Inception Score Results. Table. 3 presents the Inception Score (IS) values for different GAN models. Since the exact codes and models used to compute IS scores in Johnson and Zhang [4] are not publicly available, we employed the standard PyTorch implementation of the Inception Score function ²²2https://github.com/sbarratt/inception-score-pytorch. The measured IS scores for the real datasets are as follows: 2.58(MNIST), 9.56(CIFAR10), 4.78(LSUN T), 3.72(LSUN B), 3.72(LSUN B+L), 3.79(LSUN T+B), 5.8(LSUN C), 37.99(ImageNet). We can find that the CFG method and Li-CFG scores are very close, and the WGAN and LSGAN performance is not so good in all the datasets. The relative differences also shows that Li-CFG archives a better image quality than the CFG method in the same dataset.

Table 4: To assess the generalization of our

\boldsymbol{\varepsilon}

-centered gradient penalty, we apply it to various GANs baselines on the various datasets. When compared to the original GAN, WGAN, LSGAN, and HingeGAN baselines, each utilizing different gradient penalties, our

\boldsymbol{\varepsilon}

-centered gradient penalty consistently achieves the best Recall score. This outcome serves as evidence that our

\boldsymbol{\varepsilon}

-centered gradient penalty is not only applicable to the CFG mechanism but can also be effectively employed in common GAN models.

Precision/Recall $\uparrow$
CIFAR10	origin GAN	WGAN	LSGAN	HingeGAN	Li-CFG
Unconstrained	0.78/0.45	0.78/0.41	0.80/0.49	0.81/0.47	0.77/0.59
ours( $\boldsymbol{\varepsilon}$ -centered)	0.77/0.56	0.79/0.56	0.79/0.55	0.78/0.59	0.75/0.66
$0$ -centered	0.77/0.53	0.79/0.53	0.78/0.53	0.78/0.58	0.76/0.64
$1$ -centered	0.78/0.54	0.78/0.52	0.78/0.51	0.79/0.57	0.77/0.62
LSUN B
Unconstrained	0.75/0.37	0.75/0.39	0.77/0.38	0.73/0.4	0.65/0.49
ours( $\boldsymbol{\varepsilon}$ -centered)	0.71/0.43	0.69/0.47	0.67/0.46	0.69/0.47	0.61/0.53
$0$ -centered	0.73/0.41	0.71/0.45	0.67/0.45	0.71/0.47	0.62/0.51
$1$ -centered	0.74/0.41	0.71/0.44	0.69/0.44	0.73/0.45	0.64/0.50
LSUN T
Unconstrained	0.80/0.45	0.79/0.49	0.81/0.43	0.79/0.47	0.71/0.58
ours( $\boldsymbol{\varepsilon}$ -centered)	0.76/0.49	0.72/0.56	0.78/0.45	0.74/0.53	0.62/0.65
$0$ -centered	0.77/0.48	0.73/0.54	0.79/0.44	0.75/0.51	0.64/0.63
$1$ -centered	0.75/0.48	0.76/0.53	0.78/0.43	0.74/0.52	0.69/0.62
LSUN C
Unconstrained	0.82/0.36	0.84/0.32	0.83/0.35	0.79/0.41	0.75/0.51
ours( $\boldsymbol{\varepsilon}$ -centered)	0.79/0.43	0.80/0.45	0.81/0.44	0.78/0.47	0.72/0.58
$0$ -centered	0.78/0.42	0.81/0.44	0.82/0.43	0.79/0.45	0.76/0.55
$1$ -centered	0.79/0.41	0.82/0.42	0.81/0.41	0.79/0.42	0.77/0.54
LSUN B+L
Unconstrained	0.81/0.46	0.85/0.43	0.86/0.42	0.81/0.47	0.77/0.51
ours( $\boldsymbol{\varepsilon}$ -centered)	0.79/0.51	0.79/0.49	0.83/0.46	0.78/0.51	0.72/0.61
$0$ -centered	0.79/0.49	0.80/0.47	0.85/0.45	0.80/0.48	0.74/0.59
$1$ -centered	0.77/0.48	0.82/0.46	0.86/0.43	0.80/0.49	0.74/0.58
LSUN T+B
Unconstrained	0.82/0.38	0.82/0.39	0.84/0.37	0.81/0.42	0.78/0.46
ours( $\boldsymbol{\varepsilon}$ -centered)	0.80/0.43	0.78/0.45	0.79/0.45	0.78/0.46	0.74/0.53
$0$ -centered	0.83/0.41	0.79/0.43	0.81/0.42	0.79/0.44	0.75/0.52
$1$ -centered	0.83/0.40	0.81/0.42	0.81/0.41	0.8/0.44	0.75/0.51
ImageNet
Unconstrained	0.79/0.39	0.72/0.45	0.76/0.41	0.76/0.42	0.62/0.61
ours( $\boldsymbol{\varepsilon}$ -centered)	0.73/0.47	0.69/0.51	0.7/0.49	0.7/0.51	0.61/0.62
$0$ -centered	0.75/0.46	0.7/0.47	0.72/0.47	0.71/0.50	0.62/0.62
$1$ -centered	0.76/0.45	0.72/0.46	0.72/0.46	0.71/0.49	0.61/0.61

Fréchet Distance Results. We compute the Fréchet Distance with 50k generative images and all real images from datasets with the standard implementation ³³3https://github.com/mseitzer/pytorch-fid. However, we observed that the FID scores computed in our environment were significantly higher than those reported for the CFG method, even though we generated the images using the same CFG technique. The difference in FID scores suggest potential differences in environmental factors or implementation details. The results, presented in Table. 3, show that Li-CFG archives the best in the MNIST, CIFAR10, and the LSUN T, T+B, and B+L datasets. The WGAN and the LSGAN exhibit consistently weaker performance. The reason is that, for a fair comparison with other methods, we do not use tuning tricks, and these methods are also sensitive to varying hyper-parameters.

Table 5: Different

\varepsilon^{\prime}

settings of our Li-CFG trained in MNIST. We use the FID and IS scores to compare the generated effect. The other two penalties do not have the parameter

\varepsilon^{\prime}

so that all the cells fill the same value. Untrained means the loss function does not converge.

		FID			IS
MNIST	$\varepsilon^{\prime}=0.1$	$\varepsilon^{\prime}=0.3$	$\varepsilon^{\prime}=1$	$\varepsilon^{\prime}=5$	$\varepsilon^{\prime}=0.1$	$\varepsilon^{\prime}=0.3$	$\varepsilon^{\prime}=1$	$\varepsilon^{\prime}=5$
ours( $\boldsymbol{\varepsilon}$ -centered)	2.99	2.88	2.85	untrained	2.28	2.32	2.29	untrained
$0$ -centered	3.54	3.54	3.54	3.54	2.31	2.31	2.31	2.31
$1$ -centered	3.64	3.64	3.64	3.64	2.3	2.3	2.3	2.3
LSUN Bedroom
ours( $\boldsymbol{\varepsilon}$ -centered)	9.94	8.78	9.73	untrained	2.97	2.94	2.97	untrained
$0$ -centered	10.54	10.54	10.54	10.54	3.067	3.067	3.067	3.067
$1$ -centered	11.5	11.5	11.5	11.5	3.154	3.154	3.154	3.154
		FID			IS
LSUN T+B	$\gamma=0.1$	$\gamma=1$	$\gamma=10$		$\gamma=0.1$	$\gamma=1$	$\gamma=10$
ours( $\boldsymbol{\varepsilon}$ -centered)	15.72	16.85	16.67		5.07	5.06	5.08
$0$ -centered	16.01	17.4	19.79		5.05	5.08	5.05
$1$ -centered	16.32	18.73	34.28		5.04	5.18	4.71

As we maintained the original network configurations for both models without applying any additional training optimizations, our results provide a fair comparison.

By maintaining the same network architecture as the CFG method, we were able to achieve the best scores in six out of seven datasets. The FID results demonstrate that our Li-CFG approach is effective and capable of generating high-quality images.

Precision and Recall Results. We generate synthetic 10,000 samples compared with 10,000 real samples to compute precision and recall, utilizing the codes of Precision and Recall functions.⁴⁴4https://github.com/blandocs/improved-precision-and-recall-metric-pytorch. Except for the five GAN variants mentioned above, we additionally utilized BigGAN in both CIFAR10 and ImageNet and DDGAN on the CIFAR10 dataset. We present these results in Table. 4.

Ablation Study about $\varepsilon^{\prime}$ And $\delta(\boldsymbol{x})$ . Our method introduces a controllable parameter that adjusts the gradient penalty within a specified range. As this parameter varies, the latent N-size adjusts accordingly. Through experiments, we demonstrate that a reasonable latent N-size is crucial, as shown in Table. 5. A too small latent N-size, resulting from a large value of our parameter, causes the loss function to fail to converge. This reflects a trade-off between the diversity of synthetic samples and the model’s training stability. For the values of $\delta(\boldsymbol{x})$ where the CFG method performs well, $\gamma=0.1$ consistently yields better results. On the other hand, for cases where the CFG method performs poorly, $\gamma=10$ tends to improve performance.

Generalization of $\boldsymbol{\varepsilon}$ -centered Gradient Penalty. The conception of our $\boldsymbol{\varepsilon}$ -centered gradient penalty emerged from the CFG mechanism. The Column ’Li-CFG’ in Table. 3 and Table. 4 present the effectiveness of the CFG method. However, a fundamental question arises: How well does the generalization of our $\boldsymbol{\varepsilon}$ -centered gradient penalty extend beyond the CFG framework? To address this, we investigate the applicability of our mechanism to various GAN models. While the CFG method employs a distinct formulation, it shares equivalence with common GAN in dynamic theory. This prompts us to assess the performance of our $\boldsymbol{\varepsilon}$ -centered gradient penalty across various models. The results in Table. 3 and Table. 4 reveal that our $\boldsymbol{\varepsilon}$ -centered gradient penalty consistently achieves the best FID score across all these GAN models.

To demonstrate the effectiveness and generalization of our $\boldsymbol{\varepsilon}$ -centered gradient penalty, we compare various methods, focusing on the diversity of synthesized samples. The results of the compared methods are tabulated in Table. 6.

We have discovered an interesting phenomenon where spectral normalization, weight gradient, and gradient penalty can effectively work together to improve the diversity of synthesized samples. However, it can be argued that all these methods rely on Lipschitz constraints and, therefore may compete with each other. For instance, BigGAN and denoising diffusion GAN employ spectral normalization as a strong Lipschitz constraint for the varying weight of the neural networks. Given such a strong Lipschitz constraint, our gradient penalty should not affect the synthesis samples. However, when both models are applied with the gradient penalty, there is still a significant improvement in the diversity of the synthesis samples. This situation emphasizes that our theory is valid. It suggests that interpreting the gradient penalty only as a Lipschitz constraint may not be sufficient. Our neighborhood theory provides a new perspective to understand how the gradient penalty can improve the diversity of the model.

This evidence highlights the strong generalization capability of our proposed method. It showcases its broad applicability by seamlessly integrating with standard GAN models, offering advantages that extend beyond the CFG mechanism.

Table 6: To assess the effeteness of our

\boldsymbol{\varepsilon}

-centered gradient penalty to the diversity target, we compare it to various GANs baselines focused on the diversity of synthesis samples. The improved FID refers to the difference in FID results between a method and a baseline. The symbol ’-’ indicates that we do not recalculate the experimental results in local computing environments. The ’Baseline’ refers to the method that other methods are compared to. In order to make a fair comparison, all the methods being compared are based on the same baseline. Our method is the one that uses the bold (

\boldsymbol{\varepsilon}

-centered).

CIFAR10	FID $\downarrow$	Improved $\uparrow$ FID	Precision	Recall $\uparrow$
Baseline: WGAN	36.24	-	0.78	0.41
AdvLatGAN-qua [27]	30.21	6.03	0.69	0.45
AdvLatGAN-qua+ [27]	29.73	6.51	0.69	0.46
WGAN-Unroll [29]	30.28	5.98	0.7	0.45
IID-GAN [30]	28.63	7.61	-	-
WGAN( $\boldsymbol{\varepsilon}$ -centered)	28.45	7.79	0.67	0.46
Baseline: Origin GAN	37.83	-	0.78	0.45
MSGAN [28]	32.38	5.45	0.77	0.51
AdvLatGAN-div+ [27]	30.92	6.91	0.78	0.54
Origin GAN( $\boldsymbol{\varepsilon}$ -centered)	30.52	7.31	0.77	0.56
Baseline: SNGAN-RES [24]	15.93	-	0.8	0.75
AdvLatGAN-qua [27]	20.75	-4.62	0.82	0.68
AdvLatGAN-qua+ [27]	15.87	0.06	0.79	0.76
GN-GAN [26]	15.31	0.63	0.77	0.75
GraNC-GAN [25]	14.82	1.11	-	-
aw-SN-GAN [42]	8.9	7.03	-	-
SNGAN-RES( $\boldsymbol{\varepsilon}$ -centered)	13.4	2.53	0.74	0.79
Baseline: SNGAN-CNN [24]	18.95	-	0.785	0.63
GN-GAN [26]	19.31	-0.36	0.81	0.59
SNGAN-CNN( $\boldsymbol{\varepsilon}$ -centered)	17.92	1.03	0.79	0.64
Baseline: BigGAN	8.25	-	0.76	0.62
GN-BigGAN [26]	7.89	0.36	0.77	0.62
aw-BigGAN [42]	7.03	1.22	-	-
BigGAN( $\boldsymbol{\varepsilon}$ -centered)	5.18	3.07	0.75	0.67
DDGAN( $\boldsymbol{\varepsilon}$ -centered)	2.38	5.87	0.74	0.69
ImageNet
Baseline: BigGAN	17.33	-	0.53	0.71
BigGAN( $\boldsymbol{\varepsilon}$ -centered)	12.68	4.65	0.45	0.76

Visualization of synthesized images. The results of the experiment using Li-CFG and CFG methods can be seen in Fig. 4 and Fig. 5. Our Li-CFG method can achieve the same or better results than the CFG method. In some datasets, the image generated by the CFG method has already collapsed, while the image generated by Li-CFG still performs well. Furthermore, we also present synthesis samples generated with the state-of-the-art GAN model in CIFAR10, as shown in Fig. 6. Additional synthesis samples from various datasets can be found in Section D.4 of the supplementary materials.

The following results were obtained for the synthetic datasets presented in Fig. 7, 8. Our observation is that the unconstrained CFG method has difficulty in converging to all modes of the ring or grid datasets. However, when supplemented with gradient penalty, these methods show an enhanced ability to converge to a mixture of Gaussians. Out of the three types of gradient penalties tested, the $1$ -centered gradient penalty showed inferior convergence compared to the $0$ -centered gradient penalty and our proposed $\boldsymbol{\varepsilon}$ -centered gradient penalty. Notably, our $\boldsymbol{\varepsilon}$ -centered gradient penalty demonstrated a higher efficacy in driving more sample points to converge to the Gaussian points compared to the $0$ -centered gradient penalty.

6 Conclusion

In this paper, We provide a novel perspective to analyze the relationship between constraint and the diversity of synthetic samples. We assert that the constraint can effectively influence the latent N-size that is strongly associated with the mode collapse phenomenon. To modify the latent N-size efficiently, we propose a new form of Gradient Penalty called $\boldsymbol{\varepsilon}$ -centered GP. The experiments demonstrate that our method is more stable and achieves more diverse synthetic samples compared with the CFG method. Additionally, our method can be applied not only in the CFG method but also in common GAN models. Inception score, Fréchet Distance, Precision/Recall, and visual quality of generated images show that our method is more powerful. In future work, we plan to investigate the metric function of the generator from KL diversity to the Wasserstein distance to achieve a more stable and efficient GAN architecture.

Limitations and future work. In this study, we achieve good FID results in the considered data sets. However, if given higher resolution data sets, it might lead to a different choice of our neural network architecture and hyper-parameter. On the other hand, although the hyper-parameter $\delta(\boldsymbol{x})$ has an important impact on the algorithm, we do not discuss the relationship between hyper-parameter $\delta(\boldsymbol{x})$ in CFG and our constraint hyper-parameter $\boldsymbol{\varepsilon}$ from a theoretical perspective. These effects should be systematically studied in future work.

7 Declarations

declaration - statement

•

Conflicts of interest/Competing interests

Not applicable
•

Funding

This work was funded by the National Natural Science Foundation of China U22A20102, 62272419. Natural Science Foundation of Zhejiang Province ZJNSFLZ22F020010.
•

Ethics approval

Not applicable
•

Consent to participate

Not applicable
•

Consent for publication

Not applicable
•

Availability of data and material

All the data used in our work are sourced from publicly available datasets that have been established by prior research. References for all the data are provided in our main document.
•

Code availability

Our code is some kind of custom code and is available for use.
•

Authors’ contributions

We list the Authors’ contributions as follows:

Conceptualization: [Chang Wan], [Yanwei Fu], [Xinwei Sun];

Methodology: [Chang Wan], [Xinwei Sun], [Ke Fan];

Experiments: [Chang Wan], [Ke Fan], [Yunliang Jiang];

Writing ‐ original draft preparation: [Chang Wan], [Yanwei Fu];

Writing ‐ review and editing: [Xinwei Sun], [Zhonglong Zheng], [Minglu Li];

Funding acquisition: [Yunliang Jiang], [Zhonglong Zheng];

Resources: [Zhonglong Zheng];

Supervision: [Minglu Li], [Yunliang Jiang];

All authors read and approved the final manuscript.

References

\bibcommenthead
Goodfellow et al. [2014] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Conference on Neural Information Processing Systems 27 (2014)
Arjovsky et al. [2017] Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: International Conference on Machine Learning, pp. 214–223 (2017). PMLR
Gulrajani et al. [2017] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.: Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028 (2017)
Johnson and Zhang [2018] Johnson, R., Zhang, T.: Composite functional gradient learning of generative adversarial models. In: International Conference on Machine Learning, pp. 2371–2379 (2018). PMLR
Mescheder et al. [2018] Mescheder, L., Geiger, A., Nowozin, S.: Which training methods for gans do actually converge? In: International Conference on Machine Learning, pp. 3481–3490 (2018). PMLR
Johnson and Zhang [2019] Johnson, R., Zhang, T.: A framework of composite functional gradient methods for generative adversarial models. IEEE Transactions on Pattern Analysis and Machine Intelligence 43(1), 17–32 (2019)
Roth et al. [2017] Roth, K., Lucchi, A., Nowozin, S., Hofmann, T.: Stabilizing training of generative adversarial networks through regularization. arXiv preprint arXiv:1705.09367 (2017)
Yang et al. [2019] Yang, D., Hong, S., Jang, Y., Zhao, T., Lee, H.: Diversity-sensitive conditional generative adversarial networks. arXiv preprint arXiv:1901.09024 (2019)
Radford et al. [2015] Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
Zhang et al. [2019] Zhang, H., Goodfellow, I., Metaxas, D., Odena, A.: Self-attention generative adversarial networks. In: International Conference on Machine Learning, pp. 7354–7363 (2019). PMLR
Karras et al. [2017] Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017)
Brock et al. [2018] Brock, A., Donahue, J., Simonyan, K.: Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 (2018)
Karras et al. [2019] Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: IEEE/CVF Computer Vision and Pattern Recognition Conference, pp. 4401–4410 (2019)
Che et al. [2016] Che, T., Li, Y., Jacob, A.P., Bengio, Y., Li, W.: Mode regularized generative adversarial networks. arXiv preprint arXiv:1612.02136 (2016)
Nowozin et al. [2016] Nowozin, S., Cseke, B., Tomioka, R.: f-gan: Training generative neural samplers using variational divergence minimization. In: Conference on Neural Information Processing Systems, pp. 271–279 (2016)
Nagarajan and Kolter [2017] Nagarajan, V., Kolter, J.Z.: Gradient descent gan optimization is locally stable. arXiv preprint arXiv:1706.04156 (2017)
Mescheder et al. [2017] Mescheder, L., Nowozin, S., Geiger, A.: The numerics of gans. arXiv preprint arXiv:1705.10461 (2017)
Oberman and Calder [2018] Oberman, A.M., Calder, J.: Lipschitz regularized deep neural networks converge and generalize. arXiv preprint arXiv:1808.09540 (2018)
Scaman and Virmaux [2018] Scaman, K., Virmaux, A.: Lipschitz regularity of deep neural networks: analysis and efficient estimation. arXiv preprint arXiv:1805.10965 (2018)
Zhou et al. [2018] Zhou, Z., Song, Y., Yu, L., Wang, H., Liang, J., Zhang, W., Zhang, Z., Yu, Y.: Understanding the effectiveness of lipschitz-continuity in generative adversarial nets. arXiv preprint arXiv:1807.00751 (2018)
Zhou et al. [2019] Zhou, Z., Liang, J., Song, Y., Yu, L., Wang, H., Zhang, W., Yu, Y., Zhang, Z.: Lipschitz generative adversarial nets. In: International Conference on Machine Learning, pp. 7584–7593 (2019). PMLR
Herrera et al. [2020] Herrera, C., Krach, F., Teichmann, J.: Estimating full lipschitz constants of deep neural networks. arXiv preprint arXiv:2004.13135 (2020)
Kim et al. [2021] Kim, H., Papamakarios, G., Mnih, A.: The lipschitz constant of self-attention. In: International Conference on Machine Learning, pp. 5562–5571 (2021). PMLR
Miyato et al. [2018] Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957 (2018)
Bhaskara et al. [2022] Bhaskara, V.S., Aumentado-Armstrong, T., Jepson, A.D., Levinshtein, A.: Gran-gan: Piecewise gradient normalization for generative adversarial networks. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3821–3830 (2022)
Wu et al. [2021] Wu, Y.-L., Shuai, H.-H., Tam, Z.-R., Chiu, H.-Y.: Gradient normalization for generative adversarial networks. In: InternationalConference on Computer Vision, pp. 6373–6382 (2021)
Li et al. [2022] Li, Y., Mo, Y., Shi, L., Yan, J.: Improving generative adversarial networks via adversarial learning in latent space. Conference on Neural Information Processing Systems 35, 8868–8881 (2022)
Mao et al. [2019] Mao, Q., Lee, H.-Y., Tseng, H.-Y., Ma, S., Yang, M.-H.: Mode seeking generative adversarial networks for diverse image synthesis. In: IEEE/CVF Computer Vision and Pattern Recognition Conference, pp. 1429–1437 (2019)
Metz et al. [2016] Metz, L., Poole, B., Pfau, D., Sohl-Dickstein, J.: Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163 (2016)
Li et al. [2021] Li, Y., Shi, L., Yan, J.: Iid-gan: an iid sampling perspective for regularizing mode collapse. arXiv preprint arXiv:2106.00563 (2021)
LeCun et al. [1998] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
Krizhevsky et al. [2009] Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto (2009)
Yu et al. [2015] Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., Xiao, J.: Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365 (2015)
Deng et al. [2009] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: IEEE/CVF Computer Vision and Pattern Recognition Conference, pp. 248–255 (2009). Ieee
Mao et al. [2017] Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z., Paul Smolley, S.: Least squares generative adversarial networks. In: InternationalConference on Computer Vision, pp. 2794–2802 (2017)
Lim and Ye [2017] Lim, J.H., Ye, J.C.: Geometric gan. arXiv preprint arXiv:1705.02894 (2017)
He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE/CVF Computer Vision and Pattern Recognition Conference, pp. 770–778 (2016)
Xiao et al. [2021] Xiao, Z., Kreis, K., Vahdat, A.: Tackling the generative learning trilemma with denoising diffusion gans. arXiv preprint arXiv:2112.07804 (2021)
Salimans et al. [2016] Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. Conference on Neural Information Processing Systems 29, 2234–2242 (2016)
Heusel et al. [2017] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Conference on Neural Information Processing Systems 30 (2017)
Kynkäänniemi et al. [2019] Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., Aila, T.: Improved precision and recall metric for assessing generative models. Conference on Neural Information Processing Systems 32 (2019)
Zadorozhnyy et al. [2021] Zadorozhnyy, V., Cheng, Q., Ye, Q.: Adaptive weighted discriminator for training generative adversarial networks. In: IEEE/CVF Computer Vision and Pattern Recognition Conference, pp. 4781–4790 (2021)

Appendix

\parttoc

Appendix A Overview

In this section, we outline the contents of our supplementary material, which is divided into four main sections. Firstly, we introduce the supplementary material. Secondly, we delve into the analysis of the dynamic theory for the CFG method. This section covers essential concepts related to CFG, the Lipschitz constraint, and the analysis of the dynamic theory of CFG. Thirdly, the theoretical analysis of our theory section contains the proof of our definition, Lemma and Theorem. Finally, we showcase the results of our experiments in the last section.

In the analysis of dynamic theory for the CFG method section, we provide foundational knowledge about CFG and Lipschitz constraints. Furthermore, we provide theoretical proofs that establish certain equivalences between the CFG method and conventional GAN theory, specifically focusing on dynamic theory principles.

In the theoretical analysis of our theory section, we initiate our exploration by presenting the proofs for both Definition 3.1 and the Proposition 3.5. The Definition 3.1 is a base definition for Proposition 3.5, so we prove them together. The Proposition 3.5 gives the formulation for the latent N-size with gradient penalty. Moving forward, we provide the proof of the corollary that $\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\leq 0$ , which is an important corollary that deviates our $\boldsymbol{\varepsilon}$ -centered gradient penalty. Subsequently, we provide the proof of the Lemma 3.6 which deviates the relationship between the value of the discriminator norm under three different gradient penalties. Lastly, we provide the proof of our main Theorem 3.7 which summarizes the relationship between the latent N-size and three gradient penalties under the above definitions and Lemmas.

In the Experiments section, we provide additional experiments and present more results of our Li-CFG and $\boldsymbol{\varepsilon}$ -centered method with common GAN models. When working with synthetic datasets, we include visual inspection figures compared our $\boldsymbol{\varepsilon}$ -centered method to other GAN models that use different gradient penalties. In our work with real-world datasets, we will create visual comparison figures to evaluate the performance of the CFG method and Li-CFG on different databases, including CeleBA, LSUN Bedroom, and ImageNet. We will compare these methods at resolutions of 128x128 and 256x256 pixels. Moreover, we’ll also present visual comparison figures of our $\boldsymbol{\varepsilon}$ -centered method with DDGAN and BigGAN in LSUN Church (256x256) and ImageNet (64x64).

Appendix B Analysis of the dynamic theory for the CFG

Dynamic Theory for the CFG. This section encompasses the motivation behind and proofs for the relationship between the CFG method and the common GAN method. In terms of the dynamic theory, the generator and discriminator in the CFG method function equivalently to the corresponding components in common GAN. Based on this analysis, the CFG method still surfers unstable and not locally convergent problems at the Nash-equilibrium point. We visualize the gradient vector of the discriminator for CFG and Li-CFG in Fig. 9. This serves as the impetus for us to introduce the Lipschitz constraint to the CFG method, aiming to enforce convergence. The dynamic theory is imported from Mescheder et al. [5] and Nagarajan and Kolter [16].

Lemma B.1.

The loss functions of the discriminators of CFG and common GAN learned by minimax games exhibit the same form and optimization objective. The form of the discriminator loss function is:

\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\left[f\left(-D_{\psi}\left(G_{\theta}(\boldsymbol{z})\right)\right)\right]+\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\left[f\left(D_{\psi}(\boldsymbol{x})\right)\right]

Proof.

We can establish the equivalence through two main perspectives. Firstly, we analyze the CFG method. The loss function of the discriminator can be described as follows:

\displaystyle\begin{aligned} L_{CFG}=\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\mathrm{ln}\left(1+\exp(-D(\boldsymbol{x}))\right)+\mathbb{E}_{\boldsymbol{x}\sim p_{z}}\mathrm{ln}\left(1+\exp(D(\boldsymbol{x}))\right).\end{aligned}

(11)

To unify the symbolic expression, we use $\exp(D(\boldsymbol{x}))$ instead of $e^{D(\boldsymbol{x})}$ from the CFG article in our paper. The common GAN loss function is

\displaystyle L_{GAN}=\mathbb{E}_{\boldsymbol{x}\sim p_{*}}[\ln d(\boldsymbol{x})]+\mathbb{E}_{\boldsymbol{z}\sim p_{\boldsymbol{z}}}\ln[1-d(G(\boldsymbol{z}))].

(12)

If we set the $d(\boldsymbol{x})=\frac{1}{1+\exp(-D(\boldsymbol{x}))}$ and substitute it into the common GAN loss function, we can get the Loss function of the CFG method. This demonstrates the equivalence between the two loss functions. From the second part, we can use the notation of the loss function proposed by Nagarajan and Kolter [16] to prove the lemma. The loss function is

\displaystyle\begin{aligned} L_{COMM}(\theta,\psi)=\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\left[f\left(D_{\psi}\left(G_{\theta}(\boldsymbol{z})\right)\right)\right]+\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\left[f\left(-D_{\psi}(\boldsymbol{x})\right)\right].\end{aligned}

(13)

If we choose the f function to be $f(t)=-\ln\left(1+\exp(-t)\right)$ , it leads to the loss function of the CFG method. ∎

We demonstrate the process with more detailed proof. First, we list the loss functions of the three discriminators for comparison. The discriminator’s loss function proposed by Nagarajan and Kolter [16] is

\displaystyle L_{COMM}(\theta,\psi)=\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\left[f\left(-D_{\psi}\left(G_{\theta}(\boldsymbol{z})\right)\right)\right]+\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\left[f\left(D_{\psi}(\boldsymbol{x})\right)\right].

(14)

But in the notation of Mescheder et al. [5], the loss function is

\displaystyle L_{COMM}(\theta,\psi)=\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\left[f\left(D_{\psi}\left(G_{\theta}(\boldsymbol{z})\right)\right)\right]+\mathbb{E}_{x\sim p_{*}}\left[f\left(-D_{\psi}(\boldsymbol{x})\right)\right].

(15)

We think both the equation are correct, and choose the Eq. (14) in our proof process. As for simplification, we omit the symbol $(\theta,\psi)$ . Furthermore, we use the symbols $D$ and $G$ to denote the discriminator and generator, respectively. The loss function of the discriminator of the CFG method is

\displaystyle L_{CFG}=\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\mathrm{ln}\left(1+\exp(-D(\boldsymbol{x}))\right)+\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\mathrm{ln}\left(1+\exp(D(G(\boldsymbol{z})))\right).

(16)

The discriminator loss function of common GAN is

\displaystyle L_{GAN}=\mathbb{E}_{x\sim p_{*}}[\ln d(\boldsymbol{x})]+\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\ln[1-d(G(\boldsymbol{z}))].

(17)

Our goal is to prove that the three loss functions are equivalent. One loss function can be transformed into others under certain conditions.

The proof of Eq. (17) convert to Eq. (16).

We set the $d(\boldsymbol{x})=\frac{1}{1+\exp(-D(\boldsymbol{x}))}$ and substitute it into the Eq. (17). Then we have

\displaystyle\begin{aligned} L_{CFG}&=\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\ln\left[\frac{1}{1+\exp(-D(\boldsymbol{x}))}\right]+\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\ln\left[1-\frac{1}{1+\exp(-D(G(\boldsymbol{z})))}\right]\\ &=\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\ln\left[1+\exp(-D(\boldsymbol{x}))\right]^{-1}+\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\ln\left[\frac{\exp(-D(G(\boldsymbol{z})))}{1+\exp(-D(G(\boldsymbol{z})))}\right]\\ &=\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\ln\left[1+\exp(-D(\boldsymbol{x}))\right]^{-1}+\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\ln\left[\frac{\frac{1}{\exp(D(G(\boldsymbol{z})))}}{\frac{1+\exp(D(G(\boldsymbol{z})))}{\exp(D(G(\boldsymbol{z})))}}\right]\\ &=\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\ln\left[1+\exp(-D(\boldsymbol{x}))\right]^{-1}+\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\ln\left[\frac{1}{1+\exp(D(G(\boldsymbol{z})))}\right]\\ &=\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\ln\left[1+\exp(-D(\boldsymbol{x}))\right]^{-1}+\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\ln\left[1+\exp(D(G(\boldsymbol{z})))\right]^{-1}\\ &=-\bigg{[}\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\ln\left[1+\exp(-D(\boldsymbol{x}))\right]+\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\ln[1+\exp(D(G(\boldsymbol{z})))]\bigg{]}\\ &=-\bigg{[}\mathbb{E}_{\boldsymbol{x}\sim p_{*}}\ln\left(1+\exp(-D(\boldsymbol{x}))\right)+\mathbb{E}_{\boldsymbol{z}\sim p_{z}}\ln[1+\exp(D(G(\boldsymbol{z})))]\bigg{]}.\end{aligned}

The equation in the bracket is the loss function of the CFG method. $-\min L_{CFG}$ has the same optimization objectives as the $\max L_{GAN}$ . So both loss functions are equivalent. It is easy to transform the Eq. (16) to Eq. (17) as we set the $\frac{1}{1+\exp(-D(\boldsymbol{x}))}=d(\boldsymbol{x})$ .

The proof of Eq. (14) convert to Eq. (16).

We choice the f function to be $f(t)=-\ln\left(1+\exp(-t)\right)$ . The equation is similar to the $\ln d(\boldsymbol{x})=\ln\left[\frac{1}{1+\exp(-D(\boldsymbol{x}))}\right]=-\ln\left[1+\exp(-D(\boldsymbol{x}))\right]$ if we substitute the $f$ into Eq. (14), it leads to the loss function of CFG method. Both equations are equivalent. So the Lemma B.1 is proved.

Lemma B.2.

The gradient vector field of generators of CFG and common GAN learned by minimax games exhibit the same form in the dynamic theory. The form of the gradient vector field is:

\left(\begin{array}[]{cc}\left[-\eta\delta(G(\boldsymbol{z}))\left[\nabla_{\boldsymbol{x}}D(G(\boldsymbol{z}))\right]\frac{\partial G(\boldsymbol{z})}{\partial\boldsymbol{\theta}}\right]&\\ &\\ 0&\end{array}\right)

We want to prove that $\widetilde{v}_{G}(\boldsymbol{\theta})=\overline{v}_{G}(\boldsymbol{\theta})$ . $\widetilde{v}_{G}$ denotes the gradient vector field of the common GAN generator, while $\overline{v}_{G}$ represents the generator gradient vector field of the CFG method. $\boldsymbol{\theta}$ denotes the weight vector of the generator instead of $\theta$ . If the two gradient vector fields are equal to each other, we can say that proof of Lemma B.2 is done.

Proof.

we expand the $\widetilde{v}_{G}(\boldsymbol{\theta})$ , the gradient vector field of generator for common GAN

\displaystyle\begin{aligned} \widetilde{v}_{G}(\boldsymbol{\theta})=\left(\begin{array}[]{cc}\nabla_{\boldsymbol{\theta}}\ln(1-d(G(\boldsymbol{z}))&\\ &\\ 0&\end{array}\right).\end{aligned}

(18)

The value of $d$ is $d(\boldsymbol{x})=\frac{1}{1+\exp(-D(\boldsymbol{x})})$ . We substitute the $d(\boldsymbol{x})$ into $\widetilde{v}_{G}(\boldsymbol{\theta})$ , then

\displaystyle\begin{aligned} \widetilde{v}_{G}(\boldsymbol{\theta})=\left(\begin{array}[]{cc}\left[\nabla_{\boldsymbol{\theta}}\ln\frac{\exp(-D(G(\boldsymbol{z})))}{1+\exp(-D(G(\boldsymbol{z})))}\right]&\\ &\\ 0&\end{array}\right).\\ \end{aligned}

(19)

We extract the $[\nabla_{\boldsymbol{\theta}}\ln\frac{\exp(-D(G(\boldsymbol{z})))}{1+\exp(-D(G(\boldsymbol{z})))}]$ out and derive it as follows:

\displaystyle\begin{aligned} &\nabla_{\boldsymbol{\theta}}\ln\frac{\exp(-D(G(\boldsymbol{z})))}{1+\exp(-D(G(\boldsymbol{z})))}\\ &=\nabla_{\boldsymbol{\theta}}\ln\frac{\frac{1}{\exp(D(G(\boldsymbol{z})))}}{\frac{1+\exp(D(G(\boldsymbol{z})))}{\exp(D(G(\boldsymbol{z})))}}\\ &=\nabla_{\boldsymbol{\theta}}\ln\frac{1}{1+\exp(D(G(\boldsymbol{z})))}\\ &=-\frac{\exp(D(G(\boldsymbol{z})))}{1+\exp(D(G(\boldsymbol{z})))}\left[\nabla_{\boldsymbol{x}}D(G(\boldsymbol{z}))\right]\frac{\partial G(\boldsymbol{z})}{\partial\boldsymbol{\theta}}\\ &=-\frac{1}{1+\exp(-D(G(\boldsymbol{z})))}\left[\nabla_{\boldsymbol{x}}D(G(\boldsymbol{z}))\right]\frac{\partial G(\boldsymbol{z})}{\partial\boldsymbol{\theta}}.\\ \end{aligned}

(20)

We bring it back into the $\widetilde{v}_{G}(\boldsymbol{\theta})$ , we will get

\displaystyle\begin{aligned} \widetilde{v}_{G}(\boldsymbol{\theta})=\left(\begin{array}[]{cc}\left[-\frac{1}{1+\exp(-D(G(\boldsymbol{z})))}\left[\nabla_{\boldsymbol{x}}D(G(\boldsymbol{z}))\right]\frac{\partial G(\boldsymbol{z})}{\partial\boldsymbol{\theta}}\right]&\\ &\\ 0&\end{array}\right).\\ \end{aligned}

(21)

This is the gradient vector field of common GAN.

Now we check the gradient vector field of the CFG method. For the CFG method, a hyper-parameter $M$ has been used to control the varying steps of generators

\displaystyle G_{m+1}(\boldsymbol{z})=G_{m}(\boldsymbol{z})+\eta_{m}g_{m}\left(G_{m}(\boldsymbol{z})\right).

(22)

The variables $M$ and $m$ have the same meaning. we will discuss the M in two cases to analyze the gradient vector field of the generator for the CFG method. The loss function of the generator for the CFG method is

\displaystyle L(\boldsymbol{\theta})=\frac{1}{2}\left[G_{m+1}(\boldsymbol{z})-G_{1}(\boldsymbol{z})\right]^{2}.

(23)

First, when we consider $M$ =1. In this case, the loss function can be written as

\displaystyle L(\boldsymbol{\theta})=\frac{1}{2}\left[G_{2}(\boldsymbol{z})-G_{1}(\boldsymbol{z})\right]^{2}.

(24)

. Then the gradient vector field can be written as below:

\displaystyle\begin{aligned} \overline{v}_{G}(\boldsymbol{\theta})&=\left(\begin{array}[]{cc}\left[\left[G_{2}(\boldsymbol{z})-G_{1}(\boldsymbol{z})\right]\frac{\partial G(\boldsymbol{z})}{\partial\boldsymbol{\theta}}\right]&\\ &\\ 0&\end{array}\right)\\ &=\left(\begin{array}[]{cc}\left[\left[\eta_{1}g_{1}(G(\boldsymbol{z}))\right]\frac{\partial G(\boldsymbol{z})}{\partial\boldsymbol{\theta}}\right]&\\ &\\ 0&\end{array}\right)\\ &=\left(\begin{array}[]{cc}\left[-\eta_{1}\delta(G(\boldsymbol{z}))\left[\nabla_{\boldsymbol{x}}D(G(\boldsymbol{z}))\right]\frac{\partial G(\boldsymbol{z})}{\partial\boldsymbol{\theta}}\right]&\\ &\\ 0&\end{array}\right).\end{aligned}

(25)

For $g(\boldsymbol{x})=\delta(\boldsymbol{x})\cdot\nabla_{\boldsymbol{x}}D(\boldsymbol{x})$ , we archive the above result. Now we compare the form of $\widetilde{v}_{G}(\theta)$ and $\overline{v}_{G}(\theta)$ . If we set the $-\frac{1}{1+\exp(-D(G(\boldsymbol{z})))}$ = $-\eta_{1}\delta(G(\boldsymbol{z})$ , the two function are both scaling factor. We find two equations get exactly the same form. So the proof is done.

Next, When we consider the second case, $M>1$ and $M\rightarrow\infty$ . The expansion of Eq. (22) can be written as below:

\displaystyle\begin{aligned} G_{m+1}(\boldsymbol{z})=&G_{m}(\boldsymbol{z})+\eta_{m}g_{m}\left(G_{m-1}(\boldsymbol{z})\right)\\ &\vdots\\ G_{3}(\boldsymbol{z})=&G_{2}(\boldsymbol{z})+\eta_{2}g_{2}\left(G_{2}(\boldsymbol{z})\right)\\ G_{2}(\boldsymbol{z})=&G_{1}(\boldsymbol{z})+\eta_{1}g_{1}\left(G_{1}(\boldsymbol{z})\right).\end{aligned}

(26)

Let us sum the right side of the equation list, and we get the form of $G_{m+1}(z)$ by $G_{1}$ as:

\displaystyle G_{m+1}(\boldsymbol{z})=G_{1}(\boldsymbol{z})+\sum_{m=1}^{M}\eta_{m}g_{m}\left(G_{m}(\boldsymbol{z})\right).

(27)

Now we consider the $\overline{v}_{G}(\boldsymbol{\theta})$ . When $M>1$ , the gradient vector field can be written as

\displaystyle\begin{aligned} \overline{v}_{G}(\boldsymbol{\theta})&=\left(\begin{array}[]{cc}\left[\nabla_{\boldsymbol{\theta}}\frac{1}{2}\left[G_{m+1}(\boldsymbol{z})-G_{1}(\boldsymbol{z})\right]^{2}\right]&\\ &\\ 0&\end{array}\right)\\ &=\left(\begin{array}[]{cc}\left[\nabla_{\boldsymbol{\theta}}\frac{1}{2}\left[\sum\limits_{m=1}^{M}\eta_{m}g_{m}\left(G_{m}(\boldsymbol{z})\right)\right]^{2}\right]&\\ &\\ 0&\end{array}\right)\\ &=\left(\begin{array}[]{cc}\left[\sum\limits_{m=1}^{M}\eta_{m}g_{m}\left(G_{m}(\boldsymbol{z})\right)\right]\nabla_{\boldsymbol{\theta}}\left[\sum\limits_{j=1}^{M}G_{j}(\boldsymbol{z})\right]&\\ &\\ 0&\end{array}\right)\\ &=\left(\begin{array}[]{cc}\left[\sum\limits_{m=1}^{M}\eta_{m}g_{m}\left(G_{m}(\boldsymbol{z})\right)\right]\left[\sum\limits_{j=1}^{M}\frac{\partial G_{j}(\boldsymbol{z})}{\partial\boldsymbol{\theta}}\right]&\\ &\\ 0&\end{array}\right)\\ &=\left(\begin{array}[]{cc}\left[\sum\limits_{m=1}^{M}\sum\limits_{j=1}^{M}\eta_{m}g_{m}\left(G_{m}(\boldsymbol{z})\right)\frac{\partial G_{j}(\boldsymbol{z})}{\partial\boldsymbol{\theta}}\right]&\\ &\\ 0&\end{array}\right)\\ &=\left(\begin{array}[]{cc}\left[\sum\limits_{m=1}^{M}\sum\limits_{j=1}^{M}-\eta_{m}\delta(G_{m}(\boldsymbol{z}))\left[\nabla_{\boldsymbol{x}}D(G_{m}(\boldsymbol{z}))\right]\frac{\partial G_{j}(\boldsymbol{z})}{\partial\boldsymbol{\theta}}\right]&\\ &\\ 0&\end{array}\right).\\ \end{aligned}

(28)

So we get the gradient vector field of the CFG method when $T>1$ . Every sub-item in the equation is a gradient vector that has the same form as the gradient vector when $T=1$ , which is equivalent to common GAN. The Sum of these gradient vectors is CFG method $T>1$ which is also equivalent to common GAN. So Lemma 3.2 is proved. ∎

Appendix C Theoretical analysis of our theory

C.1 Latent N-size with gradient penalty

In this section, we will provide a detailed proof procedure for Definition 3.1 and Proposition 3.5. Since Definition 3.1 is the foundational definition for Proposition 3.5, the detailed proof procedure for Definition 3.1 will be presented in the proof of Proposition 3.5.

Latent N-size. We illustrate the concept of latent N-size, which serves as the foundational definition for the subsequent theory.

Definition 3.1.

r=\hat{\epsilon}\cdot\left(2\inf_{\boldsymbol{z}}\left\{\left(\frac{\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}+\frac{\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\right)\right\}\right)^{-1},

where the meaningful of mode $\mathcal{M}$ , $attracted$ and $distracted$ are present in definition 3.2, 3.3, 3.4, respectively.

Latent N-size and the diversity. We offer three fundamental definitions of the latent N-size and demonstrate its relationship to the diversity of synthetic samples.

Definition 3.2 (Modes in Image Space).

Definition 3.3 (Modes Attracted).

Definition 3.4 (Modes Distracted).

Let $\boldsymbol{z}_{2}$ be a sample in latent space, we say $\boldsymbol{z}_{2}$ is distracted to a mode $\mathcal{M}_{i}$ by $(\alpha+\hat{\epsilon})/2$ from a gradient step if $\left\|\boldsymbol{y_{m}}-G_{\theta_{t+1}}\left(\boldsymbol{z}_{2}\right)\right\|+(\hat{\epsilon}/2-2\alpha)<\left\|\boldsymbol{y_{k}}-G_{\theta_{t}}\left(\boldsymbol{z}_{2}\right)\right\|$ , where $\boldsymbol{y_{k}}\in\mathcal{M}_{i}$ is an image in a mode $\mathcal{M}_{i}$ , $\boldsymbol{y_{m}}\not\in\mathcal{M}_{i}$ is an image from other modes, $\alpha$ keeps the same meaning as in Definition 3.2, $\theta_{t}$ and $\theta_{t+1}$ are the generator parameters before and after the gradient updates respectively.

Proposition 3.5.

$\mathcal{N}_{r}\left(\boldsymbol{z}_{1}\right)$ can be defined with discriminator gradient penalty as follows:

r=\hat{\epsilon}\cdot\left(2\inf_{\boldsymbol{z}}\left\{\left(\frac{2\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|+\eta_{m}\delta(\boldsymbol{x})\sum\limits_{m=1}^{N}\left(\|\nabla_{\boldsymbol{x}}D_{m}(\mathcal{Y}_{2})\|+\|\nabla_{\boldsymbol{x}}D_{m}(\mathcal{Y})\|\right)}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\right)\right\}\right)^{-1}

Proof.

With the gradient penalty of CFG formulation

\displaystyle\begin{aligned} G_{\theta_{t+1}}\left(\boldsymbol{z}\right)&=G_{\theta_{t}}\left(\boldsymbol{z}\right)+\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})\left(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right)\\ &=G_{\theta_{t}}\left(\boldsymbol{z}\right)+\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})\left(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+R\right),\end{aligned}

(29)

we are now ready to prove the Proposition 3.5.

\displaystyle\begin{aligned} \left\|\boldsymbol{y_{m}}-G_{\theta_{t+1}}\left(\boldsymbol{z}_{2}\right)\right\|&\leq\left\|\boldsymbol{y_{m}}-\boldsymbol{y_{k}}\right\|+\left\|\boldsymbol{y_{k}}-G_{\theta_{t+1}}\left(\boldsymbol{z}_{2}\right)\right\|\\ &\leq\left\|\boldsymbol{y_{m}}-\boldsymbol{y_{k}}\right\|+\left\|\boldsymbol{y_{k}}-G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)\right\|+\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}\left(\boldsymbol{z}_{2}\right)\right\|\\ &<\left\|\boldsymbol{y_{m}}-\boldsymbol{y_{k}}\right\|+\left\|\boldsymbol{y_{k}}-G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)\right\|+\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}\left(\boldsymbol{z}_{2}\right)\right\|-\hat{\epsilon}\quad\\ &\leq\left\|\boldsymbol{y_{m}}-\boldsymbol{y_{k}}\right\|+\left\|\boldsymbol{y_{k}}-G_{\theta_{t}}\left(\boldsymbol{z}_{2}\right)\right\|+\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{2}\right)-G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)\right\|\\ &+\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}\left(\boldsymbol{z}_{2}\right)\right\|-\hat{\epsilon}\\ &=\left(\frac{\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}\left(\boldsymbol{z}_{2}\right)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}_{2}\right\|}+\frac{\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}\left(\boldsymbol{z}_{2}\right)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}_{2}\right\|}\right)\left\|\boldsymbol{z}_{1}-\boldsymbol{z}_{2}\right\|\\ &+\left\|\boldsymbol{y_{m}}-\boldsymbol{y_{k}}\right\|+\left\|\boldsymbol{y_{k}}-G_{\theta_{t}}\left(\boldsymbol{z}_{2}\right)\right\|-\hat{\epsilon}.\end{aligned}

(30)

This implies

\left\|\boldsymbol{y_{m}}-G_{\theta_{t+1}}\left(\boldsymbol{z}_{2}\right)\right\|+(\frac{\hat{\epsilon}}{2}-2\alpha)<\left\|\boldsymbol{y_{k}}-G_{\theta_{t}}\left(\boldsymbol{z}_{2}\right)\right\|,

which is the content of definition 3.4 and

\left(\frac{\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}\left(\boldsymbol{z}_{2}\right)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}_{2}\right\|}+\frac{\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}\left(\boldsymbol{z}_{2}\right)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}_{2}\right\|}\right)\left\|\boldsymbol{z}_{1}-\boldsymbol{z}_{2}\right\|\leq\frac{\hat{\epsilon}}{2}.

We define the latent N-size of $\boldsymbol{z}_{1}$ is

\mathcal{N}_{\tau}\left(\boldsymbol{z}_{1}\right)=\left\{\boldsymbol{z}:\left(\frac{\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}+\frac{\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\right)\leq\tau\right\}.

Then, the maximum latent N-size

\displaystyle r=\hat{\epsilon}\cdot\left(2\inf_{\boldsymbol{z}}\left\{\left(\frac{\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}+\frac{\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\right)\right\}\right)^{-1}.

(31)

Here, we finish the proof of the definition 3.1.

Then, we bring the CFG definition of generator Eq. (29) into the Eq. (31), and expand the sum of the additions in parentheses as below:

\displaystyle\begin{aligned} &\frac{\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}+\frac{\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &=\frac{\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})+\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})[(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))+R)-(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+R)]\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &\leq\frac{2\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})[(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))+R)-(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+R)]\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &\leq\frac{2\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))+R)\right\|+\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+R)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}.\\ \end{aligned}

(32)

As for simple, we define $\|\nabla_{\boldsymbol{x}}D_{m}(\mathcal{Y}_{2})\|=$ $\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{2}))+R\|$ and $\|\nabla_{\boldsymbol{x}}D_{m}(\mathcal{Y})\|=$ $\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+R\|$ , and extract the symbol $\eta_{m}\delta(\boldsymbol{x})\sum\limits_{m=1}^{N}$ out of the norm. So we finish the proof of Proposition 3.5. ∎

C.2 $\boldsymbol{\varepsilon}$ -centered GP

Observe from Eq. (3) and Eq. (4) in CFG method, we can derive the following conclusion: $\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\leq 0$ . This is an important corollary derived from the CFG method. With this corollary, we can prove that the latent N-size corresponding to our $\boldsymbol{\varepsilon}$ -centered GP is the smallest among all three gradient penalties in the following section.

Proof.

The Eq. (3) from CFG implies that for $\frac{dL\left(p_{m}\right)}{dm}$ to be negative so that the distance $L$ decreases, we should choose $g_{m}(\boldsymbol{x})$ to be:

g_{m}(\boldsymbol{x})=-s_{m}(\boldsymbol{x})\phi_{0}\left(\nabla_{\boldsymbol{x}}\ell_{2}^{\prime}\left(p_{*}(\boldsymbol{x}),p_{m}(\boldsymbol{x})\right)\right),

where $s_{m}(\boldsymbol{x})>0$ is an arbitrary scaling factor. $\phi_{0}(u)$ is a vector function such that $\phi(u)=u\cdot\phi_{0}(u)\geq 0$ and that $\phi(u)=0$ if and only if $u=0$ , e.g., $\left(\phi_{0}(u)=u,\phi(u)=\|u\|_{2}^{2}\right)$ or $\left(\phi_{0}(u)=\operatorname{sign}(u),\phi(u)=\|u\|_{1}\right)$ . With this choice of $g_{m}(\boldsymbol{x})$ , we obtain

\frac{dL\left(p_{m}\right)}{dm}=-\int s_{m}(\boldsymbol{x})p_{m}(\boldsymbol{x})\phi\left(\nabla_{\boldsymbol{x}}\ell_{2}^{\prime}\left(p_{*}(\boldsymbol{x}),p_{m}(\boldsymbol{x})\right)\right)d\boldsymbol{x}\leq 0,

that is, the distance $L$ is guaranteed to decrease unless the equality holds. So, we can know that $g_{m}(\boldsymbol{x})\leq 0$ . Then, we scrutiny the Eq. (4) from which can deviate the equation:

\displaystyle\begin{aligned} g_{m}(\boldsymbol{x})&=-s_{m}(\boldsymbol{x})\phi_{0}\left(\nabla_{x}\ell_{2}^{\prime}\left(p_{*}(\boldsymbol{x}),p_{m}(\boldsymbol{x})\right)\right)\\ &=-s_{m}(\boldsymbol{x})\nabla_{x}\ell_{2}^{\prime}\left(p_{*}(\boldsymbol{x}),p_{m}(\boldsymbol{x})\right)\\ &=-s_{m}(\boldsymbol{x})f^{\prime\prime}\left(r_{m}(\boldsymbol{x})\right)\nabla r_{m}(\boldsymbol{x})\\ &\approx s_{m}(\boldsymbol{x})f^{\prime\prime}\left(\tilde{r}_{m}(\boldsymbol{x})\right)\tilde{r}_{m}(\boldsymbol{x})\nabla_{x}D(\boldsymbol{x}),\end{aligned}

where $s_{m}(x)>0$ is an arbitrary scaling factor, $\ell_{2}\left(\rho_{*},\rho\right)=\rho_{*}f\left(\rho/\rho_{*}\right)$ , $\nabla_{x}\ell_{2}^{\prime}\left(p_{*}(x),p_{m}(x)\right)=f^{\prime\prime}\left(r_{m}(x)\right)\nabla r_{m}(x)$ , $f^{\prime\prime}=1/x^{2}$ when $f$ is KL-divergence and $r_{m}(\boldsymbol{x})=\exp(-D(\boldsymbol{x}))\approx p_{m}(x)/p_{*}(x)=\tilde{r}_{m}(\boldsymbol{x})$ when $D(x)\approx\ln\frac{p_{*}(x)}{p_{m}(x)}$ , which is the analytic solution of the CFG discriminator. Based on the above formulation, we can find that $s(\boldsymbol{x})$ is a scaling function that is always greater than 0, $\tilde{r}(\boldsymbol{x})$ is a exponential function, and $f_{kl}^{\prime\prime}(\tilde{r}(\boldsymbol{x}))$ is a non negative function composed of exponential function $\tilde{r}(\boldsymbol{x})$ . So if equation $g_{m}(\boldsymbol{x})\leq 0$ holds, the $\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\leq 0$ also holds. ∎

We define our method as the $\boldsymbol{\varepsilon}$ -centered Gradient Penalty. We use notation $\boldsymbol{\varepsilon}$ in our penalty name and equation to differ from the hyper-parameter $\varepsilon^{\prime}$ . The $\boldsymbol{\varepsilon}$ -centered GP is

\displaystyle R(\theta,\psi)=\frac{\gamma}{2}\mathrm{E}_{\hat{\boldsymbol{x}}}\left(\left\|\nabla_{\boldsymbol{x}}D_{\psi}(\hat{\boldsymbol{x}})-\boldsymbol{\varepsilon}\right\|\right)^{2},

(33)

Combining the corollary $\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\leq 0$ , our $\boldsymbol{\varepsilon}$ -centered GP increases the $\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\|$ as to achieve a better latent N-size which other two gradient penalty behaviors worse.

When training the GAN model with the loss function Eq. (8) and Eq. (9), our approach will control the gradient of the discriminator and result in a diversity of synthesized samples.

C.3 Latent N-size with different gradient penalty

In this section, we will derive our final theorem about the latent N-size with different gradient penalties.

First, we establish the relationship among the latent N-size for three gradient penalties in the following Lemma. We will substitute different types of gradient penalty into the Proposition 3.5.

Lemma 3.6.

The norms of the three Gradient Penalties, which dictate the latent N-size, are defined as follows: $\|R_{1}\|$ = $\|\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right\|-g_{0}\right)\|$ , $\|R_{0}\|=\|\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right\|\right)\|$ , $\|R_{\boldsymbol{\varepsilon}}\|$ = $\|\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right\|+\|\boldsymbol{\varepsilon}\|\right)\|$ , respectively. The order of magnitude between the norms of three Gradient Penalty is $\|R_{1}\|<\|R_{0}\|<\|R_{\boldsymbol{\varepsilon}}\|$ . Consequently, the relationship between the latent N-size of three Gradient Penalty is $r_{R_{1}}>r_{R_{0}}>r_{R_{\boldsymbol{\varepsilon}}}$ .

Proof.

Let us start with the result from the Eq .32. We will insert three distinct Gradient Penalties into it and determine the latent N-size for each of these penalties. Next, we will focus on the primary component of the three equations and compare the values among them.

The equation of the $1$ -centered gradient penalty is

\displaystyle\begin{aligned} R_{g_{0}}&=\frac{\gamma}{2}\mathrm{E}_{\hat{\boldsymbol{x}}}\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(\hat{\boldsymbol{x}})\right\|-g_{0}\right)^{2}.\end{aligned}

(34)

The equation of $0$ -centered gradient penalty is

\displaystyle\begin{aligned} R_{0}&=\frac{\gamma}{2}\mathrm{E}_{\hat{\boldsymbol{x}}}\left\|\nabla_{\hat{\boldsymbol{x}}}D_{\psi}(\hat{\boldsymbol{x}})\right\|^{2}.\end{aligned}

(35)

The equation of $\varepsilon$ -centered gradient penalty is

\displaystyle\begin{aligned} R_{\varepsilon}&=\frac{\gamma}{2}\mathrm{E}_{\hat{\boldsymbol{x}}}\left\|\nabla_{\boldsymbol{x}}D_{\psi}(\hat{\boldsymbol{x}})-\boldsymbol{\varepsilon}\right\|^{2}.\end{aligned}

(36)

We bring them back to the Eq .(29) and expand the Gradient Penalty expression. We omit the expectation symbol and focus on the gradient. Then, we plug the above results into Eq .32 and we have the different gradient penalty equation for the sum of the additions in latent N-size parentheses. As for simplicity, we just focus on it and omit other symbols. We called it the determined part of the latent N-size.

The determined part of the latent N-size of $1$ -centered gradient penalty is

\displaystyle\begin{aligned} &\frac{\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}+\frac{\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &\leq\frac{2\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))+R\right\|+\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+R\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &=\frac{2\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})\left(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))+\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))\right\|-g_{0}\right)^{2}\right)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})\left(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right\|-g_{0}\right)^{2}\right)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}.\\ \end{aligned}

(37)

The determined part of the latent N-size of $0$ -centered gradient penalty is

\displaystyle\begin{aligned} &\frac{\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}+\frac{\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &\leq\frac{2\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))+R\right\|+\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+R\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &=\frac{2\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})\left(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))+\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))\right\|\right)^{2}\right)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})\left(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right\|\right)^{2}\right)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}.\\ \end{aligned}

(38)

With prior knowledge from CFG that $\nabla_{x}D_{m}(x)\leq 0$ , The determined part of the latent N-size of $\boldsymbol{\varepsilon}$ -centered gradient penalty is

\displaystyle\begin{aligned} &\frac{\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}+\frac{\left\|G_{\theta_{t+1}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t+1}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &\leq\frac{2\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))+R\right\|+\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+R\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &=\frac{2\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})\left(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))+\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))-\boldsymbol{\varepsilon}\right\|\right)^{2}\right)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})\left(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))-\boldsymbol{\varepsilon}\right\|\right)^{2}\right)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &\leq\frac{2\left\|G_{\theta_{t}}\left(\boldsymbol{z}_{1}\right)-G_{\theta_{t}}(\boldsymbol{z})\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})\left(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))+\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}_{1}))\right\|+\|\boldsymbol{\varepsilon}\|\right)^{2}\right)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}\\ &+\frac{\left\|\sum\limits_{m=1}^{N}\eta_{m}\delta(\boldsymbol{x})\left(\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))+\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right\|+\|\boldsymbol{\varepsilon}\|\right)^{2}\right)\right\|}{\left\|\boldsymbol{z}_{1}-\boldsymbol{z}\right\|}.\\ \end{aligned}

(39)

For the square items in Eq. (37), Eq. (LABEL:dpgp0), Eq. (LABEL:dpgp2), we show that the

\displaystyle\begin{aligned} &\|\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right\|+\|\boldsymbol{\varepsilon}\|\right)\|>\|\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right\|\right)\|>\|\left(\left\|\nabla_{\boldsymbol{x}}D_{m}(G_{\theta_{t}}(\boldsymbol{z}))\right\|-g_{0}\right)\|.\end{aligned}

(40)

Based on this conclusion, we can observe that the relationship among the latent N-size with different gradient penalties is as follows: $r_{R_{1}}>r_{R_{0}}>r_{R_{\boldsymbol{\varepsilon}}}$ . ∎

Theorem 3.7.

Suppose $\boldsymbol{z}_{1}$ is attracted to the mode $\mathcal{M}_{i}$ by $\hat{\epsilon}$ , then there exists a neighborhood $\mathcal{N}_{r}\left(\boldsymbol{z}_{1}\right)$ of $\boldsymbol{z}_{1}$ such that $\boldsymbol{z}_{2}$ is distracted to $\mathcal{M}_{i}$ by $(\hat{\epsilon}/2-2\alpha)$ , for all $\boldsymbol{z}_{2}\in\mathcal{N}_{r}\left(\boldsymbol{z}_{1}\right)$ . The size of $\mathcal{N}_{r}\left(\boldsymbol{z}_{1}\right)$ can be arbitrarily large but is bounded by an open ball of radius $r$ where be controlled by Gradient Penalty terms of the discriminator. The relationship between the radius’s size of three Gradient Penalty is $r_{R_{1}}>r_{R_{0}}>r_{R_{\boldsymbol{\varepsilon}}}$ .

Proof.

By combining the Definition 3.1, Proposition 3.5, Lemma 3.6 and our $\boldsymbol{\varepsilon}$ -centered gradient penalty, we can deviate this conclusion. ∎

This theorem encompasses three key implications. Firstly, when the latent vector is attracted to a mode within the image space, the corresponding latent N-size should not be overly large, which can be attributed to two distinct reasons. One, vectors within the neighborhood are more likely to be attracted to the same mode. Two, vectors within this neighborhood face challenges in being attracted to other modes within the image space, with the level of difficulty determined by an upper bound expressed as $(\hat{\epsilon}/2-2\alpha)$ . This bound is constructed based on the distances between different modes and within the same mode within the image space.

Secondly, the discriminator gradient penalty in the CFG formulation can regulate the latent N-size. This introduces a trade-off between diversity and training stability. For instance, the $0$ -centered gradient penalty ensures stable and convergent training near the Nash equilibrium, but it leads to a minimal discriminator norm, resulting in a larger latent N-size and reduced diversity.

Thirdly, by augmenting the norm of the discriminator, our $\boldsymbol{\varepsilon}$ -centered gradient penalty achieves the smallest latent N-size, consequently leading to the highest level of diversity.

Appendix D Experiments

In this section, we describe additional experiments and give a more visual representation. In addition to the databases mentioned in the paper, we also conducted training on the CeleBA and LSUN Bedroom datasets with resolutions of 128x128 and 256x256. We will provide visual inspections and results for these additional databases.

D.1 Hyper-parameter $\varepsilon^{\prime}$

In the practice stage, the $\boldsymbol{\varepsilon}$ is a $\varepsilon^{\prime}$ -value constant vector which has the same dimension of $\nabla_{\boldsymbol{x}}D(\boldsymbol{x})$ . The dimension of $\nabla_{\boldsymbol{x}}D(\boldsymbol{x})$ is [C, H, W], the flatten of $\nabla_{\boldsymbol{x}}D(\boldsymbol{x})$ dimension equals $H*W*C$ . Our image data has the $N$ pixels height, $N$ pixels weight, and $C$ channels. The $H*W*C$ will be written as $CN^{2}$ . $\|\|$ is the symbol of norm 2. So the value of $\|\boldsymbol{\varepsilon}\|$ can be written as

\displaystyle\begin{aligned} \|\boldsymbol{\varepsilon}\|=\sqrt{\sum_{i=1}^{CN^{2}}\varepsilon_{i}^{2}}=\sqrt{CN^{2}\varepsilon^{2}}=\varepsilon^{\prime},\end{aligned}

(41)

where $\varepsilon_{i}^{2}=\varepsilon^{2}$ , $\varepsilon^{2}=\frac{\varepsilon^{\prime 2}}{CN^{2}}$ . The $\varepsilon^{\prime}$ is a hyper-parameter that controls the tight bound of $\left\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\right\|$ . We will set $\varepsilon^{\prime}=0.1,0.3,1,5$ in the ablation study to show the effeteness. It is easy to understand the $\varepsilon^{\prime}=0.1,0.3,1$ because this is a small enhancement of the discriminator norm and thus to a smaller latent N-size. If we set the $\varepsilon^{\prime}=5$ , the varying range of $\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\|$ is too large which leads to a too-small latent N-size, thus leads to a non-convergent result for the neural network. The loss function of the training process will not converge and the synthetic samples will transform to noise. We present the ablation result in Table. 8 and Table. 7

Table 7: Different

\varepsilon^{\prime}

settings of our Li-CFG trained in MNIST. We use the FID and IS scores to compare the generated effect. The other two penalties do not have the parameter

\varepsilon^{\prime}

so that all the cells fill the same value. Untrained means the loss function does not converge.

		FID			IS
MNIST	$\varepsilon^{\prime}=0.1$	$\varepsilon^{\prime}=0.3$	$\varepsilon^{\prime}=1$	$\varepsilon^{\prime}=5$	$\varepsilon^{\prime}=0.1$	$\varepsilon^{\prime}=0.3$	$\varepsilon^{\prime}=1$	$\varepsilon^{\prime}=5$
ours( $\boldsymbol{\varepsilon}$ -centered)	2.99	2.88	2.85	untrained	2.28	2.32	2.29	untrained
$0$ -centered	3.54	3.54	3.54	3.54	2.31	2.31	2.31	2.31
$1$ -centered	3.64	3.64	3.64	3.64	2.3	2.3	2.3	2.3

Table 8: Different

\varepsilon^{\prime}

settings of our Li-CFG trained in LSUN bedroom. We use the FID and IS scores to compare the generated effect. The other two penalties do not have the parameter

\varepsilon^{\prime}

so that all the cells fill the same value. Untrained means the loss function does not converge.

		FID			IS
LSUN Bedroom	$\varepsilon^{\prime}=0.1$	$\varepsilon^{\prime}=0.3$	$\varepsilon^{\prime}=1$	$\varepsilon^{\prime}=5$	$\varepsilon^{\prime}=0.1$	$\varepsilon^{\prime}=0.3$	$\varepsilon^{\prime}=1$	$\varepsilon^{\prime}=5$
ours( $\boldsymbol{\varepsilon}$ -centered)	9.94	8.78	9.73	untrained	2.97	2.94	2.97	untrained
$0$ -centered	10.54	10.54	10.54	10.54	3.067	3.067	3.067	3.067
$1$ -centered	11.5	11.5	11.5	11.5	3.154	3.154	3.154	3.154

Table 9: Different

\gamma

and the same

\varepsilon^{\prime}=0.3

settings of our Li-CFG trained in LSUN T. We use FID and IS score to compare the generated effect. The other two penalties do not have the parameter

\varepsilon

so that all the cells fill the same value. Untrained means the loss function does not converge.

		FID		IS
LSUN T ( $\delta(\boldsymbol{x})$ =1)	$\gamma^{\prime}=0.1$	$\gamma^{\prime}=1$	$\gamma^{\prime}=10$	$\gamma^{\prime}=0.1$	$\gamma^{\prime}=1$	$\gamma^{\prime}=10$
ours( $\boldsymbol{\varepsilon}$ -centered)	11.41	19.47	21.59	4.6	4.5	4.42
$0$ -centered	12.33	19.94	21.71	4.57	4.6	4.53
$1$ -centered	12.81	22.17	22.71	4.47	4.46	4.5
CFG method	13.54	13.54	13.54	4.38	4.38	4.38
LSUN T ( $\delta(\boldsymbol{x})$ =5)
ours( $\boldsymbol{\varepsilon}$ -centered)	19.18	21.58	21.69	4.48	4.49	4.34
$0$ -centered	20.29	24.77	23.02	4.47	4.43	4.43
$1$ -centered	24.09	29.39	untrained	4.37	4.41	untrained
CFG method	22.3	22.3	22.3	4.34	4.34	4.34
LSUN T ( $\delta(\boldsymbol{x})$ =10)
ours( $\boldsymbol{\varepsilon}$ -centered)	21.3	23	20.7	4.43	4.42	4.53
$0$ -centered	23.68	19.74	20.7	4.39	4.46	4.38
$1$ -centered	23.49	untrained	untrained	4.3	untrained	untrained
CFG method	25.53	25.53	25.53	4.2	4.2	4.2

If our experiment setting does not satisfy these conditions, the synthetic samples make noise or collapse. We show the visual inspection of different settings $\varepsilon^{\prime}$ for MNIST and LSUN Bedroom in Fig. 10 and Fig. 11.

Table 10: Different

\gamma

settings of our Li-CFG trained in LSUN T+B. We use FID and IS scores to compare the generated effect. The other two penalties do not have the parameter

\varepsilon

so that all the cells fill the same value.

		FID		IS
LSUN T+B	$\gamma=0.1$	$\gamma=1$	$\gamma=10$	$\gamma=0.1$	$\gamma=1$	$\gamma=10$
ours( $\boldsymbol{\varepsilon}$ -centered)	15.72	16.85	16.67	5.07	5.06	5.08
$0$ -centered	16.01	17.4	19.79	5.05	5.08	5.05
$1$ -centered	16.32	18.73	34.28	5.04	5.18	4.71

D.2 Effect of $\gamma$ with different penalty

In most cases, a gradient penalty with a center within a small interval around 0 tends to yield better results. However, on some datasets, a gradient penalty centered around 1 might perform better. Our method provides a controllable parameter that allows the center of the gradient penalty to vary within the range specified by the parameter, and we experimentally show that our method yields better results. For the parameters $\delta(\boldsymbol{x})$ where the CFG method performs well, using a value of 0.1 for the parameter $\gamma$ consistently yields improved results. Conversely, for those parameter settings of $\delta(\boldsymbol{x})$ where the CFG method shows poor or inadequate training performance, using a value of 10 for the parameter $\gamma$ often leads to better overall performance. We compare the results of different $\gamma$ values for CFG and Li-CFG in Table. 9 and Table. 10.

D.3 Effect of Gradient Penalty with different $\delta(\boldsymbol{x})$

The results of the CFG method can exhibit significant variations in response to minor changes in parameter values. This phenomenon is illustrated in Fig. 12, where we observe that different choices of the $\delta(\boldsymbol{x})$ parameter lead to diverse outcomes. However, by introducing the gradient penalty, the CFG method consistently demonstrates increased stability across various $\delta(\boldsymbol{x})$ values. Notably, in cases where the CFG method struggles to converge, the addition of the gradient penalty leads to improved results.

D.4 Visual inspection of synthesized images

In this section, we introduce the visual inspection figures of the CFG method, Li-CFG, and other GAN models. The Fig. 27, 28, 29, 32, 30, 31, 33 are separated into two columns of real images and generated images respectively. The real images are represented in the left column while the generated images are displayed in the right column which contains different GP regularization with Li-CFG. The settings of generated images in the CFG method and Li-CFG are almost the same. We also display the generated images with CeleBA, LSUN Bedroom and ImageNet with resolution of 128 and 256. We present the results of synthetic datasets in the Fig. 13, 14, 15, 16, 17, 18, 19, 20. Moreover, we showcase the results of BigGAN and DDGAN from real-world datasets in the Fig. 21, 22, 23.

Synthetic Datasets. In the present results from synthetic datasets, we observed that unconstrained methods like the original GAN, LSGAN, WGAN and HingeGAN struggle to converge to all modes of the ring or grid datasets. However, these methods, when supplemented with gradient penalty, show an enhanced ability to converge to a mixture of Gaussians. Among the three types of gradient penalties tested, the $1$ -centered gradient penalty exhibited inferior convergence compared to the $0$ -centered gradient penalty and our proposed $\boldsymbol{\varepsilon}$ -centered gradient penalty. Notably, our $\boldsymbol{\varepsilon}$ -centered gradient penalty demonstrated a higher efficacy in driving more sample points to converge to the Gaussian points compared to the $0$ -centered gradient penalty.

ImageNet. We present the experiment results of Li-CFG on the ImageNet datasets here. We compare the FID result of Li-CFG and CFG methods which are the same Neural architecture with parameters of different magnitudes. The Neural architecture simply uses the DCGAN and the res-block. Fig. 24 shows the FID scores of the CFG method, the CFG method with two times the number of parameters, and Li-CFG with twice the number of parameters. Fig. 25 presents the visualization effect of the above three models. Fig. 26 manifest our Li-CFG results on the ImageNet datasets.