This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On some theoretical limitations of Generative Adversarial Networks

Oriol, Benoît
[email protected]
École Polytechnique
Palaiseau, 91120, France
   Miot, Alexandre
[email protected]
Société Générale
La Défense, 92800, France
Abstract

Generative Adversarial Networks have become a core technique in Machine Learning to generate unknown distributions from data samples. They have been used in a wide range of context without paying much attention to the possible theoretical limitations of those models. Indeed, because of the universal approximation properties of Neural Networks, it is a general assumption that GANs can generate any probability distribution. Recently, people began to question this assumption and this article is in line with this thinking. We provide a new result based on extreme value theory showing that GANs can’t generate heavy tailed distributions. The full proof of this result is given.

1 Introduction

The universal approximation property of neural networks (see [8] and [4]) might make us assume that GANs can simulate any distribution from a Gaussian prior. However, neural networks, as functions are by design almost everywhere differentiable functions with bounded derivatives to limit exploding gradients phenomenons (see [10]). By Rademacher (see [7] for a proof) and mean value theorems, this is nearly equivalent to say that neural networks functions are Lipschitz continuous. This fact basically sets the limitations of GANs to express any probability distribution given a Gaussian prior. The are numerous definitions of the concept of ”fat”, ”longed” or ”heavy” tailed distributions. They are usually not equivalent but all convey a sense of having a larger probability of being ”big” compared to a Gaussian or Exponential distribution. Here we focus on two possible ways to define the concept. One, similarly to [12], is focussing on finite samples and relies on classical concentration inequalities. The other is asymptotic and uses Extreme Value Theory to prove a new theorem in the continuity of the theoretical work of Huster et al. in [9] and the experimental approach of [6].

Notations, in the following, we make use of the following notations:

  • -

    ff is a Lipschitz function n\mathbb{R}^{n}\mapsto\mathbb{R}, fl=supx,yn,xyf(x)f(y)xy||f||_{l}=\sup_{x,y\in\mathbb{R}^{n},x\neq y}\frac{||f(x)-f(y)||}{||x-y||} its semi-norm,

  • -

    γn\gamma_{n} the Gaussian measure on n\mathbb{R}^{n},

  • -

    dd the Euclidean distance in n\mathbb{R}^{n} i.e. d(x,y)=yxd(x,y)=||y-x||,

  • -

    for any set SnS\subset\mathbb{R}^{n} the ϵ\epsilon neighbourhood Sϵ={xn such that d(x,S)<ϵ}S_{\epsilon}=\{x\in\mathbb{R}^{n}\text{ such that }d(x,S)<\epsilon\}, where ϵ>0\epsilon>0,

  • -

    A¯\bar{A} the complement of a subset AnA\subset\mathbb{R}^{n},

  • -

    MM the median of a mapping n\mathbb{R}^{n}\mapsto\mathbb{R} for the γn\gamma_{n} measure i.e. γn({fM})12\gamma_{n}(\{f\geq M\})\geq\frac{1}{2} and γn({fM})12\gamma_{n}(\{f\leq M\})\geq\frac{1}{2},

  • -

    XX a standard Gaussian random variable in n\mathbb{R}^{n},

  • -

    Ψ¯=12πx+eu2/2𝑑u\bar{\Psi}=\frac{1}{\sqrt{2\pi}}\int_{x}^{+\infty}e^{-u^{2}/2}du the Gaussian tail function.

2 Limitations through sub-gaussianity

In this section we prove that given a Lipschitz function and a Gaussian prior XX, f(X)f(X) is sub-gaussian: a GAN with a Gaussian prior can only generate sub-gaussian distributions.

Definition 1.

A real valued random variable YY is said to be sub-gaussian if it satisfies one of the two following equivalent properties :

K,t0,P(|Y|t)2et2K2K,p1,XLpKp .\Updownarrow\begin{array}[]{l}\exists K\in\mathbb{R},\forall t\geq 0,P(\left|Y\right|\geq t)\leq 2e^{-\frac{t^{2}}{K^{2}}}\\ \exists K^{\prime}\in\mathbb{R},\forall p\geq 1,||X||_{L^{p}}\leq K^{\prime}\sqrt{p}\text{ .}\end{array}

For a proof see [15].

If G=f(X)G=f(X), where ff is Lipschitz, by Lipschitz continuity and 𝔼(X)=0\mathbb{E}(X)=0

x0(|Gf(0)|x)(|X|xfl) .\forall x\geq 0\quad\mathbb{P}(|G-f(0)|\geq x)\leq\mathbb{P}\left(|X|\geq\frac{x}{||f||_{l}}\right)\text{ .}

In particular if XX is one dimensional then using a standard upper bound of the gaussian tail function GG will be sub-gaussian as a sum of two independant sub-gaussian functions, considering f(0)f(0) as a constant random variable.

If XX was nn dimensional then,

(|X|xfl)=(|X|2x2fl2) ,\mathbb{P}\left(|X|\geq\frac{x}{||f||_{l}}\right)=\mathbb{P}\left(|X|^{2}\geq\frac{x^{2}}{||f||_{l}^{2}}\right)\text{ ,}

and |X|2|X|^{2} following a χ2\chi^{2} distribution with nn degrees of freedom the sub-gaussianity of GG would seem to be dependent on the dimension of XX. Yet, this is not the case as stated by the following remarkable result:

Theorem 1 (Gaussian concentration theorem [14], [13] and [3]).

Let XX be a standard gaussian random variable on n\mathbb{R}^{n} and ff a Lipschitz function then f(X)𝔼(f(X))f(X)-\mathbb{E}(f(X)) is sub-gaussian. More precisely,

ϵ>0,(|f(X)𝔼(f(X))|ϵ)2e2ϵ2/fl2 .\forall\epsilon>0,\mathbb{P}(|f(X)-\mathbb{E}(f(X))|\geq\epsilon)\leq 2e^{-2\epsilon^{2}/||f||_{l}^{2}}\text{ .}

So in particular, a GAN with a Gaussian prior will not be able to generate any realistic samples even if trained on a ”fat tailed” distribution. This is not the first time that concentration of measure gives some strong theoretical limits to machine learning methods see [5] or [11] for a more recent paper. Limitations of GANs has also been explored from a different perspective in [16].

This theorem is also true when we replace the mean 𝔼(f(X))\mathbb{E}(f(X)) by a median of f(X)f(X). The original proof of the theorem can be found in [14]. The proof is quite technical, a more accessible one can be found in [2] . To get a sense of the concentration of measure phenomenon we provide here a simple proof with the median.

The gaussian concentration theorem is a consequence of:

Theorem 2 (Gaussian Isoperimetric Theorem [3]).

Let’s AA be a Borel set in n\mathbb{R}^{n} and H={xn such that x1<a}H=\{x\in\mathbb{R}^{n}\text{ such that }x_{1}<a\} with aa\in\mathbb{R} such that γn(A)=γn(H)\gamma_{n}(A)=\gamma_{n}(H) then

ϵ0γn(Aϵ)γn(Hϵ) .\forall\epsilon\geq 0\;\gamma_{n}(A_{\epsilon})\geq\gamma_{n}(H_{\epsilon})\text{ .}

It is easily seen that γn(Hr)=Ψ(a+r)\gamma_{n}(H_{r})=\Psi(a+r) where Ψ\Psi is the cumulative distributive function of the one dimensional standard Gaussian distribution.
It is not obvious at first sight what is the link between this theorem and the Gaussian concentration theorem for Lipschitz functions. The link is made defining the following ‘isoperimetric function’ for a[0,1]a\in[0,1] and ϵ>0\epsilon>0

ηa(ϵ)\displaystyle\eta_{a}(\epsilon) =supA borel set n{γn(Aϵ¯)|γn(A)a}\displaystyle=\sup_{A\text{ borel set }\subset\mathbb{R}^{n}}\{\gamma_{n}(\bar{A_{\epsilon}})\;|\;\gamma_{n}(A)\geq a\}
=1infA borel set n{γn(Aϵ)|γn(A)a} .\displaystyle=1-\inf_{A\text{ borel set }\subset\mathbb{R}^{n}}\{\gamma_{n}(A_{\epsilon})\;|\;\gamma_{n}(A)\geq a\}\text{ .}
Lemma 1.

Let f:nf\,:\,\mathbb{R}^{n}\mapsto\mathbb{R} be a Lipschitz function and MM a median for the Gaussian measure, then

ϵ>0γn(|fM|>ϵ)η12(ϵfl) .\forall\epsilon>0\quad\gamma_{n}(|f-M|>\epsilon)\leq\eta_{\frac{1}{2}}\left(\frac{\epsilon}{||f||_{l}}\right)\text{ .}
Proof.

Let A={fM}A=\{f\leq M\}, ϵ>0\epsilon>0 and xAϵx\in A_{\epsilon} then

yA such that d(x,A)d(x,y)<ϵ\exists y\in A\text{ such that }d(x,A)\leq d(x,y)<\epsilon

so, ff being Lipschitz and yAy\in A

|f(x)f(y)|flϵ .|f(x)-f(y)|\leq||f||_{l}\;\epsilon\text{ .}

So, f(x)M+flϵf(x)\leq M+||f||_{l}\;\epsilon i.e. {fM+||f||lϵ}Aϵ¯\{f\geq M+||f||_{l}\;\epsilon\}\subset\bar{A_{\epsilon}}. Changing ϵϵfl\epsilon\rightarrow\frac{\epsilon}{||f||_{l}} we have proved that for any Lipschitz function ff of median MM

ϵ>0γn({f>M+ϵ})η12(ϵfl) .\forall\epsilon>0\quad\gamma_{n}(\{f>M+\epsilon\})\leq\eta_{\frac{1}{2}}\left(\frac{\epsilon}{||f||_{l}}\right)\text{ .}

Noticing that ff if Lipschitz iif f-f is, fl=fl||f||_{l}=||-f||_{l}, if MM is a median of ff then M-M is a median of f-f and applying what we just proved to f-f the result follows. ∎

We can now prove the Gaussian concentration theorem.

Proof.

Let AA be a Borel set such that γn(A)12\gamma_{n}(A)\geq\frac{1}{2}, there exists a half-space H=n1×],a[H=\mathbb{R}^{n-1}\times]-\infty,a[ such that γn(A)=γn(H)=Ψ(a)12\gamma_{n}(A)=\gamma_{n}(H)=\Psi(a)\geq\frac{1}{2}. From the Isoperimetric Gaussian Theorem

ϵ0,\displaystyle\forall\epsilon\geq 0,\quad γn(Aϵ)γn(Hϵ)\displaystyle\gamma_{n}(A_{\epsilon})\geq\gamma_{n}(H_{\epsilon})
γn(Aϵ)Ψ(a+ϵ)Ψ(ϵ) .\displaystyle\gamma_{n}(A_{\epsilon})\geq\Psi(a+\epsilon)\geq\Psi(\epsilon)\text{ .}

Ψ\Psi being non decreasing and a0a\geq 0 as Ψ(a)12\Psi(a)\geq\frac{1}{2}. Taking the infinum on the left side and noticing that the infinimum is reached for HH

ϵ>0,infA borel set n{γn(Aϵ)|γn(A)12}=Ψ(ϵ) .\forall\epsilon>0\text{,}\\ \inf_{A\text{ borel set }\subset\mathbb{R}^{n}}\{\gamma_{n}(A_{\epsilon})\;|\;\gamma_{n}(A)\geq\frac{1}{2}\}=\Psi(\epsilon)\text{ .}

That is to say,

η12=1Ψ=Ψ¯.\eta_{\frac{1}{2}}=1-\Psi=\bar{\Psi}.

3 Limitations through Extreme Value Theory

In this section, we prove the main result of this paper: given a Lipschitz function and a Gaussian prior XX, if f(X)f(X) is in a domain of attraction of an extreme value distribution of parameter ξ\xi then ξ0\xi\leq 0. In particular, f(X)f(X) can’t be ”heavy tailed”. In fact, we prove the theorem for a wider range of distributions.
In the following, we use the notations of Extreme Value Theory that are introduced in [9].

Theorem 3.

Let nn\in\mathbb{R}^{*}, f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} a 𝒞1\mathcal{C}^{1} a.e Lipschitz function with semi norm L=flL=||f||_{l}. Let GkdG_{k}^{d} be a real random variable of probability distribution function gGkd(x)x2kex222g_{G_{k}^{d}}(x)\propto\lVert x\rVert_{2}^{k}e^{-\frac{\lVert x\rVert_{2}^{2}}{2}}, where kk\in\mathbb{N}. If f(Gkd)f(G_{k}^{d}) is in the domain of attraction of the extreme value distribution of parameter ξ\xi\in\mathbb{R}, i.e f(Gkd)𝒟(ξ)f(G_{k}^{d})\in\mathcal{D}(\mathcal{H}_{\xi}), then ξ0\xi\leq 0.

Proof.

Case d=1d=1. We prove by contradiction that ξ0\xi\leq 0. Supposing that ξ>0\xi>0, by theorem 8.a [1], γ]0,ξ[\forall\,\gamma\in]0,\xi[, 𝔼[f(Gk1)γ]\mathbb{E}[f(G_{k}^{1})^{\gamma}] is finite and

cγ\displaystyle c_{\gamma} =limt𝔼[(f(Gk1)t)γ|f(Gk1)>t]\displaystyle=\lim_{t\rightarrow\infty}\mathbb{E}\left[\left(\frac{f(G_{k}^{1})}{t}\right)^{\gamma}|f(G_{k}^{1})>t\right]
=(1γξ)1 .\displaystyle=\left(1-\frac{\gamma}{\xi}\right)^{-1}\text{ .}

Let γ]0,ξ[\gamma\in]0,\xi[. We are only interested in the behaviour of the previous integral when tt goes to ++\infty so we can suppose that t>f(0)+1t>f(0)+1 and t>|k1|+γL+|f(0)|t>\sqrt{|k-1|+\gamma}L+|f(0)|.
f1(]t,[)f^{-1}(]t,\infty[) is an open set of \mathbb{R} by continuity. Open intervals are a countable base of \mathbb{R}, so we can write f1(]t,[)=i]ai,bi[f^{-1}(]t,\infty[)=\bigcup_{i\in\mathcal{I}}]a_{i},b_{i}[ where \mathcal{I} is finite or countable. We can also suppose that any of those intervals are disjoints. So we can suppose that :

  • -

    f1(]t,[)=i]ai,bi[f^{-1}(]t,\infty[)=\bigcup_{i\in\mathcal{I}}]a_{i},b_{i}[ where \mathcal{I} is finite or countable

  • -

    0]ai,bi[0\notin]a_{i},b_{i}[, ai0a_{i}\neq 0, bi0b_{i}\neq 0 and ff is strictly positive on each interval

  • -

    a0<b0a1<b2<bm+-\infty\leq a_{0}<b_{0}\leq a_{1}<b_{2}\leq\ldots<b_{m^{*}}\leq+\infty, where mm^{*} is equal to mm or \infty according to \mathcal{I} cardinality

  • -

    if(ai)=t\forall i\in\mathcal{I}\;f(a_{i})=t if ai>a_{i}>-\infty by continuity of ff

  • -

    if(bi)=t\forall i\in\mathcal{I}\;f(b_{i})=t if bi<b_{i}<\infty by continuity of ff

  • -

    x]ai,bi[|x|>tf(0)L\forall x\in]a_{i},b_{i}[\;|x|>\frac{t-f(0)}{L} as ff is LL-Lipschitz and ]ai,bi[f1(]t,+[]a_{i},b_{i}[\subset f^{-1}(]t,+\infty[. In particular, noting t=min(t,tf(0))t^{*}=\min(t,t-f(0)), |x|>tL|x|>\frac{t^{*}}{L}

Also, we are only interested in t+t\to+\infty so we can suppose t2>|k1|L2t^{*^{2}}>|k-1|L^{2}. If =\mathcal{I}=\emptyset, the case is trivial: 𝔼[(f(Gk1)t)γ|f(Gk1)>t]\mathbb{E}\left[\left(\frac{f(G_{k}^{1})}{t}\right)^{\gamma}|f(G_{k}^{1})>t\right] is not defined, which is a contradiction.
Otherwise, the conditional expectation is well defined and finite and we have:

𝔼[(f(Gk1)t)γ|f(Gk1)>t]=iaibi(f(x)t)γ|x|nex22𝑑xiaibi|x|nex22𝑑x .\mathbb{E}\left[\left(\frac{f(G_{k}^{1})}{t}\right)^{\gamma}|f(G_{k}^{1})>t\right]=\\ \frac{\sum_{i\in\mathcal{I}}\int_{a_{i}}^{b_{i}}\left(\frac{f(x)}{t}\right)^{\gamma}|x|^{n}e^{-\frac{x^{2}}{2}}dx}{\sum_{i\in\mathcal{I}}\int_{a_{i}}^{b_{i}}|x|^{n}e^{-\frac{x^{2}}{2}}dx}\text{ .} (1)

Let ii\in\mathcal{I}. For the numerator, integrating by part:

aibi(f(x)t)γ|x|kex22𝑑x=[|x|kx(f(x)t)γex22]aibi+aibi(k1x2+γf(x)xf(x))M(f(x)t)γ|x|kex22𝑑x\int_{a_{i}}^{b_{i}}\left(\frac{f(x)}{t}\right)^{\gamma}|x|^{k}e^{-\frac{x^{2}}{2}}dx=\\ \left[-\frac{|x|^{k}}{x}\left(\frac{f(x)}{t}\right)^{\gamma}e^{-\frac{x^{2}}{2}}\right]_{a_{i}}^{b_{i}}+\\ \int_{a_{i}}^{b_{i}}\underbrace{\left(\frac{k-1}{x^{2}}+\gamma\frac{f^{\prime}(x)}{xf(x)}\right)}_{M}\left(\frac{f(x)}{t}\right)^{\gamma}|x|^{k}e^{-\frac{x^{2}}{2}}dx

The first integrated part is equal to [|x|kxex22]aibi\left[-\frac{|x|^{k}}{x}e^{-\frac{x^{2}}{2}}\right]_{a_{i}}^{b_{i}}, as we have seen on the interval bounds either ff is equal to tt or f=O±(x)f=O_{\pm\infty}(x). We can bound the MM term as |x|tL|x|\geq\frac{t^{*}}{L}, f(x)>t>0f(x)>t>0 and |f|<L|f^{\prime}|<L,

|M||k1|L2t2+γL2t2 .|M|\leq|k-1|\frac{L^{2}}{t^{*^{2}}}+\gamma\frac{L^{2}}{t^{*^{2}}}\text{ .}

We deduce the following inequalities:

aibi(f(x)t)γ|x|kex22𝑑x[|x|k1ex22]aibi+(|k1|+γ)L2t2aibi(f(x)t)γ|x|kex22𝑑x ,\int_{a_{i}}^{b_{i}}\left(\frac{f(x)}{t}\right)^{\gamma}|x|^{k}e^{-\frac{x^{2}}{2}}dx\leq\\ \left[-|x|^{k-1}e^{-\frac{x^{2}}{2}}\right]_{a_{i}}^{b_{i}}+\\ \frac{(|k-1|+\gamma)L^{2}}{t^{*^{2}}}\int_{a_{i}}^{b_{i}}\left(\frac{f(x)}{t}\right)^{\gamma}|x|^{k}e^{-\frac{x^{2}}{2}}dx\text{ ,} (2)
aibi(f(x)t)γ|x|kex22𝑑x[|x|k1ex22]aibi(|k1|+γ)L2t2aibi(f(x)t)γ|x|kex22𝑑x .\int_{a_{i}}^{b_{i}}\left(\frac{f(x)}{t}\right)^{\gamma}|x|^{k}e^{-\frac{x^{2}}{2}}dx\geq\\ \left[-|x|^{k-1}e^{-\frac{x^{2}}{2}}\right]_{a_{i}}^{b_{i}}-\\ \frac{(|k-1|+\gamma)L^{2}}{t^{*^{2}}}\int_{a_{i}}^{b_{i}}\left(\frac{f(x)}{t}\right)^{\gamma}|x|^{k}e^{-\frac{x^{2}}{2}}dx\text{ .} (3)

The denominator has still a dependance on ff through the domain of integration, so |x|>tL|x|>\frac{t^{*}}{L} is still valid and similarly:

aibi|x|kex22𝑑x[|x|k1ex22]aibi+|k1|L2t2aibi|x|kex22𝑑x ,\int_{a_{i}}^{b_{i}}|x|^{k}e^{-\frac{x^{2}}{2}}dx\leq\left[-|x|^{k-1}e^{-\frac{x^{2}}{2}}\right]_{a_{i}}^{b_{i}}+\\ \frac{|k-1|L^{2}}{t^{*^{2}}}\int_{a_{i}}^{b_{i}}|x|^{k}e^{-\frac{x^{2}}{2}}dx\text{ ,} (4)
aibi|x|kex22𝑑x[|x|k1ex22]aibi|k1|L2t2aibi|x|kex22𝑑x .\int_{a_{i}}^{b_{i}}|x|^{k}e^{-\frac{x^{2}}{2}}dx\geq\left[-|x|^{k-1}e^{-\frac{x^{2}}{2}}\right]_{a_{i}}^{b_{i}}-\\ \frac{|k-1|L^{2}}{t^{*^{2}}}\int_{a_{i}}^{b_{i}}|x|^{k}e^{-\frac{x^{2}}{2}}dx\text{ .} (5)

Combining equations (3) and (4) in (1), we have:

𝔼[(f(Gk1)t)γ|f(Gk1)>t]1|k1|L2t21+(|k1|+γ)L2t2 .\mathbb{E}\left[\left(\frac{f(G_{k}^{1})}{t}\right)^{\gamma}|f(G_{k}^{1})>t\right]\geq\frac{1-\frac{|k-1|L^{2}}{t^{*^{2}}}}{1+\frac{(|k-1|+\gamma)L^{2}}{t^{*^{2}}}}\text{ .}

And combining (2) and (5) in (1), as we chose t>|k1|+γL+|f(0)|t>\sqrt{|k-1|+\gamma}L+|f(0)|, we have:

𝔼[(f(Gk1)t)γ|f(Gk1)>t]1+|k1|L2t21(|k1|+γ)L2t2\mathbb{E}\left[\left(\frac{f(G_{k}^{1})}{t}\right)^{\gamma}|f(G_{k}^{1})>t\right]\leq\frac{1+\frac{|k-1|L^{2}}{t^{*^{2}}}}{1-\frac{(|k-1|+\gamma)L^{2}}{t^{*^{2}}}}

So cγ=limt𝔼[(f(Gk1)t)γ|f(Gk1)>t]=1c_{\gamma}=\lim_{t\rightarrow\infty}\mathbb{E}\left[\left(\frac{f(G_{k}^{1})}{t}\right)^{\gamma}|f(G_{k}^{1})>t\right]=1.
Assuming f(Gk1)𝒟(ξ),ξ>0f(G_{k}^{1})\in\mathcal{D}(\mathcal{H}_{\xi}),\xi>0, entails cγ=(1γξ)1c_{\gamma}=(1-\frac{\gamma}{\xi})^{-1} and γ=0\gamma=0. We conclude that ξ0\xi\leq 0.

Case dd\in\mathbb{N}^{*}. We prove that ξ0\xi\leq 0 by contradiction. If ξ>0\xi>0 using theorem 8.a [1], 0<γ<ξ\forall 0<\gamma<\xi, 𝔼[f(Gkd)γ]\mathbb{E}[f(G_{k}^{d})^{\gamma}] is finite and

cγ\displaystyle c_{\gamma} =limt𝔼[(f(Gkd)t)γ|f(Gkd)>t]\displaystyle=\lim_{t\rightarrow\infty}\mathbb{E}\left[\left(\frac{f(G_{k}^{d})}{t}\right)^{\gamma}|f(G_{k}^{d})>t\right]
=(1γξ)1 .\displaystyle=\left(1-\frac{\gamma}{\xi}\right)^{-1}\text{ .}

Let γ]0,ξ[\gamma\in]0,\xi[ and t+t\in\mathbb{R}_{+}^{*} such that t>f(0d)t>f(0_{d}) and t>L|k+d2|+γL+|f(0)|t>L\sqrt{|k+d-2|+\gamma}L+|f(0)|. Using the hyperspherical coordinates, we introduce the operator H:([0,π]d2×[0,2π])H:\mathcal{L}([0,\pi]^{d-2}\times[0,2\pi])\mapsto\mathbb{R}:

H(f)=α0π0π02πi=1d2sin(θi)di1f(θ)dθ,H(f)=\alpha\int_{0}^{\pi}...\int_{0}^{\pi}\int_{0}^{2\pi}\prod_{i=1}^{d-2}\sin(\theta_{i})^{d-i-1}f(\theta)d\theta\text{,}

with α\alpha the normalising term for the GkdG_{k}^{d} distribution and dθ=dθ1dθd1d\theta=d\theta_{1}...d\theta_{d-1}.
Then, we have:

𝔼[(f(Gkd)t)γ1f(Gkd)>t]=H(θ0+1fθ(r)>t(fθ(r)t)γrk+d1er22𝑑r𝔼[(fθ(Gk+d11)t)γ1fθ(Gk+d11)>t])\mathbb{E}\left[\left(\frac{f(G_{k}^{d})}{t}\right)^{\gamma}1_{f(G_{k}^{d})>t}\right]=\\ H\Bigl{(}\theta\mapsto\underbrace{\int_{0}^{+\infty}1_{f_{\theta}(r)>t}\left(\frac{f_{\theta}(r)}{t}\right)^{\gamma}r^{k+d-1}e^{-\frac{r^{2}}{2}}dr}_{\mathbb{E}\left[\left(\frac{f_{\theta}(G_{k+d-1}^{1})}{t}\right)^{\gamma}1_{f_{\theta}(G_{k+d-1}^{1})>t}\right]}\Bigr{)}

with x=r(x1,,xd)x=r(x_{1},...,x_{d}) and

x1\displaystyle x_{1} =sin(θ1)\displaystyle=\sin(\theta_{1})
x2\displaystyle x_{2} =sin(θ1)cos(θ2)\displaystyle=\sin(\theta_{1})\cos(\theta_{2})
\displaystyle\vdots
xd1\displaystyle x_{d-1} =sin(θ1)sin(θ2)cos(θd1)\displaystyle=\sin(\theta_{1})\sin(\theta_{2})\ldots\cos(\theta_{d-1})
xd\displaystyle x_{d} =sin(θ1)sin(θ2)sin(θd1) .\displaystyle=\sin(\theta_{1})\sin(\theta_{2})\ldots\sin(\theta_{d-1})\text{ .}

For θ=(θ1,θd1)\theta=(\theta_{1},...\theta_{d-1}), fθ:r+f(rx1,,rxd)f_{\theta}:r\in\mathbb{R}_{+}\mapsto f(rx_{1},...,rx_{d}) is LL-Lipschitz as (x1,,xd)(x_{1},\ldots,x_{d}) is on the unit sphere. Also, f(0)=fθ(0)f(0)=f_{\theta}(0). We can use the bounds from the 1-dimensional proof. We note:

M+=1+|k+d2|L2t21(|k+d2|+γ)L2t2M_{+}=\frac{1+\frac{|k+d-2|L^{2}}{t^{*^{2}}}}{1-\frac{(|k+d-2|+\gamma)L^{2}}{t^{*^{2}}}}
M=1|k+d2|L2t21+(|k+d2|+γ)L2t2 .M_{-}=\frac{1-\frac{|k+d-2|L^{2}}{t^{*^{2}}}}{1+\frac{(|k+d-2|+\gamma)L^{2}}{t^{*^{2}}}}\text{ .}

We obtain:

𝔼[(f(Gkd)t)γ1f(Gkd)>t]M+H(θ(1fθ(Gk+d11)>t)) .\mathbb{E}\left[\left(\frac{f(G_{k}^{d})}{t}\right)^{\gamma}1_{f(G_{k}^{d})>t}\right]\leq\\ M_{+}H\left(\theta\mapsto\mathbb{P}(1_{f_{\theta}(G_{k+d-1}^{1})>t})\right)\text{ .}

That is to say,

𝔼[(f(Gkd)t)γ1f(Gkd)>t]M+(f(Gkd)>t) .\mathbb{E}\left[\left(\frac{f(G_{k}^{d})}{t}\right)^{\gamma}1_{f(G_{k}^{d})>t}\right]\leq M_{+}\mathbb{P}(f(G_{k}^{d})>t)\text{ .}

Similarly, we have:

𝔼[(f(Gkd)t)γ1f(Gkd)>t]M(f(Gkd)>t) .\mathbb{E}\left[\left(\frac{f(G_{k}^{d})}{t}\right)^{\gamma}1_{f(G_{k}^{d})>t}\right]\geq M_{-}\mathbb{P}(f(G_{k}^{d})>t)\text{ .}

Thus, we conclude similarly that cγc_{\gamma} is well-defined, finite, and cγ=1c_{\gamma}=1 that is γ=0\gamma=0, which is absurd as γ>0\gamma>0. ∎

4 Conclusion and future work

Because of the intrinsic Lipschitz characteristics of Neural Networks, GANs expressivity is limited. In particular, a Gaussian prior cannot be used to simulate heavy tailed distributions. In the EVT framework, the question of the existence of a tail index for the generated distribution, or the conditions for its existence, remains. A theoretical partial answer is given in [9] for GANs with ReLU or Leaky-ReLU activation functions and a finite number of neurons. The general case is still an open question. Likewise, determining the thinnest tail prior being able to simulate samples exhibiting heavy tails is an important question needing further investigations.
Moreover, experimentally, the problem of training GANs with a heavy-tailed prior remains too. Indeed, with such priors GANs are hard to train and exhibit numerical instabilities.

References

  • [1] A. A. Balkema and Laurens de Haan. Residual life time at great age. The Annals of Probability, 1974.
  • [2] Sergei Bobkov. An isoperimetric inequality on the discrete cube, and an elementary proof of the isoperimetric inequality in gauss space. The Annals of Probability, 1997.
  • [3] Christer Borell. The brunn-minkowski inequality in gauss space. Inventiones mathematicae, 1975.
  • [4] George V. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 1989.
  • [5] Alhussein Fawzi, Hamza Fawzi, and Omar Fawzi. Adversarial vulnerability for any classifier. ArXiv, 2018.
  • [6] Richard Feder, Philippe Berger, and George Stein. Nonlinear 3d cosmic web simulation with heavy-tailed generative adversarial networks. Phys. Rev. D, 2020.
  • [7] Juha Heinonen. Lectures on lipschitz analysis. Rep. Dept. Math. Stat, 2005.
  • [8] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 1989.
  • [9] Todd Huster, Jeremy Cohen, Zinan Lin, Kevin Chan, Charles Kamhoua, Nandi O. Leslie, Cho-Yu Jason Chiang, and Vyas Sekar. Pareto gan: Extending the representational power of gans to heavy-tailed distributions. In Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research, 2021.
  • [10] Razvan Pascanu, Tomás Mikolov, and Yoshua Bengio. Understanding the exploding gradient problem. CoRR, 2012.
  • [11] Jack Prescott, Xiao Zhang, and David Evans. Improved estimation of concentration under LpL_{p}-norm distance metrics using half spaces. International Conference on Learning Representations, 2021.
  • [12] Mohamed El Amine Seddik, Cosme Louart, Mohamed Tamaazousti, and Romain Couillet. Random matrix theory proves that deep learning representations of GAN-data behave as Gaussian mixtures. In Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, 2020.
  • [13] Vladimir Sudakov and Boris Tsirelson. Extremal properties of half-spaces for spherically invariant measures. Journal of Soviet Mathematics, 1978.
  • [14] Boris Tsirelson, Ildar Ibragimov, and Vladimir Sudakov. Norms of gaussian sample functions. Proceedings of the Third Japan — USSR Symposium on Probability Theory, 1976.
  • [15] Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.
  • [16] Magnus Wiese, Robert Knobloch, and Ralf Korn. Copula & marginal flows: Disentangling the marginal from its joint. CoRR, 2019.