This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On Sparsity in Overparametrised Shallow ReLU Networks

Jaume de Dios UCLA, Los Angeles, California Joan Bruna This work is partially supported by the Alfred P. Sloan Foundation, NSF RI-1816753, NSF CAREER CIF 1845360, and the Institute for Advanced Study. Courant Institute of Mathematical Sciences, New York University, New York Center for Data Science, New York University Institute for Advanced Study, Princeton
Abstract

The analysis of neural network training beyond their linearization regime remains an outstanding open question, even in the simplest setup of a single hidden-layer. The limit of infinitely wide networks provides an appealing route forward through the mean-field perspective, but a key challenge is to bring learning guarantees back to the finite-neuron setting, where practical algorithms operate.

Towards closing this gap, and focusing on shallow neural networks, in this work we study the ability of different regularisation strategies to capture solutions requiring only a finite amount of neurons, even on the infinitely wide regime. Specifically, we consider (i) a form of implicit regularisation obtained by injecting noise into training targets [Blanc et al. 19], and (ii) the variation-norm regularisation [Bach 17], compatible with the mean-field scaling. Under mild assumptions on the activation function (satisfied for instance with ReLUs), we establish that both schemes are minimised by functions having only a finite number of neurons, irrespective of the amount of overparametrisation. We study the consequences of such property and describe the settings where one form of regularisation is favorable over the other.

1 Introduction

Supervised Learning in high-dimensions remains an active research area, in part due to a confluence of several challenges, including approximation, computational and statistical, where the curse of dimensionality needs to be simultaneously avoided. Linear learning based on RKHS theory suffers in the high-dimensional regime due its lack of approximation power, which motivates the study of non-linear learning methods. Amongst them, arguably the simplest is given by single hidden-layer neural networks, trained with some form of gradient descent scheme. Despite its simplicity relative to its deep counterparts, learning with shallow neural networks is still not fully understood from a theoretical standpoint, specifically with regards to giving positive results that explain their advantage over linear learning methods on realistic settings.

An important theme in recent analysis of neural networks is overparametrisation. By letting the number of neurons grow to infinity, one can obtain optimization, approximation and generalization guarantees beyond the linear learning regime [CB20, MMN18, RVE18], at the expense of computational efficiency. In this context, a key goal is to exhibit the right combination of neural architecture and learning algorithm that preserves the guarantees of infinitely-wide neural networks with polynomially-sized networks. We note that our study is only tangentially related to the so-called double-descent phenomena recently reported in those same models [BHMM19, BHX19, MM19], since we include regularisation from the onset. Regularisation can take two flavors: implicit (by injecting noise) or explicit (by adding a term in the empirical loss function). They are a key device to give statistical guarantees of generalisation, and in this work we study their potential computational benefit in the overparametrised regime.

The mean-field regime of overparametrised shallow networks defines solutions in terms of measures over parameters. A natural form of explicit regularisation is then given by the variation-norm [Bac17], which penalises an continous-equivalent of an L1L^{1} norm over weights, and is equivalent to the path-norm [NTS15] and the Barron norm [MWE19], as well as the popular ‘weight-decay’ for homogeneous activations such as the ones we consider here. Injecting noise into gradient descent dynamics is another classic form of implicit regularization, under appropriate ergodicity conditions that enable the forgetting of initial conditions. In this work, and inspired by [BGVV19], we focus on a specific form of noise, namely noise added into the training labels. The authors interpreted the noisy gradient dynamics as an approximate Orstein-Uhlenbeck process that ‘walks’ around the manifold of solutions with zero training error (which has positive dimension due to overparametrisation), and ‘locks’ in solutions that minimise a certain implicit regulariser, given by the average squared norm of the network gradients over the data. They leveraged such implicit regularisation to explain the ‘simple’ structure of solutions on specific cases, such as univariate inputs or a single datapoint.

We encapsulate these two regularisation instances into a general family of regularisers that, combined with mild assumptions on the activation function, produce optimisation problems over the space of probability measures that can only be minimised if the measure has sparse support (Theorem 4.2). This gives a novel perspective on the overparametrised regime under these forms of regularisation, that connects it to a finite-dimensional linear program, and whereby the infinitude of parameters required to give optimization guarantees might be relaxed to a finite (albeit possibly exponentially large in dimension) quantity.

Related Work:

Our work fits into a recent line of work that attempts to uncover the interplay between neural network overparametrisation, optimization and generalisation, dating back at least to [Bar97]. The implicit bias brought by different flavors of gradient descent in models involving neural networks has since been thoroughly explored [SHN+18, GWB+17, SESS19, NTS15, LMZ17, BGVV19, CB20, NMB+18, AS17, BO97]. Authors have also leveraged the special structure of ReLU activations [MBG18, WTP+19]. Closest to our setup are [BGVV19, EP20], who separately study the two forms of regularisation considered in this work. Our characterisation is consistent with theirs, but extends to a general high-dimensional setting and provides a unified picture. Finally, recently [BELM20] studied sparse solutions of the TV regularised problem, establishing that in generic data one can find solutions with nearly optimal cost and sparsity; we instead focus on the exact minimisers of the constrained and penalised objectives, but leave the connections for future work.

2 Preliminaries

2.1 Shallow Neural Networks and Variation-Norm Spaces

We consider a shallow neural network with hidden-layer-width mm as the following function:

f(𝒙;𝜽1𝜽m):=1mj=1mϕ(𝜽j,𝒙),f(\bm{x};\bm{\theta}_{1}\dots\bm{\theta}_{m}):=\frac{1}{m}\sum_{j=1}^{m}\phi(\bm{\theta}_{j},\bm{x}), (1)

where 𝒙Ωd\bm{x}\in\Omega\subseteq\mathbb{R}^{d} is the feature of the input data, ϕ(𝜽,𝒙)=cσ(a𝒙+b)\phi(\bm{\theta},\bm{x})=c\sigma(a^{\top}\bm{x}+b) is the neuron constructed from an activation function σ\sigma, and 𝜽j=(cj,θj)×d+1:=Θ\bm{\theta}_{j}=(c_{j},\theta_{j})\in\mathbb{R}\times\mathbb{R}^{d+1}:=\Theta with θ=(a,b)\theta=(a,b) are the parameters for the jj-th neuron. As observed by several authors [MMN18, CB18b, RVE18, SS18], such a neural network can be equivalently represented in terms of a probability measure over Θ\Theta as fμ(m)f_{\mu^{(m)}}, where

fμ(𝒙):=Θϕ(𝜽,𝒙)μ(d𝜽),f_{\mu}(\bm{x}):=\int_{\Theta}\phi(\bm{\theta},\bm{x})\mu(d\bm{\theta})~{}, (2)

and μ(m)\mu^{(m)} is the empirical measure of the parameters {𝜽j}j=1m\{\bm{\theta}_{j}\}_{j=1}^{m}: μ(m)=1mj=1mδ𝜽j.\mu^{(m)}=\frac{1}{m}\sum_{j=1}^{m}\delta_{\bm{\theta}_{j}}~{}.

Let us denote by 𝒫(Θ)\mathcal{P}(\Theta) the space of probability measures over Θ\Theta. In the following, we write χ,μ:=Θχ(𝜽)μ(d𝜽)\langle\chi,\mu\rangle:=\int_{\Theta}\chi(\bm{\theta})\mu(d\bm{\theta}) for a Borel-measurable test function χ\chi and ϕ𝒙:=ϕ(,𝒙)\phi_{\bm{x}}:=\phi(\cdot,\bm{x}). Let V:Θ+V:\Theta\to\mathbb{R}^{+} be a non-negative potential function satisfying infα,cV(αa,αb,c)>0\inf_{\alpha,c}V(\alpha a,\alpha b,c)>0. The set of functions representable as (1) defines a metric space V\mathcal{F}_{V} with norm

fV:=inf{ΘV(𝜽)μ(d𝜽);f(𝒙)=Θϕ(𝜽,𝒙)μ(d𝜽).}.\|f\|_{V}:=\inf\left\{\int_{\Theta}V(\bm{\theta})\mu(d\bm{\theta});~{}f(\bm{x})=\int_{\Theta}\phi(\bm{\theta},\bm{x})\mu(d\bm{\theta})\,.\right\}~{}. (3)

When V(𝜽)=|c|θV(\bm{\theta})=|c|\|\theta\|, the resulting space is the so-called variation-norm space 1\mathcal{F}_{1} [Bac17]. One can verify [MWE19] that V(𝜽)=|c|q(θ)qV(\bm{\theta})=|c|^{q}(\|\theta\|)^{q} gives the same functional characterisation for any q1q\geq 1. Moreover, if ϕ\phi is 2-homogeneous w.r.t. 𝜽\bm{\theta} (such as the ReLU σ(t)=max(0,t)\sigma(t)=\max(0,t)) 111ϕ(t𝜽,𝒙)=t2ϕ(𝜽,𝒙)\phi(t\bm{\theta},\bm{x})=t^{2}\phi(\bm{\theta},\bm{x}), then one can also verify that V(𝜽)=𝜽2V(\bm{\theta})=\|\bm{\theta}\|^{2} also defines the same space 222By first projecting the measure to the unit sphere θ=1\|\theta\|=1 and then lifting again as a point mass in the cc-direction, and using the fact that μ\mu has mass 11; see [MWE19], Theorem 1.. Such apparent flexibility is a consequence of the fact that μ\mu, the underlying object parametrising the integral representation (3), is in fact a lifted version of a more ‘fundamental’ object ν=P(μ)(𝕊d)\nu=\mathrm{P}(\mu)\in\mathcal{M}(\mathbb{S}^{d}), the space of signed Radon measures over the unit (d+1)(d+1)-dimensional sphere . The measures μ\mu and ν\nu are related via the projection

𝕊dχ(θ~)𝑑ν(θ~)=Θcθχ(θθ)𝑑μ(𝜽)\int_{\mathbb{S}^{d}}\chi(\tilde{\theta})d\nu(\tilde{\theta})=\int_{\Theta}c\|\theta\|\chi\left(\frac{\theta}{\|\theta\|}\right)d\mu(\bm{\theta}) (4)

for all continuous test functions χ:𝕊d\chi:\mathbb{S}^{d}\to\mathbb{R}. One can verify [Chi19] that for the previous choice of potential VV, one has fV=inf{νTV;f(𝒙)=𝕊dσ(θ~,𝒙~)𝑑ν(θ~)}\|f\|_{V}=\inf\{\|\nu\|_{\mathrm{TV}};\,f(\bm{x})=\int_{\mathbb{S}^{d}}\sigma(\langle\tilde{\theta},\tilde{\bm{x}}\rangle)d\nu(\tilde{\theta})\}, where 𝒙~=(𝒙,1)\tilde{\bm{x}}=(\bm{x},1) and νTV\|\nu\|_{\mathrm{TV}} is the Total-Variation norm of ν\nu [Bac17].

This variation-norm space contains any RKHS whose kernel is generated as an expectation over features k(𝒙,𝒙)=Θϕ(𝜽,𝒙)ϕ(𝜽,𝒙)μ0(d𝜽)k(\bm{x},\bm{x}^{\prime})=\int_{\Theta}\phi(\bm{\theta},\bm{x})\phi(\bm{\theta},\bm{x}^{\prime})\mu_{0}(d\bm{\theta}) with a base measure μ0\mu_{0}, but it provides crucial advantages over these RKHS at approximating certain non-smooth, high-dimensional functions having some ‘hidden’ low-dimensional structure [Bac17]. This motivates the study of overparametrised shallow networks with the scaling as in (1), as opposed to the NTK scaling [JGH18] in 1/m1/\sqrt{m}.

As we will see in Section 3, other V\mathcal{F}_{V} spaces arise naturally when training overparametrised neural networks with noisy dynamics, corresponding to different choice of potential function. In the following, we are going to focus on the ReLU case where σ(t)=(t)+=max(0,t)\sigma(t)=(t)_{+}=\max(0,t).

2.2 Empirical Risk Minimization and Representer Theorem

Given samples {(𝒙i,yi)Ω×}i=1n\{(\bm{x}_{i},y_{i})\in\Omega\times\mathbb{R}\}_{i=1\dots n} for a regression task, learning in 1\mathcal{F}_{1} in the interpolant regime (corresponding to overparametrised models) amounts to the following problem

minfVfVs.t.f(𝒙i)=yi for i=1n.\min_{f\in\mathcal{F}_{V}}\|f\|_{V}~{}~{}s.t.~{}~{}f(\bm{x}_{i})=y_{i}~{}\text{ for }i=1\dots n~{}. (5)

This problem can be equivalently expressed as an optimization problem in 𝒫(Θ)\mathcal{P}(\Theta):

minμ𝒫(Θ)V,μs.t.ϕ𝒙i,μ=yi,i=1n.\min_{\mu\in\mathcal{P}(\Theta)}\langle V,\mu\rangle~{}s.t.~{}\langle\phi_{\bm{x}_{i}},\mu\rangle=y_{i}\,,i=1\dots n~{}. (6)

Even though 1\mathcal{F}_{1} is not a RKHS, its convex structure also provides a Representer Theorem [Zuh48, FJ75, Bac17, BCC+19]: the extreme points of the solution set of (6) consist of point masses j=1rcjδ𝜽j\sum_{j=1}^{r}c_{j}\delta_{\bm{\theta}_{j}} of support rr at most nn. In other words, there exist minimisers of the variation norm that require at most nn neurons to interpolate the data. However, in contrast with the RKHS representer theorem, these neurons are not explicit in terms of the datapoints, and cannot be directly characterised from (6) since this problem generally lacks unicity.

In practice, (6) is typically solved in the penalised form, by introducing the explicit regularised empirical loss

λ(μ)=1ni=1n(ϕ𝒙i,μ,yi)+λV,μ,\mathcal{L}_{\lambda}(\mu)=\frac{1}{n}\sum_{i=1}^{n}\ell(\langle\phi_{\bm{x}_{i}},\mu\rangle,y_{i})+\lambda\langle V,\mu\rangle~{}, (7)

where \ell is convex w.r.t. its first argument, e.g. (x,y)=12|xy|2\ell(x,y)=\frac{1}{2}|x-y|^{2} and λ\lambda controls the regularisation strength, so (6) can be obtained as the limit λ0+\lambda\to 0^{+} in (7).

2.3 Training Overparametrised Neural Networks and Wasserstein Gradient Flows

Notice that for empirical measures μ(m)\mu^{(m)} corresponding to a mm-width shallow network, the loss (μ(m))\mathcal{L}(\mu^{(m)}) is precisely the loss

L(𝜽1,,𝜽m)=1ni(f(𝒙i;𝜽1,,𝜽m),yi)+λmj=1mV(𝜽j),{L}(\bm{\theta}_{1},\dots,\bm{\theta}_{m})=\frac{1}{n}\sum_{i}\ell(f(\bm{x}_{i};\bm{\theta}_{1},\dots,\bm{\theta}_{m}),y_{i})+\frac{\lambda}{m}\sum_{j=1}^{m}V(\bm{\theta}_{j})~{}, (8)

with respect to the parameters {𝜽j}jm\{\bm{\theta}_{j}\}_{j\leq m}. In that case, the regularisation term corresponds to weight decay for V(𝜽)=𝜽2V(\bm{\theta})=\|\bm{\theta}\|^{2}, and the so-called path norm [NTS15] for V(𝜽)=|c|θV(\bm{\theta})=|c|\|\theta\|. As also argued in [NTS14], these two regularisation terms capture the same implicit bias in the functional space.

Performing gradient descent on LL with respect to {𝜽j}j\{\bm{\theta}_{j}\}_{j} induces gradient dynamics on the functional \mathcal{L} over 𝒫(Θ)\mathcal{P}(\Theta), with respect to the Wasserstein metric [MMN18, CB18b, RVE18, SS18], thus mapping an Euclidean non-convex problem (8) to a non-Euclidean convex one (7). The gradient dynamics of overparametrised neural networks in the infinite-width limit can thus be analysed as the mean-field limit of such Wasserstein Gradient flows, leading to global convergence for problems such as (7) under appropriate homogeneity assumptions for ϕ\phi [CB18b, Chi19]. However, such convergence results are qualitative and only hold in the limit of mm\to\infty without extra structure in the minimisers of (μ)\mathcal{L}(\mu) such as being point masses – precisely the structure that our main results from Section 4 provide.

2.4 Implicit Regularization

A popular alternative to the explicit variation-norm regularisation is to rely on the particular choice of optimization scheme to provide an implicit regularisation effect. While gradient descent enjoys powerful implicit regularisation for logistic regression [SHN+18, CB20] by favoring max-margin solutions, and for linear least-squares regression by choosing solutions of minimal 2\ell_{2} norm, the situation for non-linear least-squares is less clear.

Indeed, in those cases, the dynamics do not typically forget the initial conditions, making them very sensitive to initialisation and producing substantial qualitative changes, as illustrated by the lazy dynamics [CB18a]. A powerful alternative is thus given by algorithms that inject noise into the dynamics, such as SGD or Langevin Dynamics. The latter was studied in the framework of overparametrised shallow networks in [MMN18]. This dynamics leads to an associated cost of the form:

minμ1ni=1n(ϕ𝒙i,μ,yi)λ(μ),\min_{\mu}\frac{1}{n}\sum_{i=1}^{n}\ell(\langle\phi_{\bm{x}_{i}},\mu\rangle,y_{i})-\lambda\mathcal{H}(\mu)~{}, (9)

where \mathcal{H} is the relative entropy of μ\mu. Minimizers to this functional have a smooth density that is solution to the associated elliptic Euler-Lagrange problem. They are therefore not sparse due to the entropic regularisation.

An alternative to Langevin Dynamics and SGD was recently studied in [BGVV19], consisting of adding noise into the targets yiy_{i} rather than the parameters. As already hinted in [BGVV19], the resulting dynamics capture an implicit bias corresponding to a VV-space for a data-dependent potential function VV.

3 Implicit Regularisation by Label Noise

The goal of this section is to establish a link between a type of noise-induced regularisation phenomena established in [BGVV19, Theorem 2] and the variation-norm spaces defined in section 3.1.

3.1 Results in Blanc. et. al.

Blanc et al. [BGVV19] propose Algorithm 1 for training a neural network on a data-set (𝒙i,yi)i=1n(\bm{x}_{i},y_{i})_{i=1}^{n}. The algorithm is an SGD-type algorithm with a small, independent perturbation on each label yiy_{i} sampled independently on each gradient step.

Denote by 𝒟=(𝒙1,𝒙n){\mathcal{D}}=(\bm{x}_{1},\dots\bm{x}_{n}) the input data and let 𝔼𝒟{\mathbb{E}}_{{\mathcal{D}}} denote the empirical average over the observed nn datapoints. Let 𝜽=(𝜽1,,𝜽m)\vec{\bm{\theta}}=(\bm{\theta}_{1},\dots,\bm{\theta}_{m}). The main result therein [BGVV19, Theorem 2] states that, for large times (depending, amongst others, on the noise size and the number of parameters), the solution concentrates at local minimizers (amongst the zero-error solutions) of the effective regularization loss:

Λ(𝜽,𝒟):=𝔼𝒟[j|𝜽jf(𝒙,𝜽)|2]1mj𝔼𝒟[|𝜽jϕ(𝜽j,𝒙)|2].\Lambda(\vec{\bm{\theta}},\mathcal{D}):={\mathbb{E}}_{{\mathcal{D}}}\left[\sum_{j}|\nabla_{\bm{\theta}_{j}}f(\bm{x},\vec{\bm{\theta}})|^{2}\right]\propto\frac{1}{m}\sum_{j}{\mathbb{E}}_{{\mathcal{D}}}\left[|\nabla_{\bm{\theta}_{j}}\phi(\bm{\theta}_{j},\bm{x})|^{2}\right]~{}. (10)

The RHS of (10) is now an integral over the counting measure μ=1mj=1mδ𝜽j\mu=\frac{1}{m}\sum_{j=1}^{m}\delta_{\bm{\theta}_{j}}, and therefore,

Λ(𝜽,𝒟)𝔼𝒟[|𝜽iϕ(𝜽i,𝒙)|2]𝑑μ(𝜽):=V~b(𝜽)𝑑μ(𝜽).\Lambda(\vec{\bm{\theta}},\mathcal{D})\propto\int{\mathbb{E}}_{{\mathcal{D}}}\left[|\nabla_{\bm{\theta}_{i}}\phi(\bm{\theta}_{i},\bm{x})|^{2}\right]d\mu(\bm{\theta}):=\int\widetilde{V}_{b}(\bm{\theta})d\mu(\bm{\theta})~{}. (11)
Algorithm 1 SGD with noise on the labels
Parameters: Noise size η\eta, step size ϵ\epsilon, noise distribution ρ\rho
Input: Data (𝒙i,yi)i=1n(\bm{x}_{i},y_{i})_{i=1}^{n}, Initial parameters θ0\theta_{0}
repeat
     Choose a random data point (𝒙it,yit)(\bm{x}_{i_{t}},y_{i_{t}})
     Generate noise rρr\sim\rho
     y~it:=yit+ηr\tilde{y}_{i_{t}}:=y_{i_{t}}+\eta r
     𝜽t+1=𝜽tϵ𝜽[(f(𝒙it;𝜽)y~it)2]\bm{\theta}_{t+1}=\bm{\theta}_{t}-\epsilon\nabla_{\bm{\theta}}[(f(\bm{x}_{i_{t}};\bm{\theta})-\tilde{y}_{i_{t}})^{2}]
until Convergence

3.2 An equivalent formulation in the ReLU case

As we have seen in the previous section, the training algorithm with label noise minimizes V~b,μ\langle\tilde{V}_{b},\mu\rangle. In the ReLU case we can explicitly write:

V~b(a,b,c)=𝔼𝒟[|c2|(a𝒙+b)+2c2+(1+|𝒙|2)c2χ(a𝒙+b)0].\widetilde{V}_{b}(a,b,c)={\mathbb{E}}_{{\mathcal{D}}}\left[\frac{|c^{2}|(a^{\top}\bm{x}+b)^{2}_{+}}{c^{2}}+(1+|\bm{x}|^{2})c^{2}\chi_{(a^{\top}\bm{x}+b)\geq 0}\right]~{}. (12)

The key realization is that in order to minimize the value of 11, the algorithm will exploit the degeneracy (a,b,c)(λa,λb,λ1c)(a,b,c)\sim(\lambda a,\lambda b,\lambda^{-1}c) for λ>0\lambda>0. In other words, we may substitute V~b\widetilde{V}_{b} for Vb(a,b,c)=minλ>0V~b(λa,λb,λ1c)V_{b}(a,b,c)=\min_{\lambda>0}\widetilde{V}_{b}(\lambda a,\lambda b,\lambda^{-1}c) without effectively changing the problem. A straightforward computation shows that Vb(a,b,c)V_{b}(a,b,c) is equal to:

Vb(a,b,c)=|c|𝔼𝒟[(a𝒙+b)+2]:=|c|wb(a,b).V_{b}(a,b,c)=|c|\sqrt{{\mathbb{E}}_{\mathcal{D}}[(a^{\top}\bm{x}+b)_{+}^{2}]}:=|c|w_{b}(a,b)~{}. (13)

There are two crucial properties of the potential function VbV_{b}. First, that it is invariant under the λ\lambda degeneracy described above: If we have a measure μ\mu and modify it with the method above to obtain μ~\tilde{\mu}, both of them will have the same loss. The second is that the function ww is convex, and, as long as the neurons can ‘see’ enough of the dataset, essentially convex. To make this precise, we start with two definitions:

Definition 3.1.

A neuron with parameters (a,b)(a,b) is at the bulk of the dataset 𝒟=(𝐱1,,𝐱n)\mathcal{D}=(\bm{x}_{1},\dots,\bm{x}_{n}) whenever the map (a,b)((a𝐱1+b)+,(a𝐱2+b)+,(a𝐱n+b)+)(a,b)\mapsto((a\cdot\bm{x}_{1}+b)_{+},(a\cdot\bm{x}_{2}+b)_{+},\dots(a\cdot\bm{x}_{n}+b)_{+}) is locally injective.

Definition 3.2.

A weight w(a,b)w(a,b) that is 1-homogeneous in (a,b)(a,b) is effectively strictly convex if whenever

w(λa+(1λ)a,λb+(1λ)b)=λw(a,b)+(1λ)w(a,b)w(\lambda a+(1-\lambda)a^{\prime},\lambda b+(1-\lambda)b^{\prime})=\lambda w(a,b)+(1-\lambda)w(a^{\prime},b^{\prime}) (14)

holds for some parameters a,a,b,ba,a^{\prime},b,b^{\prime} and λ(0,1)\lambda\in(0,1), the parameters (a,b)(a,b) and (a,b)(a^{\prime},b^{\prime}) are proportional to each other.

With these definitions in hand, we can state the main result of this section, a convexity result for the regularization loss (see the Appendix for a proof):

Proposition 3.3.

For any set of variables {𝐱i}i=1n\{\bm{x}_{i}\}_{i=1}^{n} the associated weight wb(a,b):=𝔼𝒟[(a𝐱+b)+2]w_{b}(a,b):=\sqrt{\mathbb{E}_{\mathcal{D}}[(a^{\top}\bm{x}+b)_{+}^{2}]} is convex in (a,b)(a,b). Moreover, if the neuron with parameters (a,b)(a,b) is at the bulk of the dataset, the weight w(a,b)w(a,b) is effectively strictly convex in a neighbourhood of (a,b)(a,b).

We will show next how such convexity structure in the potential function can be leveraged to characterise minimisers of (6) and (7).

4 Minimizers to the convex-type TV have sparse support

In this section we will first focus on understanding the solutions of the regularised interpolation problem (6), for potential functions VV of the form V(𝜽)=|c|w(a,b)V(\bm{\theta})=|c|w(a,b) for a given regularization weight ww. As discussed so far, this problem is motivated by the case w=a2+b2w=\sqrt{\|a\|^{2}+b^{2}}, corresponding to variation-norm learning [Bac17] (and which, as seen in [OWSS19] is also related to a Tikhonov-type regularization). Another example is the implicit regularization scheme discussed in section 3.

By the rescaling trick (see the discussion in section 3.2) it suffices to consider the case where w(a,b)w(a,b) is homogeneous of degree 11 in the sense that for λ>0\lambda>0 we have w(λa,λb)=λw(a,b)w(\lambda a,\lambda b)=\lambda w(a,b). These weights can never be strictly convex (a desirable quantity when looking for uniqueness or sparsity of limits) because homogeneous functions are never strictly convex: w(2a,2b)=12(w(a,b)+w(3a,3b))w(2a,2b)=\frac{1}{2}(w(a,b)+w(3a,3b)). This obstruction, however, appears only from the parameter perspective but not from the neuron perspective, because proportional parameters also represent proportional neurons. This is a motivation for the effective convexity definition in section 3.2.

Lemma 3.3 in section 3.2 shows that the implicit regularization term that appeared in [BGVV19] is also effectively strictly convex. The fact that the TV norm is effectively strictly convex is a direct consequence of Minkowski’s inequality. The definition of effective strict convexity motivates an analog definition of sparsity on lifted measures:

Definition 4.1.

A lifted measure μ(a,b,c)\mu(a,b,c) is effectively supported on nn points if the support of μ\mu is contained on the union of at most nn half-planes of the form:

Ha,b={(βa,βb,c),c,β0}H_{a,b}=\{(\beta a,\beta b,c),c\in\mathbb{R},\beta\in\mathbb{R}_{\geq 0}\} (15)

If a lifted measure is effectively supported on nn points, then (by construction) there is a lifted measure supported on exactly nn points that represents the same function. With this definition in hand we are ready to state the main result of this section (see Figure 1 for an illustration motivating the theorem):

Theorem 4.2.

Let w(a,b)w(a,b) be a 1-homogeneous effectively strictly convex regularization weight and V(𝛉)=|c|w(a,b)V(\bm{\theta})=|c|w(a,b) the associated potential. Then the solutions of the problem (6) are all effectively supported on a finite number of points. The number of points in the solution is bounded by O(n)O(d)O(n)^{O(d)}.

The proof of the theorem has two steps. In the first step we will split the parameter space (a,b)(a,b) into cells where the parameters interact in a particularly nice way. This cell decomposition is reminiscent of the “open sectors" described in [MBG18]333The cells defined in this work are the closure of the open sectors defined by Maennel et. al.. The second step is to use the effective strict convexity result to show that, on each of those cells, optimizing measures are supported on at most one point.

Cell decomposition of the parameter space

Fix a dataset 𝒟={𝒙i}i=1n\mathcal{D}=\{\bm{x}_{i}\}_{i=1}^{n}, with 𝒙id\bm{x}_{i}\in\mathbb{R}^{d}. This will split the parameter space of θ\theta into a finite amount of convex cones C1,CSC_{1},\dots C_{S} that we will call cells, such that two neurons belong to the same cell iff they are active on the same datapoints. We will consider closed cells, and therefore there will be overlap at the boundary of the cells. By interpreting each datapoint with is dual hyperplane, there is a one-to-one correspondence between these cells and the resulting hyperplane arrangement, thus SndS\simeq n^{d} in generic datasets.

A key property of the cells is that the map θ=(a,b)σ(θ,𝒙)=(a𝒙+b)+\theta=(a,b)\mapsto\sigma(\theta,\bm{x})=(a^{\top}\bm{x}+b)_{+} is linear when restricted to a single cell. From the convexity, the cone property, and the restricted linearity property, one can extract the following result:

Lemma 4.3 (Discrete version).

If two parameters (a,b),(a,b)(a,b),(a^{\prime},b^{\prime}) belong to the same cell, then for all the datapoints 𝐱i\bm{x}_{i} it holds that

(a𝒙i+b)++((a)𝒙i+b)+=((a+a)𝒙i+(b+b))+.(a^{\top}\bm{x}_{i}+b)_{+}+((a^{\prime})^{\top}\bm{x}_{i}+b^{\prime})_{+}=((a+a^{\prime})^{\top}\bm{x}_{i}+(b+b^{\prime}))_{+}~{}. (16)

Using induction, plus standard limiting arguments this lemma can be upgraded to the measure-valued case:

Lemma 4.3 (Continuous version).

If two measures μ(a,b,c),μ(a,b,c)\mu(a,b,c),\mu^{\prime}(a,b,c) are supported on the same cell CC, then

ϕ𝒙i,μ=ϕ𝒙i,μ\langle\phi_{\bm{x}_{i}},\mu\rangle=\langle\phi_{\bm{x}_{i}},\mu^{\prime}\rangle (17)

for all data points 𝐱i\bm{x}_{i} whenever the moment estimates hold:

{Cca𝑑μ(a,b,c)=Cca𝑑μ(a,b,c)(moment estimate for a)Ccb𝑑μ(a,b,c)=Ccb𝑑μ(a,b,c)(moment estimate for b)\begin{cases}\int_{C}ca\,d\mu(a,b,c)=\int_{C}ca\,d\mu^{\prime}(a,b,c)&(\text{moment estimate for }a)\\ \int_{C}cb\,d\mu(a,b,c)=\int_{C}cb\,d\mu^{\prime}(a,b,c)&(\text{moment estimate for }b)\\ \end{cases} (18)

Effective convexity - measure case.

Now we want to show that the effective convexity concentrates the mass on each cell. We will first study measures that are already concentrated on a single cell, and then patch the result. A first step is that effective convexity for w(a,b)w(a,b) induces a similar notion effective convexity on |c|w(a,b)|c|w(a,b):

Lemma 4.4 (Lifting of the effective convexity).

Let w(a,b)w(a,b) be a 1-homogeneous effectively strictly convex regularization weight. Let (a,b,c),(a,b,c)(a,b,c),(a^{\prime},b^{\prime},c^{\prime}), and λ(0,1)\lambda\in(0,1). Then

|λc+(1λ)c|w(λa+(1λ)a,λb+(1λ)b)=λ|c|w(a,b)+(1λ)|c|w(a,b)|\lambda c+(1-\lambda)c^{\prime}|w(\lambda a+(1-\lambda)a^{\prime},\lambda b+(1-\lambda)b^{\prime})=\lambda|c|w(a,b)+(1-\lambda)|c^{\prime}|w(a^{\prime},b^{\prime}) (19)

if and only if there exist constants α,β+\alpha,\beta\in\mathbb{R}^{+} such that (a,b,c)=(αa,αb,βc)(a,b,c)=(\alpha a,\alpha b,\beta c).

In other words, the only scenario where there is no strict convexity is when testing on parameters that represent neurons proportional to each other, since the condition (a,b,c)=(αa,αb,βc)(a,b,c)=(\alpha a,\alpha b,\beta c) is equivalent to saying that the two ReLU neurons share the same hyperplane.

Convex functions have a unique minimizer, and, in the same way we bootsrapped the discrete, single-variable information into its continuous counterpart in Lemma 4.3, we can see that:

Lemma 4.5.

Let w(a,b)w(a,b) be a 1-homogeneous effectively strictly convex regularization weight, and fix a data cell. Amongst all the measures ν(a,b)\nu(a,b) which are supported on the fixed cell and have the same moment estimates (eq (18)), the minimizers of Lw(ν)L_{w}(\nu) are effectively supported on one point.

Lemma 4.5 is the key ingredient for the proof of Theorem 4.2:

Proof of Theorem 4.2.

We can write our measure μ=iCellsμi\mu=\sum_{i\in\text{Cells}}\mu_{i} where each μi\mu_{i} is supported on the closure of a cell. By the previous results, on each cell the associated measure μi\mu_{i} has to be supported at at most one point. ∎

[Uncaptioned image]

Figure 1.a: Data-space perspective: Each ReLu separates the data space into two half-spaces, the active half-space and the non-active. Given a subset 𝒟𝒟\mathcal{D^{\prime}}\subseteq\mathcal{D} of datapoints (in orange) there is at most one ReLu active at all points of 𝒟\mathcal{D}^{\prime} but non-active in all points in 𝒟𝒟\mathcal{D}\setminus\mathcal{D}^{\prime}.

[Uncaptioned image]

Figure 1.b: Parameter-space perspective: Each data-point induces a hyperplane on the parameter space: the set of ReLu parameters that have the datapoint at the hinge. There is at most one neuron (in red) on each cell generated by these hyperplanes (See 1.c).

[Uncaptioned image]

Figure 1.c: Single-cell optimizers: If a cell has more than one neuron on a cell, substituting all the neurons on the cell by a neuron in the center of mass will not affect the predictions on the data, but will give a gain on the regularizer.

Recovering Discrete Minimisers by Projecting into the Sphere:

Since the constrained optimization problem (6) is defined over probability measures in the lifted space Θ\Theta, there is a price to pay to recover discrete minimizers, what we called ‘effective’ support in definition 4.1. In the TV case, this aspect disappears as one looks at the equivalent problem in the space (𝕊d)\mathcal{M}(\mathbb{S}^{d}), as shown in the following corollary:

Corollary 4.6.

Let

minν(𝕊d)νTVs.t.𝕊dσ(𝒙~i,θ)𝑑ν(θ)=yi for i=1n.\min_{\nu\in\mathcal{M}(\mathbb{S}^{d})}\|\nu\|_{\mathrm{TV}}~{}s.t.~{}\int_{\mathbb{S}^{d}}\sigma(\langle\tilde{\bm{x}}_{i},{\theta}\rangle)d\nu(\theta)=y_{i}~{}\text{ for }i=1\dots n~{}. (20)

Then the solutions of (20) are all supported on a finite number of points, of the same cardinality as in Theorem 4.7.

Unicity of Supports Across all minimisers:

We have so far proven that solutions to the constrained minimization have a single mass point at each cell. However, a strictly stronger result holds:

Theorem 4.7.

Let CC be a cell. Let μ,μ\mu,\mu^{\prime} be zero-errror minimizers of LregL_{\text{reg}} with non-zero support on CC. Then both μ\mu and μ\mu^{\prime} are effectively supported in the same single point in CC.

Proof sketch.

By contradiction, let μ,μ\mu,\mu^{\prime} have different (effective) support on the same cell. By strict convexity the average of μ\mu and μ\mu^{\prime} would also be a minimizer supported in two points in CC, but that would contradict Lemma 4.5. ∎

Penalized minimization problem:

In a parallel fashion to the problem presented in this section, one can study the penalized version (7). All the sparsity results in this section (Theorems 4.2, 4.7 and Corollary 4.6 in particular, but also suitable variations of the intermediate lemmas) transfer to the convex case for arbitrary λ>0\lambda>0, as explained in the appendix:

Corollary 4.8.

Under the same assumptions of Theorem 4.7 and for any λ>0\lambda>0, all minimisers of λ(μ)\mathcal{L}_{\lambda}(\mu) are effectively supported on a finite number of points. For the corresponding problem expressed in the projected P(μ)(𝕊d)\mathrm{P}(\mu)\in\mathcal{M}(\mathbb{S}^{d}), all minimisers are supported on a finite number of points.

Implicit versus TV regularisation:

We can make the following observations on the relationship between the two forms of regularisation:

  • The implicit regularization is translation invariant while TV regularization is not. One could remove the bias (bb) dependence on TV (and consider V(a,b,c)=|c|aV(a,b,c)=|c|\|a\|, but in this case theorems 4.2, 4.7 become false). This is not a draw-back for all applications (for example in image processing, since translation corresponds to a color change).

  • Implicit regularization is (in principle) more costly to compute, especially whenever the number datapoints is large.

From discrete solutions to sparse solutions:

Corollary 4.6 and Theorem 4.7 allow us to turn the original infinite-dimensional problem into a SS-dimensional linear problem with linear constraints, where the only free parameters are the the amount of (signed) mass that is given to each cell. Recall that SS corresponds to the number of cells of the hyperplane arrangement determined by the datapoints in the projective space.

Focusing on the TV case, let ν=s=1Szsδθs(𝕊d)\nu=\sum_{s=1}^{S}z_{s}\delta_{\theta_{s}}\in\mathcal{M}(\mathbb{S}^{d}) be a minimiser of (20). Observe that as a consequence of Theorem 4.7, the locations θ1θS\theta_{1}\dots\theta_{S} are shared across every minimiser (but unknown a priori). Its coefficients z=(z1zS)z=(z_{1}\dots z_{S}) are thus solutions of the linear program

minz1s.t.𝒜z=y,\min\|z\|_{1}~{}s.t.~{}\mathcal{A}z=y~{}, (21)

where 𝒜n×S\mathcal{A}\in\mathbb{R}^{n\times S} is the matrix 𝒜i,s=(𝒙~i,θs)+\mathcal{A}_{i,s}=(\langle\tilde{\bm{x}}_{i},\theta_{s}\rangle)_{+}. From the Representer theorem, we know that there exists a solution zz^{*} of (21) of support z0nS\|z^{*}\|_{0}\leq n\ll S, and in the generic case this solution is the sparsest one, ie zz^{*} solves the 0\ell_{0} minimisation counterpart of (21).

A natural question is thus to understand under what conditions on the data the 1\ell_{1} and 0\ell_{0} problems share the same minimisers. This is an instance of Compressed Sensing, which could be analysed by studying properties of the sensing matrix 𝒜\mathcal{A} such as the RIP (leveraging random datasets) or the Nullspace property. This question has been also recently studied in [EP20]. This is out of the scope of the present work and is left for future research.

Consequences for Gradient-Descent Optimization:

With abuse of notation, let

(ν)=1ni=1n|𝕊dσ(𝒙~i,θ)𝑑ν(θ)yi|2+λνTV\mathcal{L}(\nu)=\frac{1}{n}\sum_{i=1}^{n}\left|\int_{\mathbb{S}^{d}}\sigma(\langle\tilde{\bm{x}}_{i},\theta\rangle)d\nu(\theta)-y_{i}\right|^{2}+\lambda\|\nu\|_{\mathrm{TV}} (22)

be the penalized form of the regularised problem, expressed in terms of the Radon measure in 𝕊d\mathbb{S}^{d}. The ability of gradient-descent to minimise such problems was recently studied in [Chi19], leading to a positive result under cetain assumptions, notably that minimisers are discrete and ‘feel’ a positive curvature around their support. The local convexity analysis that resulted in the discrete characterisation of solutions for this penalised problem (Lemma 4.5) brings precisely this geometric property around minimisers, suggesting a favorable case for gradient-descent on sufficiently overparametrised (but finite) shallow ReLU networks. However, we note that (i) our ReLU setup does not have appropriate smoothness to directly apply the results from [Chi19], and (ii) the finite-particle guarantees would still be at best exponential in the dimension. We leave these considerations for future work.

5 Conclusions

In this work, we have studied the structure of the minimisers in overparametrised shallow RelU networks under a broad umbrella of implicit and explicit regularisation strategies, including the popular weight decay and noisy gradient dynamics through label noise. Our main result leverages the underlying convexity generated by the regularisation to show that the resulting learning objectives only admit sparse minimisers, irrespective of the amount of overparametrisation.

This fact effectively maps the task of learning in variation-norm ReLU spaces (which are infinite-dimensional) into a finite-dimensional linear program. However, such linear program is determined by the hyperplane arrangement generated by the finite dataset after dualization – and thus of size exponential in dimension. As such, it does not directly solve the computational barrier present in the mean-field formulations of sparse measure optimization [Chi19], which is unavoidable in certain data regimes under the general SQ-model [GGJ+20].

As future work, it would be interesting to further explore the connections with the recent ‘memorization’ constructions of [BELM20], and to study favorable conditions on the datapoints that could break the exponential size of the linear program.

References

  • [AS17] Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667, 2017.
  • [Bac17] Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629–681, 2017.
  • [Bar97] Peter L Bartlett. For valid generalization the size of the weights is more important than the size of the network. In Advances in neural information processing systems, pages 134–140, 1997.
  • [BCC+19] Claire Boyer, Antonin Chambolle, Yohann De Castro, Vincent Duval, Frédéric De Gournay, and Pierre Weiss. On representer theorems and convex regularization. SIAM Journal on Optimization, 29(2):1260–1281, 2019.
  • [BELM20] Sébastien Bubeck, Ronen Eldan, Yin Tat Lee, and Dan Mikulincer. Network size and weights size for memorization with two-layers neural networks. arXiv preprint arXiv:2006.02855, 2020.
  • [BGVV19] Guy Blanc, Neha Gupta, Gregory Valiant, and Paul Valiant. Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process. arXiv:1904.09080 [cs, stat], April 2019. arXiv: 1904.09080.
  • [BHMM19] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning practice and the bias-variance trade-off. arXiv:1812.11118 [cs, stat], September 2019. arXiv: 1812.11118.
  • [BHX19] Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features. arXiv:1903.07571 [cs, stat], March 2019. arXiv: 1903.07571.
  • [BO97] Siegfried Bös and Manfred Opper. Dynamics of training. In Advances in Neural Information Processing Systems, pages 141–147, 1997.
  • [CB18a] Lenaic Chizat and Francis Bach. A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956, 2018.
  • [CB18b] Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in Neural Information Processing Systems, pages 3036–3046, 2018.
  • [CB20] Lénaïc Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. arXiv preprint arXiv:2002.04486, 2020.
  • [Chi19] Lenaic Chizat. Sparse optimization on measures with over-parameterized gradient descent. arXiv preprint arXiv:1907.10300, 2019.
  • [EP20] Tolga Ergen and Mert Pilanci. Convex geometry and duality of over-parameterized neural networks. arXiv preprint arXiv:2002.11219, 2020.
  • [FJ75] SD Fisher and Joseph W Jerome. Spline solutions to l1 extremal problems in one and several variables. Journal of Approximation Theory, 13(1):73–83, 1975.
  • [GGJ+20] S. Goel, A. Gollakota, Z. Jin, S. Karmalkar, and A. Klivans. Superpolynomial lower bounds for learning one-layer neural networks using gradient descent. ICML, 2020.
  • [GWB+17] Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pages 6151–6159, 2017.
  • [JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
  • [LMZ17] Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. arXiv preprint arXiv:1712.09203, 2017.
  • [MBG18] Hartmut Maennel, Olivier Bousquet, and Sylvain Gelly. Gradient Descent Quantizes ReLU Network Features. arXiv:1803.08367 [cs, stat], March 2018. arXiv: 1803.08367.
  • [MM19] Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355, 2019.
  • [MMN18] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
  • [MWE19] Chao Ma, Lei Wu, and Weinan E. Barron spaces and the compositional function spaces for neural network models. arXiv preprint arXiv:1906.08039, 2019.
  • [NMB+18] Brady Neal, Sarthak Mittal, Aristide Baratin, Vinayak Tantia, Matthew Scicluna, Simon Lacoste-Julien, and Ioannis Mitliagkas. A modern take on the bias-variance tradeoff in neural networks. arXiv preprint arXiv:1810.08591, 2018.
  • [NTS14] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
  • [NTS15] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-Based Capacity Control in Neural Networks. arXiv:1503.00036 [cs, stat], April 2015. arXiv: 1503.00036.
  • [OWSS19] Greg Ongie, Rebecca Willett, Daniel Soudry, and Nathan Srebro. A Function Space View of Bounded Norm Infinite Width ReLU Nets: The Multivariate Case. arXiv:1910.01635 [cs, stat], October 2019. arXiv: 1910.01635.
  • [RVE18] Grant Rotskoff and Eric Vanden-Eijnden. Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks. In Advances in Neural Information Processing Systems, pages 7146–7155, 2018.
  • [SESS19] Pedro Savarese, Itay Evron, Daniel Soudry, and Nathan Srebro. How do infinite width bounded norm networks look in function space? In Conference on Learning Theory, pages 2667–2690, 2019.
  • [SHN+18] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
  • [SS18] Justin Sirignano and Konstantinos Spiliopoulos. Dgm: A deep learning algorithm for solving partial differential equations. Journal of Computational Physics, 375:1339–1364, 2018.
  • [WTP+19] Francis Williams, Matthew Trager, Daniele Panozzo, Claudio Silva, Denis Zorin, and Joan Bruna. Gradient dynamics of shallow univariate relu networks. In Advances in Neural Information Processing Systems, pages 8376–8385, 2019.
  • [Zuh48] S. Zuhovickii. Remarks on problems in approximation theory. Mat. Zbirnik KDU, 1948.

Appendix A Additional Proofs

Proof of Proposition 3.3.

Proof.

Convexity follows from the fact that l2l^{2} average of convex functions is convex:

((a+a2)𝒙j+b+b2)+l2(j)\displaystyle\left\|\left(\left(\frac{a+a^{\prime}}{2}\right)^{\top}\bm{x}_{j}+\frac{b+b^{\prime}}{2}\right)_{+}\right\|_{l^{2}(j)}\leq 12(a𝒙j+b)++12((a)𝒙j+b)+l2(j)\displaystyle\left\|\frac{1}{2}(a^{\top}\bm{x}_{j}+b)_{+}+\frac{1}{2}((a^{\prime})^{\top}\bm{x}_{j}+b^{\prime})_{+}\right\|_{l^{2}(j)} (23)
\displaystyle\leq 12(a𝒙j+b)+l2(j)+12((a)𝒙j+b)+l2(j)\displaystyle\frac{1}{2}\|(a^{\top}\bm{x}_{j}+b)_{+}\|_{l^{2}(j)}+\frac{1}{2}\|((a^{\prime})^{\top}\bm{x}_{j}+b^{\prime})_{+}\|_{l^{2}(j)}

Where the first line folows from the convexity of the ReLu. Strict convexity follows similarly: Local injectivity implies that, unless (a,b)(a,b) is proportional to (a,b)(a^{\prime},b^{\prime}) the vector [(a𝒙j+b)+]j=1n[(a^{\top}\bm{x}_{j}+b)_{+}]_{j=1}^{n} is not proportional to the vector [((a)𝒙j+b)+]j=1n[((a^{\prime})^{\top}\bm{x}_{j}+b^{\prime})_{+}]_{j=1}^{n}. In that case by (sharp) Minkowski inequality, the inequality has to be strict. ∎

Proof of Lemma 4.3 (Discrete version)

Proof.

Let xix_{i} be a data-point. If (a,b)(a,b) and (a,b)(a^{\prime},b^{\prime}) belong to the same cell, there are two options:

Both (axi+b)0(ax_{i}+b)\leq 0 and (axi+b)0(a^{\prime}x_{i}+b)\leq 0. In this case, 0=(axi+b)++(axi+b)+=((a+a)xi+b+b)0=(ax_{i}+b)_{+}+(a^{\prime}x_{i}+b^{\prime})_{+}=((a+a^{\prime})x_{i}+b+b^{\prime}).
Both (axi+b)0(ax_{i}+b)\geq 0 and (axi+b)0(a^{\prime}x_{i}+b)\geq 0. In this case, we have (axi+b)++(axi+b)+=(axi+b)+(axi+b)=((a+a)xi+(b+b))=((a+a)xi+(b+b))(ax_{i}+b)_{+}+(a^{\prime}x_{i}+b^{\prime})_{+}=(ax_{i}+b)+(ax_{i}+b)=((a+a^{\prime})x_{i}+(b+b^{\prime}))=((a+a^{\prime})x_{i}+(b+b^{\prime}))

Proof of Lemma 4.3 (Continuous version)

Proof.

Let xix_{i} be a data-point. As in the discrete case, we have two options. In the trivial case, the datapoint is inactive in the cell, in the sense that for all (a,b,c)(a,b,c) in the cell, c(axi+b)+=0c(ax_{i}+b)_{+}=0. In this case the equality is always true. In the non-zero (linear) case, all neurons in the cell are active. In this case we may omit the ReLu itself, and the equality we need to show is:

c(axi+b)𝑑μ=c(axi+b)𝑑μ\int c(ax_{i}+b)d\mu=\int c(ax_{i}+b)d\mu^{\prime} (24)

by linearity of the integral this is equivalent to

xica𝑑μ+cb𝑑μ=xica𝑑μ+cb𝑑μx_{i}\cdot\int cad\mu+\int cbd\mu=x_{i}\cdot\int cad\mu^{\prime}+\int cbd\mu^{\prime} (25)

and now the result follows directly from the hypothesis. ∎

Proof of Lemma 4.4

Lemma A.1.

Let (a,b,c),(a,b,c)(a,b,c),(a\textquoteright,b\textquoteright,c\textquoteright) belong to the same cell and not represent the same neuron, let μ=12(δa,b,c+δa,b,c)\mu=\frac{1}{2}(\delta_{a,b,c}+\delta_{a\textquoteright,b\textquoteright,c\textquoteright}). Then there is a probability measure μ\mu\textquoteright concentrated at a single point (α,β,γ)(\alpha,\beta,\gamma) such that estimates (18) hold for μ,μ\mu,\mu\textquoteright but V,μ<V,μ\langle V,\mu\rangle<\langle V,\mu\textquoteright\rangle. 444There is a typo in the statement of lemma 4.4 in the main text in how the interpolation in the lifted space is performed; it has been addressed here.

Proof.

Assume without loss of generality that c,cc,c^{\prime} have the same sign (since otherwise we can transfer mass from one neuron to the other). Since (a,b,c)(a,b,c) and (a,b,c)(a^{\prime},b^{\prime},c^{\prime}) do not represent the same neuron, by the effective strict convexity of ww we have

w(λa+(1λ)a,λb+(1λ)b)<λw(a,b)+(1λ)w(a,b)w(\lambda a+(1-\lambda)a^{\prime},\lambda b+(1-\lambda)b^{\prime})<\lambda w(a,b)+(1-\lambda)w(a^{\prime},b^{\prime}) (26)

for any λ(0,1)\lambda\in(0,1).

Pick λ=|c||c|+|c|\lambda=\frac{|c|}{|c|+|c^{\prime}|}. Then we easily verify that (α,β)=λ(a,b)+(1λ)(a,b)(\alpha,\beta)=\lambda(a,b)+(1-\lambda)(a^{\prime},b^{\prime}) and γ=c+c\gamma=c+c^{\prime} verify the moment estimates, and 2|γ|w(α,β)<|c|w(a,b)+|c|w(a,b)2|\gamma|w(\alpha,\beta)<|c|w(a,b)+|c^{\prime}|w(a^{\prime},b^{\prime}) by (26).

Proof of Lemma 4.5

Proof.

Assume minimizers exist, as otherwise the result is vacuously true.

Given a minmizer ν~\tilde{\nu} there is a measure ν0=δ(aν,bν,cν)\nu_{0}=\delta_{(a_{\nu},b_{\nu},c_{\nu})} that has the same mass and first moment as ν~\tilde{\nu}, but is concentrated at a single point (α,β,±1)(\alpha,\beta,\pm 1) (the center of mass). By convexity, this ν0\nu_{0} has to be a minimizer of the loss amongst all functions with the same first moment estimate (by Jensen) so it suffices to show that ν0\nu_{0} is effectively unique. This is a consequence of effective convexity: If ν\nu^{\prime} is another minimizer, ν\nu^{\prime} has to be supported on parameters that represent the same parameter as ν0\nu_{0}, because otherwise we can build an even better measure:

Let

ϕ(a,b,c):=(12(α+|c|a),12(β+|c|b),±1)\phi(a,b,c):=\left(\frac{1}{2}(\alpha+|c|a),\frac{1}{2}(\beta+|c|b),\pm 1\right)

Then, for any given measure μ\mu with the same moment estimates as ν0\nu_{0}, ϕμ\phi_{*}\mu has the same moment estimates as μ\mu and ν0\nu_{0}, but unless μ\mu is effectively supported at the same point as ν0\nu_{0}, it must be (by strict convexity) that

w,ν0>w,ϕμ\langle w,\nu_{0}\rangle>\langle w,\phi_{*}\mu\rangle (27)

which contradicts the hypothesis that ν0\nu_{0} is a minimizer. ∎

Proof of Theorem 4.7

Proof.

Let ν,ν\nu,\nu^{\prime} be two minimizing measures. By convexity, 12(ν,ν)\frac{1}{2}(\nu,\nu^{\prime}) must also be a minimizing sequence. Lemma 4.5 guarantees then that 12(ν,ν)\frac{1}{2}(\nu,\nu^{\prime}) will be supported in at most one point per cell. Therefore, if ν,ν\nu,\nu^{\prime} had support on the same cell, it must be exactly on the same point. ∎

Proof of Corollary 4.8

Proof.

Let μ\mu be a minimizer to the penalized problem

minν𝒫|c|w(a,b),ν(a,b,c)+L({|fν(xi)yi|}i=1n)\min_{\nu\in\mathcal{P}}\langle|c|w(a,b),\nu(a,b,c)\rangle+L(\{|f_{\nu}(x_{i})-y_{i}|\}_{i=1}^{n}) (28)

for some similarity loss function LL. Then μ\mu must also be the minimizer to the following restricted problem (note that this is not the traditional penalized-constrained duality):

minν𝒫,fν(xi)=fμ(xi)|c|w(a,b),ν(a,b,c)+L({|fν(xi)yi|}i=1n)\min_{\nu\in\mathcal{P},f_{\nu}(x_{i})=f_{\mu}(x_{i})}\langle|c|w(a,b),\nu(a,b,c)\rangle+L(\{|f_{\nu}(x_{i})-y_{i}|\}_{i=1}^{n}) (29)

this is because a minimizer must remain a minimizer if we further constrain the feasible set.

This is again the constrained problem in equation (5), using fμ(xi)f_{\mu}(x_{i}) instead of yiy_{i}. In this case Theorem 4.2 states that solutions to this problem must be sparse. ∎