This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

How do Minimum-Norm Shallow Denoisers Look in Function Space?

Chen Zeno
Electrical and Computer Engineering
Technion &Greg Ongie
Department Mathematical and Statistical Sciences
Marquette University &Yaniv Blumenfeld, Nir Weinberger, Daniel Soudry
Electrical and Computer Engineering
Technion &{chenzeno,yanivbl}@campus.technion.ac.il, [email protected]
[email protected], [email protected]
Abstract

Neural network (NN) denoisers are an essential building block in many common tasks, ranging from image reconstruction to image generation. However, the success of these models is not well understood from a theoretical perspective. In this paper, we aim to characterize the functions realized by shallow ReLU NN denoisers — in the common theoretical setting of interpolation (i.e., zero training loss) with a minimal representation cost (i.e., minimal 2\ell^{2} norm weights). First, for univariate data, we derive a closed form for the NN denoiser function, find it is contractive toward the clean data points, and prove it generalizes better than the empirical MMSE estimator at a low noise level. Next, for multivariate data, we find the NN denoiser functions in a closed form under various geometric assumptions on the training data: data contained in a low-dimensional subspace, data contained in a union of one-sided rays, or several types of simplexes. These functions decompose into a sum of simple rank-one piecewise linear interpolations aligned with edges and/or faces connecting training samples. We empirically verify this alignment phenomenon on synthetic data and real images.

1 Introduction

The ability to reconstruct an image from a noisy observation has been studied extensively in the last decades, as it is useful for many practical applications (e.g., Hasinoff et al. (2010)). In recent years, Neural Network (NN) denoisers commonly replace classical expert-based approaches as they achieve substantially better results than the classical approaches (e.g., Zhang et al. (2017)). Beyond this natural usage, NN denoisers also serve as essential building blocks in a variety of common computer vision tasks, such as solving inverse problems (Zhang et al., 2021) and image generation (Song and Ermon, 2019; Ho et al., 2020). To better understand the role of NN denoisers in such complex applications, we first wish to theoretically understand the type of solutions they converge to.

In practice, when training denoisers, we sample multiple noisy samples for each clean image and minimize the Mean Squared Error (MSE) loss for recovering the clean image. Since we sample numerous noisy samples per clean sample, the number of training samples is typically larger than the number of parameters in the network. Interestingly, even in such an under-parameterized regime, the loss has multiple global minima corresponding to distinct denoiser functions which achieve zero loss on the observed data. To characterize these functions, we study, similarly to previous works (Savarese et al., 2019; Ongie et al., 2020), the shallow NN solutions that interpolate the training data with minimal representation cost, i.e., where the 2\ell^{2}-norm of the weights (without biases and skip connections) is as small as possible. This is because we converge to such min-cost solutions when we minimize the loss with a vanishingly small 2\ell^{2} regularization on these weights.

We first examine the univariate input case: building on existing results (Hanin, 2021), we characterize the min-cost interpolating solution and its generalization to unseen data. Next, we aim to extend this analysis to the multivariate case. However, this is challenging, since, to the best of our knowledge, there are no results that explicitly characterize these min-cost solutions for general multivariate shallow NNs — except in two basic cases. In the first case, the input data is co-linear (Ergen and Pilanci, 2021). In the second case, the input samples are identical to their target outputs, so the trivial min-cost solution is the identity function. The NN denoisers’ training regime is ‘near’ the second case: there, the input samples are noisy versions of the clean target outputs. Interestingly, we find that this regime leads to non-trivial min-cost solutions far from identity — even with an infinitesimally small input noise. We analytically investigate these solutions here.

Our Contributions.

We study the NN solutions in the setting of interpolation of noisy samples with min-cost, in a practically relevant “low noise regime” where the noisy samples are well clustered. In the univariate case,

  • We find a closed-form solution for the minimum representation cost NN denoiser. Then, we prove this solution generalizes better than the empirical minimum MSE (eMMSE) denoiser.

  • We prove this min-cost NN solution is contractive toward the clean data points, that is, applying the denoiser necessarily reduces the distance of a noisy sample to one of the clean samples.

In the multivariate case,

  • We derive a closed-form solution for the min-cost NN denoiser in multivariate case under various assumptions on the geometric configuration of the clean training samples. To the best of our knowledge, this is the first set of results to explicitly characterize a min-cost interpolating NN in a non-basic multivariate setting.

  • We illustrate a general alignment phenomenon of min-cost NN denoisers in the multivariate setting: the optimal NN denoiser decomposes into a sum of simple rank-one piecewise linear interpolations aligned with edges and/or faces connecting clean training samples.

2 Preliminaries and problem setting

The denoising problem.

Let 𝒚d\bm{y}\in\mathbb{R}^{d} be a noisy observation of 𝒙d\bm{x}\in\mathbb{R}^{d}, such that 𝒚=𝒙+ϵ\bm{y}=\bm{x}+\bm{\epsilon} where 𝒙\bm{x} and ϵ\bm{\epsilon} are statistically independent, and 𝔼[ϵ]=𝟎\mathbb{E}[\bm{\epsilon}]=\bm{0}. Commonly, this noise is Gaussian with covariance matrix σ2𝐈\sigma^{2}\mathrm{\bm{I}}. The ultimate goal of a denoiser 𝒙^(𝒚)\hat{\bm{x}}\left(\bm{y}\right) is to minimize the MSE loss over the joint probability distribution of the data and the noisy observation (“population distribution”), i.e., to minimize

(𝒙^)=𝔼𝒙,𝒚𝒙^(𝒚)𝒙2.\displaystyle\mathcal{L}\left(\hat{\bm{x}}\right)=\mathbb{E}_{\bm{x},\bm{y}}\left\|\hat{\bm{x}}\left(\bm{y}\right)-\bm{x}\right\|^{2}\,. (1)

The well-known optimal solution for (1) is the minimum mean square error (MMSE) denoiser, i.e.,

𝒙^(𝒚)=𝔼𝒙|𝒚[𝒙𝒚]argmin𝒙^(𝒚)𝔼𝒙,𝒚𝒙𝒙^(𝒚)2.\displaystyle\hat{\bm{x}}^{*}(\bm{y})=\mathbb{E}_{\bm{x}|\bm{y}}[\bm{x}\mid\bm{y}]\in{\arg\min}_{\hat{\bm{x}}\left(\bm{y}\right)}\mathbb{E}_{\bm{x},\bm{y}}\left\|\bm{x}-\hat{\bm{x}}\left(\bm{y}\right)\right\|^{2}\,. (2)

Since we do not have access to the distribution of the data, and hence not to the posterior distribution, we rely on a finite amount of clean data {𝒙n}n=1N\{\bm{x}_{n}\}_{n=1}^{N} in order to learn a good approximation for the MMSE estimator. One approach is to assume an empirical data distribution and derive the optimal solution of (1), i.e., the empirical minimum mean square error (eMMSE) denoiser,

𝒙^eMMSE(𝒚)argmin𝒙^(𝒚)1Nn=1N𝔼𝒚|𝒙n𝒙^(𝒚)𝒙n2.\displaystyle\hat{\bm{x}}^{\mathrm{eMMSE}}\left(\bm{y}\right)\in\underset{\hat{\bm{x}}\left(\bm{y}\right)}{\arg\min}\,\frac{1}{N}\sum_{n=1}^{N}\mathbb{E}_{{\bm{y}|\bm{x}}_{n}}\left\|\hat{\bm{x}}\left(\bm{y}\right)-\bm{x}_{n}\right\|^{2}\,. (3)

If the noise is Gaussian with a covariance of σ2𝐈\sigma^{2}\mathrm{\bm{I}}, an explicit solution to the eMMSE is given by

𝒙^eMMSE(𝒚)=n=1N𝒙nexp(𝒚𝒙n22σ2)n=1Nexp(𝒚𝒙n22σ2).\displaystyle\hat{\bm{x}}^{\mathrm{eMMSE}}\left(\bm{y}\right)=\frac{\sum_{n=1}^{N}\bm{x}_{n}\exp\left(-\frac{\left\|\bm{y}-\bm{x}_{n}\right\|^{2}}{2\sigma^{2}}\right)}{\sum_{n=1}^{N}\exp\left(-\frac{\left\|\bm{y}-\bm{x}_{n}\right\|^{2}}{2\sigma^{2}}\right)}\,. (4)

An alternative approach to computing the eMMSE directly is to draw MM noisy samples for each clean data point, as 𝒚n,m=𝒙n+ϵn,m{\bm{y}}_{n,m}=\bm{x}_{n}+\bm{\epsilon}_{n,m}, where ϵn,m𝒩(𝟎,σ2𝑰)\bm{\epsilon}_{n,m}\sim\mathcal{N}\left(\bm{0},\sigma^{2}\bm{I}\right) are independent and identically distributed, and to minimize the following loss function

offline,M(𝒙^)=1MNm=1Mn=1N𝒙^(𝒚n,m)𝒙n2.\displaystyle\mathcal{L}_{\mathrm{offline},M}\left(\hat{\bm{x}}\right)=\frac{1}{MN}\sum_{m=1}^{M}\sum_{n=1}^{N}\left\|\hat{\bm{x}}\left(\bm{y}_{n,m}\right)-\bm{x}_{n}\right\|^{2}\,. (5)
Denoiser model and algorithms.

In practice, we approximate the optimal denoiser 𝒙^(𝒚)\hat{\bm{x}}\left(\bm{y}\right) using a parametric model 𝒉𝜽(𝒚)\bm{h}_{\bm{\theta}}\left(\bm{y}\right), typically a NN. We focus on a shallow ReLU network model with a skip connection of the form

𝒉θ(𝒚)=k=1K𝒂k[𝒘k𝒚+bk]++𝑽𝒚+𝒄\bm{h}_{\theta}(\bm{y})=\sum_{k=1}^{K}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}\bm{y}+b_{k}]_{+}+{\bm{V}}\bm{y}+{\bm{c}} (6)

where θ=((θk)k=1K;𝒄,𝑽)\theta=((\theta_{k})_{k=1}^{K};{\bm{c}},{\bm{V}}) with θk=(bk,𝒂k,𝒘k)×d×d\theta_{k}=(b_{k},{\bm{a}}_{k},{\bm{w}}_{k})\in\mathbb{R}\times\mathbb{R}^{d}\times\mathbb{R}^{d} and 𝒄d,𝑽d×d{\bm{c}}\in\mathbb{R}^{d},{\bm{V}}\in\mathbb{R}^{d\times d}. We train the model on a finite set of clean data points {𝒙n}n=1N\left\{\bm{x}_{n}\right\}_{n=1}^{N}. The common practical training method is based on an online approach. First, we sample a random batch (with replacement) from the data {𝒙n}n=1N\mathcal{B}\subseteq\{\bm{x}_{n}\}_{n=1}^{N}. Then, for each clean data point 𝒙n\bm{x}_{n}\in\mathcal{B}, we draw a noisy sample 𝒚n=𝒙n+ϵn\bm{y}_{n}=\bm{x}_{n}+\bm{\epsilon}_{n}, where ϵn𝒩(𝟎,σ2𝐈)\bm{\epsilon}_{n}\sim\mathcal{N}\left(\bm{0},\sigma^{2}\mathrm{\bm{I}}\right) are independent of the clean data points and other noise samples. At each iteration tt out of TT iterations, we update the model parameters according to a stochastic gradient descent rule, with a vanishingly small regularization term λC(𝜽)\lambda C\left(\bm{\theta}\right), that is,

𝜽t+1=𝜽tη𝜽t1||n𝒉𝜽t(𝒚n)𝒙n2ηλ𝜽tC(𝜽t).\displaystyle\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\bm{\theta}_{t}}\frac{1}{|\mathcal{B}|}\sum_{n\in\mathcal{B}}\left\|\bm{h}_{\bm{\theta}_{t}}\left(\bm{y}_{n}\right)-\bm{x}_{n}\right\|^{2}-\eta\lambda\nabla_{\bm{\theta}_{t}}C\left(\bm{\theta}_{t}\right)\,. (7)

Another training method (Chen et al., 2014) is based on an offline approach. We sample MM noisy sample for each clean data point and minimize (5) plus a regularization term

offline,M(𝜽)=1MNm=1Mn=1N𝒉θ(𝒚n,m)𝒙n2+λC(𝜽).\displaystyle\mathcal{L}_{\mathrm{offline},M}\left(\bm{\theta}\right)=\frac{1}{MN}\sum_{m=1}^{M}\sum_{n=1}^{N}\left\|\bm{h}_{\theta}\left(\bm{y}_{n,m}\right)-\bm{x}_{n}\right\|^{2}+\lambda C\left(\bm{\theta}\right)\,. (8)

Similarly to previous works (Savarese et al., 2019; Ongie et al., 2020), we assume an 2\ell^{2} penalty on the weights, but not on the biases and skip connections, i.e.,

C(θ)=12k=1K(𝒂k2+𝒘k2).\displaystyle C(\theta)=\frac{1}{2}\sum_{k=1}^{K}\left(\|{\bm{a}}_{k}\|^{2}+\|{\bm{w}}_{k}\|^{2}\right)\,. (9)
Low noise regime.

In this paper, we study the solution of the NN denoiser when the clusters of noisy samples around each clean point are well-separated, a setting which we refer to as the “low noise regime”. This is a rather relevant regime since denoisers are practically used when the noise level is mild. Indeed, common image-denoising benchmarks test on low (but not negligible) noise levels. For instance, in the commonly used denoising benchmark BSD68 (Roth and Black, 2009), the noise level σ=0.1\sigma=0.1 is in the low noise regime.111The minimum distance between two images in BSD68 is about 9797 while the image resolution is d=481×321{d=481\!\times\!321}. Also, the norm of the noise concentrates around the value of dσ4813210.140<97{\sqrt{d}\sigma\!\approx\!\sqrt{481\cdot 321}\cdot 0.1\!\approx\!40\!<\!97}. Therefore, the clusters of noisy samples around each clean point are generally well-separated. Moreover, this setting is important, for example, in diffusion-based image generation, since at the end of the reverse denoising process, new images are sampled by denoising smaller and smaller noise levels.222Interestingly, it was suggested that the “useful” part of the diffusion dynamics happens only below some critical noise level (Raya and Ambrogioni, 2023).

3 Basic properties of neural network denoisers

Offline v.s. online NN solutions.
Refer to caption

Figure 1: NN denoiser vs eMMSE denoiser. We trained a one-hidden-layer ReLU network with a skip connection on a denoising task. The clean dataset has four points equally spaced in the interval [5,5][-5,5], and the noisy samples are generated by adding zero-mean Gaussian noise with σ=1.5\sigma=1.5. We use λ=105\lambda=10^{-5} in both setting. The Figure shows the denoiser output as a function of its input for: (1) NN denoiser trained online using (7) for 100K100K iterations, (2) NN denoiser trained offline using (8) with M=9000M=9000 and 20K20K epochs, and (3) the eMMSE denoiser (4).

NN denoisers are traditionally trained in an online fashion (7), using a finite amount of TT iterations. Consequently, only a finite number of noisy samples are used for each clean data point. We empirically observe that the solutions in the offline and online settings are similar. Specifically, in the univariate case, we show in Figure 1 that denoisers based on offline and online loss functions converge to indistinguishable solutions. For the multivariate case, we observe (Figure 11 in Appendix D) that the offline and online solutions achieve approximately the same test MSE when trained on a subset of the MNIST dataset. The comparison is made using the same number of iterations for both training methods, while using much less noisy samples in the offline setting. Evidently, this lower number of samples does not significantly affect the generalization error. Hence, in the rest of the paper, we focus on offline training (i.e., minimizing the offline loss offline,M\mathcal{L}_{\mathrm{offline},M}), as it defines an explicit loss function with solutions that can be theoretically analyzed, as in (Savarese et al., 2019; Ongie et al., 2020).

The empirical MMSE denoiser.

The law of large numbers implies that the denoiser minimizing the offline loss offline,M\mathcal{L}_{\mathrm{offline},M} approaches the eMMSE estimator in the limit of infinitely many noisy samples,

𝒙^eMMSE(𝒚)\displaystyle\hat{\bm{x}}^{\mathrm{eMMSE}}\left(\bm{y}\right) argmin𝒙^limMoffline,M(𝒙^).\displaystyle\in{\arg\min}_{\hat{\bm{x}}}\lim_{M\to\infty}\mathcal{L}_{\mathrm{offline},M}\left(\hat{\bm{x}}\right)\,. (10)

Therefore, it may seem that for a reasonable number of noise samples MM, a large enough model, and small enough regularization, the denoiser we get by minimizing the offline loss (6) will also be similar to the eMMSE estimator. However, Figure 1 shows that the eMMSE solution and the NN solutions (both online and offline) are quite different. The eMMSE denoiser has a much sharper transition and maps almost all inputs to a value of a clean data point. This is because in the case of low noise the eMMSE denoiser (4) approximates the one nearest-neighbor (11-NN) classifier, i.e.,

limσ0+𝒙^eMMSE(𝒚)=argmin𝒙{𝒙i}i=1N𝒚𝒙.\displaystyle\lim_{\sigma\rightarrow 0^{+}}\hat{\bm{x}}^{\mathrm{eMMSE}}\left(\bm{y}\right)=\underset{\bm{x}\in\left\{\bm{x}_{i}\right\}_{i=1}^{N}}{\arg\min}\left\|\bm{y}-\bm{x}\right\|\,. (11)

In contrast, the NN denoiser maps each noisy sample to its corresponding clean sample only in a limited “noise ball” around the clean point, and interpolates near-linearly between the “noise balls”. Hence, we may expect that the smoother NN denoiser typically generalizes better than the eMMSE denoiser. We prove that this is indeed true for the univariate case in Section 4.

Why the NN denoiser does not converge to the eMMSE denoiser? Note that the limit in (10) is not practically relevant for the low-level noise regime, since we need an exponentially large MM in order to converge in this limit. For example, in the case of univariate Gaussian noise, we have that P(|ϵ|>t)2exp(t22σ2),t>0P\left(|\epsilon|>t\right)\leq 2\exp({-\frac{t^{2}}{2\sigma^{2}}}),\,\forall t>0. Therefore, during training, we effectively observe only noisy samples that are in a bounded interval of size 2σlogM2\sigma\sqrt{\log M} around each clean sample (see Figure 1). In other words, in the low-noise regime and for non-exponential MM, there is no way to distinguish if the noise is sampled from some distribution with limited support of from a Gaussian distribution. The denoiser minimizing the loss with respect to a bounded-support distribution can be radically different from the eMMSE denoiser in the regions outside the “noise balls” surrounding the clean samples, where the denoiser function is not constrained by the loss. This leads to a large difference between the NN denoiser and the MMSE estimator.

Alternatively, one may suggest that the NN denoiser does not converge to the eMMSE denoiser due to an approximation error (i.e., the shallow NN’s model capacity is too small to approximate the MMSE denoiser). Nevertheless, we provide empirical evidence indicating it is not the case. Specifically, recall that in the low noise regime, the eMMSE denoiser tends to the nearest-neighbor classifier, and such a solution does not generalize well to test data. Thus, if the NN denoiser would have converged to the eMMSE solution, then its test error would have increased with the network size, in contrast to what we observe in Figure 12 (Appendix D).

Therefore, in order to approximate the eMMSE with a NN, it seems we must have an exponentially large MM. Alternatively, we may converge to the eMMSE if we use a loss function marginalized over the Gaussian noise. This idea was previously suggested by Chen et al. (2014), with the goal of effectively increasing the number of noisy samples and thus improving the training performance of denoising autoencoders. Therein, this improvement was obtained by approximating the marginalized loss function by a Taylor series expansion. However, for shallow denoisers, we may actually obtain an explicit expression for this marginalized loss, without any approximation. Specifically, if we assume for simplicity, that the network does not have a linear unit (𝑽=𝟎{\bm{V}}=\bm{0}) and its bias terms are zero (𝒄=𝟎,bk=0{\bm{c}}=\bm{0},b_{k}=0), then the marginalized loss for Gaussian noise, derived in Appendix A, is given by

(𝜽,σ)\displaystyle\mathcal{L}\left(\bm{\theta},\sigma\right) =𝔼𝒙,𝒚𝒉θ(𝒚)𝒙2=𝔼𝒙,𝒚k=1K𝒂k[𝒘k𝒚]+𝒙2=𝔼𝒙𝔼𝒚|𝒙k=1K𝒂i[𝒘k𝒚]+𝒙2\displaystyle=\mathbb{E}_{\bm{x},\bm{y}}\left\|\bm{h}_{\theta}\left(\bm{y}\right)-\bm{x}\right\|^{2}=\mathbb{E}_{\bm{x},\bm{y}}\left\|\sum_{k=1}^{K}\bm{a}_{k}[\bm{w}_{k}^{\top}\bm{y}]_{+}-\bm{x}\right\|^{2}=\mathbb{E}_{\bm{x}}\mathbb{E}_{{\bm{y}|\bm{x}}}\left\|\sum_{k=1}^{K}\bm{a}_{i}[\bm{w}_{k}^{\top}\bm{y}]_{+}-\bm{x}\right\|^{2}
=𝔼𝒙k=1K𝒂kϕ~(𝒘^k𝒙,𝒘k,σ)𝒙2+i=1Kj=1K𝒂i𝒂j𝐇ij(𝒘i,𝒘j,σ2),\displaystyle=\mathbb{E}_{\bm{x}}\left\|\sum_{k=1}^{K}\bm{a}_{k}\tilde{\phi}\left(\bm{\hat{w}}_{k}^{\top}\bm{x},\left\|\bm{w}_{k}\right\|,\sigma\right)-\bm{x}\right\|^{2}+\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}\mathbf{H}_{ij}\left(\bm{w}_{i},\bm{w}_{j},\sigma^{2}\right)\,, (12)

where 𝒘k=𝒘^k𝒘k\bm{w}_{k}=\bm{\hat{w}}_{k}\left\|\bm{w}_{k}\right\| and 𝐇0\mathbf{H}\succeq 0, ϕ~()\tilde{\phi}\left(\cdot\right) are defined in Appendix A. NN denoisers trained over this loss function will thus tend to the eMMSE solution as the network size is increased. However, as we explained above, this is not necessarily desirable, so we only mention (12) to show exact marginalization is feasible.

Regularization biases toward specific neural network denoisers.

To further explore the converged solution for offline training, we note that the offline loss function offline,M(𝜽)\mathcal{L}_{\mathrm{offline},M}\left(\mathbf{\bm{\theta}}\right) allows the network to converge to a zero-loss solution. This is in contrast to online training for which each batch leads to new realizations of noisy samples, and thus the training error is never exactly zero. Specifically, consider the low noise regime (well-separated noisy clusters). Then, the network can perfectly fit all the noisy samples using a finite number of neurons (see Section 4 for a more accurate description in the univariate case). Importantly, there are multiple ways to cluster the noisy data points with such neurons, and so there are multiple global training loss minima that the network can achieve with zero loss, each with a different generalization capability. In contrast to the standard case considered in the literature, this holds even in the under-parameterized case (where NMNM, the total number of noisy samples, is larger than the number of parameters).

Since there are many minima that perfectly fit the training data, we converge to specific minima which also minimize the 2\ell^{2} regularization we use (even though we assumed it is vanishing). Specifically, in the limit of vanishing regularization C(θ)C\left(\theta\right), the minimizers of offline,M(𝜽)\mathcal{L}_{\mathrm{offline},M}\left(\mathbf{\bm{\theta}}\right) also minimize the representation cost.

Definition 1.

Let 𝐡θ:dd\bm{h}_{\theta}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} denote a shallow ReLU network of the form (6). For any function 𝐟:dd\bm{f}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} realizable as a shallow ReLU network, we define its representation cost as

R(𝒇)=infθ:𝒇=𝒉θC(θ)=infθk=1K𝒂ks.t.𝒘k=1k,𝒇=𝒉θ,\displaystyle R(\bm{f})=\inf_{\theta:\,\bm{f}=\bm{h}_{\theta}}C\left(\theta\right)=\inf_{\theta}\sum_{k=1}^{K}\|{\bm{a}}_{k}\|~{}~{}\mathrm{s.t.}~{}~{}\|\bm{w}_{k}\|=1~{}\forall k,\bm{f}=\bm{h}_{\theta}\,, (13)

and a minimizer of this cost, i.e. the ‘min-cost’ solution, as

𝒇argmin𝒇R(𝒇)s.t.𝒇(𝒚n,m)=𝒙nn,m,\displaystyle\bm{f}^{*}\in\mathrm{arg}\!\min_{\bm{f}}R(\bm{f})~{}~{}s.t.~{}{\bm{f}}(\bm{y}_{n,m})={\bm{x}}_{n}~{}~{}\forall n,m\,, (14)

where the second equality in (13) holds due to the 11-homogeneity of the ReLU activation function, and since the bias terms are not regularized (see (savarese2019infinite, Appendix A)). In the next sections, we examine which function we obtain by minimizing the representation cost R(𝒇)R(\bm{f}) in various settings.

4 Closed form solution for the NN denoiser function — univariate data

In this section, we prove that NN denoisers that minimize R(f)R(f) for univariate data have the specific piecewise linear form observed in Figure 1, and we discuss the properties of this form. We observe NN clean univariate data points {xn}n=1N\left\{x_{n}\right\}_{n=1}^{N}, s.t. <x1<x2<<xN<-\infty<x_{1}<x_{2}<\cdots<x_{N}<\infty, and MM noisy samples (drawn from some known distribution) for each clean data point, such that yn,m=xn+ϵn,my_{n,m}=x_{n}+\epsilon_{n,m}. We denote by ϵnmax\epsilon^{\mathrm{max}}_{n} the maximal noise seen for data point xnx_{n}, and by ϵnmin\epsilon^{\mathrm{min}}_{n} the minimal noise seen for data point xnx_{n}, i.e.,

ϵnmaxmaxmϵn,m,ϵnminminmϵn,m,\displaystyle\epsilon^{\mathrm{max}}_{n}\equiv\max_{m}\epsilon_{n,m},\quad\epsilon^{\mathrm{min}}_{n}\equiv\min_{m}\epsilon_{n,m}\,, (15)

and assume the following,

Assumption 1.

Assume the data {xn}n=1N\left\{x_{n}\right\}_{n=1}^{N} is well-separated after the addition of noise, i.e.,

n[N1]:xn+ϵnmax<xn+1+ϵn+1min,\displaystyle\forall n\in[N-1]:\,x_{n}+\epsilon^{\mathrm{max}}_{n}<x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}\,, (16)

and ϵnmax>0,ϵnmin<0\epsilon^{\mathrm{max}}_{n}>0\,,\epsilon^{\mathrm{min}}_{n}<0.

So we can state the following,

Proposition 1.

For all datasets such that Assumption 1 holds, the unique minimizer of R(f)R(f) is

f1D(y)={x1,y<x1+ϵ1minxn,xn+ϵnminyxn+ϵnmaxxn+1xnxn+1+ϵn+1min(xn+ϵnmax)(y(xn+ϵnmax))+xn,xn+ϵnmax<y<xn+1+ϵn+1minxN,y>xN+ϵNmax.\displaystyle f^{*}_{1D}(y)=\begin{cases}x_{1},&y<x_{1}+\epsilon^{\mathrm{min}}_{1}\\ x_{n},&x_{n}+\epsilon^{\mathrm{min}}_{n}\leq y\leq x_{n}+\epsilon^{\mathrm{max}}_{n}\\ \frac{x_{n+1}-x_{n}}{x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)}\left(y-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\right)+x_{n},&x_{n}+\epsilon^{\mathrm{max}}_{n}<y<x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}\\ x_{N},&y>x_{N}+\epsilon^{\mathrm{max}}_{N}\end{cases}\,. (17)

The proof (which is based on Theorem 1.2. in (hanin2021ridgeless)) can be found in Appendix B.1. As can be seen from Figure 1, the empirical simulation matches Proposition 1. 333Notice that the training points in Figure 1 are used in the online setting (7) and in the offline setting (8) we observe less noisy samples. Proposition 1 states that (17) is a closed-form solution for (8) with minimal representation cost. Notice that the minimal number of neurons needed to represent f1Df^{*}_{1D} using hθ(y)h_{\theta}\left(y\right) is 2N22N-2, which is less than the number of the total training samples NMNM for M2M\geq 2.

In the case of univariate data, we can prove that the representation cost minimizer f1Df^{*}_{1D} (linear interpolation) generalizes better than the optimal estimator over the empirical distribution (eMMSE) for low noise levels.

Theorem 1.

Let y=x+ϵy=x+\epsilon where xpx(x)x\sim p_{x}\left(x\right) and ϵ𝒩(0,σ2)\epsilon\sim\mathcal{N}\left(0,\sigma^{2}\right) where xx and ϵ\epsilon are statistically independent. Then for all datasets such that Assumption 1 holds, and for all density probability distributions px(x)p_{x}\left(x\right) with bounded second moment such that px(x)>0p_{x}\left(x\right)>0 for all x[minnxn,maxnxn]x\in\left[\min_{n}{x_{n}},\max_{n}{x_{n}}\right], the following holds,

limσ0+MSE(x^eMMSE(y))>limσ0+MSE(f1D(y)).\displaystyle\lim_{\sigma\to 0^{+}}\mathrm{MSE}\left(\hat{x}^{\mathrm{eMMSE}}\left(y\right)\right)>\lim_{\sigma\to 0^{+}}\mathrm{MSE}\left(f^{*}_{1D}\left(y\right)\right)\,.

See Appendix B.2 for the proof. We may deduce from Theorem 1 that for each density probability distribution p(x)p\left(x\right) there exists a critical noise level for which the the representation cost minimizer f1Df^{*}_{1D} has strictly lower MSE than the eMMSE for all smaller noise levels (this is because the MSE is a continuous function of σ\sigma). The critical noise level can change significantly depending on p(x)p\left(x\right). For example, if p(x)p\left(x\right) has a high “mass” in between the training points then the critical noise level is large. However, if the density function has a low “mass” between the training points then the critical noise level is small. In Appendix D we show the MSE vs. the noise level on MNIST denoiser for NN denoiser and eMMSE denoiser (Figure 13). As can be seen there, the critical noise level in this case is not small (5\sim 5).

Intuitively, the difference between the NN denoiser and the eMMSE denoiser is how they operate on inputs that are not close to any of the clean samples (compared to the noise standard deviation). For such a point, the eMMSE denoiser does not take into account that the empirical distribution of the clean samples does not approximate well their true distribution. Thus, for small noise, it insists on “assigning” it to the closest clean sample point. By contrast, the NN denoiser generalizes better since it takes into account that, far from the clean samples, the data distribution is not well approximated by the empirical sample distribution. Thus, its operation there is near the identity function, with a small contraction toward the clean points, as we discuss next.

Minimal norm leads to contractive solutions on univariate data.

radhakrishnan2018memorization have empirically shown that Auto-Encoders (AE, i.e. NN denoisers without input noise), are locally contractive toward the training samples. Specifically, they showed that the clean dataset can be recovered when iterating the AE output multiple times until convergence. Additionally, they showed that, as we increase the width or the depth of the NN, the network becomes more contractive toward the training examples. In addition, radhakrishnan2018memorization proved that 22-layer AE models are locally contractive under strong assumptions (the weights of the input layer are fixed and the number of neurons goes to infinity). Next, we prove that a univariate shallow NN denoiser is globally contractive toward the clean data points without using the assumptions used by radhakrishnan2018memorization (i.e., the minimizer optimizes over both layers and has a finite number of neurons).

Definition 2.

We say that 𝐟:dd{\bm{f}}:\mathbb{R}^{d}\to\mathbb{R}^{d} is contractive toward a set of points {𝐱n}n=1N\{\bm{x}_{n}\}_{n=1}^{N} on 𝒴d\mathcal{Y}\subseteq\mathbb{R}^{d} if there exists a real number 0α<10\leq\alpha<1 such that for any 𝐲𝒴\bm{y}\in\mathcal{Y} there exists i[N]i\in[N] so that

𝒇(𝒚)𝒇(𝒙i)α𝒚𝒙i.\displaystyle\left\|\bm{f}\left(\bm{y}\right)-\bm{f}\left(\bm{x}_{i}\right)\right\|\leq\alpha\left\|\bm{y}-\bm{x}_{i}\right\|\,. (18)
Lemma 1.

f1D(y)f^{*}_{1D}\left(y\right) is contractive toward the clean training points {𝐱n}n=1N\{\bm{x}_{n}\}_{n=1}^{N} on 𝒴=n[N1]{xn+1ϵnmaxxnϵn+1minϵnmaxϵn+1min}\mathcal{Y}={\mathbb{R}\setminus\cup_{n\in[N-1]}\Bigl{\{}\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}}\Bigr{\}}}.

The proof can be found in Appendix B.3.

5 Minimal norm leads to alignment phenomenon on multivariate data

In the multivariate case, min-cost solutions of (14) are difficult to explicitly characterize. Even in the setting of fitting scalar-valued shallow ReLU networks, explicitly characterizing min-cost solutions under interpolation constraints remains an open problem, except in some basic cases (e.g., co-linear data (ergen2021convex)).

As an approximation to (14) that is more mathematically tractable, we assume the functions being fit are constant and equal to 𝒙n{\bm{x}}_{n} on a closed ball of radius ρ\rho centered at each 𝒙n{\bm{x}}_{n}, i.e., 𝒇(𝒚)=𝒙n\bm{f}({\bm{y}})={\bm{x}}_{n} for all 𝒚𝒙nρ\|{\bm{y}}-{\bm{x}}_{n}\|\leq\rho, such that the balls do not overlap. Letting B(𝒙n,ρ)B({\bm{x}}_{n},\rho) denote the ball of radius ρ\rho centered at 𝒙n{\bm{x}}_{n}, we can write this constraint more compactly as 𝒇(B(𝒙n,ρ))={𝒙n}\bm{f}(B({\bm{x}}_{n},\rho))=\{{\bm{x}}_{n}\}. Consider minimizing the representation cost under this constraint:

min𝒇R(𝒇)s.t.𝒇(B(𝒙n,ρ))={𝒙n}n[N].\min_{\bm{f}}R(\bm{f})~{}~{}s.t.~{}~{}{\bm{f}}(B({\bm{x}}_{n},\rho))=\{{\bm{x}}_{n}\}~{}~{}\forall n\in[N]\,. (19)

However, even with this approximation, explicitly describing minimizers of (19) for an arbitrary collection of training samples remains challenging. Instead, to gain intuition, we describe minimizers of (19) assuming the training samples belong to simple geometric structures that yield explicit solutions. Our results reveal a general alignment phenomenon, such that the weights of the representation cost minimizer align themselves with edges and/or faces connecting data points. We also show that approximate solutions of (14) obtained numerically by training a NN denoiser with weight decay match well with the solutions of (19) having exact closed-form expressions.

5.1 Training data on a subspace

In the event that the clean training samples belong to a subspace, we show the representation cost minimizer depends only on the projection of the inputs onto the subspace containing the training data, and its output is also constrained to this subspace.

Theorem 2.

Assume the training samples {𝐱n}n=1N\{{\bm{x}}_{n}\}_{n=1}^{N} belong to a linear subspace 𝒮d{\mathcal{S}}\subset\mathbb{R}^{d}, and let 𝐏𝒮d×d{\bm{P}}_{\mathcal{S}}\in\mathbb{R}^{d\times d} denote the orthogonal projector onto 𝒮{\mathcal{S}}. Then any minimizer 𝐟\bm{f}^{*} of (19) satisfies 𝐟(𝐲)=𝐏𝒮𝐟(𝐏𝒮𝐲)\bm{f}^{*}({\bm{y}})={\bm{P}}_{\mathcal{S}}\bm{f}^{*}({\bm{P}}_{\mathcal{S}}{\bm{y}}) for all 𝐲d{\bm{y}}\in\mathbb{R}^{d}.

The proof of this result and all others in this section is given in Appendix C.

Note the assumption that the dataset lies on a subpaces is practically relevant, since, in general, large datasets are (approximately) low rank, i.e., lie on a linear subspace (udell2019big). In Appendix D we also validated that common image datasets are (approximately) low rank (Table 1).

Specializing to the case co-linear training data (i.e., training samples belonging to a one-dimensional subspace) the min-cost solution is unique and is described by the following corollary:

Corollary 1.

Assume the training samples {𝐱n}n=1N\{{\bm{x}}_{n}\}_{n=1}^{N} are co-linear, i.e., 𝐱n=cn𝐮{\bm{x}}_{n}=c_{n}{\bm{u}} for some scalars c1<c2<<cnc_{1}<c_{2}<\cdots<c_{n} where 𝐮d{\bm{u}}\in\mathbb{R}^{d} is a unit-vector. Then the minimizer 𝐟\bm{f}^{*} of (19) is unique and given by 𝐟(𝐲)=𝐮ϕ(𝐮𝐲)\bm{f}^{*}({\bm{y}})={\bm{u}}\phi({\bm{u}}^{\top}{\bm{y}}) where ϕ:\phi:\mathbb{R}\rightarrow\mathbb{R} has the same form as the 1-D minimizer (17) f1Df^{*}_{1D} with xn=cnx_{n}=c_{n} and ϵnmax=ϵnmin=ρ\epsilon_{n}^{\mathrm{max}}=-\epsilon_{n}^{\mathrm{min}}=\rho.

In other words, the min-cost solution has a particularly simple form in this case: 𝒇(𝒚)=𝒖ϕ(𝒖𝒚)\bm{f}^{*}({\bm{y}})={\bm{u}}\phi({\bm{u}}^{\top}{\bm{y}}), where ϕ\phi is a monotonic piecewise linear function. We call any function of this form a rank-one piecewise linear interpolator. Below we show that many other min-cost solutions can be expressed as superpositions of rank-one piecewise linear interpolators.

5.2 Training data on rays

As an extension of the previous setting, we now consider training data belonging to a union of one-sided rays sharing a common origin. Assuming the rays are well-separated (in a sense made precise below) we prove that the representation cost minimizer decomposes into a sum of rank-one piecewise linear interpolators aligned with each ray.

Theorem 3.

Suppose the training samples XX belong to a union of LL rays plus a sample at the origin: X={𝟎}{𝐱n(1)}n=1N1{𝐱n(L)}n=1NLX=\{\bm{0}\}\cup\{{\bm{x}}_{n}^{(1)}\}_{n=1}^{N_{1}}\cup\cdots\cup\{{\bm{x}}_{n}^{(L)}\}_{n=1}^{N_{L}} where 𝐱n()=cn()𝐮{\bm{x}}_{n}^{(\ell)}=c^{(\ell)}_{n}{\bm{u}}_{\ell} for some unit vector 𝐮{\bm{u}}_{\ell} and constants 0<c1()<c2()<<cN()0<c^{(\ell)}_{1}<c^{(\ell)}_{2}<\cdots<c^{(\ell)}_{N_{\ell}}. Assume that the rays make obtuse angles with each other (i.e., 𝐮𝐮k<0{\bm{u}}_{\ell}^{\top}{\bm{u}}_{k}<0 for all k\ell\neq k). Then the minimizer 𝐟\bm{f}^{*} of (19) is unique and is given by

𝒇(𝒚)=𝒖1ϕ1(𝒖1𝒚)++𝒖LϕL(𝒖L𝒚),\bm{f}^{*}({\bm{y}})={\bm{u}}_{1}\phi_{1}({\bm{u}}_{1}^{\top}{\bm{y}})+\cdots+{\bm{u}}_{L}\phi_{L}({\bm{u}}_{L}^{\top}{\bm{y}})\,, (20)

where ϕ:\phi_{\ell}:\mathbb{R}\rightarrow\mathbb{R} has the form of the 1-D minimizer (17) f1Df^{*}_{1D} with xn=cn()x_{n}=c_{n}^{(\ell)}, ϵnmax=ϵnmin=ρ\epsilon_{n}^{\mathrm{max}}=-\epsilon_{n}^{\mathrm{min}}=\rho.

Refer to caption
Figure 2: Predicted (top row) and empirical (bottom row) min-cost NN denoisers for N=3N=3 clean training samples in d=2d=2 dimensions. The empricial NN denoisers were trained with weight decay parameter λ=105\lambda=10^{-5} and M=100M=100 noisy samples. As predicted by our theory, the ReLU boundaries align either perpendicular to the triangle edges in the obtuse case (left panel), or parallel to the triangle edges (right panel).

Additionally, in the Appendix C.2.1 we show that this min-cost solution is stable with respect to small perturbations of the data. In particular, if the training data is perturbed from the rays, the functional form of the min-cost solution only changes slightly, such that the inner and outer-layer weight vectors align with the line segments connecting consecutive data points.

5.3 Special case: training data forming a simplex

Here, we study the representation cost minimizers for Nd+1N\leq d+1 training points that form the vertices of a (N1)(N-1)-simplex, i.e., a (N1)(N-1)-dimensional simplex in d\mathbb{R}^{d} (e.g., a 22-simplex is a triangle, a 33-simplex is a tetrahedron, etc.). As we will show, the angles between vertices of the simplex (e.g., an acute versus obtuse triangle in N=3N=3) influences the functional form of the min-cost solution.

Our first result considers one extreme where the simplex has one vertex that makes an obtuse angle with all other vertices (e.g., an obtuse triangle for N=3N=3).

Proposition 2.

Suppose the convex hull of the training points {𝐱1,𝐱2,,𝐱N}d\{{\bm{x}}_{1},{\bm{x}}_{2},...,{\bm{x}}_{N}\}\subset\mathbb{R}^{d} is a (N1)(N-1)-simplex such that 𝐱1{\bm{x}}_{1} forms an obtuse angle with all other vertices, i.e., (𝐱j𝐱1)(𝐱i𝐱1)<0({\bm{x}}_{j}-{\bm{x}}_{1})^{\top}({\bm{x}}_{i}-{\bm{x}}_{1})<0 for all iji\neq j with i,j>1i,j>1. Then the minimizer 𝐟\bm{f}^{*} of (19) is unique, and is given by

𝒇(𝒚)=𝒙1+n=2N𝒖nϕn(𝒖n(𝒚𝒙1))\bm{f}^{*}({\bm{y}})={\bm{x}}_{1}+\sum_{n=2}^{N}{\bm{u}}_{n}\phi_{n}({\bm{u}}_{n}^{\top}({\bm{y}}-{\bm{x}}_{1})) (21)

where 𝐮n=𝐱n𝐱1𝐱n𝐱1{\bm{u}}_{n}=\frac{{\bm{x}}_{n}-{\bm{x}}_{1}}{\|{\bm{x}}_{n}-{\bm{x}}_{1}\|}, ϕn(t)=sn([tan]+[tbn]+)\phi_{n}(t)=s_{n}([t-a_{n}]_{+}-[t-b_{n}]_{+}), with an=ρa_{n}=\rho, bn=𝐱n𝐱1ρb_{n}=\|{\bm{x}}_{n}-{\bm{x}}_{1}\|-\rho, and sn=𝐱n𝐱1/(bnan)s_{n}=\|{\bm{x}}_{n}-{\bm{x}}_{1}\|/(b_{n}-a_{n}) for all n=2,,Nn=2,...,N.

This result is essentially a corollary of Theorem 3, since after translating 𝒙1{\bm{x}}_{1} to be the origin, the vertices of the simplex belong to rays making obtuse angles with each other, where there is exactly one sample per ray. Details of the proof are given in Appendix C.2.

At the opposite extreme, we consider the case where every vertex of the simplex is acute, meaning for all n=1,,Nn=1,...,N we have (𝒙i𝒙n)(𝒙j𝒙n)>0({\bm{x}}_{i}-{\bm{x}}_{n})^{\top}({\bm{x}}_{j}-{\bm{x}}_{n})>0 for all i,jni,j\neq n. In this case, we make following conjecture: the min-cost solution is instead a sum of NN rank-one piecewise linear interpolators, each aligned orthogonal to a different (N2)(N-2)-dimensional face of the simplex.

Conjecture 1.

Suppose the convex hull of the training points {𝐱1,𝐱2,,𝐱N}d\{{\bm{x}}_{1},{\bm{x}}_{2},...,{\bm{x}}_{N}\}\subset\mathbb{R}^{d} is a (N1)(N-1)-simplex where every vertex of the simplex is acute. Then the minimizer 𝐟{\bm{f}}^{*} of (19) is unique, and is given by

𝒇(𝒚)=𝒙¯+n=1N𝒗nϕn(𝒖n(𝒚𝒛n))\bm{f}^{*}({\bm{y}})=\overline{{\bm{x}}}+\sum_{n=1}^{N}{\bm{v}}_{n}\phi_{n}({\bm{u}}_{n}^{\top}({\bm{y}}-{\bm{z}}_{n})) (22)

where: 𝐳n{\bm{z}}_{n} is the projection of 𝐱n{\bm{x}}_{n} onto the unique (N2)(N-2)-dimensional face of the simplex not containing 𝐱n{\bm{x}}_{n}; 𝐱¯\overline{{\bm{x}}} is the weighted geometric median of the vertices specified by

𝒙¯=argmin𝒙dn=1N𝒙n𝒙𝒙n𝒛n;\overline{{\bm{x}}}=\operatorname*{arg\,min}_{{\bm{x}}\in\mathbb{R}^{d}}\sum_{n=1}^{N}\frac{\|{\bm{x}}_{n}-{\bm{x}}\|}{\|{\bm{x}}_{n}-{\bm{z}}_{n}\|};

𝒖n=𝒙n𝒛n𝒙n𝒛n{\bm{u}}_{n}=\frac{{\bm{x}}_{n}-{\bm{z}}_{n}}{\|{\bm{x}}_{n}-{\bm{z}}_{n}\|}, 𝐯n=𝐱n𝐱¯𝐱nx¯{\bm{v}}_{n}=\frac{{\bm{x}}_{n}-\overline{{\bm{x}}}}{\|{\bm{x}}_{n}-\overline{x}\|}; and ϕn(t)=sn([tan]+[tbn]+)\phi_{n}(t)=s_{n}([t-a_{n}]_{+}-[t-b_{n}]_{+}) with an=ρa_{n}=\rho, bn=𝐱n𝐳nρb_{n}=\|{\bm{x}}_{n}-{\bm{z}}_{n}\|-\rho, and sn=𝐱n𝐱¯/(bnan)s_{n}=\|{\bm{x}}_{n}-\overline{{\bm{x}}}\|/(b_{n}-a_{n}).

Justification for this conjecture is given in Appendix C.3.2. In particular, we prove that the interpolator 𝒇{\bm{f}}^{*} given in (86) is a min-cost solution in the special case of three training points whose convex hull is an equilateral triangle. If true in general, this would imply a phase transition behavior in the min-cost solution when the simplex changes from having one obtuse vertex to all acute vertices, such that ReLU boundaries go from being aligned orthogonal to the edges connecting vertices, to being aligned parallel with the simplex faces. Figure 2 illustrates this for N=3N=3 training points forming a triangle in d=2d=2 dimensions. Moreover, Figure 2 shows that the empirical minimizer obtained using noisy samples and weight decay regularization agrees well with the form of the exact min-cost solution predicted by Proposition 2 and Conjecture 1.

In general, any given vertex of a simplex may make acute angles with some vertices and obtuse angles with others. This case is not covered by the above results. Currently, we do not have a conjectured form of the min-cost solution in this case, and we leave this as an open problem for future work.

6 Related works

Numerous methods have been proposed for image denoising. In last decade NN-based methods achieve state-of-the-art results (zhang2017beyond; zhang2021plug). See (doi:10.1137/23M1545859) for a comprehensive review of image denoising. sonthalia2023training empirically showed a double decent behavior in NN denoisers, and theoretically proved it in a linear model. Similar to a denoiser, an Auto-Encoder (AE) is a NN model whose output dimension equals its input dimension, and is trained to match the output to the input. For AE, the typical goal is to learn an efficient lower-dimensional representation of the samples. radhakrishnan2018memorization proved that a single hidden-layer AE that interpolates the training data (i.e., achieves zero loss), projects the input onto a nonlinear span of the training data. In addition, radhakrishnan2018memorization empirically demonstrated that a multi-layer ReLU AE is locally contractive toward the training samples by iterating the AE and showing that the points converge to one of the training samples. Denoising autoencoders inject noise into the input data in order to learn a good representation (alain2014regularized). The marginalized denoising autoencoder, proposed by pmlr-v32-cheng14, approximates the marginalized loss over the noise (which is equivalent to observing infinitely many noisy samples) by using a Taylor approximation. pmlr-v32-cheng14 demonstrated that by using the approximate marginalized loss we can achieve a substantial speedup in training and improved representation compared to standard denoising AE.

Many recent works aim to characterize function space properties of interpolating NN with minimal representation cost (i.e., min-cost solutions). Building off of the connection between weight decay and path-norm regularization identified in neyshabur2015norm; neyshabur2017exploring, savarese2019infinite showed that the representation cost of a function realizable as a univariate two-layer ReLU network coincides with the L1L^{1}-norm of the second derivative of the function. Extensions to the multivariate setting were studied in Ongie2020A, which identified the representation cost of a multivariate function with its RR-norm, a Banach space semi-norm defined in terms of the Radon transform. Related work has extended the RR-norm to other activation functions parhi2021banach, vector-valued networks shenouda2023vector, and deeper architectures parhi2022kinds. A separate line of research studies min-cost solutions from a convex duality perspective ergen2021convex, incuding two-layer CNN denoising AEs sahiner2021convex. Recent work also studies properties of min-cost solutions in the case of arbitrarily deep NNs with ReLU activation jacot2022implicit; jacotNeurips.

Despite these advances in understanding min-cost solutions, there are few results explicitly characterizing their functional form. One exception is hanin2021ridgeless, which gives a complete characterization of min-cost solutions in the case of shallow univariate ReLU networks with unregularized bias. This characterization is possible because the univariate representation cost is defined in terms of the 2nd derivative, which acts locally. Therefore, global minimizers can be found by minimizing the representation cost locally over intervals between data points. An extension of these results to the case of regularized bias is studied in boursier2023penalising. In the multivariate setting, the representation cost involves the Radon transform of the function – a highly non-local operation – that complicates the analysis. parhi2021banach prove a representer theorem showing that there always exists a min-cost solution realizable as a shallow ReLU network with finitely many neurons, and ergen2021convex give an implicit characterization of min-cost NNs the solution to a convex optimization problem, and give explicit solutions in the case of co-linear training features. However, to the best of our knowledge, there are no results explicitly characterizing min-cost solutions in the case of non-colinear multivariate inputs, even for networks having scalar outputs.

7 Discussions

Conclusions.

We have explored the elementary properties of NN solutions for the denoising problem, while focusing on offline training of a one hidden-layer ReLU network. When the noisy clusters of the data samples are well-separated, there are multiple networks with zero loss, even in the case of under-parameterization, while having a different representation cost. In contrast, previous theoretical works focused on the over-parametrized regime. In the univariate case, we have derived a closed-form solution to such global minima with minimum representation cost. We also showed that the univariate NN solution generalizes better than the eMMSE denoiser. In the multivariate case, we showed that the interpolating solution with minimal representation cost is aligned with the edges and/or faces connecting the clean data points in several cases.

Limitations.

One limitation of our analysis in the multivariate case is that we assume the denoiser interpolates data on a full dd-dimensional ball centered at each clean training sample, where dd is the input dimension. In practical settings, often the number of noisy samples MdM\ll d. A more accurate model would be to assume that denoiser interpolates over an (M1)(M-1)-dimensional disc centered at each training sample. This model may still be a tractable alternative to assuming interpolation of finitely many noisy samples. Also, our results relate to NN denoisers trained with explicit weight decay regularization, which is not always used in practice. However, recent work shows that stable minimizers of SGD must have low representation cost mulayoff2021implicit; nacsonimplicit, and so some of our analysis may provide insight for unregularized training, as well. Finally, for mathematical tractability, we focussed on the case of fully-connected ReLU networks with one hidden-layer. Extending our analysis to deeper architectures and convolutional neural networks is an important direction for future work.

Acknowledgments

We thank Itay Hubara for his technical advice and valuable comments on the manuscript. The research of DS was Funded by the European Union (ERC, A-B-C-Deep, 101039436). Views and opinions expressed are however those of the author only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency (ERCEA). Neither the European Union nor the granting authority can be held responsible for them. DS also acknowledges the support of Schmidt Career Advancement Chair in AI. The research of NW was supported by the Israel Science Foundation (ISF), grant no. 1782/22. GO was supported by the National Science Foundation (NSF) CRII award CCF-2153371.

References

  • Alain and Bengio [2014] Guillaume Alain and Yoshua Bengio. What regularized auto-encoders learn from the data-generating distribution. The Journal of Machine Learning Research, 15(1):3563–3593, 2014.
  • Boursier and Flammarion [2023] Etienne Boursier and Nicolas Flammarion. Penalising the biases in norm regularisation enforces sparsity. arXiv preprint arXiv:2303.01353, 2023.
  • Chen et al. [2014] Minmin Chen, Kilian Weinberger, Fei Sha, and Yoshua Bengio. Marginalized denoising auto-encoders for nonlinear representations. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1476–1484, Bejing, China, 22–24 Jun 2014. PMLR. URL https://proceedings.mlr.press/v32/cheng14.html.
  • Elad et al. [2023] Michael Elad, Bahjat Kawar, and Gregory Vaksman. Image denoising: The deep learning revolution and beyond—a survey paper. SIAM Journal on Imaging Sciences, 16(3):1594–1654, 2023. doi: 10.1137/23M1545859. URL https://doi.org/10.1137/23M1545859.
  • Ergen and Pilanci [2021] Tolga Ergen and Mert Pilanci. Convex geometry and duality of over-parameterized neural networks. The Journal of Machine Learning Research, 22(1):9646–9708, 2021.
  • Hanin [2021] Boris Hanin. Ridgeless interpolation with shallow relu networks in 1d1d is nearest neighbor curvature extrapolation and provably generalizes on lipschitz functions. arXiv preprint arXiv:2109.12960, 2021.
  • Hasinoff et al. [2010] Samuel W Hasinoff, Frédo Durand, and William T Freeman. Noise-optimal capture for high dynamic range photography. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 553–560. IEEE, 2010.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  • Horrace [2015] William C Horrace. Moments of the truncated normal distribution. Journal of Productivity Analysis, 43:133–138, 2015.
  • Jacot [2022] Arthur Jacot. Implicit bias of large depth networks: a notion of rank for nonlinear functions. In International Conference on Learning Representations (ICLR), 2022.
  • Jacot et al. [2022] Arthur Jacot, Eugene Golikov, Clement Hongler, and Franck Gabriel. Feature learning in l_2-regularized dnns: Attraction/repulsion and sparsity. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pages 6763–6774, 2022.
  • Manjunath and Wilhelm [2021] B. G. Manjunath and Stefan Wilhelm. Moments calculation for the doubly truncated multivariate normal density. Journal of Behavioral Data Science, 1(1):17–33, May 2021. doi: 10.35566/jbds/v1n1/p2. URL https://jbds.isdsa.org/index.php/jbds/article/view/9.
  • Mulayoff et al. [2021] Rotem Mulayoff, Tomer Michaeli, and Daniel Soudry. The implicit bias of minima stability: A view from function space. Advances in Neural Information Processing Systems (NeurIPS), 34:17749–17761, 2021.
  • Nacson et al. [2023] Mor Shpigel Nacson, Rotem Mulayoff, Greg Ongie, Tomer Michaeli, and Daniel Soudry. The implicit bias of minima stability in multivariate shallow relu networks. In International Conference on Learning Representations (ICLR), 2023.
  • Neyshabur et al. [2015] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory (COLT), pages 1376–1401. PMLR, 2015.
  • Neyshabur et al. [2017] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017.
  • Ongie et al. [2020] Greg Ongie, Rebecca Willett, Daniel Soudry, and Nathan Srebro. A function space view of bounded norm infinite width relu nets: The multivariate case. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=H1lNPxHKDH.
  • Parhi and Nowak [2021] Rahul Parhi and Robert D Nowak. Banach space representer theorems for neural networks and ridge splines. Journal of Machine Learning Research, 22(1):1960–1999, 2021.
  • Parhi and Nowak [2022] Rahul Parhi and Robert D Nowak. What kinds of functions do deep neural networks learn? insights from variational spline theory. SIAM Journal on Mathematics of Data Science, 4(2):464–489, 2022.
  • Radhakrishnan et al. [2018] Adityanarayanan Radhakrishnan, Karren Yang, Mikhail Belkin, and Caroline Uhler. Memorization in overparameterized autoencoders. arXiv preprint arXiv:1810.10333, 2018.
  • Raya and Ambrogioni [2023] Gabriel Raya and Luca Ambrogioni. Spontaneous symmetry breaking in generative diffusion models. arXiv preprint arXiv:2305.19693, 2023.
  • Roth and Black [2009] Stefan Roth and Michael J. Black. Fields of experts. International Journal of Computer Vision, 82(2):205–229, 2009. doi: 10.1007/s11263-008-0197-6. URL https://doi.org/10.1007/s11263-008-0197-6.
  • Sahiner et al. [2021] Arda Sahiner, Morteza Mardani, Batu Ozturkler, Mert Pilanci, and John M. Pauly. Convex regularization behind neural reconstruction. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=VErQxgyrbfn.
  • Savarese et al. [2019] Pedro Savarese, Itay Evron, Daniel Soudry, and Nathan Srebro. How do infinite width bounded norm networks look in function space? In Conference on Learning Theory, pages 2667–2690. PMLR, 2019.
  • Shenouda et al. [2023] Joseph Shenouda, Rahul Parhi, Kangwook Lee, and Robert D Nowak. Vector-valued variation spaces and width bounds for DNNs: Insights on weight decay regularization. arXiv preprint arXiv:2305.16534, 2023.
  • Song and Ermon [2019] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  • Sonthalia and Nadakuditi [2023] Rishi Sonthalia and Raj Rao Nadakuditi. Training data size induced double descent for denoising feedforward neural networks and the role of training noise. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=FdMWtpVT1I.
  • Udell and Townsend [2019] Madeleine Udell and Alex Townsend. Why are big data matrices approximately low rank? SIAM Journal on Mathematics of Data Science, 1(1):144–160, 2019.
  • Zhang et al. [2017] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE transactions on image processing, 26(7):3142–3155, 2017.
  • Zhang et al. [2021] Kai Zhang, Yawei Li, Wangmeng Zuo, Lei Zhang, Luc Van Gool, and Radu Timofte. Plug-and-play image restoration with deep denoiser prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6360–6376, 2021.

Appendix A Marginalized loss

In this section, we derive the marginal loss for the case of a 1-hidden layer ReLU neural network. The loss function is,

(𝜽,σ)\displaystyle\mathcal{L}\left(\bm{\theta},\sigma\right) =𝔼𝒙,𝒚k=1K𝒂k[𝒘k𝒚]+𝒙2\displaystyle=\mathbb{E}_{\bm{x},\bm{y}}\left\|\sum_{k=1}^{K}\bm{a}_{k}[\bm{w}_{k}^{\top}\bm{y}]_{+}-\bm{x}\right\|^{2}
=𝔼𝒙,𝒚[i=1Kj=1K𝒂i𝒂j[𝒘i𝒚]+[𝒘j𝒚]+2i=1K𝒙𝒂i[𝒘i𝒚]++𝒙2]\displaystyle=\mathbb{E}_{\bm{x},\bm{y}}\left[\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}[\bm{w}_{i}^{\top}\bm{y}]_{+}[\bm{w}_{j}^{\top}\bm{y}]_{+}-2\sum_{i=1}^{K}\bm{x}^{\top}\bm{a}_{i}[\bm{w}_{i}^{\top}\bm{y}]_{+}+\left\|\bm{x}\right\|^{2}\right]
=i=1Kj=1K𝒂i𝒂j𝔼𝒙,𝒚[[𝒘i𝒚]+[𝒘j𝒚]+]2i=1K𝔼𝒙,𝒚[𝒙𝒂i[𝒘i𝒚]+]+𝔼𝒙2\displaystyle=\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}\mathbb{E}_{\bm{x},\bm{y}}\left[[\bm{w}_{i}^{\top}\bm{y}]_{+}[\bm{w}_{j}^{\top}\bm{y}]_{+}\right]-2\sum_{i=1}^{K}\mathbb{E}_{\bm{x},\bm{y}}\left[\bm{x}^{\top}\bm{a}_{i}[\bm{w}_{i}^{\top}\bm{y}]_{+}\right]+\mathbb{E}\left\|\bm{x}\right\|^{2}
=i=1Kj=1K𝒂i𝒂j𝔼𝒙[𝔼𝒚|𝒙[[𝒘i𝒚]+[𝒘j𝒚]+]]2i=1K𝔼𝒙[𝔼𝒚|𝒙[𝒙𝒂i[𝒘i𝒚]+]]+𝔼𝒙2\displaystyle=\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{y}|\bm{x}}\left[[\bm{w}_{i}^{\top}\bm{y}]_{+}[\bm{w}_{j}^{\top}\bm{y}]_{+}\right]\right]-2\sum_{i=1}^{K}\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{y}|\bm{x}}\left[\bm{x}^{\top}\bm{a}_{i}[\bm{w}_{i}^{\top}\bm{y}]_{+}\right]\right]+\mathbb{E}\left\|\bm{x}\right\|^{2}
=i=1Kj=1K𝒂i𝒂j𝔼𝒙[𝔼𝒚|𝒙[[𝒘i𝒚]+]𝔼𝒚|𝒙[[𝒘j𝒚]+]]2i=1K𝔼𝒙[𝔼𝒚|𝒙[𝒙𝒂i[𝒘i𝒚]+]]+𝔼𝒙2\displaystyle=\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{y}|\bm{x}}\left[[\bm{w}_{i}^{\top}\bm{y}]_{+}\right]\mathbb{E}_{\bm{y}|\bm{x}}\left[[\bm{w}_{j}^{\top}\bm{y}]_{+}\right]\right]-2\sum_{i=1}^{K}\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{y}|\bm{x}}\left[\bm{x}^{\top}\bm{a}_{i}[\bm{w}_{i}^{\top}\bm{y}]_{+}\right]\right]+\mathbb{E}\left\|\bm{x}\right\|^{2}
+i=1Kj=1K𝒂i𝒂j𝔼𝒙[𝔼𝒚|𝒙[[𝒘i𝒚]+[𝒘j𝒚]+]]i=1Kj=1K𝒂i𝒂j𝔼𝒙[𝔼𝒚|𝒙[[𝒘i𝒚]+]𝔼𝒚|𝒙[[𝒘j𝒚]+]]\displaystyle\quad+\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{y}|\bm{x}}\left[[\bm{w}_{i}^{\top}\bm{y}]_{+}[\bm{w}_{j}^{\top}\bm{y}]_{+}\right]\right]-\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{y}|\bm{x}}\left[[\bm{w}_{i}^{\top}\bm{y}]_{+}\right]\mathbb{E}_{\bm{y}|\bm{x}}\left[[\bm{w}_{j}^{\top}\bm{y}]_{+}\right]\right]
=𝔼(𝒙)i=1K𝒂iϕ~(𝒘i𝒙)𝒙2+i=1Kj=1K𝒂i𝒂j𝔼𝒙[𝔼𝒚|𝒙[[𝒘i𝒚]+[𝒘j𝒚]+]]\displaystyle=\mathbb{E}_{(\bm{x})}\left\|\sum_{i=1}^{K}\bm{a}_{i}\tilde{\phi}\left(\bm{w}_{i}^{\top}\bm{x}\right)-\bm{x}\right\|^{2}+\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{y}|\bm{x}}\left[[\bm{w}_{i}^{\top}\bm{y}]_{+}[\bm{w}_{j}^{\top}\bm{y}]_{+}\right]\right]
i=1Kj=1K𝒂i𝒂j𝔼𝒙[𝔼𝒚|𝒙[[𝒘i𝒚]+]𝔼𝒚|𝒙[[𝒘j𝒚]+]].\displaystyle\quad-\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{y}|\bm{x}}\left[[\bm{w}_{i}^{\top}\bm{y}]_{+}\right]\mathbb{E}_{\bm{y}|\bm{x}}\left[[\bm{w}_{j}^{\top}\bm{y}]_{+}\right]\right]\,.

We denote by

𝐇ij(𝒘i,𝒘j,σ2)=𝔼𝒙[𝔼𝒚|𝒙[[𝒘i𝒚]+[𝒘j𝒚]+]𝔼𝒚|𝒙[[𝒘i𝒚]+]𝔼𝒚|𝒙[[𝒘j𝒚]+]].\displaystyle\mathbf{H}_{ij}\left(\bm{w}_{i},\bm{w}_{j},\sigma^{2}\right)=\mathbb{E}_{\bm{x}}\biggl{[}\mathbb{E}_{\bm{y}|\bm{x}}\left[[\bm{w}_{i}^{\top}\bm{y}]_{+}[\bm{w}_{j}^{\top}\bm{y}]_{+}\right]-\mathbb{E}_{\bm{y}|\bm{x}}\left[[\bm{w}_{i}^{\top}\bm{y}]_{+}\right]\mathbb{E}_{\bm{y}|\bm{x}}\left[[\bm{w}_{j}^{\top}\bm{y}]_{+}\right]\biggr{]}\,.

Note that 𝐇0\mathbf{H}\succeq 0 since 𝐇\mathbf{H} is a covariance matrix. Thus we get,

(𝜽,σ)=𝔼𝒙k=1K𝒂kϕ~(𝒘^k𝒙,𝒘k,σ)𝒙2+i=1Kj=1K𝒂i𝒂j𝐇ij(𝒘i,𝒘j,σ2).\displaystyle\mathcal{L}\left(\bm{\theta},\sigma\right)=\mathbb{E}_{\bm{x}}\left\|\sum_{k=1}^{K}\bm{a}_{k}\tilde{\phi}\left(\hat{\bm{w}}_{k}^{\top}\bm{x},\left\|\bm{w}_{k}\right\|,\sigma\right)-\bm{x}\right\|^{2}+\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}\mathbf{H}_{ij}\left(\bm{w}_{i},\bm{w}_{j},\sigma^{2}\right)\,.
Lemma 2.

In the case of the ReLU activation function and Gaussian noise the following holds,

ϕ~(𝒘^i𝒙,𝒘i,σ)\displaystyle\tilde{\phi}\left(\hat{\bm{w}}_{i}^{\top}\bm{x},\left\|\bm{w}_{i}\right\|,\sigma\right) =𝒘i((1Φ(𝒘^i𝒙σ))𝒘^i𝒙+φ(𝒘^i𝒙σ))\displaystyle=\left\|\bm{w}_{i}\right\|\left(\left(1-\Phi\left(-\frac{\hat{\bm{w}}_{i}^{\top}\bm{x}}{\sigma}\right)\right)\hat{\bm{w}}_{i}^{\top}\bm{x}+\varphi\left(-\frac{\hat{\bm{w}}_{i}^{\top}\bm{x}}{\sigma}\right)\right)

where φ,Φ\varphi,\Phi are the density and cumulative distribution of standard normal distribution, and

𝐇ij\displaystyle\mathbf{H}_{ij} =𝔼𝒙[Ψ(𝒙,𝒘i,𝒘j,σ2)ϕ~(𝒘^i𝒙,𝒘i,σ)ϕ~(𝒘^j𝒙,𝒘j,σ)]\displaystyle=\mathbb{E}_{\bm{x}}\left[\Psi\left(\bm{x},\bm{w}_{i},\bm{w}_{j},\sigma^{2}\right)-\tilde{\phi}\left(\hat{\bm{w}}_{i}^{\top}\bm{x},\left\|\bm{w}_{i}\right\|,\sigma\right)\tilde{\phi}\left(\hat{\bm{w}}_{j}^{\top}\bm{x},\left\|\bm{w}_{j}\right\|,\sigma\right)\right]

where,

Ψ(𝒙,𝒘i,𝒘j,σ2)\displaystyle\Psi\left(\bm{x},\bm{w}_{i},\bm{w}_{j},\sigma^{2}\right) =P(z1>0,z2>0;𝝁(i,j),𝚺(i,j))(σ12+det(𝚺)f(𝝁(i,j);𝟎,𝚺(i,j))P(z1>μ1,z2>μ2;𝟎,𝚺(i,j))\displaystyle=P\left(z_{1}>0,z_{2}>0;\bm{\mu}^{(i,j)},\bm{\Sigma}^{(i,j)}\right)\biggl{(}\sigma_{12}+\mathrm{det}\left(\bm{\Sigma}\right)\frac{f\left(-\bm{\mu}^{\left(i,j\right)};\bm{0},\bm{\Sigma}^{\left(i,j\right)}\right)}{P(z_{1}>-\mu_{1},z_{2}>-\mu_{2};\bm{0},\bm{\Sigma}^{(i,j)})}
+σ11F1(μ1)μ2+μ1σ22F2(μ2)+μ1μ2)\displaystyle\quad+\sigma_{11}F_{1}\left(-\mu_{1}\right)\mu_{2}+\mu_{1}\sigma_{22}F_{2}\left(-\mu_{2}\right)+\mu_{1}\mu_{2}\biggr{)}
F2(μ2)\displaystyle F_{2}\left(-\mu_{2}\right) =φ(μ2σ22)P(z>μ1;Λ12μ2Λ11,1Λ11)P(z1>μ1,z2>μ2;𝟎,𝚺)\displaystyle=\frac{\varphi\left(-\frac{\mu_{2}}{\sqrt{\sigma_{22}}}\right)P\left(z>-\mu_{1};\frac{\Lambda_{12}\mu_{2}}{\Lambda_{11}},\frac{1}{\Lambda_{11}}\right)}{P(z_{1}>-\mu_{1},z_{2}>-\mu_{2};\bm{0},\bm{\Sigma})}
F1(μ1)\displaystyle F_{1}\left(-\mu_{1}\right) =φ(μ1σ11)P(z>μ2;Λ21μ1Λ22,1Λ22)P(z1>μ1,z2>μ2;𝟎,𝚺)\displaystyle=\frac{\varphi\left(-\frac{\mu_{1}}{\sqrt{\sigma_{11}}}\right)P\left(z>-\mu_{2};\frac{\Lambda_{21}\mu_{1}}{\Lambda_{22}},\frac{1}{\Lambda_{22}}\right)}{P(z_{1}>-\mu_{1},z_{2}>-\mu_{2};\bm{0},\bm{\Sigma})}

and

𝝁(i,j)\displaystyle\bm{\mu}^{\left(i,j\right)} =(μ1μ2)=(𝒘i𝒙𝒘j𝒙)\displaystyle=\begin{pmatrix}\mu_{1}\\ \mu_{2}\end{pmatrix}=\begin{pmatrix}\bm{w}_{i}^{\top}\bm{x}\\ \bm{w}_{j}^{\top}\bm{x}\end{pmatrix}
𝚺(i,j)\displaystyle\bm{\Sigma}^{\left(i,j\right)} =(σ11σ12σ12σ22)=(σ2𝒘i2σ2𝒘i𝒘jσ2𝒘j𝒘iσ2𝒘j2)\displaystyle=\begin{pmatrix}\sigma_{11}&\sigma_{12}\\ \sigma_{12}&\sigma_{22}\end{pmatrix}=\begin{pmatrix}\sigma^{2}\left\|\bm{w}_{i}\right\|^{2}&\sigma^{2}\bm{w}_{i}^{\top}\bm{w}_{j}\\ \sigma^{2}\bm{w}_{j}^{\top}\bm{w}_{i}&\sigma^{2}\left\|\bm{w}_{j}\right\|^{2}\end{pmatrix}
𝚲\displaystyle\bm{\Lambda} =𝚺1.\displaystyle=\bm{\Sigma}^{-1}\,.

Notice that we denote by P(z1>0,z2>0;𝛍(i,j),𝚺(i,j))P\left(z_{1}>0,z_{2}>0;\bm{\mu}^{(i,j)},\bm{\Sigma}^{(i,j)}\right) the probability that z1>0,z2>0z_{1}>0,z_{2}>0 where (z1,z2)𝒩(𝛍(i,j),𝚺(i,j))\left(z_{1},z_{2}\right)^{\top}\sim\mathcal{N}\left(\bm{\mu}^{(i,j)},\bm{\Sigma}^{(i,j)}\right).

Proof.

Let x𝒩(μ,σ2)x\sim\mathcal{N}\left(\mu,\sigma^{2}\right), then

E[[x]+]\displaystyle E\left[[x]_{+}\right] =E[[x]+|x>0]P(x>0)+E[[x]+|x<0]P(x>0)\displaystyle=E\left[[x]_{+}|x>0\right]P\left(x>0\right)+E\left[[x]_{+}|x<0\right]P\left(x>0\right)
=E[x|x>0]P(x>0)\displaystyle=E\left[x|x>0\right]P\left(x>0\right)
=E[x|x>0](1Φ(μσ))\displaystyle=E\left[x|x>0\right]\left(1-\Phi\left(-\frac{\mu}{\sigma}\right)\right)

Note that given x>0x>0 the distribution of xx is truncated normal [horrace2015moments]. Therefore,

𝔼[x|x>0]\displaystyle\mathbb{E}\left[x|x>0\right] =μ+σφ(μσ)1Φ(μσ)\displaystyle=\mu+\sigma\frac{\varphi\left(-\frac{\mu}{\sigma}\right)}{1-\Phi\left(-\frac{\mu}{\sigma}\right)}
𝔼[[x]+]\displaystyle\mathbb{E}\left[[x]_{+}\right] =(1Φ(μσ))μ+σφ(μσ).\displaystyle=\left(1-\Phi\left(-\frac{\mu}{\sigma}\right)\right)\mu+\sigma\varphi\left(-\frac{\mu}{\sigma}\right)\,. (23)

Note that given 𝒙\bm{x}, 𝒚\bm{y} is a Gaussian random vector. Therefore, given 𝒙\bm{x},

𝒘k𝒚𝒩(𝒘k𝒙,σ2𝒘k2)\displaystyle\bm{w}_{k}^{\top}\bm{y}\sim\mathcal{N}\left(\bm{w}_{k}^{\top}\bm{x},\sigma^{2}\left\|\bm{w}_{k}\right\|^{2}\right)

and we obtain,

𝔼𝒚|𝒙[[𝒘k𝒚]+]\displaystyle\mathbb{E}_{\bm{y}|\bm{x}}\left[[\bm{w}_{k}^{\top}\bm{y}]_{+}\right] =𝒘k𝔼𝒚|𝒙[[𝒘^k𝒚]+]\displaystyle=\left\|\bm{w}_{k}\right\|\mathbb{E}_{\bm{y}|\bm{x}}\left[[\hat{\bm{w}}_{k}^{\top}\bm{y}]_{+}\right]
=𝒘k(1Φ(𝒘^k𝒙σ))𝒘^k𝒙+σφ(𝒘^i𝒙σ).\displaystyle=\left\|\bm{w}_{k}\right\|\left(1-\Phi\left(-\frac{\hat{\bm{w}}_{k}^{\top}\bm{x}}{\sigma}\right)\right)\hat{\bm{w}}_{k}^{\top}\bm{x}+\sigma\varphi\left(-\frac{\hat{\bm{w}}_{i}^{\top}\bm{x}}{\sigma}\right)\,.
Refer to caption
(a)
Refer to caption
(b)
Figure 3: Numerical evaluation of (23) Histogram of the sample average of ReLU(x)\mathrm{ReLU}(x) for 10000 Monte-Carlo samples. We denote by EE the analytical expectation and by E¯\bar{E} the sample average. Figure (a) is for μ=1,σ=5\mu=1,\sigma=5, the normalized error is |EE¯|E=0.0032%\frac{|E-\bar{E}|}{E}=0.0032\%. Figure (b) is for μ=1,σ=5\mu=-1,\sigma=5 , the normalized error is |EE¯|E=0.0059%\frac{|E-\bar{E}|}{E}=0.0059\%.

Let (z1,z2)\left(z_{1},z_{2}\right)^{\top} be a Gaussian random vector with444Note that σii=σi2\sigma_{ii}=\sigma_{i}^{2}.

𝝁\displaystyle\bm{\mu} =(μ1μ2)\displaystyle=\begin{pmatrix}\mu_{1}\\ \mu_{2}\end{pmatrix}
𝚺\displaystyle\bm{\Sigma} =(σ11σ12σ12σ22)\displaystyle=\begin{pmatrix}\sigma_{11}&\sigma_{12}\\ \sigma_{12}&\sigma_{22}\end{pmatrix}

then,

𝔼[[z1]+[z2]+]\displaystyle\mathbb{E}\left[[z_{1}]_{+}[z_{2}]_{+}\right] =𝔼[[z1]+[z2]+|z1>0,z2>0]P(z1>0,z2>0)\displaystyle=\mathbb{E}\left[[z_{1}]_{+}[z_{2}]_{+}|z_{1}>0,z_{2}>0\right]P\left(z_{1}>0,z_{2}>0\right)
=𝔼[z1z2|z1>0,z2>0]P(z1>0,z2>0).\displaystyle=\mathbb{E}\left[z_{1}z_{2}|z_{1}>0,z_{2}>0\right]P\left(z_{1}>0,z_{2}>0\right)\,.

Given z1>0,z2>0z_{1}>0,z_{2}>0 the distribution of (z1,z2)\left(z_{1},z_{2}\right)^{\top} is truncated multivariate normal distribution [Manjunath_Wilhelm_2021], therefore,

𝔼[z1z2|z1>0,z2>0]\displaystyle\mathbb{E}\left[z_{1}z_{2}|z_{1}>0,z_{2}>0\right] =σ12+σ21(μ1)F1(μ1)+σ12(μ2)F2(μ2)\displaystyle=\sigma_{12}+\sigma_{21}\left(-\mu_{1}\right)F_{1}\left(-\mu_{1}\right)+\sigma_{12}\left(-\mu_{2}\right)F_{2}\left(-\mu_{2}\right)
+(σ11σ22σ12σ21)f(𝝁;𝟎,𝚺)p(z1>μ1,z2>μ2;𝟎,𝚺)\displaystyle+\left(\sigma_{11}\sigma_{22}-\sigma_{12}\sigma_{21}\right)\frac{f\left(-\bm{\mu};\bm{0},\bm{\Sigma}\right)}{p(z_{1}>-\mu_{1},z_{2}>-\mu_{2};\bm{0},\bm{\Sigma})}
(σ11F1(μ1)+σ12F2(μ2))(σ21F1(μ1)+σ22F2(μ2))\displaystyle-\left(\sigma_{11}F_{1}\left(-\mu_{1}\right)+\sigma_{12}F_{2}\left(-\mu_{2}\right)\right)\left(\sigma_{21}F_{1}\left(-\mu_{1}\right)+\sigma_{22}F_{2}\left(-\mu_{2}\right)\right)
+(σ11F1(μ1)+σ12F2(μ2)+μ1)(σ21F1(μ1)+σ22F2(μ2)+μ2)\displaystyle+\left(\sigma_{11}F_{1}\left(-\mu_{1}\right)+\sigma_{12}F_{2}\left(-\mu_{2}\right)+\mu_{1}\right)\left(\sigma_{21}F_{1}\left(-\mu_{1}\right)+\sigma_{22}F_{2}\left(-\mu_{2}\right)+\mu_{2}\right)
=σ12+det(𝚺)f(𝝁;𝟎,𝚺)p(z1>μ1,z2>μ2;𝟎,𝚺)\displaystyle=\sigma_{12}+\mathrm{det}\left(\bm{\Sigma}\right)\frac{f\left(-\bm{\mu};\bm{0},\bm{\Sigma}\right)}{p(z_{1}>-\mu_{1},z_{2}>-\mu_{2};\bm{0},\bm{\Sigma})}
+σ11F1(μ1)μ2+μ1σ22F2(μ2)+μ1μ2\displaystyle+\sigma_{11}F_{1}\left(-\mu_{1}\right)\mu_{2}+\mu_{1}\sigma_{22}F_{2}\left(-\mu_{2}\right)+\mu_{1}\mu_{2} (24)

where f(𝝁;𝟎,𝚺)f\left(-\bm{\mu};\bm{0},\bm{\Sigma}\right) is a density function of of Gaussian random vector with mean vector 𝟎\bm{0} and covariance matrix 𝚺\bm{\Sigma} at the point 𝝁-\bm{\mu}, and

F2(z2)\displaystyle F_{2}\left(z_{2}\right) =μ1f(z1,z2;𝟎,𝚺)𝑑z1P(Z1>μ1,Z2>μ2;𝟎,𝚺)\displaystyle=\frac{\int_{-\mu_{1}}^{\infty}f\left(z_{1},z_{2};\bm{0},\bm{\Sigma}\right)dz_{1}}{P(Z_{1}>-\mu_{1},Z_{2}>-\mu_{2};\bm{0},\bm{\Sigma})}
F1(z1)\displaystyle F_{1}\left(z_{1}\right) =μ2f(z1,z2;𝟎,𝚺)𝑑z2P(Z1>μ1,Z2>μ2;𝟎,𝚺).\displaystyle=\frac{\int_{-\mu_{2}}^{\infty}f\left(z_{1},z_{2};\bm{0},\bm{\Sigma}\right)dz_{2}}{P(Z_{1}>-\mu_{1},Z_{2}>-\mu_{2};\bm{0},\bm{\Sigma})}\,.

We denote by 𝚲=𝚺1\bm{\Lambda}=\bm{\Sigma}^{-1} thus,

μ1exp(12(Λ11z12+2Λ12z1z2+Λ22z22))𝑑z1=\displaystyle\int_{-\mu_{1}}^{\infty}\exp\left(-\frac{1}{2}\left(\Lambda_{11}z_{1}^{2}+2\Lambda_{12}z_{1}z_{2}+\Lambda_{22}z_{2}^{2}\right)\right)dz_{1}=
exp(12Λ22z22)μ1exp(12Λ11z12Λ12z1z2)𝑑z1=\displaystyle\exp\left(-\frac{1}{2}\Lambda_{22}z_{2}^{2}\right)\int_{-\mu_{1}}^{\infty}\exp\left(-\frac{1}{2}\Lambda_{11}z_{1}^{2}-\Lambda_{12}z_{1}z_{2}\right)dz_{1}=
exp(12Λ22z22)μ1exp(121Λ11(z1+Λ12z2Λ11)2+Λ122z222Λ11)𝑑z1=\displaystyle\exp\left(-\frac{1}{2}\Lambda_{22}z_{2}^{2}\right)\int_{-\mu_{1}}^{\infty}\exp\left(-\frac{1}{2\frac{1}{\Lambda_{11}}}\left(z_{1}+\frac{\Lambda_{12}z_{2}}{\Lambda_{11}}\right)^{2}+\frac{\Lambda_{12}^{2}z_{2}^{2}}{2\Lambda_{11}}\right)dz_{1}=
exp(12Λ22z22+Λ122z222Λ11)μ1exp(121Λ11(z1+Λ12z2Λ11)2)𝑑z1=\displaystyle\exp\left(-\frac{1}{2}\Lambda_{22}z_{2}^{2}+\frac{\Lambda_{12}^{2}z_{2}^{2}}{2\Lambda_{11}}\right)\int_{-\mu_{1}}^{\infty}\exp\left(-\frac{1}{2\frac{1}{\Lambda_{11}}}\left(z_{1}+\frac{\Lambda_{12}z_{2}}{\Lambda_{11}}\right)^{2}\right)dz_{1}=
exp(12Λ22z22+Λ122z222Λ11)2πΛ11P(z>μ1;Λ12z2Λ11,1Λ11).\displaystyle\exp\left(-\frac{1}{2}\Lambda_{22}z_{2}^{2}+\frac{\Lambda_{12}^{2}z_{2}^{2}}{2\Lambda_{11}}\right)\sqrt{\frac{2\pi}{\Lambda_{11}}}P\left(z>-\mu_{1};-\frac{\Lambda_{12}z_{2}}{\Lambda_{11}},\frac{1}{\Lambda_{11}}\right)\,.

Thus,

μ1f(z1,z2;0,𝚺)𝑑z1\displaystyle\int_{-\mu_{1}}^{\infty}f\left(z_{1},z_{2};0,\bm{\Sigma}\right)dz_{1} =12πdet(Σ)exp(12Λ22z22+Λ122z222Λ11)2πΛ11P(z>μ1;Λ12z2Λ11,1Λ11)\displaystyle=\frac{1}{2\pi\sqrt{\det\left(\Sigma\right)}}\exp\left(-\frac{1}{2}\Lambda_{22}z_{2}^{2}+\frac{\Lambda_{12}^{2}z_{2}^{2}}{2\Lambda_{11}}\right)\sqrt{\frac{2\pi}{\Lambda_{11}}}P\left(z>-\mu_{1};-\frac{\Lambda_{12}z_{2}}{\Lambda_{11}},\frac{1}{\Lambda_{11}}\right)
=12πσ22exp(12σ22z22)P(z>μ1;Λ12z2Λ11,1Λ11).\displaystyle=\frac{1}{\sqrt{2\pi\sigma_{22}}}\exp\left(-\frac{1}{2\sigma_{22}}z_{2}^{2}\right)P\left(z>-\mu_{1};-\frac{\Lambda_{12}z_{2}}{\Lambda_{11}},\frac{1}{\Lambda_{11}}\right)\,.

Similarly, we obtain,

μ2f(z1,z2;0,𝚺)𝑑z2\displaystyle\int_{-\mu_{2}}^{\infty}f\left(z_{1},z_{2};0,\bm{\Sigma}\right)dz_{2} =12πσ11exp(12σ11z12)P(z>μ2;Λ21z1Λ22,1Λ22).\displaystyle=\frac{1}{\sqrt{2\pi\sigma_{11}}}\exp\left(-\frac{1}{2\sigma_{11}}z_{1}^{2}\right)P\left(z>-\mu_{2};-\frac{\Lambda_{21}z_{1}}{\Lambda_{22}},\frac{1}{\Lambda_{22}}\right)\,.

Therefore,

F2(z2)\displaystyle F_{2}\left(z_{2}\right) =φ(z2σ22)P(z>μ1;Λ12z2Λ11,1Λ11)P(Z1>μ1,Z2>μ2;𝟎,𝚺)\displaystyle=\frac{\varphi\left(\frac{z_{2}}{\sqrt{\sigma_{22}}}\right)P\left(z>-\mu_{1};-\frac{\Lambda_{12}z_{2}}{\Lambda_{11}},\frac{1}{\Lambda_{11}}\right)}{P(Z_{1}>-\mu_{1},Z_{2}>-\mu_{2};\bm{0},\bm{\Sigma})}
F1(z1)\displaystyle F_{1}\left(z_{1}\right) =φ(z1σ11)P(z>μ2;Λ21z1Λ22,1Λ22)P(Z1>μ1,Z2>μ2;𝟎,𝚺).\displaystyle=\frac{\varphi\left(\frac{z_{1}}{\sqrt{\sigma_{11}}}\right)P\left(z>-\mu_{2};-\frac{\Lambda_{21}z_{1}}{\Lambda_{22}},\frac{1}{\Lambda_{22}}\right)}{P(Z_{1}>-\mu_{1},Z_{2}>-\mu_{2};\bm{0},\bm{\Sigma})}\,.

Next, we present in Figures 3 and 4 a numerical evaluation of (23) and (24). As can be seen, the Monte-Carlo simulations verify the analytical results.

Refer to caption
(a)
Refer to caption
(b)
Figure 4: Numerical evaluation of (24) Histogram of the sample average of ReLU(z1)ReLU(z2)\mathrm{ReLU}(z_{1})\mathrm{ReLU}(z_{2}) for 10000 Monte-Carlo samples. We denote by EE the analytical expectation and by E¯\bar{E} the sample average. Figure (a) is for 𝝁=(417),𝚺=(13998)\bm{\mu}=\begin{pmatrix}-4\\ 17\end{pmatrix},\bm{\Sigma}=\begin{pmatrix}13&-9\\ -9&8\end{pmatrix}, the normalized error is |EE¯|E=0.0093%\frac{|E-\bar{E}|}{E}=0.0093\%. Figure (b) is for 𝝁=(62),𝚺=(10221)\bm{\mu}=\begin{pmatrix}6\\ 2\end{pmatrix},\bm{\Sigma}=\begin{pmatrix}10&2\\ 2&1\end{pmatrix}, the normalized error is |EE¯|E=0.0044%\frac{|E-\bar{E}|}{E}=0.0044\%.

Appendix B Proofs of Results in Section 4

B.1 Proof of Proposition 1

Proof.

Let

𝒟={(yn=xn+ϵn,m,xn)},n[N],m[M],<x1<x2<<nN<\displaystyle\mathcal{D}=\{(y_{n}=x_{n}+\epsilon_{n,m},x_{n})\},n\in[N],m\in[M],\quad-\infty<x_{1}<x_{2}<\cdots<n_{N}<\infty

such that Assumption 1 holds. We define a reduced dataset, which only contains the noisy samples with the most extreme noises

𝒟¯={(xn+ϵnmin,xn),(xn+ϵnmax,xn)},n[N].\displaystyle\mathcal{\bar{D}}=\{(x_{n}+\epsilon^{\mathrm{min}}_{n},x_{n}),(x_{n}+\epsilon^{\mathrm{max}}_{n},x_{n})\},n\in[N]\,.

We define the 2\ell^{2} penalty on the weights,

C(θ,K)=k=1K(𝒂k2+𝒘k2).\displaystyle C\left(\theta,K\right)=\sum_{k=1}^{K}\left(\|{\bm{a}}_{k}\|^{2}+\|{\bm{w}}_{k}\|^{2}\right)\,.

According to Theorem 1.2. in [hanin2021ridgeless], since we have opposite discrete curvature on the intervals [x1+ϵ1max,x2+ϵ1min][xN1+ϵ1max,xN+ϵ1min][x_{1}+\epsilon_{1}^{\mathrm{max}},x_{2}+\epsilon_{1}^{\mathrm{min}}]\cdots[x_{N-1}+\epsilon_{1}^{max},x_{N}+\epsilon_{1}^{min}],

{hθ(y)hθ(y)=x(y,x)𝒟¯,C(θ,K)=C}={f1D(y)},\displaystyle\{h_{\theta}(y)\mid h_{\theta}(y)=x\;\forall(y,x)\in\mathcal{\bar{D}},C\left(\theta,K\right)=C_{*}\}=\{f^{*}_{1D}\left(y\right)\}\,,

where

C=infθ,K{C(θ,K)(y,x)𝒟¯hθ(y)=x}.\displaystyle C_{*}=\inf_{{\theta},K}\{C\left(\theta,K\right)\mid\forall(y,x)\in\mathcal{\bar{D}}\;\;h_{\theta}(y)=x\}\,.

Also note that,

{hθ(y)hθ(y)=x(y,x)𝒟,C(θ,K)=C}={f1D(y)}\displaystyle\{h_{\theta}(y)\mid h_{\theta}(y)=x\;\forall(y,x)\in\mathcal{D},C\left(\theta,K\right)=C_{*}\}=\{f^{*}_{1D}\left(y\right)\}

since f1D(y)f^{*}_{1D}\left(y\right) interpolates all the points in 𝒟\mathcal{D}, and if we assume by contradiction that C>infθ,K{C(θ,K)(y,x)𝒟hθ(y)=x}C_{*}>\inf_{{\theta},K}\{C\left(\theta,K\right)\mid\forall(y,x)\in\mathcal{D}\;\;h_{\theta}(y)=x\} then Cinfθ,K{C(θ,K)(y,x)𝒟¯hθ(y)=x}C_{*}\neq\inf_{{\theta},K}\{C\left(\theta,K\right)\mid\forall(y,x)\in\mathcal{\bar{D}}\;\;h_{\theta}(y)=x\}, thus contradicting Theorem 1.2. in [hanin2021ridgeless]. Note that the minimal number of neurons needed to represent f1D(y)f^{*}_{1D}\left(y\right) is 2N22N-2. So, if K2N2K\geq 2N-2 then f1Df^{*}_{1D} is the minimizer of the representation cost (i.e., f1DargminfR(f)f^{*}_{1D}\in\arg\min_{f}R(f)). ∎

B.2 Proof of Theorem 1

First we prove the following lemmas.

Lemma 3.

Let y=x+σϵy=x+\sigma\epsilon where x,ϵ,σ>0x,\epsilon\in\mathbb{R},\sigma>0 then,

limσ0+x^eMMSE(y)=x^1NN(y)\displaystyle\lim_{\sigma\to 0^{+}}\hat{x}^{\mathrm{eMMSE}}\left(y\right)=\hat{x}^{1-\mathrm{NN}}\left(y\right)
Proof.

Notice that the following holds,

exp(|yxi|22σ2)i=1Nexp(|yxi|22σ2)\displaystyle\frac{\exp\left(-\frac{\left|y-x_{i}\right|^{2}}{2\sigma^{2}}\right)}{\sum_{i=1}^{N}\exp\left(-\frac{\left|y-x_{i}\right|^{2}}{2\sigma^{2}}\right)} =exp(|yxi|22σ2)exp(minn|yxn|22σ2)i=1Nexp(|yxi|22σ2)exp(minn|yxn|22σ2)\displaystyle=\frac{\exp\left(-\frac{\left|y-x_{i}\right|^{2}}{2\sigma^{2}}\right)\exp\left(\frac{\min_{n}\left|y-x_{n}\right|^{2}}{2\sigma^{2}}\right)}{\sum_{i=1}^{N}\exp\left(\frac{\left|y-x_{i}\right|^{2}}{2\sigma^{2}}\right)\exp\left(\frac{\min_{n}\left|y-x_{n}\right|^{2}}{2\sigma^{2}}\right)}
=exp(|yxi|2minn|yxn|22σ2)i=1Nexp(|yxi|2minn|yxn|22σ2).\displaystyle=\frac{\exp\left(-\frac{\left|y-x_{i}\right|^{2}-\min_{n}\left|y-x_{n}\right|^{2}}{2\sigma^{2}}\right)}{\sum_{i=1}^{N}\exp\left(-\frac{\left|y-x_{i}\right|^{2}-\min_{n}\left|y-x_{n}\right|^{2}}{2\sigma^{2}}\right)}\,.

In addition,

limσ0+exp(|yxi|2minn|yxn|22σ2)\displaystyle\lim_{\sigma\to 0^{+}}\exp\left(-\frac{\left|y-x_{i}\right|^{2}-\min_{n}\left|y-x_{n}\right|^{2}}{2\sigma^{2}}\right) ={1|yxi|2=minn|yxn|20|yxi|2minn|yxn|2\displaystyle=\begin{cases}1&\left|y-x_{i}\right|^{2}=\min_{n}\left|y-x_{n}\right|^{2}\\ 0&\left|y-x_{i}\right|^{2}\neq\min_{n}\left|y-x_{n}\right|^{2}\end{cases}

so we obtain,

limσ0+exp(|yxi|22σ2)i=1Nexp(|yxi|22σ2)={1|yxi|2=minn|yxn|20|yxi|2minn|yxn|2.\displaystyle\lim_{\sigma\to 0^{+}}\frac{\exp\left(-\frac{\left|y-x_{i}\right|^{2}}{2\sigma^{2}}\right)}{\sum_{i=1}^{N}\exp\left(-\frac{\left|y-x_{i}\right|^{2}}{2\sigma^{2}}\right)}=\begin{cases}1&\left|y-x_{i}\right|^{2}=\min_{n}\left|y-x_{n}\right|^{2}\\ 0&\left|y-x_{i}\right|^{2}\neq\min_{n}\left|y-x_{n}\right|^{2}\end{cases}\,.

Therefore,

limσ0+x^eMMSE(y)\displaystyle\lim_{\sigma\to 0^{+}}\hat{x}^{\mathrm{eMMSE}}\left(y\right) =limσ0+i=1Nxiexp(|yxi|22σ2)i=1Nexp(|yxi|22σ2)=argminn|yxn|2\displaystyle=\lim_{\sigma\to 0^{+}}\frac{\sum_{i=1}^{N}x_{i}\exp\left(-\frac{\left|y-x_{i}\right|^{2}}{2\sigma^{2}}\right)}{\sum_{i=1}^{N}\exp\left(-\frac{\left|y-x_{i}\right|^{2}}{2\sigma^{2}}\right)}=\arg\min_{n}|y-x_{n}|^{2}
=argminn|yxn|=x^1NN(y).\displaystyle=\arg\min_{n}|y-x_{n}|=\hat{x}^{1-\mathrm{NN}}\left(y\right)\,.

Lemma 4.

Let y=x+σϵy=x+\sigma\epsilon where x,σ>0x\in\mathbb{R},\sigma>0 and ϵ𝒩(0,1)\epsilon\sim\mathcal{N}\left(0,1\right) then,

limσ0+MSE(x^eMMSE(y))=limσ0+MSE(x^1NN(y))\displaystyle\lim_{\sigma\to 0^{+}}\mathrm{MSE}\left(\hat{x}^{\mathrm{eMMSE}}\left(y\right)\right)=\lim_{\sigma\to 0^{+}}\mathrm{MSE}\left(\hat{x}^{1-\mathrm{NN}}\left(y\right)\right)
Proof.
MSE(x^eMMSE(y))\displaystyle\mathrm{MSE}\left(\hat{x}^{\mathrm{eMMSE}}\left(y\right)\right) =𝔼[(x^eMMSE(y)x)2]\displaystyle=\mathbb{E}\left[\left(\hat{x}^{\mathrm{eMMSE}}\left(y\right)-x\right)^{2}\right]
=𝔼[(x^eMMSE(y)x^1NN(y)+x^1NN(y)x)2]\displaystyle=\mathbb{E}\left[\left(\hat{x}^{\mathrm{eMMSE}}\left(y\right)-\hat{x}^{1-\mathrm{NN}}\left(y\right)+\hat{x}^{1-\mathrm{NN}}\left(y\right)-x\right)^{2}\right]
=MSE(x^1NN(y))+𝔼[(x^eMMSE(y)x^1NN(y))2]\displaystyle=\mathrm{MSE}\left(\hat{x}^{1-\mathrm{NN}}\left(y\right)\right)+\mathbb{E}\left[\left(\hat{x}^{\mathrm{eMMSE}}\left(y\right)-\hat{x}^{1-\mathrm{NN}}\left(y\right)\right)^{2}\right]
+2𝔼[(x^1NN(y)x)(x^eMMSE(y)x^1NN(y))]\displaystyle\quad+2\mathbb{E}\left[\left(\hat{x}^{1-\mathrm{NN}}\left(y\right)-x\right)\left(\hat{x}^{\mathrm{eMMSE}}\left(y\right)-\hat{x}^{1-\mathrm{NN}}\left(y\right)\right)\right]

Note that,

x^eMMSE(y)\displaystyle\hat{x}^{\mathrm{eMMSE}}\left(y\right) =i=1Nxiexp(|yxi|22σ2)i=1Nexp(|yxi|22σ2)i=1Nmaxj{xj}exp(|yxi|22σ2)i=1Nexp(|yxi|22σ2)\displaystyle=\frac{\sum_{i=1}^{N}x_{i}\exp\left(-\frac{\left|y-x_{i}\right|^{2}}{2\sigma^{2}}\right)}{\sum_{i=1}^{N}\exp\left(-\frac{\left|y-x_{i}\right|^{2}}{2\sigma^{2}}\right)}\leq\frac{\sum_{i=1}^{N}\max_{j}\{x_{j}\}\exp\left(-\frac{\left|y-x_{i}\right|^{2}}{2\sigma^{2}}\right)}{\sum_{i=1}^{N}\exp\left(-\frac{\left|y-x_{i}\right|^{2}}{2\sigma^{2}}\right)}
=maxj{xj}i=1Nexp(|yxi|22σ2)i=1Nexp(|yxi|22σ2)=maxj{xj},\displaystyle=\max_{j}\{x_{j}\}\frac{\sum_{i=1}^{N}\exp\left(-\frac{\left|y-x_{i}\right|^{2}}{2\sigma^{2}}\right)}{\sum_{i=1}^{N}\exp\left(-\frac{\left|y-x_{i}\right|^{2}}{2\sigma^{2}}\right)}=\max_{j}\{x_{j}\}\,,
x^eMMSE(y)\displaystyle\hat{x}^{\mathrm{eMMSE}}\left(y\right) =i=1Nxiexp(|yxi|22σ2)i=1Nexp(|yxi|22σ2)i=1Nminj{xj}exp(|yxi|22σ2)i=1Nexp(|yxi|22σ2)\displaystyle=\frac{\sum_{i=1}^{N}x_{i}\exp\left(-\frac{\left|y-x_{i}\right|^{2}}{2\sigma^{2}}\right)}{\sum_{i=1}^{N}\exp\left(-\frac{\left|y-x_{i}\right|^{2}}{2\sigma^{2}}\right)}\geq\frac{\sum_{i=1}^{N}\min_{j}\{x_{j}\}\exp\left(-\frac{\left|y-x_{i}\right|^{2}}{2\sigma^{2}}\right)}{\sum_{i=1}^{N}\exp\left(-\frac{\left|y-x_{i}\right|^{2}}{2\sigma^{2}}\right)}
=minj{xj}i=1Nexp(|yxi|22σ2)i=1Nexp(|yxi|22σ2)=minj{xj}.\displaystyle=\min_{j}\{x_{j}\}\frac{\sum_{i=1}^{N}\exp\left(-\frac{\left|y-x_{i}\right|^{2}}{2\sigma^{2}}\right)}{\sum_{i=1}^{N}\exp\left(-\frac{\left|y-x_{i}\right|^{2}}{2\sigma^{2}}\right)}=\min_{j}\{x_{j}\}\,.

Similarly,

x^1NN(y)\displaystyle\hat{x}^{1-\mathrm{NN}}\left(y\right) =argmini|yxi|maxi{xi}\displaystyle=\arg\min_{i}|y-x_{i}|\leq\max_{i}\{x_{i}\}
x^1NN(y)\displaystyle\hat{x}^{1-\mathrm{NN}}\left(y\right) =argmini|yxi|mini{xi}\displaystyle=\arg\min_{i}|y-x_{i}|\geq\min_{i}\{x_{i}\}

thus M0>0σ>0\exists M_{0}>0\;\forall\sigma>0

|x^eMMSE(y)x^1NN(y)|<M0.\displaystyle|\hat{x}^{\mathrm{eMMSE}}\left(y\right)-\hat{x}^{1-\mathrm{NN}}\left(y\right)|<M_{0}\,.

According to Lemma 3

limσ0+x^eMMSE(y)=x^1NN(y)\displaystyle\lim_{\sigma\to 0^{+}}\hat{x}^{\mathrm{eMMSE}}\left(y\right)=\hat{x}^{1-\mathrm{NN}}\left(y\right)

almost surely. Note that,

𝔼x,y[(x^eMMSE(y)x^1NN(y))2]=𝔼x,ϵ[(x^eMMSE(x+σϵ)x^1NN(x+σϵ))2]\displaystyle\mathbb{E}_{x,y}\left[\left(\hat{x}^{\mathrm{eMMSE}}\left(y\right)-\hat{x}^{1-\mathrm{NN}}\left(y\right)\right)^{2}\right]=\mathbb{E}_{x,\epsilon}\left[\left(\hat{x}^{\mathrm{eMMSE}}\left(x+\sigma\epsilon\right)-\hat{x}^{1-\mathrm{NN}}\left(x+\sigma\epsilon\right)\right)^{2}\right]

Therefore, by the Dominated convergence theorem we obtain

limσ0+𝔼x,ϵ[(x^eMMSE(x+σϵ)x^1NN(x+σϵ))2]=\displaystyle\lim_{\sigma\to 0^{+}}\mathbb{E}_{x,\epsilon}\left[\left(\hat{x}^{\mathrm{eMMSE}}\left(x+\sigma\epsilon\right)-\hat{x}^{1-\mathrm{NN}}\left(x+\sigma\epsilon\right)\right)^{2}\right]=
𝔼x,ϵ[limσ0+(x^eMMSE(x+σϵ)x^1NN(x+σϵ))2]=0\displaystyle\mathbb{E}_{x,\epsilon}\left[\lim_{\sigma\to 0^{+}}\left(\hat{x}^{\mathrm{eMMSE}}\left(x+\sigma\epsilon\right)-\hat{x}^{1-\mathrm{NN}}\left(x+\sigma\epsilon\right)\right)^{2}\right]=0

Similarly,

limσ0+𝔼x,ϵ[(x^1NN(x+σϵ)x)(x^eMMSE(x+σϵ)x^1NN(x+σϵ))]=\displaystyle\lim_{\sigma\to 0^{+}}\mathbb{E}_{x,\epsilon}\left[\left(\hat{x}^{1-\mathrm{NN}}\left(x+\sigma\epsilon\right)-x\right)\left(\hat{x}^{\mathrm{eMMSE}}\left(x+\sigma\epsilon\right)-\hat{x}^{1-\mathrm{NN}}\left(x+\sigma\epsilon\right)\right)\right]=
𝔼x,ϵ[limσ0+(x^1NN(x+σϵ)x)(x^eMMSE(x+σϵ)x^1NN(x+σϵ))]=0\displaystyle\mathbb{E}_{x,\epsilon}\left[\lim_{\sigma\to 0^{+}}\left(\hat{x}^{1-\mathrm{NN}}\left(x+\sigma\epsilon\right)-x\right)\left(\hat{x}^{\mathrm{eMMSE}}\left(x+\sigma\epsilon\right)-\hat{x}^{1-\mathrm{NN}}\left(x+\sigma\epsilon\right)\right)\right]=0

Since x^eMMSE(y)\hat{x}^{\mathrm{eMMSE}}\left(y\right) and x^1NN(y)\hat{x}^{1-\mathrm{NN}}\left(y\right) are bounded and 𝔼[x]<\mathbb{E}\left[x\right]<\infty. Therefore, we get

limσ0+MSE(x^eMMSE(y))=limσ0+MSE(x^1NN(y))\displaystyle\lim_{\sigma\to 0^{+}}\mathrm{MSE}\left(\hat{x}^{\mathrm{eMMSE}}\left(y\right)\right)=\lim_{\sigma\to 0^{+}}\mathrm{MSE}\left(\hat{x}^{1-\mathrm{NN}}\left(y\right)\right)

Next, we prove Theorem 1.

Proof.

Assume, without loss of generality, that N=2N=2. Let px(x)p_{x}\left(x\right) be a probability density function with bounded second moment such that px(x)>0p_{x}\left(x\right)>0 for all x[x1,x2]x\in\left[x_{1},x_{2}\right]. According to Lemma 4

limσ0+MSE(x^eMMSE(y))=limσ0+MSE(x^1NN(y))\displaystyle\lim_{\sigma\to 0^{+}}\mathrm{MSE}\left(\hat{x}^{\mathrm{eMMSE}}\left(y\right)\right)=\lim_{\sigma\to 0^{+}}\mathrm{MSE}\left(\hat{x}^{1-\mathrm{NN}}\left(y\right)\right)

So we need to prove that

limσ0+MSE(x^1NN(y))\displaystyle\lim_{\sigma\to 0^{+}}\mathrm{MSE}\left(\hat{x}^{1-\mathrm{NN}}\left(y\right)\right) >limσ0+MSE(f1D(y))\displaystyle>\lim_{\sigma\to 0^{+}}\mathrm{MSE}\left(f^{*}_{1D}\left(y\right)\right)

For the case of N=2N=2, the training set includes two data points {x1,x2}\{x_{1},x_{2}\}. So we get,

x^1NN(y)=\displaystyle\hat{x}^{1-\mathrm{NN}}\left(y\right)= {x1y<x1+x22x2x1+x22y\displaystyle\begin{cases}x_{1}&y<\frac{x_{1}+x_{2}}{2}\\ x_{2}&\frac{x_{1}+x_{2}}{2}\leq y\end{cases}
f1D(y)=\displaystyle f^{*}_{1D}\left(y\right)= {x1y<x1+Δ1x2x1x2x1+Δ2Δ1(yx1Δ1)+x1x1+Δ1yx2+Δ2x2y>x2+Δ2\displaystyle\begin{cases}x_{1}&y<x_{1}+\Delta_{1}\\ \frac{x_{2}-x_{1}}{x_{2}-x_{1}+\Delta_{2}-\Delta_{1}}\left(y-x_{1}-\Delta_{1}\right)+x_{1}&x_{1}+\Delta_{1}\leq y\leq x_{2}+\Delta_{2}\\ x_{2}&y>x_{2}+\Delta_{2}\end{cases}

where maxm[M]ϵ1,m=Δ1>0,minm[M]ϵ2,m=Δ2<0\max_{m\in[M]}\epsilon_{1,m}=\Delta_{1}>0,\min_{m\in[M]}\epsilon_{2,m}=\Delta_{2}<0. Note that x^1NN(y)=f1D(y)\hat{x}^{1-\mathrm{NN}}\left(y\right)=f^{*}_{1D}\left(y\right) for all y(,x1+Δ1)(x2+Δ,)y\in(-\infty,x_{1}+\Delta_{1})\cup(x_{2}+\Delta_{,}\infty) and 𝔼[x]<,𝔼[x2]<\mathbb{E}\left[x\right]<\infty,\mathbb{E}\left[x^{2}\right]<\infty. Therefore,

MSE(x^1NN(y))MSE(f1D(y))=\displaystyle\mathrm{MSE}\left(\hat{x}^{1-\mathrm{NN}}\left(y\right)\right)-\mathrm{MSE}\left(f^{*}_{1D}\left(y\right)\right)=
dxx1+Δ1x2+Δ2dy(x^1NN(y)x)2py|x(y|x)px(x)\displaystyle\int_{-\infty}^{\infty}\mathrm{d}x\int_{x_{1}+\Delta_{1}}^{x_{2}+\Delta_{2}}\mathrm{d}y\left(\hat{x}^{1-\mathrm{NN}}\left(y\right)-x\right)^{2}p_{y|x}\left(y|x\right)p_{x}\left(x\right) (25)
dxx1+Δ1x2+Δ2dy(f1D(y)x)2py|x(y|x)px(x)\displaystyle-\int_{-\infty}^{\infty}\mathrm{d}x\int_{x_{1}+\Delta_{1}}^{x_{2}+\Delta_{2}}\mathrm{d}y\left(f^{*}_{1D}\left(y\right)-x\right)^{2}p_{y|x}\left(y|x\right)p_{x}\left(x\right) (26)

First, we derive (25):

dxx1+Δ1x2+Δ2dy(x^1NN(y)x)2py|x(y|x)px(x)=\displaystyle\int_{-\infty}^{\infty}\mathrm{d}x\int_{x_{1}+\Delta_{1}}^{x_{2}+\Delta_{2}}\mathrm{d}y\left(\hat{x}^{1-\mathrm{NN}}\left(y\right)-x\right)^{2}p_{y|x}\left(y|x\right)p_{x}\left(x\right)=
x1+Δ1x1+x22(x1x)2py|x(y|x)px(x)dydx+x1+x22x2+Δ2(x2x)2py|x(y|x)px(x)dydx=\displaystyle\int_{-\infty}^{\infty}\int_{x_{1}+\Delta_{1}}^{\frac{x_{1}+x_{2}}{2}}\left(x_{1}-x\right)^{2}p_{y|x}\left(y|x\right)p_{x}\left(x\right)\mathrm{\mathrm{d}yd}x+\int_{-\infty}^{\infty}\int_{\frac{x_{1}+x_{2}}{2}}^{x_{2}+\Delta_{2}}\left(x_{2}-x\right)^{2}p_{y|x}\left(y|x\right)p_{x}\left(x\right)\mathrm{\mathrm{d}yd}x=
𝔼x[P(y[x1+Δ1,x1+x22]|x)(x1x)2]+𝔼x[P(y[x1+x22,x2+Δ2]|x)(x2x)2].\displaystyle\mathbb{E}_{x}\left[P\left(y\in\left[x_{1}+\Delta_{1},\frac{x_{1}+x_{2}}{2}\right]\biggr{|}x\right)\left(x_{1}-x\right)^{2}\right]+\mathbb{E}_{x}\left[P\left(y\in\left[\frac{x_{1}+x_{2}}{2},x_{2}+\Delta_{2}\right]\biggr{|}x\right)\left(x_{2}-x\right)^{2}\right]\,.

Note that,

𝔼[P(y[x1+Δ1,x1+x22]|x)(x1x)2]\displaystyle\mathbb{E}\left[P\left(y\in\left[x_{1}+\Delta_{1},\frac{x_{1}+x_{2}}{2}\right]\biggr{|}x\right)\left(x_{1}-x\right)^{2}\right] <𝔼[(x1x)2]<\displaystyle<\mathbb{E}\left[\left(x_{1}-x\right)^{2}\right]<\infty
𝔼[P(y[x1+x22,x2+Δ2]|x)(x2x)2]\displaystyle\mathbb{E}\left[P\left(y\in\left[\frac{x_{1}+x_{2}}{2},x_{2}+\Delta_{2}\right]\biggr{|}x\right)\left(x_{2}-x\right)^{2}\right] <𝔼[(x2x)2]<\displaystyle<\mathbb{E}\left[\left(x_{2}-x\right)^{2}\right]<\infty

thus by the Dominated convergence theorem we obtain

limσ0+dxx1+Δ1x2+Δ2dy(x^1NN(y)x)2py|x(y|x)px(x)=\displaystyle\lim_{\sigma\to 0^{+}}\int_{-\infty}^{\infty}\mathrm{d}x\int_{x_{1}+\Delta_{1}}^{x_{2}+\Delta_{2}}\mathrm{d}y\left(\hat{x}^{1-\mathrm{NN}}\left(y\right)-x\right)^{2}p_{y|x}\left(y|x\right)p_{x}\left(x\right)=
𝔼x[limσ0+P(y[x1+Δ1,x1+x22]|x)(x1x)2]+𝔼x[limσ0+P(y[x1+x22,x2+Δ2]|x)(x2x)2]=\displaystyle\mathbb{E}_{x}\left[\lim_{\sigma\to 0^{+}}P\left(y\in\left[x_{1}+\Delta_{1},\frac{x_{1}+x_{2}}{2}\right]\biggr{|}x\right)\left(x_{1}-x\right)^{2}\right]+\mathbb{E}_{x}\left[\lim_{\sigma\to 0^{+}}P\left(y\in\left[\frac{x_{1}+x_{2}}{2},x_{2}+\Delta_{2}\right]\biggr{|}x\right)\left(x_{2}-x\right)^{2}\right]=
𝔼x[(xx1)21{x[x1,x1+x22]}]+𝔼x[(xx2)21{x[x1+x22,x2]}]>C>0\displaystyle\mathbb{E}_{x}\left[\left(x-x_{1}\right)^{2}1\left\{x\in\left[x_{1},\frac{x_{1}+x_{2}}{2}\right]\right\}\right]+\mathbb{E}_{x}\left[\left(x-x_{2}\right)^{2}1\left\{x\in\left[\frac{x_{1}+x_{2}}{2},x_{2}\right]\right\}\right]>C>0

since px(x)>0p_{x}\left(x\right)>0 for all x[x1,x2]x\in\left[x_{1},x_{2}\right]. Next, we derive (26)

dxx1+Δ1x2+Δ2dy(f1D(y)x)2py|x(y|x)px(x)=\displaystyle\int_{-\infty}^{\infty}\mathrm{d}x\int_{x_{1}+\Delta_{1}}^{x_{2}+\Delta_{2}}\mathrm{d}y\left(f^{*}_{1D}\left(y\right)-x\right)^{2}p_{y|x}\left(y|x\right)p_{x}\left(x\right)=
𝔼x,y[1{y[x1+Δ1,x2+Δ2]}(f1D(y)x)2]=\displaystyle\mathbb{E}_{x,y}\left[1\left\{y\in\left[x_{1}+\Delta_{1},x_{2}+\Delta_{2}\right]\right\}\left(f^{*}_{1D}\left(y\right)-x\right)^{2}\right]=
𝔼x,ϵ[1{x+σϵ[x1+Δ1,x2+Δ2]}(f1D(x+σϵ)x)2]\displaystyle\mathbb{E}_{x,\epsilon}\left[1\left\{x+\sigma\epsilon\in\left[x_{1}+\Delta_{1},x_{2}+\Delta_{2}\right]\right\}\left(f^{*}_{1D}\left(x+\sigma\epsilon\right)-x\right)^{2}\right]

Note that,

𝔼x,ϵ[1{x+σϵ[x1+Δ1,x2+Δ2]}(f1D(x+σϵ)x)2]<𝔼x,ϵ[(f1D(x+σϵ)x)2]<\displaystyle\mathbb{E}_{x,\epsilon}\left[1\left\{x+\sigma\epsilon\in\left[x_{1}+\Delta_{1},x_{2}+\Delta_{2}\right]\right\}\left(f^{*}_{1D}\left(x+\sigma\epsilon\right)-x\right)^{2}\right]<\mathbb{E}_{x,\epsilon}\left[\left(f^{*}_{1D}\left(x+\sigma\epsilon\right)-x\right)^{2}\right]<\infty

thus by the Dominated convergence theorem we obtain

limϵ0+dxx1+Δ1x2+Δ2dy(f1D(y)x)2py|x(y|x)px(x)=\displaystyle\lim_{\epsilon\to 0^{+}}\int_{-\infty}^{\infty}\mathrm{d}x\int_{x_{1}+\Delta_{1}}^{x_{2}+\Delta_{2}}\mathrm{d}y\left(f^{*}_{1D}\left(y\right)-x\right)^{2}p_{y|x}\left(y|x\right)p_{x}\left(x\right)=
limϵ0+𝔼x,ϵ[1{x+σϵ[x1+Δ1,x2+Δ2]}(f1D(x+σϵ)x)2]=\displaystyle\lim_{\epsilon\to 0^{+}}\mathbb{E}_{x,\epsilon}\left[1\left\{x+\sigma\epsilon\in\left[x_{1}+\Delta_{1},x_{2}+\Delta_{2}\right]\right\}\left(f^{*}_{1D}\left(x+\sigma\epsilon\right)-x\right)^{2}\right]=
𝔼x,ϵ[limϵ0+1{x+σϵ[x1+Δ1,x2+Δ2]}(f1D(x+σϵ)x)2]=0\displaystyle\mathbb{E}_{x,\epsilon}\left[\lim_{\epsilon\to 0^{+}}1\left\{x+\sigma\epsilon\in\left[x_{1}+\Delta_{1},x_{2}+\Delta_{2}\right]\right\}\left(f^{*}_{1D}\left(x+\sigma\epsilon\right)-x\right)^{2}\right]=0

since

limϵ0+1{x+σϵ[x1+Δ1,x2+Δ2]}\displaystyle\lim_{\epsilon\to 0^{+}}1\left\{x+\sigma\epsilon\in\left[x_{1}+\Delta_{1},x_{2}+\Delta_{2}\right]\right\} =1{x[x1,x2]}\displaystyle=1\left\{x\in\left[x_{1},x_{2}\right]\right\}
limϵ0+f1D(x+σϵ)\displaystyle\lim_{\epsilon\to 0^{+}}f^{*}_{1D}\left(x+\sigma\epsilon\right) ={x1x<x1xx1xx2x2x>x2\displaystyle=\begin{cases}x_{1}&x<x_{1}\\ x&x_{1}\leq x\leq x_{2}\\ x_{2}&x>x_{2}\end{cases}

B.3 Proof of Lemma 1

Proof.

First we prove that for all yn[N1]{xn+1ϵnmaxxnϵn+1minϵnmaxϵn+1min}f1D(y)=yy\in\cup_{n\in[N-1]}\Bigl{\{}\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}}\Bigr{\}}\,\,f^{*}_{1D}(y)=y. We find the intersection between the line f1D(y)=yf^{*}_{1D}(y)=y and the linear part of (17) (the third branch)

xn+1xnxn+1+ϵn+1min(xn+ϵnmax)(y(xn+ϵnmax))+xn\displaystyle\frac{x_{n+1}-x_{n}}{x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)}\left(y-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\right)+x_{n} =y\displaystyle=y (27)
(xn+1xn)(y(xn+ϵnmax))+xn(xn+1+ϵn+1min(xn+ϵnmax))\displaystyle\left(x_{n+1}-x_{n}\right)\left(y-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\right)+x_{n}\left(x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\right) =y(xn+1+ϵn+1min(xn+ϵnmax))\displaystyle=y\left(x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\right) (28)
xn(xn+1+ϵn+1min(xn+ϵnmax))(xn+1xn)(xn+ϵnmax)\displaystyle x_{n}\left(x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\right)-\left(x_{n+1}-x_{n}\right)\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right) =y(ϵn+1minϵnmax)\displaystyle=y\left(\epsilon^{\mathrm{min}}_{n+1}-\epsilon^{\mathrm{max}}_{n}\right) (29)
xn(xn+1+ϵn+1min)xn+1(xn+ϵnmax)\displaystyle x_{n}\left(x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}\right)-x_{n+1}\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right) =y(ϵn+1minϵnmax)\displaystyle=y\left(\epsilon^{\mathrm{min}}_{n+1}-\epsilon^{\mathrm{max}}_{n}\right) (30)
xnϵn+1minxn+1ϵnmax\displaystyle x_{n}\epsilon^{\mathrm{min}}_{n+1}-x_{n+1}\epsilon^{\mathrm{max}}_{n} =y(ϵn+1minϵnmax)\displaystyle=y\left(\epsilon^{\mathrm{min}}_{n+1}-\epsilon^{\mathrm{max}}_{n}\right) (31)
y=xn+1ϵnmaxxnϵn+1minϵnmaxϵn+1min.\displaystyle y=\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}}\,. (32)

Note that

xn+ϵnmax<xn+1ϵnmaxxnϵn+1minϵnmaxϵn+1min<xn+1+ϵn+1min\displaystyle x_{n}+\epsilon^{\mathrm{max}}_{n}<\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}}<x_{n+1}+\epsilon^{\mathrm{min}}_{n+1} (33)

Since,

xn+ϵnmax\displaystyle x_{n}+\epsilon^{\mathrm{max}}_{n} <xn+1ϵnmaxxnϵn+1minϵnmaxϵn+1min\displaystyle<\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}} (34)
(xn+ϵnmax)(ϵnmaxϵn+1min)\displaystyle\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\left(\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}\right) <xn+1ϵnmaxxnϵn+1min\displaystyle<x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1} (35)
(xn+ϵnmaxϵn+1min)ϵnmax\displaystyle\left(x_{n}+\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}\right)\epsilon^{\mathrm{max}}_{n} <xn+1ϵnmax\displaystyle<x_{n+1}\epsilon^{\mathrm{max}}_{n} (36)
xn+ϵnmaxϵn+1min\displaystyle x_{n}+\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1} <xn+1\displaystyle<x_{n+1} (37)
xn+ϵnmax\displaystyle x_{n}+\epsilon^{\mathrm{max}}_{n} <xn+1+ϵn+1min\displaystyle<x_{n+1}+\epsilon^{\mathrm{min}}_{n+1} (38)

which holds according to Assumption 1.

xn+1ϵnmaxxnϵn+1minϵnmaxϵn+1min\displaystyle\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}} <xn+1+ϵn+1min\displaystyle<x_{n+1}+\epsilon^{\mathrm{min}}_{n+1} (39)
xn+1ϵnmaxxnϵn+1min\displaystyle x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1} <(xn+1+ϵn+1min)(ϵnmaxϵn+1min)\displaystyle<\left(x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}\right)\left(\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}\right) (40)
xnϵn+1min\displaystyle-x_{n}\epsilon^{\mathrm{min}}_{n+1} <ϵn+1min(xn+1ϵnmax+ϵn+1min)\displaystyle<-\epsilon^{\mathrm{min}}_{n+1}\left(x_{n+1}-\epsilon^{\mathrm{max}}_{n}+\epsilon^{\mathrm{min}}_{n+1}\right) (41)
xn\displaystyle x_{n} <xn+1ϵnmax+ϵn+1min\displaystyle<x_{n+1}-\epsilon^{\mathrm{max}}_{n}+\epsilon^{\mathrm{min}}_{n+1} (42)
xn+ϵnmax\displaystyle x_{n}+\epsilon^{\mathrm{max}}_{n} <xn+1+ϵn+1min\displaystyle<x_{n+1}+\epsilon^{\mathrm{min}}_{n+1} (43)

which holds according to Assumption 1.

Refer to caption
Figure 5: Illustration of f1D(y)f_{1D}^{*}(y).

Next we prove that f1D(y)f^{*}_{1D}(y) is contractive toward a set of the clean datapoints {𝒙n}n=1N\{\bm{x}_{n}\}_{n=1}^{N} on n[N1]{xn+1ϵnmaxxnϵn+1minϵnmaxϵn+1min}{\mathbb{R}\setminus\cup_{n\in[N-1]}\Bigl{\{}\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}}\Bigr{\}}}. In the case where y(,x1+ϵ1min]y\in(-\infty,x_{1}+\epsilon^{\mathrm{min}}_{1}] or yn[N][xn+ϵnmin,xn+ϵnmax]y\in\cup_{n\in[N]}[x_{n}+\epsilon^{\mathrm{min}}_{n},x_{n}+\epsilon^{\mathrm{max}}_{n}] or y[xN+ϵNmax,)y\in[x_{N}+\epsilon^{\mathrm{max}}_{N},\infty)\,\, f1D(y){xn}n=1Nf^{*}_{1D}(y)\in\{x_{n}\}_{n=1}^{N} therefore (18) holds for all 0α<10\leq\alpha<1. In the case where yn[N1][xn+ϵnmax,xn+1ϵnmaxxnϵn+1minϵnmaxϵn+1min)y\in\cup_{n\in[N-1]}[x_{n}+\epsilon^{\mathrm{max}}_{n},\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}}) we choose i=ni=n,

|f1D(y)f(xn)|=xn+1xnxn+1+ϵn+1min(xn+ϵnmax)(y(xn+ϵnmax))\displaystyle\left|f^{*}_{1D}\left(y\right)-f\left(x_{n}\right)\right|=\frac{x_{n+1}-x_{n}}{x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)}\left(y-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\right) (44)

There exists 0<γ1<10<\gamma_{1}<1 such that

xnxn+1xnxn+1+ϵn+1min(xn+ϵnmax)(y(xn+ϵnmax))+xnγ1y\displaystyle x_{n}\leq\frac{x_{n+1}-x_{n}}{x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)}\left(y-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\right)+x_{n}\leq\gamma_{1}y (45)

since for yn[N1][xn+ϵnmax,xn+1ϵnmaxxnϵn+1minϵnmaxϵn+1min)y\in\cup_{n\in[N-1]}[x_{n}+\epsilon^{\mathrm{max}}_{n},\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}})\,\, f1D(y)f^{*}_{1D}(y) is bellow the line f(y)=yf(y)=y because f1D(y)f^{*}_{1D}(y) is an affine function with slope larger than 1 and f1D(xn+1ϵnmaxxnϵn+1minϵnmaxϵn+1min)=xn+1ϵnmaxxnϵn+1minϵnmaxϵn+1minf^{*}_{1D}(\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}})=\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}}. Therefore there exists 0<α1<10<\alpha_{1}<1

|f1D(y)f(xn)|=xn+1xnxn+1+ϵn+1min(xn+ϵnmax)(y(xn+ϵnmax))γ1yxnα1(yxn)\displaystyle\left|f^{*}_{1D}\left(y\right)-f\left(x_{n}\right)\right|=\frac{x_{n+1}-x_{n}}{x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)}\left(y-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\right)\leq\gamma_{1}y-x_{n}\leq\alpha_{1}\left(y-x_{n}\right) (46)

since,

γ1yxn\displaystyle\gamma_{1}y-x_{n} α1(yxn)\displaystyle\leq\alpha_{1}\left(y-x_{n}\right) (47)
α1\displaystyle\alpha_{1} γ1yxnyxn.\displaystyle\geq\frac{\gamma_{1}y-x_{n}}{y-x_{n}}\,. (48)

In the case where yn[N1](xn+1ϵnmaxxnϵn+1minϵnmaxϵn+1min,xn+1+ϵn+1min]y\in\cup_{n\in[N-1]}(\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}},x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}] we choose i=n+1i=n+1,

|f1D(y)f(xn+1)|=xn+1xn+1xnxn+1+ϵn+1min(xn+ϵnmax)(y(xn+ϵnmax))xn.\displaystyle\left|f^{*}_{1D}\left(y\right)-f\left(x_{n+1}\right)\right|=x_{n+1}-\frac{x_{n+1}-x_{n}}{x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)}\left(y-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\right)-x_{n}\,. (49)

There exists 0<γ2<10<\gamma_{2}<1 such that

xn+1xn+1xnxn+1+ϵn+1min(xn+ϵnmax)(y(xn+ϵnmax))+xn1γ2y\displaystyle x_{n+1}\geq\frac{x_{n+1}-x_{n}}{x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)}\left(y-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\right)+x_{n}\geq\frac{1}{\gamma_{2}}y (50)

since for yn[N1](xn+1ϵnmaxxnϵn+1minϵnmaxϵn+1min,xn+1+ϵn+1min]y\in\cup_{n\in[N-1]}(\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}},x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}]\,\, f1D(y)f^{*}_{1D}(y) is above the line f(y)=yf(y)=y because f1D(y)f^{*}_{1D}(y) is an affine function with slope larger than 1 and f1D(xn+1ϵnmaxxnϵn+1minϵnmaxϵn+1min)=xn+1ϵnmaxxnϵn+1minϵnmaxϵn+1minf^{*}_{1D}(\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}})=\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}}. Therefore there exists 0<α2<10<\alpha_{2}<1

|f1D(y)f(xn+1)|\displaystyle\left|f^{*}_{1D}\left(y\right)-f\left(x_{n+1}\right)\right| =xn+1xn+1xnxn+1+ϵn+1min(xn+ϵnmax)(y(xn+ϵnmax))xn\displaystyle=x_{n+1}-\frac{x_{n+1}-x_{n}}{x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)}\left(y-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\right)-x_{n} (51)
xn+11γ2yα2(xn+1y)\displaystyle\leq x_{n+1}-\frac{1}{\gamma_{2}}y\leq\alpha_{2}\left(x_{n+1}-y\right) (52)

since,

xn+11γ2y\displaystyle x_{n+1}-\frac{1}{\gamma_{2}}y α2(xn+1y)\displaystyle\leq\alpha_{2}\left(x_{n+1}-y\right) (53)
α2\displaystyle\alpha_{2} xn+11γ2yxn+1y.\displaystyle\geq\frac{x_{n+1}-\frac{1}{\gamma_{2}}y}{x_{n+1}-y}\,. (54)

Therefore (18) holds for α=max{α1,α2}\alpha=\max\{\alpha_{1},\alpha_{2}\}. ∎

Appendix C Proofs of Results in Section 5

Let 𝒇:dd{\bm{f}}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} be any function realizable as a shallow ReLU network. Consider any parametrization of 𝒇{\bm{f}} given by 𝒇(𝒚)=k=1K𝒂k[𝒘k𝒚+bk]++𝑽𝒚+𝒄{\bm{f}}({\bm{y}})=\sum_{k=1}^{K}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]_{+}+{\bm{V}}{\bm{y}}+{\bm{c}}. We say such a parametrization is a minimal representative of 𝐟{\bm{f}} if 𝒘k2=1\|{\bm{w}}_{k}\|_{2}=1 and 𝒂k0{\bm{a}}_{k}\neq 0 for all k=1,,Kk=1,...,K, and R(𝒇)=k=1K𝒂k2R({\bm{f}})=\sum_{k=1}^{K}\|{\bm{a}}_{k}\|_{2}. In particular, the units making up a minimal representative must be distinct in the sense that the hyperplanes describing ReLU boundaries Hk={𝒙d:𝒘k𝒙+bk=0}H_{k}=\{{\bm{x}}\in\mathbb{R}^{d}:{\bm{w}}_{k}^{\top}{\bm{x}}+b_{k}=0\} are distinct, which implies that no units can be cancelled or combined.

We will also make use of the following lemma, which shows that representation costs are invariant to a translation of the training samples, assuming the function is suitably translated.

Lemma 5.

Let 𝐟:dd{\bm{f}}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} be any function realizable as a shallow ReLU net satisfying norm-ball interpolation constraints 𝐟(B(𝐱n;ρ))={𝐱n}{\bm{f}}(B({\bm{x}}_{n};\rho))=\{{\bm{x}}_{n}\} for all n=1,,Nn=1,...,N. Let 𝐱0d{\bm{x}}_{0}\in\mathbb{R}^{d}. Then the function 𝐠(𝐲)=𝐟(𝐲𝐱0)+𝐱0{\bm{g}}({\bm{y}})={\bm{f}}({\bm{y}}-{\bm{x}}_{0})+{\bm{x}}_{0} is such that R(𝐠)=R(𝐟)R({\bm{g}})=R({\bm{f}}) and 𝐠(B(𝐱n+𝐱0;ρ))={𝐱n+𝐱0}{\bm{g}}(B({\bm{x}}_{n}+{\bm{x}}_{0};\rho))=\{{\bm{x}}_{n}+{\bm{x}}_{0}\} for all n=1,,Nn=1,...,N.

Proof.

First we show 𝒈(B(𝒙n+𝒙0;ρ))={𝒙n+𝒙0}{\bm{g}}(B({\bm{x}}_{n}+{\bm{x}}_{0};\rho))=\{{\bm{x}}_{n}+{\bm{x}}_{0}\} for all n=1,,Nn=1,...,N. Fix any nn, and let 𝒚B(𝒙n+𝒙0;ρ){\bm{y}}\in B({\bm{x}}_{n}+{\bm{x}}_{0};\rho). Then 𝒚=𝒚+𝒙0{\bm{y}}={\bm{y}}^{\prime}+{\bm{x}}_{0} for some 𝒚B(𝒙n;ρ){\bm{y}}^{\prime}\in B({\bm{x}}_{n};\rho), and so 𝒈(𝒚)=𝒇(𝒚)+𝒙0=𝒙n+𝒙0{\bm{g}}({\bm{y}})={\bm{f}}({\bm{y}}^{\prime})+{\bm{x}}_{0}={\bm{x}}_{n}+{\bm{x}}_{0}, as claimed.

To show that R(𝒈)=R(𝒇)R({\bm{g}})=R({\bm{f}}), let 𝒇(𝒚)=k=1K𝒂k[𝒘k𝒚+bk]++𝑽𝒚+𝒄{\bm{f}}({\bm{y}})=\sum_{k=1}^{K}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]_{+}+{\bm{V}}{\bm{y}}+{\bm{c}} be any minimal representative of 𝒇{\bm{f}}. Then

𝒈(𝒚)\displaystyle{\bm{g}}({\bm{y}}) =k=1K𝒂k[𝒘k(𝒚𝒙0)+bk]++𝑽(𝒚𝒙0)+𝒄+𝒙0\displaystyle=\sum_{k=1}^{K}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}({\bm{y}}-{\bm{x}}_{0})+b_{k}]_{+}+{\bm{V}}({\bm{y}}-{\bm{x}}_{0})+{\bm{c}}+{\bm{x}}_{0} (55)
=k=1K𝒂k[𝒘k𝒚+b~k]++𝑽𝒚+𝒄~\displaystyle=\sum_{k=1}^{K}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+\tilde{b}_{k}]_{+}+{\bm{V}}{\bm{y}}+\tilde{{\bm{c}}} (56)

with b~k=bk𝒘k𝒙0\tilde{b}_{k}=b_{k}-{\bm{w}}_{k}^{\top}{\bm{x}}_{0} and 𝒄~=𝒄+𝒙0𝑽𝒙0\tilde{{\bm{c}}}={\bm{c}}+{\bm{x}}_{0}-{\bm{V}}{\bm{x}}_{0}. And so R(𝒈)k=1K𝒂k2=R(𝒇)R({\bm{g}})\leq\sum_{k=1}^{K}\|{\bm{a}}_{k}\|_{2}=R({\bm{f}}). A parallel argument with the roles of 𝒈{\bm{g}} and 𝒇{\bm{f}} reversed shows that R(𝒇)R(𝒈)R({\bm{f}})\leq R({\bm{g}}), and so R(𝒇)=R(𝒈)R({\bm{f}})=R({\bm{g}}). ∎

In particular, this lemma shows that if 𝒇(𝒚){\bm{f}}^{*}({\bm{y}}) is a norm-ball interpolating representation cost minimizer of the training samples {𝒙n}n=1N\{{\bm{x}}_{n}\}_{n=1}^{N}, then 𝒈(𝒚)=𝒇(𝒚𝒙0)+𝒙0{\bm{g}}^{*}({\bm{y}})={\bm{f}}^{*}({\bm{y}}-{\bm{x}}_{0})+{\bm{x}}_{0} is a norm-ball interpolating min-cost solution of the translated training samples {𝒙n+𝒙0}n=1N\{{\bm{x}}_{n}+{\bm{x}}_{0}\}_{n=1}^{N}.

C.1 Training data belonging to a subspace

The proof of Theorem 2 is a direct consequence of the following two lemmas:

Lemma 6.

Suppose the clean training samples {𝐱n}n=1N\{{\bm{x}}_{n}\}_{n=1}^{N} belong to a rr-dimensional subspace 𝒮d{\mathcal{S}}\subset\mathbb{R}^{d}, and let 𝐏𝒮{\bm{P}}_{\mathcal{S}} denote the orthogonal projector onto 𝒮{\mathcal{S}}. Let 𝐟(𝐲){\bm{f}}({\bm{y}}) be any function realizable as a shallow ReLU network satisfying 𝐟(B(𝐱n,ρ))={𝐱n}{\bm{f}}(B({\bm{x}}_{n},\rho))=\{{\bm{x}}_{n}\} for all n=1,,Nn=1,...,N. Define 𝐠(𝐲)=𝐏𝒮𝐟(𝐱){\bm{g}}({\bm{y}})={\bm{P}}_{\mathcal{S}}{\bm{f}}({\bm{x}}). Then 𝐠(B(𝐱n,ρ))={𝐱n}{\bm{g}}(B({\bm{x}}_{n},\rho))=\{{\bm{x}}_{n}\} for all n=1,,Nn=1,...,N, and R(𝐠)R(𝐟)R({\bm{g}})\leq R({\bm{f}}), with strict inequality if 𝐟𝐠{\bm{f}}\neq{\bm{g}}.

Proof.

First, for any 𝒚B(𝒙n,ρ){\bm{y}}\in B({\bm{x}}_{n},\rho) we have 𝒈(𝒚)=𝑷𝒮𝒇(𝒚)=𝑷𝒮𝒙n=𝒙n{\bm{g}}({\bm{y}})={\bm{P}}_{\mathcal{S}}{\bm{f}}({\bm{y}})={\bm{P}}_{\mathcal{S}}{\bm{x}}_{n}={\bm{x}}_{n}. Therefore, 𝒈(B(𝒙n,ρ))={𝒙n}{\bm{g}}(B({\bm{x}}_{n},\rho))=\{{\bm{x}}_{n}\} for all n=1,,Nn=1,...,N.

Now, let 𝒇(𝒚)=k=1K𝒂k[𝒘k𝒚+bk]++𝑽𝒚+𝒄{\bm{f}}({\bm{y}})=\sum_{k=1}^{K}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]_{+}+{\bm{V}}{\bm{y}}+{\bm{c}} be a minimal representative of 𝒇{\bm{f}}. Then g(𝒚)=k=1K𝑷𝒂k[𝒘k𝒚+bk]++𝑷(𝑽𝒚+𝒄)g({\bm{y}})=\sum_{k=1}^{K}{\bm{P}}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]_{+}+{\bm{P}}({\bm{V}}{\bm{y}}+{\bm{c}}). Since 𝑷𝒂k𝒂k\|{\bm{P}}{\bm{a}}_{k}\|\leq\|{\bm{a}}_{k}\| for all kk, we have R(𝒈)k=1K𝑷𝒮𝒂kk=1K𝒂k=R(𝒇)R({\bm{g}})\leq\sum_{k=1}^{K}\|{\bm{P}}_{\mathcal{S}}{\bm{a}}_{k}\|\leq\sum_{k=1}^{K}\|{\bm{a}}_{k}\|=R({\bm{f}}).

Now we show 𝒇𝒈{\bm{f}}\neq{\bm{g}} implies R(𝒈)<R(𝒇)R({\bm{g}})<R({\bm{f}}). Observe that if any of the outer-layer weight vectors 𝒂k𝒮{\bm{a}}_{k}\not\in{\mathcal{S}} then 𝑷𝒮𝒂k<𝒂k\|{\bm{P}}_{\mathcal{S}}{\bm{a}}_{k}\|<\|{\bm{a}}_{k}\|, which implies R(𝒈)<R(𝒇)R({\bm{g}})<R({\bm{f}}). Hence, it suffices to show that 𝒇𝒈{\bm{f}}\neq{\bm{g}} implies some 𝒂k𝒮{\bm{a}}_{k}\not\in{\mathcal{S}}.

First, consider the case where 𝑷𝒮𝑽=𝑽{\bm{P}}_{\mathcal{S}}{\bm{V}}={\bm{V}} and 𝑷𝒮𝒄=𝒄{\bm{P}}_{\mathcal{S}}{\bm{c}}={\bm{c}}. Then in this case 𝒇𝒈{\bm{f}}\neq{\bm{g}} if only if k=1K(𝑷𝒮𝒂k𝒂k)[𝒘k𝒚+bk]+0\sum_{k=1}^{K}({\bm{P}}_{\mathcal{S}}{\bm{a}}_{k}-{\bm{a}}_{k})[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]_{+}\neq 0 for all 𝒚d{\bm{y}}\in\mathbb{R}^{d}, which implies 𝑷𝒮𝒂k𝒂k{\bm{P}}_{\mathcal{S}}{\bm{a}}_{k}\neq{\bm{a}}_{k} for some kk, or equivalently 𝒂k𝒮{\bm{a}}_{k}\not\in{\mathcal{S}}.

Next, assume either 𝑷𝒮𝑽𝑽{\bm{P}}_{\mathcal{S}}{\bm{V}}\neq{\bm{V}} or 𝑷𝒮𝒄𝒄{\bm{P}}_{\mathcal{S}}{\bm{c}}\neq{\bm{c}}. Fix any training sample index nn, and let AnA_{n} denote the index set of units active over the ball B(𝒙n,ρ)B({\bm{x}}_{n},\rho). Then, since 𝒇(𝒚){\bm{f}}({\bm{y}}) is constant for all 𝒚B(𝒙n,ρ){\bm{y}}\in B({\bm{x}}_{n},\rho), the Jacobian 𝒇(𝒚)=kAn𝒂k𝒘k+𝑽=𝟎\bm{\partial}{\bm{f}}({\bm{y}})=\sum_{k\in A_{n}}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}+{\bm{V}}=\bm{0} for all 𝒚B(𝒙n,ρ){\bm{y}}\in B({\bm{x}}_{n},\rho). This gives 𝑽=kAn𝒂k𝒘k{\bm{V}}=-\sum_{k\in A_{n}}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}. Therefore, if 𝑷𝒮𝑽𝑽{\bm{P}}_{\mathcal{S}}{\bm{V}}\neq{\bm{V}} then at least one of the 𝒂k𝑺{\bm{a}}_{k}\not\in{\bm{S}}. On the other hand, if 𝑷𝒮𝒄𝒄{\bm{P}}_{\mathcal{S}}{\bm{c}}\neq{\bm{c}} then for all 𝒚B(𝒙n,ρ){\bm{y}}\in B({\bm{x}}_{n},\rho) we have

𝒇(𝒚)=kAn𝒂k(𝒘k𝒚+bk)+𝑽𝒚+𝒄=kAn(𝒂k𝒘k+𝑽)𝒚+kAnbk𝒂k+𝒄=kAnbk𝒂k+𝒄=𝒙n,{\bm{f}}({\bm{y}})=\sum_{k\in A_{n}}{\bm{a}}_{k}({\bm{w}}_{k}^{\top}{\bm{y}}+b_{k})+{\bm{V}}{\bm{y}}+{\bm{c}}=\sum_{k\in A_{n}}({\bm{a}}_{k}{\bm{w}}_{k}^{\top}+{\bm{V}}){\bm{y}}+\sum_{k\in A_{n}}b_{k}{\bm{a}}_{k}+{\bm{c}}=\sum_{k\in A_{n}}b_{k}{\bm{a}}_{k}+{\bm{c}}={\bm{x}}_{n},

which implies 𝒄=𝒙nkAnbk𝒂k{\bm{c}}={\bm{x}}_{n}-\sum_{k\in A_{n}}b_{k}{\bm{a}}_{k}, and since 𝒙n𝒮{\bm{x}}_{n}\in{\mathcal{S}} this implies some 𝒂k𝒮{\bm{a}}_{k}\not\in{\mathcal{S}}, proving the claim. ∎

Lemma 7.

Suppose the clean training samples {𝐱n}n=1N\{{\bm{x}}_{n}\}_{n=1}^{N} belong to an rr-dimensional subspace 𝒮d{\mathcal{S}}\subset\mathbb{R}^{d}, and let 𝐏𝒮{\bm{P}}_{\mathcal{S}} denote the orthogonal projector onto 𝒮{\mathcal{S}}. Let 𝐟{\bm{f}} be any network satisfying 𝐟(B(𝐱n,ρ))={𝐱n}{\bm{f}}(B({\bm{x}}_{n},\rho))=\{{\bm{x}}_{n}\}. Define 𝐡(𝐲)=𝐟(𝐏𝒮𝐲){\bm{h}}({\bm{y}})={\bm{f}}({\bm{P}}_{\mathcal{S}}{\bm{y}}). Then 𝐡(B(𝐱n,ρ))={𝐱n}{\bm{h}}(B({\bm{x}}_{n},\rho))=\{{\bm{x}}_{n}\} and R(𝐡)R(𝐟)R({\bm{h}})\leq R({\bm{f}}), with strict inequality if 𝐡𝐟{\bm{h}}\neq{\bm{f}}.

Proof.

Define 𝑷𝒮1(B(𝒙n,ρ)):={𝒚d:𝑷𝒮𝒚B(𝒙n,ρ)}{\bm{P}}_{\mathcal{S}}^{-1}(B({\bm{x}}_{n},\rho)):=\{{\bm{y}}\in\mathbb{R}^{d}:{\bm{P}}_{\mathcal{S}}{\bm{y}}\in B({\bm{x}}_{n},\rho)\}, i.e., the set of points in d\mathbb{R}^{d} whose projection onto 𝒮{\mathcal{S}} is contained in B(𝒙n,ρ)B({\bm{x}}_{n},\rho). Then clearly 𝒉(𝑷𝒮1(B(𝒙n,ρ)))={𝒙n}{\bm{h}}({\bm{P}}_{\mathcal{S}}^{-1}(B({\bm{x}}_{n},\rho)))=\{{\bm{x}}_{n}\}. Also, by properties of norm-balls, we have B(𝒙n,ρ)𝑷𝒮1(B(𝒙n,ρ))B({\bm{x}}_{n},\rho)\subset{\bm{P}}_{\mathcal{S}}^{-1}(B({\bm{x}}_{n},\rho)). Therefore, 𝒉(B(𝒙n,ρ))={𝒙n}{\bm{h}}(B({\bm{x}}_{n},\rho))=\{{\bm{x}}_{n}\} for all n=1,,Nn=1,...,N.

Next, let 𝒇(𝒚)=k=1K𝒂k[𝒘k𝒚+bk]++𝑽𝒚+𝒄{\bm{f}}({\bm{y}})=\sum_{k=1}^{K}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]_{+}+{\bm{V}}{\bm{y}}+{\bm{c}} be a minimal representative of 𝒇{\bm{f}}. Then h(𝒚)=k=1K𝒂k[(𝑷𝒮𝒘k)𝒚+bk]++𝑽𝑷𝒮𝒚+𝒄h({\bm{y}})=\sum_{k=1}^{K}{\bm{a}}_{k}[({\bm{P}}_{\mathcal{S}}{\bm{w}}_{k})^{\top}{\bm{y}}+b_{k}]_{+}+{\bm{V}}{\bm{P}}_{\mathcal{S}}{\bm{y}}+{\bm{c}}. Since 𝑷𝒮𝒘k𝒘k\|{\bm{P}}_{\mathcal{S}}{\bm{w}}_{k}\|\leq\|{\bm{w}}_{k}\| for all kk, we have R(𝒉)k=1K𝒂k𝑷𝒮𝒘kk=1K𝒂k𝒘k=R(𝒇)R({\bm{h}})\leq\sum_{k=1}^{K}\|{\bm{a}}_{k}\|\|{\bm{P}}_{\mathcal{S}}{\bm{w}}_{k}\|\leq\sum_{k=1}^{K}\|{\bm{a}}_{k}\|\|{\bm{w}}_{k}\|=R({\bm{f}}).

Finally, we show that if 𝒇𝒉{\bm{f}}\neq{\bm{h}}, then R(𝒉)<R(𝒇)R({\bm{h}})<R({\bm{f}}). Observe that if any of the inner-layer weight vectors 𝒘k𝒮{\bm{w}}_{k}\not\in{\mathcal{S}} then 𝑷𝒘k<𝒘k\|{\bm{P}}{\bm{w}}_{k}\|<\|{\bm{w}}_{k}\|, which implies R(𝒉)<R(𝒇)R({\bm{h}})<R({\bm{f}}). Hence, it suffices to show that 𝒇𝒉{\bm{f}}\neq{\bm{h}} implies some 𝒘k𝒮{\bm{w}}_{k}\not\in{\mathcal{S}}. First, consider the case 𝑽𝑷𝒮=𝑽{\bm{V}}{\bm{P}}_{\mathcal{S}}={\bm{V}}. Then 𝒇𝒉{\bm{f}}\neq{\bm{h}} if and only if 𝑷𝒮𝒘k𝒘k{\bm{P}}_{\mathcal{S}}{\bm{w}}_{k}\neq{\bm{w}}_{k} for some kk, or equivalently, 𝒘k𝒮{\bm{w}}_{k}\not\in{\mathcal{S}}. Next, consider the case 𝑽𝑷𝒮𝑽{\bm{V}}{\bm{P}}_{\mathcal{S}}\neq{\bm{V}}. Fix any training sample index nn, and let AnA_{n} denote the index set of units active over B(𝒙n,ρ)B({\bm{x}}_{n},\rho). Then, since 𝒇(𝒚){\bm{f}}({\bm{y}}) is constant for all 𝒚B(𝒙n,ρ){\bm{y}}\in B({\bm{x}}_{n},\rho), the Jacobian 𝒇(𝒙)=kAn𝒂k𝒘k+𝑽=𝟎\bm{\partial}{\bm{f}}({\bm{x}})=\sum_{k\in A_{n}}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}+{\bm{V}}=\bm{0} for all 𝒚Bn(𝒙n,ρ){\bm{y}}\in B_{n}({\bm{x}}_{n},\rho). This gives 𝑽=k𝒂k𝒘k{\bm{V}}=-\sum_{k}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}. Therefore, if 𝑽𝑷𝒮𝑽{\bm{V}}{\bm{P}}_{\mathcal{S}}\neq{\bm{V}} (i.e., at least one row of 𝑽{\bm{V}} is not contained in 𝒮{\mathcal{S}}) then at least one of the 𝒘k𝒮{\bm{w}}_{k}\not\in{\mathcal{S}}, proving the claim. ∎

Finally, we now give the proof Theorem 2 and Corollary 1.

Proof of Theorem 2.

Let 𝒇{\bm{f}}^{*} be any min-cost solution. Applying both Lemma 6 and 7 we see it must be the case that 𝒇(𝒚)=𝑷𝒮𝒇(𝑷𝒮𝒚){\bm{f}}^{*}({\bm{y}})={\bm{P}}_{\mathcal{S}}{\bm{f}}^{*}({\bm{P}}_{\mathcal{S}}{\bm{y}}) for all 𝒚d{\bm{y}}\in\mathbb{R}^{d}, since otherwise the representation cost of 𝒇{\bm{f}}^{*} could be reduced. ∎

Proof of Corollary 1.

Suppose 𝒮{\mathcal{S}} is one-dimensional, i.e., 𝒮=span{𝒖}{\mathcal{S}}=\text{span}\{{\bm{u}}\} for some unit vector 𝒖d{\bm{u}}\in\mathbb{R}^{d}. Theorem 2 shows we can express any min-cost solution 𝒇{\bm{f}}^{*} as

𝒇(𝒚)=k=1Kak𝒖[sk𝒖𝒚+bk]++v𝒖𝒖𝒚+c𝒖=𝒖ϕ(𝒖𝒚){\bm{f}}^{*}({\bm{y}})=\sum_{k=1}^{K}a_{k}{\bm{u}}[s_{k}{\bm{u}}^{\top}{\bm{y}}+b_{k}]_{+}+v{\bm{u}}{\bm{u}}^{\top}{\bm{y}}+c{\bm{u}}={\bm{u}}\phi({\bm{u}}^{\top}{\bm{y}})

where

ϕ(t)=kak[𝒔kt+bk]++vt+c\phi(t)=\sum_{k}a_{k}[{\bm{s}}_{k}t+b_{k}]_{+}+vt+c

is such that R(𝒇)=R(ϕ)R({\bm{f}}^{*})=R(\phi). Therefore, minimizing the representation cost subject to norm-ball interpolation constraints is reduces to a univariate problem:

minϕR(ϕ)s.t.ϕ((cnρ,cn+ρ))=cn\min_{\phi}R(\phi)~{}~{}s.t.~{}~{}\phi((c_{n}-\rho,c_{n}+\rho))=c_{n}

where cn=𝒖𝒙nc_{n}={\bm{u}}^{\top}{\bm{x}}_{n}. The minimizing ϕ\phi^{*} is unique and coincides with the 1-D denoiser f1Df^{*}_{1D} in (17) with xn=cnx_{n}=c_{n} and ϵnmax=ϵnmin=ρ\epsilon_{n}^{\mathrm{max}}=-\epsilon_{n}^{\mathrm{min}}=\rho. ∎

C.2 Data along rays

We begin with a key lemma that is central to the proof of Theorem 3.

Lemma 8.

Suppose {𝐮i}i=1nd\{{\bm{u}}_{i}\}_{i=1}^{n}\subset\mathbb{R}^{d} is a collection of n>1n>1 unit vectors such that 𝐮i𝐮j<0{\bm{u}}_{i}^{\top}{\bm{u}}_{j}<0 for all iji\neq j, and 𝐰{\bm{w}} is a unit vector such that 𝐮i𝐰>0{\bm{u}}_{i}^{\top}{\bm{w}}>0 for all i=1,,ni=1,...,n. Let 𝐚d{\bm{a}}\in\mathbb{R}^{d} be any vector. Then,

i=1n|𝒖i𝒂|(𝒖i𝒘)<𝒂.\sum_{i=1}^{n}|{\bm{u}}_{i}^{\top}{\bm{a}}|({\bm{u}}_{i}^{\top}{\bm{w}})<\|{\bm{a}}\|.

Before giving the proof of Lemma 8, we first prove an auxiliary result.

Lemma 9.

Let 𝐚d{\bm{a}}\in\mathbb{R}^{d} be a unit vector, and suppose {𝐮i}i=1nd\{{\bm{u}}_{i}\}_{i=1}^{n}\subset\mathbb{R}^{d} is a collection of unit vectors such that 𝐮i𝐚>0{\bm{u}}_{i}^{\top}{\bm{a}}>0 for all ii and 𝐮i𝐮j<0{\bm{u}}_{i}^{\top}{\bm{u}}_{j}<0 for all iji\neq j. Let 𝐛=i=1n𝐮i𝐮i𝐚{\bm{b}}=\sum_{i=1}^{n}{\bm{u}}_{i}{\bm{u}}_{i}^{\top}{\bm{a}}. Then 𝐛𝐚/21/2\|{\bm{b}}-{\bm{a}}/2\|\leq 1/2 with strict inequality if n>1n>1.

Proof.

It suffices to show 𝒃𝒂𝒃𝒃{\bm{b}}^{\top}{\bm{a}}\geq{\bm{b}}^{\top}{\bm{b}}, since if this is the case then

𝒃𝒂/2\displaystyle\|{\bm{b}}-{\bm{a}}/2\| =(𝒃𝒂/2)(𝒃𝒂/2)=𝒃𝒃𝒃𝒂+14𝒂𝒂12,\displaystyle=\sqrt{({\bm{b}}-{\bm{a}}/2)^{\top}({\bm{b}}-{\bm{a}}/2)}=\sqrt{{\bm{b}}^{\top}{\bm{b}}-{\bm{b}}^{\top}{\bm{a}}+\frac{1}{4}{\bm{a}}^{\top}{\bm{a}}}\leq\frac{1}{2}, (57)

which also holds with strict inequality when 𝒃𝒂>𝒃𝒃{\bm{b}}^{\top}{\bm{a}}>{\bm{b}}^{\top}{\bm{b}}.

First, if n=1n=1, then 𝒃=𝒂𝒖1𝒖1{\bm{b}}={\bm{a}}^{\top}{\bm{u}}_{1}{\bm{u}}_{1}^{\top}, and so 𝒃𝒃=𝒂𝒖1𝒖1𝒖1𝒖1𝒂=𝒂𝒖1𝒖1𝒂=𝒃𝒂{\bm{b}}^{\top}{\bm{b}}={\bm{a}}^{\top}{\bm{u}}_{1}{\bm{u}}_{1}^{\top}{\bm{u}}_{1}{\bm{u}}_{1}^{\top}{\bm{a}}={\bm{a}}^{\top}{\bm{u}}_{1}{\bm{u}}_{1}^{\top}{\bm{a}}={\bm{b}}^{\top}{\bm{a}}. Now assume n>1n>1. Then we have

𝒃𝒂\displaystyle{\bm{b}}^{\top}{\bm{a}} =i=1n𝒂𝒖i𝒖i𝒂\displaystyle=\sum_{i=1}^{n}{\bm{a}}^{\top}{\bm{u}}_{i}{\bm{u}}_{i}^{\top}{\bm{a}} (58)
=i=1n𝒂𝒖i𝒖i𝒖i𝒖i𝒂\displaystyle=\sum_{i=1}^{n}{\bm{a}}^{\top}{\bm{u}}_{i}{\bm{u}}_{i}^{\top}{\bm{u}}_{i}{\bm{u}}_{i}^{\top}{\bm{a}} (59)
>i=1nj=1n𝒂𝒖i𝒖i𝒖j𝒖i𝒂\displaystyle>\sum_{i=1}^{n}\sum_{j=1}^{n}{\bm{a}}^{\top}{\bm{u}}_{i}{\bm{u}}_{i}^{\top}{\bm{u}}_{j}{\bm{u}}_{i}^{\top}{\bm{a}} (60)
=𝒃𝒃\displaystyle={\bm{b}}^{\top}{\bm{b}} (61)

The first and last equalities hold by definition. The second equality holds because each 𝒖i{\bm{u}}_{i} is a unit vector. The last inequality holds because for iji\neq j

(𝒂𝒖i)(𝒖i𝒖j)(𝒖j𝒂)<0,({\bm{a}}^{\top}{\bm{u}}_{i})({\bm{u}}_{i}^{\top}{\bm{u}}_{j})({\bm{u}}_{j}^{\top}{\bm{a}})<0,

since 𝒖i𝒖j<0{\bm{u}}_{i}^{\top}{\bm{u}}_{j}<0 and 𝒂𝒖i,𝒂𝒖j>0{\bm{a}}^{\top}{\bm{u}}_{i},{\bm{a}}^{\top}{\bm{u}}_{j}>0 by assumption. ∎

Now we give the proof of Lemma 8.

Proof of Lemma 8.

Let 𝒗=i=1n|𝒖i𝒂|𝒖i{\bm{v}}=\sum_{i=1}^{n}|{\bm{u}}_{i}^{\top}{\bm{a}}|{\bm{u}}_{i}. Then we have

S=i=1n|𝒖i𝒂|(𝒖i𝒘)=𝒗𝒘S=\sum_{i=1}^{n}|{\bm{u}}_{i}^{\top}{\bm{a}}|({\bm{u}}_{i}^{\top}{\bm{w}})={\bm{v}}^{\top}{\bm{w}}

WLOG, we may assume 𝒂=1\|{\bm{a}}\|=1, in which case the lemma reduces to showing S=𝒗𝒘<1S={\bm{v}}^{\top}{\bm{w}}<1. By the Cauchy-Schwartz inequality, it suffices to show 𝒗<1\|{\bm{v}}\|<1. Toward this end, let us write 𝒗=𝒗++𝒗{\bm{v}}={\bm{v}}_{+}+{\bm{v}}_{-} where 𝒗+=i:𝒖i𝒂>0|𝒖i𝒂|𝒖i=i:𝒖i𝒂>0𝒖i𝒖i𝒂{\bm{v}}_{+}=\sum_{i:{\bm{u}}_{i}^{\top}{\bm{a}}>0}|{\bm{u}}_{i}^{\top}{\bm{a}}|{\bm{u}}_{i}=\sum_{i:{\bm{u}}_{i}^{\top}{\bm{a}}>0}{\bm{u}}_{i}{\bm{u}}_{i}^{\top}{\bm{a}} and 𝒗=i:𝒖i𝒂<0|𝒖i𝒂|𝒖i=i:𝒖i𝒂<0𝒖i𝒖i(𝒂){\bm{v}}_{-}=\sum_{i:{\bm{u}}_{i}^{\top}{\bm{a}}<0}|{\bm{u}}_{i}^{\top}{\bm{a}}|{\bm{u}}_{i}=\sum_{i:{\bm{u}}_{i}^{\top}{\bm{a}}<0}{\bm{u}}_{i}{\bm{u}}_{i}^{\top}(-{\bm{a}}). By Lemma 9 we have 𝒗+𝒂/21/2\|{\bm{v}}_{+}-{\bm{a}}/2\|\leq 1/2 and 𝒗+𝒂/21/2\|{\bm{v}}_{-}+{\bm{a}}/2\|\leq 1/2, and so

𝒗\displaystyle\|{\bm{v}}\| =𝒗++𝒗\displaystyle=\|{\bm{v}}_{+}+{\bm{v}}_{-}\| (62)
=𝒗+𝒂/2+𝒗+𝒂/2\displaystyle=\|{\bm{v}}_{+}-{\bm{a}}/2+{\bm{v}}_{-}+{\bm{a}}/2\| (63)
𝒗+𝒂/2+𝒗+𝒂/2\displaystyle\leq\|{\bm{v}}_{+}-{\bm{a}}/2\|+\|{\bm{v}}_{-}+{\bm{a}}/2\| (64)
1.\displaystyle\leq 1. (65)

Also, if either of the index sets I+:={i:𝒖i𝒂>0}I_{+}:=\{i:{\bm{u}}_{i}^{\top}{\bm{a}}>0\} or I:={i:𝒖i𝒂<0}I_{-}:=\{i:{\bm{u}}_{i}^{\top}{\bm{a}}<0\} has cardinality greater than one then the lemma above guarantees 𝒗+𝒂/2<1/2\|{\bm{v}}_{+}-{\bm{a}}/2\|<1/2 or 𝒗+𝒂/2<1/2\|{\bm{v}}_{+}-{\bm{a}}/2\|<1/2, and so 𝒗<1\|{\bm{v}}\|<1, which gives S<1S<1.

It remains to show S<1S<1 when I+I_{+} or II_{-} has cardinality less than or equal to one, i.e., |I+|1|I_{+}|\leq 1 or |I|1|I_{-}|\leq 1. We consider the various possibilities:

Case 1: |I+|=0|I_{+}|=0 & |I|=0|I_{-}|=0. Then 𝒗=𝟎{\bm{v}}=\bm{0} and so S=𝒗𝒘=0<1S={\bm{v}}^{\top}{\bm{w}}=0<1.

Case 2: |I+|=1|I_{+}|=1 & |I|=0|I_{-}|=0 or |I+|=0|I_{+}|=0 & |I|=1|I_{-}|=1. Then 𝒗=|𝒖i𝒂|𝒖i{\bm{v}}=|{\bm{u}}_{i}^{\top}{\bm{a}}|{\bm{u}}_{i} for some ii, and |𝒖j𝒂|=0|{\bm{u}}_{j}^{\top}{\bm{a}}|=0 for all jij\neq i. By way of contradiction, assume that S=𝒗𝒘=|𝒖i𝒂|𝒖i𝒘=1S={\bm{v}}^{\top}{\bm{w}}=|{\bm{u}}_{i}^{\top}{\bm{a}}|{\bm{u}}_{i}^{\top}{\bm{w}}=1. Since 𝒖i{\bm{u}}_{i}, 𝒂{\bm{a}}, and 𝒘{\bm{w}} are all unit vectors, the only way this is possible is if |𝒖i𝒂|=1|{\bm{u}}_{i}^{\top}{\bm{a}}|=1 and 𝒖i𝒘=1{\bm{u}}_{i}^{\top}{\bm{w}}=1, which implies 𝒂=±𝒘{\bm{a}}=\pm{\bm{w}} and 𝒖i=𝒘{\bm{u}}_{i}={\bm{w}}, and so 𝒂=±𝒖i{\bm{a}}=\pm{\bm{u}}_{i}. However, this shows that 𝒖j𝒖i=0{\bm{u}}_{j}^{\top}{\bm{u}}_{i}=0 for all jij\neq i, contradicting our assumption that 𝒖j𝒖i<0{\bm{u}}_{j}^{\top}{\bm{u}}_{i}<0 for all iji\neq j. Therefore, S<1S<1 in this case as well.

Case 3: |I+|=1|I_{+}|=1 & |I|=1|I_{-}|=1. Then 𝒗=|𝒖i𝒂|𝒖i+|𝒖j𝒂|𝒖j{\bm{v}}=|{\bm{u}}_{i}^{\top}{\bm{a}}|{\bm{u}}_{i}+|{\bm{u}}_{j}^{\top}{\bm{a}}|{\bm{u}}_{j} for some iI+i\in I_{+} and jIj\in I_{-}, and so

𝒗\displaystyle\|{\bm{v}}\| =|𝒖i𝒂|𝒖i+|𝒖j𝒂|𝒖j\displaystyle=\||{\bm{u}}_{i}^{\top}{\bm{a}}|{\bm{u}}_{i}+|{\bm{u}}_{j}^{\top}{\bm{a}}|{\bm{u}}_{j}\| (66)
=(𝒖i𝒖i𝒖j𝒖j)𝒂\displaystyle=\|({\bm{u}}_{i}{\bm{u}}_{i}^{\top}-{\bm{u}}_{j}{\bm{u}}_{j}^{\top}){\bm{a}}\| (67)
𝒖i𝒖i𝒖j𝒖j\displaystyle\leq\|{\bm{u}}_{i}{\bm{u}}_{i}^{\top}-{\bm{u}}_{j}{\bm{u}}_{j}^{\top}\| (68)

where the last inequality follows by the fact that 𝒂=1\|{\bm{a}}\|=1. Finally, since the matrix 𝒖i𝒖i𝒖j𝒖j{\bm{u}}_{i}{\bm{u}}_{i}^{\top}-{\bm{u}}_{j}{\bm{u}}_{j}^{\top} is symmetric, its operator norm 𝒖i𝒖i𝒖j𝒖j\|{\bm{u}}_{i}{\bm{u}}_{i}^{\top}-{\bm{u}}_{j}{\bm{u}}_{j}^{\top}\| coincides with the maximum eigenvalue (in absolute value). Eigenvectors in the span of 𝒖i{\bm{u}}_{i} and 𝒖j{\bm{u}}_{j} have the form 𝒗=ci𝒖i+cj𝒖j{\bm{v}}=c_{i}{\bm{u}}_{i}+c_{j}{\bm{u}}_{j} where ci,cjc_{i},c_{j} satisfy the equation

(𝒖i𝒖i𝒖j𝒖j)(ci𝒖i+cj𝒖j)=λ(ci𝒖i+cj𝒖j)({\bm{u}}_{i}{\bm{u}}_{i}^{\top}-{\bm{u}}_{j}{\bm{u}}_{j}^{\top})(c_{i}{\bm{u}}_{i}+c_{j}{\bm{u}}_{j})=\lambda(c_{i}{\bm{u}}_{i}+c_{j}{\bm{u}}_{j}) (69)

where λ\lambda\in\mathbb{R} is the corresponding eigenvalue. Expanding and equating coefficients gives the system

[1λ𝒖i𝒖j𝒖i𝒖j1+λ][cicj]=[00]\begin{bmatrix}1-\lambda&{\bm{u}}_{i}^{\top}{\bm{u}}_{j}\\ {\bm{u}}_{i}^{\top}{\bm{u}}_{j}&1+\lambda\end{bmatrix}\begin{bmatrix}c_{i}\\ c_{j}\end{bmatrix}=\begin{bmatrix}0\\ 0\end{bmatrix} (70)

which has non-trivial solutions iff

(1λ)(1+λ)(𝒖i𝒖j)2=0λ=±1(𝒖i𝒖j)2(1-\lambda)(1+\lambda)-({\bm{u}}_{i}^{\top}{\bm{u}}_{j})^{2}=0~{}~{}\iff~{}~{}\lambda=\pm\sqrt{1-({\bm{u}}_{i}^{\top}{\bm{u}}_{j})^{2}} (71)

and so |λ|<1|\lambda|<1. Therefore, 𝒗𝒖i𝒖i𝒖j𝒖j<1\|{\bm{v}}\|\leq\|{\bm{u}}_{i}{\bm{u}}_{i}^{\top}-{\bm{u}}_{j}{\bm{u}}_{j}^{\top}\|<1 as claimed. And so S𝒗<1S\leq\|{\bm{v}}\|<1. ∎

Refer to caption
Refer to caption
Figure 6: Illustration of “unit splitting” technique used in proof of Theorem 3. If a ReLU unit is active along two rays (left panel), it can be split into two units with lower representation cost by projecting its inner-layer weight vector 𝒘{\bm{w}} onto the two rays separately (right panel).
Proof of Theorem 3.

Let 𝒇{\bm{f}}^{*} be any representation cost minimizer. We prove the claim by showing that if 𝒇{\bm{f}}^{*} fails to have the form specified in (20) then it is possible to construct a norm-ball interpolant 𝒉{\bm{h}} whose units are aligned with the rays that has strictly smaller representation cost than 𝒇{\bm{f}}^{*}.

First, let 𝒇(𝒚)=k=1K𝒂k[𝒘k𝒚+bk]+𝑽𝒙+𝒄{\bm{f}}^{*}({\bm{y}})=\sum_{k=1}^{K}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]+{\bm{V}}{\bm{x}}+{\bm{c}} be any minimal representative of 𝒇{\bm{f}}^{*}. By properties of minimal representatives, none of the ReLU boundary sets {𝒚d:𝒘k𝒚+bk=0}\{{\bm{y}}\in\mathbb{R}^{d}:{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}=0\} can intersect any of the norm-balls centered at the training samples, since otherwise 𝒇{\bm{f}}^{*} would be non-constant on one of the norm-balls. Also, we may alter this parameterization in such a way that the active set of every unit avoids the ball centered at the origin B0:=B(𝟎,ρ)B_{0}:=B(\bm{0},\rho) without changing the representation cost. In particular, suppose the active set of the kkth unit contains B0B_{0}. Then, using the identity [t]+=t[t]+[t]_{+}=t-[-t]_{+} for all tt\in\mathbb{R}, we have

𝒂k[𝒘k𝒚+bk]+=𝒂k(𝒘k𝒚+bk)𝒂k[𝒘k𝒚bk]+{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]_{+}={\bm{a}}_{k}({\bm{w}}_{k}^{\top}{\bm{y}}+b_{k})-{\bm{a}}_{k}[-{\bm{w}}_{k}^{\top}{\bm{y}}-b_{k}]_{+}

for all 𝒚d{\bm{y}}\in\mathbb{R}^{d}. The active set of the reversed unit 𝒂k[𝒘k𝒚bk]+-{\bm{a}}_{k}[-{\bm{w}}_{k}^{\top}{\bm{y}}-b_{k}]_{+} does not intersect B0B_{0}. Therefore, after applying this transformation to all units whose active sets contain B0B_{0}, we can write 𝒇=𝒈+{\bm{f}}^{*}={\bm{g}}^{*}+\bm{\ell} where 𝒈{\bm{g}}^{*} is a sum of ReLU units whose active sets do not contain B0B_{0} and \bm{\ell} is an affine function that combines the original affine part 𝑽𝒚+𝒄{\bm{V}}{\bm{y}}+{\bm{c}} with a sum of linear units of the form 𝒂k(𝒘k𝒚+bk𝒂k){\bm{a}}_{k}({\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}{\bm{a}}_{k}) arising from the transformation above. However, because of the interpolation constraint 𝒇(B0)={𝟎}{\bm{f}}(B_{0})=\{\bm{0}\}, and by the fact that no units in 𝒈{\bm{g}}^{*} are active over B0B_{0}, for all 𝒚B0{\bm{y}}\in B_{0} we have

𝒇(𝒚)=𝒈(𝒚)+(𝒚)=(𝒚)=𝟎{\bm{f}}^{*}({\bm{y}})={\bm{g}}^{*}({\bm{y}})+\bm{\ell}({\bm{y}})=\bm{\ell}({\bm{y}})=\bm{0} (72)

which implies 𝟎\bm{\ell}\equiv\bm{0}, i.e., 𝒇(𝒚)=𝒈(𝒚){\bm{f}}^{*}({\bm{y}})={\bm{g}}^{*}({\bm{y}}) for all 𝒚{\bm{y}}. Finally, note that the inner- and outer-layer weight vectors on the reversed units change only in sign. Therefore, this re-parameterization does not change the representation cost, and so is also a minimal representative.

Now we construct a new norm-ball interpolant 𝒉{\bm{h}} as follows. Let 𝒇{\bm{f}}_{\ell} be the function defined as the sum of all units making up the re-parametrized 𝒇{\bm{f}}^{*} that are active on at least one norm-ball centered on the \ellth ray. Also, let 𝑷=𝒖𝒖d×d{\bm{P}}_{\ell}={\bm{u}}_{\ell}{\bm{u}}_{\ell}^{\top}\in\mathbb{R}^{d\times d} denote the orthogonal projector onto the \ellth ray. Define 𝒉==1L𝒉{\bm{h}}=\sum_{\ell=1}^{L}{\bm{h}}_{\ell} where 𝒉(𝒚):=𝑷𝒇(𝑷𝒚){\bm{h}}_{\ell}({\bm{y}}):={\bm{P}}_{\ell}{\bm{f}}_{\ell}({\bm{P}}_{\ell}{\bm{y}}). Put in words, 𝒉{\bm{h}} is constructed by “splitting” any units active over multiple rays into a sum of multiple units aligned with each ray; see Figure 6 for an illustration.

First, we prove that 𝒉{\bm{h}} satisfies norm-ball interpolation constraints. Let 𝒂[𝒘𝒚+b]+{\bm{a}}[{\bm{w}}^{\top}{\bm{y}}+b]_{+} denote any unit belonging to 𝒇{\bm{f}}_{\ell}. Since this unit is active on a norm-ball centered on ray \ell, this implies that 𝒘𝒖>0{\bm{w}}^{\top}{\bm{u}}_{\ell}>0, and since the unit is inactive on B0B_{0} we must have b<0b<0. Also, because we assume 𝒖𝒖m<0{\bm{u}}_{\ell}^{\top}{\bm{u}}_{m}<0 for any mm\neq\ell, we see that the projected unit 𝑷𝒂[𝒘𝑷𝒚+b]+=𝑷𝒂[(𝒘𝒖)𝒖𝒚+b]+{\bm{P}}_{\ell}{\bm{a}}[{\bm{w}}^{\top}{\bm{P}}_{\ell}{\bm{y}}+b]_{+}={\bm{P}}_{\ell}{\bm{a}}[({\bm{w}}^{\top}{\bm{u}}_{\ell}){\bm{u}}_{\ell}^{\top}{\bm{y}}+b]_{+} is active on norm-balls centered on ray \ell and inactive on norm-balls centered on any other ray. In particular, if 𝒚{\bm{y}} belongs to a norm-ball centered on ray \ell, then 𝒉m(𝒚)=0{\bm{h}}_{m}({\bm{y}})=0 for all mm\neq\ell. Therefore, if 𝒚B(𝒙n(),ρ){\bm{y}}\in B({\bm{x}}_{n}^{(\ell)},\rho) where 𝒙n(){\bm{x}}_{n}^{(\ell)} denotes a training sample along ray \ell, then

𝒉(𝒚)\displaystyle{\bm{h}}({\bm{y}}) =m=1L𝒉m(𝒚)\displaystyle=\sum_{m=1}^{L}{\bm{h}}_{m}({\bm{y}}) (73)
=𝒉(𝒚)\displaystyle={\bm{h}}_{\ell}({\bm{y}}) (74)
=𝑷𝒇(𝑷𝒚)\displaystyle={\bm{P}}_{\ell}{\bm{f}}_{\ell}({\bm{P}}_{\ell}{\bm{y}}) (75)
=𝑷𝒇(𝑷𝒚)\displaystyle={\bm{P}}_{\ell}{\bm{f}}({\bm{P}}_{\ell}{\bm{y}}) (76)
=𝑷𝒙n()\displaystyle={\bm{P}}_{\ell}{\bm{x}}_{n}^{(\ell)} (77)
=𝒙n()\displaystyle={\bm{x}}_{n}^{(\ell)} (78)

which shows that 𝒉{\bm{h}} satisfies norm-ball interpolation constraints, as claimed.

Next, we show that if 𝒉𝒇{\bm{h}}\neq{\bm{f}}^{*}, then 𝒉{\bm{h}} has strictly smaller representation cost. Let 𝒖(𝒚)=𝒂[𝒘𝒚+b]+{\bm{u}}({\bm{y}})={\bm{a}}[{\bm{w}}^{\top}{\bm{y}}+b]_{+} denote any ReLU unit belonging to 𝒇{\bm{f}}^{*}. Since we assume 𝒘=1\|{\bm{w}}\|=1, the contribution of unit 𝒖{\bm{u}} to the representation cost of 𝒇{\bm{f}}^{*} is 𝒂\|{\bm{a}}\|. Let {1,2,,L}{\mathcal{R}}\subset\{1,2,...,L\} index the subset of rays \ell for which unit 𝒖{\bm{u}} is active over at least one norm-ball centered on that ray. In constructing the interpolant 𝒉{\bm{h}}, the unit 𝒖{\bm{u}} get mapped to a sum of |||{\mathcal{R}}| units:

𝑷𝒂[(𝑷𝒘)𝒚+b]+\sum_{\ell\in{\mathcal{R}}}{\bm{P}}_{\ell}{\bm{a}}[({\bm{P}}_{\ell}{\bm{w}})^{\top}{\bm{y}}+b]_{+}

whose contribution to the representation cost of 𝒉{\bm{h}} is bounded above by

C=𝑷𝒂𝑷𝒘=|𝒖𝒂||𝒖𝒘|.C=\sum_{\ell\in{\mathcal{R}}}\|{\bm{P}}_{\ell}{\bm{a}}\|\|{\bm{P}}_{\ell}{\bm{w}}\|=\sum_{\ell\in{\mathcal{R}}}|{\bm{u}}_{\ell}^{\top}{\bm{a}}||{\bm{u}}_{\ell}^{\top}{\bm{w}}|.

If ||>1|{\mathcal{R}}|>1, then by Lemma 8 guarantees C<𝒂C<\|{\bm{a}}\|. And if ||=1|{\mathcal{R}}|=1 then C=|𝒖𝒂||𝒖𝒘|𝒂C=|{\bm{u}}_{\ell}^{\top}{\bm{a}}||{\bm{u}}_{\ell}^{\top}{\bm{w}}|\leq\|{\bm{a}}\|, with strict inequality unless 𝒘,𝒂{\bm{w}},{\bm{a}} belong to the span of 𝒖{\bm{u}}_{\ell}. This implies that any min-cost solution must have all its units aligned with the rays, i.e., 𝒇{\bm{f}}^{*} must have the form 𝒇(𝒚)==1L𝒖ϕ(𝒖𝒚){\bm{f}}^{*}({\bm{y}})=\sum_{\ell=1}^{L}{\bm{u}}_{\ell}\phi_{\ell}({\bm{u}}_{\ell}^{\top}{\bm{y}}), where each ϕ\phi_{\ell} is a univariate function. The representation cost of any function 𝒇{\bm{f}}^{*} in this form is the sum of the (1-D) representation costs of the ϕ\phi_{\ell}. Therefore, ϕ\phi_{\ell} must also be the 1-D min-cost solution of the norm-ball constraints projected onto ray \ell, and so must have the form specified by (17).

C.2.1 Perturbations of samples along rays

As an extension of the above setting, suppose the training samples along rays 𝒙n(){{\bm{x}}}_{n}^{(\ell)} have been slightly perturbed, i.e., we consider training samples 𝒙~n()=𝒙n()+ϵn()\widetilde{{\bm{x}}}_{n}^{(\ell)}={\bm{x}}_{n}^{(\ell)}+{\bm{\epsilon}}_{n}^{(\ell)} for some vectors ϵn(){\bm{\epsilon}}_{n}^{(\ell)} with ϵn()<δ\|{\bm{\epsilon}}_{n}^{(\ell)}\|<\delta for some sufficiently small δ>0\delta>0. In particular, we make the following two assumptions:

  • (A1)

    Let Δ:={𝒙~n()𝒙~n1()}n=1N\Delta_{\ell}:=\{\widetilde{{\bm{x}}}_{n}^{(\ell)}-\widetilde{{\bm{x}}}_{n-1}^{(\ell)}\}_{n=1}^{N_{\ell}} be the collection of difference vectors between successive perturbed points along the \ellth ray (with 𝒙~0():=𝟎\widetilde{{\bm{x}}}_{0}^{(\ell)}:=\bm{0}). Assume that for all 𝒗Δ{\bm{v}}_{\ell}\in\Delta_{\ell} and for all 𝒗kΔk{\bm{v}}_{k}\in\Delta_{k} with kk\neq\ell, we have 𝒗𝒗k<0{\bm{v}}_{\ell}^{\top}{\bm{v}}_{k}<0.

  • (A2)

    If HH is any halfspace not containing the origin that contains the ball B(𝒙~n();ρ)B(\tilde{{\bm{x}}}^{(\ell)}_{n};\rho), then HH also contains all successive balls along that ray, i.e., HH contains B(𝒙~m();ρ)B(\tilde{{\bm{x}}}^{(\ell)}_{m};\rho) for mnm\geq n.

Note that if the original points 𝒙n(){{\bm{x}}}_{n}^{(\ell)} belong to rays that satisfy the conditions of Theorem 3, then for sufficiently small δ\delta the perturbed samples 𝒙~n()\widetilde{{\bm{x}}}_{n}^{(\ell)} will satisfy assumptions (A1)&(A2) above.

We show in this case the norm-ball interpolating representation cost minimizer is a perturbed version of the min-cost solution for the unperturbed samples as identified in Theorem 3. In particular, the ReLU boundaries align with the line segments connecting successive points along the rays. See Figure 7 for an illustration in the case of L=2L=2 rays.

Refer to caption
Figure 7: Change in unit alignment of minimal representation cost solutions under perturbations of training samples belonging to rays. The red lines represent the ReLU boundaries of the representation cost minimizing solution satisfying norm-ball interpolation constraints. For the unperturbed samples in (a), all ReLU boundaries align perpendicular to the rays. For the perturbed samples in (b), the ReLU boundaries align perpendicular to the line segments connecting successive samples.
Proposition 3.

Suppose the training samples XX are a perturbation of data belong to a union of LL rays plus a sample at the origin, i.e., X={𝟎}{𝐱~n(1)}n=1N1{𝐱~n(L)}n=1NLX=\{\bm{0}\}\cup\{\widetilde{{\bm{x}}}_{n}^{(1)}\}_{n=1}^{N_{1}}\cup\cdots\cup\{\widetilde{{\bm{x}}}_{n}^{(L)}\}_{n=1}^{N_{L}} where 𝐱~n()=cn()𝐮+ϵn()\widetilde{{\bm{x}}}_{n}^{(\ell)}=c^{(\ell)}_{n}{\bm{u}}_{\ell}+{\bm{\epsilon}}^{(\ell)}_{n} for some unit vector 𝐮{\bm{u}}_{\ell}, constants 0<c1()<c2()<<cN()0<c^{(\ell)}_{1}<c^{(\ell)}_{2}<\cdots<c^{(\ell)}_{N_{\ell}}, and vectors ϵn(){\bm{\epsilon}}^{(\ell)}_{n}. Assume (A1)&(A2) above hold. Then the minimizer 𝐟\bm{f}^{*} of (19) is unique and is given by

𝒇(𝒚)==1Ln=1N𝒖n()ϕn()((𝒖n())(𝒚𝒙~n1())){\bm{f}}^{*}({\bm{y}})=\sum_{\ell=1}^{L}\sum_{n=1}^{N_{\ell}}{\bm{u}}_{n}^{(\ell)}\phi_{n}^{(\ell)}(({\bm{u}}_{n}^{(\ell)})^{\top}({\bm{y}}-\widetilde{{\bm{x}}}_{n-1}^{(\ell)})) (79)

where 𝐮n()=𝐱~n()𝐱~n1()𝐱~n()𝐱~n1(){\bm{u}}_{n}^{(\ell)}=\frac{\widetilde{{\bm{x}}}_{n}^{(\ell)}-\widetilde{{\bm{x}}}_{n-1}^{(\ell)}}{\|\widetilde{{\bm{x}}}_{n}^{(\ell)}-\widetilde{{\bm{x}}}_{n-1}^{(\ell)}\|} and ϕn()(t)=sn()([tan()][tbn()]+)\phi_{n}^{(\ell)}(t)=s_{n}^{(\ell)}([t-a_{n}^{(\ell)}]-[t-b_{n}^{(\ell)}]_{+}) with an()=ρa_{n}^{(\ell)}=\rho, bn()=𝐱n()𝐱n1()ρb_{n}^{(\ell)}=\|{\bm{x}}_{n}^{(\ell)}-{\bm{x}}_{n-1}^{(\ell)}\|-\rho, and sn()=𝐱~n()𝐱~n1()/(bn()an())s_{n}^{(\ell)}=\|\widetilde{{\bm{x}}}_{n}^{(\ell)}-\widetilde{{\bm{x}}}_{n-1}^{(\ell)}\|/(b_{n}^{(\ell)}-a_{n}^{(\ell)}).

Proof.

Let 𝒇{\bm{f}}^{*} be any min-cost solution. Following the same steps as in the proof of Theorem 3, there is a minimal representative of 𝒇{\bm{f}}^{*} having the form 𝒇(𝒚)=k𝒂k[𝒘k𝒚+bk]+{\bm{f}}^{*}({\bm{y}})=\sum_{k}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]_{+} where the active sets of all units in this representation have empty intersection with the ball B(𝟎,ρ)B(\bm{0},\rho).

Let 𝒇n(){{\bm{f}}}_{n}^{(\ell)} denote the sum of all units in 𝒇{\bm{f}}^{*} whose active set boundary {𝒚d:𝒘k𝒚+bk=0}\{{\bm{y}}\in\mathbb{R}^{d}:{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}=0\} intersects the line segment connecting the training samples 𝒙~n1()\widetilde{{\bm{x}}}_{n-1}^{(\ell)} and 𝒙~n()\widetilde{{\bm{x}}}_{n}^{(\ell)}. By assumption (A2), this is equivalent to the sum of units that are active over balls B(𝒙~m(),ρ)B(\widetilde{{\bm{x}}}_{m}^{(\ell)},\rho) for mnm\geq n, and inactive for the balls with m<nm<n. In particular, for 𝒚B(𝒙~n(),ρ){\bm{y}}\in B(\widetilde{{\bm{x}}}_{n}^{(\ell)},\rho), we have 𝒇(𝒚)=mn𝒇m()(𝒚){\bm{f}}^{*}({\bm{y}})=\sum_{m\leq n}{\bm{f}}_{m}^{(\ell)}({\bm{y}}). This implies 𝒇1()(𝒚)=𝒙~1{\bm{f}}_{1}^{(\ell)}({\bm{y}})=\widetilde{{\bm{x}}}_{1}for all 𝒚B(𝒙~1(),ρ){\bm{y}}\in B(\widetilde{{\bm{x}}}_{1}^{(\ell)},\rho). Likewise, 𝒇2()(𝒚)=𝒙~2()𝒙~1(){\bm{f}}_{2}^{(\ell)}({\bm{y}})=\widetilde{{\bm{x}}}_{2}^{(\ell)}-\widetilde{{\bm{x}}}_{1}^{(\ell)} for all 𝒚B(𝒙~2();ρ){\bm{y}}\in B(\widetilde{{\bm{x}}}_{2}^{(\ell)};\rho), and so on, such that for all n=1,,Nn=1,...,N_{\ell} we have 𝒇n()(𝒚)=𝒙~n()𝒙~n1(){\bm{f}}_{n}^{(\ell)}({\bm{y}})=\widetilde{{\bm{x}}}_{n}^{(\ell)}-\widetilde{{\bm{x}}}_{n-1}^{(\ell)} for all 𝒚B(𝒙~n();ρ){\bm{y}}\in B(\widetilde{{\bm{x}}}_{n}^{(\ell)};\rho).

Now we show how to construct a new interpolating function 𝒉{\bm{h}} having representation cost less than or equal to 𝒇{\bm{f}}^{*}. Let 𝒖n()=𝒙~n()𝒙~n1()𝒙~n()𝒙~n1(){\bm{u}}_{n}^{(\ell)}=\frac{\widetilde{{\bm{x}}}_{n}^{(\ell)}-\widetilde{{\bm{x}}}_{n-1}^{(\ell)}}{\|\widetilde{{\bm{x}}}_{n}^{(\ell)}-\widetilde{{\bm{x}}}_{n-1}^{(\ell)}\|} and define 𝑷n()=𝒖n()(𝒖n()){\bm{P}}_{n}^{(\ell)}={\bm{u}}_{n}^{(\ell)}({\bm{u}}_{n}^{(\ell)})^{\top}, the orthogonal projector onto the span of the difference vector 𝒙~n()𝒙~n1()\widetilde{{\bm{x}}}_{n}^{(\ell)}-\widetilde{{\bm{x}}}_{n-1}^{(\ell)}. Note that the map 𝒚𝑷n()𝒚+𝒙~n1(){\bm{y}}\mapsto{\bm{P}}_{n}^{(\ell)}{\bm{y}}+\widetilde{{\bm{x}}}_{n-1}^{(\ell)} is projection onto the affine line connecting samples 𝒙~n1()\widetilde{{\bm{x}}}_{n-1}^{(\ell)} and 𝒙~n()\widetilde{{\bm{x}}}_{n}^{(\ell)}. Consider the function

𝒉==1Ln=1N𝒉n(){\bm{h}}=\sum_{\ell=1}^{L}\sum_{n=1}^{N_{\ell}}{\bm{h}}^{(\ell)}_{n}

where

𝒉n()(𝒚)=𝑷n()𝒇n()(𝑷n()𝒚+𝒙~n1()).{\bm{h}}^{(\ell)}_{n}({\bm{y}})={\bm{P}}_{n}^{(\ell)}{\bm{f}}_{n}^{(\ell)}({\bm{P}}_{n}^{(\ell)}{\bm{y}}+\widetilde{{\bm{x}}}_{n-1}^{(\ell)}).

Put in words, 𝒉{\bm{h}} is constructed by aligning all inner-layer and outer-layer weights of units in 𝒇{\bm{f}}^{*} with the line segment connecting successive data points over which that unit first activates. See Figure 8 for an illustration.

Refer to caption
Figure 8: Illustration of “unit alignment” technique used in the proof of Proposition 3. The left panel shows an interpolant 𝒇{\bm{f}} whose ReLU boundaries (shown here as dark blue and dark red lines) are not aligned with the line segments S1S_{1} and S2S_{2} connecting successive training samples. The right panel shows how a new interpolant 𝒉{\bm{h}} can be constructed by projecting units in 𝒇{\bm{f}} onto the line segments S1S_{1} and S2S_{2}, which reduces the representation cost.

Now we show that 𝒉{\bm{h}} satisfies norm-ball interpolation constraints. By assumption (A1), the function 𝒉n(){\bm{h}}_{n}^{(\ell)} is non-zero only on the norm-balls B(𝒙~m(),ρ)B(\widetilde{{\bm{x}}}_{m}^{(\ell)},\rho) for mnm\geq n. This implies that for all 𝒚B(𝒙~n();ρ){\bm{y}}\in B(\widetilde{{\bm{x}}}_{n}^{(\ell)};\rho) we have 𝒉(𝒚)=mn𝒉n()(𝒚){\bm{h}}({\bm{y}})=\sum_{m\leq n}{\bm{h}}_{n}^{(\ell)}({\bm{y}}).

Observe that 𝒉(B(𝟎,ρ))={𝟎}{\bm{h}}(B(\bm{0},\rho))=\{\bm{0}\} since all functions 𝒉n(){\bm{h}}_{n}^{(\ell)} are zero on B(𝟎,ρ)B(\bm{0},\rho). Also, for all 𝒚B(𝒙~1();ρ){\bm{y}}\in B(\widetilde{{\bm{x}}}_{1}^{(\ell)};\rho) we have 𝑷1()𝒚B(𝒙~1();ρ){\bm{P}}_{1}^{(\ell)}{\bm{y}}\in B(\widetilde{{\bm{x}}}_{1}^{(\ell)};\rho), and so

𝒉(𝒚)=𝒉1()(𝒚)=𝑷1𝒇1()(𝑷1()𝒚)=𝑷1()𝒙~1()=𝒙~1().{\bm{h}}({\bm{y}})={\bm{h}}_{1}^{(\ell)}({\bm{y}})={\bm{P}}_{1}^{\ell}{\bm{f}}_{1}^{(\ell)}({\bm{P}}_{1}^{(\ell)}{\bm{y}})={\bm{P}}_{1}^{(\ell)}\widetilde{{\bm{x}}}_{1}^{(\ell)}=\widetilde{{\bm{x}}}_{1}^{(\ell)}.

Similarly, the only terms in 𝒉{\bm{h}} active for 𝒚B(𝒙~2();ρ){\bm{y}}\in B(\widetilde{{\bm{x}}}_{2}^{(\ell)};\rho) are 𝒉1(){\bm{h}}_{1}^{(\ell)} and 𝒉2(){\bm{h}}_{2}^{(\ell)}, for which 𝒉1()(𝒚)=𝒙1(){\bm{h}}_{1}^{(\ell)}({\bm{y}})={\bm{x}}_{1}^{(\ell)}, and 𝒉2()(𝒚)=𝑷2()𝒇2()(𝑷2()𝒚+𝒙~1())=𝑷2()(𝒙~2()𝒙~1())=𝒙~2()𝒙~1(){\bm{h}}_{2}^{(\ell)}({\bm{y}})={\bm{P}}_{2}^{(\ell)}{\bm{f}}_{2}^{(\ell)}({\bm{P}}_{2}^{(\ell)}{\bm{y}}+\widetilde{{\bm{x}}}_{1}^{(\ell)})={\bm{P}}_{2}^{(\ell)}(\widetilde{{\bm{x}}}_{2}^{(\ell)}-\widetilde{{\bm{x}}}_{1}^{(\ell)})=\widetilde{{\bm{x}}}_{2}^{(\ell)}-\widetilde{{\bm{x}}}_{1}^{(\ell)}. This gives

𝒉(𝒚)\displaystyle{\bm{h}}({\bm{y}}) =𝒉1()(𝒚)+𝒉2()(𝒚)=𝒙~2().\displaystyle={\bm{h}}_{1}^{(\ell)}({\bm{y}})+{\bm{h}}_{2}^{(\ell)}({\bm{y}})=\widetilde{{\bm{x}}}_{2}^{(\ell)}.

Iterating this procedure for all n=1,,Nn=1,...,N_{\ell} we see that if 𝒚B(𝒙~n(),ρ){\bm{y}}\in B(\widetilde{{\bm{x}}}_{n}^{(\ell)},\rho) then 𝒉n()(𝒚)=𝒙~n()𝒙~n1(){\bm{h}}_{n}^{(\ell)}({\bm{y}})=\widetilde{{\bm{x}}}_{n}^{(\ell)}-\widetilde{{\bm{x}}}_{n-1}^{(\ell)}, and so 𝒉(𝒚)=mn𝒉m()(𝒚)=𝒙n(){\bm{h}}({\bm{y}})=\sum_{m\leq n}{\bm{h}}_{m}^{(\ell)}({\bm{y}})={\bm{x}}_{n}^{(\ell)}, as claimed.

Following steps similar to the proof of Theorem 3, we can show that R(𝒉)R(𝒇)R({\bm{h}})\leq R({\bm{f}}^{*}) and with strict inequality if 𝒉𝒇{\bm{h}}\neq{\bm{f}}^{*}. First, if any unit 𝒖(𝒚)=𝒂[𝒘𝒚+b]+{\bm{u}}({\bm{y}})={\bm{a}}[{\bm{w}}^{\top}{\bm{y}}+b]_{+} in 𝒇{\bm{f}}^{*} is active over balls centered on multiple rays {1,2,,L}{\mathcal{R}}\subset\{1,2,...,L\}, then in the construction of 𝒉{\bm{h}} the unit 𝒖{\bm{u}} gets mapped to a sum of multiple units:

𝒖n()(𝒖n())𝒂[𝒘𝒖n()(𝒖n())(𝒚+𝒙~n1())+b]+\sum_{\ell\in{\mathcal{R}}}{\bm{u}}_{n_{\ell}}^{(\ell)}({\bm{u}}_{n_{\ell}}^{(\ell)})^{\top}{\bm{a}}[{\bm{w}}^{\top}{\bm{u}}_{n_{\ell}}^{(\ell)}({\bm{u}}_{n_{\ell}}^{(\ell)})^{\top}({\bm{y}}+\widetilde{{\bm{x}}}_{{n_{\ell}}-1}^{(\ell)})+b]_{+}

for some nn_{\ell} with 1nN1\leq n_{\ell}\leq N_{\ell}. The contribution of the sum of these units to the representation cost of 𝒉{\bm{h}} is less than or equal to C=|(𝒖n())𝒂||(𝒖n())𝒘|C=\sum_{\ell\in{\mathcal{R}}}|({\bm{u}}_{n_{\ell}}^{(\ell)})^{\top}{\bm{a}}||({\bm{u}}_{n_{\ell}}^{(\ell)})^{\top}{\bm{w}}|. By assumption (A1), we know (𝒖m())𝒖n(p)<0({\bm{u}}_{m}^{(\ell)})^{\top}{\bm{u}}_{n}^{(p)}<0 for all pp\neq\ell and for all m,nm,n. Therefore, by Lemma 8 we know C<𝒂C<\|{\bm{a}}\|. This shows 𝒇{\bm{f}}^{*} cannot have any units active over balls centered on different rays, since otherwise 𝒉{\bm{h}} has strictly smaller representation cost. Therefore, we have shown 𝒇{\bm{f}}^{*} as 𝒇==1Ln=1N𝒇n(){\bm{f}}^{*}=\sum_{\ell=1}^{L}\sum_{n=1}^{N_{\ell}}{\bm{f}}_{n}^{(\ell)}. Additionally, we see that 𝒉n()=𝒇n(){\bm{h}}_{n}^{(\ell)}={\bm{f}}_{n}^{(\ell)}, i.e., all inner- and outer-layer weight vectors of units in 𝒇n(){\bm{f}}_{n}^{(\ell)} are aligned with 𝒖n(){\bm{u}}_{n}^{(\ell)}, since otherwise the representation cost of 𝒇{\bm{f}}^{*} could be reduced. Therefore, each 𝒇n(){\bm{f}}_{n}^{(\ell)} must have the form 𝒇n()(𝒚)=𝒖n()ψn()((𝒖n())𝒚){\bm{f}}_{n}^{(\ell)}({\bm{y}})={\bm{u}}_{n}^{(\ell)}\psi_{n}^{(\ell)}(({\bm{u}}_{n}^{(\ell)})^{\top}{\bm{y}}). Minimizing the ψn()\psi_{n}^{(\ell)} separately under interpolation constraints, we arrive at the unique solution given in (79). ∎

C.3 Simplex Data

C.3.1 Simplex with one obtuse vertex

Proof of Proposition 2.

Consider training samples 𝒙1,𝒙2,,𝒙Nd{\bm{x}}_{1},{\bm{x}}_{2},...,{\bm{x}}_{N}\in\mathbb{R}^{d} whose convex hull is a (N1)(N-1)-simplex such that 𝒙1{\bm{x}}_{1} makes an obtuse angle with all other vertices, i.e., (𝒙n𝒙1)𝒙1<0({\bm{x}}_{n}-{\bm{x}}_{1})^{\top}{\bm{x}}_{1}<0 for all n=2,,Nn=2,...,N. By Lemma 5, 𝒇{\bm{f}}^{*} is a norm-ball interpolating min-cost solution if and only if 𝒈(𝒚)=𝒇(𝒚𝒙1)+𝒙1{\bm{g}}^{*}({\bm{y}})={\bm{f}}^{*}({\bm{y}}-{\bm{x}}_{1})+{\bm{x}}_{1} is a norm-ball interpolating min-cost solution for the translated points 𝟎,𝒙2𝒙1,,𝒙N𝒙1d\bm{0},{\bm{x}}_{2}-{\bm{x}}_{1},...,{\bm{x}}_{N}-{\bm{x}}_{1}\in\mathbb{R}^{d}. The latter configuration satisfies the hypotheses of Theorem 3 with a single training sample per ray. Therefore, the min-cost solution 𝒈{\bm{g}}^{*} of the translated points has units whose inner- and outer-layer weight vetors are aligned with 𝒙n𝒙1{\bm{x}}_{n}-{\bm{x}}_{1}, n=2,,Nn=2,...,{N}, and likewise for 𝒇{\bm{f}}^{*}, since it is translation of 𝒈{\bm{g}}^{*}. ∎

C.3.2 Simplex with all acute vertices

Justification of Conjecture 1.

For concreteness we focus on the case of three points 𝒙1,𝒙2,𝒙3d{\bm{x}}_{1},{\bm{x}}_{2},{\bm{x}}_{3}\in\mathbb{R}^{d} whose convex hull is an acute triangle, and let B1,B2,B3B_{1},B_{2},B_{3} be open balls of radius ρ\rho centered at 𝒙1,𝒙2,𝒙3{\bm{x}}_{1},{\bm{x}}_{2},{\bm{x}}_{3} (respectively).

Let 𝒇{\bm{f}} be any norm-ball interpolating min-cost solution, and let 𝒇(𝒚)=k𝒂k[𝒘k𝒚+bk]+𝑽𝒙+𝒄{\bm{f}}({\bm{y}})=\sum_{k}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]+{\bm{V}}{\bm{x}}+{\bm{c}} be any minimal representative of 𝒇{\bm{f}}.

First, by properties of minimal representatives, none of the ReLU boundary sets Hk={𝒚d:𝒘k𝒚+bk=0}H_{k}=\{{\bm{y}}\in\mathbb{R}^{d}:{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}=0\} intersect any of the balls centered at the training samples. Also, the active set of each unit in 𝒇{\bm{f}} must contain either one or two norm-balls, since otherwise the unit is either inactive over all balls or active over all balls, in which cases the unit can be removed or absorbed into unregularized linear part while strictly reducing the representation cost. By “reversing units”, as in the proof of Theorem 3, we may transform the parameterization in such a way that the active set of every unit contains exactly one ball, and do this without changing the representation cost.

After this transformation, we may write

𝒇=𝒇1+𝒇2+𝒇3+{\bm{f}}={\bm{f}}_{1}+{\bm{f}}_{2}+{\bm{f}}_{3}+\bm{\ell}

where 𝒇i{\bm{f}}_{i} is a sum of units active only on the ball BiB_{i} and no others, and where (𝒚)=𝑨𝒚+𝒃\bm{\ell}({\bm{y}})={\bm{A}}{\bm{y}}+{\bm{b}} is an affine function. Then we have

𝒚1B1,𝒇(𝒚1)\displaystyle\forall{\bm{y}}_{1}\in B_{1},\quad{\bm{f}}({\bm{y}}_{1}) =𝒇1(𝒚1)+𝑨𝒚1+𝒃=𝒙1\displaystyle={\bm{f}}_{1}({\bm{y}}_{1})+{\bm{A}}{\bm{y}}_{1}+{\bm{b}}={\bm{x}}_{1} (80)
𝒚2B2,𝒇(𝒚2)\displaystyle\forall{\bm{y}}_{2}\in B_{2},\quad{\bm{f}}({\bm{y}}_{2}) =𝒇2(𝒚2)+𝑨𝒚2+𝒃=𝒙2\displaystyle={\bm{f}}_{2}({\bm{y}}_{2})+{\bm{A}}{\bm{y}}_{2}+{\bm{b}}={\bm{x}}_{2} (81)
𝒚3B3,𝒇(𝒚3)\displaystyle\forall{\bm{y}}_{3}\in B_{3},\quad{\bm{f}}({\bm{y}}_{3}) =𝒇3(𝒚3)+𝑨𝒚3+𝒃=𝒙3\displaystyle={\bm{f}}_{3}({\bm{y}}_{3})+{\bm{A}}{\bm{y}}_{3}+{\bm{b}}={\bm{x}}_{3} (82)

and so,

𝒚1B1,𝒇1(𝒚1)\displaystyle\forall{\bm{y}}_{1}\in B_{1},\quad{\bm{f}}_{1}({\bm{y}}_{1}) =𝒙1𝒃𝑨𝒚1\displaystyle={\bm{x}}_{1}-{\bm{b}}-{\bm{A}}{\bm{y}}_{1} (83)
𝒚2B2,𝒇2(𝒚2)\displaystyle\forall{\bm{y}}_{2}\in B_{2},\quad{\bm{f}}_{2}({\bm{y}}_{2}) =𝒙2𝒃𝑨𝒚2\displaystyle={\bm{x}}_{2}-{\bm{b}}-{\bm{A}}{\bm{y}}_{2} (84)
𝒚3B3,𝒇3(𝒚3)\displaystyle\forall{\bm{y}}_{3}\in B_{3},\quad{\bm{f}}_{3}({\bm{y}}_{3}) =𝒙3𝒃𝑨𝒚3\displaystyle={\bm{x}}_{3}-{\bm{b}}-{\bm{A}}{\bm{y}}_{3} (85)

Finding the min-cost solution then amounts to minimizing of the representation cost of 𝒇1{\bm{f}}_{1}, 𝒇2{\bm{f}}_{2}, 𝒇3{\bm{f}}_{3} subject to the above constraints. This is made challenging by the fact that the constraints are coupled together by the parameters 𝑨{\bm{A}} and 𝒃{\bm{b}} of the affine part. However, under the assumption 𝐀=𝟎{\bm{A}}=\bm{0}, we prove below that the min-cost solution must have the conjectured form 𝒇{\bm{f}}^{*} as given in (86). However, since we cannot a priori assume 𝑨=𝟎\bm{A}=\bm{0}, this does not constitute a full proof.

Before proceeding, we give a lemma that will be used to lower bound R(𝒇)R({\bm{f}}) (assuming 𝑨=𝟎{\bm{A}}=\bm{0}).

Lemma 10.

Suppose 𝐠{\bm{g}} is a sum of ReLU units such that 𝐠=𝟎{\bm{g}}=\bm{0} on a closed convex region CdC\subset\mathbb{R}^{d} and 𝐠{\bm{g}} is constant and equal to 𝐜{\bm{c}} on a closed ball BdB\subset\mathbb{R}^{d}, and 𝐠{\bm{g}} has a minimal representative where all its units are active on BB. Then

R(𝒈)2𝒄dist(B,C)R({\bm{g}})\geq\frac{2\|{\bm{c}}\|}{\text{dist}(B,C)}

where dist(B,C)=min𝐲B,𝐱C𝐲𝐱\text{dist}(B,C)=\min_{{\bm{y}}\in B,{\bm{x}}\in C}\|{\bm{y}}-{\bm{x}}\|.

Proof.

Let 𝒈(𝒚)=k𝒂k[𝒘k𝒚+bk]+{\bm{g}}({\bm{y}})=\sum_{k}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]_{+} be a minimal representative where each unit is active on BB. For 𝒈{\bm{g}} to be constant on BB it must be the case that k𝒂k𝒘k=𝟎\sum_{k}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}=\bm{0}. Let 𝒚=𝒚0{\bm{y}}={\bm{y}}_{0} be any value for which 𝒈(𝒚0)\|\bm{\partial}{\bm{g}}({\bm{y}}_{0})\|, the operator norm of the Jacobian of 𝒈{\bm{g}}, is maximized. Since 𝒈{\bm{g}} is piecewise linear, we see that its Lipschitz constant Lip(𝒈)=𝒈(𝒚0)\text{Lip}({\bm{g}})=\|\bm{\partial}{\bm{g}}({\bm{y}}_{0})\|. Let I0I_{0} be the set of indices of active units at 𝒚0{\bm{y}}_{0} so that 𝒈(𝒚0)=kI0𝒂k𝒘k\bm{\partial}{\bm{g}}({\bm{y}}_{0})=\sum_{k\in I_{0}}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}, and let I1I_{1} be the complementary index set. Then because k𝒂k𝒘k=𝟎\sum_{k}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}=\bm{0}, we have kI0𝒂k𝒘k=kI1𝒂k𝒘k\sum_{k\in I_{0}}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}=-\sum_{k\in I_{1}}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}, and so kI0𝒂k𝒘k=kI1𝒂k𝒘k\|\sum_{k\in I_{0}}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}\|=\|\sum_{k\in I_{1}}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}\|. Therefore,

2Lip(𝒈)=2𝒈(𝒚0)=2kI0𝒂k𝒘k=kI0𝒂k𝒘k+kI1𝒂k𝒘kk𝒂k𝒘k=R(𝒈).2\,\text{Lip}({\bm{g}})=2\|\bm{\partial}{\bm{g}}({\bm{y}}_{0})\|=2\left\|\sum_{k\in I_{0}}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}\right\|=\left\|\sum_{k\in I_{0}}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}\right\|+\left\|\sum_{k\in I_{1}}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}\right\|\\ \leq\sum_{k}\|{\bm{a}}_{k}\|\|{\bm{w}}_{k}\|=R({\bm{g}}).

Finally,

Lip(𝒈)max𝒙C,𝒚B𝒈(𝒚)𝒈(𝒙)𝒚𝒙=𝒄min𝒙C,𝒚B𝒚𝒙=𝒄dist(B,C).\text{Lip}({\bm{g}})\geq\max_{{\bm{x}}\in C,{\bm{y}}\in B}\frac{\|{\bm{g}}({\bm{y}})-{\bm{g}}({\bm{x}})\|}{\|{\bm{y}}-{\bm{x}}\|}=\frac{\|{\bm{c}}\|}{\min_{{\bm{x}}\in C,{\bm{y}}\in B}\|{\bm{y}}-{\bm{x}}\|}=\frac{\|{\bm{c}}\|}{\text{dist}(B,C)}.

Combining this with the previous inequality gives the claim. ∎

Now, returning to the proof of the main claim, for n=1,2,3n=1,2,3, let CnC_{n} be the closed convex region given by the intersection of all closed half-planes containing the balls BjB_{j} and BkB_{k}, jnj\neq n , and knk\neq n, jkj\neq k. By assumption, 𝒇n{\bm{f}}_{n} vanishes on CnC_{n} and 𝒇n{\bm{f}}_{n} is constant and equal to 𝒙n𝒃{\bm{x}}_{n}-{\bm{b}} on the closed ball BnB_{n}. So, by Lemma 10, we see that

R(𝒇n)2𝒙n𝒃δnR({\bm{f}}_{n})\geq\frac{2\|{\bm{x}}_{n}-{\bm{b}}\|}{\delta_{n}}

where δn=dist(Bn,Cn)\delta_{n}=\text{dist}(B_{n},C_{n}). Therefore, for all 𝒃d{\bm{b}}\in\mathbb{R}^{d} we have

R(𝒇)=R(𝒇1)+R(𝒇2)+R(𝒇3)n=132𝒙n𝒃δnR({\bm{f}})=R({\bm{f}}_{1})+R({\bm{f}}_{2})+R({\bm{f}}_{3})\geq\sum_{n=1}^{3}\frac{2\|{\bm{x}}_{n}-{\bm{b}}\|}{\delta_{n}}

Let 𝒙¯d\overline{{\bm{x}}}\in\mathbb{R}^{d} be the minimizer of

min𝒃dn=132𝒙n𝒃δn.\min_{{\bm{b}}\in\mathbb{R}^{d}}\sum_{n=1}^{3}\frac{2\|{\bm{x}}_{n}-{\bm{b}}\|}{\delta_{n}}.

Then plugging in 𝒃=𝒙¯{\bm{b}}=\overline{{\bm{x}}} above, we have the lower bound

R(𝒇)n=132𝒙n𝒙¯δn,R({\bm{f}})\geq\sum_{n=1}^{3}\frac{2\|{\bm{x}}_{n}-\overline{{\bm{x}}}\|}{\delta_{n}},

which is independent of 𝒃{\bm{b}}.

Finally, simple calculations show that the conjectured min-cost solution 𝒇{\bm{f}}^{*} specified in (86) has representation cost achieving this lower bound. Therefore, under the assumption 𝑨=𝟎{\bm{A}}=\bm{0}, 𝒇{\bm{f}}^{*} is a min-cost solution.

Now we prove that, in the special case where the convex hull of the training points is an equilateral triangle, the norm-ball interpolator 𝒇{\bm{f}}^{*} identified in Conjecture 1 is a min-cost solution (though not necessarily the unique min-cost solution). In this case, we may also give 𝒇{\bm{f}}^{*} a more explicit form, as detailed below.

Proposition 4.

Suppose the convex hull of the training points 𝐱1,𝐱2,𝐱3d{\bm{x}}_{1},{\bm{x}}_{2},{\bm{x}}_{3}\in\mathbb{R}^{d} is an equilateral triangle. Assume the norm-balls Bn:=B(𝐱n,ρ)B_{n}:=B({\bm{x}}_{n},\rho) centered at each training point have radius ρ<𝐱n𝐱0/2\rho<\|{\bm{x}}_{n}-{\bm{x}}_{0}\|/2, n=1,2,3n=1,2,3, where 𝐱0=13(𝐱1+𝐱2+𝐱3){\bm{x}}_{0}=\frac{1}{3}({\bm{x}}_{1}+{\bm{x}}_{2}+{\bm{x}}_{3}) is the centroid of the triangle. Then a minimizer 𝐟{\bm{f}}^{*} of (19) is given by

𝒇(𝒚)=𝒖1ϕ1(𝒖1(𝒚𝒙0))+𝒖2ϕ2(𝒖2(𝒚𝒙0))+𝒖3ϕ3(𝒖3(𝒚𝒙0))+𝒙0,{\bm{f}}^{*}({\bm{y}})={\bm{u}}_{1}\phi_{1}({\bm{u}}_{1}^{\top}({\bm{y}}-{\bm{x}}_{0}))+{\bm{u}}_{2}\phi_{2}({\bm{u}}_{2}^{\top}({\bm{y}}-{\bm{x}}_{0}))+{\bm{u}}_{3}\phi_{3}({\bm{u}}_{3}^{\top}({\bm{y}}-{\bm{x}}_{0}))+{\bm{x}}_{0}, (86)

where ϕn(t)=sn([tan]+[tbn]+)\phi_{n}(t)=s_{n}([t-a_{n}]_{+}-[t-b_{n}]_{+}) with 𝐮n=𝐱n𝐱0𝐱n𝐱0{\bm{u}}_{n}=\frac{{\bm{x}}_{n}-{\bm{x}}_{0}}{\|{\bm{x}}_{n}-{\bm{x}}_{0}\|}, an=12𝐱n𝐱0+ρa_{n}=-\frac{1}{2}\|{\bm{x}}_{n}-{\bm{x}}_{0}\|+\rho, bn=𝐱n𝐱0ρb_{n}=\|{\bm{x}}_{n}-{\bm{x}}_{0}\|-\rho, and sn=𝐱n𝐱0/(bnan)s_{n}=\|{\bm{x}}_{n}-{\bm{x}}_{0}\|/(b_{n}-a_{n}).

Proof.

By translation and scale invariance of min-cost solutions, without loss of generality we may assume 𝒙1,𝒙2,𝒙3d{\bm{x}}_{1},{\bm{x}}_{2},{\bm{x}}_{3}\in\mathbb{R}^{d} are unit-norm vectors and mean-zero (i.e., the triangle centroid 𝒙0=𝟎{\bm{x}}_{0}=\bm{0}). In this case, the assumption on ρ\rho translates to ρ<1/2\rho<1/2. Additionally, it suffices to prove the claim in the case d=2d=2. This is because, if 𝒙1,𝒙2,𝒙3d{\bm{x}}_{1},{\bm{x}}_{2},{\bm{x}}_{3}\in\mathbb{R}^{d} are the vertices of an equilateral triangle whose centroid is at the origin, then these points are contained in a two-dimensional subspace 𝒮d{\mathcal{S}}\subset\mathbb{R}^{d}. And if we let 𝑷d×2{\bm{P}}\in\mathbb{R}^{d\times 2} be a matrix whose columns are an orthonormal basis for 𝒮{\mathcal{S}}, then by Theorem 2, 𝒇{\bm{f}} is a min-cost solution if and only if 𝒇(𝒚)=𝑷𝒇0(𝑷𝒚){\bm{f}}({\bm{y}})={\bm{P}}{\bm{f}}_{0}({\bm{P}}^{\top}{\bm{y}}), where 𝒇0:22{\bm{f}}_{0}:\mathbb{R}^{2}\rightarrow\mathbb{R}^{2} is a min-cost solution under the constraints 𝒇0(B(𝑷𝒙n,ρ))={𝑷𝒙n}{\bm{f}}_{0}(B({\bm{P}}^{\top}{\bm{x}}_{n},\rho))=\{{\bm{P}}^{\top}{\bm{x}}_{n}\} for all n=1,2,3n=1,2,3. Therefore, the problem reduces to finding a min-cost solution of the projected points 𝑷𝒙1,𝑷𝒙2,𝑷𝒙32{\bm{P}}^{\top}{\bm{x}}_{1},{\bm{P}}^{\top}{\bm{x}}_{2},{\bm{P}}^{\top}{\bm{x}}_{3}\in\mathbb{R}^{2} whose convex hull is an equilateral triangle in 2\mathbb{R}^{2}.

So now let 𝒇:22{\bm{f}}:\mathbb{R}^{2}\rightarrow\mathbb{R}^{2} be any min-cost solution under the assumption 𝒙1,𝒙2,𝒙32{\bm{x}}_{1},{\bm{x}}_{2},{\bm{x}}_{3}\in\mathbb{R}^{2} are unit-norm, have zero mean, and ρ<1/2\rho<1/2. By reversing units as necessary, there exists a minimal representative of 𝒇{\bm{f}} that can be put in the form 𝒇(𝒚)=𝒇1(𝒚)+𝒇2(𝒚)+𝒇3(𝒚)+𝑨𝒚+𝒄{\bm{f}}({\bm{y}})={\bm{f}}_{1}({\bm{y}})+{\bm{f}}_{2}({\bm{y}})+{\bm{f}}_{3}({\bm{y}})+{\bm{A}}{\bm{y}}+{\bm{c}}, such that 𝒇n(𝒚){\bm{f}}_{n}({\bm{y}}) is a sum of ReLU units, all of which are active on BnB_{n}, and all of which are inactive on BjB_{j} for jnj\neq n.

Let 𝑸{\bm{Q}} be the reflection matrix 𝑸=2𝒙1𝒙1𝑰{\bm{Q}}=2{\bm{x}}_{1}{\bm{x}}_{1}^{\top}-{\bm{I}}, which reflects points across the line spanned by 𝒙1{\bm{x}}_{1}. In particular, 𝒚B1{\bm{y}}\in B_{1} then 𝑸𝒚B1{\bm{Q}}{\bm{y}}\in B_{1} while if 𝒚B2{\bm{y}}\in B_{2} then 𝑸𝒚B3{\bm{Q}}{\bm{y}}\in B_{3} (and vice versa). Also, 𝑸𝒙1=𝒙1{\bm{Q}}{\bm{x}}_{1}={\bm{x}}_{1}, 𝑸𝒙2=𝒙3{\bm{Q}}{\bm{x}}_{2}={\bm{x}}_{3}, and 𝑸𝒙3=𝒙2{\bm{Q}}{\bm{x}}_{3}={\bm{x}}_{2}. Define 𝒈(𝒚)=12(𝒇(𝒚)+𝑸1𝒇(𝑸𝒚)){\bm{g}}({\bm{y}})=\frac{1}{2}\left({\bm{f}}({\bm{y}})+{\bm{Q}}^{-1}{\bm{f}}({\bm{Q}}{\bm{y}})\right). Then it is easy to check that 𝒈{\bm{g}} satisfies interpolation constraints, and since 𝑸{\bm{Q}} is unitary, we have R(𝒈)12R(𝒇)+12R(𝑸𝒇𝑸1)=12R(𝒇)+12R(𝒇)=R(𝒇)R({\bm{g}})\leq\frac{1}{2}R({\bm{f}})+\frac{1}{2}R({\bm{Q}}\circ{\bm{f}}\circ{\bm{Q}}^{-1})=\frac{1}{2}R({\bm{f}})+\frac{1}{2}R({\bm{f}})=R({\bm{f}}). Furthermore, since 𝒇{\bm{f}} is a min-cost solution, we must have R(𝒈)=R(𝒇)R({\bm{g}})=R({\bm{f}}). Additionally, since none of the units making up 𝒇{\bm{f}} are active over more than one ball, neither are the units making up 𝒈{\bm{g}}, which implies no pair of units belonging to 𝒈{\bm{g}} combine to form an affine function. Therefore, we may write 𝒈(𝒚)=𝒈1(𝒚)+𝒈2(𝒚)+𝒈3(𝒚)+𝑩𝒚+𝒗{\bm{g}}({\bm{y}})={\bm{g}}_{1}({\bm{y}})+{\bm{g}}_{2}({\bm{y}})+{\bm{g}}_{3}({\bm{y}})+{\bm{B}}{\bm{y}}+{\bm{v}}, where 𝑩=𝑨+𝑸1𝑨𝑸{\bm{B}}={\bm{A}}+{\bm{Q}}^{-1}{\bm{A}}{\bm{Q}}, such that each 𝒈n{\bm{g}}_{n} is a sum of ReLU units, all of which are active on BnB_{n}, and all of which are inactive on BjB_{j} for jnj\neq n.

Let 𝑼{\bm{U}} be a rotation matrix by 120120 degrees such that 𝒙2=𝑼𝒙1{\bm{x}}_{2}={\bm{U}}{\bm{x}}_{1}, 𝒙3=𝑼𝒙2{\bm{x}}_{3}={\bm{U}}{\bm{x}}_{2}, and 𝒙1=𝑼𝒙3{\bm{x}}_{1}={\bm{U}}{\bm{x}}_{3}. Consider the symmetrized version of 𝒈{\bm{g}} given by 𝒉(𝒚):=13(𝒈(𝒚)+𝑼1𝒈(𝑼𝒚)+𝑼2𝒈(𝑼2𝒚){\bm{h}}({\bm{y}}):=\frac{1}{3}({\bm{g}}({\bm{y}})+{\bm{U}}^{-1}{\bm{g}}({\bm{U}}{\bm{y}})+{\bm{U}}^{-2}{\bm{g}}({\bm{U}}^{2}{\bm{y}})). Since for all 𝒚Bn{\bm{y}}\in B_{n} we have 𝑼𝒚Bn+1{\bm{U}}{\bm{y}}\in B_{n+1} (with indices understood modulo 33), it is easy to verify that 𝒉{\bm{h}} satisfies interpolation constraints. Also, since 𝑼{\bm{U}} is unitary, we have R(𝒉)R(𝒈)R(𝒇)R({\bm{h}})\leq R({\bm{g}})\leq R({\bm{f}}), which implies R(𝒉)=R(𝒇)R({\bm{h}})=R({\bm{f}}) since 𝒇{\bm{f}} is a min-cost solution. Again, since none of the units making up 𝒈{\bm{g}} are active over more than one ball, neither are the units making up 𝒉{\bm{h}}, which implies no pair of units belonging to 𝒉{\bm{h}} combine to form an affine function. And so, we may write 𝒉(𝒚)=𝒉1(𝒚)+𝒉2(𝒚)+𝒉3(𝒚)+𝑪+𝒖{\bm{h}}({\bm{y}})={\bm{h}}_{1}({\bm{y}})+{\bm{h}}_{2}({\bm{y}})+{\bm{h}}_{3}({\bm{y}})+{\bm{C}}+{\bm{u}}, such that each 𝒉n{\bm{h}}_{n} is a sum of ReLU units, all of which are active on BnB_{n}, and all of which are inactive on BjB_{j} for jnj\neq n.

Observe that 𝒖=13(𝑰+𝑼1+𝑼2)𝒗=𝟎{\bm{u}}=\frac{1}{3}({\bm{I}}+{\bm{U}}^{-1}+{\bm{U}}^{-2}){\bm{v}}=\bm{0}. Also, we have 𝑪=13(𝑩+𝑼1𝑩𝑼+𝑼2𝑩𝑼2){\bm{C}}=\frac{1}{3}({\bm{B}}+{\bm{U}}^{-1}{\bm{B}}{\bm{U}}+{\bm{U}}^{-2}{\bm{B}}{\bm{U}}^{2}). This implies 𝑪{\bm{C}} commutes with 𝑼{\bm{U}}, because it is easy to see that 𝑪=𝑼1𝑪𝑼{\bm{C}}={\bm{U}}^{-1}{\bm{C}}{\bm{U}}. Additionally, 𝑪{\bm{C}} commutes with 𝑸{\bm{Q}}, because it is easy to see 𝑩=𝑸1𝑩𝑸{\bm{B}}={\bm{Q}}^{-1}{\bm{B}}{\bm{Q}}, and by properties of rotations/reflections we have 𝑼𝑸=𝑸𝑼1{\bm{U}}{\bm{Q}}={\bm{Q}}{\bm{U}}^{-1}, which implies

𝑸1𝑪𝑸\displaystyle{\bm{Q}}^{-1}{\bm{C}}{\bm{Q}} =13(𝑸1𝑩𝑸+𝑸1𝑼1𝑩𝑼𝑸+𝑸1𝑼2𝑩𝑼2𝑸)\displaystyle=\frac{1}{3}({\bm{Q}}^{-1}{\bm{B}}{\bm{Q}}+{\bm{Q}}^{-1}{\bm{U}}^{-1}{\bm{B}}{\bm{U}}{\bm{Q}}+{\bm{Q}}^{-1}{\bm{U}}^{-2}{\bm{B}}{\bm{U}}^{2}{\bm{Q}})
=13(𝑸1𝑩𝑸+𝑸1𝑼1𝑩𝑼𝑸+𝑸1𝑼2𝑩𝑼2𝑸)\displaystyle=\frac{1}{3}({\bm{Q}}^{-1}{\bm{B}}{\bm{Q}}+{\bm{Q}}^{-1}{\bm{U}}^{-1}{\bm{B}}{\bm{U}}{\bm{Q}}+{\bm{Q}}^{-1}{\bm{U}}^{-2}{\bm{B}}{\bm{U}}^{2}{\bm{Q}})
=13(𝑸1𝑩𝑸+(𝑼𝑸)1𝑩(𝑼𝑸)+(𝑼𝑸)1𝑼1𝑩𝑼(𝑼𝑸))\displaystyle=\frac{1}{3}({\bm{Q}}^{-1}{\bm{B}}{\bm{Q}}+({\bm{U}}{\bm{Q}})^{-1}{\bm{B}}({\bm{U}}{\bm{Q}})+({\bm{U}}{\bm{Q}})^{-1}{\bm{U}}^{-1}{\bm{B}}{\bm{U}}({\bm{U}}{\bm{Q}}))
=13(𝑸1𝑩𝑸+𝑼𝑸1𝑩𝑸𝑼1+𝑼𝑸1𝑼1𝑩𝑼𝑸𝑼1)\displaystyle=\frac{1}{3}({\bm{Q}}^{-1}{\bm{B}}{\bm{Q}}+{\bm{U}}{\bm{Q}}^{-1}{\bm{B}}{\bm{Q}}{\bm{U}}^{-1}+{\bm{U}}{\bm{Q}}^{-1}{\bm{U}}^{-1}{\bm{B}}{\bm{U}}{\bm{Q}}{\bm{U}}^{-1})
=13(𝑸1𝑩𝑸+𝑼𝑸1𝑩𝑸𝑼1+𝑼2𝑸1𝑩𝑸𝑼2)\displaystyle=\frac{1}{3}({\bm{Q}}^{-1}{\bm{B}}{\bm{Q}}+{\bm{U}}{\bm{Q}}^{-1}{\bm{B}}{\bm{Q}}{\bm{U}}^{-1}+{\bm{U}}^{2}{\bm{Q}}^{-1}{\bm{B}}{\bm{Q}}{\bm{U}}^{-2})
=13(𝑩+𝑼𝑩𝑼1+𝑼2𝑩𝑼2)\displaystyle=\frac{1}{3}({\bm{B}}+{\bm{U}}{\bm{B}}{\bm{U}}^{-1}+{\bm{U}}^{2}{\bm{B}}{\bm{U}}^{-2})
=13(𝑩+𝑼2𝑩𝑼2+𝑼1𝑩𝑼)\displaystyle=\frac{1}{3}({\bm{B}}+{\bm{U}}^{-2}{\bm{B}}{\bm{U}}^{2}+{\bm{U}}^{-1}{\bm{B}}{\bm{U}})
=𝑪,\displaystyle={\bm{C}},

which shows 𝑪{\bm{C}} commutes with 𝑸{\bm{Q}}. Now we show that any matrix 𝑪{\bm{C}} that commutes with both 𝑼{\bm{U}} and 𝑸{\bm{Q}} is a scaled identity matrix. Let 𝒙1{\bm{x}}_{1}^{\perp} denote a unit-vector perpendicular to 𝒙1{\bm{x}}_{1}, such that {𝒙1,𝒙1}\{{\bm{x}}_{1},{\bm{x}}_{1}^{\perp}\} form an orthonormal basis for 2\mathbb{R}^{2}. First, we know 𝑸{\bm{Q}} has eigenvectors 𝒙1{\bm{x}}_{1} and 𝒙1{\bm{x}}_{1}^{\perp} with eigenvalues 11 and 1-1, respectively. And so 𝑸𝑪𝒙1=𝑪𝑸𝒙1=𝑪𝒙1{\bm{Q}}{\bm{C}}{\bm{x}}_{1}={\bm{C}}{\bm{Q}}{\bm{x}}_{1}={\bm{C}}{\bm{x}}_{1}, while 𝑸𝑪𝒙1=𝑪𝑸𝒙1=𝑪𝒙1{\bm{Q}}{\bm{C}}{\bm{x}}_{1}^{\perp}={\bm{C}}{\bm{Q}}{\bm{x}}_{1}^{\perp}=-{\bm{C}}{\bm{x}}_{1}, which implies 𝑪𝒙1=a𝒙1{\bm{C}}{\bm{x}}_{1}=a{\bm{x}}_{1} and 𝑪𝒙1=b𝒙1{\bm{C}}{\bm{x}}_{1}^{\perp}=b{\bm{x}}_{1}^{\perp} for some a,ba,b\in\mathbb{R}, i.e., 𝒙1{\bm{x}}_{1} and 𝒙1{\bm{x}}_{1}^{\perp} are eigenvectors of 𝑪{\bm{C}} with eigenvalues aa and bb, respectively. Furthermore, we have 𝑪𝑼𝒙1=𝑼𝑪𝒙1=a𝑼𝒙1{\bm{C}}{\bm{U}}{\bm{x}}_{1}={\bm{U}}{\bm{C}}{\bm{x}}_{1}=a{\bm{U}}{\bm{x}}_{1} and 𝑪𝑼𝒙1=𝑼𝑪𝒙1=b𝑼𝒙1{\bm{C}}{\bm{U}}{\bm{x}}_{1}^{\perp}={\bm{U}}{\bm{C}}{\bm{x}}_{1}^{\perp}=b{\bm{U}}{\bm{x}}_{1}. This implies 𝑼𝒙1{\bm{U}}{\bm{x}}_{1} and 𝑼𝒙1{\bm{U}}{\bm{x}}_{1}^{\perp} are also eigenvectors of 𝑪{\bm{C}} with eigenvalues aa and bb, respectively. But since 𝒙1{\bm{x}}_{1} and 𝑼𝒙1{\bm{U}}{\bm{x}}_{1} are linearly independent, and likewise so are 𝒙1{\bm{x}}_{1}^{\perp} and 𝑼𝒙1{\bm{U}}{\bm{x}}_{1}^{\perp}, the only way this could be true is if a=ba=b, i.e., every vector in 2\mathbb{R}^{2} is an eigenvector of 𝑪{\bm{C}} for the same eigenvalue λ\lambda\in\mathbb{R}, which implies 𝑪=λ𝑰{\bm{C}}=\lambda{\bm{I}} for some λ\lambda\in\mathbb{R}.

Refer to caption
Refer to caption
Refer to caption
Figure 9: Illustration of ReLU boundaries (in red) of original interpolant 𝒇{\bm{f}} (left), the function 𝒈{\bm{g}} (middle) obtained by enforcing reflection symmetry, and the function 𝒉{\bm{h}} (right) obtained by enforcing both reflection and rotation symmetry.

Therefore, we have shown that every min-cost solution 𝒇{\bm{f}} maps to a min-cost solution 𝒉{\bm{h}} of the form

𝒉(𝒚)=𝒉1(𝒚)+𝒉2(𝒚)+𝒉3(𝒚)+λ𝑰{\bm{h}}({\bm{y}})={\bm{h}}_{1}({\bm{y}})+{\bm{h}}_{2}({\bm{y}})+{\bm{h}}_{3}({\bm{y}})+\lambda{\bm{I}}

for some λ\lambda\in\mathbb{R}, where each 𝒉n{\bm{h}}_{n} is a sum of ReLU units, all of which are active on BnB_{n} and all of which are inactive on BjB_{j}, jnj\neq n. In particular, 𝒉n(𝒚)=𝒙nλ𝒚{\bm{h}}_{n}({\bm{y}})={\bm{x}}_{n}-\lambda{\bm{y}} for all 𝒚Bn{\bm{y}}\in B_{n} and 𝒉n(𝒚)=𝟎{\bm{h}}_{n}({\bm{y}})=\bm{0} for all 𝒚Cn{\bm{y}}\in C_{n}, where CnC_{n} is the intersection of all half-planes containing the balls BjB_{j}, jnj\neq n.

This means we can write 𝒉(𝒚)=k=1K𝒂k[𝒘k𝒚+bk]++λ𝑰{\bm{h}}({\bm{y}})=\sum_{k=1}^{K}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]_{+}+\lambda{\bm{I}} and 𝒉n(𝒚)=k𝒜n𝒂k[𝒘k𝒚+bk]+{\bm{h}}_{n}({\bm{y}})=\sum_{k\in{\mathcal{A}}_{n}}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]_{+} for some index sets 𝒜1,𝒜2,𝒜3{\mathcal{A}}_{1},{\mathcal{A}}_{2},{\mathcal{A}}_{3} partitioning {1,,K}\{1,...,K\} so that R(𝒉)=k=1K𝒂k=n=13k𝒜n𝒂kR({\bm{h}})=\sum_{k=1}^{K}\|{\bm{a}}_{k}\|=\sum_{n=1}^{3}\sum_{k\in{\mathcal{A}}_{n}}\|{\bm{a}}_{k}\|. Let R0()R_{0}(\cdot) denote the representation cost of a function computed without an unregularized linear part, i.e., define R0R_{0} analogously to RR except where the model class 𝒉𝜽{\bm{h}}_{\bm{\theta}} in (6) is constrained to have 𝑽=𝟎{\bm{V}}=\bm{0}. Then, since the realizations of the 𝒉n{\bm{h}}_{n} functions considered above do not have a linear part, we see that k𝒜n𝒂kR0(𝒉n)\sum_{k\in{\mathcal{A}}_{n}}\|{\bm{a}}_{k}\|\geq R_{0}({\bm{h}}_{n}) and so R(𝒉)=n=13k𝒜n𝒂kR0(𝒉1)+R0(𝒉2)+R0(𝒉3)R({\bm{h}})=\sum_{n=1}^{3}\sum_{k\in{\mathcal{A}}_{n}}\|{\bm{a}}_{k}\|\geq R_{0}({\bm{h}}_{1})+R_{0}({\bm{h}}_{2})+R_{0}({\bm{h}}_{3}).

Now we show how lower bound R0(𝒉n)R_{0}({\bm{h}}_{n}) for all n=1,2,3n=1,2,3. Let 𝒙n{\bm{x}}_{n}^{\perp} denote a unit-vector perpendicular to 𝒙n{\bm{x}}_{n}, such that {𝒙n,𝒙n}\{{\bm{x}}_{n},{\bm{x}}_{n}^{\perp}\} form an orthonormal basis for 2\mathbb{R}^{2}. Consider the univariate functions hn(t):=𝒙n𝒉n(𝒙nt)h_{n}^{\parallel}(t):={\bm{x}}_{n}^{\top}{\bm{h}}_{n}({\bm{x}}_{n}t) and hn(t):=(𝒙n)𝒉n(𝒙nt+𝒙n)h_{n}^{\perp}(t):=({\bm{x}}_{n}^{\perp})^{\top}{\bm{h}}_{n}({\bm{x}}_{n}^{\perp}t+{\bm{x}}_{n}). Here hnh_{n}^{\parallel} is the projection of 𝒉{\bm{h}} onto the line spanned by 𝒙n{\bm{x}}_{n}, and hnh_{n}^{\perp} is a projection onto the line perpendicular to 𝒙n{\bm{x}}_{n} passing through the point 𝒙n{\bm{x}}_{n}. In particular, by the constraints on 𝒉n{\bm{h}}_{n}, we see that hnh_{n}^{\parallel} and hnh_{n}^{\perp} satisfy the constraints

hn(t)={0ift<1/2+ρ1λtift>1ρh_{n}^{\parallel}(t)=\begin{cases}0&\text{if}~{}t<-1/2+\rho\\ 1-\lambda t&\text{if}~{}t>1-\rho\end{cases} (87)

and hn(t)=λtif|t|ρh_{n}^{\perp}(t)=-\lambda t~{}\text{if}~{}|t|\leq\rho.

Claim 1: For all n=1,2,3n=1,2,3, R0(𝒉n)R0(hn)+R0(hn)R_{0}({\bm{h}}_{n})\geq R_{0}(h_{n}^{\parallel})+R_{0}(h_{n}^{\perp}). Proof: Let 𝒉n(𝒚)=k𝒂k[𝒘k𝒚+bk]++𝒄{\bm{h}}_{n}({\bm{y}})=\sum_{k}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]_{+}+{\bm{c}} be any realization of 𝒉n{\bm{h}}_{n}, whose representation cost is C=k12(𝒂k2+𝒘k2)C=\sum_{k}\frac{1}{2}\left(\|{\bm{a}}_{k}\|^{2}+\|{\bm{w}}_{k}\|^{2}\right). Then realizations of hnh_{n}^{\parallel} and hnh_{n}^{\perp} are given by

hn(t)\displaystyle h_{n}^{\parallel}(t) =𝒙n𝒉i(𝒙it)=k(𝒙n𝒂k)[(𝒙n𝒘k)t+bk]++𝒙n𝒄.\displaystyle={\bm{x}}_{n}^{\top}{\bm{h}}_{i}({\bm{x}}_{i}t)=\sum_{k}({\bm{x}}_{n}^{\top}{\bm{a}}_{k})[({\bm{x}}_{n}^{\top}{\bm{w}}_{k})t+b_{k}]_{+}+{\bm{x}}_{n}^{\top}{\bm{c}}. (88)
hn(t)\displaystyle h_{n}^{\perp}(t) =(𝒙n)𝒉n(𝒙nt+𝒙n)=k((𝒙n)𝒂k)[((𝒙n)𝒘k)t+bk+𝒘k𝒙n]++(𝒙n)𝒄,\displaystyle=({\bm{x}}_{n}^{\perp})^{\top}{\bm{h}}_{n}({\bm{x}}_{n}^{\perp}t+{\bm{x}}_{n})=\sum_{k}(({\bm{x}}_{n}^{\perp})^{\top}{\bm{a}}_{k})[(({\bm{x}}_{n}^{\perp})^{\top}{\bm{w}}_{k})t+b_{k}+{\bm{w}}_{k}^{\top}{\bm{x}}_{n}]_{+}+({\bm{x}}_{n}^{\perp})^{\top}{\bm{c}}, (89)

whose representation costs CC^{\parallel} and CC^{\perp}, respectively, are given by

C\displaystyle C^{\parallel} =k12((𝒙n𝒂k)2+(𝒙n𝒘k)2)\displaystyle=\sum_{k}\frac{1}{2}\left(({\bm{x}}_{n}^{\top}{\bm{a}}_{k})^{2}+({\bm{x}}_{n}^{\top}{\bm{w}}_{k})^{2}\right) (90)
C\displaystyle C^{\perp} =k12(((𝒙n)𝒂k)2+((𝒙n)𝒘k)2)\displaystyle=\sum_{k}\frac{1}{2}\left((({\bm{x}}_{n}^{\perp})^{\top}{\bm{a}}_{k})^{2}+(({\bm{x}}_{n}^{\perp})^{\top}{\bm{w}}_{k})^{2}\right) (91)

and by the Pythagorean Theorem we see that C=C+CC=C^{\parallel}+C^{\perp}. Therefore, CR0(hn)+R0(hn)C\geq R_{0}(h_{n}^{\parallel})+R_{0}(h_{n}^{\perp}) and finally minimizing over all realizations of 𝒉n{\bm{h}}_{n} gives the claim.

Claim 2: For all n=1,2,3n=1,2,3, R0(hn)|1λββα|+|1λαβα|R_{0}(h_{n}^{\parallel})\geq\left|\frac{1-\lambda\beta}{\beta-\alpha}\right|+\left|\frac{1-\lambda\alpha}{\beta-\alpha}\right|, where β=1ρ\beta=1-\rho and α=1/2+ρ\alpha=-1/2+\rho. Proof: By results in savarese2019infinite, R0(hn)R0(p)R_{0}(h_{n}^{\parallel})\geq R_{0}(p) where p(t)p(t) is the function satisfying the same constraints as hnh_{n}^{\parallel} given in (87) while linearly interpolating over the interval t[α,β]t\in[\alpha,\beta]. From the formula R0(p)=max{|p′′(t)|𝑑t,|p()+p()|}R_{0}(p)=\max\{\int|p^{\prime\prime}(t)|dt,|p^{\prime}(\infty)+p^{\prime}(-\infty)|\} as established in savarese2019infinite, we can show directly that R0(p)=|1λββα|+|1λαβα|R_{0}(p)=\left|\frac{1-\lambda\beta}{\beta-\alpha}\right|+\left|\frac{1-\lambda\alpha}{\beta-\alpha}\right|, which gives the claimed bound.

Claim 3: For all n=1,2,3n=1,2,3, R0(hn)|λ|R_{0}(h_{n}^{\perp})\geq|\lambda|. Proof: The function q(t)q(t) with minimal R0R_{0}-cost satisfying the constraint q(t)=λtq(t)=-\lambda t for |t|ρ|t|\leq\rho is a single ReLU unit plus a constant: q(t)=λ[t+ρ]++λρq(t)=-\lambda[t+\rho]_{+}+\lambda\rho, which has R0R_{0}-cost |λ||\lambda|.

Putting the above claims together, we see that

R0(𝒉n)\displaystyle R_{0}({\bm{h}}_{n}) |1λββα|+|1λαβα|+|λ|\displaystyle\geq\left|\frac{1-\lambda\beta}{\beta-\alpha}\right|+\left|\frac{1-\lambda\alpha}{\beta-\alpha}\right|+|\lambda| (92)
|2λ(β+α)βα|+|λ|\displaystyle\geq\left|\frac{2-\lambda(\beta+\alpha)}{\beta-\alpha}\right|+|\lambda| (93)
2|λ|(β+α)βα+|λ|\displaystyle\geq\frac{2-|\lambda|(\beta+\alpha)}{\beta-\alpha}+|\lambda| (94)
=2βα+2αβα|λ|\displaystyle=\frac{2}{\beta-\alpha}+\frac{-2\alpha}{\beta-\alpha}|\lambda| (95)
2βα\displaystyle\geq\frac{2}{\beta-\alpha} (96)

where in the last inequality we used the fact that 2αβα>0\frac{-2\alpha}{\beta-\alpha}>0 since α=1/2+ρ<0\alpha=-1/2+\rho<0 and βα=3/22ρ>0\beta-\alpha=3/2-2\rho>0.

Therefore, R(𝒇)=R(𝒉)R0(𝒉1)+R0(𝒉2)+R0(𝒉3)6βαR({\bm{f}})=R({\bm{h}})\geq R_{0}({\bm{h}}_{1})+R_{0}({\bm{h}}_{2})+R_{0}({\bm{h}}_{3})\geq\frac{6}{\beta-\alpha}. Also, the function 𝒇{\bm{f}}^{*} given by

𝒇=𝒇1+𝒇2+𝒇3{\bm{f}}^{*}={\bm{f}}_{1}^{*}+{\bm{f}}_{2}^{*}+{\bm{f}}_{3}^{*}

where

𝒇n(𝒚)=𝒙nβα([𝒙n𝒚α]+[𝒙n𝒚β]+){\bm{f}}_{n}^{*}({\bm{y}})=\frac{{\bm{x}}_{n}}{\beta-\alpha}([{\bm{x}}_{n}^{\top}{\bm{y}}-\alpha]_{+}-[{\bm{x}}_{n}^{\top}{\bm{y}}-\beta]_{+})

satisfies norm-ball interpolation constraints and R(𝒇)=6βαR({\bm{f}}^{*})=\frac{6}{\beta-\alpha}. Hence, 𝒇{\bm{f}}^{*} is a min-cost solution. ∎

Appendix D Additional simulations

We train a one-hidden-layer ReLU network without a skip connection using the setting we used in Figure 1. As can be seen from Figure 10 we get a similar result for the NN denoiser with and without the skip connection. Therefore in Appendix D.1 we use one-hidden-layer ReLU network without a skip connection.

Refer to caption
Figure 10: NN denoiser vs eMMSE denoiser. We trained a one-hidden-layer ReLU network on a denoising task. The clean dataset has four points equally spaced in the interval [5,5][-5,5], and the noisy samples are generated by adding zero-mean Gaussian noise with σ=1.5\sigma=1.5. We use λ=105\lambda=10^{-5} in both setting. The figure shows the denoiser output as a function of its input for: (1) NN denoiser trained online using (7) for 100K100K iterations, (2) NN denoiser trained offline using (8) with M=9000M=9000 and 20K20K epochs, and (3) the eMMSE denoiser (4).

D.1 MNIST

We use the MNIST dataset to verify various properties. First, the offline and online solutions achieve approximately the same test MSE when trained on a subset of the MNIST dataset (Figure 11). Second, to show that the fact that NN denoiser does not converge to the eMMSE denoiser is not due to approximation error (Figure 12). Lastly, to present the critical noise level in which representation cost minimizer f1Df^{*}_{\mathrm{1D}} has strictly lower MSE than the eMMSE, for all smaller noise levels (Figure 13).

Refer to caption
Figure 11: Online setting vs offline setting for MNIST denoiser. We train a one-hidden layer ReLU network on a subset of N=100N=100 MNIST images for 10K iterations. We use a Gaussian noise with zero mean and σ=0.1\sigma=0.1.
Refer to caption
Figure 12: Test loss vs. layer width for MNIST denoiser. We train a one hidden layer ReLU network on MNIST denoiser task using 7 for 93K93K iterations with a fixed learning rate. We use a Gaussian noise with zero mean and σ=0.1\sigma=0.1. The figure shows the test loss vs. layer width.
Refer to caption
Figure 13: MSE vs. Noise std. We train a one-hidden layer ReLU network on a subset of N=100N=100 MNIST images (the range of each pixel is [0,1][0,1]) for 10K iterations with fixed LR. We use a Gaussian noise with zero mean. The figure shows the MSE vs noise std (σ\sigma) for NN denoiser (orange line) and for eMMSE denoiser (blue line). Note that the eMMSE is dependent on σ\sigma (4). For low noise levels, the eMMSE output is one of the training set images. For moderate noise levels, the eMMSE output is a weighted sum of the training set images. For high noise levels, the eMMSE output is the mean of the training set images.

D.2 Three non-colinear training samples

We show in Figures 14 and 15 that for N=3N=3 training points from the MNIST dataset that forming a triangle in d=2d=2 dimensions the empirical minimizer obtained using noisy samples and weight decay regularization agrees well with the form of the exact representation cost minimizer predicted by Proposition 2 and Conjecture 4.

Refer to caption
Figure 14: Function space view of a denoiser, trained for 33 MNIST data points (Acute Angle). Here, we compare empirical results (left) with our theoretical results (right). We compare the function-space view for inputs from the data plane, with respect to the model output in each of the data directions. For the empirical results, we choose 33 random MNIST data points under the same label, and under the condition that they form an acute triangle (6464^{\circ}, for this figure). We trained a single-layer FC ReLU network with linear residual connection for 1M epochs, with weight decay of 1E81E-8 (as described in our model), and ADAM optimizer with learning rate 1E51E-5.
Refer to caption
Figure 15: Function space view of a denoiser, trained for 3 MNIST data points (Obtuse Angle). Here, we compare empirical results (left) with our theoretical results (right). We compare the function-space view for inputs from the data plane, with respect to the model output in each of the data directions. For the empirical results, we added a single data point to the two previously chosen for Figure 14 under the condition that the three points form an obtuse triangle (9595^{\circ}, for this figure). We trained a single-layer FC as described in Figure 14. As predicted, the function we have converged to for data forming an obtuse triangle is noticeably different form the function we converged to when the data was forming an acute triangle.

D.3 Empirical validation of the subspace assumption

We validated that the following image datasets are (approximately) low rank:

  • CIFAR10

  • CINIC10

  • Tiny ImageNet (a lower resolution version of ImageNet, enabling us to use SVD)

  • BSD (a denoising benchmark composed from 128X1600128X1600 patches of size 40X4040X40 cropped from 400400 images [zhang2017beyond])

As can be seen from Table 1 all the datasets that we used are (approximately) low rank.

Table 1: We applied a Singular Value Decomposition (SVD) for each of the above datasets, and calculated the relative number of Singular Values (SV) needed to achieve a given percentile of the energy (for the average vector).
Dataset 95% 99% 99.9%
CIFAR10 0.8% 7.5% 30%
CINIC10 1% 23% 41%
Tiny ImageNet 1.6% 20% 36%
BSD 0.1% 1.6% 4.5%