How do Minimum-Norm Shallow Denoisers Look in Function Space?

Chen Zeno
Electrical and Computer Engineering
Technion &Greg Ongie
Department Mathematical and Statistical Sciences
Marquette University &Yaniv Blumenfeld, Nir Weinberger, Daniel Soudry
Electrical and Computer Engineering
Technion &{chenzeno,yanivbl}@campus.technion.ac.il, [email protected]
[email protected], [email protected]

Abstract

Neural network (NN) denoisers are an essential building block in many common tasks, ranging from image reconstruction to image generation. However, the success of these models is not well understood from a theoretical perspective. In this paper, we aim to characterize the functions realized by shallow ReLU NN denoisers — in the common theoretical setting of interpolation (i.e., zero training loss) with a minimal representation cost (i.e., minimal $\ell^{2}$ norm weights). First, for univariate data, we derive a closed form for the NN denoiser function, find it is contractive toward the clean data points, and prove it generalizes better than the empirical MMSE estimator at a low noise level. Next, for multivariate data, we find the NN denoiser functions in a closed form under various geometric assumptions on the training data: data contained in a low-dimensional subspace, data contained in a union of one-sided rays, or several types of simplexes. These functions decompose into a sum of simple rank-one piecewise linear interpolations aligned with edges and/or faces connecting training samples. We empirically verify this alignment phenomenon on synthetic data and real images.

1 Introduction

The ability to reconstruct an image from a noisy observation has been studied extensively in the last decades, as it is useful for many practical applications (e.g., Hasinoff et al. (2010)). In recent years, Neural Network (NN) denoisers commonly replace classical expert-based approaches as they achieve substantially better results than the classical approaches (e.g., Zhang et al. (2017)). Beyond this natural usage, NN denoisers also serve as essential building blocks in a variety of common computer vision tasks, such as solving inverse problems (Zhang et al., 2021) and image generation (Song and Ermon, 2019; Ho et al., 2020). To better understand the role of NN denoisers in such complex applications, we first wish to theoretically understand the type of solutions they converge to.

In practice, when training denoisers, we sample multiple noisy samples for each clean image and minimize the Mean Squared Error (MSE) loss for recovering the clean image. Since we sample numerous noisy samples per clean sample, the number of training samples is typically larger than the number of parameters in the network. Interestingly, even in such an under-parameterized regime, the loss has multiple global minima corresponding to distinct denoiser functions which achieve zero loss on the observed data. To characterize these functions, we study, similarly to previous works (Savarese et al., 2019; Ongie et al., 2020), the shallow NN solutions that interpolate the training data with minimal representation cost, i.e., where the $\ell^{2}$ -norm of the weights (without biases and skip connections) is as small as possible. This is because we converge to such min-cost solutions when we minimize the loss with a vanishingly small $\ell^{2}$ regularization on these weights.

We first examine the univariate input case: building on existing results (Hanin, 2021), we characterize the min-cost interpolating solution and its generalization to unseen data. Next, we aim to extend this analysis to the multivariate case. However, this is challenging, since, to the best of our knowledge, there are no results that explicitly characterize these min-cost solutions for general multivariate shallow NNs — except in two basic cases. In the first case, the input data is co-linear (Ergen and Pilanci, 2021). In the second case, the input samples are identical to their target outputs, so the trivial min-cost solution is the identity function. The NN denoisers’ training regime is ‘near’ the second case: there, the input samples are noisy versions of the clean target outputs. Interestingly, we find that this regime leads to non-trivial min-cost solutions far from identity — even with an infinitesimally small input noise. We analytically investigate these solutions here.

Our Contributions.

We study the NN solutions in the setting of interpolation of noisy samples with min-cost, in a practically relevant “low noise regime” where the noisy samples are well clustered. In the univariate case,

•

We find a closed-form solution for the minimum representation cost NN denoiser. Then, we prove this solution generalizes better than the empirical minimum MSE (eMMSE) denoiser.
•

We prove this min-cost NN solution is contractive toward the clean data points, that is, applying the denoiser necessarily reduces the distance of a noisy sample to one of the clean samples.

In the multivariate case,

•

We derive a closed-form solution for the min-cost NN denoiser in multivariate case under various assumptions on the geometric configuration of the clean training samples. To the best of our knowledge, this is the first set of results to explicitly characterize a min-cost interpolating NN in a non-basic multivariate setting.
•

We illustrate a general alignment phenomenon of min-cost NN denoisers in the multivariate setting: the optimal NN denoiser decomposes into a sum of simple rank-one piecewise linear interpolations aligned with edges and/or faces connecting clean training samples.

2 Preliminaries and problem setting

The denoising problem.

Let $\bm{y}\in\mathbb{R}^{d}$ be a noisy observation of $\bm{x}\in\mathbb{R}^{d}$ , such that $\bm{y}=\bm{x}+\bm{\epsilon}$ where $\bm{x}$ and $\bm{\epsilon}$ are statistically independent, and $\mathbb{E}[\bm{\epsilon}]=\bm{0}$ . Commonly, this noise is Gaussian with covariance matrix $\sigma^{2}\mathrm{\bm{I}}$ . The ultimate goal of a denoiser $\hat{\bm{x}}\left(\bm{y}\right)$ is to minimize the MSE loss over the joint probability distribution of the data and the noisy observation (“population distribution”), i.e., to minimize

\displaystyle\mathcal{L}\left(\hat{\bm{x}}\right)=\mathbb{E}_{\bm{x},\bm{y}}\left\|\hat{\bm{x}}\left(\bm{y}\right)-\bm{x}\right\|^{2}\,.

(1)

The well-known optimal solution for (1) is the minimum mean square error (MMSE) denoiser, i.e.,

\displaystyle\hat{\bm{x}}^{*}(\bm{y})=\mathbb{E}_{\bm{x}|\bm{y}}[\bm{x}\mid\bm{y}]\in{\arg\min}_{\hat{\bm{x}}\left(\bm{y}\right)}\mathbb{E}_{\bm{x},\bm{y}}\left\|\bm{x}-\hat{\bm{x}}\left(\bm{y}\right)\right\|^{2}\,.

(2)

Since we do not have access to the distribution of the data, and hence not to the posterior distribution, we rely on a finite amount of clean data $\{\bm{x}_{n}\}_{n=1}^{N}$ in order to learn a good approximation for the MMSE estimator. One approach is to assume an empirical data distribution and derive the optimal solution of (1), i.e., the empirical minimum mean square error (eMMSE) denoiser,

\displaystyle\hat{\bm{x}}^{\mathrm{eMMSE}}\left(\bm{y}\right)\in\underset{\hat{\bm{x}}\left(\bm{y}\right)}{\arg\min}\,\frac{1}{N}\sum_{n=1}^{N}\mathbb{E}_{{\bm{y}|\bm{x}}_{n}}\left\|\hat{\bm{x}}\left(\bm{y}\right)-\bm{x}_{n}\right\|^{2}\,.

(3)

If the noise is Gaussian with a covariance of $\sigma^{2}\mathrm{\bm{I}}$ , an explicit solution to the eMMSE is given by

\displaystyle\hat{\bm{x}}^{\mathrm{eMMSE}}\left(\bm{y}\right)=\frac{\sum_{n=1}^{N}\bm{x}_{n}\exp\left(-\frac{\left\|\bm{y}-\bm{x}_{n}\right\|^{2}}{2\sigma^{2}}\right)}{\sum_{n=1}^{N}\exp\left(-\frac{\left\|\bm{y}-\bm{x}_{n}\right\|^{2}}{2\sigma^{2}}\right)}\,.

(4)

An alternative approach to computing the eMMSE directly is to draw $M$ noisy samples for each clean data point, as ${\bm{y}}_{n,m}=\bm{x}_{n}+\bm{\epsilon}_{n,m}$ , where $\bm{\epsilon}_{n,m}\sim\mathcal{N}\left(\bm{0},\sigma^{2}\bm{I}\right)$ are independent and identically distributed, and to minimize the following loss function

\displaystyle\mathcal{L}_{\mathrm{offline},M}\left(\hat{\bm{x}}\right)=\frac{1}{MN}\sum_{m=1}^{M}\sum_{n=1}^{N}\left\|\hat{\bm{x}}\left(\bm{y}_{n,m}\right)-\bm{x}_{n}\right\|^{2}\,.

(5)

Denoiser model and algorithms.

In practice, we approximate the optimal denoiser $\hat{\bm{x}}\left(\bm{y}\right)$ using a parametric model $\bm{h}_{\bm{\theta}}\left(\bm{y}\right)$ , typically a NN. We focus on a shallow ReLU network model with a skip connection of the form

\bm{h}_{\theta}(\bm{y})=\sum_{k=1}^{K}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}\bm{y}+b_{k}]_{+}+{\bm{V}}\bm{y}+{\bm{c}}

(6)

where $\theta=((\theta_{k})_{k=1}^{K};{\bm{c}},{\bm{V}})$ with $\theta_{k}=(b_{k},{\bm{a}}_{k},{\bm{w}}_{k})\in\mathbb{R}\times\mathbb{R}^{d}\times\mathbb{R}^{d}$ and ${\bm{c}}\in\mathbb{R}^{d},{\bm{V}}\in\mathbb{R}^{d\times d}$ . We train the model on a finite set of clean data points $\left\{\bm{x}_{n}\right\}_{n=1}^{N}$ . The common practical training method is based on an online approach. First, we sample a random batch (with replacement) from the data $\mathcal{B}\subseteq\{\bm{x}_{n}\}_{n=1}^{N}$ . Then, for each clean data point $\bm{x}_{n}\in\mathcal{B}$ , we draw a noisy sample $\bm{y}_{n}=\bm{x}_{n}+\bm{\epsilon}_{n}$ , where $\bm{\epsilon}_{n}\sim\mathcal{N}\left(\bm{0},\sigma^{2}\mathrm{\bm{I}}\right)$ are independent of the clean data points and other noise samples. At each iteration $t$ out of $T$ iterations, we update the model parameters according to a stochastic gradient descent rule, with a vanishingly small regularization term $\lambda C\left(\bm{\theta}\right)$ , that is,

\displaystyle\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\bm{\theta}_{t}}\frac{1}{|\mathcal{B}|}\sum_{n\in\mathcal{B}}\left\|\bm{h}_{\bm{\theta}_{t}}\left(\bm{y}_{n}\right)-\bm{x}_{n}\right\|^{2}-\eta\lambda\nabla_{\bm{\theta}_{t}}C\left(\bm{\theta}_{t}\right)\,.

(7)

Another training method (Chen et al., 2014) is based on an offline approach. We sample $M$ noisy sample for each clean data point and minimize (5) plus a regularization term

\displaystyle\mathcal{L}_{\mathrm{offline},M}\left(\bm{\theta}\right)=\frac{1}{MN}\sum_{m=1}^{M}\sum_{n=1}^{N}\left\|\bm{h}_{\theta}\left(\bm{y}_{n,m}\right)-\bm{x}_{n}\right\|^{2}+\lambda C\left(\bm{\theta}\right)\,.

(8)

Similarly to previous works (Savarese et al., 2019; Ongie et al., 2020), we assume an $\ell^{2}$ penalty on the weights, but not on the biases and skip connections, i.e.,

\displaystyle C(\theta)=\frac{1}{2}\sum_{k=1}^{K}\left(\|{\bm{a}}_{k}\|^{2}+\|{\bm{w}}_{k}\|^{2}\right)\,.

(9)

Low noise regime.

In this paper, we study the solution of the NN denoiser when the clusters of noisy samples around each clean point are well-separated, a setting which we refer to as the “low noise regime”. This is a rather relevant regime since denoisers are practically used when the noise level is mild. Indeed, common image-denoising benchmarks test on low (but not negligible) noise levels. For instance, in the commonly used denoising benchmark BSD68 (Roth and Black, 2009), the noise level $\sigma=0.1$ is in the low noise regime.¹¹1The minimum distance between two images in BSD68 is about $97$ while the image resolution is ${d=481\!\times\!321}$ . Also, the norm of the noise concentrates around the value of ${\sqrt{d}\sigma\!\approx\!\sqrt{481\cdot 321}\cdot 0.1\!\approx\!40\!<\!97}$ . Therefore, the clusters of noisy samples around each clean point are generally well-separated. Moreover, this setting is important, for example, in diffusion-based image generation, since at the end of the reverse denoising process, new images are sampled by denoising smaller and smaller noise levels.²²2Interestingly, it was suggested that the “useful” part of the diffusion dynamics happens only below some critical noise level (Raya and Ambrogioni, 2023).

3 Basic properties of neural network denoisers

Offline v.s. online NN solutions.

Refer to caption — Figure 1: NN denoiser vs eMMSE denoiser. We trained a one-hidden-layer ReLU network with a skip connection on a denoising task. The clean dataset has four points equally spaced in the interval $[-5,5]$ , and the noisy samples are generated by adding zero-mean Gaussian noise with $\sigma=1.5$ . We use $\lambda=10^{-5}$ in both setting. The Figure shows the denoiser output as a function of its input for: (1) NN denoiser trained online using (7) for $100K$ iterations, (2) NN denoiser trained offline using (8) with $M=9000$ and $20K$ epochs, and (3) the eMMSE denoiser (4).

NN denoisers are traditionally trained in an online fashion (7), using a finite amount of $T$ iterations. Consequently, only a finite number of noisy samples are used for each clean data point. We empirically observe that the solutions in the offline and online settings are similar. Specifically, in the univariate case, we show in Figure 1 that denoisers based on offline and online loss functions converge to indistinguishable solutions. For the multivariate case, we observe (Figure 11 in Appendix D) that the offline and online solutions achieve approximately the same test MSE when trained on a subset of the MNIST dataset. The comparison is made using the same number of iterations for both training methods, while using much less noisy samples in the offline setting. Evidently, this lower number of samples does not significantly affect the generalization error. Hence, in the rest of the paper, we focus on offline training (i.e., minimizing the offline loss $\mathcal{L}_{\mathrm{offline},M}$ ), as it defines an explicit loss function with solutions that can be theoretically analyzed, as in (Savarese et al., 2019; Ongie et al., 2020).

The empirical MMSE denoiser.

The law of large numbers implies that the denoiser minimizing the offline loss $\mathcal{L}_{\mathrm{offline},M}$ approaches the eMMSE estimator in the limit of infinitely many noisy samples,

\displaystyle\hat{\bm{x}}^{\mathrm{eMMSE}}\left(\bm{y}\right)

\displaystyle\in{\arg\min}_{\hat{\bm{x}}}\lim_{M\to\infty}\mathcal{L}_{\mathrm{offline},M}\left(\hat{\bm{x}}\right)\,.

(10)

Therefore, it may seem that for a reasonable number of noise samples $M$ , a large enough model, and small enough regularization, the denoiser we get by minimizing the offline loss (6) will also be similar to the eMMSE estimator. However, Figure 1 shows that the eMMSE solution and the NN solutions (both online and offline) are quite different. The eMMSE denoiser has a much sharper transition and maps almost all inputs to a value of a clean data point. This is because in the case of low noise the eMMSE denoiser (4) approximates the one nearest-neighbor ( $1$ -NN) classifier, i.e.,

\displaystyle\lim_{\sigma\rightarrow 0^{+}}\hat{\bm{x}}^{\mathrm{eMMSE}}\left(\bm{y}\right)=\underset{\bm{x}\in\left\{\bm{x}_{i}\right\}_{i=1}^{N}}{\arg\min}\left\|\bm{y}-\bm{x}\right\|\,.

(11)

In contrast, the NN denoiser maps each noisy sample to its corresponding clean sample only in a limited “noise ball” around the clean point, and interpolates near-linearly between the “noise balls”. Hence, we may expect that the smoother NN denoiser typically generalizes better than the eMMSE denoiser. We prove that this is indeed true for the univariate case in Section 4.

Why the NN denoiser does not converge to the eMMSE denoiser? Note that the limit in (10) is not practically relevant for the low-level noise regime, since we need an exponentially large $M$ in order to converge in this limit. For example, in the case of univariate Gaussian noise, we have that $P\left(|\epsilon|>t\right)\leq 2\exp({-\frac{t^{2}}{2\sigma^{2}}}),\,\forall t>0$ . Therefore, during training, we effectively observe only noisy samples that are in a bounded interval of size $2\sigma\sqrt{\log M}$ around each clean sample (see Figure 1). In other words, in the low-noise regime and for non-exponential $M$ , there is no way to distinguish if the noise is sampled from some distribution with limited support of from a Gaussian distribution. The denoiser minimizing the loss with respect to a bounded-support distribution can be radically different from the eMMSE denoiser in the regions outside the “noise balls” surrounding the clean samples, where the denoiser function is not constrained by the loss. This leads to a large difference between the NN denoiser and the MMSE estimator.

Alternatively, one may suggest that the NN denoiser does not converge to the eMMSE denoiser due to an approximation error (i.e., the shallow NN’s model capacity is too small to approximate the MMSE denoiser). Nevertheless, we provide empirical evidence indicating it is not the case. Specifically, recall that in the low noise regime, the eMMSE denoiser tends to the nearest-neighbor classifier, and such a solution does not generalize well to test data. Thus, if the NN denoiser would have converged to the eMMSE solution, then its test error would have increased with the network size, in contrast to what we observe in Figure 12 (Appendix D).

Therefore, in order to approximate the eMMSE with a NN, it seems we must have an exponentially large $M$ . Alternatively, we may converge to the eMMSE if we use a loss function marginalized over the Gaussian noise. This idea was previously suggested by Chen et al. (2014), with the goal of effectively increasing the number of noisy samples and thus improving the training performance of denoising autoencoders. Therein, this improvement was obtained by approximating the marginalized loss function by a Taylor series expansion. However, for shallow denoisers, we may actually obtain an explicit expression for this marginalized loss, without any approximation. Specifically, if we assume for simplicity, that the network does not have a linear unit ( ${\bm{V}}=\bm{0}$ ) and its bias terms are zero ( ${\bm{c}}=\bm{0},b_{k}=0$ ), then the marginalized loss for Gaussian noise, derived in Appendix A, is given by

	$\displaystyle\mathcal{L}\left(\bm{\theta},\sigma\right)$	$\displaystyle=\mathbb{E}_{\bm{x},\bm{y}}\left\\|\bm{h}_{\theta}\left(\bm{y}\right)-\bm{x}\right\\|^{2}=\mathbb{E}_{\bm{x},\bm{y}}\left\\|\sum_{k=1}^{K}\bm{a}_{k}[\bm{w}_{k}^{\top}\bm{y}]_{+}-\bm{x}\right\\|^{2}=\mathbb{E}_{\bm{x}}\mathbb{E}_{{\bm{y}\|\bm{x}}}\left\\|\sum_{k=1}^{K}\bm{a}_{i}[\bm{w}_{k}^{\top}\bm{y}]_{+}-\bm{x}\right\\|^{2}$
		$\displaystyle=\mathbb{E}_{\bm{x}}\left\\|\sum_{k=1}^{K}\bm{a}_{k}\tilde{\phi}\left(\bm{\hat{w}}_{k}^{\top}\bm{x},\left\\|\bm{w}_{k}\right\\|,\sigma\right)-\bm{x}\right\\|^{2}+\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}\mathbf{H}_{ij}\left(\bm{w}_{i},\bm{w}_{j},\sigma^{2}\right)\,,$		(12)

where $\bm{w}_{k}=\bm{\hat{w}}_{k}\left\|\bm{w}_{k}\right\|$ and $\mathbf{H}\succeq 0$ , $\tilde{\phi}\left(\cdot\right)$ are defined in Appendix A. NN denoisers trained over this loss function will thus tend to the eMMSE solution as the network size is increased. However, as we explained above, this is not necessarily desirable, so we only mention (12) to show exact marginalization is feasible.

Regularization biases toward specific neural network denoisers.

To further explore the converged solution for offline training, we note that the offline loss function $\mathcal{L}_{\mathrm{offline},M}\left(\mathbf{\bm{\theta}}\right)$ allows the network to converge to a zero-loss solution. This is in contrast to online training for which each batch leads to new realizations of noisy samples, and thus the training error is never exactly zero. Specifically, consider the low noise regime (well-separated noisy clusters). Then, the network can perfectly fit all the noisy samples using a finite number of neurons (see Section 4 for a more accurate description in the univariate case). Importantly, there are multiple ways to cluster the noisy data points with such neurons, and so there are multiple global training loss minima that the network can achieve with zero loss, each with a different generalization capability. In contrast to the standard case considered in the literature, this holds even in the under-parameterized case (where $NM$ , the total number of noisy samples, is larger than the number of parameters).

Since there are many minima that perfectly fit the training data, we converge to specific minima which also minimize the $\ell^{2}$ regularization we use (even though we assumed it is vanishing). Specifically, in the limit of vanishing regularization $C\left(\theta\right)$ , the minimizers of $\mathcal{L}_{\mathrm{offline},M}\left(\mathbf{\bm{\theta}}\right)$ also minimize the representation cost.

Definition 1.

Let $\bm{h}_{\theta}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ denote a shallow ReLU network of the form (6). For any function $\bm{f}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ realizable as a shallow ReLU network, we define its representation cost as

\displaystyle R(\bm{f})=\inf_{\theta:\,\bm{f}=\bm{h}_{\theta}}C\left(\theta\right)=\inf_{\theta}\sum_{k=1}^{K}\|{\bm{a}}_{k}\|~{}~{}\mathrm{s.t.}~{}~{}\|\bm{w}_{k}\|=1~{}\forall k,\bm{f}=\bm{h}_{\theta}\,,

(13)

and a minimizer of this cost, i.e. the ‘min-cost’ solution, as

\displaystyle\bm{f}^{*}\in\mathrm{arg}\!\min_{\bm{f}}R(\bm{f})~{}~{}s.t.~{}{\bm{f}}(\bm{y}_{n,m})={\bm{x}}_{n}~{}~{}\forall n,m\,,

(14)

where the second equality in (13) holds due to the $1$ -homogeneity of the ReLU activation function, and since the bias terms are not regularized (see (savarese2019infinite, Appendix A)). In the next sections, we examine which function we obtain by minimizing the representation cost $R(\bm{f})$ in various settings.

4 Closed form solution for the NN denoiser function — univariate data

In this section, we prove that NN denoisers that minimize $R(f)$ for univariate data have the specific piecewise linear form observed in Figure 1, and we discuss the properties of this form. We observe $N$ clean univariate data points $\left\{x_{n}\right\}_{n=1}^{N}$ , s.t. $-\infty<x_{1}<x_{2}<\cdots<x_{N}<\infty$ , and $M$ noisy samples (drawn from some known distribution) for each clean data point, such that $y_{n,m}=x_{n}+\epsilon_{n,m}$ . We denote by $\epsilon^{\mathrm{max}}_{n}$ the maximal noise seen for data point $x_{n}$ , and by $\epsilon^{\mathrm{min}}_{n}$ the minimal noise seen for data point $x_{n}$ , i.e.,

\displaystyle\epsilon^{\mathrm{max}}_{n}\equiv\max_{m}\epsilon_{n,m},\quad\epsilon^{\mathrm{min}}_{n}\equiv\min_{m}\epsilon_{n,m}\,,

(15)

and assume the following,

Assumption 1.

Assume the data $\left\{x_{n}\right\}_{n=1}^{N}$ is well-separated after the addition of noise, i.e.,

\displaystyle\forall n\in[N-1]:\,x_{n}+\epsilon^{\mathrm{max}}_{n}<x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}\,,

(16)

and $\epsilon^{\mathrm{max}}_{n}>0\,,\epsilon^{\mathrm{min}}_{n}<0$ .

So we can state the following,

Proposition 1.

For all datasets such that Assumption 1 holds, the unique minimizer of $R(f)$ is

\displaystyle f^{*}_{1D}(y)=\begin{cases}x_{1},&y<x_{1}+\epsilon^{\mathrm{min}}_{1}\\ x_{n},&x_{n}+\epsilon^{\mathrm{min}}_{n}\leq y\leq x_{n}+\epsilon^{\mathrm{max}}_{n}\\ \frac{x_{n+1}-x_{n}}{x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)}\left(y-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\right)+x_{n},&x_{n}+\epsilon^{\mathrm{max}}_{n}<y<x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}\\ x_{N},&y>x_{N}+\epsilon^{\mathrm{max}}_{N}\end{cases}\,.

(17)

The proof (which is based on Theorem 1.2. in (hanin2021ridgeless)) can be found in Appendix B.1. As can be seen from Figure 1, the empirical simulation matches Proposition 1. ³³3Notice that the training points in Figure 1 are used in the online setting (7) and in the offline setting (8) we observe less noisy samples. Proposition 1 states that (17) is a closed-form solution for (8) with minimal representation cost. Notice that the minimal number of neurons needed to represent $f^{*}_{1D}$ using $h_{\theta}\left(y\right)$ is $2N-2$ , which is less than the number of the total training samples $NM$ for $M\geq 2$ .

In the case of univariate data, we can prove that the representation cost minimizer $f^{*}_{1D}$ (linear interpolation) generalizes better than the optimal estimator over the empirical distribution (eMMSE) for low noise levels.

Theorem 1.

Let $y=x+\epsilon$ where $x\sim p_{x}\left(x\right)$ and $\epsilon\sim\mathcal{N}\left(0,\sigma^{2}\right)$ where $x$ and $\epsilon$ are statistically independent. Then for all datasets such that Assumption 1 holds, and for all density probability distributions $p_{x}\left(x\right)$ with bounded second moment such that $p_{x}\left(x\right)>0$ for all $x\in\left[\min_{n}{x_{n}},\max_{n}{x_{n}}\right]$ , the following holds,

\displaystyle\lim_{\sigma\to 0^{+}}\mathrm{MSE}\left(\hat{x}^{\mathrm{eMMSE}}\left(y\right)\right)>\lim_{\sigma\to 0^{+}}\mathrm{MSE}\left(f^{*}_{1D}\left(y\right)\right)\,.

See Appendix B.2 for the proof. We may deduce from Theorem 1 that for each density probability distribution $p\left(x\right)$ there exists a critical noise level for which the the representation cost minimizer $f^{*}_{1D}$ has strictly lower MSE than the eMMSE for all smaller noise levels (this is because the MSE is a continuous function of $\sigma$ ). The critical noise level can change significantly depending on $p\left(x\right)$ . For example, if $p\left(x\right)$ has a high “mass” in between the training points then the critical noise level is large. However, if the density function has a low “mass” between the training points then the critical noise level is small. In Appendix D we show the MSE vs. the noise level on MNIST denoiser for NN denoiser and eMMSE denoiser (Figure 13). As can be seen there, the critical noise level in this case is not small ( $\sim 5$ ).

Intuitively, the difference between the NN denoiser and the eMMSE denoiser is how they operate on inputs that are not close to any of the clean samples (compared to the noise standard deviation). For such a point, the eMMSE denoiser does not take into account that the empirical distribution of the clean samples does not approximate well their true distribution. Thus, for small noise, it insists on “assigning” it to the closest clean sample point. By contrast, the NN denoiser generalizes better since it takes into account that, far from the clean samples, the data distribution is not well approximated by the empirical sample distribution. Thus, its operation there is near the identity function, with a small contraction toward the clean points, as we discuss next.

Minimal norm leads to contractive solutions on univariate data.

radhakrishnan2018memorization have empirically shown that Auto-Encoders (AE, i.e. NN denoisers without input noise), are locally contractive toward the training samples. Specifically, they showed that the clean dataset can be recovered when iterating the AE output multiple times until convergence. Additionally, they showed that, as we increase the width or the depth of the NN, the network becomes more contractive toward the training examples. In addition, radhakrishnan2018memorization proved that $2$ -layer AE models are locally contractive under strong assumptions (the weights of the input layer are fixed and the number of neurons goes to infinity). Next, we prove that a univariate shallow NN denoiser is globally contractive toward the clean data points without using the assumptions used by radhakrishnan2018memorization (i.e., the minimizer optimizes over both layers and has a finite number of neurons).

Definition 2.

We say that ${\bm{f}}:\mathbb{R}^{d}\to\mathbb{R}^{d}$ is contractive toward a set of points $\{\bm{x}_{n}\}_{n=1}^{N}$ on $\mathcal{Y}\subseteq\mathbb{R}^{d}$ if there exists a real number $0\leq\alpha<1$ such that for any $\bm{y}\in\mathcal{Y}$ there exists $i\in[N]$ so that

\displaystyle\left\|\bm{f}\left(\bm{y}\right)-\bm{f}\left(\bm{x}_{i}\right)\right\|\leq\alpha\left\|\bm{y}-\bm{x}_{i}\right\|\,.

(18)

Lemma 1.

$f^{*}_{1D}\left(y\right)$ is contractive toward the clean training points $\{\bm{x}_{n}\}_{n=1}^{N}$ on $\mathcal{Y}={\mathbb{R}\setminus\cup_{n\in[N-1]}\Bigl{\{}\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}}\Bigr{\}}}$ .

The proof can be found in Appendix B.3.

5 Minimal norm leads to alignment phenomenon on multivariate data

In the multivariate case, min-cost solutions of (14) are difficult to explicitly characterize. Even in the setting of fitting scalar-valued shallow ReLU networks, explicitly characterizing min-cost solutions under interpolation constraints remains an open problem, except in some basic cases (e.g., co-linear data (ergen2021convex)).

As an approximation to (14) that is more mathematically tractable, we assume the functions being fit are constant and equal to ${\bm{x}}_{n}$ on a closed ball of radius $\rho$ centered at each ${\bm{x}}_{n}$ , i.e., $\bm{f}({\bm{y}})={\bm{x}}_{n}$ for all $\|{\bm{y}}-{\bm{x}}_{n}\|\leq\rho$ , such that the balls do not overlap. Letting $B({\bm{x}}_{n},\rho)$ denote the ball of radius $\rho$ centered at ${\bm{x}}_{n}$ , we can write this constraint more compactly as $\bm{f}(B({\bm{x}}_{n},\rho))=\{{\bm{x}}_{n}\}$ . Consider minimizing the representation cost under this constraint:

\min_{\bm{f}}R(\bm{f})~{}~{}s.t.~{}~{}{\bm{f}}(B({\bm{x}}_{n},\rho))=\{{\bm{x}}_{n}\}~{}~{}\forall n\in[N]\,.

(19)

However, even with this approximation, explicitly describing minimizers of (19) for an arbitrary collection of training samples remains challenging. Instead, to gain intuition, we describe minimizers of (19) assuming the training samples belong to simple geometric structures that yield explicit solutions. Our results reveal a general alignment phenomenon, such that the weights of the representation cost minimizer align themselves with edges and/or faces connecting data points. We also show that approximate solutions of (14) obtained numerically by training a NN denoiser with weight decay match well with the solutions of (19) having exact closed-form expressions.

5.1 Training data on a subspace

In the event that the clean training samples belong to a subspace, we show the representation cost minimizer depends only on the projection of the inputs onto the subspace containing the training data, and its output is also constrained to this subspace.

Theorem 2.

Assume the training samples $\{{\bm{x}}_{n}\}_{n=1}^{N}$ belong to a linear subspace ${\mathcal{S}}\subset\mathbb{R}^{d}$ , and let ${\bm{P}}_{\mathcal{S}}\in\mathbb{R}^{d\times d}$ denote the orthogonal projector onto ${\mathcal{S}}$ . Then any minimizer $\bm{f}^{*}$ of (19) satisfies $\bm{f}^{*}({\bm{y}})={\bm{P}}_{\mathcal{S}}\bm{f}^{*}({\bm{P}}_{\mathcal{S}}{\bm{y}})$ for all ${\bm{y}}\in\mathbb{R}^{d}$ .

The proof of this result and all others in this section is given in Appendix C.

Note the assumption that the dataset lies on a subpaces is practically relevant, since, in general, large datasets are (approximately) low rank, i.e., lie on a linear subspace (udell2019big). In Appendix D we also validated that common image datasets are (approximately) low rank (Table 1).

Specializing to the case co-linear training data (i.e., training samples belonging to a one-dimensional subspace) the min-cost solution is unique and is described by the following corollary:

Corollary 1.

Assume the training samples $\{{\bm{x}}_{n}\}_{n=1}^{N}$ are co-linear, i.e., ${\bm{x}}_{n}=c_{n}{\bm{u}}$ for some scalars $c_{1}<c_{2}<\cdots<c_{n}$ where ${\bm{u}}\in\mathbb{R}^{d}$ is a unit-vector. Then the minimizer $\bm{f}^{*}$ of (19) is unique and given by $\bm{f}^{*}({\bm{y}})={\bm{u}}\phi({\bm{u}}^{\top}{\bm{y}})$ where $\phi:\mathbb{R}\rightarrow\mathbb{R}$ has the same form as the 1-D minimizer (17) $f^{*}_{1D}$ with $x_{n}=c_{n}$ and $\epsilon_{n}^{\mathrm{max}}=-\epsilon_{n}^{\mathrm{min}}=\rho$ .

In other words, the min-cost solution has a particularly simple form in this case: $\bm{f}^{*}({\bm{y}})={\bm{u}}\phi({\bm{u}}^{\top}{\bm{y}})$ , where $\phi$ is a monotonic piecewise linear function. We call any function of this form a rank-one piecewise linear interpolator. Below we show that many other min-cost solutions can be expressed as superpositions of rank-one piecewise linear interpolators.

5.2 Training data on rays

As an extension of the previous setting, we now consider training data belonging to a union of one-sided rays sharing a common origin. Assuming the rays are well-separated (in a sense made precise below) we prove that the representation cost minimizer decomposes into a sum of rank-one piecewise linear interpolators aligned with each ray.

Theorem 3.

Suppose the training samples $X$ belong to a union of $L$ rays plus a sample at the origin: $X=\{\bm{0}\}\cup\{{\bm{x}}_{n}^{(1)}\}_{n=1}^{N_{1}}\cup\cdots\cup\{{\bm{x}}_{n}^{(L)}\}_{n=1}^{N_{L}}$ where ${\bm{x}}_{n}^{(\ell)}=c^{(\ell)}_{n}{\bm{u}}_{\ell}$ for some unit vector ${\bm{u}}_{\ell}$ and constants $0<c^{(\ell)}_{1}<c^{(\ell)}_{2}<\cdots<c^{(\ell)}_{N_{\ell}}$ . Assume that the rays make obtuse angles with each other (i.e., ${\bm{u}}_{\ell}^{\top}{\bm{u}}_{k}<0$ for all $\ell\neq k$ ). Then the minimizer $\bm{f}^{*}$ of (19) is unique and is given by

\bm{f}^{*}({\bm{y}})={\bm{u}}_{1}\phi_{1}({\bm{u}}_{1}^{\top}{\bm{y}})+\cdots+{\bm{u}}_{L}\phi_{L}({\bm{u}}_{L}^{\top}{\bm{y}})\,,

(20)

where $\phi_{\ell}:\mathbb{R}\rightarrow\mathbb{R}$ has the form of the 1-D minimizer (17) $f^{*}_{1D}$ with $x_{n}=c_{n}^{(\ell)}$ , $\epsilon_{n}^{\mathrm{max}}=-\epsilon_{n}^{\mathrm{min}}=\rho$ .

Additionally, in the Appendix C.2.1 we show that this min-cost solution is stable with respect to small perturbations of the data. In particular, if the training data is perturbed from the rays, the functional form of the min-cost solution only changes slightly, such that the inner and outer-layer weight vectors align with the line segments connecting consecutive data points.

5.3 Special case: training data forming a simplex

Here, we study the representation cost minimizers for $N\leq d+1$ training points that form the vertices of a $(N-1)$ -simplex, i.e., a $(N-1)$ -dimensional simplex in $\mathbb{R}^{d}$ (e.g., a $2$ -simplex is a triangle, a $3$ -simplex is a tetrahedron, etc.). As we will show, the angles between vertices of the simplex (e.g., an acute versus obtuse triangle in $N=3$ ) influences the functional form of the min-cost solution.

Our first result considers one extreme where the simplex has one vertex that makes an obtuse angle with all other vertices (e.g., an obtuse triangle for $N=3$ ).

Proposition 2.

Suppose the convex hull of the training points $\{{\bm{x}}_{1},{\bm{x}}_{2},...,{\bm{x}}_{N}\}\subset\mathbb{R}^{d}$ is a $(N-1)$ -simplex such that ${\bm{x}}_{1}$ forms an obtuse angle with all other vertices, i.e., $({\bm{x}}_{j}-{\bm{x}}_{1})^{\top}({\bm{x}}_{i}-{\bm{x}}_{1})<0$ for all $i\neq j$ with $i,j>1$ . Then the minimizer $\bm{f}^{*}$ of (19) is unique, and is given by

\bm{f}^{*}({\bm{y}})={\bm{x}}_{1}+\sum_{n=2}^{N}{\bm{u}}_{n}\phi_{n}({\bm{u}}_{n}^{\top}({\bm{y}}-{\bm{x}}_{1}))

(21)

where ${\bm{u}}_{n}=\frac{{\bm{x}}_{n}-{\bm{x}}_{1}}{\|{\bm{x}}_{n}-{\bm{x}}_{1}\|}$ , $\phi_{n}(t)=s_{n}([t-a_{n}]_{+}-[t-b_{n}]_{+})$ , with $a_{n}=\rho$ , $b_{n}=\|{\bm{x}}_{n}-{\bm{x}}_{1}\|-\rho$ , and $s_{n}=\|{\bm{x}}_{n}-{\bm{x}}_{1}\|/(b_{n}-a_{n})$ for all $n=2,...,N$ .

This result is essentially a corollary of Theorem 3, since after translating ${\bm{x}}_{1}$ to be the origin, the vertices of the simplex belong to rays making obtuse angles with each other, where there is exactly one sample per ray. Details of the proof are given in Appendix C.2.

At the opposite extreme, we consider the case where every vertex of the simplex is acute, meaning for all $n=1,...,N$ we have $({\bm{x}}_{i}-{\bm{x}}_{n})^{\top}({\bm{x}}_{j}-{\bm{x}}_{n})>0$ for all $i,j\neq n$ . In this case, we make following conjecture: the min-cost solution is instead a sum of $N$ rank-one piecewise linear interpolators, each aligned orthogonal to a different $(N-2)$ -dimensional face of the simplex.

Conjecture 1.

Suppose the convex hull of the training points $\{{\bm{x}}_{1},{\bm{x}}_{2},...,{\bm{x}}_{N}\}\subset\mathbb{R}^{d}$ is a $(N-1)$ -simplex where every vertex of the simplex is acute. Then the minimizer ${\bm{f}}^{*}$ of (19) is unique, and is given by

\bm{f}^{*}({\bm{y}})=\overline{{\bm{x}}}+\sum_{n=1}^{N}{\bm{v}}_{n}\phi_{n}({\bm{u}}_{n}^{\top}({\bm{y}}-{\bm{z}}_{n}))

(22)

where: ${\bm{z}}_{n}$ is the projection of ${\bm{x}}_{n}$ onto the unique $(N-2)$ -dimensional face of the simplex not containing ${\bm{x}}_{n}$ ; $\overline{{\bm{x}}}$ is the weighted geometric median of the vertices specified by

\overline{{\bm{x}}}=\operatorname*{arg\,min}_{{\bm{x}}\in\mathbb{R}^{d}}\sum_{n=1}^{N}\frac{\|{\bm{x}}_{n}-{\bm{x}}\|}{\|{\bm{x}}_{n}-{\bm{z}}_{n}\|};

${\bm{u}}_{n}=\frac{{\bm{x}}_{n}-{\bm{z}}_{n}}{\|{\bm{x}}_{n}-{\bm{z}}_{n}\|}$ , ${\bm{v}}_{n}=\frac{{\bm{x}}_{n}-\overline{{\bm{x}}}}{\|{\bm{x}}_{n}-\overline{x}\|}$ ; and $\phi_{n}(t)=s_{n}([t-a_{n}]_{+}-[t-b_{n}]_{+})$ with $a_{n}=\rho$ , $b_{n}=\|{\bm{x}}_{n}-{\bm{z}}_{n}\|-\rho$ , and $s_{n}=\|{\bm{x}}_{n}-\overline{{\bm{x}}}\|/(b_{n}-a_{n})$ .

Justification for this conjecture is given in Appendix C.3.2. In particular, we prove that the interpolator ${\bm{f}}^{*}$ given in (86) is a min-cost solution in the special case of three training points whose convex hull is an equilateral triangle. If true in general, this would imply a phase transition behavior in the min-cost solution when the simplex changes from having one obtuse vertex to all acute vertices, such that ReLU boundaries go from being aligned orthogonal to the edges connecting vertices, to being aligned parallel with the simplex faces. Figure 2 illustrates this for $N=3$ training points forming a triangle in $d=2$ dimensions. Moreover, Figure 2 shows that the empirical minimizer obtained using noisy samples and weight decay regularization agrees well with the form of the exact min-cost solution predicted by Proposition 2 and Conjecture 1.

In general, any given vertex of a simplex may make acute angles with some vertices and obtuse angles with others. This case is not covered by the above results. Currently, we do not have a conjectured form of the min-cost solution in this case, and we leave this as an open problem for future work.

6 Related works

Numerous methods have been proposed for image denoising. In last decade NN-based methods achieve state-of-the-art results (zhang2017beyond; zhang2021plug). See (doi:10.1137/23M1545859) for a comprehensive review of image denoising. sonthalia2023training empirically showed a double decent behavior in NN denoisers, and theoretically proved it in a linear model. Similar to a denoiser, an Auto-Encoder (AE) is a NN model whose output dimension equals its input dimension, and is trained to match the output to the input. For AE, the typical goal is to learn an efficient lower-dimensional representation of the samples. radhakrishnan2018memorization proved that a single hidden-layer AE that interpolates the training data (i.e., achieves zero loss), projects the input onto a nonlinear span of the training data. In addition, radhakrishnan2018memorization empirically demonstrated that a multi-layer ReLU AE is locally contractive toward the training samples by iterating the AE and showing that the points converge to one of the training samples. Denoising autoencoders inject noise into the input data in order to learn a good representation (alain2014regularized). The marginalized denoising autoencoder, proposed by pmlr-v32-cheng14, approximates the marginalized loss over the noise (which is equivalent to observing infinitely many noisy samples) by using a Taylor approximation. pmlr-v32-cheng14 demonstrated that by using the approximate marginalized loss we can achieve a substantial speedup in training and improved representation compared to standard denoising AE.

Many recent works aim to characterize function space properties of interpolating NN with minimal representation cost (i.e., min-cost solutions). Building off of the connection between weight decay and path-norm regularization identified in neyshabur2015norm; neyshabur2017exploring, savarese2019infinite showed that the representation cost of a function realizable as a univariate two-layer ReLU network coincides with the $L^{1}$ -norm of the second derivative of the function. Extensions to the multivariate setting were studied in Ongie2020A, which identified the representation cost of a multivariate function with its $R$ -norm, a Banach space semi-norm defined in terms of the Radon transform. Related work has extended the $R$ -norm to other activation functions parhi2021banach, vector-valued networks shenouda2023vector, and deeper architectures parhi2022kinds. A separate line of research studies min-cost solutions from a convex duality perspective ergen2021convex, incuding two-layer CNN denoising AEs sahiner2021convex. Recent work also studies properties of min-cost solutions in the case of arbitrarily deep NNs with ReLU activation jacot2022implicit; jacotNeurips.

Despite these advances in understanding min-cost solutions, there are few results explicitly characterizing their functional form. One exception is hanin2021ridgeless, which gives a complete characterization of min-cost solutions in the case of shallow univariate ReLU networks with unregularized bias. This characterization is possible because the univariate representation cost is defined in terms of the 2nd derivative, which acts locally. Therefore, global minimizers can be found by minimizing the representation cost locally over intervals between data points. An extension of these results to the case of regularized bias is studied in boursier2023penalising. In the multivariate setting, the representation cost involves the Radon transform of the function – a highly non-local operation – that complicates the analysis. parhi2021banach prove a representer theorem showing that there always exists a min-cost solution realizable as a shallow ReLU network with finitely many neurons, and ergen2021convex give an implicit characterization of min-cost NNs the solution to a convex optimization problem, and give explicit solutions in the case of co-linear training features. However, to the best of our knowledge, there are no results explicitly characterizing min-cost solutions in the case of non-colinear multivariate inputs, even for networks having scalar outputs.

7 Discussions

Conclusions.

We have explored the elementary properties of NN solutions for the denoising problem, while focusing on offline training of a one hidden-layer ReLU network. When the noisy clusters of the data samples are well-separated, there are multiple networks with zero loss, even in the case of under-parameterization, while having a different representation cost. In contrast, previous theoretical works focused on the over-parametrized regime. In the univariate case, we have derived a closed-form solution to such global minima with minimum representation cost. We also showed that the univariate NN solution generalizes better than the eMMSE denoiser. In the multivariate case, we showed that the interpolating solution with minimal representation cost is aligned with the edges and/or faces connecting the clean data points in several cases.

Limitations.

One limitation of our analysis in the multivariate case is that we assume the denoiser interpolates data on a full $d$ -dimensional ball centered at each clean training sample, where $d$ is the input dimension. In practical settings, often the number of noisy samples $M\ll d$ . A more accurate model would be to assume that denoiser interpolates over an $(M-1)$ -dimensional disc centered at each training sample. This model may still be a tractable alternative to assuming interpolation of finitely many noisy samples. Also, our results relate to NN denoisers trained with explicit weight decay regularization, which is not always used in practice. However, recent work shows that stable minimizers of SGD must have low representation cost mulayoff2021implicit; nacsonimplicit, and so some of our analysis may provide insight for unregularized training, as well. Finally, for mathematical tractability, we focussed on the case of fully-connected ReLU networks with one hidden-layer. Extending our analysis to deeper architectures and convolutional neural networks is an important direction for future work.

Acknowledgments

We thank Itay Hubara for his technical advice and valuable comments on the manuscript. The research of DS was Funded by the European Union (ERC, A-B-C-Deep, 101039436). Views and opinions expressed are however those of the author only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency (ERCEA). Neither the European Union nor the granting authority can be held responsible for them. DS also acknowledges the support of Schmidt Career Advancement Chair in AI. The research of NW was supported by the Israel Science Foundation (ISF), grant no. 1782/22. GO was supported by the National Science Foundation (NSF) CRII award CCF-2153371.

References

Alain and Bengio [2014] Guillaume Alain and Yoshua Bengio. What regularized auto-encoders learn from the data-generating distribution. The Journal of Machine Learning Research, 15(1):3563–3593, 2014.
Boursier and Flammarion [2023] Etienne Boursier and Nicolas Flammarion. Penalising the biases in norm regularisation enforces sparsity. arXiv preprint arXiv:2303.01353, 2023.
Chen et al. [2014] Minmin Chen, Kilian Weinberger, Fei Sha, and Yoshua Bengio. Marginalized denoising auto-encoders for nonlinear representations. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1476–1484, Bejing, China, 22–24 Jun 2014. PMLR. URL https://proceedings.mlr.press/v32/cheng14.html.
Elad et al. [2023] Michael Elad, Bahjat Kawar, and Gregory Vaksman. Image denoising: The deep learning revolution and beyond—a survey paper. SIAM Journal on Imaging Sciences, 16(3):1594–1654, 2023. doi: 10.1137/23M1545859. URL https://doi.org/10.1137/23M1545859.
Ergen and Pilanci [2021] Tolga Ergen and Mert Pilanci. Convex geometry and duality of over-parameterized neural networks. The Journal of Machine Learning Research, 22(1):9646–9708, 2021.
Hanin [2021] Boris Hanin. Ridgeless interpolation with shallow relu networks in $1d$ is nearest neighbor curvature extrapolation and provably generalizes on lipschitz functions. arXiv preprint arXiv:2109.12960, 2021.
Hasinoff et al. [2010] Samuel W Hasinoff, Frédo Durand, and William T Freeman. Noise-optimal capture for high dynamic range photography. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 553–560. IEEE, 2010.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
Horrace [2015] William C Horrace. Moments of the truncated normal distribution. Journal of Productivity Analysis, 43:133–138, 2015.
Jacot [2022] Arthur Jacot. Implicit bias of large depth networks: a notion of rank for nonlinear functions. In International Conference on Learning Representations (ICLR), 2022.
Jacot et al. [2022] Arthur Jacot, Eugene Golikov, Clement Hongler, and Franck Gabriel. Feature learning in l_2-regularized dnns: Attraction/repulsion and sparsity. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pages 6763–6774, 2022.
Manjunath and Wilhelm [2021] B. G. Manjunath and Stefan Wilhelm. Moments calculation for the doubly truncated multivariate normal density. Journal of Behavioral Data Science, 1(1):17–33, May 2021. doi: 10.35566/jbds/v1n1/p2. URL https://jbds.isdsa.org/index.php/jbds/article/view/9.
Mulayoff et al. [2021] Rotem Mulayoff, Tomer Michaeli, and Daniel Soudry. The implicit bias of minima stability: A view from function space. Advances in Neural Information Processing Systems (NeurIPS), 34:17749–17761, 2021.
Nacson et al. [2023] Mor Shpigel Nacson, Rotem Mulayoff, Greg Ongie, Tomer Michaeli, and Daniel Soudry. The implicit bias of minima stability in multivariate shallow relu networks. In International Conference on Learning Representations (ICLR), 2023.
Neyshabur et al. [2015] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory (COLT), pages 1376–1401. PMLR, 2015.
Neyshabur et al. [2017] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017.
Ongie et al. [2020] Greg Ongie, Rebecca Willett, Daniel Soudry, and Nathan Srebro. A function space view of bounded norm infinite width relu nets: The multivariate case. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=H1lNPxHKDH.
Parhi and Nowak [2021] Rahul Parhi and Robert D Nowak. Banach space representer theorems for neural networks and ridge splines. Journal of Machine Learning Research, 22(1):1960–1999, 2021.
Parhi and Nowak [2022] Rahul Parhi and Robert D Nowak. What kinds of functions do deep neural networks learn? insights from variational spline theory. SIAM Journal on Mathematics of Data Science, 4(2):464–489, 2022.
Radhakrishnan et al. [2018] Adityanarayanan Radhakrishnan, Karren Yang, Mikhail Belkin, and Caroline Uhler. Memorization in overparameterized autoencoders. arXiv preprint arXiv:1810.10333, 2018.
Raya and Ambrogioni [2023] Gabriel Raya and Luca Ambrogioni. Spontaneous symmetry breaking in generative diffusion models. arXiv preprint arXiv:2305.19693, 2023.
Roth and Black [2009] Stefan Roth and Michael J. Black. Fields of experts. International Journal of Computer Vision, 82(2):205–229, 2009. doi: 10.1007/s11263-008-0197-6. URL https://doi.org/10.1007/s11263-008-0197-6.
Sahiner et al. [2021] Arda Sahiner, Morteza Mardani, Batu Ozturkler, Mert Pilanci, and John M. Pauly. Convex regularization behind neural reconstruction. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=VErQxgyrbfn.
Savarese et al. [2019] Pedro Savarese, Itay Evron, Daniel Soudry, and Nathan Srebro. How do infinite width bounded norm networks look in function space? In Conference on Learning Theory, pages 2667–2690. PMLR, 2019.
Shenouda et al. [2023] Joseph Shenouda, Rahul Parhi, Kangwook Lee, and Robert D Nowak. Vector-valued variation spaces and width bounds for DNNs: Insights on weight decay regularization. arXiv preprint arXiv:2305.16534, 2023.
Song and Ermon [2019] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
Sonthalia and Nadakuditi [2023] Rishi Sonthalia and Raj Rao Nadakuditi. Training data size induced double descent for denoising feedforward neural networks and the role of training noise. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=FdMWtpVT1I.
Udell and Townsend [2019] Madeleine Udell and Alex Townsend. Why are big data matrices approximately low rank? SIAM Journal on Mathematics of Data Science, 1(1):144–160, 2019.
Zhang et al. [2017] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE transactions on image processing, 26(7):3142–3155, 2017.
Zhang et al. [2021] Kai Zhang, Yawei Li, Wangmeng Zuo, Lei Zhang, Luc Van Gool, and Radu Timofte. Plug-and-play image restoration with deep denoiser prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6360–6376, 2021.

Appendix A Marginalized loss

In this section, we derive the marginal loss for the case of a 1-hidden layer ReLU neural network. The loss function is,

	$\displaystyle\mathcal{L}\left(\bm{\theta},\sigma\right)$	$\displaystyle=\mathbb{E}_{\bm{x},\bm{y}}\left\\|\sum_{k=1}^{K}\bm{a}_{k}[\bm{w}_{k}^{\top}\bm{y}]_{+}-\bm{x}\right\\|^{2}$
		$\displaystyle=\mathbb{E}_{\bm{x},\bm{y}}\left[\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}[\bm{w}_{i}^{\top}\bm{y}]_{+}[\bm{w}_{j}^{\top}\bm{y}]_{+}-2\sum_{i=1}^{K}\bm{x}^{\top}\bm{a}_{i}[\bm{w}_{i}^{\top}\bm{y}]_{+}+\left\\|\bm{x}\right\\|^{2}\right]$
		$\displaystyle=\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}\mathbb{E}_{\bm{x},\bm{y}}\left[[\bm{w}_{i}^{\top}\bm{y}]_{+}[\bm{w}_{j}^{\top}\bm{y}]_{+}\right]-2\sum_{i=1}^{K}\mathbb{E}_{\bm{x},\bm{y}}\left[\bm{x}^{\top}\bm{a}_{i}[\bm{w}_{i}^{\top}\bm{y}]_{+}\right]+\mathbb{E}\left\\|\bm{x}\right\\|^{2}$
		$\displaystyle=\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{y}\|\bm{x}}\left[[\bm{w}_{i}^{\top}\bm{y}]_{+}[\bm{w}_{j}^{\top}\bm{y}]_{+}\right]\right]-2\sum_{i=1}^{K}\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{y}\|\bm{x}}\left[\bm{x}^{\top}\bm{a}_{i}[\bm{w}_{i}^{\top}\bm{y}]_{+}\right]\right]+\mathbb{E}\left\\|\bm{x}\right\\|^{2}$
		$\displaystyle=\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{y}\|\bm{x}}\left[[\bm{w}_{i}^{\top}\bm{y}]_{+}\right]\mathbb{E}_{\bm{y}\|\bm{x}}\left[[\bm{w}_{j}^{\top}\bm{y}]_{+}\right]\right]-2\sum_{i=1}^{K}\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{y}\|\bm{x}}\left[\bm{x}^{\top}\bm{a}_{i}[\bm{w}_{i}^{\top}\bm{y}]_{+}\right]\right]+\mathbb{E}\left\\|\bm{x}\right\\|^{2}$
		$\displaystyle\quad+\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{y}\|\bm{x}}\left[[\bm{w}_{i}^{\top}\bm{y}]_{+}[\bm{w}_{j}^{\top}\bm{y}]_{+}\right]\right]-\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{y}\|\bm{x}}\left[[\bm{w}_{i}^{\top}\bm{y}]_{+}\right]\mathbb{E}_{\bm{y}\|\bm{x}}\left[[\bm{w}_{j}^{\top}\bm{y}]_{+}\right]\right]$
		$\displaystyle=\mathbb{E}_{(\bm{x})}\left\\|\sum_{i=1}^{K}\bm{a}_{i}\tilde{\phi}\left(\bm{w}_{i}^{\top}\bm{x}\right)-\bm{x}\right\\|^{2}+\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{y}\|\bm{x}}\left[[\bm{w}_{i}^{\top}\bm{y}]_{+}[\bm{w}_{j}^{\top}\bm{y}]_{+}\right]\right]$
		$\displaystyle\quad-\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{y}\|\bm{x}}\left[[\bm{w}_{i}^{\top}\bm{y}]_{+}\right]\mathbb{E}_{\bm{y}\|\bm{x}}\left[[\bm{w}_{j}^{\top}\bm{y}]_{+}\right]\right]\,.$

We denote by

\displaystyle\mathbf{H}_{ij}\left(\bm{w}_{i},\bm{w}_{j},\sigma^{2}\right)=\mathbb{E}_{\bm{x}}\biggl{[}\mathbb{E}_{\bm{y}|\bm{x}}\left[[\bm{w}_{i}^{\top}\bm{y}]_{+}[\bm{w}_{j}^{\top}\bm{y}]_{+}\right]-\mathbb{E}_{\bm{y}|\bm{x}}\left[[\bm{w}_{i}^{\top}\bm{y}]_{+}\right]\mathbb{E}_{\bm{y}|\bm{x}}\left[[\bm{w}_{j}^{\top}\bm{y}]_{+}\right]\biggr{]}\,.

Note that $\mathbf{H}\succeq 0$ since $\mathbf{H}$ is a covariance matrix. Thus we get,

\displaystyle\mathcal{L}\left(\bm{\theta},\sigma\right)=\mathbb{E}_{\bm{x}}\left\|\sum_{k=1}^{K}\bm{a}_{k}\tilde{\phi}\left(\hat{\bm{w}}_{k}^{\top}\bm{x},\left\|\bm{w}_{k}\right\|,\sigma\right)-\bm{x}\right\|^{2}+\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}\mathbf{H}_{ij}\left(\bm{w}_{i},\bm{w}_{j},\sigma^{2}\right)\,.

Lemma 2.

In the case of the ReLU activation function and Gaussian noise the following holds,

\displaystyle\tilde{\phi}\left(\hat{\bm{w}}_{i}^{\top}\bm{x},\left\|\bm{w}_{i}\right\|,\sigma\right)

\displaystyle=\left\|\bm{w}_{i}\right\|\left(\left(1-\Phi\left(-\frac{\hat{\bm{w}}_{i}^{\top}\bm{x}}{\sigma}\right)\right)\hat{\bm{w}}_{i}^{\top}\bm{x}+\varphi\left(-\frac{\hat{\bm{w}}_{i}^{\top}\bm{x}}{\sigma}\right)\right)

where $\varphi,\Phi$ are the density and cumulative distribution of standard normal distribution, and

\displaystyle\mathbf{H}_{ij}

\displaystyle=\mathbb{E}_{\bm{x}}\left[\Psi\left(\bm{x},\bm{w}_{i},\bm{w}_{j},\sigma^{2}\right)-\tilde{\phi}\left(\hat{\bm{w}}_{i}^{\top}\bm{x},\left\|\bm{w}_{i}\right\|,\sigma\right)\tilde{\phi}\left(\hat{\bm{w}}_{j}^{\top}\bm{x},\left\|\bm{w}_{j}\right\|,\sigma\right)\right]

where,

	$\displaystyle\Psi\left(\bm{x},\bm{w}_{i},\bm{w}_{j},\sigma^{2}\right)$	$\displaystyle=P\left(z_{1}>0,z_{2}>0;\bm{\mu}^{(i,j)},\bm{\Sigma}^{(i,j)}\right)\biggl{(}\sigma_{12}+\mathrm{det}\left(\bm{\Sigma}\right)\frac{f\left(-\bm{\mu}^{\left(i,j\right)};\bm{0},\bm{\Sigma}^{\left(i,j\right)}\right)}{P(z_{1}>-\mu_{1},z_{2}>-\mu_{2};\bm{0},\bm{\Sigma}^{(i,j)})}$
		$\displaystyle\quad+\sigma_{11}F_{1}\left(-\mu_{1}\right)\mu_{2}+\mu_{1}\sigma_{22}F_{2}\left(-\mu_{2}\right)+\mu_{1}\mu_{2}\biggr{)}$
	$\displaystyle F_{2}\left(-\mu_{2}\right)$	$\displaystyle=\frac{\varphi\left(-\frac{\mu_{2}}{\sqrt{\sigma_{22}}}\right)P\left(z>-\mu_{1};\frac{\Lambda_{12}\mu_{2}}{\Lambda_{11}},\frac{1}{\Lambda_{11}}\right)}{P(z_{1}>-\mu_{1},z_{2}>-\mu_{2};\bm{0},\bm{\Sigma})}$
	$\displaystyle F_{1}\left(-\mu_{1}\right)$	$\displaystyle=\frac{\varphi\left(-\frac{\mu_{1}}{\sqrt{\sigma_{11}}}\right)P\left(z>-\mu_{2};\frac{\Lambda_{21}\mu_{1}}{\Lambda_{22}},\frac{1}{\Lambda_{22}}\right)}{P(z_{1}>-\mu_{1},z_{2}>-\mu_{2};\bm{0},\bm{\Sigma})}$

and

	$\displaystyle\bm{\mu}^{\left(i,j\right)}$	$\displaystyle=\begin{pmatrix}\mu_{1}\\ \mu_{2}\end{pmatrix}=\begin{pmatrix}\bm{w}_{i}^{\top}\bm{x}\\ \bm{w}_{j}^{\top}\bm{x}\end{pmatrix}$
	$\displaystyle\bm{\Sigma}^{\left(i,j\right)}$	$\displaystyle=\begin{pmatrix}\sigma_{11}&\sigma_{12}\\ \sigma_{12}&\sigma_{22}\end{pmatrix}=\begin{pmatrix}\sigma^{2}\left\\|\bm{w}_{i}\right\\|^{2}&\sigma^{2}\bm{w}_{i}^{\top}\bm{w}_{j}\\ \sigma^{2}\bm{w}_{j}^{\top}\bm{w}_{i}&\sigma^{2}\left\\|\bm{w}_{j}\right\\|^{2}\end{pmatrix}$
	$\displaystyle\bm{\Lambda}$	$\displaystyle=\bm{\Sigma}^{-1}\,.$

Notice that we denote by $P\left(z_{1}>0,z_{2}>0;\bm{\mu}^{(i,j)},\bm{\Sigma}^{(i,j)}\right)$ the probability that $z_{1}>0,z_{2}>0$ where $\left(z_{1},z_{2}\right)^{\top}\sim\mathcal{N}\left(\bm{\mu}^{(i,j)},\bm{\Sigma}^{(i,j)}\right)$ .

Proof.

Let $x\sim\mathcal{N}\left(\mu,\sigma^{2}\right)$ , then

	$\displaystyle E\left[[x]_{+}\right]$	$\displaystyle=E\left[[x]_{+}\|x>0\right]P\left(x>0\right)+E\left[[x]_{+}\|x<0\right]P\left(x>0\right)$
		$\displaystyle=E\left[x\|x>0\right]P\left(x>0\right)$
		$\displaystyle=E\left[x\|x>0\right]\left(1-\Phi\left(-\frac{\mu}{\sigma}\right)\right)$

Note that given $x>0$ the distribution of $x$ is truncated normal [horrace2015moments]. Therefore,

	$\displaystyle\mathbb{E}\left[x\|x>0\right]$	$\displaystyle=\mu+\sigma\frac{\varphi\left(-\frac{\mu}{\sigma}\right)}{1-\Phi\left(-\frac{\mu}{\sigma}\right)}$
	$\displaystyle\mathbb{E}\left[[x]_{+}\right]$	$\displaystyle=\left(1-\Phi\left(-\frac{\mu}{\sigma}\right)\right)\mu+\sigma\varphi\left(-\frac{\mu}{\sigma}\right)\,.$		(23)

Note that given $\bm{x}$ , $\bm{y}$ is a Gaussian random vector. Therefore, given $\bm{x}$ ,

\displaystyle\bm{w}_{k}^{\top}\bm{y}\sim\mathcal{N}\left(\bm{w}_{k}^{\top}\bm{x},\sigma^{2}\left\|\bm{w}_{k}\right\|^{2}\right)

and we obtain,

	$\displaystyle\mathbb{E}_{\bm{y}\|\bm{x}}\left[[\bm{w}_{k}^{\top}\bm{y}]_{+}\right]$	$\displaystyle=\left\\|\bm{w}_{k}\right\\|\mathbb{E}_{\bm{y}\|\bm{x}}\left[[\hat{\bm{w}}_{k}^{\top}\bm{y}]_{+}\right]$
		$\displaystyle=\left\\|\bm{w}_{k}\right\\|\left(1-\Phi\left(-\frac{\hat{\bm{w}}_{k}^{\top}\bm{x}}{\sigma}\right)\right)\hat{\bm{w}}_{k}^{\top}\bm{x}+\sigma\varphi\left(-\frac{\hat{\bm{w}}_{i}^{\top}\bm{x}}{\sigma}\right)\,.$

Let $\left(z_{1},z_{2}\right)^{\top}$ be a Gaussian random vector with⁴⁴4Note that $\sigma_{ii}=\sigma_{i}^{2}$ .

	$\displaystyle\bm{\mu}$	$\displaystyle=\begin{pmatrix}\mu_{1}\\ \mu_{2}\end{pmatrix}$
	$\displaystyle\bm{\Sigma}$	$\displaystyle=\begin{pmatrix}\sigma_{11}&\sigma_{12}\\ \sigma_{12}&\sigma_{22}\end{pmatrix}$

then,

	$\displaystyle\mathbb{E}\left[[z_{1}]_{+}[z_{2}]_{+}\right]$	$\displaystyle=\mathbb{E}\left[[z_{1}]_{+}[z_{2}]_{+}\|z_{1}>0,z_{2}>0\right]P\left(z_{1}>0,z_{2}>0\right)$
		$\displaystyle=\mathbb{E}\left[z_{1}z_{2}\|z_{1}>0,z_{2}>0\right]P\left(z_{1}>0,z_{2}>0\right)\,.$

Given $z_{1}>0,z_{2}>0$ the distribution of $\left(z_{1},z_{2}\right)^{\top}$ is truncated multivariate normal distribution [Manjunath_Wilhelm_2021], therefore,

$\displaystyle\mathbb{E}\left[z_{1}z_{2}\|z_{1}>0,z_{2}>0\right]$	$\displaystyle=\sigma_{12}+\sigma_{21}\left(-\mu_{1}\right)F_{1}\left(-\mu_{1}\right)+\sigma_{12}\left(-\mu_{2}\right)F_{2}\left(-\mu_{2}\right)$
	$\displaystyle+\left(\sigma_{11}\sigma_{22}-\sigma_{12}\sigma_{21}\right)\frac{f\left(-\bm{\mu};\bm{0},\bm{\Sigma}\right)}{p(z_{1}>-\mu_{1},z_{2}>-\mu_{2};\bm{0},\bm{\Sigma})}$
	$\displaystyle-\left(\sigma_{11}F_{1}\left(-\mu_{1}\right)+\sigma_{12}F_{2}\left(-\mu_{2}\right)\right)\left(\sigma_{21}F_{1}\left(-\mu_{1}\right)+\sigma_{22}F_{2}\left(-\mu_{2}\right)\right)$
	$\displaystyle+\left(\sigma_{11}F_{1}\left(-\mu_{1}\right)+\sigma_{12}F_{2}\left(-\mu_{2}\right)+\mu_{1}\right)\left(\sigma_{21}F_{1}\left(-\mu_{1}\right)+\sigma_{22}F_{2}\left(-\mu_{2}\right)+\mu_{2}\right)$
	$\displaystyle=\sigma_{12}+\mathrm{det}\left(\bm{\Sigma}\right)\frac{f\left(-\bm{\mu};\bm{0},\bm{\Sigma}\right)}{p(z_{1}>-\mu_{1},z_{2}>-\mu_{2};\bm{0},\bm{\Sigma})}$
	$\displaystyle+\sigma_{11}F_{1}\left(-\mu_{1}\right)\mu_{2}+\mu_{1}\sigma_{22}F_{2}\left(-\mu_{2}\right)+\mu_{1}\mu_{2}$	(24)

where $f\left(-\bm{\mu};\bm{0},\bm{\Sigma}\right)$ is a density function of of Gaussian random vector with mean vector $\bm{0}$ and covariance matrix $\bm{\Sigma}$ at the point $-\bm{\mu}$ , and

	$\displaystyle F_{2}\left(z_{2}\right)$	$\displaystyle=\frac{\int_{-\mu_{1}}^{\infty}f\left(z_{1},z_{2};\bm{0},\bm{\Sigma}\right)dz_{1}}{P(Z_{1}>-\mu_{1},Z_{2}>-\mu_{2};\bm{0},\bm{\Sigma})}$
	$\displaystyle F_{1}\left(z_{1}\right)$	$\displaystyle=\frac{\int_{-\mu_{2}}^{\infty}f\left(z_{1},z_{2};\bm{0},\bm{\Sigma}\right)dz_{2}}{P(Z_{1}>-\mu_{1},Z_{2}>-\mu_{2};\bm{0},\bm{\Sigma})}\,.$

We denote by $\bm{\Lambda}=\bm{\Sigma}^{-1}$ thus,

	$\displaystyle\int_{-\mu_{1}}^{\infty}\exp\left(-\frac{1}{2}\left(\Lambda_{11}z_{1}^{2}+2\Lambda_{12}z_{1}z_{2}+\Lambda_{22}z_{2}^{2}\right)\right)dz_{1}=$
	$\displaystyle\exp\left(-\frac{1}{2}\Lambda_{22}z_{2}^{2}\right)\int_{-\mu_{1}}^{\infty}\exp\left(-\frac{1}{2}\Lambda_{11}z_{1}^{2}-\Lambda_{12}z_{1}z_{2}\right)dz_{1}=$
	$\displaystyle\exp\left(-\frac{1}{2}\Lambda_{22}z_{2}^{2}\right)\int_{-\mu_{1}}^{\infty}\exp\left(-\frac{1}{2\frac{1}{\Lambda_{11}}}\left(z_{1}+\frac{\Lambda_{12}z_{2}}{\Lambda_{11}}\right)^{2}+\frac{\Lambda_{12}^{2}z_{2}^{2}}{2\Lambda_{11}}\right)dz_{1}=$
	$\displaystyle\exp\left(-\frac{1}{2}\Lambda_{22}z_{2}^{2}+\frac{\Lambda_{12}^{2}z_{2}^{2}}{2\Lambda_{11}}\right)\int_{-\mu_{1}}^{\infty}\exp\left(-\frac{1}{2\frac{1}{\Lambda_{11}}}\left(z_{1}+\frac{\Lambda_{12}z_{2}}{\Lambda_{11}}\right)^{2}\right)dz_{1}=$
	$\displaystyle\exp\left(-\frac{1}{2}\Lambda_{22}z_{2}^{2}+\frac{\Lambda_{12}^{2}z_{2}^{2}}{2\Lambda_{11}}\right)\sqrt{\frac{2\pi}{\Lambda_{11}}}P\left(z>-\mu_{1};-\frac{\Lambda_{12}z_{2}}{\Lambda_{11}},\frac{1}{\Lambda_{11}}\right)\,.$

Thus,

	$\displaystyle\int_{-\mu_{1}}^{\infty}f\left(z_{1},z_{2};0,\bm{\Sigma}\right)dz_{1}$	$\displaystyle=\frac{1}{2\pi\sqrt{\det\left(\Sigma\right)}}\exp\left(-\frac{1}{2}\Lambda_{22}z_{2}^{2}+\frac{\Lambda_{12}^{2}z_{2}^{2}}{2\Lambda_{11}}\right)\sqrt{\frac{2\pi}{\Lambda_{11}}}P\left(z>-\mu_{1};-\frac{\Lambda_{12}z_{2}}{\Lambda_{11}},\frac{1}{\Lambda_{11}}\right)$
		$\displaystyle=\frac{1}{\sqrt{2\pi\sigma_{22}}}\exp\left(-\frac{1}{2\sigma_{22}}z_{2}^{2}\right)P\left(z>-\mu_{1};-\frac{\Lambda_{12}z_{2}}{\Lambda_{11}},\frac{1}{\Lambda_{11}}\right)\,.$

Similarly, we obtain,

\displaystyle\int_{-\mu_{2}}^{\infty}f\left(z_{1},z_{2};0,\bm{\Sigma}\right)dz_{2}

\displaystyle=\frac{1}{\sqrt{2\pi\sigma_{11}}}\exp\left(-\frac{1}{2\sigma_{11}}z_{1}^{2}\right)P\left(z>-\mu_{2};-\frac{\Lambda_{21}z_{1}}{\Lambda_{22}},\frac{1}{\Lambda_{22}}\right)\,.

Therefore,

	$\displaystyle F_{2}\left(z_{2}\right)$	$\displaystyle=\frac{\varphi\left(\frac{z_{2}}{\sqrt{\sigma_{22}}}\right)P\left(z>-\mu_{1};-\frac{\Lambda_{12}z_{2}}{\Lambda_{11}},\frac{1}{\Lambda_{11}}\right)}{P(Z_{1}>-\mu_{1},Z_{2}>-\mu_{2};\bm{0},\bm{\Sigma})}$
	$\displaystyle F_{1}\left(z_{1}\right)$	$\displaystyle=\frac{\varphi\left(\frac{z_{1}}{\sqrt{\sigma_{11}}}\right)P\left(z>-\mu_{2};-\frac{\Lambda_{21}z_{1}}{\Lambda_{22}},\frac{1}{\Lambda_{22}}\right)}{P(Z_{1}>-\mu_{1},Z_{2}>-\mu_{2};\bm{0},\bm{\Sigma})}\,.$

∎

Next, we present in Figures 3 and 4 a numerical evaluation of (23) and (24). As can be seen, the Monte-Carlo simulations verify the analytical results.

Appendix B Proofs of Results in Section 4

B.1 Proof of Proposition 1

Proof.

Let

\displaystyle\mathcal{D}=\{(y_{n}=x_{n}+\epsilon_{n,m},x_{n})\},n\in[N],m\in[M],\quad-\infty<x_{1}<x_{2}<\cdots<n_{N}<\infty

such that Assumption 1 holds. We define a reduced dataset, which only contains the noisy samples with the most extreme noises

\displaystyle\mathcal{\bar{D}}=\{(x_{n}+\epsilon^{\mathrm{min}}_{n},x_{n}),(x_{n}+\epsilon^{\mathrm{max}}_{n},x_{n})\},n\in[N]\,.

We define the $\ell^{2}$ penalty on the weights,

\displaystyle C\left(\theta,K\right)=\sum_{k=1}^{K}\left(\|{\bm{a}}_{k}\|^{2}+\|{\bm{w}}_{k}\|^{2}\right)\,.

According to Theorem 1.2. in [hanin2021ridgeless], since we have opposite discrete curvature on the intervals $[x_{1}+\epsilon_{1}^{\mathrm{max}},x_{2}+\epsilon_{1}^{\mathrm{min}}]\cdots[x_{N-1}+\epsilon_{1}^{max},x_{N}+\epsilon_{1}^{min}]$ ,

\displaystyle\{h_{\theta}(y)\mid h_{\theta}(y)=x\;\forall(y,x)\in\mathcal{\bar{D}},C\left(\theta,K\right)=C_{*}\}=\{f^{*}_{1D}\left(y\right)\}\,,

where

\displaystyle C_{*}=\inf_{{\theta},K}\{C\left(\theta,K\right)\mid\forall(y,x)\in\mathcal{\bar{D}}\;\;h_{\theta}(y)=x\}\,.

Also note that,

\displaystyle\{h_{\theta}(y)\mid h_{\theta}(y)=x\;\forall(y,x)\in\mathcal{D},C\left(\theta,K\right)=C_{*}\}=\{f^{*}_{1D}\left(y\right)\}

since $f^{*}_{1D}\left(y\right)$ interpolates all the points in $\mathcal{D}$ , and if we assume by contradiction that $C_{*}>\inf_{{\theta},K}\{C\left(\theta,K\right)\mid\forall(y,x)\in\mathcal{D}\;\;h_{\theta}(y)=x\}$ then $C_{*}\neq\inf_{{\theta},K}\{C\left(\theta,K\right)\mid\forall(y,x)\in\mathcal{\bar{D}}\;\;h_{\theta}(y)=x\}$ , thus contradicting Theorem 1.2. in [hanin2021ridgeless]. Note that the minimal number of neurons needed to represent $f^{*}_{1D}\left(y\right)$ is $2N-2$ . So, if $K\geq 2N-2$ then $f^{*}_{1D}$ is the minimizer of the representation cost (i.e., $f^{*}_{1D}\in\arg\min_{f}R(f)$ ). ∎

B.2 Proof of Theorem 1

First we prove the following lemmas.

Lemma 3.

Let $y=x+\sigma\epsilon$ where $x,\epsilon\in\mathbb{R},\sigma>0$ then,

\displaystyle\lim_{\sigma\to 0^{+}}\hat{x}^{\mathrm{eMMSE}}\left(y\right)=\hat{x}^{1-\mathrm{NN}}\left(y\right)

Proof.

Notice that the following holds,

	$\displaystyle\frac{\exp\left(-\frac{\left\|y-x_{i}\right\|^{2}}{2\sigma^{2}}\right)}{\sum_{i=1}^{N}\exp\left(-\frac{\left\|y-x_{i}\right\|^{2}}{2\sigma^{2}}\right)}$	$\displaystyle=\frac{\exp\left(-\frac{\left\|y-x_{i}\right\|^{2}}{2\sigma^{2}}\right)\exp\left(\frac{\min_{n}\left\|y-x_{n}\right\|^{2}}{2\sigma^{2}}\right)}{\sum_{i=1}^{N}\exp\left(\frac{\left\|y-x_{i}\right\|^{2}}{2\sigma^{2}}\right)\exp\left(\frac{\min_{n}\left\|y-x_{n}\right\|^{2}}{2\sigma^{2}}\right)}$
		$\displaystyle=\frac{\exp\left(-\frac{\left\|y-x_{i}\right\|^{2}-\min_{n}\left\|y-x_{n}\right\|^{2}}{2\sigma^{2}}\right)}{\sum_{i=1}^{N}\exp\left(-\frac{\left\|y-x_{i}\right\|^{2}-\min_{n}\left\|y-x_{n}\right\|^{2}}{2\sigma^{2}}\right)}\,.$

In addition,

\displaystyle\lim_{\sigma\to 0^{+}}\exp\left(-\frac{\left|y-x_{i}\right|^{2}-\min_{n}\left|y-x_{n}\right|^{2}}{2\sigma^{2}}\right)

\displaystyle=\begin{cases}1&\left|y-x_{i}\right|^{2}=\min_{n}\left|y-x_{n}\right|^{2}\\ 0&\left|y-x_{i}\right|^{2}\neq\min_{n}\left|y-x_{n}\right|^{2}\end{cases}

so we obtain,

\displaystyle\lim_{\sigma\to 0^{+}}\frac{\exp\left(-\frac{\left|y-x_{i}\right|^{2}}{2\sigma^{2}}\right)}{\sum_{i=1}^{N}\exp\left(-\frac{\left|y-x_{i}\right|^{2}}{2\sigma^{2}}\right)}=\begin{cases}1&\left|y-x_{i}\right|^{2}=\min_{n}\left|y-x_{n}\right|^{2}\\ 0&\left|y-x_{i}\right|^{2}\neq\min_{n}\left|y-x_{n}\right|^{2}\end{cases}\,.

Therefore,

	$\displaystyle\lim_{\sigma\to 0^{+}}\hat{x}^{\mathrm{eMMSE}}\left(y\right)$	$\displaystyle=\lim_{\sigma\to 0^{+}}\frac{\sum_{i=1}^{N}x_{i}\exp\left(-\frac{\left\|y-x_{i}\right\|^{2}}{2\sigma^{2}}\right)}{\sum_{i=1}^{N}\exp\left(-\frac{\left\|y-x_{i}\right\|^{2}}{2\sigma^{2}}\right)}=\arg\min_{n}\|y-x_{n}\|^{2}$
		$\displaystyle=\arg\min_{n}\|y-x_{n}\|=\hat{x}^{1-\mathrm{NN}}\left(y\right)\,.$

∎

Lemma 4.

Let $y=x+\sigma\epsilon$ where $x\in\mathbb{R},\sigma>0$ and $\epsilon\sim\mathcal{N}\left(0,1\right)$ then,

\displaystyle\lim_{\sigma\to 0^{+}}\mathrm{MSE}\left(\hat{x}^{\mathrm{eMMSE}}\left(y\right)\right)=\lim_{\sigma\to 0^{+}}\mathrm{MSE}\left(\hat{x}^{1-\mathrm{NN}}\left(y\right)\right)

Proof.

	$\displaystyle\mathrm{MSE}\left(\hat{x}^{\mathrm{eMMSE}}\left(y\right)\right)$	$\displaystyle=\mathbb{E}\left[\left(\hat{x}^{\mathrm{eMMSE}}\left(y\right)-x\right)^{2}\right]$
		$\displaystyle=\mathbb{E}\left[\left(\hat{x}^{\mathrm{eMMSE}}\left(y\right)-\hat{x}^{1-\mathrm{NN}}\left(y\right)+\hat{x}^{1-\mathrm{NN}}\left(y\right)-x\right)^{2}\right]$
		$\displaystyle=\mathrm{MSE}\left(\hat{x}^{1-\mathrm{NN}}\left(y\right)\right)+\mathbb{E}\left[\left(\hat{x}^{\mathrm{eMMSE}}\left(y\right)-\hat{x}^{1-\mathrm{NN}}\left(y\right)\right)^{2}\right]$
		$\displaystyle\quad+2\mathbb{E}\left[\left(\hat{x}^{1-\mathrm{NN}}\left(y\right)-x\right)\left(\hat{x}^{\mathrm{eMMSE}}\left(y\right)-\hat{x}^{1-\mathrm{NN}}\left(y\right)\right)\right]$

Note that,

	$\displaystyle\hat{x}^{\mathrm{eMMSE}}\left(y\right)$	$\displaystyle=\frac{\sum_{i=1}^{N}x_{i}\exp\left(-\frac{\left\|y-x_{i}\right\|^{2}}{2\sigma^{2}}\right)}{\sum_{i=1}^{N}\exp\left(-\frac{\left\|y-x_{i}\right\|^{2}}{2\sigma^{2}}\right)}\leq\frac{\sum_{i=1}^{N}\max_{j}\{x_{j}\}\exp\left(-\frac{\left\|y-x_{i}\right\|^{2}}{2\sigma^{2}}\right)}{\sum_{i=1}^{N}\exp\left(-\frac{\left\|y-x_{i}\right\|^{2}}{2\sigma^{2}}\right)}$
		$\displaystyle=\max_{j}\{x_{j}\}\frac{\sum_{i=1}^{N}\exp\left(-\frac{\left\|y-x_{i}\right\|^{2}}{2\sigma^{2}}\right)}{\sum_{i=1}^{N}\exp\left(-\frac{\left\|y-x_{i}\right\|^{2}}{2\sigma^{2}}\right)}=\max_{j}\{x_{j}\}\,,$

	$\displaystyle\hat{x}^{\mathrm{eMMSE}}\left(y\right)$	$\displaystyle=\frac{\sum_{i=1}^{N}x_{i}\exp\left(-\frac{\left\|y-x_{i}\right\|^{2}}{2\sigma^{2}}\right)}{\sum_{i=1}^{N}\exp\left(-\frac{\left\|y-x_{i}\right\|^{2}}{2\sigma^{2}}\right)}\geq\frac{\sum_{i=1}^{N}\min_{j}\{x_{j}\}\exp\left(-\frac{\left\|y-x_{i}\right\|^{2}}{2\sigma^{2}}\right)}{\sum_{i=1}^{N}\exp\left(-\frac{\left\|y-x_{i}\right\|^{2}}{2\sigma^{2}}\right)}$
		$\displaystyle=\min_{j}\{x_{j}\}\frac{\sum_{i=1}^{N}\exp\left(-\frac{\left\|y-x_{i}\right\|^{2}}{2\sigma^{2}}\right)}{\sum_{i=1}^{N}\exp\left(-\frac{\left\|y-x_{i}\right\|^{2}}{2\sigma^{2}}\right)}=\min_{j}\{x_{j}\}\,.$

Similarly,

	$\displaystyle\hat{x}^{1-\mathrm{NN}}\left(y\right)$	$\displaystyle=\arg\min_{i}\|y-x_{i}\|\leq\max_{i}\{x_{i}\}$
	$\displaystyle\hat{x}^{1-\mathrm{NN}}\left(y\right)$	$\displaystyle=\arg\min_{i}\|y-x_{i}\|\geq\min_{i}\{x_{i}\}$

thus $\exists M_{0}>0\;\forall\sigma>0$

\displaystyle|\hat{x}^{\mathrm{eMMSE}}\left(y\right)-\hat{x}^{1-\mathrm{NN}}\left(y\right)|<M_{0}\,.

According to Lemma 3

\displaystyle\lim_{\sigma\to 0^{+}}\hat{x}^{\mathrm{eMMSE}}\left(y\right)=\hat{x}^{1-\mathrm{NN}}\left(y\right)

almost surely. Note that,

\displaystyle\mathbb{E}_{x,y}\left[\left(\hat{x}^{\mathrm{eMMSE}}\left(y\right)-\hat{x}^{1-\mathrm{NN}}\left(y\right)\right)^{2}\right]=\mathbb{E}_{x,\epsilon}\left[\left(\hat{x}^{\mathrm{eMMSE}}\left(x+\sigma\epsilon\right)-\hat{x}^{1-\mathrm{NN}}\left(x+\sigma\epsilon\right)\right)^{2}\right]

Therefore, by the Dominated convergence theorem we obtain

	$\displaystyle\lim_{\sigma\to 0^{+}}\mathbb{E}_{x,\epsilon}\left[\left(\hat{x}^{\mathrm{eMMSE}}\left(x+\sigma\epsilon\right)-\hat{x}^{1-\mathrm{NN}}\left(x+\sigma\epsilon\right)\right)^{2}\right]=$
	$\displaystyle\mathbb{E}_{x,\epsilon}\left[\lim_{\sigma\to 0^{+}}\left(\hat{x}^{\mathrm{eMMSE}}\left(x+\sigma\epsilon\right)-\hat{x}^{1-\mathrm{NN}}\left(x+\sigma\epsilon\right)\right)^{2}\right]=0$

Similarly,

	$\displaystyle\lim_{\sigma\to 0^{+}}\mathbb{E}_{x,\epsilon}\left[\left(\hat{x}^{1-\mathrm{NN}}\left(x+\sigma\epsilon\right)-x\right)\left(\hat{x}^{\mathrm{eMMSE}}\left(x+\sigma\epsilon\right)-\hat{x}^{1-\mathrm{NN}}\left(x+\sigma\epsilon\right)\right)\right]=$
	$\displaystyle\mathbb{E}_{x,\epsilon}\left[\lim_{\sigma\to 0^{+}}\left(\hat{x}^{1-\mathrm{NN}}\left(x+\sigma\epsilon\right)-x\right)\left(\hat{x}^{\mathrm{eMMSE}}\left(x+\sigma\epsilon\right)-\hat{x}^{1-\mathrm{NN}}\left(x+\sigma\epsilon\right)\right)\right]=0$

Since $\hat{x}^{\mathrm{eMMSE}}\left(y\right)$ and $\hat{x}^{1-\mathrm{NN}}\left(y\right)$ are bounded and $\mathbb{E}\left[x\right]<\infty$ . Therefore, we get

\displaystyle\lim_{\sigma\to 0^{+}}\mathrm{MSE}\left(\hat{x}^{\mathrm{eMMSE}}\left(y\right)\right)=\lim_{\sigma\to 0^{+}}\mathrm{MSE}\left(\hat{x}^{1-\mathrm{NN}}\left(y\right)\right)

∎

Next, we prove Theorem 1.

Proof.

Assume, without loss of generality, that $N=2$ . Let $p_{x}\left(x\right)$ be a probability density function with bounded second moment such that $p_{x}\left(x\right)>0$ for all $x\in\left[x_{1},x_{2}\right]$ . According to Lemma 4

\displaystyle\lim_{\sigma\to 0^{+}}\mathrm{MSE}\left(\hat{x}^{\mathrm{eMMSE}}\left(y\right)\right)=\lim_{\sigma\to 0^{+}}\mathrm{MSE}\left(\hat{x}^{1-\mathrm{NN}}\left(y\right)\right)

So we need to prove that

\displaystyle\lim_{\sigma\to 0^{+}}\mathrm{MSE}\left(\hat{x}^{1-\mathrm{NN}}\left(y\right)\right)

\displaystyle>\lim_{\sigma\to 0^{+}}\mathrm{MSE}\left(f^{*}_{1D}\left(y\right)\right)

For the case of $N=2$ , the training set includes two data points $\{x_{1},x_{2}\}$ . So we get,

	$\displaystyle\hat{x}^{1-\mathrm{NN}}\left(y\right)=$	$\displaystyle\begin{cases}x_{1}&y<\frac{x_{1}+x_{2}}{2}\\ x_{2}&\frac{x_{1}+x_{2}}{2}\leq y\end{cases}$
	$\displaystyle f^{*}_{1D}\left(y\right)=$	$\displaystyle\begin{cases}x_{1}&y<x_{1}+\Delta_{1}\\ \frac{x_{2}-x_{1}}{x_{2}-x_{1}+\Delta_{2}-\Delta_{1}}\left(y-x_{1}-\Delta_{1}\right)+x_{1}&x_{1}+\Delta_{1}\leq y\leq x_{2}+\Delta_{2}\\ x_{2}&y>x_{2}+\Delta_{2}\end{cases}$

where $\max_{m\in[M]}\epsilon_{1,m}=\Delta_{1}>0,\min_{m\in[M]}\epsilon_{2,m}=\Delta_{2}<0$ . Note that $\hat{x}^{1-\mathrm{NN}}\left(y\right)=f^{*}_{1D}\left(y\right)$ for all $y\in(-\infty,x_{1}+\Delta_{1})\cup(x_{2}+\Delta_{,}\infty)$ and $\mathbb{E}\left[x\right]<\infty,\mathbb{E}\left[x^{2}\right]<\infty$ . Therefore,

	$\displaystyle\mathrm{MSE}\left(\hat{x}^{1-\mathrm{NN}}\left(y\right)\right)-\mathrm{MSE}\left(f^{*}_{1D}\left(y\right)\right)=$
	$\displaystyle\int_{-\infty}^{\infty}\mathrm{d}x\int_{x_{1}+\Delta_{1}}^{x_{2}+\Delta_{2}}\mathrm{d}y\left(\hat{x}^{1-\mathrm{NN}}\left(y\right)-x\right)^{2}p_{y\|x}\left(y\|x\right)p_{x}\left(x\right)$		(25)
	$\displaystyle-\int_{-\infty}^{\infty}\mathrm{d}x\int_{x_{1}+\Delta_{1}}^{x_{2}+\Delta_{2}}\mathrm{d}y\left(f^{*}_{1D}\left(y\right)-x\right)^{2}p_{y\|x}\left(y\|x\right)p_{x}\left(x\right)$		(26)

First, we derive (25):

	$\displaystyle\int_{-\infty}^{\infty}\mathrm{d}x\int_{x_{1}+\Delta_{1}}^{x_{2}+\Delta_{2}}\mathrm{d}y\left(\hat{x}^{1-\mathrm{NN}}\left(y\right)-x\right)^{2}p_{y\|x}\left(y\|x\right)p_{x}\left(x\right)=$
	$\displaystyle\int_{-\infty}^{\infty}\int_{x_{1}+\Delta_{1}}^{\frac{x_{1}+x_{2}}{2}}\left(x_{1}-x\right)^{2}p_{y\|x}\left(y\|x\right)p_{x}\left(x\right)\mathrm{\mathrm{d}yd}x+\int_{-\infty}^{\infty}\int_{\frac{x_{1}+x_{2}}{2}}^{x_{2}+\Delta_{2}}\left(x_{2}-x\right)^{2}p_{y\|x}\left(y\|x\right)p_{x}\left(x\right)\mathrm{\mathrm{d}yd}x=$
	$\displaystyle\mathbb{E}_{x}\left[P\left(y\in\left[x_{1}+\Delta_{1},\frac{x_{1}+x_{2}}{2}\right]\biggr{\|}x\right)\left(x_{1}-x\right)^{2}\right]+\mathbb{E}_{x}\left[P\left(y\in\left[\frac{x_{1}+x_{2}}{2},x_{2}+\Delta_{2}\right]\biggr{\|}x\right)\left(x_{2}-x\right)^{2}\right]\,.$

Note that,

	$\displaystyle\mathbb{E}\left[P\left(y\in\left[x_{1}+\Delta_{1},\frac{x_{1}+x_{2}}{2}\right]\biggr{\|}x\right)\left(x_{1}-x\right)^{2}\right]$	$\displaystyle<\mathbb{E}\left[\left(x_{1}-x\right)^{2}\right]<\infty$
	$\displaystyle\mathbb{E}\left[P\left(y\in\left[\frac{x_{1}+x_{2}}{2},x_{2}+\Delta_{2}\right]\biggr{\|}x\right)\left(x_{2}-x\right)^{2}\right]$	$\displaystyle<\mathbb{E}\left[\left(x_{2}-x\right)^{2}\right]<\infty$

thus by the Dominated convergence theorem we obtain

	$\displaystyle\lim_{\sigma\to 0^{+}}\int_{-\infty}^{\infty}\mathrm{d}x\int_{x_{1}+\Delta_{1}}^{x_{2}+\Delta_{2}}\mathrm{d}y\left(\hat{x}^{1-\mathrm{NN}}\left(y\right)-x\right)^{2}p_{y\|x}\left(y\|x\right)p_{x}\left(x\right)=$
	$\displaystyle\mathbb{E}_{x}\left[\lim_{\sigma\to 0^{+}}P\left(y\in\left[x_{1}+\Delta_{1},\frac{x_{1}+x_{2}}{2}\right]\biggr{\|}x\right)\left(x_{1}-x\right)^{2}\right]+\mathbb{E}_{x}\left[\lim_{\sigma\to 0^{+}}P\left(y\in\left[\frac{x_{1}+x_{2}}{2},x_{2}+\Delta_{2}\right]\biggr{\|}x\right)\left(x_{2}-x\right)^{2}\right]=$
	$\displaystyle\mathbb{E}_{x}\left[\left(x-x_{1}\right)^{2}1\left\{x\in\left[x_{1},\frac{x_{1}+x_{2}}{2}\right]\right\}\right]+\mathbb{E}_{x}\left[\left(x-x_{2}\right)^{2}1\left\{x\in\left[\frac{x_{1}+x_{2}}{2},x_{2}\right]\right\}\right]>C>0$

since $p_{x}\left(x\right)>0$ for all $x\in\left[x_{1},x_{2}\right]$ . Next, we derive (26)

	$\displaystyle\int_{-\infty}^{\infty}\mathrm{d}x\int_{x_{1}+\Delta_{1}}^{x_{2}+\Delta_{2}}\mathrm{d}y\left(f^{*}_{1D}\left(y\right)-x\right)^{2}p_{y\|x}\left(y\|x\right)p_{x}\left(x\right)=$
	$\displaystyle\mathbb{E}_{x,y}\left[1\left\{y\in\left[x_{1}+\Delta_{1},x_{2}+\Delta_{2}\right]\right\}\left(f^{*}_{1D}\left(y\right)-x\right)^{2}\right]=$
	$\displaystyle\mathbb{E}_{x,\epsilon}\left[1\left\{x+\sigma\epsilon\in\left[x_{1}+\Delta_{1},x_{2}+\Delta_{2}\right]\right\}\left(f^{*}_{1D}\left(x+\sigma\epsilon\right)-x\right)^{2}\right]$

Note that,

\displaystyle\mathbb{E}_{x,\epsilon}\left[1\left\{x+\sigma\epsilon\in\left[x_{1}+\Delta_{1},x_{2}+\Delta_{2}\right]\right\}\left(f^{*}_{1D}\left(x+\sigma\epsilon\right)-x\right)^{2}\right]<\mathbb{E}_{x,\epsilon}\left[\left(f^{*}_{1D}\left(x+\sigma\epsilon\right)-x\right)^{2}\right]<\infty

thus by the Dominated convergence theorem we obtain

	$\displaystyle\lim_{\epsilon\to 0^{+}}\int_{-\infty}^{\infty}\mathrm{d}x\int_{x_{1}+\Delta_{1}}^{x_{2}+\Delta_{2}}\mathrm{d}y\left(f^{*}_{1D}\left(y\right)-x\right)^{2}p_{y\|x}\left(y\|x\right)p_{x}\left(x\right)=$
	$\displaystyle\lim_{\epsilon\to 0^{+}}\mathbb{E}_{x,\epsilon}\left[1\left\{x+\sigma\epsilon\in\left[x_{1}+\Delta_{1},x_{2}+\Delta_{2}\right]\right\}\left(f^{*}_{1D}\left(x+\sigma\epsilon\right)-x\right)^{2}\right]=$
	$\displaystyle\mathbb{E}_{x,\epsilon}\left[\lim_{\epsilon\to 0^{+}}1\left\{x+\sigma\epsilon\in\left[x_{1}+\Delta_{1},x_{2}+\Delta_{2}\right]\right\}\left(f^{*}_{1D}\left(x+\sigma\epsilon\right)-x\right)^{2}\right]=0$

since

	$\displaystyle\lim_{\epsilon\to 0^{+}}1\left\{x+\sigma\epsilon\in\left[x_{1}+\Delta_{1},x_{2}+\Delta_{2}\right]\right\}$	$\displaystyle=1\left\{x\in\left[x_{1},x_{2}\right]\right\}$
	$\displaystyle\lim_{\epsilon\to 0^{+}}f^{*}_{1D}\left(x+\sigma\epsilon\right)$	$\displaystyle=\begin{cases}x_{1}&x<x_{1}\\ x&x_{1}\leq x\leq x_{2}\\ x_{2}&x>x_{2}\end{cases}$

∎

B.3 Proof of Lemma 1

Proof.

First we prove that for all $y\in\cup_{n\in[N-1]}\Bigl{\{}\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}}\Bigr{\}}\,\,f^{*}_{1D}(y)=y$ . We find the intersection between the line $f^{*}_{1D}(y)=y$ and the linear part of (17) (the third branch)

$\displaystyle\frac{x_{n+1}-x_{n}}{x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)}\left(y-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\right)+x_{n}$	$\displaystyle=y$	(27)
$\displaystyle\left(x_{n+1}-x_{n}\right)\left(y-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\right)+x_{n}\left(x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\right)$	$\displaystyle=y\left(x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\right)$	(28)
$\displaystyle x_{n}\left(x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\right)-\left(x_{n+1}-x_{n}\right)\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)$	$\displaystyle=y\left(\epsilon^{\mathrm{min}}_{n+1}-\epsilon^{\mathrm{max}}_{n}\right)$	(29)
$\displaystyle x_{n}\left(x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}\right)-x_{n+1}\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)$	$\displaystyle=y\left(\epsilon^{\mathrm{min}}_{n+1}-\epsilon^{\mathrm{max}}_{n}\right)$	(30)
$\displaystyle x_{n}\epsilon^{\mathrm{min}}_{n+1}-x_{n+1}\epsilon^{\mathrm{max}}_{n}$	$\displaystyle=y\left(\epsilon^{\mathrm{min}}_{n+1}-\epsilon^{\mathrm{max}}_{n}\right)$	(31)
$\displaystyle y=\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}}\,.$		(32)

Note that

\displaystyle x_{n}+\epsilon^{\mathrm{max}}_{n}<\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}}<x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}

(33)

Since,

$\displaystyle x_{n}+\epsilon^{\mathrm{max}}_{n}$	$\displaystyle<\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}}$	(34)
$\displaystyle\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\left(\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}\right)$	$\displaystyle<x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}$	(35)
$\displaystyle\left(x_{n}+\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}\right)\epsilon^{\mathrm{max}}_{n}$	$\displaystyle<x_{n+1}\epsilon^{\mathrm{max}}_{n}$	(36)
$\displaystyle x_{n}+\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}$	$\displaystyle<x_{n+1}$	(37)
$\displaystyle x_{n}+\epsilon^{\mathrm{max}}_{n}$	$\displaystyle<x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}$	(38)

which holds according to Assumption 1.

$\displaystyle\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}}$	$\displaystyle<x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}$	(39)
$\displaystyle x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}$	$\displaystyle<\left(x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}\right)\left(\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}\right)$	(40)
$\displaystyle-x_{n}\epsilon^{\mathrm{min}}_{n+1}$	$\displaystyle<-\epsilon^{\mathrm{min}}_{n+1}\left(x_{n+1}-\epsilon^{\mathrm{max}}_{n}+\epsilon^{\mathrm{min}}_{n+1}\right)$	(41)
$\displaystyle x_{n}$	$\displaystyle<x_{n+1}-\epsilon^{\mathrm{max}}_{n}+\epsilon^{\mathrm{min}}_{n+1}$	(42)
$\displaystyle x_{n}+\epsilon^{\mathrm{max}}_{n}$	$\displaystyle<x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}$	(43)

which holds according to Assumption 1.

Next we prove that $f^{*}_{1D}(y)$ is contractive toward a set of the clean datapoints $\{\bm{x}_{n}\}_{n=1}^{N}$ on ${\mathbb{R}\setminus\cup_{n\in[N-1]}\Bigl{\{}\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}}\Bigr{\}}}$ . In the case where $y\in(-\infty,x_{1}+\epsilon^{\mathrm{min}}_{1}]$ or $y\in\cup_{n\in[N]}[x_{n}+\epsilon^{\mathrm{min}}_{n},x_{n}+\epsilon^{\mathrm{max}}_{n}]$ or $y\in[x_{N}+\epsilon^{\mathrm{max}}_{N},\infty)\,\,$ $f^{*}_{1D}(y)\in\{x_{n}\}_{n=1}^{N}$ therefore (18) holds for all $0\leq\alpha<1$ . In the case where $y\in\cup_{n\in[N-1]}[x_{n}+\epsilon^{\mathrm{max}}_{n},\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}})$ we choose $i=n$ ,

\displaystyle\left|f^{*}_{1D}\left(y\right)-f\left(x_{n}\right)\right|=\frac{x_{n+1}-x_{n}}{x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)}\left(y-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\right)

(44)

There exists $0<\gamma_{1}<1$ such that

\displaystyle x_{n}\leq\frac{x_{n+1}-x_{n}}{x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)}\left(y-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\right)+x_{n}\leq\gamma_{1}y

(45)

since for $y\in\cup_{n\in[N-1]}[x_{n}+\epsilon^{\mathrm{max}}_{n},\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}})\,\,$ $f^{*}_{1D}(y)$ is bellow the line $f(y)=y$ because $f^{*}_{1D}(y)$ is an affine function with slope larger than 1 and $f^{*}_{1D}(\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}})=\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}}$ . Therefore there exists $0<\alpha_{1}<1$

\displaystyle\left|f^{*}_{1D}\left(y\right)-f\left(x_{n}\right)\right|=\frac{x_{n+1}-x_{n}}{x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)}\left(y-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\right)\leq\gamma_{1}y-x_{n}\leq\alpha_{1}\left(y-x_{n}\right)

(46)

since,

	$\displaystyle\gamma_{1}y-x_{n}$	$\displaystyle\leq\alpha_{1}\left(y-x_{n}\right)$		(47)
	$\displaystyle\alpha_{1}$	$\displaystyle\geq\frac{\gamma_{1}y-x_{n}}{y-x_{n}}\,.$		(48)

In the case where $y\in\cup_{n\in[N-1]}(\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}},x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}]$ we choose $i=n+1$ ,

\displaystyle\left|f^{*}_{1D}\left(y\right)-f\left(x_{n+1}\right)\right|=x_{n+1}-\frac{x_{n+1}-x_{n}}{x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)}\left(y-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\right)-x_{n}\,.

(49)

There exists $0<\gamma_{2}<1$ such that

\displaystyle x_{n+1}\geq\frac{x_{n+1}-x_{n}}{x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)}\left(y-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\right)+x_{n}\geq\frac{1}{\gamma_{2}}y

(50)

since for $y\in\cup_{n\in[N-1]}(\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}},x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}]\,\,$ $f^{*}_{1D}(y)$ is above the line $f(y)=y$ because $f^{*}_{1D}(y)$ is an affine function with slope larger than 1 and $f^{*}_{1D}(\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}})=\frac{x_{n+1}\epsilon^{\mathrm{max}}_{n}-x_{n}\epsilon^{\mathrm{min}}_{n+1}}{\epsilon^{\mathrm{max}}_{n}-\epsilon^{\mathrm{min}}_{n+1}}$ . Therefore there exists $0<\alpha_{2}<1$

	$\displaystyle\left\|f^{*}_{1D}\left(y\right)-f\left(x_{n+1}\right)\right\|$	$\displaystyle=x_{n+1}-\frac{x_{n+1}-x_{n}}{x_{n+1}+\epsilon^{\mathrm{min}}_{n+1}-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)}\left(y-\left(x_{n}+\epsilon^{\mathrm{max}}_{n}\right)\right)-x_{n}$		(51)
		$\displaystyle\leq x_{n+1}-\frac{1}{\gamma_{2}}y\leq\alpha_{2}\left(x_{n+1}-y\right)$		(52)

since,

	$\displaystyle x_{n+1}-\frac{1}{\gamma_{2}}y$	$\displaystyle\leq\alpha_{2}\left(x_{n+1}-y\right)$		(53)
	$\displaystyle\alpha_{2}$	$\displaystyle\geq\frac{x_{n+1}-\frac{1}{\gamma_{2}}y}{x_{n+1}-y}\,.$		(54)

Therefore (18) holds for $\alpha=\max\{\alpha_{1},\alpha_{2}\}$ . ∎

Appendix C Proofs of Results in Section 5

Let ${\bm{f}}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ be any function realizable as a shallow ReLU network. Consider any parametrization of ${\bm{f}}$ given by ${\bm{f}}({\bm{y}})=\sum_{k=1}^{K}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]_{+}+{\bm{V}}{\bm{y}}+{\bm{c}}$ . We say such a parametrization is a minimal representative of ${\bm{f}}$ if $\|{\bm{w}}_{k}\|_{2}=1$ and ${\bm{a}}_{k}\neq 0$ for all $k=1,...,K$ , and $R({\bm{f}})=\sum_{k=1}^{K}\|{\bm{a}}_{k}\|_{2}$ . In particular, the units making up a minimal representative must be distinct in the sense that the hyperplanes describing ReLU boundaries $H_{k}=\{{\bm{x}}\in\mathbb{R}^{d}:{\bm{w}}_{k}^{\top}{\bm{x}}+b_{k}=0\}$ are distinct, which implies that no units can be cancelled or combined.

We will also make use of the following lemma, which shows that representation costs are invariant to a translation of the training samples, assuming the function is suitably translated.

Lemma 5.

Let ${\bm{f}}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ be any function realizable as a shallow ReLU net satisfying norm-ball interpolation constraints ${\bm{f}}(B({\bm{x}}_{n};\rho))=\{{\bm{x}}_{n}\}$ for all $n=1,...,N$ . Let ${\bm{x}}_{0}\in\mathbb{R}^{d}$ . Then the function ${\bm{g}}({\bm{y}})={\bm{f}}({\bm{y}}-{\bm{x}}_{0})+{\bm{x}}_{0}$ is such that $R({\bm{g}})=R({\bm{f}})$ and ${\bm{g}}(B({\bm{x}}_{n}+{\bm{x}}_{0};\rho))=\{{\bm{x}}_{n}+{\bm{x}}_{0}\}$ for all $n=1,...,N$ .

Proof.

First we show ${\bm{g}}(B({\bm{x}}_{n}+{\bm{x}}_{0};\rho))=\{{\bm{x}}_{n}+{\bm{x}}_{0}\}$ for all $n=1,...,N$ . Fix any $n$ , and let ${\bm{y}}\in B({\bm{x}}_{n}+{\bm{x}}_{0};\rho)$ . Then ${\bm{y}}={\bm{y}}^{\prime}+{\bm{x}}_{0}$ for some ${\bm{y}}^{\prime}\in B({\bm{x}}_{n};\rho)$ , and so ${\bm{g}}({\bm{y}})={\bm{f}}({\bm{y}}^{\prime})+{\bm{x}}_{0}={\bm{x}}_{n}+{\bm{x}}_{0}$ , as claimed.

To show that $R({\bm{g}})=R({\bm{f}})$ , let ${\bm{f}}({\bm{y}})=\sum_{k=1}^{K}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]_{+}+{\bm{V}}{\bm{y}}+{\bm{c}}$ be any minimal representative of ${\bm{f}}$ . Then

	$\displaystyle{\bm{g}}({\bm{y}})$	$\displaystyle=\sum_{k=1}^{K}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}({\bm{y}}-{\bm{x}}_{0})+b_{k}]_{+}+{\bm{V}}({\bm{y}}-{\bm{x}}_{0})+{\bm{c}}+{\bm{x}}_{0}$		(55)
		$\displaystyle=\sum_{k=1}^{K}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+\tilde{b}_{k}]_{+}+{\bm{V}}{\bm{y}}+\tilde{{\bm{c}}}$		(56)

with $\tilde{b}_{k}=b_{k}-{\bm{w}}_{k}^{\top}{\bm{x}}_{0}$ and $\tilde{{\bm{c}}}={\bm{c}}+{\bm{x}}_{0}-{\bm{V}}{\bm{x}}_{0}$ . And so $R({\bm{g}})\leq\sum_{k=1}^{K}\|{\bm{a}}_{k}\|_{2}=R({\bm{f}})$ . A parallel argument with the roles of ${\bm{g}}$ and ${\bm{f}}$ reversed shows that $R({\bm{f}})\leq R({\bm{g}})$ , and so $R({\bm{f}})=R({\bm{g}})$ . ∎

In particular, this lemma shows that if ${\bm{f}}^{*}({\bm{y}})$ is a norm-ball interpolating representation cost minimizer of the training samples $\{{\bm{x}}_{n}\}_{n=1}^{N}$ , then ${\bm{g}}^{*}({\bm{y}})={\bm{f}}^{*}({\bm{y}}-{\bm{x}}_{0})+{\bm{x}}_{0}$ is a norm-ball interpolating min-cost solution of the translated training samples $\{{\bm{x}}_{n}+{\bm{x}}_{0}\}_{n=1}^{N}$ .

C.1 Training data belonging to a subspace

The proof of Theorem 2 is a direct consequence of the following two lemmas:

Lemma 6.

Suppose the clean training samples $\{{\bm{x}}_{n}\}_{n=1}^{N}$ belong to a $r$ -dimensional subspace ${\mathcal{S}}\subset\mathbb{R}^{d}$ , and let ${\bm{P}}_{\mathcal{S}}$ denote the orthogonal projector onto ${\mathcal{S}}$ . Let ${\bm{f}}({\bm{y}})$ be any function realizable as a shallow ReLU network satisfying ${\bm{f}}(B({\bm{x}}_{n},\rho))=\{{\bm{x}}_{n}\}$ for all $n=1,...,N$ . Define ${\bm{g}}({\bm{y}})={\bm{P}}_{\mathcal{S}}{\bm{f}}({\bm{x}})$ . Then ${\bm{g}}(B({\bm{x}}_{n},\rho))=\{{\bm{x}}_{n}\}$ for all $n=1,...,N$ , and $R({\bm{g}})\leq R({\bm{f}})$ , with strict inequality if ${\bm{f}}\neq{\bm{g}}$ .

Proof.

First, for any ${\bm{y}}\in B({\bm{x}}_{n},\rho)$ we have ${\bm{g}}({\bm{y}})={\bm{P}}_{\mathcal{S}}{\bm{f}}({\bm{y}})={\bm{P}}_{\mathcal{S}}{\bm{x}}_{n}={\bm{x}}_{n}$ . Therefore, ${\bm{g}}(B({\bm{x}}_{n},\rho))=\{{\bm{x}}_{n}\}$ for all $n=1,...,N$ .

Now, let ${\bm{f}}({\bm{y}})=\sum_{k=1}^{K}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]_{+}+{\bm{V}}{\bm{y}}+{\bm{c}}$ be a minimal representative of ${\bm{f}}$ . Then $g({\bm{y}})=\sum_{k=1}^{K}{\bm{P}}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]_{+}+{\bm{P}}({\bm{V}}{\bm{y}}+{\bm{c}})$ . Since $\|{\bm{P}}{\bm{a}}_{k}\|\leq\|{\bm{a}}_{k}\|$ for all $k$ , we have $R({\bm{g}})\leq\sum_{k=1}^{K}\|{\bm{P}}_{\mathcal{S}}{\bm{a}}_{k}\|\leq\sum_{k=1}^{K}\|{\bm{a}}_{k}\|=R({\bm{f}})$ .

Now we show ${\bm{f}}\neq{\bm{g}}$ implies $R({\bm{g}})<R({\bm{f}})$ . Observe that if any of the outer-layer weight vectors ${\bm{a}}_{k}\not\in{\mathcal{S}}$ then $\|{\bm{P}}_{\mathcal{S}}{\bm{a}}_{k}\|<\|{\bm{a}}_{k}\|$ , which implies $R({\bm{g}})<R({\bm{f}})$ . Hence, it suffices to show that ${\bm{f}}\neq{\bm{g}}$ implies some ${\bm{a}}_{k}\not\in{\mathcal{S}}$ .

First, consider the case where ${\bm{P}}_{\mathcal{S}}{\bm{V}}={\bm{V}}$ and ${\bm{P}}_{\mathcal{S}}{\bm{c}}={\bm{c}}$ . Then in this case ${\bm{f}}\neq{\bm{g}}$ if only if $\sum_{k=1}^{K}({\bm{P}}_{\mathcal{S}}{\bm{a}}_{k}-{\bm{a}}_{k})[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]_{+}\neq 0$ for all ${\bm{y}}\in\mathbb{R}^{d}$ , which implies ${\bm{P}}_{\mathcal{S}}{\bm{a}}_{k}\neq{\bm{a}}_{k}$ for some $k$ , or equivalently ${\bm{a}}_{k}\not\in{\mathcal{S}}$ .

Next, assume either ${\bm{P}}_{\mathcal{S}}{\bm{V}}\neq{\bm{V}}$ or ${\bm{P}}_{\mathcal{S}}{\bm{c}}\neq{\bm{c}}$ . Fix any training sample index $n$ , and let $A_{n}$ denote the index set of units active over the ball $B({\bm{x}}_{n},\rho)$ . Then, since ${\bm{f}}({\bm{y}})$ is constant for all ${\bm{y}}\in B({\bm{x}}_{n},\rho)$ , the Jacobian $\bm{\partial}{\bm{f}}({\bm{y}})=\sum_{k\in A_{n}}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}+{\bm{V}}=\bm{0}$ for all ${\bm{y}}\in B({\bm{x}}_{n},\rho)$ . This gives ${\bm{V}}=-\sum_{k\in A_{n}}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}$ . Therefore, if ${\bm{P}}_{\mathcal{S}}{\bm{V}}\neq{\bm{V}}$ then at least one of the ${\bm{a}}_{k}\not\in{\bm{S}}$ . On the other hand, if ${\bm{P}}_{\mathcal{S}}{\bm{c}}\neq{\bm{c}}$ then for all ${\bm{y}}\in B({\bm{x}}_{n},\rho)$ we have

{\bm{f}}({\bm{y}})=\sum_{k\in A_{n}}{\bm{a}}_{k}({\bm{w}}_{k}^{\top}{\bm{y}}+b_{k})+{\bm{V}}{\bm{y}}+{\bm{c}}=\sum_{k\in A_{n}}({\bm{a}}_{k}{\bm{w}}_{k}^{\top}+{\bm{V}}){\bm{y}}+\sum_{k\in A_{n}}b_{k}{\bm{a}}_{k}+{\bm{c}}=\sum_{k\in A_{n}}b_{k}{\bm{a}}_{k}+{\bm{c}}={\bm{x}}_{n},

which implies ${\bm{c}}={\bm{x}}_{n}-\sum_{k\in A_{n}}b_{k}{\bm{a}}_{k}$ , and since ${\bm{x}}_{n}\in{\mathcal{S}}$ this implies some ${\bm{a}}_{k}\not\in{\mathcal{S}}$ , proving the claim. ∎

Lemma 7.

Suppose the clean training samples $\{{\bm{x}}_{n}\}_{n=1}^{N}$ belong to an $r$ -dimensional subspace ${\mathcal{S}}\subset\mathbb{R}^{d}$ , and let ${\bm{P}}_{\mathcal{S}}$ denote the orthogonal projector onto ${\mathcal{S}}$ . Let ${\bm{f}}$ be any network satisfying ${\bm{f}}(B({\bm{x}}_{n},\rho))=\{{\bm{x}}_{n}\}$ . Define ${\bm{h}}({\bm{y}})={\bm{f}}({\bm{P}}_{\mathcal{S}}{\bm{y}})$ . Then ${\bm{h}}(B({\bm{x}}_{n},\rho))=\{{\bm{x}}_{n}\}$ and $R({\bm{h}})\leq R({\bm{f}})$ , with strict inequality if ${\bm{h}}\neq{\bm{f}}$ .

Proof.

Define ${\bm{P}}_{\mathcal{S}}^{-1}(B({\bm{x}}_{n},\rho)):=\{{\bm{y}}\in\mathbb{R}^{d}:{\bm{P}}_{\mathcal{S}}{\bm{y}}\in B({\bm{x}}_{n},\rho)\}$ , i.e., the set of points in $\mathbb{R}^{d}$ whose projection onto ${\mathcal{S}}$ is contained in $B({\bm{x}}_{n},\rho)$ . Then clearly ${\bm{h}}({\bm{P}}_{\mathcal{S}}^{-1}(B({\bm{x}}_{n},\rho)))=\{{\bm{x}}_{n}\}$ . Also, by properties of norm-balls, we have $B({\bm{x}}_{n},\rho)\subset{\bm{P}}_{\mathcal{S}}^{-1}(B({\bm{x}}_{n},\rho))$ . Therefore, ${\bm{h}}(B({\bm{x}}_{n},\rho))=\{{\bm{x}}_{n}\}$ for all $n=1,...,N$ .

Next, let ${\bm{f}}({\bm{y}})=\sum_{k=1}^{K}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]_{+}+{\bm{V}}{\bm{y}}+{\bm{c}}$ be a minimal representative of ${\bm{f}}$ . Then $h({\bm{y}})=\sum_{k=1}^{K}{\bm{a}}_{k}[({\bm{P}}_{\mathcal{S}}{\bm{w}}_{k})^{\top}{\bm{y}}+b_{k}]_{+}+{\bm{V}}{\bm{P}}_{\mathcal{S}}{\bm{y}}+{\bm{c}}$ . Since $\|{\bm{P}}_{\mathcal{S}}{\bm{w}}_{k}\|\leq\|{\bm{w}}_{k}\|$ for all $k$ , we have $R({\bm{h}})\leq\sum_{k=1}^{K}\|{\bm{a}}_{k}\|\|{\bm{P}}_{\mathcal{S}}{\bm{w}}_{k}\|\leq\sum_{k=1}^{K}\|{\bm{a}}_{k}\|\|{\bm{w}}_{k}\|=R({\bm{f}})$ .

Finally, we show that if ${\bm{f}}\neq{\bm{h}}$ , then $R({\bm{h}})<R({\bm{f}})$ . Observe that if any of the inner-layer weight vectors ${\bm{w}}_{k}\not\in{\mathcal{S}}$ then $\|{\bm{P}}{\bm{w}}_{k}\|<\|{\bm{w}}_{k}\|$ , which implies $R({\bm{h}})<R({\bm{f}})$ . Hence, it suffices to show that ${\bm{f}}\neq{\bm{h}}$ implies some ${\bm{w}}_{k}\not\in{\mathcal{S}}$ . First, consider the case ${\bm{V}}{\bm{P}}_{\mathcal{S}}={\bm{V}}$ . Then ${\bm{f}}\neq{\bm{h}}$ if and only if ${\bm{P}}_{\mathcal{S}}{\bm{w}}_{k}\neq{\bm{w}}_{k}$ for some $k$ , or equivalently, ${\bm{w}}_{k}\not\in{\mathcal{S}}$ . Next, consider the case ${\bm{V}}{\bm{P}}_{\mathcal{S}}\neq{\bm{V}}$ . Fix any training sample index $n$ , and let $A_{n}$ denote the index set of units active over $B({\bm{x}}_{n},\rho)$ . Then, since ${\bm{f}}({\bm{y}})$ is constant for all ${\bm{y}}\in B({\bm{x}}_{n},\rho)$ , the Jacobian $\bm{\partial}{\bm{f}}({\bm{x}})=\sum_{k\in A_{n}}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}+{\bm{V}}=\bm{0}$ for all ${\bm{y}}\in B_{n}({\bm{x}}_{n},\rho)$ . This gives ${\bm{V}}=-\sum_{k}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}$ . Therefore, if ${\bm{V}}{\bm{P}}_{\mathcal{S}}\neq{\bm{V}}$ (i.e., at least one row of ${\bm{V}}$ is not contained in ${\mathcal{S}}$ ) then at least one of the ${\bm{w}}_{k}\not\in{\mathcal{S}}$ , proving the claim. ∎

Finally, we now give the proof Theorem 2 and Corollary 1.

Proof of Theorem 2.

Let ${\bm{f}}^{*}$ be any min-cost solution. Applying both Lemma 6 and 7 we see it must be the case that ${\bm{f}}^{*}({\bm{y}})={\bm{P}}_{\mathcal{S}}{\bm{f}}^{*}({\bm{P}}_{\mathcal{S}}{\bm{y}})$ for all ${\bm{y}}\in\mathbb{R}^{d}$ , since otherwise the representation cost of ${\bm{f}}^{*}$ could be reduced. ∎

Proof of Corollary 1.

Suppose ${\mathcal{S}}$ is one-dimensional, i.e., ${\mathcal{S}}=\text{span}\{{\bm{u}}\}$ for some unit vector ${\bm{u}}\in\mathbb{R}^{d}$ . Theorem 2 shows we can express any min-cost solution ${\bm{f}}^{*}$ as

{\bm{f}}^{*}({\bm{y}})=\sum_{k=1}^{K}a_{k}{\bm{u}}[s_{k}{\bm{u}}^{\top}{\bm{y}}+b_{k}]_{+}+v{\bm{u}}{\bm{u}}^{\top}{\bm{y}}+c{\bm{u}}={\bm{u}}\phi({\bm{u}}^{\top}{\bm{y}})

where

\phi(t)=\sum_{k}a_{k}[{\bm{s}}_{k}t+b_{k}]_{+}+vt+c

is such that $R({\bm{f}}^{*})=R(\phi)$ . Therefore, minimizing the representation cost subject to norm-ball interpolation constraints is reduces to a univariate problem:

\min_{\phi}R(\phi)~{}~{}s.t.~{}~{}\phi((c_{n}-\rho,c_{n}+\rho))=c_{n}

where $c_{n}={\bm{u}}^{\top}{\bm{x}}_{n}$ . The minimizing $\phi^{*}$ is unique and coincides with the 1-D denoiser $f^{*}_{1D}$ in (17) with $x_{n}=c_{n}$ and $\epsilon_{n}^{\mathrm{max}}=-\epsilon_{n}^{\mathrm{min}}=\rho$ . ∎

C.2 Data along rays

We begin with a key lemma that is central to the proof of Theorem 3.

Lemma 8.

Suppose $\{{\bm{u}}_{i}\}_{i=1}^{n}\subset\mathbb{R}^{d}$ is a collection of $n>1$ unit vectors such that ${\bm{u}}_{i}^{\top}{\bm{u}}_{j}<0$ for all $i\neq j$ , and ${\bm{w}}$ is a unit vector such that ${\bm{u}}_{i}^{\top}{\bm{w}}>0$ for all $i=1,...,n$ . Let ${\bm{a}}\in\mathbb{R}^{d}$ be any vector. Then,

\sum_{i=1}^{n}|{\bm{u}}_{i}^{\top}{\bm{a}}|({\bm{u}}_{i}^{\top}{\bm{w}})<\|{\bm{a}}\|.

Before giving the proof of Lemma 8, we first prove an auxiliary result.

Lemma 9.

Let ${\bm{a}}\in\mathbb{R}^{d}$ be a unit vector, and suppose $\{{\bm{u}}_{i}\}_{i=1}^{n}\subset\mathbb{R}^{d}$ is a collection of unit vectors such that ${\bm{u}}_{i}^{\top}{\bm{a}}>0$ for all $i$ and ${\bm{u}}_{i}^{\top}{\bm{u}}_{j}<0$ for all $i\neq j$ . Let ${\bm{b}}=\sum_{i=1}^{n}{\bm{u}}_{i}{\bm{u}}_{i}^{\top}{\bm{a}}$ . Then $\|{\bm{b}}-{\bm{a}}/2\|\leq 1/2$ with strict inequality if $n>1$ .

Proof.

It suffices to show ${\bm{b}}^{\top}{\bm{a}}\geq{\bm{b}}^{\top}{\bm{b}}$ , since if this is the case then

\displaystyle\|{\bm{b}}-{\bm{a}}/2\|

\displaystyle=\sqrt{({\bm{b}}-{\bm{a}}/2)^{\top}({\bm{b}}-{\bm{a}}/2)}=\sqrt{{\bm{b}}^{\top}{\bm{b}}-{\bm{b}}^{\top}{\bm{a}}+\frac{1}{4}{\bm{a}}^{\top}{\bm{a}}}\leq\frac{1}{2},

(57)

which also holds with strict inequality when ${\bm{b}}^{\top}{\bm{a}}>{\bm{b}}^{\top}{\bm{b}}$ .

First, if $n=1$ , then ${\bm{b}}={\bm{a}}^{\top}{\bm{u}}_{1}{\bm{u}}_{1}^{\top}$ , and so ${\bm{b}}^{\top}{\bm{b}}={\bm{a}}^{\top}{\bm{u}}_{1}{\bm{u}}_{1}^{\top}{\bm{u}}_{1}{\bm{u}}_{1}^{\top}{\bm{a}}={\bm{a}}^{\top}{\bm{u}}_{1}{\bm{u}}_{1}^{\top}{\bm{a}}={\bm{b}}^{\top}{\bm{a}}$ . Now assume $n>1$ . Then we have

$\displaystyle{\bm{b}}^{\top}{\bm{a}}$	$\displaystyle=\sum_{i=1}^{n}{\bm{a}}^{\top}{\bm{u}}_{i}{\bm{u}}_{i}^{\top}{\bm{a}}$	(58)
	$\displaystyle=\sum_{i=1}^{n}{\bm{a}}^{\top}{\bm{u}}_{i}{\bm{u}}_{i}^{\top}{\bm{u}}_{i}{\bm{u}}_{i}^{\top}{\bm{a}}$	(59)
	$\displaystyle>\sum_{i=1}^{n}\sum_{j=1}^{n}{\bm{a}}^{\top}{\bm{u}}_{i}{\bm{u}}_{i}^{\top}{\bm{u}}_{j}{\bm{u}}_{i}^{\top}{\bm{a}}$	(60)
	$\displaystyle={\bm{b}}^{\top}{\bm{b}}$	(61)

The first and last equalities hold by definition. The second equality holds because each ${\bm{u}}_{i}$ is a unit vector. The last inequality holds because for $i\neq j$

({\bm{a}}^{\top}{\bm{u}}_{i})({\bm{u}}_{i}^{\top}{\bm{u}}_{j})({\bm{u}}_{j}^{\top}{\bm{a}})<0,

since ${\bm{u}}_{i}^{\top}{\bm{u}}_{j}<0$ and ${\bm{a}}^{\top}{\bm{u}}_{i},{\bm{a}}^{\top}{\bm{u}}_{j}>0$ by assumption. ∎

Now we give the proof of Lemma 8.

Proof of Lemma 8.

Let ${\bm{v}}=\sum_{i=1}^{n}|{\bm{u}}_{i}^{\top}{\bm{a}}|{\bm{u}}_{i}$ . Then we have

S=\sum_{i=1}^{n}|{\bm{u}}_{i}^{\top}{\bm{a}}|({\bm{u}}_{i}^{\top}{\bm{w}})={\bm{v}}^{\top}{\bm{w}}

WLOG, we may assume $\|{\bm{a}}\|=1$ , in which case the lemma reduces to showing $S={\bm{v}}^{\top}{\bm{w}}<1$ . By the Cauchy-Schwartz inequality, it suffices to show $\|{\bm{v}}\|<1$ . Toward this end, let us write ${\bm{v}}={\bm{v}}_{+}+{\bm{v}}_{-}$ where ${\bm{v}}_{+}=\sum_{i:{\bm{u}}_{i}^{\top}{\bm{a}}>0}|{\bm{u}}_{i}^{\top}{\bm{a}}|{\bm{u}}_{i}=\sum_{i:{\bm{u}}_{i}^{\top}{\bm{a}}>0}{\bm{u}}_{i}{\bm{u}}_{i}^{\top}{\bm{a}}$ and ${\bm{v}}_{-}=\sum_{i:{\bm{u}}_{i}^{\top}{\bm{a}}<0}|{\bm{u}}_{i}^{\top}{\bm{a}}|{\bm{u}}_{i}=\sum_{i:{\bm{u}}_{i}^{\top}{\bm{a}}<0}{\bm{u}}_{i}{\bm{u}}_{i}^{\top}(-{\bm{a}})$ . By Lemma 9 we have $\|{\bm{v}}_{+}-{\bm{a}}/2\|\leq 1/2$ and $\|{\bm{v}}_{-}+{\bm{a}}/2\|\leq 1/2$ , and so

$\displaystyle\\|{\bm{v}}\\|$	$\displaystyle=\\|{\bm{v}}_{+}+{\bm{v}}_{-}\\|$	(62)
	$\displaystyle=\\|{\bm{v}}_{+}-{\bm{a}}/2+{\bm{v}}_{-}+{\bm{a}}/2\\|$	(63)
	$\displaystyle\leq\\|{\bm{v}}_{+}-{\bm{a}}/2\\|+\\|{\bm{v}}_{-}+{\bm{a}}/2\\|$	(64)
	$\displaystyle\leq 1.$	(65)

Also, if either of the index sets $I_{+}:=\{i:{\bm{u}}_{i}^{\top}{\bm{a}}>0\}$ or $I_{-}:=\{i:{\bm{u}}_{i}^{\top}{\bm{a}}<0\}$ has cardinality greater than one then the lemma above guarantees $\|{\bm{v}}_{+}-{\bm{a}}/2\|<1/2$ or $\|{\bm{v}}_{+}-{\bm{a}}/2\|<1/2$ , and so $\|{\bm{v}}\|<1$ , which gives $S<1$ .

It remains to show $S<1$ when $I_{+}$ or $I_{-}$ has cardinality less than or equal to one, i.e., $|I_{+}|\leq 1$ or $|I_{-}|\leq 1$ . We consider the various possibilities:

Case 1: $|I_{+}|=0$ & $|I_{-}|=0$ . Then ${\bm{v}}=\bm{0}$ and so $S={\bm{v}}^{\top}{\bm{w}}=0<1$ .

Case 2: $|I_{+}|=1$ & $|I_{-}|=0$ or $|I_{+}|=0$ & $|I_{-}|=1$ . Then ${\bm{v}}=|{\bm{u}}_{i}^{\top}{\bm{a}}|{\bm{u}}_{i}$ for some $i$ , and $|{\bm{u}}_{j}^{\top}{\bm{a}}|=0$ for all $j\neq i$ . By way of contradiction, assume that $S={\bm{v}}^{\top}{\bm{w}}=|{\bm{u}}_{i}^{\top}{\bm{a}}|{\bm{u}}_{i}^{\top}{\bm{w}}=1$ . Since ${\bm{u}}_{i}$ , ${\bm{a}}$ , and ${\bm{w}}$ are all unit vectors, the only way this is possible is if $|{\bm{u}}_{i}^{\top}{\bm{a}}|=1$ and ${\bm{u}}_{i}^{\top}{\bm{w}}=1$ , which implies ${\bm{a}}=\pm{\bm{w}}$ and ${\bm{u}}_{i}={\bm{w}}$ , and so ${\bm{a}}=\pm{\bm{u}}_{i}$ . However, this shows that ${\bm{u}}_{j}^{\top}{\bm{u}}_{i}=0$ for all $j\neq i$ , contradicting our assumption that ${\bm{u}}_{j}^{\top}{\bm{u}}_{i}<0$ for all $i\neq j$ . Therefore, $S<1$ in this case as well.

Case 3: $|I_{+}|=1$ & $|I_{-}|=1$ . Then ${\bm{v}}=|{\bm{u}}_{i}^{\top}{\bm{a}}|{\bm{u}}_{i}+|{\bm{u}}_{j}^{\top}{\bm{a}}|{\bm{u}}_{j}$ for some $i\in I_{+}$ and $j\in I_{-}$ , and so

$\displaystyle\\|{\bm{v}}\\|$	$\displaystyle=\\|\|{\bm{u}}_{i}^{\top}{\bm{a}}\|{\bm{u}}_{i}+\|{\bm{u}}_{j}^{\top}{\bm{a}}\|{\bm{u}}_{j}\\|$	(66)
	$\displaystyle=\\|({\bm{u}}_{i}{\bm{u}}_{i}^{\top}-{\bm{u}}_{j}{\bm{u}}_{j}^{\top}){\bm{a}}\\|$	(67)
	$\displaystyle\leq\\|{\bm{u}}_{i}{\bm{u}}_{i}^{\top}-{\bm{u}}_{j}{\bm{u}}_{j}^{\top}\\|$	(68)

where the last inequality follows by the fact that $\|{\bm{a}}\|=1$ . Finally, since the matrix ${\bm{u}}_{i}{\bm{u}}_{i}^{\top}-{\bm{u}}_{j}{\bm{u}}_{j}^{\top}$ is symmetric, its operator norm $\|{\bm{u}}_{i}{\bm{u}}_{i}^{\top}-{\bm{u}}_{j}{\bm{u}}_{j}^{\top}\|$ coincides with the maximum eigenvalue (in absolute value). Eigenvectors in the span of ${\bm{u}}_{i}$ and ${\bm{u}}_{j}$ have the form ${\bm{v}}=c_{i}{\bm{u}}_{i}+c_{j}{\bm{u}}_{j}$ where $c_{i},c_{j}$ satisfy the equation

({\bm{u}}_{i}{\bm{u}}_{i}^{\top}-{\bm{u}}_{j}{\bm{u}}_{j}^{\top})(c_{i}{\bm{u}}_{i}+c_{j}{\bm{u}}_{j})=\lambda(c_{i}{\bm{u}}_{i}+c_{j}{\bm{u}}_{j})

(69)

where $\lambda\in\mathbb{R}$ is the corresponding eigenvalue. Expanding and equating coefficients gives the system

\begin{bmatrix}1-\lambda&{\bm{u}}_{i}^{\top}{\bm{u}}_{j}\\ {\bm{u}}_{i}^{\top}{\bm{u}}_{j}&1+\lambda\end{bmatrix}\begin{bmatrix}c_{i}\\ c_{j}\end{bmatrix}=\begin{bmatrix}0\\ 0\end{bmatrix}

(70)

which has non-trivial solutions iff

(1-\lambda)(1+\lambda)-({\bm{u}}_{i}^{\top}{\bm{u}}_{j})^{2}=0~{}~{}\iff~{}~{}\lambda=\pm\sqrt{1-({\bm{u}}_{i}^{\top}{\bm{u}}_{j})^{2}}

(71)

and so $|\lambda|<1$ . Therefore, $\|{\bm{v}}\|\leq\|{\bm{u}}_{i}{\bm{u}}_{i}^{\top}-{\bm{u}}_{j}{\bm{u}}_{j}^{\top}\|<1$ as claimed. And so $S\leq\|{\bm{v}}\|<1$ . ∎

Proof of Theorem 3.

Let ${\bm{f}}^{*}$ be any representation cost minimizer. We prove the claim by showing that if ${\bm{f}}^{*}$ fails to have the form specified in (20) then it is possible to construct a norm-ball interpolant ${\bm{h}}$ whose units are aligned with the rays that has strictly smaller representation cost than ${\bm{f}}^{*}$ .

First, let ${\bm{f}}^{*}({\bm{y}})=\sum_{k=1}^{K}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]+{\bm{V}}{\bm{x}}+{\bm{c}}$ be any minimal representative of ${\bm{f}}^{*}$ . By properties of minimal representatives, none of the ReLU boundary sets $\{{\bm{y}}\in\mathbb{R}^{d}:{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}=0\}$ can intersect any of the norm-balls centered at the training samples, since otherwise ${\bm{f}}^{*}$ would be non-constant on one of the norm-balls. Also, we may alter this parameterization in such a way that the active set of every unit avoids the ball centered at the origin $B_{0}:=B(\bm{0},\rho)$ without changing the representation cost. In particular, suppose the active set of the $k$ th unit contains $B_{0}$ . Then, using the identity $[t]_{+}=t-[-t]_{+}$ for all $t\in\mathbb{R}$ , we have

{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]_{+}={\bm{a}}_{k}({\bm{w}}_{k}^{\top}{\bm{y}}+b_{k})-{\bm{a}}_{k}[-{\bm{w}}_{k}^{\top}{\bm{y}}-b_{k}]_{+}

for all ${\bm{y}}\in\mathbb{R}^{d}$ . The active set of the reversed unit $-{\bm{a}}_{k}[-{\bm{w}}_{k}^{\top}{\bm{y}}-b_{k}]_{+}$ does not intersect $B_{0}$ . Therefore, after applying this transformation to all units whose active sets contain $B_{0}$ , we can write ${\bm{f}}^{*}={\bm{g}}^{*}+\bm{\ell}$ where ${\bm{g}}^{*}$ is a sum of ReLU units whose active sets do not contain $B_{0}$ and $\bm{\ell}$ is an affine function that combines the original affine part ${\bm{V}}{\bm{y}}+{\bm{c}}$ with a sum of linear units of the form ${\bm{a}}_{k}({\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}{\bm{a}}_{k})$ arising from the transformation above. However, because of the interpolation constraint ${\bm{f}}(B_{0})=\{\bm{0}\}$ , and by the fact that no units in ${\bm{g}}^{*}$ are active over $B_{0}$ , for all ${\bm{y}}\in B_{0}$ we have

{\bm{f}}^{*}({\bm{y}})={\bm{g}}^{*}({\bm{y}})+\bm{\ell}({\bm{y}})=\bm{\ell}({\bm{y}})=\bm{0}

(72)

which implies $\bm{\ell}\equiv\bm{0}$ , i.e., ${\bm{f}}^{*}({\bm{y}})={\bm{g}}^{*}({\bm{y}})$ for all ${\bm{y}}$ . Finally, note that the inner- and outer-layer weight vectors on the reversed units change only in sign. Therefore, this re-parameterization does not change the representation cost, and so is also a minimal representative.

Now we construct a new norm-ball interpolant ${\bm{h}}$ as follows. Let ${\bm{f}}_{\ell}$ be the function defined as the sum of all units making up the re-parametrized ${\bm{f}}^{*}$ that are active on at least one norm-ball centered on the $\ell$ th ray. Also, let ${\bm{P}}_{\ell}={\bm{u}}_{\ell}{\bm{u}}_{\ell}^{\top}\in\mathbb{R}^{d\times d}$ denote the orthogonal projector onto the $\ell$ th ray. Define ${\bm{h}}=\sum_{\ell=1}^{L}{\bm{h}}_{\ell}$ where ${\bm{h}}_{\ell}({\bm{y}}):={\bm{P}}_{\ell}{\bm{f}}_{\ell}({\bm{P}}_{\ell}{\bm{y}})$ . Put in words, ${\bm{h}}$ is constructed by “splitting” any units active over multiple rays into a sum of multiple units aligned with each ray; see Figure 6 for an illustration.

First, we prove that ${\bm{h}}$ satisfies norm-ball interpolation constraints. Let ${\bm{a}}[{\bm{w}}^{\top}{\bm{y}}+b]_{+}$ denote any unit belonging to ${\bm{f}}_{\ell}$ . Since this unit is active on a norm-ball centered on ray $\ell$ , this implies that ${\bm{w}}^{\top}{\bm{u}}_{\ell}>0$ , and since the unit is inactive on $B_{0}$ we must have $b<0$ . Also, because we assume ${\bm{u}}_{\ell}^{\top}{\bm{u}}_{m}<0$ for any $m\neq\ell$ , we see that the projected unit ${\bm{P}}_{\ell}{\bm{a}}[{\bm{w}}^{\top}{\bm{P}}_{\ell}{\bm{y}}+b]_{+}={\bm{P}}_{\ell}{\bm{a}}[({\bm{w}}^{\top}{\bm{u}}_{\ell}){\bm{u}}_{\ell}^{\top}{\bm{y}}+b]_{+}$ is active on norm-balls centered on ray $\ell$ and inactive on norm-balls centered on any other ray. In particular, if ${\bm{y}}$ belongs to a norm-ball centered on ray $\ell$ , then ${\bm{h}}_{m}({\bm{y}})=0$ for all $m\neq\ell$ . Therefore, if ${\bm{y}}\in B({\bm{x}}_{n}^{(\ell)},\rho)$ where ${\bm{x}}_{n}^{(\ell)}$ denotes a training sample along ray $\ell$ , then

$\displaystyle{\bm{h}}({\bm{y}})$	$\displaystyle=\sum_{m=1}^{L}{\bm{h}}_{m}({\bm{y}})$	(73)
	$\displaystyle={\bm{h}}_{\ell}({\bm{y}})$	(74)
	$\displaystyle={\bm{P}}_{\ell}{\bm{f}}_{\ell}({\bm{P}}_{\ell}{\bm{y}})$	(75)
	$\displaystyle={\bm{P}}_{\ell}{\bm{f}}({\bm{P}}_{\ell}{\bm{y}})$	(76)
	$\displaystyle={\bm{P}}_{\ell}{\bm{x}}_{n}^{(\ell)}$	(77)
	$\displaystyle={\bm{x}}_{n}^{(\ell)}$	(78)

which shows that ${\bm{h}}$ satisfies norm-ball interpolation constraints, as claimed.

Next, we show that if ${\bm{h}}\neq{\bm{f}}^{*}$ , then ${\bm{h}}$ has strictly smaller representation cost. Let ${\bm{u}}({\bm{y}})={\bm{a}}[{\bm{w}}^{\top}{\bm{y}}+b]_{+}$ denote any ReLU unit belonging to ${\bm{f}}^{*}$ . Since we assume $\|{\bm{w}}\|=1$ , the contribution of unit ${\bm{u}}$ to the representation cost of ${\bm{f}}^{*}$ is $\|{\bm{a}}\|$ . Let ${\mathcal{R}}\subset\{1,2,...,L\}$ index the subset of rays $\ell$ for which unit ${\bm{u}}$ is active over at least one norm-ball centered on that ray. In constructing the interpolant ${\bm{h}}$ , the unit ${\bm{u}}$ get mapped to a sum of $|{\mathcal{R}}|$ units:

\sum_{\ell\in{\mathcal{R}}}{\bm{P}}_{\ell}{\bm{a}}[({\bm{P}}_{\ell}{\bm{w}})^{\top}{\bm{y}}+b]_{+}

whose contribution to the representation cost of ${\bm{h}}$ is bounded above by

C=\sum_{\ell\in{\mathcal{R}}}\|{\bm{P}}_{\ell}{\bm{a}}\|\|{\bm{P}}_{\ell}{\bm{w}}\|=\sum_{\ell\in{\mathcal{R}}}|{\bm{u}}_{\ell}^{\top}{\bm{a}}||{\bm{u}}_{\ell}^{\top}{\bm{w}}|.

If $|{\mathcal{R}}|>1$ , then by Lemma 8 guarantees $C<\|{\bm{a}}\|$ . And if $|{\mathcal{R}}|=1$ then $C=|{\bm{u}}_{\ell}^{\top}{\bm{a}}||{\bm{u}}_{\ell}^{\top}{\bm{w}}|\leq\|{\bm{a}}\|$ , with strict inequality unless ${\bm{w}},{\bm{a}}$ belong to the span of ${\bm{u}}_{\ell}$ . This implies that any min-cost solution must have all its units aligned with the rays, i.e., ${\bm{f}}^{*}$ must have the form ${\bm{f}}^{*}({\bm{y}})=\sum_{\ell=1}^{L}{\bm{u}}_{\ell}\phi_{\ell}({\bm{u}}_{\ell}^{\top}{\bm{y}})$ , where each $\phi_{\ell}$ is a univariate function. The representation cost of any function ${\bm{f}}^{*}$ in this form is the sum of the (1-D) representation costs of the $\phi_{\ell}$ . Therefore, $\phi_{\ell}$ must also be the 1-D min-cost solution of the norm-ball constraints projected onto ray $\ell$ , and so must have the form specified by (17).

∎

C.2.1 Perturbations of samples along rays

As an extension of the above setting, suppose the training samples along rays ${{\bm{x}}}_{n}^{(\ell)}$ have been slightly perturbed, i.e., we consider training samples $\widetilde{{\bm{x}}}_{n}^{(\ell)}={\bm{x}}_{n}^{(\ell)}+{\bm{\epsilon}}_{n}^{(\ell)}$ for some vectors ${\bm{\epsilon}}_{n}^{(\ell)}$ with $\|{\bm{\epsilon}}_{n}^{(\ell)}\|<\delta$ for some sufficiently small $\delta>0$ . In particular, we make the following two assumptions:

(A1)

Let $\Delta_{\ell}:=\{\widetilde{{\bm{x}}}_{n}^{(\ell)}-\widetilde{{\bm{x}}}_{n-1}^{(\ell)}\}_{n=1}^{N_{\ell}}$ be the collection of difference vectors between successive perturbed points along the $\ell$ th ray (with $\widetilde{{\bm{x}}}_{0}^{(\ell)}:=\bm{0}$ ). Assume that for all ${\bm{v}}_{\ell}\in\Delta_{\ell}$ and for all ${\bm{v}}_{k}\in\Delta_{k}$ with $k\neq\ell$ , we have ${\bm{v}}_{\ell}^{\top}{\bm{v}}_{k}<0$ .
(A2)

If $H$ is any halfspace not containing the origin that contains the ball $B(\tilde{{\bm{x}}}^{(\ell)}_{n};\rho)$ , then $H$ also contains all successive balls along that ray, i.e., $H$ contains $B(\tilde{{\bm{x}}}^{(\ell)}_{m};\rho)$ for $m\geq n$ .

Note that if the original points ${{\bm{x}}}_{n}^{(\ell)}$ belong to rays that satisfy the conditions of Theorem 3, then for sufficiently small $\delta$ the perturbed samples $\widetilde{{\bm{x}}}_{n}^{(\ell)}$ will satisfy assumptions (A1)&(A2) above.

We show in this case the norm-ball interpolating representation cost minimizer is a perturbed version of the min-cost solution for the unperturbed samples as identified in Theorem 3. In particular, the ReLU boundaries align with the line segments connecting successive points along the rays. See Figure 7 for an illustration in the case of $L=2$ rays.

Proposition 3.

Suppose the training samples $X$ are a perturbation of data belong to a union of $L$ rays plus a sample at the origin, i.e., $X=\{\bm{0}\}\cup\{\widetilde{{\bm{x}}}_{n}^{(1)}\}_{n=1}^{N_{1}}\cup\cdots\cup\{\widetilde{{\bm{x}}}_{n}^{(L)}\}_{n=1}^{N_{L}}$ where $\widetilde{{\bm{x}}}_{n}^{(\ell)}=c^{(\ell)}_{n}{\bm{u}}_{\ell}+{\bm{\epsilon}}^{(\ell)}_{n}$ for some unit vector ${\bm{u}}_{\ell}$ , constants $0<c^{(\ell)}_{1}<c^{(\ell)}_{2}<\cdots<c^{(\ell)}_{N_{\ell}}$ , and vectors ${\bm{\epsilon}}^{(\ell)}_{n}$ . Assume (A1)&(A2) above hold. Then the minimizer $\bm{f}^{*}$ of (19) is unique and is given by

{\bm{f}}^{*}({\bm{y}})=\sum_{\ell=1}^{L}\sum_{n=1}^{N_{\ell}}{\bm{u}}_{n}^{(\ell)}\phi_{n}^{(\ell)}(({\bm{u}}_{n}^{(\ell)})^{\top}({\bm{y}}-\widetilde{{\bm{x}}}_{n-1}^{(\ell)}))

(79)

where ${\bm{u}}_{n}^{(\ell)}=\frac{\widetilde{{\bm{x}}}_{n}^{(\ell)}-\widetilde{{\bm{x}}}_{n-1}^{(\ell)}}{\|\widetilde{{\bm{x}}}_{n}^{(\ell)}-\widetilde{{\bm{x}}}_{n-1}^{(\ell)}\|}$ and $\phi_{n}^{(\ell)}(t)=s_{n}^{(\ell)}([t-a_{n}^{(\ell)}]-[t-b_{n}^{(\ell)}]_{+})$ with $a_{n}^{(\ell)}=\rho$ , $b_{n}^{(\ell)}=\|{\bm{x}}_{n}^{(\ell)}-{\bm{x}}_{n-1}^{(\ell)}\|-\rho$ , and $s_{n}^{(\ell)}=\|\widetilde{{\bm{x}}}_{n}^{(\ell)}-\widetilde{{\bm{x}}}_{n-1}^{(\ell)}\|/(b_{n}^{(\ell)}-a_{n}^{(\ell)})$ .

Proof.

Let ${\bm{f}}^{*}$ be any min-cost solution. Following the same steps as in the proof of Theorem 3, there is a minimal representative of ${\bm{f}}^{*}$ having the form ${\bm{f}}^{*}({\bm{y}})=\sum_{k}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]_{+}$ where the active sets of all units in this representation have empty intersection with the ball $B(\bm{0},\rho)$ .

Let ${{\bm{f}}}_{n}^{(\ell)}$ denote the sum of all units in ${\bm{f}}^{*}$ whose active set boundary $\{{\bm{y}}\in\mathbb{R}^{d}:{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}=0\}$ intersects the line segment connecting the training samples $\widetilde{{\bm{x}}}_{n-1}^{(\ell)}$ and $\widetilde{{\bm{x}}}_{n}^{(\ell)}$ . By assumption (A2), this is equivalent to the sum of units that are active over balls $B(\widetilde{{\bm{x}}}_{m}^{(\ell)},\rho)$ for $m\geq n$ , and inactive for the balls with $m<n$ . In particular, for ${\bm{y}}\in B(\widetilde{{\bm{x}}}_{n}^{(\ell)},\rho)$ , we have ${\bm{f}}^{*}({\bm{y}})=\sum_{m\leq n}{\bm{f}}_{m}^{(\ell)}({\bm{y}})$ . This implies ${\bm{f}}_{1}^{(\ell)}({\bm{y}})=\widetilde{{\bm{x}}}_{1}$ for all ${\bm{y}}\in B(\widetilde{{\bm{x}}}_{1}^{(\ell)},\rho)$ . Likewise, ${\bm{f}}_{2}^{(\ell)}({\bm{y}})=\widetilde{{\bm{x}}}_{2}^{(\ell)}-\widetilde{{\bm{x}}}_{1}^{(\ell)}$ for all ${\bm{y}}\in B(\widetilde{{\bm{x}}}_{2}^{(\ell)};\rho)$ , and so on, such that for all $n=1,...,N_{\ell}$ we have ${\bm{f}}_{n}^{(\ell)}({\bm{y}})=\widetilde{{\bm{x}}}_{n}^{(\ell)}-\widetilde{{\bm{x}}}_{n-1}^{(\ell)}$ for all ${\bm{y}}\in B(\widetilde{{\bm{x}}}_{n}^{(\ell)};\rho)$ .

Now we show how to construct a new interpolating function ${\bm{h}}$ having representation cost less than or equal to ${\bm{f}}^{*}$ . Let ${\bm{u}}_{n}^{(\ell)}=\frac{\widetilde{{\bm{x}}}_{n}^{(\ell)}-\widetilde{{\bm{x}}}_{n-1}^{(\ell)}}{\|\widetilde{{\bm{x}}}_{n}^{(\ell)}-\widetilde{{\bm{x}}}_{n-1}^{(\ell)}\|}$ and define ${\bm{P}}_{n}^{(\ell)}={\bm{u}}_{n}^{(\ell)}({\bm{u}}_{n}^{(\ell)})^{\top}$ , the orthogonal projector onto the span of the difference vector $\widetilde{{\bm{x}}}_{n}^{(\ell)}-\widetilde{{\bm{x}}}_{n-1}^{(\ell)}$ . Note that the map ${\bm{y}}\mapsto{\bm{P}}_{n}^{(\ell)}{\bm{y}}+\widetilde{{\bm{x}}}_{n-1}^{(\ell)}$ is projection onto the affine line connecting samples $\widetilde{{\bm{x}}}_{n-1}^{(\ell)}$ and $\widetilde{{\bm{x}}}_{n}^{(\ell)}$ . Consider the function

{\bm{h}}=\sum_{\ell=1}^{L}\sum_{n=1}^{N_{\ell}}{\bm{h}}^{(\ell)}_{n}

where

{\bm{h}}^{(\ell)}_{n}({\bm{y}})={\bm{P}}_{n}^{(\ell)}{\bm{f}}_{n}^{(\ell)}({\bm{P}}_{n}^{(\ell)}{\bm{y}}+\widetilde{{\bm{x}}}_{n-1}^{(\ell)}).

Put in words, ${\bm{h}}$ is constructed by aligning all inner-layer and outer-layer weights of units in ${\bm{f}}^{*}$ with the line segment connecting successive data points over which that unit first activates. See Figure 8 for an illustration.

Now we show that ${\bm{h}}$ satisfies norm-ball interpolation constraints. By assumption (A1), the function ${\bm{h}}_{n}^{(\ell)}$ is non-zero only on the norm-balls $B(\widetilde{{\bm{x}}}_{m}^{(\ell)},\rho)$ for $m\geq n$ . This implies that for all ${\bm{y}}\in B(\widetilde{{\bm{x}}}_{n}^{(\ell)};\rho)$ we have ${\bm{h}}({\bm{y}})=\sum_{m\leq n}{\bm{h}}_{n}^{(\ell)}({\bm{y}})$ .

Observe that ${\bm{h}}(B(\bm{0},\rho))=\{\bm{0}\}$ since all functions ${\bm{h}}_{n}^{(\ell)}$ are zero on $B(\bm{0},\rho)$ . Also, for all ${\bm{y}}\in B(\widetilde{{\bm{x}}}_{1}^{(\ell)};\rho)$ we have ${\bm{P}}_{1}^{(\ell)}{\bm{y}}\in B(\widetilde{{\bm{x}}}_{1}^{(\ell)};\rho)$ , and so

{\bm{h}}({\bm{y}})={\bm{h}}_{1}^{(\ell)}({\bm{y}})={\bm{P}}_{1}^{\ell}{\bm{f}}_{1}^{(\ell)}({\bm{P}}_{1}^{(\ell)}{\bm{y}})={\bm{P}}_{1}^{(\ell)}\widetilde{{\bm{x}}}_{1}^{(\ell)}=\widetilde{{\bm{x}}}_{1}^{(\ell)}.

Similarly, the only terms in ${\bm{h}}$ active for ${\bm{y}}\in B(\widetilde{{\bm{x}}}_{2}^{(\ell)};\rho)$ are ${\bm{h}}_{1}^{(\ell)}$ and ${\bm{h}}_{2}^{(\ell)}$ , for which ${\bm{h}}_{1}^{(\ell)}({\bm{y}})={\bm{x}}_{1}^{(\ell)}$ , and ${\bm{h}}_{2}^{(\ell)}({\bm{y}})={\bm{P}}_{2}^{(\ell)}{\bm{f}}_{2}^{(\ell)}({\bm{P}}_{2}^{(\ell)}{\bm{y}}+\widetilde{{\bm{x}}}_{1}^{(\ell)})={\bm{P}}_{2}^{(\ell)}(\widetilde{{\bm{x}}}_{2}^{(\ell)}-\widetilde{{\bm{x}}}_{1}^{(\ell)})=\widetilde{{\bm{x}}}_{2}^{(\ell)}-\widetilde{{\bm{x}}}_{1}^{(\ell)}$ . This gives

\displaystyle{\bm{h}}({\bm{y}})

\displaystyle={\bm{h}}_{1}^{(\ell)}({\bm{y}})+{\bm{h}}_{2}^{(\ell)}({\bm{y}})=\widetilde{{\bm{x}}}_{2}^{(\ell)}.

Iterating this procedure for all $n=1,...,N_{\ell}$ we see that if ${\bm{y}}\in B(\widetilde{{\bm{x}}}_{n}^{(\ell)},\rho)$ then ${\bm{h}}_{n}^{(\ell)}({\bm{y}})=\widetilde{{\bm{x}}}_{n}^{(\ell)}-\widetilde{{\bm{x}}}_{n-1}^{(\ell)}$ , and so ${\bm{h}}({\bm{y}})=\sum_{m\leq n}{\bm{h}}_{m}^{(\ell)}({\bm{y}})={\bm{x}}_{n}^{(\ell)}$ , as claimed.

Following steps similar to the proof of Theorem 3, we can show that $R({\bm{h}})\leq R({\bm{f}}^{*})$ and with strict inequality if ${\bm{h}}\neq{\bm{f}}^{*}$ . First, if any unit ${\bm{u}}({\bm{y}})={\bm{a}}[{\bm{w}}^{\top}{\bm{y}}+b]_{+}$ in ${\bm{f}}^{*}$ is active over balls centered on multiple rays ${\mathcal{R}}\subset\{1,2,...,L\}$ , then in the construction of ${\bm{h}}$ the unit ${\bm{u}}$ gets mapped to a sum of multiple units:

\sum_{\ell\in{\mathcal{R}}}{\bm{u}}_{n_{\ell}}^{(\ell)}({\bm{u}}_{n_{\ell}}^{(\ell)})^{\top}{\bm{a}}[{\bm{w}}^{\top}{\bm{u}}_{n_{\ell}}^{(\ell)}({\bm{u}}_{n_{\ell}}^{(\ell)})^{\top}({\bm{y}}+\widetilde{{\bm{x}}}_{{n_{\ell}}-1}^{(\ell)})+b]_{+}

for some $n_{\ell}$ with $1\leq n_{\ell}\leq N_{\ell}$ . The contribution of the sum of these units to the representation cost of ${\bm{h}}$ is less than or equal to $C=\sum_{\ell\in{\mathcal{R}}}|({\bm{u}}_{n_{\ell}}^{(\ell)})^{\top}{\bm{a}}||({\bm{u}}_{n_{\ell}}^{(\ell)})^{\top}{\bm{w}}|$ . By assumption (A1), we know $({\bm{u}}_{m}^{(\ell)})^{\top}{\bm{u}}_{n}^{(p)}<0$ for all $p\neq\ell$ and for all $m,n$ . Therefore, by Lemma 8 we know $C<\|{\bm{a}}\|$ . This shows ${\bm{f}}^{*}$ cannot have any units active over balls centered on different rays, since otherwise ${\bm{h}}$ has strictly smaller representation cost. Therefore, we have shown ${\bm{f}}^{*}$ as ${\bm{f}}^{*}=\sum_{\ell=1}^{L}\sum_{n=1}^{N_{\ell}}{\bm{f}}_{n}^{(\ell)}$ . Additionally, we see that ${\bm{h}}_{n}^{(\ell)}={\bm{f}}_{n}^{(\ell)}$ , i.e., all inner- and outer-layer weight vectors of units in ${\bm{f}}_{n}^{(\ell)}$ are aligned with ${\bm{u}}_{n}^{(\ell)}$ , since otherwise the representation cost of ${\bm{f}}^{*}$ could be reduced. Therefore, each ${\bm{f}}_{n}^{(\ell)}$ must have the form ${\bm{f}}_{n}^{(\ell)}({\bm{y}})={\bm{u}}_{n}^{(\ell)}\psi_{n}^{(\ell)}(({\bm{u}}_{n}^{(\ell)})^{\top}{\bm{y}})$ . Minimizing the $\psi_{n}^{(\ell)}$ separately under interpolation constraints, we arrive at the unique solution given in (79). ∎

C.3 Simplex Data

C.3.1 Simplex with one obtuse vertex

Proof of Proposition 2.

Consider training samples ${\bm{x}}_{1},{\bm{x}}_{2},...,{\bm{x}}_{N}\in\mathbb{R}^{d}$ whose convex hull is a $(N-1)$ -simplex such that ${\bm{x}}_{1}$ makes an obtuse angle with all other vertices, i.e., $({\bm{x}}_{n}-{\bm{x}}_{1})^{\top}{\bm{x}}_{1}<0$ for all $n=2,...,N$ . By Lemma 5, ${\bm{f}}^{*}$ is a norm-ball interpolating min-cost solution if and only if ${\bm{g}}^{*}({\bm{y}})={\bm{f}}^{*}({\bm{y}}-{\bm{x}}_{1})+{\bm{x}}_{1}$ is a norm-ball interpolating min-cost solution for the translated points $\bm{0},{\bm{x}}_{2}-{\bm{x}}_{1},...,{\bm{x}}_{N}-{\bm{x}}_{1}\in\mathbb{R}^{d}$ . The latter configuration satisfies the hypotheses of Theorem 3 with a single training sample per ray. Therefore, the min-cost solution ${\bm{g}}^{*}$ of the translated points has units whose inner- and outer-layer weight vetors are aligned with ${\bm{x}}_{n}-{\bm{x}}_{1}$ , $n=2,...,{N}$ , and likewise for ${\bm{f}}^{*}$ , since it is translation of ${\bm{g}}^{*}$ . ∎

C.3.2 Simplex with all acute vertices

Justification of Conjecture 1.

For concreteness we focus on the case of three points ${\bm{x}}_{1},{\bm{x}}_{2},{\bm{x}}_{3}\in\mathbb{R}^{d}$ whose convex hull is an acute triangle, and let $B_{1},B_{2},B_{3}$ be open balls of radius $\rho$ centered at ${\bm{x}}_{1},{\bm{x}}_{2},{\bm{x}}_{3}$ (respectively).

Let ${\bm{f}}$ be any norm-ball interpolating min-cost solution, and let ${\bm{f}}({\bm{y}})=\sum_{k}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]+{\bm{V}}{\bm{x}}+{\bm{c}}$ be any minimal representative of ${\bm{f}}$ .

First, by properties of minimal representatives, none of the ReLU boundary sets $H_{k}=\{{\bm{y}}\in\mathbb{R}^{d}:{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}=0\}$ intersect any of the balls centered at the training samples. Also, the active set of each unit in ${\bm{f}}$ must contain either one or two norm-balls, since otherwise the unit is either inactive over all balls or active over all balls, in which cases the unit can be removed or absorbed into unregularized linear part while strictly reducing the representation cost. By “reversing units”, as in the proof of Theorem 3, we may transform the parameterization in such a way that the active set of every unit contains exactly one ball, and do this without changing the representation cost.

After this transformation, we may write

{\bm{f}}={\bm{f}}_{1}+{\bm{f}}_{2}+{\bm{f}}_{3}+\bm{\ell}

where ${\bm{f}}_{i}$ is a sum of units active only on the ball $B_{i}$ and no others, and where $\bm{\ell}({\bm{y}})={\bm{A}}{\bm{y}}+{\bm{b}}$ is an affine function. Then we have

$\displaystyle\forall{\bm{y}}_{1}\in B_{1},\quad{\bm{f}}({\bm{y}}_{1})$	$\displaystyle={\bm{f}}_{1}({\bm{y}}_{1})+{\bm{A}}{\bm{y}}_{1}+{\bm{b}}={\bm{x}}_{1}$	(80)
$\displaystyle\forall{\bm{y}}_{2}\in B_{2},\quad{\bm{f}}({\bm{y}}_{2})$	$\displaystyle={\bm{f}}_{2}({\bm{y}}_{2})+{\bm{A}}{\bm{y}}_{2}+{\bm{b}}={\bm{x}}_{2}$	(81)
$\displaystyle\forall{\bm{y}}_{3}\in B_{3},\quad{\bm{f}}({\bm{y}}_{3})$	$\displaystyle={\bm{f}}_{3}({\bm{y}}_{3})+{\bm{A}}{\bm{y}}_{3}+{\bm{b}}={\bm{x}}_{3}$	(82)

and so,

$\displaystyle\forall{\bm{y}}_{1}\in B_{1},\quad{\bm{f}}_{1}({\bm{y}}_{1})$	$\displaystyle={\bm{x}}_{1}-{\bm{b}}-{\bm{A}}{\bm{y}}_{1}$	(83)
$\displaystyle\forall{\bm{y}}_{2}\in B_{2},\quad{\bm{f}}_{2}({\bm{y}}_{2})$	$\displaystyle={\bm{x}}_{2}-{\bm{b}}-{\bm{A}}{\bm{y}}_{2}$	(84)
$\displaystyle\forall{\bm{y}}_{3}\in B_{3},\quad{\bm{f}}_{3}({\bm{y}}_{3})$	$\displaystyle={\bm{x}}_{3}-{\bm{b}}-{\bm{A}}{\bm{y}}_{3}$	(85)

Finding the min-cost solution then amounts to minimizing of the representation cost of ${\bm{f}}_{1}$ , ${\bm{f}}_{2}$ , ${\bm{f}}_{3}$ subject to the above constraints. This is made challenging by the fact that the constraints are coupled together by the parameters ${\bm{A}}$ and ${\bm{b}}$ of the affine part. However, under the assumption ${\bm{A}}=\bm{0}$ , we prove below that the min-cost solution must have the conjectured form ${\bm{f}}^{*}$ as given in (86). However, since we cannot a priori assume $\bm{A}=\bm{0}$ , this does not constitute a full proof.

Before proceeding, we give a lemma that will be used to lower bound $R({\bm{f}})$ (assuming ${\bm{A}}=\bm{0}$ ).

Lemma 10.

Suppose ${\bm{g}}$ is a sum of ReLU units such that ${\bm{g}}=\bm{0}$ on a closed convex region $C\subset\mathbb{R}^{d}$ and ${\bm{g}}$ is constant and equal to ${\bm{c}}$ on a closed ball $B\subset\mathbb{R}^{d}$ , and ${\bm{g}}$ has a minimal representative where all its units are active on $B$ . Then

R({\bm{g}})\geq\frac{2\|{\bm{c}}\|}{\text{dist}(B,C)}

where $\text{dist}(B,C)=\min_{{\bm{y}}\in B,{\bm{x}}\in C}\|{\bm{y}}-{\bm{x}}\|$ .

Proof.

Let ${\bm{g}}({\bm{y}})=\sum_{k}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]_{+}$ be a minimal representative where each unit is active on $B$ . For ${\bm{g}}$ to be constant on $B$ it must be the case that $\sum_{k}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}=\bm{0}$ . Let ${\bm{y}}={\bm{y}}_{0}$ be any value for which $\|\bm{\partial}{\bm{g}}({\bm{y}}_{0})\|$ , the operator norm of the Jacobian of ${\bm{g}}$ , is maximized. Since ${\bm{g}}$ is piecewise linear, we see that its Lipschitz constant $\text{Lip}({\bm{g}})=\|\bm{\partial}{\bm{g}}({\bm{y}}_{0})\|$ . Let $I_{0}$ be the set of indices of active units at ${\bm{y}}_{0}$ so that $\bm{\partial}{\bm{g}}({\bm{y}}_{0})=\sum_{k\in I_{0}}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}$ , and let $I_{1}$ be the complementary index set. Then because $\sum_{k}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}=\bm{0}$ , we have $\sum_{k\in I_{0}}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}=-\sum_{k\in I_{1}}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}$ , and so $\|\sum_{k\in I_{0}}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}\|=\|\sum_{k\in I_{1}}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}\|$ . Therefore,

2\,\text{Lip}({\bm{g}})=2\|\bm{\partial}{\bm{g}}({\bm{y}}_{0})\|=2\left\|\sum_{k\in I_{0}}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}\right\|=\left\|\sum_{k\in I_{0}}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}\right\|+\left\|\sum_{k\in I_{1}}{\bm{a}}_{k}{\bm{w}}_{k}^{\top}\right\|\\ \leq\sum_{k}\|{\bm{a}}_{k}\|\|{\bm{w}}_{k}\|=R({\bm{g}}).

Finally,

\text{Lip}({\bm{g}})\geq\max_{{\bm{x}}\in C,{\bm{y}}\in B}\frac{\|{\bm{g}}({\bm{y}})-{\bm{g}}({\bm{x}})\|}{\|{\bm{y}}-{\bm{x}}\|}=\frac{\|{\bm{c}}\|}{\min_{{\bm{x}}\in C,{\bm{y}}\in B}\|{\bm{y}}-{\bm{x}}\|}=\frac{\|{\bm{c}}\|}{\text{dist}(B,C)}.

Combining this with the previous inequality gives the claim. ∎

Now, returning to the proof of the main claim, for $n=1,2,3$ , let $C_{n}$ be the closed convex region given by the intersection of all closed half-planes containing the balls $B_{j}$ and $B_{k}$ , $j\neq n$ , and $k\neq n$ , $j\neq k$ . By assumption, ${\bm{f}}_{n}$ vanishes on $C_{n}$ and ${\bm{f}}_{n}$ is constant and equal to ${\bm{x}}_{n}-{\bm{b}}$ on the closed ball $B_{n}$ . So, by Lemma 10, we see that

R({\bm{f}}_{n})\geq\frac{2\|{\bm{x}}_{n}-{\bm{b}}\|}{\delta_{n}}

where $\delta_{n}=\text{dist}(B_{n},C_{n})$ . Therefore, for all ${\bm{b}}\in\mathbb{R}^{d}$ we have

R({\bm{f}})=R({\bm{f}}_{1})+R({\bm{f}}_{2})+R({\bm{f}}_{3})\geq\sum_{n=1}^{3}\frac{2\|{\bm{x}}_{n}-{\bm{b}}\|}{\delta_{n}}

Let $\overline{{\bm{x}}}\in\mathbb{R}^{d}$ be the minimizer of

\min_{{\bm{b}}\in\mathbb{R}^{d}}\sum_{n=1}^{3}\frac{2\|{\bm{x}}_{n}-{\bm{b}}\|}{\delta_{n}}.

Then plugging in ${\bm{b}}=\overline{{\bm{x}}}$ above, we have the lower bound

R({\bm{f}})\geq\sum_{n=1}^{3}\frac{2\|{\bm{x}}_{n}-\overline{{\bm{x}}}\|}{\delta_{n}},

which is independent of ${\bm{b}}$ .

Finally, simple calculations show that the conjectured min-cost solution ${\bm{f}}^{*}$ specified in (86) has representation cost achieving this lower bound. Therefore, under the assumption ${\bm{A}}=\bm{0}$ , ${\bm{f}}^{*}$ is a min-cost solution.

∎

Now we prove that, in the special case where the convex hull of the training points is an equilateral triangle, the norm-ball interpolator ${\bm{f}}^{*}$ identified in Conjecture 1 is a min-cost solution (though not necessarily the unique min-cost solution). In this case, we may also give ${\bm{f}}^{*}$ a more explicit form, as detailed below.

Proposition 4.

Suppose the convex hull of the training points ${\bm{x}}_{1},{\bm{x}}_{2},{\bm{x}}_{3}\in\mathbb{R}^{d}$ is an equilateral triangle. Assume the norm-balls $B_{n}:=B({\bm{x}}_{n},\rho)$ centered at each training point have radius $\rho<\|{\bm{x}}_{n}-{\bm{x}}_{0}\|/2$ , $n=1,2,3$ , where ${\bm{x}}_{0}=\frac{1}{3}({\bm{x}}_{1}+{\bm{x}}_{2}+{\bm{x}}_{3})$ is the centroid of the triangle. Then a minimizer ${\bm{f}}^{*}$ of (19) is given by

{\bm{f}}^{*}({\bm{y}})={\bm{u}}_{1}\phi_{1}({\bm{u}}_{1}^{\top}({\bm{y}}-{\bm{x}}_{0}))+{\bm{u}}_{2}\phi_{2}({\bm{u}}_{2}^{\top}({\bm{y}}-{\bm{x}}_{0}))+{\bm{u}}_{3}\phi_{3}({\bm{u}}_{3}^{\top}({\bm{y}}-{\bm{x}}_{0}))+{\bm{x}}_{0},

(86)

where $\phi_{n}(t)=s_{n}([t-a_{n}]_{+}-[t-b_{n}]_{+})$ with ${\bm{u}}_{n}=\frac{{\bm{x}}_{n}-{\bm{x}}_{0}}{\|{\bm{x}}_{n}-{\bm{x}}_{0}\|}$ , $a_{n}=-\frac{1}{2}\|{\bm{x}}_{n}-{\bm{x}}_{0}\|+\rho$ , $b_{n}=\|{\bm{x}}_{n}-{\bm{x}}_{0}\|-\rho$ , and $s_{n}=\|{\bm{x}}_{n}-{\bm{x}}_{0}\|/(b_{n}-a_{n})$ .

Proof.

By translation and scale invariance of min-cost solutions, without loss of generality we may assume ${\bm{x}}_{1},{\bm{x}}_{2},{\bm{x}}_{3}\in\mathbb{R}^{d}$ are unit-norm vectors and mean-zero (i.e., the triangle centroid ${\bm{x}}_{0}=\bm{0}$ ). In this case, the assumption on $\rho$ translates to $\rho<1/2$ . Additionally, it suffices to prove the claim in the case $d=2$ . This is because, if ${\bm{x}}_{1},{\bm{x}}_{2},{\bm{x}}_{3}\in\mathbb{R}^{d}$ are the vertices of an equilateral triangle whose centroid is at the origin, then these points are contained in a two-dimensional subspace ${\mathcal{S}}\subset\mathbb{R}^{d}$ . And if we let ${\bm{P}}\in\mathbb{R}^{d\times 2}$ be a matrix whose columns are an orthonormal basis for ${\mathcal{S}}$ , then by Theorem 2, ${\bm{f}}$ is a min-cost solution if and only if ${\bm{f}}({\bm{y}})={\bm{P}}{\bm{f}}_{0}({\bm{P}}^{\top}{\bm{y}})$ , where ${\bm{f}}_{0}:\mathbb{R}^{2}\rightarrow\mathbb{R}^{2}$ is a min-cost solution under the constraints ${\bm{f}}_{0}(B({\bm{P}}^{\top}{\bm{x}}_{n},\rho))=\{{\bm{P}}^{\top}{\bm{x}}_{n}\}$ for all $n=1,2,3$ . Therefore, the problem reduces to finding a min-cost solution of the projected points ${\bm{P}}^{\top}{\bm{x}}_{1},{\bm{P}}^{\top}{\bm{x}}_{2},{\bm{P}}^{\top}{\bm{x}}_{3}\in\mathbb{R}^{2}$ whose convex hull is an equilateral triangle in $\mathbb{R}^{2}$ .

So now let ${\bm{f}}:\mathbb{R}^{2}\rightarrow\mathbb{R}^{2}$ be any min-cost solution under the assumption ${\bm{x}}_{1},{\bm{x}}_{2},{\bm{x}}_{3}\in\mathbb{R}^{2}$ are unit-norm, have zero mean, and $\rho<1/2$ . By reversing units as necessary, there exists a minimal representative of ${\bm{f}}$ that can be put in the form ${\bm{f}}({\bm{y}})={\bm{f}}_{1}({\bm{y}})+{\bm{f}}_{2}({\bm{y}})+{\bm{f}}_{3}({\bm{y}})+{\bm{A}}{\bm{y}}+{\bm{c}}$ , such that ${\bm{f}}_{n}({\bm{y}})$ is a sum of ReLU units, all of which are active on $B_{n}$ , and all of which are inactive on $B_{j}$ for $j\neq n$ .

Let ${\bm{Q}}$ be the reflection matrix ${\bm{Q}}=2{\bm{x}}_{1}{\bm{x}}_{1}^{\top}-{\bm{I}}$ , which reflects points across the line spanned by ${\bm{x}}_{1}$ . In particular, ${\bm{y}}\in B_{1}$ then ${\bm{Q}}{\bm{y}}\in B_{1}$ while if ${\bm{y}}\in B_{2}$ then ${\bm{Q}}{\bm{y}}\in B_{3}$ (and vice versa). Also, ${\bm{Q}}{\bm{x}}_{1}={\bm{x}}_{1}$ , ${\bm{Q}}{\bm{x}}_{2}={\bm{x}}_{3}$ , and ${\bm{Q}}{\bm{x}}_{3}={\bm{x}}_{2}$ . Define ${\bm{g}}({\bm{y}})=\frac{1}{2}\left({\bm{f}}({\bm{y}})+{\bm{Q}}^{-1}{\bm{f}}({\bm{Q}}{\bm{y}})\right)$ . Then it is easy to check that ${\bm{g}}$ satisfies interpolation constraints, and since ${\bm{Q}}$ is unitary, we have $R({\bm{g}})\leq\frac{1}{2}R({\bm{f}})+\frac{1}{2}R({\bm{Q}}\circ{\bm{f}}\circ{\bm{Q}}^{-1})=\frac{1}{2}R({\bm{f}})+\frac{1}{2}R({\bm{f}})=R({\bm{f}})$ . Furthermore, since ${\bm{f}}$ is a min-cost solution, we must have $R({\bm{g}})=R({\bm{f}})$ . Additionally, since none of the units making up ${\bm{f}}$ are active over more than one ball, neither are the units making up ${\bm{g}}$ , which implies no pair of units belonging to ${\bm{g}}$ combine to form an affine function. Therefore, we may write ${\bm{g}}({\bm{y}})={\bm{g}}_{1}({\bm{y}})+{\bm{g}}_{2}({\bm{y}})+{\bm{g}}_{3}({\bm{y}})+{\bm{B}}{\bm{y}}+{\bm{v}}$ , where ${\bm{B}}={\bm{A}}+{\bm{Q}}^{-1}{\bm{A}}{\bm{Q}}$ , such that each ${\bm{g}}_{n}$ is a sum of ReLU units, all of which are active on $B_{n}$ , and all of which are inactive on $B_{j}$ for $j\neq n$ .

Let ${\bm{U}}$ be a rotation matrix by $120$ degrees such that ${\bm{x}}_{2}={\bm{U}}{\bm{x}}_{1}$ , ${\bm{x}}_{3}={\bm{U}}{\bm{x}}_{2}$ , and ${\bm{x}}_{1}={\bm{U}}{\bm{x}}_{3}$ . Consider the symmetrized version of ${\bm{g}}$ given by ${\bm{h}}({\bm{y}}):=\frac{1}{3}({\bm{g}}({\bm{y}})+{\bm{U}}^{-1}{\bm{g}}({\bm{U}}{\bm{y}})+{\bm{U}}^{-2}{\bm{g}}({\bm{U}}^{2}{\bm{y}})$ ). Since for all ${\bm{y}}\in B_{n}$ we have ${\bm{U}}{\bm{y}}\in B_{n+1}$ (with indices understood modulo $3$ ), it is easy to verify that ${\bm{h}}$ satisfies interpolation constraints. Also, since ${\bm{U}}$ is unitary, we have $R({\bm{h}})\leq R({\bm{g}})\leq R({\bm{f}})$ , which implies $R({\bm{h}})=R({\bm{f}})$ since ${\bm{f}}$ is a min-cost solution. Again, since none of the units making up ${\bm{g}}$ are active over more than one ball, neither are the units making up ${\bm{h}}$ , which implies no pair of units belonging to ${\bm{h}}$ combine to form an affine function. And so, we may write ${\bm{h}}({\bm{y}})={\bm{h}}_{1}({\bm{y}})+{\bm{h}}_{2}({\bm{y}})+{\bm{h}}_{3}({\bm{y}})+{\bm{C}}+{\bm{u}}$ , such that each ${\bm{h}}_{n}$ is a sum of ReLU units, all of which are active on $B_{n}$ , and all of which are inactive on $B_{j}$ for $j\neq n$ .

Observe that ${\bm{u}}=\frac{1}{3}({\bm{I}}+{\bm{U}}^{-1}+{\bm{U}}^{-2}){\bm{v}}=\bm{0}$ . Also, we have ${\bm{C}}=\frac{1}{3}({\bm{B}}+{\bm{U}}^{-1}{\bm{B}}{\bm{U}}+{\bm{U}}^{-2}{\bm{B}}{\bm{U}}^{2})$ . This implies ${\bm{C}}$ commutes with ${\bm{U}}$ , because it is easy to see that ${\bm{C}}={\bm{U}}^{-1}{\bm{C}}{\bm{U}}$ . Additionally, ${\bm{C}}$ commutes with ${\bm{Q}}$ , because it is easy to see ${\bm{B}}={\bm{Q}}^{-1}{\bm{B}}{\bm{Q}}$ , and by properties of rotations/reflections we have ${\bm{U}}{\bm{Q}}={\bm{Q}}{\bm{U}}^{-1}$ , which implies

	$\displaystyle{\bm{Q}}^{-1}{\bm{C}}{\bm{Q}}$	$\displaystyle=\frac{1}{3}({\bm{Q}}^{-1}{\bm{B}}{\bm{Q}}+{\bm{Q}}^{-1}{\bm{U}}^{-1}{\bm{B}}{\bm{U}}{\bm{Q}}+{\bm{Q}}^{-1}{\bm{U}}^{-2}{\bm{B}}{\bm{U}}^{2}{\bm{Q}})$
		$\displaystyle=\frac{1}{3}({\bm{Q}}^{-1}{\bm{B}}{\bm{Q}}+{\bm{Q}}^{-1}{\bm{U}}^{-1}{\bm{B}}{\bm{U}}{\bm{Q}}+{\bm{Q}}^{-1}{\bm{U}}^{-2}{\bm{B}}{\bm{U}}^{2}{\bm{Q}})$
		$\displaystyle=\frac{1}{3}({\bm{Q}}^{-1}{\bm{B}}{\bm{Q}}+({\bm{U}}{\bm{Q}})^{-1}{\bm{B}}({\bm{U}}{\bm{Q}})+({\bm{U}}{\bm{Q}})^{-1}{\bm{U}}^{-1}{\bm{B}}{\bm{U}}({\bm{U}}{\bm{Q}}))$
		$\displaystyle=\frac{1}{3}({\bm{Q}}^{-1}{\bm{B}}{\bm{Q}}+{\bm{U}}{\bm{Q}}^{-1}{\bm{B}}{\bm{Q}}{\bm{U}}^{-1}+{\bm{U}}{\bm{Q}}^{-1}{\bm{U}}^{-1}{\bm{B}}{\bm{U}}{\bm{Q}}{\bm{U}}^{-1})$
		$\displaystyle=\frac{1}{3}({\bm{Q}}^{-1}{\bm{B}}{\bm{Q}}+{\bm{U}}{\bm{Q}}^{-1}{\bm{B}}{\bm{Q}}{\bm{U}}^{-1}+{\bm{U}}^{2}{\bm{Q}}^{-1}{\bm{B}}{\bm{Q}}{\bm{U}}^{-2})$
		$\displaystyle=\frac{1}{3}({\bm{B}}+{\bm{U}}{\bm{B}}{\bm{U}}^{-1}+{\bm{U}}^{2}{\bm{B}}{\bm{U}}^{-2})$
		$\displaystyle=\frac{1}{3}({\bm{B}}+{\bm{U}}^{-2}{\bm{B}}{\bm{U}}^{2}+{\bm{U}}^{-1}{\bm{B}}{\bm{U}})$
		$\displaystyle={\bm{C}},$

which shows ${\bm{C}}$ commutes with ${\bm{Q}}$ . Now we show that any matrix ${\bm{C}}$ that commutes with both ${\bm{U}}$ and ${\bm{Q}}$ is a scaled identity matrix. Let ${\bm{x}}_{1}^{\perp}$ denote a unit-vector perpendicular to ${\bm{x}}_{1}$ , such that $\{{\bm{x}}_{1},{\bm{x}}_{1}^{\perp}\}$ form an orthonormal basis for $\mathbb{R}^{2}$ . First, we know ${\bm{Q}}$ has eigenvectors ${\bm{x}}_{1}$ and ${\bm{x}}_{1}^{\perp}$ with eigenvalues $1$ and $-1$ , respectively. And so ${\bm{Q}}{\bm{C}}{\bm{x}}_{1}={\bm{C}}{\bm{Q}}{\bm{x}}_{1}={\bm{C}}{\bm{x}}_{1}$ , while ${\bm{Q}}{\bm{C}}{\bm{x}}_{1}^{\perp}={\bm{C}}{\bm{Q}}{\bm{x}}_{1}^{\perp}=-{\bm{C}}{\bm{x}}_{1}$ , which implies ${\bm{C}}{\bm{x}}_{1}=a{\bm{x}}_{1}$ and ${\bm{C}}{\bm{x}}_{1}^{\perp}=b{\bm{x}}_{1}^{\perp}$ for some $a,b\in\mathbb{R}$ , i.e., ${\bm{x}}_{1}$ and ${\bm{x}}_{1}^{\perp}$ are eigenvectors of ${\bm{C}}$ with eigenvalues $a$ and $b$ , respectively. Furthermore, we have ${\bm{C}}{\bm{U}}{\bm{x}}_{1}={\bm{U}}{\bm{C}}{\bm{x}}_{1}=a{\bm{U}}{\bm{x}}_{1}$ and ${\bm{C}}{\bm{U}}{\bm{x}}_{1}^{\perp}={\bm{U}}{\bm{C}}{\bm{x}}_{1}^{\perp}=b{\bm{U}}{\bm{x}}_{1}$ . This implies ${\bm{U}}{\bm{x}}_{1}$ and ${\bm{U}}{\bm{x}}_{1}^{\perp}$ are also eigenvectors of ${\bm{C}}$ with eigenvalues $a$ and $b$ , respectively. But since ${\bm{x}}_{1}$ and ${\bm{U}}{\bm{x}}_{1}$ are linearly independent, and likewise so are ${\bm{x}}_{1}^{\perp}$ and ${\bm{U}}{\bm{x}}_{1}^{\perp}$ , the only way this could be true is if $a=b$ , i.e., every vector in $\mathbb{R}^{2}$ is an eigenvector of ${\bm{C}}$ for the same eigenvalue $\lambda\in\mathbb{R}$ , which implies ${\bm{C}}=\lambda{\bm{I}}$ for some $\lambda\in\mathbb{R}$ .

Therefore, we have shown that every min-cost solution ${\bm{f}}$ maps to a min-cost solution ${\bm{h}}$ of the form

{\bm{h}}({\bm{y}})={\bm{h}}_{1}({\bm{y}})+{\bm{h}}_{2}({\bm{y}})+{\bm{h}}_{3}({\bm{y}})+\lambda{\bm{I}}

for some $\lambda\in\mathbb{R}$ , where each ${\bm{h}}_{n}$ is a sum of ReLU units, all of which are active on $B_{n}$ and all of which are inactive on $B_{j}$ , $j\neq n$ . In particular, ${\bm{h}}_{n}({\bm{y}})={\bm{x}}_{n}-\lambda{\bm{y}}$ for all ${\bm{y}}\in B_{n}$ and ${\bm{h}}_{n}({\bm{y}})=\bm{0}$ for all ${\bm{y}}\in C_{n}$ , where $C_{n}$ is the intersection of all half-planes containing the balls $B_{j}$ , $j\neq n$ .

This means we can write ${\bm{h}}({\bm{y}})=\sum_{k=1}^{K}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]_{+}+\lambda{\bm{I}}$ and ${\bm{h}}_{n}({\bm{y}})=\sum_{k\in{\mathcal{A}}_{n}}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]_{+}$ for some index sets ${\mathcal{A}}_{1},{\mathcal{A}}_{2},{\mathcal{A}}_{3}$ partitioning $\{1,...,K\}$ so that $R({\bm{h}})=\sum_{k=1}^{K}\|{\bm{a}}_{k}\|=\sum_{n=1}^{3}\sum_{k\in{\mathcal{A}}_{n}}\|{\bm{a}}_{k}\|$ . Let $R_{0}(\cdot)$ denote the representation cost of a function computed without an unregularized linear part, i.e., define $R_{0}$ analogously to $R$ except where the model class ${\bm{h}}_{\bm{\theta}}$ in (6) is constrained to have ${\bm{V}}=\bm{0}$ . Then, since the realizations of the ${\bm{h}}_{n}$ functions considered above do not have a linear part, we see that $\sum_{k\in{\mathcal{A}}_{n}}\|{\bm{a}}_{k}\|\geq R_{0}({\bm{h}}_{n})$ and so $R({\bm{h}})=\sum_{n=1}^{3}\sum_{k\in{\mathcal{A}}_{n}}\|{\bm{a}}_{k}\|\geq R_{0}({\bm{h}}_{1})+R_{0}({\bm{h}}_{2})+R_{0}({\bm{h}}_{3})$ .

Now we show how lower bound $R_{0}({\bm{h}}_{n})$ for all $n=1,2,3$ . Let ${\bm{x}}_{n}^{\perp}$ denote a unit-vector perpendicular to ${\bm{x}}_{n}$ , such that $\{{\bm{x}}_{n},{\bm{x}}_{n}^{\perp}\}$ form an orthonormal basis for $\mathbb{R}^{2}$ . Consider the univariate functions $h_{n}^{\parallel}(t):={\bm{x}}_{n}^{\top}{\bm{h}}_{n}({\bm{x}}_{n}t)$ and $h_{n}^{\perp}(t):=({\bm{x}}_{n}^{\perp})^{\top}{\bm{h}}_{n}({\bm{x}}_{n}^{\perp}t+{\bm{x}}_{n})$ . Here $h_{n}^{\parallel}$ is the projection of ${\bm{h}}$ onto the line spanned by ${\bm{x}}_{n}$ , and $h_{n}^{\perp}$ is a projection onto the line perpendicular to ${\bm{x}}_{n}$ passing through the point ${\bm{x}}_{n}$ . In particular, by the constraints on ${\bm{h}}_{n}$ , we see that $h_{n}^{\parallel}$ and $h_{n}^{\perp}$ satisfy the constraints

h_{n}^{\parallel}(t)=\begin{cases}0&\text{if}~{}t<-1/2+\rho\\ 1-\lambda t&\text{if}~{}t>1-\rho\end{cases}

(87)

and $h_{n}^{\perp}(t)=-\lambda t~{}\text{if}~{}|t|\leq\rho$ .

Claim 1: For all $n=1,2,3$ , $R_{0}({\bm{h}}_{n})\geq R_{0}(h_{n}^{\parallel})+R_{0}(h_{n}^{\perp})$ . Proof: Let ${\bm{h}}_{n}({\bm{y}})=\sum_{k}{\bm{a}}_{k}[{\bm{w}}_{k}^{\top}{\bm{y}}+b_{k}]_{+}+{\bm{c}}$ be any realization of ${\bm{h}}_{n}$ , whose representation cost is $C=\sum_{k}\frac{1}{2}\left(\|{\bm{a}}_{k}\|^{2}+\|{\bm{w}}_{k}\|^{2}\right)$ . Then realizations of $h_{n}^{\parallel}$ and $h_{n}^{\perp}$ are given by

	$\displaystyle h_{n}^{\parallel}(t)$	$\displaystyle={\bm{x}}_{n}^{\top}{\bm{h}}_{i}({\bm{x}}_{i}t)=\sum_{k}({\bm{x}}_{n}^{\top}{\bm{a}}_{k})[({\bm{x}}_{n}^{\top}{\bm{w}}_{k})t+b_{k}]_{+}+{\bm{x}}_{n}^{\top}{\bm{c}}.$		(88)
	$\displaystyle h_{n}^{\perp}(t)$	$\displaystyle=({\bm{x}}_{n}^{\perp})^{\top}{\bm{h}}_{n}({\bm{x}}_{n}^{\perp}t+{\bm{x}}_{n})=\sum_{k}(({\bm{x}}_{n}^{\perp})^{\top}{\bm{a}}_{k})[(({\bm{x}}_{n}^{\perp})^{\top}{\bm{w}}_{k})t+b_{k}+{\bm{w}}_{k}^{\top}{\bm{x}}_{n}]_{+}+({\bm{x}}_{n}^{\perp})^{\top}{\bm{c}},$		(89)

whose representation costs $C^{\parallel}$ and $C^{\perp}$ , respectively, are given by

	$\displaystyle C^{\parallel}$	$\displaystyle=\sum_{k}\frac{1}{2}\left(({\bm{x}}_{n}^{\top}{\bm{a}}_{k})^{2}+({\bm{x}}_{n}^{\top}{\bm{w}}_{k})^{2}\right)$		(90)
	$\displaystyle C^{\perp}$	$\displaystyle=\sum_{k}\frac{1}{2}\left((({\bm{x}}_{n}^{\perp})^{\top}{\bm{a}}_{k})^{2}+(({\bm{x}}_{n}^{\perp})^{\top}{\bm{w}}_{k})^{2}\right)$		(91)

and by the Pythagorean Theorem we see that $C=C^{\parallel}+C^{\perp}$ . Therefore, $C\geq R_{0}(h_{n}^{\parallel})+R_{0}(h_{n}^{\perp})$ and finally minimizing over all realizations of ${\bm{h}}_{n}$ gives the claim.

Claim 2: For all $n=1,2,3$ , $R_{0}(h_{n}^{\parallel})\geq\left|\frac{1-\lambda\beta}{\beta-\alpha}\right|+\left|\frac{1-\lambda\alpha}{\beta-\alpha}\right|$ , where $\beta=1-\rho$ and $\alpha=-1/2+\rho$ . Proof: By results in savarese2019infinite, $R_{0}(h_{n}^{\parallel})\geq R_{0}(p)$ where $p(t)$ is the function satisfying the same constraints as $h_{n}^{\parallel}$ given in (87) while linearly interpolating over the interval $t\in[\alpha,\beta]$ . From the formula $R_{0}(p)=\max\{\int|p^{\prime\prime}(t)|dt,|p^{\prime}(\infty)+p^{\prime}(-\infty)|\}$ as established in savarese2019infinite, we can show directly that $R_{0}(p)=\left|\frac{1-\lambda\beta}{\beta-\alpha}\right|+\left|\frac{1-\lambda\alpha}{\beta-\alpha}\right|$ , which gives the claimed bound.

Claim 3: For all $n=1,2,3$ , $R_{0}(h_{n}^{\perp})\geq|\lambda|$ . Proof: The function $q(t)$ with minimal $R_{0}$ -cost satisfying the constraint $q(t)=-\lambda t$ for $|t|\leq\rho$ is a single ReLU unit plus a constant: $q(t)=-\lambda[t+\rho]_{+}+\lambda\rho$ , which has $R_{0}$ -cost $|\lambda|$ .

Putting the above claims together, we see that

$\displaystyle R_{0}({\bm{h}}_{n})$	$\displaystyle\geq\left\|\frac{1-\lambda\beta}{\beta-\alpha}\right\|+\left\|\frac{1-\lambda\alpha}{\beta-\alpha}\right\|+\|\lambda\|$	(92)
	$\displaystyle\geq\left\|\frac{2-\lambda(\beta+\alpha)}{\beta-\alpha}\right\|+\|\lambda\|$	(93)
	$\displaystyle\geq\frac{2-\|\lambda\|(\beta+\alpha)}{\beta-\alpha}+\|\lambda\|$	(94)
	$\displaystyle=\frac{2}{\beta-\alpha}+\frac{-2\alpha}{\beta-\alpha}\|\lambda\|$	(95)
	$\displaystyle\geq\frac{2}{\beta-\alpha}$	(96)

where in the last inequality we used the fact that $\frac{-2\alpha}{\beta-\alpha}>0$ since $\alpha=-1/2+\rho<0$ and $\beta-\alpha=3/2-2\rho>0$ .

Therefore, $R({\bm{f}})=R({\bm{h}})\geq R_{0}({\bm{h}}_{1})+R_{0}({\bm{h}}_{2})+R_{0}({\bm{h}}_{3})\geq\frac{6}{\beta-\alpha}$ . Also, the function ${\bm{f}}^{*}$ given by

{\bm{f}}^{*}={\bm{f}}_{1}^{*}+{\bm{f}}_{2}^{*}+{\bm{f}}_{3}^{*}

where

{\bm{f}}_{n}^{*}({\bm{y}})=\frac{{\bm{x}}_{n}}{\beta-\alpha}([{\bm{x}}_{n}^{\top}{\bm{y}}-\alpha]_{+}-[{\bm{x}}_{n}^{\top}{\bm{y}}-\beta]_{+})

satisfies norm-ball interpolation constraints and $R({\bm{f}}^{*})=\frac{6}{\beta-\alpha}$ . Hence, ${\bm{f}}^{*}$ is a min-cost solution. ∎

Appendix D Additional simulations

We train a one-hidden-layer ReLU network without a skip connection using the setting we used in Figure 1. As can be seen from Figure 10 we get a similar result for the NN denoiser with and without the skip connection. Therefore in Appendix D.1 we use one-hidden-layer ReLU network without a skip connection.

D.1 MNIST

We use the MNIST dataset to verify various properties. First, the offline and online solutions achieve approximately the same test MSE when trained on a subset of the MNIST dataset (Figure 11). Second, to show that the fact that NN denoiser does not converge to the eMMSE denoiser is not due to approximation error (Figure 12). Lastly, to present the critical noise level in which representation cost minimizer $f^{*}_{\mathrm{1D}}$ has strictly lower MSE than the eMMSE, for all smaller noise levels (Figure 13).

D.2 Three non-colinear training samples

We show in Figures 14 and 15 that for $N=3$ training points from the MNIST dataset that forming a triangle in $d=2$ dimensions the empirical minimizer obtained using noisy samples and weight decay regularization agrees well with the form of the exact representation cost minimizer predicted by Proposition 2 and Conjecture 4.

D.3 Empirical validation of the subspace assumption

We validated that the following image datasets are (approximately) low rank:

•

CIFAR10
•

CINIC10
•

Tiny ImageNet (a lower resolution version of ImageNet, enabling us to use SVD)
•

BSD (a denoising benchmark composed from $128X1600$ patches of size $40X40$ cropped from $400$ images [zhang2017beyond])

As can be seen from Table 1 all the datasets that we used are (approximately) low rank.

Table 1: We applied a Singular Value Decomposition (SVD) for each of the above datasets, and calculated the relative number of Singular Values (SV) needed to achieve a given percentile of the energy (for the average vector).

Dataset	95%	99%	99.9%
CIFAR10	0.8%	7.5%	30%
CINIC10	1%	23%	41%
Tiny ImageNet	1.6%	20%	36%
BSD	0.1%	1.6%	4.5%

	$\displaystyle\mathcal{L}\left(\bm{\theta},\sigma\right)$	$\displaystyle=\mathbb{E}_{\bm{x},\bm{y}}\left\\|\sum_{k=1}^{K}\bm{a}_{k}[\bm{w}_{k}^{\top}\bm{y}]_{+}-\bm{x}\right\\|^{2}$
		$\displaystyle=\mathbb{E}_{\bm{x},\bm{y}}\left[\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}[\bm{w}_{i}^{\top}\bm{y}]_{+}[\bm{w}_{j}^{\top}\bm{y}]_{+}-2\sum_{i=1}^{K}\bm{x}^{\top}\bm{a}_{i}[\bm{w}_{i}^{\top}\bm{y}]_{+}+\left\\|\bm{x}\right\\|^{2}\right]$
		$\displaystyle=\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}\mathbb{E}_{\bm{x},\bm{y}}\left[[\bm{w}_{i}^{\top}\bm{y}]_{+}[\bm{w}_{j}^{\top}\bm{y}]_{+}\right]-2\sum_{i=1}^{K}\mathbb{E}_{\bm{x},\bm{y}}\left[\bm{x}^{\top}\bm{a}_{i}[\bm{w}_{i}^{\top}\bm{y}]_{+}\right]+\mathbb{E}\left\\|\bm{x}\right\\|^{2}$
		$\displaystyle=\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{y}\|\bm{x}}\left[[\bm{w}_{i}^{\top}\bm{y}]_{+}[\bm{w}_{j}^{\top}\bm{y}]_{+}\right]\right]-2\sum_{i=1}^{K}\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{y}\|\bm{x}}\left[\bm{x}^{\top}\bm{a}_{i}[\bm{w}_{i}^{\top}\bm{y}]_{+}\right]\right]+\mathbb{E}\left\\|\bm{x}\right\\|^{2}$
		$\displaystyle=\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{y}\|\bm{x}}\left[[\bm{w}_{i}^{\top}\bm{y}]_{+}\right]\mathbb{E}_{\bm{y}\|\bm{x}}\left[[\bm{w}_{j}^{\top}\bm{y}]_{+}\right]\right]-2\sum_{i=1}^{K}\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{y}\|\bm{x}}\left[\bm{x}^{\top}\bm{a}_{i}[\bm{w}_{i}^{\top}\bm{y}]_{+}\right]\right]+\mathbb{E}\left\\|\bm{x}\right\\|^{2}$
		$\displaystyle\quad+\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{y}\|\bm{x}}\left[[\bm{w}_{i}^{\top}\bm{y}]_{+}[\bm{w}_{j}^{\top}\bm{y}]_{+}\right]\right]-\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{y}\|\bm{x}}\left[[\bm{w}_{i}^{\top}\bm{y}]_{+}\right]\mathbb{E}_{\bm{y}\|\bm{x}}\left[[\bm{w}_{j}^{\top}\bm{y}]_{+}\right]\right]$
		$\displaystyle=\mathbb{E}_{(\bm{x})}\left\\|\sum_{i=1}^{K}\bm{a}_{i}\tilde{\phi}\left(\bm{w}_{i}^{\top}\bm{x}\right)-\bm{x}\right\\|^{2}+\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{y}\|\bm{x}}\left[[\bm{w}_{i}^{\top}\bm{y}]_{+}[\bm{w}_{j}^{\top}\bm{y}]_{+}\right]\right]$
		$\displaystyle\quad-\sum_{i=1}^{K}\sum_{j=1}^{K}\bm{a}_{i}^{\top}\bm{a}_{j}\mathbb{E}_{\bm{x}}\left[\mathbb{E}_{\bm{y}\|\bm{x}}\left[[\bm{w}_{i}^{\top}\bm{y}]_{+}\right]\mathbb{E}_{\bm{y}\|\bm{x}}\left[[\bm{w}_{j}^{\top}\bm{y}]_{+}\right]\right]\,.$

	$\displaystyle E\left[[x]_{+}\right]$	$\displaystyle=E\left[[x]_{+}\|x>0\right]P\left(x>0\right)+E\left[[x]_{+}\|x<0\right]P\left(x>0\right)$
		$\displaystyle=E\left[x\|x>0\right]P\left(x>0\right)$
		$\displaystyle=E\left[x\|x>0\right]\left(1-\Phi\left(-\frac{\mu}{\sigma}\right)\right)$

$\displaystyle\\|{\bm{v}}\\|$	$\displaystyle=\\|\|{\bm{u}}_{i}^{\top}{\bm{a}}\|{\bm{u}}_{i}+\|{\bm{u}}_{j}^{\top}{\bm{a}}\|{\bm{u}}_{j}\\|$	(66)
	$\displaystyle=\\|({\bm{u}}_{i}{\bm{u}}_{i}^{\top}-{\bm{u}}_{j}{\bm{u}}_{j}^{\top}){\bm{a}}\\|$	(67)
	$\displaystyle\leq\\|{\bm{u}}_{i}{\bm{u}}_{i}^{\top}-{\bm{u}}_{j}{\bm{u}}_{j}^{\top}\\|$	(68)