Learning Neural Networks with Distribution Shift:
Efficiently Certifiable Guarantees

Gautam Chandrasekaran
UT Austin [email protected]. Supported by the NSF AI Institute for Foundations of Machine Learning (IFML). Adam R. Klivans
UT Austin [email protected]. Supported by NSF award AF-1909204 and the NSF AI Institute for Foundations of Machine Learning (IFML). Lin Lin Lee
UT Austin [email protected]. Supported by the NSF AI Institute for Foundations of Machine Learning (IFML). Konstantinos Stavropoulos
UT Austin [email protected]. Supported by the NSF AI Institute for Foundations of Machine Learning (IFML) and by scholarships from Bodossaki Foundation and Leventis Foundation.

Abstract

We give the first provably efficient algorithms for learning neural networks with distribution shift. We work in the Testable Learning with Distribution Shift framework (TDS learning) of [KSV24b], where the learner receives labeled examples from a training distribution and unlabeled examples from a test distribution and must either output a hypothesis with low test error or reject if distribution shift is detected. No assumptions are made on the test distribution.

All prior work in TDS learning focuses on classification, while here we must handle the setting of nonconvex regression. Our results apply to real-valued networks with arbitrary Lipschitz activations and work whenever the training distribution has strictly sub-exponential tails. For training distributions that are bounded and hypercontractive, we give a fully polynomial-time algorithm for TDS learning one hidden-layer networks with sigmoid activations. We achieve this by importing classical kernel methods into the TDS framework using data-dependent feature maps and a type of kernel matrix that couples samples from both train and test distributions.

1 Introduction

Understanding when a model will generalize from a known training distribution to an unknown test distribution is a critical challenge in trustworthy machine learning and domain adaptation. Traditional approaches to this problem prove generalization bounds in terms of various notions of distance between train and test distributions [BDBCP06, BDBC⁺10, MMR09] but do not provide efficient algorithms. Recent work due to [KSV24b] departs from this paradigm and defines the model of Testable Learning with Distribution Shift (TDS learning), where a learner may reject altogether if significant distribution shift is detected. When the learner accepts, however, it outputs a classifier and a proof that the classifier has nearly optimal test error.

A sequence of works has given the first set of efficient algorithms in the TDS learning model for well-studied function classes where no assumptions are taken on the test distribution [KSV24b, KSV24a, CKK⁺24, GSSV24]. These results, however, hold for classification and therefore do not apply to (nonconvex) regression problems and in particular to a long line of work giving provably efficient algorithms for learning simple classes of neural networks under natural distributional assumptions on the training marginal [GK19, DGK⁺20, DKKZ20, DKTZ22, CKM22, CDG⁺23, WZDD23, GGKS24, DK24].

The main contribution of this work is the first set of efficient TDS learning algorithms for broad classes of (nonconvex) regression problems. Our results apply to neural networks with arbitrary Lipschitz activations of any constant depth. As one example, we obtain a fully polynomial-time algorithm for learning one hidden-layer neural networks with sigmoid activations with respect to any bounded and hypercontractive training distribution. For bounded training distributions, the running times of our algorithms match the best known running times for ordinary PAC or agnostic learning (without distribution shift). We emphasize that unlike all prior work in domain adaptation, we make no assumptions on the test distribution.

Regression Setting. We assume the learner has access to labeled examples from the training distribution and unlabeled examples from the marginal of the test distribution. We consider the squared loss ${\mathcal{L}}_{{\mathcal{D}}}(h)=\sqrt{\mathbb{E}_{({\bm{x}},y)\sim{\mathcal{D}}}[(y-h({\bm{x}}))^{2}]}$ . The error benchmark is analogous to the benchmark for TDS learning in classification [KSV24b] and depends on two quantities: the optimum training error achievable by a classifier in the learnt class, $\mathrm{opt}=\min_{f\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f)]$ , and the best joint error achievable by a single classifier on both the training and test distributions, $\lambda=\min_{f^{\prime}\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f^{\prime})+{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(f^{\prime})]$ . Achieving an error of $\mathrm{opt}+\lambda$ is the standard goal in domain adaptation [BDBCP06, BCK⁺07, MMR09]. We now formally define the TDS learning framework for regression.

Definition 1.1 (Testable Regression with Distribution Shift).

For ${\epsilon},\delta\in(0,1)$ and a function class ${\mathcal{F}}\subseteq\{\mathbb{R}^{d}\to\mathbb{R}\}$ , the learner receives iid labeled examples from some unknown training distribution ${\mathcal{D}}$ over $\mathbb{R}^{d}\times\mathbb{R}$ and iid unlabeled examples from the marginal ${\mathcal{D}}_{{\bm{x}}}^{\prime}$ of another unknown test distribution ${\mathcal{D}}^{\prime}$ over $\mathbb{R}^{d}\times\mathbb{R}$ . The learner either rejects, or it accepts and outputs hypothesis $h:\mathbb{R}^{d}\to\mathbb{R}$ such that the following are true.

1.

(Soundness) With probability at least $1-\delta$ , if the algorithm accepts, then the output $h$ satisfies ${\mathcal{L}}_{{\mathcal{D}}^{\prime}}(h)\leq\min_{f\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f)]+\min_{f^{\prime}\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f^{\prime})+{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(f^{\prime})]+{\epsilon}$ .
2.

(Completeness) If ${\mathcal{D}}_{{\bm{x}}}={\mathcal{D}}_{{\bm{x}}}^{\prime}$ , then the algorithm accepts with probability at least $1-\delta$ .

1.1 Our Results

Our results hold for classes of Lipschitz neural networks. In particular, we consider functions $f$ of the following form. Let $\sigma:\mathbb{R}\to\mathbb{R}$ be an activation function. Let $\mathbf{W}=\left(W^{(1)},\ldots W^{(t)}\right)$ with $W^{(i)}\in\mathbb{R}^{s_{i}\times s_{i-1}}$ be the tuple of weight matrices. Here, $s_{0}=d$ is the input dimension and $s_{t}=1$ . Define recursively the function $f_{i}:\mathbb{R}^{d}\to\mathbb{R}^{s_{i}}$ as $f_{i}({\bm{x}})=W^{(i)}\cdot\sigma\bigl{(}f_{i-1}({\bm{x}})\bigr{)}$ with $f_{1}({\bm{x}})=W^{(1)}\cdot{\bm{x}}$ . The function $f:\mathbb{R}^{d}\to\mathbb{R}$ computed by the neural network $(\mathbf{W},\sigma)$ is defined as $f({\bm{x}})\coloneq f_{t}({\bm{x}})$ . The depth of this network is $t$ .

We now present our main results on TDS learning for neural networks.

Function Class	Runtime (Bounded)	Runtime (Subgaussian)
One hidden-layer Sigmoid Net	$\mathrm{poly}(d,M,1/\epsilon)$	$d^{\mathrm{poly}(k\log(M/\epsilon))}$
Single ReLU	$\mathrm{poly}(d,M)\cdot 2^{O(1/\epsilon)}$	$d^{\mathrm{poly}(k\log M/\epsilon)}$
Sigmoid Nets	$\mathrm{poly}(d,M)\cdot 2^{O\left((\log(1/\epsilon))^{t-1}\right)}$	$d^{\mathrm{poly}(k\log M(\log(1/\epsilon)^{t-1}))}$
$1$ -Lipschitz Nets	$\mathrm{poly}(d,M)\cdot 2^{\tilde{O}(k\sqrt{k}2^{t-1}/\epsilon)}$	$d^{\mathrm{poly}(k2^{t-1}\log M/\epsilon)}$

Table 1: In the above table,

k

denotes the number of neurons in the first hidden layer.

M

denotes a bound on the labels of the train and test distributions. One hidden-layer Sigmoid nets refers to depth

2

neural networks with sigmoid activation. The bounded distributions considered in the above table have support on the unit ball. We assume that all relevant parameters of the neural network are bounded by constants. For more detailed statements and proofs, see (1) Corollaries C.21, C.23, C.20 and C.22 for the bounded case, and (2) Theorems C.24 and C.25 for the Subgaussian case.

From the above table, we highlight that in the cases of bounded distributions with (1) one hidden-layer Sigmoid Nets, and (2) Single ReLU with $\epsilon<1/\log d$ , we obtain TDS algorithms that run in polynomial time in all parameters. Moreover, for the last row, regarding Lipschitz Nets, each neuron is allowed to have a different and unknown Lipschitz activation. Therefore, in particular, our results capture the class of single-index models (see, e.g., [KKSK11, GGKS24]).

In the results of Table 1, we assume bounded labels for both the training and test distributions. This assumption can be relaxed to a bound on any moment whose degree is strictly higher than $2$ (see Corollary D.2). In fact, such an assumption is necessary, as we show in Proposition D.1.

1.2 Our Techniques

TDS Learning via Kernel Methods. The major technical contribution of this work is devoted to importing classical kernel methods into the TDS learning framework. A first attempt at testing distribution shift with respect to a fixed feature map would be to form two corresponding covariance matrices of the expanded features, one from samples drawn from the training distribution and the other from samples drawn from the test distribution, and test if these two matrices have similar eigendecompositions. This approach only yields efficient algorithms for linear kernels, however, as here we are interested in spectral properties of covariance matrices in the feature space corresponding to low-degree polynomials, whose dimension is too large.

Instead we form a new data-dependent and concise reference feature map $\phi$ , that depends on examples from both ${\mathcal{D}}_{{\bm{x}}}$ and ${\mathcal{D}}_{{\bm{x}}}^{\prime}$ . We show that this feature map approximately represents the ground truth, i.e., some function with both low training and test error (this is due to the representer theorem, see Proposition 3.7). To certify that error bounds transfer from ${\mathcal{D}}_{{\bm{x}}}$ to ${\mathcal{D}}_{{\bm{x}}}^{\prime}$ , we require relative error closeness between covariance matrix $\Phi^{\prime}=\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}^{\prime}}[\phi({\bm{x}})\phi({\bm{x}})^{\top}]$ of the feature expansion $\phi$ over the test marginal with the corresponding matrix $\Phi=\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[\phi({\bm{x}})\phi({\bm{x}})^{\top}]$ over the training marginal. We draw fresh sets of verification examples and show how the kernel trick can be used to efficiently achieve these approximations even though $\phi$ is a nonstandard feature map. We provide a more detailed technical overview and a formal proof in Section 3.1.

By instantiating the above results using a type of polynomial kernel, we can reduce the problem of TDS learning neural networks to the problem of obtaining an appropriate polynomial approximator. Our final training algorithm (as opposed to the testing phase) will essentially be kernelized polynomial regression.

TDS Learning and Uniform Approximation. Prior work in TDS learning has established connections between polynomial approximation theory and efficient algorithms in the TDS setting. In particular, the existence of low-degree sandwiching approximators for a concept class is known to imply dimension-efficient TDS learning algorithms for binary classification. The notion of sandwiching approximators for a function $f$ refers to a pair of low-degree polynomials $p_{\mathrm{up}},p_{\mathrm{down}}$ with two main properties: (1) $p_{\mathrm{down}}\leq f\leq p_{\mathrm{up}}$ everywhere and (2) the expected absolute distance between $p_{\mathrm{up}}$ and $p_{\mathrm{down}}$ over some reference distribution is small. The first property is of particular importance in the TDS setting, since it holds everywhere and, therefore, it holds for any test distribution unconditionally.

Here we make the simple observation that the incomparable notion of uniform approximation suffices for TDS learning. A uniform approximator is a polynomial $p$ that approximates a function $f$ pointwise, meaning that $|p-f|$ is small in every point within a ball around the origin (there is no known direct relationship between sandwiching and uniform approximators). In our setting, uniform approximation is more convenient, due to the existence of powerful tools from polynomial approximation theory regarding Lipschitz and analytic functions.

Contrary to the sandwiching property, the uniform approximation property cannot hold everywhere if the approximated function class contains high-(or infinite-) degree functions. When the training distribution has strictly sub-exponential tails, however, the expected error of approximation outside the radius of approximation is negligible. Importantly, this property can be certified for the test distribution by using a moment-matching tester. See Section 4.2 for a more detailed technical overview and for the full proof.

1.3 Related Work

Learning with Distribution Shift. The field of domain adaptation has been studying the distribution shift problem for almost two decades [BDBCP06, BCK⁺07, BDBC⁺10, MMR09, DLLP10, MKFAS20, RMH⁺20, KZZ24, HK19, HK24, ACM24], providing useful insights regarding the information-theoretic (im)possibilities for learning with distribution shift. The first efficient end-to-end algorithms for non-trivial concept classes with distribution shift were given for TDS learning in [KSV24b, KSV24a, CKK⁺24] and for PQ learning, originally defined by [GKKM20], in [GSSV24]. These works focus on binary classification for classes like halfspaces, halfspace intersections, and geometric concepts. In the regression setting, we need to handle unbounded loss functions, but we are also able to use Lipschitz properties of real-valued networks to obtain results even for deeper architectures. For the special case of linear regression, efficient algorithms for learning with distribution shift are known to exist (see, e.g., [LHL21]), but our results capture much broader classes.

Another distinction between the existing works in TDS learning and our work, is that our results require significantly milder assumptions on the training distribution. In particular, while all prior works on TDS learning require both concentration and anti-concentration for the training marginal [KSV24b, KSV24a, CKK⁺24], we only assume strictly subexponential concentration in every direction. This is possible because the function classes we consider are Lipschitz, which is not the case for binary classification.

Testable Learning. More broadly, TDS learning is related to the notion of testable learning [RV23, GKK23, GKSV24b, DKK⁺23, GKSV24a, DKLZ24, STW24], originally defined by [RV23] for standard agnostic learning, aiming to certify optimal performance for learning algorithms without relying directly on any distributional assumptions. The main difference between testable agnostic learning and TDS learning is that in TDS learning, we allow for distribution shift, while in testable agnostic learning the training and test distributions are the same. Because of this, TDS learning remains challenging even in the absence of label noise, in which case testable learning becomes trivial [KSV24b].

Efficient Learning of Neural Networks. Many works have focused on providing upper and lower bounds on the computational complexity of learning neural networks in the standard (distribution-shift-free) setting [GKKT17, GK19, GGJ⁺20, GGK20, DGK⁺20, DKZ20, DKKZ20, DKTZ22, CGKM22, CKM22, CDG⁺23, WZDD23, GGKS24, DK24, LMZ20, GMOV19, ZYWG19, VW19, AZLL19, BJW19, MR18, GKLW19, GLM18, DLT18, GKM18, Tia17, LY17, BG17, ZSJ⁺17, ZLJ16a, JSA15]. The majority of the upper bounds either require noiseless labels and shallow architectures or work only under Gaussian training marginals. Our results not only hold in the presence of distribution shift, but also capture deeper architectures, under any strictly subexponential training marginal and allow adversarial label noise.

The upper bounds that are closest to our work are those given by [GKKT17]. They consider ReLU as well as sigmoid networks, allow for adversarial label noise and assume that the training marginal is bounded but otherwise arbitrary. Our results in Section 3 extend all of the results in [GKKT17] to the TDS setting, by assuming additionally that the training distribution is hypercontractive (see Definition 3.9). This additional assumption is important to ensure that our tests will pass when there is no distribution shift. For a more thorough technical comparison with [GKKT17], see Section 3.

In Section 4, we provide upper bounds for TDS learning of Lipschitz networks even when the training marginal is an arbitrary strictly subexponential distribution. In particular, our results imply new bounds for standard agnostic learning of single ReLU neurons, where we achieve runtime $d^{\mathrm{poly}({1/{\epsilon}})}$ . The only known upper bounds work under the Gaussian marginal [DGK⁺20], achieving similar runtime. In fact, in the statistical query framework [Kea98], it is known that $d^{\mathrm{poly}(1/{\epsilon})}$ runtime is necessary for agnostically learning the ReLU, even under the Gaussian distribution [DKZ20, GGK20].

2 Preliminaries

We use standard vector and matrix notation. We denote with $\mathbb{R},{\mathbb{N}}$ the sets of real and natural numbers accordingly. We denote with ${\mathcal{D}}$ labeled distributions over $\mathbb{R}^{d}\times\mathbb{R}$ and with ${\mathcal{D}}_{\bm{x}}$ the marginal of ${\mathcal{D}}$ on the features in $\mathbb{R}^{d}$ . For a set $S$ of points in $\mathbb{R}^{d}$ , we define the empirical probabilities (resp. expectations) as $\mathbf{Pr}_{{\bm{x}}\sim S}[E({\bm{x}})]=\frac{1}{|S|}\sum_{{\bm{x}}\in S}\mathbbm{1}\{E({\bm{x}})\}$ (resp. $\mathbb{E}_{{\bm{x}}\sim S}[f({\bm{x}})]=\frac{1}{|S|}\sum_{{\bm{x}}\in S}f({\bm{x}})$ ). We denote with $\bar{S}$ the labeled version of $S$ and we define the clipping function $\mathrm{cl}_{M}:\mathbb{R}\to[-M,M]$ , that maps a number $t\in\mathbb{R}$ either to itself if $t\in[-M,M]$ , or to $M\cdot\operatorname{sign}(t)$ otherwise.

Loss function. Throughout this work, we denote with ${\mathcal{L}}_{{\mathcal{D}}}(h)$ the squared loss of a hypothesis $h:\mathbb{R}^{d}\to\mathbb{R}$ with respect to a labeled distribution ${\mathcal{D}}$ , i.e., ${\mathcal{L}}_{{\mathcal{D}}}(h)=\sqrt{\mathbb{E}_{({\bm{x}},y)\sim{\mathcal{D}}}[(y-h({\bm{x}}))^{2}]}$ . Moreover, for any function $f:\mathbb{R}^{d}\to\mathbb{R}$ , we denote with $\|f\|_{{\mathcal{D}}}$ the quantity $\|f\|_{{\mathcal{D}}}=\sqrt{\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[(f({\bm{x}}))^{2}]}$ . For a set of labeled examples $\bar{S}$ , we denote with ${\mathcal{L}}_{\bar{S}}(h)$ the empirical loss on $\bar{S}$ , i.e., ${\mathcal{L}}_{\bar{S}}(h)=\sqrt{\frac{1}{|\bar{S}|}\sum_{({\bm{x}},y)\in\bar{S}}(y-h({\bm{x}}))^{2}}$ and similarly for $\|f\|_{S}$ .

Distributional Assumptions. In order to obtain efficient algorithms, we will either assume that the training marginal ${\mathcal{D}}_{{\bm{x}}}$ is bounded and hypercontractive (Section 3) or that it has strictly subexponential tails in every direction (Section 4). We make no assumptions on the test marginal ${\mathcal{D}}_{{\bm{x}}}^{\prime}$ .

Regarding the labels, we assume some mild bound on the moments of the training and the test labels, e.g., (a) that $\mathbb{E}_{y\sim{\mathcal{D}}_{y}}[y^{4}],\mathbb{E}_{y\sim{\mathcal{D}}^{\prime}_{y}}[y^{4}]\leq M$ or (b) that $y\in[-M,M]$ a.s. for both ${\mathcal{D}}$ and ${\mathcal{D}}^{\prime}$ . Although, ideally, we want to avoid any assumptions on the test distribution, as we show in Proposition D.1, a bound on some constant-degree moment of the test labels is necessary.

3 Bounded Training Marginals

We begin with the scenario where the training distribution is known to be bounded. In this case, it is known that one-hidden-layer sigmoid networks can be agnostically learned (in the classical sense, without distribution shift) in fully polynomial time and single ReLU neurons can be learned up to error $O(\frac{1}{\log(d)})$ in polynomial time [GKKT17]. These results are based on a kernel-based approach, combined with results from polynomial approximation theory. While polynomial approximations can reduce the nonconvex agnostic learning problem to a convex one through polynomial feature expansions, the kernel trick enables further pruning of the search space, which is important for obtaining polynomial-time algorithms. Our work demonstrates another useful implication of the kernel trick: it leads to efficient algorithms for testing distribution shift.

We will require the following standard notions:

Definition 3.1 (Kernels [Mer09]).

A function $\mathcal{K}:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}$ is a kernel. If for any set of $m$ points ${\bm{x}}_{1},\dots,{\bm{x}}_{m}$ in $\mathbb{R}^{d}$ , the matrix $(\mathcal{K}({\bm{x}}_{i},{\bm{x}}_{j}))_{(i,j)\in[m]}$ is positive semidefinite, we say that the kernel $\mathcal{K}$ is positive definite. The kernel $\mathcal{K}$ is symmetric if for all ${\bm{x}},{\bm{x}}^{\prime}\in\mathbb{R}^{d}$ , $\mathcal{K}({\bm{x}},{\bm{x}}^{\prime})=\mathcal{K}({\bm{x}}^{\prime},{\bm{x}})$ .

Any PSD kernel is associated with some Hilbert space ${\mathbb{H}}$ and some feature map from $\mathbb{R}^{d}$ to ${\mathbb{H}}$ .

Fact 3.2 (Reproducing Kernel Hilbert Space).

For any positive definite and symmetric (PDS) kernel $\mathcal{K}$ , there is a Hilbert space ${\mathbb{H}}$ , equipped with the inner product $\langle\cdot,\cdot\rangle:{\mathbb{H}}\times{\mathbb{H}}\to\mathbb{R}$ and a function $\psi:\mathbb{R}^{d}\to{\mathbb{H}}$ such that $\mathcal{K}({\bm{x}},{\bm{x}}^{\prime})=\langle\psi({\bm{x}}),\psi({\bm{x}}^{\prime})\rangle$ for all ${\bm{x}},{\bm{x}}^{\prime}\in\mathbb{R}^{d}$ . We call ${\mathbb{H}}$ the reproducing kernel Hilbert space (RKHS) for $\mathcal{K}$ and $\psi$ the feature map for $\mathcal{K}$ .

There are three main properties of the kernel method. First, although the associated feature map $\psi$ may correspond to a vector in an infinite-dimensional space, the kernel $\mathcal{K}({\bm{x}},{\bm{x}}^{\prime})$ may still be efficiently evaluated, due to its analytic expression in terms of ${\bm{x}}$ , ${\bm{x}}^{\prime}$ . Second, the function class ${\mathcal{F}}_{\mathcal{K}}=\{{\bm{x}}\mapsto\langle{\bm{v}},\psi({\bm{x}})\rangle:{\bm{v}}\in{\mathbb{H}},\langle{\bm{v}},{\bm{v}}\rangle\leq B\}$ has Rademacher complexity independent from the dimension of ${\mathbb{H}}$ , as long as the maximum value of $\mathcal{K}({\bm{x}},{\bm{x}})$ for ${\bm{x}}$ in the domain is bounded (Thm. 6.12 in [MRT18]). Third, the time complexity of finding the function in ${\mathcal{F}}_{\mathcal{K}}$ that best fits a dataset is actually polynomial to the size of the dataset, due to the representer theorem (Thm. 6.11 in [MRT18]). Taken together, these properties constitute the basis of the kernel method, implying learners with runtime independent from the effective dimension of the learning problem.

In order to apply the kernel method to learn some function class ${\mathcal{F}}$ , it suffices to show that the class ${\mathcal{F}}$ can be represented sufficiently well by the class ${\mathcal{F}}_{\mathcal{K}}$ . We give the following definition.

Definition 3.3 (Approximate Representation).

Let ${\mathcal{F}}$ be a function class over $\mathbb{R}^{d}$ , $\mathcal{K}:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}$ a PDS kernel, where ${\mathbb{H}}$ is the corresponding RKHS and $\psi$ the feature map for $\mathcal{K}$ . We say that ${\mathcal{F}}$ can be $(\epsilon,B)$ -approximately represented within radius $R$ with respect to $\mathcal{K}$ if for any $f\in{\mathcal{F}}$ , there is ${\bm{v}}\in{\mathbb{H}}$ with $\langle{\bm{v}},{\bm{v}}\rangle\leq B$ such that $|f({\bm{x}})-\langle{\bm{v}},\psi({\bm{x}})\rangle|\leq\epsilon$ , for all ${\bm{x}}\in\mathbb{R}^{d}:\|{\bm{x}}\|_{2}\leq R$ .

For the purposes of TDS learning, we will also require the training marginal to have be hypercontractive with respect to the kernel at hand. This is important to ensure that our test will accept whenever there is no distribution shift. More formally, we require the following.

Definition 3.4 (Hypercontractivity).

Let ${\mathcal{D}}_{{\bm{x}}}$ be some distribution over $\mathbb{R}^{d}$ , let ${\mathbb{H}}$ be a Hilbert space and let $\psi:\mathbb{R}^{d}\to{\mathbb{H}}$ . We say that ${\mathcal{D}}_{{\bm{x}}}$ is $(\psi,C,\ell)$ -hypercontractive if for any $t\in{\mathbb{N}}$ and ${\bm{v}}\in{\mathbb{H}}$ :

\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[\langle{\bm{v}},\psi({\bm{x}})\rangle^{2t}]\leq(Ct)^{2\ell t}(\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[\langle{\bm{v}},\psi({\bm{x}})\rangle^{2}])^{t}

If $\mathcal{K}$ is the PDS kernel corresponding to $\psi$ , we also say that ${\mathcal{D}}_{{\bm{x}}}$ is $(\mathcal{K},C,\ell)$ -hypercontractive.

3.1 TDS Regression via the Kernel Method

We now give a general theorem on TDS regression for bounded distributions, under the following assumptions. Note that, although we assume that the training and test labels are bounded, this assumption can be relaxed in a black-box manner and bounding some constant-degree moment of the distribution of the labels suffices, as we show in Corollary D.2.

Assumption 3.5.

For a function class ${\mathcal{F}}\subseteq\{\mathbb{R}^{d}\to\mathbb{R}\}$ , and training and test distributions ${\mathcal{D}}$ , ${\mathcal{D}}^{\prime}$ over $\mathbb{R}^{d}\times\mathbb{R}$ , we assume the following.

1.

${\mathcal{F}}$ is $({\epsilon},B)$ -approximately represented within radius $R$ w.r.t. a PDS kernel $\mathcal{K}:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}$ , for some ${\epsilon}\in(0,1)$ and $B,R\geq 1$ and let $A=\sup_{{\bm{x}}:\|{\bm{x}}\|_{2}\leq R}\mathcal{K}({\bm{x}},{\bm{x}})$ .
2.

The training marginal ${\mathcal{D}}_{{\bm{x}}}$ (1) is bounded within $\{{\bm{x}}:\|{\bm{x}}\|_{2}\leq R\}$ and (2) is $(\mathcal{K},C,\ell)$ -hypercontractive for some $C,\ell\geq 1$ .
3.

The training and test labels are both bounded in $[-M,M]$ for some $M\geq 1$ .

Consider the function class ${\mathcal{F}}$ , the kernel $\mathcal{K}$ and the parameters ${\epsilon},A,B,C,M,\ell$ as defined in the assumption above and let $\delta\in(0,1)$ . Then, we obtain the following theorem.

Theorem 3.6 (TDS Learning via the Kernel Method).

Under 3.5, Algorithm 1 learns the class ${\mathcal{F}}$ in the TDS regression setting up to excess error $5\epsilon$ and probability of failure $\delta$ . The time complexity is $O(T)\cdot\mathrm{poly}(d,\frac{1}{{\epsilon}},(\log(1/\delta))^{\ell},A,B,C^{\ell},2^{\ell},M)$ , where $T$ is the evaluation time of $\mathcal{K}$ .

Input: Parameters

M,R,B,A,C,\ell\geq 1

{\epsilon},\delta\in(0,1)

and sample access to

{\mathcal{D}}

{\mathcal{D}}_{{\bm{x}}}^{\prime}

Set

m=c\frac{(ABM)^{4}}{{\epsilon}^{4}}\log(\frac{1}{\delta})

N=cm^{2}\frac{ABC}{{\epsilon}^{4}}(4C\log(\frac{4}{\delta}))^{4\ell+1}

c

large enough constant

Draw

m

i.i.d. labeled examples

\bar{S}_{\mathrm{ref}}

from

{\mathcal{D}}

and

m

i.i.d. unlabeled examples

S_{\mathrm{ref}}^{\prime}

from

{\mathcal{D}}_{{\bm{x}}}^{\prime}

;

if for some ${\bm{x}}\in S_{\mathrm{ref}}^{\prime}$ we have $\|{\bm{x}}\|_{2}>R$ then

Reject and terminate;

Let

\hat{\bm{a}}=(\hat{a}_{{\bm{z}}})_{{\bm{z}}\in S_{\mathrm{ref}}}

be the optimal solution to the following convex program

	$\displaystyle\min_{{\bm{a}}\in\mathbb{R}^{m}}$	$\displaystyle\sum_{({\bm{x}},y)\in\bar{S}_{\mathrm{ref}}}\Bigr{(}y-\sum_{{\bm{z}}\in S_{\mathrm{ref}}}a_{{\bm{z}}}\mathcal{K}({\bm{z}},{\bm{x}})\Bigr{)}^{2}$
	s.t.	$\displaystyle\sum_{{\bm{z}},{\bm{w}}\in S_{\mathrm{ref}}}a_{{\bm{z}}}a_{{\bm{w}}}\mathcal{K}({\bm{z}},{\bm{w}})\leq B,\text{ where }{\bm{a}}=(a_{{\bm{z}}})_{{\bm{z}}\in S_{\mathrm{ref}}}$

Draw

N

i.i.d. unlabeled examples

{S}_{\mathrm{ver}}

from

{\mathcal{D}}_{{\bm{x}}}

and

N

unlabeled examples

S_{\mathrm{ver}}^{\prime}

from

{\mathcal{D}}_{{\bm{x}}}^{\prime}

;

if for some ${\bm{x}}\in S_{\mathrm{ver}}^{\prime}$ we have $\|{\bm{x}}\|_{2}>R$ then

Reject and terminate;

Compute the matrix

\hat{\Phi}=(\hat{\Phi}_{{\bm{z}},{\bm{w}}})_{{\bm{z}},{\bm{w}}\in S_{\mathrm{ref}}\cup S_{\mathrm{ref}}^{\prime}}

with

\hat{\Phi}_{{\bm{z}},{\bm{w}}}=\frac{1}{N}\sum_{{\bm{x}}\in S_{\mathrm{ver}}}\mathcal{K}({\bm{x}},{\bm{z}})\mathcal{K}({\bm{x}},{\bm{w}})

;

Compute the matrix

\hat{\Phi}^{\prime}=(\hat{\Phi}^{\prime}_{{\bm{z}},{\bm{w}}})_{{\bm{z}},{\bm{w}}\in S_{\mathrm{ref}}\cup S_{\mathrm{ref}}^{\prime}}

with

\hat{\Phi}^{\prime}_{{\bm{z}},{\bm{w}}}=\frac{1}{N}\sum_{{\bm{x}}\in S_{\mathrm{ver}}^{\prime}}\mathcal{K}({\bm{x}},{\bm{z}})\mathcal{K}({\bm{x}},{\bm{w}})

;

Let

\rho

be the value of the following eigenvalue problem

\displaystyle\max_{{\bm{a}}\in\mathbb{R}^{2m}}

\displaystyle{\bm{a}}^{\top}\hat{\Phi}^{\prime}{\bm{a}}\,\,\,\text{ s.t. }{\bm{a}}^{\top}\hat{\Phi}{\bm{a}}\leq 1

if $\rho>1+\frac{{\epsilon}^{2}}{50AB}$ then

Reject and terminate;

Otherwise, accept and output

h:{\bm{x}}\mapsto h({\bm{x}})=\mathrm{cl}_{M}(\hat{p}({\bm{x}}))

, where

\hat{p}({\bm{x}})=\sum_{{\bm{z}}\in S_{\mathrm{ref}}}\hat{a}_{{\bm{z}}}\mathcal{K}({\bm{z}},{\bm{x}})

;

Algorithm 1 TDS Regression via the Kernel Method

The main ideas of our proof are the following.

Obtaining a concise reference feature map. The algorithm first draws reference sets $S_{\mathrm{ref}},S_{\mathrm{ref}}^{\prime}$ from both the training and the test distributions. The representer theorem, combined with the approximate representation assumption (Definition 3.3) ensure that the reference examples define a new feature map $\phi:\mathbb{R}^{d}\to\mathbb{R}^{2m}$ with $\phi({\bm{x}})=(\mathcal{K}({\bm{x}},{\bm{z}}))_{{\bm{z}}\in S_{\mathrm{ref}}\cup S_{\mathrm{ref}}^{\prime}}$ such that the ground truth $f^{*}=\arg\min_{f\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f)+{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(f)]$ can be approximately represented as a linear combination of the features in $\phi$ with respect to both $S_{\mathrm{ref}}$ and $S_{\mathrm{ref}}^{\prime}$ , i.e., $\|f^{*}-({{\bm{a}}^{*}})^{\top}{\phi}\|_{S_{\mathrm{ref}}}$ and $\|f^{*}-({{\bm{a}}^{*}})^{\top}{\phi}\|_{S_{\mathrm{ref}}^{\prime}}$ are both small for some ${\bm{a}}^{*}\in\mathbb{R}^{2m}$ . In particular, we have the following.

Proposition 3.7 (Representer Theorem, modification of Theorem 6.11 in [MRT18]).

Suppose that a function $f:\mathbb{R}^{d}\to\mathbb{R}$ can be $({\epsilon},B)$ -approximately represented within radius $R$ w.r.t. some PDS kernel $\mathcal{K}$ (as per Definition 3.3). Then, for any set of examples $S$ in $\{{\bm{x}}\in\mathbb{R}^{d}:\|{\bm{x}}\|_{2}\leq R\}$ , there is ${\bm{a}}=(a_{{\bm{x}}})_{{\bm{x}}\in S}\in\mathbb{R}^{|S|}$ such that for $\tilde{p}({\bm{x}})=\sum_{{\bm{z}}\in S}a_{{\bm{z}}}\mathcal{K}({\bm{z}},{\bm{x}})$ we have:

\|f-\tilde{p}\|_{S}\leq\epsilon\text{ and }\sum_{{\bm{x}},{\bm{z}}\in S}a_{{\bm{x}}}a_{{\bm{z}}}\mathcal{K}({\bm{z}},{\bm{x}})\leq B

Proof.

We first observe that there is some ${\bm{v}}\in{\mathbb{H}}$ such that $\langle{\bm{v}},{\bm{v}}\rangle\leq B$ and for $p({\bm{x}})=\langle{\bm{v}},\psi({\bm{x}})\rangle$ we have $\|f-p\|_{S}\leq\epsilon$ , because by Definition 3.3, there is a pointwise approximator for $f$ with respect to $\mathcal{K}$ . By Theorem 6.11 in [MRT18], this implies the existence of $\tilde{p}$ as desired. ∎

Note that since the evaluation of $\phi({\bm{x}})$ only involves Kernel evaluations, we never need to compute the initial feature expansion $\psi({\bm{x}})$ which could be overly expensive.

Forming a candidate output hypothesis. We know that the reference feature map approximately represents the ground truth. However, having no access to test labels, we cannot directly hope to find the corresponding coefficient ${\bm{a}}^{*}\in\mathbb{R}^{2m}$ . Instead, we use only the training reference examples to find a candidate hypothesis $\hat{p}$ with close-to-optimal performance on the training distribution which can be also expressed in terms of the reference feature map $\phi$ , as $\hat{p}=\hat{\bm{a}}^{\top}\phi$ . It then suffices to test the quality of $\phi$ on the test distribution.

Testing the quality of reference feature map on the test distribution. We know that the function $\tilde{p}^{*}=({\bm{a}}^{*})^{\top}\phi$ performs well on the test distribution (since it is close to $f^{*}$ on a reference test set). We also know that the candidate output $\hat{\bm{a}}^{\top}\phi$ performs well on the training distribution. Therefore, in order to ensure that $\hat{p}$ performs well on the test distribution, it suffices to show that the distance between $\hat{p}$ and $\tilde{p}^{*}$ under the test distribution, i.e., $\|\hat{\bm{a}}^{\top}\phi-({\bm{a}}^{*})^{\top}\phi\|_{{\mathcal{D}}_{{\bm{x}}}^{\prime}}$ , is small. In fact, it suffices to bound this distance by the corresponding one under the training distribution, because $\hat{p}$ fits the training data well and $\|\hat{\bm{a}}^{\top}\phi-({\bm{a}}^{*})^{\top}\phi\|_{{\mathcal{D}}_{{\bm{x}}}}$ is indeed small. Since we do not know ${\bm{a}}^{*}$ , we need to run a test on $\phi$ that certifies the desired bound for any ${\bm{a}}^{*}$ .

Using the spectral tester. We observe that $\|\hat{\bm{a}}^{\top}\phi-({\bm{a}}^{*})^{\top}\phi\|_{{\mathcal{D}}_{{\bm{x}}}}^{2}=(\hat{\bm{a}}-{\bm{a}}^{*})^{\top}\Phi(\hat{\bm{a}}-{\bm{a}}^{*})$ , where $\Phi=\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[\phi({\bm{x}})\phi({\bm{x}})^{\top}]$ and similarly $\|\hat{\bm{a}}^{\top}\phi-({\bm{a}}^{*})^{\top}\phi\|_{{\mathcal{D}}_{{\bm{x}}}^{\prime}}^{2}=(\hat{\bm{a}}-{\bm{a}}^{*})^{\top}\Phi^{\prime}(\hat{\bm{a}}-{\bm{a}}^{*})$ . Since we want to obtain a bound for all ${\bm{a}}^{*}$ , we essentially want to ensure that for any ${\bm{a}}\in\mathbb{R}^{2m}$ we have ${\bm{a}}^{\top}\Phi^{\prime}{\bm{a}}\leq(1+\rho){\bm{a}}^{\top}\Phi{\bm{a}}$ for some small $\rho$ . Having a multiplicative bound is important because we do not have any bound on the norm of $\|\hat{\bm{a}}-{\bm{a}}^{*}\|_{2}$ .

To implement the test, and since we cannot test $\Phi$ and $\Phi^{\prime}$ directly, we draw fresh verification examples $S_{\mathrm{ver}},S^{\prime}_{\mathrm{ver}}$ from ${\mathcal{D}}_{{\bm{x}}}$ and ${\mathcal{D}}_{{\bm{x}}}^{\prime}$ and run a spectral test on the corresponding empirical versions $\hat{\Phi},\hat{\Phi}^{\prime}$ of the matrices $\Phi,\Phi^{\prime}$ . To ensure that the test will accept when there is no distribution shift, we use the following lemma (originally from [GSSV24]) on multiplicative spectral concentration for $\hat{\Phi}$ , where the hypercontractivity assumption (Definition 3.4) is important.

Lemma 3.8 (Multiplicative Spectral Concentration, Lemma B.1 in [GSSV24], modified).

Let ${\mathcal{D}}_{{\bm{x}}}$ be a distribution over $\mathbb{R}^{d}$ and $\phi:\mathbb{R}^{d}\to\mathbb{R}^{m}$ such that ${\mathcal{D}}_{{\bm{x}}}$ is $(\phi,C,\ell)$ -hypercontractive for some $C,\ell\geq 1$ . Suppose that $S$ consists of $N$ i.i.d. examples from ${\mathcal{D}}_{{\bm{x}}}$ and let $\Phi=\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[\phi({\bm{x}})\phi({\bm{x}})^{\top}]$ , and $\hat{\Phi}=\frac{1}{N}\sum_{{\bm{x}}\in S}\phi({\bm{x}})\phi({\bm{x}})^{\top}$ . For any ${\epsilon},\delta\in(0,1)$ , if $N\geq\frac{64Cm^{2}}{{\epsilon}^{2}}(4C\log_{2}(\frac{4}{\delta}))^{4\ell+1}$ , then with probability at least $1-\delta$ , we have that

\text{For any }{\bm{a}}\in\mathbb{R}^{m}:{\bm{a}}^{\top}\hat{\Phi}{\bm{a}}\in[{(1-{\epsilon})}{\bm{a}}^{\top}\Phi{\bm{a}},(1+{\epsilon}){\bm{a}}^{\top}\Phi{\bm{a}}]

Note that the multiplicative spectral concentration lemma requires access to independent samples. However, the reference feature map $\phi$ depends on the reference examples $S_{\mathrm{ref}},S_{\mathrm{ref}}^{\prime}$ . This is the reason why we do not reuse $S_{\mathrm{ref}},S_{\mathrm{ref}}^{\prime}$ , but rather draw fresh verification examples. For the proof of Lemma 3.8, see Appendix A.

We now provide the full formal proof of Theorem 3.6. The full proof involves appropriate uniform convergence bounds for kernel hypotheses, which are important in order to shift from the reference to the verification examples and back.

Proof of Theorem 3.6.

Consider the reference feature map $\phi:\mathbb{R}^{d}\to\mathbb{R}^{2m}$ with $\phi({\bm{x}})=(\mathcal{K}({\bm{x}},{\bm{z}}))_{{\bm{z}}\in S_{\mathrm{ref}}\cup S_{\mathrm{ref}}^{\prime}}$ . Let $f^{*}=\arg\min_{f\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f)+{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(f)]$ and $f_{\mathrm{opt}}=\arg\min_{f\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f)]$ . By 3.5, we know that there are functions $p^{*},p_{\mathrm{opt}}:\mathbb{R}^{d}\to\mathbb{R}$ with $p^{*}({\bm{x}})=\langle{\bm{v}}^{*},\psi({\bm{x}})\rangle$ and $p_{\mathrm{opt}}=\langle{\bm{v}}_{\mathrm{opt}},\psi({\bm{x}})\rangle$ , that uniformly approximate $f^{*}$ and $f_{\mathrm{opt}}$ within the ball of radius $R$ , i.e., $\sup_{{\bm{x}}:\|{\bm{x}}\|_{2}\leq R}|f^{*}({\bm{x}})-p^{*}({\bm{x}})|\leq{\epsilon}$ and $\sup_{{\bm{x}}:\|{\bm{x}}\|_{2}\leq R}|f_{\mathrm{opt}}({\bm{x}})-p_{\mathrm{opt}}({\bm{x}})|\leq{\epsilon}$ . Moreover, $\langle{\bm{v}}^{*},{\bm{v}}^{*}\rangle,\langle{\bm{v}}_{\mathrm{opt}},{\bm{v}}_{\mathrm{opt}}\rangle\leq B$ .

By Proposition 3.7, there is ${\bm{a}}^{*}\in\mathbb{R}^{2m}$ such that for $\tilde{p}^{*}:\mathbb{R}^{d}\to\mathbb{R}$ with $\tilde{p}^{*}({\bm{x}})=({{\bm{a}}^{*}})^{\top}{\phi({\bm{x}})}$ we have $\|f^{*}-\tilde{p}^{*}\|_{S_{\mathrm{ref}}}\leq 3{\epsilon}/2$ and $\|f^{*}-\tilde{p}^{*}\|_{S_{\mathrm{ref}}^{\prime}}\leq 3{\epsilon}/2$ . Let ${\bm{K}}$ be a matrix in $\mathbb{R}^{2m\times 2m}$ such that ${\bm{K}}_{{\bm{z}},{\bm{w}}}=\mathcal{K}({\bm{z}},{\bm{w}})$ for ${\bm{z}},{\bm{w}}\in S_{\mathrm{ref}}\cup S_{\mathrm{ref}}^{\prime}$ . We additionally have that $({\bm{a}}^{*})^{\top}{\bm{K}}{\bm{a}}^{*}\leq B$ . Therefore, for any ${\bm{x}}\in\mathbb{R}^{d}$ we have

	$\displaystyle(\tilde{p}^{*}({\bm{x}}))^{2}$	$\displaystyle=\Bigr{(}\Bigr{\langle}{\sum_{{\bm{z}}\in S_{\mathrm{ref}}\cup S_{\mathrm{ref}}^{\prime}}a^{*}_{z}\psi({\bm{z}})},{\psi({\bm{x}})}\Bigr{\rangle}\Bigr{)}^{2}$
		$\displaystyle\leq\Bigr{\langle}{\sum_{{\bm{z}}\in S_{\mathrm{ref}}\cup S_{\mathrm{ref}}^{\prime}}a^{}_{z}\psi({\bm{z}})},{\sum_{{\bm{z}}\in S_{\mathrm{ref}}\cup S_{\mathrm{ref}}^{\prime}}a^{}_{z}\psi({\bm{z}})}\Bigr{\rangle}\cdot\langle\psi({\bm{x}}),\psi({\bm{x}})\rangle$
		$\displaystyle=({\bm{a}}^{})^{\top}{\bm{K}}{\bm{a}}^{}\cdot\mathcal{K}({\bm{x}},{\bm{x}})\leq B\cdot\mathcal{K}({\bm{x}},{\bm{x}})\,,$

where we used the Cauchy-Schwarz inequality. For ${\bm{x}}$ with $\|{\bm{x}}\|_{2}\leq R$ , we, hence, have $(\tilde{p}^{*}({\bm{x}}))^{2}\leq AB$ (recall that $A=\max_{\|{\bm{x}}\|_{2}\leq R}\mathcal{K}({\bm{x}},{\bm{x}})$ ).

Similarly, by applying the representer theorem (Theorem 6.11 in [MRT18]) for $p_{\mathrm{opt}}$ , we have that there exists ${\bm{a}}^{\mathrm{opt}}=(a^{\mathrm{opt}}_{{\bm{z}}})_{{\bm{z}}\in S_{\mathrm{ref}}}\in\mathbb{R}^{m}$ such that for $\tilde{p}_{\mathrm{opt}}:\mathbb{R}^{d}\to\mathbb{R}$ with $\tilde{p}_{\mathrm{opt}}({\bm{x}})=\sum_{{\bm{z}}\in S_{\mathrm{ref}}}a^{\mathrm{opt}}_{\bm{z}}\mathcal{K}({\bm{z}},{\bm{x}})$ we have ${\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\tilde{p}_{\mathrm{opt}})\leq{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(p_{\mathrm{opt}})$ and $\sum_{{\bm{z}},{\bm{w}}\in S_{\mathrm{ref}}}a^{\mathrm{opt}}_{\bm{z}}a^{\mathrm{opt}}_{\bm{w}}\mathcal{K}({\bm{z}},{\bm{w}})\leq B$ . Since $\hat{p}$ in Algorithm 1 is formed by solving a convex program whose search space includes $\tilde{p}_{\mathrm{opt}}$ , we have

{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\hat{p})\leq{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\tilde{p}_{\mathrm{opt}})\leq{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(p_{\mathrm{opt}})

(3.1)

In the following, we abuse the notation and consider $\hat{\bm{a}}$ to be a vector in $\mathbb{R}^{2m}$ , by appending $m$ zeroes, one for each of the elements of $S^{\prime}_{\mathrm{ref}}$ . Note that we then have $\hat{\bm{a}}^{\top}{\bm{K}}\hat{\bm{a}}\leq B$ , and, also, $(\hat{p}({\bm{x}}))^{2}\leq A\cdot B$ for all ${\bm{x}}$ with $\|{\bm{x}}\|_{2}\leq R$ .

Soundness. Suppose first that the algorithm has accepted. In what follows, we will use the triangle inequality of the norms to bound for functions $h_{1},h_{2},h_{3}$ the quantity $\|h_{1}-h_{2}\|_{{\mathcal{D}}}$ by $\|h_{1}-h_{3}\|_{{\mathcal{D}}}+\|h_{2}-h_{3}\|_{\mathcal{D}}$ . We also use the inequality ${\mathcal{L}}_{\mathcal{D}}(h_{1})\leq{\mathcal{L}}_{\mathcal{D}}(h_{2})+\|h_{1}-h_{2}\|_{{\mathcal{D}}}$ , as well as the fact that $\|\mathrm{cl}_{M}\circ h_{1}-\mathrm{cl}_{M}\circ h_{2}\|_{\mathcal{D}}\leq\|\mathrm{cl}_{M}\circ h_{1}-h_{2}\|_{\mathcal{D}}\leq\|h_{1}-h_{2}\|_{\mathcal{D}}$ . We bound the test error of the output hypothesis $h:\mathbb{R}^{d}\to[-M,M]$ of Algorithm 1 as follows.

\displaystyle{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(h)\leq\|h-\mathrm{cl}_{M}\circ f^{*}\|_{{\mathcal{D}}_{{\bm{x}}}^{\prime}}+{\mathcal{L}}_{\mathcal{D}}^{\prime}(f^{*})

Since $(h({\bm{x}})-\mathrm{cl}_{M}(f^{*}({\bm{x}})))^{2}\leq 4M^{2}$ for all ${\bm{x}}$ and the hypothesis $h$ does not depend on the set $S_{\mathrm{ref}}^{\prime}$ , by a Hoeffding bound and the fact that $m$ is large enough, we obtain that $\|h-\mathrm{cl}_{M}\circ f^{*}\|_{{\mathcal{D}}_{{\bm{x}}}^{\prime}}\leq\|h-\mathrm{cl}_{M}\circ f^{*}\|_{S_{\mathrm{ref}}^{\prime}}+{\epsilon}/10$ , with probability at least $1-\delta/10$ . Moreover, we have $\|h-\mathrm{cl}_{M}\circ f^{*}\|_{S_{\mathrm{ref}}^{\prime}}\leq\|h-\mathrm{cl}_{M}\circ\tilde{p}^{*}\|_{S_{\mathrm{ref}}^{\prime}}+\|\tilde{p}^{*}-f^{*}\|_{S_{\mathrm{ref}}^{\prime}}$ . We have already argued that $\|\tilde{p}^{*}-f^{*}\|_{S_{\mathrm{ref}}^{\prime}}\leq 3{\epsilon}/2$ .

In order to bound the quantity $\|h-\mathrm{cl}_{M}\circ\tilde{p}^{*}\|_{S_{\mathrm{ref}}^{\prime}}$ , we observe that while the function $h$ does not depend on $S_{\mathrm{ref}}^{\prime}$ , the function $\tilde{p}^{*}$ does depend on $S_{\mathrm{ref}}^{\prime}$ and, therefore, standard concentration arguments fail to bound the $\|h-\mathrm{cl}_{M}\circ\tilde{p}^{*}\|_{S_{\mathrm{ref}}^{\prime}}$ in terms of $\|h-\mathrm{cl}_{M}\circ\tilde{p}^{*}\|_{{\mathcal{D}}_{{\bm{x}}}^{\prime}}$ . However, since we have clipped $\tilde{p}^{*}$ , and $\tilde{p}^{*}$ is of the form $\langle{\bm{v}}^{*},\psi\rangle$ , we may obtain a bound using standard results from generalization theory (i.e., bounds on the Rademacher complexity of kernel-based hypotheses like Theorem 6.12 in [MRT18] and uniform convergence bounds for classes with bounded Rademacher complexity under Lipschitz and bounded losses like Theorem 11.3 in [MRT18]). In particular, we have that with probability at least $1-\delta/10$

\|h-\mathrm{cl}_{M}\circ\tilde{p}^{*}\|_{S_{\mathrm{ref}}^{\prime}}\leq\|h-\mathrm{cl}_{M}\circ\tilde{p}^{*}\|_{{\mathcal{D}}_{{\bm{x}}}^{\prime}}+{\epsilon}/10

The corresponding requirement for $m=|S_{\mathrm{ref}}^{\prime}|$ is determined by the bounds on the Lipschitz constant of the loss function $(y,t)\mapsto(y-\mathrm{cl}_{M}(t))^{2}$ , with $y\in[-M,M]$ and $t\in\mathbb{R}$ , which is at most $5M$ , the overall bound on this loss function, which is at most $4M^{2}$ , as well as the bounds $A=\max_{{\bm{x}}:\|{\bm{x}}\|_{2}\leq R}\mathcal{K}({\bm{x}},{\bm{x}})$ and $({\bm{a}}^{*})^{\top}{\bm{K}}{\bm{a}}\leq B$ (which give bounds on the Rademacher complexity).

By applying the Hoeffding bound, we are able to further bound the quantity $\|h-\mathrm{cl}_{M}\circ\tilde{p}^{*}\|_{{\mathcal{D}}_{{\bm{x}}}^{\prime}}$ by $\|h-\mathrm{cl}_{M}\circ\tilde{p}^{*}\|_{S_{\mathrm{ver}}^{\prime}}+{\epsilon}/10$ , with probability at least $1-\delta$ . We have effectively managed to bound the quantity $\|h-\mathrm{cl}_{M}\circ\tilde{p}^{*}\|_{S_{\mathrm{ref}}^{\prime}}$ by $\|h-\mathrm{cl}_{M}\circ\tilde{p}^{*}\|_{S_{\mathrm{ver}}^{\prime}}+{\epsilon}/5$ . This is important, because the set $S_{\mathrm{ver}}^{\prime}$ is a fresh set of examples and, therefore, independent from $\tilde{p}$ . Our goal is now to use the fact that our spectral tester has accepted. We have the following for the matrix $\hat{\Phi}^{\prime}=(\hat{\Phi}^{\prime}_{{\bm{z}},{\bm{w}}})_{{\bm{z}},{\bm{w}}\in S_{\mathrm{ref}}\cup S_{\mathrm{ref}}^{\prime}}$ with $\hat{\Phi}^{\prime}_{{\bm{z}},{\bm{w}}}=\frac{1}{N}\sum_{{\bm{x}}\in S_{\mathrm{ver}}^{\prime}}\mathcal{K}({\bm{x}},{\bm{z}})\mathcal{K}({\bm{x}},{\bm{w}})$ .

	$\displaystyle\\|h-\mathrm{cl}_{M}\circ\tilde{p}^{*}\\|_{S_{\mathrm{ver}}^{\prime}}^{2}$	$\displaystyle\leq\\|\hat{p}-\tilde{p}^{*}\\|_{S_{\mathrm{ver}}^{\prime}}^{2}$
		$\displaystyle=(\hat{\bm{a}}-{\bm{a}}^{})^{\top}\hat{\Phi}^{\prime}(\hat{\bm{a}}-{\bm{a}}^{})$

Since our test has accepted, we know that $(\hat{\bm{a}}-{\bm{a}}^{*})^{\top}\hat{\Phi}^{\prime}(\hat{\bm{a}}-{\bm{a}}^{*})\leq(1+\rho)(\hat{\bm{a}}-{\bm{a}}^{*})^{\top}\hat{\Phi}(\hat{\bm{a}}-{\bm{a}}^{*})$ , for the matrix $\hat{\Phi}=(\hat{\Phi}_{{\bm{z}},{\bm{w}}})_{{\bm{z}},{\bm{w}}\in S_{\mathrm{ref}}\cup S_{\mathrm{ref}}}$ with $\hat{\Phi}_{{\bm{z}},{\bm{w}}}=\frac{1}{N}\sum_{{\bm{x}}\in S_{\mathrm{ver}}}\mathcal{K}({\bm{x}},{\bm{z}})\mathcal{K}({\bm{x}},{\bm{w}})$ . We note here that having a multiplicative bound of this form is important, because we do not have any upper bound on the norms of $\hat{\bm{a}}$ and ${\bm{a}}^{*}$ . Instead, we only have bounds on distorted versions of these vectors, e.g., on $\hat{\bm{a}}^{\top}{\bm{K}}\hat{\bm{a}}$ , which does not imply any bound on the norm of $\hat{\bm{a}}$ , because ${\bm{K}}$ could have very small singular values.

Overall, we have that

	$\displaystyle\\|\hat{p}-\tilde{p}^{}\\|_{S_{\mathrm{ver}}^{\prime}}-\\|\hat{p}-\tilde{p}^{}\\|_{S_{\mathrm{ver}}}$	$\displaystyle\leq\sqrt{\rho(2\\|\hat{p}\\|_{S_{\mathrm{ver}}}^{2}+2\\|\tilde{p}^{*}\\|_{S_{\mathrm{ver}}}^{2})}$
		$\displaystyle\leq\sqrt{4AB\rho}\leq\frac{3{\epsilon}}{10}.$

By using results from generalization theory once more, we obtain that with probability at least $1-\delta/5$ we have $\|\hat{p}-\tilde{p}^{*}\|_{S_{\mathrm{ver}}}\leq\|\hat{p}-\tilde{p}^{*}\|_{S_{\mathrm{ref}}}+{\epsilon}/5$ . This step is important, because the only fact we know about the quality of $\hat{p}$ is that it outperforms every polynomial on the sample $S_{\mathrm{ref}}$ (not necessarily over the entire training distribution). We once more may use bounds on the values of $\hat{p}$ and $\tilde{p}^{*}$ , this time without requiring clipping, since we know that the training marginal is bounded and, hence, the values of $\hat{p}$ and $\tilde{p}^{*}$ are bounded as well. This was not true for the test distribution, since we did not make any assumptions about it.

In order to bound $\|\hat{p}-\tilde{p}^{*}\|_{S_{\mathrm{ref}}}$ , we have the following.

$\displaystyle\\|\hat{p}-\tilde{p}^{*}\\|_{S_{\mathrm{ref}}}$	$\displaystyle\leq{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\hat{p})+{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\mathrm{cl}\circ f^{})+\\|f^{}-\tilde{p}^{*}\\|_{{S}_{\mathrm{ref}}}$
	$\displaystyle\leq{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\tilde{p}_{\mathrm{opt}})+{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\mathrm{cl}\circ f^{})+\\|f^{}-\tilde{p}^{*}\\|_{{S}_{\mathrm{ref}}}$	(By equation 3.1)
	$\displaystyle\leq{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}({p}_{\mathrm{opt}})+{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\mathrm{cl}\circ f^{})+\\|f^{}-\tilde{p}^{*}\\|_{{S}_{\mathrm{ref}}}$

The first term above is bounded as ${\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}({p}_{\mathrm{opt}})\leq{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\mathrm{cl}_{M}\circ{f}_{\mathrm{opt}})+\|p_{\mathrm{opt}}-f_{\mathrm{opt}}\|_{{S}_{\mathrm{ref}}}$ , where the second term is at most ${\epsilon}$ (by the definition of $p_{\mathrm{opt}}$ ) and the first term can be bounded by ${\mathcal{L}}_{{\mathcal{D}}}({f}_{\mathrm{opt}})+{\epsilon}/10=\mathrm{opt}+{\epsilon}/10$ , with probability at least $1-\delta/10$ , due to an application of the Hoeffding bound.

For the term ${\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\mathrm{cl}\circ f^{*})$ we can similarly use the Hoeffding bound to obtain, with probability at least $1-\delta/10$ that ${\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\mathrm{cl}\circ f^{*})\leq{\mathcal{L}}_{{\mathcal{D}}}(f^{*})+{\epsilon}/10$ .

Finally, for the term $\|f^{*}-\tilde{p}^{*}\|_{{S}_{\mathrm{ref}}}$ , we have that $\|f^{*}-\tilde{p}^{*}\|_{{S}_{\mathrm{ref}}}\leq 3{\epsilon}/2$ , as argued above.

Overall, we obtain a bound of the form ${\mathcal{L}}_{\mathcal{D}}^{\prime}(h)\leq{\mathcal{L}}_{{\mathcal{D}}}(f^{*})={\mathcal{L}}_{{\mathcal{D}}^{\prime}}(f^{*})+{\mathcal{L}}_{{\mathcal{D}}}(f_{\mathrm{opt}})+5{\epsilon}$ , with probability at least $1-\delta$ , as desired.

Completeness. For the completeness criterion, we assume that the test marginal is equal to the training marginal. Then, by Lemma 3.8 (where we observe that any $(\psi,C,\ell)$ -hypercontractive distribution is also $(\phi,C,\ell)$ -hypercontractive), with probability at least $1-\delta$ , we have that for all ${\bm{a}}\in\mathbb{R}^{2m}$ , ${\bm{a}}^{\top}\hat{\Phi}^{\prime}{\bm{a}}\leq\frac{1+(\rho/4)}{1-(\rho/4)}{\bm{a}}^{\top}\hat{\Phi}{\bm{a}}\leq(1+\rho){\bm{a}}^{\top}\hat{\Phi}{\bm{a}}$ , because $\mathbb{E}[\hat{\Phi}]=\mathbb{E}[\hat{\Phi}^{\prime}]$ and the matrices are sums of independent samples of $\phi({\bm{x}})\phi({\bm{x}})^{\top}$ , where ${\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}$ . It is crucial here that $\phi$ (which recall is formed by using $S_{\mathrm{ref}},S_{\mathrm{ref}}^{\prime}$ ) does not depend on the verification samples $S_{\mathrm{ver}}$ and $S^{\prime}_{\mathrm{ver}}$ , which is why we chose them to be fresh. Therefore, the test will accept with probability at least $1-\delta$ .

Efficient Implementation. To compute $\hat{\bm{a}}$ , we may run a least squares program, in time polynomial in $m$ . For the spectral tester, we first compute the SVD of $\hat{\Phi}$ and check that any vector in the kernel of $\hat{\Phi}$ is also in the kernel of $\hat{\Phi}^{\prime}$ (this can be checked without computing the SVD of $\hat{\Phi}^{\prime}$ ). Otherwise, reject. Then, let $\hat{\Phi}^{\frac{\dagger}{2}}$ be the root of the Moore-Penrose pseudoinverse of $\hat{\Phi}$ and find the maximum singular value of the matrix $\hat{\Phi}^{\frac{\dagger}{2}}\hat{\Phi}^{\prime}\hat{\Phi}^{\frac{\dagger}{2}}$ . If the value is higher than $1+\rho$ , reject. Note that this is equivalent to solving the eigenvalue problem described in Algorithm 1. ∎

3.2 Applications

Having obtained a general theorem for TDS learning under 3.5, we can now instantiate it to obtain TDS learning algorithms for learning neural networks with Lipschitz activations. In particular, we recover all of the bounds of [GKKT17], using the additional assumption that the training distribution is hypercontractive in the following standard sense.

Definition 3.9 (Hypercontractivity).

We say that ${\mathcal{D}}$ is $C$ -hypercontractive if for all polynomials of degree $\ell$ and $t\in\mathbb{N}$ , we have that

\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}}\left[p({\bm{x}})^{2t}\right]\leq(Ct)^{2\ell t}\left(\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}}\left[p({\bm{x}})^{2}\right]\right)^{t}.

Note that many common distributions like log-concave or the uniform over the hypercube are known to be hypercontractive for some constant $C$ (see [CW01] and [O’D14]).

Function Class

Degree (

\ell

)

Representation

Bound (

B

)

Kernel

Bound (

A

)

Sigmoid Nets

O\left(RW^{t-2}(t\log(\frac{W}{\epsilon}))^{t-1}\log R\right)

2^{\ell}\cdot W^{\tilde{O}(Wt\log(\frac{1}{{\epsilon}}))^{t-2}}

(2R)^{2^{t}\ell}

L

-Lipschitz Nets

O\left((WL)^{t-1}Rk\sqrt{k}/\epsilon\right)

(k+\ell)^{O(\ell)}

R^{O(\ell)}

Table 2: We instantiate the parameters relevant to 3.5 for Sigmoid and Lipschitz Nets. We have: (1)

t

denotes a bound on the depth of the network, (2)

W

is a bound on the sum of network weights in all layers other than the first, (3)

({\epsilon},B)

and radius

R

are the approximate representation parameters, (4)

k

is the number of hidden units in the first layer. The kernel function can be evaluated in time

\mathrm{poly}(d,\ell)

. For each of the classes, we assume that the maximum two norm of any row of the matrix corresponding to the weights of the first layer is bounded by

1

. The kernel we use is the composed multinomial kernel

\mathsf{MK}_{\bm{\ell}}^{(t)}

with appropriately chosen degree vector

\bm{\ell}

. Here,

\ell

equals the product of the entries of

\bm{\ell}

. Any

C

-hypercontractive distribution is also

(\mathsf{MK}_{\bm{\ell}}^{(t)},C,\ell)

hypercontractive for

\ell

as specified in the table. For the case of

k=1

, the bound

B

in the second row can be improved to

2^{O(\ell)}

In Table 2, we provide bounds on the parameters in 3.5 for sigmoid networks and $L$ -Lipschitz networks, whose proof we postpone to Appendix C (see Theorems C.17, C.19 and C.14). Combining bounds from Table 2 with Theorem 3.6, we obtain the results of the middle column of Table 1.

4 Unbounded Distributions

We showed that the kernel method provides runtime improvements for TDS learning, because it can be used to obtain a concise reference feature map, whose spectral properties on the test distribution are all we need to check to certify low test error. A similar approach would not provide any runtime improvements for the case of unbounded distributions, because the dimension of the reference feature space would not be significantly smaller than the dimension of the multinomial feature expansion. Therefore, we can follow the standard moment-matching testing approach commonly used in TDS learning [KSV24b] and testable agnostic learning [RV23, GKK23].

4.1 Additional Preliminaries

We define the notion of subspace juntas, namely, functions that only depend on a low-dimensional projection of their input vector.

Definition 4.1 (Subspace Junta).

A function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is a $k$ -subspace junta (where $k\leq d$ ) if there exists $W\in\mathbb{R}^{k\times d}$ with $\|W\|_{2}=1$ and $WW^{\top}=I_{k}$ and a function $g:\mathbb{R}^{k}\rightarrow\mathbb{R}$ such that

f({\bm{x}})=f_{W}({\bm{x}})=g(W{\bm{x}})\quad\forall{\bm{x}}\in\mathbb{R}^{d}.

Note that by taking $k=d$ , letting $W=I_{d}$ covers all functions $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ .

Note that neural networks are $k$ -subspace juntas, where $k$ is the number of neurons in the first hidden layer. We also define the following notion of a uniform polynomial approximation within a ball of a certain radius.

Definition 4.2 ( $(\epsilon,R)$ -Uniform Approximation).

For $\epsilon>0,R\geq 1,$ and $g:\mathbb{R}^{k}\rightarrow\mathbb{R}$ , we say that $q:\mathbb{R}^{k}\rightarrow\mathbb{R}$ is an $(\epsilon,R)$ -uniform approximation polynomial for $g$ if

|q({\bm{x}})-g({\bm{x}})|\leq\epsilon\quad\forall\left\|x\right\|_{2}\leq R.

We obtain the following corollary which gives the analogous bound on the $(\epsilon,R)$ -uniform approximation to a $k$ -subspace junta, given the $(\epsilon,R)$ -uniform approximation to the corresponding function $g$ .

Corollary 4.3.

Let $\epsilon>0,R\geq 1$ , and $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ be a $k$ -subspace junta, and consider the corresponding function $g(Wx)$ . Let $q:\mathbb{R}^{k}\rightarrow\mathbb{R}$ be an $(\epsilon,R)$ -uniform approximation polynomial for $g$ , and define $p:\mathbb{R}^{d}\rightarrow\mathbb{R}$ as $p({\bm{x}}):=q(Wx)$ . Then $|p({\bm{x}})-f({\bm{x}})|\leq\epsilon$ for all $\|Wx\|_{2}\leq R$ .

Finally, we consider any distribution with strictly subexponential tails in every direction, which we define as follows.

Definition 4.4 (Strictly Sub-exponential Distribution).

A distribution ${\mathcal{D}}$ on $\mathbb{R}^{d}$ is $\gamma$ -strictly subexponential if there exist constants $C,\gamma\in(0,1]$ such that for all ${\bm{w}}\in\mathbb{R}^{d},\left\|{\bm{w}}\right\|=1,t\geq 0$ ,

\mathbf{Pr}_{{\bm{x}}\sim{\mathcal{D}}}[|\langle{\bm{w}},{\bm{x}}\rangle|>t]\leq e^{-Ct^{1+\gamma}}.

4.2 TDS Regression via Uniform Approximation

We will now present our main results on TDS regression for unbounded training marginals. We require the following assumptions.

Assumption 4.5.

For a function class ${\mathcal{F}}\subseteq\{\mathbb{R}^{d}\to\mathbb{R}\}$ consisting of $k$ -subspaces juntas, and training and test distributions ${\mathcal{D}},{\mathcal{D}}^{\prime}$ over $\mathbb{R}^{d}\times\mathbb{R}$ , we assuming the following.

1.

For $f\in{\mathcal{F}}$ , there exists an $(\epsilon,R)$ -uniform approximation polynomial for $f$ with degree at most $\ell=R\log R\cdot g_{\mathcal{F}}(\epsilon)$ , where $g_{\mathcal{F}}(\epsilon)$ is a function depending only on the class ${\mathcal{F}}$ and $\epsilon$ .
2.

For $f\in{\mathcal{F}}$ , the value $r_{f}:=\sup_{\left\|W{\bm{x}}\right\|_{2}\leq R}|f(x)|$ is bounded by a constant $r>0$ .
3.

The training marginal ${\mathcal{D}}_{{\bm{x}}}$ is a $\gamma$ -strictly subexponential distribution for $\gamma\in(0,1]$ .
4.

The training and test labels are both bounded in $[-M,M]$ for some $M\geq 1$ .

Consider the function class ${\mathcal{F}}$ , and the parameters ${\epsilon},\gamma,M,k,\ell$ as defined in the assumption above and let $\delta\in(0,1)$ . Then, we obtain the following theorem.

Theorem 4.6 (TDS Learning via Uniform Approximation).

Under 4.5, Algorithm 2 learns the class ${\mathcal{F}}$ in the TDS regression setting up to excess error $5\epsilon$ and probability of failure $\delta$ . The time complexity is $\mathrm{poly}(d^{s},{1}/{\epsilon},\log(1/\delta)^{\ell})$ where $s=(\ell\log(M/\epsilon))^{O({1}/{\gamma})}$ .

Input: Parameters

{\epsilon}>0,\delta\in(0,1)

R\geq 1

M\geq 1

, and sample access to

{\mathcal{D}},{\mathcal{D}}_{{\bm{x}}}^{\prime}

Set

{\epsilon}^{\prime}=\epsilon/11

\delta^{\prime}=\delta/4

\ell=R\log R\cdot g_{\mathcal{F}}(\epsilon)

t=2\log\left(\frac{2M}{\epsilon^{\prime}}\right)

B=r(2(k+\ell))^{3\ell}

\Delta=\frac{\epsilon^{\prime 2}}{4B^{2}d^{2\ell t}}

Set

m_{\mathrm{train}}=m_{\mathrm{test}}=\mathrm{poly}(M,\ln(1/\delta)^{\ell},1/\epsilon,d^{\ell},r)

and draw

m_{\mathrm{train}}

i.i.d. labeled examples

S

from

{\mathcal{D}}

and

m_{\mathrm{test}}

i.i.d. unlabeled examples

S^{\prime}

from

{\mathcal{D}}_{{\bm{x}}}^{\prime}

For each

\alpha\in\mathbb{N}^{d}

with

\|\alpha\|_{1}\leq 2\max(\ell,t)

, compute the quantity

\widehat{\mathrm{M}}_{\alpha}=\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[{\bm{x}}^{\alpha}]=\mathbb{E}_{{\bm{x}}\sim S^{\prime}}\left[\prod_{i\in[d]}x_{i}^{\alpha_{i}}\right]

Reject and terminate if

|\widehat{\mathrm{M}}_{\alpha}-\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[{\bm{x}}^{\alpha}]|>\Delta

for some

\alpha

with

\|\alpha\|_{1}\leq 2\max(\ell,t)

Otherwise, solve the following least squares problem on

S

up to error

{\epsilon}^{\prime}

	$\displaystyle\min_{p}\quad$	$\displaystyle\mathbb{E}_{({\bm{x}},y)\sim S}\left[(y-p({\bm{x}}))^{2}\right]$
	s.t.	$\displaystyle p\text{ is a polynomial with degree at most }\ell$
		$\displaystyle\text{each coefficient of }p\text{ is absolutely bounded by }B$

Let

\hat{p}

be an

{\epsilon}^{\prime 2}

-approximate solution to the above optimization problem.

Accept and output

\mathrm{cl}_{M}(\hat{p}({\bm{x}}))

Algorithm 2 TDS Regression via Uniform Approximation

Note that 4.5 involves a low-degree uniform approximation assumption, which only holds within some bounded-radius ball. Since we work under unbounded distributions, we also need to handle the errors outside the ball. To this end, we use the following lemma, which follows from results in [BDBGK18].

Lemma 4.7.

Suppose $f=f_{W}$ and $q$ satisfy parts 1 and 2 of 4.5. Then

|p({\bm{x}})|\leq(k\ell)^{O(\ell)}\left\|{W{\bm{x}}}\right\|_{2}^{\ell}\text{, for all }\left\|W{\bm{x}}\right\|_{2}\geq R.

The lemma above gives a bound on the values of a low-degree uniform approximator outside the interval of approximation. Therefore, we can hope to control the error of approximation outside the interval by taking advantage of the tails of our target distribution as well as picking $R$ sufficiently large. In order for the strictly subexponential tails to suffice, the quantitative dependence of $\ell$ on $R$ is important. This is why we assume (see 4.5) that $\ell=\tilde{O}(R)$ . In particular, in order to bound the quantity $\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[p^{2}({\bm{x}})\mathbbm{1}\{\|W{\bm{x}}\|_{2}\geq R\}]$ , we use Lemma 4.7, the Cauchy-Schwarz inequality, and the bounds $\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[\left\|{W{\bm{x}}}\right\|_{2}^{4\ell}]\leq(k\ell)^{O(\ell)}$ and $\mathbf{Pr}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[\|W{\bm{x}}\|_{2}\geq R]\leq\exp(-\Omega(R/k)^{1+\gamma})$ . Substituting for $\ell=\tilde{O}(R)$ , we observe that the overall bound on the quantity $\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[p^{2}({\bm{x}})\mathbbm{1}\{\|W{\bm{x}}\|_{2}\geq R\}]$ decays with $R$ , whenever $\gamma$ is strictly positive. Therefore, the overall bound can be made arbitrarily small with an appropriate choice of $R$ (and therefore $\ell$ ).

Apart from the careful manipulations described above, the proof of Theorem 4.6 follows the lines of the corresponding results for TDS learning through sandwiching polynomials [KSV24b].

The following lemma allows us to relate the squared loss of the difference of polynomials under a set $S$ and under ${\mathcal{D}}$ , as long as we have a bound on the coefficients of the polynomials. This will be convenient in the analysis of the algorithm.

Lemma 4.8 (Transfer Lemma for Square Loss, see [KSV24b]).

Let ${\mathcal{D}}$ be a distribution over $\mathbb{R}^{d}$ and $S$ be a set of points in $\mathbb{R}^{d}$ . If $|\mathbb{E}_{{\bm{x}}\sim S}[{\bm{x}}^{\alpha}]-\mathbb{E}_{x\sim{\mathcal{D}}}[{\bm{x}}^{\alpha}]|\leq\Delta$ for all $\alpha\in\mathbb{N}^{d}$ with $\left\|\alpha\right\|_{1}\leq 2\ell$ , then for any degree $\ell$ polynomials $p_{1},p_{2}$ with coefficients absolutely bounded by $B$ , it holds that

\left|\mathbb{E}_{{\bm{x}}\sim S}[(p_{1}({\bm{x}})-p_{2}({\bm{x}}))^{2}]-\mathbb{E}_{x\sim{\mathcal{D}}}[(p_{1}({\bm{x}})-p_{2}({\bm{x}}))^{2}]\right|\leq 4B^{2}d^{2\ell}\Delta

We are now ready to prove Theorem 4.6.

Proof of Theorem 4.6.

We will prove soundness and completeness separately.

Soundness. Suppose the algorithm accepts and outputs $\mathrm{cl}_{M}(\hat{p})$ . Let $f^{*}=\arg\min_{f\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f)+{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(f)]$ and $f_{\mathrm{opt}}=\arg\min_{f\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f)]$ . By the uniform approximation assumption in 4.5, there are polynomials $p^{*},p_{\mathrm{opt}}$ which are $(\epsilon,R)$ -uniform approximations for $f^{*}$ and $f_{\mathrm{opt}}$ , respectively. Let $f^{*}$ and $f_{\mathrm{opt}}$ have the corresponding matrices $W^{*},W_{\mathrm{opt}}\in\mathbb{R}^{k\times d}$ , respectively. Denote $\lambda_{\mathrm{train}}={\mathcal{L}}_{{\mathcal{D}}}(f^{*})$ and $\lambda_{\mathrm{test}}={\mathcal{L}}_{{\mathcal{D}}^{\prime}}(f^{*})$ . Note that for any $f,g:\mathbb{R}^{d}\rightarrow\mathbb{R}$ , “unclipping” both functions will not increase their squared loss under any distribution, i.e. $\left\|\mathrm{cl}_{M}(f)-\mathrm{cl}_{M}(g)\right\|_{{\mathcal{D}}}\leq\left\|f-g\right\|_{{\mathcal{D}}}$ , which can be seen through casework on ${\bm{x}}$ and when $f({\bm{x}}),g({\bm{x}})$ are in $[-M,M]$ or not. Recalling that the training and test labels are bounded, we can use this fact as we bound the error of the hypothesis on ${\mathcal{D}}^{\prime}$ .

	$\displaystyle{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(\mathrm{cl}_{M}(\hat{p}))$	$\displaystyle\leq{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(\mathrm{cl}_{M}(f^{}))+\left\\|\mathrm{cl}_{M}(f^{})-\mathrm{cl}_{M}(\hat{p})\right\\|_{{\mathcal{D}}^{\prime}}$
		$\displaystyle\leq{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(f^{})+\left\\|\mathrm{cl}_{M}(f^{})-\mathrm{cl}_{M}(\hat{p})\right\\|_{S^{\prime}}+\epsilon^{\prime}.$

The second inequality follows from unclipping the first term and by applying Hoeffding’s inequality, so that for $m_{\mathrm{test}}\geq\frac{8M^{4}\ln(2/\delta^{\prime})}{\epsilon^{\prime 4}}$ , the second term is bounded with probability $\geq 1-\delta^{\prime}$ . Proceeding with more unclipping and using the triangle inequality:

	$\displaystyle{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(\mathrm{cl}_{M}(\hat{p}))$	$\displaystyle\leq\lambda_{\mathrm{test}}+\left\\|\mathrm{cl}_{M}(f^{})-\mathrm{cl}_{M}(p^{})\right\\|_{S^{\prime}}+\left\\|\mathrm{cl}_{M}(p^{*})-\mathrm{cl}_{M}(\hat{p})\right\\|_{S^{\prime}}+\epsilon^{\prime}$
		$\displaystyle\leq\lambda_{\mathrm{test}}+\left\\|\mathrm{cl}_{M}(f^{})-\mathrm{cl}_{M}(p^{})\right\\|_{S^{\prime}}+\left\\|p^{*}-\hat{p}\right\\|_{S^{\prime}}+\epsilon^{\prime}.$		(4.1)

We first bound $\left\|\mathrm{cl}_{M}(f^{*})-\mathrm{cl}_{M}(p^{*})\right\|_{S^{\prime}}=\sqrt{\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[(\mathrm{cl}_{M}(f^{*}({\bm{x}}))-\mathrm{cl}_{M}(p^{*}({\bm{x}})))^{2}]}$ . Since $p^{*}({\bm{x}})$ is an $(\epsilon,R)$ -uniform approximation to $f^{*}({\bm{x}})$ , we separately consider when we fall in the region of good approximation ( $\left\|W^{*}{\bm{x}}\right\|\leq R$ ) or not.

	$\displaystyle\mathbb{E}_{{\bm{x}}\sim S^{\prime}}$	$\displaystyle[(\mathrm{cl}_{M}(f^{}({\bm{x}}))-\mathrm{cl}_{M}(p^{}({\bm{x}})))^{2}]$
		$\displaystyle=\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[(\mathrm{cl}_{M}(f^{}({\bm{x}}))-\mathrm{cl}_{M}(p^{}({\bm{x}})))^{2}\cdot\mathbbm{1}[\left\\|W^{*}{\bm{x}}\right\\|\leq R]$
		$\displaystyle+\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[(\mathrm{cl}_{M}(f^{}({\bm{x}}))-\mathrm{cl}_{M}(p^{}({\bm{x}})))^{2}\cdot\mathbbm{1}[\left\\|W^{*}{\bm{x}}\right\\|>R]]$
		$\displaystyle\leq\epsilon^{2}+\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[2(\mathrm{cl}_{M}(f^{}({\bm{x}}))^{2}+\mathrm{cl}_{M}(p^{}({\bm{x}}))^{2})\cdot\mathbbm{1}[\left\\|W^{*}{\bm{x}}\right\\|>R]]$

Then by applying Cauchy-Schwarz, (and similarly for $\mathrm{cl}_{M}(p^{*})$ ):

\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[\mathrm{cl}_{M}(f^{*}({\bm{x}}))^{2}\cdot\mathbbm{1}[\left\|W^{*}{\bm{x}}\right\|>R]]\leq\sqrt{\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[\mathrm{cl}_{M}(f^{*}({\bm{x}}))^{4}]}\cdot\sqrt{\mathbf{Pr}_{{\bm{x}}\sim S^{\prime}}[\left\|W^{*}{\bm{x}}\right\|>R]]}.

By definition, $\mathrm{cl}_{M}(p^{*})^{2},\mathrm{cl}_{M}(f^{*})^{2}\leq M^{2}$ . So it suffices to bound $\mathbf{Pr}_{{\bm{x}}\sim S^{\prime}}[\left\|W^{*}{\bm{x}}\right\|>R]]$ , since we now have

\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[(\mathrm{cl}_{M}(f^{*}({\bm{x}}))-\mathrm{cl}_{M}(p^{*}({\bm{x}})))^{2}]\leq\epsilon^{2}+4M^{2}\sqrt{\mathbf{Pr}_{{\bm{x}}\sim S^{\prime}}[\left\|W^{*}{\bm{x}}\right\|>R]]}.

(4.2)

In order to bound this probability of the test samples falling outside the region of good approximation, we use the property that the first $2t$ moments of $S^{\prime}$ are close to the moments of ${\mathcal{D}}$ (as tested by the algorithm). Applying Markov’s inequality, we have

\mathbf{Pr}_{{\bm{x}}\sim S^{\prime}}[\left\|W^{*}{\bm{x}}\right\|>R]]\leq\frac{\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[\left\|W^{*}{\bm{x}}\right\|^{2t}]}{R^{2t}}.

Write $\left\|W^{*}{\bm{x}}\right\|^{2t}=\left(\sum_{i=1}^{k}\langle W_{i}^{*},{\bm{x}}\rangle^{2}\right)^{t}$ , where $\sum_{i=1}^{k}\langle W_{i}^{*},{\bm{x}}\rangle^{2}=\sum_{i=1}^{k}\left(\sum_{j=1}^{d}W_{ij}^{*}x_{j}\right)^{2}$ is a degree $2$ polynomial with each coefficient bounded in absolute value by $2k$ (noting that since $WW^{\top}=1$ , then $|W_{ij}|\leq 1$ ). Let $a_{\alpha}$ denote the coefficients of $\left\|W^{*}{\bm{x}}\right\|^{2t}$ . Applying Lemma C.7, $\sum_{\left\|\alpha\right\|_{1}\leq 2t}|a_{\alpha}|\leq(2k)^{t}d^{2t}\leq d^{O(t)}$ . By linearity of expectation, we also have $|\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[\left\|W^{*}{\bm{x}}\right\|^{2t}-\mathbb{E}_{x\sim{\mathcal{D}}}[\left\|W^{*}{\bm{x}}\right\|^{2t}]|\leq\sum_{\left\|\alpha\right\|_{1}\leq 2t}|a_{\alpha}|\cdot\Delta\leq d^{O(t)}\cdot\Delta\leq\epsilon^{\prime}$ , where $\Delta\leq\epsilon^{\prime}\cdot d^{-\Omega(t)}$ . Since ${\mathcal{D}}$ is $\gamma$ -strictly subexponential, then by B.1, $\mathbb{E}_{x\sim{\mathcal{D}}}[\langle W_{i}^{*},{\bm{x}}\rangle^{2t}]\leq(2C^{\prime}t)^{\frac{2t}{1+\gamma}}$ . Then, we can bound the numerator $\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[\left\|W^{*}{\bm{x}}\right\|^{2t}]\leq\mathbb{E}_{x\sim{\mathcal{D}}}[\left\|W^{*}{\bm{x}}\right\|^{2t}]+\epsilon^{\prime}\leq(Ckt)^{\frac{2t}{1+\gamma}}$ for some large constant $C$ . So we have that

\mathbf{Pr}_{{\bm{x}}\sim S^{\prime}}[\left\|W^{*}{\bm{x}}\right\|>R]]\leq\frac{(Ckt)^{\frac{2t}{1+\gamma}}}{R^{2t}}.

Setting $t\geq C^{\prime}(\log(M/\epsilon))$ and $R\geq C^{\prime}(kt)\geq C^{\prime}k\log(M/\epsilon)$ for large enough $C^{\prime}$ makes the above probability at most $16\epsilon^{\prime 4}/M^{4}$ so that $4M^{2}\sqrt{\mathbf{Pr}_{{\bm{x}}\sim S^{\prime}}[\left\|W^{*}{\bm{x}}\right\|>R]]}\leq\epsilon^{\prime 2}$ . Thus, from Equation 4.2, we have that

\left\|\mathrm{cl}_{M}(f^{*})-\mathrm{cl}_{M}(p^{*})\right\|_{S^{\prime}}\leq\epsilon+\epsilon^{\prime}.

(4.3)

We now bound the second term $\left\|\mathrm{cl}_{M}(p^{*})-\mathrm{cl}_{M}(\hat{p})\right\|_{S^{\prime}}$ . By Lemma B.2, the first $2\ell$ moments of $S$ will concentrate around those of ${\mathcal{D}}_{{\bm{x}}}$ whenever $m_{\mathrm{train}}\geq\frac{1}{\Delta^{2}}{(Cc)^{4\ell}\ell^{8\ell+1}(\log(20d/\delta))^{4\ell+1}}$ , and similarly the first $2\ell$ moments of $S^{\prime}$ match with ${\mathcal{D}}_{{\bm{x}}}$ because the algorithm accepted. Using the transfer lemma (Lemma 4.8) when considering $p^{\prime}=(p^{*}-\hat{p})^{2}$ , along with the triangle inequality, we get:

	$\displaystyle\left\\|p^{*}({\bm{x}})-\hat{p}({\bm{x}})\right\\|_{S^{\prime}}$	$\displaystyle\leq\left\\|p^{*}({\bm{x}})-\hat{p}({\bm{x}})\right\\|_{{\mathcal{D}}}+\sqrt{4B^{2}d^{2\ell}\Delta}$
		$\displaystyle\leq\left\\|p^{*}({\bm{x}})-\hat{p}({\bm{x}})\right\\|_{S}+2\epsilon^{\prime}$
		$\displaystyle\leq{\mathcal{L}}_{S}({p^{*}})+{\mathcal{L}}_{S}(\hat{p})+2\epsilon^{\prime},$

where we note that we can bound $B$ , the sum of the magnitudes of the coefficients, by $r(2(k+\ell))^{3\ell}$ using Lemma C.6. Recall that by definition $\hat{p}$ is an $\epsilon^{\prime 2}$ -approximate solution to the optimization problem in Algorithm 2, so ${\mathcal{L}}_{S}(\hat{p})\leq{\mathcal{L}}_{S}(p_{\mathrm{opt}})+\epsilon^{\prime}$ . Plugging this in, we obtain

$\displaystyle\left\\|p^{*}({\bm{x}})-\hat{p}({\bm{x}})\right\\|_{S^{\prime}}$	$\displaystyle\leq{\mathcal{L}}_{S}(p^{*})+{\mathcal{L}}_{S}(p_{\mathrm{opt}})+3\epsilon^{\prime}$
	$\displaystyle\leq\left\\|p^{}-\mathrm{cl}_{M}(f^{})\right\\|_{S}+{\mathcal{L}}(\mathrm{cl}_{M}(f^{*}))_{S}$
	$\displaystyle\quad+\left\\|p_{\mathrm{opt}}({\bm{x}})-\mathrm{cl}_{M}(f_{\mathrm{opt}}({\bm{x}}))\right\\|_{S}+{\mathcal{L}}_{S}(\mathrm{cl}_{M}(f_{\mathrm{opt}}))+3\epsilon^{\prime}.$	(4.4)

By applying Hoeffding’s inequality, we get that ${\mathcal{L}}(\mathrm{cl}_{M}(f^{*}))_{S}\leq\left\|\mathrm{cl}_{M}(f^{*})-y\right\|_{\mathcal{D}}+\epsilon^{\prime}$ which holds with probability $\geq 1-\delta^{\prime}$ when $m_{\mathrm{train}}\geq\frac{8M^{4}\ln(2/\delta^{\prime})}{\epsilon^{\prime 4}}$ . By unclipping $\mathrm{cl}_{M}(f^{*})$ , this is at most $\lambda_{\mathrm{train}}+\epsilon^{\prime}$ . Similarly, with probability $\geq 1-\delta^{\prime}$ , ${\mathcal{L}}_{S}(\mathrm{cl}_{M}(f_{\mathrm{opt}}))\leq\mathrm{opt}+\epsilon^{\prime}$ . It remains to bound $\left\|p^{*}({\bm{x}})-\mathrm{cl}_{M}(f^{*})\right\|_{S}$ and $\left\|p_{\mathrm{opt}}-\mathrm{cl}_{M}(f_{\mathrm{opt}}({\bm{x}}))\right\|_{S}$ . The analysis for both is similar to how we bounded $\left\|\mathrm{cl}_{M}(p^{*})-\mathrm{cl}_{M}(f^{*})\right\|_{S}$ , except since we do not clip $p^{*}$ or $p_{\mathrm{opt}}$ we will instead take advantage of the bound on $p^{*}({\bm{x}})$ on $\left\|W^{*}{\bm{x}}\right\|>R$ (respectively $p_{\mathrm{opt}}({\bm{x}})$ on $\left\|W_{\mathrm{opt}}{\bm{x}}\right\|>R$ ). We show how to bound $\left\|p^{*}({\bm{x}})-\mathrm{cl}_{M}(f^{*})\right\|_{S}$ :

$\displaystyle\mathbb{E}_{{\bm{x}}\sim S}[(\mathrm{cl}_{M}(f^{}({\bm{x}}))-p^{}({\bm{x}}))^{2}]$	$\displaystyle=\mathbb{E}_{{\bm{x}}\sim S}[(\mathrm{cl}_{M}(f^{}({\bm{x}}))-p^{}({\bm{x}}))^{2}\cdot\mathbbm{1}[\left\\|W^{*}{\bm{x}}\right\\|\leq R]$
	$\displaystyle+\mathbb{E}_{{\bm{x}}\sim S}[(\mathrm{cl}_{M}(f^{}({\bm{x}}))-p^{}({\bm{x}}))^{2}\cdot\mathbbm{1}[\left\\|W^{*}{\bm{x}}\right\\|>R]]$
	$\displaystyle\leq\epsilon^{2}+2\mathbb{E}_{{\bm{x}}\sim S}[\mathrm{cl}_{M}(f^{}({\bm{x}}))^{2}\cdot\mathbbm{1}[\left\\|W^{}{\bm{x}}\right\\|>R]]$
	$\displaystyle\hskip 22.0pt+2\mathbb{E}_{{\bm{x}}\sim S}[p^{}({\bm{x}})^{2}\cdot\mathbbm{1}[\left\\|W^{}{\bm{x}}\right\\|>R]].$	(4.5)

We can bound the first expectation term, $\mathbb{E}_{{\bm{x}}\sim S}[\mathrm{cl}_{M}(f^{*}({\bm{x}}))^{2}\cdot\mathbbm{1}[\left\|W^{*}{\bm{x}}\right\|>R]]$ , with $\epsilon^{\prime 2}/4$ since the same analysis holds for bounding $\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[\mathrm{cl}_{M}(f^{*}({\bm{x}}))^{2}\cdot\mathbbm{1}[\left\|W^{*}{\bm{x}}\right\|>R]]$ , except instead of matching the first $2t$ moments of $S^{\prime}$ with ${\mathcal{D}}_{{\bm{x}}}$ , we match the first $2\ell$ moments of $S$ with ${\mathcal{D}}_{{\bm{x}}}$ . We use the strictly subexponential tails of ${\mathcal{D}}_{{\bm{x}}}$ to bound the second term. Cauchy-Schwarz gives

\mathbb{E}_{{\bm{x}}\sim S}[p^{*}({\bm{x}})^{2}\cdot\mathbbm{1}[\left\|W^{*}{\bm{x}}\right\|>R]]\leq\sqrt{\mathbb{E}_{{\bm{x}}\sim S}[p^{*}({\bm{x}})^{4}]\cdot\mathbf{Pr}_{{\bm{x}}\sim S}[\left\|W^{*}{\bm{x}}\right\|>R]]}

Note that by definition of $r$ and using that $p^{*}$ is an $(\epsilon,R)$ -uniform approximation of $f^{*}$ , then $p^{*}({\bm{x}})\leq(r+\epsilon)$ when $\left\|W^{*}{\bm{x}}\right\|\leq R$ . By Lemma C.6, $|p^{*}({\bm{x}})|\leq(r+\epsilon)\cdot(2k\ell)^{c\ell}\left\|(W^{*}x)/R\right\|^{\ell}$ for sufficiently large constant $c_{1}>0$ . Then since $R\geq 1$ , $p^{*}({\bm{x}})\leq(r+\epsilon)^{4}\cdot(2k\ell)^{c\ell}\left\|W^{*}{\bm{x}}\right\|^{4\ell}$ . Then we have

	$\displaystyle\mathbb{E}_{{\bm{x}}\sim S}[p^{*}({\bm{x}})^{4}]$	$\displaystyle\leq(r+\epsilon)^{4}\cdot(2k\ell)^{c_{1}\ell}\cdot\mathbb{E}_{{\bm{x}}\sim S}[\left\\|W^{*}{\bm{x}}\right\\|^{4\ell}]$
		$\displaystyle\leq(r+\epsilon)^{4}\cdot(2k\ell)^{c_{1}\ell}\cdot(\mathbb{E}_{x\sim{\mathcal{D}}_{{\bm{x}}}}[\left\\|W^{*}{\bm{x}}\right\\|^{4\ell}]+1)$
		$\displaystyle\leq(r+\epsilon)^{4}\cdot(2k\ell)^{c\ell}$

where using B.1 we bound on $\mathbb{E}_{x\sim{\mathcal{D}}_{{\bm{x}}}}[\left\|W^{*}{\bm{x}}\right\|^{4\ell}]\leq k^{2\ell}(4\ell)^{\frac{4C\ell}{1+\gamma}}$ similar to above, which can be upper bounded with $(2k\ell)^{c_{2}\ell}$ for $c_{2}>0$ a sufficiently large constant. Take $c=c_{1}+c_{2}$ . We bound $\mathbf{Pr}_{{\bm{x}}\sim S}[\left\|W^{*}{\bm{x}}\right\|>R]]$ as follows:

	$\displaystyle\mathbf{Pr}_{{\bm{x}}\sim S}[\left\\|W^{*}{\bm{x}}\right\\|>R]]$	$\displaystyle=\mathbf{Pr}_{{\bm{x}}\sim S}\left[\sum_{i=1}^{k}\langle W_{i}^{*},{\bm{x}}\rangle^{2}>R^{2}\right]$
		$\displaystyle\leq\sum_{i=1}^{k}\mathbf{Pr}_{{\bm{x}}\sim S}[\langle W_{i}^{*},{\bm{x}}\rangle^{2}>R^{2}/k]$
		$\displaystyle\leq k\sup_{\\|{\bm{w}}\\|_{2}=1}\mathbf{Pr}_{{\bm{x}}\sim S}[\langle W,{\bm{x}}\rangle^{2}>R^{2}/k],$

where the first inequality follows from a union bound. Since $\langle{\bm{w}},{\bm{x}}\rangle^{2}$ is a degree $2$ polynomial, we can view $\operatorname{sign}(\langle{\bm{w}},{\bm{x}}\rangle^{2}-R^{2}/k)$ as a degree-2 PTF. The class of these functions has VC dimension at most $d^{2}$ (e.g. by viewing it as the class of halfspaces in $d^{2}$ dimensions). Using standard VC arguments, whenever $m_{\mathrm{train}}\geq C\cdot\frac{d^{2}+\log(1/\delta^{\prime})}{(\epsilon^{\prime\prime}/k)^{2}}$ for some sufficiently large universal constant $C>0$ , with probability $\geq 1-\delta^{\prime}$ we have

\mathbf{Pr}_{{\bm{x}}\sim S}[\langle{\bm{w}},{\bm{x}}\rangle^{2}>R^{2}/k]\leq\mathbf{Pr}_{x\sim{\mathcal{D}}_{{\bm{x}}}}[\langle{\bm{w}},{\bm{x}}\rangle^{2}>R^{2}/k]+\epsilon^{\prime\prime}/k.

Using the strictly subexponential tails of ${\mathcal{D}}_{{\bm{x}}}$ , we have

	$\displaystyle\mathbf{Pr}_{{\bm{x}}\sim S}[\left\\|W^{*}{\bm{x}}\right\\|>R]]$	$\displaystyle\leq k\left(\sup_{\left\\|w\right\\|=1}\mathbf{Pr}_{x\sim{\mathcal{D}}_{{\bm{x}}}}[\langle{\bm{w}},{\bm{x}}\rangle^{2}>R^{2}/k]+\epsilon^{\prime\prime}/k\right)$
		$\displaystyle\leq 2k\cdot\exp\left(-\left(R/k\right)^{1+\gamma}\right)+\epsilon^{\prime\prime}.$

Choose $\epsilon^{\prime\prime}=\frac{\epsilon^{\prime 4}}{(r+\epsilon)^{4}(2k\ell)^{c\ell}}$ . Putting it together:

	$\displaystyle\mathbb{E}_{{\bm{x}}\sim S}[p^{}({\bm{x}})^{4}]\cdot\mathbf{Pr}_{{\bm{x}}\sim S}[\left\\|W^{}{\bm{x}}\right\\|>R]]$	$\displaystyle\leq(r+\epsilon)^{4}\cdot(2k\ell)^{c\ell}e^{-(R/k)^{1+\gamma}}+\epsilon^{\prime 4}$
		$\displaystyle\leq(r+\epsilon)^{4}\cdot\exp\left(c\ell\log(2k\ell)-(R/k)^{1+\gamma}\right)+\epsilon^{\prime 4}.$

We want to bound the first part with $\epsilon^{\prime 4}$ . Equivalently, we need to show that the exponent is $\leq 4\ln\frac{\epsilon^{\prime}}{r+\epsilon}$ . Substituting $\ell=R\log R\cdot g_{\mathcal{F}}(\epsilon)$ , we get that $c\ell\log(2k\ell)\leq cg_{\mathcal{F}}(\epsilon)R(\log R)^{2}\log(2kg_{\mathcal{F}}(\epsilon))$ . Thus, it suffices to show that

\displaystyle\left(\frac{R}{k}\right)^{1+\gamma}

\displaystyle\geq cg_{\mathcal{F}}(\epsilon)R(\log R)^{2}(2kg_{\mathcal{F}}(\epsilon))-4\ln{\frac{\epsilon^{\prime}}{r+\epsilon}}.

This is satisfied when $R\geq\mathrm{poly}\left(\left(kg_{\mathcal{F}}(\epsilon)\log(r)\log(M/\epsilon)\right)^{1+\frac{1}{\gamma}}\right)$ . Then, we have that

\mathbb{E}_{{\bm{x}}\sim S}[p^{*}({\bm{x}})^{2}\cdot\mathbbm{1}[\left\|W^{*}{\bm{x}}\right\|>R]]\leq\epsilon^{\prime 2}\sqrt{2}.

So, plugging this into Eq. 4.5, we have

\left\|\mathrm{cl}_{M}(f^{*})-p^{*}\right\|_{S}\leq\sqrt{\epsilon^{2}+2\cdot\epsilon^{\prime 2}/4+2\epsilon^{\prime 2}\sqrt{2}}\leq\epsilon+2\epsilon^{\prime}.

The same argument will also give

\left\|\mathrm{cl}_{M}(f_{\mathrm{opt}}({\bm{x}}))-p_{\mathrm{opt}}({\bm{x}})\right\|_{S}\leq\epsilon+2\epsilon^{\prime}.

Combining Eq. 4.3 and the above two bounds into Eq. 4.4, we have from Eq. 4.1 that

{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(\mathrm{cl}_{M}(\hat{p}))\leq\lambda+\mathrm{opt}+3\epsilon+11\epsilon^{\prime}\leq\lambda+\mathrm{opt}+4\epsilon.

The result holds with probability at least $1-5\delta^{\prime}=1-\delta$ (taking a union bound over $5$ bad events).

Completeness. For completeness, it is sufficient to ensure that $m_{\mathrm{test}}\geq N$ for $N$ in Lemma B.2. This is because when ${\mathcal{D}}_{{\bm{x}}}={\mathcal{D}}_{{\bm{x}}}^{\prime}$ , our test samples $S^{\prime}$ are in fact being drawn from the subexponential distribution ${\mathcal{D}}_{{\bm{x}}}$ . Then the moment concentration of subexponential distributions (Lemma B.2) gives that the empirical moments of $S^{\prime}$ are close to the moments of ${\mathcal{D}}_{{\bm{x}}}$ with probability $\geq 1-\delta^{\prime}$ . This is the only condition for acceptance, so when ${\mathcal{D}}_{{\bm{x}}}={\mathcal{D}}_{{\bm{x}}}^{\prime}$ , the probability of acceptance is at least $1-\delta$ , as required.

Runtime. The runtime of the algorithm is $\mathrm{poly}(d^{\ell},m_{\mathrm{train}},m_{\mathrm{test}})$ , where $\ell=R\log R\cdot g_{\mathcal{F}}(\epsilon)$ . The two lower bounds on $R$ required in the proof are satisfied by setting $R\geq\left(\left(kg_{\mathcal{F}}(\epsilon)\log(r)\log(M/\epsilon)\right)^{O(\frac{1}{\gamma})}\right)$ . Note that setting $m_{\mathrm{train}}=\mathrm{poly}(M,\ln(1/\delta)^{\ell},1/\epsilon,d^{\ell},r)$ satisfies the lower bounds on $m_{\mathrm{train}}$ required in the proof. For $m_{\mathrm{test}}$ we required that $m_{\mathrm{test}}\geq\frac{8M^{4}\ln(2/\delta^{\prime})}{\epsilon^{\prime 4}}$ and also $m_{\mathrm{test}}\geq N$ for $N$ in Lemma B.2. This is satisfied by choosing $m_{\mathrm{test}}=m_{\mathrm{train}}$ . Putting this altogether, we see that the runtime is $\mathrm{poly}(d^{s},\ln(1/\delta)^{\ell},1/\epsilon)$ where $s=\left(\left(kg_{\mathcal{F}}(\epsilon)\log(r)\log(M/\epsilon)\right)^{O(1/\gamma)}\right)$ . ∎

4.3 Applications

In order to obtain end-to-end results for classes of neural networks (see the rightmost column of Table 1), we need to prove the existence of uniform polynomial approximators whose degree scales almost linearly with respect to the radius of approximation for the reasons described above. For arbitrary Lipschitz nets (see Theorem C.17), we use a general tool from polynomial approximation theory, the multivariate Jackson’s theorem (Theorem C.9). This gives us a polynomial with degree scaling linearly in $R$ and polynomially on $\frac{1}{\epsilon}$ and the number of hidden units ( $k$ ) in the first layer.

For sigmoid nets, a more careful derivation yields improved bounds (see Theorem C.19) which have a poly-logarithmic dependence on $\frac{1}{\epsilon}$ . Our construction involves composing approximators for the activations at each layer. Naively, the degree of this composition would be superlinear in $R$ . To get around this, we use the key property that the size of the output of a sigmoid network at any layer is memoryless (i.e., has no $R$ dependence). This follows from the fact that the sigmoid is bounded in $[0,1]$ . Using this, we obtain an approximator with almost-linear dependence on $R$ . For more details see Section C.5.

References

[ACM24] Pranjal Awasthi, Corinna Cortes, and Mehryar Mohri. Best-effort adaptation. Annals of Mathematics and Artificial Intelligence, 92(2):393–438, 2024.
[AZLL19] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32, 2019.
[BCK⁺07] John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. Learning bounds for domain adaptation. Advances in neural information processing systems, 20, 2007.
[BDBC⁺10] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79:151–175, 2010.
[BDBCP06] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. Advances in neural information processing systems, 19, 2006.
[BDBGK18] Shalev Ben-David, Adam Bouland, Ankit Garg, and Robin Kothari. Classical lower bounds from quantum upper bounds. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 339–349. IEEE, 2018.
[BG17] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. In International conference on machine learning, pages 605–614. PMLR, 2017.
[BJW19] Ainesh Bakshi, Rajesh Jayaram, and David P Woodruff. Learning two layer rectified neural networks in polynomial time. In Conference on Learning Theory, pages 195–268. PMLR, 2019.
[CDG⁺23] Sitan Chen, Zehao Dou, Surbhi Goel, Adam Klivans, and Raghu Meka. Learning narrow one-hidden-layer relu networks. In The Thirty Sixth Annual Conference on Learning Theory, pages 5580–5614. PMLR, 2023.
[CGKM22] Sitan Chen, Aravind Gollakota, Adam Klivans, and Raghu Meka. Hardness of noise-free learning for two-hidden-layer neural networks. Advances in Neural Information Processing Systems, 35:10709–10724, 2022.
[CKK⁺24] Gautam Chandrasekaran, Adam R Klivans, Vasilis Kontonis, Konstantinos Stavropoulos, and Arsen Vasilyan. Efficient discrepancy testing for learning with distribution shift. arXiv preprint arXiv:2406.09373, 2024.
[CKM22] Sitan Chen, Adam R Klivans, and Raghu Meka. Learning deep relu networks is fixed-parameter tractable. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 696–707. IEEE, 2022.
[CW01] Anthony Carbery and James Wright. Distributional and lq norm inequalities for polynomials over convex bodies in rn. Mathematical research letters, 8(3):233–248, 2001.
[DGK⁺20] Ilias Diakonikolas, Surbhi Goel, Sushrut Karmalkar, Adam R Klivans, and Mahdi Soltanolkotabi. Approximation schemes for relu regression. In Conference on learning theory, pages 1452–1485. PMLR, 2020.
[DK24] Ilias Diakonikolas and Daniel M Kane. Efficiently learning one-hidden-layer relu networks via schurpolynomials. In The Thirty Seventh Annual Conference on Learning Theory, pages 1364–1378. PMLR, 2024.
[DKK⁺23] Ilias Diakonikolas, Daniel Kane, Vasilis Kontonis, Sihan Liu, and Nikos Zarifis. Efficient testable learning of halfspaces with adversarial label noise. Advances in Neural Information Processing Systems, 36, 2023.
[DKKZ20] Ilias Diakonikolas, Daniel M Kane, Vasilis Kontonis, and Nikos Zarifis. Algorithms and sq lower bounds for pac learning one-hidden-layer relu networks. In Conference on Learning Theory, pages 1514–1539. PMLR, 2020.
[DKLZ24] Ilias Diakonikolas, Daniel Kane, Sihan Liu, and Nikos Zarifis. Testable learning of general halfspaces with adversarial label noise. In The Thirty Seventh Annual Conference on Learning Theory, pages 1308–1335. PMLR, 2024.
[DKTZ22] Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, and Nikos Zarifis. Learning a single neuron with adversarial label noise via gradient descent. In Conference on Learning Theory, pages 4313–4361. PMLR, 2022.
[DKZ20] Ilias Diakonikolas, Daniel Kane, and Nikos Zarifis. Near-optimal sq lower bounds for agnostically learning halfspaces and relus under gaussian marginals. Advances in Neural Information Processing Systems, 33:13586–13596, 2020.
[DLLP10] Shai Ben David, Tyler Lu, Teresa Luu, and Dávid Pál. Impossibility theorems for domain adaptation. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 129–136. JMLR Workshop and Conference Proceedings, 2010.
[DLT18] Simon S Du, Jason D Lee, and Yuandong Tian. When is a convolutional filter easy to learn? In 6th International Conference on Learning Representations, ICLR 2018, 2018.
[Fer14] Dietmar Ferger. Optimal constants in the marcinkiewicz–zygmund inequalities. Statistics & Probability Letters, 84:96–101, 2014.
[GGJ⁺20] Surbhi Goel, Aravind Gollakota, Zhihan Jin, Sushrut Karmalkar, and Adam Klivans. Superpolynomial lower bounds for learning one-layer neural networks using gradient descent. In International Conference on Machine Learning, pages 3587–3596. PMLR, 2020.
[GGK20] Surbhi Goel, Aravind Gollakota, and Adam Klivans. Statistical-query lower bounds via functional gradients. Advances in Neural Information Processing Systems, 33:2147–2158, 2020.
[GGKS24] Aravind Gollakota, Parikshit Gopalan, Adam Klivans, and Konstantinos Stavropoulos. Agnostically learning single-index models using omnipredictors. Advances in Neural Information Processing Systems, 36, 2024.
[GK19] Surbhi Goel and Adam R Klivans. Learning neural networks with two nonlinear layers in polynomial time. In Conference on Learning Theory, pages 1470–1499. PMLR, 2019.
[GKK23] Aravind Gollakota, Adam R Klivans, and Pravesh K Kothari. A moment-matching approach to testable learning and a new characterization of rademacher complexity. Proceedings of the fifty-fifth annual ACM Symposium on Theory of Computing, 2023.
[GKKM20] Shafi Goldwasser, Adam Tauman Kalai, Yael Kalai, and Omar Montasser. Beyond perturbations: Learning guarantees with arbitrary adversarial test examples. Advances in Neural Information Processing Systems, 33:15859–15870, 2020.
[GKKT17] Surbhi Goel, Varun Kanade, Adam Klivans, and Justin Thaler. Reliably learning the relu in polynomial time. In Satyen Kale and Ohad Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 1004–1042. PMLR, 07–10 Jul 2017.
[GKLW19] Rong Ge, Rohith Kuditipudi, Zhize Li, and Xiang Wang. Learning two-layer neural networks with symmetric inputs. In International Conference on Learning Representations, 2019.
[GKM18] Surbhi Goel, Adam Klivans, and Raghu Meka. Learning one convolutional layer with overlapping patches. In International conference on machine learning, pages 1783–1791. PMLR, 2018.
[GKSV24a] Aravind Gollakota, Adam Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Tester-learners for halfspaces: Universal algorithms. Advances in Neural Information Processing Systems, 36, 2024.
[GKSV24b] Aravind Gollakota, Adam R Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. An efficient tester-learner for halfspaces. The Twelfth International Conference on Learning Representations, 2024.
[GLM18] Rong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape design. In 6th International Conference on Learning Representations, ICLR 2018, 2018.
[GMOV19] Weihao Gao, Ashok V Makkuva, Sewoong Oh, and Pramod Viswanath. Learning one-hidden-layer neural networks under general input distributions. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1950–1959. PMLR, 2019.
[GSSV24] Surbhi Goel, Abhishek Shetty, Konstantinos Stavropoulos, and Arsen Vasilyan. Tolerant algorithms for learning with arbitrary covariate shift. arXiv preprint arXiv:2406.02742, 2024.
[HK19] Steve Hanneke and Samory Kpotufe. On the value of target data in transfer learning. Advances in Neural Information Processing Systems, 32, 2019.
[HK24] Steve Hanneke and Samory Kpotufe. A more unified theory of transfer learning. arXiv preprint arXiv:2408.16189, 2024.
[JSA15] Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473, 2015.
[Kea98] Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983–1006, 1998.
[KKSK11] Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Efficient learning of generalized linear and single index models with isotonic regression. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011.
[KSV24a] Adam R Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Learning intersections of halfspaces with distribution shift: Improved algorithms and sq lower bounds. The Thirty Seventh Annual Conference on Learning Theory, 2024.
[KSV24b] Adam R Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Testable learning with distribution shift. The Thirty Seventh Annual Conference on Learning Theory, 2024.
[KZZ24] Alkis Kalavasis, Ilias Zadik, and Manolis Zampetakis. Transfer learning beyond bounded density ratios. arXiv preprint arXiv:2403.11963, 2024.
[LHL21] Qi Lei, Wei Hu, and Jason Lee. Near-optimal linear regression under distribution shift. In International Conference on Machine Learning, pages 6164–6174. PMLR, 2021.
[LMZ20] Yuanzhi Li, Tengyu Ma, and Hongyang R Zhang. Learning over-parametrized two-layer neural networks beyond ntk. In Conference on learning theory, pages 2613–2682. PMLR, 2020.
[LSSS14] Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efficiency of training neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, page 855–863, Cambridge, MA, USA, 2014. MIT Press.
[LY17] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu activation. Advances in neural information processing systems, 30, 2017.
[Mer09] James Mercer. Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society A, 209:415–446, 1909.
[MKFAS20] Mohammadreza Mousavi Kalan, Zalan Fabian, Salman Avestimehr, and Mahdi Soltanolkotabi. Minimax lower bounds for transfer learning with linear and one-hidden layer neural networks. Advances in Neural Information Processing Systems, 33:1959–1969, 2020.
[MMR09] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In Proceedings of The 22nd Annual Conference on Learning Theory (COLT 2009), Montréal, Canada, 2009.
[MR18] Pasin Manurangsi and Daniel Reichman. The computational complexity of training relu (s). arXiv preprint arXiv:1810.04207, 2018.
[MRT18] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, second edition, 2018.
[NS64] D. J. Newman and H. S. Shapiro. Jackson’s Theorem in Higher Dimensions, pages 208–219. Springer Basel, Basel, 1964.
[O’D14] Ryan O’Donnell. Analysis of boolean functions. Cambridge University Press, 2014.
[RMH⁺20] Ievgen Redko, Emilie Morvant, Amaury Habrard, Marc Sebban, and Younès Bennani. A survey on domain adaptation theory: learning bounds and theoretical guarantees. arXiv preprint arXiv:2004.11829, 2020.
[RV23] Ronitt Rubinfeld and Arsen Vasilyan. Testing distributional assumptions of learning algorithms. Proceedings of the fifty-fifth annual ACM Symposium on Theory of Computing, 2023.
[STW24] Lucas Slot, Stefan Tiegel, and Manuel Wiedmer. Testably learning polynomial threshold functions. arXiv preprint arXiv:2406.06106, 2024.
[Tia17] Yuandong Tian. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In International conference on machine learning, pages 3404–3413. PMLR, 2017.
[Ver18] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
[VW19] Santosh Vempala and John Wilmes. Gradient descent for one-hidden-layer neural networks: Polynomial convergence and sq lower bounds. In Conference on Learning Theory, pages 3115–3117. PMLR, 2019.
[WZDD23] Puqian Wang, Nikos Zarifis, Ilias Diakonikolas, and Jelena Diakonikolas. Robustly learning a single neuron via sharpness. In International Conference on Machine Learning, pages 36541–36577. PMLR, 2023.
[ZLJ16a] Yuchen Zhang, Jason D Lee, and Michael I Jordan. l1-regularized neural networks are improperly learnable in polynomial time. In International Conference on Machine Learning, pages 993–1001. PMLR, 2016.
[ZLJ16b] Yuchen Zhang, Jason D. Lee, and Michael I. Jordan. L1-regularized neural networks are improperly learnable in polynomial time. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 993–1001, New York, New York, USA, 20–22 Jun 2016. PMLR.
[ZSJ⁺17] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery guarantees for one-hidden-layer neural networks. In International conference on machine learning, pages 4140–4149. PMLR, 2017.
[ZYWG19] Xiao Zhang, Yaodong Yu, Lingxiao Wang, and Quanquan Gu. Learning one-hidden-layer relu networks via gradient descent. In The 22nd international conference on artificial intelligence and statistics, pages 1524–1534. PMLR, 2019.

Appendix A Proof of Multiplicative Spectral Concentration Lemma

Here, we restate and prove the multiplicative spectral concentration lemma (Lemma 3.8).

Lemma A.1 (Multiplicative Spectral Concentration, Lemma B.1 in [GSSV24], modified).

\text{For any }{\bm{a}}\in\mathbb{R}^{m}:{\bm{a}}^{\top}\hat{\Phi}{\bm{a}}\in[{(1-{\epsilon})}{\bm{a}}^{\top}\Phi{\bm{a}},(1+{\epsilon}){\bm{a}}^{\top}\Phi{\bm{a}}]

Proof of Lemma 3.8.

Let $\Phi=UDU^{\top}$ be the compact SVD of $\Phi$ (i.e., $D$ is square with dimension equal to the rank of $\Phi$ and $U$ is not necessarily square). Note that such a decomposition exists (where the row and column spaces are both spanned by the same basis $U$ ), because $\Phi=\Phi^{\top}$ , by definition. Moreover, note that $UU^{T}$ is an orthogonal projection matrix that projects points in $\mathbb{R}^{m}$ on the span of the rows of $\Phi$ . We also have that, $U^{\top}U=I$ .

Consider $\Phi^{\dagger}=UD^{-1}U^{\top}$ and $\Phi^{\frac{\dagger}{2}}=UD^{-\frac{1}{2}}U^{\top}$ . Our proof consists of two parts. We first show that it is sufficient to prove that $\|\Phi^{\frac{\dagger}{2}}\Phi\Phi^{\frac{\dagger}{2}}-\Phi^{\frac{\dagger}{2}}\hat{\Phi}\Phi^{\frac{\dagger}{2}}\|_{2}\leq{\epsilon}$ with probability at least $1-\delta$ and then we give a bound on the probability of this event.

Claim.

Suppose that for ${\bm{A}}=\Phi^{\frac{\dagger}{2}}\Phi\Phi^{\frac{\dagger}{2}}-\Phi^{\frac{\dagger}{2}}\hat{\Phi}\Phi^{\frac{\dagger}{2}}$ we have $\|{\bm{A}}\|_{2}\leq{\epsilon}$ . Then, for any ${\bm{a}}\in\mathbb{R}^{m}$ :

{\bm{a}}^{\top}\hat{\Phi}{\bm{a}}\in[{(1-{\epsilon})}{\bm{a}}^{\top}\Phi{\bm{a}},(1+{\epsilon}){\bm{a}}^{\top}\Phi{\bm{a}}]

Proof.

Let ${\bm{a}}\in\mathbb{R}^{m}$ , ${\bm{a}}_{+}=UU^{\top}{\bm{a}}$ , and ${\bm{a}}_{0}=(I-UU^{\top}){\bm{a}}$ (i.e., ${\bm{a}}={\bm{a}}_{0}+{\bm{a}}_{+}$ , where ${\bm{a}}_{0}$ is the component of ${\bm{a}}$ lying in the nullspace of $\Phi$ ). We have that ${\bm{a}}^{\top}\Phi{\bm{a}}={\bm{a}}^{\top}_{+}\Phi{\bm{a}}_{+}$ .

Moreover, for ${\bm{a}}_{0}$ , we have that $0={\bm{a}}_{0}^{\top}\Phi{\bm{a}}_{0}=\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[(\phi({\bm{x}})^{\top}{\bm{a}}_{0})^{2}]$ and, hence, $\phi({\bm{x}})^{\top}{\bm{a}}_{0}=0$ almost surely over ${\mathcal{D}}_{{\bm{x}}}$ . Therefore, we also have ${\bm{a}}_{0}^{\top}\hat{\Phi}{\bm{a}}_{0}=\frac{1}{N}\sum_{{\bm{x}}\in S}(\phi({\bm{x}})^{\top}{\bm{a}}_{0})^{2}=0$ , with probability $1$ . Therefore, ${\bm{a}}^{\top}\hat{\Phi}{\bm{a}}={\bm{a}}_{+}^{\top}\hat{\Phi}{\bm{a}}_{+}$ .

Observe, now, that $\Phi^{\frac{1}{2}}\Phi^{\frac{\dagger}{2}}=UD^{\frac{1}{2}}U^{\top}UD^{-\frac{1}{2}}U^{\top}=UU^{\top}$ and, hence, $\Phi^{\frac{1}{2}}\Phi^{\frac{\dagger}{2}}{\bm{a}}_{+}=(UU^{\top})^{2}{\bm{a}}=UU^{\top}{\bm{a}}={\bm{a}}_{+}$ , because $UU^{\top}$ is a projection matrix. Overall, we obtain the following

	$\displaystyle{\bm{a}}^{\top}\hat{\Phi}{\bm{a}}$	$\displaystyle={\bm{a}}^{\top}\Phi{\bm{a}}+{\bm{a}}_{+}^{\top}(\hat{\Phi}-\Phi){\bm{a}}_{+}$
		$\displaystyle={\bm{a}}^{\top}\Phi{\bm{a}}+{\bm{a}}_{+}^{\top}\Phi^{\frac{1}{2}}(\Phi^{\frac{\dagger}{2}}\hat{\Phi}\Phi^{\frac{\dagger}{2}}-\Phi^{\frac{\dagger}{2}}\Phi\Phi^{\frac{\dagger}{2}})\Phi^{\frac{1}{2}}{\bm{a}}_{+}$
		$\displaystyle={\bm{a}}^{\top}\Phi{\bm{a}}+{\bm{a}}_{+}^{\top}\Phi^{\frac{1}{2}}A\Phi^{\frac{1}{2}}{\bm{a}}_{+}$

Since $\|{\bm{A}}\|_{2}\leq{\epsilon}$ and $\Phi^{\frac{1}{2}}\Phi^{\frac{1}{2}}=\Phi$ , we have that $|{\bm{a}}_{+}^{\top}\Phi^{\frac{1}{2}}A\Phi^{\frac{1}{2}}{\bm{a}}_{+}|\leq{\epsilon}|{\bm{a}}_{+}^{\top}\Phi{\bm{a}}_{+}|={\epsilon}|{\bm{a}}^{\top}\Phi{\bm{a}}|$ , which concludes the proof of the claim. ∎

It remains to show that for the matrix ${\bm{A}}$ defined in the previous claim, we have $\|{\bm{A}}\|_{2}\leq{\epsilon}$ with probability at least $1-\delta$ . The randomness of ${\bm{A}}$ depends on the random choice of $S$ from ${\mathcal{D}}_{{\bm{x}}}^{\otimes m}$ . In the rest of the proof, therefore, consider all probabilities and expectations to be over $S\sim{\mathcal{D}}_{{\bm{x}}}^{\otimes m}$ . We have the following for $t=\log_{2}(4/\delta)$ .

\displaystyle\mathbf{Pr}[\|{\bm{A}}\|_{2}>{\epsilon}]

\displaystyle\leq\mathbf{Pr}[\|{\bm{A}}\|_{F}>{\epsilon}]\leq\frac{\mathbb{E}[\|{\bm{A}}\|_{F}^{2t}]}{{\epsilon}^{2t}}

We will now bound the expectation of $\mathbb{E}[\|{\bm{A}}\|_{F}^{2t}]$ . To this end, we define ${\bm{a}}_{i}=\Phi^{\frac{\dagger}{2}}{\bm{e}}_{i}\in\mathbb{R}^{m}$ for $i\in[m]$ . We have the following, by using Jensen’s inequality appropriately.

	$\displaystyle\mathbb{E}[\\|{\bm{A}}\\|_{F}^{2t}]$	$\displaystyle=\mathbb{E}\Bigr{[}\Bigr{(}\sum_{i,j\in[m]}({\bm{a}}_{i}^{\top}\Phi{\bm{a}}_{j}-{\bm{a}}_{i}^{\top}\hat{\Phi}{\bm{a}}_{j})^{2}\Bigr{)}^{t}\Bigr{]}$
		$\displaystyle\leq m^{2(t-1)}\sum_{i,j\in[m]}\mathbb{E}[({\bm{a}}_{i}^{\top}\Phi{\bm{a}}_{j}-{\bm{a}}_{i}^{\top}\hat{\Phi}{\bm{a}}_{j})^{2t}]$
		$\displaystyle\leq m^{2t}\max_{i,j\in[m]}\mathbb{E}[({\bm{a}}_{i}^{\top}\Phi{\bm{a}}_{j}-{\bm{a}}_{i}^{\top}\hat{\Phi}{\bm{a}}_{j})^{2t}]$

In order to bound the term above, we may use Marcinkiewicz-Zygmund inequality (see [Fer14]) to exploit the independence of the samples in $S$ and obtain the following.

	$\displaystyle\mathbb{E}[({\bm{a}}_{i}^{\top}\Phi{\bm{a}}_{j}-{\bm{a}}_{i}^{\top}\hat{\Phi}{\bm{a}}_{j})^{2t}]$	$\displaystyle\leq\frac{2(4t)^{t}}{N^{t}}\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[({\bm{a}}_{i}^{\top}\Phi{\bm{a}}_{j}-{\bm{a}}_{i}^{\top}\phi({\bm{x}})\phi({\bm{x}})^{\top}{\bm{a}}_{j})^{2t}]$
		$\displaystyle\leq\frac{2(4t)^{t}}{N^{t}}\bigr{(}2^{2t}({\bm{a}}_{i}^{\top}\Phi{\bm{a}}_{j})^{2t}+2^{2t}\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[({\bm{a}}_{i}^{\top}\phi({\bm{x}})\phi({\bm{x}})^{\top}{\bm{a}}_{j})^{2t}]\bigr{)}$

We now observe that $\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[{\bm{a}}_{i}^{\top}\phi({\bm{x}})\phi({\bm{x}})^{\top}{\bm{a}}_{j}]={\bm{a}}_{i}^{\top}\Phi{\bm{a}}_{j}={\bm{e}}_{i}^{\top}\Phi^{\frac{\dagger}{2}}\Phi\Phi^{\frac{\dagger}{2}}{\bm{e}}_{j}={\bm{e}}_{i}^{\top}UU^{T}{\bm{e}}_{j}$ , which is at most equal to $1$ . Therefore, we have $\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[({\bm{a}}_{i}^{\top}\phi({\bm{x}}))^{2}]\leq 1$ and, by the hypercontractivity property (which we assume to be with respect to the standard inner product in $\mathbb{R}^{m}$ ), we have $\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[({\bm{a}}_{i}^{\top}\phi({\bm{x}}))^{4t}]\leq(4Ct)^{4\ell t}$ . We can bound $\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[({\bm{a}}_{i}^{\top}\phi({\bm{x}})\phi({\bm{x}})^{\top}{\bm{a}}_{j})^{2t}]$ by applying the Cauchy-Schwarz inequality and using the bound for $\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[({\bm{a}}_{i}^{\top}\phi({\bm{x}}))^{4t}]$ . In total, we have the following bound.

\mathbf{Pr}[\|{\bm{A}}\|_{2}>{\epsilon}]\leq 4\Bigr{(}\frac{16m^{2}t(4Ct)^{4\ell}}{N{\epsilon}^{2}}\Bigr{)}^{t}

We choose $N$ such that $\frac{16m^{2}t(4Ct)^{4\ell}}{N{\epsilon}^{2}}\leq\frac{1}{2}$ and $t=\log_{2}(4/\delta)$ so that the bound is at most $\delta$ . ∎

Appendix B Moment Concentration of Subexponential Distributions

We prove the following bounds on the moments of subexponential distributions, which allows us to control error outside the region of good approximation.

Fact B.1 (see [Ver18]).

Let ${\mathcal{D}}$ on $\mathbb{R}^{d}$ be a $\gamma$ -strictly subexponential distribution. Then for all ${\bm{w}}\in\mathbb{R}^{d},\left\|{\bm{w}}\right\|=1,t\geq 0,p\geq 1$ , there exists a constant $C^{\prime}$ such that

\mathbb{E}_{x\sim{\mathcal{D}}}[|\langle{\bm{w}},{\bm{x}}\rangle|^{p}]\leq(C^{\prime}p)^{\frac{p}{1+\gamma}}.

In fact, the two conditions are equivalent.

We use the following bounds on the concentration of subexponential moments in the analysis of our algorithm. This will be useful in showing the sample complexity $N$ required in order for the empirical moments of the sample $S$ concentrate around the moments of the training marginal ${\mathcal{D}}_{{\bm{x}}}$ .

Lemma B.2 (Moment Concentration of Subexponential Distributions).

Let ${\mathcal{D}}_{{\bm{x}}}$ be a distribution over $\mathbb{R}^{d}$ such that for any ${\bm{w}}\in\mathbb{R}^{d}$ with $\|{\bm{w}}\|_{2}=1$ and any $t\in{\mathbb{N}}$ we have $\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[|{\bm{w}}\cdot{\bm{x}}|^{t}]\leq(Ct)^{t}$ for some $C\geq 1$ . For $\alpha=(\alpha_{i})_{i\in[d]}\in{\mathbb{N}}^{d}$ , we denote with ${\bm{x}}^{\alpha}$ the quantity ${\bm{x}}^{\alpha}=\prod_{i=1}^{d}x_{i}^{\alpha_{i}}$ , where ${\bm{x}}=(x_{i})_{i\in[d]}$ . Then, for any $\Delta,\delta\in(0,1)$ , if $S$ is a set of at least $N=\frac{1}{\Delta^{2}}{(Cc)^{4\ell}\ell^{8\ell+1}(\log(20d/\delta))^{4\ell+1}}$ i.i.d. examples from ${\mathcal{D}}_{{\bm{x}}}$ for some sufficiently large universal constant $c\geq 2$ , we have that with probability at least $1-\delta$ , the following is true.

\text{For any $\alpha\in{\mathbb{N}}^{d}$ with $\|\alpha\|_{1}\leq 2\ell$ we have }|\mathbb{E}_{{\bm{x}}\sim S}[{\bm{x}}^{\alpha}]-\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[{\bm{x}}^{\alpha}]|\leq\Delta.

Proof.

Let $\alpha=(\alpha_{i})_{i\in[d]}\in{\mathbb{N}}^{d}$ with $\|\alpha\|_{1}\leq 2\ell$ . Consider the random variable $X=\frac{1}{m_{\mathrm{train}}}\sum_{{\bm{x}}\in S}{\bm{x}}^{\alpha}=\frac{1}{m_{\mathrm{train}}}\sum_{{\bm{x}}\in S}\prod_{i\in[d]}x_{i}^{\alpha_{i}}$ . We have that $\mathbb{E}[X]=\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[{\bm{x}}^{\alpha}]$ and also the following.

	$\displaystyle\mathbf{Pr}[\|X-\mathbb{E}[X]\|>\Delta]$	$\displaystyle\leq\frac{\mathbb{E}[(X-\mathbb{E}[X])^{2t}]}{\Delta^{2t}}$
		$\displaystyle\leq\frac{2(4t)^{t}}{(N\Delta^{2})^{t}}\mathbb{E}[({\bm{x}}^{\alpha}-\mathbb{E}[{\bm{x}}^{\alpha}])^{2t}]$

where the last inequality follows from the Marcinkiewicz–Zygmund inequality (see [Fer14]). We have that $\mathbb{E}[({\bm{x}}^{\alpha}-\mathbb{E}[{\bm{x}}^{\alpha}])^{2t}]\leq 4^{t}\mathbb{E}[({\bm{x}}^{\alpha})^{2t}]$ . Since $\|\alpha\|_{1}\leq 2\ell$ , we have that $\mathbb{E}[({\bm{x}}^{\alpha})^{2t}]\leq\sup_{\|{\bm{w}}\|_{2}=1}[\mathbb{E}[({\bm{w}}\cdot{\bm{x}})^{4t\ell}]]\leq(4Ct\ell)^{4t\ell}$ , which yields the desired result, due to the choice of $N$ and after a union bound over all the possible choices of $\alpha$ (at most $d^{2\ell}$ ). ∎

Appendix C Polynomial Approximations of Neural Networks

In this section we derive the polynomial approximations of neural networks with Lipschitz activations needed to instantiate Theorem 3.6 for bounded distributions and Theorem 4.6 for unbounded distributions.

Recall the definition of a neural network.

Definition C.1 (Neural Network).

Let $\sigma:\mathbb{R}\to\mathbb{R}$ be an activation function with $\sigma(0)\leq 1$ . Let $\mathbf{W}=\left(W^{(1)},\ldots W^{(t)}\right)$ with $W^{(i)}\in\mathbb{R}^{s_{i}\times s_{i-1}}$ be the tuple of weight matrices. Here, $s_{0}=d$ is the input dimension and $s_{t}=1$ . Define recursively the function $f_{i}:\mathbb{R}^{d}\to\mathbb{R}^{s_{i}}$ as $f_{i}({\bm{x}})=W^{(i)}\cdot\sigma\bigl{(}f_{i-1}({\bm{x}})\bigr{)}$ with $f_{1}({\bm{x}})=W^{(1)}\cdot{\bm{x}}$ . The function $f:\mathbb{R}^{d}\to\mathbb{R}$ computed by the neural network $(\mathbf{W},\sigma)$ is defined as $f({\bm{x}})\coloneq f_{t}({\bm{x}})$ . We denote $\|\mathbf{W}\|_{1}=\sum_{i=2}^{t}\|W^{(i)}\|_{1}$ . The depth of this network is $t$ .

We also introduce some notation and basic facts that will be useful for this section.

C.1 Useful Notation and Facts

Given a univariate function $g$ on $\mathbb{R}$ and a vector ${\bm{x}}=(x_{1},\ldots,x_{d})\in\mathbb{R}^{d}$ , the vector $g({\bm{x}})\in\mathbb{R}^{d}$ is defined as the vector with $i^{th}$ co-ordinate equal to $g(x_{i})$ . For a matrix $A\in\mathbb{R}^{m\times n}$ , we use the following notation:

•

$\|A\|_{2}\coloneq\sup_{\|x\|_{2}=1}\|Ax\|_{2}$ ,
•

$\|A\|_{2}^{\infty}\coloneq\sqrt{\max_{i\in[m]}\sum_{j=1}^{n}(A_{ij})^{2}}$ ,
•

$\|A\|_{1}\coloneq\sum_{(i,j)\in[n]\times[m]}|A_{ij}|$ .

Fact C.2.

Given a matrix $W\in\mathbb{R}^{m\times n}$ , we have that

1.

$\|A\|_{2}\leq\|A\|_{1}$ ,
2.

$\|A\|_{2}\leq\sqrt{m}\cdot\|A\|_{2}^{\infty}$ .

Proof.

We first prove (1). We have that for an ${\bm{x}}\in\mathbb{R}^{n}$ with $\|{\bm{x}}\|_{2}=1$ ,

\displaystyle\|A{\bm{x}}\|_{2}\leq\sqrt{\sum_{i=1}^{m}(A_{i}\cdot{\bm{x}})^{2}}\leq\sqrt{\sum_{i=1}^{m}\sum_{j=1}^{n}(A_{ij})^{2}}\leq\|A\|_{1}

where the second inequality follows from Cauchy Schwartz and the last inequality follows from the fact that for any vector ${\bm{v}}$ , $\|{\bm{v}}\|_{2}\leq\|{\bm{v}}\|_{1}$ . We now prove (2). We have that

\displaystyle\|A{\bm{x}}\|_{2}\leq\sqrt{\sum_{i=1}^{m}(A_{i}\cdot{\bm{x}})^{2}}\leq\sqrt{m\max_{i\in[m]}\sum_{j=1}^{n}(A_{ij})^{2}}\leq\sqrt{m}\|A\|_{2}^{\infty}

where the second inequality follows from Cauchy Schwartz and the last inequality is the definition. ∎

C.2 Results from Approximation Theory

The following are useful facts about the coefficients of approximating polynomials.

Fact C.3 (Lemma 23 from [GKKT17]).

Let $p$ be a polynomial of degree $\ell$ such that $|p(x)|\leq b$ for $|x|\leq 1$ . Then, the sum of squares of all its coefficients is at most $b^{2}\cdot 2^{O(\ell)}$ .

Lemma C.4.

Let $p$ be a polynomial of degree $\ell$ such that $|p({\bm{x}})|\leq b$ for $|x|\leq R$ . Then, the sum of squares of all its coefficients is at most $b^{2}\cdot 2^{O(\ell)}$ when $R\geq 1$ .

Proof.

Consider $q(x)=p(Rx)$ . Clearly, $|q(x)|\leq b$ for all $|x|\leq 1$ . Thus, the sum of squares of its coefficients is at most $b^{2}\cdot 2^{O(\ell)}$ from C.3. Now, $p(x)=q(x/R)$ has coefficients bounded by $b^{2}\cdot 2^{O(\ell)}$ when $R\geq 1$ . ∎

Fact C.5 ([BDBGK18]).

Let $q$ be a polynomial with real coefficients on $k$ variables with degree $\ell$ such that for all ${\bm{x}}\in[0,1]^{k}$ , $|q({\bm{x}})|\leq 1$ . Then the magnitude of any coefficient of $q$ is at most $(2k\ell(k+\ell))^{\ell}$ and the sum of the magnitudes of all coefficients of $q$ is at most $(2(k+\ell))^{3\ell}$ .

Lemma C.6.

Let $q$ be a polynomial with real coefficients on $k$ variables with degree $\ell$ such that for all ${\bm{x}}\in\mathbb{R}^{k}$ with $\left\|{\bm{x}}\right\|_{2}\leq R$ , $|q({\bm{x}})|\leq b$ . Then the sum of the magnitudes of all coefficients of $q$ is at most $b(2(k+\ell))^{3\ell}k^{\ell/2}$ for $R\geq 1$ .

Proof.

Consider the polynomial $h({\bm{x}})=1/b\cdot q(R{\bm{x}}/\sqrt{k})$ . Then $|h({\bm{x}})|=1/b\cdot|q(R{\bm{x}}/\sqrt{k})|\leq 1$ for $\|{\bm{x}}R/\sqrt{k}\|_{2}\leq R$ , or equivalently for all $\left\|x\right\|_{2}\leq\sqrt{k}$ . In particular, since the unit cube $[0,1]^{k}$ is contained in the $\sqrt{k}$ radius ball, then $|h({\bm{x}})|\leq 1$ for ${\bm{x}}\in[0,1]^{k}$ . By C.5, the sum of the magnitudes of the coefficients of $h$ is at most $(2(k+\ell))^{3\ell}$ . Since $q({\bm{x}})=b\cdot h({\bm{x}}\sqrt{k}/R)$ , then the sum of the magnitudes of the coefficients of $q$ is at most $b(2(k+\ell))^{3\ell}k^{\ell/2}$ . ∎

Lemma C.7.

Let $p({\bm{x}})$ be a degree $\ell$ polynomial in ${\bm{x}}\in\mathbb{R}^{d}$ such that each coefficient is bounded in absolute value by $b$ . Then the sum of the magnitudes of the coefficients of $p({\bm{x}})^{t}$ is at most $b^{t}d^{t\ell}$ .

Proof.

Note that $p({\bm{x}})$ has at most $d^{\ell}$ terms. Expanding $p({\bm{x}})^{t}$ gives at most $d^{t\ell}$ terms, where any monomial is formed from a product of $t$ terms in $p({\bm{x}})$ . Then the coefficients of $p({\bm{x}})^{t}$ are bounded in absolute value by $B^{t}$ . Summing over all monomials gives the bound. ∎

In the following lemma, we bound the magnitude of approximating polynomials for subspace juntas outside the radius of approximation.

Lemma C.8.

Let $\epsilon>0,R\geq 1$ , and $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ be a $k$ -subspace junta, and consider the corresponding function $g(W{\bm{x}})$ . Let $q:\mathbb{R}^{k}\rightarrow\mathbb{R}$ be an $(\epsilon,R)$ -uniform approximation polynomial for $g$ , and define $p:\mathbb{R}^{d}\rightarrow\mathbb{R}$ as $p({\bm{x}}):=q(W{\bm{x}})$ . Let $r:=\sup_{\left\|W{\bm{x}}\right\|_{2}\leq R}|g(W{\bm{x}})|$ . Then

|p({\bm{x}})|\leq(r+\epsilon)(2(k+\ell))^{3\ell}k^{\ell/2}\left\|\frac{W{\bm{x}}}{R}\right\|_{2}^{\ell}\quad\forall\left\|W{\bm{x}}\right\|_{2}\geq R.

Proof.

Since $q({\bm{x}})$ is an $(\epsilon,R)$ -uniform approximation for $g$ , then $|q({\bm{x}})-g({\bm{x}})|\leq\epsilon$ for $\left\|{\bm{x}}\right\|_{2}\leq R$ . Let $h({\bm{x}})=q(R{\bm{x}})$ . Then $|h({\bm{x}}/R)-g({\bm{x}})|\leq\epsilon$ for $\left\|{\bm{x}}\right\|_{2}\leq R$ , and so $|h({\bm{x}}/R)|\leq r+\epsilon$ for $\left\|{\bm{x}}\right\|_{2}\leq R$ , or equivalently $|h({\bm{x}})|\leq r+\epsilon$ for $\left\|{\bm{x}}\right\|_{2}\leq 1$ . Write $h({\bm{x}})=\sum_{\left\|\alpha\right\|_{1}\leq\ell}h_{\alpha}x_{1}^{\alpha_{1}}\ldots x_{k}^{\alpha_{k}}$ . By Lemma C.6, $\sum_{\left\|\alpha\right\|_{1}\leq\ell}|h_{\alpha}|\leq(r+\epsilon)(2(k+\ell))^{3\ell}\cdot k^{\ell/2}$ . Then for $\left\|x\right\|_{2}\geq 1$ ,

	$\displaystyle\|h({\bm{x}})\|$	$\displaystyle\leq\sum_{\left\\|\alpha\right\\|_{1}\leq\ell}\|h_{\alpha}\|\|x_{1}^{\alpha_{1}}\ldots x_{k}^{\alpha_{k}}\|$
		$\displaystyle\leq\sum_{\left\\|\alpha\right\\|_{1}\leq\ell}\|h_{\alpha}\|\left\\|{\bm{x}}\right\\|_{2}^{\left\\|\alpha\right\\|_{1}}$
		$\displaystyle\leq\left\\|{\bm{x}}\right\\|_{2}^{\ell}\cdot\sum_{\left\\|\alpha\right\\|_{1}\leq\ell}\|h_{\alpha}\|,$

where the second inequality holds because $|x_{i}|\leq\left\|{\bm{x}}\right\|_{2}$ for all $i$ , and the last inequality holds because $\left\|{\bm{x}}\right\|_{2}^{\ell}\geq\left\|{\bm{x}}\right\|_{2}^{\left\|\alpha\right\|_{1}}$ for $\left\|\alpha\right\|_{1}\leq\ell$ when $\left\|{\bm{x}}\right\|_{2}\geq 1$ . Then since $p({\bm{x}})=q(W{\bm{x}})=h(W{\bm{x}}/R)$ , we have $|p({\bm{x}})|\leq\left\|\frac{W{\bm{x}}}{R}\right\|_{2}^{\ell}(r+\epsilon)(2(k+\ell))^{3\ell}k^{\ell/2}$ for $\left\|W{\bm{x}}\right\|_{2}\geq R$ . ∎

The following is an important theorem that we use later to obtain uniform approximators for Lipschitz Neural networks.

Theorem C.9 ([NS64]).

Let $f:\mathbb{R}^{k}\to\mathbb{R}$ be a function continuous on the unit sphere $S_{k-1}$ . Let $\omega_{f}$ be the function defined as $\omega_{f}(t)\coloneq\sup_{\begin{subarray}{c}\|{\bm{x}}\|_{2},\|{\bm{y}}\|_{2}\leq 1\\ {\|{\bm{x}}-{\bm{y}}\|_{2}\leq t}\end{subarray}}|f({\bm{x}})-f({\bm{y}})|$ for any $t\geq 0$ . Then, we have that there exists a polynomial of degree $\ell$ such that $\sup_{\|x\|_{2}\leq 1}|f({\bm{x}})-p({\bm{x}})|\leq C\cdot\omega_{f}(k/\ell)$ where $C$ is a universal constant.

This implies the following corollary.

Corollary C.10.

Let $f:\mathbb{R}^{k}\to\mathbb{R}$ be an $L$ -Lipschitz function for $L\geq 0$ and let $R\geq 0$ . Then, for any $\epsilon\geq 0$ , there exists a polynomial $p$ of degree $O(LRk/\epsilon)$ such that $p$ is an $(\epsilon,R)$ -uniform approximation polynomial for $f$ .

Proof.

Consider the function $g({\bm{x}})\coloneq f(R{\bm{x}})$ . Then, we have that $g$ is $RL$ -Lipschitz. From statement of Theorem C.9, we have that $\omega_{g}(t)\leq RLt$ . Thus, from Theorem C.9, there exists a polynomial $q$ of degree $O(LRk/\epsilon)$ such that $\sup_{\|{\bm{y}}\|_{2}\leq 1}|g({\bm{y}})-q({\bm{y}})|\leq\epsilon$ . Thus, we have that $\sup_{\|{\bm{x}}\|_{2}\leq R}|f({\bm{x}})-q({\bm{x}}/R)|=\sup_{\|{\bm{x}}\|_{2}\leq R}|g({\bm{x}}/R)-q({\bm{x}}/R)|=\sup_{\|{\bm{y}}\|_{2}\leq 1}|g({\bm{y}})-q({\bm{y}})|\leq\epsilon$ . $p({\bm{x}})\coloneq q({\bm{x}}/R)$ is the required polynomial of degree $O(LRk/\epsilon)$ . ∎

C.3 Kernel Representations

We now state and prove facts about Kernel Representations that we require. First, we recall the multinomial kernel from [GKKT17].

Definition C.11.

Consider the mapping $\psi_{\ell}:\mathbb{R}^{n}\to\mathbb{R}^{N_{\ell}}$ , where $N_{d}=\sum_{i=1}^{\ell}d^{\ell}$ is indexed by tuples $(i_{1},i_{2},\ldots,i_{j})\in[d]^{j}$ for $j\in[\ell]$ such that value of $\psi_{\ell}({\bm{x}})$ at index $(i_{1},i_{2},\ldots,i_{j})$ is equal to $\prod_{t=1}^{j}{\bm{x}}_{i_{t}}$ . The kernel $\mathsf{MK}_{\ell}$ is defined as

\mathsf{MK}_{\ell}({\bm{x}},{\bm{y}})=\langle\psi_{\ell}({\bm{x}}),\psi_{\ell}({\bm{y}})\rangle=\sum_{i=1}^{d}({\bm{x}}\cdot{\bm{y}})^{i}.

We denote the corrresponding RKHS as ${\mathbb{H}}_{\mathsf{MK}_{\ell}}$ .

We now prove that polynomial approximators of subspace juntas can be represented as elements of ${\mathbb{H}}_{\mathsf{MK}_{\ell}}$ .

Lemma C.12.

Let $k\in\mathbb{N}$ and $\epsilon,R\geq 0$ . Let $f:\mathbb{R}^{d}\to\mathbb{R}$ be a $k$ -subspace junta such that $f({\bm{x}})=g(W{\bm{x}})$ where $g$ is a function on $\mathbb{R}^{k}$ and $W$ is a projection matrix from $\mathbb{R}^{k\times d}$ . Suppose, there exists a polynomial $q$ of degree $\ell$ such that $\sup_{\|{\bm{y}}\|_{2}\leq R}|g({\bm{y}})-q({\bm{y}})|\leq\epsilon$ and the sum of squares of coefficients of $q$ is bounded above by $B^{2}$ . Then, $f$ is $(\epsilon,B^{2}\cdot(k+1)^{\ell})$ -approximately represented within radius $R$ with respect to ${\mathbb{H}}_{\mathsf{MK}_{\ell}}$ .

Proof.

We argue that there exists a vector ${\bm{v}}\in{\mathbb{H}}_{\mathsf{MK}_{\ell}}$ such that $\langle{\bm{v}},{\bm{v}}\rangle\leq B^{2}$ and $|f({\bm{x}})-\langle{\bm{v}},\sigma_{\ell}({\bm{x}})\rangle|\leq\epsilon$ for all $\|{\bm{x}}\|_{2}\leq R$ . Consider the polynomial $p$ of degree $\ell$ such that $p({\bm{x}})=q(W{\bm{x}})$ . We argue that $p({\bm{x}})=\langle{\bm{v}},\sigma_{\ell}({\bm{x}})\rangle$ for some ${\bm{v}}$ and that $\langle{\bm{v}},{\bm{v}}\rangle\leq B^{2}$ . Let $q({\bm{y}})=\sum_{S\in\mathbb{N}^{k},|S|\leq\ell}q_{S}\prod_{j=1}^{k}{\bm{y}}^{|S_{j}|}$ . From our assumption on $q$ , we have that $\sum_{S\in\mathbb{N}^{k},|S|\leq\ell}|q_{S}|\leq B$ . For $i\in\ell$ , we use define $B_{i}$ as $B_{i}=\sum_{S\in\mathbb{N}^{k},|S|=\ell}|q_{S}|$ . Given multi-index $S$ , for any $i\in[d]$ , we define $S(i)$ as the number $t$ such that $\sum_{i=1}^{j-1}|S_{i}|\leq j<\sum_{i=1}^{j}|S_{i}|$ . We now compute the entry of ${\bm{v}}$ indexed by $(i_{1},i_{2},\ldots,i_{t})$ . By expanding the expression for $p({\bm{x}})$ , we obtain that

v_{i_{1},\ldots,i_{t}}=\sum_{|S|=t}q_{S}\prod_{j=1}^{t}W_{S(j),i_{j}}.

We are now ready to bound $\langle{\bm{v}},{\bm{v}}\rangle$ . We have that

	$\displaystyle\langle{\bm{v}},{\bm{v}}\rangle$	$\displaystyle=\sum_{t=0}^{\ell}\sum_{(i_{1},\ldots,i_{t})\in[d]^{k}}(v_{i_{1},\ldots,i_{t}})^{2}=\sum_{t=0}^{\ell}\sum_{(i_{1},\ldots,i_{t})\in[d]^{k}}\left(\sum_{\|S\|=t}q_{S}\prod_{j=1}^{t}W_{S(j),i_{j}}\right)^{2}$
		$\displaystyle\leq\sum_{t=0}^{\ell}\sum_{(i_{1},\ldots,i_{t})\in[d]^{k}}\left(\sum_{\|S\|=t}q^{2}_{S}\right)\left(\sum_{\|S\|=t}\prod_{j=1}^{t}W^{2}_{S(j),i_{j}}\right)$
		$\displaystyle\leq\sum_{t=0}^{\ell}\left(\sum_{\|S\|=t}q^{2}_{S}\right)\left(\sum_{\|S\|=t}\prod_{j=1}^{t}\left(\sum_{i=1}^{d}W^{2}_{S(j),i}\right)\right)\leq\sum_{t=0}^{\ell}\left(\sum_{\|S\|=t}q^{2}_{S}\right)\cdot(k+1)^{t}$
		$\displaystyle\leq\left(\sum_{\|S\|\leq\ell}q^{2}_{S}\right)\cdot(k+1)^{\ell}\leq B^{2}\cdot(k+1)^{\ell}.$

Here, the first inequality follows from Cauchy-Schwartz, the second follows by rearranging terms. The third inequality follows from the fact that the number of multi-indices of size $t$ from a set of $k$ elements is at most $(k+1)^{t}$ . The final inequality follows from the fact that the sum of the squares of the coefficients of $q$ is at most $B^{2}$ . ∎

We introduce an extension of the multinomial kernel that will be useful for our application to sigmoid-nets.

Definition C.13 (Composed multinomial kernel).

Let $\bm{\ell}=(\ell_{1},\ldots,\ell_{t})$ be a tuple in $\mathbb{N}^{t}$ . We denote a sequence of mappings $\psi^{(0)}_{\bm{\ell}},\psi^{(1)}_{\bm{\ell}},\ldots,\psi^{(t)}_{\bm{\ell}}$ on $\mathbb{R}^{d}$ inductively as follows:

1.

$\psi^{(0)}_{\bm{\ell}}({\bm{x}})={\bm{x}}$
2.

$\psi^{(i)}_{\bm{\ell}}({\bm{x}})=\psi_{\ell_{i}}\left(\psi^{(i-1)}_{\bm{\ell}}({\bm{x}})\right)$ .

Let $N_{\bm{\ell}}^{(i)}$ denote the number of coordinates in $\psi_{\bm{\ell}}^{(i)}$ . This induces a sequence of kernels $\mathsf{MK}_{\bm{\ell}}^{(0)},\mathsf{MK}_{\bm{\ell}}^{(1)},\ldots,\mathsf{MK}_{\bm{\ell}}^{(t)}$ defined as

\mathsf{MK}_{\bm{\ell}}^{(i)}({\bm{x}},{\bm{y}})=\langle\psi_{\bm{\ell}}^{(i)}({\bm{x}}),\psi_{\bm{\ell}}^{(i)}({\bm{y}})\rangle=\sum_{j=0}^{\ell_{i}}\left(\langle\psi_{\bm{\ell}}^{(i-1)}({\bm{x}}),\psi_{\bm{\ell}}^{(i-1)}({\bm{y}})\rangle^{j}\right)

and a corresponding sequence of RKHS denoted by $\mathcal{H}_{\mathsf{MK}_{\bm{\ell}}^{(0)}},\mathcal{H}_{\mathsf{MK}_{\bm{\ell}}^{(1)}},\ldots\mathcal{H}_{\mathsf{MK}_{\bm{\ell}}^{(t)}}$ .

Observe that the the multinomial Kernel $\mathsf{MK}_{\ell}=\mathsf{MK}^{(1)}_{(\ell)}$ is an instantiation of the composed multinomial kernel.

We now state some properties of the composed multinomial kernel.

Lemma C.14.

Let $\bm{\ell}=(\ell_{1},\ldots,\ell_{t})$ be a tuple in $\mathbb{N}^{t}$ and $R\geq 0$ . Then, the following hold:

1.

$\sup_{\|{\bm{x}}\|_{2}\leq R}\mathsf{MK}_{\bm{\ell}}^{(t)}({\bm{x}},{\bm{x}})\leq\max\{1,(2R)^{2^{t}\prod_{i=1}^{t}\ell_{i}}\}$ ,
2.

For any ${\bm{x}},{\bm{y}}\in\mathbb{R}^{d}$ , $\mathsf{MK}_{\bm{\ell}}^{(t)}({\bm{x}},{\bm{y}})$ can be computed in time $\mathrm{poly}\left(d,(\sum_{i=1}^{t}\ell_{i})\right)$ ,
3.

For any ${\bm{v}}\in\mathcal{H}_{\mathsf{MK}_{\bm{\ell}}^{(t)}}$ and ${\bm{x}}\in\mathbb{R}^{d}$ , we have $\langle{\bm{v}},\psi_{\bm{\ell}}^{(t)}({\bm{x}})\rangle$ is a polynomial in ${\bm{x}}$ of degree $\prod_{i=1}^{t}\ell_{i}$ .

Proof.

We assume without loss of generality that $R\geq 1$ as the kernel function is increasing in norm. To prove (1), observe that for any ${\bm{x}}$ , we have that

\mathsf{MK}_{\bm{\ell}}^{(i)}({\bm{x}},{\bm{x}})=\sum_{j=0}^{\ell_{i}}\left(\mathsf{MK}_{\bm{\ell}}^{(i-1)}({\bm{x}},{\bm{x}})\right)^{j}\leq\left(2\mathsf{MK}_{\bm{\ell}}^{(i-1)}({\bm{x}},{\bm{x}})\right)^{\ell_{i}+1}.

We also have that $\sup_{\|x\|_{2}\leq R}\mathsf{MK}_{\bm{\ell}}^{(0)}({\bm{x}},{\bm{x}})={\bm{x}}\cdot{\bm{x}}=R$ . Thus, unrolling the recurrence gives us that $\mathsf{MK}_{\bm{\ell}}^{(t)}({\bm{x}},{\bm{x}})\leq\max\{1,(2R)^{\prod_{i=1}^{t}(\ell_{i}+1)}\}\leq\max\{1,(2R)^{2^{t}\prod_{i=1}^{t}\ell_{i}}\}$ .

The run time follows from the fact that $\mathsf{MK}_{\bm{\ell}}^{(i)}({\bm{x}},{\bm{x}})=\sum_{j=0}^{\ell_{i}}\left(\mathsf{MK}_{\bm{\ell}}^{(i-1)}({\bm{x}},{\bm{x}})^{j}\right)$ and thus can be computed from $\mathsf{MK}_{\bm{\ell}}^{(i-1)}$ with $\ell_{i}$ additions and exponentiation operations. Recursing gives the final runtime.

The fact that $\langle{\bm{v}},\psi_{\bm{\ell}}^{(i)}({\bm{x}})\rangle$ follows immediately from the fact the fact the entries of $\psi_{\bm{\ell}}^{(i)}({\bm{x}})$ arise from the multinomial kernel and hence are polynomials in ${\bm{x}}$ . The degree is at most $\prod_{i=1}^{t}\ell_{i}$ . ∎

We now argue that a distribution that is hypercontractive with respect to polynomials is hypercontractive with respect to the multinomial kernel.

Lemma C.15.

Let ${\mathcal{D}}$ be a distribution on $\mathbb{R}^{d}$ that is $C$ -hypercontractive for some constant $C$ . Then, ${\mathcal{D}}$ is also $(\mathsf{MK}_{\bm{\ell}}^{(t)},C,\prod_{i=1}^{t}\ell_{i})$ -hypercontractive.

Proof.

The proof immediately follows from Definition 3.4 and Lemma C.14(3). ∎

C.4 Nets with Lipschitz activations

We are now ready to prove our theorem about uniform approximators for neural networks with Lipschitz activations. First, we prove that such networks describe a Lipschitz function.

Lemma C.16.

Let $f:\mathbb{R}^{d}\to\mathbb{R}$ be the function computed by an $t$ -layer neural network with $L$ -Lipschitz activation function $\sigma$ and weight matrices $\mathbf{W}$ . Say, $\|\mathbf{W}\|_{1}\leq W$ for $W\geq 0$ and the first hidden layer has $k$ neurons. Then we have that $f$ is $\sqrt{k}\|W^{(1)}\|_{2}^{\infty}(WL)^{t-1}$ -Lipschitz.

Proof.

First, observe from C.2 that for all $1<i\leq T$ , $\|W^{(i)}\|_{2}\leq W$ (since $\|\mathbf{W}\|_{1}\leq W$ ) and $\|W^{(1)}\|_{2}\leq\sqrt{k}\|W^{(1)}\|_{2}^{\infty}$ . Recall from Definition C.1, we have the functions $f_{1},\ldots,f_{t}$ where $f_{i}({\bm{x}})=W^{(i)}\cdot\sigma\bigl{(}f_{i-1}({\bm{x}})\bigr{)}$ and $f_{1}({\bm{x}})=W^{(1)}\cdot{\bm{x}}$ . We prove by induction on $i$ that $\|f_{i}({\bm{x}})-f_{i}({\bm{x}}+{\bm{u}})\|_{2}\leq\sqrt{k}\|W^{(1)}\|_{2}^{\infty}(WL)^{i-1}\|{\bm{u}}\|_{2}$ . For the base case, observe that

	$\displaystyle\\|f_{1}({\bm{x}}+{\bm{u}})-f_{1}({\bm{x}})\\|_{2}$	$\displaystyle\leq\sqrt{\sum_{i=1}^{d_{1}}\biggl{(}\bigl{(}\langle W^{(1)}_{i},{\bm{x}}\rangle-\langle W^{(1)}_{i},{\bm{x}}+{\bm{u}}\rangle\bigr{)}^{2}\biggr{)}}\leq\sqrt{\sum_{i=1}^{d_{1}}\biggl{(}\langle W^{(1)}_{i},{\bm{u}}\rangle\biggr{)}^{2}}$
		$\displaystyle\leq\\|W^{(1)}_{i}{\bm{u}}\\|_{2}\leq\sqrt{k}\\|W^{(1)}\\|_{2}^{\infty}\\|{\bm{u}}\\|_{2}$

where the second inequality follows from the Lipschitzness of $\sigma$ and the final inequality follows from the definition of operator norm. We now proceed to the inductive step. Assume by induction that $\|f_{i}({\bm{x}})-f_{i}({\bm{x}}+{\bm{u}})\|_{2}$ is at most $\sqrt{k}\|W^{(1)}\|_{2}^{\infty}(WL)^{i-1}\|{\bm{u}}\|_{2}$ . Thus, we have

	$\displaystyle\\|f_{i+1}({\bm{x}}+{\bm{u}})-f_{i+1}({\bm{x}})\\|_{2}$	$\displaystyle=\sqrt{\sum_{j=1}^{d_{1}}\biggl{(}\langle W^{(i+1)}_{j},\sigma\left(f_{i}({\bm{x}})\right)\rangle-\langle W^{(i+1)}_{j},\sigma\left(f_{i}({\bm{x}}+{\bm{u}})\right)\rangle\biggr{)}^{2}}$
		$\displaystyle\leq\\|W^{(i+1)}\\|_{2}\\|\sigma(f_{i}({\bm{x}}))-\sigma(f_{i}({\bm{x}}+{\bm{u}}))\\|_{2}$
		$\displaystyle\leq(WL)\sqrt{k}\\|W^{(1)}\\|_{2}^{\infty}(WL)^{i-1}\\|{\bm{u}}\\|_{2}\leq\sqrt{k}\\|W^{(1)}\\|_{2}^{\infty}(LW)^{i}\\|{\bm{u}}\\|_{2}$

where the third inequality follows from the Lipschitzness of $\sigma$ and the inductive hypothesis. Thus, we get that $|f({\bm{x}}+{\bm{u}})-f({\bm{x}})|\leq\|f_{t}({\bm{x}}+{\bm{u}})-f_{t}({\bm{x}})\|_{2}\leq\sqrt{k}\|W^{(1)}\|_{2}^{\infty}(WL)^{t-1}\cdot\|{\bm{u}}\|_{2}$ . ∎

We now state are theorem regarding the uniform approximation of Lipschitz nets. We also prove that the approximators can be represented by low norm vectors in $\mathcal{R}_{\mathsf{MK}_{\ell}}$ for appropriately chosen degree $\ell$ .

Theorem C.17.

Let $\epsilon,R\geq 0$ . Let $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ be a neural network with an $L$ -Lipschitz activation function $\sigma$ , depth $t$ and weight matrices $\mathbf{W}=(W^{(1)},\ldots,W^{(t)})$ where $W^{(i)}\in\mathbb{R}^{s_{i}\times s_{i-1}}$ . Let $k$ be the number of neurons in the first hidden layer. Then, there exists of a polynomial $p$ of degree $\ell=O\left(\|W^{(1)}\|_{2}^{\infty}(WL)^{t-1}Rk\sqrt{k}/\epsilon\right)$ that is an $(\epsilon,R)$ -uniform approximation polynomial for $f$ . Furthermore, $f$ is $(\epsilon,(k+\ell)^{O(\ell)})$ -approximately represented within radius $R$ with respect to ${\mathbb{H}}_{\mathsf{MK}_{\ell}}=\mathbb{H}_{\mathsf{MK}^{(1)}_{(\ell)}}$ . In fact, when $k=1$ , it holds that $f$ is $(\epsilon,2^{O(\ell)})$ -approximately represented within $R$ with respect to $\mathbb{H}_{\mathsf{MK}_{\bm{\ell}}^{(1)}}$ .

Proof.

We can express $f$ as $f({\bm{x}})=g(P{\bm{x}})$ where $P$ is a projection matrix and $g$ is a neural network with input size $k$ . We observe that the Lipschitz constant of $g$ is the same as the Lipschitz constant of $f$ since $P$ is a projection matrix. From Lemma C.16, we have that $g$ is $\|\sqrt{k}W^{(1)}\|_{2}^{\infty}(WL)^{t-1}$ -Lipshitz. From Corollary C.10, we have that there exists a polynomial $q$ of degree $\ell=O\left(\|W^{(1)}\|_{2}^{\infty}(WL)^{t-1}Rk\sqrt{k}/\epsilon\right)$ that is an $(\epsilon,R)$ -uniform approximation for $g$ . From Lemma C.6, we have that the sum of squares of magnitudes of coefficients of $q$ is bounded by $\left(\|\sqrt{k}W^{(1)}\|_{2}^{\infty}(WL)^{t-1}R\right)(k+\ell)^{O(\ell)}\leq(k+\ell)^{O(\ell)}$ . Now, applying Lemma C.12 yields the result. When $k=1$ , we apply Lemma C.4 to obtain that the sum of squares of magnitudes of coefficients of $q$ is bounded by $\|W^{(1)}\|_{2}^{\infty}(WL)^{t-1}\cdot 2^{O(\ell)}\leq 2^{O(\ell)}$ . ∎

C.5 Sigmoids and Sigmoid-nets

We now give a custom proof for the case of neural networks with sigmoid activation. We do this as we can hope to get $O(\log(1/\epsilon)$ degree for our polynomial approximation. We largely follow the proof technique of [GKKT17] and [ZLJ16b]. The modifications we make are to handle the case where the radius of approximation is a variable $R$ instead of a constant. We require(for our applications to strictly-subexponential distributions) that the degree of approximation must scale linear in $R$ , a property that does not follow directly from the analysis given in [GKKT17]. We modify their analysis to achieve this linear dependence.

We first state a result regarding polynomial approximations for a single sigmoid activation.

Theorem C.18 ([LSSS14]).

Let $\sigma:\mathbb{R}\to\mathbb{R}$ denote the function $\sigma(x)=\frac{1}{1+e^{-x}}$ . Let $R,\epsilon\geq 0$ . Then, there exists a polynomial $p$ of degree $\ell=O(R\log(R/\epsilon))$ such that $\sup_{|x|\leq R}|\sigma(x)-p(x)|\leq\epsilon$ . Also, the sum of the squares of the coefficients of $p$ is bounded above by $2^{O(\ell)}$ .

We now present a construction of a uniform approximation for neural networks with sigmoid activations. The construction is similar to the one in [GKKT17] but the analysis deviates as linear dependence on radius of approximation is important to us.

Theorem C.19.

Let $\epsilon,R\geq 0$ . Let $f$ on $\mathbb{R}^{d}$ be a neural network with sigmoid activations, depth $t$ and weight matrices $\mathbf{W}=(W^{(1)},\ldots,W^{(t)})$ where $W^{(i)}\in\mathbb{R}^{s_{i}\times s_{i-1}}$ . Also, let $\|\mathbf{W}\|_{1}\leq W$ . Then, there exists of a polynomial $p$ of degree $\ell=O\left((R\log R)\cdot(\|W^{(1)}\|_{2}^{\infty}W^{t-2})\cdot(t\log(W/\epsilon))^{t-1}\right)$ that is an $(\epsilon,R)$ -uniform approximation polynomial for $f$ . Furthermore, $f$ is $(\epsilon,B)$ -approximately represented within radius $R$ with respect to $H_{\mathsf{MK}_{\bm{\ell}}^{(t)}}$ where $\bm{\ell}=(\ell_{1},\ldots,\ell_{t-1})$ is a tuple of degrees whose product is bounded by $\ell$ . Here, $B\leq(2\|W^{(1)}\|_{2}^{\infty})^{\ell}\cdot W^{O\left(W^{t-2}(t\log(W/\epsilon)^{t-2}\right)}$ .

Proof.

First, let $q_{1}$ be the polynomial guaranteed by Theorem C.18 that $(\epsilon/(2W)^{t})$ -approximates the sigmoid in an interval of radius $R\|W^{(1)}\|_{2}^{\infty}$ . Denote the degree of $q_{1}$ as $\ell_{1}=O\left(Rt\|W^{(1)}\|_{2}^{\infty}\log(RW/\epsilon)\right)$ . For all $1<i<t$ , let $q_{i}$ be the polynomial that $(\epsilon/(2W)^{t})$ -approximates the sigmoid upto radius $2W$ . These have degree equal to $O\left(Wt\log(W/\epsilon)\right)$ . Let $\bm{\ell}=(\ell_{1},\ldots\ell_{t-1})$ . For all $i\in[t-1]$ , let $q_{i}(x)=\sum_{j=0}^{\ell_{i}}\beta^{(i)}_{j}x^{j}$ . We know that $\sum_{i=0}^{\ell_{i}}(\beta^{(i)}_{j})^{2}\leq 2^{O(\ell_{i})}$ .

We now construct the polynomial $p$ that approximates $f$ . For $i\in[t]$ , define $p_{i}({\bm{x}})=W^{(i)}\cdot q_{i-1}\left(p_{i-1}({\bm{x}})\right)$ with $p_{1}({\bm{x}})=W^{(1)}\cdot{\bm{x}}$ . Define $p({\bm{x}})=p_{t}({\bm{x}})$ . Recall that $p_{i}({\bm{x}})$ is a vector of $s_{i}$ polynomials. We prove the following by induction: for every $i\in[t]$ ,

1.

$\|p_{i}({\bm{x}})-f_{i}({\bm{x}})\|_{\infty}\leq\epsilon/(2W)^{t-i}$ ,
2.

For each $j\in[s_{i}]$ , we have that $(p_{i})_{j}({\bm{x}})=\langle{\bm{v}},\psi_{\bm{\ell}}^{(i)}({\bm{x}})\rangle$ with $\langle{\bm{v}},{\bm{v}}\rangle\leq(2\|W^{(1)}\|_{2}^{\infty})^{O(\prod_{n=1}^{i-1}\ell_{n})}\cdot W^{O(\prod_{n=2}^{i-1}\ell_{n})}$ .

where the function $f_{i}$ is as defined in Definition C.1.

The above holds trivially for $i=1$ and $f_{1}({\bm{x}})=p_{1}({\bm{x}})=W^{(1)}\cdot({\bm{x}})$ is an exact approximator. Also, $(p_{1})_{i}({\bm{x}})=\langle W^{(1)}_{i},{\bm{x}}\rangle=\langle W^{(1)}_{i},\psi_{\bm{\ell}}^{(1)}(x)\rangle$ from the definition of $\psi_{\bm{\ell}}^{(1)}$ . Clearly, $\langle W^{(1)}_{i},W^{(1)}_{i}\rangle\leq\left(\|W^{(1)}\|_{2}^{\infty}\right)^{2}.$ We now prove that the above holds for $i+1\in[t]$ assuming it holds for $i$ .

We first prove (1). For $j\in[s_{i+1}]$ , we have that

	$\displaystyle\|(p_{i+1})_{j}({\bm{x}})-(f_{i+1})_{j}({\bm{x}})\|$	$\displaystyle=\|W^{(i+1)}_{j}\bigl{(}q_{i}(p_{i}({\bm{x}}))-\sigma(f_{i}({\bm{x}}))\bigr{)}\|$
		$\displaystyle\leq\|W^{(i+1)}_{j}\bigl{(}q_{i}(p_{i}({\bm{x}}))-\sigma(p_{i}({\bm{x}})\bigr{)}\|+\|W^{(i+1)}_{j}\bigl{(}\sigma(p_{i}({\bm{x}}))-\sigma(f_{i}({\bm{x}})\bigr{)}\|$
		$\displaystyle\leq W\cdot(\epsilon/(2W)^{t})+W\cdot\epsilon/(2W)^{t-i}\leq\epsilon/(2W)^{t-(i+1)}.$

For the second inequality, we analyse the cases $i=1$ and $i>1$ separately. When $i=1$ , we have that $(p_{1})_{j}({\bm{x}})=(f_{1})_{j}({\bm{x}})\leq R\|W_{1}\|_{2}^{\infty}$ and $\sigma(x)-q_{1}(x)\leq(\epsilon/(2W)^{t})$ when $|x|\leq R\|W_{1}\|_{2}^{\infty}$ . For $i>1$ , from the inductive hypothesis, we have that $|W^{(i+1)}p_{i}({\bm{x}})|\leq|W^{(i+1)}f_{i}({\bm{x}})|+\|W^{(i+1)}\|_{1}\cdot(\epsilon/(2W)^{t-i})\leq 2W$ . The second term in the second inequality is bounded since $\sigma$ is $1$ -Lipschitz.

We are now ready to prove that $(p_{i+1})_{j}$ is representable by small norm vectors in $\mathcal{H}_{\mathsf{MK}_{\bm{\ell}}^{(i+1)}}$ for all $j\in[s_{j+1}]$ . We have that

(p_{i+1})_{j}({\bm{x}})=\sum_{k=1}^{s_{i}}W^{(i+1)}_{jk}\cdot q_{i}\left((p_{i})_{k}({\bm{x}})\right).

From the inductive hypothesis, we have that $(p_{i})_{k}=\langle{\bm{v}}^{(k)},\psi_{\bm{\ell}}^{(i)}\rangle$ . Thus, we have that

(p_{i+1})_{j}({\bm{x}})=\sum_{k=1}^{s_{i}}W^{(i+1)}_{jk}\cdot q_{i}\left(\langle{\bm{v}}^{(k)},\psi_{\bm{\ell}}^{(i)}\rangle\right).

We expand each term in the above sum. We obtain,

	$\displaystyle q_{i}\left(\langle{\bm{v}}^{(k)},\psi_{\bm{\ell}}^{(i)}\rangle\right)$	$\displaystyle=\sum_{n=0}^{\ell_{i}}\beta^{(i)}_{n}\left(\langle{\bm{v}}^{(k)},\psi_{\bm{\ell}}^{(i)}\rangle\right)^{n}$
		$\displaystyle=\sum_{n=0}^{\ell_{i}}\beta^{(i)}_{n}\sum_{(m_{1},\ldots,m_{n})\in[N_{\bm{\ell}}^{(i)}]^{n}}v^{(k)}_{m_{1}}\ldots v^{(k)}_{m_{n}}\left(\psi_{\bm{\ell}}^{(i)}({\bm{x}})\right)_{m_{1}}\ldots\left(\psi_{\bm{\ell}}^{(i)}({\bm{x}})\right)_{m_{n}}$
		$\displaystyle=\langle{\bm{u}}^{(k)},\psi_{\ell_{i}}((\psi_{\bm{\ell}}^{(i)}({\bm{x}}))\rangle=\langle{\bm{u}}^{(k)},\psi_{\bm{\ell}}^{(i+1)}({\bm{x}})\rangle.$

The second inequality follows from expanding the equation. ${\bm{u}}^{(k)}$ indexed by $(m_{1},\ldots,m_{n})\in[N^{(i)}_{\ell}]^{n}$ for $n\leq\ell_{i}$ has entries given by $u^{(k)}_{(m_{1},\ldots,m_{n})}=\beta^{(i)}_{n}v^{(k)}_{m_{1}}\ldots v^{(k)}_{m_{n}}$ . Putting things together, we obtain that

	$\displaystyle(p_{i+1})_{j}({\bm{x}})$	$\displaystyle=\sum_{k=1}^{s_{i}}W^{(i+1)}_{jk}\cdot\langle{\bm{u}}^{(k)},\psi_{\bm{\ell}}^{(i+1)}({\bm{x}})\rangle$
		$\displaystyle=\langle\sum_{k=1}^{s_{i}}W^{(i+1)}_{jk}{\bm{u}}^{(k)},\psi_{\bm{\ell}}^{(i+1)}({\bm{x}})\rangle.$

Thus, we have proved that $(p_{i+1})_{j}$ is representable in $\mathcal{H}_{\mathsf{MK}_{\bm{\ell}}^{(i+1)}}$ . We now prove that the norm of the representation is small. We have that

\displaystyle\|\sum_{k=1}^{s_{i}}W^{(i+1)}_{jk}{\bm{u}}^{(k)}\|_{2}\leq\|W^{(i+1)}\|_{1}\max_{k\in[s_{i}]}\|{\bm{u}}^{(k)}\|_{2}\leq W\cdot\max_{k\in[s_{i}]}\|{\bm{u}}^{(k)}\|_{2}.

We bound $\max_{k\in[s_{i}]}\|{\bm{u}}^{(k)}\|_{2}$ . For any $k$ , from the definition of ${\bm{u}}^{(k)}$ and the inductive hypothesis, we have that

	$\displaystyle\\|{\bm{u}}^{(k)}\\|_{2}^{2}$	$\displaystyle=\sum_{n=0}^{\ell_{i}}\left(\beta^{(i)}_{n}\right)^{2}\cdot\sum_{(m_{1},\ldots,m_{n})\in[N^{(i)}_{\bm{\ell}}]^{n}}\prod_{j=1}^{n}\left({\bm{u}}^{(k)}_{m_{j}}\right)^{2}$
		$\displaystyle=\sum_{n=0}^{\ell_{i}}\left(\beta^{(i)}_{n}\right)^{2}\\|{\bm{v}}^{(k)}\\|_{2}^{2n}\leq 2^{O(\ell_{i})}\cdot\\|{\bm{v}}^{(k)}\\|_{2}^{2\ell_{i}}$

We analyse the case $i=1$ and $i>1$ separately. When $i=1$ , we have $2^{O(\ell_{1})}\|{\bm{v}}^{(k)}\|_{2}^{2\ell_{1}}\leq(2\|W^{(1)}\|_{2}^{\infty})^{O(\ell_{1})}$ from the bound on the base case. When $i>1$ , we have

	$\displaystyle\\|\sum_{k=1}^{s_{i}}W^{(i+1)}_{jk}{\bm{u}}^{(k)}\\|_{2}^{2}$	$\displaystyle\leq W^{2}2^{O(\ell_{i})}\\|{\bm{v}}^{(k)}\\|_{2}^{2\ell_{i}}$
		$\displaystyle\leq W^{2}2^{O(\ell_{i})}\left((2\\|W^{(1)}\\|_{2}^{\infty})^{O(\prod_{n=1}^{i-1}\ell_{n})}\cdot W^{O(\prod_{n=2}^{i-1}\ell_{n})}\right)^{2\ell_{i}}$
		$\displaystyle\leq(2\\|W^{(1)}\\|_{2}^{\infty})^{O(\prod_{n=1}^{i}\ell_{n})}\cdot W^{O(\prod_{n=2}^{i}\ell_{n})}$

which completes the induction. We are ready to calculate the bound on the degree.

We have $\ell_{1}=O(Rt\|W^{(1)}\|_{2}^{\infty}\log(RW/\epsilon))$ . Also, for $i>1$ , we have $\ell_{i}=O(Wt\log(W/\epsilon))$ . Thus, the total degree is $\ell\leq\prod_{i=1}^{t-1}\ell_{i}=O\left((R\log R)\cdot(\|W^{(1)}\|_{2}^{\infty}W^{t-2})\cdot(t\log(W/\epsilon))^{t-1}\right)$ . The square of the norm of the kernel representation is bounded by $B$ where

B\leq(2\|W^{(1)}\|_{2}^{\infty})^{\ell}\cdot W^{O\left(W^{t-2}(t\log(W/\epsilon)^{t-2}\right)}.

This concludes the proof. ∎

C.6 Applications for Bounded Distributions

We first state and prove our end to end results on TDS learning Sigmoid and Lipschitz nets over bounded marginals that are $C$ -hypercontractive for some constant $C$ .

Theorem C.20 (TDS Learning for Nets with Sigmoid Activation).

Let $\mathcal{F}$ on $\mathbb{R}^{d}$ be the class of neural network with sigmoid activations, depth $t$ and weight matrices $\mathbf{W}=(W^{(1)},\ldots,W^{(t)})$ such that $\|W\|_{1}\leq W$ . Let $\epsilon\in(0,1)$ . Suppose the training and test distributions ${\mathcal{D}},{\mathcal{D}}^{\prime}$ over $\mathbb{R}^{d}\times\mathbb{R}$ are such that the following are true:

1.

${\mathcal{D}}_{{\bm{x}}}$ is bounded within $\{{\bm{x}}:\|{\bm{x}}\|_{2}\leq R\}$ and is $C$ -hypercontractive for $R,C\geq 1$ ,
2.

The training and test labels are bounded in $[-M,M]$ for some $M\geq 1$ .

Then, Algorithm 1 learns the class $\mathcal{F}$ in the TDS regression up to excess error $\epsilon$ and probability of failure $\delta$ . The time and sample complexity is

\mathrm{poly}\left(d,\frac{1}{\epsilon},C^{\ell},M,\log(1/\delta)^{\ell},(2R)^{2^{t}\cdot\ell},(2\|W^{(1)}\|_{2}^{\infty})^{\ell}\cdot W^{O\left((Wt\log(W/\epsilon))^{t-2}\right)}\right)

where $\ell=O\left((R\log R)\cdot(\|W^{(1)}\|_{2}^{\infty}W^{t-2})\cdot(t\log(W/\epsilon))^{t-1}\right)$ .

Proof.

From Theorem C.19, we have that $\mathcal{F}$ is $\left(\epsilon,(2\|W^{(1)}\|_{2}^{\infty})^{\ell}W^{O\left(W^{t-2}(t\log(W/\epsilon)^{t-2}\right)}\right)$ -approximately represented within radius $R$ with respect to $\mathsf{MK}_{\bm{\ell}}^{(t)}$ , where $\bm{\ell}$ is a degree vector whose product is equal to $\ell=O\left((R\log R)\cdot(\|W^{(1)}\|_{2}^{\infty}W^{t-2})\cdot(t\log(W/\epsilon))^{t-1}\right)$ . Also, from Lemma C.14, we have that $A\coloneq\sup_{\|{\bm{x}}\|_{2}\leq R}\mathsf{MK}_{\bm{\ell}}^{(t)}({\bm{x}},{\bm{x}})\leq(2R)^{2^{t}\ell}$ . From Lemma C.14, the entries of the kernel can be computed in $\mathrm{poly}(d,\ell)$ time and from Lemma C.15, we have that ${\mathcal{D}}_{{\bm{x}}}$ is $\left(\mathsf{MK}_{\bm{\ell}}^{(t)},C,\ell\right)$ hypercontractive. Now, we obtain the result by applying Theorem 3.6. ∎

The following corollary on TDS learning two layer sigmoid networks in polynomial time readily follows.

Corollary C.21.

Let $\mathcal{F}$ on $\mathbb{R}^{d}$ be the class of two-layer neural networks with weight matrices $\mathbf{W}=(W^{(1)},W^{(2)})$ and sigmoid activations. Let $\|W^{(1)}\|_{2}^{\infty}\leq O(1)$ and $\|\mathbf{W}\|_{1}\leq W$ . Suppose the training and test distributions satisfy the assumptions from Theorem C.20 with $R=O(1)$ . Then, Algorithm 1 learns the class $\mathcal{F}$ in the TDS regression setting up to excess error $\epsilon$ and probability of failure $0.1$ in time and sample complexity $\mathrm{poly}(d,1/\epsilon,W,M)$ .

Proof.

The proof immediately follows from Theorem C.20 by setting $t=2$ and the other parameters to the appropriate constants. ∎

Theorem C.22 (TDS Learning for Nets with Lipschitz Activation).

Let $\mathcal{F}$ on $\mathbb{R}^{d}$ be the class of neural network with $L$ -Lipschitz activations, depth $t$ and weight matrices $\mathbf{W}=(W^{(1)},\ldots,W^{(t)})$ such that $\|W\|_{1}\leq W$ . Let $\epsilon\in(0,1)$ . Suppose the training and test distributions ${\mathcal{D}},{\mathcal{D}}^{\prime}$ over $\mathbb{R}^{d}\times\mathbb{R}$ are such that the following are true:

1.

${\mathcal{D}}_{{\bm{x}}}$ is bounded within $\{{\bm{x}}:\|{\bm{x}}\|_{2}\leq R\}$ and is $C$ -hypercontractive for $R,C\geq 1$ ,
2.

The training and test labels are bounded in $[-M,M]$ for some $M\geq 1$ .

Then, Algorithm 1 learns the class $\mathcal{F}$ in the TDS regression up to excess error $\epsilon$ and probability of failure $\delta$ . The time and sample complexity is $\mathrm{poly}\left(d,\frac{1}{\epsilon},C^{\ell},M,\log(1/\delta)^{\ell},(2R(k+\ell))^{O(\ell)}\right)$ , where $\ell=O\left(\|W^{(1)}\|_{2}^{\infty}(WL)^{t-1}Rk\sqrt{k}/\epsilon\right)$ . In particular, when $k=1$ , we have that the time and sample complexity is $\mathrm{poly}(d,\frac{1}{\epsilon},C^{\ell},M,\log(1/\delta)^{\ell},(2R)^{O(\ell)})$ where $\ell=O\left(\|W^{(1)}\|_{2}^{\infty}(WL)^{t-1}R/\epsilon\right).$

Proof.

From Theorem C.17, for $k>1$ we have that $\mathcal{F}$ is $(\epsilon,(k+\ell)^{O(\ell)})$ -approximately represented within radius $R$ w.r.t $\mathsf{MK}_{\bm{\ell}}^{(1)}$ where $\ell$ is a degree vector whose product is $\ell=O\left(\|W^{(1)}\|_{2}^{\infty}(WL)^{t-1}Rk\sqrt{k}/\epsilon\right)$ . For $k=1$ , we have that we have that $\mathcal{F}$ is $(\epsilon,2^{O(\ell)})$ -approximately represented within radius $R$ w.r.t $\mathsf{MK}_{\bm{\ell}}^{(1)}$ where $\bm{\ell}$ is a degree vector whose product is equal to $\ell=O\left(\|W^{(1)}\|_{2}^{\infty}(WL)^{t-1}R/\epsilon\right)$ . Also, from Lemma C.14, we have that $A\coloneq\sup_{\|{\bm{x}}\|_{2}\leq R}\mathsf{MK}_{\bm{\ell}}^{(t)}({\bm{x}},{\bm{x}})\leq(2R)^{O(\ell)}$ . From Lemma C.14, the entries of the kernel can be computed in $\mathrm{poly}(d,\ell)$ time and from Lemma C.15, we have that ${\mathcal{D}}_{{\bm{x}}}$ is $\left(\mathsf{MK}_{\bm{\ell}}^{(1)},C,\ell\right)$ hypercontractive. Now, we obtain the result by applying Theorem 3.6. ∎

The above theorem implies the following corollary about TDS learning the class of ReLUs.

Corollary C.23.

Let $\mathcal{F}=\{{\bm{x}}\rightarrow\max(0,{\bm{w}}\cdot{\bm{x}}):\|{\bm{w}}\|_{2}=1\}$ on $\mathbb{R}^{d}$ be the class of ReLU functions with unit weight vectors. Suppose the training and test distributions satisfy the assumptions from Theorem C.22 with $R=O(1)$ . Then, Algorithm 1 learns the class $\mathcal{F}$ in the TDS regression setting up to excess error $\epsilon$ and probability of failure $0.1$ in time and sample complexity $\mathrm{poly}(d,2^{O(1/\epsilon)},M)$ .

Proof.

The proof immediately follows from Theorem C.22 by setting $t=2,\mathbf{W}=({\bm{w}})$ and the activation to be the ReLU function. ∎

In particular, this implies that the class of ReLUs is TDS learnable in polynomial time when $\epsilon<O(1/\log d)$ .

C.7 Applications for Unbounded Distributions

We are now ready to state our theorem for TDS learning neural networks with sigmoid activations.

Theorem C.24 (TDS Learning for Nets with Sigmoid Activation and Strictly Subexponential Marginals).

1.

${\mathcal{D}}_{{\bm{x}}}$ is $\gamma$ -strictly subexponential,
2.

The training and test labels are bounded in $[-M,M]$ for some $M\geq 1$ .

Then, Algorithm 2 learns the class $\mathcal{F}$ in the TDS regression up to excess error $\epsilon$ and probability of failure $\delta$ . The time and sample complexity is at most

\mathrm{poly}(d^{s},\log(1/\delta)^{s}),

where $s=\left(k\log M\cdot(\|W^{(1)}\|_{2}^{\infty}W^{t-2})\cdot(t\log(W/\epsilon))^{t-1}\right)^{O(\frac{1}{\gamma})}.$

Proof.

From Theorem C.19, we have that $\mathcal{F}$ there is an $(\epsilon,R)$ -uniform approximation polynomial for $f$ with degree $\ell=O\left((R\log R)\cdot(\|W^{(1)}\|_{2}^{\infty}W^{t-2})\cdot(t\log(W/\epsilon))^{t-1}\right)$ . Here, let $g_{\mathcal{F}}(\epsilon)\coloneq(\|W^{(1)}\|_{2}^{\infty}W^{t-2})\cdot(t\log(W/\epsilon))^{t-1}$ . We also have that $r=\sup_{\|{\bm{x}}\|_{2}\leq R,f\in\mathcal{F}}|f({\bm{x}})|\leq\mathrm{poly}(Rk\|W^{(1)}\|_{2}^{\infty}W^{t-2})$ from the Lipschitzness of the sigmoid nets (Lemma C.16) and the fact that the sigmoid evaluated at $0$ has value $1$ . The theorem now directly follows from Theorem 4.6. ∎

We now state our theorem on TDS learning neural networks with arbitrary Lipschitz activations.

Theorem C.25 (TDS Learning for Nets with Lipschitz Activation with strictly subexponential marginals).

1.

${\mathcal{D}}_{{\bm{x}}}$ is $\gamma$ -strictly subexponential,
2.

The training and test labels are bounded in $[-M,M]$ for some $M\geq 1$ .

Then, Algorithm 2 learns the class $\mathcal{F}$ in the TDS regression up to excess error $\epsilon$ and probability of failure $\delta$ . The time and sample complexity is at most

\mathrm{poly}(d^{s},\log(1/\delta^{s}),

where $s=\left(k\log M\cdot\|W^{(1)}\|_{2}^{\infty}(WL)^{t-1}/\epsilon\right)^{O(\frac{1}{\gamma})}$ .

Proof.

From Theorem C.17, we have that $\mathcal{F}$ there is an $(\epsilon,R)$ -uniform approximation polynomial for $f$ with degree $\ell=O\left(Rk\sqrt{k}\cdot\|W^{(1)}\|_{2}^{\infty}(WL)^{t-1}/\epsilon\right)$ . Here, let $g_{\mathcal{F}}(\epsilon)\coloneq k\sqrt{k}\|W^{(1)}\|_{2}^{\infty}(WL)^{t-1}/\epsilon$ . We also have that $r=\sup_{\|{\bm{x}}\|_{2}\leq R,f\in\mathcal{F}}|f({\bm{x}})|\leq\mathrm{poly}(Rk\|W^{(1)}\|_{2}^{\infty}W^{t-2})$ from the Lipschitz constant(Lemma C.16) and the fact that the each individual activation has value at most $1$ when evaluated at $0$ (see Definition C.1. The theorem now directly follows from Theorem 4.6. ∎

Appendix D Assumptions on the Labels

Our main theorems involve assumptions on the labels of both the training and test distributions. Ideally, one would want to avoid any assumptions on the test distribution. However, we demonstrate that this is not possible, even when the training marginal and the training labels are bounded, and the test labels have bounded second moment. On the other hand, we show that obtaining algorithms that work for bounded labels is sufficient even in the unbounded labels case, as long as some moment of the labels (strictly higher than the second moment) is bounded.

We begin with the lower bound, which we state for the class of linear functions, but would also hold for the class of single ReLU neurons, as well as other unbounded classes.

Proposition D.1 (Label Assumption Necessity).

Let ${\mathcal{F}}$ be the class of linear functions over $\mathbb{R}^{d}$ , i.e., ${\mathcal{F}}=\{{\bm{x}}\mapsto{\bm{w}}\cdot{\bm{x}}:{\bm{w}}\in\mathbb{R}^{d},\|{\bm{w}}\|_{2}\leq 1\}$ . Even if we assume that the training marginal is bounded within $\{{\bm{x}}\in\mathbb{R}^{d}:\|{\bm{x}}\|_{2}\leq 1\}$ , that the training labels are bounded in $[0,1]$ , and that for the test labels we have $\mathbb{E}_{y\sim{\mathcal{D}}^{\prime}_{y}}[y^{2}]\leq Y$ where $Y>0$ , no TDS regression algorithm with finite sample complexity can achieve excess error less than $Y/4$ and probability of failure less than $1/4$ for ${\mathcal{F}}$ .

The proof is based on the observation that because we cannot make any assumption on the test marginal, the test distribution could take some very large value with very small probability, while still being consistent with some linear function. The training distribution, on the other hand, gives no information about the ground truth and is information theoretically indistinguishable from the constructed test distribution. Therefore, the tester must accept and its output will have large excess error. The bound on the second moment of the labels does imply a bound on excess error, but this bound cannot be made arbitrarily small by drawing more samples.

Proof of Proposition D.1.

Suppose, for contradiction that we have a TDS regression algorithm for ${\mathcal{F}}$ with excess error ${\epsilon}<Y/4$ and probability of failure $\delta<1/4$ . Let $m\in{\mathbb{N}}$ be the sample complexity of the algorithm and $p\in(0,1)$ such that $m\ll 1/p$ . We consider three distributions over $\mathbb{R}^{d}\times\mathbb{R}$ . First ${\mathcal{D}}^{(1)}$ outputs $(0,0)$ with probability $1$ . Second, ${\mathcal{D}}^{(2)}$ outputs $(0,0)$ with probability $1-p$ and $(\frac{\sqrt{Y}}{\sqrt{p}}{\bm{w}},\frac{\sqrt{Y}}{\sqrt{p}})$ with probability $p$ , for some ${\bm{w}}\in\mathbb{R}^{d}$ with $\|{\bm{w}}\|_{2}=1$ . Third, ${\mathcal{D}}^{(3)}$ outputs $(0,0)$ with probability $1-p$ and $(\frac{\sqrt{Y}}{\sqrt{p}}{\bm{w}},0)$ with probability $p$ .

We consider two instances of the TDS regression problem. The first instance corresponds to the case ${\mathcal{D}}={\mathcal{D}}^{(1)}$ and ${\mathcal{D}}^{\prime}={\mathcal{D}}^{(2)}$ . The second corresponds to the case ${\mathcal{D}}={\mathcal{D}}^{(1)}$ and ${\mathcal{D}}^{\prime}={\mathcal{D}}^{(3)}$ . Note that the assumptions we asserted regarding the test distribution and the test labels are true for both instances. For ${\mathcal{D}}^{(2)}$ , in particular, we have $\mathbb{E}_{y\sim{\mathcal{D}}^{(2)}_{y}}[y^{2}]=p\cdot(\sqrt{Y}/\sqrt{p})^{2}=Y$ . Moreover, in each of the cases, there is a hypothesis in ${\mathcal{F}}$ that is consistent with all of the examples (either the hypothesis ${\bm{x}}\mapsto 0$ or ${\bm{x}}\mapsto{\bm{w}}\cdot{\bm{x}}$ ), so $\mathrm{opt}:=\min_{f\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f)]=0=\min_{f^{\prime}\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f^{\prime})+{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(f^{\prime})]=:\lambda$ .

Note that the total variation distance between ${\mathcal{D}}^{(1)}$ and ${\mathcal{D}}^{(2)}$ is $p$ and similarly between ${\mathcal{D}}^{(1)}$ and ${\mathcal{D}}^{(3)}$ . Therefore, by the completeness criterion, as well as the fact that sampling only increases total variation distance at a linear rate, i.e., $\mathrm{d}_{\mathrm{tv}}(({\mathcal{D}})^{\otimes m},({\mathcal{D}}^{\prime})^{\otimes m})\leq m\cdot\mathrm{d}_{\mathrm{tv}}({\mathcal{D}},{\mathcal{D}}^{\prime})\leq m\cdot p$ , we have that in each of the two instances, the algorithm will accept with probability at least $1-m\cdot p-\delta$ (due to the definition of total variation distance¹¹1We know that the algorithm would accept with probability at least $1-\delta$ if the set of test examples was drawn from $({\mathcal{D}}_{{\bm{x}}})^{\otimes m}$ . Since $({\mathcal{D}}_{{\bm{x}}}^{\prime})^{\otimes m}$ is $(mp)$ -close to $({\mathcal{D}}_{{\bm{x}}})^{\otimes m}$ , no algorithm can have different behavior if we substitute $({\mathcal{D}}_{{\bm{x}}})^{\otimes m}$ with $({\mathcal{D}}_{{\bm{x}}}^{\prime})^{\otimes m}$ except with probability $m\cdot p$ . Hence, any algorithm must accept with probability at least $1-m\cdot p-\delta$ .).

Suppose that the algorithm accepts in both instances (which happens w.p. at least $1-2\delta-2mp$ ). By the soundness criterion, with overall probability at least $1-4\delta-2mp$ , we have the following.

	$\displaystyle p\cdot(h({\bm{x}})-0)^{2}$	$\displaystyle<Y/4$
	$\displaystyle p\cdot(h({\bm{x}})-\sqrt{Y}/\sqrt{p})^{2}$	$\displaystyle<Y/4$

The inequalities above cannot be satisfied simultaneously, so we have arrived to a contradiction. It only remains to argue that $1-4\delta-2mp>0$ , which is true if we choose $p<\frac{1-4\delta}{2m}$ . Therefore, such a TDS regression algorithm cannot exist. ∎

The lower bound of Proposition D.1 demonstrates that, in the worst case, the best possible excess error scales with the second moment of the distribution of the test labels. In contrast, we show that a bound on any strictly higher moment is sufficient.

Corollary D.2.

Suppose that for any $M>0$ , we have an algorithm that learns a class ${\mathcal{F}}$ in the TDS setting up to excess error ${\epsilon}\in(0,1)$ , assuming that both the training and test labels are bounded in $[-M,M]$ . Let $T(M)$ and $m(M)$ be the corresponding time and sample complexity upper bounds.

Then, in the same setting, there is an algorithm that learns ${\mathcal{F}}$ up to excess error $4{\epsilon}$ under the relaxed assumption that for both training and test labels we have $\mathbb{E}[y^{2}g(|y|)]\leq Y$ for some $Y>0$ and $g$ some strictly increasing, positive-valued and unbounded function. The corresponding time and sample complexity upper bounds are $T(g^{-1}(Y/{\epsilon}^{2}))$ and $m(g^{-1}(Y/{\epsilon}^{2}))$ .

The proof is based on the observation that the effect of clipping on the labels, as measured by the squared loss, can be controlled by drawing enough samples, whenever a moment that is strictly higher than the second moment is bounded.

Lemma D.3.

Let $Y>0$ and $g:(0,\infty)\to(0,\infty)$ be strictly increasing and surjective. Let $y$ be a random variable over $\mathbb{R}$ such that $\mathbb{E}[y^{2}g(|y|)]\leq Y$ . Then, for any ${\epsilon}\in(0,1)$ , if $M\geq g^{-1}(Y/\epsilon^{2})$ , we have $\sqrt{\mathbb{E}[(y-\mathrm{cl}_{M}(y))^{2}]}\leq\epsilon$ .

Proof of Lemma D.3.

We have that $\mathbb{E}[(y-\mathrm{cl}_{M}(y))^{2}]\leq\mathbb{E}[y^{2}\mathbbm{1}\{|y|>M\}]$ , because $y\geq\mathrm{cl}_{M}(y)$ and $y$ , $\mathrm{cl}_{M}(y)$ always have the same sign, so $(y-\mathrm{cl}_{M}(y))^{2}\geq y^{2}$ and also $(y-\mathrm{cl}_{M}(y))^{2}=0$ if $|y|\leq M$ . Since $g(|y|)$ is non-zero whenever $y>0$ , we have $\mathbb{E}[y^{2}\mathbbm{1}\{|y|>M\}]=\mathbb{E}[y^{2}\cdot\frac{g(|y|)}{g(|y|)}\cdot\mathbbm{1}\{|y|>M\}]$ . We now use the fact that $g$ is increasing to conclude that $\mathbb{E}[y^{2}\mathbbm{1}\{|y|>M\}]\leq\frac{\mathbb{E}[y^{2}g(|y|)]}{g(M)}\leq\frac{Y}{g(M)}$ . By choosing $M\geq g^{-1}(Y/{\epsilon}^{2})$ , we obtain the desired bound. ∎

We are now ready to prove Corollary D.2, by reducing TDS learning with moment-bounded labels to TDS learning with bounded labels.

Proof of Corollary D.2.

The idea is to reduce the problem under the relaxed label assumptions to a corresponding bounded-label problem for $M=g^{-1}(Y/{\epsilon}^{2})$ . In particular, consider a new training distribution $\mathrm{cl}_{M}\circ{\mathcal{D}}$ and a new test distribution $\mathrm{cl}_{M}\circ{\mathcal{D}}^{\prime}$ , where the samples are formed by drawing a sample $({\bm{x}},y)$ from the corresponding original distribution and clipping the label $y$ to $\mathrm{cl}_{M}(y)$ . Note that whenever we have access to i.i.d. examples from ${\mathcal{D}}$ , we also have access to i.i.d. examples from $\mathrm{cl}_{M}\circ{\mathcal{D}}$ and similarly for $({\mathcal{D}}_{{\bm{x}}}^{\prime},\mathrm{cl}_{M}\circ{\mathcal{D}}_{{\bm{x}}}^{\prime})$ . Therefore, we may solve the corresponding TDS problem for $\mathrm{cl}_{M}\circ{\mathcal{D}}$ and $\mathrm{cl}_{M}\circ{\mathcal{D}}^{\prime}$ , to either reject or obtain some hypothesis $h$ such that

{\mathcal{L}}_{\mathrm{cl}_{M}\circ{\mathcal{D}}^{\prime}}(h)\leq\min_{f\in{\mathcal{F}}}[{\mathcal{L}}_{\mathrm{cl}_{M}\circ{\mathcal{D}}}(f)]+\min_{f^{\prime}\in{\mathcal{F}}}[{\mathcal{L}}_{\mathrm{cl}_{M}\circ{\mathcal{D}}}(f^{\prime})+{\mathcal{L}}_{\mathrm{cl}_{M}\circ{\mathcal{D}}^{\prime}}(f^{\prime})]+{\epsilon}

Our algorithm either rejects when the algorithm for the bounded labels case rejects or accepts and outputs $h$ . It suffices to show ${\mathcal{L}}_{{\mathcal{D}}^{\prime}}(h)\leq\min_{f\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f)]+\min_{f^{\prime}\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f^{\prime})+{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(f^{\prime})]+4{\epsilon}$ , because the marginal distributions do not change and completeness is, therefore, satisfied directly.

It suffices to show that for any distribution ${\mathcal{D}}$ , we have $|{\mathcal{L}}_{{\mathcal{D}}}(h)-{\mathcal{L}}_{\mathrm{cl}_{M}\circ{\mathcal{D}}}(h)|\leq{\epsilon}$ . To this end, note that ${\mathcal{L}}_{\mathrm{cl}_{M}\circ{\mathcal{D}}}(h)=\sqrt{\mathbb{E}_{({\bm{x}},y)\sim{\mathcal{D}}}[(\mathrm{cl}_{M}(y)-h({\bm{x}}))^{2}]}$ . We have the following.

	$\displaystyle{\mathcal{L}}_{\mathrm{cl}_{M}\circ{\mathcal{D}}}(h)$	$\displaystyle=\sqrt{\mathbb{E}_{({\bm{x}},y)\sim{\mathcal{D}}}[(\mathrm{cl}_{M}(y)-h({\bm{x}}))^{2}]}$
		$\displaystyle=\sqrt{\mathbb{E}_{({\bm{x}},y)\sim{\mathcal{D}}}[(\mathrm{cl}_{M}(y)-y+y-h({\bm{x}}))^{2}]}$
		$\displaystyle\leq\sqrt{\mathbb{E}_{({\bm{x}},y)\sim{\mathcal{D}}}[(\mathrm{cl}_{M}(y)-y)^{2}]}+\sqrt{\mathbb{E}_{({\bm{x}},y)\sim{\mathcal{D}}}[(y-h({\bm{x}}))^{2}]}$
		$\displaystyle\leq{\epsilon}+{\mathcal{L}}_{{\mathcal{D}}}(h)$

The first inequality follows from an application of the triangle inequality for the ${\mathcal{L}}_{2}$ -norm and the second inequality follows from Lemma D.3. The other side follows analogously. ∎

$\displaystyle\\|\hat{p}-\tilde{p}^{*}\\|_{S_{\mathrm{ref}}}$	$\displaystyle\leq{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\hat{p})+{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\mathrm{cl}\circ f^{})+\\|f^{}-\tilde{p}^{*}\\|_{{S}_{\mathrm{ref}}}$
	$\displaystyle\leq{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\tilde{p}_{\mathrm{opt}})+{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\mathrm{cl}\circ f^{})+\\|f^{}-\tilde{p}^{*}\\|_{{S}_{\mathrm{ref}}}$	(By equation 3.1)
	$\displaystyle\leq{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}({p}_{\mathrm{opt}})+{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\mathrm{cl}\circ f^{})+\\|f^{}-\tilde{p}^{*}\\|_{{S}_{\mathrm{ref}}}$

	$\displaystyle\mathbb{E}_{{\bm{x}}\sim S^{\prime}}$	$\displaystyle[(\mathrm{cl}_{M}(f^{}({\bm{x}}))-\mathrm{cl}_{M}(p^{}({\bm{x}})))^{2}]$
		$\displaystyle=\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[(\mathrm{cl}_{M}(f^{}({\bm{x}}))-\mathrm{cl}_{M}(p^{}({\bm{x}})))^{2}\cdot\mathbbm{1}[\left\\|W^{*}{\bm{x}}\right\\|\leq R]$
		$\displaystyle+\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[(\mathrm{cl}_{M}(f^{}({\bm{x}}))-\mathrm{cl}_{M}(p^{}({\bm{x}})))^{2}\cdot\mathbbm{1}[\left\\|W^{*}{\bm{x}}\right\\|>R]]$
		$\displaystyle\leq\epsilon^{2}+\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[2(\mathrm{cl}_{M}(f^{}({\bm{x}}))^{2}+\mathrm{cl}_{M}(p^{}({\bm{x}}))^{2})\cdot\mathbbm{1}[\left\\|W^{*}{\bm{x}}\right\\|>R]]$

	$\displaystyle\left\\|p^{*}({\bm{x}})-\hat{p}({\bm{x}})\right\\|_{S^{\prime}}$	$\displaystyle\leq\left\\|p^{*}({\bm{x}})-\hat{p}({\bm{x}})\right\\|_{{\mathcal{D}}}+\sqrt{4B^{2}d^{2\ell}\Delta}$
		$\displaystyle\leq\left\\|p^{*}({\bm{x}})-\hat{p}({\bm{x}})\right\\|_{S}+2\epsilon^{\prime}$
		$\displaystyle\leq{\mathcal{L}}_{S}({p^{*}})+{\mathcal{L}}_{S}(\hat{p})+2\epsilon^{\prime},$

$\displaystyle\left\\|p^{*}({\bm{x}})-\hat{p}({\bm{x}})\right\\|_{S^{\prime}}$	$\displaystyle\leq{\mathcal{L}}_{S}(p^{*})+{\mathcal{L}}_{S}(p_{\mathrm{opt}})+3\epsilon^{\prime}$
	$\displaystyle\leq\left\\|p^{}-\mathrm{cl}_{M}(f^{})\right\\|_{S}+{\mathcal{L}}(\mathrm{cl}_{M}(f^{*}))_{S}$
	$\displaystyle\quad+\left\\|p_{\mathrm{opt}}({\bm{x}})-\mathrm{cl}_{M}(f_{\mathrm{opt}}({\bm{x}}))\right\\|_{S}+{\mathcal{L}}_{S}(\mathrm{cl}_{M}(f_{\mathrm{opt}}))+3\epsilon^{\prime}.$	(4.4)

$\displaystyle\mathbb{E}_{{\bm{x}}\sim S}[(\mathrm{cl}_{M}(f^{}({\bm{x}}))-p^{}({\bm{x}}))^{2}]$	$\displaystyle=\mathbb{E}_{{\bm{x}}\sim S}[(\mathrm{cl}_{M}(f^{}({\bm{x}}))-p^{}({\bm{x}}))^{2}\cdot\mathbbm{1}[\left\\|W^{*}{\bm{x}}\right\\|\leq R]$
	$\displaystyle+\mathbb{E}_{{\bm{x}}\sim S}[(\mathrm{cl}_{M}(f^{}({\bm{x}}))-p^{}({\bm{x}}))^{2}\cdot\mathbbm{1}[\left\\|W^{*}{\bm{x}}\right\\|>R]]$
	$\displaystyle\leq\epsilon^{2}+2\mathbb{E}_{{\bm{x}}\sim S}[\mathrm{cl}_{M}(f^{}({\bm{x}}))^{2}\cdot\mathbbm{1}[\left\\|W^{}{\bm{x}}\right\\|>R]]$
	$\displaystyle\hskip 22.0pt+2\mathbb{E}_{{\bm{x}}\sim S}[p^{}({\bm{x}})^{2}\cdot\mathbbm{1}[\left\\|W^{}{\bm{x}}\right\\|>R]].$	(4.5)

Learning Neural Networks with Distribution Shift: Efficiently Certifiable Guarantees

Abstract

1 Introduction

Definition 1.1 (Testable Regression with Distribution Shift).

1.1 Our Results

1.2 Our Techniques

1.3 Related Work

2 Preliminaries

3 Bounded Training Marginals

Definition 3.1 (Kernels [Mer09]).

Fact 3.2 (Reproducing Kernel Hilbert Space).

Definition 3.3 (Approximate Representation).

Definition 3.4 (Hypercontractivity).

3.1 TDS Regression via the Kernel Method

Assumption 3.5.

Theorem 3.6 (TDS Learning via the Kernel Method).

Proposition 3.7 (Representer Theorem, modification of Theorem 6.11 in [MRT18]).

Proof.

Lemma 3.8 (Multiplicative Spectral Concentration, Lemma B.1 in [GSSV24], modified).

Proof of Theorem 3.6.

3.2 Applications

Definition 3.9 (Hypercontractivity).

4 Unbounded Distributions

4.1 Additional Preliminaries

Definition 4.1 (Subspace Junta).

Definition 4.2 ((ϵ,R)(\epsilon,R)-Uniform Approximation).

Corollary 4.3.

Definition 4.4 (Strictly Sub-exponential Distribution).

4.2 TDS Regression via Uniform Approximation

Assumption 4.5.

Theorem 4.6 (TDS Learning via Uniform Approximation).

Lemma 4.7.

Lemma 4.8 (Transfer Lemma for Square Loss, see [KSV24b]).

Proof of Theorem 4.6.

4.3 Applications

References

Appendix A Proof of Multiplicative Spectral Concentration Lemma

Lemma A.1 (Multiplicative Spectral Concentration, Lemma B.1 in [GSSV24], modified).

Proof of Lemma 3.8.

Claim.

Proof.

Appendix B Moment Concentration of Subexponential Distributions

Fact B.1 (see [Ver18]).

Lemma B.2 (Moment Concentration of Subexponential Distributions).

Proof.

Appendix C Polynomial Approximations of Neural Networks

Definition C.1 (Neural Network).

C.1 Useful Notation and Facts

Fact C.2.

Proof.

C.2 Results from Approximation Theory

Fact C.3 (Lemma 23 from [GKKT17]).

Lemma C.4.

Proof.

Fact C.5 ([BDBGK18]).

Lemma C.6.

Proof.

Lemma C.7.

Proof.

Lemma C.8.

Proof.

Theorem C.9 ([NS64]).

Corollary C.10.

Proof.

C.3 Kernel Representations

Definition C.11.

Lemma C.12.

Proof.

Definition C.13 (Composed multinomial kernel).

Lemma C.14.

Proof.

Lemma C.15.

Proof.

C.4 Nets with Lipschitz activations

Lemma C.16.

Proof.

Theorem C.17.

Proof.

C.5 Sigmoids and Sigmoid-nets

Theorem C.18 ([LSSS14]).

Learning Neural Networks with Distribution Shift:
Efficiently Certifiable Guarantees

Definition 4.2 ( $(\epsilon,R)$ -Uniform Approximation).