This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning Neural Networks with Distribution Shift:
Efficiently Certifiable Guarantees

Gautam Chandrasekaran
UT Austin
[email protected]. Supported by the NSF AI Institute for Foundations of Machine Learning (IFML).
   Adam R. Klivans
UT Austin
[email protected]. Supported by NSF award AF-1909204 and the NSF AI Institute for Foundations of Machine Learning (IFML).
   Lin Lin Lee
UT Austin
[email protected]. Supported by the NSF AI Institute for Foundations of Machine Learning (IFML).
   Konstantinos Stavropoulos
UT Austin
[email protected]. Supported by the NSF AI Institute for Foundations of Machine Learning (IFML) and by scholarships from Bodossaki Foundation and Leventis Foundation.
Abstract

We give the first provably efficient algorithms for learning neural networks with distribution shift. We work in the Testable Learning with Distribution Shift framework (TDS learning) of [KSV24b], where the learner receives labeled examples from a training distribution and unlabeled examples from a test distribution and must either output a hypothesis with low test error or reject if distribution shift is detected. No assumptions are made on the test distribution.

All prior work in TDS learning focuses on classification, while here we must handle the setting of nonconvex regression. Our results apply to real-valued networks with arbitrary Lipschitz activations and work whenever the training distribution has strictly sub-exponential tails. For training distributions that are bounded and hypercontractive, we give a fully polynomial-time algorithm for TDS learning one hidden-layer networks with sigmoid activations. We achieve this by importing classical kernel methods into the TDS framework using data-dependent feature maps and a type of kernel matrix that couples samples from both train and test distributions.

1 Introduction

Understanding when a model will generalize from a known training distribution to an unknown test distribution is a critical challenge in trustworthy machine learning and domain adaptation. Traditional approaches to this problem prove generalization bounds in terms of various notions of distance between train and test distributions [BDBCP06, BDBC+10, MMR09] but do not provide efficient algorithms. Recent work due to [KSV24b] departs from this paradigm and defines the model of Testable Learning with Distribution Shift (TDS learning), where a learner may reject altogether if significant distribution shift is detected. When the learner accepts, however, it outputs a classifier and a proof that the classifier has nearly optimal test error.

A sequence of works has given the first set of efficient algorithms in the TDS learning model for well-studied function classes where no assumptions are taken on the test distribution [KSV24b, KSV24a, CKK+24, GSSV24]. These results, however, hold for classification and therefore do not apply to (nonconvex) regression problems and in particular to a long line of work giving provably efficient algorithms for learning simple classes of neural networks under natural distributional assumptions on the training marginal [GK19, DGK+20, DKKZ20, DKTZ22, CKM22, CDG+23, WZDD23, GGKS24, DK24].

The main contribution of this work is the first set of efficient TDS learning algorithms for broad classes of (nonconvex) regression problems. Our results apply to neural networks with arbitrary Lipschitz activations of any constant depth. As one example, we obtain a fully polynomial-time algorithm for learning one hidden-layer neural networks with sigmoid activations with respect to any bounded and hypercontractive training distribution. For bounded training distributions, the running times of our algorithms match the best known running times for ordinary PAC or agnostic learning (without distribution shift). We emphasize that unlike all prior work in domain adaptation, we make no assumptions on the test distribution.

Regression Setting. We assume the learner has access to labeled examples from the training distribution and unlabeled examples from the marginal of the test distribution. We consider the squared loss 𝒟(h)=𝔼(𝒙,y)𝒟[(yh(𝒙))2]{\mathcal{L}}_{{\mathcal{D}}}(h)=\sqrt{\mathbb{E}_{({\bm{x}},y)\sim{\mathcal{D}}}[(y-h({\bm{x}}))^{2}]}. The error benchmark is analogous to the benchmark for TDS learning in classification [KSV24b] and depends on two quantities: the optimum training error achievable by a classifier in the learnt class, opt=minf[𝒟(f)]\mathrm{opt}=\min_{f\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f)], and the best joint error achievable by a single classifier on both the training and test distributions, λ=minf[𝒟(f)+𝒟(f)]\lambda=\min_{f^{\prime}\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f^{\prime})+{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(f^{\prime})]. Achieving an error of opt+λ\mathrm{opt}+\lambda is the standard goal in domain adaptation [BDBCP06, BCK+07, MMR09]. We now formally define the TDS learning framework for regression.

Definition 1.1 (Testable Regression with Distribution Shift).

For ϵ,δ(0,1){\epsilon},\delta\in(0,1) and a function class {d}{\mathcal{F}}\subseteq\{\mathbb{R}^{d}\to\mathbb{R}\}, the learner receives iid labeled examples from some unknown training distribution 𝒟{\mathcal{D}} over d×\mathbb{R}^{d}\times\mathbb{R} and iid unlabeled examples from the marginal 𝒟𝒙{\mathcal{D}}_{{\bm{x}}}^{\prime} of another unknown test distribution 𝒟{\mathcal{D}}^{\prime} over d×\mathbb{R}^{d}\times\mathbb{R}. The learner either rejects, or it accepts and outputs hypothesis h:dh:\mathbb{R}^{d}\to\mathbb{R} such that the following are true.

  1. 1.

    (Soundness) With probability at least 1δ1-\delta, if the algorithm accepts, then the output hh satisfies 𝒟(h)minf[𝒟(f)]+minf[𝒟(f)+𝒟(f)]+ϵ{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(h)\leq\min_{f\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f)]+\min_{f^{\prime}\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f^{\prime})+{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(f^{\prime})]+{\epsilon}.

  2. 2.

    (Completeness) If 𝒟𝒙=𝒟𝒙{\mathcal{D}}_{{\bm{x}}}={\mathcal{D}}_{{\bm{x}}}^{\prime}, then the algorithm accepts with probability at least 1δ1-\delta.

1.1 Our Results

Our results hold for classes of Lipschitz neural networks. In particular, we consider functions ff of the following form. Let σ:\sigma:\mathbb{R}\to\mathbb{R} be an activation function. Let 𝐖=(W(1),W(t))\mathbf{W}=\left(W^{(1)},\ldots W^{(t)}\right) with W(i)si×si1W^{(i)}\in\mathbb{R}^{s_{i}\times s_{i-1}} be the tuple of weight matrices. Here, s0=ds_{0}=d is the input dimension and st=1s_{t}=1. Define recursively the function fi:dsif_{i}:\mathbb{R}^{d}\to\mathbb{R}^{s_{i}} as fi(𝒙)=W(i)σ(fi1(𝒙))f_{i}({\bm{x}})=W^{(i)}\cdot\sigma\bigl{(}f_{i-1}({\bm{x}})\bigr{)} with f1(𝒙)=W(1)𝒙f_{1}({\bm{x}})=W^{(1)}\cdot{\bm{x}}. The function f:df:\mathbb{R}^{d}\to\mathbb{R} computed by the neural network (𝐖,σ)(\mathbf{W},\sigma) is defined as f(𝒙):-ft(𝒙)f({\bm{x}})\coloneq f_{t}({\bm{x}}). The depth of this network is tt.

We now present our main results on TDS learning for neural networks.

Function Class Runtime (Bounded) Runtime (Subgaussian)
One hidden-layer Sigmoid Net poly(d,M,1/ϵ)\mathrm{poly}(d,M,1/\epsilon) dpoly(klog(M/ϵ))d^{\mathrm{poly}(k\log(M/\epsilon))}
Single ReLU poly(d,M)2O(1/ϵ)\mathrm{poly}(d,M)\cdot 2^{O(1/\epsilon)} dpoly(klogM/ϵ)d^{\mathrm{poly}(k\log M/\epsilon)}
Sigmoid Nets poly(d,M)2O((log(1/ϵ))t1)\mathrm{poly}(d,M)\cdot 2^{O\left((\log(1/\epsilon))^{t-1}\right)} dpoly(klogM(log(1/ϵ)t1))d^{\mathrm{poly}(k\log M(\log(1/\epsilon)^{t-1}))}
11-Lipschitz Nets poly(d,M)2O~(kk2t1/ϵ)\mathrm{poly}(d,M)\cdot 2^{\tilde{O}(k\sqrt{k}2^{t-1}/\epsilon)} dpoly(k2t1logM/ϵ)d^{\mathrm{poly}(k2^{t-1}\log M/\epsilon)}
Table 1: In the above table, kk denotes the number of neurons in the first hidden layer. MM denotes a bound on the labels of the train and test distributions. One hidden-layer Sigmoid nets refers to depth 22 neural networks with sigmoid activation. The bounded distributions considered in the above table have support on the unit ball. We assume that all relevant parameters of the neural network are bounded by constants. For more detailed statements and proofs, see (1) Corollaries C.21, C.23, C.20 and C.22 for the bounded case, and (2) Theorems C.24 and C.25 for the Subgaussian case.

From the above table, we highlight that in the cases of bounded distributions with (1) one hidden-layer Sigmoid Nets, and (2) Single ReLU with ϵ<1/logd\epsilon<1/\log d, we obtain TDS algorithms that run in polynomial time in all parameters. Moreover, for the last row, regarding Lipschitz Nets, each neuron is allowed to have a different and unknown Lipschitz activation. Therefore, in particular, our results capture the class of single-index models (see, e.g., [KKSK11, GGKS24]).

In the results of Table 1, we assume bounded labels for both the training and test distributions. This assumption can be relaxed to a bound on any moment whose degree is strictly higher than 22 (see Corollary D.2). In fact, such an assumption is necessary, as we show in Proposition D.1.

1.2 Our Techniques

TDS Learning via Kernel Methods. The major technical contribution of this work is devoted to importing classical kernel methods into the TDS learning framework. A first attempt at testing distribution shift with respect to a fixed feature map would be to form two corresponding covariance matrices of the expanded features, one from samples drawn from the training distribution and the other from samples drawn from the test distribution, and test if these two matrices have similar eigendecompositions. This approach only yields efficient algorithms for linear kernels, however, as here we are interested in spectral properties of covariance matrices in the feature space corresponding to low-degree polynomials, whose dimension is too large.

Instead we form a new data-dependent and concise reference feature map ϕ\phi, that depends on examples from both 𝒟𝒙{\mathcal{D}}_{{\bm{x}}} and 𝒟𝒙{\mathcal{D}}_{{\bm{x}}}^{\prime}. We show that this feature map approximately represents the ground truth, i.e., some function with both low training and test error (this is due to the representer theorem, see Proposition 3.7). To certify that error bounds transfer from 𝒟𝒙{\mathcal{D}}_{{\bm{x}}} to 𝒟𝒙{\mathcal{D}}_{{\bm{x}}}^{\prime}, we require relative error closeness between covariance matrix Φ=𝔼𝒙𝒟𝒙[ϕ(𝒙)ϕ(𝒙)]\Phi^{\prime}=\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}^{\prime}}[\phi({\bm{x}})\phi({\bm{x}})^{\top}] of the feature expansion ϕ\phi over the test marginal with the corresponding matrix Φ=𝔼𝒙𝒟𝒙[ϕ(𝒙)ϕ(𝒙)]\Phi=\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[\phi({\bm{x}})\phi({\bm{x}})^{\top}] over the training marginal. We draw fresh sets of verification examples and show how the kernel trick can be used to efficiently achieve these approximations even though ϕ\phi is a nonstandard feature map. We provide a more detailed technical overview and a formal proof in Section 3.1.

By instantiating the above results using a type of polynomial kernel, we can reduce the problem of TDS learning neural networks to the problem of obtaining an appropriate polynomial approximator. Our final training algorithm (as opposed to the testing phase) will essentially be kernelized polynomial regression.

TDS Learning and Uniform Approximation. Prior work in TDS learning has established connections between polynomial approximation theory and efficient algorithms in the TDS setting. In particular, the existence of low-degree sandwiching approximators for a concept class is known to imply dimension-efficient TDS learning algorithms for binary classification. The notion of sandwiching approximators for a function ff refers to a pair of low-degree polynomials pup,pdownp_{\mathrm{up}},p_{\mathrm{down}} with two main properties: (1) pdownfpupp_{\mathrm{down}}\leq f\leq p_{\mathrm{up}} everywhere and (2) the expected absolute distance between pupp_{\mathrm{up}} and pdownp_{\mathrm{down}} over some reference distribution is small. The first property is of particular importance in the TDS setting, since it holds everywhere and, therefore, it holds for any test distribution unconditionally.

Here we make the simple observation that the incomparable notion of uniform approximation suffices for TDS learning. A uniform approximator is a polynomial pp that approximates a function ff pointwise, meaning that |pf||p-f| is small in every point within a ball around the origin (there is no known direct relationship between sandwiching and uniform approximators). In our setting, uniform approximation is more convenient, due to the existence of powerful tools from polynomial approximation theory regarding Lipschitz and analytic functions.

Contrary to the sandwiching property, the uniform approximation property cannot hold everywhere if the approximated function class contains high-(or infinite-) degree functions. When the training distribution has strictly sub-exponential tails, however, the expected error of approximation outside the radius of approximation is negligible. Importantly, this property can be certified for the test distribution by using a moment-matching tester. See Section 4.2 for a more detailed technical overview and for the full proof.

1.3 Related Work

Learning with Distribution Shift. The field of domain adaptation has been studying the distribution shift problem for almost two decades [BDBCP06, BCK+07, BDBC+10, MMR09, DLLP10, MKFAS20, RMH+20, KZZ24, HK19, HK24, ACM24], providing useful insights regarding the information-theoretic (im)possibilities for learning with distribution shift. The first efficient end-to-end algorithms for non-trivial concept classes with distribution shift were given for TDS learning in [KSV24b, KSV24a, CKK+24] and for PQ learning, originally defined by [GKKM20], in [GSSV24]. These works focus on binary classification for classes like halfspaces, halfspace intersections, and geometric concepts. In the regression setting, we need to handle unbounded loss functions, but we are also able to use Lipschitz properties of real-valued networks to obtain results even for deeper architectures. For the special case of linear regression, efficient algorithms for learning with distribution shift are known to exist (see, e.g., [LHL21]), but our results capture much broader classes.

Another distinction between the existing works in TDS learning and our work, is that our results require significantly milder assumptions on the training distribution. In particular, while all prior works on TDS learning require both concentration and anti-concentration for the training marginal [KSV24b, KSV24a, CKK+24], we only assume strictly subexponential concentration in every direction. This is possible because the function classes we consider are Lipschitz, which is not the case for binary classification.

Testable Learning. More broadly, TDS learning is related to the notion of testable learning [RV23, GKK23, GKSV24b, DKK+23, GKSV24a, DKLZ24, STW24], originally defined by [RV23] for standard agnostic learning, aiming to certify optimal performance for learning algorithms without relying directly on any distributional assumptions. The main difference between testable agnostic learning and TDS learning is that in TDS learning, we allow for distribution shift, while in testable agnostic learning the training and test distributions are the same. Because of this, TDS learning remains challenging even in the absence of label noise, in which case testable learning becomes trivial [KSV24b].

Efficient Learning of Neural Networks. Many works have focused on providing upper and lower bounds on the computational complexity of learning neural networks in the standard (distribution-shift-free) setting [GKKT17, GK19, GGJ+20, GGK20, DGK+20, DKZ20, DKKZ20, DKTZ22, CGKM22, CKM22, CDG+23, WZDD23, GGKS24, DK24, LMZ20, GMOV19, ZYWG19, VW19, AZLL19, BJW19, MR18, GKLW19, GLM18, DLT18, GKM18, Tia17, LY17, BG17, ZSJ+17, ZLJ16a, JSA15]. The majority of the upper bounds either require noiseless labels and shallow architectures or work only under Gaussian training marginals. Our results not only hold in the presence of distribution shift, but also capture deeper architectures, under any strictly subexponential training marginal and allow adversarial label noise.

The upper bounds that are closest to our work are those given by [GKKT17]. They consider ReLU as well as sigmoid networks, allow for adversarial label noise and assume that the training marginal is bounded but otherwise arbitrary. Our results in Section 3 extend all of the results in [GKKT17] to the TDS setting, by assuming additionally that the training distribution is hypercontractive (see Definition 3.9). This additional assumption is important to ensure that our tests will pass when there is no distribution shift. For a more thorough technical comparison with [GKKT17], see Section 3.

In Section 4, we provide upper bounds for TDS learning of Lipschitz networks even when the training marginal is an arbitrary strictly subexponential distribution. In particular, our results imply new bounds for standard agnostic learning of single ReLU neurons, where we achieve runtime dpoly(1/ϵ)d^{\mathrm{poly}({1/{\epsilon}})}. The only known upper bounds work under the Gaussian marginal [DGK+20], achieving similar runtime. In fact, in the statistical query framework [Kea98], it is known that dpoly(1/ϵ)d^{\mathrm{poly}(1/{\epsilon})} runtime is necessary for agnostically learning the ReLU, even under the Gaussian distribution [DKZ20, GGK20].

2 Preliminaries

We use standard vector and matrix notation. We denote with ,\mathbb{R},{\mathbb{N}} the sets of real and natural numbers accordingly. We denote with 𝒟{\mathcal{D}} labeled distributions over d×\mathbb{R}^{d}\times\mathbb{R} and with 𝒟𝒙{\mathcal{D}}_{\bm{x}} the marginal of 𝒟{\mathcal{D}} on the features in d\mathbb{R}^{d}. For a set SS of points in d\mathbb{R}^{d}, we define the empirical probabilities (resp. expectations) as 𝐏𝐫𝒙S[E(𝒙)]=1|S|𝒙S𝟙{E(𝒙)}\mathbf{Pr}_{{\bm{x}}\sim S}[E({\bm{x}})]=\frac{1}{|S|}\sum_{{\bm{x}}\in S}\mathbbm{1}\{E({\bm{x}})\} (resp. 𝔼𝒙S[f(𝒙)]=1|S|𝒙Sf(𝒙)\mathbb{E}_{{\bm{x}}\sim S}[f({\bm{x}})]=\frac{1}{|S|}\sum_{{\bm{x}}\in S}f({\bm{x}})). We denote with S¯\bar{S} the labeled version of SS and we define the clipping function clM:[M,M]\mathrm{cl}_{M}:\mathbb{R}\to[-M,M], that maps a number tt\in\mathbb{R} either to itself if t[M,M]t\in[-M,M], or to Msign(t)M\cdot\operatorname{sign}(t) otherwise.

Loss function. Throughout this work, we denote with 𝒟(h){\mathcal{L}}_{{\mathcal{D}}}(h) the squared loss of a hypothesis h:dh:\mathbb{R}^{d}\to\mathbb{R} with respect to a labeled distribution 𝒟{\mathcal{D}}, i.e., 𝒟(h)=𝔼(𝒙,y)𝒟[(yh(𝒙))2]{\mathcal{L}}_{{\mathcal{D}}}(h)=\sqrt{\mathbb{E}_{({\bm{x}},y)\sim{\mathcal{D}}}[(y-h({\bm{x}}))^{2}]}. Moreover, for any function f:df:\mathbb{R}^{d}\to\mathbb{R}, we denote with f𝒟\|f\|_{{\mathcal{D}}} the quantity f𝒟=𝔼𝒙𝒟𝒙[(f(𝒙))2]\|f\|_{{\mathcal{D}}}=\sqrt{\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[(f({\bm{x}}))^{2}]}. For a set of labeled examples S¯\bar{S}, we denote with S¯(h){\mathcal{L}}_{\bar{S}}(h) the empirical loss on S¯\bar{S}, i.e., S¯(h)=1|S¯|(𝒙,y)S¯(yh(𝒙))2{\mathcal{L}}_{\bar{S}}(h)=\sqrt{\frac{1}{|\bar{S}|}\sum_{({\bm{x}},y)\in\bar{S}}(y-h({\bm{x}}))^{2}} and similarly for fS\|f\|_{S}.

Distributional Assumptions. In order to obtain efficient algorithms, we will either assume that the training marginal 𝒟𝒙{\mathcal{D}}_{{\bm{x}}} is bounded and hypercontractive (Section 3) or that it has strictly subexponential tails in every direction (Section 4). We make no assumptions on the test marginal 𝒟𝒙{\mathcal{D}}_{{\bm{x}}}^{\prime}.

Regarding the labels, we assume some mild bound on the moments of the training and the test labels, e.g., (a) that 𝔼y𝒟y[y4],𝔼y𝒟y[y4]M\mathbb{E}_{y\sim{\mathcal{D}}_{y}}[y^{4}],\mathbb{E}_{y\sim{\mathcal{D}}^{\prime}_{y}}[y^{4}]\leq M or (b) that y[M,M]y\in[-M,M] a.s. for both 𝒟{\mathcal{D}} and 𝒟{\mathcal{D}}^{\prime}. Although, ideally, we want to avoid any assumptions on the test distribution, as we show in Proposition D.1, a bound on some constant-degree moment of the test labels is necessary.

3 Bounded Training Marginals

We begin with the scenario where the training distribution is known to be bounded. In this case, it is known that one-hidden-layer sigmoid networks can be agnostically learned (in the classical sense, without distribution shift) in fully polynomial time and single ReLU neurons can be learned up to error O(1log(d))O(\frac{1}{\log(d)}) in polynomial time [GKKT17]. These results are based on a kernel-based approach, combined with results from polynomial approximation theory. While polynomial approximations can reduce the nonconvex agnostic learning problem to a convex one through polynomial feature expansions, the kernel trick enables further pruning of the search space, which is important for obtaining polynomial-time algorithms. Our work demonstrates another useful implication of the kernel trick: it leads to efficient algorithms for testing distribution shift.

We will require the following standard notions:

Definition 3.1 (Kernels [Mer09]).

A function 𝒦:d×d\mathcal{K}:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R} is a kernel. If for any set of mm points 𝒙1,,𝒙m{\bm{x}}_{1},\dots,{\bm{x}}_{m} in d\mathbb{R}^{d}, the matrix (𝒦(𝒙i,𝒙j))(i,j)[m](\mathcal{K}({\bm{x}}_{i},{\bm{x}}_{j}))_{(i,j)\in[m]} is positive semidefinite, we say that the kernel 𝒦\mathcal{K} is positive definite. The kernel 𝒦\mathcal{K} is symmetric if for all 𝒙,𝒙d{\bm{x}},{\bm{x}}^{\prime}\in\mathbb{R}^{d}, 𝒦(𝒙,𝒙)=𝒦(𝒙,𝒙)\mathcal{K}({\bm{x}},{\bm{x}}^{\prime})=\mathcal{K}({\bm{x}}^{\prime},{\bm{x}}).

Any PSD kernel is associated with some Hilbert space {\mathbb{H}} and some feature map from d\mathbb{R}^{d} to {\mathbb{H}}.

Fact 3.2 (Reproducing Kernel Hilbert Space).

For any positive definite and symmetric (PDS) kernel 𝒦\mathcal{K}, there is a Hilbert space {\mathbb{H}}, equipped with the inner product ,:×\langle\cdot,\cdot\rangle:{\mathbb{H}}\times{\mathbb{H}}\to\mathbb{R} and a function ψ:d\psi:\mathbb{R}^{d}\to{\mathbb{H}} such that 𝒦(𝐱,𝐱)=ψ(𝐱),ψ(𝐱)\mathcal{K}({\bm{x}},{\bm{x}}^{\prime})=\langle\psi({\bm{x}}),\psi({\bm{x}}^{\prime})\rangle for all 𝐱,𝐱d{\bm{x}},{\bm{x}}^{\prime}\in\mathbb{R}^{d}. We call {\mathbb{H}} the reproducing kernel Hilbert space (RKHS) for 𝒦\mathcal{K} and ψ\psi the feature map for 𝒦\mathcal{K}.

There are three main properties of the kernel method. First, although the associated feature map ψ\psi may correspond to a vector in an infinite-dimensional space, the kernel 𝒦(𝒙,𝒙)\mathcal{K}({\bm{x}},{\bm{x}}^{\prime}) may still be efficiently evaluated, due to its analytic expression in terms of 𝒙{\bm{x}}, 𝒙{\bm{x}}^{\prime}. Second, the function class 𝒦={𝒙𝒗,ψ(𝒙):𝒗,𝒗,𝒗B}{\mathcal{F}}_{\mathcal{K}}=\{{\bm{x}}\mapsto\langle{\bm{v}},\psi({\bm{x}})\rangle:{\bm{v}}\in{\mathbb{H}},\langle{\bm{v}},{\bm{v}}\rangle\leq B\} has Rademacher complexity independent from the dimension of {\mathbb{H}}, as long as the maximum value of 𝒦(𝒙,𝒙)\mathcal{K}({\bm{x}},{\bm{x}}) for 𝒙{\bm{x}} in the domain is bounded (Thm. 6.12 in [MRT18]). Third, the time complexity of finding the function in 𝒦{\mathcal{F}}_{\mathcal{K}} that best fits a dataset is actually polynomial to the size of the dataset, due to the representer theorem (Thm. 6.11 in [MRT18]). Taken together, these properties constitute the basis of the kernel method, implying learners with runtime independent from the effective dimension of the learning problem.

In order to apply the kernel method to learn some function class {\mathcal{F}}, it suffices to show that the class {\mathcal{F}} can be represented sufficiently well by the class 𝒦{\mathcal{F}}_{\mathcal{K}}. We give the following definition.

Definition 3.3 (Approximate Representation).

Let {\mathcal{F}} be a function class over d\mathbb{R}^{d}, 𝒦:d×d\mathcal{K}:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R} a PDS kernel, where {\mathbb{H}} is the corresponding RKHS and ψ\psi the feature map for 𝒦\mathcal{K}. We say that {\mathcal{F}} can be (ϵ,B)(\epsilon,B)-approximately represented within radius RR with respect to 𝒦\mathcal{K} if for any ff\in{\mathcal{F}}, there is 𝒗{\bm{v}}\in{\mathbb{H}} with 𝒗,𝒗B\langle{\bm{v}},{\bm{v}}\rangle\leq B such that |f(𝒙)𝒗,ψ(𝒙)|ϵ|f({\bm{x}})-\langle{\bm{v}},\psi({\bm{x}})\rangle|\leq\epsilon, for all 𝒙d:𝒙2R{\bm{x}}\in\mathbb{R}^{d}:\|{\bm{x}}\|_{2}\leq R.

For the purposes of TDS learning, we will also require the training marginal to have be hypercontractive with respect to the kernel at hand. This is important to ensure that our test will accept whenever there is no distribution shift. More formally, we require the following.

Definition 3.4 (Hypercontractivity).

Let 𝒟𝒙{\mathcal{D}}_{{\bm{x}}} be some distribution over d\mathbb{R}^{d}, let {\mathbb{H}} be a Hilbert space and let ψ:d\psi:\mathbb{R}^{d}\to{\mathbb{H}}. We say that 𝒟𝒙{\mathcal{D}}_{{\bm{x}}} is (ψ,C,)(\psi,C,\ell)-hypercontractive if for any tt\in{\mathbb{N}} and 𝒗{\bm{v}}\in{\mathbb{H}}:

𝔼𝒙𝒟𝒙[𝒗,ψ(𝒙)2t](Ct)2t(𝔼𝒙𝒟𝒙[𝒗,ψ(𝒙)2])t\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[\langle{\bm{v}},\psi({\bm{x}})\rangle^{2t}]\leq(Ct)^{2\ell t}(\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[\langle{\bm{v}},\psi({\bm{x}})\rangle^{2}])^{t}

If 𝒦\mathcal{K} is the PDS kernel corresponding to ψ\psi, we also say that 𝒟𝒙{\mathcal{D}}_{{\bm{x}}} is (𝒦,C,)(\mathcal{K},C,\ell)-hypercontractive.

3.1 TDS Regression via the Kernel Method

We now give a general theorem on TDS regression for bounded distributions, under the following assumptions. Note that, although we assume that the training and test labels are bounded, this assumption can be relaxed in a black-box manner and bounding some constant-degree moment of the distribution of the labels suffices, as we show in Corollary D.2.

Assumption 3.5.

For a function class {d}{\mathcal{F}}\subseteq\{\mathbb{R}^{d}\to\mathbb{R}\}, and training and test distributions 𝒟{\mathcal{D}}, 𝒟{\mathcal{D}}^{\prime} over d×\mathbb{R}^{d}\times\mathbb{R}, we assume the following.

  1. 1.

    {\mathcal{F}} is (ϵ,B)({\epsilon},B)-approximately represented within radius RR w.r.t. a PDS kernel 𝒦:d×d\mathcal{K}:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}, for some ϵ(0,1){\epsilon}\in(0,1) and B,R1B,R\geq 1 and let A=sup𝒙:𝒙2R𝒦(𝒙,𝒙)A=\sup_{{\bm{x}}:\|{\bm{x}}\|_{2}\leq R}\mathcal{K}({\bm{x}},{\bm{x}}).

  2. 2.

    The training marginal 𝒟𝒙{\mathcal{D}}_{{\bm{x}}} (1) is bounded within {𝒙:𝒙2R}\{{\bm{x}}:\|{\bm{x}}\|_{2}\leq R\} and (2) is (𝒦,C,)(\mathcal{K},C,\ell)-hypercontractive for some C,1C,\ell\geq 1.

  3. 3.

    The training and test labels are both bounded in [M,M][-M,M] for some M1M\geq 1.

Consider the function class {\mathcal{F}}, the kernel 𝒦\mathcal{K} and the parameters ϵ,A,B,C,M,{\epsilon},A,B,C,M,\ell as defined in the assumption above and let δ(0,1)\delta\in(0,1). Then, we obtain the following theorem.

Theorem 3.6 (TDS Learning via the Kernel Method).

Under 3.5, Algorithm 1 learns the class {\mathcal{F}} in the TDS regression setting up to excess error 5ϵ5\epsilon and probability of failure δ\delta. The time complexity is O(T)poly(d,1ϵ,(log(1/δ)),A,B,C,2,M)O(T)\cdot\mathrm{poly}(d,\frac{1}{{\epsilon}},(\log(1/\delta))^{\ell},A,B,C^{\ell},2^{\ell},M), where TT is the evaluation time of 𝒦\mathcal{K}.

Input: Parameters M,R,B,A,C,1M,R,B,A,C,\ell\geq 1, ϵ,δ(0,1){\epsilon},\delta\in(0,1) and sample access to 𝒟{\mathcal{D}}, 𝒟𝒙{\mathcal{D}}_{{\bm{x}}}^{\prime}
Set m=c(ABM)4ϵ4log(1δ)m=c\frac{(ABM)^{4}}{{\epsilon}^{4}}\log(\frac{1}{\delta}), N=cm2ABCϵ4(4Clog(4δ))4+1N=cm^{2}\frac{ABC}{{\epsilon}^{4}}(4C\log(\frac{4}{\delta}))^{4\ell+1}, cc large enough constant
Draw mm i.i.d. labeled examples S¯ref\bar{S}_{\mathrm{ref}} from 𝒟{\mathcal{D}} and mm i.i.d. unlabeled examples SrefS_{\mathrm{ref}}^{\prime} from 𝒟𝒙{\mathcal{D}}_{{\bm{x}}}^{\prime};
if for some 𝐱Sref{\bm{x}}\in S_{\mathrm{ref}}^{\prime} we have 𝐱2>R\|{\bm{x}}\|_{2}>R then
      Reject and terminate;
Let 𝒂^=(a^𝒛)𝒛Sref\hat{\bm{a}}=(\hat{a}_{{\bm{z}}})_{{\bm{z}}\in S_{\mathrm{ref}}} be the optimal solution to the following convex program
min𝒂m\displaystyle\min_{{\bm{a}}\in\mathbb{R}^{m}} (𝒙,y)S¯ref(y𝒛Srefa𝒛𝒦(𝒛,𝒙))2\displaystyle\sum_{({\bm{x}},y)\in\bar{S}_{\mathrm{ref}}}\Bigr{(}y-\sum_{{\bm{z}}\in S_{\mathrm{ref}}}a_{{\bm{z}}}\mathcal{K}({\bm{z}},{\bm{x}})\Bigr{)}^{2}
s.t. 𝒛,𝒘Srefa𝒛a𝒘𝒦(𝒛,𝒘)B, where 𝒂=(a𝒛)𝒛Sref\displaystyle\sum_{{\bm{z}},{\bm{w}}\in S_{\mathrm{ref}}}a_{{\bm{z}}}a_{{\bm{w}}}\mathcal{K}({\bm{z}},{\bm{w}})\leq B,\text{ where }{\bm{a}}=(a_{{\bm{z}}})_{{\bm{z}}\in S_{\mathrm{ref}}}
Draw NN i.i.d. unlabeled examples Sver{S}_{\mathrm{ver}} from 𝒟𝒙{\mathcal{D}}_{{\bm{x}}} and NN unlabeled examples SverS_{\mathrm{ver}}^{\prime} from 𝒟𝒙{\mathcal{D}}_{{\bm{x}}}^{\prime};
if for some 𝐱Sver{\bm{x}}\in S_{\mathrm{ver}}^{\prime} we have 𝐱2>R\|{\bm{x}}\|_{2}>R then
      Reject and terminate;
Compute the matrix Φ^=(Φ^𝒛,𝒘)𝒛,𝒘SrefSref\hat{\Phi}=(\hat{\Phi}_{{\bm{z}},{\bm{w}}})_{{\bm{z}},{\bm{w}}\in S_{\mathrm{ref}}\cup S_{\mathrm{ref}}^{\prime}} with Φ^𝒛,𝒘=1N𝒙Sver𝒦(𝒙,𝒛)𝒦(𝒙,𝒘)\hat{\Phi}_{{\bm{z}},{\bm{w}}}=\frac{1}{N}\sum_{{\bm{x}}\in S_{\mathrm{ver}}}\mathcal{K}({\bm{x}},{\bm{z}})\mathcal{K}({\bm{x}},{\bm{w}});
Compute the matrix Φ^=(Φ^𝒛,𝒘)𝒛,𝒘SrefSref\hat{\Phi}^{\prime}=(\hat{\Phi}^{\prime}_{{\bm{z}},{\bm{w}}})_{{\bm{z}},{\bm{w}}\in S_{\mathrm{ref}}\cup S_{\mathrm{ref}}^{\prime}} with Φ^𝒛,𝒘=1N𝒙Sver𝒦(𝒙,𝒛)𝒦(𝒙,𝒘)\hat{\Phi}^{\prime}_{{\bm{z}},{\bm{w}}}=\frac{1}{N}\sum_{{\bm{x}}\in S_{\mathrm{ver}}^{\prime}}\mathcal{K}({\bm{x}},{\bm{z}})\mathcal{K}({\bm{x}},{\bm{w}});
Let ρ\rho be the value of the following eigenvalue problem
max𝒂2m\displaystyle\max_{{\bm{a}}\in\mathbb{R}^{2m}} 𝒂Φ^𝒂 s.t. 𝒂Φ^𝒂1\displaystyle{\bm{a}}^{\top}\hat{\Phi}^{\prime}{\bm{a}}\,\,\,\text{ s.t. }{\bm{a}}^{\top}\hat{\Phi}{\bm{a}}\leq 1
if ρ>1+ϵ250AB\rho>1+\frac{{\epsilon}^{2}}{50AB} then
      Reject and terminate;
Otherwise, accept and output h:𝒙h(𝒙)=clM(p^(𝒙))h:{\bm{x}}\mapsto h({\bm{x}})=\mathrm{cl}_{M}(\hat{p}({\bm{x}})), where p^(𝒙)=𝒛Srefa^𝒛𝒦(𝒛,𝒙)\hat{p}({\bm{x}})=\sum_{{\bm{z}}\in S_{\mathrm{ref}}}\hat{a}_{{\bm{z}}}\mathcal{K}({\bm{z}},{\bm{x}});
Algorithm 1 TDS Regression via the Kernel Method

The main ideas of our proof are the following.

Obtaining a concise reference feature map. The algorithm first draws reference sets Sref,SrefS_{\mathrm{ref}},S_{\mathrm{ref}}^{\prime} from both the training and the test distributions. The representer theorem, combined with the approximate representation assumption (Definition 3.3) ensure that the reference examples define a new feature map ϕ:d2m\phi:\mathbb{R}^{d}\to\mathbb{R}^{2m} with ϕ(𝒙)=(𝒦(𝒙,𝒛))𝒛SrefSref\phi({\bm{x}})=(\mathcal{K}({\bm{x}},{\bm{z}}))_{{\bm{z}}\in S_{\mathrm{ref}}\cup S_{\mathrm{ref}}^{\prime}} such that the ground truth f=argminf[𝒟(f)+𝒟(f)]f^{*}=\arg\min_{f\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f)+{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(f)] can be approximately represented as a linear combination of the features in ϕ\phi with respect to both SrefS_{\mathrm{ref}} and SrefS_{\mathrm{ref}}^{\prime}, i.e., f(𝒂)ϕSref\|f^{*}-({{\bm{a}}^{*}})^{\top}{\phi}\|_{S_{\mathrm{ref}}} and f(𝒂)ϕSref\|f^{*}-({{\bm{a}}^{*}})^{\top}{\phi}\|_{S_{\mathrm{ref}}^{\prime}} are both small for some 𝒂2m{\bm{a}}^{*}\in\mathbb{R}^{2m}. In particular, we have the following.

Proposition 3.7 (Representer Theorem, modification of Theorem 6.11 in [MRT18]).

Suppose that a function f:df:\mathbb{R}^{d}\to\mathbb{R} can be (ϵ,B)({\epsilon},B)-approximately represented within radius RR w.r.t. some PDS kernel 𝒦\mathcal{K} (as per Definition 3.3). Then, for any set of examples SS in {𝐱d:𝐱2R}\{{\bm{x}}\in\mathbb{R}^{d}:\|{\bm{x}}\|_{2}\leq R\}, there is 𝐚=(a𝐱)𝐱S|S|{\bm{a}}=(a_{{\bm{x}}})_{{\bm{x}}\in S}\in\mathbb{R}^{|S|} such that for p~(𝐱)=𝐳Sa𝐳𝒦(𝐳,𝐱)\tilde{p}({\bm{x}})=\sum_{{\bm{z}}\in S}a_{{\bm{z}}}\mathcal{K}({\bm{z}},{\bm{x}}) we have:

fp~Sϵ and 𝒙,𝒛Sa𝒙a𝒛𝒦(𝒛,𝒙)B\|f-\tilde{p}\|_{S}\leq\epsilon\text{ and }\sum_{{\bm{x}},{\bm{z}}\in S}a_{{\bm{x}}}a_{{\bm{z}}}\mathcal{K}({\bm{z}},{\bm{x}})\leq B
Proof.

We first observe that there is some 𝒗{\bm{v}}\in{\mathbb{H}} such that 𝒗,𝒗B\langle{\bm{v}},{\bm{v}}\rangle\leq B and for p(𝒙)=𝒗,ψ(𝒙)p({\bm{x}})=\langle{\bm{v}},\psi({\bm{x}})\rangle we have fpSϵ\|f-p\|_{S}\leq\epsilon, because by Definition 3.3, there is a pointwise approximator for ff with respect to 𝒦\mathcal{K}. By Theorem 6.11 in [MRT18], this implies the existence of p~\tilde{p} as desired. ∎

Note that since the evaluation of ϕ(𝒙)\phi({\bm{x}}) only involves Kernel evaluations, we never need to compute the initial feature expansion ψ(𝒙)\psi({\bm{x}}) which could be overly expensive.

Forming a candidate output hypothesis. We know that the reference feature map approximately represents the ground truth. However, having no access to test labels, we cannot directly hope to find the corresponding coefficient 𝒂2m{\bm{a}}^{*}\in\mathbb{R}^{2m}. Instead, we use only the training reference examples to find a candidate hypothesis p^\hat{p} with close-to-optimal performance on the training distribution which can be also expressed in terms of the reference feature map ϕ\phi, as p^=𝒂^ϕ\hat{p}=\hat{\bm{a}}^{\top}\phi. It then suffices to test the quality of ϕ\phi on the test distribution.

Testing the quality of reference feature map on the test distribution. We know that the function p~=(𝒂)ϕ\tilde{p}^{*}=({\bm{a}}^{*})^{\top}\phi performs well on the test distribution (since it is close to ff^{*} on a reference test set). We also know that the candidate output 𝒂^ϕ\hat{\bm{a}}^{\top}\phi performs well on the training distribution. Therefore, in order to ensure that p^\hat{p} performs well on the test distribution, it suffices to show that the distance between p^\hat{p} and p~\tilde{p}^{*} under the test distribution, i.e., 𝒂^ϕ(𝒂)ϕ𝒟𝒙\|\hat{\bm{a}}^{\top}\phi-({\bm{a}}^{*})^{\top}\phi\|_{{\mathcal{D}}_{{\bm{x}}}^{\prime}}, is small. In fact, it suffices to bound this distance by the corresponding one under the training distribution, because p^\hat{p} fits the training data well and 𝒂^ϕ(𝒂)ϕ𝒟𝒙\|\hat{\bm{a}}^{\top}\phi-({\bm{a}}^{*})^{\top}\phi\|_{{\mathcal{D}}_{{\bm{x}}}} is indeed small. Since we do not know 𝒂{\bm{a}}^{*}, we need to run a test on ϕ\phi that certifies the desired bound for any 𝒂{\bm{a}}^{*}.

Using the spectral tester. We observe that 𝒂^ϕ(𝒂)ϕ𝒟𝒙2=(𝒂^𝒂)Φ(𝒂^𝒂)\|\hat{\bm{a}}^{\top}\phi-({\bm{a}}^{*})^{\top}\phi\|_{{\mathcal{D}}_{{\bm{x}}}}^{2}=(\hat{\bm{a}}-{\bm{a}}^{*})^{\top}\Phi(\hat{\bm{a}}-{\bm{a}}^{*}), where Φ=𝔼𝒙𝒟𝒙[ϕ(𝒙)ϕ(𝒙)]\Phi=\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[\phi({\bm{x}})\phi({\bm{x}})^{\top}] and similarly 𝒂^ϕ(𝒂)ϕ𝒟𝒙2=(𝒂^𝒂)Φ(𝒂^𝒂)\|\hat{\bm{a}}^{\top}\phi-({\bm{a}}^{*})^{\top}\phi\|_{{\mathcal{D}}_{{\bm{x}}}^{\prime}}^{2}=(\hat{\bm{a}}-{\bm{a}}^{*})^{\top}\Phi^{\prime}(\hat{\bm{a}}-{\bm{a}}^{*}). Since we want to obtain a bound for all 𝒂{\bm{a}}^{*}, we essentially want to ensure that for any 𝒂2m{\bm{a}}\in\mathbb{R}^{2m} we have 𝒂Φ𝒂(1+ρ)𝒂Φ𝒂{\bm{a}}^{\top}\Phi^{\prime}{\bm{a}}\leq(1+\rho){\bm{a}}^{\top}\Phi{\bm{a}} for some small ρ\rho. Having a multiplicative bound is important because we do not have any bound on the norm of 𝒂^𝒂2\|\hat{\bm{a}}-{\bm{a}}^{*}\|_{2}.

To implement the test, and since we cannot test Φ\Phi and Φ\Phi^{\prime} directly, we draw fresh verification examples Sver,SverS_{\mathrm{ver}},S^{\prime}_{\mathrm{ver}} from 𝒟𝒙{\mathcal{D}}_{{\bm{x}}} and 𝒟𝒙{\mathcal{D}}_{{\bm{x}}}^{\prime} and run a spectral test on the corresponding empirical versions Φ^,Φ^\hat{\Phi},\hat{\Phi}^{\prime} of the matrices Φ,Φ\Phi,\Phi^{\prime}. To ensure that the test will accept when there is no distribution shift, we use the following lemma (originally from [GSSV24]) on multiplicative spectral concentration for Φ^\hat{\Phi}, where the hypercontractivity assumption (Definition 3.4) is important.

Lemma 3.8 (Multiplicative Spectral Concentration, Lemma B.1 in [GSSV24], modified).

Let 𝒟𝐱{\mathcal{D}}_{{\bm{x}}} be a distribution over d\mathbb{R}^{d} and ϕ:dm\phi:\mathbb{R}^{d}\to\mathbb{R}^{m} such that 𝒟𝐱{\mathcal{D}}_{{\bm{x}}} is (ϕ,C,)(\phi,C,\ell)-hypercontractive for some C,1C,\ell\geq 1. Suppose that SS consists of NN i.i.d. examples from 𝒟𝐱{\mathcal{D}}_{{\bm{x}}} and let Φ=𝔼𝐱𝒟𝐱[ϕ(𝐱)ϕ(𝐱)]\Phi=\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[\phi({\bm{x}})\phi({\bm{x}})^{\top}], and Φ^=1N𝐱Sϕ(𝐱)ϕ(𝐱)\hat{\Phi}=\frac{1}{N}\sum_{{\bm{x}}\in S}\phi({\bm{x}})\phi({\bm{x}})^{\top}. For any ϵ,δ(0,1){\epsilon},\delta\in(0,1), if N64Cm2ϵ2(4Clog2(4δ))4+1N\geq\frac{64Cm^{2}}{{\epsilon}^{2}}(4C\log_{2}(\frac{4}{\delta}))^{4\ell+1}, then with probability at least 1δ1-\delta, we have that

For any 𝒂m:𝒂Φ^𝒂[(1ϵ)𝒂Φ𝒂,(1+ϵ)𝒂Φ𝒂]\text{For any }{\bm{a}}\in\mathbb{R}^{m}:{\bm{a}}^{\top}\hat{\Phi}{\bm{a}}\in[{(1-{\epsilon})}{\bm{a}}^{\top}\Phi{\bm{a}},(1+{\epsilon}){\bm{a}}^{\top}\Phi{\bm{a}}]

Note that the multiplicative spectral concentration lemma requires access to independent samples. However, the reference feature map ϕ\phi depends on the reference examples Sref,SrefS_{\mathrm{ref}},S_{\mathrm{ref}}^{\prime}. This is the reason why we do not reuse Sref,SrefS_{\mathrm{ref}},S_{\mathrm{ref}}^{\prime}, but rather draw fresh verification examples. For the proof of Lemma 3.8, see Appendix A.

We now provide the full formal proof of Theorem 3.6. The full proof involves appropriate uniform convergence bounds for kernel hypotheses, which are important in order to shift from the reference to the verification examples and back.

Proof of Theorem 3.6.

Consider the reference feature map ϕ:d2m\phi:\mathbb{R}^{d}\to\mathbb{R}^{2m} with ϕ(𝒙)=(𝒦(𝒙,𝒛))𝒛SrefSref\phi({\bm{x}})=(\mathcal{K}({\bm{x}},{\bm{z}}))_{{\bm{z}}\in S_{\mathrm{ref}}\cup S_{\mathrm{ref}}^{\prime}}. Let f=argminf[𝒟(f)+𝒟(f)]f^{*}=\arg\min_{f\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f)+{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(f)] and fopt=argminf[𝒟(f)]f_{\mathrm{opt}}=\arg\min_{f\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f)]. By 3.5, we know that there are functions p,popt:dp^{*},p_{\mathrm{opt}}:\mathbb{R}^{d}\to\mathbb{R} with p(𝒙)=𝒗,ψ(𝒙)p^{*}({\bm{x}})=\langle{\bm{v}}^{*},\psi({\bm{x}})\rangle and popt=𝒗opt,ψ(𝒙)p_{\mathrm{opt}}=\langle{\bm{v}}_{\mathrm{opt}},\psi({\bm{x}})\rangle, that uniformly approximate ff^{*} and foptf_{\mathrm{opt}} within the ball of radius RR, i.e., sup𝒙:𝒙2R|f(𝒙)p(𝒙)|ϵ\sup_{{\bm{x}}:\|{\bm{x}}\|_{2}\leq R}|f^{*}({\bm{x}})-p^{*}({\bm{x}})|\leq{\epsilon} and sup𝒙:𝒙2R|fopt(𝒙)popt(𝒙)|ϵ\sup_{{\bm{x}}:\|{\bm{x}}\|_{2}\leq R}|f_{\mathrm{opt}}({\bm{x}})-p_{\mathrm{opt}}({\bm{x}})|\leq{\epsilon}. Moreover, 𝒗,𝒗,𝒗opt,𝒗optB\langle{\bm{v}}^{*},{\bm{v}}^{*}\rangle,\langle{\bm{v}}_{\mathrm{opt}},{\bm{v}}_{\mathrm{opt}}\rangle\leq B.

By Proposition 3.7, there is 𝒂2m{\bm{a}}^{*}\in\mathbb{R}^{2m} such that for p~:d\tilde{p}^{*}:\mathbb{R}^{d}\to\mathbb{R} with p~(𝒙)=(𝒂)ϕ(𝒙)\tilde{p}^{*}({\bm{x}})=({{\bm{a}}^{*}})^{\top}{\phi({\bm{x}})} we have fp~Sref3ϵ/2\|f^{*}-\tilde{p}^{*}\|_{S_{\mathrm{ref}}}\leq 3{\epsilon}/2 and fp~Sref3ϵ/2\|f^{*}-\tilde{p}^{*}\|_{S_{\mathrm{ref}}^{\prime}}\leq 3{\epsilon}/2. Let 𝑲{\bm{K}} be a matrix in 2m×2m\mathbb{R}^{2m\times 2m} such that 𝑲𝒛,𝒘=𝒦(𝒛,𝒘){\bm{K}}_{{\bm{z}},{\bm{w}}}=\mathcal{K}({\bm{z}},{\bm{w}}) for 𝒛,𝒘SrefSref{\bm{z}},{\bm{w}}\in S_{\mathrm{ref}}\cup S_{\mathrm{ref}}^{\prime}. We additionally have that (𝒂)𝑲𝒂B({\bm{a}}^{*})^{\top}{\bm{K}}{\bm{a}}^{*}\leq B. Therefore, for any 𝒙d{\bm{x}}\in\mathbb{R}^{d} we have

(p~(𝒙))2\displaystyle(\tilde{p}^{*}({\bm{x}}))^{2} =(𝒛SrefSrefazψ(𝒛),ψ(𝒙))2\displaystyle=\Bigr{(}\Bigr{\langle}{\sum_{{\bm{z}}\in S_{\mathrm{ref}}\cup S_{\mathrm{ref}}^{\prime}}a^{*}_{z}\psi({\bm{z}})},{\psi({\bm{x}})}\Bigr{\rangle}\Bigr{)}^{2}
𝒛SrefSrefazψ(𝒛),𝒛SrefSrefazψ(𝒛)ψ(𝒙),ψ(𝒙)\displaystyle\leq\Bigr{\langle}{\sum_{{\bm{z}}\in S_{\mathrm{ref}}\cup S_{\mathrm{ref}}^{\prime}}a^{*}_{z}\psi({\bm{z}})},{\sum_{{\bm{z}}\in S_{\mathrm{ref}}\cup S_{\mathrm{ref}}^{\prime}}a^{*}_{z}\psi({\bm{z}})}\Bigr{\rangle}\cdot\langle\psi({\bm{x}}),\psi({\bm{x}})\rangle
=(𝒂)𝑲𝒂𝒦(𝒙,𝒙)B𝒦(𝒙,𝒙),\displaystyle=({\bm{a}}^{*})^{\top}{\bm{K}}{\bm{a}}^{*}\cdot\mathcal{K}({\bm{x}},{\bm{x}})\leq B\cdot\mathcal{K}({\bm{x}},{\bm{x}})\,,

where we used the Cauchy-Schwarz inequality. For 𝒙{\bm{x}} with 𝒙2R\|{\bm{x}}\|_{2}\leq R, we, hence, have (p~(𝒙))2AB(\tilde{p}^{*}({\bm{x}}))^{2}\leq AB (recall that A=max𝒙2R𝒦(𝒙,𝒙)A=\max_{\|{\bm{x}}\|_{2}\leq R}\mathcal{K}({\bm{x}},{\bm{x}})).

Similarly, by applying the representer theorem (Theorem 6.11 in [MRT18]) for poptp_{\mathrm{opt}}, we have that there exists 𝒂opt=(a𝒛opt)𝒛Srefm{\bm{a}}^{\mathrm{opt}}=(a^{\mathrm{opt}}_{{\bm{z}}})_{{\bm{z}}\in S_{\mathrm{ref}}}\in\mathbb{R}^{m} such that for p~opt:d\tilde{p}_{\mathrm{opt}}:\mathbb{R}^{d}\to\mathbb{R} with p~opt(𝒙)=𝒛Srefa𝒛opt𝒦(𝒛,𝒙)\tilde{p}_{\mathrm{opt}}({\bm{x}})=\sum_{{\bm{z}}\in S_{\mathrm{ref}}}a^{\mathrm{opt}}_{\bm{z}}\mathcal{K}({\bm{z}},{\bm{x}}) we have S¯ref(p~opt)S¯ref(popt){\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\tilde{p}_{\mathrm{opt}})\leq{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(p_{\mathrm{opt}}) and 𝒛,𝒘Srefa𝒛opta𝒘opt𝒦(𝒛,𝒘)B\sum_{{\bm{z}},{\bm{w}}\in S_{\mathrm{ref}}}a^{\mathrm{opt}}_{\bm{z}}a^{\mathrm{opt}}_{\bm{w}}\mathcal{K}({\bm{z}},{\bm{w}})\leq B. Since p^\hat{p} in Algorithm 1 is formed by solving a convex program whose search space includes p~opt\tilde{p}_{\mathrm{opt}}, we have

S¯ref(p^)S¯ref(p~opt)S¯ref(popt){\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\hat{p})\leq{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\tilde{p}_{\mathrm{opt}})\leq{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(p_{\mathrm{opt}}) (3.1)

In the following, we abuse the notation and consider 𝒂^\hat{\bm{a}} to be a vector in 2m\mathbb{R}^{2m}, by appending mm zeroes, one for each of the elements of SrefS^{\prime}_{\mathrm{ref}}. Note that we then have 𝒂^𝑲𝒂^B\hat{\bm{a}}^{\top}{\bm{K}}\hat{\bm{a}}\leq B, and, also, (p^(𝒙))2AB(\hat{p}({\bm{x}}))^{2}\leq A\cdot B for all 𝒙{\bm{x}} with 𝒙2R\|{\bm{x}}\|_{2}\leq R.

Soundness. Suppose first that the algorithm has accepted. In what follows, we will use the triangle inequality of the norms to bound for functions h1,h2,h3h_{1},h_{2},h_{3} the quantity h1h2𝒟\|h_{1}-h_{2}\|_{{\mathcal{D}}} by h1h3𝒟+h2h3𝒟\|h_{1}-h_{3}\|_{{\mathcal{D}}}+\|h_{2}-h_{3}\|_{\mathcal{D}}. We also use the inequality 𝒟(h1)𝒟(h2)+h1h2𝒟{\mathcal{L}}_{\mathcal{D}}(h_{1})\leq{\mathcal{L}}_{\mathcal{D}}(h_{2})+\|h_{1}-h_{2}\|_{{\mathcal{D}}}, as well as the fact that clMh1clMh2𝒟clMh1h2𝒟h1h2𝒟\|\mathrm{cl}_{M}\circ h_{1}-\mathrm{cl}_{M}\circ h_{2}\|_{\mathcal{D}}\leq\|\mathrm{cl}_{M}\circ h_{1}-h_{2}\|_{\mathcal{D}}\leq\|h_{1}-h_{2}\|_{\mathcal{D}}. We bound the test error of the output hypothesis h:d[M,M]h:\mathbb{R}^{d}\to[-M,M] of Algorithm 1 as follows.

𝒟(h)hclMf𝒟𝒙+𝒟(f)\displaystyle{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(h)\leq\|h-\mathrm{cl}_{M}\circ f^{*}\|_{{\mathcal{D}}_{{\bm{x}}}^{\prime}}+{\mathcal{L}}_{\mathcal{D}}^{\prime}(f^{*})

Since (h(𝒙)clM(f(𝒙)))24M2(h({\bm{x}})-\mathrm{cl}_{M}(f^{*}({\bm{x}})))^{2}\leq 4M^{2} for all 𝒙{\bm{x}} and the hypothesis hh does not depend on the set SrefS_{\mathrm{ref}}^{\prime}, by a Hoeffding bound and the fact that mm is large enough, we obtain that hclMf𝒟𝒙hclMfSref+ϵ/10\|h-\mathrm{cl}_{M}\circ f^{*}\|_{{\mathcal{D}}_{{\bm{x}}}^{\prime}}\leq\|h-\mathrm{cl}_{M}\circ f^{*}\|_{S_{\mathrm{ref}}^{\prime}}+{\epsilon}/10, with probability at least 1δ/101-\delta/10. Moreover, we have hclMfSrefhclMp~Sref+p~fSref\|h-\mathrm{cl}_{M}\circ f^{*}\|_{S_{\mathrm{ref}}^{\prime}}\leq\|h-\mathrm{cl}_{M}\circ\tilde{p}^{*}\|_{S_{\mathrm{ref}}^{\prime}}+\|\tilde{p}^{*}-f^{*}\|_{S_{\mathrm{ref}}^{\prime}}. We have already argued that p~fSref3ϵ/2\|\tilde{p}^{*}-f^{*}\|_{S_{\mathrm{ref}}^{\prime}}\leq 3{\epsilon}/2.

In order to bound the quantity hclMp~Sref\|h-\mathrm{cl}_{M}\circ\tilde{p}^{*}\|_{S_{\mathrm{ref}}^{\prime}}, we observe that while the function hh does not depend on SrefS_{\mathrm{ref}}^{\prime}, the function p~\tilde{p}^{*} does depend on SrefS_{\mathrm{ref}}^{\prime} and, therefore, standard concentration arguments fail to bound the hclMp~Sref\|h-\mathrm{cl}_{M}\circ\tilde{p}^{*}\|_{S_{\mathrm{ref}}^{\prime}} in terms of hclMp~𝒟𝒙\|h-\mathrm{cl}_{M}\circ\tilde{p}^{*}\|_{{\mathcal{D}}_{{\bm{x}}}^{\prime}}. However, since we have clipped p~\tilde{p}^{*}, and p~\tilde{p}^{*} is of the form 𝒗,ψ\langle{\bm{v}}^{*},\psi\rangle, we may obtain a bound using standard results from generalization theory (i.e., bounds on the Rademacher complexity of kernel-based hypotheses like Theorem 6.12 in [MRT18] and uniform convergence bounds for classes with bounded Rademacher complexity under Lipschitz and bounded losses like Theorem 11.3 in [MRT18]). In particular, we have that with probability at least 1δ/101-\delta/10

hclMp~SrefhclMp~𝒟𝒙+ϵ/10\|h-\mathrm{cl}_{M}\circ\tilde{p}^{*}\|_{S_{\mathrm{ref}}^{\prime}}\leq\|h-\mathrm{cl}_{M}\circ\tilde{p}^{*}\|_{{\mathcal{D}}_{{\bm{x}}}^{\prime}}+{\epsilon}/10

The corresponding requirement for m=|Sref|m=|S_{\mathrm{ref}}^{\prime}| is determined by the bounds on the Lipschitz constant of the loss function (y,t)(yclM(t))2(y,t)\mapsto(y-\mathrm{cl}_{M}(t))^{2}, with y[M,M]y\in[-M,M] and tt\in\mathbb{R}, which is at most 5M5M, the overall bound on this loss function, which is at most 4M24M^{2}, as well as the bounds A=max𝒙:𝒙2R𝒦(𝒙,𝒙)A=\max_{{\bm{x}}:\|{\bm{x}}\|_{2}\leq R}\mathcal{K}({\bm{x}},{\bm{x}}) and (𝒂)𝑲𝒂B({\bm{a}}^{*})^{\top}{\bm{K}}{\bm{a}}\leq B (which give bounds on the Rademacher complexity).

By applying the Hoeffding bound, we are able to further bound the quantity hclMp~𝒟𝒙\|h-\mathrm{cl}_{M}\circ\tilde{p}^{*}\|_{{\mathcal{D}}_{{\bm{x}}}^{\prime}} by hclMp~Sver+ϵ/10\|h-\mathrm{cl}_{M}\circ\tilde{p}^{*}\|_{S_{\mathrm{ver}}^{\prime}}+{\epsilon}/10, with probability at least 1δ1-\delta. We have effectively managed to bound the quantity hclMp~Sref\|h-\mathrm{cl}_{M}\circ\tilde{p}^{*}\|_{S_{\mathrm{ref}}^{\prime}} by hclMp~Sver+ϵ/5\|h-\mathrm{cl}_{M}\circ\tilde{p}^{*}\|_{S_{\mathrm{ver}}^{\prime}}+{\epsilon}/5. This is important, because the set SverS_{\mathrm{ver}}^{\prime} is a fresh set of examples and, therefore, independent from p~\tilde{p}. Our goal is now to use the fact that our spectral tester has accepted. We have the following for the matrix Φ^=(Φ^𝒛,𝒘)𝒛,𝒘SrefSref\hat{\Phi}^{\prime}=(\hat{\Phi}^{\prime}_{{\bm{z}},{\bm{w}}})_{{\bm{z}},{\bm{w}}\in S_{\mathrm{ref}}\cup S_{\mathrm{ref}}^{\prime}} with Φ^𝒛,𝒘=1N𝒙Sver𝒦(𝒙,𝒛)𝒦(𝒙,𝒘)\hat{\Phi}^{\prime}_{{\bm{z}},{\bm{w}}}=\frac{1}{N}\sum_{{\bm{x}}\in S_{\mathrm{ver}}^{\prime}}\mathcal{K}({\bm{x}},{\bm{z}})\mathcal{K}({\bm{x}},{\bm{w}}).

hclMp~Sver2\displaystyle\|h-\mathrm{cl}_{M}\circ\tilde{p}^{*}\|_{S_{\mathrm{ver}}^{\prime}}^{2} p^p~Sver2\displaystyle\leq\|\hat{p}-\tilde{p}^{*}\|_{S_{\mathrm{ver}}^{\prime}}^{2}
=(𝒂^𝒂)Φ^(𝒂^𝒂)\displaystyle=(\hat{\bm{a}}-{\bm{a}}^{*})^{\top}\hat{\Phi}^{\prime}(\hat{\bm{a}}-{\bm{a}}^{*})

Since our test has accepted, we know that (𝒂^𝒂)Φ^(𝒂^𝒂)(1+ρ)(𝒂^𝒂)Φ^(𝒂^𝒂)(\hat{\bm{a}}-{\bm{a}}^{*})^{\top}\hat{\Phi}^{\prime}(\hat{\bm{a}}-{\bm{a}}^{*})\leq(1+\rho)(\hat{\bm{a}}-{\bm{a}}^{*})^{\top}\hat{\Phi}(\hat{\bm{a}}-{\bm{a}}^{*}), for the matrix Φ^=(Φ^𝒛,𝒘)𝒛,𝒘SrefSref\hat{\Phi}=(\hat{\Phi}_{{\bm{z}},{\bm{w}}})_{{\bm{z}},{\bm{w}}\in S_{\mathrm{ref}}\cup S_{\mathrm{ref}}} with Φ^𝒛,𝒘=1N𝒙Sver𝒦(𝒙,𝒛)𝒦(𝒙,𝒘)\hat{\Phi}_{{\bm{z}},{\bm{w}}}=\frac{1}{N}\sum_{{\bm{x}}\in S_{\mathrm{ver}}}\mathcal{K}({\bm{x}},{\bm{z}})\mathcal{K}({\bm{x}},{\bm{w}}). We note here that having a multiplicative bound of this form is important, because we do not have any upper bound on the norms of 𝒂^\hat{\bm{a}} and 𝒂{\bm{a}}^{*}. Instead, we only have bounds on distorted versions of these vectors, e.g., on 𝒂^𝑲𝒂^\hat{\bm{a}}^{\top}{\bm{K}}\hat{\bm{a}}, which does not imply any bound on the norm of 𝒂^\hat{\bm{a}}, because 𝑲{\bm{K}} could have very small singular values.

Overall, we have that

p^p~Sverp^p~Sver\displaystyle\|\hat{p}-\tilde{p}^{*}\|_{S_{\mathrm{ver}}^{\prime}}-\|\hat{p}-\tilde{p}^{*}\|_{S_{\mathrm{ver}}} ρ(2p^Sver2+2p~Sver2)\displaystyle\leq\sqrt{\rho(2\|\hat{p}\|_{S_{\mathrm{ver}}}^{2}+2\|\tilde{p}^{*}\|_{S_{\mathrm{ver}}}^{2})}
4ABρ3ϵ10.\displaystyle\leq\sqrt{4AB\rho}\leq\frac{3{\epsilon}}{10}.

By using results from generalization theory once more, we obtain that with probability at least 1δ/51-\delta/5 we have p^p~Sverp^p~Sref+ϵ/5\|\hat{p}-\tilde{p}^{*}\|_{S_{\mathrm{ver}}}\leq\|\hat{p}-\tilde{p}^{*}\|_{S_{\mathrm{ref}}}+{\epsilon}/5. This step is important, because the only fact we know about the quality of p^\hat{p} is that it outperforms every polynomial on the sample SrefS_{\mathrm{ref}} (not necessarily over the entire training distribution). We once more may use bounds on the values of p^\hat{p} and p~\tilde{p}^{*}, this time without requiring clipping, since we know that the training marginal is bounded and, hence, the values of p^\hat{p} and p~\tilde{p}^{*} are bounded as well. This was not true for the test distribution, since we did not make any assumptions about it.

In order to bound p^p~Sref\|\hat{p}-\tilde{p}^{*}\|_{S_{\mathrm{ref}}}, we have the following.

p^p~Sref\displaystyle\|\hat{p}-\tilde{p}^{*}\|_{S_{\mathrm{ref}}} S¯ref(p^)+S¯ref(clf)+fp~Sref\displaystyle\leq{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\hat{p})+{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\mathrm{cl}\circ f^{*})+\|f^{*}-\tilde{p}^{*}\|_{{S}_{\mathrm{ref}}}
S¯ref(p~opt)+S¯ref(clf)+fp~Sref\displaystyle\leq{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\tilde{p}_{\mathrm{opt}})+{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\mathrm{cl}\circ f^{*})+\|f^{*}-\tilde{p}^{*}\|_{{S}_{\mathrm{ref}}} (By equation 3.1)
S¯ref(popt)+S¯ref(clf)+fp~Sref\displaystyle\leq{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}({p}_{\mathrm{opt}})+{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\mathrm{cl}\circ f^{*})+\|f^{*}-\tilde{p}^{*}\|_{{S}_{\mathrm{ref}}}

The first term above is bounded as S¯ref(popt)S¯ref(clMfopt)+poptfoptSref{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}({p}_{\mathrm{opt}})\leq{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\mathrm{cl}_{M}\circ{f}_{\mathrm{opt}})+\|p_{\mathrm{opt}}-f_{\mathrm{opt}}\|_{{S}_{\mathrm{ref}}}, where the second term is at most ϵ{\epsilon} (by the definition of poptp_{\mathrm{opt}}) and the first term can be bounded by 𝒟(fopt)+ϵ/10=opt+ϵ/10{\mathcal{L}}_{{\mathcal{D}}}({f}_{\mathrm{opt}})+{\epsilon}/10=\mathrm{opt}+{\epsilon}/10, with probability at least 1δ/101-\delta/10, due to an application of the Hoeffding bound.

For the term S¯ref(clf){\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\mathrm{cl}\circ f^{*}) we can similarly use the Hoeffding bound to obtain, with probability at least 1δ/101-\delta/10 that S¯ref(clf)𝒟(f)+ϵ/10{\mathcal{L}}_{\bar{S}_{\mathrm{ref}}}(\mathrm{cl}\circ f^{*})\leq{\mathcal{L}}_{{\mathcal{D}}}(f^{*})+{\epsilon}/10.

Finally, for the term fp~Sref\|f^{*}-\tilde{p}^{*}\|_{{S}_{\mathrm{ref}}}, we have that fp~Sref3ϵ/2\|f^{*}-\tilde{p}^{*}\|_{{S}_{\mathrm{ref}}}\leq 3{\epsilon}/2, as argued above.

Overall, we obtain a bound of the form 𝒟(h)𝒟(f)=𝒟(f)+𝒟(fopt)+5ϵ{\mathcal{L}}_{\mathcal{D}}^{\prime}(h)\leq{\mathcal{L}}_{{\mathcal{D}}}(f^{*})={\mathcal{L}}_{{\mathcal{D}}^{\prime}}(f^{*})+{\mathcal{L}}_{{\mathcal{D}}}(f_{\mathrm{opt}})+5{\epsilon}, with probability at least 1δ1-\delta, as desired.

Completeness. For the completeness criterion, we assume that the test marginal is equal to the training marginal. Then, by Lemma 3.8 (where we observe that any (ψ,C,)(\psi,C,\ell)-hypercontractive distribution is also (ϕ,C,)(\phi,C,\ell)-hypercontractive), with probability at least 1δ1-\delta, we have that for all 𝒂2m{\bm{a}}\in\mathbb{R}^{2m}, 𝒂Φ^𝒂1+(ρ/4)1(ρ/4)𝒂Φ^𝒂(1+ρ)𝒂Φ^𝒂{\bm{a}}^{\top}\hat{\Phi}^{\prime}{\bm{a}}\leq\frac{1+(\rho/4)}{1-(\rho/4)}{\bm{a}}^{\top}\hat{\Phi}{\bm{a}}\leq(1+\rho){\bm{a}}^{\top}\hat{\Phi}{\bm{a}}, because 𝔼[Φ^]=𝔼[Φ^]\mathbb{E}[\hat{\Phi}]=\mathbb{E}[\hat{\Phi}^{\prime}] and the matrices are sums of independent samples of ϕ(𝒙)ϕ(𝒙)\phi({\bm{x}})\phi({\bm{x}})^{\top}, where 𝒙𝒟𝒙{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}. It is crucial here that ϕ\phi (which recall is formed by using Sref,SrefS_{\mathrm{ref}},S_{\mathrm{ref}}^{\prime}) does not depend on the verification samples SverS_{\mathrm{ver}} and SverS^{\prime}_{\mathrm{ver}}, which is why we chose them to be fresh. Therefore, the test will accept with probability at least 1δ1-\delta.

Efficient Implementation. To compute 𝒂^\hat{\bm{a}}, we may run a least squares program, in time polynomial in mm. For the spectral tester, we first compute the SVD of Φ^\hat{\Phi} and check that any vector in the kernel of Φ^\hat{\Phi} is also in the kernel of Φ^\hat{\Phi}^{\prime} (this can be checked without computing the SVD of Φ^\hat{\Phi}^{\prime}). Otherwise, reject. Then, let Φ^2\hat{\Phi}^{\frac{\dagger}{2}} be the root of the Moore-Penrose pseudoinverse of Φ^\hat{\Phi} and find the maximum singular value of the matrix Φ^2Φ^Φ^2\hat{\Phi}^{\frac{\dagger}{2}}\hat{\Phi}^{\prime}\hat{\Phi}^{\frac{\dagger}{2}}. If the value is higher than 1+ρ1+\rho, reject. Note that this is equivalent to solving the eigenvalue problem described in Algorithm 1. ∎

3.2 Applications

Having obtained a general theorem for TDS learning under 3.5, we can now instantiate it to obtain TDS learning algorithms for learning neural networks with Lipschitz activations. In particular, we recover all of the bounds of [GKKT17], using the additional assumption that the training distribution is hypercontractive in the following standard sense.

Definition 3.9 (Hypercontractivity).

We say that 𝒟{\mathcal{D}} is CC-hypercontractive if for all polynomials of degree \ell and tt\in\mathbb{N}, we have that

𝔼𝒙𝒟[p(𝒙)2t](Ct)2t(𝔼𝒙𝒟[p(𝒙)2])t.\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}}\left[p({\bm{x}})^{2t}\right]\leq(Ct)^{2\ell t}\left(\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}}\left[p({\bm{x}})^{2}\right]\right)^{t}.

Note that many common distributions like log-concave or the uniform over the hypercube are known to be hypercontractive for some constant CC (see [CW01] and [O’D14]).

Function Class Degree (\ell)
Representation
Bound (BB)
Kernel
Bound (AA)
Sigmoid Nets O(RWt2(tlog(Wϵ))t1logR)O\left(RW^{t-2}(t\log(\frac{W}{\epsilon}))^{t-1}\log R\right) 2WO~(Wtlog(1ϵ))t22^{\ell}\cdot W^{\tilde{O}(Wt\log(\frac{1}{{\epsilon}}))^{t-2}} (2R)2t(2R)^{2^{t}\ell}
LL-Lipschitz Nets O((WL)t1Rkk/ϵ)O\left((WL)^{t-1}Rk\sqrt{k}/\epsilon\right) (k+)O()(k+\ell)^{O(\ell)} RO()R^{O(\ell)}
Table 2: We instantiate the parameters relevant to 3.5 for Sigmoid and Lipschitz Nets. We have: (1) tt denotes a bound on the depth of the network, (2) WW is a bound on the sum of network weights in all layers other than the first, (3) (ϵ,B)({\epsilon},B) and radius RR are the approximate representation parameters, (4) kk is the number of hidden units in the first layer. The kernel function can be evaluated in time poly(d,)\mathrm{poly}(d,\ell). For each of the classes, we assume that the maximum two norm of any row of the matrix corresponding to the weights of the first layer is bounded by 11. The kernel we use is the composed multinomial kernel 𝖬𝖪(t)\mathsf{MK}_{\bm{\ell}}^{(t)} with appropriately chosen degree vector \bm{\ell}. Here, \ell equals the product of the entries of \bm{\ell}. Any CC-hypercontractive distribution is also (𝖬𝖪(t),C,)(\mathsf{MK}_{\bm{\ell}}^{(t)},C,\ell) hypercontractive for \ell as specified in the table. For the case of k=1k=1, the bound BB in the second row can be improved to 2O()2^{O(\ell)}.

In Table 2, we provide bounds on the parameters in 3.5 for sigmoid networks and LL-Lipschitz networks, whose proof we postpone to Appendix C (see Theorems C.17, C.19 and C.14). Combining bounds from Table 2 with Theorem 3.6, we obtain the results of the middle column of Table 1.

4 Unbounded Distributions

We showed that the kernel method provides runtime improvements for TDS learning, because it can be used to obtain a concise reference feature map, whose spectral properties on the test distribution are all we need to check to certify low test error. A similar approach would not provide any runtime improvements for the case of unbounded distributions, because the dimension of the reference feature space would not be significantly smaller than the dimension of the multinomial feature expansion. Therefore, we can follow the standard moment-matching testing approach commonly used in TDS learning [KSV24b] and testable agnostic learning [RV23, GKK23].

4.1 Additional Preliminaries

We define the notion of subspace juntas, namely, functions that only depend on a low-dimensional projection of their input vector.

Definition 4.1 (Subspace Junta).

A function f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} is a kk-subspace junta (where kdk\leq d) if there exists Wk×dW\in\mathbb{R}^{k\times d} with W2=1\|W\|_{2}=1 and WW=IkWW^{\top}=I_{k} and a function g:kg:\mathbb{R}^{k}\rightarrow\mathbb{R} such that

f(𝒙)=fW(𝒙)=g(W𝒙)𝒙d.f({\bm{x}})=f_{W}({\bm{x}})=g(W{\bm{x}})\quad\forall{\bm{x}}\in\mathbb{R}^{d}.

Note that by taking k=dk=d, letting W=IdW=I_{d} covers all functions f:df:\mathbb{R}^{d}\rightarrow\mathbb{R}.

Note that neural networks are kk-subspace juntas, where kk is the number of neurons in the first hidden layer. We also define the following notion of a uniform polynomial approximation within a ball of a certain radius.

Definition 4.2 ((ϵ,R)(\epsilon,R)-Uniform Approximation).

For ϵ>0,R1,\epsilon>0,R\geq 1, and g:kg:\mathbb{R}^{k}\rightarrow\mathbb{R}, we say that q:kq:\mathbb{R}^{k}\rightarrow\mathbb{R} is an (ϵ,R)(\epsilon,R)-uniform approximation polynomial for gg if

|q(𝒙)g(𝒙)|ϵx2R.|q({\bm{x}})-g({\bm{x}})|\leq\epsilon\quad\forall\left\|x\right\|_{2}\leq R.

We obtain the following corollary which gives the analogous bound on the (ϵ,R)(\epsilon,R)-uniform approximation to a kk-subspace junta, given the (ϵ,R)(\epsilon,R)-uniform approximation to the corresponding function gg.

Corollary 4.3.

Let ϵ>0,R1\epsilon>0,R\geq 1, and f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} be a kk-subspace junta, and consider the corresponding function g(Wx)g(Wx). Let q:kq:\mathbb{R}^{k}\rightarrow\mathbb{R} be an (ϵ,R)(\epsilon,R)-uniform approximation polynomial for gg, and define p:dp:\mathbb{R}^{d}\rightarrow\mathbb{R} as p(𝐱):=q(Wx)p({\bm{x}}):=q(Wx). Then |p(𝐱)f(𝐱)|ϵ|p({\bm{x}})-f({\bm{x}})|\leq\epsilon for all Wx2R\|Wx\|_{2}\leq R.

Finally, we consider any distribution with strictly subexponential tails in every direction, which we define as follows.

Definition 4.4 (Strictly Sub-exponential Distribution).

A distribution 𝒟{\mathcal{D}} on d\mathbb{R}^{d} is γ\gamma-strictly subexponential if there exist constants C,γ(0,1]C,\gamma\in(0,1] such that for all 𝒘d,𝒘=1,t0{\bm{w}}\in\mathbb{R}^{d},\left\|{\bm{w}}\right\|=1,t\geq 0,

𝐏𝐫𝒙𝒟[|𝒘,𝒙|>t]eCt1+γ.\mathbf{Pr}_{{\bm{x}}\sim{\mathcal{D}}}[|\langle{\bm{w}},{\bm{x}}\rangle|>t]\leq e^{-Ct^{1+\gamma}}.

4.2 TDS Regression via Uniform Approximation

We will now present our main results on TDS regression for unbounded training marginals. We require the following assumptions.

Assumption 4.5.

For a function class {d}{\mathcal{F}}\subseteq\{\mathbb{R}^{d}\to\mathbb{R}\} consisting of kk-subspaces juntas, and training and test distributions 𝒟,𝒟{\mathcal{D}},{\mathcal{D}}^{\prime} over d×\mathbb{R}^{d}\times\mathbb{R}, we assuming the following.

  1. 1.

    For ff\in{\mathcal{F}}, there exists an (ϵ,R)(\epsilon,R)-uniform approximation polynomial for ff with degree at most =RlogRg(ϵ)\ell=R\log R\cdot g_{\mathcal{F}}(\epsilon), where g(ϵ)g_{\mathcal{F}}(\epsilon) is a function depending only on the class {\mathcal{F}} and ϵ\epsilon.

  2. 2.

    For ff\in{\mathcal{F}}, the value rf:=supW𝒙2R|f(x)|r_{f}:=\sup_{\left\|W{\bm{x}}\right\|_{2}\leq R}|f(x)| is bounded by a constant r>0r>0.

  3. 3.

    The training marginal 𝒟𝒙{\mathcal{D}}_{{\bm{x}}} is a γ\gamma-strictly subexponential distribution for γ(0,1]\gamma\in(0,1].

  4. 4.

    The training and test labels are both bounded in [M,M][-M,M] for some M1M\geq 1.

Consider the function class {\mathcal{F}}, and the parameters ϵ,γ,M,k,{\epsilon},\gamma,M,k,\ell as defined in the assumption above and let δ(0,1)\delta\in(0,1). Then, we obtain the following theorem.

Theorem 4.6 (TDS Learning via Uniform Approximation).

Under 4.5, Algorithm 2 learns the class {\mathcal{F}} in the TDS regression setting up to excess error 5ϵ5\epsilon and probability of failure δ\delta. The time complexity is poly(ds,1/ϵ,log(1/δ))\mathrm{poly}(d^{s},{1}/{\epsilon},\log(1/\delta)^{\ell}) where s=(log(M/ϵ))O(1/γ)s=(\ell\log(M/\epsilon))^{O({1}/{\gamma})}.

Input: Parameters ϵ>0,δ(0,1){\epsilon}>0,\delta\in(0,1), R1R\geq 1, M1M\geq 1, and sample access to 𝒟,𝒟𝒙{\mathcal{D}},{\mathcal{D}}_{{\bm{x}}}^{\prime}
Set ϵ=ϵ/11{\epsilon}^{\prime}=\epsilon/11, δ=δ/4\delta^{\prime}=\delta/4, =RlogRg(ϵ)\ell=R\log R\cdot g_{\mathcal{F}}(\epsilon), t=2log(2Mϵ)t=2\log\left(\frac{2M}{\epsilon^{\prime}}\right), B=r(2(k+))3B=r(2(k+\ell))^{3\ell}, Δ=ϵ24B2d2t\Delta=\frac{\epsilon^{\prime 2}}{4B^{2}d^{2\ell t}}
Set mtrain=mtest=poly(M,ln(1/δ),1/ϵ,d,r)m_{\mathrm{train}}=m_{\mathrm{test}}=\mathrm{poly}(M,\ln(1/\delta)^{\ell},1/\epsilon,d^{\ell},r) and draw mtrainm_{\mathrm{train}} i.i.d. labeled examples SS from 𝒟{\mathcal{D}} and mtestm_{\mathrm{test}} i.i.d. unlabeled examples SS^{\prime} from 𝒟𝒙{\mathcal{D}}_{{\bm{x}}}^{\prime}.
For each αd\alpha\in\mathbb{N}^{d} with α12max(,t)\|\alpha\|_{1}\leq 2\max(\ell,t), compute the quantity M^α=𝔼𝒙S[𝒙α]=𝔼𝒙S[i[d]xiαi]\widehat{\mathrm{M}}_{\alpha}=\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[{\bm{x}}^{\alpha}]=\mathbb{E}_{{\bm{x}}\sim S^{\prime}}\left[\prod_{i\in[d]}x_{i}^{\alpha_{i}}\right]
Reject and terminate if |M^α𝔼𝒙𝒟𝒙[𝒙α]|>Δ|\widehat{\mathrm{M}}_{\alpha}-\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[{\bm{x}}^{\alpha}]|>\Delta for some α\alpha with α12max(,t)\|\alpha\|_{1}\leq 2\max(\ell,t).
Otherwise, solve the following least squares problem on SS up to error ϵ{\epsilon}^{\prime}
minp\displaystyle\min_{p}\quad 𝔼(𝒙,y)S[(yp(𝒙))2]\displaystyle\mathbb{E}_{({\bm{x}},y)\sim S}\left[(y-p({\bm{x}}))^{2}\right]
s.t. p is a polynomial with degree at most \displaystyle p\text{ is a polynomial with degree at most }\ell
each coefficient of p is absolutely bounded by B\displaystyle\text{each coefficient of }p\text{ is absolutely bounded by }B
Let p^\hat{p} be an ϵ2{\epsilon}^{\prime 2}-approximate solution to the above optimization problem.
Accept and output clM(p^(𝒙))\mathrm{cl}_{M}(\hat{p}({\bm{x}})).
Algorithm 2 TDS Regression via Uniform Approximation

Note that 4.5 involves a low-degree uniform approximation assumption, which only holds within some bounded-radius ball. Since we work under unbounded distributions, we also need to handle the errors outside the ball. To this end, we use the following lemma, which follows from results in [BDBGK18].

Lemma 4.7.

Suppose f=fWf=f_{W} and qq satisfy parts 1 and 2 of 4.5. Then

|p(𝒙)|(k)O()W𝒙2, for all W𝒙2R.|p({\bm{x}})|\leq(k\ell)^{O(\ell)}\left\|{W{\bm{x}}}\right\|_{2}^{\ell}\text{, for all }\left\|W{\bm{x}}\right\|_{2}\geq R.

The lemma above gives a bound on the values of a low-degree uniform approximator outside the interval of approximation. Therefore, we can hope to control the error of approximation outside the interval by taking advantage of the tails of our target distribution as well as picking RR sufficiently large. In order for the strictly subexponential tails to suffice, the quantitative dependence of \ell on RR is important. This is why we assume (see 4.5) that =O~(R)\ell=\tilde{O}(R). In particular, in order to bound the quantity 𝔼𝒙𝒟𝒙[p2(𝒙)𝟙{W𝒙2R}]\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[p^{2}({\bm{x}})\mathbbm{1}\{\|W{\bm{x}}\|_{2}\geq R\}], we use Lemma 4.7, the Cauchy-Schwarz inequality, and the bounds 𝔼𝒙𝒟𝒙[W𝒙24](k)O()\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[\left\|{W{\bm{x}}}\right\|_{2}^{4\ell}]\leq(k\ell)^{O(\ell)} and 𝐏𝐫𝒙𝒟𝒙[W𝒙2R]exp(Ω(R/k)1+γ)\mathbf{Pr}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[\|W{\bm{x}}\|_{2}\geq R]\leq\exp(-\Omega(R/k)^{1+\gamma}). Substituting for =O~(R)\ell=\tilde{O}(R), we observe that the overall bound on the quantity 𝔼𝒙𝒟𝒙[p2(𝒙)𝟙{W𝒙2R}]\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[p^{2}({\bm{x}})\mathbbm{1}\{\|W{\bm{x}}\|_{2}\geq R\}] decays with RR, whenever γ\gamma is strictly positive. Therefore, the overall bound can be made arbitrarily small with an appropriate choice of RR (and therefore \ell).

Apart from the careful manipulations described above, the proof of Theorem 4.6 follows the lines of the corresponding results for TDS learning through sandwiching polynomials [KSV24b].

The following lemma allows us to relate the squared loss of the difference of polynomials under a set SS and under 𝒟{\mathcal{D}}, as long as we have a bound on the coefficients of the polynomials. This will be convenient in the analysis of the algorithm.

Lemma 4.8 (Transfer Lemma for Square Loss, see [KSV24b]).

Let 𝒟{\mathcal{D}} be a distribution over d\mathbb{R}^{d} and SS be a set of points in d\mathbb{R}^{d}. If |𝔼𝐱S[𝐱α]𝔼x𝒟[𝐱α]|Δ|\mathbb{E}_{{\bm{x}}\sim S}[{\bm{x}}^{\alpha}]-\mathbb{E}_{x\sim{\mathcal{D}}}[{\bm{x}}^{\alpha}]|\leq\Delta for all αd\alpha\in\mathbb{N}^{d} with α12\left\|\alpha\right\|_{1}\leq 2\ell, then for any degree \ell polynomials p1,p2p_{1},p_{2} with coefficients absolutely bounded by BB, it holds that

|𝔼𝒙S[(p1(𝒙)p2(𝒙))2]𝔼x𝒟[(p1(𝒙)p2(𝒙))2]|4B2d2Δ\left|\mathbb{E}_{{\bm{x}}\sim S}[(p_{1}({\bm{x}})-p_{2}({\bm{x}}))^{2}]-\mathbb{E}_{x\sim{\mathcal{D}}}[(p_{1}({\bm{x}})-p_{2}({\bm{x}}))^{2}]\right|\leq 4B^{2}d^{2\ell}\Delta

We are now ready to prove Theorem 4.6.

Proof of Theorem 4.6.

We will prove soundness and completeness separately.

Soundness. Suppose the algorithm accepts and outputs clM(p^)\mathrm{cl}_{M}(\hat{p}). Let f=argminf[𝒟(f)+𝒟(f)]f^{*}=\arg\min_{f\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f)+{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(f)] and fopt=argminf[𝒟(f)]f_{\mathrm{opt}}=\arg\min_{f\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f)]. By the uniform approximation assumption in 4.5, there are polynomials p,poptp^{*},p_{\mathrm{opt}} which are (ϵ,R)(\epsilon,R)-uniform approximations for ff^{*} and foptf_{\mathrm{opt}}, respectively. Let ff^{*} and foptf_{\mathrm{opt}} have the corresponding matrices W,Woptk×dW^{*},W_{\mathrm{opt}}\in\mathbb{R}^{k\times d}, respectively. Denote λtrain=𝒟(f)\lambda_{\mathrm{train}}={\mathcal{L}}_{{\mathcal{D}}}(f^{*}) and λtest=𝒟(f)\lambda_{\mathrm{test}}={\mathcal{L}}_{{\mathcal{D}}^{\prime}}(f^{*}). Note that for any f,g:df,g:\mathbb{R}^{d}\rightarrow\mathbb{R}, “unclipping” both functions will not increase their squared loss under any distribution, i.e. clM(f)clM(g)𝒟fg𝒟\left\|\mathrm{cl}_{M}(f)-\mathrm{cl}_{M}(g)\right\|_{{\mathcal{D}}}\leq\left\|f-g\right\|_{{\mathcal{D}}}, which can be seen through casework on 𝒙{\bm{x}} and when f(𝒙),g(𝒙)f({\bm{x}}),g({\bm{x}}) are in [M,M][-M,M] or not. Recalling that the training and test labels are bounded, we can use this fact as we bound the error of the hypothesis on 𝒟{\mathcal{D}}^{\prime}.

𝒟(clM(p^))\displaystyle{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(\mathrm{cl}_{M}(\hat{p})) 𝒟(clM(f))+clM(f)clM(p^)𝒟\displaystyle\leq{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(\mathrm{cl}_{M}(f^{*}))+\left\|\mathrm{cl}_{M}(f^{*})-\mathrm{cl}_{M}(\hat{p})\right\|_{{\mathcal{D}}^{\prime}}
𝒟(f)+clM(f)clM(p^)S+ϵ.\displaystyle\leq{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(f^{*})+\left\|\mathrm{cl}_{M}(f^{*})-\mathrm{cl}_{M}(\hat{p})\right\|_{S^{\prime}}+\epsilon^{\prime}.

The second inequality follows from unclipping the first term and by applying Hoeffding’s inequality, so that for mtest8M4ln(2/δ)ϵ4m_{\mathrm{test}}\geq\frac{8M^{4}\ln(2/\delta^{\prime})}{\epsilon^{\prime 4}}, the second term is bounded with probability 1δ\geq 1-\delta^{\prime}. Proceeding with more unclipping and using the triangle inequality:

𝒟(clM(p^))\displaystyle{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(\mathrm{cl}_{M}(\hat{p})) λtest+clM(f)clM(p)S+clM(p)clM(p^)S+ϵ\displaystyle\leq\lambda_{\mathrm{test}}+\left\|\mathrm{cl}_{M}(f^{*})-\mathrm{cl}_{M}(p^{*})\right\|_{S^{\prime}}+\left\|\mathrm{cl}_{M}(p^{*})-\mathrm{cl}_{M}(\hat{p})\right\|_{S^{\prime}}+\epsilon^{\prime}
λtest+clM(f)clM(p)S+pp^S+ϵ.\displaystyle\leq\lambda_{\mathrm{test}}+\left\|\mathrm{cl}_{M}(f^{*})-\mathrm{cl}_{M}(p^{*})\right\|_{S^{\prime}}+\left\|p^{*}-\hat{p}\right\|_{S^{\prime}}+\epsilon^{\prime}. (4.1)

We first bound clM(f)clM(p)S=𝔼𝒙S[(clM(f(𝒙))clM(p(𝒙)))2]\left\|\mathrm{cl}_{M}(f^{*})-\mathrm{cl}_{M}(p^{*})\right\|_{S^{\prime}}=\sqrt{\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[(\mathrm{cl}_{M}(f^{*}({\bm{x}}))-\mathrm{cl}_{M}(p^{*}({\bm{x}})))^{2}]}. Since p(𝒙)p^{*}({\bm{x}}) is an (ϵ,R)(\epsilon,R)-uniform approximation to f(𝒙)f^{*}({\bm{x}}), we separately consider when we fall in the region of good approximation (W𝒙R\left\|W^{*}{\bm{x}}\right\|\leq R) or not.

𝔼𝒙S\displaystyle\mathbb{E}_{{\bm{x}}\sim S^{\prime}} [(clM(f(𝒙))clM(p(𝒙)))2]\displaystyle[(\mathrm{cl}_{M}(f^{*}({\bm{x}}))-\mathrm{cl}_{M}(p^{*}({\bm{x}})))^{2}]
=𝔼𝒙S[(clM(f(𝒙))clM(p(𝒙)))2𝟙[W𝒙R]\displaystyle=\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[(\mathrm{cl}_{M}(f^{*}({\bm{x}}))-\mathrm{cl}_{M}(p^{*}({\bm{x}})))^{2}\cdot\mathbbm{1}[\left\|W^{*}{\bm{x}}\right\|\leq R]
+𝔼𝒙S[(clM(f(𝒙))clM(p(𝒙)))2𝟙[W𝒙>R]]\displaystyle+\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[(\mathrm{cl}_{M}(f^{*}({\bm{x}}))-\mathrm{cl}_{M}(p^{*}({\bm{x}})))^{2}\cdot\mathbbm{1}[\left\|W^{*}{\bm{x}}\right\|>R]]
ϵ2+𝔼𝒙S[2(clM(f(𝒙))2+clM(p(𝒙))2)𝟙[W𝒙>R]]\displaystyle\leq\epsilon^{2}+\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[2(\mathrm{cl}_{M}(f^{*}({\bm{x}}))^{2}+\mathrm{cl}_{M}(p^{*}({\bm{x}}))^{2})\cdot\mathbbm{1}[\left\|W^{*}{\bm{x}}\right\|>R]]

Then by applying Cauchy-Schwarz, (and similarly for clM(p)\mathrm{cl}_{M}(p^{*})):

𝔼𝒙S[clM(f(𝒙))2𝟙[W𝒙>R]]𝔼𝒙S[clM(f(𝒙))4]𝐏𝐫𝒙S[W𝒙>R]].\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[\mathrm{cl}_{M}(f^{*}({\bm{x}}))^{2}\cdot\mathbbm{1}[\left\|W^{*}{\bm{x}}\right\|>R]]\leq\sqrt{\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[\mathrm{cl}_{M}(f^{*}({\bm{x}}))^{4}]}\cdot\sqrt{\mathbf{Pr}_{{\bm{x}}\sim S^{\prime}}[\left\|W^{*}{\bm{x}}\right\|>R]]}.

By definition, clM(p)2,clM(f)2M2\mathrm{cl}_{M}(p^{*})^{2},\mathrm{cl}_{M}(f^{*})^{2}\leq M^{2}. So it suffices to bound 𝐏𝐫𝒙S[W𝒙>R]]\mathbf{Pr}_{{\bm{x}}\sim S^{\prime}}[\left\|W^{*}{\bm{x}}\right\|>R]], since we now have

𝔼𝒙S[(clM(f(𝒙))clM(p(𝒙)))2]ϵ2+4M2𝐏𝐫𝒙S[W𝒙>R]].\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[(\mathrm{cl}_{M}(f^{*}({\bm{x}}))-\mathrm{cl}_{M}(p^{*}({\bm{x}})))^{2}]\leq\epsilon^{2}+4M^{2}\sqrt{\mathbf{Pr}_{{\bm{x}}\sim S^{\prime}}[\left\|W^{*}{\bm{x}}\right\|>R]]}. (4.2)

In order to bound this probability of the test samples falling outside the region of good approximation, we use the property that the first 2t2t moments of SS^{\prime} are close to the moments of 𝒟{\mathcal{D}} (as tested by the algorithm). Applying Markov’s inequality, we have

𝐏𝐫𝒙S[W𝒙>R]]𝔼𝒙S[W𝒙2t]R2t.\mathbf{Pr}_{{\bm{x}}\sim S^{\prime}}[\left\|W^{*}{\bm{x}}\right\|>R]]\leq\frac{\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[\left\|W^{*}{\bm{x}}\right\|^{2t}]}{R^{2t}}.

Write W𝒙2t=(i=1kWi,𝒙2)t\left\|W^{*}{\bm{x}}\right\|^{2t}=\left(\sum_{i=1}^{k}\langle W_{i}^{*},{\bm{x}}\rangle^{2}\right)^{t}, where i=1kWi,𝒙2=i=1k(j=1dWijxj)2\sum_{i=1}^{k}\langle W_{i}^{*},{\bm{x}}\rangle^{2}=\sum_{i=1}^{k}\left(\sum_{j=1}^{d}W_{ij}^{*}x_{j}\right)^{2} is a degree 22 polynomial with each coefficient bounded in absolute value by 2k2k (noting that since WW=1WW^{\top}=1, then |Wij|1|W_{ij}|\leq 1). Let aαa_{\alpha} denote the coefficients of W𝒙2t\left\|W^{*}{\bm{x}}\right\|^{2t}. Applying Lemma C.7, α12t|aα|(2k)td2tdO(t)\sum_{\left\|\alpha\right\|_{1}\leq 2t}|a_{\alpha}|\leq(2k)^{t}d^{2t}\leq d^{O(t)}. By linearity of expectation, we also have |𝔼𝒙S[W𝒙2t𝔼x𝒟[W𝒙2t]|α12t|aα|ΔdO(t)Δϵ|\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[\left\|W^{*}{\bm{x}}\right\|^{2t}-\mathbb{E}_{x\sim{\mathcal{D}}}[\left\|W^{*}{\bm{x}}\right\|^{2t}]|\leq\sum_{\left\|\alpha\right\|_{1}\leq 2t}|a_{\alpha}|\cdot\Delta\leq d^{O(t)}\cdot\Delta\leq\epsilon^{\prime}, where ΔϵdΩ(t)\Delta\leq\epsilon^{\prime}\cdot d^{-\Omega(t)}. Since 𝒟{\mathcal{D}} is γ\gamma-strictly subexponential, then by B.1, 𝔼x𝒟[Wi,𝒙2t](2Ct)2t1+γ\mathbb{E}_{x\sim{\mathcal{D}}}[\langle W_{i}^{*},{\bm{x}}\rangle^{2t}]\leq(2C^{\prime}t)^{\frac{2t}{1+\gamma}}. Then, we can bound the numerator 𝔼𝒙S[W𝒙2t]𝔼x𝒟[W𝒙2t]+ϵ(Ckt)2t1+γ\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[\left\|W^{*}{\bm{x}}\right\|^{2t}]\leq\mathbb{E}_{x\sim{\mathcal{D}}}[\left\|W^{*}{\bm{x}}\right\|^{2t}]+\epsilon^{\prime}\leq(Ckt)^{\frac{2t}{1+\gamma}} for some large constant CC. So we have that

𝐏𝐫𝒙S[W𝒙>R]](Ckt)2t1+γR2t.\mathbf{Pr}_{{\bm{x}}\sim S^{\prime}}[\left\|W^{*}{\bm{x}}\right\|>R]]\leq\frac{(Ckt)^{\frac{2t}{1+\gamma}}}{R^{2t}}.

Setting tC(log(M/ϵ))t\geq C^{\prime}(\log(M/\epsilon)) and RC(kt)Cklog(M/ϵ)R\geq C^{\prime}(kt)\geq C^{\prime}k\log(M/\epsilon) for large enough CC^{\prime} makes the above probability at most 16ϵ4/M416\epsilon^{\prime 4}/M^{4} so that 4M2𝐏𝐫𝒙S[W𝒙>R]]ϵ24M^{2}\sqrt{\mathbf{Pr}_{{\bm{x}}\sim S^{\prime}}[\left\|W^{*}{\bm{x}}\right\|>R]]}\leq\epsilon^{\prime 2}. Thus, from Equation 4.2, we have that

clM(f)clM(p)Sϵ+ϵ.\left\|\mathrm{cl}_{M}(f^{*})-\mathrm{cl}_{M}(p^{*})\right\|_{S^{\prime}}\leq\epsilon+\epsilon^{\prime}. (4.3)

We now bound the second term clM(p)clM(p^)S\left\|\mathrm{cl}_{M}(p^{*})-\mathrm{cl}_{M}(\hat{p})\right\|_{S^{\prime}}. By Lemma B.2, the first 22\ell moments of SS will concentrate around those of 𝒟𝒙{\mathcal{D}}_{{\bm{x}}} whenever mtrain1Δ2(Cc)48+1(log(20d/δ))4+1m_{\mathrm{train}}\geq\frac{1}{\Delta^{2}}{(Cc)^{4\ell}\ell^{8\ell+1}(\log(20d/\delta))^{4\ell+1}}, and similarly the first 22\ell moments of SS^{\prime} match with 𝒟𝒙{\mathcal{D}}_{{\bm{x}}} because the algorithm accepted. Using the transfer lemma (Lemma 4.8) when considering p=(pp^)2p^{\prime}=(p^{*}-\hat{p})^{2}, along with the triangle inequality, we get:

p(𝒙)p^(𝒙)S\displaystyle\left\|p^{*}({\bm{x}})-\hat{p}({\bm{x}})\right\|_{S^{\prime}} p(𝒙)p^(𝒙)𝒟+4B2d2Δ\displaystyle\leq\left\|p^{*}({\bm{x}})-\hat{p}({\bm{x}})\right\|_{{\mathcal{D}}}+\sqrt{4B^{2}d^{2\ell}\Delta}
p(𝒙)p^(𝒙)S+2ϵ\displaystyle\leq\left\|p^{*}({\bm{x}})-\hat{p}({\bm{x}})\right\|_{S}+2\epsilon^{\prime}
S(p)+S(p^)+2ϵ,\displaystyle\leq{\mathcal{L}}_{S}({p^{*}})+{\mathcal{L}}_{S}(\hat{p})+2\epsilon^{\prime},

where we note that we can bound BB, the sum of the magnitudes of the coefficients, by r(2(k+))3r(2(k+\ell))^{3\ell} using Lemma C.6. Recall that by definition p^\hat{p} is an ϵ2\epsilon^{\prime 2}-approximate solution to the optimization problem in Algorithm 2, so S(p^)S(popt)+ϵ{\mathcal{L}}_{S}(\hat{p})\leq{\mathcal{L}}_{S}(p_{\mathrm{opt}})+\epsilon^{\prime}. Plugging this in, we obtain

p(𝒙)p^(𝒙)S\displaystyle\left\|p^{*}({\bm{x}})-\hat{p}({\bm{x}})\right\|_{S^{\prime}} S(p)+S(popt)+3ϵ\displaystyle\leq{\mathcal{L}}_{S}(p^{*})+{\mathcal{L}}_{S}(p_{\mathrm{opt}})+3\epsilon^{\prime}
pclM(f)S+(clM(f))S\displaystyle\leq\left\|p^{*}-\mathrm{cl}_{M}(f^{*})\right\|_{S}+{\mathcal{L}}(\mathrm{cl}_{M}(f^{*}))_{S}
+popt(𝒙)clM(fopt(𝒙))S+S(clM(fopt))+3ϵ.\displaystyle\quad+\left\|p_{\mathrm{opt}}({\bm{x}})-\mathrm{cl}_{M}(f_{\mathrm{opt}}({\bm{x}}))\right\|_{S}+{\mathcal{L}}_{S}(\mathrm{cl}_{M}(f_{\mathrm{opt}}))+3\epsilon^{\prime}. (4.4)

By applying Hoeffding’s inequality, we get that (clM(f))SclM(f)y𝒟+ϵ{\mathcal{L}}(\mathrm{cl}_{M}(f^{*}))_{S}\leq\left\|\mathrm{cl}_{M}(f^{*})-y\right\|_{\mathcal{D}}+\epsilon^{\prime} which holds with probability 1δ\geq 1-\delta^{\prime} when mtrain8M4ln(2/δ)ϵ4m_{\mathrm{train}}\geq\frac{8M^{4}\ln(2/\delta^{\prime})}{\epsilon^{\prime 4}}. By unclipping clM(f)\mathrm{cl}_{M}(f^{*}), this is at most λtrain+ϵ\lambda_{\mathrm{train}}+\epsilon^{\prime}. Similarly, with probability 1δ\geq 1-\delta^{\prime}, S(clM(fopt))opt+ϵ{\mathcal{L}}_{S}(\mathrm{cl}_{M}(f_{\mathrm{opt}}))\leq\mathrm{opt}+\epsilon^{\prime}. It remains to bound p(𝒙)clM(f)S\left\|p^{*}({\bm{x}})-\mathrm{cl}_{M}(f^{*})\right\|_{S} and poptclM(fopt(𝒙))S\left\|p_{\mathrm{opt}}-\mathrm{cl}_{M}(f_{\mathrm{opt}}({\bm{x}}))\right\|_{S}. The analysis for both is similar to how we bounded clM(p)clM(f)S\left\|\mathrm{cl}_{M}(p^{*})-\mathrm{cl}_{M}(f^{*})\right\|_{S}, except since we do not clip pp^{*} or poptp_{\mathrm{opt}} we will instead take advantage of the bound on p(𝒙)p^{*}({\bm{x}}) on W𝒙>R\left\|W^{*}{\bm{x}}\right\|>R (respectively popt(𝒙)p_{\mathrm{opt}}({\bm{x}}) on Wopt𝒙>R\left\|W_{\mathrm{opt}}{\bm{x}}\right\|>R). We show how to bound p(𝒙)clM(f)S\left\|p^{*}({\bm{x}})-\mathrm{cl}_{M}(f^{*})\right\|_{S}:

𝔼𝒙S[(clM(f(𝒙))p(𝒙))2]\displaystyle\mathbb{E}_{{\bm{x}}\sim S}[(\mathrm{cl}_{M}(f^{*}({\bm{x}}))-p^{*}({\bm{x}}))^{2}] =𝔼𝒙S[(clM(f(𝒙))p(𝒙))2𝟙[W𝒙R]\displaystyle=\mathbb{E}_{{\bm{x}}\sim S}[(\mathrm{cl}_{M}(f^{*}({\bm{x}}))-p^{*}({\bm{x}}))^{2}\cdot\mathbbm{1}[\left\|W^{*}{\bm{x}}\right\|\leq R]
+𝔼𝒙S[(clM(f(𝒙))p(𝒙))2𝟙[W𝒙>R]]\displaystyle+\mathbb{E}_{{\bm{x}}\sim S}[(\mathrm{cl}_{M}(f^{*}({\bm{x}}))-p^{*}({\bm{x}}))^{2}\cdot\mathbbm{1}[\left\|W^{*}{\bm{x}}\right\|>R]]
ϵ2+2𝔼𝒙S[clM(f(𝒙))2𝟙[W𝒙>R]]\displaystyle\leq\epsilon^{2}+2\mathbb{E}_{{\bm{x}}\sim S}[\mathrm{cl}_{M}(f^{*}({\bm{x}}))^{2}\cdot\mathbbm{1}[\left\|W^{*}{\bm{x}}\right\|>R]]
+2𝔼𝒙S[p(𝒙)2𝟙[W𝒙>R]].\displaystyle\hskip 22.0pt+2\mathbb{E}_{{\bm{x}}\sim S}[p^{*}({\bm{x}})^{2}\cdot\mathbbm{1}[\left\|W^{*}{\bm{x}}\right\|>R]]. (4.5)

We can bound the first expectation term, 𝔼𝒙S[clM(f(𝒙))2𝟙[W𝒙>R]]\mathbb{E}_{{\bm{x}}\sim S}[\mathrm{cl}_{M}(f^{*}({\bm{x}}))^{2}\cdot\mathbbm{1}[\left\|W^{*}{\bm{x}}\right\|>R]], with ϵ2/4\epsilon^{\prime 2}/4 since the same analysis holds for bounding 𝔼𝒙S[clM(f(𝒙))2𝟙[W𝒙>R]]\mathbb{E}_{{\bm{x}}\sim S^{\prime}}[\mathrm{cl}_{M}(f^{*}({\bm{x}}))^{2}\cdot\mathbbm{1}[\left\|W^{*}{\bm{x}}\right\|>R]], except instead of matching the first 2t2t moments of SS^{\prime} with 𝒟𝒙{\mathcal{D}}_{{\bm{x}}}, we match the first 22\ell moments of SS with 𝒟𝒙{\mathcal{D}}_{{\bm{x}}}. We use the strictly subexponential tails of 𝒟𝒙{\mathcal{D}}_{{\bm{x}}} to bound the second term. Cauchy-Schwarz gives

𝔼𝒙S[p(𝒙)2𝟙[W𝒙>R]]𝔼𝒙S[p(𝒙)4]𝐏𝐫𝒙S[W𝒙>R]]\mathbb{E}_{{\bm{x}}\sim S}[p^{*}({\bm{x}})^{2}\cdot\mathbbm{1}[\left\|W^{*}{\bm{x}}\right\|>R]]\leq\sqrt{\mathbb{E}_{{\bm{x}}\sim S}[p^{*}({\bm{x}})^{4}]\cdot\mathbf{Pr}_{{\bm{x}}\sim S}[\left\|W^{*}{\bm{x}}\right\|>R]]}

Note that by definition of rr and using that pp^{*} is an (ϵ,R)(\epsilon,R)-uniform approximation of ff^{*}, then p(𝒙)(r+ϵ)p^{*}({\bm{x}})\leq(r+\epsilon) when W𝒙R\left\|W^{*}{\bm{x}}\right\|\leq R. By Lemma C.6, |p(𝒙)|(r+ϵ)(2k)c(Wx)/R|p^{*}({\bm{x}})|\leq(r+\epsilon)\cdot(2k\ell)^{c\ell}\left\|(W^{*}x)/R\right\|^{\ell} for sufficiently large constant c1>0c_{1}>0. Then since R1R\geq 1, p(𝒙)(r+ϵ)4(2k)cW𝒙4p^{*}({\bm{x}})\leq(r+\epsilon)^{4}\cdot(2k\ell)^{c\ell}\left\|W^{*}{\bm{x}}\right\|^{4\ell}. Then we have

𝔼𝒙S[p(𝒙)4]\displaystyle\mathbb{E}_{{\bm{x}}\sim S}[p^{*}({\bm{x}})^{4}] (r+ϵ)4(2k)c1𝔼𝒙S[W𝒙4]\displaystyle\leq(r+\epsilon)^{4}\cdot(2k\ell)^{c_{1}\ell}\cdot\mathbb{E}_{{\bm{x}}\sim S}[\left\|W^{*}{\bm{x}}\right\|^{4\ell}]
(r+ϵ)4(2k)c1(𝔼x𝒟𝒙[W𝒙4]+1)\displaystyle\leq(r+\epsilon)^{4}\cdot(2k\ell)^{c_{1}\ell}\cdot(\mathbb{E}_{x\sim{\mathcal{D}}_{{\bm{x}}}}[\left\|W^{*}{\bm{x}}\right\|^{4\ell}]+1)
(r+ϵ)4(2k)c\displaystyle\leq(r+\epsilon)^{4}\cdot(2k\ell)^{c\ell}

where using B.1 we bound on 𝔼x𝒟𝒙[W𝒙4]k2(4)4C1+γ\mathbb{E}_{x\sim{\mathcal{D}}_{{\bm{x}}}}[\left\|W^{*}{\bm{x}}\right\|^{4\ell}]\leq k^{2\ell}(4\ell)^{\frac{4C\ell}{1+\gamma}} similar to above, which can be upper bounded with (2k)c2(2k\ell)^{c_{2}\ell} for c2>0c_{2}>0 a sufficiently large constant. Take c=c1+c2c=c_{1}+c_{2}. We bound 𝐏𝐫𝒙S[W𝒙>R]]\mathbf{Pr}_{{\bm{x}}\sim S}[\left\|W^{*}{\bm{x}}\right\|>R]] as follows:

𝐏𝐫𝒙S[W𝒙>R]]\displaystyle\mathbf{Pr}_{{\bm{x}}\sim S}[\left\|W^{*}{\bm{x}}\right\|>R]] =𝐏𝐫𝒙S[i=1kWi,𝒙2>R2]\displaystyle=\mathbf{Pr}_{{\bm{x}}\sim S}\left[\sum_{i=1}^{k}\langle W_{i}^{*},{\bm{x}}\rangle^{2}>R^{2}\right]
i=1k𝐏𝐫𝒙S[Wi,𝒙2>R2/k]\displaystyle\leq\sum_{i=1}^{k}\mathbf{Pr}_{{\bm{x}}\sim S}[\langle W_{i}^{*},{\bm{x}}\rangle^{2}>R^{2}/k]
ksup𝒘2=1𝐏𝐫𝒙S[W,𝒙2>R2/k],\displaystyle\leq k\sup_{\|{\bm{w}}\|_{2}=1}\mathbf{Pr}_{{\bm{x}}\sim S}[\langle W,{\bm{x}}\rangle^{2}>R^{2}/k],

where the first inequality follows from a union bound. Since 𝒘,𝒙2\langle{\bm{w}},{\bm{x}}\rangle^{2} is a degree 22 polynomial, we can view sign(𝒘,𝒙2R2/k)\operatorname{sign}(\langle{\bm{w}},{\bm{x}}\rangle^{2}-R^{2}/k) as a degree-2 PTF. The class of these functions has VC dimension at most d2d^{2} (e.g. by viewing it as the class of halfspaces in d2d^{2} dimensions). Using standard VC arguments, whenever mtrainCd2+log(1/δ)(ϵ′′/k)2m_{\mathrm{train}}\geq C\cdot\frac{d^{2}+\log(1/\delta^{\prime})}{(\epsilon^{\prime\prime}/k)^{2}} for some sufficiently large universal constant C>0C>0, with probability 1δ\geq 1-\delta^{\prime} we have

𝐏𝐫𝒙S[𝒘,𝒙2>R2/k]𝐏𝐫x𝒟𝒙[𝒘,𝒙2>R2/k]+ϵ′′/k.\mathbf{Pr}_{{\bm{x}}\sim S}[\langle{\bm{w}},{\bm{x}}\rangle^{2}>R^{2}/k]\leq\mathbf{Pr}_{x\sim{\mathcal{D}}_{{\bm{x}}}}[\langle{\bm{w}},{\bm{x}}\rangle^{2}>R^{2}/k]+\epsilon^{\prime\prime}/k.

Using the strictly subexponential tails of 𝒟𝒙{\mathcal{D}}_{{\bm{x}}}, we have

𝐏𝐫𝒙S[W𝒙>R]]\displaystyle\mathbf{Pr}_{{\bm{x}}\sim S}[\left\|W^{*}{\bm{x}}\right\|>R]] k(supw=1𝐏𝐫x𝒟𝒙[𝒘,𝒙2>R2/k]+ϵ′′/k)\displaystyle\leq k\left(\sup_{\left\|w\right\|=1}\mathbf{Pr}_{x\sim{\mathcal{D}}_{{\bm{x}}}}[\langle{\bm{w}},{\bm{x}}\rangle^{2}>R^{2}/k]+\epsilon^{\prime\prime}/k\right)
2kexp((R/k)1+γ)+ϵ′′.\displaystyle\leq 2k\cdot\exp\left(-\left(R/k\right)^{1+\gamma}\right)+\epsilon^{\prime\prime}.

Choose ϵ′′=ϵ4(r+ϵ)4(2k)c\epsilon^{\prime\prime}=\frac{\epsilon^{\prime 4}}{(r+\epsilon)^{4}(2k\ell)^{c\ell}}. Putting it together:

𝔼𝒙S[p(𝒙)4]𝐏𝐫𝒙S[W𝒙>R]]\displaystyle\mathbb{E}_{{\bm{x}}\sim S}[p^{*}({\bm{x}})^{4}]\cdot\mathbf{Pr}_{{\bm{x}}\sim S}[\left\|W^{*}{\bm{x}}\right\|>R]] (r+ϵ)4(2k)ce(R/k)1+γ+ϵ4\displaystyle\leq(r+\epsilon)^{4}\cdot(2k\ell)^{c\ell}e^{-(R/k)^{1+\gamma}}+\epsilon^{\prime 4}
(r+ϵ)4exp(clog(2k)(R/k)1+γ)+ϵ4.\displaystyle\leq(r+\epsilon)^{4}\cdot\exp\left(c\ell\log(2k\ell)-(R/k)^{1+\gamma}\right)+\epsilon^{\prime 4}.

We want to bound the first part with ϵ4\epsilon^{\prime 4}. Equivalently, we need to show that the exponent is 4lnϵr+ϵ\leq 4\ln\frac{\epsilon^{\prime}}{r+\epsilon}. Substituting =RlogRg(ϵ)\ell=R\log R\cdot g_{\mathcal{F}}(\epsilon), we get that clog(2k)cg(ϵ)R(logR)2log(2kg(ϵ))c\ell\log(2k\ell)\leq cg_{\mathcal{F}}(\epsilon)R(\log R)^{2}\log(2kg_{\mathcal{F}}(\epsilon)). Thus, it suffices to show that

(Rk)1+γ\displaystyle\left(\frac{R}{k}\right)^{1+\gamma} cg(ϵ)R(logR)2(2kg(ϵ))4lnϵr+ϵ.\displaystyle\geq cg_{\mathcal{F}}(\epsilon)R(\log R)^{2}(2kg_{\mathcal{F}}(\epsilon))-4\ln{\frac{\epsilon^{\prime}}{r+\epsilon}}.

This is satisfied when Rpoly((kg(ϵ)log(r)log(M/ϵ))1+1γ)R\geq\mathrm{poly}\left(\left(kg_{\mathcal{F}}(\epsilon)\log(r)\log(M/\epsilon)\right)^{1+\frac{1}{\gamma}}\right). Then, we have that

𝔼𝒙S[p(𝒙)2𝟙[W𝒙>R]]ϵ22.\mathbb{E}_{{\bm{x}}\sim S}[p^{*}({\bm{x}})^{2}\cdot\mathbbm{1}[\left\|W^{*}{\bm{x}}\right\|>R]]\leq\epsilon^{\prime 2}\sqrt{2}.

So, plugging this into Eq. 4.5, we have

clM(f)pSϵ2+2ϵ2/4+2ϵ22ϵ+2ϵ.\left\|\mathrm{cl}_{M}(f^{*})-p^{*}\right\|_{S}\leq\sqrt{\epsilon^{2}+2\cdot\epsilon^{\prime 2}/4+2\epsilon^{\prime 2}\sqrt{2}}\leq\epsilon+2\epsilon^{\prime}.

The same argument will also give

clM(fopt(𝒙))popt(𝒙)Sϵ+2ϵ.\left\|\mathrm{cl}_{M}(f_{\mathrm{opt}}({\bm{x}}))-p_{\mathrm{opt}}({\bm{x}})\right\|_{S}\leq\epsilon+2\epsilon^{\prime}.

Combining Eq. 4.3 and the above two bounds into Eq. 4.4, we have from Eq. 4.1 that

𝒟(clM(p^))λ+opt+3ϵ+11ϵλ+opt+4ϵ.{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(\mathrm{cl}_{M}(\hat{p}))\leq\lambda+\mathrm{opt}+3\epsilon+11\epsilon^{\prime}\leq\lambda+\mathrm{opt}+4\epsilon.

The result holds with probability at least 15δ=1δ1-5\delta^{\prime}=1-\delta (taking a union bound over 55 bad events).

Completeness. For completeness, it is sufficient to ensure that mtestNm_{\mathrm{test}}\geq N for NN in Lemma B.2. This is because when 𝒟𝒙=𝒟𝒙{\mathcal{D}}_{{\bm{x}}}={\mathcal{D}}_{{\bm{x}}}^{\prime}, our test samples SS^{\prime} are in fact being drawn from the subexponential distribution 𝒟𝒙{\mathcal{D}}_{{\bm{x}}}. Then the moment concentration of subexponential distributions (Lemma B.2) gives that the empirical moments of SS^{\prime} are close to the moments of 𝒟𝒙{\mathcal{D}}_{{\bm{x}}} with probability 1δ\geq 1-\delta^{\prime}. This is the only condition for acceptance, so when 𝒟𝒙=𝒟𝒙{\mathcal{D}}_{{\bm{x}}}={\mathcal{D}}_{{\bm{x}}}^{\prime}, the probability of acceptance is at least 1δ1-\delta, as required.

Runtime. The runtime of the algorithm is poly(d,mtrain,mtest)\mathrm{poly}(d^{\ell},m_{\mathrm{train}},m_{\mathrm{test}}), where =RlogRg(ϵ)\ell=R\log R\cdot g_{\mathcal{F}}(\epsilon). The two lower bounds on RR required in the proof are satisfied by setting R((kg(ϵ)log(r)log(M/ϵ))O(1γ))R\geq\left(\left(kg_{\mathcal{F}}(\epsilon)\log(r)\log(M/\epsilon)\right)^{O(\frac{1}{\gamma})}\right). Note that setting mtrain=poly(M,ln(1/δ),1/ϵ,d,r)m_{\mathrm{train}}=\mathrm{poly}(M,\ln(1/\delta)^{\ell},1/\epsilon,d^{\ell},r) satisfies the lower bounds on mtrainm_{\mathrm{train}} required in the proof. For mtestm_{\mathrm{test}} we required that mtest8M4ln(2/δ)ϵ4m_{\mathrm{test}}\geq\frac{8M^{4}\ln(2/\delta^{\prime})}{\epsilon^{\prime 4}} and also mtestNm_{\mathrm{test}}\geq N for NN in Lemma B.2. This is satisfied by choosing mtest=mtrainm_{\mathrm{test}}=m_{\mathrm{train}}. Putting this altogether, we see that the runtime is poly(ds,ln(1/δ),1/ϵ)\mathrm{poly}(d^{s},\ln(1/\delta)^{\ell},1/\epsilon) where s=((kg(ϵ)log(r)log(M/ϵ))O(1/γ))s=\left(\left(kg_{\mathcal{F}}(\epsilon)\log(r)\log(M/\epsilon)\right)^{O(1/\gamma)}\right). ∎

4.3 Applications

In order to obtain end-to-end results for classes of neural networks (see the rightmost column of Table 1), we need to prove the existence of uniform polynomial approximators whose degree scales almost linearly with respect to the radius of approximation for the reasons described above. For arbitrary Lipschitz nets (see Theorem C.17), we use a general tool from polynomial approximation theory, the multivariate Jackson’s theorem (Theorem C.9). This gives us a polynomial with degree scaling linearly in RR and polynomially on 1ϵ\frac{1}{\epsilon} and the number of hidden units (kk) in the first layer.

For sigmoid nets, a more careful derivation yields improved bounds (see Theorem C.19) which have a poly-logarithmic dependence on 1ϵ\frac{1}{\epsilon}. Our construction involves composing approximators for the activations at each layer. Naively, the degree of this composition would be superlinear in RR. To get around this, we use the key property that the size of the output of a sigmoid network at any layer is memoryless (i.e., has no RR dependence). This follows from the fact that the sigmoid is bounded in [0,1][0,1]. Using this, we obtain an approximator with almost-linear dependence on RR. For more details see Section C.5.

References

  • [ACM24] Pranjal Awasthi, Corinna Cortes, and Mehryar Mohri. Best-effort adaptation. Annals of Mathematics and Artificial Intelligence, 92(2):393–438, 2024.
  • [AZLL19] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32, 2019.
  • [BCK+07] John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. Learning bounds for domain adaptation. Advances in neural information processing systems, 20, 2007.
  • [BDBC+10] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79:151–175, 2010.
  • [BDBCP06] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. Advances in neural information processing systems, 19, 2006.
  • [BDBGK18] Shalev Ben-David, Adam Bouland, Ankit Garg, and Robin Kothari. Classical lower bounds from quantum upper bounds. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 339–349. IEEE, 2018.
  • [BG17] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. In International conference on machine learning, pages 605–614. PMLR, 2017.
  • [BJW19] Ainesh Bakshi, Rajesh Jayaram, and David P Woodruff. Learning two layer rectified neural networks in polynomial time. In Conference on Learning Theory, pages 195–268. PMLR, 2019.
  • [CDG+23] Sitan Chen, Zehao Dou, Surbhi Goel, Adam Klivans, and Raghu Meka. Learning narrow one-hidden-layer relu networks. In The Thirty Sixth Annual Conference on Learning Theory, pages 5580–5614. PMLR, 2023.
  • [CGKM22] Sitan Chen, Aravind Gollakota, Adam Klivans, and Raghu Meka. Hardness of noise-free learning for two-hidden-layer neural networks. Advances in Neural Information Processing Systems, 35:10709–10724, 2022.
  • [CKK+24] Gautam Chandrasekaran, Adam R Klivans, Vasilis Kontonis, Konstantinos Stavropoulos, and Arsen Vasilyan. Efficient discrepancy testing for learning with distribution shift. arXiv preprint arXiv:2406.09373, 2024.
  • [CKM22] Sitan Chen, Adam R Klivans, and Raghu Meka. Learning deep relu networks is fixed-parameter tractable. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 696–707. IEEE, 2022.
  • [CW01] Anthony Carbery and James Wright. Distributional and lq norm inequalities for polynomials over convex bodies in rn. Mathematical research letters, 8(3):233–248, 2001.
  • [DGK+20] Ilias Diakonikolas, Surbhi Goel, Sushrut Karmalkar, Adam R Klivans, and Mahdi Soltanolkotabi. Approximation schemes for relu regression. In Conference on learning theory, pages 1452–1485. PMLR, 2020.
  • [DK24] Ilias Diakonikolas and Daniel M Kane. Efficiently learning one-hidden-layer relu networks via schurpolynomials. In The Thirty Seventh Annual Conference on Learning Theory, pages 1364–1378. PMLR, 2024.
  • [DKK+23] Ilias Diakonikolas, Daniel Kane, Vasilis Kontonis, Sihan Liu, and Nikos Zarifis. Efficient testable learning of halfspaces with adversarial label noise. Advances in Neural Information Processing Systems, 36, 2023.
  • [DKKZ20] Ilias Diakonikolas, Daniel M Kane, Vasilis Kontonis, and Nikos Zarifis. Algorithms and sq lower bounds for pac learning one-hidden-layer relu networks. In Conference on Learning Theory, pages 1514–1539. PMLR, 2020.
  • [DKLZ24] Ilias Diakonikolas, Daniel Kane, Sihan Liu, and Nikos Zarifis. Testable learning of general halfspaces with adversarial label noise. In The Thirty Seventh Annual Conference on Learning Theory, pages 1308–1335. PMLR, 2024.
  • [DKTZ22] Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, and Nikos Zarifis. Learning a single neuron with adversarial label noise via gradient descent. In Conference on Learning Theory, pages 4313–4361. PMLR, 2022.
  • [DKZ20] Ilias Diakonikolas, Daniel Kane, and Nikos Zarifis. Near-optimal sq lower bounds for agnostically learning halfspaces and relus under gaussian marginals. Advances in Neural Information Processing Systems, 33:13586–13596, 2020.
  • [DLLP10] Shai Ben David, Tyler Lu, Teresa Luu, and Dávid Pál. Impossibility theorems for domain adaptation. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 129–136. JMLR Workshop and Conference Proceedings, 2010.
  • [DLT18] Simon S Du, Jason D Lee, and Yuandong Tian. When is a convolutional filter easy to learn? In 6th International Conference on Learning Representations, ICLR 2018, 2018.
  • [Fer14] Dietmar Ferger. Optimal constants in the marcinkiewicz–zygmund inequalities. Statistics & Probability Letters, 84:96–101, 2014.
  • [GGJ+20] Surbhi Goel, Aravind Gollakota, Zhihan Jin, Sushrut Karmalkar, and Adam Klivans. Superpolynomial lower bounds for learning one-layer neural networks using gradient descent. In International Conference on Machine Learning, pages 3587–3596. PMLR, 2020.
  • [GGK20] Surbhi Goel, Aravind Gollakota, and Adam Klivans. Statistical-query lower bounds via functional gradients. Advances in Neural Information Processing Systems, 33:2147–2158, 2020.
  • [GGKS24] Aravind Gollakota, Parikshit Gopalan, Adam Klivans, and Konstantinos Stavropoulos. Agnostically learning single-index models using omnipredictors. Advances in Neural Information Processing Systems, 36, 2024.
  • [GK19] Surbhi Goel and Adam R Klivans. Learning neural networks with two nonlinear layers in polynomial time. In Conference on Learning Theory, pages 1470–1499. PMLR, 2019.
  • [GKK23] Aravind Gollakota, Adam R Klivans, and Pravesh K Kothari. A moment-matching approach to testable learning and a new characterization of rademacher complexity. Proceedings of the fifty-fifth annual ACM Symposium on Theory of Computing, 2023.
  • [GKKM20] Shafi Goldwasser, Adam Tauman Kalai, Yael Kalai, and Omar Montasser. Beyond perturbations: Learning guarantees with arbitrary adversarial test examples. Advances in Neural Information Processing Systems, 33:15859–15870, 2020.
  • [GKKT17] Surbhi Goel, Varun Kanade, Adam Klivans, and Justin Thaler. Reliably learning the relu in polynomial time. In Satyen Kale and Ohad Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 1004–1042. PMLR, 07–10 Jul 2017.
  • [GKLW19] Rong Ge, Rohith Kuditipudi, Zhize Li, and Xiang Wang. Learning two-layer neural networks with symmetric inputs. In International Conference on Learning Representations, 2019.
  • [GKM18] Surbhi Goel, Adam Klivans, and Raghu Meka. Learning one convolutional layer with overlapping patches. In International conference on machine learning, pages 1783–1791. PMLR, 2018.
  • [GKSV24a] Aravind Gollakota, Adam Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Tester-learners for halfspaces: Universal algorithms. Advances in Neural Information Processing Systems, 36, 2024.
  • [GKSV24b] Aravind Gollakota, Adam R Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. An efficient tester-learner for halfspaces. The Twelfth International Conference on Learning Representations, 2024.
  • [GLM18] Rong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape design. In 6th International Conference on Learning Representations, ICLR 2018, 2018.
  • [GMOV19] Weihao Gao, Ashok V Makkuva, Sewoong Oh, and Pramod Viswanath. Learning one-hidden-layer neural networks under general input distributions. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1950–1959. PMLR, 2019.
  • [GSSV24] Surbhi Goel, Abhishek Shetty, Konstantinos Stavropoulos, and Arsen Vasilyan. Tolerant algorithms for learning with arbitrary covariate shift. arXiv preprint arXiv:2406.02742, 2024.
  • [HK19] Steve Hanneke and Samory Kpotufe. On the value of target data in transfer learning. Advances in Neural Information Processing Systems, 32, 2019.
  • [HK24] Steve Hanneke and Samory Kpotufe. A more unified theory of transfer learning. arXiv preprint arXiv:2408.16189, 2024.
  • [JSA15] Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473, 2015.
  • [Kea98] Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983–1006, 1998.
  • [KKSK11] Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Efficient learning of generalized linear and single index models with isotonic regression. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011.
  • [KSV24a] Adam R Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Learning intersections of halfspaces with distribution shift: Improved algorithms and sq lower bounds. The Thirty Seventh Annual Conference on Learning Theory, 2024.
  • [KSV24b] Adam R Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Testable learning with distribution shift. The Thirty Seventh Annual Conference on Learning Theory, 2024.
  • [KZZ24] Alkis Kalavasis, Ilias Zadik, and Manolis Zampetakis. Transfer learning beyond bounded density ratios. arXiv preprint arXiv:2403.11963, 2024.
  • [LHL21] Qi Lei, Wei Hu, and Jason Lee. Near-optimal linear regression under distribution shift. In International Conference on Machine Learning, pages 6164–6174. PMLR, 2021.
  • [LMZ20] Yuanzhi Li, Tengyu Ma, and Hongyang R Zhang. Learning over-parametrized two-layer neural networks beyond ntk. In Conference on learning theory, pages 2613–2682. PMLR, 2020.
  • [LSSS14] Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efficiency of training neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, page 855–863, Cambridge, MA, USA, 2014. MIT Press.
  • [LY17] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu activation. Advances in neural information processing systems, 30, 2017.
  • [Mer09] James Mercer. Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society A, 209:415–446, 1909.
  • [MKFAS20] Mohammadreza Mousavi Kalan, Zalan Fabian, Salman Avestimehr, and Mahdi Soltanolkotabi. Minimax lower bounds for transfer learning with linear and one-hidden layer neural networks. Advances in Neural Information Processing Systems, 33:1959–1969, 2020.
  • [MMR09] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In Proceedings of The 22nd Annual Conference on Learning Theory (COLT 2009), Montréal, Canada, 2009.
  • [MR18] Pasin Manurangsi and Daniel Reichman. The computational complexity of training relu (s). arXiv preprint arXiv:1810.04207, 2018.
  • [MRT18] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, second edition, 2018.
  • [NS64] D. J. Newman and H. S. Shapiro. Jackson’s Theorem in Higher Dimensions, pages 208–219. Springer Basel, Basel, 1964.
  • [O’D14] Ryan O’Donnell. Analysis of boolean functions. Cambridge University Press, 2014.
  • [RMH+20] Ievgen Redko, Emilie Morvant, Amaury Habrard, Marc Sebban, and Younès Bennani. A survey on domain adaptation theory: learning bounds and theoretical guarantees. arXiv preprint arXiv:2004.11829, 2020.
  • [RV23] Ronitt Rubinfeld and Arsen Vasilyan. Testing distributional assumptions of learning algorithms. Proceedings of the fifty-fifth annual ACM Symposium on Theory of Computing, 2023.
  • [STW24] Lucas Slot, Stefan Tiegel, and Manuel Wiedmer. Testably learning polynomial threshold functions. arXiv preprint arXiv:2406.06106, 2024.
  • [Tia17] Yuandong Tian. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In International conference on machine learning, pages 3404–3413. PMLR, 2017.
  • [Ver18] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
  • [VW19] Santosh Vempala and John Wilmes. Gradient descent for one-hidden-layer neural networks: Polynomial convergence and sq lower bounds. In Conference on Learning Theory, pages 3115–3117. PMLR, 2019.
  • [WZDD23] Puqian Wang, Nikos Zarifis, Ilias Diakonikolas, and Jelena Diakonikolas. Robustly learning a single neuron via sharpness. In International Conference on Machine Learning, pages 36541–36577. PMLR, 2023.
  • [ZLJ16a] Yuchen Zhang, Jason D Lee, and Michael I Jordan. l1-regularized neural networks are improperly learnable in polynomial time. In International Conference on Machine Learning, pages 993–1001. PMLR, 2016.
  • [ZLJ16b] Yuchen Zhang, Jason D. Lee, and Michael I. Jordan. L1-regularized neural networks are improperly learnable in polynomial time. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 993–1001, New York, New York, USA, 20–22 Jun 2016. PMLR.
  • [ZSJ+17] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery guarantees for one-hidden-layer neural networks. In International conference on machine learning, pages 4140–4149. PMLR, 2017.
  • [ZYWG19] Xiao Zhang, Yaodong Yu, Lingxiao Wang, and Quanquan Gu. Learning one-hidden-layer relu networks via gradient descent. In The 22nd international conference on artificial intelligence and statistics, pages 1524–1534. PMLR, 2019.

Appendix A Proof of Multiplicative Spectral Concentration Lemma

Here, we restate and prove the multiplicative spectral concentration lemma (Lemma 3.8).

Lemma A.1 (Multiplicative Spectral Concentration, Lemma B.1 in [GSSV24], modified).

Let 𝒟𝐱{\mathcal{D}}_{{\bm{x}}} be a distribution over d\mathbb{R}^{d} and ϕ:dm\phi:\mathbb{R}^{d}\to\mathbb{R}^{m} such that 𝒟𝐱{\mathcal{D}}_{{\bm{x}}} is (ϕ,C,)(\phi,C,\ell)-hypercontractive for some C,1C,\ell\geq 1. Suppose that SS consists of NN i.i.d. examples from 𝒟𝐱{\mathcal{D}}_{{\bm{x}}} and let Φ=𝔼𝐱𝒟𝐱[ϕ(𝐱)ϕ(𝐱)]\Phi=\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[\phi({\bm{x}})\phi({\bm{x}})^{\top}], and Φ^=1N𝐱Sϕ(𝐱)ϕ(𝐱)\hat{\Phi}=\frac{1}{N}\sum_{{\bm{x}}\in S}\phi({\bm{x}})\phi({\bm{x}})^{\top}. For any ϵ,δ(0,1){\epsilon},\delta\in(0,1), if N64Cm2ϵ2(4Clog2(4δ))4+1N\geq\frac{64Cm^{2}}{{\epsilon}^{2}}(4C\log_{2}(\frac{4}{\delta}))^{4\ell+1}, then with probability at least 1δ1-\delta, we have that

For any 𝒂m:𝒂Φ^𝒂[(1ϵ)𝒂Φ𝒂,(1+ϵ)𝒂Φ𝒂]\text{For any }{\bm{a}}\in\mathbb{R}^{m}:{\bm{a}}^{\top}\hat{\Phi}{\bm{a}}\in[{(1-{\epsilon})}{\bm{a}}^{\top}\Phi{\bm{a}},(1+{\epsilon}){\bm{a}}^{\top}\Phi{\bm{a}}]
Proof of Lemma 3.8.

Let Φ=UDU\Phi=UDU^{\top} be the compact SVD of Φ\Phi (i.e., DD is square with dimension equal to the rank of Φ\Phi and UU is not necessarily square). Note that such a decomposition exists (where the row and column spaces are both spanned by the same basis UU), because Φ=Φ\Phi=\Phi^{\top}, by definition. Moreover, note that UUTUU^{T} is an orthogonal projection matrix that projects points in m\mathbb{R}^{m} on the span of the rows of Φ\Phi. We also have that, UU=IU^{\top}U=I.

Consider Φ=UD1U\Phi^{\dagger}=UD^{-1}U^{\top} and Φ2=UD12U\Phi^{\frac{\dagger}{2}}=UD^{-\frac{1}{2}}U^{\top}. Our proof consists of two parts. We first show that it is sufficient to prove that Φ2ΦΦ2Φ2Φ^Φ22ϵ\|\Phi^{\frac{\dagger}{2}}\Phi\Phi^{\frac{\dagger}{2}}-\Phi^{\frac{\dagger}{2}}\hat{\Phi}\Phi^{\frac{\dagger}{2}}\|_{2}\leq{\epsilon} with probability at least 1δ1-\delta and then we give a bound on the probability of this event.

Claim.

Suppose that for 𝐀=Φ2ΦΦ2Φ2Φ^Φ2{\bm{A}}=\Phi^{\frac{\dagger}{2}}\Phi\Phi^{\frac{\dagger}{2}}-\Phi^{\frac{\dagger}{2}}\hat{\Phi}\Phi^{\frac{\dagger}{2}} we have 𝐀2ϵ\|{\bm{A}}\|_{2}\leq{\epsilon}. Then, for any 𝐚m{\bm{a}}\in\mathbb{R}^{m}:

𝒂Φ^𝒂[(1ϵ)𝒂Φ𝒂,(1+ϵ)𝒂Φ𝒂]{\bm{a}}^{\top}\hat{\Phi}{\bm{a}}\in[{(1-{\epsilon})}{\bm{a}}^{\top}\Phi{\bm{a}},(1+{\epsilon}){\bm{a}}^{\top}\Phi{\bm{a}}]
Proof.

Let 𝒂m{\bm{a}}\in\mathbb{R}^{m}, 𝒂+=UU𝒂{\bm{a}}_{+}=UU^{\top}{\bm{a}}, and 𝒂0=(IUU)𝒂{\bm{a}}_{0}=(I-UU^{\top}){\bm{a}} (i.e., 𝒂=𝒂0+𝒂+{\bm{a}}={\bm{a}}_{0}+{\bm{a}}_{+}, where 𝒂0{\bm{a}}_{0} is the component of 𝒂{\bm{a}} lying in the nullspace of Φ\Phi). We have that 𝒂Φ𝒂=𝒂+Φ𝒂+{\bm{a}}^{\top}\Phi{\bm{a}}={\bm{a}}^{\top}_{+}\Phi{\bm{a}}_{+}.

Moreover, for 𝒂0{\bm{a}}_{0}, we have that 0=𝒂0Φ𝒂0=𝔼𝒙𝒟𝒙[(ϕ(𝒙)𝒂0)2]0={\bm{a}}_{0}^{\top}\Phi{\bm{a}}_{0}=\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[(\phi({\bm{x}})^{\top}{\bm{a}}_{0})^{2}] and, hence, ϕ(𝒙)𝒂0=0\phi({\bm{x}})^{\top}{\bm{a}}_{0}=0 almost surely over 𝒟𝒙{\mathcal{D}}_{{\bm{x}}}. Therefore, we also have 𝒂0Φ^𝒂0=1N𝒙S(ϕ(𝒙)𝒂0)2=0{\bm{a}}_{0}^{\top}\hat{\Phi}{\bm{a}}_{0}=\frac{1}{N}\sum_{{\bm{x}}\in S}(\phi({\bm{x}})^{\top}{\bm{a}}_{0})^{2}=0, with probability 11. Therefore, 𝒂Φ^𝒂=𝒂+Φ^𝒂+{\bm{a}}^{\top}\hat{\Phi}{\bm{a}}={\bm{a}}_{+}^{\top}\hat{\Phi}{\bm{a}}_{+}.

Observe, now, that Φ12Φ2=UD12UUD12U=UU\Phi^{\frac{1}{2}}\Phi^{\frac{\dagger}{2}}=UD^{\frac{1}{2}}U^{\top}UD^{-\frac{1}{2}}U^{\top}=UU^{\top} and, hence, Φ12Φ2𝒂+=(UU)2𝒂=UU𝒂=𝒂+\Phi^{\frac{1}{2}}\Phi^{\frac{\dagger}{2}}{\bm{a}}_{+}=(UU^{\top})^{2}{\bm{a}}=UU^{\top}{\bm{a}}={\bm{a}}_{+}, because UUUU^{\top} is a projection matrix. Overall, we obtain the following

𝒂Φ^𝒂\displaystyle{\bm{a}}^{\top}\hat{\Phi}{\bm{a}} =𝒂Φ𝒂+𝒂+(Φ^Φ)𝒂+\displaystyle={\bm{a}}^{\top}\Phi{\bm{a}}+{\bm{a}}_{+}^{\top}(\hat{\Phi}-\Phi){\bm{a}}_{+}
=𝒂Φ𝒂+𝒂+Φ12(Φ2Φ^Φ2Φ2ΦΦ2)Φ12𝒂+\displaystyle={\bm{a}}^{\top}\Phi{\bm{a}}+{\bm{a}}_{+}^{\top}\Phi^{\frac{1}{2}}(\Phi^{\frac{\dagger}{2}}\hat{\Phi}\Phi^{\frac{\dagger}{2}}-\Phi^{\frac{\dagger}{2}}\Phi\Phi^{\frac{\dagger}{2}})\Phi^{\frac{1}{2}}{\bm{a}}_{+}
=𝒂Φ𝒂+𝒂+Φ12AΦ12𝒂+\displaystyle={\bm{a}}^{\top}\Phi{\bm{a}}+{\bm{a}}_{+}^{\top}\Phi^{\frac{1}{2}}A\Phi^{\frac{1}{2}}{\bm{a}}_{+}

Since 𝑨2ϵ\|{\bm{A}}\|_{2}\leq{\epsilon} and Φ12Φ12=Φ\Phi^{\frac{1}{2}}\Phi^{\frac{1}{2}}=\Phi, we have that |𝒂+Φ12AΦ12𝒂+|ϵ|𝒂+Φ𝒂+|=ϵ|𝒂Φ𝒂||{\bm{a}}_{+}^{\top}\Phi^{\frac{1}{2}}A\Phi^{\frac{1}{2}}{\bm{a}}_{+}|\leq{\epsilon}|{\bm{a}}_{+}^{\top}\Phi{\bm{a}}_{+}|={\epsilon}|{\bm{a}}^{\top}\Phi{\bm{a}}|, which concludes the proof of the claim. ∎

It remains to show that for the matrix 𝑨{\bm{A}} defined in the previous claim, we have 𝑨2ϵ\|{\bm{A}}\|_{2}\leq{\epsilon} with probability at least 1δ1-\delta. The randomness of 𝑨{\bm{A}} depends on the random choice of SS from 𝒟𝒙m{\mathcal{D}}_{{\bm{x}}}^{\otimes m}. In the rest of the proof, therefore, consider all probabilities and expectations to be over S𝒟𝒙mS\sim{\mathcal{D}}_{{\bm{x}}}^{\otimes m}. We have the following for t=log2(4/δ)t=\log_{2}(4/\delta).

𝐏𝐫[𝑨2>ϵ]\displaystyle\mathbf{Pr}[\|{\bm{A}}\|_{2}>{\epsilon}] 𝐏𝐫[𝑨F>ϵ]𝔼[𝑨F2t]ϵ2t\displaystyle\leq\mathbf{Pr}[\|{\bm{A}}\|_{F}>{\epsilon}]\leq\frac{\mathbb{E}[\|{\bm{A}}\|_{F}^{2t}]}{{\epsilon}^{2t}}

We will now bound the expectation of 𝔼[𝑨F2t]\mathbb{E}[\|{\bm{A}}\|_{F}^{2t}]. To this end, we define 𝒂i=Φ2𝒆im{\bm{a}}_{i}=\Phi^{\frac{\dagger}{2}}{\bm{e}}_{i}\in\mathbb{R}^{m} for i[m]i\in[m]. We have the following, by using Jensen’s inequality appropriately.

𝔼[𝑨F2t]\displaystyle\mathbb{E}[\|{\bm{A}}\|_{F}^{2t}] =𝔼[(i,j[m](𝒂iΦ𝒂j𝒂iΦ^𝒂j)2)t]\displaystyle=\mathbb{E}\Bigr{[}\Bigr{(}\sum_{i,j\in[m]}({\bm{a}}_{i}^{\top}\Phi{\bm{a}}_{j}-{\bm{a}}_{i}^{\top}\hat{\Phi}{\bm{a}}_{j})^{2}\Bigr{)}^{t}\Bigr{]}
m2(t1)i,j[m]𝔼[(𝒂iΦ𝒂j𝒂iΦ^𝒂j)2t]\displaystyle\leq m^{2(t-1)}\sum_{i,j\in[m]}\mathbb{E}[({\bm{a}}_{i}^{\top}\Phi{\bm{a}}_{j}-{\bm{a}}_{i}^{\top}\hat{\Phi}{\bm{a}}_{j})^{2t}]
m2tmaxi,j[m]𝔼[(𝒂iΦ𝒂j𝒂iΦ^𝒂j)2t]\displaystyle\leq m^{2t}\max_{i,j\in[m]}\mathbb{E}[({\bm{a}}_{i}^{\top}\Phi{\bm{a}}_{j}-{\bm{a}}_{i}^{\top}\hat{\Phi}{\bm{a}}_{j})^{2t}]

In order to bound the term above, we may use Marcinkiewicz-Zygmund inequality (see [Fer14]) to exploit the independence of the samples in SS and obtain the following.

𝔼[(𝒂iΦ𝒂j𝒂iΦ^𝒂j)2t]\displaystyle\mathbb{E}[({\bm{a}}_{i}^{\top}\Phi{\bm{a}}_{j}-{\bm{a}}_{i}^{\top}\hat{\Phi}{\bm{a}}_{j})^{2t}] 2(4t)tNt𝔼𝒙𝒟𝒙[(𝒂iΦ𝒂j𝒂iϕ(𝒙)ϕ(𝒙)𝒂j)2t]\displaystyle\leq\frac{2(4t)^{t}}{N^{t}}\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[({\bm{a}}_{i}^{\top}\Phi{\bm{a}}_{j}-{\bm{a}}_{i}^{\top}\phi({\bm{x}})\phi({\bm{x}})^{\top}{\bm{a}}_{j})^{2t}]
2(4t)tNt(22t(𝒂iΦ𝒂j)2t+22t𝔼𝒙𝒟𝒙[(𝒂iϕ(𝒙)ϕ(𝒙)𝒂j)2t])\displaystyle\leq\frac{2(4t)^{t}}{N^{t}}\bigr{(}2^{2t}({\bm{a}}_{i}^{\top}\Phi{\bm{a}}_{j})^{2t}+2^{2t}\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[({\bm{a}}_{i}^{\top}\phi({\bm{x}})\phi({\bm{x}})^{\top}{\bm{a}}_{j})^{2t}]\bigr{)}

We now observe that 𝔼𝒙𝒟𝒙[𝒂iϕ(𝒙)ϕ(𝒙)𝒂j]=𝒂iΦ𝒂j=𝒆iΦ2ΦΦ2𝒆j=𝒆iUUT𝒆j\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[{\bm{a}}_{i}^{\top}\phi({\bm{x}})\phi({\bm{x}})^{\top}{\bm{a}}_{j}]={\bm{a}}_{i}^{\top}\Phi{\bm{a}}_{j}={\bm{e}}_{i}^{\top}\Phi^{\frac{\dagger}{2}}\Phi\Phi^{\frac{\dagger}{2}}{\bm{e}}_{j}={\bm{e}}_{i}^{\top}UU^{T}{\bm{e}}_{j}, which is at most equal to 11. Therefore, we have 𝔼𝒙𝒟𝒙[(𝒂iϕ(𝒙))2]1\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[({\bm{a}}_{i}^{\top}\phi({\bm{x}}))^{2}]\leq 1 and, by the hypercontractivity property (which we assume to be with respect to the standard inner product in m\mathbb{R}^{m}), we have 𝔼𝒙𝒟𝒙[(𝒂iϕ(𝒙))4t](4Ct)4t\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[({\bm{a}}_{i}^{\top}\phi({\bm{x}}))^{4t}]\leq(4Ct)^{4\ell t}. We can bound 𝔼𝒙𝒟𝒙[(𝒂iϕ(𝒙)ϕ(𝒙)𝒂j)2t]\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[({\bm{a}}_{i}^{\top}\phi({\bm{x}})\phi({\bm{x}})^{\top}{\bm{a}}_{j})^{2t}] by applying the Cauchy-Schwarz inequality and using the bound for 𝔼𝒙𝒟𝒙[(𝒂iϕ(𝒙))4t]\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[({\bm{a}}_{i}^{\top}\phi({\bm{x}}))^{4t}]. In total, we have the following bound.

𝐏𝐫[𝑨2>ϵ]4(16m2t(4Ct)4Nϵ2)t\mathbf{Pr}[\|{\bm{A}}\|_{2}>{\epsilon}]\leq 4\Bigr{(}\frac{16m^{2}t(4Ct)^{4\ell}}{N{\epsilon}^{2}}\Bigr{)}^{t}

We choose NN such that 16m2t(4Ct)4Nϵ212\frac{16m^{2}t(4Ct)^{4\ell}}{N{\epsilon}^{2}}\leq\frac{1}{2} and t=log2(4/δ)t=\log_{2}(4/\delta) so that the bound is at most δ\delta. ∎

Appendix B Moment Concentration of Subexponential Distributions

We prove the following bounds on the moments of subexponential distributions, which allows us to control error outside the region of good approximation.

Fact B.1 (see [Ver18]).

Let 𝒟{\mathcal{D}} on d\mathbb{R}^{d} be a γ\gamma-strictly subexponential distribution. Then for all 𝐰d,𝐰=1,t0,p1{\bm{w}}\in\mathbb{R}^{d},\left\|{\bm{w}}\right\|=1,t\geq 0,p\geq 1, there exists a constant CC^{\prime} such that

𝔼x𝒟[|𝒘,𝒙|p](Cp)p1+γ.\mathbb{E}_{x\sim{\mathcal{D}}}[|\langle{\bm{w}},{\bm{x}}\rangle|^{p}]\leq(C^{\prime}p)^{\frac{p}{1+\gamma}}.

In fact, the two conditions are equivalent.

We use the following bounds on the concentration of subexponential moments in the analysis of our algorithm. This will be useful in showing the sample complexity NN required in order for the empirical moments of the sample SS concentrate around the moments of the training marginal 𝒟𝒙{\mathcal{D}}_{{\bm{x}}}.

Lemma B.2 (Moment Concentration of Subexponential Distributions).

Let 𝒟𝐱{\mathcal{D}}_{{\bm{x}}} be a distribution over d\mathbb{R}^{d} such that for any 𝐰d{\bm{w}}\in\mathbb{R}^{d} with 𝐰2=1\|{\bm{w}}\|_{2}=1 and any tt\in{\mathbb{N}} we have 𝔼𝐱𝒟𝐱[|𝐰𝐱|t](Ct)t\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[|{\bm{w}}\cdot{\bm{x}}|^{t}]\leq(Ct)^{t} for some C1C\geq 1. For α=(αi)i[d]d\alpha=(\alpha_{i})_{i\in[d]}\in{\mathbb{N}}^{d}, we denote with 𝐱α{\bm{x}}^{\alpha} the quantity 𝐱α=i=1dxiαi{\bm{x}}^{\alpha}=\prod_{i=1}^{d}x_{i}^{\alpha_{i}}, where 𝐱=(xi)i[d]{\bm{x}}=(x_{i})_{i\in[d]}. Then, for any Δ,δ(0,1)\Delta,\delta\in(0,1), if SS is a set of at least N=1Δ2(Cc)48+1(log(20d/δ))4+1N=\frac{1}{\Delta^{2}}{(Cc)^{4\ell}\ell^{8\ell+1}(\log(20d/\delta))^{4\ell+1}} i.i.d. examples from 𝒟𝐱{\mathcal{D}}_{{\bm{x}}} for some sufficiently large universal constant c2c\geq 2, we have that with probability at least 1δ1-\delta, the following is true.

For any αd with α12 we have |𝔼𝒙S[𝒙α]𝔼𝒙𝒟𝒙[𝒙α]|Δ.\text{For any $\alpha\in{\mathbb{N}}^{d}$ with $\|\alpha\|_{1}\leq 2\ell$ we have }|\mathbb{E}_{{\bm{x}}\sim S}[{\bm{x}}^{\alpha}]-\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[{\bm{x}}^{\alpha}]|\leq\Delta.
Proof.

Let α=(αi)i[d]d\alpha=(\alpha_{i})_{i\in[d]}\in{\mathbb{N}}^{d} with α12\|\alpha\|_{1}\leq 2\ell. Consider the random variable X=1mtrain𝒙S𝒙α=1mtrain𝒙Si[d]xiαiX=\frac{1}{m_{\mathrm{train}}}\sum_{{\bm{x}}\in S}{\bm{x}}^{\alpha}=\frac{1}{m_{\mathrm{train}}}\sum_{{\bm{x}}\in S}\prod_{i\in[d]}x_{i}^{\alpha_{i}}. We have that 𝔼[X]=𝔼𝒙𝒟𝒙[𝒙α]\mathbb{E}[X]=\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}}_{{\bm{x}}}}[{\bm{x}}^{\alpha}] and also the following.

𝐏𝐫[|X𝔼[X]|>Δ]\displaystyle\mathbf{Pr}[|X-\mathbb{E}[X]|>\Delta] 𝔼[(X𝔼[X])2t]Δ2t\displaystyle\leq\frac{\mathbb{E}[(X-\mathbb{E}[X])^{2t}]}{\Delta^{2t}}
2(4t)t(NΔ2)t𝔼[(𝒙α𝔼[𝒙α])2t]\displaystyle\leq\frac{2(4t)^{t}}{(N\Delta^{2})^{t}}\mathbb{E}[({\bm{x}}^{\alpha}-\mathbb{E}[{\bm{x}}^{\alpha}])^{2t}]

where the last inequality follows from the Marcinkiewicz–Zygmund inequality (see [Fer14]). We have that 𝔼[(𝒙α𝔼[𝒙α])2t]4t𝔼[(𝒙α)2t]\mathbb{E}[({\bm{x}}^{\alpha}-\mathbb{E}[{\bm{x}}^{\alpha}])^{2t}]\leq 4^{t}\mathbb{E}[({\bm{x}}^{\alpha})^{2t}]. Since α12\|\alpha\|_{1}\leq 2\ell, we have that 𝔼[(𝒙α)2t]sup𝒘2=1[𝔼[(𝒘𝒙)4t]](4Ct)4t\mathbb{E}[({\bm{x}}^{\alpha})^{2t}]\leq\sup_{\|{\bm{w}}\|_{2}=1}[\mathbb{E}[({\bm{w}}\cdot{\bm{x}})^{4t\ell}]]\leq(4Ct\ell)^{4t\ell}, which yields the desired result, due to the choice of NN and after a union bound over all the possible choices of α\alpha (at most d2d^{2\ell}). ∎

Appendix C Polynomial Approximations of Neural Networks

In this section we derive the polynomial approximations of neural networks with Lipschitz activations needed to instantiate Theorem 3.6 for bounded distributions and Theorem 4.6 for unbounded distributions.

Recall the definition of a neural network.

Definition C.1 (Neural Network).

Let σ:\sigma:\mathbb{R}\to\mathbb{R} be an activation function with σ(0)1\sigma(0)\leq 1. Let 𝐖=(W(1),W(t))\mathbf{W}=\left(W^{(1)},\ldots W^{(t)}\right) with W(i)si×si1W^{(i)}\in\mathbb{R}^{s_{i}\times s_{i-1}} be the tuple of weight matrices. Here, s0=ds_{0}=d is the input dimension and st=1s_{t}=1. Define recursively the function fi:dsif_{i}:\mathbb{R}^{d}\to\mathbb{R}^{s_{i}} as fi(𝒙)=W(i)σ(fi1(𝒙))f_{i}({\bm{x}})=W^{(i)}\cdot\sigma\bigl{(}f_{i-1}({\bm{x}})\bigr{)} with f1(𝒙)=W(1)𝒙f_{1}({\bm{x}})=W^{(1)}\cdot{\bm{x}}. The function f:df:\mathbb{R}^{d}\to\mathbb{R} computed by the neural network (𝐖,σ)(\mathbf{W},\sigma) is defined as f(𝒙):-ft(𝒙)f({\bm{x}})\coloneq f_{t}({\bm{x}}). We denote 𝐖1=i=2tW(i)1\|\mathbf{W}\|_{1}=\sum_{i=2}^{t}\|W^{(i)}\|_{1}. The depth of this network is tt.

We also introduce some notation and basic facts that will be useful for this section.

C.1 Useful Notation and Facts

Given a univariate function gg on \mathbb{R} and a vector 𝒙=(x1,,xd)d{\bm{x}}=(x_{1},\ldots,x_{d})\in\mathbb{R}^{d}, the vector g(𝒙)dg({\bm{x}})\in\mathbb{R}^{d} is defined as the vector with ithi^{th} co-ordinate equal to g(xi)g(x_{i}). For a matrix Am×nA\in\mathbb{R}^{m\times n}, we use the following notation:

  • A2:-supx2=1Ax2\|A\|_{2}\coloneq\sup_{\|x\|_{2}=1}\|Ax\|_{2},

  • A2:-maxi[m]j=1n(Aij)2\|A\|_{2}^{\infty}\coloneq\sqrt{\max_{i\in[m]}\sum_{j=1}^{n}(A_{ij})^{2}},

  • A1:-(i,j)[n]×[m]|Aij|\|A\|_{1}\coloneq\sum_{(i,j)\in[n]\times[m]}|A_{ij}|.

Fact C.2.

Given a matrix Wm×nW\in\mathbb{R}^{m\times n}, we have that

  1. 1.

    A2A1\|A\|_{2}\leq\|A\|_{1},

  2. 2.

    A2mA2\|A\|_{2}\leq\sqrt{m}\cdot\|A\|_{2}^{\infty}.

Proof.

We first prove (1). We have that for an 𝒙n{\bm{x}}\in\mathbb{R}^{n} with 𝒙2=1\|{\bm{x}}\|_{2}=1,

A𝒙2i=1m(Ai𝒙)2i=1mj=1n(Aij)2A1\displaystyle\|A{\bm{x}}\|_{2}\leq\sqrt{\sum_{i=1}^{m}(A_{i}\cdot{\bm{x}})^{2}}\leq\sqrt{\sum_{i=1}^{m}\sum_{j=1}^{n}(A_{ij})^{2}}\leq\|A\|_{1}

where the second inequality follows from Cauchy Schwartz and the last inequality follows from the fact that for any vector 𝒗{\bm{v}}, 𝒗2𝒗1\|{\bm{v}}\|_{2}\leq\|{\bm{v}}\|_{1}. We now prove (2). We have that

A𝒙2i=1m(Ai𝒙)2mmaxi[m]j=1n(Aij)2mA2\displaystyle\|A{\bm{x}}\|_{2}\leq\sqrt{\sum_{i=1}^{m}(A_{i}\cdot{\bm{x}})^{2}}\leq\sqrt{m\max_{i\in[m]}\sum_{j=1}^{n}(A_{ij})^{2}}\leq\sqrt{m}\|A\|_{2}^{\infty}

where the second inequality follows from Cauchy Schwartz and the last inequality is the definition. ∎

C.2 Results from Approximation Theory

The following are useful facts about the coefficients of approximating polynomials.

Fact C.3 (Lemma 23 from [GKKT17]).

Let pp be a polynomial of degree \ell such that |p(x)|b|p(x)|\leq b for |x|1|x|\leq 1. Then, the sum of squares of all its coefficients is at most b22O()b^{2}\cdot 2^{O(\ell)}.

Lemma C.4.

Let pp be a polynomial of degree \ell such that |p(𝐱)|b|p({\bm{x}})|\leq b for |x|R|x|\leq R. Then, the sum of squares of all its coefficients is at most b22O()b^{2}\cdot 2^{O(\ell)} when R1R\geq 1.

Proof.

Consider q(x)=p(Rx)q(x)=p(Rx). Clearly, |q(x)|b|q(x)|\leq b for all |x|1|x|\leq 1. Thus, the sum of squares of its coefficients is at most b22O()b^{2}\cdot 2^{O(\ell)} from C.3. Now, p(x)=q(x/R)p(x)=q(x/R) has coefficients bounded by b22O()b^{2}\cdot 2^{O(\ell)} when R1R\geq 1. ∎

Fact C.5 ([BDBGK18]).

Let qq be a polynomial with real coefficients on kk variables with degree \ell such that for all 𝐱[0,1]k{\bm{x}}\in[0,1]^{k}, |q(𝐱)|1|q({\bm{x}})|\leq 1. Then the magnitude of any coefficient of qq is at most (2k(k+))(2k\ell(k+\ell))^{\ell} and the sum of the magnitudes of all coefficients of qq is at most (2(k+))3(2(k+\ell))^{3\ell}.

Lemma C.6.

Let qq be a polynomial with real coefficients on kk variables with degree \ell such that for all 𝐱k{\bm{x}}\in\mathbb{R}^{k} with 𝐱2R\left\|{\bm{x}}\right\|_{2}\leq R, |q(𝐱)|b|q({\bm{x}})|\leq b. Then the sum of the magnitudes of all coefficients of qq is at most b(2(k+))3k/2b(2(k+\ell))^{3\ell}k^{\ell/2} for R1R\geq 1.

Proof.

Consider the polynomial h(𝒙)=1/bq(R𝒙/k)h({\bm{x}})=1/b\cdot q(R{\bm{x}}/\sqrt{k}). Then |h(𝒙)|=1/b|q(R𝒙/k)|1|h({\bm{x}})|=1/b\cdot|q(R{\bm{x}}/\sqrt{k})|\leq 1 for 𝒙R/k2R\|{\bm{x}}R/\sqrt{k}\|_{2}\leq R, or equivalently for all x2k\left\|x\right\|_{2}\leq\sqrt{k}. In particular, since the unit cube [0,1]k[0,1]^{k} is contained in the k\sqrt{k} radius ball, then |h(𝒙)|1|h({\bm{x}})|\leq 1 for 𝒙[0,1]k{\bm{x}}\in[0,1]^{k}. By C.5, the sum of the magnitudes of the coefficients of hh is at most (2(k+))3(2(k+\ell))^{3\ell}. Since q(𝒙)=bh(𝒙k/R)q({\bm{x}})=b\cdot h({\bm{x}}\sqrt{k}/R), then the sum of the magnitudes of the coefficients of qq is at most b(2(k+))3k/2b(2(k+\ell))^{3\ell}k^{\ell/2}. ∎

Lemma C.7.

Let p(𝐱)p({\bm{x}}) be a degree \ell polynomial in 𝐱d{\bm{x}}\in\mathbb{R}^{d} such that each coefficient is bounded in absolute value by bb. Then the sum of the magnitudes of the coefficients of p(𝐱)tp({\bm{x}})^{t} is at most btdtb^{t}d^{t\ell}.

Proof.

Note that p(𝒙)p({\bm{x}}) has at most dd^{\ell} terms. Expanding p(𝒙)tp({\bm{x}})^{t} gives at most dtd^{t\ell} terms, where any monomial is formed from a product of tt terms in p(𝒙)p({\bm{x}}). Then the coefficients of p(𝒙)tp({\bm{x}})^{t} are bounded in absolute value by BtB^{t}. Summing over all monomials gives the bound. ∎

In the following lemma, we bound the magnitude of approximating polynomials for subspace juntas outside the radius of approximation.

Lemma C.8.

Let ϵ>0,R1\epsilon>0,R\geq 1, and f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} be a kk-subspace junta, and consider the corresponding function g(W𝐱)g(W{\bm{x}}). Let q:kq:\mathbb{R}^{k}\rightarrow\mathbb{R} be an (ϵ,R)(\epsilon,R)-uniform approximation polynomial for gg, and define p:dp:\mathbb{R}^{d}\rightarrow\mathbb{R} as p(𝐱):=q(W𝐱)p({\bm{x}}):=q(W{\bm{x}}). Let r:=supW𝐱2R|g(W𝐱)|r:=\sup_{\left\|W{\bm{x}}\right\|_{2}\leq R}|g(W{\bm{x}})|. Then

|p(𝒙)|(r+ϵ)(2(k+))3k/2W𝒙R2W𝒙2R.|p({\bm{x}})|\leq(r+\epsilon)(2(k+\ell))^{3\ell}k^{\ell/2}\left\|\frac{W{\bm{x}}}{R}\right\|_{2}^{\ell}\quad\forall\left\|W{\bm{x}}\right\|_{2}\geq R.
Proof.

Since q(𝒙)q({\bm{x}}) is an (ϵ,R)(\epsilon,R)-uniform approximation for gg, then |q(𝒙)g(𝒙)|ϵ|q({\bm{x}})-g({\bm{x}})|\leq\epsilon for 𝒙2R\left\|{\bm{x}}\right\|_{2}\leq R. Let h(𝒙)=q(R𝒙)h({\bm{x}})=q(R{\bm{x}}). Then |h(𝒙/R)g(𝒙)|ϵ|h({\bm{x}}/R)-g({\bm{x}})|\leq\epsilon for 𝒙2R\left\|{\bm{x}}\right\|_{2}\leq R, and so |h(𝒙/R)|r+ϵ|h({\bm{x}}/R)|\leq r+\epsilon for 𝒙2R\left\|{\bm{x}}\right\|_{2}\leq R, or equivalently |h(𝒙)|r+ϵ|h({\bm{x}})|\leq r+\epsilon for 𝒙21\left\|{\bm{x}}\right\|_{2}\leq 1. Write h(𝒙)=α1hαx1α1xkαkh({\bm{x}})=\sum_{\left\|\alpha\right\|_{1}\leq\ell}h_{\alpha}x_{1}^{\alpha_{1}}\ldots x_{k}^{\alpha_{k}}. By Lemma C.6, α1|hα|(r+ϵ)(2(k+))3k/2\sum_{\left\|\alpha\right\|_{1}\leq\ell}|h_{\alpha}|\leq(r+\epsilon)(2(k+\ell))^{3\ell}\cdot k^{\ell/2}. Then for x21\left\|x\right\|_{2}\geq 1,

|h(𝒙)|\displaystyle|h({\bm{x}})| α1|hα||x1α1xkαk|\displaystyle\leq\sum_{\left\|\alpha\right\|_{1}\leq\ell}|h_{\alpha}||x_{1}^{\alpha_{1}}\ldots x_{k}^{\alpha_{k}}|
α1|hα|𝒙2α1\displaystyle\leq\sum_{\left\|\alpha\right\|_{1}\leq\ell}|h_{\alpha}|\left\|{\bm{x}}\right\|_{2}^{\left\|\alpha\right\|_{1}}
𝒙2α1|hα|,\displaystyle\leq\left\|{\bm{x}}\right\|_{2}^{\ell}\cdot\sum_{\left\|\alpha\right\|_{1}\leq\ell}|h_{\alpha}|,

where the second inequality holds because |xi|𝒙2|x_{i}|\leq\left\|{\bm{x}}\right\|_{2} for all ii, and the last inequality holds because 𝒙2𝒙2α1\left\|{\bm{x}}\right\|_{2}^{\ell}\geq\left\|{\bm{x}}\right\|_{2}^{\left\|\alpha\right\|_{1}} for α1\left\|\alpha\right\|_{1}\leq\ell when 𝒙21\left\|{\bm{x}}\right\|_{2}\geq 1. Then since p(𝒙)=q(W𝒙)=h(W𝒙/R)p({\bm{x}})=q(W{\bm{x}})=h(W{\bm{x}}/R), we have |p(𝒙)|W𝒙R2(r+ϵ)(2(k+))3k/2|p({\bm{x}})|\leq\left\|\frac{W{\bm{x}}}{R}\right\|_{2}^{\ell}(r+\epsilon)(2(k+\ell))^{3\ell}k^{\ell/2} for W𝒙2R\left\|W{\bm{x}}\right\|_{2}\geq R. ∎

The following is an important theorem that we use later to obtain uniform approximators for Lipschitz Neural networks.

Theorem C.9 ([NS64]).

Let f:kf:\mathbb{R}^{k}\to\mathbb{R} be a function continuous on the unit sphere Sk1S_{k-1}. Let ωf\omega_{f} be the function defined as ωf(t):-sup𝐱2,𝐲21𝐱𝐲2t|f(𝐱)f(𝐲)|\omega_{f}(t)\coloneq\sup_{\begin{subarray}{c}\|{\bm{x}}\|_{2},\|{\bm{y}}\|_{2}\leq 1\\ {\|{\bm{x}}-{\bm{y}}\|_{2}\leq t}\end{subarray}}|f({\bm{x}})-f({\bm{y}})| for any t0t\geq 0. Then, we have that there exists a polynomial of degree \ell such that supx21|f(𝐱)p(𝐱)|Cωf(k/)\sup_{\|x\|_{2}\leq 1}|f({\bm{x}})-p({\bm{x}})|\leq C\cdot\omega_{f}(k/\ell) where CC is a universal constant.

This implies the following corollary.

Corollary C.10.

Let f:kf:\mathbb{R}^{k}\to\mathbb{R} be an LL-Lipschitz function for L0L\geq 0 and let R0R\geq 0. Then, for any ϵ0\epsilon\geq 0, there exists a polynomial pp of degree O(LRk/ϵ)O(LRk/\epsilon) such that pp is an (ϵ,R)(\epsilon,R)-uniform approximation polynomial for ff.

Proof.

Consider the function g(𝒙):-f(R𝒙)g({\bm{x}})\coloneq f(R{\bm{x}}). Then, we have that gg is RLRL-Lipschitz. From statement of Theorem C.9, we have that ωg(t)RLt\omega_{g}(t)\leq RLt. Thus, from Theorem C.9, there exists a polynomial qq of degree O(LRk/ϵ)O(LRk/\epsilon) such that sup𝒚21|g(𝒚)q(𝒚)|ϵ\sup_{\|{\bm{y}}\|_{2}\leq 1}|g({\bm{y}})-q({\bm{y}})|\leq\epsilon. Thus, we have that sup𝒙2R|f(𝒙)q(𝒙/R)|=sup𝒙2R|g(𝒙/R)q(𝒙/R)|=sup𝒚21|g(𝒚)q(𝒚)|ϵ\sup_{\|{\bm{x}}\|_{2}\leq R}|f({\bm{x}})-q({\bm{x}}/R)|=\sup_{\|{\bm{x}}\|_{2}\leq R}|g({\bm{x}}/R)-q({\bm{x}}/R)|=\sup_{\|{\bm{y}}\|_{2}\leq 1}|g({\bm{y}})-q({\bm{y}})|\leq\epsilon. p(𝒙):-q(𝒙/R)p({\bm{x}})\coloneq q({\bm{x}}/R) is the required polynomial of degree O(LRk/ϵ)O(LRk/\epsilon). ∎

C.3 Kernel Representations

We now state and prove facts about Kernel Representations that we require. First, we recall the multinomial kernel from [GKKT17].

Definition C.11.

Consider the mapping ψ:nN\psi_{\ell}:\mathbb{R}^{n}\to\mathbb{R}^{N_{\ell}}, where Nd=i=1dN_{d}=\sum_{i=1}^{\ell}d^{\ell} is indexed by tuples (i1,i2,,ij)[d]j(i_{1},i_{2},\ldots,i_{j})\in[d]^{j} for j[]j\in[\ell] such that value of ψ(𝒙)\psi_{\ell}({\bm{x}}) at index (i1,i2,,ij)(i_{1},i_{2},\ldots,i_{j}) is equal to t=1j𝒙it\prod_{t=1}^{j}{\bm{x}}_{i_{t}}. The kernel 𝖬𝖪\mathsf{MK}_{\ell} is defined as

𝖬𝖪(𝒙,𝒚)=ψ(𝒙),ψ(𝒚)=i=1d(𝒙𝒚)i.\mathsf{MK}_{\ell}({\bm{x}},{\bm{y}})=\langle\psi_{\ell}({\bm{x}}),\psi_{\ell}({\bm{y}})\rangle=\sum_{i=1}^{d}({\bm{x}}\cdot{\bm{y}})^{i}.

We denote the corrresponding RKHS as 𝖬𝖪{\mathbb{H}}_{\mathsf{MK}_{\ell}}.

We now prove that polynomial approximators of subspace juntas can be represented as elements of 𝖬𝖪{\mathbb{H}}_{\mathsf{MK}_{\ell}}.

Lemma C.12.

Let kk\in\mathbb{N} and ϵ,R0\epsilon,R\geq 0. Let f:df:\mathbb{R}^{d}\to\mathbb{R} be a kk-subspace junta such that f(𝐱)=g(W𝐱)f({\bm{x}})=g(W{\bm{x}}) where gg is a function on k\mathbb{R}^{k} and WW is a projection matrix from k×d\mathbb{R}^{k\times d}. Suppose, there exists a polynomial qq of degree \ell such that sup𝐲2R|g(𝐲)q(𝐲)|ϵ\sup_{\|{\bm{y}}\|_{2}\leq R}|g({\bm{y}})-q({\bm{y}})|\leq\epsilon and the sum of squares of coefficients of qq is bounded above by B2B^{2}. Then, ff is (ϵ,B2(k+1))(\epsilon,B^{2}\cdot(k+1)^{\ell})-approximately represented within radius RR with respect to 𝖬𝖪{\mathbb{H}}_{\mathsf{MK}_{\ell}}.

Proof.

We argue that there exists a vector 𝒗𝖬𝖪{\bm{v}}\in{\mathbb{H}}_{\mathsf{MK}_{\ell}} such that 𝒗,𝒗B2\langle{\bm{v}},{\bm{v}}\rangle\leq B^{2} and |f(𝒙)𝒗,σ(𝒙)|ϵ|f({\bm{x}})-\langle{\bm{v}},\sigma_{\ell}({\bm{x}})\rangle|\leq\epsilon for all 𝒙2R\|{\bm{x}}\|_{2}\leq R. Consider the polynomial pp of degree \ell such that p(𝒙)=q(W𝒙)p({\bm{x}})=q(W{\bm{x}}). We argue that p(𝒙)=𝒗,σ(𝒙)p({\bm{x}})=\langle{\bm{v}},\sigma_{\ell}({\bm{x}})\rangle for some 𝒗{\bm{v}} and that 𝒗,𝒗B2\langle{\bm{v}},{\bm{v}}\rangle\leq B^{2}. Let q(𝒚)=Sk,|S|qSj=1k𝒚|Sj|q({\bm{y}})=\sum_{S\in\mathbb{N}^{k},|S|\leq\ell}q_{S}\prod_{j=1}^{k}{\bm{y}}^{|S_{j}|}. From our assumption on qq, we have that Sk,|S||qS|B\sum_{S\in\mathbb{N}^{k},|S|\leq\ell}|q_{S}|\leq B. For ii\in\ell, we use define BiB_{i} as Bi=Sk,|S|=|qS|B_{i}=\sum_{S\in\mathbb{N}^{k},|S|=\ell}|q_{S}|. Given multi-index SS, for any i[d]i\in[d], we define S(i)S(i) as the number tt such that i=1j1|Si|j<i=1j|Si|\sum_{i=1}^{j-1}|S_{i}|\leq j<\sum_{i=1}^{j}|S_{i}|. We now compute the entry of 𝒗{\bm{v}} indexed by (i1,i2,,it)(i_{1},i_{2},\ldots,i_{t}). By expanding the expression for p(𝒙)p({\bm{x}}), we obtain that

vi1,,it=|S|=tqSj=1tWS(j),ij.v_{i_{1},\ldots,i_{t}}=\sum_{|S|=t}q_{S}\prod_{j=1}^{t}W_{S(j),i_{j}}.

We are now ready to bound 𝒗,𝒗\langle{\bm{v}},{\bm{v}}\rangle. We have that

𝒗,𝒗\displaystyle\langle{\bm{v}},{\bm{v}}\rangle =t=0(i1,,it)[d]k(vi1,,it)2=t=0(i1,,it)[d]k(|S|=tqSj=1tWS(j),ij)2\displaystyle=\sum_{t=0}^{\ell}\sum_{(i_{1},\ldots,i_{t})\in[d]^{k}}(v_{i_{1},\ldots,i_{t}})^{2}=\sum_{t=0}^{\ell}\sum_{(i_{1},\ldots,i_{t})\in[d]^{k}}\left(\sum_{|S|=t}q_{S}\prod_{j=1}^{t}W_{S(j),i_{j}}\right)^{2}
t=0(i1,,it)[d]k(|S|=tqS2)(|S|=tj=1tWS(j),ij2)\displaystyle\leq\sum_{t=0}^{\ell}\sum_{(i_{1},\ldots,i_{t})\in[d]^{k}}\left(\sum_{|S|=t}q^{2}_{S}\right)\left(\sum_{|S|=t}\prod_{j=1}^{t}W^{2}_{S(j),i_{j}}\right)
t=0(|S|=tqS2)(|S|=tj=1t(i=1dWS(j),i2))t=0(|S|=tqS2)(k+1)t\displaystyle\leq\sum_{t=0}^{\ell}\left(\sum_{|S|=t}q^{2}_{S}\right)\left(\sum_{|S|=t}\prod_{j=1}^{t}\left(\sum_{i=1}^{d}W^{2}_{S(j),i}\right)\right)\leq\sum_{t=0}^{\ell}\left(\sum_{|S|=t}q^{2}_{S}\right)\cdot(k+1)^{t}
(|S|qS2)(k+1)B2(k+1).\displaystyle\leq\left(\sum_{|S|\leq\ell}q^{2}_{S}\right)\cdot(k+1)^{\ell}\leq B^{2}\cdot(k+1)^{\ell}.

Here, the first inequality follows from Cauchy-Schwartz, the second follows by rearranging terms. The third inequality follows from the fact that the number of multi-indices of size tt from a set of kk elements is at most (k+1)t(k+1)^{t}. The final inequality follows from the fact that the sum of the squares of the coefficients of qq is at most B2B^{2}. ∎

We introduce an extension of the multinomial kernel that will be useful for our application to sigmoid-nets.

Definition C.13 (Composed multinomial kernel).

Let =(1,,t)\bm{\ell}=(\ell_{1},\ldots,\ell_{t}) be a tuple in t\mathbb{N}^{t}. We denote a sequence of mappings ψ(0),ψ(1),,ψ(t)\psi^{(0)}_{\bm{\ell}},\psi^{(1)}_{\bm{\ell}},\ldots,\psi^{(t)}_{\bm{\ell}} on d\mathbb{R}^{d} inductively as follows:

  1. 1.

    ψ(0)(𝒙)=𝒙\psi^{(0)}_{\bm{\ell}}({\bm{x}})={\bm{x}}

  2. 2.

    ψ(i)(𝒙)=ψi(ψ(i1)(𝒙))\psi^{(i)}_{\bm{\ell}}({\bm{x}})=\psi_{\ell_{i}}\left(\psi^{(i-1)}_{\bm{\ell}}({\bm{x}})\right).

Let N(i)N_{\bm{\ell}}^{(i)} denote the number of coordinates in ψ(i)\psi_{\bm{\ell}}^{(i)}. This induces a sequence of kernels 𝖬𝖪(0),𝖬𝖪(1),,𝖬𝖪(t)\mathsf{MK}_{\bm{\ell}}^{(0)},\mathsf{MK}_{\bm{\ell}}^{(1)},\ldots,\mathsf{MK}_{\bm{\ell}}^{(t)} defined as

𝖬𝖪(i)(𝒙,𝒚)=ψ(i)(𝒙),ψ(i)(𝒚)=j=0i(ψ(i1)(𝒙),ψ(i1)(𝒚)j)\mathsf{MK}_{\bm{\ell}}^{(i)}({\bm{x}},{\bm{y}})=\langle\psi_{\bm{\ell}}^{(i)}({\bm{x}}),\psi_{\bm{\ell}}^{(i)}({\bm{y}})\rangle=\sum_{j=0}^{\ell_{i}}\left(\langle\psi_{\bm{\ell}}^{(i-1)}({\bm{x}}),\psi_{\bm{\ell}}^{(i-1)}({\bm{y}})\rangle^{j}\right)

and a corresponding sequence of RKHS denoted by 𝖬𝖪(0),𝖬𝖪(1),𝖬𝖪(t)\mathcal{H}_{\mathsf{MK}_{\bm{\ell}}^{(0)}},\mathcal{H}_{\mathsf{MK}_{\bm{\ell}}^{(1)}},\ldots\mathcal{H}_{\mathsf{MK}_{\bm{\ell}}^{(t)}}.

Observe that the the multinomial Kernel 𝖬𝖪=𝖬𝖪()(1)\mathsf{MK}_{\ell}=\mathsf{MK}^{(1)}_{(\ell)} is an instantiation of the composed multinomial kernel.

We now state some properties of the composed multinomial kernel.

Lemma C.14.

Let =(1,,t)\bm{\ell}=(\ell_{1},\ldots,\ell_{t}) be a tuple in t\mathbb{N}^{t} and R0R\geq 0. Then, the following hold:

  1. 1.

    sup𝒙2R𝖬𝖪(t)(𝒙,𝒙)max{1,(2R)2ti=1ti}\sup_{\|{\bm{x}}\|_{2}\leq R}\mathsf{MK}_{\bm{\ell}}^{(t)}({\bm{x}},{\bm{x}})\leq\max\{1,(2R)^{2^{t}\prod_{i=1}^{t}\ell_{i}}\},

  2. 2.

    For any 𝒙,𝒚d{\bm{x}},{\bm{y}}\in\mathbb{R}^{d}, 𝖬𝖪(t)(𝒙,𝒚)\mathsf{MK}_{\bm{\ell}}^{(t)}({\bm{x}},{\bm{y}}) can be computed in time poly(d,(i=1ti))\mathrm{poly}\left(d,(\sum_{i=1}^{t}\ell_{i})\right),

  3. 3.

    For any 𝒗𝖬𝖪(t){\bm{v}}\in\mathcal{H}_{\mathsf{MK}_{\bm{\ell}}^{(t)}} and 𝒙d{\bm{x}}\in\mathbb{R}^{d}, we have 𝒗,ψ(t)(𝒙)\langle{\bm{v}},\psi_{\bm{\ell}}^{(t)}({\bm{x}})\rangle is a polynomial in 𝒙{\bm{x}} of degree i=1ti\prod_{i=1}^{t}\ell_{i}.

Proof.

We assume without loss of generality that R1R\geq 1 as the kernel function is increasing in norm. To prove (1), observe that for any 𝒙{\bm{x}}, we have that

𝖬𝖪(i)(𝒙,𝒙)=j=0i(𝖬𝖪(i1)(𝒙,𝒙))j(2𝖬𝖪(i1)(𝒙,𝒙))i+1.\mathsf{MK}_{\bm{\ell}}^{(i)}({\bm{x}},{\bm{x}})=\sum_{j=0}^{\ell_{i}}\left(\mathsf{MK}_{\bm{\ell}}^{(i-1)}({\bm{x}},{\bm{x}})\right)^{j}\leq\left(2\mathsf{MK}_{\bm{\ell}}^{(i-1)}({\bm{x}},{\bm{x}})\right)^{\ell_{i}+1}.

We also have that supx2R𝖬𝖪(0)(𝒙,𝒙)=𝒙𝒙=R\sup_{\|x\|_{2}\leq R}\mathsf{MK}_{\bm{\ell}}^{(0)}({\bm{x}},{\bm{x}})={\bm{x}}\cdot{\bm{x}}=R. Thus, unrolling the recurrence gives us that 𝖬𝖪(t)(𝒙,𝒙)max{1,(2R)i=1t(i+1)}max{1,(2R)2ti=1ti}\mathsf{MK}_{\bm{\ell}}^{(t)}({\bm{x}},{\bm{x}})\leq\max\{1,(2R)^{\prod_{i=1}^{t}(\ell_{i}+1)}\}\leq\max\{1,(2R)^{2^{t}\prod_{i=1}^{t}\ell_{i}}\}.

The run time follows from the fact that 𝖬𝖪(i)(𝒙,𝒙)=j=0i(𝖬𝖪(i1)(𝒙,𝒙)j)\mathsf{MK}_{\bm{\ell}}^{(i)}({\bm{x}},{\bm{x}})=\sum_{j=0}^{\ell_{i}}\left(\mathsf{MK}_{\bm{\ell}}^{(i-1)}({\bm{x}},{\bm{x}})^{j}\right) and thus can be computed from 𝖬𝖪(i1)\mathsf{MK}_{\bm{\ell}}^{(i-1)} with i\ell_{i} additions and exponentiation operations. Recursing gives the final runtime.

The fact that 𝒗,ψ(i)(𝒙)\langle{\bm{v}},\psi_{\bm{\ell}}^{(i)}({\bm{x}})\rangle follows immediately from the fact the fact the entries of ψ(i)(𝒙)\psi_{\bm{\ell}}^{(i)}({\bm{x}}) arise from the multinomial kernel and hence are polynomials in 𝒙{\bm{x}}. The degree is at most i=1ti\prod_{i=1}^{t}\ell_{i}. ∎

We now argue that a distribution that is hypercontractive with respect to polynomials is hypercontractive with respect to the multinomial kernel.

Lemma C.15.

Let 𝒟{\mathcal{D}} be a distribution on d\mathbb{R}^{d} that is CC-hypercontractive for some constant CC. Then, 𝒟{\mathcal{D}} is also (𝖬𝖪(t),C,i=1ti)(\mathsf{MK}_{\bm{\ell}}^{(t)},C,\prod_{i=1}^{t}\ell_{i})-hypercontractive.

Proof.

The proof immediately follows from Definition 3.4 and Lemma C.14(3). ∎

C.4 Nets with Lipschitz activations

We are now ready to prove our theorem about uniform approximators for neural networks with Lipschitz activations. First, we prove that such networks describe a Lipschitz function.

Lemma C.16.

Let f:df:\mathbb{R}^{d}\to\mathbb{R} be the function computed by an tt-layer neural network with LL-Lipschitz activation function σ\sigma and weight matrices 𝐖\mathbf{W}. Say, 𝐖1W\|\mathbf{W}\|_{1}\leq W for W0W\geq 0 and the first hidden layer has kk neurons. Then we have that ff is kW(1)2(WL)t1\sqrt{k}\|W^{(1)}\|_{2}^{\infty}(WL)^{t-1}-Lipschitz.

Proof.

First, observe from C.2 that for all 1<iT1<i\leq T, W(i)2W\|W^{(i)}\|_{2}\leq W (since 𝐖1W\|\mathbf{W}\|_{1}\leq W) and W(1)2kW(1)2\|W^{(1)}\|_{2}\leq\sqrt{k}\|W^{(1)}\|_{2}^{\infty}. Recall from Definition C.1, we have the functions f1,,ftf_{1},\ldots,f_{t} where fi(𝒙)=W(i)σ(fi1(𝒙))f_{i}({\bm{x}})=W^{(i)}\cdot\sigma\bigl{(}f_{i-1}({\bm{x}})\bigr{)} and f1(𝒙)=W(1)𝒙f_{1}({\bm{x}})=W^{(1)}\cdot{\bm{x}}. We prove by induction on ii that fi(𝒙)fi(𝒙+𝒖)2kW(1)2(WL)i1𝒖2\|f_{i}({\bm{x}})-f_{i}({\bm{x}}+{\bm{u}})\|_{2}\leq\sqrt{k}\|W^{(1)}\|_{2}^{\infty}(WL)^{i-1}\|{\bm{u}}\|_{2}. For the base case, observe that

f1(𝒙+𝒖)f1(𝒙)2\displaystyle\|f_{1}({\bm{x}}+{\bm{u}})-f_{1}({\bm{x}})\|_{2} i=1d1((Wi(1),𝒙Wi(1),𝒙+𝒖)2)i=1d1(Wi(1),𝒖)2\displaystyle\leq\sqrt{\sum_{i=1}^{d_{1}}\biggl{(}\bigl{(}\langle W^{(1)}_{i},{\bm{x}}\rangle-\langle W^{(1)}_{i},{\bm{x}}+{\bm{u}}\rangle\bigr{)}^{2}\biggr{)}}\leq\sqrt{\sum_{i=1}^{d_{1}}\biggl{(}\langle W^{(1)}_{i},{\bm{u}}\rangle\biggr{)}^{2}}
Wi(1)𝒖2kW(1)2𝒖2\displaystyle\leq\|W^{(1)}_{i}{\bm{u}}\|_{2}\leq\sqrt{k}\|W^{(1)}\|_{2}^{\infty}\|{\bm{u}}\|_{2}

where the second inequality follows from the Lipschitzness of σ\sigma and the final inequality follows from the definition of operator norm. We now proceed to the inductive step. Assume by induction that fi(𝒙)fi(𝒙+𝒖)2\|f_{i}({\bm{x}})-f_{i}({\bm{x}}+{\bm{u}})\|_{2} is at most kW(1)2(WL)i1𝒖2\sqrt{k}\|W^{(1)}\|_{2}^{\infty}(WL)^{i-1}\|{\bm{u}}\|_{2}. Thus, we have

fi+1(𝒙+𝒖)fi+1(𝒙)2\displaystyle\|f_{i+1}({\bm{x}}+{\bm{u}})-f_{i+1}({\bm{x}})\|_{2} =j=1d1(Wj(i+1),σ(fi(𝒙))Wj(i+1),σ(fi(𝒙+𝒖)))2\displaystyle=\sqrt{\sum_{j=1}^{d_{1}}\biggl{(}\langle W^{(i+1)}_{j},\sigma\left(f_{i}({\bm{x}})\right)\rangle-\langle W^{(i+1)}_{j},\sigma\left(f_{i}({\bm{x}}+{\bm{u}})\right)\rangle\biggr{)}^{2}}
W(i+1)2σ(fi(𝒙))σ(fi(𝒙+𝒖))2\displaystyle\leq\|W^{(i+1)}\|_{2}\|\sigma(f_{i}({\bm{x}}))-\sigma(f_{i}({\bm{x}}+{\bm{u}}))\|_{2}
(WL)kW(1)2(WL)i1𝒖2kW(1)2(LW)i𝒖2\displaystyle\leq(WL)\sqrt{k}\|W^{(1)}\|_{2}^{\infty}(WL)^{i-1}\|{\bm{u}}\|_{2}\leq\sqrt{k}\|W^{(1)}\|_{2}^{\infty}(LW)^{i}\|{\bm{u}}\|_{2}

where the third inequality follows from the Lipschitzness of σ\sigma and the inductive hypothesis. Thus, we get that |f(𝒙+𝒖)f(𝒙)|ft(𝒙+𝒖)ft(𝒙)2kW(1)2(WL)t1𝒖2|f({\bm{x}}+{\bm{u}})-f({\bm{x}})|\leq\|f_{t}({\bm{x}}+{\bm{u}})-f_{t}({\bm{x}})\|_{2}\leq\sqrt{k}\|W^{(1)}\|_{2}^{\infty}(WL)^{t-1}\cdot\|{\bm{u}}\|_{2}. ∎

We now state are theorem regarding the uniform approximation of Lipschitz nets. We also prove that the approximators can be represented by low norm vectors in 𝖬𝖪\mathcal{R}_{\mathsf{MK}_{\ell}} for appropriately chosen degree \ell.

Theorem C.17.

Let ϵ,R0\epsilon,R\geq 0. Let f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} be a neural network with an LL-Lipschitz activation function σ\sigma, depth tt and weight matrices 𝐖=(W(1),,W(t))\mathbf{W}=(W^{(1)},\ldots,W^{(t)}) where W(i)si×si1W^{(i)}\in\mathbb{R}^{s_{i}\times s_{i-1}}. Let kk be the number of neurons in the first hidden layer. Then, there exists of a polynomial pp of degree =O(W(1)2(WL)t1Rkk/ϵ)\ell=O\left(\|W^{(1)}\|_{2}^{\infty}(WL)^{t-1}Rk\sqrt{k}/\epsilon\right) that is an (ϵ,R)(\epsilon,R)-uniform approximation polynomial for ff. Furthermore, ff is (ϵ,(k+)O())(\epsilon,(k+\ell)^{O(\ell)})-approximately represented within radius RR with respect to 𝖬𝖪=𝖬𝖪()(1){\mathbb{H}}_{\mathsf{MK}_{\ell}}=\mathbb{H}_{\mathsf{MK}^{(1)}_{(\ell)}}. In fact, when k=1k=1, it holds that ff is (ϵ,2O())(\epsilon,2^{O(\ell)})-approximately represented within RR with respect to 𝖬𝖪(1)\mathbb{H}_{\mathsf{MK}_{\bm{\ell}}^{(1)}}.

Proof.

We can express ff as f(𝒙)=g(P𝒙)f({\bm{x}})=g(P{\bm{x}}) where PP is a projection matrix and gg is a neural network with input size kk. We observe that the Lipschitz constant of gg is the same as the Lipschitz constant of ff since PP is a projection matrix. From Lemma C.16, we have that gg is kW(1)2(WL)t1\|\sqrt{k}W^{(1)}\|_{2}^{\infty}(WL)^{t-1}-Lipshitz. From Corollary C.10, we have that there exists a polynomial qq of degree =O(W(1)2(WL)t1Rkk/ϵ)\ell=O\left(\|W^{(1)}\|_{2}^{\infty}(WL)^{t-1}Rk\sqrt{k}/\epsilon\right) that is an (ϵ,R)(\epsilon,R)-uniform approximation for gg. From Lemma C.6, we have that the sum of squares of magnitudes of coefficients of qq is bounded by (kW(1)2(WL)t1R)(k+)O()(k+)O()\left(\|\sqrt{k}W^{(1)}\|_{2}^{\infty}(WL)^{t-1}R\right)(k+\ell)^{O(\ell)}\leq(k+\ell)^{O(\ell)}. Now, applying Lemma C.12 yields the result. When k=1k=1, we apply Lemma C.4 to obtain that the sum of squares of magnitudes of coefficients of qq is bounded by W(1)2(WL)t12O()2O()\|W^{(1)}\|_{2}^{\infty}(WL)^{t-1}\cdot 2^{O(\ell)}\leq 2^{O(\ell)}. ∎

C.5 Sigmoids and Sigmoid-nets

We now give a custom proof for the case of neural networks with sigmoid activation. We do this as we can hope to get O(log(1/ϵ)O(\log(1/\epsilon) degree for our polynomial approximation. We largely follow the proof technique of [GKKT17] and [ZLJ16b]. The modifications we make are to handle the case where the radius of approximation is a variable RR instead of a constant. We require(for our applications to strictly-subexponential distributions) that the degree of approximation must scale linear in RR, a property that does not follow directly from the analysis given in [GKKT17]. We modify their analysis to achieve this linear dependence.

We first state a result regarding polynomial approximations for a single sigmoid activation.

Theorem C.18 ([LSSS14]).

Let σ:\sigma:\mathbb{R}\to\mathbb{R} denote the function σ(x)=11+ex\sigma(x)=\frac{1}{1+e^{-x}}. Let R,ϵ0R,\epsilon\geq 0. Then, there exists a polynomial pp of degree =O(Rlog(R/ϵ))\ell=O(R\log(R/\epsilon)) such that sup|x|R|σ(x)p(x)|ϵ\sup_{|x|\leq R}|\sigma(x)-p(x)|\leq\epsilon. Also, the sum of the squares of the coefficients of pp is bounded above by 2O()2^{O(\ell)}.

We now present a construction of a uniform approximation for neural networks with sigmoid activations. The construction is similar to the one in [GKKT17] but the analysis deviates as linear dependence on radius of approximation is important to us.

Theorem C.19.

Let ϵ,R0\epsilon,R\geq 0. Let ff on d\mathbb{R}^{d} be a neural network with sigmoid activations, depth tt and weight matrices 𝐖=(W(1),,W(t))\mathbf{W}=(W^{(1)},\ldots,W^{(t)}) where W(i)si×si1W^{(i)}\in\mathbb{R}^{s_{i}\times s_{i-1}}. Also, let 𝐖1W\|\mathbf{W}\|_{1}\leq W. Then, there exists of a polynomial pp of degree =O((RlogR)(W(1)2Wt2)(tlog(W/ϵ))t1)\ell=O\left((R\log R)\cdot(\|W^{(1)}\|_{2}^{\infty}W^{t-2})\cdot(t\log(W/\epsilon))^{t-1}\right) that is an (ϵ,R)(\epsilon,R)-uniform approximation polynomial for ff. Furthermore, ff is (ϵ,B)(\epsilon,B)-approximately represented within radius RR with respect to H𝖬𝖪(t)H_{\mathsf{MK}_{\bm{\ell}}^{(t)}} where =(1,,t1)\bm{\ell}=(\ell_{1},\ldots,\ell_{t-1}) is a tuple of degrees whose product is bounded by \ell. Here, B(2W(1)2)WO(Wt2(tlog(W/ϵ)t2)B\leq(2\|W^{(1)}\|_{2}^{\infty})^{\ell}\cdot W^{O\left(W^{t-2}(t\log(W/\epsilon)^{t-2}\right)}.

Proof.

First, let q1q_{1} be the polynomial guaranteed by Theorem C.18 that (ϵ/(2W)t)(\epsilon/(2W)^{t})-approximates the sigmoid in an interval of radius RW(1)2R\|W^{(1)}\|_{2}^{\infty}. Denote the degree of q1q_{1} as 1=O(RtW(1)2log(RW/ϵ))\ell_{1}=O\left(Rt\|W^{(1)}\|_{2}^{\infty}\log(RW/\epsilon)\right). For all 1<i<t1<i<t, let qiq_{i} be the polynomial that (ϵ/(2W)t)(\epsilon/(2W)^{t})-approximates the sigmoid upto radius 2W2W. These have degree equal to O(Wtlog(W/ϵ))O\left(Wt\log(W/\epsilon)\right). Let =(1,t1)\bm{\ell}=(\ell_{1},\ldots\ell_{t-1}). For all i[t1]i\in[t-1], let qi(x)=j=0iβj(i)xjq_{i}(x)=\sum_{j=0}^{\ell_{i}}\beta^{(i)}_{j}x^{j}. We know that i=0i(βj(i))22O(i)\sum_{i=0}^{\ell_{i}}(\beta^{(i)}_{j})^{2}\leq 2^{O(\ell_{i})}.

We now construct the polynomial pp that approximates ff. For i[t]i\in[t], define pi(𝒙)=W(i)qi1(pi1(𝒙))p_{i}({\bm{x}})=W^{(i)}\cdot q_{i-1}\left(p_{i-1}({\bm{x}})\right) with p1(𝒙)=W(1)𝒙p_{1}({\bm{x}})=W^{(1)}\cdot{\bm{x}}. Define p(𝒙)=pt(𝒙)p({\bm{x}})=p_{t}({\bm{x}}). Recall that pi(𝒙)p_{i}({\bm{x}}) is a vector of sis_{i} polynomials. We prove the following by induction: for every i[t]i\in[t],

  1. 1.

    pi(𝒙)fi(𝒙)ϵ/(2W)ti\|p_{i}({\bm{x}})-f_{i}({\bm{x}})\|_{\infty}\leq\epsilon/(2W)^{t-i},

  2. 2.

    For each j[si]j\in[s_{i}], we have that (pi)j(𝒙)=𝒗,ψ(i)(𝒙)(p_{i})_{j}({\bm{x}})=\langle{\bm{v}},\psi_{\bm{\ell}}^{(i)}({\bm{x}})\rangle with 𝒗,𝒗(2W(1)2)O(n=1i1n)WO(n=2i1n)\langle{\bm{v}},{\bm{v}}\rangle\leq(2\|W^{(1)}\|_{2}^{\infty})^{O(\prod_{n=1}^{i-1}\ell_{n})}\cdot W^{O(\prod_{n=2}^{i-1}\ell_{n})}.

where the function fif_{i} is as defined in Definition C.1.

The above holds trivially for i=1i=1 and f1(𝒙)=p1(𝒙)=W(1)(𝒙)f_{1}({\bm{x}})=p_{1}({\bm{x}})=W^{(1)}\cdot({\bm{x}}) is an exact approximator. Also, (p1)i(𝒙)=Wi(1),𝒙=Wi(1),ψ(1)(x)(p_{1})_{i}({\bm{x}})=\langle W^{(1)}_{i},{\bm{x}}\rangle=\langle W^{(1)}_{i},\psi_{\bm{\ell}}^{(1)}(x)\rangle from the definition of ψ(1)\psi_{\bm{\ell}}^{(1)}. Clearly, Wi(1),Wi(1)(W(1)2)2.\langle W^{(1)}_{i},W^{(1)}_{i}\rangle\leq\left(\|W^{(1)}\|_{2}^{\infty}\right)^{2}. We now prove that the above holds for i+1[t]i+1\in[t] assuming it holds for ii.

We first prove (1). For j[si+1]j\in[s_{i+1}], we have that

|(pi+1)j(𝒙)(fi+1)j(𝒙)|\displaystyle|(p_{i+1})_{j}({\bm{x}})-(f_{i+1})_{j}({\bm{x}})| =|Wj(i+1)(qi(pi(𝒙))σ(fi(𝒙)))|\displaystyle=|W^{(i+1)}_{j}\bigl{(}q_{i}(p_{i}({\bm{x}}))-\sigma(f_{i}({\bm{x}}))\bigr{)}|
|Wj(i+1)(qi(pi(𝒙))σ(pi(𝒙))|+|Wj(i+1)(σ(pi(𝒙))σ(fi(𝒙))|\displaystyle\leq|W^{(i+1)}_{j}\bigl{(}q_{i}(p_{i}({\bm{x}}))-\sigma(p_{i}({\bm{x}})\bigr{)}|+|W^{(i+1)}_{j}\bigl{(}\sigma(p_{i}({\bm{x}}))-\sigma(f_{i}({\bm{x}})\bigr{)}|
W(ϵ/(2W)t)+Wϵ/(2W)tiϵ/(2W)t(i+1).\displaystyle\leq W\cdot(\epsilon/(2W)^{t})+W\cdot\epsilon/(2W)^{t-i}\leq\epsilon/(2W)^{t-(i+1)}.

For the second inequality, we analyse the cases i=1i=1 and i>1i>1 separately. When i=1i=1, we have that (p1)j(𝒙)=(f1)j(𝒙)RW12(p_{1})_{j}({\bm{x}})=(f_{1})_{j}({\bm{x}})\leq R\|W_{1}\|_{2}^{\infty} and σ(x)q1(x)(ϵ/(2W)t)\sigma(x)-q_{1}(x)\leq(\epsilon/(2W)^{t}) when |x|RW12|x|\leq R\|W_{1}\|_{2}^{\infty}. For i>1i>1, from the inductive hypothesis, we have that |W(i+1)pi(𝒙)||W(i+1)fi(𝒙)|+W(i+1)1(ϵ/(2W)ti)2W|W^{(i+1)}p_{i}({\bm{x}})|\leq|W^{(i+1)}f_{i}({\bm{x}})|+\|W^{(i+1)}\|_{1}\cdot(\epsilon/(2W)^{t-i})\leq 2W. The second term in the second inequality is bounded since σ\sigma is 11-Lipschitz.

We are now ready to prove that (pi+1)j(p_{i+1})_{j} is representable by small norm vectors in 𝖬𝖪(i+1)\mathcal{H}_{\mathsf{MK}_{\bm{\ell}}^{(i+1)}} for all j[sj+1]j\in[s_{j+1}]. We have that

(pi+1)j(𝒙)=k=1siWjk(i+1)qi((pi)k(𝒙)).(p_{i+1})_{j}({\bm{x}})=\sum_{k=1}^{s_{i}}W^{(i+1)}_{jk}\cdot q_{i}\left((p_{i})_{k}({\bm{x}})\right).

From the inductive hypothesis, we have that (pi)k=𝒗(k),ψ(i)(p_{i})_{k}=\langle{\bm{v}}^{(k)},\psi_{\bm{\ell}}^{(i)}\rangle. Thus, we have that

(pi+1)j(𝒙)=k=1siWjk(i+1)qi(𝒗(k),ψ(i)).(p_{i+1})_{j}({\bm{x}})=\sum_{k=1}^{s_{i}}W^{(i+1)}_{jk}\cdot q_{i}\left(\langle{\bm{v}}^{(k)},\psi_{\bm{\ell}}^{(i)}\rangle\right).

We expand each term in the above sum. We obtain,

qi(𝒗(k),ψ(i))\displaystyle q_{i}\left(\langle{\bm{v}}^{(k)},\psi_{\bm{\ell}}^{(i)}\rangle\right) =n=0iβn(i)(𝒗(k),ψ(i))n\displaystyle=\sum_{n=0}^{\ell_{i}}\beta^{(i)}_{n}\left(\langle{\bm{v}}^{(k)},\psi_{\bm{\ell}}^{(i)}\rangle\right)^{n}
=n=0iβn(i)(m1,,mn)[N(i)]nvm1(k)vmn(k)(ψ(i)(𝒙))m1(ψ(i)(𝒙))mn\displaystyle=\sum_{n=0}^{\ell_{i}}\beta^{(i)}_{n}\sum_{(m_{1},\ldots,m_{n})\in[N_{\bm{\ell}}^{(i)}]^{n}}v^{(k)}_{m_{1}}\ldots v^{(k)}_{m_{n}}\left(\psi_{\bm{\ell}}^{(i)}({\bm{x}})\right)_{m_{1}}\ldots\left(\psi_{\bm{\ell}}^{(i)}({\bm{x}})\right)_{m_{n}}
=𝒖(k),ψi((ψ(i)(𝒙))=𝒖(k),ψ(i+1)(𝒙).\displaystyle=\langle{\bm{u}}^{(k)},\psi_{\ell_{i}}((\psi_{\bm{\ell}}^{(i)}({\bm{x}}))\rangle=\langle{\bm{u}}^{(k)},\psi_{\bm{\ell}}^{(i+1)}({\bm{x}})\rangle.

The second inequality follows from expanding the equation. 𝒖(k){\bm{u}}^{(k)} indexed by (m1,,mn)[N(i)]n(m_{1},\ldots,m_{n})\in[N^{(i)}_{\ell}]^{n} for nin\leq\ell_{i} has entries given by u(m1,,mn)(k)=βn(i)vm1(k)vmn(k)u^{(k)}_{(m_{1},\ldots,m_{n})}=\beta^{(i)}_{n}v^{(k)}_{m_{1}}\ldots v^{(k)}_{m_{n}}. Putting things together, we obtain that

(pi+1)j(𝒙)\displaystyle(p_{i+1})_{j}({\bm{x}}) =k=1siWjk(i+1)𝒖(k),ψ(i+1)(𝒙)\displaystyle=\sum_{k=1}^{s_{i}}W^{(i+1)}_{jk}\cdot\langle{\bm{u}}^{(k)},\psi_{\bm{\ell}}^{(i+1)}({\bm{x}})\rangle
=k=1siWjk(i+1)𝒖(k),ψ(i+1)(𝒙).\displaystyle=\langle\sum_{k=1}^{s_{i}}W^{(i+1)}_{jk}{\bm{u}}^{(k)},\psi_{\bm{\ell}}^{(i+1)}({\bm{x}})\rangle.

Thus, we have proved that (pi+1)j(p_{i+1})_{j} is representable in 𝖬𝖪(i+1)\mathcal{H}_{\mathsf{MK}_{\bm{\ell}}^{(i+1)}}. We now prove that the norm of the representation is small. We have that

k=1siWjk(i+1)𝒖(k)2W(i+1)1maxk[si]𝒖(k)2Wmaxk[si]𝒖(k)2.\displaystyle\|\sum_{k=1}^{s_{i}}W^{(i+1)}_{jk}{\bm{u}}^{(k)}\|_{2}\leq\|W^{(i+1)}\|_{1}\max_{k\in[s_{i}]}\|{\bm{u}}^{(k)}\|_{2}\leq W\cdot\max_{k\in[s_{i}]}\|{\bm{u}}^{(k)}\|_{2}.

We bound maxk[si]𝒖(k)2\max_{k\in[s_{i}]}\|{\bm{u}}^{(k)}\|_{2}. For any kk, from the definition of 𝒖(k){\bm{u}}^{(k)} and the inductive hypothesis, we have that

𝒖(k)22\displaystyle\|{\bm{u}}^{(k)}\|_{2}^{2} =n=0i(βn(i))2(m1,,mn)[N(i)]nj=1n(𝒖mj(k))2\displaystyle=\sum_{n=0}^{\ell_{i}}\left(\beta^{(i)}_{n}\right)^{2}\cdot\sum_{(m_{1},\ldots,m_{n})\in[N^{(i)}_{\bm{\ell}}]^{n}}\prod_{j=1}^{n}\left({\bm{u}}^{(k)}_{m_{j}}\right)^{2}
=n=0i(βn(i))2𝒗(k)22n2O(i)𝒗(k)22i\displaystyle=\sum_{n=0}^{\ell_{i}}\left(\beta^{(i)}_{n}\right)^{2}\|{\bm{v}}^{(k)}\|_{2}^{2n}\leq 2^{O(\ell_{i})}\cdot\|{\bm{v}}^{(k)}\|_{2}^{2\ell_{i}}

We analyse the case i=1i=1 and i>1i>1 separately. When i=1i=1, we have 2O(1)𝒗(k)221(2W(1)2)O(1)2^{O(\ell_{1})}\|{\bm{v}}^{(k)}\|_{2}^{2\ell_{1}}\leq(2\|W^{(1)}\|_{2}^{\infty})^{O(\ell_{1})} from the bound on the base case. When i>1i>1, we have

k=1siWjk(i+1)𝒖(k)22\displaystyle\|\sum_{k=1}^{s_{i}}W^{(i+1)}_{jk}{\bm{u}}^{(k)}\|_{2}^{2} W22O(i)𝒗(k)22i\displaystyle\leq W^{2}2^{O(\ell_{i})}\|{\bm{v}}^{(k)}\|_{2}^{2\ell_{i}}
W22O(i)((2W(1)2)O(n=1i1n)WO(n=2i1n))2i\displaystyle\leq W^{2}2^{O(\ell_{i})}\left((2\|W^{(1)}\|_{2}^{\infty})^{O(\prod_{n=1}^{i-1}\ell_{n})}\cdot W^{O(\prod_{n=2}^{i-1}\ell_{n})}\right)^{2\ell_{i}}
(2W(1)2)O(n=1in)WO(n=2in)\displaystyle\leq(2\|W^{(1)}\|_{2}^{\infty})^{O(\prod_{n=1}^{i}\ell_{n})}\cdot W^{O(\prod_{n=2}^{i}\ell_{n})}

which completes the induction. We are ready to calculate the bound on the degree.

We have 1=O(RtW(1)2log(RW/ϵ))\ell_{1}=O(Rt\|W^{(1)}\|_{2}^{\infty}\log(RW/\epsilon)). Also, for i>1i>1, we have i=O(Wtlog(W/ϵ))\ell_{i}=O(Wt\log(W/\epsilon)). Thus, the total degree is i=1t1i=O((RlogR)(W(1)2Wt2)(tlog(W/ϵ))t1)\ell\leq\prod_{i=1}^{t-1}\ell_{i}=O\left((R\log R)\cdot(\|W^{(1)}\|_{2}^{\infty}W^{t-2})\cdot(t\log(W/\epsilon))^{t-1}\right). The square of the norm of the kernel representation is bounded by BB where

B(2W(1)2)WO(Wt2(tlog(W/ϵ)t2).B\leq(2\|W^{(1)}\|_{2}^{\infty})^{\ell}\cdot W^{O\left(W^{t-2}(t\log(W/\epsilon)^{t-2}\right)}.

This concludes the proof. ∎

C.6 Applications for Bounded Distributions

We first state and prove our end to end results on TDS learning Sigmoid and Lipschitz nets over bounded marginals that are CC-hypercontractive for some constant CC.

Theorem C.20 (TDS Learning for Nets with Sigmoid Activation).

Let \mathcal{F} on d\mathbb{R}^{d} be the class of neural network with sigmoid activations, depth tt and weight matrices 𝐖=(W(1),,W(t))\mathbf{W}=(W^{(1)},\ldots,W^{(t)}) such that W1W\|W\|_{1}\leq W. Let ϵ(0,1)\epsilon\in(0,1). Suppose the training and test distributions 𝒟,𝒟{\mathcal{D}},{\mathcal{D}}^{\prime} over d×\mathbb{R}^{d}\times\mathbb{R} are such that the following are true:

  1. 1.

    𝒟𝒙{\mathcal{D}}_{{\bm{x}}} is bounded within {𝒙:𝒙2R}\{{\bm{x}}:\|{\bm{x}}\|_{2}\leq R\} and is CC-hypercontractive for R,C1R,C\geq 1,

  2. 2.

    The training and test labels are bounded in [M,M][-M,M] for some M1M\geq 1.

Then, Algorithm 1 learns the class \mathcal{F} in the TDS regression up to excess error ϵ\epsilon and probability of failure δ\delta. The time and sample complexity is

poly(d,1ϵ,C,M,log(1/δ),(2R)2t,(2W(1)2)WO((Wtlog(W/ϵ))t2))\mathrm{poly}\left(d,\frac{1}{\epsilon},C^{\ell},M,\log(1/\delta)^{\ell},(2R)^{2^{t}\cdot\ell},(2\|W^{(1)}\|_{2}^{\infty})^{\ell}\cdot W^{O\left((Wt\log(W/\epsilon))^{t-2}\right)}\right)

where =O((RlogR)(W(1)2Wt2)(tlog(W/ϵ))t1)\ell=O\left((R\log R)\cdot(\|W^{(1)}\|_{2}^{\infty}W^{t-2})\cdot(t\log(W/\epsilon))^{t-1}\right).

Proof.

From Theorem C.19, we have that \mathcal{F} is (ϵ,(2W(1)2)WO(Wt2(tlog(W/ϵ)t2))\left(\epsilon,(2\|W^{(1)}\|_{2}^{\infty})^{\ell}W^{O\left(W^{t-2}(t\log(W/\epsilon)^{t-2}\right)}\right)-approximately represented within radius RR with respect to 𝖬𝖪(t)\mathsf{MK}_{\bm{\ell}}^{(t)}, where \bm{\ell} is a degree vector whose product is equal to =O((RlogR)(W(1)2Wt2)(tlog(W/ϵ))t1)\ell=O\left((R\log R)\cdot(\|W^{(1)}\|_{2}^{\infty}W^{t-2})\cdot(t\log(W/\epsilon))^{t-1}\right). Also, from Lemma C.14, we have that A:-sup𝒙2R𝖬𝖪(t)(𝒙,𝒙)(2R)2tA\coloneq\sup_{\|{\bm{x}}\|_{2}\leq R}\mathsf{MK}_{\bm{\ell}}^{(t)}({\bm{x}},{\bm{x}})\leq(2R)^{2^{t}\ell}. From Lemma C.14, the entries of the kernel can be computed in poly(d,)\mathrm{poly}(d,\ell) time and from Lemma C.15, we have that 𝒟𝒙{\mathcal{D}}_{{\bm{x}}} is (𝖬𝖪(t),C,)\left(\mathsf{MK}_{\bm{\ell}}^{(t)},C,\ell\right) hypercontractive. Now, we obtain the result by applying Theorem 3.6. ∎

The following corollary on TDS learning two layer sigmoid networks in polynomial time readily follows.

Corollary C.21.

Let \mathcal{F} on d\mathbb{R}^{d} be the class of two-layer neural networks with weight matrices 𝐖=(W(1),W(2))\mathbf{W}=(W^{(1)},W^{(2)}) and sigmoid activations. Let W(1)2O(1)\|W^{(1)}\|_{2}^{\infty}\leq O(1) and 𝐖1W\|\mathbf{W}\|_{1}\leq W. Suppose the training and test distributions satisfy the assumptions from Theorem C.20 with R=O(1)R=O(1). Then, Algorithm 1 learns the class \mathcal{F} in the TDS regression setting up to excess error ϵ\epsilon and probability of failure 0.10.1 in time and sample complexity poly(d,1/ϵ,W,M)\mathrm{poly}(d,1/\epsilon,W,M).

Proof.

The proof immediately follows from Theorem C.20 by setting t=2t=2 and the other parameters to the appropriate constants. ∎

Theorem C.22 (TDS Learning for Nets with Lipschitz Activation).

Let \mathcal{F} on d\mathbb{R}^{d} be the class of neural network with LL-Lipschitz activations, depth tt and weight matrices 𝐖=(W(1),,W(t))\mathbf{W}=(W^{(1)},\ldots,W^{(t)}) such that W1W\|W\|_{1}\leq W. Let ϵ(0,1)\epsilon\in(0,1). Suppose the training and test distributions 𝒟,𝒟{\mathcal{D}},{\mathcal{D}}^{\prime} over d×\mathbb{R}^{d}\times\mathbb{R} are such that the following are true:

  1. 1.

    𝒟𝒙{\mathcal{D}}_{{\bm{x}}} is bounded within {𝒙:𝒙2R}\{{\bm{x}}:\|{\bm{x}}\|_{2}\leq R\} and is CC-hypercontractive for R,C1R,C\geq 1,

  2. 2.

    The training and test labels are bounded in [M,M][-M,M] for some M1M\geq 1.

Then, Algorithm 1 learns the class \mathcal{F} in the TDS regression up to excess error ϵ\epsilon and probability of failure δ\delta. The time and sample complexity is poly(d,1ϵ,C,M,log(1/δ),(2R(k+))O())\mathrm{poly}\left(d,\frac{1}{\epsilon},C^{\ell},M,\log(1/\delta)^{\ell},(2R(k+\ell))^{O(\ell)}\right), where =O(W(1)2(WL)t1Rkk/ϵ)\ell=O\left(\|W^{(1)}\|_{2}^{\infty}(WL)^{t-1}Rk\sqrt{k}/\epsilon\right). In particular, when k=1k=1, we have that the time and sample complexity is poly(d,1ϵ,C,M,log(1/δ),(2R)O())\mathrm{poly}(d,\frac{1}{\epsilon},C^{\ell},M,\log(1/\delta)^{\ell},(2R)^{O(\ell)}) where =O(W(1)2(WL)t1R/ϵ).\ell=O\left(\|W^{(1)}\|_{2}^{\infty}(WL)^{t-1}R/\epsilon\right).

Proof.

From Theorem C.17, for k>1k>1 we have that \mathcal{F} is (ϵ,(k+)O())(\epsilon,(k+\ell)^{O(\ell)})-approximately represented within radius RR w.r.t 𝖬𝖪(1)\mathsf{MK}_{\bm{\ell}}^{(1)} where \ell is a degree vector whose product is =O(W(1)2(WL)t1Rkk/ϵ)\ell=O\left(\|W^{(1)}\|_{2}^{\infty}(WL)^{t-1}Rk\sqrt{k}/\epsilon\right). For k=1k=1, we have that we have that \mathcal{F} is (ϵ,2O())(\epsilon,2^{O(\ell)})-approximately represented within radius RR w.r.t 𝖬𝖪(1)\mathsf{MK}_{\bm{\ell}}^{(1)} where \bm{\ell} is a degree vector whose product is equal to =O(W(1)2(WL)t1R/ϵ)\ell=O\left(\|W^{(1)}\|_{2}^{\infty}(WL)^{t-1}R/\epsilon\right). Also, from Lemma C.14, we have that A:-sup𝒙2R𝖬𝖪(t)(𝒙,𝒙)(2R)O()A\coloneq\sup_{\|{\bm{x}}\|_{2}\leq R}\mathsf{MK}_{\bm{\ell}}^{(t)}({\bm{x}},{\bm{x}})\leq(2R)^{O(\ell)}. From Lemma C.14, the entries of the kernel can be computed in poly(d,)\mathrm{poly}(d,\ell) time and from Lemma C.15, we have that 𝒟𝒙{\mathcal{D}}_{{\bm{x}}} is (𝖬𝖪(1),C,)\left(\mathsf{MK}_{\bm{\ell}}^{(1)},C,\ell\right) hypercontractive. Now, we obtain the result by applying Theorem 3.6. ∎

The above theorem implies the following corollary about TDS learning the class of ReLUs.

Corollary C.23.

Let ={𝐱max(0,𝐰𝐱):𝐰2=1}\mathcal{F}=\{{\bm{x}}\rightarrow\max(0,{\bm{w}}\cdot{\bm{x}}):\|{\bm{w}}\|_{2}=1\} on d\mathbb{R}^{d} be the class of ReLU functions with unit weight vectors. Suppose the training and test distributions satisfy the assumptions from Theorem C.22 with R=O(1)R=O(1). Then, Algorithm 1 learns the class \mathcal{F} in the TDS regression setting up to excess error ϵ\epsilon and probability of failure 0.10.1 in time and sample complexity poly(d,2O(1/ϵ),M)\mathrm{poly}(d,2^{O(1/\epsilon)},M).

Proof.

The proof immediately follows from Theorem C.22 by setting t=2,𝐖=(𝒘)t=2,\mathbf{W}=({\bm{w}}) and the activation to be the ReLU function. ∎

In particular, this implies that the class of ReLUs is TDS learnable in polynomial time when ϵ<O(1/logd)\epsilon<O(1/\log d).

C.7 Applications for Unbounded Distributions

We are now ready to state our theorem for TDS learning neural networks with sigmoid activations.

Theorem C.24 (TDS Learning for Nets with Sigmoid Activation and Strictly Subexponential Marginals).

Let \mathcal{F} on d\mathbb{R}^{d} be the class of neural network with sigmoid activations, depth tt and weight matrices 𝐖=(W(1),,W(t))\mathbf{W}=(W^{(1)},\ldots,W^{(t)}) such that W1W\|W\|_{1}\leq W. Let ϵ(0,1)\epsilon\in(0,1). Suppose the training and test distributions 𝒟,𝒟{\mathcal{D}},{\mathcal{D}}^{\prime} over d×\mathbb{R}^{d}\times\mathbb{R} are such that the following are true:

  1. 1.

    𝒟𝒙{\mathcal{D}}_{{\bm{x}}} is γ\gamma-strictly subexponential,

  2. 2.

    The training and test labels are bounded in [M,M][-M,M] for some M1M\geq 1.

Then, Algorithm 2 learns the class \mathcal{F} in the TDS regression up to excess error ϵ\epsilon and probability of failure δ\delta. The time and sample complexity is at most

poly(ds,log(1/δ)s),\mathrm{poly}(d^{s},\log(1/\delta)^{s}),

where s=(klogM(W(1)2Wt2)(tlog(W/ϵ))t1)O(1γ).s=\left(k\log M\cdot(\|W^{(1)}\|_{2}^{\infty}W^{t-2})\cdot(t\log(W/\epsilon))^{t-1}\right)^{O(\frac{1}{\gamma})}.

Proof.

From Theorem C.19, we have that \mathcal{F} there is an (ϵ,R)(\epsilon,R)-uniform approximation polynomial for ff with degree =O((RlogR)(W(1)2Wt2)(tlog(W/ϵ))t1)\ell=O\left((R\log R)\cdot(\|W^{(1)}\|_{2}^{\infty}W^{t-2})\cdot(t\log(W/\epsilon))^{t-1}\right). Here, let g(ϵ):-(W(1)2Wt2)(tlog(W/ϵ))t1g_{\mathcal{F}}(\epsilon)\coloneq(\|W^{(1)}\|_{2}^{\infty}W^{t-2})\cdot(t\log(W/\epsilon))^{t-1}. We also have that r=sup𝒙2R,f|f(𝒙)|poly(RkW(1)2Wt2)r=\sup_{\|{\bm{x}}\|_{2}\leq R,f\in\mathcal{F}}|f({\bm{x}})|\leq\mathrm{poly}(Rk\|W^{(1)}\|_{2}^{\infty}W^{t-2}) from the Lipschitzness of the sigmoid nets (Lemma C.16) and the fact that the sigmoid evaluated at 0 has value 11. The theorem now directly follows from Theorem 4.6. ∎

We now state our theorem on TDS learning neural networks with arbitrary Lipschitz activations.

Theorem C.25 (TDS Learning for Nets with Lipschitz Activation with strictly subexponential marginals).

Let \mathcal{F} on d\mathbb{R}^{d} be the class of neural network with LL-Lipschitz activations, depth tt and weight matrices 𝐖=(W(1),,W(t))\mathbf{W}=(W^{(1)},\ldots,W^{(t)}) such that W1W\|W\|_{1}\leq W. Let ϵ(0,1)\epsilon\in(0,1). Suppose the training and test distributions 𝒟,𝒟{\mathcal{D}},{\mathcal{D}}^{\prime} over d×\mathbb{R}^{d}\times\mathbb{R} are such that the following are true:

  1. 1.

    𝒟𝒙{\mathcal{D}}_{{\bm{x}}} is γ\gamma-strictly subexponential,

  2. 2.

    The training and test labels are bounded in [M,M][-M,M] for some M1M\geq 1.

Then, Algorithm 2 learns the class \mathcal{F} in the TDS regression up to excess error ϵ\epsilon and probability of failure δ\delta. The time and sample complexity is at most

poly(ds,log(1/δs),\mathrm{poly}(d^{s},\log(1/\delta^{s}),

where s=(klogMW(1)2(WL)t1/ϵ)O(1γ)s=\left(k\log M\cdot\|W^{(1)}\|_{2}^{\infty}(WL)^{t-1}/\epsilon\right)^{O(\frac{1}{\gamma})}.

Proof.

From Theorem C.17, we have that \mathcal{F} there is an (ϵ,R)(\epsilon,R)-uniform approximation polynomial for ff with degree =O(RkkW(1)2(WL)t1/ϵ)\ell=O\left(Rk\sqrt{k}\cdot\|W^{(1)}\|_{2}^{\infty}(WL)^{t-1}/\epsilon\right). Here, let g(ϵ):-kkW(1)2(WL)t1/ϵg_{\mathcal{F}}(\epsilon)\coloneq k\sqrt{k}\|W^{(1)}\|_{2}^{\infty}(WL)^{t-1}/\epsilon. We also have that r=sup𝒙2R,f|f(𝒙)|poly(RkW(1)2Wt2)r=\sup_{\|{\bm{x}}\|_{2}\leq R,f\in\mathcal{F}}|f({\bm{x}})|\leq\mathrm{poly}(Rk\|W^{(1)}\|_{2}^{\infty}W^{t-2}) from the Lipschitz constant(Lemma C.16) and the fact that the each individual activation has value at most 11 when evaluated at 0 (see Definition C.1. The theorem now directly follows from Theorem 4.6. ∎

Appendix D Assumptions on the Labels

Our main theorems involve assumptions on the labels of both the training and test distributions. Ideally, one would want to avoid any assumptions on the test distribution. However, we demonstrate that this is not possible, even when the training marginal and the training labels are bounded, and the test labels have bounded second moment. On the other hand, we show that obtaining algorithms that work for bounded labels is sufficient even in the unbounded labels case, as long as some moment of the labels (strictly higher than the second moment) is bounded.

We begin with the lower bound, which we state for the class of linear functions, but would also hold for the class of single ReLU neurons, as well as other unbounded classes.

Proposition D.1 (Label Assumption Necessity).

Let {\mathcal{F}} be the class of linear functions over d\mathbb{R}^{d}, i.e., ={𝐱𝐰𝐱:𝐰d,𝐰21}{\mathcal{F}}=\{{\bm{x}}\mapsto{\bm{w}}\cdot{\bm{x}}:{\bm{w}}\in\mathbb{R}^{d},\|{\bm{w}}\|_{2}\leq 1\}. Even if we assume that the training marginal is bounded within {𝐱d:𝐱21}\{{\bm{x}}\in\mathbb{R}^{d}:\|{\bm{x}}\|_{2}\leq 1\}, that the training labels are bounded in [0,1][0,1], and that for the test labels we have 𝔼y𝒟y[y2]Y\mathbb{E}_{y\sim{\mathcal{D}}^{\prime}_{y}}[y^{2}]\leq Y where Y>0Y>0, no TDS regression algorithm with finite sample complexity can achieve excess error less than Y/4Y/4 and probability of failure less than 1/41/4 for {\mathcal{F}}.

The proof is based on the observation that because we cannot make any assumption on the test marginal, the test distribution could take some very large value with very small probability, while still being consistent with some linear function. The training distribution, on the other hand, gives no information about the ground truth and is information theoretically indistinguishable from the constructed test distribution. Therefore, the tester must accept and its output will have large excess error. The bound on the second moment of the labels does imply a bound on excess error, but this bound cannot be made arbitrarily small by drawing more samples.

Proof of Proposition D.1.

Suppose, for contradiction that we have a TDS regression algorithm for {\mathcal{F}} with excess error ϵ<Y/4{\epsilon}<Y/4 and probability of failure δ<1/4\delta<1/4. Let mm\in{\mathbb{N}} be the sample complexity of the algorithm and p(0,1)p\in(0,1) such that m1/pm\ll 1/p. We consider three distributions over d×\mathbb{R}^{d}\times\mathbb{R}. First 𝒟(1){\mathcal{D}}^{(1)} outputs (0,0)(0,0) with probability 11. Second, 𝒟(2){\mathcal{D}}^{(2)} outputs (0,0)(0,0) with probability 1p1-p and (Yp𝒘,Yp)(\frac{\sqrt{Y}}{\sqrt{p}}{\bm{w}},\frac{\sqrt{Y}}{\sqrt{p}}) with probability pp, for some 𝒘d{\bm{w}}\in\mathbb{R}^{d} with 𝒘2=1\|{\bm{w}}\|_{2}=1. Third, 𝒟(3){\mathcal{D}}^{(3)} outputs (0,0)(0,0) with probability 1p1-p and (Yp𝒘,0)(\frac{\sqrt{Y}}{\sqrt{p}}{\bm{w}},0) with probability pp.

We consider two instances of the TDS regression problem. The first instance corresponds to the case 𝒟=𝒟(1){\mathcal{D}}={\mathcal{D}}^{(1)} and 𝒟=𝒟(2){\mathcal{D}}^{\prime}={\mathcal{D}}^{(2)}. The second corresponds to the case 𝒟=𝒟(1){\mathcal{D}}={\mathcal{D}}^{(1)} and 𝒟=𝒟(3){\mathcal{D}}^{\prime}={\mathcal{D}}^{(3)}. Note that the assumptions we asserted regarding the test distribution and the test labels are true for both instances. For 𝒟(2){\mathcal{D}}^{(2)}, in particular, we have 𝔼y𝒟y(2)[y2]=p(Y/p)2=Y\mathbb{E}_{y\sim{\mathcal{D}}^{(2)}_{y}}[y^{2}]=p\cdot(\sqrt{Y}/\sqrt{p})^{2}=Y. Moreover, in each of the cases, there is a hypothesis in {\mathcal{F}} that is consistent with all of the examples (either the hypothesis 𝒙0{\bm{x}}\mapsto 0 or 𝒙𝒘𝒙{\bm{x}}\mapsto{\bm{w}}\cdot{\bm{x}}), so opt:=minf[𝒟(f)]=0=minf[𝒟(f)+𝒟(f)]=:λ\mathrm{opt}:=\min_{f\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f)]=0=\min_{f^{\prime}\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f^{\prime})+{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(f^{\prime})]=:\lambda.

Note that the total variation distance between 𝒟(1){\mathcal{D}}^{(1)} and 𝒟(2){\mathcal{D}}^{(2)} is pp and similarly between 𝒟(1){\mathcal{D}}^{(1)} and 𝒟(3){\mathcal{D}}^{(3)}. Therefore, by the completeness criterion, as well as the fact that sampling only increases total variation distance at a linear rate, i.e., dtv((𝒟)m,(𝒟)m)mdtv(𝒟,𝒟)mp\mathrm{d}_{\mathrm{tv}}(({\mathcal{D}})^{\otimes m},({\mathcal{D}}^{\prime})^{\otimes m})\leq m\cdot\mathrm{d}_{\mathrm{tv}}({\mathcal{D}},{\mathcal{D}}^{\prime})\leq m\cdot p, we have that in each of the two instances, the algorithm will accept with probability at least 1mpδ1-m\cdot p-\delta (due to the definition of total variation distance111We know that the algorithm would accept with probability at least 1δ1-\delta if the set of test examples was drawn from (𝒟𝒙)m({\mathcal{D}}_{{\bm{x}}})^{\otimes m}. Since (𝒟𝒙)m({\mathcal{D}}_{{\bm{x}}}^{\prime})^{\otimes m} is (mp)(mp)-close to (𝒟𝒙)m({\mathcal{D}}_{{\bm{x}}})^{\otimes m}, no algorithm can have different behavior if we substitute (𝒟𝒙)m({\mathcal{D}}_{{\bm{x}}})^{\otimes m} with (𝒟𝒙)m({\mathcal{D}}_{{\bm{x}}}^{\prime})^{\otimes m} except with probability mpm\cdot p. Hence, any algorithm must accept with probability at least 1mpδ1-m\cdot p-\delta.).

Suppose that the algorithm accepts in both instances (which happens w.p. at least 12δ2mp1-2\delta-2mp). By the soundness criterion, with overall probability at least 14δ2mp1-4\delta-2mp, we have the following.

p(h(𝒙)0)2\displaystyle p\cdot(h({\bm{x}})-0)^{2} <Y/4\displaystyle<Y/4
p(h(𝒙)Y/p)2\displaystyle p\cdot(h({\bm{x}})-\sqrt{Y}/\sqrt{p})^{2} <Y/4\displaystyle<Y/4

The inequalities above cannot be satisfied simultaneously, so we have arrived to a contradiction. It only remains to argue that 14δ2mp>01-4\delta-2mp>0, which is true if we choose p<14δ2mp<\frac{1-4\delta}{2m}. Therefore, such a TDS regression algorithm cannot exist. ∎

The lower bound of Proposition D.1 demonstrates that, in the worst case, the best possible excess error scales with the second moment of the distribution of the test labels. In contrast, we show that a bound on any strictly higher moment is sufficient.

Corollary D.2.

Suppose that for any M>0M>0, we have an algorithm that learns a class {\mathcal{F}} in the TDS setting up to excess error ϵ(0,1){\epsilon}\in(0,1), assuming that both the training and test labels are bounded in [M,M][-M,M]. Let T(M)T(M) and m(M)m(M) be the corresponding time and sample complexity upper bounds.

Then, in the same setting, there is an algorithm that learns {\mathcal{F}} up to excess error 4ϵ4{\epsilon} under the relaxed assumption that for both training and test labels we have 𝔼[y2g(|y|)]Y\mathbb{E}[y^{2}g(|y|)]\leq Y for some Y>0Y>0 and gg some strictly increasing, positive-valued and unbounded function. The corresponding time and sample complexity upper bounds are T(g1(Y/ϵ2))T(g^{-1}(Y/{\epsilon}^{2})) and m(g1(Y/ϵ2))m(g^{-1}(Y/{\epsilon}^{2})).

The proof is based on the observation that the effect of clipping on the labels, as measured by the squared loss, can be controlled by drawing enough samples, whenever a moment that is strictly higher than the second moment is bounded.

Lemma D.3.

Let Y>0Y>0 and g:(0,)(0,)g:(0,\infty)\to(0,\infty) be strictly increasing and surjective. Let yy be a random variable over \mathbb{R} such that 𝔼[y2g(|y|)]Y\mathbb{E}[y^{2}g(|y|)]\leq Y. Then, for any ϵ(0,1){\epsilon}\in(0,1), if Mg1(Y/ϵ2)M\geq g^{-1}(Y/\epsilon^{2}), we have 𝔼[(yclM(y))2]ϵ\sqrt{\mathbb{E}[(y-\mathrm{cl}_{M}(y))^{2}]}\leq\epsilon.

Proof of Lemma D.3.

We have that 𝔼[(yclM(y))2]𝔼[y2𝟙{|y|>M}]\mathbb{E}[(y-\mathrm{cl}_{M}(y))^{2}]\leq\mathbb{E}[y^{2}\mathbbm{1}\{|y|>M\}], because yclM(y)y\geq\mathrm{cl}_{M}(y) and yy, clM(y)\mathrm{cl}_{M}(y) always have the same sign, so (yclM(y))2y2(y-\mathrm{cl}_{M}(y))^{2}\geq y^{2} and also (yclM(y))2=0(y-\mathrm{cl}_{M}(y))^{2}=0 if |y|M|y|\leq M. Since g(|y|)g(|y|) is non-zero whenever y>0y>0, we have 𝔼[y2𝟙{|y|>M}]=𝔼[y2g(|y|)g(|y|)𝟙{|y|>M}]\mathbb{E}[y^{2}\mathbbm{1}\{|y|>M\}]=\mathbb{E}[y^{2}\cdot\frac{g(|y|)}{g(|y|)}\cdot\mathbbm{1}\{|y|>M\}]. We now use the fact that gg is increasing to conclude that 𝔼[y2𝟙{|y|>M}]𝔼[y2g(|y|)]g(M)Yg(M)\mathbb{E}[y^{2}\mathbbm{1}\{|y|>M\}]\leq\frac{\mathbb{E}[y^{2}g(|y|)]}{g(M)}\leq\frac{Y}{g(M)}. By choosing Mg1(Y/ϵ2)M\geq g^{-1}(Y/{\epsilon}^{2}), we obtain the desired bound. ∎

We are now ready to prove Corollary D.2, by reducing TDS learning with moment-bounded labels to TDS learning with bounded labels.

Proof of Corollary D.2.

The idea is to reduce the problem under the relaxed label assumptions to a corresponding bounded-label problem for M=g1(Y/ϵ2)M=g^{-1}(Y/{\epsilon}^{2}). In particular, consider a new training distribution clM𝒟\mathrm{cl}_{M}\circ{\mathcal{D}} and a new test distribution clM𝒟\mathrm{cl}_{M}\circ{\mathcal{D}}^{\prime}, where the samples are formed by drawing a sample (𝒙,y)({\bm{x}},y) from the corresponding original distribution and clipping the label yy to clM(y)\mathrm{cl}_{M}(y). Note that whenever we have access to i.i.d. examples from 𝒟{\mathcal{D}}, we also have access to i.i.d. examples from clM𝒟\mathrm{cl}_{M}\circ{\mathcal{D}} and similarly for (𝒟𝒙,clM𝒟𝒙)({\mathcal{D}}_{{\bm{x}}}^{\prime},\mathrm{cl}_{M}\circ{\mathcal{D}}_{{\bm{x}}}^{\prime}). Therefore, we may solve the corresponding TDS problem for clM𝒟\mathrm{cl}_{M}\circ{\mathcal{D}} and clM𝒟\mathrm{cl}_{M}\circ{\mathcal{D}}^{\prime}, to either reject or obtain some hypothesis hh such that

clM𝒟(h)minf[clM𝒟(f)]+minf[clM𝒟(f)+clM𝒟(f)]+ϵ{\mathcal{L}}_{\mathrm{cl}_{M}\circ{\mathcal{D}}^{\prime}}(h)\leq\min_{f\in{\mathcal{F}}}[{\mathcal{L}}_{\mathrm{cl}_{M}\circ{\mathcal{D}}}(f)]+\min_{f^{\prime}\in{\mathcal{F}}}[{\mathcal{L}}_{\mathrm{cl}_{M}\circ{\mathcal{D}}}(f^{\prime})+{\mathcal{L}}_{\mathrm{cl}_{M}\circ{\mathcal{D}}^{\prime}}(f^{\prime})]+{\epsilon}

Our algorithm either rejects when the algorithm for the bounded labels case rejects or accepts and outputs hh. It suffices to show 𝒟(h)minf[𝒟(f)]+minf[𝒟(f)+𝒟(f)]+4ϵ{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(h)\leq\min_{f\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f)]+\min_{f^{\prime}\in{\mathcal{F}}}[{\mathcal{L}}_{{\mathcal{D}}}(f^{\prime})+{\mathcal{L}}_{{\mathcal{D}}^{\prime}}(f^{\prime})]+4{\epsilon}, because the marginal distributions do not change and completeness is, therefore, satisfied directly.

It suffices to show that for any distribution 𝒟{\mathcal{D}}, we have |𝒟(h)clM𝒟(h)|ϵ|{\mathcal{L}}_{{\mathcal{D}}}(h)-{\mathcal{L}}_{\mathrm{cl}_{M}\circ{\mathcal{D}}}(h)|\leq{\epsilon}. To this end, note that clM𝒟(h)=𝔼(𝒙,y)𝒟[(clM(y)h(𝒙))2]{\mathcal{L}}_{\mathrm{cl}_{M}\circ{\mathcal{D}}}(h)=\sqrt{\mathbb{E}_{({\bm{x}},y)\sim{\mathcal{D}}}[(\mathrm{cl}_{M}(y)-h({\bm{x}}))^{2}]}. We have the following.

clM𝒟(h)\displaystyle{\mathcal{L}}_{\mathrm{cl}_{M}\circ{\mathcal{D}}}(h) =𝔼(𝒙,y)𝒟[(clM(y)h(𝒙))2]\displaystyle=\sqrt{\mathbb{E}_{({\bm{x}},y)\sim{\mathcal{D}}}[(\mathrm{cl}_{M}(y)-h({\bm{x}}))^{2}]}
=𝔼(𝒙,y)𝒟[(clM(y)y+yh(𝒙))2]\displaystyle=\sqrt{\mathbb{E}_{({\bm{x}},y)\sim{\mathcal{D}}}[(\mathrm{cl}_{M}(y)-y+y-h({\bm{x}}))^{2}]}
𝔼(𝒙,y)𝒟[(clM(y)y)2]+𝔼(𝒙,y)𝒟[(yh(𝒙))2]\displaystyle\leq\sqrt{\mathbb{E}_{({\bm{x}},y)\sim{\mathcal{D}}}[(\mathrm{cl}_{M}(y)-y)^{2}]}+\sqrt{\mathbb{E}_{({\bm{x}},y)\sim{\mathcal{D}}}[(y-h({\bm{x}}))^{2}]}
ϵ+𝒟(h)\displaystyle\leq{\epsilon}+{\mathcal{L}}_{{\mathcal{D}}}(h)

The first inequality follows from an application of the triangle inequality for the 2{\mathcal{L}}_{2}-norm and the second inequality follows from Lemma D.3. The other side follows analogously. ∎