Learning Neural Networks with Distribution Shift:
Efficiently Certifiable Guarantees
Abstract
We give the first provably efficient algorithms for learning neural networks with distribution shift. We work in the Testable Learning with Distribution Shift framework (TDS learning) of [KSV24b], where the learner receives labeled examples from a training distribution and unlabeled examples from a test distribution and must either output a hypothesis with low test error or reject if distribution shift is detected. No assumptions are made on the test distribution.
All prior work in TDS learning focuses on classification, while here we must handle the setting of nonconvex regression. Our results apply to real-valued networks with arbitrary Lipschitz activations and work whenever the training distribution has strictly sub-exponential tails. For training distributions that are bounded and hypercontractive, we give a fully polynomial-time algorithm for TDS learning one hidden-layer networks with sigmoid activations. We achieve this by importing classical kernel methods into the TDS framework using data-dependent feature maps and a type of kernel matrix that couples samples from both train and test distributions.
1 Introduction
Understanding when a model will generalize from a known training distribution to an unknown test distribution is a critical challenge in trustworthy machine learning and domain adaptation. Traditional approaches to this problem prove generalization bounds in terms of various notions of distance between train and test distributions [BDBCP06, BDBC+10, MMR09] but do not provide efficient algorithms. Recent work due to [KSV24b] departs from this paradigm and defines the model of Testable Learning with Distribution Shift (TDS learning), where a learner may reject altogether if significant distribution shift is detected. When the learner accepts, however, it outputs a classifier and a proof that the classifier has nearly optimal test error.
A sequence of works has given the first set of efficient algorithms in the TDS learning model for well-studied function classes where no assumptions are taken on the test distribution [KSV24b, KSV24a, CKK+24, GSSV24]. These results, however, hold for classification and therefore do not apply to (nonconvex) regression problems and in particular to a long line of work giving provably efficient algorithms for learning simple classes of neural networks under natural distributional assumptions on the training marginal [GK19, DGK+20, DKKZ20, DKTZ22, CKM22, CDG+23, WZDD23, GGKS24, DK24].
The main contribution of this work is the first set of efficient TDS learning algorithms for broad classes of (nonconvex) regression problems. Our results apply to neural networks with arbitrary Lipschitz activations of any constant depth. As one example, we obtain a fully polynomial-time algorithm for learning one hidden-layer neural networks with sigmoid activations with respect to any bounded and hypercontractive training distribution. For bounded training distributions, the running times of our algorithms match the best known running times for ordinary PAC or agnostic learning (without distribution shift). We emphasize that unlike all prior work in domain adaptation, we make no assumptions on the test distribution.
Regression Setting. We assume the learner has access to labeled examples from the training distribution and unlabeled examples from the marginal of the test distribution. We consider the squared loss . The error benchmark is analogous to the benchmark for TDS learning in classification [KSV24b] and depends on two quantities: the optimum training error achievable by a classifier in the learnt class, , and the best joint error achievable by a single classifier on both the training and test distributions, . Achieving an error of is the standard goal in domain adaptation [BDBCP06, BCK+07, MMR09]. We now formally define the TDS learning framework for regression.
Definition 1.1 (Testable Regression with Distribution Shift).
For and a function class , the learner receives iid labeled examples from some unknown training distribution over and iid unlabeled examples from the marginal of another unknown test distribution over . The learner either rejects, or it accepts and outputs hypothesis such that the following are true.
-
1.
(Soundness) With probability at least , if the algorithm accepts, then the output satisfies .
-
2.
(Completeness) If , then the algorithm accepts with probability at least .
1.1 Our Results
Our results hold for classes of Lipschitz neural networks. In particular, we consider functions of the following form. Let be an activation function. Let with be the tuple of weight matrices. Here, is the input dimension and . Define recursively the function as with . The function computed by the neural network is defined as . The depth of this network is .
We now present our main results on TDS learning for neural networks.
Function Class | Runtime (Bounded) | Runtime (Subgaussian) |
---|---|---|
One hidden-layer Sigmoid Net | ||
Single ReLU | ||
Sigmoid Nets | ||
-Lipschitz Nets |
From the above table, we highlight that in the cases of bounded distributions with (1) one hidden-layer Sigmoid Nets, and (2) Single ReLU with , we obtain TDS algorithms that run in polynomial time in all parameters. Moreover, for the last row, regarding Lipschitz Nets, each neuron is allowed to have a different and unknown Lipschitz activation. Therefore, in particular, our results capture the class of single-index models (see, e.g., [KKSK11, GGKS24]).
In the results of Table 1, we assume bounded labels for both the training and test distributions. This assumption can be relaxed to a bound on any moment whose degree is strictly higher than (see Corollary D.2). In fact, such an assumption is necessary, as we show in Proposition D.1.
1.2 Our Techniques
TDS Learning via Kernel Methods. The major technical contribution of this work is devoted to importing classical kernel methods into the TDS learning framework. A first attempt at testing distribution shift with respect to a fixed feature map would be to form two corresponding covariance matrices of the expanded features, one from samples drawn from the training distribution and the other from samples drawn from the test distribution, and test if these two matrices have similar eigendecompositions. This approach only yields efficient algorithms for linear kernels, however, as here we are interested in spectral properties of covariance matrices in the feature space corresponding to low-degree polynomials, whose dimension is too large.
Instead we form a new data-dependent and concise reference feature map , that depends on examples from both and . We show that this feature map approximately represents the ground truth, i.e., some function with both low training and test error (this is due to the representer theorem, see Proposition 3.7). To certify that error bounds transfer from to , we require relative error closeness between covariance matrix of the feature expansion over the test marginal with the corresponding matrix over the training marginal. We draw fresh sets of verification examples and show how the kernel trick can be used to efficiently achieve these approximations even though is a nonstandard feature map. We provide a more detailed technical overview and a formal proof in Section 3.1.
By instantiating the above results using a type of polynomial kernel, we can reduce the problem of TDS learning neural networks to the problem of obtaining an appropriate polynomial approximator. Our final training algorithm (as opposed to the testing phase) will essentially be kernelized polynomial regression.
TDS Learning and Uniform Approximation. Prior work in TDS learning has established connections between polynomial approximation theory and efficient algorithms in the TDS setting. In particular, the existence of low-degree sandwiching approximators for a concept class is known to imply dimension-efficient TDS learning algorithms for binary classification. The notion of sandwiching approximators for a function refers to a pair of low-degree polynomials with two main properties: (1) everywhere and (2) the expected absolute distance between and over some reference distribution is small. The first property is of particular importance in the TDS setting, since it holds everywhere and, therefore, it holds for any test distribution unconditionally.
Here we make the simple observation that the incomparable notion of uniform approximation suffices for TDS learning. A uniform approximator is a polynomial that approximates a function pointwise, meaning that is small in every point within a ball around the origin (there is no known direct relationship between sandwiching and uniform approximators). In our setting, uniform approximation is more convenient, due to the existence of powerful tools from polynomial approximation theory regarding Lipschitz and analytic functions.
Contrary to the sandwiching property, the uniform approximation property cannot hold everywhere if the approximated function class contains high-(or infinite-) degree functions. When the training distribution has strictly sub-exponential tails, however, the expected error of approximation outside the radius of approximation is negligible. Importantly, this property can be certified for the test distribution by using a moment-matching tester. See Section 4.2 for a more detailed technical overview and for the full proof.
1.3 Related Work
Learning with Distribution Shift. The field of domain adaptation has been studying the distribution shift problem for almost two decades [BDBCP06, BCK+07, BDBC+10, MMR09, DLLP10, MKFAS20, RMH+20, KZZ24, HK19, HK24, ACM24], providing useful insights regarding the information-theoretic (im)possibilities for learning with distribution shift. The first efficient end-to-end algorithms for non-trivial concept classes with distribution shift were given for TDS learning in [KSV24b, KSV24a, CKK+24] and for PQ learning, originally defined by [GKKM20], in [GSSV24]. These works focus on binary classification for classes like halfspaces, halfspace intersections, and geometric concepts. In the regression setting, we need to handle unbounded loss functions, but we are also able to use Lipschitz properties of real-valued networks to obtain results even for deeper architectures. For the special case of linear regression, efficient algorithms for learning with distribution shift are known to exist (see, e.g., [LHL21]), but our results capture much broader classes.
Another distinction between the existing works in TDS learning and our work, is that our results require significantly milder assumptions on the training distribution. In particular, while all prior works on TDS learning require both concentration and anti-concentration for the training marginal [KSV24b, KSV24a, CKK+24], we only assume strictly subexponential concentration in every direction. This is possible because the function classes we consider are Lipschitz, which is not the case for binary classification.
Testable Learning. More broadly, TDS learning is related to the notion of testable learning [RV23, GKK23, GKSV24b, DKK+23, GKSV24a, DKLZ24, STW24], originally defined by [RV23] for standard agnostic learning, aiming to certify optimal performance for learning algorithms without relying directly on any distributional assumptions. The main difference between testable agnostic learning and TDS learning is that in TDS learning, we allow for distribution shift, while in testable agnostic learning the training and test distributions are the same. Because of this, TDS learning remains challenging even in the absence of label noise, in which case testable learning becomes trivial [KSV24b].
Efficient Learning of Neural Networks. Many works have focused on providing upper and lower bounds on the computational complexity of learning neural networks in the standard (distribution-shift-free) setting [GKKT17, GK19, GGJ+20, GGK20, DGK+20, DKZ20, DKKZ20, DKTZ22, CGKM22, CKM22, CDG+23, WZDD23, GGKS24, DK24, LMZ20, GMOV19, ZYWG19, VW19, AZLL19, BJW19, MR18, GKLW19, GLM18, DLT18, GKM18, Tia17, LY17, BG17, ZSJ+17, ZLJ16a, JSA15]. The majority of the upper bounds either require noiseless labels and shallow architectures or work only under Gaussian training marginals. Our results not only hold in the presence of distribution shift, but also capture deeper architectures, under any strictly subexponential training marginal and allow adversarial label noise.
The upper bounds that are closest to our work are those given by [GKKT17]. They consider ReLU as well as sigmoid networks, allow for adversarial label noise and assume that the training marginal is bounded but otherwise arbitrary. Our results in Section 3 extend all of the results in [GKKT17] to the TDS setting, by assuming additionally that the training distribution is hypercontractive (see Definition 3.9). This additional assumption is important to ensure that our tests will pass when there is no distribution shift. For a more thorough technical comparison with [GKKT17], see Section 3.
In Section 4, we provide upper bounds for TDS learning of Lipschitz networks even when the training marginal is an arbitrary strictly subexponential distribution. In particular, our results imply new bounds for standard agnostic learning of single ReLU neurons, where we achieve runtime . The only known upper bounds work under the Gaussian marginal [DGK+20], achieving similar runtime. In fact, in the statistical query framework [Kea98], it is known that runtime is necessary for agnostically learning the ReLU, even under the Gaussian distribution [DKZ20, GGK20].
2 Preliminaries
We use standard vector and matrix notation. We denote with the sets of real and natural numbers accordingly. We denote with labeled distributions over and with the marginal of on the features in . For a set of points in , we define the empirical probabilities (resp. expectations) as (resp. ). We denote with the labeled version of and we define the clipping function , that maps a number either to itself if , or to otherwise.
Loss function. Throughout this work, we denote with the squared loss of a hypothesis with respect to a labeled distribution , i.e., . Moreover, for any function , we denote with the quantity . For a set of labeled examples , we denote with the empirical loss on , i.e., and similarly for .
Distributional Assumptions. In order to obtain efficient algorithms, we will either assume that the training marginal is bounded and hypercontractive (Section 3) or that it has strictly subexponential tails in every direction (Section 4). We make no assumptions on the test marginal .
Regarding the labels, we assume some mild bound on the moments of the training and the test labels, e.g., (a) that or (b) that a.s. for both and . Although, ideally, we want to avoid any assumptions on the test distribution, as we show in Proposition D.1, a bound on some constant-degree moment of the test labels is necessary.
3 Bounded Training Marginals
We begin with the scenario where the training distribution is known to be bounded. In this case, it is known that one-hidden-layer sigmoid networks can be agnostically learned (in the classical sense, without distribution shift) in fully polynomial time and single ReLU neurons can be learned up to error in polynomial time [GKKT17]. These results are based on a kernel-based approach, combined with results from polynomial approximation theory. While polynomial approximations can reduce the nonconvex agnostic learning problem to a convex one through polynomial feature expansions, the kernel trick enables further pruning of the search space, which is important for obtaining polynomial-time algorithms. Our work demonstrates another useful implication of the kernel trick: it leads to efficient algorithms for testing distribution shift.
We will require the following standard notions:
Definition 3.1 (Kernels [Mer09]).
A function is a kernel. If for any set of points in , the matrix is positive semidefinite, we say that the kernel is positive definite. The kernel is symmetric if for all , .
Any PSD kernel is associated with some Hilbert space and some feature map from to .
Fact 3.2 (Reproducing Kernel Hilbert Space).
For any positive definite and symmetric (PDS) kernel , there is a Hilbert space , equipped with the inner product and a function such that for all . We call the reproducing kernel Hilbert space (RKHS) for and the feature map for .
There are three main properties of the kernel method. First, although the associated feature map may correspond to a vector in an infinite-dimensional space, the kernel may still be efficiently evaluated, due to its analytic expression in terms of , . Second, the function class has Rademacher complexity independent from the dimension of , as long as the maximum value of for in the domain is bounded (Thm. 6.12 in [MRT18]). Third, the time complexity of finding the function in that best fits a dataset is actually polynomial to the size of the dataset, due to the representer theorem (Thm. 6.11 in [MRT18]). Taken together, these properties constitute the basis of the kernel method, implying learners with runtime independent from the effective dimension of the learning problem.
In order to apply the kernel method to learn some function class , it suffices to show that the class can be represented sufficiently well by the class . We give the following definition.
Definition 3.3 (Approximate Representation).
Let be a function class over , a PDS kernel, where is the corresponding RKHS and the feature map for . We say that can be -approximately represented within radius with respect to if for any , there is with such that , for all .
For the purposes of TDS learning, we will also require the training marginal to have be hypercontractive with respect to the kernel at hand. This is important to ensure that our test will accept whenever there is no distribution shift. More formally, we require the following.
Definition 3.4 (Hypercontractivity).
Let be some distribution over , let be a Hilbert space and let . We say that is -hypercontractive if for any and :
If is the PDS kernel corresponding to , we also say that is -hypercontractive.
3.1 TDS Regression via the Kernel Method
We now give a general theorem on TDS regression for bounded distributions, under the following assumptions. Note that, although we assume that the training and test labels are bounded, this assumption can be relaxed in a black-box manner and bounding some constant-degree moment of the distribution of the labels suffices, as we show in Corollary D.2.
Assumption 3.5.
For a function class , and training and test distributions , over , we assume the following.
-
1.
is -approximately represented within radius w.r.t. a PDS kernel , for some and and let .
-
2.
The training marginal (1) is bounded within and (2) is -hypercontractive for some .
-
3.
The training and test labels are both bounded in for some .
Consider the function class , the kernel and the parameters as defined in the assumption above and let . Then, we obtain the following theorem.
Theorem 3.6 (TDS Learning via the Kernel Method).
Under 3.5, Algorithm 1 learns the class in the TDS regression setting up to excess error and probability of failure . The time complexity is , where is the evaluation time of .
s.t. |
The main ideas of our proof are the following.
Obtaining a concise reference feature map. The algorithm first draws reference sets from both the training and the test distributions. The representer theorem, combined with the approximate representation assumption (Definition 3.3) ensure that the reference examples define a new feature map with such that the ground truth can be approximately represented as a linear combination of the features in with respect to both and , i.e., and are both small for some . In particular, we have the following.
Proposition 3.7 (Representer Theorem, modification of Theorem 6.11 in [MRT18]).
Suppose that a function can be -approximately represented within radius w.r.t. some PDS kernel (as per Definition 3.3). Then, for any set of examples in , there is such that for we have:
Proof.
We first observe that there is some such that and for we have , because by Definition 3.3, there is a pointwise approximator for with respect to . By Theorem 6.11 in [MRT18], this implies the existence of as desired. ∎
Note that since the evaluation of only involves Kernel evaluations, we never need to compute the initial feature expansion which could be overly expensive.
Forming a candidate output hypothesis. We know that the reference feature map approximately represents the ground truth. However, having no access to test labels, we cannot directly hope to find the corresponding coefficient . Instead, we use only the training reference examples to find a candidate hypothesis with close-to-optimal performance on the training distribution which can be also expressed in terms of the reference feature map , as . It then suffices to test the quality of on the test distribution.
Testing the quality of reference feature map on the test distribution. We know that the function performs well on the test distribution (since it is close to on a reference test set). We also know that the candidate output performs well on the training distribution. Therefore, in order to ensure that performs well on the test distribution, it suffices to show that the distance between and under the test distribution, i.e., , is small. In fact, it suffices to bound this distance by the corresponding one under the training distribution, because fits the training data well and is indeed small. Since we do not know , we need to run a test on that certifies the desired bound for any .
Using the spectral tester. We observe that , where and similarly . Since we want to obtain a bound for all , we essentially want to ensure that for any we have for some small . Having a multiplicative bound is important because we do not have any bound on the norm of .
To implement the test, and since we cannot test and directly, we draw fresh verification examples from and and run a spectral test on the corresponding empirical versions of the matrices . To ensure that the test will accept when there is no distribution shift, we use the following lemma (originally from [GSSV24]) on multiplicative spectral concentration for , where the hypercontractivity assumption (Definition 3.4) is important.
Lemma 3.8 (Multiplicative Spectral Concentration, Lemma B.1 in [GSSV24], modified).
Let be a distribution over and such that is -hypercontractive for some . Suppose that consists of i.i.d. examples from and let , and . For any , if , then with probability at least , we have that
Note that the multiplicative spectral concentration lemma requires access to independent samples. However, the reference feature map depends on the reference examples . This is the reason why we do not reuse , but rather draw fresh verification examples. For the proof of Lemma 3.8, see Appendix A.
We now provide the full formal proof of Theorem 3.6. The full proof involves appropriate uniform convergence bounds for kernel hypotheses, which are important in order to shift from the reference to the verification examples and back.
Proof of Theorem 3.6.
Consider the reference feature map with . Let and . By 3.5, we know that there are functions with and , that uniformly approximate and within the ball of radius , i.e., and . Moreover, .
By Proposition 3.7, there is such that for with we have and . Let be a matrix in such that for . We additionally have that . Therefore, for any we have
where we used the Cauchy-Schwarz inequality. For with , we, hence, have (recall that ).
Similarly, by applying the representer theorem (Theorem 6.11 in [MRT18]) for , we have that there exists such that for with we have and . Since in Algorithm 1 is formed by solving a convex program whose search space includes , we have
(3.1) |
In the following, we abuse the notation and consider to be a vector in , by appending zeroes, one for each of the elements of . Note that we then have , and, also, for all with .
Soundness. Suppose first that the algorithm has accepted. In what follows, we will use the triangle inequality of the norms to bound for functions the quantity by . We also use the inequality , as well as the fact that . We bound the test error of the output hypothesis of Algorithm 1 as follows.
Since for all and the hypothesis does not depend on the set , by a Hoeffding bound and the fact that is large enough, we obtain that , with probability at least . Moreover, we have . We have already argued that .
In order to bound the quantity , we observe that while the function does not depend on , the function does depend on and, therefore, standard concentration arguments fail to bound the in terms of . However, since we have clipped , and is of the form , we may obtain a bound using standard results from generalization theory (i.e., bounds on the Rademacher complexity of kernel-based hypotheses like Theorem 6.12 in [MRT18] and uniform convergence bounds for classes with bounded Rademacher complexity under Lipschitz and bounded losses like Theorem 11.3 in [MRT18]). In particular, we have that with probability at least
The corresponding requirement for is determined by the bounds on the Lipschitz constant of the loss function , with and , which is at most , the overall bound on this loss function, which is at most , as well as the bounds and (which give bounds on the Rademacher complexity).
By applying the Hoeffding bound, we are able to further bound the quantity by , with probability at least . We have effectively managed to bound the quantity by . This is important, because the set is a fresh set of examples and, therefore, independent from . Our goal is now to use the fact that our spectral tester has accepted. We have the following for the matrix with .
Since our test has accepted, we know that , for the matrix with . We note here that having a multiplicative bound of this form is important, because we do not have any upper bound on the norms of and . Instead, we only have bounds on distorted versions of these vectors, e.g., on , which does not imply any bound on the norm of , because could have very small singular values.
Overall, we have that
By using results from generalization theory once more, we obtain that with probability at least we have . This step is important, because the only fact we know about the quality of is that it outperforms every polynomial on the sample (not necessarily over the entire training distribution). We once more may use bounds on the values of and , this time without requiring clipping, since we know that the training marginal is bounded and, hence, the values of and are bounded as well. This was not true for the test distribution, since we did not make any assumptions about it.
In order to bound , we have the following.
(By equation 3.1) | ||||
The first term above is bounded as , where the second term is at most (by the definition of ) and the first term can be bounded by , with probability at least , due to an application of the Hoeffding bound.
For the term we can similarly use the Hoeffding bound to obtain, with probability at least that .
Finally, for the term , we have that , as argued above.
Overall, we obtain a bound of the form , with probability at least , as desired.
Completeness. For the completeness criterion, we assume that the test marginal is equal to the training marginal. Then, by Lemma 3.8 (where we observe that any -hypercontractive distribution is also -hypercontractive), with probability at least , we have that for all , , because and the matrices are sums of independent samples of , where . It is crucial here that (which recall is formed by using ) does not depend on the verification samples and , which is why we chose them to be fresh. Therefore, the test will accept with probability at least .
Efficient Implementation. To compute , we may run a least squares program, in time polynomial in . For the spectral tester, we first compute the SVD of and check that any vector in the kernel of is also in the kernel of (this can be checked without computing the SVD of ). Otherwise, reject. Then, let be the root of the Moore-Penrose pseudoinverse of and find the maximum singular value of the matrix . If the value is higher than , reject. Note that this is equivalent to solving the eigenvalue problem described in Algorithm 1. ∎
3.2 Applications
Having obtained a general theorem for TDS learning under 3.5, we can now instantiate it to obtain TDS learning algorithms for learning neural networks with Lipschitz activations. In particular, we recover all of the bounds of [GKKT17], using the additional assumption that the training distribution is hypercontractive in the following standard sense.
Definition 3.9 (Hypercontractivity).
We say that is -hypercontractive if for all polynomials of degree and , we have that
Note that many common distributions like log-concave or the uniform over the hypercube are known to be hypercontractive for some constant (see [CW01] and [O’D14]).
Function Class | Degree () |
|
|
||||
---|---|---|---|---|---|---|---|
Sigmoid Nets | |||||||
-Lipschitz Nets |
In Table 2, we provide bounds on the parameters in 3.5 for sigmoid networks and -Lipschitz networks, whose proof we postpone to Appendix C (see Theorems C.17, C.19 and C.14). Combining bounds from Table 2 with Theorem 3.6, we obtain the results of the middle column of Table 1.
4 Unbounded Distributions
We showed that the kernel method provides runtime improvements for TDS learning, because it can be used to obtain a concise reference feature map, whose spectral properties on the test distribution are all we need to check to certify low test error. A similar approach would not provide any runtime improvements for the case of unbounded distributions, because the dimension of the reference feature space would not be significantly smaller than the dimension of the multinomial feature expansion. Therefore, we can follow the standard moment-matching testing approach commonly used in TDS learning [KSV24b] and testable agnostic learning [RV23, GKK23].
4.1 Additional Preliminaries
We define the notion of subspace juntas, namely, functions that only depend on a low-dimensional projection of their input vector.
Definition 4.1 (Subspace Junta).
A function is a -subspace junta (where ) if there exists with and and a function such that
Note that by taking , letting covers all functions .
Note that neural networks are -subspace juntas, where is the number of neurons in the first hidden layer. We also define the following notion of a uniform polynomial approximation within a ball of a certain radius.
Definition 4.2 (-Uniform Approximation).
For and , we say that is an -uniform approximation polynomial for if
We obtain the following corollary which gives the analogous bound on the -uniform approximation to a -subspace junta, given the -uniform approximation to the corresponding function .
Corollary 4.3.
Let , and be a -subspace junta, and consider the corresponding function . Let be an -uniform approximation polynomial for , and define as . Then for all .
Finally, we consider any distribution with strictly subexponential tails in every direction, which we define as follows.
Definition 4.4 (Strictly Sub-exponential Distribution).
A distribution on is -strictly subexponential if there exist constants such that for all ,
4.2 TDS Regression via Uniform Approximation
We will now present our main results on TDS regression for unbounded training marginals. We require the following assumptions.
Assumption 4.5.
For a function class consisting of -subspaces juntas, and training and test distributions over , we assuming the following.
-
1.
For , there exists an -uniform approximation polynomial for with degree at most , where is a function depending only on the class and .
-
2.
For , the value is bounded by a constant .
-
3.
The training marginal is a -strictly subexponential distribution for .
-
4.
The training and test labels are both bounded in for some .
Consider the function class , and the parameters as defined in the assumption above and let . Then, we obtain the following theorem.
Theorem 4.6 (TDS Learning via Uniform Approximation).
Under 4.5, Algorithm 2 learns the class in the TDS regression setting up to excess error and probability of failure . The time complexity is where .
s.t. | |||
Note that 4.5 involves a low-degree uniform approximation assumption, which only holds within some bounded-radius ball. Since we work under unbounded distributions, we also need to handle the errors outside the ball. To this end, we use the following lemma, which follows from results in [BDBGK18].
Lemma 4.7.
Suppose and satisfy parts 1 and 2 of 4.5. Then
The lemma above gives a bound on the values of a low-degree uniform approximator outside the interval of approximation. Therefore, we can hope to control the error of approximation outside the interval by taking advantage of the tails of our target distribution as well as picking sufficiently large. In order for the strictly subexponential tails to suffice, the quantitative dependence of on is important. This is why we assume (see 4.5) that . In particular, in order to bound the quantity , we use Lemma 4.7, the Cauchy-Schwarz inequality, and the bounds and . Substituting for , we observe that the overall bound on the quantity decays with , whenever is strictly positive. Therefore, the overall bound can be made arbitrarily small with an appropriate choice of (and therefore ).
Apart from the careful manipulations described above, the proof of Theorem 4.6 follows the lines of the corresponding results for TDS learning through sandwiching polynomials [KSV24b].
The following lemma allows us to relate the squared loss of the difference of polynomials under a set and under , as long as we have a bound on the coefficients of the polynomials. This will be convenient in the analysis of the algorithm.
Lemma 4.8 (Transfer Lemma for Square Loss, see [KSV24b]).
Let be a distribution over and be a set of points in . If for all with , then for any degree polynomials with coefficients absolutely bounded by , it holds that
We are now ready to prove Theorem 4.6.
Proof of Theorem 4.6.
We will prove soundness and completeness separately.
Soundness. Suppose the algorithm accepts and outputs . Let and . By the uniform approximation assumption in 4.5, there are polynomials which are -uniform approximations for and , respectively. Let and have the corresponding matrices , respectively. Denote and . Note that for any , “unclipping” both functions will not increase their squared loss under any distribution, i.e. , which can be seen through casework on and when are in or not. Recalling that the training and test labels are bounded, we can use this fact as we bound the error of the hypothesis on .
The second inequality follows from unclipping the first term and by applying Hoeffding’s inequality, so that for , the second term is bounded with probability . Proceeding with more unclipping and using the triangle inequality:
(4.1) |
We first bound . Since is an -uniform approximation to , we separately consider when we fall in the region of good approximation () or not.
Then by applying Cauchy-Schwarz, (and similarly for ):
By definition, . So it suffices to bound , since we now have
(4.2) |
In order to bound this probability of the test samples falling outside the region of good approximation, we use the property that the first moments of are close to the moments of (as tested by the algorithm). Applying Markov’s inequality, we have
Write , where is a degree polynomial with each coefficient bounded in absolute value by (noting that since , then ). Let denote the coefficients of . Applying Lemma C.7, . By linearity of expectation, we also have , where . Since is -strictly subexponential, then by B.1, . Then, we can bound the numerator for some large constant . So we have that
Setting and for large enough makes the above probability at most so that . Thus, from Equation 4.2, we have that
(4.3) |
We now bound the second term . By Lemma B.2, the first moments of will concentrate around those of whenever , and similarly the first moments of match with because the algorithm accepted. Using the transfer lemma (Lemma 4.8) when considering , along with the triangle inequality, we get:
where we note that we can bound , the sum of the magnitudes of the coefficients, by using Lemma C.6. Recall that by definition is an -approximate solution to the optimization problem in Algorithm 2, so . Plugging this in, we obtain
(4.4) |
By applying Hoeffding’s inequality, we get that which holds with probability when . By unclipping , this is at most . Similarly, with probability , . It remains to bound and . The analysis for both is similar to how we bounded , except since we do not clip or we will instead take advantage of the bound on on (respectively on ). We show how to bound :
(4.5) |
We can bound the first expectation term, , with since the same analysis holds for bounding , except instead of matching the first moments of with , we match the first moments of with . We use the strictly subexponential tails of to bound the second term. Cauchy-Schwarz gives
Note that by definition of and using that is an -uniform approximation of , then when . By Lemma C.6, for sufficiently large constant . Then since , . Then we have
where using B.1 we bound on similar to above, which can be upper bounded with for a sufficiently large constant. Take . We bound as follows:
where the first inequality follows from a union bound. Since is a degree polynomial, we can view as a degree-2 PTF. The class of these functions has VC dimension at most (e.g. by viewing it as the class of halfspaces in dimensions). Using standard VC arguments, whenever for some sufficiently large universal constant , with probability we have
Using the strictly subexponential tails of , we have
Choose . Putting it together:
We want to bound the first part with . Equivalently, we need to show that the exponent is . Substituting , we get that . Thus, it suffices to show that
This is satisfied when . Then, we have that
So, plugging this into Eq. 4.5, we have
The same argument will also give
Combining Eq. 4.3 and the above two bounds into Eq. 4.4, we have from Eq. 4.1 that
The result holds with probability at least (taking a union bound over bad events).
Completeness. For completeness, it is sufficient to ensure that for in Lemma B.2. This is because when , our test samples are in fact being drawn from the subexponential distribution . Then the moment concentration of subexponential distributions (Lemma B.2) gives that the empirical moments of are close to the moments of with probability . This is the only condition for acceptance, so when , the probability of acceptance is at least , as required.
Runtime. The runtime of the algorithm is , where . The two lower bounds on required in the proof are satisfied by setting . Note that setting satisfies the lower bounds on required in the proof. For we required that and also for in Lemma B.2. This is satisfied by choosing . Putting this altogether, we see that the runtime is where . ∎
4.3 Applications
In order to obtain end-to-end results for classes of neural networks (see the rightmost column of Table 1), we need to prove the existence of uniform polynomial approximators whose degree scales almost linearly with respect to the radius of approximation for the reasons described above. For arbitrary Lipschitz nets (see Theorem C.17), we use a general tool from polynomial approximation theory, the multivariate Jackson’s theorem (Theorem C.9). This gives us a polynomial with degree scaling linearly in and polynomially on and the number of hidden units () in the first layer.
For sigmoid nets, a more careful derivation yields improved bounds (see Theorem C.19) which have a poly-logarithmic dependence on . Our construction involves composing approximators for the activations at each layer. Naively, the degree of this composition would be superlinear in . To get around this, we use the key property that the size of the output of a sigmoid network at any layer is memoryless (i.e., has no dependence). This follows from the fact that the sigmoid is bounded in . Using this, we obtain an approximator with almost-linear dependence on . For more details see Section C.5.
References
- [ACM24] Pranjal Awasthi, Corinna Cortes, and Mehryar Mohri. Best-effort adaptation. Annals of Mathematics and Artificial Intelligence, 92(2):393–438, 2024.
- [AZLL19] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32, 2019.
- [BCK+07] John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. Learning bounds for domain adaptation. Advances in neural information processing systems, 20, 2007.
- [BDBC+10] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79:151–175, 2010.
- [BDBCP06] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. Advances in neural information processing systems, 19, 2006.
- [BDBGK18] Shalev Ben-David, Adam Bouland, Ankit Garg, and Robin Kothari. Classical lower bounds from quantum upper bounds. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 339–349. IEEE, 2018.
- [BG17] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. In International conference on machine learning, pages 605–614. PMLR, 2017.
- [BJW19] Ainesh Bakshi, Rajesh Jayaram, and David P Woodruff. Learning two layer rectified neural networks in polynomial time. In Conference on Learning Theory, pages 195–268. PMLR, 2019.
- [CDG+23] Sitan Chen, Zehao Dou, Surbhi Goel, Adam Klivans, and Raghu Meka. Learning narrow one-hidden-layer relu networks. In The Thirty Sixth Annual Conference on Learning Theory, pages 5580–5614. PMLR, 2023.
- [CGKM22] Sitan Chen, Aravind Gollakota, Adam Klivans, and Raghu Meka. Hardness of noise-free learning for two-hidden-layer neural networks. Advances in Neural Information Processing Systems, 35:10709–10724, 2022.
- [CKK+24] Gautam Chandrasekaran, Adam R Klivans, Vasilis Kontonis, Konstantinos Stavropoulos, and Arsen Vasilyan. Efficient discrepancy testing for learning with distribution shift. arXiv preprint arXiv:2406.09373, 2024.
- [CKM22] Sitan Chen, Adam R Klivans, and Raghu Meka. Learning deep relu networks is fixed-parameter tractable. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 696–707. IEEE, 2022.
- [CW01] Anthony Carbery and James Wright. Distributional and lq norm inequalities for polynomials over convex bodies in rn. Mathematical research letters, 8(3):233–248, 2001.
- [DGK+20] Ilias Diakonikolas, Surbhi Goel, Sushrut Karmalkar, Adam R Klivans, and Mahdi Soltanolkotabi. Approximation schemes for relu regression. In Conference on learning theory, pages 1452–1485. PMLR, 2020.
- [DK24] Ilias Diakonikolas and Daniel M Kane. Efficiently learning one-hidden-layer relu networks via schurpolynomials. In The Thirty Seventh Annual Conference on Learning Theory, pages 1364–1378. PMLR, 2024.
- [DKK+23] Ilias Diakonikolas, Daniel Kane, Vasilis Kontonis, Sihan Liu, and Nikos Zarifis. Efficient testable learning of halfspaces with adversarial label noise. Advances in Neural Information Processing Systems, 36, 2023.
- [DKKZ20] Ilias Diakonikolas, Daniel M Kane, Vasilis Kontonis, and Nikos Zarifis. Algorithms and sq lower bounds for pac learning one-hidden-layer relu networks. In Conference on Learning Theory, pages 1514–1539. PMLR, 2020.
- [DKLZ24] Ilias Diakonikolas, Daniel Kane, Sihan Liu, and Nikos Zarifis. Testable learning of general halfspaces with adversarial label noise. In The Thirty Seventh Annual Conference on Learning Theory, pages 1308–1335. PMLR, 2024.
- [DKTZ22] Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, and Nikos Zarifis. Learning a single neuron with adversarial label noise via gradient descent. In Conference on Learning Theory, pages 4313–4361. PMLR, 2022.
- [DKZ20] Ilias Diakonikolas, Daniel Kane, and Nikos Zarifis. Near-optimal sq lower bounds for agnostically learning halfspaces and relus under gaussian marginals. Advances in Neural Information Processing Systems, 33:13586–13596, 2020.
- [DLLP10] Shai Ben David, Tyler Lu, Teresa Luu, and Dávid Pál. Impossibility theorems for domain adaptation. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 129–136. JMLR Workshop and Conference Proceedings, 2010.
- [DLT18] Simon S Du, Jason D Lee, and Yuandong Tian. When is a convolutional filter easy to learn? In 6th International Conference on Learning Representations, ICLR 2018, 2018.
- [Fer14] Dietmar Ferger. Optimal constants in the marcinkiewicz–zygmund inequalities. Statistics & Probability Letters, 84:96–101, 2014.
- [GGJ+20] Surbhi Goel, Aravind Gollakota, Zhihan Jin, Sushrut Karmalkar, and Adam Klivans. Superpolynomial lower bounds for learning one-layer neural networks using gradient descent. In International Conference on Machine Learning, pages 3587–3596. PMLR, 2020.
- [GGK20] Surbhi Goel, Aravind Gollakota, and Adam Klivans. Statistical-query lower bounds via functional gradients. Advances in Neural Information Processing Systems, 33:2147–2158, 2020.
- [GGKS24] Aravind Gollakota, Parikshit Gopalan, Adam Klivans, and Konstantinos Stavropoulos. Agnostically learning single-index models using omnipredictors. Advances in Neural Information Processing Systems, 36, 2024.
- [GK19] Surbhi Goel and Adam R Klivans. Learning neural networks with two nonlinear layers in polynomial time. In Conference on Learning Theory, pages 1470–1499. PMLR, 2019.
- [GKK23] Aravind Gollakota, Adam R Klivans, and Pravesh K Kothari. A moment-matching approach to testable learning and a new characterization of rademacher complexity. Proceedings of the fifty-fifth annual ACM Symposium on Theory of Computing, 2023.
- [GKKM20] Shafi Goldwasser, Adam Tauman Kalai, Yael Kalai, and Omar Montasser. Beyond perturbations: Learning guarantees with arbitrary adversarial test examples. Advances in Neural Information Processing Systems, 33:15859–15870, 2020.
- [GKKT17] Surbhi Goel, Varun Kanade, Adam Klivans, and Justin Thaler. Reliably learning the relu in polynomial time. In Satyen Kale and Ohad Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 1004–1042. PMLR, 07–10 Jul 2017.
- [GKLW19] Rong Ge, Rohith Kuditipudi, Zhize Li, and Xiang Wang. Learning two-layer neural networks with symmetric inputs. In International Conference on Learning Representations, 2019.
- [GKM18] Surbhi Goel, Adam Klivans, and Raghu Meka. Learning one convolutional layer with overlapping patches. In International conference on machine learning, pages 1783–1791. PMLR, 2018.
- [GKSV24a] Aravind Gollakota, Adam Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Tester-learners for halfspaces: Universal algorithms. Advances in Neural Information Processing Systems, 36, 2024.
- [GKSV24b] Aravind Gollakota, Adam R Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. An efficient tester-learner for halfspaces. The Twelfth International Conference on Learning Representations, 2024.
- [GLM18] Rong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape design. In 6th International Conference on Learning Representations, ICLR 2018, 2018.
- [GMOV19] Weihao Gao, Ashok V Makkuva, Sewoong Oh, and Pramod Viswanath. Learning one-hidden-layer neural networks under general input distributions. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1950–1959. PMLR, 2019.
- [GSSV24] Surbhi Goel, Abhishek Shetty, Konstantinos Stavropoulos, and Arsen Vasilyan. Tolerant algorithms for learning with arbitrary covariate shift. arXiv preprint arXiv:2406.02742, 2024.
- [HK19] Steve Hanneke and Samory Kpotufe. On the value of target data in transfer learning. Advances in Neural Information Processing Systems, 32, 2019.
- [HK24] Steve Hanneke and Samory Kpotufe. A more unified theory of transfer learning. arXiv preprint arXiv:2408.16189, 2024.
- [JSA15] Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473, 2015.
- [Kea98] Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983–1006, 1998.
- [KKSK11] Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Efficient learning of generalized linear and single index models with isotonic regression. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011.
- [KSV24a] Adam R Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Learning intersections of halfspaces with distribution shift: Improved algorithms and sq lower bounds. The Thirty Seventh Annual Conference on Learning Theory, 2024.
- [KSV24b] Adam R Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Testable learning with distribution shift. The Thirty Seventh Annual Conference on Learning Theory, 2024.
- [KZZ24] Alkis Kalavasis, Ilias Zadik, and Manolis Zampetakis. Transfer learning beyond bounded density ratios. arXiv preprint arXiv:2403.11963, 2024.
- [LHL21] Qi Lei, Wei Hu, and Jason Lee. Near-optimal linear regression under distribution shift. In International Conference on Machine Learning, pages 6164–6174. PMLR, 2021.
- [LMZ20] Yuanzhi Li, Tengyu Ma, and Hongyang R Zhang. Learning over-parametrized two-layer neural networks beyond ntk. In Conference on learning theory, pages 2613–2682. PMLR, 2020.
- [LSSS14] Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efficiency of training neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, page 855–863, Cambridge, MA, USA, 2014. MIT Press.
- [LY17] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu activation. Advances in neural information processing systems, 30, 2017.
- [Mer09] James Mercer. Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society A, 209:415–446, 1909.
- [MKFAS20] Mohammadreza Mousavi Kalan, Zalan Fabian, Salman Avestimehr, and Mahdi Soltanolkotabi. Minimax lower bounds for transfer learning with linear and one-hidden layer neural networks. Advances in Neural Information Processing Systems, 33:1959–1969, 2020.
- [MMR09] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In Proceedings of The 22nd Annual Conference on Learning Theory (COLT 2009), Montréal, Canada, 2009.
- [MR18] Pasin Manurangsi and Daniel Reichman. The computational complexity of training relu (s). arXiv preprint arXiv:1810.04207, 2018.
- [MRT18] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, second edition, 2018.
- [NS64] D. J. Newman and H. S. Shapiro. Jackson’s Theorem in Higher Dimensions, pages 208–219. Springer Basel, Basel, 1964.
- [O’D14] Ryan O’Donnell. Analysis of boolean functions. Cambridge University Press, 2014.
- [RMH+20] Ievgen Redko, Emilie Morvant, Amaury Habrard, Marc Sebban, and Younès Bennani. A survey on domain adaptation theory: learning bounds and theoretical guarantees. arXiv preprint arXiv:2004.11829, 2020.
- [RV23] Ronitt Rubinfeld and Arsen Vasilyan. Testing distributional assumptions of learning algorithms. Proceedings of the fifty-fifth annual ACM Symposium on Theory of Computing, 2023.
- [STW24] Lucas Slot, Stefan Tiegel, and Manuel Wiedmer. Testably learning polynomial threshold functions. arXiv preprint arXiv:2406.06106, 2024.
- [Tia17] Yuandong Tian. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In International conference on machine learning, pages 3404–3413. PMLR, 2017.
- [Ver18] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
- [VW19] Santosh Vempala and John Wilmes. Gradient descent for one-hidden-layer neural networks: Polynomial convergence and sq lower bounds. In Conference on Learning Theory, pages 3115–3117. PMLR, 2019.
- [WZDD23] Puqian Wang, Nikos Zarifis, Ilias Diakonikolas, and Jelena Diakonikolas. Robustly learning a single neuron via sharpness. In International Conference on Machine Learning, pages 36541–36577. PMLR, 2023.
- [ZLJ16a] Yuchen Zhang, Jason D Lee, and Michael I Jordan. l1-regularized neural networks are improperly learnable in polynomial time. In International Conference on Machine Learning, pages 993–1001. PMLR, 2016.
- [ZLJ16b] Yuchen Zhang, Jason D. Lee, and Michael I. Jordan. L1-regularized neural networks are improperly learnable in polynomial time. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 993–1001, New York, New York, USA, 20–22 Jun 2016. PMLR.
- [ZSJ+17] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery guarantees for one-hidden-layer neural networks. In International conference on machine learning, pages 4140–4149. PMLR, 2017.
- [ZYWG19] Xiao Zhang, Yaodong Yu, Lingxiao Wang, and Quanquan Gu. Learning one-hidden-layer relu networks via gradient descent. In The 22nd international conference on artificial intelligence and statistics, pages 1524–1534. PMLR, 2019.
Appendix A Proof of Multiplicative Spectral Concentration Lemma
Here, we restate and prove the multiplicative spectral concentration lemma (Lemma 3.8).
Lemma A.1 (Multiplicative Spectral Concentration, Lemma B.1 in [GSSV24], modified).
Let be a distribution over and such that is -hypercontractive for some . Suppose that consists of i.i.d. examples from and let , and . For any , if , then with probability at least , we have that
Proof of Lemma 3.8.
Let be the compact SVD of (i.e., is square with dimension equal to the rank of and is not necessarily square). Note that such a decomposition exists (where the row and column spaces are both spanned by the same basis ), because , by definition. Moreover, note that is an orthogonal projection matrix that projects points in on the span of the rows of . We also have that, .
Consider and . Our proof consists of two parts. We first show that it is sufficient to prove that with probability at least and then we give a bound on the probability of this event.
Claim.
Suppose that for we have . Then, for any :
Proof.
Let , , and (i.e., , where is the component of lying in the nullspace of ). We have that .
Moreover, for , we have that and, hence, almost surely over . Therefore, we also have , with probability . Therefore, .
Observe, now, that and, hence, , because is a projection matrix. Overall, we obtain the following
Since and , we have that , which concludes the proof of the claim. ∎
It remains to show that for the matrix defined in the previous claim, we have with probability at least . The randomness of depends on the random choice of from . In the rest of the proof, therefore, consider all probabilities and expectations to be over . We have the following for .
We will now bound the expectation of . To this end, we define for . We have the following, by using Jensen’s inequality appropriately.
In order to bound the term above, we may use Marcinkiewicz-Zygmund inequality (see [Fer14]) to exploit the independence of the samples in and obtain the following.
We now observe that , which is at most equal to . Therefore, we have and, by the hypercontractivity property (which we assume to be with respect to the standard inner product in ), we have . We can bound by applying the Cauchy-Schwarz inequality and using the bound for . In total, we have the following bound.
We choose such that and so that the bound is at most . ∎
Appendix B Moment Concentration of Subexponential Distributions
We prove the following bounds on the moments of subexponential distributions, which allows us to control error outside the region of good approximation.
Fact B.1 (see [Ver18]).
Let on be a -strictly subexponential distribution. Then for all , there exists a constant such that
In fact, the two conditions are equivalent.
We use the following bounds on the concentration of subexponential moments in the analysis of our algorithm. This will be useful in showing the sample complexity required in order for the empirical moments of the sample concentrate around the moments of the training marginal .
Lemma B.2 (Moment Concentration of Subexponential Distributions).
Let be a distribution over such that for any with and any we have for some . For , we denote with the quantity , where . Then, for any , if is a set of at least i.i.d. examples from for some sufficiently large universal constant , we have that with probability at least , the following is true.
Proof.
Let with . Consider the random variable . We have that and also the following.
where the last inequality follows from the Marcinkiewicz–Zygmund inequality (see [Fer14]). We have that . Since , we have that , which yields the desired result, due to the choice of and after a union bound over all the possible choices of (at most ). ∎
Appendix C Polynomial Approximations of Neural Networks
In this section we derive the polynomial approximations of neural networks with Lipschitz activations needed to instantiate Theorem 3.6 for bounded distributions and Theorem 4.6 for unbounded distributions.
Recall the definition of a neural network.
Definition C.1 (Neural Network).
Let be an activation function with . Let with be the tuple of weight matrices. Here, is the input dimension and . Define recursively the function as with . The function computed by the neural network is defined as . We denote . The depth of this network is .
We also introduce some notation and basic facts that will be useful for this section.
C.1 Useful Notation and Facts
Given a univariate function on and a vector , the vector is defined as the vector with co-ordinate equal to . For a matrix , we use the following notation:
-
•
,
-
•
,
-
•
.
Fact C.2.
Given a matrix , we have that
-
1.
,
-
2.
.
Proof.
We first prove (1). We have that for an with ,
where the second inequality follows from Cauchy Schwartz and the last inequality follows from the fact that for any vector , . We now prove (2). We have that
where the second inequality follows from Cauchy Schwartz and the last inequality is the definition. ∎
C.2 Results from Approximation Theory
The following are useful facts about the coefficients of approximating polynomials.
Fact C.3 (Lemma 23 from [GKKT17]).
Let be a polynomial of degree such that for . Then, the sum of squares of all its coefficients is at most .
Lemma C.4.
Let be a polynomial of degree such that for . Then, the sum of squares of all its coefficients is at most when .
Proof.
Consider . Clearly, for all . Thus, the sum of squares of its coefficients is at most from C.3. Now, has coefficients bounded by when . ∎
Fact C.5 ([BDBGK18]).
Let be a polynomial with real coefficients on variables with degree such that for all , . Then the magnitude of any coefficient of is at most and the sum of the magnitudes of all coefficients of is at most .
Lemma C.6.
Let be a polynomial with real coefficients on variables with degree such that for all with , . Then the sum of the magnitudes of all coefficients of is at most for .
Proof.
Consider the polynomial . Then for , or equivalently for all . In particular, since the unit cube is contained in the radius ball, then for . By C.5, the sum of the magnitudes of the coefficients of is at most . Since , then the sum of the magnitudes of the coefficients of is at most . ∎
Lemma C.7.
Let be a degree polynomial in such that each coefficient is bounded in absolute value by . Then the sum of the magnitudes of the coefficients of is at most .
Proof.
Note that has at most terms. Expanding gives at most terms, where any monomial is formed from a product of terms in . Then the coefficients of are bounded in absolute value by . Summing over all monomials gives the bound. ∎
In the following lemma, we bound the magnitude of approximating polynomials for subspace juntas outside the radius of approximation.
Lemma C.8.
Let , and be a -subspace junta, and consider the corresponding function . Let be an -uniform approximation polynomial for , and define as . Let . Then
Proof.
Since is an -uniform approximation for , then for . Let . Then for , and so for , or equivalently for . Write . By Lemma C.6, . Then for ,
where the second inequality holds because for all , and the last inequality holds because for when . Then since , we have for . ∎
The following is an important theorem that we use later to obtain uniform approximators for Lipschitz Neural networks.
Theorem C.9 ([NS64]).
Let be a function continuous on the unit sphere . Let be the function defined as for any . Then, we have that there exists a polynomial of degree such that where is a universal constant.
This implies the following corollary.
Corollary C.10.
Let be an -Lipschitz function for and let . Then, for any , there exists a polynomial of degree such that is an -uniform approximation polynomial for .
Proof.
Consider the function . Then, we have that is -Lipschitz. From statement of Theorem C.9, we have that . Thus, from Theorem C.9, there exists a polynomial of degree such that . Thus, we have that . is the required polynomial of degree . ∎
C.3 Kernel Representations
We now state and prove facts about Kernel Representations that we require. First, we recall the multinomial kernel from [GKKT17].
Definition C.11.
Consider the mapping , where is indexed by tuples for such that value of at index is equal to . The kernel is defined as
We denote the corrresponding RKHS as .
We now prove that polynomial approximators of subspace juntas can be represented as elements of .
Lemma C.12.
Let and . Let be a -subspace junta such that where is a function on and is a projection matrix from . Suppose, there exists a polynomial of degree such that and the sum of squares of coefficients of is bounded above by . Then, is -approximately represented within radius with respect to .
Proof.
We argue that there exists a vector such that and for all . Consider the polynomial of degree such that . We argue that for some and that . Let . From our assumption on , we have that . For , we use define as . Given multi-index , for any , we define as the number such that . We now compute the entry of indexed by . By expanding the expression for , we obtain that
We are now ready to bound . We have that
Here, the first inequality follows from Cauchy-Schwartz, the second follows by rearranging terms. The third inequality follows from the fact that the number of multi-indices of size from a set of elements is at most . The final inequality follows from the fact that the sum of the squares of the coefficients of is at most . ∎
We introduce an extension of the multinomial kernel that will be useful for our application to sigmoid-nets.
Definition C.13 (Composed multinomial kernel).
Let be a tuple in . We denote a sequence of mappings on inductively as follows:
-
1.
-
2.
.
Let denote the number of coordinates in . This induces a sequence of kernels defined as
and a corresponding sequence of RKHS denoted by .
Observe that the the multinomial Kernel is an instantiation of the composed multinomial kernel.
We now state some properties of the composed multinomial kernel.
Lemma C.14.
Let be a tuple in and . Then, the following hold:
-
1.
,
-
2.
For any , can be computed in time ,
-
3.
For any and , we have is a polynomial in of degree .
Proof.
We assume without loss of generality that as the kernel function is increasing in norm. To prove (1), observe that for any , we have that
We also have that . Thus, unrolling the recurrence gives us that .
The run time follows from the fact that and thus can be computed from with additions and exponentiation operations. Recursing gives the final runtime.
The fact that follows immediately from the fact the fact the entries of arise from the multinomial kernel and hence are polynomials in . The degree is at most . ∎
We now argue that a distribution that is hypercontractive with respect to polynomials is hypercontractive with respect to the multinomial kernel.
Lemma C.15.
Let be a distribution on that is -hypercontractive for some constant . Then, is also -hypercontractive.
Proof.
The proof immediately follows from Definition 3.4 and Lemma C.14(3). ∎
C.4 Nets with Lipschitz activations
We are now ready to prove our theorem about uniform approximators for neural networks with Lipschitz activations. First, we prove that such networks describe a Lipschitz function.
Lemma C.16.
Let be the function computed by an -layer neural network with -Lipschitz activation function and weight matrices . Say, for and the first hidden layer has neurons. Then we have that is -Lipschitz.
Proof.
First, observe from C.2 that for all , (since ) and . Recall from Definition C.1, we have the functions where and . We prove by induction on that . For the base case, observe that
where the second inequality follows from the Lipschitzness of and the final inequality follows from the definition of operator norm. We now proceed to the inductive step. Assume by induction that is at most . Thus, we have
where the third inequality follows from the Lipschitzness of and the inductive hypothesis. Thus, we get that . ∎
We now state are theorem regarding the uniform approximation of Lipschitz nets. We also prove that the approximators can be represented by low norm vectors in for appropriately chosen degree .
Theorem C.17.
Let . Let be a neural network with an -Lipschitz activation function , depth and weight matrices where . Let be the number of neurons in the first hidden layer. Then, there exists of a polynomial of degree that is an -uniform approximation polynomial for . Furthermore, is -approximately represented within radius with respect to . In fact, when , it holds that is -approximately represented within with respect to .
Proof.
We can express as where is a projection matrix and is a neural network with input size . We observe that the Lipschitz constant of is the same as the Lipschitz constant of since is a projection matrix. From Lemma C.16, we have that is -Lipshitz. From Corollary C.10, we have that there exists a polynomial of degree that is an -uniform approximation for . From Lemma C.6, we have that the sum of squares of magnitudes of coefficients of is bounded by . Now, applying Lemma C.12 yields the result. When , we apply Lemma C.4 to obtain that the sum of squares of magnitudes of coefficients of is bounded by . ∎
C.5 Sigmoids and Sigmoid-nets
We now give a custom proof for the case of neural networks with sigmoid activation. We do this as we can hope to get degree for our polynomial approximation. We largely follow the proof technique of [GKKT17] and [ZLJ16b]. The modifications we make are to handle the case where the radius of approximation is a variable instead of a constant. We require(for our applications to strictly-subexponential distributions) that the degree of approximation must scale linear in , a property that does not follow directly from the analysis given in [GKKT17]. We modify their analysis to achieve this linear dependence.
We first state a result regarding polynomial approximations for a single sigmoid activation.
Theorem C.18 ([LSSS14]).
Let denote the function . Let . Then, there exists a polynomial of degree such that . Also, the sum of the squares of the coefficients of is bounded above by .
We now present a construction of a uniform approximation for neural networks with sigmoid activations. The construction is similar to the one in [GKKT17] but the analysis deviates as linear dependence on radius of approximation is important to us.
Theorem C.19.
Let . Let on be a neural network with sigmoid activations, depth and weight matrices where . Also, let . Then, there exists of a polynomial of degree that is an -uniform approximation polynomial for . Furthermore, is -approximately represented within radius with respect to where is a tuple of degrees whose product is bounded by . Here, .
Proof.
First, let be the polynomial guaranteed by Theorem C.18 that -approximates the sigmoid in an interval of radius . Denote the degree of as . For all , let be the polynomial that -approximates the sigmoid upto radius . These have degree equal to . Let . For all , let . We know that .
We now construct the polynomial that approximates . For , define with . Define . Recall that is a vector of polynomials. We prove the following by induction: for every ,
-
1.
,
-
2.
For each , we have that with .
where the function is as defined in Definition C.1.
The above holds trivially for and is an exact approximator. Also, from the definition of . Clearly, We now prove that the above holds for assuming it holds for .
We first prove (1). For , we have that
For the second inequality, we analyse the cases and separately. When , we have that and when . For , from the inductive hypothesis, we have that . The second term in the second inequality is bounded since is -Lipschitz.
We are now ready to prove that is representable by small norm vectors in for all . We have that
From the inductive hypothesis, we have that . Thus, we have that
We expand each term in the above sum. We obtain,
The second inequality follows from expanding the equation. indexed by for has entries given by . Putting things together, we obtain that
Thus, we have proved that is representable in . We now prove that the norm of the representation is small. We have that
We bound . For any , from the definition of and the inductive hypothesis, we have that
We analyse the case and separately. When , we have from the bound on the base case. When , we have
which completes the induction. We are ready to calculate the bound on the degree.
We have . Also, for , we have . Thus, the total degree is . The square of the norm of the kernel representation is bounded by where
This concludes the proof. ∎
C.6 Applications for Bounded Distributions
We first state and prove our end to end results on TDS learning Sigmoid and Lipschitz nets over bounded marginals that are -hypercontractive for some constant .
Theorem C.20 (TDS Learning for Nets with Sigmoid Activation).
Let on be the class of neural network with sigmoid activations, depth and weight matrices such that . Let . Suppose the training and test distributions over are such that the following are true:
-
1.
is bounded within and is -hypercontractive for ,
-
2.
The training and test labels are bounded in for some .
Then, Algorithm 1 learns the class in the TDS regression up to excess error and probability of failure . The time and sample complexity is
where .
Proof.
From Theorem C.19, we have that is -approximately represented within radius with respect to , where is a degree vector whose product is equal to . Also, from Lemma C.14, we have that . From Lemma C.14, the entries of the kernel can be computed in time and from Lemma C.15, we have that is hypercontractive. Now, we obtain the result by applying Theorem 3.6. ∎
The following corollary on TDS learning two layer sigmoid networks in polynomial time readily follows.
Corollary C.21.
Let on be the class of two-layer neural networks with weight matrices and sigmoid activations. Let and . Suppose the training and test distributions satisfy the assumptions from Theorem C.20 with . Then, Algorithm 1 learns the class in the TDS regression setting up to excess error and probability of failure in time and sample complexity .
Proof.
The proof immediately follows from Theorem C.20 by setting and the other parameters to the appropriate constants. ∎
Theorem C.22 (TDS Learning for Nets with Lipschitz Activation).
Let on be the class of neural network with -Lipschitz activations, depth and weight matrices such that . Let . Suppose the training and test distributions over are such that the following are true:
-
1.
is bounded within and is -hypercontractive for ,
-
2.
The training and test labels are bounded in for some .
Then, Algorithm 1 learns the class in the TDS regression up to excess error and probability of failure . The time and sample complexity is , where . In particular, when , we have that the time and sample complexity is where
Proof.
From Theorem C.17, for we have that is -approximately represented within radius w.r.t where is a degree vector whose product is . For , we have that we have that is -approximately represented within radius w.r.t where is a degree vector whose product is equal to . Also, from Lemma C.14, we have that . From Lemma C.14, the entries of the kernel can be computed in time and from Lemma C.15, we have that is hypercontractive. Now, we obtain the result by applying Theorem 3.6. ∎
The above theorem implies the following corollary about TDS learning the class of ReLUs.
Corollary C.23.
Let on be the class of ReLU functions with unit weight vectors. Suppose the training and test distributions satisfy the assumptions from Theorem C.22 with . Then, Algorithm 1 learns the class in the TDS regression setting up to excess error and probability of failure in time and sample complexity .
Proof.
The proof immediately follows from Theorem C.22 by setting and the activation to be the ReLU function. ∎
In particular, this implies that the class of ReLUs is TDS learnable in polynomial time when .
C.7 Applications for Unbounded Distributions
We are now ready to state our theorem for TDS learning neural networks with sigmoid activations.
Theorem C.24 (TDS Learning for Nets with Sigmoid Activation and Strictly Subexponential Marginals).
Let on be the class of neural network with sigmoid activations, depth and weight matrices such that . Let . Suppose the training and test distributions over are such that the following are true:
-
1.
is -strictly subexponential,
-
2.
The training and test labels are bounded in for some .
Then, Algorithm 2 learns the class in the TDS regression up to excess error and probability of failure . The time and sample complexity is at most
where
Proof.
From Theorem C.19, we have that there is an -uniform approximation polynomial for with degree . Here, let . We also have that from the Lipschitzness of the sigmoid nets (Lemma C.16) and the fact that the sigmoid evaluated at has value . The theorem now directly follows from Theorem 4.6. ∎
We now state our theorem on TDS learning neural networks with arbitrary Lipschitz activations.
Theorem C.25 (TDS Learning for Nets with Lipschitz Activation with strictly subexponential marginals).
Let on be the class of neural network with -Lipschitz activations, depth and weight matrices such that . Let . Suppose the training and test distributions over are such that the following are true:
-
1.
is -strictly subexponential,
-
2.
The training and test labels are bounded in for some .
Then, Algorithm 2 learns the class in the TDS regression up to excess error and probability of failure . The time and sample complexity is at most
where .
Proof.
From Theorem C.17, we have that there is an -uniform approximation polynomial for with degree . Here, let . We also have that from the Lipschitz constant(Lemma C.16) and the fact that the each individual activation has value at most when evaluated at (see Definition C.1. The theorem now directly follows from Theorem 4.6. ∎
Appendix D Assumptions on the Labels
Our main theorems involve assumptions on the labels of both the training and test distributions. Ideally, one would want to avoid any assumptions on the test distribution. However, we demonstrate that this is not possible, even when the training marginal and the training labels are bounded, and the test labels have bounded second moment. On the other hand, we show that obtaining algorithms that work for bounded labels is sufficient even in the unbounded labels case, as long as some moment of the labels (strictly higher than the second moment) is bounded.
We begin with the lower bound, which we state for the class of linear functions, but would also hold for the class of single ReLU neurons, as well as other unbounded classes.
Proposition D.1 (Label Assumption Necessity).
Let be the class of linear functions over , i.e., . Even if we assume that the training marginal is bounded within , that the training labels are bounded in , and that for the test labels we have where , no TDS regression algorithm with finite sample complexity can achieve excess error less than and probability of failure less than for .
The proof is based on the observation that because we cannot make any assumption on the test marginal, the test distribution could take some very large value with very small probability, while still being consistent with some linear function. The training distribution, on the other hand, gives no information about the ground truth and is information theoretically indistinguishable from the constructed test distribution. Therefore, the tester must accept and its output will have large excess error. The bound on the second moment of the labels does imply a bound on excess error, but this bound cannot be made arbitrarily small by drawing more samples.
Proof of Proposition D.1.
Suppose, for contradiction that we have a TDS regression algorithm for with excess error and probability of failure . Let be the sample complexity of the algorithm and such that . We consider three distributions over . First outputs with probability . Second, outputs with probability and with probability , for some with . Third, outputs with probability and with probability .
We consider two instances of the TDS regression problem. The first instance corresponds to the case and . The second corresponds to the case and . Note that the assumptions we asserted regarding the test distribution and the test labels are true for both instances. For , in particular, we have . Moreover, in each of the cases, there is a hypothesis in that is consistent with all of the examples (either the hypothesis or ), so .
Note that the total variation distance between and is and similarly between and . Therefore, by the completeness criterion, as well as the fact that sampling only increases total variation distance at a linear rate, i.e., , we have that in each of the two instances, the algorithm will accept with probability at least (due to the definition of total variation distance111We know that the algorithm would accept with probability at least if the set of test examples was drawn from . Since is -close to , no algorithm can have different behavior if we substitute with except with probability . Hence, any algorithm must accept with probability at least .).
Suppose that the algorithm accepts in both instances (which happens w.p. at least ). By the soundness criterion, with overall probability at least , we have the following.
The inequalities above cannot be satisfied simultaneously, so we have arrived to a contradiction. It only remains to argue that , which is true if we choose . Therefore, such a TDS regression algorithm cannot exist. ∎
The lower bound of Proposition D.1 demonstrates that, in the worst case, the best possible excess error scales with the second moment of the distribution of the test labels. In contrast, we show that a bound on any strictly higher moment is sufficient.
Corollary D.2.
Suppose that for any , we have an algorithm that learns a class in the TDS setting up to excess error , assuming that both the training and test labels are bounded in . Let and be the corresponding time and sample complexity upper bounds.
Then, in the same setting, there is an algorithm that learns up to excess error under the relaxed assumption that for both training and test labels we have for some and some strictly increasing, positive-valued and unbounded function. The corresponding time and sample complexity upper bounds are and .
The proof is based on the observation that the effect of clipping on the labels, as measured by the squared loss, can be controlled by drawing enough samples, whenever a moment that is strictly higher than the second moment is bounded.
Lemma D.3.
Let and be strictly increasing and surjective. Let be a random variable over such that . Then, for any , if , we have .
Proof of Lemma D.3.
We have that , because and , always have the same sign, so and also if . Since is non-zero whenever , we have . We now use the fact that is increasing to conclude that . By choosing , we obtain the desired bound. ∎
We are now ready to prove Corollary D.2, by reducing TDS learning with moment-bounded labels to TDS learning with bounded labels.
Proof of Corollary D.2.
The idea is to reduce the problem under the relaxed label assumptions to a corresponding bounded-label problem for . In particular, consider a new training distribution and a new test distribution , where the samples are formed by drawing a sample from the corresponding original distribution and clipping the label to . Note that whenever we have access to i.i.d. examples from , we also have access to i.i.d. examples from and similarly for . Therefore, we may solve the corresponding TDS problem for and , to either reject or obtain some hypothesis such that
Our algorithm either rejects when the algorithm for the bounded labels case rejects or accepts and outputs . It suffices to show , because the marginal distributions do not change and completeness is, therefore, satisfied directly.
It suffices to show that for any distribution , we have . To this end, note that . We have the following.
The first inequality follows from an application of the triangle inequality for the -norm and the second inequality follows from Lemma D.3. The other side follows analogously. ∎