Reliable Learning of Halfspaces under Gaussian Marginals
Abstract
We study the problem of PAC learning halfspaces in the reliable agnostic model of Kalai et al. (2012). The reliable PAC model captures learning scenarios where one type of error is costlier than the others. Our main positive result is a new algorithm for reliable learning of Gaussian halfspaces on with sample and computational complexity
where is the excess error and is the bias of the optimal halfspace. We complement our upper bound with a Statistical Query lower bound suggesting that the dependence is best possible. Conceptually, our results imply a strong computational separation between reliable agnostic learning and standard agnostic learning of halfspaces in the Gaussian setting.
1 Introduction
Halfspaces (or Linear Threshold Functions) is the class of functions of the form , where is called the weight vector and is called the threshold. The problem of learning halfspaces is one of the classical and most well-studied problems in machine learning—going back to the Perceptron algorithm [Ros58]—and has had great impact on many other influential techniques, including SVMs [Vap98] and AdaBoost [FS97].
Here we focus on learning halfspaces from random labeled examples. The computational complexity of this task crucially depends on the choice of the underlying model. For example, in the realizable PAC model (i.e., with clean labels), the problem is known to be efficiently solvable (see, e.g., [MT94]) via a reduction to linear programming. Unfortunately, this method is quite fragile and breaks down in the presence of noisy labels. In the noisy setting, the computational complexity of the problem depends on the choice of noise model and distributional assumptions. In this work, we study the problem of distribution-specific PAC learning of halfspaces, with respect to Gaussian marginals, in the reliable agnostic model of [KKM12]. Formally, we have the following definition.
Definition 1.1 ((Positive) Reliable Learning of Gaussian Halfspaces).
Let be the class of halfspaces on . Given and i.i.d. samples from a distribution supported on , where the marginal is the standard Gaussian , we say that an algorithm reliably learns to error if the algorithm with high probability outputs a hypothesis such that and , where , with
We say that is the optimal halfspace on distribution and for the conditions above that hypothesis satisfies, we say that is -reliable with respect to on distribution .
In other words, a (positive) reliable learner makes almost no false positive errors, while also maintaining the best possible false negative error—as compared to any hypothesis with no false positive errors. Note that if there is no hypothesis (in the target class) that makes no false positive errors, then essentially we have no requirement for the false negative error of the returned hypothesis. The algorithm can then simply return the constant hypothesis.
The reliable agnostic PAC model was introduced by [KKM12] as a one-sided analogue of the classical agnostic model [Hau92, KSS94], and has since been extensively studied in a range of works; see, e.g., [KKM12, KT14, GKKT17, DJ19, GKKM20, KK21]. The underlying motivation for this definition comes from learning situations where mistakes of one type (e.g., false positive) should be avoided at all costs. Typical examples include spam detection—where incorrectly detecting spam emails is much less costly than mislabeling an important email as spam—and detecting network failures—where false negatives may be particularly harmful. Such scenarios motivate the study of reliable learning, which can be viewed as minimizing a loss function for different costs for false positive and false negative errors (see, e.g., [Dom99, Elk01]). In a historical context, reliable learning is related to the Neyman-Pearson criterion [NP33] in hypothesis testing, which shows that the optimal strategy to minimize one type of error subject to another type of error being bounded is to threshold the likelihood ratio function. More recently, reliable learning has been shown [KK21] to be equivalent to the PQ-learning model [GKKM20]—a recently introduced learning model motivated by covariance shift.
The algorithmic task of agnostic reliable learning can be quite challenging. While reliable learning can be viewed as minimizing a loss function with different cost for false positive and false negative error, in general, such a loss function will result in a non-convex optimization problem. As pointed out in [KKM12], distribution-specific reliable learning can be efficiently reduced to distribution-specific agnostic learning (such reduction preserves the marginal distribution). A natural question, serving as a motivation for this work, is whether reliable learning is qualitatively easier computationally—for natural concept classes and halfspaces in particular.
Before we state our contributions, we briefly summarize prior work on agnostically learning Gaussian halfspaces. Recall that, in agnostic learning, the goal is to output a hypothesis with error , where is the optimal 0-1 error within the target class. Prior work has essentially characterized the computational complexity of agnostically learning Gaussian halfspaces. Specifically, it is known that the polynomial regression algorithm of [KKMS08] is an agnostic learner with complexity [DGJ09, DKN10]. Moreover, there is strong evidence that this complexity upper bound is tight, both in the Statistical Query (SQ) model [DKZ20, GGK20, DKPZ21] and under plausible cryptographic assumptions [DKR23, Tie23]. It is worth noting that the aforementioned hardness results hold even for the subclass of homogeneous halfspaces.
Given the aforementioned reduction of reliable learning to agnostic learning [KKM12], one can use -regression as a reliable agnostic learner for Gaussian halfspaces, leading to complexity . Prior to this work, this was the best (and only known) reliable halfspace learner. Given the fundamental nature of this problem, it is natural to ask if one can do better for reliable learning.
Is it possible to develop faster algorithms for reliably learning Gaussian halfspaces
compared to agnostic learning?
In this work, we provide an affirmative answer to this question.
To formally state our main result, we need the notion of the bias of a Boolean function.
Definition 1.2 (Bias of Boolean Function).
We define the bias of under the Gaussian distribution as .
The main result of this paper is encapsulated in the following theorem:
Theorem 1.3 (Main Result).
Let be a joint distribution of supported on with marginal and let be the bias of the optimal halfspace on distribution (with respect to Definition 1.1). There is an algorithm that uses many samples from , runs in time, and with high probability returns a hypothesis that is -reliable with respect to . Moreover, for , any SQ algorithm for the problem requires complexity .
For more detailed theorem statements of our upper and lower bounds, see Appendix B for the algorithmic result and LABEL:app:lb for the SQ hardness result.
Theorem 1.3 gives a significantly more efficient algorithm (in terms of both sample complexity and computational complexity) for reliably learning Gaussian halfspaces, compared to agnostic learning. Specifically, as long as is a universal constant, the overall complexity is polynomial in and quasi-polynomial in —as opposed to in the agnostic setting. While the complexity of our algorithm (namely, the multiplicative factor that is independent of ) increases as the optimal halfspace becomes more biased, it is always bounded above by the complexity of agnostic learning.
Finally, we note that our algorithm also applies to the fully reliable learning model [KKM12], since one can easily reduce the fully reliable learning to positive reliable and negative reliable learning (as observed in [KKM12]).
1.1 Overview of Techniques
Instead of directly considering the optimal halfspace for a distribution (as defined in Definition 1.1), we introduce the following definition as a relaxation.
Definition 1.4 (Reliability Condition).
We say that a distribution supported on satisfies the reliability condition with respect to if
Notice that being -reliable on distribution is equivalent to having better false negative error compared with any such that satisfies the reliability condition with respect to (instead of compared with the optimal halfspace). At the same time, given any fixed , such a definition allows the adversary to arbitrarily corrupt negative labels, thus allowing for more manageable analysis.
We start with a brief overview of our SQ lower bound construction, followed by the high-level idea enabling our nearly matching algorithm.
SQ Lower Bound
As follows from Definition 1.4 and the subsequent discussion, the adversary in reliable learning can arbitrarily corrupt the negative labels. We leverage this idea to construct an instance where the corruption level suffices to match many moments, meaning that labels are uncorrelated with any polynomial of degree at most . To prove this, we construct an (infinite) feasibility Linear Program (LP) with the following properties: if the LP is feasible, there exists an instance for which no low-degree polynomial correlates with the labels; see Lemma 3.3. To show the existence of such an instance, we leverage a technique from [DKPZ21] that exploits (an infinite generalization of) LP duality. Specifically, we show that it is possible to add noise to the labels whose uncorrupted value is negative so that all low-degree moments are uncorrelated with the observed labels . This implies that no algorithm that relies on low-degree polynomials can succeed for reliable learning. This allows us to show that SQ algorithms fail if they do not use queries with tolerance at most or exponentially many queries.
Reliable Learning Algorithm
As discussed in the previous paragraph, an adversary can arbitrarily corrupt the negative labels which can make various algorithmic approaches fail. In more detail, the adversary can corrupt moments; that is, any SQ algorithm needs to rely on higher-order moment information. One of the main difficulties of this setting is that there may exist weight vectors with while at the same time the 0-1 error of and is nearly the same. This might suggest that no approach can verify that the algorithm decreases the angle between the current weight vector and the optimal vector. To overcome this obstacle, we develop an algorithm that performs a random walk over a “low-dimensional” subspace. To establish the correctness of our algorithm, we prove that at some point during its execution, the algorithm will find a weight vector sufficiently close to the optimal vector.
To show that such a low-dimensional subspace exists, we proceed as follows: We first prove that there exists a non-trivial polynomial of degree that correlates with the negative labels (by the term nontrivial, we are referring to the requirement that we cannot use the constant polynomial, as this trivially correlates with the labels); see B.5. This implies that the low-degree moment tensors, i.e., for some between and , correlate non-trivially with , where is the target weight vector. Thus, we can use these moment tensors to construct a low-dimensional subspace. We leverage the structure of the problem, namely that the (observed) negative labels are uncorrupted, to show that the correlation is in fact almost constant. As a consequence, this allows us to construct a subspace whose dimension depends only on the degree of the polynomial (which is ) and the desired excess error .
We then proceed as follows: in each round, as long as the current solution is not optimal, i.e., there are negative samples on the region , our algorithm conditions on a thin strip, projects the points on the orthogonal complement of , and reapplies the above structure result. This leads (with constant probability) to an increase in the correlation between and . Assuming that the correlation is increased by in each round with probability at least , we roughly need at most successful updates. Therefore, if we run our algorithm for steps, it is guaranteed that with probability at least we will find a vector almost parallel to .
Prior Algorithmic Techniques
Roughly speaking, prior techniques for reliable learning relied on some variant of -regression. In that sense, our algorithmic approach departs from a direct polynomial approximation. Concretely, for distribution-free reliable learning, the algorithm from [KT14] uses a one-sided variant of regression—where instead of approximating the target function in norm, they use the hinge loss instead. For distribution-specific (and Gaussian in particular) reliable learning of halfspaces, the only previous approach uses the reduction of [KKM12] to agnostic learning. While our algorithm also leverages polynomial approximation, our approach employs significantly different ideas and we believe it provides a novel perspective for this problem.
Technical Comparison with Prior Work
Our algorithm shares similarities with the algorithm of [DKK22] for learning Gaussian halfspaces with Massart noise (a weaker semi-random noise model). Both algorithms perform a random walk in order to converge to the target halfpace. Having said so, our algorithm is fundamentally different than theirs for the following reasons. The algorithm of [DKK22] partitions into sufficiently angles and searches in each one of them for a direction to update the current hypothesis at random. Our algorithm instead conditions on , which are the points that are uncorrupted, and uses the guarantee (that we establish) that the first moments of the distribution (conditioned on ) correlate with the unknown optimal hypothesis. This leads to an algorithm with significantly faster runtime.
Our SQ lower bound leverages techniques from [DKPZ21]. Essentially, we formulate a linear program to construct a noise function that makes all low-degree polynomials uncorrelated with the labels . This condition suffices to show that no SQ algorithm can solve the problem without relying on higher moments. To prove the existence of such a noise function, we use (a generalization of) LP duality of an infinite LP, which provides the necessary conditions for the existence of such a function.
1.2 Related Work
This work is part of the broader research program of characterizing the efficient learnability of natural concept classes with respect to challenging noise models. The complexity of this general learning task depends on the underlying class, the distributional assumptions and the choice of noise model. A long line of prior work has made substantial progress in this direction for the class of halfspaces under Gaussians (and other natural distributions), and for both semi-random [ABHU15, DKTZ20a, ZSA20, DKTZ20b, DKK20, DKK21b, DKK22] and adversarial noise [KKMS08, KOS08, KLS09, ABL17, DKS18, DKTZ20c, DKK21a, DKTZ22]. An arguably surprising conceptual implication of our results is that the complexity of reliable learning for Gaussian halfspaces is qualitatively closer to the semi-random case.
1.3 Preliminaries
We use lowercase boldface letters for vectors and capitalized boldface letters for matrices and tensors. We use for the inner product between . For , we use for the projection of on the direction. Similarly, we use for the projection of on the orthogonal complement of . Additionally, let be the projection of on the subspace and reparameterized on . More precisely, let be the matrix whose columns form an (arbitrary) orthonormal basis for the subspace , and let . For and , we use to denote the -norm of . For a matrix or tensor , we denote by the Frobenius norm of .
We use to denote the standard normal distribution , where is the -dimensional zero vector and is the identity matrix. We use to denote a random variable with distribution . For a random variable (resp. a distribution ), we use (resp. ) to denote the probability density function or probability mass function of the random variable (resp. distribution ). We also use to denote the cdf function of . We use to denote the indicator function.
For a boolean function and a distribution supported on , we use (resp. ) to denote the false positive (resp. false negative) 0-1 error.
2 Algorithm for Reliably Learning Gaussian Halfspaces
In this section, we describe and analyze our algorithm establishing Theorem 1.3. Due to space limitations, some proofs have been deferred to Appendix B. For convenience, we will assume that (the bias of the optimal halfspace) is known to the algorithm and that the excess error satisfies . These assumptions are without loss of generality for the following reasons: First, one can efficiently reduce the unknown case to the case that is known, by guessing the value of . Second, if is large, there is a straightforward algorithm for the problem (simply output the best constant hypothesis).
Notation: We use the notation for the set of all LTFs on whose bias is equal to . Given the above assumptions, it suffices for us to give a reliable learning algorithm for instead of .
The high-level idea of the algorithm is as follows. Without loss of generality, we assume that there exists an -biased halfspace that correctly classifies all the points with label —since otherwise the algorithm can just return the hypothesis .
Let be the optimal halfspace and let be our current guess of . Assuming that is not sufficiently close to and the hypothesis that classifies all the points as negative is not optimal, we show that there exists a low-degree polynomial of the form with correlation at least with the negative labels. By leveraging this structural result, we use a spectral algorithm to find a direction that is non-trivially correlated with with at least some constant probability.
Unfortunately, it is not easy to verify whether is non-trivial. However, conditioned on the algorithm always getting a that has good correlation, we show that it only takes at most steps to get sufficiently close to . Therefore, repeating this process many times will eventually find an accurate approximation to .
The structure of this section is as follows: In Section 2.1, we give our algorithm for finding a good direction. Section 2.2 describes and analyzes our random walk procedure.
2.1 Finding a Non-Trivial Direction
Here we present an algorithm that finds a direction that correlates non-trivially with the unknown optimal vector. We first show that there exists a zero-mean -degree polynomial that sign-matches the optimal hypothesis. Furthermore, using the fact that the optimal hypothesis always correctly guesses the sign of the negative points, this gives us that the polynomial correlates with the negative (clean) points. Using this intuition, our algorithm estimates the first moments of the distribution conditioned on . This guarantees that at least one moment correlates with the optimal hypothesis, as a linear combination of the moments generates the sign-matching polynomial. Then, by taking a random vector that lies in the high-influence directions (which form a low-dimensional subspace), we guarantee that with constant probability, this vector correlates well with the unknown optimal vector. The main result of the section is the following.
Proposition 2.1.
Let be a joint distribution of supported on with marginal and . Suppose satisfies the reliability condition with respect to with and . Then there is an algorithm that draws samples, has runtime, and with probability at least returns a unit vector such that .
We start by showing that for any distribution satisfying the reliability condition with respect to some , there exists a degree- zero-mean polynomial of the form that has correlation at least with the labels.
Lemma 2.2 (Correlation with an Orthonormal Polynomial).
Let be a joint distribution of supported on with marginal . Suppose satisfies the reliability condition with respect to . Then there exists a polynomial of degree at most such that , and .
Proof Sketch of Lemma 2.2.
Note that since is a zero-mean polynomial with respect to , we have that Then, using the fact that the distribution satisfies the reliability condition, the statement boils down to showing that for any -mass inside the interval , the expectation of on this mass is at least . To prove this statement, it suffices to construct a polynomial that is non-negative on and such that the -tail of in that interval is at least . To achieve this, we show that the sign-matching polynomial used in [DKK22] meets our purpose. ∎
Our algorithm uses the following normalized Hermite tensor.
Definition 2.3 (Hermite Tensor).
For and , we define the -th Hermite tensor as
Given that there is such a polynomial, we show that if we take a flattened version of the Chow-parameter tensors (which turns them into matrices) and look at the space spanned by their top singular vectors, then a non-trivial fraction of must lie inside this subspace. We prove the following lemma, which is similar to Lemma 5.10 in [DKK22]. See Appendix B for the proof.
Lemma 2.4.
Let be the joint distribution of supported on with marginal . Let be a univariate, mean zero and unit variance polynomial of degree such that for some unit vector it holds for some . Let be an approximation of the order- Chow-parameter tensor such that . Denote by the subspace spanned by the left singular vectors of flattened whose singular values are greater than . Moreover, denote by the union of . Then, for , we have that
-
1.
, and
-
2.
.
By Lemma 2.4, taking a random unit vector in will give us . We are ready to prove Proposition 2.1. For the algorithm pseudocode, see Appendix B.
Proof Sketch of Proposition 2.1 .
2.2 Random Walk to Update Current Guess
We now describe how we use Proposition 2.1 to construct an algorithm for our learning problem. Let be the current guess for . For convenience, we assume that . Let be the distribution of conditioned on , where . We next show that the -marginals of are standard Gaussian and that satisfies the reliability condition with respect to the halfspace with and .
Lemma 2.5.
Let be the joint distribution of supported on with marginal and . Suppose satisfies the reliability condition with respect to . Suppose that with and does not satisfy . Let and be the distribution of given , then
-
1.
has marginal distribution ,
-
2.
satisfies the reliability condition with respect to , where and , and
-
3.
.
Proof Sketch of Lemma 2.5.
Item 1 follows by definition. For Item 2, we consider two cases: ; and . For the case , we prove that the distribution satisfies the reliability condition with respect to , where . For the case , we prove that the distribution satisfies the reliability condition with respect to , where and . Then Item 3 follows from the fact that does not satisfy . ∎
By Lemma 2.5, given any current guess such that does not satisfy , the corresponding distribution satisfies the reliability condition with respect to and . Therefore, satisfies the assumptions of Proposition 2.1. So, if we apply the algorithm in Proposition 2.1, we will with probability at least get a unit vector such that and .
The following fact shows that by updating our current guess in the direction of with appropriate step size, we can get an updated guess with increased correlation with .
Fact 2.6 (Correlation Improvement, Lemma 5.13 in [DKK22]).
Fix unit vectors . Let such that , and with . Then, for , with , we have that .
Notice that because is unknown, we cannot always choose the optimal step size . Instead, we will use the same to do sufficiently many update steps such that after that many updates, we are certain that . We then take the new step size and repeat this process, until and are sufficiently close to each other.
We are now ready to describe our algorithm (Algorithm 1) and prove its correctness.
Notice that given that we know the bias , must be either or . For convenience, we assume that . To account for the case that , we can simply run Algorithm 1 twice and pick the output halfspace with the smallest value (or even run a different efficient algorithm, since as explained in Appendix B). For convenience, we also assume . To account for the other case, we can simply initialize the step size for a sufficiently large constant instead of .
A more detailed version of Algorithm 1 is deferred to Appendix B.
-
1.
Check if (with sufficiently small constant failure probability). If so, return the constant hypothesis. Set the initial step size , where is a sufficiently large universal constant and .
-
2.
Initialize to be a random unit vector in . Let the update step size and repeat the following process until .
-
(a)
Use samples from to check if the hypothesis satisfies . If so, go to Step (3).
-
(b)
With probability, let . Let , and let be the distribution of for given . Use the algorithm of Proposition 2.1 on to find a unit vector such that and . Then update as follows:
- (c)
-
(a)
-
3.
Check if satisfies . If so, return and terminate. Repeat Step (2) many times where is a sufficiently large constant.
Proof Sketch of the algorithmic part of Theorem 1.3.
Let be the optimal halfspace with bias. We need to show that with high probability Algorithm 1 returns a hypothesis such that and .
To do so, it suffices to show that ; given , follows from our choice of . For convenience, we can assume never satisfies in Step (2a) (otherwise, we are done). We can also assume that the subroutine in Proposition 2.1 always succeeds since the algorithm repeats Step (2) sufficiently many times. Given the above conditions, using 2.6, one can show that each time after many updates in Step (2b), we must have . Therefore, when we have , then , which implies . ∎
3 Nearly Matching SQ Lower Bound
In this section, we establish the SQ hardness result of Theorem 1.3. Due to space limitations, some proofs have been deferred to LABEL:app:lb.
Proof Overview
To establish our SQ lower bound for reliable learning, we first prove an SQ lower bound for a natural decision version of reliably learning -biased LTFs. We define the following decision problem over distributions.
Definition 3.1 (Decision Problem over Distributions).
Let be a fixed distribution and be a distribution family. We denote by the decision problem in which the input distribution is promised to satisfy either (a) or (b) , and the goal is to distinguish the two cases with high probability.
We show that given SQ access to a joint distribution of supported on with marginal , it is hard to solve the problem with the following distributions.
-
(a)
Null hypothesis: is the distribution so that with probability independent of .
-
(b)
Alternative hypothesis: , where is a family of distributions such that for any distribution , there exists an -biased LTF such that .
In order to construct such a family of distributions , we start by constructing a joint distribution of over such that the marginal distribution of is and the conditional distributions and both match many moments with the standard Gaussian . Moreover, there is probability mass on the positive side of the marginal distribution of that is purely associated with (i.e., where ). We then embed this distribution along a hidden direction inside the joint distribution of on to construct a family of hard-to-distinguish distributions using the “hidden-direction” framework developed in [DKS17] and enhanced in [DKPZ21, DKRS23].
We can now proceed with the details of the proof. We start by defining the pairwise correlation between distributions.
Definition 3.2 (Pairwise Correlation).
The pairwise correlation of two distributions with pdfs with respect to a distribution with density , where the support of contains the support of and , is defined as . Furthermore, the -squared divergence of to is defined as .
In particular, the framework in [DKS17] allows us to construct a family of distributions on whose pairwise correlation is where is the number of matching moments has with the standard Gaussian. Then, using standard SQ dimension techniques, this gives an SQ lower bound for the distinguishing problem. After that, we reduce the distinguishing problem to the problem of reliably learning -biased LTFs under Gaussian marginals with additive error .
In order to construct such a distribution of supported on , we reparameterize as . For a function , we use for its norm. We let denote the set of all functions that have finite -norm.
We use linear programming over one-dimensional functions in space to establish the existence of such a function. Specifically, we show the following:
Lemma 3.3.
For any sufficiently large , there exists a function such that satisfies the following properties:
-
(i)
for all , where , and
-
(ii)
for all .
Proof Sketch of Lemma 3.3.
We let denote the set of all polynomials of degree at most and let denote the set of all nonnegative functions in . Then, using linear programming, we will get the following primal:
find | ||||||
such that | ||||||
Then, using (infinite-dimensional) LP duality, we get that the above primal is feasible if and only if there is no polynomial of degree such that .
We have proven the existence of the function in Lemma 3.3. Now, we construct a joint distribution of on such that is exactly , as we discussed in the proof outline. For a joint distribution of supported on , we will use to denote the conditional distribution of given ; and for the distribution of given .
Lemma 3.4.
For any sufficiently large , there exists a distribution on such that for :
-
(i)
the marginal distribution ;
-
(ii)
for all ;
-
(iii)
and for all ;
-
(iv)
.
Proof Sketch for Lemma 3.4.
The properties here directly follow from the properties of . ∎
Using the framework introduced in [DKS17, DKPZ21], we can construct a set of alternative hypothesis distributions on , where is a set of exponentially many pairwise nearly-orthogonal vectors and the marginal distribution of each on direction is the distribution in Lemma 3.4. This effectively embeds the distribution in a hidden direction . The size of the family is exponential in , and the distributions in it have small pairwise correlations. The details of are deferred to LABEL:app:lb. Now, we are ready to give a proof sketch for the SQ hardness part of our main theorem Theorem 1.3.
Proof Sketch of the SQ hardness part of Theorem 1.3.
Let be the set of distributions discussed above. We also let be the joint distribution of such that and independent of . Then, using standard SQ dimension techniques, one can show that any SQ algorithm that solves requires either queries of tolerance at most or makes at least queries. By reducing the decision problem to reliably learning -biased LTFs with accuracy, we get the lower bound part of the statement in Theorem 1.3. ∎
4 Conclusions and Open Problems
In this paper, we study the problem of learning halfspaces under Gaussian marginals in the reliable learning model. Our main contribution is the design of the first efficient learner for this task whose complexity beats the complexity of agnostically learning this concept class. Moreover, we provide rigorous evidence, via an SQ lower bound, that no fully-polynomial time algorithm exists for general halfspaces. The obvious open question is whether the dependence on in the complexity of our algorithm can be improved. Specifically, is it possible to design a reliable learner with complexity ? Is it possible to obtain similarly efficient reliable learners under more general marginal distributions (e.g., strongly log-concave or discrete distributions)? More broadly, it would be interesting to characterize the computational separation between (distribution-specific) reliable and agnostic learning for other natural concept classes.
References
- [ABHU15] P. Awasthi, M. F. Balcan, N. Haghtalab, and R. Urner. Efficient learning of linear separators under bounded noise. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, pages 167–190, 2015.
- [ABL17] P. Awasthi, M. F. Balcan, and P. M. Long. The power of localization for efficiently learning linear separators with noise. J. ACM, 63(6):50:1–50:27, 2017.
- [Bog98] V. Bogachev. Gaussian measures. Mathematical surveys and monographs, vol. 62, 1998.
- [DGJ09] I. Diakonikolas, P. Gopalan, R. Jaiswal, R. Servedio, and E. Viola. Bounded independence fools halfspaces. In Proc. 50th IEEE Symposium on Foundations of Computer Science (FOCS), pages 171–180, 2009.
- [DJ19] A. Durgin and B. Juba. Hardness of improper one-sided learning of conjunctions for all uniformly falsifiable csps. In Algorithmic Learning Theory, ALT, volume 98 of Proceedings of Machine Learning Research, pages 369–382, 2019.
- [DKK20] I. Diakonikolas, D. M. Kane, V. Kontonis, C. Tzamos, and N. Zarifis. A polynomial time algorithm for learning halfspaces with Tsybakov noise. arXiv, 2020.
- [DKK21a] I. Diakonikolas, D. M. Kane, V. Kontonis, C. Tzamos, and N. Zarifis. Agnostic proper learning of halfspaces under gaussian marginals. In Proceedings of The 34th Conference on Learning Theory, COLT, 2021.
- [DKK21b] I. Diakonikolas, D. M. Kane, V. Kontonis, C. Tzamos, and N. Zarifis. Efficiently learning halfspaces with Tsybakov noise. STOC, 2021.
- [DKK22] I. Diakonikolas, D. M. Kane, V. Kontonis, C. Tzamos, and N. Zarifis. Learning general halfspaces with general Massart noise under the gaussian distribution. In STOC ’22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 874–885. ACM, 2022.
- [DKN10] I. Diakonikolas, D. M. Kane, and J. Nelson. Bounded independence fools degree- threshold functions. In FOCS, pages 11–20, 2010.
- [DKPZ21] I. Diakonikolas, D. M. Kane, T. Pittas, and N. Zarifis. The optimality of polynomial regression for agnostic learning under gaussian marginals. In Proceedings of The 34th Conference on Learning Theory, COLT, 2021.
- [DKR23] I. Diakonikolas, D. M. Kane, and L. Ren. Near-optimal cryptographic hardness of agnostically learning halfspaces and relu regression under gaussian marginals. In International Conference on Machine Learning, ICML, volume 202 of Proceedings of Machine Learning Research, pages 7922–7938, 2023.
- [DKRS23] I. Diakonikolas, D. Kane, L. Ren, and Y. Sun. SQ lower bounds for non-gaussian component analysis with weaker assumptions. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, 2023.
- [DKS17] I. Diakonikolas, D. M. Kane, and A. Stewart. Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 73–84, 2017.
- [DKS18] I. Diakonikolas, D. M. Kane, and A. Stewart. Learning geometric concepts with nasty noise. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, pages 1061–1073, 2018.
- [DKTZ20a] I. Diakonikolas, V. Kontonis, C. Tzamos, and N. Zarifis. Learning halfspaces with Massart noise under structured distributions. In Conference on Learning Theory, COLT, 2020.
- [DKTZ20b] I. Diakonikolas, V. Kontonis, C. Tzamos, and N. Zarifis. Learning halfspaces with Tsybakov noise. arXiv, 2020.
- [DKTZ20c] I. Diakonikolas, V. Kontonis, C. Tzamos, and N. Zarifis. Non-convex SGD learns halfspaces with adversarial label noise. In Advances in Neural Information Processing Systems, NeurIPS, 2020.
- [DKTZ22] I. Diakonikolas, V. Kontonis, C. Tzamos, and N. Zarifis. Learning general halfspaces with adversarial label noise via online gradient descent. In International Conference on Machine Learning, ICML 2022, volume 162 of Proceedings of Machine Learning Research, pages 5118–5141. PMLR, 2022.
- [DKZ20] I. Diakonikolas, D. M. Kane, and N. Zarifis. Near-optimal SQ lower bounds for agnostically learning halfspaces and ReLUs under Gaussian marginals. In Advances in Neural Information Processing Systems, NeurIPS, 2020.
- [Dom99] P. M. Domingos. Metacost: A general method for making classifiers cost-sensitive. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 155–164, 1999.
- [Elk01] C. Elkan. The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, IJCAI, pages 973–978, 2001.
- [Fan68] K. Fan. On infinite systems of linear inequalities. Journal of Mathematical Analysis and Applications, 21(3):475 – 478, 1968.
- [FGR17] V. Feldman, E. Grigorescu, L. Reyzin, S. Vempala, and Y. Xiao. Statistical algorithms and a lower bound for detecting planted cliques. J. ACM, 64(2):8:1–8:37, 2017.
- [FS97] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
- [GGK20] S. Goel, A. Gollakota, and A. R. Klivans. Statistical-query lower bounds via functional gradients. In Advances in Neural Information Processing Systems, NeurIPS, 2020.
- [GKKM20] S. Goldwasser, A. T. Kalai, Y. Kalai, and O. Montasser. Beyond perturbations: Learning guarantees with arbitrary adversarial test examples. In Advances in Neural Information Processing Systems, NeurIPS, 2020.
- [GKKT17] S. Goel, V. Kanade, A. R. Klivans, and J. Thaler. Reliably learning the relu in polynomial time. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, volume 65 of Proceedings of Machine Learning Research, pages 1004–1042, 2017.
- [Hau92] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100:78–150, 1992.
- [KK21] A. T. Kalai and V. Kanade. Efficient learning with arbitrary covariate shift. In Algorithmic Learning Theory, 2021, volume 132 of Proceedings of Machine Learning Research, pages 850–864, 2021.
- [KKM12] A. T. Kalai, V. Kanade, and Y. Mansour. Reliable agnostic learning. J. Comput. Syst. Sci., 78(5):1481–1495, 2012.
- [KKMS08] A. Kalai, A. Klivans, Y. Mansour, and R. Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777–1805, 2008. Special issue for FOCS 2005.
- [KLS09] A. Klivans, P. Long, and R. Servedio. Learning Halfspaces with Malicious Noise. Journal of Machine Learning Research, 10:2715–2740, 2009.
- [KOS08] A. Klivans, R. O’Donnell, and R. Servedio. Learning geometric concepts via Gaussian surface area. In Proc. 49th IEEE Symposium on Foundations of Computer Science (FOCS), pages 541–550, Philadelphia, Pennsylvania, 2008.
- [KSS94] M. Kearns, R. Schapire, and L. Sellie. Toward Efficient Agnostic Learning. Machine Learning, 17(2/3):115–141, 1994.
- [KT14] V. Kanade and J. Thaler. Distribution-independent reliable learning. In Proceedings of The 27th Conference on Learning Theory, COLT, volume 35 of JMLR Workshop and Conference Proceedings, pages 3–24, 2014.
- [MT94] W. Maass and G. Turan. How fast can a threshold gate learn? In S. Hanson, G. Drastal, and R. Rivest, editors, Computational Learning Theory and Natural Learning Systems, pages 381–414. MIT Press, 1994.
- [Nel73] E. Nelson. The free markoff field. Journal of Functional Analysis, 12(2):211–227, 1973.
- [NP33] J. Neyman and E. S. Pearson. On the problem of the most efficient tests for statistical hypotheses. Trans. R. So. Lond. Ser. A Contain. Pap. Math. Phys. Character, 231:281–337, 1933.
- [Ros58] F. Rosenblatt. The Perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958.
- [Tie23] S. Tiegel. Hardness of agnostically learning halfspaces from worst-case lattice problems. In The Thirty Sixth Annual Conference on Learning Theory, COLT, volume 195 of Proceedings of Machine Learning Research, pages 3029–3064, 2023.
- [Vap98] V. Vapnik. Statistical learning theory. Wiley, 1998.
- [ZSA20] C. Zhang, J. Shen, and P. Awasthi. Efficient active learning of sparse halfspaces with arbitrary bounded noise. In Advances in Neural Information Processing Systems, NeurIPS, 2020.
Appendix
Organization
The supplementary material is structured as follows: Appendix A includes additional preliminaries. Appendix B includes the omitted proofs from Section 2. Finally, LABEL:app:lb includes the omitted proofs from Section 3.
Appendix A Additional Preliminaries
We use to denote the -dimensional unit sphere, i.e., .
Basics of the SQ Model
In order to give an SQ lower bound for reliable learning, we first present the basics of the SQ model. We define the pairwise correlation, which is used for the SQ lower bound.
Definition A.1 (Pairwise Correlation).
The pairwise correlation of two distributions with probability density function with respect to a distribution with density , where the support of contains the support of and , is defined as . Furthermore, the -squared divergence of to is defined as .
We require the following definitions and lemmata from [FGR17] connecting pairwise correlation and SQ lower bounds.
Definition A.2.
We say that a set of distribution over is -correlated relative to a distribution if for all , and for .
Definition A.3 (Statistical Query Dimension).
For , a decision problem , where is a fixed distribution and is a family of distribution, let be the maximum integer such that there exists a finite set of distributions such that is -correlated relative to and . The Statistical Query dimension with pairwise correlations of is defined to be , and denoted by .
Lemma A.4.
Let be a decision problem, where is the reference distribution and is a class of distribution. For , let . For any , any SQ algorithm for requires queries of tolerance at most or makes at least queries.
Basics of Hermite Polynomials
We require the following definitions used in our work.
Definition A.5 (Normalized Hermite Polynomial).
For , we define the -th probabilist’s Hermite polynomials as . We define the -th normalized Hermite polynomial as .
Furthermore, we use multivariate Hermite polynomials in the form of Hermite tensors (as the entries in the Hermite tensors are re-scaled multivariate Hermite polynomials). We define the Hermite tensor as follows.
Definition A.6 (Hermite Tensor).
For and , we define the -th Hermite tensor as
We also point out that fully reliable learning (see [KKM12] for the definition) reduces to one-sided reliable learning (Definition 1.1). The proof is implicit in the proof of Theorem 3 in [KKM12] and we include it here for completeness.
Fact A.7.
For a joint distribution of supported on , let (resp. ) be a hypothesis that is -reliable with respect to the concept class on distribution for positive labels (resp. for negative labels). Let hypothesis be defined as if , if and otherwise. Then is close to the best fully reliable sandwich hypothesis for on distribution .
Proof Sketch.
We first show that . Notice that
The same holds for negative labels.
To show that is -suboptimal, assume (for the purpose of contradiction) that there is an such that and . Notice that
Thus,
(1) |
Then, define as if and otherwise ( is defined similarly). We get . Using Equation 1 and , we get
Therefore, either or
Thus, either or is not -suboptimal. This gives a contradiction. ∎
Appendix B Omitted Proofs from Section 2
We define the bias of a function as , and define to be the set of all LTFs whose bias is . Let be the optimal reliable hypothesis and be the bias of .
B.1 Reliably Learning Halfspaces with Gaussian Marginals when
We show that for the problem in Theorem 1.3, in the case of , the algorithm can just return one of the constant hypotheses. Let (resp. ) be the (resp. ) constant hypothesis.
Fact B.1.
For the problem defined in Theorem 1.3, the case where can be efficiently solved by returning if and returning otherwise.
Proof.
When the algorithm returns , it is easy to see that satisfies the reliable learning requirement in Definition 1.1. So we only consider the case that the algorithm returns .
Given the algorithm returns , either there is an -biased LTF such that or none of the -biased LTFs satisfies . For the case that none of the -biased LTFs satisfies , it is easy to see is a valid answer from the definition of Definition 1.1. For the case that there is an -biased LTF such that , since the algorithm did not return , it must be the case that . Therefore, we have . Thus, we get and . This shows is reliable with respect to all biased LTFs. This completes the proof. ∎
B.2 Reliably Learning Halfspaces with Gaussian Marginals: Case of Unknown
Here we establish our main result:
Theorem B.2 (Main Algorithmic Result).
Let and let be a joint distribution of supported on with marginal . Let be the optimal halfspace and be its bias which is unknown. Then there is an algorithm that uses many samples from , runs in time and with high probability returns a hypothesis that is -reliable with respect to the class of LTFs.
To prove Theorem B.2, we need to use the algorithm in the following lemma as a black box. The lemma shows that if the optimal halfspace satisfies , then the problem can be solved efficiently.
Lemma B.3.
Let be the joint distribution of supported on with marginal and . Suppose the optimal halfspace satisfies . Then there is an algorithm that reliably learns LTFs on using many samples and running time.
Proof.
The algorithm is the following:
-
1.
First check if with sufficiently small constant failure probability. If so, return the +1 constant hypothesis.
-
2.
Otherwise, draw many samples conditioned on . By Step 1, the sampling efficiency here is with high probability.
-
3.
Solve the following semidefinite program for and :
minimize such that Return the hypothesis , where and .
Let be the optimal hypothesis. We prove that is such that and . Since is consistent with all the negative samples, we have . From the assumption that , we have . Notice that and is a feasible solution to the above program; therefore, and we must have , which implies that . Therefore,
This completes the proof. ∎
We now prove Theorem B.2.
Proof of Theorem B.2.
The algorithm is the following:
-
1.
Run the algorithm in Lemma B.3 with . If the output hypothesis satisfies with a sufficiently small constant failure probability, then output and terminate.
-
2.
Set and run the algorithm in Theorem B.4 with . If the output hypothesis satisfies with a sufficiently small constant failure probability, then output and terminate. Otherwise, update as . Repeat Step 2 until the algorithm terminates.
Let be the optimal halfspace and be its bias which is unknown. Suppose the algorithm terminates. Let be the bias of the output hypothesis . It is easy to see that with high probability; therefore, it only remains to show . Notice that if , the algorithm will terminate in Step 1. Therefore, without loss of generality, we assume . Then, by Theorem B.4, it is easy to see that since given any , Step 2 is guaranteed to terminate. Therefore, we have
Thus,
This completes the proof. ∎
B.3 Reliably Learning Halfspaces with Gaussian Marginals for the Case
For convenience, we assume that is known. This can be assumed without loss of generality as we discussed in Theorem B.2.
We show the following:
Theorem B.4.
Let and let be a joint distribution of supported on with marginal . Assume that there is a halfspace in that is reliable with respect to . Then, there is an algorithm that uses many samples from , runs in time and with high probability returns a hypothesis that is -reliable with respect to the class of .
Below we provide the omitted proofs from Section 2.
B.4 Proof of Lemma 2.2
Notice that for any polynomial such that , we have
Therefore, it suffices for us to show that there is a zero mean and unit variance polynomial such that
Here we will consider the sign-matching polynomial from [DKK22], which satisfies the following properties:
Claim B.5.
Let and . There exists a zero mean and unit variance polynomial of degree such that
-
1.
The sign of matches , i.e. .
-
2.
For any , we have .
Proof.
We consider the polynomial defined as:
where , and is a sufficiently large odd integer such that . We then take .
For convenience, we first note that from Stirling’s approximation for ,
To prove that , we first show that . Since , we just need to show . Notice , since and , thus . Therefore, considering and ,
must be monotone increasing for and must also be monotone increasing for . Notice that ; therefore, for . To show for , we prove it for the cases and . For , notice that due to and the definition of and ,
Then, for , we show that is monotone increasing. Notice that the derivative , for , this is positive if and only if . If we can show , then it is immediate that for any , . The condition can be further simplified to . Then considering and , we have
thus the condition holds. Therefore, for and must be monotone increasing for . Then combined with the condition , which is implied from the previous case (), we have for .
For the Property 2, we show that for , , and separately. For notice that is convex for ; therefore, it suffices to show . Note that , and we have . Therefore, since , we have . This completes the proof of the case . For the case , it immediately follows from the fact that by the definition. For the case , we have . Then, the case immediately follows from the fact that is monotone increasing in this interval. This completes the proof of Property 2. ∎
B.5 Proof of Lemma 2.4
We start by bounding the Frobenius norm of , i.e., . Notice that by taking to be any degree- polynomial such that and , then . We leverage the following standard fact about the concentration of polynomials lemma known as Bonami-Beckner inequality or simply Gaussian hypercontractivity.
Fact B.6 (Hypercontractivity Concentration Inequality [Bog98, Nel73]).
Consider any -degree, unit variance polynomial with respect to . Then, for any
where is an absolute constant.
To prove the Property 1, since , there can be at most many singular vectors with singular values greater than . Therefore, is at most .
For the Property 2, suppose , then we have for any ,
However, since for some , we know there must be an such that
Therefore, we can write
which contradicts . This completes the proof.
B.6 Proof of Proposition 2.1
The pseudocode for the algorithm establishing Proposition 2.1 is given in Algorithm 2.
-
1.
Let be a sufficiently large universal constant and be a sufficiently large universal constant depending on . Let be a set of many samples from . For , with 1/2 probability, take
to be the empirical Chow-tensor on negative samples. Otherwise, take
to be the empirical Chow-tensor on all samples. Let be the empirical estimation of with error at most with a sufficiently small constant failure probability.
-
2.
Take to be the flattened version of , and let be the subspace spanned by the left singular vectors of whose singular values are greater than where is a sufficiently large constant. Let be the union of all and output to be a random unit vector chosen from .
We can now prove Proposition 2.1.
Proof of Proposition 2.1.
We first introduce the following fact about learning Chow tensors.
Fact B.7.
Fix , and . Let be a distribution on with standard normal marginals. There is an algorithm that with samples and runtime, outputs an approximation of the order- Chow-parameter tensor such that with probability , it holds .
Proof.
Let be the empirical estimation of using for sufficiently large universal constant . We start by showing . Notice that,
Notice that each entry of must be some where and is a unit variance Hermite polynomial. Therefore, and thus .
Using the above, we have
Then, applying Chebyshev’s inequality gives . ∎
We prove Proposition 2.1 for two cases: (a) ; and (b) . We first prove Case (a) as the other case is similar. For Case (a), it suffices for us to prove that happens with probability given the algorithm chooses (which happens with 1/2 probability). Let and . By B.7, we have . Then from Lemma 2.2 and Lemma 2.4, if we take to be a random vector in , we will with constant probability have,
Since , we have . Also, considering , we get .
For Case (b), the proof remains the same except we take and use fact below instead of Lemma 2.4.
Fact B.8 (Lemma 5.10 in [DKK22]).
Let be the joint distribution of supported on with the marginal . Let be a univariate, mean zero, unit variance polynomial of degree such that for some unit vector it holds for some . Let be an approximation of the order- Chow-parameter tensor such that . Denote by the subspace spanned by the left singular vectors of flattened whose singular values are greater than . Moreover, denote by the union of . Then we have that
-
1.
, and
-
2.
.
Then the same argument will give . The above described algorithm uses at most samples and runtime. This completes the proof of Proposition 2.1. ∎
B.7 Proof of Lemma 2.5
Item 1 follows from the definition of and the fact that has marginal distribution .
For proving Item 2, we consider two cases: The first case is when and the second one when . For the case , we prove that the distribution satisfies the reliability condition with respect to where . Notice that for any such that and , we have