Learning Constant-Depth Circuits in
Malicious Noise Models
Abstract
The seminal work of Linial, Mansour, and Nisan gave a quasipolynomial-time algorithm for learning constant-depth circuits () with respect to the uniform distribution on the hypercube. Extending their algorithm to the setting of malicious noise, where both covariates and labels can be adversarially corrupted, has remained open. Here we achieve such a result, inspired by recent work on learning with distribution shift. Our running time essentially matches their algorithm, which is known to be optimal assuming various cryptographic primitives.
Our proof uses a simple outlier-removal method combined with Braverman’s theorem for fooling constant-depth circuits. We attain the best possible dependence on the noise rate and succeed in the harshest possible noise model (i.e., contamination or so-called “nasty noise”).
1 Introduction
In their famous paper, Linial, Mansour, and Nisan [LMN93] introduced the “low-degree” algorithm for learning Boolean functions with respect to the uniform distribution on . The running time and sample complexity of their algorithm scales in terms of the Fourier concentration of the underlying concept class, and, using this framework, they obtained a quasipolynomial-time algorithm for learning constant-depth, polynomial-size circuits ().
Prior work [KKMS08] had extended their result to the agnostic setting, where the labels can be adversarially corrupted, but the marginal distribution on inputs must still be uniform over . Remarkably, there had been no progress on this problem in the last three decades for malicious noise models where both covariates and labels can be adversarially corrupted [Val85, KL93].
In this paper, we completely resolve this problem and obtain a quasipolynomial-time algorithm for learning in the harshest possible noise model, the so-called “nasty noise” model of [BEK02]. We define this model below and refer to it simply as learning with contamination, in line with recent work in computationally efficient robust statistics (see e.g., [DK23]).
Definition 1.1 (Learning from Contaminated Samples).
A set of labeled examples is an -contaminated (uniform) sample with respect to some class , where and , if it is formed by an adversary as follows.
-
1.
The adversary receives a set of clean i.i.d. labeled examples , drawn from the uniform distribution over and labeled by some unknown concept in .
-
2.
The adversary removes an arbitrary set of labeled examples from and substitutes it with an adversarial set of labeled examples .
Namely, . For the corresponding unlabeled set , we say that it is an -contaminated (uniform) sample.
In this model, the goal of the learner is to output (with probability ) a hypothesis such that . The factor is known to be the best possible constant achievable by any algorithm [BEK02].
Although there is now a long line of research giving computationally efficient algorithms for learning Boolean function classes in malicious noise models, these algorithms primarily apply to geometric concept classes and continuous marginal distributions, such as halfspaces or intersections of halfspaces with respect to Gaussian or log-concave densities [KKMS08, KLS09, ABL17, DKS18, SZ21]. In particular, nothing was known for the case of .
Our main theorem is as follows:
Theorem 1.2.
For any , and , there is an algorithm that learns the class of circuits of size and depth and achieves error , with running time and sample complexity , where , from contaminated samples of any noise rate .
Our running time essentially matches the Linial, Mansour, and Nisan result, which is known to be optimal assuming various cryptographic primitives [Kha95].
More generally, we prove that any concept class that admits -sandwiching polynomials of degree can be learned in time from contaminated samples. Recent work due to [GSSV24] had obtained a similar result achieving the weaker bound of for learning functions with -sandwiching polynomials. Crucially, it remains unclear how to obtain such sandwiching approximators for constant depth circuits 111Braverman’s celebrated result on [Bra08] obtains only -sandwiching., and so their result does not apply here.
In 2005, Kalai et al. [KKMS08] showed that -approximation suffices for agnostic learning. Here we complete the analogy for malicious learning, showing that -sandwiching implies learnability with respect to contamination.
Proof Overview.
The input set is -contaminated. This might make it hard to find a hypothesis with near-optimal error on . However, we are only interested in finding a hypothesis with error on the clean distribution, which is structured (in particular, the marginal distribution on the features is uniform over ). In order to take advantage of the structure of the clean distribution despite only having access to the contaminated sample, we make use of the notion of sandwiching polynomials:
Definition 1.3 (Sandwiching polynomials).
Let . We say that the () -sandwiching degree of with respect to the uniform distribution over the hypercube is if there are polynomials of degree at most such that (1) for all and (2) .
The sandwiching degree of size- depth- circuits is bounded by , due to the result of Braverman on fooling constant-depth circuits (see Theorem 4.2 from [Bra08, Tal17, HS19]). Suppose that is a subset of that preserves the expectations of low-degree and non-negative polynomials (e.g., ) compared to the uniform distribution. Under this condition, low-degree polynomial regression gives a hypothesis with near-optimal error on (see Section 4).
We show in Lemma 3.1 that a simple procedure that iteratively removes samples from can be used to form such a set (that preserves the expectations of non-negative, degree- and low-expectation polynomials) and, moreover, this procedure removes more contaminated points than clean points. The last property is important, because it implies that is representative for the ground truth distribution, i.e., any near-optimal hypothesis for will also have error on the ground truth.
This is possible because the only way the adversary can significantly increase the expectation of a non-negative polynomial is by inserting examples where is unreasonably large compared to the typical values of over the uniform distribution. Our algorithm iteratively finds the non-negative polynomial with the largest expectation over a given set through a simple linear program and then removes the points for which is large.
Our iterative outlier removal procedure is inspired by prior work on TDS learning (Testable Learning with Distribution Shift) and PQ learning [KSV24, GSSV24] as well as the work of [DKS18] on learning geometric concepts from contaminated examples. Both of these works use outlier removal procedures that give bounds on the variance of polynomials rather than the expectation of non-negative polynomials and, instead of linear programming, they use spectral algorithms.
2 Notation
Throughout this work, when we refer to a set of examples from the hypercube , we consider every example in to be a unique and separate instance of the corresponding element in . Moreover, we denote with the corresponding labeled set of examples in .
Recall that polynomials over are functions of the form , where and . We denote with the quantity . We say that the degree of is at most if for any with , we have . For a polynomial , we denote with the norm of its coefficients, i.e., .
3 Removing the Outliers
The input set includes an fraction of contaminated examples. It is, of course, impossible to identify the exact subset of that is contaminated. However, we show how to remove contaminated examples that lead to inflation of the expectations of low-degree non-negative polynomials, which we call “outliers.” We remove only a relatively small number of clean examples from , as we show in the following lemma.
Lemma 3.1 (Outlier removal).
Let be an -contaminated uniform sample (see Definition 1.1) with size . For any choice of the parameters , and , the output of Algorithm 1 satisfies the following, whenever , for some sufficiently large constant .
-
1.
With probability at least , the number of clean examples in that are removed from is at most equal to the number of adversarial examples that are removed from (see Figure 1). Namely, .
-
2.
For any non-negative polynomial over with degree at most and , we have
(P) |
Our Algorithm 1 is similar in spirit to outlier removal procedures that have been used previously in the context of learning with contaminated samples [DKS18] and tolerant learning with distribution shift [GSSV24]: we iteratively find the non-negative polynomial with largest expectation and remove the examples that give this polynomial unusually large values. Here we focus on the expectations of non-negative polynomials, while in all previous works, the guarantees after outlier removal concerned the variance of arbitrary polynomials. In this sense, our guarantees are stronger, but only hold for non-negative polynomials. Our algorithm solves, in every iteration, one linear program (P) in place of the usual spectral techniques from prior work.
The proof idea is that whenever there is a non-negative polynomial with unreasonably large expectation, there have to be many outliers that give unusually large values to . By removing all the points where is large, we can, therefore, be confident that we remove more outliers than clean examples (part 1 of Lemma 3.1). When the algorithm terminates, all non-negative polynomials with low expectation under the uniform distribution, will also have low expectation under the remaining set of examples (part 2 of Lemma 3.1).
For part 1, we analyze the non-terminating iterations and we show that for each clean point that is filtered out by the procedure, at least one adversarial point is filtered out as well. We first show that in such an iteration, there is a such that . This implies that in every non-terminating iteration at least one point is removed and, therefore, some iteration will satisfy the stopping criterion and terminate (there are only points in total).
Claim.
In any non-terminating iteration (i.e. an iteration where ), there is such that
Proof.
Suppose, for contradiction, that for all we have
We may integrate over both sides of the above inequality, since the corresponding functions of have finite number of discontinuities (at most equal to ).
(3.1) |
We will now substitute the integrals above with expectations, i.e., and . We use the simple fact that for any non-negative random variable with values in , we have .
We first set , where and observe that (1) for all , and also that (2) for all , since satisfies according to the constraints of (P) and , since and therefore . This shows that almost surely over . Using an analogous argument for , we overall obtain the following.
(3.2) |
We may now substitute (3.2) in the inequality (3.1), and use the fact that (by the constraints of (P)) to conclude that
We reached a contradiction, since , and, therefore, exists. ∎
We still need to show that whenever the procedure filters out clean examples, it also filters out an equal number of adversarial examples. Let be the set of points that are filtered out during iteration . We can write as a disjoint union , where are the clean examples that are removed and are the adversarial examples that are removed.
Claim.
With probability at least , we have that for all non-terminating iterations, .
Proof.
By the previous claim, we know that (which defines the set ) exists and has the property that .
We first focus on the quantity , which is proportional to the number of reference examples that would be removed by the thresholding operation . However, we are interested in the number of actual clean examples that would be removed. The reference examples can be shown to provide an estimate of the number of removed clean examples, through uniform convergence. In particular, the thresholding operation corresponds to a polynomial threshold function of degree at most and, therefore, by standard VC dimension arguments (and uniformly for all iterations) we have that , except with probability , as long as the sample size is . This is because both and consist of i.i.d. samples from the uniform distribution.
Overall, we have that . We can write the empirical probabilities in terms of the sizes of the removed sets to obtain the following, where we also use the fact that and that is at most equal to the number of clean examples that would be removed by the -th filtering operation (some clean examples could already have been removed either by the adversary or by some previous iteration and these will not be contained in ).
This concludes the proof of the claim. ∎
Overall, if we sum over , we obtain that the number of clean examples that are removed by the procedure is at most equal to the number of adversarial examples that are removed by the procedure.
For part 2 of Lemma 3.1, we observe that , as long as satisfies all the constraints of the program (P). It suffices to prove the following claim.
Claim.
Any non-negative polynomial with degree at most and satisfies all the constraints of the program (P) with probability at least .
Proof.
The degree bound and non-negativity are satisfied directly by the definition of . We now need to show that . Recall that , where for any and . By viewing as a vector with dimensions (assuming ), we have that .
Since the uniform distribution is -hypercontractive (see Theorem 9.22 in [O’D14]), we have
(3.3) |
Recall that the polynomial is non-negative. This implies that for all and therefore . Overall, we have
(3.4) |
Recall, now, that . We obtain the desired bound .
It remains to show that with probability at least , we have . Consider the random variable , where is drawn i.i.d. form . We have that . Moreover, , for all and, from a standard Hoeffding bound on the random variable , we obtain that with probability at least . Due to the choice of , we have that the probability of failure is bounded by , as desired. ∎
4 Finding a Low-Error Hypothesis
The outlier removal process of Lemma 3.1 enables us to find a subset of the input set such that all non-negative and low-degree polynomials with small expectation under the uniform distribution also have small empirical expectation under . Moreover, the number of clean examples removed to form is smaller than the number of removed outliers (see Figure 1). We show that these two properties are all we need in order to learn constant-depth circuits with contamination (Definition 1.1).
In order to take advantage of Lemma 3.1, we will use two main tools. The first one is the following theorem originally proposed by [KKMS08] to show that polynomial regression implies agnostic learning for classes that can be approximated by low-degree polynomials.
Theorem 4.1 (Learning through polynomial regression [KKMS08]).
Let be any distribution over and some class of concepts from to . If for each there is some polynomial over of degree at most such that , then there is an algorithm (based on degree- polynomial regression) which outputs a degree- polynomial threshold function such that , in time .
Our overall learning algorithm will first filter the input set of examples using Algorithm 1 and then run the algorithm of Theorem 4.1 on the uniform distribution over the filtered set . All we need to show is that there is a low-degree polynomial with . This is ensured by combining part 2 of Lemma 3.1 with the sandwiching approximators for constant-depth circuits originally proposed by [Bra08] in the context of pseudorandomness.
Theorem 4.2 (Sandwiching polynomials for [Bra08, Tal17, HS19]).
Let be any circuit of size and depth and let . Then, there are polynomials over , each of degree at most such that (1) for all and (2) .
Proof of Theorem 1.2.
Consider the polynomial , where are the -sandwiching polynomials of some circuit of size and depth . Observe that is non-negative and . Therefore, according to part 2 of Lemma 3.1, we have . Since we also have . By Theorem 4.1, we find with or equivalently
(4.1) |
The error of the hypothesis on the set gives a bound on its error under the uniform distribution with high probability, due to classical VC theory, as long as , because is a PTF of degree at most . We can provide an upper bound for in terms of the sizes of the sets depicted in Figure 1. In particular, we give a high-probability upper bound on the number of mistakes that makes on .
-
1.
The points in that are removed by the adversary are not taken into account while forming , so, in the worst case, classifies them incorrectly. This gives at most mistakes.
-
2.
Similarly, makes at most mistakes corresponding to the clean points that are removed during the outlier removal process.
-
3.
Finally, will make at most mistakes on , according to the inequality (4.1), corresponding to the adversarially corrupted points that were not removed during the outlier removal process. In the worst case, all of these mistakes are made in the part of that intersects .

The overall error is . According to part 1 of Lemma 3.1, we have . Moreover, by Definition 1.1, we have that . The error bound we obtain overall is , as desired. ∎
Acknowledgments.
We thank Mark Braverman and Sasha Razborov for useful conversations.
References
- [ABL17] Pranjal Awasthi, Maria Florina Balcan, and Philip M Long. The power of localization for efficiently learning linear separators with noise. Journal of the ACM (JACM), 63(6):1–27, 2017.
- [BEK02] Nader H. Bshouty, Nadav Eiron, and Eyal Kushilevitz. Pac learning with nasty noise. Theoretical Computer Science, 288(2):255–275, 2002. Algorithmic Learning Theory.
- [Bra08] Mark Braverman. Polylogarithmic independence fools AC 0 circuits. Journal of the ACM (JACM), 57(5):1–10, 2008.
- [DK23] Ilias Diakonikolas and Daniel M. Kane. Algorithmic High-Dimensional Robust Statistics. Cambridge University Press, 2023.
- [DKS18] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Learning geometric concepts with nasty noise. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1061–1073, 2018.
- [GSSV24] Surbhi Goel, Abhishek Shetty, Konstantinos Stavropoulos, and Arsen Vasilyan. Tolerant algorithms for learning with arbitrary covariate shift. arXiv preprint arXiv:2406.02742, 2024.
- [HS19] Prahladh Harsha and Srikanth Srinivasan. On polynomial approximations to . Random Structures & Algorithms, 54(2):289–303, 2019.
- [Kha95] Michael Kharitonov. Cryptographic lower bounds for learnability of boolean functions on the uniform distribution. J. Comput. Syst. Sci., 50(3):600–610, 1995.
- [KKMS08] Adam Tauman Kalai, Adam R Klivans, Yishay Mansour, and Rocco A Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777–1805, 2008.
- [KL93] Michael J. Kearns and Ming Li. Learning in the presence of malicious errors. SIAM J. Comput., 22(4):807–837, 1993.
- [KLS09] Adam R Klivans, Philip M Long, and Rocco A Servedio. Learning halfspaces with malicious noise. Journal of Machine Learning Research, 10(12), 2009.
- [KSV24] Adam Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Testable learning with distribution shift. In Shipra Agrawal and Aaron Roth, editors, Proceedings of Thirty Seventh Conference on Learning Theory, volume 247 of Proceedings of Machine Learning Research, pages 2887–2943. PMLR, 30 Jun–03 Jul 2024.
- [LMN93] Nathan Linial, Yishay Mansour, and Noam Nisan. Constant depth circuits, fourier transform, and learnability. Journal of the ACM (JACM), 40(3):607–620, 1993.
- [O’D14] Ryan O’Donnell. Analysis of boolean functions. Cambridge University Press, 2014.
- [SZ21] Jie Shen and Chicheng Zhang. Attribute-efficient learning of halfspaces with malicious noise: Near-optimal label complexity and noise tolerance. In Vitaly Feldman, Katrina Ligett, and Sivan Sabato, editors, Algorithmic Learning Theory, 16-19 March 2021, Virtual Conference, Worldwide, volume 132 of Proceedings of Machine Learning Research, pages 1072–1113. PMLR, 2021.
- [Tal17] Avishay Tal. Tight Bounds on the Fourier Spectrum of AC0. In Ryan O’Donnell, editor, 32nd Computational Complexity Conference (CCC 2017), volume 79 of Leibniz International Proceedings in Informatics (LIPIcs), pages 15:1–15:31, Dagstuhl, Germany, 2017. Schloss Dagstuhl – Leibniz-Zentrum für Informatik.
- [Val85] Leslie G. Valiant. Learning disjunction of conjunctions. In Aravind K. Joshi, editor, Proceedings of the 9th International Joint Conference on Artificial Intelligence. Los Angeles, CA, USA, August 1985, pages 560–566. Morgan Kaufmann, 1985.