This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning Constant-Depth Circuits in
Malicious Noise Models

Adam R. Klivans
UT Austin
[email protected]. Supported by NSF award AF-1909204 and the NSF AI Institute for Foundations of Machine Learning (IFML).
   Konstantinos Stavropoulos
UT Austin
[email protected]. Supported by the NSF AI Institute for Foundations of Machine Learning (IFML) and by scholarships from Bodossaki Foundation and Leventis Foundation.
   Arsen Vasilyan
UC Berkeley
[email protected]. Supported in part by NSF awards CCF-2006664, DMS-2022448, CCF-1565235, CCF-1955217, CCF-2310818, Big George Fellowship and Fintech@CSAIL. Part of this work was conducted while the author was visiting the Simons Institute for the Theory of Computing. Work done in part while visiting UT Austin.
Abstract

The seminal work of Linial, Mansour, and Nisan gave a quasipolynomial-time algorithm for learning constant-depth circuits (𝖠𝖢0\mathsf{AC}^{0}) with respect to the uniform distribution on the hypercube. Extending their algorithm to the setting of malicious noise, where both covariates and labels can be adversarially corrupted, has remained open. Here we achieve such a result, inspired by recent work on learning with distribution shift. Our running time essentially matches their algorithm, which is known to be optimal assuming various cryptographic primitives.

Our proof uses a simple outlier-removal method combined with Braverman’s theorem for fooling constant-depth circuits. We attain the best possible dependence on the noise rate and succeed in the harshest possible noise model (i.e., contamination or so-called “nasty noise”).

1 Introduction

In their famous paper, Linial, Mansour, and Nisan [LMN93] introduced the “low-degree” algorithm for learning Boolean functions with respect to the uniform distribution on {±1}d\{\pm 1\}^{d}. The running time and sample complexity of their algorithm scales in terms of the Fourier concentration of the underlying concept class, and, using this framework, they obtained a quasipolynomial-time algorithm for learning constant-depth, polynomial-size circuits (𝖠𝖢0\mathsf{AC}^{0}).

Prior work [KKMS08] had extended their result to the agnostic setting, where the labels can be adversarially corrupted, but the marginal distribution on inputs must still be uniform over {±1}d\{\pm 1\}^{d}. Remarkably, there had been no progress on this problem in the last three decades for malicious noise models where both covariates and labels can be adversarially corrupted [Val85, KL93].

In this paper, we completely resolve this problem and obtain a quasipolynomial-time algorithm for learning 𝖠𝖢0\mathsf{AC}^{0} in the harshest possible noise model, the so-called “nasty noise” model of [BEK02]. We define this model below and refer to it simply as learning with contamination, in line with recent work in computationally efficient robust statistics (see e.g., [DK23]).

Definition 1.1 (Learning from Contaminated Samples).

A set of NN labeled examples S¯inp\bar{S}_{\mathrm{inp}} is an η\eta-contaminated (uniform) sample with respect to some class 𝒞{{±1}d{±1}}\mathcal{C}\subseteq\{\{\pm 1\}^{d}\to\{\pm 1\}\}, where NN\in\mathbb{N} and η(0,1)\eta\in(0,1), if it is formed by an adversary as follows.

  1. 1.

    The adversary receives a set of NN clean i.i.d. labeled examples S¯cln\bar{S}_{\mathrm{cln}}, drawn from the uniform distribution over {±1}d\{\pm 1\}^{d} and labeled by some unknown concept ff^{*} in 𝒞\mathcal{C}.

  2. 2.

    The adversary removes an arbitrary set S¯rem\bar{S}_{\mathrm{rem}} of ηN\lfloor\eta N\rfloor labeled examples from S¯cln\bar{S}_{\mathrm{cln}} and substitutes it with an adversarial set of ηN\lfloor\eta N\rfloor labeled examples S¯adv\bar{S}_{\mathrm{adv}}.

Namely, S¯inp=(S¯clnS¯rem)S¯adv\bar{S}_{\mathrm{inp}}=(\bar{S}_{\mathrm{cln}}\setminus\bar{S}_{\mathrm{rem}})\cup\bar{S}_{\mathrm{adv}}. For the corresponding unlabeled set SinpS_{\mathrm{inp}}, we say that it is an η\eta-contaminated (uniform) sample.

In this model, the goal of the learner is to output (with probability 1δ1-\delta) a hypothesis h:{±1}d{±1}h:\{\pm 1\}^{d}\to\{\pm 1\} such that 𝐱Unif({±1}d)[h(𝐱)f(𝐱)]2η+ϵ\operatorname*{\mathbb{P}}_{\mathbf{x}\sim\mathrm{Unif}(\{\pm 1\}^{d})}[h(\mathbf{x})\neq f^{*}(\mathbf{x})]\leq 2\eta+\epsilon. The factor 22 is known to be the best possible constant achievable by any algorithm [BEK02].

Although there is now a long line of research giving computationally efficient algorithms for learning Boolean function classes in malicious noise models, these algorithms primarily apply to geometric concept classes and continuous marginal distributions, such as halfspaces or intersections of halfspaces with respect to Gaussian or log-concave densities [KKMS08, KLS09, ABL17, DKS18, SZ21]. In particular, nothing was known for the case of 𝖠𝖢𝟢\mathsf{AC^{0}}.

Our main theorem is as follows:

Theorem 1.2.

For any s,,ds,\ell,d\in{\mathbb{N}}, and ϵ,δ(0,1)\epsilon,\delta\in(0,1), there is an algorithm that learns the class of 𝖠𝖢0\mathsf{AC}^{0} circuits of size ss and depth \ell and achieves error 2η+ϵ2\eta+\epsilon, with running time and sample complexity dO(k)log(1/δ)d^{O(k)}\log(1/\delta), where k=(log(s))O()log(1/ϵ)k={(\log(s))^{O(\ell)}\log(1/\epsilon)}, from contaminated samples of any noise rate η\eta.

Our running time essentially matches the Linial, Mansour, and Nisan result, which is known to be optimal assuming various cryptographic primitives [Kha95].

More generally, we prove that any concept class 𝒞\mathcal{C} that admits 1\ell_{1}-sandwiching polynomials of degree kk can be learned in time dO(k)d^{O(k)} from contaminated samples. Recent work due to [GSSV24] had obtained a similar result achieving the weaker bound of O(η)+ϵO(\eta)+\epsilon for learning functions with 2\ell_{2}-sandwiching polynomials. Crucially, it remains unclear how to obtain such 2\ell_{2} sandwiching approximators for constant depth circuits 111Braverman’s celebrated result on 𝖠𝖢𝟢\mathsf{AC^{0}} [Bra08] obtains only 1\ell_{1}-sandwiching., and so their result does not apply here.

In 2005, Kalai et al. [KKMS08] showed that 1\ell_{1}-approximation suffices for agnostic learning. Here we complete the analogy for malicious learning, showing that 1\ell_{1}-sandwiching implies learnability with respect to contamination.

Proof Overview.

The input set S¯inp\bar{S}_{\mathrm{inp}} is η\eta-contaminated. This might make it hard to find a hypothesis with near-optimal error on S¯inp\bar{S}_{\mathrm{inp}}. However, we are only interested in finding a hypothesis with error 2η+ϵ2\eta+\epsilon on the clean distribution, which is structured (in particular, the marginal distribution on the features is uniform over {±1}d\{\pm 1\}^{d}). In order to take advantage of the structure of the clean distribution despite only having access to the contaminated sample, we make use of the notion of sandwiching polynomials:

Definition 1.3 (Sandwiching polynomials).

Let f:{±1}d{±1}f:\{\pm 1\}^{d}\to\{\pm 1\}. We say that the (1\ell_{1}) ϵ\epsilon-sandwiching degree of ff with respect to the uniform distribution over the hypercube {±1}d\{\pm 1\}^{d} is kk if there are polynomials pup,pdown:{±1}dp_{\mathrm{up}},p_{\mathrm{down}}:\{\pm 1\}^{d}\to{\mathbb{R}} of degree at most kk such that (1) pdown(𝐱)f(𝐱)pup(𝐱)p_{\mathrm{down}}(\mathbf{x})\leq f(\mathbf{x})\leq p_{\mathrm{up}}(\mathbf{x}) for all 𝐱{±1}d\mathbf{x}\in\{\pm 1\}^{d} and (2) 𝔼𝐱Unif({±1}d)[pup(𝐱)pdown(𝐱)]ϵ\operatorname*{\mathbb{E}}_{\mathbf{x}\sim\operatorname{Unif}(\{\pm 1\}^{d})}[p_{\mathrm{up}}(\mathbf{x})-p_{\mathrm{down}}(\mathbf{x})]\leq\epsilon.

The sandwiching degree of size-ss depth-\ell 𝖠𝖢0\mathsf{AC}^{0} circuits is bounded by k=(log(s))O()log(1/ϵ)k=(\log(s))^{O(\ell)}\log(1/\epsilon), due to the result of Braverman on fooling constant-depth circuits (see Theorem 4.2 from [Bra08, Tal17, HS19]). Suppose that S¯\bar{S} is a subset of S¯inp\bar{S}_{\mathrm{inp}} that preserves the expectations of low-degree and non-negative polynomials (e.g., puppdownp_{\mathrm{up}}-p_{\mathrm{down}}) compared to the uniform distribution. Under this condition, low-degree polynomial regression gives a hypothesis with near-optimal error on S¯\bar{S} (see Section 4).

We show in Lemma 3.1 that a simple procedure that iteratively removes samples from S¯inp\bar{S}_{\mathrm{inp}} can be used to form such a set S¯\bar{S} (that preserves the expectations of non-negative, degree-kk and low-expectation polynomials) and, moreover, this procedure removes more contaminated points than clean points. The last property is important, because it implies that S¯\bar{S} is representative for the ground truth distribution, i.e., any near-optimal hypothesis for S¯\bar{S} will also have error 2η+ϵ2\eta+\epsilon on the ground truth.

This is possible because the only way the adversary can significantly increase the expectation of a non-negative polynomial pp is by inserting examples 𝐱\mathbf{x} where p(𝐱)p(\mathbf{x}) is unreasonably large compared to the typical values of pp over the uniform distribution. Our algorithm iteratively finds the non-negative polynomial qq with the largest expectation over a given set through a simple linear program and then removes the points 𝐱\mathbf{x} for which q(x)q(x) is large.

Our iterative outlier removal procedure is inspired by prior work on TDS learning (Testable Learning with Distribution Shift) and PQ learning [KSV24, GSSV24] as well as the work of [DKS18] on learning geometric concepts from contaminated examples. Both of these works use outlier removal procedures that give bounds on the variance of polynomials rather than the expectation of non-negative polynomials and, instead of linear programming, they use spectral algorithms.

2 Notation

Throughout this work, when we refer to a set SS of examples from the hypercube {±1}d\{\pm 1\}^{d}, we consider every example in SS to be a unique and separate instance of the corresponding element in {±1}d\{\pm 1\}^{d}. Moreover, we denote with S¯\bar{S} the corresponding labeled set of examples in {±1}d×{±1}\{\pm 1\}^{d}\times\{\pm 1\}.

Recall that polynomials over {±1}d\{\pm 1\}^{d} are functions of the form p(𝐱)=[d]cp()ixip(\mathbf{x})=\sum_{\mathcal{I}\subseteq[d]}c_{p}(\mathcal{I})\prod_{i\in\mathcal{I}}x_{i}, where 𝐱=(xi)i[d]\mathbf{x}=(x_{i})_{i\in[d]} and cp()c_{p}(\mathcal{I})\in{\mathbb{R}}. We denote with 𝐱\mathbf{x}^{\mathcal{I}} the quantity ixi\prod_{i\in\mathcal{I}}x_{i}. We say that the degree of pp is at most kk if for any [d]\mathcal{I}\subseteq[d] with ||>k|\mathcal{I}|>k, we have cp()=0c_{p}(\mathcal{I})=0. For a polynomial pp, we denote with pcoef\|p\|_{\mathrm{coef}} the 1\ell_{1} norm of its coefficients, i.e., pcoef=[d]|cp()|\|p\|_{\mathrm{coef}}=\sum_{\mathcal{I}\subseteq[d]}|c_{p}(\mathcal{I})|.

3 Removing the Outliers

The input set S¯inp\bar{S}_{\mathrm{inp}} includes an η\eta fraction of contaminated examples. It is, of course, impossible to identify the exact subset of S¯inp\bar{S}_{\mathrm{inp}} that is contaminated. However, we show how to remove contaminated examples that lead to inflation of the expectations of low-degree non-negative polynomials, which we call “outliers.” We remove only a relatively small number of clean examples from S¯inp\bar{S}_{\mathrm{inp}}, as we show in the following lemma.

Lemma 3.1 (Outlier removal).

Let SinpS_{\mathrm{inp}} be an η\eta-contaminated uniform sample (see Definition 1.1) with size NN. For any choice of the parameters ϵ,δ(0,1)\epsilon,\delta\in(0,1), and kk\in\mathbb{N}, the output SfiltS_{\mathrm{filt}} of Algorithm 1 satisfies the following, whenever NC(3d)2kϵ2log(1/δ)N\geq C\frac{(3d)^{2k}}{\epsilon^{2}}\log(1/\delta), for some sufficiently large constant C1C\geq 1.

  1. 1.

    With probability at least 1δ1-\delta, the number of clean examples in SinpS_{\mathrm{inp}} that are removed from SfiltS_{\mathrm{filt}} is at most equal to the number of adversarial examples that are removed from SfiltS_{\mathrm{filt}} (see Figure 1). Namely, |(SinpScln)Sfilt||SadvSfilt||(S_{\mathrm{inp}}\cap S_{\mathrm{cln}})\setminus S_{\mathrm{filt}}|\leq|S_{\mathrm{adv}}\setminus S_{\mathrm{filt}}|.

  2. 2.

    For any non-negative polynomial pp over {±1}d\{\pm 1\}^{d} with degree at most kk and 𝔼𝐱Unifd[p(𝐱)]ϵ8\operatorname*{\mathbb{E}}_{\mathbf{x}\sim\mathrm{Unif}_{d}}[p(\mathbf{x})]\leq\frac{\epsilon}{8}, we have 𝐱Sfiltp(𝐱)ϵN with probability at least 1δ.\sum_{\mathbf{x}\in S_{\mathrm{filt}}}p(\mathbf{x})\leq\epsilon N\text{ with probability at least }1-\delta.

Input: Set Sinp{±1}dS_{\mathrm{inp}}\subseteq\{\pm 1\}^{d} of size NN and parameters ϵ(0,1)\epsilon\in(0,1), B>0B>0 and kk\in\mathbb{N}
Output: Filtered set SfiltSinpS_{\mathrm{filt}}\subseteq S_{\mathrm{inp}}.
Let B=3kdk/2B=3^{k}d^{k/2}, Δ=ϵ2B\Delta=\frac{\epsilon}{2B}.
S(0)SinpS^{(0)}\leftarrow S_{\mathrm{inp}}
for i=0,1,2,,Ni=0,1,2,\dots,N do
       Let pp^{*} be the solution of the following linear program (P) and λ=1N𝐱S(i)p(𝐱)\lambda^{*}=\frac{1}{N}\sum_{\mathbf{x}\in S^{(i)}}p^{*}(\mathbf{x}).
{maxp𝐱S(i)p(𝐱) s.t.:p polynomial, deg(p)k and pcoefBp(𝐱)0, for all 𝐱SrefSinp1N𝐱Srefp(𝐱)ϵ/4}\left.\begin{cases}\max_{p}&\,\sum_{\mathbf{x}\in S^{(i)}}p(\mathbf{x})\\ \text{ s.t.:}&\,\,p\text{ polynomial, }\deg(p)\leq k\text{ and }\|p\|_{\mathrm{coef}}\leq B\\ &\,\,p(\mathbf{x})\geq 0,\text{ for all }\mathbf{x}\in S_{\mathrm{ref}}\cup S_{\mathrm{inp}}\\ &\,\,\frac{1}{N}\sum_{\mathbf{x}\in S_{\mathrm{ref}}}p(\mathbf{x})\leq\epsilon/4\end{cases}\right\} (P)
       if λϵ\lambda^{*}\leq\epsilon then output SfiltS(i)S_{\mathrm{filt}}\leftarrow S^{(i)} and terminate;
       else
            let τ0\tau^{*}\geq 0 be the smallest value such that |S(i)|N𝐱S(i)[p(𝐱)>τ]2𝐱Sref[p(𝐱)>τ]+Δ\frac{|S^{(i)}|}{N}\operatorname*{\mathbb{P}}_{\mathbf{x}\sim S^{(i)}}[p^{*}(\mathbf{x})>\tau^{*}]\geq 2\operatorname*{\mathbb{P}}_{\mathbf{x}\sim S_{\mathrm{ref}}}[p^{*}(\mathbf{x})>\tau^{*}]+\Delta
             S(i+1)S(i){𝐱S(i):p(𝐱)>τ}S^{(i+1)}\leftarrow S^{(i)}\setminus\{\mathbf{x}\in S^{(i)}:p^{*}(\mathbf{x})>\tau^{*}\}
      
Algorithm 1 Outlier removal through Linear Programming

Our Algorithm 1 is similar in spirit to outlier removal procedures that have been used previously in the context of learning with contaminated samples [DKS18] and tolerant learning with distribution shift [GSSV24]: we iteratively find the non-negative polynomial with largest expectation and remove the examples that give this polynomial unusually large values. Here we focus on the expectations of non-negative polynomials, while in all previous works, the guarantees after outlier removal concerned the variance of arbitrary polynomials. In this sense, our guarantees are stronger, but only hold for non-negative polynomials. Our algorithm solves, in every iteration, one linear program (P) in place of the usual spectral techniques from prior work.

The proof idea is that whenever there is a non-negative polynomial pp^{*} with unreasonably large expectation, there have to be many outliers that give unusually large values to pp^{*}. By removing all the points where pp^{*} is large, we can, therefore, be confident that we remove more outliers than clean examples (part 1 of Lemma 3.1). When the algorithm terminates, all non-negative polynomials with low expectation under the uniform distribution, will also have low expectation under the remaining set of examples (part 2 of Lemma 3.1).

For part 1, we analyze the non-terminating iterations and we show that for each clean point that is filtered out by the procedure, at least one adversarial point is filtered out as well. We first show that in such an iteration, there is a τ[0,B]\tau\in[0,B] such that |S(i)|N𝐱S(i)[p(𝐱)>τ]2𝐱Sref[p(𝐱)>τ]+Δ>0\frac{|S^{(i)}|}{N}\operatorname*{\mathbb{P}}_{\mathbf{x}\sim S^{(i)}}[p^{*}(\mathbf{x})>\tau]\geq 2\operatorname*{\mathbb{P}}_{\mathbf{x}\sim S_{\mathrm{ref}}}[p^{*}(\mathbf{x})>\tau]+\Delta>0. This implies that in every non-terminating iteration at least one point is removed and, therefore, some iteration iNi\leq N will satisfy the stopping criterion and terminate (there are only NN points in total).

Claim.

In any non-terminating iteration (i.e. an iteration where λ>ϵ\lambda^{*}>\epsilon), there is τ[0,B]\tau^{*}\in[0,B] such that

|S(i)|N𝐱S(i)[p(𝐱)>τ]2𝐱Sref[p(𝐱)>τ]+Δ.\frac{|S^{(i)}|}{N}\operatorname*{\mathbb{P}}_{\mathbf{x}\sim S^{(i)}}[p^{*}(\mathbf{x})>\tau^{*}]\geq 2\operatorname*{\mathbb{P}}_{\mathbf{x}\sim S_{\mathrm{ref}}}[p^{*}(\mathbf{x})>\tau^{*}]+\Delta.
Proof.

Suppose, for contradiction, that for all τ[0,B]\tau\in[0,B] we have

|S(i)|N𝐱S(i)[p(𝐱)>τ]<2𝐱Sref[p(𝐱)>τ]+Δ\frac{|S^{(i)}|}{N}\operatorname*{\mathbb{P}}_{\mathbf{x}\sim S^{(i)}}[p^{*}(\mathbf{x})>\tau]<2\operatorname*{\mathbb{P}}_{\mathbf{x}\sim S_{\mathrm{ref}}}[p^{*}(\mathbf{x})>\tau]+\Delta

We may integrate over τ[0,B]\tau\in[0,B] both sides of the above inequality, since the corresponding functions of τ\tau have finite number of discontinuities (at most equal to |S(i)|+|Sref||S^{(i)}|+|S_{\mathrm{ref}}|).

|S(i)|Nτ=0B𝐱S(i)[p(𝐱)>τ]𝑑τ<2τ=0B𝐱Sref[p(𝐱)>τ]𝑑τ+ΔB\frac{|S^{(i)}|}{N}\int_{\tau=0}^{B}\operatorname*{\mathbb{P}}_{\mathbf{x}\sim S^{(i)}}[p^{*}(\mathbf{x})>\tau]\,d\tau<2\int_{\tau=0}^{B}\operatorname*{\mathbb{P}}_{\mathbf{x}\sim S_{\mathrm{ref}}}[p^{*}(\mathbf{x})>\tau]\,d\tau+\Delta B (3.1)

We will now substitute the integrals above with expectations, i.e., τ=0B𝐱S(i)[p(𝐱)>τ]𝑑τ=𝔼𝐱S(i)[p(𝐱)]\int_{\tau=0}^{B}\operatorname*{\mathbb{P}}_{\mathbf{x}\sim S^{(i)}}[p^{*}(\mathbf{x})>\tau]\,d\tau=\operatorname*{\mathbb{E}}_{\mathbf{x}\sim S^{(i)}}[p^{*}(\mathbf{x})] and τ=0B𝐱Sref[p(𝐱)>τ]𝑑τ=𝔼𝐱Sref[p(𝐱)]\int_{\tau=0}^{B}\operatorname*{\mathbb{P}}_{\mathbf{x}\sim S_{\mathrm{ref}}}[p^{*}(\mathbf{x})>\tau]\,d\tau=\operatorname*{\mathbb{E}}_{\mathbf{x}\sim S_{\mathrm{ref}}}[p^{*}(\mathbf{x})]. We use the simple fact that for any non-negative random variable XX with values in [0,B][0,B], we have 𝔼[X]=τ=0B[X>τ]𝑑τ\operatorname*{\mathbb{E}}[X]=\int_{\tau=0}^{B}\operatorname*{\mathbb{P}}[X>\tau]\,d\tau.

We first set X=p(𝐱)X=p^{*}(\mathbf{x}), where 𝐱S(i)\mathbf{x}\sim S^{(i)} and observe that (1) p(𝐱)0p^{*}(\mathbf{x})\geq 0 for all 𝐱S(i)\mathbf{x}\in S^{(i)}, and also that (2) p(𝐱)pcoefBp^{*}(\mathbf{x})\leq\|p\|_{\mathrm{coef}}\leq B for all 𝐱{±1}dS(i)\mathbf{x}\in\{\pm 1\}^{d}\supseteq S^{(i)}, since pp^{*} satisfies pcoefB\|p\|_{\mathrm{coef}}\leq B according to the constraints of (P) and p(𝐱)=[d]cp()𝐱[d]|cp()||𝐱|=[d]|cp()|=pcoefp^{*}(\mathbf{x})=\sum_{\mathcal{I}\subseteq[d]}c_{p^{*}}(\mathcal{I})\mathbf{x}^{\mathcal{I}}\leq\sum_{\mathcal{I}\subseteq[d]}|c_{p^{*}}(\mathcal{I})|\cdot|\mathbf{x}^{\mathcal{I}}|=\sum_{\mathcal{I}\subseteq[d]}|c_{p^{*}}(\mathcal{I})|=\|p^{*}\|_{\mathrm{coef}}, since 𝐱{±1}d\mathbf{x}\in\{\pm 1\}^{d} and therefore |𝐱|=1|\mathbf{x}^{\mathcal{I}}|=1. This shows that X[0,B]X\in[0,B] almost surely over 𝐱S(i)\mathbf{x}\sim S^{(i)}. Using an analogous argument for 𝐱Sref\mathbf{x}\sim S_{\mathrm{ref}}, we overall obtain the following.

τ=0B𝐱S(i)[p(𝐱)>τ]𝑑τ=𝔼𝐱S(i)[p(𝐱)] and τ=0B𝐱Sref[p(𝐱)>τ]𝑑τ=𝔼𝐱Sref[p(𝐱)]\int_{\tau=0}^{B}\operatorname*{\mathbb{P}}_{\mathbf{x}\sim S^{(i)}}[p^{*}(\mathbf{x})>\tau]\,d\tau=\operatorname*{\mathbb{E}}_{\mathbf{x}\sim S^{(i)}}[p^{*}(\mathbf{x})]\;\;\text{ and }\;\;\int_{\tau=0}^{B}\operatorname*{\mathbb{P}}_{\mathbf{x}\sim S_{\mathrm{ref}}}[p^{*}(\mathbf{x})>\tau]\,d\tau=\operatorname*{\mathbb{E}}_{\mathbf{x}\sim S_{\mathrm{ref}}}[p^{*}(\mathbf{x})] (3.2)

We may now substitute (3.2) in the inequality (3.1), and use the fact that 𝔼𝐱Sref[p(𝐱)]ϵ/4\operatorname*{\mathbb{E}}_{\mathbf{x}\sim S_{\mathrm{ref}}}[p^{*}(\mathbf{x})]\leq\epsilon/4 (by the constraints of (P)) to conclude that

λ=1N𝐱S(i)p(𝐱)=|S(i)|N𝔼𝐱S(i)[p(𝐱)]2ϵ/4+ϵ/2=ϵ\lambda^{*}=\frac{1}{N}\sum_{\mathbf{x}\sim S^{(i)}}p^{*}(\mathbf{x})=\frac{|S^{(i)}|}{N}\operatorname*{\mathbb{E}}_{\mathbf{x}\sim S^{(i)}}[p^{*}(\mathbf{x})]\leq 2\epsilon/4+\epsilon/2=\epsilon

We reached a contradiction, since λ>ϵ\lambda^{*}>\epsilon, and, therefore, τ\tau^{*} exists. ∎

We still need to show that whenever the procedure filters out clean examples, it also filters out an equal number of adversarial examples. Let Sr(i)={𝐱S(i):p(𝐱)>τ}S_{r}^{(i)}=\{\mathbf{x}\in S^{(i)}:p^{*}(\mathbf{x})>\tau^{*}\} be the set of points that are filtered out during iteration ii. We can write Sr(i)S_{r}^{(i)} as a disjoint union Sr,cln(i)Sr,adv(i)S_{r,\mathrm{cln}}^{(i)}\cup S_{r,\mathrm{adv}}^{(i)}, where Sr,cln(i)=Sr(i)SclnS_{r,\mathrm{cln}}^{(i)}=S_{r}^{(i)}\cap S_{\mathrm{cln}} are the clean examples that are removed and Sr,adv(i)=Sr(i)SadvS_{r,\mathrm{adv}}^{(i)}=S_{r}^{(i)}\cap S_{\mathrm{adv}} are the adversarial examples that are removed.

Claim.

With probability at least 1δ1-\delta, we have that for all non-terminating iterations, |Sr,cln(i)||Sr,adv(i)||S_{r,\mathrm{cln}}^{(i)}|\leq|S_{r,\mathrm{adv}}^{(i)}|.

Proof.

By the previous claim, we know that τ\tau^{*} (which defines the set Sr(i)S_{r}^{(i)}) exists and has the property that |S(i)|N𝐱S(i)[p(𝐱)>τ]2𝐱Sref[p(𝐱)>τ]+Δ\frac{|S^{(i)}|}{N}\operatorname*{\mathbb{P}}_{\mathbf{x}\sim S^{(i)}}[p^{*}(\mathbf{x})>\tau^{*}]\geq 2\operatorname*{\mathbb{P}}_{\mathbf{x}\sim S_{\mathrm{ref}}}[p^{*}(\mathbf{x})>\tau^{*}]+\Delta.

We first focus on the quantity 𝐱Sref[p(𝐱)>τ]\operatorname*{\mathbb{P}}_{\mathbf{x}\sim S_{\mathrm{ref}}}[p^{*}(\mathbf{x})>\tau^{*}], which is proportional to the number of reference examples that would be removed by the thresholding operation p(𝐱)>τp^{*}(\mathbf{x})>\tau^{*}. However, we are interested in the number of actual clean examples that would be removed. The reference examples can be shown to provide an estimate of the number of removed clean examples, through uniform convergence. In particular, the thresholding operation corresponds to a polynomial threshold function of degree at most dkd^{k} and, therefore, by standard VC dimension arguments (and uniformly for all iterations) we have that 𝐱Sref[p(𝐱)>τ]𝐱Scln[p(𝐱)>τ]Δ/2\operatorname*{\mathbb{P}}_{\mathbf{x}\sim S_{\mathrm{ref}}}[p^{*}(\mathbf{x})>\tau^{*}]\geq\operatorname*{\mathbb{P}}_{\mathbf{x}\sim S_{\mathrm{cln}}}[p^{*}(\mathbf{x})>\tau^{*}]-\Delta/2, except with probability δ\delta , as long as the sample size is NCdk+log(1/δ)Δ2N\geq C^{\prime}\frac{d^{k}+\log(1/\delta)}{\Delta^{2}}. This is because both SrefS_{\mathrm{ref}} and SclnS_{\mathrm{cln}} consist of NN i.i.d. samples from the uniform distribution.

Overall, we have that |S(i)|N𝐱S(i)[p(𝐱)>τ]2𝐱Scln[p(𝐱)>τ]\frac{|S^{(i)}|}{N}\operatorname*{\mathbb{P}}_{\mathbf{x}\sim S^{(i)}}[p^{*}(\mathbf{x})>\tau^{*}]\geq 2\operatorname*{\mathbb{P}}_{\mathbf{x}\sim S_{\mathrm{cln}}}[p^{*}(\mathbf{x})>\tau^{*}]. We can write the empirical probabilities in terms of the sizes of the removed sets to obtain the following, where we also use the fact that |Sr(i)|=|Sr,cln(i)|+|Sr,adv(i)||S_{r}^{(i)}|=|S_{r,\mathrm{cln}}^{(i)}|+|S_{r,\mathrm{adv}}^{(i)}| and that |Sr,cln(i)||S_{r,\mathrm{cln}}^{(i)}| is at most equal to the number of clean examples that would be removed by the ii-th filtering operation (some clean examples could already have been removed either by the adversary or by some previous iteration and these will not be contained in Sr,cln(i)S_{r,\mathrm{cln}}^{(i)}).

|S(i)|N|Sr(i)||S(i)|2|Sr,cln(i)|N or |Sr,cln(i)|+|Sr,adv(i)|2|Sr,cln(i)| or |Sr,adv(i)||Sr,cln(i)|\displaystyle\frac{|S^{(i)}|}{N}\cdot\frac{|S_{r}^{(i)}|}{|S^{(i)}|}\geq 2\frac{|S_{r,\mathrm{cln}}^{(i)}|}{N}\;\;\text{ or }\;\;|S_{r,\mathrm{cln}}^{(i)}|+|S_{r,\mathrm{adv}}^{(i)}|\geq 2|S_{r,\mathrm{cln}}^{(i)}|\;\;\text{ or }\;\;|S_{r,\mathrm{adv}}^{(i)}|\geq|S_{r,\mathrm{cln}}^{(i)}|

This concludes the proof of the claim. ∎

Overall, if we sum over i[N]i\in[N], we obtain that the number of clean examples that are removed by the procedure is at most equal to the number of adversarial examples that are removed by the procedure.

For part 2 of Lemma 3.1, we observe that 𝐱Sfiltp(𝐱)λNϵN\sum_{\mathbf{x}\in S_{\mathrm{filt}}}p(\mathbf{x})\leq\lambda^{*}N\leq\epsilon N, as long as pp satisfies all the constraints of the program (P). It suffices to prove the following claim.

Claim.

Any non-negative polynomial pp with degree at most kk and 𝔼𝐱Unifd[p(𝐱)]ϵ/8\operatorname*{\mathbb{E}}_{\mathbf{x}\sim\operatorname{Unif}_{d}}[p(\mathbf{x})]\leq\epsilon/8 satisfies all the constraints of the program (P) with probability at least 1δ1-\delta.

Proof.

The degree bound and non-negativity are satisfied directly by the definition of pp. We now need to show that pcoef3kdk/2\|p\|_{\mathrm{coef}}\leq 3^{k}d^{k/2}. Recall that p(𝐱)=[d]cp()𝐱Sp(\mathbf{x})=\sum_{\mathcal{I}\subseteq[d]}c_{p}(\mathcal{I})\mathbf{x}^{S}, where cp()=0c_{p}(\mathcal{I})=0 for any ||>k|\mathcal{I}|>k and pcoef=[d]|cp()|\|p\|_{\mathrm{coef}}=\sum_{\mathcal{I}\subseteq[d]}|c_{p}(\mathcal{I})|. By viewing cpc_{p} as a vector with j=0k(dj)dk\sum_{j=0}^{k}\binom{d}{j}\leq d^{k} dimensions (assuming 2kd2\leq k\leq d), we have that pcoef=cp1dk/2cp2=dk/2(𝔼𝐱Unifd[(p(𝐱))2])1/2\|p\|_{\mathrm{coef}}=\|c_{p}\|_{1}\leq d^{k/2}\|c_{p}\|_{2}=d^{k/2}(\operatorname*{\mathbb{E}}_{\mathbf{x}\sim\operatorname{Unif}_{d}}[(p(\mathbf{x}))^{2}])^{1/2}.

Since the uniform distribution is (2,1)(2,1)-hypercontractive (see Theorem 9.22 in [O’D14]), we have

𝔼𝐱Unifd[(p(𝐱))2])1/2ek𝔼𝐱Unifd[|p(𝐱)|]\operatorname*{\mathbb{E}}_{\mathbf{x}\sim\operatorname{Unif}_{d}}[(p(\mathbf{x}))^{2}])^{1/2}\leq e^{k}\operatorname*{\mathbb{E}}_{\mathbf{x}\sim\operatorname{Unif}_{d}}[|p(\mathbf{x})|] (3.3)

Recall that the polynomial pp is non-negative. This implies that |p(𝐱)|=p(𝐱)|p(\mathbf{x})|=p(\mathbf{x}) for all 𝐱{±1}d\mathbf{x}\in\{\pm 1\}^{d} and therefore 𝔼𝐱Unifd[|p(𝐱)|]=𝔼𝐱Unifd[p(𝐱)]ϵ/8\operatorname*{\mathbb{E}}_{\mathbf{x}\sim\operatorname{Unif}_{d}}[|p(\mathbf{x})|]=\operatorname*{\mathbb{E}}_{\mathbf{x}\sim\operatorname{Unif}_{d}}[p(\mathbf{x})]\leq\epsilon/8. Overall, we have

𝔼𝐱Unifd[(p(𝐱))2])1/2ekϵ/83k\operatorname*{\mathbb{E}}_{\mathbf{x}\sim\operatorname{Unif}_{d}}[(p(\mathbf{x}))^{2}])^{1/2}\leq e^{k}\epsilon/8\leq 3^{k} (3.4)

Recall, now, that pcoefdk/2(𝔼𝐱Unifd[(p(𝐱))2])1/2\|p\|_{\mathrm{coef}}\leq d^{k/2}(\operatorname*{\mathbb{E}}_{\mathbf{x}\sim\operatorname{Unif}_{d}}[(p(\mathbf{x}))^{2}])^{1/2}. We obtain the desired bound pcoef3kdk/2\|p\|_{\mathrm{coef}}\leq 3^{k}d^{k/2}.

It remains to show that with probability at least 1δ1-\delta, we have 1N𝐱Srefp(𝐱)ϵ/4\frac{1}{N}\sum_{\mathbf{x}\in S_{\mathrm{ref}}}p(\mathbf{x})\leq\epsilon/4. Consider the random variable X=1N𝐱Srefp(𝐱)X=\frac{1}{N}\sum_{\mathbf{x}\in S_{\mathrm{ref}}}p(\mathbf{x}), where SrefS_{\mathrm{ref}} is drawn i.i.d. form Unifd\operatorname{Unif}_{d}. We have that 𝔼[X]=𝔼𝐱Unifd[p(𝐱)]ϵ/8\operatorname*{\mathbb{E}}[X]=\operatorname*{\mathbb{E}}_{\mathbf{x}\sim\operatorname{Unif}_{d}}[p(\mathbf{x})]\leq\epsilon/8. Moreover, p(𝐱)pcoef3kdk/2p(\mathbf{x})\leq\|p\|_{\mathrm{coef}}\leq 3^{k}d^{k/2}, for all 𝐱{±1}d\mathbf{x}\in\{\pm 1\}^{d} and, from a standard Hoeffding bound on the random variable XX, we obtain that 1N𝐱Srefp(𝐱)ϵ/4\frac{1}{N}\sum_{\mathbf{x}\in S_{\mathrm{ref}}}p(\mathbf{x})\leq\epsilon/4 with probability at least 1exp(ϵ264Ndk9k)1-\exp(-\frac{\epsilon^{2}}{64Nd^{k}9^{k}}). Due to the choice of NCdk9kϵ2log(1/δ)N\geq C^{\prime}\frac{d^{k}9^{k}}{\epsilon^{2}}\log(1/\delta), we have that the probability of failure is bounded by δ\delta, as desired. ∎

4 Finding a Low-Error Hypothesis

The outlier removal process of Lemma 3.1 enables us to find a subset SfiltS_{\mathrm{filt}} of the input set such that all non-negative and low-degree polynomials with small expectation under the uniform distribution also have small empirical expectation under SfiltS_{\mathrm{filt}}. Moreover, the number of clean examples removed to form SfiltS_{\mathrm{filt}} is smaller than the number of removed outliers (see Figure 1). We show that these two properties are all we need in order to learn constant-depth circuits with contamination (Definition 1.1).

In order to take advantage of Lemma 3.1, we will use two main tools. The first one is the following theorem originally proposed by [KKMS08] to show that 1\mathcal{L}_{1} polynomial regression implies agnostic learning for classes that can be approximated by low-degree polynomials.

Theorem 4.1 (Learning through 1\mathcal{L}_{1} polynomial regression [KKMS08]).

Let 𝒟\mathcal{D} be any distribution over {±1}d×{±1}\{\pm 1\}^{d}\times\{\pm 1\} and 𝒞\mathcal{C} some class of concepts from {±1}d\{\pm 1\}^{d} to {±1}\{\pm 1\}. If for each f𝒞f\in\mathcal{C} there is some polynomial pp over {±1}d\{\pm 1\}^{d} of degree at most kk such that 𝔼𝐱𝒟𝐱[|f(𝐱)p(𝐱)|]ϵ\operatorname*{\mathbb{E}}_{\mathbf{x}\sim\mathcal{D}_{\mathbf{x}}}[|f(\mathbf{x})-p(\mathbf{x})|]\leq\epsilon, then there is an algorithm (based on degree-kk 1\mathcal{L}_{1} polynomial regression) which outputs a degree-kk polynomial threshold function hh such that (𝐱,y)𝒟[yh(𝐱)]minf𝒞(𝐱,y)𝒟[yf(𝐱)]+2ϵ\operatorname*{\mathbb{P}}_{(\mathbf{x},y)\sim\mathcal{D}}[y\neq h(\mathbf{x})]\leq\min_{f\in\mathcal{C}}\operatorname*{\mathbb{P}}_{(\mathbf{x},y)\sim\mathcal{D}}[y\neq f(\mathbf{x})]+2\epsilon, in time O(1ϵ2)dO(k)log(1/δ)O(\frac{1}{\epsilon^{2}})d^{O(k)}\log(1/\delta).

Our overall learning algorithm will first filter the input set of examples S¯inp\bar{S}_{\mathrm{inp}} using Algorithm 1 and then run the algorithm of Theorem 4.1 on the uniform distribution over the filtered set S¯filt\bar{S}_{\mathrm{filt}}. All we need to show is that there is a low-degree polynomial pp with 𝔼𝐱S¯filt[|f(𝐱)p(𝐱)|]ϵ\operatorname*{\mathbb{E}}_{\mathbf{x}\sim\bar{S}_{\mathrm{filt}}}[|f(\mathbf{x})-p(\mathbf{x})|]\leq\epsilon. This is ensured by combining part 2 of Lemma 3.1 with the sandwiching approximators for constant-depth circuits originally proposed by [Bra08] in the context of pseudorandomness.

Theorem 4.2 (Sandwiching polynomials for 𝖠𝖢0\mathsf{AC}^{0} [Bra08, Tal17, HS19]).

Let f:{±1}d{±1}f:\{\pm 1\}^{d}\to\{\pm 1\} be any 𝖠𝖢0\mathsf{AC}^{0} circuit of size ss and depth \ell and let ϵ(0,1)\epsilon\in(0,1). Then, there are polynomials pup,pdownp_{\mathrm{up}},p_{\mathrm{down}} over {±1}d\{\pm 1\}^{d}, each of degree at most k=(log(s))O()log(1/ϵ)k=(\log(s))^{O(\ell)}\cdot\log(1/\epsilon) such that (1) pup(𝐱)f(𝐱)pdown(𝐱)p_{\mathrm{up}}(\mathbf{x})\geq f(\mathbf{x})\geq p_{\mathrm{down}}(\mathbf{x}) for all 𝐱{±1}d\mathbf{x}\in\{\pm 1\}^{d} and (2) 𝔼𝐱Unifd[pup(𝐱)pdown(𝐱)]ϵ\operatorname*{\mathbb{E}}_{\mathbf{x}\sim\operatorname{Unif}_{d}}[p_{\mathrm{up}}(\mathbf{x})-p_{\mathrm{down}}(\mathbf{x})]\leq\epsilon.

Proof of Theorem 1.2.

Consider the polynomial p=puppdownp=p_{\mathrm{up}}-p_{\mathrm{down}}, where pup,pdownp_{\mathrm{up}},p_{\mathrm{down}} are the (ϵ/8)(\epsilon/8)-sandwiching polynomials of some circuit ff of size ss and depth \ell. Observe that pp is non-negative and 𝔼𝐱Unifd[p(𝐱)]ϵ/8\operatorname*{\mathbb{E}}_{\mathbf{x}\sim\operatorname{Unif}_{d}}[p(\mathbf{x})]\leq\epsilon/8. Therefore, according to part 2 of Lemma 3.1, we have 𝐱Sfiltp(𝐱)ϵN\sum_{\mathbf{x}\in S_{\mathrm{filt}}}p(\mathbf{x})\leq\epsilon N. Since pupfpdownp_{\mathrm{up}}\geq f\geq p_{\mathrm{down}} we also have 𝔼𝐱Sfilt[|f(𝐱)pdown(𝐱)|]ϵN/|S¯filt|\operatorname*{\mathbb{E}}_{\mathbf{x}\sim S_{\mathrm{filt}}}[|f(\mathbf{x})-p_{\mathrm{down}}(\mathbf{x})|]\leq\epsilon N/|\bar{S}_{\mathrm{filt}}|. By Theorem 4.1, we find h:{±1}d{±1}h:\{\pm 1\}^{d}\to\{\pm 1\} with (𝐱,y)S¯filt[yh(𝐱)]minf𝒞(𝐱,y)S¯filt[yf(𝐱)]+2ϵN/|S¯filt|\operatorname*{\mathbb{P}}_{(\mathbf{x},y)\sim\bar{S}_{\mathrm{filt}}}[y\neq h(\mathbf{x})]\leq\min_{f\in\mathcal{C}}\operatorname*{\mathbb{P}}_{(\mathbf{x},y)\sim\bar{S}_{\mathrm{filt}}}[y\neq f(\mathbf{x})]+2\epsilon N/|\bar{S}_{\mathrm{filt}}| or equivalently

(𝐱,y)S¯filt𝟙{yh(𝐱)}minf𝒞(𝐱,y)S¯filt𝟙{yf(𝐱)}+2ϵN\displaystyle\sum_{(\mathbf{x},y)\in\bar{S}_{\mathrm{filt}}}\operatorname{\mathbbm{1}}\{y\neq h(\mathbf{x})\}\leq\min_{f\in\mathcal{C}}\sum_{(\mathbf{x},y)\in\bar{S}_{\mathrm{filt}}}\operatorname{\mathbbm{1}}\{y\neq f(\mathbf{x})\}+2\epsilon N (4.1)

The error of the hypothesis hh on the set S¯cln\bar{S}_{\mathrm{cln}} gives a bound on its error under the uniform distribution with high probability, due to classical VC theory, as long as NCdk+log(1/δ)ϵ2N\geq C^{\prime}\frac{d^{k}+\log(1/\delta)}{\epsilon^{2}}, because hh is a PTF of degree at most kk. We can provide an upper bound for (𝐱,y)S¯cln[yh(𝐱)]\operatorname*{\mathbb{P}}_{(\mathbf{x},y)\sim\bar{S}_{\mathrm{cln}}}[y\neq h(\mathbf{x})] in terms of the sizes of the sets depicted in Figure 1. In particular, we give a high-probability upper bound on the number of mistakes that hh makes on S¯cln\bar{S}_{\mathrm{cln}}.

  1. 1.

    The points in S¯cln\bar{S}_{\mathrm{cln}} that are removed by the adversary are not taken into account while forming hh, so, in the worst case, hh classifies them incorrectly. This gives at most |SclnSinp||S_{\mathrm{cln}}\setminus S_{\mathrm{inp}}| mistakes.

  2. 2.

    Similarly, hh makes at most |S3||S_{3}| mistakes corresponding to the clean points that are removed during the outlier removal process.

  3. 3.

    Finally, hh will make at most |S2|+2ϵN|S_{2}|+2\epsilon N mistakes on SfiltS_{\mathrm{filt}}, according to the inequality (4.1), corresponding to the adversarially corrupted points that were not removed during the outlier removal process. In the worst case, all of these mistakes are made in the part of SfiltS_{\mathrm{filt}} that intersects SclnS_{\mathrm{cln}}.

Refer to caption
Figure 1: The diagram shows the input set of points SinpS_{\mathrm{inp}} (red circle), the clean points SclnS_{\mathrm{cln}} (green circle), the output SfiltS_{\mathrm{filt}} (black circle) of Algorithm 1 and the sets S1S_{1} (yellow region), S2S_{2} (blue region), S3S_{3} (pink region). The set SinpS_{\mathrm{inp}} consists of clean points, except from an η\eta fraction of adversarial points. S1S_{1} contains the adversarial points that are filtered out by the outlier removal process and S2S_{2} contains the adversarial points that were not removed and are kept in SfiltS_{\mathrm{filt}}. S3S_{3} contains the clean points that were filtered out during outlier removal. Lemma 3.1 states that |S3||S1||S_{3}|\leq|S_{1}| w.h.p.

The overall error is 1N(|SclnSinp|+|S3|+|S2|)+O(ϵ)\frac{1}{N}(|S_{\mathrm{cln}}\setminus S_{\mathrm{inp}}|+|S_{3}|+|S_{2}|)+O(\epsilon). According to part 1 of Lemma 3.1, we have |S3||S1||S_{3}|\leq|S_{1}|. Moreover, by Definition 1.1, we have that |SclnSinp|=|SinpScln|=|S1|+|S2|=ηN|S_{\mathrm{cln}}\setminus S_{\mathrm{inp}}|=|S_{\mathrm{inp}}\setminus S_{\mathrm{cln}}|=|S_{1}|+|S_{2}|=\eta N. The error bound we obtain overall is 2η+O(ϵ)2\eta+O(\epsilon), as desired. ∎

Acknowledgments.

We thank Mark Braverman and Sasha Razborov for useful conversations.

References

  • [ABL17] Pranjal Awasthi, Maria Florina Balcan, and Philip M Long. The power of localization for efficiently learning linear separators with noise. Journal of the ACM (JACM), 63(6):1–27, 2017.
  • [BEK02] Nader H. Bshouty, Nadav Eiron, and Eyal Kushilevitz. Pac learning with nasty noise. Theoretical Computer Science, 288(2):255–275, 2002. Algorithmic Learning Theory.
  • [Bra08] Mark Braverman. Polylogarithmic independence fools AC 0 circuits. Journal of the ACM (JACM), 57(5):1–10, 2008.
  • [DK23] Ilias Diakonikolas and Daniel M. Kane. Algorithmic High-Dimensional Robust Statistics. Cambridge University Press, 2023.
  • [DKS18] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Learning geometric concepts with nasty noise. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1061–1073, 2018.
  • [GSSV24] Surbhi Goel, Abhishek Shetty, Konstantinos Stavropoulos, and Arsen Vasilyan. Tolerant algorithms for learning with arbitrary covariate shift. arXiv preprint arXiv:2406.02742, 2024.
  • [HS19] Prahladh Harsha and Srikanth Srinivasan. On polynomial approximations to AC0AC^{0}. Random Structures & Algorithms, 54(2):289–303, 2019.
  • [Kha95] Michael Kharitonov. Cryptographic lower bounds for learnability of boolean functions on the uniform distribution. J. Comput. Syst. Sci., 50(3):600–610, 1995.
  • [KKMS08] Adam Tauman Kalai, Adam R Klivans, Yishay Mansour, and Rocco A Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777–1805, 2008.
  • [KL93] Michael J. Kearns and Ming Li. Learning in the presence of malicious errors. SIAM J. Comput., 22(4):807–837, 1993.
  • [KLS09] Adam R Klivans, Philip M Long, and Rocco A Servedio. Learning halfspaces with malicious noise. Journal of Machine Learning Research, 10(12), 2009.
  • [KSV24] Adam Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Testable learning with distribution shift. In Shipra Agrawal and Aaron Roth, editors, Proceedings of Thirty Seventh Conference on Learning Theory, volume 247 of Proceedings of Machine Learning Research, pages 2887–2943. PMLR, 30 Jun–03 Jul 2024.
  • [LMN93] Nathan Linial, Yishay Mansour, and Noam Nisan. Constant depth circuits, fourier transform, and learnability. Journal of the ACM (JACM), 40(3):607–620, 1993.
  • [O’D14] Ryan O’Donnell. Analysis of boolean functions. Cambridge University Press, 2014.
  • [SZ21] Jie Shen and Chicheng Zhang. Attribute-efficient learning of halfspaces with malicious noise: Near-optimal label complexity and noise tolerance. In Vitaly Feldman, Katrina Ligett, and Sivan Sabato, editors, Algorithmic Learning Theory, 16-19 March 2021, Virtual Conference, Worldwide, volume 132 of Proceedings of Machine Learning Research, pages 1072–1113. PMLR, 2021.
  • [Tal17] Avishay Tal. Tight Bounds on the Fourier Spectrum of AC0. In Ryan O’Donnell, editor, 32nd Computational Complexity Conference (CCC 2017), volume 79 of Leibniz International Proceedings in Informatics (LIPIcs), pages 15:1–15:31, Dagstuhl, Germany, 2017. Schloss Dagstuhl – Leibniz-Zentrum für Informatik.
  • [Val85] Leslie G. Valiant. Learning disjunction of conjunctions. In Aravind K. Joshi, editor, Proceedings of the 9th International Joint Conference on Artificial Intelligence. Los Angeles, CA, USA, August 1985, pages 560–566. Morgan Kaufmann, 1985.