This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Reliable Learning of Halfspaces under Gaussian Marginals

Ilias Diakonikolas
University of Wisconsin-Madison
[email protected]
Supported in part by NSF Medium Award CCF-2107079 and an H.I. Romnes Faculty Fellowship.
   Lisheng Ren
University of Wisconsin-Madison
[email protected]
Supported in part by NSF Medium Award CCF-2107079.
   Nikos Zarifis
University of Wisconsin-Madison
[email protected]
Supported in part by NSF Medium Award CCF-2107079.
Abstract

We study the problem of PAC learning halfspaces in the reliable agnostic model of Kalai et al. (2012). The reliable PAC model captures learning scenarios where one type of error is costlier than the others. Our main positive result is a new algorithm for reliable learning of Gaussian halfspaces on d\mathbb{R}^{d} with sample and computational complexity

dO(log(min{1/α,1/ϵ}))min(2log(1/ϵ)O(log(1/α)),2poly(1/ϵ)),d^{O(\log(\min\{1/\alpha,1/\epsilon\}))}\min(2^{\log(1/\epsilon)^{O(\log(1/\alpha))}},2^{\mathrm{poly}(1/\epsilon)})\;,

where ϵ\epsilon is the excess error and α\alpha is the bias of the optimal halfspace. We complement our upper bound with a Statistical Query lower bound suggesting that the dΩ(log(1/α))d^{\Omega(\log(1/\alpha))} dependence is best possible. Conceptually, our results imply a strong computational separation between reliable agnostic learning and standard agnostic learning of halfspaces in the Gaussian setting.

1 Introduction

Halfspaces (or Linear Threshold Functions) is the class of functions f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\} of the form f(𝐱)=sign(𝐰,𝐱t)f(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w},\mathbf{x}\rangle-t), where 𝐰d\mathbf{w}\in\mathbb{R}^{d} is called the weight vector and tt is called the threshold. The problem of learning halfspaces is one of the classical and most well-studied problems in machine learning—going back to the Perceptron algorithm [Ros58]—and has had great impact on many other influential techniques, including SVMs [Vap98] and AdaBoost [FS97].

Here we focus on learning halfspaces from random labeled examples. The computational complexity of this task crucially depends on the choice of the underlying model. For example, in the realizable PAC model (i.e., with clean labels), the problem is known to be efficiently solvable (see, e.g., [MT94]) via a reduction to linear programming. Unfortunately, this method is quite fragile and breaks down in the presence of noisy labels. In the noisy setting, the computational complexity of the problem depends on the choice of noise model and distributional assumptions. In this work, we study the problem of distribution-specific PAC learning of halfspaces, with respect to Gaussian marginals, in the reliable agnostic model of [KKM12]. Formally, we have the following definition.

Definition 1.1 ((Positive) Reliable Learning of Gaussian Halfspaces).

Let d\mathcal{H}_{d} be the class of halfspaces on d\mathbb{R}^{d}. Given 0<ϵ<10<\epsilon<1 and i.i.d. samples (𝐱,y)(\mathbf{x},y) from a distribution DD supported on d×{±1}\mathbb{R}^{d}\times\{\pm 1\}, where the marginal D𝐱D_{\mathbf{x}} is the standard Gaussian 𝒩(𝟎,𝐈)\mathcal{N}({\bf 0},\mathbf{I}), we say that an algorithm reliably learns d\mathcal{H}_{d} to error ϵ\epsilon if the algorithm with high probability outputs a hypothesis h:d{±1}h:\mathbb{R}^{d}\to\{\pm 1\} such that 𝐏𝐫(𝐱,y)D[h(𝐱)=1y=1]ϵ\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=1\land y=-1]\leq\epsilon and 𝐏𝐫(𝐱,y)D[h(𝐱)=1y=1]OPT+ϵ\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=-1\land y=1]\leq\mathrm{OPT}+\epsilon, where OPT=defminf𝒢(D)𝐏𝐫(𝐱,y)D[f(𝐱)=1y=1]\mathrm{OPT}\stackrel{{\scriptstyle{\mathrm{\footnotesize def}}}}{{=}}\min_{f\in\mathcal{G}(D)}\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[f(\mathbf{x})=-1{\land}y={1}], with

𝒢(D)={fd:𝐏𝐫(𝐱,y)D[f(𝐱)=1y=1]=0}{f(𝐱)=1}.\mathcal{G}(D)=\{f\in{\mathcal{H}_{d}}:\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[f(\mathbf{x})=1{\land}y={-1}]=0\}\cup\{f(\mathbf{x})={-1}\}\;.

We say that f=argminf𝒢(D)𝐏𝐫(𝐱,y)D[f(𝐱)=1y=1]f=\operatorname*{argmin}_{f\in\mathcal{G}(D)}\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[f(\mathbf{x})=-1{\land}y={1}] is the optimal halfspace on distribution DD and for the conditions above that hypothesis hh satisfies, we say that hh is ϵ\epsilon-reliable with respect to d\mathcal{H}_{d} on distribution DD.

In other words, a (positive) reliable learner makes almost no false positive errors, while also maintaining the best possible false negative error—as compared to any hypothesis with no false positive errors. Note that if there is no hypothesis (in the target class) that makes no false positive errors, then essentially we have no requirement for the false negative error of the returned hypothesis. The algorithm can then simply return the 1-1 constant hypothesis.

The reliable agnostic PAC model was introduced by [KKM12] as a one-sided analogue of the classical agnostic model [Hau92, KSS94], and has since been extensively studied in a range of works; see, e.g., [KKM12, KT14, GKKT17, DJ19, GKKM20, KK21]. The underlying motivation for this definition comes from learning situations where mistakes of one type (e.g., false positive) should be avoided at all costs. Typical examples include spam detection—where incorrectly detecting spam emails is much less costly than mislabeling an important email as spam—and detecting network failures—where false negatives may be particularly harmful. Such scenarios motivate the study of reliable learning, which can be viewed as minimizing a loss function for different costs for false positive and false negative errors (see, e.g., [Dom99, Elk01]). In a historical context, reliable learning is related to the Neyman-Pearson criterion [NP33] in hypothesis testing, which shows that the optimal strategy to minimize one type of error subject to another type of error being bounded is to threshold the likelihood ratio function. More recently, reliable learning has been shown [KK21] to be equivalent to the PQ-learning model [GKKM20]—a recently introduced learning model motivated by covariance shift.

The algorithmic task of agnostic reliable learning can be quite challenging. While reliable learning can be viewed as minimizing a loss function with different cost for false positive and false negative error, in general, such a loss function will result in a non-convex optimization problem. As pointed out in [KKM12], distribution-specific reliable learning can be efficiently reduced to distribution-specific agnostic learning (such reduction preserves the marginal distribution). A natural question, serving as a motivation for this work, is whether reliable learning is qualitatively easier computationally—for natural concept classes and halfspaces in particular.

Before we state our contributions, we briefly summarize prior work on agnostically learning Gaussian halfspaces. Recall that, in agnostic learning, the goal is to output a hypothesis with error OPT+ϵ\mathrm{OPT}+\epsilon, where OPT\mathrm{OPT} is the optimal 0-1 error within the target class. Prior work has essentially characterized the computational complexity of agnostically learning Gaussian halfspaces. Specifically, it is known that the L1L_{1} polynomial regression algorithm of [KKMS08] is an agnostic learner with complexity dO(1/ϵ2)d^{O(1/\epsilon^{2})} [DGJ+{}^{+}09, DKN10]. Moreover, there is strong evidence that this complexity upper bound is tight, both in the Statistical Query (SQ) model [DKZ20, GGK20, DKPZ21] and under plausible cryptographic assumptions [DKR23, Tie23]. It is worth noting that the aforementioned hardness results hold even for the subclass of homogeneous halfspaces.

Given the aforementioned reduction of reliable learning to agnostic learning [KKM12], one can use L1L_{1}-regression as a reliable agnostic learner for Gaussian halfspaces, leading to complexity dO(1/ϵ2)d^{O(1/\epsilon^{2})}. Prior to this work, this was the best (and only known) reliable halfspace learner. Given the fundamental nature of this problem, it is natural to ask if one can do better for reliable learning.

Is it possible to develop faster algorithms for reliably learning Gaussian halfspaces
compared to agnostic learning?

In this work, we provide an affirmative answer to this question.

To formally state our main result, we need the notion of the bias of a Boolean function.

Definition 1.2 (Bias of Boolean Function).

We define the bias α[0,1/2]\alpha\in[0,1/2] of h:d{±1}h:\mathbb{R}^{d}\to\{\pm 1\} under the Gaussian distribution as α=α(h)=defmin(𝐏𝐫𝐱𝒩d[h(𝐱)=1],𝐏𝐫𝐱𝒩d[h(𝐱)=1])\alpha=\alpha(h)\stackrel{{\scriptstyle{\mathrm{\footnotesize def}}}}{{=}}\min(\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}[h(\mathbf{x})=1],\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}[h(\mathbf{x})=-1]).

The main result of this paper is encapsulated in the following theorem:

Theorem 1.3 (Main Result).

Let DD be a joint distribution of (𝐱,y)(\mathbf{x},y) supported on d×{±1}\mathbb{R}^{d}\times\{\pm 1\} with marginal D𝐱=𝒩dD_{\mathbf{x}}=\mathcal{N}_{d} and let α\alpha be the bias of the optimal halfspace on distribution DD (with respect to Definition 1.1). There is an algorithm that uses N=dO(log(min{1/α,1/ϵ}))min(2log(1/ϵ)O(log(1/α)),2poly(1/ϵ))N=d^{O(\log(\min\{1/\alpha,1/\epsilon\}))}\min\left(2^{\log(1/\epsilon)^{O(\log(1/\alpha))}},2^{\mathrm{poly}(1/\epsilon)}\right) many samples from DD, runs in poly(N,d)\mathrm{poly}(N,d) time, and with high probability returns a hypothesis h(𝐱)=sign(𝐰,𝐱t)h(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w},\mathbf{x}\rangle-t) that is ϵ\epsilon-reliable with respect to d\mathcal{H}_{d}. Moreover, for ϵ<α/2\epsilon<\alpha/2, any SQ algorithm for the problem requires complexity dΩ(log(1/α))d^{\Omega(\log(1/\alpha))}.

For more detailed theorem statements of our upper and lower bounds, see Appendix B for the algorithmic result and LABEL:app:lb for the SQ hardness result.

Theorem 1.3 gives a significantly more efficient algorithm (in terms of both sample complexity and computational complexity) for reliably learning Gaussian halfspaces, compared to agnostic learning. Specifically, as long as α>0\alpha>0 is a universal constant, the overall complexity is polynomial in dd and quasi-polynomial in 1/ϵ1/\epsilon—as opposed to dpoly(1/ϵ)d^{\mathrm{poly}(1/\epsilon)} in the agnostic setting. While the complexity of our algorithm (namely, the multiplicative factor that is independent of dd) increases as the optimal halfspace becomes more biased, it is always bounded above by the complexity of agnostic learning.

Finally, we note that our algorithm also applies to the fully reliable learning model [KKM12], since one can easily reduce the fully reliable learning to positive reliable and negative reliable learning (as observed in [KKM12]).

1.1 Overview of Techniques

Instead of directly considering the optimal halfspace for a distribution DD (as defined in Definition 1.1), we introduce the following definition as a relaxation.

Definition 1.4 (Reliability Condition).

We say that a distribution DD supported on X×{±1}X\times\{\pm 1\} satisfies the reliability condition with respect to f:X{±1}f:X\mapsto\{\pm 1\} if 𝐏𝐫(𝐱,y)D[f(𝐱)=+1y=1]=0.\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[f(\mathbf{x})=+1\land y=-1]=0\;.

Notice that hh being ϵ\epsilon-reliable on distribution DD is equivalent to hh having better false negative error compared with any ff such that DD satisfies the reliability condition with respect to ff (instead of compared with the optimal halfspace). At the same time, given any fixed ff, such a definition allows the adversary to arbitrarily corrupt negative labels, thus allowing for more manageable analysis.

We start with a brief overview of our SQ lower bound construction, followed by the high-level idea enabling our nearly matching algorithm.

SQ Lower Bound

As follows from Definition 1.4 and the subsequent discussion, the adversary in reliable learning can arbitrarily corrupt the negative labels. We leverage this idea to construct an instance where the corruption level suffices to match many moments, meaning that labels yy are uncorrelated with any polynomial of degree at most log(1/α)\log(1/\alpha). To prove this, we construct an (infinite) feasibility Linear Program (LP) with the following properties: if the LP is feasible, there exists an instance for which no low-degree polynomial correlates with the labels; see Lemma 3.3. To show the existence of such an instance, we leverage a technique from [DKPZ21] that exploits (an infinite generalization of) LP duality. Specifically, we show that it is possible to add noise to the labels whose uncorrupted value is negative so that all low-degree moments are uncorrelated with the observed labels yy. This implies that no algorithm that relies on low-degree polynomials can succeed for reliable learning. This allows us to show that SQ algorithms fail if they do not use queries with tolerance at most 1/dΩ(log(1/α))1/d^{\Omega(\log(1/\alpha))} or exponentially many queries.

Reliable Learning Algorithm

As discussed in the previous paragraph, an adversary can arbitrarily corrupt the negative labels which can make various algorithmic approaches fail. In more detail, the adversary can corrupt Ω(log(1/α))\Omega(\log(1/\alpha)) moments; that is, any SQ algorithm needs to rely on higher-order moment information. One of the main difficulties of this setting is that there may exist weight vectors 𝐰,𝐰\mathbf{w},\mathbf{w}^{\prime} with 𝐰𝐰2Ω(1)\|\mathbf{w}-\mathbf{w}^{\prime}\|_{2}\geq\Omega(1) while at the same time the 0-1 error of 𝐰\mathbf{w} and 𝐰\mathbf{w}^{\prime} is nearly the same. This might suggest that no approach can verify that the algorithm decreases the angle between the current weight vector and the optimal vector. To overcome this obstacle, we develop an algorithm that performs a random walk over a “low-dimensional” subspace. To establish the correctness of our algorithm, we prove that at some point during its execution, the algorithm will find a weight vector sufficiently close to the optimal vector.

To show that such a low-dimensional subspace exists, we proceed as follows: We first prove that there exists a non-trivial polynomial pp of degree O(log(1/α))O(\log(1/\alpha)) that correlates with the negative labels (by the term nontrivial, we are referring to the requirement that we cannot use the constant polynomial, as this trivially correlates with the labels); see B.5. This implies that the low-degree moment tensors, i.e., 𝐄(𝐱,y)𝒟[𝟙{y=1}𝐱k]\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim\mathcal{D}}[\mathds{1}\{y=-1\}\mathbf{x}^{\otimes k}] for some kk between 11 and O(log(1/α))O(\log(1/\alpha)), correlate non-trivially with (𝐰)k(\mathbf{w}^{\ast})^{\otimes k}, where 𝐰\mathbf{w}^{\ast} is the target weight vector. Thus, we can use these moment tensors to construct a low-dimensional subspace. We leverage the structure of the problem, namely that the (observed) negative labels are uncorrupted, to show that the correlation is in fact almost constant. As a consequence, this allows us to construct a subspace whose dimension depends only on the degree of the polynomial pp (which is O(log(1/α))O(\log(1/\alpha))) and the desired excess error ϵ\epsilon.

We then proceed as follows: in each round, as long as the current solution 𝐰\mathbf{w} is not optimal, i.e., there are negative samples on the region {𝐱d:𝐰𝐱t0}\{\mathbf{x}\in\mathbb{R}^{d}:\mathbf{w}\cdot\mathbf{x}-t\geq 0\}, our algorithm conditions on a thin strip, projects the points 𝐱\mathbf{x} on the orthogonal complement of 𝐰\mathbf{w}, and reapplies the above structure result. This leads (with constant probability) to an increase in the correlation between 𝐰\mathbf{w} and 𝐰\mathbf{w}^{\ast}. Assuming that the correlation is increased by β\beta in each round with probability at least 1/31/3, we roughly need at most 1/β1/\beta successful updates. Therefore, if we run our algorithm for 31/β3^{1/\beta} steps, it is guaranteed that with probability at least 1/31/3 we will find a vector almost parallel to 𝐰\mathbf{w}^{\ast}.

Prior Algorithmic Techniques

Roughly speaking, prior techniques for reliable learning relied on some variant of L1L_{1}-regression. In that sense, our algorithmic approach departs from a direct polynomial approximation. Concretely, for distribution-free reliable learning, the algorithm from [KT14] uses a one-sided variant of L1L_{1} regression—where instead of approximating the target function in L1L_{1} norm, they use the hinge loss instead. For distribution-specific (and Gaussian in particular) reliable learning of halfspaces, the only previous approach uses the reduction of [KKM12] to agnostic learning. While our algorithm also leverages polynomial approximation, our approach employs significantly different ideas and we believe it provides a novel perspective for this problem.

Technical Comparison with Prior Work

Our algorithm shares similarities with the algorithm of [DKK+{}^{+}22] for learning Gaussian halfspaces with Massart noise (a weaker semi-random noise model). Both algorithms perform a random walk in order to converge to the target halfpace. Having said so, our algorithm is fundamentally different than theirs for the following reasons. The algorithm of [DKK+{}^{+}22] partitions d\mathbb{R}^{d} into sufficiently angles and searches in each one of them for a direction to update the current hypothesis at random. Our algorithm instead conditions on y=1y=-1, which are the points that are uncorrupted, and uses the guarantee (that we establish) that the first Ω(log(1/α))\Omega(\log(1/\alpha)) moments of the distribution (conditioned on y=1y=-1) correlate with the unknown optimal hypothesis. This leads to an algorithm with significantly faster runtime.

Our SQ lower bound leverages techniques from [DKPZ21]. Essentially, we formulate a linear program to construct a noise function that makes all low-degree polynomials uncorrelated with the labels yy. This condition suffices to show that no SQ algorithm can solve the problem without relying on higher moments. To prove the existence of such a noise function, we use (a generalization of) LP duality of an infinite LP, which provides the necessary conditions for the existence of such a function.

1.2 Related Work

This work is part of the broader research program of characterizing the efficient learnability of natural concept classes with respect to challenging noise models. The complexity of this general learning task depends on the underlying class, the distributional assumptions and the choice of noise model. A long line of prior work has made substantial progress in this direction for the class of halfspaces under Gaussians (and other natural distributions), and for both semi-random [ABHU15, DKTZ20a, ZSA20, DKTZ20b, DKK+{}^{+}20, DKK+{}^{+}21b, DKK+{}^{+}22] and adversarial noise [KKMS08, KOS08, KLS09, ABL17, DKS18, DKTZ20c, DKK+{}^{+}21a, DKTZ22]. An arguably surprising conceptual implication of our results is that the complexity of reliable learning for Gaussian halfspaces is qualitatively closer to the semi-random case.

1.3 Preliminaries

We use lowercase boldface letters for vectors and capitalized boldface letters for matrices and tensors. We use 𝐱,𝐲\langle\mathbf{x},\mathbf{y}\rangle for the inner product between 𝐱,𝐲d\mathbf{x},\mathbf{y}\in\mathbb{R}^{d}. For 𝐱,𝐬d\mathbf{x},\mathbf{s}\in\mathbb{R}^{d}, we use proj𝐬(𝐱)=def𝐱,𝐬𝐬𝐬22\mathrm{proj}_{\mathbf{s}}(\mathbf{x})\stackrel{{\scriptstyle{\mathrm{\footnotesize def}}}}{{=}}\frac{\langle\mathbf{x},\mathbf{s}\rangle\mathbf{s}}{\|\mathbf{s}\|^{2}_{2}} for the projection of 𝐱\mathbf{x} on the 𝐬\mathbf{s} direction. Similarly, we use proj𝐬(𝐱)=𝐱proj𝐬(𝐱)\mathrm{proj}_{\perp\mathbf{s}}(\mathbf{x})=\mathbf{x}-\mathrm{proj}_{\mathbf{s}}(\mathbf{x}) for the projection of 𝐱\mathbf{x} on the orthogonal complement of 𝐬\mathbf{s}. Additionally, let 𝐱Vdim(V)\mathbf{x}^{V}\in\mathbb{R}^{\dim(V)} be the projection of 𝐱\mathbf{x} on the subspace VV and reparameterized on dim(V)\mathbb{R}^{\dim(V)}. More precisely, let 𝐁Vd×dim(V)\mathbf{B}_{V}\in\mathbb{R}^{d\times\dim(V)} be the matrix whose columns form an (arbitrary) orthonormal basis for the subspace VV, and let 𝐱V=def(𝐁V)𝐱\mathbf{x}^{V}\stackrel{{\scriptstyle{\mathrm{\footnotesize def}}}}{{=}}(\mathbf{B}_{V})^{\intercal}\mathbf{x}. For p1p\geq 1 and 𝐱d\mathbf{x}\in\mathbb{R}^{d}, we use 𝐱p=def(i=1n|𝐱i|p)1/p\|\mathbf{x}\|_{p}\stackrel{{\scriptstyle{\mathrm{\footnotesize def}}}}{{=}}\left(\sum_{i=1}^{n}|\mathbf{x}_{i}|^{p}\right)^{1/p} to denote the p\ell_{p}-norm of 𝐱\mathbf{x}. For a matrix or tensor 𝐓\mathbf{T}, we denote by 𝐓F\|\mathbf{T}\|_{F} the Frobenius norm of 𝐓\mathbf{T}.

We use 𝒩d\mathcal{N}_{d} to denote the standard normal distribution 𝒩(𝟎,𝐈)\mathcal{N}(\mathbf{0},\mathbf{I}), where 𝟎\mathbf{0} is the dd-dimensional zero vector and 𝐈\mathbf{I} is the d×dd\times d identity matrix. We use 𝐱D\mathbf{x}\sim D to denote a random variable with distribution DD. For a random variable 𝐱\mathbf{x} (resp. a distribution DD), we use P𝐱P_{\mathbf{x}} (resp. PDP_{D}) to denote the probability density function or probability mass function of the random variable 𝐱\mathbf{x} (resp. distribution DD). We also use Φ:[0,1]\Phi:\mathbb{R}\mapsto[0,1] to denote the cdf function of 𝒩1\mathcal{N}_{1}. We use 𝟏\mathbf{1} to denote the indicator function.

For a boolean function h:d{±1}h:\mathbb{R}^{d}\to\{\pm 1\} and a distribution DD supported on d×{±1}\mathbb{R}^{d}\times\{\pm 1\}, we use R+(h;D)=def𝐏𝐫(𝐱,y)D[h(𝐱)=1y1]R_{+}(h;D)\stackrel{{\scriptstyle{\mathrm{\footnotesize def}}}}{{=}}\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=1\land y\neq 1] (resp. R(h;D)=def𝐏𝐫(𝐱,y)D[h(𝐱)=1y1]R_{-}(h;D)\stackrel{{\scriptstyle{\mathrm{\footnotesize def}}}}{{=}}\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=-1\land y\neq-1]) to denote the false positive (resp. false negative) 0-1 error.

2 Algorithm for Reliably Learning Gaussian Halfspaces

In this section, we describe and analyze our algorithm establishing Theorem 1.3. Due to space limitations, some proofs have been deferred to Appendix B. For convenience, we will assume that α\alpha (the bias of the optimal halfspace) is known to the algorithm and that the excess error satisfies ϵα/3\epsilon\leq\alpha/3. These assumptions are without loss of generality for the following reasons: First, one can efficiently reduce the unknown α\alpha case to the case that α\alpha is known, by guessing the value of α\alpha. Second, if ϵ\epsilon is large, there is a straightforward algorithm for the problem (simply output the best constant hypothesis).

Notation: We use the notation dα\mathcal{H}_{d}^{\alpha} for the set of all LTFs on d\mathbb{R}^{d} whose bias is equal to α\alpha. Given the above assumptions, it suffices for us to give a reliable learning algorithm for dα\mathcal{H}_{d}^{\alpha} instead of d\mathcal{H}_{d}.

The high-level idea of the algorithm is as follows. Without loss of generality, we assume that there exists an α\alpha-biased halfspace that correctly classifies all the points with label y=1y=-1—since otherwise the algorithm can just return the hypothesis h(𝐱)1h(\mathbf{x})\equiv-1.

Let f(𝐱)=sign(𝐰,𝐱t)f(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{*},\mathbf{x}\rangle-t^{*}) be the optimal halfspace and let 𝐰\mathbf{w} be our current guess of 𝐰\mathbf{w}^{*}. Assuming that 𝐰\mathbf{w}^{*} is not sufficiently close to 𝐰\mathbf{w} and the hypothesis that classifies all the points as negative is not optimal, we show that there exists a low-degree polynomial of the form p(𝐰,𝐱)p(\langle\mathbf{w},\mathbf{x}\rangle) with correlation at least 2O(t2)ϵ2^{-O({t^{*}}^{2})}\epsilon with the negative labels. By leveraging this structural result, we use a spectral algorithm to find a direction 𝐯\mathbf{v} that is non-trivially correlated with proj𝐰(𝐰)\mathrm{proj}_{\perp\mathbf{w}}(\mathbf{w}^{*}) with at least some constant probability.

Unfortunately, it is not easy to verify whether 𝐯,proj𝐰(𝐰)\langle\mathbf{v},\mathrm{proj}_{\perp\mathbf{w}}(\mathbf{w}^{*})\rangle is non-trivial. However, conditioned on the algorithm always getting a 𝐯\mathbf{v} that has good correlation, we show that it only takes at most log(1/ϵ)O(t2)\log(1/\epsilon)^{O({t^{*}}^{2})} steps to get sufficiently close to 𝐰\mathbf{w}^{*}. Therefore, repeating this process 2log(1/ϵ)O(t2)2^{\log(1/\epsilon)^{O({t^{*}}^{2})}} many times will eventually find an accurate approximation to 𝐰\mathbf{w}^{*}.

The structure of this section is as follows: In Section 2.1, we give our algorithm for finding a good direction. Section 2.2 describes and analyzes our random walk procedure.

2.1 Finding a Non-Trivial Direction

Here we present an algorithm that finds a direction that correlates non-trivially with the unknown optimal vector. We first show that there exists a zero-mean O(log(1/α))O(\log(1/\alpha))-degree polynomial that sign-matches the optimal hypothesis. Furthermore, using the fact that the optimal hypothesis always correctly guesses the sign of the negative points, this gives us that the polynomial correlates with the negative (clean) points. Using this intuition, our algorithm estimates the first O(log(1/α))O(\log(1/\alpha)) moments of the distribution D𝐱D_{\mathbf{x}} conditioned on y=1y=-1. This guarantees that at least one moment correlates with the optimal hypothesis, as a linear combination of the moments generates the sign-matching polynomial. Then, by taking a random vector that lies in the high-influence directions (which form a low-dimensional subspace), we guarantee that with constant probability, this vector correlates well with the unknown optimal vector. The main result of the section is the following.

Proposition 2.1.

Let DD be a joint distribution of (𝐱,y)(\mathbf{x},y) supported on d×{±1}\mathbb{R}^{d}\times\{\pm 1\} with marginal D𝐱=𝒩dD_{\mathbf{x}}=\mathcal{N}_{d} and ϵ(0,1)\epsilon\in(0,1). Suppose DD satisfies the reliability condition with respect to f(𝐱)=sign(𝐰,𝐱t)f(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{*},\mathbf{x}\rangle-t^{*}) with t=O(log(1/ϵ))t^{*}=O\left(\sqrt{\log(1/\epsilon)}\right) and 𝐏𝐫(𝐱,y)D[y=1]ϵ\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=-1]\geq\epsilon. Then there is an algorithm that draws N=dO(t2)/ϵ2N=d^{O({t^{*}}^{2})}/\epsilon^{2} samples, has poly(N)\mathrm{poly}(N) runtime, and with probability at least Ω(1)\Omega(1) returns a unit vector 𝐯\mathbf{v} such that 𝐯,𝐰max(log(1/ϵ)O(t2),ϵO(1))\langle\mathbf{v},\mathbf{w}^{*}\rangle\geq\max(\log(1/\epsilon)^{-O({t^{*}}^{2})},\epsilon^{O(1)}).

We start by showing that for any distribution satisfying the reliability condition with respect to some f(𝐱)=sign(𝐰,𝐱t)f(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{*},\mathbf{x}\rangle-t^{*}), there exists a degree-O(t2)O({t^{*}}^{2}) zero-mean polynomial of the form p(𝐰,𝐱)p(\langle\mathbf{w}^{*},\mathbf{x}\rangle) that has correlation at least 2O(t2)𝐏𝐫(𝐱,y)D[y=1]2^{-O({t^{*}}^{2})}\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=-1] with the labels.

Lemma 2.2 (Correlation with an Orthonormal Polynomial).

Let DD be a joint distribution of (𝐱,y)(\mathbf{x},y) supported on d×{±1}\mathbb{R}^{d}\times\{\pm 1\} with marginal D𝐱=𝒩dD_{\mathbf{x}}=\mathcal{N}_{d}. Suppose DD satisfies the reliability condition with respect to f(𝐱)=sign(𝐰,𝐱t)f(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{*},\mathbf{x}\rangle-t^{*}). Then there exists a polynomial p:p:\mathbb{R}\mapsto\mathbb{R} of degree at most k=O(t2+1)k=O({t^{*}}^{2}+1) such that 𝐄z𝒩1[p(z)]=0\operatorname*{\mathbf{E}}_{z\sim{\mathcal{N}_{1}}}[p(z)]=0, 𝐄z𝒩1[p2(z)]=1\operatorname*{\mathbf{E}}_{z\sim{\mathcal{N}_{1}}}[p^{2}(z)]=1 and 𝐄(𝐱,y)D[yp(𝐰,𝐱)]=2O(t2)𝐏𝐫(𝐱,y)D[y=1]\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[y\,p(\langle\mathbf{w}^{*},\mathbf{x}\rangle)]=2^{-O({t^{*}}^{2})}\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=-1].

Proof Sketch of Lemma 2.2.

Note that since pp is a zero-mean polynomial with respect to D𝐱D_{\mathbf{x}}, we have that 𝐄(𝐱,y)D[yp(𝐰,𝐱)]=2𝐄(𝐱,y)D[𝟏(y=1)p(𝐰,𝐱)].\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[y\,p(\langle\mathbf{w}^{*},\mathbf{x}\rangle)]=-2\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[\mathbf{1}(y=-1)\,p(\langle\mathbf{w}^{*},\mathbf{x}\rangle)]\;. Then, using the fact that the distribution DD satisfies the reliability condition, the statement boils down to showing that for any ϵ\epsilon-mass inside the interval [t,][t^{*},\infty], the expectation of pp on this ϵ\epsilon mass is at least 2O(t2)2^{-O({t^{*}}^{2})}. To prove this statement, it suffices to construct a polynomial pp that is non-negative on [t,][t^{*},\infty] and such that the ϵ/2\epsilon/2-tail of pp in that interval is at least 2O(t2)2^{-O({t^{*}}^{2})}. To achieve this, we show that the sign-matching polynomial used in [DKK+{}^{+}22] meets our purpose. ∎

Our algorithm uses the following normalized Hermite tensor.

Definition 2.3 (Hermite Tensor).

For kk\in\mathbb{N} and 𝐱d\mathbf{x}\in\mathbb{R}^{d}, we define the kk-th Hermite tensor as

(𝐇k(𝐱))i1,i2,,ik=1k!Partitions P of [k]into sets of size 1 and 2{a,b}P(𝐈ia,ib){c}P𝐱ic.(\mathbf{H}_{k}(\mathbf{x}))_{i_{1},i_{2},\ldots,i_{k}}=\frac{1}{\sqrt{k!}}\sum_{\begin{subarray}{c}\text{Partitions $P$ of $[k]$}\\ \text{into sets of size 1 and 2}\end{subarray}}\bigotimes_{\{a,b\}\in P}(-\mathbf{I}_{i_{a},i_{b}})\bigotimes_{\{c\}\in P}\mathbf{x}_{i_{c}}\;.

Given that there is such a polynomial, we show that if we take a flattened version of the Chow-parameter tensors (which turns them into matrices) and look at the space spanned by their top singular vectors, then a non-trivial fraction of 𝐰\mathbf{w}^{*} must lie inside this subspace. We prove the following lemma, which is similar to Lemma 5.10 in [DKK+{}^{+}22]. See Appendix B for the proof.

Lemma 2.4.

Let DD be the joint distribution of (𝐱,y)(\mathbf{x},y) supported on d×{±1}\mathbb{R}^{d}\times\{\pm 1\} with marginal D𝐱=𝒩dD_{\mathbf{x}}=\mathcal{N}_{d}. Let p:p:\mathbb{R}\mapsto\mathbb{R} be a univariate, mean zero and unit variance polynomial of degree kk such that for some unit vector 𝐯d\mathbf{v^{*}}\in\mathbb{R}^{d} it holds 𝐄(𝐱,y)D[𝟏(y=1)p(𝐯,𝐱)]τ\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[\mathbf{1}(y=-1)p(\langle\mathbf{v^{*},\mathbf{x}}\rangle)]\geq\tau for some τ(0,1]\tau\in(0,1]. Let 𝐓m\mathbf{T}^{\prime m} be an approximation of the order-mm Chow-parameter tensor 𝐓m=𝐄(𝐱,y)D[𝟏(y=1)𝐇m(𝐱)]\mathbf{T}^{m}=\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[\mathbf{1}(y=-1){\bf H}_{m}(\mathbf{x})] such that 𝐓m𝐓mFτ/(4k)\|\mathbf{T}^{\prime m}-\mathbf{T}^{m}\|_{F}\leq\tau/(4\sqrt{k}). Denote by VmV_{m} the subspace spanned by the left singular vectors of flattened 𝐓m\mathbf{T}^{\prime m} whose singular values are greater than τ/(4k)\tau/(4\sqrt{k}). Moreover, denote by VV the union of V1,,VkV_{1},\cdots,V_{k}. Then, for γ=𝐏𝐫(𝐱,y)D[y=1]\gamma=\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=-1], we have that

  1. 1.

    dim(V)=O(γ2log(1/γ)kk2/τ2+1)\dim(V)=O(\gamma^{2}\log(1/\gamma)^{k}k^{2}/\tau^{2}+1), and

  2. 2.

    projV(𝐯)2=Ω(τ/(kγlog(1/γ)k/2))\|\mathrm{proj}_{V}(\mathbf{v^{*}})\|_{2}=\Omega\left(\tau/\left(\sqrt{k}\gamma\log(1/\gamma)^{k/2}\right)\right).

By Lemma 2.4, taking a random unit vector 𝐯\mathbf{v} in VV will give us 𝐯,𝐯projV(𝐯)/dim(V)\langle\mathbf{v},\mathbf{v}^{*}\rangle\geq\|\mathrm{proj}_{V}(\mathbf{v}^{*})\|/\sqrt{\dim(V)}. We are ready to prove Proposition 2.1. For the algorithm pseudocode, see Appendix B.

Proof Sketch of Proposition 2.1 .

The idea is that the empirical estimate 𝐓m\mathbf{T}^{\prime m} obtained using
dO(max{t2,1})log(1/δ)/ϵ2d^{O({\max\{{t^{*}}^{2},1\}})}\log(1/\delta)/\epsilon^{2} many samples will satisfy 𝐓m𝐓mFτ/(4k)\|\mathbf{T}^{\prime m}-\mathbf{T}^{m}\|_{F}\leq\tau/(4\sqrt{k}) with high probability. Then, by Lemma 2.2 and Lemma 2.4, if we take 𝐯\mathbf{v} to be a random unit vector in VV, with constant probability we will have 𝐯,𝐰=log(1/ϵ)O(t2)\langle\mathbf{v},\mathbf{w}^{*}\rangle=\log(1/\epsilon)^{-O({t^{*}}^{2})}. For 𝐯,𝐰=ϵO(1)\langle\mathbf{v},\mathbf{w}^{*}\rangle=\epsilon^{O(1)}, the proof relies on a different version of Lemma 2.4. ∎

2.2 Random Walk to Update Current Guess

We now describe how we use Proposition 2.1 to construct an algorithm for our learning problem. Let 𝐰\mathbf{w} be the current guess for 𝐰\mathbf{w}^{*}. For convenience, we assume that 𝐰,𝐰=Ω(1/d)\langle\mathbf{w},\mathbf{w}^{*}\rangle=\Omega(1/\sqrt{d}). Let DD^{\prime} be the distribution of 𝐱𝐰\mathbf{x}^{\perp\mathbf{w}} conditioned on 𝐱B\mathbf{x}\in B, where B={𝐱d:𝐰,𝐱t0}B=\{\mathbf{x}\in\mathbb{R}^{d}:\langle\mathbf{w},\mathbf{x}\rangle-t^{*}\geq 0\}. We next show that the 𝐱\mathbf{x}-marginals of DD^{\prime} are standard Gaussian and that DD^{\prime} satisfies the reliability condition with respect to the halfspace h(𝐱)=sign(𝐰,𝐱t)h(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{\prime},\mathbf{x}\rangle-t^{\prime}) with 𝐰=𝐰𝐰\mathbf{w}^{\prime}={\mathbf{w}^{*}}^{\perp\mathbf{w}} and |t||t||t^{\prime}|\leq|t^{*}|.

Lemma 2.5.

Let DD be the joint distribution of (𝐱,y)(\mathbf{x},y) supported on d×{±1}\mathbb{R}^{d}\times\{\pm 1\} with marginal D𝐱=𝒩dD_{\mathbf{x}}=\mathcal{N}_{d} and ϵ(0,1)\epsilon\in(0,1). Suppose DD satisfies the reliability condition with respect to f(𝐱)=sign(𝐰,𝐱t)f(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{*},\mathbf{x}\rangle-t^{*}). Suppose that h(𝐱)=sign(𝐰,𝐱t)h(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w},\mathbf{x}\rangle-t) with 𝐰,𝐰>0\langle\mathbf{w},\mathbf{w}^{*}\rangle>0 and tt[0,ϵ/100]t-t^{*}\in[0,\epsilon/100] does not satisfy Rh+(D)ϵ/2R_{h}^{+}(D)\leq\epsilon/2. Let B={𝐱d:𝐰,𝐱t0}B=\{\mathbf{x}\in\mathbb{R}^{d}:\langle\mathbf{w},\mathbf{x}\rangle-t\geq 0\} and DD^{\prime} be the distribution of (𝐱,y)=(𝐱𝐰,y)(\mathbf{x}^{\prime},y)=(\mathbf{x}^{\perp\mathbf{w}},y) given 𝐱B\mathbf{x}\in B, then

  1. 1.

    DD^{\prime} has marginal distribution D𝐱=𝒩d1D_{\mathbf{x}^{\prime}}=\mathcal{N}_{d-1},

  2. 2.

    DD^{\prime} satisfies the reliability condition with respect to h(𝐱)=sign(𝐰,𝐱t)h^{\prime}(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{\prime},\mathbf{x}\rangle-t^{\prime}), where 𝐰=𝐰𝐰/𝐰𝐰2\mathbf{w}^{\prime}={\mathbf{w}^{*}}^{\perp\mathbf{w}}/\|{\mathbf{w}^{*}}^{\perp\mathbf{w}}\|_{2} and |t||t||t^{\prime}|\leq|t^{*}|, and

  3. 3.

    𝐏𝐫(𝐱,y)D[y=1]ϵ/2\operatorname*{\mathbf{Pr}}_{(\mathbf{x}^{\prime},y)\sim D^{\prime}}[y=-1]\geq\epsilon/2.

Proof Sketch of Lemma 2.5.

Item 1 follows by definition. For Item 2, we consider two cases: t>0t^{*}>0; and t0t^{*}\leq 0. For the case t0t^{*}\leq 0, we prove that the distribution DD^{\prime} satisfies the reliability condition with respect to h(𝐱)=sign(𝐰,𝐱)h^{\prime}(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{\prime},\mathbf{x}\rangle), where 𝐰=𝐰𝐰/𝐰𝐰2\mathbf{w}^{\prime}={\mathbf{w}^{*}}^{\perp\mathbf{w}}/\|{\mathbf{w}^{*}}^{\perp\mathbf{w}}\|_{2}. For the case t>0t^{*}>0, we prove that the distribution DD^{\prime} satisfies the reliability condition with respect to h(𝐱)=sign(𝐰,𝐱t)h^{\prime}(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{\prime},\mathbf{x}\rangle-t^{\prime}), where 𝐰=𝐰𝐰/𝐰𝐰2\mathbf{w}^{\prime}={\mathbf{w}^{*}}^{\perp\mathbf{w}}/\|{\mathbf{w}^{*}}^{\perp\mathbf{w}}\|_{2} and t=tt^{\prime}=t^{*}. Then Item 3 follows from the fact that hh does not satisfy Rh+(D)ϵ/2R_{h}^{+}(D)\leq\epsilon/2. ∎

By Lemma 2.5, given any current guess 𝐰\mathbf{w} such that h(𝐱)=sign(𝐰,𝐱t)h(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w},\mathbf{x}\rangle-t^{*}) does not satisfy R+(h;D)ϵ/2R_{+}(h;D)\leq\epsilon/2, the corresponding distribution DD^{\prime} satisfies the reliability condition with respect to hh^{\prime} and 𝐏𝐫(𝐱,y)D[y=1]ϵ/2\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D^{\prime}}[y=-1]\geq\epsilon/2. Therefore, DD^{\prime} satisfies the assumptions of Proposition 2.1. So, if we apply the algorithm in Proposition 2.1, we will with probability at least Ω(1)\Omega(1) get a unit vector 𝐯\mathbf{v} such that 𝐯,𝐰=0\langle\mathbf{v},\mathbf{w}\rangle=0 and 𝐯,𝐰=log(1/ϵ)O(t2)\langle\mathbf{v},\mathbf{w}^{*}\rangle=\log(1/\epsilon)^{-O({t^{*}}^{2})}.

The following fact shows that by updating our current guess in the direction of 𝐯\mathbf{v} with appropriate step size, we can get an updated guess with increased correlation with 𝐰\mathbf{w}^{*}.

Fact 2.6 (Correlation Improvement, Lemma 5.13 in [DKK+{}^{+}22]).

Fix unit vectors 𝐯,𝐯d\mathbf{v}^{*},\mathbf{v}\in\mathbb{R}^{d}. Let 𝐮d\mathbf{u}\in\mathbb{R}^{d} such that 𝐮,𝐯c\langle\mathbf{u},\mathbf{v}^{*}\rangle\geq c, 𝐮,𝐯=0\langle\mathbf{u},\mathbf{v}\rangle=0 and 𝐮21\|\mathbf{u}\|_{2}\leq 1 with c>0c>0. Then, for 𝐯=𝐯+λ𝐮𝐯+λ𝐮2\mathbf{v}^{\prime}=\frac{\mathbf{v}+\lambda\mathbf{u}}{\|\mathbf{v}+\lambda\mathbf{u}\|_{2}}, with λ=c/2\lambda=c/2, we have that 𝐯,𝐯𝐯,𝐯+λ2/2\langle\mathbf{v}^{\prime},\mathbf{v}^{*}\rangle\geq\langle\mathbf{v},\mathbf{v}^{*}\rangle+\lambda^{2}/2.

Notice that because proj𝐰(𝐰)2\|\mathrm{proj}_{\perp\mathbf{w}}(\mathbf{w}^{*})\|_{2} is unknown, we cannot always choose the optimal step size λ\lambda. Instead, we will use the same λ\lambda to do sufficiently many update steps such that after that many updates, we are certain that proj𝐰(𝐰)23λ\|\mathrm{proj}_{\perp\mathbf{w}}(\mathbf{w}^{*})\|_{2}\leq 3\lambda. We then take the new step size λupdate=λ/2\lambda_{\text{update}}=\lambda/2 and repeat this process, until 𝐰\mathbf{w} and 𝐰\mathbf{w}^{*} are sufficiently close to each other.

We are now ready to describe our algorithm (Algorithm 1) and prove its correctness.

Notice that given that we know the bias α\alpha, tt must be either Φ1(α)-\Phi^{-1}(\alpha) or Φ1(α)\Phi^{-1}(\alpha). For convenience, we assume that t=Φ1(α)t=\Phi^{-1}(\alpha). To account for the case that t=Φ1(α)t=-\Phi^{-1}(\alpha), we can simply run Algorithm 1 twice and pick the output halfspace with the smallest tt value (or even run a different efficient algorithm, since t0t\leq 0 as explained in Appendix B). For convenience, we also assume min(2log(1/ϵ)O(log(1/α)),2poly(1/ϵ))=2log(1/ϵ)O(log(1/α))\min\left(2^{\log(1/\epsilon)^{O(\log(1/\alpha))}},2^{\mathrm{poly}(1/\epsilon)}\right)=2^{\log(1/\epsilon)^{O(\log(1/\alpha))}}. To account for the other case, we can simply initialize the step size ζ=ϵc\zeta=\epsilon^{c} for a sufficiently large constant cc instead of ζ=log(1/ϵ)ct2\zeta=\log(1/\epsilon)^{-c{t^{*}}^{2}}.

A more detailed version of Algorithm 1 is deferred to Appendix B.

Input: ϵ(0,1)\epsilon\in(0,1), α(0,1/2)\alpha\in(0,1/2) and samples access to a joint distribution DD of (𝐱,y)(\mathbf{x},y) supported on d×{±1}\mathbb{R}^{d}\times\{\pm 1\} with 𝐱\mathbf{x}-marginal D𝐱=𝒩dD_{\mathbf{x}}=\mathcal{N}_{d}.
Output: h(𝐱)=sign(𝐰,𝐱t)h(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w},\mathbf{x}\rangle-t) that is ϵ\epsilon-reliable with respect to the class dα\mathcal{H}_{d}^{\alpha}.
  1. 1.

    Check if 𝐏𝐫(𝐱,y)D[y=1]ϵ/2\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=-1]\leq\epsilon/2 (with sufficiently small constant failure probability). If so, return the +1+1 constant hypothesis. Set the initial step size ζ=log(1/ϵ)ct2\zeta=\log(1/\epsilon)^{-c{t^{*}}^{2}}, where cc is a sufficiently large universal constant and t=Φ1(α)t=\Phi^{-1}(\alpha).

  2. 2.

    Initialize 𝐰\mathbf{w} to be a random unit vector in d\mathbb{R}^{d}. Let the update step size λ=ζ\lambda=\zeta and repeat the following process until λϵ/100\lambda\leq\epsilon/100.

    1. (a)

      Use samples from DD to check if the hypothesis h(𝐱)=sign(𝐰,𝐱t)h(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w},\mathbf{x}\rangle-t) satisfies R+(h;D)ϵ/2R_{+}(h;D)\leq\epsilon/2. If so, go to Step (3).

    2. (b)

      With 1/21/2 probability, let 𝐰=𝐰\mathbf{w}=-\mathbf{w}. Let B={𝐱d:𝐰,𝐱t0}B=\{\mathbf{x}\in\mathbb{R}^{d}:\langle\mathbf{w},\mathbf{x}\rangle-t\geq 0\}, and let DD^{\prime} be the distribution of (𝐱𝐰,y)(\mathbf{x}^{\perp\mathbf{w}},y) for (𝐱,y)D(\mathbf{x},y)\sim D given 𝐱B\mathbf{x}\in B. Use the algorithm of Proposition 2.1 on DD^{\prime} to find a unit vector 𝐯\mathbf{v} such that 𝐯,𝐰=0\langle\mathbf{v},\mathbf{w}\rangle=0 and 𝐯,proj𝐰(𝐰)proj𝐰(𝐰)2ζ\left\langle\mathbf{v},\frac{\mathrm{proj}_{\perp\mathbf{w}}(\mathbf{w}^{*})}{\|\mathrm{proj}_{\perp\mathbf{w}}(\mathbf{w}^{*})\|_{2}}\right\rangle\geq\zeta. Then update 𝐰\mathbf{w} as follows: 𝐰update=𝐰+λ𝐯𝐰+λ𝐯2.\mathbf{w}_{\mathrm{update}}=\frac{\mathbf{w}+\lambda\mathbf{v}}{\|\mathbf{w}+\lambda\mathbf{v}\|_{2}}\;.

    3. (c)

      Repeat Steps (2a) and (2b) c/ζ2c/\zeta^{2} times, where cc is a sufficiently large universal constant, with the same step size λ\lambda. After that, update the new step size as λupdate=λ/2\lambda_{\mathrm{update}}=\lambda/2.

  3. 3.

    Check if h(𝐱)=sign(𝐰,𝐱t)h(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w},\mathbf{x}\rangle-t) satisfies R+(h;D)ϵ/2R_{+}(h;D)\leq\epsilon/2. If so, return hh and terminate. Repeat Step (2) 21/ζc2^{1/\zeta^{c}} many times where cc is a sufficiently large constant.

Algorithm 1 Reliably Learning General Halfspaces with Gaussian Marginals.
Proof Sketch of the algorithmic part of Theorem 1.3.

Let f(𝐱)=sign(𝐰,𝐱t)f(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{*},\mathbf{x}\rangle-t^{*}) be the optimal halfspace with α\alpha bias. We need to show that with high probability Algorithm 1 returns a hypothesis h(𝐱)=sign(𝐰,𝐱t)h(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w},\mathbf{x}\rangle-t) such that R+(h;D)ϵR_{+}(h;D)\leq\epsilon and R(h;D)R(f;D)+ϵR_{-}(h;D)\leq R_{-}(f;D)+\epsilon.

To do so, it suffices to show that R+(h;D)ϵR_{+}(h;D)\leq\epsilon; given R+(h;D)ϵR_{+}(h;D)\leq\epsilon, R(h;D)R(f;D)+ϵR_{-}(h;D)\leq R_{-}(f;D)+\epsilon follows from our choice of tt. For convenience, we can assume hh never satisfies R+(h;D)ϵ/2R_{+}(h;D)\leq\epsilon/2 in Step (2a) (otherwise, we are done). We can also assume that the subroutine in Proposition 2.1 always succeeds since the algorithm repeats Step (2) sufficiently many times. Given the above conditions, using 2.6, one can show that each time after c/ζ2c/\zeta^{2} many updates in Step (2b), we must have proj𝐰𝐰23λ\|\mathrm{proj}_{\perp\mathbf{w}}\mathbf{w}^{*}\|_{2}\leq 3\lambda. Therefore, when we have λϵ/100\lambda\leq\epsilon/100, then proj𝐰𝐰23ϵ/100\|\mathrm{proj}_{\perp\mathbf{w}}\mathbf{w}^{*}\|_{2}\leq 3\epsilon/100, which implies R+(h;D)ϵ/2R_{+}(h;D)\leq\epsilon/2. ∎

3 Nearly Matching SQ Lower Bound

In this section, we establish the SQ hardness result of Theorem 1.3. Due to space limitations, some proofs have been deferred to LABEL:app:lb.

Proof Overview

To establish our SQ lower bound for reliable learning, we first prove an SQ lower bound for a natural decision version of reliably learning α\alpha-biased LTFs. We define the following decision problem over distributions.

Definition 3.1 (Decision Problem over Distributions).

Let DD be a fixed distribution and 𝒟\cal D be a distribution family. We denote by (𝒟,D)\mathcal{B}(\mathcal{D},D) the decision problem in which the input distribution DD^{\prime} is promised to satisfy either (a) D=DD^{\prime}=D or (b) D𝒟D^{\prime}\in\mathcal{D}, and the goal is to distinguish the two cases with high probability.

We show that given SQ access to a joint distribution DD of (𝐱,y)(\mathbf{x},y) supported on d×{±1}\mathbb{R}^{d}\times\{\pm 1\} with marginal D𝐱=𝒩(𝟎,𝐈)D_{\mathbf{x}}=\mathcal{N}({\bf 0},\mathbf{I}), it is hard to solve the problem (𝒟,D)\mathcal{B}(\mathcal{D},D) with the following distributions.

  1. (a)

    Null hypothesis: DD is the distribution so that y=1y=1 with probability 1/21/2 independent of 𝐱\mathbf{x}.

  2. (b)

    Alternative hypothesis: D𝒟D\in\mathcal{D}, where 𝒟\mathcal{D} is a family of distributions such that for any distribution D𝒟D\in\mathcal{D}, there exists an α\alpha-biased LTF ff such that R+(f;D)=0R_{+}(f;D)=0.

In order to construct such a family of distributions 𝒟\mathcal{D}, we start by constructing a joint distribution DD^{\prime} of (z,y)(z,y) over ×{±1}\mathbb{R}\times\{\pm 1\} such that the marginal distribution of zz is 𝒩1\mathcal{N}_{1} and the conditional distributions zy=1z\mid y=1 and zy=1z\mid y=-1 both match many moments with the standard Gaussian 𝒩1\mathcal{N}_{1}. Moreover, there is α\alpha probability mass on the positive side of the marginal distribution of zz that is purely associated with y=1y=1 (i.e., 𝐄(z,y)D[yzc]=1\operatorname*{\mathbf{E}}_{(z,y)\sim D}[y\mid z\geq c]=1 where c=Φ1(1α)c=\Phi^{-1}(1-\alpha)). We then embed this distribution DD along a hidden direction inside the joint distribution of (𝐱,y)(\mathbf{x},y) on d×{±1}\mathbb{R}^{d}\times\{\pm 1\} to construct a family of hard-to-distinguish distributions using the “hidden-direction” framework developed in [DKS17] and enhanced in [DKPZ21, DKRS23].

We can now proceed with the details of the proof. We start by defining the pairwise correlation between distributions.

Definition 3.2 (Pairwise Correlation).

The pairwise correlation of two distributions with pdfs D1,D2:d+D_{1},D_{2}:\mathbb{R}^{d}\mapsto\mathbb{R}_{+} with respect to a distribution with density D:d+D:\mathbb{R}^{d}\mapsto\mathbb{R}_{+}, where the support of DD contains the support of D1D_{1} and D2D_{2}, is defined as χD(D1,D2)=defdD1(𝐱)D2(𝐱)/D(𝐱)d𝐱1\chi_{D}(D_{1},D_{2})\stackrel{{\scriptstyle{\mathrm{\footnotesize def}}}}{{=}}\int_{\mathbb{R}^{d}}D_{1}(\mathbf{x})D_{2}(\mathbf{x})/D(\mathbf{x})d\mathbf{x}-1. Furthermore, the χ\chi-squared divergence of D1D_{1} to DD is defined as χ2(D1,D)=defχD(D1,D1)\chi^{2}(D_{1},D)\stackrel{{\scriptstyle{\mathrm{\footnotesize def}}}}{{=}}\chi_{D}(D_{1},D_{1}).

In particular, the framework in [DKS17] allows us to construct a family of 2dΩ(1)2^{d^{\Omega(1)}} distributions on d×{±1}\mathbb{R}^{d}\times\{\pm 1\} whose pairwise correlation is dΩ(n)d^{-\Omega(n)} where nn is the number of matching moments DD^{\prime} has with the standard Gaussian. Then, using standard SQ dimension techniques, this gives an SQ lower bound for the distinguishing problem. After that, we reduce the distinguishing problem to the problem of reliably learning α\alpha-biased LTFs under Gaussian marginals with additive error ϵ<α/3\epsilon<\alpha/3.

In order to construct such a distribution DD^{\prime} of (z,y)(z,y) supported on ×{±1}\mathbb{R}\times\{\pm 1\}, we reparameterize 𝐄(z,y)D[y|z]\operatorname*{\mathbf{E}}_{(z,y)\sim D^{\prime}}[y|z] as g(z)g(z). For a function g:g:\mathbb{R}\to\mathbb{R}, we use gp=𝐄t𝒩1[|g(t)|p]1/p\|g\|_{p}=\operatorname*{\mathbf{E}}_{t\sim\mathcal{N}_{1}}[|g(t)|^{p}]^{1/p} for its LpL_{p} norm. We let L1(R)L^{1}(R) denote the set of all functions g:g:\mathbb{R}\to\mathbb{R} that have finite L1L_{1}-norm.

We use linear programming over one-dimensional functions in L1()L^{1}(\mathbb{R}) space to establish the existence of such a function. Specifically, we show the following:

Lemma 3.3.

For any sufficiently large nn\in\mathbb{N}, there exists a function g:[1,+1]g:\mathbb{R}\mapsto[-1,+1] such that gg satisfies the following properties:

  • (i)

    g(z)=1g(z)=1 for all zcz\geq c, where c=Φ1(132n/4)c=\Phi^{-1}(1-3^{-2n}/4), and

  • (ii)

    𝐄t𝒩1[g(z)zk]=0\operatorname*{\mathbf{E}}_{t\sim\mathcal{N}_{1}}[g(z)z^{k}]=0 for all k[n]k\in[n].

Proof Sketch of Lemma 3.3.

We let PnP_{n} denote the set of all polynomials p:p:\mathbb{R}\to\mathbb{R} of degree at most nn and let L1+()L^{1}_{+}(\mathbb{R}) denote the set of all nonnegative functions in L1()L^{1}(\mathbb{R}). Then, using linear programming, we will get the following primal:

find gL1()\displaystyle g\in L^{1}(\mathbb{R})
such that 𝐄z𝒩1[p(z)g(z)]\displaystyle\operatorname*{\mathbf{E}}_{z\sim\mathcal{N}_{1}}[p(z)g(z)] =0,\displaystyle=0\;, p\displaystyle\forall p Pn\displaystyle\in P_{n}
𝐄z𝒩1[g(z)h(z)𝟙{tc}]\displaystyle\operatorname*{\mathbf{E}}_{z\sim\mathcal{N}_{1}}[g(z)h(z)\mathds{1}\{t\geq c\}] h(z)𝟙{zc}1,\displaystyle\geq\|h(z)\mathds{1}\{z\geq c\}\|_{1}\;, h\displaystyle\quad\forall h L+1()\displaystyle\in L_{+}^{1}(\mathbb{R})
𝐄z𝒩1[g(z)H(z)]\displaystyle\operatorname*{\mathbf{E}}_{z\sim\mathcal{N}_{1}}[g(z)H(z)] H1,\displaystyle\leq\|H\|_{1}\;, H\displaystyle\quad\forall H L1()\displaystyle\in L^{1}(\mathbb{R})

Then, using (infinite-dimensional) LP duality, we get that the above primal is feasible if and only if there is no polynomial of degree nn such that 𝐄t𝒩1[|p(t)|𝟙(tc)]<𝐄t𝒩1[p(t)𝟙(tc)]\operatorname*{\mathbf{E}}_{t\sim\mathcal{N}_{1}}[|p(t)|\mathds{1}(t\leq c)]<\operatorname*{\mathbf{E}}_{t\sim\mathcal{N}_{1}}[p(t)\mathds{1}(t\geq c)].

Using Gaussian hypercontractivity (see [Bog98, Nel73]), one can show that for every polynomial pp and cc\in\mathbb{R}, it holds

𝐄z𝒩1[|p(z)|]<23n𝐄z𝒩1[|p(z)|](𝐏𝐫z𝒩1[zc])1/2.\operatorname*{\mathbf{E}}_{z\sim\mathcal{N}_{1}}[|p(z)|]<2\cdot 3^{n}\operatorname*{\mathbf{E}}_{z\sim\mathcal{N}_{1}}[|p(z)|]\left(\operatorname*{\mathbf{Pr}}_{z\sim\mathcal{N}_{1}}[z\geq c]\right)^{1/2}\;.

From our choice of the parameter cc, we have that 𝐏𝐫z𝒩1[zc]32n/4\operatorname*{\mathbf{Pr}}_{z\sim\mathcal{N}_{1}}[z\geq c]\leq 3^{-2n}/4; thus, 𝐄z𝒩1[|p(z)|]<𝐄z𝒩1[|p(z)|]\operatorname*{\mathbf{E}}_{z\sim\mathcal{N}_{1}}[|p(z)|]<\operatorname*{\mathbf{E}}_{z\sim\mathcal{N}_{1}}[|p(z)|], which is a contradiction. Therefore, such a polynomial pp cannot exist. ∎

We have proven the existence of the function gg in Lemma 3.3. Now, we construct a joint distribution of (z,y)(z,y) on ×{±1}\mathbb{R}\times\{\pm 1\} such that 𝐄[y|z]\operatorname*{\mathbf{E}}[y|z] is exactly gg, as we discussed in the proof outline. For a joint distribution DD of (x,y)(x,y) supported on X×{±1}X\times\{\pm 1\}, we will use D+D_{+} to denote the conditional distribution of xx given y=1y=1; and DD_{-} for the distribution of xx given y=1y=-1.

Lemma 3.4.

For any sufficiently large nn\in\mathbb{N}, there exists a distribution DD on ×{±1}\mathbb{R}\times\{\pm 1\} such that for (z,y)D(z,y)\sim D:

  • (i)

    the marginal distribution Dz=𝒩1D_{z}=\mathcal{N}_{1};

  • (ii)

    𝐄(z,y)D[y|z=z]=1\operatorname*{\mathbf{E}}_{(z,y)\sim D}[y|z=z^{\prime}]=1 for all zΦ1(132n/4)z^{\prime}\geq\Phi^{-1}(1-3^{-2n}/4);

  • (iii)

    𝐄(z,y)D[y]=0\operatorname*{\mathbf{E}}_{(z,y)\sim D}[y]=0 and 𝐄(z,y)D[zk]=𝐄(z,y)D[zky=1]=𝐄(z,y)D[zky=1]\operatorname*{\mathbf{E}}_{(z,y)\sim D}[z^{k}]=\operatorname*{\mathbf{E}}_{(z,y)\sim D}[z^{k}\mid y=1]=\operatorname*{\mathbf{E}}_{(z,y)\sim D}[z^{k}\mid y=-1] for all k[n]k\in[n];

  • (iv)

    χ2(D+,𝒩1),χ2(D,𝒩1)=O(1)\chi^{2}(D_{+},\mathcal{N}_{1}),\chi^{2}(D_{-},\mathcal{N}_{1})=O(1).

Proof Sketch for Lemma 3.4.

The properties here directly follow from the properties of gg. ∎

Using the framework introduced in [DKS17, DKPZ21], we can construct a set of alternative hypothesis distributions 𝒟={D𝐯:𝐯V}\mathcal{D}=\{D_{\mathbf{v}}:\mathbf{v}\in V\} on d×{±1}\mathbb{R}^{d}\times\{\pm 1\}, where VV is a set of exponentially many pairwise nearly-orthogonal vectors and the marginal distribution of each D𝐯D_{\mathbf{v}} on direction 𝐯\mathbf{v} is the distribution DD in Lemma 3.4. This effectively embeds the distribution DD in a hidden direction 𝐯\mathbf{v}. The size of the family 𝒟\mathcal{D} is exponential in dd, and the distributions in it have small pairwise correlations. The details of 𝒟\mathcal{D} are deferred to LABEL:app:lb. Now, we are ready to give a proof sketch for the SQ hardness part of our main theorem Theorem 1.3.

Proof Sketch of the SQ hardness part of Theorem 1.3.

Let 𝒟\mathcal{D} be the set of distributions discussed above. We also let DnullD_{\mathrm{null}} be the joint distribution of (𝐱,y)(\mathbf{x},y) such that 𝐱𝒩d\mathbf{x}\sim\mathcal{N}_{d} and yBern(1/2)y\sim\mathrm{Bern}(1/2) independent of 𝐱\mathbf{x}. Then, using standard SQ dimension techniques, one can show that any SQ algorithm that solves (𝒟,Dnull)\mathcal{B}(\mathcal{D},D_{\mathrm{null}}) requires either queries of tolerance at most dΩ(log1α)d^{-\Omega(\log\frac{1}{\alpha})} or makes at least 2dΩ(1)2^{d^{\Omega(1)}} queries. By reducing the decision problem (𝒟,Dnull)\mathcal{B}(\mathcal{D},D_{\mathrm{null}}) to reliably learning α\alpha-biased LTFs with ϵ<α/3\epsilon<\alpha/3 accuracy, we get the lower bound part of the statement in Theorem 1.3. ∎

4 Conclusions and Open Problems

In this paper, we study the problem of learning halfspaces under Gaussian marginals in the reliable learning model. Our main contribution is the design of the first efficient learner for this task whose complexity beats the complexity of agnostically learning this concept class. Moreover, we provide rigorous evidence, via an SQ lower bound, that no fully-polynomial time algorithm exists for general halfspaces. The obvious open question is whether the dependence on ϵ\epsilon in the complexity of our algorithm can be improved. Specifically, is it possible to design a reliable learner with complexity dO(log(1/α))poly(1/ϵ)d^{O(\log(1/\alpha))}\mathrm{poly}(1/\epsilon)? Is it possible to obtain similarly efficient reliable learners under more general marginal distributions (e.g., strongly log-concave or discrete distributions)? More broadly, it would be interesting to characterize the computational separation between (distribution-specific) reliable and agnostic learning for other natural concept classes.

References

  • [ABHU15] P. Awasthi, M. F. Balcan, N. Haghtalab, and R. Urner. Efficient learning of linear separators under bounded noise. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, pages 167–190, 2015.
  • [ABL17] P. Awasthi, M. F. Balcan, and P. M. Long. The power of localization for efficiently learning linear separators with noise. J. ACM, 63(6):50:1–50:27, 2017.
  • [Bog98] V. Bogachev. Gaussian measures. Mathematical surveys and monographs, vol. 62, 1998.
  • [DGJ+{}^{+}09] I. Diakonikolas, P. Gopalan, R. Jaiswal, R. Servedio, and E. Viola. Bounded independence fools halfspaces. In Proc. 50th IEEE Symposium on Foundations of Computer Science (FOCS), pages 171–180, 2009.
  • [DJ19] A. Durgin and B. Juba. Hardness of improper one-sided learning of conjunctions for all uniformly falsifiable csps. In Algorithmic Learning Theory, ALT, volume 98 of Proceedings of Machine Learning Research, pages 369–382, 2019.
  • [DKK+{}^{+}20] I. Diakonikolas, D. M. Kane, V. Kontonis, C. Tzamos, and N. Zarifis. A polynomial time algorithm for learning halfspaces with Tsybakov noise. arXiv, 2020.
  • [DKK+{}^{+}21a] I. Diakonikolas, D. M. Kane, V. Kontonis, C. Tzamos, and N. Zarifis. Agnostic proper learning of halfspaces under gaussian marginals. In Proceedings of The 34th Conference on Learning Theory, COLT, 2021.
  • [DKK+{}^{+}21b] I. Diakonikolas, D. M. Kane, V. Kontonis, C. Tzamos, and N. Zarifis. Efficiently learning halfspaces with Tsybakov noise. STOC, 2021.
  • [DKK+{}^{+}22] I. Diakonikolas, D. M. Kane, V. Kontonis, C. Tzamos, and N. Zarifis. Learning general halfspaces with general Massart noise under the gaussian distribution. In STOC ’22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 874–885. ACM, 2022.
  • [DKN10] I. Diakonikolas, D. M. Kane, and J. Nelson. Bounded independence fools degree-22 threshold functions. In FOCS, pages 11–20, 2010.
  • [DKPZ21] I. Diakonikolas, D. M. Kane, T. Pittas, and N. Zarifis. The optimality of polynomial regression for agnostic learning under gaussian marginals. In Proceedings of The 34th Conference on Learning Theory, COLT, 2021.
  • [DKR23] I. Diakonikolas, D. M. Kane, and L. Ren. Near-optimal cryptographic hardness of agnostically learning halfspaces and relu regression under gaussian marginals. In International Conference on Machine Learning, ICML, volume 202 of Proceedings of Machine Learning Research, pages 7922–7938, 2023.
  • [DKRS23] I. Diakonikolas, D. Kane, L. Ren, and Y. Sun. SQ lower bounds for non-gaussian component analysis with weaker assumptions. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, 2023.
  • [DKS17] I. Diakonikolas, D. M. Kane, and A. Stewart. Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 73–84, 2017.
  • [DKS18] I. Diakonikolas, D. M. Kane, and A. Stewart. Learning geometric concepts with nasty noise. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, pages 1061–1073, 2018.
  • [DKTZ20a] I. Diakonikolas, V. Kontonis, C. Tzamos, and N. Zarifis. Learning halfspaces with Massart noise under structured distributions. In Conference on Learning Theory, COLT, 2020.
  • [DKTZ20b] I. Diakonikolas, V. Kontonis, C. Tzamos, and N. Zarifis. Learning halfspaces with Tsybakov noise. arXiv, 2020.
  • [DKTZ20c] I. Diakonikolas, V. Kontonis, C. Tzamos, and N. Zarifis. Non-convex SGD learns halfspaces with adversarial label noise. In Advances in Neural Information Processing Systems, NeurIPS, 2020.
  • [DKTZ22] I. Diakonikolas, V. Kontonis, C. Tzamos, and N. Zarifis. Learning general halfspaces with adversarial label noise via online gradient descent. In International Conference on Machine Learning, ICML 2022, volume 162 of Proceedings of Machine Learning Research, pages 5118–5141. PMLR, 2022.
  • [DKZ20] I. Diakonikolas, D. M. Kane, and N. Zarifis. Near-optimal SQ lower bounds for agnostically learning halfspaces and ReLUs under Gaussian marginals. In Advances in Neural Information Processing Systems, NeurIPS, 2020.
  • [Dom99] P. M. Domingos. Metacost: A general method for making classifiers cost-sensitive. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 155–164, 1999.
  • [Elk01] C. Elkan. The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, IJCAI, pages 973–978, 2001.
  • [Fan68] K. Fan. On infinite systems of linear inequalities. Journal of Mathematical Analysis and Applications, 21(3):475 – 478, 1968.
  • [FGR+{}^{+}17] V. Feldman, E. Grigorescu, L. Reyzin, S. Vempala, and Y. Xiao. Statistical algorithms and a lower bound for detecting planted cliques. J. ACM, 64(2):8:1–8:37, 2017.
  • [FS97] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
  • [GGK20] S. Goel, A. Gollakota, and A. R. Klivans. Statistical-query lower bounds via functional gradients. In Advances in Neural Information Processing Systems, NeurIPS, 2020.
  • [GKKM20] S. Goldwasser, A. T. Kalai, Y. Kalai, and O. Montasser. Beyond perturbations: Learning guarantees with arbitrary adversarial test examples. In Advances in Neural Information Processing Systems, NeurIPS, 2020.
  • [GKKT17] S. Goel, V. Kanade, A. R. Klivans, and J. Thaler. Reliably learning the relu in polynomial time. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, volume 65 of Proceedings of Machine Learning Research, pages 1004–1042, 2017.
  • [Hau92] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100:78–150, 1992.
  • [KK21] A. T. Kalai and V. Kanade. Efficient learning with arbitrary covariate shift. In Algorithmic Learning Theory, 2021, volume 132 of Proceedings of Machine Learning Research, pages 850–864, 2021.
  • [KKM12] A. T. Kalai, V. Kanade, and Y. Mansour. Reliable agnostic learning. J. Comput. Syst. Sci., 78(5):1481–1495, 2012.
  • [KKMS08] A. Kalai, A. Klivans, Y. Mansour, and R. Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777–1805, 2008. Special issue for FOCS 2005.
  • [KLS09] A. Klivans, P. Long, and R. Servedio. Learning Halfspaces with Malicious Noise. Journal of Machine Learning Research, 10:2715–2740, 2009.
  • [KOS08] A. Klivans, R. O’Donnell, and R. Servedio. Learning geometric concepts via Gaussian surface area. In Proc. 49th IEEE Symposium on Foundations of Computer Science (FOCS), pages 541–550, Philadelphia, Pennsylvania, 2008.
  • [KSS94] M. Kearns, R. Schapire, and L. Sellie. Toward Efficient Agnostic Learning. Machine Learning, 17(2/3):115–141, 1994.
  • [KT14] V. Kanade and J. Thaler. Distribution-independent reliable learning. In Proceedings of The 27th Conference on Learning Theory, COLT, volume 35 of JMLR Workshop and Conference Proceedings, pages 3–24, 2014.
  • [MT94] W. Maass and G. Turan. How fast can a threshold gate learn? In S. Hanson, G. Drastal, and R. Rivest, editors, Computational Learning Theory and Natural Learning Systems, pages 381–414. MIT Press, 1994.
  • [Nel73] E. Nelson. The free markoff field. Journal of Functional Analysis, 12(2):211–227, 1973.
  • [NP33] J. Neyman and E. S. Pearson. On the problem of the most efficient tests for statistical hypotheses. Trans. R. So. Lond. Ser. A Contain. Pap. Math. Phys. Character, 231:281–337, 1933.
  • [Ros58] F. Rosenblatt. The Perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958.
  • [Tie23] S. Tiegel. Hardness of agnostically learning halfspaces from worst-case lattice problems. In The Thirty Sixth Annual Conference on Learning Theory, COLT, volume 195 of Proceedings of Machine Learning Research, pages 3029–3064, 2023.
  • [Vap98] V. Vapnik. Statistical learning theory. Wiley, 1998.
  • [ZSA20] C. Zhang, J. Shen, and P. Awasthi. Efficient active learning of sparse halfspaces with arbitrary bounded noise. In Advances in Neural Information Processing Systems, NeurIPS, 2020.

Appendix

Organization

The supplementary material is structured as follows: Appendix A includes additional preliminaries. Appendix B includes the omitted proofs from Section 2. Finally, LABEL:app:lb includes the omitted proofs from Section 3.

Appendix A Additional Preliminaries

We use 𝕊d1\mathbb{S}^{d-1} to denote the dd-dimensional unit sphere, i.e., 𝕊d1=def{𝐱d:𝐱2=1}\mathbb{S}^{d-1}\stackrel{{\scriptstyle{\mathrm{\footnotesize def}}}}{{=}}\{\mathbf{x}\in\mathbb{R}^{d}:\|\mathbf{x}\|_{2}=1\}.

Basics of the SQ Model

In order to give an SQ lower bound for reliable learning, we first present the basics of the SQ model. We define the pairwise correlation, which is used for the SQ lower bound.

Definition A.1 (Pairwise Correlation).

The pairwise correlation of two distributions with probability density function D1,D2:d+D_{1},D_{2}:\mathbb{R}^{d}\mapsto\mathbb{R}_{+} with respect to a distribution with density D:d+D:\mathbb{R}^{d}\mapsto\mathbb{R}_{+}, where the support of DD contains the support of D1D_{1} and D2D_{2}, is defined as χD(D1,D2)=defdD1(𝐱)D2(𝐱)/D(𝐱)d𝐱1\chi_{D}(D_{1},D_{2})\stackrel{{\scriptstyle{\mathrm{\footnotesize def}}}}{{=}}\int_{\mathbb{R}^{d}}D_{1}(\mathbf{x})D_{2}(\mathbf{x})/D(\mathbf{x})d\mathbf{x}-1. Furthermore, the χ\chi-squared divergence of D1D_{1} to DD is defined as χ2(D1,D)=defχD(D1,D1)\chi^{2}(D_{1},D)\stackrel{{\scriptstyle{\mathrm{\footnotesize def}}}}{{=}}\chi_{D}(D_{1},D_{1}).

We require the following definitions and lemmata from [FGR+{}^{+}17] connecting pairwise correlation and SQ lower bounds.

Definition A.2.

We say that a set of ss distribution 𝒟={D1,,Ds}\mathcal{D}=\{D_{1},\cdots,D_{s}\} over d\mathbb{R}^{d} is (γ,β)(\gamma,\beta)-correlated relative to a distribution DD if χD(Di,Dj)γ\chi_{D}(D_{i},D_{j})\leq\gamma for all iji\neq j, and χD(Di,Dj)β\chi_{D}(D_{i},D_{j})\leq\beta for i=ji=j.

Definition A.3 (Statistical Query Dimension).

For β,γ>0\beta,\gamma>0, a decision problem (𝒟,D)\mathcal{B}(\mathcal{D},D), where DD is a fixed distribution and 𝒟\mathcal{D} is a family of distribution, let ss be the maximum integer such that there exists a finite set of distributions 𝒟D𝒟\mathcal{D}_{D}\subseteq\mathcal{D} such that 𝒟D\mathcal{D}_{D} is (γ,β)(\gamma,\beta)-correlated relative to DD and |𝒟D|s|\mathcal{D}_{D}|\geq s. The Statistical Query dimension with pairwise correlations (γ,β)(\gamma,\beta) of \mathcal{B} is defined to be ss, and denoted by s=SD(,γ,β)s=\mathrm{SD}(\mathcal{B},\gamma,\beta).

Lemma A.4.

Let (𝒟,D)\mathcal{B}(\mathcal{D},D) be a decision problem, where DD is the reference distribution and 𝒟\mathcal{D} is a class of distribution. For γ,β>0\gamma,\beta>0, let s=SD(,γ,β)s=\mathrm{SD}(\mathcal{B},\gamma,\beta). For any γ>0\gamma^{\prime}>0, any SQ algorithm for \mathcal{B} requires queries of tolerance at most γ+γ\sqrt{\gamma+\gamma^{\prime}} or makes at least sγ/(βγ)s\gamma^{\prime}/(\beta-\gamma) queries.

Basics of Hermite Polynomials

We require the following definitions used in our work.

Definition A.5 (Normalized Hermite Polynomial).

For kk\in\mathbb{N}, we define the kk-th probabilist’s Hermite polynomials Hek:\mathrm{\textit{He}}_{k}:\mathbb{R}\to\mathbb{R} as Hek(t)=(1)ket2/2dkdtket2/2\mathrm{\textit{He}}_{k}(t)=(-1)^{k}e^{t^{2}/2}\cdot\frac{\mathrm{d}^{k}}{\mathrm{d}t^{k}}e^{-t^{2}/2}. We define the kk-th normalized Hermite polynomial hk:h_{k}:\mathbb{R}\to\mathbb{R} as hk(t)=Hek(t)/k!h_{k}(t)=\mathrm{\textit{He}}_{k}(t)/\sqrt{k!}.

Furthermore, we use multivariate Hermite polynomials in the form of Hermite tensors (as the entries in the Hermite tensors are re-scaled multivariate Hermite polynomials). We define the Hermite tensor as follows.

Definition A.6 (Hermite Tensor).

For kk\in\mathbb{N} and 𝐱d\mathbf{x}\in\mathbb{R}^{d}, we define the kk-th Hermite tensor as

(𝐇k(𝐱))i1,i2,,ik=1k!Partitions P of [k]into sets of size 1 and 2{a,b}P(𝐈ia,ib){c}P𝐱ic.(\mathbf{H}_{k}(\mathbf{x}))_{i_{1},i_{2},\ldots,i_{k}}=\frac{1}{\sqrt{k!}}\sum_{\begin{subarray}{c}\text{Partitions $P$ of $[k]$}\\ \text{into sets of size 1 and 2}\end{subarray}}\bigotimes_{\{a,b\}\in P}(-\mathbf{I}_{i_{a},i_{b}})\bigotimes_{\{c\}\in P}\mathbf{x}_{i_{c}}\;.

We also point out that fully reliable learning (see [KKM12] for the definition) reduces to one-sided reliable learning (Definition 1.1). The proof is implicit in the proof of Theorem 3 in [KKM12] and we include it here for completeness.

Fact A.7.

For a joint distribution DD of (𝐱,y)(\mathbf{x},y) supported on X×{±1}X\times\{\pm 1\}, let h+:X{±1}h_{+}:X\to\{\pm 1\} (resp. h:X{±1}h_{-}:X\to\{\pm 1\}) be a hypothesis that is ϵ/4\epsilon/4-reliable with respect to the concept class CC on distribution DD for positive labels (resp. for negative labels). Let hypothesis hh be defined as h(𝐱)=1h(\mathbf{x})=1 if h+(𝐱)=h(𝐱)=1h_{+}(\mathbf{x})=h_{-}(\mathbf{x})=1, h(𝐱)=1h(\mathbf{x})=-1 if h+(𝐱)=h(𝐱)=1h_{+}(\mathbf{x})=h_{-}(\mathbf{x})=-1 and h(𝐱)=?h(\mathbf{x})=? otherwise. Then hh is ϵ\epsilon close to the best fully reliable sandwich hypothesis for CC on distribution DD.

Proof Sketch.

We first show that R+(h;D),R(h;D)ϵR_{+}(h;D),R_{-}(h;D)\leq\epsilon. Notice that

R+(h;D)=𝐏𝐫(𝐱,y)D[h(𝐱)=1y=1]𝐏𝐫(𝐱,y)D[h+(𝐱)=1y=1]ϵ.R_{+}(h;D)=\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=1\land y=-1]\leq\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{+}(\mathbf{x})=1\land y=-1]\leq\epsilon\;.

The same holds for negative labels.

To show that 𝐏𝐫(𝐱,y)D[h(𝐱)=?]\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=?] is ϵ\epsilon-suboptimal, assume (for the purpose of contradiction) that there is an hh^{\prime} such that R+(h;D),R(h;D)ϵR_{+}(h^{\prime};D),R_{-}(h^{\prime};D)\leq\epsilon and 𝐏𝐫(𝐱,y)D[h(𝐱)=?]<𝐏𝐫(𝐱,y)D[h(𝐱)=?]ϵ\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h^{\prime}(\mathbf{x})=?]<\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=?]-\epsilon. Notice that

𝐏𝐫(𝐱,y)D[h(𝐱)=?]\displaystyle\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=?]
=\displaystyle= 𝐏𝐫(𝐱,y)D[h+(𝐱)=1h(𝐱)=1]+𝐏𝐫(𝐱,y)D[h+(𝐱)=1h(𝐱)=1]\displaystyle\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{+}(\mathbf{x})=1\land h_{-}(\mathbf{x})=-1]+\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{+}(\mathbf{x})=-1\land h_{-}(\mathbf{x})=1]
\displaystyle\leq 1𝐏𝐫(𝐱,y)D[h+(𝐱)=1]𝐏𝐫(𝐱,y)D[h(𝐱)=1]+𝐏𝐫(𝐱,y)D[h+(𝐱)=1h(𝐱)=1]\displaystyle 1-\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{+}(\mathbf{x})=1]-\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{-}(\mathbf{x})=-1]+\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{+}(\mathbf{x})=1\land h_{-}(\mathbf{x})=-1]
\displaystyle\leq 1𝐏𝐫(𝐱,y)D[h+(𝐱)=1]𝐏𝐫(𝐱,y)D[h(𝐱)=1]+ϵ/2.\displaystyle 1-\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{+}(\mathbf{x})=1]-\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{-}(\mathbf{x})=-1]+\epsilon/2\;.

Thus,

𝐏𝐫(𝐱,y)D[h+(𝐱)=1]+𝐏𝐫(𝐱,y)D[h(𝐱)=1]1𝐏𝐫(𝐱,y)D[h(𝐱)=?]+ϵ/2.\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{+}(\mathbf{x})=1]+\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{-}(\mathbf{x})=-1]\leq 1-\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=?]+\epsilon/2\;. (1)

Then, define h+:X{±1}h_{+}^{\prime}:X\to\{\pm 1\} as h+(𝐱)=1h_{+}^{\prime}(\mathbf{x})=1 if h(𝐱)=1h^{\prime}(\mathbf{x})=1 and h+(𝐱)=1h_{+}^{\prime}(\mathbf{x})=-1 otherwise (hh_{-}^{\prime} is defined similarly). We get 𝐏𝐫(𝐱,y)D[h+(𝐱)=1]+𝐏𝐫(𝐱,y)D[h(𝐱)=1]=1𝐏𝐫(𝐱,y)D[h(𝐱)=?]\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h^{\prime}_{+}(\mathbf{x})=1]+\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h^{\prime}_{-}(\mathbf{x})=-1]=1-\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h^{\prime}(\mathbf{x})=?]. Using Equation 1 and 𝐏𝐫(𝐱,y)D[h(𝐱)=?]<𝐏𝐫(𝐱,y)D[h(𝐱)=?]ϵ\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h^{\prime}(\mathbf{x})=?]<\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=?]-\epsilon, we get

𝐏𝐫(𝐱,y)D[h+(𝐱)=1]𝐏𝐫(𝐱,y)D[h(𝐱)=1]<𝐏𝐫(𝐱,y)D[h+(𝐱)=1]𝐏𝐫(𝐱,y)D[h(𝐱)=1]ϵ/2.\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{+}(\mathbf{x})=1]-\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{-}(\mathbf{x})=-1]<\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h^{\prime}_{+}(\mathbf{x})=1]-\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h^{\prime}_{-}(\mathbf{x})=-1]-\epsilon/2\;.

Therefore, either 𝐏𝐫(𝐱,y)D[h+(𝐱)=1]<𝐏𝐫(𝐱,y)D[h+(𝐱)=1]ϵ/4\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{+}(\mathbf{x})=1]<\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h^{\prime}_{+}(\mathbf{x})=1]-\epsilon/4 or

𝐏𝐫(𝐱,y)D[h(𝐱)=1]<𝐏𝐫(𝐱,y)D[h(𝐱)=1]ϵ/4.\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{-}(\mathbf{x})=-1]<\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h^{\prime}_{-}(\mathbf{x})=-1]-\epsilon/4\;.

Thus, either h+h_{+} or hh_{-} is not ϵ/4\epsilon/4-suboptimal. This gives a contradiction. ∎

Appendix B Omitted Proofs from Section 2

We define the bias of a function h:d{±1}h:\mathbb{R}^{d}\to\{\pm 1\} as min(𝐏𝐫𝐱𝒩d[h(𝐱)=1],𝐏𝐫𝐱𝒩d[h(𝐱)=1])\min(\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}[h(\mathbf{x})=1],\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}[h(\mathbf{x})=-1]), and define dα\mathcal{H}_{d}^{\alpha} to be the set of all LTFs whose bias is α\alpha. Let f=argmincdαR+(f;D)=0R(f;D)f=\operatorname*{argmin}_{c\in\mathcal{H}_{d}^{\alpha}\;\land\;R_{+}(f;D)=0}R_{-}(f;D) be the optimal reliable hypothesis and α\alpha be the bias of ff.

B.1 Reliably Learning Halfspaces with Gaussian Marginals when ϵ2α\epsilon\geq 2\alpha

We show that for the problem in Theorem 1.3, in the case of ϵ2α\epsilon\geq 2\alpha, the algorithm can just return one of the constant hypotheses. Let h+h_{+} (resp. hh_{-}) be the +1+1 (resp. 1-1) constant hypothesis.

Fact B.1.

For the problem defined in Theorem 1.3, the case where ϵ2α\epsilon\geq 2\alpha can be efficiently solved by returning h+h_{+} if R+(h+;D)3ϵ/4R_{+}(h_{+};D)\leq 3\epsilon/4 and returning hh_{-} otherwise.

Proof.

When the algorithm returns h+h_{+}, it is easy to see that h+h_{+} satisfies the reliable learning requirement in Definition 1.1. So we only consider the case that the algorithm returns hh_{-}.

Given the algorithm returns hh_{-}, either there is an α\alpha-biased LTF ff such that R+(f;D)=0R_{+}(f;D)=0 or none of the α\alpha-biased LTFs ff satisfies R+(f;D)=0R_{+}(f;D)=0. For the case that none of the α\alpha-biased LTFs ff satisfies R+(f;D)=0R_{+}(f;D)=0, it is easy to see hh_{-} is a valid answer from the definition of Definition 1.1. For the case that there is an α\alpha-biased LTF ff such that R+(f,D)=0R_{+}(f,D)=0, since the algorithm did not return h+h_{+}, it must be the case that 𝐏𝐫(𝐱,y)D[y=1]>3ϵ/4>α\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=-1]>3\epsilon/4>\alpha. Therefore, we have 𝐏𝐫𝐱𝒩d[f(𝐱)=1]=1α\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}[f(\mathbf{x})=-1]=1-\alpha. Thus, we get R+(h,D)=0R_{+}(h_{-},D)=0 and R(h,D)R(f,D)+αR_{-}(h_{-},D)\leq R_{-}(f,D)+\alpha. This shows hh_{-} is ϵ\epsilon reliable with respect to all α\alpha biased LTFs. This completes the proof. ∎

B.2 Reliably Learning Halfspaces with Gaussian Marginals: Case of Unknown α\alpha

Here we establish our main result:

Theorem B.2 (Main Algorithmic Result).

Let ϵ(0,1/2)\epsilon\in(0,1/2) and let DD be a joint distribution of (𝐱,y)(\mathbf{x},y) supported on d×{±1}\mathbb{R}^{d}\times\{\pm 1\} with marginal D𝐱=𝒩dD_{\mathbf{x}}=\mathcal{N}_{d}. Let f=argminfdR+(f;D)=0R(f;D)f=\operatorname*{argmin}_{f\in\mathcal{H}_{d}\;\land\;R_{+}(f;D)=0}R_{-}(f;D) be the optimal halfspace and α\alpha be its bias which is unknown. Then there is an algorithm that uses N=dO(log(min{1/α,1/ϵ}))min(2log(1/ϵ)O(log(1/α)),2poly(1/ϵ))N=d^{O(\log(\min\{1/\alpha,1/\epsilon\}))}\min\left(2^{\log(1/\epsilon)^{O(\log(1/\alpha))}},2^{\mathrm{poly}(1/\epsilon)}\right) many samples from DD, runs in poly(N,d,1/ϵ)\mathrm{poly}(N,d,1/\epsilon) time and with high probability returns a hypothesis h(𝐱)=sign(𝐰,𝐱t)h(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w},\mathbf{x}\rangle-t) that is ϵ\epsilon-reliable with respect to the class of LTFs.

To prove Theorem B.2, we need to use the algorithm in the following lemma as a black box. The lemma shows that if the optimal halfspace ff satisfies 𝐏𝐫𝐱𝒩d[f(𝐱)=1]1/2\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}[f(\mathbf{x})=-1]\leq 1/2, then the problem can be solved efficiently.

Lemma B.3.

Let DD be the joint distribution of (𝐱,y)(\mathbf{x},y) supported on d×{±1}\mathbb{R}^{d}\times\{\pm 1\} with marginal D𝐱=𝒩dD_{\mathbf{x}}=\mathcal{N}_{d} and ϵ(0,1)\epsilon\in(0,1). Suppose the optimal halfspace f=argminfdR+(f;D)=0R(f;D)f=\operatorname*{argmin}_{f\in\mathcal{H}_{d}\;\land\;R_{+}(f;D)=0}R_{-}(f;D) satisfies 𝐏𝐫𝐱𝒩d[f(𝐱)=1]1/2\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}[f(\mathbf{x})=-1]\leq 1/2. Then there is an algorithm that reliably learns LTFs on DD using N=O(d/ϵ2)N=O(d/\epsilon^{2}) many samples and poly(N,d)\mathrm{poly}(N,d) running time.

Proof.

The algorithm is the following:

  1. 1.

    First check if 𝐏𝐫(𝐱,y)D)[y=1]ϵ\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D)}[y=-1]\leq\epsilon with sufficiently small constant failure probability. If so, return the +1 constant hypothesis.

  2. 2.

    Otherwise, draw m=(d/ϵ)cm=(d/\epsilon)^{c} many samples S={(𝐱1,y1),,(𝐱m,ym)}S=\{(\mathbf{x}_{1},y_{1}),\cdots,(\mathbf{x}_{m},y_{m})\} conditioned on y=1y=-1. By Step 1, the sampling efficiency here is Ω(ϵ)\Omega(\epsilon) with high probability.

  3. 3.

    Solve the following semidefinite program for 𝐰\mathbf{w}^{\prime} and tt^{\prime}:

    minimize t\displaystyle t^{\prime}
    such that 𝐰,𝐱t\displaystyle\langle\mathbf{w}^{\prime},\mathbf{x}\rangle-t^{\prime}\leq 0,\displaystyle 0\;, (𝐱,y)S\displaystyle\quad\forall(\mathbf{x},y)\in S
    𝐰2\displaystyle\|\mathbf{w}^{\prime}\|_{2}\leq 1\displaystyle 1

    Return the hypothesis h(𝐱)=sign(𝐰,𝐱t)h(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w},\mathbf{x}\rangle-t), where 𝐰=𝐰/𝐰2\mathbf{w}=\mathbf{w}^{\prime}/\|\mathbf{w}^{\prime}\|_{2} and t=t/𝐰2t=t^{\prime}/\|\mathbf{w}^{\prime}\|_{2}.

Let f(𝐱)=sign(𝐰,𝐱t)f(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{*},\mathbf{x}\rangle-t^{*}) be the optimal hypothesis. We prove that h(𝐱)=sign(𝐰,𝐱t)h(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w},\mathbf{x}\rangle-t) is such that R+(h;D)ϵR_{+}(h;D)\leq\epsilon and R(h;D)R(f;D)+ϵR_{-}(h;D)\leq R_{-}(f;D)+\epsilon. Since hh is consistent with all the negative samples, we have R+(h;D)ϵR_{+}(h;D)\leq\epsilon. From the assumption that 𝐏𝐫𝐱𝒩d[f(𝐱)=1]1/2\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}[f(\mathbf{x})=-1]\leq 1/2, we have t0t^{*}\leq 0. Notice that 𝐰\mathbf{w}^{*} and tt^{*} is a feasible solution to the above program; therefore, tt0t^{\prime}\leq t^{*}\leq 0 and we must have t=t/𝐰2ttt=t^{\prime}/\|\mathbf{w}^{\prime}\|_{2}\leq t^{\prime}\leq t^{*}, which implies that 𝐏𝐫(𝐱,y)D[h(𝐱)=1]𝐏𝐫(𝐱,y)D[f(𝐱)=1]\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=1]\geq\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[f(\mathbf{x})=1]. Therefore,

R(h;D)\displaystyle R_{-}(h;D) =𝐏𝐫(𝐱,y)D[y=1]𝐏𝐫(𝐱,y)D[y=1h(𝐱)=1]\displaystyle=\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=1]-\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=1\land h(\mathbf{x})=1]
=𝐏𝐫(𝐱,y)D[y=1]𝐏𝐫(𝐱,y)D[h(𝐱)=1]+R+(h;D)\displaystyle=\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=1]-\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=1]+R_{+}(h;D)
𝐏𝐫(𝐱,y)D[y=1]𝐏𝐫(𝐱,y)D[f(𝐱)=1]+ϵ\displaystyle\leq\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=1]-\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[f(\mathbf{x})=1]+\epsilon
=R(f;D)+ϵ.\displaystyle=R_{-}(f;D)+\epsilon\;.

This completes the proof. ∎

We now prove Theorem B.2.

Proof of Theorem B.2.

The algorithm is the following:

  1. 1.

    Run the algorithm in Lemma B.3 with ϵ=ϵ/2\epsilon^{\prime}=\epsilon/2. If the output hypothesis hh satisfies R+(h;D)ϵR_{+}(h;D)\leq\epsilon with a sufficiently small constant failure probability, then output hh and terminate.

  2. 2.

    Set α=1/2ϵ/100\alpha=1/2-\epsilon/100 and run the algorithm in Theorem B.4 with ϵ=ϵ/2\epsilon^{\prime}=\epsilon/2. If the output hypothesis hh satisfies R+(h;D)ϵR_{+}(h;D)\leq\epsilon with a sufficiently small constant failure probability, then output hh and terminate. Otherwise, update α\alpha as αϵ/100\alpha-\epsilon/100. Repeat Step 2 until the algorithm terminates.

Let f=argminfdR+(f;D)=0R(f;D)f=\operatorname*{argmin}_{f\in\mathcal{H}_{d}\;\land\;R_{+}(f;D)=0}R_{-}(f;D) be the optimal halfspace and αf\alpha_{f} be its bias which is unknown. Suppose the algorithm terminates. Let α\alpha be the bias of the output hypothesis hh. It is easy to see that R+(h;D)ϵR_{+}(h;D)\leq\epsilon with high probability; therefore, it only remains to show R(h;D)R(f;D)+ϵR_{-}(h;D)\leq R_{-}(f;D)+\epsilon. Notice that if 𝐏𝐫𝐱𝒩d[f(𝐱)=1]1/2\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}[f(\mathbf{x})=-1]\leq 1/2, the algorithm will terminate in Step 1. Therefore, without loss of generality, we assume 𝐏𝐫𝐱𝒩d[f(𝐱)=1]1/2\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}[f(\mathbf{x})=-1]\geq 1/2. Then, by Theorem B.4, it is easy to see that ααfϵ/100\alpha\geq\alpha_{f}-\epsilon/100 since given any ααf\alpha\leq\alpha_{f}, Step 2 is guaranteed to terminate. Therefore, we have

𝐏𝐫𝐱𝒩d[h(𝐱)=1]1α1αf+ϵ/100=𝐏𝐫𝐱𝒩d[f(𝐱)=1]+ϵ/100.\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}[h(\mathbf{x})=-1]\leq 1-\alpha\leq 1-\alpha_{f}+\epsilon/100=\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}[f(\mathbf{x})=-1]+\epsilon/100\;.

Thus,

R(h;D)\displaystyle R_{-}(h;D) =𝐏𝐫(𝐱,y)D[y=1]𝐏𝐫(𝐱,y)D[y=1h(𝐱)=1]\displaystyle=\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=1]-\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=1\land h(\mathbf{x})=1]
=𝐏𝐫(𝐱,y)D[y=1]𝐏𝐫(𝐱,y)D[h(𝐱)=1]+R+(h;D)\displaystyle=\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=1]-\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=1]+R_{+}(h;D)
𝐏𝐫(𝐱,y)D[y=1]𝐏𝐫(𝐱,y)D[f(𝐱)=1]+ϵ\displaystyle\leq\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=1]-\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[f(\mathbf{x})=1]+\epsilon
=R(f;D)+ϵ.\displaystyle=R_{-}(f;D)+\epsilon\;.

This completes the proof. ∎

B.3 Reliably Learning Halfspaces with Gaussian Marginals for the Case ϵα/2\epsilon\leq\alpha/2

For convenience, we assume that α\alpha is known. This can be assumed without loss of generality as we discussed in Theorem B.2.

We show the following:

Theorem B.4.

Let ϵ(0,1/2),α(0,1/2]\epsilon\in(0,1/2),\alpha\in(0,1/2] and let DD be a joint distribution of (𝐱,y)(\mathbf{x},y) supported on d×{±1}\mathbb{R}^{d}\times\{\pm 1\} with marginal D𝐱=𝒩dD_{\mathbf{x}}=\mathcal{N}_{d}. Assume that there is a halfspace in dα\mathcal{H}_{d}^{\alpha} that is reliable with respect to DD. Then, there is an algorithm that uses N=dO(log(min{1/α,1/ϵ}))min(2log(1/ϵ)O(log(1/α)),2poly(1/ϵ))N=d^{O(\log(\min\{1/\alpha,1/\epsilon\}))}\min\left(2^{\log(1/\epsilon)^{O(\log(1/\alpha))}},2^{\mathrm{poly}(1/\epsilon)}\right) many samples from DD, runs in poly(N,d,1/ϵ)\mathrm{poly}(N,d,1/\epsilon) time and with high probability returns a hypothesis h(𝐱)=sign(𝐰,𝐱t)h(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w},\mathbf{x}\rangle-t) that is ϵ\epsilon-reliable with respect to the class of dα\mathcal{H}_{d}^{\alpha}.

Below we provide the omitted proofs from Section 2.

B.4 Proof of Lemma 2.2

Notice that for any polynomial pp such that 𝐄z𝒩1[p(z)]=0\operatorname*{\mathbf{E}}_{z\sim{\mathcal{N}_{1}}}[p(z)]=0, we have

𝐄(𝐱,y)D[yp(𝐰,𝐱)]=2𝐄(𝐱,y)D[𝟏(y=1)p(𝐰,𝐱)].\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[y\,p(\langle\mathbf{w}^{*},\mathbf{x}\rangle)]=-2\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[\mathbf{1}(y=-1)\,p(\langle\mathbf{w}^{*},\mathbf{x}\rangle)]\;.

Therefore, it suffices for us to show that there is a zero mean and unit variance polynomial pp such that

𝐄(𝐱,y)D[𝟏(y=1)p(𝐰,𝐱)]=2O(t2)𝐏𝐫(𝐱,y)D[y=1].\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[\mathbf{1}(y=-1)\,p(\langle\mathbf{w}^{*},\mathbf{x}\rangle)]=2^{-O({t^{*}}^{2})}\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=-1]\;.

Here we will consider the sign-matching polynomial from [DKK+{}^{+}22], which satisfies the following properties:

Claim B.5.

Let bb\in\mathbb{R} and b4b\geq 4. There exists a zero mean and unit variance polynomial p:p:\mathbb{R}\to\mathbb{R} of degree k=Θ(b2)k=\Theta(b^{2}) such that

  1. 1.

    The sign of p(z)p(z) matches sign(zb)\mathrm{sign}(z-b), i.e. sign(p(z))=sign(zb)\mathrm{sign}(p(z))=\mathrm{sign}(z-b).

  2. 2.

    For any zb/2z\leq b/2, we have |p(z)|=2O(k)|p(z)|=2^{-O(k)}.

Proof.

We consider the polynomial pp defined as:

p~(z)=q(z)q(b)r(b)r(z),\tilde{p}(z)=q(z)-\frac{q(b)}{r(b)}r(z)\;,

where q(z)=z3kq(z)=z^{3k}, r(z)=z2k(2k1)!!r(z)=z^{2k}-(2k-1)!! and kk is a sufficiently large odd integer such that 2|b|k4max(|b|,1)2|b|\leq\sqrt{k}\leq 4\max(|b|,1). We then take p(z)=p~(z)/𝐄u𝒩1[p~2(u)]p(z)=\tilde{p}(z)/\sqrt{\operatorname*{\mathbf{E}}_{u\sim\mathcal{N}_{1}}[\tilde{p}^{2}(u)]}.

For convenience, we first note that from Stirling’s approximation for m+m\in\mathbb{Z}^{+},

(m/2)m(2m1)!!(2m)m.(m/2)^{m}\leq(2m-1)!!\leq(2m)^{m}\;.

To prove that sign(p(z))=sign(zb)\mathrm{sign}(p(z))=\mathrm{sign}(z-b), we first show that q(b)r(b)0\frac{q(b)}{r(b)}\leq 0. Since q(b)0q(b)\geq 0, we just need to show r(b)0r(b)\leq 0. Notice r(b)=b2k(2k1)!!r(b)=b^{2k}-(2k-1)!!, since 2|b|k2|b|\leq\sqrt{k} and (2k1)!!(k/2)k(2k-1)!!\geq{(k/2)}^{k}, thus r(b)(k/2)2k(k/2)k0r(b)\leq(\sqrt{k}/2)^{2k}-(k/2)^{k}\leq 0. Therefore, considering q(z)=z3kq(z)=z^{3k} and r(z)=z2k(2k1)!!r(z)=z^{2k}-(2k-1)!!,

p~(z)=q(z)q(b)r(b)r(z)\tilde{p}(z)=q(z)-\frac{q(b)}{r(b)}r(z)

must be monotone increasing for zbz\geq b and p(z)p(z) must also be monotone increasing for zbz\geq b. Notice that p(b)=0p(b)=0; therefore, p(z)>0p(z)>0 for z>bz>b. To show p(z)<0p(z)<0 for z<bz<-b, we prove it for the cases z[b,+b)z\in[-b,+b) and z<bz<-b. For z[b,+b)z\in[-b,+b), notice that due to q(b)r(b)0\frac{q(b)}{r(b)}\leq 0 and the definition of q(z)q(z) and r(z)r(z),

p~(z)=q(z)q(b)r(b)r(z)q(b)q(b)r(b)r(b)<p(b)=0.\tilde{p}(z)=q(z)-\frac{q(b)}{r(b)}r(z)\leq q(b)-\frac{q(b)}{r(b)}r(b)<p(b)=0\;.

Then, for zbz\leq-b, we show that p(z)p(z) is monotone increasing. Notice that the derivative p~(z)=kz2k1(3zk2q(b)r(b))\tilde{p}^{\prime}(z)=kz^{2k-1}(3z^{k}-2\frac{q(b)}{r(b)}), for zb<0z\leq-b<0, this is positive if and only if 3zk2q(b)r(b)<03z^{k}-2\frac{q(b)}{r(b)}<0. If we can show bk23q(b)/r(b)b^{k}\geq-\frac{2}{3}q(b)/r(b), then it is immediate that for any zbz\leq-b, 3zk2q(b)r(b)<03z^{k}-2\frac{q(b)}{r(b)}<0. The condition bk23q(b)/r(b)b^{k}\geq-\frac{2}{3}q(b)/r(b) can be further simplified to (2k1)!!/b2k5/3(2k-1)!!/b^{2k}\geq 5/3. Then considering 2b<k2b<\sqrt{k} and (2k1)!!(k/2)k(2k-1)!!\geq(k/2)^{k}, we have

(2k1)!!/b2k(k/2)k/(k/2)2k2k,(2k-1)!!/b^{2k}\geq(k/2)^{k}/(\sqrt{k}/2)^{2k}\geq 2^{k}\;,

thus the condition holds. Therefore, p~(z)>0\tilde{p}^{\prime}(z)>0 for z<bz<-b and p(z)p(z) must be monotone increasing for z<bz<-b. Then combined with the condition p(b)<0p(-b)<0, which is implied from the previous case (z[b,b]z\in[-b,b]), we have p(z)<0p(z)<0 for z<bz<-b.

For the Property 2, we show that p(z)=2Θ(k)p(z)=2^{-\Theta(k)} for z[0,b/2]z\in[0,b/2], z[b/2,0]z\in[-b/2,0], z[b,b/2]z\in[-b,-b/2] and z[,b]z\in[-\infty,-b] separately. For z[0,b/2]z\in[0,b/2] notice that p(z)p(z) is convex for z[0,b]z\in[0,b]; therefore, it suffices to show p(0)=2Θ(k)p(0)=2^{-\Theta(k)}. Note that p(0)=q(b)r(b)(2k1)!!/𝐄u𝒩1p~2(u)p(0)=\frac{q(b)}{r(b)}(2k-1)!!/\sqrt{\operatorname*{\mathbf{E}}_{u\sim\mathcal{N}_{1}}\tilde{p}^{2}(u)}, and we have |q(b)r(b)|b3k/max(b2k,(2k1)!!)Ω(k)k/2\left|\frac{q(b)}{r(b)}\right|\geq b^{3k}/\max(b^{2k},(2k-1)!!)\geq\Omega(k)^{k/2}. Therefore, since 𝐄u𝒩1[p~2(u)](O(k))3k\operatorname*{\mathbf{E}}_{u\sim\mathcal{N}_{1}}[\tilde{p}^{2}(u)]\leq(O(k))^{3k}, we have p(0)2O(k)p(0)\geq 2^{-O(k)}. This completes the proof of the case z[0,b/2]z\in[0,b/2]. For the case z[b/2,0]z\in[-b/2,0], it immediately follows from the fact that |p(z)||p(z)||p(-z)|\geq|p(z)| by the definition. For the case z[b,b/2]z\in[-b,-b/2], we have p(z)(p(b)(b/2)3k)/𝐄u𝒩1p~2(u)2O(k)p(z)\leq(p(b)-(b/2)^{3k})/\sqrt{\operatorname*{\mathbf{E}}_{u\sim\mathcal{N}_{1}}\tilde{p}^{2}(u)}\leq-2^{-O(k)}. Then, the case z[,b]z\in[-\infty,-b] immediately follows from the fact that p(z)p(z) is monotone increasing in this interval. This completes the proof of Property 2. ∎

Given B.5, we let pp be the polynomial in B.5 with b=max(2|t|,4)b=-\max(2|t^{*}|,4). Then, given that DD satisfies the reliability condition with respect to ff, it is easy to see the polynomial p(𝐰,𝐱)p(\langle\mathbf{w}^{*},\mathbf{x}\rangle) satisfies the requirements. This completes the proof of Lemma 2.2.

B.5 Proof of Lemma 2.4

We start by bounding the Frobenius norm of 𝐓m\mathbf{T}^{m}, i.e., 𝐓mF\|\mathbf{T}^{m}\|_{F}. Notice that by taking p:p:\mathbb{R}\to\mathbb{R} to be any degree-mm polynomial such that 𝐄𝐱𝒩d[p(𝐱)]=0\operatorname*{\mathbf{E}}_{\mathbf{x}\sim\mathcal{N}_{d}}[p(\mathbf{x})]=0 and 𝐄𝐱𝒩d[p(𝐱)2]=1\operatorname*{\mathbf{E}}_{\mathbf{x}\sim\mathcal{N}_{d}}[p(\mathbf{x})^{2}]=1, then 𝐓mF=maxp𝐄(𝐱,y)D[𝟏(y=1)p(𝐱)]\|\mathbf{T}^{m}\|_{F}=\max_{p}\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[\mathbf{1}(y=-1)p(\mathbf{x})]. We leverage the following standard fact about the concentration of polynomials lemma known as Bonami-Beckner inequality or simply Gaussian hypercontractivity.

Fact B.6 (Hypercontractivity Concentration Inequality [Bog98, Nel73]).

Consider any mm-degree, unit variance polynomial p:dp:\mathbb{R}^{d}\to\mathbb{R} with respect to 𝒩d\mathcal{N}_{d}. Then, for any λ0\lambda\geq 0

𝐏𝐫𝐱𝒩d[|p(𝐱)𝐄𝐮[p(𝐮)]|λ]e2e(cλ2)1/m,\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}\left[\left|p(\mathbf{x})-\operatorname*{\mathbf{E}}_{\mathbf{u}}[p(\mathbf{u})]\right|\geq\lambda\right]\leq e^{2}e^{-{\left(c\lambda^{2}\right)}^{1/m}}\;,

where cc is an absolute constant.

Using B.6, we have that for any zero mean and unit variance polynomial pp,

𝐄(𝐱,y)D[𝟏(y=1)p(𝐱)]\displaystyle\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[\mathbf{1}(y=-1)p(\mathbf{x})] 𝐄(𝐱,y)D[𝟏(y=1)|p(𝐱)|]0min(γ,e2e(cλ2)1/m)dλ\displaystyle\leq\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[\mathbf{1}(y=-1)|p(\mathbf{x})|]\leq\int_{0}^{\infty}\min\left(\gamma,e^{2}e^{-{\left(c\lambda^{2}\right)}^{1/m}}\right)d\lambda
=O(γlog(1/γ)m/2).\displaystyle=O\left(\gamma\log(1/\gamma)^{m/2}\right)\;.

Therefore, 𝐓mF=O(γlog(1/γ)m/2)\|\mathbf{T}^{m}\|_{F}=O\left(\gamma\log(1/\gamma)^{m/2}\right).

To prove the Property 1, since 𝐓mF𝐓mF+τ/(4k)=O(γlog(1/γ)m/2)+τ/(4k)\|\mathbf{T}^{\prime m}\|_{F}\leq\|\mathbf{T}^{m}\|_{F}+\tau/(4\sqrt{k})=O\left(\gamma\log(1/\gamma)^{m/2}\right)+\tau/(4\sqrt{k}), there can be at most 𝐓mF2/(τ/k)2=O(γ2log(1/γ)mk/τ2+1)\|\mathbf{T}^{m}\|_{F}^{2}/\left(\tau/\sqrt{k}\right)^{2}=O(\gamma^{2}\log(1/\gamma)^{m}k/\tau^{2}+1) many singular vectors with singular values greater than τ/(4k)\tau/(4\sqrt{k}). Therefore, dim(V)\dim(V) is at most O(γ2log(1/γ)mk/τ2+1)O(\gamma^{2}\log(1/\gamma)^{m}k/\tau^{2}+1).

For the Property 2, suppose projV(𝐯)2=o(τ/(kγlog(1/γ)k/2))\|\mathrm{proj}_{V}(\mathbf{v^{*}})\|_{2}=o\left(\tau/\left(\sqrt{k}\gamma\log(1/\gamma)^{k/2}\right)\right), then we have for any mm,

𝐯𝐓m2\displaystyle\|{\mathbf{v}^{*}}^{\top}\mathbf{T}^{\prime m}\|_{2} 𝐯𝐓m2+τ/(4k)\displaystyle\leq\|{\mathbf{v}^{*}}^{\top}\mathbf{T}^{m}\|_{2}+\tau/(4\sqrt{k})
𝐓mFprojV(𝐯)2+τ/(4k)projV(𝐯)2+τ/(4k)(5/8)τ/k.\displaystyle\leq\|\mathbf{T}^{m}\|_{F}\|\mathrm{proj}_{V}(\mathbf{v}^{*})\|_{2}+\tau/(4\sqrt{k})\|\mathrm{proj}_{\perp V}(\mathbf{v}^{*})\|_{2}+\tau/(4\sqrt{k})\leq(5/8)\tau/\sqrt{k}\;.

However, since 𝐄(𝐱,y)D[𝟏(y=1)p(𝐯,𝐱)]=m=1kai𝐯𝐓m𝐯m1τ\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[\mathbf{1}(y=-1)p(\langle\mathbf{v^{*},x}\rangle)]=\sum_{m=1}^{k}a_{i}{\mathbf{v}^{*}}^{\top}\mathbf{T}^{m}{\mathbf{v}^{*}}^{\otimes{m-1}}\geq\tau for some a12++ak2=1a_{1}^{2}+\cdots+a_{k}^{2}=1, we know there must be an mm such that

|𝐯𝐓m𝐯m1|τ/k.\left|{\mathbf{v}^{*}}^{\top}\mathbf{T}^{m}{\mathbf{v}^{*}}^{\otimes{m-1}}\right|\geq\tau/\sqrt{k}\;.

Therefore, we can write

𝐯𝐓m2𝐯𝐓m2τ/(4k)τ/kτ/(4k)=(3/4)τ/k,\|{\mathbf{v}^{*}}^{\top}\mathbf{T}^{\prime m}\|_{2}\geq\|{\mathbf{v}^{*}}^{\top}\mathbf{T}^{m}\|_{2}-\tau/(4\sqrt{k})\geq\tau/\sqrt{k}-\tau/(4\sqrt{k})=(3/4)\tau/\sqrt{k}\;,

which contradicts 𝐯𝐓m2(5/8)τ/k\|{\mathbf{v}^{*}}^{\top}\mathbf{T}^{\prime m}\|_{2}\leq(5/8)\tau/\sqrt{k}. This completes the proof.

B.6 Proof of Proposition 2.1

The pseudocode for the algorithm establishing Proposition 2.1 is given in Algorithm 2.

Input: Sample access to a joint distribution DD of (𝐱,y)(\mathbf{x},y) supported on d×{±1}\mathbb{R}^{d}\times\{\pm 1\} with the marginal D𝐱=𝒩dD_{\mathbf{x}}=\mathcal{N}_{d}. Let ϵ,δ(0,1)\epsilon,\delta\in(0,1) and suppose DD satisfies the reliability condition with respect to an LTF f(𝐱)=sign(𝐰,𝐱t)f(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{*},\mathbf{x}\rangle-t^{*}) with t=O(log(1/ϵ))t^{*}=O\left(\sqrt{\log(1/\epsilon)}\right) and 𝐏𝐫(𝐱,y)D[y=1]ϵ\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=-1]\geq\epsilon.
Output: With probability at least Ω(1)\Omega(1), the algorithm outputs a unit vector 𝐯\mathbf{v} such that 𝐯,𝐰=max(log(1/ϵ)O(t2),ϵO(1))\langle\mathbf{v},\mathbf{w}^{*}\rangle=\max\left(\log(1/\epsilon)^{-O({t^{*}}^{2})},\epsilon^{O(1)}\right) .
  1. 1.

    Let c1c_{1} be a sufficiently large universal constant and c2c_{2} be a sufficiently large universal constant depending on c1c_{1}. Let SS be a set of N=dc2max(t2,1)log(1/δ)/ϵ2N=d^{c_{2}{\max({t^{*}}^{2},1)}}\log(1/\delta)/\epsilon^{2} many samples from DD. For m=1,,c1t2m=1,\cdots,c_{1}{t^{*}}^{2}, with 1/2 probability, take

    𝐓m=𝐄(𝐱,y)uS[𝟏(y=1)𝐇m(𝐱)],\mathbf{T}^{\prime m}=\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim_{u}S}[\mathbf{1}(y=-1)\mathbf{H}^{m}(\mathbf{x})]\;,

    to be the empirical Chow-tensor on negative samples. Otherwise, take

    𝐓m=𝐄(𝐱,y)uS[y𝐇m(𝐱)],\mathbf{T}^{\prime m}=\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim_{u}S}[y\mathbf{H}^{m}(\mathbf{x})]\;,

    to be the empirical Chow-tensor on all samples. Let γ\gamma be the empirical estimation of 𝐏𝐫(𝐱,y)D[y=1]\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=-1] with error at most ϵ/100\epsilon/100 with a sufficiently small constant failure probability.

  2. 2.

    Take 𝐓~md×dm1\tilde{\mathbf{T}}^{\prime m}\in\mathbb{R}^{d\times{d^{m-1}}} to be the flattened version of 𝐓m\mathbf{T}^{\prime m}, and let VmV_{m} be the subspace spanned by the left singular vectors of 𝐓~m\tilde{\mathbf{T}}^{\prime m} whose singular values are greater than 2ct2γ2^{-c{t^{*}}^{2}}\gamma where cc is a sufficiently large constant. Let VV be the union of all VmV_{m} and output 𝐯\mathbf{v} to be a random unit vector chosen from VV.

Algorithm 2 Finding a Direction with High Correlation.

We can now prove Proposition 2.1.

Proof of Proposition 2.1.

We first introduce the following fact about learning Chow tensors.

Fact B.7.

Fix m+m\in\mathbb{Z}_{+}, and ϵ,δ(0,1)\epsilon,\delta\in(0,1). Let DD be a distribution on d×[±1]\mathbb{R}^{d}\times[\pm 1] with standard normal marginals. There is an algorithm that with N=dO(m)log(1/δ)/ϵ2N=d^{O(m)}\log(1/\delta)/\epsilon^{2} samples and poly(d,N)\mathrm{poly}(d,N) runtime, outputs an approximation 𝐓m\mathbf{T}^{\prime m} of the order-mm Chow-parameter tensor 𝐓m=𝐄(𝐱,y)D[y𝐇m(𝐱)]\mathbf{T}^{m}=\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[y\mathbf{H}^{m}(\mathbf{x})] such that with probability 1δ1-\delta, it holds 𝐓m𝐓mFϵ\|\mathbf{T}^{\prime m}-\mathbf{T}^{m}\|_{F}\leq\epsilon.

Proof.

Let 𝐓m\mathbf{T}^{\prime m} be the empirical estimation of 𝐓m\mathbf{T}^{m} using N=dcmlog(1/δ)/ϵ2N=d^{cm}\log(1/\delta)/\epsilon^{2} for sufficiently large universal constant cc. We start by showing 𝐄(𝐱,y)D[y𝐇m(𝐱)F2]dm\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}\left[\|y\mathbf{H}^{m}(\mathbf{x})\|_{F}^{2}\right]\leq d^{m}. Notice that,

𝐄(𝐱,y)D[y𝐇m(𝐱)F2]=𝐄(𝐱,y)D[y2𝐇m(𝐱)F2]𝐄(𝐱,y)D[𝐇m(𝐱)F2]=𝐄𝐱d[𝐇m(𝐱)F2].\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}\left[\|y\mathbf{H}^{m}(\mathbf{x})\|_{F}^{2}\right]=\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}\left[y^{2}\|\mathbf{H}^{m}(\mathbf{x})\|_{F}^{2}\right]\leq\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}\left[\|\mathbf{H}^{m}(\mathbf{x})\|_{F}^{2}\right]=\operatorname*{\mathbf{E}}_{\mathbf{x}\sim\mathbb{R}^{d}}\left[\|\mathbf{H}^{m}(\mathbf{x})\|_{F}^{2}\right]\;.

Notice that each entry of 𝐇m(𝐱)\mathbf{H}^{m}(\mathbf{x}) must be some αh(𝐱)\alpha h(\mathbf{x}) where α[0,1]\alpha\in[0,1] and h(𝐱)h(\mathbf{x}) is a unit variance Hermite polynomial. Therefore, 𝐄(𝐱,y)D[𝐇m(𝐱)F2]dm\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}\left[\|\mathbf{H}^{m}(\mathbf{x})\|_{F}^{2}\right]\leq d^{m} and thus 𝐄(𝐱,y)D[y𝐇m(𝐱)F2]dm\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}\left[\|y\mathbf{H}^{m}(\mathbf{x})\|_{F}^{2}\right]\leq d^{m}.

Using the above, we have

𝐄(𝐱,y)D[y𝐇m(𝐱)𝐓mF2]=𝐄(𝐱,y)D[y𝐇m(𝐱)F2]𝐓mF2dm.\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}\left[\|y\mathbf{H}^{m}(\mathbf{x})-\mathbf{T}^{m}\|_{F}^{2}\right]=\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}\left[\|y\mathbf{H}^{m}(\mathbf{x})\|_{F}^{2}\right]-\|\mathbf{T}^{m}\|_{F}^{2}\leq d^{m}\;.

Then, applying Chebyshev’s inequality gives 𝐓m𝐓mFϵ\|\mathbf{T}^{\prime m}-\mathbf{T}^{m}\|_{F}\leq\epsilon. ∎

We prove Proposition 2.1 for two cases: (a) max(log(1/ϵ)O(t2),ϵO(1))=log(1/ϵ)O(t2)\max(\log(1/\epsilon)^{-O({t^{*}}^{2})},\epsilon^{O(1)})=\log(1/\epsilon)^{-O({t^{*}}^{2})}; and (b) max(log(1/ϵ)O(t2),ϵO(1))=ϵO(1)\max(\log(1/\epsilon)^{-O({t^{*}}^{2})},\epsilon^{O(1)})=\epsilon^{O(1)}. We first prove Case (a) as the other case is similar. For Case (a), it suffices for us to prove that 𝐯,𝐰=log(1/ϵ)O(t2)\langle\mathbf{v},\mathbf{w}^{*}\rangle=\log(1/\epsilon)^{-O({t^{*}}^{2})} happens with Ω(1)\Omega(1) probability given the algorithm chooses 𝐓m=𝐄(𝐱,y)uS[𝟏(y=1)𝐇m(𝐱)]\mathbf{T}^{\prime m}=\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim_{u}S}[\mathbf{1}(y=-1)\mathbf{H}^{m}(\mathbf{x})] (which happens with 1/2 probability). Let k=c1t2k=c_{1}{t^{*}}^{2} and τ/(4k)=2ct2γ\tau/(4\sqrt{k})=2^{-c{t^{*}}^{2}}\gamma. By B.7, we have 𝐓m𝐓mFτ/(4k)\|\mathbf{T}^{\prime m}-\mathbf{T}^{m}\|_{F}\leq\tau/(4\sqrt{k}). Then from Lemma 2.2 and Lemma 2.4, if we take 𝐯\mathbf{v} to be a random vector in VV, we will with constant probability have,

𝐯,𝐰\displaystyle\langle\mathbf{v},\mathbf{w}^{*}\rangle =Ω(τ/(kγlog(1/γ)k/2))dim(V)1/2=Ω(τ2/(log(1/γ)kk3/2γ2))\displaystyle=\frac{\Omega\left(\tau/\left(\sqrt{k}\gamma\log(1/\gamma)^{k/2}\right)\right)}{\dim(V)^{1/2}}=\Omega\left(\tau^{2}/\left(\log(1/\gamma)^{k}k^{3/2}\gamma^{2}\right)\right)
=Ω(log(1/γ)O(t2)k3/2).\displaystyle=\Omega\left(\log(1/\gamma)^{-O({t^{*}}^{2})}{k}^{-3/2}\right)\;.

Since 𝐏𝐫(𝐱,y)D[y=1]ϵ\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=-1]\geq\epsilon, we have γ=Ω(ϵ)\gamma=\Omega(\epsilon). Also, considering k=O(max{t2,1})=O(log(1/ϵ))k=O(\max\{{t^{*}}^{2},1\})=O\left(\log(1/\epsilon)\right), we get 𝐯,𝐰=log(1/ϵ)O(t2)\langle\mathbf{v},\mathbf{w}^{*}\rangle=\log(1/\epsilon)^{-O({t^{*}}^{2})}.

For Case (b), the proof remains the same except we take 𝐓m=𝐄(𝐱,y)uS[y𝐇m(𝐱)]\mathbf{T}^{\prime m}=\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim_{u}S}[y\mathbf{H}^{m}(\mathbf{x})] and use fact below instead of Lemma 2.4.

Fact B.8 (Lemma 5.10 in [DKK+{}^{+}22]).

Let DD be the joint distribution of (𝐱,y)(\mathbf{x},y) supported on d×{±1}\mathbb{R}^{d}\times\{\pm 1\} with the marginal D𝐱=𝒩dD_{\mathbf{x}}=\mathcal{N}_{d}. Let p:p:\mathbb{R}\mapsto\mathbb{R} be a univariate, mean zero, unit variance polynomial of degree kk such that for some unit vector 𝐯d\mathbf{v^{*}}\in\mathbb{R}^{d} it holds 𝐄(𝐱,y)D[yp(𝐯,𝐱)]τ\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[yp(\langle\mathbf{v^{*},\mathbf{x}}\rangle)]\geq\tau for some τ(0,1]\tau\in(0,1]. Let 𝐓m\mathbf{T}^{\prime m} be an approximation of the order-mm Chow-parameter tensor 𝐓m=𝐄(𝐱,y)D[y𝐇m(𝐱)]\mathbf{T}^{m}=\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[y{\bf H}_{m}(\mathbf{x})] such that 𝐓m𝐓mFτ/(4k)\|\mathbf{T}^{\prime m}-\mathbf{T}^{m}\|_{F}\leq\tau/(4\sqrt{k}). Denote by VmV_{m} the subspace spanned by the left singular vectors of flattened 𝐓m\mathbf{T}^{\prime m} whose singular values are greater than τ/(4k)\tau/(4\sqrt{k}). Moreover, denote by VV the union of V1,,VkV_{1},\cdots,V_{k}. Then we have that

  1. 1.

    dim(V)4k/τ2\dim(V)\leq 4k/\tau^{2}, and

  2. 2.

    projV(𝐯)2τ/(4k)\|\mathrm{proj}_{V}(\mathbf{v^{*}})\|_{2}\geq\tau/\left(4\sqrt{k}\right).

Then the same argument will give 𝐯,𝐰=ϵO(1)\langle\mathbf{v},\mathbf{w}^{*}\rangle=\epsilon^{O(1)}. The above described algorithm uses at most N=dO(t2)/ϵ2N=d^{O({t^{*}}^{2})}/\epsilon^{2} samples and poly(N,1/ϵ)\mathrm{poly}\left(N,1/\epsilon\right) runtime. This completes the proof of Proposition 2.1. ∎

B.7 Proof of Lemma 2.5

Item 1 follows from the definition of DD^{\prime} and the fact that DD has marginal distribution D𝐱=𝒩dD_{\mathbf{x}}=\mathcal{N}_{d}.

For proving Item 2, we consider two cases: The first case is when t>0t^{*}>0 and the second one when t0t^{*}\leq 0. For the case t0t^{*}\leq 0, we prove that the distribution DD^{\prime} satisfies the reliability condition with respect to h(𝐱)=sign(𝐰,𝐱)h^{\prime}(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{\prime},\mathbf{x}\rangle) where 𝐰=𝐰𝐰/𝐰𝐰2\mathbf{w}^{\prime}={\mathbf{w}^{*}}^{\perp\mathbf{w}}/\|{\mathbf{w}^{*}}^{\perp\mathbf{w}}\|_{2}. Notice that for any (𝐱,y)(\mathbf{x},y) such that 𝐱B\mathbf{x}\in B and h(𝐱𝐰)=1h^{\prime}(\mathbf{x}^{\perp\mathbf{w}})=1, we have

f(𝐱)\displaystyle f(\mathbf{x}) =sign(𝐰,𝐱t)\displaystyle=\mathrm{sign}(\langle\mathbf{w}^{*},\mathbf{x}\rangle-t^{*})
=sign(proj𝐰(𝐰),𝐱+proj𝐰(𝐰),𝐱t)\displaystyle=\mathrm{sign}(\langle\mathrm{proj}_{\perp\mathbf{w}}(\mathbf{w}^{*}),\mathbf{x}\rangle+\langle\mathrm{proj}_{\mathbf{w}}(\mathbf{w}^{*}),\mathbf{x}\rangle-t^{*})