Reliable Learning of Halfspaces under Gaussian Marginals

Ilias Diakonikolas
University of Wisconsin-Madison
[email protected]
Supported in part by NSF Medium Award CCF-2107079 and an H.I. Romnes Faculty Fellowship. Lisheng Ren
University of Wisconsin-Madison
[email protected]
Supported in part by NSF Medium Award CCF-2107079. Nikos Zarifis
University of Wisconsin-Madison
[email protected]
Supported in part by NSF Medium Award CCF-2107079.

Abstract

We study the problem of PAC learning halfspaces in the reliable agnostic model of Kalai et al. (2012). The reliable PAC model captures learning scenarios where one type of error is costlier than the others. Our main positive result is a new algorithm for reliable learning of Gaussian halfspaces on $\mathbb{R}^{d}$ with sample and computational complexity

d^{O(\log(\min\{1/\alpha,1/\epsilon\}))}\min(2^{\log(1/\epsilon)^{O(\log(1/\alpha))}},2^{\mathrm{poly}(1/\epsilon)})\;,

where $\epsilon$ is the excess error and $\alpha$ is the bias of the optimal halfspace. We complement our upper bound with a Statistical Query lower bound suggesting that the $d^{\Omega(\log(1/\alpha))}$ dependence is best possible. Conceptually, our results imply a strong computational separation between reliable agnostic learning and standard agnostic learning of halfspaces in the Gaussian setting.

1 Introduction

Halfspaces (or Linear Threshold Functions) is the class of functions $f:\mathbb{R}^{d}\to\{\pm 1\}$ of the form $f(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w},\mathbf{x}\rangle-t)$ , where $\mathbf{w}\in\mathbb{R}^{d}$ is called the weight vector and $t$ is called the threshold. The problem of learning halfspaces is one of the classical and most well-studied problems in machine learning—going back to the Perceptron algorithm [Ros58]—and has had great impact on many other influential techniques, including SVMs [Vap98] and AdaBoost [FS97].

Here we focus on learning halfspaces from random labeled examples. The computational complexity of this task crucially depends on the choice of the underlying model. For example, in the realizable PAC model (i.e., with clean labels), the problem is known to be efficiently solvable (see, e.g., [MT94]) via a reduction to linear programming. Unfortunately, this method is quite fragile and breaks down in the presence of noisy labels. In the noisy setting, the computational complexity of the problem depends on the choice of noise model and distributional assumptions. In this work, we study the problem of distribution-specific PAC learning of halfspaces, with respect to Gaussian marginals, in the reliable agnostic model of [KKM12]. Formally, we have the following definition.

Definition 1.1 ((Positive) Reliable Learning of Gaussian Halfspaces).

Let $\mathcal{H}_{d}$ be the class of halfspaces on $\mathbb{R}^{d}$ . Given $0<\epsilon<1$ and i.i.d. samples $(\mathbf{x},y)$ from a distribution $D$ supported on $\mathbb{R}^{d}\times\{\pm 1\}$ , where the marginal $D_{\mathbf{x}}$ is the standard Gaussian $\mathcal{N}({\bf 0},\mathbf{I})$ , we say that an algorithm reliably learns $\mathcal{H}_{d}$ to error $\epsilon$ if the algorithm with high probability outputs a hypothesis $h:\mathbb{R}^{d}\to\{\pm 1\}$ such that $\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=1\land y=-1]\leq\epsilon$ and $\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=-1\land y=1]\leq\mathrm{OPT}+\epsilon$ , where $\mathrm{OPT}\stackrel{{\scriptstyle{\mathrm{\footnotesize def}}}}{{=}}\min_{f\in\mathcal{G}(D)}\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[f(\mathbf{x})=-1{\land}y={1}]$ , with

\mathcal{G}(D)=\{f\in{\mathcal{H}_{d}}:\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[f(\mathbf{x})=1{\land}y={-1}]=0\}\cup\{f(\mathbf{x})={-1}\}\;.

We say that $f=\operatorname*{argmin}_{f\in\mathcal{G}(D)}\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[f(\mathbf{x})=-1{\land}y={1}]$ is the optimal halfspace on distribution $D$ and for the conditions above that hypothesis $h$ satisfies, we say that $h$ is $\epsilon$ -reliable with respect to $\mathcal{H}_{d}$ on distribution $D$ .

In other words, a (positive) reliable learner makes almost no false positive errors, while also maintaining the best possible false negative error—as compared to any hypothesis with no false positive errors. Note that if there is no hypothesis (in the target class) that makes no false positive errors, then essentially we have no requirement for the false negative error of the returned hypothesis. The algorithm can then simply return the $-1$ constant hypothesis.

The reliable agnostic PAC model was introduced by [KKM12] as a one-sided analogue of the classical agnostic model [Hau92, KSS94], and has since been extensively studied in a range of works; see, e.g., [KKM12, KT14, GKKT17, DJ19, GKKM20, KK21]. The underlying motivation for this definition comes from learning situations where mistakes of one type (e.g., false positive) should be avoided at all costs. Typical examples include spam detection—where incorrectly detecting spam emails is much less costly than mislabeling an important email as spam—and detecting network failures—where false negatives may be particularly harmful. Such scenarios motivate the study of reliable learning, which can be viewed as minimizing a loss function for different costs for false positive and false negative errors (see, e.g., [Dom99, Elk01]). In a historical context, reliable learning is related to the Neyman-Pearson criterion [NP33] in hypothesis testing, which shows that the optimal strategy to minimize one type of error subject to another type of error being bounded is to threshold the likelihood ratio function. More recently, reliable learning has been shown [KK21] to be equivalent to the PQ-learning model [GKKM20]—a recently introduced learning model motivated by covariance shift.

The algorithmic task of agnostic reliable learning can be quite challenging. While reliable learning can be viewed as minimizing a loss function with different cost for false positive and false negative error, in general, such a loss function will result in a non-convex optimization problem. As pointed out in [KKM12], distribution-specific reliable learning can be efficiently reduced to distribution-specific agnostic learning (such reduction preserves the marginal distribution). A natural question, serving as a motivation for this work, is whether reliable learning is qualitatively easier computationally—for natural concept classes and halfspaces in particular.

Before we state our contributions, we briefly summarize prior work on agnostically learning Gaussian halfspaces. Recall that, in agnostic learning, the goal is to output a hypothesis with error $\mathrm{OPT}+\epsilon$ , where $\mathrm{OPT}$ is the optimal 0-1 error within the target class. Prior work has essentially characterized the computational complexity of agnostically learning Gaussian halfspaces. Specifically, it is known that the $L_{1}$ polynomial regression algorithm of [KKMS08] is an agnostic learner with complexity $d^{O(1/\epsilon^{2})}$ [DGJ ${}^{+}$ 09, DKN10]. Moreover, there is strong evidence that this complexity upper bound is tight, both in the Statistical Query (SQ) model [DKZ20, GGK20, DKPZ21] and under plausible cryptographic assumptions [DKR23, Tie23]. It is worth noting that the aforementioned hardness results hold even for the subclass of homogeneous halfspaces.

Given the aforementioned reduction of reliable learning to agnostic learning [KKM12], one can use $L_{1}$ -regression as a reliable agnostic learner for Gaussian halfspaces, leading to complexity $d^{O(1/\epsilon^{2})}$ . Prior to this work, this was the best (and only known) reliable halfspace learner. Given the fundamental nature of this problem, it is natural to ask if one can do better for reliable learning.

Is it possible to develop faster algorithms for reliably learning Gaussian halfspaces
compared to agnostic learning?

In this work, we provide an affirmative answer to this question.

To formally state our main result, we need the notion of the bias of a Boolean function.

Definition 1.2 (Bias of Boolean Function).

We define the bias $\alpha\in[0,1/2]$ of $h:\mathbb{R}^{d}\to\{\pm 1\}$ under the Gaussian distribution as $\alpha=\alpha(h)\stackrel{{\scriptstyle{\mathrm{\footnotesize def}}}}{{=}}\min(\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}[h(\mathbf{x})=1],\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}[h(\mathbf{x})=-1])$ .

The main result of this paper is encapsulated in the following theorem:

Theorem 1.3 (Main Result).

Let $D$ be a joint distribution of $(\mathbf{x},y)$ supported on $\mathbb{R}^{d}\times\{\pm 1\}$ with marginal $D_{\mathbf{x}}=\mathcal{N}_{d}$ and let $\alpha$ be the bias of the optimal halfspace on distribution $D$ (with respect to Definition 1.1). There is an algorithm that uses $N=d^{O(\log(\min\{1/\alpha,1/\epsilon\}))}\min\left(2^{\log(1/\epsilon)^{O(\log(1/\alpha))}},2^{\mathrm{poly}(1/\epsilon)}\right)$ many samples from $D$ , runs in $\mathrm{poly}(N,d)$ time, and with high probability returns a hypothesis $h(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w},\mathbf{x}\rangle-t)$ that is $\epsilon$ -reliable with respect to $\mathcal{H}_{d}$ . Moreover, for $\epsilon<\alpha/2$ , any SQ algorithm for the problem requires complexity $d^{\Omega(\log(1/\alpha))}$ .

For more detailed theorem statements of our upper and lower bounds, see Appendix B for the algorithmic result and LABEL:app:lb for the SQ hardness result.

Theorem 1.3 gives a significantly more efficient algorithm (in terms of both sample complexity and computational complexity) for reliably learning Gaussian halfspaces, compared to agnostic learning. Specifically, as long as $\alpha>0$ is a universal constant, the overall complexity is polynomial in $d$ and quasi-polynomial in $1/\epsilon$ —as opposed to $d^{\mathrm{poly}(1/\epsilon)}$ in the agnostic setting. While the complexity of our algorithm (namely, the multiplicative factor that is independent of $d$ ) increases as the optimal halfspace becomes more biased, it is always bounded above by the complexity of agnostic learning.

Finally, we note that our algorithm also applies to the fully reliable learning model [KKM12], since one can easily reduce the fully reliable learning to positive reliable and negative reliable learning (as observed in [KKM12]).

1.1 Overview of Techniques

Instead of directly considering the optimal halfspace for a distribution $D$ (as defined in Definition 1.1), we introduce the following definition as a relaxation.

Definition 1.4 (Reliability Condition).

We say that a distribution $D$ supported on $X\times\{\pm 1\}$ satisfies the reliability condition with respect to $f:X\mapsto\{\pm 1\}$ if $\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[f(\mathbf{x})=+1\land y=-1]=0\;.$

Notice that $h$ being $\epsilon$ -reliable on distribution $D$ is equivalent to $h$ having better false negative error compared with any $f$ such that $D$ satisfies the reliability condition with respect to $f$ (instead of compared with the optimal halfspace). At the same time, given any fixed $f$ , such a definition allows the adversary to arbitrarily corrupt negative labels, thus allowing for more manageable analysis.

We start with a brief overview of our SQ lower bound construction, followed by the high-level idea enabling our nearly matching algorithm.

SQ Lower Bound

As follows from Definition 1.4 and the subsequent discussion, the adversary in reliable learning can arbitrarily corrupt the negative labels. We leverage this idea to construct an instance where the corruption level suffices to match many moments, meaning that labels $y$ are uncorrelated with any polynomial of degree at most $\log(1/\alpha)$ . To prove this, we construct an (infinite) feasibility Linear Program (LP) with the following properties: if the LP is feasible, there exists an instance for which no low-degree polynomial correlates with the labels; see Lemma 3.3. To show the existence of such an instance, we leverage a technique from [DKPZ21] that exploits (an infinite generalization of) LP duality. Specifically, we show that it is possible to add noise to the labels whose uncorrupted value is negative so that all low-degree moments are uncorrelated with the observed labels $y$ . This implies that no algorithm that relies on low-degree polynomials can succeed for reliable learning. This allows us to show that SQ algorithms fail if they do not use queries with tolerance at most $1/d^{\Omega(\log(1/\alpha))}$ or exponentially many queries.

Reliable Learning Algorithm

As discussed in the previous paragraph, an adversary can arbitrarily corrupt the negative labels which can make various algorithmic approaches fail. In more detail, the adversary can corrupt $\Omega(\log(1/\alpha))$ moments; that is, any SQ algorithm needs to rely on higher-order moment information. One of the main difficulties of this setting is that there may exist weight vectors $\mathbf{w},\mathbf{w}^{\prime}$ with $\|\mathbf{w}-\mathbf{w}^{\prime}\|_{2}\geq\Omega(1)$ while at the same time the 0-1 error of $\mathbf{w}$ and $\mathbf{w}^{\prime}$ is nearly the same. This might suggest that no approach can verify that the algorithm decreases the angle between the current weight vector and the optimal vector. To overcome this obstacle, we develop an algorithm that performs a random walk over a “low-dimensional” subspace. To establish the correctness of our algorithm, we prove that at some point during its execution, the algorithm will find a weight vector sufficiently close to the optimal vector.

To show that such a low-dimensional subspace exists, we proceed as follows: We first prove that there exists a non-trivial polynomial $p$ of degree $O(\log(1/\alpha))$ that correlates with the negative labels (by the term nontrivial, we are referring to the requirement that we cannot use the constant polynomial, as this trivially correlates with the labels); see B.5. This implies that the low-degree moment tensors, i.e., $\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim\mathcal{D}}[\mathds{1}\{y=-1\}\mathbf{x}^{\otimes k}]$ for some $k$ between $1$ and $O(\log(1/\alpha))$ , correlate non-trivially with $(\mathbf{w}^{\ast})^{\otimes k}$ , where $\mathbf{w}^{\ast}$ is the target weight vector. Thus, we can use these moment tensors to construct a low-dimensional subspace. We leverage the structure of the problem, namely that the (observed) negative labels are uncorrupted, to show that the correlation is in fact almost constant. As a consequence, this allows us to construct a subspace whose dimension depends only on the degree of the polynomial $p$ (which is $O(\log(1/\alpha))$ ) and the desired excess error $\epsilon$ .

We then proceed as follows: in each round, as long as the current solution $\mathbf{w}$ is not optimal, i.e., there are negative samples on the region $\{\mathbf{x}\in\mathbb{R}^{d}:\mathbf{w}\cdot\mathbf{x}-t\geq 0\}$ , our algorithm conditions on a thin strip, projects the points $\mathbf{x}$ on the orthogonal complement of $\mathbf{w}$ , and reapplies the above structure result. This leads (with constant probability) to an increase in the correlation between $\mathbf{w}$ and $\mathbf{w}^{\ast}$ . Assuming that the correlation is increased by $\beta$ in each round with probability at least $1/3$ , we roughly need at most $1/\beta$ successful updates. Therefore, if we run our algorithm for $3^{1/\beta}$ steps, it is guaranteed that with probability at least $1/3$ we will find a vector almost parallel to $\mathbf{w}^{\ast}$ .

Prior Algorithmic Techniques

Roughly speaking, prior techniques for reliable learning relied on some variant of $L_{1}$ -regression. In that sense, our algorithmic approach departs from a direct polynomial approximation. Concretely, for distribution-free reliable learning, the algorithm from [KT14] uses a one-sided variant of $L_{1}$ regression—where instead of approximating the target function in $L_{1}$ norm, they use the hinge loss instead. For distribution-specific (and Gaussian in particular) reliable learning of halfspaces, the only previous approach uses the reduction of [KKM12] to agnostic learning. While our algorithm also leverages polynomial approximation, our approach employs significantly different ideas and we believe it provides a novel perspective for this problem.

Technical Comparison with Prior Work

Our algorithm shares similarities with the algorithm of [DKK ${}^{+}$ 22] for learning Gaussian halfspaces with Massart noise (a weaker semi-random noise model). Both algorithms perform a random walk in order to converge to the target halfpace. Having said so, our algorithm is fundamentally different than theirs for the following reasons. The algorithm of [DKK ${}^{+}$ 22] partitions $\mathbb{R}^{d}$ into sufficiently angles and searches in each one of them for a direction to update the current hypothesis at random. Our algorithm instead conditions on $y=-1$ , which are the points that are uncorrupted, and uses the guarantee (that we establish) that the first $\Omega(\log(1/\alpha))$ moments of the distribution (conditioned on $y=-1$ ) correlate with the unknown optimal hypothesis. This leads to an algorithm with significantly faster runtime.

Our SQ lower bound leverages techniques from [DKPZ21]. Essentially, we formulate a linear program to construct a noise function that makes all low-degree polynomials uncorrelated with the labels $y$ . This condition suffices to show that no SQ algorithm can solve the problem without relying on higher moments. To prove the existence of such a noise function, we use (a generalization of) LP duality of an infinite LP, which provides the necessary conditions for the existence of such a function.

1.2 Related Work

This work is part of the broader research program of characterizing the efficient learnability of natural concept classes with respect to challenging noise models. The complexity of this general learning task depends on the underlying class, the distributional assumptions and the choice of noise model. A long line of prior work has made substantial progress in this direction for the class of halfspaces under Gaussians (and other natural distributions), and for both semi-random [ABHU15, DKTZ20a, ZSA20, DKTZ20b, DKK ${}^{+}$ 20, DKK ${}^{+}$ 21b, DKK ${}^{+}$ 22] and adversarial noise [KKMS08, KOS08, KLS09, ABL17, DKS18, DKTZ20c, DKK ${}^{+}$ 21a, DKTZ22]. An arguably surprising conceptual implication of our results is that the complexity of reliable learning for Gaussian halfspaces is qualitatively closer to the semi-random case.

1.3 Preliminaries

We use lowercase boldface letters for vectors and capitalized boldface letters for matrices and tensors. We use $\langle\mathbf{x},\mathbf{y}\rangle$ for the inner product between $\mathbf{x},\mathbf{y}\in\mathbb{R}^{d}$ . For $\mathbf{x},\mathbf{s}\in\mathbb{R}^{d}$ , we use $\mathrm{proj}_{\mathbf{s}}(\mathbf{x})\stackrel{{\scriptstyle{\mathrm{\footnotesize def}}}}{{=}}\frac{\langle\mathbf{x},\mathbf{s}\rangle\mathbf{s}}{\|\mathbf{s}\|^{2}_{2}}$ for the projection of $\mathbf{x}$ on the $\mathbf{s}$ direction. Similarly, we use $\mathrm{proj}_{\perp\mathbf{s}}(\mathbf{x})=\mathbf{x}-\mathrm{proj}_{\mathbf{s}}(\mathbf{x})$ for the projection of $\mathbf{x}$ on the orthogonal complement of $\mathbf{s}$ . Additionally, let $\mathbf{x}^{V}\in\mathbb{R}^{\dim(V)}$ be the projection of $\mathbf{x}$ on the subspace $V$ and reparameterized on $\mathbb{R}^{\dim(V)}$ . More precisely, let $\mathbf{B}_{V}\in\mathbb{R}^{d\times\dim(V)}$ be the matrix whose columns form an (arbitrary) orthonormal basis for the subspace $V$ , and let $\mathbf{x}^{V}\stackrel{{\scriptstyle{\mathrm{\footnotesize def}}}}{{=}}(\mathbf{B}_{V})^{\intercal}\mathbf{x}$ . For $p\geq 1$ and $\mathbf{x}\in\mathbb{R}^{d}$ , we use $\|\mathbf{x}\|_{p}\stackrel{{\scriptstyle{\mathrm{\footnotesize def}}}}{{=}}\left(\sum_{i=1}^{n}|\mathbf{x}_{i}|^{p}\right)^{1/p}$ to denote the $\ell_{p}$ -norm of $\mathbf{x}$ . For a matrix or tensor $\mathbf{T}$ , we denote by $\|\mathbf{T}\|_{F}$ the Frobenius norm of $\mathbf{T}$ .

We use $\mathcal{N}_{d}$ to denote the standard normal distribution $\mathcal{N}(\mathbf{0},\mathbf{I})$ , where $\mathbf{0}$ is the $d$ -dimensional zero vector and $\mathbf{I}$ is the $d\times d$ identity matrix. We use $\mathbf{x}\sim D$ to denote a random variable with distribution $D$ . For a random variable $\mathbf{x}$ (resp. a distribution $D$ ), we use $P_{\mathbf{x}}$ (resp. $P_{D}$ ) to denote the probability density function or probability mass function of the random variable $\mathbf{x}$ (resp. distribution $D$ ). We also use $\Phi:\mathbb{R}\mapsto[0,1]$ to denote the cdf function of $\mathcal{N}_{1}$ . We use $\mathbf{1}$ to denote the indicator function.

For a boolean function $h:\mathbb{R}^{d}\to\{\pm 1\}$ and a distribution $D$ supported on $\mathbb{R}^{d}\times\{\pm 1\}$ , we use $R_{+}(h;D)\stackrel{{\scriptstyle{\mathrm{\footnotesize def}}}}{{=}}\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=1\land y\neq 1]$ (resp. $R_{-}(h;D)\stackrel{{\scriptstyle{\mathrm{\footnotesize def}}}}{{=}}\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=-1\land y\neq-1]$ ) to denote the false positive (resp. false negative) 0-1 error.

2 Algorithm for Reliably Learning Gaussian Halfspaces

In this section, we describe and analyze our algorithm establishing Theorem 1.3. Due to space limitations, some proofs have been deferred to Appendix B. For convenience, we will assume that $\alpha$ (the bias of the optimal halfspace) is known to the algorithm and that the excess error satisfies $\epsilon\leq\alpha/3$ . These assumptions are without loss of generality for the following reasons: First, one can efficiently reduce the unknown $\alpha$ case to the case that $\alpha$ is known, by guessing the value of $\alpha$ . Second, if $\epsilon$ is large, there is a straightforward algorithm for the problem (simply output the best constant hypothesis).

Notation: We use the notation $\mathcal{H}_{d}^{\alpha}$ for the set of all LTFs on $\mathbb{R}^{d}$ whose bias is equal to $\alpha$ . Given the above assumptions, it suffices for us to give a reliable learning algorithm for $\mathcal{H}_{d}^{\alpha}$ instead of $\mathcal{H}_{d}$ .

The high-level idea of the algorithm is as follows. Without loss of generality, we assume that there exists an $\alpha$ -biased halfspace that correctly classifies all the points with label $y=-1$ —since otherwise the algorithm can just return the hypothesis $h(\mathbf{x})\equiv-1$ .

Let $f(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{*},\mathbf{x}\rangle-t^{*})$ be the optimal halfspace and let $\mathbf{w}$ be our current guess of $\mathbf{w}^{*}$ . Assuming that $\mathbf{w}^{*}$ is not sufficiently close to $\mathbf{w}$ and the hypothesis that classifies all the points as negative is not optimal, we show that there exists a low-degree polynomial of the form $p(\langle\mathbf{w},\mathbf{x}\rangle)$ with correlation at least $2^{-O({t^{*}}^{2})}\epsilon$ with the negative labels. By leveraging this structural result, we use a spectral algorithm to find a direction $\mathbf{v}$ that is non-trivially correlated with $\mathrm{proj}_{\perp\mathbf{w}}(\mathbf{w}^{*})$ with at least some constant probability.

Unfortunately, it is not easy to verify whether $\langle\mathbf{v},\mathrm{proj}_{\perp\mathbf{w}}(\mathbf{w}^{*})\rangle$ is non-trivial. However, conditioned on the algorithm always getting a $\mathbf{v}$ that has good correlation, we show that it only takes at most $\log(1/\epsilon)^{O({t^{*}}^{2})}$ steps to get sufficiently close to $\mathbf{w}^{*}$ . Therefore, repeating this process $2^{\log(1/\epsilon)^{O({t^{*}}^{2})}}$ many times will eventually find an accurate approximation to $\mathbf{w}^{*}$ .

The structure of this section is as follows: In Section 2.1, we give our algorithm for finding a good direction. Section 2.2 describes and analyzes our random walk procedure.

2.1 Finding a Non-Trivial Direction

Here we present an algorithm that finds a direction that correlates non-trivially with the unknown optimal vector. We first show that there exists a zero-mean $O(\log(1/\alpha))$ -degree polynomial that sign-matches the optimal hypothesis. Furthermore, using the fact that the optimal hypothesis always correctly guesses the sign of the negative points, this gives us that the polynomial correlates with the negative (clean) points. Using this intuition, our algorithm estimates the first $O(\log(1/\alpha))$ moments of the distribution $D_{\mathbf{x}}$ conditioned on $y=-1$ . This guarantees that at least one moment correlates with the optimal hypothesis, as a linear combination of the moments generates the sign-matching polynomial. Then, by taking a random vector that lies in the high-influence directions (which form a low-dimensional subspace), we guarantee that with constant probability, this vector correlates well with the unknown optimal vector. The main result of the section is the following.

Proposition 2.1.

Let $D$ be a joint distribution of $(\mathbf{x},y)$ supported on $\mathbb{R}^{d}\times\{\pm 1\}$ with marginal $D_{\mathbf{x}}=\mathcal{N}_{d}$ and $\epsilon\in(0,1)$ . Suppose $D$ satisfies the reliability condition with respect to $f(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{*},\mathbf{x}\rangle-t^{*})$ with $t^{*}=O\left(\sqrt{\log(1/\epsilon)}\right)$ and $\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=-1]\geq\epsilon$ . Then there is an algorithm that draws $N=d^{O({t^{*}}^{2})}/\epsilon^{2}$ samples, has $\mathrm{poly}(N)$ runtime, and with probability at least $\Omega(1)$ returns a unit vector $\mathbf{v}$ such that $\langle\mathbf{v},\mathbf{w}^{*}\rangle\geq\max(\log(1/\epsilon)^{-O({t^{*}}^{2})},\epsilon^{O(1)})$ .

We start by showing that for any distribution satisfying the reliability condition with respect to some $f(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{*},\mathbf{x}\rangle-t^{*})$ , there exists a degree- $O({t^{*}}^{2})$ zero-mean polynomial of the form $p(\langle\mathbf{w}^{*},\mathbf{x}\rangle)$ that has correlation at least $2^{-O({t^{*}}^{2})}\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=-1]$ with the labels.

Lemma 2.2 (Correlation with an Orthonormal Polynomial).

Let $D$ be a joint distribution of $(\mathbf{x},y)$ supported on $\mathbb{R}^{d}\times\{\pm 1\}$ with marginal $D_{\mathbf{x}}=\mathcal{N}_{d}$ . Suppose $D$ satisfies the reliability condition with respect to $f(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{*},\mathbf{x}\rangle-t^{*})$ . Then there exists a polynomial $p:\mathbb{R}\mapsto\mathbb{R}$ of degree at most $k=O({t^{*}}^{2}+1)$ such that $\operatorname*{\mathbf{E}}_{z\sim{\mathcal{N}_{1}}}[p(z)]=0$ , $\operatorname*{\mathbf{E}}_{z\sim{\mathcal{N}_{1}}}[p^{2}(z)]=1$ and $\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[y\,p(\langle\mathbf{w}^{*},\mathbf{x}\rangle)]=2^{-O({t^{*}}^{2})}\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=-1]$ .

Proof Sketch of Lemma 2.2.

Note that since $p$ is a zero-mean polynomial with respect to $D_{\mathbf{x}}$ , we have that $\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[y\,p(\langle\mathbf{w}^{*},\mathbf{x}\rangle)]=-2\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[\mathbf{1}(y=-1)\,p(\langle\mathbf{w}^{*},\mathbf{x}\rangle)]\;.$ Then, using the fact that the distribution $D$ satisfies the reliability condition, the statement boils down to showing that for any $\epsilon$ -mass inside the interval $[t^{*},\infty]$ , the expectation of $p$ on this $\epsilon$ mass is at least $2^{-O({t^{*}}^{2})}$ . To prove this statement, it suffices to construct a polynomial $p$ that is non-negative on $[t^{*},\infty]$ and such that the $\epsilon/2$ -tail of $p$ in that interval is at least $2^{-O({t^{*}}^{2})}$ . To achieve this, we show that the sign-matching polynomial used in [DKK ${}^{+}$ 22] meets our purpose. ∎

Our algorithm uses the following normalized Hermite tensor.

Definition 2.3 (Hermite Tensor).

For $k\in\mathbb{N}$ and $\mathbf{x}\in\mathbb{R}^{d}$ , we define the $k$ -th Hermite tensor as

(\mathbf{H}_{k}(\mathbf{x}))_{i_{1},i_{2},\ldots,i_{k}}=\frac{1}{\sqrt{k!}}\sum_{\begin{subarray}{c}\text{Partitions $P$ of $[k]$}\\ \text{into sets of size 1 and 2}\end{subarray}}\bigotimes_{\{a,b\}\in P}(-\mathbf{I}_{i_{a},i_{b}})\bigotimes_{\{c\}\in P}\mathbf{x}_{i_{c}}\;.

Given that there is such a polynomial, we show that if we take a flattened version of the Chow-parameter tensors (which turns them into matrices) and look at the space spanned by their top singular vectors, then a non-trivial fraction of $\mathbf{w}^{*}$ must lie inside this subspace. We prove the following lemma, which is similar to Lemma 5.10 in [DKK ${}^{+}$ 22]. See Appendix B for the proof.

Lemma 2.4.

Let $D$ be the joint distribution of $(\mathbf{x},y)$ supported on $\mathbb{R}^{d}\times\{\pm 1\}$ with marginal $D_{\mathbf{x}}=\mathcal{N}_{d}$ . Let $p:\mathbb{R}\mapsto\mathbb{R}$ be a univariate, mean zero and unit variance polynomial of degree $k$ such that for some unit vector $\mathbf{v^{*}}\in\mathbb{R}^{d}$ it holds $\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[\mathbf{1}(y=-1)p(\langle\mathbf{v^{*},\mathbf{x}}\rangle)]\geq\tau$ for some $\tau\in(0,1]$ . Let $\mathbf{T}^{\prime m}$ be an approximation of the order- $m$ Chow-parameter tensor $\mathbf{T}^{m}=\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[\mathbf{1}(y=-1){\bf H}_{m}(\mathbf{x})]$ such that $\|\mathbf{T}^{\prime m}-\mathbf{T}^{m}\|_{F}\leq\tau/(4\sqrt{k})$ . Denote by $V_{m}$ the subspace spanned by the left singular vectors of flattened $\mathbf{T}^{\prime m}$ whose singular values are greater than $\tau/(4\sqrt{k})$ . Moreover, denote by $V$ the union of $V_{1},\cdots,V_{k}$ . Then, for $\gamma=\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=-1]$ , we have that

1.

$\dim(V)=O(\gamma^{2}\log(1/\gamma)^{k}k^{2}/\tau^{2}+1)$ , and
2.

$\|\mathrm{proj}_{V}(\mathbf{v^{*}})\|_{2}=\Omega\left(\tau/\left(\sqrt{k}\gamma\log(1/\gamma)^{k/2}\right)\right)$ .

By Lemma 2.4, taking a random unit vector $\mathbf{v}$ in $V$ will give us $\langle\mathbf{v},\mathbf{v}^{*}\rangle\geq\|\mathrm{proj}_{V}(\mathbf{v}^{*})\|/\sqrt{\dim(V)}$ . We are ready to prove Proposition 2.1. For the algorithm pseudocode, see Appendix B.

Proof Sketch of Proposition 2.1 .

The idea is that the empirical estimate $\mathbf{T}^{\prime m}$ obtained using
$d^{O({\max\{{t^{*}}^{2},1\}})}\log(1/\delta)/\epsilon^{2}$ many samples will satisfy $\|\mathbf{T}^{\prime m}-\mathbf{T}^{m}\|_{F}\leq\tau/(4\sqrt{k})$ with high probability. Then, by Lemma 2.2 and Lemma 2.4, if we take $\mathbf{v}$ to be a random unit vector in $V$ , with constant probability we will have $\langle\mathbf{v},\mathbf{w}^{*}\rangle=\log(1/\epsilon)^{-O({t^{*}}^{2})}$ . For $\langle\mathbf{v},\mathbf{w}^{*}\rangle=\epsilon^{O(1)}$ , the proof relies on a different version of Lemma 2.4. ∎

2.2 Random Walk to Update Current Guess

We now describe how we use Proposition 2.1 to construct an algorithm for our learning problem. Let $\mathbf{w}$ be the current guess for $\mathbf{w}^{*}$ . For convenience, we assume that $\langle\mathbf{w},\mathbf{w}^{*}\rangle=\Omega(1/\sqrt{d})$ . Let $D^{\prime}$ be the distribution of $\mathbf{x}^{\perp\mathbf{w}}$ conditioned on $\mathbf{x}\in B$ , where $B=\{\mathbf{x}\in\mathbb{R}^{d}:\langle\mathbf{w},\mathbf{x}\rangle-t^{*}\geq 0\}$ . We next show that the $\mathbf{x}$ -marginals of $D^{\prime}$ are standard Gaussian and that $D^{\prime}$ satisfies the reliability condition with respect to the halfspace $h(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{\prime},\mathbf{x}\rangle-t^{\prime})$ with $\mathbf{w}^{\prime}={\mathbf{w}^{*}}^{\perp\mathbf{w}}$ and $|t^{\prime}|\leq|t^{*}|$ .

Lemma 2.5.

Let $D$ be the joint distribution of $(\mathbf{x},y)$ supported on $\mathbb{R}^{d}\times\{\pm 1\}$ with marginal $D_{\mathbf{x}}=\mathcal{N}_{d}$ and $\epsilon\in(0,1)$ . Suppose $D$ satisfies the reliability condition with respect to $f(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{*},\mathbf{x}\rangle-t^{*})$ . Suppose that $h(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w},\mathbf{x}\rangle-t)$ with $\langle\mathbf{w},\mathbf{w}^{*}\rangle>0$ and $t-t^{*}\in[0,\epsilon/100]$ does not satisfy $R_{h}^{+}(D)\leq\epsilon/2$ . Let $B=\{\mathbf{x}\in\mathbb{R}^{d}:\langle\mathbf{w},\mathbf{x}\rangle-t\geq 0\}$ and $D^{\prime}$ be the distribution of $(\mathbf{x}^{\prime},y)=(\mathbf{x}^{\perp\mathbf{w}},y)$ given $\mathbf{x}\in B$ , then

1.

$D^{\prime}$ has marginal distribution $D_{\mathbf{x}^{\prime}}=\mathcal{N}_{d-1}$ ,
2.

$D^{\prime}$ satisfies the reliability condition with respect to $h^{\prime}(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{\prime},\mathbf{x}\rangle-t^{\prime})$ , where $\mathbf{w}^{\prime}={\mathbf{w}^{*}}^{\perp\mathbf{w}}/\|{\mathbf{w}^{*}}^{\perp\mathbf{w}}\|_{2}$ and $|t^{\prime}|\leq|t^{*}|$ , and
3.

$\operatorname*{\mathbf{Pr}}_{(\mathbf{x}^{\prime},y)\sim D^{\prime}}[y=-1]\geq\epsilon/2$ .

Proof Sketch of Lemma 2.5.

Item 1 follows by definition. For Item 2, we consider two cases: $t^{*}>0$ ; and $t^{*}\leq 0$ . For the case $t^{*}\leq 0$ , we prove that the distribution $D^{\prime}$ satisfies the reliability condition with respect to $h^{\prime}(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{\prime},\mathbf{x}\rangle)$ , where $\mathbf{w}^{\prime}={\mathbf{w}^{*}}^{\perp\mathbf{w}}/\|{\mathbf{w}^{*}}^{\perp\mathbf{w}}\|_{2}$ . For the case $t^{*}>0$ , we prove that the distribution $D^{\prime}$ satisfies the reliability condition with respect to $h^{\prime}(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{\prime},\mathbf{x}\rangle-t^{\prime})$ , where $\mathbf{w}^{\prime}={\mathbf{w}^{*}}^{\perp\mathbf{w}}/\|{\mathbf{w}^{*}}^{\perp\mathbf{w}}\|_{2}$ and $t^{\prime}=t^{*}$ . Then Item 3 follows from the fact that $h$ does not satisfy $R_{h}^{+}(D)\leq\epsilon/2$ . ∎

By Lemma 2.5, given any current guess $\mathbf{w}$ such that $h(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w},\mathbf{x}\rangle-t^{*})$ does not satisfy $R_{+}(h;D)\leq\epsilon/2$ , the corresponding distribution $D^{\prime}$ satisfies the reliability condition with respect to $h^{\prime}$ and $\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D^{\prime}}[y=-1]\geq\epsilon/2$ . Therefore, $D^{\prime}$ satisfies the assumptions of Proposition 2.1. So, if we apply the algorithm in Proposition 2.1, we will with probability at least $\Omega(1)$ get a unit vector $\mathbf{v}$ such that $\langle\mathbf{v},\mathbf{w}\rangle=0$ and $\langle\mathbf{v},\mathbf{w}^{*}\rangle=\log(1/\epsilon)^{-O({t^{*}}^{2})}$ .

The following fact shows that by updating our current guess in the direction of $\mathbf{v}$ with appropriate step size, we can get an updated guess with increased correlation with $\mathbf{w}^{*}$ .

Fact 2.6 (Correlation Improvement, Lemma 5.13 in [DKK ${}^{+}$ 22]).

Fix unit vectors $\mathbf{v}^{*},\mathbf{v}\in\mathbb{R}^{d}$ . Let $\mathbf{u}\in\mathbb{R}^{d}$ such that $\langle\mathbf{u},\mathbf{v}^{*}\rangle\geq c$ , $\langle\mathbf{u},\mathbf{v}\rangle=0$ and $\|\mathbf{u}\|_{2}\leq 1$ with $c>0$ . Then, for $\mathbf{v}^{\prime}=\frac{\mathbf{v}+\lambda\mathbf{u}}{\|\mathbf{v}+\lambda\mathbf{u}\|_{2}}$ , with $\lambda=c/2$ , we have that $\langle\mathbf{v}^{\prime},\mathbf{v}^{*}\rangle\geq\langle\mathbf{v},\mathbf{v}^{*}\rangle+\lambda^{2}/2$ .

Notice that because $\|\mathrm{proj}_{\perp\mathbf{w}}(\mathbf{w}^{*})\|_{2}$ is unknown, we cannot always choose the optimal step size $\lambda$ . Instead, we will use the same $\lambda$ to do sufficiently many update steps such that after that many updates, we are certain that $\|\mathrm{proj}_{\perp\mathbf{w}}(\mathbf{w}^{*})\|_{2}\leq 3\lambda$ . We then take the new step size $\lambda_{\text{update}}=\lambda/2$ and repeat this process, until $\mathbf{w}$ and $\mathbf{w}^{*}$ are sufficiently close to each other.

We are now ready to describe our algorithm (Algorithm 1) and prove its correctness.

Notice that given that we know the bias $\alpha$ , $t$ must be either $-\Phi^{-1}(\alpha)$ or $\Phi^{-1}(\alpha)$ . For convenience, we assume that $t=\Phi^{-1}(\alpha)$ . To account for the case that $t=-\Phi^{-1}(\alpha)$ , we can simply run Algorithm 1 twice and pick the output halfspace with the smallest $t$ value (or even run a different efficient algorithm, since $t\leq 0$ as explained in Appendix B). For convenience, we also assume $\min\left(2^{\log(1/\epsilon)^{O(\log(1/\alpha))}},2^{\mathrm{poly}(1/\epsilon)}\right)=2^{\log(1/\epsilon)^{O(\log(1/\alpha))}}$ . To account for the other case, we can simply initialize the step size $\zeta=\epsilon^{c}$ for a sufficiently large constant $c$ instead of $\zeta=\log(1/\epsilon)^{-c{t^{*}}^{2}}$ .

A more detailed version of Algorithm 1 is deferred to Appendix B.

Input:

\epsilon\in(0,1)

\alpha\in(0,1/2)

and samples access to a joint distribution

D

(\mathbf{x},y)

supported on

\mathbb{R}^{d}\times\{\pm 1\}

with

\mathbf{x}

-marginal

D_{\mathbf{x}}=\mathcal{N}_{d}

Output:

h(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w},\mathbf{x}\rangle-t)

that is

\epsilon

-reliable with respect to the class

\mathcal{H}_{d}^{\alpha}

1.

Check if $\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=-1]\leq\epsilon/2$ (with sufficiently small constant failure probability). If so, return the $+1$ constant hypothesis. Set the initial step size $\zeta=\log(1/\epsilon)^{-c{t^{*}}^{2}}$ , where $c$ is a sufficiently large universal constant and $t=\Phi^{-1}(\alpha)$ .
2.
Initialize $\mathbf{w}$ to be a random unit vector in $\mathbb{R}^{d}$ . Let the update step size $\lambda=\zeta$ and repeat the following process until $\lambda\leq\epsilon/100$ .
1. (a)
  
  Use samples from $D$ to check if the hypothesis $h(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w},\mathbf{x}\rangle-t)$ satisfies $R_{+}(h;D)\leq\epsilon/2$ . If so, go to Step (3).
2. (b)
  
  With $1/2$ probability, let $\mathbf{w}=-\mathbf{w}$ . Let $B=\{\mathbf{x}\in\mathbb{R}^{d}:\langle\mathbf{w},\mathbf{x}\rangle-t\geq 0\}$ , and let $D^{\prime}$ be the distribution of $(\mathbf{x}^{\perp\mathbf{w}},y)$ for $(\mathbf{x},y)\sim D$ given $\mathbf{x}\in B$ . Use the algorithm of Proposition 2.1 on $D^{\prime}$ to find a unit vector $\mathbf{v}$ such that $\langle\mathbf{v},\mathbf{w}\rangle=0$ and $\left\langle\mathbf{v},\frac{\mathrm{proj}_{\perp\mathbf{w}}(\mathbf{w}^{*})}{\|\mathrm{proj}_{\perp\mathbf{w}}(\mathbf{w}^{*})\|_{2}}\right\rangle\geq\zeta$ . Then update $\mathbf{w}$ as follows: $\mathbf{w}_{\mathrm{update}}=\frac{\mathbf{w}+\lambda\mathbf{v}}{\|\mathbf{w}+\lambda\mathbf{v}\|_{2}}\;.$
3. (c)
  
  Repeat Steps (2a) and (2b) $c/\zeta^{2}$ times, where $c$ is a sufficiently large universal constant, with the same step size $\lambda$ . After that, update the new step size as $\lambda_{\mathrm{update}}=\lambda/2$ .
3.

Check if $h(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w},\mathbf{x}\rangle-t)$ satisfies $R_{+}(h;D)\leq\epsilon/2$ . If so, return $h$ and terminate. Repeat Step (2) $2^{1/\zeta^{c}}$ many times where $c$ is a sufficiently large constant.

Algorithm 1 Reliably Learning General Halfspaces with Gaussian Marginals.

Proof Sketch of the algorithmic part of Theorem 1.3.

Let $f(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{*},\mathbf{x}\rangle-t^{*})$ be the optimal halfspace with $\alpha$ bias. We need to show that with high probability Algorithm 1 returns a hypothesis $h(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w},\mathbf{x}\rangle-t)$ such that $R_{+}(h;D)\leq\epsilon$ and $R_{-}(h;D)\leq R_{-}(f;D)+\epsilon$ .

To do so, it suffices to show that $R_{+}(h;D)\leq\epsilon$ ; given $R_{+}(h;D)\leq\epsilon$ , $R_{-}(h;D)\leq R_{-}(f;D)+\epsilon$ follows from our choice of $t$ . For convenience, we can assume $h$ never satisfies $R_{+}(h;D)\leq\epsilon/2$ in Step (2a) (otherwise, we are done). We can also assume that the subroutine in Proposition 2.1 always succeeds since the algorithm repeats Step (2) sufficiently many times. Given the above conditions, using 2.6, one can show that each time after $c/\zeta^{2}$ many updates in Step (2b), we must have $\|\mathrm{proj}_{\perp\mathbf{w}}\mathbf{w}^{*}\|_{2}\leq 3\lambda$ . Therefore, when we have $\lambda\leq\epsilon/100$ , then $\|\mathrm{proj}_{\perp\mathbf{w}}\mathbf{w}^{*}\|_{2}\leq 3\epsilon/100$ , which implies $R_{+}(h;D)\leq\epsilon/2$ . ∎

3 Nearly Matching SQ Lower Bound

In this section, we establish the SQ hardness result of Theorem 1.3. Due to space limitations, some proofs have been deferred to LABEL:app:lb.

Proof Overview

To establish our SQ lower bound for reliable learning, we first prove an SQ lower bound for a natural decision version of reliably learning $\alpha$ -biased LTFs. We define the following decision problem over distributions.

Definition 3.1 (Decision Problem over Distributions).

Let $D$ be a fixed distribution and $\cal D$ be a distribution family. We denote by $\mathcal{B}(\mathcal{D},D)$ the decision problem in which the input distribution $D^{\prime}$ is promised to satisfy either (a) $D^{\prime}=D$ or (b) $D^{\prime}\in\mathcal{D}$ , and the goal is to distinguish the two cases with high probability.

We show that given SQ access to a joint distribution $D$ of $(\mathbf{x},y)$ supported on $\mathbb{R}^{d}\times\{\pm 1\}$ with marginal $D_{\mathbf{x}}=\mathcal{N}({\bf 0},\mathbf{I})$ , it is hard to solve the problem $\mathcal{B}(\mathcal{D},D)$ with the following distributions.

(a)

Null hypothesis: $D$ is the distribution so that $y=1$ with probability $1/2$ independent of $\mathbf{x}$ .
(b)

Alternative hypothesis: $D\in\mathcal{D}$ , where $\mathcal{D}$ is a family of distributions such that for any distribution $D\in\mathcal{D}$ , there exists an $\alpha$ -biased LTF $f$ such that $R_{+}(f;D)=0$ .

In order to construct such a family of distributions $\mathcal{D}$ , we start by constructing a joint distribution $D^{\prime}$ of $(z,y)$ over $\mathbb{R}\times\{\pm 1\}$ such that the marginal distribution of $z$ is $\mathcal{N}_{1}$ and the conditional distributions $z\mid y=1$ and $z\mid y=-1$ both match many moments with the standard Gaussian $\mathcal{N}_{1}$ . Moreover, there is $\alpha$ probability mass on the positive side of the marginal distribution of $z$ that is purely associated with $y=1$ (i.e., $\operatorname*{\mathbf{E}}_{(z,y)\sim D}[y\mid z\geq c]=1$ where $c=\Phi^{-1}(1-\alpha)$ ). We then embed this distribution $D$ along a hidden direction inside the joint distribution of $(\mathbf{x},y)$ on $\mathbb{R}^{d}\times\{\pm 1\}$ to construct a family of hard-to-distinguish distributions using the “hidden-direction” framework developed in [DKS17] and enhanced in [DKPZ21, DKRS23].

We can now proceed with the details of the proof. We start by defining the pairwise correlation between distributions.

Definition 3.2 (Pairwise Correlation).

The pairwise correlation of two distributions with pdfs $D_{1},D_{2}:\mathbb{R}^{d}\mapsto\mathbb{R}_{+}$ with respect to a distribution with density $D:\mathbb{R}^{d}\mapsto\mathbb{R}_{+}$ , where the support of $D$ contains the support of $D_{1}$ and $D_{2}$ , is defined as $\chi_{D}(D_{1},D_{2})\stackrel{{\scriptstyle{\mathrm{\footnotesize def}}}}{{=}}\int_{\mathbb{R}^{d}}D_{1}(\mathbf{x})D_{2}(\mathbf{x})/D(\mathbf{x})d\mathbf{x}-1$ . Furthermore, the $\chi$ -squared divergence of $D_{1}$ to $D$ is defined as $\chi^{2}(D_{1},D)\stackrel{{\scriptstyle{\mathrm{\footnotesize def}}}}{{=}}\chi_{D}(D_{1},D_{1})$ .

In particular, the framework in [DKS17] allows us to construct a family of $2^{d^{\Omega(1)}}$ distributions on $\mathbb{R}^{d}\times\{\pm 1\}$ whose pairwise correlation is $d^{-\Omega(n)}$ where $n$ is the number of matching moments $D^{\prime}$ has with the standard Gaussian. Then, using standard SQ dimension techniques, this gives an SQ lower bound for the distinguishing problem. After that, we reduce the distinguishing problem to the problem of reliably learning $\alpha$ -biased LTFs under Gaussian marginals with additive error $\epsilon<\alpha/3$ .

In order to construct such a distribution $D^{\prime}$ of $(z,y)$ supported on $\mathbb{R}\times\{\pm 1\}$ , we reparameterize $\operatorname*{\mathbf{E}}_{(z,y)\sim D^{\prime}}[y|z]$ as $g(z)$ . For a function $g:\mathbb{R}\to\mathbb{R}$ , we use $\|g\|_{p}=\operatorname*{\mathbf{E}}_{t\sim\mathcal{N}_{1}}[|g(t)|^{p}]^{1/p}$ for its $L_{p}$ norm. We let $L^{1}(R)$ denote the set of all functions $g:\mathbb{R}\to\mathbb{R}$ that have finite $L_{1}$ -norm.

We use linear programming over one-dimensional functions in $L^{1}(\mathbb{R})$ space to establish the existence of such a function. Specifically, we show the following:

Lemma 3.3.

For any sufficiently large $n\in\mathbb{N}$ , there exists a function $g:\mathbb{R}\mapsto[-1,+1]$ such that $g$ satisfies the following properties:

(i)

$g(z)=1$ for all $z\geq c$ , where $c=\Phi^{-1}(1-3^{-2n}/4)$ , and
(ii)

$\operatorname*{\mathbf{E}}_{t\sim\mathcal{N}_{1}}[g(z)z^{k}]=0$ for all $k\in[n]$ .

Proof Sketch of Lemma 3.3.

We let $P_{n}$ denote the set of all polynomials $p:\mathbb{R}\to\mathbb{R}$ of degree at most $n$ and let $L^{1}_{+}(\mathbb{R})$ denote the set of all nonnegative functions in $L^{1}(\mathbb{R})$ . Then, using linear programming, we will get the following primal:

find	$\displaystyle g\in L^{1}(\mathbb{R})$
such that	$\displaystyle\operatorname*{\mathbf{E}}_{z\sim\mathcal{N}_{1}}[p(z)g(z)]$	$\displaystyle=0\;,$	$\displaystyle\forall p$	$\displaystyle\in P_{n}$
	$\displaystyle\operatorname*{\mathbf{E}}_{z\sim\mathcal{N}_{1}}[g(z)h(z)\mathds{1}\{t\geq c\}]$	$\displaystyle\geq\\|h(z)\mathds{1}\{z\geq c\}\\|_{1}\;,$	$\displaystyle\quad\forall h$	$\displaystyle\in L_{+}^{1}(\mathbb{R})$
	$\displaystyle\operatorname*{\mathbf{E}}_{z\sim\mathcal{N}_{1}}[g(z)H(z)]$	$\displaystyle\leq\\|H\\|_{1}\;,$	$\displaystyle\quad\forall H$	$\displaystyle\in L^{1}(\mathbb{R})$

Then, using (infinite-dimensional) LP duality, we get that the above primal is feasible if and only if there is no polynomial of degree $n$ such that $\operatorname*{\mathbf{E}}_{t\sim\mathcal{N}_{1}}[|p(t)|\mathds{1}(t\leq c)]<\operatorname*{\mathbf{E}}_{t\sim\mathcal{N}_{1}}[p(t)\mathds{1}(t\geq c)]$ .

Using Gaussian hypercontractivity (see [Bog98, Nel73]), one can show that for every polynomial $p$ and $c\in\mathbb{R}$ , it holds

\operatorname*{\mathbf{E}}_{z\sim\mathcal{N}_{1}}[|p(z)|]<2\cdot 3^{n}\operatorname*{\mathbf{E}}_{z\sim\mathcal{N}_{1}}[|p(z)|]\left(\operatorname*{\mathbf{Pr}}_{z\sim\mathcal{N}_{1}}[z\geq c]\right)^{1/2}\;.

From our choice of the parameter $c$ , we have that $\operatorname*{\mathbf{Pr}}_{z\sim\mathcal{N}_{1}}[z\geq c]\leq 3^{-2n}/4$ ; thus, $\operatorname*{\mathbf{E}}_{z\sim\mathcal{N}_{1}}[|p(z)|]<\operatorname*{\mathbf{E}}_{z\sim\mathcal{N}_{1}}[|p(z)|]$ , which is a contradiction. Therefore, such a polynomial $p$ cannot exist. ∎

We have proven the existence of the function $g$ in Lemma 3.3. Now, we construct a joint distribution of $(z,y)$ on $\mathbb{R}\times\{\pm 1\}$ such that $\operatorname*{\mathbf{E}}[y|z]$ is exactly $g$ , as we discussed in the proof outline. For a joint distribution $D$ of $(x,y)$ supported on $X\times\{\pm 1\}$ , we will use $D_{+}$ to denote the conditional distribution of $x$ given $y=1$ ; and $D_{-}$ for the distribution of $x$ given $y=-1$ .

Lemma 3.4.

For any sufficiently large $n\in\mathbb{N}$ , there exists a distribution $D$ on $\mathbb{R}\times\{\pm 1\}$ such that for $(z,y)\sim D$ :

(i)

the marginal distribution $D_{z}=\mathcal{N}_{1}$ ;
(ii)

$\operatorname*{\mathbf{E}}_{(z,y)\sim D}[y|z=z^{\prime}]=1$ for all $z^{\prime}\geq\Phi^{-1}(1-3^{-2n}/4)$ ;
(iii)

$\operatorname*{\mathbf{E}}_{(z,y)\sim D}[y]=0$ and $\operatorname*{\mathbf{E}}_{(z,y)\sim D}[z^{k}]=\operatorname*{\mathbf{E}}_{(z,y)\sim D}[z^{k}\mid y=1]=\operatorname*{\mathbf{E}}_{(z,y)\sim D}[z^{k}\mid y=-1]$ for all $k\in[n]$ ;
(iv)

$\chi^{2}(D_{+},\mathcal{N}_{1}),\chi^{2}(D_{-},\mathcal{N}_{1})=O(1)$ .

Proof Sketch for Lemma 3.4.

The properties here directly follow from the properties of $g$ . ∎

Using the framework introduced in [DKS17, DKPZ21], we can construct a set of alternative hypothesis distributions $\mathcal{D}=\{D_{\mathbf{v}}:\mathbf{v}\in V\}$ on $\mathbb{R}^{d}\times\{\pm 1\}$ , where $V$ is a set of exponentially many pairwise nearly-orthogonal vectors and the marginal distribution of each $D_{\mathbf{v}}$ on direction $\mathbf{v}$ is the distribution $D$ in Lemma 3.4. This effectively embeds the distribution $D$ in a hidden direction $\mathbf{v}$ . The size of the family $\mathcal{D}$ is exponential in $d$ , and the distributions in it have small pairwise correlations. The details of $\mathcal{D}$ are deferred to LABEL:app:lb. Now, we are ready to give a proof sketch for the SQ hardness part of our main theorem Theorem 1.3.

Proof Sketch of the SQ hardness part of Theorem 1.3.

Let $\mathcal{D}$ be the set of distributions discussed above. We also let $D_{\mathrm{null}}$ be the joint distribution of $(\mathbf{x},y)$ such that $\mathbf{x}\sim\mathcal{N}_{d}$ and $y\sim\mathrm{Bern}(1/2)$ independent of $\mathbf{x}$ . Then, using standard SQ dimension techniques, one can show that any SQ algorithm that solves $\mathcal{B}(\mathcal{D},D_{\mathrm{null}})$ requires either queries of tolerance at most $d^{-\Omega(\log\frac{1}{\alpha})}$ or makes at least $2^{d^{\Omega(1)}}$ queries. By reducing the decision problem $\mathcal{B}(\mathcal{D},D_{\mathrm{null}})$ to reliably learning $\alpha$ -biased LTFs with $\epsilon<\alpha/3$ accuracy, we get the lower bound part of the statement in Theorem 1.3. ∎

4 Conclusions and Open Problems

In this paper, we study the problem of learning halfspaces under Gaussian marginals in the reliable learning model. Our main contribution is the design of the first efficient learner for this task whose complexity beats the complexity of agnostically learning this concept class. Moreover, we provide rigorous evidence, via an SQ lower bound, that no fully-polynomial time algorithm exists for general halfspaces. The obvious open question is whether the dependence on $\epsilon$ in the complexity of our algorithm can be improved. Specifically, is it possible to design a reliable learner with complexity $d^{O(\log(1/\alpha))}\mathrm{poly}(1/\epsilon)$ ? Is it possible to obtain similarly efficient reliable learners under more general marginal distributions (e.g., strongly log-concave or discrete distributions)? More broadly, it would be interesting to characterize the computational separation between (distribution-specific) reliable and agnostic learning for other natural concept classes.

References

[ABHU15] P. Awasthi, M. F. Balcan, N. Haghtalab, and R. Urner. Efficient learning of linear separators under bounded noise. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, pages 167–190, 2015.
[ABL17] P. Awasthi, M. F. Balcan, and P. M. Long. The power of localization for efficiently learning linear separators with noise. J. ACM, 63(6):50:1–50:27, 2017.
[Bog98] V. Bogachev. Gaussian measures. Mathematical surveys and monographs, vol. 62, 1998.
[DGJ ${}^{+}$ 09] I. Diakonikolas, P. Gopalan, R. Jaiswal, R. Servedio, and E. Viola. Bounded independence fools halfspaces. In Proc. 50th IEEE Symposium on Foundations of Computer Science (FOCS), pages 171–180, 2009.
[DJ19] A. Durgin and B. Juba. Hardness of improper one-sided learning of conjunctions for all uniformly falsifiable csps. In Algorithmic Learning Theory, ALT, volume 98 of Proceedings of Machine Learning Research, pages 369–382, 2019.
[DKK ${}^{+}$ 20] I. Diakonikolas, D. M. Kane, V. Kontonis, C. Tzamos, and N. Zarifis. A polynomial time algorithm for learning halfspaces with Tsybakov noise. arXiv, 2020.
[DKK ${}^{+}$ 21a] I. Diakonikolas, D. M. Kane, V. Kontonis, C. Tzamos, and N. Zarifis. Agnostic proper learning of halfspaces under gaussian marginals. In Proceedings of The 34th Conference on Learning Theory, COLT, 2021.
[DKK ${}^{+}$ 21b] I. Diakonikolas, D. M. Kane, V. Kontonis, C. Tzamos, and N. Zarifis. Efficiently learning halfspaces with Tsybakov noise. STOC, 2021.
[DKK ${}^{+}$ 22] I. Diakonikolas, D. M. Kane, V. Kontonis, C. Tzamos, and N. Zarifis. Learning general halfspaces with general Massart noise under the gaussian distribution. In STOC ’22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 874–885. ACM, 2022.
[DKN10] I. Diakonikolas, D. M. Kane, and J. Nelson. Bounded independence fools degree- $2$ threshold functions. In FOCS, pages 11–20, 2010.
[DKPZ21] I. Diakonikolas, D. M. Kane, T. Pittas, and N. Zarifis. The optimality of polynomial regression for agnostic learning under gaussian marginals. In Proceedings of The 34th Conference on Learning Theory, COLT, 2021.
[DKR23] I. Diakonikolas, D. M. Kane, and L. Ren. Near-optimal cryptographic hardness of agnostically learning halfspaces and relu regression under gaussian marginals. In International Conference on Machine Learning, ICML, volume 202 of Proceedings of Machine Learning Research, pages 7922–7938, 2023.
[DKRS23] I. Diakonikolas, D. Kane, L. Ren, and Y. Sun. SQ lower bounds for non-gaussian component analysis with weaker assumptions. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, 2023.
[DKS17] I. Diakonikolas, D. M. Kane, and A. Stewart. Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 73–84, 2017.
[DKS18] I. Diakonikolas, D. M. Kane, and A. Stewart. Learning geometric concepts with nasty noise. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, pages 1061–1073, 2018.
[DKTZ20a] I. Diakonikolas, V. Kontonis, C. Tzamos, and N. Zarifis. Learning halfspaces with Massart noise under structured distributions. In Conference on Learning Theory, COLT, 2020.
[DKTZ20b] I. Diakonikolas, V. Kontonis, C. Tzamos, and N. Zarifis. Learning halfspaces with Tsybakov noise. arXiv, 2020.
[DKTZ20c] I. Diakonikolas, V. Kontonis, C. Tzamos, and N. Zarifis. Non-convex SGD learns halfspaces with adversarial label noise. In Advances in Neural Information Processing Systems, NeurIPS, 2020.
[DKTZ22] I. Diakonikolas, V. Kontonis, C. Tzamos, and N. Zarifis. Learning general halfspaces with adversarial label noise via online gradient descent. In International Conference on Machine Learning, ICML 2022, volume 162 of Proceedings of Machine Learning Research, pages 5118–5141. PMLR, 2022.
[DKZ20] I. Diakonikolas, D. M. Kane, and N. Zarifis. Near-optimal SQ lower bounds for agnostically learning halfspaces and ReLUs under Gaussian marginals. In Advances in Neural Information Processing Systems, NeurIPS, 2020.
[Dom99] P. M. Domingos. Metacost: A general method for making classifiers cost-sensitive. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 155–164, 1999.
[Elk01] C. Elkan. The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, IJCAI, pages 973–978, 2001.
[Fan68] K. Fan. On infinite systems of linear inequalities. Journal of Mathematical Analysis and Applications, 21(3):475 – 478, 1968.
[FGR ${}^{+}$ 17] V. Feldman, E. Grigorescu, L. Reyzin, S. Vempala, and Y. Xiao. Statistical algorithms and a lower bound for detecting planted cliques. J. ACM, 64(2):8:1–8:37, 2017.
[FS97] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
[GGK20] S. Goel, A. Gollakota, and A. R. Klivans. Statistical-query lower bounds via functional gradients. In Advances in Neural Information Processing Systems, NeurIPS, 2020.
[GKKM20] S. Goldwasser, A. T. Kalai, Y. Kalai, and O. Montasser. Beyond perturbations: Learning guarantees with arbitrary adversarial test examples. In Advances in Neural Information Processing Systems, NeurIPS, 2020.
[GKKT17] S. Goel, V. Kanade, A. R. Klivans, and J. Thaler. Reliably learning the relu in polynomial time. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, volume 65 of Proceedings of Machine Learning Research, pages 1004–1042, 2017.
[Hau92] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100:78–150, 1992.
[KK21] A. T. Kalai and V. Kanade. Efficient learning with arbitrary covariate shift. In Algorithmic Learning Theory, 2021, volume 132 of Proceedings of Machine Learning Research, pages 850–864, 2021.
[KKM12] A. T. Kalai, V. Kanade, and Y. Mansour. Reliable agnostic learning. J. Comput. Syst. Sci., 78(5):1481–1495, 2012.
[KKMS08] A. Kalai, A. Klivans, Y. Mansour, and R. Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777–1805, 2008. Special issue for FOCS 2005.
[KLS09] A. Klivans, P. Long, and R. Servedio. Learning Halfspaces with Malicious Noise. Journal of Machine Learning Research, 10:2715–2740, 2009.
[KOS08] A. Klivans, R. O’Donnell, and R. Servedio. Learning geometric concepts via Gaussian surface area. In Proc. 49th IEEE Symposium on Foundations of Computer Science (FOCS), pages 541–550, Philadelphia, Pennsylvania, 2008.
[KSS94] M. Kearns, R. Schapire, and L. Sellie. Toward Efficient Agnostic Learning. Machine Learning, 17(2/3):115–141, 1994.
[KT14] V. Kanade and J. Thaler. Distribution-independent reliable learning. In Proceedings of The 27th Conference on Learning Theory, COLT, volume 35 of JMLR Workshop and Conference Proceedings, pages 3–24, 2014.
[MT94] W. Maass and G. Turan. How fast can a threshold gate learn? In S. Hanson, G. Drastal, and R. Rivest, editors, Computational Learning Theory and Natural Learning Systems, pages 381–414. MIT Press, 1994.
[Nel73] E. Nelson. The free markoff field. Journal of Functional Analysis, 12(2):211–227, 1973.
[NP33] J. Neyman and E. S. Pearson. On the problem of the most efficient tests for statistical hypotheses. Trans. R. So. Lond. Ser. A Contain. Pap. Math. Phys. Character, 231:281–337, 1933.
[Ros58] F. Rosenblatt. The Perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958.
[Tie23] S. Tiegel. Hardness of agnostically learning halfspaces from worst-case lattice problems. In The Thirty Sixth Annual Conference on Learning Theory, COLT, volume 195 of Proceedings of Machine Learning Research, pages 3029–3064, 2023.
[Vap98] V. Vapnik. Statistical learning theory. Wiley, 1998.
[ZSA20] C. Zhang, J. Shen, and P. Awasthi. Efficient active learning of sparse halfspaces with arbitrary bounded noise. In Advances in Neural Information Processing Systems, NeurIPS, 2020.

Appendix

Organization

The supplementary material is structured as follows: Appendix A includes additional preliminaries. Appendix B includes the omitted proofs from Section 2. Finally, LABEL:app:lb includes the omitted proofs from Section 3.

Appendix A Additional Preliminaries

We use $\mathbb{S}^{d-1}$ to denote the $d$ -dimensional unit sphere, i.e., $\mathbb{S}^{d-1}\stackrel{{\scriptstyle{\mathrm{\footnotesize def}}}}{{=}}\{\mathbf{x}\in\mathbb{R}^{d}:\|\mathbf{x}\|_{2}=1\}$ .

Basics of the SQ Model

In order to give an SQ lower bound for reliable learning, we first present the basics of the SQ model. We define the pairwise correlation, which is used for the SQ lower bound.

Definition A.1 (Pairwise Correlation).

The pairwise correlation of two distributions with probability density function $D_{1},D_{2}:\mathbb{R}^{d}\mapsto\mathbb{R}_{+}$ with respect to a distribution with density $D:\mathbb{R}^{d}\mapsto\mathbb{R}_{+}$ , where the support of $D$ contains the support of $D_{1}$ and $D_{2}$ , is defined as $\chi_{D}(D_{1},D_{2})\stackrel{{\scriptstyle{\mathrm{\footnotesize def}}}}{{=}}\int_{\mathbb{R}^{d}}D_{1}(\mathbf{x})D_{2}(\mathbf{x})/D(\mathbf{x})d\mathbf{x}-1$ . Furthermore, the $\chi$ -squared divergence of $D_{1}$ to $D$ is defined as $\chi^{2}(D_{1},D)\stackrel{{\scriptstyle{\mathrm{\footnotesize def}}}}{{=}}\chi_{D}(D_{1},D_{1})$ .

We require the following definitions and lemmata from [FGR ${}^{+}$ 17] connecting pairwise correlation and SQ lower bounds.

Definition A.2.

We say that a set of $s$ distribution $\mathcal{D}=\{D_{1},\cdots,D_{s}\}$ over $\mathbb{R}^{d}$ is $(\gamma,\beta)$ -correlated relative to a distribution $D$ if $\chi_{D}(D_{i},D_{j})\leq\gamma$ for all $i\neq j$ , and $\chi_{D}(D_{i},D_{j})\leq\beta$ for $i=j$ .

Definition A.3 (Statistical Query Dimension).

For $\beta,\gamma>0$ , a decision problem $\mathcal{B}(\mathcal{D},D)$ , where $D$ is a fixed distribution and $\mathcal{D}$ is a family of distribution, let $s$ be the maximum integer such that there exists a finite set of distributions $\mathcal{D}_{D}\subseteq\mathcal{D}$ such that $\mathcal{D}_{D}$ is $(\gamma,\beta)$ -correlated relative to $D$ and $|\mathcal{D}_{D}|\geq s$ . The Statistical Query dimension with pairwise correlations $(\gamma,\beta)$ of $\mathcal{B}$ is defined to be $s$ , and denoted by $s=\mathrm{SD}(\mathcal{B},\gamma,\beta)$ .

Lemma A.4.

Let $\mathcal{B}(\mathcal{D},D)$ be a decision problem, where $D$ is the reference distribution and $\mathcal{D}$ is a class of distribution. For $\gamma,\beta>0$ , let $s=\mathrm{SD}(\mathcal{B},\gamma,\beta)$ . For any $\gamma^{\prime}>0$ , any SQ algorithm for $\mathcal{B}$ requires queries of tolerance at most $\sqrt{\gamma+\gamma^{\prime}}$ or makes at least $s\gamma^{\prime}/(\beta-\gamma)$ queries.

Basics of Hermite Polynomials

We require the following definitions used in our work.

Definition A.5 (Normalized Hermite Polynomial).

For $k\in\mathbb{N}$ , we define the $k$ -th probabilist’s Hermite polynomials $\mathrm{\textit{He}}_{k}:\mathbb{R}\to\mathbb{R}$ as $\mathrm{\textit{He}}_{k}(t)=(-1)^{k}e^{t^{2}/2}\cdot\frac{\mathrm{d}^{k}}{\mathrm{d}t^{k}}e^{-t^{2}/2}$ . We define the $k$ -th normalized Hermite polynomial $h_{k}:\mathbb{R}\to\mathbb{R}$ as $h_{k}(t)=\mathrm{\textit{He}}_{k}(t)/\sqrt{k!}$ .

Furthermore, we use multivariate Hermite polynomials in the form of Hermite tensors (as the entries in the Hermite tensors are re-scaled multivariate Hermite polynomials). We define the Hermite tensor as follows.

Definition A.6 (Hermite Tensor).

For $k\in\mathbb{N}$ and $\mathbf{x}\in\mathbb{R}^{d}$ , we define the $k$ -th Hermite tensor as

(\mathbf{H}_{k}(\mathbf{x}))_{i_{1},i_{2},\ldots,i_{k}}=\frac{1}{\sqrt{k!}}\sum_{\begin{subarray}{c}\text{Partitions $P$ of $[k]$}\\ \text{into sets of size 1 and 2}\end{subarray}}\bigotimes_{\{a,b\}\in P}(-\mathbf{I}_{i_{a},i_{b}})\bigotimes_{\{c\}\in P}\mathbf{x}_{i_{c}}\;.

We also point out that fully reliable learning (see [KKM12] for the definition) reduces to one-sided reliable learning (Definition 1.1). The proof is implicit in the proof of Theorem 3 in [KKM12] and we include it here for completeness.

Fact A.7.

For a joint distribution $D$ of $(\mathbf{x},y)$ supported on $X\times\{\pm 1\}$ , let $h_{+}:X\to\{\pm 1\}$ (resp. $h_{-}:X\to\{\pm 1\}$ ) be a hypothesis that is $\epsilon/4$ -reliable with respect to the concept class $C$ on distribution $D$ for positive labels (resp. for negative labels). Let hypothesis $h$ be defined as $h(\mathbf{x})=1$ if $h_{+}(\mathbf{x})=h_{-}(\mathbf{x})=1$ , $h(\mathbf{x})=-1$ if $h_{+}(\mathbf{x})=h_{-}(\mathbf{x})=-1$ and $h(\mathbf{x})=?$ otherwise. Then $h$ is $\epsilon$ close to the best fully reliable sandwich hypothesis for $C$ on distribution $D$ .

Proof Sketch.

We first show that $R_{+}(h;D),R_{-}(h;D)\leq\epsilon$ . Notice that

R_{+}(h;D)=\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=1\land y=-1]\leq\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{+}(\mathbf{x})=1\land y=-1]\leq\epsilon\;.

The same holds for negative labels.

To show that $\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=?]$ is $\epsilon$ -suboptimal, assume (for the purpose of contradiction) that there is an $h^{\prime}$ such that $R_{+}(h^{\prime};D),R_{-}(h^{\prime};D)\leq\epsilon$ and $\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h^{\prime}(\mathbf{x})=?]<\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=?]-\epsilon$ . Notice that

		$\displaystyle\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=?]$
	$\displaystyle=$	$\displaystyle\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{+}(\mathbf{x})=1\land h_{-}(\mathbf{x})=-1]+\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{+}(\mathbf{x})=-1\land h_{-}(\mathbf{x})=1]$
	$\displaystyle\leq$	$\displaystyle 1-\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{+}(\mathbf{x})=1]-\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{-}(\mathbf{x})=-1]+\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{+}(\mathbf{x})=1\land h_{-}(\mathbf{x})=-1]$
	$\displaystyle\leq$	$\displaystyle 1-\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{+}(\mathbf{x})=1]-\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{-}(\mathbf{x})=-1]+\epsilon/2\;.$

Thus,

\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{+}(\mathbf{x})=1]+\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{-}(\mathbf{x})=-1]\leq 1-\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=?]+\epsilon/2\;.

(1)

Then, define $h_{+}^{\prime}:X\to\{\pm 1\}$ as $h_{+}^{\prime}(\mathbf{x})=1$ if $h^{\prime}(\mathbf{x})=1$ and $h_{+}^{\prime}(\mathbf{x})=-1$ otherwise ( $h_{-}^{\prime}$ is defined similarly). We get $\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h^{\prime}_{+}(\mathbf{x})=1]+\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h^{\prime}_{-}(\mathbf{x})=-1]=1-\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h^{\prime}(\mathbf{x})=?]$ . Using Equation 1 and $\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h^{\prime}(\mathbf{x})=?]<\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=?]-\epsilon$ , we get

\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{+}(\mathbf{x})=1]-\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{-}(\mathbf{x})=-1]<\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h^{\prime}_{+}(\mathbf{x})=1]-\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h^{\prime}_{-}(\mathbf{x})=-1]-\epsilon/2\;.

Therefore, either $\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{+}(\mathbf{x})=1]<\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h^{\prime}_{+}(\mathbf{x})=1]-\epsilon/4$ or

\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h_{-}(\mathbf{x})=-1]<\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h^{\prime}_{-}(\mathbf{x})=-1]-\epsilon/4\;.

Thus, either $h_{+}$ or $h_{-}$ is not $\epsilon/4$ -suboptimal. This gives a contradiction. ∎

Appendix B Omitted Proofs from Section 2

We define the bias of a function $h:\mathbb{R}^{d}\to\{\pm 1\}$ as $\min(\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}[h(\mathbf{x})=1],\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}[h(\mathbf{x})=-1])$ , and define $\mathcal{H}_{d}^{\alpha}$ to be the set of all LTFs whose bias is $\alpha$ . Let $f=\operatorname*{argmin}_{c\in\mathcal{H}_{d}^{\alpha}\;\land\;R_{+}(f;D)=0}R_{-}(f;D)$ be the optimal reliable hypothesis and $\alpha$ be the bias of $f$ .

B.1 Reliably Learning Halfspaces with Gaussian Marginals when $\epsilon\geq 2\alpha$

We show that for the problem in Theorem 1.3, in the case of $\epsilon\geq 2\alpha$ , the algorithm can just return one of the constant hypotheses. Let $h_{+}$ (resp. $h_{-}$ ) be the $+1$ (resp. $-1$ ) constant hypothesis.

Fact B.1.

For the problem defined in Theorem 1.3, the case where $\epsilon\geq 2\alpha$ can be efficiently solved by returning $h_{+}$ if $R_{+}(h_{+};D)\leq 3\epsilon/4$ and returning $h_{-}$ otherwise.

Proof.

When the algorithm returns $h_{+}$ , it is easy to see that $h_{+}$ satisfies the reliable learning requirement in Definition 1.1. So we only consider the case that the algorithm returns $h_{-}$ .

Given the algorithm returns $h_{-}$ , either there is an $\alpha$ -biased LTF $f$ such that $R_{+}(f;D)=0$ or none of the $\alpha$ -biased LTFs $f$ satisfies $R_{+}(f;D)=0$ . For the case that none of the $\alpha$ -biased LTFs $f$ satisfies $R_{+}(f;D)=0$ , it is easy to see $h_{-}$ is a valid answer from the definition of Definition 1.1. For the case that there is an $\alpha$ -biased LTF $f$ such that $R_{+}(f,D)=0$ , since the algorithm did not return $h_{+}$ , it must be the case that $\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=-1]>3\epsilon/4>\alpha$ . Therefore, we have $\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}[f(\mathbf{x})=-1]=1-\alpha$ . Thus, we get $R_{+}(h_{-},D)=0$ and $R_{-}(h_{-},D)\leq R_{-}(f,D)+\alpha$ . This shows $h_{-}$ is $\epsilon$ reliable with respect to all $\alpha$ biased LTFs. This completes the proof. ∎

B.2 Reliably Learning Halfspaces with Gaussian Marginals: Case of Unknown $\alpha$

Here we establish our main result:

Theorem B.2 (Main Algorithmic Result).

Let $\epsilon\in(0,1/2)$ and let $D$ be a joint distribution of $(\mathbf{x},y)$ supported on $\mathbb{R}^{d}\times\{\pm 1\}$ with marginal $D_{\mathbf{x}}=\mathcal{N}_{d}$ . Let $f=\operatorname*{argmin}_{f\in\mathcal{H}_{d}\;\land\;R_{+}(f;D)=0}R_{-}(f;D)$ be the optimal halfspace and $\alpha$ be its bias which is unknown. Then there is an algorithm that uses $N=d^{O(\log(\min\{1/\alpha,1/\epsilon\}))}\min\left(2^{\log(1/\epsilon)^{O(\log(1/\alpha))}},2^{\mathrm{poly}(1/\epsilon)}\right)$ many samples from $D$ , runs in $\mathrm{poly}(N,d,1/\epsilon)$ time and with high probability returns a hypothesis $h(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w},\mathbf{x}\rangle-t)$ that is $\epsilon$ -reliable with respect to the class of LTFs.

To prove Theorem B.2, we need to use the algorithm in the following lemma as a black box. The lemma shows that if the optimal halfspace $f$ satisfies $\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}[f(\mathbf{x})=-1]\leq 1/2$ , then the problem can be solved efficiently.

Lemma B.3.

Let $D$ be the joint distribution of $(\mathbf{x},y)$ supported on $\mathbb{R}^{d}\times\{\pm 1\}$ with marginal $D_{\mathbf{x}}=\mathcal{N}_{d}$ and $\epsilon\in(0,1)$ . Suppose the optimal halfspace $f=\operatorname*{argmin}_{f\in\mathcal{H}_{d}\;\land\;R_{+}(f;D)=0}R_{-}(f;D)$ satisfies $\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}[f(\mathbf{x})=-1]\leq 1/2$ . Then there is an algorithm that reliably learns LTFs on $D$ using $N=O(d/\epsilon^{2})$ many samples and $\mathrm{poly}(N,d)$ running time.

Proof.

The algorithm is the following:

1.

First check if $\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D)}[y=-1]\leq\epsilon$ with sufficiently small constant failure probability. If so, return the +1 constant hypothesis.
2.

Otherwise, draw $m=(d/\epsilon)^{c}$ many samples $S=\{(\mathbf{x}_{1},y_{1}),\cdots,(\mathbf{x}_{m},y_{m})\}$ conditioned on $y=-1$ . By Step 1, the sampling efficiency here is $\Omega(\epsilon)$ with high probability.

Solve the following semidefinite program for $\mathbf{w}^{\prime}$ and $t^{\prime}$ :

	minimize	$\displaystyle t^{\prime}$
	such that	$\displaystyle\langle\mathbf{w}^{\prime},\mathbf{x}\rangle-t^{\prime}\leq$	$\displaystyle 0\;,$	$\displaystyle\quad\forall(\mathbf{x},y)\in S$
$\displaystyle\\|\mathbf{w}^{\prime}\\|_{2}\leq$	$\displaystyle 1$

Return the hypothesis $h(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w},\mathbf{x}\rangle-t)$ , where $\mathbf{w}=\mathbf{w}^{\prime}/\|\mathbf{w}^{\prime}\|_{2}$ and $t=t^{\prime}/\|\mathbf{w}^{\prime}\|_{2}$ .

Let $f(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{*},\mathbf{x}\rangle-t^{*})$ be the optimal hypothesis. We prove that $h(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w},\mathbf{x}\rangle-t)$ is such that $R_{+}(h;D)\leq\epsilon$ and $R_{-}(h;D)\leq R_{-}(f;D)+\epsilon$ . Since $h$ is consistent with all the negative samples, we have $R_{+}(h;D)\leq\epsilon$ . From the assumption that $\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}[f(\mathbf{x})=-1]\leq 1/2$ , we have $t^{*}\leq 0$ . Notice that $\mathbf{w}^{*}$ and $t^{*}$ is a feasible solution to the above program; therefore, $t^{\prime}\leq t^{*}\leq 0$ and we must have $t=t^{\prime}/\|\mathbf{w}^{\prime}\|_{2}\leq t^{\prime}\leq t^{*}$ , which implies that $\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=1]\geq\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[f(\mathbf{x})=1]$ . Therefore,

	$\displaystyle R_{-}(h;D)$	$\displaystyle=\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=1]-\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=1\land h(\mathbf{x})=1]$
		$\displaystyle=\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=1]-\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=1]+R_{+}(h;D)$
		$\displaystyle\leq\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=1]-\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[f(\mathbf{x})=1]+\epsilon$
		$\displaystyle=R_{-}(f;D)+\epsilon\;.$

This completes the proof. ∎

We now prove Theorem B.2.

Proof of Theorem B.2.

The algorithm is the following:

1.

Run the algorithm in Lemma B.3 with $\epsilon^{\prime}=\epsilon/2$ . If the output hypothesis $h$ satisfies $R_{+}(h;D)\leq\epsilon$ with a sufficiently small constant failure probability, then output $h$ and terminate.
2.

Set $\alpha=1/2-\epsilon/100$ and run the algorithm in Theorem B.4 with $\epsilon^{\prime}=\epsilon/2$ . If the output hypothesis $h$ satisfies $R_{+}(h;D)\leq\epsilon$ with a sufficiently small constant failure probability, then output $h$ and terminate. Otherwise, update $\alpha$ as $\alpha-\epsilon/100$ . Repeat Step 2 until the algorithm terminates.

Let $f=\operatorname*{argmin}_{f\in\mathcal{H}_{d}\;\land\;R_{+}(f;D)=0}R_{-}(f;D)$ be the optimal halfspace and $\alpha_{f}$ be its bias which is unknown. Suppose the algorithm terminates. Let $\alpha$ be the bias of the output hypothesis $h$ . It is easy to see that $R_{+}(h;D)\leq\epsilon$ with high probability; therefore, it only remains to show $R_{-}(h;D)\leq R_{-}(f;D)+\epsilon$ . Notice that if $\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}[f(\mathbf{x})=-1]\leq 1/2$ , the algorithm will terminate in Step 1. Therefore, without loss of generality, we assume $\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}[f(\mathbf{x})=-1]\geq 1/2$ . Then, by Theorem B.4, it is easy to see that $\alpha\geq\alpha_{f}-\epsilon/100$ since given any $\alpha\leq\alpha_{f}$ , Step 2 is guaranteed to terminate. Therefore, we have

\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}[h(\mathbf{x})=-1]\leq 1-\alpha\leq 1-\alpha_{f}+\epsilon/100=\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}[f(\mathbf{x})=-1]+\epsilon/100\;.

Thus,

	$\displaystyle R_{-}(h;D)$	$\displaystyle=\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=1]-\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=1\land h(\mathbf{x})=1]$
		$\displaystyle=\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=1]-\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=1]+R_{+}(h;D)$
		$\displaystyle\leq\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=1]-\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[f(\mathbf{x})=1]+\epsilon$
		$\displaystyle=R_{-}(f;D)+\epsilon\;.$

This completes the proof. ∎

B.3 Reliably Learning Halfspaces with Gaussian Marginals for the Case $\epsilon\leq\alpha/2$

For convenience, we assume that $\alpha$ is known. This can be assumed without loss of generality as we discussed in Theorem B.2.

We show the following:

Theorem B.4.

Let $\epsilon\in(0,1/2),\alpha\in(0,1/2]$ and let $D$ be a joint distribution of $(\mathbf{x},y)$ supported on $\mathbb{R}^{d}\times\{\pm 1\}$ with marginal $D_{\mathbf{x}}=\mathcal{N}_{d}$ . Assume that there is a halfspace in $\mathcal{H}_{d}^{\alpha}$ that is reliable with respect to $D$ . Then, there is an algorithm that uses $N=d^{O(\log(\min\{1/\alpha,1/\epsilon\}))}\min\left(2^{\log(1/\epsilon)^{O(\log(1/\alpha))}},2^{\mathrm{poly}(1/\epsilon)}\right)$ many samples from $D$ , runs in $\mathrm{poly}(N,d,1/\epsilon)$ time and with high probability returns a hypothesis $h(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w},\mathbf{x}\rangle-t)$ that is $\epsilon$ -reliable with respect to the class of $\mathcal{H}_{d}^{\alpha}$ .

Below we provide the omitted proofs from Section 2.

B.4 Proof of Lemma 2.2

Notice that for any polynomial $p$ such that $\operatorname*{\mathbf{E}}_{z\sim{\mathcal{N}_{1}}}[p(z)]=0$ , we have

\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[y\,p(\langle\mathbf{w}^{*},\mathbf{x}\rangle)]=-2\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[\mathbf{1}(y=-1)\,p(\langle\mathbf{w}^{*},\mathbf{x}\rangle)]\;.

Therefore, it suffices for us to show that there is a zero mean and unit variance polynomial $p$ such that

\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[\mathbf{1}(y=-1)\,p(\langle\mathbf{w}^{*},\mathbf{x}\rangle)]=2^{-O({t^{*}}^{2})}\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=-1]\;.

Here we will consider the sign-matching polynomial from [DKK ${}^{+}$ 22], which satisfies the following properties:

Claim B.5.

Let $b\in\mathbb{R}$ and $b\geq 4$ . There exists a zero mean and unit variance polynomial $p:\mathbb{R}\to\mathbb{R}$ of degree $k=\Theta(b^{2})$ such that

1.

The sign of $p(z)$ matches $\mathrm{sign}(z-b)$ , i.e. $\mathrm{sign}(p(z))=\mathrm{sign}(z-b)$ .
2.

For any $z\leq b/2$ , we have $|p(z)|=2^{-O(k)}$ .

Proof.

We consider the polynomial $p$ defined as:

\tilde{p}(z)=q(z)-\frac{q(b)}{r(b)}r(z)\;,

where $q(z)=z^{3k}$ , $r(z)=z^{2k}-(2k-1)!!$ and $k$ is a sufficiently large odd integer such that $2|b|\leq\sqrt{k}\leq 4\max(|b|,1)$ . We then take $p(z)=\tilde{p}(z)/\sqrt{\operatorname*{\mathbf{E}}_{u\sim\mathcal{N}_{1}}[\tilde{p}^{2}(u)]}$ .

For convenience, we first note that from Stirling’s approximation for $m\in\mathbb{Z}^{+}$ ,

(m/2)^{m}\leq(2m-1)!!\leq(2m)^{m}\;.

To prove that $\mathrm{sign}(p(z))=\mathrm{sign}(z-b)$ , we first show that $\frac{q(b)}{r(b)}\leq 0$ . Since $q(b)\geq 0$ , we just need to show $r(b)\leq 0$ . Notice $r(b)=b^{2k}-(2k-1)!!$ , since $2|b|\leq\sqrt{k}$ and $(2k-1)!!\geq{(k/2)}^{k}$ , thus $r(b)\leq(\sqrt{k}/2)^{2k}-(k/2)^{k}\leq 0$ . Therefore, considering $q(z)=z^{3k}$ and $r(z)=z^{2k}-(2k-1)!!$ ,

\tilde{p}(z)=q(z)-\frac{q(b)}{r(b)}r(z)

must be monotone increasing for $z\geq b$ and $p(z)$ must also be monotone increasing for $z\geq b$ . Notice that $p(b)=0$ ; therefore, $p(z)>0$ for $z>b$ . To show $p(z)<0$ for $z<-b$ , we prove it for the cases $z\in[-b,+b)$ and $z<-b$ . For $z\in[-b,+b)$ , notice that due to $\frac{q(b)}{r(b)}\leq 0$ and the definition of $q(z)$ and $r(z)$ ,

\tilde{p}(z)=q(z)-\frac{q(b)}{r(b)}r(z)\leq q(b)-\frac{q(b)}{r(b)}r(b)<p(b)=0\;.

Then, for $z\leq-b$ , we show that $p(z)$ is monotone increasing. Notice that the derivative $\tilde{p}^{\prime}(z)=kz^{2k-1}(3z^{k}-2\frac{q(b)}{r(b)})$ , for $z\leq-b<0$ , this is positive if and only if $3z^{k}-2\frac{q(b)}{r(b)}<0$ . If we can show $b^{k}\geq-\frac{2}{3}q(b)/r(b)$ , then it is immediate that for any $z\leq-b$ , $3z^{k}-2\frac{q(b)}{r(b)}<0$ . The condition $b^{k}\geq-\frac{2}{3}q(b)/r(b)$ can be further simplified to $(2k-1)!!/b^{2k}\geq 5/3$ . Then considering $2b<\sqrt{k}$ and $(2k-1)!!\geq(k/2)^{k}$ , we have

(2k-1)!!/b^{2k}\geq(k/2)^{k}/(\sqrt{k}/2)^{2k}\geq 2^{k}\;,

thus the condition holds. Therefore, $\tilde{p}^{\prime}(z)>0$ for $z<-b$ and $p(z)$ must be monotone increasing for $z<-b$ . Then combined with the condition $p(-b)<0$ , which is implied from the previous case ( $z\in[-b,b]$ ), we have $p(z)<0$ for $z<-b$ .

For the Property 2, we show that $p(z)=2^{-\Theta(k)}$ for $z\in[0,b/2]$ , $z\in[-b/2,0]$ , $z\in[-b,-b/2]$ and $z\in[-\infty,-b]$ separately. For $z\in[0,b/2]$ notice that $p(z)$ is convex for $z\in[0,b]$ ; therefore, it suffices to show $p(0)=2^{-\Theta(k)}$ . Note that $p(0)=\frac{q(b)}{r(b)}(2k-1)!!/\sqrt{\operatorname*{\mathbf{E}}_{u\sim\mathcal{N}_{1}}\tilde{p}^{2}(u)}$ , and we have $\left|\frac{q(b)}{r(b)}\right|\geq b^{3k}/\max(b^{2k},(2k-1)!!)\geq\Omega(k)^{k/2}$ . Therefore, since $\operatorname*{\mathbf{E}}_{u\sim\mathcal{N}_{1}}[\tilde{p}^{2}(u)]\leq(O(k))^{3k}$ , we have $p(0)\geq 2^{-O(k)}$ . This completes the proof of the case $z\in[0,b/2]$ . For the case $z\in[-b/2,0]$ , it immediately follows from the fact that $|p(-z)|\geq|p(z)|$ by the definition. For the case $z\in[-b,-b/2]$ , we have $p(z)\leq(p(b)-(b/2)^{3k})/\sqrt{\operatorname*{\mathbf{E}}_{u\sim\mathcal{N}_{1}}\tilde{p}^{2}(u)}\leq-2^{-O(k)}$ . Then, the case $z\in[-\infty,-b]$ immediately follows from the fact that $p(z)$ is monotone increasing in this interval. This completes the proof of Property 2. ∎

Given B.5, we let $p$ be the polynomial in B.5 with $b=-\max(2|t^{*}|,4)$ . Then, given that $D$ satisfies the reliability condition with respect to $f$ , it is easy to see the polynomial $p(\langle\mathbf{w}^{*},\mathbf{x}\rangle)$ satisfies the requirements. This completes the proof of Lemma 2.2.

B.5 Proof of Lemma 2.4

We start by bounding the Frobenius norm of $\mathbf{T}^{m}$ , i.e., $\|\mathbf{T}^{m}\|_{F}$ . Notice that by taking $p:\mathbb{R}\to\mathbb{R}$ to be any degree- $m$ polynomial such that $\operatorname*{\mathbf{E}}_{\mathbf{x}\sim\mathcal{N}_{d}}[p(\mathbf{x})]=0$ and $\operatorname*{\mathbf{E}}_{\mathbf{x}\sim\mathcal{N}_{d}}[p(\mathbf{x})^{2}]=1$ , then $\|\mathbf{T}^{m}\|_{F}=\max_{p}\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[\mathbf{1}(y=-1)p(\mathbf{x})]$ . We leverage the following standard fact about the concentration of polynomials lemma known as Bonami-Beckner inequality or simply Gaussian hypercontractivity.

Fact B.6 (Hypercontractivity Concentration Inequality [Bog98, Nel73]).

Consider any $m$ -degree, unit variance polynomial $p:\mathbb{R}^{d}\to\mathbb{R}$ with respect to $\mathcal{N}_{d}$ . Then, for any $\lambda\geq 0$

\operatorname*{\mathbf{Pr}}_{\mathbf{x}\sim\mathcal{N}_{d}}\left[\left|p(\mathbf{x})-\operatorname*{\mathbf{E}}_{\mathbf{u}}[p(\mathbf{u})]\right|\geq\lambda\right]\leq e^{2}e^{-{\left(c\lambda^{2}\right)}^{1/m}}\;,

where $c$ is an absolute constant.

Using B.6, we have that for any zero mean and unit variance polynomial $p$ ,

	$\displaystyle\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[\mathbf{1}(y=-1)p(\mathbf{x})]$	$\displaystyle\leq\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[\mathbf{1}(y=-1)\|p(\mathbf{x})\|]\leq\int_{0}^{\infty}\min\left(\gamma,e^{2}e^{-{\left(c\lambda^{2}\right)}^{1/m}}\right)d\lambda$
		$\displaystyle=O\left(\gamma\log(1/\gamma)^{m/2}\right)\;.$

Therefore, $\|\mathbf{T}^{m}\|_{F}=O\left(\gamma\log(1/\gamma)^{m/2}\right)$ .

To prove the Property 1, since $\|\mathbf{T}^{\prime m}\|_{F}\leq\|\mathbf{T}^{m}\|_{F}+\tau/(4\sqrt{k})=O\left(\gamma\log(1/\gamma)^{m/2}\right)+\tau/(4\sqrt{k})$ , there can be at most $\|\mathbf{T}^{m}\|_{F}^{2}/\left(\tau/\sqrt{k}\right)^{2}=O(\gamma^{2}\log(1/\gamma)^{m}k/\tau^{2}+1)$ many singular vectors with singular values greater than $\tau/(4\sqrt{k})$ . Therefore, $\dim(V)$ is at most $O(\gamma^{2}\log(1/\gamma)^{m}k/\tau^{2}+1)$ .

For the Property 2, suppose $\|\mathrm{proj}_{V}(\mathbf{v^{*}})\|_{2}=o\left(\tau/\left(\sqrt{k}\gamma\log(1/\gamma)^{k/2}\right)\right)$ , then we have for any $m$ ,

	$\displaystyle\\|{\mathbf{v}^{*}}^{\top}\mathbf{T}^{\prime m}\\|_{2}$	$\displaystyle\leq\\|{\mathbf{v}^{*}}^{\top}\mathbf{T}^{m}\\|_{2}+\tau/(4\sqrt{k})$
		$\displaystyle\leq\\|\mathbf{T}^{m}\\|_{F}\\|\mathrm{proj}_{V}(\mathbf{v}^{})\\|_{2}+\tau/(4\sqrt{k})\\|\mathrm{proj}_{\perp V}(\mathbf{v}^{})\\|_{2}+\tau/(4\sqrt{k})\leq(5/8)\tau/\sqrt{k}\;.$

However, since $\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[\mathbf{1}(y=-1)p(\langle\mathbf{v^{*},x}\rangle)]=\sum_{m=1}^{k}a_{i}{\mathbf{v}^{*}}^{\top}\mathbf{T}^{m}{\mathbf{v}^{*}}^{\otimes{m-1}}\geq\tau$ for some $a_{1}^{2}+\cdots+a_{k}^{2}=1$ , we know there must be an $m$ such that

\left|{\mathbf{v}^{*}}^{\top}\mathbf{T}^{m}{\mathbf{v}^{*}}^{\otimes{m-1}}\right|\geq\tau/\sqrt{k}\;.

Therefore, we can write

\|{\mathbf{v}^{*}}^{\top}\mathbf{T}^{\prime m}\|_{2}\geq\|{\mathbf{v}^{*}}^{\top}\mathbf{T}^{m}\|_{2}-\tau/(4\sqrt{k})\geq\tau/\sqrt{k}-\tau/(4\sqrt{k})=(3/4)\tau/\sqrt{k}\;,

which contradicts $\|{\mathbf{v}^{*}}^{\top}\mathbf{T}^{\prime m}\|_{2}\leq(5/8)\tau/\sqrt{k}$ . This completes the proof.

B.6 Proof of Proposition 2.1

The pseudocode for the algorithm establishing Proposition 2.1 is given in Algorithm 2.

Input: Sample access to a joint distribution

D

(\mathbf{x},y)

supported on

\mathbb{R}^{d}\times\{\pm 1\}

with the marginal

D_{\mathbf{x}}=\mathcal{N}_{d}

. Let

\epsilon,\delta\in(0,1)

and suppose

D

satisfies the reliability condition with respect to an LTF

f(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{*},\mathbf{x}\rangle-t^{*})

with

t^{*}=O\left(\sqrt{\log(1/\epsilon)}\right)

and

\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=-1]\geq\epsilon

Output: With probability at least

\Omega(1)

, the algorithm outputs a unit vector

\mathbf{v}

such that

\langle\mathbf{v},\mathbf{w}^{*}\rangle=\max\left(\log(1/\epsilon)^{-O({t^{*}}^{2})},\epsilon^{O(1)}\right)

1.

Let $c_{1}$ be a sufficiently large universal constant and $c_{2}$ be a sufficiently large universal constant depending on $c_{1}$ . Let $S$ be a set of $N=d^{c_{2}{\max({t^{*}}^{2},1)}}\log(1/\delta)/\epsilon^{2}$ many samples from $D$ . For $m=1,\cdots,c_{1}{t^{*}}^{2}$ , with 1/2 probability, take

$\mathbf{T}^{\prime m}=\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim_{u}S}[\mathbf{1}(y=-1)\mathbf{H}^{m}(\mathbf{x})]\;,$

to be the empirical Chow-tensor on negative samples. Otherwise, take

$\mathbf{T}^{\prime m}=\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim_{u}S}[y\mathbf{H}^{m}(\mathbf{x})]\;,$

to be the empirical Chow-tensor on all samples. Let $\gamma$ be the empirical estimation of $\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=-1]$ with error at most $\epsilon/100$ with a sufficiently small constant failure probability.
2.

Take $\tilde{\mathbf{T}}^{\prime m}\in\mathbb{R}^{d\times{d^{m-1}}}$ to be the flattened version of $\mathbf{T}^{\prime m}$ , and let $V_{m}$ be the subspace spanned by the left singular vectors of $\tilde{\mathbf{T}}^{\prime m}$ whose singular values are greater than $2^{-c{t^{*}}^{2}}\gamma$ where $c$ is a sufficiently large constant. Let $V$ be the union of all $V_{m}$ and output $\mathbf{v}$ to be a random unit vector chosen from $V$ .

Algorithm 2 Finding a Direction with High Correlation.

We can now prove Proposition 2.1.

Proof of Proposition 2.1.

We first introduce the following fact about learning Chow tensors.

Fact B.7.

Fix $m\in\mathbb{Z}_{+}$ , and $\epsilon,\delta\in(0,1)$ . Let $D$ be a distribution on $\mathbb{R}^{d}\times[\pm 1]$ with standard normal marginals. There is an algorithm that with $N=d^{O(m)}\log(1/\delta)/\epsilon^{2}$ samples and $\mathrm{poly}(d,N)$ runtime, outputs an approximation $\mathbf{T}^{\prime m}$ of the order- $m$ Chow-parameter tensor $\mathbf{T}^{m}=\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[y\mathbf{H}^{m}(\mathbf{x})]$ such that with probability $1-\delta$ , it holds $\|\mathbf{T}^{\prime m}-\mathbf{T}^{m}\|_{F}\leq\epsilon$ .

Proof.

Let $\mathbf{T}^{\prime m}$ be the empirical estimation of $\mathbf{T}^{m}$ using $N=d^{cm}\log(1/\delta)/\epsilon^{2}$ for sufficiently large universal constant $c$ . We start by showing $\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}\left[\|y\mathbf{H}^{m}(\mathbf{x})\|_{F}^{2}\right]\leq d^{m}$ . Notice that,

\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}\left[\|y\mathbf{H}^{m}(\mathbf{x})\|_{F}^{2}\right]=\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}\left[y^{2}\|\mathbf{H}^{m}(\mathbf{x})\|_{F}^{2}\right]\leq\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}\left[\|\mathbf{H}^{m}(\mathbf{x})\|_{F}^{2}\right]=\operatorname*{\mathbf{E}}_{\mathbf{x}\sim\mathbb{R}^{d}}\left[\|\mathbf{H}^{m}(\mathbf{x})\|_{F}^{2}\right]\;.

Notice that each entry of $\mathbf{H}^{m}(\mathbf{x})$ must be some $\alpha h(\mathbf{x})$ where $\alpha\in[0,1]$ and $h(\mathbf{x})$ is a unit variance Hermite polynomial. Therefore, $\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}\left[\|\mathbf{H}^{m}(\mathbf{x})\|_{F}^{2}\right]\leq d^{m}$ and thus $\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}\left[\|y\mathbf{H}^{m}(\mathbf{x})\|_{F}^{2}\right]\leq d^{m}$ .

Using the above, we have

\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}\left[\|y\mathbf{H}^{m}(\mathbf{x})-\mathbf{T}^{m}\|_{F}^{2}\right]=\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}\left[\|y\mathbf{H}^{m}(\mathbf{x})\|_{F}^{2}\right]-\|\mathbf{T}^{m}\|_{F}^{2}\leq d^{m}\;.

Then, applying Chebyshev’s inequality gives $\|\mathbf{T}^{\prime m}-\mathbf{T}^{m}\|_{F}\leq\epsilon$ . ∎

We prove Proposition 2.1 for two cases: (a) $\max(\log(1/\epsilon)^{-O({t^{*}}^{2})},\epsilon^{O(1)})=\log(1/\epsilon)^{-O({t^{*}}^{2})}$ ; and (b) $\max(\log(1/\epsilon)^{-O({t^{*}}^{2})},\epsilon^{O(1)})=\epsilon^{O(1)}$ . We first prove Case (a) as the other case is similar. For Case (a), it suffices for us to prove that $\langle\mathbf{v},\mathbf{w}^{*}\rangle=\log(1/\epsilon)^{-O({t^{*}}^{2})}$ happens with $\Omega(1)$ probability given the algorithm chooses $\mathbf{T}^{\prime m}=\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim_{u}S}[\mathbf{1}(y=-1)\mathbf{H}^{m}(\mathbf{x})]$ (which happens with 1/2 probability). Let $k=c_{1}{t^{*}}^{2}$ and $\tau/(4\sqrt{k})=2^{-c{t^{*}}^{2}}\gamma$ . By B.7, we have $\|\mathbf{T}^{\prime m}-\mathbf{T}^{m}\|_{F}\leq\tau/(4\sqrt{k})$ . Then from Lemma 2.2 and Lemma 2.4, if we take $\mathbf{v}$ to be a random vector in $V$ , we will with constant probability have,

	$\displaystyle\langle\mathbf{v},\mathbf{w}^{*}\rangle$	$\displaystyle=\frac{\Omega\left(\tau/\left(\sqrt{k}\gamma\log(1/\gamma)^{k/2}\right)\right)}{\dim(V)^{1/2}}=\Omega\left(\tau^{2}/\left(\log(1/\gamma)^{k}k^{3/2}\gamma^{2}\right)\right)$
		$\displaystyle=\Omega\left(\log(1/\gamma)^{-O({t^{*}}^{2})}{k}^{-3/2}\right)\;.$

Since $\operatorname*{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=-1]\geq\epsilon$ , we have $\gamma=\Omega(\epsilon)$ . Also, considering $k=O(\max\{{t^{*}}^{2},1\})=O\left(\log(1/\epsilon)\right)$ , we get $\langle\mathbf{v},\mathbf{w}^{*}\rangle=\log(1/\epsilon)^{-O({t^{*}}^{2})}$ .

For Case (b), the proof remains the same except we take $\mathbf{T}^{\prime m}=\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim_{u}S}[y\mathbf{H}^{m}(\mathbf{x})]$ and use fact below instead of Lemma 2.4.

Fact B.8 (Lemma 5.10 in [DKK ${}^{+}$ 22]).

Let $D$ be the joint distribution of $(\mathbf{x},y)$ supported on $\mathbb{R}^{d}\times\{\pm 1\}$ with the marginal $D_{\mathbf{x}}=\mathcal{N}_{d}$ . Let $p:\mathbb{R}\mapsto\mathbb{R}$ be a univariate, mean zero, unit variance polynomial of degree $k$ such that for some unit vector $\mathbf{v^{*}}\in\mathbb{R}^{d}$ it holds $\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[yp(\langle\mathbf{v^{*},\mathbf{x}}\rangle)]\geq\tau$ for some $\tau\in(0,1]$ . Let $\mathbf{T}^{\prime m}$ be an approximation of the order- $m$ Chow-parameter tensor $\mathbf{T}^{m}=\operatorname*{\mathbf{E}}_{(\mathbf{x},y)\sim D}[y{\bf H}_{m}(\mathbf{x})]$ such that $\|\mathbf{T}^{\prime m}-\mathbf{T}^{m}\|_{F}\leq\tau/(4\sqrt{k})$ . Denote by $V_{m}$ the subspace spanned by the left singular vectors of flattened $\mathbf{T}^{\prime m}$ whose singular values are greater than $\tau/(4\sqrt{k})$ . Moreover, denote by $V$ the union of $V_{1},\cdots,V_{k}$ . Then we have that

1.

$\dim(V)\leq 4k/\tau^{2}$ , and
2.

$\|\mathrm{proj}_{V}(\mathbf{v^{*}})\|_{2}\geq\tau/\left(4\sqrt{k}\right)$ .

Then the same argument will give $\langle\mathbf{v},\mathbf{w}^{*}\rangle=\epsilon^{O(1)}$ . The above described algorithm uses at most $N=d^{O({t^{*}}^{2})}/\epsilon^{2}$ samples and $\mathrm{poly}\left(N,1/\epsilon\right)$ runtime. This completes the proof of Proposition 2.1. ∎

B.7 Proof of Lemma 2.5

Item 1 follows from the definition of $D^{\prime}$ and the fact that $D$ has marginal distribution $D_{\mathbf{x}}=\mathcal{N}_{d}$ .

For proving Item 2, we consider two cases: The first case is when $t^{*}>0$ and the second one when $t^{*}\leq 0$ . For the case $t^{*}\leq 0$ , we prove that the distribution $D^{\prime}$ satisfies the reliability condition with respect to $h^{\prime}(\mathbf{x})=\mathrm{sign}(\langle\mathbf{w}^{\prime},\mathbf{x}\rangle)$ where $\mathbf{w}^{\prime}={\mathbf{w}^{*}}^{\perp\mathbf{w}}/\|{\mathbf{w}^{*}}^{\perp\mathbf{w}}\|_{2}$ . Notice that for any $(\mathbf{x},y)$ such that $\mathbf{x}\in B$ and $h^{\prime}(\mathbf{x}^{\perp\mathbf{w}})=1$ , we have

	$\displaystyle f(\mathbf{x})$	$\displaystyle=\mathrm{sign}(\langle\mathbf{w}^{},\mathbf{x}\rangle-t^{})$
		$\displaystyle=\mathrm{sign}(\langle\mathrm{proj}_{\perp\mathbf{w}}(\mathbf{w}^{}),\mathbf{x}\rangle+\langle\mathrm{proj}_{\mathbf{w}}(\mathbf{w}^{}),\mathbf{x}\rangle-t^{*})$

	$\displaystyle R_{-}(h;D)$	$\displaystyle=\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=1]-\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=1\land h(\mathbf{x})=1]$
		$\displaystyle=\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=1]-\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=1]+R_{+}(h;D)$
		$\displaystyle\leq\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=1]-\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[f(\mathbf{x})=1]+\epsilon$
		$\displaystyle=R_{-}(f;D)+\epsilon\;.$

	$\displaystyle R_{-}(h;D)$	$\displaystyle=\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=1]-\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=1\land h(\mathbf{x})=1]$
		$\displaystyle=\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=1]-\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[h(\mathbf{x})=1]+R_{+}(h;D)$
		$\displaystyle\leq\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[y=1]-\operatorname{\mathbf{Pr}}_{(\mathbf{x},y)\sim D}[f(\mathbf{x})=1]+\epsilon$
		$\displaystyle=R_{-}(f;D)+\epsilon\;.$

Reliable Learning of Halfspaces under Gaussian Marginals

Abstract

1 Introduction

Definition 1.1 ((Positive) Reliable Learning of Gaussian Halfspaces).

Definition 1.2 (Bias of Boolean Function).

Theorem 1.3 (Main Result).

1.1 Overview of Techniques

Definition 1.4 (Reliability Condition).

SQ Lower Bound

Reliable Learning Algorithm

Prior Algorithmic Techniques

Technical Comparison with Prior Work

1.2 Related Work

1.3 Preliminaries

2 Algorithm for Reliably Learning Gaussian Halfspaces

2.1 Finding a Non-Trivial Direction

Proposition 2.1.

Lemma 2.2 (Correlation with an Orthonormal Polynomial).

Proof Sketch of Lemma 2.2.

Definition 2.3 (Hermite Tensor).

Lemma 2.4.

Proof Sketch of Proposition 2.1 .

2.2 Random Walk to Update Current Guess

Lemma 2.5.

Proof Sketch of Lemma 2.5.

Fact 2.6 (Correlation Improvement, Lemma 5.13 in [DKK+{}^{+}22]).

Proof Sketch of the algorithmic part of Theorem 1.3.

3 Nearly Matching SQ Lower Bound

Proof Overview

Definition 3.1 (Decision Problem over Distributions).

Definition 3.2 (Pairwise Correlation).

Lemma 3.3.

Proof Sketch of Lemma 3.3.

Lemma 3.4.

Proof Sketch for Lemma 3.4.

Proof Sketch of the SQ hardness part of Theorem 1.3.

4 Conclusions and Open Problems

References

Appendix

Organization

Appendix A Additional Preliminaries

Basics of the SQ Model

Definition A.1 (Pairwise Correlation).

Definition A.2.

Definition A.3 (Statistical Query Dimension).

Lemma A.4.

Basics of Hermite Polynomials

Definition A.5 (Normalized Hermite Polynomial).

Definition A.6 (Hermite Tensor).

Fact A.7.

Proof Sketch.

Appendix B Omitted Proofs from Section 2

B.1 Reliably Learning Halfspaces with Gaussian Marginals when ϵ≥2α\epsilon\geq 2\alpha

Fact B.1.

Proof.

B.2 Reliably Learning Halfspaces with Gaussian Marginals: Case of Unknown α\alpha

Theorem B.2 (Main Algorithmic Result).

Lemma B.3.

Proof.

Proof of Theorem B.2.

B.3 Reliably Learning Halfspaces with Gaussian Marginals for the Case ϵ≤α/2\epsilon\leq\alpha/2

Theorem B.4.

B.4 Proof of Lemma 2.2

Claim B.5.

Proof.

B.5 Proof of Lemma 2.4

Fact B.6 (Hypercontractivity Concentration Inequality [Bog98, Nel73]).

B.6 Proof of Proposition 2.1

Proof of Proposition 2.1.

Fact B.7.

Proof.

Fact B.8 (Lemma 5.10 in [DKK+{}^{+}22]).

B.7 Proof of Lemma 2.5

Fact 2.6 (Correlation Improvement, Lemma 5.13 in [DKK ${}^{+}$ 22]).

B.1 Reliably Learning Halfspaces with Gaussian Marginals when $\epsilon\geq 2\alpha$

B.2 Reliably Learning Halfspaces with Gaussian Marginals: Case of Unknown $\alpha$

B.3 Reliably Learning Halfspaces with Gaussian Marginals for the Case $\epsilon\leq\alpha/2$

Fact B.8 (Lemma 5.10 in [DKK ${}^{+}$ 22]).