\declaretheorem

[name=Theorem,numberlike=theorem]restate-theorem \declaretheorem[name=Theorem,unnumbered]restate-theorem*

Downsampling for Testing and Learning in Product Distributions

Nathaniel Harms
University of Waterloo, Canada
[email protected] Partly supported by NSERC. Much of this research was done while the author was visiting the National Institute of Informatics, Japan. Yuichi Yoshida
National Institute of Informatics, Japan
[email protected] Supported by JSPS KAKENHI Grant Number JP17H04676 and 18H05291.

Abstract

We study distribution-free property testing and learning problems where the unknown probability distribution is a product distribution over $\mathbb{R}^{d}$ . For many important classes of functions, such as intersections of halfspaces, polynomial threshold functions, convex sets, and $k$ -alternating functions, the known algorithms either have complexity that depends on the support size of the distribution, or are proven to work only for specific examples of product distributions. We introduce a general method, which we call downsampling, that resolves these issues. Downsampling uses a notion of “rectilinear isoperimetry” for product distributions, which further strengthens the connection between isoperimetry, testing and learning. Using this technique, we attain new efficient distribution-free algorithms under product distributions on $\mathbb{R}^{d}$ :

1.

A simpler proof for non-adaptive, one-sided monotonicity testing of functions $[n]^{d}\to\{0,1\}$ , and improved sample complexity for testing monotonicity over unknown product distributions, from $O(d^{7})$ [Black, Chakrabarty, & Seshadhri, SODA 2020] to $\widetilde{O}(d^{3})$ .
2.

Polynomial-time agnostic learning algorithms for functions of a constant number of halfspaces, and constant-degree polynomial threshold functions;
3.

An $\mathrm{exp}\left(O(d\log(dk))\right)$ -time agnostic learning algorithm, and an $\mathrm{exp}\left(O(d\log(dk))\right)$ -sample tolerant tester, for functions of $k$ convex sets; and a $2^{\widetilde{O}(d)}$ sample-based one-sided tester for convex sets;
4.

An $\mathrm{exp}\left(\widetilde{O}(k\sqrt{d})\right)$ -time agnostic learning algorithm for $k$ -alternating functions, and a sample-based tolerant tester with the same complexity.

1 Introduction

In property testing and learning, the goal is to design algorithms that use as little information as possible about the input while still being correct (with high probability). This includes using as little information as possible about the probability distribution against which correctness is measured. Information about the probability distribution could be in the form of guarantees on this distribution (e.g. it is guaranteed to be uniform, or Gaussian), or in the form of samples from the distribution. So we want to minimize the requirements on this distribution, as well as the number of samples used by the algorithm.

Progress on high-dimensional property testing and learning problems is usually made by studying algorithms for the uniform distribution over the hypercube $\{\pm 1\}^{d}$ , or the standard Gaussian distribution over $\mathbb{R}^{d}$ , as the simplest case. For example, efficiently learning intersections of halfspaces is a major open problem in learning theory [DKS18, KOS04], and progress on this problem has been made by studying the uniform distribution over the hypercube $\{\pm 1\}^{d}$ and the Gaussian as special cases [KKMS08, KOS08, Vem10a]. Another important example is the class of degree- $k$ polynomial threshold functions (PTFs). Unlike intersections of halfspaces, these can be efficiently learned in the PAC model [KOS04], but agnostic¹¹1See the Glossary in Appendix B for standard definitions in learning theory and property testing. learning is more challenging. Again, progress has been made by studying the hypercube [DHK⁺10]. An even more extreme example is the class of convex sets, which are not learnable in the distribution-free PAC model, because they have infinite VC dimension, but which become learnable under the Gaussian [KOS08]. The uniform distribution over the hypercube and the Gaussian are both examples of product distributions, so the next natural question to ask is, can these results be generalized to any unknown product distribution? A partial answer was given by Blais, O’Donnell, & Wimmer [BOW10] for some of these classes; in this paper we resolve this question.

Similar examples appear in the property testing literature. Distribution-free property testing and testing functions with domain $\mathbb{R}^{d}$ are emerging trends in the field (e.g. [BBBY12, DMN19, Har19, CDS20, FY20, BFPJH21]). Testing monotonicity is one of the most well-studied problems in property testing, and recent work [BCS20] has extended this study to product distributions over domain $\mathbb{R}^{d}$ . Work of Chakrabarty & Seshadhri [CS16], Khot, Minzer, & Safra [KMS18], and Black, Chakrabarty, & Seshadhri [BCS18, BCS20] has resulted in efficient $o(d)$ -query algorithms for the hypercube $\{\pm 1\}^{d}$ [KMS18] and the hypergrid $[n]^{d}$ . Black, Chakrabarty, & Seshadhri [BCS20] showed that testing monotonicity over unknown product distributions on $\mathbb{R}^{d}$ could be done with $\widetilde{O}(d^{5/6})$ queries and $O(d^{7})$ samples. Their “domain reduction” method is intricate and specialized for the problem of testing monotonicity. We improve²²2An early version of this paper proved a weaker result, with two-sided error and worse sample complexity. the sample complexity to $\widetilde{O}(d^{3})$ using a much simpler proof. We also generalize the testers of [CFSS17, CGG⁺19] for convex sets and $k$ -alternating functions, respectively, and provide new testers for arbitrary functions of convex sets.

This paper provides a general framework for designing distribution-free testing and learning algorithms under product distributions on $\mathbb{R}^{d}$ , which may be finite or continuous. An algorithm is distribution-free under product distributions if it does not require any prior knowledge of the probability distribution, except the guarantee that it is a product distribution. The technique in this paper, which we call downsampling, improves upon previous methods (in particular, [BCS20, BOW10]), in a few ways. It is more general and does not apply only to a specific type of algorithm [BOW10] or a specific problem [BCS20], and we use it to obtain many other results. It is conceptually simpler. And it allows quantitative improvements over both [BOW10] and [BCS20].

Organization.

In Section 1.1, we describe the main results of this paper in context of the related work. In Section 1.2, we briefly describe the main techniques in the paper. Section 2 presents the definitions and lemmas required by the main results. The following sections present the proofs of the results. For simplicity, the main body of the paper handles only continuous product distributions; finite distributions are handled in Appendix A. Definitions of standard terminology in property testing and machine learning can be found in the Glossary in Section B.

1.1 Results

See Table 1 for a summary of our results on property testing, and Table 2 for a summary of our results on learning.

1.1.1 Testing Monotonicity

Testing monotonicity is the problem of testing whether an unknown function $f:X\to\{0,1\}$ is monotone, where $X$ is a partial order. It is one of the most commonly studied problems in the field of property testing. Previous work on this problem has mostly focused on uniform probability distributions (exceptions include [AC06, HK07, CDJS17, BFPJH21]) and finite domains. However, there is growing interest in property testing for functions on domain $\mathbb{R}^{d}$ ([BBBY12, DMN19, Har19, CDS20, FY20, BFPJH21]) and [BCS20] generalized the problem to this domain.

Testing monotonicity under product distributions has been studied a few times. Ailon & Chazelle [AC06] gave a distribution-free monotonicity tester for real-valued functions under product distributions on $[n]^{d}$ , with query complexity $O(\tfrac{1}{\epsilon}d2^{d}\log n)$ . Chakrabarty et al. [CDJS17] improved this to $O(\tfrac{1}{\epsilon}d\log n)$ and gave a matching lower bound. This lower bound applies to the real-valued case. For the boolean-valued case, monotonicity testers under the uniform distribution on $\{\pm 1\}^{d}$ [CS16, KMS18] and $[n]^{d}$ [BCS18, BCS20] are known with query complexity $o(d)$ . In [BCS20], an $o(d)$ -query tester was given for domain $\mathbb{R}^{d}$ . That paper showed that there is a one-sided, non-adaptive, distribution-free monotonicity tester under product distributions on $\mathbb{R}^{d}$ , with query complexity $O\left(\frac{d^{5/6}}{\epsilon^{4/3}}\operatorname{poly}\log(d/\epsilon)\right)$ and sample complexity $O((d/\epsilon)^{7})$ . In this paper we improve the sample complexity to $\widetilde{O}((d/\epsilon)^{3})$ , while greatly simplifying the proof.

{restate-theorem}

[] There is a one-sided, non-adaptive $\epsilon$ -tester for monotonicity of functions $\mathbb{R}^{d}\to\{0,1\}$ that is distribution-free under (finite or continuous) product distributions, using

O\left(\frac{d^{5/6}}{\epsilon^{4/3}}\operatorname{poly}\log(d/\epsilon)\right)

queries and $O(\frac{d^{3}}{\epsilon^{3}}\log(d/\epsilon))$ samples.

The main result of [BCS20] is a “domain reduction” theorem, allowing a change of domain from $[n]^{d}$ to $[r]^{d}$ where $r=\operatorname{poly}(d/\epsilon)$ ; by applying this theorem together with their earlier $\widetilde{O}(\tfrac{d^{5/6}}{\epsilon^{4/3}}\operatorname{poly}\log(dn))$ -query tester for the uniform distribution on $[n]^{d}$ , they obtain a tester for monotone functions with query complexity independent of $n$ . Our result replaces this domain reduction method with a simpler and more general 2-page argument, and gives a different generalization to the distribution-free case. See Section 3 for the proofs.

Table 1: Testing results.

	$\mathsf{unif}(\{\pm 1\}^{d})$	$\mathsf{unif}([n]^{d})$	Gaussian	$\forall$ Products
1-Sided Testing Monotonicity (Query model)	$\widetilde{O}\left(\frac{\sqrt{d}}{\epsilon^{2}}\right)$ [KMS18]	$\widetilde{O}\left(\frac{d^{5/6}}{\epsilon^{4/3}}\right)$ [BCS20]	$\widetilde{O}\left(\frac{d^{5/6}}{\epsilon^{4/3}}\right)$ [BCS20]	$\widetilde{O}\left(\frac{d^{5/6}}{\epsilon^{4/3}}\right)$ queries, $\widetilde{O}\left(\left(\frac{d}{\epsilon}\right)^{3}\right)$ samples (Thm. 1.1.1)
1-Sided Testing Convex Sets (Sample model)	–	–	$\left(\frac{d}{\epsilon}\right)^{(1+o(1))d}$ $2^{\Omega(d)}$ [CFSS17]	$\left(\frac{d}{\epsilon}\right)^{(1+o(1))d}$ (Thm. 1.1.4)
Tolerant Testing Functions of $k$ Convex Sets (Sample model)	–	–	–	$\left(\frac{dk}{\epsilon}\right)^{O(d)}$ (Thm. 1.1.4)
Tolerant Testing $k$ -Alternating Functions (Sample model)	–	$\left(\frac{dk}{\tau}\right)^{O\left(\frac{k\sqrt{d}}{\tau^{2}}\right)}$ $\tau=\epsilon_{2}-3\epsilon_{1}$ [CGG⁺19]	–	$\left(\frac{dk}{\tau}\right)^{O\left(\frac{k\sqrt{d}}{\tau^{2}}\right)}$ $\tau=\epsilon_{2}-\epsilon_{1}$ (Thm. 1.1.5)

Table 2: Learning results. All learning algorithms are agnostic except that of [Vem10a]. The PTF result for the Gaussian follows from the two cited works but is not stated in either. All statements are informal, see references for restrictions and qualifications. For PTFs, $\psi(k,\epsilon):=\min\left\{O(\epsilon^{-2^{k+1}}),2^{O(k^{2})}\left(\log(1/\epsilon)/\epsilon^{2}\right)^{4k+2}\right\}$ .

	$\mathsf{unif}(\{\pm 1\}^{d})$	$\mathsf{unif}([n]^{d})$	Gaussian	$\forall$ Products
Functions of $k$ Convex Sets	$\Omega(2^{d})$	–	$d^{O\left(\frac{\sqrt{d}}{\epsilon^{4}}\right)},2^{\Omega(\sqrt{d})}$ [KOS08]	$O\left(\frac{1}{\epsilon^{2}}\left(\frac{6dk}{\epsilon}\right)^{d}\right)$ (Thm. 1.1.4)
Functions of $k$ Halfspaces	$d^{O\left(\frac{k^{2}}{\epsilon^{4}}\right)}$ [KKMS08]	$(dn)^{O\left(\frac{k^{2}}{\epsilon^{4}}\right)}$ [BOW10]	$d^{O\left(\frac{\log k}{\epsilon^{4}}\right)},$ $\operatorname{poly}\left(d,\left(\frac{k}{\epsilon}\right)^{k}\right)$ [KOS08, Vem10a] (Intersections only)	$\left(\frac{dk}{\epsilon}\right)^{O\left(\frac{k^{2}}{\epsilon^{4}}\right)}$ (Thm. 1.1.2)
Degree- $k$ PTFs	$d^{\psi(k,\epsilon)}$ [DHK⁺10]	$(dn)^{\psi(k,\epsilon)}$ [DHK⁺10, BOW10]	$d^{\psi(k,\epsilon)}$ [DHK⁺10, BOW10]	$\left(\frac{dk}{\epsilon}\right)^{\psi(k,\epsilon)}$ (Thm. 1.1.3)
$k$ -Alternating Functions	$2^{\Theta\left(\frac{k\sqrt{d}}{\epsilon}\right)}$ [BCO⁺15]	$\left(\frac{dk}{\tau}\right)^{O\left(\frac{k\sqrt{d}}{\tau^{2}}\right)}$ (Testing) [CGG⁺19]	–	$\left(\frac{dk}{\epsilon}\right)^{O\left(\frac{k\sqrt{d}}{\epsilon^{2}}\right)}$ (Thm. 1.1.5)

1.1.2 Learning Functions of Halfspaces

Intersections of $k$ halfspaces have VC dimension $\Theta(dk\log k)$ [BEHW89, CMK19], so the sample complexity of learning is known, but it is not possible to efficiently find $k$ halfspaces whose intersection is correct on the sample, unless $\mathsf{P}=\mathsf{NP}$ [BR92]. Therefore the goal is to find efficient “improper” algorithms that output a function other than an intersection of $k$ halfspaces. Several learning algorithms for intersections of $k$ halfspaces actually work for arbitrary functions of $k$ halfspaces. We will write $\mathcal{B}_{k}$ for the set of all functions $[k]\to\{0,1\}$ , and for any class $\mathcal{F}$ of functions we will write $\mathcal{B}_{k}\circ\mathcal{F}$ as the set of all functions $x\mapsto g(f_{1}(x),\dotsc,f_{k}(x))$ where $g\in\mathcal{B}_{k}$ and each $f_{i}\in\mathcal{F}$ . Then for $\mathcal{H}$ the class of halfspaces, Klivans, O’Donnell, & Servedio [KOS04] gave a (non-agnostic) learning algorithm for $\mathcal{B}_{k}\circ\mathcal{H}$ over the uniform distribution on $\{\pm 1\}^{d}$ with complexity $d^{O(k^{2}/\epsilon^{2})}$ , Kalai, Klivans, Mansour, & Servedio [KKMS08] presented an agnostic algorithm with complexity $d^{O(k^{2}/\epsilon^{4})}$ in the same setting using “polynomial regression”.

Polynomial regression is a powerful technique, so it is important to understand when it can be applied. Blais, O’Donnell, & Servedio [BOW10] studied how to generalize it to arbitrary product distributions. With their method, they obtained an agnostic learning algorithm for $\mathcal{B}_{k}\circ\mathcal{H}$ with complexity $(dn)^{O(k^{2}/\epsilon^{4})}$ for product distributions $X_{1}\times\dotsm\times X_{d}$ where each $|X_{i}|=n$ , and complexity $d^{O(k^{2}/\epsilon^{4})}$ for the “polynomially bounded” continuous distributions. This is not a complete generalization, because, for example, on the grid $[n]^{d}$ its complexity depends on $n$ . This prevents a full generalization to the domain $\mathbb{R}^{d}$ . Their algorithm also requires some prior knowledge of the support or support size. We use a different technique and fully generalize the polynomial regression algorithm to arbitrary product distributions. See Section 5 for the proof.

{restate-theorem}

[] There is a distribution-free, improper agnostic learning algorithm for $\mathcal{B}_{k}\circ\mathcal{H}$ under (continuous or finite) product distributions over $\mathbb{R}^{d}$ , with time complexity

\min\left\{\left(\frac{dk}{\epsilon}\right)^{O\left(\frac{k^{2}}{\epsilon^{4}}\right)},O\left(\frac{1}{\epsilon^{2}}\left(\frac{3dk}{\epsilon}\right)^{d}\right)\right\}\,.

1.1.3 Learning Polynomial Threshold Functions

Degree- $k$ PTFs are another generalization of halfspaces. A function $f:\mathbb{R}^{d}\to\{\pm 1\}$ is a degree- $k$ PTF if there is a degree- $k$ polynomial $p:\mathbb{R}^{d}\to\mathbb{R}$ such that $f(x)=\operatorname{sign}(p(x))$ . Degree- $k$ PTFs can be PAC learned in time $d^{O(k)}$ using linear programming [KOS04], but agnostic learning is more challenging. Diakonikolas et al. [DHK⁺10] previously gave an agnostic learning algorithm for degree- $k$ PTFs in the uniform distribution over $\{\pm 1\}^{d}$ with time complexity $d^{\psi(k,\epsilon)}$ , where

\psi(k,\epsilon):=\min\left\{O(\epsilon^{-2^{k+1}}),2^{O(k^{2})}\left(\log(1/\epsilon)/\epsilon^{2}\right)^{4k+2}\right\}\,.

The main result of that paper is an upper bound on the noise sensitivity of PTFs. Combined with the reduction of Blais et al. [BOW10], this implies an algorithm for the uniform distribution over $[n]^{d}$ with complexity $(dn)^{\psi(k,\epsilon)}$ and for the Gaussian distribution with complexity $d^{\psi(k,\epsilon)}$ .

Our agnostic learning algorithm for degree- $k$ PTFs eliminates the dependence on $n$ and works for any unknown product distribution over $\mathbb{R}^{n}$ , while matching the complexity of [DHK⁺10] for the uniform distribution over the hypercube. See Section 6 for the proof.

{restate-theorem}

[] There is an improper agnostic learning algorithm for degree- $k$ PTFs under (finite or continuous) product distributions over $\mathbb{R}^{d}$ , with time complexity

\min\left\{\left(\frac{kd}{\epsilon}\right)^{\psi(k,\epsilon)}\;,\;O\left(\frac{1}{\epsilon^{2}}\left(\frac{9dk}{\epsilon}\right)^{d}\right)\right\}\,.

1.1.4 Testing & Learning Convex Sets

One of the first properties (sets) of functions $\mathbb{R}^{d}\to\{0,1\}$ to be studied in the property testing literature is the set of indicator functions of convex sets, i.e. functions $f:\mathbb{R}^{d}\to\{0,1\}$ where $f^{-1}(1)$ is convex. Write $\mathcal{C}$ for this class of functions. This problem has been studied in various models of testing [Ras03, RV04, CFSS17, BMR19, BB20]. In this paper we consider the sample-based model of testing, where the tester receives only random examples of the function and cannot make queries. This model of testing has received a lot of recent attention (e.g. [BBBY12, BMR19, BY19, CFSS17, GR16, Har19, RR21, BFPJH21]), partly because it matches the standard sample-based model for learning algorithms.

Chen et al. [CFSS17] gave a sample-based tester for $\mathcal{C}$ under the Gaussian distribution on $\mathbb{R}^{d}$ with one-sided error and sample complexity $(d/\epsilon)^{O(d)}$ , along with a lower bound (for one-sided testers) of $2^{\Omega(d)}$ . We match their upper bound while generalizing the tester to be distribution-free under product distributions. See Section 4 for proofs.

{restate-theorem}

[] There is a sample-based distribution-free one-sided $\epsilon$ -tester for $\mathcal{C}$ under (finite or continuous) product distributions that uses at most $O\left(\left(\frac{6d}{\epsilon}\right)^{d}\right)$ samples.

A more powerful kind of tester is an $(\epsilon_{1},\epsilon_{2})$ -tolerant tester, which must accept (with high probability) any function that is $\epsilon_{1}$ -close to the property, while rejecting functions that are $\epsilon_{2}$ -far. Tolerantly testing convex sets has been studied by [BMR16] for the uniform distribution over the 2-dimensional grid, but not (to the best of our knowledge) in higher dimensions. We obtain a sample-based tolerant tester (and distance) approximator for convex sets in high dimension. In fact, recall that $\mathcal{B}_{k}$ is the set of all functions $\{0,1\}^{k}\to\{0,1\}$ and $\mathcal{B}^{\prime}\subset\mathcal{B}_{k}$ any subset, so $\mathcal{B}^{\prime}\circ\mathcal{C}$ is any property of functions of convex sets. We obtain a distance approximator for any such property:

{restate-theorem}

[] Let $\mathcal{B}^{\prime}\subset\mathcal{B}_{k}$ . There is a sample-based distribution-free algorithm under (finite or continuous) product distributions that approximates distance to $\mathcal{B}^{\prime}\circ\mathcal{C}$ up to additive error $\epsilon$ using $O\left(\frac{1}{\epsilon^{2}}\left(\frac{3dk}{\epsilon}\right)^{d}\right)$ samples. Setting $\epsilon=(\epsilon_{2}-\epsilon_{1})/2$ we obtain an $(\epsilon_{1},\epsilon_{2})$ -tolerant tester with sample complexity $O\left(\frac{1}{(\epsilon_{2}-\epsilon_{1})^{2}}\left(\frac{6dk}{\epsilon_{2}-\epsilon_{1}}\right)^{d}\right)$ .

General distribution-free learning of convex sets is not possible, since this class has infinite VC dimension. However, they can be learned under the Gaussian distribution. Non-agnostic learning under the Gaussian was studied by Vempala [Vem10a, Vem10b]. Agnostic learning under the Gaussian was studied by Klivans, O’Donnell, & Servedio [KOS08] who presented a learning algorithm with complexity $d^{O(\sqrt{d}/\epsilon^{4})}$ , and a lower bound of $2^{\Omega(\sqrt{d})}$ .

Unlike the Gaussian, there is a trivial lower bound of $\Omega(2^{d})$ in arbitrary product distributions, because any function $f:\{\pm 1\}^{d}\to\{0,1\}$ belongs to this class. However, unlike the general distribution-free case, we show that convex sets (or any functions of convex sets) can be learned under unknown product distributions.

{restate-theorem}

[] There is an agnostic learning algorithm for $\mathcal{B}_{k}\circ\mathcal{C}$ under (finite or continuous) product distributions over $\mathbb{R}^{d}$ , with time complexity $O\left(\frac{1}{\epsilon^{2}}\cdot\left(\frac{6dk}{\epsilon}\right)^{d}\right)$ .

1.1.5 Testing & Learning $k$ -Alternating Functions

A $k$ -alternating function $f:X\to\{\pm 1\}$ on a partial order $X$ is one where for any chain $x_{1}<\cdots<x_{m}$ in $X$ , $f$ changes value at most $k$ times. Learning $k$ -alternating functions on domain $\{\pm 1\}^{d}$ was studied by Blais et al. [BCO⁺15], motivated by the fact that these functions are computed by circuits with few negation gates. They show that $2^{\Theta(k\sqrt{d}/\epsilon)}$ samples are necessary and sufficient in this setting. Canonne et al. [CGG⁺19] later obtained an algorithm for $(\epsilon_{1},\epsilon_{2})$ -tolerant testing $k$ -alternating functions, when $\epsilon_{2}>3\epsilon_{1}$ , in the uniform distribution over $[n]^{d}$ , with query complexity $(kd/\tau)^{O(k\sqrt{d}/\tau^{2})}$ , where $\tau=\epsilon_{2}-3\epsilon_{1}$ .

We obtain an agnostic learning algorithm for $k$ -alternating functions that matches the query complexity of the tester in [CGG⁺19], and nearly matches the complexity of the (non-agnostic) learning algorithm of [BCO⁺15] for the uniform distribution over the hypercube. See Section 7 for proofs.

{restate-theorem}

[] There is an agnostic learning algorithm for $k$ -alternating functions under (finite or continuous) product distributions over $\mathbb{R}^{d}$ that runs in time at most

\min\left\{\left(\frac{dk}{\epsilon}\right)^{O\left(\frac{k\sqrt{d}}{\epsilon^{2}}\right)},O\left(\frac{1}{\epsilon^{2}}\left(\frac{3kd}{\epsilon}\right)^{d}\right)\right\}\,.

We also generalize the tolerant tester of [CGG⁺19] to be distribution-free under product distributions, and eliminate the condition $\epsilon_{2}>3\epsilon_{1}$ .

{restate-theorem}

[] For any $\epsilon_{2}>\epsilon_{1}>0$ , let $\tau=(\epsilon_{2}-\epsilon_{1})/2$ , there is a sample-based $(\epsilon_{1},\epsilon_{2})$ -tolerant tester for $k$ -alternating functions using $\left(\frac{dk}{\tau}\right)^{O\left(\frac{k\sqrt{d}}{\tau^{2}}\right)}$ samples, which is distribution-free under (finite or continuous) product distributions over $\mathbb{R}^{d}$ .

1.2 Techniques

What connects these diverse problems is a notion of rectilinear surface area or isoperimetry that we call “block boundary size”. There is a close connection between learning & testing and various notions of isoperimetry or surface area (e.g. [CS16, KOS04, KOS08, KMS18]). We show that testing or learning a class $\mathcal{H}$ on product distributions over $\mathbb{R}^{d}$ can be reduced to testing and learning on the uniform distribution over $[r]^{d}$ , where $r$ is determined by the block boundary size, and we call this reduction “downsampling”. The name downsampling is used in image and signal processing for the process of reducing the resolution of an image or reducing the number of discrete samples used to represent an analog signal. We adopt the name because our method can be described by analogy to image or signal processing as the following 2-step process:

1.

Construct a “digitized” or “pixellated” image of the function $f:\mathbb{R}^{d}\to\{\pm 1\}$ by sampling from the distribution and constructing a grid in which each cell has roughly equal probability mass; and
2.

Learn or test the “low-resolution” pixellated function.

As long as the function $f$ takes a constant value in the vast majority of “pixels”, the low resolution version seen by the algorithm is a good enough approximation for testing or learning. The block boundary size is, informally, the number of pixels on which $f$ is not constant.

This technique reduces distribution-free testing and learning problems to the uniform distribution in a way that is conceptually simpler than in the prior work [BOW10, BCS20]. However, some technical challenges remain. The first is that it is not always easy to bound the number of “pixels” on which a function $f$ is not constant – for example, for PTFs. Second, unlike in the uniform distribution, the resulting downsampled function class on $[r]^{d}$ is not necessarily “the same” as the original class – for example, halfspaces on $\mathbb{R}^{d}$ are not downsampled to halfspaces on $[r]^{d}$ , since the “pixels” are not of equal size. Thus, geometric arguments may not work, unlike the case for actual images.

A similar technique of constructing “low-resolution” representations of the input has been used and rediscovered ad-hoc a few times in the property testing literature; in prior work, it was restricted to the uniform distribution over $[n]^{d}$ [KR00, Ras03, FR10, BY19, CGG⁺19] (or the Gaussian in [CFSS17]). With this paper, we aim to provide a unified and generalized study of this simple and powerful technique.

1.3 Block Boundary Size

Informally, we define the $r$ -block boundary size $\mathsf{bbs}(\mathcal{H},r)$ of a class $\mathcal{H}$ of functions $\mathbb{R}^{d}\to\{0,1\}$ as the maximum number of grid cells on which a function $f\in\mathcal{H}$ is non-constant, over all possible $r\times\dotsm\times r$ grid partitions of $\mathbb{R}^{d}$ (which are not necessarily evenly spaced) – see Section 2 for formal definitions. Whether downsampling can be applied to $\mathcal{H}$ depends on whether

\lim_{r\to\infty}\frac{\mathsf{bbs}(\mathcal{H},r)}{r^{d}}\to 0\,,

and the complexity of the algorithms depends on how large $r$ must be for the non-constant blocks to vanish relative to the whole $r^{d}$ grid. A general observation is that any function class $\mathcal{H}$ where downsampling can be applied can be learned under unknown product distributions with a finite number of samples; for example, this holds for convex sets even though the VC dimension is infinite.

Proposition 1.1 (Consequence of Lemma 4.5).

Let $\mathcal{H}$ be any set of functions $\mathbb{R}^{d}\to\{0,1\}$ (measurable with respect to continuous product distributions) such that

\lim_{r\to\infty}\frac{\mathsf{bbs}(\mathcal{H},r)}{r^{d}}=0\,.

Then there is some function $\delta(d,\epsilon)$ such that $\mathcal{H}$ is distribution-free learnable under product distributions, up to error $\epsilon$ , with $\delta(d,\epsilon)$ samples.

For convex sets, monotone functions, $k$ -alternating functions, and halfspaces, $\mathsf{bbs}(\mathcal{H},r)$ is easy to calculate. For degree- $k$ PTFs, it is more challenging. We say that a function $f:\mathbb{R}^{d}\to\{0,1\}$ induces a connected component $S$ if for every $x,y\in S$ there is a continuous curve in $\mathbb{R}^{d}$ from $x$ to $y$ such that $f(z)=f(x)=f(y)$ for all $z$ on the curve, and $S$ is a maximal such set. Then we prove a general lemma that bounds the block boundary size by the number of connected components induced by functions $f\in\mathcal{H}$ .

Lemma 1.2 (Informal, see Lemma 6.6).

Suppose that for any axis-aligned affine subspace $A$ of affine dimension $n\leq d$ , and any function $f\in\mathcal{H}$ , $f$ induces at most $k^{n}$ connected components in $A$ . Then for $r=\Omega(dk/\epsilon)$ , $\mathsf{bbs}(\mathcal{H},r)\leq\epsilon\cdot r^{d}$ .

This lemma in fact generalizes all computations of block boundary size in this paper (up to constant factors in $r$ ). Using a theorem of Warren [War68], we get that any degree- $k$ polynomial $\mathbb{R}^{d}\to\{\pm 1\}$ achieves value 0 in at most $\epsilon r^{d}$ grid cells, for sufficiently large $r=\Omega(dk/\epsilon)$ (Corollary 6.8).

1.4 Polynomial Regression

The second step of downsampling is to find a testing or learning algorithm that works for the uniform distribution over the (not necessarily evenly-spaced) hypergrid. Most of our learning results use polynomial regression. This is a powerful technique introduced in [KKMS08] that performs linear regression over a vector space of functions that approximately spans the hypothesis class. This method is usually applied by using Fourier analysis to construct such an approximate basis for the hypothesis class [BOW10, DHK⁺10, CGG⁺19]. This was the method used, for example, by Blais, O’Donnell, & Wimmer [BOW10] to achieve the $\operatorname{poly}(dn)$ -time algorithms for intersections of halfspaces.

We take the same approach but we use the Walsh basis for functions on domain $[n]^{d}$ (see e.g. [BRY14]) instead of the bases used in the prior works. We show that if one can establish bounds on the noise sensitivity in the Fourier basis for the hypothesis class restricted to the uniform distribution over $\{\pm 1\}^{d}$ , then one gets a bound on the number of Walsh functions required to approximately span the “downsampled” hypothesis class. In this way, we establish that if one can apply standard Fourier-analytic techniques to the hypothesis class over the uniform distribution on $\{\pm 1\}^{d}$ and calculate the block boundary size, then the results for the hypercube essentially carry over to the distribution-free setting for product distributions on $\mathbb{R}^{d}$ .

An advantage of this technique is that both noise sensitivity and block boundary size grow at most linearly during function composition: for functions $f(x)=g(h_{1}(x),\dotsc,h_{k}(x))$ where each $h_{i}$ belongs to the class $\mathcal{H}$ , the noise sensitivity and block boundary size grow at most linearly in $k$ . Therefore learning results for $\mathcal{H}$ obtained in this way are easy to extend to arbitrary compositions of $\mathcal{H}$ , which is how we get our result for intersections of halfspaces.

2 Downsampling

We will now introduce the main definitions, notation, and lemmas required by our main results. The purpose of this section is to establish the main conceptual component of the downsampling technique: that functions with small enough block boundary size can be efficiently well-approximated by a “coarsened” version of the function that is obtained by random sampling. See Figure 1 for an illustration of the following definitions.

Refer to caption — Figure 1: Left: Random grid $X$ (pale lines) with induced block partition (thick lines) and $\mathsf{blockpoint}$ values (dots), superimposed on $f^{-1}(1)$ (gray polygon). Right: $f^{\mathsf{coarse}}$ (grey) compared to $f$ (polygon outline).

Definition 2.1 (Block Partitions).

An $r$ -block partition of $\mathbb{R}^{d}$ is a pair of functions $\mathsf{block}:\mathbb{R}^{d}\to{[r]}^{d}$ and $\mathsf{blockpoint}:{[r]}^{d}\to\mathbb{R}^{d}$ obtained as follows. For each $i\in[d],j\in[r-1]$ let $a_{i,j}\in\mathbb{R}$ such that $a_{i,j}<a_{i,j+1}$ and define $a_{i,0}=-\infty,a_{i,r}=\infty$ for each $i$ . For each $i\in[d],j\in[r]$ define the interval $B_{i,j}=(a_{i,j-1},a_{i,j}]$ and a point $b_{i,j}\in B_{i,j}$ . The function $\mathsf{block}:\mathbb{R}^{d}\to[r]^{d}$ is defined by setting $\mathsf{block}(x)$ to be the unique vector $v\in[r]^{d}$ such that $x_{i}\in B_{i,v_{i}}$ for each $i\in[d]$ . The function $\mathsf{blockpoint}:[r]^{d}\to\mathbb{R}^{d}$ is defined by setting $\mathsf{blockpoint}(v)=(b_{1,v_{1}},\dotsc,b_{d,v_{d}})$ ; note that $\mathsf{blockpoint}(v)\in\mathsf{block}^{-1}(v)$ where $\mathsf{block}^{-1}(v)=\{x\in\mathbb{R}^{d}:\mathsf{block}(x)=v\}$ .

Definition 2.2 (Block Functions and Coarse Functions).

For a function $f:\mathbb{R}^{d}\to\{\pm 1\}$ , we define $f^{\mathsf{block}}:[r]^{d}\to\{\pm 1\}$ as $f^{\mathsf{block}}:=f\circ\mathsf{blockpoint}$ and $f^{\mathsf{coarse}}:\mathbb{R}^{d}\to\mathbb{R}$ as $f^{\mathsf{coarse}}:=f^{\mathsf{block}}\circ\mathsf{block}=f\circ\mathsf{blockpoint}\circ\mathsf{block}$ . For any set $\mathcal{H}$ of functions $\mathbb{R}^{d}\to\{\pm 1\}$ , we define $\mathcal{H}^{\mathsf{block}}:=\{f^{\mathsf{block}}\;|\;f\in\mathcal{H}\}$ . For a distribution $\mu$ over $\mathbb{R}^{d}$ and an $r$ -block partition $\mathsf{block}:\mathbb{R}^{d}\to[r]^{d}$ we define the distribution $\mathsf{block}(\mu)$ over $[r]^{d}$ as the distribution of $\mathsf{block}(x)$ for $x\sim\mu$ .

Definition 2.3 (Induced Block Partitions).

When $\mu$ is a product distribution over $\mathbb{R}^{d}$ , a random grid $X$ of length $m$ is the grid obtained by sampling $m$ points $x_{1},\dotsc,x_{m}\in\mathbb{R}^{d}$ independently from $\mu$ and for each $i\in[d],j\in[m]$ defining $X_{i,j}$ to be the $j^{\mathrm{th}}$ -smallest coordinate in dimension $i$ among all sampled points. For any $r$ that divides $m$ we define an $r$ -block partition depending on $X$ by defining for each $i\in[d],j\in[r-1]$ the point $a_{i,j}=X_{i,mj/r}$ so that the intervals are $B_{i,j}:=(X_{i,m(j-1)/r},X_{i,mj/r}]$ when $j\in\{2,\dotsc,r-1\}$ and $B_{i,1}=(-\infty,X_{i,m/r}],B_{i,r}=(X_{i,m(r-1)/r},\infty)$ ; we let the points $b_{i,j}$ defining $\mathsf{blockpoint}$ be arbitrary. This is the $r$ -block partition induced by $X$ .

Definition 2.4 (Block Boundary Size).

For a block partition $\mathsf{block}:\mathbb{R}^{d}\to[r]^{d}$ , a distribution $\mu$ over $\mathbb{R}^{d}$ , and a function $f:\mathbb{R}^{d}\to\{\pm 1\}$ , we say $f$ is non-constant on a block $v\in[r]^{d}$ if there are sets $S,T\subset\mathsf{block}^{-1}(v)$ such that $\forall s\in S,t\in T:f(s)=1,f(t)=-1$ ; and $S,T$ have positive measure (in the product of Lebesgue measures). For a function $f:\mathbb{R}^{d}\to\{\pm 1\}$ and a number $r$ , we define the $r$ -block boundary size $\mathsf{bbs}(f,r)$ as the maximum number of blocks on which $f$ is non-constant, where the maximum is taken over all $r$ -block partitions $\mathsf{block}:\mathbb{R}^{d}\to[r]^{d}$ . For a set $\mathcal{H}$ of functions $\mathbb{R}^{d}\to\{\pm 1\}$ , we define $\mathsf{bbs}(\mathcal{H},r):=\max\{\mathsf{bbs}(f,r)\;|\;f\in\mathcal{H}\}$ .

The total variation distance between two distributions $\mu,\nu$ over a finite domain $\mathcal{X}$ is defined as

\|\mu-\nu\|_{\mathsf{TV}}:=\frac{1}{2}\sum_{x\in\mathcal{X}}|\mu(x)-\nu(x)|=\max_{S\subseteq\mathcal{X}}|\mu(S)-\nu(S)|\,.

The essence of downsampling is apparent in the next proposition. It shows that the distance of $f$ to its coarsened version $f^{\mathsf{coarse}}$ is bounded by two quantities: the fraction of blocks in the $r$ -block partition on which $f$ is not constant, and the distance of the distribution $\mathsf{block}(\mu)$ to uniform. When both quantities are small, testing or learning $f$ can be done by testing or learning $f^{\mathsf{coarse}}$ instead. The uniform distribution over a set $S$ is denoted $\mathsf{unif}(S)$ :

Proposition 2.5.

Let $\mu$ be a continuous product distribution over $\mathbb{R}^{d}$ , let $X$ be a random grid, and let $\mathsf{block}:\mathbb{R}^{d}\to[r]^{d}$ be the induced $r$ -block partition. Then, for any measurable $f:\mathbb{R}^{d}\to\{\pm 1\}$ , the following holds with probability 1 over the choice of $X$ :

\underset{x\sim\mu}{\mathbb{P}}\left[f(x)\neq f^{\mathsf{coarse}}(x)\right]\leq r^{-d}\cdot\mathsf{bbs}(f,r)+\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}\,.

Proof.

We first establish that, with probability 1 over $X$ and $x\sim\mu$ , if $f(x)\neq f^{\mathsf{coarse}}(x)$ then $f$ is non-constant on $\mathsf{block}(x)$ . Fix $X$ and suppose there exists a set $Z$ of positive measure such that for each $x\in Z,f(x)\neq f^{\mathsf{coarse}}(x)$ but $f$ is not non-constant on $\mathsf{block}(x)$ , i.e. for $V=\mathsf{block}^{-1}(\mathsf{block}(x))$ , either $\mu(V\cap f^{-1}(1))=\mu(V)$ or $\mu(V\cap f^{-1}(-1))=\mu(V)$ . Then there is $v\in[r]^{d}$ such that for $V=\mathsf{block}^{-1}(v)$ , $\mu(Z\cap V)>0$ . Let $y=\mathsf{blockpoint}(v)$ . If $\mu(V\cap f^{-1}(f(y))=\mu(V)$ then $\mu(Z\cap V)=0$ , so $\mu(V\cap f^{-1}(f(y))=0$ . But for random $X$ , the probability that there exists $v\in[r]^{d}$ such that $\mu(V\cap f^{-1}(\mathsf{blockpoint}(v)))=0$ is 0, since $\mathsf{blockpoint}(v)$ is random within $V$ .

Assuming that the above event occurs,

	$\displaystyle\underset{x\sim\mu}{\mathbb{P}}\left[f(x)\neq f^{\mathsf{coarse}}(x)\right]$	$\displaystyle\leq\underset{x\sim\mu}{\mathbb{P}}\left[f\text{ is non-constant on }\mathsf{block}(x)\right]$
		$\displaystyle\leq\underset{v\sim[r]^{d}}{\mathbb{P}}\left[f\text{ is non-constant on }v\right]+\\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\\|_{\mathsf{TV}}\,.$

Since $v\sim[r]^{d}$ is uniform, the probability of hitting a non-constant block is at most $r^{-d}\cdot\mathsf{bbs}(f,r)$ . ∎

Next we give a bound on the number of samples required to ensure that $\mathsf{block}(\mu)$ is close to uniform. We need the following lemma.

Lemma 2.6.

Let $\mu$ be continuous probability distribution over $\mathbb{R}$ , $m,r\in\mathbb{N}$ such that $r$ divides $m$ , and $\delta\in(0,1/2)$ . Let $X$ be a set of $m$ points sampled independently from $\mu$ . Write $X=\{x_{1},\dotsc,x_{m}\}$ labeled such that $x_{1}<\dotsm<x_{m}$ (and write $x_{0}=-\infty$ ). Then for any $i\in[r]$ ,

\mathbb{P}\left[\mu\left(x_{(i-1)(m/r)},x_{i(m/r)}\right]<\frac{1-\delta}{r}\right]\leq 4\cdot e^{-\frac{\delta^{2}m}{32r}}\,.

Proof.

We assume that $i-1\leq r/2$ . If $i-1>r/2$ then we can repeat the following analysis with the opposite ordering on the points in $X$ . Write $x^{*}=x_{(i-1)\tfrac{m}{r}}$ and $\beta=\mu(-\infty,x^{*}]$ . First suppose that $(1-\delta/2)\frac{i-1}{r}<\beta<(1+\delta/2)\frac{i-1}{r}\leq(1+\delta/2)/2$ ; we will bound the probability of this event later.

Let $t\in\mathbb{R}$ be the point such that $\mu(x^{*},t]=(1-\delta)/r$ (which must exist since $\mu$ is continuous). Let $\eta=\frac{\delta}{1-\delta}\geq\delta$ . Write $X^{*}=\{x\in X:x>x^{*}\}$ . The expected value of $|X^{*}\cap(x^{*},t]|$ is $|X^{*}|\frac{1-\delta}{r(1-\beta)}=\left(1-\frac{i-1}{r}\right)\frac{1-\delta}{r(1-\beta)}$ , where the factor $1-\beta$ in the denominator is due to the fact that each element of $X^{*}$ is sampled from $\mu$ conditional on being larger than $x^{*}$ . The event $\mu(x^{*},x_{i(m/r)}]<(1-\delta)/r$ occurs if and only if $|X^{*}\cap(x^{*},t]|>m/r$ , which occurs with probability

\displaystyle\mathbb{P}\left[|X^{*}\cap(x^{*},t]|>\frac{m}{r}\right]=\mathbb{P}\left[|X^{*}\cap(x^{*},t]|>m\left(1-\frac{(i-1)}{r}\right)\frac{1-\delta}{r(1-\beta)}(1+\eta)\right]

where

	$\displaystyle 1+\eta$	$\displaystyle=\frac{(1-\beta)}{(1-\delta)\left(1-\frac{i-1}{r}\right)}\geq\frac{\left(1-(1+\delta/2)\frac{i-1}{r}\right)}{(1-\delta)\left(1-\frac{i-1}{r}\right)}=\frac{1}{1-\delta}\left(1-\frac{(\delta/2)(i-1)}{r-(i-1)}\right)$
		$\displaystyle\geq\frac{1-\delta/2}{1-\delta}=1+\frac{\delta}{2(1-\delta)}\geq 1+\delta/2\,.$

Since the expected value satisfies

|X^{*}|\frac{1-\delta}{r(1-\beta)}\geq\frac{m}{r}(1-\frac{i-1}{r})\frac{2(1-\delta)}{1-\delta/2}\geq\frac{m}{r}(1-\delta/2)\geq\frac{m}{2r}\,,

the Chernoff bound gives

\mathbb{P}\left[|X^{*}\cap(x^{*},t]|>\frac{m}{r}\right]\leq\mathrm{exp}\left(-\frac{\delta^{2}|X^{*}|(1-\delta)}{3\cdot 4\cdot r(1-\beta)}\right)\leq e^{-\frac{\delta^{2}m}{3\cdot 4\cdot 2r}}\,.

Now let $t\in\mathbb{R}$ be the point such that $\mu(x^{*},t]=(1+\delta)/r$ . The expected value of $|X^{*}\cap(x^{*},t]|$ is now $|X^{*}|\frac{1+\delta}{r(1-\beta)}$ . The event $\mu(x^{*},x_{i(m/r)}]>(1+\delta)/r$ occurs if and only if $|X^{*}\cap(x^{*},t]|<m/r$ , which occurs with probability

\mathbb{P}\left[|X^{*}\cap(x^{*},t]|<\frac{m}{r}\right]=\mathbb{P}\left[|X^{*}\cap(x^{*},t]|<m\left(1-\frac{i-1}{r}\right)\frac{1+\delta}{r(1-\beta)}(1-\eta)\right]

where

	$\displaystyle 1-\eta$	$\displaystyle=\frac{1-\beta}{(1+\delta)(1-\tfrac{i-1}{r})}\leq\frac{1-(1+\delta/2)\tfrac{i-1}{r}}{(1+\delta)\left(1-\frac{i-1}{r}\right)}=\frac{1}{1+\delta}\left(1+\frac{(\delta/2)(i-1)}{r-(i-1)}\right)$
		$\displaystyle\leq\frac{1+\delta/2}{1+\delta}=1-\frac{\delta/2}{1+\delta}\leq 1-\frac{\delta}{4}\,.$

The expected value satisfies $|X^{*}|\frac{1+\delta}{r(1-\beta)}>m/r$ , so the Chernoff bound gives

\mathbb{P}\left[|X^{*}\cap(x^{*},t]|<\frac{m}{r}\right]\leq\mathrm{exp}\left(-\frac{\delta^{2}|X^{*}|(1+\delta)}{2\cdot 4^{2}\cdot r(1-\beta)}\right)\leq e^{-\frac{\delta^{2}m}{2\cdot 4^{2}}}\,.

It remains to bound the probability that $(1-\delta/2)\frac{i-1}{r}<\beta<(1+\delta/2)\frac{i-1}{r}$ . Define $t\in\mathbb{R}$ such that $\mu(-\infty,t]=(1+\delta/2)\frac{i-1}{r}$ . $\beta=\mu(-\infty,x^{*}]\geq(1+\delta/2)\frac{i-1}{r}$ if and only if $x^{*}>t$ , i.e. $|X\cap(-\infty,t]|<\frac{i-1}{r}$ . The expected value of $|X\cap(-\infty,t]|$ is $m\frac{(1+\delta/2)(i-1)}{r}$ , so for $\eta=\frac{\delta/2}{1+\delta/2}\geq\delta/3$ , the Chernoff bound implies

	$\displaystyle\mathbb{P}\left[\|X\cap(-\infty,t]\|<m\frac{i-1}{r}\right]$	$\displaystyle=\mathbb{P}\left[\|X\cap(-\infty,t]\|<m\frac{(1+\delta/2)(i-1)}{r}(1-\eta)\right]$
		$\displaystyle\leq e^{-\frac{\delta^{2}m(1+\delta/2)(i-1)}{18r}}\leq e^{-\frac{\delta^{2}m}{18r}}\,.$

Now define $t\in\mathbb{R}$ such that $\mu(-\infty,t]=(1-\delta/2)\frac{i-1}{r}$ . $\beta=\mu(-\infty,x^{*}]\leq(1-\delta/2)\frac{i-1}{r}$ if and only if $x^{*}<t$ , i.e. $|X\cap(-\infty,t]|>\frac{i-1}{r}$ . The expected value of $|X\cap(-\infty,t]|$ is $m\frac{(1-\delta/2)(i-1)}{r}$ , so for $\eta=\frac{\delta}{2-\delta}\geq\delta/2$ ,

	$\displaystyle\mathbb{P}\left[\|X\cap(-\infty,t]\|>m\frac{i-1}{r}\right]$	$\displaystyle=\mathbb{P}\left[\|X\cap(-\infty,t]\|>m\frac{(1-\delta/2)(i-1)}{r}(1+\eta)\right]$
		$\displaystyle\leq e^{-\frac{\delta^{2}m(1-\delta/2)(i-1)}{2\cdot 4r}}\leq e^{-\frac{\delta^{2}m}{4^{2}r}}\,.$

The conclusion then follows from the union bound over these four events. ∎

Lemma 2.7.

Let $\mu=\mu_{1}\times\dotsm\times\mu_{d}$ be a product distribution over $\mathbb{R}^{d}$ where each $\mu_{i}$ is continuous. Let $X$ be a random grid with length $m$ sampled from $\mu$ , and let $\mathsf{block}:\mathbb{R}^{d}\to[r]^{d}$ be the $r$ -block partition induced by $X$ . Then

\underset{X}{\mathbb{P}}\left[\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}>\epsilon\right]\leq 4rd\cdot e^{-\frac{\epsilon^{2}m}{18rd^{2}}}

Proof.

For a fixed grid $X$ and each $i\in[d]$ , write $p_{i}:[r]\to[0,1]$ be the probability distribution on $[r]$ with $p_{i}(z)=\mu_{i}(B_{i,z})$ . Then $\mathsf{block}(\mu)=p_{1}\times\dotsm\times p_{d}$ .

Let $\delta=\frac{4\epsilon}{3d}$ . Suppose that for every $i,j\in[d]\times[r]$ it holds that $\frac{1+\delta}{r}\leq p_{i}(j)\geq\frac{1-\delta}{r}$ . Note that $d\delta=\frac{4\epsilon}{3}\leq\ln(1+2\epsilon)\leq 2\epsilon$ . Then for every $v\in[r]^{d}$ ,

\displaystyle\underset{u\sim\mu}{\mathbb{P}}\left[\mathsf{block}(u)=v\right]

\displaystyle=\prod_{i=1}^{d}p_{i}(v_{i})\begin{cases}\leq{(1+\delta)}^{d}r^{-d}\leq e^{d\delta}r^{-d}\leq(1+2\epsilon)r^{-d}\\ \geq{(1-\delta)}^{d}r^{-d}\geq(1-d\delta)r^{-d}\geq(1-2\epsilon)r^{-d}\,.\end{cases}

\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}=\frac{1}{2}\sum_{v\in[r]^{d}}|\underset{u\sim\mu}{\mathbb{P}}\left[\mathsf{block}(u)=v\right]-r^{-d}|\leq\frac{1}{2}\sum_{v\in[r]^{d}}2\epsilon r^{-d}=\epsilon\,.

By Lemma 2.6 and the union bound, the probability that there is some $i\in[d],j\in[r]$ that satisfies $p_{i}(j)<(1-\delta)/r$ is at most $4rd\cdot e^{-\frac{\epsilon^{2}m}{18rd^{2}}}$ . ∎

3 Testing Monotonicity

3.1 Testing Monotonicity on the Hypergrid

A good introduction to downsampling is the following short proof of the main result of Black, Chakrabarty, & Seshadhri [BCS20]. In an earlier work, [BCS18], they gave an $O((d^{5/6}/\epsilon^{4/3})\operatorname{poly}\log(dn))$ tester for the domain $[n]^{d}$ , and in the later work they showed how to reduce the domain $[n]^{d}$ to $[r]^{d}$ for $r=\operatorname{poly}(d/\epsilon)$ .

Our monotonicity tester will use as a subroutine the following tester for diagonal functions. For a hypergrid $[n]^{d}$ , a diagonal is a subset of points $\{x\in[n]^{d}:x=v+\lambda\vec{1},\lambda\in\mathbb{Z}\}$ defined by some $v\in[n]^{d}$ . A function $f:[n]^{d}\to\{0,1\}$ is a diagonal function if it has at most one 1-valued point in each diagonal.

Lemma 3.1.

There is an $\epsilon$ -tester with one-sided error and query complexity $O\left(\tfrac{1}{\epsilon}\log^{2}(1/\epsilon)\right)$ for diagonal functions on $[n]^{d}$ .

Proof.

For each $t\in[n]$ let $D_{t}$ be the set of diagonals with length $t$ . For any $x\in[n]^{d}$ let $\mathrm{diag}(x)$ be the unique diagonal that contains $x$ . For input $f:[n]^{d}\to\{0,1\}$ and any $x\in[n]^{d}$ , let $R(x)=\frac{|\{y\in\mathrm{diag}(x):f(y)=1\}|}{|\mathrm{diag}(x)|}$ .

Suppose that $f$ is $\epsilon$ -far from diagonal. Then $f$ must have at least $\epsilon n^{d}$ 1-valued points; otherwise we could set each 1-valued point to 0 to obtain the constant 0 function. Now observe

	$\displaystyle\underset{x\sim[n]^{d}}{\mathbb{E}}\left[R(x)\right]$	$\displaystyle=\underset{x\sim[n]^{d}}{\mathbb{E}}\left[\sum_{t=1}^{n}\sum_{L\in D_{t}}\mathbf{1}\left[\mathrm{diag}(x)=L\right]\frac{\|\{y\in L:f(y)=1\}\|}{t}\right]$
		$\displaystyle=\sum_{t=1}^{n}\sum_{L\in D_{t}}\underset{x\sim[n]^{d}}{\mathbb{P}}\left[x\in L\right]\frac{\|\{y\in L:f(y)=1\}\|}{t}=\sum_{t=1}^{n}\sum_{L\in D_{t}}\frac{t}{n^{d}}\frac{\|\{y\in L:f(y)=1\}\|}{t}$
		$\displaystyle=\frac{1}{n^{d}}\|\{y\in[n]^{d}:f(y)=1\}\|\geq\epsilon\,.$

For each $i$ , define $A_{i}=\left\{x\in[n]^{d}:\frac{1}{2^{i}}<R(x)\leq\frac{1}{2^{i-1}}\right\}$ . Let $k=\log(4/\epsilon)$ . Then

	$\displaystyle\epsilon$	$\displaystyle\leq\mathbb{E}\left[R(x)\right]\leq\sum_{i=1}^{\infty}\frac{\|A_{i}\|}{n^{d}}\max_{x\in A_{i}}R(x)\leq\sum_{i=1}^{\infty}\frac{\|A_{i}\|}{n^{d}2^{i-1}}\leq\sum_{i=1}^{k}\frac{\|A_{i}\|}{n^{d}2^{i-1}}+\sum_{i=k+1}^{\infty}\frac{1}{2^{i-1}}$
		$\displaystyle\leq\sum_{i=1}^{k}\frac{\|A_{i}\|}{n^{d}2^{i-1}}+\frac{1}{2^{k-1}}\leq\sum_{i=1}^{k}\frac{\|A_{i}\|}{n^{d}2^{i-1}}+\frac{\epsilon}{2}$
	$\displaystyle\implies\frac{\epsilon}{2}$	$\displaystyle\leq\sum_{i=1}^{k}\frac{\|A_{i}\|}{n^{d}2^{i-1}}\,.$

Therefore there is some $\ell\in[k]$ such that $|A_{\ell}|\geq\frac{\epsilon n^{d}2^{\ell-1}}{2k}$ .

The tester is as follows. For each $i\in[k]$ :

1.

Sample $p=\frac{k}{\epsilon 2^{i-2}}\ln(6)$ points $x_{1},\dotsc,x_{p}\sim[n]^{d}$ .
2.

For each $j\in[p]$ , sample $q=2^{i+2}\ln(12)$ points $y_{1},\dotsc,y_{q}$ from $\mathrm{diag}(x_{i})$ and reject if there are two distinct 1-valued points in the sample.

The query complexity of the tester is $\sum_{i=1}^{k}4^{2}\ln(6)\ln(12)\frac{k}{\epsilon 2^{i}}2^{i}=O\left(\frac{1}{\epsilon}\log^{2}(1/\epsilon)\right)$ .

The tester will clearly accept any diagonal function. Now suppose that $f$ is $\epsilon$ -far from having this property, and let $\ell\in[k]$ be such that $|A_{\ell}|\geq\frac{\epsilon n^{d}2^{\ell-2}}{k}$ . On iteration $i=\ell$ , the algorithm samples $p=\frac{k}{\epsilon 2^{\ell-2}}\ln(6)$ points $x_{1},\dotsc,x_{p}$ . The probability that $\forall j\in[p],x_{j}\notin A_{\ell}$ is at most

\left(1-\frac{|A_{\ell}|}{n^{d}}\right)^{p}\leq\left(1-\frac{\epsilon 2^{\ell-2}}{k}\right)^{p}\leq\mathrm{exp}\left(-\frac{\epsilon p2^{\ell-2}}{k}\right)\leq 1/6\,.

Now assume that there is some $x_{j}\in A_{\ell}$ , so that $R(x_{j})>2^{-\ell}$ . Let $A,B\subset\mathrm{diag}(x_{j})$ be disjoint subsets that partition the 1-valued points in $\mathrm{diag}(x_{i})$ into equally-sized parts. Then for $y$ sampled uniformly at random from $\mathrm{diag}(x_{j})$ , $\mathbb{P}\left[y\in A\right],\mathbb{P}\left[y\in B\right]\geq 2^{-(\ell+1)}$ . The probability that there are at least 2 distinct 1-valued points in $y_{1},\dotsc,y_{q}$ sampled by the algorithm is at least the probability that one of the first $q/2$ samples is in $A$ and one of the last $q/2$ samples is in $B$ . This fails to occur with probability at most $2(1-2^{-(\ell+1)})^{q/2}\leq 2e^{-q2^{-(\ell+2)}}\leq 1/6$ . So the total probability of failure is at most $2/6=1/3$ . ∎

Theorem 3.2.

There is a non-adaptive monotonicity tester on domain $[n]^{d}$ with one-sided error and query complexity $\widetilde{O}\left(\frac{d^{5/6}}{\epsilon^{4/3}}\right)$ .

Proof.

Set $r=\lceil 4d/\epsilon\rceil$ , and assume without loss of generality that $r$ divides $n$ . Partition $[n]$ into $r$ intervals $B_{i}=\{(i-1)(n/r)+1,\dotsc,i(n/r)\}$ . For each $v\in[r]^{d}$ write $B_{v}=B_{v_{i}}\times\dotsm\times B_{v_{d}}$ . Define $\mathsf{block}:[n]^{d}\to[r]^{d}$ where $\mathsf{block}(x)$ is the unique vector $v\in[r]^{d}$ such that $x\in B_{v}$ . Define $\mathsf{block}^{-\downarrow}(v)=\min\{x\in B_{v}\}$ and $\mathsf{block}^{-\uparrow}(v)=\max\{x\in B_{v}\}$ , where the minimum and maximum are with respect to the natural ordering on $[n]^{d}$ . For $f:[n]^{d}\to\{0,1\}$ , write $f^{\mathsf{block}}:[r]^{d}\to\{0,1\},f^{\mathsf{block}}(v)=f(\mathsf{block}^{-\downarrow}(v))$ . We may simulate queries $v$ to $f^{\mathsf{block}}$ by returning $f(\mathsf{block}^{-\downarrow}(v))$ . We will call $v\in[r]^{d}$ a boundary block if $f(\mathsf{block}^{-\downarrow}(v))\neq f(\mathsf{block}^{-\uparrow}(v))$ .

The test proceeds as follows: On input $f:[n]^{d}\to\{0,1\}$ and a block $v\in[r]^{d}$ , define the following functions:

	$\displaystyle g:[n]^{d}\to\{0,1\},\qquad g(x)$	$\displaystyle=\begin{cases}f^{\mathsf{block}}(\mathsf{block}(x))&\text{ if }\mathsf{block}(x)\text{ is not a boundary block}\\ f(x)&\text{ if }\mathsf{block}(x)\text{ is a boundary block.}\end{cases}$
	$\displaystyle b:[r]^{d}\to\{0,1\},\qquad b(v)$	$\displaystyle=\begin{cases}0&\text{ if }v\text{ is not a boundary block}\\ 1&\text{ if }v\text{ is a boundary block.}\end{cases}$
	$\displaystyle h:[r]^{d}\to\{0,1\},\qquad h(v)$	$\displaystyle=\begin{cases}f^{\mathsf{block}}(v)&\text{ if }v\text{ is not a boundary block}\\ 0&\text{ if }v\text{ is a boundary block.}\end{cases}$

Queries to each of these functions can be simulated by 2 or 3 queries to $f$ . The tester performs:

1.

Test whether $g=f$ , or whether $\mathsf{dist}(f,g)>\epsilon/4$ , using $O(1/\epsilon)$ queries.
2.

Test whether $b$ is diagonal, or is $\epsilon/4$ -far from diagonal, using Lemma 3.1, with $O\left(\tfrac{1}{\epsilon}\log^{2}(1/\epsilon)\right)$ queries.
3.

Test whether $h$ is monotone or $\epsilon/4$ -far from monotone, using the tester of Black, Chakrabarty, & Seshadhri with $\widetilde{O}\left(\frac{d^{5/6}}{\epsilon^{4/3}}\right)$ queries.

Claim 3.3.

If $f$ is monotone, the tester passes all 3 tests with probability 1.

Proof of claim.

To see that $g=f$ , observe that if $v=\mathsf{block}(x)$ is not a boundary block then $f(\mathsf{block}^{-\downarrow}(v))=f(\mathsf{block}^{-\uparrow}(v))$ . If $f(x)\neq f^{\mathsf{block}}(\mathsf{block}(x))$ then $f(x)\neq f(\mathsf{block}^{-\downarrow}(v))$ and $f(x)\neq f(\mathsf{block}^{-\uparrow}(v))$ while $\mathsf{block}^{-\downarrow}(v)\preceq x\preceq\mathsf{block}^{-\uparrow}(v)$ , and this is a violation of the monotonicity of $f$ . Therefore $f$ will pass the first test with probability 1.

To see that $f$ passes the second test with probability 1, observe that if $f$ had 2 boundary blocks in some diagonal, then there are boundary blocks $u,v\in[r]^{d}$ such that $\mathsf{block}^{-\uparrow}(u)\prec\mathsf{block}^{-\downarrow}(v)$ . But then there is $x,y\in[n]^{d}$ such that $\mathsf{block}(x)=u,\mathsf{block}(y)=v$ and $f(x)=1,f(y)=0$ ; since $x\preceq\mathsf{block}^{-\uparrow}(u)\prec\mathsf{block}^{-\downarrow}(v)\preceq y$ , this contradicts the monotonicity of $f$ . So $f$ has at most 1 boundary block in each diagonal.

To see that $h$ is monotone, it is sufficient to consider the boundary blocks, since all other values are the same as $f^{\mathsf{block}}$ . Let $v\in[r]^{d}$ be a boundary block, so there exist $x,y\in[n]^{d}$ such that $\mathsf{block}(x)=\mathsf{block}(y)$ and $f(x)=1,f(y)=0$ . Suppose $u\prec v$ is not a boundary block (if it is a boundary block then $h(u)=h(v)=0$ ). If $h(u)=1$ then $f(\mathsf{block}^{-\downarrow}(u))=1$ , but $\mathsf{block}^{-\downarrow}(u)\prec\mathsf{block}^{-\downarrow}(v)\preceq y$ while $f(\mathsf{block}^{-\downarrow}(u))>f(y)$ , a contradiction. So it must be that $h(u)=0$ whenever $u\prec v$ . For any block $u\in[r]^{d}$ such that $v\prec u$ , we have $0=h(v)\leq h(u)$ , so monotonicity holds. Since the tester of Black, Chakrabarty, & Seshadhri has one-sided error, the test passes with probability 1. ∎

Claim 3.4.

If $g$ is $\epsilon/4$ -close to $f$ , $b$ is $\epsilon/4$ -close to diagonal, and $h$ is $\epsilon/4$ -close to monotone, then $f$ is $\epsilon$ -close to monotone.

Proof of claim.

Let $h^{\mathsf{coarse}}:[n]^{d}\to\{0,1\}$ be the function $h^{\mathsf{coarse}}(x)=h(\mathsf{block}(x))$ . Suppose that $f(x)\neq h^{\mathsf{coarse}}(x)$ . If $v=\mathsf{block}(x)$ is not a boundary block of $f$ then $h^{\mathsf{coarse}}(x)=h(v)=f^{\mathsf{block}}(v)=g(x)$ , so $f(x)\neq g(x)$ . If $v$ is a boundary block then $h^{\mathsf{coarse}}(x)=h(v)=0$ so $f(x)=1$ , and $b(v)=1$ .

Suppose for contradiction that there are more than $\tfrac{\epsilon}{2}r^{d}$ boundary blocks $v\in[r]^{d}$ , so there are more than $\tfrac{\epsilon}{2}r^{d}$ 1-valued points of $b$ . Any diagonal function has at most $dr^{d-1}$ 1-valued points. Therefore the distance of $b$ to diagonal is at least

r^{-d}\left(\frac{\epsilon}{2}r^{d}-dr^{d-1}\right)=\frac{\epsilon}{2}-\frac{d}{r}=\frac{\epsilon}{2}-\frac{\epsilon}{4}=\frac{\epsilon}{4}\,,

a contradiction. So $f$ has at most $\tfrac{\epsilon}{2}r^{d}$ boundary blocks. Now

\displaystyle\mathsf{dist}(f,h^{\mathsf{coarse}})

\displaystyle=\mathsf{dist}(f,g)+\underset{x\sim[n]^{d}}{\mathbb{P}}\left[f(x)=1,\mathsf{block}(x)\text{ is a boundary block}\right]\leq\frac{\epsilon}{4}+r^{-d}\cdot\frac{\epsilon r^{d}}{2}=\frac{3}{4}\epsilon\,.

Let $p:[r]^{d}\to\{0,1\}$ be a monotone function minimizing the distance to $h$ , and let $p^{\mathsf{coarse}}:[n]^{d}\to\{0,1\}$ be the function $p^{\mathsf{coarse}}(x)=p(\mathsf{block}(x))$ . Then

\mathsf{dist}(h^{\mathsf{coarse}},p^{\mathsf{coarse}})=\underset{x\sim[n]^{d}}{\mathbb{P}}\left[h(\mathsf{block}(x))\neq p(\mathsf{block}(x))\right]=\underset{v\sim[r]^{d}}{\mathbb{P}}\left[h(v)\neq p(v)\right]\leq\epsilon/4\,.

Finally, the distance of $f$ to the nearest monotone function is at most

\mathsf{dist}(f,p^{\mathsf{coarse}})\leq\mathsf{dist}(f,h^{\mathsf{coarse}})+\mathsf{dist}(h^{\mathsf{coarse}},p^{\mathsf{coarse}})\leq\frac{3}{4}\epsilon+\frac{1}{4}\epsilon=\epsilon\,.\qed

These two claims suffice to establish the theorem. ∎

3.2 Monotonicity Testing for Product Distributions

The previous section used a special case of downsampling, tailored for the uniform distribution over $[n]^{d}$ . We will call a product distribution $\mu=\mu_{1}\times\dotsm\times\mu_{d}$ over $\mathbb{R}^{d}$ continuous if each of its factors $\mu_{i}$ are continuous (i.e. absolutely continuous with respect to the Lebesgue measure). The proof for discrete distributions is in LABEL:section:finitealgorithms.

See 1.1.1

Proof.

We follow the proof of Theorem 3.2, with some small changes. Let $r=\lceil 16d/\epsilon\rceil$ . The tester first samples a grid $X$ with length $m=O\left(\frac{rd^{2}}{\epsilon^{2}}\log(rd)\right)$ and constructs the induced $(r+2)$ -block partition, with cells labeled $\{0,\dotsc,r+1\}^{d}$ . We call a block $v\in\{0,\dotsc,r+1\}^{d}$ upper extreme if there is some $i\in[d]$ such that $v_{i}=r+1$ , and we call it lower extreme if there is some $i\in[d]$ such that $v_{i}=0$ but $v$ is not upper extreme. Call the upper extreme blocks $U$ and the lower extreme blocks $L$ . Note that $[r]^{d}=\{0,\dotsc,r+1\}^{d}\setminus(U\cup L)$ .

For each $v\in[r]^{d}$ , we again define $\mathsf{block}^{-\uparrow}(v),\mathsf{block}^{-\downarrow}(v)$ as, respectively, the supremal and infimal point $x\in\mathbb{R}^{d}$ such that $\mathsf{block}(x)=v$ . The algorithm will ignore the extreme blocks $U\cup L$ , which do not have a supremal or an infimal point. Therefore it is not defined whether these blocks are boundary blocks.

By Lemma 2.7, with probability at least $5/6$ , we will have $\|\mathsf{block}(\mu)-\mathsf{unif}(\{0,\dotsc,r+1\})\|_{\mathsf{TV}}\leq\epsilon/8$ . We define $b,h$ as before, with domain $[r]^{d}$ . Define $g$ similarly but with domain $\mathbb{R}^{d}$ and values

g(x)=\begin{cases}1&\text{ if }\mathsf{block}(x)\in U\\ 0&\text{ if }\mathsf{block}(x)\in L\\ f(x)&\text{ if }\mathsf{block}(x)\in[n]^{d}\text{ is a boundary block}\\ f^{\mathsf{block}}(\mathsf{block}(x))&\text{ otherwise.}\end{cases}

If $f$ is monotone, it may now be the case $f\neq g$ , but we will have $f(x)=g(x)$ for all $x$ with $\mathsf{block}(x)\in[r]^{d}$ , where the algorithm will make its queries. The algorithm will test whether $f(x)=g(x)$ on all $x$ with $\mathsf{block}(x)\in[r]^{d}$ , or $\epsilon/8$ -far from this property, which can be again done with $O(1/\epsilon)$ samples. Note that if $f$ is $\epsilon/8$ -close to having this property, then

	$\displaystyle\mathsf{dist}_{\mu}(f,g)$	$\displaystyle\leq\underset{x\sim\mu}{\mathbb{P}}\left[\mathsf{block}(x)\notin[n]^{d}\right]+\epsilon/8$
		$\displaystyle\leq\frac{d(r+2)^{d-1}}{(r+2)^{d}}+\epsilon/8+\\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d}\cup U\cup L)\\|_{\mathsf{TV}}$
		$\displaystyle\leq\frac{\epsilon}{16}+\frac{\epsilon}{8}+\frac{\epsilon}{4}\leq\frac{\epsilon}{2}\,.$

The algorithm then procedes as before, with error parameter $\epsilon/2$ . To test whether $g=f$ , the algorithm samples from $\mu$ and throws away any sample $x\in\mathbb{R}^{d}$ with $\mathsf{block}(x)\notin[r]^{d}$ . It then tests $b$ and $h$ using the uniform distribution on $[r]^{d}$ . It suffices to prove the following claim, which replaces 3.4.

Claim 3.5.

If $g$ is $\epsilon/2$ -close to $f$ , $b$ is $\epsilon/16$ -close to diagonal, and $h$ is $\epsilon/8$ -close to monotone, then $f$ is $\epsilon$ -close to monotone.

Proof of claim.

Let $p:[r]^{d}\to\{0,1\}$ be a monotone function minimizing the distance to $h$ . Then $p(v)\neq h(v)$ on at most $\frac{\epsilon r^{d}}{8}$ blocks $v\in[r]^{d}$ . Define $p^{\mathsf{coarse}}:\mathbb{R}^{d}\to\{0,1\}$ as $p^{\mathsf{coarse}}(x)=p(\mathsf{block}(x))$ when $\mathsf{block}(x)\in[r]^{d}$ , and $p^{\mathsf{coarse}}(x)=g(x)$ when $\mathsf{block}(x)\in U\cup L$ . Note that $p^{\mathsf{coarse}}$ is monotone.

By the triangle inequality,

\mathsf{dist}_{\mu}(f,p^{\mathsf{coarse}})\leq\mathsf{dist}_{\mu}(f,g)+\mathsf{dist}_{\mu}(g,p^{\mathsf{coarse}})\,.

From above, we know $\mathsf{dist}_{\mu}(f,g)\leq\epsilon/2$ . To bound the second term, observe that since $b$ is $\epsilon/16$ -close to diagonal, there are at most

\frac{\epsilon}{16}r^{d}+dr^{d-1}\leq\frac{\epsilon}{16}r^{d}+\frac{d}{r}r^{d}\leq\frac{\epsilon}{16}r^{d}+\frac{\epsilon}{16}r^{d}=\frac{\epsilon}{8}r^{d}

boundary blocks. Then observe that if $g(x)\neq p^{\mathsf{coarse}}(x)$ then $\mathsf{block}(x)\in[r]^{d}$ and either $\mathsf{block}(x)$ is a boundary block, or $g(x)=f^{\mathsf{block}}(\mathsf{block}(x))=h(\mathsf{block}(x))$ and $h(\mathsf{block}(x))\neq p(\mathsf{block}(x))$ . Then

	$\displaystyle\mathsf{dist}_{\mu}(g,p^{\mathsf{coarse}})$	$\displaystyle\leq\left(\frac{1}{(r+2)^{d}}\sum_{v\in[r]^{d}}\mathbf{1}\left[v\text{ is a boundary block, or }h(v)\neq p(v)\right]\right)$
		$\displaystyle\qquad+\\|\mathsf{block}(\mu)-\mathsf{unif}(\{0,\dotsc,r+1\}^{d})\\|_{\mathsf{TV}}$
		$\displaystyle\leq\frac{\epsilon r^{d}}{8r^{d}}+\frac{\epsilon r^{d}}{8r^{d}}+\frac{\epsilon}{4}\leq\frac{\epsilon}{2}\,.\qed$

∎

4 Learning and Testing Functions of Convex Sets

In this section we present our learning and testing results for functions of $k$ convex sets: an agnostic learning algorithm, a sample-based distance approximator, and a sample-based one-sided tester. All our algorithms will follow from more general results that actually hold for any class $\mathcal{H}$ with bounded $r$ -block boundary size; this shows that bounded block-boundary size is sufficient to guarantee learnability in product distributions.

Let $\mathcal{C}$ be the set of functions $f:\mathbb{R}^{d}\to\{\pm 1\}$ such that $f^{-1}(1)$ is convex. Let $\mathcal{B}_{k}$ be the set of all Boolean functions $h:\{\pm 1\}^{k}\to\{\pm 1\}$ .

Definition 4.1 (Function Composition).

For a set $\mathcal{H}$ of functions $h:\mathbb{R}^{d}\to\{\pm 1\}$ , we will define the composition $\mathcal{B}_{k}\circ\mathcal{H}$ as the set of functions of the form $f(x)=g(h_{1}(x),\dotsc,h_{k}(x))$ where $g\in\mathcal{B}_{k}$ and each $h_{i}$ belongs to $\mathcal{H}$ .

Proposition 4.2.

Let $\mathcal{H}$ be any class of functions $\mathbb{R}^{d}\to\{\pm 1\}$ and fix any $r$ . Then $\mathsf{bbs}(\mathcal{B}_{k}\circ\mathcal{H},r)\leq k\cdot\mathsf{bbs}(\mathcal{H},r)$ .

Proof.

If $f(\cdot)=g(h_{1}(\cdot),\dotsc,h_{k}(\cdot))$ is not constant on $\mathsf{block}^{-1}(v)$ then one of the $h_{i}$ is not constant on that block. Therefore $\mathsf{bbs}(f,r)\leq\sum_{i=1}^{k}\mathsf{bbs}(h_{i},r)\leq k\cdot\mathsf{bbs}(\mathcal{H},r)$ . ∎

Lemma 4.3.

For any $r$ , $\mathsf{bbs}(\mathcal{B}_{k}\circ\mathcal{C},r)\leq 2dkr^{d-1}$ .

Proof.

We prove $\mathsf{bbs}(\mathcal{C},r)\leq 2dr^{d-1}$ by induction on $d$ ; the result will hold by Proposition 4.2. Let $\mathsf{bbs}(\mathcal{C},r,d)$ be the $r$ -block boundary size in dimension $d$ . Recall that block $v\in[r]^{d}$ is the set $B_{v}=B_{1,v_{1}}\times\dotsm\times B_{d,v_{d}}$ where $B_{i,j}=(a_{i,j-1},a_{i,j}]$ for some $a_{i,j-1}<a_{i,j}$ . Let $f\in\mathcal{C}$ .

For $d=1$ , if there are 3 intervals $B_{1,i_{1}},B_{1,i_{2}},B_{1,i_{3}}$ , $i_{1}<i_{2}<i_{3}$ , on which $f$ is not constant, then within each interval the function takes both values $\{\pm 1\}$ . Thus, there are points $a\in B_{1,i_{1}},b\in B_{1,i_{2}},c\in B_{1,i_{3}}$ such that $f(a)=1,f(b)=-1,f(c)=1$ , which is a contradiction.

For each block $B_{v}$ , let $A_{v}=\{a_{1,v_{1}}\}\times B_{2,v_{2}}\times\dotsc\times B_{d,v_{d}}$ be the “upper face”. For $d>1$ , let $P\subseteq[r]^{d}$ be the set of non-constant blocks $B_{v}$ such that $f$ is constant on the upper face and let $Q$ be the set of non-constant blocks that are non-constant on the upper face, so that $\mathsf{bbs}(f,r,d)=|P|+|Q|$ . We argue that $|P|\leq 2r^{d-1}$ : for a vector $w\in[r]^{d-1}$ define the line $L_{w}:=\{v\in[r]^{d}\;|\;\forall i>1,v_{i}=w_{i}\}$ . If $|P\cap L_{w}|\geq 3$ then there are $t,u,v\in L_{w}$ with $t<u<v$ such that $f$ is constant on $A_{t},A_{u},A_{v}$ but non-constant on $B_{t},B_{u},B_{v}$ . Let $x,y,z$ be points in $B_{t},B_{u},B_{v}$ respectively such that $f(x)=f(y)=f(z)=1$ . If $f$ is constant $-1$ on $A_{t}$ or $A_{u}$ then there is a contradiction since the lines through $(x,y)$ and $(y,z)$ pass through $A_{t},A_{u}$ ; so $f$ is constant 1 on $A_{t},A_{u}$ . But then there is a point $q\in A_{u}$ with $f(q)=-1$ , which is a contradiction since it is within the convex hull of $A_{t},A_{u}$ . So $|L_{w}\cap P|<3$ ; since there are at most $r^{d-1}$ lines $L_{w}$ , $|P|\leq 2r^{d-1}$ .

To bound $|Q|$ , observe that for each block $v\in Q,f$ is non-constant on the plane $\{a_{1,v_{1}}\}\times\mathbb{R}^{d-1}$ , there are $(r-1)$ such planes, $f$ is convex on each, and the $r$ -block partition induces an $r$ -block partition on the plane where $f$ is non-constant on the corresponding block. Then, by induction $|Q|\leq(r-1)\cdot\mathsf{bbs}(\mathcal{C},r,d-1)\leq 2(d-1)(r-1)r^{d-2}$ . So

\mathsf{bbs}(\mathcal{C},r,d)\leq 2\left[(d-1)(r-1)r^{d-2}+r^{d-1}\right]<2dr^{d-1}\,.\qed

The above two lemmas combine to show that $r^{-d}\cdot\mathsf{bbs}(\mathcal{B}_{k}\circ\mathcal{C},r)\leq r^{-d}(2dkr^{d-1})=2dk/r\leq\epsilon$ when $r=\lceil 2dk/\epsilon\rceil$ .

4.1 Sample-based One-sided Tester

First, we prove a one-sided sample-based tester for convex sets.

See 1.1.4

Proof.

We prove the result for continuous distributions. The proof for finite distributions is in Theorem A.18.

On input distribution $\mu$ and function $f$ , let $r=\lceil 6d/\epsilon\rceil$ so that $r^{-d}\cdot\mathsf{bbs}(\mathcal{C},r)\leq\epsilon/3$ .

1.

Sample a grid $X$ of size $m=O\left(\frac{rd^{2}}{\epsilon^{2}}\log(rd/\epsilon)\right)$ large enough that Lemma 2.7 guarantees $\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}<\epsilon/9$ with probability $5/6$ .
2.

Take $q=O\left(\frac{r^{d}}{\epsilon}\right)$ samples $Q$ and accept if there exists $h\in\mathcal{C}$ such that $f(x)=h^{\mathsf{coarse}}(x)$ on all $x\in Q$ that are not in a boundary block of $h$ .

This tester is one-sided since for any $h\in\mathcal{C}$ , $h(x)=h^{\mathsf{coarse}}(x)$ for all $x\in Q$ that are not in a boundary block, regardless of whether the $r$ -block decomposition induced by $X$ satisfies $\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}\leq\epsilon/3$ . Now suppose that $\mathsf{dist}_{\mu}(f,\mathcal{C})>\epsilon$ , and suppose that $\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}\leq\epsilon$ . For $h\in\mathcal{C}$ , let $B_{h}\subseteq[r]^{d}$ be the set of non-constant blocks. If $\exists h\in\mathcal{C}$ such that $\underset{x\sim\mu}{\mathbb{P}}\left[h^{\mathsf{coarse}}(x)\neq f(x)\wedge\mathsf{block}(x)\notin B_{h}\right]<\epsilon/9$ , then

	$\displaystyle\mathsf{dist}_{\mu}(f,h^{\mathsf{coarse}})$	$\displaystyle\leq\underset{x\sim\mu}{\mathbb{P}}\left[\mathsf{block}(x)\in B_{h}\right]+\underset{x\sim\mu}{\mathbb{P}}\left[h^{\mathsf{coarse}}(x)\neq f(x)\wedge\mathsf{block}(x)\notin B_{h}\right]$
		$\displaystyle\leq r^{-d}\cdot\mathsf{bbs}(\mathcal{C},r)+\\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\\|_{\mathsf{TV}}+\frac{\epsilon}{9}$
		$\displaystyle\leq\left(\frac{1}{3}+\frac{2}{9}\right)\epsilon=\frac{5}{9}\cdot\epsilon\,.$

Therefore

	$\displaystyle\mathsf{dist}_{\mu}(f,h)$	$\displaystyle\leq\mathsf{dist}_{\mu}(f,h^{\mathsf{coarse}})+\mathsf{dist}_{\mu}(h^{\mathsf{coarse}},h)$
		$\displaystyle\leq\mathsf{dist}_{\mu}(f,h^{\mathsf{coarse}})+r^{-d}\cdot\mathsf{bbs}(\mathcal{C},r)+\\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\\|_{\mathsf{TV}}$
		$\displaystyle\leq\frac{5}{9}\epsilon+\frac{1}{3}\cdot\epsilon+\frac{1}{9}\epsilon=\epsilon\,,$

a contradiction. So it must be that for every $h\in\mathcal{C},\mathbb{P}\left[f(x)\neq h^{\mathsf{coarse}}(x)\wedge x\notin B_{h}\right]\geq\epsilon/9$ . There are at most ${r^{d}\choose\tfrac{\epsilon}{3}r^{d}}\leq(3e/\epsilon)^{\epsilon r^{d}/3}$ choices of boundary set $B$ . Because the 1-valued blocks must be the convex hull of the boundary points, for each boundary set $B$ there are at most 2 choices of function $h^{\mathsf{coarse}}$ with boundary $B$ (with a second choice occurring when the complement of $h^{\mathsf{coarse}}$ is also a convex set with the same boundary). Therefore, by the union bound, the probability that $f$ is accepted is at most

\left(\frac{3e}{\epsilon}\right)^{\frac{\epsilon}{3}r^{d}}\cdot{\left(1-\frac{\epsilon}{9}\right)}^{q}\leq e^{\epsilon\left(\frac{r^{d}}{3}-\frac{q}{9}\right)}\,,

which is at most $1/6$ for sufficiently large $q=O\left(r^{d}+\frac{1}{\epsilon}\right)$ . ∎

4.2 Sample-based Distance Approximator

Our sample-based distance approximator follows from the following general result.

Lemma 4.4.

For any set $\mathcal{H}$ of functions $\mathbb{R}^{d}\to\{\pm 1\}$ , $\epsilon>0$ , and $r$ satisfying $r^{-d}\cdot\mathsf{bbs}(\mathcal{H},r)\leq\epsilon/3$ , there is a sample-based distribution-free algorithm for product distributions that approximates distance to $\mathcal{H}$ up to additive error $\epsilon$ using $O\left(\frac{r^{d}}{\epsilon^{2}}\right)$ samples.

Proof.

On input distribution $\mu$ and function $f:\mathbb{R}^{d}\to\{0,1\}$ , let $r=3dk/\epsilon$ , then:

1.

Sample a grid $X$ of size $m=O(\frac{rd^{2}}{\epsilon^{2}}\log\frac{rd}{\epsilon})$ large enough that Lemma 2.7 guarantees $\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}<\epsilon/3$ with probability $5/6$ .
2.

Let $\mathcal{H}^{\mathsf{coarse}}$ be the set of all functions $h^{\mathsf{coarse}}$ where $h\in\mathcal{H}$ ; note that $|\mathcal{H}^{\mathsf{coarse}}|\leq 2^{r^{d}}$ .
3.

Draw $q=O\left(\frac{r^{d}}{\epsilon^{2}}\right)$ samples $Q$ and output the distance on $Q$ to the nearest function in $\mathcal{H}^{\mathsf{coarse}}$ .

We argue that with probability at least $5/6$ , $\mathcal{H}^{\mathsf{coarse}}$ is an $\tfrac{5}{6}\epsilon$ -cover of $\mathcal{H}$ . With probability at least $5/6$ , $\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}<\epsilon/6$ . Then by Proposition 2.5, for any $h\in\mathcal{H}$ ,

\underset{x\sim\mu}{\mathbb{P}}\left[h(x)\neq h^{\mathsf{coarse}}(x)\right]\leq r^{-d}\cdot\mathsf{bbs}(f,r)+\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}\leq\left(\frac{2}{3}+\frac{1}{6}\right)\epsilon=\frac{5}{6}\epsilon\,,

so $\mathcal{H}^{\mathsf{coarse}}$ is a $\tfrac{5}{6}\epsilon$ -cover; assume this event occurs.

Write $\mathsf{dist}_{Q}(f,g):=\frac{1}{q}\sum_{x\in Q}\mathbf{1}\left[f(x)\neq g(x)\right]$ . By the union bound and Hoeffding’s inequality, with $q$ samples we fail to get an estimate of $\mathsf{dist}_{\mu}(f,\mathcal{H}^{\mathsf{coarse}})$ up to additive error $\frac{1}{6}\epsilon$ with probability at most

|\mathcal{H}^{\mathsf{coarse}}|\cdot\max_{h^{\mathsf{coarse}}\in\mathcal{H}^{\mathsf{coarse}}}\underset{Q}{\mathbb{P}}\left[\left|\mathsf{dist}_{\mu}(f,h^{\mathsf{coarse}})-\mathsf{dist}_{Q}(f,h^{\mathsf{coarse}})\right|>\frac{1}{6}\epsilon\right]\leq|\mathcal{H}^{\mathsf{coarse}}|\mathrm{exp}\left(-2\frac{q\epsilon^{2}}{36}\right)<\frac{1}{6}

for appropriately chosen $q=O\left(\frac{1}{\epsilon^{2}}\log(|\mathcal{H}^{\mathsf{coarse}}|)\right)=O\left(\frac{r^{d}}{\epsilon^{2}}\right)$ . Assume this event occurs. We want to show that $|\mathsf{dist}_{Q}(f,\mathcal{H}^{\mathsf{coarse}})-\mathsf{dist}_{\mu}(f,\mathcal{H})|\leq\epsilon$ . Let $h\in\mathcal{H}$ minimize $\mathsf{dist}_{\mu}(f,h)$ so $\mathsf{dist}_{\mu}(f,h)=\mathsf{dist}_{\mu}(f,\mathcal{H})$ . Then

	$\displaystyle\mathsf{dist}_{Q}(f,\mathcal{H}^{\mathsf{coarse}})$	$\displaystyle\leq\mathsf{dist}_{Q}(f,h^{\mathsf{coarse}})\leq\mathsf{dist}_{\mu}(f,h^{\mathsf{coarse}})+\frac{\epsilon}{6}$
		$\displaystyle\leq\mathsf{dist}_{\mu}(f,h)+\mathsf{dist}_{\mu}(h,h^{\mathsf{coarse}})+\frac{\epsilon}{6}\leq\mathsf{dist}_{\mu}(f,\mathcal{H})+\epsilon\,.$

Now let $g\in\mathcal{H}$ minimize $\mathsf{dist}_{Q}(f,g^{\mathsf{coarse}})$ so $\mathsf{dist}_{Q}(f,g^{\mathsf{coarse}})=\mathsf{dist}_{Q}(f,\mathcal{H}^{\mathsf{coarse}})$ . Then

	$\displaystyle\mathsf{dist}_{Q}(f,\mathcal{H}^{\mathsf{coarse}})$	$\displaystyle=\mathsf{dist}_{Q}(f,g^{\mathsf{coarse}})\geq\mathsf{dist}_{\mu}(f,g^{\mathsf{coarse}})-\frac{\epsilon}{6}\geq\mathsf{dist}_{\mu}(f,h^{\mathsf{coarse}})-\frac{\epsilon}{6}$
		$\displaystyle\geq\mathsf{dist}_{\mu}(f,h)-\mathsf{dist}_{\mu}(h,h^{\mathsf{coarse}})-\frac{\epsilon}{6}\geq\mathsf{dist}_{\mu}(f,h)-\epsilon\,,$

which concludes the proof. ∎

Applying the bound on $\mathsf{bbs}(\mathcal{B}_{k},r)$ we conclude: See 1.1.4

4.3 Agnostic Learning

We begin our learning results with an agnostic learning algorithm for functions of $k$ convex sets: the class $\mathcal{B}_{k}\circ\mathcal{C}$ . For a distribution $\mathcal{D}$ over $\mathbb{R}^{d}\times\{\pm 1\}$ and an $r$ -block partition $\mathsf{block}:\mathbb{R}^{d}\to[r]^{d}$ , define the distribution $\mathcal{D}^{\mathsf{block}}$ over $[r]^{d}\times\{\pm 1\}$ as the distribution of $(\mathsf{block}(x),b)$ when $(x,b)\sim\mathcal{D}$ .

Lemma 4.5.

Let $\mathcal{H}$ be any set of functions $\mathbb{R}^{d}\to\{\pm 1\}$ , let $\epsilon>0$ , and suppose $r$ satisfies ${r^{-d}\cdot\mathsf{bbs}(\mathcal{H},r)\leq\epsilon/3}$ . Then there is an distribution-free agnostic learning algorithm for continuous product distributions that learns $\mathcal{H}$ in $O\left(\frac{r^{d}+rd^{2}\log(rd/\epsilon)}{\epsilon^{2}}\right)$ samples and time.

Proof.

On input distribution $\mathcal{D}$ :

1.

Sample a grid $X$ of size $m=O(\frac{rd^{2}}{\epsilon^{2}}\log(rd/\epsilon))$ large enough that Lemma 2.7 guarantees $\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}<\epsilon/3$ with probability $5/6$ , where $\mathsf{block}:\mathbb{R}^{d}\to[r]^{d}$ is the induced $r$ -block partition.
2.

Agnostically learn a function $g:[r]^{d}\to\{\pm 1\}$ with error $\epsilon/3$ and success probability $5/6$ using $O(r^{d}/\epsilon^{2})$ samples from $\mathcal{D}^{\mathsf{block}}$ . Output the function $g\circ\mathsf{block}$ .

The second step is accomplished via standard learning results ([SSBD14] Theorem 6.8): the number of samples required for agnostic learning is bounded by $O(1/\epsilon^{2})$ multiplied by the logarithm of the number of functions in the class, and the number of functions $[r]^{d}\to\{\pm 1\}$ is $2^{r^{d}}$ . Assume that both steps succeed, which occurs with probability at least $2/3$ . Let $f\in\mathcal{H}$ minimize $\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[f(x)\neq b\right]$ . By Proposition 2.5,

\underset{x\sim\mu}{\mathbb{P}}\left[f(x)\neq f^{\mathsf{coarse}}\right]\leq r^{-d}\cdot\mathsf{bbs}(f,r)+\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}<2\epsilon/3\,.

Then

	$\displaystyle\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[g(\mathsf{block}(x))\neq b\right]$	$\displaystyle=\underset{(v,b)\sim\mathcal{D}^{\mathsf{block}}}{\mathbb{P}}\left[g(v)\neq b\right]\leq\underset{(v,b)\sim\mathcal{D}^{\mathsf{block}}}{\mathbb{P}}\left[f^{\mathsf{block}}(v)\neq b\right]+\epsilon/3$
		$\displaystyle=\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[f^{\mathsf{coarse}}(x)\neq b\right]+\epsilon/3<\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[f(x)\neq b\right]+\epsilon\,.\qed$

Lemma 4.5 then gives the following result for continuous product distributions, with the result for finite distributions following from Theorem A.18.

See 1.1.4

5 Learning Functions of Halfspaces

A halfspace, or linear threshold function, is any function $h:\mathbb{R}^{d}\to\{\pm 1\}$ such that for some $w\in\mathbb{R}^{d},t\in\mathbb{R},h(x)=\operatorname{sign}(\langle w,x\rangle-t)$ , where $\operatorname{sign}(z)=1$ if $z\geq 0$ and $-1$ otherwise. Let $\mathcal{H}$ be the set of halfspaces. Recall that downsampling reduces learning $\mathcal{H}$ in $\mathbb{R}^{d}$ to learning $\mathcal{H}^{\mathsf{block}}$ over $[r]^{d}$ , and $\mathcal{H}^{\mathsf{block}}$ is not the set of halfspaces over $[r]^{d}$ . Fortunately, agnostically learning a halfspaces $h$ is commonly done by giving a bound on the degree of a polynomial $p$ that approximates $h$ [KOS04, KOS08, KKMS08], and we will show that a similar idea also suffices for learning $\mathcal{H}^{\mathsf{block}}$ . We first present a general algorithm based on “polynomial regression”, and then introduce the Fourier analysis necessary to apply the general learning algorithm to halfspaces, polynomial threshold functions, and $k$ -alternating functions.

5.1 A General Learning Algorithm

The learning algorithm in this section essentially replaces step 2 of the brute force algorithm (Lemma 4.5) with the “polynomial regression” algorithm of Kalai et al. [KKMS08]. Our general algorithm is inspired by an algorithm of Canonne et al. [CGG⁺19] for tolerantly testing $k$ -alternating functions over the uniform distribution on $[n]^{d}$ ; we state the regression algorithm as it appears in [CGG⁺19]. For a set $\mathcal{F}$ of functions, $\mathsf{span}(\mathcal{F})$ is the set of all linear combinations of functions in $\mathcal{F}$ :

Theorem 5.1 ([KKMS08, CGG⁺19]).

Let $\mu$ be a distribution over $\mathcal{X}$ , let $\mathcal{H}$ be a class of functions $\mathcal{X}\to\{\pm 1\}$ and $\mathcal{F}$ a collection of functions $\mathcal{X}\to\mathbb{R}$ such that for every $h\in\mathcal{H}$ , $\exists f\in\mathsf{span}(\mathcal{F})$ where $\underset{x\sim\mu}{\mathbb{E}}\left[{(h(x)-f(x))}^{2}\right]\leq\epsilon^{2}$ . Then there is an algorithm that, for any distribution $\mathcal{D}$ over $\mathcal{X}\times\{\pm 1\}$ with marginal $\mu$ over $\mathcal{X}$ , outputs a function $g:\mathcal{X}\to\{\pm 1\}$ such that $\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[g(x)\neq b\right]\leq\inf_{h\in\mathcal{H}}\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[g(x)\neq b\right]+\epsilon$ , with probability at least $11/12$ , using at most $\operatorname{poly}(|\mathcal{F}|,1/\epsilon)$ samples and time.

Our general learning algorithm will apply to any hypothesis class that has small $r$ -block boundary size, and for which there is a set of functions $\mathcal{F}$ that approximately span the class $\mathcal{H}^{\mathsf{block}}$ . This algorithm is improved to work for finite (rather than only continuous) product distributions in Lemma A.17.

Lemma 5.2.

Let $\epsilon>0$ and let $\mathcal{H}$ be a set of measurable functions $f:\mathbb{R}^{d}\to\{\pm 1\}$ that satisfy:

1.

There is some $r=r(d,\epsilon)$ such that $\mathsf{bbs}(\mathcal{H},r)\leq\frac{\epsilon}{3}\cdot r^{d}$ ;
2.

There is a set $\mathcal{F}$ of functions $[r]^{d}\to\mathbb{R}$ satisfying: $\forall f\in\mathcal{H},\exists g\in\mathsf{span}(\mathcal{F})$ such that for $v\sim[r]^{d},\mathbb{E}\left[(f^{\mathsf{block}}(v)-g(v))^{2}\right]\leq\epsilon^{2}/4$ .

Let $n=\operatorname{poly}(|\mathcal{F}|,1/\epsilon)$ be the sample complexity of the algorithm in Theorem 5.1, with error parameter $\epsilon/2$ . Then there is an agnostic learning algorithm for $\mathcal{H}$ on continuous product distributions over $\mathbb{R}^{d}$ , that uses $O(\max(n^{2},1/\epsilon^{2})\cdot rd^{2}\log(dr))$ samples and runs in time polynomial in the sample size.

Proof.

We will assume $n>1/\epsilon$ . Let $\mu$ be the marginal of $\mathcal{D}$ on $\mathbb{R}^{d}$ . For an $r$ -block partition, let $\mathcal{D}^{\mathsf{block}}$ be the distribution of $(\mathsf{block}(x),b)$ when $(x,b)\sim\mathcal{D}$ . We may simulate samples from $\mathcal{D}^{\mathsf{block}}$ by sampling $(x,b)$ from $\mathcal{D}$ and constructing $(\mathsf{block}(x),b)$ . The algorithm is as follows:

1.

Sample a grid $X$ of length $m=O(rd^{2}n^{2}\log(rd))$ large enough that Lemma 2.7 guarantees $\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}<1/12n$ with probability $5/6$ . Construct $\mathsf{block}:\mathbb{R}^{d}\to[r]^{d}$ induced by $X$ . We may construct the $\mathsf{block}$ function in time $O(dm\log m)$ by sorting, and once constructed it takes time $O(\log r)$ to compute.
2.

Run the algorithm of Theorem 5.1 on a sample of $n$ points from $\mathcal{D}^{\mathsf{block}}$ to learn the class $\mathcal{H}^{\mathsf{block}}$ ; that algorithm returns a function $g:[r]^{d}\to\{\pm 1\}$ . Output $g\circ\mathsf{block}$ .

Assume that step 1 succeeds, which occurs with probability at least $5/6$ . By condition 2, the algorithm in step 2 is guaranteed to work on samples $(v,b)\in[r]^{d}\times\{\pm 1\}$ where the marginal of $v$ is $\mathsf{unif}([r]^{d})$ ; let $\mathcal{D}^{\mathsf{unif}}$ be the distribution of $(v,b)$ when $v\sim\mathsf{unif}([r]^{d})$ and $b$ is obtained by sampling $(x,b)\sim(\mathcal{D}\mid x\in\mathsf{block}^{-1}(v))$ . The algorithm of step 2 will succeed on $\mathcal{D}^{\mathsf{unif}}$ ; we argue that it will also succeed on the actual input $\mathcal{D}^{\mathsf{block}}$ since these distributions are close. Observe that for samples $(v,b)\sim\mathcal{D}^{\mathsf{unif}}$ and $(\mathsf{block}(x),b^{\prime})\sim\mathcal{D}^{\mathsf{block}}$ , if $v=\mathsf{block}(x)$ then $b,b^{\prime}$ each have the distribution of $b^{\prime}$ in $(x,b^{\prime})\sim(\mathcal{D}\mid\mathsf{block}(x)=v)$ . Therefore

	$\displaystyle\\|\mathcal{D}^{\mathsf{unif}}-\mathcal{D}^{\mathsf{block}}\\|_{\mathsf{TV}}$	$\displaystyle=\\|(v,b)-(\mathsf{block}(x),b^{\prime})\\|_{\mathsf{TV}}=\\|v-\mathsf{block}(x)\\|_{\mathsf{TV}}$
		$\displaystyle=\\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\\|_{\mathsf{TV}}<\frac{1}{12n}\,.$

It is a standard fact that for product distributions $P^{n},Q^{n},\|P^{n}-Q^{n}\|_{\mathsf{TV}}\leq n\cdot\|P-Q\|_{\mathsf{TV}}$ ; using this fact,

\|(\mathcal{D}^{\mathsf{unif}})^{n}-(\mathcal{D}^{\mathsf{block}})^{n}\|_{\mathsf{TV}}\leq n\cdot\|\mathcal{D}^{\mathsf{unif}}-\mathcal{D}^{\mathsf{block}}\|_{\mathsf{TV}}<\frac{1}{12}\,.

We will argue that step 2 succeeds with probability $5/6$ ; i.e. that with probability $5/6$ ,

\underset{(v,b)\sim\mathcal{D}^{\mathsf{block}}}{\mathbb{P}}\left[g(v)\neq b\right]<\inf_{h\in\mathcal{H}}\underset{(v,b)\sim\mathcal{D}^{\mathsf{block}}}{\mathbb{P}}\left[h^{\mathsf{block}}(v)\neq b\right]+\epsilon/2\,.

Let $E(S)$ be the event that success occurs given sample $S\in([r]^{d}\times\{\pm 1\})^{n}$ . The algorithm samples $S\sim(\mathcal{D}^{\mathsf{block}})^{n}$ but the success guarantee of step 2 is for $(\mathcal{D}^{\mathsf{unif}})^{n}$ ; this step will still succeed with probability $5/6$ :

	$\displaystyle\underset{S\sim(\mathcal{D}^{\mathsf{unif}})^{n}}{\mathbb{P}}\left[E(S)\right]$	$\displaystyle\geq\underset{S\sim(\mathcal{D}^{\mathsf{block}})^{n}}{\mathbb{P}}\left[E(S)\right]-\\|(\mathcal{D}^{\mathsf{unif}})^{n}-(\mathcal{D}^{\mathsf{block}})^{n}\\|_{\mathsf{TV}}$
		$\displaystyle>\underset{S\sim\mathcal{D}^{n}}{\mathbb{P}}\left[E(S)\right]-\frac{1}{12}\geq\frac{11}{12}-\frac{1}{12}=\frac{5}{6}\,.$

Assume that each step succeeds, which occurs with probability at least $1-2\cdot(5/6)=2/3$ . By Proposition 2.5, our condition 1, and the fact that $n>1/\epsilon$ , we have for any $h\in\mathcal{H}$ that

\displaystyle\underset{x\sim\mu}{\mathbb{P}}\left[h(x)\neq h^{\mathsf{coarse}}(x)\right]\leq r^{-d}\cdot\mathsf{bbs}(\mathcal{H},r)+\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}\leq\epsilon/3+\frac{1}{12n}<\epsilon/2\,.

The output of the algorithm is $g\circ\mathsf{block}$ , which for any $h\in\mathcal{H}$ satisfies:

	$\displaystyle\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[g(\mathsf{block}(x))\neq b\right]$	$\displaystyle=\underset{(v,b)\sim\mathcal{D}^{\mathsf{block}}}{\mathbb{P}}\left[g(v)\neq b\right]\leq\underset{(v,b)\sim\mathcal{D}^{\mathsf{block}}}{\mathbb{P}}\left[h^{\mathsf{block}}(v)\neq b\right]+\epsilon/2$
		$\displaystyle=\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[h^{\mathsf{coarse}}(x)\neq b\right]+\epsilon/2$
		$\displaystyle\leq\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[h(x)\neq b\right]+\underset{x}{\mathbb{P}}\left[h(x)\neq h^{\mathsf{coarse}}(x)\right]+\epsilon/2$
		$\displaystyle<\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[h(x)\neq b\right]+\epsilon\,.$

Then $\mathbb{P}\left[g(\mathsf{block}(x))\neq b\right]\leq\inf_{h\in\mathcal{H}}{h(x)\neq b}+\epsilon$ , as desired. ∎

5.2 Fourier Analysis on ${[n]}^{d}$

We will show how to construct a spanning set $\mathcal{F}$ to satisfy condition 2 of the general learning algorithm, by using noise sensitivity and the Walsh basis. For any $n$ , let $u\sim{[n]}^{d}$ uniformly at random and draw $v\in{[n]}^{d}$ as follows: $v_{i}=u_{i}$ with probability $\delta$ , and $v_{i}$ is uniform in $[n]\setminus\{u_{i}\}$ with probability $1-\delta$ . The noise sensitivity of functions $[n]^{d}\to\{\pm 1\}$ is defined as:

\mathsf{ns}_{n,\delta}(f):=\underset{u,v}{\mathbb{P}}\left[f(u)\neq f(v)\right]=\frac{1}{2}-\frac{1}{2}\cdot\underset{u,v}{\mathbb{E}}\left[f(u)f(v)\right]\,.

Note that we include $n$ in the subscript to indicate the size of the domain. We will use $\mathsf{ns}_{r,\delta}(f)$ to obtain upper bounds on the spanning set, and we will obtain bounds on $\mathsf{ns}_{r,\delta}$ by relating it to $\mathsf{ns}_{2,\delta}$ , for which many bounds are known. For a function $f:{[n]}^{d}\to\{\pm 1\}$ , two vectors $u,v\in{[r]}^{d}$ , and $x\in\{\pm 1\}^{d}$ , define ${[u,v]}^{x}\in{[n]}^{d}$ as the vector with ${[u,v]}^{x}_{i}=u_{i}$ if $x_{i}=1$ and $v_{i}$ if $x_{i}=-1$ . Then define $f_{u,v}:\{\pm 1\}^{d}\to\{\pm 1\}$ as the function $f_{u,v}(x)=f({[u,v]}^{x})$ . The next lemma is essentially the same as the reduction in [BOW10].

Lemma 5.3.

Let $\mathcal{H}$ be a set of functions $f:\mathbb{R}^{d}\to\{\pm 1\}$ such that for any linear transformation $A\in\mathbb{R}^{d\times d}$ , the function $f\circ A\in\mathcal{H}$ , and let $\mathsf{block}:\mathbb{R}^{d}\to[r]^{d}$ be any $r$ -block partition. Let $\mathsf{ns}_{2,\delta}(\mathcal{H})=\sup_{f\in\mathcal{H}}\mathsf{ns}_{2,\delta}(f)$ where $\mathsf{ns}_{2,\delta}(f)$ is the $\delta$ -noise sensitivity on domain $\{\pm 1\}^{d}$ . Then $\mathsf{ns}_{r,\delta}(f^{\mathsf{block}})\leq\mathsf{ns}_{2,\delta}(\mathcal{H})$ .

Proof.

Let $u\sim[r]^{d}$ and let $v$ be uniform among the vectors $[r]^{d}$ where $\forall i,v_{i}\neq u_{i}$ . Now let $x\sim\{\pm 1\}^{d}$ uniformly at random and let $y$ be drawn such that $y_{i}=x_{i}$ with probability $\delta$ and $y_{i}=-x_{i}$ otherwise. Then ${[u,v]}^{x}$ is uniform in ${[r]}^{d}$ , because ${[u,v]}^{x}_{i}$ is $u_{i}$ or $v_{i}$ with equal probability and the marginals of $u_{i},v_{i}$ are uniform. ${[u,v]}^{y}_{i}={[u,v]}^{x}_{i}$ with probability $1-\delta$ and is otherwise uniform in $[r]\setminus\{{[u,v]}^{x}_{i}\}$ . Let $f:{[r]}^{d}\to\{\pm 1\}$ and $\delta\in[0,1]$ . Let $(u^{\prime},v^{\prime})$ be an independent copy of $(u,v)$ and observe that $\mathsf{ns}_{r,\delta}(f^{\mathsf{block}})=\mathbb{P}\left[f^{\mathsf{block}}(u^{\prime})\neq f^{\mathsf{block}}(v^{\prime})\right]$ . Now observe that $({[u,v]}^{x},{[u,v]}^{y})$ has the same distribution as $(u^{\prime},v^{\prime})$ , so:

	$\displaystyle\underset{u,v}{\mathbb{E}}\left[\mathsf{ns}_{2,\delta}(f_{u,v})\right]$	$\displaystyle=\underset{u,v}{\mathbb{E}}\left[\underset{x,y\sim_{\delta}x}{\mathbb{P}}\left[f({[u,v]}^{x})\neq f({[u,v]}^{y})\right]\right]$
		$\displaystyle=\underset{u,v,{(x,y)}_{\delta}}{\mathbb{E}}\left[\mathbf{1}\left[f({[u,v]}^{x})\neq f({[u,v]}^{y})\right]\right]$
		$\displaystyle=\underset{u^{\prime},v^{\prime}}{\mathbb{E}}\left[\mathbf{1}\left[f(u^{\prime})\neq f(v^{\prime})\right]\right]=\mathsf{ns}_{r,\delta}(f^{\mathsf{block}})\,.$

For any $u,v\in[r]^{d}$ , define the function $\Phi_{u,v}:\{\pm 1\}^{d}\to[r]^{d}$ by $\Phi_{u,v}(x)=\mathsf{blockpoint}([u,v]^{x})$ . This function maps $\{\pm 1\}^{d}$ to a set $\{b_{1,i_{1}},b_{1,j_{1}}\}\times\dotsm\times\{b_{d,i_{d}},b_{d,j_{d}}\}$ and can be obtained by translation and scaling, which is a linear transformation. Therefore $f_{u,v}=f\circ\Phi_{u,v}^{-1}$ , so we are guaranteed that $f_{u,v}\in\mathcal{H}$ . So

\mathsf{ns}_{r,\delta}(f)=\underset{u,v}{\mathbb{E}}\left[\mathsf{ns}_{2,\delta}(f_{u,v})\right]\leq\mathsf{ns}_{2,\delta}(\mathcal{H})\,.\qed

We define the Walsh basis, an orthonormal basis of functions $[n]^{d}\to\mathbb{R}$ ; see e.g. [BRY14]. Suppose $n=2^{m}$ for some positive integer $m$ . For two functions $f,g:{[n]}^{d}\to\mathbb{R}$ , define the inner product $\langle f,g\rangle=\mathbb{E}_{x\sim{[n]}^{d}}[f(x)g(x)]$ . The Walsh functions $\{\psi_{0},\dotsc,\psi_{m}\},\psi_{i}:[n]\to\{\pm 1\}$ can be defined by $\psi_{0}\equiv 1$ and for $i\geq 1$ , $\psi_{i}(z):=(-1)^{\mathsf{bit}_{i}(z-1)}$ where $\mathsf{bit}_{i}(z-1)$ is the $i^{\mathrm{th}}$ bit in the binary representation of $z-1$ , where the first bit is the least significant (see e.g. [BRY14]). It is easy to verify that for all $i,j\in\{0,\dotsc,m\}$ , if $i\neq j$ then $\langle\psi_{i},\psi_{j}\rangle=0$ , and $\mathbb{E}_{x\sim[n]}[\psi_{i}(x)]=0$ when $i\geq 1$ . For $S\subseteq[m]$ define $\psi_{S}=\prod_{i\in S}\psi_{i}$ and note that for any set $S\subseteq[m],S\neq\emptyset$ ,

\underset{x\sim[n]}{\mathbb{E}}\left[\psi_{S}(x)\right]=\underset{x\sim[n]}{\mathbb{E}}\left[\prod_{i\in S}\psi_{i}(x)\right]=\underset{x\sim[n]}{\mathbb{E}}\left[(-1)^{\sum_{i\in S}\mathsf{bit}_{i}(x-1)}\right]=0

(1)

since each bit is uniform in $\{0,1\}$ , while $\psi_{\emptyset}\equiv 1$ . For $S,T\subseteq[m]$ ,

\langle\psi_{S},\psi_{T}\rangle=\mathbb{E}_{x\sim[n]}[\psi_{S}(x)\psi_{T}(x)]=\mathbb{E}_{x}[\psi_{S\Delta T}(x)]\,,

where $S\Delta T$ is the symmetric difference, so this is 0 when $S\Delta T\neq\emptyset$ (i.e. $S\neq T$ ) and 1 otherwise; therefore $\{\psi_{S}:S\subseteq[m]\}$ is an orthonormal basis of functions $[n]\to\mathbb{R}$ . Identify each $S\subseteq[m]$ with the number $s\in\{0,\dotsc,n-1\}$ where $\mathsf{bit}_{i}(s)=\mathbf{1}\left[i\in S\right]$ . Now for every $\alpha\in\{0,\dotsc,n-1\}^{d}$ define $\psi_{\alpha}:[n]^{d}\to\{\pm 1\}$ as $\psi_{\alpha}(x)=\prod_{i=1}^{d}\psi_{\alpha_{i}}(x_{i})$ where $\psi_{\alpha_{i}}$ is the Walsh function determined by the identity between subsets of $[m]$ and the integer $\alpha_{i}\in\{0,\dotsc,n-1\}$ . It is easy to verify that the set $\{\psi_{\alpha}:\alpha\in{\{0,\dotsc,n-1\}}^{d}\}$ is an orthonormal basis. Every function $f:{[n]}^{d}\to\mathbb{R}$ has a unique representation $f=\sum_{\alpha\in\{0,\dotsc,n-1\}^{d}}\hat{f}(\alpha)\psi_{\alpha}$ where $\hat{f}(\alpha)=\langle f,\psi_{\alpha}\rangle$ .

For each $x\in{[n]}^{d}$ and $\rho\in[0,1]$ define $N_{\rho}(x)$ as the distribution over $y\in{[n]}^{d}$ where for each $i\in[d]$ , $y_{i}=x_{i}$ with probability $\rho$ and $y_{i}$ is uniform in $[n]$ with probability $1-\rho$ . Define $T_{\rho}f(x):=\underset{y\sim N_{\rho}(x)}{\mathbb{E}}\left[f(y)\right]$ and $\mathsf{stab}_{\rho}(f):=\langle f,T_{\rho}f\rangle$ . For any $\alpha\in{\{0,\dotsc,n-1\}}^{d}$ ,

	$\displaystyle T_{\rho}\psi_{\alpha}(x)$	$\displaystyle=\underset{y\sim N_{\rho}(x)}{\mathbb{E}}\left[\psi_{\alpha}(y)\right]=\underset{y\sim N_{\rho}(x)}{\mathbb{E}}\left[\prod_{i=1}^{d}\psi_{\alpha_{i}}(y_{i})\right]=\prod_{i=1}^{d}\underset{y_{i}\sim N_{\rho}(x_{i})}{\mathbb{E}}\left[\psi_{\alpha_{i}}(y_{i})\right]$
		$\displaystyle=\prod_{i=1}^{d}\left[\rho\psi_{\alpha_{i}}(x_{i})+(1-\rho)\underset{z\sim[n]}{\mathbb{E}}\left[\psi_{\alpha_{i}}(z)\right]\right]\,.$

If $\alpha_{i}\geq 1$ then $\underset{z\sim[n]}{\mathbb{E}}\left[\psi_{\alpha_{i}}(z)\right]=0$ ; otherwise, $\psi_{1}\equiv 1$ so $\underset{y_{i}\sim N_{\rho}(x_{i})}{\mathbb{E}}\left[\psi_{0}(y_{i})\right]=1$ . Therefore

T_{\rho}\psi_{\alpha}(x)=\rho^{|\alpha|}\psi_{\alpha}(x)\,,

where $|\alpha|$ is the number of nonzero entries of $\alpha$ ; so $\widehat{T_{\rho}f}(\alpha)=\langle\psi_{\alpha},T_{\rho}f\rangle=\langle T_{\rho}\psi_{\alpha},f\rangle=\rho^{|\alpha|}\widehat{f}(\alpha)$ . Since $T_{\rho}$ is a linear operator,

\mathsf{stab}_{\rho}(f)=\langle f,T_{\rho}f\rangle=\sum_{\alpha}\rho^{|\alpha|}\hat{f}{(\alpha)}^{2}\,.

Note that for $f:\{\pm 1\}^{d}\to\{\pm 1\}$ , $\mathsf{stab}_{\rho}(f)$ is the usual notion of stability in the analysis of Boolean functions.

Proposition 5.4.

For any $f:{[n]}^{d}\to\{\pm 1\}$ and any $\delta,\rho\in[0,1]$ , $\mathsf{ns}_{n,\delta}(f)=\frac{1}{2}-\frac{1}{2}\cdot\mathsf{stab}_{1-\frac{n}{n-1}\delta}(f)$ .

Proof.

For $v\sim N_{\rho}(u),v_{i}=u_{i}$ with probability $\rho+\frac{1-\rho}{n}$ , so in the definition of noise sensitivity, $v$ is distributed as $N_{\rho}(u)$ where $(1-\delta)=\rho+\frac{1-\rho}{n}$ , i.e. $\delta=1-\rho-\frac{1-\rho}{n}=(1-1/n)-\rho(1-1/n)=(1-\rho)(1-1/n)$ ; or, $\rho=1-\frac{n}{n-1}\delta$ . By rearranging, we arrive at the conclusions. ∎

Proposition 5.5.

For any $f:{[n]}^{d}\to\mathbb{R}$ and $t=\frac{2}{\delta}$ , $\sum_{\alpha:|\alpha|\geq t}\hat{f}{(\alpha)}^{2}\leq 2.32\cdot\mathsf{ns}_{n,\delta}(f)$ .

Proof.

Following [KOS04]:

	$\displaystyle 2\mathsf{ns}_{n,\delta}(f)$	$\displaystyle=1-\sum_{\alpha}{\left(1-\frac{n}{n-1}\delta\right)}^{\|\alpha\|}\hat{f}{(\alpha)}^{2}\geq 1-\sum_{\alpha}{\left(1-\delta\right)}^{\|\alpha\|}\hat{f}{(\alpha)}^{2}$
		$\displaystyle=\sum_{\alpha}(1-{\left(1-\delta\right)}^{\|\alpha\|})\hat{f}{(\alpha)}^{2}\geq\sum_{\alpha:\|\alpha\|\geq 2/\delta}(1-{\left(1-\delta\right)}^{\|\alpha\|})\hat{f}{(\alpha)}^{2}$
		$\displaystyle\geq\sum_{\alpha:\|\alpha\|\geq 2/\delta}(1-{\left(1-\delta\right)}^{2/\delta})\hat{f}{(\alpha)}^{2}\geq(1-e^{-2})\sum_{\alpha:\|\alpha\|\geq 2/\delta}\hat{f}{(\alpha)}^{2}\,.$

The result now holds since $2/(1-e^{-2})<2.32$ . ∎

Lemma 5.6.

Let $\mathcal{H}$ be a set of functions $[n]^{d}\to\{\pm 1\}$ where $n$ is a power of 2, let $\epsilon,\delta>0$ such that $\forall h\in\mathcal{H},\mathsf{ns}_{n,\delta}(h)\leq\epsilon^{2}/3$ , and let $t=\lceil\frac{2}{\delta}\rceil$ . Then there is a set $\mathcal{F}$ of functions ${[n]}^{d}\to\mathbb{R}$ of size $|\mathcal{F}|\leq{(nd)}^{t}$ , such that that for any $h\in\mathcal{H}$ , there is a function $p\in\mathsf{span}(\mathcal{F})$ where $\mathbb{E}\left[{(h(x)-p(x))}^{2}\right]\leq\epsilon^{2}$ .

Proof.

Let $p=\sum_{|\alpha|<t}\hat{f}(\alpha)\phi_{\alpha}$ . Then by Proposition 5.5,

	$\displaystyle\mathbb{E}\left[{(p(x)-f(x))}^{2}\right]$	$\displaystyle=\mathbb{E}\left[{\left(\sum_{\|\alpha\|\geq t}\hat{f}(\alpha)\phi_{\alpha}(x)\right)}^{2}\right]=\mathbb{E}\left[\sum_{\|\alpha\|\geq t}\sum_{\|\beta\|\geq t}\hat{f}(\alpha)\hat{f}(\beta)\phi_{\alpha}(x)\phi_{\alpha}(x)\right]$
		$\displaystyle=\sum_{\|\alpha\|\geq t}\sum_{\|\beta\|\geq t}\hat{f}(\alpha)\hat{f}(\beta)\langle\phi_{\alpha},\phi_{\beta}\rangle=\sum_{\|\alpha\|\geq t}\hat{f}{(\alpha)}^{2}\leq\epsilon^{2}\,.$

Therefore $p$ is a linear combination of functions $\phi_{\alpha}=\prod_{i=1}^{d}\phi_{\alpha_{i}}$ where at most $t$ values $\alpha_{i}\in\{0,\dotsc,n-1\}$ are not 0. There are at most ${((n-1)d)}^{t}$ such products since for each non-constant $\phi_{\alpha_{i}}$ we choose $i\in[d]$ and $\alpha_{i}\in[n-1]$ . We may take $\mathcal{F}$ to be the set of these products. ∎

5.3 Application

To apply Lemma 5.2, we must give bounds on $\mathsf{bbs}(\mathcal{B}_{k}\circ\mathcal{H},r)$ and the noise sensitivity:

Lemma 5.7.

Fix any $r$ . Then $\mathsf{bbs}(\mathcal{B}_{k}\circ\mathcal{H},r)\leq dkr^{d-1}$ .

Proof.

Any halfspace $h(x)=\operatorname{sign}(\langle w,x\rangle-t)$ is unate, meaning there is a vector $\sigma\in\{\pm 1\}^{d}$ such that the function $h^{\sigma}:=\operatorname{sign}(\langle w,x^{\sigma}\rangle-t)$ , where $x^{\sigma}=(\sigma_{1}x_{1},\dotsc,\sigma_{d}x_{d})$ , is monotone. For any $r$ -block partition $\mathsf{block}:\mathbb{R}^{d}\to[r]^{d}$ defined by values $\{a_{i,j}\}$ for $i\in[d],j\in[r-1]$ , we can define $\mathsf{block}^{\sigma}:\mathbb{R}^{d}\to[r]^{d}$ as the block partition obtained by taking $\{\sigma_{i}\cdot a_{i,j}\}$ . The number of non-constant blocks of $h$ in $\mathsf{block}$ is the same as that of $h^{\sigma}$ in $\mathsf{block}^{\sigma}$ , but $h^{\sigma}$ is monotone. Thus the bound on $\mathsf{bbs}$ for monotone functions holds, so $\mathsf{bbs}(\mathcal{H},r)\leq dr^{d-1}$ by Lemma 7.1, and Lemma 4.2 gives $\mathsf{bbs}(\mathcal{B}_{k}\circ\mathcal{H},r)\leq dkr^{d-1}$ . ∎

The bounds on noise sensitivity follow from known results for the hypercube.

Proposition 5.8.

Let $h_{1},\dotsc,h_{k}:{[n]}^{d}\to\{\pm 1\}$ and let $g:\{\pm 1\}^{k}\to\{\pm 1\}$ . Let $f:=g\circ(h_{1},\dotsc,h_{k})$ . Then $\mathsf{ns}_{\delta}(f)\leq\sum_{i=1}^{k}\mathsf{ns}_{\delta}(h_{i})$ .

Proof.

For $u,v$ drawn from ${[n]}^{d}$ as in the definition of noise sensitivity, the union bound gives $\mathsf{ns}_{\delta}(f)=\underset{u,v}{\mathbb{P}}\left[f(u)\neq f(v)\right]\leq\underset{u,v}{\mathbb{P}}\left[\exists i:h_{i}(u)\neq h_{i}(v)\right]\leq\sum_{i=1}^{k}\mathsf{ns}_{\delta}(h_{i})$ . ∎

Lemma 5.9.

Let $f=g\circ(h_{1},\dotsc,h_{k})\in\mathcal{B}_{k}\circ\mathcal{H}$ . For any $r$ -block partition $\mathsf{block}:\mathbb{R}^{d}\to[r]^{d}$ , and any $\delta\in[0,1],\mathsf{ns}_{r,\delta}(f^{\mathsf{block}})=O(k\sqrt{\delta})$ .

Proof.

It is known that $\mathsf{ns}_{2,\delta}(\mathcal{H})=O(\sqrt{\delta})$ (Peres’ theorem [O’D14]). Let $A$ be any full-rank linear transformation and let $h\in\mathcal{H}$ , $h\circ A\in\mathcal{H}$ . This holds since for some $w\in\mathbb{R}^{d},t\in\mathbb{R}$ , $h(Ax)=\operatorname{sign}(\langle w,Ax\rangle-t)=\operatorname{sign}(\langle Aw,x\rangle-t)$ , which is a halfspace. Then Lemma 5.3 implies $\mathsf{ns}_{r,\delta}(h^{\mathsf{block}})\leq\mathsf{ns}_{2,\delta}(\mathcal{H})=O(\sqrt{\delta})$ and we conclude with Proposition 5.8. ∎

See 1.1.2

Proof.

Here we prove only the continuous distribution case. The finite case is proved in LABEL:thm:finitealgorithms.

For $r=\lceil dk/\epsilon\rceil$ , we have $r^{-d}\cdot\mathsf{bbs}(\mathcal{B}_{k}\circ\mathcal{H},r)\leq\epsilon$ by Lemma 5.7, so condition 1 of Lemma 5.2 holds. Lemma 5.9 guarantees that for any $f\in\mathcal{B}_{k}\circ\mathcal{H},\mathsf{ns}_{r,\delta}(f^{\mathsf{block}})=O(k\sqrt{\delta})$ . Setting $\delta=\Theta(\epsilon^{4}/k^{2})$ so that $\mathsf{ns}_{r,\delta}(f^{\mathsf{block}})\leq\epsilon^{2}/3$ , we obtain via Lemma 5.6 a set $\mathcal{F}$ of size $|\mathcal{F}|\leq(rd)^{O(k^{2}/\epsilon^{4})}$ satisfying condition 2 of Lemma 5.2. Then for $n=\operatorname{poly}(|\mathcal{F}|,1/\epsilon)$ we apply Lemma 5.2 to get an algorithm with sample complexity

O\left(rd^{2}n^{2}\log(rd)\right)=O\left(\frac{d^{3}k}{\epsilon}\log(dk/\epsilon)\right)\cdot\left(\frac{dk}{\epsilon}\right)^{O\left(\frac{k^{2}}{\epsilon^{4}}\right)}\,.

The other time complexity follows from Lemma 4.5. ∎

6 Learning Polynomial Threshold Functions

A degree- $k$ polynomial threshold function (PTF) is a function $f:\mathbb{R}^{d}\to\{\pm 1\}$ such that there is a degree- $k$ polynomial $p:\mathbb{R}^{d}\to\mathbb{R}$ in $d$ variables where $f(x)=\operatorname{sign}(p(x))$ ; for example, a halfspaces is a degree-1 PTF. Let $\mathcal{P}_{k}$ be the set of degree- $k$ PTFs. As for halfspaces, we will give bounds on the noise sensitivity and block boundary size and apply the general learning algorithm. The bound on noise sensitivity will follow from known results on the hypercube [DHK⁺10], but the bound on the block boundary size is much more difficult to obtain than for halfspaces.

6.1 Block-Boundary Size of PTFs

A theorem of Warren [War68] gives a bound on the number of connected components of $\mathbb{R}^{d}$ after removing the 0-set of a degree- $k$ polynomial. This bound (Theorem 6.7 below) will be our main tool.

A set $S\subseteq\mathbb{R}^{d}$ is connected³³3Here we are using the fact that connected and path-connected are equivalent in $\mathbb{R}^{d}$ . if for every $s,t\in S$ there is a continuous function $p:[0,1]\to S$ such that $p(0)=s,p(1)=t$ . A subset $S\subseteq X$ where $X\subseteq\mathbb{R}^{d}$ is a connected component of $X$ if it is connected and there is no connected set $T\subseteq X$ such that $S\subseteq T$ . Write $\mathsf{comp}(X)$ for the number of connected components of $X$ .

A function $\rho:\mathbb{R}^{d}\to(\mathbb{R}\cup\{*\})^{d}$ is called a restriction and we will denote $|\rho|=|\{i\in[d]:\rho(i)=*\}|$ . The affine subspace $A_{\rho}$ induced by $\rho$ is $A_{\rho}:=\{x\in\mathbb{R}^{d}\;|\;x_{i}=\rho(i)\text{ if }\rho(i)\neq*\}$ and has affine dimension $|\rho|$ .

For $n\leq d$ , let $\mathcal{A}_{n}$ be the set of affine subspaces $A_{\rho}$ obtained by choosing a restriction $\rho$ with $\rho(i)=*$ when $i\leq n$ and $\rho(i)\neq*$ when $i>n$ , so in particular $\mathcal{A}_{d}=\{\mathbb{R}^{d}\}$ .

Let $f:\mathbb{R}^{d}\to\{\pm 1\}$ be a measurable function and define the boundary of $f$ as:

\partial f:=\{x\in\mathbb{R}^{d}\;|\;\forall\epsilon>0,\exists y:\|x-y\|_{2}<\epsilon,f(y)\neq f(x)\}\,.

This is equivalent to the boundary of the set of $+1$ -valued points, and the boundary of any set is closed. Each measurable $f:\mathbb{R}^{d}\to\{\pm 1\}$ induces a partition of $\mathbb{R}^{d}\setminus\partial f$ into some number of connected parts. For a set $\mathcal{H}$ of functions $f:\mathbb{R}^{d}\to\{\pm 1\}$ and $n\leq d$ , write

M(n):=\max_{f\in\mathcal{H}}\max_{A\in\mathcal{A}_{n}}\mathsf{comp}(A\setminus\partial f)\,.

For each $i\in[d]$ let $\mathcal{P}_{i}$ be the set of hyperplanes of the form $\{x\in\mathbb{R}^{d}\;|\;x_{i}=a\}$ for some $a\in\mathbb{R}$ . An $(r,n,m)$ -arrangement for $f$ is any set $A\setminus\left(\partial f\cup\bigcup_{i=1}^{m}\bigcup_{j=1}^{r-1}H_{i,j}\right)$ where $A\in\mathcal{A}_{n}$ and $H_{i,j}\in\mathcal{P}_{i}$ such that all $H_{i,j}$ are distinct. Write $R_{f}(r,n,m)$ for the set of $(r,n,m)$ -arrangements for $f$ . Define

P_{r}(n,m):=\max_{f\in\mathcal{H}}\max\{\mathsf{comp}(R)\;|\;R\in R_{f}(r,n,m)\}

and observe that $P_{r}(n,0)=M(n)$ .

Proposition 6.1.

For any set $\mathcal{H}$ of functions $f:\mathbb{R}^{d}\to\{\pm 1\}$ and any $r>0$ , $\mathsf{bbs}(\mathcal{H},r)\leq P_{r}(d,d)-r^{d}$ .

Proof.

Consider any $r$ -block partition, which is obtained by choosing values $a_{i,j}\in\mathbb{R}$ for each $i\in[d],j\in[r-1]$ and defining $\mathsf{block}:\mathbb{R}^{d}\to[r]^{d}$ by assigning each $x\in\mathbb{R}^{d}$ the block $v\in[r]^{d}$ where $v_{i}$ is the unique value such that $a_{i,v_{i}-1}<x_{i}\leq a_{i,v_{i}}$ , where we define $a_{i,0}=-\infty,a_{i,r}=\infty$ . Suppose $v$ is a non-constant block, so there are $x,y\in\mathsf{block}^{-1}(v)\setminus\partial f$ such that $f(x)\neq f(y)$ . Let $H_{i,j}=\{x\in\mathbb{R}^{d}\;|\;x_{i}=a_{i,j}\}$ and let $B=\partial f\cup\bigcup_{i,j}H_{i,j}$ . Consider the set $\mathbb{R}^{d}\setminus B$ . Since $x\notin\partial f$ there exists some small open ball $R_{x}$ around $x$ such that $\forall x^{\prime}\in R_{x},f(x^{\prime})=f(x)$ ; and since $x\in\mathsf{block}^{-1}(v)$ , $R_{x}\cap\mathsf{block}^{-1}(v)$ is a set of positive Lebesgue measure. Since $B$ has Lebesgue measure 0, we conclude that $(R_{x}\cap\mathsf{block}^{-1}(v))\setminus B$ has positive measure, so there is $x^{\prime}\in\mathsf{block}^{-1}(v)\setminus B$ with $f(x^{\prime})=f(x)$ . Likewise, there is $y^{\prime}\in\mathsf{block}^{-1}(v)\setminus B$ with $f(y^{\prime})=f(y)\neq f(x^{\prime})$ . Therefore $x^{\prime},y^{\prime}$ must belong to separate components, so $\mathsf{block}^{-1}(v)\setminus B$ is partitioned into at least 2 components. Meanwhile, each constant block is partitioned into at least 1 component. So

P_{r}(d,d)\geq 2\cdot(\#\text{ non-constant blocks})+(\#\text{ constant blocks})=\mathsf{bbs}(\mathcal{H},r)+r^{d}\,.\qed

The following fact must be well-known, but not to us:

Proposition 6.2.

Let $A$ be an affine subspace of $\mathbb{R}^{d}$ , let $B\subset A$ , and for $a\in\mathbb{R}$ let $H=\{x\in\mathbb{R}^{d}\;|\;x_{1}=a\}$ . Then

\mathsf{comp}(A\setminus(H\cup B))-\mathsf{comp}(A\setminus B)\leq\mathsf{comp}(H\setminus B)\,.

Proof.

Let $G$ be the graph with its vertices $V$ being the components of $A\setminus(H\cup B)$ and the edges $E$ being the pairs $(S,T)$ where $S,T$ are components of $A\setminus(H\cup B)$ such that $\forall s\in S,s_{1}<a$ , $\forall t\in T,t_{1}>a$ , and there exists a component $U$ of $A\setminus B$ such that $S,T\subset U$ . Clearly $\mathsf{comp}(A\setminus(H\cup B))=|V|$ ; we will show that $\mathsf{comp}(A\setminus B)$ is the number of connected components of $G$ and that $|E|\leq\mathsf{comp}(H\setminus B)$ . This suffices to prove the statement. We will use the following claim:

Claim 6.3.

Let $U$ be a connected component of $A\setminus B$ . If $S,T\in V$ and there is a path $p:[0,1]\to U$ such that $p(0)\in S,p(1)\in T$ and either $\forall\lambda,p(\lambda)_{1}\leq a$ or $\forall\lambda,p(\lambda)_{1}\geq a$ , then $S=T$ .

Proof of claim.

Assume without loss of generality that $p(\lambda)_{1}\leq a$ for all $\lambda$ . Let $P=\{p(\lambda)\;|\;\lambda\in[0,1]\}$ . Since $U$ is open we can define for each $\lambda$ a ball $B(\lambda)\ni p(\lambda)$ such that $B(\lambda)\subset U$ . Consider the sets $B_{a}(\lambda):=\{x\in B(\lambda)\;|\;x_{1}<a\}$ , which are open, and note that for all $\alpha,\beta\in[0,1]$ , if $p(\alpha)\in B(\beta)$ then $B_{a}(\alpha)\cap B_{a}(\beta)\neq\emptyset$ since $p(\alpha)_{1},p(\beta)_{1}\leq a$ .

Assume for contradiction that there is $\lambda$ such that $B_{a}(\lambda)$ is not connected to $S$ or $T$ ; then let $\lambda^{\prime}$ be the infimum of all such $\lambda$ , which must satisfy $\lambda^{\prime}>0$ since $p(0)\in S$ . For any $\alpha$ , if $p(\alpha)\in B(\lambda^{\prime})$ and $B_{a}(\alpha)$ is connected to $S$ or $T$ then since $B_{a}(\lambda^{\prime})\cap B_{a}(\alpha)$ it must be that $B_{a}(\lambda)$ is connected as well; therefore $B_{a}(\alpha)$ is not connected to either $S$ or $T$ . But since $p$ is continuous, there is $\alpha<\lambda^{\prime}$ such that $p(\alpha)\in B(\lambda^{\prime})$ , so $\lambda^{\prime}$ cannot be the infimum, which is a contradiction. Therefore every $\lambda$ has $B_{a}(\lambda)$ connected to either $S$ or $T$ . If $S\neq T$ , this is a contradiction since there must then be $\alpha,\beta$ such that $p(\alpha)\in B(\beta)$ but $B_{a}(\alpha),B_{a}(\beta)$ are connected to $S,T$ respectively. Therefore $S=T$ . ∎

We first show that $\mathsf{comp}(A\setminus B)$ is the number of graph-connected components of $G$ . Suppose that vertices $(S,T)$ are connected, so there is a path $S=S_{0},\dotsc,S_{n}=T$ in $G$ . Then there are connected components $U_{i}$ of $A\setminus B$ such that $S_{i-1},S_{i}\subset U_{i}$ ; so $S_{i}\subset U_{i}\cap U_{i+1}$ , which implies that $\bigcup_{i}U_{i}\subset A\setminus B$ is connected. Therefore we may define $\Phi$ as mapping each connected component $\{S_{i}\}$ of $G$ to the unique component $U$ of $A\setminus B$ with $\bigcup_{i}S_{i}\subset U$ . $\Phi$ is surjective since for each component $U$ of $A\setminus B$ there is some vertex $S$ (a component of $A\setminus(H\cup B)$ ) such that $S\subseteq U$ : this is $U$ itself if $U\cap H=\emptyset$ . For some connected component $U$ of $A\setminus B$ , let $S,T\subseteq U$ be vertices of $G$ , and let $s\in S,t\in T$ ; since $U$ is connected, there is a path $p:[0,1]\to U$ such that $s=p(0),t=p(1)$ . Let $S=S_{0},\dotsc,S_{n}=T$ be the multiset of vertices such that $\forall\lambda\in[0,1],\exists i:p(\lambda)\in S_{i}$ ; let $\psi(\lambda)\in\{0,\dotsc,n\}$ be the index such that $p(\lambda)\in S_{\psi(\lambda)}$ , and order the sequence such that if $\alpha<\beta$ then $\psi(\alpha)\leq\psi(\beta)$ (note that we may have $S_{i}=S_{j}$ for some $i<j$ if $p(\lambda)$ visits the same set more than once). Then for any $i$ , $S_{i},S_{i+1}\subseteq U$ since the path visits both and is contained in $U$ . If $S_{i},S_{i+1}$ are on opposite sides of $H$ , then there is an edge $(S_{i},S_{i+1})$ in $G$ ; otherwise, the above claim implies $S_{i}=S_{i+1}$ . Thus there is a path $S$ to $T$ in $G$ ; this proves that $\Phi$ is injective, so $\mathsf{comp}(A\setminus B)$ is indeed the number of graph-connected components of $G$ .

Now let $(S,T)\in E$ , so there is a component $U$ of $A\setminus B$ such that $S,T\subset U$ . For any $s\in S,t\in T$ there is a continuous path $p_{s,t}:[0,1]\to U$ where $p_{s,t}(0)=s,p_{s,t}(1)=t$ . There must be some $z\in[0,1]$ such that $p_{s,t}(z)\in H$ , otherwise the path is a path in $\mathbb{R}^{d}\setminus B$ and $S=T$ . Since $p_{s,t}(z)\in H\cap U$ , so $p_{s,t}(z)\notin B$ , there is some component $Z\in\mathcal{C}_{H}$ containing $p_{s,t}(z)$ . We will map the edge $(S,T)$ to an arbitrary such $Z$ , for any choice $s,t,z$ , and show that it is injective. Suppose that $(S,T),(S^{\prime},T^{\prime})$ map to the same $Z\in\mathcal{C}_{H}$ . Without loss of generality we may assume that $S,S^{\prime}$ lie on the same side of $H$ and that $\forall x\in S\cup S^{\prime},x_{1}<a$ . Then there are $s\in S,s^{\prime}\in S^{\prime},t\in T,t^{\prime}\in T^{\prime}$ , and $z,z^{\prime}\in[0,1]$ such that $p_{s,t}(z),p_{s^{\prime},t^{\prime}}(z^{\prime})\in Z$ . Then since $Z$ is a connected component, we may take $z,z^{\prime}$ to be the least such values that $p_{s,t}(z),p_{s^{\prime},t^{\prime}}(z)\in Z$ , and connected $p_{s,t}(z),p_{s^{\prime},t^{\prime}}(z)$ by a path in $Z$ to obtain a path $q:[0,1]\to U$ such that $q(0)=s,q(1)=s^{\prime}$ , and $\forall\lambda,q(\lambda)_{1}\leq a$ . Then by the above claim, $S=S^{\prime}$ ; the same holds for $T,T^{\prime}$ , so the mapping is injective. This completes the proof of the proposition. ∎

Proposition 6.4.

For any set $\mathcal{H}$ of measurable functions $f:\mathbb{R}^{d}\to\{\pm 1\}$ and any $r>1$ ,

P_{r}(n,m)\leq P_{r}(n,m-1)+(r-1)\cdot P_{r}(n-1,m-1)\,.

Proof.

Let $A\in\mathcal{A}_{n}$ and $a_{i,j}\in\mathbb{R}$ , $i\in[m],j\in[r-1]$ such that the number of connected components in $A\setminus B$ , where $B=\partial f\cup\bigcup_{i,j}H_{i,j}$ and $H_{i,j}=\{x\in A\;|\;x_{i}=a_{i,j}\}$ , is $P_{r}(n,m)$ . For $0\leq k\leq r-1$ let

B_{k}:=\partial f\cup\left(\bigcup_{i=1}^{m-1}\bigcup_{j=1}^{r-1}H_{i,j}\right)\cup\left(\bigcup_{j=1}^{k}H_{m,j}\right)\,,

so that $B=B_{r-1}$ and $B_{k}=B_{k-1}\cup H_{m,k}$ . Since $B_{0}$ is an $(r,n,m-1)$ -arrangement, $\mathsf{comp}(A\setminus B_{0})\leq P_{r}(n,m-1)$ . For $k>0$ , since $B_{k}$ is obtained from $B_{k-1}$ by adding a hyperplane $H_{m,k}$ , Proposition 6.2 implies

\mathsf{comp}(A\setminus B_{k})\leq\mathsf{comp}(A\setminus B_{k-1})+\mathsf{comp}(H\setminus B_{k-1})\leq\mathsf{comp}(A\setminus B_{k-1})+P_{r}(n-1,m-1)\,,

because $H\setminus B_{k-1}$ is an $(r,n-1,m-1)$ -arrangement. Iterating $r-1$ times, once for each added hyperplane, we arrive at

	$\displaystyle P_{r}(n,m)$	$\displaystyle=\mathsf{comp}(A\setminus B)$
		$\displaystyle=\mathsf{comp}(A\setminus B_{0})+\sum_{k=1}^{r-1}\left(\mathsf{comp}(A\setminus B_{k})-\mathsf{comp}(A\setminus B_{k-1})\right)$
		$\displaystyle\leq P_{r}(n,m-1)+(r-1)P_{r}(n-1,m-1)\,.\qed$

Lemma 6.5.

For any set $\mathcal{H}$ of measurable functions $\mathbb{R}^{d}\to\{\pm 1\}$ and any $r$ ,

P_{r}(d,d)\leq(r-1)^{d}+\sum_{i=0}^{d-1}\binom{d}{i}\cdot M(d-i)\cdot(r-1)^{i}\,.

Proof.

Write $s=r-1$ for convenience. We will show by induction the more general statement that for any $m\leq n\leq d$ ,

P_{r}(n,m)\leq\sum_{i=0}^{m}\binom{m}{i}\cdot M(n-i)\cdot s^{i}

where we define $M(0):=1$ . In the base case, note that $P_{r}(n,0)=M(n)$ . Assume the statement holds for $P_{r}(n^{\prime},m^{\prime})$ when $n^{\prime}\leq n,m^{\prime}<m$ . Then by Proposition 6.4,

	$\displaystyle P_{r}(n,m)$	$\displaystyle\leq P_{r}(n,m-1)+s\cdot P_{r}(n-1,m-1)$
		$\displaystyle\leq\sum_{i=0}^{m-1}{m-1\choose i}\cdot M(n-i)\cdot s^{i}+\sum_{i=0}^{m-1}{m-1\choose i}\cdot M(n-1-i)\cdot s^{i+1}$
		$\displaystyle\leq\sum_{i=0}^{m-1}{m-1\choose i}\cdot M(n-i)\cdot s^{i}+\sum_{i=1}^{m}{m-1\choose i-1}\cdot M(n-i)\cdot s^{i}$
		$\displaystyle=M(n)+M(n-m)\cdot s^{m}+\sum_{i=1}^{m-1}\left({m-1\choose i}+{m-1\choose i-1}\right)\cdot M(n-i)\cdot s^{i}$
		$\displaystyle=\sum_{i=0}^{m}\binom{m}{i}\cdot M(n-i)\cdot s^{i}\,.\qed$

Lemma 6.6.

Let $\mathcal{H}$ be a set of functions $f:\mathbb{R}^{d}\to\{\pm 1\}$ such that for some $k\geq 1$ , $M(n)\leq k^{n}$ . Then for any $\epsilon>0$ and $r\geq 3dk/\epsilon$ , $\mathsf{bbs}(\mathcal{H},r)<\epsilon\cdot r^{d}$ .

Proof.

Write $s=r-1$ . By Proposition 6.1 and Lemma 6.5, the probability that $v$ is a non-constant block is

	$\displaystyle\frac{\mathsf{bbs}(r)}{r^{d}}$	$\displaystyle\leq r^{-d}\left(P_{r}(d,d)-r^{d}\right)\leq r^{-d}\left(\sum_{i=0}^{d-1}\left[{d\choose i}\cdot M(d-i)\cdot s^{i}\right]+s^{d}-r^{d}\right)$
		$\displaystyle\leq r^{-d}\sum_{i=0}^{d-1}{d\choose i}\cdot M(d-i)\cdot s^{i}\,.$

Split the sum into two parts:

	$\displaystyle\sum_{i=0}^{\lfloor d/2\rfloor}{d\choose i}\cdot\frac{M(d-i)\cdot s^{i}}{r^{d}}+\sum_{i=1}^{\lceil d/2\rceil-1}{d\choose i}\cdot\frac{M(i)\cdot s^{d-i}}{r^{d}}$
	$\displaystyle\qquad\leq\sum_{i=0}^{\lfloor d/2\rfloor}{d\choose i}\cdot\frac{k^{d-i}\cdot r^{i}}{r^{d}}+\sum_{i=1}^{\lceil d/2\rceil-1}{d\choose i}\cdot\frac{k^{i}\cdot r^{d-i}}{r^{d}}$
	$\displaystyle\qquad\leq\sum_{i=0}^{\lfloor d/2\rfloor}\frac{d^{i}k^{d-i}\cdot r^{i}}{r^{d}}+\sum_{i=1}^{\lceil d/2\rceil-1}\frac{d^{i}k^{i}\cdot r^{d-i}}{r^{d}}\qquad\leq\sum_{i=0}^{\lfloor d/2\rfloor}\frac{\epsilon^{d-i}}{3^{d-i}d^{d-i}}+\sum_{i=1}^{\lceil d/2\rceil-1}\frac{\epsilon^{i}}{3^{i}}$
	$\displaystyle\qquad\leq\frac{\epsilon}{3}+\sum_{i=2}^{\lceil d/2\rceil-1}\frac{\epsilon^{i}}{3^{i}}+\lfloor d/2\rfloor\cdot\frac{\epsilon^{\lceil d/2\rceil}}{3^{\lceil d/2\rceil}d^{\lceil d/2\rceil}}\qquad\leq\frac{\epsilon}{3}+\frac{\epsilon}{3}\sum_{i=1}^{\infty}\frac{\epsilon^{i}}{3^{i}}+\frac{\epsilon^{\lceil d/2\rceil}}{3^{\lceil d/2\rceil}}\leq\epsilon\,.\qed$

It is a standard fact that for degree- $k$ polynomials, $M(1)\leq k$ , and a special case of a theorem of Warren bounds gives a bound for larger dimensions:

Theorem 6.7 ([War68]).

Polynomial threshold functions $p:\mathbb{R}^{d}\to\{\pm 1\}$ of degree $k$ have $M(n)\leq 6(2k)^{n}$ .

Since $M(1)\leq\sqrt{24}k$ and $6(2k)^{n}\leq(\sqrt{24}k)^{n}$ , for $n>1$ , Lemma 6.6 gives us:

Corollary 6.8.

For $r\geq 3\sqrt{24}dk/\epsilon$ , $r^{-d}\cdot\mathsf{bbs}(\mathcal{P}_{k},r)<\epsilon$ .

6.2 Application

As was the case for halfspaces, our reduction of noise sensitivity on $[r]^{d}$ to $\{\pm 1\}^{d}$ requires that the class $\mathcal{P}_{k}$ is invariant under linear transformations:

Proposition 6.9.

For any $f\in\mathcal{P}_{k}$ and full-rank linear transformation $A\in\mathbb{R}^{d\times d}$ , $f\circ A\in\mathcal{P}_{k}$ .

Proof.

Let $f(x)=\operatorname{sign}(p(x))$ where $p$ is a degree- $k$ polynomial and let $c_{q}\prod_{i=1}^{d}x_{i}^{q_{i}}$ be a term of $p$ , where $c\in\mathbb{R}$ and $q\in\mathbb{Z}_{\geq 0}^{d}$ such that $\sum_{i}q_{i}\leq k$ . Let $A_{i}\in\mathbb{R}^{d}$ be the $i^{\text{th}}$ row of $A$ . Then

\prod_{i=1}^{d}(Ax)_{i}^{q_{i}}=\prod_{i=1}^{d}\left(\sum_{j=1}^{d}A_{i,j}x_{j}\right)^{q_{i}}=p_{q}(x)

where $p_{q}(x)$ is some polynomial of degree at most $\sum_{i=1}^{d}q_{i}\leq k$ . Then $p\circ A=\sum_{q}c_{q}p_{q}$ where $q$ ranges over $\mathbb{Z}_{\geq 0}$ with $\sum_{i}q_{i}\leq k$ , and each $p_{q}$ has degree at most $k$ , so $p\circ A$ is a degree- $k$ polynomial. ∎

The last ingredient we need is the following bound of Diakonikolas et al. on the noise sensitivity:

Theorem 6.10 ([DHK⁺10]).

Let $f:\{\pm 1\}^{d}\to\{\pm 1\}$ be a degree- $k$ PTF. Then for any $\delta\in[0,1]$ ,

	$\displaystyle\mathsf{ns}_{2,\delta}(f)$	$\displaystyle\leq O(\delta^{1/2^{k}})$
	$\displaystyle\mathsf{ns}_{2,\delta}(f)$	$\displaystyle\leq 2^{O(k)}\cdot\delta^{1/(4k+2)}\log(1/\delta)\,.$

Putting everything together, we obtain a bound that is polynomial in $d$ for any fixed $k,\epsilon$ , and which matches the result of Diakonikolas et al. [DHK⁺10] for the uniform distribution over $\{\pm 1\}^{d}$ .

See 1.1.3

Proof.

We prove the continuous case here; the finite case is proved in Theorem A.18.

Let $r=\lceil 9dk/\epsilon\rceil$ , so that by Corollary 6.8, condition 1 of Lemma 5.2 is satisfied. Due to Proposition 6.9, we may apply Theorem 6.10 and Lemma 5.3 to conclude that for all $f\in\mathcal{P}_{k}$

	$\displaystyle\mathsf{ns}_{r,\delta}(f^{\mathsf{block}})$	$\displaystyle\leq O(\delta^{1/2^{k}})$
	$\displaystyle\mathsf{ns}_{r,\delta}(f^{\mathsf{block}})$	$\displaystyle\leq 2^{O(k)}\cdot\delta^{1/(4k+2)}\log(1/\delta)\,.$

In the first case, setting $\delta=O(\epsilon^{2^{k+1}})$ we get $\mathsf{ns}_{r,\delta}(f^{\mathsf{block}})<\epsilon^{2}/3$ , so by Lemma 5.6 we get a set $\mathcal{F}$ of functions $[r]^{d}\to\mathbb{R}$ of size $|\mathcal{F}|\leq(rd)^{O\left(\frac{1}{\epsilon^{2^{k+1}}}\right)}$ satisfying condition 2 of Lemma 5.2. For $n=\operatorname{poly}(|\mathcal{F}|,1/\epsilon)$ , Lemma 5.2 implies an algorithm with sample size

O(rd^{2}n^{2}\log(rd))=O\left(\frac{d^{3}k}{\epsilon}\log(dk/\epsilon)\right)\cdot\left(\frac{kd}{\epsilon}\right)^{O\left(\frac{1}{\epsilon^{2^{k+1}}}\right)}\,.

In the second case, setting $\delta=O\left(\left(\frac{2^{O(k)}\log(2^{k}/\epsilon)}{\epsilon^{2}}\right)^{4k+2}\right)$ , we again obtain $\mathsf{ns}(f^{\mathsf{block}})_{r,\delta}\leq\epsilon^{2}/3$ and get an algorithm with sample size

\left(\frac{kd}{\epsilon}\right)^{2^{O(k^{2})}\left(\frac{\log(1/\epsilon)}{\epsilon^{2}}\right)^{4k+2}}\,.

The final result is obtained by applying Lemma 4.5. ∎

7 Learning & Testing $k$ -Alternating Functions

A function $f:\mathcal{X}\to\{\pm 1\}$ on a partial order $\mathcal{X}$ is $k$ -alternating if for every chain $x_{1}<\dotsc<x_{k+2}$ there is $i\in[k+1]$ such that $f(x_{i})=f(x_{i+1})$ . Monotone functions are examples of 1-alternating functions. We consider $k$ -alternating functions on $\mathbb{R}^{d}$ with the usual partial order: for $x,y\in\mathbb{R}^{d}$ we say $x<y$ when $x_{i}\leq y_{i}$ for each $i\in[d]$ and $x\neq y$ . Once again, we must bound the block boundary size, which has been done already by Canonne et al.. We include the proof because their work does not share our definition of block boundary size, and because we use it in our short proof of the monotonicity tester in Section 3.1.

Lemma 7.1 ([CGG⁺19]).

The $r$ -block boundary size of $k$ -alternating functions is at most $kdr^{d-1}$ .

Proof.

Let $f$ be $k$ -alternating, let $\mathsf{block}:\mathbb{R}^{d}\to{[r]}^{d}$ be any block partition and let $v_{1},\dotsc,v_{m}$ be any chain in ${[r]}^{d}$ . Suppose that there are $k+1$ indices $i_{1},\dotsc,i_{k+1}$ such that $f$ is not constant on $\mathsf{block}^{-1}(v_{i_{j}})$ . Then there is a set of points $x_{1},\dotsc,x_{k+1}$ such that $x_{j}\in\mathsf{block}^{-1}(v_{i_{j}})$ and $x_{j}\neq x_{j+1}$ for each $j\in[k]$ . But since $v_{i_{1}}<\dotsm<v_{i_{k+1}}$ , $x_{1}<\dotsm<x_{k+1}$ also, which contradicts the fact that $f$ is $k$ -alternating. Then every chain in ${[r]}^{d}$ has at most $k$ non-constant blocks, and we may partition ${[r]}^{d}$ into at most $dr^{d-1}$ chains by taking the diagonals $v+\lambda\vec{1}$ where $v$ is any vector satisfying $\exists i:v_{i}=1$ and $\lambda$ ranges over all integers. ∎

Canonne et al. also use noise sensitivity bound to obtain a spanning set $\mathcal{F}$ ; we quote their result.

Lemma 7.2 ([CGG⁺19]).

There is a set $\mathcal{F}$ of functions ${[r]}^{d}\to\mathbb{R}$ , with size

|\mathcal{F}|\leq\mathrm{exp}\left(O\left(\frac{k\sqrt{d}}{\epsilon^{2}}\log(rd/\epsilon)\right)\right)\,,

such that for any $k$ -alternating function $h:{[r]}^{d}\to\{\pm 1\}$ , there is $g:{[r]}^{d}\to\mathbb{R}$ that is a linear combination of functions in $\mathcal{F}$ and $\underset{x\sim{[r]}^{d}}{\mathbb{E}}\left[{(h(x)-g(x))}^{2}\right]\leq\epsilon^{2}$ .

See 1.1.5

Proof.

We prove the continuous case; for the finite case see Theorem A.18.

Let $r=\lceil dk/\epsilon\rceil$ and let $\mathsf{block}:\mathbb{R}^{d}\to[r]^{d}$ be any $r$ -block partition. By Lemma 7.1, the first condition of Lemma 5.2 is satisfied. Now let $f\in\mathcal{H}$ and consider $f^{\mathsf{block}}$ . For any chain $v_{1}<v_{2}<\dotsm<v_{m}$ in $[r]^{d}$ , it must be $\mathsf{blockpoint}(v_{1})<\mathsf{blockpoint}(v_{2})<\dotsm<\mathsf{blockpoint}(v_{m})$ since every $x\in\mathsf{block}^{-1}(v_{i}),y\in\mathsf{block}^{-1}(v_{j})$ satisfy $x<y$ when $v_{i}<v_{j}$ ; then $f$ alternates at most $k$ times on the chain $\mathsf{blockpoint}(v_{1})<\dotsm<\mathsf{blockpoint}(v_{m})$ and, since $f^{\mathsf{block}}(v_{i})=f(\mathsf{blockpoint}(v_{i}))$ , $f^{\mathsf{block}}$ is also $k$ -alternating. Therefore the set $\mathcal{F}$ of functions given by Lemma 7.2 satisfies condition 2 of Lemma 7.1, and we have $n=\operatorname{poly}(|\mathcal{F}|,1/\epsilon)=\mathrm{exp}\left(O\left(\frac{k\sqrt{d}}{\epsilon^{2}}\log(rd/\epsilon)\right)\right)$ . Applying Lemma 5.2 gives an algorithm with sample complexity

O\left(rd^{2}n^{2}\log(rd)\right)=O\left(\frac{d^{3}k}{\epsilon}\log(dk/\epsilon)\cdot\left(\frac{dk}{\epsilon}\right)^{O\left(\frac{k\sqrt{d}}{\epsilon^{2}}\right)}\right)=\left(\frac{dk}{\epsilon}\right)^{O\left(\frac{k\sqrt{d}}{\epsilon^{2}}\right)}\,.

The other sample complexity follows from Lemma 4.5. ∎

See 1.1.5

Proof.

The following argument is for the continuous case, but generalizes to the finite case using the definitions in Appendix A.

Let $\mathcal{H}$ be the class of $k$ -alternating functions. Suppose there is a set $\mathcal{K}\subset\mathcal{H}$ , known to the algorithm, that is a $(\tau/2)$ -cover. Then, taking a set $Q$ of $q=O(\tfrac{1}{\epsilon^{2}}\log|\mathcal{K}|)$ independent random samples from $\mu$ and using Hoeffding’s inequality,

	$\displaystyle\underset{Q}{\mathbb{P}}\left[\exists h\in\mathcal{K}:\|\mathsf{dist}_{Q}(f,h)-\mathsf{dist}_{\mu}(f,h)\|>\frac{\tau}{2}\right]$	$\displaystyle\leq\|\mathcal{K}\|\cdot\max_{h\in\mathcal{K}}\mathbb{P}\left[\|\mathsf{dist}_{Q}(f,h)-\mathsf{dist}_{\mu}(f,h)\|>\frac{\tau}{2}\right]$
		$\displaystyle\leq\|\mathcal{K}\|\cdot 2\mathrm{exp}\left(-\frac{q\tau^{2}}{2}\right)<1/6\,.$

Then the tester accepts if $\mathsf{dist}_{Q}(f,\mathcal{K})<\epsilon_{1}+\tau$ and rejects otherwise; we now prove that this is correct with high probability. Assume that the above estimation is accurate, which occurs with probability at least $5/6$ . If $\mathsf{dist}_{\mu}(f,\mathcal{H})\leq\epsilon_{1}$ then $\mathsf{dist}_{\mu}(f,\mathcal{K})\leq\mathsf{dist}_{\mu}(f,h)+\mathsf{dist}_{\mu}(h,\mathcal{K})\leq\epsilon_{1}+\tau/2$ . Then for $g\in\mathcal{K}$ minimizing $\mathsf{dist}_{\mu}(f,g)$ ,

\mathsf{dist}_{Q}(f,\mathcal{K})\leq\mathsf{dist}_{Q}(f,g)<\mathsf{dist}_{\mu}(f,g)+\frac{\tau}{2}\leq\epsilon_{1}+\tau\,,

so $f$ is accepted. Now suppose that $f$ is accepted, so $\mathsf{dist}_{Q}(f,\mathcal{K})<\epsilon_{1}+\tau$ . Then

\mathsf{dist}_{\mu}(f,\mathcal{H})\leq\mathsf{dist}_{\mu}(f,g)\leq\mathsf{dist}_{Q}(f,g)+\frac{\tau}{2}<\epsilon_{1}+\frac{3}{2}\tau=\epsilon_{1}+\frac{3}{4}(\epsilon_{2}-\epsilon_{1})\leq\epsilon_{2}\,.

What remains is to show how the tester constructs such a cover $\mathcal{K}$ .

Consider the learning algorithm of Theorem 1.1.5 with error parameter $\tau/12$ , so $r=\lceil 12dk/\tau\rceil$ . Let $X$ be the grid constructed by that algorithm and let $\mathsf{block}:\mathbb{R}^{d}\to[r]^{d}$ be the induced $r$ -block partition. We may assume that with probability at least $5/6$ , $\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}<\tau/12$ ; suppose that this event occurs. The learner then takes $m=\left(\frac{dk}{\tau}\right)^{O\left(\frac{k\sqrt{d}}{\tau^{2}}\right)}$ additional samples to learn the class $\mathcal{H}^{\mathsf{block}}$ with domain $[r]^{d}$ . For every $f\in\mathcal{H}$ the learner has positive probability of outputting a function $h:[r]^{d}\to\{0,1\}$ with $\underset{v}{\mathbb{P}}\left[h(v)\neq f^{\mathsf{block}}(v)\right]<\tau/12$ (where $v$ is chosen from $\mathsf{block}(\mu)$ ). Let $\mathcal{K}^{\prime}$ be the set of possible outputs of the learner; then $\mathcal{K}^{\prime}$ is a $(\tau/12)$ -cover for $\mathcal{H}^{\mathsf{block}}$ . Construct a set $\mathcal{K}^{\mathsf{block}}$ by choosing, for each $h\in\mathcal{K}^{\prime}$ , the nearest function $g\in\mathcal{K}$ with respect to the distribution $\mathsf{block}(\mu)$ . Then $\mathcal{K}^{\mathsf{block}}$ is a $(\tau/6)$ -cover, since for any function $f^{\mathsf{block}}\in\mathcal{H}^{\mathsf{block}}$ , if $h\in\mathcal{K}^{\prime}$ is the nearest output of the learner and $g\in\mathcal{K}^{\mathsf{block}}$ is nearest $h$ , then by the triangle inequality $f^{\mathsf{block}}$ has distance at most $\tau/6$ to $g$ with respect to $\mathsf{block}(\mu)$ . Finally, construct a set $\mathcal{K}\subset\mathcal{H}$ by taking each function $h\in\mathcal{H}$ such that $h^{\mathsf{coarse}}=h$ and $h^{\mathsf{block}}\in\mathcal{K}^{\mathsf{block}}$ (note that there exists $h\in\mathcal{H}$ such that $h^{\mathsf{coarse}}=h$ since $h^{\mathsf{coarse}}$ is $k$ -alternating when $h^{\mathsf{block}}$ is $k$ -alternating). Then $\mathcal{K}$ is a $(\tau/2)$ -cover since for any $f\in\mathcal{H}$ , when $h\in\mathcal{K}$ minimizes $\underset{v\sim\mathsf{block}(\mu)}{\mathbb{P}}\left[f^{\mathsf{block}}(v)\neq h^{\mathsf{block}}\right]$ ,

	$\displaystyle\mathsf{dist}_{\mu}(f,\mathcal{K})$
	$\displaystyle\qquad\leq\mathsf{dist}_{\mu}(f,f^{\mathsf{coarse}})+\mathsf{dist}_{\mu}(f^{\mathsf{coarse}},\mathcal{K})$
	$\displaystyle\qquad\leq r^{-d}\cdot\mathsf{bbs}(\mathcal{H},r)+\underset{v\sim\mathsf{block}(\mu)}{\mathbb{P}}\left[f^{\mathsf{block}}(v)\neq h^{\mathsf{block}}(v)\right]+2\\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\\|_{\mathsf{TV}}$
	$\displaystyle\qquad<\tau/6+\tau/6+2\tau/12\leq\tau/2\,.$

Now we bound the size of $\mathcal{K}^{\mathsf{block}}$ . Since there are $m$ samples and each sample $v\sim\mathsf{block}(\mu)$ is in $[r]^{d}$ , labelled by $\{0,1\}$ , there are at most $(r^{d})^{m}2^{m}$ possible sample sequences, so at most $(2r^{d})^{m}$ outputs of the learner (after constructing $X$ ), so $|\mathcal{K}^{\mathsf{block}}|\leq(2r^{d})^{m}$ . Therefore, after constructing $X$ , the tester may construct $\mathcal{K}^{\mathsf{block}}$ and run the above estimation procedure, with $q=O\left(\frac{1}{\tau^{2}}dm\log r\right)=\left(\frac{dk}{\tau}\right)^{O\left(\frac{k\sqrt{d}}{\tau^{2}}\right)}$ . ∎

Appendix A Discrete Distributions

We will say that a distribution $\mu_{i}$ over $\mathbb{R}$ is finite if it is a distribution over a finite set $X\subset\mathbb{R}$ . In this section, we extend downsampling to work for finite product distributions: distributions $\mu=\mu_{1}\times\dotsm\times\mu_{d}$ such that all $\mu_{i}$ are finite. As mentioned in the introduction, our algorithms have the advantage that they do not need to know in advance whether the distribution is continuous or finite, and if they are finite they do not need to know the support. This is in contrast to the algorithms of Blais et al. [BOW10], which work for arbitrary finite product distributions but must know the support (since it learns a function under the “one-out-of- $k$ encoding”). Our algorithms have superior time complexity for large supports.

We begin with an example of a pathological set of functions that illustrates some of the difficulties in the generalization.

Example A.1.

The Dirichlet function $f:\mathbb{R}\to\{\pm 1\}$ is the function that takes value $1$ on all rational numbers and value $-1$ on all irrational numbers. We will define the Dirichlet class of functions as the set of all functions $f:\mathbb{R}^{d}\to\{\pm 1\}$ such that $f(x)=-1$ on all $x$ with at least 1 irrational coordinate $x_{i}$ , and $f(x)$ is arbitrary for any $x$ with all rational coordinates. Since the Lebesgue measure of the set of rational numbers is 0, in any continuous product distribution, any function $f$ in the Dirichlet class satisfies $\mathbb{P}\left[f(x)\neq-1\right]=0$ ; therefore learning this class is trivial in any continuous product distribution since we may output the constant $-1$ function. And $\mathsf{bbs}(f,r)=0$ for this class since no block contains a set $S$ of positive measure containing $1$ -valued points. On the other hand, if $\mu$ is a finitely supported product distribution, then it may be the case that it is supported only on points with all rational coordinates. In that case, the Dirichlet class of functions is the set of all functions on the support, which is impossible to learn when the size of the support is unknown (since the number of samples will depend on the support size). It is apparent that our former definition of $\mathsf{bbs}$ no longer suffices to bound the complexity of algorithms when we allow finitely supported distributions.

Another difficulty arises for finitely supported distributions with small support: for example, the hypercube $\{\pm 1\}^{d}$ . Consider what happens when we attempt to sample a uniform grid, as in the first step of the algorithms above. We will sample many points $x$ such that $x_{1}=1$ and many points such that $x_{1}=-1$ . Essentially, the algorithm takes a small domain $\{\pm 1\}^{d}$ and constructs the larger domain $[r]^{d}$ , which is antithetical to the downsampling method. A similar situation would occur in large domains $[n]^{d}$ where some coordinates have exceptionally large probability densities and are sampled many times. Our algorithm must be able to handle such cases, so we must redefine the grid sampling step and block partitions to handle this situation. To do so, we introduce augmented samples: for every sample point $x\sim\mu$ we will append a uniformly random value in $[0,1]^{d}$ .

A.1 Augmented Samples & Constructing Uniform Partitions

For augmented points $\overline{a},\overline{b},\in\mathbb{R}\times[0,1]$ , where $\overline{a}=(a,a^{\prime}),\overline{b}=(b,b^{\prime})$ , we will define a total order by saying $\overline{a}<\overline{b}$ if $a<b$ , or $a=b$ and $a^{\prime}<b^{\prime}$ . Define interval $(\overline{a},\overline{b}]:=\{\overline{c}\;|\;\overline{a}<\overline{c}\leq\overline{b}\}$ . For convenience, when $\overline{a}\in\mathbb{R}\times[0,1]$ and $\overline{a}=(a,a^{\prime})$ we will write $\xi(\overline{a})=a$ . If $\overline{x}\in\mathbb{R}^{d}\times[0,1]^{d}$ is an augmented vector (i.e. each coordinate $\overline{x}_{i}$ is an augmented point), we will write $\xi(\overline{x})=(\xi(x_{1}),\dotsc,\xi(x_{d}))$ ; and when $S\subseteq\mathbb{R}^{d}\times[0,1]^{d}$ is a set of augmented points, we will write $\xi(S)=\{\xi(\overline{x})\;|\;\overline{x}\in S\}$ .

Definition A.2 (Augmented Block Partition).

An augmented $r$ -block partition of $\mathbb{R}^{d}$ is a pair of functions $\overline{\mathsf{block}}:\mathbb{R}^{d}\times[0,1]^{d}\to[r]^{d}$ and $\mathsf{blockpoint}:[r]^{d}\to\mathbb{R}^{d}$ obtained as follows. For each $i\in[d],j\in[r-1]$ let $\overline{a}_{i,j}\in\mathbb{R}\times[0,1]$ such that $\overline{a}_{i,j}<\overline{a}_{i,j+1}$ and define $\overline{a}_{i,0}=(-\infty,0),\overline{a}_{i,r}=(\infty,1)$ . For each $i\in[d],j\in[r]$ define the interval $B_{i,j}=(\overline{a}_{i,j-1},\overline{a}_{i,j}]$ and a point $\overline{b}_{i,j}\in\mathbb{R}\times[0,1]$ such that $\overline{a}_{i,j}\leq\overline{b}_{i,j}\leq\overline{a}_{i,j+1}$ . The function $\overline{\mathsf{block}}:\mathbb{R}^{d}\times[0,1]^{d}\to[r]^{d}$ is defined by setting $\overline{\mathsf{block}}(\overline{x})$ to be the unique vector $v\in[r]^{d}$ such that $\overline{x}_{i}\in B_{i,v_{i}}$ for each $i\in[d]$ . Observe that $\overline{\mathsf{block}}^{-1}(v):=\{\overline{x}:\overline{\mathsf{block}}(x)=v\}$ is a set of augmented points in $\mathbb{R}^{d}\times[0,1]$ and that it is possible for two augmented points $\overline{x},\overline{y}$ to satisfy $\xi(\overline{x})=\xi(\overline{y})$ while $\overline{\mathsf{block}}(\overline{x})\neq\overline{\mathsf{block}}(\overline{y})$ . The function $\mathsf{blockpoint}:[r]^{d}\to\mathbb{R}^{d}$ is defined by setting $\mathsf{blockpoint}(v)=(\xi(\overline{b}_{1,v_{1}}),\dotsc,\xi(\overline{b}_{d,v_{d}}))$ ; note that this is a non-augmented point.

Definition A.3 (Block Functions and Coarse Functions).

For a function $f:\mathbb{R}^{d}\to\{\pm 1\}$ we will define the functions $f^{\mathsf{block}}:[r]^{d}\to\{\pm 1\}$ as $f^{\mathsf{block}}:=f\circ\mathsf{blockpoint}$ and for each $z\in[0,1]^{d}$ we will define $f^{\mathsf{coarse}}_{z}:\mathbb{R}^{d}\to\{\pm 1\}$ as $f^{\mathsf{coarse}}_{z}(x):=f^{\mathsf{block}}(\overline{\mathsf{block}}(x,z))$ . Unlike in the continuous setting, $f^{\mathsf{coarse}}_{z}$ depends on an additional variable $z\in[0,1]^{d}$ , which is necessary because a single point $x\in\mathbb{R}^{d}$ may be augmented differently to get different $\overline{\mathsf{block}}$ values. For a distribution $\mu$ over $\mathbb{R}^{d}$ define the augmented distribution $\overline{\mu}$ over $\mathbb{R}^{d}\times[0,1]^{d}$ as the distribution of $(x,z)$ when $x\sim\mu$ and $z$ is uniform in $[0,1]^{d}$ . For an augmented $r$ -block partition $\overline{\mathsf{block}}:\mathbb{R}^{d}\times[0,1]^{d}\to[r]^{d}$ we define the distribution $\overline{\mathsf{block}}(\mu)$ over $[r]^{d}$ as the distribution of $\overline{\mathsf{block}}(\overline{x})$ for $\overline{x}\sim\overline{\mu}$ .

Definition A.4 (Augmented Random Grid).

An augmented random grid $\overline{X}$ of length $m$ is obtained by sampling $m$ augmented points $\overline{x}_{1},\dotsc,\overline{x}_{m}\sim\overline{\mu}$ and for each $i\in[d],j\in[m]$ defining $\overline{X}_{i,j}$ to the be $j^{\mathrm{th}}$ smallest coordinate in dimension $i$ by the augmented partial order. For any $r$ that divides $m$ we define an augmented $r$ -block partition depending on $\overline{X}$ by defining for each $i\in[d],j\in[r-1]$ the points $\overline{a}_{i,j}=\overline{X}_{i,mj/r}$ , (and $\overline{a}_{i,0}=(-\infty,0),\overline{a}_{i,r}=(\infty,1)$ ), so that the intervals are $B_{i,j}=(\overline{X}_{i,m(j-1)/r},\overline{X}_{i,mj/r}]$ for $j\in\{2,\dotsc,r-1\}$ and $B_{i,1}=((-\infty,0),\overline{X}_{i,m/r}],B_{i,r}=(\overline{X}_{i,m(r-1)/r},(\infty,1)]$ . We set the points $\overline{b}_{i,j}$ defining $\mathsf{blockpoint}:[r]^{d}\to\mathbb{R}^{d}$ to be $\overline{b}_{i,j}=\overline{X}_{i,k}$ for some $\overline{X}_{i,k}\in B_{i,j}$ . This is the augmented $r$ -block partition induced by $\overline{X}$ .

Definition A.5 (Augmented Block Boundary).

For an augmented block partition $\overline{\mathsf{block}}:\mathbb{R}^{d}\times[0,1]^{d}\to[r]^{d}$ , a distribution $\mu$ over $\mathbb{R}^{d}$ , and a function $f:\mathbb{R}^{d}\to\{\pm 1\}$ , we say $f$ is non-constant on an augmented block $\overline{\mathsf{block}}^{-1}(v)$ if there are sets $\overline{S},\overline{T}\subset\overline{\mathsf{block}}^{-1}(v)$ such that $\mu(\xi(\overline{S})),\mu(\xi(\overline{T}))>0$ and for all $s\in S,t\in T:f(s)=1,f(t)=-1$ . For a function $f:\mathbb{R}^{d}\to\{\pm 1\}$ and a number $r$ , we define the augmented $r$ -block boundary size $\overline{\mathsf{bbs}}(f,r)$ as the maximum number of blocks on which $f$ is non-constant with respect to a distribution $\mu$ , where the maximum is taken over all augmented $r$ -block partitions.

The augmented block partitions satisfy analogous properties to the previously-defined block partitions:

Lemma A.6.

Let $\overline{X}$ be an augmented random grid with length $m$ sampled from a finite product distirbution $\mu$ , and let $\overline{\mathsf{block}}:\mathbb{R}^{d}\times[0,1]^{d}\to[r]^{d}$ be the augmented $r$ -block partition induced by $\overline{X}$ . Then

\underset{\overline{X}}{\mathbb{P}}\left[\|\overline{\mathsf{block}}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}>\epsilon\right]\leq 4rd\cdot\mathrm{exp}\left(-\frac{m\epsilon^{2}}{18rd^{2}}\right)\,.

Proof.

Let $\mu_{i}$ be a finitely supported distribution with support $S\subset\mathbb{R}$ . Let $\eta=\frac{1}{2}\min_{a,b\in S}|a-b|$ . Let $\mu^{\prime}_{i}$ be the distribution of $x_{i}+\eta z_{i}$ where $x_{i}\sim\mu_{i}$ and $z_{i}\sim[0,1]$ uniformly at random; note that $\mu^{\prime}_{i}$ is a continuous distribution over $\mathbb{R}$ . For $\overline{x}=(x,x^{\prime}),\overline{y}=(y,y^{\prime})\in\mathbb{R}\times[0,1]$ , observe that $\overline{x}<\overline{y}$ iff $x+\eta x^{\prime}<y+\eta y^{\prime}$ . Therefore,

\underset{\overline{x},\overline{y}\sim\overline{\mu}_{i}}{\mathbb{P}}\left[\overline{x}<\overline{y}\right]=\underset{x,y\sim\mu^{\prime}_{i}}{\mathbb{P}}\left[x<y\right]\,.

By replacing each finitely supported $\mu_{i}$ with $\mu^{\prime}_{i}$ we obtain a continuous product distribution $\mu^{\prime}$ such that $\overline{\mathsf{block}}(\mu)$ is the same distribution as $\mathsf{block}(\mu^{\prime})$ , so by Lemma 2.7 the conclusion holds. ∎

Proposition A.7.

For any continuous or finite product distribution $\mu$ over $\mathbb{R}^{d}$ , any augmented $r$ -block partition $\overline{\mathsf{block}}:\mathbb{R}^{d}\times[0,1]^{d}\to[r]^{d}$ constructed from a random grid $\overline{X}$ , and any $f:\mathbb{R}^{d}\to\{\pm 1\}$ ,

\underset{x\sim\mu,z\sim[0,1]^{d}}{\mathbb{P}}\left[f(x)\neq f^{\mathsf{coarse}}_{z}(x)\right]\leq r^{-d}\cdot\overline{\mathsf{bbs}}(f,r)+\|\overline{\mathsf{block}}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}\,.

Proof.

The result for continuous product distributions holds by Proposition 2.5 and the fact that $\mathsf{bbs}(f,r)\leq\overline{\mathsf{bbs}}(f,r)$ , so assume $\mu$ is a finite product distribution, and let $S=\operatorname{supp}(\mu)$ .

Suppose that for $(x,z)$ sampled from $\overline{\mu}$ , $f(x)\neq f^{\mathsf{coarse}}_{z}(x)$ , and let $v=\overline{\mathsf{block}}(x,z)$ . Then for $y=\mathsf{blockpoint}(v)$ , $f(x)\neq f(y)$ and $x,y\in\xi(\overline{\mathsf{block}}^{-1}(v))$ . The points $x,y$ clearly have positive measure because $\mu$ is finite, so $v$ a non-constant block. Then

	$\displaystyle\underset{x\sim\mu,z\sim[0,1]^{d}}{\mathbb{P}}\left[f(x)\neq f^{\mathsf{coarse}}_{z}(x)\right]$	$\displaystyle\leq\underset{x,z}{\mathbb{P}}\left[\overline{\mathsf{block}}(x,z)\text{ is non-constant}\right]$
		$\displaystyle\leq\underset{v\sim[r]^{d}}{\mathbb{P}}\left[v\text{ is non-constant}\right]+\\|\overline{\mathsf{block}}(\mu)-\mathsf{unif}([r]^{d})\\|_{\mathsf{TV}}\,.\qed$

A.2 Augmented Block-Boundary Size and Noise Sensitivity

To obtain learning algorithms for $k$ -alternating functions, functions of $k$ convex sets, functions of $k$ halfspaces, and degree- $k$ PTFs, we must provide a bound on $\overline{\mathsf{bbs}}$ .

For a finite set $X\subset\mathbb{R}^{d}$ and a function $f:\mathbb{R}^{d}\to\{\pm 1\}$ , we will call a function $f^{\prime}:\mathbb{R}^{d}\to\{\pm 1\}$ a blowup of $f$ (with respect to $X$ ) if $\forall x\in X$ there exists an open ball $B_{x}\ni x$ where $\forall y\in B_{x},f^{\prime}(y)=f(x)$ . We will call a set $\mathcal{H}$ of functions $f:\mathbb{R}^{d}\to\{\pm 1\}$ inflatable if for every finite product set $X=X_{1}\times\dotsm\times X_{d}$ and $f\in\mathcal{H}$ , there exists $f^{\prime}\in\mathcal{H}$ that is a blowup of $f$ with respect to $X$ .

Proposition A.8.

Let $\mathcal{H}$ be a inflatable set of functions. Then $\overline{\mathsf{bbs}}(\mathcal{H},r)\leq\mathsf{bbs}(\mathcal{H},r)$ .

Proof.

Let $\overline{\mathsf{block}}:\mathbb{R}^{d}\times[0,1]^{d}\to[r]^{d}$ be an augmented $r$ -block partition defined by parameters $\overline{b}_{i,j}\in\mathbb{R}\times[0,1]$ for $i\in[d],j\in[r-1]$ , and write $\overline{b}_{i,j}=(b_{i,j},b^{\prime}_{i,j})$ . Let $X=X_{1}\times\dotsm\times X_{d}$ be any finite product set, and let $f\in\mathcal{H}$ ; we will bound the number of non-constant blocks We construct a (non-augmented) $r$ -block partition as follows. Let $\eta>0$ be sufficiently small that:

•

$\forall x\in X$ , the rectangle $R_{x}:=[x_{1},x_{1}+\eta]\times\dotsm\times[x_{d},x_{d}+\eta]$ is contained within $B_{x}$ ,
•

$\forall i\in[d],[x_{i},x_{i}+\eta]\cap X_{i}=\{x_{i}\}$ ; and
•

$\forall i\in[d],b_{i,j}+\eta<b_{i,j+1}$ unless $b_{i,j}=b_{i,j+1}$ .

Such an $\eta$ exists since the number of constraints is finite. Then define $\mathsf{block}:\mathbb{R}^{d}\to[r]^{d}$ by the parameters $c_{i,j}=b_{i,j}+\eta\cdot b^{\prime}_{i,j}$ . Note that $c_{i,j}=b_{i,j}+\eta\cdot b^{\prime}_{i,j}\leq b_{i,j}+\eta<b_{i,j+1}\leq c_{i,j+1}$ . Let $v\in[r]^{d}$ and suppose that $f$ is non-constant on $\overline{\mathsf{block}}^{-1}(v)$ , so there are $\overline{x},\overline{y}\in\overline{\mathsf{block}}^{-1}(v)\cap(X\times[0,1]^{d})$ such that $f(x)\neq f(y)$ , where $\overline{x}=(x,x^{\prime}),\overline{y}=(y,y^{\prime})$ , and $\forall i\in[d],x_{i},y_{i}\in(b_{i,v_{i}-1},b_{i,v_{i}}]$ where we define $(b,b]=\{b\}$ . Consider $\mathsf{block}^{-1}(v)=(c_{1,v_{1}-1},c_{1,v_{1}}]\times\dotsm\times(c_{d,v_{d}-1},c_{d,v_{d}}]$ .

Since $\overline{x}_{i}\in(\overline{b}_{i,v_{i}-1},\overline{b}_{i,v_{i}}]$ , $x_{i}\in(b_{i,v_{i}-1},b_{i,v_{i}}]$ (where we define $(b,b]=\{b\}$ ) and $x^{\prime}_{i}\in(b^{\prime}_{i,v_{i}-1},b^{\prime}_{i,v_{i}}]$ . Therefore $x_{i}+\eta\cdot x^{\prime}_{i}\leq b_{i,v_{i}}+\eta\cdot b^{\prime}_{i,v_{i}}=c_{i,v_{i}}$ and $x_{i}+\eta\cdot x^{\prime}_{i}>b_{i,v_{i}-1}+\eta\cdot b^{\prime}_{i,v_{i}-1}=c_{i,v_{i}-1}$ so $x+\eta\cdot x^{\prime}\in\mathsf{block}^{-1}(v)$ . Also, $x+\eta\cdot x^{\prime}$ is in the rectangle $R_{x}\subset B_{x}$ so there is a ball around $x+\eta\cdot x^{\prime}$ , containing only points with value $f^{\prime}(x)=f(x)$ . Likewise, there is a ball around $y+\eta\cdot y^{\prime}$ inside $\mathsf{block}^{-1}(v)$ containing only points with value $f^{\prime}(y)=f(y)\neq f^{\prime}(x)$ . Since these balls must intersect $\mathsf{block}^{-1}(v)$ on sets with positive measure (in the product of Lebesgue measures), $f^{\prime}$ is non-constant on $\mathsf{block}^{-1}(v)$ , which proves the statement. ∎

Lemma A.9.

The set $\mathcal{A}_{k}$ of $k$ -alternating functions is inflatable.

Proof.

Let $f\in\mathcal{A}_{k}$ and let $X=X_{1}\times\dotsm\times X_{d}$ be a finite set. where we use the standard ordering on $\mathbb{R}^{d}$ . Let $u\in\mathbb{R}^{d}$ . We claim that the set $\{x\in X:x\leq u\}$ has a unique maximum. Suppose otherwise, so there are $x,y\leq u$ that are each maximal. Let $x\wedge y=(\max(x_{1},y_{1}),\dotsc,\max(x_{d},y_{d}))$ . Then $x\vee y\in X$ and $x\wedge y>x,y$ but $u\geq x\wedge y$ , a contradiction. For every $u\in\mathbb{R}^{d}$ , write $u^{\downarrow}$ for this unique maximum. Let $\eta>0$ be small enough that $\forall x\in X,(x+\eta\cdot\vec{1})^{\downarrow}=x$ ; such a value exists since $X$ is finite. Define the map $\phi(u)=(u+(\eta/2)\cdot\vec{1})^{\downarrow}$ and $\forall u\in\mathbb{R}^{d}$ , we define $f^{\prime}(u):=f(\phi(u))$ , and argue that this satisfies the required properties. It is clear by our choice of $\eta$ that $f^{\prime}(x)=f((x+(\eta/2)\cdot\vec{1})^{\downarrow})=f(x)$ . Since $\phi$ is order-preserving (i.e. if $u<v$ then $\phi(u)\leq\phi(v)$ ), $f^{\prime}$ is $k$ -alternating. Now consider the ball $B(x):=\{y\in\mathbb{R}^{d}:\|y-x\|_{2}<\eta/2\}$ . Since $|y_{i}-x_{i}|<\eta/2$ for all $y\in B(x)$ , we have $\phi(y)=(y_{1}+\eta/2,\dotsc,y_{d}+\eta/2)^{\downarrow}\leq(x_{1}+\eta,\dotsc,x_{d}+\eta)^{\downarrow}=x$ , and $\phi(y)\geq(x_{1},\dotsc,x_{d})^{\downarrow}=x$ so $f^{\prime}(y)=f(\phi(y))=f(x)$ . ∎

Lemma A.10.

The set $\mathcal{C}$ of indicator functions of convex sets is inflatable.

Proof.

Let $f:\mathbb{R}^{d}\to\{\pm 1\}$ indicate a closed convex set, let $S=f^{-1}(1)$ be this set, and write $\delta(x):=\min\{\|x-y\|_{2}:y\in S\}$ (this minimum exists since $S$ is closed). Let $X$ be any finite set and let $\delta=\min\{\delta(x):x\in X\setminus S\}$ . Consider $S^{\prime}=\{x:\delta(x)\leq\delta/2\}$ , and let $f^{\prime}$ be the indicator function for this set. Then $f^{\prime}(x)=f(x)$ for all $x\in X$ . Finally, $S^{\prime}$ is closed, and it is convex since for any two points $x,y$ , it is well-known that the function $\lambda\mapsto\delta(\lambda x+(1-\lambda)y)$ is convex for $\lambda\in[0,1]$ . ∎

Lemma A.11.

The set $\mathcal{H}$ of halfspaces is inflatable.

Proof.

It suffices to show that for any finite set $X$ (not necessarily a product set) and any halfspace $f(x)=\operatorname{sign}(\langle w,x\rangle-t)$ , there is a halfspace $f^{\prime}(x)=\operatorname{sign}(\langle w,x\rangle-t^{\prime})$ such that $f(x)=f^{\prime}(x)$ for all $x\in X$ but $\langle w^{\prime},x\rangle-t\neq 0$ for all $x\in X$ ; this is a commonly-used fact. Let $\delta=\min\{-(\langle w,x\rangle-t):\langle w,x\rangle-t<0\}$ . It must be the case that $\delta>0$ . Then $f^{\prime}(x)=\operatorname{sign}(\langle w,x\rangle-t+\delta/2)$ satisfies the condition. ∎

Lemma A.12.

The set $\mathcal{P}_{k}$ of degree- $k$ PTFs is inflatable.

Proof.

This follows from the above proof for halfspaces, since for any finite $X$ we may map $x\in X$ to its vector $(x_{1}^{k},x_{2}^{k},\dotsc)$ of monomials, so that any polynomial $p(x)$ is a linear threshold function in the space of monomials. ∎

For a set $\mathcal{H}$ of functions $f:\mathbb{R}^{d}\to\{\pm 1\}$ and an augmented $r$ -block partition $\overline{\mathsf{block}}:\mathbb{R}^{d}\to[r]^{d}$ , we will write $\overline{}\mathcal{H}^{\mathsf{block}}:=\{f^{\mathsf{block}}:f\in\mathcal{H}\}$ for the set of block functions $f^{\mathsf{block}}:[r]^{d}\to\{\pm 1\}$ ; note that this is not necessarily the same set of functions as $\mathcal{H}^{\mathsf{block}}$ defined for continuous distributions. We must show that the same learning algorithms used above for learning $\mathcal{H}^{\mathsf{block}}$ will work also for $\overline{}\mathcal{H}^{\mathsf{block}}$ . For the brute-force learning algorithm of Lemma 4.5, this is trivial, but for the regression algorithm in Lemma 5.2 we must show that there exists a set $\mathcal{F}$ such that each $f^{\mathsf{block}}\in\overline{}\mathcal{H}^{\mathsf{block}}$ is close to a function $g\in\mathsf{span}(\mathcal{F})$ . For functions of halfspaces and PTFs, we used the bound on noise sensitivity, Lemma 5.3, to construct a set $\mathcal{F}$ of functions suitable for the regression algorithm. The proof for that lemma works without modification for augmented block partitions, so we have the following:

Lemma A.13.

Let $\mathcal{H}$ be any family of functions $f:\mathbb{R}^{d}\to\{\pm 1\}$ such that, for any linear transformation $A:\mathbb{R}^{d}\to\mathbb{R}^{d}$ , if $f\in\mathcal{H}$ then $f\circ A\in\mathcal{H}$ . Let $\overline{\mathsf{block}}:\mathbb{R}^{d}\times[0,1]^{d}\to[r]^{d}$ be any augmented $r$ -block partition. Let $\mathrm{ns}_{2,\delta}(\mathcal{H}):=\sup_{f\in\mathcal{H}}\mathsf{ns}_{2,\delta}(f)$ . Then $\mathrm{ns}_{r,\delta}(f^{\mathsf{block}})\leq\mathsf{ns}_{2,\delta}(\mathcal{H})$ .

A.3 Rounding the Output

After learning a function $g:[r]^{d}\to\{\pm 1\}$ , we must output a function $g^{\prime}:\mathbb{R}^{d}\to\{\pm 1\}$ . In the continuous setting, we simply output $g\circ\mathsf{block}$ . In the finite setting, we cannot simply output $g\circ\overline{\mathsf{block}}$ since $\overline{\mathsf{block}}:\mathbb{R}^{d}\times[0,1]^{d}\to[r]^{d}$ requires an additional argument $z\in[0,1]^{d}$ . For example, if the distribution $\mu$ is a finitely supported distribution on $\{\pm 1\}^{d}$ , then for each point $x\in\{\pm 1\}^{d}$ there may be roughly $(r/2)^{d}$ points $v\in[r]^{d}$ for which $(x,z)\in\overline{\mathsf{block}}^{-1}(v)$ for an appropriate choice of $z\in[0,1]^{d}$ , and these points $v$ may have different values in $g$ . The algorithm must choose a single value to output for each $x$ . We do so by approximating the function $x\mapsto\underset{z}{\mathbb{E}}\left[g_{z}(x)\right]$ and then rounding it via the next lemma.

Lemma A.14.

Fix a domain $\mathcal{X}$ , let $\gamma:\mathcal{X}\to[-1,1]$ , and let $\epsilon>0$ . There is an algorithm such that, given query access to $\gamma$ and sample access to any distribution $\mathcal{D}$ over $\mathcal{X}\times\{\pm 1\}$ , uses at most $O\left(\log(1/\delta)/\epsilon^{2}\right)$ samples and queries and with probability at least $1-\delta$ produces a value $t$ such that

\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[\operatorname{sign}(f(x)-t)\neq b\right]\leq\frac{1}{2}\underset{(x,b)\sim\mathcal{D}}{\mathbb{E}}\left[|f(x)-b|\right]+\epsilon\,.

Proof.

Let $\mathcal{T}$ be the set of functions $x\mapsto\operatorname{sign}(\gamma(x)-t)$ for any choice of $t\in[-1,1]$ . We will show that the VC dimension of $\mathcal{T}$ is 1. Suppose for contradiction that two points $x,y\in\mathcal{X}$ are shattered by $\mathcal{T}$ , so in particular there are $s,t\in\mathbb{R}$ such that $\operatorname{sign}(f(x)-s)=1$ and $\operatorname{sign}(f(y)-s)=-1$ , while $\operatorname{sign}(f(x)-t)=-1$ and $\operatorname{sign}(f(y)-t)=1$ . Without loss of generality, suppose $s<t$ . But then $\operatorname{sign}(f(y)-s)\geq\operatorname{sign}(f(y)-t)$ , which is a contradiction. Therefore, by standard VC dimension arguments ([SSBD14], Theorem 6.8), using $O(\log(1/\delta)/\epsilon^{2})$ samples and choosing $t$ to minimize the error on the samples, with probability at least $1-\delta$ we will obtain a value $t$ such that

\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[\operatorname{sign}(\gamma(x)-t)\neq b\right]\leq\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[\operatorname{sign}(\gamma(x)-t^{*})\neq b\right]+\epsilon

where $t^{*}$ minimizes the latter quantity among all values $[-1,1]$ . Since the algorithm can restrict itself to those values $t\in[-1,1]$ for which $\gamma(x)=t$ for some $x$ in the sample, the value minimizing the error on the sample can be computed time polynomial in the number of samples. Next, we show that the minimizer $t^{*}$ satisfies the desired properties. Suppose that $t\sim[-1,1]$ uniformly at random. For any $y\in[-1,1],b\in\{\pm 1\}$ ,

\underset{t}{\mathbb{P}}\left[\operatorname{sign}(y-t)\neq b\right]=\begin{cases}\underset{t}{\mathbb{P}}\left[t>y\right]=\frac{1}{2}|b-y|&\text{ if }b=1\\ \underset{t}{\mathbb{P}}\left[t\leq y\right]=\frac{1}{2}|y-b|&\text{ if }b=-1\,.\end{cases}

Therefore

\underset{t\sim[-1,1]}{\mathbb{E}}\left[\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[\operatorname{sign}(\gamma(x)-t)\neq b\right]\right]=\underset{(x,b)\sim\mathcal{D}}{\mathbb{E}}\left[\underset{t}{\mathbb{P}}\left[\operatorname{sign}(f(x)-t)\neq b\right]\right]=\frac{1}{2}\underset{(x,b)\sim\mathcal{D}}{\mathbb{E}}\left[|\gamma(x)-b|\right]\,,

so we can conclude the lemma with

\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[\operatorname{sign}(\gamma(x)-t^{*})\neq b\right]\leq\frac{1}{2}\underset{(x,b)\sim\mathcal{D}}{\mathbb{E}}\left[|\gamma(x)-b|\right]\,.\qed

Lemma A.15.

Let $\overline{\mathsf{block}}:\mathbb{R}^{d}\times[0,1]^{d}\to[r]^{d}$ be an augmented $r$ -block partition. There is an algorithm which, given $\epsilon,\delta>0$ , query access to a function $g:[r]^{d}\to\{\pm 1\}$ and sample access to a distribution $\mathcal{D}$ over $\mathbb{R}^{d}\times\{\pm 1\}$ , outputs a function $g^{\prime}:\mathbb{R}^{d}\to\{\pm 1\}$ such that, with probability $1-\delta$ ,

\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[g^{\prime}(x)\neq b\right]\leq\underset{(x,b)\sim\mathcal{D},z\sim[0,1]^{d}}{\mathbb{P}}\left[g(\overline{\mathsf{block}}(x,z))\neq b\right]+\epsilon\,,

uses at most $O\left(\frac{d\log(r)}{\epsilon^{2}}\log\frac{1}{\delta}\right)$ samples and queries, and runs in time polynomial in the number of samples.

Proof.

For $z\in[0,1]^{d}$ , write $g_{z}(x)=g(\overline{\mathsf{block}}(x,z))$ . For any $(x,b)$ ,

|\underset{z}{\mathbb{E}}\left[g_{z}(x)\right]-b|=|b\underset{z}{\mathbb{P}}\left[g_{z}(x)=b\right]-b\underset{z}{\mathbb{P}}\left[g_{z}(x)\neq b\right]-b|=|-2b\underset{z}{\mathbb{P}}\left[g_{z}(x)\neq b\right]|=2\underset{z}{\mathbb{P}}\left[g_{z}(x)\neq b\right]\,,

\frac{1}{2}\underset{(x,b)\sim\mathcal{D}}{\mathbb{E}}\left[|\underset{z}{\mathbb{E}}\left[g_{z}(x)\right]-b|\right]=\underset{(x,b)\sim\mathcal{D},z}{\mathbb{P}}\left[g_{z}(x)\neq b\right]\,.

The algorithm will construct a function $\gamma(x)\approx\underset{z}{\mathbb{E}}\left[g_{z}(x)\right]$ and then learn a suitable parameter $t$ for rounding $\gamma(x)$ to $\operatorname{sign}(\gamma(x)-t)$ .

First the algorithm samples a set $Z\subset[0,1]^{d}$ of size $m=\frac{2d\ln(r)\ln(1/\delta)}{\epsilon^{2}}$ and construct the function $\gamma(x)=\frac{1}{m}\sum_{z\in Z}g(\overline{\mathsf{block}}(x,z))$ . Fix $Z\subset[0,1]^{d}$ and suppose $x\in\mathbb{R}^{d}$ satisfies $\gamma(x)\neq\underset{z}{\mathbb{E}}\left[g_{z}(x)\right]$ . Then there must be $w,z\in[0,1]^{d}$ such that $\overline{\mathsf{block}}(x,z)\neq\overline{\mathsf{block}}(x,w)$ , otherwise $g_{z}(x)=g_{w}(x)$ for all $z,w$ so for all $w,\gamma(x)=g_{w}(x)=\underset{z}{\mathbb{E}}\left[g_{z}(x)\right]$ . There can be at most $r^{d}$ values of $x\in\mathbb{R}^{d}$ for which $\exists z,w\in[0,1]^{d}:\overline{\mathsf{block}}(x,z)\neq\overline{\mathsf{block}}(x,w)$ , so by the union bound and the Hoeffding bound,

	$\displaystyle\underset{Z}{\mathbb{P}}\left[\exists x\in\mathbb{R}^{d}:\|\gamma(x)-\underset{z}{\mathbb{E}}\left[g_{z}(x)\right]\|>\epsilon\right]$	$\displaystyle\leq r^{d}\max_{x\in X}\underset{Z}{\mathbb{P}}\left[\|\gamma(x)-\underset{z}{\mathbb{E}}\left[g_{z}(x)\right]\|>\epsilon\right]$
		$\displaystyle\leq 2r^{d}\mathrm{exp}\left(-\frac{m\epsilon^{2}}{2}\right)<\delta\,.$

Therefore with probability at least $1-\delta/2$ , $\gamma$ satisfies $|\gamma(x)-\underset{z}{\mathbb{E}}\left[g_{z}(x)\right]|\leq\epsilon$ for all $x$ . Suppose this occurs. Then

	$\displaystyle\frac{1}{2}\underset{(x,b)\sim\mathcal{D}}{\mathbb{E}}\left[\|\gamma(x)-b\|\right]$	$\displaystyle\leq\frac{1}{2}\underset{(x,b)\sim\mathcal{D}}{\mathbb{E}}\left[\|\underset{z}{\mathbb{E}}\left[g_{z}(x)\right]-b\|+\|\gamma(x)-\underset{z}{\mathbb{E}}\left[g_{z}(x)\right]\|\right]$
		$\displaystyle\leq\underset{(x,b)\sim\mathcal{D},z}{\mathbb{P}}\left[g_{z}(x)\neq b\right]+\frac{\epsilon}{2}\,.$

Now we apply Lemma A.14 with error $\epsilon/2$ , using $O(\log(1/\delta)/\epsilon^{2})$ samples and polynomial time, to output a value $t$ such that with probability $1-\delta/2$ ,

\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[\operatorname{sign}(\gamma(x)-t)\neq b\right]\leq\frac{1}{2}\underset{(x,b)}{\mathbb{E}}\left[|\gamma(x)-b|\right]+\frac{\epsilon}{2}\leq\underset{(x,b)\sim\mathcal{D},z}{\mathbb{P}}\left[g_{z}(x)\neq b\right]+\epsilon\,.\qed

A.4 Algorithms for Finite Distributions

We now state improved versions of our monotonicity tester and two general learning algorithms: the “brute force” learning algorithm (Lemma 4.5) and the “polynomial regression” algorithm (Lemma 5.2). Using these algorithms we obtain the same complexity bounds as for continuous product distributions, but the algorithms can now handle finite product distributions as well.

See 1.1.1

Proof.

The proof of Section 1.1.1 goes through as before, with $\mathsf{block}$ replaced by $\overline{\mathsf{block}}$ , $\mathsf{block}^{-\downarrow}(v)$ replaced with $\overline{\mathsf{block}}^{-\downarrow}(v)$ defined as the infimal element $\overline{x}$ such that $\overline{\mathsf{block}}(\overline{x})=v$ , and $\mathsf{block}^{-\uparrow}(v)$ defined as the supremal element $\overline{x}$ such that $\overline{\mathsf{block}}(\overline{x})=v$ , and $g$ redefined appropriately. ∎

Next, we move on to the learning algorithms:

Lemma A.16.

Let $\mathcal{H}$ be any set of functions $\mathbb{R}^{d}\to\{\pm 1\}$ , let $\epsilon>0$ , and suppose $r$ satisfies $r^{-d}\cdot\overline{\mathsf{bbs}}(\mathcal{H},r)\leq\epsilon/3$ . Then there is an agnostic learning algorithm for $\mathcal{H}$ that uses $O\left(\frac{r^{d}+rd^{2}\log(rd/\epsilon)}{\epsilon^{2}}\right)$ samples and time and works for any distribution $\mathcal{D}$ over $\mathbb{R}^{d}\times\{\pm 1\}$ whose marginal on $\mathbb{R}^{d}$ is a finite or continuous product distribution.

Proof.

On input distribution $\mathcal{D}$ :

1.

Sample a grid $\overline{X}$ of size $m=O(\frac{rd^{2}}{\epsilon^{2}}\log(rd/\epsilon))$ large enough that Lemma A.6 guarantees $\|\overline{\mathsf{block}}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}<\epsilon/3$ with probability $5/6$ , where $\overline{\mathsf{block}}:\mathbb{R}^{d}\times[0,1]^{d}\to[r]^{d}$ is the induced augmented $r$ -block partition.
2.

Agnostically learn a function $g:[r]^{d}\to\{\pm 1\}$ with error $\epsilon/3$ and success probability $5/6$ using $O(r^{d}/\epsilon^{2})$ samples from $\mathcal{D}^{\mathsf{block}}$ .
3.

Run the algorithm of Lemma A.14 using $O\left(\frac{d\log r}{\epsilon^{2}}\right)$ samples to obtain $g^{\prime}$ and output $g^{\prime}$ .

The proof proceeds as in the case for continuous distributions (Lemma 4.5). Assume all steps succeed, which occurs with probability at least $2/3$ . After step 3 we obtain $g:[r]^{d}\to\{\pm 1\}$ such that, for any $h\in\mathcal{H}$ ,

\underset{(v,b)\sim\mathcal{D}^{\mathsf{block}}}{\mathbb{P}}\left[g(v)\neq b\right]\leq\underset{(v,b)\sim\mathcal{D}^{\mathsf{block}}}{\mathbb{P}}\left[h^{\mathsf{block}}(v)\neq b\right]+\epsilon/3\,.

By Lemma A.14 and Proposition A.7, the output satisfies,

	$\displaystyle\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[g^{\prime}(x)\neq b\right]$	$\displaystyle\leq\underset{(x,b)\sim\mathcal{D},z}{\mathbb{P}}\left[g(\overline{\mathsf{block}}(x,z))\neq b\right]+\epsilon/3$
		$\displaystyle\leq\underset{(x,b)\sim\mathcal{D},z}{\mathbb{P}}\left[h^{\mathsf{block}}(\overline{\mathsf{block}}(x,z))\neq b\right]+2\epsilon/3$
		$\displaystyle\leq\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[h(x)\neq b\right]+\underset{x,z}{\mathbb{P}}\left[h(x)\neq h^{\mathsf{coarse}}_{z}(x)\right]+2\epsilon/3$
		$\displaystyle\leq\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[h(x)\neq b\right]+\epsilon\,.\qed$

We now state the general learning algorithm from Lemma 5.2, improved to allow finite product distributions.

Lemma A.17.

Let $\epsilon>0$ and let $\mathcal{H}$ be a set of measurable functions $f:\mathbb{R}^{d}\to\{\pm 1\}$ that satisfy:

1.

There is some $r=r(d,\epsilon)$ such that $\overline{\mathsf{bbs}}(\mathcal{H},r)\leq\epsilon\cdot r^{d}$ ;
2.

There is a set $\mathcal{F}$ of functions $[r]^{d}\to\mathbb{R}$ satisfying: $\forall f\in\mathcal{H},\exists g\in\mathsf{span}(\mathcal{F})$ such that for $v\sim[r]^{d},\mathbb{E}\left[(f^{\mathsf{block}}(v)-g(v))^{2}\right]\leq\epsilon^{2}$ .

Let $n=\operatorname{poly}(|\mathcal{F}|,1/\epsilon)$ be the sample complexity of the algorithm in Theorem 5.1. Then there is an agnostic learning algorithm for $\mathcal{H}$ on finite and continuous product distributions over $\mathbb{R}^{d}$ , that uses $O(\max(n^{2},1/\epsilon^{2})\cdot rd^{2}\log(dr))$ samples and runs in time polynomial in the sample size.

Proof.

Let $\overline{}\mathcal{D}$ be the augmented distribution, where $\overline{x}\sim\overline{}\mathcal{D}$ is obtained by drawing $x\sim\mathcal{D}$ and augmenting it with a uniformly random $z\in[0,1]^{d}$ . We will assume $n>1/\epsilon$ . Let $\mu$ be the marginal of $\mathcal{D}$ on $\mathbb{R}^{d}$ . For an augmented $r$ -block partition, let $\mathcal{D}^{\mathsf{block}}$ be the distribution of $(\overline{\mathsf{block}}(\overline{x}),b)$ when $(\overline{x},b)\sim\overline{}\mathcal{D}$ . We may simulate samples from $\mathcal{D}^{\mathsf{block}}$ by sampling $(x,b)$ from $\mathcal{D}$ and constructing $(\overline{\mathsf{block}}(\overline{x}),b)$ . The algorithm is as follows:

1.

Sample a grid $X$ of length $m=O(rd^{2}n^{2}\log(rd))$ ; by Lemma A.6, this ensures that $\|\overline{\mathsf{block}}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}<1/12n$ with probability $5/6$ . Construct $\overline{\mathsf{block}}:\mathbb{R}^{d}\to[r]^{d}$ induced by $X$ .
2.

Run the algorithm of Theorem 5.1 on a sample of $n$ points from $\mathcal{D}^{\mathsf{block}}$ ; that algorithm returns a function $g:[r]^{d}\to\{\pm 1\}$ .
3.

Run the algorithm of Lemma A.14 using $O\left(\frac{d\log r}{\epsilon^{2}}\right)$ samples to obtain $g^{\prime}$ and output $g^{\prime}$ .

The proof proceeds as in the case for continuous distributions (Lemma 5.2). Assume all steps succeed, which occurs with probability at least $2/3$ . After step 3 we obtain $g:[r]^{d}\to\{\pm 1\}$ such that, for any $h\in\mathcal{H}$ ,

\underset{(v,b)\sim\mathcal{D}^{\mathsf{block}}}{\mathbb{P}}\left[g(v)\neq b\right]\leq\underset{(v,b)\sim\mathcal{D}^{\mathsf{block}}}{\mathbb{P}}\left[h^{\mathsf{block}}(v)\neq b\right]+\epsilon/3\,.

By Lemma A.14 and Proposition A.7, the output satisfies,

	$\displaystyle\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[g^{\prime}(x)\neq b\right]$	$\displaystyle\leq\underset{(x,b)\sim\mathcal{D},z}{\mathbb{P}}\left[g(\overline{\mathsf{block}}(x,z))\neq b\right]+\epsilon/3$
		$\displaystyle\leq\underset{(x,b)\sim\mathcal{D},z}{\mathbb{P}}\left[h^{\mathsf{block}}(\overline{\mathsf{block}}(x,z))\neq b\right]+2\epsilon/3$
		$\displaystyle\leq\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[h(x)\neq b\right]+\underset{x,z}{\mathbb{P}}\left[h(x)\neq h^{\mathsf{coarse}}_{z}(x)\right]+2\epsilon/3$
		$\displaystyle\leq\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[h(x)\neq b\right]+\epsilon\,.\qed$

Theorem A.18.

There are agnostic learning algorithms for functions of convex sets, functions of halfspaces, degree- $k$ PTFs, and $k$ -alternating functions achieving the sample and time complexity bounds in Theorems 1.1.4, 1.1.2, 1.1.3, and 1.1.5, that work for any finite or continuous product distribution over $\mathbb{R}^{d}$ .

Proof.

This follows from the same arguments as for each of those theorems, except with the bounds from Proposition A.8 and Lemmas A.10, A.11, A.12, and A.9 to bound $\overline{\mathsf{bbs}}$ ; Lemma A.13 to bound the noise sensitivity; and the improved general algorithms of Lemmas A.16 and A.17. ∎

Appendix B Glossary

The notation $f(n)=\widetilde{O}(g(n))$ means that for some constant $c$ , $f(n)=O(g(n)\log^{c}(g(n)))$ .

The natural ordering on the set $[n]^{d}$ or $\mathbb{R}^{d}$ is the partial order where for $x,y\in\mathbb{R}^{d}$ , $x<y$ iff $\forall i\in[d],x_{i}\leq y_{i}$ , and $x\neq y$ .

Property Testing.

For a set $\mathcal{P}$ of distributions over $X$ and a set $\mathcal{H}$ of functions $X\to\{\pm 1\}$ , a distribution-free property testing algorithm for $\mathcal{H}$ under $\mathcal{P}$ is a randomized algorithm that is given a parameter $\epsilon>0$ . It has access to the input probability distribution $\mathcal{D}\in\mathcal{P}$ via a sample oracle, which returns an independent sample from $\mathcal{D}$ . It has access to the input function $f:X\to\{\pm 1\}$ via a query oracle, which given query $x\in X$ returns the value $f(x)$ . A two-sided distribution-free testing algorithm must satisfy:

1.

If $f\in\mathcal{H}$ then the algorithm accepts with probability at least $2/3$ ;
2.

If $f$ is $\epsilon$ -far from $\mathcal{H}$ with respect to $\mu$ then the algorithm rejects with probability at least $2/3$ .

A one-sided algorithm must accept with probability 1 when $f\in\mathcal{H}$ . An $(\epsilon_{1},\epsilon_{2})$ -tolerant tester must accept with probability at least $2/3$ when $\exists h\in\mathcal{H}$ such that $\underset{x\sim\mu}{\mathbb{P}}\left[f(x)\neq h(x)\right]\leq\epsilon_{1}$ and reject when $f$ is $\epsilon_{2}$ -far from $\mathcal{H}$ with respect to $\mu$ .

In the query model, the queries to the query oracle can be arbitrary. In the sample model, the tester queries a point $x\in X$ if and only if $x$ was obtained from the sample oracle.

A tester in the query model is adaptive if it makes its choice of query based on the answers to previous queries. It is non-adaptive if it chooses its full set of queries in advance, before obtaining any of the answers.

The sample complexity of an algorithm is the number of samples requested from the sample oracle. The query complexity of an algorithm is the number of queries made to the query oracle.

Learning.

Let $\mathcal{H}$ be a set of functions $X\to\{\pm 1\}$ and let $\mathcal{P}$ be a set of probability distributions over $X$ . A learning algorithm for $\mathcal{H}$ under $\mathcal{P}$ (in the non-agnostic or realizable) model is a randomized algorithm that receives a parameter $\epsilon>0$ and has sample access to an input function $f\in\mathcal{H}$ . Sample access means that the algorithm may request an independent random example $(x,f(x))$ where $x$ is sampled from some input distribution $\mathcal{D}\in\mathcal{P}$ . The algorithm is required to output a function $g:X\to\{\pm 1\}$ that, with probability $2/3$ , satisfies the condition

\underset{x\sim\mathcal{D}}{\mathbb{P}}\left[f(x)\neq g(x)\right]\leq\epsilon\,.

In the agnostic setting, the algorithm instead has sample access to an input distribution $\mathcal{D}$ over $X\times\{0,1\}$ whose marginal over $X$ is in $\mathcal{P}$ (i.e. it receives samples of the form $(x,b)\in X\times\{0,1\}$ ). The algorithm is required to output a function $g:X\to\{\pm 1\}$ that, with probability $2/3$ , satisfies the following condition: $\forall h\in\mathcal{H}$ ,

\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[g(x)\neq b\right]\leq\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[h(x)\neq b\right]+\epsilon\,.

A proper learning algorithm is one whose output must also satisfy $g\in\mathcal{H}$ ; otherwise it is improper.

VC Dimension.

For a set $\mathcal{H}$ of functions $X\to\{\pm 1\}$ , a set $S\subseteq X$ is shattered by $\mathcal{H}$ if for all functions $f:S\to\{\pm 1\}$ there is a function $h\in\mathcal{H}$ such that $\forall x\in S,h(x)=f(x)$ . The VC dimension $\mathsf{VC}(\mathcal{H})$ of $\mathcal{H}$ is the size of the largest set $S\subseteq X$ that is shattered by $\mathcal{H}$ .

References

[AC06] Nir Ailon and Bernard Chazelle. Information theory in property testing and monotonicity testing in higher dimension. Information and Computation, 204(11):1704–1717, 2006.
[BB20] Eric Blais and Abhinav Bommireddi. On testing and robust characterizations of convexity. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020.
[BBBY12] Maria-Florina Balcan, Eric Blais, Avrim Blum, and Liu Yang. Active property testing. In Proceedings of the IEEE 53rd Annual Symposium on Foundations of Computer Science (FOCS), pages 21–30, 2012.
[BCO⁺15] Eric Blais, Clément Canonne, Igor Oliveira, Rocco Servedio, and Li-Yang Tan. Learning circuits with few negations. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, page 512, 2015.
[BCS18] Hadley Black, Deeparnab Chakrabarty, and C Seshadhri. A $o(d)\cdot\operatorname{poly}\log n$ monotonicity tester for boolean functions over the hypergrid $[n]^{d}$ . In Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2133–2151, 2018.
[BCS20] Hadley Black, Deeparnab Chakrabarty, and C Seshadhri. Domain reduction for monotonicity testing: A $o(d)$ tester for boolean functions in $d$ -dimensions. In Proceedings of the 31st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1975–1994, 2020.
[BEHW89] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929–965, 1989.
[BFPJH21] Eric Blais, Renato Ferreira Pinto Jr, and Nathaniel Harms. VC dimension and distribution-free sample-based testing. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 504–517, 2021.
[BMR16] Piotr Berman, Meiram Murzabulatov, and Sofya Raskhodnikova. Tolerant testers of image properties. In 43rd International Colloquium on Automata, Languages, and Programming (ICALP 2016). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016.
[BMR19] Piotr Berman, Meiram Murzabulatov, and Sofya Raskhodnikova. Testing convexity of figures under the uniform distribution. Random Structures & Algorithms, 54(3):413–443, 2019.
[BOW10] Eric Blais, Ryan O’Donnell, and Karl Wimmer. Polynomial regression under arbitrary product distributions. Machine Learning, 80(2-3):273–294, 2010.
[BR92] Avrim L Blum and Ronald L Rivest. Training a 3-node neural network is NP-complete. Neural Networks, 5(1):117–127, 1992.
[BRY14] Eric Blais, Sofya Raskhodnikova, and Grigory Yaroslavtsev. Lower bounds for testing properties of functions over hypergrid domains. In Proceedings of the IEEE 29th Conference on Computational Complexity (CCC), pages 309–320, 2014.
[BY19] Eric Blais and Yuichi Yoshida. A characterization of constant-sample testable properties. Random Structures & Algorithms, 55(1):73–88, 2019.
[CDJS17] Deeparnab Chakrabarty, Kashyap Dixit, Madhav Jha, and C Seshadhri. Property testing on product distributions: Optimal testers for bounded derivative properties. ACM Transactions on Algorithms (TALG), 13(2):1–30, 2017.
[CDS20] Xue Chen, Anindya De, and Rocco A Servedio. Testing noisy linear functions for sparsity. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 610–623, 2020.
[CFSS17] Xi Chen, Adam Freilich, Rocco A Servedio, and Timothy Sun. Sample-based high-dimensional convexity testing. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.
[CGG⁺19] Clément L. Canonne, Elena Grigorcscu, Siyao Guo, Akash Kumar, and Karl Wimmer. Testing $k$ -monotonicity: The rise and fall of boolean functions. Theory of Computing, 15(1):1–55, 2019.
[CMK19] Mónika Csikós, Nabil H Mustafa, and Andrey Kupavskii. Tight lower bounds on the VC-dimension of geometric set systems. Journal of Machine Learning Research, 20(81):1–8, 2019.
[CS16] Deeparnab Chakrabarty and Comandur Seshadhri. An $o(n)$ monotonicity tester for boolean functions over the hypercube. SIAM Journal on Computing, 45(2):461–472, 2016.
[DHK⁺10] Ilias Diakonikolas, Prahladh Harsha, Adam Klivans, Raghu Meka, Prasad Raghavendra, Rocco A Servedio, and Li-Yang Tan. Bounding the average sensitivity and noise sensitivity of polynomial threshold functions. In Proceedings of the 42nd ACM Symposium on Theory of Computing (STOC), pages 533–542, 2010.
[DKS18] Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Learning geometric concepts with nasty noise. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 1061–1073, 2018.
[DMN19] Anindya De, Elchanan Mossel, and Joe Neeman. Is your function low-dimensional? In Conference on Learning Theory, pages 979–993, 2019.
[FR10] Shahar Fattal and Dana Ron. Approximating the distance to monotonicity in high dimensions. ACM Transactions on Algorithms (TALG), 6(3):1–37, 2010.
[FY20] Noah Fleming and Yuichi Yoshida. Distribution-free testing of linear functions on $\mathbb{R}^{n}$ . In Proceedings of the 11th Innovations in Theoretical Computer Science Conference (ITCS). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020.
[GR16] Oded Goldreich and Dana Ron. On sample-based testers. ACM Transactions on Computation Theory (TOCT), 8(2):1–54, 2016.
[Har19] Nathaniel Harms. Testing halfspaces over rotation-invariant distributions. In Proceedings of the 30th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 694–713, 2019.
[HK07] Shirley Halevy and Eyal Kushilevitz. Distribution-free property-testing. SIAM Journal on Computing, 37(4):1107–1138, 2007.
[KKMS08] Adam T. Kalai, Adam R. Klivans, Yishay Mansour, and Rocco A Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777–1805, 2008.
[KMS18] Subhash Khot, Dor Minzer, and Muli Safra. On monotonicity testing and boolean isoperimetric-type theorems. SIAM Journal on Computing, 47(6):2238–2276, 2018.
[KOS04] Adam R Klivans, Ryan O’Donnell, and Rocco A Servedio. Learning intersections and thresholds of halfspaces. Journal of Computer and System Sciences, 68(4):808–840, 2004.
[KOS08] Adam R Klivans, Ryan O’Donnell, and Rocco A Servedio. Learning geometric concepts via gaussian surface area. In Proceedings of the 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 541–550, 2008.
[KR00] Michael Kearns and Dana Ron. Testing problems with sublearning sample complexity. Journal of Computer and System Sciences, 61(3):428–456, 2000.
[O’D14] Ryan O’Donnell. Analysis of Boolean Functions. Cambridge University Press, 2014.
[Ras03] Sofya Raskhodnikova. Approximate testing of visual properties. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 370–381. Springer, 2003.
[RR21] Dana Ron and Asaf Rosin. Optimal distribution-free sample-based testing of subsequence-freeness. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 337–256. SIAM, 2021.
[RV04] Luis Rademacher and Santosh Vempala. Testing geometric convexity. In International Conference on Foundations of Software Technology and Theoretical Computer Science, pages 469–480. Springer, 2004.
[SSBD14] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.
[Vem10a] Santosh Vempala. Learning convex concepts from gaussian distributions with PCA. In Proceedings of the IEEE 51st Annual Symposium on Foundations of Computer Science (FOCS), pages 124–130, 2010.
[Vem10b] Santosh Vempala. A random-sampling-based algorithm for learning intersections of halfspaces. Journal of the ACM, 57(6):1–14, 2010.
[War68] Hugh E. Warren. Lower bounds for approximation by nonlinear manifolds. Transactions of the American Mathematical Society, 133(1):167–178, 1968.

	$\displaystyle\underset{x\sim[n]^{d}}{\mathbb{E}}\left[R(x)\right]$	$\displaystyle=\underset{x\sim[n]^{d}}{\mathbb{E}}\left[\sum_{t=1}^{n}\sum_{L\in D_{t}}\mathbf{1}\left[\mathrm{diag}(x)=L\right]\frac{\|\{y\in L:f(y)=1\}\|}{t}\right]$
		$\displaystyle=\sum_{t=1}^{n}\sum_{L\in D_{t}}\underset{x\sim[n]^{d}}{\mathbb{P}}\left[x\in L\right]\frac{\|\{y\in L:f(y)=1\}\|}{t}=\sum_{t=1}^{n}\sum_{L\in D_{t}}\frac{t}{n^{d}}\frac{\|\{y\in L:f(y)=1\}\|}{t}$
		$\displaystyle=\frac{1}{n^{d}}\|\{y\in[n]^{d}:f(y)=1\}\|\geq\epsilon\,.$

	$\displaystyle\epsilon$	$\displaystyle\leq\mathbb{E}\left[R(x)\right]\leq\sum_{i=1}^{\infty}\frac{\|A_{i}\|}{n^{d}}\max_{x\in A_{i}}R(x)\leq\sum_{i=1}^{\infty}\frac{\|A_{i}\|}{n^{d}2^{i-1}}\leq\sum_{i=1}^{k}\frac{\|A_{i}\|}{n^{d}2^{i-1}}+\sum_{i=k+1}^{\infty}\frac{1}{2^{i-1}}$
		$\displaystyle\leq\sum_{i=1}^{k}\frac{\|A_{i}\|}{n^{d}2^{i-1}}+\frac{1}{2^{k-1}}\leq\sum_{i=1}^{k}\frac{\|A_{i}\|}{n^{d}2^{i-1}}+\frac{\epsilon}{2}$
	$\displaystyle\implies\frac{\epsilon}{2}$	$\displaystyle\leq\sum_{i=1}^{k}\frac{\|A_{i}\|}{n^{d}2^{i-1}}\,.$

	$\displaystyle 2\mathsf{ns}_{n,\delta}(f)$	$\displaystyle=1-\sum_{\alpha}{\left(1-\frac{n}{n-1}\delta\right)}^{\|\alpha\|}\hat{f}{(\alpha)}^{2}\geq 1-\sum_{\alpha}{\left(1-\delta\right)}^{\|\alpha\|}\hat{f}{(\alpha)}^{2}$
		$\displaystyle=\sum_{\alpha}(1-{\left(1-\delta\right)}^{\|\alpha\|})\hat{f}{(\alpha)}^{2}\geq\sum_{\alpha:\|\alpha\|\geq 2/\delta}(1-{\left(1-\delta\right)}^{\|\alpha\|})\hat{f}{(\alpha)}^{2}$
		$\displaystyle\geq\sum_{\alpha:\|\alpha\|\geq 2/\delta}(1-{\left(1-\delta\right)}^{2/\delta})\hat{f}{(\alpha)}^{2}\geq(1-e^{-2})\sum_{\alpha:\|\alpha\|\geq 2/\delta}\hat{f}{(\alpha)}^{2}\,.$

Downsampling for Testing and Learning in Product Distributions

Abstract

1 Introduction

Organization.

1.1 Results

1.1.1 Testing Monotonicity

1.1.2 Learning Functions of Halfspaces

1.1.3 Learning Polynomial Threshold Functions

1.1.4 Testing & Learning Convex Sets

1.1.5 Testing & Learning kk-Alternating Functions

1.2 Techniques

1.3 Block Boundary Size

Proposition 1.1 (Consequence of Lemma 4.5).

Lemma 1.2 (Informal, see Lemma 6.6).

1.4 Polynomial Regression

2 Downsampling

Definition 2.1 (Block Partitions).

Definition 2.2 (Block Functions and Coarse Functions).

Definition 2.3 (Induced Block Partitions).

Definition 2.4 (Block Boundary Size).

Proposition 2.5.

Proof.

Lemma 2.6.

Proof.

Lemma 2.7.

Proof.

3 Testing Monotonicity

3.1 Testing Monotonicity on the Hypergrid

Lemma 3.1.

Proof.

Theorem 3.2.

Proof.

Claim 3.3.

Proof of claim.

Claim 3.4.

Proof of claim.

3.2 Monotonicity Testing for Product Distributions

Proof.

Claim 3.5.

Proof of claim.

4 Learning and Testing Functions of Convex Sets

Definition 4.1 (Function Composition).

Proposition 4.2.

Proof.

Lemma 4.3.

Proof.

4.1 Sample-based One-sided Tester

Proof.

4.2 Sample-based Distance Approximator

Lemma 4.4.

Proof.

4.3 Agnostic Learning

Lemma 4.5.

Proof.

5 Learning Functions of Halfspaces

5.1 A General Learning Algorithm

Theorem 5.1 ([KKMS08, CGG+19]).

Lemma 5.2.

Proof.

5.2 Fourier Analysis on [n]d{[n]}^{d}

Lemma 5.3.

Proof.

Proposition 5.4.

Proof.

Proposition 5.5.

Proof.

Lemma 5.6.

Proof.

5.3 Application

Lemma 5.7.

Proof.

Proposition 5.8.

Proof.

Lemma 5.9.

Proof.

Proof.

6 Learning Polynomial Threshold Functions

6.1 Block-Boundary Size of PTFs

Proposition 6.1.

Proof.

1.1.5 Testing & Learning $k$ -Alternating Functions

Theorem 5.1 ([KKMS08, CGG⁺19]).

5.2 Fourier Analysis on ${[n]}^{d}$

Theorem 6.10 ([DHK⁺10]).

7 Learning & Testing $k$ -Alternating Functions

Lemma 7.1 ([CGG⁺19]).

Lemma 7.2 ([CGG⁺19]).