This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\declaretheorem

[name=Theorem,numberlike=theorem]restate-theorem \declaretheorem[name=Theorem,unnumbered]restate-theorem*

Downsampling for Testing and Learning in Product Distributions

Nathaniel Harms
University of Waterloo, Canada
[email protected]
Partly supported by NSERC. Much of this research was done while the author was visiting the National Institute of Informatics, Japan.
   Yuichi Yoshida
National Institute of Informatics, Japan
[email protected]
Supported by JSPS KAKENHI Grant Number JP17H04676 and 18H05291.
Abstract

We study distribution-free property testing and learning problems where the unknown probability distribution is a product distribution over d\mathbb{R}^{d}. For many important classes of functions, such as intersections of halfspaces, polynomial threshold functions, convex sets, and kk-alternating functions, the known algorithms either have complexity that depends on the support size of the distribution, or are proven to work only for specific examples of product distributions. We introduce a general method, which we call downsampling, that resolves these issues. Downsampling uses a notion of “rectilinear isoperimetry” for product distributions, which further strengthens the connection between isoperimetry, testing and learning. Using this technique, we attain new efficient distribution-free algorithms under product distributions on d\mathbb{R}^{d}:

  1. 1.

    A simpler proof for non-adaptive, one-sided monotonicity testing of functions [n]d{0,1}[n]^{d}\to\{0,1\}, and improved sample complexity for testing monotonicity over unknown product distributions, from O(d7)O(d^{7}) [Black, Chakrabarty, & Seshadhri, SODA 2020] to O~(d3)\widetilde{O}(d^{3}).

  2. 2.

    Polynomial-time agnostic learning algorithms for functions of a constant number of halfspaces, and constant-degree polynomial threshold functions;

  3. 3.

    An exp(O(dlog(dk)))\mathrm{exp}\left(O(d\log(dk))\right)-time agnostic learning algorithm, and an exp(O(dlog(dk)))\mathrm{exp}\left(O(d\log(dk))\right)-sample tolerant tester, for functions of kk convex sets; and a 2O~(d)2^{\widetilde{O}(d)} sample-based one-sided tester for convex sets;

  4. 4.

    An exp(O~(kd))\mathrm{exp}\left(\widetilde{O}(k\sqrt{d})\right)-time agnostic learning algorithm for kk-alternating functions, and a sample-based tolerant tester with the same complexity.

1 Introduction

In property testing and learning, the goal is to design algorithms that use as little information as possible about the input while still being correct (with high probability). This includes using as little information as possible about the probability distribution against which correctness is measured. Information about the probability distribution could be in the form of guarantees on this distribution (e.g. it is guaranteed to be uniform, or Gaussian), or in the form of samples from the distribution. So we want to minimize the requirements on this distribution, as well as the number of samples used by the algorithm.

Progress on high-dimensional property testing and learning problems is usually made by studying algorithms for the uniform distribution over the hypercube {±1}d\{\pm 1\}^{d}, or the standard Gaussian distribution over d\mathbb{R}^{d}, as the simplest case. For example, efficiently learning intersections of halfspaces is a major open problem in learning theory [DKS18, KOS04], and progress on this problem has been made by studying the uniform distribution over the hypercube {±1}d\{\pm 1\}^{d} and the Gaussian as special cases [KKMS08, KOS08, Vem10a]. Another important example is the class of degree-kk polynomial threshold functions (PTFs). Unlike intersections of halfspaces, these can be efficiently learned in the PAC model [KOS04], but agnostic111See the Glossary in Appendix B for standard definitions in learning theory and property testing. learning is more challenging. Again, progress has been made by studying the hypercube [DHK+10]. An even more extreme example is the class of convex sets, which are not learnable in the distribution-free PAC model, because they have infinite VC dimension, but which become learnable under the Gaussian [KOS08]. The uniform distribution over the hypercube and the Gaussian are both examples of product distributions, so the next natural question to ask is, can these results be generalized to any unknown product distribution? A partial answer was given by Blais, O’Donnell, & Wimmer [BOW10] for some of these classes; in this paper we resolve this question.

Similar examples appear in the property testing literature. Distribution-free property testing and testing functions with domain d\mathbb{R}^{d} are emerging trends in the field (e.g. [BBBY12, DMN19, Har19, CDS20, FY20, BFPJH21]). Testing monotonicity is one of the most well-studied problems in property testing, and recent work [BCS20] has extended this study to product distributions over domain d\mathbb{R}^{d}. Work of Chakrabarty & Seshadhri [CS16], Khot, Minzer, & Safra [KMS18], and Black, Chakrabarty, & Seshadhri [BCS18, BCS20] has resulted in efficient o(d)o(d)-query algorithms for the hypercube {±1}d\{\pm 1\}^{d} [KMS18] and the hypergrid [n]d[n]^{d}. Black, Chakrabarty, & Seshadhri [BCS20] showed that testing monotonicity over unknown product distributions on d\mathbb{R}^{d} could be done with O~(d5/6)\widetilde{O}(d^{5/6}) queries and O(d7)O(d^{7}) samples. Their “domain reduction” method is intricate and specialized for the problem of testing monotonicity. We improve222An early version of this paper proved a weaker result, with two-sided error and worse sample complexity. the sample complexity to O~(d3)\widetilde{O}(d^{3}) using a much simpler proof. We also generalize the testers of [CFSS17, CGG+19] for convex sets and kk-alternating functions, respectively, and provide new testers for arbitrary functions of convex sets.

This paper provides a general framework for designing distribution-free testing and learning algorithms under product distributions on d\mathbb{R}^{d}, which may be finite or continuous. An algorithm is distribution-free under product distributions if it does not require any prior knowledge of the probability distribution, except the guarantee that it is a product distribution. The technique in this paper, which we call downsampling, improves upon previous methods (in particular, [BCS20, BOW10]), in a few ways. It is more general and does not apply only to a specific type of algorithm [BOW10] or a specific problem [BCS20], and we use it to obtain many other results. It is conceptually simpler. And it allows quantitative improvements over both [BOW10] and [BCS20].

Organization.

In Section 1.1, we describe the main results of this paper in context of the related work. In Section 1.2, we briefly describe the main techniques in the paper. Section 2 presents the definitions and lemmas required by the main results. The following sections present the proofs of the results. For simplicity, the main body of the paper handles only continuous product distributions; finite distributions are handled in Appendix A. Definitions of standard terminology in property testing and machine learning can be found in the Glossary in Section B.

1.1 Results

See Table 1 for a summary of our results on property testing, and Table 2 for a summary of our results on learning.

1.1.1 Testing Monotonicity

Testing monotonicity is the problem of testing whether an unknown function f:X{0,1}f:X\to\{0,1\} is monotone, where XX is a partial order. It is one of the most commonly studied problems in the field of property testing. Previous work on this problem has mostly focused on uniform probability distributions (exceptions include [AC06, HK07, CDJS17, BFPJH21]) and finite domains. However, there is growing interest in property testing for functions on domain d\mathbb{R}^{d} ([BBBY12, DMN19, Har19, CDS20, FY20, BFPJH21]) and [BCS20] generalized the problem to this domain.

Testing monotonicity under product distributions has been studied a few times. Ailon & Chazelle [AC06] gave a distribution-free monotonicity tester for real-valued functions under product distributions on [n]d[n]^{d}, with query complexity O(1ϵd2dlogn)O(\tfrac{1}{\epsilon}d2^{d}\log n). Chakrabarty et al. [CDJS17] improved this to O(1ϵdlogn)O(\tfrac{1}{\epsilon}d\log n) and gave a matching lower bound. This lower bound applies to the real-valued case. For the boolean-valued case, monotonicity testers under the uniform distribution on {±1}d\{\pm 1\}^{d} [CS16, KMS18] and [n]d[n]^{d} [BCS18, BCS20] are known with query complexity o(d)o(d). In [BCS20], an o(d)o(d)-query tester was given for domain d\mathbb{R}^{d}. That paper showed that there is a one-sided, non-adaptive, distribution-free monotonicity tester under product distributions on d\mathbb{R}^{d}, with query complexity O(d5/6ϵ4/3polylog(d/ϵ))O\left(\frac{d^{5/6}}{\epsilon^{4/3}}\operatorname{poly}\log(d/\epsilon)\right) and sample complexity O((d/ϵ)7)O((d/\epsilon)^{7}). In this paper we improve the sample complexity to O~((d/ϵ)3)\widetilde{O}((d/\epsilon)^{3}), while greatly simplifying the proof.

{restate-theorem}

[] There is a one-sided, non-adaptive ϵ\epsilon-tester for monotonicity of functions d{0,1}\mathbb{R}^{d}\to\{0,1\} that is distribution-free under (finite or continuous) product distributions, using

O(d5/6ϵ4/3polylog(d/ϵ))O\left(\frac{d^{5/6}}{\epsilon^{4/3}}\operatorname{poly}\log(d/\epsilon)\right)

queries and O(d3ϵ3log(d/ϵ))O(\frac{d^{3}}{\epsilon^{3}}\log(d/\epsilon)) samples.

The main result of [BCS20] is a “domain reduction” theorem, allowing a change of domain from [n]d[n]^{d} to [r]d[r]^{d} where r=poly(d/ϵ)r=\operatorname{poly}(d/\epsilon); by applying this theorem together with their earlier O~(d5/6ϵ4/3polylog(dn))\widetilde{O}(\tfrac{d^{5/6}}{\epsilon^{4/3}}\operatorname{poly}\log(dn))-query tester for the uniform distribution on [n]d[n]^{d}, they obtain a tester for monotone functions with query complexity independent of nn. Our result replaces this domain reduction method with a simpler and more general 2-page argument, and gives a different generalization to the distribution-free case. See Section 3 for the proofs.

Table 1: Testing results.
𝗎𝗇𝗂𝖿({±1}d)\mathsf{unif}(\{\pm 1\}^{d}) 𝗎𝗇𝗂𝖿([n]d)\mathsf{unif}([n]^{d}) Gaussian \forall Products
1-Sided Testing Monotonicity
(Query model)
O~(dϵ2)\widetilde{O}\left(\frac{\sqrt{d}}{\epsilon^{2}}\right)
[KMS18]
O~(d5/6ϵ4/3)\widetilde{O}\left(\frac{d^{5/6}}{\epsilon^{4/3}}\right)
[BCS20]
O~(d5/6ϵ4/3)\widetilde{O}\left(\frac{d^{5/6}}{\epsilon^{4/3}}\right)
[BCS20]
O~(d5/6ϵ4/3)\widetilde{O}\left(\frac{d^{5/6}}{\epsilon^{4/3}}\right) queries,
O~((dϵ)3)\widetilde{O}\left(\left(\frac{d}{\epsilon}\right)^{3}\right) samples
(Thm. 1.1.1)
1-Sided Testing Convex Sets
(Sample model)
(dϵ)(1+o(1))d\left(\frac{d}{\epsilon}\right)^{(1+o(1))d}
2Ω(d)2^{\Omega(d)}
[CFSS17]
(dϵ)(1+o(1))d\left(\frac{d}{\epsilon}\right)^{(1+o(1))d} (Thm. 1.1.4)
Tolerant Testing Functions of kk Convex Sets
(Sample model)
(dkϵ)O(d)\left(\frac{dk}{\epsilon}\right)^{O(d)} (Thm. 1.1.4)
Tolerant Testing kk-Alternating Functions
(Sample model)
(dkτ)O(kdτ2)\left(\frac{dk}{\tau}\right)^{O\left(\frac{k\sqrt{d}}{\tau^{2}}\right)}
τ=ϵ23ϵ1\tau=\epsilon_{2}-3\epsilon_{1}
[CGG+19]
(dkτ)O(kdτ2)\left(\frac{dk}{\tau}\right)^{O\left(\frac{k\sqrt{d}}{\tau^{2}}\right)}
τ=ϵ2ϵ1\tau=\epsilon_{2}-\epsilon_{1}
(Thm. 1.1.5)
Table 2: Learning results. All learning algorithms are agnostic except that of [Vem10a]. The PTF result for the Gaussian follows from the two cited works but is not stated in either. All statements are informal, see references for restrictions and qualifications. For PTFs, ψ(k,ϵ):=min{O(ϵ2k+1),2O(k2)(log(1/ϵ)/ϵ2)4k+2}\psi(k,\epsilon):=\min\left\{O(\epsilon^{-2^{k+1}}),2^{O(k^{2})}\left(\log(1/\epsilon)/\epsilon^{2}\right)^{4k+2}\right\}.
𝗎𝗇𝗂𝖿({±1}d)\mathsf{unif}(\{\pm 1\}^{d}) 𝗎𝗇𝗂𝖿([n]d)\mathsf{unif}([n]^{d}) Gaussian \forall Products
Functions of kk
Convex Sets
Ω(2d)\Omega(2^{d}) dO(dϵ4),2Ω(d)d^{O\left(\frac{\sqrt{d}}{\epsilon^{4}}\right)},2^{\Omega(\sqrt{d})}
[KOS08]
O(1ϵ2(6dkϵ)d)O\left(\frac{1}{\epsilon^{2}}\left(\frac{6dk}{\epsilon}\right)^{d}\right) (Thm. 1.1.4)
Functions of kk
Halfspaces
dO(k2ϵ4)d^{O\left(\frac{k^{2}}{\epsilon^{4}}\right)}
[KKMS08]
(dn)O(k2ϵ4)(dn)^{O\left(\frac{k^{2}}{\epsilon^{4}}\right)}
[BOW10]
dO(logkϵ4),d^{O\left(\frac{\log k}{\epsilon^{4}}\right)},
poly(d,(kϵ)k)\operatorname{poly}\left(d,\left(\frac{k}{\epsilon}\right)^{k}\right)
[KOS08, Vem10a] (Intersections only)
(dkϵ)O(k2ϵ4)\left(\frac{dk}{\epsilon}\right)^{O\left(\frac{k^{2}}{\epsilon^{4}}\right)} (Thm. 1.1.2)
Degree-kk PTFs dψ(k,ϵ)d^{\psi(k,\epsilon)}
[DHK+10]
(dn)ψ(k,ϵ)(dn)^{\psi(k,\epsilon)}
[DHK+10, BOW10]
dψ(k,ϵ)d^{\psi(k,\epsilon)}
[DHK+10, BOW10]
(dkϵ)ψ(k,ϵ)\left(\frac{dk}{\epsilon}\right)^{\psi(k,\epsilon)} (Thm. 1.1.3)
kk-Alternating
Functions
2Θ(kdϵ)2^{\Theta\left(\frac{k\sqrt{d}}{\epsilon}\right)}
[BCO+15]
(dkτ)O(kdτ2)\left(\frac{dk}{\tau}\right)^{O\left(\frac{k\sqrt{d}}{\tau^{2}}\right)}
(Testing) [CGG+19]
(dkϵ)O(kdϵ2)\left(\frac{dk}{\epsilon}\right)^{O\left(\frac{k\sqrt{d}}{\epsilon^{2}}\right)} (Thm. 1.1.5)

1.1.2 Learning Functions of Halfspaces

Intersections of kk halfspaces have VC dimension Θ(dklogk)\Theta(dk\log k) [BEHW89, CMK19], so the sample complexity of learning is known, but it is not possible to efficiently find kk halfspaces whose intersection is correct on the sample, unless 𝖯=𝖭𝖯\mathsf{P}=\mathsf{NP} [BR92]. Therefore the goal is to find efficient “improper” algorithms that output a function other than an intersection of kk halfspaces. Several learning algorithms for intersections of kk halfspaces actually work for arbitrary functions of kk halfspaces. We will write k\mathcal{B}_{k} for the set of all functions [k]{0,1}[k]\to\{0,1\}, and for any class \mathcal{F} of functions we will write k\mathcal{B}_{k}\circ\mathcal{F} as the set of all functions xg(f1(x),,fk(x))x\mapsto g(f_{1}(x),\dotsc,f_{k}(x)) where gkg\in\mathcal{B}_{k} and each fif_{i}\in\mathcal{F}. Then for \mathcal{H} the class of halfspaces, Klivans, O’Donnell, & Servedio [KOS04] gave a (non-agnostic) learning algorithm for k\mathcal{B}_{k}\circ\mathcal{H} over the uniform distribution on {±1}d\{\pm 1\}^{d} with complexity dO(k2/ϵ2)d^{O(k^{2}/\epsilon^{2})}, Kalai, Klivans, Mansour, & Servedio [KKMS08] presented an agnostic algorithm with complexity dO(k2/ϵ4)d^{O(k^{2}/\epsilon^{4})} in the same setting using “polynomial regression”​​.

Polynomial regression is a powerful technique, so it is important to understand when it can be applied. Blais, O’Donnell, & Servedio [BOW10] studied how to generalize it to arbitrary product distributions. With their method, they obtained an agnostic learning algorithm for k\mathcal{B}_{k}\circ\mathcal{H} with complexity (dn)O(k2/ϵ4)(dn)^{O(k^{2}/\epsilon^{4})} for product distributions X1××XdX_{1}\times\dotsm\times X_{d} where each |Xi|=n|X_{i}|=n, and complexity dO(k2/ϵ4)d^{O(k^{2}/\epsilon^{4})} for the “polynomially bounded” continuous distributions. This is not a complete generalization, because, for example, on the grid [n]d[n]^{d} its complexity depends on nn. This prevents a full generalization to the domain d\mathbb{R}^{d}. Their algorithm also requires some prior knowledge of the support or support size. We use a different technique and fully generalize the polynomial regression algorithm to arbitrary product distributions. See Section 5 for the proof.

{restate-theorem}

[] There is a distribution-free, improper agnostic learning algorithm for k\mathcal{B}_{k}\circ\mathcal{H} under (continuous or finite) product distributions over d\mathbb{R}^{d}, with time complexity

min{(dkϵ)O(k2ϵ4),O(1ϵ2(3dkϵ)d)}.\min\left\{\left(\frac{dk}{\epsilon}\right)^{O\left(\frac{k^{2}}{\epsilon^{4}}\right)},O\left(\frac{1}{\epsilon^{2}}\left(\frac{3dk}{\epsilon}\right)^{d}\right)\right\}\,.

1.1.3 Learning Polynomial Threshold Functions

Degree-kk PTFs are another generalization of halfspaces. A function f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\} is a degree-kk PTF if there is a degree-kk polynomial p:dp:\mathbb{R}^{d}\to\mathbb{R} such that f(x)=sign(p(x))f(x)=\operatorname{sign}(p(x)). Degree-kk PTFs can be PAC learned in time dO(k)d^{O(k)} using linear programming [KOS04], but agnostic learning is more challenging. Diakonikolas et al. [DHK+10] previously gave an agnostic learning algorithm for degree-kk PTFs in the uniform distribution over {±1}d\{\pm 1\}^{d} with time complexity dψ(k,ϵ)d^{\psi(k,\epsilon)}, where

ψ(k,ϵ):=min{O(ϵ2k+1),2O(k2)(log(1/ϵ)/ϵ2)4k+2}.\psi(k,\epsilon):=\min\left\{O(\epsilon^{-2^{k+1}}),2^{O(k^{2})}\left(\log(1/\epsilon)/\epsilon^{2}\right)^{4k+2}\right\}\,.

The main result of that paper is an upper bound on the noise sensitivity of PTFs. Combined with the reduction of Blais et al. [BOW10], this implies an algorithm for the uniform distribution over [n]d[n]^{d} with complexity (dn)ψ(k,ϵ)(dn)^{\psi(k,\epsilon)} and for the Gaussian distribution with complexity dψ(k,ϵ)d^{\psi(k,\epsilon)}.

Our agnostic learning algorithm for degree-kk PTFs eliminates the dependence on nn and works for any unknown product distribution over n\mathbb{R}^{n}, while matching the complexity of [DHK+10] for the uniform distribution over the hypercube. See Section 6 for the proof.

{restate-theorem}

[] There is an improper agnostic learning algorithm for degree-kk PTFs under (finite or continuous) product distributions over d\mathbb{R}^{d}, with time complexity

min{(kdϵ)ψ(k,ϵ),O(1ϵ2(9dkϵ)d)}.\min\left\{\left(\frac{kd}{\epsilon}\right)^{\psi(k,\epsilon)}\;,\;O\left(\frac{1}{\epsilon^{2}}\left(\frac{9dk}{\epsilon}\right)^{d}\right)\right\}\,.

1.1.4 Testing & Learning Convex Sets

One of the first properties (sets) of functions d{0,1}\mathbb{R}^{d}\to\{0,1\} to be studied in the property testing literature is the set of indicator functions of convex sets, i.e. functions f:d{0,1}f:\mathbb{R}^{d}\to\{0,1\} where f1(1)f^{-1}(1) is convex. Write 𝒞\mathcal{C} for this class of functions. This problem has been studied in various models of testing [Ras03, RV04, CFSS17, BMR19, BB20]. In this paper we consider the sample-based model of testing, where the tester receives only random examples of the function and cannot make queries. This model of testing has received a lot of recent attention (e.g. [BBBY12, BMR19, BY19, CFSS17, GR16, Har19, RR21, BFPJH21]), partly because it matches the standard sample-based model for learning algorithms.

Chen et al. [CFSS17] gave a sample-based tester for 𝒞\mathcal{C} under the Gaussian distribution on d\mathbb{R}^{d} with one-sided error and sample complexity (d/ϵ)O(d)(d/\epsilon)^{O(d)}, along with a lower bound (for one-sided testers) of 2Ω(d)2^{\Omega(d)}. We match their upper bound while generalizing the tester to be distribution-free under product distributions. See Section 4 for proofs.

{restate-theorem}

[] There is a sample-based distribution-free one-sided ϵ\epsilon-tester for 𝒞\mathcal{C} under (finite or continuous) product distributions that uses at most O((6dϵ)d)O\left(\left(\frac{6d}{\epsilon}\right)^{d}\right) samples.

A more powerful kind of tester is an (ϵ1,ϵ2)(\epsilon_{1},\epsilon_{2})-tolerant tester, which must accept (with high probability) any function that is ϵ1\epsilon_{1}-close to the property, while rejecting functions that are ϵ2\epsilon_{2}-far. Tolerantly testing convex sets has been studied by [BMR16] for the uniform distribution over the 2-dimensional grid, but not (to the best of our knowledge) in higher dimensions. We obtain a sample-based tolerant tester (and distance) approximator for convex sets in high dimension. In fact, recall that k\mathcal{B}_{k} is the set of all functions {0,1}k{0,1}\{0,1\}^{k}\to\{0,1\} and k\mathcal{B}^{\prime}\subset\mathcal{B}_{k} any subset, so 𝒞\mathcal{B}^{\prime}\circ\mathcal{C} is any property of functions of convex sets. We obtain a distance approximator for any such property:

{restate-theorem}

[] Let k\mathcal{B}^{\prime}\subset\mathcal{B}_{k}. There is a sample-based distribution-free algorithm under (finite or continuous) product distributions that approximates distance to 𝒞\mathcal{B}^{\prime}\circ\mathcal{C} up to additive error ϵ\epsilon using O(1ϵ2(3dkϵ)d)O\left(\frac{1}{\epsilon^{2}}\left(\frac{3dk}{\epsilon}\right)^{d}\right) samples. Setting ϵ=(ϵ2ϵ1)/2\epsilon=(\epsilon_{2}-\epsilon_{1})/2 we obtain an (ϵ1,ϵ2)(\epsilon_{1},\epsilon_{2})-tolerant tester with sample complexity O(1(ϵ2ϵ1)2(6dkϵ2ϵ1)d)O\left(\frac{1}{(\epsilon_{2}-\epsilon_{1})^{2}}\left(\frac{6dk}{\epsilon_{2}-\epsilon_{1}}\right)^{d}\right).

General distribution-free learning of convex sets is not possible, since this class has infinite VC dimension. However, they can be learned under the Gaussian distribution. Non-agnostic learning under the Gaussian was studied by Vempala [Vem10a, Vem10b]. Agnostic learning under the Gaussian was studied by Klivans, O’Donnell, & Servedio [KOS08] who presented a learning algorithm with complexity dO(d/ϵ4)d^{O(\sqrt{d}/\epsilon^{4})}, and a lower bound of 2Ω(d)2^{\Omega(\sqrt{d})}.

Unlike the Gaussian, there is a trivial lower bound of Ω(2d)\Omega(2^{d}) in arbitrary product distributions, because any function f:{±1}d{0,1}f:\{\pm 1\}^{d}\to\{0,1\} belongs to this class. However, unlike the general distribution-free case, we show that convex sets (or any functions of convex sets) can be learned under unknown product distributions.

{restate-theorem}

[] There is an agnostic learning algorithm for k𝒞\mathcal{B}_{k}\circ\mathcal{C} under (finite or continuous) product distributions over d\mathbb{R}^{d}, with time complexity O(1ϵ2(6dkϵ)d)O\left(\frac{1}{\epsilon^{2}}\cdot\left(\frac{6dk}{\epsilon}\right)^{d}\right).

1.1.5 Testing & Learning kk-Alternating Functions

A kk-alternating function f:X{±1}f:X\to\{\pm 1\} on a partial order XX is one where for any chain x1<<xmx_{1}<\cdots<x_{m} in XX, ff changes value at most kk times. Learning kk-alternating functions on domain {±1}d\{\pm 1\}^{d} was studied by Blais et al. [BCO+15], motivated by the fact that these functions are computed by circuits with few negation gates. They show that 2Θ(kd/ϵ)2^{\Theta(k\sqrt{d}/\epsilon)} samples are necessary and sufficient in this setting. Canonne et al. [CGG+19] later obtained an algorithm for (ϵ1,ϵ2)(\epsilon_{1},\epsilon_{2})-tolerant testing kk-alternating functions, when ϵ2>3ϵ1\epsilon_{2}>3\epsilon_{1}, in the uniform distribution over [n]d[n]^{d}, with query complexity (kd/τ)O(kd/τ2)(kd/\tau)^{O(k\sqrt{d}/\tau^{2})}, where τ=ϵ23ϵ1\tau=\epsilon_{2}-3\epsilon_{1}.

We obtain an agnostic learning algorithm for kk-alternating functions that matches the query complexity of the tester in [CGG+19], and nearly matches the complexity of the (non-agnostic) learning algorithm of [BCO+15] for the uniform distribution over the hypercube. See Section 7 for proofs.

{restate-theorem}

[] There is an agnostic learning algorithm for kk-alternating functions under (finite or continuous) product distributions over d\mathbb{R}^{d} that runs in time at most

min{(dkϵ)O(kdϵ2),O(1ϵ2(3kdϵ)d)}.\min\left\{\left(\frac{dk}{\epsilon}\right)^{O\left(\frac{k\sqrt{d}}{\epsilon^{2}}\right)},O\left(\frac{1}{\epsilon^{2}}\left(\frac{3kd}{\epsilon}\right)^{d}\right)\right\}\,.

We also generalize the tolerant tester of [CGG+19] to be distribution-free under product distributions, and eliminate the condition ϵ2>3ϵ1\epsilon_{2}>3\epsilon_{1}.

{restate-theorem}

[] For any ϵ2>ϵ1>0\epsilon_{2}>\epsilon_{1}>0, let τ=(ϵ2ϵ1)/2\tau=(\epsilon_{2}-\epsilon_{1})/2, there is a sample-based (ϵ1,ϵ2)(\epsilon_{1},\epsilon_{2})-tolerant tester for kk-alternating functions using (dkτ)O(kdτ2)\left(\frac{dk}{\tau}\right)^{O\left(\frac{k\sqrt{d}}{\tau^{2}}\right)} samples, which is distribution-free under (finite or continuous) product distributions over d\mathbb{R}^{d}.

1.2 Techniques

What connects these diverse problems is a notion of rectilinear surface area or isoperimetry that we call “block boundary size”​​. There is a close connection between learning & testing and various notions of isoperimetry or surface area (e.g. [CS16, KOS04, KOS08, KMS18]). We show that testing or learning a class \mathcal{H} on product distributions over d\mathbb{R}^{d} can be reduced to testing and learning on the uniform distribution over [r]d[r]^{d}, where rr is determined by the block boundary size, and we call this reduction “downsampling”​​. The name downsampling is used in image and signal processing for the process of reducing the resolution of an image or reducing the number of discrete samples used to represent an analog signal. We adopt the name because our method can be described by analogy to image or signal processing as the following 2-step process:

  1. 1.

    Construct a “digitized” or “pixellated” image of the function f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\} by sampling from the distribution and constructing a grid in which each cell has roughly equal probability mass; and

  2. 2.

    Learn or test the “low-resolution” pixellated function.

As long as the function ff takes a constant value in the vast majority of “pixels”​​, the low resolution version seen by the algorithm is a good enough approximation for testing or learning. The block boundary size is, informally, the number of pixels on which ff is not constant.

This technique reduces distribution-free testing and learning problems to the uniform distribution in a way that is conceptually simpler than in the prior work [BOW10, BCS20]. However, some technical challenges remain. The first is that it is not always easy to bound the number of “pixels” on which a function ff is not constant – for example, for PTFs. Second, unlike in the uniform distribution, the resulting downsampled function class on [r]d[r]^{d} is not necessarily “the same” as the original class – for example, halfspaces on d\mathbb{R}^{d} are not downsampled to halfspaces on [r]d[r]^{d}, since the “pixels” are not of equal size. Thus, geometric arguments may not work, unlike the case for actual images.

A similar technique of constructing “low-resolution” representations of the input has been used and rediscovered ad-hoc a few times in the property testing literature; in prior work, it was restricted to the uniform distribution over [n]d[n]^{d} [KR00, Ras03, FR10, BY19, CGG+19] (or the Gaussian in [CFSS17]). With this paper, we aim to provide a unified and generalized study of this simple and powerful technique.

1.3 Block Boundary Size

Informally, we define the rr-block boundary size 𝖻𝖻𝗌(,r)\mathsf{bbs}(\mathcal{H},r) of a class \mathcal{H} of functions d{0,1}\mathbb{R}^{d}\to\{0,1\} as the maximum number of grid cells on which a function ff\in\mathcal{H} is non-constant, over all possible r××rr\times\dotsm\times r grid partitions of d\mathbb{R}^{d} (which are not necessarily evenly spaced) – see Section 2 for formal definitions. Whether downsampling can be applied to \mathcal{H} depends on whether

limr𝖻𝖻𝗌(,r)rd0,\lim_{r\to\infty}\frac{\mathsf{bbs}(\mathcal{H},r)}{r^{d}}\to 0\,,

and the complexity of the algorithms depends on how large rr must be for the non-constant blocks to vanish relative to the whole rdr^{d} grid. A general observation is that any function class \mathcal{H} where downsampling can be applied can be learned under unknown product distributions with a finite number of samples; for example, this holds for convex sets even though the VC dimension is infinite.

Proposition 1.1 (Consequence of Lemma 4.5).

Let \mathcal{H} be any set of functions d{0,1}\mathbb{R}^{d}\to\{0,1\} (measurable with respect to continuous product distributions) such that

limr𝖻𝖻𝗌(,r)rd=0.\lim_{r\to\infty}\frac{\mathsf{bbs}(\mathcal{H},r)}{r^{d}}=0\,.

Then there is some function δ(d,ϵ)\delta(d,\epsilon) such that \mathcal{H} is distribution-free learnable under product distributions, up to error ϵ\epsilon, with δ(d,ϵ)\delta(d,\epsilon) samples.

For convex sets, monotone functions, kk-alternating functions, and halfspaces, 𝖻𝖻𝗌(,r)\mathsf{bbs}(\mathcal{H},r) is easy to calculate. For degree-kk PTFs, it is more challenging. We say that a function f:d{0,1}f:\mathbb{R}^{d}\to\{0,1\} induces a connected component SS if for every x,ySx,y\in S there is a continuous curve in d\mathbb{R}^{d} from xx to yy such that f(z)=f(x)=f(y)f(z)=f(x)=f(y) for all zz on the curve, and SS is a maximal such set. Then we prove a general lemma that bounds the block boundary size by the number of connected components induced by functions ff\in\mathcal{H}.

Lemma 1.2 (Informal, see Lemma 6.6).

Suppose that for any axis-aligned affine subspace AA of affine dimension ndn\leq d, and any function ff\in\mathcal{H}, ff induces at most knk^{n} connected components in AA. Then for r=Ω(dk/ϵ)r=\Omega(dk/\epsilon), 𝖻𝖻𝗌(,r)ϵrd\mathsf{bbs}(\mathcal{H},r)\leq\epsilon\cdot r^{d}.

This lemma in fact generalizes all computations of block boundary size in this paper (up to constant factors in rr). Using a theorem of Warren [War68], we get that any degree-kk polynomial d{±1}\mathbb{R}^{d}\to\{\pm 1\} achieves value 0 in at most ϵrd\epsilon r^{d} grid cells, for sufficiently large r=Ω(dk/ϵ)r=\Omega(dk/\epsilon) (Corollary 6.8).

1.4 Polynomial Regression

The second step of downsampling is to find a testing or learning algorithm that works for the uniform distribution over the (not necessarily evenly-spaced) hypergrid. Most of our learning results use polynomial regression. This is a powerful technique introduced in [KKMS08] that performs linear regression over a vector space of functions that approximately spans the hypothesis class. This method is usually applied by using Fourier analysis to construct such an approximate basis for the hypothesis class [BOW10, DHK+10, CGG+19]. This was the method used, for example, by Blais, O’Donnell, & Wimmer [BOW10] to achieve the poly(dn)\operatorname{poly}(dn)-time algorithms for intersections of halfspaces.

We take the same approach but we use the Walsh basis for functions on domain [n]d[n]^{d} (see e.g. [BRY14]) instead of the bases used in the prior works. We show that if one can establish bounds on the noise sensitivity in the Fourier basis for the hypothesis class restricted to the uniform distribution over {±1}d\{\pm 1\}^{d}, then one gets a bound on the number of Walsh functions required to approximately span the “downsampled” hypothesis class. In this way, we establish that if one can apply standard Fourier-analytic techniques to the hypothesis class over the uniform distribution on {±1}d\{\pm 1\}^{d} and calculate the block boundary size, then the results for the hypercube essentially carry over to the distribution-free setting for product distributions on d\mathbb{R}^{d}.

An advantage of this technique is that both noise sensitivity and block boundary size grow at most linearly during function composition: for functions f(x)=g(h1(x),,hk(x))f(x)=g(h_{1}(x),\dotsc,h_{k}(x)) where each hih_{i} belongs to the class \mathcal{H}, the noise sensitivity and block boundary size grow at most linearly in kk. Therefore learning results for \mathcal{H} obtained in this way are easy to extend to arbitrary compositions of \mathcal{H}, which is how we get our result for intersections of halfspaces.

2 Downsampling

We will now introduce the main definitions, notation, and lemmas required by our main results. The purpose of this section is to establish the main conceptual component of the downsampling technique: that functions with small enough block boundary size can be efficiently well-approximated by a “coarsened” version of the function that is obtained by random sampling. See Figure 1 for an illustration of the following definitions.

Refer to caption
Figure 1: Left: Random grid XX (pale lines) with induced block partition (thick lines) and 𝖻𝗅𝗈𝖼𝗄𝗉𝗈𝗂𝗇𝗍\mathsf{blockpoint} values (dots), superimposed on f1(1)f^{-1}(1) (gray polygon). Right: f𝖼𝗈𝖺𝗋𝗌𝖾f^{\mathsf{coarse}} (grey) compared to ff (polygon outline).
Definition 2.1 (Block Partitions).

An rr-block partition of d\mathbb{R}^{d} is a pair of functions 𝖻𝗅𝗈𝖼𝗄:d[r]d\mathsf{block}:\mathbb{R}^{d}\to{[r]}^{d} and 𝖻𝗅𝗈𝖼𝗄𝗉𝗈𝗂𝗇𝗍:[r]dd\mathsf{blockpoint}:{[r]}^{d}\to\mathbb{R}^{d} obtained as follows. For each i[d],j[r1]i\in[d],j\in[r-1] let ai,ja_{i,j}\in\mathbb{R} such that ai,j<ai,j+1a_{i,j}<a_{i,j+1} and define ai,0=,ai,r=a_{i,0}=-\infty,a_{i,r}=\infty for each ii. For each i[d],j[r]i\in[d],j\in[r] define the interval Bi,j=(ai,j1,ai,j]B_{i,j}=(a_{i,j-1},a_{i,j}] and a point bi,jBi,jb_{i,j}\in B_{i,j}. The function 𝖻𝗅𝗈𝖼𝗄:d[r]d\mathsf{block}:\mathbb{R}^{d}\to[r]^{d} is defined by setting 𝖻𝗅𝗈𝖼𝗄(x)\mathsf{block}(x) to be the unique vector v[r]dv\in[r]^{d} such that xiBi,vix_{i}\in B_{i,v_{i}} for each i[d]i\in[d]. The function 𝖻𝗅𝗈𝖼𝗄𝗉𝗈𝗂𝗇𝗍:[r]dd\mathsf{blockpoint}:[r]^{d}\to\mathbb{R}^{d} is defined by setting 𝖻𝗅𝗈𝖼𝗄𝗉𝗈𝗂𝗇𝗍(v)=(b1,v1,,bd,vd)\mathsf{blockpoint}(v)=(b_{1,v_{1}},\dotsc,b_{d,v_{d}}); note that 𝖻𝗅𝗈𝖼𝗄𝗉𝗈𝗂𝗇𝗍(v)𝖻𝗅𝗈𝖼𝗄1(v)\mathsf{blockpoint}(v)\in\mathsf{block}^{-1}(v) where 𝖻𝗅𝗈𝖼𝗄1(v)={xd:𝖻𝗅𝗈𝖼𝗄(x)=v}\mathsf{block}^{-1}(v)=\{x\in\mathbb{R}^{d}:\mathsf{block}(x)=v\}.

Definition 2.2 (Block Functions and Coarse Functions).

For a function f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\}, we define f𝖻𝗅𝗈𝖼𝗄:[r]d{±1}f^{\mathsf{block}}:[r]^{d}\to\{\pm 1\} as f𝖻𝗅𝗈𝖼𝗄:=f𝖻𝗅𝗈𝖼𝗄𝗉𝗈𝗂𝗇𝗍f^{\mathsf{block}}:=f\circ\mathsf{blockpoint} and f𝖼𝗈𝖺𝗋𝗌𝖾:df^{\mathsf{coarse}}:\mathbb{R}^{d}\to\mathbb{R} as f𝖼𝗈𝖺𝗋𝗌𝖾:=f𝖻𝗅𝗈𝖼𝗄𝖻𝗅𝗈𝖼𝗄=f𝖻𝗅𝗈𝖼𝗄𝗉𝗈𝗂𝗇𝗍𝖻𝗅𝗈𝖼𝗄f^{\mathsf{coarse}}:=f^{\mathsf{block}}\circ\mathsf{block}=f\circ\mathsf{blockpoint}\circ\mathsf{block}. For any set \mathcal{H} of functions d{±1}\mathbb{R}^{d}\to\{\pm 1\}, we define 𝖻𝗅𝗈𝖼𝗄:={f𝖻𝗅𝗈𝖼𝗄|f}\mathcal{H}^{\mathsf{block}}:=\{f^{\mathsf{block}}\;|\;f\in\mathcal{H}\}. For a distribution μ\mu over d\mathbb{R}^{d} and an rr-block partition 𝖻𝗅𝗈𝖼𝗄:d[r]d\mathsf{block}:\mathbb{R}^{d}\to[r]^{d} we define the distribution 𝖻𝗅𝗈𝖼𝗄(μ)\mathsf{block}(\mu) over [r]d[r]^{d} as the distribution of 𝖻𝗅𝗈𝖼𝗄(x)\mathsf{block}(x) for xμx\sim\mu.

Definition 2.3 (Induced Block Partitions).

When μ\mu is a product distribution over d\mathbb{R}^{d}, a random grid XX of length mm is the grid obtained by sampling mm points x1,,xmdx_{1},\dotsc,x_{m}\in\mathbb{R}^{d} independently from μ\mu and for each i[d],j[m]i\in[d],j\in[m] defining Xi,jX_{i,j} to be the jthj^{\mathrm{th}}-smallest coordinate in dimension ii among all sampled points. For any rr that divides mm we define an rr-block partition depending on XX by defining for each i[d],j[r1]i\in[d],j\in[r-1] the point ai,j=Xi,mj/ra_{i,j}=X_{i,mj/r} so that the intervals are Bi,j:=(Xi,m(j1)/r,Xi,mj/r]B_{i,j}:=(X_{i,m(j-1)/r},X_{i,mj/r}] when j{2,,r1}j\in\{2,\dotsc,r-1\} and Bi,1=(,Xi,m/r],Bi,r=(Xi,m(r1)/r,)B_{i,1}=(-\infty,X_{i,m/r}],B_{i,r}=(X_{i,m(r-1)/r},\infty); we let the points bi,jb_{i,j} defining 𝖻𝗅𝗈𝖼𝗄𝗉𝗈𝗂𝗇𝗍\mathsf{blockpoint} be arbitrary. This is the rr-block partition induced by XX.

Definition 2.4 (Block Boundary Size).

For a block partition 𝖻𝗅𝗈𝖼𝗄:d[r]d\mathsf{block}:\mathbb{R}^{d}\to[r]^{d}, a distribution μ\mu over d\mathbb{R}^{d}, and a function f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\}, we say ff is non-constant on a block v[r]dv\in[r]^{d} if there are sets S,T𝖻𝗅𝗈𝖼𝗄1(v)S,T\subset\mathsf{block}^{-1}(v) such that sS,tT:f(s)=1,f(t)=1\forall s\in S,t\in T:f(s)=1,f(t)=-1; and S,TS,T have positive measure (in the product of Lebesgue measures). For a function f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\} and a number rr, we define the rr-block boundary size 𝖻𝖻𝗌(f,r)\mathsf{bbs}(f,r) as the maximum number of blocks on which ff is non-constant, where the maximum is taken over all rr-block partitions 𝖻𝗅𝗈𝖼𝗄:d[r]d\mathsf{block}:\mathbb{R}^{d}\to[r]^{d}. For a set \mathcal{H} of functions d{±1}\mathbb{R}^{d}\to\{\pm 1\}, we define 𝖻𝖻𝗌(,r):=max{𝖻𝖻𝗌(f,r)|f}\mathsf{bbs}(\mathcal{H},r):=\max\{\mathsf{bbs}(f,r)\;|\;f\in\mathcal{H}\}.

The total variation distance between two distributions μ,ν\mu,\nu over a finite domain 𝒳\mathcal{X} is defined as

μν𝖳𝖵:=12x𝒳|μ(x)ν(x)|=maxS𝒳|μ(S)ν(S)|.\|\mu-\nu\|_{\mathsf{TV}}:=\frac{1}{2}\sum_{x\in\mathcal{X}}|\mu(x)-\nu(x)|=\max_{S\subseteq\mathcal{X}}|\mu(S)-\nu(S)|\,.

The essence of downsampling is apparent in the next proposition. It shows that the distance of ff to its coarsened version f𝖼𝗈𝖺𝗋𝗌𝖾f^{\mathsf{coarse}} is bounded by two quantities: the fraction of blocks in the rr-block partition on which ff is not constant, and the distance of the distribution 𝖻𝗅𝗈𝖼𝗄(μ)\mathsf{block}(\mu) to uniform. When both quantities are small, testing or learning ff can be done by testing or learning f𝖼𝗈𝖺𝗋𝗌𝖾f^{\mathsf{coarse}} instead. The uniform distribution over a set SS is denoted 𝗎𝗇𝗂𝖿(S)\mathsf{unif}(S):

Proposition 2.5.

Let μ\mu be a continuous product distribution over d\mathbb{R}^{d}, let XX be a random grid, and let 𝖻𝗅𝗈𝖼𝗄:d[r]d\mathsf{block}:\mathbb{R}^{d}\to[r]^{d} be the induced rr-block partition. Then, for any measurable f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\}, the following holds with probability 1 over the choice of XX:

xμ[f(x)f𝖼𝗈𝖺𝗋𝗌𝖾(x)]rd𝖻𝖻𝗌(f,r)+𝖻𝗅𝗈𝖼𝗄(μ)𝗎𝗇𝗂𝖿([r]d)𝖳𝖵.\underset{x\sim\mu}{\mathbb{P}}\left[f(x)\neq f^{\mathsf{coarse}}(x)\right]\leq r^{-d}\cdot\mathsf{bbs}(f,r)+\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}\,.
Proof.

We first establish that, with probability 1 over XX and xμx\sim\mu, if f(x)f𝖼𝗈𝖺𝗋𝗌𝖾(x)f(x)\neq f^{\mathsf{coarse}}(x) then ff is non-constant on 𝖻𝗅𝗈𝖼𝗄(x)\mathsf{block}(x). Fix XX and suppose there exists a set ZZ of positive measure such that for each xZ,f(x)f𝖼𝗈𝖺𝗋𝗌𝖾(x)x\in Z,f(x)\neq f^{\mathsf{coarse}}(x) but ff is not non-constant on 𝖻𝗅𝗈𝖼𝗄(x)\mathsf{block}(x), i.e. for V=𝖻𝗅𝗈𝖼𝗄1(𝖻𝗅𝗈𝖼𝗄(x))V=\mathsf{block}^{-1}(\mathsf{block}(x)), either μ(Vf1(1))=μ(V)\mu(V\cap f^{-1}(1))=\mu(V) or μ(Vf1(1))=μ(V)\mu(V\cap f^{-1}(-1))=\mu(V). Then there is v[r]dv\in[r]^{d} such that for V=𝖻𝗅𝗈𝖼𝗄1(v)V=\mathsf{block}^{-1}(v), μ(ZV)>0\mu(Z\cap V)>0. Let y=𝖻𝗅𝗈𝖼𝗄𝗉𝗈𝗂𝗇𝗍(v)y=\mathsf{blockpoint}(v). If μ(Vf1(f(y))=μ(V)\mu(V\cap f^{-1}(f(y))=\mu(V) then μ(ZV)=0\mu(Z\cap V)=0, so μ(Vf1(f(y))=0\mu(V\cap f^{-1}(f(y))=0. But for random XX, the probability that there exists v[r]dv\in[r]^{d} such that μ(Vf1(𝖻𝗅𝗈𝖼𝗄𝗉𝗈𝗂𝗇𝗍(v)))=0\mu(V\cap f^{-1}(\mathsf{blockpoint}(v)))=0 is 0, since 𝖻𝗅𝗈𝖼𝗄𝗉𝗈𝗂𝗇𝗍(v)\mathsf{blockpoint}(v) is random within VV.

Assuming that the above event occurs,

xμ[f(x)f𝖼𝗈𝖺𝗋𝗌𝖾(x)]\displaystyle\underset{x\sim\mu}{\mathbb{P}}\left[f(x)\neq f^{\mathsf{coarse}}(x)\right] xμ[f is non-constant on 𝖻𝗅𝗈𝖼𝗄(x)]\displaystyle\leq\underset{x\sim\mu}{\mathbb{P}}\left[f\text{ is non-constant on }\mathsf{block}(x)\right]
v[r]d[f is non-constant on v]+𝖻𝗅𝗈𝖼𝗄(μ)𝗎𝗇𝗂𝖿([r]d)𝖳𝖵.\displaystyle\leq\underset{v\sim[r]^{d}}{\mathbb{P}}\left[f\text{ is non-constant on }v\right]+\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}\,.

Since v[r]dv\sim[r]^{d} is uniform, the probability of hitting a non-constant block is at most rd𝖻𝖻𝗌(f,r)r^{-d}\cdot\mathsf{bbs}(f,r). ∎

Next we give a bound on the number of samples required to ensure that 𝖻𝗅𝗈𝖼𝗄(μ)\mathsf{block}(\mu) is close to uniform. We need the following lemma.

Lemma 2.6.

Let μ\mu be continuous probability distribution over \mathbb{R}, m,rm,r\in\mathbb{N} such that rr divides mm, and δ(0,1/2)\delta\in(0,1/2). Let XX be a set of mm points sampled independently from μ\mu. Write X={x1,,xm}X=\{x_{1},\dotsc,x_{m}\} labeled such that x1<<xmx_{1}<\dotsm<x_{m} (and write x0=x_{0}=-\infty). Then for any i[r]i\in[r],

[μ(x(i1)(m/r),xi(m/r)]<1δr]4eδ2m32r.\mathbb{P}\left[\mu\left(x_{(i-1)(m/r)},x_{i(m/r)}\right]<\frac{1-\delta}{r}\right]\leq 4\cdot e^{-\frac{\delta^{2}m}{32r}}\,.
Proof.

We assume that i1r/2i-1\leq r/2. If i1>r/2i-1>r/2 then we can repeat the following analysis with the opposite ordering on the points in XX. Write x=x(i1)mrx^{*}=x_{(i-1)\tfrac{m}{r}} and β=μ(,x]\beta=\mu(-\infty,x^{*}]. First suppose that (1δ/2)i1r<β<(1+δ/2)i1r(1+δ/2)/2(1-\delta/2)\frac{i-1}{r}<\beta<(1+\delta/2)\frac{i-1}{r}\leq(1+\delta/2)/2; we will bound the probability of this event later.

Let tt\in\mathbb{R} be the point such that μ(x,t]=(1δ)/r\mu(x^{*},t]=(1-\delta)/r (which must exist since μ\mu is continuous). Let η=δ1δδ\eta=\frac{\delta}{1-\delta}\geq\delta. Write X={xX:x>x}X^{*}=\{x\in X:x>x^{*}\}. The expected value of |X(x,t]||X^{*}\cap(x^{*},t]| is |X|1δr(1β)=(1i1r)1δr(1β)|X^{*}|\frac{1-\delta}{r(1-\beta)}=\left(1-\frac{i-1}{r}\right)\frac{1-\delta}{r(1-\beta)}, where the factor 1β1-\beta in the denominator is due to the fact that each element of XX^{*} is sampled from μ\mu conditional on being larger than xx^{*}. The event μ(x,xi(m/r)]<(1δ)/r\mu(x^{*},x_{i(m/r)}]<(1-\delta)/r occurs if and only if |X(x,t]|>m/r|X^{*}\cap(x^{*},t]|>m/r, which occurs with probability

[|X(x,t]|>mr]=[|X(x,t]|>m(1(i1)r)1δr(1β)(1+η)]\displaystyle\mathbb{P}\left[|X^{*}\cap(x^{*},t]|>\frac{m}{r}\right]=\mathbb{P}\left[|X^{*}\cap(x^{*},t]|>m\left(1-\frac{(i-1)}{r}\right)\frac{1-\delta}{r(1-\beta)}(1+\eta)\right]

where

1+η\displaystyle 1+\eta =(1β)(1δ)(1i1r)(1(1+δ/2)i1r)(1δ)(1i1r)=11δ(1(δ/2)(i1)r(i1))\displaystyle=\frac{(1-\beta)}{(1-\delta)\left(1-\frac{i-1}{r}\right)}\geq\frac{\left(1-(1+\delta/2)\frac{i-1}{r}\right)}{(1-\delta)\left(1-\frac{i-1}{r}\right)}=\frac{1}{1-\delta}\left(1-\frac{(\delta/2)(i-1)}{r-(i-1)}\right)
1δ/21δ=1+δ2(1δ)1+δ/2.\displaystyle\geq\frac{1-\delta/2}{1-\delta}=1+\frac{\delta}{2(1-\delta)}\geq 1+\delta/2\,.

Since the expected value satisfies

|X|1δr(1β)mr(1i1r)2(1δ)1δ/2mr(1δ/2)m2r,|X^{*}|\frac{1-\delta}{r(1-\beta)}\geq\frac{m}{r}(1-\frac{i-1}{r})\frac{2(1-\delta)}{1-\delta/2}\geq\frac{m}{r}(1-\delta/2)\geq\frac{m}{2r}\,,

the Chernoff bound gives

[|X(x,t]|>mr]exp(δ2|X|(1δ)34r(1β))eδ2m342r.\mathbb{P}\left[|X^{*}\cap(x^{*},t]|>\frac{m}{r}\right]\leq\mathrm{exp}\left(-\frac{\delta^{2}|X^{*}|(1-\delta)}{3\cdot 4\cdot r(1-\beta)}\right)\leq e^{-\frac{\delta^{2}m}{3\cdot 4\cdot 2r}}\,.

Now let tt\in\mathbb{R} be the point such that μ(x,t]=(1+δ)/r\mu(x^{*},t]=(1+\delta)/r. The expected value of |X(x,t]||X^{*}\cap(x^{*},t]| is now |X|1+δr(1β)|X^{*}|\frac{1+\delta}{r(1-\beta)}. The event μ(x,xi(m/r)]>(1+δ)/r\mu(x^{*},x_{i(m/r)}]>(1+\delta)/r occurs if and only if |X(x,t]|<m/r|X^{*}\cap(x^{*},t]|<m/r, which occurs with probability

[|X(x,t]|<mr]=[|X(x,t]|<m(1i1r)1+δr(1β)(1η)]\mathbb{P}\left[|X^{*}\cap(x^{*},t]|<\frac{m}{r}\right]=\mathbb{P}\left[|X^{*}\cap(x^{*},t]|<m\left(1-\frac{i-1}{r}\right)\frac{1+\delta}{r(1-\beta)}(1-\eta)\right]

where

1η\displaystyle 1-\eta =1β(1+δ)(1i1r)1(1+δ/2)i1r(1+δ)(1i1r)=11+δ(1+(δ/2)(i1)r(i1))\displaystyle=\frac{1-\beta}{(1+\delta)(1-\tfrac{i-1}{r})}\leq\frac{1-(1+\delta/2)\tfrac{i-1}{r}}{(1+\delta)\left(1-\frac{i-1}{r}\right)}=\frac{1}{1+\delta}\left(1+\frac{(\delta/2)(i-1)}{r-(i-1)}\right)
1+δ/21+δ=1δ/21+δ1δ4.\displaystyle\leq\frac{1+\delta/2}{1+\delta}=1-\frac{\delta/2}{1+\delta}\leq 1-\frac{\delta}{4}\,.

The expected value satisfies |X|1+δr(1β)>m/r|X^{*}|\frac{1+\delta}{r(1-\beta)}>m/r, so the Chernoff bound gives

[|X(x,t]|<mr]exp(δ2|X|(1+δ)242r(1β))eδ2m242.\mathbb{P}\left[|X^{*}\cap(x^{*},t]|<\frac{m}{r}\right]\leq\mathrm{exp}\left(-\frac{\delta^{2}|X^{*}|(1+\delta)}{2\cdot 4^{2}\cdot r(1-\beta)}\right)\leq e^{-\frac{\delta^{2}m}{2\cdot 4^{2}}}\,.

It remains to bound the probability that (1δ/2)i1r<β<(1+δ/2)i1r(1-\delta/2)\frac{i-1}{r}<\beta<(1+\delta/2)\frac{i-1}{r}. Define tt\in\mathbb{R} such that μ(,t]=(1+δ/2)i1r\mu(-\infty,t]=(1+\delta/2)\frac{i-1}{r}. β=μ(,x](1+δ/2)i1r\beta=\mu(-\infty,x^{*}]\geq(1+\delta/2)\frac{i-1}{r} if and only if x>tx^{*}>t, i.e. |X(,t]|<i1r|X\cap(-\infty,t]|<\frac{i-1}{r}. The expected value of |X(,t]||X\cap(-\infty,t]| is m(1+δ/2)(i1)rm\frac{(1+\delta/2)(i-1)}{r}, so for η=δ/21+δ/2δ/3\eta=\frac{\delta/2}{1+\delta/2}\geq\delta/3, the Chernoff bound implies

[|X(,t]|<mi1r]\displaystyle\mathbb{P}\left[|X\cap(-\infty,t]|<m\frac{i-1}{r}\right] =[|X(,t]|<m(1+δ/2)(i1)r(1η)]\displaystyle=\mathbb{P}\left[|X\cap(-\infty,t]|<m\frac{(1+\delta/2)(i-1)}{r}(1-\eta)\right]
eδ2m(1+δ/2)(i1)18reδ2m18r.\displaystyle\leq e^{-\frac{\delta^{2}m(1+\delta/2)(i-1)}{18r}}\leq e^{-\frac{\delta^{2}m}{18r}}\,.

Now define tt\in\mathbb{R} such that μ(,t]=(1δ/2)i1r\mu(-\infty,t]=(1-\delta/2)\frac{i-1}{r}. β=μ(,x](1δ/2)i1r\beta=\mu(-\infty,x^{*}]\leq(1-\delta/2)\frac{i-1}{r} if and only if x<tx^{*}<t, i.e. |X(,t]|>i1r|X\cap(-\infty,t]|>\frac{i-1}{r}. The expected value of |X(,t]||X\cap(-\infty,t]| is m(1δ/2)(i1)rm\frac{(1-\delta/2)(i-1)}{r}, so for η=δ2δδ/2\eta=\frac{\delta}{2-\delta}\geq\delta/2,

[|X(,t]|>mi1r]\displaystyle\mathbb{P}\left[|X\cap(-\infty,t]|>m\frac{i-1}{r}\right] =[|X(,t]|>m(1δ/2)(i1)r(1+η)]\displaystyle=\mathbb{P}\left[|X\cap(-\infty,t]|>m\frac{(1-\delta/2)(i-1)}{r}(1+\eta)\right]
eδ2m(1δ/2)(i1)24reδ2m42r.\displaystyle\leq e^{-\frac{\delta^{2}m(1-\delta/2)(i-1)}{2\cdot 4r}}\leq e^{-\frac{\delta^{2}m}{4^{2}r}}\,.

The conclusion then follows from the union bound over these four events. ∎

Lemma 2.7.

Let μ=μ1××μd\mu=\mu_{1}\times\dotsm\times\mu_{d} be a product distribution over d\mathbb{R}^{d} where each μi\mu_{i} is continuous. Let XX be a random grid with length mm sampled from μ\mu, and let 𝖻𝗅𝗈𝖼𝗄:d[r]d\mathsf{block}:\mathbb{R}^{d}\to[r]^{d} be the rr-block partition induced by XX. Then

𝑋[𝖻𝗅𝗈𝖼𝗄(μ)𝗎𝗇𝗂𝖿([r]d)𝖳𝖵>ϵ]4rdeϵ2m18rd2\underset{X}{\mathbb{P}}\left[\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}>\epsilon\right]\leq 4rd\cdot e^{-\frac{\epsilon^{2}m}{18rd^{2}}}
Proof.

For a fixed grid XX and each i[d]i\in[d], write pi:[r][0,1]p_{i}:[r]\to[0,1] be the probability distribution on [r][r] with pi(z)=μi(Bi,z)p_{i}(z)=\mu_{i}(B_{i,z}). Then 𝖻𝗅𝗈𝖼𝗄(μ)=p1××pd\mathsf{block}(\mu)=p_{1}\times\dotsm\times p_{d}.

Let δ=4ϵ3d\delta=\frac{4\epsilon}{3d}. Suppose that for every i,j[d]×[r]i,j\in[d]\times[r] it holds that 1+δrpi(j)1δr\frac{1+\delta}{r}\leq p_{i}(j)\geq\frac{1-\delta}{r}. Note that dδ=4ϵ3ln(1+2ϵ)2ϵd\delta=\frac{4\epsilon}{3}\leq\ln(1+2\epsilon)\leq 2\epsilon. Then for every v[r]dv\in[r]^{d},

uμ[𝖻𝗅𝗈𝖼𝗄(u)=v]\displaystyle\underset{u\sim\mu}{\mathbb{P}}\left[\mathsf{block}(u)=v\right] =i=1dpi(vi){(1+δ)drdedδrd(1+2ϵ)rd(1δ)drd(1dδ)rd(12ϵ)rd.\displaystyle=\prod_{i=1}^{d}p_{i}(v_{i})\begin{cases}\leq{(1+\delta)}^{d}r^{-d}\leq e^{d\delta}r^{-d}\leq(1+2\epsilon)r^{-d}\\ \geq{(1-\delta)}^{d}r^{-d}\geq(1-d\delta)r^{-d}\geq(1-2\epsilon)r^{-d}\,.\end{cases}

So

𝖻𝗅𝗈𝖼𝗄(μ)𝗎𝗇𝗂𝖿([r]d)𝖳𝖵=12v[r]d|uμ[𝖻𝗅𝗈𝖼𝗄(u)=v]rd|12v[r]d2ϵrd=ϵ.\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}=\frac{1}{2}\sum_{v\in[r]^{d}}|\underset{u\sim\mu}{\mathbb{P}}\left[\mathsf{block}(u)=v\right]-r^{-d}|\leq\frac{1}{2}\sum_{v\in[r]^{d}}2\epsilon r^{-d}=\epsilon\,.

By Lemma 2.6 and the union bound, the probability that there is some i[d],j[r]i\in[d],j\in[r] that satisfies pi(j)<(1δ)/rp_{i}(j)<(1-\delta)/r is at most 4rdeϵ2m18rd24rd\cdot e^{-\frac{\epsilon^{2}m}{18rd^{2}}}. ∎

3 Testing Monotonicity

3.1 Testing Monotonicity on the Hypergrid

A good introduction to downsampling is the following short proof of the main result of Black, Chakrabarty, & Seshadhri [BCS20]. In an earlier work, [BCS18], they gave an O((d5/6/ϵ4/3)polylog(dn))O((d^{5/6}/\epsilon^{4/3})\operatorname{poly}\log(dn)) tester for the domain [n]d[n]^{d}, and in the later work they showed how to reduce the domain [n]d[n]^{d} to [r]d[r]^{d} for r=poly(d/ϵ)r=\operatorname{poly}(d/\epsilon).

Our monotonicity tester will use as a subroutine the following tester for diagonal functions. For a hypergrid [n]d[n]^{d}, a diagonal is a subset of points {x[n]d:x=v+λ1,λ}\{x\in[n]^{d}:x=v+\lambda\vec{1},\lambda\in\mathbb{Z}\} defined by some v[n]dv\in[n]^{d}. A function f:[n]d{0,1}f:[n]^{d}\to\{0,1\} is a diagonal function if it has at most one 1-valued point in each diagonal.

Lemma 3.1.

There is an ϵ\epsilon-tester with one-sided error and query complexity O(1ϵlog2(1/ϵ))O\left(\tfrac{1}{\epsilon}\log^{2}(1/\epsilon)\right) for diagonal functions on [n]d[n]^{d}.

Proof.

For each t[n]t\in[n] let DtD_{t} be the set of diagonals with length tt. For any x[n]dx\in[n]^{d} let diag(x)\mathrm{diag}(x) be the unique diagonal that contains xx. For input f:[n]d{0,1}f:[n]^{d}\to\{0,1\} and any x[n]dx\in[n]^{d}, let R(x)=|{ydiag(x):f(y)=1}||diag(x)|R(x)=\frac{|\{y\in\mathrm{diag}(x):f(y)=1\}|}{|\mathrm{diag}(x)|}.

Suppose that ff is ϵ\epsilon-far from diagonal. Then ff must have at least ϵnd\epsilon n^{d} 1-valued points; otherwise we could set each 1-valued point to 0 to obtain the constant 0 function. Now observe

𝔼x[n]d[R(x)]\displaystyle\underset{x\sim[n]^{d}}{\mathbb{E}}\left[R(x)\right] =𝔼x[n]d[t=1nLDt𝟏[diag(x)=L]|{yL:f(y)=1}|t]\displaystyle=\underset{x\sim[n]^{d}}{\mathbb{E}}\left[\sum_{t=1}^{n}\sum_{L\in D_{t}}\mathbf{1}\left[\mathrm{diag}(x)=L\right]\frac{|\{y\in L:f(y)=1\}|}{t}\right]
=t=1nLDtx[n]d[xL]|{yL:f(y)=1}|t=t=1nLDttnd|{yL:f(y)=1}|t\displaystyle=\sum_{t=1}^{n}\sum_{L\in D_{t}}\underset{x\sim[n]^{d}}{\mathbb{P}}\left[x\in L\right]\frac{|\{y\in L:f(y)=1\}|}{t}=\sum_{t=1}^{n}\sum_{L\in D_{t}}\frac{t}{n^{d}}\frac{|\{y\in L:f(y)=1\}|}{t}
=1nd|{y[n]d:f(y)=1}|ϵ.\displaystyle=\frac{1}{n^{d}}|\{y\in[n]^{d}:f(y)=1\}|\geq\epsilon\,.

For each ii, define Ai={x[n]d:12i<R(x)12i1}A_{i}=\left\{x\in[n]^{d}:\frac{1}{2^{i}}<R(x)\leq\frac{1}{2^{i-1}}\right\}. Let k=log(4/ϵ)k=\log(4/\epsilon). Then

ϵ\displaystyle\epsilon 𝔼[R(x)]i=1|Ai|ndmaxxAiR(x)i=1|Ai|nd2i1i=1k|Ai|nd2i1+i=k+112i1\displaystyle\leq\mathbb{E}\left[R(x)\right]\leq\sum_{i=1}^{\infty}\frac{|A_{i}|}{n^{d}}\max_{x\in A_{i}}R(x)\leq\sum_{i=1}^{\infty}\frac{|A_{i}|}{n^{d}2^{i-1}}\leq\sum_{i=1}^{k}\frac{|A_{i}|}{n^{d}2^{i-1}}+\sum_{i=k+1}^{\infty}\frac{1}{2^{i-1}}
i=1k|Ai|nd2i1+12k1i=1k|Ai|nd2i1+ϵ2\displaystyle\leq\sum_{i=1}^{k}\frac{|A_{i}|}{n^{d}2^{i-1}}+\frac{1}{2^{k-1}}\leq\sum_{i=1}^{k}\frac{|A_{i}|}{n^{d}2^{i-1}}+\frac{\epsilon}{2}
ϵ2\displaystyle\implies\frac{\epsilon}{2} i=1k|Ai|nd2i1.\displaystyle\leq\sum_{i=1}^{k}\frac{|A_{i}|}{n^{d}2^{i-1}}\,.

Therefore there is some [k]\ell\in[k] such that |A|ϵnd212k|A_{\ell}|\geq\frac{\epsilon n^{d}2^{\ell-1}}{2k}.

The tester is as follows. For each i[k]i\in[k]:

  1. 1.

    Sample p=kϵ2i2ln(6)p=\frac{k}{\epsilon 2^{i-2}}\ln(6) points x1,,xp[n]dx_{1},\dotsc,x_{p}\sim[n]^{d}.

  2. 2.

    For each j[p]j\in[p], sample q=2i+2ln(12)q=2^{i+2}\ln(12) points y1,,yqy_{1},\dotsc,y_{q} from diag(xi)\mathrm{diag}(x_{i}) and reject if there are two distinct 1-valued points in the sample.

The query complexity of the tester is i=1k42ln(6)ln(12)kϵ2i2i=O(1ϵlog2(1/ϵ))\sum_{i=1}^{k}4^{2}\ln(6)\ln(12)\frac{k}{\epsilon 2^{i}}2^{i}=O\left(\frac{1}{\epsilon}\log^{2}(1/\epsilon)\right).

The tester will clearly accept any diagonal function. Now suppose that ff is ϵ\epsilon-far from having this property, and let [k]\ell\in[k] be such that |A|ϵnd22k|A_{\ell}|\geq\frac{\epsilon n^{d}2^{\ell-2}}{k}. On iteration i=i=\ell, the algorithm samples p=kϵ22ln(6)p=\frac{k}{\epsilon 2^{\ell-2}}\ln(6) points x1,,xpx_{1},\dotsc,x_{p}. The probability that j[p],xjA\forall j\in[p],x_{j}\notin A_{\ell} is at most

(1|A|nd)p(1ϵ22k)pexp(ϵp22k)1/6.\left(1-\frac{|A_{\ell}|}{n^{d}}\right)^{p}\leq\left(1-\frac{\epsilon 2^{\ell-2}}{k}\right)^{p}\leq\mathrm{exp}\left(-\frac{\epsilon p2^{\ell-2}}{k}\right)\leq 1/6\,.

Now assume that there is some xjAx_{j}\in A_{\ell}, so that R(xj)>2R(x_{j})>2^{-\ell}. Let A,Bdiag(xj)A,B\subset\mathrm{diag}(x_{j}) be disjoint subsets that partition the 1-valued points in diag(xi)\mathrm{diag}(x_{i}) into equally-sized parts. Then for yy sampled uniformly at random from diag(xj)\mathrm{diag}(x_{j}), [yA],[yB]2(+1)\mathbb{P}\left[y\in A\right],\mathbb{P}\left[y\in B\right]\geq 2^{-(\ell+1)}. The probability that there are at least 2 distinct 1-valued points in y1,,yqy_{1},\dotsc,y_{q} sampled by the algorithm is at least the probability that one of the first q/2q/2 samples is in AA and one of the last q/2q/2 samples is in BB. This fails to occur with probability at most 2(12(+1))q/22eq2(+2)1/62(1-2^{-(\ell+1)})^{q/2}\leq 2e^{-q2^{-(\ell+2)}}\leq 1/6. So the total probability of failure is at most 2/6=1/32/6=1/3. ∎

Theorem 3.2.

There is a non-adaptive monotonicity tester on domain [n]d[n]^{d} with one-sided error and query complexity O~(d5/6ϵ4/3)\widetilde{O}\left(\frac{d^{5/6}}{\epsilon^{4/3}}\right).

Proof.

Set r=4d/ϵr=\lceil 4d/\epsilon\rceil, and assume without loss of generality that rr divides nn. Partition [n][n] into rr intervals Bi={(i1)(n/r)+1,,i(n/r)}B_{i}=\{(i-1)(n/r)+1,\dotsc,i(n/r)\}. For each v[r]dv\in[r]^{d} write Bv=Bvi××BvdB_{v}=B_{v_{i}}\times\dotsm\times B_{v_{d}}. Define 𝖻𝗅𝗈𝖼𝗄:[n]d[r]d\mathsf{block}:[n]^{d}\to[r]^{d} where 𝖻𝗅𝗈𝖼𝗄(x)\mathsf{block}(x) is the unique vector v[r]dv\in[r]^{d} such that xBvx\in B_{v}. Define 𝖻𝗅𝗈𝖼𝗄(v)=min{xBv}\mathsf{block}^{-\downarrow}(v)=\min\{x\in B_{v}\} and 𝖻𝗅𝗈𝖼𝗄(v)=max{xBv}\mathsf{block}^{-\uparrow}(v)=\max\{x\in B_{v}\}, where the minimum and maximum are with respect to the natural ordering on [n]d[n]^{d}. For f:[n]d{0,1}f:[n]^{d}\to\{0,1\}, write f𝖻𝗅𝗈𝖼𝗄:[r]d{0,1},f𝖻𝗅𝗈𝖼𝗄(v)=f(𝖻𝗅𝗈𝖼𝗄(v))f^{\mathsf{block}}:[r]^{d}\to\{0,1\},f^{\mathsf{block}}(v)=f(\mathsf{block}^{-\downarrow}(v)). We may simulate queries vv to f𝖻𝗅𝗈𝖼𝗄f^{\mathsf{block}} by returning f(𝖻𝗅𝗈𝖼𝗄(v))f(\mathsf{block}^{-\downarrow}(v)). We will call v[r]dv\in[r]^{d} a boundary block if f(𝖻𝗅𝗈𝖼𝗄(v))f(𝖻𝗅𝗈𝖼𝗄(v))f(\mathsf{block}^{-\downarrow}(v))\neq f(\mathsf{block}^{-\uparrow}(v)).

The test proceeds as follows: On input f:[n]d{0,1}f:[n]^{d}\to\{0,1\} and a block v[r]dv\in[r]^{d}, define the following functions:

g:[n]d{0,1},g(x)\displaystyle g:[n]^{d}\to\{0,1\},\qquad g(x) ={f𝖻𝗅𝗈𝖼𝗄(𝖻𝗅𝗈𝖼𝗄(x)) if 𝖻𝗅𝗈𝖼𝗄(x) is not a boundary blockf(x) if 𝖻𝗅𝗈𝖼𝗄(x) is a boundary block.\displaystyle=\begin{cases}f^{\mathsf{block}}(\mathsf{block}(x))&\text{ if }\mathsf{block}(x)\text{ is not a boundary block}\\ f(x)&\text{ if }\mathsf{block}(x)\text{ is a boundary block.}\end{cases}
b:[r]d{0,1},b(v)\displaystyle b:[r]^{d}\to\{0,1\},\qquad b(v) ={0 if v is not a boundary block1 if v is a boundary block.\displaystyle=\begin{cases}0&\text{ if }v\text{ is not a boundary block}\\ 1&\text{ if }v\text{ is a boundary block.}\end{cases}
h:[r]d{0,1},h(v)\displaystyle h:[r]^{d}\to\{0,1\},\qquad h(v) ={f𝖻𝗅𝗈𝖼𝗄(v) if v is not a boundary block0 if v is a boundary block.\displaystyle=\begin{cases}f^{\mathsf{block}}(v)&\text{ if }v\text{ is not a boundary block}\\ 0&\text{ if }v\text{ is a boundary block.}\end{cases}

Queries to each of these functions can be simulated by 2 or 3 queries to ff. The tester performs:

  1. 1.

    Test whether g=fg=f, or whether 𝖽𝗂𝗌𝗍(f,g)>ϵ/4\mathsf{dist}(f,g)>\epsilon/4, using O(1/ϵ)O(1/\epsilon) queries.

  2. 2.

    Test whether bb is diagonal, or is ϵ/4\epsilon/4-far from diagonal, using Lemma 3.1, with O(1ϵlog2(1/ϵ))O\left(\tfrac{1}{\epsilon}\log^{2}(1/\epsilon)\right) queries.

  3. 3.

    Test whether hh is monotone or ϵ/4\epsilon/4-far from monotone, using the tester of Black, Chakrabarty, & Seshadhri with O~(d5/6ϵ4/3)\widetilde{O}\left(\frac{d^{5/6}}{\epsilon^{4/3}}\right) queries.

Claim 3.3.

If ff is monotone, the tester passes all 3 tests with probability 1.

Proof of claim.

To see that g=fg=f, observe that if v=𝖻𝗅𝗈𝖼𝗄(x)v=\mathsf{block}(x) is not a boundary block then f(𝖻𝗅𝗈𝖼𝗄(v))=f(𝖻𝗅𝗈𝖼𝗄(v))f(\mathsf{block}^{-\downarrow}(v))=f(\mathsf{block}^{-\uparrow}(v)). If f(x)f𝖻𝗅𝗈𝖼𝗄(𝖻𝗅𝗈𝖼𝗄(x))f(x)\neq f^{\mathsf{block}}(\mathsf{block}(x)) then f(x)f(𝖻𝗅𝗈𝖼𝗄(v))f(x)\neq f(\mathsf{block}^{-\downarrow}(v)) and f(x)f(𝖻𝗅𝗈𝖼𝗄(v))f(x)\neq f(\mathsf{block}^{-\uparrow}(v)) while 𝖻𝗅𝗈𝖼𝗄(v)x𝖻𝗅𝗈𝖼𝗄(v)\mathsf{block}^{-\downarrow}(v)\preceq x\preceq\mathsf{block}^{-\uparrow}(v), and this is a violation of the monotonicity of ff. Therefore ff will pass the first test with probability 1.

To see that ff passes the second test with probability 1, observe that if ff had 2 boundary blocks in some diagonal, then there are boundary blocks u,v[r]du,v\in[r]^{d} such that 𝖻𝗅𝗈𝖼𝗄(u)𝖻𝗅𝗈𝖼𝗄(v)\mathsf{block}^{-\uparrow}(u)\prec\mathsf{block}^{-\downarrow}(v). But then there is x,y[n]dx,y\in[n]^{d} such that 𝖻𝗅𝗈𝖼𝗄(x)=u,𝖻𝗅𝗈𝖼𝗄(y)=v\mathsf{block}(x)=u,\mathsf{block}(y)=v and f(x)=1,f(y)=0f(x)=1,f(y)=0; since x𝖻𝗅𝗈𝖼𝗄(u)𝖻𝗅𝗈𝖼𝗄(v)yx\preceq\mathsf{block}^{-\uparrow}(u)\prec\mathsf{block}^{-\downarrow}(v)\preceq y, this contradicts the monotonicity of ff. So ff has at most 1 boundary block in each diagonal.

To see that hh is monotone, it is sufficient to consider the boundary blocks, since all other values are the same as f𝖻𝗅𝗈𝖼𝗄f^{\mathsf{block}}. Let v[r]dv\in[r]^{d} be a boundary block, so there exist x,y[n]dx,y\in[n]^{d} such that 𝖻𝗅𝗈𝖼𝗄(x)=𝖻𝗅𝗈𝖼𝗄(y)\mathsf{block}(x)=\mathsf{block}(y) and f(x)=1,f(y)=0f(x)=1,f(y)=0. Suppose uvu\prec v is not a boundary block (if it is a boundary block then h(u)=h(v)=0h(u)=h(v)=0). If h(u)=1h(u)=1 then f(𝖻𝗅𝗈𝖼𝗄(u))=1f(\mathsf{block}^{-\downarrow}(u))=1, but 𝖻𝗅𝗈𝖼𝗄(u)𝖻𝗅𝗈𝖼𝗄(v)y\mathsf{block}^{-\downarrow}(u)\prec\mathsf{block}^{-\downarrow}(v)\preceq y while f(𝖻𝗅𝗈𝖼𝗄(u))>f(y)f(\mathsf{block}^{-\downarrow}(u))>f(y), a contradiction. So it must be that h(u)=0h(u)=0 whenever uvu\prec v. For any block u[r]du\in[r]^{d} such that vuv\prec u, we have 0=h(v)h(u)0=h(v)\leq h(u), so monotonicity holds. Since the tester of Black, Chakrabarty, & Seshadhri has one-sided error, the test passes with probability 1. ∎

Claim 3.4.

If gg is ϵ/4\epsilon/4-close to ff, bb is ϵ/4\epsilon/4-close to diagonal, and hh is ϵ/4\epsilon/4-close to monotone, then ff is ϵ\epsilon-close to monotone.

Proof of claim.

Let h𝖼𝗈𝖺𝗋𝗌𝖾:[n]d{0,1}h^{\mathsf{coarse}}:[n]^{d}\to\{0,1\} be the function h𝖼𝗈𝖺𝗋𝗌𝖾(x)=h(𝖻𝗅𝗈𝖼𝗄(x))h^{\mathsf{coarse}}(x)=h(\mathsf{block}(x)). Suppose that f(x)h𝖼𝗈𝖺𝗋𝗌𝖾(x)f(x)\neq h^{\mathsf{coarse}}(x). If v=𝖻𝗅𝗈𝖼𝗄(x)v=\mathsf{block}(x) is not a boundary block of ff then h𝖼𝗈𝖺𝗋𝗌𝖾(x)=h(v)=f𝖻𝗅𝗈𝖼𝗄(v)=g(x)h^{\mathsf{coarse}}(x)=h(v)=f^{\mathsf{block}}(v)=g(x), so f(x)g(x)f(x)\neq g(x). If vv is a boundary block then h𝖼𝗈𝖺𝗋𝗌𝖾(x)=h(v)=0h^{\mathsf{coarse}}(x)=h(v)=0 so f(x)=1f(x)=1, and b(v)=1b(v)=1.

Suppose for contradiction that there are more than ϵ2rd\tfrac{\epsilon}{2}r^{d} boundary blocks v[r]dv\in[r]^{d}, so there are more than ϵ2rd\tfrac{\epsilon}{2}r^{d} 1-valued points of bb. Any diagonal function has at most drd1dr^{d-1} 1-valued points. Therefore the distance of bb to diagonal is at least

rd(ϵ2rddrd1)=ϵ2dr=ϵ2ϵ4=ϵ4,r^{-d}\left(\frac{\epsilon}{2}r^{d}-dr^{d-1}\right)=\frac{\epsilon}{2}-\frac{d}{r}=\frac{\epsilon}{2}-\frac{\epsilon}{4}=\frac{\epsilon}{4}\,,

a contradiction. So ff has at most ϵ2rd\tfrac{\epsilon}{2}r^{d} boundary blocks. Now

𝖽𝗂𝗌𝗍(f,h𝖼𝗈𝖺𝗋𝗌𝖾)\displaystyle\mathsf{dist}(f,h^{\mathsf{coarse}}) =𝖽𝗂𝗌𝗍(f,g)+x[n]d[f(x)=1,𝖻𝗅𝗈𝖼𝗄(x) is a boundary block]ϵ4+rdϵrd2=34ϵ.\displaystyle=\mathsf{dist}(f,g)+\underset{x\sim[n]^{d}}{\mathbb{P}}\left[f(x)=1,\mathsf{block}(x)\text{ is a boundary block}\right]\leq\frac{\epsilon}{4}+r^{-d}\cdot\frac{\epsilon r^{d}}{2}=\frac{3}{4}\epsilon\,.

Let p:[r]d{0,1}p:[r]^{d}\to\{0,1\} be a monotone function minimizing the distance to hh, and let p𝖼𝗈𝖺𝗋𝗌𝖾:[n]d{0,1}p^{\mathsf{coarse}}:[n]^{d}\to\{0,1\} be the function p𝖼𝗈𝖺𝗋𝗌𝖾(x)=p(𝖻𝗅𝗈𝖼𝗄(x))p^{\mathsf{coarse}}(x)=p(\mathsf{block}(x)). Then

𝖽𝗂𝗌𝗍(h𝖼𝗈𝖺𝗋𝗌𝖾,p𝖼𝗈𝖺𝗋𝗌𝖾)=x[n]d[h(𝖻𝗅𝗈𝖼𝗄(x))p(𝖻𝗅𝗈𝖼𝗄(x))]=v[r]d[h(v)p(v)]ϵ/4.\mathsf{dist}(h^{\mathsf{coarse}},p^{\mathsf{coarse}})=\underset{x\sim[n]^{d}}{\mathbb{P}}\left[h(\mathsf{block}(x))\neq p(\mathsf{block}(x))\right]=\underset{v\sim[r]^{d}}{\mathbb{P}}\left[h(v)\neq p(v)\right]\leq\epsilon/4\,.

Finally, the distance of ff to the nearest monotone function is at most

𝖽𝗂𝗌𝗍(f,p𝖼𝗈𝖺𝗋𝗌𝖾)𝖽𝗂𝗌𝗍(f,h𝖼𝗈𝖺𝗋𝗌𝖾)+𝖽𝗂𝗌𝗍(h𝖼𝗈𝖺𝗋𝗌𝖾,p𝖼𝗈𝖺𝗋𝗌𝖾)34ϵ+14ϵ=ϵ.\mathsf{dist}(f,p^{\mathsf{coarse}})\leq\mathsf{dist}(f,h^{\mathsf{coarse}})+\mathsf{dist}(h^{\mathsf{coarse}},p^{\mathsf{coarse}})\leq\frac{3}{4}\epsilon+\frac{1}{4}\epsilon=\epsilon\,.\qed

These two claims suffice to establish the theorem. ∎

3.2 Monotonicity Testing for Product Distributions

The previous section used a special case of downsampling, tailored for the uniform distribution over [n]d[n]^{d}. We will call a product distribution μ=μ1××μd\mu=\mu_{1}\times\dotsm\times\mu_{d} over d\mathbb{R}^{d} continuous if each of its factors μi\mu_{i} are continuous (i.e. absolutely continuous with respect to the Lebesgue measure). The proof for discrete distributions is in LABEL:section:finitealgorithms.

See 1.1.1

Proof.

We follow the proof of Theorem 3.2, with some small changes. Let r=16d/ϵr=\lceil 16d/\epsilon\rceil. The tester first samples a grid XX with length m=O(rd2ϵ2log(rd))m=O\left(\frac{rd^{2}}{\epsilon^{2}}\log(rd)\right) and constructs the induced (r+2)(r+2)-block partition, with cells labeled {0,,r+1}d\{0,\dotsc,r+1\}^{d}. We call a block v{0,,r+1}dv\in\{0,\dotsc,r+1\}^{d} upper extreme if there is some i[d]i\in[d] such that vi=r+1v_{i}=r+1, and we call it lower extreme if there is some i[d]i\in[d] such that vi=0v_{i}=0 but vv is not upper extreme. Call the upper extreme blocks UU and the lower extreme blocks LL. Note that [r]d={0,,r+1}d(UL)[r]^{d}=\{0,\dotsc,r+1\}^{d}\setminus(U\cup L).

For each v[r]dv\in[r]^{d}, we again define 𝖻𝗅𝗈𝖼𝗄(v),𝖻𝗅𝗈𝖼𝗄(v)\mathsf{block}^{-\uparrow}(v),\mathsf{block}^{-\downarrow}(v) as, respectively, the supremal and infimal point xdx\in\mathbb{R}^{d} such that 𝖻𝗅𝗈𝖼𝗄(x)=v\mathsf{block}(x)=v. The algorithm will ignore the extreme blocks ULU\cup L, which do not have a supremal or an infimal point. Therefore it is not defined whether these blocks are boundary blocks.

By Lemma 2.7, with probability at least 5/65/6, we will have 𝖻𝗅𝗈𝖼𝗄(μ)𝗎𝗇𝗂𝖿({0,,r+1})𝖳𝖵ϵ/8\|\mathsf{block}(\mu)-\mathsf{unif}(\{0,\dotsc,r+1\})\|_{\mathsf{TV}}\leq\epsilon/8. We define b,hb,h as before, with domain [r]d[r]^{d}. Define gg similarly but with domain d\mathbb{R}^{d} and values

g(x)={1 if 𝖻𝗅𝗈𝖼𝗄(x)U0 if 𝖻𝗅𝗈𝖼𝗄(x)Lf(x) if 𝖻𝗅𝗈𝖼𝗄(x)[n]d is a boundary blockf𝖻𝗅𝗈𝖼𝗄(𝖻𝗅𝗈𝖼𝗄(x)) otherwise.g(x)=\begin{cases}1&\text{ if }\mathsf{block}(x)\in U\\ 0&\text{ if }\mathsf{block}(x)\in L\\ f(x)&\text{ if }\mathsf{block}(x)\in[n]^{d}\text{ is a boundary block}\\ f^{\mathsf{block}}(\mathsf{block}(x))&\text{ otherwise.}\end{cases}

If ff is monotone, it may now be the case fgf\neq g, but we will have f(x)=g(x)f(x)=g(x) for all xx with 𝖻𝗅𝗈𝖼𝗄(x)[r]d\mathsf{block}(x)\in[r]^{d}, where the algorithm will make its queries. The algorithm will test whether f(x)=g(x)f(x)=g(x) on all xx with 𝖻𝗅𝗈𝖼𝗄(x)[r]d\mathsf{block}(x)\in[r]^{d}, or ϵ/8\epsilon/8-far from this property, which can be again done with O(1/ϵ)O(1/\epsilon) samples. Note that if ff is ϵ/8\epsilon/8-close to having this property, then

𝖽𝗂𝗌𝗍μ(f,g)\displaystyle\mathsf{dist}_{\mu}(f,g) xμ[𝖻𝗅𝗈𝖼𝗄(x)[n]d]+ϵ/8\displaystyle\leq\underset{x\sim\mu}{\mathbb{P}}\left[\mathsf{block}(x)\notin[n]^{d}\right]+\epsilon/8
d(r+2)d1(r+2)d+ϵ/8+𝖻𝗅𝗈𝖼𝗄(μ)𝗎𝗇𝗂𝖿([r]dUL)𝖳𝖵\displaystyle\leq\frac{d(r+2)^{d-1}}{(r+2)^{d}}+\epsilon/8+\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d}\cup U\cup L)\|_{\mathsf{TV}}
ϵ16+ϵ8+ϵ4ϵ2.\displaystyle\leq\frac{\epsilon}{16}+\frac{\epsilon}{8}+\frac{\epsilon}{4}\leq\frac{\epsilon}{2}\,.

The algorithm then procedes as before, with error parameter ϵ/2\epsilon/2. To test whether g=fg=f, the algorithm samples from μ\mu and throws away any sample xdx\in\mathbb{R}^{d} with 𝖻𝗅𝗈𝖼𝗄(x)[r]d\mathsf{block}(x)\notin[r]^{d}. It then tests bb and hh using the uniform distribution on [r]d[r]^{d}. It suffices to prove the following claim, which replaces 3.4.

Claim 3.5.

If gg is ϵ/2\epsilon/2-close to ff, bb is ϵ/16\epsilon/16-close to diagonal, and hh is ϵ/8\epsilon/8-close to monotone, then ff is ϵ\epsilon-close to monotone.

Proof of claim.

Let p:[r]d{0,1}p:[r]^{d}\to\{0,1\} be a monotone function minimizing the distance to hh. Then p(v)h(v)p(v)\neq h(v) on at most ϵrd8\frac{\epsilon r^{d}}{8} blocks v[r]dv\in[r]^{d}. Define p𝖼𝗈𝖺𝗋𝗌𝖾:d{0,1}p^{\mathsf{coarse}}:\mathbb{R}^{d}\to\{0,1\} as p𝖼𝗈𝖺𝗋𝗌𝖾(x)=p(𝖻𝗅𝗈𝖼𝗄(x))p^{\mathsf{coarse}}(x)=p(\mathsf{block}(x)) when 𝖻𝗅𝗈𝖼𝗄(x)[r]d\mathsf{block}(x)\in[r]^{d}, and p𝖼𝗈𝖺𝗋𝗌𝖾(x)=g(x)p^{\mathsf{coarse}}(x)=g(x) when 𝖻𝗅𝗈𝖼𝗄(x)UL\mathsf{block}(x)\in U\cup L. Note that p𝖼𝗈𝖺𝗋𝗌𝖾p^{\mathsf{coarse}} is monotone.

By the triangle inequality,

𝖽𝗂𝗌𝗍μ(f,p𝖼𝗈𝖺𝗋𝗌𝖾)𝖽𝗂𝗌𝗍μ(f,g)+𝖽𝗂𝗌𝗍μ(g,p𝖼𝗈𝖺𝗋𝗌𝖾).\mathsf{dist}_{\mu}(f,p^{\mathsf{coarse}})\leq\mathsf{dist}_{\mu}(f,g)+\mathsf{dist}_{\mu}(g,p^{\mathsf{coarse}})\,.

From above, we know 𝖽𝗂𝗌𝗍μ(f,g)ϵ/2\mathsf{dist}_{\mu}(f,g)\leq\epsilon/2. To bound the second term, observe that since bb is ϵ/16\epsilon/16-close to diagonal, there are at most

ϵ16rd+drd1ϵ16rd+drrdϵ16rd+ϵ16rd=ϵ8rd\frac{\epsilon}{16}r^{d}+dr^{d-1}\leq\frac{\epsilon}{16}r^{d}+\frac{d}{r}r^{d}\leq\frac{\epsilon}{16}r^{d}+\frac{\epsilon}{16}r^{d}=\frac{\epsilon}{8}r^{d}

boundary blocks. Then observe that if g(x)p𝖼𝗈𝖺𝗋𝗌𝖾(x)g(x)\neq p^{\mathsf{coarse}}(x) then 𝖻𝗅𝗈𝖼𝗄(x)[r]d\mathsf{block}(x)\in[r]^{d} and either 𝖻𝗅𝗈𝖼𝗄(x)\mathsf{block}(x) is a boundary block, or g(x)=f𝖻𝗅𝗈𝖼𝗄(𝖻𝗅𝗈𝖼𝗄(x))=h(𝖻𝗅𝗈𝖼𝗄(x))g(x)=f^{\mathsf{block}}(\mathsf{block}(x))=h(\mathsf{block}(x)) and h(𝖻𝗅𝗈𝖼𝗄(x))p(𝖻𝗅𝗈𝖼𝗄(x))h(\mathsf{block}(x))\neq p(\mathsf{block}(x)). Then

𝖽𝗂𝗌𝗍μ(g,p𝖼𝗈𝖺𝗋𝗌𝖾)\displaystyle\mathsf{dist}_{\mu}(g,p^{\mathsf{coarse}}) (1(r+2)dv[r]d𝟏[v is a boundary block, or h(v)p(v)])\displaystyle\leq\left(\frac{1}{(r+2)^{d}}\sum_{v\in[r]^{d}}\mathbf{1}\left[v\text{ is a boundary block, or }h(v)\neq p(v)\right]\right)
+𝖻𝗅𝗈𝖼𝗄(μ)𝗎𝗇𝗂𝖿({0,,r+1}d)𝖳𝖵\displaystyle\qquad+\|\mathsf{block}(\mu)-\mathsf{unif}(\{0,\dotsc,r+1\}^{d})\|_{\mathsf{TV}}
ϵrd8rd+ϵrd8rd+ϵ4ϵ2.\displaystyle\leq\frac{\epsilon r^{d}}{8r^{d}}+\frac{\epsilon r^{d}}{8r^{d}}+\frac{\epsilon}{4}\leq\frac{\epsilon}{2}\,.\qed

4 Learning and Testing Functions of Convex Sets

In this section we present our learning and testing results for functions of kk convex sets: an agnostic learning algorithm, a sample-based distance approximator, and a sample-based one-sided tester. All our algorithms will follow from more general results that actually hold for any class \mathcal{H} with bounded rr-block boundary size; this shows that bounded block-boundary size is sufficient to guarantee learnability in product distributions.

Let 𝒞\mathcal{C} be the set of functions f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\} such that f1(1)f^{-1}(1) is convex. Let k\mathcal{B}_{k} be the set of all Boolean functions h:{±1}k{±1}h:\{\pm 1\}^{k}\to\{\pm 1\}.

Definition 4.1 (Function Composition).

For a set \mathcal{H} of functions h:d{±1}h:\mathbb{R}^{d}\to\{\pm 1\}, we will define the composition k\mathcal{B}_{k}\circ\mathcal{H} as the set of functions of the form f(x)=g(h1(x),,hk(x))f(x)=g(h_{1}(x),\dotsc,h_{k}(x)) where gkg\in\mathcal{B}_{k} and each hih_{i} belongs to \mathcal{H}.

Proposition 4.2.

Let \mathcal{H} be any class of functions d{±1}\mathbb{R}^{d}\to\{\pm 1\} and fix any rr. Then 𝖻𝖻𝗌(k,r)k𝖻𝖻𝗌(,r)\mathsf{bbs}(\mathcal{B}_{k}\circ\mathcal{H},r)\leq k\cdot\mathsf{bbs}(\mathcal{H},r).

Proof.

If f()=g(h1(),,hk())f(\cdot)=g(h_{1}(\cdot),\dotsc,h_{k}(\cdot)) is not constant on 𝖻𝗅𝗈𝖼𝗄1(v)\mathsf{block}^{-1}(v) then one of the hih_{i} is not constant on that block. Therefore 𝖻𝖻𝗌(f,r)i=1k𝖻𝖻𝗌(hi,r)k𝖻𝖻𝗌(,r)\mathsf{bbs}(f,r)\leq\sum_{i=1}^{k}\mathsf{bbs}(h_{i},r)\leq k\cdot\mathsf{bbs}(\mathcal{H},r). ∎

Lemma 4.3.

For any rr, 𝖻𝖻𝗌(k𝒞,r)2dkrd1\mathsf{bbs}(\mathcal{B}_{k}\circ\mathcal{C},r)\leq 2dkr^{d-1}.

Proof.

We prove 𝖻𝖻𝗌(𝒞,r)2drd1\mathsf{bbs}(\mathcal{C},r)\leq 2dr^{d-1} by induction on dd; the result will hold by Proposition 4.2. Let 𝖻𝖻𝗌(𝒞,r,d)\mathsf{bbs}(\mathcal{C},r,d) be the rr-block boundary size in dimension dd. Recall that block v[r]dv\in[r]^{d} is the set Bv=B1,v1××Bd,vdB_{v}=B_{1,v_{1}}\times\dotsm\times B_{d,v_{d}} where Bi,j=(ai,j1,ai,j]B_{i,j}=(a_{i,j-1},a_{i,j}] for some ai,j1<ai,ja_{i,j-1}<a_{i,j}. Let f𝒞f\in\mathcal{C}.

For d=1d=1, if there are 3 intervals B1,i1,B1,i2,B1,i3B_{1,i_{1}},B_{1,i_{2}},B_{1,i_{3}}, i1<i2<i3i_{1}<i_{2}<i_{3}, on which ff is not constant, then within each interval the function takes both values {±1}\{\pm 1\}. Thus, there are points aB1,i1,bB1,i2,cB1,i3a\in B_{1,i_{1}},b\in B_{1,i_{2}},c\in B_{1,i_{3}} such that f(a)=1,f(b)=1,f(c)=1f(a)=1,f(b)=-1,f(c)=1, which is a contradiction.

For each block BvB_{v}, let Av={a1,v1}×B2,v2××Bd,vdA_{v}=\{a_{1,v_{1}}\}\times B_{2,v_{2}}\times\dotsc\times B_{d,v_{d}} be the “upper face”. For d>1d>1, let P[r]dP\subseteq[r]^{d} be the set of non-constant blocks BvB_{v} such that ff is constant on the upper face and let QQ be the set of non-constant blocks that are non-constant on the upper face, so that 𝖻𝖻𝗌(f,r,d)=|P|+|Q|\mathsf{bbs}(f,r,d)=|P|+|Q|. We argue that |P|2rd1|P|\leq 2r^{d-1}: for a vector w[r]d1w\in[r]^{d-1} define the line Lw:={v[r]d|i>1,vi=wi}L_{w}:=\{v\in[r]^{d}\;|\;\forall i>1,v_{i}=w_{i}\}. If |PLw|3|P\cap L_{w}|\geq 3 then there are t,u,vLwt,u,v\in L_{w} with t<u<vt<u<v such that ff is constant on At,Au,AvA_{t},A_{u},A_{v} but non-constant on Bt,Bu,BvB_{t},B_{u},B_{v}. Let x,y,zx,y,z be points in Bt,Bu,BvB_{t},B_{u},B_{v} respectively such that f(x)=f(y)=f(z)=1f(x)=f(y)=f(z)=1. If ff is constant 1-1 on AtA_{t} or AuA_{u} then there is a contradiction since the lines through (x,y)(x,y) and (y,z)(y,z) pass through At,AuA_{t},A_{u}; so ff is constant 1 on At,AuA_{t},A_{u}. But then there is a point qAuq\in A_{u} with f(q)=1f(q)=-1, which is a contradiction since it is within the convex hull of At,AuA_{t},A_{u}. So |LwP|<3|L_{w}\cap P|<3; since there are at most rd1r^{d-1} lines LwL_{w}, |P|2rd1|P|\leq 2r^{d-1}.

To bound |Q||Q|, observe that for each block vQ,fv\in Q,f is non-constant on the plane {a1,v1}×d1\{a_{1,v_{1}}\}\times\mathbb{R}^{d-1}, there are (r1)(r-1) such planes, ff is convex on each, and the rr-block partition induces an rr-block partition on the plane where ff is non-constant on the corresponding block. Then, by induction |Q|(r1)𝖻𝖻𝗌(𝒞,r,d1)2(d1)(r1)rd2|Q|\leq(r-1)\cdot\mathsf{bbs}(\mathcal{C},r,d-1)\leq 2(d-1)(r-1)r^{d-2}. So

𝖻𝖻𝗌(𝒞,r,d)2[(d1)(r1)rd2+rd1]<2drd1.\mathsf{bbs}(\mathcal{C},r,d)\leq 2\left[(d-1)(r-1)r^{d-2}+r^{d-1}\right]<2dr^{d-1}\,.\qed

The above two lemmas combine to show that rd𝖻𝖻𝗌(k𝒞,r)rd(2dkrd1)=2dk/rϵr^{-d}\cdot\mathsf{bbs}(\mathcal{B}_{k}\circ\mathcal{C},r)\leq r^{-d}(2dkr^{d-1})=2dk/r\leq\epsilon when r=2dk/ϵr=\lceil 2dk/\epsilon\rceil.

4.1 Sample-based One-sided Tester

First, we prove a one-sided sample-based tester for convex sets.

See 1.1.4

Proof.

We prove the result for continuous distributions. The proof for finite distributions is in Theorem A.18.

On input distribution μ\mu and function ff, let r=6d/ϵr=\lceil 6d/\epsilon\rceil so that rd𝖻𝖻𝗌(𝒞,r)ϵ/3r^{-d}\cdot\mathsf{bbs}(\mathcal{C},r)\leq\epsilon/3.

  1. 1.

    Sample a grid XX of size m=O(rd2ϵ2log(rd/ϵ))m=O\left(\frac{rd^{2}}{\epsilon^{2}}\log(rd/\epsilon)\right) large enough that Lemma 2.7 guarantees 𝖻𝗅𝗈𝖼𝗄(μ)𝗎𝗇𝗂𝖿([r]d)𝖳𝖵<ϵ/9\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}<\epsilon/9 with probability 5/65/6.

  2. 2.

    Take q=O(rdϵ)q=O\left(\frac{r^{d}}{\epsilon}\right) samples QQ and accept if there exists h𝒞h\in\mathcal{C} such that f(x)=h𝖼𝗈𝖺𝗋𝗌𝖾(x)f(x)=h^{\mathsf{coarse}}(x) on all xQx\in Q that are not in a boundary block of hh.

This tester is one-sided since for any h𝒞h\in\mathcal{C}, h(x)=h𝖼𝗈𝖺𝗋𝗌𝖾(x)h(x)=h^{\mathsf{coarse}}(x) for all xQx\in Q that are not in a boundary block, regardless of whether the rr-block decomposition induced by XX satisfies 𝖻𝗅𝗈𝖼𝗄(μ)𝗎𝗇𝗂𝖿([r]d)𝖳𝖵ϵ/3\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}\leq\epsilon/3. Now suppose that 𝖽𝗂𝗌𝗍μ(f,𝒞)>ϵ\mathsf{dist}_{\mu}(f,\mathcal{C})>\epsilon, and suppose that 𝖻𝗅𝗈𝖼𝗄(μ)𝗎𝗇𝗂𝖿([r]d)𝖳𝖵ϵ\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}\leq\epsilon. For h𝒞h\in\mathcal{C}, let Bh[r]dB_{h}\subseteq[r]^{d} be the set of non-constant blocks. If h𝒞\exists h\in\mathcal{C} such that xμ[h𝖼𝗈𝖺𝗋𝗌𝖾(x)f(x)𝖻𝗅𝗈𝖼𝗄(x)Bh]<ϵ/9\underset{x\sim\mu}{\mathbb{P}}\left[h^{\mathsf{coarse}}(x)\neq f(x)\wedge\mathsf{block}(x)\notin B_{h}\right]<\epsilon/9, then

𝖽𝗂𝗌𝗍μ(f,h𝖼𝗈𝖺𝗋𝗌𝖾)\displaystyle\mathsf{dist}_{\mu}(f,h^{\mathsf{coarse}}) xμ[𝖻𝗅𝗈𝖼𝗄(x)Bh]+xμ[h𝖼𝗈𝖺𝗋𝗌𝖾(x)f(x)𝖻𝗅𝗈𝖼𝗄(x)Bh]\displaystyle\leq\underset{x\sim\mu}{\mathbb{P}}\left[\mathsf{block}(x)\in B_{h}\right]+\underset{x\sim\mu}{\mathbb{P}}\left[h^{\mathsf{coarse}}(x)\neq f(x)\wedge\mathsf{block}(x)\notin B_{h}\right]
rd𝖻𝖻𝗌(𝒞,r)+𝖻𝗅𝗈𝖼𝗄(μ)𝗎𝗇𝗂𝖿([r]d)𝖳𝖵+ϵ9\displaystyle\leq r^{-d}\cdot\mathsf{bbs}(\mathcal{C},r)+\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}+\frac{\epsilon}{9}
(13+29)ϵ=59ϵ.\displaystyle\leq\left(\frac{1}{3}+\frac{2}{9}\right)\epsilon=\frac{5}{9}\cdot\epsilon\,.

Therefore

𝖽𝗂𝗌𝗍μ(f,h)\displaystyle\mathsf{dist}_{\mu}(f,h) 𝖽𝗂𝗌𝗍μ(f,h𝖼𝗈𝖺𝗋𝗌𝖾)+𝖽𝗂𝗌𝗍μ(h𝖼𝗈𝖺𝗋𝗌𝖾,h)\displaystyle\leq\mathsf{dist}_{\mu}(f,h^{\mathsf{coarse}})+\mathsf{dist}_{\mu}(h^{\mathsf{coarse}},h)
𝖽𝗂𝗌𝗍μ(f,h𝖼𝗈𝖺𝗋𝗌𝖾)+rd𝖻𝖻𝗌(𝒞,r)+𝖻𝗅𝗈𝖼𝗄(μ)𝗎𝗇𝗂𝖿([r]d)𝖳𝖵\displaystyle\leq\mathsf{dist}_{\mu}(f,h^{\mathsf{coarse}})+r^{-d}\cdot\mathsf{bbs}(\mathcal{C},r)+\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}
59ϵ+13ϵ+19ϵ=ϵ,\displaystyle\leq\frac{5}{9}\epsilon+\frac{1}{3}\cdot\epsilon+\frac{1}{9}\epsilon=\epsilon\,,

a contradiction. So it must be that for every h𝒞,[f(x)h𝖼𝗈𝖺𝗋𝗌𝖾(x)xBh]ϵ/9h\in\mathcal{C},\mathbb{P}\left[f(x)\neq h^{\mathsf{coarse}}(x)\wedge x\notin B_{h}\right]\geq\epsilon/9. There are at most (rdϵ3rd)(3e/ϵ)ϵrd/3{r^{d}\choose\tfrac{\epsilon}{3}r^{d}}\leq(3e/\epsilon)^{\epsilon r^{d}/3} choices of boundary set BB. Because the 1-valued blocks must be the convex hull of the boundary points, for each boundary set BB there are at most 2 choices of function h𝖼𝗈𝖺𝗋𝗌𝖾h^{\mathsf{coarse}} with boundary BB (with a second choice occurring when the complement of h𝖼𝗈𝖺𝗋𝗌𝖾h^{\mathsf{coarse}} is also a convex set with the same boundary). Therefore, by the union bound, the probability that ff is accepted is at most

(3eϵ)ϵ3rd(1ϵ9)qeϵ(rd3q9),\left(\frac{3e}{\epsilon}\right)^{\frac{\epsilon}{3}r^{d}}\cdot{\left(1-\frac{\epsilon}{9}\right)}^{q}\leq e^{\epsilon\left(\frac{r^{d}}{3}-\frac{q}{9}\right)}\,,

which is at most 1/61/6 for sufficiently large q=O(rd+1ϵ)q=O\left(r^{d}+\frac{1}{\epsilon}\right). ∎

4.2 Sample-based Distance Approximator

Our sample-based distance approximator follows from the following general result.

Lemma 4.4.

For any set \mathcal{H} of functions d{±1}\mathbb{R}^{d}\to\{\pm 1\}, ϵ>0\epsilon>0, and rr satisfying rd𝖻𝖻𝗌(,r)ϵ/3r^{-d}\cdot\mathsf{bbs}(\mathcal{H},r)\leq\epsilon/3, there is a sample-based distribution-free algorithm for product distributions that approximates distance to \mathcal{H} up to additive error ϵ\epsilon using O(rdϵ2)O\left(\frac{r^{d}}{\epsilon^{2}}\right) samples.

Proof.

On input distribution μ\mu and function f:d{0,1}f:\mathbb{R}^{d}\to\{0,1\}, let r=3dk/ϵr=3dk/\epsilon, then:

  1. 1.

    Sample a grid XX of size m=O(rd2ϵ2logrdϵ)m=O(\frac{rd^{2}}{\epsilon^{2}}\log\frac{rd}{\epsilon}) large enough that Lemma 2.7 guarantees 𝖻𝗅𝗈𝖼𝗄(μ)𝗎𝗇𝗂𝖿([r]d)𝖳𝖵<ϵ/3\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}<\epsilon/3 with probability 5/65/6.

  2. 2.

    Let 𝖼𝗈𝖺𝗋𝗌𝖾\mathcal{H}^{\mathsf{coarse}} be the set of all functions h𝖼𝗈𝖺𝗋𝗌𝖾h^{\mathsf{coarse}} where hh\in\mathcal{H}; note that |𝖼𝗈𝖺𝗋𝗌𝖾|2rd|\mathcal{H}^{\mathsf{coarse}}|\leq 2^{r^{d}}.

  3. 3.

    Draw q=O(rdϵ2)q=O\left(\frac{r^{d}}{\epsilon^{2}}\right) samples QQ and output the distance on QQ to the nearest function in 𝖼𝗈𝖺𝗋𝗌𝖾\mathcal{H}^{\mathsf{coarse}}.

We argue that with probability at least 5/65/6, 𝖼𝗈𝖺𝗋𝗌𝖾\mathcal{H}^{\mathsf{coarse}} is an 56ϵ\tfrac{5}{6}\epsilon-cover of \mathcal{H}. With probability at least 5/65/6, 𝖻𝗅𝗈𝖼𝗄(μ)𝗎𝗇𝗂𝖿([r]d)𝖳𝖵<ϵ/6\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}<\epsilon/6. Then by Proposition 2.5, for any hh\in\mathcal{H},

xμ[h(x)h𝖼𝗈𝖺𝗋𝗌𝖾(x)]rd𝖻𝖻𝗌(f,r)+𝖻𝗅𝗈𝖼𝗄(μ)𝗎𝗇𝗂𝖿([r]d)𝖳𝖵(23+16)ϵ=56ϵ,\underset{x\sim\mu}{\mathbb{P}}\left[h(x)\neq h^{\mathsf{coarse}}(x)\right]\leq r^{-d}\cdot\mathsf{bbs}(f,r)+\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}\leq\left(\frac{2}{3}+\frac{1}{6}\right)\epsilon=\frac{5}{6}\epsilon\,,

so 𝖼𝗈𝖺𝗋𝗌𝖾\mathcal{H}^{\mathsf{coarse}} is a 56ϵ\tfrac{5}{6}\epsilon-cover; assume this event occurs.

Write 𝖽𝗂𝗌𝗍Q(f,g):=1qxQ𝟏[f(x)g(x)]\mathsf{dist}_{Q}(f,g):=\frac{1}{q}\sum_{x\in Q}\mathbf{1}\left[f(x)\neq g(x)\right]. By the union bound and Hoeffding’s inequality, with qq samples we fail to get an estimate of 𝖽𝗂𝗌𝗍μ(f,𝖼𝗈𝖺𝗋𝗌𝖾)\mathsf{dist}_{\mu}(f,\mathcal{H}^{\mathsf{coarse}}) up to additive error 16ϵ\frac{1}{6}\epsilon with probability at most

|𝖼𝗈𝖺𝗋𝗌𝖾|maxh𝖼𝗈𝖺𝗋𝗌𝖾𝖼𝗈𝖺𝗋𝗌𝖾𝑄[|𝖽𝗂𝗌𝗍μ(f,h𝖼𝗈𝖺𝗋𝗌𝖾)𝖽𝗂𝗌𝗍Q(f,h𝖼𝗈𝖺𝗋𝗌𝖾)|>16ϵ]|𝖼𝗈𝖺𝗋𝗌𝖾|exp(2qϵ236)<16|\mathcal{H}^{\mathsf{coarse}}|\cdot\max_{h^{\mathsf{coarse}}\in\mathcal{H}^{\mathsf{coarse}}}\underset{Q}{\mathbb{P}}\left[\left|\mathsf{dist}_{\mu}(f,h^{\mathsf{coarse}})-\mathsf{dist}_{Q}(f,h^{\mathsf{coarse}})\right|>\frac{1}{6}\epsilon\right]\leq|\mathcal{H}^{\mathsf{coarse}}|\mathrm{exp}\left(-2\frac{q\epsilon^{2}}{36}\right)<\frac{1}{6}

for appropriately chosen q=O(1ϵ2log(|𝖼𝗈𝖺𝗋𝗌𝖾|))=O(rdϵ2)q=O\left(\frac{1}{\epsilon^{2}}\log(|\mathcal{H}^{\mathsf{coarse}}|)\right)=O\left(\frac{r^{d}}{\epsilon^{2}}\right). Assume this event occurs. We want to show that |𝖽𝗂𝗌𝗍Q(f,𝖼𝗈𝖺𝗋𝗌𝖾)𝖽𝗂𝗌𝗍μ(f,)|ϵ|\mathsf{dist}_{Q}(f,\mathcal{H}^{\mathsf{coarse}})-\mathsf{dist}_{\mu}(f,\mathcal{H})|\leq\epsilon. Let hh\in\mathcal{H} minimize 𝖽𝗂𝗌𝗍μ(f,h)\mathsf{dist}_{\mu}(f,h) so 𝖽𝗂𝗌𝗍μ(f,h)=𝖽𝗂𝗌𝗍μ(f,)\mathsf{dist}_{\mu}(f,h)=\mathsf{dist}_{\mu}(f,\mathcal{H}). Then

𝖽𝗂𝗌𝗍Q(f,𝖼𝗈𝖺𝗋𝗌𝖾)\displaystyle\mathsf{dist}_{Q}(f,\mathcal{H}^{\mathsf{coarse}}) 𝖽𝗂𝗌𝗍Q(f,h𝖼𝗈𝖺𝗋𝗌𝖾)𝖽𝗂𝗌𝗍μ(f,h𝖼𝗈𝖺𝗋𝗌𝖾)+ϵ6\displaystyle\leq\mathsf{dist}_{Q}(f,h^{\mathsf{coarse}})\leq\mathsf{dist}_{\mu}(f,h^{\mathsf{coarse}})+\frac{\epsilon}{6}
𝖽𝗂𝗌𝗍μ(f,h)+𝖽𝗂𝗌𝗍μ(h,h𝖼𝗈𝖺𝗋𝗌𝖾)+ϵ6𝖽𝗂𝗌𝗍μ(f,)+ϵ.\displaystyle\leq\mathsf{dist}_{\mu}(f,h)+\mathsf{dist}_{\mu}(h,h^{\mathsf{coarse}})+\frac{\epsilon}{6}\leq\mathsf{dist}_{\mu}(f,\mathcal{H})+\epsilon\,.

Now let gg\in\mathcal{H} minimize 𝖽𝗂𝗌𝗍Q(f,g𝖼𝗈𝖺𝗋𝗌𝖾)\mathsf{dist}_{Q}(f,g^{\mathsf{coarse}}) so 𝖽𝗂𝗌𝗍Q(f,g𝖼𝗈𝖺𝗋𝗌𝖾)=𝖽𝗂𝗌𝗍Q(f,𝖼𝗈𝖺𝗋𝗌𝖾)\mathsf{dist}_{Q}(f,g^{\mathsf{coarse}})=\mathsf{dist}_{Q}(f,\mathcal{H}^{\mathsf{coarse}}). Then

𝖽𝗂𝗌𝗍Q(f,𝖼𝗈𝖺𝗋𝗌𝖾)\displaystyle\mathsf{dist}_{Q}(f,\mathcal{H}^{\mathsf{coarse}}) =𝖽𝗂𝗌𝗍Q(f,g𝖼𝗈𝖺𝗋𝗌𝖾)𝖽𝗂𝗌𝗍μ(f,g𝖼𝗈𝖺𝗋𝗌𝖾)ϵ6𝖽𝗂𝗌𝗍μ(f,h𝖼𝗈𝖺𝗋𝗌𝖾)ϵ6\displaystyle=\mathsf{dist}_{Q}(f,g^{\mathsf{coarse}})\geq\mathsf{dist}_{\mu}(f,g^{\mathsf{coarse}})-\frac{\epsilon}{6}\geq\mathsf{dist}_{\mu}(f,h^{\mathsf{coarse}})-\frac{\epsilon}{6}
𝖽𝗂𝗌𝗍μ(f,h)𝖽𝗂𝗌𝗍μ(h,h𝖼𝗈𝖺𝗋𝗌𝖾)ϵ6𝖽𝗂𝗌𝗍μ(f,h)ϵ,\displaystyle\geq\mathsf{dist}_{\mu}(f,h)-\mathsf{dist}_{\mu}(h,h^{\mathsf{coarse}})-\frac{\epsilon}{6}\geq\mathsf{dist}_{\mu}(f,h)-\epsilon\,,

which concludes the proof. ∎

Applying the bound on 𝖻𝖻𝗌(k,r)\mathsf{bbs}(\mathcal{B}_{k},r) we conclude: See 1.1.4

4.3 Agnostic Learning

We begin our learning results with an agnostic learning algorithm for functions of kk convex sets: the class k𝒞\mathcal{B}_{k}\circ\mathcal{C}. For a distribution 𝒟\mathcal{D} over d×{±1}\mathbb{R}^{d}\times\{\pm 1\} and an rr-block partition 𝖻𝗅𝗈𝖼𝗄:d[r]d\mathsf{block}:\mathbb{R}^{d}\to[r]^{d}, define the distribution 𝒟𝖻𝗅𝗈𝖼𝗄\mathcal{D}^{\mathsf{block}} over [r]d×{±1}[r]^{d}\times\{\pm 1\} as the distribution of (𝖻𝗅𝗈𝖼𝗄(x),b)(\mathsf{block}(x),b) when (x,b)𝒟(x,b)\sim\mathcal{D}.

Lemma 4.5.

Let \mathcal{H} be any set of functions d{±1}\mathbb{R}^{d}\to\{\pm 1\}, let ϵ>0\epsilon>0, and suppose rr satisfies rd𝖻𝖻𝗌(,r)ϵ/3{r^{-d}\cdot\mathsf{bbs}(\mathcal{H},r)\leq\epsilon/3}. Then there is an distribution-free agnostic learning algorithm for continuous product distributions that learns \mathcal{H} in O(rd+rd2log(rd/ϵ)ϵ2)O\left(\frac{r^{d}+rd^{2}\log(rd/\epsilon)}{\epsilon^{2}}\right) samples and time.

Proof.

On input distribution 𝒟\mathcal{D}:

  1. 1.

    Sample a grid XX of size m=O(rd2ϵ2log(rd/ϵ))m=O(\frac{rd^{2}}{\epsilon^{2}}\log(rd/\epsilon)) large enough that Lemma 2.7 guarantees 𝖻𝗅𝗈𝖼𝗄(μ)𝗎𝗇𝗂𝖿([r]d)𝖳𝖵<ϵ/3\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}<\epsilon/3 with probability 5/65/6, where 𝖻𝗅𝗈𝖼𝗄:d[r]d\mathsf{block}:\mathbb{R}^{d}\to[r]^{d} is the induced rr-block partition.

  2. 2.

    Agnostically learn a function g:[r]d{±1}g:[r]^{d}\to\{\pm 1\} with error ϵ/3\epsilon/3 and success probability 5/65/6 using O(rd/ϵ2)O(r^{d}/\epsilon^{2}) samples from 𝒟𝖻𝗅𝗈𝖼𝗄\mathcal{D}^{\mathsf{block}}. Output the function g𝖻𝗅𝗈𝖼𝗄g\circ\mathsf{block}.

The second step is accomplished via standard learning results ([SSBD14] Theorem 6.8): the number of samples required for agnostic learning is bounded by O(1/ϵ2)O(1/\epsilon^{2}) multiplied by the logarithm of the number of functions in the class, and the number of functions [r]d{±1}[r]^{d}\to\{\pm 1\} is 2rd2^{r^{d}}. Assume that both steps succeed, which occurs with probability at least 2/32/3. Let ff\in\mathcal{H} minimize (x,b)𝒟[f(x)b]\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[f(x)\neq b\right]. By Proposition 2.5,

xμ[f(x)f𝖼𝗈𝖺𝗋𝗌𝖾]rd𝖻𝖻𝗌(f,r)+𝖻𝗅𝗈𝖼𝗄(μ)𝗎𝗇𝗂𝖿([r]d)𝖳𝖵<2ϵ/3.\underset{x\sim\mu}{\mathbb{P}}\left[f(x)\neq f^{\mathsf{coarse}}\right]\leq r^{-d}\cdot\mathsf{bbs}(f,r)+\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}<2\epsilon/3\,.

Then

(x,b)𝒟[g(𝖻𝗅𝗈𝖼𝗄(x))b]\displaystyle\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[g(\mathsf{block}(x))\neq b\right] =(v,b)𝒟𝖻𝗅𝗈𝖼𝗄[g(v)b](v,b)𝒟𝖻𝗅𝗈𝖼𝗄[f𝖻𝗅𝗈𝖼𝗄(v)b]+ϵ/3\displaystyle=\underset{(v,b)\sim\mathcal{D}^{\mathsf{block}}}{\mathbb{P}}\left[g(v)\neq b\right]\leq\underset{(v,b)\sim\mathcal{D}^{\mathsf{block}}}{\mathbb{P}}\left[f^{\mathsf{block}}(v)\neq b\right]+\epsilon/3
=(x,b)𝒟[f𝖼𝗈𝖺𝗋𝗌𝖾(x)b]+ϵ/3<(x,b)𝒟[f(x)b]+ϵ.\displaystyle=\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[f^{\mathsf{coarse}}(x)\neq b\right]+\epsilon/3<\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[f(x)\neq b\right]+\epsilon\,.\qed

Lemma 4.5 then gives the following result for continuous product distributions, with the result for finite distributions following from Theorem A.18.

See 1.1.4

5 Learning Functions of Halfspaces

A halfspace, or linear threshold function, is any function h:d{±1}h:\mathbb{R}^{d}\to\{\pm 1\} such that for some wd,t,h(x)=sign(w,xt)w\in\mathbb{R}^{d},t\in\mathbb{R},h(x)=\operatorname{sign}(\langle w,x\rangle-t), where sign(z)=1\operatorname{sign}(z)=1 if z0z\geq 0 and 1-1 otherwise. Let \mathcal{H} be the set of halfspaces. Recall that downsampling reduces learning \mathcal{H} in d\mathbb{R}^{d} to learning 𝖻𝗅𝗈𝖼𝗄\mathcal{H}^{\mathsf{block}} over [r]d[r]^{d}, and 𝖻𝗅𝗈𝖼𝗄\mathcal{H}^{\mathsf{block}} is not the set of halfspaces over [r]d[r]^{d}. Fortunately, agnostically learning a halfspaces hh is commonly done by giving a bound on the degree of a polynomial pp that approximates hh [KOS04, KOS08, KKMS08], and we will show that a similar idea also suffices for learning 𝖻𝗅𝗈𝖼𝗄\mathcal{H}^{\mathsf{block}}. We first present a general algorithm based on “polynomial regression”​​, and then introduce the Fourier analysis necessary to apply the general learning algorithm to halfspaces, polynomial threshold functions, and kk-alternating functions.

5.1 A General Learning Algorithm

The learning algorithm in this section essentially replaces step 2 of the brute force algorithm (Lemma 4.5) with the “polynomial regression” algorithm of Kalai et al. [KKMS08]. Our general algorithm is inspired by an algorithm of Canonne et al. [CGG+19] for tolerantly testing kk-alternating functions over the uniform distribution on [n]d[n]^{d}; we state the regression algorithm as it appears in [CGG+19]. For a set \mathcal{F} of functions, 𝗌𝗉𝖺𝗇()\mathsf{span}(\mathcal{F}) is the set of all linear combinations of functions in \mathcal{F}:

Theorem 5.1 ([KKMS08, CGG+19]).

Let μ\mu be a distribution over 𝒳\mathcal{X}, let \mathcal{H} be a class of functions 𝒳{±1}\mathcal{X}\to\{\pm 1\} and \mathcal{F} a collection of functions 𝒳\mathcal{X}\to\mathbb{R} such that for every hh\in\mathcal{H}, f𝗌𝗉𝖺𝗇()\exists f\in\mathsf{span}(\mathcal{F}) where 𝔼xμ[(h(x)f(x))2]ϵ2\underset{x\sim\mu}{\mathbb{E}}\left[{(h(x)-f(x))}^{2}\right]\leq\epsilon^{2}. Then there is an algorithm that, for any distribution 𝒟\mathcal{D} over 𝒳×{±1}\mathcal{X}\times\{\pm 1\} with marginal μ\mu over 𝒳\mathcal{X}, outputs a function g:𝒳{±1}g:\mathcal{X}\to\{\pm 1\} such that (x,b)𝒟[g(x)b]infh(x,b)𝒟[g(x)b]+ϵ\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[g(x)\neq b\right]\leq\inf_{h\in\mathcal{H}}\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[g(x)\neq b\right]+\epsilon, with probability at least 11/1211/12, using at most poly(||,1/ϵ)\operatorname{poly}(|\mathcal{F}|,1/\epsilon) samples and time.

Our general learning algorithm will apply to any hypothesis class that has small rr-block boundary size, and for which there is a set of functions \mathcal{F} that approximately span the class 𝖻𝗅𝗈𝖼𝗄\mathcal{H}^{\mathsf{block}}. This algorithm is improved to work for finite (rather than only continuous) product distributions in Lemma A.17.

Lemma 5.2.

Let ϵ>0\epsilon>0 and let \mathcal{H} be a set of measurable functions f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\} that satisfy:

  1. 1.

    There is some r=r(d,ϵ)r=r(d,\epsilon) such that 𝖻𝖻𝗌(,r)ϵ3rd\mathsf{bbs}(\mathcal{H},r)\leq\frac{\epsilon}{3}\cdot r^{d};

  2. 2.

    There is a set \mathcal{F} of functions [r]d[r]^{d}\to\mathbb{R} satisfying: f,g𝗌𝗉𝖺𝗇()\forall f\in\mathcal{H},\exists g\in\mathsf{span}(\mathcal{F}) such that for v[r]d,𝔼[(f𝖻𝗅𝗈𝖼𝗄(v)g(v))2]ϵ2/4v\sim[r]^{d},\mathbb{E}\left[(f^{\mathsf{block}}(v)-g(v))^{2}\right]\leq\epsilon^{2}/4.

Let n=poly(||,1/ϵ)n=\operatorname{poly}(|\mathcal{F}|,1/\epsilon) be the sample complexity of the algorithm in Theorem 5.1, with error parameter ϵ/2\epsilon/2. Then there is an agnostic learning algorithm for \mathcal{H} on continuous product distributions over d\mathbb{R}^{d}, that uses O(max(n2,1/ϵ2)rd2log(dr))O(\max(n^{2},1/\epsilon^{2})\cdot rd^{2}\log(dr)) samples and runs in time polynomial in the sample size.

Proof.

We will assume n>1/ϵn>1/\epsilon. Let μ\mu be the marginal of 𝒟\mathcal{D} on d\mathbb{R}^{d}. For an rr-block partition, let 𝒟𝖻𝗅𝗈𝖼𝗄\mathcal{D}^{\mathsf{block}} be the distribution of (𝖻𝗅𝗈𝖼𝗄(x),b)(\mathsf{block}(x),b) when (x,b)𝒟(x,b)\sim\mathcal{D}. We may simulate samples from 𝒟𝖻𝗅𝗈𝖼𝗄\mathcal{D}^{\mathsf{block}} by sampling (x,b)(x,b) from 𝒟\mathcal{D} and constructing (𝖻𝗅𝗈𝖼𝗄(x),b)(\mathsf{block}(x),b). The algorithm is as follows:

  1. 1.

    Sample a grid XX of length m=O(rd2n2log(rd))m=O(rd^{2}n^{2}\log(rd)) large enough that Lemma 2.7 guarantees 𝖻𝗅𝗈𝖼𝗄(μ)𝗎𝗇𝗂𝖿([r]d)𝖳𝖵<1/12n\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}<1/12n with probability 5/65/6. Construct 𝖻𝗅𝗈𝖼𝗄:d[r]d\mathsf{block}:\mathbb{R}^{d}\to[r]^{d} induced by XX. We may construct the 𝖻𝗅𝗈𝖼𝗄\mathsf{block} function in time O(dmlogm)O(dm\log m) by sorting, and once constructed it takes time O(logr)O(\log r) to compute.

  2. 2.

    Run the algorithm of Theorem 5.1 on a sample of nn points from 𝒟𝖻𝗅𝗈𝖼𝗄\mathcal{D}^{\mathsf{block}} to learn the class 𝖻𝗅𝗈𝖼𝗄\mathcal{H}^{\mathsf{block}}; that algorithm returns a function g:[r]d{±1}g:[r]^{d}\to\{\pm 1\}. Output g𝖻𝗅𝗈𝖼𝗄g\circ\mathsf{block}.

Assume that step 1 succeeds, which occurs with probability at least 5/65/6. By condition 2, the algorithm in step 2 is guaranteed to work on samples (v,b)[r]d×{±1}(v,b)\in[r]^{d}\times\{\pm 1\} where the marginal of vv is 𝗎𝗇𝗂𝖿([r]d)\mathsf{unif}([r]^{d}); let 𝒟𝗎𝗇𝗂𝖿\mathcal{D}^{\mathsf{unif}} be the distribution of (v,b)(v,b) when v𝗎𝗇𝗂𝖿([r]d)v\sim\mathsf{unif}([r]^{d}) and bb is obtained by sampling (x,b)(𝒟x𝖻𝗅𝗈𝖼𝗄1(v))(x,b)\sim(\mathcal{D}\mid x\in\mathsf{block}^{-1}(v)). The algorithm of step 2 will succeed on 𝒟𝗎𝗇𝗂𝖿\mathcal{D}^{\mathsf{unif}}; we argue that it will also succeed on the actual input 𝒟𝖻𝗅𝗈𝖼𝗄\mathcal{D}^{\mathsf{block}} since these distributions are close. Observe that for samples (v,b)𝒟𝗎𝗇𝗂𝖿(v,b)\sim\mathcal{D}^{\mathsf{unif}} and (𝖻𝗅𝗈𝖼𝗄(x),b)𝒟𝖻𝗅𝗈𝖼𝗄(\mathsf{block}(x),b^{\prime})\sim\mathcal{D}^{\mathsf{block}}, if v=𝖻𝗅𝗈𝖼𝗄(x)v=\mathsf{block}(x) then b,bb,b^{\prime} each have the distribution of bb^{\prime} in (x,b)(𝒟𝖻𝗅𝗈𝖼𝗄(x)=v)(x,b^{\prime})\sim(\mathcal{D}\mid\mathsf{block}(x)=v). Therefore

𝒟𝗎𝗇𝗂𝖿𝒟𝖻𝗅𝗈𝖼𝗄𝖳𝖵\displaystyle\|\mathcal{D}^{\mathsf{unif}}-\mathcal{D}^{\mathsf{block}}\|_{\mathsf{TV}} =(v,b)(𝖻𝗅𝗈𝖼𝗄(x),b)𝖳𝖵=v𝖻𝗅𝗈𝖼𝗄(x)𝖳𝖵\displaystyle=\|(v,b)-(\mathsf{block}(x),b^{\prime})\|_{\mathsf{TV}}=\|v-\mathsf{block}(x)\|_{\mathsf{TV}}
=𝖻𝗅𝗈𝖼𝗄(μ)𝗎𝗇𝗂𝖿([r]d)𝖳𝖵<112n.\displaystyle=\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}<\frac{1}{12n}\,.

It is a standard fact that for product distributions Pn,Qn,PnQn𝖳𝖵nPQ𝖳𝖵P^{n},Q^{n},\|P^{n}-Q^{n}\|_{\mathsf{TV}}\leq n\cdot\|P-Q\|_{\mathsf{TV}}; using this fact,

(𝒟𝗎𝗇𝗂𝖿)n(𝒟𝖻𝗅𝗈𝖼𝗄)n𝖳𝖵n𝒟𝗎𝗇𝗂𝖿𝒟𝖻𝗅𝗈𝖼𝗄𝖳𝖵<112.\|(\mathcal{D}^{\mathsf{unif}})^{n}-(\mathcal{D}^{\mathsf{block}})^{n}\|_{\mathsf{TV}}\leq n\cdot\|\mathcal{D}^{\mathsf{unif}}-\mathcal{D}^{\mathsf{block}}\|_{\mathsf{TV}}<\frac{1}{12}\,.

We will argue that step 2 succeeds with probability 5/65/6; i.e. that with probability 5/65/6,

(v,b)𝒟𝖻𝗅𝗈𝖼𝗄[g(v)b]<infh(v,b)𝒟𝖻𝗅𝗈𝖼𝗄[h𝖻𝗅𝗈𝖼𝗄(v)b]+ϵ/2.\underset{(v,b)\sim\mathcal{D}^{\mathsf{block}}}{\mathbb{P}}\left[g(v)\neq b\right]<\inf_{h\in\mathcal{H}}\underset{(v,b)\sim\mathcal{D}^{\mathsf{block}}}{\mathbb{P}}\left[h^{\mathsf{block}}(v)\neq b\right]+\epsilon/2\,.

Let E(S)E(S) be the event that success occurs given sample S([r]d×{±1})nS\in([r]^{d}\times\{\pm 1\})^{n}. The algorithm samples S(𝒟𝖻𝗅𝗈𝖼𝗄)nS\sim(\mathcal{D}^{\mathsf{block}})^{n} but the success guarantee of step 2 is for (𝒟𝗎𝗇𝗂𝖿)n(\mathcal{D}^{\mathsf{unif}})^{n}; this step will still succeed with probability 5/65/6:

S(𝒟𝗎𝗇𝗂𝖿)n[E(S)]\displaystyle\underset{S\sim(\mathcal{D}^{\mathsf{unif}})^{n}}{\mathbb{P}}\left[E(S)\right] S(𝒟𝖻𝗅𝗈𝖼𝗄)n[E(S)](𝒟𝗎𝗇𝗂𝖿)n(𝒟𝖻𝗅𝗈𝖼𝗄)n𝖳𝖵\displaystyle\geq\underset{S\sim(\mathcal{D}^{\mathsf{block}})^{n}}{\mathbb{P}}\left[E(S)\right]-\|(\mathcal{D}^{\mathsf{unif}})^{n}-(\mathcal{D}^{\mathsf{block}})^{n}\|_{\mathsf{TV}}
>S𝒟n[E(S)]1121112112=56.\displaystyle>\underset{S\sim\mathcal{D}^{n}}{\mathbb{P}}\left[E(S)\right]-\frac{1}{12}\geq\frac{11}{12}-\frac{1}{12}=\frac{5}{6}\,.

Assume that each step succeeds, which occurs with probability at least 12(5/6)=2/31-2\cdot(5/6)=2/3. By Proposition 2.5, our condition 1, and the fact that n>1/ϵn>1/\epsilon, we have for any hh\in\mathcal{H} that

xμ[h(x)h𝖼𝗈𝖺𝗋𝗌𝖾(x)]rd𝖻𝖻𝗌(,r)+𝖻𝗅𝗈𝖼𝗄(μ)𝗎𝗇𝗂𝖿([r]d)𝖳𝖵ϵ/3+112n<ϵ/2.\displaystyle\underset{x\sim\mu}{\mathbb{P}}\left[h(x)\neq h^{\mathsf{coarse}}(x)\right]\leq r^{-d}\cdot\mathsf{bbs}(\mathcal{H},r)+\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}\leq\epsilon/3+\frac{1}{12n}<\epsilon/2\,.

The output of the algorithm is g𝖻𝗅𝗈𝖼𝗄g\circ\mathsf{block}, which for any hh\in\mathcal{H} satisfies:

(x,b)𝒟[g(𝖻𝗅𝗈𝖼𝗄(x))b]\displaystyle\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[g(\mathsf{block}(x))\neq b\right] =(v,b)𝒟𝖻𝗅𝗈𝖼𝗄[g(v)b](v,b)𝒟𝖻𝗅𝗈𝖼𝗄[h𝖻𝗅𝗈𝖼𝗄(v)b]+ϵ/2\displaystyle=\underset{(v,b)\sim\mathcal{D}^{\mathsf{block}}}{\mathbb{P}}\left[g(v)\neq b\right]\leq\underset{(v,b)\sim\mathcal{D}^{\mathsf{block}}}{\mathbb{P}}\left[h^{\mathsf{block}}(v)\neq b\right]+\epsilon/2
=(x,b)𝒟[h𝖼𝗈𝖺𝗋𝗌𝖾(x)b]+ϵ/2\displaystyle=\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[h^{\mathsf{coarse}}(x)\neq b\right]+\epsilon/2
(x,b)𝒟[h(x)b]+𝑥[h(x)h𝖼𝗈𝖺𝗋𝗌𝖾(x)]+ϵ/2\displaystyle\leq\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[h(x)\neq b\right]+\underset{x}{\mathbb{P}}\left[h(x)\neq h^{\mathsf{coarse}}(x)\right]+\epsilon/2
<(x,b)𝒟[h(x)b]+ϵ.\displaystyle<\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[h(x)\neq b\right]+\epsilon\,.

Then [g(𝖻𝗅𝗈𝖼𝗄(x))b]infhh(x)b+ϵ\mathbb{P}\left[g(\mathsf{block}(x))\neq b\right]\leq\inf_{h\in\mathcal{H}}{h(x)\neq b}+\epsilon, as desired. ∎

5.2 Fourier Analysis on [n]d{[n]}^{d}

We will show how to construct a spanning set \mathcal{F} to satisfy condition 2 of the general learning algorithm, by using noise sensitivity and the Walsh basis. For any nn, let u[n]du\sim{[n]}^{d} uniformly at random and draw v[n]dv\in{[n]}^{d} as follows: vi=uiv_{i}=u_{i} with probability δ\delta, and viv_{i} is uniform in [n]{ui}[n]\setminus\{u_{i}\} with probability 1δ1-\delta. The noise sensitivity of functions [n]d{±1}[n]^{d}\to\{\pm 1\} is defined as:

𝗇𝗌n,δ(f):=u,v[f(u)f(v)]=1212𝔼u,v[f(u)f(v)].\mathsf{ns}_{n,\delta}(f):=\underset{u,v}{\mathbb{P}}\left[f(u)\neq f(v)\right]=\frac{1}{2}-\frac{1}{2}\cdot\underset{u,v}{\mathbb{E}}\left[f(u)f(v)\right]\,.

Note that we include nn in the subscript to indicate the size of the domain. We will use 𝗇𝗌r,δ(f)\mathsf{ns}_{r,\delta}(f) to obtain upper bounds on the spanning set, and we will obtain bounds on 𝗇𝗌r,δ\mathsf{ns}_{r,\delta} by relating it to 𝗇𝗌2,δ\mathsf{ns}_{2,\delta}, for which many bounds are known. For a function f:[n]d{±1}f:{[n]}^{d}\to\{\pm 1\}, two vectors u,v[r]du,v\in{[r]}^{d}, and x{±1}dx\in\{\pm 1\}^{d}, define [u,v]x[n]d{[u,v]}^{x}\in{[n]}^{d} as the vector with [u,v]ix=ui{[u,v]}^{x}_{i}=u_{i} if xi=1x_{i}=1 and viv_{i} if xi=1x_{i}=-1. Then define fu,v:{±1}d{±1}f_{u,v}:\{\pm 1\}^{d}\to\{\pm 1\} as the function fu,v(x)=f([u,v]x)f_{u,v}(x)=f({[u,v]}^{x}). The next lemma is essentially the same as the reduction in [BOW10].

Lemma 5.3.

Let \mathcal{H} be a set of functions f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\} such that for any linear transformation Ad×dA\in\mathbb{R}^{d\times d}, the function fAf\circ A\in\mathcal{H}, and let 𝖻𝗅𝗈𝖼𝗄:d[r]d\mathsf{block}:\mathbb{R}^{d}\to[r]^{d} be any rr-block partition. Let 𝗇𝗌2,δ()=supf𝗇𝗌2,δ(f)\mathsf{ns}_{2,\delta}(\mathcal{H})=\sup_{f\in\mathcal{H}}\mathsf{ns}_{2,\delta}(f) where 𝗇𝗌2,δ(f)\mathsf{ns}_{2,\delta}(f) is the δ\delta-noise sensitivity on domain {±1}d\{\pm 1\}^{d}. Then 𝗇𝗌r,δ(f𝖻𝗅𝗈𝖼𝗄)𝗇𝗌2,δ()\mathsf{ns}_{r,\delta}(f^{\mathsf{block}})\leq\mathsf{ns}_{2,\delta}(\mathcal{H}).

Proof.

Let u[r]du\sim[r]^{d} and let vv be uniform among the vectors [r]d[r]^{d} where i,viui\forall i,v_{i}\neq u_{i}. Now let x{±1}dx\sim\{\pm 1\}^{d} uniformly at random and let yy be drawn such that yi=xiy_{i}=x_{i} with probability δ\delta and yi=xiy_{i}=-x_{i} otherwise. Then [u,v]x{[u,v]}^{x} is uniform in [r]d{[r]}^{d}, because [u,v]ix{[u,v]}^{x}_{i} is uiu_{i} or viv_{i} with equal probability and the marginals of ui,viu_{i},v_{i} are uniform. [u,v]iy=[u,v]ix{[u,v]}^{y}_{i}={[u,v]}^{x}_{i} with probability 1δ1-\delta and is otherwise uniform in [r]{[u,v]ix}[r]\setminus\{{[u,v]}^{x}_{i}\}. Let f:[r]d{±1}f:{[r]}^{d}\to\{\pm 1\} and δ[0,1]\delta\in[0,1]. Let (u,v)(u^{\prime},v^{\prime}) be an independent copy of (u,v)(u,v) and observe that 𝗇𝗌r,δ(f𝖻𝗅𝗈𝖼𝗄)=[f𝖻𝗅𝗈𝖼𝗄(u)f𝖻𝗅𝗈𝖼𝗄(v)]\mathsf{ns}_{r,\delta}(f^{\mathsf{block}})=\mathbb{P}\left[f^{\mathsf{block}}(u^{\prime})\neq f^{\mathsf{block}}(v^{\prime})\right]. Now observe that ([u,v]x,[u,v]y)({[u,v]}^{x},{[u,v]}^{y}) has the same distribution as (u,v)(u^{\prime},v^{\prime}), so:

𝔼u,v[𝗇𝗌2,δ(fu,v)]\displaystyle\underset{u,v}{\mathbb{E}}\left[\mathsf{ns}_{2,\delta}(f_{u,v})\right] =𝔼u,v[x,yδx[f([u,v]x)f([u,v]y)]]\displaystyle=\underset{u,v}{\mathbb{E}}\left[\underset{x,y\sim_{\delta}x}{\mathbb{P}}\left[f({[u,v]}^{x})\neq f({[u,v]}^{y})\right]\right]
=𝔼u,v,(x,y)δ[𝟏[f([u,v]x)f([u,v]y)]]\displaystyle=\underset{u,v,{(x,y)}_{\delta}}{\mathbb{E}}\left[\mathbf{1}\left[f({[u,v]}^{x})\neq f({[u,v]}^{y})\right]\right]
=𝔼u,v[𝟏[f(u)f(v)]]=𝗇𝗌r,δ(f𝖻𝗅𝗈𝖼𝗄).\displaystyle=\underset{u^{\prime},v^{\prime}}{\mathbb{E}}\left[\mathbf{1}\left[f(u^{\prime})\neq f(v^{\prime})\right]\right]=\mathsf{ns}_{r,\delta}(f^{\mathsf{block}})\,.

For any u,v[r]du,v\in[r]^{d}, define the function Φu,v:{±1}d[r]d\Phi_{u,v}:\{\pm 1\}^{d}\to[r]^{d} by Φu,v(x)=𝖻𝗅𝗈𝖼𝗄𝗉𝗈𝗂𝗇𝗍([u,v]x)\Phi_{u,v}(x)=\mathsf{blockpoint}([u,v]^{x}). This function maps {±1}d\{\pm 1\}^{d} to a set {b1,i1,b1,j1}××{bd,id,bd,jd}\{b_{1,i_{1}},b_{1,j_{1}}\}\times\dotsm\times\{b_{d,i_{d}},b_{d,j_{d}}\} and can be obtained by translation and scaling, which is a linear transformation. Therefore fu,v=fΦu,v1f_{u,v}=f\circ\Phi_{u,v}^{-1}, so we are guaranteed that fu,vf_{u,v}\in\mathcal{H}. So

𝗇𝗌r,δ(f)=𝔼u,v[𝗇𝗌2,δ(fu,v)]𝗇𝗌2,δ().\mathsf{ns}_{r,\delta}(f)=\underset{u,v}{\mathbb{E}}\left[\mathsf{ns}_{2,\delta}(f_{u,v})\right]\leq\mathsf{ns}_{2,\delta}(\mathcal{H})\,.\qed

We define the Walsh basis, an orthonormal basis of functions [n]d[n]^{d}\to\mathbb{R}; see e.g. [BRY14]. Suppose n=2mn=2^{m} for some positive integer mm. For two functions f,g:[n]df,g:{[n]}^{d}\to\mathbb{R}, define the inner product f,g=𝔼x[n]d[f(x)g(x)]\langle f,g\rangle=\mathbb{E}_{x\sim{[n]}^{d}}[f(x)g(x)]. The Walsh functions {ψ0,,ψm},ψi:[n]{±1}\{\psi_{0},\dotsc,\psi_{m}\},\psi_{i}:[n]\to\{\pm 1\} can be defined by ψ01\psi_{0}\equiv 1 and for i1i\geq 1, ψi(z):=(1)𝖻𝗂𝗍i(z1)\psi_{i}(z):=(-1)^{\mathsf{bit}_{i}(z-1)} where 𝖻𝗂𝗍i(z1)\mathsf{bit}_{i}(z-1) is the ithi^{\mathrm{th}} bit in the binary representation of z1z-1, where the first bit is the least significant (see e.g. [BRY14]). It is easy to verify that for all i,j{0,,m}i,j\in\{0,\dotsc,m\}, if iji\neq j then ψi,ψj=0\langle\psi_{i},\psi_{j}\rangle=0, and 𝔼x[n][ψi(x)]=0\mathbb{E}_{x\sim[n]}[\psi_{i}(x)]=0 when i1i\geq 1. For S[m]S\subseteq[m] define ψS=iSψi\psi_{S}=\prod_{i\in S}\psi_{i} and note that for any set S[m],SS\subseteq[m],S\neq\emptyset,

𝔼x[n][ψS(x)]=𝔼x[n][iSψi(x)]=𝔼x[n][(1)iS𝖻𝗂𝗍i(x1)]=0\underset{x\sim[n]}{\mathbb{E}}\left[\psi_{S}(x)\right]=\underset{x\sim[n]}{\mathbb{E}}\left[\prod_{i\in S}\psi_{i}(x)\right]=\underset{x\sim[n]}{\mathbb{E}}\left[(-1)^{\sum_{i\in S}\mathsf{bit}_{i}(x-1)}\right]=0 (1)

since each bit is uniform in {0,1}\{0,1\}, while ψ1\psi_{\emptyset}\equiv 1. For S,T[m]S,T\subseteq[m],

ψS,ψT=𝔼x[n][ψS(x)ψT(x)]=𝔼x[ψSΔT(x)],\langle\psi_{S},\psi_{T}\rangle=\mathbb{E}_{x\sim[n]}[\psi_{S}(x)\psi_{T}(x)]=\mathbb{E}_{x}[\psi_{S\Delta T}(x)]\,,

where SΔTS\Delta T is the symmetric difference, so this is 0 when SΔTS\Delta T\neq\emptyset (i.e. STS\neq T) and 1 otherwise; therefore {ψS:S[m]}\{\psi_{S}:S\subseteq[m]\} is an orthonormal basis of functions [n][n]\to\mathbb{R}. Identify each S[m]S\subseteq[m] with the number s{0,,n1}s\in\{0,\dotsc,n-1\} where 𝖻𝗂𝗍i(s)=𝟏[iS]\mathsf{bit}_{i}(s)=\mathbf{1}\left[i\in S\right]. Now for every α{0,,n1}d\alpha\in\{0,\dotsc,n-1\}^{d} define ψα:[n]d{±1}\psi_{\alpha}:[n]^{d}\to\{\pm 1\} as ψα(x)=i=1dψαi(xi)\psi_{\alpha}(x)=\prod_{i=1}^{d}\psi_{\alpha_{i}}(x_{i}) where ψαi\psi_{\alpha_{i}} is the Walsh function determined by the identity between subsets of [m][m] and the integer αi{0,,n1}\alpha_{i}\in\{0,\dotsc,n-1\}. It is easy to verify that the set {ψα:α{0,,n1}d}\{\psi_{\alpha}:\alpha\in{\{0,\dotsc,n-1\}}^{d}\} is an orthonormal basis. Every function f:[n]df:{[n]}^{d}\to\mathbb{R} has a unique representation f=α{0,,n1}df^(α)ψαf=\sum_{\alpha\in\{0,\dotsc,n-1\}^{d}}\hat{f}(\alpha)\psi_{\alpha} where f^(α)=f,ψα\hat{f}(\alpha)=\langle f,\psi_{\alpha}\rangle.

For each x[n]dx\in{[n]}^{d} and ρ[0,1]\rho\in[0,1] define Nρ(x)N_{\rho}(x) as the distribution over y[n]dy\in{[n]}^{d} where for each i[d]i\in[d], yi=xiy_{i}=x_{i} with probability ρ\rho and yiy_{i} is uniform in [n][n] with probability 1ρ1-\rho. Define Tρf(x):=𝔼yNρ(x)[f(y)]T_{\rho}f(x):=\underset{y\sim N_{\rho}(x)}{\mathbb{E}}\left[f(y)\right] and 𝗌𝗍𝖺𝖻ρ(f):=f,Tρf\mathsf{stab}_{\rho}(f):=\langle f,T_{\rho}f\rangle. For any α{0,,n1}d\alpha\in{\{0,\dotsc,n-1\}}^{d},

Tρψα(x)\displaystyle T_{\rho}\psi_{\alpha}(x) =𝔼yNρ(x)[ψα(y)]=𝔼yNρ(x)[i=1dψαi(yi)]=i=1d𝔼yiNρ(xi)[ψαi(yi)]\displaystyle=\underset{y\sim N_{\rho}(x)}{\mathbb{E}}\left[\psi_{\alpha}(y)\right]=\underset{y\sim N_{\rho}(x)}{\mathbb{E}}\left[\prod_{i=1}^{d}\psi_{\alpha_{i}}(y_{i})\right]=\prod_{i=1}^{d}\underset{y_{i}\sim N_{\rho}(x_{i})}{\mathbb{E}}\left[\psi_{\alpha_{i}}(y_{i})\right]
=i=1d[ρψαi(xi)+(1ρ)𝔼z[n][ψαi(z)]].\displaystyle=\prod_{i=1}^{d}\left[\rho\psi_{\alpha_{i}}(x_{i})+(1-\rho)\underset{z\sim[n]}{\mathbb{E}}\left[\psi_{\alpha_{i}}(z)\right]\right]\,.

If αi1\alpha_{i}\geq 1 then 𝔼z[n][ψαi(z)]=0\underset{z\sim[n]}{\mathbb{E}}\left[\psi_{\alpha_{i}}(z)\right]=0; otherwise, ψ11\psi_{1}\equiv 1 so 𝔼yiNρ(xi)[ψ0(yi)]=1\underset{y_{i}\sim N_{\rho}(x_{i})}{\mathbb{E}}\left[\psi_{0}(y_{i})\right]=1. Therefore

Tρψα(x)=ρ|α|ψα(x),T_{\rho}\psi_{\alpha}(x)=\rho^{|\alpha|}\psi_{\alpha}(x)\,,

where |α||\alpha| is the number of nonzero entries of α\alpha; so Tρf^(α)=ψα,Tρf=Tρψα,f=ρ|α|f^(α)\widehat{T_{\rho}f}(\alpha)=\langle\psi_{\alpha},T_{\rho}f\rangle=\langle T_{\rho}\psi_{\alpha},f\rangle=\rho^{|\alpha|}\widehat{f}(\alpha). Since TρT_{\rho} is a linear operator,

𝗌𝗍𝖺𝖻ρ(f)=f,Tρf=αρ|α|f^(α)2.\mathsf{stab}_{\rho}(f)=\langle f,T_{\rho}f\rangle=\sum_{\alpha}\rho^{|\alpha|}\hat{f}{(\alpha)}^{2}\,.

Note that for f:{±1}d{±1}f:\{\pm 1\}^{d}\to\{\pm 1\}, 𝗌𝗍𝖺𝖻ρ(f)\mathsf{stab}_{\rho}(f) is the usual notion of stability in the analysis of Boolean functions.

Proposition 5.4.

For any f:[n]d{±1}f:{[n]}^{d}\to\{\pm 1\} and any δ,ρ[0,1]\delta,\rho\in[0,1], 𝗇𝗌n,δ(f)=1212𝗌𝗍𝖺𝖻1nn1δ(f)\mathsf{ns}_{n,\delta}(f)=\frac{1}{2}-\frac{1}{2}\cdot\mathsf{stab}_{1-\frac{n}{n-1}\delta}(f).

Proof.

For vNρ(u),vi=uiv\sim N_{\rho}(u),v_{i}=u_{i} with probability ρ+1ρn\rho+\frac{1-\rho}{n}, so in the definition of noise sensitivity, vv is distributed as Nρ(u)N_{\rho}(u) where (1δ)=ρ+1ρn(1-\delta)=\rho+\frac{1-\rho}{n}, i.e. δ=1ρ1ρn=(11/n)ρ(11/n)=(1ρ)(11/n)\delta=1-\rho-\frac{1-\rho}{n}=(1-1/n)-\rho(1-1/n)=(1-\rho)(1-1/n); or, ρ=1nn1δ\rho=1-\frac{n}{n-1}\delta. By rearranging, we arrive at the conclusions. ∎

Proposition 5.5.

For any f:[n]df:{[n]}^{d}\to\mathbb{R} and t=2δt=\frac{2}{\delta}, α:|α|tf^(α)22.32𝗇𝗌n,δ(f)\sum_{\alpha:|\alpha|\geq t}\hat{f}{(\alpha)}^{2}\leq 2.32\cdot\mathsf{ns}_{n,\delta}(f).

Proof.

Following [KOS04]:

2𝗇𝗌n,δ(f)\displaystyle 2\mathsf{ns}_{n,\delta}(f) =1α(1nn1δ)|α|f^(α)21α(1δ)|α|f^(α)2\displaystyle=1-\sum_{\alpha}{\left(1-\frac{n}{n-1}\delta\right)}^{|\alpha|}\hat{f}{(\alpha)}^{2}\geq 1-\sum_{\alpha}{\left(1-\delta\right)}^{|\alpha|}\hat{f}{(\alpha)}^{2}
=α(1(1δ)|α|)f^(α)2α:|α|2/δ(1(1δ)|α|)f^(α)2\displaystyle=\sum_{\alpha}(1-{\left(1-\delta\right)}^{|\alpha|})\hat{f}{(\alpha)}^{2}\geq\sum_{\alpha:|\alpha|\geq 2/\delta}(1-{\left(1-\delta\right)}^{|\alpha|})\hat{f}{(\alpha)}^{2}
α:|α|2/δ(1(1δ)2/δ)f^(α)2(1e2)α:|α|2/δf^(α)2.\displaystyle\geq\sum_{\alpha:|\alpha|\geq 2/\delta}(1-{\left(1-\delta\right)}^{2/\delta})\hat{f}{(\alpha)}^{2}\geq(1-e^{-2})\sum_{\alpha:|\alpha|\geq 2/\delta}\hat{f}{(\alpha)}^{2}\,.

The result now holds since 2/(1e2)<2.322/(1-e^{-2})<2.32. ∎

Lemma 5.6.

Let \mathcal{H} be a set of functions [n]d{±1}[n]^{d}\to\{\pm 1\} where nn is a power of 2, let ϵ,δ>0\epsilon,\delta>0 such that h,𝗇𝗌n,δ(h)ϵ2/3\forall h\in\mathcal{H},\mathsf{ns}_{n,\delta}(h)\leq\epsilon^{2}/3, and let t=2δt=\lceil\frac{2}{\delta}\rceil. Then there is a set \mathcal{F} of functions [n]d{[n]}^{d}\to\mathbb{R} of size ||(nd)t|\mathcal{F}|\leq{(nd)}^{t}, such that that for any hh\in\mathcal{H}, there is a function p𝗌𝗉𝖺𝗇()p\in\mathsf{span}(\mathcal{F}) where 𝔼[(h(x)p(x))2]ϵ2\mathbb{E}\left[{(h(x)-p(x))}^{2}\right]\leq\epsilon^{2}.

Proof.

Let p=|α|<tf^(α)ϕαp=\sum_{|\alpha|<t}\hat{f}(\alpha)\phi_{\alpha}. Then by Proposition 5.5,

𝔼[(p(x)f(x))2]\displaystyle\mathbb{E}\left[{(p(x)-f(x))}^{2}\right] =𝔼[(|α|tf^(α)ϕα(x))2]=𝔼[|α|t|β|tf^(α)f^(β)ϕα(x)ϕα(x)]\displaystyle=\mathbb{E}\left[{\left(\sum_{|\alpha|\geq t}\hat{f}(\alpha)\phi_{\alpha}(x)\right)}^{2}\right]=\mathbb{E}\left[\sum_{|\alpha|\geq t}\sum_{|\beta|\geq t}\hat{f}(\alpha)\hat{f}(\beta)\phi_{\alpha}(x)\phi_{\alpha}(x)\right]
=|α|t|β|tf^(α)f^(β)ϕα,ϕβ=|α|tf^(α)2ϵ2.\displaystyle=\sum_{|\alpha|\geq t}\sum_{|\beta|\geq t}\hat{f}(\alpha)\hat{f}(\beta)\langle\phi_{\alpha},\phi_{\beta}\rangle=\sum_{|\alpha|\geq t}\hat{f}{(\alpha)}^{2}\leq\epsilon^{2}\,.

Therefore pp is a linear combination of functions ϕα=i=1dϕαi\phi_{\alpha}=\prod_{i=1}^{d}\phi_{\alpha_{i}} where at most tt values αi{0,,n1}\alpha_{i}\in\{0,\dotsc,n-1\} are not 0. There are at most ((n1)d)t{((n-1)d)}^{t} such products since for each non-constant ϕαi\phi_{\alpha_{i}} we choose i[d]i\in[d] and αi[n1]\alpha_{i}\in[n-1]. We may take \mathcal{F} to be the set of these products. ∎

5.3 Application

To apply Lemma 5.2, we must give bounds on 𝖻𝖻𝗌(k,r)\mathsf{bbs}(\mathcal{B}_{k}\circ\mathcal{H},r) and the noise sensitivity:

Lemma 5.7.

Fix any rr. Then 𝖻𝖻𝗌(k,r)dkrd1\mathsf{bbs}(\mathcal{B}_{k}\circ\mathcal{H},r)\leq dkr^{d-1}.

Proof.

Any halfspace h(x)=sign(w,xt)h(x)=\operatorname{sign}(\langle w,x\rangle-t) is unate, meaning there is a vector σ{±1}d\sigma\in\{\pm 1\}^{d} such that the function hσ:=sign(w,xσt)h^{\sigma}:=\operatorname{sign}(\langle w,x^{\sigma}\rangle-t), where xσ=(σ1x1,,σdxd)x^{\sigma}=(\sigma_{1}x_{1},\dotsc,\sigma_{d}x_{d}), is monotone. For any rr-block partition 𝖻𝗅𝗈𝖼𝗄:d[r]d\mathsf{block}:\mathbb{R}^{d}\to[r]^{d} defined by values {ai,j}\{a_{i,j}\} for i[d],j[r1]i\in[d],j\in[r-1], we can define 𝖻𝗅𝗈𝖼𝗄σ:d[r]d\mathsf{block}^{\sigma}:\mathbb{R}^{d}\to[r]^{d} as the block partition obtained by taking {σiai,j}\{\sigma_{i}\cdot a_{i,j}\}. The number of non-constant blocks of hh in 𝖻𝗅𝗈𝖼𝗄\mathsf{block} is the same as that of hσh^{\sigma} in 𝖻𝗅𝗈𝖼𝗄σ\mathsf{block}^{\sigma}, but hσh^{\sigma} is monotone. Thus the bound on 𝖻𝖻𝗌\mathsf{bbs} for monotone functions holds, so 𝖻𝖻𝗌(,r)drd1\mathsf{bbs}(\mathcal{H},r)\leq dr^{d-1} by Lemma 7.1, and Lemma 4.2 gives 𝖻𝖻𝗌(k,r)dkrd1\mathsf{bbs}(\mathcal{B}_{k}\circ\mathcal{H},r)\leq dkr^{d-1}. ∎

The bounds on noise sensitivity follow from known results for the hypercube.

Proposition 5.8.

Let h1,,hk:[n]d{±1}h_{1},\dotsc,h_{k}:{[n]}^{d}\to\{\pm 1\} and let g:{±1}k{±1}g:\{\pm 1\}^{k}\to\{\pm 1\}. Let f:=g(h1,,hk)f:=g\circ(h_{1},\dotsc,h_{k}). Then 𝗇𝗌δ(f)i=1k𝗇𝗌δ(hi)\mathsf{ns}_{\delta}(f)\leq\sum_{i=1}^{k}\mathsf{ns}_{\delta}(h_{i}).

Proof.

For u,vu,v drawn from [n]d{[n]}^{d} as in the definition of noise sensitivity, the union bound gives 𝗇𝗌δ(f)=u,v[f(u)f(v)]u,v[i:hi(u)hi(v)]i=1k𝗇𝗌δ(hi)\mathsf{ns}_{\delta}(f)=\underset{u,v}{\mathbb{P}}\left[f(u)\neq f(v)\right]\leq\underset{u,v}{\mathbb{P}}\left[\exists i:h_{i}(u)\neq h_{i}(v)\right]\leq\sum_{i=1}^{k}\mathsf{ns}_{\delta}(h_{i}). ∎

Lemma 5.9.

Let f=g(h1,,hk)kf=g\circ(h_{1},\dotsc,h_{k})\in\mathcal{B}_{k}\circ\mathcal{H}. For any rr-block partition 𝖻𝗅𝗈𝖼𝗄:d[r]d\mathsf{block}:\mathbb{R}^{d}\to[r]^{d}, and any δ[0,1],𝗇𝗌r,δ(f𝖻𝗅𝗈𝖼𝗄)=O(kδ)\delta\in[0,1],\mathsf{ns}_{r,\delta}(f^{\mathsf{block}})=O(k\sqrt{\delta}).

Proof.

It is known that 𝗇𝗌2,δ()=O(δ)\mathsf{ns}_{2,\delta}(\mathcal{H})=O(\sqrt{\delta}) (Peres’ theorem [O’D14]). Let AA be any full-rank linear transformation and let hh\in\mathcal{H}, hAh\circ A\in\mathcal{H}. This holds since for some wd,tw\in\mathbb{R}^{d},t\in\mathbb{R}, h(Ax)=sign(w,Axt)=sign(Aw,xt)h(Ax)=\operatorname{sign}(\langle w,Ax\rangle-t)=\operatorname{sign}(\langle Aw,x\rangle-t), which is a halfspace. Then Lemma 5.3 implies 𝗇𝗌r,δ(h𝖻𝗅𝗈𝖼𝗄)𝗇𝗌2,δ()=O(δ)\mathsf{ns}_{r,\delta}(h^{\mathsf{block}})\leq\mathsf{ns}_{2,\delta}(\mathcal{H})=O(\sqrt{\delta}) and we conclude with Proposition 5.8. ∎

See 1.1.2

Proof.

Here we prove only the continuous distribution case. The finite case is proved in LABEL:thm:finitealgorithms.

For r=dk/ϵr=\lceil dk/\epsilon\rceil, we have rd𝖻𝖻𝗌(k,r)ϵr^{-d}\cdot\mathsf{bbs}(\mathcal{B}_{k}\circ\mathcal{H},r)\leq\epsilon by Lemma 5.7, so condition 1 of Lemma 5.2 holds. Lemma 5.9 guarantees that for any fk,𝗇𝗌r,δ(f𝖻𝗅𝗈𝖼𝗄)=O(kδ)f\in\mathcal{B}_{k}\circ\mathcal{H},\mathsf{ns}_{r,\delta}(f^{\mathsf{block}})=O(k\sqrt{\delta}). Setting δ=Θ(ϵ4/k2)\delta=\Theta(\epsilon^{4}/k^{2}) so that 𝗇𝗌r,δ(f𝖻𝗅𝗈𝖼𝗄)ϵ2/3\mathsf{ns}_{r,\delta}(f^{\mathsf{block}})\leq\epsilon^{2}/3, we obtain via Lemma 5.6 a set \mathcal{F} of size ||(rd)O(k2/ϵ4)|\mathcal{F}|\leq(rd)^{O(k^{2}/\epsilon^{4})} satisfying condition 2 of Lemma 5.2. Then for n=poly(||,1/ϵ)n=\operatorname{poly}(|\mathcal{F}|,1/\epsilon) we apply Lemma 5.2 to get an algorithm with sample complexity

O(rd2n2log(rd))=O(d3kϵlog(dk/ϵ))(dkϵ)O(k2ϵ4).O\left(rd^{2}n^{2}\log(rd)\right)=O\left(\frac{d^{3}k}{\epsilon}\log(dk/\epsilon)\right)\cdot\left(\frac{dk}{\epsilon}\right)^{O\left(\frac{k^{2}}{\epsilon^{4}}\right)}\,.

The other time complexity follows from Lemma 4.5. ∎

6 Learning Polynomial Threshold Functions

A degree-kk polynomial threshold function (PTF) is a function f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\} such that there is a degree-kk polynomial p:dp:\mathbb{R}^{d}\to\mathbb{R} in dd variables where f(x)=sign(p(x))f(x)=\operatorname{sign}(p(x)); for example, a halfspaces is a degree-1 PTF. Let 𝒫k\mathcal{P}_{k} be the set of degree-kk PTFs. As for halfspaces, we will give bounds on the noise sensitivity and block boundary size and apply the general learning algorithm. The bound on noise sensitivity will follow from known results on the hypercube [DHK+10], but the bound on the block boundary size is much more difficult to obtain than for halfspaces.

6.1 Block-Boundary Size of PTFs

A theorem of Warren [War68] gives a bound on the number of connected components of d\mathbb{R}^{d} after removing the 0-set of a degree-kk polynomial. This bound (Theorem 6.7 below) will be our main tool.

A set SdS\subseteq\mathbb{R}^{d} is connected333Here we are using the fact that connected and path-connected are equivalent in d\mathbb{R}^{d}. if for every s,tSs,t\in S there is a continuous function p:[0,1]Sp:[0,1]\to S such that p(0)=s,p(1)=tp(0)=s,p(1)=t. A subset SXS\subseteq X where XdX\subseteq\mathbb{R}^{d} is a connected component of XX if it is connected and there is no connected set TXT\subseteq X such that STS\subseteq T. Write 𝖼𝗈𝗆𝗉(X)\mathsf{comp}(X) for the number of connected components of XX.

A function ρ:d({})d\rho:\mathbb{R}^{d}\to(\mathbb{R}\cup\{*\})^{d} is called a restriction and we will denote |ρ|=|{i[d]:ρ(i)=}||\rho|=|\{i\in[d]:\rho(i)=*\}|. The affine subspace AρA_{\rho} induced by ρ\rho is Aρ:={xd|xi=ρ(i) if ρ(i)}A_{\rho}:=\{x\in\mathbb{R}^{d}\;|\;x_{i}=\rho(i)\text{ if }\rho(i)\neq*\} and has affine dimension |ρ||\rho|.

For ndn\leq d, let 𝒜n\mathcal{A}_{n} be the set of affine subspaces AρA_{\rho} obtained by choosing a restriction ρ\rho with ρ(i)=\rho(i)=* when ini\leq n and ρ(i)\rho(i)\neq* when i>ni>n, so in particular 𝒜d={d}\mathcal{A}_{d}=\{\mathbb{R}^{d}\}.

Let f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\} be a measurable function and define the boundary of ff as:

f:={xd|ϵ>0,y:xy2<ϵ,f(y)f(x)}.\partial f:=\{x\in\mathbb{R}^{d}\;|\;\forall\epsilon>0,\exists y:\|x-y\|_{2}<\epsilon,f(y)\neq f(x)\}\,.

This is equivalent to the boundary of the set of +1+1-valued points, and the boundary of any set is closed. Each measurable f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\} induces a partition of df\mathbb{R}^{d}\setminus\partial f into some number of connected parts. For a set \mathcal{H} of functions f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\} and ndn\leq d, write

M(n):=maxfmaxA𝒜n𝖼𝗈𝗆𝗉(Af).M(n):=\max_{f\in\mathcal{H}}\max_{A\in\mathcal{A}_{n}}\mathsf{comp}(A\setminus\partial f)\,.

For each i[d]i\in[d] let 𝒫i\mathcal{P}_{i} be the set of hyperplanes of the form {xd|xi=a}\{x\in\mathbb{R}^{d}\;|\;x_{i}=a\} for some aa\in\mathbb{R}. An (r,n,m)(r,n,m)-arrangement for ff is any set A(fi=1mj=1r1Hi,j)A\setminus\left(\partial f\cup\bigcup_{i=1}^{m}\bigcup_{j=1}^{r-1}H_{i,j}\right) where A𝒜nA\in\mathcal{A}_{n} and Hi,j𝒫iH_{i,j}\in\mathcal{P}_{i} such that all Hi,jH_{i,j} are distinct. Write Rf(r,n,m)R_{f}(r,n,m) for the set of (r,n,m)(r,n,m)-arrangements for ff. Define

Pr(n,m):=maxfmax{𝖼𝗈𝗆𝗉(R)|RRf(r,n,m)}P_{r}(n,m):=\max_{f\in\mathcal{H}}\max\{\mathsf{comp}(R)\;|\;R\in R_{f}(r,n,m)\}

and observe that Pr(n,0)=M(n)P_{r}(n,0)=M(n).

Proposition 6.1.

For any set \mathcal{H} of functions f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\} and any r>0r>0, 𝖻𝖻𝗌(,r)Pr(d,d)rd\mathsf{bbs}(\mathcal{H},r)\leq P_{r}(d,d)-r^{d}.

Proof.

Consider any rr-block partition, which is obtained by choosing values ai,ja_{i,j}\in\mathbb{R} for each i[d],j[r1]i\in[d],j\in[r-1] and defining 𝖻𝗅𝗈𝖼𝗄:d[r]d\mathsf{block}:\mathbb{R}^{d}\to[r]^{d} by assigning each xdx\in\mathbb{R}^{d} the block v[r]dv\in[r]^{d} where viv_{i} is the unique value such that ai,vi1<xiai,via_{i,v_{i}-1}<x_{i}\leq a_{i,v_{i}}, where we define ai,0=,ai,r=a_{i,0}=-\infty,a_{i,r}=\infty. Suppose vv is a non-constant block, so there are x,y𝖻𝗅𝗈𝖼𝗄1(v)fx,y\in\mathsf{block}^{-1}(v)\setminus\partial f such that f(x)f(y)f(x)\neq f(y). Let Hi,j={xd|xi=ai,j}H_{i,j}=\{x\in\mathbb{R}^{d}\;|\;x_{i}=a_{i,j}\} and let B=fi,jHi,jB=\partial f\cup\bigcup_{i,j}H_{i,j}. Consider the set dB\mathbb{R}^{d}\setminus B. Since xfx\notin\partial f there exists some small open ball RxR_{x} around xx such that xRx,f(x)=f(x)\forall x^{\prime}\in R_{x},f(x^{\prime})=f(x); and since x𝖻𝗅𝗈𝖼𝗄1(v)x\in\mathsf{block}^{-1}(v), Rx𝖻𝗅𝗈𝖼𝗄1(v)R_{x}\cap\mathsf{block}^{-1}(v) is a set of positive Lebesgue measure. Since BB has Lebesgue measure 0, we conclude that (Rx𝖻𝗅𝗈𝖼𝗄1(v))B(R_{x}\cap\mathsf{block}^{-1}(v))\setminus B has positive measure, so there is x𝖻𝗅𝗈𝖼𝗄1(v)Bx^{\prime}\in\mathsf{block}^{-1}(v)\setminus B with f(x)=f(x)f(x^{\prime})=f(x). Likewise, there is y𝖻𝗅𝗈𝖼𝗄1(v)By^{\prime}\in\mathsf{block}^{-1}(v)\setminus B with f(y)=f(y)f(x)f(y^{\prime})=f(y)\neq f(x^{\prime}). Therefore x,yx^{\prime},y^{\prime} must belong to separate components, so 𝖻𝗅𝗈𝖼𝗄1(v)B\mathsf{block}^{-1}(v)\setminus B is partitioned into at least 2 components. Meanwhile, each constant block is partitioned into at least 1 component. So

Pr(d,d)2(# non-constant blocks)+(# constant blocks)=𝖻𝖻𝗌(,r)+rd.P_{r}(d,d)\geq 2\cdot(\#\text{ non-constant blocks})+(\#\text{ constant blocks})=\mathsf{bbs}(\mathcal{H},r)+r^{d}\,.\qed

The following fact must be well-known, but not to us:

Proposition 6.2.

Let AA be an affine subspace of d\mathbb{R}^{d}, let BAB\subset A, and for aa\in\mathbb{R} let H={xd|x1=a}H=\{x\in\mathbb{R}^{d}\;|\;x_{1}=a\}. Then

𝖼𝗈𝗆𝗉(A(HB))𝖼𝗈𝗆𝗉(AB)𝖼𝗈𝗆𝗉(HB).\mathsf{comp}(A\setminus(H\cup B))-\mathsf{comp}(A\setminus B)\leq\mathsf{comp}(H\setminus B)\,.
Proof.

Let GG be the graph with its vertices VV being the components of A(HB)A\setminus(H\cup B) and the edges EE being the pairs (S,T)(S,T) where S,TS,T are components of A(HB)A\setminus(H\cup B) such that sS,s1<a\forall s\in S,s_{1}<a, tT,t1>a\forall t\in T,t_{1}>a, and there exists a component UU of ABA\setminus B such that S,TUS,T\subset U. Clearly 𝖼𝗈𝗆𝗉(A(HB))=|V|\mathsf{comp}(A\setminus(H\cup B))=|V|; we will show that 𝖼𝗈𝗆𝗉(AB)\mathsf{comp}(A\setminus B) is the number of connected components of GG and that |E|𝖼𝗈𝗆𝗉(HB)|E|\leq\mathsf{comp}(H\setminus B). This suffices to prove the statement. We will use the following claim:

Claim 6.3.

Let UU be a connected component of ABA\setminus B. If S,TVS,T\in V and there is a path p:[0,1]Up:[0,1]\to U such that p(0)S,p(1)Tp(0)\in S,p(1)\in T and either λ,p(λ)1a\forall\lambda,p(\lambda)_{1}\leq a or λ,p(λ)1a\forall\lambda,p(\lambda)_{1}\geq a, then S=TS=T.

Proof of claim.

Assume without loss of generality that p(λ)1ap(\lambda)_{1}\leq a for all λ\lambda. Let P={p(λ)|λ[0,1]}P=\{p(\lambda)\;|\;\lambda\in[0,1]\}. Since UU is open we can define for each λ\lambda a ball B(λ)p(λ)B(\lambda)\ni p(\lambda) such that B(λ)UB(\lambda)\subset U. Consider the sets Ba(λ):={xB(λ)|x1<a}B_{a}(\lambda):=\{x\in B(\lambda)\;|\;x_{1}<a\}, which are open, and note that for all α,β[0,1]\alpha,\beta\in[0,1], if p(α)B(β)p(\alpha)\in B(\beta) then Ba(α)Ba(β)B_{a}(\alpha)\cap B_{a}(\beta)\neq\emptyset since p(α)1,p(β)1ap(\alpha)_{1},p(\beta)_{1}\leq a.

Assume for contradiction that there is λ\lambda such that Ba(λ)B_{a}(\lambda) is not connected to SS or TT; then let λ\lambda^{\prime} be the infimum of all such λ\lambda, which must satisfy λ>0\lambda^{\prime}>0 since p(0)Sp(0)\in S. For any α\alpha, if p(α)B(λ)p(\alpha)\in B(\lambda^{\prime}) and Ba(α)B_{a}(\alpha) is connected to SS or TT then since Ba(λ)Ba(α)B_{a}(\lambda^{\prime})\cap B_{a}(\alpha) it must be that Ba(λ)B_{a}(\lambda) is connected as well; therefore Ba(α)B_{a}(\alpha) is not connected to either SS or TT. But since pp is continuous, there is α<λ\alpha<\lambda^{\prime} such that p(α)B(λ)p(\alpha)\in B(\lambda^{\prime}), so λ\lambda^{\prime} cannot be the infimum, which is a contradiction. Therefore every λ\lambda has Ba(λ)B_{a}(\lambda) connected to either SS or TT. If STS\neq T, this is a contradiction since there must then be α,β\alpha,\beta such that p(α)B(β)p(\alpha)\in B(\beta) but Ba(α),Ba(β)B_{a}(\alpha),B_{a}(\beta) are connected to S,TS,T respectively. Therefore S=TS=T. ∎

We first show that 𝖼𝗈𝗆𝗉(AB)\mathsf{comp}(A\setminus B) is the number of graph-connected components of GG. Suppose that vertices (S,T)(S,T) are connected, so there is a path S=S0,,Sn=TS=S_{0},\dotsc,S_{n}=T in GG. Then there are connected components UiU_{i} of ABA\setminus B such that Si1,SiUiS_{i-1},S_{i}\subset U_{i}; so SiUiUi+1S_{i}\subset U_{i}\cap U_{i+1}, which implies that iUiAB\bigcup_{i}U_{i}\subset A\setminus B is connected. Therefore we may define Φ\Phi as mapping each connected component {Si}\{S_{i}\} of GG to the unique component UU of ABA\setminus B with iSiU\bigcup_{i}S_{i}\subset U. Φ\Phi is surjective since for each component UU of ABA\setminus B there is some vertex SS (a component of A(HB)A\setminus(H\cup B)) such that SUS\subseteq U: this is UU itself if UH=U\cap H=\emptyset. For some connected component UU of ABA\setminus B, let S,TUS,T\subseteq U be vertices of GG, and let sS,tTs\in S,t\in T; since UU is connected, there is a path p:[0,1]Up:[0,1]\to U such that s=p(0),t=p(1)s=p(0),t=p(1). Let S=S0,,Sn=TS=S_{0},\dotsc,S_{n}=T be the multiset of vertices such that λ[0,1],i:p(λ)Si\forall\lambda\in[0,1],\exists i:p(\lambda)\in S_{i}; let ψ(λ){0,,n}\psi(\lambda)\in\{0,\dotsc,n\} be the index such that p(λ)Sψ(λ)p(\lambda)\in S_{\psi(\lambda)}, and order the sequence such that if α<β\alpha<\beta then ψ(α)ψ(β)\psi(\alpha)\leq\psi(\beta) (note that we may have Si=SjS_{i}=S_{j} for some i<ji<j if p(λ)p(\lambda) visits the same set more than once). Then for any ii, Si,Si+1US_{i},S_{i+1}\subseteq U since the path visits both and is contained in UU. If Si,Si+1S_{i},S_{i+1} are on opposite sides of HH, then there is an edge (Si,Si+1)(S_{i},S_{i+1}) in GG; otherwise, the above claim implies Si=Si+1S_{i}=S_{i+1}. Thus there is a path SS to TT in GG; this proves that Φ\Phi is injective, so 𝖼𝗈𝗆𝗉(AB)\mathsf{comp}(A\setminus B) is indeed the number of graph-connected components of GG.

Now let (S,T)E(S,T)\in E, so there is a component UU of ABA\setminus B such that S,TUS,T\subset U. For any sS,tTs\in S,t\in T there is a continuous path ps,t:[0,1]Up_{s,t}:[0,1]\to U where ps,t(0)=s,ps,t(1)=tp_{s,t}(0)=s,p_{s,t}(1)=t. There must be some z[0,1]z\in[0,1] such that ps,t(z)Hp_{s,t}(z)\in H, otherwise the path is a path in dB\mathbb{R}^{d}\setminus B and S=TS=T. Since ps,t(z)HUp_{s,t}(z)\in H\cap U, so ps,t(z)Bp_{s,t}(z)\notin B, there is some component Z𝒞HZ\in\mathcal{C}_{H} containing ps,t(z)p_{s,t}(z). We will map the edge (S,T)(S,T) to an arbitrary such ZZ, for any choice s,t,zs,t,z, and show that it is injective. Suppose that (S,T),(S,T)(S,T),(S^{\prime},T^{\prime}) map to the same Z𝒞HZ\in\mathcal{C}_{H}. Without loss of generality we may assume that S,SS,S^{\prime} lie on the same side of HH and that xSS,x1<a\forall x\in S\cup S^{\prime},x_{1}<a. Then there are sS,sS,tT,tTs\in S,s^{\prime}\in S^{\prime},t\in T,t^{\prime}\in T^{\prime}, and z,z[0,1]z,z^{\prime}\in[0,1] such that ps,t(z),ps,t(z)Zp_{s,t}(z),p_{s^{\prime},t^{\prime}}(z^{\prime})\in Z. Then since ZZ is a connected component, we may take z,zz,z^{\prime} to be the least such values that ps,t(z),ps,t(z)Zp_{s,t}(z),p_{s^{\prime},t^{\prime}}(z)\in Z, and connected ps,t(z),ps,t(z)p_{s,t}(z),p_{s^{\prime},t^{\prime}}(z) by a path in ZZ to obtain a path q:[0,1]Uq:[0,1]\to U such that q(0)=s,q(1)=sq(0)=s,q(1)=s^{\prime}, and λ,q(λ)1a\forall\lambda,q(\lambda)_{1}\leq a. Then by the above claim, S=SS=S^{\prime}; the same holds for T,TT,T^{\prime}, so the mapping is injective. This completes the proof of the proposition. ∎

Proposition 6.4.

For any set \mathcal{H} of measurable functions f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\} and any r>1r>1,

Pr(n,m)Pr(n,m1)+(r1)Pr(n1,m1).P_{r}(n,m)\leq P_{r}(n,m-1)+(r-1)\cdot P_{r}(n-1,m-1)\,.
Proof.

Let A𝒜nA\in\mathcal{A}_{n} and ai,ja_{i,j}\in\mathbb{R}, i[m],j[r1]i\in[m],j\in[r-1] such that the number of connected components in ABA\setminus B, where B=fi,jHi,jB=\partial f\cup\bigcup_{i,j}H_{i,j} and Hi,j={xA|xi=ai,j}H_{i,j}=\{x\in A\;|\;x_{i}=a_{i,j}\}, is Pr(n,m)P_{r}(n,m). For 0kr10\leq k\leq r-1 let

Bk:=f(i=1m1j=1r1Hi,j)(j=1kHm,j),B_{k}:=\partial f\cup\left(\bigcup_{i=1}^{m-1}\bigcup_{j=1}^{r-1}H_{i,j}\right)\cup\left(\bigcup_{j=1}^{k}H_{m,j}\right)\,,

so that B=Br1B=B_{r-1} and Bk=Bk1Hm,kB_{k}=B_{k-1}\cup H_{m,k}. Since B0B_{0} is an (r,n,m1)(r,n,m-1)-arrangement, 𝖼𝗈𝗆𝗉(AB0)Pr(n,m1)\mathsf{comp}(A\setminus B_{0})\leq P_{r}(n,m-1). For k>0k>0, since BkB_{k} is obtained from Bk1B_{k-1} by adding a hyperplane Hm,kH_{m,k}, Proposition 6.2 implies

𝖼𝗈𝗆𝗉(ABk)𝖼𝗈𝗆𝗉(ABk1)+𝖼𝗈𝗆𝗉(HBk1)𝖼𝗈𝗆𝗉(ABk1)+Pr(n1,m1),\mathsf{comp}(A\setminus B_{k})\leq\mathsf{comp}(A\setminus B_{k-1})+\mathsf{comp}(H\setminus B_{k-1})\leq\mathsf{comp}(A\setminus B_{k-1})+P_{r}(n-1,m-1)\,,

because HBk1H\setminus B_{k-1} is an (r,n1,m1)(r,n-1,m-1)-arrangement. Iterating r1r-1 times, once for each added hyperplane, we arrive at

Pr(n,m)\displaystyle P_{r}(n,m) =𝖼𝗈𝗆𝗉(AB)\displaystyle=\mathsf{comp}(A\setminus B)
=𝖼𝗈𝗆𝗉(AB0)+k=1r1(𝖼𝗈𝗆𝗉(ABk)𝖼𝗈𝗆𝗉(ABk1))\displaystyle=\mathsf{comp}(A\setminus B_{0})+\sum_{k=1}^{r-1}\left(\mathsf{comp}(A\setminus B_{k})-\mathsf{comp}(A\setminus B_{k-1})\right)
Pr(n,m1)+(r1)Pr(n1,m1).\displaystyle\leq P_{r}(n,m-1)+(r-1)P_{r}(n-1,m-1)\,.\qed
Lemma 6.5.

For any set \mathcal{H} of measurable functions d{±1}\mathbb{R}^{d}\to\{\pm 1\} and any rr,

Pr(d,d)(r1)d+i=0d1(di)M(di)(r1)i.P_{r}(d,d)\leq(r-1)^{d}+\sum_{i=0}^{d-1}\binom{d}{i}\cdot M(d-i)\cdot(r-1)^{i}\,.
Proof.

Write s=r1s=r-1 for convenience. We will show by induction the more general statement that for any mndm\leq n\leq d,

Pr(n,m)i=0m(mi)M(ni)siP_{r}(n,m)\leq\sum_{i=0}^{m}\binom{m}{i}\cdot M(n-i)\cdot s^{i}

where we define M(0):=1M(0):=1. In the base case, note that Pr(n,0)=M(n)P_{r}(n,0)=M(n). Assume the statement holds for Pr(n,m)P_{r}(n^{\prime},m^{\prime}) when nn,m<mn^{\prime}\leq n,m^{\prime}<m. Then by Proposition 6.4,

Pr(n,m)\displaystyle P_{r}(n,m) Pr(n,m1)+sPr(n1,m1)\displaystyle\leq P_{r}(n,m-1)+s\cdot P_{r}(n-1,m-1)
i=0m1(m1i)M(ni)si+i=0m1(m1i)M(n1i)si+1\displaystyle\leq\sum_{i=0}^{m-1}{m-1\choose i}\cdot M(n-i)\cdot s^{i}+\sum_{i=0}^{m-1}{m-1\choose i}\cdot M(n-1-i)\cdot s^{i+1}
i=0m1(m1i)M(ni)si+i=1m(m1i1)M(ni)si\displaystyle\leq\sum_{i=0}^{m-1}{m-1\choose i}\cdot M(n-i)\cdot s^{i}+\sum_{i=1}^{m}{m-1\choose i-1}\cdot M(n-i)\cdot s^{i}
=M(n)+M(nm)sm+i=1m1((m1i)+(m1i1))M(ni)si\displaystyle=M(n)+M(n-m)\cdot s^{m}+\sum_{i=1}^{m-1}\left({m-1\choose i}+{m-1\choose i-1}\right)\cdot M(n-i)\cdot s^{i}
=i=0m(mi)M(ni)si.\displaystyle=\sum_{i=0}^{m}\binom{m}{i}\cdot M(n-i)\cdot s^{i}\,.\qed
Lemma 6.6.

Let \mathcal{H} be a set of functions f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\} such that for some k1k\geq 1, M(n)knM(n)\leq k^{n}. Then for any ϵ>0\epsilon>0 and r3dk/ϵr\geq 3dk/\epsilon, 𝖻𝖻𝗌(,r)<ϵrd\mathsf{bbs}(\mathcal{H},r)<\epsilon\cdot r^{d}.

Proof.

Write s=r1s=r-1. By Proposition 6.1 and Lemma 6.5, the probability that vv is a non-constant block is

𝖻𝖻𝗌(r)rd\displaystyle\frac{\mathsf{bbs}(r)}{r^{d}} rd(Pr(d,d)rd)rd(i=0d1[(di)M(di)si]+sdrd)\displaystyle\leq r^{-d}\left(P_{r}(d,d)-r^{d}\right)\leq r^{-d}\left(\sum_{i=0}^{d-1}\left[{d\choose i}\cdot M(d-i)\cdot s^{i}\right]+s^{d}-r^{d}\right)
rdi=0d1(di)M(di)si.\displaystyle\leq r^{-d}\sum_{i=0}^{d-1}{d\choose i}\cdot M(d-i)\cdot s^{i}\,.

Split the sum into two parts:

i=0d/2(di)M(di)sird+i=1d/21(di)M(i)sdird\displaystyle\sum_{i=0}^{\lfloor d/2\rfloor}{d\choose i}\cdot\frac{M(d-i)\cdot s^{i}}{r^{d}}+\sum_{i=1}^{\lceil d/2\rceil-1}{d\choose i}\cdot\frac{M(i)\cdot s^{d-i}}{r^{d}}
i=0d/2(di)kdirird+i=1d/21(di)kirdird\displaystyle\qquad\leq\sum_{i=0}^{\lfloor d/2\rfloor}{d\choose i}\cdot\frac{k^{d-i}\cdot r^{i}}{r^{d}}+\sum_{i=1}^{\lceil d/2\rceil-1}{d\choose i}\cdot\frac{k^{i}\cdot r^{d-i}}{r^{d}}
i=0d/2dikdirird+i=1d/21dikirdirdi=0d/2ϵdi3diddi+i=1d/21ϵi3i\displaystyle\qquad\leq\sum_{i=0}^{\lfloor d/2\rfloor}\frac{d^{i}k^{d-i}\cdot r^{i}}{r^{d}}+\sum_{i=1}^{\lceil d/2\rceil-1}\frac{d^{i}k^{i}\cdot r^{d-i}}{r^{d}}\qquad\leq\sum_{i=0}^{\lfloor d/2\rfloor}\frac{\epsilon^{d-i}}{3^{d-i}d^{d-i}}+\sum_{i=1}^{\lceil d/2\rceil-1}\frac{\epsilon^{i}}{3^{i}}
ϵ3+i=2d/21ϵi3i+d/2ϵd/23d/2dd/2ϵ3+ϵ3i=1ϵi3i+ϵd/23d/2ϵ.\displaystyle\qquad\leq\frac{\epsilon}{3}+\sum_{i=2}^{\lceil d/2\rceil-1}\frac{\epsilon^{i}}{3^{i}}+\lfloor d/2\rfloor\cdot\frac{\epsilon^{\lceil d/2\rceil}}{3^{\lceil d/2\rceil}d^{\lceil d/2\rceil}}\qquad\leq\frac{\epsilon}{3}+\frac{\epsilon}{3}\sum_{i=1}^{\infty}\frac{\epsilon^{i}}{3^{i}}+\frac{\epsilon^{\lceil d/2\rceil}}{3^{\lceil d/2\rceil}}\leq\epsilon\,.\qed

It is a standard fact that for degree-kk polynomials, M(1)kM(1)\leq k, and a special case of a theorem of Warren bounds gives a bound for larger dimensions:

Theorem 6.7 ([War68]).

Polynomial threshold functions p:d{±1}p:\mathbb{R}^{d}\to\{\pm 1\} of degree kk have M(n)6(2k)nM(n)\leq 6(2k)^{n}.

Since M(1)24kM(1)\leq\sqrt{24}k and 6(2k)n(24k)n6(2k)^{n}\leq(\sqrt{24}k)^{n}, for n>1n>1, Lemma 6.6 gives us:

Corollary 6.8.

For r324dk/ϵr\geq 3\sqrt{24}dk/\epsilon, rd𝖻𝖻𝗌(𝒫k,r)<ϵr^{-d}\cdot\mathsf{bbs}(\mathcal{P}_{k},r)<\epsilon.

6.2 Application

As was the case for halfspaces, our reduction of noise sensitivity on [r]d[r]^{d} to {±1}d\{\pm 1\}^{d} requires that the class 𝒫k\mathcal{P}_{k} is invariant under linear transformations:

Proposition 6.9.

For any f𝒫kf\in\mathcal{P}_{k} and full-rank linear transformation Ad×dA\in\mathbb{R}^{d\times d}, fA𝒫kf\circ A\in\mathcal{P}_{k}.

Proof.

Let f(x)=sign(p(x))f(x)=\operatorname{sign}(p(x)) where pp is a degree-kk polynomial and let cqi=1dxiqic_{q}\prod_{i=1}^{d}x_{i}^{q_{i}} be a term of pp, where cc\in\mathbb{R} and q0dq\in\mathbb{Z}_{\geq 0}^{d} such that iqik\sum_{i}q_{i}\leq k. Let AidA_{i}\in\mathbb{R}^{d} be the ithi^{\text{th}} row of AA. Then

i=1d(Ax)iqi=i=1d(j=1dAi,jxj)qi=pq(x)\prod_{i=1}^{d}(Ax)_{i}^{q_{i}}=\prod_{i=1}^{d}\left(\sum_{j=1}^{d}A_{i,j}x_{j}\right)^{q_{i}}=p_{q}(x)

where pq(x)p_{q}(x) is some polynomial of degree at most i=1dqik\sum_{i=1}^{d}q_{i}\leq k. Then pA=qcqpqp\circ A=\sum_{q}c_{q}p_{q} where qq ranges over 0\mathbb{Z}_{\geq 0} with iqik\sum_{i}q_{i}\leq k, and each pqp_{q} has degree at most kk, so pAp\circ A is a degree-kk polynomial. ∎

The last ingredient we need is the following bound of Diakonikolas et al. on the noise sensitivity:

Theorem 6.10 ([DHK+10]).

Let f:{±1}d{±1}f:\{\pm 1\}^{d}\to\{\pm 1\} be a degree-kk PTF. Then for any δ[0,1]\delta\in[0,1],

𝗇𝗌2,δ(f)\displaystyle\mathsf{ns}_{2,\delta}(f) O(δ1/2k)\displaystyle\leq O(\delta^{1/2^{k}})
𝗇𝗌2,δ(f)\displaystyle\mathsf{ns}_{2,\delta}(f) 2O(k)δ1/(4k+2)log(1/δ).\displaystyle\leq 2^{O(k)}\cdot\delta^{1/(4k+2)}\log(1/\delta)\,.

Putting everything together, we obtain a bound that is polynomial in dd for any fixed k,ϵk,\epsilon, and which matches the result of Diakonikolas et al. [DHK+10] for the uniform distribution over {±1}d\{\pm 1\}^{d}.

See 1.1.3

Proof.

We prove the continuous case here; the finite case is proved in Theorem A.18.

Let r=9dk/ϵr=\lceil 9dk/\epsilon\rceil, so that by Corollary 6.8, condition 1 of Lemma 5.2 is satisfied. Due to Proposition 6.9, we may apply Theorem 6.10 and Lemma 5.3 to conclude that for all f𝒫kf\in\mathcal{P}_{k}

𝗇𝗌r,δ(f𝖻𝗅𝗈𝖼𝗄)\displaystyle\mathsf{ns}_{r,\delta}(f^{\mathsf{block}}) O(δ1/2k)\displaystyle\leq O(\delta^{1/2^{k}})
𝗇𝗌r,δ(f𝖻𝗅𝗈𝖼𝗄)\displaystyle\mathsf{ns}_{r,\delta}(f^{\mathsf{block}}) 2O(k)δ1/(4k+2)log(1/δ).\displaystyle\leq 2^{O(k)}\cdot\delta^{1/(4k+2)}\log(1/\delta)\,.

In the first case, setting δ=O(ϵ2k+1)\delta=O(\epsilon^{2^{k+1}}) we get 𝗇𝗌r,δ(f𝖻𝗅𝗈𝖼𝗄)<ϵ2/3\mathsf{ns}_{r,\delta}(f^{\mathsf{block}})<\epsilon^{2}/3, so by Lemma 5.6 we get a set \mathcal{F} of functions [r]d[r]^{d}\to\mathbb{R} of size ||(rd)O(1ϵ2k+1)|\mathcal{F}|\leq(rd)^{O\left(\frac{1}{\epsilon^{2^{k+1}}}\right)} satisfying condition 2 of Lemma 5.2. For n=poly(||,1/ϵ)n=\operatorname{poly}(|\mathcal{F}|,1/\epsilon), Lemma 5.2 implies an algorithm with sample size

O(rd2n2log(rd))=O(d3kϵlog(dk/ϵ))(kdϵ)O(1ϵ2k+1).O(rd^{2}n^{2}\log(rd))=O\left(\frac{d^{3}k}{\epsilon}\log(dk/\epsilon)\right)\cdot\left(\frac{kd}{\epsilon}\right)^{O\left(\frac{1}{\epsilon^{2^{k+1}}}\right)}\,.

In the second case, setting δ=O((2O(k)log(2k/ϵ)ϵ2)4k+2)\delta=O\left(\left(\frac{2^{O(k)}\log(2^{k}/\epsilon)}{\epsilon^{2}}\right)^{4k+2}\right), we again obtain 𝗇𝗌(f𝖻𝗅𝗈𝖼𝗄)r,δϵ2/3\mathsf{ns}(f^{\mathsf{block}})_{r,\delta}\leq\epsilon^{2}/3 and get an algorithm with sample size

(kdϵ)2O(k2)(log(1/ϵ)ϵ2)4k+2.\left(\frac{kd}{\epsilon}\right)^{2^{O(k^{2})}\left(\frac{\log(1/\epsilon)}{\epsilon^{2}}\right)^{4k+2}}\,.

The final result is obtained by applying Lemma 4.5. ∎

7 Learning & Testing kk-Alternating Functions

A function f:𝒳{±1}f:\mathcal{X}\to\{\pm 1\} on a partial order 𝒳\mathcal{X} is kk-alternating if for every chain x1<<xk+2x_{1}<\dotsc<x_{k+2} there is i[k+1]i\in[k+1] such that f(xi)=f(xi+1)f(x_{i})=f(x_{i+1}). Monotone functions are examples of 1-alternating functions. We consider kk-alternating functions on d\mathbb{R}^{d} with the usual partial order: for x,ydx,y\in\mathbb{R}^{d} we say x<yx<y when xiyix_{i}\leq y_{i} for each i[d]i\in[d] and xyx\neq y. Once again, we must bound the block boundary size, which has been done already by Canonne et al.. We include the proof because their work does not share our definition of block boundary size, and because we use it in our short proof of the monotonicity tester in Section 3.1.

Lemma 7.1 ([CGG+19]).

The rr-block boundary size of kk-alternating functions is at most kdrd1kdr^{d-1}.

Proof.

Let ff be kk-alternating, let 𝖻𝗅𝗈𝖼𝗄:d[r]d\mathsf{block}:\mathbb{R}^{d}\to{[r]}^{d} be any block partition and let v1,,vmv_{1},\dotsc,v_{m} be any chain in [r]d{[r]}^{d}. Suppose that there are k+1k+1 indices i1,,ik+1i_{1},\dotsc,i_{k+1} such that ff is not constant on 𝖻𝗅𝗈𝖼𝗄1(vij)\mathsf{block}^{-1}(v_{i_{j}}). Then there is a set of points x1,,xk+1x_{1},\dotsc,x_{k+1} such that xj𝖻𝗅𝗈𝖼𝗄1(vij)x_{j}\in\mathsf{block}^{-1}(v_{i_{j}}) and xjxj+1x_{j}\neq x_{j+1} for each j[k]j\in[k]. But since vi1<<vik+1v_{i_{1}}<\dotsm<v_{i_{k+1}}, x1<<xk+1x_{1}<\dotsm<x_{k+1} also, which contradicts the fact that ff is kk-alternating. Then every chain in [r]d{[r]}^{d} has at most kk non-constant blocks, and we may partition [r]d{[r]}^{d} into at most drd1dr^{d-1} chains by taking the diagonals v+λ1v+\lambda\vec{1} where vv is any vector satisfying i:vi=1\exists i:v_{i}=1 and λ\lambda ranges over all integers. ∎

Canonne et al. also use noise sensitivity bound to obtain a spanning set \mathcal{F}; we quote their result.

Lemma 7.2 ([CGG+19]).

There is a set \mathcal{F} of functions [r]d{[r]}^{d}\to\mathbb{R}, with size

||exp(O(kdϵ2log(rd/ϵ))),|\mathcal{F}|\leq\mathrm{exp}\left(O\left(\frac{k\sqrt{d}}{\epsilon^{2}}\log(rd/\epsilon)\right)\right)\,,

such that for any kk-alternating function h:[r]d{±1}h:{[r]}^{d}\to\{\pm 1\}, there is g:[r]dg:{[r]}^{d}\to\mathbb{R} that is a linear combination of functions in \mathcal{F} and 𝔼x[r]d[(h(x)g(x))2]ϵ2\underset{x\sim{[r]}^{d}}{\mathbb{E}}\left[{(h(x)-g(x))}^{2}\right]\leq\epsilon^{2}.

See 1.1.5

Proof.

We prove the continuous case; for the finite case see Theorem A.18.

Let r=dk/ϵr=\lceil dk/\epsilon\rceil and let 𝖻𝗅𝗈𝖼𝗄:d[r]d\mathsf{block}:\mathbb{R}^{d}\to[r]^{d} be any rr-block partition. By Lemma 7.1, the first condition of Lemma 5.2 is satisfied. Now let ff\in\mathcal{H} and consider f𝖻𝗅𝗈𝖼𝗄f^{\mathsf{block}}. For any chain v1<v2<<vmv_{1}<v_{2}<\dotsm<v_{m} in [r]d[r]^{d}, it must be 𝖻𝗅𝗈𝖼𝗄𝗉𝗈𝗂𝗇𝗍(v1)<𝖻𝗅𝗈𝖼𝗄𝗉𝗈𝗂𝗇𝗍(v2)<<𝖻𝗅𝗈𝖼𝗄𝗉𝗈𝗂𝗇𝗍(vm)\mathsf{blockpoint}(v_{1})<\mathsf{blockpoint}(v_{2})<\dotsm<\mathsf{blockpoint}(v_{m}) since every x𝖻𝗅𝗈𝖼𝗄1(vi),y𝖻𝗅𝗈𝖼𝗄1(vj)x\in\mathsf{block}^{-1}(v_{i}),y\in\mathsf{block}^{-1}(v_{j}) satisfy x<yx<y when vi<vjv_{i}<v_{j}; then ff alternates at most kk times on the chain 𝖻𝗅𝗈𝖼𝗄𝗉𝗈𝗂𝗇𝗍(v1)<<𝖻𝗅𝗈𝖼𝗄𝗉𝗈𝗂𝗇𝗍(vm)\mathsf{blockpoint}(v_{1})<\dotsm<\mathsf{blockpoint}(v_{m}) and, since f𝖻𝗅𝗈𝖼𝗄(vi)=f(𝖻𝗅𝗈𝖼𝗄𝗉𝗈𝗂𝗇𝗍(vi))f^{\mathsf{block}}(v_{i})=f(\mathsf{blockpoint}(v_{i})), f𝖻𝗅𝗈𝖼𝗄f^{\mathsf{block}} is also kk-alternating. Therefore the set \mathcal{F} of functions given by Lemma 7.2 satisfies condition 2 of Lemma 7.1, and we have n=poly(||,1/ϵ)=exp(O(kdϵ2log(rd/ϵ)))n=\operatorname{poly}(|\mathcal{F}|,1/\epsilon)=\mathrm{exp}\left(O\left(\frac{k\sqrt{d}}{\epsilon^{2}}\log(rd/\epsilon)\right)\right). Applying Lemma 5.2 gives an algorithm with sample complexity

O(rd2n2log(rd))=O(d3kϵlog(dk/ϵ)(dkϵ)O(kdϵ2))=(dkϵ)O(kdϵ2).O\left(rd^{2}n^{2}\log(rd)\right)=O\left(\frac{d^{3}k}{\epsilon}\log(dk/\epsilon)\cdot\left(\frac{dk}{\epsilon}\right)^{O\left(\frac{k\sqrt{d}}{\epsilon^{2}}\right)}\right)=\left(\frac{dk}{\epsilon}\right)^{O\left(\frac{k\sqrt{d}}{\epsilon^{2}}\right)}\,.

The other sample complexity follows from Lemma 4.5. ∎

See 1.1.5

Proof.

The following argument is for the continuous case, but generalizes to the finite case using the definitions in Appendix A.

Let \mathcal{H} be the class of kk-alternating functions. Suppose there is a set 𝒦\mathcal{K}\subset\mathcal{H}, known to the algorithm, that is a (τ/2)(\tau/2)-cover. Then, taking a set QQ of q=O(1ϵ2log|𝒦|)q=O(\tfrac{1}{\epsilon^{2}}\log|\mathcal{K}|) independent random samples from μ\mu and using Hoeffding’s inequality,

𝑄[h𝒦:|𝖽𝗂𝗌𝗍Q(f,h)𝖽𝗂𝗌𝗍μ(f,h)|>τ2]\displaystyle\underset{Q}{\mathbb{P}}\left[\exists h\in\mathcal{K}:|\mathsf{dist}_{Q}(f,h)-\mathsf{dist}_{\mu}(f,h)|>\frac{\tau}{2}\right] |𝒦|maxh𝒦[|𝖽𝗂𝗌𝗍Q(f,h)𝖽𝗂𝗌𝗍μ(f,h)|>τ2]\displaystyle\leq|\mathcal{K}|\cdot\max_{h\in\mathcal{K}}\mathbb{P}\left[|\mathsf{dist}_{Q}(f,h)-\mathsf{dist}_{\mu}(f,h)|>\frac{\tau}{2}\right]
|𝒦|2exp(qτ22)<1/6.\displaystyle\leq|\mathcal{K}|\cdot 2\mathrm{exp}\left(-\frac{q\tau^{2}}{2}\right)<1/6\,.

Then the tester accepts if 𝖽𝗂𝗌𝗍Q(f,𝒦)<ϵ1+τ\mathsf{dist}_{Q}(f,\mathcal{K})<\epsilon_{1}+\tau and rejects otherwise; we now prove that this is correct with high probability. Assume that the above estimation is accurate, which occurs with probability at least 5/65/6. If 𝖽𝗂𝗌𝗍μ(f,)ϵ1\mathsf{dist}_{\mu}(f,\mathcal{H})\leq\epsilon_{1} then 𝖽𝗂𝗌𝗍μ(f,𝒦)𝖽𝗂𝗌𝗍μ(f,h)+𝖽𝗂𝗌𝗍μ(h,𝒦)ϵ1+τ/2\mathsf{dist}_{\mu}(f,\mathcal{K})\leq\mathsf{dist}_{\mu}(f,h)+\mathsf{dist}_{\mu}(h,\mathcal{K})\leq\epsilon_{1}+\tau/2. Then for g𝒦g\in\mathcal{K} minimizing 𝖽𝗂𝗌𝗍μ(f,g)\mathsf{dist}_{\mu}(f,g),

𝖽𝗂𝗌𝗍Q(f,𝒦)𝖽𝗂𝗌𝗍Q(f,g)<𝖽𝗂𝗌𝗍μ(f,g)+τ2ϵ1+τ,\mathsf{dist}_{Q}(f,\mathcal{K})\leq\mathsf{dist}_{Q}(f,g)<\mathsf{dist}_{\mu}(f,g)+\frac{\tau}{2}\leq\epsilon_{1}+\tau\,,

so ff is accepted. Now suppose that ff is accepted, so 𝖽𝗂𝗌𝗍Q(f,𝒦)<ϵ1+τ\mathsf{dist}_{Q}(f,\mathcal{K})<\epsilon_{1}+\tau. Then

𝖽𝗂𝗌𝗍μ(f,)𝖽𝗂𝗌𝗍μ(f,g)𝖽𝗂𝗌𝗍Q(f,g)+τ2<ϵ1+32τ=ϵ1+34(ϵ2ϵ1)ϵ2.\mathsf{dist}_{\mu}(f,\mathcal{H})\leq\mathsf{dist}_{\mu}(f,g)\leq\mathsf{dist}_{Q}(f,g)+\frac{\tau}{2}<\epsilon_{1}+\frac{3}{2}\tau=\epsilon_{1}+\frac{3}{4}(\epsilon_{2}-\epsilon_{1})\leq\epsilon_{2}\,.

What remains is to show how the tester constructs such a cover 𝒦\mathcal{K}.

Consider the learning algorithm of Theorem 1.1.5 with error parameter τ/12\tau/12, so r=12dk/τr=\lceil 12dk/\tau\rceil. Let XX be the grid constructed by that algorithm and let 𝖻𝗅𝗈𝖼𝗄:d[r]d\mathsf{block}:\mathbb{R}^{d}\to[r]^{d} be the induced rr-block partition. We may assume that with probability at least 5/65/6, 𝖻𝗅𝗈𝖼𝗄(μ)𝗎𝗇𝗂𝖿([r]d)𝖳𝖵<τ/12\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}<\tau/12; suppose that this event occurs. The learner then takes m=(dkτ)O(kdτ2)m=\left(\frac{dk}{\tau}\right)^{O\left(\frac{k\sqrt{d}}{\tau^{2}}\right)} additional samples to learn the class 𝖻𝗅𝗈𝖼𝗄\mathcal{H}^{\mathsf{block}} with domain [r]d[r]^{d}. For every ff\in\mathcal{H} the learner has positive probability of outputting a function h:[r]d{0,1}h:[r]^{d}\to\{0,1\} with 𝑣[h(v)f𝖻𝗅𝗈𝖼𝗄(v)]<τ/12\underset{v}{\mathbb{P}}\left[h(v)\neq f^{\mathsf{block}}(v)\right]<\tau/12 (where vv is chosen from 𝖻𝗅𝗈𝖼𝗄(μ)\mathsf{block}(\mu)). Let 𝒦\mathcal{K}^{\prime} be the set of possible outputs of the learner; then 𝒦\mathcal{K}^{\prime} is a (τ/12)(\tau/12)-cover for 𝖻𝗅𝗈𝖼𝗄\mathcal{H}^{\mathsf{block}}. Construct a set 𝒦𝖻𝗅𝗈𝖼𝗄\mathcal{K}^{\mathsf{block}} by choosing, for each h𝒦h\in\mathcal{K}^{\prime}, the nearest function g𝒦g\in\mathcal{K} with respect to the distribution 𝖻𝗅𝗈𝖼𝗄(μ)\mathsf{block}(\mu). Then 𝒦𝖻𝗅𝗈𝖼𝗄\mathcal{K}^{\mathsf{block}} is a (τ/6)(\tau/6)-cover, since for any function f𝖻𝗅𝗈𝖼𝗄𝖻𝗅𝗈𝖼𝗄f^{\mathsf{block}}\in\mathcal{H}^{\mathsf{block}}, if h𝒦h\in\mathcal{K}^{\prime} is the nearest output of the learner and g𝒦𝖻𝗅𝗈𝖼𝗄g\in\mathcal{K}^{\mathsf{block}} is nearest hh, then by the triangle inequality f𝖻𝗅𝗈𝖼𝗄f^{\mathsf{block}} has distance at most τ/6\tau/6 to gg with respect to 𝖻𝗅𝗈𝖼𝗄(μ)\mathsf{block}(\mu). Finally, construct a set 𝒦\mathcal{K}\subset\mathcal{H} by taking each function hh\in\mathcal{H} such that h𝖼𝗈𝖺𝗋𝗌𝖾=hh^{\mathsf{coarse}}=h and h𝖻𝗅𝗈𝖼𝗄𝒦𝖻𝗅𝗈𝖼𝗄h^{\mathsf{block}}\in\mathcal{K}^{\mathsf{block}} (note that there exists hh\in\mathcal{H} such that h𝖼𝗈𝖺𝗋𝗌𝖾=hh^{\mathsf{coarse}}=h since h𝖼𝗈𝖺𝗋𝗌𝖾h^{\mathsf{coarse}} is kk-alternating when h𝖻𝗅𝗈𝖼𝗄h^{\mathsf{block}} is kk-alternating). Then 𝒦\mathcal{K} is a (τ/2)(\tau/2)-cover since for any ff\in\mathcal{H}, when h𝒦h\in\mathcal{K} minimizes v𝖻𝗅𝗈𝖼𝗄(μ)[f𝖻𝗅𝗈𝖼𝗄(v)h𝖻𝗅𝗈𝖼𝗄]\underset{v\sim\mathsf{block}(\mu)}{\mathbb{P}}\left[f^{\mathsf{block}}(v)\neq h^{\mathsf{block}}\right],

𝖽𝗂𝗌𝗍μ(f,𝒦)\displaystyle\mathsf{dist}_{\mu}(f,\mathcal{K})
𝖽𝗂𝗌𝗍μ(f,f𝖼𝗈𝖺𝗋𝗌𝖾)+𝖽𝗂𝗌𝗍μ(f𝖼𝗈𝖺𝗋𝗌𝖾,𝒦)\displaystyle\qquad\leq\mathsf{dist}_{\mu}(f,f^{\mathsf{coarse}})+\mathsf{dist}_{\mu}(f^{\mathsf{coarse}},\mathcal{K})
rd𝖻𝖻𝗌(,r)+v𝖻𝗅𝗈𝖼𝗄(μ)[f𝖻𝗅𝗈𝖼𝗄(v)h𝖻𝗅𝗈𝖼𝗄(v)]+2𝖻𝗅𝗈𝖼𝗄(μ)𝗎𝗇𝗂𝖿([r]d)𝖳𝖵\displaystyle\qquad\leq r^{-d}\cdot\mathsf{bbs}(\mathcal{H},r)+\underset{v\sim\mathsf{block}(\mu)}{\mathbb{P}}\left[f^{\mathsf{block}}(v)\neq h^{\mathsf{block}}(v)\right]+2\|\mathsf{block}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}
<τ/6+τ/6+2τ/12τ/2.\displaystyle\qquad<\tau/6+\tau/6+2\tau/12\leq\tau/2\,.

Now we bound the size of 𝒦𝖻𝗅𝗈𝖼𝗄\mathcal{K}^{\mathsf{block}}. Since there are mm samples and each sample v𝖻𝗅𝗈𝖼𝗄(μ)v\sim\mathsf{block}(\mu) is in [r]d[r]^{d}, labelled by {0,1}\{0,1\}, there are at most (rd)m2m(r^{d})^{m}2^{m} possible sample sequences, so at most (2rd)m(2r^{d})^{m} outputs of the learner (after constructing XX), so |𝒦𝖻𝗅𝗈𝖼𝗄|(2rd)m|\mathcal{K}^{\mathsf{block}}|\leq(2r^{d})^{m}. Therefore, after constructing XX, the tester may construct 𝒦𝖻𝗅𝗈𝖼𝗄\mathcal{K}^{\mathsf{block}} and run the above estimation procedure, with q=O(1τ2dmlogr)=(dkτ)O(kdτ2)q=O\left(\frac{1}{\tau^{2}}dm\log r\right)=\left(\frac{dk}{\tau}\right)^{O\left(\frac{k\sqrt{d}}{\tau^{2}}\right)}. ∎

Appendix A Discrete Distributions

We will say that a distribution μi\mu_{i} over \mathbb{R} is finite if it is a distribution over a finite set XX\subset\mathbb{R}. In this section, we extend downsampling to work for finite product distributions: distributions μ=μ1××μd\mu=\mu_{1}\times\dotsm\times\mu_{d} such that all μi\mu_{i} are finite. As mentioned in the introduction, our algorithms have the advantage that they do not need to know in advance whether the distribution is continuous or finite, and if they are finite they do not need to know the support. This is in contrast to the algorithms of Blais et al. [BOW10], which work for arbitrary finite product distributions but must know the support (since it learns a function under the “one-out-of-kk encoding”). Our algorithms have superior time complexity for large supports.

We begin with an example of a pathological set of functions that illustrates some of the difficulties in the generalization.

Example A.1.

The Dirichlet function f:{±1}f:\mathbb{R}\to\{\pm 1\} is the function that takes value 11 on all rational numbers and value 1-1 on all irrational numbers. We will define the Dirichlet class of functions as the set of all functions f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\} such that f(x)=1f(x)=-1 on all xx with at least 1 irrational coordinate xix_{i}, and f(x)f(x) is arbitrary for any xx with all rational coordinates. Since the Lebesgue measure of the set of rational numbers is 0, in any continuous product distribution, any function ff in the Dirichlet class satisfies [f(x)1]=0\mathbb{P}\left[f(x)\neq-1\right]=0; therefore learning this class is trivial in any continuous product distribution since we may output the constant 1-1 function. And 𝖻𝖻𝗌(f,r)=0\mathsf{bbs}(f,r)=0 for this class since no block contains a set SS of positive measure containing 11-valued points. On the other hand, if μ\mu is a finitely supported product distribution, then it may be the case that it is supported only on points with all rational coordinates. In that case, the Dirichlet class of functions is the set of all functions on the support, which is impossible to learn when the size of the support is unknown (since the number of samples will depend on the support size). It is apparent that our former definition of 𝖻𝖻𝗌\mathsf{bbs} no longer suffices to bound the complexity of algorithms when we allow finitely supported distributions.

Another difficulty arises for finitely supported distributions with small support: for example, the hypercube {±1}d\{\pm 1\}^{d}. Consider what happens when we attempt to sample a uniform grid, as in the first step of the algorithms above. We will sample many points xx such that x1=1x_{1}=1 and many points such that x1=1x_{1}=-1. Essentially, the algorithm takes a small domain {±1}d\{\pm 1\}^{d} and constructs the larger domain [r]d[r]^{d}, which is antithetical to the downsampling method. A similar situation would occur in large domains [n]d[n]^{d} where some coordinates have exceptionally large probability densities and are sampled many times. Our algorithm must be able to handle such cases, so we must redefine the grid sampling step and block partitions to handle this situation. To do so, we introduce augmented samples: for every sample point xμx\sim\mu we will append a uniformly random value in [0,1]d[0,1]^{d}.

A.1 Augmented Samples & Constructing Uniform Partitions

For augmented points a¯,b¯,×[0,1]\overline{a},\overline{b},\in\mathbb{R}\times[0,1], where a¯=(a,a),b¯=(b,b)\overline{a}=(a,a^{\prime}),\overline{b}=(b,b^{\prime}), we will define a total order by saying a¯<b¯\overline{a}<\overline{b} if a<ba<b, or a=ba=b and a<ba^{\prime}<b^{\prime}. Define interval (a¯,b¯]:={c¯|a¯<c¯b¯}(\overline{a},\overline{b}]:=\{\overline{c}\;|\;\overline{a}<\overline{c}\leq\overline{b}\}. For convenience, when a¯×[0,1]\overline{a}\in\mathbb{R}\times[0,1] and a¯=(a,a)\overline{a}=(a,a^{\prime}) we will write ξ(a¯)=a\xi(\overline{a})=a. If x¯d×[0,1]d\overline{x}\in\mathbb{R}^{d}\times[0,1]^{d} is an augmented vector (i.e. each coordinate x¯i\overline{x}_{i} is an augmented point), we will write ξ(x¯)=(ξ(x1),,ξ(xd))\xi(\overline{x})=(\xi(x_{1}),\dotsc,\xi(x_{d})); and when Sd×[0,1]dS\subseteq\mathbb{R}^{d}\times[0,1]^{d} is a set of augmented points, we will write ξ(S)={ξ(x¯)|x¯S}\xi(S)=\{\xi(\overline{x})\;|\;\overline{x}\in S\}.

Definition A.2 (Augmented Block Partition).

An augmented rr-block partition of d\mathbb{R}^{d} is a pair of functions 𝖻𝗅𝗈𝖼𝗄¯:d×[0,1]d[r]d\overline{\mathsf{block}}:\mathbb{R}^{d}\times[0,1]^{d}\to[r]^{d} and 𝖻𝗅𝗈𝖼𝗄𝗉𝗈𝗂𝗇𝗍:[r]dd\mathsf{blockpoint}:[r]^{d}\to\mathbb{R}^{d} obtained as follows. For each i[d],j[r1]i\in[d],j\in[r-1] let a¯i,j×[0,1]\overline{a}_{i,j}\in\mathbb{R}\times[0,1] such that a¯i,j<a¯i,j+1\overline{a}_{i,j}<\overline{a}_{i,j+1} and define a¯i,0=(,0),a¯i,r=(,1)\overline{a}_{i,0}=(-\infty,0),\overline{a}_{i,r}=(\infty,1). For each i[d],j[r]i\in[d],j\in[r] define the interval Bi,j=(a¯i,j1,a¯i,j]B_{i,j}=(\overline{a}_{i,j-1},\overline{a}_{i,j}] and a point b¯i,j×[0,1]\overline{b}_{i,j}\in\mathbb{R}\times[0,1] such that a¯i,jb¯i,ja¯i,j+1\overline{a}_{i,j}\leq\overline{b}_{i,j}\leq\overline{a}_{i,j+1}. The function 𝖻𝗅𝗈𝖼𝗄¯:d×[0,1]d[r]d\overline{\mathsf{block}}:\mathbb{R}^{d}\times[0,1]^{d}\to[r]^{d} is defined by setting 𝖻𝗅𝗈𝖼𝗄¯(x¯)\overline{\mathsf{block}}(\overline{x}) to be the unique vector v[r]dv\in[r]^{d} such that x¯iBi,vi\overline{x}_{i}\in B_{i,v_{i}} for each i[d]i\in[d]. Observe that 𝖻𝗅𝗈𝖼𝗄¯1(v):={x¯:𝖻𝗅𝗈𝖼𝗄¯(x)=v}\overline{\mathsf{block}}^{-1}(v):=\{\overline{x}:\overline{\mathsf{block}}(x)=v\} is a set of augmented points in d×[0,1]\mathbb{R}^{d}\times[0,1] and that it is possible for two augmented points x¯,y¯\overline{x},\overline{y} to satisfy ξ(x¯)=ξ(y¯)\xi(\overline{x})=\xi(\overline{y}) while 𝖻𝗅𝗈𝖼𝗄¯(x¯)𝖻𝗅𝗈𝖼𝗄¯(y¯)\overline{\mathsf{block}}(\overline{x})\neq\overline{\mathsf{block}}(\overline{y}). The function 𝖻𝗅𝗈𝖼𝗄𝗉𝗈𝗂𝗇𝗍:[r]dd\mathsf{blockpoint}:[r]^{d}\to\mathbb{R}^{d} is defined by setting 𝖻𝗅𝗈𝖼𝗄𝗉𝗈𝗂𝗇𝗍(v)=(ξ(b¯1,v1),,ξ(b¯d,vd))\mathsf{blockpoint}(v)=(\xi(\overline{b}_{1,v_{1}}),\dotsc,\xi(\overline{b}_{d,v_{d}})); note that this is a non-augmented point.

Definition A.3 (Block Functions and Coarse Functions).

For a function f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\} we will define the functions f𝖻𝗅𝗈𝖼𝗄:[r]d{±1}f^{\mathsf{block}}:[r]^{d}\to\{\pm 1\} as f𝖻𝗅𝗈𝖼𝗄:=f𝖻𝗅𝗈𝖼𝗄𝗉𝗈𝗂𝗇𝗍f^{\mathsf{block}}:=f\circ\mathsf{blockpoint} and for each z[0,1]dz\in[0,1]^{d} we will define fz𝖼𝗈𝖺𝗋𝗌𝖾:d{±1}f^{\mathsf{coarse}}_{z}:\mathbb{R}^{d}\to\{\pm 1\} as fz𝖼𝗈𝖺𝗋𝗌𝖾(x):=f𝖻𝗅𝗈𝖼𝗄(𝖻𝗅𝗈𝖼𝗄¯(x,z))f^{\mathsf{coarse}}_{z}(x):=f^{\mathsf{block}}(\overline{\mathsf{block}}(x,z)). Unlike in the continuous setting, fz𝖼𝗈𝖺𝗋𝗌𝖾f^{\mathsf{coarse}}_{z} depends on an additional variable z[0,1]dz\in[0,1]^{d}, which is necessary because a single point xdx\in\mathbb{R}^{d} may be augmented differently to get different 𝖻𝗅𝗈𝖼𝗄¯\overline{\mathsf{block}} values. For a distribution μ\mu over d\mathbb{R}^{d} define the augmented distribution μ¯\overline{\mu} over d×[0,1]d\mathbb{R}^{d}\times[0,1]^{d} as the distribution of (x,z)(x,z) when xμx\sim\mu and zz is uniform in [0,1]d[0,1]^{d}. For an augmented rr-block partition 𝖻𝗅𝗈𝖼𝗄¯:d×[0,1]d[r]d\overline{\mathsf{block}}:\mathbb{R}^{d}\times[0,1]^{d}\to[r]^{d} we define the distribution 𝖻𝗅𝗈𝖼𝗄¯(μ)\overline{\mathsf{block}}(\mu) over [r]d[r]^{d} as the distribution of 𝖻𝗅𝗈𝖼𝗄¯(x¯)\overline{\mathsf{block}}(\overline{x}) for x¯μ¯\overline{x}\sim\overline{\mu}.

Definition A.4 (Augmented Random Grid).

An augmented random grid X¯\overline{X} of length mm is obtained by sampling mm augmented points x¯1,,x¯mμ¯\overline{x}_{1},\dotsc,\overline{x}_{m}\sim\overline{\mu} and for each i[d],j[m]i\in[d],j\in[m] defining X¯i,j\overline{X}_{i,j} to the be jthj^{\mathrm{th}} smallest coordinate in dimension ii by the augmented partial order. For any rr that divides mm we define an augmented rr-block partition depending on X¯\overline{X} by defining for each i[d],j[r1]i\in[d],j\in[r-1] the points a¯i,j=X¯i,mj/r\overline{a}_{i,j}=\overline{X}_{i,mj/r}, (and a¯i,0=(,0),a¯i,r=(,1)\overline{a}_{i,0}=(-\infty,0),\overline{a}_{i,r}=(\infty,1)), so that the intervals are Bi,j=(X¯i,m(j1)/r,X¯i,mj/r]B_{i,j}=(\overline{X}_{i,m(j-1)/r},\overline{X}_{i,mj/r}] for j{2,,r1}j\in\{2,\dotsc,r-1\} and Bi,1=((,0),X¯i,m/r],Bi,r=(X¯i,m(r1)/r,(,1)]B_{i,1}=((-\infty,0),\overline{X}_{i,m/r}],B_{i,r}=(\overline{X}_{i,m(r-1)/r},(\infty,1)]. We set the points b¯i,j\overline{b}_{i,j} defining 𝖻𝗅𝗈𝖼𝗄𝗉𝗈𝗂𝗇𝗍:[r]dd\mathsf{blockpoint}:[r]^{d}\to\mathbb{R}^{d} to be b¯i,j=X¯i,k\overline{b}_{i,j}=\overline{X}_{i,k} for some X¯i,kBi,j\overline{X}_{i,k}\in B_{i,j}. This is the augmented rr-block partition induced by X¯\overline{X}.

Definition A.5 (Augmented Block Boundary).

For an augmented block partition 𝖻𝗅𝗈𝖼𝗄¯:d×[0,1]d[r]d\overline{\mathsf{block}}:\mathbb{R}^{d}\times[0,1]^{d}\to[r]^{d}, a distribution μ\mu over d\mathbb{R}^{d}, and a function f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\}, we say ff is non-constant on an augmented block 𝖻𝗅𝗈𝖼𝗄¯1(v)\overline{\mathsf{block}}^{-1}(v) if there are sets S¯,T¯𝖻𝗅𝗈𝖼𝗄¯1(v)\overline{S},\overline{T}\subset\overline{\mathsf{block}}^{-1}(v) such that μ(ξ(S¯)),μ(ξ(T¯))>0\mu(\xi(\overline{S})),\mu(\xi(\overline{T}))>0 and for all sS,tT:f(s)=1,f(t)=1s\in S,t\in T:f(s)=1,f(t)=-1. For a function f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\} and a number rr, we define the augmented rr-block boundary size 𝖻𝖻𝗌¯(f,r)\overline{\mathsf{bbs}}(f,r) as the maximum number of blocks on which ff is non-constant with respect to a distribution μ\mu, where the maximum is taken over all augmented rr-block partitions.

The augmented block partitions satisfy analogous properties to the previously-defined block partitions:

Lemma A.6.

Let X¯\overline{X} be an augmented random grid with length mm sampled from a finite product distirbution μ\mu, and let 𝖻𝗅𝗈𝖼𝗄¯:d×[0,1]d[r]d\overline{\mathsf{block}}:\mathbb{R}^{d}\times[0,1]^{d}\to[r]^{d} be the augmented rr-block partition induced by X¯\overline{X}. Then

X¯[𝖻𝗅𝗈𝖼𝗄¯(μ)𝗎𝗇𝗂𝖿([r]d)𝖳𝖵>ϵ]4rdexp(mϵ218rd2).\underset{\overline{X}}{\mathbb{P}}\left[\|\overline{\mathsf{block}}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}>\epsilon\right]\leq 4rd\cdot\mathrm{exp}\left(-\frac{m\epsilon^{2}}{18rd^{2}}\right)\,.
Proof.

Let μi\mu_{i} be a finitely supported distribution with support SS\subset\mathbb{R}. Let η=12mina,bS|ab|\eta=\frac{1}{2}\min_{a,b\in S}|a-b|. Let μi\mu^{\prime}_{i} be the distribution of xi+ηzix_{i}+\eta z_{i} where xiμix_{i}\sim\mu_{i} and zi[0,1]z_{i}\sim[0,1] uniformly at random; note that μi\mu^{\prime}_{i} is a continuous distribution over \mathbb{R}. For x¯=(x,x),y¯=(y,y)×[0,1]\overline{x}=(x,x^{\prime}),\overline{y}=(y,y^{\prime})\in\mathbb{R}\times[0,1], observe that x¯<y¯\overline{x}<\overline{y} iff x+ηx<y+ηyx+\eta x^{\prime}<y+\eta y^{\prime}. Therefore,

x¯,y¯μ¯i[x¯<y¯]=x,yμi[x<y].\underset{\overline{x},\overline{y}\sim\overline{\mu}_{i}}{\mathbb{P}}\left[\overline{x}<\overline{y}\right]=\underset{x,y\sim\mu^{\prime}_{i}}{\mathbb{P}}\left[x<y\right]\,.

By replacing each finitely supported μi\mu_{i} with μi\mu^{\prime}_{i} we obtain a continuous product distribution μ\mu^{\prime} such that 𝖻𝗅𝗈𝖼𝗄¯(μ)\overline{\mathsf{block}}(\mu) is the same distribution as 𝖻𝗅𝗈𝖼𝗄(μ)\mathsf{block}(\mu^{\prime}), so by Lemma 2.7 the conclusion holds. ∎

Proposition A.7.

For any continuous or finite product distribution μ\mu over d\mathbb{R}^{d}, any augmented rr-block partition 𝖻𝗅𝗈𝖼𝗄¯:d×[0,1]d[r]d\overline{\mathsf{block}}:\mathbb{R}^{d}\times[0,1]^{d}\to[r]^{d} constructed from a random grid X¯\overline{X}, and any f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\},

xμ,z[0,1]d[f(x)fz𝖼𝗈𝖺𝗋𝗌𝖾(x)]rd𝖻𝖻𝗌¯(f,r)+𝖻𝗅𝗈𝖼𝗄¯(μ)𝗎𝗇𝗂𝖿([r]d)𝖳𝖵.\underset{x\sim\mu,z\sim[0,1]^{d}}{\mathbb{P}}\left[f(x)\neq f^{\mathsf{coarse}}_{z}(x)\right]\leq r^{-d}\cdot\overline{\mathsf{bbs}}(f,r)+\|\overline{\mathsf{block}}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}\,.
Proof.

The result for continuous product distributions holds by Proposition 2.5 and the fact that 𝖻𝖻𝗌(f,r)𝖻𝖻𝗌¯(f,r)\mathsf{bbs}(f,r)\leq\overline{\mathsf{bbs}}(f,r), so assume μ\mu is a finite product distribution, and let S=supp(μ)S=\operatorname{supp}(\mu).

Suppose that for (x,z)(x,z) sampled from μ¯\overline{\mu}, f(x)fz𝖼𝗈𝖺𝗋𝗌𝖾(x)f(x)\neq f^{\mathsf{coarse}}_{z}(x), and let v=𝖻𝗅𝗈𝖼𝗄¯(x,z)v=\overline{\mathsf{block}}(x,z). Then for y=𝖻𝗅𝗈𝖼𝗄𝗉𝗈𝗂𝗇𝗍(v)y=\mathsf{blockpoint}(v), f(x)f(y)f(x)\neq f(y) and x,yξ(𝖻𝗅𝗈𝖼𝗄¯1(v))x,y\in\xi(\overline{\mathsf{block}}^{-1}(v)). The points x,yx,y clearly have positive measure because μ\mu is finite, so vv a non-constant block. Then

xμ,z[0,1]d[f(x)fz𝖼𝗈𝖺𝗋𝗌𝖾(x)]\displaystyle\underset{x\sim\mu,z\sim[0,1]^{d}}{\mathbb{P}}\left[f(x)\neq f^{\mathsf{coarse}}_{z}(x)\right] x,z[𝖻𝗅𝗈𝖼𝗄¯(x,z) is non-constant]\displaystyle\leq\underset{x,z}{\mathbb{P}}\left[\overline{\mathsf{block}}(x,z)\text{ is non-constant}\right]
v[r]d[v is non-constant]+𝖻𝗅𝗈𝖼𝗄¯(μ)𝗎𝗇𝗂𝖿([r]d)𝖳𝖵.\displaystyle\leq\underset{v\sim[r]^{d}}{\mathbb{P}}\left[v\text{ is non-constant}\right]+\|\overline{\mathsf{block}}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}\,.\qed

A.2 Augmented Block-Boundary Size and Noise Sensitivity

To obtain learning algorithms for kk-alternating functions, functions of kk convex sets, functions of kk halfspaces, and degree-kk PTFs, we must provide a bound on 𝖻𝖻𝗌¯\overline{\mathsf{bbs}}.

For a finite set XdX\subset\mathbb{R}^{d} and a function f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\}, we will call a function f:d{±1}f^{\prime}:\mathbb{R}^{d}\to\{\pm 1\} a blowup of ff (with respect to XX) if xX\forall x\in X there exists an open ball BxxB_{x}\ni x where yBx,f(y)=f(x)\forall y\in B_{x},f^{\prime}(y)=f(x). We will call a set \mathcal{H} of functions f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\} inflatable if for every finite product set X=X1××XdX=X_{1}\times\dotsm\times X_{d} and ff\in\mathcal{H}, there exists ff^{\prime}\in\mathcal{H} that is a blowup of ff with respect to XX.

Proposition A.8.

Let \mathcal{H} be a inflatable set of functions. Then 𝖻𝖻𝗌¯(,r)𝖻𝖻𝗌(,r)\overline{\mathsf{bbs}}(\mathcal{H},r)\leq\mathsf{bbs}(\mathcal{H},r).

Proof.

Let 𝖻𝗅𝗈𝖼𝗄¯:d×[0,1]d[r]d\overline{\mathsf{block}}:\mathbb{R}^{d}\times[0,1]^{d}\to[r]^{d} be an augmented rr-block partition defined by parameters b¯i,j×[0,1]\overline{b}_{i,j}\in\mathbb{R}\times[0,1] for i[d],j[r1]i\in[d],j\in[r-1], and write b¯i,j=(bi,j,bi,j)\overline{b}_{i,j}=(b_{i,j},b^{\prime}_{i,j}). Let X=X1××XdX=X_{1}\times\dotsm\times X_{d} be any finite product set, and let ff\in\mathcal{H}; we will bound the number of non-constant blocks We construct a (non-augmented) rr-block partition as follows. Let η>0\eta>0 be sufficiently small that:

  • xX\forall x\in X, the rectangle Rx:=[x1,x1+η]××[xd,xd+η]R_{x}:=[x_{1},x_{1}+\eta]\times\dotsm\times[x_{d},x_{d}+\eta] is contained within BxB_{x},

  • i[d],[xi,xi+η]Xi={xi}\forall i\in[d],[x_{i},x_{i}+\eta]\cap X_{i}=\{x_{i}\}; and

  • i[d],bi,j+η<bi,j+1\forall i\in[d],b_{i,j}+\eta<b_{i,j+1} unless bi,j=bi,j+1b_{i,j}=b_{i,j+1}.

Such an η\eta exists since the number of constraints is finite. Then define 𝖻𝗅𝗈𝖼𝗄:d[r]d\mathsf{block}:\mathbb{R}^{d}\to[r]^{d} by the parameters ci,j=bi,j+ηbi,jc_{i,j}=b_{i,j}+\eta\cdot b^{\prime}_{i,j}. Note that ci,j=bi,j+ηbi,jbi,j+η<bi,j+1ci,j+1c_{i,j}=b_{i,j}+\eta\cdot b^{\prime}_{i,j}\leq b_{i,j}+\eta<b_{i,j+1}\leq c_{i,j+1}. Let v[r]dv\in[r]^{d} and suppose that ff is non-constant on 𝖻𝗅𝗈𝖼𝗄¯1(v)\overline{\mathsf{block}}^{-1}(v), so there are x¯,y¯𝖻𝗅𝗈𝖼𝗄¯1(v)(X×[0,1]d)\overline{x},\overline{y}\in\overline{\mathsf{block}}^{-1}(v)\cap(X\times[0,1]^{d}) such that f(x)f(y)f(x)\neq f(y), where x¯=(x,x),y¯=(y,y)\overline{x}=(x,x^{\prime}),\overline{y}=(y,y^{\prime}), and i[d],xi,yi(bi,vi1,bi,vi]\forall i\in[d],x_{i},y_{i}\in(b_{i,v_{i}-1},b_{i,v_{i}}] where we define (b,b]={b}(b,b]=\{b\}. Consider 𝖻𝗅𝗈𝖼𝗄1(v)=(c1,v11,c1,v1]××(cd,vd1,cd,vd]\mathsf{block}^{-1}(v)=(c_{1,v_{1}-1},c_{1,v_{1}}]\times\dotsm\times(c_{d,v_{d}-1},c_{d,v_{d}}].

Since x¯i(b¯i,vi1,b¯i,vi]\overline{x}_{i}\in(\overline{b}_{i,v_{i}-1},\overline{b}_{i,v_{i}}], xi(bi,vi1,bi,vi]x_{i}\in(b_{i,v_{i}-1},b_{i,v_{i}}] (where we define (b,b]={b}(b,b]=\{b\}) and xi(bi,vi1,bi,vi]x^{\prime}_{i}\in(b^{\prime}_{i,v_{i}-1},b^{\prime}_{i,v_{i}}]. Therefore xi+ηxibi,vi+ηbi,vi=ci,vix_{i}+\eta\cdot x^{\prime}_{i}\leq b_{i,v_{i}}+\eta\cdot b^{\prime}_{i,v_{i}}=c_{i,v_{i}} and xi+ηxi>bi,vi1+ηbi,vi1=ci,vi1x_{i}+\eta\cdot x^{\prime}_{i}>b_{i,v_{i}-1}+\eta\cdot b^{\prime}_{i,v_{i}-1}=c_{i,v_{i}-1} so x+ηx𝖻𝗅𝗈𝖼𝗄1(v)x+\eta\cdot x^{\prime}\in\mathsf{block}^{-1}(v). Also, x+ηxx+\eta\cdot x^{\prime} is in the rectangle RxBxR_{x}\subset B_{x} so there is a ball around x+ηxx+\eta\cdot x^{\prime}, containing only points with value f(x)=f(x)f^{\prime}(x)=f(x). Likewise, there is a ball around y+ηyy+\eta\cdot y^{\prime} inside 𝖻𝗅𝗈𝖼𝗄1(v)\mathsf{block}^{-1}(v) containing only points with value f(y)=f(y)f(x)f^{\prime}(y)=f(y)\neq f^{\prime}(x). Since these balls must intersect 𝖻𝗅𝗈𝖼𝗄1(v)\mathsf{block}^{-1}(v) on sets with positive measure (in the product of Lebesgue measures), ff^{\prime} is non-constant on 𝖻𝗅𝗈𝖼𝗄1(v)\mathsf{block}^{-1}(v), which proves the statement. ∎

Lemma A.9.

The set 𝒜k\mathcal{A}_{k} of kk-alternating functions is inflatable.

Proof.

Let f𝒜kf\in\mathcal{A}_{k} and let X=X1××XdX=X_{1}\times\dotsm\times X_{d} be a finite set. where we use the standard ordering on d\mathbb{R}^{d}. Let udu\in\mathbb{R}^{d}. We claim that the set {xX:xu}\{x\in X:x\leq u\} has a unique maximum. Suppose otherwise, so there are x,yux,y\leq u that are each maximal. Let xy=(max(x1,y1),,max(xd,yd))x\wedge y=(\max(x_{1},y_{1}),\dotsc,\max(x_{d},y_{d})). Then xyXx\vee y\in X and xy>x,yx\wedge y>x,y but uxyu\geq x\wedge y, a contradiction. For every udu\in\mathbb{R}^{d}, write uu^{\downarrow} for this unique maximum. Let η>0\eta>0 be small enough that xX,(x+η1)=x\forall x\in X,(x+\eta\cdot\vec{1})^{\downarrow}=x; such a value exists since XX is finite. Define the map ϕ(u)=(u+(η/2)1)\phi(u)=(u+(\eta/2)\cdot\vec{1})^{\downarrow} and ud\forall u\in\mathbb{R}^{d}, we define f(u):=f(ϕ(u))f^{\prime}(u):=f(\phi(u)), and argue that this satisfies the required properties. It is clear by our choice of η\eta that f(x)=f((x+(η/2)1))=f(x)f^{\prime}(x)=f((x+(\eta/2)\cdot\vec{1})^{\downarrow})=f(x). Since ϕ\phi is order-preserving (i.e. if u<vu<v then ϕ(u)ϕ(v)\phi(u)\leq\phi(v)), ff^{\prime} is kk-alternating. Now consider the ball B(x):={yd:yx2<η/2}B(x):=\{y\in\mathbb{R}^{d}:\|y-x\|_{2}<\eta/2\}. Since |yixi|<η/2|y_{i}-x_{i}|<\eta/2 for all yB(x)y\in B(x), we have ϕ(y)=(y1+η/2,,yd+η/2)(x1+η,,xd+η)=x\phi(y)=(y_{1}+\eta/2,\dotsc,y_{d}+\eta/2)^{\downarrow}\leq(x_{1}+\eta,\dotsc,x_{d}+\eta)^{\downarrow}=x, and ϕ(y)(x1,,xd)=x\phi(y)\geq(x_{1},\dotsc,x_{d})^{\downarrow}=x so f(y)=f(ϕ(y))=f(x)f^{\prime}(y)=f(\phi(y))=f(x). ∎

Lemma A.10.

The set 𝒞\mathcal{C} of indicator functions of convex sets is inflatable.

Proof.

Let f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\} indicate a closed convex set, let S=f1(1)S=f^{-1}(1) be this set, and write δ(x):=min{xy2:yS}\delta(x):=\min\{\|x-y\|_{2}:y\in S\} (this minimum exists since SS is closed). Let XX be any finite set and let δ=min{δ(x):xXS}\delta=\min\{\delta(x):x\in X\setminus S\}. Consider S={x:δ(x)δ/2}S^{\prime}=\{x:\delta(x)\leq\delta/2\}, and let ff^{\prime} be the indicator function for this set. Then f(x)=f(x)f^{\prime}(x)=f(x) for all xXx\in X. Finally, SS^{\prime} is closed, and it is convex since for any two points x,yx,y, it is well-known that the function λδ(λx+(1λ)y)\lambda\mapsto\delta(\lambda x+(1-\lambda)y) is convex for λ[0,1]\lambda\in[0,1]. ∎

Lemma A.11.

The set \mathcal{H} of halfspaces is inflatable.

Proof.

It suffices to show that for any finite set XX (not necessarily a product set) and any halfspace f(x)=sign(w,xt)f(x)=\operatorname{sign}(\langle w,x\rangle-t), there is a halfspace f(x)=sign(w,xt)f^{\prime}(x)=\operatorname{sign}(\langle w,x\rangle-t^{\prime}) such that f(x)=f(x)f(x)=f^{\prime}(x) for all xXx\in X but w,xt0\langle w^{\prime},x\rangle-t\neq 0 for all xXx\in X; this is a commonly-used fact. Let δ=min{(w,xt):w,xt<0}\delta=\min\{-(\langle w,x\rangle-t):\langle w,x\rangle-t<0\}. It must be the case that δ>0\delta>0. Then f(x)=sign(w,xt+δ/2)f^{\prime}(x)=\operatorname{sign}(\langle w,x\rangle-t+\delta/2) satisfies the condition. ∎

Lemma A.12.

The set 𝒫k\mathcal{P}_{k} of degree-kk PTFs is inflatable.

Proof.

This follows from the above proof for halfspaces, since for any finite XX we may map xXx\in X to its vector (x1k,x2k,)(x_{1}^{k},x_{2}^{k},\dotsc) of monomials, so that any polynomial p(x)p(x) is a linear threshold function in the space of monomials. ∎

For a set \mathcal{H} of functions f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\} and an augmented rr-block partition 𝖻𝗅𝗈𝖼𝗄¯:d[r]d\overline{\mathsf{block}}:\mathbb{R}^{d}\to[r]^{d}, we will write ¯𝖻𝗅𝗈𝖼𝗄:={f𝖻𝗅𝗈𝖼𝗄:f}\overline{}\mathcal{H}^{\mathsf{block}}:=\{f^{\mathsf{block}}:f\in\mathcal{H}\} for the set of block functions f𝖻𝗅𝗈𝖼𝗄:[r]d{±1}f^{\mathsf{block}}:[r]^{d}\to\{\pm 1\}; note that this is not necessarily the same set of functions as 𝖻𝗅𝗈𝖼𝗄\mathcal{H}^{\mathsf{block}} defined for continuous distributions. We must show that the same learning algorithms used above for learning 𝖻𝗅𝗈𝖼𝗄\mathcal{H}^{\mathsf{block}} will work also for ¯𝖻𝗅𝗈𝖼𝗄\overline{}\mathcal{H}^{\mathsf{block}}. For the brute-force learning algorithm of Lemma 4.5, this is trivial, but for the regression algorithm in Lemma 5.2 we must show that there exists a set \mathcal{F} such that each f𝖻𝗅𝗈𝖼𝗄¯𝖻𝗅𝗈𝖼𝗄f^{\mathsf{block}}\in\overline{}\mathcal{H}^{\mathsf{block}} is close to a function g𝗌𝗉𝖺𝗇()g\in\mathsf{span}(\mathcal{F}). For functions of halfspaces and PTFs, we used the bound on noise sensitivity, Lemma 5.3, to construct a set \mathcal{F} of functions suitable for the regression algorithm. The proof for that lemma works without modification for augmented block partitions, so we have the following:

Lemma A.13.

Let \mathcal{H} be any family of functions f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\} such that, for any linear transformation A:ddA:\mathbb{R}^{d}\to\mathbb{R}^{d}, if ff\in\mathcal{H} then fAf\circ A\in\mathcal{H}. Let 𝖻𝗅𝗈𝖼𝗄¯:d×[0,1]d[r]d\overline{\mathsf{block}}:\mathbb{R}^{d}\times[0,1]^{d}\to[r]^{d} be any augmented rr-block partition. Let ns2,δ():=supf𝗇𝗌2,δ(f)\mathrm{ns}_{2,\delta}(\mathcal{H}):=\sup_{f\in\mathcal{H}}\mathsf{ns}_{2,\delta}(f). Then nsr,δ(f𝖻𝗅𝗈𝖼𝗄)𝗇𝗌2,δ()\mathrm{ns}_{r,\delta}(f^{\mathsf{block}})\leq\mathsf{ns}_{2,\delta}(\mathcal{H}).

A.3 Rounding the Output

After learning a function g:[r]d{±1}g:[r]^{d}\to\{\pm 1\}, we must output a function g:d{±1}g^{\prime}:\mathbb{R}^{d}\to\{\pm 1\}. In the continuous setting, we simply output g𝖻𝗅𝗈𝖼𝗄g\circ\mathsf{block}. In the finite setting, we cannot simply output g𝖻𝗅𝗈𝖼𝗄¯g\circ\overline{\mathsf{block}} since 𝖻𝗅𝗈𝖼𝗄¯:d×[0,1]d[r]d\overline{\mathsf{block}}:\mathbb{R}^{d}\times[0,1]^{d}\to[r]^{d} requires an additional argument z[0,1]dz\in[0,1]^{d}. For example, if the distribution μ\mu is a finitely supported distribution on {±1}d\{\pm 1\}^{d}, then for each point x{±1}dx\in\{\pm 1\}^{d} there may be roughly (r/2)d(r/2)^{d} points v[r]dv\in[r]^{d} for which (x,z)𝖻𝗅𝗈𝖼𝗄¯1(v)(x,z)\in\overline{\mathsf{block}}^{-1}(v) for an appropriate choice of z[0,1]dz\in[0,1]^{d}, and these points vv may have different values in gg. The algorithm must choose a single value to output for each xx. We do so by approximating the function x𝔼𝑧[gz(x)]x\mapsto\underset{z}{\mathbb{E}}\left[g_{z}(x)\right] and then rounding it via the next lemma.

Lemma A.14.

Fix a domain 𝒳\mathcal{X}, let γ:𝒳[1,1]\gamma:\mathcal{X}\to[-1,1], and let ϵ>0\epsilon>0. There is an algorithm such that, given query access to γ\gamma and sample access to any distribution 𝒟\mathcal{D} over 𝒳×{±1}\mathcal{X}\times\{\pm 1\}, uses at most O(log(1/δ)/ϵ2)O\left(\log(1/\delta)/\epsilon^{2}\right) samples and queries and with probability at least 1δ1-\delta produces a value tt such that

(x,b)𝒟[sign(f(x)t)b]12𝔼(x,b)𝒟[|f(x)b|]+ϵ.\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[\operatorname{sign}(f(x)-t)\neq b\right]\leq\frac{1}{2}\underset{(x,b)\sim\mathcal{D}}{\mathbb{E}}\left[|f(x)-b|\right]+\epsilon\,.
Proof.

Let 𝒯\mathcal{T} be the set of functions xsign(γ(x)t)x\mapsto\operatorname{sign}(\gamma(x)-t) for any choice of t[1,1]t\in[-1,1]. We will show that the VC dimension of 𝒯\mathcal{T} is 1. Suppose for contradiction that two points x,y𝒳x,y\in\mathcal{X} are shattered by 𝒯\mathcal{T}, so in particular there are s,ts,t\in\mathbb{R} such that sign(f(x)s)=1\operatorname{sign}(f(x)-s)=1 and sign(f(y)s)=1\operatorname{sign}(f(y)-s)=-1, while sign(f(x)t)=1\operatorname{sign}(f(x)-t)=-1 and sign(f(y)t)=1\operatorname{sign}(f(y)-t)=1. Without loss of generality, suppose s<ts<t. But then sign(f(y)s)sign(f(y)t)\operatorname{sign}(f(y)-s)\geq\operatorname{sign}(f(y)-t), which is a contradiction. Therefore, by standard VC dimension arguments ([SSBD14], Theorem 6.8), using O(log(1/δ)/ϵ2)O(\log(1/\delta)/\epsilon^{2}) samples and choosing tt to minimize the error on the samples, with probability at least 1δ1-\delta we will obtain a value tt such that

(x,b)𝒟[sign(γ(x)t)b](x,b)𝒟[sign(γ(x)t)b]+ϵ\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[\operatorname{sign}(\gamma(x)-t)\neq b\right]\leq\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[\operatorname{sign}(\gamma(x)-t^{*})\neq b\right]+\epsilon

where tt^{*} minimizes the latter quantity among all values [1,1][-1,1]. Since the algorithm can restrict itself to those values t[1,1]t\in[-1,1] for which γ(x)=t\gamma(x)=t for some xx in the sample, the value minimizing the error on the sample can be computed time polynomial in the number of samples. Next, we show that the minimizer tt^{*} satisfies the desired properties. Suppose that t[1,1]t\sim[-1,1] uniformly at random. For any y[1,1],b{±1}y\in[-1,1],b\in\{\pm 1\},

𝑡[sign(yt)b]={𝑡[t>y]=12|by| if b=1𝑡[ty]=12|yb| if b=1.\underset{t}{\mathbb{P}}\left[\operatorname{sign}(y-t)\neq b\right]=\begin{cases}\underset{t}{\mathbb{P}}\left[t>y\right]=\frac{1}{2}|b-y|&\text{ if }b=1\\ \underset{t}{\mathbb{P}}\left[t\leq y\right]=\frac{1}{2}|y-b|&\text{ if }b=-1\,.\end{cases}

Therefore

𝔼t[1,1][(x,b)𝒟[sign(γ(x)t)b]]=𝔼(x,b)𝒟[𝑡[sign(f(x)t)b]]=12𝔼(x,b)𝒟[|γ(x)b|],\underset{t\sim[-1,1]}{\mathbb{E}}\left[\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[\operatorname{sign}(\gamma(x)-t)\neq b\right]\right]=\underset{(x,b)\sim\mathcal{D}}{\mathbb{E}}\left[\underset{t}{\mathbb{P}}\left[\operatorname{sign}(f(x)-t)\neq b\right]\right]=\frac{1}{2}\underset{(x,b)\sim\mathcal{D}}{\mathbb{E}}\left[|\gamma(x)-b|\right]\,,

so we can conclude the lemma with

(x,b)𝒟[sign(γ(x)t)b]12𝔼(x,b)𝒟[|γ(x)b|].\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[\operatorname{sign}(\gamma(x)-t^{*})\neq b\right]\leq\frac{1}{2}\underset{(x,b)\sim\mathcal{D}}{\mathbb{E}}\left[|\gamma(x)-b|\right]\,.\qed
Lemma A.15.

Let 𝖻𝗅𝗈𝖼𝗄¯:d×[0,1]d[r]d\overline{\mathsf{block}}:\mathbb{R}^{d}\times[0,1]^{d}\to[r]^{d} be an augmented rr-block partition. There is an algorithm which, given ϵ,δ>0\epsilon,\delta>0, query access to a function g:[r]d{±1}g:[r]^{d}\to\{\pm 1\} and sample access to a distribution 𝒟\mathcal{D} over d×{±1}\mathbb{R}^{d}\times\{\pm 1\}, outputs a function g:d{±1}g^{\prime}:\mathbb{R}^{d}\to\{\pm 1\} such that, with probability 1δ1-\delta,

(x,b)𝒟[g(x)b](x,b)𝒟,z[0,1]d[g(𝖻𝗅𝗈𝖼𝗄¯(x,z))b]+ϵ,\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[g^{\prime}(x)\neq b\right]\leq\underset{(x,b)\sim\mathcal{D},z\sim[0,1]^{d}}{\mathbb{P}}\left[g(\overline{\mathsf{block}}(x,z))\neq b\right]+\epsilon\,,

uses at most O(dlog(r)ϵ2log1δ)O\left(\frac{d\log(r)}{\epsilon^{2}}\log\frac{1}{\delta}\right) samples and queries, and runs in time polynomial in the number of samples.

Proof.

For z[0,1]dz\in[0,1]^{d}, write gz(x)=g(𝖻𝗅𝗈𝖼𝗄¯(x,z))g_{z}(x)=g(\overline{\mathsf{block}}(x,z)). For any (x,b)(x,b),

|𝔼𝑧[gz(x)]b|=|b𝑧[gz(x)=b]b𝑧[gz(x)b]b|=|2b𝑧[gz(x)b]|=2𝑧[gz(x)b],|\underset{z}{\mathbb{E}}\left[g_{z}(x)\right]-b|=|b\underset{z}{\mathbb{P}}\left[g_{z}(x)=b\right]-b\underset{z}{\mathbb{P}}\left[g_{z}(x)\neq b\right]-b|=|-2b\underset{z}{\mathbb{P}}\left[g_{z}(x)\neq b\right]|=2\underset{z}{\mathbb{P}}\left[g_{z}(x)\neq b\right]\,,

so

12𝔼(x,b)𝒟[|𝔼𝑧[gz(x)]b|]=(x,b)𝒟,z[gz(x)b].\frac{1}{2}\underset{(x,b)\sim\mathcal{D}}{\mathbb{E}}\left[|\underset{z}{\mathbb{E}}\left[g_{z}(x)\right]-b|\right]=\underset{(x,b)\sim\mathcal{D},z}{\mathbb{P}}\left[g_{z}(x)\neq b\right]\,.

The algorithm will construct a function γ(x)𝔼𝑧[gz(x)]\gamma(x)\approx\underset{z}{\mathbb{E}}\left[g_{z}(x)\right] and then learn a suitable parameter tt for rounding γ(x)\gamma(x) to sign(γ(x)t)\operatorname{sign}(\gamma(x)-t).

First the algorithm samples a set Z[0,1]dZ\subset[0,1]^{d} of size m=2dln(r)ln(1/δ)ϵ2m=\frac{2d\ln(r)\ln(1/\delta)}{\epsilon^{2}} and construct the function γ(x)=1mzZg(𝖻𝗅𝗈𝖼𝗄¯(x,z))\gamma(x)=\frac{1}{m}\sum_{z\in Z}g(\overline{\mathsf{block}}(x,z)). Fix Z[0,1]dZ\subset[0,1]^{d} and suppose xdx\in\mathbb{R}^{d} satisfies γ(x)𝔼𝑧[gz(x)]\gamma(x)\neq\underset{z}{\mathbb{E}}\left[g_{z}(x)\right]. Then there must be w,z[0,1]dw,z\in[0,1]^{d} such that 𝖻𝗅𝗈𝖼𝗄¯(x,z)𝖻𝗅𝗈𝖼𝗄¯(x,w)\overline{\mathsf{block}}(x,z)\neq\overline{\mathsf{block}}(x,w), otherwise gz(x)=gw(x)g_{z}(x)=g_{w}(x) for all z,wz,w so for all w,γ(x)=gw(x)=𝔼𝑧[gz(x)]w,\gamma(x)=g_{w}(x)=\underset{z}{\mathbb{E}}\left[g_{z}(x)\right]. There can be at most rdr^{d} values of xdx\in\mathbb{R}^{d} for which z,w[0,1]d:𝖻𝗅𝗈𝖼𝗄¯(x,z)𝖻𝗅𝗈𝖼𝗄¯(x,w)\exists z,w\in[0,1]^{d}:\overline{\mathsf{block}}(x,z)\neq\overline{\mathsf{block}}(x,w), so by the union bound and the Hoeffding bound,

𝑍[xd:|γ(x)𝔼𝑧[gz(x)]|>ϵ]\displaystyle\underset{Z}{\mathbb{P}}\left[\exists x\in\mathbb{R}^{d}:|\gamma(x)-\underset{z}{\mathbb{E}}\left[g_{z}(x)\right]|>\epsilon\right] rdmaxxX𝑍[|γ(x)𝔼𝑧[gz(x)]|>ϵ]\displaystyle\leq r^{d}\max_{x\in X}\underset{Z}{\mathbb{P}}\left[|\gamma(x)-\underset{z}{\mathbb{E}}\left[g_{z}(x)\right]|>\epsilon\right]
2rdexp(mϵ22)<δ.\displaystyle\leq 2r^{d}\mathrm{exp}\left(-\frac{m\epsilon^{2}}{2}\right)<\delta\,.

Therefore with probability at least 1δ/21-\delta/2, γ\gamma satisfies |γ(x)𝔼𝑧[gz(x)]|ϵ|\gamma(x)-\underset{z}{\mathbb{E}}\left[g_{z}(x)\right]|\leq\epsilon for all xx. Suppose this occurs. Then

12𝔼(x,b)𝒟[|γ(x)b|]\displaystyle\frac{1}{2}\underset{(x,b)\sim\mathcal{D}}{\mathbb{E}}\left[|\gamma(x)-b|\right] 12𝔼(x,b)𝒟[|𝔼𝑧[gz(x)]b|+|γ(x)𝔼𝑧[gz(x)]|]\displaystyle\leq\frac{1}{2}\underset{(x,b)\sim\mathcal{D}}{\mathbb{E}}\left[|\underset{z}{\mathbb{E}}\left[g_{z}(x)\right]-b|+|\gamma(x)-\underset{z}{\mathbb{E}}\left[g_{z}(x)\right]|\right]
(x,b)𝒟,z[gz(x)b]+ϵ2.\displaystyle\leq\underset{(x,b)\sim\mathcal{D},z}{\mathbb{P}}\left[g_{z}(x)\neq b\right]+\frac{\epsilon}{2}\,.

Now we apply Lemma A.14 with error ϵ/2\epsilon/2, using O(log(1/δ)/ϵ2)O(\log(1/\delta)/\epsilon^{2}) samples and polynomial time, to output a value tt such that with probability 1δ/21-\delta/2,

(x,b)𝒟[sign(γ(x)t)b]12𝔼(x,b)[|γ(x)b|]+ϵ2(x,b)𝒟,z[gz(x)b]+ϵ.\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[\operatorname{sign}(\gamma(x)-t)\neq b\right]\leq\frac{1}{2}\underset{(x,b)}{\mathbb{E}}\left[|\gamma(x)-b|\right]+\frac{\epsilon}{2}\leq\underset{(x,b)\sim\mathcal{D},z}{\mathbb{P}}\left[g_{z}(x)\neq b\right]+\epsilon\,.\qed

A.4 Algorithms for Finite Distributions

We now state improved versions of our monotonicity tester and two general learning algorithms: the “brute force” learning algorithm (Lemma 4.5) and the “polynomial regression” algorithm (Lemma 5.2). Using these algorithms we obtain the same complexity bounds as for continuous product distributions, but the algorithms can now handle finite product distributions as well.

See 1.1.1

Proof.

The proof of Section 1.1.1 goes through as before, with 𝖻𝗅𝗈𝖼𝗄\mathsf{block} replaced by 𝖻𝗅𝗈𝖼𝗄¯\overline{\mathsf{block}}, 𝖻𝗅𝗈𝖼𝗄(v)\mathsf{block}^{-\downarrow}(v) replaced with 𝖻𝗅𝗈𝖼𝗄¯(v)\overline{\mathsf{block}}^{-\downarrow}(v) defined as the infimal element x¯\overline{x} such that 𝖻𝗅𝗈𝖼𝗄¯(x¯)=v\overline{\mathsf{block}}(\overline{x})=v, and 𝖻𝗅𝗈𝖼𝗄(v)\mathsf{block}^{-\uparrow}(v) defined as the supremal element x¯\overline{x} such that 𝖻𝗅𝗈𝖼𝗄¯(x¯)=v\overline{\mathsf{block}}(\overline{x})=v, and gg redefined appropriately. ∎

Next, we move on to the learning algorithms:

Lemma A.16.

Let \mathcal{H} be any set of functions d{±1}\mathbb{R}^{d}\to\{\pm 1\}, let ϵ>0\epsilon>0, and suppose rr satisfies rd𝖻𝖻𝗌¯(,r)ϵ/3r^{-d}\cdot\overline{\mathsf{bbs}}(\mathcal{H},r)\leq\epsilon/3. Then there is an agnostic learning algorithm for \mathcal{H} that uses O(rd+rd2log(rd/ϵ)ϵ2)O\left(\frac{r^{d}+rd^{2}\log(rd/\epsilon)}{\epsilon^{2}}\right) samples and time and works for any distribution 𝒟\mathcal{D} over d×{±1}\mathbb{R}^{d}\times\{\pm 1\} whose marginal on d\mathbb{R}^{d} is a finite or continuous product distribution.

Proof.

On input distribution 𝒟\mathcal{D}:

  1. 1.

    Sample a grid X¯\overline{X} of size m=O(rd2ϵ2log(rd/ϵ))m=O(\frac{rd^{2}}{\epsilon^{2}}\log(rd/\epsilon)) large enough that Lemma A.6 guarantees 𝖻𝗅𝗈𝖼𝗄¯(μ)𝗎𝗇𝗂𝖿([r]d)𝖳𝖵<ϵ/3\|\overline{\mathsf{block}}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}<\epsilon/3 with probability 5/65/6, where 𝖻𝗅𝗈𝖼𝗄¯:d×[0,1]d[r]d\overline{\mathsf{block}}:\mathbb{R}^{d}\times[0,1]^{d}\to[r]^{d} is the induced augmented rr-block partition.

  2. 2.

    Agnostically learn a function g:[r]d{±1}g:[r]^{d}\to\{\pm 1\} with error ϵ/3\epsilon/3 and success probability 5/65/6 using O(rd/ϵ2)O(r^{d}/\epsilon^{2}) samples from 𝒟𝖻𝗅𝗈𝖼𝗄\mathcal{D}^{\mathsf{block}}.

  3. 3.

    Run the algorithm of Lemma A.14 using O(dlogrϵ2)O\left(\frac{d\log r}{\epsilon^{2}}\right) samples to obtain gg^{\prime} and output gg^{\prime}.

The proof proceeds as in the case for continuous distributions (Lemma 4.5). Assume all steps succeed, which occurs with probability at least 2/32/3. After step 3 we obtain g:[r]d{±1}g:[r]^{d}\to\{\pm 1\} such that, for any hh\in\mathcal{H},

(v,b)𝒟𝖻𝗅𝗈𝖼𝗄[g(v)b](v,b)𝒟𝖻𝗅𝗈𝖼𝗄[h𝖻𝗅𝗈𝖼𝗄(v)b]+ϵ/3.\underset{(v,b)\sim\mathcal{D}^{\mathsf{block}}}{\mathbb{P}}\left[g(v)\neq b\right]\leq\underset{(v,b)\sim\mathcal{D}^{\mathsf{block}}}{\mathbb{P}}\left[h^{\mathsf{block}}(v)\neq b\right]+\epsilon/3\,.

By Lemma A.14 and Proposition A.7, the output satisfies,

(x,b)𝒟[g(x)b]\displaystyle\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[g^{\prime}(x)\neq b\right] (x,b)𝒟,z[g(𝖻𝗅𝗈𝖼𝗄¯(x,z))b]+ϵ/3\displaystyle\leq\underset{(x,b)\sim\mathcal{D},z}{\mathbb{P}}\left[g(\overline{\mathsf{block}}(x,z))\neq b\right]+\epsilon/3
(x,b)𝒟,z[h𝖻𝗅𝗈𝖼𝗄(𝖻𝗅𝗈𝖼𝗄¯(x,z))b]+2ϵ/3\displaystyle\leq\underset{(x,b)\sim\mathcal{D},z}{\mathbb{P}}\left[h^{\mathsf{block}}(\overline{\mathsf{block}}(x,z))\neq b\right]+2\epsilon/3
(x,b)𝒟[h(x)b]+x,z[h(x)hz𝖼𝗈𝖺𝗋𝗌𝖾(x)]+2ϵ/3\displaystyle\leq\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[h(x)\neq b\right]+\underset{x,z}{\mathbb{P}}\left[h(x)\neq h^{\mathsf{coarse}}_{z}(x)\right]+2\epsilon/3
(x,b)𝒟[h(x)b]+ϵ.\displaystyle\leq\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[h(x)\neq b\right]+\epsilon\,.\qed

We now state the general learning algorithm from Lemma 5.2, improved to allow finite product distributions.

Lemma A.17.

Let ϵ>0\epsilon>0 and let \mathcal{H} be a set of measurable functions f:d{±1}f:\mathbb{R}^{d}\to\{\pm 1\} that satisfy:

  1. 1.

    There is some r=r(d,ϵ)r=r(d,\epsilon) such that 𝖻𝖻𝗌¯(,r)ϵrd\overline{\mathsf{bbs}}(\mathcal{H},r)\leq\epsilon\cdot r^{d};

  2. 2.

    There is a set \mathcal{F} of functions [r]d[r]^{d}\to\mathbb{R} satisfying: f,g𝗌𝗉𝖺𝗇()\forall f\in\mathcal{H},\exists g\in\mathsf{span}(\mathcal{F}) such that for v[r]d,𝔼[(f𝖻𝗅𝗈𝖼𝗄(v)g(v))2]ϵ2v\sim[r]^{d},\mathbb{E}\left[(f^{\mathsf{block}}(v)-g(v))^{2}\right]\leq\epsilon^{2}.

Let n=poly(||,1/ϵ)n=\operatorname{poly}(|\mathcal{F}|,1/\epsilon) be the sample complexity of the algorithm in Theorem 5.1. Then there is an agnostic learning algorithm for \mathcal{H} on finite and continuous product distributions over d\mathbb{R}^{d}, that uses O(max(n2,1/ϵ2)rd2log(dr))O(\max(n^{2},1/\epsilon^{2})\cdot rd^{2}\log(dr)) samples and runs in time polynomial in the sample size.

Proof.

Let ¯𝒟\overline{}\mathcal{D} be the augmented distribution, where x¯¯𝒟\overline{x}\sim\overline{}\mathcal{D} is obtained by drawing x𝒟x\sim\mathcal{D} and augmenting it with a uniformly random z[0,1]dz\in[0,1]^{d}. We will assume n>1/ϵn>1/\epsilon. Let μ\mu be the marginal of 𝒟\mathcal{D} on d\mathbb{R}^{d}. For an augmented rr-block partition, let 𝒟𝖻𝗅𝗈𝖼𝗄\mathcal{D}^{\mathsf{block}} be the distribution of (𝖻𝗅𝗈𝖼𝗄¯(x¯),b)(\overline{\mathsf{block}}(\overline{x}),b) when (x¯,b)¯𝒟(\overline{x},b)\sim\overline{}\mathcal{D}. We may simulate samples from 𝒟𝖻𝗅𝗈𝖼𝗄\mathcal{D}^{\mathsf{block}} by sampling (x,b)(x,b) from 𝒟\mathcal{D} and constructing (𝖻𝗅𝗈𝖼𝗄¯(x¯),b)(\overline{\mathsf{block}}(\overline{x}),b). The algorithm is as follows:

  1. 1.

    Sample a grid XX of length m=O(rd2n2log(rd))m=O(rd^{2}n^{2}\log(rd)); by Lemma A.6, this ensures that 𝖻𝗅𝗈𝖼𝗄¯(μ)𝗎𝗇𝗂𝖿([r]d)𝖳𝖵<1/12n\|\overline{\mathsf{block}}(\mu)-\mathsf{unif}([r]^{d})\|_{\mathsf{TV}}<1/12n with probability 5/65/6. Construct 𝖻𝗅𝗈𝖼𝗄¯:d[r]d\overline{\mathsf{block}}:\mathbb{R}^{d}\to[r]^{d} induced by XX.

  2. 2.

    Run the algorithm of Theorem 5.1 on a sample of nn points from 𝒟𝖻𝗅𝗈𝖼𝗄\mathcal{D}^{\mathsf{block}}; that algorithm returns a function g:[r]d{±1}g:[r]^{d}\to\{\pm 1\}.

  3. 3.

    Run the algorithm of Lemma A.14 using O(dlogrϵ2)O\left(\frac{d\log r}{\epsilon^{2}}\right) samples to obtain gg^{\prime} and output gg^{\prime}.

The proof proceeds as in the case for continuous distributions (Lemma 5.2). Assume all steps succeed, which occurs with probability at least 2/32/3. After step 3 we obtain g:[r]d{±1}g:[r]^{d}\to\{\pm 1\} such that, for any hh\in\mathcal{H},

(v,b)𝒟𝖻𝗅𝗈𝖼𝗄[g(v)b](v,b)𝒟𝖻𝗅𝗈𝖼𝗄[h𝖻𝗅𝗈𝖼𝗄(v)b]+ϵ/3.\underset{(v,b)\sim\mathcal{D}^{\mathsf{block}}}{\mathbb{P}}\left[g(v)\neq b\right]\leq\underset{(v,b)\sim\mathcal{D}^{\mathsf{block}}}{\mathbb{P}}\left[h^{\mathsf{block}}(v)\neq b\right]+\epsilon/3\,.

By Lemma A.14 and Proposition A.7, the output satisfies,

(x,b)𝒟[g(x)b]\displaystyle\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[g^{\prime}(x)\neq b\right] (x,b)𝒟,z[g(𝖻𝗅𝗈𝖼𝗄¯(x,z))b]+ϵ/3\displaystyle\leq\underset{(x,b)\sim\mathcal{D},z}{\mathbb{P}}\left[g(\overline{\mathsf{block}}(x,z))\neq b\right]+\epsilon/3
(x,b)𝒟,z[h𝖻𝗅𝗈𝖼𝗄(𝖻𝗅𝗈𝖼𝗄¯(x,z))b]+2ϵ/3\displaystyle\leq\underset{(x,b)\sim\mathcal{D},z}{\mathbb{P}}\left[h^{\mathsf{block}}(\overline{\mathsf{block}}(x,z))\neq b\right]+2\epsilon/3
(x,b)𝒟[h(x)b]+x,z[h(x)hz𝖼𝗈𝖺𝗋𝗌𝖾(x)]+2ϵ/3\displaystyle\leq\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[h(x)\neq b\right]+\underset{x,z}{\mathbb{P}}\left[h(x)\neq h^{\mathsf{coarse}}_{z}(x)\right]+2\epsilon/3
(x,b)𝒟[h(x)b]+ϵ.\displaystyle\leq\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[h(x)\neq b\right]+\epsilon\,.\qed
Theorem A.18.

There are agnostic learning algorithms for functions of convex sets, functions of halfspaces, degree-kk PTFs, and kk-alternating functions achieving the sample and time complexity bounds in Theorems 1.1.41.1.21.1.3, and 1.1.5, that work for any finite or continuous product distribution over d\mathbb{R}^{d}.

Proof.

This follows from the same arguments as for each of those theorems, except with the bounds from Proposition A.8 and Lemmas A.10A.11A.12, and A.9 to bound 𝖻𝖻𝗌¯\overline{\mathsf{bbs}}; Lemma A.13 to bound the noise sensitivity; and the improved general algorithms of Lemmas A.16 and A.17. ∎

Appendix B Glossary

The notation f(n)=O~(g(n))f(n)=\widetilde{O}(g(n)) means that for some constant cc, f(n)=O(g(n)logc(g(n)))f(n)=O(g(n)\log^{c}(g(n))).

The natural ordering on the set [n]d[n]^{d} or d\mathbb{R}^{d} is the partial order where for x,ydx,y\in\mathbb{R}^{d}, x<yx<y iff i[d],xiyi\forall i\in[d],x_{i}\leq y_{i}, and xyx\neq y.

Property Testing.

For a set 𝒫\mathcal{P} of distributions over XX and a set \mathcal{H} of functions X{±1}X\to\{\pm 1\}, a distribution-free property testing algorithm for \mathcal{H} under 𝒫\mathcal{P} is a randomized algorithm that is given a parameter ϵ>0\epsilon>0. It has access to the input probability distribution 𝒟𝒫\mathcal{D}\in\mathcal{P} via a sample oracle, which returns an independent sample from 𝒟\mathcal{D}. It has access to the input function f:X{±1}f:X\to\{\pm 1\} via a query oracle, which given query xXx\in X returns the value f(x)f(x). A two-sided distribution-free testing algorithm must satisfy:

  1. 1.

    If ff\in\mathcal{H} then the algorithm accepts with probability at least 2/32/3;

  2. 2.

    If ff is ϵ\epsilon-far from \mathcal{H} with respect to μ\mu then the algorithm rejects with probability at least 2/32/3.

A one-sided algorithm must accept with probability 1 when ff\in\mathcal{H}. An (ϵ1,ϵ2)(\epsilon_{1},\epsilon_{2})-tolerant tester must accept with probability at least 2/32/3 when h\exists h\in\mathcal{H} such that xμ[f(x)h(x)]ϵ1\underset{x\sim\mu}{\mathbb{P}}\left[f(x)\neq h(x)\right]\leq\epsilon_{1} and reject when ff is ϵ2\epsilon_{2}-far from \mathcal{H} with respect to μ\mu.

In the query model, the queries to the query oracle can be arbitrary. In the sample model, the tester queries a point xXx\in X if and only if xx was obtained from the sample oracle.

A tester in the query model is adaptive if it makes its choice of query based on the answers to previous queries. It is non-adaptive if it chooses its full set of queries in advance, before obtaining any of the answers.

The sample complexity of an algorithm is the number of samples requested from the sample oracle. The query complexity of an algorithm is the number of queries made to the query oracle.

Learning.

Let \mathcal{H} be a set of functions X{±1}X\to\{\pm 1\} and let 𝒫\mathcal{P} be a set of probability distributions over XX. A learning algorithm for \mathcal{H} under 𝒫\mathcal{P} (in the non-agnostic or realizable) model is a randomized algorithm that receives a parameter ϵ>0\epsilon>0 and has sample access to an input function ff\in\mathcal{H}. Sample access means that the algorithm may request an independent random example (x,f(x))(x,f(x)) where xx is sampled from some input distribution 𝒟𝒫\mathcal{D}\in\mathcal{P}. The algorithm is required to output a function g:X{±1}g:X\to\{\pm 1\} that, with probability 2/32/3, satisfies the condition

x𝒟[f(x)g(x)]ϵ.\underset{x\sim\mathcal{D}}{\mathbb{P}}\left[f(x)\neq g(x)\right]\leq\epsilon\,.

In the agnostic setting, the algorithm instead has sample access to an input distribution 𝒟\mathcal{D} over X×{0,1}X\times\{0,1\} whose marginal over XX is in 𝒫\mathcal{P} (i.e. it receives samples of the form (x,b)X×{0,1}(x,b)\in X\times\{0,1\}). The algorithm is required to output a function g:X{±1}g:X\to\{\pm 1\} that, with probability 2/32/3, satisfies the following condition: h\forall h\in\mathcal{H},

(x,b)𝒟[g(x)b](x,b)𝒟[h(x)b]+ϵ.\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[g(x)\neq b\right]\leq\underset{(x,b)\sim\mathcal{D}}{\mathbb{P}}\left[h(x)\neq b\right]+\epsilon\,.

A proper learning algorithm is one whose output must also satisfy gg\in\mathcal{H}; otherwise it is improper.

VC Dimension.

For a set \mathcal{H} of functions X{±1}X\to\{\pm 1\}, a set SXS\subseteq X is shattered by \mathcal{H} if for all functions f:S{±1}f:S\to\{\pm 1\} there is a function hh\in\mathcal{H} such that xS,h(x)=f(x)\forall x\in S,h(x)=f(x). The VC dimension 𝖵𝖢()\mathsf{VC}(\mathcal{H}) of \mathcal{H} is the size of the largest set SXS\subseteq X that is shattered by \mathcal{H}.

References

  • [AC06] Nir Ailon and Bernard Chazelle. Information theory in property testing and monotonicity testing in higher dimension. Information and Computation, 204(11):1704–1717, 2006.
  • [BB20] Eric Blais and Abhinav Bommireddi. On testing and robust characterizations of convexity. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020.
  • [BBBY12] Maria-Florina Balcan, Eric Blais, Avrim Blum, and Liu Yang. Active property testing. In Proceedings of the IEEE 53rd Annual Symposium on Foundations of Computer Science (FOCS), pages 21–30, 2012.
  • [BCO+15] Eric Blais, Clément Canonne, Igor Oliveira, Rocco Servedio, and Li-Yang Tan. Learning circuits with few negations. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, page 512, 2015.
  • [BCS18] Hadley Black, Deeparnab Chakrabarty, and C Seshadhri. A o(d)polylogno(d)\cdot\operatorname{poly}\log n monotonicity tester for boolean functions over the hypergrid [n]d[n]^{d}. In Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2133–2151, 2018.
  • [BCS20] Hadley Black, Deeparnab Chakrabarty, and C Seshadhri. Domain reduction for monotonicity testing: A o(d)o(d) tester for boolean functions in dd-dimensions. In Proceedings of the 31st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1975–1994, 2020.
  • [BEHW89] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929–965, 1989.
  • [BFPJH21] Eric Blais, Renato Ferreira Pinto Jr, and Nathaniel Harms. VC dimension and distribution-free sample-based testing. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 504–517, 2021.
  • [BMR16] Piotr Berman, Meiram Murzabulatov, and Sofya Raskhodnikova. Tolerant testers of image properties. In 43rd International Colloquium on Automata, Languages, and Programming (ICALP 2016). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016.
  • [BMR19] Piotr Berman, Meiram Murzabulatov, and Sofya Raskhodnikova. Testing convexity of figures under the uniform distribution. Random Structures & Algorithms, 54(3):413–443, 2019.
  • [BOW10] Eric Blais, Ryan O’Donnell, and Karl Wimmer. Polynomial regression under arbitrary product distributions. Machine Learning, 80(2-3):273–294, 2010.
  • [BR92] Avrim L Blum and Ronald L Rivest. Training a 3-node neural network is NP-complete. Neural Networks, 5(1):117–127, 1992.
  • [BRY14] Eric Blais, Sofya Raskhodnikova, and Grigory Yaroslavtsev. Lower bounds for testing properties of functions over hypergrid domains. In Proceedings of the IEEE 29th Conference on Computational Complexity (CCC), pages 309–320, 2014.
  • [BY19] Eric Blais and Yuichi Yoshida. A characterization of constant-sample testable properties. Random Structures & Algorithms, 55(1):73–88, 2019.
  • [CDJS17] Deeparnab Chakrabarty, Kashyap Dixit, Madhav Jha, and C Seshadhri. Property testing on product distributions: Optimal testers for bounded derivative properties. ACM Transactions on Algorithms (TALG), 13(2):1–30, 2017.
  • [CDS20] Xue Chen, Anindya De, and Rocco A Servedio. Testing noisy linear functions for sparsity. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 610–623, 2020.
  • [CFSS17] Xi Chen, Adam Freilich, Rocco A Servedio, and Timothy Sun. Sample-based high-dimensional convexity testing. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.
  • [CGG+19] Clément L. Canonne, Elena Grigorcscu, Siyao Guo, Akash Kumar, and Karl Wimmer. Testing kk-monotonicity: The rise and fall of boolean functions. Theory of Computing, 15(1):1–55, 2019.
  • [CMK19] Mónika Csikós, Nabil H Mustafa, and Andrey Kupavskii. Tight lower bounds on the VC-dimension of geometric set systems. Journal of Machine Learning Research, 20(81):1–8, 2019.
  • [CS16] Deeparnab Chakrabarty and Comandur Seshadhri. An o(n)o(n) monotonicity tester for boolean functions over the hypercube. SIAM Journal on Computing, 45(2):461–472, 2016.
  • [DHK+10] Ilias Diakonikolas, Prahladh Harsha, Adam Klivans, Raghu Meka, Prasad Raghavendra, Rocco A Servedio, and Li-Yang Tan. Bounding the average sensitivity and noise sensitivity of polynomial threshold functions. In Proceedings of the 42nd ACM Symposium on Theory of Computing (STOC), pages 533–542, 2010.
  • [DKS18] Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Learning geometric concepts with nasty noise. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 1061–1073, 2018.
  • [DMN19] Anindya De, Elchanan Mossel, and Joe Neeman. Is your function low-dimensional? In Conference on Learning Theory, pages 979–993, 2019.
  • [FR10] Shahar Fattal and Dana Ron. Approximating the distance to monotonicity in high dimensions. ACM Transactions on Algorithms (TALG), 6(3):1–37, 2010.
  • [FY20] Noah Fleming and Yuichi Yoshida. Distribution-free testing of linear functions on n\mathbb{R}^{n}. In Proceedings of the 11th Innovations in Theoretical Computer Science Conference (ITCS). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020.
  • [GR16] Oded Goldreich and Dana Ron. On sample-based testers. ACM Transactions on Computation Theory (TOCT), 8(2):1–54, 2016.
  • [Har19] Nathaniel Harms. Testing halfspaces over rotation-invariant distributions. In Proceedings of the 30th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 694–713, 2019.
  • [HK07] Shirley Halevy and Eyal Kushilevitz. Distribution-free property-testing. SIAM Journal on Computing, 37(4):1107–1138, 2007.
  • [KKMS08] Adam T. Kalai, Adam R. Klivans, Yishay Mansour, and Rocco A Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777–1805, 2008.
  • [KMS18] Subhash Khot, Dor Minzer, and Muli Safra. On monotonicity testing and boolean isoperimetric-type theorems. SIAM Journal on Computing, 47(6):2238–2276, 2018.
  • [KOS04] Adam R Klivans, Ryan O’Donnell, and Rocco A Servedio. Learning intersections and thresholds of halfspaces. Journal of Computer and System Sciences, 68(4):808–840, 2004.
  • [KOS08] Adam R Klivans, Ryan O’Donnell, and Rocco A Servedio. Learning geometric concepts via gaussian surface area. In Proceedings of the 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 541–550, 2008.
  • [KR00] Michael Kearns and Dana Ron. Testing problems with sublearning sample complexity. Journal of Computer and System Sciences, 61(3):428–456, 2000.
  • [O’D14] Ryan O’Donnell. Analysis of Boolean Functions. Cambridge University Press, 2014.
  • [Ras03] Sofya Raskhodnikova. Approximate testing of visual properties. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 370–381. Springer, 2003.
  • [RR21] Dana Ron and Asaf Rosin. Optimal distribution-free sample-based testing of subsequence-freeness. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 337–256. SIAM, 2021.
  • [RV04] Luis Rademacher and Santosh Vempala. Testing geometric convexity. In International Conference on Foundations of Software Technology and Theoretical Computer Science, pages 469–480. Springer, 2004.
  • [SSBD14] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.
  • [Vem10a] Santosh Vempala. Learning convex concepts from gaussian distributions with PCA. In Proceedings of the IEEE 51st Annual Symposium on Foundations of Computer Science (FOCS), pages 124–130, 2010.
  • [Vem10b] Santosh Vempala. A random-sampling-based algorithm for learning intersections of halfspaces. Journal of the ACM, 57(6):1–14, 2010.
  • [War68] Hugh E. Warren. Lower bounds for approximation by nonlinear manifolds. Transactions of the American Mathematical Society, 133(1):167–178, 1968.