[name=Theorem,numberlike=theorem]restate-theorem \declaretheorem[name=Theorem,unnumbered]restate-theorem*
Downsampling for Testing and Learning in Product Distributions
Abstract
We study distribution-free property testing and learning problems where the unknown probability distribution is a product distribution over . For many important classes of functions, such as intersections of halfspaces, polynomial threshold functions, convex sets, and -alternating functions, the known algorithms either have complexity that depends on the support size of the distribution, or are proven to work only for specific examples of product distributions. We introduce a general method, which we call downsampling, that resolves these issues. Downsampling uses a notion of “rectilinear isoperimetry” for product distributions, which further strengthens the connection between isoperimetry, testing and learning. Using this technique, we attain new efficient distribution-free algorithms under product distributions on :
-
1.
A simpler proof for non-adaptive, one-sided monotonicity testing of functions , and improved sample complexity for testing monotonicity over unknown product distributions, from [Black, Chakrabarty, & Seshadhri, SODA 2020] to .
-
2.
Polynomial-time agnostic learning algorithms for functions of a constant number of halfspaces, and constant-degree polynomial threshold functions;
-
3.
An -time agnostic learning algorithm, and an -sample tolerant tester, for functions of convex sets; and a sample-based one-sided tester for convex sets;
-
4.
An -time agnostic learning algorithm for -alternating functions, and a sample-based tolerant tester with the same complexity.
1 Introduction
In property testing and learning, the goal is to design algorithms that use as little information as possible about the input while still being correct (with high probability). This includes using as little information as possible about the probability distribution against which correctness is measured. Information about the probability distribution could be in the form of guarantees on this distribution (e.g. it is guaranteed to be uniform, or Gaussian), or in the form of samples from the distribution. So we want to minimize the requirements on this distribution, as well as the number of samples used by the algorithm.
Progress on high-dimensional property testing and learning problems is usually made by studying algorithms for the uniform distribution over the hypercube , or the standard Gaussian distribution over , as the simplest case. For example, efficiently learning intersections of halfspaces is a major open problem in learning theory [DKS18, KOS04], and progress on this problem has been made by studying the uniform distribution over the hypercube and the Gaussian as special cases [KKMS08, KOS08, Vem10a]. Another important example is the class of degree- polynomial threshold functions (PTFs). Unlike intersections of halfspaces, these can be efficiently learned in the PAC model [KOS04], but agnostic111See the Glossary in Appendix B for standard definitions in learning theory and property testing. learning is more challenging. Again, progress has been made by studying the hypercube [DHK+10]. An even more extreme example is the class of convex sets, which are not learnable in the distribution-free PAC model, because they have infinite VC dimension, but which become learnable under the Gaussian [KOS08]. The uniform distribution over the hypercube and the Gaussian are both examples of product distributions, so the next natural question to ask is, can these results be generalized to any unknown product distribution? A partial answer was given by Blais, O’Donnell, & Wimmer [BOW10] for some of these classes; in this paper we resolve this question.
Similar examples appear in the property testing literature. Distribution-free property testing and testing functions with domain are emerging trends in the field (e.g. [BBBY12, DMN19, Har19, CDS20, FY20, BFPJH21]). Testing monotonicity is one of the most well-studied problems in property testing, and recent work [BCS20] has extended this study to product distributions over domain . Work of Chakrabarty & Seshadhri [CS16], Khot, Minzer, & Safra [KMS18], and Black, Chakrabarty, & Seshadhri [BCS18, BCS20] has resulted in efficient -query algorithms for the hypercube [KMS18] and the hypergrid . Black, Chakrabarty, & Seshadhri [BCS20] showed that testing monotonicity over unknown product distributions on could be done with queries and samples. Their “domain reduction” method is intricate and specialized for the problem of testing monotonicity. We improve222An early version of this paper proved a weaker result, with two-sided error and worse sample complexity. the sample complexity to using a much simpler proof. We also generalize the testers of [CFSS17, CGG+19] for convex sets and -alternating functions, respectively, and provide new testers for arbitrary functions of convex sets.
This paper provides a general framework for designing distribution-free testing and learning algorithms under product distributions on , which may be finite or continuous. An algorithm is distribution-free under product distributions if it does not require any prior knowledge of the probability distribution, except the guarantee that it is a product distribution. The technique in this paper, which we call downsampling, improves upon previous methods (in particular, [BCS20, BOW10]), in a few ways. It is more general and does not apply only to a specific type of algorithm [BOW10] or a specific problem [BCS20], and we use it to obtain many other results. It is conceptually simpler. And it allows quantitative improvements over both [BOW10] and [BCS20].
Organization.
In Section 1.1, we describe the main results of this paper in context of the related work. In Section 1.2, we briefly describe the main techniques in the paper. Section 2 presents the definitions and lemmas required by the main results. The following sections present the proofs of the results. For simplicity, the main body of the paper handles only continuous product distributions; finite distributions are handled in Appendix A. Definitions of standard terminology in property testing and machine learning can be found in the Glossary in Section B.
1.1 Results
See Table 1 for a summary of our results on property testing, and Table 2 for a summary of our results on learning.
1.1.1 Testing Monotonicity
Testing monotonicity is the problem of testing whether an unknown function is monotone, where is a partial order. It is one of the most commonly studied problems in the field of property testing. Previous work on this problem has mostly focused on uniform probability distributions (exceptions include [AC06, HK07, CDJS17, BFPJH21]) and finite domains. However, there is growing interest in property testing for functions on domain ([BBBY12, DMN19, Har19, CDS20, FY20, BFPJH21]) and [BCS20] generalized the problem to this domain.
Testing monotonicity under product distributions has been studied a few times. Ailon & Chazelle [AC06] gave a distribution-free monotonicity tester for real-valued functions under product distributions on , with query complexity . Chakrabarty et al. [CDJS17] improved this to and gave a matching lower bound. This lower bound applies to the real-valued case. For the boolean-valued case, monotonicity testers under the uniform distribution on [CS16, KMS18] and [BCS18, BCS20] are known with query complexity . In [BCS20], an -query tester was given for domain . That paper showed that there is a one-sided, non-adaptive, distribution-free monotonicity tester under product distributions on , with query complexity and sample complexity . In this paper we improve the sample complexity to , while greatly simplifying the proof.
[] There is a one-sided, non-adaptive -tester for monotonicity of functions that is distribution-free under (finite or continuous) product distributions, using
queries and samples.
The main result of [BCS20] is a “domain reduction” theorem, allowing a change of domain from to where ; by applying this theorem together with their earlier -query tester for the uniform distribution on , they obtain a tester for monotone functions with query complexity independent of . Our result replaces this domain reduction method with a simpler and more general 2-page argument, and gives a different generalization to the distribution-free case. See Section 3 for the proofs.
Gaussian | Products | |||
1-Sided Testing Monotonicity
(Query model) |
[KMS18] |
[BCS20] |
[BCS20] |
queries,
samples (Thm. 1.1.1) |
1-Sided Testing Convex Sets
(Sample model) |
– | – |
[CFSS17] |
(Thm. 1.1.4) |
Tolerant Testing Functions of Convex Sets
(Sample model) |
– | – | – | (Thm. 1.1.4) |
Tolerant Testing -Alternating Functions
(Sample model) |
– |
[CGG+19] |
– |
(Thm. 1.1.5) |
Gaussian | Products | |||
Functions of
Convex Sets |
– |
[KOS08] |
(Thm. 1.1.4) | |
Functions of
Halfspaces |
[KKMS08] |
[BOW10] |
[KOS08, Vem10a] (Intersections only) |
(Thm. 1.1.2) |
Degree- PTFs |
[DHK+10] |
[DHK+10, BOW10] |
[DHK+10, BOW10] |
(Thm. 1.1.3) |
-Alternating
Functions |
[BCO+15] |
(Testing) [CGG+19] |
– | (Thm. 1.1.5) |
1.1.2 Learning Functions of Halfspaces
Intersections of halfspaces have VC dimension [BEHW89, CMK19], so the sample complexity of learning is known, but it is not possible to efficiently find halfspaces whose intersection is correct on the sample, unless [BR92]. Therefore the goal is to find efficient “improper” algorithms that output a function other than an intersection of halfspaces. Several learning algorithms for intersections of halfspaces actually work for arbitrary functions of halfspaces. We will write for the set of all functions , and for any class of functions we will write as the set of all functions where and each . Then for the class of halfspaces, Klivans, O’Donnell, & Servedio [KOS04] gave a (non-agnostic) learning algorithm for over the uniform distribution on with complexity , Kalai, Klivans, Mansour, & Servedio [KKMS08] presented an agnostic algorithm with complexity in the same setting using “polynomial regression”.
Polynomial regression is a powerful technique, so it is important to understand when it can be applied. Blais, O’Donnell, & Servedio [BOW10] studied how to generalize it to arbitrary product distributions. With their method, they obtained an agnostic learning algorithm for with complexity for product distributions where each , and complexity for the “polynomially bounded” continuous distributions. This is not a complete generalization, because, for example, on the grid its complexity depends on . This prevents a full generalization to the domain . Their algorithm also requires some prior knowledge of the support or support size. We use a different technique and fully generalize the polynomial regression algorithm to arbitrary product distributions. See Section 5 for the proof.
[] There is a distribution-free, improper agnostic learning algorithm for under (continuous or finite) product distributions over , with time complexity
1.1.3 Learning Polynomial Threshold Functions
Degree- PTFs are another generalization of halfspaces. A function is a degree- PTF if there is a degree- polynomial such that . Degree- PTFs can be PAC learned in time using linear programming [KOS04], but agnostic learning is more challenging. Diakonikolas et al. [DHK+10] previously gave an agnostic learning algorithm for degree- PTFs in the uniform distribution over with time complexity , where
The main result of that paper is an upper bound on the noise sensitivity of PTFs. Combined with the reduction of Blais et al. [BOW10], this implies an algorithm for the uniform distribution over with complexity and for the Gaussian distribution with complexity .
Our agnostic learning algorithm for degree- PTFs eliminates the dependence on and works for any unknown product distribution over , while matching the complexity of [DHK+10] for the uniform distribution over the hypercube. See Section 6 for the proof.
[] There is an improper agnostic learning algorithm for degree- PTFs under (finite or continuous) product distributions over , with time complexity
1.1.4 Testing & Learning Convex Sets
One of the first properties (sets) of functions to be studied in the property testing literature is the set of indicator functions of convex sets, i.e. functions where is convex. Write for this class of functions. This problem has been studied in various models of testing [Ras03, RV04, CFSS17, BMR19, BB20]. In this paper we consider the sample-based model of testing, where the tester receives only random examples of the function and cannot make queries. This model of testing has received a lot of recent attention (e.g. [BBBY12, BMR19, BY19, CFSS17, GR16, Har19, RR21, BFPJH21]), partly because it matches the standard sample-based model for learning algorithms.
Chen et al. [CFSS17] gave a sample-based tester for under the Gaussian distribution on with one-sided error and sample complexity , along with a lower bound (for one-sided testers) of . We match their upper bound while generalizing the tester to be distribution-free under product distributions. See Section 4 for proofs.
[] There is a sample-based distribution-free one-sided -tester for under (finite or continuous) product distributions that uses at most samples.
A more powerful kind of tester is an -tolerant tester, which must accept (with high probability) any function that is -close to the property, while rejecting functions that are -far. Tolerantly testing convex sets has been studied by [BMR16] for the uniform distribution over the 2-dimensional grid, but not (to the best of our knowledge) in higher dimensions. We obtain a sample-based tolerant tester (and distance) approximator for convex sets in high dimension. In fact, recall that is the set of all functions and any subset, so is any property of functions of convex sets. We obtain a distance approximator for any such property:
[] Let . There is a sample-based distribution-free algorithm under (finite or continuous) product distributions that approximates distance to up to additive error using samples. Setting we obtain an -tolerant tester with sample complexity .
General distribution-free learning of convex sets is not possible, since this class has infinite VC dimension. However, they can be learned under the Gaussian distribution. Non-agnostic learning under the Gaussian was studied by Vempala [Vem10a, Vem10b]. Agnostic learning under the Gaussian was studied by Klivans, O’Donnell, & Servedio [KOS08] who presented a learning algorithm with complexity , and a lower bound of .
Unlike the Gaussian, there is a trivial lower bound of in arbitrary product distributions, because any function belongs to this class. However, unlike the general distribution-free case, we show that convex sets (or any functions of convex sets) can be learned under unknown product distributions.
[] There is an agnostic learning algorithm for under (finite or continuous) product distributions over , with time complexity .
1.1.5 Testing & Learning -Alternating Functions
A -alternating function on a partial order is one where for any chain in , changes value at most times. Learning -alternating functions on domain was studied by Blais et al. [BCO+15], motivated by the fact that these functions are computed by circuits with few negation gates. They show that samples are necessary and sufficient in this setting. Canonne et al. [CGG+19] later obtained an algorithm for -tolerant testing -alternating functions, when , in the uniform distribution over , with query complexity , where .
We obtain an agnostic learning algorithm for -alternating functions that matches the query complexity of the tester in [CGG+19], and nearly matches the complexity of the (non-agnostic) learning algorithm of [BCO+15] for the uniform distribution over the hypercube. See Section 7 for proofs.
[] There is an agnostic learning algorithm for -alternating functions under (finite or continuous) product distributions over that runs in time at most
We also generalize the tolerant tester of [CGG+19] to be distribution-free under product distributions, and eliminate the condition .
[] For any , let , there is a sample-based -tolerant tester for -alternating functions using samples, which is distribution-free under (finite or continuous) product distributions over .
1.2 Techniques
What connects these diverse problems is a notion of rectilinear surface area or isoperimetry that we call “block boundary size”. There is a close connection between learning & testing and various notions of isoperimetry or surface area (e.g. [CS16, KOS04, KOS08, KMS18]). We show that testing or learning a class on product distributions over can be reduced to testing and learning on the uniform distribution over , where is determined by the block boundary size, and we call this reduction “downsampling”. The name downsampling is used in image and signal processing for the process of reducing the resolution of an image or reducing the number of discrete samples used to represent an analog signal. We adopt the name because our method can be described by analogy to image or signal processing as the following 2-step process:
-
1.
Construct a “digitized” or “pixellated” image of the function by sampling from the distribution and constructing a grid in which each cell has roughly equal probability mass; and
-
2.
Learn or test the “low-resolution” pixellated function.
As long as the function takes a constant value in the vast majority of “pixels”, the low resolution version seen by the algorithm is a good enough approximation for testing or learning. The block boundary size is, informally, the number of pixels on which is not constant.
This technique reduces distribution-free testing and learning problems to the uniform distribution in a way that is conceptually simpler than in the prior work [BOW10, BCS20]. However, some technical challenges remain. The first is that it is not always easy to bound the number of “pixels” on which a function is not constant – for example, for PTFs. Second, unlike in the uniform distribution, the resulting downsampled function class on is not necessarily “the same” as the original class – for example, halfspaces on are not downsampled to halfspaces on , since the “pixels” are not of equal size. Thus, geometric arguments may not work, unlike the case for actual images.
A similar technique of constructing “low-resolution” representations of the input has been used and rediscovered ad-hoc a few times in the property testing literature; in prior work, it was restricted to the uniform distribution over [KR00, Ras03, FR10, BY19, CGG+19] (or the Gaussian in [CFSS17]). With this paper, we aim to provide a unified and generalized study of this simple and powerful technique.
1.3 Block Boundary Size
Informally, we define the -block boundary size of a class of functions as the maximum number of grid cells on which a function is non-constant, over all possible grid partitions of (which are not necessarily evenly spaced) – see Section 2 for formal definitions. Whether downsampling can be applied to depends on whether
and the complexity of the algorithms depends on how large must be for the non-constant blocks to vanish relative to the whole grid. A general observation is that any function class where downsampling can be applied can be learned under unknown product distributions with a finite number of samples; for example, this holds for convex sets even though the VC dimension is infinite.
Proposition 1.1 (Consequence of Lemma 4.5).
Let be any set of functions (measurable with respect to continuous product distributions) such that
Then there is some function such that is distribution-free learnable under product distributions, up to error , with samples.
For convex sets, monotone functions, -alternating functions, and halfspaces, is easy to calculate. For degree- PTFs, it is more challenging. We say that a function induces a connected component if for every there is a continuous curve in from to such that for all on the curve, and is a maximal such set. Then we prove a general lemma that bounds the block boundary size by the number of connected components induced by functions .
Lemma 1.2 (Informal, see Lemma 6.6).
Suppose that for any axis-aligned affine subspace of affine dimension , and any function , induces at most connected components in . Then for , .
This lemma in fact generalizes all computations of block boundary size in this paper (up to constant factors in ). Using a theorem of Warren [War68], we get that any degree- polynomial achieves value 0 in at most grid cells, for sufficiently large (Corollary 6.8).
1.4 Polynomial Regression
The second step of downsampling is to find a testing or learning algorithm that works for the uniform distribution over the (not necessarily evenly-spaced) hypergrid. Most of our learning results use polynomial regression. This is a powerful technique introduced in [KKMS08] that performs linear regression over a vector space of functions that approximately spans the hypothesis class. This method is usually applied by using Fourier analysis to construct such an approximate basis for the hypothesis class [BOW10, DHK+10, CGG+19]. This was the method used, for example, by Blais, O’Donnell, & Wimmer [BOW10] to achieve the -time algorithms for intersections of halfspaces.
We take the same approach but we use the Walsh basis for functions on domain (see e.g. [BRY14]) instead of the bases used in the prior works. We show that if one can establish bounds on the noise sensitivity in the Fourier basis for the hypothesis class restricted to the uniform distribution over , then one gets a bound on the number of Walsh functions required to approximately span the “downsampled” hypothesis class. In this way, we establish that if one can apply standard Fourier-analytic techniques to the hypothesis class over the uniform distribution on and calculate the block boundary size, then the results for the hypercube essentially carry over to the distribution-free setting for product distributions on .
An advantage of this technique is that both noise sensitivity and block boundary size grow at most linearly during function composition: for functions where each belongs to the class , the noise sensitivity and block boundary size grow at most linearly in . Therefore learning results for obtained in this way are easy to extend to arbitrary compositions of , which is how we get our result for intersections of halfspaces.
2 Downsampling
We will now introduce the main definitions, notation, and lemmas required by our main results. The purpose of this section is to establish the main conceptual component of the downsampling technique: that functions with small enough block boundary size can be efficiently well-approximated by a “coarsened” version of the function that is obtained by random sampling. See Figure 1 for an illustration of the following definitions.

Definition 2.1 (Block Partitions).
An -block partition of is a pair of functions and obtained as follows. For each let such that and define for each . For each define the interval and a point . The function is defined by setting to be the unique vector such that for each . The function is defined by setting ; note that where .
Definition 2.2 (Block Functions and Coarse Functions).
For a function , we define as and as . For any set of functions , we define . For a distribution over and an -block partition we define the distribution over as the distribution of for .
Definition 2.3 (Induced Block Partitions).
When is a product distribution over , a random grid of length is the grid obtained by sampling points independently from and for each defining to be the -smallest coordinate in dimension among all sampled points. For any that divides we define an -block partition depending on by defining for each the point so that the intervals are when and ; we let the points defining be arbitrary. This is the -block partition induced by .
Definition 2.4 (Block Boundary Size).
For a block partition , a distribution over , and a function , we say is non-constant on a block if there are sets such that ; and have positive measure (in the product of Lebesgue measures). For a function and a number , we define the -block boundary size as the maximum number of blocks on which is non-constant, where the maximum is taken over all -block partitions . For a set of functions , we define .
The total variation distance between two distributions over a finite domain is defined as
The essence of downsampling is apparent in the next proposition. It shows that the distance of to its coarsened version is bounded by two quantities: the fraction of blocks in the -block partition on which is not constant, and the distance of the distribution to uniform. When both quantities are small, testing or learning can be done by testing or learning instead. The uniform distribution over a set is denoted :
Proposition 2.5.
Let be a continuous product distribution over , let be a random grid, and let be the induced -block partition. Then, for any measurable , the following holds with probability 1 over the choice of :
Proof.
We first establish that, with probability 1 over and , if then is non-constant on . Fix and suppose there exists a set of positive measure such that for each but is not non-constant on , i.e. for , either or . Then there is such that for , . Let . If then , so . But for random , the probability that there exists such that is 0, since is random within .
Assuming that the above event occurs,
Since is uniform, the probability of hitting a non-constant block is at most . ∎
Next we give a bound on the number of samples required to ensure that is close to uniform. We need the following lemma.
Lemma 2.6.
Let be continuous probability distribution over , such that divides , and . Let be a set of points sampled independently from . Write labeled such that (and write ). Then for any ,
Proof.
We assume that . If then we can repeat the following analysis with the opposite ordering on the points in . Write and . First suppose that ; we will bound the probability of this event later.
Let be the point such that (which must exist since is continuous). Let . Write . The expected value of is , where the factor in the denominator is due to the fact that each element of is sampled from conditional on being larger than . The event occurs if and only if , which occurs with probability
where
Since the expected value satisfies
the Chernoff bound gives
Now let be the point such that . The expected value of is now . The event occurs if and only if , which occurs with probability
where
The expected value satisfies , so the Chernoff bound gives
It remains to bound the probability that . Define such that . if and only if , i.e. . The expected value of is , so for , the Chernoff bound implies
Now define such that . if and only if , i.e. . The expected value of is , so for ,
The conclusion then follows from the union bound over these four events. ∎
Lemma 2.7.
Let be a product distribution over where each is continuous. Let be a random grid with length sampled from , and let be the -block partition induced by . Then
Proof.
For a fixed grid and each , write be the probability distribution on with . Then .
Let . Suppose that for every it holds that . Note that . Then for every ,
So
By Lemma 2.6 and the union bound, the probability that there is some that satisfies is at most . ∎
3 Testing Monotonicity
3.1 Testing Monotonicity on the Hypergrid
A good introduction to downsampling is the following short proof of the main result of Black, Chakrabarty, & Seshadhri [BCS20]. In an earlier work, [BCS18], they gave an tester for the domain , and in the later work they showed how to reduce the domain to for .
Our monotonicity tester will use as a subroutine the following tester for diagonal functions. For a hypergrid , a diagonal is a subset of points defined by some . A function is a diagonal function if it has at most one 1-valued point in each diagonal.
Lemma 3.1.
There is an -tester with one-sided error and query complexity for diagonal functions on .
Proof.
For each let be the set of diagonals with length . For any let be the unique diagonal that contains . For input and any , let .
Suppose that is -far from diagonal. Then must have at least 1-valued points; otherwise we could set each 1-valued point to 0 to obtain the constant 0 function. Now observe
For each , define . Let . Then
Therefore there is some such that .
The tester is as follows. For each :
-
1.
Sample points .
-
2.
For each , sample points from and reject if there are two distinct 1-valued points in the sample.
The query complexity of the tester is .
The tester will clearly accept any diagonal function. Now suppose that is -far from having this property, and let be such that . On iteration , the algorithm samples points . The probability that is at most
Now assume that there is some , so that . Let be disjoint subsets that partition the 1-valued points in into equally-sized parts. Then for sampled uniformly at random from , . The probability that there are at least 2 distinct 1-valued points in sampled by the algorithm is at least the probability that one of the first samples is in and one of the last samples is in . This fails to occur with probability at most . So the total probability of failure is at most . ∎
Theorem 3.2.
There is a non-adaptive monotonicity tester on domain with one-sided error and query complexity .
Proof.
Set , and assume without loss of generality that divides . Partition into intervals . For each write . Define where is the unique vector such that . Define and , where the minimum and maximum are with respect to the natural ordering on . For , write . We may simulate queries to by returning . We will call a boundary block if .
The test proceeds as follows: On input and a block , define the following functions:
Queries to each of these functions can be simulated by 2 or 3 queries to . The tester performs:
-
1.
Test whether , or whether , using queries.
-
2.
Test whether is diagonal, or is -far from diagonal, using Lemma 3.1, with queries.
-
3.
Test whether is monotone or -far from monotone, using the tester of Black, Chakrabarty, & Seshadhri with queries.
Claim 3.3.
If is monotone, the tester passes all 3 tests with probability 1.
Proof of claim.
To see that , observe that if is not a boundary block then . If then and while , and this is a violation of the monotonicity of . Therefore will pass the first test with probability 1.
To see that passes the second test with probability 1, observe that if had 2 boundary blocks in some diagonal, then there are boundary blocks such that . But then there is such that and ; since , this contradicts the monotonicity of . So has at most 1 boundary block in each diagonal.
To see that is monotone, it is sufficient to consider the boundary blocks, since all other values are the same as . Let be a boundary block, so there exist such that and . Suppose is not a boundary block (if it is a boundary block then ). If then , but while , a contradiction. So it must be that whenever . For any block such that , we have , so monotonicity holds. Since the tester of Black, Chakrabarty, & Seshadhri has one-sided error, the test passes with probability 1. ∎
Claim 3.4.
If is -close to , is -close to diagonal, and is -close to monotone, then is -close to monotone.
Proof of claim.
Let be the function . Suppose that . If is not a boundary block of then , so . If is a boundary block then so , and .
Suppose for contradiction that there are more than boundary blocks , so there are more than 1-valued points of . Any diagonal function has at most 1-valued points. Therefore the distance of to diagonal is at least
a contradiction. So has at most boundary blocks. Now
Let be a monotone function minimizing the distance to , and let be the function . Then
Finally, the distance of to the nearest monotone function is at most
These two claims suffice to establish the theorem. ∎
3.2 Monotonicity Testing for Product Distributions
The previous section used a special case of downsampling, tailored for the uniform distribution over . We will call a product distribution over continuous if each of its factors are continuous (i.e. absolutely continuous with respect to the Lebesgue measure). The proof for discrete distributions is in LABEL:section:finitealgorithms.
See 1.1.1
Proof.
We follow the proof of Theorem 3.2, with some small changes. Let . The tester first samples a grid with length and constructs the induced -block partition, with cells labeled . We call a block upper extreme if there is some such that , and we call it lower extreme if there is some such that but is not upper extreme. Call the upper extreme blocks and the lower extreme blocks . Note that .
For each , we again define as, respectively, the supremal and infimal point such that . The algorithm will ignore the extreme blocks , which do not have a supremal or an infimal point. Therefore it is not defined whether these blocks are boundary blocks.
By Lemma 2.7, with probability at least , we will have . We define as before, with domain . Define similarly but with domain and values
If is monotone, it may now be the case , but we will have for all with , where the algorithm will make its queries. The algorithm will test whether on all with , or -far from this property, which can be again done with samples. Note that if is -close to having this property, then
The algorithm then procedes as before, with error parameter . To test whether , the algorithm samples from and throws away any sample with . It then tests and using the uniform distribution on . It suffices to prove the following claim, which replaces 3.4.
Claim 3.5.
If is -close to , is -close to diagonal, and is -close to monotone, then is -close to monotone.
Proof of claim.
Let be a monotone function minimizing the distance to . Then on at most blocks . Define as when , and when . Note that is monotone.
By the triangle inequality,
From above, we know . To bound the second term, observe that since is -close to diagonal, there are at most
boundary blocks. Then observe that if then and either is a boundary block, or and . Then
∎
4 Learning and Testing Functions of Convex Sets
In this section we present our learning and testing results for functions of convex sets: an agnostic learning algorithm, a sample-based distance approximator, and a sample-based one-sided tester. All our algorithms will follow from more general results that actually hold for any class with bounded -block boundary size; this shows that bounded block-boundary size is sufficient to guarantee learnability in product distributions.
Let be the set of functions such that is convex. Let be the set of all Boolean functions .
Definition 4.1 (Function Composition).
For a set of functions , we will define the composition as the set of functions of the form where and each belongs to .
Proposition 4.2.
Let be any class of functions and fix any . Then .
Proof.
If is not constant on then one of the is not constant on that block. Therefore . ∎
Lemma 4.3.
For any , .
Proof.
We prove by induction on ; the result will hold by Proposition 4.2. Let be the -block boundary size in dimension . Recall that block is the set where for some . Let .
For , if there are 3 intervals , , on which is not constant, then within each interval the function takes both values . Thus, there are points such that , which is a contradiction.
For each block , let be the “upper face”. For , let be the set of non-constant blocks such that is constant on the upper face and let be the set of non-constant blocks that are non-constant on the upper face, so that . We argue that : for a vector define the line . If then there are with such that is constant on but non-constant on . Let be points in respectively such that . If is constant on or then there is a contradiction since the lines through and pass through ; so is constant 1 on . But then there is a point with , which is a contradiction since it is within the convex hull of . So ; since there are at most lines , .
To bound , observe that for each block is non-constant on the plane , there are such planes, is convex on each, and the -block partition induces an -block partition on the plane where is non-constant on the corresponding block. Then, by induction . So
The above two lemmas combine to show that when .
4.1 Sample-based One-sided Tester
First, we prove a one-sided sample-based tester for convex sets.
See 1.1.4
Proof.
We prove the result for continuous distributions. The proof for finite distributions is in Theorem A.18.
On input distribution and function , let so that .
-
1.
Sample a grid of size large enough that Lemma 2.7 guarantees with probability .
-
2.
Take samples and accept if there exists such that on all that are not in a boundary block of .
This tester is one-sided since for any , for all that are not in a boundary block, regardless of whether the -block decomposition induced by satisfies . Now suppose that , and suppose that . For , let be the set of non-constant blocks. If such that , then
Therefore
a contradiction. So it must be that for every . There are at most choices of boundary set . Because the 1-valued blocks must be the convex hull of the boundary points, for each boundary set there are at most 2 choices of function with boundary (with a second choice occurring when the complement of is also a convex set with the same boundary). Therefore, by the union bound, the probability that is accepted is at most
which is at most for sufficiently large . ∎
4.2 Sample-based Distance Approximator
Our sample-based distance approximator follows from the following general result.
Lemma 4.4.
For any set of functions , , and satisfying , there is a sample-based distribution-free algorithm for product distributions that approximates distance to up to additive error using samples.
Proof.
On input distribution and function , let , then:
-
1.
Sample a grid of size large enough that Lemma 2.7 guarantees with probability .
-
2.
Let be the set of all functions where ; note that .
-
3.
Draw samples and output the distance on to the nearest function in .
We argue that with probability at least , is an -cover of . With probability at least , . Then by Proposition 2.5, for any ,
so is a -cover; assume this event occurs.
Write . By the union bound and Hoeffding’s inequality, with samples we fail to get an estimate of up to additive error with probability at most
for appropriately chosen . Assume this event occurs. We want to show that . Let minimize so . Then
Now let minimize so . Then
which concludes the proof. ∎
Applying the bound on we conclude: See 1.1.4
4.3 Agnostic Learning
We begin our learning results with an agnostic learning algorithm for functions of convex sets: the class . For a distribution over and an -block partition , define the distribution over as the distribution of when .
Lemma 4.5.
Let be any set of functions , let , and suppose satisfies . Then there is an distribution-free agnostic learning algorithm for continuous product distributions that learns in samples and time.
Proof.
On input distribution :
-
1.
Sample a grid of size large enough that Lemma 2.7 guarantees with probability , where is the induced -block partition.
-
2.
Agnostically learn a function with error and success probability using samples from . Output the function .
The second step is accomplished via standard learning results ([SSBD14] Theorem 6.8): the number of samples required for agnostic learning is bounded by multiplied by the logarithm of the number of functions in the class, and the number of functions is . Assume that both steps succeed, which occurs with probability at least . Let minimize . By Proposition 2.5,
Then
Lemma 4.5 then gives the following result for continuous product distributions, with the result for finite distributions following from Theorem A.18.
See 1.1.4
5 Learning Functions of Halfspaces
A halfspace, or linear threshold function, is any function such that for some , where if and otherwise. Let be the set of halfspaces. Recall that downsampling reduces learning in to learning over , and is not the set of halfspaces over . Fortunately, agnostically learning a halfspaces is commonly done by giving a bound on the degree of a polynomial that approximates [KOS04, KOS08, KKMS08], and we will show that a similar idea also suffices for learning . We first present a general algorithm based on “polynomial regression”, and then introduce the Fourier analysis necessary to apply the general learning algorithm to halfspaces, polynomial threshold functions, and -alternating functions.
5.1 A General Learning Algorithm
The learning algorithm in this section essentially replaces step 2 of the brute force algorithm (Lemma 4.5) with the “polynomial regression” algorithm of Kalai et al. [KKMS08]. Our general algorithm is inspired by an algorithm of Canonne et al. [CGG+19] for tolerantly testing -alternating functions over the uniform distribution on ; we state the regression algorithm as it appears in [CGG+19]. For a set of functions, is the set of all linear combinations of functions in :
Theorem 5.1 ([KKMS08, CGG+19]).
Let be a distribution over , let be a class of functions and a collection of functions such that for every , where . Then there is an algorithm that, for any distribution over with marginal over , outputs a function such that , with probability at least , using at most samples and time.
Our general learning algorithm will apply to any hypothesis class that has small -block boundary size, and for which there is a set of functions that approximately span the class . This algorithm is improved to work for finite (rather than only continuous) product distributions in Lemma A.17.
Lemma 5.2.
Let and let be a set of measurable functions that satisfy:
-
1.
There is some such that ;
-
2.
There is a set of functions satisfying: such that for .
Let be the sample complexity of the algorithm in Theorem 5.1, with error parameter . Then there is an agnostic learning algorithm for on continuous product distributions over , that uses samples and runs in time polynomial in the sample size.
Proof.
We will assume . Let be the marginal of on . For an -block partition, let be the distribution of when . We may simulate samples from by sampling from and constructing . The algorithm is as follows:
-
1.
Sample a grid of length large enough that Lemma 2.7 guarantees with probability . Construct induced by . We may construct the function in time by sorting, and once constructed it takes time to compute.
-
2.
Run the algorithm of Theorem 5.1 on a sample of points from to learn the class ; that algorithm returns a function . Output .
Assume that step 1 succeeds, which occurs with probability at least . By condition 2, the algorithm in step 2 is guaranteed to work on samples where the marginal of is ; let be the distribution of when and is obtained by sampling . The algorithm of step 2 will succeed on ; we argue that it will also succeed on the actual input since these distributions are close. Observe that for samples and , if then each have the distribution of in . Therefore
It is a standard fact that for product distributions ; using this fact,
We will argue that step 2 succeeds with probability ; i.e. that with probability ,
Let be the event that success occurs given sample . The algorithm samples but the success guarantee of step 2 is for ; this step will still succeed with probability :
Assume that each step succeeds, which occurs with probability at least . By Proposition 2.5, our condition 1, and the fact that , we have for any that
The output of the algorithm is , which for any satisfies:
Then , as desired. ∎
5.2 Fourier Analysis on
We will show how to construct a spanning set to satisfy condition 2 of the general learning algorithm, by using noise sensitivity and the Walsh basis. For any , let uniformly at random and draw as follows: with probability , and is uniform in with probability . The noise sensitivity of functions is defined as:
Note that we include in the subscript to indicate the size of the domain. We will use to obtain upper bounds on the spanning set, and we will obtain bounds on by relating it to , for which many bounds are known. For a function , two vectors , and , define as the vector with if and if . Then define as the function . The next lemma is essentially the same as the reduction in [BOW10].
Lemma 5.3.
Let be a set of functions such that for any linear transformation , the function , and let be any -block partition. Let where is the -noise sensitivity on domain . Then .
Proof.
Let and let be uniform among the vectors where . Now let uniformly at random and let be drawn such that with probability and otherwise. Then is uniform in , because is or with equal probability and the marginals of are uniform. with probability and is otherwise uniform in . Let and . Let be an independent copy of and observe that . Now observe that has the same distribution as , so:
For any , define the function by . This function maps to a set and can be obtained by translation and scaling, which is a linear transformation. Therefore , so we are guaranteed that . So
We define the Walsh basis, an orthonormal basis of functions ; see e.g. [BRY14]. Suppose for some positive integer . For two functions , define the inner product . The Walsh functions can be defined by and for , where is the bit in the binary representation of , where the first bit is the least significant (see e.g. [BRY14]). It is easy to verify that for all , if then , and when . For define and note that for any set ,
(1) |
since each bit is uniform in , while . For ,
where is the symmetric difference, so this is 0 when (i.e. ) and 1 otherwise; therefore is an orthonormal basis of functions . Identify each with the number where . Now for every define as where is the Walsh function determined by the identity between subsets of and the integer . It is easy to verify that the set is an orthonormal basis. Every function has a unique representation where .
For each and define as the distribution over where for each , with probability and is uniform in with probability . Define and . For any ,
If then ; otherwise, so . Therefore
where is the number of nonzero entries of ; so . Since is a linear operator,
Note that for , is the usual notion of stability in the analysis of Boolean functions.
Proposition 5.4.
For any and any , .
Proof.
For with probability , so in the definition of noise sensitivity, is distributed as where , i.e. ; or, . By rearranging, we arrive at the conclusions. ∎
Proposition 5.5.
For any and , .
Proof.
Lemma 5.6.
Let be a set of functions where is a power of 2, let such that , and let . Then there is a set of functions of size , such that that for any , there is a function where .
Proof.
Let . Then by Proposition 5.5,
Therefore is a linear combination of functions where at most values are not 0. There are at most such products since for each non-constant we choose and . We may take to be the set of these products. ∎
5.3 Application
To apply Lemma 5.2, we must give bounds on and the noise sensitivity:
Lemma 5.7.
Fix any . Then .
Proof.
Any halfspace is unate, meaning there is a vector such that the function , where , is monotone. For any -block partition defined by values for , we can define as the block partition obtained by taking . The number of non-constant blocks of in is the same as that of in , but is monotone. Thus the bound on for monotone functions holds, so by Lemma 7.1, and Lemma 4.2 gives . ∎
The bounds on noise sensitivity follow from known results for the hypercube.
Proposition 5.8.
Let and let . Let . Then .
Proof.
For drawn from as in the definition of noise sensitivity, the union bound gives . ∎
Lemma 5.9.
Let . For any -block partition , and any .
Proof.
See 1.1.2
Proof.
Here we prove only the continuous distribution case. The finite case is proved in LABEL:thm:finitealgorithms.
For , we have by Lemma 5.7, so condition 1 of Lemma 5.2 holds. Lemma 5.9 guarantees that for any . Setting so that , we obtain via Lemma 5.6 a set of size satisfying condition 2 of Lemma 5.2. Then for we apply Lemma 5.2 to get an algorithm with sample complexity
The other time complexity follows from Lemma 4.5. ∎
6 Learning Polynomial Threshold Functions
A degree- polynomial threshold function (PTF) is a function such that there is a degree- polynomial in variables where ; for example, a halfspaces is a degree-1 PTF. Let be the set of degree- PTFs. As for halfspaces, we will give bounds on the noise sensitivity and block boundary size and apply the general learning algorithm. The bound on noise sensitivity will follow from known results on the hypercube [DHK+10], but the bound on the block boundary size is much more difficult to obtain than for halfspaces.
6.1 Block-Boundary Size of PTFs
A theorem of Warren [War68] gives a bound on the number of connected components of after removing the 0-set of a degree- polynomial. This bound (Theorem 6.7 below) will be our main tool.
A set is connected333Here we are using the fact that connected and path-connected are equivalent in . if for every there is a continuous function such that . A subset where is a connected component of if it is connected and there is no connected set such that . Write for the number of connected components of .
A function is called a restriction and we will denote . The affine subspace induced by is and has affine dimension .
For , let be the set of affine subspaces obtained by choosing a restriction with when and when , so in particular .
Let be a measurable function and define the boundary of as:
This is equivalent to the boundary of the set of -valued points, and the boundary of any set is closed. Each measurable induces a partition of into some number of connected parts. For a set of functions and , write
For each let be the set of hyperplanes of the form for some . An -arrangement for is any set where and such that all are distinct. Write for the set of -arrangements for . Define
and observe that .
Proposition 6.1.
For any set of functions and any , .
Proof.
Consider any -block partition, which is obtained by choosing values for each and defining by assigning each the block where is the unique value such that , where we define . Suppose is a non-constant block, so there are such that . Let and let . Consider the set . Since there exists some small open ball around such that ; and since , is a set of positive Lebesgue measure. Since has Lebesgue measure 0, we conclude that has positive measure, so there is with . Likewise, there is with . Therefore must belong to separate components, so is partitioned into at least 2 components. Meanwhile, each constant block is partitioned into at least 1 component. So
The following fact must be well-known, but not to us:
Proposition 6.2.
Let be an affine subspace of , let , and for let . Then
Proof.
Let be the graph with its vertices being the components of and the edges being the pairs where are components of such that , , and there exists a component of such that . Clearly ; we will show that is the number of connected components of and that . This suffices to prove the statement. We will use the following claim:
Claim 6.3.
Let be a connected component of . If and there is a path such that and either or , then .
Proof of claim.
Assume without loss of generality that for all . Let . Since is open we can define for each a ball such that . Consider the sets , which are open, and note that for all , if then since .
Assume for contradiction that there is such that is not connected to or ; then let be the infimum of all such , which must satisfy since . For any , if and is connected to or then since it must be that is connected as well; therefore is not connected to either or . But since is continuous, there is such that , so cannot be the infimum, which is a contradiction. Therefore every has connected to either or . If , this is a contradiction since there must then be such that but are connected to respectively. Therefore . ∎
We first show that is the number of graph-connected components of . Suppose that vertices are connected, so there is a path in . Then there are connected components of such that ; so , which implies that is connected. Therefore we may define as mapping each connected component of to the unique component of with . is surjective since for each component of there is some vertex (a component of ) such that : this is itself if . For some connected component of , let be vertices of , and let ; since is connected, there is a path such that . Let be the multiset of vertices such that ; let be the index such that , and order the sequence such that if then (note that we may have for some if visits the same set more than once). Then for any , since the path visits both and is contained in . If are on opposite sides of , then there is an edge in ; otherwise, the above claim implies . Thus there is a path to in ; this proves that is injective, so is indeed the number of graph-connected components of .
Now let , so there is a component of such that . For any there is a continuous path where . There must be some such that , otherwise the path is a path in and . Since , so , there is some component containing . We will map the edge to an arbitrary such , for any choice , and show that it is injective. Suppose that map to the same . Without loss of generality we may assume that lie on the same side of and that . Then there are , and such that . Then since is a connected component, we may take to be the least such values that , and connected by a path in to obtain a path such that , and . Then by the above claim, ; the same holds for , so the mapping is injective. This completes the proof of the proposition. ∎
Proposition 6.4.
For any set of measurable functions and any ,
Proof.
Let and , such that the number of connected components in , where and , is . For let
so that and . Since is an -arrangement, . For , since is obtained from by adding a hyperplane , Proposition 6.2 implies
because is an -arrangement. Iterating times, once for each added hyperplane, we arrive at
Lemma 6.5.
For any set of measurable functions and any ,
Proof.
Write for convenience. We will show by induction the more general statement that for any ,
where we define . In the base case, note that . Assume the statement holds for when . Then by Proposition 6.4,
Lemma 6.6.
Let be a set of functions such that for some , . Then for any and , .
Proof.
It is a standard fact that for degree- polynomials, , and a special case of a theorem of Warren bounds gives a bound for larger dimensions:
Theorem 6.7 ([War68]).
Polynomial threshold functions of degree have .
Since and , for , Lemma 6.6 gives us:
Corollary 6.8.
For , .
6.2 Application
As was the case for halfspaces, our reduction of noise sensitivity on to requires that the class is invariant under linear transformations:
Proposition 6.9.
For any and full-rank linear transformation , .
Proof.
Let where is a degree- polynomial and let be a term of , where and such that . Let be the row of . Then
where is some polynomial of degree at most . Then where ranges over with , and each has degree at most , so is a degree- polynomial. ∎
The last ingredient we need is the following bound of Diakonikolas et al. on the noise sensitivity:
Theorem 6.10 ([DHK+10]).
Let be a degree- PTF. Then for any ,
Putting everything together, we obtain a bound that is polynomial in for any fixed , and which matches the result of Diakonikolas et al. [DHK+10] for the uniform distribution over .
See 1.1.3
Proof.
We prove the continuous case here; the finite case is proved in Theorem A.18.
Let , so that by Corollary 6.8, condition 1 of Lemma 5.2 is satisfied. Due to Proposition 6.9, we may apply Theorem 6.10 and Lemma 5.3 to conclude that for all
In the first case, setting we get , so by Lemma 5.6 we get a set of functions of size satisfying condition 2 of Lemma 5.2. For , Lemma 5.2 implies an algorithm with sample size
In the second case, setting , we again obtain and get an algorithm with sample size
The final result is obtained by applying Lemma 4.5. ∎
7 Learning & Testing -Alternating Functions
A function on a partial order is -alternating if for every chain there is such that . Monotone functions are examples of 1-alternating functions. We consider -alternating functions on with the usual partial order: for we say when for each and . Once again, we must bound the block boundary size, which has been done already by Canonne et al.. We include the proof because their work does not share our definition of block boundary size, and because we use it in our short proof of the monotonicity tester in Section 3.1.
Lemma 7.1 ([CGG+19]).
The -block boundary size of -alternating functions is at most .
Proof.
Let be -alternating, let be any block partition and let be any chain in . Suppose that there are indices such that is not constant on . Then there is a set of points such that and for each . But since , also, which contradicts the fact that is -alternating. Then every chain in has at most non-constant blocks, and we may partition into at most chains by taking the diagonals where is any vector satisfying and ranges over all integers. ∎
Canonne et al. also use noise sensitivity bound to obtain a spanning set ; we quote their result.
Lemma 7.2 ([CGG+19]).
There is a set of functions , with size
such that for any -alternating function , there is that is a linear combination of functions in and .
See 1.1.5
Proof.
We prove the continuous case; for the finite case see Theorem A.18.
Let and let be any -block partition. By Lemma 7.1, the first condition of Lemma 5.2 is satisfied. Now let and consider . For any chain in , it must be since every satisfy when ; then alternates at most times on the chain and, since , is also -alternating. Therefore the set of functions given by Lemma 7.2 satisfies condition 2 of Lemma 7.1, and we have . Applying Lemma 5.2 gives an algorithm with sample complexity
The other sample complexity follows from Lemma 4.5. ∎
See 1.1.5
Proof.
The following argument is for the continuous case, but generalizes to the finite case using the definitions in Appendix A.
Let be the class of -alternating functions. Suppose there is a set , known to the algorithm, that is a -cover. Then, taking a set of independent random samples from and using Hoeffding’s inequality,
Then the tester accepts if and rejects otherwise; we now prove that this is correct with high probability. Assume that the above estimation is accurate, which occurs with probability at least . If then . Then for minimizing ,
so is accepted. Now suppose that is accepted, so . Then
What remains is to show how the tester constructs such a cover .
Consider the learning algorithm of Theorem 1.1.5 with error parameter , so . Let be the grid constructed by that algorithm and let be the induced -block partition. We may assume that with probability at least , ; suppose that this event occurs. The learner then takes additional samples to learn the class with domain . For every the learner has positive probability of outputting a function with (where is chosen from ). Let be the set of possible outputs of the learner; then is a -cover for . Construct a set by choosing, for each , the nearest function with respect to the distribution . Then is a -cover, since for any function , if is the nearest output of the learner and is nearest , then by the triangle inequality has distance at most to with respect to . Finally, construct a set by taking each function such that and (note that there exists such that since is -alternating when is -alternating). Then is a -cover since for any , when minimizes ,
Now we bound the size of . Since there are samples and each sample is in , labelled by , there are at most possible sample sequences, so at most outputs of the learner (after constructing ), so . Therefore, after constructing , the tester may construct and run the above estimation procedure, with . ∎
Appendix A Discrete Distributions
We will say that a distribution over is finite if it is a distribution over a finite set . In this section, we extend downsampling to work for finite product distributions: distributions such that all are finite. As mentioned in the introduction, our algorithms have the advantage that they do not need to know in advance whether the distribution is continuous or finite, and if they are finite they do not need to know the support. This is in contrast to the algorithms of Blais et al. [BOW10], which work for arbitrary finite product distributions but must know the support (since it learns a function under the “one-out-of- encoding”). Our algorithms have superior time complexity for large supports.
We begin with an example of a pathological set of functions that illustrates some of the difficulties in the generalization.
Example A.1.
The Dirichlet function is the function that takes value on all rational numbers and value on all irrational numbers. We will define the Dirichlet class of functions as the set of all functions such that on all with at least 1 irrational coordinate , and is arbitrary for any with all rational coordinates. Since the Lebesgue measure of the set of rational numbers is 0, in any continuous product distribution, any function in the Dirichlet class satisfies ; therefore learning this class is trivial in any continuous product distribution since we may output the constant function. And for this class since no block contains a set of positive measure containing -valued points. On the other hand, if is a finitely supported product distribution, then it may be the case that it is supported only on points with all rational coordinates. In that case, the Dirichlet class of functions is the set of all functions on the support, which is impossible to learn when the size of the support is unknown (since the number of samples will depend on the support size). It is apparent that our former definition of no longer suffices to bound the complexity of algorithms when we allow finitely supported distributions.
Another difficulty arises for finitely supported distributions with small support: for example, the hypercube . Consider what happens when we attempt to sample a uniform grid, as in the first step of the algorithms above. We will sample many points such that and many points such that . Essentially, the algorithm takes a small domain and constructs the larger domain , which is antithetical to the downsampling method. A similar situation would occur in large domains where some coordinates have exceptionally large probability densities and are sampled many times. Our algorithm must be able to handle such cases, so we must redefine the grid sampling step and block partitions to handle this situation. To do so, we introduce augmented samples: for every sample point we will append a uniformly random value in .
A.1 Augmented Samples & Constructing Uniform Partitions
For augmented points , where , we will define a total order by saying if , or and . Define interval . For convenience, when and we will write . If is an augmented vector (i.e. each coordinate is an augmented point), we will write ; and when is a set of augmented points, we will write .
Definition A.2 (Augmented Block Partition).
An augmented -block partition of is a pair of functions and obtained as follows. For each let such that and define . For each define the interval and a point such that . The function is defined by setting to be the unique vector such that for each . Observe that is a set of augmented points in and that it is possible for two augmented points to satisfy while . The function is defined by setting ; note that this is a non-augmented point.
Definition A.3 (Block Functions and Coarse Functions).
For a function we will define the functions as and for each we will define as . Unlike in the continuous setting, depends on an additional variable , which is necessary because a single point may be augmented differently to get different values. For a distribution over define the augmented distribution over as the distribution of when and is uniform in . For an augmented -block partition we define the distribution over as the distribution of for .
Definition A.4 (Augmented Random Grid).
An augmented random grid of length is obtained by sampling augmented points and for each defining to the be smallest coordinate in dimension by the augmented partial order. For any that divides we define an augmented -block partition depending on by defining for each the points , (and ), so that the intervals are for and . We set the points defining to be for some . This is the augmented -block partition induced by .
Definition A.5 (Augmented Block Boundary).
For an augmented block partition , a distribution over , and a function , we say is non-constant on an augmented block if there are sets such that and for all . For a function and a number , we define the augmented -block boundary size as the maximum number of blocks on which is non-constant with respect to a distribution , where the maximum is taken over all augmented -block partitions.
The augmented block partitions satisfy analogous properties to the previously-defined block partitions:
Lemma A.6.
Let be an augmented random grid with length sampled from a finite product distirbution , and let be the augmented -block partition induced by . Then
Proof.
Let be a finitely supported distribution with support . Let . Let be the distribution of where and uniformly at random; note that is a continuous distribution over . For , observe that iff . Therefore,
By replacing each finitely supported with we obtain a continuous product distribution such that is the same distribution as , so by Lemma 2.7 the conclusion holds. ∎
Proposition A.7.
For any continuous or finite product distribution over , any augmented -block partition constructed from a random grid , and any ,
Proof.
The result for continuous product distributions holds by Proposition 2.5 and the fact that , so assume is a finite product distribution, and let .
Suppose that for sampled from , , and let . Then for , and . The points clearly have positive measure because is finite, so a non-constant block. Then
A.2 Augmented Block-Boundary Size and Noise Sensitivity
To obtain learning algorithms for -alternating functions, functions of convex sets, functions of halfspaces, and degree- PTFs, we must provide a bound on .
For a finite set and a function , we will call a function a blowup of (with respect to ) if there exists an open ball where . We will call a set of functions inflatable if for every finite product set and , there exists that is a blowup of with respect to .
Proposition A.8.
Let be a inflatable set of functions. Then .
Proof.
Let be an augmented -block partition defined by parameters for , and write . Let be any finite product set, and let ; we will bound the number of non-constant blocks We construct a (non-augmented) -block partition as follows. Let be sufficiently small that:
-
•
, the rectangle is contained within ,
-
•
; and
-
•
unless .
Such an exists since the number of constraints is finite. Then define by the parameters . Note that . Let and suppose that is non-constant on , so there are such that , where , and where we define . Consider .
Since , (where we define ) and . Therefore and so . Also, is in the rectangle so there is a ball around , containing only points with value . Likewise, there is a ball around inside containing only points with value . Since these balls must intersect on sets with positive measure (in the product of Lebesgue measures), is non-constant on , which proves the statement. ∎
Lemma A.9.
The set of -alternating functions is inflatable.
Proof.
Let and let be a finite set. where we use the standard ordering on . Let . We claim that the set has a unique maximum. Suppose otherwise, so there are that are each maximal. Let . Then and but , a contradiction. For every , write for this unique maximum. Let be small enough that ; such a value exists since is finite. Define the map and , we define , and argue that this satisfies the required properties. It is clear by our choice of that . Since is order-preserving (i.e. if then ), is -alternating. Now consider the ball . Since for all , we have , and so . ∎
Lemma A.10.
The set of indicator functions of convex sets is inflatable.
Proof.
Let indicate a closed convex set, let be this set, and write (this minimum exists since is closed). Let be any finite set and let . Consider , and let be the indicator function for this set. Then for all . Finally, is closed, and it is convex since for any two points , it is well-known that the function is convex for . ∎
Lemma A.11.
The set of halfspaces is inflatable.
Proof.
It suffices to show that for any finite set (not necessarily a product set) and any halfspace , there is a halfspace such that for all but for all ; this is a commonly-used fact. Let . It must be the case that . Then satisfies the condition. ∎
Lemma A.12.
The set of degree- PTFs is inflatable.
Proof.
This follows from the above proof for halfspaces, since for any finite we may map to its vector of monomials, so that any polynomial is a linear threshold function in the space of monomials. ∎
For a set of functions and an augmented -block partition , we will write for the set of block functions ; note that this is not necessarily the same set of functions as defined for continuous distributions. We must show that the same learning algorithms used above for learning will work also for . For the brute-force learning algorithm of Lemma 4.5, this is trivial, but for the regression algorithm in Lemma 5.2 we must show that there exists a set such that each is close to a function . For functions of halfspaces and PTFs, we used the bound on noise sensitivity, Lemma 5.3, to construct a set of functions suitable for the regression algorithm. The proof for that lemma works without modification for augmented block partitions, so we have the following:
Lemma A.13.
Let be any family of functions such that, for any linear transformation , if then . Let be any augmented -block partition. Let . Then .
A.3 Rounding the Output
After learning a function , we must output a function . In the continuous setting, we simply output . In the finite setting, we cannot simply output since requires an additional argument . For example, if the distribution is a finitely supported distribution on , then for each point there may be roughly points for which for an appropriate choice of , and these points may have different values in . The algorithm must choose a single value to output for each . We do so by approximating the function and then rounding it via the next lemma.
Lemma A.14.
Fix a domain , let , and let . There is an algorithm such that, given query access to and sample access to any distribution over , uses at most samples and queries and with probability at least produces a value such that
Proof.
Let be the set of functions for any choice of . We will show that the VC dimension of is 1. Suppose for contradiction that two points are shattered by , so in particular there are such that and , while and . Without loss of generality, suppose . But then , which is a contradiction. Therefore, by standard VC dimension arguments ([SSBD14], Theorem 6.8), using samples and choosing to minimize the error on the samples, with probability at least we will obtain a value such that
where minimizes the latter quantity among all values . Since the algorithm can restrict itself to those values for which for some in the sample, the value minimizing the error on the sample can be computed time polynomial in the number of samples. Next, we show that the minimizer satisfies the desired properties. Suppose that uniformly at random. For any ,
Therefore
so we can conclude the lemma with
Lemma A.15.
Let be an augmented -block partition. There is an algorithm which, given , query access to a function and sample access to a distribution over , outputs a function such that, with probability ,
uses at most samples and queries, and runs in time polynomial in the number of samples.
Proof.
For , write . For any ,
so
The algorithm will construct a function and then learn a suitable parameter for rounding to .
First the algorithm samples a set of size and construct the function . Fix and suppose satisfies . Then there must be such that , otherwise for all so for all . There can be at most values of for which , so by the union bound and the Hoeffding bound,
Therefore with probability at least , satisfies for all . Suppose this occurs. Then
Now we apply Lemma A.14 with error , using samples and polynomial time, to output a value such that with probability ,
A.4 Algorithms for Finite Distributions
We now state improved versions of our monotonicity tester and two general learning algorithms: the “brute force” learning algorithm (Lemma 4.5) and the “polynomial regression” algorithm (Lemma 5.2). Using these algorithms we obtain the same complexity bounds as for continuous product distributions, but the algorithms can now handle finite product distributions as well.
See 1.1.1
Proof.
The proof of Section 1.1.1 goes through as before, with replaced by , replaced with defined as the infimal element such that , and defined as the supremal element such that , and redefined appropriately. ∎
Next, we move on to the learning algorithms:
Lemma A.16.
Let be any set of functions , let , and suppose satisfies . Then there is an agnostic learning algorithm for that uses samples and time and works for any distribution over whose marginal on is a finite or continuous product distribution.
Proof.
On input distribution :
-
1.
Sample a grid of size large enough that Lemma A.6 guarantees with probability , where is the induced augmented -block partition.
-
2.
Agnostically learn a function with error and success probability using samples from .
-
3.
Run the algorithm of Lemma A.14 using samples to obtain and output .
The proof proceeds as in the case for continuous distributions (Lemma 4.5). Assume all steps succeed, which occurs with probability at least . After step 3 we obtain such that, for any ,
By Lemma A.14 and Proposition A.7, the output satisfies,
We now state the general learning algorithm from Lemma 5.2, improved to allow finite product distributions.
Lemma A.17.
Let and let be a set of measurable functions that satisfy:
-
1.
There is some such that ;
-
2.
There is a set of functions satisfying: such that for .
Let be the sample complexity of the algorithm in Theorem 5.1. Then there is an agnostic learning algorithm for on finite and continuous product distributions over , that uses samples and runs in time polynomial in the sample size.
Proof.
Let be the augmented distribution, where is obtained by drawing and augmenting it with a uniformly random . We will assume . Let be the marginal of on . For an augmented -block partition, let be the distribution of when . We may simulate samples from by sampling from and constructing . The algorithm is as follows:
-
1.
Sample a grid of length ; by Lemma A.6, this ensures that with probability . Construct induced by .
-
2.
Run the algorithm of Theorem 5.1 on a sample of points from ; that algorithm returns a function .
-
3.
Run the algorithm of Lemma A.14 using samples to obtain and output .
The proof proceeds as in the case for continuous distributions (Lemma 5.2). Assume all steps succeed, which occurs with probability at least . After step 3 we obtain such that, for any ,
By Lemma A.14 and Proposition A.7, the output satisfies,
Theorem A.18.
Appendix B Glossary
The notation means that for some constant , .
The natural ordering on the set or is the partial order where for , iff , and .
Property Testing.
For a set of distributions over and a set of functions , a distribution-free property testing algorithm for under is a randomized algorithm that is given a parameter . It has access to the input probability distribution via a sample oracle, which returns an independent sample from . It has access to the input function via a query oracle, which given query returns the value . A two-sided distribution-free testing algorithm must satisfy:
-
1.
If then the algorithm accepts with probability at least ;
-
2.
If is -far from with respect to then the algorithm rejects with probability at least .
A one-sided algorithm must accept with probability 1 when . An -tolerant tester must accept with probability at least when such that and reject when is -far from with respect to .
In the query model, the queries to the query oracle can be arbitrary. In the sample model, the tester queries a point if and only if was obtained from the sample oracle.
A tester in the query model is adaptive if it makes its choice of query based on the answers to previous queries. It is non-adaptive if it chooses its full set of queries in advance, before obtaining any of the answers.
The sample complexity of an algorithm is the number of samples requested from the sample oracle. The query complexity of an algorithm is the number of queries made to the query oracle.
Learning.
Let be a set of functions and let be a set of probability distributions over . A learning algorithm for under (in the non-agnostic or realizable) model is a randomized algorithm that receives a parameter and has sample access to an input function . Sample access means that the algorithm may request an independent random example where is sampled from some input distribution . The algorithm is required to output a function that, with probability , satisfies the condition
In the agnostic setting, the algorithm instead has sample access to an input distribution over whose marginal over is in (i.e. it receives samples of the form ). The algorithm is required to output a function that, with probability , satisfies the following condition: ,
A proper learning algorithm is one whose output must also satisfy ; otherwise it is improper.
VC Dimension.
For a set of functions , a set is shattered by if for all functions there is a function such that . The VC dimension of is the size of the largest set that is shattered by .
References
- [AC06] Nir Ailon and Bernard Chazelle. Information theory in property testing and monotonicity testing in higher dimension. Information and Computation, 204(11):1704–1717, 2006.
- [BB20] Eric Blais and Abhinav Bommireddi. On testing and robust characterizations of convexity. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020.
- [BBBY12] Maria-Florina Balcan, Eric Blais, Avrim Blum, and Liu Yang. Active property testing. In Proceedings of the IEEE 53rd Annual Symposium on Foundations of Computer Science (FOCS), pages 21–30, 2012.
- [BCO+15] Eric Blais, Clément Canonne, Igor Oliveira, Rocco Servedio, and Li-Yang Tan. Learning circuits with few negations. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, page 512, 2015.
- [BCS18] Hadley Black, Deeparnab Chakrabarty, and C Seshadhri. A monotonicity tester for boolean functions over the hypergrid . In Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2133–2151, 2018.
- [BCS20] Hadley Black, Deeparnab Chakrabarty, and C Seshadhri. Domain reduction for monotonicity testing: A tester for boolean functions in -dimensions. In Proceedings of the 31st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1975–1994, 2020.
- [BEHW89] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929–965, 1989.
- [BFPJH21] Eric Blais, Renato Ferreira Pinto Jr, and Nathaniel Harms. VC dimension and distribution-free sample-based testing. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 504–517, 2021.
- [BMR16] Piotr Berman, Meiram Murzabulatov, and Sofya Raskhodnikova. Tolerant testers of image properties. In 43rd International Colloquium on Automata, Languages, and Programming (ICALP 2016). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016.
- [BMR19] Piotr Berman, Meiram Murzabulatov, and Sofya Raskhodnikova. Testing convexity of figures under the uniform distribution. Random Structures & Algorithms, 54(3):413–443, 2019.
- [BOW10] Eric Blais, Ryan O’Donnell, and Karl Wimmer. Polynomial regression under arbitrary product distributions. Machine Learning, 80(2-3):273–294, 2010.
- [BR92] Avrim L Blum and Ronald L Rivest. Training a 3-node neural network is NP-complete. Neural Networks, 5(1):117–127, 1992.
- [BRY14] Eric Blais, Sofya Raskhodnikova, and Grigory Yaroslavtsev. Lower bounds for testing properties of functions over hypergrid domains. In Proceedings of the IEEE 29th Conference on Computational Complexity (CCC), pages 309–320, 2014.
- [BY19] Eric Blais and Yuichi Yoshida. A characterization of constant-sample testable properties. Random Structures & Algorithms, 55(1):73–88, 2019.
- [CDJS17] Deeparnab Chakrabarty, Kashyap Dixit, Madhav Jha, and C Seshadhri. Property testing on product distributions: Optimal testers for bounded derivative properties. ACM Transactions on Algorithms (TALG), 13(2):1–30, 2017.
- [CDS20] Xue Chen, Anindya De, and Rocco A Servedio. Testing noisy linear functions for sparsity. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 610–623, 2020.
- [CFSS17] Xi Chen, Adam Freilich, Rocco A Servedio, and Timothy Sun. Sample-based high-dimensional convexity testing. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.
- [CGG+19] Clément L. Canonne, Elena Grigorcscu, Siyao Guo, Akash Kumar, and Karl Wimmer. Testing -monotonicity: The rise and fall of boolean functions. Theory of Computing, 15(1):1–55, 2019.
- [CMK19] Mónika Csikós, Nabil H Mustafa, and Andrey Kupavskii. Tight lower bounds on the VC-dimension of geometric set systems. Journal of Machine Learning Research, 20(81):1–8, 2019.
- [CS16] Deeparnab Chakrabarty and Comandur Seshadhri. An monotonicity tester for boolean functions over the hypercube. SIAM Journal on Computing, 45(2):461–472, 2016.
- [DHK+10] Ilias Diakonikolas, Prahladh Harsha, Adam Klivans, Raghu Meka, Prasad Raghavendra, Rocco A Servedio, and Li-Yang Tan. Bounding the average sensitivity and noise sensitivity of polynomial threshold functions. In Proceedings of the 42nd ACM Symposium on Theory of Computing (STOC), pages 533–542, 2010.
- [DKS18] Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Learning geometric concepts with nasty noise. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 1061–1073, 2018.
- [DMN19] Anindya De, Elchanan Mossel, and Joe Neeman. Is your function low-dimensional? In Conference on Learning Theory, pages 979–993, 2019.
- [FR10] Shahar Fattal and Dana Ron. Approximating the distance to monotonicity in high dimensions. ACM Transactions on Algorithms (TALG), 6(3):1–37, 2010.
- [FY20] Noah Fleming and Yuichi Yoshida. Distribution-free testing of linear functions on . In Proceedings of the 11th Innovations in Theoretical Computer Science Conference (ITCS). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020.
- [GR16] Oded Goldreich and Dana Ron. On sample-based testers. ACM Transactions on Computation Theory (TOCT), 8(2):1–54, 2016.
- [Har19] Nathaniel Harms. Testing halfspaces over rotation-invariant distributions. In Proceedings of the 30th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 694–713, 2019.
- [HK07] Shirley Halevy and Eyal Kushilevitz. Distribution-free property-testing. SIAM Journal on Computing, 37(4):1107–1138, 2007.
- [KKMS08] Adam T. Kalai, Adam R. Klivans, Yishay Mansour, and Rocco A Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777–1805, 2008.
- [KMS18] Subhash Khot, Dor Minzer, and Muli Safra. On monotonicity testing and boolean isoperimetric-type theorems. SIAM Journal on Computing, 47(6):2238–2276, 2018.
- [KOS04] Adam R Klivans, Ryan O’Donnell, and Rocco A Servedio. Learning intersections and thresholds of halfspaces. Journal of Computer and System Sciences, 68(4):808–840, 2004.
- [KOS08] Adam R Klivans, Ryan O’Donnell, and Rocco A Servedio. Learning geometric concepts via gaussian surface area. In Proceedings of the 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 541–550, 2008.
- [KR00] Michael Kearns and Dana Ron. Testing problems with sublearning sample complexity. Journal of Computer and System Sciences, 61(3):428–456, 2000.
- [O’D14] Ryan O’Donnell. Analysis of Boolean Functions. Cambridge University Press, 2014.
- [Ras03] Sofya Raskhodnikova. Approximate testing of visual properties. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 370–381. Springer, 2003.
- [RR21] Dana Ron and Asaf Rosin. Optimal distribution-free sample-based testing of subsequence-freeness. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 337–256. SIAM, 2021.
- [RV04] Luis Rademacher and Santosh Vempala. Testing geometric convexity. In International Conference on Foundations of Software Technology and Theoretical Computer Science, pages 469–480. Springer, 2004.
- [SSBD14] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.
- [Vem10a] Santosh Vempala. Learning convex concepts from gaussian distributions with PCA. In Proceedings of the IEEE 51st Annual Symposium on Foundations of Computer Science (FOCS), pages 124–130, 2010.
- [Vem10b] Santosh Vempala. A random-sampling-based algorithm for learning intersections of halfspaces. Journal of the ACM, 57(6):1–14, 2010.
- [War68] Hugh E. Warren. Lower bounds for approximation by nonlinear manifolds. Transactions of the American Mathematical Society, 133(1):167–178, 1968.