A unified framework for learning with nonlinear model classes from arbitrary linear samples
Abstract
This work considers the fundamental problem of learning an unknown object from training data using a given model class. We introduce a unified framework that allows for objects in arbitrary Hilbert spaces, general types of (random) linear measurements as training data and general types of nonlinear model classes. We establish a series of learning guarantees for this framework. These guarantees provide explicit relations between the amount of training data and properties of the model class to ensure near-best generalization bounds. In doing so, we also introduce and develop the key notion of the variation of a model class with respect to a distribution of sampling operators. To exhibit the versatility of this framework, we show that it can accommodate many different types of well-known problems of interest. We present examples such as matrix sketching by random sampling, compressed sensing with isotropic vectors, active learning in regression and compressed sensing with generative models. In all cases, we show how known results become straightforward corollaries of our general learning guarantees. For compressed sensing with generative models, we also present a number of generalizations and improvements of recent results. In summary, our work not only introduces a unified way to study learning unknown objects from general types of data, but also establishes a series of general theoretical guarantees which consolidate and improve various known results.
1 Introduction
Learning an unknown object (e.g., a vector, matrix or function) from a finite set of training data is a fundamental problem in applied mathematics and computer science. Typically, in modern settings, one seeks to learn an approximate representation in a nonlinear model class (also known as an approximation space or hypothesis set). It is also common to generate the training data randomly according to some distribution. Of critical importance in this endeavour is the question of learning guarantees. In other words: how much training data suffices to ensure good generalization, and how is this influenced by, firstly, the choice of model class, and secondly, the random process generating the training data?
This question has often been addressed for specific types of training data. For instance, in the case of function regression, the training data consists of pointwise evaluations of some target function, or in the case of computational imaging, the training data may consist of samples of the Fourier or Radon transform of some target image. It is also commonly studied for specific model classes, e.g., (linear or nonlinear) polynomial spaces [5] or (nonlinear) spaces of sparse vectors [43, 63], spaces of low-rank matrices or tensors [31, 63], spaces defined by generative models [18], single- [44] or multi-layer neural networks [8, 2, 3], Fourier sparse functions [41], (sparse) random feature models [12, 47] and many more.
In this paper, we introduce a unified framework for learning with nonlinear model classes from arbitrary linear samples. The main features of this framework are:
-
(i)
the target object is an element of a separable Hilbert space;
-
(ii)
each measurement is taken randomly and independently according to a random linear operator (which may differ from measurement to measurement);
-
(iii)
the measurements may be scalar- or vector-valued, or, in general, may take values in a Hilbert space;
-
(iv)
the measurements can be completely multimodal, i.e., generated by different distributions of random linear operators, as long as a certain nondegeneracy condition holds;
-
(v)
the model class can be linear or nonlinear, but should be contained in (or covered by) a union of finite-dimensional subspaces;
-
(vi)
the resulting learning guarantees are given in terms of the variation of the model class with respect to the sampling distributions.
We present a series of examples to highlight the generality of this framework. In various cases, our resulting learning guarantees either include or improve known guarantees in the literature.
1.1 The framework
The setup we consider in this paper is the following.
-
•
is a separable Hilbert space with inner product and norm .
-
•
is a semi-normed vector space, termed the object space, with semi-norm .
-
•
is the unknown target object.
-
•
For each , is a Hilbert space with inner product and norm termed the th measurement space.
-
•
For each , is a distribution of bounded linear operators . We term the th sampling operator. We also write for the space of bounded linear operators, so that .
-
•
We assume that the family of distributions is nondegenerate. Namely, there exist such that
(1.1) -
•
is a set, termed the approximation space or model class. Our aim is to learn from its measurements with an element .
Now let , , be independent realizations from the distributions . Then we consider noisy measurements
(1.2) |
In other words, we consider training data of the form
Our aim is to recover from this data using an element of . We do this via the empirical least squares. That is, we let
(1.3) |
Later, we also allow for to be an approximate minimizer, to model the more practical scenario where the minimization problem is solved inexactly. Note that in this work we consider the agnostic learning setting, where and the noise can be adversarial (but small in norm).
As we explain in §2, this framework contains many problems of interest. In particular, scalar-valued and vector-valued function regression from i.i.d. samples, matrix sketching for large least-squares problems, compressed sensing with isotropic vectors and compressed sensing with subsampled unitary matrices are all instances of this framework.
As we also discuss therein, it is often the case in practice that the training data is sampled from the same distribution, i.e., . However, having different distributions allows us to consider multimodal sampling problems, where data is obtained from different random processes. In §2 we also discuss examples where this arises.
1.2 Contributions
Besides the general framework described above, our main contributions are a series of learning guarantees that relate the amount of training data to properties of the sampling distributions . We now present a simplified version of our main result covering the case where . The full case is presented in §3.
A key quantity in this analysis is the so-called variation of a (nonlinear) set with respect to a distribution of bounded linear operators in . We define this as the smallest constant such that
(1.4) |
See also Definition 3.2. As we discuss in Example 3.2, the variation is effectively a generalization of the notion of coherence in classical compressed sensing (see, e.g., [22] or [9, Chpt. 5]). In the following, we also write for the unit sphere of in .
Theorem 1.1 (Main result A; slightly simplified).
Consider the setup of §1.1 with , let and suppose that the difference set is such that
-
(i)
is a cone (i.e., for any and ), and
-
(ii)
, where each is a subspace of dimension at most .
Suppose that, for some , either
-
(a)
, or
-
(b)
.
Let and . Then
(1.5) |
for any minimizer of (1.3) with noisy measurements (1.2). Here .
Theorem 1.2 (Main result B; slightly simplified).
Consider the setup of §1.1 with and let be such that assumptions (i) and (ii) of Theorem 1.1 hold and also that
-
(iii)
, where is a finite set of size .
Then the same result holds if conditions (a) and (b) of Theorem 1.1 are replaced by
-
(a)
, where
or
-
(b)
, where
respectively.
These theorems are simplified versions of our main results, Theorems 3.6 and 3.7. In the full results, besides allowing for different sampling distributions , we also consider inexact minimizers of the empirical least-squares fit (1.3). Moreover, we provide a refinement of condition (a) in which the linear dependence on (or in the case of Theorem 1.2) is replaced by (respectively, ), at the expense of a more complicated log factor.
Having presented our main results in §3, we then describe how this framework unifies, generalizes and, in some cases, improves known results. In particular, we recover known bounds for compressed sensing with isotropic vectors and for matrix sketching by random sampling (see §3.4). We then discuss the application of this framework to two different problems.
-
(a)
Active learning in regression (§4). Here, we extend and improve recent work [6] by applying our main results to derive a random (importance) sampling strategy based on a certain Christoffel function (also known as the leverage score function). We show that this strategy, termed Christoffel sampling, leads to near-optimal sample complexity bounds in a number of important settings.
-
(b)
Compressed sensing with generative models (§5). Here we extend and improve recent work [15] by obtaining learning guarantees for general types of measurements in the case where is the range of a ReLU generative model. We also provide theoretical analysis for an active learning strategy first introduced in [6] and subsequently considered in [16].
1.3 Related work
Our sampling framework is inspired by previous work in compressed sensing, notably compressed sensing with isotropic vectors [22], which was later extended in [9, Chpt. 12] to compressed sensing with jointly isotropic vectors. This considers the specific case where , and is the set of -sparse vectors, i.e., the target object is an (approximately) sparse vector and the sampling operators are linear functionals. Note that isotropy would correspond to the case in (1.1). We relax this to allow . Within the compressed sensing literature, there are a number of works that allow for non-scalar valued measurements. See [17, 20] for an instance of vector-valued measurements (‘block sampling’) and [61] as well as [4, 33] for Hilbert-valued measurements.
Recovery guarantees in compressed sensing are generally derived for specific model classes, such as sparse vectors or various generalizations (e.g., joint sparse vectors, block sparse vectors, sparse in levels vectors, and so forth [10, 14, 19, 30, 38, 60]). Guarantees for general model classes are usually only presented in the case of (sub)Gaussian random measurements (see e.g., [14, 34]). These, while mathematically elegant, are typically not useful in practice. Our framework provides a unified set of recovery guarantees for very general types of measurements. It contains subsampled unitary matrices (a well-known measurement modality in compressed sensing, with practical relevance – see Example 2) as a special case, but also many others, including non-scalar valued measurements.
The proofs of our main results adopt similar ideas to those used in classical compressed sensing (see, e.g., [43, Chpt. 12] or [9, Chpt. 13]). In particular, they rely on Dudley’s inequality, Maurey’s lemma and a version of Talagrand’s theorem. Our main innovations involve the significant generalization of these arguments to, firstly, much broader classes of sampling problems (i.e., not just linear functionals of finite vectors), and secondly, to arbitrary model classes, rather than classes of (structured) sparse vectors. Our results broaden and strengthen recent results in the active learning context found in [6] and [39]. In particular, [6] assumes a union-of-subspaces model and then uses matrix Chernoff-type estimates. This is similar (although less general in terms of the type of measurements allowed) to our condition (b) in Theorem 1.1. Besides the generalization in terms of the measurements, our main effort is to derive the stronger condition (a) in this theorem, as well as Theorem 1.2. The work [39] makes very few assumptions on , then uses Hoeffding’s inequality and covering number arguments. As noted in [6, §A.3] the trade-off for this high level of generality is weaker theoretical guarantees.
Active learning is a large topic. For function regression in linear spaces, a now well-known solution involves sampling according to the Christoffel function [28] or leverage score function [12, 24, 25, 32, 41, 44, 50] of the subspace. A number of these works have extended this to certain nonlinear spaces, such as Fourier sparse functions [41] and single-neuron models [44]. In our previous work [6], we extended this to more general nonlinear spaces and other types of measurements. As noted above, this work improves the theoretical guarantees in [6]. In particular, we show the usefulness of Christoffel sampling for more general types of nonlinear model classes.
Compressed sensing with generative models was introduced in [18], and has proved useful in image reconstruction tasks such as Magnetic Resonance Imaging (MRI) (see [48] and references therein). Initial learning guarantees for generative models involved (sub)Gaussian random measurements [18]. Guarantees for randomly subsampled unitary matrices were established in [15] for ReLU generative networks, and later extended in [16] for the nonuniformly subsampled case. As noted above, our work refines and further generalizes this analysis to more general types of measurements.
1.4 Outline
The remainder of this paper proceeds as follows. First, in §2 we present a series of examples that showcase the generality of the framework introduced in §1.1. In the next section, §3, we present our main results. Next we discuss the application to active learning in regression (§4) and compressed sensing with generative models (§5). Finally, in §6 we give the proofs of our main theorems.
2 Examples
In this section, we present a series of different sampling problems that can be cast into this unified framework. We will return to these examples later having stated our main learning guarantees in the next section.
-
Example 2.1 (Function regression from i.i.d. samples)
Let be a domain with a measure and consider the problem of learning an unknown function from pointwise samples , , drawn i.i.d. from some probability measure on . To cast this problem in the framework of §1.1, we make the (mild) assumption that is absolutely continuous with respect to with Radon–Nikodym derivative that is positive almost everywhere. Let , , with the Euclidean inner product and be defined by if for . Observe that nondegeneracy (1.1) holds with in this case, since
In this case, given an approximation space , the least-squares problem (1.3) becomes the (nonlinear) weighted-least squares fit
(2.1) Note that it is common to consider the case in such problems, in which case and (2.1) is an unweighted least-squares fit. However, this more general setup allows one to consider the active learning setting, where the sampling measure is chosen judiciously in term of to improve the learning performance of . We discuss this in §4.
- Example 2.2 (Matrix sketching for large least-squares problems)
Let , , be a given matrix and a target vector. In many applications, it may be infeasible (due to computational constraints) to find a solution to the ‘full’ least-squares problem
Therefore, one aims to instead find a sketching matrix (a matrix with one nonzero per row) such that a minimizer
(2.2) |
satisfies for some small . A particularly effective way to do this involves constructing a random sketch. Formally, let be a discrete probability distribution on , i.e., and . We also assume that for all . Then we draw integers and set
Therefore, consists of rows of scaled by the probabilities . See [51, 64] for further discussions on matrix sketching.
To cast into the above framework we let and both equipped with the Euclidean norm. Note that bounded linear operators are equivalent to column vectors (via the relation ). Hence we define to be a distribution of vectors in with if
Observe that nondegeneracy (1.1) holds with . Finally, we consider the model class
(2.3) |
Then we readily see that (1.3) (with ) is equivalent to (2.2) in the sense that is a solution of (1.3) if and only if is a solution of (2.2). In particular, is precisely the -norm error of the estimator .
-
Example 2.3 (Function regression with vector-valued measurements)
Regression problems in various applications call for learning vector- as opposed to scalar-valued functions. These are readily incorporated into this framework by modifying Example 2.
Hilbert-valued functions. Let be a Hilbert space and consider the Hilbert-valued function . The problem of learning Hilbert-valued functions arises in several applications, including parametric Differential Equations (DEs) in computational Uncertainty Quantification (UQ). Here, represents the parameters-to-solution map of a DE involving parameters , which, for each , yields the Hilbert-valued output being the solution of the DE at those parameter values (see [5, 27, 33] and references therein for discussion).
To cast this problem into the general framework, we modify Example 2 by letting be the Bochner space of strongly -measurable functions , be the space of continuous, -valued functions, and be the distribution of bounded linear operators with if for and . Nondegeneracy holds (1.1) as before with .
Continuous-in-time sampling. Consider a function depending on a spatial variable and a time variable . In some examples, for example seismology, one assumes continuous (or dense) sampling in time with discrete (i.e., sparse) sampling in space. Thus, each measurement takes the form for fixed .
It transpires that we can cast this problem into a Hilbert-valued function approximation problem by letting so that . Hence, continuous-in-time sampling is also covered by the general framework.
Gradient-augmented measurements. Another extension of Example 2 involves gradient-augmented measurements. This can be considered both in the scalar- or Hilbert-valued setting. In this problem, each measurement takes the form , i.e., we sample both the function and its gradient with respect to the variable at each sample point. This problem arises in a number of applications, including parametric DEs and UQ [11, 45, 55, 54, 53], seismology and Physics-Informed Neural Networks (PINNs) for PDEs [42, 65] and deep learning [29]. Assuming the Hilbert-valued setup of Example 2, we can cast this into the general framework by letting be the Sobolev space , , with the Euclidean inner product and defining if for and . It is readily checked that nondegeneracy (1.1) holds with , as before.
In the next four examples, we show how this framework includes as special cases various general sampling models from the compressed sensing literature.
-
Example 2.4 (Compressed sensing with isotropic vectors)
Classical compressed sensing concerns learning a sparse approximation to an unknown vector from linear measurements. A well-known model involves sampling with isotropic vectors [22] (see also [9, Chpt. 11]). We can cast this in the above framework as follows. Let equipped with Euclidean inner product and . As in Example 2, we consider to be a distribution of vectors in that are isotropic, i.e.,
(2.4) where is the identity matrix. Note that (1.1) holds with in this case. Moreover, the measurements (1.2) have the form
where . In matrix-vector notation, we can rewrite this as
(2.5) and (the division by is a convention: due to (2.4), it ensures that ).
As discussed in [22], this model includes not only the well-known case of subgaussian random matrices, in which case is a distribution of subgaussian random vectors, but also many other common sampling models used in signal and image processing applications. It is also a generalization of the concept of random sampling with bounded orthonormal systems (see, e.g., [43, Chpt. 12]). Moreover, if we slightly relax (2.4) to
so that (1.1) holds with the same values of and , then it also generalizes the bounded Riesz systems studied in [21].
-
Example 2.5 (Compressed sensing with subsampled unitary matrices)
A particular case of interest within the previous example is the class of randomly subsampled unitary matrices (see also [9, Defn. 5.6 and Examp. 11.5]). Let be unitary, i.e., . For example, may be the matrix of the Discrete Fourier Transform (DFT) in a Fourier sensing problem. Let , where is the th canonical basis vector, and define the (discrete) distribution of vectors by if
One readily checks that (2.4) holds in this case. Hence this family is isotropic. Note that the corresponding measurement matrix (2.5) consists of rows of – hence the term ‘subsampled unitary matrix’.
This setup can be straightforwardly extended to the case where the rows of are sampled with different probabilities, termed a nonuniformly subsampled unitary matrix. Let be a discrete probability distribution on , i.e., and . We also assume that for all . Then we now modify the distribution so that if
It is readily checked that (2.4) also holds in this case, making this family once more isotropic. Note that uniform subsampling is restored simply by setting , .
Nonuniform subsampling is important in various applications. For instance, in applications that involve Fourier measurements (e.g., MRI, NMR, Helium Atom Scattering and radio interferometry) it usually desirable to sample rows corresponding to low frequencies more densely than rows corresponding to high frequencies. See, e.g., [9] and references therein.
-
Example 2.6 (Sampling without replacement)
The previous example considers random sampling of the rows of with replacement. In particular, repeats are possible. In practice, it is usually desirable to avoid repeats. One way to do this is to use Bernoulli selectors (see [9, §11.4.3] or [15]). Here, the decision of whether to include a row of or not is based on the outcome of a (biased) coin toss, i.e., a realization of a Bernoulli random variable.
Specifically, let satisfy and . Then, for each , define the distribution by if
Notice that the collection is nondegenerate. Indeed,
Hence, this model falls within the general setup. Later, in §5.2 and Corollary 5.4, we will see that this model admits essentially the same learning guarantees as the model considered in Example 2. Observe that the measurement matrix (2.5) in this case is , where the th row is proportional to the th row of if the th coin toss returns a heads, and zero otherwise. In particular, it is equivalent to a matrix, where is the number of heads. Notice that, unlike in the previous model, the number of measurements is a random variable with .
The reader will notice that in the previous examples, the distributions we all equal. We now present several examples that justify considering different distributions.
-
Example 2.7 (Half-half and multilevel sampling)
Recall that Example 2 arises in various applications involving Fourier measurements. In these applications, it is often desirable to fully sample the low-frequency regime [49] (see also [9]). The low frequencies of a signal or image contain much of its energy. Therefore, any purely random sampling scheme can miss important information with nonzero probability.
A simple way to avoid this problem is a half-half sampling scheme [10, 58]. Here the first measurements are set equal to the lowest frequencies (i.e., the first samples are deterministic) and the remaining samples are obtained by randomly sampling the remaining frequencies according to some distribution. This can be formulated within our framework as follows. Let be unitary (typically, is the DFT matrix, but this is not needed for the current discussion) and , as before. Then, for , we define the distribution by if
In other words, takes value with probability one. Now let be a discrete probability distribution on with for all . Then we define with if
We readily see that the family is nondegenerate with . Notice that the first rows of measurement matrix obtained from this procedure are proportional to the first rows of and its remaining rows are randomly sampled according to from the last rows of .
Half-half sampling schemes have been used in various imaging applications [10, 49, 23, 57, 58]. Note that such a scheme subdivides the indices into two levels: fully sampled and randomly subsampled. In practice, it is often desirable to have further control over the sampling, by subdividing into more than two levels. This is known as multilevel random subsampling [10] (see also [9, Chpts. 4 & 11]). It can be formulated within this framework in a similar way.
-
Example 2.8 (Multimodal data)
Another benefit of allowing for different distributions is that it can model multimodal data. Consider the general setup of §1.1. We now assume that there are different types of data, with the th type generating samples via a distribution , , of bounded linear operators in . Let and define by
Thus, the first samples are generated by , the next samples by , and so forth. Notice that nondegeneracy (1.1) is now equivalent to the condition
Multimodal data was previously considered in [6], which, as noted in §1 is a special case of this work. As observed in [6], multimodal data arises in various applications, such as multi-sensor imaging systems [26] and PINNs for PDEs [46, 56]. This includes the important case of parallel MRI [52], which is used widely in medical practice. Another application involves an extension of the gradient-augmented learning problem in Example 2 where, due to cost or other constraints, one can only afford to measure gradients in addition to function values at some fraction of the total samples. See [6, §B.7].
3 Learning guarantees
In this section, we present our main results.
3.1 Additional notation
We first require additional notation. Let be the Hilbert space defined as the direct sum of the , i.e.,
Next, let be the distribution of bounded linear operators in induced by the family . In other words, if
where the are independent with for each (we include the factor for convenience). With this notation, observe that nondegeneracy (1.1) is equivalent to
(3.1) |
and the least-squares problem (1.3) is equivalent to
(3.2) |
where . For convenience, we also write .
In our main results, we consider approximate minimizers of (1.3) or, equivalently, (3.2). We now define the following.
Definition 3.1 (-minimizer).
Given we say that is a -minimizer of (1.3) if
Finally, we require several additional concepts. Given a set , we now let
(3.3) |
Further, we say that is a cone if whenever and . Notice that in this case the set (3.3) is given by
i.e., it is the unit sphere of with respect to the norm on .
3.2 Variation
We now formally introduce the concept of variation, which is crucial to our analysis.
Definition 3.2 (Variation with respect to a distribution).
Let and be a Hilbert space. Consider a distribution of bounded linear operators in . The variation of with respect to is the smallest constant such that
(3.4) |
If no such constant exists then we write
Note that we specify as the smallest constant such that (3.4) holds to ensure that it is well defined. However, in all our results can be any constant such that (3.4) holds. This is relevant in our examples, since it means we only need to derive upper bounds for the variation.
Recall (3.3). We will generally consider the variation of the set rather than in what follows. Note that in this case, is given by
In the following example, we show how the variation extends to the classical notion of coherence in compressed sensing. Later, in the context of active learning in §4, we show that it also relates to the Christoffel function (also the leverage score function).
-
Example 3.3 (Classical compressed sensing and coherence)
Coherence is a well-known concept in compressed sensing (see, e.g., [22]). Consider the setting of Example 2 and let . Classical compressed sensing considers the model class of -sparse vectors
(3.5) Recall that a vector is -sparse if it has at most nonzero entries. In this case, is the smallest constant such that
Define the coherence of the distribution (see [22] or [9, Defn. 11.16]) as the smallest constant such that
Then, by the Cauchy–Schwarz inequality, we have
We deduce that
(3.6) Thus, the variation is precisely the coherence multiplied by the sparsity .
We now require several additional definitions.
Definition 3.4 (Constant distribution).
A distribution constant if takes only a single value with probability one, i.e., for some . Otherwise it is nonconstant.
Definition 3.5 (Variation with respect to a collection).
Let be the distribution of bounded linear operators induced by the family , where each is a distribution of bounded linear operators in , and let
(3.8) |
We define the variation of with respect to as
(3.9) |
and write if no such constant exists.
The motivation for these definitions is as follows. Often in applications, one has some measurements that should always be included. For example, in a function regression problem, it may be desirable to sample the function at certain locations, based on known behaviour. Similarly, in a Fourier sampling problem, as discussed in Example 2, one may wish to fully sample all low frequencies up to some given maximum.
Mathematically, as demonstrated in Example 2, we can incorporate a deterministic measurement by defining a distribution that takes a single value with probability one. Since these measurements are always included, it is unnecessary to include them in the definition (3.9), since, as we will see, the variation is a tool that relates to the number of random measurements needed to ensure recovery. Later, when we consider sampling without replacement (Example 2), we will see that it is, in fact, vital that the variation be defined as a maximum over the nonconstant distributions only. See Corollary 5.4 and Remark 5.2.
3.3 Main results
We now present our main results.
Theorem 3.6.
This result holds under general assumptions on . However, as we discuss in Example 3.4 below, it may yield suboptimal bounds in certain cases, such as compressed sensing with sparse vectors. Fortunately, by making an additional assumption, we can resolve this shortcoming.
Theorem 3.7.
Consider the setup of §1.1 and let be such that assumptions (i) and (ii) of Theorem 3.6 hold and also that
-
(iii)
, where is a finite set of size .
Suppose that, for some , either
- (a)
-
(b)
where
or
-
(c)
, where
Let , and . Then
3.4 Examples
-
Example 3.8 (Classical compressed sensing with isotropic vectors)
Consider the classical compressed sensing problem as in Example 2, with as in Example 3.2. Note that in this case the collection , where with as in Example 3.2 and recall that . Observe that in this case and moreover that assumption (i) of Theorem 3.6 holds, since is -sparse for any (in fact, ) whenever is -sparse. Also, assumption (ii) holds, since
Here is the support of . Observe that this is a union of
subspaces of dimension . Example 3.2 derives an expression (3.6) for the variation in terms of and . Using this, we deduce that and the measurement condition in Theorem 3.6(a) reduces to
(one can readily verify that the conditions (b) and (c) give the same result, up to constants). This bound is suboptimal, since it scales quadratically in . Fortunately, this can be overcome by using Theorem 3.7. It is well known that (see, e.g., [9, proof of Lem. 13.29] or [43, proof of Lem. 12.37])
This set has size . In order to apply Theorem 3.7 we estimate the variation
The first term is equal to (recall (3.6)). For the latter, we observe that, for any , we have, for some ,
Hence and therefore
Using this and the previous bounds for and we see that condition (a) of Theorem 3.7 (the same applies for conditions (b) and (c)) reduces to
In other words, the measurement condition scales linearly in , up to the coherence and a polylogarithmic factor in , , and . Note that this bound is similar to standard compressed sensing measurement conditions for isotropic vectors.
-
Example 3.9 (Matrix sketching for large least-squares problems)
Consider the setup of Example 2, where for convenience we assume that has full column rank. In this problem, is the linear subspace (2.3) with , so we strive to apply condition (c) of Theorem 3.6 with . We first calculate its variation. Recalling the definition of and in this case and letting be an arbitrary element of , we notice that
Therefore,
where
are the so-called leverage scores of the matrix (note that it is a short argument to show that , where is the th row of ).
Having obtained an explicit expression for the variation, we can now choose a discrete probability distribution to minimize it. It is a short exercise to show that
Therefore, we now set
This is the well-known leverage score sampling technique for matrix sketching [64]. Observe that Theorem 3.6 yields the near-optimal sample complexity bound
which is also well known in the literature.
The previous example describes a type of active learning technique for the (discrete) matrix sketching problem. We will discuss related techniques for function regression in §4.
3.5 Further discussion
We now provide some additional comments on our main results. First, notice that the estimator in (1.5) involves a truncation of a -minimizer of (1.3). This is a technical condition that is used to establish the error bound in expectation. See §6.1.5 and Remark 6.2.
The main difference between Theorem 3.6 and Theorem 3.7 is the additional assumption (iii), which states that the unit ball of should be contained in the convex hull of a set that is not too large (since enters logarithmically in the measurement condition) and has small variation (since the variation is taken over a set that includes ). In practice, this assumption allows the dependence of the measurement condition on and to be reduced from essentially in Theorem 3.6 to . As seen in Example 3.4, this can make a significant difference to the resulting measurement condition in certain problems.
4 Application to active learning in regression
In this and the next section, we focus on the two main applications of our framework considered in this paper. In this section, we focus on active learning.
Active learning is a type of machine learning problem where one has the flexibility to choose where to sample the ground truth (or oracle) so as to enhance the generalization performance of the learning algorithm. In the standard regression problem, where the ground truth is a function , this means the training data takes the form
where one is free to choose the sample points . Building on Example 2, in this section we first show how our theoretical results lead naturally to an active learning strategy. This extends well-known leverage score sampling [12, 24, 25, 32, 41, 44, 50] to general nonlinear model classes.
4.1 Active learning via Christoffel sampling
Consider Example 2 and let be an arbitrary model class. In this case, the variation of the distribution defined therein is
(4.1) |
For convenience, we now define the notation
(4.2) |
This is sometimes termed the Christoffel function of the set . Up to a scaling factor, it is the same as the leverage score function of (see [6, §A.2]). Notice that
(4.3) |
Recall that the measurement conditions in Theorem 3.6 scales linearly with , where . Therefore, in the active learning context, we look to choose the sampling distribution – or, equivalently, its density – to minimize this quantity. The following result is immediate.
Proposition 4.1 (Christoffel sampling).
Suppose that is integrable and positive almost everywhere. Then (4.1) is minimized by the choice
(4.4) |
In this case, one has the optimal value
(4.5) |
This result states that to obtain the best measurement condition over all possible sampling measures one should choose a measure that is proportional to the Christoffel function – an approach termed Christoffel sampling in [6]. Combining this with Theorem 3.6, we deduce the following learning guarantees for Christoffel sampling.
Corollary 4.2 (Learning guarantees for Christoffel sampling).
This corollary describes three different variants of Christoffel sampling, depending on whether one chooses , or as the set over which to define the Christoffel function. Each choice leads to a slightly different measurement condition, but the key point is they all scale linearly in the integral of the corresponding Christoffel function. We next discuss what this means in practice. We note in passing that is also possible to develop a version of Corollary 4.2 under the additional assumption (iii) of Theorem 3.7. In this case, the relevant relevant set should also include .
-
Remark 4.3 (Improvement over inactive learning)
Suppose that the underlying measure is a probability measure. Inactive learning corresponds to the scenario where , i.e., the sampling distribution is taken equal to the underlying measure. This is sometimes termed Monte Carlo sampling. In this case, we have , and therefore
(4.6) Therefore, the improvement of Christoffel sampling over inactive learning is equivalent to the difference between the mean value of , i.e., (4.5), and its maximal value, i.e., (4.6). Since
Christoffel sampling always reduces the sample complexity bound. Practically, the extent of the improvement corresponds to how ‘spiky’ is over . If is fairly flat, then one expects very little benefit. On the other hand, if has sharp peaks, then one expects a significant improvement. As discussed in [28, 13, 7, 6, 41, 44], there are many case where significant improvements are possible because of this phenomenon. However, this is not always the case [1].
-
Remark 4.4 (Non-pointwise samples and generalized Christoffel functions)
One can extend the above discussion to more general types of sampling problems, beyond just pointwise samples of a scalar target function. This generalization was introduced in [6]. The main difference involves replacing (4.2) with a so-called generalized Christoffel function, which involves the sampling operator. For brevity, we omit the details. Note that a special case of this generalization is considered in §5 in the context of generative models.
4.2 Examples
To gain further insight, we now consider several examples.
-
Example 4.5 (Near-optimal sampling for linear subspaces)
Let be a subspace of dimension with an orthonormal basis in the -norm. Notice that in this case. Moreover, it is a short exercise (see, e.g., [6, Lem. E.1]) to show that
In particular, due to orthonormality. Hence, the Christoffel sampling measure (4.4) becomes
and, using condition (c) of Corollary 4.2 with , we deduce the near-optimal measurement condition
This result is well known (see, e.g., [28]).
The previous example is highly informative, as it gives a class of problems where the active learning strategy of Christoffel sampling is provably optimal up to a log factor. However, many practical problems of interest involve nonlinear model classes. We now consider the nonlinear case in more detail. We commence with the case of unions of subspaces.
-
Example 4.6 (Near-optimal sampling for a ‘small’ union of subspaces)
Suppose that is a union of subspaces of dimension at most . By definition,
Therefore, by a crude bound and the fact that (recall the previous example), we obtain
Hence, applying condition (c) of Corollary 4.2, we obtain the measurement condition
(4.7)
The main purpose of this example is to illustrate a class of problems involving nonlinear model classes for which Christoffel sampling is probably optimal, up to the log term. Indeed, if is small, then the sample complexity bound scales like .
Unfortunately, as we see in the next example (and also in the next section when considering generative models), in problems of interest the number is often not small.
4.3 Application to sparse regression
Let and be an orthonormal system (one can also consider Riesz systems with few additional difficulties – see [6, § A.4]). Define the model class
(4.8) |
We refer to the corresponding regression problem as a sparse regression problem. Notice that, as in Example 3.4, is a union of subspaces. Hence, in this case, the bound (4.7) becomes meaningless.
Moreover, another problem in this case is that even evaluating the Christoffel function in computationally intractable, since it involves a pointwise maximum over subspaces. Thus, it is not possible to implement Christoffel sampling directly.
Fortunately, both issues can be resolved by using the structure of the model class. In what follows, we recap ideas that were previously presented in [7]. Let , where . Then, by the Cauchy–Schwarz inequality and Parseval’s identity,
Using this, we deduce that
(4.9) |
The idea now is to sample from a density proportional to . Note that this is usually feasible in practice, since the maximum only involves terms, as opposed to .
Corollary 4.7 (Active learning for sparse regression).
Proof.
As in Example 3.4, notice that is contained in , where
Notice that
Consider the second term. We have
We also know from (4.3) and (4.9) that
Therefore, using the fact that and the definition of , we get
To apply Theorem 3.7 we also need to estimate the term . However, notice from the above derivation that
Therefore, recalling that our main results hold when the variation constant is replaced by an upper bound, we deduce that we may take in this instance. We now apply Theorem 3.7, recalling that , and in this case (see Example 3.4). ∎
This result describes an active learning strategy for sparse regression (previously derived in [7]) that is optimal up to the term , since the measurement condition scales linearly in up to log factors. As discussed in [7], one can numerically estimate this factor. In cases such as sparse polynomial regression, it is either bounded or very mildly growing in . Whether it is possible to derive an active learning strategy that avoids this factor is an open problem.
5 Application to compressed sensing with generative models
Compressed sensing with generative models involves replacing the classical model class in (3.5) with the range of a generative neural network that has been trained on some training data relevant to the learning problem at hand (e.g., brain images in the case of a MRI reconstruction problem). In this section, we demonstrate how our main results can be applied to this problem in the case of ReLU generative neural networks. The setup in this section is based on [15, 16]. We extend and generalize this work in the following ways:
- •
- •
-
•
We provide error bounds in expectation instead of probability.
5.1 Setup and main result
We consider the discrete setting, where . Following [15], we consider an -layer ReLU neural network given by
(5.1) |
where is the ReLU, and , . Notice that this network has no bias terms. As discussed in [15, Rem. 2], the following theory can also be extended to the case of nonzero biases.
For sampling, we consider the general setup of §1.1, i.e., where the measurements are the result of arbitrary linear operators applied to the object . In particular, they may scalar- or vector-valued.
We next state a general learning guarantee for generative models. For this, we require the following results from [15]. First, if , then the difference set
(5.2) |
where each is a polyhedral cone of dimension at most and
See [15, Lem. S2.2 & Rem. S2.3]. Next, as in [15, Defn. 5], we define the piecewise linear expansion of as
(5.3) |
Notice that is a union of subspaces of dimension at most .
Corollary 5.1.
Proof.
The set is a cone by construction. As shown above, we have , where the latter is a union of subspaces of dimension at most . We now apply Theorem 3.6, and specifically condition (a), with . ∎
This result shows that the sample complexity depends linearly on the dimension of the latent space of the generative neural network, up to log terms, multiplied by the variations and . In particular, if these variations are small and the network widths we obtain the measurement condition scaling linearly in .
5.2 Recovery guarantees for subsampled unitary matrices
Corollary 5.1 applies to the general measurement model introduced in §1.1. As mentioned, [15] and [16] consider measurements drawn from the rows a subsampled unitary matrix . As described in Example 2, this is a special case of our general model.111Technically, [15, Defn. 1] considers a randomly-subsampled unitary matrix model involving Bernoulli sampling. In other words, it is an instance of Example 2 with , .
We now consider the setting of [16], i.e., Example 2. The analysis therein relies on the concept of local coherences. Following [16, Defn. 1.4], we say that the local coherences of with respect a set are the entries of the vector , where
(5.5) |
(Note that [15, Defn. 3] uses the notation . We use as is already used in the definition of nondegeneracy.) We now show how the local coherence relates to a special case of the variation (Definition 3.2). Let be the isotropic family of the subsampled unitary matrix model of Example 2. Then is readily seen to be
(5.6) |
Using this and Corollary 5.1, we deduce the following.
Corollary 5.2 (Generative models with subsampled unitary matrices).
Consider the setup of Corollary 5.1 with the randomly subsampled unitary matrix model of Example 2. Let be the local coherences of with respect to and be the local coherences of with respect to (see (5.3)). Then the measurement condition (5.4) is implied by
(5.7) |
where and . It also implied by the condition
(5.8) |
Proof.
Using this result, we can now derive optimized variable-density sampling strategies based on the local coherences.
Corollary 5.3 (Optimal variable-density sampling for generative models).
These two results are very similar to those in [16]. The second conditions (5.8) and (5.10) are identical to [16, Thms. A2.1 & 2.1], respectively. In the first conditions (5.7) and (5.9) we obtain a small improvement over [16, Thms. A2.1 & 2.1]. Specifically, the measurement condition is primarily influenced by the local coherences taken over , with the local coherences , which are taken over the piecewise linear expansion set only entering logarithmically into the condition. This has relevance to practice, since the can be empirically estimated more efficiently than the . This was done in [6] for MRI reconstruction using generative models and later in [16]. As shown in these works, this can lead to substantial benefits over uniform random subsampling.
Notice that Corollary 5.3 is a type of active learning result for learning with generative models. It differs from the setup of §4 in that the underlying object is a vector, not a function, and the measurements are not pointwise samples. It does, however, fall into the general active learning framework of [6] (recall Remark 4.1). Notice that the local coherences and are equivalent to the generalized Christoffel functions for this problem. Hence Corollary 5.3 is a type of Christoffel sampling.
The previous results use sampling with replacement. In the next result we show the same result can be obtained without replacement by using Bernoulli sampling (Example 2).
Corollary 5.4 (Optimal variable-density Bernoulli sampling for generative models).
Let , where is a ReLU generative neural network as in (5.1) with layers and widths and consider the randomly subsampled unitary matrix model of Example 2. Let and be as in Corollary 5.2 and suppose that either of the following hold.
-
(i)
Let
where is the unique solution in to the nonlinear equation
and
-
(ii)
Let
where is the unique solution in to the nonlinear equation
and
Let , and . Then
Proof.
Let be the distribution defined in Example 2 and consider the first scenario (i). Then, observe that is a constant distribution if and only if . Hence
Now observe that the function is nonincreasing in and is bounded above by . Therefore , i.e., . We deduce that
Further, it is a short argument to see that
We now apply Theorem 3.6 with . Recall that the collection involves distributions. Hence condition (a) is implied by the condition
Rearranging now gives the result for scenario (i). The proof for scenario (ii) is similar. ∎
This result shows that the sampling without replacement can be achieved under the same sample complexity bounds. This is subject to the standard caveat that is not the number of samples in the Bernoulli sampling model, but rather its expected value.
-
Remark 5.5
The proof of this corollary also demonstrates why it is important to define the variation as a maximum over the nonconstant distributions only. The resulting measurement conditions could not be obtained had the variation been defined over all distributions.
5.3 Extension to block sampling
A desirable aspect of the general framework of §1.1 is that we can easily derive similar results for compressed sensing with generative models under a more general ‘block’ sampling model [17, 20]. Here, each measurement corresponds to a block of rows of a unitary matrix , rather than a single row. This model is highly relevant in applications such as MRI, where it is generally impossible to measure individual frequencies (i.e., individual rows). See [9, 52].
We now formalize this setup. Note that this was previously done in [6, §C.1] for the special case where was the 3D Discrete Fourier Transform (DFT) matrix and each block corresponded to a horizontal line of -space values. Here we consider the general setting.
Let be a (potentially overlapping) partition of . Let , and
be the number of partition elements to which the integer belongs. Now let be a discrete probability distribution on and define the linear maps by
where is the vector of zeros of size . Now define the distribution of bounded linear operators in by if
We readily deduce that is nondegenerate with . Indeed,
Now let and consider the measurements (1.2). Then this gives the desired block sampling model. Each measurement corresponds to a block of rows of (up to scaling factors to account for overlaps) from one of the partition elements, which is chosen according to the discrete probability distribution .
Note that the setting considered previously corresponds to the case and for . Much as in that case, we can readily derive a measurement condition. The only change we need to make is to redefine the local coherences in (5.5) by the following:
(5.11) |
We now obtain the following result.
Corollary 5.6.
As observed, this block sampling model was used in [6] in the context of MRI reconstruction with generative models (see [6, App. C]). Note that the local coherences (5.11) are identical to the generalized Christoffel functions defined therein. Here we generalize this setup and provide theoretical guarantees. On the practical side, we note that the local coherences in the block case (5.11) can be estimated empirically, much as in the single-row case considered previously. See [6, §C.4].
6 Proofs
Consider a realization of the . We say that empirical nondegeneracy holds over a set with constants if
(6.1) |
Recall from the notation introduced in §1.1 this can be equivalently written as
Note that in the classical compressed sensing setup (see Example 2), this equivalent to the measurement matrix satisfying the Restricted Isometry Property (RIP) [43, Chpt. 6]. As the following lemma shows, empirical nondegeneracy is crucial in establishing a recovery guarantee in the general case.
Lemma 6.1.
Although the proof is straightforward, we include it for completeness.
Proof.
Let . Then
as required. ∎
Notice that this result in fact only requires the lower inequality in (6.1). The upper inequality will be used later in the derivation of the error bound in expectation. Consequently, the remainder of the proofs are devoted to deriving measurement conditions under which (6.1) holds, then using this and the above lemma to derive error bounds in expectation.
6.1 Measurement conditions for empirical nondegeneracy
Theorem 6.2.
Note that these first two theorems will be used to establish Theorem 3.6. We split them into two statements as the proof techniques are quite different. For Theorem 3.7, we will use the following result.
Theorem 6.3.
Consider the setup of §1.1. Let and be such that assumptions (i) and (ii) of Theorem 6.2 hold, and also that
-
(iii)
, where is a finite set of size .
Suppose that either
- (a)
-
(b)
where
or
-
(c)
, where
Then, with probability at least , (6.1) holds for with constants and .
We now prove these results.
6.1.1 Setup
Let be such that assumption (ii) of Theorem 6.2 holds. As per §3.5, we will not assume that assumption (i) holds for the moment.
Nondegeneracy (1.1) implies that the quantity
defines a norm on that is equivalent to . Let
Then, for empirical degeneracy (6.1) to hold with constants and , it suffices to show that the random variable
(6.3) |
satisfies with probability at least . Following a standard route, we show this by first bounding the expectation of and then by estimating the probability it exceeds .
6.1.2 Expectation bounds: setup
Recall that a Rademacher random variable is a random variable that takes the values and with probability . Recall also that is the set of such that the corresponding distribution is nonconstant.
Lemma 6.4.
Proof.
Since
(6.4) |
we have
(6.5) |
Here, in the second step we used the definition of . Assumption (ii) implies that , where is the finite-dimensional vector space . Since is a vector space, we also have . Consider as a Hilbert space with the inner product from . Then the map
is bounded. Indeed,
Here, in the first inequality we used the fact that the map is bounded by assumption and in the second step, we used the fact that is finite dimensional and all norms on finite-dimensional vector spaces are equivalent. Therefore, has a unique bounded adjoint
In particular, we may write
where is a random variable taking values in , the finite-dimensional vector space of bounded linear operators on . Using this we can write
(6.6) |
where is the convex function
We now apply symmetrization (see, e.g., [9, Lem. 13.26]) to get
Now observe that
This completes the proof. ∎
The next several result involves the covering number (see, e.g., [9, Defn. 13.21]) and Dudley’s inequality (see, e.g., [9, Thm. 13.25]). Here we recall the notation .
Lemma 6.5.
Fix a realization of the , . Then
(6.7) |
for all , where
(6.8) |
is such that ,
(6.9) |
and
(6.10) |
Proof.
The left-hand side of (6.7) is the expected value of the supremum of the absolute value of the Rademacher process , where
The corresponding pseudometric is
Dudley’s inequality implies that
(6.11) |
where . The remainder of the proof involves upper bounding the pseudometric and the constant .
First, for two sequences , , observe that
where is such that . Therefore
We deduce that
(6.12) |
Second, consider the term . Then, defining additionally (since ), we see that
Now let , where . Then
Also, recall that , . We deduce that
(6.13) |
This gives
Hence .
We now combine this with (6.12) and standard properties of covering numbers to obtain
The result now follows from a change of variables. ∎
The next steps involves estimating the covering number. We do this in two ways via the following two lemmas. Their proofs rely on a number of standard properties of covering numbers. See, e.g., [9, §13.5.2].
Lemma 6.6 (First covering number bound).
Consider the setup of the previous lemma. Then
(6.14) |
where
-
(a)
if satisfies assumption (ii) of Theorem 6.2; or
-
(b)
if satisfies assumptions (i) and (ii) of Theorem 6.2.
Proof.
Write , where
Then it suffices to prove the bound (6.14) first with in place of and then with in place of .
Step 1: showing (6.14) with in place of . By definition,
Therefore, the norm (6.10) satisfies
(6.15) |
Notice that the right-hand side is equal to in case (a). Now consider case (b). Note that whenever is a cone. Therefore,
from which it follows that
Therefore, the right-hand side of (6.15) can be bounded by in case (b) as well. Using the equivalence between and once more, we deduce that
(6.16) |
in either case (a) or (b). Now, using standard properties of covering numbers, we get
Now, assumption (ii) of Theorem 6.2 gives that , where is the unit ball of . Using further properties of covering numbers and (6.16), we get
We now take the logarithm and square root, and using the inequality , . This gives (6.14) with in place of .
Step 2: showing (6.14) with in place of . We now switch the order of the above arguments. First, we write
Now, since is a subspace, we have
and therefore
Since is the unit ball of with respect to , we deduce that
We now take the logarithm and square root of both sides, as before. ∎
To derive our second bound for the covering number makes we first need to establish a version of Khintchine’s inequality in Hilbert spaces. For this, we need the following.
Lemma 6.7 (Hoeffding’s inequality in a Hilbert space).
Let be independent mean-zero random variables taking values in a separable Hilbert space such that and let . Then, for all ,
Note that this lemma is can be proved directly from McDiarmid’s inequality. Using this, we now obtain the following.
Lemma 6.8 (Khintchine’s inequality in a Hilbert space).
Let , where is a separable Hilbert space and be independent Rademacher random variables. Then
Proof.
Let and note that our aim is to bound . Now let . Then, by the previous lemma.
whenever . In particular, if , then
Now fix . Then (see, e.g., [62, Ex. 1.2.3])
where . Now choose and take the th root to obtain
The moments (see, e.g., [62, Ex. 2.5.1]). Using this and the fact that for , we deduce the result. ∎
Lemma 6.9 (Second covering number bound).
Consider the setup of Lemma 6.5 and suppose that assumption (iii) of Theorem 6.3 holds. Then
6.1.3 Expectation bounds
We are now in a position to establish our first expectation bound.
Theorem 6.10 (First expectation bound).
Consider the setup of Section 1.1. Let and and suppose that
(6.17) |
where
-
(a)
if satisfies assumption (ii) of Theorem 6.2; or
-
(b)
if satisfies assumptions (i) and (ii) of Theorem 6.2.
Then the random variable (6.3) satisfies .
Proof.
Lemmas 6.4, 6.5 and 6.6 imply that
where is as in (6.9) and is as in Lemma 6.6. We now use [43, Lem. C.9], to obtain
Using (6.13) and the definition of , we see that
Now consider the sum. Let . Then
where in the second step we used (6.5) and (6.4). Hence
Using the fact that and the Cauchy–Schwarz inequality, we deduce that
(6.18) |
Combining this with the previous expression and using the definition of , we deduce that
Now set , where and notice that in this case. Observe that
Using this and the fact that (since ), we deduce that
Hence, we see that , provided
for suitably small constant . Rearranging and noticing that
now gives the result. ∎
Theorem 6.11 (Second expectation bound).
Consider the setup of Section 1.1. Let and satisfy assumption (iii) of Theorem 6.3 and suppose that
where
and
-
(a)
if satisfies assumption (ii) of Theorem 6.2; or
-
(b)
if satisfies assumptions (i) and (ii) of Theorem 6.2.
Then the random variable (6.3) satisfies .
6.1.4 Probability bound
We now bound in probability.
Theorem 6.12.
Let and suppose that satisfies assumption (ii) of Theorem 6.2. Suppose also that . Then with probability at least , provided
Proof.
Our analysis is based on a version of Talagrand’s theorem (see, e.g., [9, Thm. 13.33]). First, let be a countable dense subset of . This exists since is separable. Then from (6.6) we have
where is the -valued random variable . Now consider the measurable space
and let be defined by . Observe that we can write
Notice also that . Next, notice that, for all ,
Therefore , .
Now observe that
Finally, observe that
in this case. In particular, .
Now suppose that . Then, using Talagrand’s theorem,
Due to the condition on , we deduce that , as required. ∎
6.1.5 Proofs of Theorems 6.2 and 6.3
For the proof of Theorem 6.2, we divide into two parts. The first deals with conditions (a) and (b), and the second deals with condition (c), which involves a different approach.
Proof of Theorem 6.2; conditions (a) and (b).
Consider condition (a). Recall that whenever is a cone. Therefore, since as well, we see that
(6.19) |
Hence which implies that
The result subject to condition (a) now follows immediately from Theorems 6.10 and 6.12.
Moreover, using (6.19) and the inequality , we see that
(6.20) |
Hence condition (b) implies condition (a), which implies the result. ∎
-
Remark 6.13 (On assumption (ii) of Theorem 6.2)
Notice that Theorem 6.12 does not require assumption (i) of Theorem 6.2. Moreover, in Theorem 6.10 we gave a bound for the expectation when only assumption (ii) holds. Therefore, it is straightforward to derive an extension of Theorem 6.2 in which only assumption (ii) holds. The only difference is the definition of in (6.2), which would now be given by the expression in Theorem 6.10, part (a). Note that the same considerations also apply to Theorem 3.6 since, as we show in its proof below, it follows from a direct application of Theorem 6.2 to the difference set .
Next, for condition (c), we follow a different and simpler approach based on the matrix Chernoff bound.
Proof of Theorem 6.2; conditions (c).
Since is a union of subspaces of dimension it suffices, by the union bound, to show that (6.1) holds for each subspace with probability at least .
Therefore, without loss of generality, we may now assume that is a single subspace of dimension and show (6.1) for this subspace. To do this, we follow a standard approach based on the matrix Chernoff bound. Let be an orthonormal basis for and write . Then
where is the self-adjoint random matrix
Therefore, (6.1) is equivalent to
Noticethat and therefore is nonnegative definite. Also, we have
Hence, by (1.1), we have
Therefore, it suffices to show that
with high probability. Observe that
almost surely, and therefore
almost surely, for all . Thus, by the matrix Chernoff bound and the union bound, we have
Notice that . Hence the right-hand side is bounded below by
We deduce that if
then, with probability at least , (6.1) holds with and for the subspace . Replacing by now gives the result. ∎
Proof of Theorem 6.3.
The result follows from Theorems 6.11 and 6.12. Recall that since assumption (i) holds. Therefore, we may bound the logarithmic term in Theorem 6.11 as
Hence, we immediately deduce the result under the condition (a). Now recall (6.19). Then, much as in (6.20), we deduce that
We conclude that condition (b) implies condition (a). This gives the result under condition (b). Finally, recalling (6.19) once more, we also have
Hence condition (c) also implies condition (a), and therefore we get the result under this condition as well. ∎
6.2 Proof of Theorems 3.6 and 3.7
Proof of Theorems 3.6 and 3.7.
We follow essentially the same ideas as in [6, Thm. 4.8]. Let (this value is arbitrary) and be the event that (6.1) holds over with and . Theorems 6.2–6.3 (with in place of ) and the various conditions on imply that . Now write
(6.21) |
Observe that the mapping is a contraction. We also have and , where in the latter case, we used the fact that by assumption.
Fix and consider the first term. If the event occurs, then the properties of and Lemma 6.1 give that
By the Cauchy–Schwarz inequality, we deduce that
We now use (3.1) to deduce that
(6.22) |
We next consider the second term of (6.21). Using the properties of , we see that
Substituting this and (6.22) into (6.21) and recalling that now gives the result. ∎
- Remark 6.14 (On the error bounds in expectation)
The above proof explains why the truncation term is needed: namely, it bounds the error in the event where empirical nondegeneracy (6.1) fails. Modifying the estimator appears unavoidable if one is to obtain an error bound in expectation. A downside of the estimator is that it requires an a priori bound on . In some limited scenarios, one can avoid this by using a different estimator [35]. Indeed, let be the event in the above proof. Then define the conditional estimator
One readily deduces that this estimator obeys the same error bound (1.5) with replaced by . Unfortunately, computing this estimator involves computing the empirical nondegeneracy constants. This is possible in some limited scenarios, such as when is a linear subspace, as the constants then correspond to the maximum and minimum singular values of a certain matrix. However, this is generally impossible in the nonlinear case. For instance, in the classical compressed problem, we previously noted that (6.1) is equivalent to the RIP. It is well known that computing RIP constants in NP-hard [59].
-
Remark 6.15
The observant reader may have noticed a small technical issue with Theorem 3.6 and its proof: the estimator (and therefore ) is nonunique, and therefore is not a well-defined random variable. To resolve this issue, one can replace by the maximum error between and any -minimizer . The error bound remains the same for this (well-defined) random variable.
References
- [1] B. Adcock and S. Brugiapaglia. Is Monte Carlo a bad sampling strategy for learning smooth functions in high dimensions? arXiv:2208.09045, 2022.
- [2] B. Adcock, S. Brugiapaglia, N. Dexter, and S. Moraga. Deep neural networks are effective at learning high-dimensional Hilbert-valued functions from limited data. In J. Bruna, J. S. Hesthaven, and L. Zdeborová, editors, Proceedings of The Second Annual Conference on Mathematical and Scientific Machine Learning, volume 145 of Proc. Mach. Learn. Res. (PMLR), pages 1–36. PMLR, 2021.
- [3] B. Adcock, S. Brugiapaglia, N. Dexter, and S. Moraga. Near-optimal learning of Banach-valued, high-dimensional functions via deep neural networks. arXiv:2211.12633, 2022.
- [4] B. Adcock, S. Brugiapaglia, N. Dexter, and S. Moraga. On efficient algorithms for computing near-best polynomial approximations to high-dimensional, Hilbert-valued functions from limited samples. arXiv:2203.13908, 2022.
- [5] B. Adcock, S. Brugiapaglia, and C. G. Webster. Sparse Polynomial Approximation of High-Dimensional Functions. Comput. Sci. Eng. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2022.
- [6] B. Adcock, J. M. Cardenas, and N. Dexter. CS4ML: A general framework for active learning with arbitrary data based on Christoffel functions. arXiv:2306.00945, 2023.
- [7] B. Adcock, J. M. Cardenas, N. Dexter, and S. Moraga. Towards optimal sampling for learning sparse approximation in high dimensions, chapter 2, pages 9–77. Springer Optimization and Its Applications. Springer, 2022.
- [8] B. Adcock and N. Dexter. The gap between theory and practice in function approximation with deep neural networks. SIAM J. Math. Data Sci., 3(2):624–655, 2021.
- [9] B. Adcock and A. C. Hansen. Compressive Imaging: Structure, Sampling, Learning. Cambridge University Press, Cambridge, UK, 2021.
- [10] B. Adcock, A. C. Hansen, C. Poon, and B. Roman. Breaking the coherence barrier: a new theory for compressed sensing. Forum Math. Sigma, 5:e4, 2017.
- [11] A. K. Alekseev, I. M. Navon, and M. E. Zelentsov. The estimation of functional uncertainty using polynomial chaos and adjoint equations. Internat. J. Numer. Methods Fluids, 67(3):328–341, 2011.
- [12] H. Avron, M. Kapralov, C. Musco, C. Musco, A. Velingker, and A. Zandieh. Random Fourier features for kernel ridge regression: approximation bounds and statistical guarantees. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 253–262. PMLR, 2017.
- [13] H. Avron, M. Kapralov, C. Musco, C. Musco, A. Velingker, and A. Zandieh. A universal sampling method for reconstructing signals with simple Fourier transforms. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, pages 1051–1063, New York, NY, USA, 2019. Association for Computing Machinery.
- [14] R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hedge. Model-based compressive sensing. IEEE Trans. Inform. Theory, 56(4):1982–2001, 2010.
- [15] A. Berk, S. Brugiapaglia, B. Joshi, Y. Plan, M. Scott, and O. Yilmaz. A coherence parameter characterizing generative compressed sensing with Fourier measurements. IEEE J. Sel. Areas Inf. Theory, 3(3):502–512, 2023.
- [16] A. Berk, S. Brugiapaglia, Y. Plan, M. Scott, X. Sheng, and O. Yilmaz. Model-adapted Fourier sampling for generative compressed sensing. arXiv:2310.04984, 2023.
- [17] J. Bigot, C. Boyer, and P. Weiss. An analysis of block sampling strategies in compressed sensing. IEEE Trans. Inform. Theory, 62(4):2125–2139, 2016.
- [18] A. Bora, A. Jalal, E. Price, and A. G. Dimakis. Compressed sensing using generative models. In International Conference on Machine Learning, pages 537–546, 2017.
- [19] A. Bourrier, M. E. Davies, T. Peleg, P. Pérez, and R. Gribonval. Fundamental performance limits for ideal decoders in high-dimensional linear inverse problems. IEEE Trans. Inform. Theory, 60(12):7928–7946, 2014.
- [20] C. Boyer, J. Bigot, and P. Weiss. Compressed sensing with structured sparsity and structured acquisition. Appl. Comput. Harmon. Anal., 46(2):312–350, 2019.
- [21] S. Brugiapaglia, S. Dirksen, H. C. Jung, and H. Rauhut. Sparse recovery in bounded Riesz systems with applications to numerical methods for PDEs. Appl. Comput. Harmon. Anal., 53:231–269, 2021.
- [22] E. J. Candès and Y. Plan. A probabilistic and RIPless theory of compressed sensing. IEEE Trans. Inform. Theory, 57(11):7235–7254, 2011.
- [23] N. Chauffert, P. Ciuciu, J. Kahn, and P. Weiss. Variable density sampling with continuous trajectories. SIAM J. Imaging Sci., 7(4):1962–1992, 2014.
- [24] S. Chen, R. Varma, A. Singh, and J. Kovačcević. A statistical perspective of sampling scores for linear regression. In 2016 IEEE International Symposium on Information Theory (ISIT), pages 1556–1560, 2016.
- [25] X. Chen and E. Price. Active regression via linear-sample sparsification. In A. Beygelzimer and D. Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 663–695. PMLR, 2019.
- [26] I.-Y. Chun and B. Adcock. Compressed sensing and parallel acquisition. IEEE Trans. Inform. Theory, 63(8):4860–4882, 2017.
- [27] A. Cohen and R. A. DeVore. Approximation of high-dimensional parametric PDEs. Acta Numer., 24:1–159, 2015.
- [28] A. Cohen and G. Migliorati. Optimal weighted least-squares methods. SMAI J. Comput. Math., 3:181–203, 2017.
- [29] W. M. Czarnecki, S. Osindero, M. Jaderberg, G. Swirszcz, and R. Pascanu. Sobolev training for neural networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- [30] M. A. Davenport, M. F. Duarte, Y. C. Eldar, and G. Kutyniok. Introduction to compressed sensing. In Y. C. Eldar and G. Kutyniok, editors, Compressed Sensing: Theory and Applications, pages 1–64. Cambridge University Press, Cambridge, UK, 2012.
- [31] M. A. Davenport and J. Romberg. An overview of low-rank matrix recovery from incomplete observations. IEEE J. Sel. Topics Signal Process., 10(4):602–622, 2016.
- [32] M. Derezinski, M. K. K. Warmuth, and D. J. Hsu. Leveraged volume sampling for linear regression. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
- [33] N. Dexter, H. Tran, and C. Webster. A mixed regularization approach for sparse simultaneous approximation of parameterized PDEs. ESAIM Math. Model. Numer. Anal., 53:2025–2045, 2019.
- [34] S. Dirksen. Dimensionality reduction with subgaussian matrices: a unified theory. Found. Comput. Math., 16:1367–1396, 2016.
- [35] M. Dolbeault and A. Cohen. Optimal sampling and christoffel functions on general domains. Constructive Approximation, 56(1):121–163, 2022.
- [36] D. L. Donoho and M. Elad. Optimally sparse representation in general (non-orthogonal) dictionaries vi minimization. Proc. Natl. Acad. Sci. USA, 100:2197–2002, 2003.
- [37] D. L. Donoho and X. Huo. Uncertainty principles and ideal atomic decomposition. IEEE Trans. Inform. Theory, 47(7):2845–2862, 2001.
- [38] M. F. Duarte and Y. C. Eldar. Structured compressed sensing: from theory to applications. IEEE Trans. Signal Process., 59(9):4053–4085, 2011.
- [39] M. Eigel, R. Schneider, and P. Trunschke. Convergence bounds for empirical nonlinear least-squares. ESAIM Math. Model. Numer. Anal., 56(1):79–104, 2022.
- [40] M. Elad and A. M. Bruckstein. A generalized uncertainty principle and sparse representation in pairs of bases. IEEE Trans. Inform. Theory, 48(9):2558–2567, 2002.
- [41] T. Erdelyi, C. Musco, and C. Musco. Fourier sparse leverage scores and approximate kernel learning. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 109–122. Curran Associates, Inc., 2020.
- [42] X. Feng and L. Zeng. Gradient-enhanced deep neural network approximations. J. Mach. Learn. Model. Comput., 3(4):73–91, 2022.
- [43] S. Foucart and H. Rauhut. A Mathematical Introduction to Compressive Sensing. Appl. Numer. Harmon. Anal. Birkhäuser, New York, NY, 2013.
- [44] A. Gajjar, C. Musco, and C. Hegde. Active learning for single neuron models with Lipschitz non-linearities. In F. Ruiz, J. Dy, and J.-W. van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 4101–4113. PMLR, 2023.
- [45] L. Guo, A. Narayan, and T. Zhou. A gradient enhanced -minimization for sparse approximation of polynomial chaos expansions. J. Comput. Phys., 367:49–64, 2018.
- [46] J. Han, A. Jentzen, and W. E. Solving high-dimensional partial differential equations using deep learning. Proc. Natl. Acad. Sci. U.S.A., 115(34):8505–8510, 2018.
- [47] A. Hashemi, H. Schaeffer, R. Shi, U. Topcu, G. Tran, and R. Ward. Generalization bounds for sparse random feature expansions. Appl. Comput. Harmon. Anal., 62:310–330, 2023.
- [48] A. Jalal, M. Arvinte, G. Daras, E. Price, A. Dimakis, and J. Tamir. Robust compressed sensing MR imaging with deep generative priors. In NeurIPS 2021 Workshop on Deep Learning and Inverse Problems, 2021.
- [49] M. Lustig, D. L. Donoho, and J. M. Pauly. Sparse MRI: the application of compressed sensing for rapid MRI imaging. Magn. Reson. Med., 58(6):1182–1195, 2007.
- [50] P. Ma, M. W. Mahoney, and B. Yu. A statistical perspective on algorithmic leveraging. J. Mach. Learn. Res., 16:861–911, 2015.
- [51] O. Malik, Y. Xu, N. Cheng, S. Becker, A. Doostan, and A. Narayan. Fast algorithms for monotone lower subsets of Kronecker least squares problems. arXiv:2209.05662, 2022.
- [52] D. W. McRobbie, E. A. Moore, M. J. Graves, and M. R. Prince. MRI: From Picture to Proton. Cambridge University Press, Cambridge, 2nd edition, 2006.
- [53] T. O’Leary-Roseberry, P. Chen, U. Villa, and O. Ghattas. Derivative-Informed Neural Operator: An efficient framework for high-dimensional parametric derivative learning. Journal of Computational Physics, 496:112555, 2024.
- [54] T. O’Leary-Roseberry, U. Villa, P. Chen, and O. Ghattas. Derivative-informed projected neural networks for high-dimensional parametric maps governed by PDEs. Comput. Methods Appl. Mech. Engrg., 388(1):114199, 2022.
- [55] J. Peng, J. Hampton, and A. Doostan. On polynomial chaos expansion via gradient-enhanced -minimization. J. Comput. Phys., 310:440–458, 2016.
- [56] M. Raissi, P. Perdikaris, and G. E. Karniadakis. Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys., 378:686–707, 2019.
- [57] J. Romberg. Imaging via compressive sampling. IEEE Signal Process. Mag., 25(2):14–20, 2008.
- [58] V. Studer, J. Bobin, M. Chahid, H. Moussavi, E. J. Candès, and M. Dahan. Compressive fluorescence microscopy for biological and hyperspectral imaging. Proc. Natl. Acad. Sci. USA, 109(26):1679–1687, 2011.
- [59] A. M. Tillmann and M. E. Pfetsch. The computational complexity of the restricted isometry property, the nullspace property, and related concepts in compressed sensing. IEEE Trans. Inform. Theory, 60(2):1248–1259, 2014.
- [60] Y. Traonmilin and R. Gribonval. Stable recovery of low-dimensional cones in Hilbert spaces: one RIP to rule them all. Appl. Comput. Harmon. Anal., 45(1):170–205, 2018.
- [61] Y. Traonmilin, G. Puy, R. Gribonval, and M. E. Davies. Compressed sensing in Hilbert spaces. In H. Boche, G. Caire, R. Calderbank, M. März, G. Kutyniok, and R. Mathar, editors, Compressed Sensing and its Applications: Second International MATHEON Conference 2015, Applied and Numerical Harmonic Analysis, pages 359–384. Birkhäuser, Cham, 2017.
- [62] R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, Cambridge, UK, 2018.
- [63] M. Vidyasagar. An Introduction to Compressed Sensing. Comput. Sci. Eng. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2019.
- [64] D. P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1-2):1–157, 2014.
- [65] J. Yu, L. Lu, X. Meng, and G. E. Karniadakis. Gradient-enhanced physics-informed neural networks for forward and inverse PDE problems. Comput. Methods Appl. Mech. Engrg., 393:114823, 2022.