This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A unified framework for learning with nonlinear model classes from arbitrary linear samples

Ben Adcock
Department of Mathematics
Simon Fraser University
Canada
   Juan M. Cardenas
Ann and H. J. Smead Department of Aerospace Engineering Sciences
University of Colorado Boulder
USA
   Nick Dexter
Department of Scientific Computing
Florida State University
USA
Abstract

This work considers the fundamental problem of learning an unknown object from training data using a given model class. We introduce a unified framework that allows for objects in arbitrary Hilbert spaces, general types of (random) linear measurements as training data and general types of nonlinear model classes. We establish a series of learning guarantees for this framework. These guarantees provide explicit relations between the amount of training data and properties of the model class to ensure near-best generalization bounds. In doing so, we also introduce and develop the key notion of the variation of a model class with respect to a distribution of sampling operators. To exhibit the versatility of this framework, we show that it can accommodate many different types of well-known problems of interest. We present examples such as matrix sketching by random sampling, compressed sensing with isotropic vectors, active learning in regression and compressed sensing with generative models. In all cases, we show how known results become straightforward corollaries of our general learning guarantees. For compressed sensing with generative models, we also present a number of generalizations and improvements of recent results. In summary, our work not only introduces a unified way to study learning unknown objects from general types of data, but also establishes a series of general theoretical guarantees which consolidate and improve various known results.

1 Introduction

Learning an unknown object (e.g., a vector, matrix or function) from a finite set of training data is a fundamental problem in applied mathematics and computer science. Typically, in modern settings, one seeks to learn an approximate representation in a nonlinear model class (also known as an approximation space or hypothesis set). It is also common to generate the training data randomly according to some distribution. Of critical importance in this endeavour is the question of learning guarantees. In other words: how much training data suffices to ensure good generalization, and how is this influenced by, firstly, the choice of model class, and secondly, the random process generating the training data?

This question has often been addressed for specific types of training data. For instance, in the case of function regression, the training data consists of pointwise evaluations of some target function, or in the case of computational imaging, the training data may consist of samples of the Fourier or Radon transform of some target image. It is also commonly studied for specific model classes, e.g., (linear or nonlinear) polynomial spaces [5] or (nonlinear) spaces of sparse vectors [43, 63], spaces of low-rank matrices or tensors [31, 63], spaces defined by generative models [18], single- [44] or multi-layer neural networks [8, 2, 3], Fourier sparse functions [41], (sparse) random feature models [12, 47] and many more.

In this paper, we introduce a unified framework for learning with nonlinear model classes from arbitrary linear samples. The main features of this framework are:

  1. (i)

    the target object is an element of a separable Hilbert space;

  2. (ii)

    each measurement is taken randomly and independently according to a random linear operator (which may differ from measurement to measurement);

  3. (iii)

    the measurements may be scalar- or vector-valued, or, in general, may take values in a Hilbert space;

  4. (iv)

    the measurements can be completely multimodal, i.e., generated by different distributions of random linear operators, as long as a certain nondegeneracy condition holds;

  5. (v)

    the model class can be linear or nonlinear, but should be contained in (or covered by) a union of finite-dimensional subspaces;

  6. (vi)

    the resulting learning guarantees are given in terms of the variation of the model class with respect to the sampling distributions.

We present a series of examples to highlight the generality of this framework. In various cases, our resulting learning guarantees either include or improve known guarantees in the literature.

1.1 The framework

The setup we consider in this paper is the following.

  • 𝕏\mathbb{X} is a separable Hilbert space with inner product ,𝕏\langle\cdot,\cdot\rangle_{\mathbb{X}} and norm 𝕏{\left\|\cdot\right\|}_{\mathbb{X}}.

  • 𝕏0𝕏\mathbb{X}_{0}\subseteq\mathbb{X} is a semi-normed vector space, termed the object space, with semi-norm 𝕏0{\left\|\cdot\right\|}_{\mathbb{X}_{0}}.

  • x𝕏0x\in\mathbb{X}_{0} is the unknown target object.

  • For each i=1,,mi=1,\ldots,m, 𝕐i\mathbb{Y}_{i} is a Hilbert space with inner product ,𝕐i\langle\cdot,\cdot\rangle_{\mathbb{Y}_{i}} and norm 𝕐i{\left\|\cdot\right\|}_{\mathbb{Y}_{i}} termed the iith measurement space.

  • For each i=1,,mi=1,\ldots,m, 𝒜i\mathcal{A}_{i} is a distribution of bounded linear operators (𝕏0,𝕏0)(𝕐i,𝕐i)(\mathbb{X}_{0},{\left\|\cdot\right\|}_{\mathbb{X}_{0}})\rightarrow(\mathbb{Y}_{i},{\left\|\cdot\right\|}_{\mathbb{Y}_{i}}). We term Ai𝒜iA_{i}\sim\mathcal{A}_{i} the iith sampling operator. We also write (𝕏0,𝕐i)\mathcal{B}(\mathbb{X}_{0},\mathbb{Y}_{i}) for the space of bounded linear operators, so that Ai(𝕏0,𝕐i)A_{i}\in\mathcal{B}(\mathbb{X}_{0},\mathbb{Y}_{i}).

  • We assume that the family {𝒜i}i=1m\{\mathcal{A}_{i}\}^{m}_{i=1} of distributions is nondegenerate. Namely, there exist 0<αβ<0<\alpha\leq\beta<\infty such that

    αx𝕏21mi=1m𝔼Ai𝒜iAi(x)𝕐i2βx𝕏2,x𝕏0.\alpha{\|x\|}^{2}_{\mathbb{X}}\leq\frac{1}{m}\sum^{m}_{i=1}\mathbb{E}_{A_{i}\sim\mathcal{A}_{i}}{\|A_{i}(x)\|}^{2}_{\mathbb{Y}_{i}}\leq\beta{\|x\|}^{2}_{\mathbb{X}},\quad\forall x\in\mathbb{X}_{0}. (1.1)
  • 𝕌𝕏0\mathbb{U}\subseteq\mathbb{X}_{0} is a set, termed the approximation space or model class. Our aim is to learn x𝕏0x\in\mathbb{X}_{0} from its measurements with an element x^𝕌\hat{x}\in\mathbb{U}.

Now let Ai(𝕏0,𝕐i)A_{i}\in\mathcal{B}(\mathbb{X}_{0},\mathbb{Y}_{i}), i=1,,mi=1,\ldots,m, be independent realizations from the distributions 𝒜1,,𝒜m\mathcal{A}_{1},\ldots,\mathcal{A}_{m}. Then we consider noisy measurements

bi=Ai(x)+ei𝕐i,i=1,,m.b_{i}=A_{i}(x)+e_{i}\in\mathbb{Y}_{i},\quad i=1,\ldots,m. (1.2)

In other words, we consider training data of the form

{(Ai,bi)}i=1m,where (Ai,bi)(𝕏0,𝕐i)×𝕐i.\{(A_{i},b_{i})\}^{m}_{i=1},\qquad\text{where }(A_{i},b_{i})\in\mathcal{B}(\mathbb{X}_{0},\mathbb{Y}_{i})\times\mathbb{Y}_{i}.

Our aim is to recover xx from this data using an element of 𝕌\mathbb{U}. We do this via the empirical least squares. That is, we let

x^argminu𝕌1mi=1mbiAi(u)𝕐i2.\hat{x}\in{\underset{u\in\mathbb{U}}{\operatorname{argmin}}}\frac{1}{m}\sum^{m}_{i=1}{\|b_{i}-A_{i}(u)\|}^{2}_{\mathbb{Y}_{i}}. (1.3)

Later, we also allow for x^\hat{x} to be an approximate minimizer, to model the more practical scenario where the minimization problem is solved inexactly. Note that in this work we consider the agnostic learning setting, where x𝕌x\notin\mathbb{U} and the noise eie_{i} can be adversarial (but small in norm).

As we explain in §2, this framework contains many problems of interest. In particular, scalar-valued and vector-valued function regression from i.i.d. samples, matrix sketching for large least-squares problems, compressed sensing with isotropic vectors and compressed sensing with subsampled unitary matrices are all instances of this framework.

As we also discuss therein, it is often the case in practice that the training data is sampled from the same distribution, i.e., 𝒜1==𝒜m=𝒜\mathcal{A}_{1}=\cdots=\mathcal{A}_{m}=\mathcal{A}. However, having different distributions allows us to consider multimodal sampling problems, where data is obtained from different random processes. In §2 we also discuss examples where this arises.

1.2 Contributions

Besides the general framework described above, our main contributions are a series of learning guarantees that relate the amount of training data mm to properties of the sampling distributions {𝒜i}i=1m\{\mathcal{A}_{i}\}^{m}_{i=1}. We now present a simplified version of our main result covering the case where 𝒜1==𝒜m=𝒜\mathcal{A}_{1}=\cdots=\mathcal{A}_{m}=\mathcal{A}. The full case is presented in §3.

A key quantity in this analysis is the so-called variation of a (nonlinear) set 𝕍\mathbb{V} with respect to a distribution 𝒜\mathcal{A} of bounded linear operators in (𝕏0,𝕐)\mathcal{B}(\mathbb{X}_{0},\mathbb{Y}). We define this as the smallest constant Φ=Φ(𝕍;𝒜)\Phi=\Phi(\mathbb{V};\mathcal{A}) such that

A(v)𝕐2Φ,v𝕍,a.s. A𝒜.{\|A(v)\|}^{2}_{\mathbb{Y}}\leq\Phi,\ \forall v\in\mathbb{V},\quad\text{a.s.\ $A\sim\mathcal{A}$}. (1.4)

See also Definition 3.2. As we discuss in Example 3.2, the variation is effectively a generalization of the notion of coherence in classical compressed sensing (see, e.g., [22] or [9, Chpt. 5]). In the following, we also write S(𝕍)={v𝕍:v𝕏=1}S(\mathbb{V})=\left\{v\in\mathbb{V}:{\|v\|}_{\mathbb{X}}=1\right\} for the unit sphere of 𝕍\mathbb{V} in 𝕏\mathbb{X}.

Theorem 1.1 (Main result A; slightly simplified).

Consider the setup of §1.1 with 𝒜1==𝒜m=𝒜\mathcal{A}_{1}=\cdots=\mathcal{A}_{m}=\mathcal{A}, let 𝕌𝕏0\mathbb{U}\subseteq\mathbb{X}_{0} and suppose that the difference set 𝕌=𝕌𝕌\mathbb{U}^{\prime}=\mathbb{U}-\mathbb{U} is such that

  1. (i)

    𝕌\mathbb{U}^{\prime} is a cone (i.e., tu𝕌tu^{\prime}\in\mathbb{U}^{\prime} for any t0t\geq 0 and u𝕌u^{\prime}\in\mathbb{U}^{\prime}), and

  2. (ii)

    𝕌𝕍1𝕍d=:𝕍\mathbb{U}^{\prime}\subseteq\mathbb{V}_{1}\cup\cdots\cup\mathbb{V}_{d}=:\mathbb{V}, where each 𝕍i𝕏0\mathbb{V}_{i}\subseteq\mathbb{X}_{0} is a subspace of dimension at most nn.

Suppose that, for some 0<ϵ<10<\epsilon<1, either

  1. (a)

    mα1Φ(S(𝕌𝕌);𝒜)[log(2d/ϵ)+n]m\gtrsim\alpha^{-1}\cdot\Phi(S(\mathbb{U}^{\prime}-\mathbb{U}^{\prime});\mathcal{A})\cdot\left[\log(2d/\epsilon)+n\right], or

  2. (b)

    mα1Φ(S(𝕍);𝒜)log(2nd/ϵ)m\gtrsim\alpha^{-1}\cdot\Phi(S(\mathbb{V});\mathcal{A})\cdot\log(2nd/\epsilon).

Let x𝕏0x\in\mathbb{X}_{0} and θx𝕏\theta\geq{\|x\|}_{\mathbb{X}}. Then

𝔼xxˇ𝕏2βαinfu𝕌xu𝕏2+e𝕐¯2α,where xˇ=min{1,θ/x^𝕏}x^\mathbb{E}{\|x-\check{x}\|}^{2}_{\mathbb{X}}\lesssim\frac{\beta}{\alpha}\cdot\inf_{u\in\mathbb{U}}{\|x-u\|}^{2}_{\mathbb{X}}+\frac{{\|e\|}^{2}_{\overline{\mathbb{Y}}}}{\alpha},\qquad\text{where }\check{x}=\min\{1,\theta/{\|\hat{x}\|}_{\mathbb{X}}\}\hat{x} (1.5)

for any minimizer x^\hat{x} of (1.3) with noisy measurements (1.2). Here eY¯2=i=1mei𝕐i2{\|e\|}_{\overline{Y}}^{2}=\sum^{m}_{i=1}{\|e_{i}\|}^{2}_{\mathbb{Y}_{i}}.

Theorem 1.2 (Main result B; slightly simplified).

Consider the setup of §1.1 with 𝒜1==𝒜m=𝒜\mathcal{A}_{1}=\cdots=\mathcal{A}_{m}=\mathcal{A} and let 𝕌𝕏0\mathbb{U}\subseteq\mathbb{X}_{0} be such that assumptions (i) and (ii) of Theorem 1.1 hold and also that

  1. (iii)

    {u𝕌:u𝕏1}conv(𝕎)\{u\in\mathbb{U}^{\prime}:{\|u\|}_{\mathbb{X}}\leq 1\}\subseteq\mathrm{conv}(\mathbb{W}), where 𝕎\mathbb{W} is a finite set of size |𝕎|=M|\mathbb{W}|=M.

Then the same result holds if conditions (a) and (b) of Theorem 1.1 are replaced by

  1. (a)

    mα1Φ(S(𝕌𝕌)𝕎;𝒜)Lm\gtrsim\alpha^{-1}\cdot\Phi(S(\mathbb{U}^{\prime}-\mathbb{U}^{\prime})\cup\mathbb{W};\mathcal{A})\cdot L, where

    L=log(2Φ(S(𝕌𝕌)𝕎;𝒜)/α)log(2M)log2(log(2d)+n)+log(ϵ1),L=\log\left(2\Phi(S(\mathbb{U}^{\prime}-\mathbb{U}^{\prime})\cup\mathbb{W};\mathcal{A})/\alpha\right)\cdot\log(2M)\cdot\log^{2}(\log(2d)+n)+\log(\epsilon^{-1}),

    or

  2. (b)

    mα1Φ(S(𝕍)𝕎;𝒜)Lm\gtrsim\alpha^{-1}\cdot\Phi(S(\mathbb{V})\cup\mathbb{W};\mathcal{A})\cdot L, where

    L=log(2Φ(S(𝕍)𝕎;𝒜)/α)log(2M)log2(log(2d)+n)+log(ϵ1),L=\log(2\Phi(S(\mathbb{V})\cup\mathbb{W};\mathcal{A})/\alpha)\cdot\log(2M)\cdot\log^{2}(\log(2d)+n)+\log(\epsilon^{-1}),

    respectively.

These theorems are simplified versions of our main results, Theorems 3.6 and 3.7. In the full results, besides allowing for different sampling distributions 𝒜1,,𝒜m\mathcal{A}_{1},\ldots,\mathcal{A}_{m}, we also consider inexact minimizers of the empirical least-squares fit (1.3). Moreover, we provide a refinement of condition (a) in which the linear dependence on Φ(S(𝕌𝕌);𝒜)\Phi(S(\mathbb{U}^{\prime}-\mathbb{U}^{\prime});\mathcal{A}) (or Φ(S(𝕌𝕌)𝕎;𝒜)\Phi(S(\mathbb{U}^{\prime}-\mathbb{U}^{\prime})\cup\mathbb{W};\mathcal{A}) in the case of Theorem 1.2) is replaced by Φ(S(𝕌);𝒜)\Phi(S(\mathbb{U}^{\prime});\mathcal{A}) (respectively, Φ(S(𝕌)𝕎;𝒜)\Phi(S(\mathbb{U}^{\prime})\cup\mathbb{W};\mathcal{A})), at the expense of a more complicated log factor.

Having presented our main results in §3, we then describe how this framework unifies, generalizes and, in some cases, improves known results. In particular, we recover known bounds for compressed sensing with isotropic vectors and for matrix sketching by random sampling (see §3.4). We then discuss the application of this framework to two different problems.

  1. (a)

    Active learning in regression (§4). Here, we extend and improve recent work [6] by applying our main results to derive a random (importance) sampling strategy based on a certain Christoffel function (also known as the leverage score function). We show that this strategy, termed Christoffel sampling, leads to near-optimal sample complexity bounds in a number of important settings.

  2. (b)

    Compressed sensing with generative models (§5). Here we extend and improve recent work [15] by obtaining learning guarantees for general types of measurements in the case where 𝕌\mathbb{U} is the range of a ReLU generative model. We also provide theoretical analysis for an active learning strategy first introduced in [6] and subsequently considered in [16].

1.3 Related work

Our sampling framework is inspired by previous work in compressed sensing, notably compressed sensing with isotropic vectors [22], which was later extended in [9, Chpt. 12] to compressed sensing with jointly isotropic vectors. This considers the specific case where 𝕏0=N\mathbb{X}_{0}=\mathbb{C}^{N}, 𝕐1==𝕐m=\mathbb{Y}_{1}=\cdots=\mathbb{Y}_{m}=\mathbb{C} and 𝕌\mathbb{U} is the set of ss-sparse vectors, i.e., the target object is an (approximately) sparse vector and the sampling operators are linear functionals. Note that isotropy would correspond to the case α=β=1\alpha=\beta=1 in (1.1). We relax this to allow αβ\alpha\neq\beta. Within the compressed sensing literature, there are a number of works that allow for non-scalar valued measurements. See [17, 20] for an instance of vector-valued measurements (‘block sampling’) and [61] as well as [4, 33] for Hilbert-valued measurements.

Recovery guarantees in compressed sensing are generally derived for specific model classes, such as sparse vectors or various generalizations (e.g., joint sparse vectors, block sparse vectors, sparse in levels vectors, and so forth [10, 14, 19, 30, 38, 60]). Guarantees for general model classes are usually only presented in the case of (sub)Gaussian random measurements (see e.g., [14, 34]). These, while mathematically elegant, are typically not useful in practice. Our framework provides a unified set of recovery guarantees for very general types of measurements. It contains subsampled unitary matrices (a well-known measurement modality in compressed sensing, with practical relevance – see Example 2) as a special case, but also many others, including non-scalar valued measurements.

The proofs of our main results adopt similar ideas to those used in classical compressed sensing (see, e.g., [43, Chpt. 12] or [9, Chpt. 13]). In particular, they rely on Dudley’s inequality, Maurey’s lemma and a version of Talagrand’s theorem. Our main innovations involve the significant generalization of these arguments to, firstly, much broader classes of sampling problems (i.e., not just linear functionals of finite vectors), and secondly, to arbitrary model classes, rather than classes of (structured) sparse vectors. Our results broaden and strengthen recent results in the active learning context found in [6] and [39]. In particular, [6] assumes a union-of-subspaces model and then uses matrix Chernoff-type estimates. This is similar (although less general in terms of the type of measurements allowed) to our condition (b) in Theorem 1.1. Besides the generalization in terms of the measurements, our main effort is to derive the stronger condition (a) in this theorem, as well as Theorem 1.2. The work [39] makes very few assumptions on 𝕌\mathbb{U}, then uses Hoeffding’s inequality and covering number arguments. As noted in [6, §A.3] the trade-off for this high level of generality is weaker theoretical guarantees.

Active learning is a large topic. For function regression in linear spaces, a now well-known solution involves sampling according to the Christoffel function [28] or leverage score function [12, 24, 25, 32, 41, 44, 50] of the subspace. A number of these works have extended this to certain nonlinear spaces, such as Fourier sparse functions [41] and single-neuron models [44]. In our previous work [6], we extended this to more general nonlinear spaces and other types of measurements. As noted above, this work improves the theoretical guarantees in [6]. In particular, we show the usefulness of Christoffel sampling for more general types of nonlinear model classes.

Compressed sensing with generative models was introduced in [18], and has proved useful in image reconstruction tasks such as Magnetic Resonance Imaging (MRI) (see [48] and references therein). Initial learning guarantees for generative models involved (sub)Gaussian random measurements [18]. Guarantees for randomly subsampled unitary matrices were established in [15] for ReLU generative networks, and later extended in [16] for the nonuniformly subsampled case. As noted above, our work refines and further generalizes this analysis to more general types of measurements.

1.4 Outline

The remainder of this paper proceeds as follows. First, in §2 we present a series of examples that showcase the generality of the framework introduced in §1.1. In the next section, §3, we present our main results. Next we discuss the application to active learning in regression (§4) and compressed sensing with generative models (§5). Finally, in §6 we give the proofs of our main theorems.

2 Examples

In this section, we present a series of different sampling problems that can be cast into this unified framework. We will return to these examples later having stated our main learning guarantees in the next section.

  • Example 2.1 (Function regression from i.i.d. samples)

    Let DdD\subseteq\mathbb{R}^{d} be a domain with a measure ρ\rho and consider the problem of learning an unknown function fLρ2(D)f\in L^{2}_{\rho}(D) from mm pointwise samples (xi,f(xi))(x_{i},f(x_{i})), i=1,,mi=1,\ldots,m, drawn i.i.d. from some probability measure μ\mu on DD. To cast this problem in the framework of §1.1, we make the (mild) assumption that μ\mu is absolutely continuous with respect to ρ\rho with Radon–Nikodym derivative dμ/dρ=ν\,\mathrm{d}\mu/\,\mathrm{d}\rho=\nu that is positive almost everywhere. Let 𝕏=Lρ2(D)\mathbb{X}=L^{2}_{\rho}(D), 𝕏0=C(D¯)\mathbb{X}_{0}=C(\overline{D}), 𝕐1==𝕐m=𝕐=\mathbb{Y}_{1}=\cdots=\mathbb{Y}_{m}=\mathbb{Y}=\mathbb{R} with the Euclidean inner product and 𝒜1=𝒜m=𝒜\mathcal{A}_{1}=\cdots\mathcal{A}_{m}=\mathcal{A} be defined by A𝒜A\sim\mathcal{A} if A(f)=f(x)/ν(x)A(f)=f(x)/\sqrt{\nu(x)} for xμx\sim\mu. Observe that nondegeneracy (1.1) holds with α=β=1\alpha=\beta=1 in this case, since

    1mi=1m𝔼Ai𝒜iAi(f)𝕐i2=𝔼xμ|f(x)|2ν(x)=D|f(x)|2dρ(x)=f𝕏2,f𝕏0.\frac{1}{m}\sum^{m}_{i=1}\mathbb{E}_{A_{i}\sim\mathcal{A}_{i}}{\|A_{i}(f)\|}^{2}_{\mathbb{Y}_{i}}=\mathbb{E}_{x\sim\mu}\frac{|f(x)|^{2}}{\nu(x)}=\int_{D}|f(x)|^{2}\,\mathrm{d}\rho(x)={\|f\|}^{2}_{\mathbb{X}},\quad\forall f\in\mathbb{X}_{0}.

    In this case, given an approximation space 𝕌C(D¯)\mathbb{U}\subseteq C(\overline{D}), the least-squares problem (1.3) becomes the (nonlinear) weighted-least squares fit

    f^argminu𝕌1mi=1m1ν(xi)|f(xi)+ν(xi)eiu(xi)|2.\hat{f}\in{\underset{u\in\mathbb{U}}{\operatorname{argmin}}}\frac{1}{m}\sum^{m}_{i=1}\frac{1}{\nu(x_{i})}|f(x_{i})+\sqrt{\nu(x_{i})}e_{i}-u(x_{i})|^{2}. (2.1)

    Note that it is common to consider the case μ=ρ\mu=\rho in such problems, in which case ν=1\nu=1 and (2.1) is an unweighted least-squares fit. However, this more general setup allows one to consider the active learning setting, where the sampling measure μ\mu is chosen judiciously in term of 𝕌\mathbb{U} to improve the learning performance of f^\hat{f}. We discuss this in §4.

  • Example 2.2 (Matrix sketching for large least-squares problems)

Let XN×nX\in\mathbb{C}^{N\times n}, NnN\geq n, be a given matrix and yNy\in\mathbb{C}^{N} a target vector. In many applications, it may be infeasible (due to computational constraints) to find a solution to the ‘full’ least-squares problem

wargminznXzx2.w\in{\underset{z\in\mathbb{C}^{n}}{\operatorname{argmin}}}{\|Xz-x\|}_{\ell^{2}}.

Therefore, one aims to instead find a sketching matrix Sm×NS\in\mathbb{C}^{m\times N} (a matrix with one nonzero per row) such that a minimizer

w^argminznSXzSx2\hat{w}\in{\underset{z\in\mathbb{C}^{n}}{\operatorname{argmin}}}{\|SXz-Sx\|}_{\ell^{2}} (2.2)

satisfies Xw^y22(1+ϵ)Xwx22{\|X\hat{w}-y\|}^{2}_{\ell^{2}}\leq(1+\epsilon){\|Xw-x\|}^{2}_{\ell^{2}} for some small ϵ>0\epsilon>0. A particularly effective way to do this involves constructing a random sketch. Formally, let π={π1,,πN}\pi=\{\pi_{1},\ldots,\pi_{N}\} be a discrete probability distribution on {1,,N}\{1,\ldots,N\}, i.e., 0πi10\leq\pi_{i}\leq 1 and i=1Nπi=1\sum^{N}_{i=1}\pi_{i}=1. We also assume that πi>0\pi_{i}>0 for all ii. Then we draw mm integers j1,,jmi.i.d.πj_{1},\ldots,j_{m}\sim_{\mathrm{i.i.d.}}\pi and set

Sij={1/πjij=ji0otherwise.S_{ij}=\begin{cases}1/\sqrt{\pi_{j_{i}}}&j=j_{i}\\ 0&\text{otherwise}\end{cases}.

Therefore, SXm×nSX\in\mathbb{C}^{m\times n} consists of mm rows of XX scaled by the probabilities 1/πi1/\sqrt{\pi_{i}}. See [51, 64] for further discussions on matrix sketching.

To cast into the above framework we let 𝕏=𝕏0=N\mathbb{X}=\mathbb{X}_{0}=\mathbb{C}^{N} and 𝕐=\mathbb{Y}=\mathbb{C} both equipped with the Euclidean norm. Note that bounded linear operators 𝕏0𝕐\mathbb{X}_{0}\rightarrow\mathbb{Y} are equivalent to column vectors aNa\in\mathbb{C}^{N} (via the relation xaxx\mapsto a^{*}x). Hence we define 𝒜1==𝒜m=𝒜\mathcal{A}_{1}=\cdots=\mathcal{A}_{m}=\mathcal{A} to be a distribution of vectors in N\mathbb{C}^{N} with a𝒜a\sim\mathcal{A} if

(a=ei/πi)=πi,i=1,,N.\mathbb{P}(a=e_{i}/\sqrt{\pi_{i}})=\pi_{i},\quad i=1,\ldots,N.

Observe that nondegeneracy (1.1) holds with α=β=1\alpha=\beta=1. Finally, we consider the model class

𝕌={Xz:zn}.\mathbb{U}=\{Xz:z\in\mathbb{C}^{n}\}. (2.3)

Then we readily see that (1.3) (with bi=(Sx)ib_{i}=(Sx)_{i}) is equivalent to (2.2) in the sense that x^=Xw^\hat{x}=X\hat{w} is a solution of (1.3) if and only if w^\hat{w} is a solution of (2.2). In particular, Xw^x22=x^x22{\|X\hat{w}-x\|}^{2}_{\ell^{2}}={\|\hat{x}-x\|}^{2}_{\ell^{2}} is precisely the 𝕏\mathbb{X}-norm error of the estimator x^\hat{x}.

  • Example 2.3 (Function regression with vector-valued measurements)

    Regression problems in various applications call for learning vector- as opposed to scalar-valued functions. These are readily incorporated into this framework by modifying Example 2.

    Hilbert-valued functions. Let 𝕍\mathbb{V} be a Hilbert space and consider the Hilbert-valued function f:D𝕍f:D\rightarrow\mathbb{V}. The problem of learning Hilbert-valued functions arises in several applications, including parametric Differential Equations (DEs) in computational Uncertainty Quantification (UQ). Here, ff represents the parameters-to-solution map of a DE involving parameters xDx\in D, which, for each xDx\in D, yields the Hilbert-valued output f(x)Df(x)\in D being the solution of the DE at those parameter values (see [5, 27, 33] and references therein for discussion).

    To cast this problem into the general framework, we modify Example 2 by letting 𝕏=Lρ2(D;𝕍)\mathbb{X}=L^{2}_{\rho}(D;\mathbb{V}) be the Bochner space of strongly ρ\rho-measurable functions D𝕍D\rightarrow\mathbb{V}, 𝕏0=C(D¯;𝕍)\mathbb{X}_{0}=C(\overline{D};\mathbb{V}) be the space of continuous, 𝕍\mathbb{V}-valued functions, 𝕐1==𝕐m=𝕍\mathbb{Y}_{1}=\cdots=\mathbb{Y}_{m}=\mathbb{V} and 𝒜\mathcal{A} be the distribution of bounded linear operators 𝕏0𝕍\mathbb{X}_{0}\rightarrow\mathbb{V} with A𝒜A\sim\mathcal{A} if A(f)=f(x)/ν(x)𝕍A(f)=f(x)/\sqrt{\nu(x)}\in\mathbb{V} for f𝕏0f\in\mathbb{X}_{0} and xμx\sim\mu. Nondegeneracy holds (1.1) as before with α=β=1\alpha=\beta=1.

    Continuous-in-time sampling. Consider a function f:D×[0,T]f:D\times[0,T]\rightarrow\mathbb{R} depending on a spatial variable xDx\in D and a time variable t[0,T]t\in[0,T]. In some examples, for example seismology, one assumes continuous (or dense) sampling in time with discrete (i.e., sparse) sampling in space. Thus, each measurement takes the form {f(x,t):0tT}\{f(x,t):0\leq t\leq T\} for fixed xDx\in D.

    It transpires that we can cast this problem into a Hilbert-valued function approximation problem by letting 𝕍=Lσ2(0,T)\mathbb{V}=L^{2}_{\sigma}(0,T) so that 𝕏=Lρ2(D;𝕍)Lρ×σ2(D×[0,T])\mathbb{X}=L^{2}_{\rho}(D;\mathbb{V})\cong L^{2}_{\rho\times\sigma}(D\times[0,T]). Hence, continuous-in-time sampling is also covered by the general framework.

    Gradient-augmented measurements. Another extension of Example 2 involves gradient-augmented measurements. This can be considered both in the scalar- or Hilbert-valued setting. In this problem, each measurement takes the form (x,f(x),xf(x))(x,f(x),\nabla_{x}f(x)), i.e., we sample both the function and its gradient with respect to the variable xx at each sample point. This problem arises in a number of applications, including parametric DEs and UQ [11, 45, 55, 54, 53], seismology and Physics-Informed Neural Networks (PINNs) for PDEs [42, 65] and deep learning [29]. Assuming the Hilbert-valued setup of Example 2, we can cast this into the general framework by letting 𝕏\mathbb{X} be the Sobolev space Hρ1(D)H^{1}_{\rho}(D), 𝕏0=C1(D¯)\mathbb{X}_{0}=C^{1}(\overline{D}), 𝕐1==𝕐m=𝕐=𝕍d+1\mathbb{Y}_{1}=\cdots=\mathbb{Y}_{m}=\mathbb{Y}=\mathbb{V}^{d+1} with the Euclidean inner product and defining A𝒜A\sim\mathcal{A} if A(f)=(f(x),f(x))/ν(x)d+1A(f)=(f(x),\nabla f(x))/\sqrt{\nu(x)}\in\mathbb{R}^{d+1} for f𝕏0f\in\mathbb{X}_{0} and xμx\sim\mu. It is readily checked that nondegeneracy (1.1) holds with α=β=1\alpha=\beta=1, as before.

In the next four examples, we show how this framework includes as special cases various general sampling models from the compressed sensing literature.

  • Example 2.4 (Compressed sensing with isotropic vectors)

    Classical compressed sensing concerns learning a sparse approximation to an unknown vector fNf\in\mathbb{C}^{N} from mm linear measurements. A well-known model involves sampling with isotropic vectors [22] (see also [9, Chpt. 11]). We can cast this in the above framework as follows. Let 𝕏=𝕏0=N\mathbb{X}=\mathbb{X}_{0}=\mathbb{C}^{N} equipped with Euclidean inner product and 𝕐1==𝕐m=𝕐=\mathbb{Y}_{1}=\cdots=\mathbb{Y}_{m}=\mathbb{Y}=\mathbb{C}. As in Example 2, we consider 𝒜1==𝒜m=𝒜\mathcal{A}_{1}=\cdots=\mathcal{A}_{m}=\mathcal{A} to be a distribution of vectors in N\mathbb{C}^{N} that are isotropic, i.e.,

    𝔼a𝒜aa=I,\mathbb{E}_{a\sim\mathcal{A}}aa^{*}=I, (2.4)

    where II is the N×NN\times N identity matrix. Note that (1.1) holds with α=β=1\alpha=\beta=1 in this case. Moreover, the measurements (1.2) have the form

    bi=aix+ei,i=1,,m,b_{i}=a^{*}_{i}x+e_{i}\in\mathbb{C},\quad i=1,\ldots,m,

    where a1,,ami.i.d.𝒜a_{1},\ldots,a_{m}\sim_{\mathrm{i.i.d.}}\mathcal{A}. In matrix-vector notation, we can rewrite this as

    b=Ax+e,where A=1m[a1am]m×N,b=Ax+e,\quad\text{where }A=\frac{1}{\sqrt{m}}\begin{bmatrix}a^{*}_{1}\\ \vdots\\ a^{*}_{m}\end{bmatrix}\in\mathbb{C}^{m\times N}, (2.5)

    b=1m(bi)i=1mmb=\frac{1}{\sqrt{m}}(b_{i})^{m}_{i=1}\in\mathbb{C}^{m} and e=1m(ei)i=1me=\frac{1}{\sqrt{m}}(e_{i})^{m}_{i=1} (the division by 1/m1/\sqrt{m} is a convention: due to (2.4), it ensures that 𝔼(AA)=I\mathbb{E}(A^{*}A)=I).

    As discussed in [22], this model includes not only the well-known case of subgaussian random matrices, in which case 𝒜\mathcal{A} is a distribution of subgaussian random vectors, but also many other common sampling models used in signal and image processing applications. It is also a generalization of the concept of random sampling with bounded orthonormal systems (see, e.g., [43, Chpt. 12]). Moreover, if we slightly relax (2.4) to

    αI𝔼a𝒜aaβI,\alpha I\preceq\mathbb{E}_{a\sim\mathcal{A}}aa^{*}\preceq\beta I,

    so that (1.1) holds with the same values of α\alpha and β\beta, then it also generalizes the bounded Riesz systems studied in [21].

  • Example 2.5 (Compressed sensing with subsampled unitary matrices)

    A particular case of interest within the previous example is the class of randomly subsampled unitary matrices (see also [9, Defn. 5.6 and Examp. 11.5]). Let UN×NU\in\mathbb{C}^{N\times N} be unitary, i.e., UU=IU^{*}U=I. For example, UU may be the matrix of the Discrete Fourier Transform (DFT) in a Fourier sensing problem. Let ui=Ueiu_{i}=U^{*}e_{i}, where eie_{i} is the iith canonical basis vector, and define the (discrete) distribution of vectors 𝒜\mathcal{A} by a𝒜a\sim\mathcal{A} if

    (a=Nui)=1N,i=1,,N.\mathbb{P}(a=\sqrt{N}u_{i})=\frac{1}{N},\quad i=1,\ldots,N.

    One readily checks that (2.4) holds in this case. Hence this family is isotropic. Note that the corresponding measurement matrix (2.5) consists of rows of UU – hence the term ‘subsampled unitary matrix’.

    This setup can be straightforwardly extended to the case where the rows of UU are sampled with different probabilities, termed a nonuniformly subsampled unitary matrix. Let π=(π1,,πN)\pi=(\pi_{1},\ldots,\pi_{N}) be a discrete probability distribution on {1,,N}\{1,\ldots,N\}, i.e., 0πi10\leq\pi_{i}\leq 1 and i=1Nπi=1\sum^{N}_{i=1}\pi_{i}=1. We also assume that πi>0\pi_{i}>0 for all ii. Then we now modify the distribution 𝒜\mathcal{A} so that a𝒜a\sim\mathcal{A} if

    (a=ui/πi)=πi,i=1,,N.\mathbb{P}(a=u_{i}/\sqrt{\pi_{i}})=\pi_{i},\quad i=1,\ldots,N.

    It is readily checked that (2.4) also holds in this case, making this family once more isotropic. Note that uniform subsampling is restored simply by setting πi=1/N\pi_{i}=1/N, i\forall i.

    Nonuniform subsampling is important in various applications. For instance, in applications that involve Fourier measurements (e.g., MRI, NMR, Helium Atom Scattering and radio interferometry) it usually desirable to sample rows corresponding to low frequencies more densely than rows corresponding to high frequencies. See, e.g., [9] and references therein.

  • Example 2.6 (Sampling without replacement)

    The previous example considers random sampling of the rows of UU with replacement. In particular, repeats are possible. In practice, it is usually desirable to avoid repeats. One way to do this is to use Bernoulli selectors (see [9, §11.4.3] or [15]). Here, the decision of whether to include a row of UU or not is based on the outcome of a (biased) coin toss, i.e., a realization of a Bernoulli random variable.

    Specifically, let π=(π1,,πN)\pi=(\pi_{1},\ldots,\pi_{N}) satisfy 0<πi1/m0<\pi_{i}\leq 1/m and i=1Nπi=1\sum^{N}_{i=1}\pi_{i}=1. Then, for each i=1,,Ni=1,\ldots,N, define the distribution 𝒜i\mathcal{A}_{i} by ai𝒜ia_{i}\sim\mathcal{A}_{i} if

    (ai=Nmπiui)=mπi,(ai=0)=1mπi.\mathbb{P}\left(a_{i}=\sqrt{\frac{N}{m\pi_{i}}}u_{i}\right)=m\pi_{i},\quad\mathbb{P}(a_{i}=0)=1-m\pi_{i}.

    Notice that the collection {𝒜i}i=1N\{\mathcal{A}_{i}\}^{N}_{i=1} is nondegenerate. Indeed,

    1Ni=1N𝔼ai𝒜iaiai=1Ni=1NmπiNmπiuiui=UU=I.\frac{1}{N}\sum^{N}_{i=1}\mathbb{E}_{a_{i}\sim\mathcal{A}_{i}}a_{i}a^{*}_{i}=\frac{1}{N}\sum^{N}_{i=1}m\pi_{i}\frac{N}{m\pi_{i}}u_{i}u^{*}_{i}=U^{*}U=I.

    Hence, this model falls within the general setup. Later, in §5.2 and Corollary 5.4, we will see that this model admits essentially the same learning guarantees as the model considered in Example 2. Observe that the measurement matrix (2.5) in this case is N×NN\times N, where the iith row is proportional to the iith row of UU if the iith coin toss returns a heads, and zero otherwise. In particular, it is equivalent to a q×Nq\times N matrix, where qq is the number of heads. Notice that, unlike in the previous model, the number of measurements qq is a random variable with 𝔼(q)=m\mathbb{E}(q)=m.

The reader will notice that in the previous examples, the distributions 𝒜1,,𝒜m\mathcal{A}_{1},\ldots,\mathcal{A}_{m} we all equal. We now present several examples that justify considering different distributions.

  • Example 2.7 (Half-half and multilevel sampling)

    Recall that Example 2 arises in various applications involving Fourier measurements. In these applications, it is often desirable to fully sample the low-frequency regime [49] (see also [9]). The low frequencies of a signal or image contain much of its energy. Therefore, any purely random sampling scheme can miss important information with nonzero probability.

    A simple way to avoid this problem is a half-half sampling scheme [10, 58]. Here the first m1m_{1} measurements are set equal to the lowest m1m_{1} frequencies (i.e., the first m1m_{1} samples are deterministic) and the remaining m2:=mm1m_{2}:=m-m_{1} samples are obtained by randomly sampling the remaining Nm1N-m_{1} frequencies according to some distribution. This can be formulated within our framework as follows. Let UN×NU\in\mathbb{C}^{N\times N} be unitary (typically, UU is the DFT matrix, but this is not needed for the current discussion) and ui=Ueiu_{i}=U^{*}e_{i}, as before. Then, for i=1,,m1i=1,\ldots,m_{1}, we define the distribution 𝒜i\mathcal{A}_{i} by a𝒜ia\sim\mathcal{A}_{i} if

    (a=mui)=1,i=1,,m1.\mathbb{P}(a=\sqrt{m}u_{i})=1,\quad i=1,\ldots,m_{1}.

    In other words, a𝒜ia\sim\mathcal{A}_{i} takes value mui\sqrt{m}u_{i} with probability one. Now let π=(πm1+1,,πN)\pi=(\pi_{m_{1}+1},\ldots,\pi_{N}) be a discrete probability distribution on {m1+1,,N}\{m_{1}+1,\ldots,N\} with 0<πi10<\pi_{i}\leq 1 for all ii. Then we define 𝒜m1+1==𝒜m=𝒜\mathcal{A}_{m_{1}+1}=\cdots=\mathcal{A}_{m}=\mathcal{A} with a𝒜a\sim\mathcal{A} if

    (a=m/(m2πi)ui)=πi,i=m1+1,,N.\mathbb{P}(a=\sqrt{m/(m_{2}\pi_{i})}u_{i})=\pi_{i},\quad i=m_{1}+1,\ldots,N.

    We readily see that the family {𝒜i}i=1m\{\mathcal{A}_{i}\}^{m}_{i=1} is nondegenerate with α=β=1\alpha=\beta=1. Notice that the first m1m_{1} rows of measurement matrix obtained from this procedure are proportional to the first m1m_{1} rows of UU and its remaining m2=mm1m_{2}=m-m_{1} rows are randomly sampled according to π\pi from the last Nm1N-m_{1} rows of UU.

    Half-half sampling schemes have been used in various imaging applications [10, 49, 23, 57, 58]. Note that such a scheme subdivides the indices {1,,N}\{1,\ldots,N\} into two levels: fully sampled and randomly subsampled. In practice, it is often desirable to have further control over the sampling, by subdividing into more than two levels. This is known as multilevel random subsampling [10] (see also [9, Chpts. 4 & 11]). It can be formulated within this framework in a similar way.

  • Example 2.8 (Multimodal data)

    Another benefit of allowing for different distributions is that it can model multimodal data. Consider the general setup of §1.1. We now assume that there are C>1C>1 different types of data, with the ccth type generating mcm_{c} samples via a distribution 𝒜(c)\mathcal{A}^{(c)}, c=1,,Cc=1,\ldots,C, of bounded linear operators in (𝕏0,𝕐(c))\mathcal{B}(\mathbb{X}_{0},\mathbb{Y}^{(c)}). Let m=m1++mCm=m_{1}+\cdots+m_{C} and define {𝒜i}i=1m\{\mathcal{A}_{i}\}^{m}_{i=1} by

    𝒜i=𝒜(c)if m1++mc1<im1++mc.\mathcal{A}_{i}=\mathcal{A}^{(c)}\quad\text{if }m_{1}+\cdots+m_{c-1}<i\leq m_{1}+\cdots+m_{c}.

    Thus, the first m1m_{1} samples are generated by 𝒜(1)\mathcal{A}^{(1)}, the next m2m_{2} samples by 𝒜(2)\mathcal{A}^{(2)}, and so forth. Notice that nondegeneracy (1.1) is now equivalent to the condition

    αx𝕏2c=1Cmcm𝔼A𝒜(c)A(x)𝕐(c)2βx𝕏2.\alpha{\|x\|}^{2}_{\mathbb{X}}\leq\sum^{C}_{c=1}\frac{m_{c}}{m}\mathbb{E}_{A\sim\mathcal{A}^{(c)}}{\|A(x)\|}^{2}_{\mathbb{Y}^{(c)}}\leq\beta{\|x\|}^{2}_{\mathbb{X}}.

    Multimodal data was previously considered in [6], which, as noted in §1 is a special case of this work. As observed in [6], multimodal data arises in various applications, such as multi-sensor imaging systems [26] and PINNs for PDEs [46, 56]. This includes the important case of parallel MRI [52], which is used widely in medical practice. Another application involves an extension of the gradient-augmented learning problem in Example 2 where, due to cost or other constraints, one can only afford to measure gradients in addition to function values at some fraction of the total samples. See [6, §B.7].

3 Learning guarantees

In this section, we present our main results.

3.1 Additional notation

We first require additional notation. Let (𝕐¯,,𝕐¯)(\overline{\mathbb{Y}},\langle\cdot,\cdot\rangle_{\overline{\mathbb{Y}}}) be the Hilbert space defined as the direct sum of the 𝕐i\mathbb{Y}_{i}, i.e.,

𝕐¯=𝕐1𝕐m.\overline{\mathbb{Y}}=\mathbb{Y}_{1}\oplus\cdots\oplus\mathbb{Y}_{m}.

Next, let 𝒜¯\bar{\mathcal{A}} be the distribution of bounded linear operators in (𝕏0,𝕐¯)\mathcal{B}(\mathbb{X}_{0},\overline{\mathbb{Y}}) induced by the family {𝒜i}i=1m\{\mathcal{A}_{i}\}^{m}_{i=1}. In other words, A¯𝒜¯\bar{A}\sim\bar{\mathcal{A}} if

A¯(x)=1m(A1(x),,Am(x)),\bar{A}(x)=\frac{1}{\sqrt{m}}(A_{1}(x),\ldots,A_{m}(x)),

where the AiA_{i} are independent with Ai𝒜iA_{i}\sim\mathcal{A}_{i} for each ii (we include the factor 1/m1/\sqrt{m} for convenience). With this notation, observe that nondegeneracy (1.1) is equivalent to

αx𝕏2𝔼A¯𝒜¯A¯(x)𝕐¯2βx𝕏2,x𝕏0\alpha{\|x\|}^{2}_{\mathbb{X}}\leq\mathbb{E}_{\bar{A}\sim\bar{\mathcal{A}}}{\|\bar{A}(x)\|}^{2}_{\overline{\mathbb{Y}}}\leq\beta{\|x\|}^{2}_{\mathbb{X}},\quad\forall x\in\mathbb{X}_{0} (3.1)

and the least-squares problem (1.3) is equivalent to

x^argminu𝕌b¯A¯(u)𝕐¯2,\hat{x}\in{\underset{u\in\mathbb{U}}{\operatorname{argmin}}}{\|\bar{b}-\bar{A}(u)\|}^{2}_{\overline{\mathbb{Y}}}, (3.2)

where b¯=1m(b1,,bm)𝕐¯\bar{b}=\frac{1}{\sqrt{m}}(b_{1},\ldots,b_{m})\in\overline{\mathbb{Y}}. For convenience, we also write e¯=1m(e1,,em)𝕐¯\bar{e}=\frac{1}{\sqrt{m}}(e_{1},\ldots,e_{m})\in\overline{\mathbb{Y}}.

In our main results, we consider approximate minimizers of (1.3) or, equivalently, (3.2). We now define the following.

Definition 3.1 (γ\gamma-minimizer).

Given γ0\gamma\geq 0 we say that x^𝕌\hat{x}\in\mathbb{U} is a γ\gamma-minimizer of (1.3) if

1mi=1mbiAi(x^)𝕐i2minu𝕌1mi=1mbiAi(u)𝕐i2+γ.\frac{1}{m}\sum^{m}_{i=1}{\|b_{i}-A_{i}(\hat{x})\|}^{2}_{\mathbb{Y}_{i}}\leq\min_{u\in\mathbb{U}}\frac{1}{m}\sum^{m}_{i=1}{\|b_{i}-A_{i}(u)\|}^{2}_{\mathbb{Y}_{i}}+\gamma.

Finally, we require several additional concepts. Given a set 𝕍𝕏\mathbb{V}\subseteq\mathbb{X}, we now let

S(𝕍)={vv𝕏:v𝕍\{0}}.S(\mathbb{V})=\left\{\frac{v}{{\|v\|}_{\mathbb{X}}}:v\in\mathbb{V}\backslash\{0\}\right\}. (3.3)

Further, we say that 𝕍\mathbb{V} is a cone if tv𝕍tv\in\mathbb{V} whenever t0t\geq 0 and v𝕍v\in\mathbb{V}. Notice that in this case the set (3.3) is given by

S(𝕍)={v𝕍:v𝕏=1}S(\mathbb{V})=\left\{v\in\mathbb{V}:{\|v\|}_{\mathbb{X}}=1\right\}

i.e., it is the unit sphere of 𝕍\mathbb{V} with respect to the norm on 𝕏\mathbb{X}.

3.2 Variation

We now formally introduce the concept of variation, which is crucial to our analysis.

Definition 3.2 (Variation with respect to a distribution).

Let 𝕍𝕏0\mathbb{V}\subseteq\mathbb{X}_{0} and 𝕐\mathbb{Y} be a Hilbert space. Consider a distribution 𝒜\mathcal{A} of bounded linear operators in (𝕏0,𝕐)\mathcal{B}(\mathbb{X}_{0},\mathbb{Y}). The variation of 𝕍\mathbb{V} with respect to 𝒜\mathcal{A} is the smallest constant Φ=Φ(𝕍;𝒜)\Phi=\Phi(\mathbb{V};\mathcal{A}) such that

A(v)𝕐2Φ,v𝕍,a.s. A𝒜.{\|A(v)\|}^{2}_{\mathbb{Y}}\leq\Phi,\ \forall v\in\mathbb{V},\quad\text{a.s.\ $A\sim\mathcal{A}$}. (3.4)

If no such constant exists then we write Φ(𝕍;𝒜)=+\Phi(\mathbb{V};\mathcal{A})=+\infty

Note that we specify Φ\Phi as the smallest constant such that (3.4) holds to ensure that it is well defined. However, in all our results Φ\Phi can be any constant such that (3.4) holds. This is relevant in our examples, since it means we only need to derive upper bounds for the variation.

Recall (3.3). We will generally consider the variation of the set S(𝕍)S(\mathbb{V}) rather than 𝕍\mathbb{V} in what follows. Note that in this case, Φ=Φ(S(𝕍);𝒜)\Phi=\Phi(S(\mathbb{V});\mathcal{A}) is given by

A(v)𝕐2Φv𝕏2,v𝕍,a.s. A𝒜.{\|A(v)\|}^{2}_{\mathbb{Y}}\leq\Phi{\|v\|}^{2}_{\mathbb{X}},\quad\forall v\in\mathbb{V},\quad\text{a.s. }A\sim\mathcal{A}.

In the following example, we show how the variation extends to the classical notion of coherence in compressed sensing. Later, in the context of active learning in §4, we show that it also relates to the Christoffel function (also the leverage score function).

  • Example 3.3 (Classical compressed sensing and coherence)

    Coherence is a well-known concept in compressed sensing (see, e.g., [22]). Consider the setting of Example 2 and let 1sN1\leq s\leq N. Classical compressed sensing considers the model class of ss-sparse vectors

    𝕌=Σs={xN:x is s-sparse}.\mathbb{U}=\Sigma_{s}=\{x\in\mathbb{C}^{N}:\text{$x$ is $s$-sparse}\}. (3.5)

    Recall that a vector is ss-sparse if it has at most ss nonzero entries. In this case, Φ(S(𝕌);𝒜)\Phi(S(\mathbb{U});\mathcal{A}) is the smallest constant such that

    |av|2Φv22,v s-sparse,a.s. a𝒜.|a^{*}v|^{2}\leq\Phi{\|v\|}^{2}_{\ell^{2}},\quad\forall\text{$v$ $s$-sparse},\ \text{a.s.\ $a\sim\mathcal{A}$}.

    Define the coherence μ=μ(𝒜)\mu=\mu(\mathcal{A}) of the distribution 𝒜\mathcal{A} (see [22] or [9, Defn. 11.16]) as the smallest constant such that

    a2μ(𝒜),a.s. a𝒜.{\|a\|}^{2}_{\ell^{\infty}}\leq\mu(\mathcal{A}),\quad\text{a.s.\ $a\sim\mathcal{A}$}.

    Then, by the Cauchy–Schwarz inequality, we have

    |av|2μ(𝒜)sv22,v s-sparse.|a^{*}v|^{2}\leq\mu(\mathcal{A})s{\|v\|}^{2}_{\ell^{2}},\quad\forall\text{$v$ $s$-sparse}.

    We deduce that

    Φ(S(Σs);𝒜)=μ(𝒜)s.\Phi(S(\Sigma_{s});\mathcal{A})=\mu(\mathcal{A})s. (3.6)

    Thus, the variation is precisely the coherence μ(𝒜)\mu(\mathcal{A}) multiplied by the sparsity ss.

    In particular, consider Example 2, in which case 𝒜\mathcal{A} is a distribution of scaled rows of a unitary matrix U=(uij)i,j=1NU=(u_{ij})^{N}_{i,j=1}. We have

    μ(𝒜)=Nmaxij|uij|2=:μ(U),\mu(\mathcal{A})=N\max_{ij}|u_{ij}|^{2}=:\mu(U), (3.7)

    where μ(U)\mu(U) is the coherence of the matrix UU (see, e.g., [9, Defn. 5.8] or [36, 37, 40]). Note that 1μ(U)N1\leq\mu(U)\leq N. A matrix is said to be incoherent if μ(U)1\mu(U)\approx 1.

We now require several additional definitions.

Definition 3.4 (Constant distribution).

A distribution 𝒟\mathcal{D} constant if X𝒟X\sim\mathcal{D} takes only a single value with probability one, i.e., (X=c)=1\mathbb{P}(X=c)=1 for some cc. Otherwise it is nonconstant.

Definition 3.5 (Variation with respect to a collection).

Let 𝒜¯\bar{\mathcal{A}} be the distribution of bounded linear operators induced by the family {𝒜i}i=1m\{\mathcal{A}_{i}\}^{m}_{i=1}, where each 𝒜i\mathcal{A}_{i} is a distribution of bounded linear operators in (𝕏0,𝕐i)\mathcal{B}(\mathbb{X}_{0},\mathbb{Y}_{i}), and let

={i:𝒜i nonconstant}{1,,m}.\mathcal{I}=\{i:\text{$\mathcal{A}_{i}$ nonconstant}\}\subseteq\{1,\ldots,m\}. (3.8)

We define the variation of 𝕍\mathbb{V} with respect to 𝒜¯\bar{\mathcal{A}} as

Φ(𝕍;𝒜¯)=maxiΦ(𝕍;𝒜i),\Phi(\mathbb{V};\bar{\mathcal{A}})=\max_{i\in\mathcal{I}}\Phi(\mathbb{V};\mathcal{A}_{i}), (3.9)

and write Φ(𝕍;𝒜¯)=+\Phi(\mathbb{V};\bar{\mathcal{A}})=+\infty if no such constant exists.

The motivation for these definitions is as follows. Often in applications, one has some measurements that should always be included. For example, in a function regression problem, it may be desirable to sample the function at certain locations, based on known behaviour. Similarly, in a Fourier sampling problem, as discussed in Example 2, one may wish to fully sample all low frequencies up to some given maximum.

Mathematically, as demonstrated in Example 2, we can incorporate a deterministic measurement by defining a distribution that takes a single value with probability one. Since these measurements are always included, it is unnecessary to include them in the definition (3.9), since, as we will see, the variation is a tool that relates to the number of random measurements needed to ensure recovery. Later, when we consider sampling without replacement (Example 2), we will see that it is, in fact, vital that the variation be defined as a maximum over the nonconstant distributions only. See Corollary 5.4 and Remark 5.2.

3.3 Main results

We now present our main results.

Theorem 3.6.

Consider the setup of §1.1, let 𝕌𝕏0\mathbb{U}\subseteq\mathbb{X}_{0} and suppose that the difference set 𝕌=𝕌𝕌\mathbb{U}^{\prime}=\mathbb{U}-\mathbb{U} is such that

  1. (i)

    𝕌\mathbb{U}^{\prime} is a cone, and

  2. (ii)

    𝕌𝕍1𝕍d=:𝕍\mathbb{U}^{\prime}\subseteq\mathbb{V}_{1}\cup\cdots\cup\mathbb{V}_{d}=:\mathbb{V}, where each 𝕍i𝕏0\mathbb{V}_{i}\subseteq\mathbb{X}_{0} is a subspace of dimension at most nn.

Suppose that, for some 0<ϵ<10<\epsilon<1, either

  1. (a)

    mα1Φ(S(𝕌);𝒜¯)[log(2d/ϵ)+nlog(2γ(𝕌;𝒜¯))]m\gtrsim\alpha^{-1}\cdot\Phi(S(\mathbb{U}^{\prime});\bar{\mathcal{A}})\cdot\left[\log(2d/\epsilon)+n\log(2\gamma(\mathbb{U}^{\prime};\bar{\mathcal{A}}))\right], where

    γ(𝕌;𝒜¯)=min{Φ(S(𝕍);𝒜¯),Φ(S(𝕌𝕌);𝒜¯)}Φ(S(𝕌);𝒜¯),\gamma(\mathbb{U}^{\prime};\bar{\mathcal{A}})=\frac{\min\{\Phi(S(\mathbb{V});\bar{\mathcal{A}}),\Phi(S(\mathbb{U}^{\prime}-\mathbb{U}^{\prime});\bar{\mathcal{A}})\}}{\Phi(S(\mathbb{U}^{\prime});\bar{\mathcal{A}})},
  2. (b)

    mα1Φ(S(𝕌𝕌);𝒜¯)[log(2d/ϵ)+n]m\gtrsim\alpha^{-1}\cdot\Phi(S(\mathbb{U}^{\prime}-\mathbb{U}^{\prime});\bar{\mathcal{A}})\cdot\left[\log(2d/\epsilon)+n\right],

  3. (c)

    or mα1Φ(S(𝕍);𝒜¯)log(2nd/ϵ)m\gtrsim\alpha^{-1}\cdot\Phi(S(\mathbb{V});\bar{\mathcal{A}})\cdot\log(2nd/\epsilon).

Let x𝕏0x\in\mathbb{X}_{0}, θx𝕏\theta\geq{\|x\|}_{\mathbb{X}} and γ0\gamma\geq 0. Then

𝔼xxˇ𝕏2βαinfu𝕌xu𝕏2+θ2ϵ+γ2α+e𝕐¯2α,where xˇ=min{1,θ/x^𝕏}x^\mathbb{E}{\|x-\check{x}\|}^{2}_{\mathbb{X}}\lesssim\frac{\beta}{\alpha}\cdot\inf_{u\in\mathbb{U}}{\|x-u\|}^{2}_{\mathbb{X}}+\theta^{2}\epsilon+\frac{\gamma^{2}}{\alpha}+\frac{{\|e\|}^{2}_{\overline{\mathbb{Y}}}}{\alpha},\qquad\text{where }\check{x}=\min\{1,\theta/{\|\hat{x}\|}_{\mathbb{X}}\}\hat{x}

for any γ\gamma-minimizer x^\hat{x} of (1.3) with noisy measurements (1.2).

This result holds under general assumptions on 𝕌\mathbb{U}. However, as we discuss in Example 3.4 below, it may yield suboptimal bounds in certain cases, such as compressed sensing with sparse vectors. Fortunately, by making an additional assumption, we can resolve this shortcoming.

Theorem 3.7.

Consider the setup of §1.1 and let 𝕌𝕏0\mathbb{U}\subseteq\mathbb{X}_{0} be such that assumptions (i) and (ii) of Theorem 3.6 hold and also that

  1. (iii)

    {u𝕌:u𝕏1}conv(𝕎)\{u\in\mathbb{U}^{\prime}:{\|u\|}_{\mathbb{X}}\leq 1\}\subseteq\mathrm{conv}(\mathbb{W}), where 𝕎\mathbb{W} is a finite set of size |𝕎|=M|\mathbb{W}|=M.

Suppose that, for some 0<ϵ<10<\epsilon<1, either

  1. (a)

    mα1Φ(S(𝕌)𝕎;𝒜¯)Lm\gtrsim\alpha^{-1}\cdot\Phi(S(\mathbb{U}^{\prime})\cup\mathbb{W};\bar{\mathcal{A}})\cdot L, where

    L=log(2Φ(S(𝕌)𝕎;𝒜¯)/α)[log(2γ(𝕌;𝒜¯))+log(2M)log2(log(2d)+n)]+log(ϵ1)L=\log\left(2\Phi(S(\mathbb{U}^{\prime})\cup\mathbb{W};\bar{\mathcal{A}})/\alpha\right)\cdot\left[\log\left(2\gamma(\mathbb{U}^{\prime};\bar{\mathcal{A}})\right)+\log(2M)\cdot\log^{2}(\log(2d)+n)\right]+\log(\epsilon^{-1})

    and γ(𝕌;𝒜¯)\gamma(\mathbb{U}^{\prime};\bar{\mathcal{A}}) is as in (6.2);

  2. (b)

    mα1Φ(S(𝕌𝕌)𝕎;𝒜¯)L,m\gtrsim\alpha^{-1}\cdot\Phi(S(\mathbb{U}^{\prime}-\mathbb{U}^{\prime})\cup\mathbb{W};\bar{\mathcal{A}})\cdot L, where

    L=log(2Φ(S(𝕌)𝕎;𝒜¯)/α)log(2M)log2(log(2d)+n)+log(ϵ1);L=\log\left(2\Phi(S(\mathbb{U}^{\prime})\cup\mathbb{W};\bar{\mathcal{A}})/\alpha\right)\cdot\log(2M)\cdot\log^{2}(\log(2d)+n)+\log(\epsilon^{-1});

    or

  3. (c)

    mα1Φ(S(𝕍)𝕎;𝒜¯)Lm\gtrsim\alpha^{-1}\cdot\Phi(S(\mathbb{V})\cup\mathbb{W};\bar{\mathcal{A}})\cdot L, where

    L=log(2Φ(S(𝕍)𝕎;𝒜¯)/α)log(2M)log2(log(2d)+n)+log(ϵ1).L=\log\left(2\Phi(S(\mathbb{V})\cup\mathbb{W};\bar{\mathcal{A}})/\alpha\right)\cdot\log(2M)\cdot\log^{2}(\log(2d)+n)+\log(\epsilon^{-1}).

Let x𝕏0x\in\mathbb{X}_{0}, θx𝕏\theta\geq{\|x\|}_{\mathbb{X}} and γ0\gamma\geq 0. Then

𝔼xxˇ𝕏2βαinfu𝕌xu𝕏2+θ2ϵ+γ2α+e𝕐¯2α,where xˇ=min{1,θ/x^𝕏}x^\mathbb{E}{\|x-\check{x}\|}^{2}_{\mathbb{X}}\lesssim\frac{\beta}{\alpha}\cdot\inf_{u\in\mathbb{U}}{\|x-u\|}^{2}_{\mathbb{X}}+\theta^{2}\epsilon+\frac{\gamma^{2}}{\alpha}+\frac{{\|e\|}^{2}_{\overline{\mathbb{Y}}}}{\alpha},\qquad\text{where }\check{x}=\min\{1,\theta/{\|\hat{x}\|}_{\mathbb{X}}\}\hat{x}

for any γ\gamma-minimizer x^\hat{x} of (1.3) with noisy measurements (1.2).

We consider various applications of these results in §4 and §5. However, it is first informative to consider the implications of these results for several of our previous examples.

3.4 Examples

  • Example 3.8 (Classical compressed sensing with isotropic vectors)

    Consider the classical compressed sensing problem as in Example 2, with 𝕌=Σs\mathbb{U}=\Sigma_{s} as in Example 3.2. Note that in this case the collection 𝒜¯={𝒜i}i=1m\bar{\mathcal{A}}=\{\mathcal{A}_{i}\}^{m}_{i=1}, where 𝒜1==𝒜m=𝒜\mathcal{A}_{1}=\cdots=\mathcal{A}_{m}=\mathcal{A} with 𝒜\mathcal{A} as in Example 3.2 and recall that α=β=1\alpha=\beta=1. Observe that 𝕌=𝕌𝕌=Σ2s\mathbb{U}^{\prime}=\mathbb{U}-\mathbb{U}=\Sigma_{2s} in this case and moreover that assumption (i) of Theorem 3.6 holds, since txtx is ss-sparse for any t0t\geq 0 (in fact, tt\in\mathbb{R}) whenever xx is ss-sparse. Also, assumption (ii) holds, since

    Σ2s=S[N]|S|=2s{xN:supp(x)S}.\Sigma_{2s}=\bigcup_{\begin{subarray}{c}S\subseteq[N]\\ |S|=2s\end{subarray}}\{x\in\mathbb{C}^{N}:\mathrm{supp}(x)\subseteq S\}.

    Here supp(x)={i:xi0}\mathrm{supp}(x)=\{i:x_{i}\neq 0\} is the support of x=(xi)i=1Nx=(x_{i})^{N}_{i=1}. Observe that this is a union of

    d=(N2s)(eN2s)2sd={N\choose 2s}\leq\left(\frac{\mathrm{e}N}{2s}\right)^{2s}

    subspaces of dimension 2s2s. Example 3.2 derives an expression (3.6) for the variation in terms of ss and μ(𝒜)\mu(\mathcal{A}). Using this, we deduce that γ(𝕌;𝒜¯)=2\gamma(\mathbb{U}^{\prime};\bar{\mathcal{A}})=2 and the measurement condition in Theorem 3.6(a) reduces to

    mμ(𝒜)s(slog(eN/s)+log(ϵ1))m\gtrsim\mu(\mathcal{A})\cdot s\cdot\left(s\log(\mathrm{e}N/s)+\log(\epsilon^{-1})\right)

    (one can readily verify that the conditions (b) and (c) give the same result, up to constants). This bound is suboptimal, since it scales quadratically in ss. Fortunately, this can be overcome by using Theorem 3.7. It is well known that (see, e.g., [9, proof of Lem. 13.29] or [43, proof of Lem. 12.37])

    {xΣs:x21}conv({±2sei,±2siei:i=1,,N})=:conv(𝕎).\{x\in\Sigma_{s}:{\|x\|}_{2}\leq 1\}\subseteq\mathrm{conv}\left(\left\{\pm\sqrt{2s}e_{i},\pm\sqrt{2s}\mathrm{i}e_{i}:i=1,\ldots,N\right\}\right)=:\mathrm{conv}(\mathbb{W}).

    This set has size M=|𝕎|=4NM=|\mathbb{W}|=4N. In order to apply Theorem 3.7 we estimate the variation

    Φ(S(𝕌)𝕎;𝒜¯)=max{Φ(S(𝕌);𝒜¯),Φ(𝕎;𝒜¯)}.\Phi(S(\mathbb{U}^{\prime})\cup\mathbb{W};\bar{\mathcal{A}})=\max\left\{\Phi(S(\mathbb{U}^{\prime});\bar{\mathcal{A}}),\Phi(\mathbb{W};\bar{\mathcal{A}})\right\}.

    The first term is equal to 2sμ(𝒜)2s\mu(\mathcal{A}) (recall (3.6)). For the latter, we observe that, for any x𝕎x\in\mathbb{W}, we have, for some 1iN1\leq i\leq N,

    |ax|2=2s|ai|22sμ(𝒜),a.s. a𝒜.|a^{*}x|^{2}=2s|a_{i}|^{2}\leq 2s\mu(\mathcal{A}),\quad\text{a.s. $a\sim\mathcal{A}$}.

    Hence Φ(𝕎;𝒜¯)2sμ(𝒜)\Phi(\mathbb{W};\bar{\mathcal{A}})\leq 2s\mu(\mathcal{A}) and therefore

    Φ(S(𝕌)𝕎;𝒜¯)=2sμ(𝒜).\Phi(S(\mathbb{U}^{\prime})\cup\mathbb{W};\bar{\mathcal{A}})=2s\mu(\mathcal{A}).

    Using this and the previous bounds for dd and γ(𝕌;𝒜¯)\gamma(\mathbb{U}^{\prime};\bar{\mathcal{A}}) we see that condition (a) of Theorem 3.7 (the same applies for conditions (b) and (c)) reduces to

    mμ(𝒜)s(log(2sμ(𝒜))log2(slog(N/s))log(N)+log(ϵ1)).m\gtrsim\mu(\mathcal{A})\cdot s\cdot\left(\log(2s\mu(\mathcal{A}))\cdot\log^{2}(s\log(N/s))\log(N)+\log(\epsilon^{-1})\right).

    In other words, the measurement condition scales linearly in ss, up to the coherence μ(𝒜)\mu(\mathcal{A}) and a polylogarithmic factor in ss, μ(𝒜)\mu(\mathcal{A}), NN and ϵ1\epsilon^{-1}. Note that this bound is similar to standard compressed sensing measurement conditions for isotropic vectors.

  • Example 3.9 (Matrix sketching for large least-squares problems)

    Consider the setup of Example 2, where for convenience we assume that XX has full column rank. In this problem, 𝕌\mathbb{U} is the linear subspace (2.3) with dim(𝕌)=n\dim(\mathbb{U})=n, so we strive to apply condition (c) of Theorem 3.6 with d=1d=1. We first calculate its variation. Recalling the definition of 𝒜=𝒜1==𝒜m\mathcal{A}=\mathcal{A}_{1}=\cdots=\mathcal{A}_{m} and 𝕐\mathbb{Y} in this case and letting u=Xzu=Xz be an arbitrary element of 𝕌\mathbb{U}, we notice that

    A(u)𝕐2maxi=1,,N1πi|eiXz|2{\|A(u)\|}^{2}_{\mathbb{Y}}\leq\max_{i=1,\ldots,N}\frac{1}{\pi_{i}}|e^{*}_{i}Xz|^{2}

    Therefore,

    Φ(S(𝕌);𝒜¯)=Φ(S(𝕌);𝒜)=maxi=1,,N{τ(X)(i)πi},\Phi(S(\mathbb{U});\bar{\mathcal{A}})=\Phi(S(\mathbb{U});\mathcal{A})=\max_{i=1,\ldots,N}\left\{\frac{\tau(X)(i)}{\pi_{i}}\right\},

    where

    τ(X)(i)=maxznXz0|(Xz)i|2Xz22,i=1,,N,\tau(X)(i)=\max_{\begin{subarray}{c}z\in\mathbb{C}^{n}\\ Xz\neq 0\end{subarray}}\frac{|(Xz)_{i}|^{2}}{{\|Xz\|}^{2}_{2}},\quad i=1,\ldots,N,

    are the so-called leverage scores of the matrix XX (note that it is a short argument to show that τ(X)(i)=xi(XX)1xi\tau(X)(i)=x^{*}_{i}(X^{*}X)^{-1}x_{i}, where xix_{i} is the iith row of XX).

    Having obtained an explicit expression for the variation, we can now choose a discrete probability distribution π={π1,,πN}\pi=\{\pi_{1},\ldots,\pi_{N}\} to minimize it. It is a short exercise to show that

    i=1Nτ(X)(i)=n.\sum^{N}_{i=1}\tau(X)(i)=n.

    Therefore, we now set

    πi=τ(X)(i)n,i=1,,N.\pi_{i}=\frac{\tau(X)(i)}{n},\quad i=1,\ldots,N.

    This is the well-known leverage score sampling technique for matrix sketching [64]. Observe that Theorem 3.6 yields the near-optimal sample complexity bound

    mnlog(2n/ϵ),m\gtrsim n\cdot\log(2n/\epsilon),

    which is also well known in the literature.

The previous example describes a type of active learning technique for the (discrete) matrix sketching problem. We will discuss related techniques for function regression in §4.

3.5 Further discussion

We now provide some additional comments on our main results. First, notice that the estimator in (1.5) involves a truncation xˇ\check{x} of a γ\gamma-minimizer x^\hat{x} of (1.3). This is a technical condition that is used to establish the error bound in expectation. See §6.1.5 and Remark 6.2.

The main difference between Theorem 3.6 and Theorem 3.7 is the additional assumption (iii), which states that the unit ball of 𝕌\mathbb{U}^{\prime} should be contained in the convex hull of a set 𝕎\mathbb{W} that is not too large (since MM enters logarithmically in the measurement condition) and has small variation (since the variation is taken over a set that includes 𝕎\mathbb{W}). In practice, this assumption allows the dependence of the measurement condition on dd and nn to be reduced from essentially log(2d)+n\log(2d)+n in Theorem 3.6 to log2(log(2d)+n)\log^{2}(\log(2d)+n). As seen in Example 3.4, this can make a significant difference to the resulting measurement condition in certain problems.

Finally, as we show in the proof, assumption (i) in Theorem 3.6 can also be removed. The only difference comes in the definition of γ(𝕌;𝒜¯)\gamma(\mathbb{U}^{\prime};\bar{\mathcal{A}}), which now involves a ratio of variations over slightly different sets. See Remark 6.1.5 for further details.

4 Application to active learning in regression

In this and the next section, we focus on the two main applications of our framework considered in this paper. In this section, we focus on active learning.

Active learning is a type of machine learning problem where one has the flexibility to choose where to sample the ground truth (or oracle) so as to enhance the generalization performance of the learning algorithm. In the standard regression problem, where the ground truth is a function f:Df:D\rightarrow\mathbb{R}, this means the training data takes the form

{(xi,f(xi)+ei)}i=1m,\{(x_{i},f(x_{i})+e_{i})\}^{m}_{i=1},

where one is free to choose the sample points xix_{i}. Building on Example 2, in this section we first show how our theoretical results lead naturally to an active learning strategy. This extends well-known leverage score sampling [12, 24, 25, 32, 41, 44, 50] to general nonlinear model classes.

4.1 Active learning via Christoffel sampling

Consider Example 2 and let 𝕍𝕏0\mathbb{V}\subseteq\mathbb{X}_{0} be an arbitrary model class. In this case, the variation of the distribution 𝒜\mathcal{A} defined therein is

Φ(S(𝕍);𝒜)=esssupxρsup{|v(x)|2ν(x)vLρ2(D)2:v𝕍,v0}.\Phi(S(\mathbb{V});\mathcal{A})=\operatorname*{ess\,sup}_{x\sim\rho}\sup\left\{\frac{|v(x)|^{2}}{\nu(x){\|v\|}^{2}_{L^{2}_{\rho}(D)}}:v\in\mathbb{V},\ v\neq 0\right\}. (4.1)

For convenience, we now define the notation

K(𝕍)(x)=sup{|v(x)|2vLρ2(D)2:v𝕍,v0}.K(\mathbb{V})(x)=\sup\left\{\frac{|v(x)|^{2}}{{\|v\|}^{2}_{L^{2}_{\rho}(D)}}:v\in\mathbb{V},\ v\neq 0\right\}. (4.2)

This is sometimes termed the Christoffel function of the set 𝕍\mathbb{V}. Up to a scaling factor, it is the same as the leverage score function of 𝕍\mathbb{V} (see [6, §A.2]). Notice that

Φ(S(𝕍);𝒜)=esssupxρ{K(𝕍)(x)/ν(x)}.\Phi(S(\mathbb{V});\mathcal{A})=\operatorname*{ess\,sup}_{x\sim\rho}\left\{K(\mathbb{V})(x)/\nu(x)\right\}. (4.3)

Recall that the measurement conditions in Theorem 3.6 scales linearly with Φ(S(𝕌);𝒜)\Phi(S(\mathbb{U}^{\prime});\mathcal{A}), where 𝕌=𝕌𝕌\mathbb{U}=\mathbb{U}^{\prime}-\mathbb{U}. Therefore, in the active learning context, we look to choose the sampling distribution μ\mu – or, equivalently, its density ν=dμ/dρ\nu=\,\mathrm{d}\mu/\,\mathrm{d}\rho – to minimize this quantity. The following result is immediate.

Proposition 4.1 (Christoffel sampling).

Suppose that K(𝕍)K(\mathbb{V}) is integrable and positive almost everywhere. Then (4.1) is minimized by the choice

dμ(𝕍)(x)=K(𝕍)(x)DK(𝕍)(x)dρ(x)dρ(x).\,\mathrm{d}\mu^{\star}(\mathbb{V})(x)=\frac{K(\mathbb{V})(x)}{\int_{D}K(\mathbb{V})(x)\,\mathrm{d}\rho(x)}\,\mathrm{d}\rho(x). (4.4)

In this case, one has the optimal value

Φ(S(𝕍);𝒜)=DK(𝕍)(x)dρ(x).\Phi(S(\mathbb{V});\mathcal{A})=\int_{D}K(\mathbb{V})(x)\,\mathrm{d}\rho(x). (4.5)

This result states that to obtain the best measurement condition over all possible sampling measures μ\mu one should choose a measure that is proportional to the Christoffel function – an approach termed Christoffel sampling in [6]. Combining this with Theorem 3.6, we deduce the following learning guarantees for Christoffel sampling.

Corollary 4.2 (Learning guarantees for Christoffel sampling).

Consider Example 2 and let 𝕌𝕏0\mathbb{U}\subseteq\mathbb{X}_{0} and 𝕌=𝕌𝕌\mathbb{U}^{\prime}=\mathbb{U}-\mathbb{U} be such that (i) and (ii) of Theorem 3.6 hold. Suppose that, for some 0<ϵ<10<\epsilon<1, either

  1. (a)

    μ=μ(𝕌)\mu=\mu^{\star}(\mathbb{U}^{\prime}) in (4.4) and mDK(𝕌)(x)dρ(x)[log(2d/ϵ)+nlog(2γ(𝕌))]m\gtrsim\int_{D}K(\mathbb{U}^{\prime})(x)\,\mathrm{d}\rho(x)\cdot\left[\log(2d/\epsilon)+n\log(2\gamma(\mathbb{U}^{\prime}))\right], where

    γ(𝕌)=min{esssupxρ{K(𝕍)(x)/ν(x)},esssupxρ{K(𝕌𝕌)(x)/ν(x)}},\gamma(\mathbb{U}^{\prime})=\min\left\{\operatorname*{ess\,sup}_{x\sim\rho}\left\{K(\mathbb{V})(x)/\nu(x)\right\},\operatorname*{ess\,sup}_{x\sim\rho}\left\{K(\mathbb{U}^{\prime}-\mathbb{U}^{\prime})(x)/\nu(x)\right\}\right\},
  2. (b)

    μ=μ(𝕌𝕌)\mu=\mu^{\star}(\mathbb{U}^{\prime}-\mathbb{U}^{\prime}) and mDK(𝕌𝕌)(x)dρ(x)[log(2d/ϵ)+n]m\gtrsim\int_{D}K(\mathbb{U}^{\prime}-\mathbb{U}^{\prime})(x)\,\mathrm{d}\rho(x)\cdot\left[\log(2d/\epsilon)+n\right], or

  3. (c)

    μ=μ(𝕍)\mu=\mu^{\star}(\mathbb{V}) and mDK(𝕍)(x)dρ(x)log(2dn/ϵ)m\gtrsim\int_{D}K(\mathbb{V})(x)\,\mathrm{d}\rho(x)\cdot\log(2dn/\epsilon).

Let fC(D¯)f\in C(\overline{D}), θfLρ2(D)\theta\geq{\|f\|}_{L^{2}_{\rho}(D)} and γ0\gamma\geq 0. Then

𝔼ffˇLρ2(D)2infu𝕌fuLρ2(D)2+θ2ϵ+γ2+1me22,where fˇ=min{1,θ/fLρ2(D)}f^\mathbb{E}{\|f-\check{f}\|}^{2}_{L^{2}_{\rho}(D)}\lesssim\inf_{u\in\mathbb{U}}{\|f-u\|}^{2}_{L^{2}_{\rho}(D)}+\theta^{2}\epsilon+\gamma^{2}+\frac{1}{m}{\|e\|}^{2}_{2},\quad\text{where }\check{f}=\min\{1,\theta/{\|f\|}_{L^{2}_{\rho}(D)}\}\hat{f}

for any γ\gamma-minimizer f^\hat{f} of (2.1), where e=(ei)i=1me=(e_{i})^{m}_{i=1}.

This corollary describes three different variants of Christoffel sampling, depending on whether one chooses 𝕌\mathbb{U}^{\prime}, 𝕌𝕌\mathbb{U}^{\prime}-\mathbb{U}^{\prime} or 𝕍\mathbb{V} as the set over which to define the Christoffel function. Each choice leads to a slightly different measurement condition, but the key point is they all scale linearly in the integral of the corresponding Christoffel function. We next discuss what this means in practice. We note in passing that is also possible to develop a version of Corollary 4.2 under the additional assumption (iii) of Theorem 3.7. In this case, the relevant relevant set should also include 𝕎\mathbb{W}.

  • Remark 4.3 (Improvement over inactive learning)

    Suppose that the underlying measure ρ\rho is a probability measure. Inactive learning corresponds to the scenario where μ=ρ\mu=\rho, i.e., the sampling distribution is taken equal to the underlying measure. This is sometimes termed Monte Carlo sampling. In this case, we have ν=dμ/dρ=1\nu=\,\mathrm{d}\mu/\,\mathrm{d}\rho=1, and therefore

    Φ(S(𝕍);𝒜)=esssupxρK(𝕍)(x)\Phi(S(\mathbb{V});\mathcal{A})=\operatorname*{ess\,sup}_{x\sim\rho}K(\mathbb{V})(x) (4.6)

    Therefore, the improvement of Christoffel sampling over inactive learning is equivalent to the difference between the mean value of K(𝕍)K(\mathbb{V}), i.e., (4.5), and its maximal value, i.e., (4.6). Since

    esssupxρK(𝕍)(x)DK(𝕍)(x)dρ(x),\operatorname*{ess\,sup}_{x\sim\rho}K(\mathbb{V})(x)\geq\int_{D}K(\mathbb{V})(x)\,\mathrm{d}\rho(x),

    Christoffel sampling always reduces the sample complexity bound. Practically, the extent of the improvement corresponds to how ‘spiky’ K(𝕍)(x)K(\mathbb{V})(x) is over xDx\in D. If K(𝕍)(x)K(\mathbb{V})(x) is fairly flat, then one expects very little benefit. On the other hand, if K(𝕍)(x)K(\mathbb{V})(x) has sharp peaks, then one expects a significant improvement. As discussed in [28, 13, 7, 6, 41, 44], there are many case where significant improvements are possible because of this phenomenon. However, this is not always the case [1].

  • Remark 4.4 (Non-pointwise samples and generalized Christoffel functions)

    One can extend the above discussion to more general types of sampling problems, beyond just pointwise samples of a scalar target function. This generalization was introduced in [6]. The main difference involves replacing (4.2) with a so-called generalized Christoffel function, which involves the sampling operator. For brevity, we omit the details. Note that a special case of this generalization is considered in §5 in the context of generative models.

4.2 Examples

To gain further insight, we now consider several examples.

  • Example 4.5 (Near-optimal sampling for linear subspaces)

    Let 𝕌\mathbb{U} be a subspace of dimension nn with an orthonormal basis {ui}i=1n\{u_{i}\}^{n}_{i=1} in the Lρ2L^{2}_{\rho}-norm. Notice that 𝕌=𝕌𝕌=𝕌\mathbb{U}^{\prime}=\mathbb{U}-\mathbb{U}=\mathbb{U} in this case. Moreover, it is a short exercise (see, e.g., [6, Lem. E.1]) to show that

    K(𝕌)(x)=i=1n|ui(x)|2.K(\mathbb{U})(x)=\sum^{n}_{i=1}|u_{i}(x)|^{2}.

    In particular, DK(𝕌)(x)dρ(x)=n\int_{D}K(\mathbb{U})(x)\,\mathrm{d}\rho(x)=n due to orthonormality. Hence, the Christoffel sampling measure (4.4) becomes

    dμ(x)=i=1n|ui(x)|2ndρ(x),\,\mathrm{d}\mu^{\star}(x)=\frac{\sum^{n}_{i=1}|u_{i}(x)|^{2}}{n}\,\mathrm{d}\rho(x),

    and, using condition (c) of Corollary 4.2 with d=1d=1, we deduce the near-optimal measurement condition

    mnlog(2n/ϵ).m\gtrsim n\cdot\log(2n/\epsilon).

    This result is well known (see, e.g., [28]).

The previous example is highly informative, as it gives a class of problems where the active learning strategy of Christoffel sampling is provably optimal up to a log factor. However, many practical problems of interest involve nonlinear model classes. We now consider the nonlinear case in more detail. We commence with the case of unions of subspaces.

  • Example 4.6 (Near-optimal sampling for a ‘small’ union of subspaces)

    Suppose that 𝕌=𝕌𝕌=𝕍1𝕍d\mathbb{U}^{\prime}=\mathbb{U}-\mathbb{U}=\mathbb{V}_{1}\cup\cdots\cup\mathbb{V}_{d} is a union of dd subspaces of dimension at most nn. By definition,

    K(𝕌)(x)=maxi=1,,dK(𝕍d)(x).K(\mathbb{U}^{\prime})(x)=\max_{i=1,\ldots,d}K(\mathbb{V}_{d})(x).

    Therefore, by a crude bound and the fact that DK(𝕍i)dρ(x)=n\int_{D}K(\mathbb{V}_{i})\,\mathrm{d}\rho(x)=n (recall the previous example), we obtain

    K(𝕌)(x)dρ(x)i=1dDK(𝕍i)dρ(x)=nd.\int K(\mathbb{U}^{\prime})(x)\,\mathrm{d}\rho(x)\leq\sum^{d}_{i=1}\int_{D}K(\mathbb{V}_{i})\,\mathrm{d}\rho(x)=nd.

    Hence, applying condition (c) of Corollary 4.2, we obtain the measurement condition

    mndlog(2nd/ϵ).m\gtrsim n\cdot d\cdot\log(2nd/\epsilon). (4.7)

The main purpose of this example is to illustrate a class of problems involving nonlinear model classes for which Christoffel sampling is probably optimal, up to the log term. Indeed, if dd is small, then the sample complexity bound scales like nlog(n/ϵ)n\log(n/\epsilon).

Unfortunately, as we see in the next example (and also in the next section when considering generative models), in problems of interest the number dd is often not small.

4.3 Application to sparse regression

Let s,Ns,N\in\mathbb{N} and Ψ={ψi}i=1NLρ2(D)\Psi=\{\psi_{i}\}^{N}_{i=1}\subseteq L^{2}_{\rho}(D) be an orthonormal system (one can also consider Riesz systems with few additional difficulties – see [6, § A.4]). Define the model class

𝕌=𝕌s={iSciψi:ci,i,S{1,,N},|S|=s}.\mathbb{U}=\mathbb{U}_{s}=\left\{\sum_{i\in S}c_{i}\psi_{i}:c_{i}\in\mathbb{C},\ \forall i,\ S\subseteq\{1,\ldots,N\},\ |S|=s\right\}. (4.8)

We refer to the corresponding regression problem as a sparse regression problem. Notice that, as in Example 3.4, 𝕌=𝕌s𝕌s=𝕌2s\mathbb{U}^{\prime}=\mathbb{U}_{s}-\mathbb{U}_{s}=\mathbb{U}_{2s} is a union of d=(N2s)d={N\choose 2s} subspaces. Hence, in this case, the bound (4.7) becomes meaningless.

Moreover, another problem in this case is that even evaluating the Christoffel function K(𝕌2s)(x)K(\mathbb{U}_{2s})(x) in computationally intractable, since it involves a pointwise maximum over d=(N2s)d={N\choose 2s} subspaces. Thus, it is not possible to implement Christoffel sampling directly.

Fortunately, both issues can be resolved by using the structure of the model class. In what follows, we recap ideas that were previously presented in [7]. Let u=iSciψi𝕌2su=\sum_{i\in S}c_{i}\psi_{i}\in\mathbb{U}_{2s}, where |S|2s|S|\leq 2s. Then, by the Cauchy–Schwarz inequality and Parseval’s identity,

|u(x)|22smaxi=1,,N|ψi(x)|2uLρ2(D)2.|u(x)|^{2}\leq 2s\max_{i=1,\ldots,N}|\psi_{i}(x)|^{2}{\|u\|}^{2}_{L^{2}_{\rho}(D)}.

Using this, we deduce that

K(𝕌2s)(x)2sK~(Ψ)(x),K~(Ψ)(x)=maxi=1,,N|ψi(x)|2.K(\mathbb{U}_{2s})(x)\leq 2s\widetilde{K}(\Psi)(x),\quad\widetilde{K}(\Psi)(x)=\max_{i=1,\ldots,N}|\psi_{i}(x)|^{2}. (4.9)

The idea now is to sample from a density proportional to K~(Ψ)(x)\widetilde{K}(\Psi)(x). Note that this is usually feasible in practice, since the maximum only involves NN terms, as opposed to d=(N2s)d={N\choose 2s}.

Corollary 4.7 (Active learning for sparse regression).

Consider the setup of Example 2, where 𝕌=𝕌s\mathbb{U}=\mathbb{U}_{s} is as in (4.8), let μ\mu be given by

dμ(x)=K~(Ψ)(x)θ(Ψ)dρ(x),where θ(Ψ)=DK~(Ψ)(x)dρ(x)\,\mathrm{d}\mu(x)=\frac{\widetilde{K}(\Psi)(x)}{\theta(\Psi)}\,\mathrm{d}\rho(x),\quad\text{where }\theta(\Psi)=\int_{D}\widetilde{K}(\Psi)(x)\,\mathrm{d}\rho(x)

and suppose that

mθ(Ψ)s[log(2sθ(Ψ))log2(slog(N/s))log(N)+log(ϵ1)].m\gtrsim\theta(\Psi)\cdot s\cdot\left[\log(2s\theta(\Psi))\cdot\log^{2}(s\log(N/s))\log(N)+\log(\epsilon^{-1})\right].

Let fC(D¯)f\in C(\overline{D}), θfLρ2(D)\theta\geq{\|f\|}_{L^{2}_{\rho}(D)} and γ0\gamma\geq 0. Then

𝔼ffˇLρ2(D)2infu𝕌fuLρ2(D)2+θ2ϵ+γ2+1me22,where fˇ=min{1,θ/fLρ2(D)}f^\mathbb{E}{\|f-\check{f}\|}^{2}_{L^{2}_{\rho}(D)}\lesssim\inf_{u\in\mathbb{U}}{\|f-u\|}^{2}_{L^{2}_{\rho}(D)}+\theta^{2}\epsilon+\gamma^{2}+\frac{1}{m}{\|e\|}^{2}_{2},\quad\text{where }\check{f}=\min\{1,\theta/{\|f\|}_{L^{2}_{\rho}(D)}\}\hat{f}

for any γ\gamma-minimizer f^\hat{f} of (2.1), where e=(ei)i=1me=(e_{i})^{m}_{i=1}.

Proof.

As in Example 3.4, notice that 𝕌=𝕌2s\mathbb{U}^{\prime}=\mathbb{U}_{2s} is contained in conv(𝕎)\mathrm{conv}(\mathbb{W}), where

𝕎={±2sψi,±2siψi,i=1,,N}.\mathbb{W}=\left\{\pm\sqrt{2s}\psi_{i},\pm\sqrt{2s}\mathrm{i}\psi_{i},\ i=1,\ldots,N\right\}.

Notice that

Φ(S(𝕌)𝕎;𝒜¯)=max{Φ(S(𝕌);𝒜¯),Φ(𝕎;𝒜¯)}.\Phi(S(\mathbb{U}^{\prime})\cup\mathbb{W};\bar{\mathcal{A}})=\max\{\Phi(S(\mathbb{U}^{\prime});\bar{\mathcal{A}}),\Phi(\mathbb{W};\bar{\mathcal{A}})\}.

Consider the second term. We have

Φ(𝕎;𝒜¯)=2sesssupxρmax{|ψi(x)|2ν(x):i=1,,N}=2sesssupxρ{K~(Ψ)(x)/ν(x)}.\Phi(\mathbb{W};\bar{\mathcal{A}})=2s\operatorname*{ess\,sup}_{x\sim\rho}\max\left\{\frac{|\psi_{i}(x)|^{2}}{\nu(x)}:i=1,\ldots,N\right\}=2s\operatorname*{ess\,sup}_{x\sim\rho}\left\{\widetilde{K}(\Psi)(x)/\nu(x)\right\}.

We also know from (4.3) and (4.9) that

Φ(S(𝕌);𝒜¯)2sesssupxρ{K~(Ψ)(x)/ν(x)}.\Phi(S(\mathbb{U}^{\prime});\bar{\mathcal{A}})\leq 2s\operatorname*{ess\,sup}_{x\sim\rho}\left\{\widetilde{K}(\Psi)(x)/\nu(x)\right\}.

Therefore, using the fact that ν=dμ/dρ\nu=\,\mathrm{d}\mu/\,\mathrm{d}\rho and the definition of μ\mu, we get

Φ(S(𝕌)𝕎;𝒜¯)2sesssupxρ{K~(Ψ)(x)/ν(x)}=2sDK~(Ψ)(x)dρ(x)=2sθ(Ψ).\Phi(S(\mathbb{U}^{\prime})\cup\mathbb{W};\bar{\mathcal{A}})\leq 2s\operatorname*{ess\,sup}_{x\sim\rho}\left\{\widetilde{K}(\Psi)(x)/\nu(x)\right\}=2s\int_{D}\widetilde{K}(\Psi)(x)\,\mathrm{d}\rho(x)=2s\theta(\Psi).

To apply Theorem 3.7 we also need to estimate the term γ(𝕌;𝒜¯)\gamma(\mathbb{U}^{\prime};\bar{\mathcal{A}}). However, notice from the above derivation that

Φ(S(𝕌𝕌);𝒜¯)=Φ(S(𝕌4s);𝒜¯)4sDK~(Ψ)(x)dρ(x)=4sθ(Ψ).\Phi(S(\mathbb{U}^{\prime}-\mathbb{U}^{\prime});\bar{\mathcal{A}})=\Phi(S(\mathbb{U}_{4s});\bar{\mathcal{A}})\leq 4s\int_{D}\widetilde{K}(\Psi)(x)\,\mathrm{d}\rho(x)=4s\theta(\Psi).

Therefore, recalling that our main results hold when the variation constant is replaced by an upper bound, we deduce that we may take γ(𝕌;𝒜¯)2\gamma(\mathbb{U}^{\prime};\bar{\mathcal{A}})\leq 2 in this instance. We now apply Theorem 3.7, recalling that n=sn=s, d=(Ns)d={N\choose s} and M=4NM=4N in this case (see Example 3.4). ∎

This result describes an active learning strategy for sparse regression (previously derived in [7]) that is optimal up to the term θ(Ψ)\theta(\Psi), since the measurement condition scales linearly in ss up to log factors. As discussed in [7], one can numerically estimate this factor. In cases such as sparse polynomial regression, it is either bounded or very mildly growing in NN. Whether it is possible to derive an active learning strategy that avoids this factor is an open problem.

5 Application to compressed sensing with generative models

Compressed sensing with generative models involves replacing the classical model class 𝕌=Σs\mathbb{U}=\Sigma_{s} in (3.5) with the range of a generative neural network that has been trained on some training data relevant to the learning problem at hand (e.g., brain images in the case of a MRI reconstruction problem). In this section, we demonstrate how our main results can be applied to this problem in the case of ReLU generative neural networks. The setup in this section is based on [15, 16]. We extend and generalize this work in the following ways:

  • [15, 16] considers only sampling with subsampled unitary matrices. We extend this to sampling with arbitrary linear operators, which may be scalar- or vector-valued.

  • We improve over [15, Thm. 1] and [16, Thm. 2.1] by requiring the ‘local coherences’ to be evaluated over a smaller set.

  • We provide error bounds in expectation instead of probability.

5.1 Setup and main result

We consider the discrete setting, where 𝕏0=𝕏=N\mathbb{X}_{0}=\mathbb{X}=\mathbb{R}^{N}. Following [15], we consider an \ell-layer ReLU neural network G:nNG:\mathbb{R}^{n}\rightarrow\mathbb{R}^{N} given by

G(z)=σ(Aσ(A1σ(A2σ(A1z)))),G(z)=\sigma(A_{\ell}\sigma(A_{\ell-1}\sigma(\cdots A_{2}\sigma(A_{1}z)\cdots))), (5.1)

where σ\sigma is the ReLU, Aipi×pi1A_{i}\in\mathbb{R}^{p_{i}\times p_{i-1}} and p0=np_{0}=n, p=Np_{\ell}=N. Notice that this network has no bias terms. As discussed in [15, Rem. 2], the following theory can also be extended to the case of nonzero biases.

For sampling, we consider the general setup of §1.1, i.e., where the measurements are the result of arbitrary linear operators applied to the object xNx\in\mathbb{R}^{N}. In particular, they may scalar- or vector-valued.

We next state a general learning guarantee for generative models. For this, we require the following results from [15]. First, if 𝕌=Ran(G)\mathbb{U}=\mathrm{Ran}(G), then the difference set

𝕌𝕌=Ran(G)Ran(G)=1=1d𝒞i,\mathbb{U}-\mathbb{U}=\mathrm{Ran}(G)-\mathrm{Ran}(G)=\bigcup^{d}_{1=1}\mathcal{C}_{i}, (5.2)

where each 𝒞i\mathcal{C}_{i} is a polyhedral cone of dimension at most 2n2n and

log(d)2ni=11log(2epi/n).\log(d)\leq 2n\sum^{\ell-1}_{i=1}\log(2\mathrm{e}p_{i}/n).

See [15, Lem. S2.2 & Rem. S2.3]. Next, as in [15, Defn. 5], we define the piecewise linear expansion of 𝕌𝕌\mathbb{U}-\mathbb{U} as

Δ(𝕌𝕌)=i=1dspan(𝒞i).\Delta(\mathbb{U}-\mathbb{U})=\bigcup^{d}_{i=1}\mathrm{span}(\mathcal{C}_{i}). (5.3)

Notice that Δ(𝕌𝕌)\Delta(\mathbb{U}-\mathbb{U}) is a union of dd subspaces of dimension at most 2n2n.

Corollary 5.1.

Consider the setup of §1.1, where 𝕏0=𝕏=N\mathbb{X}_{0}=\mathbb{X}=\mathbb{R}^{N} with the Euclidean inner product and norm. Let 𝕌=Ran(G)\mathbb{U}=\mathrm{Ran}(G), where GG is a ReLU generative neural network as in (5.1) with LL layers and widths n=p0p1,,p1,p=Nn=p_{0}\leq p_{1},\ldots,p_{\ell-1},p_{\ell}=N. Suppose that, for some 0<ϵ<10<\epsilon<1,

mα1Φ(S(𝕌𝕌);𝒜¯)nL,m\gtrsim\alpha^{-1}\cdot\Phi(S(\mathbb{U}-\mathbb{U});\bar{\mathcal{A}})\cdot n\cdot L, (5.4)

where

L=i=11log(2epi/n)+log(2/ϵ)+log(2Φ(Δ(𝕌𝕌);𝒜¯)Φ(𝕌𝕌;𝒜¯)).L=\sum^{\ell-1}_{i=1}\log(2\mathrm{e}p_{i}/n)+\log(2/\epsilon)+\log\left(2\frac{\Phi(\Delta(\mathbb{U}-\mathbb{U});\bar{\mathcal{A}})}{\Phi(\mathbb{U}-\mathbb{U};\bar{\mathcal{A}})}\right).

Let xNx\in\mathbb{R}^{N}, x𝕏1{\|x\|}_{\mathbb{X}}\leq 1 and γ0\gamma\geq 0. Then

𝔼xxˇ22βαinfu𝕌xu22+ϵ+γ2α+e𝕐¯2α,where xˇ=min{1,1/x^2}x^\mathbb{E}{\|x-\check{x}\|}^{2}_{\ell^{2}}\lesssim\frac{\beta}{\alpha}\cdot\inf_{u\in\mathbb{U}}{\|x-u\|}^{2}_{\ell^{2}}+\epsilon+\frac{\gamma^{2}}{\alpha}+\frac{{\|e\|}^{2}_{\overline{\mathbb{Y}}}}{\alpha},\qquad\text{where }\check{x}=\min\{1,1/{\|\hat{x}\|}_{\ell^{2}}\}\hat{x}

for any γ\gamma-minimizer x^\hat{x} of (1.3) with noisy measurements (1.2).

Proof.

The set 𝕌=𝕌𝕌=Ran(G)Ran(G)\mathbb{U}^{\prime}=\mathbb{U}-\mathbb{U}=\mathrm{Ran}(G)-\mathrm{Ran}(G) is a cone by construction. As shown above, we have 𝕌Δ(𝕌)\mathbb{U}^{\prime}\subseteq\Delta(\mathbb{U}^{\prime}), where the latter is a union of dd subspaces of dimension at most 2n2n. We now apply Theorem 3.6, and specifically condition (a), with 𝕍=Δ(𝕌)\mathbb{V}=\Delta(\mathbb{U}^{\prime}). ∎

This result shows that the sample complexity depends linearly on the dimension nn of the latent space of the generative neural network, up to log terms, multiplied by the variations Φ(S(𝕌𝕌);𝒜¯)\Phi(S(\mathbb{U}-\mathbb{U});\bar{\mathcal{A}}) and Φ(S(Δ(𝕌𝕌));𝒜¯)\Phi(S(\Delta(\mathbb{U}-\mathbb{U}));\bar{\mathcal{A}}). In particular, if these variations are small and the network widths p1,,p1Np_{1},\ldots,p_{\ell-1}\lesssim N we obtain the measurement condition mn(log(N/n)+log(2/ϵ))m\gtrsim n(\ell\log(N/n)+\log(2/\epsilon)) scaling linearly in nn.

5.2 Recovery guarantees for subsampled unitary matrices

Corollary 5.1 applies to the general measurement model introduced in §1.1. As mentioned, [15] and [16] consider measurements drawn from the rows a subsampled unitary matrix UU. As described in Example 2, this is a special case of our general model.111Technically, [15, Defn. 1] considers a randomly-subsampled unitary matrix model involving Bernoulli sampling. In other words, it is an instance of Example 2 with πi=1/N\pi_{i}=1/N, i\forall i.

We now consider the setting of [16], i.e., Example 2. The analysis therein relies on the concept of local coherences. Following [16, Defn. 1.4], we say that the local coherences of UU with respect a set 𝕍N\mathbb{V}\subseteq\mathbb{C}^{N} are the entries of the vector σ=(σi)i=1N\sigma=(\sigma_{i})^{N}_{i=1}, where

σi=supv𝕍v2=1|uiv|,i=1,,N.\sigma_{i}=\sup_{\begin{subarray}{c}v\in\mathbb{V}\\ {\|v\|}_{\ell^{2}}=1\end{subarray}}|u^{*}_{i}v|,\quad i=1,\ldots,N. (5.5)

(Note that [15, Defn. 3] uses the notation α\alpha. We use σ\sigma as α\alpha is already used in the definition of nondegeneracy.) We now show how the local coherence relates to a special case of the variation (Definition 3.2). Let 𝒜¯\bar{\mathcal{A}} be the isotropic family of the subsampled unitary matrix model of Example 2. Then Φ(S(𝕍);𝒜¯)\Phi(S(\mathbb{V});\bar{\mathcal{A}}) is readily seen to be

Φ(S(𝕍);𝒜¯)=maxi=1,,Nsup{|uiv|2πiv22:v𝕍,v0}=maxi=1,,N(σi2πi).\Phi(S(\mathbb{V});\bar{\mathcal{A}})=\max_{i=1,\ldots,N}\sup\left\{\frac{|u^{*}_{i}v|^{2}}{\pi_{i}{\|v\|}^{2}_{\ell^{2}}}:v\in\mathbb{V},\ v\neq 0\right\}=\max_{i=1,\ldots,N}\left(\frac{\sigma^{2}_{i}}{\pi_{i}}\right). (5.6)

Using this and Corollary 5.1, we deduce the following.

Corollary 5.2 (Generative models with subsampled unitary matrices).

Consider the setup of Corollary 5.1 with the randomly subsampled unitary matrix model of Example 2. Let σ=(σi)i=1N\sigma=(\sigma_{i})^{N}_{i=1} be the local coherences of UU with respect to 𝕌𝕌\mathbb{U}-\mathbb{U} and σ~=(σ~i)i=1N\tilde{\sigma}=(\tilde{\sigma}_{i})^{N}_{i=1} be the local coherences of UU with respect to Δ(𝕌𝕌)\Delta(\mathbb{U}-\mathbb{U}) (see (5.3)). Then the measurement condition (5.4) is implied by

mCn(i=11log(2epi/n)+log(2/ϵ)+log(C~/C)),m\gtrsim C\cdot n\cdot\left(\sum^{\ell-1}_{i=1}\log(2\mathrm{e}p_{i}/n)+\log(2/\epsilon)+\log(\widetilde{C}/C)\right), (5.7)

where C=maxi=1,,N(σi2πi)C=\max_{i=1,\ldots,N}\left(\frac{\sigma^{2}_{i}}{\pi_{i}}\right) and C~=maxi=1,,N(σ~i2πi)\widetilde{C}=\max_{i=1,\ldots,N}\left(\frac{\tilde{\sigma}^{2}_{i}}{\pi_{i}}\right). It also implied by the condition

mC~n(i=11log(2epi/n)+log(2/ϵ)).m\gtrsim\widetilde{C}\cdot n\cdot\left(\sum^{\ell-1}_{i=1}\log(2\mathrm{e}p_{i}/n)+\log(2/\epsilon)\right). (5.8)
Proof.

From (5.6), we see that

Φ(S(𝕌𝕌);𝒜¯)=maxi=1,,N(σi2πi)=C,Φ(Δ(𝕌𝕌);𝒜¯)=maxi=1,,N(σ~i2πi)=C~.\Phi(S(\mathbb{U}-\mathbb{U});\bar{\mathcal{A}})=\max_{i=1,\ldots,N}\left(\frac{\sigma^{2}_{i}}{\pi_{i}}\right)=C,\quad\Phi(\Delta(\mathbb{U}-\mathbb{U});\bar{\mathcal{A}})=\max_{i=1,\ldots,N}\left(\frac{\tilde{\sigma}^{2}_{i}}{\pi_{i}}\right)=\widetilde{C}.

That (5.7) implies (5.4) now follows immediately. Next, we use the fact that C~C\widetilde{C}\geq C (since 𝕌𝕌Δ(𝕌𝕌)\mathbb{U}-\mathbb{U}\subseteq\Delta(\mathbb{U}-\mathbb{U})) and the inequality log(x)x\log(x)\leq x to show that (5.8) implies (5.7), and therefore (5.4), as required. ∎

Using this result, we can now derive optimized variable-density sampling strategies based on the local coherences.

Corollary 5.3 (Optimal variable-density sampling for generative models).

Consider the setup of the previous corollary. If

πi=σi2σ22,i=1,,N,\pi_{i}=\frac{\sigma^{2}_{i}}{{\|\sigma\|}^{2}_{\ell^{2}}},\quad i=1,\ldots,N,

then the measurement condition (5.4) is implied by

mσ22n(i=11log(2epi/n)+log(2/ϵ)+maxi=1,,Nlog(σ~i/σi))m\gtrsim{\|\sigma\|}^{2}_{\ell^{2}}\cdot n\cdot\left(\sum^{\ell-1}_{i=1}\log(2\mathrm{e}p_{i}/n)+\log(2/\epsilon)+\max_{i=1,\ldots,N}\log(\tilde{\sigma}_{i}/\sigma_{i})\right) (5.9)

and if

πi=σ~i2σ~22,i=1,,N,\pi_{i}=\frac{\tilde{\sigma}^{2}_{i}}{{\|\tilde{\sigma}\|}^{2}_{\ell^{2}}},\quad i=1,\ldots,N,

then the measurement condition (5.4) is implied by

mσ~22n(i=11log(2epi/n)+log(2/ϵ)).m\gtrsim{\|\tilde{\sigma}\|}^{2}_{\ell^{2}}\cdot n\cdot\left(\sum^{\ell-1}_{i=1}\log(2\mathrm{e}p_{i}/n)+\log(2/\epsilon)\right). (5.10)

These two results are very similar to those in [16]. The second conditions (5.8) and (5.10) are identical to [16, Thms. A2.1 & 2.1], respectively. In the first conditions (5.7) and (5.9) we obtain a small improvement over [16, Thms. A2.1 & 2.1]. Specifically, the measurement condition is primarily influenced by the local coherences σi\sigma_{i} taken over 𝕌𝕌=Ran(G)Ran(G)\mathbb{U}-\mathbb{U}=\mathrm{Ran}(G)-\mathrm{Ran}(G), with the local coherences σ~i\tilde{\sigma}_{i}, which are taken over the piecewise linear expansion set Δ(𝕌𝕌)\Delta(\mathbb{U}-\mathbb{U}) only entering logarithmically into the condition. This has relevance to practice, since the σi\sigma_{i} can be empirically estimated more efficiently than the σ~i\tilde{\sigma}_{i}. This was done in [6] for MRI reconstruction using generative models and later in [16]. As shown in these works, this can lead to substantial benefits over uniform random subsampling.

Notice that Corollary 5.3 is a type of active learning result for learning with generative models. It differs from the setup of §4 in that the underlying object is a vector, not a function, and the measurements are not pointwise samples. It does, however, fall into the general active learning framework of [6] (recall Remark 4.1). Notice that the local coherences σ\sigma and σ~\tilde{\sigma} are equivalent to the generalized Christoffel functions for this problem. Hence Corollary 5.3 is a type of Christoffel sampling.

The previous results use sampling with replacement. In the next result we show the same result can be obtained without replacement by using Bernoulli sampling (Example 2).

Corollary 5.4 (Optimal variable-density Bernoulli sampling for generative models).

Let 𝕌=Ran(G)\mathbb{U}=\mathrm{Ran}(G), where GG is a ReLU generative neural network as in (5.1) with LL layers and widths n=p0p1,,p1,p=Nn=p_{0}\leq p_{1},\ldots,p_{\ell-1},p_{\ell}=N and consider the randomly subsampled unitary matrix model of Example 2. Let σi\sigma_{i} and σ~i\tilde{\sigma}_{i} be as in Corollary 5.2 and suppose that either of the following hold.

  1. (i)

    Let

    πi=min{1/m,σi2/cπ},i=1,,N,\pi_{i}=\min\left\{1/m,\sigma^{2}_{i}/c_{\pi}\right\},\quad i=1,\ldots,N,

    where cπc_{\pi} is the unique solution in [0,)[0,\infty) to the nonlinear equation

    i=1Nmin{1/m,σi2/c}=1\sum^{N}_{i=1}\min\left\{1/m,\sigma^{2}_{i}/c\right\}=1

    and

    mσ22n(i=11log(2epi/n)+log(2/ϵ)+maxi=1,,Nlog(σ~i/σi))m\gtrsim{\|\sigma\|}^{2}_{\ell^{2}}\cdot n\cdot\left(\sum^{\ell-1}_{i=1}\log(2\mathrm{e}p_{i}/n)+\log(2/\epsilon)+\max_{i=1,\ldots,N}\log(\tilde{\sigma}_{i}/\sigma_{i})\right)
  2. (ii)

    Let

    πi=min{1/m,σ~i2/cπ},i=1,,N,\pi_{i}=\min\left\{1/m,\tilde{\sigma}^{2}_{i}/c_{\pi}\right\},\quad i=1,\ldots,N,

    where cπc_{\pi} is the unique solution in [0,)[0,\infty) to the nonlinear equation

    i=1Nmin{1/m,σ~i2/c}=1\sum^{N}_{i=1}\min\left\{1/m,\tilde{\sigma}^{2}_{i}/c\right\}=1

    and

    mσ~22n(i=11log(2epi/n)+log(2/ϵ)).m\gtrsim{\|\tilde{\sigma}\|}^{2}_{\ell^{2}}\cdot n\cdot\left(\sum^{\ell-1}_{i=1}\log(2\mathrm{e}p_{i}/n)+\log(2/\epsilon)\right).

Let xNx\in\mathbb{R}^{N}, x𝕏1{\|x\|}_{\mathbb{X}}\leq 1 and γ0\gamma\geq 0. Then

𝔼xxˇ22βαinfu𝕌xu22+ϵ+γ2α+e𝕐¯2α,where xˇ=min{1,1/x^2}x^\mathbb{E}{\|x-\check{x}\|}^{2}_{\ell^{2}}\lesssim\frac{\beta}{\alpha}\cdot\inf_{u\in\mathbb{U}}{\|x-u\|}^{2}_{\ell^{2}}+\epsilon+\frac{\gamma^{2}}{\alpha}+\frac{{\|e\|}^{2}_{\overline{\mathbb{Y}}}}{\alpha},\qquad\text{where }\check{x}=\min\{1,1/{\|\hat{x}\|}_{\ell^{2}}\}\hat{x}

for any γ\gamma-minimizer x^\hat{x} of (1.3) with noisy measurements (1.2).

Proof.

Let 𝒜¯={𝒜i}i=1N\bar{\mathcal{A}}=\{\mathcal{A}_{i}\}^{N}_{i=1} be the distribution defined in Example 2 and consider the first scenario (i). Then, observe that 𝒜i\mathcal{A}_{i} is a constant distribution if and only if πi=1/m\pi_{i}=1/m. Hence

Φ(S(𝕌𝕌);𝒜¯)=maxi:πi<1/msup{N|uiv|2mπi:v𝕌𝕌,v0}=maxi:πi<1/mNcπσi2mσi2=Ncπm.\Phi(S(\mathbb{U}-\mathbb{U});\bar{\mathcal{A}})=\max_{i:\pi_{i}<1/m}\sup\left\{\frac{N|u^{*}_{i}v|^{2}}{m\pi_{i}}:v\in\mathbb{U}-\mathbb{U},\ v\neq 0\right\}=\max_{i:\pi_{i}<1/m}\frac{Nc_{\pi}\sigma^{2}_{i}}{m\sigma^{2}_{i}}=\frac{Nc_{\pi}}{m}.

Now observe that the function i=1Nmin{1/m,σi2/c}\sum^{N}_{i=1}\min\{1/m,\sigma^{2}_{i}/c\} is nonincreasing in cc and is bounded above by σ22/c{\|\sigma\|}^{2}_{\ell^{2}}/c. Therefore 1σ22/cπ1\leq{\|\sigma\|}^{2}_{\ell^{2}}/c_{\pi}, i.e., cπσ22c_{\pi}\leq{\|\sigma\|}^{2}_{\ell^{2}}. We deduce that

Φ(S(𝕌𝕌);𝒜¯)Nσ22m.\Phi(S(\mathbb{U}-\mathbb{U});\bar{\mathcal{A}})\leq\frac{N{\|\sigma\|}^{2}_{\ell^{2}}}{m}.

Further, it is a short argument to see that

Φ(S(Δ(𝕌𝕌));𝒜¯)Nσ22mmaxi=1,,N{σ~i/σi}.\Phi(S(\Delta(\mathbb{U}-\mathbb{U}));\bar{\mathcal{A}})\leq\frac{N{\|\sigma\|}^{2}_{\ell^{2}}}{m}\max_{i=1,\ldots,N}\{\tilde{\sigma}_{i}/\sigma_{i}\}.

We now apply Theorem 3.6 with 𝕍=Δ(𝕌𝕌)\mathbb{V}=\Delta(\mathbb{U}-\mathbb{U}). Recall that the collection 𝒜¯\bar{\mathcal{A}} involves NN distributions. Hence condition (a) is implied by the condition

NNσ22mn(i=11log(2epi/m)+log(2/ϵ)+maxi=1,,Nlog(σ~i/σi))N\gtrsim\frac{N{\|\sigma\|}^{2}_{\ell^{2}}}{m}\cdot n\cdot\left(\sum^{\ell-1}_{i=1}\log(2\mathrm{e}p_{i}/m)+\log(2/\epsilon)+\max_{i=1,\ldots,N}\log(\tilde{\sigma}_{i}/\sigma_{i})\right)

Rearranging now gives the result for scenario (i). The proof for scenario (ii) is similar. ∎

This result shows that the sampling without replacement can be achieved under the same sample complexity bounds. This is subject to the standard caveat that mm is not the number of samples in the Bernoulli sampling model, but rather its expected value.

  • Remark 5.5

    The proof of this corollary also demonstrates why it is important to define the variation as a maximum over the nonconstant distributions only. The resulting measurement conditions could not be obtained had the variation been defined over all distributions.

5.3 Extension to block sampling

A desirable aspect of the general framework of §1.1 is that we can easily derive similar results for compressed sensing with generative models under a more general ‘block’ sampling model [17, 20]. Here, each measurement corresponds to a block of rows of a unitary matrix UU, rather than a single row. This model is highly relevant in applications such as MRI, where it is generally impossible to measure individual frequencies (i.e., individual rows). See [9, 52].

We now formalize this setup. Note that this was previously done in [6, §C.1] for the special case where UU was the 3D Discrete Fourier Transform (DFT) matrix and each block corresponded to a horizontal line of kk-space values. Here we consider the general setting.

Let P1,,PMP_{1},\ldots,P_{M} be a (potentially overlapping) partition of {1,,N}\{1,\ldots,N\}. Let pi=|Pi|p_{i}=|P_{i}|, p=maxi=1,,M{pi}p=\max_{i=1,\ldots,M}\{p_{i}\} and

rj=|{i:jPi}|,j=1,,N,r_{j}=|\{i:j\in P_{i}\}|,\quad j=1,\ldots,N,

be the number of partition elements to which the integer jj belongs. Now let π=(π1,,πM)\pi=(\pi_{1},\ldots,\pi_{M}) be a discrete probability distribution on {1,,M}\{1,\ldots,M\} and define the linear maps Bi(N,p)B_{i}\in\mathcal{B}(\mathbb{C}^{N},\mathbb{C}^{p}) by

Bix=1πi[((Ux)j/rj)jPi0ppi],xN,B_{i}x=\frac{1}{\sqrt{\pi_{i}}}\begin{bmatrix}((Ux)_{j}/\sqrt{r_{j}})_{j\in P_{i}}\\ 0_{p-p_{i}}\end{bmatrix},\quad x\in\mathbb{C}^{N},

where 0ppi0_{p-p_{i}} is the vector of zeros of size (ppi)×1(p-p_{i})\times 1. Now define the distribution 𝒜\mathcal{A} of bounded linear operators in (N,p)\mathcal{B}(\mathbb{C}^{N},\mathbb{C}^{p}) by A𝒜A\sim\mathcal{A} if

(A=Bi)=πi,i=1,,M.\mathbb{P}(A=B_{i})=\pi_{i},\quad i=1,\ldots,M.

We readily deduce that 𝒜\mathcal{A} is nondegenerate with α=β=1\alpha=\beta=1. Indeed,

𝔼A𝒜Ax22=i=1Mπi1πijPi|(Ux)j|2rj=Ux22=x22.\mathbb{E}_{A\sim\mathcal{A}}{\|Ax\|}^{2}_{\ell^{2}}=\sum^{M}_{i=1}\pi_{i}\frac{1}{\pi_{i}}\sum_{j\in P_{i}}\frac{|(Ux)_{j}|^{2}}{r_{j}}={\|Ux\|}^{2}_{\ell^{2}}={\|x\|}^{2}_{\ell^{2}}.

Now let A1,,Ami.i.d.𝒜A_{1},\ldots,A_{m}\sim_{\mathrm{i.i.d.}}\mathcal{A} and consider the measurements (1.2). Then this gives the desired block sampling model. Each measurement bipb_{i}\in\mathbb{C}^{p} corresponds to a block of rows of UU (up to scaling factors to account for overlaps) from one of the partition elements, which is chosen according to the discrete probability distribution π\pi.

Note that the setting considered previously corresponds to the case N=MN=M and Pi={i}P_{i}=\{i\} for i=1,,Ni=1,\ldots,N. Much as in that case, we can readily derive a measurement condition. The only change we need to make is to redefine the local coherences σi\sigma_{i} in (5.5) by the following:

σi=supv𝕍v2=1jPi|uiv|2rj,i=1,,M.\sigma_{i}=\sup_{\begin{subarray}{c}v\in\mathbb{V}\\ {\|v\|}_{\ell^{2}}=1\end{subarray}}\sum_{j\in P_{i}}\frac{|u^{*}_{i}v|^{2}}{r_{j}},\quad i=1,\ldots,M. (5.11)

We now obtain the following result.

Corollary 5.6.

Corollaries 5.25.4 all hold for the block sampling model with NN replaced by MM and taking (5.11) as the definition of the σi\sigma_{i} rather than (5.5).

As observed, this block sampling model was used in [6] in the context of MRI reconstruction with generative models (see [6, App. C]). Note that the local coherences (5.11) are identical to the generalized Christoffel functions defined therein. Here we generalize this setup and provide theoretical guarantees. On the practical side, we note that the local coherences in the block case (5.11) can be estimated empirically, much as in the single-row case considered previously. See [6, §C.4].

6 Proofs

Consider a realization of the AiA_{i}. We say that empirical nondegeneracy holds over a set 𝕌𝕏0\mathbb{U}\subseteq\mathbb{X}_{0} with constants 0<αβ<0<\alpha^{\prime}\leq\beta^{\prime}<\infty if

αu𝕏21mi=1mAi(u)𝕐i2βu𝕏2,u𝕌.\alpha^{\prime}{\|u\|}^{2}_{\mathbb{X}}\leq\frac{1}{m}\sum^{m}_{i=1}{\|A_{i}(u)\|}^{2}_{\mathbb{Y}_{i}}\leq\beta^{\prime}{\|u\|}^{2}_{\mathbb{X}},\quad\forall u\in\mathbb{U}. (6.1)

Recall from the notation introduced in §1.1 this can be equivalently written as

αv𝕏2A¯(v)𝕐¯2βv𝕏2,v𝕍.\alpha^{\prime}{\|v\|}^{2}_{\mathbb{X}}\leq{\|\bar{A}(v)\|}^{2}_{\overline{\mathbb{Y}}}\leq\beta^{\prime}{\|v\|}^{2}_{\mathbb{X}},\quad\forall v\in\mathbb{V}.

Note that in the classical compressed sensing setup (see Example 2), this equivalent to the measurement matrix AA satisfying the Restricted Isometry Property (RIP) [43, Chpt. 6]. As the following lemma shows, empirical nondegeneracy is crucial in establishing a recovery guarantee in the general case.

Lemma 6.1.

Let Ai(𝕏0,𝕐i)A_{i}\in\mathcal{B}(\mathbb{X}_{0},\mathbb{Y}_{i}), i=1,,mi=1,\ldots,m, be such that (6.1) holds over the set 𝕌𝕌\mathbb{U}-\mathbb{U} with constants 0<αβ<0<\alpha^{\prime}\leq\beta^{\prime}<\infty. Let x𝕏0x\in\mathbb{X}_{0}, γ0\gamma\geq 0 and x^𝕌\hat{x}\in\mathbb{U} be a γ\gamma-minimizer of (1.3) based on noisy measurements (1.2). Then

xx^𝕏infu𝕌{2αA¯(xu)𝕐¯+xu𝕏}+2αe¯𝕐¯+γα.{\|x-\hat{x}\|}_{\mathbb{X}}\leq\inf_{u\in\mathbb{U}}\left\{\frac{2}{\sqrt{\alpha^{\prime}}}{\|\bar{A}(x-u)\|}_{\overline{\mathbb{Y}}}+{\|x-u\|}_{\mathbb{X}}\right\}+\frac{2}{\sqrt{\alpha^{\prime}}}{\|\bar{e}\|}_{\overline{\mathbb{Y}}}+\frac{\gamma}{\sqrt{\alpha^{\prime}}}.

Although the proof is straightforward, we include it for completeness.

Proof.

Let u𝕌u\in\mathbb{U}. Then

x^x𝕏\displaystyle{\|\hat{x}-x\|}_{\mathbb{X}} x^u𝕏+xu𝕏\displaystyle\leq{\|\hat{x}-u\|}_{\mathbb{X}}+{\|x-u\|}_{\mathbb{X}}
1/αA¯(x^u)𝕏+xu𝕏\displaystyle\leq 1/\sqrt{\alpha^{\prime}}{\|\bar{A}(\hat{x}-u)\|}_{\mathbb{X}}+{\|x-u\|}_{\mathbb{X}}
1/αA¯(x^)b¯𝕐¯+1/αb¯A¯(u)𝕐¯+xu𝕏\displaystyle\leq 1/\sqrt{\alpha^{\prime}}{\|\bar{A}(\hat{x})-\bar{b}\|}_{\overline{\mathbb{Y}}}+1/\sqrt{\alpha^{\prime}}{\|\bar{b}-\bar{A}(u)\|}_{\overline{\mathbb{Y}}}+{\|x-u\|}_{\mathbb{X}}
γ/α+2/αA¯(u)b¯𝕐¯+xu𝕏\displaystyle\leq\gamma/\sqrt{\alpha^{\prime}}+2/\sqrt{\alpha^{\prime}}{\|\bar{A}(u)-\bar{b}\|}_{\overline{\mathbb{Y}}}+{\|x-u\|}_{\mathbb{X}}
γ/α+2/αA¯(ux)𝕐¯+2/αe¯𝕐¯+xu𝕏,\displaystyle\leq\gamma/\sqrt{\alpha^{\prime}}+2/\sqrt{\alpha^{\prime}}{\|\bar{A}(u-x)\|}_{\overline{\mathbb{Y}}}+2/\sqrt{\alpha^{\prime}}{\|\bar{e}\|}_{\overline{\mathbb{Y}}}+{\|x-u\|}_{\mathbb{X}},

as required. ∎

Notice that this result in fact only requires the lower inequality in (6.1). The upper inequality will be used later in the derivation of the error bound in expectation. Consequently, the remainder of the proofs are devoted to deriving measurement conditions under which (6.1) holds, then using this and the above lemma to derive error bounds in expectation.

6.1 Measurement conditions for empirical nondegeneracy

Theorem 6.2.

Consider the setup of §1.1. Let 0<δ,ϵ<10<\delta,\epsilon<1 and 𝕌𝕏0\mathbb{U}\subseteq\mathbb{X}_{0} be such that

  1. (i)

    𝕌\mathbb{U} is a cone, and

  2. (ii)

    𝕌𝕍1𝕍d=:𝕍\mathbb{U}\subseteq\mathbb{V}_{1}\cup\cdots\cup\mathbb{V}_{d}=:\mathbb{V}, where each 𝕍i𝕏0\mathbb{V}_{i}\subseteq\mathbb{X}_{0} is a subspace of dimension at most nn.

Suppose that either

  1. (a)

    mδ2α1Φ(S(𝕌);𝒜¯)[log(2d/ϵ)+nlog(2γ(𝕌;𝒜¯))]m\gtrsim\delta^{-2}\cdot\alpha^{-1}\cdot\Phi(S(\mathbb{U});\bar{\mathcal{A}})\cdot\left[\log(2d/\epsilon)+n\log(2\gamma(\mathbb{U};\bar{\mathcal{A}}))\right], where

    γ(𝕌;𝒜¯)=min{Φ(S(𝕍);𝒜¯),Φ(S(𝕌𝕌);𝒜¯)}Φ(S(𝕌);𝒜¯),\gamma(\mathbb{U};\bar{\mathcal{A}})=\frac{\min\{\Phi(S(\mathbb{V});\bar{\mathcal{A}}),\Phi(S(\mathbb{U}-\mathbb{U});\bar{\mathcal{A}})\}}{\Phi(S(\mathbb{U});\bar{\mathcal{A}})}, (6.2)
  2. (b)

    mδ2α1Φ(S(𝕌𝕌);𝒜¯)[log(2d/ϵ)+n]m\gtrsim\delta^{-2}\cdot\alpha^{-1}\cdot\Phi(S(\mathbb{U}-\mathbb{U});\bar{\mathcal{A}})\cdot\left[\log(2d/\epsilon)+n\right],

  3. (c)

    or mδ2α1Φ(S(𝕍);𝒜¯)log(2nd/ϵ)m\gtrsim\delta^{-2}\cdot\alpha^{-1}\cdot\Phi(S(\mathbb{V});\bar{\mathcal{A}})\cdot\log(2nd/\epsilon).

Then, with probability at least 1ϵ1-\epsilon, (6.1) holds for 𝕌\mathbb{U} with constants α=(1δ)α\alpha^{\prime}=(1-\delta)\alpha and β=(1+δ)β\beta^{\prime}=(1+\delta)\beta.

Note that these first two theorems will be used to establish Theorem 3.6. We split them into two statements as the proof techniques are quite different. For Theorem 3.7, we will use the following result.

Theorem 6.3.

Consider the setup of §1.1. Let 0<δ,ϵ<10<\delta,\epsilon<1 and 𝕌𝕏0\mathbb{U}\subseteq\mathbb{X}_{0} be such that assumptions (i) and (ii) of Theorem 6.2 hold, and also that

  1. (iii)

    {u𝕌:u𝕏1}conv(𝕎)\{u\in\mathbb{U}:{\|u\|}_{\mathbb{X}}\leq 1\}\subseteq\mathrm{conv}(\mathbb{W}), where 𝕎\mathbb{W} is a finite set of size |𝕎|=M|\mathbb{W}|=M.

Suppose that either

  1. (a)

    mδ2α1Φ(S(𝕌)𝕎;𝒜¯)Lm\gtrsim\delta^{-2}\cdot\alpha^{-1}\cdot\Phi(S(\mathbb{U})\cup\mathbb{W};\bar{\mathcal{A}})\cdot L, where

    L=log(2Φ(S(𝕌)𝕎;𝒜¯)/α)[log(2γ(𝕌;𝒜¯))+log(2M)log2(log(2d)+n)]+log(ϵ1)L=\log\left(2\Phi(S(\mathbb{U})\cup\mathbb{W};\bar{\mathcal{A}})/\alpha\right)\cdot\left[\log\left(2\gamma(\mathbb{U};\bar{\mathcal{A}})\right)+\log(2M)\cdot\log^{2}(\log(2d)+n)\right]+\log(\epsilon^{-1})

    and γ(𝕌;𝒜¯)\gamma(\mathbb{U};\bar{\mathcal{A}}) is as in (6.2);

  2. (b)

    mδ2α1Φ(S(𝕌𝕌)𝕎;𝒜¯)L,m\gtrsim\delta^{-2}\cdot\alpha^{-1}\cdot\Phi(S(\mathbb{U}-\mathbb{U})\cup\mathbb{W};\bar{\mathcal{A}})\cdot L, where

    L=log(2Φ(S(𝕌)𝕎;𝒜¯)/α)log(2M)log2(log(2d)+n)+log(ϵ1);L=\log\left(2\Phi(S(\mathbb{U})\cup\mathbb{W};\bar{\mathcal{A}})/\alpha\right)\cdot\log(2M)\cdot\log^{2}(\log(2d)+n)+\log(\epsilon^{-1});

    or

  3. (c)

    mδ2α1Φ(S(𝕍)𝕎;𝒜¯)Lm\gtrsim\delta^{-2}\cdot\alpha^{-1}\cdot\Phi(S(\mathbb{V})\cup\mathbb{W};\bar{\mathcal{A}})\cdot L, where

    L=log(2Φ(S(𝕍)𝕎;𝒜¯)/α)log(2M)log2(log(2d)+n)+log(ϵ1).L=\log\left(2\Phi(S(\mathbb{V})\cup\mathbb{W};\bar{\mathcal{A}})/\alpha\right)\cdot\log(2M)\cdot\log^{2}(\log(2d)+n)+\log(\epsilon^{-1}).

Then, with probability at least 1ϵ1-\epsilon, (6.1) holds for 𝕌\mathbb{U} with constants α=(1δ)α\alpha^{\prime}=(1-\delta)\alpha and β=(1+δ)β\beta^{\prime}=(1+\delta)\beta.

We now prove these results.

6.1.1 Setup

Let 𝕌𝕏\mathbb{U}\subseteq\mathbb{X} be such that assumption (ii) of Theorem 6.2 holds. As per §3.5, we will not assume that assumption (i) holds for the moment.

Nondegeneracy (1.1) implies that the quantity

|x|𝕏=1mi=1m𝔼Ai𝒜iAi(x)𝕐i2=𝔼A¯𝒜¯A¯(x)𝕐¯2,x𝕏0,{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|x\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\mathbb{X}}=\sqrt{\frac{1}{m}\sum^{m}_{i=1}\mathbb{E}_{A_{i}\sim\mathcal{A}_{i}}{\|A_{i}(x)\|}^{2}_{\mathbb{Y}_{i}}}=\sqrt{\mathbb{E}_{\bar{A}\sim\bar{\mathcal{A}}}{\|\bar{A}(x)\|}^{2}_{\overline{\mathbb{Y}}}},\quad x\in\mathbb{X}_{0},

defines a norm on 𝕏0\mathbb{X}_{0} that is equivalent to 𝕏{\left\|\cdot\right\|}_{\mathbb{X}}. Let

S~(𝕌)={u/|||u|||𝕏:u𝕌\{0}}.\widetilde{S}(\mathbb{U})=\{u/{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|u\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\mathbb{X}}:u\in\mathbb{U}\backslash\{0\}\}.

Then, for empirical degeneracy (6.1) to hold with constants α=(1δ)α\alpha^{\prime}=(1-\delta)\alpha and β=(1+δ)β\beta^{\prime}=(1+\delta)\beta, it suffices to show that the random variable

δ𝕌=supuS~(𝕌)|1miAi(u)𝕐i21|\delta_{\mathbb{U}}=\sup_{u\in\widetilde{S}(\mathbb{U})}\left|\frac{1}{m}\sum_{i\in\mathcal{I}}{\|A_{i}(u)\|}^{2}_{\mathbb{Y}_{i}}-1\right| (6.3)

satisfies δ𝕌δ\delta_{\mathbb{U}}\leq\delta with probability at least 1ϵ1-\epsilon. Following a standard route, we show this by first bounding the expectation of δ𝕌\delta_{\mathbb{U}} and then by estimating the probability it exceeds δ\delta.

6.1.2 Expectation bounds: setup

Recall that a Rademacher random variable is a random variable that takes the values +1+1 and 1-1 with probability 12\frac{1}{2}. Recall also that {1,,m}\mathcal{I}\subseteq\{1,\ldots,m\} is the set of ii such that the corresponding distribution 𝒜i\mathcal{A}_{i} is nonconstant.

Lemma 6.4.

Let \mathcal{I} be the index set (3.8) and {ϵi}i\{\epsilon_{i}\}_{i\in\mathcal{I}} be independent Rademacher random variables that are also independent of the random variables {Ai}i=1m\{A_{i}\}^{m}_{i=1}. Then random variable (6.3) satisfies

𝔼𝒜¯(δ𝕌)2m1𝔼𝒜¯𝔼ϵ¯supuS~(𝕌)|iϵiAi(u)𝕐i2|.\mathbb{E}_{\bar{\mathcal{A}}}(\delta_{\mathbb{U}})\leq 2m^{-1}\mathbb{E}_{\bar{\mathcal{A}}}\mathbb{E}_{\overline{\epsilon}}\sup_{u\in\widetilde{S}(\mathbb{U})}\left|\sum_{i\in\mathcal{I}}\epsilon_{i}{\|A_{i}(u)\|}^{2}_{\mathbb{Y}_{i}}\right|.

Here 𝔼𝒜¯\mathbb{E}_{\bar{\mathcal{A}}} denotes expectation with respect to the variables {Ai}i=1m\{A_{i}\}^{m}_{i=1} and 𝔼ϵ¯\mathbb{E}_{\overline{\epsilon}} denotes expectation with respect to the {ϵi}i\{\epsilon_{i}\}_{i\in\mathcal{I}}.

Proof.

Since

1=|u|𝕏2=1mi=1m𝔼Ai𝒜iAi(u)𝕐i2,uS~(𝕌),1={\left|\kern-1.07639pt\left|\kern-1.07639pt\left|u\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}_{\mathbb{X}}=\frac{1}{m}\sum^{m}_{i=1}\mathbb{E}_{A_{i}\sim\mathcal{A}_{i}}{\|A_{i}(u)\|}^{2}_{\mathbb{Y}_{i}},\quad\forall u\in\widetilde{S}(\mathbb{U}), (6.4)

we have

δ𝕌=supuS~(𝕌)|1mi=1m(Ai(u)𝕐i2𝔼Ai𝒜iAi(u)𝕐i2)|=supuS~(𝕌)|1mi(Ai(u)𝕐i2𝔼Ai𝒜iAi(u)𝕐i2)|a.s.\begin{split}\delta_{\mathbb{U}}&=\sup_{u\in\widetilde{S}(\mathbb{U})}\left|\frac{1}{m}\sum^{m}_{i=1}\left({\|A_{i}(u)\|}^{2}_{\mathbb{Y}_{i}}-\mathbb{E}_{A_{i}\sim\mathcal{A}_{i}}{\|A_{i}(u)\|}^{2}_{\mathbb{Y}_{i}}\right)\right|\\ &=\sup_{u\in\widetilde{S}(\mathbb{U})}\left|\frac{1}{m}\sum_{i\in\mathcal{I}}\left({\|A_{i}(u)\|}^{2}_{\mathbb{Y}_{i}}-\mathbb{E}_{A_{i}\sim\mathcal{A}_{i}}{\|A_{i}(u)\|}^{2}_{\mathbb{Y}_{i}}\right)\right|\quad\text{a.s.}\end{split} (6.5)

Here, in the second step we used the definition of \mathcal{I}. Assumption (ii) implies that 𝕌𝕍¯\mathbb{U}\subseteq\overline{\mathbb{V}}, where 𝕍¯𝕏0\overline{\mathbb{V}}\subseteq\mathbb{X}_{0} is the finite-dimensional vector space 𝕍1++𝕍d\mathbb{V}_{1}+\cdots+\mathbb{V}_{d}. Since 𝕍¯\overline{\mathbb{V}} is a vector space, we also have S~(𝕌)𝕍¯\widetilde{S}(\mathbb{U})\subseteq\overline{\mathbb{V}}. Consider 𝕍¯\overline{\mathbb{V}} as a Hilbert space with the inner product from 𝕏\mathbb{X}. Then the map

Bi:(𝕍¯,𝕏)(𝕐i,𝕐i),vAi(v),B_{i}:(\overline{\mathbb{V}},{\left\|\cdot\right\|}_{\mathbb{X}})\rightarrow(\mathbb{Y}_{i},{\left\|\cdot\right\|}_{\mathbb{Y}_{i}}),\ v\mapsto A_{i}(v),

is bounded. Indeed,

Bi(v)𝕐iAi𝕏0𝕐iv𝕏0cAi𝕏0𝕐iv𝕏,v𝕍¯.{\|B_{i}(v)\|}_{\mathbb{Y}_{i}}\leq{\|A_{i}\|}_{\mathbb{X}_{0}\rightarrow\mathbb{Y}_{i}}{\|v\|}_{\mathbb{X}_{0}}\leq c{\|A_{i}\|}_{\mathbb{X}_{0}\rightarrow\mathbb{Y}_{i}}{\|v\|}_{\mathbb{X}},\quad\forall v\in\overline{\mathbb{V}}.

Here, in the first inequality we used the fact that the map Ai(𝕏0,𝕐i)A_{i}\in\mathcal{B}(\mathbb{X}_{0},\mathbb{Y}_{i}) is bounded by assumption and in the second step, we used the fact that 𝕍¯\overline{\mathbb{V}} is finite dimensional and all norms on finite-dimensional vector spaces are equivalent. Therefore, BiB_{i} has a unique bounded adjoint

Bi:(𝕐i,𝕐i)(𝕍¯,𝕏).B^{*}_{i}:(\mathbb{Y}_{i},{\left\|\cdot\right\|}_{\mathbb{Y}_{i}})\rightarrow(\overline{\mathbb{V}},{\left\|\cdot\right\|}_{\mathbb{X}}).

In particular, we may write

Ai(v)𝕐i2=BiBi(v),v𝕏=Ui(v),v𝕏,v𝕍¯,{\|A_{i}(v)\|}^{2}_{\mathbb{Y}_{i}}=\langle B^{*}_{i}B_{i}(v),v\rangle_{\mathbb{X}}=\langle U_{i}(v),v\rangle_{\mathbb{X}},\quad\forall v\in\overline{\mathbb{V}},

where Ui=BiBiU_{i}=B^{*}_{i}B_{i} is a random variable taking values in (𝕍¯)\mathcal{B}(\overline{\mathbb{V}}), the finite-dimensional vector space of bounded linear operators on (𝕍¯,𝕏)(\overline{\mathbb{V}},{\left\|\cdot\right\|}_{\mathbb{X}}). Using this we can write

δ𝕌=m1supuS~(𝕌)|iUi(u)𝔼(Ui(u)),u𝕏|=m1f(i(Ui𝔼(Ui))),\delta_{\mathbb{U}}=m^{-1}\sup_{u\in\widetilde{S}(\mathbb{U})}\left|\sum_{i\in\mathcal{I}}\left<U_{i}(u)-\mathbb{E}\left(U_{i}(u)\right),u\right>_{\mathbb{X}}\right|=m^{-1}f\left(\sum_{i\in\mathcal{I}}(U_{i}-\mathbb{E}(U_{i}))\right), (6.6)

where ff is the convex function

f:(𝕍¯),UsupuS~(𝕌)|U(u),u𝕏|.f:\mathcal{B}(\overline{\mathbb{V}})\rightarrow\mathbb{R},\quad U\mapsto\sup_{u\in\widetilde{S}(\mathbb{U})}|\langle U(u),u\rangle_{\mathbb{X}}|.

We now apply symmetrization (see, e.g., [9, Lem. 13.26]) to get

𝔼𝒜¯(δ𝕌)m1𝔼𝒜¯𝔼ϵ¯f(2iϵiUi).\mathbb{E}_{\bar{\mathcal{A}}}(\delta_{\mathbb{U}})\leq m^{-1}\mathbb{E}_{\bar{\mathcal{A}}}\mathbb{E}_{\overline{\epsilon}}f\left(2\sum_{i\in\mathcal{I}}\epsilon_{i}U_{i}\right).

Now observe that

f(2iϵiUi)=2supuS~(𝕌)|iϵiUi(u),u𝕏|=2supuS~(𝕌)|iϵiAi(u)𝕐i2|.f\left(2\sum_{i\in\mathcal{I}}\epsilon_{i}U_{i}\right)=2\sup_{u\in\widetilde{S}(\mathbb{U})}\left|\left<\sum_{i\in\mathcal{I}}\epsilon_{i}U_{i}(u),u\right>_{\mathbb{X}}\right|=2\sup_{u\in\widetilde{S}(\mathbb{U})}\left|\sum_{i\in\mathcal{I}}\epsilon_{i}{\|A_{i}(u)\|}^{2}_{\mathbb{Y}_{i}}\right|.

This completes the proof. ∎

The next several result involves the covering number 𝒩\mathcal{N} (see, e.g., [9, Defn. 13.21]) and Dudley’s inequality (see, e.g., [9, Thm. 13.25]). Here we recall the notation m~=||\widetilde{m}=|\mathcal{I}|.

Lemma 6.5.

Fix a realization of the AiA_{i}, i=1,,mi=1,\ldots,m. Then

𝔼ϵ¯supuS~(𝕌)|iϵiAi(u)𝕐i2|Rp,𝒜¯0χq/2log(2𝒩(S~(𝕌),𝒜¯,t)dt\mathbb{E}_{\overline{\epsilon}}\sup_{u\in\widetilde{S}(\mathbb{U})}\left|\sum_{i\in\mathcal{I}}\epsilon_{i}{\|A_{i}(u)\|}^{2}_{\mathbb{Y}_{i}}\right|\lesssim R_{p,\bar{\mathcal{A}}}\int^{\chi_{q}/2}_{0}\sqrt{\log(2\mathcal{N}(\widetilde{S}(\mathbb{U}),{\left\|\cdot\right\|}_{\bar{\mathcal{A}}},t)}\,\mathrm{d}t (6.7)

for all 1p<1\leq p<\infty, where

Rp,𝒜¯=supuS~(𝕌)(iAi(u)𝕐i2p)12p,R_{p,\bar{\mathcal{A}}}=\sup_{u\in\widetilde{S}(\mathbb{U})}\left(\sum_{i\in\mathcal{I}}{\|A_{i}(u)\|}^{2p}_{\mathbb{Y}_{i}}\right)^{\frac{1}{2p}}, (6.8)

1<q1<q\leq\infty is such that 1/p+1/q=11/p+1/q=1,

χq=m~1qΦ(S(𝕌),𝒜¯)α,\chi_{q}=\sqrt{\frac{\widetilde{m}^{\frac{1}{q}}\Phi(S(\mathbb{U}),\bar{\mathcal{A}})}{\alpha}}, (6.9)

and

x𝒜¯=(iAi(x)𝕐i2q)12q,x𝕏0.{\|x\|}_{\bar{\mathcal{A}}}=\left(\sum_{i\in\mathcal{I}}{\|A_{i}(x)\|}^{2q}_{\mathbb{Y}_{i}}\right)^{\frac{1}{2q}},\quad x\in\mathbb{X}_{0}. (6.10)
Proof.

The left-hand side of (6.7) is the expected value of the supremum of the absolute value of the Rademacher process {Xu:uS~(𝕌)}\{X_{u}:u\in\widetilde{S}(\mathbb{U})\}, where

Xu=iϵiAi(u)𝕐i2.X_{u}=\sum_{i\in\mathcal{I}}\epsilon_{i}{\|A_{i}(u)\|}^{2}_{\mathbb{Y}_{i}}.

The corresponding pseudometric is

d(u,v)=i(Ai(u)𝕐i2Ai(v)𝕐i2)2,u,vS~(𝕌).d(u,v)=\sqrt{\sum_{i\in\mathcal{I}}\left({\|A_{i}(u)\|}^{2}_{\mathbb{Y}_{i}}-{\|A_{i}(v)\|}^{2}_{\mathbb{Y}_{i}}\right)^{2}},\quad u,v\in\widetilde{S}(\mathbb{U}).

Dudley’s inequality implies that

𝔼ϵ¯supuS~(𝕌)|iϵiAi(u)𝕐i2|0ϖ/2log(2𝒩(S~(𝕌),d,z))dz,\mathbb{E}_{\overline{\epsilon}}\sup_{u\in\widetilde{S}(\mathbb{U})}\left|\sum_{i\in\mathcal{I}}\epsilon_{i}{\|A_{i}(u)\|}^{2}_{\mathbb{Y}_{i}}\right|\lesssim\int^{\varpi/2}_{0}\sqrt{\log(2\mathcal{N}(\widetilde{S}(\mathbb{U}),d,z))}\,\mathrm{d}z, (6.11)

where ϖ=supuS~(𝕌)𝔼ϵ¯|Xu|2\varpi=\sup_{u\in\widetilde{S}(\mathbb{U})}\sqrt{\mathbb{E}_{\overline{\epsilon}}|X_{u}|^{2}}. The remainder of the proof involves upper bounding the pseudometric dd and the constant ϖ\varpi.

First, for two sequences aia_{i}, bib_{i}, observe that

i(|ai|2|bi|2)2\displaystyle\sum_{i}(|a_{i}|^{2}-|b_{i}|^{2})^{2} =i(|ai|+|bi|)2(|ai||bi|)2\displaystyle=\sum_{i}(|a_{i}|+|b_{i}|)^{2}(|a_{i}|-|b_{i}|)^{2}
(i(|ai|+|bi|)2p)1p(i(|aibi|)2q)1q\displaystyle\leq\left(\sum_{i}(|a_{i}|+|b_{i}|)^{2p}\right)^{\frac{1}{p}}\left(\sum_{i}(|a_{i}-b_{i}|)^{2q}\right)^{\frac{1}{q}}

where qq is such that 1/p+1/q=11/p+1/q=1. Therefore

i(|ai|2|bi|2)22max{a2p,b2p}ab2q.\sqrt{\sum_{i}(|a_{i}|^{2}-|b_{i}|^{2})^{2}}\leq 2\max\{{\|a\|}_{2p},{\|b\|}_{2p}\}{\|a-b\|}_{2q}.

We deduce that

d(u,v)2Rp,𝒜¯uv𝒜¯,d(u,v)\leq 2R_{p,\bar{\mathcal{A}}}{\|u-v\|}_{\bar{\mathcal{A}}}, (6.12)

where RpR_{p} and 𝒜¯{\left\|\cdot\right\|}_{\bar{\mathcal{A}}} are as in (6.8) and (6.10), respectively.

Second, consider the term ϖ\varpi. Then, defining additionally X0=0X_{0}=0 (since 0S~(𝕌)0\notin\widetilde{S}(\mathbb{U})), we see that

ϖ=supuS~(𝕌)d(u,0)2Rp,𝒜¯supuS~(𝕌)u𝒜¯.\varpi=\sup_{u\in\widetilde{S}(\mathbb{U})}d(u,0)\leq 2R_{p,\bar{\mathcal{A}}}\sup_{u\in\widetilde{S}(\mathbb{U})}{\|u\|}_{\bar{\mathcal{A}}}.

Now let u=v/|v|𝕏S~(𝕌)u=v/{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|v\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\mathbb{X}}\in\widetilde{S}(\mathbb{U}), where v𝕌v\in\mathbb{U}. Then

Ai(v)𝕐i2Φ(S(𝕌),𝒜i)v𝕏2{\|A_{i}(v)\|}^{2}_{\mathbb{Y}_{i}}\leq\Phi(S(\mathbb{U}),\mathcal{A}_{i}){\|v\|}^{2}_{\mathbb{X}}

Also, recall that αx𝕏|x|𝕏βx𝕏\sqrt{\alpha}{\|x\|}_{\mathbb{X}}\leq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|x\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\mathbb{X}}\leq\sqrt{\beta}{\|x\|}_{\mathbb{X}}, x𝕏0\forall x\in\mathbb{X}_{0}. We deduce that

Ai(u)𝕐i2Φ(S(𝕌);𝒜i)/α,uS~(𝕌).{\|A_{i}(u)\|}^{2}_{\mathbb{Y}_{i}}\leq\Phi(S(\mathbb{U});\mathcal{A}_{i})/\alpha,\quad\forall u\in\widetilde{S}(\mathbb{U}). (6.13)

This gives

u𝒜¯χq,uS~(𝕌).\displaystyle{\|u\|}_{\bar{\mathcal{A}}}\leq\chi_{q},\quad\forall u\in\widetilde{S}(\mathbb{U}).

Hence ϖ2Rp,𝒜¯χq\varpi\leq 2R_{p,\bar{\mathcal{A}}}\chi_{q}.

We now combine this with (6.12) and standard properties of covering numbers to obtain

𝔼ϵ¯supuS~(𝕌)|1miϵiAi(u)𝕐i2|0Rp,𝒜¯χqlog(2𝒩(S~(𝕌),𝒜¯,z/2Rp,𝒜¯))dz.\mathbb{E}_{\overline{\epsilon}}\sup_{u\in\widetilde{S}(\mathbb{U})}\left|\frac{1}{m}\sum_{i\in\mathcal{I}}\epsilon_{i}{\|A_{i}(u)\|}^{2}_{\mathbb{Y}_{i}}\right|\lesssim\int^{R_{p,\bar{\mathcal{A}}}\chi_{q}}_{0}\sqrt{\log(2\mathcal{N}(\widetilde{S}(\mathbb{U}),{\left\|\cdot\right\|}_{\bar{\mathcal{A}}},z/2R_{p,\bar{\mathcal{A}}}))}\,\mathrm{d}z.

The result now follows from a change of variables. ∎

The next steps involves estimating the covering number. We do this in two ways via the following two lemmas. Their proofs rely on a number of standard properties of covering numbers. See, e.g., [9, §13.5.2].

Lemma 6.6 (First covering number bound).

Consider the setup of the previous lemma. Then

log(2𝒩(S~(𝕌),𝒜¯,t))log(2d)+nlog(1+4χq/t),\sqrt{\log(2\mathcal{N}(\widetilde{S}(\mathbb{U}),{\left\|\cdot\right\|}_{\bar{\mathcal{A}}},t))}\leq\sqrt{\log(2d)}+\sqrt{n}\sqrt{\log\left(1+4\chi^{\prime}_{q}/t\right)}, (6.14)

where

  1. (a)

    χq=m~1qmin{Φ(S(𝕍);𝒜¯),Φ(S(S~(𝕌)S~(𝕌));𝒜¯)}α\chi^{\prime}_{q}=\sqrt{\frac{\widetilde{m}^{\frac{1}{q}}\min\{\Phi(S(\mathbb{V});\bar{\mathcal{A}}),\Phi(S(\widetilde{S}(\mathbb{U})-\widetilde{S}(\mathbb{U}));\bar{\mathcal{A}})\}}{\alpha}} if 𝕌\mathbb{U} satisfies assumption (ii) of Theorem 6.2; or

  2. (b)

    χq=m~1qmin{Φ(S(𝕍);𝒜¯),Φ(S(𝕌𝕌);𝒜¯)}α\chi^{\prime}_{q}=\sqrt{\frac{\widetilde{m}^{\frac{1}{q}}\min\{\Phi(S(\mathbb{V});\bar{\mathcal{A}}),\Phi(S(\mathbb{U}-\mathbb{U});\bar{\mathcal{A}})\}}{\alpha}} if 𝕌\mathbb{U} satisfies assumptions (i) and (ii) of Theorem 6.2.

Proof.

Write χq=min{χq,1,χq,2}\chi^{\prime}_{q}=\min\{\chi^{\prime}_{q,1},\chi^{\prime}_{q,2}\}, where

χq,1={m~1qΦ(S(S~(𝕌)S~(𝕌));𝒜¯)αcase (a)m~1qΦ(S(𝕌𝕌));𝒜¯)αcase (b)andχq,2=m~1qΦ(S(𝕍));𝒜¯)α,\chi^{\prime}_{q,1}=\begin{cases}\sqrt{\frac{\widetilde{m}^{\frac{1}{q}}\Phi(S(\widetilde{S}(\mathbb{U})-\widetilde{S}(\mathbb{U}));\bar{\mathcal{A}})}{\alpha}}&\text{case (a)}\\ \sqrt{\frac{\widetilde{m}^{\frac{1}{q}}\Phi(S(\mathbb{U}-\mathbb{U}));\bar{\mathcal{A}})}{\alpha}}&\text{case (b)}\end{cases}\quad\text{and}\quad\chi^{\prime}_{q,2}=\sqrt{\frac{\widetilde{m}^{\frac{1}{q}}\Phi(S(\mathbb{V}));\bar{\mathcal{A}})}{\alpha}},

Then it suffices to prove the bound (6.14) first with χq,1\chi^{\prime}_{q,1} in place of χq\chi^{\prime}_{q} and then with χq,2\chi^{\prime}_{q,2} in place of χq\chi^{\prime}_{q}.

Step 1: showing (6.14) with χq,1\chi^{\prime}_{q,1} in place of χq\chi^{\prime}_{q}. By definition,

Ai(uv)𝕐i2Φ(S(S~(𝕌)S~(𝕌));𝒜i)uv𝕏2,u,vS~(𝕌).{\|A_{i}(u-v)\|}^{2}_{\mathbb{Y}_{i}}\leq\Phi(S(\widetilde{S}(\mathbb{U})-\widetilde{S}(\mathbb{U}));\mathcal{A}_{i}){\|u-v\|}^{2}_{\mathbb{X}},\quad\forall u,v\in\widetilde{S}(\mathbb{U}).

Therefore, the norm (6.10) satisfies

uv𝒜¯m~1qΦ(S(S~(𝕌)S~(𝕌));𝒜¯)uv𝕏,u,vS~(𝕌).{\|u-v\|}_{\bar{\mathcal{A}}}\leq\sqrt{\widetilde{m}^{\frac{1}{q}}\Phi(S(\widetilde{S}(\mathbb{U})-\widetilde{S}(\mathbb{U}));\bar{\mathcal{A}})}{\|u-v\|}_{\mathbb{X}},\quad\forall u,v\in\widetilde{S}(\mathbb{U}). (6.15)

Notice that the right-hand side is equal to χq,1uv𝕏\chi^{\prime}_{q,1}{\|u-v\|}_{\mathbb{X}} in case (a). Now consider case (b). Note that S~(𝕌)={u/|||u|||𝕏:u𝕌\{0}}𝕌\widetilde{S}(\mathbb{U})=\{u/{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|u\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\mathbb{X}}:u\in\mathbb{U}\backslash\{0\}\}\subseteq\mathbb{U} whenever 𝕌\mathbb{U} is a cone. Therefore,

S(S~(𝕌)S~(𝕌))S(𝕌𝕌)S(\widetilde{S}(\mathbb{U})-\widetilde{S}(\mathbb{U}))\subseteq S(\mathbb{U}-\mathbb{U})

from which it follows that

Φ(S~(𝕌)S~(𝕌));𝒜¯)Φ(S(𝕌𝕌);𝒜¯).\Phi(\widetilde{S}(\mathbb{U})-\widetilde{S}(\mathbb{U}));\bar{\mathcal{A}})\leq\Phi(S(\mathbb{U}-\mathbb{U});\bar{\mathcal{A}}).

Therefore, the right-hand side of (6.15) can be bounded by χq,1uv𝕏\chi^{\prime}_{q,1}{\|u-v\|}_{\mathbb{X}} in case (b) as well. Using the equivalence between 𝕏{\left\|\cdot\right\|}_{\mathbb{X}} and ||||||𝕏{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\mathbb{X}} once more, we deduce that

uv𝒜¯χq,1|uv|𝕏,u,vS~(𝕌),{\|u-v\|}_{\bar{\mathcal{A}}}\leq\chi^{\prime}_{q,1}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|u-v\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\mathbb{X}},\quad\forall u,v\in\widetilde{S}(\mathbb{U}), (6.16)

in either case (a) or (b). Now, using standard properties of covering numbers, we get

𝒩(S~(𝕌),𝒜¯,t)𝒩(S~(𝕌),χq,1||||||𝕏,t)=𝒩(S~(𝕌),||||||𝕏,t/χq,1).\mathcal{N}(\widetilde{S}(\mathbb{U}),{\left\|\cdot\right\|}_{\bar{\mathcal{A}}},t)\leq\mathcal{N}(\widetilde{S}(\mathbb{U}),\chi^{\prime}_{q,1}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\mathbb{X}},t)=\mathcal{N}(\widetilde{S}(\mathbb{U}),{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\mathbb{X}},t/\chi^{\prime}_{q,1}).

Now, assumption (ii) of Theorem 6.2 gives that S~(𝕌)B~(𝕍1)B~(𝕍d)\widetilde{S}(\mathbb{U})\subseteq\widetilde{B}(\mathbb{V}_{1})\cup\cdots\cup\widetilde{B}(\mathbb{V}_{d}), where B~(𝕍i)={vi𝕍i:|vi|𝕏1}\widetilde{B}(\mathbb{V}_{i})=\{v_{i}\in\mathbb{V}_{i}:{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|v_{i}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\mathbb{X}}\leq 1\} is the unit ball of (𝕍i,||||||𝕏)(\mathbb{V}_{i},{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\mathbb{X}}). Using further properties of covering numbers and (6.16), we get

𝒩(S~(𝕌),𝒜¯,t)\displaystyle\mathcal{N}(\widetilde{S}(\mathbb{U}),{\left\|\cdot\right\|}_{\bar{\mathcal{A}}},t) 𝒩(B~(𝕍1)B~(𝕍d),||||||𝕏,t/(2χq,1))\displaystyle\leq\mathcal{N}(\widetilde{B}(\mathbb{V}_{1})\cup\cdots\cup\widetilde{B}(\mathbb{V}_{d}),{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\mathbb{X}},t/(2\chi^{\prime}_{q,1}))
i=1d𝒩(B~(𝕍i),||||||𝕏,t/(2χq,1))\displaystyle\leq\sum^{d}_{i=1}\mathcal{N}(\widetilde{B}(\mathbb{V}_{i}),{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\mathbb{X}},t/(2\chi^{\prime}_{q,1}))
d(1+4χq,1/t)n.\displaystyle\leq d\left(1+4\chi^{\prime}_{q,1}/t\right)^{n}.

We now take the logarithm and square root, and using the inequality a+ba+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}, a,b0a,b\geq 0. This gives (6.14) with χq,1\chi^{\prime}_{q,1} in place of χq\chi^{\prime}_{q}.

Step 2: showing (6.14) with χq,2\chi^{\prime}_{q,2} in place of χq\chi^{\prime}_{q}. We now switch the order of the above arguments. First, we write

𝒩(S~(𝕌),𝒜¯,t)i=1d𝒩(B~(𝕍i),𝒜¯,t/2).\mathcal{N}(\widetilde{S}(\mathbb{U}),{\left\|\cdot\right\|}_{\bar{\mathcal{A}}},t)\leq\sum^{d}_{i=1}\mathcal{N}(\widetilde{B}(\mathbb{V}_{i}),{\left\|\cdot\right\|}_{\bar{\mathcal{A}}},t/2).

Now, since 𝕍j\mathbb{V}_{j} is a subspace, we have

Ai(uv)𝕐i2Φ(S(𝕍j);𝒜i)uv𝕏2Φ(S(𝕍);𝒜i)uv𝕏2,u,v𝕍j,{\|A_{i}(u-v)\|}^{2}_{\mathbb{Y}_{i}}\leq\Phi(S(\mathbb{V}_{j});\mathcal{A}_{i}){\|u-v\|}^{2}_{\mathbb{X}}\leq\Phi(S(\mathbb{V});\mathcal{A}_{i}){\|u-v\|}^{2}_{\mathbb{X}},\quad\forall u,v\in\mathbb{V}_{j},

and therefore

uv𝒜¯χq,2|uv|𝕏,u,v𝕍i.{\|u-v\|}_{\bar{\mathcal{A}}}\leq\chi^{\prime}_{q,2}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|u-v\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\mathbb{X}},\quad\quad\forall u,v\in\mathbb{V}_{i}.

Since B~(𝕍i)\widetilde{B}(\mathbb{V}_{i}) is the unit ball of 𝕍i\mathbb{V}_{i} with respect to ||||||𝕏{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\mathbb{X}}, we deduce that

𝒩(S~(𝕌),𝒜¯,t)i=1d𝒩(B~(𝕍i),||||||𝕏,t/(2χq,2))d(1+4χq,2/t)n.\mathcal{N}(\widetilde{S}(\mathbb{U}),{\left\|\cdot\right\|}_{\bar{\mathcal{A}}},t)\leq\sum^{d}_{i=1}\mathcal{N}(\widetilde{B}(\mathbb{V}_{i}),{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\mathbb{X}},t/(2\chi^{\prime}_{q,2}))\leq d(1+4\chi^{\prime}_{q,2}/t)^{n}.

We now take the logarithm and square root of both sides, as before. ∎

To derive our second bound for the covering number makes we first need to establish a version of Khintchine’s inequality in Hilbert spaces. For this, we need the following.

Lemma 6.7 (Hoeffding’s inequality in a Hilbert space).

Let X1,,XnX_{1},\ldots,X_{n} be independent mean-zero random variables taking values in a separable Hilbert space \mathbb{H} such that Xic/2{\|X_{i}\|}_{\mathbb{H}}\leq c/2 and let v=nc2/4v=nc^{2}/4. Then, for all tvt\geq\sqrt{v},

(i=1nXi>t)e(tv)2/(2v).\mathbb{P}\left({\left\|\sum^{n}_{i=1}X_{i}\right\|}_{\mathbb{H}}>t\right)\leq\mathrm{e}^{-(t-\sqrt{v})^{2}/(2v)}.

Note that this lemma is can be proved directly from McDiarmid’s inequality. Using this, we now obtain the following.

Lemma 6.8 (Khintchine’s inequality in a Hilbert space).

Let x1,,xnx_{1},\ldots,x_{n}\in\mathbb{H}, where \mathbb{H} is a separable Hilbert space and ϵ1,,ϵn\epsilon_{1},\ldots,\epsilon_{n} be independent Rademacher random variables. Then

(𝔼i=1nϵixip)1ppnmaxi=1,,nxi,p1.\left(\mathbb{E}{\left\|\sum^{n}_{i=1}\epsilon_{i}x_{i}\right\|}^{p}_{\mathbb{H}}\right)^{\frac{1}{p}}\lesssim\sqrt{p}\sqrt{n}\max_{i=1,\ldots,n}{\|x_{i}\|}_{\mathbb{H}},\quad\forall p\geq 1.
Proof.

Let Z=i=1nϵixiZ={\left\|\sum^{n}_{i=1}\epsilon_{i}x_{i}\right\|}_{\mathbb{H}} and note that our aim is to bound (𝔼(Zp))1p(\mathbb{E}(Z^{p}))^{\frac{1}{p}}. Now let γ=2maxi=1,,nxi\gamma=2\max_{i=1,\ldots,n}{\|x_{i}\|}_{\mathbb{H}}. Then, by the previous lemma.

(Zt)exp((tnγ/2)2nγ2/2)\mathbb{P}\left(Z\geq t\right)\leq\exp\left(-\frac{(t-\sqrt{n}\gamma/2)^{2}}{n\gamma^{2}/2}\right)

whenever tnγ/2t\geq\sqrt{n}\gamma/2. In particular, if tnγt\geq\sqrt{n}\gamma, then

(Zt)exp(t22nγ2).\mathbb{P}\left(Z\geq t\right)\leq\exp\left(-\frac{t^{2}}{2n\gamma^{2}}\right).

Now fix nγ/2τ<\sqrt{n}\gamma/2\leq\tau<\infty. Then (see, e.g., [62, Ex. 1.2.3])

𝔼(Zp)\displaystyle\mathbb{E}(Z^{p}) =0ptp1(Zt)dt\displaystyle=\int^{\infty}_{0}pt^{p-1}\mathbb{P}(Z\geq t)\,\mathrm{d}t
pτp+τptp1exp(t22nγ2)dt\displaystyle\leq p\tau^{p}+\int^{\infty}_{\tau}pt^{p-1}\exp\left(-\frac{t^{2}}{2n\gamma^{2}}\right)\,\mathrm{d}t
pτp+0ptp1exp(t22nγ2)dt\displaystyle\leq p\tau^{p}+\int^{\infty}_{0}pt^{p-1}\exp\left(-\frac{t^{2}}{2n\gamma^{2}}\right)\,\mathrm{d}t
=pτp+(nγ)p0ptp1exp(t22)dt\displaystyle=p\tau^{p}+(\sqrt{n}\gamma)^{p}\int^{\infty}_{0}pt^{p-1}\exp\left(-\frac{t^{2}}{2}\right)\,\mathrm{d}t
=pτp+π/2(nγ)pp|t|p112πexp(t22)dt\displaystyle=p\tau^{p}+\sqrt{\pi/2}(\sqrt{n}\gamma)^{p}p\int^{\infty}_{-\infty}|t|^{p-1}\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{t^{2}}{2}\right)\,\mathrm{d}t
=pτp+π/2(nγ)pp𝔼|X|p1,\displaystyle=p\tau^{p}+\sqrt{\pi/2}(\sqrt{n}\gamma)^{p}p\mathbb{E}|X|^{p-1},

where X𝒩(0,1)X\sim\mathcal{N}(0,1). Now choose τ=nγ\tau=\sqrt{n}\gamma and take the ppth root to obtain

(𝔼(Zp))1pp1pnγ(1+(𝔼|X|p1)1/p).(\mathbb{E}(Z^{p}))^{\frac{1}{p}}\lesssim p^{\frac{1}{p}}\sqrt{n}\gamma(1+(\mathbb{E}|X|^{p-1})^{1/p}).

The moments (𝔼|X|q)1/qq(\mathbb{E}|X|^{q})^{1/q}\lesssim\sqrt{q} (see, e.g., [62, Ex. 2.5.1]). Using this and the fact that p1p1p^{\frac{1}{p}}\lesssim 1 for p1p\geq 1, we deduce the result. ∎

Lemma 6.9 (Second covering number bound).

Consider the setup of Lemma 6.5 and suppose that assumption (iii) of Theorem 6.3 holds. Then

log(𝒩(S~(𝕌),𝒜¯,t))qm~1qΦ(𝕎;𝒜¯)log(M)αt1.\sqrt{\log(\mathcal{N}(\widetilde{S}(\mathbb{U}),{\left\|\cdot\right\|}_{\bar{\mathcal{A}}},t))}\lesssim\sqrt{\frac{q\widetilde{m}^{\frac{1}{q}}\Phi(\mathbb{W};\bar{\mathcal{A}})\log(M)}{\alpha}}\cdot t^{-1}.
Proof.

Let v=u/|u|𝕏S~(𝕌)v=u/{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|u\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\mathbb{X}}\in\widetilde{S}(\mathbb{U}). Then, by norm equivalence, v𝕏1/α{\|v\|}_{\mathbb{X}}\leq 1/\sqrt{\alpha}. Hence vB(𝕌)/αv\in B(\mathbb{U})/\sqrt{\alpha}, where B(𝕌)={u𝕌:u𝕏1}B(\mathbb{U})=\{u\in\mathbb{U}:{\|u\|}_{\mathbb{X}}\leq 1\}. We deduce that

S~(𝕌)B(𝕌)/αconv(𝕎)/α=conv(𝕎/α).\widetilde{S}(\mathbb{U})\subseteq B(\mathbb{U})/\sqrt{\alpha}\subseteq\mathrm{conv}(\mathbb{W})/\sqrt{\alpha}=\mathrm{conv}(\mathbb{W}/\sqrt{\alpha}).

Therefore

𝒩(S~(𝕌),𝒜¯,t)𝒩(conv(𝕎/α),𝒜¯,t/2).\mathcal{N}(\widetilde{S}(\mathbb{U}),{\left\|\cdot\right\|}_{\bar{\mathcal{A}}},t)\leq\mathcal{N}(\mathrm{conv}(\mathbb{W}/\sqrt{\alpha}),{\left\|\cdot\right\|}_{\bar{\mathcal{A}}},t/2).

Maurey’s lemma (see, e.g., [9, Lem. 13.30]) implies that

log(𝒩(S~(𝕌),𝒜¯,t))(C/t)2log(M),\log(\mathcal{N}(\widetilde{S}(\mathbb{U}),{\left\|\cdot\right\|}_{\bar{\mathcal{A}}},t))\lesssim(C/t)^{2}\log(M),

where CC is such that

𝔼ϵ¯l=1Lϵlwlα𝒜¯CL,L,w1,,wL𝕎.\mathbb{E}_{\overline{\epsilon}}{\left\|\sum^{L}_{l=1}\epsilon_{l}\frac{w_{l}}{\sqrt{\alpha}}\right\|}_{\bar{\mathcal{A}}}\leq C\sqrt{L},\quad\forall L\in\mathbb{N},\ w_{1},\ldots,w_{L}\in\mathbb{W}.

We now estimate CC. Let LL\in\mathbb{N} and w1,,wL𝕎w_{1},\ldots,w_{L}\in\mathbb{W}. By Hölder’s inequality and linearity,

𝔼ϵ¯l=1Lϵlwlα𝒜¯\displaystyle\mathbb{E}_{\overline{\epsilon}}{\left\|\sum^{L}_{l=1}\epsilon_{l}\frac{w_{l}}{\sqrt{\alpha}}\right\|}_{\bar{\mathcal{A}}} (𝔼ϵ¯l=1Lϵlwlα𝒜¯2q)12q\displaystyle\leq\left(\mathbb{E}_{\overline{\epsilon}}{\left\|\sum^{L}_{l=1}\epsilon_{l}\frac{w_{l}}{\sqrt{\alpha}}\right\|}^{2q}_{\bar{\mathcal{A}}}\right)^{\frac{1}{2q}}
=1α(i𝔼ϵ¯l=1LϵlAi(wl)𝕐i2q)12q\displaystyle=\frac{1}{\sqrt{\alpha}}\left(\sum_{i\in\mathcal{I}}\mathbb{E}_{\overline{\epsilon}}{\left\|\sum^{L}_{l=1}\epsilon_{l}A_{i}(w_{l})\right\|}^{2q}_{\mathbb{Y}_{i}}\right)^{\frac{1}{2q}}
m~12qαmaxi(𝔼ϵ¯l=1LϵlAi(wl)𝕐i2q)12q.\displaystyle\leq\frac{\widetilde{m}^{\frac{1}{2q}}}{\sqrt{\alpha}}\max_{i\in\mathcal{I}}\left(\mathbb{E}_{\overline{\epsilon}}{\left\|\sum^{L}_{l=1}\epsilon_{l}A_{i}(w_{l})\right\|}^{2q}_{\mathbb{Y}_{i}}\right)^{\frac{1}{2q}}.

By Lemma 6.8,

(𝔼ϵ¯l=1LϵlAi(wl)𝕐i2q)12qqLmaxl=1,,LAi(wl)𝕐iqLΦ(𝕎;𝒜¯).\displaystyle\left(\mathbb{E}_{\overline{\epsilon}}{\left\|\sum^{L}_{l=1}\epsilon_{l}A_{i}(w_{l})\right\|}^{2q}_{\mathbb{Y}_{i}}\right)^{\frac{1}{2q}}\lesssim\sqrt{qL}\max_{l=1,\ldots,L}{\|A_{i}(w_{l})\|}_{\mathbb{Y}_{i}}\leq\sqrt{qL\Phi(\mathbb{W};\bar{\mathcal{A}})}.

We deduce that

𝔼ϵ¯l=1Lϵlwlα𝒜¯qm~1qΦ(𝕎;𝒜¯)αL.\mathbb{E}_{\overline{\epsilon}}{\left\|\sum^{L}_{l=1}\epsilon_{l}\frac{w_{l}}{\sqrt{\alpha}}\right\|}_{\bar{\mathcal{A}}}\lesssim\sqrt{\frac{q\widetilde{m}^{\frac{1}{q}}\Phi(\mathbb{W};\bar{\mathcal{A}})}{\alpha}}\sqrt{L}.

The result now follows from Maurey’s lemma. ∎

6.1.3 Expectation bounds

We are now in a position to establish our first expectation bound.

Theorem 6.10 (First expectation bound).

Consider the setup of Section 1.1. Let 0<δ<10<\delta<1 and 𝕌𝕏0\mathbb{U}\subseteq\mathbb{X}_{0} and suppose that

mδ2α1Φ(S(𝕌);𝒜¯)[log(2d)+nlog(2(1+γ(𝕌;𝒜¯)))],m\gtrsim\delta^{-2}\cdot\alpha^{-1}\cdot\Phi(S(\mathbb{U});\bar{\mathcal{A}})\cdot\left[\log(2d)+n\log\left(2\left(1+\gamma(\mathbb{U};\bar{\mathcal{A}})\right)\right)\right], (6.17)

where

  1. (a)

    γ(𝕌;𝒜¯)=min{Φ(S(𝕍);𝒜¯),Φ(S(S~(𝕌)S~(𝕌));𝒜¯)}Φ(S(𝕌);𝒜¯)\gamma(\mathbb{U};\bar{\mathcal{A}})=\frac{\min\{\Phi(S(\mathbb{V});\bar{\mathcal{A}}),\Phi(S(\widetilde{S}(\mathbb{U})-\widetilde{S}(\mathbb{U}));\bar{\mathcal{A}})\}}{\Phi(S(\mathbb{U});\bar{\mathcal{A}})} if 𝕌\mathbb{U} satisfies assumption (ii) of Theorem 6.2; or

  2. (b)

    γ(𝕌;𝒜¯)=min{Φ(S(𝕍);𝒜¯),Φ(S(𝕌𝕌);𝒜¯)}Φ(S(𝕌);𝒜¯)\gamma(\mathbb{U};\bar{\mathcal{A}})=\frac{\min\{\Phi(S(\mathbb{V});\bar{\mathcal{A}}),\Phi(S(\mathbb{U}-\mathbb{U});\bar{\mathcal{A}})\}}{\Phi(S(\mathbb{U});\bar{\mathcal{A}})} if 𝕌\mathbb{U} satisfies assumptions (i) and (ii) of Theorem 6.2.

Then the random variable (6.3) satisfies 𝔼(δ𝕌)δ\mathbb{E}(\delta_{\mathbb{U}})\leq\delta.

Proof.

Lemmas 6.4, 6.5 and 6.6 imply that

𝔼𝒜¯(δ𝕌)\displaystyle\mathbb{E}_{\bar{\mathcal{A}}}(\delta_{\mathbb{U}}) m1𝔼𝒜¯(Rp,𝒜¯)(χqlog(2d)+n0χq/2log(1+4χqt)dt)\displaystyle\lesssim m^{-1}\mathbb{E}_{\bar{\mathcal{A}}}(R_{p,\bar{\mathcal{A}}})\left(\chi_{q}\sqrt{\log(2d)}+\sqrt{n}\int^{\chi_{q}/2}_{0}\sqrt{\log\left(1+\frac{4\chi^{\prime}_{q}}{t}\right)}\,\mathrm{d}t\right)
=m1𝔼𝒜¯(Rp,𝒜¯)(χqlog(2d)+nχq0χq/(8χq)log(1+1/s)ds),\displaystyle=m^{-1}\mathbb{E}_{\bar{\mathcal{A}}}(R_{p,\bar{\mathcal{A}}})\left(\chi_{q}\sqrt{\log(2d)}+\sqrt{n}\chi^{\prime}_{q}\int^{\chi_{q}/(8\chi^{\prime}_{q})}_{0}\sqrt{\log\left(1+1/s\right)}\,\mathrm{d}s\right),

where χq\chi_{q} is as in (6.9) and χq\chi^{\prime}_{q} is as in Lemma 6.6. We now use [43, Lem. C.9], to obtain

𝔼𝒜¯(δ𝕌)m1𝔼𝒜¯(Rp,𝒜¯)χq(log(2d)+nlog(e(1+8χq/χq))).\mathbb{E}_{\bar{\mathcal{A}}}(\delta_{\mathbb{U}})\lesssim m^{-1}\mathbb{E}_{\bar{\mathcal{A}}}(R_{p,\bar{\mathcal{A}}})\chi_{q}\left(\sqrt{\log(2d)}+\sqrt{n}\sqrt{\log(\mathrm{e}(1+8\chi^{\prime}_{q}/\chi_{q}))}\right).

Using (6.13) and the definition of Rp,𝒜¯R_{p,\bar{\mathcal{A}}}, we see that

Rp,𝒜¯m~12p(Φ(S(𝕌);𝒜¯)α)1212psupuS~(𝕌)(1miAi(u)𝕐i2)12pR_{p,\bar{\mathcal{A}}}\leq\widetilde{m}^{\frac{1}{2p}}\left(\frac{\Phi(S(\mathbb{U});\bar{\mathcal{A}})}{\alpha}\right)^{\frac{1}{2}-\frac{1}{2p}}\sup_{u\in\widetilde{S}(\mathbb{U})}\left(\frac{1}{m}\sum_{i\in\mathcal{I}}{\|A_{i}(u)\|}^{2}_{\mathbb{Y}_{i}}\right)^{\frac{1}{2p}}

Now consider the sum. Let uS~(𝕌)u\in\widetilde{S}(\mathbb{U}). Then

1miAi(u)𝕐i2\displaystyle\frac{1}{m}\sum_{i\in\mathcal{I}}{\|A_{i}(u)\|}^{2}_{\mathbb{Y}_{i}} =1mi(Ai(u)𝕐i2𝔼Ai𝒜iAi(u)𝕐i2)+1mi𝔼Ai𝒜iAi(u)𝕐i2δ𝕌+1,\displaystyle=\frac{1}{m}\sum_{i\in\mathcal{I}}\left({\|A_{i}(u)\|}^{2}_{\mathbb{Y}_{i}}-\mathbb{E}_{A_{i}\sim\mathcal{A}_{i}}{\|A_{i}(u)\|}^{2}_{\mathbb{Y}_{i}}\right)+\frac{1}{m}\sum_{i\in\mathcal{I}}\mathbb{E}_{A_{i}\sim\mathcal{A}_{i}}{\|A_{i}(u)\|}^{2}_{\mathbb{Y}_{i}}\leq\delta_{\mathbb{U}}+1,

where in the second step we used (6.5) and (6.4). Hence

Rp,𝒜¯m12p(Φ(S(𝕌);𝒜¯)α)1212p(δ𝕌+1)12pR_{p,\bar{\mathcal{A}}}\leq m^{\frac{1}{2p}}\left(\frac{\Phi(S(\mathbb{U});\bar{\mathcal{A}})}{\alpha}\right)^{\frac{1}{2}-\frac{1}{2p}}(\delta_{\mathbb{U}}+1)^{\frac{1}{2p}}

Using the fact that p1p\geq 1 and the Cauchy–Schwarz inequality, we deduce that

𝔼𝒜¯(Rp,𝒜¯)m12p(Φ(S(𝕌);𝒜¯)α)1212p𝔼𝒜¯(δ𝕌)+1.\mathbb{E}_{\bar{\mathcal{A}}}(R_{p,\bar{\mathcal{A}}})\leq m^{\frac{1}{2p}}\left(\frac{\Phi(S(\mathbb{U});\bar{\mathcal{A}})}{\alpha}\right)^{\frac{1}{2}-\frac{1}{2p}}\sqrt{\mathbb{E}_{\bar{\mathcal{A}}}(\delta_{\mathbb{U}})+1}. (6.18)

Combining this with the previous expression and using the definition of χq\chi_{q}, we deduce that

𝔼𝒜¯(δ𝕌)m12p1m~12q(Φ(S(𝕌);𝒜¯)α)112p(log(2d)+nlog(e(1+8χq/χq)))𝔼𝒜¯(δ𝕌)+1.\mathbb{E}_{\bar{\mathcal{A}}}(\delta_{\mathbb{U}})\lesssim m^{\frac{1}{2p}-1}\widetilde{m}^{\frac{1}{2q}}\left(\frac{\Phi(S(\mathbb{U});\bar{\mathcal{A}})}{\alpha}\right)^{1-\frac{1}{2p}}\left(\sqrt{\log(2d)}+\sqrt{n}\sqrt{\log(\mathrm{e}(1+8\chi^{\prime}_{q}/\chi_{q}))}\right)\sqrt{\mathbb{E}_{\bar{\mathcal{A}}}(\delta_{\mathbb{U}})+1}.

Now set p=1+1/log(λ)p=1+1/\log(\lambda), where λ=1+Φ(S(𝕌);𝒜¯)/α\lambda=1+\Phi(S(\mathbb{U});\bar{\mathcal{A}})/\alpha and notice that q=1+log(λ)q=1+\log(\lambda) in this case. Observe that

(Φ(S(𝕌);𝒜¯)α)112p(Φ(S(𝕌);𝒜¯)α)12λ1212p(Φ(S(𝕌);𝒜¯)α)12e.\left(\frac{\Phi(S(\mathbb{U});\bar{\mathcal{A}})}{\alpha}\right)^{1-\frac{1}{2p}}\leq\left(\frac{\Phi(S(\mathbb{U});\bar{\mathcal{A}})}{\alpha}\right)^{\frac{1}{2}}\lambda^{\frac{1}{2}-\frac{1}{2p}}\leq\left(\frac{\Phi(S(\mathbb{U});\bar{\mathcal{A}})}{\alpha}\right)^{\frac{1}{2}}\sqrt{\mathrm{e}}.

Using this and the fact that m12p1m~12q=m1/2(m~/m)1/(2q)m1/2m^{\frac{1}{2p}-1}\widetilde{m}^{\frac{1}{2q}}=m^{-1/2}(\widetilde{m}/m)^{1/(2q)}\leq m^{-1/2} (since m~m\widetilde{m}\leq m), we deduce that

𝔼𝒜¯(δ𝕌)m1/2(Φ(S(𝕌);𝒜¯)α)12(log(2d)+nlog(e(1+8χq/χq)))𝔼𝒜¯(δ𝕌)+1.\mathbb{E}_{\bar{\mathcal{A}}}(\delta_{\mathbb{U}})\lesssim m^{-1/2}\left(\frac{\Phi(S(\mathbb{U});\bar{\mathcal{A}})}{\alpha}\right)^{\frac{1}{2}}\left(\sqrt{\log(2d)}+\sqrt{n}\sqrt{\log(\mathrm{e}(1+8\chi^{\prime}_{q}/\chi_{q}))}\right)\sqrt{\mathbb{E}_{\bar{\mathcal{A}}}(\delta_{\mathbb{U}})+1}.

Hence, we see that 𝔼𝒜¯(δ𝕌)δ\mathbb{E}_{\bar{\mathcal{A}}}(\delta_{\mathbb{U}})\leq\delta, provided

m1/2(Φ(S(𝕌);𝒜¯)α)12(log(2d)+nlog(2(1+χq/χq)))cδ,m^{-1/2}\left(\frac{\Phi(S(\mathbb{U});\bar{\mathcal{A}})}{\alpha}\right)^{\frac{1}{2}}\left(\sqrt{\log(2d)}+\sqrt{n}\sqrt{\log(2(1+\chi^{\prime}_{q}/\chi_{q}))}\right)\leq c\delta,

for suitably small constant c>0c>0. Rearranging and noticing that

χq/χq=γ(𝕌;𝒜¯)\chi^{\prime}_{q}/\chi_{q}=\sqrt{\gamma(\mathbb{U};\bar{\mathcal{A}})}

now gives the result. ∎

Theorem 6.11 (Second expectation bound).

Consider the setup of Section 1.1. Let 0<δ<10<\delta<1 and 𝕌𝕏0\mathbb{U}\subseteq\mathbb{X}_{0} satisfy assumption (iii) of Theorem 6.3 and suppose that

mδ2α1Φ(S(𝕌)𝕎;𝒜¯)L0(L1+L2),m\gtrsim\delta^{-2}\cdot\alpha^{-1}\cdot\Phi(S(\mathbb{U})\cup\mathbb{W};\bar{\mathcal{A}})\cdot L_{0}\cdot(L_{1}+L_{2})\cdot,

where

L0\displaystyle L_{0} =log(2Φ(S(𝕌)𝕎;𝒜¯)/α),\displaystyle=\log(2\Phi(S(\mathbb{U})\cup\mathbb{W};\bar{\mathcal{A}})/\alpha),
L1\displaystyle L_{1} =1+log(1+γ(𝕌;𝒜¯)(log(2d)+n)),\displaystyle=1+\log\left(1+\gamma(\mathbb{U};\bar{\mathcal{A}})(\log(2d)+n)\right),
L2\displaystyle L_{2} =log(M)log2(log(2d)+n),\displaystyle=\log(M)\log^{2}(\log(2d)+n),

and

  1. (a)

    γ(𝕌;𝒜¯)=min{Φ(S(𝕍);𝒜¯),Φ(S(S~(𝕌)S~(𝕌));𝒜¯)}Φ(S(𝕌);𝒜¯)\gamma(\mathbb{U};\bar{\mathcal{A}})=\frac{\min\{\Phi(S(\mathbb{V});\bar{\mathcal{A}}),\Phi(S(\widetilde{S}(\mathbb{U})-\widetilde{S}(\mathbb{U}));\bar{\mathcal{A}})\}}{\Phi(S(\mathbb{U});\bar{\mathcal{A}})} if 𝕌\mathbb{U} satisfies assumption (ii) of Theorem 6.2; or

  2. (b)

    γ(𝕌;𝒜¯)=min{Φ(S(𝕍);𝒜¯),Φ(S(𝕌𝕌);𝒜¯)}Φ(S(𝕌);𝒜¯)\gamma(\mathbb{U};\bar{\mathcal{A}})=\frac{\min\{\Phi(S(\mathbb{V});\bar{\mathcal{A}}),\Phi(S(\mathbb{U}-\mathbb{U});\bar{\mathcal{A}})\}}{\Phi(S(\mathbb{U});\bar{\mathcal{A}})} if 𝕌\mathbb{U} satisfies assumptions (i) and (ii) of Theorem 6.2.

Then the random variable (6.3) satisfies 𝔼(δ𝕌)δ\mathbb{E}(\delta_{\mathbb{U}})\leq\delta.

Proof.

Lemmas 6.4 and 6.5 imply that

𝔼𝒜¯(δ𝕌)m1𝔼𝒜¯(Rp,𝒜¯)0χq/2log(2𝒩(S~(𝕌),𝒜¯,t))dt.\mathbb{E}_{\bar{\mathcal{A}}}(\delta_{\mathbb{U}})\lesssim m^{-1}\mathbb{E}_{\bar{\mathcal{A}}}(R_{p,\bar{\mathcal{A}}})\int^{\chi_{q}/2}_{0}\sqrt{\log(2\mathcal{N}(\widetilde{S}(\mathbb{U}),{\left\|\cdot\right\|}_{\bar{\mathcal{A}}},t))}\,\mathrm{d}t.

Let 0<τ<χq/20<\tau<\chi_{q}/2 and then use Lemmas 6.6 and 6.9 to obtain

0χq/2log(2𝒩(S~(𝕌),𝒜¯,t))dt\displaystyle\int^{\chi_{q}/2}_{0}\sqrt{\log(2\mathcal{N}(\widetilde{S}(\mathbb{U}),{\left\|\cdot\right\|}_{\bar{\mathcal{A}}},t))}\,\mathrm{d}t\lesssim log(2d)τ+n0τlog(1+2χq/t)dt\displaystyle\sqrt{\log(2d)}\tau+\sqrt{n}\int^{\tau}_{0}\sqrt{\log(1+2\chi^{\prime}_{q}/t)}\,\mathrm{d}t
+qm~1qΦ(𝕎;𝒜¯)log(M)ατχq/21/tdt.\displaystyle+\sqrt{\frac{q\widetilde{m}^{\frac{1}{q}}\Phi(\mathbb{W};\bar{\mathcal{A}})\log(M)}{\alpha}}\int^{\chi_{q}/2}_{\tau}1/t\,\mathrm{d}t.

A short exercise gives that

0χq/2log(2𝒩(S~(𝕌),𝒜¯,t))dt\displaystyle\int^{\chi_{q}/2}_{0}\sqrt{\log(2\mathcal{N}(\widetilde{S}(\mathbb{U}),{\left\|\cdot\right\|}_{\bar{\mathcal{A}}},t))}\,\mathrm{d}t\lesssim log(2d)τ+nτlog(e(1+2χq/τ))\displaystyle\sqrt{\log(2d)}\tau+\sqrt{n}\tau\sqrt{\log(\mathrm{e}(1+2\chi^{\prime}_{q}/\tau))}
+qm~1qΦ(𝕎;𝒜¯)log(M)αlog(χq/(2τ)).\displaystyle+\sqrt{\frac{q\widetilde{m}^{\frac{1}{q}}\Phi(\mathbb{W};\bar{\mathcal{A}})\log(M)}{\alpha}}\log(\chi_{q}/(2\tau)).

Now set τ=χq/(log(2d)+n)\tau=\chi_{q}/(\sqrt{\log(2d)}+\sqrt{n}) to obtain

0χq/2log(2𝒩(S~(𝕌),𝒜¯,t))dt\displaystyle\int^{\chi_{q}/2}_{0}\sqrt{\log(2\mathcal{N}(\widetilde{S}(\mathbb{U}),{\left\|\cdot\right\|}_{\bar{\mathcal{A}}},t))}\,\mathrm{d}t\lesssim χqlog(e(1+2χq/χq(log(2d)+n)))\displaystyle\chi_{q}\sqrt{\log(\mathrm{e}(1+2\chi^{\prime}_{q}/\chi_{q}(\sqrt{\log(2d)}+\sqrt{n})))}
+qm~1qΦ(𝕎;𝒜¯)log(M)αlog((log(2d)+n)).\displaystyle+\sqrt{\frac{q\widetilde{m}^{\frac{1}{q}}\Phi(\mathbb{W};\bar{\mathcal{A}})\log(M)}{\alpha}}\log((\sqrt{\log(2d)}+\sqrt{n})).
\displaystyle\lesssim χqL1+qm~1qΦ(𝕎;𝒜¯)αL2.\displaystyle\chi_{q}\sqrt{L_{1}}+\sqrt{\frac{q\widetilde{m}^{\frac{1}{q}}\Phi(\mathbb{W};\bar{\mathcal{A}})}{\alpha}}\sqrt{L_{2}}.

Now (6.9) and the fact that

max{Φ(S(𝕌);𝒜¯),Φ(𝕎;𝒜¯)}=Φ(S(𝕌)𝕎;𝒜¯)\max\{\Phi(S(\mathbb{U});\bar{\mathcal{A}}),\Phi(\mathbb{W};\bar{\mathcal{A}})\}=\Phi(S(\mathbb{U})\cup\mathbb{W};\bar{\mathcal{A}})

yield

0χq/2log(2𝒩(S~(𝕌),𝒜¯,t))dtqm~1qΦ(S(𝕌)𝕎;𝒜¯)αL1+L2\int^{\chi_{q}/2}_{0}\sqrt{\log(2\mathcal{N}(\widetilde{S}(\mathbb{U}),{\left\|\cdot\right\|}_{\bar{\mathcal{A}}},t))}\,\mathrm{d}t\lesssim\sqrt{\frac{q\widetilde{m}^{\frac{1}{q}}\Phi(S(\mathbb{U})\cup\mathbb{W};\bar{\mathcal{A}})}{\alpha}}\sqrt{L_{1}+L_{2}}

We now use (6.18) to obtain

𝔼𝒜¯(δ𝕌)m1/2(Φ(S(𝕌)𝕎;𝒜¯)α)112pqL1+L2𝔼𝒜¯(δ𝕌)+1.\mathbb{E}_{\bar{\mathcal{A}}}(\delta_{\mathbb{U}})\lesssim m^{-1/2}\left(\frac{\Phi(S(\mathbb{U})\cup\mathbb{W};\bar{\mathcal{A}})}{\alpha}\right)^{1-\frac{1}{2p}}\sqrt{q}\sqrt{L_{1}+L_{2}}\sqrt{\mathbb{E}_{\bar{\mathcal{A}}}(\delta_{\mathbb{U}})+1}.

We now choose p=1+1/log(λ)p=1+1/\log(\lambda), where λ=1+Φ(S(𝕌)𝕎;𝒜¯)/α\lambda=1+\Phi(S(\mathbb{U})\cup\mathbb{W};\bar{\mathcal{A}})/\alpha and proceed as before. This gives

𝔼𝒜¯(δ𝕌)m1/2(Φ(S(𝕌)𝕎;𝒜¯)α)12log(eΦ(S(𝕌)𝕎;𝒜¯)/α)L1+L2𝔼𝒜¯(δ𝕌)+1.\mathbb{E}_{\bar{\mathcal{A}}}(\delta_{\mathbb{U}})\lesssim m^{-1/2}\left(\frac{\Phi(S(\mathbb{U})\cup\mathbb{W};\bar{\mathcal{A}})}{\alpha}\right)^{\frac{1}{2}}\sqrt{\log(\mathrm{e}\Phi(S(\mathbb{U})\cup\mathbb{W};\bar{\mathcal{A}})/\alpha)}\sqrt{L_{1}+L_{2}}\sqrt{\mathbb{E}_{\bar{\mathcal{A}}}(\delta_{\mathbb{U}})+1}.

This completes the proof. ∎

6.1.4 Probability bound

We now bound δ𝕌\delta_{\mathbb{U}} in probability.

Theorem 6.12.

Let 0<δ,ϵ<10<\delta,\epsilon<1 and suppose that 𝕌\mathbb{U} satisfies assumption (ii) of Theorem 6.2. Suppose also that 𝔼(δ𝕌)δ/2\mathbb{E}(\delta_{\mathbb{U}})\leq\delta/2. Then δ𝕌δ\delta_{\mathbb{U}}\leq\delta with probability at least 1ϵ1-\epsilon, provided

mδ2α1Φ(S(𝕌);𝒜¯)log(2/ϵ).m\gtrsim\delta^{-2}\cdot\alpha^{-1}\cdot\Phi(S(\mathbb{U});\bar{\mathcal{A}})\cdot\log(2/\epsilon).
Proof.

Our analysis is based on a version of Talagrand’s theorem (see, e.g., [9, Thm. 13.33]). First, let BB^{*} be a countable dense subset of S~(𝕌)\widetilde{S}(\mathbb{U}). This exists since 𝕏\mathbb{X} is separable. Then from (6.6) we have

Z=δ𝕌\displaystyle Z=\delta_{\mathbb{U}} =m1supuB|iUi(u)𝔼(Ui(u)),u𝕏|\displaystyle=m^{-1}\sup_{u\in B^{*}}\left|\sum_{i\in\mathcal{I}}\left<U_{i}(u)-\mathbb{E}\left(U_{i}(u)\right),u\right>_{\mathbb{X}}\right|
=supuBmax{+iVi(u),u𝕏,iVi(u),u𝕏},\displaystyle=\sup_{u\in B^{*}}\max\left\{+\sum_{i\in\mathcal{I}}\left<V_{i}(u),u\right>_{\mathbb{X}},-\sum_{i\in\mathcal{I}}\left<V_{i}(u),u\right>_{\mathbb{X}}\right\},

where ViV_{i} is the (𝕍)\mathcal{B}(\mathbb{V})-valued random variable Vi=m1(Ui𝔼(Ui))V_{i}=m^{-1}(U_{i}-\mathbb{E}(U_{i})). Now consider the measurable space

Ω={V(𝕍):V=V,supuB|V(u),u𝕏|K},K:=2Φ(S(𝕌);𝒜¯)/(αm)\Omega=\{V\in\mathcal{B}(\mathbb{V}):V=V^{*},\ \sup_{u\in B^{*}}|\langle V(u),u\rangle_{\mathbb{X}}|\leq K\},\qquad K:=2\Phi(S(\mathbb{U});\bar{\mathcal{A}})/(\alpha m)

and let fu±:Ωf^{\pm}_{u}:\Omega\rightarrow\mathbb{R} be defined by fu±(V)=±V(u),u𝕏f^{\pm}_{u}(V)=\pm\langle V(u),u\rangle_{\mathbb{X}}. Observe that we can write

δ𝕌=supuBmax{ifu+(Vi),ifu(Vi)}.\delta_{\mathbb{U}}=\sup_{u\in B^{*}}\max\left\{\sum_{i\in\mathcal{I}}f^{+}_{u}(V_{i}),\sum_{i\in\mathcal{I}}f^{-}_{u}(V_{i})\right\}.

Notice also that 𝔼(f±(Vi))=0\mathbb{E}(f^{\pm}(V_{i}))=0. Next, notice that, for all uBu\in B^{*},

|Ui(u),u𝕏|=Ai(u)𝕐i2Φ(S(𝕌);𝒜¯)/α.|\langle U_{i}(u),u\rangle_{\mathbb{X}}|={\|A_{i}(u)\|}^{2}_{\mathbb{Y}_{i}}\leq\Phi(S(\mathbb{U});\bar{\mathcal{A}})/\alpha.

Therefore ViΩV_{i}\in\Omega, i=1,,mi=1,\ldots,m.

Now observe that

𝔼(i(fu±(Vi))2)\displaystyle\mathbb{E}\left(\sum_{i\in\mathcal{I}}(f^{\pm}_{u}(V_{i}))^{2}\right) m2i𝔼|Ui(u),u𝕏|2\displaystyle\leq m^{-2}\sum_{i\in\mathcal{I}}\mathbb{E}|\langle U_{i}(u),u\rangle_{\mathbb{X}}|^{2}
2Φ(S(𝕌);𝒜¯)αm2i𝔼Ai(u)𝕐i2\displaystyle\leq\frac{2\Phi(S(\mathbb{U});\bar{\mathcal{A}})}{\alpha m^{2}}\sum_{i\in\mathcal{I}}\mathbb{E}{\|A_{i}(u)\|}^{2}_{\mathbb{Y}_{i}}
=2Φ(S(𝕌);𝒜¯)αm|u|𝕏2\displaystyle=\frac{2\Phi(S(\mathbb{U});\bar{\mathcal{A}})}{\alpha m}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|u\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}_{\mathbb{X}}
=2Φ(S(𝕌);𝒜¯)αm=:σ2.\displaystyle=\frac{2\Phi(S(\mathbb{U});\bar{\mathcal{A}})}{\alpha m}=:\sigma^{2}.

Finally, observe that

Z¯=supuBmax{|ifu+(Vi)|,|ifu(Vi)|}=Z,\overline{Z}=\sup_{u\in B^{*}}\max\left\{\left|\sum_{i\in\mathcal{I}}f^{+}_{u}(V_{i})\right|,\left|\sum_{i\in\mathcal{I}}f^{-}_{u}(V_{i})\right|\right\}=Z,

in this case. In particular, 𝔼(Z¯)=𝔼(Z)=𝔼(δ𝕌)\mathbb{E}(\overline{Z})=\mathbb{E}(Z)=\mathbb{E}(\delta_{\mathbb{U}}).

Now suppose that 𝔼(δ𝕌)δ/2\mathbb{E}(\delta_{\mathbb{U}})\leq\delta/2. Then, using Talagrand’s theorem,

(δ𝕌δ)\displaystyle\mathbb{P}(\delta_{\mathbb{U}}\geq\delta) (|δ𝕌𝔼(δ𝕌)|δ/2)\displaystyle\leq\mathbb{P}\left(|\delta_{\mathbb{U}}-\mathbb{E}(\delta_{\mathbb{U}})|\geq\delta/2\right)
3exp(δαm4cΦ(S(𝕌);𝒜¯)log(1+δ2+δ))\displaystyle\leq 3\exp\left(-\frac{\delta\alpha m}{4c\Phi(S(\mathbb{U});\bar{\mathcal{A}})}\log\left(1+\frac{\delta}{2+\delta}\right)\right)
3exp(log(4/3)δ2αm4cΦ(S(𝕌);𝒜¯)).\displaystyle\leq 3\exp\left(-\frac{\log(4/3)\delta^{2}\alpha m}{4c\Phi(S(\mathbb{U});\bar{\mathcal{A}})}\right).

Due to the condition on mm, we deduce that (δ𝕌δ)ϵ\mathbb{P}(\delta_{\mathbb{U}}\geq\delta)\leq\epsilon, as required. ∎

6.1.5 Proofs of Theorems 6.2 and 6.3

For the proof of Theorem 6.2, we divide into two parts. The first deals with conditions (a) and (b), and the second deals with condition (c), which involves a different approach.

Proof of Theorem 6.2; conditions (a) and (b).

Consider condition (a). Recall that 𝕌𝕌𝕌\mathbb{U}\subseteq\mathbb{U}-\mathbb{U} whenever 𝕌\mathbb{U} is a cone. Therefore, since 𝕌𝕍\mathbb{U}\subseteq\mathbb{V} as well, we see that

Φ(S(𝕌𝕌);𝒜¯)Φ(S(𝕌);𝒜¯),Φ(S(𝕍);𝒜¯)Φ(S(𝕌);𝒜¯).\Phi(S(\mathbb{U}-\mathbb{U});\bar{\mathcal{A}})\geq\Phi(S(\mathbb{U});\bar{\mathcal{A}}),\quad\Phi(S(\mathbb{V});\bar{\mathcal{A}})\geq\Phi(S(\mathbb{U});\bar{\mathcal{A}}). (6.19)

Hence γ(𝕌;𝒜¯)1\gamma(\mathbb{U};\bar{\mathcal{A}})\geq 1 which implies that

log(2(1+γ(𝕌;𝒜¯)))log(2γ(𝕌;𝒜¯)),\log(2(1+\gamma(\mathbb{U};\bar{\mathcal{A}})))\lesssim\log(2\gamma(\mathbb{U};\bar{\mathcal{A}})),

The result subject to condition (a) now follows immediately from Theorems 6.10 and 6.12.

Moreover, using (6.19) and the inequality log(x)x\log(x)\leq x, we see that

Φ(S(𝕌);𝒜¯)log(γ(𝕌;𝒜¯))Φ(S(𝕌);𝒜¯)log[Φ(S(𝕌𝕌);𝒜¯)Φ(S(𝕌);𝒜¯)]Φ(S(𝕌𝕌);𝒜¯).\Phi(S(\mathbb{U});\bar{\mathcal{A}})\log(\gamma(\mathbb{U};\bar{\mathcal{A}}))\leq\Phi(S(\mathbb{U});\bar{\mathcal{A}})\log\left[\frac{\Phi(S(\mathbb{U}-\mathbb{U});\bar{\mathcal{A}})}{\Phi(S(\mathbb{U});\bar{\mathcal{A}})}\right]\leq\Phi(S(\mathbb{U}-\mathbb{U});\bar{\mathcal{A}}). (6.20)

Hence condition (b) implies condition (a), which implies the result. ∎

  • Remark 6.13 (On assumption (ii) of Theorem 6.2)

    Notice that Theorem 6.12 does not require assumption (i) of Theorem 6.2. Moreover, in Theorem 6.10 we gave a bound for the expectation when only assumption (ii) holds. Therefore, it is straightforward to derive an extension of Theorem 6.2 in which only assumption (ii) holds. The only difference is the definition of γ(𝕌;𝒜¯)\gamma(\mathbb{U};\bar{\mathcal{A}}) in (6.2), which would now be given by the expression in Theorem 6.10, part (a). Note that the same considerations also apply to Theorem 3.6 since, as we show in its proof below, it follows from a direct application of Theorem 6.2 to the difference set 𝕌=𝕌𝕌\mathbb{U}^{\prime}=\mathbb{U}-\mathbb{U}.

Next, for condition (c), we follow a different and simpler approach based on the matrix Chernoff bound.

Proof of Theorem 6.2; conditions (c).

Since 𝕌𝕍=𝕍1𝕍d\mathbb{U}\subseteq\mathbb{V}=\mathbb{V}_{1}\cup\cdots\cup\mathbb{V}_{d} is a union of dd subspaces of dimension nn it suffices, by the union bound, to show that (6.1) holds for each subspace 𝕍i\mathbb{V}_{i} with probability at least 1ϵ/d1-\epsilon/d.

Therefore, without loss of generality, we may now assume that 𝕍\mathbb{V} is a single subspace of dimension nn and show (6.1) for this subspace. To do this, we follow a standard approach based on the matrix Chernoff bound. Let {vi}i=1n\{v_{i}\}^{n}_{i=1} be an orthonormal basis for 𝕍\mathbb{V} and write v=i=1ncivi𝕍v=\sum^{n}_{i=1}c_{i}v_{i}\in\mathbb{V}. Then

1mi=1mAi(v)𝕐i2=i=1mcXic,\displaystyle\frac{1}{m}\sum^{m}_{i=1}{\|A_{i}(v)\|}^{2}_{\mathbb{Y}_{i}}=\sum^{m}_{i=1}c^{*}X_{i}c,

where XiX_{i} is the self-adjoint random matrix

Xi=1m(Ai(vj),Ai(vk)𝕐i)j,k=1nn×n.X_{i}=\frac{1}{m}\left(\langle A_{i}(v_{j}),A_{i}(v_{k})\rangle_{\mathbb{Y}_{i}}\right)^{n}_{j,k=1}\in\mathbb{C}^{n\times n}.

Therefore, (6.1) is equivalent to

(1δ)αλmin(i=1mXi)λmax(i=1mXi)(1+δ)β.(1-\delta)\alpha\leq\lambda_{\min}\left(\sum^{m}_{i=1}X_{i}\right)\leq\lambda_{\max}\left(\sum^{m}_{i=1}X_{i}\right)\leq(1+\delta)\beta.

Noticethat cXic=m1Ai(v)𝕐i2c^{*}X_{i}c=m^{-1}{\|A_{i}(v)\|}^{2}_{\mathbb{Y}_{i}} and therefore XiX_{i} is nonnegative definite. Also, we have

i=1mc𝔼Ai𝒜iXic=1mi=1m𝔼Ai𝒜iAi(v)𝕐i2.\sum^{m}_{i=1}c^{*}\mathbb{E}_{A_{i}\sim\mathcal{A}_{i}}X_{i}c=\frac{1}{m}\sum^{m}_{i=1}\mathbb{E}_{A_{i}\sim\mathcal{A}_{i}}{\|A_{i}(v)\|}^{2}_{\mathbb{Y}_{i}}.

Hence, by (1.1), we have

αλmin(i=1m𝔼Xi)λmax(i=1m𝔼Xi)β.\alpha\leq\lambda_{\min}\left(\sum^{m}_{i=1}\mathbb{E}X_{i}\right)\leq\lambda_{\max}\left(\sum^{m}_{i=1}\mathbb{E}X_{i}\right)\leq\beta.

Therefore, it suffices to show that

(1δ)λmin(i=1m𝔼Xi)λmin(i=1mXi)λmax(i=1mXi)(1+δ)λmin(i=1m𝔼Xi)(1-\delta)\lambda_{\min}\left(\sum^{m}_{i=1}\mathbb{E}X_{i}\right)\leq\lambda_{\min}\left(\sum^{m}_{i=1}X_{i}\right)\leq\lambda_{\max}\left(\sum^{m}_{i=1}X_{i}\right)\leq(1+\delta)\lambda_{\min}\left(\sum^{m}_{i=1}\mathbb{E}X_{i}\right)

with high probability. Observe that

cXic=1mAi(v)𝕐i21mΦ(S(𝕍);𝒜¯)v𝕏2=1mΦ(S(𝕍);𝒜¯)c22,c^{*}X_{i}c=\frac{1}{m}{\|A_{i}(v)\|}^{2}_{\mathbb{Y}_{i}}\leq\frac{1}{m}\Phi(S(\mathbb{V});\bar{\mathcal{A}}){\|v\|}^{2}_{\mathbb{X}}=\frac{1}{m}\Phi(S(\mathbb{V});\bar{\mathcal{A}}){\|c\|}^{2}_{2},

almost surely, and therefore

λmax(Xi)1mΦ(S(𝕍);𝒜¯).\lambda_{\max}(X_{i})\leq\frac{1}{m}\Phi(S(\mathbb{V});\bar{\mathcal{A}}).

almost surely, for all i=1,,mi=1,\ldots,m. Thus, by the matrix Chernoff bound and the union bound, we have

((6.1) holds with α=α(1δ) and β=(1+δ)β)12nexp(αm((1+δ)log(1+δ)δ)Φ(S(𝕍);𝒜¯)).\mathbb{P}\left(\text{\eqref{empirical-nondegeneracy} holds with $\alpha^{\prime}=\alpha(1-\delta)$ and $\beta^{\prime}=(1+\delta)\beta$}\right)\geq 1-2n\exp\left(-\frac{\alpha m((1+\delta)\log(1+\delta)-\delta)}{\Phi(S(\mathbb{V});\bar{\mathcal{A}})}\right).

Notice that (1+δ)log(1+δ)δδ2/3(1+\delta)\log(1+\delta)-\delta\geq\delta^{2}/3. Hence the right-hand side is bounded below by

12nexp(αmδ2/3Φ(S(𝕍);𝒜¯)).1-2n\exp\left(-\frac{\alpha m\delta^{2}/3}{\Phi(S(\mathbb{V});\bar{\mathcal{A}})}\right).

We deduce that if

mδ2α1Φ(S(𝕍);𝒜¯)log(2n/ϵ),m\gtrsim\delta^{-2}\cdot\alpha^{-1}\cdot\Phi(S(\mathbb{V});\bar{\mathcal{A}})\cdot\log(2n/\epsilon),

then, with probability at least 1ϵ1-\epsilon, (6.1) holds with α=α(1δ)\alpha^{\prime}=\alpha(1-\delta) and β=(1+δ)β\beta^{\prime}=(1+\delta)\beta for the subspace 𝕍\mathbb{V}. Replacing ϵ\epsilon by ϵ/d\epsilon/d now gives the result. ∎

Proof of Theorem 6.3.

The result follows from Theorems 6.11 and 6.12. Recall that γ(𝕌;𝒜¯)1\gamma(\mathbb{U};\bar{\mathcal{A}})\geq 1 since assumption (i) holds. Therefore, we may bound the logarithmic term in Theorem 6.11 as

L0\displaystyle L_{0}\cdot (L1+L2)\displaystyle(L_{1}+L_{2})
log(2Φ(S(𝕌)𝕎;𝒜¯)/α)[log(2γ(𝕌;𝒜¯)(log(2d)+n))+log2(M)log(log(2d)+n)]\displaystyle\lesssim\log(2\Phi(S(\mathbb{U})\cup\mathbb{W};\bar{\mathcal{A}})/\alpha)\cdot\left[\log(2\gamma(\mathbb{U};\bar{\mathcal{A}})(\log(2d)+n))+\log^{2}(M)\log(\log(2d)+n)\right]
log(2Φ(S(𝕌)𝕎;𝒜¯)/α)[log(2γ(𝕌;𝒜¯))+log(2M)log2(log(2d)+n)].\displaystyle\lesssim\log(2\Phi(S(\mathbb{U})\cup\mathbb{W};\bar{\mathcal{A}})/\alpha)\cdot\left[\log(2\gamma(\mathbb{U};\bar{\mathcal{A}}))+\log(2M)\log^{2}(\log(2d)+n)\right].

Hence, we immediately deduce the result under the condition (a). Now recall (6.19). Then, much as in (6.20), we deduce that

Φ(S(𝕌)𝕎;𝒜¯)log(γ(𝕌;𝒜¯))Φ(S(𝕌𝕌)𝕎;𝒜¯)\Phi(S(\mathbb{U})\cup\mathbb{W};\bar{\mathcal{A}})\log(\gamma(\mathbb{U};\bar{\mathcal{A}}))\leq\Phi(S(\mathbb{U}-\mathbb{U})\cup\mathbb{W};\bar{\mathcal{A}})

We conclude that condition (b) implies condition (a). This gives the result under condition (b). Finally, recalling (6.19) once more, we also have

Φ(S(𝕌)𝕎;𝒜¯)log(γ(𝕌;𝒜¯))Φ(S(𝕍)𝕎;𝒜¯).\Phi(S(\mathbb{U})\cup\mathbb{W};\bar{\mathcal{A}})\log(\gamma(\mathbb{U};\bar{\mathcal{A}}))\leq\Phi(S(\mathbb{V})\cup\mathbb{W};\bar{\mathcal{A}}).

Hence condition (c) also implies condition (a), and therefore we get the result under this condition as well. ∎

6.2 Proof of Theorems 3.6 and 3.7

Proof of Theorems 3.6 and 3.7.

We follow essentially the same ideas as in [6, Thm. 4.8]. Let δ=1/2\delta=1/2 (this value is arbitrary) and EE be the event that (6.1) holds over 𝕌\mathbb{U}^{\prime} with α=(1δ)α\alpha^{\prime}=(1-\delta)\alpha and β=(1+δ)β\beta^{\prime}=(1+\delta)\beta. Theorems 6.26.3 (with 𝕌\mathbb{U}^{\prime} in place of 𝕌\mathbb{U}) and the various conditions on mm imply that (Ec)ϵ\mathbb{P}(E^{c})\leq\epsilon. Now write

𝔼xxˇ𝕏2=𝔼(xxˇ𝕏2|E)(E)+𝔼(xxˇ𝕏2|Ec)(Ec).\mathbb{E}{\|x-\check{x}\|}^{2}_{\mathbb{X}}=\mathbb{E}({\|x-\check{x}\|}^{2}_{\mathbb{X}}|E)\mathbb{P}(E)+\mathbb{E}({\|x-\check{x}\|}^{2}_{\mathbb{X}}|E^{c})\mathbb{P}(E^{c}). (6.21)

Observe that the mapping 𝒞:𝕏𝕏,xmin{1,θ/x𝕏}x\mathcal{C}:\mathbb{X}\rightarrow\mathbb{X},x\mapsto\min\{1,\theta/{\|x\|}_{\mathbb{X}}\}x is a contraction. We also have xˇ=𝒞(x^)\check{x}=\mathcal{C}(\hat{x}) and x=𝒞(x)x=\mathcal{C}(x), where in the latter case, we used the fact that θx𝕏\theta\geq{\|x\|}_{\mathbb{X}} by assumption.

Fix u𝕌u\in\mathbb{U} and consider the first term. If the event EE occurs, then the properties of 𝒞\mathcal{C} and Lemma 6.1 give that

xxˇ𝕏=𝒞(x)𝒞(x^)𝕏xx^𝕏{22αA¯(xu)𝕐¯+xu𝕏}+22αe¯𝕐¯+2γα.{\|x-\check{x}\|}_{\mathbb{X}}={\|\mathcal{C}(x)-\mathcal{C}(\hat{x})\|}_{\mathbb{X}}\leq{\|x-\hat{x}\|}_{\mathbb{X}}\leq\left\{\frac{2\sqrt{2}}{\sqrt{\alpha}}{\|\bar{A}(x-u)\|}_{\overline{\mathbb{Y}}}+{\|x-u\|}_{\mathbb{X}}\right\}+\frac{2\sqrt{2}}{\sqrt{\alpha}}{\|\bar{e}\|}_{\overline{\mathbb{Y}}}+\frac{\sqrt{2}\gamma}{\sqrt{\alpha}}.

By the Cauchy–Schwarz inequality, we deduce that

𝔼(xxˇ𝕏2|E)1α𝔼A¯(xu)𝕐¯2+xu𝕏2+1αe¯𝕐¯2+γ2α.\mathbb{E}({\|x-\check{x}\|}^{2}_{\mathbb{X}}|E)\lesssim\frac{1}{\alpha}\mathbb{E}{\|\bar{A}(x-u)\|}^{2}_{\overline{\mathbb{Y}}}+{\|x-u\|}^{2}_{\mathbb{X}}+\frac{1}{\alpha}{\|\bar{e}\|}^{2}_{\overline{\mathbb{Y}}}+\frac{\gamma^{2}}{\alpha}.

We now use (3.1) to deduce that

𝔼(xxˇ𝕏2|E)βαxu𝕏2+1αe¯𝕐¯2+γ2α.\mathbb{E}({\|x-\check{x}\|}^{2}_{\mathbb{X}}|E)\lesssim\frac{\beta}{\alpha}{\|x-u\|}^{2}_{\mathbb{X}}+\frac{1}{\alpha}{\|\bar{e}\|}^{2}_{\overline{\mathbb{Y}}}+\frac{\gamma^{2}}{\alpha}. (6.22)

We next consider the second term of (6.21). Using the properties of 𝒞\mathcal{C}, we see that

xxˇ𝕏2θ.{\|x-\check{x}\|}_{\mathbb{X}}\leq 2\theta.

Substituting this and (6.22) into (6.21) and recalling that (Ec)ϵ\mathbb{P}(E^{c})\leq\epsilon now gives the result. ∎

  • Remark 6.14 (On the error bounds in expectation)

The above proof explains why the truncation term is needed: namely, it bounds the error in the event where empirical nondegeneracy (6.1) fails. Modifying the estimator appears unavoidable if one is to obtain an error bound in expectation. A downside of the estimator xˇ\check{x} is that it requires an a priori bound on x𝕏{\|x\|}_{\mathbb{X}}. In some limited scenarios, one can avoid this by using a different estimator [35]. Indeed, let EE be the event in the above proof. Then define the conditional estimator

xˇ={x^if E occurs0otherwise.\check{x}=\begin{cases}\hat{x}&\text{if $E$ occurs}\\ 0&\text{otherwise}\end{cases}.

One readily deduces that this estimator obeys the same error bound (1.5) with θ\theta replaced by x𝕏{\|x\|}_{\mathbb{X}}. Unfortunately, computing this estimator involves computing the empirical nondegeneracy constants. This is possible in some limited scenarios, such as when 𝕌\mathbb{U} is a linear subspace, as the constants then correspond to the maximum and minimum singular values of a certain matrix. However, this is generally impossible in the nonlinear case. For instance, in the classical compressed problem, we previously noted that (6.1) is equivalent to the RIP. It is well known that computing RIP constants in NP-hard [59].

  • Remark 6.15

    The observant reader may have noticed a small technical issue with Theorem 3.6 and its proof: the estimator x^\hat{x} (and therefore xˇ\check{x}) is nonunique, and therefore xxˇ𝕏{\|x-\check{x}\|}_{\mathbb{X}} is not a well-defined random variable. To resolve this issue, one can replace xxˇ𝕏=x𝒞(x^)𝕏{\|x-\check{x}\|}_{\mathbb{X}}={\|x-\mathcal{C}(\hat{x})\|}_{\mathbb{X}} by the maximum error between xx and any γ\gamma-minimizer x^\hat{x}. The error bound remains the same for this (well-defined) random variable.

References

  • [1] B. Adcock and S. Brugiapaglia. Is Monte Carlo a bad sampling strategy for learning smooth functions in high dimensions? arXiv:2208.09045, 2022.
  • [2] B. Adcock, S. Brugiapaglia, N. Dexter, and S. Moraga. Deep neural networks are effective at learning high-dimensional Hilbert-valued functions from limited data. In J. Bruna, J. S. Hesthaven, and L. Zdeborová, editors, Proceedings of The Second Annual Conference on Mathematical and Scientific Machine Learning, volume 145 of Proc. Mach. Learn. Res. (PMLR), pages 1–36. PMLR, 2021.
  • [3] B. Adcock, S. Brugiapaglia, N. Dexter, and S. Moraga. Near-optimal learning of Banach-valued, high-dimensional functions via deep neural networks. arXiv:2211.12633, 2022.
  • [4] B. Adcock, S. Brugiapaglia, N. Dexter, and S. Moraga. On efficient algorithms for computing near-best polynomial approximations to high-dimensional, Hilbert-valued functions from limited samples. arXiv:2203.13908, 2022.
  • [5] B. Adcock, S. Brugiapaglia, and C. G. Webster. Sparse Polynomial Approximation of High-Dimensional Functions. Comput. Sci. Eng. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2022.
  • [6] B. Adcock, J. M. Cardenas, and N. Dexter. CS4ML: A general framework for active learning with arbitrary data based on Christoffel functions. arXiv:2306.00945, 2023.
  • [7] B. Adcock, J. M. Cardenas, N. Dexter, and S. Moraga. Towards optimal sampling for learning sparse approximation in high dimensions, chapter 2, pages 9–77. Springer Optimization and Its Applications. Springer, 2022.
  • [8] B. Adcock and N. Dexter. The gap between theory and practice in function approximation with deep neural networks. SIAM J. Math. Data Sci., 3(2):624–655, 2021.
  • [9] B. Adcock and A. C. Hansen. Compressive Imaging: Structure, Sampling, Learning. Cambridge University Press, Cambridge, UK, 2021.
  • [10] B. Adcock, A. C. Hansen, C. Poon, and B. Roman. Breaking the coherence barrier: a new theory for compressed sensing. Forum Math. Sigma, 5:e4, 2017.
  • [11] A. K. Alekseev, I. M. Navon, and M. E. Zelentsov. The estimation of functional uncertainty using polynomial chaos and adjoint equations. Internat. J. Numer. Methods Fluids, 67(3):328–341, 2011.
  • [12] H. Avron, M. Kapralov, C. Musco, C. Musco, A. Velingker, and A. Zandieh. Random Fourier features for kernel ridge regression: approximation bounds and statistical guarantees. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 253–262. PMLR, 2017.
  • [13] H. Avron, M. Kapralov, C. Musco, C. Musco, A. Velingker, and A. Zandieh. A universal sampling method for reconstructing signals with simple Fourier transforms. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, pages 1051–1063, New York, NY, USA, 2019. Association for Computing Machinery.
  • [14] R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hedge. Model-based compressive sensing. IEEE Trans. Inform. Theory, 56(4):1982–2001, 2010.
  • [15] A. Berk, S. Brugiapaglia, B. Joshi, Y. Plan, M. Scott, and O. Yilmaz. A coherence parameter characterizing generative compressed sensing with Fourier measurements. IEEE J. Sel. Areas Inf. Theory, 3(3):502–512, 2023.
  • [16] A. Berk, S. Brugiapaglia, Y. Plan, M. Scott, X. Sheng, and O. Yilmaz. Model-adapted Fourier sampling for generative compressed sensing. arXiv:2310.04984, 2023.
  • [17] J. Bigot, C. Boyer, and P. Weiss. An analysis of block sampling strategies in compressed sensing. IEEE Trans. Inform. Theory, 62(4):2125–2139, 2016.
  • [18] A. Bora, A. Jalal, E. Price, and A. G. Dimakis. Compressed sensing using generative models. In International Conference on Machine Learning, pages 537–546, 2017.
  • [19] A. Bourrier, M. E. Davies, T. Peleg, P. Pérez, and R. Gribonval. Fundamental performance limits for ideal decoders in high-dimensional linear inverse problems. IEEE Trans. Inform. Theory, 60(12):7928–7946, 2014.
  • [20] C. Boyer, J. Bigot, and P. Weiss. Compressed sensing with structured sparsity and structured acquisition. Appl. Comput. Harmon. Anal., 46(2):312–350, 2019.
  • [21] S. Brugiapaglia, S. Dirksen, H. C. Jung, and H. Rauhut. Sparse recovery in bounded Riesz systems with applications to numerical methods for PDEs. Appl. Comput. Harmon. Anal., 53:231–269, 2021.
  • [22] E. J. Candès and Y. Plan. A probabilistic and RIPless theory of compressed sensing. IEEE Trans. Inform. Theory, 57(11):7235–7254, 2011.
  • [23] N. Chauffert, P. Ciuciu, J. Kahn, and P. Weiss. Variable density sampling with continuous trajectories. SIAM J. Imaging Sci., 7(4):1962–1992, 2014.
  • [24] S. Chen, R. Varma, A. Singh, and J. Kovačcević. A statistical perspective of sampling scores for linear regression. In 2016 IEEE International Symposium on Information Theory (ISIT), pages 1556–1560, 2016.
  • [25] X. Chen and E. Price. Active regression via linear-sample sparsification. In A. Beygelzimer and D. Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 663–695. PMLR, 2019.
  • [26] I.-Y. Chun and B. Adcock. Compressed sensing and parallel acquisition. IEEE Trans. Inform. Theory, 63(8):4860–4882, 2017.
  • [27] A. Cohen and R. A. DeVore. Approximation of high-dimensional parametric PDEs. Acta Numer., 24:1–159, 2015.
  • [28] A. Cohen and G. Migliorati. Optimal weighted least-squares methods. SMAI J. Comput. Math., 3:181–203, 2017.
  • [29] W. M. Czarnecki, S. Osindero, M. Jaderberg, G. Swirszcz, and R. Pascanu. Sobolev training for neural networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  • [30] M. A. Davenport, M. F. Duarte, Y. C. Eldar, and G. Kutyniok. Introduction to compressed sensing. In Y. C. Eldar and G. Kutyniok, editors, Compressed Sensing: Theory and Applications, pages 1–64. Cambridge University Press, Cambridge, UK, 2012.
  • [31] M. A. Davenport and J. Romberg. An overview of low-rank matrix recovery from incomplete observations. IEEE J. Sel. Topics Signal Process., 10(4):602–622, 2016.
  • [32] M. Derezinski, M. K. K. Warmuth, and D. J. Hsu. Leveraged volume sampling for linear regression. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  • [33] N. Dexter, H. Tran, and C. Webster. A mixed 1\ell_{1} regularization approach for sparse simultaneous approximation of parameterized PDEs. ESAIM Math. Model. Numer. Anal., 53:2025–2045, 2019.
  • [34] S. Dirksen. Dimensionality reduction with subgaussian matrices: a unified theory. Found. Comput. Math., 16:1367–1396, 2016.
  • [35] M. Dolbeault and A. Cohen. Optimal sampling and christoffel functions on general domains. Constructive Approximation, 56(1):121–163, 2022.
  • [36] D. L. Donoho and M. Elad. Optimally sparse representation in general (non-orthogonal) dictionaries vi 1\ell_{1} minimization. Proc. Natl. Acad. Sci. USA, 100:2197–2002, 2003.
  • [37] D. L. Donoho and X. Huo. Uncertainty principles and ideal atomic decomposition. IEEE Trans. Inform. Theory, 47(7):2845–2862, 2001.
  • [38] M. F. Duarte and Y. C. Eldar. Structured compressed sensing: from theory to applications. IEEE Trans. Signal Process., 59(9):4053–4085, 2011.
  • [39] M. Eigel, R. Schneider, and P. Trunschke. Convergence bounds for empirical nonlinear least-squares. ESAIM Math. Model. Numer. Anal., 56(1):79–104, 2022.
  • [40] M. Elad and A. M. Bruckstein. A generalized uncertainty principle and sparse representation in pairs of bases. IEEE Trans. Inform. Theory, 48(9):2558–2567, 2002.
  • [41] T. Erdelyi, C. Musco, and C. Musco. Fourier sparse leverage scores and approximate kernel learning. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 109–122. Curran Associates, Inc., 2020.
  • [42] X. Feng and L. Zeng. Gradient-enhanced deep neural network approximations. J. Mach. Learn. Model. Comput., 3(4):73–91, 2022.
  • [43] S. Foucart and H. Rauhut. A Mathematical Introduction to Compressive Sensing. Appl. Numer. Harmon. Anal. Birkhäuser, New York, NY, 2013.
  • [44] A. Gajjar, C. Musco, and C. Hegde. Active learning for single neuron models with Lipschitz non-linearities. In F. Ruiz, J. Dy, and J.-W. van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 4101–4113. PMLR, 2023.
  • [45] L. Guo, A. Narayan, and T. Zhou. A gradient enhanced 1\ell_{1}-minimization for sparse approximation of polynomial chaos expansions. J. Comput. Phys., 367:49–64, 2018.
  • [46] J. Han, A. Jentzen, and W. E. Solving high-dimensional partial differential equations using deep learning. Proc. Natl. Acad. Sci. U.S.A., 115(34):8505–8510, 2018.
  • [47] A. Hashemi, H. Schaeffer, R. Shi, U. Topcu, G. Tran, and R. Ward. Generalization bounds for sparse random feature expansions. Appl. Comput. Harmon. Anal., 62:310–330, 2023.
  • [48] A. Jalal, M. Arvinte, G. Daras, E. Price, A. Dimakis, and J. Tamir. Robust compressed sensing MR imaging with deep generative priors. In NeurIPS 2021 Workshop on Deep Learning and Inverse Problems, 2021.
  • [49] M. Lustig, D. L. Donoho, and J. M. Pauly. Sparse MRI: the application of compressed sensing for rapid MRI imaging. Magn. Reson. Med., 58(6):1182–1195, 2007.
  • [50] P. Ma, M. W. Mahoney, and B. Yu. A statistical perspective on algorithmic leveraging. J. Mach. Learn. Res., 16:861–911, 2015.
  • [51] O. Malik, Y. Xu, N. Cheng, S. Becker, A. Doostan, and A. Narayan. Fast algorithms for monotone lower subsets of Kronecker least squares problems. arXiv:2209.05662, 2022.
  • [52] D. W. McRobbie, E. A. Moore, M. J. Graves, and M. R. Prince. MRI: From Picture to Proton. Cambridge University Press, Cambridge, 2nd edition, 2006.
  • [53] T. O’Leary-Roseberry, P. Chen, U. Villa, and O. Ghattas. Derivative-Informed Neural Operator: An efficient framework for high-dimensional parametric derivative learning. Journal of Computational Physics, 496:112555, 2024.
  • [54] T. O’Leary-Roseberry, U. Villa, P. Chen, and O. Ghattas. Derivative-informed projected neural networks for high-dimensional parametric maps governed by PDEs. Comput. Methods Appl. Mech. Engrg., 388(1):114199, 2022.
  • [55] J. Peng, J. Hampton, and A. Doostan. On polynomial chaos expansion via gradient-enhanced l1l_{1}-minimization. J. Comput. Phys., 310:440–458, 2016.
  • [56] M. Raissi, P. Perdikaris, and G. E. Karniadakis. Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys., 378:686–707, 2019.
  • [57] J. Romberg. Imaging via compressive sampling. IEEE Signal Process. Mag., 25(2):14–20, 2008.
  • [58] V. Studer, J. Bobin, M. Chahid, H. Moussavi, E. J. Candès, and M. Dahan. Compressive fluorescence microscopy for biological and hyperspectral imaging. Proc. Natl. Acad. Sci. USA, 109(26):1679–1687, 2011.
  • [59] A. M. Tillmann and M. E. Pfetsch. The computational complexity of the restricted isometry property, the nullspace property, and related concepts in compressed sensing. IEEE Trans. Inform. Theory, 60(2):1248–1259, 2014.
  • [60] Y. Traonmilin and R. Gribonval. Stable recovery of low-dimensional cones in Hilbert spaces: one RIP to rule them all. Appl. Comput. Harmon. Anal., 45(1):170–205, 2018.
  • [61] Y. Traonmilin, G. Puy, R. Gribonval, and M. E. Davies. Compressed sensing in Hilbert spaces. In H. Boche, G. Caire, R. Calderbank, M. März, G. Kutyniok, and R. Mathar, editors, Compressed Sensing and its Applications: Second International MATHEON Conference 2015, Applied and Numerical Harmonic Analysis, pages 359–384. Birkhäuser, Cham, 2017.
  • [62] R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, Cambridge, UK, 2018.
  • [63] M. Vidyasagar. An Introduction to Compressed Sensing. Comput. Sci. Eng. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2019.
  • [64] D. P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1-2):1–157, 2014.
  • [65] J. Yu, L. Lu, X. Meng, and G. E. Karniadakis. Gradient-enhanced physics-informed neural networks for forward and inverse PDE problems. Comput. Methods Appl. Mech. Engrg., 393:114823, 2022.