Abstract
The matrix is -regular if for any -sparse vector ,
|
|
|
We show that if is -regular for , then by multiplying the columns of by independent random signs, the resulting random ensemble acts on an arbitrary subset (almost) as if it were gaussian, and with the optimal probability estimate: if is the gaussian mean-width of and , then with probability at least ,
|
|
|
where . This estimate is optimal for .
1 Introduction
Linear operators that act in an almost-isometric way on subsets of are of obvious importance. Although approximations of isometries are the only operators that almost preserve the Euclidean norm of any point in , one may consider a more flexible alternative: a random ensemble of operators such that, for any fixed , with high probability, “acts well” on every element of . Such random ensembles have been studied extensively over the years, following the path paved by the celebrated work of Johnson and Lindenstrauss in [5]. Here we formulate the Johnson-Lindenstrauss Lemma in one of its gaussian versions:
Theorem 1.1.
There exist absolute constants and such that the following holds. Let and set to be a random matrix whose entries are independent, standard gaussian random variables. Let be of cardinality at most . Then for , with probability at least , for every ,
|
|
|
The scope of Theorem 1.1 can be extended to more general random ensembles than the gaussian one, e.g., to a random matrix whose rows are iid copies of a centred random vector that exhibits suitable decay properties (see, e.g. [4, 8]). It is far more challenging to construct a random ensemble that, on the one hand, satisfies a version of Theorem 1.1, and on the other is based on “few random bits” or is constructed using a heavy-tailed random vector.
A significant breakthrough towards more general “Johnson-Lindenstrauss transforms” came in [7], where it was shown that a matrix that satisfies a suitable version of the restricted isometry property, can be converted to the wanted random ensemble by multiplying its columns by random signs. More accurately, let be independent, symmetric -valued random variables. Set and for a matrix define
|
|
|
From here on we denote by the subset of consisting of vectors that are supported on at most coordinates.
Definition 1.2.
A matrix satisfies the restricted isometry property of order and level if
|
|
|
Theorem 1.3.
[7]
There are absolute constants and such that the following holds.
Let and . Consider and let . If satisfies the restricted isometry property of order and at level , then with probability at least , for every ,
|
|
|
While Theorem 1.3 does not recover the probability estimate from Theorem 1.1, it does imply at the constant probability level that is an almost isometry in the random ensemble sense: if is a matrix that -preserves the norms of vectors that are sparse, then a typical realization of the random ensemble , preserves the norms of all the elements in .
Various extensions of Theorem 1.1 that hold for arbitrary subsets of have been studied over the years. In such extensions the “complexity parameter” is replaced by more suitable counterparts. A rather general version of Theorem 1.1 follows from a functional Bernstein inequality (see, e.g., [3, 8, 2]), and to formulate that inequality in the gaussian case we require the following definition.
Definition 1.4.
Let be independent, standard gaussian random variables. For set
|
|
|
Let
|
|
|
be the critical dimension of the set .
The critical dimension appears naturally when studying the geometry of convex sets—for example, in the context of the Dvoretzky-Milman Theorem (see [1] and references therein for more details). It is the natural alternative to —which was suitable for finite subsets of sphere .
Let be the standard gaussian random vector in , set to be independent copies of and put
|
|
|
to be the random ensemble used in Theorem 1.1.
Theorem 1.5.
There exist absolute constants and such that the following holds. If and then with probability at least
|
|
|
for every ,
|
|
|
(1.1) |
One may use Theorem 1.5 to ensure that the uniform error in (1.1) is at most . Indeed, if
|
|
|
then with probability at least ,
|
|
|
(1.2) |
which is a natural counterpart of Theorem 1.1 once is replaced by .
As it happens, a version of Theorem 1.3 that is analogous to (1.2) was proved in [9], using the notion of a multi-level RIP.
Definition 1.6.
Let . For and the matrix satisfies a multi-scale RIP with distortion and sparsity if, for every and every , one has
|
|
|
Definition 1.6 implies that if then
|
|
|
Example 1.7.
Let be a gaussian matrix as above and set . It is standard to verify (using, for example, Theorem 1.5 and a well-known estimate on ) that with probability at least ,
|
|
|
By the union bound over it follows that with a nontrivial probability, satisfies a multi-scale RIP with and . Observe that the second term in the multi-scale RIP—, namely , is not needed here.
The following theorem is the starting point of this note: an estimate on the error a typical realization of the random ensemble has when acting on an arbitrary , given that satisfies an appropriate multi-scale RIP.
Theorem 1.9.
[9]
There are absolute constants and such that the following holds.
Let and that satisfies a multi-scale RIP with sparsity level and distortion
|
|
|
(1.3) |
Then for , with probability at least ,
|
|
|
(1.4) |
To put Theorem 1.9 in some context, if the belief is that should exhibit the same behaviour as the gaussian matrix , then (keeping in mind that should scale like ), “a gaussian behaviour” as in Theorem 1.5 is that with high probability,
|
|
|
Observe that , implying by (1.3) that . Hence, the error in (1.4) in terms of is indeed
|
|
|
However, despite the “gaussian error”, the probability estimate in Theorem 1.9 is far weaker than in Theorem 1.5—it is just at the constant level.
Our main result is that using a modified, seemingly less restrictive version of the multi-scale RIP, acts on as if it were a gaussian operator: achieving the same distortion and probability estimate as in Theorem 1.5.
Definition 1.10.
Let be a matrix. For let be the largest such that for every , is a regular; that is, for every
|
|
|
Theorem 1.11.
There exist absolute constants and such that the following holds. Let and set . If , and , then with probability at least
|
|
|
we have that
|
|
|
(1.5) |
Theorem 1.11 clearly improves the probability estimate from Theorem 1.9. The other (virtual) improvement is that the matrix need only be -regular for , and the way acts on for is of no importance. The reason for calling that improvement “virtual” is the following observation:
Lemma 1.13.
If then for any ,
|
|
|
In other words, the second term in the multi-scale RIP condition follows automatically from the first one and the fact that is sufficiently large.
Proof. Let for , and let be a decomposition of the support of to coordinate blocks of cardinality . Set , that is, the projection of onto and write . Note that and that
|
|
|
The vectors are orthogonal and so . Therefore,
|
|
|
For the first term, as each is supported on at most coordinates, it follows from the regularity condition that
|
|
|
As for the second term, since and are orthogonal, and
|
|
|
|
|
|
|
|
Thus, by the regularity of and as ,
|
|
|
|
|
|
|
|
Taking the sum over all pairs , , each factor appears at most times, and . Hence, using that
|
|
|
Clearly, Theorem 1.11 implies a suitable version of Theorem 1.9.
Corollary 1.14.
There exist absolute constants and such that the following holds. Let be as above, set and . Let . Then with probability at least ,
|
|
|
In Section 3 we present one simple application of Theorem 1.11. We show that column randomization of a typical realization of a Bernoulli circulant matrix (complete or partial) exhibits an almost gaussian behaviour (conditioned on the generating vector). In particular, only random bits ( from the generating Bernoulli vector and from the column randomization) are required if one wishes to create a random ensemble that is, effectively, an almost isometry.
The proof of Theorem 1.11 is based on a chaining argument. For more information on the generic chaining mechanism, see Talagrand’s treasured manuscript [10]. We only require relatively basic notions from generic chaining theory, as well as the celebrated majorizing measures theorem.
Definition 1.16.
Let . A collection of subsets of , , is an admissible sequence if and for , . For every denote by a nearest point to in with respect to the Euclidean distance. Set for and let .
The functional with respect to the metric is defined by
|
|
|
where the infimum is taken with respect to all admissible sequences of .
An application of Talagrand’s majorizing measures theorem to the gaussian process shows that and are equivalent:
Theorem 1.17.
There are absolute constants and such that for every ,
|
|
|
The proof of Theorem 1.17 can be found, for example, in [10].
2 Proof of Theorem 1.11
We begin the proof with a word about notation: throughout, absolute constants, that is, positive numbers that are independent of all the parameters involved in the problem, are denoted by , , , etc. Their value may change from line to line.
As noted previously, the proof is based on a chaining argument. Let be an optimal admissible sequence of . Set to satisfy that is the critical dimension of , i.e.,
|
|
|
(without loss of generality we may assume that equality holds). Let to be named in what follows and observe that
|
|
|
Writing , it follows that
|
|
|
(2.1) |
and
|
|
|
Therefore, setting
|
|
|
we have that for every ,
|
|
|
and
|
|
|
(2.2) |
Hence,
|
|
|
(2.3) |
To estimate the final term, note that for every ,
|
|
|
|
|
|
|
|
By the definition of the functional and the majorizing measures theorem, for every integer and every ,
|
|
|
Thus,
|
|
|
and
|
|
|
(2.4) |
Equation (2.4) and the wanted estimate in Theorem 1.11 hint on the identity of : it should be larger than . Recalling that and that , set
|
|
|
The reason behind the choice of the third term will become clear in what follows.
With that choice of ,
|
|
|
(2.5) |
and the nontrivial part of the proof is to control and with high probability that would lead to the wanted estimate on (2.3).
2.1 A decoupling argument
For every ,
|
|
|
(2.6) |
By the assumption that is -regular,
|
|
|
and noting that ,
|
|
|
Next, we turn to the “off-diagonal” term in (2.6). For let
|
|
|
and let us obtain high probability estimates on for various sets .
The first step in that direction is decoupling: let be independent -valued random variables with mean . Set and observe that for every ,
|
|
|
(2.7) |
Indeed, for every ,
|
|
|
|
|
|
|
|
and for every ,
|
|
|
Equation (2.7) naturally leads to the following definition:
Definition 2.1.
For and , set
|
|
|
(2.8) |
Recall that is the nearest point to in and .
Lemma 2.2.
For every and every ,
|
|
|
|
|
|
|
|
(2.9) |
Proof. Fix an integer . With the decoupling argument in mind, fix and observe that
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Moreover,
|
|
|
|
|
|
|
|
and
|
|
|
Combining these observations, for every and ,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
from which the claim follows immediately.
As part of the decoupling argument and to deal with the introduction of the random variables in (2.7), we will use the following elementary fact:
Lemma 2.3.
Let be a function of and . If then with -probability at least ,
|
|
|
(2.10) |
Proof. For a nonnegative function we have that point-wise, . Let and note that
|
|
|
By Jensen’s inequality followed by Fubini’s Theorem,
|
|
|
and setting proves (2.10).
We shall use Lemma 2.3 in situations where we actually have more information—namely that for any , for a well chosen . As a result,
|
|
|
Taking into account Lemma 2.2 and Lemma 2.3, it follows that if one wishes to estimate using a chaining argument, it suffices to obtain, for every , bounds on moments of random variables of the form
|
|
|
(2.11) |
as that results in high probability estimates on each of the terms in (2.2). And as there are at most random variables involved in this chaining argument at the -stage, the required moment in (2.11) is for the first two terms and for the third one.
2.2 Preliminary estimates
For let be the projection of onto . The key lemma in the proof of Theorem 1.11 is:
Lemma 2.4.
There exists an absolute constant such that the following holds.
Let and be as in (2.8). Set such that . Then for ,
|
|
|
Proof. Let be the Euclidean unit sphere in the subspace and let be a maximal -separated subset of . By a volumetric estimate (see, e.g. [1]), there is an absolute constant such that . Moreover, a standard approximation argument shows that for every ,
|
|
|
where is a suitable absolute constant. Therefore,
|
|
|
and it suffices to control, with high probability,
|
|
|
Fix , recall that is -regular for and we first explore the case .
Denote by the set of indices corresponding to the largest values of . If then set .
It is straightforward to verify (e.g., using Höffding’s inequality) that there is an absolute constant such that
|
|
|
|
|
|
|
|
|
|
|
|
where we used that fact that for ,
|
|
|
Therefore,
|
|
|
Note that is supported in , while each ‘legal’ is supported in a subset of ; in particular, and are orthogonal, implying that for every such ,
|
|
|
Thus, by the -regularity of (as ),
|
|
|
|
|
|
|
|
|
|
|
|
and it follows that
|
|
|
Turning to the case , let be the set of indices corresponding to the largest coordinates of . Therefore,
|
|
|
|
|
|
|
|
(2.12) |
using that for ,
|
|
|
Recall that and that . The same argument used previously shows that
|
|
|
hence,
|
|
|
and the estimate holds for each for that fixed .
Setting , it follows from Chebychev’s inequality that with probability at least ,
|
|
|
and by the union bound, recalling that , the same estimate holds uniformly for every , provided that . Thus, with probability at least ,
|
|
|
and the wanted estimate follows from tail integration.
The next observation deals with more refined estimates on random variables of the form
|
|
|
Once again, we use the fact that for any
|
|
|
(2.13) |
As one might have guessed, the choice of will be vectors of the form . These are random vectors that are independent of the Bernoulli random variables involved in the definition of . At the same time, will be deterministic.
Without loss of generality assume that for every . Let be the set of indices corresponding to the largest coordinates of , is the set corresponding to the following largest coordinates, and so on. The choice of will be specified in what follows.
Note that for any ,
|
|
|
Set , and by (2.13),
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(2.14) |
And, in the case where for every , it follows that
|
|
|
(2.15) |
2.3 Estimating
Recall that
|
|
|
and that for every ,
|
|
|
Theorem 2.5.
There are absolute constants and such that for , with probability at least ,
|
|
|
Proof. Let , and as noted previously, for every ,
|
|
|
|
|
|
|
|
|
|
|
|
Following the decoupling argument, one may fix . The core of the argument is to obtain satisfactory estimates on moments of the random variables
|
|
|
for that (arbitrary) fixed choice of . And, since , for a uniform estimate that holds for every random variable of the form , it is enough to control the -th moment of each for .
With that in mind, denote by the expectation taken with respect to , and set the expectation taken with respect to .
Let us apply (2.15), with the choice
|
|
|
Thus, for every ,
|
|
|
The sets are all of cardinality and so there are at most of them. By Lemma 2.4, for each one of the sets and ,
|
|
|
implying that
|
|
|
|
|
|
|
|
(2.16) |
as long as . In particular, since , it suffices that to ensure that (2.3) holds.
Hence, for every ,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
because .
By Jensen’s inequality, for every ,
|
|
|
|
|
|
|
|
Setting for and , Chebychev’s inequality implies that with -probability at least ,
|
|
|
|
|
|
|
|
By the union bound, this estimate holds for every and . Thus, there are absolute constants and such that with probability at least ,
|
|
|
as claimed.
2.4 Estimating
Next, recall that
|
|
|
Expanding as in (2.6), the “diagonal term” is at most , and one has to deal with the “off-diagonal” term
|
|
|
As observed in Lemma 2.2 combined with Lemma 2.3, it suffices to obtain, for every fixed , estimates on the moments of
|
|
|
(2.17) |
We begin with a standard observation:
Lemma 2.7.
There is an absolute constant such that the following holds. Let be nonnegative random variables, let and set . If there is some such that for every , , then
|
|
|
Proof. By Chebychev’s inequality, for every and ,
|
|
|
Therefore,
|
|
|
implying that
|
|
|
Hence,
|
|
|
The analysis is split into two cases.
Case I: .
In this case, . For every we invoke (2.2) for sets of cardinality when , and of cardinality when .
Set and ; the treatment when is identical and is omitted. Let and put if . Finally, set .
Consider such that and observe that . By Lemma 2.4,
|
|
|
(2.18) |
Therefore, by Lemma 2.7,
|
|
|
Next, if then and by Lemma 2.4,
|
|
|
(2.19) |
There are at most sets in that range, and because , it is evident that
|
|
|
therefore, as ,
|
|
|
Thus, for every fixed ,
|
|
|
(2.20) |
Now, by (2.15),
|
|
|
|
|
|
|
|
(2.21) |
which, combined with (2.18) and (2.20), implies that
|
|
|
(2.22) |
Clearly, there are at most random variables as in (2.22). With that in mind, set and let . By Lemma 2.2 and Lemma 2.3 followed by the union bound, we have that with probability at least , for every ,
|
|
|
(2.23) |
And, by the union bound and recalling that , (2.23) holds uniformly for every and with probability at least
|
|
|
An identical argument shows that with probability at least
|
|
|
for every ,
|
|
|
Finally, invoking Lemma 2.2, we have that with probability at least , for every ,
|
|
|
as required.
Case II: .
The necessary modifications are minor and we shall only sketch them. In this range, , and the problem is that for each vector , the number of blocks of cardinality —namely, , is larger than when . Therefore, setting for every , the uniform estimate on
|
|
|
can be obtained by bounding for . Indeed, by Lemma 2.4, for every and every , we have that
|
|
|
The rest of the argument is identical to the previous one and is omitted.
Proof of Theorem 1.11. Using the estimates we established, it follows that for , with probability at least
|
|
|
|
|
|
noting that . Since
|
|
|
the claim follows from a straightforward computation, by separating to the cases and .
3 Application - A circulant Bernoulli matrix
Let and be independent Bernoulli vectors. Set and put . Let be the circulant matrix generated by the random vector ; that is, is the matrix whose -th row is the shifted vector , where for every , .
Fix of cardinality and let
|
|
|
to be the normalized partial circulant matrix. It follows from Theorem 3.1 in [6] and the estimates in Section 4 there that for a typical realization of , the matrix is -regular for :
Theorem 3.1.
[6]
There exist absolute constants and such that the following holds. For with probability at least
|
|
|
we have that
|
|
|
By Theorem 3.1 and the union bound for , there is an event with probability at least with respect to , on which, for every ,
|
|
|
This verifies the assumption needed in Theorem 1.11 on the event . Now fix and let be the resulting partial circulant matrix. Set and let . By Theorem 1.11, with probability at least
|
|
|
with respect to , we have that
|
|
|
where
|
|
|
Thus, a random matrix generated by independent random signs is a good embedding of an arbitrary subset of with the same accuracy (up to logarithmic factors) as a gaussian matrix. Moreover, conditioned on the choice of the circulant matrix , the probability estimate coincides with the estimate in the gaussian case.