1 Introduction
The Johnson-Lindenstrauss lemma [JL84] concerns the approximate preservation of distances in a finite point set in Euclidean space.
Specifically, for a subset of points and a tolerance , there exist matrices and a constant for which
|
|
|
(JL) |
holds for all pairs of points simultaneously, with probability at least , provided
|
|
|
The matrices are drawn randomly, and much work has been done since the original paper to equip with special properties, such as allowing fast matrix multiplication, preserving sparsity, restricting the matrix entries to discrete distributions, and so forth; see [Nel20] for a recent review.
The matrix provides a linear method for dimension reduction, which, at the very least, reduces the amount of space needed to store the dataset on the computer, provided one can work with approximate versions of the pairwise distances.
One would expect that “robust” downstream algorithms that depend on distance data, now working on the pointset , should still return answers that approximate those found on the original pointset , but now in shorter time, with less memory needed, etc.
People appreciate this lemma, in theory [Vem04], for the above reasons, and if the algorithm scales linearly in the ambient dimension, in time or in space, then processing instead of will yield a proportional improvement.
However, certain algorithms, including many nearest-neighbor algorithms, scale exponentially in the dimension, especially if they only use space linear in , due to packing arguments. For comparison to brute force, the all pairs nearest-neighbor problem may already be solved in time by scanning the points of with respect to their distance to each query .
If the algorithm scales like , this scaling is only an improvement for , and even then one would really prefer , as is too expensive when is large.
Consequently, if we use multiplication by as a preprocessing step, the target dimension is much too large to be practical, as is now polynomial in with exponent at least with .
Apart from searching for better algorithms, it is then natural to ask whether there are matrices with target dimension much smaller than that would still satisfy equation (JL) for all pairs of points of .
However, Larsen and Nelson [LN14] showed no such matrix exists for general datasets – a union of the standard simplex and a sufficient number of Gaussian vectors forces at least one distance to be stretched or compressed too much.
On the other hand, if further assumptions are made on the pointset, smaller is possible, for instance, when already lies in a low-dimensional subspace [Sar06] or when its unit difference set has low Gaussian complexity [KM05], even while allowing many more points in the dataset.
If one considers a given algorithm “robust”, one would hope that failing to preserve a few distances should not markedly change the final output; though, one would ideally have to prove such behavior for that algorithm.
One is then led to ask: what happens to the other distances between the points of when is smaller than ?
To be concrete, can we approximately preserve of all the distances, for some fraction , ideally with ?
The results in this paper show this is possible when is not too small and the data has high or even moderate “intrinsic dimension”, in a sense to be defined later.
As a preview, for data sampled i.i.d. like a random vector , corollary 4.1.8 shows that if has i.i.d. coordinates (No moment assumptions are made.), we can take
|
|
|
for preserving of all pairwise distances, with probability at least over the draw for and , provided and .
The matrix may have i.i.d. standard Gaussian, or in general, zero-mean unit-variance sub-gausssian coordinates.
This estimate for is an “improvement” over the target dimension as soon as or say a fractional power of ; “improvement” must be in quotes, as we have only guaranteed “most” distances are approximately preserved, not all.
For general , the is replaced by a , with the intrinsic dimension of the unit normalized random vector for an independent copy of .
We have always, and we may estimate it using corollary 4.2.1.
Both of our main results – theorems 2.0.6 for general datasets and theorem 4.1.5 for i.i.d. samples – rely on a dual viewpoint for the dimension reduction problem, namely, instead of asking how transforms the data , we ask how transforms the test matrix ; we can then exploit known properties of how general matrices act on or its columns.
Standard results like the Hanson-Wright inequality may be viewed in this light, and we indeed do so in this paper.
Treating as a test matrix with known properties has been done previously in randomized numerical linear algebra [HMT10]; however, unlike there, slow decay in singular values is actually a good case for us.
The rest of the paper is organized as follows.
The main argument allowing us to quantify how many distances can be preserved is in section 2 and leads to our first theorem 2.0.6, which, by itself, only suggests smaller target dimensions may be possible by considering “small” batches of difference vectors.
We then recall the Walecki construction in section 3, which gives us a way to choose these batches that works well for i.i.d. samples as well as the standard simplex.
Section 4 then presents the rest of our results specializing to data sampled i.i.d. from a given distribution, which may just be a larger dataset.
This section leads up to our second main theorem 4.1.5 allowing us to make depend on the intrinsic dimension mentioned above.
We also show how to estimate from the data.
The appendix contains proofs for the properties of that we use in the paper, namely, a variant of the Hanson-Wright inequality, and a corresponding independent proof in the Gaussian case in order to have decent explicit constants for .
1.1 Notation
Suppose is a point set of size .
Given an arbitrary ordering of the points of ,
let be the set of difference vectors
|
|
|
and be their unit normalized versions
|
|
|
The number of elements of a set is denoted by .
We set for the usual Euclidean inner product, with .
We also denote by the origin in any .
Let be a matrix, which we write as . From [GVL13, page 76], the singular value decomposition (SVD) for is the factorization with and orthogonal
and
|
|
|
for .
Let as a vector in .
We write for the operator norm of , while is the Frobenius (or Hilbert-Schmidt) norm of .
We may also compute via
|
|
|
where the may be taken as the rows or the columns of .
Vectors are treated as column vectors unless otherwise indicated.
The -stable rank of a matrix is the ratio
|
|
|
There are other variants of stable rank [NY17], but only the -stable rank will make an appearance in this paper:
|
|
|
So always, and both of these are at most , the rank of .
We make use of the -norm and -norm, defined for a random variable as
|
|
|
See [Ver18, section 2.5 and 2.7].
1.1.1 Isotropic Random Vectors and the Intrinsic Dimension
A particular condition on the second moment matrix of a random vector will be useful in this paper.
Definition 1.1.1.
A random vector is isotropic if .
A working example is any vector with i.i.d. zero-mean unit-variance coordinates.
Basic properties of isotropic random vectors include via the cyclic property of the trace, and that isotropic is equivalent to for any . See [Ver18, page 43-5] for more on such vectors.
We can assign a version of -stable rank to the distribution of a random vector via the intrinsic dimension.
Definition 1.1.2.
The intrinsic dimension of a random vector is the ratio
|
|
|
Like the stable ranks, the intrinsic dimension of a vector in is bounded by the ambient dimension, ,
so isotropic random vectors realize the highest possible intrinsic dimension.
In the literature, is sometimes called the effective rank of the second moment matrix , and is the stable rank of the matrix .
Isotropic random vectors behave well under orthogonal projection.
Lemma 1.1.3.
Let be a matrix with orthonormal rows and an isotropic random vector.
Then is also isotropic.
Proof.
By linearity of expectation:
.
∎
Note if is scaled by constant , the intrinsic dimension is unchanged: .
2 Controlling Order Statistics
In this paper, we only study dimension reduction matrices with isotropic rows, that is,
for all and all rows .
So, .
We say more about isotropic random vectors in section 1.1.1.
To guarantee equation (JL) holds for a difference vector , the usual proof for the Johnson-Lindenstrauss lemma considers each vector individually, providing upper bounds for the failure probabilities
|
|
|
and
|
|
|
implicitly working with the formulation
|
|
|
() |
which is often easier to manage.
In lemma 5.0.5 of the appendix, we show how to choose in line with the original equation (JL).
Suppose , as it will for this paper.
If the distribution of is sufficiently nice, sub-gaussian for example, then one may show for each fixed , with a constant depending on the distribution of .
By the union bound (Boole’s inequality), the probability that a random fails for some is at most
|
|
|
provided .
The probability there is some that succeeds for all is then at least , so at least one such exists.
The argument above considers the vectors one at a time, making sure succeeds for each ; if we consider the unit normalized vectors , we may view this argument as controlling the extreme order statistics of the random variables , induced by , namely
|
|
|
If we only wish to preserve a fraction of the distances, say with hopefully small, we can consider controlling the intermediate or central order statistics of the instead.
We do so as follows.
If we divide the difference vectors into batches of size , and preserve distances there, then we still recover
|
|
|
We assume is a strictly positive integer here, and for simplicity of discussion, we also assume divides ; we shall return to this point later.
Let be a given batch of size .
Each given matrix also induces an ordering on the points of : with ,
|
|
|
As is random, this ordering is too, treating as fixed.
If we could guarantee that , then all with also have this upper bound, with an analogous statement for a lower bound of on , altogether guaranteeing of the vectors of have
|
|
|
We need to control the probabilities
|
|
|
|
(2.1) |
and |
|
|
|
|
(2.2) |
We can recast control of using the following lemma.
Let , that is, the unit normalized version of .
Lemma 2.0.1.
Let be a random random matrix. With the notation above, and running through all -sized subsets of
|
|
|
and
|
|
|
with a diagonal matrix with strictly positive entries, chosen for each subset .
Remark 2.0.2.
For simplicity, the rest of the paper will take as itself, or its unit normalized version ; though, there may potentially be some improvement gained by the freedom in choosing each .
Proof.
For , let .
If , then is part of a subset of size for which all the have norms too large.
For any given subset , consider the following probabilities, with chosen for each ,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The second and third lines follow by the linearity of .
Passing to the sum above allows an important change of viewpoint using the Frobenius norm, as each is a column of , with the appropriate diagonal matrix:
|
|
|
Now, there are subsets of of size , so a union bound over such subsets gives
|
|
|
|
|
|
|
|
The argument for is similar, noting that
forces vectors to have squared norms too small, so their corresponding sum is too small as well.
∎
We now make assumptions on the matrix , allowing us to make use of the stable rank of the minibatches .
Corollary 2.0.3.
With the notation as in lemma 2.0.1,
, and , if has i.i.d. standard Gaussian entries, then
|
|
|
One may replace by on the right hand side.
One point we want to stress even now is the presence of the product in the exponent.
If one wants a fixed failure probability, need not be as large when is sizable.
We shall give several examples in this paper where is large.
Proof.
The proof follows immediately from lemma 5.0.4 in the appendix, with , recalling for the case.
With and , we improve the denominator from 8 to 4 by setting
as in in lemma 5.0.5.
∎
Consider the term in lemma 2.0.1, taking as the identity.
(The following discussion will also hold for other scalings, such as .)
Labeling the rows of as , we can interchange the sums implicit in to our advantage:
|
|
|
Because we assume the rows are isotropic, ,
and we have transformed the bound on to involve the probabilities
|
|
|
and is now viewed as a fixed matrix acting on the random vectors .
We may thus take a dual viewpoint on the dimension reduction problem: instead of considering how the matrix acts on each difference vector, we consider how the transposed batch of difference vectors acts on the matrix , as mentioned in the introduction.
If we take the to be independent, with i.i.d. mean-zero unit-variance sub-gaussian entries, we can use lemma 5.0.3 in the appendix to harness both the independence of the and the Hanson-Wright inequality:
Lemma 2.0.4.
With the notation as in lemma 2.0.1 and , let be a random matrix with i.i.d. mean-zero unit-variance sub-gaussian entries, then
|
|
|
with .
One may replace by on the right hand side.
Proof.
To bound the probabilities on the right hand side of lemma 2.0.1, just take in lemma 5.0.3.
∎
Remark 2.0.5.
Recall always, so if , we can write
|
|
|
We now can control the probabilities in equations (2.1) for a given batch of size .
The control is in terms of or for subsets of size .
Because the target dimension is a global parameter, it needs to be in terms of global quantitities, but the stable ranks above vary over minibatches.
To make this transition and to help with bookkeeping, recall that a decomposition of a graph is a partition of its edges.
Let be a decomposition of the complete graph on vertices, into batches of size .
If does not divide , that is, with , we can expand several of the batches to with .
For those batches, need not be an integer, so take as
|
|
|
(2.3) |
and set when the batch size is not .
Note smaller values only help us ensure a total fraction of distances is preserved.
For this decomposition, let
|
|
|
be the minimum stable rank of the sized minibatches from such batches.
We then have our first theorem, written in terms of in the original equation (JL).
Theorem 2.0.6.
Let , with , and let .
For a set of points in , as above, and a matrix with i.i.d. mean-zero unit-variance sub-gaussian entries, equation (JL) holds
for at least pairs , with probability at least , provided
|
|
|
Here, and is an absolute constant.
In the case of standard Gaussian entries, one can replace and by
|
|
|
respectively.
When the distribution for is understood, we denote or by
, and likewise by .
To make sense of the above, suppose , and then recall as the stable rank is always bounded above by the ambient dimension.
If , for bounded away from 0, the term attached to becomes constant, while the remaining becomes constant as soon as
.
Note depends on the decomposition , which we are free to choose at will.
The remainder of the paper will address how to choose (and the batch size ).
Proof.
We treat the upper and lower bounds on separately.
If
|
|
|
(2.4) |
then with probability at least none of the events defining hold, over all the batches of the decomposition.
So by equation (2.1),
with probability at least , we preserve at least (recalling when needed)
|
|
|
of the squared distances within a tolerance, and another of the squared distances within a tolerance.
By the pigeonhole principle, at least of the distances are approximatley preserved on both sides.
It remains to choose to ensure equation (2.4) holds.
We always have, by lemma 5.0.6,
|
|
|
while ,
so by lemma 2.0.4
|
|
|
There are at most batches in the decomposition, expanding several of the batches to absorb any remainder if necessary, so we need to satisfy
|
|
|
Using lemma 5.0.5 with , we achieve the desired bound for in terms of by taking
|
|
|
To replace in the Gaussian case, use corollary 2.0.3 with lemma 5.0.5 on to find
|
|
|
From lemma 2.0.1 we are free to scale vectors in each batch to have unit norm.
So we can also define for a given decomposition
|
|
|
This normalization allows us to use linear algebra to control deterministically, for any of the -sized subsets of , once we have control of .
When the underlying data is random, we can thus avoid the need to take union bounds over the minibatches.
Lemma 2.0.7.
Let be a matrix with columns of constant norm.
If is a submatrix of , then
|
|
|
In particular, if , then .
To be useful, we need here.
Proof.
To control , partition the matrix as with a matrix.
Viewing the unit sphere as in ,
|
|
|
That is, .
Note for any nonzero scalar , so we may assume the columns of all have norm 1.
Under this assumption,
|
|
|
Recalling always completes the proof.
∎
If we set
|
|
|
Theorem 2.0.8.
Let , with , and .
For a set of points in , as above, and a matrix with i.i.d. mean-zero unit-variance sub-gaussian entries, then equation () holds
for at least pairs , with probability at least , provided
|
|
|
Here, depends on .
In the case of independent standard Gaussian entries, .
Proof.
In the proof of theorem 2.0.6, we shall replace lemma 2.0.4 by
|
|
|
By lemma 2.0.7, , no matter the subset chosen.
We can then upper bound the right hand side of lemma 2.0.4, recalling .
∎
The choice of decomposition matters, yielding very different values, even with fixed, as seen in the following.
Example 2.1.
The difference vectors for the (vertices of) the standard simplex in are , with .
Suppose the decomposition involved “stars” made by subsets for .
Using these difference vectors as rows, the corresponding matrix looks like (relabeling as necessary)
|
|
|
with .
If , we have , so .
Because is bounded by a constant, this decomposition is of no use in theorem 2.0.8.
If we instead consider a decomposition involving “cycles” of length , that is, subsets
|
|
|
then the corresponding difference vectors form a circulant matrix
|
|
|
Viewing as a complex matrix, we can diagonalize it using the discrete Fourier transform matrix as . See [GVL13, page 222].
Here,
|
|
|
The squares of the singular values of are then the eigenvalues of , that is,
|
|
|
with equality achieved when is even at .
Because the cycle has length here, we have , for each such cycle.
If such a decomposition involved only such cycles, we could conclude (because the vectors have constant norm) , which would be very useful for theorem 2.0.8.
We review in the next section a construction originally due to Walecki that provides such cycle decompositions as above when is odd, and the next best thing when is even.
The construction will also be useful when our data set is drawn i.i.d.; in particular, we can address cases where the minimal stable ranks or are too pessimistic, but “most” batches have larger stable ranks.
3 The Walecki Construction
The sets and describe directions in space.
Even if the data generating these directions is sampled independently, the directions themselves are not independent; for example, is not independent of , while and are, assuming , and are drawn independently.
However, we are partitioning the directions into batches.
If we can guarantee that in each batch, the directions are independent or only “weakly” dependent, we can exploit these properties, ensuring the stable ranks of many batches are bounded below by given values.
Viewing or as corresponding to the edges of the complete graph on vertices, we are asking for a partition of the edges, a decomposition, such that each subset involves each vertex once, or at most twice (say).
Thankfully, Walecki provided such a construction in the 1880’s;
see [Luc82, page 161, Sixième Récréation] for the original French and [Als08] for an English overview.
We use a different indexing scheme than presented in [Als08], which is easier for us to use.
We can arrange vertices in the complex plane as follows: vertices corresponding to the st roots of unity, and the remaining vertex at the origin .
Let .
We can thus label the nonzero vertices as .
We partition their corresponding edges based on their products , or more formally:
|
|
|
The sets represent 1-regular subgraphs of : each vertex has degree one, for if and are in , then
|
|
|
forcing because .
There are only vertices on the circle, so there are at most edges in .
The cyclic group generated by acts freely on the vertices of the circle via counterclockwise rotations, so also acts on the sets via
|
|
|
corresponding to the product .
Consequently, it is enough to discuss and .
With , the edges of are of the form , while those of are of the form .
To each edge in , there is a corresponding edge in , namely , so and includes the edge .
When is odd, the only nonzero vertex on the real line is 1; when is even, both vertices and are present.
Consequently, is determined by the number of vertices strictly in the upper half plane:
|
|
|
When is odd,
recall because is 1-regular, so by the above, , with avoiding the vertex , as its left most edge is
.
Form the augmented sets and ; each is a 1-factor, as it is 1-regular and spans all vertices.
We can thus form a cycle using and : explicitly, in cycle notation,
|
|
|
which has length as it should for covering all the vertices.
When is even, , so can contain at most one additional edge; it does, via with . If , there are at least two edges in , so
split into and , corresponding to those vertices with nonnegative and strictly negative real parts respectively.
When is even, that is, , both are of the same size , while in the other case, we have and .
Form the augmented sets and ; these sets are 1-regular, of sizes
|
|
|
while when , that is, when is odd,
|
|
|
and
|
|
|
The sets now form the cycle
|
|
|
which has length , again as it should.
Extending the group action to send the origin to itself,
we can thus form cycles of length using the above, recalling there are different sets.
Explicitly,
|
|
|
(3.1) |
When is even, all edges are covered, while when is odd, still remains, but is still a 1-factor.
We thus have, considering the parity of instead of ,
Lemma 3.0.1 (Walecki Construction).
The complete graph has a decomposition into cycles of length when is odd, and cycles of length and a 1-factor when is even.
For reference later, we also record
Corollary 3.0.2.
Consider the cycles in lemma 3.0.1, corresponding to equation (3.1).
When is even, these cycles split as a pair of 1-factors of size .
When is odd, the cycles split as three 1-regular subgraphs when .
When , their sizes are
|
|
|
while when , their sizes are
|
|
|
Returning to the sets of difference vectors with a set of points, we can assign each to a unique 1-regular subgraph by the correspondence when and for .
Most useful for us is the following lemma; note we shall be considering batches of size (at least) drawn from within these subgraphs.
Lemma 3.0.3.
Let be a set of points drawn i.i.d. from a common distribution.
With the correspondence above, the vectors within each subgraph from corollary 3.0.2 and lemma 3.0.1 are independent, as are their unit norm versions.
When is even, there are subgraphs involved.
When is odd, there are subgraphs involved.
Proof.
The subgraphs are 1-regular, so each vertex corresponding to a point of appears only once; independence follows as no two distinct edges share a common vertex.
In the unit norm case, we are applying the same function to independent vectors, so the results are independent too.
The rest of the lemma is immediate.
∎
Theorem 3.0.4.
For the standard simplex in , it suffices to take
|
|
|
for theorem 2.0.8.
Note is bounded independent of as soon as .
Proof.
Taking in theorem 2.0.8, we can use the Walecki construction 3.0.1 as is to control .
When is odd, the computations from example 2.1 show , while when is even, the 1-factor from the Walecki construction is of size , with mutually orthogonal vectors of constant norm, so its stable rank is equal to its rank, .
Thus for both parities of .
∎
4 Further Theorems for i.i.d. Samples
Let be a given random vector.
In this section, the following assumption 4.1 will be in play for the dimension reduction matrix and the dataset of points in .
The theorems will then be in terms of additional assumptions on the distribution of .
Assumption 4.1.
The matrix has i.i.d. zero-mean unit-variance sub-gaussian entries.
The dataset is drawn independently of , with .
In practice, datasets need not be drawn i.i.d. from some underlying distribution; however, if the number of points is very large, it may be useful or convenient to subsample the data in order to fit it in memory or to try to estimate properties of the data.
The theorems in this section may then be read as applying to an i.i.d. (sub)sample of the data, that is, drawing points uniformly with replacement from the underlying dataset, redefining to be the new sample size, and redefining to be drawn from the discrete uniform measure on the underlying data.
With the Walecki construction in hand and assumption 4.1, we can give refinements of theorems 2.0.6 and 2.0.8.
The theorem statements will have failure probabilities in terms of the draw of the pair .
(There should be no confusion with the dot product notation.)
Both of the theorems just mentioned give a failure probability for once is fixed, and this probability only depends on stable rank properties of the data set , or more accurately, the difference vector set .
These theorems in turn rely on lemma 2.0.4, that bounds the probabilities in equation (2.1) for acting on a given batch or its unit norm version in terms of stable ranks of minibatches .
In many of the examples that follow, the assumptions on the data only guarantee or is above some threshold most of the time, say for a fraction of all batches, instead of all of the time.
We can use lemma 2.0.4 on those “good” batches, and still conclude that of all distances are approximately preserved with high probability.
To be concrete, let be the event that is approximately preserved for of the vectors –
provided the batches considered have , for some threshold .
Let be the event that for at least of the batches considered in the decomposition .
We then have
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
In the rest of this section, we use the 1-regular subgraph version of the Walecki construction, lemma 3.0.3
to define the decomposition and the underlying batches .
Specifically, each 1-regular subgraph is at least of size , and we partition the subgraph edges into batches of size – each edge corresponding to a difference vector.
Under assumption 4.1, the edges within each subgraph are exchangeable, so they can be assigned to batches in an arbitrary manner as long as the batch size is respected.
For any remainder when does not divide the subgraph size ,
we can modify the definition of from equation (2.3) to apply with in place of ; any of the expanded batches are still of size at most .
Note when is odd, the subgraphs are not all of the same size, so the will vary accordingly.
We first present one case where is 0, that is, we are able to control for every minibatch, with high probability.
Theorem 4.0.1.
Under assumption 4.1, equation (JL) holds
for at least pairs , with probability at least over the draw of , provided
|
|
|
when is a strictly positive integer, with
|
|
|
To make some sense of the above, note that if , it suffices to take and
|
|
|
recalling too.
This target dimension is roughly the same as for the simplex 3.0.4 with there, despite the very different sparsity properties of these two datasets.
Proof.
With defined as in the theorem statement and an independent draw of , the difference vector set is drawn i.i.d. from , so it is immediately mean-zero with i.i.d. coordinates.
Because the stable ranks do not see constant scaling, we can work with , so that now has unit variance coordinates.
The sub-gaussian norm for a coordinate of is at most by the triangle inequality.
We control in theorem 2.0.6 by showing each batch has high stable rank.
Because , we shall control numerator and denominator separately.
We have .
Further, Bernstein’s inequality for the mean-zero sub-exponential random variables yields [Ver18, page 34]
|
|
|
with .
For us with , the above gives
|
|
|
To lock down , recall for constants , so
|
|
|
Recall lemma 2.0.7 which allowed us to state always, because the columns of had constant norm.
We can give a high probability replacement for that lemma in this context, but require that it must hold over all batches , that is,
|
|
|
|
|
|
|
|
|
|
|
|
Taking into account the cases, as mentioned before this proof, we require
|
|
|
For ambient dimensions at least this large,
every single batch satisfies
|
|
|
with total failure probability at most .
We now only need to control instead of , and we can use the known result [Ver18, page 85] for matrices with mean-zero independent sub-gaussian entries – recalling the -norm of each entry is now , as mentioned above –
|
|
|
with probability at least .
It is proved via an -net argument.
For us, with , the above gives
|
|
|
with probability at most .
Consequently,
|
|
|
(4.1) |
over all minibatches from all batches with total failure probability at most
|
|
|
So we may take the right hand side of equation (4.1) as our value.
Plugging into theorem 2.0.6 yields
|
|
|
Setting completes the proof.
∎
4.1 Controlling “Most” Batches
Unlike in theorem 4.0.1 above, in general the dataset need not be so well-behaved, to the point that controlling is not possible across all minibatches.
To make things more manageable, we shall now work with theorem 2.0.8, using the unit normalized batches , working on batches of size at least instead of .
For general random vectors , we shall not be able to guarantee that has high stable rank, but we shall guarantee that a fraction of the batches have stable rank comparable to a “typical” value associated with .
The columns of have constant unit norm, so it is enough to control in order to control .
It will be convenient to introduce some new notation for this purpose; we present it first in its unnormalized version.
Because we are still using the Walecki construction via lemma 3.0.3 to define the batches,
the columns in are independent, each drawn like the given random vector , with an independent copy of .
The corresponding second moment matrix
is twice the covariance matrix for , for if ,
|
|
|
|
Consequently, is the effective rank of , as the factors of 2 will cancel in the ratio.
Now may be approximated by its empirical version
|
|
|
The unit normalized versions will depend on , with , , and
.
The connection between and is not so immediate, but we shall return to this point in section 4.1.1.
The following is implicit in [Ver18, section 5.6];
we include the proof here, as we want explicit constants, and we plan to take , applying it to .
Controlling the deviation will be the way we eventually show “most” of the time.
Lemma 4.1.1.
Let be as above, with the , and assume for some , that
|
|
|
Then, with failure probability at most ,
|
|
|
Proof.
Because the matrices are symmetric, i.i.d., and mean-zero, we can use the matrix Bernstein inequality: for ,
|
|
|
with and .
We want the right hand side to be at most , that is,
|
|
|
The positive root is at
|
|
|
Because the other root of the quadratic is negative, taking any will suffice for us.
Because the square root function is subadditive, it is safe to take .
We now just need to estimate and .
Estimate via
|
|
|
|
|
|
|
|
For , the i.i.d. assumption already gives
.
Expanding the square,
|
|
|
while .
Taking expectations on with gives
|
|
|
Taking the maximum over gives
|
|
|
as is negative semidefinite.
Recalling our choice of and that , we find
|
|
|
|
|
|
|
|
|
|
|
|
With this value, we finally have
|
|
|
If we used the unit normalized , lemma 4.1.1 provides the following upper tail probability
|
|
|
(4.2) |
with a function of .
By the triangle inequality, we should immediately have
|
|
|
so that
with failure probablity .
This tail probability is too weak to control for every batch with high probability, because will need to be too large to be useful, so we allow a fraction of the batches to fail.
We consider order statistics of real-valued functions of across batches, namely –
because the batches within each 1-regular subgraph are independent, we can
exploit that independence to inform the choice for using the following lemma.
Lemma 4.1.2.
Let be i.i.d. random variables.
Let with an integer.
If
|
|
|
Proof.
First note
|
|
|
that is, we are looking for the top of the to be larger than .
Because the are drawn i.i.d., all their sized subsets are equally likely, so we may conclude
|
|
|
|
|
|
|
|
using the independence of the in the last line.
∎
In lemma 4.1.2, we shall take to be the number of batches in a given subgraph, so that the random variables in question are independent.
The subgraphs from lemma 3.0.3
are of size at least and at most , but are not all of the same size when is odd.
If is one of these subgraphs, it contains batches, as we have enforced each batch to have size at least , and reassigned any remainder to those batches.
When is even, , and we only need to require .
When is odd, we adjust depending on the size of the subgraph .
Suppose we enforce to be an integer, recalling in all cases.
For any other subgraph of larger size, set
|
|
|
(4.3) |
In particular for all the subgraph sizes, if we assume , as .
Thus in lemma 4.1.2, we replace by when we consider the different subgraphs.
4.1.1 Retraction to the Sphere
Suppose does not have a second moment, that is, is undefined. As mentioned before, we can use lemma 4.1.1 on the unit normalized batches instead, provided we replace by and by .
By Cauchy-Schwarz, always: for any unit vector ,
|
|
|
A first question to ask is: in the presence of a second moment, how are the operator norms of and related?
The following lemma gives one such answer.
Lemma 4.1.3.
Let be a random vector in , with second moment matrix . If , then with , for any ,
|
|
|
Further, .
Remark 4.1.4.
Some dependence on the small ball (if ) or lower deviation (if ) probability
must be present, in general, due to the nature of the retraction to the sphere.
Specifically, suppose is distributed uniformly at random from a finite set in , having a cluster of points all pointing in roughly the direction, but of very small norm.
If all the other points are distributed uniformly on the unit sphere, one expects to be roughly 1/D, as the cluster points will not contribute much to the operator norm.
However, after the retraction, these cluster points now all have unit norm and are still pointing in roughly the direction, so if there are enough points in the cluster, could now be much closer to 1.
Proof.
The matrices and are symmetric positive semi-definite,
so their singular values correspond to their eigenvalues.
These eigenvalues involve the quadratic form or .
The latter quadratic form is just , which we may split as
|
|
|
For the first term,
|
|
|
When is a unit vector, always, so for such ,
|
|
|
|
|
|
|
|
Now is just the maximum of over the unit sphere, and the maximum is realized by some as the sphere is compact.
Consequently,
|
|
|
|
|
|
|
|
Recalling and the definition of and finishes the proof.
∎
Apart from assumption 4.1, we make no other assumptions on or in the following theorem.
Theorem 4.1.5.
Let , , , and .
Under assumption 4.1, equation (JL) holds
for at least pairs , with probability at least over the draw of , provided
|
|
|
when the quantities and are strictly positive integers, with
|
|
|
Remark 4.1.6.
To make sense of the above, consider as if from the original Johnson-Lindenstrauss lemmas.
We always have , so vanishes fairly quickly with increasing when is fixed or even slowly decaying.
Compared to theorem 4.0.1, we have an extra factor against , but it is not too big in that , and does not grow quickly with decreasing .
When is large, we can make small enough that the fraction is not that much worse than .
Finally, in light of lemma 4.1.3, we can replace by in the definition of at the expense of a bounded prefactor for provided the lower deviation or small ball probability is less than .
Proof.
From lemma 3.0.3, we have either or 1-regular subgraphs to consider when , and we choose (unit normalized) batches of size at least from these subgraphs.
Let .
Within each subgraph, the batches are independent, allowing us to use lemma 4.1.2 on the random variables
|
|
|
Take for that lemma and recall the discussion from equation (4.3).
We have to choose so that a union bound over all the subgraphs is still smaller than , so a safe value for would be
|
|
|
|
(4.4) |
|
|
|
|
(4.5) |
With this in hand,
we can apply lemma 4.1.1 with , for
|
|
|
|
|
|
|
|
Because we are interested in
, we should like to make the error term manageable, so choose
|
|
|
Because already depends on through in equation 4.5, there is a constraint on and that we need to address.
Write with .
Set to satisfy .
We then have
|
|
|
|
|
|
|
|
We can divide by provided .
|
|
|
Recalling , if we also require ,
it would be safe to require
|
|
|
We then have
|
|
|
and because the maps and are strictly increasing, a valid lower bound for is
|
|
|
Taking as this lower bound yields
|
|
|
With this choice of in hand, with probability at least ,
|
|
|
for at least of all batches .
Assuming this bound holds, we now ask that when is drawn, equation (JL) holds for all the vectors involved in at least these batches, with failure probability at most .
We run the argument of theorem 2.0.8, only for these batches , using in place of .
As we could still have “good” batches, we still must allow for all of them when we compute the union bound.
The ratio in theorem 2.0.8 now just becomes
|
|
|
Let be the coefficient of in the above.
We may then set as
|
|
|
(4.6) |
using in the term.
The choice of follows from theorem 2.0.8.
∎
In certain cases, we know exactly without relying on lemma 4.1.3.
Lemma 4.1.7.
Suppose is a random vector with i.i.d. coordinates, and is an independent copy of .
If , then the scaled unit vector
is mean-zero isotropic.
There are no moment assumptions on the coordinates here, so the lemma even applies to vectors with i.i.d. standard Cauchy coordinates.
Proof.
Both properties rely on the following observation.
For a fixed coordinate , the coordinate is a symmetric random variable: and have the same distribution.
Consequently, for any odd function , (using the symmetry in the 2nd equality)
|
|
|
so that for such odd functions .
If we temporarily freeze the other coordinates, the th coordinate of the unit vector is just
|
|
|
an odd function of , forcing to be mean-zero.
To check is isotropic, we must show the matrix
is the identity .
Because the are identically distributed,
|
|
|
so the diagonal entries of are all equal to .
Further, when , the independence of the now give
|
|
|
because the conditional expectation vanishes on the odd function of .
∎
We now have an immediate corollary to theorem 4.1.5, which again we may compare to theorem 4.0.1.
Corollary 4.1.8.
In the setting of theorem 4.1.5, suppose has i.i.d. coordinates.
Then the corresponding conclusion still holds, with , namely it suffices to take and
|
|
|
Proof.
By lemma 4.1.7, the difference vector yields the isotropic vector .
Because , we compute
, as the intrinsic dimension does not see constant scalings.
We can then apply theorem 4.1.5.
∎
Remark 4.1.9.
The proof only uses that is isotropic.
By lemma 1.1.3, is still isotropic when has orthonormal rows.
Because equation JL is 1-homogeneous, the corollary still holds with replaced by , in particular when has a fast transform method available.
4.2 Estimating
The intrinsic dimension of , namely , enters into theorem 4.1.5 only as a parameter, so we are free to estimate it separately.
In particular,
Corollary 4.2.1.
Theorem 4.1.5 holds with replaced by
an empirical estimate using a batch of size , namely,
with failure probability at most ,
|
|
|
If , then computing will cost polynomial in and ; however, because , we do not need very high accuracy when computing this top eigenvalue.
If we were working with drawn uniformly with replacement from a larger dataset, we should draw the first data points, and then sequentially pair them off for unit difference vectors to yield .
The uniform with replacement assumption assures that these data points used are as good as any other subset, even if some of the data points turn out to be copies of the same point in the larger dataset.
Proof.
For , we can find a batch of size with indepedent columns, by corollary 3.0.2 and lemma 3.0.3.
By lemma 4.1.1, we have,
with failure probability at most ,
|
|
|
By the triangle inequality, we then have
|
|
|
and
|
|
|
Consequently, as ,
|
|
|
Recalling always, we choose and so that
|
|
|
Acknowledgements
This research was performed while the author held an NRC Research Associateship award at the U.S. Air Force Research Laboratory.
I should like to thank Mary, Saint Joseph, and the Holy Trinity for helping me with this work.
Disclaimers
The views expressed are those of the author and do not reflect the official guidance or position of the United States Government, the Department of Defense, or of the United States Air Force.
Statement from the DoD: The appearance of external hyperlinks does not constitute endorsement by the United States Department of Defense (DoD) of the linked websites, or the information, products, or services contained therein. The DoD does not exercise any editorial, security, or other control over the information you may find at these locations.
5 Appendix
The following lemma shows how to modify the proof of the Hanson-Wright inequality from [RV13] (cf. [Ver18, chapter 6]) to a “bulk” version, looking at the sum of several i.i.d. quadratic forms.
Note here is in the main part of the paper.
Let be a matrix entries drawn i.i.d. from a mean-zero unit-variance sub-gaussian distribution.
Write for the th column of .
Let be a matrix and
|
|
|
Note, with ,
|
|
|
(5.1) |
using the mean zero and independence assumptions for the coordinates of , that is for the when varies.
Lemma 5.0.1.
Let , , and be as above.
Then for all and either sign choice,
|
|
|
with an absolute constant (not depending on ) and .
Remark 5.0.2.
The key point is the additional factor of in the exponential, compared to the usual Hanson-Wright inequality where .
Here,
|
|
|
so in particular , and .
If we prove the result for , that is bound
|
|
|
then taking will give us the bound for the original .
Proof.
By equation 5.1,
|
|
|
By the union bound (Boole’s inequality), we can bound the probability
|
|
|
for if and , then .
We can now use the i.i.d. assumption for the columns, that is, for the when varies,
|
|
|
|
and
|
|
|
|
The terms and are the starting points for establishing a proof of the Hanson-Wright inequality [Ver18, page 133]; the former is for using Bernstein’s inequality, while the latter uses decoupling and comparison to the case when is a standard Gaussian random vector.
Consequently, we can use the bounds for and , which both are given by
|
|
|
with an absolute constant (not depending on the distribution of , as we already rescaled to have unit -norm entries).
Recalling the th powers, we finally have
|
|
|
Lemma 5.0.3.
Let be a random matrix with i.i.d. mean-zero unit-variance sub-gaussian entries.
Then for a matrix with columns in ,
|
|
|
with .
Proof.
We use lemma 5.0.1, with , and , for then the rows of are written as column vectors, so that
|
|
|
Using , we recover
|
|
|
Because
|
|
|
the choice yields the result.
∎
If the reader would prefer explicit constants, the following lemma may be convenient, and gives an alternative proof for lemma 5.0.3 in the Gaussian case, relying on the explicit moment generating function for the Gaussian distribution.
Lemma 5.0.4.
Let be a random matrix with i.i.d. standard Gaussian entries.
Then for a matrix with columns in ,
|
|
|
for .
Also, when ,
|
|
|
and
|
|
|
Note always.
When , there is a savings of one factor of in the upper tail; however, for our applications, we do not know the relative sizes of and , so the bound was included.
Proof.
Let be the SVD of , with the diagonal matrix of singular values.
So and by the rotation invariance of standard Gaussian vectors, acts on a row of as
|
|
|
and consequently
|
|
|
with the independent standard Gaussian.
We then have
|
|
|
and for to be determined
|
|
|
|
|
|
|
|
with standard Gaussian, having used the indepdence of the rows .
We can now use the independence of the columns, here via the coordinates of :
|
|
|
provided
via change of variables ,
|
|
|
On , the function is increasing from , while on it is decreasing and still greater than 1, as , so
|
|
|
leaving us to minimize
|
|
|
There will turn out to be two cases.
If we minimize directly, the minimizer is
|
|
|
This estimate still requires , so if we require , we force
|
|
|
Because we always have , we can use the value when and are “comparable” and .
For the other case, when , we have
|
|
|
and can upper bound by defined as
|
|
|
The minimizer for is
|
|
|
and this also satisfies whenever .
When , we can also avoid the distinction between the two cases by noting for all , so that
|
|
|
For the lower tail, with ,
|
|
|
|
|
|
Because , we can estimate the moment generating function in two ways.
From for , we find
|
|
|
while if we use that is decreasing to 1 for ,
|
|
|
Minimizing
|
|
|
yields
|
|
|
with corresponding to the Taylor expansion bound and corresponding to the function bound.
Putting everything together, and remembering the th power outside, we complete the lemma.
∎
The next lemma makes the connection between equation () and equation (JL) explicit, and is informed by the form of the target dimension derived from the tail bound rates above.
In the Gaussian case, and .
Lemma 5.0.5.
For and , the requirements
|
|
|
have solution with ,
|
|
|
Proof.
The first equation gives .
Taking times the second equation and adding it to the third gives the equation for .
Subtracting the second equation from the third yields
|
|
|
Conclude
|
|
|
Lemma 5.0.6.
For ,
|
|
|
Proof.
First note by
|
|
|
from our lower bound for .
∎