This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Bulk Johnson-Lindenstrauss Lemmas

Michael P. Casey
Abstract

For a set XX of NN points in D\mathbb{R}^{D}, the Johnson-Lindenstrauss lemma provides random linear maps that approximately preserve all pairwise distances in XX – up to multiplicative error (1±ϵ)(1\pm\epsilon) with high probability – using a target dimension of O(ϵ2log(N))O(\epsilon^{-2}\log(N)). Certain known point sets actually require a target dimension this large – any smaller dimension forces at least one distance to be stretched or compressed too much. What happens to the remaining distances? If we only allow a fraction η\eta of the distances to be distorted beyond tolerance (1±ϵ)(1\pm\epsilon), we show a target dimension of O(ϵ2log(4e/η)log(N)/R)O(\epsilon^{-2}\log(4e/\eta)\log(N)/R) is sufficient for the remaining distances. With the stable rank of a matrix AA as AF2/A2\left\lVert A\right\rVert_{F}^{2}/\left\lVert A\right\rVert^{2}, the parameter RR is the minimal stable rank over certain log(N)\log(N) sized subsets of XXX-X or their unit normalized versions, involving each point of XX exactly once. The linear maps may be taken as random matrices with i.i.d. zero-mean unit-variance sub-gaussian entries. When the data is sampled i.i.d. as a given random vector ξ\xi, refined statements are provided; the most improvement happens when ξ\xi or the unit normalized ξξ^\widehat{\xi-\xi^{\prime}} is isotropic, with ξ\xi^{\prime} an independent copy of ξ\xi, and includes the case of i.i.d. coordinates.

  • MSC: primary 68Q87, 68R12, 60B20; secondary 62G30, 68T09

  • keywords: dimension reduction, Johnson-Lindenstrauss lemma, Hanson-Wright inequality, stable rank, effective rank, intrinsic dimension, order statistics, Walecki construction, bulk, batch, minibatch, random projection

1 Introduction

The Johnson-Lindenstrauss lemma [JL84] concerns the approximate preservation of distances in a finite point set in Euclidean space. Specifically, for a subset XDX\subset\mathbb{R}^{D} of NN points and a tolerance ϵ(0,1)\epsilon\in(0,1), there exist k×Dk\times D matrices ZZ and a constant γ(ϵ)\gamma(\epsilon) for which

(1ϵ)xx2γ(ϵ)kZ(xx)2(1+ϵ)xx2(1-\epsilon)\left\lVert x-x^{\prime}\right\rVert_{2}\leq\sqrt{\frac{\gamma(\epsilon)}{k}}\left\lVert Z(x-x^{\prime})\right\rVert_{2}\leq(1+\epsilon)\left\lVert x-x^{\prime}\right\rVert_{2} (JL)

holds for all pairs of points x,xXx,x^{\prime}\in X simultaneously, with probability at least 1δ1-\delta, provided

k=O(ϵ2log(N2/δ))=:DJL(N)=:DJL.k=O(\epsilon^{-2}\log(N^{2}/\delta))=:D_{JL}(N)=:D_{JL}.

The matrices ZZ are drawn randomly, and much work has been done since the original paper to equip ZZ with special properties, such as allowing fast matrix multiplication, preserving sparsity, restricting the matrix entries to discrete distributions, and so forth; see [Nel20] for a recent review. The matrix ZZ provides a linear method for dimension reduction, which, at the very least, reduces the amount of space needed to store the dataset XX on the computer, provided one can work with approximate versions of the pairwise distances. One would expect that “robust” downstream algorithms that depend on distance data, now working on the pointset ZXZX, should still return answers that approximate those found on the original pointset XX, but now in shorter time, with less memory needed, etc. People appreciate this lemma, in theory [Vem04], for the above reasons, and if the algorithm scales linearly in the ambient dimension, in time or in space, then processing ZXZX instead of XX will yield a proportional improvement.

However, certain algorithms, including many nearest-neighbor algorithms, scale exponentially in the dimension, especially if they only use space linear in NN, due to packing arguments. For comparison to brute force, the all pairs nearest-neighbor problem may already be solved in O(DN2)O(DN^{2}) time by scanning the points of XX with respect to their distance to each query xx. If the algorithm scales like DNexp(D)DN\exp(D), this scaling is only an improvement for D=log(N)D=\log(N), and even then one would really prefer D=o(logN)D=o(\log N), as N2N^{2} is too expensive when NN is large. Consequently, if we use multiplication by ZZ as a preprocessing step, the target dimension DJL=O(ϵ2log(N2))D_{JL}=O(\epsilon^{-2}\log(N^{2})) is much too large to be practical, as exp(DJL)\exp(D_{JL}) is now polynomial in NN with exponent at least 2/ϵ22/\epsilon^{2} with ϵ<1\epsilon<1.

Apart from searching for better algorithms, it is then natural to ask whether there are matrices ZZ with target dimension kk much smaller than DJLD_{JL} that would still satisfy equation (JL) for all pairs of points of XX. However, Larsen and Nelson [LN14] showed no such matrix exists for general datasets – a union of the standard simplex and a sufficient number of Gaussian vectors forces at least one distance to be stretched or compressed too much. On the other hand, if further assumptions are made on the pointset, smaller kk is possible, for instance, when XX already lies in a low-dimensional subspace [Sar06] or when its unit difference set XX^\widehat{X-X} has low Gaussian complexity [KM05], even while allowing many more points in the dataset.

If one considers a given algorithm “robust”, one would hope that failing to preserve a few distances should not markedly change the final output; though, one would ideally have to prove such behavior for that algorithm. One is then led to ask: what happens to the other distances between the points of ZXZX when kk is smaller than DJLD_{JL}? To be concrete, can we approximately preserve (1η)(1-\eta) of all the distances, for some fraction η(0,1/2)\eta\in(0,1/2), ideally with k=o(DJL(N))k=o(D_{JL}(N))? The results in this paper show this is possible when η\eta is not too small and the data has high or even moderate “intrinsic dimension”, in a sense to be defined later. As a preview, for data sampled i.i.d. like a random vector ξD\xi\in\mathbb{R}^{D}, corollary 4.1.8 shows that if ξ\xi has i.i.d. coordinates (No moment assumptions are made.), we can take

k=Cϵ2(log(4e/η)log(6De/ζ)+log(N2/δ)ηD)k=\frac{C^{\prime}}{\epsilon^{2}}\left(\log(4e/\eta)\log(6De/\zeta)+\frac{\log(N^{2}/\delta)}{\eta D}\right)

for preserving (1η)(1ζ)(1-\eta)(1-\zeta) of all pairwise distances, with probability at least 12δ1-2\delta over the draw for ZZ and XX, provided N=Ω(ζ1Dlog(N/δ))N=\Omega(\zeta^{-1}D\log(N/\delta)) and N=Ω(Dlog(6De/ζ))N=\Omega(D\log(6De/\zeta)). The matrix ZZ may have i.i.d. standard Gaussian, or in general, zero-mean unit-variance sub-gausssian coordinates. This estimate for kk is an “improvement” over the DJLD_{JL} target dimension as soon as ηD=ω(1)\eta D=\omega(1) or say a fractional power of log(N)\log(N); “improvement” must be in quotes, as we have only guaranteed “most” distances are approximately preserved, not all. For general ξ\xi, the 1/ηD1/\eta D is replaced by a 1/ηr^1/\eta\widehat{r}, with r^\widehat{r} the intrinsic dimension 1/𝔼y^y^1/\left\lVert\mathbb{E}\widehat{y}\widehat{y}^{\top}\right\rVert of the unit normalized random vector y^=ξξ^\widehat{y}=\widehat{\xi-\xi^{\prime}} for an independent copy ξ\xi^{\prime} of ξ\xi. We have 1r^D1\leq\widehat{r}\leq D always, and we may estimate it using corollary 4.2.1.

Both of our main results – theorems 2.0.6 for general datasets and theorem 4.1.5 for i.i.d. samples – rely on a dual viewpoint for the dimension reduction problem, namely, instead of asking how ZZ transforms the data XX, we ask how XX^{\top} transforms the test matrix ZZ^{\top}; we can then exploit known properties of how general matrices act on ZZ^{\top} or its columns. Standard results like the Hanson-Wright inequality may be viewed in this light, and we indeed do so in this paper. Treating ZZ^{\top} as a test matrix with known properties has been done previously in randomized numerical linear algebra [HMT10]; however, unlike there, slow decay in singular values is actually a good case for us.

The rest of the paper is organized as follows. The main argument allowing us to quantify how many distances can be preserved is in section 2 and leads to our first theorem 2.0.6, which, by itself, only suggests smaller target dimensions kk may be possible by considering “small” batches of difference vectors. We then recall the Walecki construction in section 3, which gives us a way to choose these batches that works well for i.i.d. samples as well as the standard simplex. Section 4 then presents the rest of our results specializing to data sampled i.i.d. from a given distribution, which may just be a larger dataset. This section leads up to our second main theorem 4.1.5 allowing us to make kk depend on the intrinsic dimension r^\widehat{r} mentioned above. We also show how to estimate r^\widehat{r} from the data. The appendix contains proofs for the properties of ZZ that we use in the paper, namely, a variant of the Hanson-Wright inequality, and a corresponding independent proof in the Gaussian case in order to have decent explicit constants for kk.

1.1 Notation

Suppose XDX\subset\mathbb{R}^{D} is a point set of size NN. Given an arbitrary ordering of the points of XX, let YY be the set of difference vectors

Y:={xixj|xi,xjX, 0i<jN1}Y:=\left\{x_{i}-x_{j}\;\lvert\;x_{i},x_{j}\in X,\,0\leq i<j\leq N-1\right\}

and Y^\widehat{Y} be their unit normalized versions

Y^:={y/y2|yY}SD1.\widehat{Y}:=\left\{y/\left\lVert y\right\rVert_{2}\;\lvert\;y\in Y\right\}\subset S^{D-1}.

The number of elements of a set Υ\Upsilon is denoted by |Υ|\left\lvert\Upsilon\right\rvert. We set (x,y)=xy(x,y)=x^{\top}y for the usual Euclidean inner product, with x22=(x,x)\left\lVert x\right\rVert_{2}^{2}=(x,x). We also denote by oo the origin (0,,0)(0,\ldots,0) in any j\mathbb{R}^{j}.

Let AA be a D×MD\times M matrix, which we write as AD×MA\in\mathbb{R}^{D\times M}. From [GVL13, page 76], the singular value decomposition (SVD) for AA is the factorization A=UΣVA=U\Sigma V^{\top} with UD×DU\in\mathbb{R}^{D\times D} and VM×MV\in\mathbb{R}^{M\times M} orthogonal and

Σ=diag(σ1,,σp)withp=min{D,M}\Sigma=\mathrm{diag}(\sigma_{1},\ldots,\sigma_{p})\quad\text{with}\quad p=\min\left\{D,\,M\right\}

for σ1σ2σp0\sigma_{1}\geq\sigma_{2}\geq\ldots\geq\sigma_{p}\geq 0. Let σ=(σ1,,σp)\vec{\sigma}=(\sigma_{1},\ldots,\sigma_{p}) as a vector in p\mathbb{R}^{p}. We write A=σ=σ1\left\lVert A\right\rVert=\left\lVert\vec{\sigma}\right\rVert_{\infty}=\sigma_{1} for the operator norm of AA, while AF=σ2\left\lVert A\right\rVert_{F}=\left\lVert\vec{\sigma}\right\rVert_{2} is the Frobenius (or Hilbert-Schmidt) norm of AA. We may also compute AF\left\lVert A\right\rVert_{F} via

AF2=iAi22=i,jAij2\left\lVert A\right\rVert_{F}^{2}=\sum_{i}\left\lVert A_{i}\right\rVert_{2}^{2}=\sum_{i,j}A_{ij}^{2}

where the AiA_{i} may be taken as the rows or the columns of AA. Vectors are treated as column vectors unless otherwise indicated.

The \infty-stable rank r(A)r_{\infty}(A) of a matrix AA is the ratio

r(A):=AF2A2=σ22σ2=σi2σ12.r_{\infty}(A):=\frac{\left\lVert A\right\rVert_{F}^{2}}{\left\lVert A\right\rVert^{2}}=\frac{\left\lVert\vec{\sigma}\right\rVert_{2}^{2}}{\left\lVert\vec{\sigma}\right\rVert_{\infty}^{2}}=\frac{\sum\sigma_{i}^{2}}{\sigma_{1}^{2}}.

There are other variants of stable rank [NY17], but only the 44-stable rank r4(A)r_{4}(A) will make an appearance in this paper:

r4(A):=AF4AAF2=σ24σ44=(σi2)2σi4(σi2)2σ12σi2=r(A)r_{4}(A):=\frac{\left\lVert A\right\rVert_{F}^{4}}{\left\lVert AA^{\top}\right\rVert_{F}^{2}}=\frac{\left\lVert\vec{\sigma}\right\rVert_{2}^{4}}{\left\lVert\vec{\sigma}\right\rVert_{4}^{4}}=\frac{\big{(}\sum\sigma_{i}^{2}\big{)}^{2}}{\sum\sigma_{i}^{4}}\geq\frac{\big{(}\sum\sigma_{i}^{2}\big{)}^{2}}{\sigma_{1}^{2}\sum\sigma_{i}^{2}}=r_{\infty}(A)

So r4(A)r(A)r_{4}(A)\geq r_{\infty}(A) always, and both of these are at most pp, the rank of AA.

We make use of the ψ2\psi_{2}-norm and ψ1\psi_{1}-norm, defined for a random variable ω\omega as

ωψ2:=inf{t>0|𝔼exp(ω2/t2)2},andωψ1:=inf{t>0|𝔼exp(|ω|/t)2}.\left\lVert\omega\right\rVert_{\psi_{2}}:=\inf\left\{t>0\;\lvert\;\mathbb{E}\exp(\omega^{2}/t^{2})\leq 2\right\},\quad\text{and}\quad\left\lVert\omega\right\rVert_{\psi_{1}}:=\inf\left\{t>0\;\lvert\;\mathbb{E}\exp(\left\lvert\omega\right\rvert/t)\leq 2\right\}.

See [Ver18, section 2.5 and 2.7].

1.1.1 Isotropic Random Vectors and the Intrinsic Dimension

A particular condition on the second moment matrix of a random vector will be useful in this paper.

Definition 1.1.1.

A random vector ξD\xi\in\mathbb{R}^{D} is isotropic if Σ(ξ):=𝔼ξξ=IdD\Sigma(\xi):=\mathbb{E}\xi\xi^{\top}=\operatorname{Id}_{D}.

A working example is any vector ξ\xi with i.i.d. zero-mean unit-variance coordinates. Basic properties of isotropic random vectors include 𝔼ξ22=D\mathbb{E}\left\lVert\xi\right\rVert_{2}^{2}=D via the cyclic property of the trace, and that ξ\xi isotropic is equivalent to 𝔼(ξ,y)2=y22\mathbb{E}(\xi,y)^{2}=\left\lVert y\right\rVert_{2}^{2} for any yDy\in\mathbb{R}^{D}. See [Ver18, page 43-5] for more on such vectors.

We can assign a version of \infty-stable rank to the distribution of a random vector via the intrinsic dimension.

Definition 1.1.2.

The intrinsic dimension r(ξ)r(\xi) of a random vector ξD\xi\in\mathbb{R}^{D} is the ratio

r(ξ):=trΣ(ξ)Σ(ξ)=tr𝔼ξξΣ(ξ)=𝔼ξ22𝔼ξξ,and for c0r(cξ)=c2𝔼ξ22c2𝔼ξξ=r(ξ).r(\xi):=\frac{\,\mathrm{tr}\,\Sigma(\xi)}{\left\lVert\Sigma(\xi)\right\rVert}=\frac{\,\mathrm{tr}\,{\mathbb{E}\xi\xi^{\top}}}{\left\lVert\Sigma(\xi)\right\rVert}=\frac{\mathbb{E}\left\lVert\xi\right\rVert_{2}^{2}}{\left\lVert\mathbb{E}\xi\xi^{\top}\right\rVert},\quad\text{and for $c\neq 0$, }\quad r(c\,\xi)=\frac{c^{2}\,\mathbb{E}\left\lVert\xi\right\rVert_{2}^{2}}{c^{2}\left\lVert\mathbb{E}\xi\xi^{\top}\right\rVert}=r(\xi).

Like the stable ranks, the intrinsic dimension of a vector in D\mathbb{R}^{D} is bounded by the ambient dimension, DD, so isotropic random vectors realize the highest possible intrinsic dimension. In the literature, r(ξ)r(\xi) is sometimes called the effective rank of the second moment matrix Σ(ξ)\Sigma(\xi), and is the stable rank of the matrix Σ1/2(ξ)\Sigma^{1/2}(\xi).

Isotropic random vectors behave well under orthogonal projection.

Lemma 1.1.3.

Let Φ\Phi be a d×Dd\times D matrix with orthonormal rows and ξD\xi\in\mathbb{R}^{D} an isotropic random vector. Then Φ(ξ)d\Phi(\xi)\in\mathbb{R}^{d} is also isotropic.

Proof.

By linearity of expectation: 𝔼(Φξ)(Φξ)=Φ(𝔼ξξ)Φ=ΦIdDΦ=Idd\mathbb{E}(\Phi\xi)(\Phi\xi)^{\top}=\Phi(\mathbb{E}\xi\xi^{\top})\Phi^{\top}=\Phi\operatorname{Id}_{D}\Phi^{\top}=\operatorname{Id}_{d}. ∎

Note if Φ\Phi is scaled by constant c0c\neq 0, the intrinsic dimension is unchanged: d=r(Φ(ξ))=r(cΦ(ξ))d=r(\Phi(\xi))=r(c\,\Phi(\xi)).

2 Controlling Order Statistics

In this paper, we only study dimension reduction matrices Z:DkZ:\mathbb{R}^{D}\to\mathbb{R}^{k} with isotropic rows, that is, 𝔼(Zi,y)2=y22\mathbb{E}(Z_{i},y)^{2}=\left\lVert y\right\rVert_{2}^{2} for all yy and all rows ZiZ_{i}. So, 𝔼Zy22=ky22\mathbb{E}\left\lVert Zy\right\rVert_{2}^{2}=k\left\lVert y\right\rVert_{2}^{2}. We say more about isotropic random vectors in section 1.1.1. To guarantee equation (JL) holds for a difference vector yYy\in Y, the usual proof for the Johnson-Lindenstrauss lemma considers each vector individually, providing upper bounds for the failure probabilities

p+(y):=Z{Zy22>(1+ϵ~+)ky22}p_{+}(y):=\mathbb{P}_{Z}\left\{\left\lVert Zy\right\rVert_{2}^{2}>(1+\widetilde{\epsilon}_{+})k\left\lVert y\right\rVert_{2}^{2}\right\}

and

p(y):=Z{Zy22<(1ϵ~)ky22},p_{-}(y):=\mathbb{P}_{Z}\left\{\left\lVert Zy\right\rVert_{2}^{2}<(1-\widetilde{\epsilon}_{-})k\left\lVert y\right\rVert_{2}^{2}\right\},

implicitly working with the formulation

(1ϵ~)kxx22Z(xx)22(1+ϵ~+)kxx22(1-\widetilde{\epsilon}_{-})k\left\lVert x-x^{\prime}\right\rVert_{2}^{2}\leq\left\lVert Z(x-x^{\prime})\right\rVert_{2}^{2}\leq(1+\widetilde{\epsilon}_{+})k\left\lVert x-x^{\prime}\right\rVert_{2}^{2} (JL2JL^{2})

which is often easier to manage. In lemma 5.0.5 of the appendix, we show how to choose ϵ~±\widetilde{\epsilon}_{\pm} in line with the original equation (JL).

Suppose ϵ~ϵ~+\widetilde{\epsilon}_{-}\leq\widetilde{\epsilon}_{+}, as it will for this paper. If the distribution of ZZ is sufficiently nice, sub-gaussian for example, then one may show p++p2exp(ϵ~2k/C)p_{+}+p_{-}\leq 2\exp(-\widetilde{\epsilon}_{-}^{2}k/C) for each fixed yy, with CC a constant depending on the distribution of ZZ. By the union bound (Boole’s inequality), the probability that a random ZZ fails for some yYy\in Y is at most

(N2)2exp(ϵ~2k/C)N2exp(ϵ~2k/C)<δ,\binom{N}{2}2\exp(-\widetilde{\epsilon}_{-}^{2}k/C)\leq N^{2}\exp(-\widetilde{\epsilon}_{-}^{2}k/C)<\delta,

provided k>Cϵ~2log(N2/δ)=DJLk>C\widetilde{\epsilon}_{-}^{-2}\log(N^{2}/\delta)=D_{JL}. The probability there is some ZZ that succeeds for all yYy\in Y is then at least 1δ1-\delta, so at least one such ZZ exists.

The argument above considers the vectors yYy\in Y one at a time, making sure ZZ succeeds for each yy; if we consider the unit normalized vectors y^Y^\widehat{y}\in\widehat{Y}, we may view this argument as controlling the extreme order statistics of the random variables Zy^i22\left\lVert Z\widehat{y}_{i}\right\rVert_{2}^{2}, induced by ZZ, namely

(1ϵ~)kZy^(0)22Zy^(j)22Zy^((N2)1)22(1+ϵ~+)k.(1-\widetilde{\epsilon}_{-})k\leq\left\lVert Z\widehat{y}_{(0)}\right\rVert_{2}^{2}\leq\ldots\leq\left\lVert Z\widehat{y}_{(j)}\right\rVert_{2}^{2}\leq\ldots\leq\left\lVert Z\widehat{y}_{\big{(}\binom{N}{2}-1\big{)}}\right\rVert_{2}^{2}\leq(1+\widetilde{\epsilon}_{+})k.

If we only wish to preserve a fraction of the distances, say (1η)(1-\eta) with η\eta hopefully small, we can consider controlling the intermediate or central order statistics of the Zy^i22\left\lVert Z\widehat{y}_{i}\right\rVert_{2}^{2} instead. We do so as follows.

If we divide the difference vectors into batches of size MM, and preserve (1η)M(1-\eta)M distances there, then we still recover

(1η)M(N2)/M=(1η)(N2) of the total distances.(1-\eta)M\binom{N}{2}/M=(1-\eta)\binom{N}{2}\quad\text{ of the total distances.}\quad

We assume ηM\eta M is a strictly positive integer here, and for simplicity of discussion, we also assume MM divides NN; we shall return to this point later. Let ΥY\Upsilon\subset Y be a given batch of size MM. Each given matrix ZZ also induces an ordering on the points of Υ\Upsilon: with y^=y/y2\widehat{y}=y/\left\lVert y\right\rVert_{2},

Zy^(0)22Zy^(i)22Zy^(M1)22.\left\lVert Z\widehat{y}_{(0)}\right\rVert_{2}^{2}\leq\ldots\leq\left\lVert Z\widehat{y}_{(i)}\right\rVert_{2}^{2}\leq\ldots\leq\left\lVert Z\widehat{y}_{(M-1)}\right\rVert_{2}^{2}.

As ZZ is random, this ordering is too, treating YY as fixed. If we could guarantee that Zy^((1η)M)22(1+ϵ)k\left\lVert Z\widehat{y}_{((1-\eta)M)}\right\rVert_{2}^{2}\leq(1+\epsilon)k, then all Zy^(i)22\left\lVert Z\widehat{y}_{(i)}\right\rVert_{2}^{2} with i(1η)Mi\leq(1-\eta)M also have this upper bound, with an analogous statement for a lower bound of (1ϵ)k(1-\epsilon)k on Zy^(ηM1)22\left\lVert Z\widehat{y}_{(\eta M-1)}\right\rVert_{2}^{2}, altogether guaranteeing (12η)M+2>(12η)M(1-2\eta)M+2>(1-2\eta)M of the vectors of Υ\Upsilon have

(1ϵ~)kZy^22(1+ϵ~+)k.(1-\widetilde{\epsilon}_{-})k\leq\left\lVert Z\widehat{y}\right\rVert_{2}^{2}\leq(1+\widetilde{\epsilon}_{+})k.

We need to control the probabilities

p+(Υ):\displaystyle p_{+}(\Upsilon): =Z{Zy^((1η)M)22>(1+ϵ~+)k}\displaystyle=\mathbb{P}_{Z}\left\{\left\lVert Z\widehat{y}_{((1-\eta)M)}\right\rVert_{2}^{2}>(1+\widetilde{\epsilon}_{+})k\right\} (2.1)
and
p(Υ):\displaystyle p_{-}(\Upsilon): =Z{Zy^(ηM1)22<(1ϵ~)k}.\displaystyle=\mathbb{P}_{Z}\left\{\left\lVert Z\widehat{y}_{(\eta M-1)}\right\rVert_{2}^{2}<(1-\widetilde{\epsilon}_{-})k\right\}. (2.2)

We can recast control of p±(Υ)p_{\pm}(\Upsilon) using the following lemma. Let Υ^={y^0,,y^M1}\widehat{\Upsilon}=\left\{\widehat{y}_{0},\ldots,\widehat{y}_{M-1}\right\}, that is, the unit normalized version of Υ\Upsilon.

Lemma 2.0.1.

Let ZZ be a random k×Dk\times D random matrix. With the notation above, and Υ(η)\Upsilon(\eta) running through all ηM\eta M-sized subsets of Υ\Upsilon

p+(Υ)maxΥ(η)minΛΥ(η)(MηM)Z{ZΥ(η)ΛΥ(η)F2>(1+ϵ~+)kΥ(η)ΛΥ(η)F2}p_{+}(\Upsilon)\leq\max_{\Upsilon(\eta)}\min_{\Lambda_{\Upsilon(\eta)}}\binom{M}{\eta M}\mathbb{P}_{Z}\left\{\left\lVert Z\Upsilon(\eta)\Lambda_{\Upsilon(\eta)}\right\rVert_{F}^{2}>(1+\widetilde{\epsilon}_{+})k\left\lVert\Upsilon(\eta)\Lambda_{\Upsilon(\eta)}\right\rVert_{F}^{2}\right\}

and

p(Υ)maxΥ(η)minΛΥ(η)(MηM)Z{ZΥ(η)ΛΥ(η)F2<(1ϵ~)kΥ(η)ΛΥ(η)F2}p_{-}(\Upsilon)\leq\max_{\Upsilon(\eta)}\min_{\Lambda_{\Upsilon(\eta)}}\binom{M}{\eta M}\mathbb{P}_{Z}\left\{\left\lVert Z\Upsilon(\eta)\Lambda_{\Upsilon(\eta)}\right\rVert_{F}^{2}<(1-\widetilde{\epsilon}_{-})k\left\lVert\Upsilon(\eta)\Lambda_{\Upsilon(\eta)}\right\rVert_{F}^{2}\right\}

with ΛΥ(η)\Lambda_{\Upsilon(\eta)} a diagonal matrix with strictly positive entries, chosen for each subset Υ(η)\Upsilon(\eta).

Remark 2.0.2.

For simplicity, the rest of the paper will take Υ(η)ΛΥ(η)\Upsilon(\eta)\Lambda_{\Upsilon(\eta)} as Υ(η)\Upsilon(\eta) itself, or its unit normalized version Υ^(η)\widehat{\Upsilon}(\eta); though, there may potentially be some improvement gained by the freedom in choosing each ΛΥ(η)\Lambda_{\Upsilon(\eta)}.

Proof.

For p+(Υ)p_{+}(\Upsilon), let y^=y^((1η)M)\widehat{y}_{\ast}=\widehat{y}_{((1-\eta)M)}. If Zy^22>(1+ϵ~+)k\left\lVert Z\widehat{y}_{\ast}\right\rVert_{2}^{2}>(1+\widetilde{\epsilon}_{+})k, then y^\widehat{y}_{\ast} is part of a subset Υ(η)Υ\Upsilon(\eta)\subset\Upsilon of size ηM\eta M for which all the Zy^Z\widehat{y} have norms too large. For any given subset Υ(η)\Upsilon(\eta), consider the following probabilities, with λy>0\lambda_{y}>0 chosen for each yy,

Z{(Υ(η))}\displaystyle\mathbb{P}_{Z}\left\{\mathcal{E}(\Upsilon(\eta))\right\} :=Z{Zy^22>(1+ϵ~+)k for all yΥ(η)}\displaystyle:=\mathbb{P}_{Z}\left\{\left\lVert Z\widehat{y}\right\rVert_{2}^{2}>(1+\widetilde{\epsilon}_{+})k\text{ for all }y\in\Upsilon(\eta)\right\}
=Z{Zy22>(1+ϵ~+)ky22 for all yΥ(η)}\displaystyle=\mathbb{P}_{Z}\left\{\left\lVert Zy\right\rVert_{2}^{2}>(1+\widetilde{\epsilon}_{+})k\left\lVert y\right\rVert_{2}^{2}\text{ for all }y\in\Upsilon(\eta)\right\}
=Z{Zyλy22>(1+ϵ~+)kyλy22 for all yΥ(η)}\displaystyle=\mathbb{P}_{Z}\left\{\left\lVert Zy\lambda_{y}\right\rVert_{2}^{2}>(1+\widetilde{\epsilon}_{+})k\left\lVert y\lambda_{y}\right\rVert_{2}^{2}\text{ for all }y\in\Upsilon(\eta)\right\}
Z{yΥ(η)Zyλy22>(1+ϵ~+)kyΥ(η)yλy22}.\displaystyle\leq\mathbb{P}_{Z}\left\{\sum_{y\in\Upsilon(\eta)}\left\lVert Zy\lambda_{y}\right\rVert_{2}^{2}>(1+\widetilde{\epsilon}_{+})k\sum_{y\in\Upsilon(\eta)}\left\lVert y\lambda_{y}\right\rVert_{2}^{2}\right\}.

The second and third lines follow by the linearity of ZZ. Passing to the sum above allows an important change of viewpoint using the Frobenius norm, as each ZyλyZy\lambda_{y} is a column of Z(Υ(η)ΛΥ(η))Z(\Upsilon(\eta)\Lambda_{\Upsilon(\eta)}), with ΛΥ(η)\Lambda_{\Upsilon(\eta)} the appropriate diagonal matrix:

Z{(Υ(η))}Z{ZΥ(η)ΛΥ(η)F2>(1+ϵ~+)kΥ(η)ΛΥ(η)F2}.\mathbb{P}_{Z}\left\{\mathcal{E}(\Upsilon(\eta))\right\}\leq\mathbb{P}_{Z}\left\{\left\lVert Z\Upsilon(\eta)\Lambda_{\Upsilon(\eta)}\right\rVert_{F}^{2}>(1+\widetilde{\epsilon}_{+})k\left\lVert\Upsilon(\eta)\Lambda_{\Upsilon(\eta)}\right\rVert_{F}^{2}\right\}.

Now, there are (MηM)\binom{M}{\eta M} subsets Υ(η)\Upsilon(\eta) of Υ\Upsilon of size ηM\eta M, so a union bound over such subsets gives

p+(Υ)\displaystyle p_{+}(\Upsilon) (MηM)maxΥ(η)Z{(Υ(η))}\displaystyle\leq\binom{M}{\eta M}\max_{\Upsilon(\eta)}\mathbb{P}_{Z}\left\{\mathcal{E}(\Upsilon(\eta))\right\}
=(MηM)Z{ZΥ(η)ΛΥ(η)F2>(1+ϵ~+)kΥ(η)ΛΥ(η)F2}.\displaystyle=\binom{M}{\eta M}\mathbb{P}_{Z}\left\{\left\lVert Z\Upsilon(\eta)\Lambda_{\Upsilon(\eta)}\right\rVert_{F}^{2}>(1+\widetilde{\epsilon}_{+})k\left\lVert\Upsilon(\eta)\Lambda_{\Upsilon(\eta)}\right\rVert_{F}^{2}\right\}.

The argument for p(Υ)p_{-}(\Upsilon) is similar, noting that Zy^(ηM1)22<(1ϵ~)k\left\lVert Z\widehat{y}_{(\eta M-1)}\right\rVert_{2}^{2}<(1-\widetilde{\epsilon}_{-})k forces ηM\eta M vectors Zy^Z\widehat{y} to have squared norms too small, so their corresponding sum is too small as well. ∎

We now make assumptions on the matrix ZZ, allowing us to make use of the stable rank of the minibatches Υ(η)\Upsilon(\eta).

Corollary 2.0.3.

With the notation as in lemma 2.0.1, ϵ~+=ϵ~2\widetilde{\epsilon}_{+}=\widetilde{\epsilon}_{-}\sqrt{2}, and ϵ~(0,1)\widetilde{\epsilon}_{-}\in(0,1), if ZZ has i.i.d. standard Gaussian entries, then

max{p+(Υ),p(Υ)}(MηM)maxΥ(η)exp(kϵ~24r(Υ(η))).\max\left\{p_{+}(\Upsilon),\,p_{-}(\Upsilon)\right\}\leq\binom{M}{\eta M}\max_{\Upsilon(\eta)}\exp\left(-\frac{k\widetilde{\epsilon}_{-}^{2}}{4}r_{\infty}(\Upsilon(\eta))\right).

One may replace Υ(η)\Upsilon(\eta) by Υ^(η)\widehat{\Upsilon}(\eta) on the right hand side.

One point we want to stress even now is the presence of the product krk\,r_{\infty} in the exponent. If one wants a fixed failure probability, kk need not be as large when rr_{\infty} is sizable. We shall give several examples in this paper where rr_{\infty} is large.

Proof.

The proof follows immediately from lemma 5.0.4 in the appendix, with A=Υ(η)A=\Upsilon(\eta), recalling r4rr_{4}\geq r_{\infty} for the p(Υ)p_{-}(\Upsilon) case. With C+=8C_{+}=8 and C=4C_{-}=4, we improve the denominator from 8 to 4 by setting C+/ϵ~+2=C/ϵ~2C_{+}/\widetilde{\epsilon}_{+}^{2}=C_{-}/\widetilde{\epsilon}_{-}^{2} as in in lemma 5.0.5. ∎

Consider the term ZΥ(η)F2\left\lVert Z\Upsilon(\eta)\right\rVert_{F}^{2} in lemma 2.0.1, taking ΛΥ(η)\Lambda_{\Upsilon(\eta)} as the identity. (The following discussion will also hold for other scalings, such as Υ^(η)\widehat{\Upsilon}(\eta).) Labeling the rows of ZZ as ZiZ_{i}, we can interchange the sums implicit in ZΥ(η)F2\left\lVert Z\Upsilon(\eta)\right\rVert_{F}^{2} to our advantage:

yΥ(η)Zy22=yΥ(η)i=1k(Zi,y)2=i=1kyΥ(η)(y,Zi)2=i=1kΥ(η)Zi22.\displaystyle\sum_{y\in\Upsilon(\eta)}\left\lVert Zy\right\rVert_{2}^{2}=\sum_{y\in\Upsilon(\eta)}\sum_{i=1}^{k}(Z_{i},y)^{2}=\sum_{i=1}^{k}\sum_{y\in\Upsilon(\eta)}(y,Z_{i})^{2}=\sum_{i=1}^{k}\left\lVert\Upsilon(\eta)^{\top}Z_{i}\right\rVert_{2}^{2}.

Because we assume the rows ZiZ_{i} are isotropic, 𝔼Υ(η)Zi22=Υ(η)F2=Υ(η)F2\mathbb{E}\left\lVert\Upsilon(\eta)^{\top}Z_{i}\right\rVert_{2}^{2}=\left\lVert\Upsilon(\eta)^{\top}\right\rVert_{F}^{2}=\left\lVert\Upsilon(\eta)\right\rVert_{F}^{2}, and we have transformed the bound on p+(Υ)p_{+}(\Upsilon) to involve the probabilities

Z{i=1kΥ(η)Zi22>(1+ϵ~+)i=1kΥ(η)F2},\mathbb{P}_{Z}\left\{\sum_{i=1}^{k}\left\lVert\Upsilon(\eta)^{\top}Z_{i}\right\rVert_{2}^{2}>(1+\widetilde{\epsilon}_{+})\sum_{i=1}^{k}\left\lVert\Upsilon(\eta)\right\rVert_{F}^{2}\right\},

and Υ(η)\Upsilon(\eta)^{\top} is now viewed as a fixed matrix acting on the random vectors ZiZ_{i}. We may thus take a dual viewpoint on the dimension reduction problem: instead of considering how the matrix ZZ acts on each difference vector, we consider how the transposed batch of difference vectors Υ(η)\Upsilon(\eta)^{\top} acts on the matrix ZZ^{\top}, as mentioned in the introduction. If we take the ZiZ_{i} to be independent, with i.i.d. mean-zero unit-variance sub-gaussian entries, we can use lemma 5.0.3 in the appendix to harness both the independence of the ZiZ_{i} and the Hanson-Wright inequality:

Lemma 2.0.4.

With the notation as in lemma 2.0.1 and ϵ~ϵ~+\widetilde{\epsilon}_{-}\leq\widetilde{\epsilon}_{+}, let ZZ be a k×Dk\times D random matrix with i.i.d. mean-zero unit-variance sub-gaussian entries, then

p±(Υ)(MηM)maxΥ(η)exp(ckmin(ϵ~2r4(Υ(η))K4,ϵ~r(Υ(η))K2)),p_{\pm}(\Upsilon)\leq\binom{M}{\eta M}\max_{\Upsilon(\eta)}\exp\left(-ck\min\left(\frac{\widetilde{\epsilon}_{-}^{2}r_{4}(\Upsilon(\eta))}{K^{4}},\frac{\widetilde{\epsilon}_{-}\,r_{\infty}(\Upsilon(\eta))}{K^{2}}\right)\right),

with K=Z11ψ2K=\left\lVert Z_{11}\right\rVert_{\psi_{2}}. One may replace Υ(η)\Upsilon(\eta) by Υ^(η)\widehat{\Upsilon}(\eta) on the right hand side.

Proof.

To bound the probabilities on the right hand side of lemma 2.0.1, just take A=Υ(η)A=\Upsilon(\eta) in lemma 5.0.3. ∎

Remark 2.0.5.

Recall r4rr_{4}\geq r_{\infty} always, so if K1K\geq 1, we can write

p±(Υ)(MηM)maxΥ(η)exp(Ckmin(ϵ~2,ϵ)r(Υ(η))).p_{\pm}(\Upsilon)\leq\binom{M}{\eta M}\max_{\Upsilon(\eta)}\exp\left(-Ck\min(\widetilde{\epsilon}_{-}^{2},\epsilon)r_{\infty}(\Upsilon(\eta))\right).

We now can control the probabilities in equations (2.1) for a given batch Υ\Upsilon of size MM. The control is in terms of r(Υ(η))r_{\infty}(\Upsilon(\eta)) or r4(Υ(η))r_{4}(\Upsilon(\eta)) for subsets of size ηM\eta M. Because the target dimension kk is a global parameter, it needs to be in terms of global quantitities, but the stable ranks above vary over minibatches. To make this transition and to help with bookkeeping, recall that a decomposition of a graph is a partition of its edges.

Let 𝒫\mathcal{P} be a decomposition of the complete graph on NN vertices, into batches Υ\Upsilon of size MM. If MM does not divide NN, that is, with N=jM+nN=jM+n, we can expand several of the batches to M+sMM+s\geq M with s=n/j<Ms=\left\lceil n/j\right\rceil<M. For those batches, η(M+s)\eta(M+s) need not be an integer, so take η~\tilde{\eta} as

η~=ηMM+sin order forη~(M+s)=ηM,\tilde{\eta}=\eta\frac{M}{M+s}\quad\text{in order for}\quad\tilde{\eta}(M+s)=\eta M, (2.3)

and set Υ(η)=Υ(η~)\Upsilon(\eta)=\Upsilon(\tilde{\eta}) when the batch size is not MM. Note smaller η\eta values only help us ensure a total fraction (1η)(1-\eta) of distances is preserved. For this decomposition, let

R(ηM):=R(ηM;𝒫):=minΥ𝒫minΥ(η)Υr(Υ(η))R_{\infty}(\eta M):=R_{\infty}(\eta M;\mathcal{P}):=\min_{\Upsilon\in\mathcal{P}}\min_{\Upsilon(\eta)\subset\Upsilon}r_{\infty}(\Upsilon(\eta))

be the minimum stable rank of the ηM\eta M sized minibatches from such batches. We then have our first theorem, written in terms of ϵ\epsilon in the original equation (JL).

Theorem 2.0.6.

Let 0<η<10<\eta<1, with ηM\eta M\in\mathbb{N}, and let 0<ϵ2/30<\epsilon\leq 2/3. For a set XX of NN points in D\mathbb{R}^{D}, R(ηM)R_{\infty}(\eta M) as above, and ZZ a k×Dk\times D matrix with i.i.d. mean-zero unit-variance sub-gaussian entries, equation (JL) holds for at least (12η)(N2)(1-2\eta)\binom{N}{2} pairs x,xXx,x^{\prime}\in X, with probability at least 1δ1-\delta, provided

kCmax(K4,K2)ϵ2(log(2e/η)ηMR(ηM)+log(N2/(Mδ))R(ηM)),andγ(ϵ)=1+ϵ2.k\geq C\frac{\max(K^{4},K^{2})}{\epsilon^{2}}\left(\log(2e/\eta)\frac{\eta M}{R_{\infty}(\eta M)}+\frac{\log(N^{2}/(M\delta))}{R_{\infty}(\eta M)}\right),\quad\text{and}\quad\gamma(\epsilon)=1+\epsilon^{2}.

Here, K=Z11ψ2K=\left\lVert Z_{11}\right\rVert_{\psi_{2}} and CC is an absolute constant. In the case of standard Gaussian entries, one can replace Cmax(K4,K2)C\max(K^{4},K^{2}) and γ(ϵ)\gamma(\epsilon) by

C(ϵ):=4((1+ϵ)2+(1ϵ)224)2<2.25andγ(ϵ)=(1+ϵ)2+(1ϵ)221+2,C(\epsilon):=4\left(\frac{(1+\epsilon)^{2}+(1-\epsilon)^{2}\sqrt{2}}{4}\right)^{2}<2.25\quad\text{and}\quad\gamma(\epsilon)=\frac{(1+\epsilon)^{2}+(1-\epsilon)^{2}\sqrt{2}}{1+\sqrt{2}},

respectively. When the distribution for ZZ is understood, we denote Cmax(K4,K2)C\max(K^{4},K^{2}) or C(ϵ)C(\epsilon) by C[2.0.6]C_{[\ref{thm:FreeDecompBulkJL}]}, and likewise γ(ϵ)\gamma(\epsilon) by γ[2.0.6]\gamma_{[\ref{thm:FreeDecompBulkJL}]}.

To make sense of the above, suppose ηM=D\eta M=D, and then recall 1R(ηM)D1\leq R_{\infty}(\eta M)\leq D as the stable rank is always bounded above by the ambient dimension. If R(ηM)=cDR_{\infty}(\eta M)=cD, for cc bounded away from 0, the term attached to log(2e/η)\log(2e/\eta) becomes constant, while the remaining becomes constant as soon as Dlog(2N2/(Dδ))D\gtrsim\log(2N^{2}/(D\delta)).

Note R(ηM)R_{\infty}(\eta M) depends on the decomposition 𝒫\mathcal{P}, which we are free to choose at will. The remainder of the paper will address how to choose 𝒫\mathcal{P} (and the batch size MM).

Proof.

We treat the upper and lower bounds on Z(xx)22\left\lVert Z(x-x^{\prime})\right\rVert_{2}^{2} separately. If

Υ𝒫(p+(Υ)+p(Υ))δ,\sum_{\Upsilon\in\mathcal{P}}(p_{+}(\Upsilon)+p_{-}(\Upsilon))\leq\delta, (2.4)

then with probability at least 1δ1-\delta none of the events defining p±(Υ)p_{\pm}(\Upsilon) hold, over all the batches Υ\Upsilon of the decomposition. So by equation (2.1), with probability at least 1δ1-\delta, we preserve at least (recalling η~<η\tilde{\eta}<\eta when needed)

Υ𝒫(1η)|Υ|=(1η)(N2)\sum_{\Upsilon\in\mathcal{P}}(1-\eta)\left\lvert\Upsilon\right\rvert=(1-\eta)\binom{N}{2}

of the squared distances within a (1+ϵ~+)(1+\widetilde{\epsilon}_{+}) tolerance, and another (1η)(N2)(1-\eta)\binom{N}{2} of the squared distances within a (1ϵ~)(1-\widetilde{\epsilon}_{-}) tolerance. By the pigeonhole principle, at least (12η)(N2)(1-2\eta)\binom{N}{2} of the distances are approximatley preserved on both sides.

It remains to choose kk to ensure equation (2.4) holds. We always have, by lemma 5.0.6,

(M+sη~(M+s))(e(M+s)ηM)ηM=exp(log(e(M+s)ηM)ηM)<exp(log(2e/η)ηM),\binom{M+s}{\tilde{\eta}(M+s)}\leq\left(\frac{e(M+s)}{\eta M}\right)^{\eta M}=\exp\left(\log\left(\frac{e(M+s)}{\eta M}\right)\eta M\right)<\exp(\log(2e/\eta)\eta M),

while (MηM)exp(log(e/η)ηM)\binom{M}{\eta M}\leq\exp(\log(e/\eta)\eta M), so by lemma 2.0.4

p+(Υ)+p(Υ)2exp(cmax(K4,K2)kϵ~2R(ηM)+log(2e/η)ηM).p_{+}(\Upsilon)+p_{-}(\Upsilon)\leq 2\exp\left(-\frac{c}{\max(K^{4},K^{2})}k\widetilde{\epsilon}_{-}^{2}R_{\infty}(\eta M)+\log(2e/\eta)\eta M\right).

There are at most (N2)/MN2/(2M)\left\lfloor\binom{N}{2}/M\right\rfloor\leq N^{2}/(2M) batches Υ\Upsilon in the decomposition, expanding several of the batches to absorb any remainder NmodMN\mod M if necessary, so we need kk to satisfy

2(N2/(2M))exp(cmax(K4,K2)kϵ~2R(ηM)+log(2e/η)ηM)δ.2(N^{2}/(2M))\exp\left(-\frac{c}{\max(K^{4},K^{2})}k\widetilde{\epsilon}_{-}^{2}R_{\infty}(\eta M)+\log(2e/\eta)\eta M\right)\leq\delta.

Using lemma 5.0.5 with C+=C=1C_{+}=C_{-}=1, we achieve the desired bound for kk in terms of ϵ\epsilon by taking

C=1c(1+ϵ22)2<0.5221c.C=\frac{1}{c}\left(\frac{1+\epsilon^{2}}{2}\right)^{2}<0.522\frac{1}{c}.

To replace Cmax(K4,K2)C\max(K^{4},K^{2}) in the Gaussian case, use corollary 2.0.3 with lemma 5.0.5 on C+=2CC_{+}=2C_{-} to find

4ϵ~2=C(ϵ)ϵ2.\frac{4}{\widetilde{\epsilon}_{-}^{2}}=\frac{C(\epsilon)}{\epsilon^{2}}.\qed

From lemma 2.0.1 we are free to scale vectors in each batch Υ\Upsilon to have unit norm. So we can also define for a given decomposition

R^(ηM;𝒫)=minΥ^𝒫minΥ^(η)Υ^r(Υ^(η)).\widehat{R}_{\infty}(\eta M;\mathcal{P})=\min_{\widehat{\Upsilon}\in\mathcal{P}}\min_{\widehat{\Upsilon}(\eta)\subset\widehat{\Upsilon}}r_{\infty}(\widehat{\Upsilon}(\eta)).

This normalization allows us to use linear algebra to control r(Υ^(η))r_{\infty}(\widehat{\Upsilon}(\eta)) deterministically, for any of the ηM\eta M-sized subsets of Υ^\widehat{\Upsilon}, once we have control of σ1(Υ^)\sigma_{1}(\widehat{\Upsilon}). When the underlying data is random, we can thus avoid the need to take union bounds over the minibatches.

Lemma 2.0.7.

Let AA be a D×MD\times M matrix with columns of constant norm. If BB is a D×mD\times m submatrix of AA, then

r(B)max(mMr(A),1).r_{\infty}(B)\geq\max\left(\frac{m}{M}r_{\infty}(A),1\right).

In particular, if r(A)cMr_{\infty}(A)\geq cM, then r(B)max(cm,1)r_{\infty}(B)\geq\max(cm,1).

To be useful, we need c1/mc\gg 1/m here.

Proof.

To control B\left\lVert B\right\rVert, partition the matrix AA as A=(B|B)A=(B^{\prime}|B) with BB^{\prime} a D×(Mm)D\times(M-m) matrix. Viewing the unit sphere Sm1S^{m-1} as o×Sm1o\times S^{m-1} in SM1S^{M-1},

A=maxxSM1(B|B)x2maxxSm1(B|B)x2=maxxSm1Bx2=B.\left\lVert A\right\rVert=\max_{x\in S^{M-1}}\left\lVert(B^{\prime}|B)x\right\rVert_{2}\geq\max_{x\in S^{m-1}}\left\lVert(B^{\prime}|B)x\right\rVert_{2}=\max_{x\in S^{m-1}}\left\lVert Bx\right\rVert_{2}=\left\lVert B\right\rVert.

That is, BA\left\lVert B\right\rVert\leq\left\lVert A\right\rVert. Note r(A)=r(cA)r_{\infty}(A)=r_{\infty}(cA) for any nonzero scalar cc, so we may assume the columns of AA all have norm 1. Under this assumption,

r(B)=Bj22B2=mMAj22B2mMAj22A2=mMr(A).r_{\infty}(B)=\frac{\sum\left\lVert B_{j}\right\rVert_{2}^{2}}{\left\lVert B\right\rVert^{2}}=\frac{m}{M}\frac{\sum\left\lVert A_{j}\right\rVert_{2}^{2}}{\left\lVert B\right\rVert^{2}}\geq\frac{m}{M}\frac{\sum\left\lVert A_{j}\right\rVert_{2}^{2}}{\left\lVert A\right\rVert^{2}}=\frac{m}{M}r_{\infty}(A).

Recalling r1r_{\infty}\geq 1 always completes the proof. ∎

If we set

R^(M):=R^(M;𝒫):=minΥ𝒫r(Υ^),we can state the following theorem.\widehat{R}_{\infty}(M):=\widehat{R}_{\infty}(M;\mathcal{P}):=\min_{\Upsilon\in\mathcal{P}}r_{\infty}(\widehat{\Upsilon}),\quad\text{we can state the following theorem.}\quad
Theorem 2.0.8.

Let 0<η<10<\eta<1, with ηM\eta M\in\mathbb{N}, and 0<ϵ2/30<\epsilon\leq 2/3. For a set XX of NN points in D\mathbb{R}^{D}, R^(M)\widehat{R}_{\infty}(M) as above, and ZZ a k×Dk\times D matrix with i.i.d. mean-zero unit-variance sub-gaussian entries, then equation (JL2JL^{2}) holds for at least (12η)(N2)(1-2\eta)\binom{N}{2} pairs x,xXx,x^{\prime}\in X, with probability at least 1δ1-\delta, provided

kC[2.0.6]ϵ2(log(2e/η)MR^(M)+log(N2/(Mδ))max(ηR^(M),1))andγ(ϵ)=γ[2.0.6].k\geq\frac{C_{[\ref{thm:FreeDecompBulkJL}]}}{\epsilon^{2}}\left(\log(2e/\eta)\frac{M}{\widehat{R}_{\infty}(M)}+\frac{\log(N^{2}/(M\delta))}{\max(\eta\widehat{R}_{\infty}(M),1)}\right)\quad\text{and}\quad\gamma(\epsilon)=\gamma_{[\ref{thm:FreeDecompBulkJL}]}.

Here, C[2.0.6]C_{[\ref{thm:FreeDecompBulkJL}]} depends on K=Z11ψ2K=\left\lVert Z_{11}\right\rVert_{\psi_{2}}. In the case of independent standard Gaussian entries, C[2.0.6]<2.25C_{[\ref{thm:FreeDecompBulkJL}]}<2.25.

Proof.

In the proof of theorem 2.0.6, we shall replace lemma 2.0.4 by

p±(Υ^)(MηM)exp(ckηr(Υ^)min(ϵ~2K4,ϵ~K2)).p_{\pm}(\widehat{\Upsilon})\leq\binom{M}{\eta M}\exp\left(-ck\eta r_{\infty}(\widehat{\Upsilon})\min\left(\frac{\widetilde{\epsilon}_{-}^{2}}{K^{4}},\frac{\widetilde{\epsilon}_{-}}{K^{2}}\right)\right).

By lemma 2.0.7, r(Υ^(η))ηr(Υ^)r_{\infty}(\widehat{\Upsilon}(\eta))\geq\eta r_{\infty}(\widehat{\Upsilon}), no matter the subset Υ^(η)\widehat{\Upsilon}(\eta) chosen. We can then upper bound the right hand side of lemma 2.0.4, recalling r4rr_{4}\geq r_{\infty}. ∎

The choice of decomposition 𝒫\mathcal{P} matters, yielding very different R^(M)\widehat{R}_{\infty}(M) values, even with MM fixed, as seen in the following.

Example 2.1.

The difference vectors for the (vertices of) the standard simplex in D\mathbb{R}^{D} are {eiej}ij\left\{e_{i}-e_{j}\right\}_{i\neq j}, with 1iD1\leq i\leq D. Suppose the decomposition 𝒫\mathcal{P} involved “stars” made by subsets {eje1}\left\{e_{j}-e_{1}\right\} for 2jM+1D2\leq j\leq M+1\leq D. Using these difference vectors as rows, the corresponding matrix Υ^\widehat{\Upsilon} looks like (relabeling as necessary)

Υ^=12(IdM𝕀)\widehat{\Upsilon}=\frac{1}{\sqrt{2}}\begin{pmatrix}\operatorname{Id}_{M}&-\mathbb{I}\end{pmatrix}

with 𝕀=(1,,1)M\mathbb{I}=(1,\ldots,1)^{\top}\in\mathbb{R}^{M}. If z=(0,,0,1)z=(0,\ldots,0,1)^{\top}, we have Υ^2Υ^z22=M/2\left\lVert\widehat{\Upsilon}\right\rVert^{2}\geq\left\lVert\widehat{\Upsilon}z\right\rVert_{2}^{2}=M/2, so R^(M)r(Υ^)M/(M/2)=2\widehat{R}_{\infty}(M)\leq r_{\infty}(\widehat{\Upsilon})\leq M/(M/2)=2. Because R^(M)\widehat{R}_{\infty}(M) is bounded by a constant, this decomposition is of no use in theorem 2.0.8.

If we instead consider a decomposition involving “cycles” of length D=MD=M, that is, subsets

{eiej|i,j appear exactly twice },\left\{e_{i}-e_{j}\;\lvert\;i,j\text{ appear exactly twice }\right\},

then the corresponding difference vectors form a circulant matrix

Υ=(100111011011).\Upsilon=\begin{pmatrix}1&0&\cdots&0&-1\\ -1&1&&&0\\ &-1&\ddots&&\vdots\\ &&\ddots&1&0\\ &&&-1&1\end{pmatrix}.

Viewing Υ\Upsilon as a complex matrix, we can diagonalize it using the discrete Fourier transform matrix V=FDV=F_{D} as V1ΥV=diag(λ1,,λD)V^{-1}\Upsilon V=\mathrm{diag}(\lambda_{1},\ldots,\lambda_{D}). See [GVL13, page 222]. Here,

λj=(F¯D(11o))j=1ωj1withω=exp(2πi/D).\lambda_{j}=\left(\bar{F}_{D}\begin{pmatrix}1\\ -1\\ o\end{pmatrix}\right)_{j}=1-\omega^{j-1}\quad\text{with}\quad\omega=\exp(2\pi i/D).

The squares of the singular values of Υ\Upsilon are then the eigenvalues of ΥΥ\Upsilon^{\top}\Upsilon, that is,

σj2(Υ)=λjλj=(1ω¯j1)(1ωj1)=22cos(2π(j1)/D)4,\sigma_{j}^{2}(\Upsilon)=\lambda_{j}^{\ast}\lambda_{j}=(1-\bar{\omega}^{j-1})(1-\omega^{j-1})=2-2\cos(2\pi(j-1)/D)\leq 4,

with equality achieved when DD is even at j=1+D/2j=1+D/2. Because the cycle has length DD here, we have r(Υ)(2D)/4=D/2r_{\infty}(\Upsilon)\geq(2D)/4=D/2, for each such cycle.

If such a decomposition involved only such cycles, we could conclude (because the vectors have constant norm) R^(M)D/2\widehat{R}_{\infty}(M)\geq D/2, which would be very useful for theorem 2.0.8.

We review in the next section a construction originally due to Walecki that provides such cycle decompositions as above when DD is odd, and the next best thing when DD is even. The construction will also be useful when our data set is drawn i.i.d.; in particular, we can address cases where the minimal stable ranks RR_{\infty} or R^\widehat{R}_{\infty} are too pessimistic, but “most” batches have larger stable ranks.

3 The Walecki Construction

The sets YY and Y^\widehat{Y} describe (N2)\binom{N}{2} directions in space. Even if the data generating these directions is sampled independently, the directions themselves are not independent; for example, xyx-y is not independent of xzx-z, while xyx-y and zwz-w are, assuming x,y,zx,y,z, and ww are drawn independently. However, we are partitioning the directions into batches. If we can guarantee that in each batch, the directions are independent or only “weakly” dependent, we can exploit these properties, ensuring the stable ranks of many batches are bounded below by given values.

Viewing YY or Y^\widehat{Y} as corresponding to the edges of the complete graph KNK_{N} on NN vertices, we are asking for a partition of the edges, a decomposition, such that each subset involves each vertex once, or at most twice (say). Thankfully, Walecki provided such a construction in the 1880’s; see [Luc82, page 161, Sixième Récréation] for the original French and [Als08] for an English overview. We use a different indexing scheme than presented in [Als08], which is easier for us to use.

We can arrange NN vertices in the complex plane as follows: N1N-1 vertices corresponding to the (N1)(N-1)st roots of unity, and the remaining vertex at the origin oo. Let ω=exp(2πi/(N1))\omega=\exp(2\pi i/(N-1)). We can thus label the nonzero vertices as {ωj}j=0N2\left\{\omega^{j}\right\}_{j=0}^{N-2}. We partition their corresponding edges based on their products Wp:={{ωj,ωk}|ωjωk=ωp}W_{p}:=\left\{\left\{\omega^{j},\omega^{k}\right\}\;\lvert\;\omega^{j}\omega^{k}=\omega^{p}\right\}, or more formally:

Wp:={{ωj,ωk}| 0jkN2 and j+kpmodN1}W_{p}:=\left\{\left\{\omega^{j},\omega^{k}\right\}\;\lvert\;0\leq j\neq k\leq N-2\text{ and }j+k\equiv p\mod N-1\right\}

The N1N-1 sets {Wp}\left\{W_{p}\right\} represent 1-regular subgraphs of KNK_{N}: each vertex has degree one, for if {ωj,ωk}\left\{\omega^{j},\omega^{k}\right\} and {ωj,ωl}\left\{\omega^{j},\omega^{l}\right\} are in WpW_{p}, then

j+kpj+l,that is,kpjl,j+k\equiv p\equiv j+l,\quad\text{that is,}\quad k\equiv p-j\equiv l,

forcing k=lk=l because 0k,lN20\leq k,l\leq N-2. There are only N1N-1 vertices on the circle, so there are at most (N1)/2\left\lfloor(N-1)/2\right\rfloor edges in WpW_{p}.

The cyclic group N1\mathbb{Z}_{N-1} generated by ω\omega acts freely on the vertices of the circle via counterclockwise rotations, so N1\mathbb{Z}_{N-1} also acts on the WpW_{p} sets via

ωWp:={{ωωj,ωωk}|j+kpmodN1}=W(p+2)modN1,\omega W_{p}:=\left\{\left\{\omega\omega^{j},\omega\omega^{k}\right\}\;\lvert\;j+k\equiv p\mod N-1\right\}=W_{(p+2)\mod N-1},

corresponding to the product (ωωj)(ωωk)=ω2ωj+k(\omega\omega^{j})(\omega\omega^{k})=\omega^{2}\omega^{j+k}. Consequently, it is enough to discuss W0W_{0} and W1W_{1}.

With k>0k>0, the edges of W0W_{0} are of the form {ωk,ωk}\left\{\omega^{k},\omega^{-k}\right\}, while those of W1W_{1} are of the form {ωk,ω(k1)}\left\{\omega^{k},\omega^{-(k-1)}\right\}. To each edge {ωk,ωk}\left\{\omega^{k},\omega^{-k}\right\} in W0W_{0}, there is a corresponding edge in W1W_{1}, namely {ωk,ω(k1)}\left\{\omega^{k},\omega^{-(k-1)}\right\}, so |W1||W0|\left\lvert W_{1}\right\rvert\geq\left\lvert W_{0}\right\rvert and W1W_{1} includes the edge {ω,1}\left\{\omega,1\right\}. When N1N-1 is odd, the only nonzero vertex on the real line is 1; when N1N-1 is even, both vertices 11 and 1-1 are present. Consequently, |W0|\left\lvert W_{0}\right\rvert is determined by the number of vertices strictly in the upper half plane:

|W0|={(N2)/2 if N1 is odd ,(N3)/2 if N1 is even .\left\lvert W_{0}\right\rvert=\begin{cases}(N-2)/2\text{ if $N-1$ is odd },\\ (N-3)/2\text{ if $N-1$ is even }.\end{cases}

When N1N-1 is odd, recall |W1|(N1)/2\left\lvert W_{1}\right\rvert\leq\left\lfloor(N-1)/2\right\rfloor because W1W_{1} is 1-regular, so by the above, |W1|=|W0|\left\lvert W_{1}\right\rvert=\left\lvert W_{0}\right\rvert, with W1W_{1} avoiding the vertex ω(N2)/2=ωN/2\omega^{-(N-2)/2}=\omega^{N/2}, as its left most edge is {ω(N2)/2,ω(((N2)/2)1)}\left\{\omega^{(N-2)/2},\omega^{-(((N-2)/2)-1)}\right\}. Form the augmented sets W~0=W0{o,1}\tilde{W}_{0}=W_{0}\cup\left\{o,1\right\} and W~1=W1{ω(N2)/2,o}\tilde{W}_{1}=W_{1}\cup\left\{\omega^{-(N-2)/2},o\right\}; each W~i\tilde{W}_{i} is a 1-factor, as it is 1-regular and spans all NN vertices. We can thus form a cycle using W~0\tilde{W}_{0} and W~1\tilde{W}_{1}: explicitly, in cycle notation,

(o,1,ω1,ω1,,ωj,ωj,,ω(N2)/2,ω(N2)/2)(o,1,\omega^{1},\omega^{-1},\ldots,\omega^{j},\omega^{-j},\ldots,\omega^{(N-2)/2},\omega^{-(N-2)/2})

which has length 2(N2)/2+2=N2(N-2)/2+2=N as it should for covering all the NN vertices.

When N1N-1 is even, |W0|=(N3)/2=(N1)/21\left\lvert W_{0}\right\rvert=(N-3)/2=\left\lfloor(N-1)/2\right\rfloor-1, so W1W_{1} can contain at most one additional edge; it does, via (1,ω(k1))(-1,\,\omega^{-(k_{\ast}-1)}) with k=(N1)/2k_{\ast}=(N-1)/2. If N7N\geq 7, there are at least two edges in W0W_{0}, so split W0W_{0} into W0+W_{0}^{+} and W0W_{0}^{-}, corresponding to those vertices with nonnegative and strictly negative real parts respectively. When |W0|=(N3)/2\left\lvert W_{0}\right\rvert=(N-3)/2 is even, that is, N3mod4N\equiv 3\mod 4, both W0±W_{0}^{\pm} are of the same size (N3)/4(N-3)/4, while in the other case, we have |W0+|=(N3)/4\left\lvert W_{0}^{+}\right\rvert=\left\lceil(N-3)/4\right\rceil and |W0|=(N3)/4\left\lvert W_{0}^{-}\right\rvert=\left\lfloor(N-3)/4\right\rfloor. Form the augmented sets W~0+=W0+{o,1}\tilde{W}_{0}^{+}=W_{0}^{+}\cup\left\{o,1\right\} and W~0=W0{o,1}\tilde{W}_{0}^{-}=W_{0}^{-}\cup\left\{o,-1\right\}; these sets are 1-regular, of sizes

|W~0±|=1+|W0±|=1+(N3)/4=(N+1)/4.whenN3mod4,\left\lvert\tilde{W}_{0}^{\pm}\right\rvert=1+\left\lvert W_{0}^{\pm}\right\rvert=1+(N-3)/4=(N+1)/4.\quad\text{when}\quad N\equiv 3\mod 4,

while when N1mod4N\equiv 1\mod 4, that is, when (N3)/2(N-3)/2 is odd,

|W~0|=1+|W0|=1+12(N321)=N14\left\lvert\tilde{W}_{0}^{-}\right\rvert=1+\left\lvert W_{0}^{-}\right\rvert=1+\frac{1}{2}\left(\frac{N-3}{2}-1\right)=\frac{N-1}{4}

and

|W~0+|=1+|W0+|=1+12(N32+1)=N+34.\left\lvert\tilde{W}_{0}^{+}\right\rvert=1+\left\lvert W_{0}^{+}\right\rvert=1+\frac{1}{2}\left(\frac{N-3}{2}+1\right)=\frac{N+3}{4}.

The sets W~0+,W~0,W1\tilde{W}_{0}^{+},\tilde{W}_{0}^{-},W_{1} now form the cycle

(o,1,ω1,ω1,,ωj,ωj,,ω(N3)/2,ω(N3)/2,1),(o,1,\omega^{1},\omega^{-1},\ldots,\omega^{j},\omega^{-j},\ldots,\omega^{(N-3)/2},\omega^{-(N-3)/2},-1),

which has length 2(N3)/2+3=N2(N-3)/2+3=N, again as it should.

Extending the N1\mathbb{Z}_{N-1} group action to send the origin oo to itself, we can thus form (N1)/2\left\lfloor(N-1)/2\right\rfloor cycles 𝒞j\mathcal{C}_{j} of length NN using the above, recalling there are N1N-1 different WpW_{p} sets. Explicitly,

𝒞j={W~2j+W~2jW2j+1 if N1 is even W~2jW~2j+1 if N1 is odd .\mathcal{C}_{j}=\begin{cases}\tilde{W}_{2j}^{+}\coprod\tilde{W}_{2j}^{-}\coprod W_{2j+1}&\text{ if $N-1$ is even }\\ \tilde{W}_{2j}\coprod\tilde{W}_{2j+1}&\text{ if $N-1$ is odd }.\end{cases} (3.1)

When N1N-1 is even, all (N2)\binom{N}{2} edges are covered, while when N1N-1 is odd, W~N2\tilde{W}_{N-2} still remains, but is still a 1-factor.

We thus have, considering the parity of NN instead of N1N-1,

Lemma 3.0.1 (Walecki Construction).

The complete graph KNK_{N} has a decomposition into (N1)/2(N-1)/2 cycles of length NN when NN is odd, and (N2)/2(N-2)/2 cycles of length NN and a 1-factor when NN is even.

For reference later, we also record

Corollary 3.0.2.

Consider the cycles in lemma 3.0.1, corresponding to equation (3.1). When NN is even, these cycles split as a pair of 1-factors of size N/2N/2. When NN is odd, the cycles split as three 1-regular subgraphs when N7N\geq 7. When N3mod4N\equiv 3\mod 4, their sizes are

|W2j+1|=(N1)/2and|W~2j±|=(N+1)/4,\left\lvert W_{2j+1}\right\rvert=(N-1)/2\quad\text{and}\quad\left\lvert\tilde{W}_{2j}^{\pm}\right\rvert=(N+1)/4,

while when N1mod4N\equiv 1\mod 4, their sizes are

|W2j+1|=(N1)/2,|W~2j|=(N1)/4,and|W~2j+|=(N+3)/4.\left\lvert W_{2j+1}\right\rvert=(N-1)/2,\quad\left\lvert\tilde{W}_{2j}^{-}\right\rvert=(N-1)/4,\quad\text{and}\quad\left\lvert\tilde{W}_{2j}^{+}\right\rvert=(N+3)/4.

Returning to the sets of difference vectors yij=xixjYy_{ij}=x_{i}-x_{j}\in Y with XDX\subset\mathbb{R}^{D} a set of NN points, we can assign each yijy_{ij} to a unique 1-regular subgraph by the correspondence yij{ωi,ωj}y_{ij}\leftrightarrow\left\{\omega^{i},\omega^{j}\right\} when 0<i<j0<i<j and y0j{o,ωj}y_{0j}\leftrightarrow\left\{o,\omega^{j}\right\} for 0<j0<j. Most useful for us is the following lemma; note we shall be considering batches of size (at least) MM drawn from within these subgraphs.

Lemma 3.0.3.

Let XDX\subset\mathbb{R}^{D} be a set of NN points drawn i.i.d. from a common distribution. With the correspondence above, the vectors within each subgraph from corollary 3.0.2 and lemma 3.0.1 are independent, as are their unit norm versions. When NN is even, there are N1N-1 subgraphs involved. When N7N\geq 7 is odd, there are 3(N1)/23(N-1)/2 subgraphs involved.

Proof.

The subgraphs are 1-regular, so each vertex corresponding to a point of XX appears only once; independence follows as no two distinct edges share a common vertex. In the unit norm case, we are applying the same function vv/v2v\mapsto v/\left\lVert v\right\rVert_{2} to independent vectors, so the results are independent too. The rest of the lemma is immediate. ∎

We can now prove

Theorem 3.0.4.

For XX the standard simplex in D\mathbb{R}^{D}, it suffices to take

k=2C[2.0.6]ϵ2(log(2e/η)+log(D/δ)max(ηD,1))k=\frac{2C_{[\ref{thm:FreeDecompBulkJL}]}}{\epsilon^{2}}\left(\log(2e/\eta)+\frac{\log(D/\delta)}{\max(\eta D,1)}\right)

for theorem 2.0.8.

Note kk is bounded independent of DD as soon as ηDlog(D/δ)\eta D\gtrsim\log(D/\delta).

Proof.

Taking M=DM=D in theorem 2.0.8, we can use the Walecki construction 3.0.1 as is to control R^(D)\widehat{R}_{\infty}(D). When DD is odd, the computations from example 2.1 show R^(D)D/2\widehat{R}_{\infty}(D)\geq D/2, while when DD is even, the 1-factor from the Walecki construction is of size D/2D/2, with mutually orthogonal vectors of constant norm, so its stable rank is equal to its rank, D/2D/2. Thus R^(D)D/2\widehat{R}_{\infty}(D)\geq D/2 for both parities of DD. ∎

4 Further Theorems for i.i.d. Samples

Let ξD\xi\in\mathbb{R}^{D} be a given random vector. In this section, the following assumption 4.1 will be in play for the k×Dk\times D dimension reduction matrix ZZ and the dataset XX of NN points in D\mathbb{R}^{D}. The theorems will then be in terms of additional assumptions on the distribution of ξ\xi.

Assumption 4.1.

The matrix ZZ has i.i.d. zero-mean unit-variance sub-gaussian entries. The dataset X={xi}i=0N1DX=\left\{x_{i}\right\}_{i=0}^{N-1}\subset\mathbb{R}^{D} is drawn independently of ZZ, with xii.i.d.ξx_{i}\overset{\text{i.i.d.}}{\sim}\xi.

In practice, datasets need not be drawn i.i.d. from some underlying distribution; however, if the number of points NN is very large, it may be useful or convenient to subsample the data in order to fit it in memory or to try to estimate properties of the data. The theorems in this section may then be read as applying to an i.i.d. (sub)sample of the data, that is, drawing NN points uniformly with replacement from the underlying dataset, redefining NN to be the new sample size, and redefining ξ\xi to be drawn from the discrete uniform measure on the underlying data.

With the Walecki construction in hand and assumption 4.1, we can give refinements of theorems 2.0.6 and 2.0.8. The theorem statements will have failure probabilities in terms of the draw of the pair (Z,X)(Z,X). (There should be no confusion with the dot product notation.) Both of the theorems just mentioned give a failure probability δZ\delta_{Z} for ZZ once XX is fixed, and this probability only depends on stable rank properties of the data set XX, or more accurately, the difference vector set YY. These theorems in turn rely on lemma 2.0.4, that bounds the probabilities in equation (2.1) for ZZ acting on a given batch Υ\Upsilon or its unit norm version in terms of stable ranks of minibatches Υ(η)\Upsilon(\eta). In many of the examples that follow, the assumptions on the data only guarantee r(Υ)r_{\infty}(\Upsilon) or r(Υ(η))r_{\infty}(\Upsilon(\eta)) is above some threshold most of the time, say for a fraction (1ζ)(1-\zeta) of all batches, instead of all of the time. We can use lemma 2.0.4 on those “good” batches, and still conclude that (1ζ)(12η)(1-\zeta)(1-2\eta) of all distances are approximately preserved with high probability. To be concrete, let Z\mathcal{E}_{Z} be the event that Zy^22\left\lVert Z\widehat{y}\right\rVert_{2}^{2} is approximately preserved for (1ζ)(12η)(1-\zeta)(1-2\eta) of the vectors yYy\in Y – provided the batches considered have r(Υ)r0r_{\infty}(\Upsilon)\geq r_{0}, for some threshold r0r_{0}. Let X\mathcal{E}_{X} be the event that r(Υ)r0r_{\infty}(\Upsilon)\geq r_{0} for at least (1ζ)(1-\zeta) of the batches considered in the decomposition 𝒫\mathcal{P}. We then have

Z×X{Zc}\displaystyle\mathbb{P}_{Z\times X}\left\{\mathcal{E}_{Z}^{c}\right\} =Z×X{ZcX}+Z×X{ZcXc}\displaystyle=\mathbb{P}_{Z\times X}\left\{\mathcal{E}_{Z}^{c}\cap\mathcal{E}_{X}\right\}+\mathbb{P}_{Z\times X}\left\{\mathcal{E}_{Z}^{c}\cap\mathcal{E}_{X}^{c}\right\}
=Z×X{Zc|X}X{X}+Z×X{ZcXc}\displaystyle=\mathbb{P}_{Z\times X}\left\{\mathcal{E}_{Z}^{c}\;\lvert\;\mathcal{E}_{X}\right\}\mathbb{P}_{X}\left\{\mathcal{E}_{X}\right\}+\mathbb{P}_{Z\times X}\left\{\mathcal{E}_{Z}^{c}\cap\mathcal{E}_{X}^{c}\right\}
Z×X{Zc|X}+X{Xc}\displaystyle\leq\mathbb{P}_{Z\times X}\left\{\mathcal{E}_{Z}^{c}\;\lvert\;\mathcal{E}_{X}\right\}+\mathbb{P}_{X}\left\{\mathcal{E}_{X}^{c}\right\}
δZ+δX\displaystyle\leq\delta_{Z}+\delta_{X}

In the rest of this section, we use the 1-regular subgraph version of the Walecki construction, lemma 3.0.3 to define the decomposition 𝒫\mathcal{P} and the underlying batches Υ\Upsilon. Specifically, each 1-regular subgraph WW is at least of size (N1)/4(N-1)/4, and we partition the subgraph edges into batches Υ\Upsilon of size MM – each edge corresponding to a difference vector. Under assumption 4.1, the edges within each subgraph are exchangeable, so they can be assigned to batches in an arbitrary manner as long as the batch size is respected. For any remainder when MM does not divide the subgraph size |W|\left\lvert W\right\rvert, we can modify the definition of η~\tilde{\eta} from equation (2.3) to apply with |W|\left\lvert W\right\rvert in place of NN; any of the expanded batches are still of size at most 2M12M-1. Note when NN is odd, the subgraphs are not all of the same size, so the η~\tilde{\eta} will vary accordingly.

We first present one case where ζ\zeta is 0, that is, we are able to control r(Υ(η))r_{\infty}(\Upsilon(\eta)) for every minibatch, with high probability.

Theorem 4.0.1.

Under assumption 4.1, equation (JL) holds for at least (12η)(N2)(1-2\eta)\binom{N}{2} pairs x,xXx,x^{\prime}\in X, with probability at least 13δ1-3\,\delta over the draw of (Z,X)(Z,X), provided

kC2.0.6ϵ24C2ξ1ψ22(1+(1+α)2)(1t)(log(2e/η)+log(N2/(Dδ))ηD)k\geq\frac{C_{\ref{thm:FreeDecompBulkJL}}}{\epsilon^{2}}\frac{4C^{2}\left\lVert\xi_{1}\right\rVert_{\psi_{2}}^{2}(1+(1+\alpha)^{2})}{(1-t)}\left(\log(2e/\eta)+\frac{\log(N^{2}/(D\delta))}{\eta D}\right)

when ηD\eta D is a strictly positive integer, with

NDmax(1α2log(N2/(Dδ)),(2ξ1ψ22+1/log(2))2ct2(log(N2/(Dδ))ηD+log(2e/η))).N\geq D\geq\max\left(\frac{1}{\alpha^{2}}\log(N^{2}/(D\delta)),\,\frac{(2\left\lVert\xi_{1}\right\rVert_{\psi_{2}}^{2}+1/\log(2))^{2}}{ct^{2}}\left(\frac{\log(N^{2}/(D\delta))}{\eta D}+\log(2e/\eta)\right)\right).

To make some sense of the above, note that if NDkN\geq D\gtrsim k, it suffices to take t=ϵt=\epsilon and

kC2.0.6ϵ24C2ξ1ψ22(1+(1+α)2)(1ϵ)(log(2e/η)+α2η)k\geq\frac{C_{\ref{thm:FreeDecompBulkJL}}}{\epsilon^{2}}\frac{4C^{2}\left\lVert\xi_{1}\right\rVert_{\psi_{2}}^{2}(1+(1+\alpha)^{2})}{(1-\epsilon)}\left(\log(2e/\eta)+\frac{\alpha^{2}}{\eta}\right)

recalling Dα2log(N2/(Dδ))D\geq\alpha^{-2}\log(N^{2}/(D\delta)) too. This target dimension is roughly the same as for the simplex 3.0.4 with D=ND=N there, despite the very different sparsity properties of these two datasets.

Proof.

With XX defined as in the theorem statement and ξ\xi^{\prime} an independent draw of ξ\xi, the difference vector set YY is drawn i.i.d. from ξξ\xi-\xi^{\prime}, so it is immediately mean-zero with i.i.d. coordinates. Because the stable ranks do not see constant scaling, we can work with Y/2Y/\sqrt{2}, so that (ξξ)/2(\xi-\xi^{\prime})/\sqrt{2} now has unit variance coordinates. The sub-gaussian norm for a coordinate (ξ1ξ1)/2(\xi_{1}-\xi_{1}^{\prime})/\sqrt{2} of yY/2y\in Y/\sqrt{2} is at most 2ξ1ψ2\sqrt{2}\left\lVert\xi_{1}\right\rVert_{\psi_{2}} by the triangle inequality.

We control R(ηM)R_{\infty}(\eta M) in theorem 2.0.6 by showing each batch Υ(η)\Upsilon(\eta) has high stable rank. Because r(Υ(η))=Υ(η)F2/Υ(η)2r_{\infty}(\Upsilon(\eta))=\left\lVert\Upsilon(\eta)\right\rVert_{F}^{2}/\left\lVert\Upsilon(\eta)\right\rVert^{2}, we shall control numerator and denominator separately.

We have 𝔼Υ(η)F2=DηM\mathbb{E}\left\lVert\Upsilon(\eta)\right\rVert_{F}^{2}=D\eta M. Further, Bernstein’s inequality for the mean-zero sub-exponential random variables Υij21\Upsilon_{ij}^{2}-1 yields [Ver18, page 34]

{|1DηMΥ(η)F21|t}2exp(cmin(t2K2,tK)DηM)\mathbb{P}\left\{\left\lvert\frac{1}{D\eta M}\left\lVert\Upsilon(\eta)\right\rVert_{F}^{2}-1\right\rvert\geq t\right\}\leq 2\exp\left(-c\min\left(\frac{t^{2}}{K^{2}},\frac{t}{K}\right)D\eta M\right)

with K=Υ1121ψ1K=\left\lVert\Upsilon_{11}^{2}-1\right\rVert_{\psi_{1}}. For us with t(0,1)t\in(0,1), the above gives

{(1t)DηMΥ(η)F2}2exp(cDηMmax(K2,K)t2).\mathbb{P}\left\{(1-t)D\eta M\geq\left\lVert\Upsilon(\eta)\right\rVert_{F}^{2}\right\}\leq 2\exp\left(-\frac{cD\eta M}{\max(K^{2},K)}t^{2}\right).

To lock down KK, recall aψ1=|a|/log(2)\left\lVert a\right\rVert_{\psi_{1}}=\left\lvert a\right\rvert/\log(2) for constants aa, so

K=Υ1121ψ1Υ112ψ1+𝔼Υ112ψ1Υ11ψ22+𝔼Υ112log(2)=2ξ1ψ22+1log(2).\displaystyle K=\left\lVert\Upsilon_{11}^{2}-1\right\rVert_{\psi_{1}}\leq\left\lVert\Upsilon_{11}^{2}\right\rVert_{\psi_{1}}+\left\lVert\mathbb{E}\Upsilon_{11}^{2}\right\rVert_{\psi_{1}}\leq\left\lVert\Upsilon_{11}\right\rVert^{2}_{\psi_{2}}+\frac{\mathbb{E}\Upsilon_{11}^{2}}{\log(2)}=2\left\lVert\xi_{1}\right\rVert^{2}_{\psi_{2}}+\frac{1}{\log(2)}.

Recall lemma 2.0.7 which allowed us to state r(Υ^(η))ηr(Υ^)r_{\infty}(\widehat{\Upsilon}(\eta))\geq\eta r_{\infty}(\widehat{\Upsilon}) always, because the columns of Υ^\widehat{\Upsilon} had constant norm. We can give a high probability replacement for that lemma in this context, but require that it must hold over all (N2)/M<N2/(2M)\left\lfloor\binom{N}{2}/M\right\rfloor<N^{2}/(2M) batches Υ\Upsilon, that is,

2MδN2\displaystyle\frac{2M\delta}{N^{2}} (MηM)2exp(cDηMmax(K2,K)t2)\displaystyle\geq\binom{M}{\eta M}2\exp\left(-\frac{cD\eta M}{\max(K^{2},K)}t^{2}\right)
log(Mδ/N2)\displaystyle\log(M\delta/N^{2}) log(MηM)cDηMmax(K2,K)t2\displaystyle\geq\log\binom{M}{\eta M}-\frac{cD\eta M}{\max(K^{2},K)}t^{2}
cDηMmax(K2,K)t2\displaystyle\frac{cD\eta M}{\max(K^{2},K)}t^{2} log(N2/(Mδ))+log(MηM).\displaystyle\geq\log(N^{2}/(M\delta))+\log\binom{M}{\eta M}.

Taking into account the η~\tilde{\eta} cases, as mentioned before this proof, we require

D(2ξ1ψ22+1/log(2))2ct2(log(N2/(Mδ))ηM+log(2e/η)).D\geq\frac{(2\left\lVert\xi_{1}\right\rVert_{\psi_{2}}^{2}+1/\log(2))^{2}}{ct^{2}}\left(\frac{\log(N^{2}/(M\delta))}{\eta M}+\log(2e/\eta)\right).

For ambient dimensions DD at least this large, every single batch Υ\Upsilon satisfies

r(Υ(η))(1t)DηMΥ(η)2(1t)ηDMΥ2r_{\infty}(\Upsilon(\eta))\geq\frac{(1-t)D\eta M}{\left\lVert\Upsilon(\eta)\right\rVert^{2}}\geq(1-t)\eta\frac{DM}{\left\lVert\Upsilon\right\rVert^{2}}

with total failure probability at most δ\delta.

We now only need to control Υ2\left\lVert\Upsilon\right\rVert^{2} instead of Υ(η)2\left\lVert\Upsilon(\eta)\right\rVert^{2}, and we can use the known result [Ver18, page 85] for matrices with mean-zero independent sub-gaussian entries – recalling the ψ2\psi_{2}-norm of each entry is now 2ξ1ψ22\left\lVert\xi_{1}\right\rVert_{\psi_{2}}, as mentioned above –

ΥC(2ξ1ψ2)(D+M+s)\left\lVert\Upsilon\right\rVert\leq C(2\left\lVert\xi_{1}\right\rVert_{\psi_{2}})(\sqrt{D}+\sqrt{M}+s)

with probability at least 12exp(s2)1-2\exp(-s^{2}). It is proved via an ϵ\epsilon-net argument. For us, with s=αDs=\alpha\sqrt{D}, the above gives

Υ2>4C2ξ1ψ22(M+(1+α)2D)\left\lVert\Upsilon\right\rVert^{2}>4C^{2}\left\lVert\xi_{1}\right\rVert_{\psi_{2}}^{2}(M+(1+\alpha)^{2}D)

with probability at most 2exp(α2D)2\exp(-\alpha^{2}D). Consequently,

r(Υ(η))(1t)ηDM4C2ξ1ψ22(M+(1+α)2D)=(1t)ηM4C2ξ1ψ22((M/D)+(1+α)2)r_{\infty}(\Upsilon(\eta))\geq\frac{(1-t)\eta DM}{4C^{2}\left\lVert\xi_{1}\right\rVert_{\psi_{2}}^{2}(M+(1+\alpha)^{2}D)}=\frac{(1-t)\eta M}{4C^{2}\left\lVert\xi_{1}\right\rVert_{\psi_{2}}^{2}((M/D)+(1+\alpha)^{2})} (4.1)

over all minibatches Υ(η)\Upsilon(\eta) from all batches Υ\Upsilon with total failure probability at most

N22M2exp(α2D)+δ2δprovidedD1α2log(N2/(Mδ))as well.\frac{N^{2}}{2M}2\exp(-\alpha^{2}D)+\delta\leq 2\delta\quad\text{provided}\quad D\geq\frac{1}{\alpha^{2}}\log(N^{2}/(M\delta))\quad\text{as well.}\quad

So we may take the right hand side of equation (4.1) as our R(ηM)R_{\infty}(\eta M) value. Plugging into theorem 2.0.6 yields

kC2.0.6ϵ24C2ξ1ψ22((M/D)+(1+α)2)(1t)(log(2e/η)+log(N2/(Mδ))ηM).k\geq\frac{C_{\ref{thm:FreeDecompBulkJL}}}{\epsilon^{2}}\frac{4C^{2}\left\lVert\xi_{1}\right\rVert_{\psi_{2}}^{2}((M/D)+(1+\alpha)^{2})}{(1-t)}\left(\log(2e/\eta)+\frac{\log(N^{2}/(M\delta))}{\eta M}\right).

Setting M=DM=D completes the proof. ∎

4.1 Controlling “Most” Batches

Unlike in theorem 4.0.1 above, in general the dataset XX need not be so well-behaved, to the point that controlling r(Υ(η))r_{\infty}(\Upsilon(\eta)) is not possible across all minibatches. To make things more manageable, we shall now work with theorem 2.0.8, using the unit normalized batches Υ^\widehat{\Upsilon}, working on batches of size at least MM instead of ηM\eta M. For general random vectors ξ\xi, we shall not be able to guarantee that Υ^\widehat{\Upsilon} has high stable rank, but we shall guarantee that a fraction (1ζ)(1-\zeta) of the batches have stable rank comparable to a “typical” value associated with ξ\xi.

The columns of Υ^\widehat{\Upsilon} have constant unit norm, so it is enough to control Υ^2=Υ^Υ^\left\lVert\widehat{\Upsilon}\right\rVert^{2}=\left\lVert\widehat{\Upsilon}\widehat{\Upsilon}^{\top}\right\rVert in order to control r(Υ^)r_{\infty}(\widehat{\Upsilon}). It will be convenient to introduce some new notation for this purpose; we present it first in its unnormalized version.

Because we are still using the Walecki construction via lemma 3.0.3 to define the batches, the columns in Υ\Upsilon are independent, each drawn like the given random vector y:=ξξDy:=\xi-\xi^{\prime}\in\mathbb{R}^{D}, with ξ\xi^{\prime} an independent copy of ξ\xi. The corresponding second moment matrix Σ:=Σ(y):=𝔼yy\Sigma:=\Sigma(y):=\mathbb{E}yy^{\top} is twice the covariance matrix for ξ\xi, for if 𝔼ξ=μ\mathbb{E}\xi=\mu,

𝔼yy\displaystyle\mathbb{E}yy^{\top} =𝔼(ξξ)(ξξ)=2(Σ(ξ)μμ)=2(𝔼(ξμ)(ξμ))=2Cov(ξ).\displaystyle=\mathbb{E}(\xi-\xi^{\prime})(\xi-\xi^{\prime})^{\top}=2(\Sigma(\xi)-\mu\mu^{\top})=2(\mathbb{E}(\xi-\mu)(\xi-\mu)^{\top})=2\,\mathrm{Cov}(\xi).

Consequently, r(y)r(y) is the effective rank of Cov(ξ)\mathrm{Cov}(\xi), as the factors of 2 will cancel in the ratio.

Now Σ(y)\Sigma(y) may be approximated by its empirical version

ΣM:=1MΥΥ=1Mi=1Myiyi.Recall also𝔼y22=𝔼tryy=𝔼tryy=trΣ.\Sigma_{M}:=\frac{1}{M}\Upsilon\Upsilon^{\top}=\frac{1}{M}\sum_{i=1}^{M}y_{i}y_{i}^{\top}.\quad\text{Recall also}\quad\mathbb{E}\left\lVert y\right\rVert_{2}^{2}=\mathbb{E}\,\mathrm{tr}\,{y^{\top}y}=\mathbb{E}\,\mathrm{tr}\,{yy^{\top}}=\,\mathrm{tr}\,\Sigma.

The unit normalized versions will depend on y^\widehat{y}, with Σ^:=𝔼y^y^\widehat{\Sigma}:=\mathbb{E}\widehat{y}\widehat{y}^{\top}, MΣ^M:=Υ^Υ^M\widehat{\Sigma}_{M}:=\widehat{\Upsilon}\widehat{\Upsilon}^{\top}, and r^:=r(y^)=1/Σ^\widehat{r}:=r(\widehat{y})=1/\left\lVert\widehat{\Sigma}\right\rVert. The connection between rr and r^\widehat{r} is not so immediate, but we shall return to this point in section 4.1.1.

The following is implicit in [Ver18, section 5.6]; we include the proof here, as we want explicit constants, and we plan to take K=1K=1, applying it to Υ^\widehat{\Upsilon}. Controlling the deviation Σ^MΣ^\left\lVert\widehat{\Sigma}_{M}-\widehat{\Sigma}\right\rVert will be the way we eventually show r(Υ^)r^r_{\infty}(\widehat{\Upsilon})\gtrsim\widehat{r} “most” of the time.

Lemma 4.1.1.

Let Υ={yi}i=1MD\Upsilon=\left\{y_{i}\right\}_{i=1}^{M}\subset\mathbb{R}^{D} be as above, with the yii.i.d.yy_{i}\overset{\text{i.i.d.}}{\sim}y, and assume for some K1K\geq 1, that

y22K2𝔼y22almost surely.\left\lVert y\right\rVert_{2}^{2}\leq K^{2}\mathbb{E}\left\lVert y\right\rVert_{2}^{2}\quad\text{almost surely.}\quad

Then, with failure probability at most 2eu2e^{-u},

ΣMΣΣ((4K2/3)r(u+logD)M+2K2r(u+logD)M)for r=r(y)=trΣ(y)Σ(y).\frac{\left\lVert\Sigma_{M}-\Sigma\right\rVert}{\left\lVert\Sigma\right\rVert}\leq\left(\frac{(4K^{2}/3)r(u+\log D)}{M}+\sqrt{\frac{2K^{2}r(u+\log D)}{M}}\right)\quad\text{for $r=r(y)=\frac{\,\mathrm{tr}\,\Sigma(y)}{\left\lVert\Sigma(y)\right\rVert}$.}\quad
Proof.

Because the matrices yiyiΣy_{i}y_{i}^{\top}-\Sigma are symmetric, i.i.d., and mean-zero, we can use the matrix Bernstein inequality: for t0t\geq 0,

{i=1M(yiyiΣ)t}2Dexp(t2/2σ2+bt/3)\mathbb{P}\left\{\left\lVert\sum_{i=1}^{M}(y_{i}y_{i}^{\top}-\Sigma)\right\rVert\geq t\right\}\leq 2D\exp\left(-\frac{t^{2}/2}{\sigma^{2}+bt/3}\right)

with byiyiΣb\geq\left\lVert y_{i}y_{i}^{\top}-\Sigma\right\rVert and σ2=i=1M𝔼(yiyiΣ)2\sigma^{2}=\left\lVert\sum_{i=1}^{M}\mathbb{E}(y_{i}y_{i}^{\top}-\Sigma)^{2}\right\rVert. We want the right hand side to be at most 2eu2e^{-u}, that is,

s=log(D)+ut2/2σ2+bt/3,or0t22sb3t2σ2s.s=\log(D)+u\leq\frac{t^{2}/2}{\sigma^{2}+bt/3},\quad\text{or}\quad 0\leq t^{2}-\frac{2sb}{3}t-2\sigma^{2}s.

The positive root is at

t=12(2sb3+(2sb3)2+8σ2s).t^{\ast}=\frac{1}{2}\left(\frac{2sb}{3}+\sqrt{\Big{(}\frac{2sb}{3}\Big{)}^{2}+8\sigma^{2}s}\right).

Because the other root of the quadratic is negative, taking any ttt\geq t^{\ast} will suffice for us. Because the square root function is subadditive, it is safe to take t=(2sb/3)+σ2st=(2sb/3)+\sigma\sqrt{2s}.

We now just need to estimate bb and σ2\sigma^{2}. Estimate bb via

yiyiΣ\displaystyle\left\lVert y_{i}y_{i}^{\top}-\Sigma\right\rVert yiyi+Σ=yi22+Σ\displaystyle\leq\left\lVert y_{i}y_{i}^{\top}\right\rVert+\left\lVert\Sigma\right\rVert=\left\lVert y_{i}\right\rVert_{2}^{2}+\left\lVert\Sigma\right\rVert
K2trΣ+Σ2K2trΣ=:b.\displaystyle\leq K^{2}\,\mathrm{tr}\,\Sigma+\left\lVert\Sigma\right\rVert\leq 2K^{2}\,\mathrm{tr}\,\Sigma=:b.

For σ2\sigma^{2}, the i.i.d. assumption already gives σ2=M𝔼(yyΣ)2\sigma^{2}=\left\lVert M\mathbb{E}(yy^{\top}-\Sigma)^{2}\right\rVert. Expanding the square,

𝔼(yyΣ)2=𝔼(yy)2Σ2,\mathbb{E}(yy^{\top}-\Sigma)^{2}=\mathbb{E}(yy^{\top})^{2}-\Sigma^{2},

while (yy)2=y(yy)y=y22yy(yy^{\top})^{2}=y(y^{\top}y)y=\left\lVert y\right\rVert_{2}^{2}yy^{\top}. Taking expectations on v(yy)2vv^{\top}(yy^{\top})^{2}v with vSD1v\in S^{D-1} gives

v𝔼(yy)2v=𝔼y22vyyvK2(trΣ)𝔼vyyv=K2(trΣ)vΣv.v^{\top}\mathbb{E}(yy^{\top})^{2}v=\mathbb{E}\left\lVert y\right\rVert_{2}^{2}v^{\top}yy^{\top}v\leq K^{2}(\,\mathrm{tr}\,\Sigma)\mathbb{E}v^{\top}yy^{\top}v=K^{2}(\,\mathrm{tr}\,\Sigma)v^{\top}\Sigma v.

Taking the maximum over vSD1v\in S^{D-1} gives

𝔼(yy)2K2trΣΣandσ2MK2trΣΣ,\left\lVert\mathbb{E}(yy^{\top})^{2}\right\rVert\leq K^{2}\,\mathrm{tr}\,\Sigma\left\lVert\Sigma\right\rVert\quad\text{and}\quad\sigma^{2}\leq MK^{2}\,\mathrm{tr}\,\Sigma\left\lVert\Sigma\right\rVert,

as Σ2-\Sigma^{2} is negative semidefinite.

Recalling our choice of tt and that r=trΣ/Σr=\,\mathrm{tr}\,\Sigma/\left\lVert\Sigma\right\rVert, we find

t\displaystyle t =2sb3+σ2s\displaystyle=\frac{2sb}{3}+\sigma\sqrt{2s}
=43K2trΣ(u+logD)+2(u+logD)MK2trΣΣ\displaystyle=\frac{4}{3}K^{2}\,\mathrm{tr}\,\Sigma(u+\log D)+\sqrt{2(u+\log D)MK^{2}\,\mathrm{tr}\,\Sigma\left\lVert\Sigma\right\rVert}
=MΣ((4K2/3)r(u+logD)M+2K2r(u+logD)M).\displaystyle=M\left\lVert\Sigma\right\rVert\left(\frac{(4K^{2}/3)r(u+\log D)}{M}+\sqrt{\frac{2K^{2}r(u+\log D)}{M}}\right).

With this tt value, we finally have

{ΣmΣtM}2eu,as desired.\mathbb{P}\left\{\left\lVert\Sigma_{m}-\Sigma\right\rVert\geq\frac{t}{M}\right\}\leq 2e^{-u},\quad\text{as desired.}\quad\qed

If we used the unit normalized Υ^\widehat{\Upsilon}, lemma 4.1.1 provides the following upper tail probability

PX{Σ^MΣ^>τ+(u)}2exp(u)P_{X}\left\{\left\lVert\widehat{\Sigma}_{M}-\widehat{\Sigma}\right\rVert>\tau_{+}(u)\right\}\leq 2\exp(-u) (4.2)

with τ+(u)\tau_{+}(u) a function of uu. By the triangle inequality, we should immediately have

Υ^2=MΣ^MMΣ^MΣ^+MΣ^MΣ^(1+τ+(u)),\left\lVert\widehat{\Upsilon}\right\rVert^{2}=\left\lVert M\widehat{\Sigma}_{M}\right\rVert\leq M\left\lVert\widehat{\Sigma}_{M}-\widehat{\Sigma}\right\rVert+M\left\lVert\widehat{\Sigma}\right\rVert\leq M\left\lVert\widehat{\Sigma}\right\rVert(1+\tau_{+}(u)),

so that r(Υ^)r^/(1+τ+(u))r_{\infty}(\widehat{\Upsilon})\geq\widehat{r}/(1+\tau_{+}(u)) with failure probablity 2exp(u)2\exp(-u). This tail probability is too weak to control r(Υ^)r_{\infty}(\widehat{\Upsilon}) for every batch with high probability, because uu will need to be too large to be useful, so we allow a fraction ζ>0\zeta>0 of the batches to fail. We consider order statistics of real-valued functions of Υ\Upsilon across batches, namely Σ^MΣ^\left\lVert\widehat{\Sigma}_{M}-\widehat{\Sigma}\right\rVert – because the batches within each 1-regular subgraph are independent, we can exploit that independence to inform the choice for uu using the following lemma.

Lemma 4.1.2.

Let {ωi}0n1\left\{\omega_{i}\right\}_{0}^{n-1} be i.i.d. random variables. Let ζ(0,1)\zeta\in(0,1) with ζn\zeta n an integer. If

{ω0>τ+(u)}2eu,then{ω((1ζ)n)>τ+(u)}exp(ζn(log(e/ζ)+log(2)u)).\mathbb{P}\left\{\omega_{0}>\tau_{+}(u)\right\}\leq 2e^{-u},\quad\text{then}\quad\mathbb{P}\left\{\omega_{((1-\zeta)n)}>\tau_{+}(u)\right\}\leq\exp\big{(}\zeta n(\log(e/\zeta)+\log(2)-u)\big{)}.
Proof.

First note

{ω((1ζ)n)>τ+(u)}={ω(i)>τ+(u) for i(1ζ)n},\mathbb{P}\left\{\omega_{((1-\zeta)n)}>\tau_{+}(u)\right\}=\mathbb{P}\left\{\omega_{(i)}>\tau_{+}(u)\text{ for }i\geq(1-\zeta)n\right\},

that is, we are looking for the top ζn\zeta n of the ω(i)\omega_{(i)} to be larger than τ+(u)\tau_{+}(u). Because the ωi\omega_{i} are drawn i.i.d., all their ζn\zeta n sized subsets are equally likely, so we may conclude

{ω((1ζ)n)>τ+(u)}\displaystyle\mathbb{P}\left\{\omega_{((1-\zeta)n)}>\tau_{+}(u)\right\} (nζn){ωi>τ+(u) for 0iζn1}\displaystyle\leq\binom{n}{\zeta n}\mathbb{P}\left\{\omega_{i}>\tau_{+}(u)\text{ for }0\leq i\leq\zeta n-1\right\}
(eζ)ζn(2eu)ζn=exp(ζn(log(e/ζ)+log(2)u)),\displaystyle\leq\left(\frac{e}{\zeta}\right)^{\zeta n}\left(2e^{-u}\right)^{\zeta n}=\exp\big{(}\zeta n(\log(e/\zeta)+\log(2)-u)\big{)},

using the independence of the ωi\omega_{i} in the last line. ∎

In lemma 4.1.2, we shall take nn to be the number of batches in a given subgraph, so that the random variables in question are independent. The subgraphs from lemma 3.0.3 are of size at least (N1)/4(N-1)/4 and at most N/2N/2, but are not all of the same size when NN is odd. If WW is one of these subgraphs, it contains |W|/M\left\lfloor\left\lvert W\right\rvert/M\right\rfloor batches, as we have enforced each batch to have size at least MM, and reassigned any remainder |W|modM\left\lvert W\right\rvert\mod M to those batches. When NN is even, |W|=N/2\left\lvert W\right\rvert=N/2, and we only need to require ζn:=ζN/(2M)\zeta n:=\zeta\left\lfloor N/(2M)\right\rfloor\in\mathbb{N}. When NN is odd, we adjust ζ\zeta depending on the size of the subgraph WW. Suppose we enforce ζn:=ζ(N1)/(4M)\zeta n:=\zeta\left\lfloor(N-1)/(4M)\right\rfloor to be an integer, recalling |W|(N1)/4\left\lvert W\right\rvert\geq(N-1)/4 in all cases. For any other subgraph WW of larger size, set

ζ~=ζn|W|/Mso thatζ~|W|/M=ζn.\tilde{\zeta}=\zeta\frac{n}{\left\lfloor\left\lvert W\right\rvert/M\right\rfloor}\quad\text{so that}\quad\tilde{\zeta}\left\lfloor\left\lvert W\right\rvert/M\right\rfloor=\zeta n. (4.3)

In particular for all the subgraph sizes, 1/ζ~<2(n+1)/(ζn)3/ζ1/\tilde{\zeta}<2(n+1)/(\zeta n)\leq 3/\zeta if we assume ζ1/2\zeta\leq 1/2, as ζn1\zeta n\geq 1. Thus in lemma 4.1.2, we replace log(e/ζ)\log(e/\zeta) by log(3e/ζ)\log(3e/\zeta) when we consider the different subgraphs.

4.1.1 Retraction to the Sphere

Suppose yy does not have a second moment, that is, Cov(ξ)\mathrm{Cov}(\xi) is undefined. As mentioned before, we can use lemma 4.1.1 on the unit normalized batches Υ^\widehat{\Upsilon} instead, provided we replace Σ\Sigma by Σ^:=𝔼y^y^\widehat{\Sigma}:=\mathbb{E}\widehat{y}\widehat{y}^{\top} and rr by r^=r(y^)=1/Σ^\widehat{r}=r(\widehat{y})=1/\left\lVert\widehat{\Sigma}\right\rVert. By Cauchy-Schwarz, Σ^1\left\lVert\widehat{\Sigma}\right\rVert\leq 1 always: for any unit vector vv,

v𝔼y^y^v=𝔼(y^,v)2𝔼y^22v22=1.v^{\top}\mathbb{E}\widehat{y}\widehat{y}^{\top}v=\mathbb{E}(\widehat{y},v)^{2}\leq\mathbb{E}\left\lVert\widehat{y}\right\rVert_{2}^{2}\left\lVert v\right\rVert_{2}^{2}=1.

A first question to ask is: in the presence of a second moment, how are the operator norms of Σ\Sigma and Σ^\widehat{\Sigma} related? The following lemma gives one such answer.

Lemma 4.1.3.

Let yy be a random vector in D\mathbb{R}^{D}, with second moment matrix 𝔼yy=Σ\mathbb{E}yy^{\top}=\Sigma. If y^=y/y2\widehat{y}=y/\left\lVert y\right\rVert_{2}, then with Σ^=𝔼y^y^\widehat{\Sigma}=\mathbb{E}\widehat{y}\widehat{y}^{\top}, for any ϵ(0,1)\epsilon\in(0,1),

Σ^1ϵr+p(ϵ)withp(ϵ):={y22<ϵ𝔼y22}andr=trΣΣ=𝔼y22Σ.\left\lVert\widehat{\Sigma}\right\rVert\leq\frac{1}{\epsilon\,r}+p(\epsilon)\quad\text{with}\quad p(\epsilon):=\mathbb{P}\left\{\left\lVert y\right\rVert_{2}^{2}<\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}\right\}\quad\text{and}\quad r=\frac{\,\mathrm{tr}\,\Sigma}{\left\lVert\Sigma\right\rVert}=\frac{\mathbb{E}\left\lVert y\right\rVert_{2}^{2}}{\left\lVert\Sigma\right\rVert}.

Further, r^r/(ϵ1+rp(ϵ))\widehat{r}\geq r/(\epsilon^{-1}+rp(\epsilon)).

Remark 4.1.4.

Some dependence on the small ball (if ϵ1\epsilon\ll 1) or lower deviation (if ϵ1\epsilon\approx 1) probability {y22<ϵ𝔼y22}\mathbb{P}\left\{\left\lVert y\right\rVert_{2}^{2}<\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}\right\} must be present, in general, due to the nature of the retraction to the sphere. Specifically, suppose yy is distributed uniformly at random from a finite set in D\mathbb{R}^{D}, having a cluster of points all pointing in roughly the e1e_{1} direction, but of very small norm. If all the other points are distributed uniformly on the unit sphere, one expects Σ\left\lVert\Sigma\right\rVert to be roughly 1/D, as the cluster points will not contribute much to the operator norm. However, after the retraction, these cluster points now all have unit norm and are still pointing in roughly the e1e_{1} direction, so if there are enough points in the cluster, Σ^\left\lVert\widehat{\Sigma}\right\rVert could now be much closer to 1.

Proof.

The matrices Σ\Sigma and Σ^\widehat{\Sigma} are symmetric positive semi-definite, so their singular values correspond to their eigenvalues. These eigenvalues involve the quadratic form vvΣvv\mapsto v^{\top}\Sigma v or vvΣ^vv\mapsto v^{\top}\widehat{\Sigma}v. The latter quadratic form is just 𝔼(y^,v)2\mathbb{E}(\widehat{y},v)^{2}, which we may split as

𝔼(y^,v)2=𝔼(y^,v)2𝕀{y22ϵ𝔼y22}+𝔼(y^,v)2𝕀{y22<ϵ𝔼y22}.\mathbb{E}(\widehat{y},v)^{2}=\mathbb{E}(\widehat{y},v)^{2}\mathbb{I}\left\{\left\lVert y\right\rVert_{2}^{2}\geq\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}\right\}+\mathbb{E}(\widehat{y},v)^{2}\mathbb{I}\left\{\left\lVert y\right\rVert_{2}^{2}<\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}\right\}.

For the first term,

𝔼(y^,v)2𝕀{y22ϵ𝔼y22}𝔼(y,v)2ϵ𝔼y22𝕀{y22ϵ𝔼y22}𝔼(y,v)2ϵ𝔼y22.\mathbb{E}(\widehat{y},v)^{2}\mathbb{I}\left\{\left\lVert y\right\rVert_{2}^{2}\geq\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}\right\}\leq\frac{\mathbb{E}(y,v)^{2}}{\epsilon\mathbb{E}\left\lVert y\right\rVert_{2}^{2}}\mathbb{I}\left\{\left\lVert y\right\rVert_{2}^{2}\geq\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}\right\}\leq\frac{\mathbb{E}(y,v)^{2}}{\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}}.

When vv is a unit vector, (y^,v)21(\widehat{y},v)^{2}\leq 1 always, so for such vv,

𝔼(y^,v)2\displaystyle\mathbb{E}(\widehat{y},v)^{2} 𝔼(y,v)2ϵ𝔼y22+𝔼(y^,v)2𝕀{y22<ϵ𝔼y22}\displaystyle\leq\frac{\mathbb{E}(y,v)^{2}}{\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}}+\mathbb{E}(\widehat{y},v)^{2}\mathbb{I}\left\{\left\lVert y\right\rVert_{2}^{2}<\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}\right\}
𝔼(y,v)2ϵ𝔼y22+{y22<ϵ𝔼y22}.\displaystyle\leq\frac{\mathbb{E}(y,v)^{2}}{\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}}+\mathbb{P}\left\{\left\lVert y\right\rVert_{2}^{2}<\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}\right\}.

Now Σ^\left\lVert\widehat{\Sigma}\right\rVert is just the maximum of 𝔼(y^,v)2\mathbb{E}(\widehat{y},v)^{2} over the unit sphere, and the maximum is realized by some vv^{\ast} as the sphere is compact. Consequently,

Σ^\displaystyle\left\lVert\widehat{\Sigma}\right\rVert 𝔼(y,v)2ϵ𝔼y22+{y22<ϵ𝔼y22}maxvSD1𝔼(y,v)2ϵ𝔼y22+{y22<ϵ𝔼y22}\displaystyle\leq\frac{\mathbb{E}(y,v^{\ast})^{2}}{\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}}+\mathbb{P}\left\{\left\lVert y\right\rVert_{2}^{2}<\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}\right\}\leq\max_{v\in S^{D-1}}\frac{\mathbb{E}(y,v)^{2}}{\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}}+\mathbb{P}\left\{\left\lVert y\right\rVert_{2}^{2}<\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}\right\}
=Σϵ𝔼y22+{y22<ϵ𝔼y22}.\displaystyle=\frac{\left\lVert\Sigma\right\rVert}{\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}}+\mathbb{P}\left\{\left\lVert y\right\rVert_{2}^{2}<\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}\right\}.

Recalling 𝔼y22=trΣ\mathbb{E}\left\lVert y\right\rVert_{2}^{2}=\,\mathrm{tr}\,\Sigma and the definition of rr and r^=1/Σ^\widehat{r}=1/\left\lVert\widehat{\Sigma}\right\rVert finishes the proof. ∎

Apart from assumption 4.1, we make no other assumptions on XX or ξ\xi in the following theorem.

Theorem 4.1.5.

Let 0<ϵ,η,ζ<10<\epsilon,\,\eta,\,\zeta<1, ζ1/2\zeta\leq 1/2, ϵ2/3\epsilon\leq 2/3, and 0<α0<\alpha. Under assumption 4.1, equation (JL) holds for at least (12η)(1ζ)(N2)(1-2\eta)(1-\zeta)\binom{N}{2} pairs x,xXx,x^{\prime}\in X, with probability at least 12δ1-2\delta over the draw of (Z,X)(Z,X), provided

kC[2.0.6]ϵ2(α+43+2α)(log(2e/η)(log(6e/ζ)+logD)1t+log(N2/(αr^δlog(D)))αmax(ηr^,1))k\geq\frac{C_{[\ref{thm:FreeDecompBulkJL}]}}{\epsilon^{2}}\left(\alpha+\frac{4}{3}+\sqrt{2\alpha}\right)\left(\log(2e/\eta)\frac{(\log(6e/\zeta)+\log D)}{1-t^{\prime}}+\frac{\log(N^{2}/(\alpha\widehat{r}\delta\log(D)))}{\alpha\max(\eta\widehat{r},1)}\right)

when the quantities ηM\eta M and ζ(N1)/(4M)\zeta\left\lfloor(N-1)/(4M)\right\rfloor are strictly positive integers, with

N18M:=αr^(log(6e/ζ)+logD)1t1>t:=8αr^log(3(N1)2δ)ζ(N1),andγ(ϵ)=γ[2.0.6].\frac{N-1}{8}\geq M:=\alpha\widehat{r}\frac{(\log(6e/\zeta)+\log D)}{1-t^{\prime}}\quad 1>t^{\prime}:=\frac{8\alpha\widehat{r}\log(\frac{3(N-1)}{2\delta})}{\zeta(N-1)},\quad\text{and}\quad\gamma(\epsilon)=\gamma_{[\ref{thm:FreeDecompBulkJL}]}.
Remark 4.1.6.

To make sense of the above, consider D=DJL=O(ϵ2log(N2/δ))D=D_{JL}=O(\epsilon^{-2}\log(N^{2}/\delta)) as if from the original Johnson-Lindenstrauss lemmas. We always have r^D\widehat{r}\leq D, so tt^{\prime} vanishes fairly quickly with increasing NN when ζ\zeta is fixed or even slowly decaying. Compared to theorem 4.0.1, we have an extra factor against log(2e/η)\log(2e/\eta), but it is not too big in that log(D)=O(loglog(N/δ))\log(D)=O(\log\log(N/\delta)), and log(6e/ζ)\log(6e/\zeta) does not grow quickly with decreasing ζ\zeta. When NN is large, we can make ζ\zeta small enough that the fraction (12η)(1ζ)(1-2\eta)(1-\zeta) is not that much worse than (12η)(1-2\eta). Finally, in light of lemma 4.1.3, we can replace r^\widehat{r} by rr in the definition of kk at the expense of a bounded prefactor for log(N2/δ)\log(N^{2}/\delta) provided the lower deviation or small ball probability p(ϵ)p(\epsilon) is less than 1/r1/r.

Proof.

From lemma 3.0.3, we have either N1N-1 or 3(N1)/23(N-1)/2 1-regular subgraphs to consider when N7N\geq 7, and we choose (unit normalized) batches Υ^\widehat{\Upsilon} of size at least MM from these subgraphs. Let MΣ^M:=Υ^Υ^M\widehat{\Sigma}_{M}:=\widehat{\Upsilon}\widehat{\Upsilon}^{\top}. Within each subgraph, the batches are independent, allowing us to use lemma 4.1.2 on the random variables

ω(Υ^):=Σ^MΣ^.\omega(\widehat{\Upsilon}):=\left\lVert\widehat{\Sigma}_{M}-\widehat{\Sigma}\right\rVert.

Take n:=(N1)/(4M)n:=\left\lfloor(N-1)/(4M)\right\rfloor for that lemma and recall the ζ~\tilde{\zeta} discussion from equation (4.3). We have to choose uu so that a union bound over all the subgraphs is still smaller than δ\delta, so a safe value for uu would be

3(N1)2exp(ζn(log(3e/ζ)+log(2)u))\displaystyle\frac{3(N-1)}{2}\exp\left(\zeta n\big{(}\log(3\,e/\zeta)+\log(2)-u\big{)}\right) δ\displaystyle\leq\delta (4.4)
log(6e/ζ)+log(3(N1)/(2δ))ζn\displaystyle\log(6e/\zeta)+\frac{\log(3(N-1)/(2\delta))}{\zeta n} u\displaystyle\leq u (4.5)

With this uu in hand, we can apply lemma 4.1.1 with K=1K=1, for MΣ^M=Υ^Υ^M\widehat{\Sigma}_{M}=\widehat{\Upsilon}\widehat{\Upsilon}^{\top}

Υ^2=MΣ^M\displaystyle\left\lVert\widehat{\Upsilon}\right\rVert^{2}=\left\lVert M\widehat{\Sigma}_{M}\right\rVert MΣ^MΣ^+MΣ^\displaystyle\leq M\left\lVert\widehat{\Sigma}_{M}-\widehat{\Sigma}\right\rVert+M\left\lVert\widehat{\Sigma}\right\rVert
MΣ^(1+((4/3)r^(u+logD)M+2r^(u+logD)M)).\displaystyle\leq M\left\lVert\widehat{\Sigma}\right\rVert\left(1+\left(\frac{(4/3)\widehat{r}(u+\log D)}{M}+\sqrt{\frac{2\widehat{r}(u+\log D)}{M}}\right)\right).

Because we are interested in r(Υ^)=M/Υ^2r_{\infty}(\widehat{\Upsilon})=M/\left\lVert\widehat{\Upsilon}\right\rVert^{2}, we should like to make the error term manageable, so choose

M=αr^(u+logD)withα>0.M=\alpha\widehat{r}(u+\log D)\quad\text{with}\quad\alpha>0.

Because uu already depends on MM through nn in equation 4.5, there is a constraint on uu and ζ\zeta that we need to address. Write (N1)/4=nM+s(N-1)/4=nM+s with 0sM10\leq s\leq M-1. Set ζ\zeta^{\ast} to satisfy ζn=ζ(N1)/(4M)\zeta n=\zeta^{\ast}(N-1)/(4M). We then have

log(6e/ζ)+t(u+logD)\displaystyle\log(6e/\zeta)+t(u+\log D) uwitht:=4αr^log(3(N1)/(2δ))ζ(N1)>0\displaystyle\leq u\quad\text{with}\quad t:=4\alpha\widehat{r}\frac{\log(3(N-1)/(2\delta))}{\zeta^{\ast}(N-1)}>0
log(6e/ζ)+tlogD\displaystyle\log(6e/\zeta)+t\log D (1t)u\displaystyle\leq(1-t)u

We can divide by (1t)(1-t) provided t<1t<1.

αr^log(3(N1)/(2δ))<ζN14=ζ(N14s)\displaystyle\alpha\widehat{r}\log(3(N-1)/(2\delta))<\zeta^{\ast}\frac{N-1}{4}=\zeta\left(\frac{N-1}{4}-s\right)

Recalling sM1s\leq M-1, if we also require M(N1)/8M\leq(N-1)/8, it would be safe to require

αr^log(3(N1)/(2δ))<ζN18<ζ(N14s)anduN18αr^log(D).\alpha\widehat{r}\log(3(N-1)/(2\delta))<\zeta\frac{N-1}{8}<\zeta\left(\frac{N-1}{4}-s\right)\quad\text{and}\quad u\leq\frac{N-1}{8\alpha\widehat{r}}-\log(D).

We then have

1>t:=8αr^log(3(N1)/(2δ))ζ(N1)>t,1>t^{\prime}:=8\alpha\widehat{r}\frac{\log(3(N-1)/(2\delta))}{\zeta(N-1)}>t,

and because the maps tt/(1t)t\mapsto t/(1-t) and t1/(1t)t\mapsto 1/(1-t) are strictly increasing, a valid lower bound for uu is

ut1tlog(D)+11tlog(6e/ζ).u\geq\frac{t^{\prime}}{1-t^{\prime}}\log(D)+\frac{1}{1-t^{\prime}}\log(6e/\zeta).

Taking uu as this lower bound yields

u+log(D)=11t(log(D)+log(6e/ζ)).u+\log(D)=\frac{1}{1-t^{\prime}}(\log(D)+\log(6e/\zeta)).

With this choice of uu in hand, with probability at least 1δ1-\delta,

r(Υ^)MMΣ^(1+43α+2α)=1/Σ^(1+43α+2α)=r^(1+43α+2α)=:R^(M;ζ)r_{\infty}(\widehat{\Upsilon})\geq\frac{M}{M\left\lVert\widehat{\Sigma}\right\rVert\left(1+\frac{4}{3\alpha}+\sqrt{\frac{2}{\alpha}}\right)}=\frac{1/\left\lVert\widehat{\Sigma}\right\rVert}{\left(1+\frac{4}{3\alpha}+\sqrt{\frac{2}{\alpha}}\right)}=\frac{\widehat{r}}{\left(1+\frac{4}{3\alpha}+\sqrt{\frac{2}{\alpha}}\right)}=:\widehat{R}_{\infty}(M;\zeta)

for at least (1ζ)(1-\zeta) of all batches Υ^\widehat{\Upsilon}. Assuming this bound holds, we now ask that when ZZ is drawn, equation (JL) holds for all the vectors involved in at least these batches, with failure probability at most δ\delta. We run the argument of theorem 2.0.8, only for these batches Υ^\widehat{\Upsilon}, using R^(M;ζ)\widehat{R}_{\infty}(M;\zeta) in place of R^(M)\widehat{R}_{\infty}(M). As we could still have (N2)/M\left\lfloor\binom{N}{2}/M\right\rfloor “good” batches, we still must allow for all of them when we compute the union bound. The M/R^(M)M/\widehat{R}_{\infty}(M) ratio in theorem 2.0.8 now just becomes

MR^(M;ζ)=αr^(u+logD)r^(1+43α+2α)=(u+logD)(α+43+2α)\frac{M}{\widehat{R}_{\infty}(M;\zeta)}=\frac{\alpha\widehat{r}(u+\log D)}{\widehat{r}}\left(1+\frac{4}{3\alpha}+\sqrt{\frac{2}{\alpha}}\right)=(u+\log D)\left(\alpha+\frac{4}{3}+\sqrt{2\alpha}\right)

Let CαC_{\alpha} be the coefficient of (u+logD)(u+\log D) in the above. We may then set kk as

kC[2.0.6]ϵ2Cα(log(2e/η)(log(6e/ζ)+logD)1t+log(N2/(αr^δlog(D)))αmax(ηr^,1))k\geq\frac{C_{[\ref{thm:FreeDecompBulkJL}]}}{\epsilon^{2}}C_{\alpha}\left(\log(2e/\eta)\frac{(\log(6e/\zeta)+\log D)}{1-t^{\prime}}+\frac{\log(N^{2}/(\alpha\widehat{r}\delta\log(D)))}{\alpha\max(\eta\widehat{r},1)}\right) (4.6)

using u+log(D)log(D)u+\log(D)\geq\log(D) in the log(N2)\log(N^{2}) term. The choice of γ\gamma follows from theorem 2.0.8. ∎

In certain cases, we know r^\widehat{r} exactly without relying on lemma 4.1.3.

Lemma 4.1.7.

Suppose ξ=(ξ1,,ξD)D\xi=(\xi_{1},\ldots,\xi_{D})\in\mathbb{R}^{D} is a random vector with i.i.d. coordinates, and ξ\xi^{\prime} is an independent copy of ξ\xi. If y:=ξξy:=\xi-\xi^{\prime}, then the scaled unit vector y^D\widehat{y}\sqrt{D} is mean-zero isotropic.

There are no moment assumptions on the coordinates ξi\xi_{i} here, so the lemma even applies to vectors with i.i.d. standard Cauchy coordinates.

Proof.

Both properties rely on the following observation. For a fixed coordinate ii, the coordinate yi=ξiξiy_{i}=\xi_{i}-\xi_{i}^{\prime} is a symmetric random variable: yiy_{i} and yi-y_{i} have the same distribution. Consequently, for any odd function ff, (using the symmetry in the 2nd equality)

𝔼f(yi)=𝔼f(yi)=𝔼f(yi)-\mathbb{E}f(y_{i})=\mathbb{E}f(-y_{i})=\mathbb{E}f(y_{i})

so that 𝔼f(yi)=0\mathbb{E}f(y_{i})=0 for such odd functions ff.

If we temporarily freeze the other coordinates, the iith coordinate of the unit vector y^\widehat{y} is just

y^i=yi(yi2+jiDyj2)1/2=yi(yi2+C)1/2,\widehat{y}_{i}=\frac{y_{i}}{(y_{i}^{2}+\sum_{j\neq i}^{D}y_{j}^{2})^{1/2}}=\frac{y_{i}}{(y_{i}^{2}+C)^{1/2}},

an odd function of yiy_{i}, forcing y^D\widehat{y}\sqrt{D} to be mean-zero.

To check y^D\widehat{y}\sqrt{D} is isotropic, we must show the matrix Σ=D𝔼y^y^\Sigma=D\,\mathbb{E}\widehat{y}\widehat{y}^{\top} is the identity IdD\operatorname{Id}_{D}. Because the yiy_{i} are identically distributed,

D𝔼y12j=1Dyj2=𝔼i=1Dyi2j=1Dyj2=1,D\,\mathbb{E}\frac{y_{1}^{2}}{\sum_{j=1}^{D}y_{j}^{2}}=\mathbb{E}\sum_{i=1}^{D}\frac{y_{i}^{2}}{\sum_{j=1}^{D}y_{j}^{2}}=1,

so the diagonal entries of Σ\Sigma are all equal to 11.

Further, when iji\neq j, the independence of the yiy_{i} now give

D𝔼yiyjyi2+kiDyk2=D𝔼yj𝔼(yiyi2+kiDyk2|yji)=0D\,\mathbb{E}\frac{y_{i}y_{j}}{y_{i}^{2}+\sum_{k\neq i}^{D}y_{k}^{2}}=D\,\mathbb{E}y_{j}\mathbb{E}\left(\frac{y_{i}}{y_{i}^{2}+\sum_{k\neq i}^{D}y_{k}^{2}}\Big{\lvert}y_{j\neq i}\right)=0

because the conditional expectation vanishes on the odd function of yiy_{i}. ∎

We now have an immediate corollary to theorem 4.1.5, which again we may compare to theorem 4.0.1.

Corollary 4.1.8.

In the setting of theorem 4.1.5, suppose ξ\xi has i.i.d. coordinates. Then the corresponding conclusion still holds, with r^=D\widehat{r}=D, namely it suffices to take γ(ϵ)=γ[2.0.6]\gamma(\epsilon)=\gamma_{[\ref{thm:FreeDecompBulkJL}]} and

kC[2.0.6]ϵ2(α+43+2α)(log(2e/η)(log(6e/ζ)+logD)1t+log(N2/(αDδlog(D)))αmax(ηD,1)).k\geq\frac{C_{[\ref{thm:FreeDecompBulkJL}]}}{\epsilon^{2}}\left(\alpha+\frac{4}{3}+\sqrt{2\alpha}\right)\left(\log(2e/\eta)\frac{(\log(6e/\zeta)+\log D)}{1-t^{\prime}}+\frac{\log(N^{2}/(\alpha D\delta\log(D)))}{\alpha\max(\eta D,1)}\right).
Proof.

By lemma 4.1.7, the difference vector y=ξξy=\xi-\xi^{\prime} yields the isotropic vector y^D\widehat{y}\sqrt{D}. Because Σ(y^D)=IdD\Sigma(\widehat{y}\sqrt{D})=\operatorname{Id}_{D}, we compute r^:=r(y^)=r(y^D)=D\widehat{r}:=r(\widehat{y})=r(\widehat{y}\sqrt{D})=D, as the intrinsic dimension does not see constant scalings. We can then apply theorem 4.1.5. ∎

Remark 4.1.9.

The proof only uses that y^D\widehat{y}\sqrt{D} is isotropic. By lemma 1.1.3, Φ(y^D)\Phi(\widehat{y}\sqrt{D}) is still isotropic when Φ\Phi has orthonormal rows. Because equation JL is 1-homogeneous, the corollary still holds with ξ\xi replaced by Φ(ξ)\Phi(\xi), in particular when Φ\Phi has a fast transform method available.

4.2 Estimating r^\widehat{r}

The intrinsic dimension of y^\widehat{y}, namely r^\widehat{r}, enters into theorem 4.1.5 only as a parameter, so we are free to estimate it separately. In particular,

Corollary 4.2.1.

Theorem 4.1.5 holds with r^\widehat{r} replaced by an empirical estimate using a batch Υ^(m)\widehat{\Upsilon}(m) of size mm, namely, with failure probability at most δ\delta,

r^13Σ^mr^5form=8Dlog(2D/δ)providedm(N1)/2.\widehat{r}\geq\frac{1}{3\left\lVert\widehat{\Sigma}_{m}\right\rVert}\geq\frac{\widehat{r}}{5}\quad\text{for}\quad m=8D\log(2D/\delta)\quad\text{provided}\quad m\leq(N-1)/2.

If D=DJL=O(ϵ2log(N2/δ))D=D_{JL}=O(\epsilon^{-2}\log(N^{2}/\delta)), then computing Σ^m\left\lVert\widehat{\Sigma}_{m}\right\rVert will cost polynomial in log(N2/δ)\log(N^{2}/\delta) and ϵ2\epsilon^{-2}; however, because 1r^D1\leq\widehat{r}\leq D, we do not need very high accuracy when computing this top eigenvalue.

If we were working with XX drawn uniformly with replacement from a larger dataset, we should draw the first 16Dlog(2D/δ)16D\log(2D/\delta) data points, and then sequentially pair them off for unit difference vectors to yield Υ^(m)\widehat{\Upsilon}(m). The uniform with replacement assumption assures that these data points used are as good as any other subset, even if some of the data points turn out to be copies of the same point in the larger dataset.

Proof.

For m(N1)/2m\leq(N-1)/2, we can find a batch Υ^(m)\widehat{\Upsilon}(m) of size mm with indepedent columns, by corollary 3.0.2 and lemma 3.0.3. By lemma 4.1.1, we have, with failure probability at most 2exp(u)2\exp(-u),

Σ^Σ^mΣ^τ+(u)=Σ^((4/3)r^(u+log(D))m+2r^(u+log(D))m).\left\lVert\widehat{\Sigma}-\widehat{\Sigma}_{m}\right\rVert\leq\left\lVert\widehat{\Sigma}\right\rVert\tau_{+}(u)=\left\lVert\widehat{\Sigma}\right\rVert\left(\frac{(4/3)\widehat{r}(u+\log(D))}{m}+\sqrt{\frac{2\widehat{r}(u+\log(D))}{m}}\right).

By the triangle inequality, we then have

Σ^Σ^τ+(u)+Σ^mthat is(1τ+(u))Σ^Σ^m\displaystyle\left\lVert\widehat{\Sigma}\right\rVert\leq\left\lVert\widehat{\Sigma}\right\rVert\tau_{+}(u)+\left\lVert\widehat{\Sigma}_{m}\right\rVert\quad\text{that is}\quad(1-\tau_{+}(u))\left\lVert\widehat{\Sigma}\right\rVert\leq\left\lVert\widehat{\Sigma}_{m}\right\rVert

and

Σ^mΣ^τ+(u)+Σ^=(1+τ+(u))Σ^.\left\lVert\widehat{\Sigma}_{m}\right\rVert\leq\left\lVert\widehat{\Sigma}\right\rVert\tau_{+}(u)+\left\lVert\widehat{\Sigma}\right\rVert=(1+\tau_{+}(u))\left\lVert\widehat{\Sigma}\right\rVert.

Consequently, as r^=1/Σ^\widehat{r}=1/\left\lVert\widehat{\Sigma}\right\rVert,

(1τ+(u)1+τ+(u))r^1τ+(u)Σ^mr^1+τ+(u)Σ^m\left(\frac{1-\tau_{+}(u)}{1+\tau_{+}(u)}\right)\widehat{r}\leq\frac{1-\tau_{+}(u)}{\left\lVert\widehat{\Sigma}_{m}\right\rVert}\leq\widehat{r}\leq\frac{1+\tau_{+}(u)}{\left\lVert\widehat{\Sigma}_{m}\right\rVert}

Recalling r^D\widehat{r}\leq D always, we choose m=αDlog(2D/δ)m=\alpha D\log(2D/\delta) and u=log(2/δ)u=\log(2/\delta) so that

τ+(log(2/δ))(4/3)r^αD+2r^αD43α+2α23forα8.\tau_{+}(\log(2/\delta))\leq\frac{(4/3)\widehat{r}}{\alpha D}+\sqrt{\frac{2\widehat{r}}{\alpha D}}\leq\frac{4}{3\alpha}+\sqrt{\frac{2}{\alpha}}\leq\frac{2}{3}\quad\text{for}\quad\alpha\geq 8.\qed

Acknowledgements

This research was performed while the author held an NRC Research Associateship award at the U.S. Air Force Research Laboratory. I should like to thank Mary, Saint Joseph, and the Holy Trinity for helping me with this work.

Disclaimers

The views expressed are those of the author and do not reflect the official guidance or position of the United States Government, the Department of Defense, or of the United States Air Force.

Statement from the DoD: The appearance of external hyperlinks does not constitute endorsement by the United States Department of Defense (DoD) of the linked websites, or the information, products, or services contained therein. The DoD does not exercise any editorial, security, or other control over the information you may find at these locations.

5 Appendix

The following lemma shows how to modify the proof of the Hanson-Wright inequality from [RV13] (cf. [Ver18, chapter 6]) to a “bulk” version, looking at the sum of several i.i.d. quadratic forms. Note ZZ here is ZZ^{\top} in the main part of the paper. Let ZZ be a D×kD\times k matrix entries ZijZ_{ij} drawn i.i.d. from a mean-zero unit-variance sub-gaussian distribution. Write Z(:,j)Z(:,j) for the jjth column of ZZ. Let BB be a D×DD\times D matrix and

S=j=1kZ(:,j)BZ(:,j)S=\sum_{j=1}^{k}Z(:,j)^{\top}BZ(:,j)

Note, with B=(bij)i,j=1DB=(b_{ij})_{i,j=1}^{D},

Z(:,j)BZ(:,j)=q,lZqjbqlZljand𝔼Z(:,j)BZ(:,j)=qbqq𝔼Zqj2,Z(:,j)^{\top}BZ(:,j)=\sum_{q,\,l}Z_{qj}b_{ql}Z_{lj}\quad\text{and}\quad\mathbb{E}Z(:,j)^{\top}BZ(:,j)=\sum_{q}b_{qq}\mathbb{E}Z_{qj}^{2}, (5.1)

using the mean zero and independence assumptions for the coordinates of Z(:,j)Z(:,j), that is for the ZijZ_{ij} when ii varies.

Lemma 5.0.1.

Let SS, BB, and ZZ be as above. Then for all t0t\geq 0 and either sign choice,

{±(S𝔼S)kt}2exp(ckmin(t2K4BF2,tK2B))\mathbb{P}\left\{\pm(S-\mathbb{E}S)\geq kt\right\}\leq 2\exp\left(-ck\min\left(\frac{t^{2}}{K^{4}\left\lVert B\right\rVert_{F}^{2}},\,\frac{t}{K^{2}\left\lVert B\right\rVert}\right)\right)

with cc an absolute constant (not depending on ZZ) and K=Z11ψ2K=\left\lVert Z_{11}\right\rVert_{\psi_{2}}.

Remark 5.0.2.

The key point is the additional factor of kk in the exponential, compared to the usual Hanson-Wright inequality where k=1k=1. Here,

Zψ2=inf{t>0|𝔼exp(Z2/t)2},\left\lVert Z\right\rVert_{\psi_{2}}=\inf\left\{t>0\;\lvert\;\mathbb{E}\exp(Z^{2}/t)\leq 2\right\},

so in particular CZψ2=CZψ2\left\lVert CZ\right\rVert_{\psi_{2}}=C\left\lVert Z\right\rVert_{\psi_{2}}, and Z/Kψ2=1\left\lVert Z/K\right\rVert_{\psi_{2}}=1. If we prove the result for Z/KZ/K, that is bound

{|j=1kZ(:,j)KBZ(:,j)K𝔼j=1kZ(:,j)KBZ(:,j)K|kt},\mathbb{P}\left\{\left\lvert\sum_{j=1}^{k}\frac{Z(:,j)^{\top}}{K}B\frac{Z(:,j)}{K}-\mathbb{E}\sum_{j=1}^{k}\frac{Z(:,j)^{\top}}{K}B\frac{Z(:,j)}{K}\right\rvert\geq kt\right\},

then taking tt/K2t\mapsto t/K^{2} will give us the bound for the original Z(:,j)Z(:,j).

Proof.

By equation 5.1,

S𝔼S=j=1kqbqq(Zqj2𝔼Zqj2)+j=1kqlbqlZqjZlj=:S1+S2.S-\mathbb{E}S=\sum_{j=1}^{k}\sum_{q}b_{qq}(Z_{qj}^{2}-\mathbb{E}Z_{qj}^{2})+\sum_{j=1}^{k}\sum_{q\neq l}b_{ql}Z_{qj}Z_{lj}=:S_{1}+S_{2}.

By the union bound (Boole’s inequality), we can bound the probability

p:={S𝔼Skt}{S1kt/2}+{S2kt/2}=:p1+p2,p:=\mathbb{P}\left\{S-\mathbb{E}S\geq kt\right\}\leq\mathbb{P}\left\{S_{1}\geq kt/2\right\}+\mathbb{P}\left\{S_{2}\geq kt/2\right\}=:p_{1}+p_{2},

for if S1<kt/2S_{1}<kt/2 and S2<kt/2S_{2}<kt/2, then S𝔼S<ktS-\mathbb{E}S<kt.

We can now use the i.i.d. assumption for the columns, that is, for the ZijZ_{ij} when jj varies,

p1\displaystyle p_{1} ektλ1/2𝔼exp(λ1S1)=(etλ1/2𝔼exp(λ1qbqq(Zq2𝔼Zq2)))k=1k\displaystyle\leq e^{-kt\lambda_{1}/2}\mathbb{E}\exp(\lambda_{1}S_{1})=\left(e^{-t\lambda_{1}/2}\mathbb{E}\exp\left(\lambda_{1}\sum_{q}b_{qq}(Z_{q}^{2}-\mathbb{E}Z_{q}^{2})\right)\right)^{k}=\wp_{1}^{k}

and

p2\displaystyle p_{2} ektλ2/2𝔼exp(λ2S2)=(etλ2/2𝔼exp(λ2qlbqlZqZl))k=2k\displaystyle\leq e^{-kt\lambda_{2}/2}\mathbb{E}\exp(\lambda_{2}S_{2})=\left(e^{-t\lambda_{2}/2}\mathbb{E}\exp\left(\lambda_{2}\sum_{q\neq l}b_{ql}Z_{q}Z_{l}\right)\right)^{k}=\wp_{2}^{k}

The terms 1\wp_{1} and 2\wp_{2} are the starting points for establishing a proof of the Hanson-Wright inequality [Ver18, page 133]; the former is for using Bernstein’s inequality, while the latter uses decoupling and comparison to the case when ZZ is a standard Gaussian random vector. Consequently, we can use the bounds for 1\wp_{1} and 2\wp_{2}, which both are given by

max{1,2}exp(cmin(t2BF2,tB)),\max\left\{\wp_{1},\wp_{2}\right\}\leq\exp\left(-c\min\left(\frac{t^{2}}{\left\lVert B\right\rVert_{F}^{2}},\,\frac{t}{\left\lVert B\right\rVert}\right)\right),

with cc an absolute constant (not depending on the distribution of ZZ, as we already rescaled ZZ to have unit ψ2\psi_{2}-norm entries). Recalling the kkth powers, we finally have

pp1+p22exp(ckmin(t2BF2,tB)).p\leq p_{1}+p_{2}\leq 2\exp\left(-ck\min\left(\frac{t^{2}}{\left\lVert B\right\rVert_{F}^{2}},\,\frac{t}{\left\lVert B\right\rVert}\right)\right).\qed
Lemma 5.0.3.

Let ZZ be a k×Dk\times D random matrix with i.i.d. mean-zero unit-variance sub-gaussian entries. Then for a matrix AA with columns in D\mathbb{R}^{D},

{±(ZAF2kAF2)kϵAF2}2exp(ckmin(ϵ2r4(A)K4,ϵr(A)K2))\mathbb{P}\left\{\pm(\left\lVert ZA\right\rVert_{F}^{2}-k\left\lVert A\right\rVert_{F}^{2})\geq k\epsilon\left\lVert A\right\rVert_{F}^{2}\right\}\leq 2\exp\left(-ck\min\left(\frac{\epsilon^{2}r_{4}(A)}{K^{4}},\,\frac{\epsilon\,r_{\infty}(A)}{K^{2}}\right)\right)

with K=Z11ψ2K=\left\lVert Z_{11}\right\rVert_{\psi_{2}}.

Proof.

We use lemma 5.0.1, with B=AAB=AA^{\top}, and ZZZ\mapsto Z^{\top}, for then the rows ZjZ_{j} of ZZ are written as column vectors, so that

ZjBZj=AZj22,S=j=1kAZj22=ZAF2,and𝔼S=kAF2.Z_{j}^{\top}BZ_{j}=\left\lVert A^{\top}Z_{j}\right\rVert_{2}^{2},\quad S=\sum_{j=1}^{k}\left\lVert A^{\top}Z_{j}\right\rVert_{2}^{2}=\left\lVert ZA\right\rVert_{F}^{2},\quad\text{and}\quad\mathbb{E}S=k\left\lVert A\right\rVert_{F}^{2}.

Using B=A2\left\lVert B\right\rVert=\left\lVert A\right\rVert^{2}, we recover

{±(SkAF2)kt}exp(ckmin(t2K4AAF2,tK2A2)).\mathbb{P}\left\{\pm(S-k\left\lVert A\right\rVert_{F}^{2})\geq kt\right\}\leq\exp\left(-ck\min\left(\frac{t^{2}}{K^{4}\left\lVert AA^{\top}\right\rVert_{F}^{2}},\,\frac{t}{K^{2}\left\lVert A\right\rVert^{2}}\right)\right).

Because

r(A)=AF2A2andr4(A)=AF4AAF2,r_{\infty}(A)=\frac{\left\lVert A\right\rVert_{F}^{2}}{\left\lVert A\right\rVert^{2}}\quad\text{and}\quad r_{4}(A)=\frac{\left\lVert A\right\rVert_{F}^{4}}{\left\lVert AA^{\top}\right\rVert_{F}^{2}},

the choice t=ϵAF2t=\epsilon\left\lVert A\right\rVert_{F}^{2} yields the result. ∎

If the reader would prefer explicit constants, the following lemma may be convenient, and gives an alternative proof for lemma 5.0.3 in the Gaussian case, relying on the explicit moment generating function for the Gaussian distribution.

Lemma 5.0.4.

Let ZZ be a k×Dk\times D random matrix with i.i.d. standard Gaussian entries. Then for a matrix AA with columns in D\mathbb{R}^{D},

{ZAF2>(1+ϵ)kAF2}exp(kϵ8min{ϵr4(A),r(A)})\mathbb{P}\left\{\left\lVert ZA\right\rVert_{F}^{2}>(1+\epsilon)k\left\lVert A\right\rVert_{F}^{2}\right\}\leq\exp\left(-k\frac{\epsilon}{8}\min\left\{\epsilon\,r_{4}(A),\,r_{\infty}(A)\right\}\right)

for ϵ>0\epsilon>0. Also, when ϵ(0,1)\epsilon\in(0,1),

{ZAF2>(1+ϵ)kAF2}exp(kϵ28r(A))\mathbb{P}\left\{\left\lVert ZA\right\rVert_{F}^{2}>(1+\epsilon)k\left\lVert A\right\rVert_{F}^{2}\right\}\leq\exp\left(-k\frac{\epsilon^{2}}{8}r_{\infty}(A)\right)

and

{ZAF2<(1ϵ)kAF2}exp(kϵ24r4(A)).\mathbb{P}\left\{\left\lVert ZA\right\rVert_{F}^{2}<(1-\epsilon)k\left\lVert A\right\rVert_{F}^{2}\right\}\leq\exp\left(-k\frac{\epsilon^{2}}{4}r_{4}(A)\right).

Note r4(A)r(A)r_{4}(A)\geq r_{\infty}(A) always. When ϵr4(A)r(A)\epsilon\,r_{4}(A)\geq r_{\infty}(A), there is a savings of one factor of ϵ\epsilon in the upper tail; however, for our applications, we do not know the relative sizes of r4(A)r_{4}(A) and r(A)r_{\infty}(A), so the kϵ2r(A)/8k\epsilon^{2}r_{\infty}(A)/8 bound was included.

Proof.

Let A=UΣVA=U\Sigma V^{\top} be the SVD of AA, with Σ=diag(σ)\Sigma=\mathrm{diag}(\vec{\sigma}) the diagonal matrix of singular values. So A=VΣUA^{\top}=V\Sigma U^{\top} and by the rotation invariance of standard Gaussian vectors, AA^{\top} acts on a row ZiZ_{i} of ZZ as

AZi=VΣUZiVΣgiwithgiDA^{\top}Z_{i}=V\Sigma U^{\top}Z_{i}\sim V\Sigma g_{i}\quad\text{with}\quad g_{i}\in\mathbb{R}^{D}

and consequently

Agi22giΣVVΣgi=giΣ2gi=jσj2gij2\left\lVert A^{\top}g_{i}\right\rVert_{2}^{2}\sim g_{i}^{\top}\Sigma V^{\top}V\Sigma g_{i}=g_{i}^{\top}\Sigma^{2}g_{i}=\sum_{j}\sigma_{j}^{2}g_{ij}^{2}

with the gijg_{ij} independent standard Gaussian.

We then have

ZAF2=AZF2=i=1kAgi22\left\lVert ZA\right\rVert_{F}^{2}=\left\lVert A^{\top}Z^{\top}\right\rVert_{F}^{2}=\sum_{i=1}^{k}\left\lVert A^{\top}g_{i}\right\rVert_{2}^{2}

and for λ>0\lambda>0 to be determined

{ZAF2>(1+ϵ)kAF2}\displaystyle\mathbb{P}\left\{\left\lVert ZA\right\rVert_{F}^{2}>(1+\epsilon)k\left\lVert A\right\rVert_{F}^{2}\right\} eλ(1+ϵ)kAF2𝔼exp(λi=1kAgi22)\displaystyle\leq e^{-\lambda(1+\epsilon)k\left\lVert A\right\rVert_{F}^{2}}\mathbb{E}\exp\left(\lambda\sum_{i=1}^{k}\left\lVert A^{\top}g_{i}\right\rVert_{2}^{2}\right)
=(eλ(1+ϵ)AF2𝔼exp(λAg122))k\displaystyle=\left(e^{-\lambda(1+\epsilon)\left\lVert A\right\rVert_{F}^{2}}\mathbb{E}\exp\left(\lambda\left\lVert A^{\top}g_{1}\right\rVert_{2}^{2}\right)\right)^{k}

with g1Dg_{1}\in\mathbb{R}^{D} standard Gaussian, having used the indepdence of the rows {gi}\left\{g_{i}\right\}. We can now use the independence of the columns, here via the coordinates of g1g_{1}:

𝔼exp(λAg122)=j𝔼exp(λσj2g1j2)=j(12λσj2)1/2\displaystyle\mathbb{E}\exp(\lambda\left\lVert A^{\top}g_{1}\right\rVert_{2}^{2})=\prod_{j}\mathbb{E}\exp(\lambda\sigma_{j}^{2}g_{1j}^{2})=\prod_{j}(1-2\lambda\sigma_{j}^{2})^{-1/2}

provided λ<1/(2σ12)\lambda<1/(2\sigma_{1}^{2}) via change of variables y=(12λσj2)1/2xy=(1-2\lambda\sigma_{j}^{2})^{1/2}\,x,

𝔼exp(λσj2g1j2)=12πexp(λσj2x2x22)𝑑x=(12λσj2)1/2.\displaystyle\mathbb{E}\exp(\lambda\sigma_{j}^{2}g_{1j}^{2})=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty}\exp\left(\lambda\sigma_{j}^{2}x^{2}-\frac{x^{2}}{2}\right)\,dx=(1-2\lambda\sigma_{j}^{2})^{-1/2}.

On [0,1/2][0,1/2], the function xex+x2(1x)x\mapsto e^{x+x^{2}}(1-x) is increasing from 11, while on [1/2,2/3][1/2,2/3] it is decreasing and still greater than 1, as (10/9)>log(3)(10/9)>\log(3), so

𝔼exp(λσj2g1j2)exp(λσj2+2λ2σj4)certainly when2λσ122/3,\mathbb{E}\exp(\lambda\sigma_{j}^{2}g_{1j}^{2})\leq\exp\big{(}\lambda\sigma_{j}^{2}+2\lambda^{2}\sigma_{j}^{4}\big{)}\quad\text{certainly when}\quad 2\lambda\sigma_{1}^{2}\leq 2/3,

leaving us to minimize

h(λ):=λ(1+ϵ)AF2+j(λσj2+2λ2σj4)=λϵjσj2+2λ2jσj4.h(\lambda):=-\lambda(1+\epsilon)\left\lVert A\right\rVert_{F}^{2}+\sum_{j}(\lambda\sigma_{j}^{2}+2\lambda^{2}\sigma_{j}^{4})=-\lambda\epsilon\sum_{j}\sigma_{j}^{2}+2\lambda^{2}\sum_{j}\sigma_{j}^{4}.

There will turn out to be two cases. If we minimize h(λ)h(\lambda) directly, the minimizer is

λ=ϵ4jσj2jσj4at whichh(λ)=ϵ28(jσj2)2jσj4=ϵ28r4(A).\lambda^{\ast}=\frac{\epsilon}{4}\frac{\sum_{j}\sigma_{j}^{2}}{\sum_{j}\sigma_{j}^{4}}\quad\text{at which}\quad h(\lambda^{\ast})=-\frac{\epsilon^{2}}{8}\frac{\left(\sum_{j}\sigma_{j}^{2}\right)^{2}}{\sum_{j}\sigma_{j}^{4}}=-\frac{\epsilon^{2}}{8}r_{4}(A).

This estimate still requires 2λσ12<12\lambda^{\ast}\sigma_{1}^{2}<1, so if we require 2λσ121/22\lambda^{\ast}\sigma_{1}^{2}\leq 1/2, we force

1ϵr(A)=1ϵjσj2σ12>1ϵ4λjσj2=r4(A).\displaystyle\frac{1}{\epsilon}r_{\infty}(A)=\frac{1}{\epsilon}\frac{\sum_{j}\sigma_{j}^{2}}{\sigma_{1}^{2}}>\frac{1}{\epsilon}4\lambda^{\ast}\sum_{j}\sigma_{j}^{2}=r_{4}(A).

Because we always have r4(A)r(A)r_{4}(A)\geq r_{\infty}(A), we can use the h(λ)h(\lambda^{\ast}) value when r4(A)r_{4}(A) and r(A)r_{\infty}(A) are “comparable” and ϵ(0,1)\epsilon\in(0,1).

For the other case, when r4(A)ϵ1r(A)r_{4}(A)\geq\epsilon^{-1}r_{\infty}(A), we have

jσj4ϵσ12jσj2\sum_{j}\sigma_{j}^{4}\leq\epsilon\,\sigma_{1}^{2}\sum_{j}\sigma_{j}^{2}

and can upper bound h(λ)h(\lambda) by h~(λ)\tilde{h}(\lambda) defined as

h~(λ)=λϵjσj2+2λ2αϵσ12jσj2for anyα1.\tilde{h}(\lambda)=-\lambda\epsilon\sum_{j}\sigma_{j}^{2}+2\lambda^{2}\alpha\epsilon\,\sigma_{1}^{2}\sum_{j}\sigma_{j}^{2}\quad\text{for any}\quad\alpha\geq 1.

The minimizer for h~(λ)\tilde{h}(\lambda) is

λ~=14ασ12at whichh~(λ~)=ϵα8r(A),\tilde{\lambda}^{\ast}=\frac{1}{4\alpha\sigma_{1}^{2}}\quad\text{at which}\quad\tilde{h}(\tilde{\lambda}^{\ast})=-\frac{\epsilon}{\alpha 8}r_{\infty}(A),

and this λ~\tilde{\lambda}^{\ast} also satisfies 2λ~σ121/2<12\tilde{\lambda}^{\ast}\sigma_{1}^{2}\leq 1/2<1 whenever α1\alpha\geq 1.

When ϵ(0,1)\epsilon\in(0,1), we can also avoid the distinction between the two cases by noting σ1σj\sigma_{1}\geq\sigma_{j} for all jj, so that

jσj4σ12jσj2which corresponds to taking α=1/ϵ in the above.\sum_{j}\sigma_{j}^{4}\leq\sigma_{1}^{2}\sum_{j}\sigma_{j}^{2}\quad\text{which corresponds to taking $\alpha=1/\epsilon$ in the above.}\quad

For the lower tail, with λ<0\lambda<0,

{ZAF2<(1ϵ)kAF2}eλ(1ϵ)kAF2𝔼exp(λZAF2)\displaystyle\mathbb{P}\left\{\left\lVert ZA\right\rVert_{F}^{2}<(1-\epsilon)k\left\lVert A\right\rVert_{F}^{2}\right\}\leq e^{-\lambda(1-\epsilon)k\left\lVert A\right\rVert_{F}^{2}}\mathbb{E}\exp(\lambda\left\lVert ZA\right\rVert_{F}^{2})
=(eλ(1ϵ)AF2j𝔼exp(λσj2g1j2))k.\displaystyle=\left(e^{-\lambda(1-\epsilon)\left\lVert A\right\rVert_{F}^{2}}\prod_{j}\mathbb{E}\exp(\lambda\sigma_{j}^{2}g_{1j}^{2})\right)^{k}.

Because λ<0\lambda<0, we can estimate the moment generating function in two ways. From ex1+x+x2/2e^{x}\leq 1+x+x^{2}/2 for x0x\leq 0, we find

𝔼exp(λσj2g1j2)1+λσj2+(3/2)λ2σj4exp(λσj2+(3/2)λ2σj4)\mathbb{E}\exp(\lambda\sigma_{j}^{2}g_{1j}^{2})\leq 1+\lambda\sigma_{j}^{2}+(3/2)\lambda^{2}\sigma_{j}^{4}\leq\exp\left(\lambda\sigma_{j}^{2}+(3/2)\lambda^{2}\sigma_{j}^{4}\right)

while if we use that ex+x2/2(1x)e^{x+x^{2}/2}(1-x) is decreasing to 1 for x0x\leq 0,

𝔼exp(λσj2g1j2)=(12λσj2)1/2exp(λσj2+λ2σj4).\mathbb{E}\exp(\lambda\sigma_{j}^{2}g_{1j}^{2})=(1-2\lambda\sigma_{j}^{2})^{-1/2}\leq\exp\left(\lambda\sigma_{j}^{2}+\lambda^{2}\sigma_{j}^{4}\right).

Minimizing

h(λ)=λ(1ϵ)AF2+j(λσj2+βλ2σj4)=λϵAF2+βλ2jσj4h_{-}(\lambda)=-\lambda(1-\epsilon)\left\lVert A\right\rVert_{F}^{2}+\sum_{j}(\lambda\sigma_{j}^{2}+\beta\lambda^{2}\sigma_{j}^{4})=\lambda\epsilon\left\lVert A\right\rVert_{F}^{2}+\beta\lambda^{2}\sum_{j}\sigma_{j}^{4}

yields

h(λ)=ϵ24βr4(A)atλ=ϵ2βAF2jσj4h_{-}(\lambda_{-}^{\ast})=-\frac{\epsilon^{2}}{4\beta}r_{4}(A)\quad\text{at}\quad\lambda_{-}^{\ast}=-\frac{\epsilon}{2\beta}\frac{\left\lVert A\right\rVert_{F}^{2}}{\sum_{j}\sigma_{j}^{4}}

with β=3/2\beta=3/2 corresponding to the Taylor expansion bound and β=1\beta=1 corresponding to the function bound.

Putting everything together, and remembering the kkth power outside, we complete the lemma. ∎

The next lemma makes the connection between equation (JL2JL^{2}) and equation (JL) explicit, and is informed by the form of the target dimension derived from the tail bound rates above. In the Gaussian case, C+=8C_{+}=8 and C=4C_{-}=4.

Lemma 5.0.5.

For 0<ϵ<10<\epsilon<1 and C±>0C_{\pm}>0, the requirements

C+ϵ~+2=Cϵ~2,(1ϵ)2γ(ϵ)=1ϵ~and(1+ϵ)2γ(ϵ)=1+ϵ~+\frac{C_{+}}{\widetilde{\epsilon}_{+}^{2}}=\frac{C_{-}}{\widetilde{\epsilon}_{-}^{2}},\quad\frac{(1-\epsilon)^{2}}{\gamma(\epsilon)}=1-\widetilde{\epsilon}_{-}\quad\text{and}\quad\frac{(1+\epsilon)^{2}}{\gamma(\epsilon)}=1+\widetilde{\epsilon}_{+}

have solution ϵ~+=θϵ~\widetilde{\epsilon}_{+}=\theta\,\widetilde{\epsilon}_{-} with θ=C+/C\theta=\sqrt{C_{+}/C_{-}},

1>ϵ~=4ϵ(1+ϵ)2+θ(1ϵ)2,andγ(ϵ)=(1+ϵ)2+θ(1ϵ)21+θ.1>\widetilde{\epsilon}_{-}=\frac{4\epsilon}{(1+\epsilon)^{2}+\theta(1-\epsilon)^{2}},\quad\text{and}\quad\gamma(\epsilon)=\frac{(1+\epsilon)^{2}+\theta(1-\epsilon)^{2}}{1+\theta}.
Proof.

The first equation gives ϵ~+=θϵ~\widetilde{\epsilon}_{+}=\theta\,\widetilde{\epsilon}_{-}. Taking θ\theta times the second equation and adding it to the third gives the equation for γ(ϵ\gamma(\epsilon. Subtracting the second equation from the third yields

(1+θ)ϵ~=(1+ϵ)2(1ϵ)2γ(ϵ)=4ϵγ(ϵ).(1+\theta)\widetilde{\epsilon}_{-}=\frac{(1+\epsilon)^{2}-(1-\epsilon)^{2}}{\gamma(\epsilon)}=\frac{4\epsilon}{\gamma(\epsilon)}.

Conclude

ϵ~=4ϵ(1+ϵ)2+θ(1ϵ)2=4ϵ(1+θ)(1ϵ)2+4ϵ<1.\widetilde{\epsilon}_{-}=\frac{4\epsilon}{(1+\epsilon)^{2}+\theta(1-\epsilon)^{2}}=\frac{4\epsilon}{(1+\theta)(1-\epsilon)^{2}+4\epsilon}<1.\qed
Lemma 5.0.6.

For 1jM/21\leq j\leq\left\lfloor M/2\right\rfloor,

(Mj)(eMj)j.\binom{M}{j}\leq\left(\frac{eM}{j}\right)^{j}.
Proof.

First note j!(j/e)jj!\geq(j/e)^{j} by

jjj!i=0jii!=ej.We then have(Mj)=M!j!(Mj)!=1j!i=0j1(Mi)Mjj!(eMj)j\frac{j^{j}}{j!}\leq\sum_{i=0}^{\infty}\frac{j^{i}}{i!}=e^{j}.\quad\text{We then have}\quad\binom{M}{j}=\frac{M!}{j!(M-j)!}=\frac{1}{j!}\prod_{i=0}^{j-1}(M-i)\leq\frac{M^{j}}{j!}\leq\left(\frac{eM}{j}\right)^{j}

from our lower bound for j!j!. ∎

Literature Cited

  • [Als08] Brian Alspach, The wonderful Walecki construction, Bulletin of the Institute of Combinatorics and its Applications 52 (2008), 7–20 (English).
  • [GVL13] Gene H. Golub and Charles F. Van Loan, Matrix computations, 4th ed. ed., The Johns Hopkins University Press, Baltimore, MD, 2013 (English).
  • [HMT10] Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp, Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions, December 2010, arXiv:0909.4061 [math] version: 2.
  • [JL84] William B. Johnson and Joram Lindenstrauss, Extensions of Lipschitz mappings into a Hilbert space, Conference in modern analysis and probability (New Haven, Conn., 1982), Contemp. Math., vol. 26, Amer. Math. Soc., Providence, RI, 1984, pp. 189–206. MR 737400
  • [KM05] B. Klartag and S. Mendelson, Empirical processes and random projections, Journal of Functional Analysis 225 (2005), no. 1, 229–245 (en).
  • [LN14] Kasper Green Larsen and Jelani Nelson, The Johnson-Lindenstrauss lemma is optimal for linear dimensionality reduction, arXiv:1411.2404 [cs, math] (2014), arXiv: 1411.2404.
  • [Luc82] E. Lucas, Récréations mathématiques, 1882, Published: Paris. Gauthier-Villars. T. II. 1883.
  • [Nel20] Jelani Nelson, Dimensionality Reduction in Euclidean Space, Notices of the American Mathematical Society 67 (2020), no. 10, 1 (en).
  • [NY17] Assaf Naor and Pierre Youssef, Restricted invertibility revisited, A journey through discrete mathematics. A tribute to Jiří Matoušek, Springer, Cham, 2017, pp. 657–691 (English).
  • [RV13] Mark Rudelson and Roman Vershynin, Hanson-Wright inequality and sub-gaussian concentration, Electronic Communications in Probability 18 (2013), no. none, 1–9, Publisher: Institute of Mathematical Statistics and Bernoulli Society.
  • [Sar06] Tamas Sarlos, Improved Approximation Algorithms for Large Matrices via Random Projections, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), October 2006, ISSN: 0272-5428, pp. 143–152.
  • [Vem04] Santosh S. Vempala, The random projection method, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, vol. 65, American Mathematical Society, Providence, RI, 2004. MR 2073630
  • [Ver18] Roman Vershynin, High-dimensional probability. An introduction with applications in data science, Camb. Ser. Stat. Probab. Math., vol. 47, Cambridge University Press, Cambridge, 2018 (English).