Bulk Johnson-Lindenstrauss Lemmas

Michael P. Casey

Abstract

For a set $X$ of $N$ points in $\mathbb{R}^{D}$ , the Johnson-Lindenstrauss lemma provides random linear maps that approximately preserve all pairwise distances in $X$ – up to multiplicative error $(1\pm\epsilon)$ with high probability – using a target dimension of $O(\epsilon^{-2}\log(N))$ . Certain known point sets actually require a target dimension this large – any smaller dimension forces at least one distance to be stretched or compressed too much. What happens to the remaining distances? If we only allow a fraction $\eta$ of the distances to be distorted beyond tolerance $(1\pm\epsilon)$ , we show a target dimension of $O(\epsilon^{-2}\log(4e/\eta)\log(N)/R)$ is sufficient for the remaining distances. With the stable rank of a matrix $A$ as $\left\lVert A\right\rVert_{F}^{2}/\left\lVert A\right\rVert^{2}$ , the parameter $R$ is the minimal stable rank over certain $\log(N)$ sized subsets of $X-X$ or their unit normalized versions, involving each point of $X$ exactly once. The linear maps may be taken as random matrices with i.i.d. zero-mean unit-variance sub-gaussian entries. When the data is sampled i.i.d. as a given random vector $\xi$ , refined statements are provided; the most improvement happens when $\xi$ or the unit normalized $\widehat{\xi-\xi^{\prime}}$ is isotropic, with $\xi^{\prime}$ an independent copy of $\xi$ , and includes the case of i.i.d. coordinates.

•

e-mail: [email protected]
•

MSC: primary 68Q87, 68R12, 60B20; secondary 62G30, 68T09
•

keywords: dimension reduction, Johnson-Lindenstrauss lemma, Hanson-Wright inequality, stable rank, effective rank, intrinsic dimension, order statistics, Walecki construction, bulk, batch, minibatch, random projection

1 Introduction

The Johnson-Lindenstrauss lemma [JL84] concerns the approximate preservation of distances in a finite point set in Euclidean space. Specifically, for a subset $X\subset\mathbb{R}^{D}$ of $N$ points and a tolerance $\epsilon\in(0,1)$ , there exist $k\times D$ matrices $Z$ and a constant $\gamma(\epsilon)$ for which

(1-\epsilon)\left\lVert x-x^{\prime}\right\rVert_{2}\leq\sqrt{\frac{\gamma(\epsilon)}{k}}\left\lVert Z(x-x^{\prime})\right\rVert_{2}\leq(1+\epsilon)\left\lVert x-x^{\prime}\right\rVert_{2}

(JL)

holds for all pairs of points $x,x^{\prime}\in X$ simultaneously, with probability at least $1-\delta$ , provided

k=O(\epsilon^{-2}\log(N^{2}/\delta))=:D_{JL}(N)=:D_{JL}.

The matrices $Z$ are drawn randomly, and much work has been done since the original paper to equip $Z$ with special properties, such as allowing fast matrix multiplication, preserving sparsity, restricting the matrix entries to discrete distributions, and so forth; see [Nel20] for a recent review. The matrix $Z$ provides a linear method for dimension reduction, which, at the very least, reduces the amount of space needed to store the dataset $X$ on the computer, provided one can work with approximate versions of the pairwise distances. One would expect that “robust” downstream algorithms that depend on distance data, now working on the pointset $ZX$ , should still return answers that approximate those found on the original pointset $X$ , but now in shorter time, with less memory needed, etc. People appreciate this lemma, in theory [Vem04], for the above reasons, and if the algorithm scales linearly in the ambient dimension, in time or in space, then processing $ZX$ instead of $X$ will yield a proportional improvement.

However, certain algorithms, including many nearest-neighbor algorithms, scale exponentially in the dimension, especially if they only use space linear in $N$ , due to packing arguments. For comparison to brute force, the all pairs nearest-neighbor problem may already be solved in $O(DN^{2})$ time by scanning the points of $X$ with respect to their distance to each query $x$ . If the algorithm scales like $DN\exp(D)$ , this scaling is only an improvement for $D=\log(N)$ , and even then one would really prefer $D=o(\log N)$ , as $N^{2}$ is too expensive when $N$ is large. Consequently, if we use multiplication by $Z$ as a preprocessing step, the target dimension $D_{JL}=O(\epsilon^{-2}\log(N^{2}))$ is much too large to be practical, as $\exp(D_{JL})$ is now polynomial in $N$ with exponent at least $2/\epsilon^{2}$ with $\epsilon<1$ .

Apart from searching for better algorithms, it is then natural to ask whether there are matrices $Z$ with target dimension $k$ much smaller than $D_{JL}$ that would still satisfy equation (JL) for all pairs of points of $X$ . However, Larsen and Nelson [LN14] showed no such matrix exists for general datasets – a union of the standard simplex and a sufficient number of Gaussian vectors forces at least one distance to be stretched or compressed too much. On the other hand, if further assumptions are made on the pointset, smaller $k$ is possible, for instance, when $X$ already lies in a low-dimensional subspace [Sar06] or when its unit difference set $\widehat{X-X}$ has low Gaussian complexity [KM05], even while allowing many more points in the dataset.

If one considers a given algorithm “robust”, one would hope that failing to preserve a few distances should not markedly change the final output; though, one would ideally have to prove such behavior for that algorithm. One is then led to ask: what happens to the other distances between the points of $ZX$ when $k$ is smaller than $D_{JL}$ ? To be concrete, can we approximately preserve $(1-\eta)$ of all the distances, for some fraction $\eta\in(0,1/2)$ , ideally with $k=o(D_{JL}(N))$ ? The results in this paper show this is possible when $\eta$ is not too small and the data has high or even moderate “intrinsic dimension”, in a sense to be defined later. As a preview, for data sampled i.i.d. like a random vector $\xi\in\mathbb{R}^{D}$ , corollary 4.1.8 shows that if $\xi$ has i.i.d. coordinates (No moment assumptions are made.), we can take

k=\frac{C^{\prime}}{\epsilon^{2}}\left(\log(4e/\eta)\log(6De/\zeta)+\frac{\log(N^{2}/\delta)}{\eta D}\right)

for preserving $(1-\eta)(1-\zeta)$ of all pairwise distances, with probability at least $1-2\delta$ over the draw for $Z$ and $X$ , provided $N=\Omega(\zeta^{-1}D\log(N/\delta))$ and $N=\Omega(D\log(6De/\zeta))$ . The matrix $Z$ may have i.i.d. standard Gaussian, or in general, zero-mean unit-variance sub-gausssian coordinates. This estimate for $k$ is an “improvement” over the $D_{JL}$ target dimension as soon as $\eta D=\omega(1)$ or say a fractional power of $\log(N)$ ; “improvement” must be in quotes, as we have only guaranteed “most” distances are approximately preserved, not all. For general $\xi$ , the $1/\eta D$ is replaced by a $1/\eta\widehat{r}$ , with $\widehat{r}$ the intrinsic dimension $1/\left\lVert\mathbb{E}\widehat{y}\widehat{y}^{\top}\right\rVert$ of the unit normalized random vector $\widehat{y}=\widehat{\xi-\xi^{\prime}}$ for an independent copy $\xi^{\prime}$ of $\xi$ . We have $1\leq\widehat{r}\leq D$ always, and we may estimate it using corollary 4.2.1.

Both of our main results – theorems 2.0.6 for general datasets and theorem 4.1.5 for i.i.d. samples – rely on a dual viewpoint for the dimension reduction problem, namely, instead of asking how $Z$ transforms the data $X$ , we ask how $X^{\top}$ transforms the test matrix $Z^{\top}$ ; we can then exploit known properties of how general matrices act on $Z^{\top}$ or its columns. Standard results like the Hanson-Wright inequality may be viewed in this light, and we indeed do so in this paper. Treating $Z^{\top}$ as a test matrix with known properties has been done previously in randomized numerical linear algebra [HMT10]; however, unlike there, slow decay in singular values is actually a good case for us.

The rest of the paper is organized as follows. The main argument allowing us to quantify how many distances can be preserved is in section 2 and leads to our first theorem 2.0.6, which, by itself, only suggests smaller target dimensions $k$ may be possible by considering “small” batches of difference vectors. We then recall the Walecki construction in section 3, which gives us a way to choose these batches that works well for i.i.d. samples as well as the standard simplex. Section 4 then presents the rest of our results specializing to data sampled i.i.d. from a given distribution, which may just be a larger dataset. This section leads up to our second main theorem 4.1.5 allowing us to make $k$ depend on the intrinsic dimension $\widehat{r}$ mentioned above. We also show how to estimate $\widehat{r}$ from the data. The appendix contains proofs for the properties of $Z$ that we use in the paper, namely, a variant of the Hanson-Wright inequality, and a corresponding independent proof in the Gaussian case in order to have decent explicit constants for $k$ .

1.1 Notation

Suppose $X\subset\mathbb{R}^{D}$ is a point set of size $N$ . Given an arbitrary ordering of the points of $X$ , let $Y$ be the set of difference vectors

Y:=\left\{x_{i}-x_{j}\;\lvert\;x_{i},x_{j}\in X,\,0\leq i<j\leq N-1\right\}

and $\widehat{Y}$ be their unit normalized versions

\widehat{Y}:=\left\{y/\left\lVert y\right\rVert_{2}\;\lvert\;y\in Y\right\}\subset S^{D-1}.

The number of elements of a set $\Upsilon$ is denoted by $\left\lvert\Upsilon\right\rvert$ . We set $(x,y)=x^{\top}y$ for the usual Euclidean inner product, with $\left\lVert x\right\rVert_{2}^{2}=(x,x)$ . We also denote by $o$ the origin $(0,\ldots,0)$ in any $\mathbb{R}^{j}$ .

Let $A$ be a $D\times M$ matrix, which we write as $A\in\mathbb{R}^{D\times M}$ . From [GVL13, page 76], the singular value decomposition (SVD) for $A$ is the factorization $A=U\Sigma V^{\top}$ with $U\in\mathbb{R}^{D\times D}$ and $V\in\mathbb{R}^{M\times M}$ orthogonal and

\Sigma=\mathrm{diag}(\sigma_{1},\ldots,\sigma_{p})\quad\text{with}\quad p=\min\left\{D,\,M\right\}

for $\sigma_{1}\geq\sigma_{2}\geq\ldots\geq\sigma_{p}\geq 0$ . Let $\vec{\sigma}=(\sigma_{1},\ldots,\sigma_{p})$ as a vector in $\mathbb{R}^{p}$ . We write $\left\lVert A\right\rVert=\left\lVert\vec{\sigma}\right\rVert_{\infty}=\sigma_{1}$ for the operator norm of $A$ , while $\left\lVert A\right\rVert_{F}=\left\lVert\vec{\sigma}\right\rVert_{2}$ is the Frobenius (or Hilbert-Schmidt) norm of $A$ . We may also compute $\left\lVert A\right\rVert_{F}$ via

\left\lVert A\right\rVert_{F}^{2}=\sum_{i}\left\lVert A_{i}\right\rVert_{2}^{2}=\sum_{i,j}A_{ij}^{2}

where the $A_{i}$ may be taken as the rows or the columns of $A$ . Vectors are treated as column vectors unless otherwise indicated.

The $\infty$ -stable rank $r_{\infty}(A)$ of a matrix $A$ is the ratio

r_{\infty}(A):=\frac{\left\lVert A\right\rVert_{F}^{2}}{\left\lVert A\right\rVert^{2}}=\frac{\left\lVert\vec{\sigma}\right\rVert_{2}^{2}}{\left\lVert\vec{\sigma}\right\rVert_{\infty}^{2}}=\frac{\sum\sigma_{i}^{2}}{\sigma_{1}^{2}}.

There are other variants of stable rank [NY17], but only the $4$ -stable rank $r_{4}(A)$ will make an appearance in this paper:

r_{4}(A):=\frac{\left\lVert A\right\rVert_{F}^{4}}{\left\lVert AA^{\top}\right\rVert_{F}^{2}}=\frac{\left\lVert\vec{\sigma}\right\rVert_{2}^{4}}{\left\lVert\vec{\sigma}\right\rVert_{4}^{4}}=\frac{\big{(}\sum\sigma_{i}^{2}\big{)}^{2}}{\sum\sigma_{i}^{4}}\geq\frac{\big{(}\sum\sigma_{i}^{2}\big{)}^{2}}{\sigma_{1}^{2}\sum\sigma_{i}^{2}}=r_{\infty}(A)

So $r_{4}(A)\geq r_{\infty}(A)$ always, and both of these are at most $p$ , the rank of $A$ .

We make use of the $\psi_{2}$ -norm and $\psi_{1}$ -norm, defined for a random variable $\omega$ as

\left\lVert\omega\right\rVert_{\psi_{2}}:=\inf\left\{t>0\;\lvert\;\mathbb{E}\exp(\omega^{2}/t^{2})\leq 2\right\},\quad\text{and}\quad\left\lVert\omega\right\rVert_{\psi_{1}}:=\inf\left\{t>0\;\lvert\;\mathbb{E}\exp(\left\lvert\omega\right\rvert/t)\leq 2\right\}.

See [Ver18, section 2.5 and 2.7].

1.1.1 Isotropic Random Vectors and the Intrinsic Dimension

A particular condition on the second moment matrix of a random vector will be useful in this paper.

Definition 1.1.1.

A random vector $\xi\in\mathbb{R}^{D}$ is isotropic if $\Sigma(\xi):=\mathbb{E}\xi\xi^{\top}=\operatorname{Id}_{D}$ .

A working example is any vector $\xi$ with i.i.d. zero-mean unit-variance coordinates. Basic properties of isotropic random vectors include $\mathbb{E}\left\lVert\xi\right\rVert_{2}^{2}=D$ via the cyclic property of the trace, and that $\xi$ isotropic is equivalent to $\mathbb{E}(\xi,y)^{2}=\left\lVert y\right\rVert_{2}^{2}$ for any $y\in\mathbb{R}^{D}$ . See [Ver18, page 43-5] for more on such vectors.

We can assign a version of $\infty$ -stable rank to the distribution of a random vector via the intrinsic dimension.

Definition 1.1.2.

The intrinsic dimension $r(\xi)$ of a random vector $\xi\in\mathbb{R}^{D}$ is the ratio

r(\xi):=\frac{\,\mathrm{tr}\,\Sigma(\xi)}{\left\lVert\Sigma(\xi)\right\rVert}=\frac{\,\mathrm{tr}\,{\mathbb{E}\xi\xi^{\top}}}{\left\lVert\Sigma(\xi)\right\rVert}=\frac{\mathbb{E}\left\lVert\xi\right\rVert_{2}^{2}}{\left\lVert\mathbb{E}\xi\xi^{\top}\right\rVert},\quad\text{and for $c\neq 0$, }\quad r(c\,\xi)=\frac{c^{2}\,\mathbb{E}\left\lVert\xi\right\rVert_{2}^{2}}{c^{2}\left\lVert\mathbb{E}\xi\xi^{\top}\right\rVert}=r(\xi).

Like the stable ranks, the intrinsic dimension of a vector in $\mathbb{R}^{D}$ is bounded by the ambient dimension, $D$ , so isotropic random vectors realize the highest possible intrinsic dimension. In the literature, $r(\xi)$ is sometimes called the effective rank of the second moment matrix $\Sigma(\xi)$ , and is the stable rank of the matrix $\Sigma^{1/2}(\xi)$ .

Isotropic random vectors behave well under orthogonal projection.

Lemma 1.1.3.

Let $\Phi$ be a $d\times D$ matrix with orthonormal rows and $\xi\in\mathbb{R}^{D}$ an isotropic random vector. Then $\Phi(\xi)\in\mathbb{R}^{d}$ is also isotropic.

Proof.

By linearity of expectation: $\mathbb{E}(\Phi\xi)(\Phi\xi)^{\top}=\Phi(\mathbb{E}\xi\xi^{\top})\Phi^{\top}=\Phi\operatorname{Id}_{D}\Phi^{\top}=\operatorname{Id}_{d}$ . ∎

Note if $\Phi$ is scaled by constant $c\neq 0$ , the intrinsic dimension is unchanged: $d=r(\Phi(\xi))=r(c\,\Phi(\xi))$ .

2 Controlling Order Statistics

In this paper, we only study dimension reduction matrices $Z:\mathbb{R}^{D}\to\mathbb{R}^{k}$ with isotropic rows, that is, $\mathbb{E}(Z_{i},y)^{2}=\left\lVert y\right\rVert_{2}^{2}$ for all $y$ and all rows $Z_{i}$ . So, $\mathbb{E}\left\lVert Zy\right\rVert_{2}^{2}=k\left\lVert y\right\rVert_{2}^{2}$ . We say more about isotropic random vectors in section 1.1.1. To guarantee equation (JL) holds for a difference vector $y\in Y$ , the usual proof for the Johnson-Lindenstrauss lemma considers each vector individually, providing upper bounds for the failure probabilities

p_{+}(y):=\mathbb{P}_{Z}\left\{\left\lVert Zy\right\rVert_{2}^{2}>(1+\widetilde{\epsilon}_{+})k\left\lVert y\right\rVert_{2}^{2}\right\}

and

p_{-}(y):=\mathbb{P}_{Z}\left\{\left\lVert Zy\right\rVert_{2}^{2}<(1-\widetilde{\epsilon}_{-})k\left\lVert y\right\rVert_{2}^{2}\right\},

implicitly working with the formulation

(1-\widetilde{\epsilon}_{-})k\left\lVert x-x^{\prime}\right\rVert_{2}^{2}\leq\left\lVert Z(x-x^{\prime})\right\rVert_{2}^{2}\leq(1+\widetilde{\epsilon}_{+})k\left\lVert x-x^{\prime}\right\rVert_{2}^{2}

(

JL^{2}

)

which is often easier to manage. In lemma 5.0.5 of the appendix, we show how to choose $\widetilde{\epsilon}_{\pm}$ in line with the original equation (JL).

Suppose $\widetilde{\epsilon}_{-}\leq\widetilde{\epsilon}_{+}$ , as it will for this paper. If the distribution of $Z$ is sufficiently nice, sub-gaussian for example, then one may show $p_{+}+p_{-}\leq 2\exp(-\widetilde{\epsilon}_{-}^{2}k/C)$ for each fixed $y$ , with $C$ a constant depending on the distribution of $Z$ . By the union bound (Boole’s inequality), the probability that a random $Z$ fails for some $y\in Y$ is at most

\binom{N}{2}2\exp(-\widetilde{\epsilon}_{-}^{2}k/C)\leq N^{2}\exp(-\widetilde{\epsilon}_{-}^{2}k/C)<\delta,

provided $k>C\widetilde{\epsilon}_{-}^{-2}\log(N^{2}/\delta)=D_{JL}$ . The probability there is some $Z$ that succeeds for all $y\in Y$ is then at least $1-\delta$ , so at least one such $Z$ exists.

The argument above considers the vectors $y\in Y$ one at a time, making sure $Z$ succeeds for each $y$ ; if we consider the unit normalized vectors $\widehat{y}\in\widehat{Y}$ , we may view this argument as controlling the extreme order statistics of the random variables $\left\lVert Z\widehat{y}_{i}\right\rVert_{2}^{2}$ , induced by $Z$ , namely

(1-\widetilde{\epsilon}_{-})k\leq\left\lVert Z\widehat{y}_{(0)}\right\rVert_{2}^{2}\leq\ldots\leq\left\lVert Z\widehat{y}_{(j)}\right\rVert_{2}^{2}\leq\ldots\leq\left\lVert Z\widehat{y}_{\big{(}\binom{N}{2}-1\big{)}}\right\rVert_{2}^{2}\leq(1+\widetilde{\epsilon}_{+})k.

If we only wish to preserve a fraction of the distances, say $(1-\eta)$ with $\eta$ hopefully small, we can consider controlling the intermediate or central order statistics of the $\left\lVert Z\widehat{y}_{i}\right\rVert_{2}^{2}$ instead. We do so as follows.

If we divide the difference vectors into batches of size $M$ , and preserve $(1-\eta)M$ distances there, then we still recover

(1-\eta)M\binom{N}{2}/M=(1-\eta)\binom{N}{2}\quad\text{ of the total distances.}\quad

We assume $\eta M$ is a strictly positive integer here, and for simplicity of discussion, we also assume $M$ divides $N$ ; we shall return to this point later. Let $\Upsilon\subset Y$ be a given batch of size $M$ . Each given matrix $Z$ also induces an ordering on the points of $\Upsilon$ : with $\widehat{y}=y/\left\lVert y\right\rVert_{2}$ ,

\left\lVert Z\widehat{y}_{(0)}\right\rVert_{2}^{2}\leq\ldots\leq\left\lVert Z\widehat{y}_{(i)}\right\rVert_{2}^{2}\leq\ldots\leq\left\lVert Z\widehat{y}_{(M-1)}\right\rVert_{2}^{2}.

As $Z$ is random, this ordering is too, treating $Y$ as fixed. If we could guarantee that $\left\lVert Z\widehat{y}_{((1-\eta)M)}\right\rVert_{2}^{2}\leq(1+\epsilon)k$ , then all $\left\lVert Z\widehat{y}_{(i)}\right\rVert_{2}^{2}$ with $i\leq(1-\eta)M$ also have this upper bound, with an analogous statement for a lower bound of $(1-\epsilon)k$ on $\left\lVert Z\widehat{y}_{(\eta M-1)}\right\rVert_{2}^{2}$ , altogether guaranteeing $(1-2\eta)M+2>(1-2\eta)M$ of the vectors of $\Upsilon$ have

(1-\widetilde{\epsilon}_{-})k\leq\left\lVert Z\widehat{y}\right\rVert_{2}^{2}\leq(1+\widetilde{\epsilon}_{+})k.

We need to control the probabilities

	$\displaystyle p_{+}(\Upsilon):$	$\displaystyle=\mathbb{P}_{Z}\left\{\left\lVert Z\widehat{y}_{((1-\eta)M)}\right\rVert_{2}^{2}>(1+\widetilde{\epsilon}_{+})k\right\}$	(2.1)
and
	$\displaystyle p_{-}(\Upsilon):$	$\displaystyle=\mathbb{P}_{Z}\left\{\left\lVert Z\widehat{y}_{(\eta M-1)}\right\rVert_{2}^{2}<(1-\widetilde{\epsilon}_{-})k\right\}.$	(2.2)

We can recast control of $p_{\pm}(\Upsilon)$ using the following lemma. Let $\widehat{\Upsilon}=\left\{\widehat{y}_{0},\ldots,\widehat{y}_{M-1}\right\}$ , that is, the unit normalized version of $\Upsilon$ .

Lemma 2.0.1.

Let $Z$ be a random $k\times D$ random matrix. With the notation above, and $\Upsilon(\eta)$ running through all $\eta M$ -sized subsets of $\Upsilon$

p_{+}(\Upsilon)\leq\max_{\Upsilon(\eta)}\min_{\Lambda_{\Upsilon(\eta)}}\binom{M}{\eta M}\mathbb{P}_{Z}\left\{\left\lVert Z\Upsilon(\eta)\Lambda_{\Upsilon(\eta)}\right\rVert_{F}^{2}>(1+\widetilde{\epsilon}_{+})k\left\lVert\Upsilon(\eta)\Lambda_{\Upsilon(\eta)}\right\rVert_{F}^{2}\right\}

and

p_{-}(\Upsilon)\leq\max_{\Upsilon(\eta)}\min_{\Lambda_{\Upsilon(\eta)}}\binom{M}{\eta M}\mathbb{P}_{Z}\left\{\left\lVert Z\Upsilon(\eta)\Lambda_{\Upsilon(\eta)}\right\rVert_{F}^{2}<(1-\widetilde{\epsilon}_{-})k\left\lVert\Upsilon(\eta)\Lambda_{\Upsilon(\eta)}\right\rVert_{F}^{2}\right\}

with $\Lambda_{\Upsilon(\eta)}$ a diagonal matrix with strictly positive entries, chosen for each subset $\Upsilon(\eta)$ .

Remark 2.0.2.

For simplicity, the rest of the paper will take $\Upsilon(\eta)\Lambda_{\Upsilon(\eta)}$ as $\Upsilon(\eta)$ itself, or its unit normalized version $\widehat{\Upsilon}(\eta)$ ; though, there may potentially be some improvement gained by the freedom in choosing each $\Lambda_{\Upsilon(\eta)}$ .

Proof.

For $p_{+}(\Upsilon)$ , let $\widehat{y}_{\ast}=\widehat{y}_{((1-\eta)M)}$ . If $\left\lVert Z\widehat{y}_{\ast}\right\rVert_{2}^{2}>(1+\widetilde{\epsilon}_{+})k$ , then $\widehat{y}_{\ast}$ is part of a subset $\Upsilon(\eta)\subset\Upsilon$ of size $\eta M$ for which all the $Z\widehat{y}$ have norms too large. For any given subset $\Upsilon(\eta)$ , consider the following probabilities, with $\lambda_{y}>0$ chosen for each $y$ ,

	$\displaystyle\mathbb{P}_{Z}\left\{\mathcal{E}(\Upsilon(\eta))\right\}$	$\displaystyle:=\mathbb{P}_{Z}\left\{\left\lVert Z\widehat{y}\right\rVert_{2}^{2}>(1+\widetilde{\epsilon}_{+})k\text{ for all }y\in\Upsilon(\eta)\right\}$
		$\displaystyle=\mathbb{P}_{Z}\left\{\left\lVert Zy\right\rVert_{2}^{2}>(1+\widetilde{\epsilon}_{+})k\left\lVert y\right\rVert_{2}^{2}\text{ for all }y\in\Upsilon(\eta)\right\}$
		$\displaystyle=\mathbb{P}_{Z}\left\{\left\lVert Zy\lambda_{y}\right\rVert_{2}^{2}>(1+\widetilde{\epsilon}_{+})k\left\lVert y\lambda_{y}\right\rVert_{2}^{2}\text{ for all }y\in\Upsilon(\eta)\right\}$
		$\displaystyle\leq\mathbb{P}_{Z}\left\{\sum_{y\in\Upsilon(\eta)}\left\lVert Zy\lambda_{y}\right\rVert_{2}^{2}>(1+\widetilde{\epsilon}_{+})k\sum_{y\in\Upsilon(\eta)}\left\lVert y\lambda_{y}\right\rVert_{2}^{2}\right\}.$

The second and third lines follow by the linearity of $Z$ . Passing to the sum above allows an important change of viewpoint using the Frobenius norm, as each $Zy\lambda_{y}$ is a column of $Z(\Upsilon(\eta)\Lambda_{\Upsilon(\eta)})$ , with $\Lambda_{\Upsilon(\eta)}$ the appropriate diagonal matrix:

\mathbb{P}_{Z}\left\{\mathcal{E}(\Upsilon(\eta))\right\}\leq\mathbb{P}_{Z}\left\{\left\lVert Z\Upsilon(\eta)\Lambda_{\Upsilon(\eta)}\right\rVert_{F}^{2}>(1+\widetilde{\epsilon}_{+})k\left\lVert\Upsilon(\eta)\Lambda_{\Upsilon(\eta)}\right\rVert_{F}^{2}\right\}.

Now, there are $\binom{M}{\eta M}$ subsets $\Upsilon(\eta)$ of $\Upsilon$ of size $\eta M$ , so a union bound over such subsets gives

	$\displaystyle p_{+}(\Upsilon)$	$\displaystyle\leq\binom{M}{\eta M}\max_{\Upsilon(\eta)}\mathbb{P}_{Z}\left\{\mathcal{E}(\Upsilon(\eta))\right\}$
		$\displaystyle=\binom{M}{\eta M}\mathbb{P}_{Z}\left\{\left\lVert Z\Upsilon(\eta)\Lambda_{\Upsilon(\eta)}\right\rVert_{F}^{2}>(1+\widetilde{\epsilon}_{+})k\left\lVert\Upsilon(\eta)\Lambda_{\Upsilon(\eta)}\right\rVert_{F}^{2}\right\}.$

The argument for $p_{-}(\Upsilon)$ is similar, noting that $\left\lVert Z\widehat{y}_{(\eta M-1)}\right\rVert_{2}^{2}<(1-\widetilde{\epsilon}_{-})k$ forces $\eta M$ vectors $Z\widehat{y}$ to have squared norms too small, so their corresponding sum is too small as well. ∎

We now make assumptions on the matrix $Z$ , allowing us to make use of the stable rank of the minibatches $\Upsilon(\eta)$ .

Corollary 2.0.3.

With the notation as in lemma 2.0.1, $\widetilde{\epsilon}_{+}=\widetilde{\epsilon}_{-}\sqrt{2}$ , and $\widetilde{\epsilon}_{-}\in(0,1)$ , if $Z$ has i.i.d. standard Gaussian entries, then

\max\left\{p_{+}(\Upsilon),\,p_{-}(\Upsilon)\right\}\leq\binom{M}{\eta M}\max_{\Upsilon(\eta)}\exp\left(-\frac{k\widetilde{\epsilon}_{-}^{2}}{4}r_{\infty}(\Upsilon(\eta))\right).

One may replace $\Upsilon(\eta)$ by $\widehat{\Upsilon}(\eta)$ on the right hand side.

One point we want to stress even now is the presence of the product $k\,r_{\infty}$ in the exponent. If one wants a fixed failure probability, $k$ need not be as large when $r_{\infty}$ is sizable. We shall give several examples in this paper where $r_{\infty}$ is large.

Proof.

The proof follows immediately from lemma 5.0.4 in the appendix, with $A=\Upsilon(\eta)$ , recalling $r_{4}\geq r_{\infty}$ for the $p_{-}(\Upsilon)$ case. With $C_{+}=8$ and $C_{-}=4$ , we improve the denominator from 8 to 4 by setting $C_{+}/\widetilde{\epsilon}_{+}^{2}=C_{-}/\widetilde{\epsilon}_{-}^{2}$ as in in lemma 5.0.5. ∎

Consider the term $\left\lVert Z\Upsilon(\eta)\right\rVert_{F}^{2}$ in lemma 2.0.1, taking $\Lambda_{\Upsilon(\eta)}$ as the identity. (The following discussion will also hold for other scalings, such as $\widehat{\Upsilon}(\eta)$ .) Labeling the rows of $Z$ as $Z_{i}$ , we can interchange the sums implicit in $\left\lVert Z\Upsilon(\eta)\right\rVert_{F}^{2}$ to our advantage:

\displaystyle\sum_{y\in\Upsilon(\eta)}\left\lVert Zy\right\rVert_{2}^{2}=\sum_{y\in\Upsilon(\eta)}\sum_{i=1}^{k}(Z_{i},y)^{2}=\sum_{i=1}^{k}\sum_{y\in\Upsilon(\eta)}(y,Z_{i})^{2}=\sum_{i=1}^{k}\left\lVert\Upsilon(\eta)^{\top}Z_{i}\right\rVert_{2}^{2}.

Because we assume the rows $Z_{i}$ are isotropic, $\mathbb{E}\left\lVert\Upsilon(\eta)^{\top}Z_{i}\right\rVert_{2}^{2}=\left\lVert\Upsilon(\eta)^{\top}\right\rVert_{F}^{2}=\left\lVert\Upsilon(\eta)\right\rVert_{F}^{2}$ , and we have transformed the bound on $p_{+}(\Upsilon)$ to involve the probabilities

\mathbb{P}_{Z}\left\{\sum_{i=1}^{k}\left\lVert\Upsilon(\eta)^{\top}Z_{i}\right\rVert_{2}^{2}>(1+\widetilde{\epsilon}_{+})\sum_{i=1}^{k}\left\lVert\Upsilon(\eta)\right\rVert_{F}^{2}\right\},

and $\Upsilon(\eta)^{\top}$ is now viewed as a fixed matrix acting on the random vectors $Z_{i}$ . We may thus take a dual viewpoint on the dimension reduction problem: instead of considering how the matrix $Z$ acts on each difference vector, we consider how the transposed batch of difference vectors $\Upsilon(\eta)^{\top}$ acts on the matrix $Z^{\top}$ , as mentioned in the introduction. If we take the $Z_{i}$ to be independent, with i.i.d. mean-zero unit-variance sub-gaussian entries, we can use lemma 5.0.3 in the appendix to harness both the independence of the $Z_{i}$ and the Hanson-Wright inequality:

Lemma 2.0.4.

With the notation as in lemma 2.0.1 and $\widetilde{\epsilon}_{-}\leq\widetilde{\epsilon}_{+}$ , let $Z$ be a $k\times D$ random matrix with i.i.d. mean-zero unit-variance sub-gaussian entries, then

p_{\pm}(\Upsilon)\leq\binom{M}{\eta M}\max_{\Upsilon(\eta)}\exp\left(-ck\min\left(\frac{\widetilde{\epsilon}_{-}^{2}r_{4}(\Upsilon(\eta))}{K^{4}},\frac{\widetilde{\epsilon}_{-}\,r_{\infty}(\Upsilon(\eta))}{K^{2}}\right)\right),

with $K=\left\lVert Z_{11}\right\rVert_{\psi_{2}}$ . One may replace $\Upsilon(\eta)$ by $\widehat{\Upsilon}(\eta)$ on the right hand side.

Proof.

To bound the probabilities on the right hand side of lemma 2.0.1, just take $A=\Upsilon(\eta)$ in lemma 5.0.3. ∎

Remark 2.0.5.

Recall $r_{4}\geq r_{\infty}$ always, so if $K\geq 1$ , we can write

p_{\pm}(\Upsilon)\leq\binom{M}{\eta M}\max_{\Upsilon(\eta)}\exp\left(-Ck\min(\widetilde{\epsilon}_{-}^{2},\epsilon)r_{\infty}(\Upsilon(\eta))\right).

We now can control the probabilities in equations (2.1) for a given batch $\Upsilon$ of size $M$ . The control is in terms of $r_{\infty}(\Upsilon(\eta))$ or $r_{4}(\Upsilon(\eta))$ for subsets of size $\eta M$ . Because the target dimension $k$ is a global parameter, it needs to be in terms of global quantitities, but the stable ranks above vary over minibatches. To make this transition and to help with bookkeeping, recall that a decomposition of a graph is a partition of its edges.

Let $\mathcal{P}$ be a decomposition of the complete graph on $N$ vertices, into batches $\Upsilon$ of size $M$ . If $M$ does not divide $N$ , that is, with $N=jM+n$ , we can expand several of the batches to $M+s\geq M$ with $s=\left\lceil n/j\right\rceil<M$ . For those batches, $\eta(M+s)$ need not be an integer, so take $\tilde{\eta}$ as

\tilde{\eta}=\eta\frac{M}{M+s}\quad\text{in order for}\quad\tilde{\eta}(M+s)=\eta M,

(2.3)

and set $\Upsilon(\eta)=\Upsilon(\tilde{\eta})$ when the batch size is not $M$ . Note smaller $\eta$ values only help us ensure a total fraction $(1-\eta)$ of distances is preserved. For this decomposition, let

R_{\infty}(\eta M):=R_{\infty}(\eta M;\mathcal{P}):=\min_{\Upsilon\in\mathcal{P}}\min_{\Upsilon(\eta)\subset\Upsilon}r_{\infty}(\Upsilon(\eta))

be the minimum stable rank of the $\eta M$ sized minibatches from such batches. We then have our first theorem, written in terms of $\epsilon$ in the original equation (JL).

Theorem 2.0.6.

Let $0<\eta<1$ , with $\eta M\in\mathbb{N}$ , and let $0<\epsilon\leq 2/3$ . For a set $X$ of $N$ points in $\mathbb{R}^{D}$ , $R_{\infty}(\eta M)$ as above, and $Z$ a $k\times D$ matrix with i.i.d. mean-zero unit-variance sub-gaussian entries, equation (JL) holds for at least $(1-2\eta)\binom{N}{2}$ pairs $x,x^{\prime}\in X$ , with probability at least $1-\delta$ , provided

k\geq C\frac{\max(K^{4},K^{2})}{\epsilon^{2}}\left(\log(2e/\eta)\frac{\eta M}{R_{\infty}(\eta M)}+\frac{\log(N^{2}/(M\delta))}{R_{\infty}(\eta M)}\right),\quad\text{and}\quad\gamma(\epsilon)=1+\epsilon^{2}.

Here, $K=\left\lVert Z_{11}\right\rVert_{\psi_{2}}$ and $C$ is an absolute constant. In the case of standard Gaussian entries, one can replace $C\max(K^{4},K^{2})$ and $\gamma(\epsilon)$ by

C(\epsilon):=4\left(\frac{(1+\epsilon)^{2}+(1-\epsilon)^{2}\sqrt{2}}{4}\right)^{2}<2.25\quad\text{and}\quad\gamma(\epsilon)=\frac{(1+\epsilon)^{2}+(1-\epsilon)^{2}\sqrt{2}}{1+\sqrt{2}},

respectively. When the distribution for $Z$ is understood, we denote $C\max(K^{4},K^{2})$ or $C(\epsilon)$ by $C_{[\ref{thm:FreeDecompBulkJL}]}$ , and likewise $\gamma(\epsilon)$ by $\gamma_{[\ref{thm:FreeDecompBulkJL}]}$ .

To make sense of the above, suppose $\eta M=D$ , and then recall $1\leq R_{\infty}(\eta M)\leq D$ as the stable rank is always bounded above by the ambient dimension. If $R_{\infty}(\eta M)=cD$ , for $c$ bounded away from 0, the term attached to $\log(2e/\eta)$ becomes constant, while the remaining becomes constant as soon as $D\gtrsim\log(2N^{2}/(D\delta))$ .

Note $R_{\infty}(\eta M)$ depends on the decomposition $\mathcal{P}$ , which we are free to choose at will. The remainder of the paper will address how to choose $\mathcal{P}$ (and the batch size $M$ ).

Proof.

We treat the upper and lower bounds on $\left\lVert Z(x-x^{\prime})\right\rVert_{2}^{2}$ separately. If

\sum_{\Upsilon\in\mathcal{P}}(p_{+}(\Upsilon)+p_{-}(\Upsilon))\leq\delta,

(2.4)

then with probability at least $1-\delta$ none of the events defining $p_{\pm}(\Upsilon)$ hold, over all the batches $\Upsilon$ of the decomposition. So by equation (2.1), with probability at least $1-\delta$ , we preserve at least (recalling $\tilde{\eta}<\eta$ when needed)

\sum_{\Upsilon\in\mathcal{P}}(1-\eta)\left\lvert\Upsilon\right\rvert=(1-\eta)\binom{N}{2}

of the squared distances within a $(1+\widetilde{\epsilon}_{+})$ tolerance, and another $(1-\eta)\binom{N}{2}$ of the squared distances within a $(1-\widetilde{\epsilon}_{-})$ tolerance. By the pigeonhole principle, at least $(1-2\eta)\binom{N}{2}$ of the distances are approximatley preserved on both sides.

It remains to choose $k$ to ensure equation (2.4) holds. We always have, by lemma 5.0.6,

\binom{M+s}{\tilde{\eta}(M+s)}\leq\left(\frac{e(M+s)}{\eta M}\right)^{\eta M}=\exp\left(\log\left(\frac{e(M+s)}{\eta M}\right)\eta M\right)<\exp(\log(2e/\eta)\eta M),

while $\binom{M}{\eta M}\leq\exp(\log(e/\eta)\eta M)$ , so by lemma 2.0.4

p_{+}(\Upsilon)+p_{-}(\Upsilon)\leq 2\exp\left(-\frac{c}{\max(K^{4},K^{2})}k\widetilde{\epsilon}_{-}^{2}R_{\infty}(\eta M)+\log(2e/\eta)\eta M\right).

There are at most $\left\lfloor\binom{N}{2}/M\right\rfloor\leq N^{2}/(2M)$ batches $\Upsilon$ in the decomposition, expanding several of the batches to absorb any remainder $N\mod M$ if necessary, so we need $k$ to satisfy

2(N^{2}/(2M))\exp\left(-\frac{c}{\max(K^{4},K^{2})}k\widetilde{\epsilon}_{-}^{2}R_{\infty}(\eta M)+\log(2e/\eta)\eta M\right)\leq\delta.

Using lemma 5.0.5 with $C_{+}=C_{-}=1$ , we achieve the desired bound for $k$ in terms of $\epsilon$ by taking

C=\frac{1}{c}\left(\frac{1+\epsilon^{2}}{2}\right)^{2}<0.522\frac{1}{c}.

To replace $C\max(K^{4},K^{2})$ in the Gaussian case, use corollary 2.0.3 with lemma 5.0.5 on $C_{+}=2C_{-}$ to find

\frac{4}{\widetilde{\epsilon}_{-}^{2}}=\frac{C(\epsilon)}{\epsilon^{2}}.\qed

From lemma 2.0.1 we are free to scale vectors in each batch $\Upsilon$ to have unit norm. So we can also define for a given decomposition

\widehat{R}_{\infty}(\eta M;\mathcal{P})=\min_{\widehat{\Upsilon}\in\mathcal{P}}\min_{\widehat{\Upsilon}(\eta)\subset\widehat{\Upsilon}}r_{\infty}(\widehat{\Upsilon}(\eta)).

This normalization allows us to use linear algebra to control $r_{\infty}(\widehat{\Upsilon}(\eta))$ deterministically, for any of the $\eta M$ -sized subsets of $\widehat{\Upsilon}$ , once we have control of $\sigma_{1}(\widehat{\Upsilon})$ . When the underlying data is random, we can thus avoid the need to take union bounds over the minibatches.

Lemma 2.0.7.

Let $A$ be a $D\times M$ matrix with columns of constant norm. If $B$ is a $D\times m$ submatrix of $A$ , then

r_{\infty}(B)\geq\max\left(\frac{m}{M}r_{\infty}(A),1\right).

In particular, if $r_{\infty}(A)\geq cM$ , then $r_{\infty}(B)\geq\max(cm,1)$ .

To be useful, we need $c\gg 1/m$ here.

Proof.

To control $\left\lVert B\right\rVert$ , partition the matrix $A$ as $A=(B^{\prime}|B)$ with $B^{\prime}$ a $D\times(M-m)$ matrix. Viewing the unit sphere $S^{m-1}$ as $o\times S^{m-1}$ in $S^{M-1}$ ,

\left\lVert A\right\rVert=\max_{x\in S^{M-1}}\left\lVert(B^{\prime}|B)x\right\rVert_{2}\geq\max_{x\in S^{m-1}}\left\lVert(B^{\prime}|B)x\right\rVert_{2}=\max_{x\in S^{m-1}}\left\lVert Bx\right\rVert_{2}=\left\lVert B\right\rVert.

That is, $\left\lVert B\right\rVert\leq\left\lVert A\right\rVert$ . Note $r_{\infty}(A)=r_{\infty}(cA)$ for any nonzero scalar $c$ , so we may assume the columns of $A$ all have norm 1. Under this assumption,

r_{\infty}(B)=\frac{\sum\left\lVert B_{j}\right\rVert_{2}^{2}}{\left\lVert B\right\rVert^{2}}=\frac{m}{M}\frac{\sum\left\lVert A_{j}\right\rVert_{2}^{2}}{\left\lVert B\right\rVert^{2}}\geq\frac{m}{M}\frac{\sum\left\lVert A_{j}\right\rVert_{2}^{2}}{\left\lVert A\right\rVert^{2}}=\frac{m}{M}r_{\infty}(A).

Recalling $r_{\infty}\geq 1$ always completes the proof. ∎

If we set

\widehat{R}_{\infty}(M):=\widehat{R}_{\infty}(M;\mathcal{P}):=\min_{\Upsilon\in\mathcal{P}}r_{\infty}(\widehat{\Upsilon}),\quad\text{we can state the following theorem.}\quad

Theorem 2.0.8.

Let $0<\eta<1$ , with $\eta M\in\mathbb{N}$ , and $0<\epsilon\leq 2/3$ . For a set $X$ of $N$ points in $\mathbb{R}^{D}$ , $\widehat{R}_{\infty}(M)$ as above, and $Z$ a $k\times D$ matrix with i.i.d. mean-zero unit-variance sub-gaussian entries, then equation ( $JL^{2}$ ) holds for at least $(1-2\eta)\binom{N}{2}$ pairs $x,x^{\prime}\in X$ , with probability at least $1-\delta$ , provided

k\geq\frac{C_{[\ref{thm:FreeDecompBulkJL}]}}{\epsilon^{2}}\left(\log(2e/\eta)\frac{M}{\widehat{R}_{\infty}(M)}+\frac{\log(N^{2}/(M\delta))}{\max(\eta\widehat{R}_{\infty}(M),1)}\right)\quad\text{and}\quad\gamma(\epsilon)=\gamma_{[\ref{thm:FreeDecompBulkJL}]}.

Here, $C_{[\ref{thm:FreeDecompBulkJL}]}$ depends on $K=\left\lVert Z_{11}\right\rVert_{\psi_{2}}$ . In the case of independent standard Gaussian entries, $C_{[\ref{thm:FreeDecompBulkJL}]}<2.25$ .

Proof.

In the proof of theorem 2.0.6, we shall replace lemma 2.0.4 by

p_{\pm}(\widehat{\Upsilon})\leq\binom{M}{\eta M}\exp\left(-ck\eta r_{\infty}(\widehat{\Upsilon})\min\left(\frac{\widetilde{\epsilon}_{-}^{2}}{K^{4}},\frac{\widetilde{\epsilon}_{-}}{K^{2}}\right)\right).

By lemma 2.0.7, $r_{\infty}(\widehat{\Upsilon}(\eta))\geq\eta r_{\infty}(\widehat{\Upsilon})$ , no matter the subset $\widehat{\Upsilon}(\eta)$ chosen. We can then upper bound the right hand side of lemma 2.0.4, recalling $r_{4}\geq r_{\infty}$ . ∎

The choice of decomposition $\mathcal{P}$ matters, yielding very different $\widehat{R}_{\infty}(M)$ values, even with $M$ fixed, as seen in the following.

Example 2.1.

The difference vectors for the (vertices of) the standard simplex in $\mathbb{R}^{D}$ are $\left\{e_{i}-e_{j}\right\}_{i\neq j}$ , with $1\leq i\leq D$ . Suppose the decomposition $\mathcal{P}$ involved “stars” made by subsets $\left\{e_{j}-e_{1}\right\}$ for $2\leq j\leq M+1\leq D$ . Using these difference vectors as rows, the corresponding matrix $\widehat{\Upsilon}$ looks like (relabeling as necessary)

\widehat{\Upsilon}=\frac{1}{\sqrt{2}}\begin{pmatrix}\operatorname{Id}_{M}&-\mathbb{I}\end{pmatrix}

with $\mathbb{I}=(1,\ldots,1)^{\top}\in\mathbb{R}^{M}$ . If $z=(0,\ldots,0,1)^{\top}$ , we have $\left\lVert\widehat{\Upsilon}\right\rVert^{2}\geq\left\lVert\widehat{\Upsilon}z\right\rVert_{2}^{2}=M/2$ , so $\widehat{R}_{\infty}(M)\leq r_{\infty}(\widehat{\Upsilon})\leq M/(M/2)=2$ . Because $\widehat{R}_{\infty}(M)$ is bounded by a constant, this decomposition is of no use in theorem 2.0.8.

If we instead consider a decomposition involving “cycles” of length $D=M$ , that is, subsets

\left\{e_{i}-e_{j}\;\lvert\;i,j\text{ appear exactly twice }\right\},

then the corresponding difference vectors form a circulant matrix

\Upsilon=\begin{pmatrix}1&0&\cdots&0&-1\\ -1&1&&&0\\ &-1&\ddots&&\vdots\\ &&\ddots&1&0\\ &&&-1&1\end{pmatrix}.

Viewing $\Upsilon$ as a complex matrix, we can diagonalize it using the discrete Fourier transform matrix $V=F_{D}$ as $V^{-1}\Upsilon V=\mathrm{diag}(\lambda_{1},\ldots,\lambda_{D})$ . See [GVL13, page 222]. Here,

\lambda_{j}=\left(\bar{F}_{D}\begin{pmatrix}1\\ -1\\ o\end{pmatrix}\right)_{j}=1-\omega^{j-1}\quad\text{with}\quad\omega=\exp(2\pi i/D).

The squares of the singular values of $\Upsilon$ are then the eigenvalues of $\Upsilon^{\top}\Upsilon$ , that is,

\sigma_{j}^{2}(\Upsilon)=\lambda_{j}^{\ast}\lambda_{j}=(1-\bar{\omega}^{j-1})(1-\omega^{j-1})=2-2\cos(2\pi(j-1)/D)\leq 4,

with equality achieved when $D$ is even at $j=1+D/2$ . Because the cycle has length $D$ here, we have $r_{\infty}(\Upsilon)\geq(2D)/4=D/2$ , for each such cycle.

If such a decomposition involved only such cycles, we could conclude (because the vectors have constant norm) $\widehat{R}_{\infty}(M)\geq D/2$ , which would be very useful for theorem 2.0.8.

We review in the next section a construction originally due to Walecki that provides such cycle decompositions as above when $D$ is odd, and the next best thing when $D$ is even. The construction will also be useful when our data set is drawn i.i.d.; in particular, we can address cases where the minimal stable ranks $R_{\infty}$ or $\widehat{R}_{\infty}$ are too pessimistic, but “most” batches have larger stable ranks.

3 The Walecki Construction

The sets $Y$ and $\widehat{Y}$ describe $\binom{N}{2}$ directions in space. Even if the data generating these directions is sampled independently, the directions themselves are not independent; for example, $x-y$ is not independent of $x-z$ , while $x-y$ and $z-w$ are, assuming $x,y,z$ , and $w$ are drawn independently. However, we are partitioning the directions into batches. If we can guarantee that in each batch, the directions are independent or only “weakly” dependent, we can exploit these properties, ensuring the stable ranks of many batches are bounded below by given values.

Viewing $Y$ or $\widehat{Y}$ as corresponding to the edges of the complete graph $K_{N}$ on $N$ vertices, we are asking for a partition of the edges, a decomposition, such that each subset involves each vertex once, or at most twice (say). Thankfully, Walecki provided such a construction in the 1880’s; see [Luc82, page 161, Sixième Récréation] for the original French and [Als08] for an English overview. We use a different indexing scheme than presented in [Als08], which is easier for us to use.

We can arrange $N$ vertices in the complex plane as follows: $N-1$ vertices corresponding to the $(N-1)$ st roots of unity, and the remaining vertex at the origin $o$ . Let $\omega=\exp(2\pi i/(N-1))$ . We can thus label the nonzero vertices as $\left\{\omega^{j}\right\}_{j=0}^{N-2}$ . We partition their corresponding edges based on their products $W_{p}:=\left\{\left\{\omega^{j},\omega^{k}\right\}\;\lvert\;\omega^{j}\omega^{k}=\omega^{p}\right\}$ , or more formally:

W_{p}:=\left\{\left\{\omega^{j},\omega^{k}\right\}\;\lvert\;0\leq j\neq k\leq N-2\text{ and }j+k\equiv p\mod N-1\right\}

The $N-1$ sets $\left\{W_{p}\right\}$ represent 1-regular subgraphs of $K_{N}$ : each vertex has degree one, for if $\left\{\omega^{j},\omega^{k}\right\}$ and $\left\{\omega^{j},\omega^{l}\right\}$ are in $W_{p}$ , then

j+k\equiv p\equiv j+l,\quad\text{that is,}\quad k\equiv p-j\equiv l,

forcing $k=l$ because $0\leq k,l\leq N-2$ . There are only $N-1$ vertices on the circle, so there are at most $\left\lfloor(N-1)/2\right\rfloor$ edges in $W_{p}$ .

The cyclic group $\mathbb{Z}_{N-1}$ generated by $\omega$ acts freely on the vertices of the circle via counterclockwise rotations, so $\mathbb{Z}_{N-1}$ also acts on the $W_{p}$ sets via

\omega W_{p}:=\left\{\left\{\omega\omega^{j},\omega\omega^{k}\right\}\;\lvert\;j+k\equiv p\mod N-1\right\}=W_{(p+2)\mod N-1},

corresponding to the product $(\omega\omega^{j})(\omega\omega^{k})=\omega^{2}\omega^{j+k}$ . Consequently, it is enough to discuss $W_{0}$ and $W_{1}$ .

With $k>0$ , the edges of $W_{0}$ are of the form $\left\{\omega^{k},\omega^{-k}\right\}$ , while those of $W_{1}$ are of the form $\left\{\omega^{k},\omega^{-(k-1)}\right\}$ . To each edge $\left\{\omega^{k},\omega^{-k}\right\}$ in $W_{0}$ , there is a corresponding edge in $W_{1}$ , namely $\left\{\omega^{k},\omega^{-(k-1)}\right\}$ , so $\left\lvert W_{1}\right\rvert\geq\left\lvert W_{0}\right\rvert$ and $W_{1}$ includes the edge $\left\{\omega,1\right\}$ . When $N-1$ is odd, the only nonzero vertex on the real line is 1; when $N-1$ is even, both vertices $1$ and $-1$ are present. Consequently, $\left\lvert W_{0}\right\rvert$ is determined by the number of vertices strictly in the upper half plane:

\left\lvert W_{0}\right\rvert=\begin{cases}(N-2)/2\text{ if $N-1$ is odd },\\ (N-3)/2\text{ if $N-1$ is even }.\end{cases}

When $N-1$ is odd, recall $\left\lvert W_{1}\right\rvert\leq\left\lfloor(N-1)/2\right\rfloor$ because $W_{1}$ is 1-regular, so by the above, $\left\lvert W_{1}\right\rvert=\left\lvert W_{0}\right\rvert$ , with $W_{1}$ avoiding the vertex $\omega^{-(N-2)/2}=\omega^{N/2}$ , as its left most edge is $\left\{\omega^{(N-2)/2},\omega^{-(((N-2)/2)-1)}\right\}$ . Form the augmented sets $\tilde{W}_{0}=W_{0}\cup\left\{o,1\right\}$ and $\tilde{W}_{1}=W_{1}\cup\left\{\omega^{-(N-2)/2},o\right\}$ ; each $\tilde{W}_{i}$ is a 1-factor, as it is 1-regular and spans all $N$ vertices. We can thus form a cycle using $\tilde{W}_{0}$ and $\tilde{W}_{1}$ : explicitly, in cycle notation,

(o,1,\omega^{1},\omega^{-1},\ldots,\omega^{j},\omega^{-j},\ldots,\omega^{(N-2)/2},\omega^{-(N-2)/2})

which has length $2(N-2)/2+2=N$ as it should for covering all the $N$ vertices.

When $N-1$ is even, $\left\lvert W_{0}\right\rvert=(N-3)/2=\left\lfloor(N-1)/2\right\rfloor-1$ , so $W_{1}$ can contain at most one additional edge; it does, via $(-1,\,\omega^{-(k_{\ast}-1)})$ with $k_{\ast}=(N-1)/2$ . If $N\geq 7$ , there are at least two edges in $W_{0}$ , so split $W_{0}$ into $W_{0}^{+}$ and $W_{0}^{-}$ , corresponding to those vertices with nonnegative and strictly negative real parts respectively. When $\left\lvert W_{0}\right\rvert=(N-3)/2$ is even, that is, $N\equiv 3\mod 4$ , both $W_{0}^{\pm}$ are of the same size $(N-3)/4$ , while in the other case, we have $\left\lvert W_{0}^{+}\right\rvert=\left\lceil(N-3)/4\right\rceil$ and $\left\lvert W_{0}^{-}\right\rvert=\left\lfloor(N-3)/4\right\rfloor$ . Form the augmented sets $\tilde{W}_{0}^{+}=W_{0}^{+}\cup\left\{o,1\right\}$ and $\tilde{W}_{0}^{-}=W_{0}^{-}\cup\left\{o,-1\right\}$ ; these sets are 1-regular, of sizes

\left\lvert\tilde{W}_{0}^{\pm}\right\rvert=1+\left\lvert W_{0}^{\pm}\right\rvert=1+(N-3)/4=(N+1)/4.\quad\text{when}\quad N\equiv 3\mod 4,

while when $N\equiv 1\mod 4$ , that is, when $(N-3)/2$ is odd,

\left\lvert\tilde{W}_{0}^{-}\right\rvert=1+\left\lvert W_{0}^{-}\right\rvert=1+\frac{1}{2}\left(\frac{N-3}{2}-1\right)=\frac{N-1}{4}

and

\left\lvert\tilde{W}_{0}^{+}\right\rvert=1+\left\lvert W_{0}^{+}\right\rvert=1+\frac{1}{2}\left(\frac{N-3}{2}+1\right)=\frac{N+3}{4}.

The sets $\tilde{W}_{0}^{+},\tilde{W}_{0}^{-},W_{1}$ now form the cycle

(o,1,\omega^{1},\omega^{-1},\ldots,\omega^{j},\omega^{-j},\ldots,\omega^{(N-3)/2},\omega^{-(N-3)/2},-1),

which has length $2(N-3)/2+3=N$ , again as it should.

Extending the $\mathbb{Z}_{N-1}$ group action to send the origin $o$ to itself, we can thus form $\left\lfloor(N-1)/2\right\rfloor$ cycles $\mathcal{C}_{j}$ of length $N$ using the above, recalling there are $N-1$ different $W_{p}$ sets. Explicitly,

\mathcal{C}_{j}=\begin{cases}\tilde{W}_{2j}^{+}\coprod\tilde{W}_{2j}^{-}\coprod W_{2j+1}&\text{ if $N-1$ is even }\\ \tilde{W}_{2j}\coprod\tilde{W}_{2j+1}&\text{ if $N-1$ is odd }.\end{cases}

(3.1)

When $N-1$ is even, all $\binom{N}{2}$ edges are covered, while when $N-1$ is odd, $\tilde{W}_{N-2}$ still remains, but is still a 1-factor.

We thus have, considering the parity of $N$ instead of $N-1$ ,

Lemma 3.0.1 (Walecki Construction).

The complete graph $K_{N}$ has a decomposition into $(N-1)/2$ cycles of length $N$ when $N$ is odd, and $(N-2)/2$ cycles of length $N$ and a 1-factor when $N$ is even.

For reference later, we also record

Corollary 3.0.2.

Consider the cycles in lemma 3.0.1, corresponding to equation (3.1). When $N$ is even, these cycles split as a pair of 1-factors of size $N/2$ . When $N$ is odd, the cycles split as three 1-regular subgraphs when $N\geq 7$ . When $N\equiv 3\mod 4$ , their sizes are

\left\lvert W_{2j+1}\right\rvert=(N-1)/2\quad\text{and}\quad\left\lvert\tilde{W}_{2j}^{\pm}\right\rvert=(N+1)/4,

while when $N\equiv 1\mod 4$ , their sizes are

\left\lvert W_{2j+1}\right\rvert=(N-1)/2,\quad\left\lvert\tilde{W}_{2j}^{-}\right\rvert=(N-1)/4,\quad\text{and}\quad\left\lvert\tilde{W}_{2j}^{+}\right\rvert=(N+3)/4.

Returning to the sets of difference vectors $y_{ij}=x_{i}-x_{j}\in Y$ with $X\subset\mathbb{R}^{D}$ a set of $N$ points, we can assign each $y_{ij}$ to a unique 1-regular subgraph by the correspondence $y_{ij}\leftrightarrow\left\{\omega^{i},\omega^{j}\right\}$ when $0<i<j$ and $y_{0j}\leftrightarrow\left\{o,\omega^{j}\right\}$ for $0<j$ . Most useful for us is the following lemma; note we shall be considering batches of size (at least) $M$ drawn from within these subgraphs.

Lemma 3.0.3.

Let $X\subset\mathbb{R}^{D}$ be a set of $N$ points drawn i.i.d. from a common distribution. With the correspondence above, the vectors within each subgraph from corollary 3.0.2 and lemma 3.0.1 are independent, as are their unit norm versions. When $N$ is even, there are $N-1$ subgraphs involved. When $N\geq 7$ is odd, there are $3(N-1)/2$ subgraphs involved.

Proof.

The subgraphs are 1-regular, so each vertex corresponding to a point of $X$ appears only once; independence follows as no two distinct edges share a common vertex. In the unit norm case, we are applying the same function $v\mapsto v/\left\lVert v\right\rVert_{2}$ to independent vectors, so the results are independent too. The rest of the lemma is immediate. ∎

We can now prove

Theorem 3.0.4.

For $X$ the standard simplex in $\mathbb{R}^{D}$ , it suffices to take

k=\frac{2C_{[\ref{thm:FreeDecompBulkJL}]}}{\epsilon^{2}}\left(\log(2e/\eta)+\frac{\log(D/\delta)}{\max(\eta D,1)}\right)

for theorem 2.0.8.

Note $k$ is bounded independent of $D$ as soon as $\eta D\gtrsim\log(D/\delta)$ .

Proof.

Taking $M=D$ in theorem 2.0.8, we can use the Walecki construction 3.0.1 as is to control $\widehat{R}_{\infty}(D)$ . When $D$ is odd, the computations from example 2.1 show $\widehat{R}_{\infty}(D)\geq D/2$ , while when $D$ is even, the 1-factor from the Walecki construction is of size $D/2$ , with mutually orthogonal vectors of constant norm, so its stable rank is equal to its rank, $D/2$ . Thus $\widehat{R}_{\infty}(D)\geq D/2$ for both parities of $D$ . ∎

4 Further Theorems for i.i.d. Samples

Let $\xi\in\mathbb{R}^{D}$ be a given random vector. In this section, the following assumption 4.1 will be in play for the $k\times D$ dimension reduction matrix $Z$ and the dataset $X$ of $N$ points in $\mathbb{R}^{D}$ . The theorems will then be in terms of additional assumptions on the distribution of $\xi$ .

Assumption 4.1.

The matrix $Z$ has i.i.d. zero-mean unit-variance sub-gaussian entries. The dataset $X=\left\{x_{i}\right\}_{i=0}^{N-1}\subset\mathbb{R}^{D}$ is drawn independently of $Z$ , with $x_{i}\overset{\text{i.i.d.}}{\sim}\xi$ .

In practice, datasets need not be drawn i.i.d. from some underlying distribution; however, if the number of points $N$ is very large, it may be useful or convenient to subsample the data in order to fit it in memory or to try to estimate properties of the data. The theorems in this section may then be read as applying to an i.i.d. (sub)sample of the data, that is, drawing $N$ points uniformly with replacement from the underlying dataset, redefining $N$ to be the new sample size, and redefining $\xi$ to be drawn from the discrete uniform measure on the underlying data.

With the Walecki construction in hand and assumption 4.1, we can give refinements of theorems 2.0.6 and 2.0.8. The theorem statements will have failure probabilities in terms of the draw of the pair $(Z,X)$ . (There should be no confusion with the dot product notation.) Both of the theorems just mentioned give a failure probability $\delta_{Z}$ for $Z$ once $X$ is fixed, and this probability only depends on stable rank properties of the data set $X$ , or more accurately, the difference vector set $Y$ . These theorems in turn rely on lemma 2.0.4, that bounds the probabilities in equation (2.1) for $Z$ acting on a given batch $\Upsilon$ or its unit norm version in terms of stable ranks of minibatches $\Upsilon(\eta)$ . In many of the examples that follow, the assumptions on the data only guarantee $r_{\infty}(\Upsilon)$ or $r_{\infty}(\Upsilon(\eta))$ is above some threshold most of the time, say for a fraction $(1-\zeta)$ of all batches, instead of all of the time. We can use lemma 2.0.4 on those “good” batches, and still conclude that $(1-\zeta)(1-2\eta)$ of all distances are approximately preserved with high probability. To be concrete, let $\mathcal{E}_{Z}$ be the event that $\left\lVert Z\widehat{y}\right\rVert_{2}^{2}$ is approximately preserved for $(1-\zeta)(1-2\eta)$ of the vectors $y\in Y$ – provided the batches considered have $r_{\infty}(\Upsilon)\geq r_{0}$ , for some threshold $r_{0}$ . Let $\mathcal{E}_{X}$ be the event that $r_{\infty}(\Upsilon)\geq r_{0}$ for at least $(1-\zeta)$ of the batches considered in the decomposition $\mathcal{P}$ . We then have

	$\displaystyle\mathbb{P}_{Z\times X}\left\{\mathcal{E}_{Z}^{c}\right\}$	$\displaystyle=\mathbb{P}_{Z\times X}\left\{\mathcal{E}_{Z}^{c}\cap\mathcal{E}_{X}\right\}+\mathbb{P}_{Z\times X}\left\{\mathcal{E}_{Z}^{c}\cap\mathcal{E}_{X}^{c}\right\}$
		$\displaystyle=\mathbb{P}_{Z\times X}\left\{\mathcal{E}_{Z}^{c}\;\lvert\;\mathcal{E}_{X}\right\}\mathbb{P}_{X}\left\{\mathcal{E}_{X}\right\}+\mathbb{P}_{Z\times X}\left\{\mathcal{E}_{Z}^{c}\cap\mathcal{E}_{X}^{c}\right\}$
		$\displaystyle\leq\mathbb{P}_{Z\times X}\left\{\mathcal{E}_{Z}^{c}\;\lvert\;\mathcal{E}_{X}\right\}+\mathbb{P}_{X}\left\{\mathcal{E}_{X}^{c}\right\}$
		$\displaystyle\leq\delta_{Z}+\delta_{X}$

In the rest of this section, we use the 1-regular subgraph version of the Walecki construction, lemma 3.0.3 to define the decomposition $\mathcal{P}$ and the underlying batches $\Upsilon$ . Specifically, each 1-regular subgraph $W$ is at least of size $(N-1)/4$ , and we partition the subgraph edges into batches $\Upsilon$ of size $M$ – each edge corresponding to a difference vector. Under assumption 4.1, the edges within each subgraph are exchangeable, so they can be assigned to batches in an arbitrary manner as long as the batch size is respected. For any remainder when $M$ does not divide the subgraph size $\left\lvert W\right\rvert$ , we can modify the definition of $\tilde{\eta}$ from equation (2.3) to apply with $\left\lvert W\right\rvert$ in place of $N$ ; any of the expanded batches are still of size at most $2M-1$ . Note when $N$ is odd, the subgraphs are not all of the same size, so the $\tilde{\eta}$ will vary accordingly.

We first present one case where $\zeta$ is 0, that is, we are able to control $r_{\infty}(\Upsilon(\eta))$ for every minibatch, with high probability.

Theorem 4.0.1.

Under assumption 4.1, equation (JL) holds for at least $(1-2\eta)\binom{N}{2}$ pairs $x,x^{\prime}\in X$ , with probability at least $1-3\,\delta$ over the draw of $(Z,X)$ , provided

k\geq\frac{C_{\ref{thm:FreeDecompBulkJL}}}{\epsilon^{2}}\frac{4C^{2}\left\lVert\xi_{1}\right\rVert_{\psi_{2}}^{2}(1+(1+\alpha)^{2})}{(1-t)}\left(\log(2e/\eta)+\frac{\log(N^{2}/(D\delta))}{\eta D}\right)

when $\eta D$ is a strictly positive integer, with

N\geq D\geq\max\left(\frac{1}{\alpha^{2}}\log(N^{2}/(D\delta)),\,\frac{(2\left\lVert\xi_{1}\right\rVert_{\psi_{2}}^{2}+1/\log(2))^{2}}{ct^{2}}\left(\frac{\log(N^{2}/(D\delta))}{\eta D}+\log(2e/\eta)\right)\right).

To make some sense of the above, note that if $N\geq D\gtrsim k$ , it suffices to take $t=\epsilon$ and

k\geq\frac{C_{\ref{thm:FreeDecompBulkJL}}}{\epsilon^{2}}\frac{4C^{2}\left\lVert\xi_{1}\right\rVert_{\psi_{2}}^{2}(1+(1+\alpha)^{2})}{(1-\epsilon)}\left(\log(2e/\eta)+\frac{\alpha^{2}}{\eta}\right)

recalling $D\geq\alpha^{-2}\log(N^{2}/(D\delta))$ too. This target dimension is roughly the same as for the simplex 3.0.4 with $D=N$ there, despite the very different sparsity properties of these two datasets.

Proof.

With $X$ defined as in the theorem statement and $\xi^{\prime}$ an independent draw of $\xi$ , the difference vector set $Y$ is drawn i.i.d. from $\xi-\xi^{\prime}$ , so it is immediately mean-zero with i.i.d. coordinates. Because the stable ranks do not see constant scaling, we can work with $Y/\sqrt{2}$ , so that $(\xi-\xi^{\prime})/\sqrt{2}$ now has unit variance coordinates. The sub-gaussian norm for a coordinate $(\xi_{1}-\xi_{1}^{\prime})/\sqrt{2}$ of $y\in Y/\sqrt{2}$ is at most $\sqrt{2}\left\lVert\xi_{1}\right\rVert_{\psi_{2}}$ by the triangle inequality.

We control $R_{\infty}(\eta M)$ in theorem 2.0.6 by showing each batch $\Upsilon(\eta)$ has high stable rank. Because $r_{\infty}(\Upsilon(\eta))=\left\lVert\Upsilon(\eta)\right\rVert_{F}^{2}/\left\lVert\Upsilon(\eta)\right\rVert^{2}$ , we shall control numerator and denominator separately.

We have $\mathbb{E}\left\lVert\Upsilon(\eta)\right\rVert_{F}^{2}=D\eta M$ . Further, Bernstein’s inequality for the mean-zero sub-exponential random variables $\Upsilon_{ij}^{2}-1$ yields [Ver18, page 34]

\mathbb{P}\left\{\left\lvert\frac{1}{D\eta M}\left\lVert\Upsilon(\eta)\right\rVert_{F}^{2}-1\right\rvert\geq t\right\}\leq 2\exp\left(-c\min\left(\frac{t^{2}}{K^{2}},\frac{t}{K}\right)D\eta M\right)

with $K=\left\lVert\Upsilon_{11}^{2}-1\right\rVert_{\psi_{1}}$ . For us with $t\in(0,1)$ , the above gives

\mathbb{P}\left\{(1-t)D\eta M\geq\left\lVert\Upsilon(\eta)\right\rVert_{F}^{2}\right\}\leq 2\exp\left(-\frac{cD\eta M}{\max(K^{2},K)}t^{2}\right).

To lock down $K$ , recall $\left\lVert a\right\rVert_{\psi_{1}}=\left\lvert a\right\rvert/\log(2)$ for constants $a$ , so

\displaystyle K=\left\lVert\Upsilon_{11}^{2}-1\right\rVert_{\psi_{1}}\leq\left\lVert\Upsilon_{11}^{2}\right\rVert_{\psi_{1}}+\left\lVert\mathbb{E}\Upsilon_{11}^{2}\right\rVert_{\psi_{1}}\leq\left\lVert\Upsilon_{11}\right\rVert^{2}_{\psi_{2}}+\frac{\mathbb{E}\Upsilon_{11}^{2}}{\log(2)}=2\left\lVert\xi_{1}\right\rVert^{2}_{\psi_{2}}+\frac{1}{\log(2)}.

Recall lemma 2.0.7 which allowed us to state $r_{\infty}(\widehat{\Upsilon}(\eta))\geq\eta r_{\infty}(\widehat{\Upsilon})$ always, because the columns of $\widehat{\Upsilon}$ had constant norm. We can give a high probability replacement for that lemma in this context, but require that it must hold over all $\left\lfloor\binom{N}{2}/M\right\rfloor<N^{2}/(2M)$ batches $\Upsilon$ , that is,

	$\displaystyle\frac{2M\delta}{N^{2}}$	$\displaystyle\geq\binom{M}{\eta M}2\exp\left(-\frac{cD\eta M}{\max(K^{2},K)}t^{2}\right)$
	$\displaystyle\log(M\delta/N^{2})$	$\displaystyle\geq\log\binom{M}{\eta M}-\frac{cD\eta M}{\max(K^{2},K)}t^{2}$
	$\displaystyle\frac{cD\eta M}{\max(K^{2},K)}t^{2}$	$\displaystyle\geq\log(N^{2}/(M\delta))+\log\binom{M}{\eta M}.$

Taking into account the $\tilde{\eta}$ cases, as mentioned before this proof, we require

D\geq\frac{(2\left\lVert\xi_{1}\right\rVert_{\psi_{2}}^{2}+1/\log(2))^{2}}{ct^{2}}\left(\frac{\log(N^{2}/(M\delta))}{\eta M}+\log(2e/\eta)\right).

For ambient dimensions $D$ at least this large, every single batch $\Upsilon$ satisfies

r_{\infty}(\Upsilon(\eta))\geq\frac{(1-t)D\eta M}{\left\lVert\Upsilon(\eta)\right\rVert^{2}}\geq(1-t)\eta\frac{DM}{\left\lVert\Upsilon\right\rVert^{2}}

with total failure probability at most $\delta$ .

We now only need to control $\left\lVert\Upsilon\right\rVert^{2}$ instead of $\left\lVert\Upsilon(\eta)\right\rVert^{2}$ , and we can use the known result [Ver18, page 85] for matrices with mean-zero independent sub-gaussian entries – recalling the $\psi_{2}$ -norm of each entry is now $2\left\lVert\xi_{1}\right\rVert_{\psi_{2}}$ , as mentioned above –

\left\lVert\Upsilon\right\rVert\leq C(2\left\lVert\xi_{1}\right\rVert_{\psi_{2}})(\sqrt{D}+\sqrt{M}+s)

with probability at least $1-2\exp(-s^{2})$ . It is proved via an $\epsilon$ -net argument. For us, with $s=\alpha\sqrt{D}$ , the above gives

\left\lVert\Upsilon\right\rVert^{2}>4C^{2}\left\lVert\xi_{1}\right\rVert_{\psi_{2}}^{2}(M+(1+\alpha)^{2}D)

with probability at most $2\exp(-\alpha^{2}D)$ . Consequently,

r_{\infty}(\Upsilon(\eta))\geq\frac{(1-t)\eta DM}{4C^{2}\left\lVert\xi_{1}\right\rVert_{\psi_{2}}^{2}(M+(1+\alpha)^{2}D)}=\frac{(1-t)\eta M}{4C^{2}\left\lVert\xi_{1}\right\rVert_{\psi_{2}}^{2}((M/D)+(1+\alpha)^{2})}

(4.1)

over all minibatches $\Upsilon(\eta)$ from all batches $\Upsilon$ with total failure probability at most

\frac{N^{2}}{2M}2\exp(-\alpha^{2}D)+\delta\leq 2\delta\quad\text{provided}\quad D\geq\frac{1}{\alpha^{2}}\log(N^{2}/(M\delta))\quad\text{as well.}\quad

So we may take the right hand side of equation (4.1) as our $R_{\infty}(\eta M)$ value. Plugging into theorem 2.0.6 yields

k\geq\frac{C_{\ref{thm:FreeDecompBulkJL}}}{\epsilon^{2}}\frac{4C^{2}\left\lVert\xi_{1}\right\rVert_{\psi_{2}}^{2}((M/D)+(1+\alpha)^{2})}{(1-t)}\left(\log(2e/\eta)+\frac{\log(N^{2}/(M\delta))}{\eta M}\right).

Setting $M=D$ completes the proof. ∎

4.1 Controlling “Most” Batches

Unlike in theorem 4.0.1 above, in general the dataset $X$ need not be so well-behaved, to the point that controlling $r_{\infty}(\Upsilon(\eta))$ is not possible across all minibatches. To make things more manageable, we shall now work with theorem 2.0.8, using the unit normalized batches $\widehat{\Upsilon}$ , working on batches of size at least $M$ instead of $\eta M$ . For general random vectors $\xi$ , we shall not be able to guarantee that $\widehat{\Upsilon}$ has high stable rank, but we shall guarantee that a fraction $(1-\zeta)$ of the batches have stable rank comparable to a “typical” value associated with $\xi$ .

The columns of $\widehat{\Upsilon}$ have constant unit norm, so it is enough to control $\left\lVert\widehat{\Upsilon}\right\rVert^{2}=\left\lVert\widehat{\Upsilon}\widehat{\Upsilon}^{\top}\right\rVert$ in order to control $r_{\infty}(\widehat{\Upsilon})$ . It will be convenient to introduce some new notation for this purpose; we present it first in its unnormalized version.

Because we are still using the Walecki construction via lemma 3.0.3 to define the batches, the columns in $\Upsilon$ are independent, each drawn like the given random vector $y:=\xi-\xi^{\prime}\in\mathbb{R}^{D}$ , with $\xi^{\prime}$ an independent copy of $\xi$ . The corresponding second moment matrix $\Sigma:=\Sigma(y):=\mathbb{E}yy^{\top}$ is twice the covariance matrix for $\xi$ , for if $\mathbb{E}\xi=\mu$ ,

\displaystyle\mathbb{E}yy^{\top}

\displaystyle=\mathbb{E}(\xi-\xi^{\prime})(\xi-\xi^{\prime})^{\top}=2(\Sigma(\xi)-\mu\mu^{\top})=2(\mathbb{E}(\xi-\mu)(\xi-\mu)^{\top})=2\,\mathrm{Cov}(\xi).

Consequently, $r(y)$ is the effective rank of $\mathrm{Cov}(\xi)$ , as the factors of 2 will cancel in the ratio.

Now $\Sigma(y)$ may be approximated by its empirical version

\Sigma_{M}:=\frac{1}{M}\Upsilon\Upsilon^{\top}=\frac{1}{M}\sum_{i=1}^{M}y_{i}y_{i}^{\top}.\quad\text{Recall also}\quad\mathbb{E}\left\lVert y\right\rVert_{2}^{2}=\mathbb{E}\,\mathrm{tr}\,{y^{\top}y}=\mathbb{E}\,\mathrm{tr}\,{yy^{\top}}=\,\mathrm{tr}\,\Sigma.

The unit normalized versions will depend on $\widehat{y}$ , with $\widehat{\Sigma}:=\mathbb{E}\widehat{y}\widehat{y}^{\top}$ , $M\widehat{\Sigma}_{M}:=\widehat{\Upsilon}\widehat{\Upsilon}^{\top}$ , and $\widehat{r}:=r(\widehat{y})=1/\left\lVert\widehat{\Sigma}\right\rVert$ . The connection between $r$ and $\widehat{r}$ is not so immediate, but we shall return to this point in section 4.1.1.

The following is implicit in [Ver18, section 5.6]; we include the proof here, as we want explicit constants, and we plan to take $K=1$ , applying it to $\widehat{\Upsilon}$ . Controlling the deviation $\left\lVert\widehat{\Sigma}_{M}-\widehat{\Sigma}\right\rVert$ will be the way we eventually show $r_{\infty}(\widehat{\Upsilon})\gtrsim\widehat{r}$ “most” of the time.

Lemma 4.1.1.

Let $\Upsilon=\left\{y_{i}\right\}_{i=1}^{M}\subset\mathbb{R}^{D}$ be as above, with the $y_{i}\overset{\text{i.i.d.}}{\sim}y$ , and assume for some $K\geq 1$ , that

\left\lVert y\right\rVert_{2}^{2}\leq K^{2}\mathbb{E}\left\lVert y\right\rVert_{2}^{2}\quad\text{almost surely.}\quad

Then, with failure probability at most $2e^{-u}$ ,

\frac{\left\lVert\Sigma_{M}-\Sigma\right\rVert}{\left\lVert\Sigma\right\rVert}\leq\left(\frac{(4K^{2}/3)r(u+\log D)}{M}+\sqrt{\frac{2K^{2}r(u+\log D)}{M}}\right)\quad\text{for $r=r(y)=\frac{\,\mathrm{tr}\,\Sigma(y)}{\left\lVert\Sigma(y)\right\rVert}$.}\quad

Proof.

Because the matrices $y_{i}y_{i}^{\top}-\Sigma$ are symmetric, i.i.d., and mean-zero, we can use the matrix Bernstein inequality: for $t\geq 0$ ,

\mathbb{P}\left\{\left\lVert\sum_{i=1}^{M}(y_{i}y_{i}^{\top}-\Sigma)\right\rVert\geq t\right\}\leq 2D\exp\left(-\frac{t^{2}/2}{\sigma^{2}+bt/3}\right)

with $b\geq\left\lVert y_{i}y_{i}^{\top}-\Sigma\right\rVert$ and $\sigma^{2}=\left\lVert\sum_{i=1}^{M}\mathbb{E}(y_{i}y_{i}^{\top}-\Sigma)^{2}\right\rVert$ . We want the right hand side to be at most $2e^{-u}$ , that is,

s=\log(D)+u\leq\frac{t^{2}/2}{\sigma^{2}+bt/3},\quad\text{or}\quad 0\leq t^{2}-\frac{2sb}{3}t-2\sigma^{2}s.

The positive root is at

t^{\ast}=\frac{1}{2}\left(\frac{2sb}{3}+\sqrt{\Big{(}\frac{2sb}{3}\Big{)}^{2}+8\sigma^{2}s}\right).

Because the other root of the quadratic is negative, taking any $t\geq t^{\ast}$ will suffice for us. Because the square root function is subadditive, it is safe to take $t=(2sb/3)+\sigma\sqrt{2s}$ .

We now just need to estimate $b$ and $\sigma^{2}$ . Estimate $b$ via

	$\displaystyle\left\lVert y_{i}y_{i}^{\top}-\Sigma\right\rVert$	$\displaystyle\leq\left\lVert y_{i}y_{i}^{\top}\right\rVert+\left\lVert\Sigma\right\rVert=\left\lVert y_{i}\right\rVert_{2}^{2}+\left\lVert\Sigma\right\rVert$
		$\displaystyle\leq K^{2}\,\mathrm{tr}\,\Sigma+\left\lVert\Sigma\right\rVert\leq 2K^{2}\,\mathrm{tr}\,\Sigma=:b.$

For $\sigma^{2}$ , the i.i.d. assumption already gives $\sigma^{2}=\left\lVert M\mathbb{E}(yy^{\top}-\Sigma)^{2}\right\rVert$ . Expanding the square,

\mathbb{E}(yy^{\top}-\Sigma)^{2}=\mathbb{E}(yy^{\top})^{2}-\Sigma^{2},

while $(yy^{\top})^{2}=y(y^{\top}y)y=\left\lVert y\right\rVert_{2}^{2}yy^{\top}$ . Taking expectations on $v^{\top}(yy^{\top})^{2}v$ with $v\in S^{D-1}$ gives

v^{\top}\mathbb{E}(yy^{\top})^{2}v=\mathbb{E}\left\lVert y\right\rVert_{2}^{2}v^{\top}yy^{\top}v\leq K^{2}(\,\mathrm{tr}\,\Sigma)\mathbb{E}v^{\top}yy^{\top}v=K^{2}(\,\mathrm{tr}\,\Sigma)v^{\top}\Sigma v.

Taking the maximum over $v\in S^{D-1}$ gives

\left\lVert\mathbb{E}(yy^{\top})^{2}\right\rVert\leq K^{2}\,\mathrm{tr}\,\Sigma\left\lVert\Sigma\right\rVert\quad\text{and}\quad\sigma^{2}\leq MK^{2}\,\mathrm{tr}\,\Sigma\left\lVert\Sigma\right\rVert,

as $-\Sigma^{2}$ is negative semidefinite.

Recalling our choice of $t$ and that $r=\,\mathrm{tr}\,\Sigma/\left\lVert\Sigma\right\rVert$ , we find

	$\displaystyle t$	$\displaystyle=\frac{2sb}{3}+\sigma\sqrt{2s}$
		$\displaystyle=\frac{4}{3}K^{2}\,\mathrm{tr}\,\Sigma(u+\log D)+\sqrt{2(u+\log D)MK^{2}\,\mathrm{tr}\,\Sigma\left\lVert\Sigma\right\rVert}$
		$\displaystyle=M\left\lVert\Sigma\right\rVert\left(\frac{(4K^{2}/3)r(u+\log D)}{M}+\sqrt{\frac{2K^{2}r(u+\log D)}{M}}\right).$

With this $t$ value, we finally have

\mathbb{P}\left\{\left\lVert\Sigma_{m}-\Sigma\right\rVert\geq\frac{t}{M}\right\}\leq 2e^{-u},\quad\text{as desired.}\quad\qed

If we used the unit normalized $\widehat{\Upsilon}$ , lemma 4.1.1 provides the following upper tail probability

P_{X}\left\{\left\lVert\widehat{\Sigma}_{M}-\widehat{\Sigma}\right\rVert>\tau_{+}(u)\right\}\leq 2\exp(-u)

(4.2)

with $\tau_{+}(u)$ a function of $u$ . By the triangle inequality, we should immediately have

\left\lVert\widehat{\Upsilon}\right\rVert^{2}=\left\lVert M\widehat{\Sigma}_{M}\right\rVert\leq M\left\lVert\widehat{\Sigma}_{M}-\widehat{\Sigma}\right\rVert+M\left\lVert\widehat{\Sigma}\right\rVert\leq M\left\lVert\widehat{\Sigma}\right\rVert(1+\tau_{+}(u)),

so that $r_{\infty}(\widehat{\Upsilon})\geq\widehat{r}/(1+\tau_{+}(u))$ with failure probablity $2\exp(-u)$ . This tail probability is too weak to control $r_{\infty}(\widehat{\Upsilon})$ for every batch with high probability, because $u$ will need to be too large to be useful, so we allow a fraction $\zeta>0$ of the batches to fail. We consider order statistics of real-valued functions of $\Upsilon$ across batches, namely $\left\lVert\widehat{\Sigma}_{M}-\widehat{\Sigma}\right\rVert$ – because the batches within each 1-regular subgraph are independent, we can exploit that independence to inform the choice for $u$ using the following lemma.

Lemma 4.1.2.

Let $\left\{\omega_{i}\right\}_{0}^{n-1}$ be i.i.d. random variables. Let $\zeta\in(0,1)$ with $\zeta n$ an integer. If

\mathbb{P}\left\{\omega_{0}>\tau_{+}(u)\right\}\leq 2e^{-u},\quad\text{then}\quad\mathbb{P}\left\{\omega_{((1-\zeta)n)}>\tau_{+}(u)\right\}\leq\exp\big{(}\zeta n(\log(e/\zeta)+\log(2)-u)\big{)}.

Proof.

First note

\mathbb{P}\left\{\omega_{((1-\zeta)n)}>\tau_{+}(u)\right\}=\mathbb{P}\left\{\omega_{(i)}>\tau_{+}(u)\text{ for }i\geq(1-\zeta)n\right\},

that is, we are looking for the top $\zeta n$ of the $\omega_{(i)}$ to be larger than $\tau_{+}(u)$ . Because the $\omega_{i}$ are drawn i.i.d., all their $\zeta n$ sized subsets are equally likely, so we may conclude

	$\displaystyle\mathbb{P}\left\{\omega_{((1-\zeta)n)}>\tau_{+}(u)\right\}$	$\displaystyle\leq\binom{n}{\zeta n}\mathbb{P}\left\{\omega_{i}>\tau_{+}(u)\text{ for }0\leq i\leq\zeta n-1\right\}$
		$\displaystyle\leq\left(\frac{e}{\zeta}\right)^{\zeta n}\left(2e^{-u}\right)^{\zeta n}=\exp\big{(}\zeta n(\log(e/\zeta)+\log(2)-u)\big{)},$

using the independence of the $\omega_{i}$ in the last line. ∎

In lemma 4.1.2, we shall take $n$ to be the number of batches in a given subgraph, so that the random variables in question are independent. The subgraphs from lemma 3.0.3 are of size at least $(N-1)/4$ and at most $N/2$ , but are not all of the same size when $N$ is odd. If $W$ is one of these subgraphs, it contains $\left\lfloor\left\lvert W\right\rvert/M\right\rfloor$ batches, as we have enforced each batch to have size at least $M$ , and reassigned any remainder $\left\lvert W\right\rvert\mod M$ to those batches. When $N$ is even, $\left\lvert W\right\rvert=N/2$ , and we only need to require $\zeta n:=\zeta\left\lfloor N/(2M)\right\rfloor\in\mathbb{N}$ . When $N$ is odd, we adjust $\zeta$ depending on the size of the subgraph $W$ . Suppose we enforce $\zeta n:=\zeta\left\lfloor(N-1)/(4M)\right\rfloor$ to be an integer, recalling $\left\lvert W\right\rvert\geq(N-1)/4$ in all cases. For any other subgraph $W$ of larger size, set

\tilde{\zeta}=\zeta\frac{n}{\left\lfloor\left\lvert W\right\rvert/M\right\rfloor}\quad\text{so that}\quad\tilde{\zeta}\left\lfloor\left\lvert W\right\rvert/M\right\rfloor=\zeta n.

(4.3)

In particular for all the subgraph sizes, $1/\tilde{\zeta}<2(n+1)/(\zeta n)\leq 3/\zeta$ if we assume $\zeta\leq 1/2$ , as $\zeta n\geq 1$ . Thus in lemma 4.1.2, we replace $\log(e/\zeta)$ by $\log(3e/\zeta)$ when we consider the different subgraphs.

4.1.1 Retraction to the Sphere

Suppose $y$ does not have a second moment, that is, $\mathrm{Cov}(\xi)$ is undefined. As mentioned before, we can use lemma 4.1.1 on the unit normalized batches $\widehat{\Upsilon}$ instead, provided we replace $\Sigma$ by $\widehat{\Sigma}:=\mathbb{E}\widehat{y}\widehat{y}^{\top}$ and $r$ by $\widehat{r}=r(\widehat{y})=1/\left\lVert\widehat{\Sigma}\right\rVert$ . By Cauchy-Schwarz, $\left\lVert\widehat{\Sigma}\right\rVert\leq 1$ always: for any unit vector $v$ ,

v^{\top}\mathbb{E}\widehat{y}\widehat{y}^{\top}v=\mathbb{E}(\widehat{y},v)^{2}\leq\mathbb{E}\left\lVert\widehat{y}\right\rVert_{2}^{2}\left\lVert v\right\rVert_{2}^{2}=1.

A first question to ask is: in the presence of a second moment, how are the operator norms of $\Sigma$ and $\widehat{\Sigma}$ related? The following lemma gives one such answer.

Lemma 4.1.3.

Let $y$ be a random vector in $\mathbb{R}^{D}$ , with second moment matrix $\mathbb{E}yy^{\top}=\Sigma$ . If $\widehat{y}=y/\left\lVert y\right\rVert_{2}$ , then with $\widehat{\Sigma}=\mathbb{E}\widehat{y}\widehat{y}^{\top}$ , for any $\epsilon\in(0,1)$ ,

\left\lVert\widehat{\Sigma}\right\rVert\leq\frac{1}{\epsilon\,r}+p(\epsilon)\quad\text{with}\quad p(\epsilon):=\mathbb{P}\left\{\left\lVert y\right\rVert_{2}^{2}<\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}\right\}\quad\text{and}\quad r=\frac{\,\mathrm{tr}\,\Sigma}{\left\lVert\Sigma\right\rVert}=\frac{\mathbb{E}\left\lVert y\right\rVert_{2}^{2}}{\left\lVert\Sigma\right\rVert}.

Further, $\widehat{r}\geq r/(\epsilon^{-1}+rp(\epsilon))$ .

Remark 4.1.4.

Some dependence on the small ball (if $\epsilon\ll 1$ ) or lower deviation (if $\epsilon\approx 1$ ) probability $\mathbb{P}\left\{\left\lVert y\right\rVert_{2}^{2}<\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}\right\}$ must be present, in general, due to the nature of the retraction to the sphere. Specifically, suppose $y$ is distributed uniformly at random from a finite set in $\mathbb{R}^{D}$ , having a cluster of points all pointing in roughly the $e_{1}$ direction, but of very small norm. If all the other points are distributed uniformly on the unit sphere, one expects $\left\lVert\Sigma\right\rVert$ to be roughly 1/D, as the cluster points will not contribute much to the operator norm. However, after the retraction, these cluster points now all have unit norm and are still pointing in roughly the $e_{1}$ direction, so if there are enough points in the cluster, $\left\lVert\widehat{\Sigma}\right\rVert$ could now be much closer to 1.

Proof.

The matrices $\Sigma$ and $\widehat{\Sigma}$ are symmetric positive semi-definite, so their singular values correspond to their eigenvalues. These eigenvalues involve the quadratic form $v\mapsto v^{\top}\Sigma v$ or $v\mapsto v^{\top}\widehat{\Sigma}v$ . The latter quadratic form is just $\mathbb{E}(\widehat{y},v)^{2}$ , which we may split as

\mathbb{E}(\widehat{y},v)^{2}=\mathbb{E}(\widehat{y},v)^{2}\mathbb{I}\left\{\left\lVert y\right\rVert_{2}^{2}\geq\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}\right\}+\mathbb{E}(\widehat{y},v)^{2}\mathbb{I}\left\{\left\lVert y\right\rVert_{2}^{2}<\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}\right\}.

For the first term,

\mathbb{E}(\widehat{y},v)^{2}\mathbb{I}\left\{\left\lVert y\right\rVert_{2}^{2}\geq\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}\right\}\leq\frac{\mathbb{E}(y,v)^{2}}{\epsilon\mathbb{E}\left\lVert y\right\rVert_{2}^{2}}\mathbb{I}\left\{\left\lVert y\right\rVert_{2}^{2}\geq\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}\right\}\leq\frac{\mathbb{E}(y,v)^{2}}{\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}}.

When $v$ is a unit vector, $(\widehat{y},v)^{2}\leq 1$ always, so for such $v$ ,

	$\displaystyle\mathbb{E}(\widehat{y},v)^{2}$	$\displaystyle\leq\frac{\mathbb{E}(y,v)^{2}}{\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}}+\mathbb{E}(\widehat{y},v)^{2}\mathbb{I}\left\{\left\lVert y\right\rVert_{2}^{2}<\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}\right\}$
		$\displaystyle\leq\frac{\mathbb{E}(y,v)^{2}}{\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}}+\mathbb{P}\left\{\left\lVert y\right\rVert_{2}^{2}<\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}\right\}.$

Now $\left\lVert\widehat{\Sigma}\right\rVert$ is just the maximum of $\mathbb{E}(\widehat{y},v)^{2}$ over the unit sphere, and the maximum is realized by some $v^{\ast}$ as the sphere is compact. Consequently,

	$\displaystyle\left\lVert\widehat{\Sigma}\right\rVert$	$\displaystyle\leq\frac{\mathbb{E}(y,v^{\ast})^{2}}{\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}}+\mathbb{P}\left\{\left\lVert y\right\rVert_{2}^{2}<\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}\right\}\leq\max_{v\in S^{D-1}}\frac{\mathbb{E}(y,v)^{2}}{\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}}+\mathbb{P}\left\{\left\lVert y\right\rVert_{2}^{2}<\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}\right\}$
		$\displaystyle=\frac{\left\lVert\Sigma\right\rVert}{\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}}+\mathbb{P}\left\{\left\lVert y\right\rVert_{2}^{2}<\epsilon\,\mathbb{E}\left\lVert y\right\rVert_{2}^{2}\right\}.$

Recalling $\mathbb{E}\left\lVert y\right\rVert_{2}^{2}=\,\mathrm{tr}\,\Sigma$ and the definition of $r$ and $\widehat{r}=1/\left\lVert\widehat{\Sigma}\right\rVert$ finishes the proof. ∎

Apart from assumption 4.1, we make no other assumptions on $X$ or $\xi$ in the following theorem.

Theorem 4.1.5.

Let $0<\epsilon,\,\eta,\,\zeta<1$ , $\zeta\leq 1/2$ , $\epsilon\leq 2/3$ , and $0<\alpha$ . Under assumption 4.1, equation (JL) holds for at least $(1-2\eta)(1-\zeta)\binom{N}{2}$ pairs $x,x^{\prime}\in X$ , with probability at least $1-2\delta$ over the draw of $(Z,X)$ , provided

k\geq\frac{C_{[\ref{thm:FreeDecompBulkJL}]}}{\epsilon^{2}}\left(\alpha+\frac{4}{3}+\sqrt{2\alpha}\right)\left(\log(2e/\eta)\frac{(\log(6e/\zeta)+\log D)}{1-t^{\prime}}+\frac{\log(N^{2}/(\alpha\widehat{r}\delta\log(D)))}{\alpha\max(\eta\widehat{r},1)}\right)

when the quantities $\eta M$ and $\zeta\left\lfloor(N-1)/(4M)\right\rfloor$ are strictly positive integers, with

\frac{N-1}{8}\geq M:=\alpha\widehat{r}\frac{(\log(6e/\zeta)+\log D)}{1-t^{\prime}}\quad 1>t^{\prime}:=\frac{8\alpha\widehat{r}\log(\frac{3(N-1)}{2\delta})}{\zeta(N-1)},\quad\text{and}\quad\gamma(\epsilon)=\gamma_{[\ref{thm:FreeDecompBulkJL}]}.

Remark 4.1.6.

To make sense of the above, consider $D=D_{JL}=O(\epsilon^{-2}\log(N^{2}/\delta))$ as if from the original Johnson-Lindenstrauss lemmas. We always have $\widehat{r}\leq D$ , so $t^{\prime}$ vanishes fairly quickly with increasing $N$ when $\zeta$ is fixed or even slowly decaying. Compared to theorem 4.0.1, we have an extra factor against $\log(2e/\eta)$ , but it is not too big in that $\log(D)=O(\log\log(N/\delta))$ , and $\log(6e/\zeta)$ does not grow quickly with decreasing $\zeta$ . When $N$ is large, we can make $\zeta$ small enough that the fraction $(1-2\eta)(1-\zeta)$ is not that much worse than $(1-2\eta)$ . Finally, in light of lemma 4.1.3, we can replace $\widehat{r}$ by $r$ in the definition of $k$ at the expense of a bounded prefactor for $\log(N^{2}/\delta)$ provided the lower deviation or small ball probability $p(\epsilon)$ is less than $1/r$ .

Proof.

From lemma 3.0.3, we have either $N-1$ or $3(N-1)/2$ 1-regular subgraphs to consider when $N\geq 7$ , and we choose (unit normalized) batches $\widehat{\Upsilon}$ of size at least $M$ from these subgraphs. Let $M\widehat{\Sigma}_{M}:=\widehat{\Upsilon}\widehat{\Upsilon}^{\top}$ . Within each subgraph, the batches are independent, allowing us to use lemma 4.1.2 on the random variables

\omega(\widehat{\Upsilon}):=\left\lVert\widehat{\Sigma}_{M}-\widehat{\Sigma}\right\rVert.

Take $n:=\left\lfloor(N-1)/(4M)\right\rfloor$ for that lemma and recall the $\tilde{\zeta}$ discussion from equation (4.3). We have to choose $u$ so that a union bound over all the subgraphs is still smaller than $\delta$ , so a safe value for $u$ would be

	$\displaystyle\frac{3(N-1)}{2}\exp\left(\zeta n\big{(}\log(3\,e/\zeta)+\log(2)-u\big{)}\right)$	$\displaystyle\leq\delta$		(4.4)
	$\displaystyle\log(6e/\zeta)+\frac{\log(3(N-1)/(2\delta))}{\zeta n}$	$\displaystyle\leq u$		(4.5)

With this $u$ in hand, we can apply lemma 4.1.1 with $K=1$ , for $M\widehat{\Sigma}_{M}=\widehat{\Upsilon}\widehat{\Upsilon}^{\top}$

	$\displaystyle\left\lVert\widehat{\Upsilon}\right\rVert^{2}=\left\lVert M\widehat{\Sigma}_{M}\right\rVert$	$\displaystyle\leq M\left\lVert\widehat{\Sigma}_{M}-\widehat{\Sigma}\right\rVert+M\left\lVert\widehat{\Sigma}\right\rVert$
		$\displaystyle\leq M\left\lVert\widehat{\Sigma}\right\rVert\left(1+\left(\frac{(4/3)\widehat{r}(u+\log D)}{M}+\sqrt{\frac{2\widehat{r}(u+\log D)}{M}}\right)\right).$

Because we are interested in $r_{\infty}(\widehat{\Upsilon})=M/\left\lVert\widehat{\Upsilon}\right\rVert^{2}$ , we should like to make the error term manageable, so choose

M=\alpha\widehat{r}(u+\log D)\quad\text{with}\quad\alpha>0.

Because $u$ already depends on $M$ through $n$ in equation 4.5, there is a constraint on $u$ and $\zeta$ that we need to address. Write $(N-1)/4=nM+s$ with $0\leq s\leq M-1$ . Set $\zeta^{\ast}$ to satisfy $\zeta n=\zeta^{\ast}(N-1)/(4M)$ . We then have

	$\displaystyle\log(6e/\zeta)+t(u+\log D)$	$\displaystyle\leq u\quad\text{with}\quad t:=4\alpha\widehat{r}\frac{\log(3(N-1)/(2\delta))}{\zeta^{\ast}(N-1)}>0$
	$\displaystyle\log(6e/\zeta)+t\log D$	$\displaystyle\leq(1-t)u$

We can divide by $(1-t)$ provided $t<1$ .

\displaystyle\alpha\widehat{r}\log(3(N-1)/(2\delta))<\zeta^{\ast}\frac{N-1}{4}=\zeta\left(\frac{N-1}{4}-s\right)

Recalling $s\leq M-1$ , if we also require $M\leq(N-1)/8$ , it would be safe to require

\alpha\widehat{r}\log(3(N-1)/(2\delta))<\zeta\frac{N-1}{8}<\zeta\left(\frac{N-1}{4}-s\right)\quad\text{and}\quad u\leq\frac{N-1}{8\alpha\widehat{r}}-\log(D).

We then have

1>t^{\prime}:=8\alpha\widehat{r}\frac{\log(3(N-1)/(2\delta))}{\zeta(N-1)}>t,

and because the maps $t\mapsto t/(1-t)$ and $t\mapsto 1/(1-t)$ are strictly increasing, a valid lower bound for $u$ is

u\geq\frac{t^{\prime}}{1-t^{\prime}}\log(D)+\frac{1}{1-t^{\prime}}\log(6e/\zeta).

Taking $u$ as this lower bound yields

u+\log(D)=\frac{1}{1-t^{\prime}}(\log(D)+\log(6e/\zeta)).

With this choice of $u$ in hand, with probability at least $1-\delta$ ,

r_{\infty}(\widehat{\Upsilon})\geq\frac{M}{M\left\lVert\widehat{\Sigma}\right\rVert\left(1+\frac{4}{3\alpha}+\sqrt{\frac{2}{\alpha}}\right)}=\frac{1/\left\lVert\widehat{\Sigma}\right\rVert}{\left(1+\frac{4}{3\alpha}+\sqrt{\frac{2}{\alpha}}\right)}=\frac{\widehat{r}}{\left(1+\frac{4}{3\alpha}+\sqrt{\frac{2}{\alpha}}\right)}=:\widehat{R}_{\infty}(M;\zeta)

for at least $(1-\zeta)$ of all batches $\widehat{\Upsilon}$ . Assuming this bound holds, we now ask that when $Z$ is drawn, equation (JL) holds for all the vectors involved in at least these batches, with failure probability at most $\delta$ . We run the argument of theorem 2.0.8, only for these batches $\widehat{\Upsilon}$ , using $\widehat{R}_{\infty}(M;\zeta)$ in place of $\widehat{R}_{\infty}(M)$ . As we could still have $\left\lfloor\binom{N}{2}/M\right\rfloor$ “good” batches, we still must allow for all of them when we compute the union bound. The $M/\widehat{R}_{\infty}(M)$ ratio in theorem 2.0.8 now just becomes

\frac{M}{\widehat{R}_{\infty}(M;\zeta)}=\frac{\alpha\widehat{r}(u+\log D)}{\widehat{r}}\left(1+\frac{4}{3\alpha}+\sqrt{\frac{2}{\alpha}}\right)=(u+\log D)\left(\alpha+\frac{4}{3}+\sqrt{2\alpha}\right)

Let $C_{\alpha}$ be the coefficient of $(u+\log D)$ in the above. We may then set $k$ as

k\geq\frac{C_{[\ref{thm:FreeDecompBulkJL}]}}{\epsilon^{2}}C_{\alpha}\left(\log(2e/\eta)\frac{(\log(6e/\zeta)+\log D)}{1-t^{\prime}}+\frac{\log(N^{2}/(\alpha\widehat{r}\delta\log(D)))}{\alpha\max(\eta\widehat{r},1)}\right)

(4.6)

using $u+\log(D)\geq\log(D)$ in the $\log(N^{2})$ term. The choice of $\gamma$ follows from theorem 2.0.8. ∎

In certain cases, we know $\widehat{r}$ exactly without relying on lemma 4.1.3.

Lemma 4.1.7.

Suppose $\xi=(\xi_{1},\ldots,\xi_{D})\in\mathbb{R}^{D}$ is a random vector with i.i.d. coordinates, and $\xi^{\prime}$ is an independent copy of $\xi$ . If $y:=\xi-\xi^{\prime}$ , then the scaled unit vector $\widehat{y}\sqrt{D}$ is mean-zero isotropic.

There are no moment assumptions on the coordinates $\xi_{i}$ here, so the lemma even applies to vectors with i.i.d. standard Cauchy coordinates.

Proof.

Both properties rely on the following observation. For a fixed coordinate $i$ , the coordinate $y_{i}=\xi_{i}-\xi_{i}^{\prime}$ is a symmetric random variable: $y_{i}$ and $-y_{i}$ have the same distribution. Consequently, for any odd function $f$ , (using the symmetry in the 2nd equality)

-\mathbb{E}f(y_{i})=\mathbb{E}f(-y_{i})=\mathbb{E}f(y_{i})

so that $\mathbb{E}f(y_{i})=0$ for such odd functions $f$ .

If we temporarily freeze the other coordinates, the $i$ th coordinate of the unit vector $\widehat{y}$ is just

\widehat{y}_{i}=\frac{y_{i}}{(y_{i}^{2}+\sum_{j\neq i}^{D}y_{j}^{2})^{1/2}}=\frac{y_{i}}{(y_{i}^{2}+C)^{1/2}},

an odd function of $y_{i}$ , forcing $\widehat{y}\sqrt{D}$ to be mean-zero.

To check $\widehat{y}\sqrt{D}$ is isotropic, we must show the matrix $\Sigma=D\,\mathbb{E}\widehat{y}\widehat{y}^{\top}$ is the identity $\operatorname{Id}_{D}$ . Because the $y_{i}$ are identically distributed,

D\,\mathbb{E}\frac{y_{1}^{2}}{\sum_{j=1}^{D}y_{j}^{2}}=\mathbb{E}\sum_{i=1}^{D}\frac{y_{i}^{2}}{\sum_{j=1}^{D}y_{j}^{2}}=1,

so the diagonal entries of $\Sigma$ are all equal to $1$ .

Further, when $i\neq j$ , the independence of the $y_{i}$ now give

D\,\mathbb{E}\frac{y_{i}y_{j}}{y_{i}^{2}+\sum_{k\neq i}^{D}y_{k}^{2}}=D\,\mathbb{E}y_{j}\mathbb{E}\left(\frac{y_{i}}{y_{i}^{2}+\sum_{k\neq i}^{D}y_{k}^{2}}\Big{\lvert}y_{j\neq i}\right)=0

because the conditional expectation vanishes on the odd function of $y_{i}$ . ∎

We now have an immediate corollary to theorem 4.1.5, which again we may compare to theorem 4.0.1.

Corollary 4.1.8.

In the setting of theorem 4.1.5, suppose $\xi$ has i.i.d. coordinates. Then the corresponding conclusion still holds, with $\widehat{r}=D$ , namely it suffices to take $\gamma(\epsilon)=\gamma_{[\ref{thm:FreeDecompBulkJL}]}$ and

k\geq\frac{C_{[\ref{thm:FreeDecompBulkJL}]}}{\epsilon^{2}}\left(\alpha+\frac{4}{3}+\sqrt{2\alpha}\right)\left(\log(2e/\eta)\frac{(\log(6e/\zeta)+\log D)}{1-t^{\prime}}+\frac{\log(N^{2}/(\alpha D\delta\log(D)))}{\alpha\max(\eta D,1)}\right).

Proof.

By lemma 4.1.7, the difference vector $y=\xi-\xi^{\prime}$ yields the isotropic vector $\widehat{y}\sqrt{D}$ . Because $\Sigma(\widehat{y}\sqrt{D})=\operatorname{Id}_{D}$ , we compute $\widehat{r}:=r(\widehat{y})=r(\widehat{y}\sqrt{D})=D$ , as the intrinsic dimension does not see constant scalings. We can then apply theorem 4.1.5. ∎

Remark 4.1.9.

The proof only uses that $\widehat{y}\sqrt{D}$ is isotropic. By lemma 1.1.3, $\Phi(\widehat{y}\sqrt{D})$ is still isotropic when $\Phi$ has orthonormal rows. Because equation JL is 1-homogeneous, the corollary still holds with $\xi$ replaced by $\Phi(\xi)$ , in particular when $\Phi$ has a fast transform method available.

4.2 Estimating $\widehat{r}$

The intrinsic dimension of $\widehat{y}$ , namely $\widehat{r}$ , enters into theorem 4.1.5 only as a parameter, so we are free to estimate it separately. In particular,

Corollary 4.2.1.

Theorem 4.1.5 holds with $\widehat{r}$ replaced by an empirical estimate using a batch $\widehat{\Upsilon}(m)$ of size $m$ , namely, with failure probability at most $\delta$ ,

\widehat{r}\geq\frac{1}{3\left\lVert\widehat{\Sigma}_{m}\right\rVert}\geq\frac{\widehat{r}}{5}\quad\text{for}\quad m=8D\log(2D/\delta)\quad\text{provided}\quad m\leq(N-1)/2.

If $D=D_{JL}=O(\epsilon^{-2}\log(N^{2}/\delta))$ , then computing $\left\lVert\widehat{\Sigma}_{m}\right\rVert$ will cost polynomial in $\log(N^{2}/\delta)$ and $\epsilon^{-2}$ ; however, because $1\leq\widehat{r}\leq D$ , we do not need very high accuracy when computing this top eigenvalue.

If we were working with $X$ drawn uniformly with replacement from a larger dataset, we should draw the first $16D\log(2D/\delta)$ data points, and then sequentially pair them off for unit difference vectors to yield $\widehat{\Upsilon}(m)$ . The uniform with replacement assumption assures that these data points used are as good as any other subset, even if some of the data points turn out to be copies of the same point in the larger dataset.

Proof.

For $m\leq(N-1)/2$ , we can find a batch $\widehat{\Upsilon}(m)$ of size $m$ with indepedent columns, by corollary 3.0.2 and lemma 3.0.3. By lemma 4.1.1, we have, with failure probability at most $2\exp(-u)$ ,

\left\lVert\widehat{\Sigma}-\widehat{\Sigma}_{m}\right\rVert\leq\left\lVert\widehat{\Sigma}\right\rVert\tau_{+}(u)=\left\lVert\widehat{\Sigma}\right\rVert\left(\frac{(4/3)\widehat{r}(u+\log(D))}{m}+\sqrt{\frac{2\widehat{r}(u+\log(D))}{m}}\right).

By the triangle inequality, we then have

\displaystyle\left\lVert\widehat{\Sigma}\right\rVert\leq\left\lVert\widehat{\Sigma}\right\rVert\tau_{+}(u)+\left\lVert\widehat{\Sigma}_{m}\right\rVert\quad\text{that is}\quad(1-\tau_{+}(u))\left\lVert\widehat{\Sigma}\right\rVert\leq\left\lVert\widehat{\Sigma}_{m}\right\rVert

and

\left\lVert\widehat{\Sigma}_{m}\right\rVert\leq\left\lVert\widehat{\Sigma}\right\rVert\tau_{+}(u)+\left\lVert\widehat{\Sigma}\right\rVert=(1+\tau_{+}(u))\left\lVert\widehat{\Sigma}\right\rVert.

Consequently, as $\widehat{r}=1/\left\lVert\widehat{\Sigma}\right\rVert$ ,

\left(\frac{1-\tau_{+}(u)}{1+\tau_{+}(u)}\right)\widehat{r}\leq\frac{1-\tau_{+}(u)}{\left\lVert\widehat{\Sigma}_{m}\right\rVert}\leq\widehat{r}\leq\frac{1+\tau_{+}(u)}{\left\lVert\widehat{\Sigma}_{m}\right\rVert}

Recalling $\widehat{r}\leq D$ always, we choose $m=\alpha D\log(2D/\delta)$ and $u=\log(2/\delta)$ so that

\tau_{+}(\log(2/\delta))\leq\frac{(4/3)\widehat{r}}{\alpha D}+\sqrt{\frac{2\widehat{r}}{\alpha D}}\leq\frac{4}{3\alpha}+\sqrt{\frac{2}{\alpha}}\leq\frac{2}{3}\quad\text{for}\quad\alpha\geq 8.\qed

Acknowledgements

This research was performed while the author held an NRC Research Associateship award at the U.S. Air Force Research Laboratory. I should like to thank Mary, Saint Joseph, and the Holy Trinity for helping me with this work.

Disclaimers

The views expressed are those of the author and do not reflect the official guidance or position of the United States Government, the Department of Defense, or of the United States Air Force.

Statement from the DoD: The appearance of external hyperlinks does not constitute endorsement by the United States Department of Defense (DoD) of the linked websites, or the information, products, or services contained therein. The DoD does not exercise any editorial, security, or other control over the information you may find at these locations.

5 Appendix

The following lemma shows how to modify the proof of the Hanson-Wright inequality from [RV13] (cf. [Ver18, chapter 6]) to a “bulk” version, looking at the sum of several i.i.d. quadratic forms. Note $Z$ here is $Z^{\top}$ in the main part of the paper. Let $Z$ be a $D\times k$ matrix entries $Z_{ij}$ drawn i.i.d. from a mean-zero unit-variance sub-gaussian distribution. Write $Z(:,j)$ for the $j$ th column of $Z$ . Let $B$ be a $D\times D$ matrix and

S=\sum_{j=1}^{k}Z(:,j)^{\top}BZ(:,j)

Note, with $B=(b_{ij})_{i,j=1}^{D}$ ,

Z(:,j)^{\top}BZ(:,j)=\sum_{q,\,l}Z_{qj}b_{ql}Z_{lj}\quad\text{and}\quad\mathbb{E}Z(:,j)^{\top}BZ(:,j)=\sum_{q}b_{qq}\mathbb{E}Z_{qj}^{2},

(5.1)

using the mean zero and independence assumptions for the coordinates of $Z(:,j)$ , that is for the $Z_{ij}$ when $i$ varies.

Lemma 5.0.1.

Let $S$ , $B$ , and $Z$ be as above. Then for all $t\geq 0$ and either sign choice,

\mathbb{P}\left\{\pm(S-\mathbb{E}S)\geq kt\right\}\leq 2\exp\left(-ck\min\left(\frac{t^{2}}{K^{4}\left\lVert B\right\rVert_{F}^{2}},\,\frac{t}{K^{2}\left\lVert B\right\rVert}\right)\right)

with $c$ an absolute constant (not depending on $Z$ ) and $K=\left\lVert Z_{11}\right\rVert_{\psi_{2}}$ .

Remark 5.0.2.

The key point is the additional factor of $k$ in the exponential, compared to the usual Hanson-Wright inequality where $k=1$ . Here,

\left\lVert Z\right\rVert_{\psi_{2}}=\inf\left\{t>0\;\lvert\;\mathbb{E}\exp(Z^{2}/t)\leq 2\right\},

so in particular $\left\lVert CZ\right\rVert_{\psi_{2}}=C\left\lVert Z\right\rVert_{\psi_{2}}$ , and $\left\lVert Z/K\right\rVert_{\psi_{2}}=1$ . If we prove the result for $Z/K$ , that is bound

\mathbb{P}\left\{\left\lvert\sum_{j=1}^{k}\frac{Z(:,j)^{\top}}{K}B\frac{Z(:,j)}{K}-\mathbb{E}\sum_{j=1}^{k}\frac{Z(:,j)^{\top}}{K}B\frac{Z(:,j)}{K}\right\rvert\geq kt\right\},

then taking $t\mapsto t/K^{2}$ will give us the bound for the original $Z(:,j)$ .

Proof.

By equation 5.1,

S-\mathbb{E}S=\sum_{j=1}^{k}\sum_{q}b_{qq}(Z_{qj}^{2}-\mathbb{E}Z_{qj}^{2})+\sum_{j=1}^{k}\sum_{q\neq l}b_{ql}Z_{qj}Z_{lj}=:S_{1}+S_{2}.

By the union bound (Boole’s inequality), we can bound the probability

p:=\mathbb{P}\left\{S-\mathbb{E}S\geq kt\right\}\leq\mathbb{P}\left\{S_{1}\geq kt/2\right\}+\mathbb{P}\left\{S_{2}\geq kt/2\right\}=:p_{1}+p_{2},

for if $S_{1}<kt/2$ and $S_{2}<kt/2$ , then $S-\mathbb{E}S<kt$ .

We can now use the i.i.d. assumption for the columns, that is, for the $Z_{ij}$ when $j$ varies,

\displaystyle p_{1}

\displaystyle\leq e^{-kt\lambda_{1}/2}\mathbb{E}\exp(\lambda_{1}S_{1})=\left(e^{-t\lambda_{1}/2}\mathbb{E}\exp\left(\lambda_{1}\sum_{q}b_{qq}(Z_{q}^{2}-\mathbb{E}Z_{q}^{2})\right)\right)^{k}=\wp_{1}^{k}

and

\displaystyle p_{2}

\displaystyle\leq e^{-kt\lambda_{2}/2}\mathbb{E}\exp(\lambda_{2}S_{2})=\left(e^{-t\lambda_{2}/2}\mathbb{E}\exp\left(\lambda_{2}\sum_{q\neq l}b_{ql}Z_{q}Z_{l}\right)\right)^{k}=\wp_{2}^{k}

The terms $\wp_{1}$ and $\wp_{2}$ are the starting points for establishing a proof of the Hanson-Wright inequality [Ver18, page 133]; the former is for using Bernstein’s inequality, while the latter uses decoupling and comparison to the case when $Z$ is a standard Gaussian random vector. Consequently, we can use the bounds for $\wp_{1}$ and $\wp_{2}$ , which both are given by

\max\left\{\wp_{1},\wp_{2}\right\}\leq\exp\left(-c\min\left(\frac{t^{2}}{\left\lVert B\right\rVert_{F}^{2}},\,\frac{t}{\left\lVert B\right\rVert}\right)\right),

with $c$ an absolute constant (not depending on the distribution of $Z$ , as we already rescaled $Z$ to have unit $\psi_{2}$ -norm entries). Recalling the $k$ th powers, we finally have

p\leq p_{1}+p_{2}\leq 2\exp\left(-ck\min\left(\frac{t^{2}}{\left\lVert B\right\rVert_{F}^{2}},\,\frac{t}{\left\lVert B\right\rVert}\right)\right).\qed

Lemma 5.0.3.

Let $Z$ be a $k\times D$ random matrix with i.i.d. mean-zero unit-variance sub-gaussian entries. Then for a matrix $A$ with columns in $\mathbb{R}^{D}$ ,

\mathbb{P}\left\{\pm(\left\lVert ZA\right\rVert_{F}^{2}-k\left\lVert A\right\rVert_{F}^{2})\geq k\epsilon\left\lVert A\right\rVert_{F}^{2}\right\}\leq 2\exp\left(-ck\min\left(\frac{\epsilon^{2}r_{4}(A)}{K^{4}},\,\frac{\epsilon\,r_{\infty}(A)}{K^{2}}\right)\right)

with $K=\left\lVert Z_{11}\right\rVert_{\psi_{2}}$ .

Proof.

We use lemma 5.0.1, with $B=AA^{\top}$ , and $Z\mapsto Z^{\top}$ , for then the rows $Z_{j}$ of $Z$ are written as column vectors, so that

Z_{j}^{\top}BZ_{j}=\left\lVert A^{\top}Z_{j}\right\rVert_{2}^{2},\quad S=\sum_{j=1}^{k}\left\lVert A^{\top}Z_{j}\right\rVert_{2}^{2}=\left\lVert ZA\right\rVert_{F}^{2},\quad\text{and}\quad\mathbb{E}S=k\left\lVert A\right\rVert_{F}^{2}.

Using $\left\lVert B\right\rVert=\left\lVert A\right\rVert^{2}$ , we recover

\mathbb{P}\left\{\pm(S-k\left\lVert A\right\rVert_{F}^{2})\geq kt\right\}\leq\exp\left(-ck\min\left(\frac{t^{2}}{K^{4}\left\lVert AA^{\top}\right\rVert_{F}^{2}},\,\frac{t}{K^{2}\left\lVert A\right\rVert^{2}}\right)\right).

Because

r_{\infty}(A)=\frac{\left\lVert A\right\rVert_{F}^{2}}{\left\lVert A\right\rVert^{2}}\quad\text{and}\quad r_{4}(A)=\frac{\left\lVert A\right\rVert_{F}^{4}}{\left\lVert AA^{\top}\right\rVert_{F}^{2}},

the choice $t=\epsilon\left\lVert A\right\rVert_{F}^{2}$ yields the result. ∎

If the reader would prefer explicit constants, the following lemma may be convenient, and gives an alternative proof for lemma 5.0.3 in the Gaussian case, relying on the explicit moment generating function for the Gaussian distribution.

Lemma 5.0.4.

Let $Z$ be a $k\times D$ random matrix with i.i.d. standard Gaussian entries. Then for a matrix $A$ with columns in $\mathbb{R}^{D}$ ,

\mathbb{P}\left\{\left\lVert ZA\right\rVert_{F}^{2}>(1+\epsilon)k\left\lVert A\right\rVert_{F}^{2}\right\}\leq\exp\left(-k\frac{\epsilon}{8}\min\left\{\epsilon\,r_{4}(A),\,r_{\infty}(A)\right\}\right)

for $\epsilon>0$ . Also, when $\epsilon\in(0,1)$ ,

\mathbb{P}\left\{\left\lVert ZA\right\rVert_{F}^{2}>(1+\epsilon)k\left\lVert A\right\rVert_{F}^{2}\right\}\leq\exp\left(-k\frac{\epsilon^{2}}{8}r_{\infty}(A)\right)

and

\mathbb{P}\left\{\left\lVert ZA\right\rVert_{F}^{2}<(1-\epsilon)k\left\lVert A\right\rVert_{F}^{2}\right\}\leq\exp\left(-k\frac{\epsilon^{2}}{4}r_{4}(A)\right).

Note $r_{4}(A)\geq r_{\infty}(A)$ always. When $\epsilon\,r_{4}(A)\geq r_{\infty}(A)$ , there is a savings of one factor of $\epsilon$ in the upper tail; however, for our applications, we do not know the relative sizes of $r_{4}(A)$ and $r_{\infty}(A)$ , so the $k\epsilon^{2}r_{\infty}(A)/8$ bound was included.

Proof.

Let $A=U\Sigma V^{\top}$ be the SVD of $A$ , with $\Sigma=\mathrm{diag}(\vec{\sigma})$ the diagonal matrix of singular values. So $A^{\top}=V\Sigma U^{\top}$ and by the rotation invariance of standard Gaussian vectors, $A^{\top}$ acts on a row $Z_{i}$ of $Z$ as

A^{\top}Z_{i}=V\Sigma U^{\top}Z_{i}\sim V\Sigma g_{i}\quad\text{with}\quad g_{i}\in\mathbb{R}^{D}

and consequently

\left\lVert A^{\top}g_{i}\right\rVert_{2}^{2}\sim g_{i}^{\top}\Sigma V^{\top}V\Sigma g_{i}=g_{i}^{\top}\Sigma^{2}g_{i}=\sum_{j}\sigma_{j}^{2}g_{ij}^{2}

with the $g_{ij}$ independent standard Gaussian.

We then have

\left\lVert ZA\right\rVert_{F}^{2}=\left\lVert A^{\top}Z^{\top}\right\rVert_{F}^{2}=\sum_{i=1}^{k}\left\lVert A^{\top}g_{i}\right\rVert_{2}^{2}

and for $\lambda>0$ to be determined

	$\displaystyle\mathbb{P}\left\{\left\lVert ZA\right\rVert_{F}^{2}>(1+\epsilon)k\left\lVert A\right\rVert_{F}^{2}\right\}$	$\displaystyle\leq e^{-\lambda(1+\epsilon)k\left\lVert A\right\rVert_{F}^{2}}\mathbb{E}\exp\left(\lambda\sum_{i=1}^{k}\left\lVert A^{\top}g_{i}\right\rVert_{2}^{2}\right)$
		$\displaystyle=\left(e^{-\lambda(1+\epsilon)\left\lVert A\right\rVert_{F}^{2}}\mathbb{E}\exp\left(\lambda\left\lVert A^{\top}g_{1}\right\rVert_{2}^{2}\right)\right)^{k}$

with $g_{1}\in\mathbb{R}^{D}$ standard Gaussian, having used the indepdence of the rows $\left\{g_{i}\right\}$ . We can now use the independence of the columns, here via the coordinates of $g_{1}$ :

\displaystyle\mathbb{E}\exp(\lambda\left\lVert A^{\top}g_{1}\right\rVert_{2}^{2})=\prod_{j}\mathbb{E}\exp(\lambda\sigma_{j}^{2}g_{1j}^{2})=\prod_{j}(1-2\lambda\sigma_{j}^{2})^{-1/2}

provided $\lambda<1/(2\sigma_{1}^{2})$ via change of variables $y=(1-2\lambda\sigma_{j}^{2})^{1/2}\,x$ ,

\displaystyle\mathbb{E}\exp(\lambda\sigma_{j}^{2}g_{1j}^{2})=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty}\exp\left(\lambda\sigma_{j}^{2}x^{2}-\frac{x^{2}}{2}\right)\,dx=(1-2\lambda\sigma_{j}^{2})^{-1/2}.

On $[0,1/2]$ , the function $x\mapsto e^{x+x^{2}}(1-x)$ is increasing from $1$ , while on $[1/2,2/3]$ it is decreasing and still greater than 1, as $(10/9)>\log(3)$ , so

\mathbb{E}\exp(\lambda\sigma_{j}^{2}g_{1j}^{2})\leq\exp\big{(}\lambda\sigma_{j}^{2}+2\lambda^{2}\sigma_{j}^{4}\big{)}\quad\text{certainly when}\quad 2\lambda\sigma_{1}^{2}\leq 2/3,

leaving us to minimize

h(\lambda):=-\lambda(1+\epsilon)\left\lVert A\right\rVert_{F}^{2}+\sum_{j}(\lambda\sigma_{j}^{2}+2\lambda^{2}\sigma_{j}^{4})=-\lambda\epsilon\sum_{j}\sigma_{j}^{2}+2\lambda^{2}\sum_{j}\sigma_{j}^{4}.

There will turn out to be two cases. If we minimize $h(\lambda)$ directly, the minimizer is

\lambda^{\ast}=\frac{\epsilon}{4}\frac{\sum_{j}\sigma_{j}^{2}}{\sum_{j}\sigma_{j}^{4}}\quad\text{at which}\quad h(\lambda^{\ast})=-\frac{\epsilon^{2}}{8}\frac{\left(\sum_{j}\sigma_{j}^{2}\right)^{2}}{\sum_{j}\sigma_{j}^{4}}=-\frac{\epsilon^{2}}{8}r_{4}(A).

This estimate still requires $2\lambda^{\ast}\sigma_{1}^{2}<1$ , so if we require $2\lambda^{\ast}\sigma_{1}^{2}\leq 1/2$ , we force

\displaystyle\frac{1}{\epsilon}r_{\infty}(A)=\frac{1}{\epsilon}\frac{\sum_{j}\sigma_{j}^{2}}{\sigma_{1}^{2}}>\frac{1}{\epsilon}4\lambda^{\ast}\sum_{j}\sigma_{j}^{2}=r_{4}(A).

Because we always have $r_{4}(A)\geq r_{\infty}(A)$ , we can use the $h(\lambda^{\ast})$ value when $r_{4}(A)$ and $r_{\infty}(A)$ are “comparable” and $\epsilon\in(0,1)$ .

For the other case, when $r_{4}(A)\geq\epsilon^{-1}r_{\infty}(A)$ , we have

\sum_{j}\sigma_{j}^{4}\leq\epsilon\,\sigma_{1}^{2}\sum_{j}\sigma_{j}^{2}

and can upper bound $h(\lambda)$ by $\tilde{h}(\lambda)$ defined as

\tilde{h}(\lambda)=-\lambda\epsilon\sum_{j}\sigma_{j}^{2}+2\lambda^{2}\alpha\epsilon\,\sigma_{1}^{2}\sum_{j}\sigma_{j}^{2}\quad\text{for any}\quad\alpha\geq 1.

The minimizer for $\tilde{h}(\lambda)$ is

\tilde{\lambda}^{\ast}=\frac{1}{4\alpha\sigma_{1}^{2}}\quad\text{at which}\quad\tilde{h}(\tilde{\lambda}^{\ast})=-\frac{\epsilon}{\alpha 8}r_{\infty}(A),

and this $\tilde{\lambda}^{\ast}$ also satisfies $2\tilde{\lambda}^{\ast}\sigma_{1}^{2}\leq 1/2<1$ whenever $\alpha\geq 1$ .

When $\epsilon\in(0,1)$ , we can also avoid the distinction between the two cases by noting $\sigma_{1}\geq\sigma_{j}$ for all $j$ , so that

\sum_{j}\sigma_{j}^{4}\leq\sigma_{1}^{2}\sum_{j}\sigma_{j}^{2}\quad\text{which corresponds to taking $\alpha=1/\epsilon$ in the above.}\quad

For the lower tail, with $\lambda<0$ ,

	$\displaystyle\mathbb{P}\left\{\left\lVert ZA\right\rVert_{F}^{2}<(1-\epsilon)k\left\lVert A\right\rVert_{F}^{2}\right\}\leq e^{-\lambda(1-\epsilon)k\left\lVert A\right\rVert_{F}^{2}}\mathbb{E}\exp(\lambda\left\lVert ZA\right\rVert_{F}^{2})$
	$\displaystyle=\left(e^{-\lambda(1-\epsilon)\left\lVert A\right\rVert_{F}^{2}}\prod_{j}\mathbb{E}\exp(\lambda\sigma_{j}^{2}g_{1j}^{2})\right)^{k}.$

Because $\lambda<0$ , we can estimate the moment generating function in two ways. From $e^{x}\leq 1+x+x^{2}/2$ for $x\leq 0$ , we find

\mathbb{E}\exp(\lambda\sigma_{j}^{2}g_{1j}^{2})\leq 1+\lambda\sigma_{j}^{2}+(3/2)\lambda^{2}\sigma_{j}^{4}\leq\exp\left(\lambda\sigma_{j}^{2}+(3/2)\lambda^{2}\sigma_{j}^{4}\right)

while if we use that $e^{x+x^{2}/2}(1-x)$ is decreasing to 1 for $x\leq 0$ ,

\mathbb{E}\exp(\lambda\sigma_{j}^{2}g_{1j}^{2})=(1-2\lambda\sigma_{j}^{2})^{-1/2}\leq\exp\left(\lambda\sigma_{j}^{2}+\lambda^{2}\sigma_{j}^{4}\right).

Minimizing

h_{-}(\lambda)=-\lambda(1-\epsilon)\left\lVert A\right\rVert_{F}^{2}+\sum_{j}(\lambda\sigma_{j}^{2}+\beta\lambda^{2}\sigma_{j}^{4})=\lambda\epsilon\left\lVert A\right\rVert_{F}^{2}+\beta\lambda^{2}\sum_{j}\sigma_{j}^{4}

yields

h_{-}(\lambda_{-}^{\ast})=-\frac{\epsilon^{2}}{4\beta}r_{4}(A)\quad\text{at}\quad\lambda_{-}^{\ast}=-\frac{\epsilon}{2\beta}\frac{\left\lVert A\right\rVert_{F}^{2}}{\sum_{j}\sigma_{j}^{4}}

with $\beta=3/2$ corresponding to the Taylor expansion bound and $\beta=1$ corresponding to the function bound.

Putting everything together, and remembering the $k$ th power outside, we complete the lemma. ∎

The next lemma makes the connection between equation ( $JL^{2}$ ) and equation (JL) explicit, and is informed by the form of the target dimension derived from the tail bound rates above. In the Gaussian case, $C_{+}=8$ and $C_{-}=4$ .

Lemma 5.0.5.

For $0<\epsilon<1$ and $C_{\pm}>0$ , the requirements

\frac{C_{+}}{\widetilde{\epsilon}_{+}^{2}}=\frac{C_{-}}{\widetilde{\epsilon}_{-}^{2}},\quad\frac{(1-\epsilon)^{2}}{\gamma(\epsilon)}=1-\widetilde{\epsilon}_{-}\quad\text{and}\quad\frac{(1+\epsilon)^{2}}{\gamma(\epsilon)}=1+\widetilde{\epsilon}_{+}

have solution $\widetilde{\epsilon}_{+}=\theta\,\widetilde{\epsilon}_{-}$ with $\theta=\sqrt{C_{+}/C_{-}}$ ,

1>\widetilde{\epsilon}_{-}=\frac{4\epsilon}{(1+\epsilon)^{2}+\theta(1-\epsilon)^{2}},\quad\text{and}\quad\gamma(\epsilon)=\frac{(1+\epsilon)^{2}+\theta(1-\epsilon)^{2}}{1+\theta}.

Proof.

The first equation gives $\widetilde{\epsilon}_{+}=\theta\,\widetilde{\epsilon}_{-}$ . Taking $\theta$ times the second equation and adding it to the third gives the equation for $\gamma(\epsilon$ . Subtracting the second equation from the third yields

(1+\theta)\widetilde{\epsilon}_{-}=\frac{(1+\epsilon)^{2}-(1-\epsilon)^{2}}{\gamma(\epsilon)}=\frac{4\epsilon}{\gamma(\epsilon)}.

Conclude

\widetilde{\epsilon}_{-}=\frac{4\epsilon}{(1+\epsilon)^{2}+\theta(1-\epsilon)^{2}}=\frac{4\epsilon}{(1+\theta)(1-\epsilon)^{2}+4\epsilon}<1.\qed

Lemma 5.0.6.

For $1\leq j\leq\left\lfloor M/2\right\rfloor$ ,

\binom{M}{j}\leq\left(\frac{eM}{j}\right)^{j}.

Proof.

First note $j!\geq(j/e)^{j}$ by

\frac{j^{j}}{j!}\leq\sum_{i=0}^{\infty}\frac{j^{i}}{i!}=e^{j}.\quad\text{We then have}\quad\binom{M}{j}=\frac{M!}{j!(M-j)!}=\frac{1}{j!}\prod_{i=0}^{j-1}(M-i)\leq\frac{M^{j}}{j!}\leq\left(\frac{eM}{j}\right)^{j}

from our lower bound for $j!$ . ∎

Literature Cited

[Als08] Brian Alspach, The wonderful Walecki construction, Bulletin of the Institute of Combinatorics and its Applications 52 (2008), 7–20 (English).
[GVL13] Gene H. Golub and Charles F. Van Loan, Matrix computations, 4th ed. ed., The Johns Hopkins University Press, Baltimore, MD, 2013 (English).
[HMT10] Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp, Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions, December 2010, arXiv:0909.4061 [math] version: 2.
[JL84] William B. Johnson and Joram Lindenstrauss, Extensions of Lipschitz mappings into a Hilbert space, Conference in modern analysis and probability (New Haven, Conn., 1982), Contemp. Math., vol. 26, Amer. Math. Soc., Providence, RI, 1984, pp. 189–206. MR 737400
[KM05] B. Klartag and S. Mendelson, Empirical processes and random projections, Journal of Functional Analysis 225 (2005), no. 1, 229–245 (en).
[LN14] Kasper Green Larsen and Jelani Nelson, The Johnson-Lindenstrauss lemma is optimal for linear dimensionality reduction, arXiv:1411.2404 [cs, math] (2014), arXiv: 1411.2404.
[Luc82] E. Lucas, Récréations mathématiques, 1882, Published: Paris. Gauthier-Villars. T. II. 1883.
[Nel20] Jelani Nelson, Dimensionality Reduction in Euclidean Space, Notices of the American Mathematical Society 67 (2020), no. 10, 1 (en).
[NY17] Assaf Naor and Pierre Youssef, Restricted invertibility revisited, A journey through discrete mathematics. A tribute to Jiří Matoušek, Springer, Cham, 2017, pp. 657–691 (English).
[RV13] Mark Rudelson and Roman Vershynin, Hanson-Wright inequality and sub-gaussian concentration, Electronic Communications in Probability 18 (2013), no. none, 1–9, Publisher: Institute of Mathematical Statistics and Bernoulli Society.
[Sar06] Tamas Sarlos, Improved Approximation Algorithms for Large Matrices via Random Projections, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), October 2006, ISSN: 0272-5428, pp. 143–152.
[Vem04] Santosh S. Vempala, The random projection method, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, vol. 65, American Mathematical Society, Providence, RI, 2004. MR 2073630
[Ver18] Roman Vershynin, High-dimensional probability. An introduction with applications in data science, Camb. Ser. Stat. Probab. Math., vol. 47, Cambridge University Press, Cambridge, 2018 (English).

Bulk Johnson-Lindenstrauss Lemmas

Abstract

1 Introduction

1.1 Notation

1.1.1 Isotropic Random Vectors and the Intrinsic Dimension

Definition 1.1.1.

Definition 1.1.2.

Lemma 1.1.3.

Proof.

2 Controlling Order Statistics

Lemma 2.0.1.

Remark 2.0.2.

Proof.

Corollary 2.0.3.

Proof.

Lemma 2.0.4.

Proof.

Remark 2.0.5.

Theorem 2.0.6.

Proof.

Lemma 2.0.7.

Proof.

Theorem 2.0.8.

Proof.

Example 2.1.

3 The Walecki Construction

Lemma 3.0.1 (Walecki Construction).

Corollary 3.0.2.

Lemma 3.0.3.

Proof.

Theorem 3.0.4.

Proof.

4 Further Theorems for i.i.d. Samples

Assumption 4.1.

Theorem 4.0.1.

Proof.

4.1 Controlling “Most” Batches

Lemma 4.1.1.

Proof.

Lemma 4.1.2.

Proof.

4.1.1 Retraction to the Sphere

Lemma 4.1.3.

Remark 4.1.4.

Proof.

Theorem 4.1.5.

Remark 4.1.6.

Proof.

Lemma 4.1.7.

Proof.

Corollary 4.1.8.

Proof.

Remark 4.1.9.

4.2 Estimating r^\widehat{r}

Corollary 4.2.1.

Proof.

Acknowledgements

Disclaimers

5 Appendix

Lemma 5.0.1.

Remark 5.0.2.

Proof.

Lemma 5.0.3.

Proof.

Lemma 5.0.4.

Proof.

Lemma 5.0.5.

Proof.

Lemma 5.0.6.

Proof.

Literature Cited

4.2 Estimating $\widehat{r}$