Column randomization and almost-isometric embeddings

Shahar Mendelson Centre for Mathematics and its Applications, The Australian National University. Email: [email protected]

Abstract

The matrix $A:\mathbb{R}^{n}\to\mathbb{R}^{m}$ is $(\delta,k)$ -regular if for any $k$ -sparse vector $x$ ,

\left|\|Ax\|_{2}^{2}-\|x\|_{2}^{2}\right|\leq\delta\sqrt{k}\|x\|_{2}^{2}.

We show that if $A$ is $(\delta,k)$ -regular for $1\leq k\leq 1/\delta^{2}$ , then by multiplying the columns of $A$ by independent random signs, the resulting random ensemble $A_{\varepsilon}$ acts on an arbitrary subset $T\subset\mathbb{R}^{n}$ (almost) as if it were gaussian, and with the optimal probability estimate: if $\ell_{*}(T)$ is the gaussian mean-width of $T$ and $d_{T}=\sup_{t\in T}\|t\|_{2}$ , then with probability at least $1-2\exp(-c(\ell_{*}(T)/d_{T})^{2})$ ,

\sup_{t\in T}\left|\|A_{\varepsilon}t\|_{2}^{2}-\|t\|_{2}^{2}\right|\leq C\left(\Lambda d_{T}\delta\ell_{*}(T)+(\delta\ell_{*}(T))^{2}\right),

where $\Lambda=\max\{1,\delta^{2}\log(n\delta^{2})\}$ . This estimate is optimal for $0<\delta\leq 1/\sqrt{\log n}$ .

1 Introduction

Linear operators that act in an almost-isometric way on subsets of $\mathbb{R}^{n}$ are of obvious importance. Although approximations of isometries are the only operators that almost preserve the Euclidean norm of any point in $\mathbb{R}^{n}$ , one may consider a more flexible alternative: a random ensemble of operators $\Gamma$ such that, for any fixed $T\subset\mathbb{R}^{n}$ , with high probability, $\Gamma$ “acts well” on every element of $T$ . Such random ensembles have been studied extensively over the years, following the path paved by the celebrated work of Johnson and Lindenstrauss in [5]. Here we formulate the Johnson-Lindenstrauss Lemma in one of its gaussian versions:

Theorem 1.1.

There exist absolute constants $c_{0}$ and $c_{1}$ such that the following holds. Let $1\leq m\leq n$ and set $\Gamma:\mathbb{R}^{n}\to\mathbb{R}^{m}$ to be a random matrix whose entries are independent, standard gaussian random variables. Let $T\subset S^{n-1}$ be of cardinality at most $\exp(c_{0}m)$ . Then for $m^{-1/2}\sqrt{\log|T|}<\rho<1$ , with probability at least $1-2\exp(-c_{1}\rho^{2}m)$ , for every $t\in T$ ,

\left|\left\|m^{-1/2}\Gamma t\right\|_{2}^{2}-1\right|\leq\rho.

The scope of Theorem 1.1 can be extended to more general random ensembles than the gaussian one, e.g., to a random matrix whose rows are iid copies of a centred random vector that exhibits suitable decay properties (see, e.g. [4, 8]). It is far more challenging to construct a random ensemble that, on the one hand, satisfies a version of Theorem 1.1, and on the other is based on “few random bits” or is constructed using a heavy-tailed random vector.

A significant breakthrough towards more general “Johnson-Lindenstrauss transforms” came in [7], where it was shown that a matrix that satisfies a suitable version of the restricted isometry property, can be converted to the wanted random ensemble by multiplying its columns by random signs. More accurately, let $\varepsilon_{1},...,\varepsilon_{n}$ be independent, symmetric $\{-1,1\}$ -valued random variables. Set $D_{\varepsilon}={\rm diag}(\varepsilon_{1},...,\varepsilon_{n})$ and for a matrix $A:\mathbb{R}^{n}\to\mathbb{R}^{m}$ define

A_{\varepsilon}=AD_{\varepsilon}.

From here on we denote by $\Sigma_{k}$ the subset of $S^{n-1}$ consisting of vectors that are supported on at most $k$ coordinates.

Definition 1.2.

A matrix $A:\mathbb{R}^{n}\to\mathbb{R}^{m}$ satisfies the restricted isometry property of order $k$ and level $\delta\in(0,1)$ if

\sup_{x\in\Sigma_{k}}\left|\|Ax\|_{2}^{2}-1\right|\leq\delta.

Theorem 1.3.

[7] There are absolute constants $c_{0}$ and $c_{1}$ such that the following holds. Let $\lambda>0$ and $\rho\in(0,1)$ . Consider $T\subset\mathbb{R}^{n}$ and let $k\geq c\log(e|T|/\lambda)$ . If $A$ satisfies the restricted isometry property of order $k$ and at level $\delta<\rho/4$ , then with probability at least $1-\lambda$ , for every $t\in T$ ,

(1-\rho)\|t\|_{2}^{2}\leq\|A_{\varepsilon}t\|_{2}^{2}\leq(1-\rho)\|t\|_{2}^{2}.

While Theorem 1.3 does not recover the probability estimate from Theorem 1.1, it does imply at the constant probability level that $A_{\varepsilon}$ is an almost isometry in the random ensemble sense: if $A$ is a matrix that $1\pm\delta$ -preserves the norms of vectors that are $c\log|T|$ sparse, then a typical realization of the random ensemble $A_{\varepsilon}$ , $1\pm c^{\prime}\delta$ preserves the norms of all the elements in $T$ .

Various extensions of Theorem 1.1 that hold for arbitrary subsets of $\mathbb{R}^{n}$ have been studied over the years. In such extensions the “complexity parameter” $\log|T|$ is replaced by more suitable counterparts. A rather general version of Theorem 1.1 follows from a functional Bernstein inequality (see, e.g., [3, 8, 2]), and to formulate that inequality in the gaussian case we require the following definition.

Definition 1.4.

Let $g_{1},...g_{n}$ be independent, standard gaussian random variables. For $T\subset\mathbb{R}^{n}$ set

\ell_{*}(T)=\mathbb{E}\sup_{t\in T}\left|\sum_{i=1}^{n}g_{i}t_{i}\right|\ \ \ {\rm and}\ \ \ d_{T}=\sup_{t\in T}\|t\|_{2}.

Let

\left(\frac{\ell_{*}(T)}{d_{T}}\right)^{2}

be the critical dimension of the set $T$ .

The critical dimension appears naturally when studying the geometry of convex sets—for example, in the context of the Dvoretzky-Milman Theorem (see [1] and references therein for more details). It is the natural alternative to $\log|T|$ —which was suitable for finite subsets of sphere $S^{n-1}$ .

Let $G=(g_{i})_{i=1}^{n}$ be the standard gaussian random vector in $\mathbb{R}^{n}$ , set $G_{1},...,G_{m}$ to be independent copies of $G$ and put

\Gamma=\sum_{i=1}^{m}\left\langle G_{i},\cdot\right\rangle e_{i}

to be the random ensemble used in Theorem 1.1.

Theorem 1.5.

There exist absolute constants $c_{0},c_{1}$ and $C$ such that the following holds. If $T\subset\mathbb{R}^{n}$ and $u\geq c_{0}$ then with probability at least

1-2\exp\left(-c_{1}u^{2}\left(\frac{\ell_{*}(T)}{d_{T}}\right)^{2}\right),

for every $t\in T$ ,

\left|\left\|m^{-1/2}\Gamma t\right\|_{2}^{2}-\|t\|_{2}^{2}\right|\leq C\left(ud_{T}\frac{\ell_{*}(T)}{\sqrt{m}}+u^{2}\left(\frac{\ell_{*}(T)}{\sqrt{m}}\right)^{2}\right).

(1.1)

One may use Theorem 1.5 to ensure that the uniform error in (1.1) is at most $\max\{\rho,\rho^{2}\}d_{T}^{2}$ . Indeed, if

\frac{\ell_{*}(T)/d_{T}}{\sqrt{m}}\sim\rho,

then with probability at least $1-2\exp(-c_{3}\rho^{2}m)$ ,

\sup_{t\in T}\left|\left\|m^{-1/2}\Gamma t\right\|_{2}^{2}-\|t\|_{2}^{2}\right|\leq\max\{\rho,\rho^{2}\}d_{T}^{2},

(1.2)

which is a natural counterpart of Theorem 1.1 once $\log|T|$ is replaced by $(\ell_{*}(T)/d_{T})^{2}$ .

As it happens, a version of Theorem 1.3 that is analogous to (1.2) was proved in [9], using the notion of a multi-level RIP.

Definition 1.6.

Let $L=\lceil\log_{2}n\rceil$ . For $\delta>0$ and $s\geq 1$ the matrix $A$ satisfies a multi-scale RIP with distortion $\delta$ and sparsity $s$ if, for every $1\leq\ell\leq L$ and every $x\in\Sigma_{2^{\ell}s}$ , one has

\left|\|Ax\|_{2}^{2}-\|x\|_{2}^{2}\right|\leq\max\left\{2^{\ell/2}\delta,2^{\ell}\delta^{2}\right\}.

Definition 1.6 implies that if $k\geq s$ then

\sup_{x\in\Sigma_{k}}\left|\|Ax\|_{2}^{2}-\|x\|_{2}^{2}\right|\leq\max\{\sqrt{k}\delta,k\delta^{2}\}.

Example 1.7.

Let $\Gamma:\mathbb{R}^{n}\to\mathbb{R}^{m}$ be a gaussian matrix as above and set $A=m^{-1/2}\Gamma$ . It is standard to verify (using, for example, Theorem 1.5 and a well-known estimate on $\ell_{*}(\Sigma_{k})$ ) that with probability at least $1-2\exp(-ck\log(en/k))$ ,

\sup_{x\in\Sigma_{k}}\left|\|Ax\|_{2}^{2}-1\right|\leq C\sqrt{\frac{k\log(en/k)}{m}}.

By the union bound over $k$ it follows that with a nontrivial probability, $A$ satisfies a multi-scale RIP with $s=1$ and $\delta\sim m^{-1/2}\sqrt{\log(en)}$ . Observe that the second term in the multi-scale RIP—, namely $k\delta^{2}$ , is not needed here.

Remark 1.8.

Example 1.7 gives a good intuition on the role $\delta$ has in well-behaved situations: it should scale (roughly) like $1/\sqrt{m}$ , where $m$ is the number of rows of the matrix $A$ .

The following theorem is the starting point of this note: an estimate on the error a typical realization of the random ensemble $A_{\varepsilon}=AD_{\varepsilon}$ has when acting on an arbitrary $T\subset\mathbb{R}^{n}$ , given that $A$ satisfies an appropriate multi-scale RIP.

Theorem 1.9.

[9] There are absolute constants $c$ and $C$ such that the following holds. Let $\eta,\rho>0$ and $A:\mathbb{R}^{n}\to\mathbb{R}^{m}$ that satisfies a multi-scale RIP with sparsity level $s=c(1+\eta)$ and distortion

\delta=C\frac{\rho d_{T}}{\max\{\ell_{*}(T),d_{T}\}}.

(1.3)

Then for $T\subset\mathbb{R}^{n}$ , with probability at least $1-\eta$ ,

\sup_{t\in T}\left|\|A_{\varepsilon}t\|_{2}^{2}-\|t\|_{2}^{2}\right|\leq\max\{\rho^{2},\rho\}d_{T}^{2}.

(1.4)

To put Theorem 1.9 in some context, if the belief is that $A_{\varepsilon}$ should exhibit the same behaviour as the gaussian matrix $m^{-1/2}\Gamma$ , then (keeping in mind that $\delta$ should scale like $1/\sqrt{m}$ ), “a gaussian behaviour” as in Theorem 1.5 is that with high probability,

\sup_{t\in T}\left|\|A_{\varepsilon}t\|_{2}^{2}-\|t\|_{2}^{2}\right|\leq c\left(\delta d_{T}\ell_{*}(T)+(\delta\ell_{*}(T))^{2}\right).

Observe that $\ell_{*}(T)\gtrsim d_{T}$ , implying by (1.3) that $\rho\sim\delta\ell_{*}(T)/d_{T}$ . Hence, the error in (1.4) in terms of $\delta$ is indeed

\sim d_{T}\delta\ell_{*}(T)+(\delta\ell_{*}(T))^{2}.

However, despite the “gaussian error”, the probability estimate in Theorem 1.9 is far weaker than in Theorem 1.5—it is just at the constant level.

Our main result is that using a modified, seemingly less restrictive version of the multi-scale RIP, $A_{\varepsilon}$ acts on $T$ as if it were a gaussian operator: achieving the same distortion and probability estimate as in Theorem 1.5.

Definition 1.10.

Let $A:\mathbb{R}^{n}\to\mathbb{R}^{m}$ be a matrix. For $\delta>0$ let $1\leq k^{*}\leq n$ be the largest such that for every $1\leq k\leq k_{*}$ , $A$ is a $(\delta,k)$ regular; that is, for every $1\leq k\leq k_{*}$

\sup_{x\in\Sigma_{k}}\left|\|Ax\|_{2}^{2}-\|x\|_{2}^{2}\right|\leq\delta\sqrt{k}.

Theorem 1.11.

There exist absolute constants $c$ and $C$ such that the following holds. Let $\delta>0$ and set $\Lambda=\max\{1,\delta^{2}\log(n\delta^{2})\}$ . If $k_{*}\geq 1/\delta^{2}$ , $T\subset\mathbb{R}^{n}$ and $u\geq 1$ , then with probability at least

1-2\exp\left(-cu^{2}\left(\frac{\ell_{*}(T)}{d_{T}}\right)^{2}\right),

we have that

\sup_{t\in T}\left|\|A_{\varepsilon}t\|_{2}^{2}-\|t\|_{2}^{2}\right|\leq Cu^{2}\left(\Lambda\cdot d_{T}\delta\ell_{*}(T)+(\delta\ell_{*}(T))^{2}\right).

(1.5)

Remark 1.12.

The sub-optimality in Theorem 1.11 lies in the factor $\Lambda$ —in the range where $\delta$ is relatively large: at least $1/\sqrt{\log n}$ . For $\delta\leq 1/\sqrt{\log n}$ we have that $\Lambda=1$ and Theorem 1.11 recovers the functional Bernstein inequality for $u\sim 1$ ; that holds despite the fact that $A_{\varepsilon}$ is based only on $n$ “random bits”.

Moreover, for the error in (1.5) to have a chance of being a nontrivial two-sided estimate, i.e., that for some $0<\rho<1$ and every $t\in T$ ,

\left|\|A_{\varepsilon}t\|_{2}^{2}-\|t\|_{2}^{2}\right|\leq\rho d_{T}^{2},

$\delta$ has to be smaller than $\sim d_{T}/\ell_{*}(T)$ . In particular, if the critical dimension of $T$ , $(\ell_{*}(T)/d_{T})^{2}$ is at least $\log n$ , a choice of $\delta\leq(d_{T}/\ell_{*}(T))$ leads to $\Lambda=1$ and thus to an optimal outcome in Theorem 1.11.

Theorem 1.11 clearly improves the probability estimate from Theorem 1.9. The other (virtual) improvement is that the matrix $A$ need only be $(\delta,k)$ -regular for $k\leq 1/\delta^{2}$ , and the way $A$ acts on $\Sigma_{k}$ for $k>1/\delta^{2}$ is of no importance. The reason for calling that improvement “virtual” is the following observation:

Lemma 1.13.

If $k^{*}\geq 1/\delta^{2}$ then for any $1\leq s\leq n$ ,

\sup_{x\in\Sigma_{s}}\left|\|Ax\|_{2}^{2}-1\right|\leq 4\max\{\delta\sqrt{s},\delta^{2}s\}.

In other words, the second term in the multi-scale RIP condition follows automatically from the first one and the fact that $k^{*}$ is sufficiently large.

Proof. Let $x\in\Sigma_{s}$ for $s\geq k_{*}$ , and let $(J_{i})_{i=1}^{\ell}$ be a decomposition of the support of $x$ to coordinate blocks of cardinality $k_{*}/2$ . Set $y_{i}=P_{J_{i}}x$ , that is, the projection of $x$ onto ${\rm span}\{e_{m}:m\in J_{i}\}$ and write $x=\sum_{i=1}^{\ell}y_{i}$ . Note that $\ell\leq 4s/k_{*}$ and that

\|Ax\|_{2}^{2}=\bigl{\|}\sum_{i=1}^{\ell}Ay_{i}\bigr{\|}^{2}=\sum_{i=1}^{\ell}\|Ay_{i}\|_{2}^{2}+\sum_{i\not=j}\left\langle Ay_{i},Ay_{j}\right\rangle.

The vectors $y_{i}$ are orthogonal and so $\|x\|_{2}^{2}=\sum_{i=1}^{\ell}\|y_{i}\|_{2}^{2}$ . Therefore,

\left|\|Ax\|_{2}^{2}-\|x\|_{2}^{2}\right|\leq\sum_{i=1}^{\ell}\left|\|Ay_{i}\|_{2}^{2}-\|y_{i}\|_{2}^{2}\right|+\left|\sum_{i\not=j}\left\langle Ay_{i},Ay_{j}\right\rangle\right|.

For the first term, as each $y_{i}$ is supported on at most $k_{*}/2$ coordinates, it follows from the regularity condition that

\sum_{i=1}^{\ell}\left|\|Ay_{i}\|_{2}^{2}-\|y_{i}\|_{2}^{2}\right|\leq\delta\sqrt{k_{*}/2}\sum_{i=1}^{\ell}\|y_{i}\|_{2}^{2}=\delta\sqrt{k_{*}/2}\|x\|_{2}^{2}\leq\delta\sqrt{s}.

As for the second term, since $y_{i}$ and $y_{j}$ are orthogonal, $\|y_{i}+y_{j}\|_{2}=\|y_{i}-y_{j}\|_{2}$ and

	$\displaystyle\left\langle Ay_{i},Ay_{j}\right\rangle=$	$\displaystyle\frac{1}{4}\left(\\|A(y_{i}+y_{j})\\|_{2}^{2}-\\|A(y_{i}-y_{j})\\|_{2}^{2}\right)$
	$\displaystyle=$	$\displaystyle\frac{1}{4}\left(\\|A(y_{i}+y_{j})\\|_{2}^{2}-\\|y_{i}+y_{j}\\|_{2}^{2}\right)-\frac{1}{4}\left(\\|y_{i}-y_{j}\\|_{2}^{2}-\\|A(y_{i}-y_{j})\\|_{2}^{2}\right).$

Thus, by the regularity of $A$ and as $|{\rm supp}(y_{i}\pm y_{j})|\leq k_{*}$ ,

	$\displaystyle\|\left\langle Ay_{i},Ay_{j}\right\rangle\|\leq$	$\displaystyle\frac{1}{4}\left(\delta\sqrt{k_{}}\\|y_{i}+y_{j}\\|_{2}^{2}+\delta\sqrt{k_{}}\\|y_{i}-y_{j}\\|_{2}^{2}\right)$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2}\delta\sqrt{k_{*}}(\\|y_{i}\\|_{2}^{2}+\\|y_{j}\\|_{2}^{2}).$

Taking the sum over all pairs $i\not=j$ , $i,j\leq\ell$ , each factor $\|y_{i}\|_{2}^{2}$ appears at most $2\ell$ times, and $2\ell\leq 8s/k_{*}$ . Hence, using that $1/\delta^{2}\leq k_{*}$

\sum_{i\not=j}|\left\langle Ay_{i},Ay_{j}\right\rangle|\leq\frac{1}{2}\delta\sqrt{k_{*}}\cdot\frac{8s}{k_{*}}\sum_{i=1}^{\ell}\|y_{i}\|_{2}^{2}\leq 2\delta\frac{s}{\sqrt{k_{*}}}\|x\|_{2}^{2}\leq 4\delta^{2}s.

Clearly, Theorem 1.11 implies a suitable version of Theorem 1.9.

Corollary 1.14.

There exist absolute constants $c$ and $c_{1}$ such that the following holds. Let $A$ be as above, set $T\subset\mathbb{R}^{n}$ and $0<\delta<1/\sqrt{\log n}$ . Let $\rho=c\delta\ell_{*}(T)/d_{T}$ . Then with probability at least $1-2\exp(-c_{1}\rho^{2}/\delta^{2})$ ,

\sup_{t\in T}\left|\|A_{\varepsilon}t\|_{2}^{2}-\|t\|_{2}^{2}\right|\leq\rho d_{T}^{2}.

Remark 1.15.

Recalling the intuition that $m\sim 1/\delta^{2}$ , the outcome of Corollary 1.14 coincides with the estimate in (1.2).

In Section 3 we present one simple application of Theorem 1.11. We show that column randomization of a typical realization of a Bernoulli circulant matrix (complete or partial) exhibits an almost gaussian behaviour (conditioned on the generating vector). In particular, only $2n$ random bits ( $n$ from the generating Bernoulli vector and $n$ from the column randomization) are required if one wishes to create a random ensemble that is, effectively, an almost isometry.

The proof of Theorem 1.11 is based on a chaining argument. For more information on the generic chaining mechanism, see Talagrand’s treasured manuscript [10]. We only require relatively basic notions from generic chaining theory, as well as the celebrated majorizing measures theorem.

Definition 1.16.

Let $T\subset\mathbb{R}^{n}$ . A collection of subsets of $T$ , $(T_{s})_{s\geq 0}$ , is an admissible sequence if $|T_{0}|=1$ and for $s\geq 1$ , $|T_{s}|\leq 2^{2^{s}}$ . For every $t\in T$ denote by $\pi_{s}t$ a nearest point to $t$ in $T_{s}$ with respect to the Euclidean distance. Set $\Delta_{s}t=\pi_{s+1}t-\pi_{s}t$ for $s\geq 1$ and let $\Delta_{0}t=\pi_{0}t$ .

The $\gamma_{2}$ functional with respect to the $\ell_{2}$ metric is defined by

\gamma_{2}(T,\|\ \|_{2})=\inf_{(T_{s})}\sup_{t\in T}\sum_{s\geq 0}2^{s/2}\|\Delta_{s}t\|_{2},

where the infimum is taken with respect to all admissible sequences of $T$ .

An application of Talagrand’s majorizing measures theorem to the gaussian process $t\to\sum_{i=1}^{n}g_{i}t_{i}$ shows that $\gamma_{2}(T,\|\ \|_{2})$ and $\ell_{*}(T)$ are equivalent:

Theorem 1.17.

There are absolute constants $c$ and $C$ such that for every $T\subset\mathbb{R}^{n}$ ,

c\gamma_{2}(T,\|\ \|_{2})\leq\ell_{*}(T)\leq C\gamma_{2}(T,\|\ \|_{2}).

The proof of Theorem 1.17 can be found, for example, in [10].

2 Proof of Theorem 1.11

We begin the proof with a word about notation: throughout, absolute constants, that is, positive numbers that are independent of all the parameters involved in the problem, are denoted by $c$ , $c_{1}$ , $C$ , etc. Their value may change from line to line.

As noted previously, the proof is based on a chaining argument. Let $(T_{s})_{s\geq 0}$ be an optimal admissible sequence of $T$ . Set $s_{0}$ to satisfy that $2^{s_{0}}$ is the critical dimension of $T$ , i.e.,

2^{s_{0}}=\left(\frac{\ell_{*}(T)}{d_{T}}\right)^{2}

(without loss of generality we may assume that equality holds). Let $s_{0}\leq s_{1}$ to be named in what follows and observe that

\|A_{\varepsilon}t\|_{2}^{2}=\|A_{\varepsilon}(t-\pi_{s_{1}}t)+A_{\varepsilon}\pi_{s_{1}}t\|_{2}^{2}=\|A_{\varepsilon}(t-\pi_{s_{1}}t)\|_{2}^{2}+2\left\langle A_{\varepsilon}(t-\pi_{s_{1}}t),A_{\varepsilon}\pi_{s_{1}}t\right\rangle+\|A_{\varepsilon}\pi_{s_{1}}t\|_{2}^{2}.

Writing $t-\pi_{s_{1}}t=\sum_{s\geq s_{1}}\Delta_{s}t$ , it follows that

\|A_{\varepsilon}(t-\pi_{s_{1}}t)\|_{2}\leq\sum_{s\geq s_{1}}\|A_{\varepsilon}\Delta_{s}t\|_{2},

(2.1)

and

\left|\left\langle A_{\varepsilon}(t-\pi_{s_{1}}t),A_{\varepsilon}\pi_{s_{1}}t\right\rangle\right|\leq\|A_{\varepsilon}(t-\pi_{s_{1}}t)\|_{2}\cdot\|A_{\varepsilon}\pi_{s_{1}}t\|_{2};

Therefore, setting

\Psi^{2}=\sup_{t\in T}\left|\|A_{\varepsilon}\pi_{s_{1}}t\|_{2}^{2}-\|\pi_{s_{1}}t\|_{2}^{2}\right|\ \ \ {\rm and}\ \ \ \Phi=\sum_{s\geq s_{1}}\|A_{\varepsilon}\Delta_{s}t\|_{2}

we have that for every $t\in T$ ,

\|A_{\varepsilon}\pi_{s_{1}}t\|_{2}\leq\sqrt{\Psi^{2}+d_{T}^{2}}

and

\left|\left\langle A_{\varepsilon}(t-\pi_{s_{1}}t),A_{\varepsilon}\pi_{s_{1}}t\right\rangle\right|\leq\Phi\cdot\sqrt{\Psi^{2}+d_{T}^{2}}.

(2.2)

Hence,

\sup_{t\in T}\left|\|A_{\varepsilon}t\|_{2}^{2}-\|t\|_{2}^{2}\right|\leq\Psi^{2}+2\Phi\cdot\sqrt{\Psi^{2}+d_{T}^{2}}+\Phi^{2}+\sup_{t\in T}\left|\|\pi_{s_{1}}t\|_{2}^{2}-\|t\|_{2}^{2}\right|.

(2.3)

To estimate the final term, note that for every $t\in T$ ,

	$\displaystyle\left\|\\|t\\|_{2}^{2}-\\|\pi_{s_{1}}t\\|_{2}^{2}\right\|\leq$	$\displaystyle\\|t-\pi_{s_{1}}t\\|_{2}^{2}+2\left\|\left\langle t-\pi_{s_{1}}t,\pi_{s_{1}}t\right\rangle\right\|\leq\\|t-\pi_{s_{1}}t\\|_{2}^{2}+2\\|t-\pi_{s_{1}}t\\|_{2}\cdot\\|\pi_{s_{1}}t\\|_{2}$
	$\displaystyle\leq$	$\displaystyle\bigl{(}\sum_{s\geq s_{1}}\\|\Delta_{s}t\\|_{2}\bigr{)}^{2}+2d_{T}\sum_{s\geq s_{1}}\\|\Delta_{s}t\\|_{2}.$

By the definition of the $\gamma_{2}$ functional and the majorizing measures theorem, for every integer $s$ and every $t\in T$ ,

\|\Delta_{s}t\|_{2}\leq 2^{-s/2}\gamma_{2}(T,\|\ \|_{2})\leq c_{1}2^{-s/2}\ell_{*}(T).

Thus,

\sum_{s\geq s_{1}}\|\Delta_{s}t\|_{2}\leq c_{1}\ell_{*}(T)\sum_{s\geq s_{1}}2^{-s/2}\leq c_{2}2^{-s_{1}/2}\ell_{*}(T),

and

\left|\|\pi_{s_{1}}t\|_{2}^{2}-\|t\|_{2}^{2}\right|\leq c_{3}\left(2^{-s_{1}}\ell_{*}^{2}(T)+d_{T}2^{-s_{1}/2}\ell_{*}(T)\right).

(2.4)

Equation (2.4) and the wanted estimate in Theorem 1.11 hint on the identity of $2^{s_{1}}$ : it should be larger than $1/\delta^{2}$ . Recalling that $s_{0}\leq s_{1}$ and that $2^{s_{0}}=(\ell_{*}(T)/d_{T})^{2}$ , set

2^{s_{1}}=\max\left\{\frac{1}{\delta^{2}},\left(\frac{\ell_{*}(T)}{d_{T}}\right)^{2},\log\left(e(1+n\delta^{2})\right)\right\}.

The reason behind the choice of the third term will become clear in what follows.

With that choice of $s_{1}$ ,

\sup_{t\in T}\left|\|\pi_{s_{1}}t\|_{2}^{2}-\|t\|_{2}^{2}\right|\leq c_{3}\left(\delta^{2}\ell_{*}^{2}(T)+d_{T}\delta\ell_{*}(T)\right),

(2.5)

and the nontrivial part of the proof is to control $\Phi$ and $\Psi$ with high probability that would lead to the wanted estimate on (2.3).

2.1 A decoupling argument

For every $t\in\mathbb{R}^{n}$ ,

\|A_{\varepsilon}t\|_{2}^{2}=\sum_{i,j}\left\langle Ae_{i},Ae_{j}\right\rangle\varepsilon_{i}\varepsilon_{j}t_{i}t_{j}=\sum_{i=1}^{n}\|Ae_{i}\|_{2}^{2}t_{i}^{2}+\sum_{i\not=j}\left\langle Ae_{i},Ae_{j}\right\rangle\varepsilon_{i}\varepsilon_{j}t_{i}t_{j}.

(2.6)

By the assumption that $A$ is $(\delta,1)$ -regular,

\max_{1\leq i\leq n}\left|\|Ae_{i}\|_{2}^{2}-1\right|\leq\delta,

and noting that $d_{T}\leq c\ell_{*}(T)$ ,

\left|\sum_{i=1}^{n}\|Ae_{i}\|_{2}^{2}t_{i}^{2}-\|t\|_{2}^{2}\right|=\left|\sum_{i=1}^{n}\left(\|Ae_{i}\|_{2}^{2}-1\right)t_{i}^{2}\right|\leq\delta\|t\|_{2}^{2}\leq\delta d_{T}^{2}\leq cd_{T}\delta\ell_{*}(T).

Next, we turn to the “off-diagonal” term in (2.6). For $t\in\mathbb{R}^{n}$ let

Z_{t}=\sum_{i\not=j}\varepsilon_{i}\varepsilon_{j}t_{i}t_{j}\left\langle Ae_{i},Ae_{j}\right\rangle

and let us obtain high probability estimates on $\sup_{u\in U}|Z_{u}|$ for various sets $U$ . The first step in that direction is decoupling: let $\eta_{1},...,\eta_{n}$ be independent $\{0,1\}$ -valued random variables with mean $1/2$ . Set $I=\{i:\eta_{i}=1\}$ and observe that for every $(\varepsilon_{i})_{i=1}^{n}$ ,

\sup_{u\in U}|Z_{u}|\leq 4\sup_{u\in U}\mathbb{E}_{\eta}\left|\left\langle A\left(\sum_{i\in I}\varepsilon_{i}u_{i}e_{i}\right),A\left(\sum_{j\in I^{c}}\varepsilon_{j}u_{j}e_{j}\right)\right\rangle\right|.

(2.7)

Indeed, for every $(\varepsilon_{i})_{i=1}^{n}$ ,

	$\displaystyle\sup_{u\in U}\left\|\sum_{i\not=j}\left\langle Ae_{i},Ae_{j}\right\rangle\varepsilon_{i}\varepsilon_{j}u_{i}u_{j}\right\|=$	$\displaystyle 4\sup_{u\in U}\left\|\sum_{i\not=j}\left\langle Ae_{i},Ae_{j}\right\rangle\mathbb{E}_{\eta}(1-\eta_{i})\eta_{j}\cdot\varepsilon_{i}\varepsilon_{j}u_{i}u_{i}\right\|$
	$\displaystyle\leq$	$\displaystyle 4\sup_{u\in U}\mathbb{E}_{\eta}\left\|\sum_{i\in I,\ j\in I^{c}}\left\langle Ae_{i},Ae_{j}\right\rangle\varepsilon_{i}\varepsilon_{j}u_{i}u_{j}\right\|;$

and for every $u\in\mathbb{R}^{n}$ ,

\sum_{i\in I,\ j\in I^{c}}\left\langle Ae_{i},Ae_{j}\right\rangle\varepsilon_{i}\varepsilon_{j}u_{i}u_{j}=\left\langle A\bigl{(}\sum_{j\in I^{c}}\varepsilon_{j}u_{j}e_{j}\bigr{)},A\bigl{(}\sum_{i\in I}\varepsilon_{i}u_{i}e_{i}\bigr{)}\right\rangle.

Equation (2.7) naturally leads to the following definition:

Definition 2.1.

For $v\in\mathbb{R}^{n}$ and $I\subset\{1,...,n\}$ , set

W_{v,I}=A^{*}A\bigl{(}\sum_{i\in I}\varepsilon_{i}v_{i}e_{i}\bigr{)}.

(2.8)

Recall that $\pi_{s+1}t$ is the nearest point to $t$ in $T_{s+1}$ and $\Delta_{s}t=\pi_{s+1}t-\pi_{s}t$ .

Lemma 2.2.

For every $t$ and every $(\varepsilon_{i})_{i=1}^{n}$ ,

	$\displaystyle\|Z_{\pi_{s_{1}}t}\|\leq$	$\displaystyle 4\sum_{s=s_{0}}^{s_{1}-1}\left(\mathbb{E}_{\eta}\left\|\sum_{j\in I^{c}}\varepsilon_{j}(\Delta_{s}t)_{j}(W_{\pi_{s+1}t,I})_{j}\right\|+\mathbb{E}_{\eta}\left\|\sum_{i\in I}\varepsilon_{i}(\Delta_{s}t)_{i}(W_{\pi_{s}t,I^{c}})_{i}\right\|\right)$
	$\displaystyle+$	$\displaystyle 4\mathbb{E}_{\eta}\left\|\sum_{j\in I^{c}}\varepsilon_{j}(\pi_{s_{0}}t)_{j}(W_{\pi_{s_{0}}t,I})_{j}\right\|.$		(2.9)

Proof. Fix an integer $s$ . With the decoupling argument in mind, fix $I\subset\{1,...,n\}$ and observe that

		$\displaystyle\left\langle A\bigl{(}\sum_{i\in I}\varepsilon_{i}(\pi_{s+1}t)_{i}e_{i}\bigr{)},A\bigl{(}\sum_{j\in I^{c}}\varepsilon_{j}(\pi_{s+1}t)_{j}e_{j}\bigr{)}\right\rangle$
	$\displaystyle=$	$\displaystyle\left\langle A\bigl{(}\sum_{i\in I}\varepsilon_{i}(\pi_{s+1}t)_{i}e_{i}\bigr{)},A\bigl{(}\sum_{j\in I^{c}}\varepsilon_{j}(\Delta_{s}t)_{j}e_{j}\bigr{)}\right\rangle+\left\langle A\bigl{(}\sum_{i\in I}\varepsilon_{i}(\pi_{s+1}t)_{i}e_{i}\bigr{)},A\bigl{(}\sum_{j\in I^{c}}\varepsilon_{j}(\pi_{s}t)_{j}e_{j}\bigr{)}\right\rangle$
	$\displaystyle=$	$\displaystyle\left\langle A\bigl{(}\sum_{i\in I}\varepsilon_{i}(\pi_{s+1}t)_{i}e_{i}\bigr{)},A\bigl{(}\sum_{j\in I^{c}}\varepsilon_{j}(\Delta_{s}t)_{j}e_{j}\bigr{)}\right\rangle+\left\langle A\bigl{(}\sum_{i\in I}\varepsilon_{i}(\Delta_{s}t)_{i}e_{i}\bigr{)},A\bigl{(}\sum_{j\in I^{c}}\varepsilon_{j}(\pi_{s}t)_{j}e_{j}\bigr{)}\right\rangle$
	$\displaystyle+$	$\displaystyle\left\langle A\bigl{(}\sum_{i\in I}\varepsilon_{i}(\pi_{s}t)_{i}e_{i}\bigr{)},A\bigl{(}\sum_{j\in I^{c}}\varepsilon_{j}(\pi_{s}t)_{j}e_{j}\bigr{)}\right\rangle.$

Moreover,

	$\displaystyle\left\langle A\bigl{(}\sum_{i\in I}\varepsilon_{i}(\pi_{s+1}t)_{i}e_{i}\bigr{)},A\bigl{(}\sum_{j\in I^{c}}\varepsilon_{j}(\Delta_{s}t)_{j}e_{j}\bigr{)}\right\rangle=$	$\displaystyle\left\langle A^{*}A\bigl{(}\sum_{i\in I}\varepsilon_{i}(\pi_{s+1}t)_{i}e_{i}\bigr{)},\bigl{(}\sum_{j\in I^{c}}\varepsilon_{j}(\Delta_{s}t)_{j}e_{j}\bigr{)}\right\rangle$
	$\displaystyle=$	$\displaystyle\sum_{j\in I^{c}}\varepsilon_{j}(\Delta_{s}t)_{j}(W_{\pi_{s+1}t,I})_{j}.$

and

\left\langle A\bigl{(}\sum_{i\in I}\varepsilon_{i}(\Delta_{s}t)_{i}e_{i}\bigr{)},A\bigl{(}\sum_{j\in I^{c}}\varepsilon_{j}(\pi_{s}t)_{j}e_{j}\bigr{)}\right\rangle=\sum_{i\in I}\varepsilon_{i}(\Delta_{s}t)_{i}(W_{\pi_{s}t,I^{c}})_{i}.

Combining these observations, for every $t\in T$ and $(\varepsilon_{i})_{i=1}^{n}$ ,

	$\displaystyle\frac{1}{4}Z_{\pi_{s_{1}}t}=$	$\displaystyle\mathbb{E}_{\eta}\left\langle A\bigl{(}\sum_{i\in I}\varepsilon_{i}(\pi_{s_{1}}t)_{i}e_{i}\bigr{)},A\bigl{(}\sum_{j\in I^{c}}\varepsilon_{j}(\pi_{s_{1}}t)_{j}e_{j}\bigr{)}\right\rangle$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{\eta}\sum_{s=s_{0}}^{s_{1}-1}\sum_{i\in I}\varepsilon_{i}(\Delta_{s}t)_{i}(W_{\pi_{s}t,I^{c}})_{i}$
	$\displaystyle+$	$\displaystyle\mathbb{E}_{\eta}\sum_{s=s_{0}}^{s_{1}-1}\sum_{j\in I^{c}}\varepsilon_{j}(\Delta_{s}t)_{j}(W_{\pi_{s+1}t,I})_{j}$
	$\displaystyle+$	$\displaystyle\mathbb{E}_{\eta}\sum_{j\in I^{c}}\varepsilon_{j}(\pi_{s_{0}}t)_{j}(W_{\pi_{s_{0}}t,I})_{j},$

from which the claim follows immediately.

As part of the decoupling argument and to deal with the introduction of the random variables $(\eta_{i})_{i=1}^{n}$ in (2.7), we will use the following elementary fact:

Lemma 2.3.

Let $f$ be a function of $(\varepsilon_{i})_{i=1}^{n}$ and $(\eta_{i})_{i=1}^{n}$ . If $\mathbb{E}_{\eta}\mathbb{E}_{\varepsilon}|f|^{q}\leq\kappa^{q}$ then with $(\varepsilon_{i})_{i=1}^{n}$ -probability at least $1-\exp(-q)$ ,

\mathbb{E}_{\eta}(|f|\ |(\varepsilon_{i})_{i=1}^{n})\leq e\kappa.

(2.10)

Proof. For a nonnegative function $h$ we have that point-wise, $\mathbbm{1}_{\{h\geq t\}}\leq h^{q}/t^{q}$ . Let $h=\mathbb{E}_{\eta}(|f|\ |(\varepsilon_{i})_{i=1}^{n})$ and note that

Pr_{\varepsilon}\left(\mathbb{E}_{\eta}(|f|\ |(\varepsilon_{i})_{i=1}^{n})\geq u\kappa\right)=\mathbb{E}_{\varepsilon}\mathbbm{1}_{\{h\geq u\kappa\}}\leq(u\kappa)^{-q}(\mathbb{E}_{\varepsilon}h^{q}).

By Jensen’s inequality followed by Fubini’s Theorem,

\mathbb{E}_{\varepsilon}h^{q}=\mathbb{E}_{\varepsilon}(\mathbb{E}_{\eta}(|f|\ |(\varepsilon_{i})_{i=1}^{n})^{q}\leq\mathbb{E}_{\varepsilon}\mathbb{E}_{\eta}|f|^{q}\leq\kappa^{q},

and setting $u=e$ proves (2.10).

We shall use Lemma 2.3 in situations where we actually have more information—namely that for any $(\eta_{i})_{i=1}^{n}$ , $\mathbb{E}_{\varepsilon}(|f|^{q}\ |(\eta_{i})_{i=1}^{n})\leq\kappa^{q}$ for a well chosen $\kappa$ . As a result,

\mathbb{E}_{\eta}\bigl{(}|f|\big{|}(\varepsilon_{i})_{i=1}^{n}\bigr{)}\leq e\kappa\ \ {\rm with\ probability\ at\ least\ }1-\exp(-q).

Taking into account Lemma 2.2 and Lemma 2.3, it follows that if one wishes to estimate $\sup_{t\in T}|Z_{\pi_{s_{1}}t}|$ using a chaining argument, it suffices to obtain, for every $I\subset\{1,...,n\}$ , bounds on moments of random variables of the form

\sum_{j\in I^{c}}\varepsilon_{j}(\Delta_{s}t)_{j}(W_{\pi_{s+1}t,I})_{j},\ \ \ \sum_{i\in I}\varepsilon_{i}(\Delta_{s}t)_{i}(W_{\pi_{s}t,I^{c}})_{i},\ \ \ {\rm and}\ \ \ \sum_{j\in I^{c}}\varepsilon_{j}(\pi_{s_{0}}t)_{j}(W_{\pi_{s_{0}}t,I})_{j},

(2.11)

as that results in high probability estimates on each of the terms in (2.2). And as there are at most $2^{2^{s+3}}$ random variables involved in this chaining argument at the $s$ -stage, the required moment in (2.11) is $q\sim 2^{s}$ for the first two terms and $q\sim 2^{s_{0}}$ for the third one.

2.2 Preliminary estimates

For $J\subset\{1,...,n\}$ let $P_{J}x=\sum_{j\in J}x_{j}e_{j}$ be the projection of $x$ onto ${\rm span}(e_{j})_{j\in J}$ . The key lemma in the proof of Theorem 1.11 is:

Lemma 2.4.

There exists an absolute constant $c$ such that the following holds. Let $I\subset\{1,...,n\}$ and $W_{v,I}$ be as in (2.8). Set $J\subset I^{c}$ such that $|J|\leq k_{*}$ . Then for $q\geq|J|$ ,

\left(\mathbb{E}\|P_{J}W_{v,I}\|_{2}^{q}\right)^{1/q}\leq c\sqrt{q}\delta\|v\|_{2}.

Proof. Let $S^{J}$ be the Euclidean unit sphere in the subspace ${\rm span}(e_{j})_{j\in J}$ and let $U$ be a maximal $1/10$ -separated subset of $S^{J}$ . By a volumetric estimate (see, e.g. [1]), there is an absolute constant $c_{0}$ such that $|U|\leq\exp(c_{0}|J|)$ . Moreover, a standard approximation argument shows that for every $y\in\mathbb{R}^{n}$ ,

\|P_{J}y\|_{2}=\sup_{x\in S^{J}}\left\langle y,x\right\rangle\leq c_{1}\max_{x\in U}\left\langle y,x\right\rangle,

where $c_{1}$ is a suitable absolute constant. Therefore,

\sup_{x\in S^{J}}\left\langle W_{v,I},x\right\rangle\leq c_{1}\max_{x\in U}\left\langle W_{v,I},x\right\rangle,

and it suffices to control, with high probability,

\max_{x\in U}\left\langle\sum_{i\in I}\varepsilon_{i}v_{i}e_{i},A^{*}Ax\right\rangle=\max_{x\in U}\sum_{i\in I}\varepsilon_{i}v_{i}\left(A^{*}Ax\right)_{i}.

Fix $x\in U$ , recall that $A$ is $(\delta,k)$ -regular for $1\leq k\leq k_{*}$ and we first explore the case $1\leq q\leq k_{*}/2$ .

Denote by $I^{\prime}\subset I$ the set of indices corresponding to the $q$ largest values of $|(A^{*}Ax)_{i}|,\ i\in I$ . If $|I|\leq q$ then set $I^{\prime}=I$ .

It is straightforward to verify (e.g., using Höffding’s inequality) that there is an absolute constant $c_{2}$ such that

	$\displaystyle\left\\|\sum_{i\in I}\varepsilon_{i}v_{i}\left(A^{*}Ax\right)_{i}\right\\|_{L_{q}}\leq$	$\displaystyle\sum_{i\in I^{\prime}}\|v_{i}\left(A^{}Ax\right)_{i}\|+c_{2}\sqrt{q}\bigl{(}\sum_{i\in I\backslash I^{\prime}}v_{i}^{2}\left(A^{}Ax\right)_{i}^{2}\bigr{)}^{1/2}$
	$\displaystyle\leq$	$\displaystyle\\|v\\|_{2}\\|P_{I^{\prime}}(A^{}Ax)\\|_{2}+c_{2}\sqrt{q}\cdot\frac{\\|P_{I^{\prime}}(A^{}Ax)\\|_{2}}{\sqrt{q}}\\|v\\|_{2}$
	$\displaystyle\leq$	$\displaystyle c_{3}\\|v\\|_{2}\\|P_{I^{\prime}}(A^{*}Ax)\\|_{2},$

where we used that fact that for $i\in I\backslash I^{\prime}$ ,

|\left(A^{*}Ax\right)_{i}|\leq\frac{\|P_{I^{\prime}}(A^{*}Ax)\|_{2}}{\sqrt{q}}.

Therefore,

\left\|\sum_{i\in I}\varepsilon_{i}v_{i}\left(A^{*}Ax\right)_{i}\right\|_{L_{q}}\leq c_{3}\|v\|_{2}\max_{I^{\prime}\subset I,\ |I^{\prime}|=q}\sup_{z\in S^{I^{\prime}}}\left\langle A^{*}Ax,z\right\rangle.

Note that $x$ is supported in $J\subset I^{c}$ , while each ‘legal’ $z$ is supported in a subset of $I$ ; in particular, $x$ and $z$ are orthogonal, implying that for every such $z$ ,

\|x+z\|_{2}=\|x-z\|_{2}\leq 2\ \ {\rm and}\ \ |{\rm supp}(x+z)|,|{\rm supp}(x-z)|\leq 2q.

Thus, by the $(\delta,2q)$ -regularity of $A$ (as $2q\leq k_{*}$ ),

	$\displaystyle\left\|\left\langle A^{*}Ax,z\right\rangle\right\|=$	$\displaystyle\left\|\left\langle Az,Ax\right\rangle\right\|=\frac{1}{4}\left\|\\|A(x+z)\\|_{2}^{2}-\\|A(x-z)\\|_{2}^{2}\right\|$
	$\displaystyle=$	$\displaystyle\frac{1}{4}\left\|\left(\\|A(x+z)\\|_{2}^{2}-\\|x+z\\|_{2}^{2}\right)-\left(\\|A(x-z)\\|_{2}^{2}-\\|x-z\\|_{2}^{2}\right)\right\|$
	$\displaystyle\leq$	$\displaystyle\frac{\delta}{4}\sqrt{2q}\cdot\max\{\\|x+z\\|_{2}^{2},\\|x-z\\|_{2}^{2}\}\leq\delta\sqrt{q},$

and it follows that

\left\|\sum_{i\in I}\varepsilon_{i}v_{i}\left(A^{*}Ax\right)_{i}\right\|_{L_{q}}\leq c_{4}\|v\|_{2}\delta\sqrt{q}.

Turning to the case $q\geq k_{*}/2$ , let $I^{\prime}$ be the set of indices corresponding to the $k_{*}/2$ largest coordinates of $(|(A^{*}Ax)_{i}|)_{i\in I}$ . Therefore,

	$\displaystyle\left\\|\sum_{i\in I}\varepsilon_{i}v_{i}(A^{*}Ax)_{i}\right\\|_{L_{q}}\leq$	$\displaystyle\sum_{i\in I^{\prime}}v_{i}(A^{}Ax)_{i}+c_{2}\sqrt{q}\left(\sum_{i\in I\backslash I^{\prime}}v_{i}^{2}(A^{}Ax)_{i}^{2}\right)^{1/2}$
	$\displaystyle\leq$	$\displaystyle\\|v\\|_{2}\cdot\\|P_{I^{\prime}}A^{}Ax\\|_{2}\left(1+c_{5}\sqrt{\frac{q}{k_{}}}\right),$		(2.12)

using that for $i\in I\backslash I^{\prime}$ ,

|(A^{*}Ax)_{i}|\leq\frac{\|P_{I^{\prime}}A^{*}Ax\|_{2}}{\sqrt{k_{*}/2}}.

Recall that $|{\rm supp}(x)|=|J|\leq k_{*}$ and that $|I^{\prime}|=k_{*}/2$ . The same argument used previously shows that

\|P_{I^{\prime}}A^{*}Ax\|_{2}\leq c_{6}\delta\sqrt{k_{*}};

hence,

\left\|\sum_{i\in I}\varepsilon_{i}v_{i}\left(A^{*}Ax\right)_{i}\right\|_{L_{q}}\leq c_{7}\|v\|_{2}\delta\sqrt{q},

and the estimate holds for each $q\geq|J|$ for that fixed $x$ .

Setting $u\geq 1$ , it follows from Chebychev’s inequality that with probability at least $1-2\exp(-c_{8}u^{2}q)$ ,

\left|\sum_{i\in I}\varepsilon_{i}v_{i}\left(A^{*}Ax\right)_{i}\right|\leq c_{9}u\|v\|_{2}\delta\sqrt{q},

and by the union bound, recalling that $q\geq|J|$ , the same estimate holds uniformly for every $x\in U$ , provided that $u\geq c_{10}$ . Thus, with probability at least $1-2\exp(-cu^{2}q)$ ,

\|P_{J}W_{v,I}\|_{2}\leq c^{\prime}\max_{x\in U}\left|\sum_{i\in I}\varepsilon_{i}v_{i}(A^{*}Ax)_{i}\right|\leq Cu\|v\|_{2}\delta\sqrt{q},

and the wanted estimate follows from tail integration.

The next observation deals with more refined estimates on random variables of the form

X_{a,b}=\sum_{i\in I^{c}}\varepsilon_{i}a_{i}b_{i}.

Once again, we use the fact that for any $J\subset I^{c}$

\|X_{a,b}\|_{L_{q}}\leq\sum_{j\in J}|a_{i}b_{i}|+c\sqrt{q}\bigl{(}\sum_{j\in I^{c}\backslash J}a_{i}^{2}b_{i}^{2}\bigr{)}^{1/2}.

(2.13)

As one might have guessed, the choice of $b$ will be vectors of the form $W_{\pi_{s}t,I}$ . These are random vectors that are independent of the Bernoulli random variables involved in the definition of $X$ . At the same time, $a$ will be deterministic.

Without loss of generality assume that $a_{i},b_{i}\geq 0$ for every $i$ . Let $J_{1}$ be the set of indices corresponding to the $k_{1}$ largest coordinates of $(a_{i})_{i\in I^{c}}$ , $J_{2}$ is the set corresponding to the following $k_{2}$ largest coordinates, and so on. The choice of $k_{1},k_{2},...,$ will be specified in what follows.

Note that for any $\ell>1$ ,

\|P_{J_{\ell}}a\|_{\infty}\leq\frac{\|P_{J_{\ell-1}}a\|_{2}}{\sqrt{|J_{\ell-1}|}}=\frac{\|P_{J_{\ell-1}}a\|_{2}}{\sqrt{k_{\ell-1}}}.

Set $J=J_{1}$ , and by (2.13),

$\displaystyle\\|X_{a,b}\\|_{L_{q}}\leq$	$\displaystyle\sum_{j\in J_{1}}\|a_{i}b_{i}\|+c\sqrt{q}\bigl{(}\sum_{\ell\geq 1}\sum_{j\in J_{\ell+1}}a_{i}^{2}b_{i}^{2}\bigr{)}^{1/2}$
$\displaystyle\leq$	$\displaystyle\\|P_{J_{1}}a\\|_{2}\\|P_{J_{1}}b\\|_{2}+c\sqrt{q}\bigl{(}\sum_{\ell\geq 1}\\|P_{J_{\ell+1}}a\\|_{\infty}^{2}\\|P_{J_{\ell+1}}b\\|_{2}^{2}\bigr{)}^{1/2}$
$\displaystyle\leq$	$\displaystyle\\|P_{J_{1}}a\\|_{2}\\|P_{J_{1}}b\\|_{2}+c\sqrt{q}\bigl{(}\sum_{\ell\geq 1}\frac{\\|P_{J_{\ell}}a\\|_{2}^{2}}{\|J_{\ell}\|}\\|P_{J_{\ell+1}}b\\|_{2}^{2}\bigr{)}^{1/2}$
$\displaystyle\leq$	$\displaystyle\\|P_{J_{1}}a\\|_{2}\\|P_{J_{1}}b\\|_{2}+c\sqrt{q}\max_{\ell\geq 2}\frac{\\|P_{J_{\ell}}b\\|_{2}}{\sqrt{k_{\ell-1}}}\cdot\bigl{(}\sum_{\ell\geq 2}\\|P_{J_{\ell}}a\\|_{2}^{2}\bigr{)}^{1/2}$
$\displaystyle\leq$	$\displaystyle\\|a\\|_{2}\Bigl{(}\\|P_{J_{1}}b\\|_{2}+c\sqrt{q}\max_{\ell\geq 2}\frac{\\|P_{J_{\ell}}b\\|_{2}}{\sqrt{k_{\ell-1}}}\Bigr{)}.$	(2.14)

And, in the case where $|J_{\ell}|=k$ for every $\ell$ , it follows that

\|X_{a,b}\|_{L_{q}}\leq\|a\|_{2}\Bigl{(}\|P_{J_{1}}b\|_{2}+c\sqrt{\frac{q}{k}}\max_{\ell\geq 2}\|P_{J_{\ell}}b\|_{2}\Bigr{)}.

(2.15)

2.3 Estimating $\Phi$

Recall that

\Phi=\sup_{t\in T}\sum_{s\geq s_{1}}\|A_{\varepsilon}\Delta_{s}t\|_{2}

and that for every $t\in\mathbb{R}^{n}$ ,

Z_{t}=\sum_{i\not=j}\varepsilon_{i}\varepsilon_{j}t_{i}t_{j}\left\langle Ae_{i},Ae_{j}\right\rangle.

Theorem 2.5.

There are absolute constants $c$ and $C$ such that for $u>1$ , with probability at least $1-2\exp(-cu^{2}2^{s_{1}})$ ,

\Phi^{2}\leq Cu^{2}\delta^{2}\ell_{*}^{2}(T).

Proof. Let $s\geq s_{1}$ , and as noted previously, for every $t\in T$ ,

	$\displaystyle\\|A_{\varepsilon}\Delta_{s}t\\|_{2}^{2}=$	$\displaystyle\sum_{i=1}^{n}\\|Ae_{i}\\|_{2}^{2}(\Delta_{s}t)_{i}^{2}+\sum_{i\not=j}\left\langle Ae_{i},Ae_{j}\right\rangle\varepsilon_{i}\varepsilon_{j}(\Delta_{s}t)_{i}(\Delta_{s}t)_{j}$
	$\displaystyle\leq$	$\displaystyle(1+\delta)\\|\Delta_{s}t\\|_{2}^{2}+\left\|\sum_{i\not=j}\left\langle Ae_{i},Ae_{j}\right\rangle\varepsilon_{i}\varepsilon_{j}(\Delta_{s}t)_{i}(\Delta_{s}t)_{j}\right\|$
	$\displaystyle\leq$	$\displaystyle 2\\|\Delta_{s}t\\|_{2}^{2}+\|Z_{\Delta_{s}t}\|.$

Following the decoupling argument, one may fix $I\subset\{1,...,n\}$ . The core of the argument is to obtain satisfactory estimates on moments of the random variables

V_{\Delta_{s}t,I}=\sum_{j\in I^{c}}\varepsilon_{j}(\Delta_{s}t)_{j}(W_{\Delta_{s}t,I})_{j}

for that (arbitrary) fixed choice of $I$ . And, since $|T_{s}|=2^{2^{s}}$ , for a uniform estimate that holds for every random variable of the form $V_{\Delta_{s}t,I},\ t\in T$ , it is enough to control the $q$ -th moment of each $V_{\Delta_{s}t,I}$ for $q\sim 2^{s}$ .

With that in mind, denote by $\mathbb{E}_{I^{c}}$ the expectation taken with respect to $(\varepsilon_{i})_{i\in I^{c}}$ , and set $\mathbb{E}_{I}$ the expectation taken with respect to $(\varepsilon_{i})_{i\in I}$ .

Let us apply (2.15), with the choice

k=1/\delta^{2}\leq k_{*},\ \ a=\Delta_{s}t\ \ {\rm and}\ \ b=W_{\Delta_{s}t,I}.

Thus, for every $(\varepsilon_{i})_{i\in I}$ ,

\left(\mathbb{E}_{I^{c}}|V_{\Delta_{s}t,I}|^{q}\right)^{1/q}\leq\|\Delta_{s}t\|_{2}\left(\|P_{J_{1}}W_{\Delta_{s}t,I}\|_{2}+c_{0}\delta\sqrt{q}\cdot\max_{\ell\geq 2}\|P_{J_{\ell}}W_{\Delta_{s}t,I}\|_{2}\right).

The sets $J_{\ell}$ are all of cardinality $1/\delta^{2}$ and so there are at most $\max\{\delta^{2}n,1\}$ of them. By Lemma 2.4, for each one of the sets $J_{\ell}\subset I^{c}$ and $q\geq|J_{\ell}|=1/\delta^{2}$ ,

\left(\mathbb{E}_{I}\|P_{J_{\ell}}W_{\Delta_{s}t,I}\|_{2}^{q}\right)^{1/q}\leq c_{1}\delta\sqrt{q}\|\Delta_{s}t\|_{2},

implying that

	$\displaystyle\left(\mathbb{E}_{I}\max_{\ell}\\|P_{J_{\ell}}W_{\Delta_{s}t,I}\\|_{2}^{q}\right)^{1/q}\leq$	$\displaystyle\left(\sum_{\ell}\mathbb{E}_{I}\\|P_{J_{\ell}}W_{\Delta_{s}t,I}\\|_{2}^{q}\right)^{1/q}$
	$\displaystyle\leq$	$\displaystyle\ell^{1/q}\cdot c_{1}\delta\sqrt{q}\\|\Delta_{s}t\\|_{2}\leq c_{2}\delta\sqrt{q}\\|\Delta_{s}t\\|_{2}$		(2.16)

as long as $\max\{\delta^{2}n,1\}\leq\exp(q)$ . In particular, since $2^{s_{1}}\geq\log(e(1+n\delta^{2}))$ , it suffices that $q\geq 2^{s_{1}}$ to ensure that (2.3) holds.

Hence, for every $I\subset\{1,...,n\}$ ,

	$\displaystyle\left(\mathbb{E}\|V_{\Delta_{s}t,I}\|^{q}\right)^{1/q}=$	$\displaystyle\left(\mathbb{E}_{I}\mathbb{E}_{I^{c}}\|V_{\Delta_{s}t,I}\|^{q}\right)^{1/q}$
	$\displaystyle\leq$	$\displaystyle\\|\Delta_{s}t\\|_{2}\left(\mathbb{E}_{I}\left(\\|P_{J_{1}}W_{\Delta_{s}t,I}\\|_{2}+c_{0}\delta\sqrt{q}\cdot\max_{\ell\geq 2}\\|P_{J_{\ell}}W_{\Delta_{s}t,I}\\|_{2}\right)^{q}\right)^{1/q}$
	$\displaystyle\leq$	$\displaystyle\\|\Delta_{s}t\\|_{2}\cdot 2\left(\left(\mathbb{E}_{I}\\|P_{J_{1}}W_{\Delta_{s}t,I}\\|_{2}^{q}\right)^{1/q}+c_{0}\delta\sqrt{q}\left(\mathbb{E}_{I}\max_{\ell\geq 2}\\|P_{J_{\ell}}W_{\Delta_{s}t,I}\\|_{2}^{q}\right)^{1/q}\right)$
	$\displaystyle\leq$	$\displaystyle c_{3}\\|\Delta_{s}t\\|_{2}^{2}\left(\delta\sqrt{q}+\delta^{2}q\right)$
	$\displaystyle\leq$	$\displaystyle c_{4}\delta^{2}q\\|\Delta_{s}t\\|_{2}^{2},$

because $\delta^{2}q\geq\delta^{2}2^{s_{1}}\geq 1$ .

By Jensen’s inequality, for every $t\in T$ ,

	$\displaystyle\left(\mathbb{E}\|Z_{\Delta_{s}t}\|^{q}\right)^{1/q}\leq$	$\displaystyle 4\left(\mathbb{E}_{\eta}\mathbb{E}_{\varepsilon}\left\|\sum_{i\in I^{c}}\varepsilon_{i}(\Delta_{s}t)_{i}(W_{\Delta_{s}t,I})_{i}\right\|^{q}\right)^{1/q}$
	$\displaystyle=$	$\displaystyle 4\left(\mathbb{E}_{\eta}\mathbb{E}_{\varepsilon}\|V_{\Delta_{s}t,I}\|^{q}\right)^{1/q}\leq c_{5}q\delta^{2}\\|\Delta_{s}t\\|_{2}^{2}.$

Setting $q=u2^{s}$ for $u\geq 1$ and $s\geq s_{1}$ , Chebychev’s inequality implies that with $(\varepsilon_{i})_{i=1}^{n}$ -probability at least $1-2\exp(-c_{6}u^{2}2^{s})$ ,

	$\displaystyle\\|A_{\varepsilon}\Delta_{s}t\\|_{2}^{2}=$	$\displaystyle 2\\|\Delta_{s}t\\|_{2}^{2}+\|Z_{\Delta_{s}t}\|\leq 2\\|\Delta_{s}t\\|_{2}^{2}+c_{7}u^{2}2^{s}\delta^{2}\\|\Delta_{s}t\\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle c_{8}u^{2}\delta^{2}2^{s}\\|\Delta_{s}t\\|_{2}^{2}.$

By the union bound, this estimate holds for every $t\in T$ and $s\geq s_{1}$ . Thus, there are absolute constants $c$ and $C$ such that with probability at least $1-2\exp(-cu^{2}2^{s_{1}})$ ,

\sum_{s\geq s_{1}}\|A_{\varepsilon}\Delta_{s}t\|_{2}\leq Cu\delta\sum_{s\geq s_{1}}2^{s/2}\|\Delta_{s}t\|_{2}\leq Cu\delta\ell_{*}(T),

as claimed.

2.4 Estimating $\Psi$

Next, recall that

\Psi^{2}=\sup_{t\in T}\left|\|A_{\varepsilon}\pi_{s_{1}}t\|_{2}^{2}-\|\pi_{s_{1}}t\|_{2}^{2}\right|.

Expanding as in (2.6), the “diagonal term” is at most $\delta^{2}d_{T}^{2}$ , and one has to deal with the “off-diagonal” term

\sup_{t\in T}|Z_{\pi_{s_{1}}t}|.

As observed in Lemma 2.2 combined with Lemma 2.3, it suffices to obtain, for every fixed $I\subset\{1,...,n\}$ , estimates on the moments of

\sum_{j\in I^{c}}\varepsilon_{j}(\Delta_{s}t)_{j}(W_{\pi_{s+1}t,I})_{j}\ \ \ \sum_{i\in I}\varepsilon_{i}(\Delta_{s}t)_{i}(W_{\pi_{s}t,I^{c}})_{i}\ \ {\rm and}\ \ \sum_{j\in I^{c}}\varepsilon_{j}(\pi_{s_{0}}t)_{j}(W_{\pi_{s_{0}}t,I})_{j}.

(2.17)

Remark 2.6.

We shall assume throughout that $(\ell_{*}(T)/d_{T})^{2}\leq 1/\delta^{2}$ ; the required modifications when the reverse inequality holds are straightforward and are therefore omitted.

We begin with a standard observation:

Lemma 2.7.

There is an absolute constant $c$ such that the following holds. Let $(X_{\ell})_{\ell\geq 0}$ be nonnegative random variables, let $q\geq 1$ and set $q_{\ell}=q2^{\ell}$ . If there is some $\kappa$ such that for every $\ell$ , $\|X_{\ell}\|_{L_{q_{\ell}}}\leq\kappa$ , then

\|\max_{\ell}X_{\ell}\|_{L_{q}}\leq c\kappa

Proof. By Chebychev’s inequality, for every $\ell\geq 1$ and $u\geq 2$ ,

Pr\left(|X_{\ell}|\geq u\kappa\right)\leq\frac{\mathbb{E}|X_{\ell}|^{q_{\ell}}}{(u\kappa)^{q_{\ell}}}\leq u^{-q_{\ell}}.

Therefore,

Pr\left(\exists\ell\geq 2:|X_{\ell}|\geq u\kappa\right)\leq\sum_{\ell\geq 2}u^{-q2^{\ell}}\leq c_{1}u^{-4q},

implying that

\mathbb{E}\max_{\ell\geq 2}|X_{\ell}|^{q}\leq\int_{0}^{\infty}qu^{q-1}Pr\left(\max_{\ell\geq 1}|X_{\ell}|>u\right)du\leq(c_{1}\kappa)^{q}.

Hence,

\|\max_{\ell}X_{\ell}\|_{L_{q}}\leq\|X_{0}\|_{L_{q}}+\|X_{1}\|_{L_{q}}+\|\max_{\ell\geq 2}X_{\ell}\|_{L_{q}}\leq c_{2}\kappa.

The analysis is split into two cases. Case I: $\log(e(1+\delta^{2}n))\leq 1/\delta^{2}$ . In this case, $2^{s_{1}}=1/\delta^{2}$ . For every $s_{0}\leq s\leq s_{1}$ we invoke (2.2) for sets $J_{\ell}$ of cardinality $2^{s+\ell}$ when $s+\ell\leq s_{1}$ , and of cardinality $1/\delta^{2}$ when $s+\ell>s_{1}$ .

Set $a=\Delta_{s}t$ and $b=W_{\pi_{s+1}t,I}$ ; the treatment when $b=W_{\pi_{s}t,I^{c}}$ is identical and is omitted. Let $q\geq 2^{s}$ and put $q_{\ell}=q2^{\ell}$ if $s+\ell\leq s_{1}$ . Finally, set $p\geq 1/\delta^{2}$ .

Consider $\ell$ such that $s+\ell\leq s_{1}$ and observe that $q_{\ell}\geq|J_{\ell}|=2^{s+\ell}$ . By Lemma 2.4,

|J_{\ell-1}|^{-1/2}\left(\mathbb{E}\|P_{J_{\ell}}W_{\pi_{s+1}t,I}\|_{2}^{q_{\ell}}\right)^{1/q_{\ell}}\leq c_{1}\delta\|\pi_{s+1}t\|_{2}\cdot\sqrt{\frac{q2^{\ell}}{2^{s+\ell}}}\leq c_{1}\sqrt{\frac{q}{2^{s}}}\delta d_{T}.

(2.18)

Therefore, by Lemma 2.7,

\left(\mathbb{E}\max_{\{\ell:s+\ell\leq s_{1}\}}\left(|J_{\ell-1}|^{-1/2}\|P_{J_{\ell}}W_{\pi_{s+1}t,I}\|_{2}\right)^{q}\right)^{1/q}\leq c_{2}\sqrt{\frac{q}{2^{s}}}\delta d_{T}.

Next, if $s+\ell>s_{1}$ then $p\geq|J_{\ell}|=1/\delta^{2}$ and by Lemma 2.4,

|J_{\ell-1}|^{-1/2}\left(\mathbb{E}\|P_{J_{\ell}}W_{\pi_{s+1}t,I}\|_{2}^{p}\right)^{1/p}\leq c_{3}\delta\|\pi_{s+1}t\|_{2}\cdot\sqrt{\frac{p}{1/\delta^{2}}}\leq c_{3}\sqrt{p\delta^{2}}\cdot\delta d_{T}.

(2.19)

There are at most $n\delta^{2}$ sets $J_{\ell}$ in that range, and because $n\delta^{2}\leq\exp(p)$ , it is evident that

\left(\mathbb{E}\max_{\{\ell:s+\ell>s_{1}\}}\left(|J_{\ell-1}|^{-1/2}\|P_{J_{\ell}}W_{\pi_{s+1}t,I}\|_{2}\right)^{p}\right)^{1/{p}}\leq c_{4}\sqrt{p\delta^{2}}\cdot\delta d_{T};

therefore, as $q\leq p$ ,

\left(\mathbb{E}\max_{\{\ell:s+\ell>s_{1}\}}\left(|J_{\ell-1}|^{-1/2}\|P_{J_{\ell}}W_{\pi_{s+1}t,I}\|_{2}\right)^{q}\right)^{1/{q}}\leq c_{4}\sqrt{p\delta^{2}}\delta d_{T};

Thus, for every fixed $I\subset\{1,...,n\}$ ,

\left(\mathbb{E}\max_{\ell}\left(|J_{\ell-1}|^{-1/2}\|P_{J_{\ell}}W_{\pi_{s+1}t,I}\|_{2}\right)^{q}\right)^{1/{q}}\leq c_{5}\max\left\{\sqrt{\frac{q}{2^{s}}},\sqrt{p\delta^{2}}\right\}\cdot\delta d_{T}.

(2.20)

Now, by (2.15),

		$\displaystyle\left(\mathbb{E}_{I^{c}}\left\|\sum_{j\in I^{c}}\varepsilon_{j}(\Delta_{s}t)_{j}(W_{\pi_{s+1}t,I})_{j}\right\|^{q}\right)^{1/q}$
	$\displaystyle\leq$	$\displaystyle\\|\Delta_{s}t\\|_{2}\left(\\|P_{J_{1}}W_{\pi_{s+1}t,I}\\|_{2}+c_{6}\sqrt{q}\max_{\ell\geq 2}\frac{\\|P_{J_{\ell+1}}W_{\pi_{s+1}t,I}\\|_{2}}{\sqrt{\|J_{\ell}\|}}\right),$		(2.21)

which, combined with (2.18) and (2.20), implies that

\left(\mathbb{E}\left|\sum_{j\in I^{c}}\varepsilon_{j}(\Delta_{s}t)_{j}(W_{\pi_{s+1}t,I})_{j}\right|^{q}\right)^{1/q}\leq c_{7}\sqrt{q}\|\Delta_{s}t\|_{2}\cdot\Bigl{(}1+\max\left\{\sqrt{\frac{q}{2^{s}}},\sqrt{p\delta^{2}}\right\}\Bigr{)}\cdot\delta d_{T}.

(2.22)

Clearly, there are at most $2^{2^{s+3}}$ random variables as in (2.22). With that in mind, set $u\geq 1$ and let $q=u2^{s},\ p=u/\delta^{2}$ . By Lemma 2.2 and Lemma 2.3 followed by the union bound, we have that with probability at least $1-2\exp(-c_{8}u^{2}2^{s})$ , for every $t\in T$ ,

\mathbb{E}_{\eta}\left|\sum_{j\in I^{c}}\varepsilon_{j}(\Delta_{s}t)_{j}W_{\pi_{s+1}t,I}\right|\leq c_{9}u^{2}2^{s/2}\|\Delta_{s}t\|_{2}\cdot\delta d_{T}.

(2.23)

And, by the union bound and recalling that $2^{s_{0}}=(\ell_{*}(T)/d_{T})^{2}$ , (2.23) holds uniformly for every $t\in T$ and $s_{0}\leq s\leq s_{1}$ with probability at least

1-2\exp\left(-c_{10}u^{2}\left(\frac{\ell_{*}(T)}{d_{T}}\right)^{2}\right).

An identical argument shows that with probability at least

1-2\exp(-c_{11}u^{2}2^{s_{0}})=1-2\exp(-c_{11}u^{2}(\ell_{*}(T)/d_{T})^{2}),

for every $t\in T$ ,

\mathbb{E}_{\eta}\left|\sum_{i\in I^{c}}\varepsilon_{i}(\pi_{s_{0}}t)_{i}(W_{\pi_{s_{0}}t,I})_{i}\right|\leq c_{12}u^{2}2^{s_{0}/2}d_{T}\cdot\delta d_{T}\leq c_{13}d_{T}\delta\ell_{*}(T).

Finally, invoking Lemma 2.2, we have that with probability at least $1-2\exp(-cu^{2}(\ell_{*}(T)/d_{T})^{2})$ , for every $t\in T_{s_{1}}$ ,

|Z_{t}|\leq Cu^{2}d_{T}\delta\ell_{*}(T),

as required.

Case II: $\log(e(1+\delta^{2}n))>1/\delta^{2}$ .

The necessary modifications are minor and we shall only sketch them. In this range, $2^{s_{1}}=\log(e(1+\delta^{2}n))$ , and the problem is that for each vector $\Delta_{s}t$ , the number of blocks $J_{\ell}$ of cardinality $1/\delta^{2}$ —namely, $n\delta^{2}$ , is larger than $\exp(2^{s})$ when $s\leq s_{1}$ . Therefore, setting $|J_{\ell}|=1/\delta^{2}$ for every $\ell$ , the uniform estimate on

\max_{\ell}\frac{\|P_{J_{\ell}+1}W_{\pi_{s+1}t,I}\|_{2}}{\sqrt{|J_{\ell}|}}=\delta\max_{\ell}\|P_{J_{\ell}}W_{\pi_{s+1}t,I}\|_{2}

can be obtained by bounding $\left(\mathbb{E}\|P_{J_{\ell}}W_{\pi_{s+1}t,I}\|_{2}^{q}\right)^{1/q}$ for $q\geq\log(\delta^{2}n)$ . Indeed, by Lemma 2.4, for every $I\subset\{1,...n\}$ and every $t\in T$ , we have that

\delta\left(\mathbb{E}\max_{\ell}\|P_{J_{\ell}}W_{\pi_{s+1}t,I}\|_{2}^{q}\right)^{1/q}\leq\delta^{2}\sqrt{q}d_{T}.

The rest of the argument is identical to the previous one and is omitted.

Proof of Theorem 1.11. Using the estimates we established, it follows that for $u\geq c_{0}$ , with probability at least

1-2\exp\left(-c_{1}u^{2}\left(\frac{\ell_{*}(T)}{d_{T}}\right)^{2}\right),

\Phi\leq Cu\delta\ell_{*}(T)\ \ {\rm and}\ \ \Psi^{2}\leq Cu^{2}d_{T}\delta\ell_{*}(T)\left(1+\delta\sqrt{\log(e(1+n\delta^{2}))}\right),

noting that $\delta d_{T}^{2}\leq c_{2}d_{T}\delta\ell_{*}(T)$ . Since

\sup_{t\in T}\left|\|A_{\varepsilon}t\|_{2}^{2}-\|t\|_{2}^{2}\right|\leq\Psi^{2}+2\Phi\sqrt{\Psi^{2}+d_{T}^{2}}+\Phi^{2}+C^{\prime}\left((\delta\ell_{*}(T))^{2}+d_{T}\delta\ell_{*}(T)\right)

the claim follows from a straightforward computation, by separating to the cases $\Phi\leq\Psi$ and $\Psi\leq\Phi$ .

3 Application - A circulant Bernoulli matrix

Let $(\varepsilon_{i})_{i=1}^{n}$ and $(\varepsilon_{i}^{\prime})_{i=1}^{n}$ be independent Bernoulli vectors. Set $\xi=(\varepsilon_{i}^{\prime})_{i=1}^{n}$ and put $D_{\varepsilon}={\rm diag}(\varepsilon_{1},...,\varepsilon_{n})$ . Let $\Gamma$ be the circulant matrix generated by the random vector $\xi$ ; that is, $\Gamma$ is the matrix whose $j$ -th row is the shifted vector $\tau_{j}\xi$ , where for every $v\in\mathbb{R}^{n}$ , $\tau_{j}v=(v_{j-i})_{i=1}^{n}$ .

Fix $I\subset\{1,...,n\}$ of cardinality $m$ and let

A=\sqrt{\frac{1}{m}}P_{I}\Gamma:\mathbb{R}^{n}\to\mathbb{R}^{m}

to be the normalized partial circulant matrix. It follows from Theorem 3.1 in [6] and the estimates in Section 4 there that for a typical realization of $\xi$ , the matrix $A$ is $(\delta,k)$ -regular for $\delta\sim m^{-1/2}\log^{2}n$ :

Theorem 3.1.

[6] There exist absolute constants $c$ and $C$ such that the following holds. For $x>0$ with probability at least

1-2\exp\left(-c\min\left\{x^{2}\frac{m}{k},x\sqrt{\frac{m}{k}}\log^{2}n\right\}\right)

we have that

\sup_{t\in\Sigma_{k}}\left|\|At\|_{2}^{2}-1\right|\leq C(1+x)\sqrt{\frac{k}{m}}\log^{2}n.

By Theorem 3.1 and the union bound for $1\leq k\leq 1/\delta^{2}$ , there is an event $\Omega$ with probability at least $1-2\exp(-c^{\prime}\log^{4}n)$ with respect to $(\varepsilon_{i}^{\prime})_{i=1}^{n}$ , on which, for every $k\leq 1/\delta^{2}$ ,

\sup_{t\in\Sigma_{k}}\left|\|At\|_{2}^{2}-1\right|\leq\delta\sqrt{k}.

This verifies the assumption needed in Theorem 1.11 on the event $\Omega$ . Now fix $(\varepsilon_{i}^{\prime})_{i=1}^{n}\in\Omega$ and let $A$ be the resulting partial circulant matrix. Set $A_{\varepsilon}=AD_{\varepsilon}$ and let $T\subset\mathbb{R}^{n}$ . By Theorem 1.11, with probability at least

1-2\exp(-c^{\prime}(\ell_{*}(T)/d_{T})^{2})

with respect to $(\varepsilon_{i})_{i=1}^{n}$ , we have that

\sup_{t\in T}\left|\|A_{\varepsilon}t\|_{2}^{2}-\|t\|_{2}^{2}\right|\leq C^{\prime}\left(\Lambda d_{T}\frac{\ell_{*}(T)}{\sqrt{m}}\cdot\log^{2}n+\left(\frac{\ell_{*}(T)}{\sqrt{m}}\right)^{2}\cdot\log^{4}n\right)

where

\Lambda\leq c^{\prime\prime}\max\left\{1,\frac{\log^{5}n}{m}\right\}.

Thus, a random matrix generated by $2n$ independent random signs is a good embedding of an arbitrary subset of $\mathbb{R}^{n}$ with the same accuracy (up to logarithmic factors) as a gaussian matrix. Moreover, conditioned on the choice of the circulant matrix $A$ , the probability estimate coincides with the estimate in the gaussian case.

References

[1] Shiri Artstein-Avidan, Apostolos Giannopoulos, and Vitali D. Milman. Asymptotic geometric analysis. Part I, volume 202 of Mathematical Surveys and Monographs. American Mathematical Society, Providence, RI, 2015.
[2] Withold Bednorz. Concentration via chaining method and its applications. arXiv:1405.0676.
[3] Sjoerd Dirksen. Tail bounds via generic chaining. Electron. J. Probab., 20:no. 53, 29, 2015.
[4] Sjoerd Dirksen. Dimensionality reduction with subgaussian matrices: a unified theory. Found. Comput. Math., 16(5):1367–1396, 2016.
[5] William B. Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Conference in modern analysis and probability (New Haven, Conn., 1982), volume 26 of Contemp. Math., pages 189–206. Amer. Math. Soc., Providence, RI, 1984.
[6] Felix Krahmer, Shahar Mendelson, and Holger Rauhut. Suprema of chaos processes and the restricted isometry property. Comm. Pure Appl. Math., 67(11):1877–1904, 2014.
[7] Felix Krahmer and Rachel Ward. New and improved Johnson-Lindenstrauss embeddings via the restricted isometry property. SIAM J. Math. Anal., 43(3):1269–1281, 2011.
[8] Shahar Mendelson. Upper bounds on product and multiplier empirical processes. Stochastic Process. Appl., 126(12):3652–3680, 2016.
[9] Samet Oymak, Benjamin Recht, and Mahdi Soltanolkotabi. Isometric sketching of any set via the restricted isometry property. Inf. Inference, 7(4):707–726, 2018.
[10] Michel Talagrand. Upper and lower bounds for stochastic processes. Springer, 2014.

	$\displaystyle\|\left\langle Ay_{i},Ay_{j}\right\rangle\|\leq$	$\displaystyle\frac{1}{4}\left(\delta\sqrt{k_{}}\\|y_{i}+y_{j}\\|_{2}^{2}+\delta\sqrt{k_{}}\\|y_{i}-y_{j}\\|_{2}^{2}\right)$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2}\delta\sqrt{k_{*}}(\\|y_{i}\\|_{2}^{2}+\\|y_{j}\\|_{2}^{2}).$

	$\displaystyle\left\|\\|t\\|_{2}^{2}-\\|\pi_{s_{1}}t\\|_{2}^{2}\right\|\leq$	$\displaystyle\\|t-\pi_{s_{1}}t\\|_{2}^{2}+2\left\|\left\langle t-\pi_{s_{1}}t,\pi_{s_{1}}t\right\rangle\right\|\leq\\|t-\pi_{s_{1}}t\\|_{2}^{2}+2\\|t-\pi_{s_{1}}t\\|_{2}\cdot\\|\pi_{s_{1}}t\\|_{2}$
	$\displaystyle\leq$	$\displaystyle\bigl{(}\sum_{s\geq s_{1}}\\|\Delta_{s}t\\|_{2}\bigr{)}^{2}+2d_{T}\sum_{s\geq s_{1}}\\|\Delta_{s}t\\|_{2}.$

	$\displaystyle\sup_{u\in U}\left\|\sum_{i\not=j}\left\langle Ae_{i},Ae_{j}\right\rangle\varepsilon_{i}\varepsilon_{j}u_{i}u_{j}\right\|=$	$\displaystyle 4\sup_{u\in U}\left\|\sum_{i\not=j}\left\langle Ae_{i},Ae_{j}\right\rangle\mathbb{E}_{\eta}(1-\eta_{i})\eta_{j}\cdot\varepsilon_{i}\varepsilon_{j}u_{i}u_{i}\right\|$
	$\displaystyle\leq$	$\displaystyle 4\sup_{u\in U}\mathbb{E}_{\eta}\left\|\sum_{i\in I,\ j\in I^{c}}\left\langle Ae_{i},Ae_{j}\right\rangle\varepsilon_{i}\varepsilon_{j}u_{i}u_{j}\right\|;$

	$\displaystyle\|Z_{\pi_{s_{1}}t}\|\leq$	$\displaystyle 4\sum_{s=s_{0}}^{s_{1}-1}\left(\mathbb{E}_{\eta}\left\|\sum_{j\in I^{c}}\varepsilon_{j}(\Delta_{s}t)_{j}(W_{\pi_{s+1}t,I})_{j}\right\|+\mathbb{E}_{\eta}\left\|\sum_{i\in I}\varepsilon_{i}(\Delta_{s}t)_{i}(W_{\pi_{s}t,I^{c}})_{i}\right\|\right)$
	$\displaystyle+$	$\displaystyle 4\mathbb{E}_{\eta}\left\|\sum_{j\in I^{c}}\varepsilon_{j}(\pi_{s_{0}}t)_{j}(W_{\pi_{s_{0}}t,I})_{j}\right\|.$		(2.9)

	$\displaystyle\left\\|\sum_{i\in I}\varepsilon_{i}v_{i}\left(A^{*}Ax\right)_{i}\right\\|_{L_{q}}\leq$	$\displaystyle\sum_{i\in I^{\prime}}\|v_{i}\left(A^{}Ax\right)_{i}\|+c_{2}\sqrt{q}\bigl{(}\sum_{i\in I\backslash I^{\prime}}v_{i}^{2}\left(A^{}Ax\right)_{i}^{2}\bigr{)}^{1/2}$
	$\displaystyle\leq$	$\displaystyle\\|v\\|_{2}\\|P_{I^{\prime}}(A^{}Ax)\\|_{2}+c_{2}\sqrt{q}\cdot\frac{\\|P_{I^{\prime}}(A^{}Ax)\\|_{2}}{\sqrt{q}}\\|v\\|_{2}$
	$\displaystyle\leq$	$\displaystyle c_{3}\\|v\\|_{2}\\|P_{I^{\prime}}(A^{*}Ax)\\|_{2},$

Column randomization and almost-isometric embeddings

Abstract

1 Introduction

Theorem 1.1.

Definition 1.2.

Theorem 1.3.

Definition 1.4.

Theorem 1.5.

Definition 1.6.

Example 1.7.

Remark 1.8.

Theorem 1.9.

Definition 1.10.

Theorem 1.11.

Remark 1.12.

Lemma 1.13.

Corollary 1.14.

Remark 1.15.

Definition 1.16.

Theorem 1.17.

2 Proof of Theorem 1.11

2.1 A decoupling argument

Definition 2.1.

Lemma 2.2.

Lemma 2.3.

2.2 Preliminary estimates

Lemma 2.4.

2.3 Estimating Φ\Phi

Theorem 2.5.

2.4 Estimating Ψ\Psi

Remark 2.6.

Lemma 2.7.

3 Application - A circulant Bernoulli matrix

Theorem 3.1.

References

2.3 Estimating $\Phi$

2.4 Estimating $\Psi$