This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Nearly-Optimal Bound for Fast Regression with \ell_{\infty} Guarantee

Zhao Song [email protected]. Adobe Research.    Mingquan Ye [email protected]. University of Illinois at Chicago.    Junze Yin [email protected]. Boston University.    Lichen Zhang [email protected]. MIT.

Given a matrix An×dA\in\mathbb{R}^{n\times d} and a vector bnb\in\mathbb{R}^{n}, we consider the regression problem with \ell_{\infty} guarantees: finding a vector xdx^{\prime}\in\mathbb{R}^{d} such that

xx\displaystyle\|x^{\prime}-x^{*}\|_{\infty}\leq ϵdAxb2A,\displaystyle~{}\frac{\epsilon}{\sqrt{d}}\cdot\|Ax^{*}-b\|_{2}\cdot\|A^{\dagger}\|,

where x=argminxdAxb2x^{*}=\arg\min_{x\in\mathbb{R}^{d}}\|Ax-b\|_{2}. One popular approach for solving such 2\ell_{2} regression problem is via sketching: picking a structured random matrix Sm×nS\in\mathbb{R}^{m\times n} with mnm\ll n and SASA can be quickly computed, solve the “sketched” regression problem argminxdSAxSb2\arg\min_{x\in\mathbb{R}^{d}}\|SAx-Sb\|_{2}.

In this paper, we show that in order to obtain such \ell_{\infty} guarantee for 2\ell_{2} regression, one has to use sketching matrices that are dense. To the best of our knowledge, this is the first user case in which dense sketching matrices are necessary. On the algorithmic side, we prove that there exists a distribution of dense sketching matrices with m=ϵ2dlog3(n/δ)m=\epsilon^{-2}d\log^{3}(n/\delta) such that solving the sketched regression problem gives the \ell_{\infty} guarantee, with probability at least 1δ1-\delta. Moreover, the matrix SASA can be computed in time O(ndlogn)O(nd\log n). Our row count is nearly-optimal up to logarithmic factors, and significantly improves the result in [22], in which a super-linear in dd rows, m=Ω(ϵ2d1+γ)m=\Omega(\epsilon^{-2}d^{1+\gamma}) for γ=Θ(loglognlogd)\gamma=\Theta(\sqrt{\frac{\log\log n}{\log d}}) is required.

Moreover, we develop a novel analytical framework for \ell_{\infty} guarantee regression that utilizes the Oblivious Coordinate-wise Embedding (OCE) property introduced in [31]. Our analysis is much simpler and more general than that of [22]. Leveraging this framework, we extend the \ell_{\infty} guarantee regression result to dense sketching matrices for computing the fast tensor product of vectors.

1 Introduction

Linear regression, or 2\ell_{2} least-square problem is ubiquitous in numerical linear algebra, scientific computing and machine learning. Given a tall skinny matrix An×dA\in\mathbb{R}^{n\times d} and a label vector bnb\in\mathbb{R}^{n}, the goal is to (approximately) compute an optimal solution xx^{\prime} that minimizes the 2\ell_{2} loss Axb2\|Ax-b\|_{2}. For the regime where ndn\gg d, sketching is a popular approach to obtain an approximate solution quickly [8, 7]: the idea is to pick a random matrix Sm×nS\in\mathbb{R}^{m\times n} from carefully-designed distributions, so that 1). SS can be efficiently applied to AA and 2). the row count mnm\ll n. Given these two guarantees, one can then solve the “sketched” regression problem:

argminxdSAxSb2,\displaystyle\arg\min_{x\in\mathbb{R}^{d}}\|SAx-Sb\|_{2},

and obtain a vector xx^{\prime} such that Axb2=(1±ϵ)OPT\|Ax^{\prime}-b\|_{2}=(1\pm\epsilon)\cdot{\rm OPT}, where OPT denotes the optimal 2\ell_{2} discrepancy between vectors in column space of AA and bb. Recent advances in sketching [7] show that one can design matrix SS with m=O(ϵ2dlog2(n/δ))m=O(\epsilon^{-2}d\log^{2}(n/\delta)) and the sketched regression can be solved in time O(nnz(A)+dω+ϵ2d2polylog(n,d,1/ϵ,1/δ))O(\operatorname{nnz}(A)+d^{\omega}+\epsilon^{-2}d^{2}\operatorname{poly}\log(n,d,1/\epsilon,1/\delta)) where nnz(A)\operatorname{nnz}(A) denotes the number of nonzero entries of AA and δ\delta is the failure probability.

Unfortunately, modern machine learning emphasizes more and more on large, complex, and nonlinear models such as deep neural networks, thus linear regression becomes less appealing as a model. However, it is still a very important subroutine in many deep learning and optimization frameworks, especially second-order method for training neural networks [4, 32] or convex optimization [19, 20, 31, 15]. In these applications, one typically seeks forward error guarantee, i.e., how close is the approximate solution xx^{\prime} to the optimal solution xx^{*}. A prominent example is Newton’s method: given the (possibly implicit) Hessian matrix Hd×dH\in\mathbb{R}^{d\times d} and the gradient gdg\in\mathbb{R}^{d}, one wants to compute H1gH^{-1}g. A common approach is to solve the regression argminxdHxg2\arg\min_{x\in\mathbb{R}^{d}}\|Hx-g\|_{2}, in which one wants xH1g2\|x-H^{-1}g\|_{2} or even xH1g\|x-H^{-1}g\|_{\infty} to be small. When the matrix SS satisfies the so-called Oblivious Subspace Embedding (OSE) property [25], one can show that the approximate solution xx^{\prime} is close to xx^{*} in the 2\ell_{2} sense:

xx2\displaystyle\|x^{\prime}-x^{*}\|_{2}\leq O(ϵ)Axb2A.\displaystyle~{}O(\epsilon)\cdot\|Ax^{*}-b\|_{2}\cdot\|A^{\dagger}\|. (1)

Unfortunately, 2\ell_{2}-closeness cannot characterize how good xx^{\prime} approximates xx^{*}, as xx^{*} can have a good spread of 2\ell_{2} mass over all coordinates while xx^{\prime} concentrates its mass over a few coordinates. Formally speaking, let ada\in\mathbb{R}^{d} be a fixed vector, then one can measure how far a,x\langle a,x^{\prime}\rangle deviates from a,x\langle a,x^{*}\rangle via Eq. (1):

|a,xa,x|=\displaystyle|\langle a,x^{*}\rangle-\langle a,x^{\prime}\rangle|= |a,xx|\displaystyle~{}|\langle a,x^{\prime}-x^{*}\rangle|
\displaystyle\leq a2xx2\displaystyle~{}\|a\|_{2}\|x^{\prime}-x^{*}\|_{2}
\displaystyle\leq O(ϵ)a2Axb2A.\displaystyle~{}O(\epsilon)\cdot\|a\|_{2}\cdot\|Ax^{*}-b\|_{2}\cdot\|A^{\dagger}\|.

This bound is clearly too loose, as one would expect the deviation on a random direction is only 1d\frac{1}{\sqrt{d}} factor of the 2\ell_{2} discrepancy. [22] shows that this intuition is indeed true when SS is picked as the subsampled randomized Hadamard transform (SRHT[17]: 111We will later refer the following property as \ell_{\infty} guarantee.

|a,xa,x|ϵda2Axb2A.\displaystyle|\langle a,x^{*}\rangle-\langle a,x^{\prime}\rangle|\lesssim\frac{\epsilon}{\sqrt{d}}\|a\|_{2}\|Ax^{*}-b\|_{2}\|A^{\dagger}\|. (2)

However, their analysis is not tight as they require a row count m=Ω(ϵ2d1+γ)m=\Omega(\epsilon^{-2}d^{1+\gamma}) for γ=Θ(loglognlogd)\gamma=\Theta(\sqrt{\frac{\log\log n}{\log d}}). Such a row count is super-linear in dd as long as nexp(d)n\leq\exp(d) and therefore is worse than the required row count for SS to be a subspace embedding, in which only m=ϵ2dlog2nm=\epsilon^{-2}d\log^{2}n rows are required for constant success probability. In contrast, for random Gaussian matrices, the \ell_{\infty} guarantee only requires nearly linear in dd rows. In addition to their sub-optimal row count, the [22] analysis is also complicated: let Un×dU\in\mathbb{R}^{n\times d} be an orthonormal basis of AA[22] has to analyze the higher moment of matrix IdUSSUI_{d}-U^{\top}S^{\top}SU. This makes their analysis particularly hard to generalize to other dense sketching matrices beyond SRHT.

In this work, we present a novel framework for analyzing the \ell_{\infty} guarantee induced by SRHT and more generally, a large class of dense sketching matrices. Our analysis is arguably much simpler than [22], and it exposes the fundamental structure of sketching matrices that provides \ell_{\infty} guarantee: if any two columns of the sketching matrix have a small inner product with high probability, then \ell_{\infty} guarantee can be preserved. We then prove that the small pairwise column inner product is also closely related to the Oblivious Coordinate-wise Embedding (OCE) property introduced in [31]. More concretely, for any two fixed vectors g,hng,h\in\mathbb{R}^{n}, we say the sketching matrix is (β,δ,n)(\beta,\delta,n)-OCE if |Sg,Shg,h|βmg2h2|\langle Sg,Sh\rangle-\langle g,h\rangle|\leq\frac{\beta}{\sqrt{m}}\cdot\|g\|_{2}\|h\|_{2} holds with probability at least 1δ1-\delta. This property has previously been leveraged for approximating matrix-vector product between a dynamically-changing projection matrix and an online sequence of vectors for the fast linear program and empirical risk minimization algorithms [20, 15, 31, 23] as these algorithms need \ell_{\infty} bound on the matrix-vector product. One common theme shared by those applications and \ell_{\infty} guarantee is to use dense sketching matrices, such as random Gaussian, the Alon-Matias-Szegedy sketch (AMS, [2]) or SRHT. This is in drastic contrast with the trending direction for using sparse matrices such as Count Sketch [5, 8] and OSNAP [21, 6], as they can be applied in (nearly) input sparsity time.

In recent years, sketches that can be applied to the tensor product of matrices/vectors have gained popularity [3, 27, 10, 28, 9, 1, 35, 26, 36, 30, 29, 24] as they can speed up optimization tasks and large-scale kernel learning. We show that dense sketches for degree-2 tensors also provide \ell_{\infty} guarantee.

Theorem 1.1 (Nearly-optimal bound for dense sketching matrices).

Suppose nexp(d)n\leq\exp(d) and matrix An×dA\in\mathbb{R}^{n\times d} and vector bnb\in\mathbb{R}^{n} are given. Let Sm×nS\in\mathbb{R}^{m\times n} be a subsampled randomized Hadamard transform matrix SRHT with m=O(ϵ2dlog3(n/δ))m=O(\epsilon^{-2}d\log^{3}(n/\delta)) rows.

For any fixed vector ada\in\mathbb{R}^{d},

|a,xa,x|\displaystyle|\langle a,x^{*}\rangle-\langle a,x^{\prime}\rangle|\lesssim ϵda2Axb2A\displaystyle~{}\frac{\epsilon}{\sqrt{d}}\cdot\|a\|_{2}\cdot\|Ax^{*}-b\|_{2}\cdot\|A^{\dagger}\|

with probability 1δ1-\delta, where x=argminxdSAxSb2x^{\prime}=\arg\min_{x\in\mathbb{R}^{d}}\|SAx-Sb\|_{2}, x=argminxdAxb2x^{*}=\arg\min_{x\in\mathbb{R}^{d}}\|Ax-b\|_{2}.

Remark 1.2.

The row count m=ϵ2dlog3nm=\epsilon^{-2}d\log^{3}n is nearly-optimal up to logarithmic factors, as the row count for SS being an OSE is m=ϵ2dlog2nm=\epsilon^{-2}d\log^{2}n for constant success probability. In comparison, [22] requires m=ϵ2d1+γm=\epsilon^{-2}d^{1+\gamma} rows for γ=Θ(loglognlogd)\gamma=\Theta(\sqrt{\frac{\log\log n}{\log d}}) which is only nearly-linear in dd if n>exp(d)n>\exp(d). In most applications, we concern about n=poly(d)n=\operatorname{poly}(d), meaning that their row count is worse than ours in almost all meaningful scenarios.

The row count and guarantee obtained in Theorem 1.1 extend beyond SRHT; in fact, for a range of dense sketching matrices including random Gaussian, AMS sketch [2], SRHT and subsampled randomized Circulant Transform (see Definition 2.11) (SRCT). This is because our argument is a structural condition that can be satisfied by various dense sketches.

Our result can also be generalized to degree-2 Kronecker product regression, see Theorem B.1.

Roadmap.

In Section 2, we introduce the notations that we use and explain the key definitions and properties to support the framework for \ell_{\infty} guarantee regression. In Section 3, we introduce our framework by presenting a sufficient condition for a sketching matrix to give a good \ell_{\infty} guarantee. In Section 4, we provide a proof for our main theorem by putting everything together. Finally, in Section 5, we summarize the main findings of this paper and through comparing with previous work.

2 Preliminary

For a positive integer, we define [n]:={1,2,,n}[n]:=\{1,2,\cdots,n\}. For a vector xnx\in\mathbb{R}^{n}, we define x2:=(i=1nxi2)1/2\|x\|_{2}:=(\sum_{i=1}^{n}x_{i}^{2})^{1/2} and x:=maxi[n]|xi|\|x\|_{\infty}:=\max_{i\in[n]}|x_{i}|. For a matrix AA, we define A:=supxAx2/x2\|A\|:=\sup_{x}\|Ax\|_{2}/\|x\|_{2} to be the spectral norm of AA. We use AF:=i,jAi,j2\|A\|_{F}:=\sum_{i,j}A_{i,j}^{2} to be the Frobenius norm of AA. In general, we have the following property for spectral norm, ABAB\|AB\|\leq\|A\|\cdot\|B\|. We use AA^{\dagger} to denote the Moore-Penrose pseudoinverse of m×nm\times n matrix AA which if A=UΣVA=U\Sigma V^{\top} is its SVD (where Um×nU\in\mathbb{R}^{m\times n}, Σn×n\Sigma\in\mathbb{R}^{n\times n} and Vn×nV\in\mathbb{R}^{n\times n} for mnm\geq n), is given by A=VΣ1UA^{\dagger}=V\Sigma^{-1}U^{\top}.

We use 𝔼[]\operatorname*{{\mathbb{E}}}[\cdot] to denote the expectation, and Pr[]\Pr[\cdot] to denote the probability. For a distribution DD and a random variable xx, we use xDx\sim D to denote that we draw a random variable from the distribution DD. We use 𝒩(μ,σ2){\cal N}(\mu,\sigma^{2}) to denote a Gaussian distribution with mean μ\mu and variance σ2\sigma^{2}. We say a random variable xx is Rademacher random variables if Pr[x=1]=1/2\Pr[x=1]=1/2 and Pr[x=1]=1/2\Pr[x=-1]=1/2. We also call it a sign random variable.

In addition to O()O(\cdot) notation, for two functions f,gf,g, we use the shorthand fgf\lesssim g (resp. \gtrsim) to indicate that fCgf\leq Cg (resp. \geq) for an absolute constant CC. We use fgf\eqsim g to mean cfgCfcf\leq g\leq Cf for constants c>0c>0 and C>0C>0. For two matrices An1×d1A\in\mathbb{R}^{n_{1}\times d_{1}}, Bn2×d2B\in\mathbb{R}^{n_{2}\times d_{2}}, we use ABn1n2×d1d2A\otimes B\in\mathbb{R}^{n_{1}n_{2}\times d_{1}d_{2}} to denote the Kronecker product, i.e., the (i1,i2)(i_{1},i_{2})-th row and (j1,j2)(j_{1},j_{2})-th column of A×BA\times B is Ai1,j1Bi2,j2A_{i_{1},j_{1}}B_{i_{2},j_{2}}. For two vectors xm,ynx\in\mathbb{R}^{m},y\in\mathbb{R}^{n}, we use xymnx\otimes y\in\mathbb{R}^{mn} to denote the tensor product of vectors, in which xy=vec(xy)x\otimes y=\operatorname{vec}(xy^{\top}).

2.1 Oblivious subspace embedding and coordinate-wise embedding

Definition 2.1 (Oblivious subspace embedding [25]).

We define (ϵ,δ,d,n)(\epsilon,\delta,d,n)-Oolivious subspace embedding (OSE) as follows: Suppose Π\Pi is a distribution on m×nm\times n matrices SS, where mm is a function of n,d,ϵn,d,\epsilon, and δ\delta. Suppose that with probability at least 1δ1-\delta, for any fixed n×dn\times d orthonormal basis UU, a matrix SS drawn from the distribution Π\Pi has the property that the singular values of SUSU lie in the range [1ϵ,1+ϵ][1-\epsilon,1+\epsilon].

The oblivious coordinate-wise embedding (OCE) is implicitly used in [20, 15] and formally introduced in [31].

Definition 2.2 ((α,β,δ)(\alpha,\beta,\delta)–coordinate wise embedding [31]).

We say a random matrix Sm×nS\in\mathbb{R}^{m\times n} satisfying (α,β,δ)(\alpha,\beta,\delta)–coordinate wise embedding if

1.𝔼SΠ[gSSh]=gh,\displaystyle~{}1.\operatorname*{{\mathbb{E}}}_{S\sim\Pi}[g^{\top}S^{\top}Sh]=g^{\top}h,
2.𝔼SΠ[(gSSh)2](gh)2+αmg22h22,\displaystyle~{}2.\operatorname*{{\mathbb{E}}}_{S\sim\Pi}[(g^{\top}S^{\top}Sh)^{2}]\leq(g^{\top}h)^{2}+\frac{\alpha}{m}\|g\|_{2}^{2}\|h\|_{2}^{2},
3.PrSΠ[|gSShgh|βmg2h2]δ.\displaystyle~{}3.\Pr_{S\sim\Pi}[|g^{\top}S^{\top}Sh-g^{\top}h|\geq\frac{\beta}{\sqrt{m}}\|g\|_{2}\|h\|_{2}]\leq\delta.

In this paper, we mainly use the property 3 of Definition 2.2. For convenient, we redefine OCE as follows:

Definition 2.3 (OCE).

Let β1\beta\geq 1 and δ(0,0.1)\delta\in(0,0.1). We say a randomized matrix Sm×nS\in\mathbb{R}^{m\times n} satisfy (β,δ,n)(\beta,\delta,n)-OCE, if

PrSΠ[|gSShgh|βmg2h2]δ\displaystyle\Pr_{S\sim\Pi}[|g^{\top}S^{\top}Sh-g^{\top}h|\geq\frac{\beta}{\sqrt{m}}\|g\|_{2}\|h\|_{2}]\leq\delta

and the distribution Π\Pi is oblivious to any fixed vectors gg and hh.

2.2 Sketching matrices

In this paper, we concern a list of dense sketching matrices.

Definition 2.4 (Random Gaussian matrix, folklore).

We say Sm×nS\in\mathbb{R}^{m\times n} is a random Gaussian matrix if all entries are sampled from 𝒩(0,1/m)\mathcal{N}(0,1/m) independently.

Definition 2.5 (AMS sketch matrix, [2]).

Let h1,h2,,hmh_{1},h_{2},\cdots,h_{m} be mm random hash functions picking from a 4–wise independent hash family ={h:[n]{1m,+1m}}\mathcal{H}=\{h:[n]\to\{-\frac{1}{m},+\frac{1}{m}\}\}. Then Sm×nS\in\mathbb{R}^{m\times n} is an AMS sketch matrix if we set Si,j=hi(j)S_{i,j}=h_{i}(j).

The following sketching matrices can utilize fast Fourier Transform (FFT) for efficient application to matrices.

Definition 2.6 (Subsampled randomized Hadamard transform (SRHT) [17, 26]).

The SRHT matrix Sm×nS\in\mathbb{R}^{m\times n} is defined as S:=1mPHDS:=\frac{1}{\sqrt{m}}PHD, where each row of matrix P{0,1}m×nP\in\{0,1\}^{m\times n} contains exactly one 11 at a random position, HH is the n×nn\times n Hadamard matrix, and DD is a n×nn\times n diagonal matrix with each diagonal entry being a value in {1,+1}\{-1,+1\} with equal probability.

Remark 2.7.

Using the fast Fourier transform (FFT), SS can be applied to a vector in time O(nlogn)O(n\log{n}).

Definition 2.8 (Tensor subsampled randomized Hadamard transform (TensorSRHT) [26]).

The TensorSRHT S:n×nmS:\mathbb{R}^{n}\times\mathbb{R}^{n}\to\mathbb{R}^{m} is defined as S:=1mP(HD1HD2)S:=\frac{1}{\sqrt{m}}P\cdot(HD_{1}\otimes HD_{2}), where each row of P{0,1}m×n2P\in\{0,1\}^{m\times n^{2}} contains only one 11 at a random coordinate and one can view PP as a sampling matrix. HH is a n×nn\times n Hadamard matrix, and D1D_{1}, D2D_{2} are two n×nn\times n independent diagonal matrices with diagonals that are each independently set to be a Rademacher random variable (uniform in {1,1}\{-1,1\}).

Remark 2.9.

By leveraging the FFT algorithm in the sketch space, S(xy)S(x\otimes y) can be computed in time O(nlogn+m)O(n\log{n}+m).

To store and generate a Hadamard matrix is expensive, we consider a cheaper and space-efficient way to generate an FFT matrix via circulant transform.

Definition 2.10 (Circulant matrix).

A circulant matrix is an n×nn\times n matrix, where nn\in\mathbb{N}, whose row vectors consist of the same element, and compared to the preceding row vector, each row vector is rotated one element to the right.

Definition 2.11 (Subsampled randomized circulant transform (SRCT)).

Let xnx\in\mathbb{R}^{n} be a random vector, whose elements are i.i.d. Rademacher random variables.

Also, let Pm×nP\in\mathbb{R}^{m\times n} be a random matrix in which each row contains a 1 at a uniformly distributed coordinate and zeros elsewhere.

Let Gn×nG\in\mathbb{R}^{n\times n} be a circulant matrix (see Definition 2.10) generated by xx and Dn×nD\in\mathbb{R}^{n\times n} be a diagonal matrix whose diagonal elements are i.i.d. Rademacher random variables.

Then, the subsampled randomized circulant transform is defined as follows: S:=1mPGD.S:=~{}\frac{1}{\sqrt{m}}PGD.

Definition 2.12 (Tensor subsampled randomized circulant transform (TensorSRCT)).

Let xnx\in\mathbb{R}^{n} be a random vector, whose elements are i.i.d. Rademacher random variables.

Also, let Pm×n2P\in\mathbb{R}^{m\times n^{2}} be a random matrix in which each row contains a 1 at a uniformly distributed coordinate and zeros elsewhere.

Let Gn×nG\in\mathbb{R}^{n\times n} be a circulant matrix (see Definition 2.10) generated by xx.

Let D1n×nD_{1}\in\mathbb{R}^{n\times n} and D2n×nD_{2}\in\mathbb{R}^{n\times n} be two independent diagonal matrices whose diagonal elements are i.i.d. Rademacher random variables.

Then, the tensor circulant transform T:n×nmT:\mathbb{R}^{n}\times\mathbb{R}^{n}\rightarrow\mathbb{R}^{m} is defined as follows: T:=P(GD1GD2).T:=P\cdot(GD_{1}\otimes GD_{2}).

Remark 2.13.

Similar to SRHT, we can utilize the fast Fourier transform with circulant matrix. SRCT can be applied to a vector of length nn in O(nlogn)O(n\log n) time, and TensorSRCT can be applied to xyx\otimes y in O(nlogn+m)O(n\log n+m) time.

2.3 OSE property of dense sketches

An important condition for sketch-and-solve regressions is OSE. We focus particularly on SRHT, SRCT, and their tensor variants.

Lemma 2.14 (Lemma 2.11 in [26]).

Let SS be an SRHT matrix defined in Definition 2.6. If m=O(ϵ2dlog2(nd/δ))m=O(\epsilon^{-2}d\log^{2}(nd/\delta)), then SS is an (ϵ,δ,d,n)(\epsilon,\delta,d,n)-OSE.

Lemma 2.15 (Lemma 2.12 in [26]).

Let SS be a TensorSRHT matrix defined in Definition 2.8. If m=O(ϵ2dlog3(nd/ϵδ))m=O(\epsilon^{-2}d\log^{3}(nd/\epsilon\delta)), then SS is an (ϵ,δ,d,n2)(\epsilon,\delta,d,n^{2})-OSE for degree-22 tensors.

SRCT requires more row count than SRHT due to the Gram GGG^{\top}G is only InI_{n} in expectation.

Lemma 2.16 (Informal version of Corollary C.7).

Let SS be an SRCT matrix defined in Definition 2.11. If m=O(ϵ2d2log2(nd/δ))m=O(\epsilon^{-2}d^{2}\log^{2}(nd/\delta)), then SS is an (ϵ,δ,d,n)(\epsilon,\delta,d,n)-OSE.

Lemma 2.17 (Informal version of Corollary C.8).

Let SS be an TensorSRCT matrix defined in Definition 2.12. If m=O(ϵ2d2log3(nd/δ))m=O(\epsilon^{-2}d^{2}\log^{3}(nd/\delta)), then SS is an (ϵ,δ,d,n2)(\epsilon,\delta,d,n^{2})-OSE.

2.4 Probability tools

Lemma 2.18 (Khintchine’s inequality [16]).

Let σ1,,σn\sigma_{1},\cdots,\sigma_{n} be i.i.d. Rademacher random variables and z1,,znz_{1},\cdots,z_{n} be real numbers. Then, there exists constants C,C>0C,C^{\prime}>0 such that

Pr[|i=1nziσi|Ctz2]exp(Ct2).\displaystyle\Pr[|\sum_{i=1}^{n}z_{i}\sigma_{i}|\geq Ct\|z\|_{2}]\leq\exp(-C^{\prime}t^{2}).
Lemma 2.19 (Hoeffding bound, [13]).

Let Z1,,ZnZ_{1},\cdots,Z_{n} be independent, zero-mean random variables with Zi[αi,βi]Z_{i}\in[\alpha_{i},\beta_{i}]. Then,

Pr[|i=1nZi|>t]\displaystyle\Pr[|\sum_{i=1}^{n}Z_{i}|>t]\leq 2exp(t22i=1n(βiαi)2).\displaystyle~{}2\exp(-\frac{t^{2}}{2\sum_{i=1}^{n}(\beta_{i}-\alpha_{i})^{2}}).
Lemma 2.20 (Lemma 1 on page 1325 of Laurent and Massart [18] ).

Let X𝒳k2X\sim\mathcal{X}_{k}^{2} be a chi-squared distributed random variable with kk degrees of freedom. Each one has zero means and σ2\sigma^{2} variance. Then,

Pr[Xkσ2(2kt+2t)σ2]\displaystyle\Pr[X-k\sigma^{2}\geq(2\sqrt{kt}+2t)\sigma^{2}]\leq exp(t)\displaystyle~{}\exp{(-t)}
Pr[kσ2X2ktσ2]\displaystyle\Pr[k\sigma^{2}-X\geq 2\sqrt{kt}\sigma^{2}]\leq exp(t)\displaystyle~{}\exp{(-t)}
Lemma 2.21 (Hanson-Wright inequality [14]).

Let xnx\in\mathbb{R}^{n} denote a random vector with independent entries xix_{i} with 𝔼[xi]=0\operatorname*{{\mathbb{E}}}[x_{i}]=0 and |xi|K|x_{i}|\leq K. Let AA be an n×nn\times n matrix. Then, for every t0t\geq 0,

Pr[|xAx𝔼[xAx]|>t]\displaystyle~{}\Pr[|x^{\top}Ax-\operatorname*{{\mathbb{E}}}[x^{\top}Ax]|>t]
\displaystyle\leq 2exp(cmin{t2/(K4AF2),t/(K2A)})\displaystyle~{}2\cdot\exp{(-c\min\{t^{2}/(K^{4}\|A\|_{F}^{2}),t/(K^{2}\|A\|)\})}
Lemma 2.22 (Matrix Chernoff bound, Theorem 2.2 in [33]).

Let XX be a finite set of positive-semidefinite matrices with dimension d×dd\times d. Suppose that maxX𝒳λmax(X)B\max_{X\in\mathcal{X}}\lambda_{\max}(X)\leq B. Sample {X1,,Xn}\{X_{1},\cdots,X_{n}\} uniformly at random from 𝒳\mathcal{X} without replacement. We define μmin\mu_{\min} and μmax\mu_{\max} as follows: μmin:=nλmin(𝔼X𝒳[X])\mu_{\min}:=n\cdot\lambda_{\min}(\operatorname*{{\mathbb{E}}}_{X\sim\mathcal{X}}[X]) and μmax:=nλmax(𝔼X𝒳[X])\mu_{\max}:=n\cdot\lambda_{\max}(\operatorname*{{\mathbb{E}}}_{X\sim\mathcal{X}}[X]). Then,

Pr[λmin(i=1nXi)\displaystyle\Pr[\lambda_{\min}(\sum_{i=1}^{n}X_{i})\leq (1δ)μmin]dexp(δ2μmin/B)\displaystyle~{}(1-\delta)\mu_{\min}]\leq d\cdot\exp{(-\delta^{2}\mu_{\min}/B)}

for δ[0,1)\delta\in[0,1),

Pr[λmax(i=1nXi)(1+δ)μmax]\displaystyle~{}\Pr[\lambda_{\max}(\sum_{i=1}^{n}X_{i})\leq(1+\delta)\mu_{\max}]
\displaystyle\leq dexp(δ2μmax/(4B))\displaystyle~{}d\cdot\exp{(-\delta^{2}\mu_{\max}/(4B))}

for δ0\delta\geq 0.

3 \ell_{\infty} guarantee via OCE

In this section, we present a sufficient condition for a sketching matrix to give good \ell_{\infty} guarantee: given a pair of fixed vectors g,hg,h such that gh=0g^{\top}h=0, if the sketching matrix approximately preserves the inner product with high probability, then it gives good \ell_{\infty} guarantee for regression.

Lemma 3.1 (Core lemma).

Let An×dA\in\mathbb{R}^{n\times d} be a fixed matrix. Let Un×dU\in\mathbb{R}^{n\times d} denote the orthonormal basis of AA. Let Sm×nS\in\mathbb{R}^{m\times n} be a sketching matrix that satisfies two properties

  • SS is an (0.1,δ,d,n)(0.1,\delta,d,n)-𝖮𝖲𝖤\mathsf{OSE} (with δ(0,0.1)\delta\in(0,0.1), Definition 2.1).

  • SS is an (β,δ,n)(\beta,\delta,n)-𝖮𝖢𝖤\mathsf{OCE} (with β1\beta\geq 1 and δ(0,0.1)\delta\in(0,0.1), Definition 2.3).

For any fixed vectors ada\in\mathbb{R}^{d} and bnb\in\mathbb{R}^{n} with Ub=0U^{\top}b=0, we have

|a(SA)Sb|\displaystyle|a^{\top}(SA)^{\dagger}Sb|\lesssim βma2b2Σ1\displaystyle~{}\frac{\beta}{\sqrt{m}}\cdot\|a\|_{2}\cdot\|b\|_{2}\cdot\|\Sigma^{-1}\|

holds with probability at least 1δ1-\delta.

Proof.

With probability 1, the matrix SAm×dSA\in\mathbb{R}^{m\times d} has linearly independent columns.

Therefore, (SA)d×m(SA)^{\dagger}\in\mathbb{R}^{d\times m} is

(SA)=\displaystyle(SA)^{\dagger}= (ASSA)1AS\displaystyle~{}(A^{\top}S^{\top}SA)^{-1}A^{\top}S^{\top}
=\displaystyle= (VΣUSSUΣV)1VΣUS\displaystyle~{}(V\Sigma U^{\top}S^{\top}SU\Sigma V^{\top})^{-1}V\Sigma U^{\top}S^{\top}
=\displaystyle= (V)1Σ1(USSU)1Σ1V1VΣUS\displaystyle~{}(V^{\top})^{-1}\Sigma^{-1}(U^{\top}S^{\top}SU)^{-1}\Sigma^{-1}V^{-1}V\Sigma U^{\top}S^{\top}
=\displaystyle= VΣ1(USSU)1US,\displaystyle~{}V\Sigma^{-1}(U^{\top}S^{\top}SU)^{-1}U^{\top}S^{\top},

where the first step follows from SAm×dSA\in\mathbb{R}^{m\times d} has full rank, the second step follows from SVD on An×dA\in\mathbb{R}^{n\times d}, the third step follows from (AB)1=B1A1(AB)^{-1}=B^{-1}A^{-1}, and the last step follows from the fact that VV is orthogonal based on the property of SVD.

For convenience, we define xx as follows:

x:=aVΣ1(USSU)1USSb.\displaystyle x:=a^{\top}V\Sigma^{-1}(U^{\top}S^{\top}SU)^{-1}U^{\top}S^{\top}Sb.

In the next few paragraphs, we will explain how to upper bound |x||x| with high probability.

Since SS is a (0.1,δ,d,n)(0.1,\delta,d,n)-OSE (Definition 2.1), we know

Pr[IUSSU0.1]1δ.\displaystyle\Pr[\|I-U^{\top}S^{\top}SU\|\leq 0.1]\geq 1-\delta.

We condition on this event. It follows that

VΣ1(USSU)1U\displaystyle~{}\|V\Sigma^{-1}(U^{\top}S^{\top}SU)^{-1}U^{\top}\|
=\displaystyle= Σ1(USSU)1U\displaystyle~{}\|\Sigma^{-1}(U^{\top}S^{\top}SU)^{-1}U^{\top}\|
\displaystyle\leq Σ1(USSU)1U\displaystyle~{}\|\Sigma^{-1}\|\|(U^{\top}S^{\top}SU)^{-1}\|\|U^{\top}\|
\displaystyle\leq Σ1110.11\displaystyle~{}\|\Sigma^{-1}\|\cdot\frac{1}{1-0.1}\cdot 1
=\displaystyle= O(Σ1),\displaystyle~{}O(\|\Sigma^{-1}\|),

where the first step follows from that VV is a rotation, the second step follows from sub-multiplicativity, and the third step follows from IUSSU0.1\|I-U^{\top}S^{\top}SU\|\leq 0.1 and that UU is a rotation.

Hence, we have

Pr[aVΣ1(USSU)1U2\displaystyle~{}\Pr[\|a^{\top}V\Sigma^{-1}(U^{\top}S^{\top}SU)^{-1}U^{\top}\|_{2}
=\displaystyle= O(Σ1a2)]1δ.\displaystyle~{}O(\|\Sigma^{-1}\|\cdot\|a\|_{2})]\geq~{}1-\delta. (3)

Let us define a vector unu\in\mathbb{R}^{n}

u:=\displaystyle u:= U(USSU)1Σ1Va\displaystyle~{}U(U^{\top}S^{\top}SU)^{-1}\Sigma^{-1}V^{\top}a

By the definition of OCE (Definition 2.3, we have that

Pr[|uSSbub|βmu2b2]\displaystyle\Pr[|u^{\top}S^{\top}Sb-u^{\top}b|\leq\frac{\beta}{\sqrt{m}}\cdot\|u\|_{2}\|b\|_{2}]\leq 1δ,\displaystyle~{}1-\delta,

where Ub=0U^{\top}b=0 gives us ub=0u^{\top}b=0 and uSSb=xu^{\top}S^{\top}Sb=x.

Thus, the above bound translates to

Pr[|x|CβmΣ1a2b2]\displaystyle\Pr[|x|\leq C\cdot\frac{\beta}{\sqrt{m}}\cdot\|\Sigma^{-1}\|\|a\|_{2}\|b\|_{2}]\geq 1δ\displaystyle~{}1-\delta (4)

as desired. ∎

We are now ready to prove the \ell_{\infty} guarantee given the inner product bound of Lemma 3.1.

Theorem 3.2.

Suppose An×dA\in\mathbb{R}^{n\times d} has full column rank and bnb\in\mathbb{R}^{n}. Let Sm×nS\in\mathbb{R}^{m\times n} be a sketching matrix satisfying conditions in Lemma 3.1. For any fixed vector ada\in\mathbb{R}^{d}, we have

|a,xa,x|\displaystyle|\langle a,x^{*}\rangle-\langle a,x^{\prime}\rangle|\lesssim ϵda2Axb2A,\displaystyle~{}\frac{\epsilon}{\sqrt{d}}\cdot\|a\|_{2}\cdot\|Ax^{*}-b\|_{2}\cdot\|A^{\dagger}\|,

holds with probability at least 1δ1-\delta, where x=argminxdAxb2x^{*}=\arg\min_{x\in\mathbb{R}^{d}}\|Ax-b\|_{2} and x=argminxdSAxSb2x^{\prime}=\arg\min_{x\in\mathbb{R}^{d}}\|SAx-Sb\|_{2}.

Proof.

Since AA has full column rank, we have that x=Abx^{*}=A^{\dagger}b. Similarly, SASA has full column rank with probability 1, therefore x=(SA)Sbx^{\prime}=(SA)^{\dagger}Sb and (SA)SA=I(SA)^{\dagger}SA=I. Thus, we have

|a,xa,x|=\displaystyle|\langle a,x^{*}\rangle-\langle a,x^{\prime}\rangle|= |a,x(SA)Sb|\displaystyle~{}|\langle a,x^{*}-(SA)^{\dagger}Sb\rangle|
=\displaystyle= |a,(SA)S(Axb)|\displaystyle~{}|\langle a,(SA)^{\dagger}S(Ax^{*}-b)\rangle|
=\displaystyle= |a,(SA)S(AAbb)|\displaystyle~{}|\langle a,(SA)^{\dagger}S(AA^{\dagger}b-b)|
=\displaystyle= |((SA)S)a,(IUU)b|\displaystyle~{}|\langle((SA)^{\dagger}S)^{\top}a,(I-UU^{\top})b\rangle| (5)

where Un×dU\in\mathbb{R}^{n\times d} is an orthonormal basis for AA. It is well-known that IUU=UUI-UU^{\top}=U_{\perp}U_{\perp}^{\top} where Un×(nd)U_{\perp}\in\mathbb{R}^{n\times(n-d)} is the orthonormal basis for the orthogonal component of span(A){\rm span}(A). To maximize the above expression, we shall let bspan(U)b\in{\rm span}(U_{\perp}) or equivalently, Ub=0U^{\top}b=0. Thus, bounding Eq. (3) is equivalent to consider

|a(SA)Sb|\displaystyle|a^{\top}(SA)^{\dagger}Sb|\lesssim βma2b2A,\displaystyle~{}\frac{\beta}{\sqrt{m}}\cdot\|a\|_{2}\cdot\|b\|_{2}\cdot\|A^{\dagger}\|,

the inequality holds with probability at least 12δ1-2\delta by Lemma 3.1. Finally, note that since Ub=0U^{\top}b=0, we have that Ax=0Ax^{*}=0 and we have proved

|a,xa,x|\displaystyle|\langle a,x^{*}\rangle-\langle a,x^{\prime}\rangle|\lesssim ϵda2Axb2A.\displaystyle~{}\frac{\epsilon}{\sqrt{d}}\cdot\|a\|_{2}\cdot\|Ax^{*}-b\|_{2}\cdot\|A^{\dagger}\|.

Note that we only require the 𝖮𝖲𝖤{\sf OSE} with ϵ=O(1)\epsilon=O(1) and the ϵ\epsilon dependence follows from the row count of OCE.

3.1 High probability bound for OCE

In this section, we provide a unified framework for proving the high probability bound of OCE. Our analysis utilizes the three dense sketching matrices that can all be designed as first picking a set of fresh random signs, then picking the sketching matrix according to the distribution.

We state the key assumptions on dense sketching matrices that are sufficient for OCE property.

Assumption 3.3.

Let Sm×nS\in\mathbb{R}^{m\times n} be a dense sketching matrix satisfying the following two assumptions:

  • Pairwise inner product bound:

    Pr[maxij|S,i,S,j|log(n/δ)m]\displaystyle\Pr[\max_{i\neq j}|\langle S_{*,i},S_{*,j}\rangle|\leq\frac{\sqrt{\log(n/\delta)}}{\sqrt{m}}]\geq 1δ.\displaystyle~{}1-\delta.
  • Column norm bound:

    Pr[|S,i221|log(n/δ)m]\displaystyle\Pr[|\|S_{*,i}\|_{2}^{2}-1|\leq\frac{\sqrt{\log(n/\delta)}}{\sqrt{m}}]\geq 1δ,\displaystyle~{}1-\delta,

    for all i[n]i\in[n].

Lemma 3.4.

Let Sm×nS\in\mathbb{R}^{m\times n} be a dense sketching matrix meets Assumption 3.3. Let hnh\in\mathbb{R}^{n} and gng\in\mathbb{R}^{n} be two fixed vectors. Then, the following properties hold:

|(gSSh)(gh)|\displaystyle|(g^{\top}S^{\top}Sh)-(g^{\top}h)|\leq log1.5(n/δ)mg2h2\displaystyle~{}\frac{\log^{1.5}(n/\delta)}{\sqrt{m}}\|g\|_{2}\|h\|_{2}

holds with probability at least 1δ1-\delta.

Proof.

We can rewrite (gSSh)(gh)(g^{\top}S^{\top}Sh)-(g^{\top}h) as follows:

(gSSh)(gh)\displaystyle~{}(g^{\top}S^{\top}Sh)-(g^{\top}h)
=\displaystyle= i=1nj[n]ingihjS,i,S,j+i=1ngihi(S,i221)\displaystyle~{}\sum_{i=1}^{n}\sum_{j\in[n]\setminus i}^{n}g_{i}h_{j}\langle S_{*,i},S_{*,j}\rangle+\sum_{i=1}^{n}g_{i}h_{i}(\|S_{*,i}\|_{2}^{2}-1)
=\displaystyle= i=1nj[n]ingihjσiS¯,i,σjS¯,joff-diag\displaystyle~{}\underbrace{\sum_{i=1}^{n}\sum_{j\in[n]\setminus i}^{n}g_{i}h_{j}\langle\sigma_{i}\overline{S}_{*,i},\sigma_{j}\overline{S}_{*,j}\rangle}_{\texttt{off-diag}}
+\displaystyle+ i=1ngihi(S,i221)diag,\displaystyle~{}\underbrace{\sum_{i=1}^{n}g_{i}h_{i}(\|S_{*,i}\|_{2}^{2}-1)}_{\texttt{diag}},

where the first step follows from the fact that σi\sigma_{i}’s are independent Rademacher random variables and S,i=σiS¯,iS_{*,i}=\sigma_{i}\overline{S}_{*,i}, i[n]\forall i\in[n], the second step follows from separating diagonal and off-diagonal terms.

We will focus on bounding the quantity off-diag, as diag can be handled in a rather simple fashion.

We define matrix An×nA\in\mathbb{R}^{n\times n} and Bn×nB\in\mathbb{R}^{n\times n} as follows:

Ai,j:=gihjS¯,i,S¯,j,\displaystyle~{}A_{i,j}:=g_{i}h_{j}\cdot\langle\overline{S}_{*,i},\overline{S}_{*,j}\rangle, i[n],j[n]\displaystyle~{}\forall i\in[n],j\in[n]
Bi,j:=gihjmaxij|S¯,i,S¯,j|,\displaystyle~{}B_{i,j}:=g_{i}h_{j}\cdot\max_{i^{\prime}\not=j^{\prime}}|\langle\overline{S}_{*,i^{\prime}},\overline{S}_{*,j^{\prime}}\rangle|, i[n],j[n].\displaystyle~{}\forall i\in[n],j\in[n].

We define An×nA^{\circ}\in\mathbb{R}^{n\times n} to be the matrix An×nA\in\mathbb{R}^{n\times n} with removing diagonal entries.

By applying Hanson-Wright inequality (Lemma 2.21), we have

Pr[|σAσ|>τ]\displaystyle~{}\Pr[|\sigma^{\top}A^{\circ}\sigma|>\tau]
\displaystyle\leq 2exp(cmin{τ2/AF2,τ/A})\displaystyle~{}2\cdot\exp{(-c\cdot\min\{\tau^{2}/\|A^{\circ}\|_{F}^{2},\tau/\|A^{\circ}\|\})}

We can upper bound A\|A^{\circ}\| and AF\|A^{\circ}\|_{F}.

A\displaystyle\|A^{\circ}\|\leq AF\displaystyle~{}\|A^{\circ}\|_{F}
\displaystyle\leq AF\displaystyle~{}\|A\|_{F}
\displaystyle\leq BF\displaystyle~{}\|B\|_{F}
\displaystyle\leq g2h2maxij|S¯,i,S¯,j|,\displaystyle~{}\|g\|_{2}\cdot\|h\|_{2}\cdot\max_{i\not=j}|\langle\overline{S}_{*,i},\overline{S}_{*,j}\rangle|,

where the first step follows from F\|\cdot\|\leq\|\cdot\|_{F}, the second step follows from the definition of AA^{\circ}, the third step follows from the definition of AA and BB, and the fourth step follows from BB is rank 1 as B=maxij|S¯,i,S¯,j|ghB=\max_{i\neq j}|\langle\overline{S}_{*,i},\overline{S}_{*,j}\rangle|\cdot gh^{\top}.

It remains to obtain a bound on maxij|S¯,i,S¯,j|\max_{i\neq j}|\langle\overline{S}_{*,i},\overline{S}_{*,j}\rangle|. Note that for any column i,ji,j,

|S¯,i,S¯,j|=\displaystyle|\langle\overline{S}_{*,i},\overline{S}_{*,j}\rangle|= |σiS¯,i,σjS¯,j|\displaystyle~{}|\langle\sigma_{i}\overline{S}_{*,i},\sigma_{j}\overline{S}_{*,j}\rangle|
=\displaystyle= |S,i,S,j|,\displaystyle~{}|\langle S_{*,i},S_{*,j}\rangle|,

where the first step follows from the fact that random signs do not change the magnitude of the inner product and the second step follows from the definition of S,iS_{*,i} and S,jS_{*,j}.

Since SS meets Assumption 3.3, we have that with probability at least 1δ1-\delta,

maxij|S¯,i,S¯,j|\displaystyle\max_{i\neq j}|\langle\overline{S}_{*,i},\overline{S}_{*,j}\rangle|\leq log(n/δ)m.\displaystyle~{}\frac{\sqrt{\log(n/\delta)}}{\sqrt{m}}.

Conditioning on the above event holds, we have that

Pr[|off-diag|>τ]\displaystyle~{}\Pr[|\texttt{off-diag}|>\tau]
\displaystyle\leq 2exp(cτg2h2log(n/δ)m),\displaystyle~{}2\cdot\exp(-c\cdot\frac{\tau}{\|g\|_{2}\cdot\|h\|_{2}\cdot\frac{\sqrt{\log(n/\delta)}}{\sqrt{m}}}),

choosing τ=g2h2log1.5(n/δ)/m\tau=\|g\|_{2}\cdot\|h\|_{2}\cdot\log^{1.5}(n/\delta)/\sqrt{m}, we can show that

Pr[|off-diag|g2h2log1.5(n/δ)m]Θ(δ).\displaystyle\Pr[|\texttt{off-diag}|\geq\|g\|_{2}\cdot\|h\|_{2}\frac{\log^{1.5}(n/\delta)}{\sqrt{m}}]\leq\Theta(\delta).

To bound the term diag, note that due to Assumption 3.3, we have with probability at least 1δ1-\delta, |S,i221|log(n/δ)m|\|S_{*,i}\|_{2}^{2}-1|\leq\frac{\sqrt{\log(n/\delta)}}{\sqrt{m}}.

Conditioning on this event, we have

|diag|\displaystyle|\texttt{diag}|\leq maxi[n]|S,i221||gh|\displaystyle~{}\max_{i\in[n]}|\|S_{*,i}\|_{2}^{2}-1|\cdot|g^{\top}h|
\displaystyle\leq log(n/δ)mg2h2,\displaystyle~{}\frac{\sqrt{\log(n/\delta)}}{\sqrt{m}}\cdot\|g\|_{2}\cdot\|h\|_{2},

where the last step is by Cauchy-Schwartz. Note that |diag||\texttt{diag}| is subsumed by |off-diag||\texttt{off-diag}|.

Union bounding over all events, we have that

Pr[|gSShgh|log1.5(n/δ)mg2h2]\displaystyle~{}\Pr[|g^{\top}S^{\top}Sh-g^{\top}h|\geq\frac{\log^{1.5}(n/\delta)}{\sqrt{m}}\cdot\|g\|_{2}\cdot\|h\|_{2}]
\displaystyle\leq Θ(δ).\displaystyle~{}\Theta(\delta).\qed

3.2 Inner product bound for SRHT and SRCT

We will show that SRHT and SRCT satisfy Assumption 3.3. Before proving the pairwise inner product bound, we state a general property to characterize these sketching matrices. This key property will be used in the later proof.

Definition 3.5 (Sign structure).

For any sketching matrix, we say it has “Sign structure” if the following properties hold

  • Sk,i{±1m}S_{k,i}\in\{\pm\frac{1}{\sqrt{m}}\}, for all k[m],i[n]k\in[m],i\in[n].

  • Sk,iS_{k,i} and Sk,jS_{k,j} are independent for any iji\neq j.

  • 𝔼[Sk,i]=0\operatorname*{{\mathbb{E}}}[S_{k,i}]=0 for all k[m]k\in[m] and i[n]i\in[n].

Lemma 3.6.

Both SRHT and SRCT satisfy Definition 3.5.

Proof.

It follows from the definitions of two sketching matrices directly. ∎

Lemma 3.7 (SRHT and SRCT).

Let Sm×nS\in\mathbb{R}^{m\times n} be any sketching matrices that satisfy the Definition 3.5. Then, we have

Pr[maxij|S,i,S,j|log(n/δ)m]\displaystyle\Pr[\max_{i\neq j}|\langle S_{*,i},S_{*,j}\rangle|\geq\frac{\sqrt{\log(n/\delta)}}{\sqrt{m}}]\leq Θ(δ).\displaystyle~{}\Theta(\delta).
Proof.

Fix a pair of indices iji\neq j and we define Xm×2X\in\mathbb{R}^{m\times 2} as follows:

X:=[S,iS,j]\displaystyle X:=\begin{bmatrix}S_{*,i}&S_{*,j}\end{bmatrix}

The Gram matrix is XX=k=1mGkX^{\top}X=\sum_{k=1}^{m}G_{k}, where

Gk=\displaystyle G_{k}= [Sk,iSk,j][Sk,iSk,j]\displaystyle~{}\begin{bmatrix}S_{k,i}&S_{k,j}\end{bmatrix}^{\top}\begin{bmatrix}S_{k,i}&S_{k,j}\end{bmatrix}
=\displaystyle= [Sk,iSk,j][Sk,iSk,j]\displaystyle~{}\begin{bmatrix}S_{k,i}\\ S_{k,j}\end{bmatrix}\begin{bmatrix}S_{k,i}&S_{k,j}\end{bmatrix}
=\displaystyle= [Sk,i2Sk,iSk,jSk,iSk,jSk,j2]\displaystyle~{}\begin{bmatrix}S_{k,i}^{2}&S_{k,i}S_{k,j}\\ S_{k,i}S_{k,j}&S_{k,j}^{2}\end{bmatrix}
=\displaystyle= [1mSk,iSk,jSk,iSk,j1m].\displaystyle~{}\begin{bmatrix}\frac{1}{m}&S_{k,i}S_{k,j}\\ S_{k,i}S_{k,j}&\frac{1}{m}\end{bmatrix}.

where the first step follows from the definition of GkG_{k}, the second step follows from rewriting [Sk,iSk,j]\begin{bmatrix}S_{k,i}&S_{k,j}\end{bmatrix}^{\top}, the third step follows from the definition of matrix multiplication, and the last step follows from Sk,i2=1/mS_{k,i}^{2}=1/m and Sk,j2=1/mS_{k,j}^{2}=1/m.

Note that GkG_{k} has eigenvalues 0 and 2m\frac{2}{m}, i.e.,

λ1(Gk)=2/m,λ2(Gk)=0.\displaystyle\lambda_{1}(G_{k})=~{}2/m,~{}~{}~{}\lambda_{2}(G_{k})=~{}0.

Since Sk,iS_{k,i} and Sk,jS_{k,j} are independent Rademacher random variables, we have

𝔼[Sk,iSk,j]=𝔼[Sk,i]𝔼[Sk,j]=0.\displaystyle\operatorname*{{\mathbb{E}}}[S_{k,i}S_{k,j}]=\operatorname*{{\mathbb{E}}}[S_{k,i}]\cdot\operatorname*{{\mathbb{E}}}[S_{k,j}]=0.

Thus, we know

𝔼[Gk]=[1/m001/m].\displaystyle\operatorname*{{\mathbb{E}}}[G_{k}]=\begin{bmatrix}1/m&0\\ 0&1/m\end{bmatrix}. (6)

Consequently, we have

𝔼[XX]=\displaystyle\operatorname*{{\mathbb{E}}}[X^{\top}X]= 𝔼[k=1mGk]=m𝔼[Gk]\displaystyle~{}\operatorname*{{\mathbb{E}}}[\sum_{k=1}^{m}G_{k}]=~{}m\cdot\operatorname*{{\mathbb{E}}}[G_{k}]
=\displaystyle= m[1/m001/m]\displaystyle~{}m\cdot\begin{bmatrix}1/m&0\\ 0&1/m\end{bmatrix}
=\displaystyle= [1001],\displaystyle~{}\begin{bmatrix}1&0\\ 0&1\end{bmatrix},

where the first step follows from the definition of XXX^{\top}X, the second step follows from the fact that 𝔼[ca]=c𝔼[a]\operatorname*{{\mathbb{E}}}[ca]=c\operatorname*{{\mathbb{E}}}[a] for a constant cc, the third step follows from Eq. (6), and the last step follows from simple algebra.

Let λi(XX)\lambda_{i}(X^{\top}X) be the ii-th eigenvalue of XX2×2X^{\top}X\in\mathbb{R}^{2\times 2}. By matrix Chernoff bound (Lemma 2.22 with B=2/mB=2/m), for any t>0t>0, we have

Pr[i[2],|λi(XX)1|t]\displaystyle\Pr[\forall i\in[2],|\lambda_{i}(X^{\top}X)-1|\geq t]\leq 4exp(t2m/2)\displaystyle~{}4\exp(-t^{2}m/2)

This means with probability at least 14exp(t2m/2)1-4\exp(-t^{2}m/2), the eigenvalues of XXX^{\top}X are between [1t,1+t][1-t,1+t] and consequently, the eigenvalues of XXI2X^{\top}X-I_{2} are between [t,t][-t,t]. Let us choose t=O(log(n/δ)m)t=O(\frac{\sqrt{\log(n/\delta)}}{\sqrt{m}}), we have

Pr[XXI2Clog(n/δ)m]\displaystyle\Pr[\|X^{\top}X-I_{2}\|\geq C\cdot\frac{\sqrt{\log(n/\delta)}}{\sqrt{m}}]\leq δn2.\displaystyle~{}\frac{\delta}{n^{2}}.

The proof can be wrapped up by noting that

XXI2=\displaystyle X^{\top}X-I_{2}= [0S,i,S,jS,i,S,j0],\displaystyle~{}\begin{bmatrix}0&\langle S_{*,i},S_{*,j}\rangle\\ \langle S_{*,i},S_{*,j}\rangle&0\end{bmatrix},

the spectral norm of this matrix is |S,i,S,j||\langle S_{*,i},S_{*,j}\rangle| and union bound over all n2n^{2} pairs of columns, we have

Pr[maxij|S,i,S,j|Clog(n/δ)m]δ.\displaystyle\Pr[\max_{i\neq j}|\langle S_{*,i},S_{*,j}|\geq C\cdot\frac{\sqrt{\log(n/\delta)}}{\sqrt{m}}]\leq\delta.

3.3 Column norm bound for SRHT and SRCT

In this section, we prove the column norm bound for SRHT and SRCT. In particular, their columns are unit vectors. In Appendix D, we prove for random Gaussian matrix, the squared column norm is χm2\chi^{2}_{m} random vriable that concentrates around 1 with high probability.

Lemma 3.8 (SRHT and SRCT).

Let Sm×nS\in\mathbb{R}^{m\times n} be an SRHT matrix or SRCT matrix.

Then, for any i[n]i\in[n], we have S,i22=1.\|S_{*,i}\|_{2}^{2}=~{}1.

Proof.

The proof directly follows from the definition.

For SRHT, recall S=PHDS=PHD, the column norm of HH is n\sqrt{n}, and DD is a random sign that does not change the norm. The matrix PP subsamples mm rows and rescale each entry by 1m\sqrt{\frac{1}{m}}. The (squared) column norm is then 1.

For SRCT, the column norm of GG is n\sqrt{n} as well. Thus, by the same argument, SRCT has its column vectors being units. ∎

4 Put things together

Now, we’re ready to present the proof for Theorem 1.1.

Proof of Theorem 1.1.

Using Lemma 2.14 (it shows SRHT gives OSE), we know if mdlog2(n/δ)m\geq d\log^{2}(n/\delta), it gives (O(1),δ,n,d)(O(1),\delta,n,d)-OSE.

Using Lemma 3.7 (it shows SRHT gives OCE), we know β=O(log1.5(n/δ))\beta=O(\log^{1.5}(n/\delta)).

Using Lemma 3.1 (it shows OSE + OCE implies our result), we need to choose

mϵ2dβ2ϵ2dlog3(n/δ)\displaystyle m\geq\epsilon^{-2}d\beta^{2}\geq\epsilon^{-2}d\log^{3}(n/\delta)

Combining the above equation together, we have

mdlog2(n/δ)+ϵ2dlog3(n/δ)ϵ2dlog3(n/δ).\displaystyle m\geq~{}d\log^{2}(n/\delta)+\epsilon^{-2}d\log^{3}(n/\delta)\geq~{}\epsilon^{-2}d\log^{3}(n/\delta).

5 Conclusion

In this paper, we study the sketching-based regression algorithm with an \ell_{\infty} guarantee. We show that SRHT with m=ϵ2dlog3(n/δ)m=\epsilon^{-2}d\log^{3}(n/\delta) rows provides the desired \ell_{\infty} guarantee solution, improving upon the ϵ2d1+γ\epsilon^{-2}d^{1+\gamma} rows for γ=loglognlogd\gamma=\sqrt{\frac{\log\log n}{\log d}} of [22]. This is nearly-optimal up to logarithmic factors. Our proof adapts the oblivious coordinate-wise embedding property introduced in [31] in a novel way. We also greatly extends the reach of \ell_{\infty} guarantee to degree-2 Kronecker product regression via TensorSRHT matrix.

In addition, we introduce the SRCT and TensorSRCT matrices. These matrices can be applied in a fashion similar to SRHT, and they have similar OCE behaviors as SRHT.

Our result provides an elegant way to integrate fast, sketching-based regression solver for optimization process, in particular second-order methods. The regression problem per iteration can be solved in time nearly-linear in the input size, and the \ell_{\infty} guarantee comes in handy when analyzing convergence with approximate step. It also gives improved generalization bound on approximate regression via SRHT [22].

Appendix

Roadmap.

In Section A, we introduce the fundamental definitions and properties that we will use in Appendix. In Section B, we analyze and develop the \ell_{\infty} guarantee of Kronecker product regressions. In Section C, we introduce the Strong JL Moment Property and prove that both Circulant Transform and Tensor Circulant Transform satisfy this. In Section D, we focus on studying AMS, random Gaussian, and SRHT and show that the inner product is bounded on any pair of different columns of AMS, random Gaussian, and SRHT–dense sketching matrices.

Appendix A Tools for matrices and probability

For matrix A1n1×d1A_{1}\in\mathbb{R}^{n_{1}\times d_{1}} and A2n2×d2A_{2}\in\mathbb{R}^{n_{2}\times d_{2}}, we use A1A2n1n2×d1d2A_{1}\otimes A_{2}\in\mathbb{R}^{n_{1}n_{2}\times d_{1}d_{2}} to denote the matrix that (i11)(n2)+i2,(j11)d2+j2(i_{1}-1)\cdot(n_{2})+i_{2},(j_{1}-1)d_{2}+j_{2} -th entry is (A1)i1,j1(A2)i2,j2(A_{1})_{i_{1},j_{1}}\cdot(A_{2})_{i_{2},j_{2}}.

Lemma A.1 (Markov’s inequality).

If XX is a non-negative random variable and a>0a>0. Then we have

Pr[Xa]𝔼[X]/a.\displaystyle\Pr[X\geq a]\leq\operatorname*{{\mathbb{E}}}[X]/a.
Definition A.2 (Sub-exponential distribution ([12])).

We say X𝖲𝗎𝖻𝖤𝗑𝗉(σ2,α)X\in{\sf SubExp}(\sigma^{2},\alpha) with parameters σ>0\sigma>0, α>0\alpha>0 if:

𝔼[eλX]exp(λ2σ2/2),|λ|<1/α.\displaystyle\operatorname*{{\mathbb{E}}}[e^{\lambda X}]\leq\exp{(\lambda^{2}\sigma^{2}/2)},\forall|\lambda|<1/\alpha.
Lemma A.3 (Tail bound for sub-exponential distribution ([12])).

Let X𝖲𝗎𝖻𝖤𝗑𝗉(σ2,α)X\in{\sf SubExp}(\sigma^{2},\alpha) and 𝔼[X]=μ\operatorname*{{\mathbb{E}}}[X]=\mu. Then,

Pr[|Xμ|t]exp(0.5min{t2/σ2,t/α}).\displaystyle\Pr[|X-\mu|\geq t]\leq\exp{(-0.5\min\{t^{2}/\sigma^{2},t/\alpha\})}.
Claim A.4.

For every matrix An1×n2,Bn2×n3,Cd1×d2,Dd2×d3A\in\mathbb{R}^{n_{1}\times n_{2}},B\in\mathbb{R}^{n_{2}\times n_{3}},C\in\mathbb{R}^{d_{1}\times d_{2}},D\in\mathbb{R}^{d_{2}\times d_{3}}

(AB)(CD)=(AC)(BD).\displaystyle(A\cdot B)\otimes(C\cdot D)=(A\otimes C)\cdot(B\otimes D).

Appendix B Kronecker product regression with \ell_{\infty} guarantee

In this section, we study the \ell_{\infty} guarantee of Kronecker product regressions. Given two matrices A1,A2n×dA_{1},A_{2}\in\mathbb{R}^{n\times d} and a label vector bn2b\in\mathbb{R}^{n^{2}}, the goal is to solve the regression argminxd2(A1A2)xb22\arg\min_{x\in\mathbb{R}^{d^{2}}}\|(A_{1}\otimes A_{2})x-b\|_{2}^{2}. This problem can be easily generalized to product of qq matrices and fast, input-sparsity time algorithms have been studied in a line of works [10, 28, 9, 11, 24].

B.1 Main result

Theorem B.1 (Tensor version of Theorem 1.1).

Suppose nexp(d)n\leq\exp(d) and matrix An2×d2A\in\mathbb{R}^{n^{2}\times d^{2}} and vector bn2b\in\mathbb{R}^{n^{2}} are given, where A=A1A2A=A_{1}\otimes A_{2} for matrices A1,A2n×dA_{1},A_{2}\in\mathbb{R}^{n\times d} and b=b1b2b=b_{1}\otimes b_{2} for vectors b1,b2nb_{1},b_{2}\in\mathbb{R}^{n}. Let Sm×n2S\in\mathbb{R}^{m\times n^{2}} be a

  • tensor subsampled randomized Hadamard transform matrix (TensorSRHT) with m=Θ(ϵ2d2log3(n/δ))m=\Theta(\epsilon^{-2}d^{2}\log^{3}(n/\delta)) rows or

  • tensor subsampled randomized circulant transform matrix (TensorSRCT) with m=Θ(ϵ2d4log3(n/δ))m=\Theta(\epsilon^{-2}d^{4}\log^{3}(n/\delta)) rows.

For

x=argminxd2SAxSb2\displaystyle x^{\prime}=\arg\min_{x\in\mathbb{R}^{d^{2}}}\|SAx-Sb\|_{2}

and

x=argminxd2Axb2,\displaystyle x^{*}=\arg\min_{x\in\mathbb{R}^{d^{2}}}\|Ax-b\|_{2},

and any fixed ad2a\in\mathbb{R}^{d^{2}},

|a,xa,x|ϵda2Axb2A\displaystyle|\langle a,x^{*}\rangle-\langle a,x^{\prime}\rangle|\leq\frac{\epsilon}{d}\cdot\|a\|_{2}\cdot\|Ax^{*}-b\|_{2}\cdot\|A^{\dagger}\|

with probability 11/poly(d)1-1/\operatorname{poly}(d).

Proof.

Recall that we require (O(1),δ,d,n)(O(1),\delta,d,n)-OSE and β=O(log1.5(n/δ))\beta=O(\log^{1.5}(n/\delta))-OCE for it to give \ell_{\infty} guarantee.

For OCE, it follows from Lemma B.3.

For TensorSRHT’s OSE, it follows from Lemma 2.15 and for TensorSRCT, it follows from Corollary C.8. ∎

Remark B.2.

The slightly different guarantee follows from the small dimension becomes d2d^{2} instead of dd. Let us discuss the utility of using these sketching matrices for solving the regression. As discussed in Def. 2.8 and 2.12, each column of A1A2A_{1}\otimes A_{2} can be computed in O(nlogn+m)O(n\log n+m) time instead of n2n^{2}, thus the total running time of applying SS to AA is O(nd2logn+poly(d))O(nd^{2}\log n+\operatorname{poly}(d)). Similarly, SbSb can be applied in time O(nlogn+poly(d))O(n\log n+\operatorname{poly}(d)). The regression can then be solved in O~(nd2+poly(d))\widetilde{O}(nd^{2}+\operatorname{poly}(d)) time. Prior works mainly focus on input-sparsity sketches [10], importance sampling [9], iterative method [11] or more complicated sketches that scale well to qq products and in dynamic setting [24]. To the best of our knowledge, this is the first \ell_{\infty} guarantee for Kronecker product regression (with two matrices).

B.2 Oblivious coordinate-wise embedding for TensorSRHT and TensorSRCT

Lemma B.3 (TensorSRHT and TensorSRCT, Tensor version of Lemma 3.7).

Let Sm×nS\in\mathbb{R}^{m\times n} be TensorSRHT or TensorSRCT. Then, SS is an OCE with parameter β=log1.5(n/δ)\beta=\log^{1.5}(n/\delta).

Proof.

To prove this result, we show that TensorSRHT and TensorSRCT satisfy Definition 3.5.

For TensorSRHT, recall S=1mP(HD1×HD2)S=\frac{1}{\sqrt{m}}P(HD_{1}\times HD_{2}), since HH is Hadamard matrix and D1,D2D_{1},D_{2} are just diagonal matrices with random signs. Thus, all entries of HD1×HD2HD_{1}\times HD_{2} are also in {±1}\{\pm 1\}. As PP is a row sampling matrix and we rescale each entry by 1m\frac{1}{\sqrt{m}}. Thus, each entry of SS is in {±1m}\{\pm\frac{1}{\sqrt{m}}\}. For entries at the same row but two different columns i,ji,j, if ii is generated from two columns disjoint from jj, then it’s clear then are independent. Otherwise, suppose ii is generated from columns a,ba,b and jj is generated from columns a,ca,c with bcb\neq c. Then it is again independent, as the sign is completely determined by signs of bb and cc. Finally, we need to verify 𝔼[Sk,i]=0\operatorname*{{\mathbb{E}}}[S_{k,i}]=0, this is trivially true since product of two random signs is still a random sign. For TensorSRCT, the argument is exactly the same.

Now that both of these matrices satisfy Definition 3.5, we can use Lemma 3.7 to give a bound on pairwise inner product. The column norm bound is automatically satisfied by definition. Thus, we can invoke Lemma 3.4 to wrap up the proof. ∎

Appendix C SRCT and TensorSRCT: OSE via strong JL moment property

In this section, we prove that both SRCT and TensorSRCT are OSE’s. We prove this property via the strong JL moment property [1]. This gives a worse row count compared to that of SRHT and TensorSRHT. We believe that these two family of distributions should have similar row count for an OSE and leave it as a major open problem to close the gap between these two distributions.

C.1 Notations

To make the notation less heavy, we will use XLt\|X\|_{L^{t}} for the tt-th moment of a random variable XX. This is formally defined below.

Definition C.1 (tt-th moment).

For every integer t1t\geq 1 and any random variable XX\in\mathbb{R}, we write

XLt=(𝔼[|X|t])1/t\displaystyle\|X\|_{L^{t}}=\left(\operatorname*{{\mathbb{E}}}\left[|X|^{t}\right]\right)^{1/t}

Note that

X+YLtXLt+YLt\displaystyle\|X+Y\|_{L^{t}}\leq\|X\|_{L^{t}}+\|Y\|_{L^{t}}

for any random variables XX, YY by the Minkowski inequality.

C.2 Strong JL moment property

We show that both SRCT (see Definition 2.11) and TensorSRCT (see Definition 2.12) satisfy the so-called strong JL moment property. Strong JL moment property is one of the core properties that can show the sketching matrix has subspace embedding property [25].

Definition C.2 (Strong JL moment property [1]).

For every ϵ,δ[0,1]\epsilon,\delta\in[0,1], we say a distribution over random matrices Sm×nS\in\mathbb{R}^{m\times n} has the Strong (ϵ,δ)(\epsilon,\delta)-JL Moment Property when

Sx221Ltϵt/log(1/δ)\displaystyle\|\|Sx\|_{2}^{2}-1\|_{L^{t}}\leq\epsilon\sqrt{{t}/{\log(1/\delta)}}

and

𝔼[Sx22]=1\displaystyle\operatorname*{{\mathbb{E}}}\left[\|Sx\|_{2}^{2}\right]=1

for all xnx\in\mathbb{R}^{n}, x2=1\|x\|_{2}=1 and every integer tlog(1/δ)t\leq\log(1/\delta).

Given a distribution with strong JL moment property, it is well-known that such distribution provides OSE.

Lemma C.3 (Lemma 11 of [1]).

Let Sm×nS\in\mathbb{R}^{m\times n} be a random matrix with (ϵ/d,δ)(\epsilon/d,\delta)-strong JL moment property (Def. C.2). Then, SS is also an (ϵ,δ,d,n)(\epsilon,\delta,d,n)-OSE (Def. 2.1).

To prove that SRCT (see Definition 2.11) and TensorSRCT (see Definition 2.12) satisfy the strong JL moment property, we will do this by proving that a more general class of matrices satisfies the strong JL moment property.

More precisely, let k>0k\in\operatorname*{\mathbb{Z}}_{>0} be a positive integer and

(D(i))i[k]i[k]ni×ni\displaystyle(D^{(i)})_{i\in[k]}\in\prod_{i\in[k]}\mathbb{R}^{n_{i}\times n_{i}}

be independent matrices, each with diagonal entries given by independent Rademacher variables.

Let n=i[k]nin=\prod_{i\in[k]}n_{i} and P{0,1}m×nP\in\{0,1\}^{m\times n} be a random sampling matrix in which each row contains exactly one uniformly distributed nonzero element which has value one.

Then, we prove that the matrix

S=1mPG(D1Dk)\displaystyle S=\frac{1}{\sqrt{m}}PG\cdot(D_{1}\otimes\dots\otimes D_{k})

satisfies the strong JL moment property, where GG is n×nn\times n circulant matrix (see Definition 2.10) generated by a random vector whose elements are Rademacher variables.

If k=1k=1, then SS is just a SRCT (see Definition 2.11). If k=2k=2, then MM is a TensorSRCT (see Definition 2.12).

In order to prove this result we need a couple of lemmas. The first lemma can be seen as a version of Khintchine’s Inequality (see Lemma 2.18) for higher order chaos.

Lemma C.4 (Lemma 19 in [1]).

Let t1t\geq 1 and k>0k\in\mathbb{Z}_{>0}. Let (σ(i))i[k]i[k]ni(\sigma^{(i)})_{i\in[k]}\in\prod_{i\in[k]}\mathbb{R}^{n_{i}} be independent vectors each satisfying the Khintchine’s inequality (see Lemma 2.18):

σ(i),xLtCtx2\displaystyle\|\langle\sigma^{(i)},x\rangle\|_{L^{t}}\leq C_{t}\|x\|_{2}

for t1t\geq 1 and any vector xdix\in\mathbb{R}^{d_{i}}.

Let (ai1,,ik)i1[nj],,ik[nk](a_{i_{1},\ldots,i_{k}})_{i_{1}\in[n_{j}],\ldots,i_{k}\in[n_{k}]} be a tensor in n1××nk\mathbb{R}^{n_{1}\times\cdots\times n_{k}}. Then,

i1[n1],,ik[nk](j[k]σij(j))ai1,,ikLtCtk(i1[n1],,ik[nk]ai1,,ik2)12,\displaystyle\left\|\sum_{i_{1}\in[n_{1}],\ldots,i_{k}\in[n_{k}]}(\prod_{j\in[k]}\sigma_{i_{j}}^{(j)})a_{i_{1},\ldots,i_{k}}\right\|_{L^{t}}\leq C_{t}^{k}(\sum_{i_{1}\in[n_{1}],\ldots,i_{k}\in[n_{k}]}a_{i_{1},\ldots,i_{k}}^{2})^{\frac{1}{2}},

for t1t\geq 1.

Viewing an1××nka\in\mathbb{R}^{n_{1}\times\ldots\times n_{k}} as a vector, then

σ(1)σ(k),aLtCtka2,\displaystyle\|\langle\sigma^{(1)}\otimes\cdots\otimes\sigma^{(k)},a\rangle\|_{L^{t}}\leq C_{t}^{k}\|a\|_{2},

for t1t\geq 1.

Proof.

The proof will be by induction on kk.

Base case: For k=1k=1, the result is by the assumption that the vectors satisfy Khintchine’s inequality.

Inductive case: Assume that the result is true for every value up to k1k-1.

Let

Bi1,,ik1=ik[dk]σik(k)ai1,,ik.\displaystyle B_{i_{1},\ldots,i_{k-1}}=\sum_{i_{k}\in[d_{k}]}\sigma_{i_{k}}^{(k)}a_{i_{1},\ldots,i_{k}}. (7)

We then pull it out of the left hand term in the theorem:

i1[n1],,ik[nk](j[k]σij(j))ai1,,ikLt=\displaystyle\|\sum_{i_{1}\in[n_{1}],\ldots,i_{k}\in[n_{k}]}(\prod_{j\in[k]}\sigma_{i_{j}}^{(j)})a_{i_{1},\ldots,i_{k}}\|_{L^{t}}= i1[n1],,ik1[nk1](j[k1]σij(j))Bi1,,ik1Lt\displaystyle~{}\|\sum_{i_{1}\in[n_{1}],\ldots,i_{k-1}\in[n_{k-1}]}(\prod_{j\in[k-1]}\sigma_{i_{j}}^{(j)})B_{i_{1},\ldots,i_{k-1}}\|_{L^{t}}
\displaystyle\leq Ctk1(i1[d1],,ik1[nk1]Bi1,,ik12)12Lt\displaystyle~{}C_{t}^{k-1}\|(\sum_{i_{1}\in[d_{1}],\ldots,i_{k-1}\in[n_{k-1}]}B_{i_{1},\ldots,i_{k-1}}^{2})^{\frac{1}{2}}\|_{L^{t}}
=\displaystyle= Ctk1i1[n1],,ik1[nk1]Bi1,,ik12Lt/212\displaystyle~{}C_{t}^{k-1}\|\sum_{i_{1}\in[n_{1}],\ldots,i_{k-1}\in[n_{k-1}]}B_{i_{1},\ldots,i_{k-1}}^{2}\|_{L^{t/2}}^{\frac{1}{2}}
\displaystyle\leq Ctk1(i1[n1],,ik1[nk1]Bi1,,ik12Lt/2)12\displaystyle~{}C_{t}^{k-1}(\sum_{i_{1}\in[n_{1}],\ldots,i_{k-1}\in[n_{k-1}]}\|B_{i_{1},\ldots,i_{k-1}}^{2}\|_{L^{t/2}})^{\frac{1}{2}}
=\displaystyle= Ctk1(i1[n1],,ik1[nk1]Bi1,,ik1Lt2)12,\displaystyle~{}C_{t}^{k-1}(\sum_{i_{1}\in[n_{1}],\ldots,i_{k-1}\in[n_{k-1}]}\|B_{i_{1},\ldots,i_{k-1}}\|_{L^{t}}^{2})^{\frac{1}{2}},

where the first step follows from Eq. (7), the second step follows from the inductive hypothesis, the third step follows from the definition of \|\cdot\|, the fourth step follows from the triangle inequality, the fifth step follows from the definition of \|\cdot\|.

It remains to bound

Bi1,,ik1Lt2Ct2ik[nk]ai1,,ik2\displaystyle\|B_{i_{1},\ldots,i_{k-1}}\|_{L^{t}}^{2}\leq C_{t}^{2}\sum_{i_{k}\in[n_{k}]}a_{i_{1},\ldots,i_{k}}^{2}

by Khintchine’s inequality, which finishes the induction step and hence the proof. ∎

The next lemma we will be using is a type of Rosenthal inequality based on first principles. It mixes large and small moments of random variables in an intricate way. For completeness, we include a proof here.

Lemma C.5 (Properties of random variables with tt-moment, Lemma 20 in [1]).

There exists a universal constant LL, such that, for t1t\geq 1 if X1,,XkX_{1},\dots,X_{k} are independent non-negative random variables with tt-moment, then

i[k](Xi𝔼[Xi])LtL(tmaxi[k]XiLt1/2(i[k]𝔼[Xi])1/2+tmaxi[k]XiLt).\displaystyle\|\sum_{i\in[k]}(X_{i}-\operatorname*{{\mathbb{E}}}[X_{i}])\|_{L^{t}}\leq L\cdot\big{(}\sqrt{t}\cdot\|\max_{i\in[k]}X_{i}\|^{1/2}_{L^{t}}\cdot(\sum_{i\in[k]}\operatorname*{{\mathbb{E}}}[X_{i}])^{1/2}+t\cdot\|\max_{i\in[k]}X_{i}\|_{L^{t}}\big{)}.
Proof.

Throughout these calculations L1,L2L_{1},L_{2} and L3L_{3} will be universal constants.

i[k](Xi𝔼[Xi])Lt\displaystyle\|\sum_{i\in[k]}(X_{i}-\operatorname*{{\mathbb{E}}}[X_{i}])\|_{L^{t}}\leq L1i[k]σiXiLt\displaystyle~{}L_{1}\|\sum_{i\in[k]}\sigma_{i}X_{i}\|_{L^{t}}
\displaystyle\leq L2ti[k]Xi2Lt/21/2\displaystyle~{}L_{2}\sqrt{t}\cdot\|\sum_{i\in[k]}X^{2}_{i}\|^{1/2}_{L^{t/2}}
\displaystyle\leq L2tmaxi[k]Xii[k]XiLt/21/2\displaystyle~{}L_{2}\sqrt{t}\cdot\|\max_{i\in[k]}X_{i}\cdot\sum_{i\in[k]}X_{i}\|^{1/2}_{L^{t/2}}
\displaystyle\leq L2tmaxi[k]XiLt1/2i[k]XiLt1/2\displaystyle~{}L_{2}\sqrt{t}\cdot\|\max_{i\in[k]}X_{i}\|^{1/2}_{L^{t}}\cdot\|\sum_{i\in[k]}X_{i}\|^{1/2}_{L^{t}}
\displaystyle\leq L2tmaxi[k]XiLt1/2((i[k]𝔼[Xi])1/2+L2i[k](Xi𝔼[Xi])Lt1/2)\displaystyle~{}L_{2}\sqrt{t}\cdot\|\max_{i\in[k]}X_{i}\|^{1/2}_{L^{t}}\cdot\Big{(}(\sum_{i\in[k]}\operatorname*{{\mathbb{E}}}[X_{i}])^{1/2}+L_{2}\|\sum_{i\in[k]}(X_{i}-\operatorname*{{\mathbb{E}}}[X_{i}])\|^{1/2}_{L^{t}}\Big{)} (8)

where the first step follows from symmetrization of XiX_{i}, the second step follows from Khintchine’s inequality (see Lemma 2.18), the third step follows from Non-negativity of XiX_{i}, the fourth step follows from Cauchy-Schwartz inequality, and the last step follows from the triangle inequality.

Now, let A,B,CA,B,C be defined as follows:

C:=i[k](Xi𝔼[Xi])Lt1/2,\displaystyle C:=\|\sum_{i\in[k]}(X_{i}-\operatorname*{{\mathbb{E}}}[X_{i}])\|^{1/2}_{L^{t}},
B:=L2(i[k]𝔼[Xi])1/2,\displaystyle B:=L_{2}(\sum_{i\in[k]}\operatorname*{{\mathbb{E}}}[X_{i}])^{1/2},

and

A:=tmaxi[k]XiLt1/2.\displaystyle A:=\sqrt{t}\|\max_{i\in[k]}X_{i}\|^{1/2}_{L^{t}}.

Then, by rewriting Eq. (C.2), we have

C2A(B+C).\displaystyle C^{2}\leq A(B+C).

This implies CC is smaller than the largest of the roots of the quadratic.

Solving this quadratic inequality gives

C2L3(AB+A2),\displaystyle C^{2}\leq L_{3}(AB+A^{2}),

which completes the proof. ∎

C.3 SRCT and TensorSRCT satisfy strong JL moment property

We can now prove that SRCT (see Definition 2.11) and TensorSRCT (see Definition 2.12) have the strong JL moment property.

Theorem C.6.

There exists a universal constant LL, such that, the following holds.

Let k>0k\in\operatorname*{\mathbb{Z}}_{>0}. Let (D(i))i[k]i[k]ni×ni(D^{(i)})_{i\in[k]}\in\prod_{i\in[k]}\mathbb{R}^{n_{i}\times n_{i}} be independent diagonal matrices with independent Rademacher variables.

We define n:=i[k]nin:=\prod_{i\in[k]}n_{i}, D:=D1D2Dkn×nD:=D_{1}\otimes D_{2}\otimes\dots\otimes D_{k}\in\mathbb{R}^{n\times n} and G:=G1Gkn×nG:=G_{1}\otimes\ldots\otimes G_{k}\in\mathbb{R}^{n\times n}, where each Gini×niG_{i}\in\mathbb{R}^{n_{i}\times n_{i}} is a circulant matrix generated by an independent Rademacher random vector. Let Pm×nP\in\mathbb{R}^{m\times n} be a row sampling matrix that has exactly one nonzero per row. Let S:=PGDS:=PGD.

Let xnx\in\mathbb{R}^{n} be any vector with x2=1\|x\|_{2}=1 and t1t\geq 1.

Then,

1mPGDx221LtL(trkm+trkm),\displaystyle\left\|\frac{1}{m}\|PGDx\|_{2}^{2}-1\right\|_{L^{t}}\leq L(\sqrt{\frac{tr^{k}}{m}}+\frac{tr^{k}}{m}),

where r=max{t,logm}r=\max\{t,\log m\}.

There exists a universal constant C0>1C_{0}>1, such that, by setting

m=Ω(ϵ2log(1/δ)(C0log(1/(ϵδ)))k),\displaystyle m=\Omega(\epsilon^{-2}\log(1/\delta)\cdot(C_{0}\log(1/(\epsilon\delta)))^{k}),

we get that 1mPGD\frac{1}{\sqrt{m}}PGD has (ϵ,δ)(\epsilon,\delta)-strong JL moment property.

Proof.

Throughout the proof C1C_{1}, C2C_{2} and C3C_{3} will denote universal constants.

For every i[m]i\in[m], we let PiP_{i} be the random variable that says which coordinates the ii-th row of PP samples.

We define the random variable

Zi:=Mix=GPiDx.\displaystyle Z_{i}:=M_{i}x=G_{P_{i}}Dx.

We note that since the variables (Pi)i[m](P_{i})_{i\in[m]} are independent, so the variables (Zi)i[m](Z_{i})_{i\in[m]} are conditionally independent given DD, that is, if we fix DD, then (Zi)i[m](Z_{i})_{i\in[m]} are independent.

Then, we could get the following inequality:

1mi[m]Zi21Lt\displaystyle~{}\|\frac{1}{m}\sum_{i\in[m]}Z^{2}_{i}-1\|_{L^{t}}
=\displaystyle= (𝔼[(1mi[m]Zi21)D])1/tLt\displaystyle~{}\|(\operatorname*{{\mathbb{E}}}[(\frac{1}{m}\sum_{i\in[m]}Z^{2}_{i}-1)~{}\big{|}~{}D])^{1/t}\|_{L^{t}}
\displaystyle\leq C1tm(𝔼[(maxi[m]Zi2)D])1/(2t)(i[m]𝔼[Zi2D])1/2+tm(𝔼[(maxi[m]Zi2)tD])1/tLt\displaystyle~{}C_{1}\|\frac{\sqrt{t}}{m}\cdot(\operatorname*{{\mathbb{E}}}[(\max_{i\in[m]}Z^{2}_{i})~{}\big{|}~{}D])^{1/(2t)}\cdot(\sum_{i\in[m]}\operatorname*{{\mathbb{E}}}[Z^{2}_{i}~{}\big{|}~{}D])^{1/2}+\frac{t}{m}\cdot(\operatorname*{{\mathbb{E}}}[(\max_{i\in[m]}Z^{2}_{i})^{t}~{}\big{|}~{}D])^{1/t}\|_{L^{t}}
\displaystyle\leq C1tm(𝔼[(maxi[m]Zi2)D])1/(2t)(i[m]𝔼[Zi2D])1/2Lt+C1tmmaxi[m]Zi2Lt\displaystyle~{}C_{1}\frac{\sqrt{t}}{m}\cdot\|(\operatorname*{{\mathbb{E}}}[(\max_{i\in[m]}Z^{2}_{i})~{}\big{|}~{}D])^{1/(2t)}\cdot(\sum_{i\in[m]}\operatorname*{{\mathbb{E}}}[Z^{2}_{i}~{}\big{|}~{}D])^{1/2}\|_{L^{t}}+C_{1}\frac{t}{m}\cdot\|\max_{i\in[m]}Z^{2}_{i}\|_{L^{t}}
\displaystyle\leq C1tmmaxi[m]Zi2Lt1/2i[m]𝔼[Zi2D]Lt1/2+C1tmmaxi[m]Zi2Lt\displaystyle~{}C_{1}\frac{\sqrt{t}}{m}\cdot\|\max_{i\in[m]}Z^{2}_{i}\|^{1/2}_{L^{t}}\cdot\|\sum_{i\in[m]}\operatorname*{{\mathbb{E}}}[Z^{2}_{i}~{}\big{|}~{}D]\|^{1/2}_{L^{t}}+C_{1}\frac{t}{m}\cdot\|\max_{i\in[m]}Z^{2}_{i}\|_{L^{t}}

where the first step follows from Definition C.1, the second step follows from Lemma C.5, the third step follows from triangle inequality, and the last step follows from Cauchy-Schwartz inequality.

Note that each row of GG is generated by taking the tensor product of independent Rademacher random vectors, we thus can view the row vector itself as a length nn Rademacher random vector. Thus,

𝔼[Zi2|D]=\displaystyle\operatorname*{{\mathbb{E}}}[Z^{2}_{i}|D]= σ{1,+1}npσ(x,σ)2\displaystyle~{}\sum_{\sigma\in\{-1,+1\}^{n}}p_{\sigma}\cdot(\langle x,\sigma\rangle)^{2}
=\displaystyle= (x1+x2++xn)22n+(x1+x2+xn)22n++(x1x2xn)22n\displaystyle~{}\frac{(x_{1}+x_{2}+\cdots+x_{n})^{2}}{2^{n}}+\frac{(x_{1}+x_{2}+\cdots-x_{n})^{2}}{2^{n}}+\dots+\frac{(-x_{1}-x_{2}-\cdots-x_{n})^{2}}{2^{n}}
=\displaystyle= x12+x22++xn2\displaystyle~{}x^{2}_{1}+x^{2}_{2}+\cdots+x^{2}_{n}
=\displaystyle= x22,\displaystyle~{}\|x\|^{2}_{2}, (9)

where the first step follows from the definition of the expected value, 𝔼[Zi2|D]\operatorname*{{\mathbb{E}}}[Z^{2}_{i}|D], the second step follows from expanding all the 2n2^{n} possibilities, the third step follows from simple algebra, and the last step follows from the definition of 22\|\cdot\|_{2}^{2}.

We could get that

i[m]𝔼[Zi2|D]=\displaystyle\sum_{i\in[m]}\operatorname*{{\mathbb{E}}}[Z_{i}^{2}~{}\big{|}~{}D]= i[m]x22\displaystyle~{}\sum_{i\in[m]}\|x\|_{2}^{2}
=\displaystyle= m,\displaystyle~{}m,

where the first step follows from Eq. (C.3) and the second step follows from x22=1\|x\|^{2}_{2}=1.

To bound maxi[m]Zi2Lt\|\max_{i\in[m]}Z^{2}_{i}\|_{L^{t}}, we could show

Zi2Lr=\displaystyle\|Z^{2}_{i}\|_{L^{r}}= GPiDxL2r2\displaystyle~{}\|G_{P^{i}}Dx\|^{2}_{L^{2r}}
=\displaystyle= DxL2r2\displaystyle~{}\|Dx\|^{2}_{L^{2r}}
\displaystyle\leq rkx22.\displaystyle~{}r^{k}\|x\|^{2}_{2}.

where the first step follows from the definition of ZiZ_{i}, the second step follows from each row of GG is independent Rademacher vector, therefore 𝔼[GG]=I\operatorname*{{\mathbb{E}}}[G^{\top}G]=I, and the last step follows from Lemma C.4.

We then bound the maximum using a sufficiently high-order sum:

maxi[m]Zi2Lt\displaystyle\|\max_{i\in[m]}Z^{2}_{i}\|_{L^{t}}\leq maxi[m]Zi2Lr\displaystyle~{}\|\max_{i\in[m]}Z^{2}_{i}\|_{L^{r}}
\displaystyle\leq (i[m]Zi2Lrr)1/r\displaystyle~{}(\sum_{i\in[m]}\|Z^{2}_{i}\|^{r}_{L^{r}})^{1/r}
\displaystyle\leq m1/rrkx22erk,\displaystyle~{}m^{1/r}r^{k}\|x\|_{2}^{2}\leq er^{k},

where the first step follows from Definition C.1, the second step follows from Zi2Z^{2}_{i} is non-negative, and the last step follows from rlogmr\geq\log m.

This gives us that

1mi[m]Zi2x22LtC2trkm+C2trkm\displaystyle\|\frac{1}{m}\sum_{i\in[m]}Z^{2}_{i}-\|x\|_{2}^{2}\|_{L^{t}}\leq C_{2}\sqrt{\frac{tr^{k}}{m}}+C_{2}\frac{tr^{k}}{m} (10)

which finishes the first part of the proof.

We want to choose mm as follows

m=16C22ϵ2log(1/δ)(C3log(1/(δϵ)))k.\displaystyle m=16C^{2}_{2}\epsilon^{-2}\cdot\log(1/\delta)\cdot(C_{3}\log(1/(\delta\epsilon)))^{k}.

According to the above choice of mm, we know following condition for rr is holding

rC3log(1/(δϵ)).\displaystyle r\leq C_{3}\log(1/(\delta\epsilon)).

Hence,

m16C22ϵ2log(1/δ)rk.\displaystyle m\geq 16C^{2}_{2}\epsilon^{-2}\cdot\log(1/\delta)\cdot r^{k}.

For all 1tlog(1/δ)1\leq t\leq\log(1/\delta), we then get that

PGDx221Lt\displaystyle\|\|PGDx\|_{2}^{2}-1\|_{L^{t}}\leq C2trkm+C2trkm\displaystyle~{}C_{2}\sqrt{\frac{tr^{k}}{m}}+C_{2}\frac{tr^{k}}{m}
\displaystyle\leq C2(trk16C22ϵ2log(1/δ)rk)1/2+C2trk16C22ϵ2log(1/δ)rk\displaystyle~{}C_{2}(\frac{tr^{k}}{16C_{2}^{2}\epsilon^{-2}\log(1/\delta)r^{k}})^{1/2}+C_{2}\frac{tr^{k}}{16C^{2}_{2}\epsilon^{-2}\log(1/\delta)r^{k}}
\displaystyle\leq 0.5ϵt/log(1/δ)+0.5ϵ2t/log(1/δ)\displaystyle~{}0.5\epsilon\sqrt{t/\log(1/\delta)}+0.5\epsilon^{2}t/\log(1/\delta)
\displaystyle\leq ϵt/log(1/δ).\displaystyle~{}\epsilon\sqrt{{t}/{\log(1/\delta)}}.

where the first step follows from Eq. (10), and the second step follows from choice of mm, the third step follows from simple algebra, and the last step follows from ϵ2ϵ\epsilon^{2}\leq\epsilon and t/log(1/δ)t/log(1/δ)t/\log(1/\delta)\sqrt{t/\log(1/\delta)} (since t/log(1/δ)(0,1)t/\log(1/\delta)\in(0,1)) .

This finishes the proof. ∎

As two corollaries, we have SRCT and TensorSRCT are OSE’s with d2d^{2} rows, instead of dd rows.

Corollary C.7 (SRCT is an OSE).

Let Sm×nS\in\mathbb{R}^{m\times n} be an SRCT matrix with m=Θ(ϵ2d2log2(n/ϵδ))m=\Theta(\epsilon^{-2}d^{2}\log^{2}(n/\epsilon\delta)), then SS is an (ϵ,δ,d,n)(\epsilon,\delta,d,n)-OSE.

Proof.

The proof follows from combining Lemma C.3 and Theorem C.6 with k=1k=1. ∎

Corollary C.8 (TensorSRCT is an OSE).

Let Sm×nS\in\mathbb{R}^{m\times n} be a TensorSRCT matrix with m=Θ(ϵ2d2log3(n/ϵδ))m=\Theta(\epsilon^{-2}d^{2}\log^{3}(n/\epsilon\delta)), then SS is an (ϵ,δ,d,n2)(\epsilon,\delta,d,n^{2})-OSE.

Proof.

The proof follows from combining Lemma C.3 and Theorem C.6 with k=2k=2. ∎

Appendix D Gaussian and AMS

In this section, we prove that both random Gaussian matrices and AMS matrices satisfy OCE with good parameter β\beta. Combining with the fact that they are OSE’s, one can derive \ell_{\infty} guarantee for them.

D.1 OSE property of random Gaussian and AMS

The OSE property for these two distributions are folklore. For a proof for them, see, e.g., [34].

Lemma D.1.

Let SS be a random Gaussian matrix defined in Def. 2.4. If m=Θ(ϵ2(d+log(d/δ)))m=\Theta(\epsilon^{-2}(d+\log(d/\delta))), then SS is an (ϵ,δ,d,n)(\epsilon,\delta,d,n)-OSE.

Lemma D.2.

Let SS be an AMS matrix defined in Def. 2.5. If m=Θ(ϵ2dlog2(n/δ))m=\Theta(\epsilon^{-2}d\log^{2}(n/\delta)), then SS is an (ϵ,δ,d,n)(\epsilon,\delta,d,n)-OSE.

D.2 OCE property of random Gaussian and AMS

In this section, we prove the OCE property of random Gaussian and AMS. We start with the pairwise inner product bound for these two distributions. For column norm bound, AMS has unit columns and we will prove for random Gaussian.

Lemma D.3 (Gaussian pairwise inner product bound, Lemma B.18 in [31]).

Let Sm×nS\in\mathbb{R}^{m\times n} be a random Gaussian matrix (Definition 2.4).

Then, we have:

Pr[maxij|S,i,S,j|Clog(n/δ)m]Θ(δ).\displaystyle\Pr[\max_{i\not=j}|\langle S_{*,i},S_{*,j}\rangle|\geq C\cdot\frac{\sqrt{\log{(n/\delta)}}}{\sqrt{m}}]\leq\Theta(\delta).
Proof.

Note for iji\not=j, S,i,S,j𝒩(0,1mIm)S_{*,i},S_{*,j}\sim\mathcal{N}(0,\frac{1}{m}I_{m}) are two independent Gaussian vectors. Let zk=Sk,iSk,jz_{k}=S_{k,i}S_{k,j} and z=S,i,S,jz=\langle S_{*,i},S_{*,j}\rangle.

Then, we have for any |λ|m/2|\lambda|\leq m/2,

𝔼[eλzk]=11λ2/m2exp(λ2/m2),\displaystyle\operatorname*{{\mathbb{E}}}[e^{\lambda z_{k}}]=\frac{1}{\sqrt{1-\lambda^{2}/m^{2}}}\leq\exp{(\lambda^{2}/m^{2})},

where the first step follows from zk=14(Sk,i+Sk,j)2+14(Sk,iSk,j)2=m2(Q1Q2)z_{k}=\frac{1}{4}(S_{k,i}+S_{k,j})^{2}+\frac{1}{4}(S_{k,i}-S_{k,j})^{2}=\frac{m}{2}(Q_{1}-Q_{2}) where Q1,Q2χ12Q_{1},Q_{2}\sim\chi_{1}^{2}, and 𝔼[eλQ]=112λ\operatorname*{{\mathbb{E}}}[e^{\lambda Q}]=\frac{1}{\sqrt{1-2\lambda}} for any Qχ12Q\sim\chi_{1}^{2}.

This implies zk𝖲𝗎𝖻𝖤𝗑𝗉(2/m2,2/m)z_{k}\in{\sf SubExp}(2/m^{2},2/m) is a sub-exponential random variable. Here 𝖲𝗎𝖻𝖤𝗑𝗉{\sf SubExp} is the shorthand of sub-exponential random variable.

Thus, we have

z=k=1mzk𝖲𝗎𝖻𝖤𝗑𝗉(2/m,2/m),\displaystyle z=\sum_{k=1}^{m}z_{k}\in{\sf SubExp}(2/m,2/m),

by sub-exponential concentration Lemma A.3, we have

Pr[|z|t]2exp(mt2/4)\displaystyle\Pr[|z|\geq t]\leq 2\exp{(-mt^{2}/4)}

for 0<t<10<t<1. Picking t=log(n2/δ)/mt=\sqrt{\log{(n^{2}/\delta)/m}}, we have

Pr[|S,i,S,j|Clog(n/δ)m]δ/n2.\displaystyle\Pr[|\langle S_{*,i},S_{*,j}\rangle|\geq C\cdot\frac{\sqrt{\log{(n/\delta)}}}{\sqrt{m}}]\leq\delta/n^{2}.

Taking the union bound over all (i,j)[n]×[n](i,j)\in[n]\times[n] and iji\not=j, we complete the proof. ∎

Lemma D.4 (AMS pairwise inner product bound).

Let Sm×nS\in\mathbb{R}^{m\times n} be an AMS matrix (Definition 2.5. Let {σi}i[n]\{\sigma_{i}\}_{i\in[n]} be independent Rademacher random variables and S¯m×n\overline{S}\in\mathbb{R}^{m\times n} with S¯,i=σiS,i\overline{S}_{*,i}=\sigma_{i}S_{*,i}, i[n]\forall i\in[n].

Then, we have:

Pr[maxij|S¯,i,S¯,j|log(n/δ)m]Θ(δ)\displaystyle\Pr[\max_{i\not=j}|\langle\overline{S}_{*,i},\overline{S}_{*,j}\rangle|\geq\frac{\sqrt{\log{(n/\delta)}}}{\sqrt{m}}]\leq\Theta(\delta)
Proof.

Note for any fixed iji\not=j, S¯,i\overline{S}_{*,i} and S¯,j\overline{S}_{*,j} are independent. By Hoeffding inequality (Lemma 2.19), we have

Pr[|S¯,i,S¯,j|t]\displaystyle~{}\Pr[|\langle\overline{S}_{*,i},\overline{S}_{*,j}\rangle|\geq t]
\displaystyle\leq 2exp(2t2i=1m(1m(1m))2)\displaystyle~{}2\exp{(-\frac{2t^{2}}{\sum_{i=1}^{m}(\frac{1}{m}-(-\frac{1}{m}))^{2}})}
\displaystyle\leq 2et2m/2,\displaystyle~{}2e^{-t^{2}m/2},

where the second step follows from simple algebra (m1/m2=1/mm\cdot 1/m^{2}=1/m).

Choosing t=2log(2n2/δ)/mt=\sqrt{2\log{(2n^{2}/\delta)}}/\sqrt{m}, we have

Pr[|S¯,i,S¯,j|2log(2n2/δ)/m]δn2,\displaystyle\Pr[|\langle\overline{S}_{*,i},\overline{S}_{*,j}\rangle|\geq\sqrt{2\log{(2n^{2}/\delta)}}/\sqrt{m}]\leq\frac{\delta}{n^{2}},

union bound over all n2n^{2} pairs of columns gives the desired result. ∎

Lemma D.5 (Gaussian column norm bound).

Let Sm×nS\in\mathbb{R}^{m\times n} be a random Gaussian matrix.

Then, for any i[n]i\in[n], we have

Pr[|S,i221|log(n/δ)m]\displaystyle\Pr[|\|S_{*,i}\|_{2}^{2}-1|\geq\frac{\sqrt{\log(n/\delta)}}{\sqrt{m}}]\leq Θ(δ).\displaystyle~{}\Theta(\delta).
Proof.

For any column S,iS_{*,i}, note that S,i22χm2\|S_{*,i}\|_{2}^{2}\sim\chi_{m}^{2}, each one with zero mean and variance 1m\frac{1}{m}.

By Lemma 2.20, we have

Pr[|S,i221|2tm]\displaystyle\Pr[|\|S_{*,i}\|_{2}^{2}-1|\geq 2\frac{\sqrt{t}}{\sqrt{m}}]\leq 2exp(t).\displaystyle~{}2\exp(-t).

Setting t=log(n/δ)t=\log(n/\delta), we have

Pr[|S,i221|Clog(n/δ)m]\displaystyle\Pr[|\|S_{*,i}\|_{2}^{2}-1|\geq C\cdot\frac{\sqrt{\log(n/\delta)}}{\sqrt{m}}]\leq δ/n,\displaystyle~{}\delta/n,

the proof is concluded by union bounding over all nn columns. ∎

We conclude random Gaussian and AMS are OCE’s.

Lemma D.6 (Gaussian OCE).

Let Sm×nS\in\mathbb{R}^{m\times n} be a random Gaussian matrix, then SS is a (log1.5(n/δ),δ,n)(\log^{1.5}(n/\delta),\delta,n)-OCE.

Proof.

By Lemma D.3 and Lemma D.5, we know both pairwise inner product bound and column norm bound hold and thus, by Lemma 3.4, SS satisfies the desired OCE property. ∎

Lemma D.7 (AMS OCE).

Let Sm×nS\in\mathbb{R}^{m\times n} be an AMS matrix, then SS is a (log1.5(n/δ),δ,n)(\log^{1.5}(n/\delta),\delta,n)-OCE.

Proof.

The proof is similar to Lemma D.6. The column norm bound follows from definition. ∎

References

  • AKK+ [20] Thomas D Ahle, Michael Kapralov, Jakob BT Knudsen, Rasmus Pagh, Ameya Velingker, David P Woodruff, and Amir Zandieh. Oblivious sketching of high-degree polynomial kernels. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 141–160. SIAM, 2020.
  • AMS [96] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pages 20–29, 1996.
  • ANW [14] Haim Avron, Huy Nguyen, and David Woodruff. Subspace embeddings for the polynomial kernel. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2258–2266. 2014.
  • BPSW [21] Jan van den Brand, Binghui Peng, Zhao Song, and Omri Weinstein. Training (overparametrized) neural networks in near-linear time. In ITCS, 2021.
  • CCFC [02] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Programming (ICALP), pages 693–703. Springer, 2002.
  • Coh [16] Michael B Cohen. Nearly tight oblivious subspace embeddings by trace inequalities. In Proceedings of the twenty-seventh annual ACM-SIAM symposium on Discrete algorithms, pages 278–287. SIAM, 2016.
  • CSWZ [23] Yeshwanth Cherapanamjeri, Sandeep Silwal, David P. Woodruff, and Samson Zhou. Optimal algorithms for linear algebra in the current matrix multiplication time. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2023.
  • CW [13] Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. In Symposium on Theory of Computing Conference(STOC), pages 81–90, 2013.
  • DJS+ [19] Huaian Diao, Rajesh Jayaram, Zhao Song, Wen Sun, and David Woodruff. Optimal sketching for kronecker product regression and low rank approximation. Advances in neural information processing systems, 32, 2019.
  • DSSW [18] Huaian Diao, Zhao Song, Wen Sun, and David Woodruff. Sketching for kronecker product regression and p-splines. In International Conference on Artificial Intelligence and Statistics, pages 1299–1308. PMLR, 2018.
  • FFG [22] Matthew Fahrbach, Thomas Fu, and Mehrdad Ghadiri. Subquadratic kronecker regression with applications to tensor decomposition. In Thirty-Sixth Conference on Neural Information Processing Systems, NeurIPS’22, 2022.
  • FKZ [11] Sergey Foss, Dmitry Korshunov, and Stan Zachary. An Introduction to Heavy-Tailed and Subexponential Distributions. Springer series in operations research. Springer, New York, NY, 1st ed edition, 2011.
  • Hoe [63] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. In Journal of the American Statistical, pages 13–30. 1963.
  • HW [71] David Lee Hanson and Farroll Tim Wright. A bound on tail probabilities for quadratic forms in independent random variables. The Annals of Mathematical Statistics, 42(3):1079–1083, 1971.
  • JSWZ [21] Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang. A faster algorithm for solving general lps. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 823–832, 2021.
  • Khi [23] Aleksandr Khintchine. Über dyadische brüche. Mathematische Zeitschrift, 18(1):109–116, 1923.
  • LDFU [13] Yichao Lu, Paramveer Dhillon, Dean P Foster, and Lyle Ungar. Faster ridge regression via the subsampled randomized hadamard transform. In Advances in neural information processing systems (NIPS), pages 369–377, 2013.
  • LM [00] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302–1338, 2000.
  • LS [14] Yin Tat Lee and Aaron Sidford. Path finding methods for linear programming: Solving linear programs in O(rank){O}(\sqrt{rank}) iterations and faster algorithms for maximum flow. In 55th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 424–433, 2014.
  • LSZ [19] Yin Tat Lee, Zhao Song, and Qiuyi Zhang. Solving empirical risk minimization in the current matrix multiplication time. In Conference on Learning Theory, pages 2140–2157. PMLR, 2019.
  • NN [13] Jelani Nelson and Huy L Nguyên. Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), pages 117–126. IEEE, 2013.
  • PSW [17] Eric Price, Zhao Song, and David P Woodruff. Fast regression with an \ell_{\infty} guarantee. In ICALP, 2017.
  • QSZZ [23] Lianke Qin, Zhao Song, Lichen Zhang, and Danyang Zhuo. An online and unified algorithm for projection matrix vector multiplication with application to empirical risk minimization. In AISTATS, 2023.
  • RSZ [22] Aravind Reddy, Zhao Song, and Lichen Zhang. Dynamic tensor product regression. In Thirty-Sixth Conference on Neural Information Processing Systems, NeurIPS’22, 2022.
  • Sar [06] Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In 2006 47th annual IEEE symposium on foundations of computer science (FOCS), pages 143–152. IEEE, 2006.
  • SWYZ [21] Zhao Song, David Woodruff, Zheng Yu, and Lichen Zhang. Fast sketching of polynomial kernels of polynomial degree. In International Conference on Machine Learning, pages 9812–9823. PMLR, 2021.
  • SWZ [16] Zhao Song, David Woodruff, and Huan Zhang. Sublinear time orthogonal tensor decomposition. Advances in Neural Information Processing Systems, 29, 2016.
  • SWZ [19] Zhao Song, David P Woodruff, and Peilin Zhong. Relative error tensor low rank approximation. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2772–2789. SIAM, 2019.
  • SXYZ [22] Zhao Song, Zhaozhuo Xu, Yuanyuan Yang, and Lichen Zhang. Accelerating frank-wolfe algorithm using low-dimensional and adaptive data structures. arXiv preprint arXiv:2207.09002, 2022.
  • SXZ [22] Zhao Song, Zhaozhuo Xu, and Lichen Zhang. Speeding up sparsification using inner product search data structures. arXiv preprint arXiv:2204.03209, 2022.
  • SY [21] Zhao Song and Zheng Yu. Oblivious sketching-based central path method for solving linear programming problems. In 38th International Conference on Machine Learning (ICML), 2021.
  • SZZ [21] Zhao Song, Lichen Zhang, and Ruizhe Zhang. Training multi-layer over-parametrized neural network in subquadratic time. arXiv preprint arXiv:2112.07628, 2021.
  • Tro [10] Joel A. Tropp. Improved analysis of the subsampled randomized hadamard transform. 2010.
  • Woo [14] David P Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science, 10(1–2):1–157, 2014.
  • WZ [20] David P Woodruff and Amir Zandieh. Near input sparsity time kernel embeddings via adaptive sampling. In ICML, 2020.
  • WZ [22] David Woodruff and Amir Zandieh. Leverage score sampling for tensor product matrices in input sparsity time. In Proceedings of the 39th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2022.