A Nearly-Optimal Bound for Fast Regression with Guarantee
Given a matrix and a vector , we consider the regression problem with guarantees: finding a vector such that
where . One popular approach for solving such regression problem is via sketching: picking a structured random matrix with and can be quickly computed, solve the “sketched” regression problem .
In this paper, we show that in order to obtain such guarantee for regression, one has to use sketching matrices that are dense. To the best of our knowledge, this is the first user case in which dense sketching matrices are necessary. On the algorithmic side, we prove that there exists a distribution of dense sketching matrices with such that solving the sketched regression problem gives the guarantee, with probability at least . Moreover, the matrix can be computed in time . Our row count is nearly-optimal up to logarithmic factors, and significantly improves the result in [22], in which a super-linear in rows, for is required.
Moreover, we develop a novel analytical framework for guarantee regression that utilizes the Oblivious Coordinate-wise Embedding (OCE) property introduced in [31]. Our analysis is much simpler and more general than that of [22]. Leveraging this framework, we extend the guarantee regression result to dense sketching matrices for computing the fast tensor product of vectors.
1 Introduction
Linear regression, or least-square problem is ubiquitous in numerical linear algebra, scientific computing and machine learning. Given a tall skinny matrix and a label vector , the goal is to (approximately) compute an optimal solution that minimizes the loss . For the regime where , sketching is a popular approach to obtain an approximate solution quickly [8, 7]: the idea is to pick a random matrix from carefully-designed distributions, so that 1). can be efficiently applied to and 2). the row count . Given these two guarantees, one can then solve the “sketched” regression problem:
and obtain a vector such that , where OPT denotes the optimal discrepancy between vectors in column space of and . Recent advances in sketching [7] show that one can design matrix with and the sketched regression can be solved in time where denotes the number of nonzero entries of and is the failure probability.
Unfortunately, modern machine learning emphasizes more and more on large, complex, and nonlinear models such as deep neural networks, thus linear regression becomes less appealing as a model. However, it is still a very important subroutine in many deep learning and optimization frameworks, especially second-order method for training neural networks [4, 32] or convex optimization [19, 20, 31, 15]. In these applications, one typically seeks forward error guarantee, i.e., how close is the approximate solution to the optimal solution . A prominent example is Newton’s method: given the (possibly implicit) Hessian matrix and the gradient , one wants to compute . A common approach is to solve the regression , in which one wants or even to be small. When the matrix satisfies the so-called Oblivious Subspace Embedding (OSE) property [25], one can show that the approximate solution is close to in the sense:
(1) |
Unfortunately, -closeness cannot characterize how good approximates , as can have a good spread of mass over all coordinates while concentrates its mass over a few coordinates. Formally speaking, let be a fixed vector, then one can measure how far deviates from via Eq. (1):
This bound is clearly too loose, as one would expect the deviation on a random direction is only factor of the discrepancy. [22] shows that this intuition is indeed true when is picked as the subsampled randomized Hadamard transform (SRHT) [17]: 111We will later refer the following property as guarantee.
(2) |
However, their analysis is not tight as they require a row count for . Such a row count is super-linear in as long as and therefore is worse than the required row count for to be a subspace embedding, in which only rows are required for constant success probability. In contrast, for random Gaussian matrices, the guarantee only requires nearly linear in rows. In addition to their sub-optimal row count, the [22] analysis is also complicated: let be an orthonormal basis of , [22] has to analyze the higher moment of matrix . This makes their analysis particularly hard to generalize to other dense sketching matrices beyond SRHT.
In this work, we present a novel framework for analyzing the guarantee induced by SRHT and more generally, a large class of dense sketching matrices. Our analysis is arguably much simpler than [22], and it exposes the fundamental structure of sketching matrices that provides guarantee: if any two columns of the sketching matrix have a small inner product with high probability, then guarantee can be preserved. We then prove that the small pairwise column inner product is also closely related to the Oblivious Coordinate-wise Embedding (OCE) property introduced in [31]. More concretely, for any two fixed vectors , we say the sketching matrix is -OCE if holds with probability at least . This property has previously been leveraged for approximating matrix-vector product between a dynamically-changing projection matrix and an online sequence of vectors for the fast linear program and empirical risk minimization algorithms [20, 15, 31, 23] as these algorithms need bound on the matrix-vector product. One common theme shared by those applications and guarantee is to use dense sketching matrices, such as random Gaussian, the Alon-Matias-Szegedy sketch (AMS, [2]) or SRHT. This is in drastic contrast with the trending direction for using sparse matrices such as Count Sketch [5, 8] and OSNAP [21, 6], as they can be applied in (nearly) input sparsity time.
In recent years, sketches that can be applied to the tensor product of matrices/vectors have gained popularity [3, 27, 10, 28, 9, 1, 35, 26, 36, 30, 29, 24] as they can speed up optimization tasks and large-scale kernel learning. We show that dense sketches for degree-2 tensors also provide guarantee.
Theorem 1.1 (Nearly-optimal bound for dense sketching matrices).
Suppose and matrix and vector are given. Let be a subsampled randomized Hadamard transform matrix SRHT with rows.
For any fixed vector ,
with probability , where , .
Remark 1.2.
The row count is nearly-optimal up to logarithmic factors, as the row count for being an OSE is for constant success probability. In comparison, [22] requires rows for which is only nearly-linear in if . In most applications, we concern about , meaning that their row count is worse than ours in almost all meaningful scenarios.
The row count and guarantee obtained in Theorem 1.1 extend beyond SRHT; in fact, for a range of dense sketching matrices including random Gaussian, AMS sketch [2], SRHT and subsampled randomized Circulant Transform (see Definition 2.11) (SRCT). This is because our argument is a structural condition that can be satisfied by various dense sketches.
Our result can also be generalized to degree-2 Kronecker product regression, see Theorem B.1.
Roadmap.
In Section 2, we introduce the notations that we use and explain the key definitions and properties to support the framework for guarantee regression. In Section 3, we introduce our framework by presenting a sufficient condition for a sketching matrix to give a good guarantee. In Section 4, we provide a proof for our main theorem by putting everything together. Finally, in Section 5, we summarize the main findings of this paper and through comparing with previous work.
2 Preliminary
For a positive integer, we define . For a vector , we define and . For a matrix , we define to be the spectral norm of . We use to be the Frobenius norm of . In general, we have the following property for spectral norm, . We use to denote the Moore-Penrose pseudoinverse of matrix which if is its SVD (where , and for ), is given by .
We use to denote the expectation, and to denote the probability. For a distribution and a random variable , we use to denote that we draw a random variable from the distribution . We use to denote a Gaussian distribution with mean and variance . We say a random variable is Rademacher random variables if and . We also call it a sign random variable.
In addition to notation, for two functions , we use the shorthand (resp. ) to indicate that (resp. ) for an absolute constant . We use to mean for constants and . For two matrices , , we use to denote the Kronecker product, i.e., the -th row and -th column of is . For two vectors , we use to denote the tensor product of vectors, in which .
2.1 Oblivious subspace embedding and coordinate-wise embedding
Definition 2.1 (Oblivious subspace embedding [25]).
We define -Oolivious subspace embedding (OSE) as follows: Suppose is a distribution on matrices , where is a function of , and . Suppose that with probability at least , for any fixed orthonormal basis , a matrix drawn from the distribution has the property that the singular values of lie in the range .
The oblivious coordinate-wise embedding (OCE) is implicitly used in [20, 15] and formally introduced in [31].
Definition 2.2 (–coordinate wise embedding [31]).
We say a random matrix satisfying –coordinate wise embedding if
In this paper, we mainly use the property 3 of Definition 2.2. For convenient, we redefine OCE as follows:
Definition 2.3 (OCE).
Let and . We say a randomized matrix satisfy -OCE, if
and the distribution is oblivious to any fixed vectors and .
2.2 Sketching matrices
In this paper, we concern a list of dense sketching matrices.
Definition 2.4 (Random Gaussian matrix, folklore).
We say is a random Gaussian matrix if all entries are sampled from independently.
Definition 2.5 (AMS sketch matrix, [2]).
Let be random hash functions picking from a 4–wise independent hash family . Then is an AMS sketch matrix if we set .
The following sketching matrices can utilize fast Fourier Transform (FFT) for efficient application to matrices.
Definition 2.6 (Subsampled randomized Hadamard transform (SRHT) [17, 26]).
The SRHT matrix is defined as , where each row of matrix contains exactly one at a random position, is the Hadamard matrix, and is a diagonal matrix with each diagonal entry being a value in with equal probability.
Remark 2.7.
Using the fast Fourier transform (FFT), can be applied to a vector in time .
Definition 2.8 (Tensor subsampled randomized Hadamard transform (TensorSRHT) [26]).
The TensorSRHT is defined as , where each row of contains only one at a random coordinate and one can view as a sampling matrix. is a Hadamard matrix, and , are two independent diagonal matrices with diagonals that are each independently set to be a Rademacher random variable (uniform in ).
Remark 2.9.
By leveraging the FFT algorithm in the sketch space, can be computed in time .
To store and generate a Hadamard matrix is expensive, we consider a cheaper and space-efficient way to generate an FFT matrix via circulant transform.
Definition 2.10 (Circulant matrix).
A circulant matrix is an matrix, where , whose row vectors consist of the same element, and compared to the preceding row vector, each row vector is rotated one element to the right.
Definition 2.11 (Subsampled randomized circulant transform (SRCT)).
Let be a random vector, whose elements are i.i.d. Rademacher random variables.
Also, let be a random matrix in which each row contains a 1 at a uniformly distributed coordinate and zeros elsewhere.
Let be a circulant matrix (see Definition 2.10) generated by and be a diagonal matrix whose diagonal elements are i.i.d. Rademacher random variables.
Then, the subsampled randomized circulant transform is defined as follows:
Definition 2.12 (Tensor subsampled randomized circulant transform (TensorSRCT)).
Let be a random vector, whose elements are i.i.d. Rademacher random variables.
Also, let be a random matrix in which each row contains a 1 at a uniformly distributed coordinate and zeros elsewhere.
Let be a circulant matrix (see Definition 2.10) generated by .
Let and be two independent diagonal matrices whose diagonal elements are i.i.d. Rademacher random variables.
Then, the tensor circulant transform is defined as follows:
Remark 2.13.
Similar to SRHT, we can utilize the fast Fourier transform with circulant matrix. SRCT can be applied to a vector of length in time, and TensorSRCT can be applied to in time.
2.3 OSE property of dense sketches
An important condition for sketch-and-solve regressions is OSE. We focus particularly on SRHT, SRCT, and their tensor variants.
Lemma 2.14 (Lemma 2.11 in [26]).
Let be an SRHT matrix defined in Definition 2.6. If , then is an -OSE.
Lemma 2.15 (Lemma 2.12 in [26]).
Let be a TensorSRHT matrix defined in Definition 2.8. If , then is an -OSE for degree- tensors.
SRCT requires more row count than SRHT due to the Gram is only in expectation.
2.4 Probability tools
Lemma 2.18 (Khintchine’s inequality [16]).
Let be i.i.d. Rademacher random variables and be real numbers. Then, there exists constants such that
Lemma 2.19 (Hoeffding bound, [13]).
Let be independent, zero-mean random variables with . Then,
Lemma 2.20 (Lemma 1 on page 1325 of Laurent and Massart [18] ).
Let be a chi-squared distributed random variable with degrees of freedom. Each one has zero means and variance. Then,
Lemma 2.21 (Hanson-Wright inequality [14]).
Let denote a random vector with independent entries with and . Let be an matrix. Then, for every ,
Lemma 2.22 (Matrix Chernoff bound, Theorem 2.2 in [33]).
Let be a finite set of positive-semidefinite matrices with dimension . Suppose that . Sample uniformly at random from without replacement. We define and as follows: and . Then,
for ,
for .
3 guarantee via OCE
In this section, we present a sufficient condition for a sketching matrix to give good guarantee: given a pair of fixed vectors such that , if the sketching matrix approximately preserves the inner product with high probability, then it gives good guarantee for regression.
Lemma 3.1 (Core lemma).
Let be a fixed matrix. Let denote the orthonormal basis of . Let be a sketching matrix that satisfies two properties
For any fixed vectors and with , we have
holds with probability at least .
Proof.
With probability 1, the matrix has linearly independent columns.
Therefore, is
where the first step follows from has full rank, the second step follows from SVD on , the third step follows from , and the last step follows from the fact that is orthogonal based on the property of SVD.
For convenience, we define as follows:
In the next few paragraphs, we will explain how to upper bound with high probability.
Since is a -OSE (Definition 2.1), we know
We condition on this event. It follows that
where the first step follows from that is a rotation, the second step follows from sub-multiplicativity, and the third step follows from and that is a rotation.
Hence, we have
(3) |
Let us define a vector
By the definition of OCE (Definition 2.3, we have that
where gives us and .
Thus, the above bound translates to
(4) |
as desired. ∎
We are now ready to prove the guarantee given the inner product bound of Lemma 3.1.
Theorem 3.2.
Suppose has full column rank and . Let be a sketching matrix satisfying conditions in Lemma 3.1. For any fixed vector , we have
holds with probability at least , where and .
Proof.
Since has full column rank, we have that . Similarly, has full column rank with probability 1, therefore and . Thus, we have
(5) |
where is an orthonormal basis for . It is well-known that where is the orthonormal basis for the orthogonal component of . To maximize the above expression, we shall let or equivalently, . Thus, bounding Eq. (3) is equivalent to consider
the inequality holds with probability at least by Lemma 3.1. Finally, note that since , we have that and we have proved
∎
Note that we only require the with and the dependence follows from the row count of OCE.
3.1 High probability bound for OCE
In this section, we provide a unified framework for proving the high probability bound of OCE. Our analysis utilizes the three dense sketching matrices that can all be designed as first picking a set of fresh random signs, then picking the sketching matrix according to the distribution.
We state the key assumptions on dense sketching matrices that are sufficient for OCE property.
Assumption 3.3.
Let be a dense sketching matrix satisfying the following two assumptions:
-
•
Pairwise inner product bound:
-
•
Column norm bound:
for all .
Lemma 3.4.
Let be a dense sketching matrix meets Assumption 3.3. Let and be two fixed vectors. Then, the following properties hold:
holds with probability at least .
Proof.
We can rewrite as follows:
where the first step follows from the fact that ’s are independent Rademacher random variables and , , the second step follows from separating diagonal and off-diagonal terms.
We will focus on bounding the quantity off-diag, as diag can be handled in a rather simple fashion.
We define matrix and as follows:
We define to be the matrix with removing diagonal entries.
By applying Hanson-Wright inequality (Lemma 2.21), we have
We can upper bound and .
where the first step follows from , the second step follows from the definition of , the third step follows from the definition of and , and the fourth step follows from is rank 1 as .
It remains to obtain a bound on . Note that for any column ,
where the first step follows from the fact that random signs do not change the magnitude of the inner product and the second step follows from the definition of and .
Since meets Assumption 3.3, we have that with probability at least ,
Conditioning on the above event holds, we have that
choosing , we can show that
To bound the term diag, note that due to Assumption 3.3, we have with probability at least , .
Conditioning on this event, we have
where the last step is by Cauchy-Schwartz. Note that is subsumed by .
Union bounding over all events, we have that
3.2 Inner product bound for SRHT and SRCT
We will show that SRHT and SRCT satisfy Assumption 3.3. Before proving the pairwise inner product bound, we state a general property to characterize these sketching matrices. This key property will be used in the later proof.
Definition 3.5 (Sign structure).
For any sketching matrix, we say it has “Sign structure” if the following properties hold
-
•
, for all .
-
•
and are independent for any .
-
•
for all and .
Lemma 3.6.
Both SRHT and SRCT satisfy Definition 3.5.
Proof.
It follows from the definitions of two sketching matrices directly. ∎
Lemma 3.7 (SRHT and SRCT).
Let be any sketching matrices that satisfy the Definition 3.5. Then, we have
Proof.
Fix a pair of indices and we define as follows:
The Gram matrix is , where
where the first step follows from the definition of , the second step follows from rewriting , the third step follows from the definition of matrix multiplication, and the last step follows from and .
Note that has eigenvalues 0 and , i.e.,
Since and are independent Rademacher random variables, we have
Thus, we know
(6) |
Consequently, we have
where the first step follows from the definition of , the second step follows from the fact that for a constant , the third step follows from Eq. (6), and the last step follows from simple algebra.
Let be the -th eigenvalue of . By matrix Chernoff bound (Lemma 2.22 with ), for any , we have
This means with probability at least , the eigenvalues of are between and consequently, the eigenvalues of are between . Let us choose , we have
The proof can be wrapped up by noting that
the spectral norm of this matrix is and union bound over all pairs of columns, we have
∎
3.3 Column norm bound for SRHT and SRCT
In this section, we prove the column norm bound for SRHT and SRCT. In particular, their columns are unit vectors. In Appendix D, we prove for random Gaussian matrix, the squared column norm is random vriable that concentrates around 1 with high probability.
Lemma 3.8 (SRHT and SRCT).
Let be an SRHT matrix or SRCT matrix.
Then, for any , we have
Proof.
The proof directly follows from the definition.
For SRHT, recall , the column norm of is , and is a random sign that does not change the norm. The matrix subsamples rows and rescale each entry by . The (squared) column norm is then 1.
For SRCT, the column norm of is as well. Thus, by the same argument, SRCT has its column vectors being units. ∎
4 Put things together
Now, we’re ready to present the proof for Theorem 1.1.
5 Conclusion
In this paper, we study the sketching-based regression algorithm with an guarantee. We show that SRHT with rows provides the desired guarantee solution, improving upon the rows for of [22]. This is nearly-optimal up to logarithmic factors. Our proof adapts the oblivious coordinate-wise embedding property introduced in [31] in a novel way. We also greatly extends the reach of guarantee to degree-2 Kronecker product regression via TensorSRHT matrix.
In addition, we introduce the SRCT and TensorSRCT matrices. These matrices can be applied in a fashion similar to SRHT, and they have similar OCE behaviors as SRHT.
Our result provides an elegant way to integrate fast, sketching-based regression solver for optimization process, in particular second-order methods. The regression problem per iteration can be solved in time nearly-linear in the input size, and the guarantee comes in handy when analyzing convergence with approximate step. It also gives improved generalization bound on approximate regression via SRHT [22].
Appendix
Roadmap.
In Section A, we introduce the fundamental definitions and properties that we will use in Appendix. In Section B, we analyze and develop the guarantee of Kronecker product regressions. In Section C, we introduce the Strong JL Moment Property and prove that both Circulant Transform and Tensor Circulant Transform satisfy this. In Section D, we focus on studying AMS, random Gaussian, and SRHT and show that the inner product is bounded on any pair of different columns of AMS, random Gaussian, and SRHT–dense sketching matrices.
Appendix A Tools for matrices and probability
For matrix and , we use to denote the matrix that -th entry is .
Lemma A.1 (Markov’s inequality).
If is a non-negative random variable and . Then we have
Definition A.2 (Sub-exponential distribution ([12])).
We say with parameters , if:
Lemma A.3 (Tail bound for sub-exponential distribution ([12])).
Let and . Then,
Claim A.4.
For every matrix
Appendix B Kronecker product regression with guarantee
In this section, we study the guarantee of Kronecker product regressions. Given two matrices and a label vector , the goal is to solve the regression . This problem can be easily generalized to product of matrices and fast, input-sparsity time algorithms have been studied in a line of works [10, 28, 9, 11, 24].
B.1 Main result
Theorem B.1 (Tensor version of Theorem 1.1).
Suppose and matrix and vector are given, where for matrices and for vectors . Let be a
-
•
tensor subsampled randomized Hadamard transform matrix (TensorSRHT) with rows or
-
•
tensor subsampled randomized circulant transform matrix (TensorSRCT) with rows.
For
and
and any fixed ,
with probability .
Proof.
Recall that we require -OSE and -OCE for it to give guarantee.
For OCE, it follows from Lemma B.3.
Remark B.2.
The slightly different guarantee follows from the small dimension becomes instead of . Let us discuss the utility of using these sketching matrices for solving the regression. As discussed in Def. 2.8 and 2.12, each column of can be computed in time instead of , thus the total running time of applying to is . Similarly, can be applied in time . The regression can then be solved in time. Prior works mainly focus on input-sparsity sketches [10], importance sampling [9], iterative method [11] or more complicated sketches that scale well to products and in dynamic setting [24]. To the best of our knowledge, this is the first guarantee for Kronecker product regression (with two matrices).
B.2 Oblivious coordinate-wise embedding for TensorSRHT and TensorSRCT
Lemma B.3 (TensorSRHT and TensorSRCT, Tensor version of Lemma 3.7).
Let be TensorSRHT or TensorSRCT. Then, is an OCE with parameter .
Proof.
To prove this result, we show that TensorSRHT and TensorSRCT satisfy Definition 3.5.
For TensorSRHT, recall , since is Hadamard matrix and are just diagonal matrices with random signs. Thus, all entries of are also in . As is a row sampling matrix and we rescale each entry by . Thus, each entry of is in . For entries at the same row but two different columns , if is generated from two columns disjoint from , then it’s clear then are independent. Otherwise, suppose is generated from columns and is generated from columns with . Then it is again independent, as the sign is completely determined by signs of and . Finally, we need to verify , this is trivially true since product of two random signs is still a random sign. For TensorSRCT, the argument is exactly the same.
Appendix C SRCT and TensorSRCT: OSE via strong JL moment property
In this section, we prove that both SRCT and TensorSRCT are OSE’s. We prove this property via the strong JL moment property [1]. This gives a worse row count compared to that of SRHT and TensorSRHT. We believe that these two family of distributions should have similar row count for an OSE and leave it as a major open problem to close the gap between these two distributions.
C.1 Notations
To make the notation less heavy, we will use for the -th moment of a random variable . This is formally defined below.
Definition C.1 (-th moment).
For every integer and any random variable , we write
Note that
for any random variables , by the Minkowski inequality.
C.2 Strong JL moment property
We show that both SRCT (see Definition 2.11) and TensorSRCT (see Definition 2.12) satisfy the so-called strong JL moment property. Strong JL moment property is one of the core properties that can show the sketching matrix has subspace embedding property [25].
Definition C.2 (Strong JL moment property [1]).
For every , we say a distribution over random matrices has the Strong -JL Moment Property when
and
for all , and every integer .
Given a distribution with strong JL moment property, it is well-known that such distribution provides OSE.
Lemma C.3 (Lemma 11 of [1]).
To prove that SRCT (see Definition 2.11) and TensorSRCT (see Definition 2.12) satisfy the strong JL moment property, we will do this by proving that a more general class of matrices satisfies the strong JL moment property.
More precisely, let be a positive integer and
be independent matrices, each with diagonal entries given by independent Rademacher variables.
Let and be a random sampling matrix in which each row contains exactly one uniformly distributed nonzero element which has value one.
Then, we prove that the matrix
satisfies the strong JL moment property, where is circulant matrix (see Definition 2.10) generated by a random vector whose elements are Rademacher variables.
In order to prove this result we need a couple of lemmas. The first lemma can be seen as a version of Khintchine’s Inequality (see Lemma 2.18) for higher order chaos.
Lemma C.4 (Lemma 19 in [1]).
Let and . Let be independent vectors each satisfying the Khintchine’s inequality (see Lemma 2.18):
for and any vector .
Let be a tensor in . Then,
for .
Viewing as a vector, then
for .
Proof.
The proof will be by induction on .
Base case: For , the result is by the assumption that the vectors satisfy Khintchine’s inequality.
Inductive case: Assume that the result is true for every value up to .
Let
(7) |
We then pull it out of the left hand term in the theorem:
where the first step follows from Eq. (7), the second step follows from the inductive hypothesis, the third step follows from the definition of , the fourth step follows from the triangle inequality, the fifth step follows from the definition of .
It remains to bound
by Khintchine’s inequality, which finishes the induction step and hence the proof. ∎
The next lemma we will be using is a type of Rosenthal inequality based on first principles. It mixes large and small moments of random variables in an intricate way. For completeness, we include a proof here.
Lemma C.5 (Properties of random variables with -moment, Lemma 20 in [1]).
There exists a universal constant , such that, for if are independent non-negative random variables with -moment, then
Proof.
Throughout these calculations and will be universal constants.
(8) |
where the first step follows from symmetrization of , the second step follows from Khintchine’s inequality (see Lemma 2.18), the third step follows from Non-negativity of , the fourth step follows from Cauchy-Schwartz inequality, and the last step follows from the triangle inequality.
This implies is smaller than the largest of the roots of the quadratic.
Solving this quadratic inequality gives
which completes the proof. ∎
C.3 SRCT and TensorSRCT satisfy strong JL moment property
We can now prove that SRCT (see Definition 2.11) and TensorSRCT (see Definition 2.12) have the strong JL moment property.
Theorem C.6.
There exists a universal constant , such that, the following holds.
Let . Let be independent diagonal matrices with independent Rademacher variables.
We define , and , where each is a circulant matrix generated by an independent Rademacher random vector. Let be a row sampling matrix that has exactly one nonzero per row. Let .
Let be any vector with and .
Then,
where .
There exists a universal constant , such that, by setting
we get that has -strong JL moment property.
Proof.
Throughout the proof , and will denote universal constants.
For every , we let be the random variable that says which coordinates the -th row of samples.
We define the random variable
We note that since the variables are independent, so the variables are conditionally independent given , that is, if we fix , then are independent.
Then, we could get the following inequality:
where the first step follows from Definition C.1, the second step follows from Lemma C.5, the third step follows from triangle inequality, and the last step follows from Cauchy-Schwartz inequality.
Note that each row of is generated by taking the tensor product of independent Rademacher random vectors, we thus can view the row vector itself as a length Rademacher random vector. Thus,
(9) |
where the first step follows from the definition of the expected value, , the second step follows from expanding all the possibilities, the third step follows from simple algebra, and the last step follows from the definition of .
To bound , we could show
where the first step follows from the definition of , the second step follows from each row of is independent Rademacher vector, therefore , and the last step follows from Lemma C.4.
We then bound the maximum using a sufficiently high-order sum:
where the first step follows from Definition C.1, the second step follows from is non-negative, and the last step follows from .
This gives us that
(10) |
which finishes the first part of the proof.
We want to choose as follows
According to the above choice of , we know following condition for is holding
Hence,
For all , we then get that
where the first step follows from Eq. (10), and the second step follows from choice of , the third step follows from simple algebra, and the last step follows from and (since ) .
This finishes the proof. ∎
As two corollaries, we have SRCT and TensorSRCT are OSE’s with rows, instead of rows.
Corollary C.7 (SRCT is an OSE).
Let be an SRCT matrix with , then is an -OSE.
Corollary C.8 (TensorSRCT is an OSE).
Let be a TensorSRCT matrix with , then is an -OSE.
Appendix D Gaussian and AMS
In this section, we prove that both random Gaussian matrices and AMS matrices satisfy OCE with good parameter . Combining with the fact that they are OSE’s, one can derive guarantee for them.
D.1 OSE property of random Gaussian and AMS
The OSE property for these two distributions are folklore. For a proof for them, see, e.g., [34].
Lemma D.1.
Let be a random Gaussian matrix defined in Def. 2.4. If , then is an -OSE.
Lemma D.2.
Let be an AMS matrix defined in Def. 2.5. If , then is an -OSE.
D.2 OCE property of random Gaussian and AMS
In this section, we prove the OCE property of random Gaussian and AMS. We start with the pairwise inner product bound for these two distributions. For column norm bound, AMS has unit columns and we will prove for random Gaussian.
Lemma D.3 (Gaussian pairwise inner product bound, Lemma B.18 in [31]).
Let be a random Gaussian matrix (Definition 2.4).
Then, we have:
Proof.
Note for , are two independent Gaussian vectors. Let and .
Then, we have for any ,
where the first step follows from where , and for any .
This implies is a sub-exponential random variable. Here is the shorthand of sub-exponential random variable.
Thus, we have
by sub-exponential concentration Lemma A.3, we have
for . Picking , we have
Taking the union bound over all and , we complete the proof. ∎
Lemma D.4 (AMS pairwise inner product bound).
Let be an AMS matrix (Definition 2.5. Let be independent Rademacher random variables and with , .
Then, we have:
Proof.
Note for any fixed , and are independent. By Hoeffding inequality (Lemma 2.19), we have
where the second step follows from simple algebra ().
Choosing , we have
union bound over all pairs of columns gives the desired result. ∎
Lemma D.5 (Gaussian column norm bound).
Let be a random Gaussian matrix.
Then, for any , we have
Proof.
For any column , note that , each one with zero mean and variance .
By Lemma 2.20, we have
Setting , we have
the proof is concluded by union bounding over all columns. ∎
We conclude random Gaussian and AMS are OCE’s.
Lemma D.6 (Gaussian OCE).
Let be a random Gaussian matrix, then is a -OCE.
Proof.
Lemma D.7 (AMS OCE).
Let be an AMS matrix, then is a -OCE.
Proof.
The proof is similar to Lemma D.6. The column norm bound follows from definition. ∎
References
- AKK+ [20] Thomas D Ahle, Michael Kapralov, Jakob BT Knudsen, Rasmus Pagh, Ameya Velingker, David P Woodruff, and Amir Zandieh. Oblivious sketching of high-degree polynomial kernels. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 141–160. SIAM, 2020.
- AMS [96] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pages 20–29, 1996.
- ANW [14] Haim Avron, Huy Nguyen, and David Woodruff. Subspace embeddings for the polynomial kernel. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2258–2266. 2014.
- BPSW [21] Jan van den Brand, Binghui Peng, Zhao Song, and Omri Weinstein. Training (overparametrized) neural networks in near-linear time. In ITCS, 2021.
- CCFC [02] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Programming (ICALP), pages 693–703. Springer, 2002.
- Coh [16] Michael B Cohen. Nearly tight oblivious subspace embeddings by trace inequalities. In Proceedings of the twenty-seventh annual ACM-SIAM symposium on Discrete algorithms, pages 278–287. SIAM, 2016.
- CSWZ [23] Yeshwanth Cherapanamjeri, Sandeep Silwal, David P. Woodruff, and Samson Zhou. Optimal algorithms for linear algebra in the current matrix multiplication time. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2023.
- CW [13] Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. In Symposium on Theory of Computing Conference(STOC), pages 81–90, 2013.
- DJS+ [19] Huaian Diao, Rajesh Jayaram, Zhao Song, Wen Sun, and David Woodruff. Optimal sketching for kronecker product regression and low rank approximation. Advances in neural information processing systems, 32, 2019.
- DSSW [18] Huaian Diao, Zhao Song, Wen Sun, and David Woodruff. Sketching for kronecker product regression and p-splines. In International Conference on Artificial Intelligence and Statistics, pages 1299–1308. PMLR, 2018.
- FFG [22] Matthew Fahrbach, Thomas Fu, and Mehrdad Ghadiri. Subquadratic kronecker regression with applications to tensor decomposition. In Thirty-Sixth Conference on Neural Information Processing Systems, NeurIPS’22, 2022.
- FKZ [11] Sergey Foss, Dmitry Korshunov, and Stan Zachary. An Introduction to Heavy-Tailed and Subexponential Distributions. Springer series in operations research. Springer, New York, NY, 1st ed edition, 2011.
- Hoe [63] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. In Journal of the American Statistical, pages 13–30. 1963.
- HW [71] David Lee Hanson and Farroll Tim Wright. A bound on tail probabilities for quadratic forms in independent random variables. The Annals of Mathematical Statistics, 42(3):1079–1083, 1971.
- JSWZ [21] Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang. A faster algorithm for solving general lps. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 823–832, 2021.
- Khi [23] Aleksandr Khintchine. Über dyadische brüche. Mathematische Zeitschrift, 18(1):109–116, 1923.
- LDFU [13] Yichao Lu, Paramveer Dhillon, Dean P Foster, and Lyle Ungar. Faster ridge regression via the subsampled randomized hadamard transform. In Advances in neural information processing systems (NIPS), pages 369–377, 2013.
- LM [00] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302–1338, 2000.
- LS [14] Yin Tat Lee and Aaron Sidford. Path finding methods for linear programming: Solving linear programs in iterations and faster algorithms for maximum flow. In 55th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 424–433, 2014.
- LSZ [19] Yin Tat Lee, Zhao Song, and Qiuyi Zhang. Solving empirical risk minimization in the current matrix multiplication time. In Conference on Learning Theory, pages 2140–2157. PMLR, 2019.
- NN [13] Jelani Nelson and Huy L Nguyên. Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), pages 117–126. IEEE, 2013.
- PSW [17] Eric Price, Zhao Song, and David P Woodruff. Fast regression with an guarantee. In ICALP, 2017.
- QSZZ [23] Lianke Qin, Zhao Song, Lichen Zhang, and Danyang Zhuo. An online and unified algorithm for projection matrix vector multiplication with application to empirical risk minimization. In AISTATS, 2023.
- RSZ [22] Aravind Reddy, Zhao Song, and Lichen Zhang. Dynamic tensor product regression. In Thirty-Sixth Conference on Neural Information Processing Systems, NeurIPS’22, 2022.
- Sar [06] Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In 2006 47th annual IEEE symposium on foundations of computer science (FOCS), pages 143–152. IEEE, 2006.
- SWYZ [21] Zhao Song, David Woodruff, Zheng Yu, and Lichen Zhang. Fast sketching of polynomial kernels of polynomial degree. In International Conference on Machine Learning, pages 9812–9823. PMLR, 2021.
- SWZ [16] Zhao Song, David Woodruff, and Huan Zhang. Sublinear time orthogonal tensor decomposition. Advances in Neural Information Processing Systems, 29, 2016.
- SWZ [19] Zhao Song, David P Woodruff, and Peilin Zhong. Relative error tensor low rank approximation. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2772–2789. SIAM, 2019.
- SXYZ [22] Zhao Song, Zhaozhuo Xu, Yuanyuan Yang, and Lichen Zhang. Accelerating frank-wolfe algorithm using low-dimensional and adaptive data structures. arXiv preprint arXiv:2207.09002, 2022.
- SXZ [22] Zhao Song, Zhaozhuo Xu, and Lichen Zhang. Speeding up sparsification using inner product search data structures. arXiv preprint arXiv:2204.03209, 2022.
- SY [21] Zhao Song and Zheng Yu. Oblivious sketching-based central path method for solving linear programming problems. In 38th International Conference on Machine Learning (ICML), 2021.
- SZZ [21] Zhao Song, Lichen Zhang, and Ruizhe Zhang. Training multi-layer over-parametrized neural network in subquadratic time. arXiv preprint arXiv:2112.07628, 2021.
- Tro [10] Joel A. Tropp. Improved analysis of the subsampled randomized hadamard transform. 2010.
- Woo [14] David P Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science, 10(1–2):1–157, 2014.
- WZ [20] David P Woodruff and Amir Zandieh. Near input sparsity time kernel embeddings via adaptive sampling. In ICML, 2020.
- WZ [22] David Woodruff and Amir Zandieh. Leverage score sampling for tensor product matrices in input sparsity time. In Proceedings of the 39th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2022.