This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improved Subsampled Randomized Hadamard Transform for Linear SVM

Zijian Lei
Department of Computer Science,
Hong Kong Baptist University,
Hong Kong SAR, China
[email protected] &Liang Lan
Department of Computer Science,
Hong Kong Baptist University,
Hong Kong SAR, China
[email protected]
Abstract

Subsampled Randomized Hadamard Transform (SRHT), a popular random projection method that can efficiently project a dd-dimensional data into rr-dimensional space (rdr\ll d) in O(dlog(d))O(dlog(d)) time, has been widely used to address the challenge of high-dimensionality in machine learning. SRHT works by rotating the input data matrix 𝐗n×d\mathbf{X}\in\mathbb{R}^{n\times d} by Randomized Walsh-Hadamard Transform followed with a subsequent uniform column sampling on the rotated matrix. Despite the advantages of SRHT, one limitation of SRHT is that it generates the new low-dimensional embedding without considering any specific properties of a given dataset. Therefore, this data-independent random projection method may result in inferior and unstable performance when used for a particular machine learning task, e.g., classification. To overcome this limitation, we analyze the effect of using SRHT for random projection in the context of linear SVM classification. Based on our analysis, we propose importance sampling and deterministic top-rr sampling to produce effective low-dimensional embedding instead of uniform sampling SRHT. In addition, we also proposed a new supervised non-uniform sampling method. Our experimental results have demonstrated that our proposed methods can achieve higher classification accuracies than SRHT and other random projection methods on six real-life datasets.

Introduction

Classification is one of the fundamental machine learning tasks. In the era of big data, huge amounts of high-dimensional data become common in a wide variety of applications. As the dimensionality of the data for real-life classification tasks has increased dramatically in recent years, it is essential to develop accurate and efficient classification algorithms for high-dimensional data.

One of the most popular methods to address the high-dimensionality challenge is dimensionality reduction. It allows the classification problems to be efficiently solved in a lower dimensional space. Popular dimensionality reduction methods includes Principal Component Analysis (PCA) (?), Linear Discriminant Analysis (LDA) (?) etc. However, these traditional dimensionality reduction methods are computationally expensive which makes them not suitable for big data. Recently, random projection (?) that projects a high-dimensional feature vector 𝐱d\mathbf{x}\in\mathbb{R}^{d} into a low-dimensional feature 𝐱~r\widetilde{\mathbf{x}}\in\mathbb{R}^{r} (rdr\ll d) using a random orthonormal matrix 𝐑d×r\mathbf{R}\in\mathbb{R}^{d\times r}. Namely, 𝐱~=𝐑T𝐱\widetilde{\mathbf{x}}=\mathbf{R}^{T}\mathbf{x}. Since random projection methods are (1) computationally efficient while maintaining accurate results; (2) simple to implement in practice; and (3) easy to analyze (??), they have attracted significant research attentions in recent years.

The theoretical foundation of random projection is the Johnson-Lindenstrauss Lemma (JL lemma) (?). JL lemma proves that in Euclidean space, high-dimensional data can be randomly projected into lower dimensional space while the pairwise distances between all data points are preserved with a high probability. Based on the JL lemma, several different ways are proposed for constructing the random matrix 𝐑\mathbf{R}: (1) Gaussian Matrix (?): each entry in 𝐑\mathbf{R} is generated from a Gaussian distribution with mean equals to 0 and variance equals to 1d\frac{1}{d}; (2) Achlioptas matrix (?): each entry in 𝐑\mathbf{R} is generated from {1-1, 0, 1} from a discrete distribution. This method generates a sparse random matrix 𝐑\mathbf{R}. Therefore it requires less memory and computation cost; (3) sparse embedding matrix (or called count sketch matrix) (?) and its similar variant feature hashing (??): for each row 𝐑\mathbf{R}, only one column is randomly selected and assign either 1 or 1-1 with probability 0.5 to this entry; and (4) Subsampled Randomized Hadamard Transform (SRHT) (?): It uses a highly structured matrix 𝐑\mathbf{R} for random projection. The procedure of SRHT can be summarized as follows: It first rotates the input matrix 𝐗n×d\mathbf{X}\in\mathbb{R}^{n\times d} (nn is the number of samples, dd is the dimensionality of the data) by Randomized Walsh-Hadamard Transform and then uniformly sampling rr columns from the rotated matrix. More details of SRHT will be discussed in next section.

Although random projection methods have attracted a great deal of research attention in recent years (??) and have been studied for regression (?), classification (???) and clustering (??), most of the existing studies focused on data-independent projection. In other words, the random matrix 𝐑\mathbf{R} was constructed without considering any specific properties of a given dataset.

In this paper, we focus on SRHT for random projection and study how it affects the classification performance of linear Support Vector Machines (SVM). SRHT achieve dimensionality reduction by uniformly sampling rr columns from a rotated version of the input data matrix 𝐗\mathbf{X}. Even though Randomized Walsh-Hadamard transform in SRHT tends to equalize column norms (?), we argue that the importance of different columns in the rotated data matrix are not equal. Therefore, the uniformly random sampling procedure in SRHT would result in low accuracy when used as a dimensionality reduction method in high-dimensional data classification.

To overcome the limitation of SRHT, we propose to produce effective low-dimensional embedding by using non-uniform sampling instead of uniform sampling. To achieve this goal, we first analyze the effect of using SRHT for random projection in linear SVM classification. Based on our analysis, we have proposed importance sampling and deterministic top-rr sampling methods to improve SRHT. Furthermore, we also propose a new sampling method by incorporating label information. It samples the columns which can achieve optimal inter-class separability along with the intra-class compactness of the data samples.

Finally, we performed experiments to evaluate our proposed methods on six real-life datasets. Our experimental results clearly demonstrate that our proposed method can obtain higher classification accuracies than SRHT and other three popular random projection methods while only slightly increasing the running time.

Preliminaries

Random Projection

Given a data matrix 𝐗n×d\mathbf{X}\in\mathbb{R}^{n\times d}, random projection methods reduce the dimensionality of 𝐗\mathbf{X} by multiplying it by a random orthonormal matrix 𝐑d×r\mathbf{R}\in\mathbb{R}^{d\times r} with parameter rdr\ll d. The projected data in the low-dimensional space is

𝐗~=𝐗𝐑n×r.\displaystyle\mathbf{\widetilde{X}}=\mathbf{XR}\in\mathbb{R}^{n\times r}. (1)

Note that the matrix 𝐑\mathbf{R} is randomly generated and is independent of the input data 𝐗\mathbf{X}. The theoretical foundation of random projection is the Johnson-Lindenstrauss Lemma (JL lemma) as shown in following,

Lemma 1 (Johnson-Lindenstrauss Lemma (JL lemma) (?)).

For any 0<ϵ<10<\epsilon<1 and any integer n, let r=O(logn/ϵ2)r=O(\log n/\epsilon^{2}) and 𝐑d×r\mathbf{R}\in\mathbb{R}^{d\times r} be a random orthonormal matrix. Then for any set 𝐗\mathbf{X} of nn points in d\mathbb{R}^{d}, the following inequality about pairwise distance between any two data points 𝐱i\mathbf{x}_{i} and 𝐱j\mathbf{x}_{j} in 𝐗\mathbf{X} holds true with high probability:

(1ϵ)𝐱i𝐱j2𝐑T𝐱i𝐑T𝐱j2(1+ϵ)𝐱i𝐱j2.(1-\epsilon)\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}\leq\|\mathbf{R}^{T}\mathbf{x}_{i}-\mathbf{R}^{T}\mathbf{x}_{j}\|_{2}\leq(1+\epsilon)\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}.

JL lemma proves that in Euclidean space, high-dimensional data can be randomly projected into lower dimensional space while the pairwise distances between all data points are well preserved with high probability. There are several different ways to construct the random projection matrix 𝐑\mathbf{R}. In this paper, we focus on SRHT which uses a structured orthonormal matrix for random projection. The details about SRHT are as follows.

Subsampled Randomized Hadamard Transform (SRHT)

For d=2qd=2^{q}where qq is any positive integer111We can ensure this by padding zeros to original data , SRHT defines a d×rd\times r matrix as:

𝐑=dr𝐃𝐇𝐒\mathbf{R}=\sqrt{\frac{d}{r}}\mathbf{DHS} (2)

where

  • 𝐃d×d\mathbf{D}\in\mathbb{R}^{d\times d} is a diagonal matrix whose elements are independent random signs {1,1-1};

  • 𝐇d×d\mathbf{H}\in\mathbb{R}^{d\times d} is a normalized Walsh-Hadamard matrix. The Walsh-Hadamard matrix is defined recursively as:
    𝐇d=[𝐇d/2𝐇d/2𝐇d/2𝐇d/2]\mathbf{H}_{d}=\left[\begin{array}[]{cc}\mathbf{H}_{d/2}&\mathbf{H}_{d/2}\\ \mathbf{H}_{d/2}&-\mathbf{H}_{d/2}\end{array}\right] with 𝐇2=[1111],\mathbf{H}_{2}=\left[\begin{array}[]{cc}1&1\\ 1&-1\end{array}\right], and 𝐇=1d𝐇dd×d\mathbf{H}=\frac{1}{\sqrt{d}}\mathbf{H}_{d}\in\mathbb{R}^{d\times d};

  • 𝐒d×r\mathbf{S}\in\mathbb{R}^{d\times r} is a subset of randomly sampling rr columns from the d×dd\times d identity matrix. The purpose of multiplying 𝐒\mathbf{S} is to uniformly sample rr columns from the rotated data matrix 𝐗r=𝐗𝐃𝐇\mathbf{X}_{r}=\mathbf{XDH}.

There are several advantages of using SRHT for random projection. Firstly, we do not need to explicitly represent 𝐇\mathbf{H} in memory and only need to store a diagonal random sign matrix 𝐃\mathbf{D} and a sampling matrix 𝐒\mathbf{S} with rr non-zero values. Therefore, the memory cost of storing 𝐑\mathbf{R} is only O(d+r)O(d+r). Secondly, due to the recursive structure of Hadamard matrix 𝐇\mathbf{H}, matrix multiplication 𝐗𝐑\mathbf{XR} only takes O(ndlog(d))O(nd\log(d)) time by using Fast Fourier Transform (FFT) for matrix multiplication (?). ? (?) further improve the time complexity of SRHT to O(ndlog(r))O(nd\log(r)) if only rr columns in 𝐗r\mathbf{X}_{r} are needed.

Refer to caption
(a) Original data 𝐗\mathbf{X}
Refer to caption
(b) Rotated data 𝐗r=𝐗𝐃𝐇\mathbf{X}_{r}=\mathbf{XDH}
Refer to caption
(c) Accuracy of SRHT on different runs
Figure 1: Limitation of uniform sampling

Methodology

Limitation of Uniform Sampling in SRHT

Despite the advantages of SRHT as mentioned in the previous section, one limitation of SRHT is that the random matrix 𝐑\mathbf{R} is constructed without considering any specific properties of input data 𝐗\mathbf{X}. Based on the definition in (2), SRHT works by first rotating the input data matrix 𝐗\mathbf{X} by multiplying it with 𝐃𝐇\mathbf{DH} and then uniformly random sampling rr columns from the rotated matrix 𝐗r=𝐗𝐃𝐇\mathbf{X}_{r}=\mathbf{XDH}. Since this uniform column sampling method does not consider the underlying data distribution of 𝐗r\mathbf{X}_{r}, it would result in low and unstable accuracy when used as a dimensionality reduction method in a specific classification problem. To illustrate the limitation of uniformly random sampling step in SRHT, we generate a simple two-dimensional synthetic data as shown in Figure 1(a). The first class (i.e., class label ‘+’) was generated from NN(μ1=[3,3]\mu_{1}=[3,3], Σ=[11.751.751]\Sigma=\left[\begin{array}[]{cc}1&1.75\\ 1.75&1\end{array}\right]). The second class (i.e., class label ‘\bullet’) was generated from NN (μ2=[3,3]\mu_{2}=[-3,-3], Σ=[11.751.751]\Sigma=\left[\begin{array}[]{cc}1&1.75\\ 1.75&1\end{array}\right]). After multiplying with matrix 𝐃𝐇\mathbf{DH}, the original data was rotated into new feature space as shown in Figure 1(b). In this simple illustration example, SRHT might choose the feature that can not distinguish these two classes (i.e, the vertical feature in Figure 1(b)) and then leads to low classification accuracy. In Figure 1(c), we shown the classification accuracy of using SRHT on a real-life dataset for 15 different runs. It clearly shows that this data-independent random projection method could result in low and unstable performance on this synthetic dataset. In the following sections, we will describe our proposed methods to construct the data-dependent sampling matrix 𝐒\mathbf{S} in (2) by exploiting the underlying data properties from both unsupervised and supervised perspectives.

Improved SRHT by unsupervised non-uniform sampling

First, we present our proposed non-uniform sampling methods from unsupervised perspective. In this paper, we study SRHT in the context of using linear SVM for classification. Assume we are given a training data set D={𝐱i,yi}i=1nD=\{\mathbf{x}_{i},y_{i}\}_{i=1}^{n}, where 𝐱i\mathbf{x}_{i} is a dd-dimensional input feature vector, yiy_{i} is its corresponding class label. The dual problem of SVM can be written in matrix format as shown in following,

max𝟏T𝜶12𝜶T𝐘𝐗𝐗T𝐘𝜶s.t𝟏T𝐘𝜶=𝟎𝟎𝜶𝐂,\begin{split}\max\ \ \ &\mathbf{1}^{T}\boldsymbol{\alpha}-\frac{1}{2}\boldsymbol{\alpha}^{T}\mathbf{Y}\mathbf{X}\mathbf{X}^{T}\mathbf{Y}\boldsymbol{\alpha}\\ s.t\ \ \ &\mathbf{1}^{T}\mathbf{Y}\boldsymbol{\alpha}=\mathbf{0}\\ \ \ \ &\mathbf{0}\leq\boldsymbol{\alpha}\leq\mathbf{C},\end{split} (3)

where 𝐗n×d\mathbf{X}\in\mathbb{R}^{n\times d} is the input feature matrix, and 𝐘\mathbf{Y} is a diagonal matrix where the elements in the diagonal are class labels, 𝐘ii=yi\mathbf{Y}_{ii}=y_{i}, CC is the regularization parameter.

By applying SRHT to project original 𝐗\mathbf{X} from dd-dimensional space into rr-dimensional space, i.e., 𝐗~=dr𝐗𝐃𝐇𝐒\widetilde{\mathbf{X}}=\sqrt{\frac{d}{r}}\mathbf{XDHS}, the SVM optimization problem on projected data 𝐗~\widetilde{\mathbf{X}} will be

max𝟏T𝜶~12𝜶~T𝐘𝐗~𝐗~T𝐘𝜶~s.t𝟏T𝐘𝜶~=𝟎𝟎𝜶~𝐂.\begin{split}\max\ \ \ &\mathbf{1}^{T}\boldsymbol{\widetilde{\alpha}}-\frac{1}{2}\boldsymbol{\widetilde{\alpha}}^{T}\mathbf{Y}\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^{T}\mathbf{Y}\boldsymbol{\widetilde{\alpha}}\\ s.t\ \ \ &\mathbf{1}^{T}\mathbf{Y}\boldsymbol{\widetilde{\alpha}}=\mathbf{0}\\ \ \ \ &\mathbf{0}\leq\boldsymbol{\widetilde{\alpha}}\leq\mathbf{C}.\end{split} (4)

Comparing the dual problem (3) in the original space and the dual problem (4) in the projected low-dimensional space, the only difference is that 𝐗𝐗T\mathbf{X}\mathbf{X}^{T} in (3) is replaced by 𝐗~𝐗~T\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^{T} in (4). The constraints on optimal solutions (i.e., 𝜶\boldsymbol{\alpha}^{*} of (3) and 𝜶~\boldsymbol{\widetilde{\alpha}}^{*} of (4) ) do not depend on input data matrix. Therefore, 𝜶~\boldsymbol{\widetilde{\alpha}}^{*} is expected to be close to 𝜶\boldsymbol{\alpha}^{*} when 𝐗~𝐗~T\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^{T} is close to 𝐗𝐗T\mathbf{X}\mathbf{X}^{T}.

Let us rewrite 𝐗~=𝐗r𝐒dr\widetilde{\mathbf{X}}=\mathbf{X}_{r}\mathbf{S}\sqrt{\frac{d}{r}} where 𝐗r=𝐗𝐃𝐇\mathbf{X}_{r}=\mathbf{XDH}, we can view the effect of 𝐗r𝐒dr\mathbf{X}_{r}\mathbf{S}\sqrt{\frac{d}{r}} is to uniformly sample rr columns from 𝐗r\mathbf{X}_{r} and then re-scale each selected column by a scaling factor dr\sqrt{\frac{d}{r}}. It corresponds to use uniform column sampling for approximating matrix multiplication (?). It has been proved in (?) that (i,ji,j)-th element in 𝐗~𝐗~T\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^{T} equals to the (i,ji,j)-th element of exact product 𝐗r𝐗rT\mathbf{X}_{r}\mathbf{X}_{r}^{T} in expectation regardless of the sampling probabilities. However, the variance of the (i,ji,j)-th element in 𝐗~𝐗~T\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^{T} depends on the choice of sampling probabilities. Therefore, simply using uniform sampling would result in large variance of each element in 𝐗~𝐗~T\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^{T} and then cause a large approximation error between 𝐗~𝐗~T\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^{T} and 𝐗r𝐗rT\mathbf{X}_{r}\mathbf{X}_{r}^{T}.

Motivated by this observation, we first propose to improve SRHT by employing importance sampling for randomized matrix multiplication. Specifically, we would like to design a method to construct 𝐒~\widetilde{\mathbf{S}} based on specific data properties of 𝐗r\mathbf{X}_{r} instead of using random sampling matrix 𝐒\mathbf{S} in original SRHT. The idea is based on importance sampling for approximating matrix multiplication: (1) we assign each column ii in 𝐗r\mathbf{X}_{r} a probability pip_{i} such that pi0p_{i}\geq 0 and i=1dpi=1\sum_{i=1}^{d}p_{i}=1; (2) we then select rr columns from 𝐗r\mathbf{X}_{r} based on the probability distribution {pi}i=1d\{p_{i}\}_{i=1}^{d} to form 𝐗~\widetilde{\mathbf{X}}; (3) a column in 𝐗~\widetilde{\mathbf{X}} is formed as 1rpi𝐗r(:,i)\sqrt{\frac{1}{rp_{i}}}\mathbf{X}_{r({:,i}}) if this column is chosen from the ii-th column in 𝐗r\mathbf{X}_{r} where 1rpi\sqrt{\frac{1}{rp_{i}}} is the re-scaling factor. In other word, the matrix 𝐒~d×r\widetilde{\mathbf{S}}\in\mathbb{R}^{d\times r} is constructed as: for every column jj in 𝐒~\widetilde{\mathbf{S}} only one entry ii (ii is a random number from {1,,d}\{1,\dots,d\}) is selected by following the probability distribution {pi}i=1d\{p_{i}\}_{i=1}^{d} and a non-zero value is assigned to this entry. The non-zero value is set as

𝐒~ij=1rpi.\displaystyle\widetilde{\mathbf{S}}_{ij}=\sqrt{\frac{1}{rp_{i}}}. (5)

In the uniform sampling setting, pi=1dp_{i}=\frac{1}{d}, therefore 𝐒~=dr𝐒\widetilde{\mathbf{S}}=\sqrt{\frac{d}{r}}\mathbf{S} which is consistent with the setting in original SRHT as shown in (2).

Now, the question is what choice of probability {pi}i=1d\{p_{i}\}_{i=1}^{d} can optimally reduce the expected approximation error between 𝐗~𝐗~T\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^{T} and 𝐗𝐗T\mathbf{X}\mathbf{X}^{T}. We use Frobenius norm to measure the approximation error between 𝐗~𝐗~T\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^{T} and 𝐗𝐗T\mathbf{X}\mathbf{X}^{T}.

Let us use 𝐗r(:,j)\mathbf{X}_{r(:,j)} to denote the jj-th column of 𝐗r\mathbf{X}_{r}, as shown in the Proposition 1, the optimal sampling probability to minimize 𝐄[𝐗~𝐗~T𝐗𝐗TF]\mathbf{E}[\|\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^{T}-\mathbf{X}\mathbf{X}^{T}\|_{F}] is

pj=𝐗r(:,j)22j=1d𝐗r(:,j)22.\displaystyle p_{j}=\frac{\|\mathbf{X}_{r(:,j)}\|_{2}^{2}}{\sum_{j=1}^{d}\|\mathbf{X}_{r(:,j)}\|_{2}^{2}}. (6)
Proposition 1.

Suppose 𝐗n×d\mathbf{X}\in\mathbb{R}^{n\times d}, 𝐗r=𝐗𝐃𝐇n×d\mathbf{X}_{r}=\mathbf{XDH}\in\mathbb{R}^{n\times d} and 𝐗~=𝐗r𝐒~ij\widetilde{\mathbf{X}}=\mathbf{X}_{r}\widetilde{\mathbf{S}}_{ij} where 𝐒~ijd×r\widetilde{\mathbf{S}}_{ij}\in\mathbb{R}^{d\times r} is defined as in (5). Then, the optimal sampling probability to minimize the expectation of the Frobenius norm of the approximation error between 𝐗~𝐗~T\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^{T} and 𝐗𝐗T\mathbf{X}\mathbf{X}^{T}, i.e., 𝐄[𝐗~𝐗~T𝐗𝐗T]\mathbf{E}[\|\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^{T}-\mathbf{X}\mathbf{X}^{T}\|], is pj=𝐗r(:,j)22j=1d𝐗r(:,j)22p_{j}=\frac{\|\mathbf{X}_{r(:,j)}\|_{2}^{2}}{\sum_{j=1}^{d}\|\mathbf{X}_{r(:,j)}\|_{2}^{2}}.

To prove Proposition 1, we first use the fact 𝐗𝐗T=𝐗r𝐗rT\mathbf{X}\mathbf{X}^{T}=\mathbf{X}_{r}\mathbf{X}_{r}^{T} since 𝐃𝐇𝐇T𝐃T=𝐈\mathbf{D}\mathbf{H}\mathbf{H}^{T}\mathbf{D}^{T}=\mathbf{I}. Then, we apply the Lemma 4 from (?) to obtain the optimal sampling probability.

In addition, we also propose a deterministic top-rr sampling, which select the rr columns with largest Euclidean norms from 𝐗r\mathbf{X}_{r}. As shown in the Proposition 2, this deterministic top-rr sampling method directly optimizes an upper bound of 𝐗~𝐗~T𝐗𝐗TF\|\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^{T}-\mathbf{X}\mathbf{X}^{T}\|_{F} without using re-scaling factor.

Proposition 2.

Suppose 𝐗n×d\mathbf{X}\in\mathbb{R}^{n\times d}, 𝐗r=𝐗𝐃𝐇n×d\mathbf{X}_{r}=\mathbf{XDH}\in\mathbb{R}^{n\times d} and 𝐗~=𝐗r𝐏\widetilde{\mathbf{X}}=\mathbf{X}_{r}\mathbf{P} where 𝐏d×d\mathbf{P}\in\mathbb{R}^{d\times d} is a diagonal matrix whose elements are 1 or 0. 𝐏ii=1\mathbf{P}_{ii}=1 denotes the ii-th column in 𝐗r\mathbf{X}_{r} is selected. Then, 𝐏jj=0𝐗r(:,j)22\sum_{\mathbf{P}_{jj}=0}\|\mathbf{X}_{r(:,j)}\|_{2}^{2} is an upper bound of the approximation error 𝐗~𝐗~T𝐗𝐗TF\|\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^{T}-\mathbf{X}\mathbf{X}^{T}\|_{F}.

Proof.

Since 𝐗~=𝐗r𝐏\widetilde{\mathbf{X}}=\mathbf{X}_{r}\mathbf{P} and 𝐗𝐗T=𝐗r𝐗rT\mathbf{X}\mathbf{X}^{T}=\mathbf{X}_{r}\mathbf{X}_{r}^{T}, 𝐗~𝐗~T𝐗𝐗TF=𝐗r𝐏𝐏T𝐗rT𝐗r𝐗rTF\|\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^{T}-\mathbf{X}\mathbf{X}^{T}\|_{F}=\|\mathbf{X}_{r}\mathbf{P}\mathbf{P}^{T}\mathbf{X}_{r}^{T}-\mathbf{X}_{r}\mathbf{X}_{r}^{T}\|_{F}. Let us rewrite the matrix product 𝐗r𝐗rT\mathbf{X}_{r}\mathbf{X}_{r}^{T} as the sum of dd rank one matrices j=1d𝐗r(:,j)𝐗r(:,j)T\sum_{j=1}^{d}\mathbf{X}_{r(:,j)}{\mathbf{X}_{r(:,j)}}^{T}. Similarly, 𝐗r𝐏𝐏T𝐗rT\mathbf{X}_{r}\mathbf{P}\mathbf{P}^{T}\mathbf{X}_{r}^{T} can be rewritten as j=1d𝐏jj𝐗r(:,j)𝐗r(:,j)T\sum_{j=1}^{d}{\mathbf{P}_{jj}\mathbf{X}_{r(:,j)}{\mathbf{X}_{r(:,j)}}^{T}}. Therefore, the approximation error 𝐗~𝐗~T𝐗𝐗TF\|\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^{T}-\mathbf{X}\mathbf{X}^{T}\|_{F} can be upper bounded as follows,

𝐗~𝐗~T𝐗𝐗TF=𝐏jj=0𝐗r(:,j)𝐗r(:,j)TF𝐏jj=0𝐗r(:,j)𝐗r(:,j)TF=𝐏jj=0𝐗r(:,j)22\begin{split}&\|\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^{T}-\mathbf{X}\mathbf{X}^{T}\|_{F}=\|\sum_{\mathbf{P}_{jj}=0}\mathbf{X}_{r(:,j)}{\mathbf{X}_{r(:,j)}}^{T}\|_{F}\\ \leq&\sum_{\mathbf{P}_{jj}=0}\|\mathbf{X}_{r(:,j)}{\mathbf{X}_{r(:,j)}}^{T}\|_{F}=\sum_{\mathbf{P}_{jj}=0}\|\mathbf{X}_{r(:,j)}\|_{2}^{2}\end{split} (7)

In our experimental section, we have shown that this deterministic top-rr sampling obtains better classification accuracy than importance sampling method. ?(?) also observed that deterministic top-rr sampling gets better results when they use randomized matrix multiplication approximation to speedup the training of deep neural networks.

Improved SRHT by Supervised Non-uniform Sampling

In this section, we describe our proposed non-uniform sampling method from supervised perspective. We propose to construct the sampling matrix 𝐒~\widetilde{\mathbf{S}} by incorporating label information. Our proposed method is based on the idea of metric learning (?). The idea is to select rr columns from 𝐗r\mathbf{X}_{r} for improving the inter-class separability along with the intra-class compactness as used in (?).

Let 𝐳i\mathbf{z}_{i} be new feature representation of the ii-th data sample in 𝐗r\mathbf{X}_{r}. Then, we measure the interclass separability by the sum of squared pair-wise distances between 𝐳i\mathbf{z}_{i}’s from different classes as yiyj𝐳i𝐳j2\sum_{y_{i}\neq y_{j}}\|\mathbf{z}_{i}-\mathbf{z}_{j}\|^{2}. Similarly, we measure the intra-class tightness by the sum of squared pair-wise distances between 𝐳i\mathbf{z}_{i}’s from the same class as yi=yj𝐳i𝐳j2\sum_{y_{i}=y_{j}}\|\mathbf{z}_{i}-\mathbf{z}_{j}\|^{2}. We formulate our objective as a combination of these two terms,

i,jD𝐳i𝐳j2𝐀ij=trace((𝐗r𝐒~)𝐋(𝐗r𝐒~))=trace(𝐗r𝐋𝐗r𝐒~𝐒~),\begin{split}&\sum\limits_{i,j\in{D}}\|\mathbf{z}_{i}-\mathbf{z}_{j}\|^{2}\mathbf{A}_{ij}=trace((\mathbf{X}_{r}\widetilde{\mathbf{S}})^{\top}\mathbf{L}(\mathbf{X}_{r}\widetilde{\mathbf{S}}))\\ &=trace({\mathbf{X}_{r}}^{\top}\mathbf{L}\mathbf{X}_{r}\widetilde{\mathbf{S}}\widetilde{\mathbf{S}}^{\top}),\end{split} (8)

where 𝐀ij=1\mathbf{A}_{ij}=1 if yi=yjy_{i}=y_{j}, and 𝐀ij=a\mathbf{A}_{ij}=-a if yiyjy_{i}\neq y_{j}, parameter aa is used to balance the tradeoff between inter-class separability and intra-class tightness, 𝐋\mathbf{L} is the Laplacian matrix of 𝐀\mathbf{A}, defined as 𝐋=𝐃𝐀\mathbf{L}=\mathbf{D}-\mathbf{A}, and 𝐃\mathbf{D} is a diagonal degree matrix of 𝐀\mathbf{A} such that 𝐃ii=j𝐀ij\mathbf{D}_{ii}=\sum_{j}\mathbf{A}_{ij}. 𝐒~\widetilde{\mathbf{S}} is the sampling matrix whose elements are 0 or 1. We want to learn 𝐒~\widetilde{\mathbf{S}} by minimizing (8).

Without losing any information, let us rewrite our variable 𝐒~\widetilde{\mathbf{S}} as a d×dd\times d diagonal matrix 𝐏\mathbf{P} where the diagonal element 𝐏ii{0,1}\mathbf{P}_{ii}\in\{0,1\}. 𝐏ii=1\mathbf{P}_{ii}=1 means the ii-th column in 𝐗r\mathbf{X}_{r} is selected and 𝐏ii=0\mathbf{P}_{ii}=0 means the ii-th column is not selected. Then 𝐒~\widetilde{\mathbf{S}} can be viewed as a submatrix of 𝐏\mathbf{P} that selects the columns with 𝐏ii=1\mathbf{P}_{ii}=1 from 𝐏\mathbf{P}.

It is straight-forward to verify that 𝐒~𝐒~\widetilde{\mathbf{S}}\widetilde{\mathbf{S}}^{\top} in (8) equals to 𝐏\mathbf{P}. Therefore, our optimization is formulated as

min𝐏trace(𝐗r𝐋𝐗r𝐏)s.t𝑷ii{0,1},i=1d𝑷ii=r.\begin{split}\min\limits_{\boldsymbol{\mathbf{P}}}&\ \ \ trace({\mathbf{X}_{r}}^{\top}\mathbf{L}\mathbf{X}_{r}\mathbf{P})\\ s.t&\ \boldsymbol{P}_{ii}\in\{0,1\},\\ &\ \sum_{i=1}^{d}\boldsymbol{P}_{ii}=r.\end{split} (9)

Let vector 𝐛\mathbf{b} denote the diagonal of the matrix 𝐗r𝐋𝐗r{\mathbf{X}_{r}}^{\top}\mathbf{L}\mathbf{X}_{r}, i.e bi=(𝐗r𝐋𝐗r)iib_{i}=({\mathbf{X}_{r}}^{\top}\mathbf{L}\mathbf{X}_{r})_{ii}, vector 𝒑\boldsymbol{p} denote the diagonal of 𝐏\mathbf{P}. Then, the objective (9) is simplified to

min𝒑𝐛𝒑s.t.pi{0,1},i=1dpi=r.\begin{split}\min\limits_{\boldsymbol{p}}&\ \ \ \ \mathbf{b}^{\top}\boldsymbol{p}\\ \text{s.t.}&\ \ \ \ p_{i}\in\{0,1\},\\ &\sum_{i=1}^{d}p_{i}=r.\end{split} (10)

The optimal solution of (10) is just to select the largest rr elements in 𝐛i\mathbf{b}_{i} and set the corresponding entries in 𝐩i\mathbf{p}_{i} to 1. Then 𝐒~\widetilde{\mathbf{S}} can easily obtained based on non-zero values in 𝐩\mathbf{p}.

Algorithm 1 Improved Subsampled Randomized Hadamard Transform (ISRHT) for linear SVM classification
  Training
  Input: training set Xtrain\textbf{X}_{train}, reduced dimension rr, regularization parameter cc;
  Output: model 𝐰r×1\mathbf{w}\in\mathbb{R}^{r\times 1}, matrix 𝐃d×d\mathbf{D}\in\mathbb{R}^{d\times d} and S~d×r\widetilde{\textbf{S}}\in\mathbb{R}^{d\times r};
1:  Generate a diagonal random sign matrix 𝐃\mathbf{D} as in (2)
2:  Compute 𝐗r=𝐗train𝐃𝐇n×d\mathbf{X}_{r}=\mathbf{X}_{train}\mathbf{DH}\in\mathbb{R}^{n\times d} by FFT
3:  Obtain the sampling matrix S~\widetilde{\textbf{S}}
4:  Compute 𝐗^train=𝐗rS~n×r\hat{\mathbf{X}}_{train}=\mathbf{X}_{r}\widetilde{\textbf{S}}\in\mathbb{R}^{n\times r}
5:  Train a linear SVM on 𝐗^train\hat{\mathbf{X}}_{train} and obtain the model 𝐰\mathbf{w}
  Prediction
  Input: test data 𝐗test\mathbf{X}_{test}, model 𝐰\mathbf{w}, matrix 𝐃,S~\mathbf{D},\widetilde{\textbf{S}};
  Output: predicted labels 𝐲^test\hat{\mathbf{y}}_{test};
1:  compute 𝐗^test=𝐗test𝐃𝐇S~n×r\hat{\mathbf{X}}_{test}=\mathbf{X}_{test}\mathbf{DH}\widetilde{\textbf{S}}\in\mathbb{R}^{n\times r}
2:  predict by y^test=𝐗^test𝐰\hat{\textbf{y}}_{test}=\mathbf{\hat{X}}_{test}\mathbf{w};

Algorithm Implementation and Analysis

We summarize our proposed algorithm in algorithm 1 and name it as Improved Subsampled Randomized Hadamard Transform (ISRHT). With respect to training time complexity, step 2 needs O(ndlog(d))O(ndlog(d)) time by using FFT. In step 3, for unsupervised column selection, we need O(nd)O(nd) time to compute the Euclidean norm for each column. For supervised column selection, bib_{i} in (10) are computed as bi=(𝐗r𝐋𝐗r)iib_{i}=({\mathbf{X}_{r}}^{\top}\mathbf{L}\mathbf{X}_{r})_{ii} and the Laplacian matrix L is equal to 𝐃𝐲𝐲𝐓\mathbf{D-yy^{T}}, therefore, the required time is still O(nd)O(nd). Step 4 needs O(r)O(r) time and step 5 needs O(tnr)O(tnr) where tt is the number of iterations needed for training linear SVM. Therefore, the overall time complexity of our proposed algorithm is O(ndlog(d)+nd+tnr)O(ndlog(d)+nd+tnr). With respect to prediction time complexity, our proposed method needs O(dlog(d)+r)O(d\log(d)+r) time to make a prediction for a new test sample. With respect to space complexity, 𝐃\mathbf{D} is a diagonal matrix, matrix 𝐒~\widetilde{\mathbf{S}} is a sampling matrix with rr non-zero entries. Therefore, the required space for storing the projection matrix 𝐑\mathbf{R} of our proposed methods is O(d+r)O(d+r). Our proposed methods achieve very efficient time and space complexities for model deployment. We compare the time and space complexity of different random projection methods in Table 1.

Table 1: Comparison of different random projection methods
Algorithms Space for 𝐑\mathbf{R} Time for projection
Gaussian O(dr)O(dr) O(ndr)O(ndr)
Achlioptas O(13dr)O(\frac{1}{3}dr) O(13ndr)O(\frac{1}{3}ndr)
Sparse Embedding O(d)O(d) O(nnz(𝐗))O(nnz(\mathbf{X}))
SRHT O(d+r)O(d+r) O(ndlog(r))O(ndlog(r))
ISRHT-unsupervised O(d+r)O(d+r) O(ndlog(d)+nd)O(ndlog(d)+nd)
ISRHT-supervised O(d+r)O(d+r) O(ndlog(d)+nd)O(ndlog(d)+nd)

Extension to High-dimensional Sparse Data

In this section, we describe an extension of our algorithm on high-dimensional sparse data. One disadvantage of applying Hadamard transform on sparse data is that it produces a dense matrix. As shown in the step 2 in Algorithm 1, the memory cost is increased from O(nnz(𝐗))O(nnz(\mathbf{X})) to O(nd)O(nd). To tackle this challenge, we use a combination of sparse embedding and SRHT as proposed in (?) for memory-efficient projection. We improve their work by replacing the original SRHT by our proposed ISRHT methods. We expect that our proposed data-dependent projection methods can obtain higher accuracies than data-independent projection methods on high-dimensional sparse data.

Specifically, we first use sparse embedding matrix 𝐑sparsed×r\mathbf{R}_{sparse}\in\mathbb{R}^{d\times r^{\prime}} to project 𝐗\mathbf{X} into rr^{\prime}-dimensional space and then use our improved SRHT methods to project into rr dimensional space. The sparse embedding matrix 𝐑sparse\mathbf{R}_{sparse} is generated as follows, for each row in random matrix 𝐑sparse\mathbf{R}_{sparse}, randomly selected one column and assign either 1 or 1-1 with probability 0.5 to this entry. All other entries are 0s. Due to the sparse structure of 𝐑sparse\mathbf{R}_{sparse}, matrix multiplication 𝐗𝐑sparse\mathbf{X}\mathbf{R}_{sparse} takes O(nnz(𝐗))O(nnz(\mathbf{X})) time. By doing this, the time complexity of the proposed method on high-dimensional sparse data was reduced to O(nnz(𝐗)+nrlog(r)+nr)O(nnz(\mathbf{X})+nr^{\prime}log(r^{\prime})+nr^{\prime}) and the memory cost was reduced to O(nnz(𝐗)+nr)O(nnz(\mathbf{X})+nr^{\prime}). In our experimental setting, we set r=2rr^{\prime}=2r as suggested in (?).

Table 2: The classification accuracy and running time (in seconds) (The best results for each dataset are in bold and italic and the best results among the seven random projection methods are in bold).
Algorithms Datasets mushrooms usps-b gisette real-sim rcv1-binary news20-binary
(rr = 16) (rr = 16) (rr = 256) (rr = 256) (rr = 256) (rr = 256)
All features 99.85 91.03 97.5 97.43 96.43 96.01
0.1s 2.8s 3.0s 0.1s 0.1s 0.3s
PCA 96.23 85.85 96.50 93.10 95.05 -
0.3s 0.6s 3.2s 51.0s 34.5s -
DLSS 94.66 80.21 93.90 89.89 91,43 79.33
0.2s 1.0s 5.8s 2.0s 1.6s 9.3s
Gaussian 89.15±\pm3.85 80.05±\pm3.33 89.76±\pm0.56 76.48±\pm0.55 78.90±\pm0.61 70.10±\pm0.78
0.2s 1.0s 5.9s 19s 1.2s 20.3s
Achlioptas 91.74±\pm2.75 81.50±\pm2.28 89.56±\pm0.65 76.20±\pm0.63 79.03±\pm1.28 69.29±\pm0.99
0.3s 0.9s 7.8s 6.3s 1.5s 28.6s
Sparse Embedding 89.06±\pm4.18 82.09±\pm2.05 89.65±\pm0.86 76.82±\pm0.45 79.11±\pm1.20 70.08±\pm0.69
0.3s 0.8s 1.5s 2.4s 0.2s 0.6s
SRHT 92.45±\pm2.59 81.87±\pm1.55 91.77±\pm0.76 72.80±\pm0.23 79.08±\pm0.97 70.58±\pm0.26
0.3s 0.3s 1.5s 23.3s 0.5s 0.7s
ISRHT-nps 94.30±\pm1.15 83.34±\pm1.31 91.17±\pm0.51 73.04±\pm0.51 79.28±\pm0.63 70.36±\pm0.84
0.4s 0.6s 1.4s 24.2s 0.6s 0.9s
ISRHT-top-rr 94.23±\pm1.58 82.91±\pm1.07 93.20±\pm0.61 76.81±\pm0.46 82.81±\pm0.89 73.46±\pm0.90
0.4s 0.4s 1.4s 18.7s 0.6s 0.9s
ISRHT-supervised 96.25±\pm1.22 86.25±\pm0.81 93.90±\pm0.72 80.04±\pm0.52 87.81±\pm0.42 78.33±\pm0.74
0.2s 0.4s 1.4s 44.3s 1.0s 1.1s

Experiments

In this section, we compare our proposed methods with other four popular random projection methods on six real-world benchmark datasets. These benchmark datasets are downloaded from LIBSVM website 222https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. For datasets (mushrooms,real-sim, rcv1-binary, news20-binary), we refer to the setting in (?) and randomly select 70% of the original training data as training data and the rest 30% as test data. The detailed information about these six datasets is summarized in in Table LABEL:table.data.

Table 3: Experiment dataset
Dataset train size test size dimensionality
mushrooms 6,000 2124 112
usps-b 7,291 2007 256
gisette 6,000 1000 5,000
real-sim 48,447 23862 20,958
rcv1-binary 13,562 6680 47,236
news20-binary 13,397 6599 1,355,191

We evaluate the performance of the following seven random projection algorithms in our experiments:

  • Gaussian: random projection by Gaussian Matrix (?);

  • Achlioptas: random projection by Achlioptas matrix (?);

  • Sparse Embedding: random projection by count sketch matrix (?);

  • SRHT: original subsampled randomized Hadamard transform with uniform sampling (?);

  • ISRHT-nps: our proposed method for ISRHT with norm-proportional sampling as defined in (6);

  • ISRHT-top-rr: our proposed deterministic sampling method for ISRHT which select rr columns with largest Euclidean norms;

  • ISRHT-supervised: our proposed method by incorporating label information as defined in (10);

We also include the results of using all features and two other popular data-dependent dimensionality reduction methods: PCA and Deterministic Leverage Score Sampling (DLLS) (?) in Table 2. Due to high computational complexity of Singular Value Decomposition (SVD) in PCA and leverage score sampling, we employed randomized SVD (?) in our experiments.

Experimental setup. The feature values for all data sets are linearly scaled to [1-1, 1]. For dense datasets (mushrooms, usps-b and gisette), we directly apply our proposed ISRHT methods on them. For the high-dimefnsional sparse datasets (real-sim, rcv1-binary and news20-binary), we use an extension of our proposed methods that combines of sparse embedding and ISRHT for memory efficient projection as discussed in previous section. The regularization parameter CC in linear SVM is chosen from {25,24,,24,252^{-5},2^{-4},\dots,2^{4},2^{5}} by 5-fold cross validation. The tradeoff parameter aa for ISRHT-supervised is fixed to 1.0. Our experiments are performed on a server with Dual 6-core Intel Xeon 2.4GHz CPU and 128 GB RAM.

Refer to caption
(a) mushrooms
Refer to caption
(b) usps
Refer to caption
(c) gisette
Refer to caption
(d) real-sim
Refer to caption
(e) rcv
Refer to caption
(f) news
Figure 2: classification accuracies of different algorithms with different rr

Experimental Results. The accuracy and training time of all algorithms are reported in Table 2. ’-’ is used to denote that PCA cannot be completed because of space complexity. The number of reduced dimension rr is shown in the first column of the table. We also explore the impact of parameter rr in Figure 2 and will be discussed later. The reported accuracies are the averaged accuracies on test data based on 15 repetitions. As shown in Table 2, our proposed ISRHT-(top-kk) method achieves significant higher accuracies than other four data-independent random projection methods (i.e., Gaussian, Achlioptas, Sparse Embedding and SRHT) on all six datasets. The ISRHT-nps gets slightly better results than SRHT. Furthermore, by incorporating the label information, our proposed ISRHT-supervised method gets the best accuracy on all six datasets among the seven random projection methods. Using all features gets the best accuracy on all six datasets. PCA gets better accuracy than DLSS and other random projection methods while needs longer running time. We also observe that our proposed methods produce dense feature representation after random projection and therefore it needs longer running time than directly applying liblinear on high-dimensional sparse data.

With respect to the running time, all the random projection methods are more efficient than data-dependent dimension reduction method such as PCA and DLSS especially on large datasets. Among these random projection methods Gaussian and Achiloptas are much slower than Sparse Embedding and SRHT since they need O(ndr)O(ndr) time for projection. The running times of our proposed methods ISRHT-nps and ISRHT-top-r are very close to SRHT. The results demonstrate that our proposed methods only slightly increase the running time since the computing norm can be done very efficiently.

Impact of parameter rr. We evaluate the impact of parameter rr in our proposed algorithms. We show the accuracies of all the algorithms with respect to different rr in Figure 2. We can observe that our proposed methods gets higher accuracy than other four methods. The accuracy improvement is large when the parameter rr is set to relative small number. As expected, the difference will become smaller as the parameter rr increases.

Refer to caption
(a) ISRHT-nps vs. SRHT
Refer to caption
(b) ISRHT-top-rr vs. SRHT
Figure 3: Classification Accuracies of SRHT (x-axis) and ISRHT (y-axis) for different parameters.

Stability Comparison Between ISRHT and SRHT. In this section, we would like to further investigate the stability of our proposed methods and SRHT with respect to different choices of the regularization parameter CC in SVM and reduced dimension rr. We evaluate all methods for different parameter combinations of CC and rr. Regularization parameter CC is chosen from {25,24,,25}\{2^{-5},2^{-4},\dots,2^{5}\} and reduced dimension rr is chosen from {d26,d25,,d2}\{\frac{d}{2^{6}},\frac{d}{2^{5}},\dots,\frac{d}{2}\}.

Due to the space limitation, we only show the results of comparing the prediction accuracies of ISRHT-nps and ISRHT-top-rr with SRHT on mushrooms dataset in Figure 3. The accuracies plotted in Figure 3 are based on the average of 15 repetitions. For each parameter combination, we plot its corresponding accuracy of SRHT on the xx-axis and the accuracy of our proposed methods on yy-axis. So the points above y=xy=x line indicate an accuracy improvement of our proposed methods (y-axis) (ISRHT-nps or ISRHT-top-rr) over SRHT(x-axis). Overall speaking, our proposed methods is more stable with different choices of parameter combinations.

Conclusion and Future work

In this paper, we propose to produce more effective low-dimensional embedding than original SRHT by using non-uniform sampling instead of uniform sampling in SRHT. To achieve this goal, we first analyze the effect of using SRHT for random projection in the context of linear SVM classification. Based on our analysis, we have proposed importance sampling and deterministic top-rr sampling to improve the embedding. Secondly, we also propose a new sampling method to incorporate label information based on the idea of metric learning. We performed extensive experiments to evaluate our proposed non-uniform samplings methods. Our experimental results demonstrate that our proposed new methods can achieve better accuracy than original SRHT and other three popular random projection methods. Our results also demonstrate that our proposed method only slightly increase the running time but results in more effective embedding. In the future, we would like to extend our proposed ISRHT methods to nonlinear classification problems. Another interesting direction is to design data-dependent sparse embedding methods.

Acknowledgments

We would like to thank the anonymous reviewers for their insightful comments and valuable suggestions on our paper. This work was supported by NSFC 61906161.

References

  • [Achlioptas 2003] Achlioptas, D. 2003. Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of computer and System Sciences 66(4):671–687.
  • [Adelman and Silberstein 2018] Adelman, M., and Silberstein, M. 2018. Faster neural network training with approximate tensor operations. arXiv preprint arXiv:1805.08079.
  • [Ailon and Liberty 2009] Ailon, N., and Liberty, E. 2009. Fast dimension reduction using rademacher series on dual bch codes. Discrete & Computational Geometry 42(4):615.
  • [Bingham and Mannila 2001] Bingham, E., and Mannila, H. 2001. Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD, 245–250. ACM.
  • [Boutsidis and Gittens 2013] Boutsidis, C., and Gittens, A. 2013. Improved matrix algorithms via the subsampled randomized hadamard transform. SIAM Journal on Matrix Analysis and Applications 34(3):1301–1340.
  • [Boutsidis, Zouzias, and Drineas 2010] Boutsidis, C.; Zouzias, A.; and Drineas, P. 2010. Random projections for kk-means clustering. In NIPS, 298–306.
  • [Chen et al. 2015] Chen, S.; Liu, Y.; Lyu, M. R.; King, I.; and Zhang, S. 2015. Fast relative-error approximation algorithm for ridge regression. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, 201–210. AUAI Press.
  • [Choromanski, Rowland, and Weller 2017] Choromanski, K. M.; Rowland, M.; and Weller, A. 2017. The unreasonable effectiveness of structured random orthogonal embeddings. In NeurIPS, 219–228.
  • [Clarkson and Woodruff 2017] Clarkson, K. L., and Woodruff, D. P. 2017. Low-rank approximation and regression in input sparsity time. Journal of the ACM (JACM) 63(6):54.
  • [Dasgupta and Gupta 1999] Dasgupta, S., and Gupta, A. 1999. An elementary proof of the johnson-lindenstrauss lemma. International Computer Science Institute, Technical Report 22(1):1–5.
  • [Drineas, Kannan, and Mahoney 2006] Drineas, P.; Kannan, R.; and Mahoney, M. W. 2006. Fast monte carlo algorithms for matrices i: Approximating matrix multiplication. SIAM Journal on Computing 36(1):132–157.
  • [Freksen, Kamma, and Larsen 2018] Freksen, C. B.; Kamma, L.; and Larsen, K. G. 2018. Fully understanding the hashing trick. In Advances in Neural Information Processing Systems, 5389–5399.
  • [Halko, Martinsson, and Tropp 2011] Halko, N.; Martinsson, P.-G.; and Tropp, J. A. 2011. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review 53(2):217–288.
  • [Johnson, Lindenstrauss, and Schechtman 1986] Johnson, W. B.; Lindenstrauss, J.; and Schechtman, G. 1986. Extensions of lipschitz maps into banach spaces. Israel Journal of Mathematics 54(2):129–138.
  • [Jolliffe 2011] Jolliffe, I. 2011. Principal component analysis. Springer.
  • [Lan et al. 2019] Lan, L.; Wang, Z.; Zhe, S.; Cheng, W.; Wang, J.; and Zhang, K. 2019. Scaling up kernel SVM on limited resources: A low-rank linearization approach. IEEE Trans. Neural Netw. Learning Syst. 30(2):369–378.
  • [Liu, Shen, and Tsang 2017] Liu, W.; Shen, X.; and Tsang, I. 2017. Sparse embedded kk-means clustering. In NIPS, 3319–3327.
  • [Lu et al. 2013] Lu, Y.; Dhillon, P.; Foster, D. P.; and Ungar, L. 2013. Faster ridge regression via the subsampled randomized hadamard transform. In NIPS, 369–377.
  • [Mahoney and others 2011] Mahoney, M. W., et al. 2011. Randomized algorithms for matrices and data. Foundations and Trends® in Machine Learning 3(2):123–224.
  • [Mika et al. 1999] Mika, S.; Ratsch, G.; Weston, J.; Scholkopf, B.; and Mullers, K.-R. 1999. Fisher discriminant analysis with kernels. In Neural networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society workshop (cat. no. 98th8468), 41–48. Ieee.
  • [Papailiopoulos, Kyrillidis, and Boutsidis 2014] Papailiopoulos, D.; Kyrillidis, A.; and Boutsidis, C. 2014. Provable deterministic leverage score sampling. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 997–1006. ACM.
  • [Paul et al. 2013] Paul, S.; Boutsidis, C.; Magdon-Ismail, M.; and Drineas, P. 2013. Random projections for support vector machines. In Artificial intelligence and statistics, 498–506.
  • [Paul, Magdon-Ismail, and Drineas 2015] Paul, S.; Magdon-Ismail, M.; and Drineas, P. 2015. Feature selection for linear svm with provable guarantees. In Artificial Intelligence and Statistics, 735–743.
  • [Pourkamali-Anaraki, Becker, and Wakin 2018] Pourkamali-Anaraki, F.; Becker, S.; and Wakin, M. B. 2018. Randomized clustered nystrom for large-scale kernel machines. In AAAI’18.
  • [Sarlos 2006] Sarlos, T. 2006. Improved approximation algorithms for large matrices via random projections. In FOCS’06, 143–152. IEEE.
  • [Tropp 2011] Tropp, J. A. 2011. Improved analysis of the subsampled randomized hadamard transform. Advances in Adaptive Data Analysis 3(01n02):115–126.
  • [Weinberger et al. 2009] Weinberger, K.; Dasgupta, A.; Langford, J.; Smola, A.; and Attenberg, J. 2009. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning, 1113–1120. ACM.
  • [Xing et al. 2003] Xing, E. P.; Jordan, M. I.; Russell, S. J.; and Ng, A. Y. 2003. Distance metric learning with application to clustering with side-information. In NIPS, 521–528.
  • [Xu et al. 2017] Xu, Y.; Yang, H.; Zhang, L.; and Yang, T. 2017. Efficient non-oblivious randomized reduction for risk minimization with improved excess risk guarantee. In Thirty-First AAAI Conference on Artificial Intelligence.
  • [Zhang et al. 2013] Zhang, L.; Mahdavi, M.; Jin, R.; Yang, T.; and Zhu, S. 2013. Recovering the optimal solution by dual random projection. In Conference on Learning Theory, 135–157.