This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On The Relative Error of Random Fourier Features for Preserving Kernel Distance

Kuan Cheng
Peking University
Email: [email protected] &Shaofeng H.-C. Jiang
Peking University
Email: [email protected] &Luojian Wei
Peking University
Email: [email protected]
&Zhide Wei
Peking University
Email: [email protected]
Abstract

The method of random Fourier features (RFF), proposed in a seminal paper by Rahimi and Recht (NIPS’07), is a powerful technique to find approximate low-dimensional representations of points in (high-dimensional) kernel space, for shift-invariant kernels. While RFF has been analyzed under various notions of error guarantee, the ability to preserve the kernel distance with relative error is less understood. We show that for a significant range of kernels, including the well-known Laplacian kernels, RFF cannot approximate the kernel distance with small relative error using low dimensions. We complement this by showing as long as the shift-invariant kernel is analytic, RFF with poly(ε1logn)\operatorname{poly}(\varepsilon^{-1}\log n) dimensions achieves ε\varepsilon-relative error for pairwise kernel distance of nn points, and the dimension bound is improved to poly(ε1logk)\operatorname{poly}(\varepsilon^{-1}\log k) for the specific application of kernel kk-means. Finally, going beyond RFF, we make the first step towards data-oblivious dimension-reduction for general shift-invariant kernels, and we obtain a similar poly(ε1logn)\operatorname{poly}(\varepsilon^{-1}\log n) dimension bound for Laplacian kernels. We also validate the dimension-error tradeoff of our methods on simulated datasets, and they demonstrate superior performance compared with other popular methods including random-projection and Nyström methods.

1 Introduction

We study the ability of the random Fourier features (RFF) method (Rahimi & Recht, 2007) for preserving the relative error for the kernel distance. Kernel method (Schölkopf & Smola, 2002) is a systematic way to map the input data into a (indefinitely) high dimensional feature space to introduce richer structures, such as non-linearity. In particular, for a set of nn data points PP, a kernel function K:P×PK:P\times P\to\mathbb{R} implicitly defines a feature mapping φ:P\varphi:P\to\mathcal{H} to a feature space \mathcal{H} which is a Hilbert space, such that x,y,K(x,y)=φ(x),φ(y)\forall x,y,K(x,y)=\langle\varphi(x),\varphi(y)\rangle. Kernel methods have been successfully applied to classical machine learning (Boser et al., 1992; Schölkopf et al., 1998; Girolami, 2002), and it has been recently established that in a certain sense the behavior of neural networks may be modeled as a kernel (Jacot et al., 2018).

Despite the superior power and wide applicability, the scalability has been an outstanding issue of applying kernel methods. Specifically, the representation of data points in the feature space is only implicit, and solving for the explicit representation, which is crucially required in many algorithms, takes at least Ω(n2)\Omega(n^{2}) time in the worst case. While for many problems such as kernel SVM, it is possible to apply the so-called “kernel trick” to rewrite the objective in terms of K(x,y)K(x,y), the explicit representation is still often preferred, since the representation is compatible with a larger range of solvers/algorithms which allows better efficiency.

In a seminal work (Rahimi & Recht, 2007), Rahimi and Recht addressed this issue by introducing the method of random Fourier features (see Section 2 for a detailed description), to compute an explicit low-dimensional mapping φ:PD\varphi^{\prime}:P\to\mathbb{R}^{D} (for DnD\ll n) such that φ(x),φ(y)φ(x),φ(y)=K(x,y)\langle\varphi^{\prime}(x),\varphi^{\prime}(y)\rangle\approx\langle\varphi(x),\varphi(y)\rangle=K(x,y), for shift-invariant kernels (i.e., there exists K:PK:P\to\mathbb{R}, such that K(x,y)=K(xy)K(x,y)=K(x-y)) which includes widely-used Gaussian kernels, Cauchy kernels and Laplacian kernels.

Towards understanding this fundamental method of RFF, a long line of research has focused on analyzing the tradeoff between the target dimension DD and the accuracy of approximating KK under certain error measures. This includes additive error maxx,y|φ(x),φ(y)φ(x),φ(y)|\max_{x,y}|\langle\varphi(x),\varphi(y)\rangle-\langle\varphi^{\prime}(x),\varphi^{\prime}(y)\rangle| (Rahimi & Recht, 2007; Sriperumbudur & Szabó, 2015; Sutherland & Schneider, 2015), spectral error (Avron et al., 2017; Choromanski et al., 2018; Zhang et al., 2019; Erdélyi et al., 2020; Ahle et al., 2020), and the generalization error of several learning tasks such as kernel SVM and kernel ridge regression (Avron et al., 2017; Sun et al., 2018; Li et al., 2021). A more comprehensive overview of the study of RFF can be found in a recent survey (Liu et al., 2021).

We focus on analyzing RFF with respect to the kernel distance. Here, the kernel distance of two data points x,yx,y is defined as their (Euclidean) distance in the feature space, i.e.,

distφ(x,y)=φ(x)φ(y)2.\operatorname{dist}_{\varphi}(x,y)=\|\varphi(x)-\varphi(y)\|_{2}.

While previous results on the additive error of K(x,y)K(x,y) (Rahimi & Recht, 2007; Sriperumbudur & Szabó, 2015; Sutherland & Schneider, 2015; Avron et al., 2017) readily implies additive error guarantee of distφ(x,y)\operatorname{dist}_{\varphi}(x,y), the relative error guarantee is less understood. As far as we know, Chen & Phillips (2017) is the only previous work that gives a relative error bound for kernel distance, but unfortunately, only Gaussian kernel is studied in that work, and whether or not the kernel distance for other shift-invariant kernels is preserved by RFF, is still largely open.

In spirit, this multiplicative error guarantee of RFF, if indeed exists, makes it a kernelized version of Johnson-Lindenstrauss Lemma (Johnson & Lindenstrauss, 1984) which is one of the central result in dimension reduction. This guarantee is also very useful for downstream applications, since one can combine it directly with classical geometric algorithms such as k-means++ (Arthur & Vassilvitskii, 2007), locality sensitive hashing (Indyk & Motwani, 1998) and fast geometric matching algorithms (Raghvendra & Agarwal, 2020) to obtain very efficient algorithms for kernelized kk-means clustering, nearest neighbor search, matching and many more.

1.1 Our Contributions

Our main results are characterizations of the kernel functions on which RFF preserves the kernel distance with small relative error using polylog\operatorname{poly}\log target dimensions. Furthermore, we also explore how to obtain data-oblivious dimension-reduction for kernels that cannot be handled by RFF.

As mentioned, it has been shown that RFF with small dimension preserves the additive error of kernel distance for all shift-invariant kernels (Rahimi & Recht, 2007; Sriperumbudur & Szabó, 2015; Sutherland & Schneider, 2015). In addition, it has been shown in Chen & Phillips (2017) that RFF indeed preserves the relative error of kernel distance for Gaussian kernels (which is shift-invariant). Hence, by analogue to the additive case and as informally claimed in Chen & Phillips (2017), one might be tempted to expect that RFF also preserves the relative error for general shift-invariant kernels as well.

Lower Bounds.

Surprisingly, we show that this is not the case. In particular, we show that for a wide range of kernels, including the well-known Laplacian kernels, it requires unbounded target dimension for RFF to preserve the kernel distance with constant multiplicative error. We state the result for a Laplacian kernel in the following, and the full statement of the general conditions of kernels can be found in Theorem 4.1. In fact, what we show is a quantitatively stronger result, that if the input is (Δ,ρ)(\Delta,\rho)-bounded, then preserving any constant multiplicative error requires Ω(poly(Δ/ρ))\Omega(\operatorname{poly}(\Delta/\rho)) target dimension. Here, a point xdx\in\mathbb{R}^{d} is (Δ,ρ)(\Delta,\rho)-bounded if xΔ\|x\|_{\infty}\leq\Delta and mini:xi0|xi|ρ\min_{i:x_{i}\neq 0}|x_{i}|\geq\rho, i.e., the magnitude is (upper) bounded by Δ\Delta and the resolution is (lower) bounded by ρ\rho.

Theorem 1.1 (Lower bound; see Remark 4.1).

For every Δρ>0\Delta\geq\rho>0 and some feature mapping φ:d\varphi:\mathbb{R}^{d}\to\mathcal{H} of a Laplacian kernel K(x,y)=exp(xy1)K(x,y)=\exp(-\|x-y\|_{1}), if for every x,ydx,y\in\mathbb{R}^{d} that are (Δ,ρ)(\Delta,\rho)-bounded, the RFF mapping π\pi for KK with target dimension DD satisfies distπ(x,y)(1±ε)distφ(x,y)\operatorname{dist}_{\pi}(x,y)\in(1\pm\varepsilon)\cdot\operatorname{dist}_{\varphi}(x,y) with constant probability, then DΩ(1ε2Δρ)D\geq\Omega(\frac{1}{\varepsilon^{2}}\frac{\Delta}{\rho}). This holds even when d=1d=1.

Upper Bounds.

Complementing the lower bound, we show that RFF can indeed preserve the kernel distance within 1±ε1\pm\varepsilon error using poly(ε1logn)\operatorname{poly}(\varepsilon^{-1}\log n) target dimensions with high probability, as long as the kernel function is shift-invariant and analytic, which includes Gaussian kernels and Cauchy kernels. Our target dimension nearly matches (up to the degree of polynomial of parameters) that is achievable by the Johnson-Lindenstrauss transform (Johnson & Lindenstrauss, 1984), which is shown to be tight (Larsen & Nelson, 2017). This upper bound also greatly generalizes the result of Chen & Phillips (2017) which only works for Gaussian kernels (see Appendix G for a detailed comparison).

Theorem 1.2 (Upper bound).

Let K:d×dK:\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R} be a kernel function which is shift-invariant and analytic at the origin, with feature mapping φ:d\varphi:\mathbb{R}^{d}\to\mathcal{H} for some feature space \mathcal{H}. For every 0<δε2160<\delta\leq\varepsilon\leq 2^{-16}, every d,Dd,D\in\mathbb{N}, Dmax{Θ(ε1log3(1/δ)),Θ(ε2log(1/δ))}D\geq\max\{\Theta(\varepsilon^{-1}\log^{3}(1/{\delta})),\Theta(\varepsilon^{-2}\log(1/{\delta}))\}, if π:dD\pi:\mathbb{R}^{d}\to\mathbb{R}^{D} is an RFF mapping for KK with target dimension DD, then for every x,ydx,y\in\mathbb{R}^{d},

Pr[|distπ(x,y)distφ(x,y)|εdistφ(x,y)]1δ.\Pr[|\operatorname{dist}_{\pi}(x,y)-\operatorname{dist}_{\varphi}(x,y)|\leq\varepsilon\cdot\operatorname{dist}_{\varphi}(x,y)]\geq 1-\delta.

The technical core of our analysis is a moment bound for RFF, which is derived by analysis techniques such as Taylor expansion and Cauchy’s integral formula for multi-variate functions. The moment bound is slightly weaker than the moment bound of Gaussian variables, and this is the primary reason that we obtain a bound weaker than that of the Johnson-Lindenstrauss transform. Finally, several additional steps are required to fit this moment bound in Bernstein’s inequality, which implies the bound in Theorem 1.2.

Improved Dimension Bound for Kernel kk-Means.

We show that if we focus on a specific application of kernel kk-means, then it suffices to set the target dimension D=poly(ε1logk)D=\operatorname{poly}(\varepsilon^{-1}\log k), instead of D=poly(ε1logn)D=\operatorname{poly}(\varepsilon^{-1}\log n), to preserve the kernel kk-means clustering cost for every kk-partition. This follows from the probabilistic guarantee of RFF in Theorem 1.2 plus a generalization of the dimension-reduction result proved in a recent paper (Makarychev et al., 2019). Here, given a data set PdP\subset\mathbb{R}^{d} and a kernel function K:d×dK:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}, denoting the feature mapping as φ:d\varphi:\mathbb{R}^{d}\to\mathcal{H}, the kernel kk-means problem asks to find a kk-partition 𝒞:={C1,,Ck}\mathcal{C}:=\{C_{1},\ldots,C_{k}\} of PP, such that costφ(P,𝒞)=i=1kmincixCiφ(x)ci22\operatorname{cost}^{\varphi}(P,\mathcal{C})=\sum_{i=1}^{k}\min_{c_{i}\in\mathcal{H}}\sum_{x\in C_{i}}\|\varphi(x)-c_{i}\|_{2}^{2} is minimized.

Theorem 1.3 (Dimension reduction for clustering; see Theorem 3.1).

For kernel kk-means problem whose kernel function K:d×dK:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R} is shift-invariant and analytic at the origin, for every data set PdP\subset\mathbb{R}^{d}, the RFF mapping π:dD\pi:\mathbb{R}^{d}\rightarrow\mathbb{R}^{D} with target dimension DO(1ε2(log3kδ+log31ε))D\geq O(\frac{1}{\varepsilon^{2}}(\log^{3}\frac{k}{\delta}+\log^{3}\frac{1}{\varepsilon})), with probability at least 1δ1-\delta, preserves the clustering cost within 1±ε1\pm\varepsilon error for every kk-partition simultaneously.

Applying RFF to speed up kernel kk-means has also been considered in Chitta et al. (2012), but their error bound is much weaker than ours (and theirs is not a generic dimension-reduction bound). Also, similar dimension-reduction bounds (i.e., independent of nn) for kernel kk-means were obtained using Nyström methods (Musco & Musco, 2017; Wang et al., 2019), but their bound is poly(k)\operatorname{poly}(k) which is worse than our polylog(k)\operatorname{poly}\log(k); furthermore, our RFF-based approach is unique in that it is data-oblivious, which enables great applicability in other relevant computational settings such as streaming and distributed computing.

Going beyond RFF.

Finally, even though we have proved RFF cannot preserve the kernel distance for every shift-invariant kernels, it does not rule out the existence of other efficient data-oblivious dimension reduction methods for those kernels, particularly for Laplacian kernel which is the primary example in our lower bound. For instance, in the same paper where RFF was proposed, Rahimi and Recht (Rahimi & Recht, 2007) also considered an alternative embedding called “binning features” that can work for Laplacian kernels. Unfortunately, to achieve a relative error of ε\varepsilon, it requires a dimension that depends linearly on the magnitude/aspect-ratio of the dataset, which may be exponential in the input size. Follow-up works, such as (Backurs et al., 2019), also suffer similar issues.

We make the first successful attempt towards this direction, and we show that Laplacian kernels do admit an efficient data-oblivious dimension reduction. Here, we focus on the (Δ,ρ)(\Delta,\rho)-bounded case, Here, we use a similar setting to our lower bound (Theorem 1.1) where we focus on the (Δ,ρ)(\Delta,\rho)-bounded case.

Theorem 1.4 (Oblivious dimension-reduction for Laplacian kernels, see Theorem F.1).

Let KK be a Laplacian kernel, and denote its feature mapping as φ:d\varphi:\mathbb{R}^{d}\to\mathcal{H}. For every 0<δε2160<\delta\leq\varepsilon\leq 2^{-16}, every Dmax{Θ(ε1log3(1/δ)),Θ(ε2log(1/δ))}D\geq\max\{\Theta(\varepsilon^{-1}\log^{3}(1/{\delta})),\Theta(\varepsilon^{-2}\log(1/{\delta}))\}, every Δρ>0\Delta\geq\rho>0, there is a mapping π:dD\pi:\mathbb{R}^{d}\to\mathbb{R}^{D}, such that for every x,ydx,y\in\mathbb{R}^{d} that are (Δ,ρ)(\Delta,\rho)-bounded, it holds that

Pr[|distπ(x,y)distφ(x,y)|εdistφ(x,y)]1δ.\Pr[|\operatorname{dist}_{\pi}(x,y)-\operatorname{dist}_{\varphi}(x,y)|\leq\varepsilon\cdot\operatorname{dist}_{\varphi}(x,y)]\geq 1-\delta.

The time for evaluating π\pi is dDpoly(logΔρ,logδ1)dD\cdot\operatorname{poly}(\log\frac{\Delta}{\rho},\log\delta^{-1}).

Our target dimension only depends on logΔρ\log{\frac{\Delta}{\rho}} which may be interpreted as the precision of the input. Hence, as an immediate corollary, for any nn-points dataset with precision 1/poly(n)1/\operatorname{poly}(n), we have an embedding with target dimension D=poly(ε1logn)D=\operatorname{poly}(\varepsilon^{-1}\log n), where the success probability is 11/poly(n)1-1/\operatorname{poly}(n) and the overall running time of embedding the nn points is O(npoly(dε1logn))O(n\operatorname{poly}(d\varepsilon^{-1}\log n)).

Our proof relies on the fact that every 1\ell_{1} metric space can be embedded into a squared 2\ell_{2} metric space isometrically. We explicitly implement an approximate version of this embedding (Kahane, 1981), and eventually reduce our problem of Laplacian kernels to Gaussian kernels. After this reduction, we use the RFF for Gaussian kernels to obtain the final mapping. However, since the embedding to squared 2\ell_{2} is only of very high dimension, to implement this whole idea efficiently, we need to utilize the special structures of the embedding, combined with an application of space bounded pseudo-random generators (PRGs) (Nisan, 1992).

Even though our algorithm utilizes the special property of Laplacian kernels and eventually still partially use the RFF for Gaussian kernels, it is still of conceptual importance. It opens up the direction of exploring general methods for Johnson-Lindenstrauss style dimension reduction for shift-invariant kernels. Furthermore, the lower bound suggests that the Johnson-Lindenstrauss style dimension reduction for general shift-invariant kernels has to be not differentiable, which is a fundamental difference to RFF. This requirement of “not analytical” seems very counter-intuitive, but our construction of the mapping for Laplacian kernels indeed provides valuable insights on how the non-analytical mapping behaves.

Experiments and Comparison to Other Methods.

Apart from RFF, the Nyström and the random-projection methods are alternative popular methods for kernel dimension reduction. In Section 6, we conduct experiments to compare their empirical dimension-error tradeoffs with that of our methods on a simulated dataset. Since we focus on the error, we use the “ideal” implementation of both methods that achieve the best accuracy, so they are only in favor of the two baselines – for Nyström, we use SVD on the kernel matrix, since Nyström methods can be viewed as fast and approximate low-rank approximations to the kernel matrix; for random-projection, we apply the Johnson-Lindenstrauss transform on the explicit representations of points in the feature space. We run two experiments to compare each of RFF (on a Gaussian kernel) and our new algorithm in Theorem 1.4 (on a Laplacian kernel) with the two baselines respectively. Our experiments indicate that the Nyström method is indeed incapable of preserving the kernel distance in relative error, and more interestingly, our methods perform the best among the three, even better than the Johnson-Lindenstrauss transform which is the optimal in the worst case.

1.2 Related Work

Variants of the vanilla RFF, particularly those that use information in the input data set and/or sample random features non-uniformly, have also been considered, including leverage score sampling random Fourier features (LSS-RFF) (Rudi et al., 2018; Liu et al., 2020; Erdélyi et al., 2020; Li et al., 2021), weighted random features (Rahimi & Recht, 2008; Avron et al., 2016; Chang et al., 2017; Dao et al., 2017), and kernel alignment (Shahrampour et al., 2018; Zhen et al., 2020).

The RFF-based methods usually work for shift-invariant kernels only. For general kernels, techniques that are based on low-rank approximation of the kernel matrix, notably Nyström method (Williams & Seeger, 2000; Gittens & Mahoney, 2016; Musco & Musco, 2017; Oglic & Gärtner, 2017; Wang et al., 2019) and incomplete Cholesky factorization (Fine & Scheinberg, 2001; Bach & Jordan, 2002; Chen et al., 2021; Jia et al., 2021)) were developed. Moreover, specific sketching techniques were known for polynomial kernels (Avron et al., 2014; Woodruff & Zandieh, 2020; Ahle et al., 2020; Song et al., 2021), a basic type of kernel that is not shift-invariant.

2 Preliminaries

Random Fourier Features.

RFF was first introduced by Rahimi and Recht (Rahimi & Recht, 2007). It is based on the fact that, for shift-invariant kernel K:dK:\mathbb{R}^{d}\to\mathbb{R} such that K(0)=1K(0)=1 (this can be assumed w.l.o.g. by normalization), function p:dp:\mathbb{R}^{d}\to\mathbb{R} such that p(ω)=12πdK(x)eiω,xdxp(\omega)=\frac{1}{2\pi}\int_{\mathbb{R}^{d}}K(x)e^{-\mathrm{i}\langle\omega,x\rangle}\mathop{}\!\mathrm{d}x, which is the Fourier transform of K()K(\cdot), is a probability distribution (guaranteed by Bochner’s theorem (Bochner, 1933; Rudin, 1991)). Then, the RFF mapping is defined as

π(x):=1D(sinω1,xcosω1,xsinωD,xcosωD,x)\pi(x):=\sqrt{\frac{1}{D}}\begin{pmatrix}\sin\langle\omega_{1},x\rangle\\ \cos\langle\omega_{1},x\rangle\\ \vdots\\ \sin\langle\omega_{D},x\rangle\\ \cos\langle\omega_{D},x\rangle\end{pmatrix}

where ω1,ω2,,ωDd\omega_{1},\omega_{2},\dots,\omega_{D}\in\mathbb{R}^{d} are i.i.d. samples from distribution with densitiy pp.

Theorem 2.1 (Rahimi & Recht 2007).

𝔼[π(x),π(y)]=1Di=1D𝔼[cosωi,xy]=K(xy)\mathbb{E}[\langle\pi(x),\pi(y)\rangle]=\frac{1}{D}\sum_{i=1}^{D}\mathbb{E}[\cos\langle\omega_{i},x-y\rangle]=K(x-y).

Fact 2.1.

Let ω\omega be a random variable with distribution pp over d\mathbb{R}^{d}. Then

t,𝔼[cos(tω,xy)]=dp(ω)eiω,t(xy)dω=K(t(xy)),\forall t\in\mathbb{R},\quad\mathbb{E}[\cos{(t\langle\omega,x-y\rangle)}]=\Re\int_{\mathbb{R}^{d}}p(\omega)e^{\mathrm{i}\langle\omega,t(x-y)\rangle}\mathop{}\!\mathrm{d}\omega=K(t(x-y)), (1)

and Var(cosω,xy)=1+K(2(xy))2K(xy)22\operatorname{Var}(\cos{\langle\omega,x-y\rangle})=\frac{1+K(2(x-y))-2K(x-y)^{2}}{2}.

3 Upper Bounds

We present two results in this section. We start with Section 3.1 to show RFF preserves the relative error of kernel distance using poly(ε1logn)\operatorname{poly}(\varepsilon^{-1}\log n) target dimensions with high probability, when the kernel function is shift-invariant and analytic at origin. Then in Section 3.2, combining this bound with a generalized analysis from a recent paper (Makarychev et al., 2019), we show that RFF also preserves the clustering cost for kernel kk-clustering problems with p\ell_{p}-objective, with target dimension only poly(ε1logk)\operatorname{poly}(\varepsilon^{-1}\log k) which is independent of nn.

3.1 Proof of Theorem 1.2: The Relative Error for Preserving Kernel Distance

Since KK is shift-invariant, we interpret KK as a function on d\mathbb{R}^{d} instead of d×d\mathbb{R}^{d}\times\mathbb{R}^{d}, such that K(x,y)=K(xy)K(x,y)=K(x-y). Here for simplicity of exhibition we further assume K(x,x)=1,xK(x,x)=1,\forall x. But with a straightforward modification, our proof can still work without this assumption. As in Section 2, let p:dp:\mathbb{R}^{d}\to\mathbb{R} be the Fourier transform of KK, and suppose in the RFF mapping π\pi, the random variables ω1,,ωdd\omega_{1},\ldots,\omega_{d}\in\mathbb{R}^{d} are i.i.d. sampled from the distribution with density pp. When we say KK is analytic at the origin, we mean there exists some constant rr s.t. KK is analytic in {xd:x1<r}\{x\in\mathbb{R}^{d}:\|x\|_{1}<r\}. We pick rKr_{K} to be the maximum of such constant rr. Also notice that in Dmax{Θ(ε1log3(1/δ)),Θ(ε2log(1/δ))}D\geq\max\{\Theta(\varepsilon^{-1}\log^{3}(1/{\delta})),\Theta(\varepsilon^{-2}\log(1/{\delta}))\}, there are constants about KK hidden inside the Θ\Theta, i.e. RKR_{K} as in Lemma 3.2.

Fact 3.1.

The following holds.

  • distπ(x,y)=22/Di=1Dcosωi,xy\operatorname{dist}_{\pi}(x,y)=\sqrt{2-2/D\sum_{i=1}^{D}\cos\langle\omega_{i},x-y\rangle}, and distφ(x,y)=22K(xy)\operatorname{dist}_{\varphi}(x,y)=\sqrt{2-2K(x-y)}.

  • Pr[|distπ(x,y)distφ(x,y)|εdistφ(x,y)]Pr[|distπ(x,y)2distφ(x,y)2|εdistφ(x,y)2]\Pr[|\operatorname{dist}_{\pi}(x,y)-\operatorname{dist}_{\varphi}(x,y)|\leq\varepsilon\cdot\operatorname{dist}_{\varphi}(x,y)]\geq\Pr[|\operatorname{dist}_{\pi}(x,y)^{2}-\operatorname{dist}_{\varphi}(x,y)^{2}|\leq\varepsilon\cdot\operatorname{dist}_{\varphi}(x,y)^{2}].

Define Xi(x):=cosωi,xK(x)X_{i}(x):=\cos\langle\omega_{i},x\rangle-K(x). As a crucial step, we next analyze the moment of random variables Xi(xy)X_{i}(x-y). This bound will be plugged into Bernstein’s inequality to conclude the proof.

Lemma 3.1.

If for some r>0r>0, KK is analytic in {xd:x1<r}\{x\in\mathbb{R}^{d}:\|x\|_{1}<r\}, then for every k1k\geq 1 being even and every xx s.t. x1<r\|x\|_{1}<r, we have 𝔼[|Xi(x)|k](4kx1r)2k\mathbb{E}[|X_{i}(x)|^{k}]\leq\left(\dfrac{4k\|x\|_{1}}{r}\right)^{2k}.

Proof.

The proof can be found in Appendix A. ∎

Lemma 3.2.

For kernel KK which is shift-invariant and analytic at the origin, there exist cK,RK>0c_{K},R_{K}>0 such that for all x1RK\|x\|_{1}\leq R_{K}, 1K(x)x12cK2\frac{1-K(x)}{\|x\|_{1}^{2}}\geq\frac{c_{K}}{2}.

Proof.

The proof can be found in Appendix B. ∎

Proof sketch of Theorem 1.2.

We present a proof sketch for Theorem 1.2, and the full proof can be found in Appendix C. We focus on the case when xy1RK\|x-y\|_{1}\leq R_{K} (the other case can be found in the full proof). Then by Lemma 3.2, we have 22K(xy)cxy122-2K(x-y)\geq c\|x-y\|_{1}^{2}. Then we have:

Pr[|2Di=1DXi(xy)|ε(22K(xy))]Pr[|2Di=1DXi(xy)|cεxy12].\Pr\left[\left|\frac{2}{D}\sum_{i=1}^{D}X_{i}(x-y)\right|\leq\varepsilon\cdot(2-2K(x-y))\right]\geq\Pr\left[\left|\frac{2}{D}\sum_{i=1}^{D}X_{i}(x-y)\right|\leq c\varepsilon\cdot\|x-y\|_{1}^{2}\right].

We take r=rKr=r_{K} for simplicity of exhibition. Assume δmin{ε,216}\delta\leq\min\{\varepsilon,2^{-16}\}, let k=log(2D2/δ),t=64k2/r2,k=\log(2D^{2}/\delta),t=64k^{2}/r^{2}, is even. By Markov’s inequality and Lemma 3.1:

Pr[|Xi(xy)|txy12]=Pr[|Xi(xy)|ktkxy12k](4k)2ktkr2k=4kδ2D2.\Pr[|X_{i}(x-y)|\geq t\|x-y\|^{2}_{1}]=\Pr\left[|X_{i}(x-y)|^{k}\geq t^{k}\|x-y\|_{1}^{2k}\right]\leq\frac{(4k)^{2k}}{t^{k}r^{2k}}=4^{-k}\leq\frac{\delta}{2D^{2}}.

For simplicity denote Xi(xy)X_{i}(x-y) by XiX_{i}, xy12\|x-y\|_{1}^{2} by \ell and define Xi=𝟙[|Xi|t]tsgn(Xi)+𝟙[|Xi|<t]XiX_{i}^{\prime}=\mathbbm{1}_{[|X_{i}|\geq t\ell]}t\ell\cdot\mathrm{sgn}(X_{i})+\mathbbm{1}_{[|X_{i}|<t\ell]}X_{i}, note that 𝔼[Xi]=0\mathbb{E}[X_{i}]=0. By some further calculations and plugging in the parameters t,δ,Dt,\delta,D, we can eventually obtain 𝔼[|Xi|]δ\mathbb{E}[|X_{i}^{\prime}|]\leq\delta\ell. Denote σ2\sigma^{\prime 2} as the variance of XiX_{i}^{\prime}, then again by Lemma 3.1 we immediately have σ64/r2\sigma^{\prime}\leq 64\ell/r^{2}. The theorem follows by a straightforward application of Bernstein’s inequality. ∎

3.2 Dimension Reduction for Kernel Clustering

We present the formal statement for Theorem 1.3 in Theorem 3.1. In fact, we consider the more general kk-clustering problem with 2p\ell_{2}^{p}-objective defined in the following Definition 3.1, which generalizes kernel kk-means (by setting p=2p=2).

Definition 3.1.

Given a data set PdP\subset\mathbb{R}^{d} and kernel function K:d×dK:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}, denoting the feature mapping as φ:d\varphi:\mathbb{R}^{d}\to\mathcal{H}, the kernel kk-clustering problem with p\ell_{p}-objective asks for a kk-partition 𝒞={C1,C2,,Ck}\mathcal{C}=\{C_{1},C_{2},...,C_{k}\} of PP that minimizes the cost function: costpφ(P,𝒞):=i=1kmincixCiφ(x)ci2p\operatorname{cost}_{p}^{\varphi}(P,\mathcal{C}):=\sum_{i=1}^{k}\min_{c_{i}\in\mathcal{H}}\sum_{x\in C_{i}}\|\varphi(x)-c_{i}\|_{2}^{p}.

Theorem 3.1 (Generalization of Theorem 3.6).

DBLP:conf/stoc/MakarychevMR19] For kernel kk-clustering problem with 2p\ell_{2}^{p}-objective whose kernel function K:d×dK:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R} is shift-invariant and analytic at the origin, for every data set PdP\subset\mathbb{R}^{d}, the RFF mapping π:dD\pi:\mathbb{R}^{d}\rightarrow\mathbb{R}^{D} with target dimension D=Ω(p2log3kα+p5log31ε+p8)/ε2D=\Omega(p^{2}\log^{3}\frac{k}{\alpha}+p^{5}\log^{3}\frac{1}{\varepsilon}+p^{8})/\varepsilon^{2} satisfies

Pr[k-partition 𝒞 of P:costpπ(P,𝒞)(1±ε)costpφ(P,𝒞)]1δ.\Pr[\forall\text{$k$-partition $\mathcal{C}$ of $P$}:\operatorname{cost}_{p}^{\pi}(P,\mathcal{C})\in(1\pm\varepsilon)\cdot\operatorname{cost}_{p}^{\varphi}(P,\mathcal{C})]\geq 1-\delta.
Proof.

The proof can be found in Appendix D. ∎

4 Lower Bounds

Theorem 4.1.

Consider Δρ>0\Delta\geq\rho>0, and a shift-invariant kernel function K:dK:\mathbb{R}^{d}\to\mathbb{R}, denoting its feature mapping φ:d\varphi:\mathbb{R}^{d}\to\mathcal{H}. Then there exists x,ydx,y\in\mathbb{R}^{d} that are (Δ,ρ)(\Delta,\rho)-bounded, such that for every 0<ε<10<\varepsilon<1, the RFF mapping π\pi for KK with target dimension DD satisfies

Pr[|distφ(x,y)distπ(x,y)|εdistφ(x,y)]22π6εD/sK(Δ,ρ)es2/2dsO(D12)\Pr[|\operatorname{dist}_{\varphi}(x,y)-\operatorname{dist}_{\pi}(x,y)|\geq\varepsilon\cdot\operatorname{dist}_{\varphi}(x,y)]\geq\frac{2}{\sqrt{2\pi}}\int_{6\varepsilon\sqrt{D/s_{K}^{(\Delta,\rho)}}}^{\infty}e^{-s^{2}/2}\mathop{}\!\mathrm{d}s-O\left(D^{-\frac{1}{2}}\right) (2)

where sK(Δ,ρ):=sup(Δ,ρ)-bounded xdsK(x)s_{K}^{(\Delta,\rho)}:=\sup_{\text{$(\Delta,\rho)$-bounded }x\in\mathbb{R}^{d}}s_{K}(x), and sK(x):=1+K(2x)2K(x)22(1K(x))2s_{K}(x):=\dfrac{1+K(2x)-2K(x)^{2}}{2(1-K(x))^{2}}.

Proof.

The proof can be found in Appendix E. ∎

Note that the right hand side of (2) is always less than 11, since the first term 22π6εD/sK(Δ,ρ)es2/2ds\frac{2}{\sqrt{2\pi}}\int_{6\varepsilon\sqrt{D/s_{K}^{(\Delta,\rho)}}}^{\infty}e^{-s^{2}/2}\mathop{}\!\mathrm{d}s achieves its maximum at Dε2sK(Δ,ρ)=0\frac{D\varepsilon^{2}}{s_{K}^{(\Delta,\rho)}}=0, and this maximum is 1. On the other hand, we need the right hand side of (2) to be >0>0 in order to obtain a useful lower bound, and a typical setup to achieve this is when D=Θ(sK(Δ,ρ))D=\Theta\left(s_{K}^{(\Delta,\rho)}\right).

Intuition of sKs_{K}.

Observe that sK(x)s_{K}(x) measures the ratio between the variance of RFF and the (squared) expectation evaluated at xx. The intuition of considering this comes from the central limit theorem. Indeed, when the number of samples/target dimension is sufficiently large, the error/difference behaves like a Gaussian distribution where with constant probability the error Var\approx\mathrm{Var}. Hence, this sKs_{K} measures the “typical” relative error when the target dimension is sufficiently large, and an upper bound of sK(Δ,ρ)s_{K}^{(\Delta,\rho)} is naturally a necessary condition for the bounded relative error. The following gives a simple (sufficient) condition for kernels that do not have a bounded sK(x)s_{K}(x).

Remark 4.1 (Simple sufficient conditions for lower bounds).

Assume the input dimension is 11, so K:K:\mathbb{R}\to\mathbb{R}, and assume Δ=1\Delta=1, ρ<1\rho<1. Then the (Δ,ρ)(\Delta,\rho)-bounded property simply requires ρ|x|1\rho\leq|x|\leq 1. We claim that, if KK’s first derivative at 0 is non-zero, i.e., K(0)0K^{\prime}(0)\neq 0, then RFF cannot preserve relative error for such KK. To see this, we use Taylor’s expansion for KK at the origin, and simply use the approximation to degree one, i.e., K(x)1+axK(x)\approx 1+ax (noting that x1x\leq 1 so this is a good approximation), where a=K(0)a=K^{\prime}(0). Then

sK(x)=1+1+2ax2(1+ax)22a2x2=11ax.s_{K}(x)=\frac{1+1+2ax-2(1+ax)^{2}}{2a^{2}x^{2}}=-1-\frac{1}{ax}.

So if a=K(0)0a=K^{\prime}(0)\neq 0, then for sufficiently small ρ\rho and |x|ρ|x|\geq\rho, sK(ρ)Ω(1/ρ)s_{K}(\rho)\geq\Omega(1/\rho). This also implies the claim in Theorem 1.1 for Laplacian kernels (even though one needs to slightly modify this analysis since strictly speaking KK^{\prime} is not well defined at 0 for Laplacian kernels). As a sanity check, for shift-invariant kernels that are analytic at the origin (which include Gaussian kernels), it is necessary that K(0)=0K^{\prime}(0)=0.

5 Beyond RFF: Oblivious Embedding for Laplacian Kernel

In this section we provide a proof sketch for theorem 1.4. A more detailed proof is deferred to appendix F.

Embedding

To handle a Laplacian kernel function K(x,y)=exy1cK(x,y)=e^{-\frac{\|x-y\|_{1}}{c}} with some constant cc, we cannot directly use the RFF mapping ϕ\phi, since our lower bound shows that the output dimension has to be very large when KK is not analytical around the origin. To overcome this issue, we come up with the following idea. Notice that L(x,y)L(x,y) relies on the 1\ell_{1}-distance between x,yx,y. If one can embed (embedding function ff) the data points from the original 1\ell_{1} metric space to a new metric space and ensure that there is an kernel function KK^{\prime}, analytical around the origin, for the new space s.t. K(x,y)=K(f(x),f(y))K(x,y)=K^{\prime}(f(x),f(y)) for every pair of original data points x,yx,y, then one can use the function composition ϕf\phi\circ f to get a desired mapping.

Indeed, we find that 1\ell_{1} can be embedded to 22\ell_{2}^{2} isometrically (Kahane, 1981) in the following way. Here for simplicity of exhibition we only handle the case where input data are from d\mathbb{N}^{d}, upper bounded by a natural number NN. Notice that even though input data points are only consisted of integers, the mapping construction needs to handle fractions, as we will later consider some numbers generated from Gaussian distributions or numbers computed in the RFF mapping. So we first setup two numbers, Δ=poly(N,δ1)\Delta^{\prime}=\operatorname{poly}(N,\delta^{-1}) large enough and ρ=1/poly(N,δ1)\rho^{\prime}=1/\operatorname{poly}(N,\delta^{-1}) small enough. All our following operations are working on numbers that are (Δ,ρ)(\Delta^{\prime},\rho^{\prime})-bounded. For each dimension we do the following transformation. Let π1:N\pi_{1}:\mathbb{N}\rightarrow\mathbb{R}^{N} be such that for every x,xNx\in\mathbb{N},x\leq N, the first xx entries of π1(x)\pi_{1}(x) is the number 11, while all the remaining entries are 0. Then consider all dd dimensions. The embedding function π1(d):dNd\pi_{1}^{(d)}:\mathbb{N}^{d}\rightarrow\mathbb{R}^{Nd} be such that for every xd,xiN,i[d]x\in\mathbb{N}^{d},x_{i}\leq N,\forall i\in[d], we have π1(d)(x)\pi^{(d)}_{1}(x) being the concatenation of dd vectors π1(xi),i[d]\pi_{1}(x_{i}),i\in[d]. After embedding, consider a new kernel function K=exy22cK^{\prime}=e^{-\frac{\|x^{\prime}-y^{\prime}\|_{2}^{2}}{c}}, where x=π1(d)(x),y=π1(d)(y)x^{\prime}=\pi^{(d)}_{1}(x),y^{\prime}=\pi^{(d)}_{1}(y). One can see immediately that K(x,y)=K(x,y)K^{\prime}(x^{\prime},y^{\prime})=K(x,y). Hence, we can apply RFF then, i.e. the mapping is ϕπ1(d)\phi\circ\pi_{1}^{(d)}, which has a small output dimension. Detailed proofs can be seen in section F.1.

However, there is another issue. In our setting, if the data is (Δ,ρ)(\Delta,\rho) bounded, then we have to pick N=O(Δρ)N=O(\frac{\Delta}{\rho}). The computing time has a linear factor in NN, which is too large.

Polynomial Time Construction

To reduce computing time, we start from the following observation about the RFF mapping ϕ(x)\phi(x^{\prime}). Each output dimension is actually a function of ω,x\langle\omega,x^{\prime}\rangle, where ω\omega is a vector of i.i.d Gaussian random variables. For simplicity of description we only consider that xx has only one dimension and x=π1(x)x^{\prime}=\pi_{1}(x). So xx^{\prime} is just a vector consists of xx number of 11’s starting from the left and then all the remaining entries are 0’s. Notice that a summation of Gaussian random variables is still a Gaussian. So given xx, one can generate ω,x\langle\omega,x^{\prime}\rangle according to the summation of Gaussians. But here comes another problem. For two data points x,yx,y, we need to use the same ω\omega. So if we generate ω,x\langle\omega,x^{\prime}\rangle and ω,y\langle\omega^{\prime},y^{\prime}\rangle separately, then ω,ω\omega,\omega^{\prime} are independent.

To bypass this issue, first consider the following alternate way to generate ω,x\langle\omega,x^{\prime}\rangle. Let hh be the smallest integer s.t. N2hN\leq 2^{h}. Consider a binary tree where each node has exactly 2 children. The depth is hh. So it has exactly 2h2^{h} leaf nodes in the last layer. For each node vv, we attach a random variable αv\alpha_{v} in the following way. For the root, we attach a Gaussian variable which is the summation of 2h2^{h} independent Gaussian variable with distribution ω0\omega_{0}. Then we proceed layer by layer from the root to leaves. For each u,vu,v being children of a common parent ww, assume that αw\alpha_{w} is the summation of 2l2^{l} independent ω0\omega_{0} distributions. Then let αu\alpha_{u} be the summation of the first 2l12^{l-1} distributions among them and αv\alpha_{v} be the summation of the second 2l12^{l-1} distributions. That is αw=αu+αv\alpha_{w}=\alpha_{u}+\alpha_{v} with αu,αv\alpha_{u},\alpha_{v} being independent. Notice that conditioned on αw=a\alpha_{w}=a, then αu\alpha_{u} takes the value bb with probability Prαu,αv i.i.d. [αu=bαu+αv=a]\Pr_{\alpha_{u},\alpha_{v}\text{ i.i.d. }}[\alpha_{u}=b\mid\alpha_{u}+\alpha_{v}=a]. αv\alpha_{v} takes the value aba-b when αu\alpha_{u} takes value bb.

The randomness for generating every random variable corresponding to a node, is presented as a sequence, in the order from root to leaves, layer by layer, from left to right. We define αx\alpha^{x} to be the summation of the random variables corresponding to the first xx leaves. Notice that αx\alpha^{x} can be sampled efficiently in the following way. Consider the path from the root to the xx-th leaf. First we sample the root, which can be computed using the corresponding part of the randomness. We use a variable zz to record this sample outcome, calling zz an accumulator for convenience. Then we visit each node along the path. When visiting vv, assume its parent is ww, where αw\alpha_{w} has already been sampled previously with outcome aa. If vv is a left child of ww, then we sample αv\alpha_{v} conditioned on αw=a\alpha_{w}=a. Assume this sampling has outcome bb. Then we add a+b-a+b to the current accumulator zz. If vv is a right child of a node ww, then we keep the current accumulator zz unchanged. After visiting all nodes in the path, zz is the sample outcome for αx\alpha^{x}. We can show that the joint distribution αx,αy\alpha^{x},\alpha^{y} has basically the same distribution as ω,π1(x),ω,π1(y)\langle\omega,\pi_{1}(x)\rangle,\langle\omega,\pi_{1}(y)\rangle. See lemma F.2.

The advantage of this alternate construction is that given any xx, to generate αx\alpha^{x}, one only needs to visit the path from the root to the xx-th leaf, using the above generating procedure. To finally reduce the time complexity, the last issue is that the uniform random string for generating random variables here is very long. If we sweep the random tape to locate the randomness used to generate a variable corresponding to a node, then we still need a linear time of NN. Fortunately, PRGs for space bounded computation, e.g. Nisan’s PRG (Nisan, 1992), can be used here to replace the uniform randomness. Because the whole procedure for deciding whether ϕπ1(x)ϕπ1(y)2\|\phi\circ\pi_{1}(x)-\phi\circ\pi_{1}(y)\|_{2} approximates 2K(x,y)\sqrt{2-K(x,y)} within (1±ε)(1\pm\varepsilon) multiplicative error, is in poly-logarithmic space. Also the computation of such PRGs can be highly efficient, i.e. given any index of its output, one can compute that bit in time polynomial of the seed length, which is poly-logarithmic of NN. Hence the computing time of the mapping only has a factor poly-logarithmic in NN instead of a factor linear in NN.

Now we have shown our construction for the case that all input data points are from \mathbb{N}. One can generalize this to the case where all numbers are (Δ,ρ)(\Delta,\rho) bounded, by doing some simple roundings and shiftings of numbers. Then this can be further generalized to the case where the input data has dd dimension, by simply handling each dimension and then concatenating them together. More details of this part are deferred to section F.4.

6 Experiments

We evaluate the empirical relative error of our methods on a simulated dataset. Specifically, we do two experiments, one to evaluate RFF on a Gaussian kernel, and the other one to evaluate the new algorithm in Theorem F.1, which we call “New-Lap”, on a Laplacian kernel. In each experiment, we compare against two other popular methods, particularly Nyström and random-projection methods.

Baselines.

Observe that there are many possible implementations of these two methods. However, since we focus on the accuracy evaluation, we choose computationally-heavy but more accurate implementations as the two baselines (hence the evaluation of the error is only in the baseline’s favor). In particular, we consider 1) SVD low-rank approximation which we call “SVD”, and 2) the vanilla Johnson-Lindenstrauss algorithm performed on top of the high-dimensional representation of points in the feature space, which we call “JL”. Note that SVD is the “ideal” goal/form of Nyström methods and that Johnson-Lindenstrauss applied on the feature space can obtain a theoretically-tight target-dimension bound (in the worst-case sense).

Experiment Setup.

Both experiments are conducted on a synthesized dataset XX which consists of N=100N=100 points with d=60d=60 dimensions generated i.i.d. from a Gaussian distribution. For the experiment that we evaluate RFF, we use a Gaussian kernel K(x)=exp(0.5x2)K(x)=\exp(-0.5\cdot\|x\|^{2}), and for that we evaluate New-Lap, we use a Laplacian kernel K(x)=exp(0.5x1)K(x)=\exp(-0.5\cdot\|x\|_{1}). In each experiment, for each method, we run it for varying target dimension DD (for SVD, DD is the target rank), and we report its empirical relative error, which is defined as

maxxyX|d(x,y)dK(x,y)|dK(x,y),\max_{x\neq y\in X}\frac{|d^{\prime}(x,y)-d_{K}(x,y)|}{d_{K}(x,y)},

where dKd_{K} is the kernel distance and dd^{\prime} is the approximated distance. To make the result stabilized, we conduct this entire experiment for every DD for T=20T=20 times and report the average and 95% confident interval. We plot these dimension-error tradeoff curves, and we depict the results in Figure 1.

Refer to caption
(a) RFF
Refer to caption
(b) New-Lap
Figure 1: The dimension-error tradeoff curves for both experiments, i.e., the experiment that evaluates RFF and the one that evaluates New-Lap.
Results.

We conclude that in both experiments, our methods can indeed well preserve the relative error of the kernel distance, which verifies our theorem. In particular, the dimension-error curve is comparable (and even slightly better) to the computationally heavy Johnson-Lindenstrauss algorithm (which is theoretically optimal in the worst case). On the contrary, the popular Nyström (low-rank approximation) method is largely incapable of preserving the relative error of the kernel distance. In fact, we observe that dSVD(x,y)=0d^{\prime}_{SVD}(x,y)=0 or \approx 0 often happens for some pairs of (x,y)(x,y) such that d(x,y)0d(x,y)\neq 0, which explains the high relative error. This indicates that our methods can indeed well preserve the kernel distance in relative error, but existing methods struggle to achieve this.

Acknowledgments

Research is partially supported by a national key R&D program of China No. 2021YFA1000900, startup funds from Peking University, and the Advanced Institute of Information Technology, Peking University.

References

  • Ahle et al. (2020) Thomas D. Ahle, Michael Kapralov, Jakob Bæk Tejs Knudsen, Rasmus Pagh, Ameya Velingker, David P. Woodruff, and Amir Zandieh. Oblivious sketching of high-degree polynomial kernels. In SODA, pp.  141–160. SIAM, 2020.
  • Arthur & Vassilvitskii (2007) David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In SODA, pp.  1027–1035. SIAM, 2007.
  • Avron et al. (2014) Haim Avron, Huy L. Nguyen, and David P. Woodruff. Subspace embeddings for the polynomial kernel. In NIPS, pp.  2258–2266, 2014.
  • Avron et al. (2016) Haim Avron, Vikas Sindhwani, Jiyan Yang, and Michael W. Mahoney. Quasi-monte carlo feature maps for shift-invariant kernels. J. Mach. Learn. Res., 17:120:1–120:38, 2016.
  • Avron et al. (2017) Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and Amir Zandieh. Random fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. In ICML, volume 70 of Proceedings of Machine Learning Research, pp.  253–262. PMLR, 2017.
  • Bach & Jordan (2002) Francis R. Bach and Michael I. Jordan. Kernel independent component analysis. J. Mach. Learn. Res., 3:1–48, 2002.
  • Backurs et al. (2019) Arturs Backurs, Piotr Indyk, and Tal Wagner. Space and time efficient kernel density estimation in high dimensions. In NeurIPS, pp.  15773–15782, 2019.
  • Berry (1941) Andrew C Berry. The accuracy of the gaussian approximation to the sum of independent variates. Transactions of the american mathematical society, 49(1):122–136, 1941.
  • Bochner (1933) Salomon Bochner. Monotone funktionen, stieltjessche integrale und harmonische analyse. Mathematische Annalen, 108(1):378–410, 1933.
  • Boser et al. (1992) Bernhard E. Boser, Isabelle Guyon, and Vladimir Vapnik. A training algorithm for optimal margin classifiers. In COLT, pp.  144–152. ACM, 1992.
  • Box (1958) George EP Box. A note on the generation of random normal deviates. Ann. Math. Statist., 29:610–611, 1958.
  • Brehm (1981) Ulrich Brehm. Extensions of distance reducing mappings to piecewise congruent mappings on lm. Journal of Geometry, 16(1):187–193, 1981.
  • Chang et al. (2017) Wei-Cheng Chang, Chun-Liang Li, Yiming Yang, and Barnabás Póczos. Data-driven random fourier features using stein effect. In IJCAI, pp.  1497–1503. ijcai.org, 2017.
  • Chen & Phillips (2017) Di Chen and Jeff M. Phillips. Relative error embeddings of the gaussian kernel distance. In ALT, volume 76 of Proceedings of Machine Learning Research, pp.  560–576. PMLR, 2017.
  • Chen et al. (2021) Li Chen, Shuisheng Zhou, Jiajun Ma, and Mingliang Xu. Fast kernel k-means clustering using incomplete cholesky factorization. Appl. Math. Comput., 402:126037, 2021.
  • Chitta et al. (2012) Radha Chitta, Rong Jin, and Anil K. Jain. Efficient kernel clustering using random fourier features. In ICDM, pp.  161–170. IEEE Computer Society, 2012.
  • Choromanski et al. (2018) Krzysztof Choromanski, Mark Rowland, Tamás Sarlós, Vikas Sindhwani, Richard E. Turner, and Adrian Weller. The geometry of random features. In AISTATS, volume 84 of Proceedings of Machine Learning Research, pp.  1–9. PMLR, 2018.
  • Czumaj & Sohler (2004) Artur Czumaj and Christian Sohler. Sublinear-time approximation for clustering via random sampling. In ICALP, volume 3142 of Lecture Notes in Computer Science, pp.  396–407. Springer, 2004.
  • Dao et al. (2017) Tri Dao, Christopher De Sa, and Christopher Ré. Gaussian quadrature for kernel features. In NIPS, pp.  6107–6117, 2017.
  • Erdélyi et al. (2020) Tamás Erdélyi, Cameron Musco, and Christopher Musco. Fourier sparse leverage scores and approximate kernel learning. In NeurIPS, 2020.
  • Esseen (1942) Carl-Gustav Esseen. On the liapunov limit error in the theory of probability. Ark. Mat. Astr. Fys., 28:1–19, 1942.
  • Fine & Scheinberg (2001) Shai Fine and Katya Scheinberg. Efficient SVM training using low-rank kernel representations. J. Mach. Learn. Res., 2:243–264, 2001.
  • Girolami (2002) Mark A. Girolami. Mercer kernel-based clustering in feature space. IEEE Trans. Neural Networks, 13(3):780–784, 2002.
  • Gittens & Mahoney (2016) Alex Gittens and Michael W. Mahoney. Revisiting the nystrom method for improved large-scale machine learning. J. Mach. Learn. Res., 17:117:1–117:65, 2016.
  • Hormander (1966) Lars Hormander. An introduction to complex analysis in several variables. van Nostrand, 1966.
  • Indyk & Motwani (1998) Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, pp.  604–613. ACM, 1998.
  • Jacot et al. (2018) Arthur Jacot, Clément Hongler, and Franck Gabriel. Neural tangent kernel: Convergence and generalization in neural networks. In NeurIPS, pp.  8580–8589, 2018.
  • Jia et al. (2021) Hongjie Jia, Liangjun Wang, Heping Song, Qirong Mao, and Shifei Ding. An efficient nyström spectral clustering algorithm using incomplete cholesky decomposition. Expert Syst. Appl., 186:115813, 2021.
  • Johnson & Lindenstrauss (1984) William B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26:28, 1984.
  • Kahane (1981) Jean-Pierre Kahane. Hélices et quasi-hélices. In Leopoldo Nachbin (ed.), Mathematical Analysis and Applications, Part B, pp.  418–432. Academic Pr, first edition edition, 1981. ISBN 0125128029,9780125128025.
  • Larsen & Nelson (2017) Kasper Green Larsen and Jelani Nelson. Optimality of the johnson-lindenstrauss lemma. In FOCS, pp.  633–638. IEEE Computer Society, 2017.
  • Li et al. (2021) Zhu Li, Jean-Francois Ton, Dino Oglic, and Dino Sejdinovic. Towards a unified analysis of random fourier features. J. Mach. Learn. Res., 22:108:1–108:51, 2021.
  • Liao et al. (2020) Zhenyu Liao, Romain Couillet, and Michael W. Mahoney. A random matrix analysis of random fourier features: beyond the gaussian kernel, a precise phase transition, and the corresponding double descent. In NeurIPS, 2020.
  • Liu et al. (2020) Fanghui Liu, Xiaolin Huang, Yudong Chen, Jie Yang, and Johan A. K. Suykens. Random fourier features via fast surrogate leverage weighted sampling. In AAAI, pp.  4844–4851. AAAI Press, 2020.
  • Liu et al. (2021) Fanghui Liu, Xiaolin Huang, Yudong Chen, and Johan AK Suykens. Random features for kernel approximation: A survey on algorithms, theory, and beyond. IEEE Transactions on Pattern Analysis & Machine Intelligence, (01):1–1, 2021.
  • Makarychev et al. (2019) Konstantin Makarychev, Yury Makarychev, and Ilya P. Razenshteyn. Performance of johnson-lindenstrauss transform for k-means and k-medians clustering. In STOC, pp.  1027–1038. ACM, 2019.
  • Musco & Musco (2017) Cameron Musco and Christopher Musco. Recursive sampling for the nystrom method. In NIPS, pp.  3833–3845, 2017.
  • Nisan (1992) Noam Nisan. Pseudorandom generators for space-bounded computation. Comb., 12(4):449–461, 1992.
  • Oglic & Gärtner (2017) Dino Oglic and Thomas Gärtner. Nyström method with kernel k-means++ samples as landmarks. In ICML, volume 70 of Proceedings of Machine Learning Research, pp.  2652–2660. PMLR, 2017.
  • Phillips & Tai (2020) Jeff M. Phillips and Wai Ming Tai. The gaussiansketch for almost relative error kernel distance. In APPROX-RANDOM, volume 176 of LIPIcs, pp. 12:1–12:20. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020.
  • Raghvendra & Agarwal (2020) Sharath Raghvendra and Pankaj K. Agarwal. A near-linear time ε\varepsilon-approximation algorithm for geometric bipartite matching. J. ACM, 67(3):18:1–18:19, 2020.
  • Rahimi & Recht (2007) Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In NIPS, pp.  1177–1184. Curran Associates, Inc., 2007.
  • Rahimi & Recht (2008) Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In NIPS, pp.  1313–1320. Curran Associates, Inc., 2008.
  • Ren & Du (2020) Yuanhang Ren and Ye Du. Uniform and non-uniform sampling methods for sub-linear time k-means clustering. In ICPR, pp.  7775–7781. IEEE, 2020.
  • Rudi & Rosasco (2017) Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. In NIPS, pp.  3215–3225, 2017.
  • Rudi et al. (2018) Alessandro Rudi, Daniele Calandriello, Luigi Carratino, and Lorenzo Rosasco. On fast leverage score sampling and optimal learning. In NeurIPS, pp.  5677–5687, 2018.
  • Rudin (1991) W. Rudin. Fourier Analysis on Groups. Wiley Classics Library. Wiley, 1991. ISBN 9780471523642.
  • Schölkopf & Smola (2002) Bernhard Schölkopf and Alexander Johannes Smola. Learning with Kernels: support vector machines, regularization, optimization, and beyond. Adaptive computation and machine learning series. MIT Press, 2002.
  • Schölkopf et al. (1998) Bernhard Schölkopf, Alexander J. Smola, and Klaus-Robert Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput., 10(5):1299–1319, 1998.
  • Shahrampour et al. (2018) Shahin Shahrampour, Ahmad Beirami, and Vahid Tarokh. On data-dependent random features for improved generalization in supervised learning. In AAAI, pp.  4026–4033. AAAI Press, 2018.
  • Song et al. (2021) Zhao Song, David P. Woodruff, Zheng Yu, and Lichen Zhang. Fast sketching of polynomial kernels of polynomial degree. In ICML, volume 139 of Proceedings of Machine Learning Research, pp.  9812–9823. PMLR, 2021.
  • Sriperumbudur & Szabó (2015) Bharath K. Sriperumbudur and Zoltán Szabó. Optimal rates for random fourier features. In NIPS, pp.  1144–1152, 2015.
  • Sun et al. (2018) Yitong Sun, Anna C. Gilbert, and Ambuj Tewari. But how does it work in theory? linear SVM with random features. In NeurIPS, pp.  3383–3392, 2018.
  • Sutherland & Schneider (2015) Danica J. Sutherland and Jeff G. Schneider. On the error of random fourier features. In UAI, pp.  862–871. AUAI Press, 2015.
  • Wang et al. (2019) Shusen Wang, Alex Gittens, and Michael W. Mahoney. Scalable kernel k-means clustering with nyström approximation: Relative-error bounds. J. Mach. Learn. Res., 20:12:1–12:49, 2019.
  • Williams & Seeger (2000) Christopher K. I. Williams and Matthias W. Seeger. Using the nyström method to speed up kernel machines. In NIPS, pp.  682–688. MIT Press, 2000.
  • Woodruff & Zandieh (2020) David P. Woodruff and Amir Zandieh. Near input sparsity time kernel embeddings via adaptive sampling. In ICML, volume 119 of Proceedings of Machine Learning Research, pp.  10324–10333. PMLR, 2020.
  • Yang et al. (2012) Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nyström method vs random fourier features: A theoretical and empirical comparison. In NIPS, pp.  485–493, 2012.
  • Zhang et al. (2019) Jian Zhang, Avner May, Tri Dao, and Christopher Ré. Low-precision random fourier features for memory-constrained kernel approximation. In AISTATS, volume 89 of Proceedings of Machine Learning Research, pp.  1264–1274. PMLR, 2019.
  • Zhen et al. (2020) Xiantong Zhen, Haoliang Sun, Ying-Jun Du, Jun Xu, Yilong Yin, Ling Shao, and Cees Snoek. Learning to learn kernels with variational random features. In ICML, volume 119 of Proceedings of Machine Learning Research, pp.  11409–11419. PMLR, 2020.
\appendixpage

Appendix A Proof of Lemma 3.1

See 3.1

We first introduce following two lemmas to show the properties of 𝔼[Xi(x)k]\mathbb{E}[X_{i}(x)^{k}].

Lemma A.1.

For any ωi\omega_{i} sampled in RFF and k0k\geq 0, we have 𝔼[coskωi,x]=12kj=0k(kj)K((2jk)x)\mathbb{E}[\cos^{k}\langle\omega_{i},x\rangle]=\frac{1}{2^{k}}\sum_{j=0}^{k}\binom{k}{j}K((2j-k)x).

Proof.

By eq. 1 it is sufficient to prove that

coskωi,x=12kj=0k(kj)cos((2jk)ωi,x).\cos^{k}\langle\omega_{i},x\rangle=\frac{1}{2^{k}}\sum_{j=0}^{k}{k\choose j}\cos((2j-k)\langle\omega_{i},x\rangle).

We prove this by induction. In the case of k=0k=0 the lemma holds obviously.

If for kk the lemma holds, we have

cosk+1(ωi,x)\displaystyle\cos^{k+1}(\langle\omega_{i},x\rangle) =cos(ωi,x)12kj=0k(kj)cos((2jk)ωi,x)\displaystyle=\cos(\langle\omega_{i},x\rangle)\cdot\frac{1}{2^{k}}\sum_{j=0}^{k}{k\choose j}\cos((2j-k)\langle\omega_{i},x\rangle)
=12kj=0k(kj)cos(ωi,x)cos(2jk)ωi,x)\displaystyle=\frac{1}{2^{k}}\sum_{j=0}^{k}\binom{k}{j}\cos(\langle\omega_{i},x\rangle)\cos(2j-k)\langle\omega_{i},x\rangle)
=12k+1j=0k(kj)(cos((2jk+1)ωi,x)+cos((2jk1)ωi,x))\displaystyle=\frac{1}{2^{k+1}}\sum_{j=0}^{k}\binom{k}{j}\bigg{(}\cos((2j-k+1)\langle\omega_{i},x\rangle)+\cos((2j-k-1)\langle\omega_{i},x\rangle)\bigg{)}
=12k+1j=0k(kj)(cos((2(j+1)(k+1))ωi,x)+cos(2j(k+1))ωi,x))\displaystyle=\frac{1}{2^{k+1}}\sum_{j=0}^{k}\binom{k}{j}\bigg{(}\cos((2(j+1)-(k+1))\langle\omega_{i},x\rangle)+\cos(2j-(k+1))\langle\omega_{i},x\rangle)\bigg{)}
=12k+1j=0k+1((kj)+(kj1))cos((2j(k+1))ωi,x)\displaystyle=\frac{1}{2^{k+1}}\sum_{j=0}^{k+1}\bigg{(}{k\choose j}+{k\choose j-1}\bigg{)}\cos((2j-(k+1))\langle\omega_{i},x\rangle)
=12k+1j=0k+1(k+1j)cos((2j(k+1))ωi,x)\displaystyle=\frac{1}{2^{k+1}}\sum_{j=0}^{k+1}{k+1\choose j}\cos((2j-(k+1))\langle\omega_{i},x\rangle)

where in the third equality we use the fact that 2cosαcosβ=cos(α+β)+cos(αβ)2\cos\alpha\cos\beta=\cos(\alpha+\beta)+\cos(\alpha-\beta). ∎

Lemma A.2.

If there exists r>0r>0 such that KK is analytic in {xd:x1<r}\{x\in\mathbb{R}^{d}:\|x\|_{1}<r\}, then k0\forall k\geq 0, limx0𝔼[Xi(x)k]x12k=c\lim_{x\rightarrow 0}\frac{\mathbb{E}[X_{i}(x)^{k}]}{\|x\|_{1}^{2k}}=c for some constant cc.

Proof.

We denote analytic function K(x)K(x) as Taylor series around origin as

K(x)=βdcβxβ,K(x)=\sum_{\beta\in\mathbb{N}^{d}}c_{\beta}x^{\beta}, (3)

where xβ:=i=1dxiβix^{\beta}:=\prod_{i=1}^{d}x_{i}^{\beta_{i}} is a monomial and its coefficient is cβc_{\beta}. By definition, c0=1c_{0}=1 since K(0)=1K(0)=1. We let s:d,s(β):=i=1dβis:\mathbb{N}^{d}\rightarrow\mathbb{N},s(\beta):=\sum_{i=1}^{d}\beta_{i} denote the degree of xβx^{\beta}. Since K(xy)=K(x,y)=K(y,x)=K(yx)K(x-y)=K(x,y)=K(y,x)=K(y-x) by definition, hence K(x)K(x) is an even function, so cβ=0c_{\beta}=0 for s(β)s(\beta) odd.

Recall that Xi(x):=cosωi,xK(x)X_{i}(x):=\cos\langle\omega_{i},x\rangle-K(x). In the following, we drop the subscripts in Xi,ωiX_{i},\omega_{i} and write X,ωX,\omega for simplicity. By the definition of XX we have

𝔼[X(x)k]=i=0k(ki)K(x)i(1)ki𝔼[coskiω,x].\mathbb{E}[X(x)^{k}]=\sum_{i=0}^{k}\binom{k}{i}K(x)^{i}(-1)^{k-i}\mathbb{E}[\cos^{k-i}\langle\omega,x\rangle]. (4)

Note that by Lemma A.1, 𝔼[coskiω,x]=12kij=0ki(kij)K((2j(ki))x)\mathbb{E}[\cos^{k-i}\langle\omega,x\rangle]=\frac{1}{2^{k-i}}\sum_{j=0}^{k-i}\binom{k-i}{j}K((2j-(k-i))x). Plug this in eq. 4:

𝔼[X(x)k]\displaystyle\mathbb{E}[X(x)^{k}] =i=0k(ki)K(x)i(1)ki12kij=0ki(kij)K((2j(ki))x)\displaystyle=\sum_{i=0}^{k}\binom{k}{i}K(x)^{i}(-1)^{k-i}\frac{1}{2^{k-i}}\sum_{j=0}^{k-i}\binom{k-i}{j}K((2j-(k-i))x)
=i=0k(ki)K(x)i(12)kij=0ki(kij)βdcβ(2jk+i)s(β)xβ\displaystyle=\sum_{i=0}^{k}\binom{k}{i}K(x)^{i}\left(\frac{-1}{2}\right)^{k-i}\sum_{j=0}^{k-i}\binom{k-i}{j}\sum_{\beta\in\mathbb{N}^{d}}c_{\beta}(2j-k+i)^{s(\beta)}x^{\beta}
=βdcβxβi=0k(ki)K(x)i(12)kij=0ki(kij)(2jk+i)s(β)\displaystyle=\sum_{\beta\in\mathbb{N}^{d}}c_{\beta}x^{\beta}\sum_{i=0}^{k}\binom{k}{i}K(x)^{i}\left(\frac{-1}{2}\right)^{k-i}\sum_{j=0}^{k-i}\binom{k-i}{j}(2j-k+i)^{s(\beta)}

where the second equality comes from the Tyler expansion of K((2jk+i)x)K((2j-k+i)x). Next we will show that 𝔼[X(x)k]\mathbb{E}[X(x)^{k}] is of degree at least 2k2k.

For β=0\beta=0 note that i=0k(ki)K(x)i(1)ki=(K(x)1)k\sum_{i=0}^{k}\binom{k}{i}K(x)^{i}(-1)^{k-i}=(K(x)-1)^{k}, since K(x)K(x) is even and K(0)=1K(0)=1, we have limx0(K(x)1)kxt=0,t<2k\lim_{x\rightarrow 0}\frac{(K(x)-1)^{k}}{x^{t}}=0,\forall t<2k . For β0\beta\neq 0, we next show that every term of degree less than 2k2k has coefficient zero.

Fix β0\beta\neq 0 and take Tyler expansion for K(x)iK(x)^{i}

K(x)i=β1,β2,,βicβ1cβ2cβixβ1++βi,K(x)^{i}=\sum_{\beta_{1},\beta_{2},\dots,\beta_{i}}c_{\beta_{1}}c_{\beta_{2}}\dots c_{\beta_{i}}x^{\beta_{1}+...+\beta_{i}},

Without loss of generality, we assume βl+1,,βi\beta_{l+1},...,\beta_{i} are all β\betas that equals 0, so we have cβl+1==cβi=1c_{\beta_{l+1}}=...=c_{\beta_{i}}=1.

Now we consider the coefficient of term cβcβ1cβ2,cβlxβ+j=1lβjc_{\beta}c_{\beta_{1}}c_{\beta_{2}}\dots,c_{\beta_{l}}x^{\beta+\sum_{j=1}^{l}\beta_{j}}, which would be:

C~i=0k(ki)(il)(12)kij=0ki(kij)(2jk+i)s(β)\tilde{C}\sum_{i=0}^{k}\binom{k}{i}\binom{i}{l}\left(\frac{-1}{2}\right)^{k-i}\sum_{j=0}^{k-i}\binom{k-i}{j}(2j-k+i)^{s(\beta)}

where C~\tilde{C} is the number of ordered sequence (β1,β2,,βl)(\beta_{1},\beta_{2},\dots,\beta_{l}), here, for β1=β2,(β1,β2,,βl)\beta_{1}=\beta_{2},(\beta_{1},\beta_{2},\dots,\beta_{l}) and (β2,β1,,βl)(\beta_{2},\beta_{1},\dots,\beta_{l}) are equivalent.

Next, we show if the degree of a monomial s(β)+j=1ls(βj)<2ks(\beta)+\sum_{j=1}^{l}s(\beta_{j})<2k, its coefficient is zero. Since all βj0\beta_{j}\neq 0, we may assume s(βj)2s(\beta_{j})\geq 2, therefore s(β)<2k2ls(\beta)<2k-2l.

Suppose operator JJ is a mapping from a function space to itself, such that f:,J(f):\forall f:\mathbb{R}\to\mathbb{R},J(f):\mathbb{R}\to\mathbb{R} is defined by J(f)(x):=f(x+1)J(f)(x):=f(x+1) . Denote J1=J,Jk:=JJk1J^{1}=J,J^{k}:=J\circ J^{k-1} as its kk-time composition, define J0J^{0} to be the identity mapping such that J0(f)=fJ^{0}(f)=f. Similarly we can define addition that (J1+J2)(f)=J1(f)+J2(f)(J_{1}+J_{2})(f)=J_{1}(f)+J_{2}(f) and scalar multiplication that (αJ)(f)=α(J(f))(\alpha J)(f)=\alpha(J(f)). By definition, cJmJn=Jm(cJn)=cJm+n,c,m,ncJ^{m}\circ J^{n}=J^{m}\circ(cJ^{n})=cJ^{m+n},\forall c\in\mathbb{R},m,n\in\mathbb{N}.

Let L(x)=xs(β)L(x)=x^{s(\beta)}, the coefficient can be rewritten as:

C~i=0k(ki)(il)(12)kij=0ki(kij)(J2j+i(L)(k))\tilde{C}\sum_{i=0}^{k}\binom{k}{i}\binom{i}{l}\left(\frac{-1}{2}\right)^{k-i}\sum_{j=0}^{k-i}\binom{k-i}{j}\left(J^{2j+i}(L)(-k)\right)

Let P=C~i=0k(ki)(il)(12)kij=0ki(kij)J2j+i(L)P=\tilde{C}\sum_{i=0}^{k}\binom{k}{i}\binom{i}{l}\left(\frac{-1}{2}\right)^{k-i}\sum_{j=0}^{k-i}\binom{k-i}{j}J^{2j+i}(L), the above is P(k)P(-k). Now we show P0P\equiv 0

P=\displaystyle P= C~(i=0k(ki)(il)(12)kiJi(j=0ki(kij)J2j))(L)\displaystyle\tilde{C}\left(\sum_{i=0}^{k}\binom{k}{i}\binom{i}{l}\left(\frac{-1}{2}\right)^{k-i}J^{i}\circ\left(\sum_{j=0}^{k-i}\binom{k-i}{j}J^{2j}\right)\right)(L)
=\displaystyle= C~(i=lk(kl)(klil)Ji(J0+J22)ki)(L)\displaystyle\tilde{C}\left(\sum_{i=l}^{k}\binom{k}{l}\binom{k-l}{i-l}J^{i}\circ\left(-\frac{J^{0}+J^{2}}{2}\right)^{k-i}\right)(L)
=\displaystyle= C~Jl(kl)(i=lk(klil)Jil(J0+J22)ki)(L)\displaystyle\tilde{C}J^{l}\circ\binom{k}{l}\left(\sum_{i=l}^{k}\binom{k-l}{i-l}J^{i-l}\circ\left(-\frac{J^{0}+J^{2}}{2}\right)^{k-i}\right)(L)
=\displaystyle= C~Jl(kl)(JJ0+J22)kl(L).\displaystyle\tilde{C}J^{l}\circ\binom{k}{l}\left(J-\frac{J^{0}+J^{2}}{2}\right)^{k-l}(L).

Note that (JJ0+J22)(f)(x)=(f(x+1)f(x))/2(f(x+2)f(x+1))/2\left(J-\frac{J^{0}+J^{2}}{2}\right)(f)(x)=(f(x+1)-f(x))/2-(f(x+2)-f(x+1))/2 calculates second order difference, namely, f\forall f that is a polynomial of degree k2,(JJ0+J22)(f)k\geq 2,\left(J-\frac{J^{0}+J^{2}}{2}\right)(f) is a polynomial of degree k2k-2, and f\forall f that is a polynomial of degree k<2,(JJ0+J22)(f)k<2,\left(J-\frac{J^{0}+J^{2}}{2}\right)(f) is 0.

Since LL is a polynomial of degree less than 2(kl)2(k-l), we have

(JJ0+J22)kl(L)0.\left(J-\frac{J^{0}+J^{2}}{2}\right)^{k-l}(L)\equiv 0.

Combining the above two cases, we have proved eq. 4 is of degree at least 2k2k, which completes our proof. ∎

Proof of Lemma 3.1.

If 2kx1r2k\|x\|_{1}\geq r, since |Xi(x)|=|cosωi,xK(x)|2|X_{i}(x)|=|\cos\langle\omega_{i},x\rangle-K(x)|\leq 2, we have 𝔼[Xi(x)k]2k(4kx1r)2k\mathbb{E}[X_{i}(x)^{k}]\leq 2^{k}\leq\left(\dfrac{4k\|x\|_{1}}{r}\right)^{2k}. Otherwise x1<r/2k\|x\|_{1}<r/2k. Define gk(x):=𝔼[Xi(x)k]g_{k}(x):=\mathbb{E}[X_{i}(x)^{k}], we have:

gk(x)=i=0(j=1dxjxj)igk(x)i!=(j=1dxjxj)2kgk(θx)(2k)!,θ[0,1]g_{k}(x)=\sum_{i=0}^{\infty}\left(\sum_{j=1}^{d}x_{j}\frac{\partial}{\partial x_{j}}\right)^{i}\frac{g_{k}(x)}{i!}=\left(\sum_{j=1}^{d}x_{j}\frac{\partial}{\partial x_{j}}\right)^{2k}\frac{g_{k}(\theta x)}{(2k)!},\quad\theta\in[0,1]

where the second equation comes from Lemma A.2 and Taylor expansion with Lagrange remainder.

Lemma A.3 (Cauchy’s integral formula for multivariate functions Hormander 1966).

For f(z1,,zd)f(z_{1},...,z_{d}) analytic in Δ(z,r)={ζ=(ζ1,ζ2,,ζd)d;|ζνzν|rν,ν=1,,d}\Delta(z,r)=\left\{\zeta=(\zeta_{1},\zeta_{2},\dots,\zeta_{d})\in\mathbb{C}^{d};\left|\zeta_{\nu}-z_{\nu}\right|\leq r_{\nu},\nu=1,\dots,d\right\}

f(z1,,zd)=1(2πi)dD1×D2××Ddf(ζ1,,ζd)(ζ1z1)(ζdzd)𝑑ζ.\displaystyle f(z_{1},\dots,z_{d})=\frac{1}{(2\pi\mathrm{i})^{d}}\int_{\partial D_{1}\times\partial D_{2}\times\dots\times\partial D_{d}}\frac{f(\zeta_{1},\dots,\zeta_{d})}{(\zeta_{1}-z_{1})\cdots(\zeta_{d}-z_{d})}\,d\zeta.

Furthermore,

k1++kdf(z1,z2,,zd)z1k1zdkd=k1!kd!(2πi)dD1×D2×Ddf(ζ1,,ζd)(ζ1z1)k1+1(ζdzd)kd+1𝑑ζ.\displaystyle\frac{\partial^{k_{1}+\cdots+k_{d}}f(z_{1},z_{2},\ldots,z_{d})}{\partial{z_{1}}^{k_{1}}\cdots\partial{z_{d}}^{k_{d}}}=\frac{k_{1}!\cdots k_{d}!}{(2\pi\mathrm{i})^{d}}\int_{\partial D_{1}\times\partial D_{2}\dots\times\partial D_{d}}\frac{f(\zeta_{1},\dots,\zeta_{d})}{(\zeta_{1}-z_{1})^{k_{1}+1}\cdots(\zeta_{d}-z_{d})^{k_{d}+1}}\,d\zeta.

If in addition |f|<M|f|<M, we have the following evaluation:

|k1++kdf(z1,z2,,zd)z1k1zdkd|Mk1!kd!r1k1rdkd.\displaystyle\left|\frac{\partial^{k_{1}+\cdots+k_{d}}f(z_{1},z_{2},\ldots,z_{d})}{{\partial z_{1}}^{k_{1}}\cdots\partial{z_{d}}^{k_{d}}}\right|\leq\frac{Mk_{1}!\cdots k_{d}!}{{r_{1}}^{k_{1}}\cdots{r_{d}}^{k_{d}}}.

Recall that gk(x)=𝔼[Xi(x)k]=i=0k(ki)K(x)i(1)ki12kij=0ki(kij)K((2j(ki))x)g_{k}(x)=\mathbb{E}[X_{i}(x)^{k}]=\sum_{i=0}^{k}\binom{k}{i}K(x)^{i}(-1)^{k-i}\frac{1}{2^{k-i}}\sum_{j=0}^{k-i}\binom{k-i}{j}K((2j-(k-i))x), so gk(x)=poly(K(x),K(x),,K(kx),K(kx))g_{k}(x)=\operatorname{poly}(K(x),K(-x),\dots,K(kx),K(-kx)) is analytic when x1r/k\|x\|_{1}\leq r/k. Applying Cauchy’s integral formula Lemma A.3 (here z+θx12r/2k\|z+\theta x\|_{1}\leq 2\cdot r/2k is in the analytic area),

gk(x)\displaystyle g_{k}(x) =t1++td=2kx1t1x2t2xdtdt1!t2!td!2kgk(θx)x1t1x2t2xdtd\displaystyle=\sum_{t_{1}+\dots+t_{d}=2k}\frac{x_{1}^{t_{1}}x_{2}^{t_{2}}\dots x_{d}^{t_{d}}}{t_{1}!t_{2}!\dots t_{d}!}\frac{\partial^{2k}g_{k}(\theta x)}{\partial x_{1}^{t_{1}}\partial x_{2}^{t_{2}}\dots\partial x_{d}^{t_{d}}}
=t1++td=2kx1t1x2t2xdtd(2πi)dzd,|zi|=r2kgk(z+θx)z1t1+1zdtd+1dz\displaystyle=\sum_{t_{1}+\dots+t_{d}=2k}\frac{x_{1}^{t_{1}}x_{2}^{t_{2}}\dots x_{d}^{t_{d}}}{(2\pi\mathrm{i})^{d}}\int_{z\in\mathbb{C}^{d},|z_{i}|=\frac{r}{2k}}\frac{g_{k}(z+\theta x)}{z_{1}^{t_{1}+1}\dots z_{d}^{t_{d}+1}}\mathop{}\!\mathrm{d}z

we have

|gk(x)|sup|zi|=r/2k|gk(z+θx)|(2kr)2k|i=1dxi|2k(4kx1r)2k.|g_{k}(x)|\leq\sup_{|z_{i}|=r/2k}|g_{k}(z+\theta x)|\left(\frac{2k}{r}\right)^{2k}\left|\sum_{i=1}^{d}x_{i}\right|^{2k}\leq\left(\dfrac{4k\|x\|_{1}}{r}\right)^{2k}.

Appendix B Proof of Lemma 3.2

See 3.2

Proof.

It suffices to prove that liminfx01K(x)x12c>0\lim\inf_{x\rightarrow 0}\frac{1-K(x)}{\|x\|_{1}^{2}}\geq c>0, for some cc. Towards proving this, we show that KK is strongly convex at origin. In fact, by definition, dp(ω)eiω,txdω=K(tx)\Re\int_{\mathbb{R}^{d}}p(\omega)e^{\mathrm{i}\langle\omega,tx\rangle}\mathop{}\!\mathrm{d}\omega=K(tx) for every fixed xx, therefore K′′(tx)=dω2p(ω)eiω,txdω>0K^{\prime\prime}(tx)=\Re\int_{\mathbb{R}^{d}}\|\omega\|^{2}p(\omega)e^{\mathrm{i}\langle\omega,tx\rangle}\mathop{}\!\mathrm{d}\omega>0, hence K(tx)K(tx) is strongly convex with respect to tt at origin, so is K(x)K(x). ∎

Appendix C Proof of Theorem 1.2

See 1.2

Proof.

When xy1RK\|x-y\|_{1}\geq R_{K}, consider the function g(t)=K(t(xy))g(t)=K(t(x-y)). It follows from definition that g(0)=0,g′′(t)=dω2xy2p(ω)eiω,t(xy)dω<0g^{\prime}(0)=0,g^{\prime\prime}(t)=-\Re\int_{\mathbb{R}^{d}}\|\omega\|^{2}\|x-y\|^{2}p(\omega)e^{\mathrm{i}\langle\omega,t(x-y)\rangle}\mathop{}\!\mathrm{d}\omega<0, so g(t)g(t) strictly decreases for all t>0t>0. So 22K(xy)22maxxy1=RKK(xy)>02-2K(x-y)\geq 2-2\max_{\|x-y\|_{1}=R_{K}}K(x-y)>0. We denote t=22maxxy1=RKK(xy)t=2-2\max_{\|x-y\|_{1}=R_{K}}K(x-y), so by Chernorff bound, when D12t(ln1δ+ln2)D\geq\frac{1}{2t}(\ln\frac{1}{\delta}+\ln 2), we have:

Pr[|2Di=1DXi(xy)|ε(22K(xy))]Pr[|2Di=1DXi(xy)|t]1δ.\Pr\left[\left|\frac{2}{D}\sum_{i=1}^{D}X_{i}(x-y)\right|\leq\varepsilon\cdot(2-2K(x-y))\right]\geq\Pr\left[\left|\frac{2}{D}\sum_{i=1}^{D}X_{i}(x-y)\right|\leq t\right]\geq 1-\delta.

When xy1RK\|x-y\|_{1}\leq R_{K}, by Lemma 3.2, we have 22K(xy)cxy122-2K(x-y)\geq c\|x-y\|_{1}^{2}. Then we have:

Pr[|2Di=1DXi(xy)|ε(22K(xy))]Pr[|2Di=1DXi(xy)|cεxy12].\Pr\left[\left|\frac{2}{D}\sum_{i=1}^{D}X_{i}(x-y)\right|\leq\varepsilon\cdot(2-2K(x-y))\right]\geq\Pr\left[\left|\frac{2}{D}\sum_{i=1}^{D}X_{i}(x-y)\right|\leq c\varepsilon\cdot\|x-y\|_{1}^{2}\right].

We take r=rKr=r_{K} for simplicity of exhibition. Assume δmin{ε,216}\delta\leq\min\{\varepsilon,2^{-16}\}, let k=log(2D2/δ),t=64k2/r2,k=\log(2D^{2}/\delta),t=64k^{2}/r^{2}, is even. By Markov’s inequality and Lemma 3.1:

Pr[|Xi(xy)|txy12]=Pr[|Xi(xy)|ktkxy12k](4k)2ktkr2k=4kδ2D2.\Pr[|X_{i}(x-y)|\geq t\|x-y\|^{2}_{1}]=\Pr\left[|X_{i}(x-y)|^{k}\geq t^{k}\|x-y\|_{1}^{2k}\right]\leq\frac{(4k)^{2k}}{t^{k}r^{2k}}=4^{-k}\leq\frac{\delta}{2D^{2}}.

For simplicity denote Xi(xy)X_{i}(x-y) by XiX_{i}, xy12\|x-y\|_{1}^{2} by \ell and define Xi=𝟙[|Xi|t]tsgn(Xi)+𝟙[|Xi|<t]XiX_{i}^{\prime}=\mathbbm{1}_{[|X_{i}|\geq t\ell]}t\ell\cdot\mathrm{sgn}(X_{i})+\mathbbm{1}_{[|X_{i}|<t\ell]}X_{i}, note that 𝔼[Xi]=0\mathbb{E}[X_{i}]=0. Then:

|𝔼[Xi]|\displaystyle|\mathbb{E}[X_{i}^{\prime}]| |𝔼[Xi|Xi|<t]|Pr[|Xi|<t]+t|Pr[Xit]Pr[Xit]|\displaystyle\leq|\mathbb{E}[X_{i}^{\prime}\mid|X_{i}^{\prime}|<t\ell]|\cdot\Pr[|X_{i}^{\prime}|<t\ell]+t\ell\cdot\left|\Pr[X^{\prime}_{i}\geq t\ell]-\Pr[X^{\prime}_{i}\leq-t\ell]\right|
=|𝔼[Xi|Xi|<t]|Pr[|Xi|<t]+t|Pr[Xit]Pr[Xit]|\displaystyle=|\mathbb{E}[X_{i}^{\prime}\mid|X_{i}^{\prime}|<t\ell]|\cdot\Pr[|X_{i}|<t\ell]+t\ell\cdot\left|\Pr[X_{i}\geq t\ell]-\Pr[X_{i}\leq-t\ell]\right|
=|𝔼[Xi]𝔼[Xi|Xi|t]Pr[|Xi|t]|Pr[|Xi|<t]Pr[|Xi|<t]+t|Pr[Xit]Pr[Xit]|\displaystyle=\frac{|\mathbb{E}[X_{i}]-\mathbb{E}\left[X_{i}\mid|X_{i}|\geq t\ell\right]\Pr[|X_{i}|\geq t\ell]|}{\Pr[|X_{i}|<t\ell]}\Pr[|X_{i}|<t\ell]+t\ell\cdot\left|\Pr[X_{i}\geq t\ell]-\Pr[X_{i}\leq-t\ell]\right|
=|𝔼[Xi]𝔼[Xi|Xi|t]Pr[|Xi|t]|+t|Pr[Xit]Pr[Xit]|\displaystyle=|\mathbb{E}[X_{i}]-\mathbb{E}\left[X_{i}\mid|X_{i}|\geq t\ell\right]\Pr[|X_{i}|\geq t\ell]|+t\ell\cdot\left|\Pr[X_{i}\geq t\ell]-\Pr[X_{i}\leq-t\ell]\right|
=|𝔼[Xi|Xi|t]Pr[|Xi|t]|+t|Pr[Xit]Pr[Xit]|\displaystyle=|\mathbb{E}\left[X_{i}\mid|X_{i}|\geq t\ell\right]\Pr[|X_{i}|\geq t\ell]|+t\ell\cdot\left|\Pr[X_{i}\geq t\ell]-\Pr[X_{i}\leq-t\ell]\right|

where t|Pr[Xi>t]Pr[Xi<t]|tPr[|Xi|>t]tδ/(2D2)t\ell\cdot|\Pr[X_{i}>t\ell]-\Pr[X_{i}<-t\ell]|\leq t\ell\cdot\Pr[|X_{i}|>t\ell]\leq t\ell\delta/(2D^{2}). The first inequality is by considering the two conditions |Xi|<t|X_{i}|<t\ell and |Xi|t|X_{i}|\geq t\ell, then taking a triangle inequality. The first and second equations are by definition of Xi,XiX_{i},X_{i}^{\prime}. The third equation is a straightforward computation. The last equation is due to 𝔼[Xi]=0\mathbb{E}[X_{i}]=0. By Lemma 3.1 for every integer α\alpha,

Pr[|Xi|α/r2]=Pr[|Xi|α/8(α/r2)α/8]𝔼[|Xi|α/8](α/r2)α/84α/8.\displaystyle\Pr\left[|X_{i}|\geq\alpha\ell/r^{2}\right]=\Pr\left[|X_{i}|^{\sqrt{\alpha}/8}\geq(\alpha\ell/r^{2})^{\sqrt{\alpha}/8}\right]\leq\frac{\mathbb{E}[|X_{i}|^{\sqrt{\alpha}/8}]}{(\alpha\ell/r^{2})^{\sqrt{\alpha}/8}}\leq 4^{-\sqrt{\alpha}/8}.

The first equality is straightforward. The first inequality is by Markov. The second equality is by 𝔼[|Xi|α/8](α4r2)α/8\mathbb{E}[|X_{i}|^{\sqrt{\alpha}/8}]\leq\left(\frac{\alpha\ell}{4r^{2}}\right)^{\sqrt{\alpha}/8} which follows from Lemma 3.1, and a rearrangement of parameters, where rr is the parameter rr in Lemma 3.1. Therefore,

|𝔼[Xi|Xi|t]|Pr[|Xi|t]\displaystyle|\mathbb{E}[X_{i}\mid|X_{i}|\geq t\ell]|\Pr[|X_{i}|\geq t\ell] 𝔼[|Xi||Xi|t]Pr[|Xi|t]\displaystyle\leq\mathbb{E}[|X_{i}|\mid|X_{i}|\geq t\ell]\Pr[|X_{i}|\geq t\ell]
(t+1r2)Pr[|Xi|t]+r2 integer αtr2+1Pr[|Xi|α/r2]\displaystyle\leq(t+\frac{1}{r^{2}})\ell\cdot\Pr[|X_{i}|\geq t\ell]+\frac{\ell}{r^{2}}\sum_{\text{ integer }\alpha\geq tr^{2}+1}\Pr[|X_{i}|\geq\alpha\ell/r^{2}]
(t+1r2)δ2D2+tr24α/8dα\displaystyle\leq(t+\frac{1}{r^{2}})\ell\cdot\frac{\delta}{2D^{2}}+\ell\int_{tr^{2}}^{\infty}4^{-\sqrt{\alpha}/8}\mathop{}\!\mathrm{d}\alpha
((t+1r2)δ2D2+16r2ln44tr2/8).\displaystyle\leq\ell\left((t+\frac{1}{r^{2}})\frac{\delta}{2D^{2}}+\frac{16}{r^{2}\ln 4}4^{-tr^{2}/8}\right).

The first inequality is by the property of absolute value. The second inequality is because we can divide the event |Xi|t|X_{i}|\geq t\ell into |Xi|[α/r2,(α+1)/r2),α=tr2,tr2+1,|X_{i}|\in[\alpha\ell/r^{2},(\alpha+1)\ell/r^{2}),\alpha=tr^{2},tr^{2}+1,\ldots and when |Xi|[α/r2,(α+1)/r2)|X_{i}|\in[\alpha\ell/r^{2},(\alpha+1)\ell/r^{2}), |Xi|<(α+1)/r2|X_{i}|<(\alpha+1)\ell/r^{2}. The third inequality is by pluging in the previous bound for Pr[|Xi|α/r2]\Pr[|X_{i}|\geq\alpha\ell/r^{2}]. The last inequality is by a calculation of the integral.

By plugging in parameters tt, δ\delta, Dmax{Θ(ε1log3(1/δ)),Θ(ε2log(1/δ))}D\geq\max\{\Theta(\varepsilon^{-1}\log^{3}(1/{\delta})),\Theta(\varepsilon^{-2}\log(1/{\delta}))\}, we have 𝔼[|Xi|]δ\mathbb{E}[|X_{i}^{\prime}|]\leq\delta\ell. Note that the Θ(D)\Theta(D) hides a constant rr. Denote σ2\sigma^{\prime 2} as the variance of XiX_{i}^{\prime}. So σ64/r2\sigma^{\prime}\leq 64\ell/r^{2} by Lemma 3.1.

Lemma C.1 (Bernstein’s Inequality).

Let X1,..,XDX_{1},..,X_{D} be independent zero-mean random variables. Suppose that |Xi|M,i|X_{i}|\leq M,\forall i, then for all positive tt,

Pr[i=1DXit]exp(t2/2Mt/3+i=1D𝔼[Xi2]).\Pr\left[\sum_{i=1}^{D}X_{i}\geq t\right]\leq\exp\left(-\frac{t^{2}/2}{Mt/3+\sum_{i=1}^{D}\mathbb{E}[X_{i}^{2}]}\right).

Applying Bernstein’s Inequality to XiX_{i}^{\prime},

Pr[i=1DXiD𝔼[Xi](cε/σ)Dσ]\displaystyle\Pr\left[\sum_{i=1}^{D}X_{i}^{\prime}-D\mathbb{E}[X_{i}^{\prime}]\geq(c\varepsilon\ell/\sigma^{\prime})D\sigma^{\prime}\right]\leq exp(c2ε2Dctε+2σ2/2)\displaystyle\exp\left(-\frac{c^{2}\varepsilon^{2}D}{ct\varepsilon+2\sigma^{\prime 2}/\ell^{2}}\right)
\displaystyle\leq max{exp(cε2D2tε),exp(c2ε2D4σ2/2)}.\displaystyle\max\left\{\exp\left(-\frac{c\varepsilon^{2}D}{2t\varepsilon}\right),\exp\left(-\frac{c^{2}\varepsilon^{2}D}{4\sigma^{\prime 2}/\ell^{2}}\right)\right\}.

Since Dmax{Θ(tε1log(1/δ))D\geq\max\{\Theta\left(t\varepsilon^{-1}\log(1/\delta)\right), Θ(ε2log(1/δ))}\Theta\left(\varepsilon^{-2}\log(1/{\delta})\right)\}, we have Pr[i=1DXiε(/σ)Dσ]δ/2\Pr\left[\sum_{i=1}^{D}X_{i}^{\prime}\geq\varepsilon(\ell/\sigma^{\prime})D\sigma^{\prime}\right]\leq\delta/2. With 1δ21-\frac{\delta}{2} probability, every Xit,Xi=XiX_{i}\leq t\ell,X_{i}^{\prime}=X_{i}. Therefore, Pr[i=1DXiD(δ+ε)]δ/2.\Pr\left[\sum_{i=1}^{D}X_{i}\geq D(\delta+\varepsilon)\ell\right]\leq\delta/2. Combine it together, Pr[|distπ(x,y)distφ(x,y)|εdistφ(x,y)]1δ\Pr[|\operatorname{dist}_{\pi}(x,y)-\operatorname{dist}_{\varphi}(x,y)|\leq\varepsilon\cdot\operatorname{dist}_{\varphi}(x,y)]\geq 1-\delta.

Appendix D Proof of Theorem 3.1

See 3.1

The proof relies on a key notion of (ε,δ,ρ)(\varepsilon,\delta,\rho)-dimension reduction from (Makarychev et al., 2019), and we adopt it with respect to our setting/language of kernel distance as follows.

Definition D.1 (Makarychev et al. 2019, Definition 2.1).

For ε,δ,ρ>0\varepsilon,\delta,\rho>0, a feature mapping φ:d\varphi:\mathbb{R}^{d}\to\mathcal{H} for some Hilbert space \mathcal{H}, a random mapping πd,D:dD\pi_{d,D}:\mathbb{R}^{d}\to\mathbb{R}^{D} is an (ε,δ,ρ)(\varepsilon,\delta,\rho)-dimension reduction, if

  • for every x,ydx,y\in\mathbb{R}^{d}, 11+εdistφ(x,y)distπ(x,y)(1+ε)distφ(x,y)\frac{1}{1+\varepsilon}\operatorname{dist}_{\varphi}(x,y)\leq\operatorname{dist}_{\pi}(x,y)\leq(1+\varepsilon)\operatorname{dist}_{\varphi}(x,y) with probability at least 1δ1-\delta, and

  • for every fixed p[1,)p\in[1,\infty), 𝔼[𝟙{distπ(x,y)>(1+ε)distφ(x,y)}(distπ(x,y)pdistφ(x,y)p(1+ε)p)]ρ\mathbb{E}\left[\mathbbm{1}_{\{\operatorname{dist}_{\pi}(x,y)>(1+\varepsilon)\operatorname{dist}_{\varphi}(x,y)\}}\left(\frac{\operatorname{dist}_{\pi}(x,y)^{p}}{\operatorname{dist}_{\varphi}(x,y)^{p}}-(1+\varepsilon)^{p}\right)\right]\leq\rho.

In Makarychev et al. (2019), most results are stated for a particular parameter setup of Definition D.1 resulted from Johnson-Lindenstrauss transform (Johnson & Lindenstrauss, 1984), but their analysis actually works for other similar parameter setups. The following is a generalized statement of (Makarychev et al., 2019, Theorem 3.5) which also reveals how alternative parameter setups affect the distortion. We note that this is simply a more precise and detailed statement of (Makarychev et al., 2019, Theorem 3.5), and it follows from exactly the same proof in Makarychev et al. (2019).

Lemma D.1 (Makarychev et al. 2019, Theorem 3.5).

Let 0<ε,α<10<\varepsilon,\alpha<1 and θ:=min{εp+13(p+1)(p+2),αεp/(10k(1+ε)4p1),1/10p+1}\theta:=\min\{\varepsilon^{p+1}3^{-(p+1)(p+2)},\alpha\varepsilon^{p}/(10k(1+\varepsilon)^{4p-1}),1/10^{p+1}\}. If some (ε,δ,ρ)(\varepsilon,\delta,\rho)-dimension reduction π\pi for feature mapping φ:d\varphi:\mathbb{R}^{d}\to\mathcal{H} of some kernel function satisfies δmin(θ7/600,θ/k),(k2)δα2,ρθ\delta\leq\min(\theta^{7}/600,\theta/k),\binom{k}{2}\delta\leq\frac{\alpha}{2},\rho\leq\theta, then with probability at least 1α1-\alpha, for every partition 𝒞\mathcal{C} of PP,

costpπ(P,𝒞)\displaystyle\operatorname{cost}_{p}^{\pi}(P,\mathcal{C}) (1+ε)3pcostpφ(P,𝒞),\displaystyle\leq(1+\varepsilon)^{3p}\operatorname{cost}_{p}^{\varphi}(P,\mathcal{C}),
(1ε)costpφ(P,𝒞)\displaystyle(1-\varepsilon)\operatorname{cost}_{p}^{\varphi}(P,\mathcal{C}) (1+ε)3p1costpπ(P,𝒞).\displaystyle\leq(1+\varepsilon)^{3p-1}\operatorname{cost}_{p}^{\pi}(P,\mathcal{C}).
Proof of Theorem 3.1.

We verify that setting D=Θ(log3kα+p3log31ε+p6)/ε2D=\Theta(\log^{3}\frac{k}{\alpha}+p^{3}\log^{3}\frac{1}{\varepsilon}+p^{6})/\varepsilon^{2}, the RFF mapping π\pi with target dimension DD satisfies the conditions in Lemma D.1, namely, it is a (ε,δ,ρ)(\varepsilon,\delta,\rho)-dimension reduction . In fact, Theorem 1.2 already implies such π\pi satisfies that for every x,ydx,y\in\mathbb{R}^{d}, 11+εdistφ(x,y)distπ(x,y)(1+ε)distφ(x,y)\frac{1}{1+\varepsilon}\operatorname{dist}_{\varphi}(x,y)\leq\operatorname{dist}_{\pi}(x,y)\leq(1+\varepsilon)\operatorname{dist}_{\varphi}(x,y) with probability at least 1δ1-\delta, where δ=ecf(ε,D)\delta=e^{-cf(\varepsilon,D)} for some constant cc, and f(ε,D):=max{ε2D,ε1/3D1/3}f(\varepsilon,D):=\max\{\varepsilon^{2}D,\varepsilon^{1/3}D^{1/3}\}. For the other part,

𝔼[𝟙{distπ(x,y)>(1+ε)distφ(x,y)}(distπ(x,y)pdistφ(x,y)p(1+ε)p)]\displaystyle\mathbb{E}\left[\mathbbm{1}_{\{\operatorname{dist}_{\pi}(x,y)>(1+\varepsilon)\operatorname{dist}_{\varphi}(x,y)\}}\left(\frac{\operatorname{dist}_{\pi}(x,y)^{p}}{\operatorname{dist}_{\varphi}(x,y)^{p}}-(1+\varepsilon)^{p}\right)\right]
=\displaystyle= ε((1+t)p(1+ε)p)d(Pr(distπ(x,y)distφ(x,y)>t+1))\displaystyle\int_{\varepsilon}^{\infty}((1+t)^{p}-(1+\varepsilon)^{p})\mathop{}\!\mathrm{d}\left(-\Pr\left(\frac{\operatorname{dist}_{\pi}(x,y)}{\operatorname{dist}_{\varphi}(x,y)}>t+1\right)\right)
=\displaystyle= [(1+m)p+(1+ε)p]Pr(distπ(x,y)distφ(x,y)>m+1)|m=εm=+\displaystyle\left.[-(1+m)^{p}+(1+\varepsilon)^{p}]\Pr\left(\frac{\operatorname{dist}_{\pi}(x,y)}{\operatorname{dist}_{\varphi}(x,y)}>m+1\right)\right|_{m=\varepsilon}^{m=+\infty}
+εp(1+t)p1Pr(distπ(x,y)distφ(x,y)>t+1)dt\displaystyle\qquad+\int_{\varepsilon}^{\infty}p(1+t)^{p-1}\Pr\left(\frac{\operatorname{dist}_{\pi}(x,y)}{\operatorname{dist}_{\varphi}(x,y)}>t+1\right)\mathop{}\!\mathrm{d}t (integration by part)
=\displaystyle= εp(1+t)p1Pr(distπ(x,y)distφ(x,y)>t+1)dt\displaystyle\int_{\varepsilon}^{\infty}p(1+t)^{p-1}\Pr\left(\frac{\operatorname{dist}_{\pi}(x,y)}{\operatorname{dist}_{\varphi}(x,y)}>t+1\right)\mathop{}\!\mathrm{d}t
\displaystyle\leq εp(1+t)p1ecf(t,D)dt.\displaystyle\int_{\varepsilon}^{\infty}p(1+t)^{p-1}e^{-cf(t,D)}\mathop{}\!\mathrm{d}t.

Where the third equality follows by Pr(distπ(x,y)distφ(x,y)>m)\Pr\left(\frac{\operatorname{dist}_{\pi}(x,y)}{\operatorname{dist}_{\varphi}(x,y)}>m\right) decays exponentially fast with respect to mm. Observe that for p1,D(p1)38c3,p(1+t)p1ecD13t13/2p\geq 1,D\geq\frac{(p-1)^{3}}{8c^{3}},p(1+t)^{p-1}e^{-cD^{\frac{1}{3}}t^{\frac{1}{3}}/2} decrease when tεt\geq\varepsilon, and for Dc(p1)ε2,p(1+t)p1ect2D/2D\geq\frac{c(p-1)}{\varepsilon^{2}},p(1+t)^{p-1}e^{-ct^{2}D/2} decrease when tεt\geq\varepsilon. Hence for Dmax{(p1)38c3,c(p1)ε2}D\geq\max\{\frac{(p-1)^{3}}{8c^{3}},\frac{c(p-1)}{\varepsilon^{2}}\}, we have

εp(1+t)p1ecf(t,D)𝑑tcεecf(t,D)/2𝑑t<c′′ecf(ε,D)/2.\int_{\varepsilon}^{\infty}p(1+t)^{p-1}e^{-cf(t,D)}dt\leq c^{\prime}\int_{\varepsilon}^{\infty}e^{-cf(t,D)/2}dt<c^{\prime\prime}e^{-cf(\varepsilon,D)/2}.

In conclusion, by setting D=Θ(log3kα+p3log31ε+p6)/ε2D=\Theta(\log^{3}\frac{k}{\alpha}+p^{3}\log^{3}\frac{1}{\varepsilon}+p^{6})/\varepsilon^{2}, for δ=ecf(ε,D),ρ=c′′ecf(ε,D)\delta=e^{-cf(\varepsilon,D)},\rho=c^{\prime\prime}e^{-cf(\varepsilon,D)} and f(ε,D)=max{ε2D,ε1/3D1/3}f(\varepsilon,D)=\max\{\varepsilon^{2}D,\varepsilon^{1/3}D^{1/3}\}, it satisfies δmin(θ7/600,θ/k),(k2)δα2,ρθ\delta\leq\min(\theta^{7}/600,\theta/k),\binom{k}{2}\delta\leq\frac{\alpha}{2},\rho\leq\theta. This verifies the condition of Lemma D.1.

Finally, we conclude the proof of Theorem 3.1 by plugging ε=ε/3p\varepsilon^{\prime}=\varepsilon/3p and the above mentioned RFF mapping π\pi with target dimension DD into Lemma D.1. ∎

Appendix E Proof of Theorem 4.1

See 4.1

Proof.

Our proof requires the following anti-concentration inequality.

Lemma E.1 (Berry 1941; Esseen 1942).

For i.i.d. random variables ξi\xi_{i}\in\mathbb{R} with mean 0 and variance 1, let X:=1Di=1DξiX:=\frac{1}{\sqrt{D}}\sum_{i=1}^{D}\xi_{i}, then for any t,

Pr[Xt]12πtes2/2dsO(D12)\Pr[X\geq t]\geq\frac{1}{\sqrt{2\pi}}\int_{t}^{\infty}e^{-s^{2}/2}\mathop{}\!\mathrm{d}s-O(D^{-\frac{1}{2}})

Let Xi(x):=cosωi,xK(x),σ(x):=Var(Xi(x))=1+K(2(x))2K(x)22X_{i}(x):=\cos\langle\omega_{i},x\rangle-K(x),\sigma(x):=\sqrt{\operatorname{Var}(X_{i}(x))}=\sqrt{\frac{1+K(2(x))-2K(x)^{2}}{2}}, choose x,yx,y such that sK(xy)=sK(Δ,ρ)s_{K}(x-y)=s_{K}^{(\Delta,\rho)}. Clearly, such pair of x,yx,y satisfies that (xy)(x-y) is (Δ,ρ)(\Delta,\rho)-bounded. In fact, it is without loss of generality to assume that both xx and yy are (Δ,ρ)(\Delta,\rho)-bounded, since one may pick y=0y^{\prime}=0, x=xyx^{\prime}=x-y and still have xy=xyx^{\prime}-y^{\prime}=x-y. We next verify that such x,yx,y satisfy our claimed properties. Indeed,

Pr[|distφ(x,y)distπ(x,y)|εdistφ(x,y)]\displaystyle\Pr[|\operatorname{dist}_{\varphi}(x,y)-\operatorname{dist}_{\pi}(x,y)|\geq\varepsilon\cdot\operatorname{dist}_{\varphi}(x,y)]
\displaystyle\geq Pr[|distφ(x,y)2distπ(x,y)2|6εdistφ(x,y)2]\displaystyle\Pr[|\operatorname{dist}_{\varphi}(x,y)^{2}-\operatorname{dist}_{\pi}(x,y)^{2}|\geq 6\varepsilon\cdot\operatorname{dist}_{\varphi}(x,y)^{2}]
=\displaystyle= Pr[|2Di=1DXi(xy)|6ε(22K(xy))]\displaystyle\Pr\left[\left|\frac{2}{D}\sum_{i=1}^{D}X_{i}(x-y)\right|\geq 6\varepsilon(2-2K(x-y))\right]
=\displaystyle= Pr[|1Dσ(xy)i=1DXi(xy)|6ε(1K(xy))Dσ(xy)]\displaystyle\Pr\left[\left|\frac{1}{\sqrt{D}\cdot\sigma(x-y)}\sum_{i=1}^{D}X_{i}(x-y)\right|\geq 6\varepsilon(1-K(x-y))\cdot\frac{\sqrt{D}}{\sigma(x-y)}\right]
\displaystyle\geq O(D1/2)+22π6ε(1K(xy))D/σ(xy)es2/2ds\displaystyle-O(D^{-1/2})+\frac{2}{\sqrt{2\pi}}\int_{6\varepsilon(1-K(x-y))\sqrt{D}/\sigma(x-y)}^{\infty}e^{-s^{2}/2}\mathop{}\!\mathrm{d}s
=\displaystyle= O(D1/2)+22π6εD/sK(Δ,ρ)es2/2ds,\displaystyle-O(D^{-1/2})+\frac{2}{\sqrt{2\pi}}\int_{6\varepsilon\sqrt{D/s_{K}^{(\Delta,\rho)}}}^{\infty}e^{-s^{2}/2}\mathop{}\!\mathrm{d}s,

where the second inequality is by Lemma E.1, and the the second-last equality follows from the definition of sK()s_{K}(\cdot), and that of x,yx,y such that sK(xy)=sK(Δ,ρ)s_{K}(x-y)=s_{K}^{(\Delta,\rho)}. ∎

Appendix F Beyond RFF: Oblivious Embedding for Laplacian Kernel with Small Computing Time

In this section we show an oblivious feature mapping for Laplacian kernel dimension reduction with small computing time. The following is the main theorem.

Theorem F.1.

Let KK be a Laplacian kernel with feature mapping φ:d\varphi:\mathbb{R}^{d}\to\mathcal{H}. For every 0<δε2160<\delta\leq\varepsilon\leq 2^{-16}, every d,Dd,D\in\mathbb{N}, Dmax{Θ(ε1log3(1/δ)),Θ(ε2log(1/δ))}D\geq\max\{\Theta(\varepsilon^{-1}\log^{3}(1/{\delta})),\Theta(\varepsilon^{-2}\log(1/{\delta}))\}, every Δρ>0\Delta\geq\rho>0, there is a mapping π:dD\pi:\mathbb{R}^{d}\to\mathbb{R}^{D}, such that for every x,ydx,y\in\mathbb{R}^{d} that are (Δ,ρ)(\Delta,\rho)-bounded,

Pr[|distπ(x,y)distφ(x,y)|εdistφ(x,y)]1δ.\Pr[|\operatorname{dist}_{\pi}(x,y)-\operatorname{dist}_{\varphi}(x,y)|\leq\varepsilon\cdot\operatorname{dist}_{\varphi}(x,y)]\geq 1-\delta.

The time for evaluating π\pi is dDpoly(logΔρ,logδ1)dD\operatorname{poly}(\log\frac{\Delta}{\rho},\log\delta^{-1}).

For simplicity of exhibition, we first handle the case when the input data are from d\mathbb{N}^{d}. At the end we will describe how to handle the case when the input data are from d\mathbb{R}^{d} by a simple transformation. Let NN\in\mathbb{N} be s.t. every entry of an input data point is at most NN. Even though input data are only consisted of integers, the mapping construction needs to handle fractions, as we will later consider some numbers generated from Gaussian distributions or numbers computed in the RFF mapping. So we first setup two numbers, Δ=poly(N,δ1)\Delta^{\prime}=\operatorname{poly}(N,\delta^{-1}) large enough and ρ=1/poly(N,δ1)\rho^{\prime}=1/\operatorname{poly}(N,\delta^{-1}) small enough. All our following operations are working on numbers that are (Δ,ρ)(\Delta^{\prime},\rho^{\prime})-bounded. Denote ρ/Δ\rho^{\prime}/\Delta^{\prime} as ρ0\rho_{0} for convenience.

F.1 Embedding from 1\ell_{1} to 22\ell_{2}^{2}

Now we describe an isometric embedding from 1\ell_{1} norm to 22\ell_{2}^{2}. This construction is based on Kahane (1981), in which the first such construction of finite dimension was given, to the best of our knowledge. Let π1:N\pi_{1}:\mathbb{N}\rightarrow\mathbb{R}^{N} be such that for every x,xNx\in\mathbb{N},x\leq N, π1(x)[j]=1\pi_{1}(x)[j]=1, if j[1,x]j\in[1,x] and π1(x)[j]=0\pi_{1}(x)[j]=0 otherwise. Let π1(d):dNd\pi_{1}^{(d)}:\mathbb{N}^{d}\rightarrow\mathbb{R}^{Nd} be such that for every xd,xiN,i[d]x\in\mathbb{N}^{d},x_{i}\leq N,\forall i\in[d], we have π1(d)(x)\pi^{(d)}_{1}(x) being the concatenation of dd vectors π1(xi),i[d]\pi_{1}(x_{i}),i\in[d].

Lemma F.1.

For every x,ydx,y\in\mathbb{N}^{d} with xi,yiN,i[d]x_{i},y_{i}\leq N,i\in[d], it holds that

xy1=π1(d)(x)π1(d)(y)22.\|x-y\|_{1}=\|\pi_{1}^{(d)}(x)-\pi_{1}^{(d)}(y)\|_{2}^{2}.
Proof.

Notice that for every i[d]i\in[d], π1(xi)\pi_{1}(x_{i}) has its first xix_{i} entries being 11 while π1(yi)\pi_{1}(y_{i}) has its first yiy_{i} entries being 11. Thus π1(xi)π1(yi)22\|\pi_{1}(x_{i})-\pi_{1}(y_{i})\|_{2}^{2} is exactly xiyi1\|x_{i}-y_{i}\|_{1}. If we consider all the dd dimensions, then by the construction of π1(d)\pi_{1}^{(d)}, the lemma holds. ∎

F.2 Feature Mapping for Laplacian Kernel

Notice that we can apply the mapping π1(d)\pi_{1}^{(d)} and the RFF mapping ϕ\phi from Theorem 1.2 for the kernel function exp(xy22)\exp(-\|x-y\|_{2}^{2}) which is actually a Gaussian kernel. This gives a mapping which preserves kernel distance for Laplacian kernel. To be more precise, we setup the mapping to be π=ϕπ1(d)\pi=\phi\circ\pi_{1}^{(d)}. The only drawback is that the running time is high, as in the above mapping we map dd dimension to dNdN dimension. We formalize this as the following theorem.

Theorem F.2.

Let KK be a Laplacian kernel with feature map φ:d\varphi:\mathbb{R}^{d}\to\mathcal{H}. For every 0<δε2160<\delta\leq\varepsilon\leq 2^{-16}, every d,D,Nd,D,N\in\mathbb{N}, Dmax{Θ(ε1log3(1/δ)),Θ(ε2log(1/δ))}D\geq\max\{\Theta(\varepsilon^{-1}\log^{3}(1/{\delta})),\Theta(\varepsilon^{-2}\log(1/{\delta}))\}, there exists a mapping π:dD\pi:\mathbb{R}^{d}\to\mathbb{R}^{D} s.t. for every x,yd,x,yNx,y\in\mathbb{N}^{d},x,y\leq N,

Pr[|distπ(x,y)distφ(x,y)|εdistφ(x,y)]1δ.\Pr[|\operatorname{dist}_{\pi}(x,y)-\operatorname{dist}_{\varphi}(x,y)|\leq\varepsilon\cdot\operatorname{dist}_{\varphi}(x,y)]\geq 1-\delta.

The time of evaluating π\pi is O~(dDN)\tilde{O}(dDN).

Proof.

Consider the map π\pi defined above. It follows by Lemma F.1 and Theorem 1.2. The running time is as stated, since we need to compute a vector of length O(N)O(N) and then apply the RFF on this vector. ∎

The time complexity is a bit large, since we need to compute π1(d)\pi_{1}^{(d)} which has a rather large dimension dNdN. Next we show how to reduce the time complexity.

F.3 An Alternate Construction

We give the following map π\pi^{\prime} which has the same output distribution as that of π=ϕπ1(d)\pi=\phi\circ\pi_{1}^{(d)}. Then in the next subsection we will use pseudorandom generators to replace the randomness in π\pi^{\prime} while using its highly efficiency in computation to reduce the time complexity. Notice that in computing ϕπ1(d)\phi\circ\pi_{1}^{(d)}, for each output dimension, the crucial step is computing ω,π1(d)(x)\langle\omega,\pi_{1}^{(d)}(x)\rangle for some Gaussian distribution ωdN\omega\in\mathbb{R}^{dN} which has each dimension being an independent gaussian distribution ω0\omega_{0}. The final output is a function of ω,π1(d)(x)\langle\omega,\pi_{1}^{(d)}(x)\rangle. So we only need to present the construction for the first part, i.e. the inner product of an NN dimension Gaussian distribution and π1(x1)\pi_{1}(x_{1}). For the other parts the computations are the same and finally we only need to sum them up. Hence to make the description simpler, we denote this inner product as ω,π1(x)\langle\omega,\pi_{1}(x)\rangle, where now we let x,xNx\in\mathbb{N},x\leq N and ω\omega has NN dimensions each being an independent ω0\omega_{0}.

Let hh be the smallest integer s.t. N2hN\leq 2^{h}. Consider a binary tree where each node has exactly 2 children. The depth is hh. So it has exactly 2hN2^{h}\geq N leaf nodes in the last layer. For each node vv, we attach a random variable αv\alpha_{v} in the following way. For the root, we attach a Gaussian variable which is the summation of 2h2^{h} independent Gaussian variable with distribution ω0\omega_{0}. Then we proceed layer by layer from the root to leaves. For each u,vu,v being children of a common parent ww, assume that αw\alpha_{w} is the summation of 2l2^{l} independent ω0\omega_{0} distributions. Then let αu\alpha_{u} be the summation of the first 2l12^{l-1} distributions among them and αv\alpha_{v} be the summation of the second 2l12^{l-1} distributions. That is αw=αu+αv\alpha_{w}=\alpha_{u}+\alpha_{v} with αu,αv\alpha_{u},\alpha_{v} being independent. Notice that conditioned on αw=a\alpha_{w}=a, then αu\alpha_{u} takes the value bb with probability Prαu,αv i.i.d. [αu=bαu+αv=a]\Pr_{\alpha_{u},\alpha_{v}\text{ i.i.d. }}[\alpha_{u}=b\mid\alpha_{u}+\alpha_{v}=a]. αv\alpha_{v} takes the value aba-b when αu\alpha_{u} takes value bb.

The randomness for generating every random variable corresponding to a node, are presented as a sequence, in the order from root to leaves, layer by layer, from left to right. We define αx\alpha^{x} to be the summation of the random variables corresponding to the first xx leaves. Notice that αx\alpha^{x} can be sampled efficiently in the following way. Consider the path from the root to the xx-th leaf. First we sample the root, which can be computed using the corresponding randomness. We use a variable zz to record this sample outcome, calling zz an accumulator for convenience. Then we visit each node along the path. When visiting vv, assume its parent is ww, where αw\alpha_{w} has already been sampled previously with outcome aa. If vv is a left child of ww, then we sample αv\alpha_{v} conditioned on αw=a\alpha_{w}=a. Assume this sampling has outcome bb. Then we add a+b-a+b to the current accumulator zz. If vv is a right child of a node ww, then we keep the current accumulator zz unchanged. After visiting all nodes in the path, zz is the sample outcome for αx\alpha^{x}.

Lemma F.2.

The joint distribution αx,x=0,1,,N\alpha^{x},x=0,1,\ldots,N has the same distribution as ω,π1(x),x=0,1,,N\langle\omega,\pi_{1}(x)\rangle,x=0,1,\ldots,N.

Proof.

According to our construction, each leaf is an independent distribution ω0\omega_{0}. Hence if we take all the leaves and form a vector, then it has the same distribution as ww.

Notice that for each parent ww with two children u,vu,v, by the construction, αw=αu+αv\alpha_{w}=\alpha_{u}+\alpha_{v}. Here αu,αv\alpha_{u},\alpha_{v} are independent, each being a summation of ll independent ω0\omega_{0}, with ll being the number of leaves derived from uu. Thus for each layer, for every node uu in the layer, αu\alpha_{u}’s are independent and the summation of them is their parent. So for the last layer all the variables are independent and follow the distribution ω0\omega_{0}. And for each node ww in the tree, αw\alpha_{w} is the summation of the random variables attached to the leaves of the subtree whose root is ww. So αx\alpha^{x} is the summation of the first xx leaf variables. ∎

We do the same operation for other dimensions of the output of π1(d)\pi_{1}^{(d)} and then sum them up to get an alternate construction π\pi^{\prime} for π=ϕπ1(d)\pi=\phi\circ\pi_{1}^{(d)}.

We note that to generate an αv\alpha_{v}, we only need to simulate the conditional distributions. The distribution function FF of the random variable is easy to derive, since its density function is a product of three Gaussian density functions, i.e.

Prαu,αv i.i.d. [αu=bαu+αv=a]\displaystyle\Pr_{\alpha_{u},\alpha_{v}\text{ i.i.d. }}[\alpha_{u}=b\mid\alpha_{u}+\alpha_{v}=a] =Prαu,αv i.i.d. [αu=b,αv=ab]/Prαu,αv i.i.d. [αu+αv=a]\displaystyle=\Pr_{\alpha_{u},\alpha_{v}\text{ i.i.d. }}[\alpha_{u}=b,\alpha_{v}=a-b]/\Pr_{\alpha_{u},\alpha_{v}\text{ i.i.d. }}[\alpha_{u}+\alpha_{v}=a]
=Prαu[αu=b]Prαv[αv=ab]/Prαu,αv i.i.d. [αu+αv=a],\displaystyle=\Pr_{\alpha_{u}}[\alpha_{u}=b]\cdot\Pr_{\alpha_{v}}[\alpha_{v}=a-b]/\Pr_{\alpha_{u},\alpha_{v}\text{ i.i.d. }}[\alpha_{u}+\alpha_{v}=a],

where αu,αv\alpha_{u},\alpha_{v} are Gaussians. After getting the density function pp, we can use reject sampling to sample.

We remark that simulating a distribution using uniform random bits always has some simulating bias. The above lemma is proved under the assumption that the simulation has no bias. But we claim that the statistical distance between the simulated distribution and the original distribution is at most ε0=poly(ρ0)=1/poly(N)\varepsilon_{0}=\operatorname{poly}(\rho_{0})=1/\operatorname{poly}(N), which is small enough by our picking of Δ=poly(N,δ1),ρ=1/poly(N,δ1)\Delta^{\prime}=\operatorname{poly}(N,\delta^{-1}),\rho^{\prime}=1/\operatorname{poly}(N,\delta^{-1}). To see the claim, notice that in the reject sampling we can first obtain an interval II s.t. the density function pp on any point of II is at least poly(ρ0)\operatorname{poly}(\rho_{0}). Then we do reject sampling only on II. Notice that because pp is some exponentially small functions, the interval II is not too large, |I|O(logρ01)|I|\leq O(\log\rho_{0}^{-1}). Thus trying poly(ρ01)\operatorname{poly}(\rho_{0}^{-1}) reject samplings is enough to get a sample with sufficient high probability 1poly(δ)1-\operatorname{poly}(\delta). And this sample only has a statistical distance poly(ρ0)\operatorname{poly}(\rho_{0}) from the original distribution, since the density functions are Gaussian functions or product of Gaussian functions and this implies the probability measure on the complement of II is still at most ε0=poly(ρ0)\varepsilon_{0}=\operatorname{poly}(\rho_{0}).

So if we consider simulation bias, then we can show that for every subset S{0,1,,N}S\subseteq\{0,1,\ldots,N\}, the joint distribution αx,xS\alpha^{x},x\in S has a statistical distance O(|S|ε0)O(|S|\varepsilon_{0}) to the joint distribution ω,π1(x),xS\langle\omega,\pi_{1}(x)\rangle,x\in S. Later we will only use the case that |S|=2|S|=2, i.e. two points. So the overall statistical distance is δΘ(1)\delta^{-\Theta(1)} which does not affect our analysis and parameters.

F.4 Reducing the Time Complexity Using PRGs

Next we use a pseudorandom generator to replace the randomness used in the above construction. A function G:{0,1}r{0,1}nG:\{0,1\}^{r}\rightarrow\{0,1\}^{n} a pseudorandom generator for space ss computations with error parameter εg\varepsilon_{g}, if for every probabilistic TM MM with space ss using nn bits randomness in the read-once manner

|Pr[M(G(Ur))=1]Pr[M(Un)=1]|εg.\left|\Pr\left[M(G(U_{r}))=1\right]-\Pr\left[M(U_{n})=1\right]\right|\leq\varepsilon_{g}.

Here rr is called the seed length of GG.

Theorem F.3 (Nisan 1992).

For every nn\in\mathbb{N} and ss\in\mathbb{N}, there exists an pseudorandom generator G:{0,1}r{0,1}nG:\{0,1\}^{r}\rightarrow\{0,1\}^{n} for space ss computations with parameter εg\varepsilon_{g}, where r=O(logn(logn+s+log1εg))r=O(\log n(\log n+s+\log\frac{1}{\varepsilon_{g}})). GG can be computed in polynomial time (in n,rn,r) and O(r)O(r) space. Moreover, given an index i[n]i\in[n], the ii-th bit of the output of GG can be computed in time poly(r)\operatorname{poly}(r).

Let G:{0,1}r{0,1},=2dDNτG:\{0,1\}^{r}\rightarrow\{0,1\}^{\ell},\ell=2dDN\tau be a pseudorandom generator for space s=c1(logN+sτ)s=c_{1}(\log N+s_{\tau}), with εg=δ/2\varepsilon_{g}=\delta/2, τ=c2logN\tau=c_{2}\log N for some large enough constants c1,c2c_{1},c_{2}. Again we only need to consider the construction corresponding to the first output dimension of ϕπ1\phi\circ\pi_{1}. We replace the randomness UU_{\ell} used in the construction by output of GG. That is, when we need τ\tau uniform random bits to construct a distribution αv\alpha_{v} in the tree, we first compute positions of these bits in UU_{\ell} and then compute the corresponding bits in the output of GG. Then use them to do the construction in the same way. We denote this mapping using pseudorandomness as our final mapping π\pi^{*}.

Now we provide a test algorithm to show that the feature mapping provided by the pseudorandom distribution has roughly the same quality as that of the mapping provided by the true randomness. We denote the test algorithm as T=TK,x,y,εT=T_{K,x,y,\varepsilon} where x,ydx,y\in\mathbb{R}^{d} and KK is a Laplacian kernel with feature mapping φ\varphi. TT works as the following. Its input is the randomness either being UU_{\ell} or G(Ur)G(U_{r}). TT first computes distφ(x,y)\operatorname{dist}_{\varphi}(x,y). Notice that TT actually does not have to compute φ\varphi since the distance can be directly computed as 22K(x,y)\sqrt{2-2K(x,y)}. Then T(G(Ur))T(G(U_{r})) computes distπ(x,y)\operatorname{dist}_{\pi^{*}}(x,y) and test

|distπ(x,y)distφ(x,y)|εdistφ(x,y).|\operatorname{dist}_{\pi^{*}}(x,y)-\operatorname{dist}_{\varphi}(x,y)|\leq\varepsilon\operatorname{dist}_{\varphi}(x,y).

Notice that when the input is UU_{\ell}, then this algorithm TT is instead testing

|distπ(x,y)distφ(x,y)|εdistφ(x,y).|\operatorname{dist}_{\pi^{\prime}}(x,y)-\operatorname{dist}_{\varphi}(x,y)|\leq\varepsilon\operatorname{dist}_{\varphi}(x,y).

Recall that π\pi^{\prime} is defined in the previous section as our mapping using true randomness.

Next we consider using TT on true randomness.

Lemma F.3.

Pr[T(U)=1]1δ/2\Pr[T(U_{\ell})=1]\geq 1-\delta/2.

Proof.

By Lemma F.2, distπ(x,y)=distπ(x,y)\operatorname{dist}_{\pi^{\prime}}(x,y)=\operatorname{dist}_{\pi}(x,y). By Theorem 1.2 setting the error probability to be δ/2\delta/2, we have

Pr[|distπ(x,y)distφ(x,y)|εdistφ(x,y)]1δ/2.\Pr[\left|\operatorname{dist}_{\pi}(x,y)-\operatorname{dist}_{\varphi}(x,y)\right|\leq\varepsilon\operatorname{dist}_{\varphi}(x,y)]\geq 1-\delta/2.

Notice that the event T(U)=1T(U_{\ell})=1 is indeed |distπ(x,y)distφ(x,y)|εdistφ(x,y)\left|\operatorname{dist}_{\pi^{\prime}}(x,y)-\operatorname{dist}_{\varphi}(x,y)\right|\leq\varepsilon\operatorname{dist}_{\varphi}(x,y). Hence the lemma holds. ∎

Now we show that TT is actually a small space computation.

Lemma F.4.

TT runs in space c(logN+sτ)c(\log N+s_{\tau}) for some constant cc and the input is read-once.

Proof.

The computing of distφ(x,y)\operatorname{dist}_{\varphi}(x,y) is in space O(logN)O(\log N), since x,y,x,yNx,y\in\mathbb{N},x,y\leq N and the kernel function KK can be computed in that space. Now we focus on the computation of π\pi^{\prime}. We claim that by the construction of αx\alpha^{x} in section F.3, π\pi^{\prime} can be computed using space O(sτ+logN)O(s_{\tau}+\log N). The procedure proceeds as the following. First it finds the path to the xx-th leaf. This takes space O(logN)O(\log N). Then along this path, for each node we need to compute a distribution αv\alpha_{v}. This takes space O(sτ)O(s_{\tau}). Also notice that since the randomness is presented layer by layer, the procedure only needs to do a read-once sweep of the randomness. TT needs to compute π\pi^{\prime} for both xx and yy, but this only blow up the space by 22. So the overall space needed is as stated. ∎

Finally we prove our theorem by using the property of the PRG.

Proof of Theorem F.1.

We first show our result assuming x,y,x,yNx,y\in\mathbb{N},x,y\leq N for an integer NN. We claim that π\pi^{*} is the mapping we want. By lemma F.3, Pr[T(U)=1]1δ/2.\Pr[T(U_{\ell})=1]\geq 1-\delta/2. By Lemma F.4, TT runs in space O(logN+sτ)O(\log N+s_{\tau}) and is read-once. As GG is for space c(logN+sτ)c(\log N+s_{\tau}) for some large enough constant cc,

|Pr[T(G(Ur))=1]Pr[T(U)=1]|εg,|\Pr[T(G(U_{r}))=1]-\Pr[T(U_{\ell})=1]|\leq\varepsilon_{g},

where seed length r=O(log(dDNsτ/εg)log(dDNτ)).r=O(\log(dDNs_{\tau}/\varepsilon_{g})\log(dDN\tau)). Notice that T(G(Ur))=1T(G(U_{r}))=1 is equivalent to |distπ(x,y)distφ(x,y)|εdistφ(x,y)|\operatorname{dist}_{\pi^{*}}(x,y)-\operatorname{dist}_{\varphi}(x,y)|\leq\varepsilon\cdot\operatorname{dist}_{\varphi}(x,y). Thus

Pr[|distπ(x,y)distφ(x,y)|εdistφ(x,y)]1δ/2εg1δ.\Pr[|\operatorname{dist}_{\pi^{*}}(x,y)-\operatorname{dist}_{\varphi}(x,y)|\leq\varepsilon\cdot\operatorname{dist}_{\varphi}(x,y)]\geq 1-\delta/2-\varepsilon_{g}\geq 1-\delta.

The running time is computed as the following. We only need to consider one dimension of the input data and one output dimension of the mapping, since others can be computed using the same time. So actually we consider the time for sampling αx\alpha^{x}. For αx\alpha^{x}, recall that we visit the path from the root to the xx-th leaf. We don’t have to compute the whole output of GG, but instead only need to use some parts of the output. For sampling each variable αv\alpha_{v} along the path, we use τ\tau bits in the output of GG. By Theorem F.3, the computing of each random bit in GG’s output, given the index of this bit, needs time poly(r)\operatorname{poly}(r). Locating the τ\tau bits of randomness for generating αv\alpha_{v} needs time O(logN)O(\log N). Generating each of the Gaussian random variable using these random bits needs time tτt_{\tau}. Summing up these variables takes less time than sampling all of them. After sampling, the cosine and sine function of the RFF can be computed in time poly(1/ρ0)=poly(logN,δ1)\operatorname{poly}(1/\rho_{0})=\operatorname{poly}(\log N,\delta^{-1}). There are dd input dimensions and DD output dimensions. So the total time complexity is dDpoly(logN,δ1)dD\operatorname{poly}(\log N,\delta^{-1}).

For the case that x,ydx,y\in\mathbb{R}^{d}, we only need to modify the embedding π1(d)\pi_{1}^{(d)} in the following way. We first round every entry so that their decimal part is now finite. The rounded parts are small enough (e.g. dropping all digits after the 10logρ110\log\rho^{-1}-th position to the right of the decimal point.) such that this only introduce some small additive errors. Then we shift all the entries to be non-negative numbers by adding a common shift ss. Then we multiply every entry of xx by a common factor tt s.t. every entry now only has an integer part. Notice that tt and ss can both be chosen according to Δρ\frac{\Delta}{\rho}, for example t=s=O(Δρ)t=s=O(\frac{\Delta}{\rho}). And we can take NN to be poly(Δρ)\operatorname{poly}(\frac{\Delta}{\rho}). Then we apply π1\pi_{1}, and multiply a factor 1/t\sqrt{1/t}. Denote this map as π~1\tilde{\pi}_{1}. Notice that this ensures that xy1=π~1(d)(x)π~1(d)(y)22\|x-y\|_{1}=\|\tilde{\pi}_{1}^{(d)}(x)-\tilde{\pi}_{1}^{(d)}(y)\|_{2}^{2}. Then we can apply the same construction and analysis as we did for the above natural number case. This shows the theorem. ∎

Appendix G Remarks and Comparisons to Chen & Phillips (2017)

Our upper bound in Theorem 1.2 is not directly comparable to that of Chen & Phillips (2017) which gave dimension reduction results for Gaussian kernels. Chen & Phillips (2017) showed in their Theorem 7 a slightly improved target dimension bound than ours, but it only works for the case of xyσ\|x-y\|\geq\sigma, where σ\sigma is the parameter in the Gaussian kernel111This condition is not clearly mentioned in the theorem statement, but it is indeed used, and is mentioned in one line above the statement in Chen & Phillips (2017).. For the other case of xy<σ\|x-y\|<\sigma, their Theorem 14 gave a related bound, but their guarantee is quite different from ours. Specifically, their target dimension depends linearly on the input dimension dd. Hence, when dd is large (e.g., d=log2nd=\log^{2}n), this Theorem 14 is worse than ours (for the case of xy<σ\|x-y\|<\sigma.

Finally, we remark that there might be subtle technical issues in the proof of [CP17]. Their Theorem 7 crucially uses a bound for moment generating functions that is established in their Lemma 5. However, we find various technical issues in the proof of Lemma 5 (found in their appendix). Specifically, the term 𝔼[es12ω2Δ2]\mathbb{E}[e^{-s\frac{1}{2}\omega^{2}\|\Delta\|^{2}}] in the last line above “But” (in page 17), should actually be 𝔼[es12ω2Δ2]\mathbb{E}[e^{s\frac{1}{2}\omega^{2}\|\Delta\|^{2}}]. Even if one fixes this mistake (by negating the exponent), then eventually we can only obtain a weaker bound of lnM(s)s24Δ4+sΔ2\ln M(s)\leq\frac{s^{2}}{4}\|\Delta\|^{4}+s\|\Delta\|^{2} in the very last step, since the term sΔ2-s\|\Delta\|^{2} is negated accordingly. Hence, it is not clear if the claimed bound can still be obtained in Theorem 7.