This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Linear Time Kernel Matrix Approximation via Hyperspherical Harmonics

John Paul Ryan    Anil Damle
Abstract

We propose a new technique for constructing low-rank approximations of matrices that arise in kernel methods for machine learning. Our approach pairs a novel automatically constructed analytic expansion of the underlying kernel function with a data-dependent compression step to further optimize the approximation. This procedure works in linear time and is applicable to any isotropic kernel. Moreover, our method accepts the desired error tolerance as input, in contrast to prevalent methods which accept the rank as input. Experimental results show our approach compares favorably to the commonly used Nyström method with respect to both accuracy for a given rank and computational time for a given accuracy across a variety of kernels, dimensions, and datasets. Notably, in many of these problem settings our approach produces near-optimal low-rank approximations. We provide an efficient open-source implementation of our new technique to complement our theoretical developments and experimental results.

Machine Learning, ICML

1 Introduction

Kernel methods are a popular class of techniques in machine learning problems. They typically involve the implicit definition of an infinite dimensional feature space using an explicitly defined kernel function that describes similarity between data points. Examples of techniques based around the use of kernels include kernel support vector machines, kernel ridge regression, spectral clustering, kernel PCA, and Gaussian process regression (Shawe-Taylor et al., 2004; Scholkopf & Smola, 2018). In practice, it is important that an appropriate kernel be chosen for the given task and a wealth of literature exists on proper choices of kernel functions and their hyperparameters.

In this work, our focus is on computational problems that generically arise in the use of kernel methods. Let kk be an isotropic kernel function and let 𝒳\mathcal{X} be a set of points in d\mathbb{R}^{d}. The associated kernel matrix is defined as

Kij=k(xixj)xi,xj𝒳.K_{ij}=k(\|x_{i}-x_{j}\|)\qquad x_{i},x_{j}\in\mathcal{X}.

Simply populating and storing the kernel matrix requires time and space which grow quadratically with the size of the dataset. Typically, kernel methods will also require operations on the kernel matrix which naïvely take cubic time, including solving linear systems and/or computing eigendecompositions. For even modestly sized datasets standard approaches are infeasible. This paper introduces a novel technique for constructing an approximation to the kernel matrix in linear time that subsequently enables the computation of matrix vector products with KK in linear time. This procedure can be used to, for example, substantially accelerate the solution of linear systems via iterative methods (Greenbaum, 1997; Shewchuk, 1994).

Accelerating computations involving kernel matrices has been the subject of extensive research. One family of algorithms uses inducing points as a means by which to approximately represent the kernel matrix in an easier-to-use form. One example is the Nyström method (Nyström, 1930) which uses kernel interactions between data points and a set of inducing points as basis functions for a low-rank approximation of the matrix. For example, if 𝒰𝒳\mathcal{U}\subset\mathcal{X} is a subset of datapoints, then the Nyström approximation is given by

KKNys=K𝒳𝒰(K𝒰𝒰)1K𝒰𝒳K\approx K_{\text{Nys}}=K_{\mathcal{X}\mathcal{U}}(K_{\mathcal{U}\mathcal{U}})^{-1}K_{\mathcal{U}\mathcal{X}} (1)

This low rank approximation allows for accelerated matrix-vector multiplies within iterative schemes as well as the opportunity to use the Sherman-Morrison-Woodbury formula in settings where the kernel matrix has a diagonal matrix added (such as for regularization).

Other techniques leverage analytic properties of the kernel function itself to arrive at more manageable forms of the kernel matrix. For example, the Fast Multipole Method and Fast Gauss Transform (Greengard & Rokhlin, 1987; Greengard & Strain, 1991; Yang et al., 2003) both use truncated expansions of the kernel function to compute low-rank matrix representations without ever having to examine the original kernel matrix. The method of random Fourier features (Rahimi et al., 2007) uses a representation of the kernel as the expectation of a random function to approximate the kernel matrix via a Monte-Carlo scheme. Recent work (Solin & Särkkä, 2020) leverages the spectral density of the kernel to compute coefficients in a harmonic expansion. Whereas the techniques of the previous paragraph use kernel interactions among data points without explicitly analyzing the kernel function itself, these analytically motivated methods use only properties of the kernel as the basis of their approximation. While data-independent approximations benefit from being easy to update with new datapoints, they often yield higher-rank representations than necessary. In addition, they are often kernel-targeted, requiring knowledge of a known expansion for a given kernel, or the kernel’s spectral density.

1.1 Contributions

We present a new technique which leverages both the analytic properties of the underlying kernel (without being targeted to any specific kernel), and additional data-dependent compression to generate a low-rank approximation to the kernel matrix. The time and space cost of the approximation are linear in the size of the dataset. The main baseline technique with which we compare our results is the Nyström method for high-accuracy kernel-independent approximations. Our experiments show favorable comparison with the Nyström method for a host of kernels, dimensions, datasets, and error tolerances. In fact, we show a number of settings in which our method achieves a nearly-optimal tradeoff between rank and error, while still taking less time than the Nyström method.

2 Prior Work

Given a kernel function kk and dataset 𝒳\mathcal{X} there are several common techniques for devising a fast matrix vector product with the associated kernel matrix K.K. One broad class of methods computes low-rank approximations to KK as KUVTK\approx UV^{T} when applicable. The SVD is the optimal choice if accuracy for a given rank is the only consideration. For sufficiently small scale problems this may be feasible. However, for larger problems computing the SVD becomes prohibitively expensive since, at a minimum, methods targeting the top singular values/vectors require repeated matrix vector products with KK—the very problem we want to address.

The Nyström method (Nyström, 1930) is a general approach to this problem that is much more computationally efficient at the expense of accuracy for a fixed rank approximation (i.e., it is suboptimal relative to the SVD). The most commonly used variants within machine learning are sampling based methods that yield low-rank approximations of KK as in (1). 111The Nyström method was originally introduced within the integral equations community (Nyström, 1930) and subsequently found use within machine learning (Williams & Seeger, 2001; Gittens & Mahoney, 2013). The key complexities of this methods are: (1) selecting an appropriate rank for the approximation and (2) selecting the subset 𝒰.\mathcal{U}. In principle, the theory of pseudoskeleton approximations (Goreinov et al., 1997) provides generic upper bounds for the rank given a desired accuracy and there is significant work within the machine learning and theoretical computer science communities to sharply characterize the best way to select 𝒰\mathcal{U} (see, e.g., (Musco & Musco, 2017; Gittens & Mahoney, 2013; Drineas et al., 2005)). Another broad class of methods that fall into this category are so-called random features methods such as random Fourier features (Rahimi et al., 2007). In this setting, the low-rank factors are produced by approximating the feature expansion related to a given kernel. While these methods are quite efficient, they often struggle to achieve high accuracy without taking the approximation to be of excessively large rank. For a systematic comparison of these methods see (Yang et al., 2012).

Another set of methods broadens these techniques to cases where the kernel matrix KK may not be globally low rank, but it exhibits so-called rank-structure. This is often captured using domain decomposition schemes which can leverage analytic expansions of kernel functions as in (Greengard & Rokhlin, 1987; Ryan et al., 2021), or algebraic data-dependent methods as in (Ying et al., 2004; Ying, 2006; Si et al., 2017). These methods typically require the ability to compute low-rank approximations of subblocks of KK—our present work solves this problem making it complementary to these schemes rather than in direct competition.

Another class of methods for solving this problem represent KK as a product KWK𝒢𝒢WT,K\approx WK_{\mathcal{G}\mathcal{G}}W^{T}, where 𝒢\mathcal{G} is a set of points chosen such that K𝒢𝒢K_{\mathcal{G}\mathcal{G}} admits fast matrix vector products and WW is an appropriately chosen interpolation/aggregation matrix. The most common set of methods in this class let 𝒢\mathcal{G} be a regular grid over the domain of interest and forcing WW to be sparse (i.e., using local interpolation). In this setting K𝒢𝒢K_{\mathcal{G}\mathcal{G}} can be efficiently applied using the fast Fourier transform. This form of method arose as the so-called pre-corrected FFT (Phillips & White, 1994; White et al., 1994) and gained popularity in the machine learning community as structured kernel interpolation (SKI) (Wilson & Nickisch, 2015) (in this context, it can be seen as an acceleration of inducing point methods (Snelson & Ghahramani, 2005)). While potentially highly efficient in moderate dimension if sparse interpolation and aggregation operators are used (i.e., WW is very sparse), as with sampling based Nyström methods they do not easily scale to higher accuracy regimes.

3 Methods

3.1 Overview

Here we derive our analytic expansion and the subsequent additional compression step which comprise our new algorithm, which we refer to as a Harmonic Decomposition Factorization (HDF). We will begin by showing why an analytic expansion can lead to a low-rank approximation of a kernel matrix. Then we will show how a Chebyshev expansion of the kernel function can be expanded and rearranged into an expansion in hyperspherical harmonics. From there, our additional compression step examines the functions which form the coefficients in the harmonic expansion—when evaluated on the data points, these functions can be further reduced in number, thus reducing the overall rank of the factorization. Finally we present error and complexity analyses for the factorization.

3.2 Methodology

We achieve a linear time low-rank approximation by analyzing the kernel’s functional form rather than algebraically operating on its associated matrix. Let KK be the kernel matrix whose entry in the iith row and jjth column is k(xiyj)k(\|x_{i}-y_{j}\|) where xi,yjx_{i},y_{j} are dd-dimensional points in datasets 𝒳,𝒴\mathcal{X},\mathcal{Y} respectively, and kk is a piecewise smooth and continuous isotropic kernel. Our goal is to find N×rN\times r matrices U,VU,V with rNr\ll N so that the N×NN\times N kernel matrix may be approximated by

KUVT.K\approx UV^{T}. (2)

We achieve this by finding functions ul,vlu_{l},v_{l} so that

k(xy)l=0r1ul(x)vl(y)k(\|x-y\|)\approx\sum_{l=0}^{r-1}u_{l}(x)v_{l}(y) (3)

Then we may define UU and VV as

Uilul(xi)Vilvl(yi).U_{il}\coloneqq u_{l}(x_{i})\qquad V_{il}\coloneqq v_{l}(y_{i}).

This is the typical mechanism of analytic methods for kernel matrix approximation—UU and VV are formed by analyzing the kernel function and applying related functions to the data, rather than by examining the original kernel matrix. Thus our main goal is to find ul,vlu_{l},v_{l} so that (3) can be computed efficiently and is sufficiently accurate for as small rr as possible.

3.3 Analytic Expansion

First we assume without loss of generality that 0xiyj1{0\leq\|x_{i}-y_{j}\|\leq 1} for all xi,yjx_{i},y_{j}.222Mathematically, we’re absorbing the diameter of the point set into a lengthscale parameter in the kernel function. We begin by defining

k¯(r){k(r)r0k(r)r<0\bar{k}(r)\coloneqq\begin{cases}k(r)&r\geq 0\\ k(-r)&r<0\end{cases} (4)

and then forming a truncated Chebyshev expansion of k¯(r)\bar{k}(r) for 1r1-1\leq r\leq 1

k¯(r)i=0paiTi(r),\bar{k}(r)\approx\sum_{i=0}^{p}a_{i}T_{i}(r),

where TiT_{i} is the iith Chebyshev polynomial of the first kind. The coefficients aia_{i} may be computed efficiently and accurately using e.g. the discrete cosine transform, see (Clenshaw & Curtis, 1960). This yields

k(xy)i=0paiTi(xy)k(\|x-y\|)\approx\sum_{i=0}^{p}a_{i}T_{i}(\|x-y\|) (5)

for 0xy10\leq\|x-y\|\leq 1. Furthermore ai=0a_{i}=0 for odd ii since k¯\bar{k} is even by definition. To see this, note that ai11k¯(r)Ti(r)dr{a_{i}\coloneqq\int_{-1}^{1}\bar{k}(r)T_{i}(r)\mathrm{d}r}, and when ii is odd the integrand is an even function times an odd function, hence the integral vanishes. Consequently, we henceforth assume that pp is even.

The coefficients in the Chebyshev polynomials are defined as ti,jt_{i,j} so that

Ti(xy)=j=0iti,jxyjT_{i}(\|x-y\|)=\sum_{j=0}^{i}t_{i,j}\|x-y\|^{j} (6)

and, by definition of the Chebyshev polynomials,

ti,j={1i=jti2,jj=02ti1,j1ti2,jij,j>0.t_{i,j}=\begin{cases}1&i=j\\ -t_{i-2,j}&j=0\\ 2t_{i-1,j-1}-t_{i-2,j}&i\neq j,j>0.\end{cases}

Noting xy2=x2+y22xTy\|x-y\|^{2}=\|x\|^{2}+\|y\|^{2}-2x^{T}y and plugging (6) into (5) yields

k(xy)i=0pj=0iaiti,j(x2+y22xTy)j/2,k(\|x-y\|)\approx\sum_{i=0}^{p}\sum_{j=0}^{i}a_{i}t_{i,j}(\|x\|^{2}+\|y\|^{2}-2x^{T}y)^{j/2},

Applying the multinomial theorem and rearranging the sums (see Section A for details) results in

k(xy)\displaystyle k(\|x-y\|)\approx
k3=0p/2k2=0p/2k3k1=0p/2k3k2(xTy)k3y2k2x2k1𝒯k1,k2,k3\displaystyle\sum_{k_{3}=0}^{p/2}\sum_{k_{2}=0}^{p/2-k_{3}}\sum_{k_{1}=0}^{p/2-k_{3}-k_{2}}(x^{T}y)^{k_{3}}\|y\|^{2k_{2}}\|x\|^{2k_{1}}\mathcal{T}_{k_{1},k_{2},k_{3}} (7)

where we have absorbed ai,ti,ja_{i},t_{i,j}, and the multinomial coefficients into constants 𝒯k1,k2,k3\mathcal{T}_{k_{1},k_{2},k_{3}}.

Since xTy=xycosγx^{T}y=\|x\|\|y\|\cos{\gamma} where γ\gamma is the angle between the vectors x,yx,y, we may write

(xTy)k3y2k2x2k1=(cosγ)k3y2k2k3x2k1k3.(x^{T}y)^{k_{3}}\|y\|^{2k_{2}}\|x\|^{2k_{1}}=(\cos{\gamma})^{k_{3}}\|y\|^{2k_{2}-k_{3}}\|x\|^{2k_{1}-k_{3}}.

Plugging this into (3.3) and performing additional rearranging/reindexing yields

k(xy)k=0p/2Ckα(cosγ)m=kpkn=kpmymxn𝒯k,m,nk(\|x-y\|)\approx\sum_{k=0}^{p/2}C_{k}^{\alpha}(\cos{\gamma})\sum_{m=k}^{p-k}\sum_{n=k}^{p-m}\|y\|^{m}\|x\|^{n}\mathcal{T}^{{}^{\prime}}_{k,m,n} (8)

where α=d21\alpha=\frac{d}{2}-1, Cjα(cosγ)C_{j}^{\alpha}(\cos{\gamma}) is the Gegenbauer polynomial of the jjth order, and 𝒯j,m,n\mathcal{T}^{{}^{\prime}}_{j,m,n} are constants which depend on the kernel but not on x,yx,y.

We may finally put this in the form of (3) by using the hyperspherical harmonic addition theorem (Wen & Avery, 1985):

1Zk(α)Ck(α)(cosγ)=hkΥkh(x)Υkh(y)\frac{1}{Z_{k}^{(\alpha)}}C_{k}^{(\alpha)}(\cos{\gamma})=\sum_{h\in\mathcal{H}_{k}}\Upsilon_{k}^{h}(x)\Upsilon_{k}^{h}(y) (9)

where Zk(α)Z_{k}^{(\alpha)} is a normalization term, Υkh\Upsilon_{k}^{h} are hyperspherical harmonics333We use a real basis of hyperspherical harmonics to avoid complex arithmetic. which are dd-dimensional generalizations of the spherical harmonics (see (Avery, 1989) for their definition) and

k{(μ1,μd2):kμ1|μd2|0}.\mathcal{H}_{k}\coloneqq\{(\mu_{1},\dots\mu_{d-2}):k\geq\mu_{1}\geq\dots\geq|\mu_{d-2}|\geq 0\}.

Substituting (9) into (8) yields

k(xy)\displaystyle k(\|x-y\|)\approx
k=0p/2hkΥkh(x)Υkh(y)m=kpkymn=kpmxn𝒯k,m,n′′.\displaystyle\sum_{k=0}^{p/2}\sum_{h\in\mathcal{H}_{k}}\Upsilon_{k}^{h}(x)\Upsilon_{k}^{h}(y)\sum_{m=k}^{p-k}\|y\|^{m}\sum_{n=k}^{p-m}\|x\|^{n}\mathcal{T}^{{}^{\prime\prime}}_{k,m,n}. (10)

This is of the form in (3), using

uk,h,m(x)=Υkh(x)n=kpmxn𝒯k,m,n′′u_{k,h,m}(x)=\Upsilon_{k}^{h}(x)\sum_{n=k}^{p-m}\|x\|^{n}\mathcal{T}^{{}^{\prime\prime}}_{k,m,n}
vk,h,m(y)=Υkh(y)ym.v_{k,h,m}(y)=\Upsilon_{k}^{h}(y)\|y\|^{m}.
Algorithm 1 Harmonic Decomposition Factorization (HDF)
  Input: kernel, error tolerance ε\varepsilon, datasets x,yx,y.
  Output: matrices U,VU,V s.t. KUVTK\approx UV^{T}
  pp\leftarrow ChebDegreeNeeded(kernel,εkernel,\varepsilon)  {Remark 3}
  aiai\leftarrow ChebyshevTransform(kernel,pkernel,p)
  idx0idx\leftarrow 0
  for kk\in 0..p/2p/2 do
     X(k)X^{(k)}\leftarrow InitX(k,ai,xk,ai,x)  {Eq. (12)}
     Y(k)Y^{(k)}\leftarrow InitY(k,yk,y)
     QX(k),RX(k)Q_{X}^{(k)},R_{X}^{(k)}\leftarrow qr(X(k)X^{(k)})
     QY(k),RY(k)Q_{Y}^{(k)},R_{Y}^{(k)}\leftarrow qr(Y(k)Y^{(k)})
     U(k),Σ(k),V(k)U^{(k)},\Sigma^{(k)},V^{(k)}\leftarrow truncsvd(RX(k)(RY(k))T,εR_{X}^{(k)}(R_{Y}^{(k)})^{T},\varepsilon)
     sks_{k}\leftarrownumcols(Σ(k)\Sigma^{(k)})
     if sk>0s_{k}>0 then
        X¯(k)QX(k)U(k)Σ(k)\bar{X}^{(k)}\leftarrow Q_{X}^{(k)}U^{(k)}\sqrt{\Sigma^{(k)}}
        Y¯(k)QY(k)V(k)Σ(k)\bar{Y}^{(k)}\leftarrow Q_{Y}^{(k)}V^{(k)}\sqrt{\Sigma^{(k)}}
        for hkh\in\mathcal{H}_{k} do
           (Υkh)X(\Upsilon_{k}^{h})_{X}\leftarrow Harmonic(k,h,x)k,h,x)
           (Υkh)Y(\Upsilon_{k}^{h})_{Y}\leftarrow Harmonic(k,h,y)k,h,y)
           for l1..sl\in 1..s do
              U[:,idx]X¯(k)(Υkh)XU[:,idx]\leftarrow\bar{X}^{(k)}\odot(\Upsilon_{k}^{h})_{X}
              V[:,idx]Y¯(k)(Υkh)YV[:,idx]\leftarrow\bar{Y}^{(k)}\odot(\Upsilon_{k}^{h})_{Y}
              idxidx+1idx\leftarrow idx+1
           end for
        end for
     end if
  end for
  return U,VU,V

3.4 Additional Data-Dependent Compression

The number of functions in the approximate expansion (which is rr in (3)) will henceforth be referred to as the rank of the expansion. The cost of computing the low-rank expansion is the cost of populating the matrices UU and VTV^{T}, which is linear in the number of entries in the matrices. To achieve efficiency, we’d like the rank of the expansion to be as small as possible while still achieving the desired level of accuracy. Unfortunately, the rank of (3.3) is still large—proportional to dpd^{p}. In this section, we incorporate the datasets 𝒳,𝒴\mathcal{X},\mathcal{Y} into a scheme that efficiently reduces the rank of (3.3).

First, we define

r(k)(x,y)m=kpkymn=kpmxn𝒯k,m.n′′r^{(k)}(\|x\|,\|y\|)\coloneqq\sum_{m=k}^{p-k}\|y\|^{m}\sum_{n=k}^{p-m}\|x\|^{n}\mathcal{T}^{{}^{\prime\prime}}_{k,m.n} (11)

so that (3.3) may be written as

k(xy)k=0p/2hkΥkh(x)Υkh(y)r(k)(x,y).k(\|x-y\|)\approx\sum_{k=0}^{p/2}\sum_{h\in\mathcal{H}_{k}}\Upsilon_{k}^{h}(x)\Upsilon_{k}^{h}(y)r^{(k)}(\|x\|,\|y\|).

The function r(k)r^{(k)} has rank p2k2\frac{p-2k}{2},444𝒯k,m,n′′\mathcal{T}^{{}^{\prime\prime}}_{k,m,n} is nonzero only if k,m,nk,m,n have the same parity, see Section A. but if we can find an approximation r~(k)r(k)\tilde{r}^{(k)}\approx r^{(k)} with smaller rank and within our error tolerance (when evaluated on the data), then it may be substituted into (3.3) to reduce the overall rank.

We will find this low-rank approximation by efficiently computing a low-rank approximation of the associated N×NN\times N matrix defined by ((k))ijr(k)(xi,yj).{(\mathcal{R}^{(k)})_{ij}\coloneqq r^{(k)}(\|x_{i}\|,\|y_{j}\|).} Notably, (k)\mathcal{R}^{(k)} can be written as (k)=X(k)(Y(k))T,{\mathcal{R}^{(k)}=X^{(k)}(Y^{(k)})^{T},} where

Xil(k)=n=kpk2lxin𝒯k,k+2l,n′′,Yil(k)=yik+2l.X^{(k)}_{il}=\sum_{n=k}^{p-k-2l}\|x_{i}\|^{n}\mathcal{T}^{{}^{\prime\prime}}_{k,k+2l,n},\qquad Y^{(k)}_{il}=\|y_{i}\|^{k+2l}. (12)

Letting X(k)=QX(k)RX(k)X^{(k)}=Q^{(k)}_{X}R^{(k)}_{X} and Y(k)=QY(k)RY(k)Y^{(k)}=Q^{(k)}_{Y}R^{(k)}_{Y} be QRQR factorizations we have

(k)=QX(k)U(k)Σ(k)(V(k))T(QY(k))T\mathcal{R}^{(k)}=Q^{(k)}_{X}U^{(k)}\Sigma^{(k)}(V^{(k)})^{T}(Q^{(k)}_{Y})^{T} (13)

where U(k)Σ(k)(V(k))TU^{(k)}\Sigma^{(k)}(V^{(k)})^{T} is an SVD of the matrix RX(k)(RY(k))TR^{(k)}_{X}(R^{(k)}_{Y})^{T}. Using an appropriate error tolerance τ\tau, we may find a lower rank representation of (k)\mathcal{R}^{(k)} by truncating all singular values of Σ(k)\Sigma^{(k)} less than τ\tau. This results in

(k)X¯(k)(Y¯(k))T{\mathcal{R}^{(k)}\approx\bar{X}^{(k)}(\bar{Y}^{(k)})^{T}} (14)

where

X¯(k)QX(k)U:,1:sk(k)Σ1:sk,1:sk(k)\bar{X}^{(k)}\coloneqq Q^{(k)}_{X}U^{(k)}_{:,1:s_{k}}\sqrt{\Sigma^{(k)}_{1:s_{k},1:s_{k}}}
Y¯(k)QY(k)V:,1:sk(k)Σ1:sk,1:sk(k),\bar{Y}^{(k)}\coloneqq Q^{(k)}_{Y}V^{(k)}_{:,1:s_{k}}\sqrt{\Sigma^{(k)}_{1:s_{k},1:s_{k}}},

where the notation U:,1:sk(k)U^{(k)}_{:,1:s_{k}} refers to the first sks_{k} columns of U(k)U^{(k)}, Σ1:sk,1:sk(k)\Sigma^{(k)}_{1:s_{k},1:s_{k}} means the first sks_{k} rows and columns of Σ(k)\Sigma^{(k)}, and sks_{k} is the index of the smallest singular value in Σ(k)\Sigma^{(k)} greater than τ\tau. The selection of τ\tau is discussed in Section 3.5.

Remark 1.

When 𝒳=𝒴\mathcal{X}=\mathcal{Y}, the matrix (k)\mathcal{R}^{(k)} is symmetric, hence the SVD may be replaced with an eigendecomposition. In fact, only the QRQR factorization of Y(k)Y^{(k)} is needed in this case. Furthermore, we will have U=VU=V in (2), so the memory requirements are halved. From a practical point of view, it is efficient to include a diagonal matrix DD containing these eigenvalues so that we may store KUDUTK\approx UDU^{T} with all real entries instead of KUUTK\approx UU^{T} with possibly complex entries in the case of non-positive definite matrices (see (Chen et al., 2009) for examples of indefinite learning tasks).

Remark 2.

Since the hyperspherical harmonics are orthogonal functions, using the same truncation criteria for different (k)\mathcal{R}^{(k)} yields approximately the same error for each kk. In practice, this may result in some (k)\mathcal{R}^{(k)} being completely dropped because even their first singular value is below the threshold—our interpretation is that the kkth order harmonics in the expansion are unnecessary to achieve the desired level of accuracy, and so this additional compression step allows us to drop them and greatly reduce the rank of the approximation.

3.5 Error

Error is introduced into our approximation in two ways: the truncation of the Chebyshev expansion in (4) and the data-driven approximation to (k)\mathcal{R}^{(k)} in (14). Suppose we use a threshold of τ\tau to truncate singuar values/vectors in (13), then the error ε\varepsilon of the approximation for any pair of points will satisfy

ε=k=0p/2Ck(α)(cosγij)Zk(α)((k)X¯(k)(Y¯(k))T)ij+i=p+1aiTi(xiyj).\varepsilon=\sum_{k=0}^{p/2}\frac{C_{k}^{(\alpha)}(\cos{\gamma_{ij}})}{Z_{k}^{(\alpha)}}\left(\mathcal{R}^{(k)}-\bar{X}^{(k)}(\bar{Y}^{(k)})^{T}\right)_{ij}\\ +\sum_{i=p+1}^{\infty}a_{i}T_{i}(\|x_{i}-y_{j}\|).

For the first part, note that |Ck(α)(cosγij)|(k+d3k){|C_{k}^{(\alpha)}(\cos{\gamma_{ij}})|\leq\binom{k+d-3}{k}} (see (Kim et al., 2012) Eq (1.1)). Further, since ANA2{\|A\|_{\infty}\leq\sqrt{N}\|A\|_{2}} for any N×NN\times N matrix AA, we have

|((k)X¯(k)(Y¯(k))T)ij|<Nτ.\left|\left(\mathcal{R}^{(k)}-\bar{X}^{(k)}(\bar{Y}^{(k)})^{T}\right)_{ij}\right|<\sqrt{N}\tau.

For the second part, (Elliott et al., 1987) gives

εc(p)|i=p+1aiTi(xiyj)|\varepsilon_{c}(p)\coloneqq\left|\sum_{i=p+1}^{\infty}a_{i}T_{i}(\|x_{i}-y_{j}\|)\right|
12p(p+1)!max0r1|k(p+1)(r)|.\leq\frac{1}{2^{p}(p+1)!}\max_{0\leq r\leq 1}\left|k^{(p+1)}(r)\right|.

Therefore, the total error satisfies

|ε|N𝒞p/2τ+εc(p),|\varepsilon|\leq\sqrt{N}\mathcal{C}_{p/2}\tau+\varepsilon_{c}(p), (15)

where 𝒞p/2k=0p/2(k+d3k)/Zk(α).{\mathcal{C}_{p/2}\coloneqq\sum_{k=0}^{p/2}\binom{k+d-3}{k}/Z_{k}^{(\alpha)}}.

Remark 3.

In practice, εc(p)\varepsilon_{c}(p) may be easily approximated and we may adaptively choose pp and τ\tau given an input error tolerance ε\varepsilon such that the desired accuracy is achieved. In fact, since (15) is often quite loose, tuning how τ\tau is chosen with respect to the input error tolerance can yield significantly smaller ranks while maintaining desired accuracy.

3.6 Complexity

The algorithm is summarized in Algorithm 1. Note that pp is an adaptively chosen parameter so that (4) is sufficiently accurate. The Chebyshev transform takes 𝒪(p)\mathcal{O}(p) time via the discrete cosine transform. If an initial multiplication of Vandermonde matrices is performed outside the kk loop, the QRQR factorizations can be performed in 𝒪(Np)\mathcal{O}(Np) time each (see, for example, (Brubeck et al., 2021)) and so contribute 𝒪(Np2)\mathcal{O}(Np^{2}) to the total cost. Forming X¯(k),Y¯(k)\bar{X}^{(k)},\bar{Y}^{(k)} takes 𝒪(Npsk)\mathcal{O}(Nps_{k}) each, contributing 𝒪(Np2s)\mathcal{O}(Np^{2}s) to the total cost, where smaxksks\coloneqq\max_{k}s_{k}. The SVDs take 𝒪(p3)\mathcal{O}(p^{3}) time each and so contribute 𝒪(p4)\mathcal{O}(p^{4}) to the total cost.

The total number of harmonics needed by our algorithm will be bounded above by the number of harmonics of order less than or equal to p/2{p/2}—in Section B we derive this to be 𝐇p/2,d=(p2+d1d1)+(p2+d2d1).{\mathbf{H}_{p/2,d}=\binom{\frac{p}{2}+d-1}{d-1}+\binom{\frac{p}{2}+d-2}{d-1}}. Each harmonic costs 𝒪(d)\mathcal{O}(d) time to evaluate per datapoint, hence the cost of computing the necessary harmonics will be 𝒪(N𝐇p/2,dd)\mathcal{O}(N\mathbf{H}_{p/2,d}d). Once they are computed, the cost to populate the UU and VV matrices will be 𝒪(Nr)\mathcal{O}(Nr) where rr is the final rank of the approximation. This rank is given by

r=k=0p/2|k|skk=0p/2|k|(p2k2).r=\sum_{k=0}^{p/2}|\mathcal{H}_{k}|s_{k}\leq\sum_{k=0}^{p/2}|\mathcal{H}_{k}|\left(\frac{p-2k}{2}\right).

Thus the total runtime complexity for the factorization is 𝒪(N(r+𝐇p/2,dd+p2s)+p4){\mathcal{O}(N(r+\mathbf{H}_{p/2,d}d+p^{2}s)+p^{4})}. After the factorization, matrix-vector multiplies may be performed in 𝒪(Nr){\mathcal{O}(Nr)} time.

3.7 Limitations

We have described an additional data-dependent compression step based on matrices formed out of the polynomials within the analytic expansion. We currently make no attempt to perform further compression based on the observed rank of the harmonic components. This is logical when the number of points is high relative to the dimension—matrices formed from the harmonics will have little singular value decay owing to the orthogonality of the harmonic functions. However, if the number of dimensions is large relative to the the number of points, then the points are not space filling and the lack of additional compression based on the harmonics will result in poor efficiency, especially as the number of harmonics grows polynomially with the dimension (see Section C for further discussion).

Vandermonde matrices are notoriously ill-conditioned (Pan, 2016), and the accuracy of the QRQR factorizations in the current presentation of our additional compression step can suffer when pp is large unless this is addressed. In our testing, this degradation only arises for very small length-scale parameters and/or very small error tolerances (notably, outside the range seen in our experiments). One step towards preventing potential conditioning issues could be to arrange the expansion so that Legendre polynomials of x,y\|x\|,\|y\| are used instead of powers of x,y\|x\|,\|y\| (in an analogous way to our replacement of powers of cosine with Gegenbauer polynomials of cosine, see Section A for details).

In many kernel matrix approximation algorithms, a non-negligible cost of generating the low-rank matrices is that matrix entries typically cost 𝒪(d)\mathcal{O}(d) time to generate. For example, this is the case when a kernel evaluation is required between a point and a benchmark point, as in inducing point methods. On the one hand, our factorization benefits from the 𝒪(d)\mathcal{O}(d) costs being confined555The Vandermonde-like matrices require only a single dimension-dependent computation of the norms of all datapoints. to the computation of the harmonics, each of which is reused across sks_{k} rows where sks_{k} is the rank in (14). On the other hand, this depends on an efficient implementation of hyperspherical harmonic computation—although we provide a moderately tuned implementation, this is the subject of further work as there are few available good pre-existing routines for this procedure.

4 Experiments

We have implemented the Harmonic Decomposition factorization in Algorithm 1 in Julia as part of an open source toolkit, and have performed synthetic and real-world regression experiments single-threaded on a 2020 Apple Macbook Air with an M1 CPU and 8GB of RAM. All recorded times are averaged across 10 trials.

Refer to caption
Figure 1: Time vs. dataset size results for the Cauchy kernel with σ=1\sigma=1. For this experiment we use an error tolerance of 10310^{-3} and use normally distributed points scaled to have norm less than 1 as our dataset. For a variety of dimensions our algorithm displays linear scaling with the size of the dataset.

4.1 Synthetic Data

To test the runtime of our algorithm for constructing the factorization, we generate datasets of points drawn from a normal distribution and subsequently scaled so that the maximum norm of a point in a dataset is 11. Using the Cauchy kernel with lengthscale parameter σ=1\sigma=1, we vary the number of points and dimensions of this dataset and record the time required by the implementation to create the low-rank approximation. The relative error tolerance is fixed at ε=0.005\varepsilon=0.005. Figure 1 vizualizes the results of this experiment—across a variety of dimensions, our factorization requires linear time in the number of points.

To illustrate how efficiently we can tradeoff time and accuracy, we compare our factorization with the popular Nyström method where inducing points are chosen uniformly at random.666Times for the Nyström method include populating the interpolation matrix as well as factoring the inducing point matrix. We use a dataset of 10,000 points in three dimensions generated the same as before and using the same Cauchy kernel. This time, the error tolerance is varied and the factorization time and relative errors are recorded. Figure 2 shows that our factorization is consistently cheaper to produce than the Nyström factorization that achieves the same accuracy. This further translates to faster matrix vector products at the same accuracy. This is due largely to the HDF’s lower-rank approximation, which we investigate next.

Refer to caption
Figure 2: Time vs. relative error results for the Cauchy kernel with σ=1\sigma=1 using a 3D dataset of 10510^{5} points generated the same way as in Figure 1. For both factorization and matrix-vector multiplies, our method displays a better tradeoff than the Nyström method.

To further explore the tradeoff between rank and relative error of our approximation, we conduct experiments where the error tolerance is varied and the rank of the necessary factorization along with the actual final relative error is collected. In this experiment, we use a dataset of 5000 points in five dimensions generated in the same way as above and use the Cauchy kernel ((k(r)=11+(r/σ)2)\left((k(r)=\frac{1}{1+(r/\sigma)^{2}}\right), Gaussian kernel (k(r)=e(r/σ)2k(r)=e^{-(r/\sigma)^{2}}), Matérn kernel with ν=1.5\nu=1.5 (k(r)=(1+3(r/σ))e3r/σ{k(r)=(1+\sqrt{3}(r/\sigma))e^{-\sqrt{3}r/\sigma}}) and Matérn kernel with ν=2.5\nu=2.5 (k(r)=(1+5(r/σ)+(5/3)(r/σ)2)e5r/σ{k(r)=(1+\sqrt{5}(r/\sigma)+(5/3)(r/\sigma)^{2})e^{-\sqrt{5}r/\sigma}}). Figure 3 compares results between our method, the Nyström method, and the SVD (which provides the optimal rank/error tradeoff). In all cases our method has greater accuracy than the Nyström method for the same rank. Additionally, our method shows nearly optimal performance for lower rank approximations. This experiment also highlights the generality of our method—most existing analytic methods cannot be applied to this breadth of kernels and are, therefore, omitted from the experiment.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Relative error vs. rank results for a variety of kernels using a 5D dataset of 50005000 points generated the same way as in Figure 1. Our method has a substantially improved tradeoff over the Nyström method, and approaches optimality for lower-rank approximations.

4.2 Kernel Ridge Regression on California Housing Data

We also apply our factorization within the training process of kernel ridge regression and evaluate the predictions. Kernel ridge regression (Shawe-Taylor et al., 2004) is a kernel method which involves finding a weight vector 𝐰\mathbf{w} that solves the minimization problem

𝐰argmin𝐰𝐲K(𝐱,𝐱)𝐰2+λ𝐰TK(𝐱,𝐱)𝐰,\mathbf{w}\coloneqq\operatorname*{arg\,min}_{\mathbf{w}}\|\mathbf{y}-K(\mathbf{x},\mathbf{x})\mathbf{w}\|^{2}+\lambda\mathbf{w}^{T}K(\mathbf{x},\mathbf{x})\mathbf{w},

where 𝐱\mathbf{x} are the training points, 𝐲\mathbf{y} are the training labels, and λ\lambda is a regularization parameter. The solution to this problem is

𝐰=(λ𝐈+K(𝐱,𝐱))1𝐲\mathbf{w}=(\lambda\mathbf{I}+K(\mathbf{x},\mathbf{x}))^{-1}\mathbf{y} (16)

Predictions on test data are then performed using 𝐲^=K(𝐱,𝐱)𝐰{\hat{\mathbf{y}}=K(\mathbf{x}^{*},\mathbf{x})\mathbf{w}}, where 𝐱\mathbf{x}^{*} are the test points and 𝐲^\hat{\mathbf{y}} are our predictions of the test labels.

We use the California housing data set from the UCI Machine Learning Repository (Dua & Graff, 2017), which contains data drawn from the 1990 U.S. Census. The labels are the median house values, and the features are all other attributes, leading to a feature space of d=8d=8 dimensions for a dataset of N=20,640{N=20,640} points. We scale features to be in [0,1][0,1], and for each of five trials perform a random shuffle of the data followed by a 23\frac{2}{3}-13\frac{1}{3} train-test split.

We apply our factorization to the problem by generating low-rank approximations for the diagonal blocks of the kernel matrix in (16). The diagonal blocks are chosen to be associated with 30 clusters CiC_{i} of points generated by kk-means clustering (Lloyd, 1982), so that the iith block along the diagonal is given by BiKCi,Ci{B_{i}\coloneqq K_{C_{i},C_{i}}}. A host of techniques exist for compressing the off-diagonal blocks (Ying et al., 2004; Ying, 2006; Ambikasaran et al., 2015; Ryan et al., 2021), but we perform no off-diagonal compression for this experiment to simplify the implementation. The associated linear system is then solved using the method of conjugate gradients (Greenbaum, 1997; Shewchuk, 1994), where the matrix-vector multiplies are accelerated by our compression. As a preconditioner we use a block diagonal matrix whose blocks are the inverses of the blocks of the true kernel matrix associated with the clusters.

For comparison, we also applied the Nyström method using the same ranks for the diagonal blocks. We record the relative errors of the diagonal block approximations, as well as the final mean squared errors (MSEs) when compared with the ground truth test labels. Figure 4 visualizes our findings. For all kernels, our method tends to find a more accurate approximation than Nyström for the same rank, as also demonstrated for the synthetic data in Figure 3. Further, owing to the more accurate approximations, our method yields MSEs closer to those found using dense operations.

Refer to caption
Figure 4: Relative error results for the diagonal block approximations by our method and the Nyström method. The blocks correspond to kernel interactions between points within clusters generated by kk-means on the California Housing dataset.

5 Conclusion

We have presented a new method for low-rank approximation of kernel matrices that pairs a novel analytic expansion with an efficient data-dependent compression step. The technique works for any isotropic kernel, and is particularly efficient when the dimension of the data is small relative to the size of the dataset. Our method runs in linear time and has a configurable error tolerance along with theoretical guarantees on the size of the error. This work presents results from a suite of experiments demonstrating linear scaling and favorable comparative metrics against the Nyström method, both for synthetic and real-world data. An open-source implementation of our method is included alongside this paper. This work also leads to many interesting and promising questions for future exploration. One currently unexplored facet is whether an additional compression step may be applied to the harmonics. For our areas of application this has not been necessary, as the assumption of moderate dimension has meant that the matrices of harmonics are not numerically low rank. However, for high dd and high desired accuracy, the high orders of harmonics required are too numerous for an efficient low-rank approximation (see Figure 5 in Section C). This problem could potentially be mitigated by a similar data-dependent compression technique as is used in this work, or another dimension-reduction technique.

11+(r/σ)2\frac{1}{1+(r/\sigma)^{2}} e(r/σ)2e^{-(r/\sigma)^{2}}
Dense 1.72 [1.62, 1.77]e-2 1.76 [1.71, 1.80]e-2
HDF 1.71 [1.62, 1.77]e-2 1.76 [1.70, 1.80]e-2
Nyst. 1.78 [1.71, 2.08]e-2 1.79 [1.71, 1.83]e-2
Matérn ν=1.5\nu=1.5 Matérn ν=2.5\nu=2.5
Dense 1.58 [1.54,1.62]e-2 1.69 [1.66,1.73]e-2
HDF 1.58 [1.55,1.62]e-2 1.69 [1.65,1.73]e-2
Nyst. 2.52 [1.70,7.94]e-2 1.69 [1.68,1.72]e-2
Table 1: Mean squared error (MSE) of predictions compared to ground truth values in the test split of our regression experiment. Entries are the median, minimum, and maximum MSEs observed. In all cases, the higher accuracy of our approximation resulted in better MSE. For the Matérn kernel with ν=1.5\nu=1.5 the Nyström approximation yielded conditioning issues, and the iterative method sometimes failed to converge in 1024 iterations.

Our expansion was formed by transforming to a basis of Gegenbauer polynomials in cosγ\cos{\gamma} and then splitting into harmonics. This approach is inspired by the analytic expansions underlying fast multipole methods, but may not be ideal for high dimensional problems if the hyperspherical harmonics are too onerous to compute. For higher dimensional problems it is conceivable that a more manageable basis for the angular functions would be preferred for a fast implementation. A different such basis could also be engineered to allow for efficiency in an additional compression as described in the previous paragraph, to parallel how the Vandermonde structure of (12) allows for efficiency in our additional compression step.

We have provided the theoretical underpinnings for off-diagonal compression based on our scheme and future work will incorporate our routine into a hierarchical domain decomposition framework. It is plausible that an opportunity exists to reuse matrices or at least harmonic computations between on- and off-diagonal block approximations.

References

  • Ambikasaran et al. (2015) Ambikasaran, S., Foreman-Mackey, D., Greengard, L., Hogg, D. W., and O’Neil, M. Fast direct methods for gaussian processes. IEEE transactions on pattern analysis and machine intelligence, 38(2):252–265, 2015.
  • Avery (1989) Avery, J. S. Hyperspherical Harmonics. Springer, 1989. doi: https://doi.org/10.1007/978-94-009-2323-2.
  • Brubeck et al. (2021) Brubeck, P. D., Nakatsukasa, Y., and Trefethen, L. N. Vandermonde with arnoldi. SIAM Rev., 63:405–415, 2021.
  • Chen et al. (2009) Chen, Y., Gupta, M. R., and Recht, B. Learning kernels from indefinite similarities. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pp.  145–152, New York, NY, USA, 2009. Association for Computing Machinery. ISBN 9781605585161. doi: 10.1145/1553374.1553393. URL https://doi.org/10.1145/1553374.1553393.
  • Clenshaw & Curtis (1960) Clenshaw, C. W. and Curtis, A. R. A method for numerical integration on an automatic computer. Numerische Mathematik, 2(1):197–205, 1960. doi: 10.1007/BF01386223. URL https://doi.org/10.1007/BF01386223.
  • Drineas et al. (2005) Drineas, P., Mahoney, M. W., and Cristianini, N. On the nyström method for approximating a gram matrix for improved kernel-based learning. journal of machine learning research, 6(12), 2005.
  • Dua & Graff (2017) Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  • Elliott et al. (1987) Elliott, D., Paget, D., Phillips, G., and Taylor, P. Error of truncated chebyshev series and other near minimax polynomial approximations. Journal of Approximation Theory, 50(1):49–57, 1987. ISSN 0021-9045. doi: https://doi.org/10.1016/0021-9045(87)90065-7. URL https://www.sciencedirect.com/science/article/pii/0021904587900657.
  • Gittens & Mahoney (2013) Gittens, A. and Mahoney, M. Revisiting the nystrom method for improved large-scale machine learning. In International Conference on Machine Learning, pp. 567–575. PMLR, 2013.
  • Goreinov et al. (1997) Goreinov, S. A., Tyrtyshnikov, E. E., and Zamarashkin, N. L. A theory of pseudoskeleton approximations. Linear algebra and its applications, 261(1-3):1–21, 1997.
  • Greenbaum (1997) Greenbaum, A. Iterative Methods for Solving Linear Systems. Society for Industrial and Applied Mathematics, USA, 1997. ISBN 089871396X.
  • Greengard & Rokhlin (1987) Greengard, L. and Rokhlin, V. A fast algorithm for particle simulations. Journal of computational physics, 73(2):325–348, 1987.
  • Greengard & Strain (1991) Greengard, L. and Strain, J. The fast gauss transform. SIAM Journal on Scientific and Statistical Computing, 12(1):79–94, 1991.
  • Kim et al. (2012) Kim, D. S., Kim, T., and Rim, S.-H. Some identities involving gegenbauer polynomials. Advances in Difference Equations, 2012(1):219, 2012. doi: 10.1186/1687-1847-2012-219. URL https://doi.org/10.1186/1687-1847-2012-219.
  • Lloyd (1982) Lloyd, S. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129–137, 1982. doi: 10.1109/TIT.1982.1056489.
  • Musco & Musco (2017) Musco, C. and Musco, C. Recursive sampling for the nyström method. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp.  3836–3848, 2017.
  • Nyström (1930) Nyström, E. J. Über die praktische auflösung von integralgleichungen mit anwendungen auf randwertaufgaben. Acta Mathematica, 54:185–204, 1930.
  • Pan (2016) Pan, V. Y. How bad are vandermonde matrices? SIAM Journal on Matrix Analysis and Applications, 37(2):676–694, 2016. doi: 10.1137/15M1030170. URL https://doi.org/10.1137/15M1030170.
  • Phillips & White (1994) Phillips, J. R. and White, J. A precorrected-fft method for capacitance extraction of complicated 3-d structures. In ICCAD, volume 94, pp.  268–271. Citeseer, 1994.
  • Rahimi et al. (2007) Rahimi, A., Recht, B., et al. Random features for large-scale kernel machines. In NIPS, volume 3, pp.  5. Citeseer, 2007.
  • Ryan et al. (2021) Ryan, J. P., Ament, S., Gomes, C. P., and Damle, A. The fast kernel transform. arXiv preprint arXiv:2106.04487, 2021.
  • Scholkopf & Smola (2018) Scholkopf, B. and Smola, A. J. Learning with kernels: support vector machines, regularization, optimization, and beyond. Adaptive Computation and Machine Learning series, 2018.
  • Shawe-Taylor et al. (2004) Shawe-Taylor, J., Cristianini, N., et al. Kernel methods for pattern analysis. Cambridge university press, 2004.
  • Shewchuk (1994) Shewchuk, J. R. An introduction to the conjugate gradient method without the agonizing pain. Technical report, USA, 1994.
  • Si et al. (2017) Si, S., Hsieh, C.-J., and Dhillon, I. S. Memory efficient kernel approximation. Journal of Machine Learning Research, 18(20):1–32, 2017. URL http://jmlr.org/papers/v18/15-025.html.
  • Snelson & Ghahramani (2005) Snelson, E. and Ghahramani, Z. Sparse gaussian processes using pseudo-inputs. Advances in neural information processing systems, 18:1257–1264, 2005.
  • Solin & Särkkä (2020) Solin, A. and Särkkä, S. Hilbert space methods for reduced-rank gaussian process regression. Statistics and Computing, 30(2):419–446, mar 2020. ISSN 0960-3174. doi: 10.1007/s11222-019-09886-w. URL https://doi.org/10.1007/s11222-019-09886-w.
  • Wen & Avery (1985) Wen, Z. and Avery, J. Some properties of hyperspherical harmonics. Journal of Mathematical Physics, 26(3):396–403, 1985. doi: 10.1063/1.526621. URL https://doi.org/10.1063/1.526621.
  • White et al. (1994) White, J., Phillips, J., and Korsmeyer, T. Comparing precorrected-fft and fast multipole algorithms for solving three-dimensional potential integral equations,”. In Proceedings of the Colorado Conference on Iterative Methods, pp.  4–10. Citeseer, 1994.
  • Williams & Seeger (2001) Williams, C. K. and Seeger, M. Using the nyström method to speed up kernel machines. In Advances in neural information processing systems, pp. 682–688, 2001.
  • Wilson & Nickisch (2015) Wilson, A. and Nickisch, H. Kernel interpolation for scalable structured gaussian processes (kiss-gp). In International Conference on Machine Learning, pp. 1775–1784, 2015.
  • Yang et al. (2003) Yang, C., Duraiswami, R., Gumerov, N. A., and Davis, L. Improved fast gauss transform and efficient kernel density estimation. IEEE, 2003.
  • Yang et al. (2012) Yang, T., Li, Y.-F., Mahdavi, M., Jin, R., and Zhou, Z.-H. Nyström method vs random fourier features: A theoretical and empirical comparison. Advances in neural information processing systems, 25:476–484, 2012.
  • Ying (2006) Ying, L. A kernel independent fast multipole algorithm for radial basis functions. Journal of Computational Physics, 213(2):451–457, 2006.
  • Ying et al. (2004) Ying, L., Biros, G., and Zorin, D. A kernel-independent adaptive fast multipole algorithm in two and three dimensions. Journal of Computational Physics, 196(2):591–626, 2004.

Appendix A Derivation of the Analytic Expansion

Here we present a detailed derivation of the analytic expansion which forms the foundation of our algorithm. Starting from the Chebyshev expansion

k(xy)i=0paiTi(xy)k(\|x-y\|)\approx\sum_{i=0}^{p}a_{i}T_{i}(\|x-y\|)

Using the definition of the Chebyshev polynomial

Ti(xy)=j=0iti,jxyj,T_{i}(\|x-y\|)=\sum_{j=0}^{i}t_{i,j}\|x-y\|^{j},

we get

k(xy)i=0pj=0iaiti,j(x2+y22xTy)j/2.k(\|x-y\|)\approx\sum_{i=0}^{p}\sum_{j=0}^{i}a_{i}t_{i,j}(\|x\|^{2}+\|y\|^{2}-2x^{T}y)^{j/2}.

Applying the multinomial theorem yields

k(xy)i=0pj=0ik3=0j/2k2=0jk3ti,jxj2k32k2y2k2(2xTy)k3(j/2j/2k3k2,k2,k3)ai.k(\|x-y\|)\approx\sum_{i=0}^{p}\sum_{j=0}^{i}\sum_{k_{3}=0}^{j/2}\sum_{k_{2}=0}^{j-k_{3}}t_{i,j}\|x\|^{j-2k_{3}-2k_{2}}\|y\|^{2k_{2}}(-2x^{T}y)^{k_{3}}\binom{j/2}{j/2-k_{3}-k_{2},k_{2},k_{3}}a_{i}.

Swapping sums via

i=0pj=0ik3=0j/2k2=0j/2k3=i=0pk3=0i/2j=2k3ik2=0j/2k3=k3=0p/2i=2k3pj=2k3ik2=0j/2k3\sum_{i=0}^{p}\sum_{j=0}^{i}\sum_{k_{3}=0}^{j/2}\sum_{k_{2}=0}^{j/2-k_{3}}=\sum_{i=0}^{p}\sum_{k_{3}=0}^{i/2}\sum_{j=2k_{3}}^{i}\sum_{k_{2}=0}^{j/2-k_{3}}=\sum_{k_{3}=0}^{p/2}\sum_{i=2k_{3}}^{p}\sum_{j=2k_{3}}^{i}\sum_{k_{2}=0}^{j/2-k_{3}}
=k3=0p/2i=2k3pk2=0i/2k3j=2k2+2k3i=k3=0p/2k2=0p/2k3i=2k2+2k3pj=2k2+2k3i=\sum_{k_{3}=0}^{p/2}\sum_{i=2k_{3}}^{p}\sum_{k_{2}=0}^{i/2-k_{3}}\sum_{j=2k_{2}+2k_{3}}^{i}=\sum_{k_{3}=0}^{p/2}\sum_{k_{2}=0}^{p/2-k_{3}}\sum_{i=2k_{2}+2k_{3}}^{p}\sum_{j=2k_{2}+2k_{3}}^{i}

gives

k(xy)k3=0p/2(xTy)k3k2=0p/2k3y2k2i=2k2+2k3pj=2k2+2k3iti,jxj2k32k2(2)k3(j/2j/2k3k2,k2,k3)aik(\|x-y\|)\approx\sum_{k_{3}=0}^{p/2}(x^{T}y)^{k_{3}}\sum_{k_{2}=0}^{p/2-k_{3}}\|y\|^{2k_{2}}\sum_{i=2k_{2}+2k_{3}}^{p}\sum_{j=2k_{2}+2k_{3}}^{i}t_{i,j}\|x\|^{j-2k_{3}-2k_{2}}(-2)^{k_{3}}\binom{j/2}{j/2-k_{3}-k_{2},k_{2},k_{3}}a_{i}

Let k1j/2k3k2k_{1}\coloneqq j/2-k_{3}-k_{2} so that j=2k1+2k2+2k3j=2k_{1}+2k_{2}+2k_{3}, then we have

k(xy)k3=0p/2(xTy)k3k2=0p/2k3y2k2i=2k2+2k3pk1=0i/2k3k2ti,2k1+2k2+2k3x2k1(2)k3(k1+k2+k3k1,k2,k3)aik(\|x-y\|)\approx\sum_{k_{3}=0}^{p/2}(x^{T}y)^{k_{3}}\sum_{k_{2}=0}^{p/2-k_{3}}\|y\|^{2k_{2}}\sum_{i=2k_{2}+2k_{3}}^{p}\sum_{k_{1}=0}^{i/2-k_{3}-k_{2}}t_{i,2k_{1}+2k_{2}+2k_{3}}\|x\|^{2k_{1}}(-2)^{k_{3}}\binom{k_{1}+k_{2}+k_{3}}{k_{1},k_{2},k_{3}}a_{i}

Swapping the ii and k1k_{1} sums gives

k(xy)k3=0p/2(xTy)k3k2=0p/2k3y2k2k1=0p/2k3k2x2k1i=2k1+2k2+2k3pti,2k1+2k2+2k3(2)k3(k1+k2+k3k1,k2,k3)aik(\|x-y\|)\approx\sum_{k_{3}=0}^{p/2}(x^{T}y)^{k_{3}}\sum_{k_{2}=0}^{p/2-k_{3}}\|y\|^{2k_{2}}\sum_{k_{1}=0}^{p/2-k_{3}-k_{2}}\|x\|^{2k_{1}}\sum_{i=2k_{1}+2k_{2}+2k_{3}}^{p}t_{i,2k_{1}+2k_{2}+2k_{3}}(-2)^{k_{3}}\binom{k_{1}+k_{2}+k_{3}}{k_{1},k_{2},k_{3}}a_{i}

Using ab=abcosγa\cdot b=\|a\|\|b\|\cos{\gamma} gives

k(xy)k3=0p/2(cosγ)k3k2=0p/2k3y2k2+k3k1=0p/2k3k2x2k1+k3i=2k1+2k2+2k3pk(\|x-y\|)\approx\sum_{k_{3}=0}^{p/2}(\cos{\gamma})^{k_{3}}\sum_{k_{2}=0}^{p/2-k_{3}}\|y\|^{2k_{2}+k_{3}}\sum_{k_{1}=0}^{p/2-k_{3}-k_{2}}\|x\|^{2k_{1}+k_{3}}\sum_{i=2k_{1}+2k_{2}+2k_{3}}^{p}\dots

We will use the identity

cosiγ=k=0iAkiCkα(cosγ)\cos^{i}{\gamma}=\sum_{k=0}^{i}A_{ki}C_{k}^{\alpha}(\cos{\gamma})

where α=d/21\alpha=d/2-1, Ckα(cosγ)C_{k}^{\alpha}(\cos{\gamma}) is the Gegenbauer polynomial, Aki=0A_{ki}=0 when kimod2k\neq i\mod 2, and

Aki=i!(α+k)2iik2(α)i+k2+1A_{ki}=\frac{i!(\alpha+k)}{2^{i}\frac{i-k}{2}(\alpha)_{\frac{i+k}{2}+1}}

when k=imod2k=i\mod 2. The notation (α)i+k2+1(\alpha)_{\frac{i+k}{2}+1} refers to the rising factorial, i.e. (α)n=(α)(α+1)(α+n1)(\alpha)_{n}=(\alpha)(\alpha+1)\dots(\alpha+n-1). Using this to expand the (cosγ)k3(\cos{\gamma})^{k_{3}} term yields

k(xy)k3=0p/2j=0k3Ak3jCj(cosγ)k2=0p/2k3y2k2+k3k1=0p/2k3k2x2k1+k3i=2k1+2k2+2k3pk(\|x-y\|)\approx\sum_{k_{3}=0}^{p/2}\sum_{j=0}^{k_{3}}A_{k_{3}j}C_{j}(\cos{\gamma})\sum_{k_{2}=0}^{p/2-k_{3}}\|y\|^{2k_{2}+k_{3}}\sum_{k_{1}=0}^{p/2-k_{3}-k_{2}}\|x\|^{2k_{1}+k_{3}}\sum_{i=2k_{1}+2k_{2}+2k_{3}}^{p}\dots

Swapping the k3k_{3} and jj sums yields

k(xy)j=0p/2Cj(cosγ)k3=jp/2k2=0p/2k3Ak3jy2k2+k3k1=0p/2k3k2x2k1+k3i=2k1+2k2+2k3pk(\|x-y\|)\approx\sum_{j=0}^{p/2}C_{j}(\cos{\gamma})\sum_{k_{3}=j}^{p/2}\sum_{k_{2}=0}^{p/2-k_{3}}A_{k_{3}j}\|y\|^{2k_{2}+k_{3}}\sum_{k_{1}=0}^{p/2-k_{3}-k_{2}}\|x\|^{2k_{1}+k_{3}}\sum_{i=2k_{1}+2k_{2}+2k_{3}}^{p}\dots

Let m=2k2+k3m=2k_{2}+k_{3} so that k2=(mk3)/2k_{2}=(m-k_{3})/2

k(xy)j=0p/2Cj(cosγ)k3=jp/2m=k3pk3Ak3jymk1=0p/2k3/2m/2x2k1+k3i=2k1+m+k3pk(\|x-y\|)\approx\sum_{j=0}^{p/2}C_{j}(\cos{\gamma})\sum_{k_{3}=j}^{p/2}\sum_{m=k_{3}}^{p-k_{3}}A_{k_{3}j}\|y\|^{m}\sum_{k_{1}=0}^{p/2-k_{3}/2-m/2}\|x\|^{2k_{1}+k_{3}}\sum_{i=2k_{1}+m+k_{3}}^{p}\dots

Swapping the k3k_{3} and mm sums gives

k(xy)j=0p/2Cj(cosγ)m=jpjymk3=jmin(p/2,m,pm)k1=0p/2k3/2m/2Ak3jx2k1+k3i=2k1+m+k3pk(\|x-y\|)\approx\sum_{j=0}^{p/2}C_{j}(\cos{\gamma})\sum_{m=j}^{p-j}\|y\|^{m}\sum_{k_{3}=j}^{\min(p/2,m,p-m)}\sum_{k_{1}=0}^{p/2-k_{3}/2-m/2}A_{k_{3}j}\|x\|^{2k_{1}+k_{3}}\sum_{i=2k_{1}+m+k_{3}}^{p}\dots

Let n=2k1+k3n=2k_{1}+k_{3} so that k1=(nk3)/2k_{1}=(n-k_{3})/2

k(xy)j=0p/2Cj(cosγ)m=jpjymk3=jmin(p/2,m,pm)n=k3pmAk3jx2k1+k3i=2k1+m+k3pk(\|x-y\|)\approx\sum_{j=0}^{p/2}C_{j}(\cos{\gamma})\sum_{m=j}^{p-j}\|y\|^{m}\sum_{k_{3}=j}^{\min(p/2,m,p-m)}\sum_{n=k_{3}}^{p-m}A_{k_{3}j}\|x\|^{2k_{1}+k_{3}}\sum_{i=2k_{1}+m+k_{3}}^{p}\dots

Swapping the k3k_{3} and nn sums yields

k(xy)j=0p/2Cj(cosγ)m=jpjymn=jpmxnk3=jmin(p/2,m,pm,n)Ak3ji=n+mpk(\|x-y\|)\approx\sum_{j=0}^{p/2}C_{j}(\cos{\gamma})\sum_{m=j}^{p-j}\|y\|^{m}\sum_{n=j}^{p-m}\|x\|^{n}\sum_{k_{3}=j}^{\min(p/2,m,p-m,n)}A_{k_{3}j}\sum_{i=n+m}^{p}\dots
k(xy)j=0p/2Cj(cosγ)m=jpjymn=jpmxn𝒯j,m,nk(\|x-y\|)\approx\sum_{j=0}^{p/2}C_{j}(\cos{\gamma})\sum_{m=j}^{p-j}\|y\|^{m}\sum_{n=j}^{p-m}\|x\|^{n}\mathcal{T}^{{}^{\prime}}_{j,m,n}

where

𝒯j,m,n=k3=jmin(p/2,m,pm,n)Ak3ji=n+mpti,n+m(2)k3((n+m)/2(nk3)/2,(mk3)/2,k3)ai\mathcal{T}^{{}^{\prime}}_{j,m,n}=\sum_{k_{3}=j}^{\min(p/2,m,p-m,n)}A_{k_{3}j}\sum_{i=n+m}^{p}t_{i,n+m}(-2)^{k_{3}}\binom{(n+m)/2}{(n-k_{3})/2,(m-k_{3})/2,k_{3}}a_{i}

The final form of the expansion follows from applying the hyperspherical harmonic addition theorem.

Appendix B Number of Harmonics of Order k\leq k

Here we show that

𝐇kk=0p/2|k|=(p2+d1d1)+(p2+d2d1).\mathbf{H}_{k}\coloneqq\sum_{k=0}^{p/2}|\mathcal{H}_{k}|=\binom{\frac{p}{2}+d-1}{d-1}+\binom{\frac{p}{2}+d-2}{d-1}.

Note that (Wen & Avery, 1985) gives

|k|=(k+d1k)(k+d3k2)|\mathcal{H}_{k}|=\binom{k+d-1}{k}-\binom{k+d-3}{k-2}

Thus the LHS forms a telescoping sum and we are left with

(p2+d1p2)+(p2+d2p21)=(p2+d1d1)+(p2+d2d1)\binom{\frac{p}{2}+d-1}{\frac{p}{2}}+\binom{\frac{p}{2}+d-2}{\frac{p}{2}-1}=\binom{\frac{p}{2}+d-1}{d-1}+\binom{\frac{p}{2}+d-2}{d-1}

Appendix C Compression of Matrices of Harmonics

The algorithm we presented includes an additional data-dependent compression step which involves finding a low-rank approximation to the matrices (k)\mathcal{R}^{(k)} as in (14). We could, in an analogous fashion, attempt to find a data-dependent compression of the matrix of harmonics defined by

Mij(k)hkΥkh(xi)Υkh(xj).M^{(k)}_{ij}\coloneqq\sum_{h\in\mathcal{H}_{k}}\Upsilon_{k}^{h}(x_{i})\Upsilon_{k}^{h}(x_{j}).

For problems where the dimension is large relative to the size of the dataset, this matrix can have fast-decaying singular values, as seen in Figure 5, and hence it would be useful to perform additional compression in those cases. Future research will search for efficient ways to perform this compression when it is called for, or prevent the need for it with some suitable data transformation.

Refer to caption
Refer to caption
Figure 5: The decay of the singular values of the matrices formed out of the harmonics in the expansion. Specifically, the matrices are defined by Mij(k)hkΥkh(xi)Υkh(xj)M^{(k)}_{ij}\coloneqq\sum_{h\in\mathcal{H}_{k}}\Upsilon_{k}^{h}(x_{i})\Upsilon_{k}^{h}(x_{j}) for 1i,jN1\leq i,j\leq N. This experiment used N=500N=500 normally distributed points. For small dd, there is no real need to compress the harmonic components as the orthogonality of the functions leads to matrices with little singular value decay. However, for high dd the number of harmonics grows substantially, and the decay of the singular values of the associate matrices suggests that an additional compression would need to be performed on the harmonic components to remain efficient.