This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

All Random Features Representations are Equivalent

Luke Sernau
Google DeepMind
[email protected]
&Silvano Bonacina
Google DeepMind
[email protected]
&Rif A. Saurous
Google Research
[email protected]
Abstract

Random features are a powerful technique for rewriting positive-definite kernels as linear products. They bring linear tools to bear in important nonlinear domains like KNNs and attention. Unfortunately, practical implementations require approximating an expectation, usually via sampling. This has led to the development of increasingly elaborate representations with ever lower sample error.

We resolve this arms race by deriving an optimal sampling policy. Under this policy all random features representations have the same approximation error, which we show is the lowest possible. This means that we are free to choose whatever representation we please, provided we sample optimally.

1 Introduction

Kernel methods have a long and illustrious history in machine learning [1]. Nowadays one of their most widespread applications is the softmax kernel inside the attention layers of a transformer. Unfortunately, kernel methods scale linearly in the size of the dataset, which is impractical for very large datasets.

In their seminal paper, Rahimi and Recht [2] show that this linear cost can be avoided by replacing the kernel with a randomized approximation that is correct in expectation. They did this by means of a random feature representation.

Definition 1.

Given a positive-definite kernel K:X×XK:X\times X\rightarrow\mathbb{R}, a random feature representation of KK is a symmetric function ϕ:X×X\phi:X\times X\rightarrow\mathbb{R} together with a distribution Ω\Omega (most commonly the normal distribution) such that

K(x1,x2)=𝔼ωΩ[ϕ(x1,ω)ϕ(x2,ω)].K\left(x_{1},x_{2}\right)=\mathop{\mathbb{E}}_{\omega\sim\Omega}\left[\phi\left(x_{1},\omega\right)\phi\left(x_{2},\omega\right)\right]. (1)

To see why this is useful, consider a simple kernel estimator over some dataset {(x1,y1),(x2,y2),,(xn,yn)}\{(x_{1},y_{1}),(x_{2},y_{2}),...,(x_{n},y_{n})\}.

KE(x)=i=1nK(x,xi)yi.KE\left(x\right)=\sum^{n}_{i=1}K\left(x,x_{i}\right)y_{i}.

As written, this kernel estimator requires linear time to evaluate, since for every new xx it’s necessary to compute a dot product with every element in the dataset. But if the kernel admits a random features representation, we may write

KE(x)\displaystyle KE\left(x\right) =i=1n𝔼ωΩ[ϕ(x,ω)ϕ(xi,ω)]yi=𝔼ωΩ[ϕ(x,ω)i=1nϕ(xi,ω)yi].\displaystyle=\sum^{n}_{i=1}\mathop{\mathbb{E}}_{\omega\sim\Omega}\left[\phi(x,\omega)\phi(x_{i},\omega)\right]y_{i}=\mathop{\mathbb{E}}_{\omega\sim\Omega}\left[\phi(x,\omega)\sum^{n}_{i=1}\phi(x_{i},\omega)y_{i}\right].

This expectation can be approximated via sampling,

KE(x)1kj=1kϕ(x,ωj)i=1nϕ(xi,ωj)yi,KE\left(x\right)\approx\frac{1}{k}\sum^{k}_{j=1}\phi(x,\omega_{j})\sum^{n}_{i=1}\phi(x_{i},\omega_{j})y_{i},

where ω1,ω2,,ωk\omega_{1},\omega_{2},\dots,\omega_{k} are samples drawn from Ω\Omega. As long as kk is less than nn, we can save time by precomputing the sum on the right. The principle challenge is finding an estimator that

Choromanski et al. [3] were the first to apply this idea to attention. They did so by finding a choice of ϕ\phi that was numerically stable and had low sample variance. This kicked off a flurry of research ([4] [5] [6]) into better choices of ϕ\phi with ever lower sample variance.

We provide a potential conclusion to this competition. Rather than optimizing ϕ\phi, we simply chose our sampling strategy carefully. By approximating the expectation using an optimal importance sampling strategy, we are able to achieve a sample error that is as low as possible.

Remarkably, this sample variance does not depend on ϕ\phi, meaning that when sampled optimally, all random features schemes have the same variance.

Ours is not the first scheme to consider importance sampling, but in contrast to others that sample based on kernel polarization heuristics [7] or quadrature rules [8], we directly solve the variance minimization problem. To our knowledge we are the first to obtain global optima that are independent of the choice of ϕ\phi.

2 Importance sampling

Importance sampling [9] is a well known technique for reducing variance when estimating an expectation via sampling. It’s based on the observation that some samples have a larger effect on the expectation than others. Instead of naively sampling from the given distribution, we sample from some new distribution Ψ\Psi, rescaling the samples such that the correct final estimate is preserved.

In our case, we wish to estimate (1) via sampling. Let Ψ\Psi be some new distribution with the same support as Ω\Omega, and suppose Ψ\Psi and Ω\Omega have densities pΨp_{\Psi} and pΩp_{\Omega}, respectively. Observe that

K(x1,x2)=𝔼ωΩ[ϕ(x1,ω)ϕ(x2,ω)]=𝔼ωΨ[pΩ(ω)pΨ(ω)ϕ(x1,ω)ϕ(x2,ω)].K\left(x_{1},x_{2}\right)=\mathop{\mathbb{E}}_{\omega\sim\Omega}\left[\phi\left(x_{1},\omega\right)\phi\left(x_{2},\omega\right)\right]=\mathop{\mathbb{E}}_{\omega\sim\Psi}\left[\frac{p_{\Omega}\left(\omega\right)}{p_{\Psi}\left(\omega\right)}\phi\left(x_{1},\omega\right)\phi\left(x_{2},\omega\right)\right]. (2)

In other words, we are free to sample from Ψ\Psi without changing the value of the expectation. On the other hand, the sample variance, given by

VarωΨ[pΩ(ω)pΨ(ω)ϕ(x1,ω)ϕ(x2,ω)],\mathop{\text{Var}}_{\omega\sim\Psi}\left[\frac{p_{\Omega}\left(\omega\right)}{p_{\Psi}\left(\omega\right)}\phi\left(x_{1},\omega\right)\phi\left(x_{2},\omega\right)\right],

does depend on Ψ\Psi. Notice that the mean squared error of our sampling procedure is precisely this variance. All we need to do to minimize the per-sample error is find a choice of Ψ\Psi to make this quantity small.

3 Optimal sampling

Choosing an optimal Ψ\Psi is an intrinsically data-driven question, as the variance depends on the input. We’d like to minimize the expected sample variance over all inputs, as captured in the following definition.

Definition 2.

Suppose x1x_{1} and x2x_{2} are sampled from 𝒳1\mathcal{X}_{1} and 𝒳2\mathcal{X}_{2}, respectively. If we approximate the expression in (2) by sampling from Ψ\Psi, the expected sample variance 𝒱Ψ\mathcal{V}_{\Psi} over all x1x_{1} and x2x_{2} is given by

𝒱Ψ=𝔼x1𝒳1𝔼x2𝒳2VarωΨ[pΩ(ω)pΨ(ω)ϕ(x1,ω)ϕ(x2,ω)]\mathcal{V}_{\Psi}=\mathop{\mathbb{E}}_{x_{1}\sim\mathcal{X}_{1}}\mathop{\mathbb{E}}_{x_{2}\sim\mathcal{X}_{2}}\mathop{\text{Var}}_{\omega\sim\Psi}\left[\frac{p_{\Omega}\left(\omega\right)}{p_{\Psi}\left(\omega\right)}\phi\left(x_{1},\omega\right)\phi\left(x_{2},\omega\right)\right]

The Ψ\Psi that minimizes 𝒱Ψ\mathcal{V}_{\Psi} can be found analytically.

Theorem 1.

Let KK be a positive-definite kernel such that

K(x1,x2)=𝔼ωΨ[pΩ(ω)pΨ(ω)ϕ(x1,ω)ϕ(x2,ω)]K\left(x_{1},x_{2}\right)=\mathop{\mathbb{E}}_{\omega\sim\Psi}\left[\frac{p_{\Omega}\left(\omega\right)}{p_{\Psi}\left(\omega\right)}\phi\left(x_{1},\omega\right)\phi\left(x_{2},\omega\right)\right]

for some ϕ\phi and Ω\Omega, and every Ψ\Psi with the same support as Ω\Omega. Suppose x1x_{1} and x2x_{2} are sampled from 𝒳1\mathcal{X}_{1} and 𝒳2\mathcal{X}_{2}, respectively. Then the expected sample variance 𝒱Ψ\mathcal{V}_{\Psi} over all x1x_{1} and x2x_{2} will be minimized when

pΨ(w)pΩ(w)qϕ(ω),p_{\Psi}(w)\propto p_{\Omega}(w)q_{\phi}\left(\omega\right),

where

qϕ(ω)=𝔼x1𝒳1[ϕ(x1,ω)2]𝔼x2𝒳2[ϕ(x2,ω)2].q_{\phi}\left(\omega\right)=\sqrt{\mathop{\mathbb{E}}_{x_{1}\sim\mathcal{X}_{1}}\left[\phi\left(x_{1},\omega\right)^{2}\right]\mathop{\mathbb{E}}_{x_{2}\sim\mathcal{X}_{2}}\left[\phi\left(x_{2},\omega\right)^{2}\right]}.

The resulting optimal variance will be

𝒱^Ψ=(𝔼ωΩqϕ(ω))2𝔼x1𝒳1𝔼x2𝒳2K(x1,x2)2.\mathcal{\hat{V}}_{\Psi}=\left(\mathop{\mathbb{E}}_{\omega\sim\Omega}q_{\phi}\left(\omega\right)\right)^{2}-\mathop{\mathbb{E}}_{x_{1}\sim\mathcal{X}_{1}}\mathop{\mathbb{E}}_{x_{2}\sim\mathcal{X}_{2}}K\left(x_{1},x_{2}\right)^{2}.
Proof.

Recall that variance can be written as

VarωΨ\displaystyle\mathop{\text{Var}}_{\omega\sim\Psi} [pΩ(ω)pΨ(ω)ϕ(x1,ω)ϕ(x2,ω)]=\displaystyle\left[\frac{p_{\Omega}\left(\omega\right)}{p_{\Psi}\left(\omega\right)}\phi\left(x_{1},\omega\right)\phi\left(x_{2},\omega\right)\right]=
𝔼ωΨ[pΩ(ω)2pΨ(ω)2ϕ(x1,ω)2ϕ(x2,ω)2]𝔼ωΨ[pΩ(ω)pΨ(ω)ϕ(x1,ω)ϕ(x2,ω)]2.\displaystyle\mathop{\mathbb{E}}_{\omega\sim\Psi}\left[\frac{p_{\Omega}\left(\omega\right)^{2}}{p_{\Psi}\left(\omega\right)^{2}}\phi\left(x_{1},\omega\right)^{2}\phi\left(x_{2},\omega\right)^{2}\right]-\mathop{\mathbb{E}}_{\omega\sim\Psi}\left[\frac{p_{\Omega}\left(\omega\right)}{p_{\Psi}\left(\omega\right)}\phi\left(x_{1},\omega\right)\phi\left(x_{2},\omega\right)\right]^{2}.

The second term is equal to K(x1,x2)2K\left(x_{1},x_{2}\right)^{2}, meaning it does not depend on Ψ\Psi. We turn our attention to the first term. The expected value of the first term is

𝔼x1𝒳1𝔼x2𝒳2𝔼ωΨ\displaystyle\mathop{\mathbb{E}}_{x_{1}\sim\mathcal{X}_{1}}\mathop{\mathbb{E}}_{x_{2}\sim\mathcal{X}_{2}}\mathop{\mathbb{E}}_{\omega\sim\Psi} [pΩ(ω)2pΨ(ω)2ϕ(x1,ω)2ϕ(x2,ω)2]\displaystyle\left[\frac{p_{\Omega}\left(\omega\right)^{2}}{p_{\Psi}\left(\omega\right)^{2}}\phi\left(x_{1},\omega\right)^{2}\phi\left(x_{2},\omega\right)^{2}\right]
=\displaystyle= 𝔼x1𝒳1𝔼x2𝒳2XpΩ(ω)2pΨ(ω)ϕ(x1,ω)2ϕ(x2,ω)2𝑑ω\displaystyle\mathop{\mathbb{E}}_{x_{1}\sim\mathcal{X}_{1}}\mathop{\mathbb{E}}_{x_{2}\sim\mathcal{X}_{2}}\int_{X}\frac{p_{\Omega}\left(\omega\right)^{2}}{p_{\Psi}\left(\omega\right)}\phi\left(x_{1},\omega\right)^{2}\phi\left(x_{2},\omega\right)^{2}d\omega
=\displaystyle= XpΩ(ω)2pΨ(ω)qϕ(ω)2𝑑ω\displaystyle\int_{X}\frac{p_{\Omega}\left(\omega\right)^{2}}{p_{\Psi}\left(\omega\right)}q_{\phi}\left(\omega\right)^{2}d\omega (3)

We wish to find the pΨp_{\Psi} that minimizes this expression, subject to the constraint that XpΨ(ω)𝑑ω=1\int_{X}p_{\Psi}\left(\omega\right)d\omega=1. We’ll solve this using Lagrange optimization. The Lagrangian is

=pΩ(ω)2pΨ(ω)qϕ(ω)2+λpΨ(ω).\mathcal{L}=\frac{p_{\Omega}\left(\omega\right)^{2}}{p_{\Psi}\left(\omega\right)}q_{\phi}\left(\omega\right)^{2}+\lambda p_{\Psi}\left(\omega\right).

where λ\lambda is our Lagrange parameter. Differentiating with respect to pΨ(ω)p_{\Psi}\left(\omega\right),

pΨ=pΩ(ω)2pΨ(ω)2qϕ(ω)2+λ.\frac{\partial\mathcal{L}}{\partial p_{\Psi}}=-\frac{p_{\Omega}\left(\omega\right)^{2}}{p_{\Psi}\left(\omega\right)^{2}}q_{\phi}\left(\omega\right)^{2}+\lambda.

Setting this equal to zero and solving, we find that

pΨ(ω)=1λpΩ(ω)qϕ(ω)p_{\Psi}(\omega)=\frac{1}{\sqrt{\lambda}}p_{\Omega}(\omega)q_{\phi}\left(\omega\right)

as desired. The constraint that XpΨ(ω)𝑑ω=1\int_{X}p_{\Psi}\left(\omega\right)d\omega=1 forces λ=XpΩ(ω)qϕ(ω)𝑑ω\sqrt{\lambda}=\int_{X}p_{\Omega}\left(\omega\right)q_{\phi}\left(\omega\right)d\omega. By plugging this into (3), we obtain the desired variance. ∎

This result tells us how to minimize the variance for a particular choice of ϕ\phi. What is less clear in this form is that it also tells us something about optimizing over every ϕ\phi. We will explore this in the next section.

4 The choice of feature representation does not matter

As written, Theorem 1 appears to be a statement about how to get the most out of any particular choice of ϕ\phi. While this is useful, the real insight is what it can tell us about every possible ϕ\phi. We are ready now to begin exploring our main point: the optimal sample variance is often entirely independent of ϕ\phi.

Our first step in this direction is a tight upper bound on the sampling variance, and therefore the error, that can be stated without reference to ϕ\phi.

Corrolary 1.

The optimal sample variance achieved in Theorem 1 is upper bounded by

𝒱^Ψ𝔼x1𝒳1𝔼x2𝒳2[K(x1,x1)K(x2,x2)K(x1,x2)2].\mathcal{\hat{V}}_{\Psi}\leq\mathop{\mathbb{E}}_{x_{1}\sim\mathcal{X}_{1}}\mathop{\mathbb{E}}_{x_{2}\sim\mathcal{X}_{2}}\left[K\left(x_{1},x_{1}\right)K\left(x_{2},x_{2}\right)-K\left(x_{1},x_{2}\right)^{2}\right].

This follows from Theorem 1 by applying the Cauchy-Schwarz inequality to the first term in 𝒱^\mathcal{\hat{V}}. Specifically,

(𝔼ωΩqϕ(ω))2=\displaystyle\left(\mathop{\mathbb{E}}_{\omega\sim\Omega}q_{\phi}\left(\omega\right)\right)^{2}= (𝔼ωΩ[𝔼x1𝒳1[ϕ(x1,ω)2]𝔼x2𝒳2[ϕ(x2,ω)2]])2\displaystyle\left(\mathop{\mathbb{E}}_{\omega\sim\Omega}\left[\sqrt{\mathop{\mathbb{E}}_{x_{1}\sim\mathcal{X}_{1}}\left[\phi\left(x_{1},\omega\right)^{2}\right]}\cdot\sqrt{\mathop{\mathbb{E}}_{x_{2}\sim\mathcal{X}_{2}}\left[\phi\left(x_{2},\omega\right)^{2}\right]}\right]\right)^{2}
\displaystyle\leq (𝔼ωΩ𝔼x1𝒳1[ϕ(x1,ω)2])(𝔼ωΩ𝔼x2𝒳2[ϕ(x2,ω)2])\displaystyle\left(\mathop{\mathbb{E}}_{\omega\sim\Omega}\mathop{\mathbb{E}}_{x_{1}\sim\mathcal{X}_{1}}\left[\phi\left(x_{1},\omega\right)^{2}\right]\right)\left(\mathop{\mathbb{E}}_{\omega\sim\Omega}\mathop{\mathbb{E}}_{x_{2}\sim\mathcal{X}_{2}}\left[\phi\left(x_{2},\omega\right)^{2}\right]\right)
=\displaystyle= 𝔼x1𝒳1𝔼x2𝒳2K(x1,x1)K(x2,x2).\displaystyle\mathop{\mathbb{E}}_{x_{1}\sim\mathcal{X}_{1}}\mathop{\mathbb{E}}_{x_{2}\sim\mathcal{X}_{2}}K\left(x_{1},x_{1}\right)K\left(x_{2},x_{2}\right).

Written this way, our bound is itself suggestively similar to the Cauchy-Schwarz inequality. Indeed, since positive-definite kernels obey the Cauchy-Schwarz inequality, we know that this expression is positive as expected. More interestingly, since this expression does not depend on ϕ\phi, this bound is the same regardless of which representation we’re using.

In the (relatively common) case where 𝒳1=𝒳2\mathcal{X}_{1}=\mathcal{X}_{2}, we can go even further.

Corrolary 2.

If 𝒳1=𝒳2\mathcal{X}_{1}=\mathcal{X}_{2} then the bound in Corollary 1 holds with equality.

This can be seen from Theorem 1, since in this case qϕ(ω)=𝔼x1𝒳1[ϕ(x1,ω)2]q_{\phi}\left(\omega\right)=\mathop{\mathbb{E}}_{x_{1}\sim\mathcal{X}_{1}}\left[\phi\left(x_{1},\omega\right)^{2}\right]. By exchanging the expectations it follows that 𝔼ωΩqϕ(ω)=𝔼x1𝒳1[K(x1,x1)]\mathop{\mathbb{E}}_{\omega\sim\Omega}q_{\phi}\left(\omega\right)=\mathop{\mathbb{E}}_{x_{1}\sim\mathcal{X}_{1}}\left[K\left(x_{1},x_{1}\right)\right].

The significance of this result is that we are able to write down the exact sample variance without any reference to ϕ\phi. This is not just a statement about the error of a particular representation, it is a statement of equality that holds across all representations. It does not matter which ϕ\phi you choose. Every random features representation has the same sample error as every other.

References

  • Campbell [2002] Colin Campbell. Kernel methods: a survey of current techniques. Neurocomputing, 48(1):63–84, 2002. ISSN 0925-2312. doi: https://doi.org/10.1016/S0925-2312(01)00643-9. URL https://www.sciencedirect.com/science/article/pii/S0925231201006439.
  • Rahimi and Recht [2007] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007. URL https://proceedings.neurips.cc/paper_files/paper/2007/file/013a006f03dbc5392effeb8f18fda755-Paper.pdf.
  • Choromanski et al. [2021] Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking attention with performers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Ua6zuk0WRH.
  • Likhosherstov et al. [2022] Valerii Likhosherstov, Krzysztof Choromanski, Kumar Avinava Dubey, Frederick Liu, Tamás Sarlós, and Adrian Weller. Chefs’ random tables: Non-trigonometric random features. Neural Information Processing Systems, abs/2205.15317, 2022. URL https://api.semanticscholar.org/CorpusID:249210161.
  • Likhosherstov et al. [2023] Valerii Likhosherstov, Krzysztof Choromanski, Avinava Dubey, Frederick Liu, Tamas Sarlos, and Adrian Weller. Favor#: Sharp attention kernel approximations via new classes of positive random features. arXiv preprint arXiv:2302.00787, 2023.
  • Reid et al. [2023] Isaac Reid, Krzysztof Marcin Choromanski, Valerii Likhosherstov, and Adrian Weller. Simplex random features. In International Conference on Machine Learning, pages 28864–28888. PMLR, 2023.
  • Shahrampour et al. [2018] Shahin Shahrampour, Ahmad Beirami, and Vahid Tarokh. On data-dependent random features for improved generalization in supervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • Bach [2017] Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research, 18(21):1–38, 2017.
  • Tokdar and Kass [2010] Surya T Tokdar and Robert E Kass. Importance sampling: a review. Wiley Interdisciplinary Reviews: Computational Statistics, 2(1):54–60, 2010.