All Random Features Representations are Equivalent

Luke Sernau
Google DeepMind
[email protected]
&Silvano Bonacina
Google DeepMind
[email protected]
&Rif A. Saurous
Google Research
[email protected]

Abstract

Random features are a powerful technique for rewriting positive-definite kernels as linear products. They bring linear tools to bear in important nonlinear domains like KNNs and attention. Unfortunately, practical implementations require approximating an expectation, usually via sampling. This has led to the development of increasingly elaborate representations with ever lower sample error.

We resolve this arms race by deriving an optimal sampling policy. Under this policy all random features representations have the same approximation error, which we show is the lowest possible. This means that we are free to choose whatever representation we please, provided we sample optimally.

1 Introduction

Kernel methods have a long and illustrious history in machine learning [1]. Nowadays one of their most widespread applications is the softmax kernel inside the attention layers of a transformer. Unfortunately, kernel methods scale linearly in the size of the dataset, which is impractical for very large datasets.

In their seminal paper, Rahimi and Recht [2] show that this linear cost can be avoided by replacing the kernel with a randomized approximation that is correct in expectation. They did this by means of a random feature representation.

Definition 1.

Given a positive-definite kernel $K:X\times X\rightarrow\mathbb{R}$ , a random feature representation of $K$ is a symmetric function $\phi:X\times X\rightarrow\mathbb{R}$ together with a distribution $\Omega$ (most commonly the normal distribution) such that

K\left(x_{1},x_{2}\right)=\mathop{\mathbb{E}}_{\omega\sim\Omega}\left[\phi\left(x_{1},\omega\right)\phi\left(x_{2},\omega\right)\right].

(1)

To see why this is useful, consider a simple kernel estimator over some dataset $\{(x_{1},y_{1}),(x_{2},y_{2}),...,(x_{n},y_{n})\}$ .

KE\left(x\right)=\sum^{n}_{i=1}K\left(x,x_{i}\right)y_{i}.

As written, this kernel estimator requires linear time to evaluate, since for every new $x$ it’s necessary to compute a dot product with every element in the dataset. But if the kernel admits a random features representation, we may write

\displaystyle KE\left(x\right)

\displaystyle=\sum^{n}_{i=1}\mathop{\mathbb{E}}_{\omega\sim\Omega}\left[\phi(x,\omega)\phi(x_{i},\omega)\right]y_{i}=\mathop{\mathbb{E}}_{\omega\sim\Omega}\left[\phi(x,\omega)\sum^{n}_{i=1}\phi(x_{i},\omega)y_{i}\right].

This expectation can be approximated via sampling,

KE\left(x\right)\approx\frac{1}{k}\sum^{k}_{j=1}\phi(x,\omega_{j})\sum^{n}_{i=1}\phi(x_{i},\omega_{j})y_{i},

where $\omega_{1},\omega_{2},\dots,\omega_{k}$ are samples drawn from $\Omega$ . As long as $k$ is less than $n$ , we can save time by precomputing the sum on the right. The principle challenge is finding an estimator that

Choromanski et al. [3] were the first to apply this idea to attention. They did so by finding a choice of $\phi$ that was numerically stable and had low sample variance. This kicked off a flurry of research ([4] [5] [6]) into better choices of $\phi$ with ever lower sample variance.

We provide a potential conclusion to this competition. Rather than optimizing $\phi$ , we simply chose our sampling strategy carefully. By approximating the expectation using an optimal importance sampling strategy, we are able to achieve a sample error that is as low as possible.

Remarkably, this sample variance does not depend on $\phi$ , meaning that when sampled optimally, all random features schemes have the same variance.

Ours is not the first scheme to consider importance sampling, but in contrast to others that sample based on kernel polarization heuristics [7] or quadrature rules [8], we directly solve the variance minimization problem. To our knowledge we are the first to obtain global optima that are independent of the choice of $\phi$ .

2 Importance sampling

Importance sampling [9] is a well known technique for reducing variance when estimating an expectation via sampling. It’s based on the observation that some samples have a larger effect on the expectation than others. Instead of naively sampling from the given distribution, we sample from some new distribution $\Psi$ , rescaling the samples such that the correct final estimate is preserved.

In our case, we wish to estimate (1) via sampling. Let $\Psi$ be some new distribution with the same support as $\Omega$ , and suppose $\Psi$ and $\Omega$ have densities $p_{\Psi}$ and $p_{\Omega}$ , respectively. Observe that

K\left(x_{1},x_{2}\right)=\mathop{\mathbb{E}}_{\omega\sim\Omega}\left[\phi\left(x_{1},\omega\right)\phi\left(x_{2},\omega\right)\right]=\mathop{\mathbb{E}}_{\omega\sim\Psi}\left[\frac{p_{\Omega}\left(\omega\right)}{p_{\Psi}\left(\omega\right)}\phi\left(x_{1},\omega\right)\phi\left(x_{2},\omega\right)\right].

(2)

In other words, we are free to sample from $\Psi$ without changing the value of the expectation. On the other hand, the sample variance, given by

\mathop{\text{Var}}_{\omega\sim\Psi}\left[\frac{p_{\Omega}\left(\omega\right)}{p_{\Psi}\left(\omega\right)}\phi\left(x_{1},\omega\right)\phi\left(x_{2},\omega\right)\right],

does depend on $\Psi$ . Notice that the mean squared error of our sampling procedure is precisely this variance. All we need to do to minimize the per-sample error is find a choice of $\Psi$ to make this quantity small.

3 Optimal sampling

Choosing an optimal $\Psi$ is an intrinsically data-driven question, as the variance depends on the input. We’d like to minimize the expected sample variance over all inputs, as captured in the following definition.

Definition 2.

Suppose $x_{1}$ and $x_{2}$ are sampled from $\mathcal{X}_{1}$ and $\mathcal{X}_{2}$ , respectively. If we approximate the expression in (2) by sampling from $\Psi$ , the expected sample variance $\mathcal{V}_{\Psi}$ over all $x_{1}$ and $x_{2}$ is given by

\mathcal{V}_{\Psi}=\mathop{\mathbb{E}}_{x_{1}\sim\mathcal{X}_{1}}\mathop{\mathbb{E}}_{x_{2}\sim\mathcal{X}_{2}}\mathop{\text{Var}}_{\omega\sim\Psi}\left[\frac{p_{\Omega}\left(\omega\right)}{p_{\Psi}\left(\omega\right)}\phi\left(x_{1},\omega\right)\phi\left(x_{2},\omega\right)\right]

The $\Psi$ that minimizes $\mathcal{V}_{\Psi}$ can be found analytically.

Theorem 1.

Let $K$ be a positive-definite kernel such that

K\left(x_{1},x_{2}\right)=\mathop{\mathbb{E}}_{\omega\sim\Psi}\left[\frac{p_{\Omega}\left(\omega\right)}{p_{\Psi}\left(\omega\right)}\phi\left(x_{1},\omega\right)\phi\left(x_{2},\omega\right)\right]

for some $\phi$ and $\Omega$ , and every $\Psi$ with the same support as $\Omega$ . Suppose $x_{1}$ and $x_{2}$ are sampled from $\mathcal{X}_{1}$ and $\mathcal{X}_{2}$ , respectively. Then the expected sample variance $\mathcal{V}_{\Psi}$ over all $x_{1}$ and $x_{2}$ will be minimized when

p_{\Psi}(w)\propto p_{\Omega}(w)q_{\phi}\left(\omega\right),

where

q_{\phi}\left(\omega\right)=\sqrt{\mathop{\mathbb{E}}_{x_{1}\sim\mathcal{X}_{1}}\left[\phi\left(x_{1},\omega\right)^{2}\right]\mathop{\mathbb{E}}_{x_{2}\sim\mathcal{X}_{2}}\left[\phi\left(x_{2},\omega\right)^{2}\right]}.

The resulting optimal variance will be

\mathcal{\hat{V}}_{\Psi}=\left(\mathop{\mathbb{E}}_{\omega\sim\Omega}q_{\phi}\left(\omega\right)\right)^{2}-\mathop{\mathbb{E}}_{x_{1}\sim\mathcal{X}_{1}}\mathop{\mathbb{E}}_{x_{2}\sim\mathcal{X}_{2}}K\left(x_{1},x_{2}\right)^{2}.

Proof.

Recall that variance can be written as

	$\displaystyle\mathop{\text{Var}}_{\omega\sim\Psi}$	$\displaystyle\left[\frac{p_{\Omega}\left(\omega\right)}{p_{\Psi}\left(\omega\right)}\phi\left(x_{1},\omega\right)\phi\left(x_{2},\omega\right)\right]=$
		$\displaystyle\mathop{\mathbb{E}}_{\omega\sim\Psi}\left[\frac{p_{\Omega}\left(\omega\right)^{2}}{p_{\Psi}\left(\omega\right)^{2}}\phi\left(x_{1},\omega\right)^{2}\phi\left(x_{2},\omega\right)^{2}\right]-\mathop{\mathbb{E}}_{\omega\sim\Psi}\left[\frac{p_{\Omega}\left(\omega\right)}{p_{\Psi}\left(\omega\right)}\phi\left(x_{1},\omega\right)\phi\left(x_{2},\omega\right)\right]^{2}.$

The second term is equal to $K\left(x_{1},x_{2}\right)^{2}$ , meaning it does not depend on $\Psi$ . We turn our attention to the first term. The expected value of the first term is

$\displaystyle\mathop{\mathbb{E}}_{x_{1}\sim\mathcal{X}_{1}}\mathop{\mathbb{E}}_{x_{2}\sim\mathcal{X}_{2}}\mathop{\mathbb{E}}_{\omega\sim\Psi}$	$\displaystyle\left[\frac{p_{\Omega}\left(\omega\right)^{2}}{p_{\Psi}\left(\omega\right)^{2}}\phi\left(x_{1},\omega\right)^{2}\phi\left(x_{2},\omega\right)^{2}\right]$
$\displaystyle=$	$\displaystyle\mathop{\mathbb{E}}_{x_{1}\sim\mathcal{X}_{1}}\mathop{\mathbb{E}}_{x_{2}\sim\mathcal{X}_{2}}\int_{X}\frac{p_{\Omega}\left(\omega\right)^{2}}{p_{\Psi}\left(\omega\right)}\phi\left(x_{1},\omega\right)^{2}\phi\left(x_{2},\omega\right)^{2}d\omega$
$\displaystyle=$	$\displaystyle\int_{X}\frac{p_{\Omega}\left(\omega\right)^{2}}{p_{\Psi}\left(\omega\right)}q_{\phi}\left(\omega\right)^{2}d\omega$	(3)

We wish to find the $p_{\Psi}$ that minimizes this expression, subject to the constraint that $\int_{X}p_{\Psi}\left(\omega\right)d\omega=1$ . We’ll solve this using Lagrange optimization. The Lagrangian is

\mathcal{L}=\frac{p_{\Omega}\left(\omega\right)^{2}}{p_{\Psi}\left(\omega\right)}q_{\phi}\left(\omega\right)^{2}+\lambda p_{\Psi}\left(\omega\right).

where $\lambda$ is our Lagrange parameter. Differentiating with respect to $p_{\Psi}\left(\omega\right)$ ,

\frac{\partial\mathcal{L}}{\partial p_{\Psi}}=-\frac{p_{\Omega}\left(\omega\right)^{2}}{p_{\Psi}\left(\omega\right)^{2}}q_{\phi}\left(\omega\right)^{2}+\lambda.

Setting this equal to zero and solving, we find that

p_{\Psi}(\omega)=\frac{1}{\sqrt{\lambda}}p_{\Omega}(\omega)q_{\phi}\left(\omega\right)

as desired. The constraint that $\int_{X}p_{\Psi}\left(\omega\right)d\omega=1$ forces $\sqrt{\lambda}=\int_{X}p_{\Omega}\left(\omega\right)q_{\phi}\left(\omega\right)d\omega$ . By plugging this into (3), we obtain the desired variance. ∎

This result tells us how to minimize the variance for a particular choice of $\phi$ . What is less clear in this form is that it also tells us something about optimizing over every $\phi$ . We will explore this in the next section.

4 The choice of feature representation does not matter

As written, Theorem 1 appears to be a statement about how to get the most out of any particular choice of $\phi$ . While this is useful, the real insight is what it can tell us about every possible $\phi$ . We are ready now to begin exploring our main point: the optimal sample variance is often entirely independent of $\phi$ .

Our first step in this direction is a tight upper bound on the sampling variance, and therefore the error, that can be stated without reference to $\phi$ .

Corrolary 1.

The optimal sample variance achieved in Theorem 1 is upper bounded by

\mathcal{\hat{V}}_{\Psi}\leq\mathop{\mathbb{E}}_{x_{1}\sim\mathcal{X}_{1}}\mathop{\mathbb{E}}_{x_{2}\sim\mathcal{X}_{2}}\left[K\left(x_{1},x_{1}\right)K\left(x_{2},x_{2}\right)-K\left(x_{1},x_{2}\right)^{2}\right].

This follows from Theorem 1 by applying the Cauchy-Schwarz inequality to the first term in $\mathcal{\hat{V}}$ . Specifically,

	$\displaystyle\left(\mathop{\mathbb{E}}_{\omega\sim\Omega}q_{\phi}\left(\omega\right)\right)^{2}=$	$\displaystyle\left(\mathop{\mathbb{E}}_{\omega\sim\Omega}\left[\sqrt{\mathop{\mathbb{E}}_{x_{1}\sim\mathcal{X}_{1}}\left[\phi\left(x_{1},\omega\right)^{2}\right]}\cdot\sqrt{\mathop{\mathbb{E}}_{x_{2}\sim\mathcal{X}_{2}}\left[\phi\left(x_{2},\omega\right)^{2}\right]}\right]\right)^{2}$
	$\displaystyle\leq$	$\displaystyle\left(\mathop{\mathbb{E}}_{\omega\sim\Omega}\mathop{\mathbb{E}}_{x_{1}\sim\mathcal{X}_{1}}\left[\phi\left(x_{1},\omega\right)^{2}\right]\right)\left(\mathop{\mathbb{E}}_{\omega\sim\Omega}\mathop{\mathbb{E}}_{x_{2}\sim\mathcal{X}_{2}}\left[\phi\left(x_{2},\omega\right)^{2}\right]\right)$
	$\displaystyle=$	$\displaystyle\mathop{\mathbb{E}}_{x_{1}\sim\mathcal{X}_{1}}\mathop{\mathbb{E}}_{x_{2}\sim\mathcal{X}_{2}}K\left(x_{1},x_{1}\right)K\left(x_{2},x_{2}\right).$

Written this way, our bound is itself suggestively similar to the Cauchy-Schwarz inequality. Indeed, since positive-definite kernels obey the Cauchy-Schwarz inequality, we know that this expression is positive as expected. More interestingly, since this expression does not depend on $\phi$ , this bound is the same regardless of which representation we’re using.

In the (relatively common) case where $\mathcal{X}_{1}=\mathcal{X}_{2}$ , we can go even further.

Corrolary 2.

If $\mathcal{X}_{1}=\mathcal{X}_{2}$ then the bound in Corollary 1 holds with equality.

This can be seen from Theorem 1, since in this case $q_{\phi}\left(\omega\right)=\mathop{\mathbb{E}}_{x_{1}\sim\mathcal{X}_{1}}\left[\phi\left(x_{1},\omega\right)^{2}\right]$ . By exchanging the expectations it follows that $\mathop{\mathbb{E}}_{\omega\sim\Omega}q_{\phi}\left(\omega\right)=\mathop{\mathbb{E}}_{x_{1}\sim\mathcal{X}_{1}}\left[K\left(x_{1},x_{1}\right)\right]$ .

The significance of this result is that we are able to write down the exact sample variance without any reference to $\phi$ . This is not just a statement about the error of a particular representation, it is a statement of equality that holds across all representations. It does not matter which $\phi$ you choose. Every random features representation has the same sample error as every other.

References

Campbell [2002] Colin Campbell. Kernel methods: a survey of current techniques. Neurocomputing, 48(1):63–84, 2002. ISSN 0925-2312. doi: https://doi.org/10.1016/S0925-2312(01)00643-9. URL https://www.sciencedirect.com/science/article/pii/S0925231201006439.
Rahimi and Recht [2007] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007. URL https://proceedings.neurips.cc/paper_files/paper/2007/file/013a006f03dbc5392effeb8f18fda755-Paper.pdf.
Choromanski et al. [2021] Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking attention with performers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Ua6zuk0WRH.
Likhosherstov et al. [2022] Valerii Likhosherstov, Krzysztof Choromanski, Kumar Avinava Dubey, Frederick Liu, Tamás Sarlós, and Adrian Weller. Chefs’ random tables: Non-trigonometric random features. Neural Information Processing Systems, abs/2205.15317, 2022. URL https://api.semanticscholar.org/CorpusID:249210161.
Likhosherstov et al. [2023] Valerii Likhosherstov, Krzysztof Choromanski, Avinava Dubey, Frederick Liu, Tamas Sarlos, and Adrian Weller. Favor#: Sharp attention kernel approximations via new classes of positive random features. arXiv preprint arXiv:2302.00787, 2023.
Reid et al. [2023] Isaac Reid, Krzysztof Marcin Choromanski, Valerii Likhosherstov, and Adrian Weller. Simplex random features. In International Conference on Machine Learning, pages 28864–28888. PMLR, 2023.
Shahrampour et al. [2018] Shahin Shahrampour, Ahmad Beirami, and Vahid Tarokh. On data-dependent random features for improved generalization in supervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
Bach [2017] Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research, 18(21):1–38, 2017.
Tokdar and Kass [2010] Surya T Tokdar and Robert E Kass. Importance sampling: a review. Wiley Interdisciplinary Reviews: Computational Statistics, 2(1):54–60, 2010.