Computational-Statistical Trade-off in Kernel Two-Sample Testing with Random Fourier Features

Ikjun Choi Department of Statistics and Data Science, Yonsei University, Seoul, South Korea. Ilmun Kim
Department of Statistics and Data Science, Department of Applied Statistics, Yonsei University, Seoul, South Korea.

Computational-Statistical Trade-off in Kernel Two-Sample Testing with Random Fourier Features

Abstract

Recent years have seen a surge in methods for two-sample testing, among which the Maximum Mean Discrepancy (MMD) test has emerged as an effective tool for handling complex and high-dimensional data. Despite its success and widespread adoption, the primary limitation of the MMD test has been its quadratic-time complexity, which poses challenges for large-scale analysis. While various approaches have been proposed to expedite the procedure, it has been unclear whether it is possible to attain the same power guarantee as the MMD test at sub-quadratic time cost. To fill this gap, we revisit the approximated MMD test using random Fourier features, and investigate its computational-statistical trade-off. We start by revealing that the approximated MMD test is pointwise consistent in power only when the number of random features approaches infinity. We then consider the uniform power of the test and study the time-power trade-off under the minimax testing framework. Our result shows that, by carefully choosing the number of random features, it is possible to attain the same minimax separation rates as the MMD test within sub-quadratic time. We demonstrate this point under different distributional assumptions such as densities in a Sobolev ball. Our theoretical findings are corroborated by simulation studies.

Keywords: Maximum mean discrepancy, Minimax power, Permutation tests, Random Fourier features, Two-sample testing

1 Introduction

The problem of two-sample testing stands as a fundamental topic in statistics, concerned with comparing two distributions to determine their equivalence. Classical techniques, such as the $t$ -test and Wilcoxon rank-sum test, have been widely used to tackle this problem, and their theoretical and empirical properties have been well-investigated. However, these classical approaches often require parametric or strong moment assumptions to fully ensure their soundness, and their power is limited to specific directions of alternative hypotheses, such as location shifts. While these classical approaches are effective in well-structured and simple scenarios, their limitations in handling the increasing complexity of modern statistical problems has consistently prompted the need for new developments (Stolte et al.,, 2023, for a recent review). Among various advancements made to address this issue, the kernel two-sample test based on the maximum mean discrepancy (MMD, Gretton et al., 2012a, ) has garnered significant attention over the years, due to its nonparametric nature and flexibility. It can be applied in diverse scenarios without requiring distributional assumptions and offers robust theoretical underpinnings. With its empirical success and popularity, various research endeavors have been dedicated to enhancing their performance and deepening our understanding of their theoretical properties.

Broadly, there are two main branches of research regarding the kernel test: (i) kernel selection and (ii) computational time-power trade-off. Regarding kernel selection, significant advancements have been made in the last decade, aiming to identify the kernel that best captures the difference between two distributions. A common approach involves sample splitting where one-half of the data is used for kernel selection and the other half of data is used for the actual test (e.g., Gretton et al., 2012b, ; Sutherland et al.,, 2017; Liu et al.,, 2020). However, an inefficient use of the data from sample splitting often results in a loss of power, which has been the main criticism. Another approach for kernel selection involves aggregating multiple kernels, which avoids sample splitting but requires a careful selection of kernels in advance (e.g., Schrab et al.,, 2023, 2022; Biggs et al.,, 2023; Chatterjee and Bhattacharya,, 2023).

Regarding the time-power trade-off, much effort has concentrated on constructing a time-efficient test statistic with competitive power. The standard estimator of MMD via U-statistics or V-statistics demands quadratic-time complexity, which hinders the use of kernel tests for large-scale analyses. To mitigate this computational challenge, various methods have been proposed by using linear-time statistics (Gretton et al., 2012a, ; Gretton et al., 2012b, ), block-based statistics (Zaremba et al.,, 2013) and more generally incomplete U-statistics (Yamada et al.,, 2019; Schrab et al.,, 2022). However, these methods typically sacrifice statistical power for computational efficiency. Another approach that aims to balance this time-power trade-off is based on random Fourier features (RFF, Rahimi and Recht,, 2007). The idea is to approximate a kernel function using a finite dimensional random feature mapping, which can be computed efficiently. The use of RFF in a kernel test was initially considered by Zhao and Meng, (2015) and explored further by follow-up studies (e.g., Cevid et al.,, 2022). It is intuitively clear that the performance of an RFF-MMD test crucially depends on the number of random features. While there is a line of work studying theoretical aspects of RFFs (Liu et al.,, 2022, for a survey), their focus is mainly on the approximation quality of RFFs (Liu et al.,, 2022, for a survey), and the optimal choice of the number of random features that balances between computational costs and statistical power remains largely unexplored.

Motivated by this gap, we consider kernel two-sample tests using random Fourier features and aim to establish theoretical foundations for their power properties. Our tests are based on a permutation procedure, which is practically relevant but introduces additional technical challenges. As mentioned earlier, both the quality and the computational complexity of the RFF-MMD test heavily depend on the number of random features. Our primary focus therefore is to determine the number of random features that strikes an optimal balance. It is worth highlighting that the challenge in our analysis lies in managing the interplay of three distinct randomness sources: the data itself, the random Fourier features, and the permutations employed in our approach. All of these random sources are intertwined within the testing process, which makes our analysis non-trivial and unique. To effectively manage this complexity, we systematically decompose and analyze each layer of randomness in the test procedure, transitioning them into forms that are more amenable for analysis. This approach allows us to build on existing results from the literature that specifically address each of the three aspects of randomness.

In the next subsection, we present a brief review of prior work that is most relevant to our paper.

1.1 Related work

In recent years, there has been a growing body of literature aimed at investigating the power of MMD-based tests and enhancing their performance. For example, the work of Li and Yuan, (2019); Balasubramanian et al., (2021) demonstrated that MMD tests equipped with a fine-tuned kernel can achieve minimax optimality with respect to the $L_{2}$ separation in an asymptotic sense. To establish a similar but non-asymptotic guarantee, Schrab et al., (2023) introduced a MMD aggregated test calibrated by using either permutations or a wild bootstrap. It is also worth noting that the minimax optimality of MMD two-sample tests has been established for separations other than the $L_{2}$ distance, such as MMD distance (Kim and Schrab,, 2023), and Hellinger distance (Hagrass et al.,, 2022). In addition to these works, several other MMD-based minimax tests have been proposed using techniques such as aggregation (Fromont et al.,, 2013; Chatterjee and Bhattacharya,, 2023; Biggs et al.,, 2023) and studentization (Kim and Ramdas,, 2024; Shekhar et al.,, 2023). Despite significant recent advancements made in this field, the quadratic time complexity of these methods remains a barrier in large-scale applications, which highlights the need for more efficient yet powerful testing approaches.

To address the computational concern of quadratic-time MMD tests, several time-efficient approaches have emerged, which leverage subsampled estimation techniques, such as linear-time statistics (Gretton et al., 2012a, ; Gretton et al., 2012b, ), block-based statistics (Zaremba et al.,, 2013) and incomplete U-statistics (Yamada et al.,, 2019; Schrab et al.,, 2022). However, in terms of power, these methods are either sub-optimal or ultimately require quadratic time complexity to achieve optimality (Domingo-Enrich et al.,, 2023, Proposition 2). Other advancements in accelerating two-sample tests have involved techniques, such as Nyström approximations (Chatalic et al.,, 2022), analytic mean embeddings and smoothed characteristic functions (Chwialkowski et al.,, 2015; Jitkrittum et al.,, 2016), deep linear kernels (Kirchler et al.,, 2020), as well as random Fourier features (Zhao and Meng,, 2015). These tests can also run in sub-quadratic time, while their theoretical guarantees on power remain largely unknown. We also mention the recent method using kernel thinning (Dwivedi and Mackey,, 2021; Domingo-Enrich et al.,, 2023), which achieves the same MMD separation rate as the quadratic-time test but with sub-quadratic running time. However, this guarantee is valid under specific distributional assumptions that differ from those we consider. Moreover, their result focuses solely on alternatives that deviate from the null in terms of the MMD metric.

With this context in mind, we revisit the RFF-MMD test (Zhao and Meng,, 2015) and delve into its time-power trade-off concerning the number of random features. Despite an extensive body of literature on random features for kernel approximation, prior work has mainly focused on the estimation quality of kernel approximation (Rahimi and Recht,, 2007; Zhao and Meng,, 2015; Sriperumbudur and Szabo,, 2015; Sutherland and Schneider,, 2015; Yao et al.,, 2023), and a theoretical guarantee on the power of the RFF-MMD test has not been explored. In this work, we seek to bridge this gap by thoroughly analyzing the trade-off between computation time and statistical power in the context of the RFF-MMD test.

1.2 Our contributions

Having reviewed the prior work, we now summarize the key contributions of this paper.

•

Inconsistency result for RFF-MMD (Section 3). We first investigate the setting where the number of random Fourier features is fixed, and demonstrate that the RFF-MMD test fails to achieve pointwise consistency (Theorem 3 and Corollary 4). Concretely, we prove that there exist infinitely many pairs of distinct distributions for which the power of the RFF-MMD test using a fixed number of random Fourier features is almost equal to the size even asymptotically.
•

Sufficient conditions for consistency (Section 4). Our previous negative result clearly indicates that increasing the number of random Fourier features is necessary to achieve pointwise consistency. In Theorem 5, we show that it is indeed sufficient to increase the number of Fourier features to infinity to achieve pointwise consistency, even at an arbitrarily slow rate.
•

Time-power trade-off (Section 4). As mentioned before, there exists a clear trade-off between computational efficiency and statistical power in terms of the number of random Fourier features. To balance this trade-off, we adopt the non-asymptotic minimax testing framework and analyze how changes in the number of random Fourier features impact both computational efficiency and separation rates in terms of the $L_{2}$ metric (Theorem 6) and the MMD metric (Theorem 7).
•

Achieving optimality in sub-quadratic time (Section 4). We firmly demonstrate in Theorem 6 that it is possible to achieve the minimax separation rate against $L_{2}$ alternatives in sub-quadratic time when the underlying distributions are sufficiently smooth. Similarly, we establish in Proposition 8 that a parametric separation rate against MMD alternatives can be achieved in linear time for certain classes of distributions including Gaussian distributions.

Our theoretical results are validated through simulation studies under various scenarios and the code that reproduces our numerical results can be found at https://github.com/ikjunchoi/rff-mmd.

Organization.

The remainder of this paper is organized as follows. We set up the problem and present relevant background information in Section 2. Section 3 provides an inconsistency result of the RFF-MMD test and highlights the important role of the number of random features in the power performance. Moving forward to Section 4, we investigate the time-power trade-off in terms of the number of random features denoted as $R$ , and discuss an optimal choice of $R$ under minimax frameworks. We present simulation results in Section 5 that confirm our theoretical findings. Finally, in Section 6, we discuss the implications of our findings and suggest directions for future research. All technical proofs are collected in the appendix.

2 Background

In this section, we set up the problem and lay out some background for this work. Specifically, Section 2.1 explains the two-sample problem that we tackle, and specifies the desired error guarantees. We then present a brief overview of the MMD in Section 2.2 and its estimators using random Fourier features in Section 2.3. Lastly, in Section 2.4, we review the permutation method for evaluating the significance of a two-sample test statistic.

2.1 Two-sample problem

Let $\mathcal{X}_{n_{1}}:=\{X_{i}\}^{n_{1}}_{i=1}$ be ${n_{1}}$ i.i.d. random samples from the distribution $P_{X}$ , and $\mathcal{Y}_{n_{2}}:=\{Y_{j}\}^{n_{2}}_{j=1}$ be ${n_{2}}$ i.i.d. random samples from the distribution $P_{Y}$ where ${n_{1}},{n_{2}}\geq 2$ . Based on these mutually independent samples, the problem of two-sample testing is concerned with determining whether $P_{X}$ and $P_{Y}$ agree or not. More formally, let $\mathcal{P}$ be a class of all possible pairs of distributions on some generic space $\mathbb{S}$ , and consider two disjoint subsets in $\mathcal{P}$ , namely $\mathcal{P}_{0}:=\left\{(P_{X},P_{Y})\in\mathcal{P}\,|\,P_{X}=P_{Y}\right\}$ and $\mathcal{P}_{1}:=\left\{(P_{X},P_{Y})\in\mathcal{P}\,|\,P_{X}\neq P_{Y}\right\}$ . Then, the null hypothesis $H_{0}$ and the alternative hypothesis $H_{1}$ of two-sample testing can be formulated as follows:

H_{0}:(P_{X},P_{Y})\in\mathcal{P}_{0}\quad\text{vs.}\quad H_{1}:(P_{X},P_{Y})\in\mathcal{P}_{1}.

In order to decide whether to reject $H_{0}$ or not, we devise a test function $\Delta_{{n_{1}},{n_{2}}}:(\mathbb{S}^{n_{1}},\,\mathbb{S}^{{n_{2}}})\rightarrow\{0,1\}$ , and reject the null hypothesis if and only if $\Delta_{{n_{1}},{n_{2}}}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}})=1$ . This decision-making process naturally leads to two types of errors, which we would like to minimize. The first error, called the type I error, occurs by rejecting the null hypothesis despite being true. Conversely, the second error, called the type II error, arises when the null hypothesis is accepted despite being false. One common approach to design an ideal test is to first bound the probability of the type I error uniformly over $\mathcal{P}_{0}$ as

\sup_{(P_{X},P_{Y})\in\mathcal{P}_{0}}\mathbb{P}_{X\times Y}(\Delta_{{n_{1},n_{2}}}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}})=1)\leq\alpha,\quad\text{for a given {level} $\alpha\in(0,1),$}

where $\mathbb{P}_{X\times Y}$ denotes the probability operator over $\mathcal{X}_{n_{1}}\overset{\mathrm{i.i.d.}}{\sim}P_{X}$ and $\mathcal{Y}_{n_{2}}\overset{\mathrm{i.i.d.}}{\sim}P_{Y}$ . We say that such a test is a level- $\alpha$ test. Next, our focus shifts to controlling the type II error. Given a fixed pair $(P_{X},P_{Y})$ in $\mathcal{P}_{1}$ and a level- $\alpha$ test $\Delta^{\alpha}_{{n_{1},n_{2}}}$ , suppose that the probability of the type II error is upper bounded by some constant $\beta\in(0,1)$ . Equivalently, the probability of correctly rejecting the null, referred to as the power, is lower bounded by $1-\beta$ . Ideally, we expect that the power of the test $\Delta^{\alpha}_{{n_{1},n_{2}}}$ against any fixed alternative $(P_{X},P_{Y})\in\mathcal{P}_{1}$ converges to one as we increase the data size ${n_{1}}$ and ${n_{2}}$ . More formally, we desire a test $\Delta^{\alpha}_{{n_{1},n_{2}}}$ to be pointwise consistent, satisfying

\displaystyle\lim_{{n_{1},n_{2}}\rightarrow\infty}\mathbb{P}_{X\times Y}\bigl{(}\Delta^{\alpha}_{{n_{1},n_{2}}}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}})=1\bigr{)}=1,\quad\text{for any fixed $(P_{X},P_{Y})\in\mathcal{P}_{1}$.}

(1)

A stronger notion of the power is uniform consistency, guaranteeing that the power converges to one uniformly over a class of alternative distributions. See Section 4 for a discussion. For simplicity, in the rest of this paper, we consider $\mathbb{S}$ to be the $d$ -dimensional Euclidean space denoted as $\mathbb{R}^{d}$ .

2.2 Maximum Mean Discrepancy

As an example of integral probability metrics, the MMD measures the discrepancy between two distributions in a nonparametric manner. Specifically, given a reproducing kernel Hilbert space (RKHS) $\mathcal{H}_{k}$ equipped with a positive definite kernel $k$ , the MMD between $P_{X}$ and $P_{Y}$ is defined as

\displaystyle\mathrm{MMD}(P_{X},P_{Y};\mathcal{H}_{k}):=\sup_{f\in\mathcal{H}_{k}:\|f\|_{\mathcal{H}_{k}}\leq 1}\bigl{|}\mathbb{E}_{X}\left[f(X)\right]-\mathbb{E}_{Y}\left[f(Y)\right]\bigr{|}.

It can also be represented as the RKHS distance between two mean embeddings of $P_{X}$ and $P_{Y}$ , i.e., $\mathrm{MMD}(P_{X},P_{Y};\mathcal{H}_{k})=\|\mu_{X}-\mu_{Y}\|_{\mathcal{H}_{k}}$ where $\mu_{X}(\cdot):=\mathbb{E}_{X}[k(X,\cdot)]$ and $\mu_{Y}(\cdot):=\mathbb{E}_{Y}[k(Y,\cdot)]$ . For a characteristic kernel $k$ , the mean embedding of the kernel is injective (Sriperumbudur et al.,, 2010), which means that $\mathrm{MMD}(P_{X},P_{Y};\mathcal{H}_{k})=0$ if and only if $P_{X}=P_{Y}$ . Among several ways to estimate the MMD, one straightforward way is to substitute the population mean embeddings $\mu_{X}$ and $\mu_{Y}$ with the empirical counterparts $\hat{\mu}_{X}(\cdot)=\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}k(X_{i},\cdot)$ and $\hat{\mu}_{Y}(\cdot)=\frac{1}{{n_{2}}}\sum_{i=1}^{n_{2}}k(Y_{i},\cdot)$ . This plug-in approach results in a biased quadratic-time estimator of the squared MMD, also referred to as the V-statistic, given as

	$\displaystyle\widehat{\mathrm{MMD}}_{b}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})$	$\displaystyle=\left\\|\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}k(X_{i},\cdot)-\frac{1}{{n_{2}}}\sum_{i=1}^{n_{2}}k(Y_{i},\cdot)\right\\|_{\mathcal{H}_{k}}^{2}$		(2)
		$\displaystyle=\frac{1}{{n_{1}^{2}}}\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{1}}k\left(X_{i},X_{j}\right)+\frac{1}{{n_{2}^{2}}}\sum_{i=1}^{n_{2}}\sum_{j=1}^{n_{2}}k\left(Y_{i},Y_{j}\right)-\frac{2}{{n_{1}}{n_{2}}}\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{2}}k\left(X_{i},Y_{j}\right).$		(2)

Denoting ${N}:={n_{1}}+{n_{2}}$ , this plug-in estimator requires a quadratic-time cost of $O({N}^{2}d)$ in terms of the sample size $N$ as it involves evaluating pairwise kernel similarities between samples. Another common approach to estimate $\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k})$ is using the U-statistic (e.g., Gretton et al., 2012a, , Lemma 6), which is given as

	$\displaystyle\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\,=$	$\displaystyle\frac{1}{{n_{1}}({n_{1}}-1)}\sum_{1\leq i\neq j\leq{n_{1}}}k\left(X_{i},X_{j}\right)+\frac{1}{{n_{2}}({n_{2}}-1)}\sum_{1\leq i\neq j\leq{n_{2}}}k\left(Y_{i},Y_{j}\right)$
	$\displaystyle-$	$\displaystyle\frac{2}{{n_{1}}{n_{2}}}\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{2}}k\left(X_{i},Y_{j}\right).$

This estimator is an unbiased estimator of the squared MMD and also requires quadratic-time computational costs.

2.3 Random Fourier features

Numerous approaches have been introduced to mitigate the computational cost of quadratic-time statistics often at the cost of sacrificing power performance. As reviewed in Section 1.1, some notable approaches include incomplete U-statistics (Gretton et al., 2012a, ; Zaremba et al.,, 2013; Schrab et al.,, 2022), Nyström approximations (Chatalic et al.,, 2022), kernel thinning (Dwivedi and Mackey,, 2021; Domingo-Enrich et al.,, 2023) and random Fourier features (Rahimi and Recht,, 2007; Zhao and Meng,, 2015). This work focuses on the method utilizing random Fourier features and investigates the effect of the number of random features on the power of a test. At the heart of this method is Bochner’s theorem (Lemma 9), which offers a means to approximate the kernel using a low-dimensional feature mapping $\psi_{\omega}$ , satisfying $k(x,y)\approx\langle\psi_{\omega}(x),\psi_{\omega}(y)\rangle$ . If a bounded continuous positive definite kernel $k$ is translation invariant on $\mathbb{R}^{d}$ , that is, $k(x,y)=\kappa(x-y)$ , Bochner’s theorem guarantees the existence of a nonnegative Borel measure $\Lambda$ . It can be shown that $\Lambda$ is the inverse Fourier transform of $\kappa$ and meets

k(x,y)=\int_{\mathbb{R}^{d}}e^{\sqrt{-1}\omega^{\top}(x-y)}d\Lambda(\omega)\stackrel{{\scriptstyle(\dagger)}}{{=}}\int_{\mathbb{R}^{d}}\cos\bigl{(}{\omega^{\top}(x-y)}\bigr{)}d\Lambda(\omega),

where the equality $(\dagger)$ is obtained by the fact that $\kappa$ is both real and symmetric. Without loss of generality, we assume that $\Lambda$ is a probability measure, allowing the last integral to be expressed as $\mathbb{E}_{\omega\sim\Lambda}[\langle\psi_{\omega}(x),\psi_{\omega}(y)\rangle]$ with $\psi_{\omega}(x):=[\cos(\omega^{\top}x),\sin(\omega^{\top}x)]^{\top}$ . If not, we instead work with the scaled versions of $\Lambda$ and $\psi_{\omega}$ , given as $\Lambda^{\prime}:=\kappa^{-1}(0)\Lambda$ and $\psi_{\omega}^{\prime}(\cdot):=[\sqrt{\kappa(0)}\cos(\omega^{\top}\cdot),\sqrt{\kappa(0)}\sin(\omega^{\top}\cdot)]^{\top}$ . In this case, $k(x,y)$ can be represented as $\mathbb{E}_{\omega\sim\Lambda^{\prime}}[\langle\psi_{\omega}^{\prime}(x),\psi_{\omega}^{\prime}(y)\rangle]$ .

Now, by drawing a sequence of i.i.d. $R$ random frequencies $\boldsymbol{\omega}_{R}:=\{\omega_{r}\}^{R}_{r=1}$ from $\Lambda$ , we construct an unbiased estimator of $k(x,y)$ defined as an inner product of random feature maps:

\hat{k}(x,y):=\frac{1}{R}\sum\limits_{r=1}^{R}\langle\psi_{\omega_{r}}(x),\psi_{\omega_{r}}(y)\rangle=\langle\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(x),\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(y)\rangle,

(3)

where $\psi_{\omega_{r}}(x)=[\cos(\omega_{r}^{\top}x),\sin(\omega_{r}^{\top}x)]^{\top}$ and $\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(x)=\frac{1}{\sqrt{R}}[\psi_{\omega_{1}}(x)^{\top},~{}\ldots~{},\psi_{\omega_{R}}(x)^{\top}]^{\top}\in\mathbb{R}^{2R}$ . Let us define the vector in $\mathbb{R}^{2R}$ representing the difference in sample means of random feature maps as follows:

T(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\boldsymbol{\omega}_{R}):=\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X_{i})-\frac{1}{{n_{2}}}\sum_{j=1}^{n_{2}}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y_{j}).

Also, denote the quadratic form of $T:=T(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\boldsymbol{\omega}_{R})$ as $V:=T^{\top}T$ . When we replace the kernel $k$ in Equation (2) with the estimated $\hat{k}$ , we obtain a RFF-MMD estimator of $\mathrm{MMD}^{2}$ that can run with a time complexity of $O({N}Rd)$ :

\text{r}\widehat{\mathrm{MMD}}_{b}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\boldsymbol{\omega}_{R}):=V(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\boldsymbol{\omega}_{R})=\Bigg{\|}\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X_{i})-\frac{1}{{n_{2}}}\sum_{j=1}^{n_{2}}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y_{j})\Bigg{\|}_{\mathbb{R}^{2R}}^{2}.

(4)

Notably, this estimator can be computed in linear time in terms of the pooled sample size ${N}$ , and this computational benefit has motivated the prior work, such as Zhao and Meng, (2015), Sutherland and Schneider, (2015) and Cevid et al., (2022), that consider RFF-MMD statistics.

One may also consider an unbiased RFF-MMD statistic, given as

$\displaystyle\text{r}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\boldsymbol{\omega}_{R})$	$\displaystyle:=\frac{1}{{n_{1}}({n_{1}}-1)}\sum_{1\leq i\neq j\leq{n_{1}}}\langle\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X_{i}),\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X_{j})\rangle$	(5)
	$\displaystyle\quad+\frac{1}{{n_{2}}({n_{2}}-1)}\sum_{1\leq i\neq j\leq{n_{2}}}\langle\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y_{i}),\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y_{j})\rangle$
	$\displaystyle\quad-\frac{2}{{n_{1}}{n_{2}}}\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{2}}\langle\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X_{i}),\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y_{j})\rangle,$

which also involves $O({N}Rd)$ computational time (Zhao and Meng,, 2015, Appendix A.1). In this work, we consider both $\text{r}\widehat{\mathrm{MMD}}_{b}^{2}$ and $\text{r}\widehat{\mathrm{MMD}}_{u}^{2}$ to demonstrate statistical and computational trade-offs in RFF-based two-sample testing.

2.4 Permutation tests

There have been several methods proposed for determining the threshold for MMD tests, which ensures (asymptotic or non-asymptotic) type I error control. These methods include those using limiting distributions or concentration inequalities, Gamma approximations, and bootstrap/permutation methods (e.g., Gretton et al., 2012a, ; Schrab et al.,, 2023). Among these, permutation tests stand out for their unique strength: they maintain level $\alpha$ for any finite sample size and often achieve optimal power (e.g., Kim et al.,, 2022). This advantage has made permutation tests a popular choice in real-world applications despite extra computational costs. Given their practical relevance, this work focuses on permutation-based MMD tests and establishes their theoretical guarantees.

To explain the procedure, let us write the pooled sample as $\mathcal{Z}_{N}:=\left\{Z_{1},\ldots,Z_{N}\right\}=\left\{\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}}\right\}$ , and denote the collection of all possible permutations of $(1,2,\ldots,{N})$ as $\Pi_{N}$ . Given a permutation $\pi:=(\pi(1),\ldots,\pi({N}))\in\Pi_{N},$ we denote the permuted pooled samples as $\mathcal{Z}^{\pi}_{N}:=\left\{Z_{\pi(1)},\ldots,Z_{\pi({N})}\right\}.$ Then, for a generic test statistic $T_{{n_{1}},{n_{2}}},$ the permutation distribution of $T_{{n_{1}},{n_{2}}}$ is defined as

F^{\pi}_{T_{{n_{1}},{n_{2}}}}(t):=\frac{1}{{N}!}\sum_{\pi\in\Pi_{{N}}}\mathds{1}\{T_{{n_{1}},{n_{2}}}(\mathcal{Z}^{\pi}_{N})\leq t\}.

The permutation test rejects the null hypothesis when $T_{{n_{1}},{n_{2}}}(\mathcal{Z}_{N})\geq q_{{n_{1}},{n_{2}},1-\alpha}$ where $q_{{n_{1}},{n_{2}},1-\alpha}$ denotes the $1-\alpha$ quantile of $F^{\pi}_{T_{{n_{1}},{n_{2}}}}$ given as

q_{{n_{1}},{n_{2}},1-\alpha}:=\inf\big{\{}t:F_{T_{{n_{1}},{n_{2}}}}^{\pi}(t)\geq 1-\alpha\big{\}}.

It is well-known that the resulting permutation test maintains non-asymptotic type I error control under the exchangeability of random vectors (e.g., Hemerik and Goeman,, 2018, Theorem 1). This exchangeability condition is satisfied under the null hypothesis of two-sample testing where $\mathcal{Z}_{N}$ are assumed to be i.i.d. random vectors.

A more computationally efficient permutation test is defined through Monte Carlo simulations. Let $\pi_{1},\ldots,\pi_{B}$ be permutation vectors randomly drawn from $\Pi_{N}$ with replacement. We let $T_{{n_{1}},{n_{2}}}^{(1)},\ldots,T_{{n_{1}},{n_{2}}}^{(B)}$ denote the test statistics computed based on $\mathcal{Z}_{N}^{\pi_{1}},\dots,\mathcal{Z}_{N}^{\pi_{B}}$ . Let $\hat{q}_{{n_{1}},{n_{2}},1-\alpha}$ be the $1-\alpha$ quantile of the empirical distribution of $\{T_{{n_{1}},{n_{2}}},T_{{n_{1}},{n_{2}}}^{(1)},\ldots,T_{{n_{1}},{n_{2}}}^{(B)}\}$ , and reject the null when $T_{{n_{1}},{n_{2}}}>\hat{q}_{{n_{1}},{n_{2}},1-\alpha}$ . The resulting Monte Carlo-based test is also valid in finite samples (Hemerik and Goeman,, 2018, Theorem 2) and has almost equivalent power behavior as the full permutation test for sufficiently large $B$ .

3 Lack of consistency

In this section, we show that the RFF-MMD test, employing a finite number of random Fourier features, lacks pointwise consistency — i.e., it fails to fulfill the guarantee in Equation (1) — even when the underlying kernel is characteristic. We establish this inconsistency result by focusing on a permutation test based on the test statistic in Equation (4) or that in Equation (5), while our main idea is not limited to these specific tests. We start by explaining the intuition behind this negative result in Section 3.1 and then present the main results in Section 3.2.

3.1 Preliminaries and intuition

An alternative formulation of $\text{r}\widehat{\mathrm{MMD}}_{b}^{2}$ in Equation (4) is in terms of the characteristic functions of $P_{X}$ and $P_{Y}$ . This reformulation provides a key insight into our negative result in Section 3.2. To fix ideas, the squared MMD with a translation-invariant kernel $k$ can be represented as $\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k})=\int_{\mathbb{R}^{d}}|\phi_{X}(\omega)-\phi_{Y}(\omega)|^{2}d\Lambda(\omega)$ where $\phi_{X}$ and $\phi_{Y}$ are the characteristic functions of $P_{X}$ and $P_{Y}$ , respectively (e.g., Sriperumbudur et al.,, 2010, Corollary 4). Letting $\hat{\phi}_{X}(\omega):=\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}e^{\sqrt{-1}\omega^{\top}X_{i}}$ and $\hat{\phi}_{Y}(\omega):=\frac{1}{n_{2}}\sum_{j=1}^{n_{2}}e^{\sqrt{-1}\omega^{\top}Y_{j}}$ , we may represent the plug-in estimator in Equation (2) as

\displaystyle\widehat{\mathrm{MMD}}_{b}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})=\int_{\mathbb{R}^{d}}|\hat{\phi}_{X}(\omega)-\hat{\phi}_{Y}(\omega)|^{2}d\Lambda(\omega).

With this identity in place, the RFF-MMD statistic $\text{r}\widehat{\mathrm{MMD}}_{b}^{2}$ can be regarded as an approximation of the above plug-in estimator via Monte Carlo simulations with $R$ random frequencies $\{\omega_{r}\}_{r=1}^{R}\overset{\mathrm{i.i.d.}}{\sim}\Lambda$ . Specifically, the RFF-MMD statistic can be written in terms of the empirical characteristic functions as:

\displaystyle\text{r}\widehat{\mathrm{MMD}}_{b}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\boldsymbol{\omega}_{R})=\frac{1}{R}\sum^{R}_{r=1}|\hat{\phi}_{X}(\omega_{r})-\hat{\phi}_{Y}(\omega_{r})|^{2}.

As it is well-known, the characteristic function uniquely determines the distribution of a random vector. Therefore, when the support of $\Lambda$ is the entire Euclidean space, the population MMD becomes zero if and only if $P_{X}$ and $P_{Y}$ coincide. However, the empirical MMD evaluated on a finite number of random points is unable to capture an arbitrary difference between $P_{X}$ and $P_{Y}$ , even asymptotically. At a high-level, this happens due to a combination of two factors. First of all, it is possible that two distinct characteristic functions can be equal in an interval (e.g., Romano and Siegel,, 1986, page 74). Moreover, if random evaluation points $\{\omega_{r}\}_{r=1}^{R}$ fall within such interval with high probability, then the RFF-MMD statistic would behave similarly to the null case, resulting in a test that is inconsistent with a fixed number of random features. This observation was partly made in Chwialkowski et al., (2015, Proposition 1), which we generalize to $\mathbb{R}^{d}$ as below.

Lemma 1 (Chwialkowski et al., 2015).

Let $R\in\mathbb{N}$ be a fixed number and let $\boldsymbol{\omega}_{R}=\{\omega_{r}\}^{R}_{r=1}$ be a sequence of real-valued i.i.d. random vectors from a probability distribution on $\mathbb{R}^{d}$ which is absolutely continuous with respect to the Lebesgue measure. For arbitrary $\epsilon\in(0,1)$ , there exists an uncountable set $\mathcal{A}_{\epsilon}$ of mutually distinct probability distributions on $\mathbb{R}^{d}$ such that for any distinct pair $P_{X},P_{Y}\in\mathcal{A}_{\epsilon}$ and their corresponding random vectors $X$ and $Y$ , it holds that $\mathbb{P}_{\boldsymbol{\omega}_{R}}(\frac{1}{R}\sum^{R}_{r=1}|\phi_{X}(\omega_{r})-\phi_{Y}(\omega_{r})|^{2}=0)\geq 1-\epsilon$ .

The above lemma implies that there exists a certain pair of $(P_{X},P_{Y})$ under the alternative such that the expectation of the RFF-MMD statistic (4) is approximately zero with high probability. Given that the same test statistic has an expectation approximately equal to zero under the null, one may argue that the power of an RFF-MMD test would be strictly less than one against that specific alternative. However, this argument is insufficient to correctly claim the lack of consistency. An instructive example would be the case where a test statistic $W$ is either $0$ or $1/n$ with probability $1-\alpha$ and $\alpha$ , respectively, under the null, whereas it takes the value $\alpha/n$ with probability one. In this case, it is clear to see that the expectation of $W$ remains the same under $H_{0}$ and $H_{1}$ , converging to zero as $n\rightarrow\infty$ . Nevertheless, if we reject the null when $W>0$ , the resulting test has size $\alpha$ and power one for any value of $n\geq 1$ . This toy example suggests that Lemma 1 is insufficient to formally prove the inconsistency result and we indeed need a distribution-level understanding of the RFF-MMD statistic. Moreover, when the critical value is determined via the permutation procedure (Section 2.4), we further need to take care of random sources arising from permutations, which adds an additional layer of technical challenges. With this context in place, we next develop inconsistency results by carefully studying the limiting distribution of the RFF-MMD statistic and its permuted counterpart.

3.2 Main results

Consider a permutation test based on the test statistic in Equation (4) defined as follows:

\Delta_{{n_{1},n_{2}},R}^{\alpha}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\boldsymbol{\omega}_{R}):=\Delta_{{n_{1},n_{2}},R}^{\alpha}:=\mathds{1}\big{\{}V(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\boldsymbol{\omega}_{R})>q_{{n_{1},n_{2}},1-\alpha}\big{\}},

(6)

where $q_{{n_{1}},{n_{2}},1-\alpha}:=\inf\{t:F_{V}^{\pi}(t)\geq 1-\alpha\}$ and $F_{V}^{\pi}(t)=\frac{1}{{N}!}\sum_{\pi\in\Pi_{{N}}}\mathds{1}\{V(Z_{\pi(1)},\dots,Z_{\pi({N})};\boldsymbol{\omega}_{R})\leq t\}$ . Building on the intuition laid out in Section 3.1, we aim to prove that the asymptotic power of the test $\Delta_{{n_{1}},{n_{2}},R}^{\alpha}$ is strictly less than one with a fixed number of $R$ against certain fixed alternatives. To formally achieve this, let $\boldsymbol{\psi}_{\boldsymbol{x}}$ be defined similarly as $\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}$ by replacing $\boldsymbol{\omega}_{R}$ with $\boldsymbol{x}\in\mathbb{R}^{d\times R}$ . Based on Euler’s formula, the event $\frac{1}{R}\sum^{R}_{r=1}|\phi_{X}(\omega_{r})-\phi_{Y}(\omega_{r})|^{2}=0$ is equivalent to $\boldsymbol{\omega}_{R}\in\mathcal{E}:=\mathcal{E}(X,Y)$ where

\mathcal{E}(X,Y):=\bigl{\{}\boldsymbol{x}\in\mathbb{R}^{d\times R}:\mathbb{E}_{X}[\boldsymbol{\psi}_{\boldsymbol{x}}(X)]=\mathbb{E}_{Y}[\boldsymbol{\psi}_{\boldsymbol{x}}(Y)]\bigr{\}}.

(7)

We call $\boldsymbol{\omega}_{R}\in\mathcal{E}$ as the first moment equivalence (1-ME) condition, which holds with high probability, say $1-\epsilon$ , for some fixed $(P_{X},P_{Y})$ according to Lemma 1. As mentioned earlier, the 1-ME condition alone is insufficient to formally prove the inconsistency result, which prompts an extension of the 1-ME condition to include higher-order moments. Specifically, consider a subset $\mathcal{E}_{k}\subseteq\mathcal{E}$ where $\boldsymbol{\omega}_{R}\in\mathcal{E}_{k}$ implies equivalence up to the $k$ -th moment, i.e., with denoting $\boldsymbol{i}=(i_{1},\ldots,i_{2R})\in\mathbb{R}^{2R},$

\mathcal{E}_{k}:=\biggl{\{}\boldsymbol{x}\in\mathbb{R}^{d\times R}:\mathbb{E}_{X}[\boldsymbol{\psi}_{\boldsymbol{x}}(X)^{\boldsymbol{i}}]=\mathbb{E}_{Y}[\boldsymbol{\psi}_{\boldsymbol{x}}(Y)^{\boldsymbol{i}}],~{}\forall\boldsymbol{i}~{}\text{such that}~{}\{i_{r}\}^{2R}_{r=1}\in\mathbb{N}\cup\{0\},~{}\sum^{2R}_{r=1}i_{r}\leq k\biggr{\}},

for $\boldsymbol{\psi}_{\boldsymbol{x}}(X)^{\boldsymbol{i}}:=(\boldsymbol{\psi}_{\boldsymbol{x}}(X)_{1}^{i_{1}},\ldots,\boldsymbol{\psi}_{\boldsymbol{x}}(X)_{2R}^{i_{2R}})$ and $\boldsymbol{\psi}_{\boldsymbol{x}}(Y)^{\boldsymbol{i}}:=(\boldsymbol{\psi}_{\boldsymbol{x}}(Y)_{1}^{i_{1}},\ldots,\boldsymbol{\psi}_{\boldsymbol{x}}(Y)_{2R}^{i_{2R}})$ . We refer $\boldsymbol{\omega}_{R}\in\mathcal{E}_{k}$ as the first $k$ moments equivalence (k-ME) condition. In the following proposition, we prove a generalized version of Lemma 1 demonstrating that the $k$ -ME condition holds with high probability for some fixed $(P_{X},P_{Y})$ . The proof of this result can be found in Section B.1.

Proposition 2.

Let $k,R\in\mathbb{N}$ be a fixed number and let $\boldsymbol{\omega}_{R}=\{\omega_{r}\}^{R}_{r=1}$ be a sequence of real-valued i.i.d. random vectors from a probability distribution on $\mathbb{R}^{d}$ which is absolutely continuous with respect to the Lebesgue measure. For arbitrary $\epsilon\in(0,1)$ , there exists an uncountable set $\mathcal{A}_{k,\epsilon}$ of mutually distinct probability distributions on $\mathbb{R}^{d}$ such that for any distinct pair $P_{X},P_{Y}\in\mathcal{A}_{k,\epsilon}$ and their corresponding random vectors $X$ and $Y$ , it holds that $\mathbb{P}_{\boldsymbol{\omega}_{R}}(\boldsymbol{\omega}_{R}\in\mathcal{E}_{k})\geq 1-\epsilon$ .

Suppose that $P_{X},P_{Y}\in\mathcal{A}_{k,\epsilon}$ , specified in Proposition 2. The power of the considered test against this specific alternative is then upper bounded as

	$\displaystyle\mathbb{P}(\Delta_{{n_{1}},{n_{2}},R}^{\alpha}=1)$	$\displaystyle=\int_{\mathcal{E}_{k}}\mathbb{P}(\Delta_{{n_{1}},{n_{2}},R}^{\alpha}=1\,\|\,\boldsymbol{\omega}_{R}=\boldsymbol{\omega})f_{\boldsymbol{\omega}_{R}}(\boldsymbol{\omega})d\boldsymbol{\omega}+\int_{\mathcal{E}^{c}_{k}}\mathbb{P}(\Delta_{{n_{1}},{n_{2}},R}^{\alpha}=1\,\|\,\boldsymbol{\omega}_{R}=\boldsymbol{\omega})f_{\boldsymbol{\omega}_{R}}(\boldsymbol{\omega})d\boldsymbol{\omega}$		(8)
		$\displaystyle\leq\int_{\mathcal{E}_{k}}\mathbb{P}(\Delta_{{n_{1}},{n_{2}},R}^{\alpha}=1\,\|\,\boldsymbol{\omega}_{R}=\boldsymbol{\omega})f_{\boldsymbol{\omega}_{R}}(\boldsymbol{\omega})d\boldsymbol{\omega}+\epsilon.$		(8)

Given this bound, our proof for inconsistency revolves around showing that $\int_{\mathcal{E}_{k}}\mathbb{P}(\Delta_{{n_{1}},{n_{2}},R}^{\alpha}=1\,|\,\boldsymbol{\omega}_{R}=\boldsymbol{\omega})f_{\boldsymbol{\omega}_{R}}(\boldsymbol{\omega})d\boldsymbol{\omega}$ is sufficiently small. This in turn requires understanding the limiting behavior of the test statistic $V(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\boldsymbol{\omega}_{R})$ and the permutation critical value $q_{{n_{1}},{n_{2}},1-\alpha}$ under the $k$ -ME condition. On the one hand, the limiting distribution of the test statistic can be derived using the standard asymptotic tools such as the central limit theorem. On the other hand, we leverage asymptotic results for permutation distributions in Chung and Romano, (2016) to show that the critical value $q_{{n_{1}},{n_{2}},1-\alpha}$ converges to the $1-\alpha$ quantile of a continuous distribution. We point out that both limiting and permutation distributions of $V(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\boldsymbol{\omega}_{R})$ are determined by the first two moments of $\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X)$ and $\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y)$ . Furthermore, both distributions become asymptotically identical when those moments are the same, implying the coincidence of both distributions under the 2-ME condition. Consequently, the power of the test $\Delta_{{n_{1}},{n_{2}},R}^{\alpha}$ under the 2-ME condition remains small even asymptotically, which together with inequality (8), leads to the inconsistency result. This negative result is formally stated in the following theorem and the proof can be found in Section B.2.

Theorem 3.

Let $k(x,y)=\kappa(x-y)$ be a bounded continuous positive definite kernel whose inverse Fourier transform is absolutely continuous with respect to the Lebesgue measure. Then, given any $\epsilon>0$ , for the test $\Delta_{{n_{1}},{n_{2}},R}^{\alpha}$ defined in Equation (6) with a fixed number $R\geq 1$ and the limiting sample-ratio $p:=\lim_{{n_{1}},{n_{2}}\rightarrow\infty}\frac{{n_{1}}}{{n_{1}}+{n_{2}}}\in(0,1)$ , there exist uncountably many pairs of distinct probability distributions $(P_{X},P_{Y})$ on $\mathbb{R}^{d}\times\mathbb{R}^{d}$ that satisfies

\lim_{{n_{1}},{n_{2}}\rightarrow\infty}\mathbb{P}_{X\times Y\times\omega}\big{(}\Delta_{{n_{1}},{n_{2}},R}^{\alpha}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\boldsymbol{\omega}_{R})=1\big{)}<\alpha+\epsilon.

The underlying idea of the proof for Theorem 3 can be applied to the unbiased RFF-MMD statistic in Equation (5) as well. In particular, consider a permutation test

\Delta_{{n_{1}},{n_{2}},R}^{\alpha,u}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\boldsymbol{\omega}_{R}):=\mathds{1}\big{\{}{U(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\boldsymbol{\omega}_{R})}>q_{{n_{1}},{n_{2}},1-\alpha}^{u}\big{\}},

(9)

where $U(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\boldsymbol{\omega}_{R}):=\text{r}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\boldsymbol{\omega}_{R})$ and $q_{{n_{1}},{n_{2}},1-\alpha}^{u}:=\inf\{t:F_{U}^{\pi}(t)\geq 1-\alpha\}$ is the corresponding critical value. Building on the observation that the difference between $U(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\boldsymbol{\omega}_{R})$ and $V(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\boldsymbol{\omega}_{R})$ is asymptotically negligible, we derive a result analogous to Theorem 3, demonstrating that $\Delta_{{n_{1}},{n_{2}},R}^{\alpha,u}$ fails to be pointwise consistent.

Corollary 4.

Consider the same setting in Theorem 3. Given any $\epsilon>0$ , for the test $\Delta_{{n_{1}},{n_{2}},R}^{\alpha,u}$ defined in Equation (9) with a fixed number $R$ and the limiting sample-ratio $p$ , there exist uncountably many pairs of distinct probability distributions $(P_{X},P_{Y})$ on $\mathbb{R}^{d}\times\mathbb{R}^{d}$ that satisfies

\lim_{{n_{1}},{n_{2}}\rightarrow\infty}\mathbb{P}_{X\times Y\times\omega}\big{(}\Delta_{{n_{1}},{n_{2}},R}^{\alpha,u}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\boldsymbol{\omega}_{R})=1\big{)}<\alpha+\epsilon.

The proof of Corollary 4 can be found in Section B.3. Our findings so far indicate that RFF-MMD tests with a fixed number of random features fail to be pointwise consistent. To address this issue, we naturally consider increasing $R$ with the sample size and show that the tests then become pointwise consistent. Moreover, in some cases, the RFF-MMD test can attain comparable power to the quadratic-time MMD test but in strictly less than quadratic time. These are the topics of the next section.

4 Optimal choice of the number of random features

We now turn to scenarios where the number of random Fourier features grows with the sample size, and examine computational and statistical trade-offs in selecting these random features. The first result of this section complements the previous inconsistency results, indicating that the RFF-MMD tests are pointwise consistent as long as the number of random Fourier features increases to infinity even at an arbitrarily slow rate.

Theorem 5.

Consider an arbitrary sequence $\{R_{n}\}_{{n}\geq 1}$ that increases as $\lim_{{n}\rightarrow\infty}R_{n}=\infty$ and assume that the kernel $k(\cdot,\cdot)$ is characteristic. Then, against any fixed alternative $(P_{X},P_{Y})\in\mathcal{P}_{1}$ , the permutation test $\Delta_{{n_{1}},{n_{2}},R}^{\alpha}$ defined in Equation (6) with $R=R_{n}$ and $n:=\min\{n_{1},n_{2}\}$ satisfies

\lim_{{n_{1}},{n_{2}}\rightarrow\infty}\mathbb{P}_{X\times Y\times\omega}\big{(}\Delta_{{n_{1}},{n_{2}},R}^{\alpha}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}})=1\big{)}=1.

This result also holds for the permutation test $\Delta_{{n_{1}},{n_{2}},R}^{\alpha,u}$ defined in Equation (9).

The proof of Theorem 5 is given in Section B.4. It is worth noting that increasing the number of random features comes with an increase in computational cost. On the other hand, using a small number of random features may lead to suboptimal power performance compared to the quadratic-time MMD test. Therefore, achieving a balance between computational costs and statistical power is crucial from a practical standpoint. To determine the number of random features that balance this time-power trade-off, we adopt the minimax testing framework pioneered by Ingster, (1987, 1993) explained below.

Minimax two-sample testing framework.

While pointwise consistency in Equation (1) is an important property, it only provides a guarantee against a fixed pair of alternative distributions, which may be regarded as a weak property. Given some constant $\beta\in(0,1)$ , one might instead aim to build a test that also uniformly bounds the probability of type II error in a non-asymptotic sense:

\sup_{(P_{X},P_{Y})\in\mathcal{P}_{1}}\mathbb{P}_{X\times Y}\big{(}\Delta^{\alpha}_{{n_{1},n_{2}}}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}})=0\big{)}\leq\beta.

In general, however, achieving this uniform guarantee is not feasible unless the two classes $\mathcal{P}_{0}$ and $\mathcal{P}_{1}$ are sufficiently distant. Therefore it is common to introduce a gap between $\mathcal{P}_{0}$ and $\mathcal{P}_{1}$ , and analyze the minimum gap for which the testing error is uniformly controlled. In detail, we define a class of alternative pairs $\mathcal{P}_{1}(\mathcal{C},\delta,\epsilon):=\{(P_{X},P_{Y})\in\mathcal{C}\,|\,\delta(P_{X},P_{Y})\geq\epsilon\}$ where $\delta$ is a metric of interest, $\mathcal{C}\subseteq\mathcal{P}$ is a predefined class of distribution pairs (if not stated otherwise, $\mathcal{C}=\mathcal{P}$ ), and $\epsilon>0$ is a separation parameter. Then the uniform separation rate that measures the performance of the test $\Delta$ is defined (e.g., Baraud,, 2002; Schrab et al.,, 2023) as

\rho\left(\Delta,\beta,\mathcal{C},\delta\right):=\inf\Big{\{}\epsilon>0:\sup_{(P_{X},P_{Y})\in\mathcal{P}_{1}(\mathcal{C},\delta,\epsilon)}\mathbb{P}_{X\times Y}\big{(}\Delta(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}})=0\big{)}\leq\beta\Big{\}}.

Among all possible level- $\alpha$ tests, it is reasonable to consider a test that achieves the smallest uniform separation as an optimal test. More formally, we define the minimax separation as

\rho^{\star}\left(\alpha,\beta,\mathcal{C},\delta\right):=\inf_{\Delta_{\alpha}}\rho\left(\Delta_{\alpha},\beta,\mathcal{C},\delta\right),

where the infimum is taken over all level- $\alpha$ tests, and we refer to the test $\Delta$ satisfying $\rho\left(\Delta,\beta,\mathcal{C},\delta\right)=\rho^{\star}\left(\alpha,\beta,\mathcal{C},\delta\right)$ as a minimax optimal test. However, except for a few parametric problems, it is generally infeasible to devise an optimal test that precisely achieves the minimax separation. As a compromise, it is now conventional to seek a minimax rate optimal test, which achieves the minimax separation, up to a constant. It has been shown that the quadratic-time MMD test is minimax rate optimal against the $L_{2}$ metric (Schrab et al.,, 2023; Li and Yuan,, 2019) and against the MMD metric (Kim,, 2021; Kim and Schrab,, 2023). Our aim is to determine the minimum number of random features $R$ for which the RFF-MMD test attains the same optimality property as the quadratic-time MMD.

4.1 Uniform consistency in $L_{2}$ metric

We start by examining the uniform separation rate of the RFF-MMD test over the Sobolev ball with respect to the $L_{2}$ distance. Let us denote $\mathcal{S}_{d}^{s}(M_{1})$ as the $s$ th order Sobolev ball in $\mathbb{R}^{d}$ with radius $M_{1}>0,$ that is

\mathcal{S}_{d}^{s}(M_{1}):=\Big{\{}f\in L^{1}\left(\mathbb{R}^{d}\right)\cap L^{2}\left(\mathbb{R}^{d}\right):\int_{\mathbb{R}^{d}}\|\omega\|_{2}^{2s}|\widehat{f}(\omega)|^{2}\mathrm{~{}d}\omega\leq(2\pi)^{d}M_{1}^{2}\Big{\}},

where $s>0$ is the smoothness parameter, and $\hat{f}(\omega)=\frac{1}{(2\pi)^{d/2}}\int_{\mathbb{R}^{2}}f(x)e^{-i\langle x,\omega\rangle}d\omega$ is the Fourier transform of $f$ . Furthermore, each of $L^{1}(\mathbb{R}^{d})$ and $L^{2}(\mathbb{R}^{d})$ denotes a set of functions that are integrable in absolute value and square integrable, respectively. Let $\mathcal{P}_{\mathrm{conti}}$ be the collection of distribution pairs on $\mathbb{R}^{d}\times\mathbb{R}^{d}$ where each pair of distributions $(P_{X},P_{Y})\in\mathcal{P}_{\mathrm{conti}}$ admits the probability density functions $(p_{X},p_{Y})$ with respect to the Lebesgue measure. Defining the class of distribution pairs with some constant $M_{2}>0$ given as

\displaystyle\widetilde{\mathcal{C}}_{L_{2}}:=\big{\{}(P_{X},P_{Y})\in\mathcal{P}_{\mathrm{conti}}\,\big{|}\,p_{X}-p_{Y}\in\mathcal{S}_{d}^{s}(M_{1}),~{}\max\{\|p_{X}\|_{\infty},\|p_{Y}\|_{\infty}\}\leq M_{2}\big{\}},

Schrab et al., (2023) demonstrated that the minimax rate in terms of the $L_{2}$ distance is $\rho^{\star}(\alpha,\beta,\widetilde{\mathcal{C}}_{L_{2}},\delta_{L_{2}})\asymp{n}^{-2s/(4s+d)}$ , where $a_{n}\asymp b_{n}$ indicates $c\leq|a_{n}/b_{n}|\leq C$ for some positive constants $c,C$ . They further showed that the MMD test using a translation-invariant kernel is minimax rate optimal in a non-asymptotic sense. A similar but asymptotic result was obtained by Li and Yuan, (2019), focusing specifically on the Gaussian kernel.¹¹1Both Schrab et al., (2023) and Li and Yuan, (2019) assume that $n_{1}\asymp n_{2}$ under which the minimax rate against the $L_{2}$ alternative is given as $(n_{1}+n_{2})^{-2s/(4s+d)}$ . Without this balanced sample size assumption, however, the minimax rate is dominated by the minimum sample size i.e., $n^{-2s/(4s+d)}$ . It is intuitively clear that when the number of random features $R$ is sufficiently large, the RFF-MMD test will also attain the same minimax optimality as the law of large numbers guarantees that the approximated kernel $\hat{k}$ converges to the underlying kernel $k$ almost surely. Our next question is then to ask how rapidly $R$ should be increased to ensure the same minimax guarantee and whether it is possible to attain the same optimality in sub-quadratic time. We answer these questions in the affirmative.

Similarly to Schrab et al., (2023), our analysis assumes that the kernel $k$ can be represented as a product of $d$ one-dimensional translation-invariant characteristic kernels with a given bandwidth $\lambda$ . More specifically, we assume that the kernel $k$ can be decomposed as

k(x,y)=k_{\lambda}(x,y):=\prod^{d}_{i=1}\frac{1}{\lambda_{i}}\kappa_{i}\left(\frac{x_{i}-y_{i}}{\lambda_{i}}\right)

for $\lambda=(\lambda_{1},\ldots,\lambda_{d})^{\top}\in(0,\infty)^{d},$ where $\kappa_{i}:\mathbb{R}\rightarrow\mathbb{R}$ are some non-negative functions in $L^{1}\left(\mathbb{R}\right)\cap L^{2}\left(\mathbb{R}\right)$ satisfying $\int_{\mathbb{R}}\kappa_{i}(x)dx=1$ for $i=1,\ldots,d$ . We note that $k_{\lambda}$ is indeed a characteristic kernel on $\mathbb{R}^{d}\times\mathbb{R}^{d}$ , and we treat the bandwidth $\lambda$ as a tuning parameter that varies with the sample size. In order to highlight the dependence on $\lambda$ , we let $\Delta_{{n_{1}},{n_{2}},R}^{\alpha,\lambda}$ (resp. $\Delta_{{n_{1}},{n_{2}},R}^{\alpha,u,\lambda}$ ) denote the test $\Delta_{{n_{1}},{n_{2}},R}^{\alpha}$ (resp. $\Delta_{{n_{1}},{n_{2}},R}^{\alpha,u}$ ) equipped with the kernel $k_{\lambda}$ . Given a constant $M_{3}>0$ , let us now consider a subset of $\widetilde{\mathcal{C}}_{L_{2}}$ where the support of individual distributions lies within the $d$ -dimensional hypercube $[-M_{3},M_{3}]^{d}$ . In other words, we define

\mathcal{C}_{L_{2}}:=\bigl{\{}(P_{X},P_{Y})\in\widetilde{\mathcal{C}}_{L_{2}}\,\big{|}\,\mathrm{support}(P_{X}),\,\mathrm{support}(P_{Y})\subset[-M_{3},M_{3}]^{d}\bigr{\}}.

Recalling $N=n_{1}+n_{2}$ and $n=\min\{n_{1},n_{2}\}$ , the following theorem discusses the choice of $R$ and $\lambda$ that allows $\Delta_{{n_{1}},{n_{2}},R}^{\alpha,\lambda}$ and $\Delta_{{n_{1}},{n_{2}},R}^{\alpha,u,\lambda}$ to achieve the minimax separation rate against the class of alternatives defined on $\mathcal{C}_{L_{2}}$ .

Theorem 6.

Consider the tests $\Delta_{{n_{1}},{n_{2}},R}^{\alpha,\lambda}$ and $\Delta_{{n_{1}},{n_{2}},R}^{\alpha,u,\lambda}$ with $\lambda_{i}={n}^{-2/(4s+d)}$ , $i=1,\ldots,d$ and $R\geq{n}^{4d/(4s+d)}$ for ${n}=\min\{{n_{1}},{n_{2}}\}$ . Then there exists some positive constant $C_{L_{2}}(M_{1},M_{2},M_{3},\alpha,\beta,d,s)$ such that the uniform separation of $\Delta_{{n_{1}},{n_{2}},R}^{\alpha,\lambda}$ satisfies

\rho\big{(}\Delta_{{n_{1}},{n_{2}},R}^{\alpha,\lambda},~{}\beta,~{}\mathcal{C}_{L_{2}},~{}\delta_{L_{2}}\big{)}\leq C_{L_{2}}(M_{1},M_{2},M_{3},\alpha,\beta,d,s){n}^{-2s/(4s+d)}.

The same guarantee also holds for $\Delta_{{n_{1}},{n_{2}},R}^{\alpha,u,\lambda}$ . Moreover, the computational cost of the corresponding test statistics $\text{r}\widehat{\mathrm{MMD}}_{b}^{2}$ and $\text{r}\widehat{\mathrm{MMD}}_{u}^{2}$ is $O(Nn^{\frac{4d}{4s+d}}d)$ .

Theorem 6, proven in Section B.5, has several interesting aspects worth highlighting. First of all, it indicates that the RFF-MMD tests can achieve the optimal separation rate ${n}^{-2s/(4s+d)}$ when $R$ is larger than ${n}^{4d/(4s+d)}$ . This in turn suggests that this optimality can be attained in sub-quadratic time when the underlying distributions are sufficiently smooth (i.e., $s\geq 3d/4$ ). Indeed, the computational time becomes linear in $N$ as $d/s\rightarrow 0$ . On the other hand, the computational complexity may need to exceed quadratic-time to achieve the minimax separation rate in non-smooth cases.

An astute reader may have realized that Theorem 6 is established for distributions on the bounded domain, which differs from the unbounded setting in the prior work (Schrab et al.,, 2023). We impose this additional constraint for analytical tractability, and in fact, the bounded domain is frequently assumed in minimax analysis (e.g., Ingster,, 1987, 1993; Arias-Castro et al.,, 2018). Nevertheless, it is important to point out that the worst-case instance used for deriving the minimax lower bound is defined on a bounded domain, say $[0,1]^{d}$ . Therefore the minimax rate remains unchanged for the bounded distributions that we consider.

4.2 Uniform consistency in MMD metric

In the previous subsection, we demonstrated that the RFF-MMD tests can achieve the minimax separation rate in sub-quadratic time complexity. It is worth pointing out that this result is presented against the Sobolev smooth $L_{2}$ alternatives, and optimal choices of the bandwidth $\lambda$ , which parameterizes the kernel $k_{\lambda}$ , and $R$ that balances between computational and statistical trade-offs may vary depending on classes of alternatives. To illustrate this point, we now turn to studying the uniform separation rate of the RFF-MMD test with respect to the MMD metric equipped with a generic kernel $k$ , and discuss the choice of $R$ that strikes the aforementioned trade-offs. Given a kernel $k$ , consider the alternative $\mathcal{P}_{1}(\mathcal{C},\delta,\epsilon)$ with a class of distribution pairs, $\mathcal{C}$ , and a MMD metric, $\delta_{\text{MMD}}(P_{X},P_{Y})=\mathrm{MMD}(P_{X},P_{Y};\mathcal{H}_{k}).$ As formally shown in Kim and Schrab, (2023), the minimax rate of testing against the MMD metric satisfies $\rho^{\star}\asymp n^{-1/2}$ where $n=\min\{{n_{1},n_{2}}\}$ . The next theorem demonstrates that the number of random features $R$ required to achieve the minimax separation rate in terms of the MMD metric is of order $N$ where recall $N=n_{1}+n_{2}$ ; thereby the overall runtime becomes $Nn$ in the sample size. The proof can be found in Section B.6.

Theorem 7.

Consider the tests $\Delta_{{n_{1}},{n_{2}},R}^{\alpha}$ and $\Delta_{{n_{1}},{n_{2}},R}^{\alpha,u}$ with kernel $k$ which is bounded as $0\leq k(x,y)\leq K$ for all $x,y\in\mathbb{R}^{d}$ . Then, the test $\Delta_{{n_{1}},{n_{2}},R}^{\alpha}$ with $R={n}=\min\{{n_{1},n_{2}}\}$ achieves the minimax separation rate, satisfying

\rho\big{(}\Delta_{{n_{1}},{n_{2}},R}^{\alpha},~{}\beta,~{}\mathcal{C},~{}\delta_{\mathrm{MMD}}\big{)}\leq C_{\mathrm{MMD}}(\alpha,\beta,K){n}^{-1/2}

for some positive constant $C_{\mathrm{MMD}}(\alpha,\beta,K),$ The same guarantee also holds for $\Delta_{{n_{1}},{n_{2}},R}^{\alpha,u}$ . Moreover, the computational cost of the corresponding test statistics $\text{r}\widehat{\mathrm{MMD}}_{b}^{2}$ and $\text{r}\widehat{\mathrm{MMD}}_{u}^{2}$ is $O(N{n}d).$

It has been commonly believed that the RFF-MMD test requires at least cubic-time complexity to match the power of a standard MMD test (e.g., Domingo-Enrich et al.,, 2023). However, Theorem 7 refutes this common belief, claiming that the RFF-MMD test can attain the same minimax separation rate with quadratic-time complexity. Indeed, we can further improve this point: when properly carving out the distributions of interest, it becomes possible to achieve the same separation rate of $n^{-1/2}$ in sub-quadratic or even linear-time complexity. To demonstrate this point, denote the U-statistic in Equation (5) with a single random Fourier feature (i.e., $R=1$ ) as $U_{1}$ . One of the crucial steps in the proof of Theorem 7 involves finding an upper bound for the expectation $\mathbb{E}_{\omega}[(\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega])^{2}]$ . Since the kernel is uniformly bounded and $\mathbb{E}[U_{1}]=\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k})$ , the previous expectation is bounded above by $\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k})$ , up to a constant. Our analysis utilizes this somewhat crude, but not universally improvable, upper bound, which is the place where the quadratic-time complexity arises.

Now let us consider a subclass of distribution pairs $\mathcal{C}^{\prime}\subseteq\mathcal{C}$ . Suppose that there exist some universal constants $c\in(1,2]$ and $C>0$ such that the following inequality

\displaystyle\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega]\big{)}^{2}\big{]}

\displaystyle\leq C\big{(}\mathbb{E}_{\omega}\big{[}\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega]\big{]}\big{)}^{c}=C\big{(}\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k})\big{)}^{c}

(10)

holds for all $(P_{X},P_{Y})\in\mathcal{C}^{\prime}$ (see Remark 17.1 of the appendix for a discussion on the range of $c$ ). Against this class of alternatives $\mathcal{C}^{\prime}$ , our proof shows that the RFF-MMD test achieves ${n}^{-1/2}$ -separation rate within sub-quadratic time. Specifically, the time complexity depends on the value of $c$ in Equation (10) with a precise computational cost of $O(N{n}^{2-c})$ for $c\in(1,2]$ . As a concrete example, consider the class of pairs of Gaussian distributions with a common fixed covariance $\Sigma\in\mathbb{R}^{d\times d}$ , denoted as

\displaystyle\mathcal{C}_{N,\Sigma}:=\big{\{}(P_{X},P_{Y})\in\mathcal{P}_{\mathrm{conti}}\,\big{|}\,P_{X}=N(\mu_{X},\Sigma),~{}P_{Y}=N(\mu_{Y},\Sigma)~{}\text{where $\mu_{X},\mu_{Y}\in\mathbb{R}^{d}$}\big{\}}.

(11)

and set $\mathcal{C}^{\prime}=\mathcal{C}_{N,\Sigma}$ . For this Gaussian subclass and a generic Gaussian kernel given as

k_{\lambda}(x,y)=\prod^{d}_{i=1}\frac{1}{\sqrt{\pi}\lambda_{i}}e^{-\frac{(x_{i}-y_{i})^{2}}{\lambda_{i}^{2}}}

with bandwidth $\lambda=(\lambda_{1},\ldots,\lambda_{d})^{\top}\in(0,\infty)^{d}$ , we prove that the inequality in Equation (10) holds with the constant $c=2$ . This main building block allows us to show the following proposition, indicating that the RFF-MMD test achieves the uniform separation rate of ${n}^{-1/2}$ in linear-time complexity.

Proposition 8.

For the class of distribution pairs $\mathcal{C}_{N,\Sigma}$ and the Gaussian kernel $k_{\lambda}(x,y)$ with any fixed bandwidth $\lambda=(\lambda_{1},\ldots,\lambda_{d})^{\top}\in(0,\infty)^{d}$ , there exist some positive constants $C_{1}(\beta,d,\lambda,\Sigma)$ and $C_{2}(\alpha,\beta,d,\lambda)$ such that $\Delta_{{n_{1},n_{2}},R}^{\alpha,\lambda}$ with the choice of $R\geq C_{1}(\beta,d,\lambda,\Sigma)$ satisfies

\rho\big{(}\Delta_{{n_{1},n_{2}},R}^{\alpha,\lambda},~{}\beta,~{}\mathcal{C}_{N,\Sigma},~{}\delta_{\mathrm{MMD}}\big{)}\leq C_{2}(\alpha,\beta,d,\lambda){n}^{-1/2},

and the computational cost of the corresponding estimator $\text{r}\widehat{\mathrm{MMD}}_{b}^{2}$ is $O(Nd).$ This result also holds for the test $\Delta_{{n_{1},n_{2}},R}^{\alpha,u,\lambda}$ with the same choice of $R$ and its corresponding estimator $\text{r}\widehat{\mathrm{MMD}}_{u}^{2}.$

Proposition 8, proven in Section B.7, states that the RFF-MMD test requires only a fixed number of random features to match the uniform separation rate of the original MMD test. At first glance, this appears to contradict Theorem 3, which demonstrates the pointwise inconsistency of the test when the number of random features $R$ is fixed. However, this is not a contradiction as Proposition 8 assumes a smaller, specific class of distributions, whereas Theorem 3 considers all possible distributions. Notably, the distributions that lead to the inconsistency demonstrated in Theorem 3 do not fall within the class $\mathcal{C}_{N,\Sigma}$ .

While we focus on the class of Gaussian distributions for technical tractability, we believe that Proposition 8 holds for a broader class of distributions as evidenced by our empirical studies. It would be of great interest to further explore classes of distributions for which the RFF-MMD test offers significant computational gains over the original MMD test, while maintaining nearly the same power. We leave this topic for future work.

5 Numerical studies

In this section, we compare the empirical power and computational time of RFF-MMD tests with other computationally efficient methods such as linear-time statistics (lMMD; Gretton et al., 2012a, ; Gretton et al., 2012b, ), block-based statistics (bMMD; Zaremba et al.,, 2013), incomplete U-statistics (incMMD; Yamada et al.,, 2019; Schrab et al.,, 2022) under several different scenarios. Within each scenario, we run RFF-MMD tests with varying numbers of random features $R\in\{10,200,1000\}$ , and also run the quadratic time MMD test (Gretton et al., 2012b, ) as a benchmark for comparison. In our simulations, all kernel tests employ a Gaussian kernel with the bandwidth selected using the median heuristic. The significance level is set at $\alpha=0.05$ and the critical value of each test is determined by using permutation or bootstrap methods with $B=199$ Monte Carlo iterations. The power of each test is approximated by averaging the results over $2000$ repetitions.

The specific scenarios that we consider in our simulation studies are described as follows.

•

Scenario 1: Univariate Gaussians. Our first experiment is concerned with comparing two Gaussian distributions on $\mathbb{R}$ with a mean difference or a variance difference. Specifically, we first evaluate the performance of the methods in distinguishing $P_{X}=N(0,1)$ from $P_{Y}=N(\mu,1)$ by (i) varying $\mu$ from $0$ to $0.3$ and (ii) varying the sample sizes $n_{1}=n_{2}$ with a fixed mean difference of $\mu=0.15$ . We conducted a similar experiment to evaluate the performance of the methods in distinguishing $P_{X}=N(0,1)$ from $P_{Y}=N(0,\sigma^{2})$ by (i) varying $\sigma$ from $0.5$ to $2$ and (ii) varying the sample sizes $n_{1}=n_{2}$ with fixed variance $\sigma=1.3$ .
•

Scenario 2: High-dimensional Gaussians. We also compared the power of the tests for distinguishing two Gaussian distributions with different mean vectors or variance matrices in high-dimensional settings. For location alternatives, we let $\boldsymbol{\mu}_{0.1,20}\in\mathbb{R}^{d}$ be a vector whose first $20$ coordinates are $0.1,$ and the others are 0. We set $P_{X}=N(\boldsymbol{0}_{d},I_{d\times d})$ and $P_{Y}=N(\boldsymbol{\mu}_{0.1,20},I_{d\times d}),$ and also report the test powers by varying $d$ from $20$ to $2000$ or the sample sizes $n_{1}=n_{2}$ with fixed $d=1000$ . For scale alternatives, we set $P_{X}=N(\boldsymbol{0}_{d},I_{d\times d})$ and $P_{Y}=N(\boldsymbol{0}_{d},\sigma^{2}I_{d\times d}),$ and vary $\sigma$ from 0.95 to 1.1 or the sample sizes $n_{1}=n_{2}$ while fixing $\sigma=1.03$ .

•

Scenario 3: Perturbed uniforms. Motivated by the experiments conducted in Schrab et al., (2022, 2023); Biggs et al., (2023), we investigate the test powers for capturing perturbations in uniform distributions on $\mathbb{R}$ or $\mathbb{R}^{2}$ . Specifically, for $t\in\mathbb{R}^{d},$ we set the density of the null distribution as $f_{X}(t)=\mathds{1}_{[0,1]}(t)$ and that of the alternative as $f_{Y}(t)=\mathds{1}_{[0,1]}(t)+\alpha E_{d,p}(t)$ where $\alpha\in[0,p]$ is perturbation amplitude and $E_{d,p}(t)$ is the $d$ -dimensional perturbation function of size $p$ , defined as $E_{d,p}(t):=p^{-1}e^{d}\sum_{u\in\{1,\ldots,p\}^{d}}\prod^{d}_{i=1}\theta_{u}G(pt_{i}-u_{i})$ with $\{\theta_{u}\}_{u\in\{1,\ldots,p\}^{d}}\in\{-1,1\}^{p^{d}}.$ The perturbation shape function $G(t)$ is given by:

G(t):=\exp\left(-\frac{1}{1-(4t+3)^{2}}\right)\mathds{1}_{\left(-1,-\frac{1}{2}\right)}(t)-\exp\left(-\frac{1}{1-(4t+1)^{2}}\right)\mathds{1}_{\left(-\frac{1}{2},0\right)}(t),\quad t\in\mathbb{R}.

We set $E_{1,2}(t)$ as an one-dimensional alternative and $E_{2,1}(t)$ as a two-dimensional alternative. In this case, the perturbation amplitude $\alpha=0$ implies the null hypothesis and we consider different scenarios by varying $\alpha$ from $0$ to $0.9$ . Additionally, we fix the perturbation amplitude at $\alpha=0.6$ for $E_{1,2}(t)$ and $\alpha=0.45$ for $E_{2,1}(t),$ and vary the sample sizes $n_{1}=n_{2}$ .

•

Scenario 4: MNIST. To evaluate the performance of the methods in real-world settings, we consider a task of distinguishing between the distribution of even-number images and the distribution of odd-number images in the MNIST dataset. Each data point $z$ is an image with dimension $d=28\times 28=784$ (without downsampling) or $d=7\times 7=49$ (with downsampling), with labels $L_{z}\in\{0,1,\ldots,9\}$ . We collect the images of even numbers to define a distribution $P_{\text{even}}:=\{z:L_{z}\in\{0,2,4,6,8\}\}$ and collect the images of odd numbers to define another distribution $P_{\text{odd}}:=\{z:L_{z}\in\{1,3,5,7,9\}\}$ . Given a mixing rate $\gamma\in[0,1]$ , we set $P_{X}=P_{\text{even}}$ and $P_{Y}=(1-\gamma)P_{\text{even}}+\gamma P_{\text{odd}}$ . Accordingly, we regard the case $\gamma=0$ as the null hypothesis and vary $\gamma$ from $0$ to $0.3$ to evaluate the power performance. When we vary the sample sizes $n_{1}=n_{2}$ , we fix the mixing rate at $\gamma=0.1$ .

Refer to caption — Figure 1: Power experiments with two different settings: (i) univariate Gaussian distribution, (ii) high-dimensional Gaussian distribution. The sample sizes are set to ${n_{1}}={n_{2}}=1000$ for the first row of graphs. For the second row of graphs, parameters are set to $\mu=0.15$ in the first column, $\sigma=1.3$ in the second column, $d=1000$ in the third column, and $\sigma=1.03$ in the fourth column.

The simulation results for the first two scenarios are displayed in Figure 1, whereas the simulation results for the last two scenarios can be found in Figure 2. We first note that the test power of RFF-MMD test increases monotonically with $R$ , converging to the power of the quadratic-time MMD test. This empirically illustrates that the RFF-MMD test approximates the quadratic time MMD test as $R$ increases. Also, the empirical results demonstrate that different values of $R$ are required depending on the underlying distribution to match the power of the RFF-MMD test with that of the quadratic-time MMD test. Specifically, the RFF-MMD test matched the power of the original MMD test in all cases when $R=200$ , except in Scenario 2, where it matched when $R=1000$ . It is also worth noting that the RFF-MMD test outperforms other efficient methods in Scenarios 1 and 3, even with $R$ as small as 10.

In Scenario 2, which involves a high-dimensional Gaussian setting, we observed that the power of the RFF-MMD test drops more sharply than that of the incMMD test when the sample size is fixed and the dimension increases. Conversely, when the dimension is fixed and the sample size increases, the power of the RFF-MMD test converges to that of the quadratic time MMD test more quickly than the incMMD test. A similar phenomenon was observed in Scenario 4: as the dimension increases from downsampled MNIST to MNIST data, the power curve of the RFF-MMD test shifts downward, while the power curves of other methods show little variation for the same mixing rate. However, when fixing the mixing rate and varying the sample size, the power of the RFF-MMD test increases faster than that of the incMMD test. This can be explained by the fact that the RFF-MMD test involves kernel approximation. As the dimension increases while the number of random features remains fixed, the accuracy of the kernel approximation decreases, leading to a relatively faster decline in power compared to the incMMD test. Conversely, when the dimension is fixed and the sample size varies, the incMMD test considers only a subset of the samples for computing the test statistic, resulting in a relatively slower increase in power compared to the RFF-MMD test.

We empirically measured the computational time of the considered methods under Scenario 1, as recorded in Table 1. In the experiments, we varied the sample size from $250$ to $8000$ , with a mean difference of $\mu=0.15$ . To ensure the efficiency of the experiments, we measured the time taken to compute the test statistic once, rather than the time taken to perform the permutation test. The results were approximated by averaging over $1000$ repetitions. From Table 1, we experimentally confirmed that while the computational time of the conventional MMD increases quadratically with the sample size, the computational times of RFF-MMD and incMMD increase linearly. Additionally, the last row of Table 1 demonstrates that the time increases linearly with the number of features, which aligns with the theoretical computational time of $O(NRd)$ for RFF-MMD. We also note that similar patterns were observed in other simulation scenarios.

Table 1: Computational time (in seconds) comparisons of the considered methods under Scenario 1.

Sample

size

MMD

RFF-MMD

R=10

RFF-MMD

R=200

RFF-MMD

R=1000

incMMD

R^{\prime}=100

incMMD

R^{\prime}=200

lMMD

bMMD

b=n^{1/2}

250

0.0088

0.0002

0.0009

0.0070

0.0057

0.0084

0.0001

0.0006

500

0.0411

0.0003

0.0017

0.0130

0.0140

0.0251

0.0001

0.0019

1000

0.1946

0.0004

0.0051

0.0254

0.0325

0.0681

0.0002

0.0053

2000

0.7983

0.0006

0.0097

0.0485

0.0744

0.1497

0.0004

0.0155

4000

3.2662

0.0010

0.0192

0.0966

0.1567

0.3128

0.0007

0.0439

8000

13.247

0.0020

0.0371

0.1933

0.3189

0.6391

0.0015

0.1426

6 Discussion

In this work, we laid the theoretical foundations for kernel MMD tests using random Fourier features. Firstly, we proved that pointwise consistency is attainable if and only if the number of random Fourier features tends to infinity with the sample size. This observation naturally motivates an investigation into the optimal choice of the number of random Fourier features that strikes a balance between computational efficiency and statistical power. We explored this time-power trade-off under the minimax testing framework, and showed that it is possible to attain minimax separation rates within sub-quadratic time under certain distributional assumptions. We also validated these theoretical findings through numerical studies.

Our work opens up several promising avenues for future work. A natural extension of our work is to adapt our techniques to other kernel-based inference methods, such as the Hilbert–Schmidt independence criterion, and investigate fundamental time-power trade-offs in different applications. From a technical standpoint, it remains open whether a similar result to Theorem 6 can be obtained for distributions with unbounded supports. Future work can also attempt to extend our results in Section 4 to other metrics such as the Hellinger distance (e.g., Hagrass et al.,, 2022) and explore further improvements under other smoothness conditions. Finally, it would be of interest to consider deterministic Fourier features, which are shown to better approximate a kernel than random Fourier features (e.g., Wesel and Batselier,, 2021), and apply those in our application. We leave all these intriguing yet challenging problems to future work.

Acknowledgments

We acknowledge support from the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2022R1A4A1033384), and the Korea government (MSIT) RS-2023-00211073. We are grateful to Yongho Jeon and Gyumin Lee for their careful proofreading and helpful discussion.

References

Arias-Castro et al., (2018) Arias-Castro, E., Pelletier, B., and Saligrama, V. (2018). Remember the curse of dimensionality: The case of goodness-of-fit testing in arbitrary dimension. Journal of Nonparametric Statistics, 30(2):448–471.
Balasubramanian et al., (2021) Balasubramanian, K., Li, T., and Yuan, M. (2021). On the optimality of kernel-embedding based goodness-of-fit tests. Journal of Machine Learning Research, 22(1):1–45.
Baraud, (2002) Baraud, Y. (2002). Non-asymptotic minimax rates of testing in signal detection. Bernoulli, 8(5):577–606.
Biggs et al., (2023) Biggs, F., Schrab, A., and Gretton, A. (2023). MMD-FUSE: Learning and combining kernels for two-sample testing without data splitting. In Advances in Neural Information Processing Systems, volume 36, pages 75151–75188.
Bochner, (1933) Bochner, S. (1933). Monotone Funktionen, Stieltjessche Integrale und harmonische Analyse. Mathematische Annalen, 108(1):378–410.
Bogachev, (2007) Bogachev, V. I. (2007). Measure Theory. Springer Berlin Heidelberg.
Cevid et al., (2022) Cevid, D., Michel, L., Näf, J., Bühlmann, P., and Meinshausen, N. (2022). Distributional random forests: Heterogeneity adjustment and multivariate distributional regression. Journal of Machine Learning Research, 23(333):1–79.
Chatalic et al., (2022) Chatalic, A., Schreuder, N., Rosasco, L., and Rudi, A. (2022). Nyström kernel mean embeddings. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 3006–3024.
Chatterjee and Bhattacharya, (2023) Chatterjee, A. and Bhattacharya, B. B. (2023). Boosting the power of kernel two-sample tests. arXiv preprint arXiv:2302.10687.
Chung and Romano, (2016) Chung, E. and Romano, J. P. (2016). Multivariate and multiple permutation tests. Journal of Econometrics, 193(1):76–91.
Chwialkowski et al., (2015) Chwialkowski, K. P., Ramdas, A., Sejdinovic, D., and Gretton, A. (2015). Fast two-sample testing with analytic representations of probability measures. In Advances in Neural Information Processing Systems, volume 28, pages 1981–1989.
Domingo-Enrich et al., (2023) Domingo-Enrich, C., Dwivedi, R., and Mackey, L. (2023). Compress then test: Powerful kernel testing in near-linear time. In International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 1174–1218.
Dwivedi and Mackey, (2021) Dwivedi, R. and Mackey, L. (2021). Kernel thinning. In Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 1753–1753.
Fromont et al., (2013) Fromont, M., Laurent, B., and Reynaud-Bouret, P. (2013). The two-sample problem for Poisson processes: Adaptive tests with a nonasymptotic wild bootstrap approach. The Annals of Statistics, 41(3):1431–1461.
(15) Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012a). A kernel two-sample test. Journal of Machine Learning Research, 13(25):723–773.
(16) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., and Sriperumbudur, B. K. (2012b). Optimal kernel choice for large-scale two-sample tests. In Advances in Neural Information Processing Systems, volume 25, pages 1214–1222.
Guo and Shah, (2024) Guo, F. R. and Shah, R. D. (2024). Rank-transformed subsampling: inference for multiple data splitting and exchangeable p-values. arXiv preprint arXiv:2301.02739.
Hagrass et al., (2022) Hagrass, O., Sriperumbudur, B. K., and Li, B. (2022). Spectral regularized kernel two-sample tests. arXiv preprint arXiv:2212.09201.
Hemerik and Goeman, (2018) Hemerik, J. and Goeman, J. (2018). Exact testing with random permutations. Test, 27(4):811–825.
Ingster, (1987) Ingster, Y. I. (1987). Minimax testing of nonparametric hypotheses on a distribution density in the $L_{p}$ metrics. Theory of Probability & Its Applications, 31(2):333–337.
Ingster, (1993) Ingster, Y. I. (1993). Asymptotically minimax hypothesis testing for nonparametric alternatives. i, ii, iii. Mathematical Methods of Statistics, 2(2):85–114.
Jitkrittum et al., (2016) Jitkrittum, W., Szabó, Z., Chwialkowski, K. P., and Gretton, A. (2016). Interpretable distribution features with maximum testing power. In Advances in Neural Information Processing Systems, volume 29, pages 181–189.
Kim, (2021) Kim, I. (2021). Comparing a large number of multivariate distributions. Bernoulli, 27(1):419–441.
Kim et al., (2022) Kim, I., Balakrishnan, S., and Wasserman, L. (2022). Minimax optimality of permutation tests. The Annals of Statistics, 50(1):225–251.
Kim and Ramdas, (2024) Kim, I. and Ramdas, A. (2024). Dimension-agnostic inference using cross U-statistics. Bernoulli, 30(1):683–711.
Kim and Schrab, (2023) Kim, I. and Schrab, A. (2023). Differentially private permutation tests: Applications to kernel methods. arXiv preprint arXiv:2310.19043.
Kirchler et al., (2020) Kirchler, M., Khorasani, S., Kloft, M., and Lippert, C. (2020). Two-sample testing using deep learning. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 1387–1398.
Lee, (1990) Lee, J. (1990). U-statistics: Theory and Practice. CRC Press.
Lehmann and Romano, (2006) Lehmann, E. and Romano, J. (2006). Testing Statistical Hypotheses. Springer New York.
Li and Yuan, (2019) Li, T. and Yuan, M. (2019). On the optimality of gaussian kernel based nonparametric tests against smooth alternatives. arXiv preprint arXiv:1909.03302.
Liu et al., (2022) Liu, F., Huang, X., Chen, Y., and Suykens, J. A. K. (2022). Random Features for Kernel Approximation: A Survey on Algorithms, Theory, and Beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7128–7148.
Liu et al., (2020) Liu, F., Xu, W., Lu, J., Zhang, G., Gretton, A., and Sutherland, D. J. (2020). Learning deep kernels for non-parametric two-sample tests. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 6316–6326.
Pólya, (1949) Pólya, G. (1949). Remarks on characteristic functions. In Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability, pages 115–123.
Rahimi and Recht, (2007) Rahimi, A. and Recht, B. (2007). Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, volume 20, pages 1177–1184.
Ramdas et al., (2015) Ramdas, A., Jakkam Reddi, S., Poczos, B., Singh, A., and Wasserman, L. (2015). On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. Proceedings of the AAAI Conference on Artificial Intelligence, 29(1):3571–3577.
Romano and Siegel, (1986) Romano, J. P. and Siegel, A. F. (1986). Counterexamples in Probability and Statistics. CRC Press.
Schrab et al., (2023) Schrab, A., Kim, I., Albert, M., Laurent, B., Guedj, B., and Gretton, A. (2023). MMD aggregated two-sample test. Journal of Machine Learning Research, 24(194):1–81.
Schrab et al., (2022) Schrab, A., Kim, I., Guedj, B., and Gretton, A. (2022). Efficient Aggregated Kernel Tests using Incomplete U-statistics. In Advances in Neural Information Processing Systems, volume 35, pages 18793–18807.
Shekhar et al., (2023) Shekhar, S., Kim, I., and Ramdas, A. (2023). A permutation-free kernel independence test. Journal of Machine Learning Research, 24(369):1–68.
Sriperumbudur and Szabo, (2015) Sriperumbudur, B. and Szabo, Z. (2015). Optimal rates for random fourier features. In Advances in Neural Information Processing Systems, volume 28, page 1144–1152.
Sriperumbudur et al., (2010) Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Schölkopf, B., and Lanckriet, G. R. (2010). Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11(50):1517–1561.
Stolte et al., (2023) Stolte, M., Bommert, A., and Rahnenführer, J. (2023). A Review and Taxonomy of Methods for Quantifying Dataset Similarity. arXiv preprint arXiv:2312.04078.
Sutherland and Schneider, (2015) Sutherland, D. J. and Schneider, J. (2015). On the error of random fourier features. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, page 862–871.
Sutherland et al., (2017) Sutherland, D. J., Tung, H.-Y., Strathmann, H., De, S., Ramdas, A., Smola, A., and Gretton, A. (2017). Generative models and model criticism via optimized maximum mean discrepancy. In International Conference on Learning Representations.
Wesel and Batselier, (2021) Wesel, F. and Batselier, K. (2021). Large-scale learning with fourier features and tensor decompositions. In Advances in Neural Information Processing Systems, volume 34, pages 17543–17554.
Yamada et al., (2019) Yamada, M., Wu, D., Tsai, Y. H., Ohta, H., Salakhutdinov, R., Takeuchi, I., and Fukumizu, K. (2019). Post selection inference with incomplete maximum mean discrepancy estimator. In International Conference on Learning Representations.
Yao et al., (2023) Yao, J., Erichson, N. B., and Lopes, M. E. (2023). Error estimation for random fourier features. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 2348–2364.
Zaremba et al., (2013) Zaremba, W., Gretton, A., and Blaschko, M. (2013). B-test: A non-parametric, low variance kernel two-sample test. In Advances in Neural Information Processing Systems, volume 26, page 755–763.
Zhao and Meng, (2015) Zhao, J. and Meng, D. (2015). FastMMD: Ensemble of circular discrepancy for efficient two-sample test. Neural Computation, 27(6):1345–1372.

Appendix A Technical lemmas

In this section, we collect technical lemmas used in the main proofs of our results.

Lemma 9 (Bochner,, 1933, Bochner’s theorem,).

A translation-invariant bounded continuous kernel $k(x,y)=\kappa(x-y)$ on $\mathbb{R}^{d}$ is positive definite if and only if there exists a finite non-negative Borel measure $\Lambda$ on $\mathbb{R}^{d}$ such that

k(x,y)=\int_{\mathbb{R}^{d}}e^{\sqrt{-1}\omega^{\top}(x-y)}d\Lambda(\omega).

The following result is commonly known as Young’s convolution inequality.

Lemma 10 (Bogachev,, 2007, Theorem 3.9.4).

Let $p$ and $q$ be real numbers such that $1\leq p,q\leq\infty,$ $1/p+1/q=1+1/r.$ Then, for any functions $f\in L_{p}(\mathbb{R}^{d})$ and $g\in L_{q}(\mathbb{R}^{d})$ ,

\|f\ast g\|_{r}\leq\|f\|_{p}\|g\|_{q}.

We next collect useful asymptotic tools from Chung and Romano, (2016) to analyze the limiting behavior of permutation distributions.

Lemma 11 (Chung and Romano, 2016, Lemma A.2).

Suppose $X^{n}=(X_{1},\ldots,X_{n})$ has distribution $P_{n}$ in $\mathcal{X}_{n},$ and $\boldsymbol{G}_{n}$ is a finite group of transformations $g$ of $\mathcal{X}_{n}$ onto itself. Also, let $G_{n}$ be a random variable that is uniform on $\boldsymbol{G}_{n}$ . Assume $X^{n}$ and $G_{n}$ are mutually independent. For a $d$ -dimensional test statistic $B_{n}=B_{n}(X^{n}),$ let $\hat{R}_{n}^{B}$ denote the randomization distributions of a $d$ -dimensional random vector $B_{n},$ defined by

\displaystyle\hat{R}^{B}_{n}(t)=\frac{1}{|\boldsymbol{G}_{n}|}\sum_{g\in\boldsymbol{G}_{n}}\mathds{1}\{B_{n}(gX^{n})\leq t\}.

(12)

Suppose, under $P_{n},$

\displaystyle B_{n}(G_{n}X^{n})\xrightarrow[]{p}b

(13)

for a constant $b\in\mathbb{R}^{d}.$ Then under $P_{n},$

\hat{R}^{B}_{n}(t)=\frac{1}{|\boldsymbol{G}_{n}|}\sum_{g\in\boldsymbol{G}_{n}}\mathds{1}\{B_{n}(gX^{n})\leq t\}\xrightarrow[]{p}\delta_{b}(t)\quad\text{if}~{}t\neq b,

where $\delta_{c}$ denotes the distribution function corresponding to the point mass function at $c\in\mathbb{R}^{d}.$

Lemma 12 (Chung and Romano, 2016, Lemma A.3).

Let $B_{n}$ and $T_{n}$ be sequences of $d$ -dimensional random variables satisfying Equation (13) and

(T_{n}(G_{n}X^{n}),T_{n}(G^{\prime}_{n}X^{n}))\xrightarrow{d}(T,T^{\prime}),

where $T$ and $T^{\prime}$ are independent, each with common $d$ -variate cumulative distribution function $R^{T}(\cdot).$ Let $\hat{R}^{T+B}_{n}$ (t) denote the randomization distribution of $T_{n}+B_{n},$ defined in Equation (12) with $B$ replaced by $T+B.$ Then, $\hat{R}^{T+B}_{n}(t)$ converges to the cumulative distribution function of $T+b$ in probability. In other words,

\hat{R}^{T+B}_{n}(t)=\frac{1}{|\boldsymbol{G}_{n}|}\sum_{g\in\boldsymbol{G}_{n}}\mathds{1}\{T_{n}(gX^{n})+B_{n}(gX^{n})\leq t\}\xrightarrow{p}R^{T+b}(t),

if $R^{T+b}$ is continuous at $t\in\mathbb{R}^{d},$ where $R^{T+b}(\cdot)$ denotes the corresponding $d$ -variate cumulative distribution function of $T+b.$

Lemma 13 (Chung and Romano, 2016, Lemma A.6).

Suppose the randomization distribution of a test statistic $T_{n}$ converges to $T$ in probability. In other words,

\hat{R}^{T}_{n}(t)=\frac{1}{|\boldsymbol{G}_{n}|}\sum_{g\in\boldsymbol{G}_{n}}\mathds{1}\{T_{n}(gX^{n})\leq t\}\xrightarrow{p}R^{T}(t),

if $R^{T}$ is continuous at $t\in\mathbb{R}^{d},$ where $R^{T}(\cdot)$ denotes the corresponding cumulative distribution function of $t$ . Let $h$ be a measurable map from $\mathbb{R}^{d}$ to $\mathbb{R}^{s}.$ Let $C$ be the set of points in $\mathbb{R}^{d}$ for which $h$ is continuous. If $\mathbb{P}(T\in C)=1,$ then the randomization distribution of $h(T_{n})$ converges $h(T)$ in probability.

Lemma 14 (Chung and Romano, 2016, Theorem 2.1).

Suppose that $X_{1},\ldots,X_{n_{1}}$ are ${n_{1}}$ i.i.d. random samples from the $d$ -dimensional distribution $P_{X}$ , where $X_{i}=(X_{i,1},\ldots,X_{i,d})^{\top}$ for $i=1,\ldots,n_{1}$ with mean vector $\mu$ and covariance matrix $\Sigma_{X}$ , and inedpendently, $Y_{1},\ldots,Y_{n_{2}}$ be ${n_{2}}$ i.i.d. random samples from the $d$ -dimensional distribution $P_{Y}$ , where $Y_{i}=(Y_{i,1},\ldots,Y_{i,d})^{\top}$ for $i=1,\ldots,n_{2}$ with the common mean vector $\mu$ and covariance matrix $\Sigma_{Y}.$ Let $N={n_{1}}+{n_{2}}$ and write $Z=(Z_{1},\ldots,Z_{N})=(X_{1},\ldots,X_{n_{1}},Y_{1},\ldots,Y_{n_{2}}).$ Consider a test statistic $T_{{n_{1}},{n_{2}}}(Z_{1},\ldots,Z_{N})={n_{1}}^{-1/2}\big{[}\sum_{i=1}^{n_{1}}X_{i}-\frac{n_{1}}{n_{2}}\sum_{j=1}^{n_{2}}Y_{i}\big{]}$ and its permutation distribution

\hat{R}^{T}_{{n_{1}},{n_{2}}}(t)=\frac{1}{N!}\sum_{\pi\in\boldsymbol{G}_{N}}\mathds{1}\{T_{{n_{1}},{n_{2}}}(Z_{\pi(1)},\ldots,Z_{\pi(N)})\leq t\}

where $\boldsymbol{G}_{N}$ denotes the $N!$ permutations of $\{1,2,\ldots,N\}$ and $t\in\mathbb{R}^{d^{\prime}}$ . Assume $0<\operatorname{Var}(X_{i,k})<\infty$ and $0<\operatorname{Var}(Y_{j,k})<\infty$ for all $i=1,\ldots,{n_{1}},$ $j=1,\ldots,{n_{2}},$ and $k=1,\ldots,d.$ Let ${n_{1}},{n_{2}}\rightarrow\infty,$ $p=\lim_{{n_{1}},{n_{2}}\rightarrow\infty}\frac{{n_{1}}}{{n_{1}}+{n_{2}}}$ and assume that $\bar{\Sigma}=\frac{p}{1-p}\Sigma_{X}+\Sigma_{Y}$ is positive definite. Then,

\sup_{t\in R^{d^{\prime}}}\big{|}\hat{R}^{T}_{{n_{1}},{n_{2}}}(t)-G(t)\big{|}\xrightarrow{p}0,

where $G$ denotes the $d$ -variate normal distribution with mean $\boldsymbol{0}$ and variance $\bar{\Sigma}.$

The following result is a slight modification of Ramdas et al., (2015, Proposition 1) tailored to our kernel setting.

Lemma 15 (Ramdas et al., 2015, Proposition 1).

Suppose $P_{X}=N(\mu_{1},\Sigma)$ and $P_{Y}=N(\mu_{2},\Sigma)$ . The squared MMD between $P_{X}$ and $P_{Y}$ using a Gaussian kernel $\kappa_{\lambda}(x-y)=\prod^{d}_{i=1}\frac{1}{\sqrt{\pi}\lambda_{i}}\exp{\bigl{\{}-\frac{(x_{i}-y_{i})^{2}}{\lambda_{i}^{2}}\bigr{\}}}$ has the following explicit form:

\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k})=2\left(\frac{1}{4\pi}\right)^{d/2}\frac{1-\exp\big{\{}-(\mu_{X}-\mu_{Y})^{\top}\left(\Sigma+D(\lambda^{2}/4)\right)^{-1}(\mu_{X}-\mu_{Y})/4\big{\}}}{\left|\Sigma+D(\lambda^{2}/4)\right|^{1/2}},

where $D(\lambda^{2}/4)=\mathrm{diag}(\lambda_{1}^{2}/4,\ldots,\lambda_{d}^{2}/4).$

The next lemma facilitates the calculation of $\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega]\big{)}^{2}\big{]}$ . The proof can be found in Appendix B.8.

Lemma 16.

Let $X^{\prime},X^{\prime\prime}$ and $Y^{\prime},Y^{\prime\prime}$ be the independent copies of $X$ and $Y$ , respectively. Then, the following two equations hold:

	$\displaystyle\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,\|\,\omega]\big{)}^{2}\big{]}$	$\displaystyle=2\kappa(0)\mathrm{MMD}^{2}(P_{X-X^{\prime}},P_{X^{\prime\prime}-Y};\mathcal{H}_{k})+2\kappa(0)\mathrm{MMD}^{2}(P_{Y-Y^{\prime}},P_{X-Y^{\prime\prime}};\mathcal{H}_{k})$
		$\displaystyle\quad-\kappa(0)\mathrm{MMD}^{2}(P_{X+X^{\prime}},P_{Y+Y^{\prime}};\mathcal{H}_{k})\quad\text{and}$
	$\displaystyle\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,\|\,\omega]\big{)}^{2}\big{]}$	$\displaystyle=2\kappa(0)\mathrm{MMD}^{2}(P_{X+X^{\prime}},P_{X^{\prime\prime}+Y};\mathcal{H}_{k})+2\kappa(0)\mathrm{MMD}^{2}(P_{Y+Y^{\prime}},P_{X+Y^{\prime\prime}};\mathcal{H}_{k})$
		$\displaystyle\quad-\kappa(0)\mathrm{MMD}^{2}(P_{X+X^{\prime}},P_{Y+Y^{\prime}};\mathcal{H}_{k}).$

The next lemma serve as a main block in the proof of Propositions 8. The proof can be found in Appendix B.9.

Lemma 17.

For the class of distribution pairs, $\mathcal{C}_{N,\Sigma}$ , in Equation (11) and the Gaussian kernel $k_{\lambda}(x,y)$ with any fixed bandwidth $\lambda=(\lambda_{1},\ldots,\lambda_{d})^{\top}\in(0,\infty)^{d}$ , the inequality in Equation (10) holds with $c=2$ . Specifically, there exists a constant $C=C(d,\lambda,\Sigma)>0$ such that

\displaystyle\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega]\big{)}^{2}\big{]}

\displaystyle\leq C\big{(}\mathbb{E}_{\omega}\big{[}\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega]\big{]}\big{)}^{c}=C(d,\lambda,\Sigma)\big{(}\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k})\big{)}^{2}.

Remark 17.1.

Regarding the range of $c\in(1,2]$ in Equation (10), note that when $c>2$ , the inequality in Equation (10) yields $\mathrm{MMD}\gtrsim 1$ and this is not of our interest (in fact, this condition becomes vacuous since $\mathrm{MMD}$ using a bounded kernel is bounded above by a constant). More specifically, by Jensen’s inequality, we have $\mathrm{MMD}^{4}\leq\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega]\big{)}^{2}\big{]}$ and then the inequality in Equation (10) implies

		$\displaystyle\mathrm{MMD}^{4}(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}})\leq C\mathrm{MMD}^{2c}(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}})$
	$\displaystyle\equiv~{}$	$\displaystyle\mathrm{MMD}^{2(2-c)}(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}})\leq C$
	$\displaystyle\equiv~{}$	$\displaystyle\frac{1}{C^{\frac{1}{2(c-2)}}}=C^{\prime}\leq\mathrm{MMD}(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}}).$

Therefore we may see some computational gain when $c>2$ but it is only possible when the minimum separation is of constant order as $\mathrm{MMD}\gtrsim 1$ .

Appendix B Proofs

Notation and terminology.

We start by organizing the notation and the terminology we use throughout this appendix. Unless explicitly stated otherwise, the symbol $\mathbb{P}(\cdot)$ denotes the probability measure that takes into account all inherent uncertainties. In addition, we represent constants as $C_{1},C_{2},\ldots$ , which may depend on “fixed” parameters such as $M_{1},M_{2},M_{3},\alpha,\beta,d,s$ that do not vary with the sample sizes ${n_{1}}$ and ${n_{2}}$ . The specific values of these constants may vary in different places. We use the notation $A_{n}\xrightarrow{p}A$ to denote that the sequence of the random variables $A_{n}$ converges in probability to a random variable $A.$ We also introduce a terminology for the convergence of permutation distributions. For a given generic test statistic $T_{{n_{1}},{n_{2}}}$ and a continuous random variable $G$ , denote the permutation distribution of $T_{{n_{1}},{n_{2}}}$ as $F^{\pi}_{T_{{n_{1}},{n_{2}}}}(\cdot):=\frac{1}{N!}\sum_{\pi\in\Pi_{N}}\mathds{1}\{T_{{n_{1}},{n_{2}}}(Z_{\pi(1)},\dots,Z_{\pi(N)})\leq\cdot\}$ and the cumulative distribution function (CDF) of $G$ as $F_{G}(\cdot):=\mathbb{P}(G\leq\cdot)$ . Suppose that

\sup_{t\in\mathbb{R}^{d}}\big{|}F^{\pi}_{T_{{n_{1}},{n_{2}}}}(t)-F_{G}(t)\big{|}\xrightarrow{p}0,

or equivalently, for any given $\epsilon>0,$

\lim_{{n_{1}},{n_{2}}\rightarrow\infty}\mathbb{P}\bigg{(}\sup_{t\in\mathbb{R}^{d}}\big{|}F^{\pi}_{T_{{n_{1}},{n_{2}}}}(t)-F_{G}(t)\big{|}>\epsilon\bigg{)}=0.

In this case, we say that $F^{\pi}_{T_{{n_{1}},{n_{2}}}}$ converges weakly in probability to $G$ , as in Chung and Romano, (2016). Also, if a sequence of random variables $\{H_{n}\}_{n=1}^{\infty}$ converges in distribution to a continuous random variable $H$ , we use the expression that aligns with the above:

\sup_{t\in\mathbb{R}^{d}}\big{|}F_{H_{n}}(t)-F_{H}(t)\big{|}\rightarrow 0,

instead of $H_{n}\xrightarrow{d}H.$ Note that Pólya’s theorem can be generalized into the multivariate case (Guo and Shah,, 2024, Lemma C.7) and thus guarantees the equivalence between those two expressions under the assumption that $H$ is continuous.

B.1 Proof of Proposition 2

For simplicity, we consider the case where $d=1$ , as the scenario with $d\geq 2$ can be extended naturally by taking the Cartesian product of the one-dimensional cases. Also, we consider the case $k=2$ , since the case $k=1$ is identical to Lemma 1, and the logic used for $k=2$ can be extended to prove the cases for $k\geq 3$ .

Now, suppose that $k=2$ . Given frequencies $\boldsymbol{\omega}_{R}=\{\omega_{1},\ldots,\omega_{R}\},$ recall that the feature mapping is defined as

\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(x)=[\cos(\omega_{1}^{\top}x),\sin(\omega_{1}^{\top}x),~{}\ldots~{},\cos(\omega_{R}^{\top}x),\sin(\omega_{R}^{\top}x)]^{\top}\in\mathbb{R}^{2R},

and $\mathcal{E}_{2}$ can be written as

\mathcal{E}_{2}:=\bigl{\{}\boldsymbol{x}\in\mathbb{R}^{d\times R}:\mathbb{E}_{X}[\boldsymbol{\psi}_{\boldsymbol{x}}(X)]=\mathbb{E}_{Y}[\boldsymbol{\psi}_{\boldsymbol{x}}(Y)],~{}\mathbb{E}_{X}[\boldsymbol{\psi}_{\boldsymbol{x}}(X)\boldsymbol{\psi}_{\boldsymbol{x}}(X)^{\top}]=\mathbb{E}_{Y}[\boldsymbol{\psi}_{\boldsymbol{x}}(Y)\boldsymbol{\psi}_{\boldsymbol{x}}(Y)^{\top}]\bigr{\}}.

Then, the components of the second moment matrix of $\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X)$ should be one of the following terms:

	$\displaystyle\mathbb{E}_{X}[\cos(\omega_{i}^{\top}X)\cos(\omega_{j}^{\top}X)],$	$\displaystyle\quad i,j=1,\ldots,R,\quad\text{or}$
	$\displaystyle\mathbb{E}_{X}[\cos(\omega_{i}^{\top}X)\sin(\omega_{j}^{\top}X)],$	$\displaystyle\quad i,j=1,\ldots,R,\quad\text{or}$
	$\displaystyle\mathbb{E}_{X}[\sin(\omega_{i}^{\top}X)\sin(\omega_{j}^{\top}X)],$	$\displaystyle\quad i,j=1,\ldots,R.$

Similarly, the same argument holds true for $\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y)$ . Hence, to show that $\mathbb{E}_{X}[\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X)\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X)^{\top}]=\mathbb{E}_{Y}[\boldsymbol{\psi}_{\boldsymbol{x}}(Y)\boldsymbol{\psi}_{\boldsymbol{x}}(Y)^{\top}]$ holds, it is enough to show that the following identities are satisfied simultaneously:

	$\displaystyle\mathbb{E}_{X}[\cos(\omega_{i}^{\top}X)\cos(\omega_{j}^{\top}X)]$	$\displaystyle=\mathbb{E}_{Y}[\cos(\omega_{i}^{\top}Y)\cos(\omega_{j}^{\top}Y)],$
	$\displaystyle\mathbb{E}_{X}[\cos(\omega_{i}^{\top}X)\sin(\omega_{j}^{\top}X)]$	$\displaystyle=\mathbb{E}_{Y}[\cos(\omega_{i}^{\top}Y)\sin(\omega_{j}^{\top}Y)],$
	$\displaystyle\mathbb{E}_{X}[\sin(\omega_{i}^{\top}X)\sin(\omega_{j}^{\top}X)]$	$\displaystyle=\mathbb{E}_{Y}[\sin(\omega_{i}^{\top}Y)\sin(\omega_{j}^{\top}Y)],$

for all $i,j=1,\ldots,R.$ By trigonometric identities, these identities are equivalent to

	$\displaystyle\mathbb{E}_{X}\big{[}\cos\big{(}(\omega_{i}-\omega_{j})^{\top}X\big{)}+\cos\big{(}(\omega_{i}+\omega_{j})^{\top}X\big{)}\big{]}$	$\displaystyle=\mathbb{E}_{Y}\big{[}\cos\big{(}(\omega_{i}-\omega_{j})^{\top}Y\big{)}+\cos\big{(}(\omega_{i}+\omega_{j})^{\top}Y\big{)}\big{]},$
	$\displaystyle\mathbb{E}_{X}\big{[}\sin\big{(}(\omega_{i}+\omega_{j})^{\top}X\big{)}-\sin\big{(}(\omega_{i}-\omega_{j})^{\top}X\big{)}\big{]}$	$\displaystyle=\mathbb{E}_{Y}\big{[}\sin\big{(}(\omega_{i}+\omega_{j})^{\top}Y\big{)}-\sin\big{(}(\omega_{i}-\omega_{j})^{\top}Y\big{)}\big{]},$
	$\displaystyle\mathbb{E}_{X}\big{[}\cos\big{(}(\omega_{i}-\omega_{j})^{\top}X\big{)}-\cos\big{(}(\omega_{i}+\omega_{j})^{\top}X\big{)}\big{]}$	$\displaystyle=\mathbb{E}_{Y}\big{[}\cos\big{(}(\omega_{i}-\omega_{j})^{\top}Y\big{)}-\cos\big{(}(\omega_{i}+\omega_{j})^{\top}Y\big{)}\big{]},$

for all $i,j=1,\ldots,R.$ Hence, if we show that the following identities

	$\displaystyle\mathbb{E}_{X}\big{[}\cos\big{(}(\omega_{i}+\omega_{j})^{\top}X\big{)}\big{]}$	$\displaystyle=\mathbb{E}_{Y}\big{[}\cos\big{(}(\omega_{i}+\omega_{j})^{\top}Y\big{)}\big{]},$
	$\displaystyle\mathbb{E}_{X}\big{[}\sin\big{(}(\omega_{i}+\omega_{j})^{\top}X\big{)}\big{]}$	$\displaystyle=\mathbb{E}_{Y}\big{[}\sin\big{(}(\omega_{i}+\omega_{j})^{\top}Y\big{)}\big{]},$
	$\displaystyle\mathbb{E}_{X}\big{[}\cos\big{(}(\omega_{i}-\omega_{j})^{\top}X\big{)}\big{]}$	$\displaystyle=\mathbb{E}_{Y}\big{[}\cos\big{(}(\omega_{i}-\omega_{j})^{\top}Y\big{)}\big{]},$
	$\displaystyle\mathbb{E}_{X}\big{[}\sin\big{(}(\omega_{i}-\omega_{j})^{\top}X\big{)}\big{]}$	$\displaystyle=\mathbb{E}_{Y}\big{[}\sin\big{(}(\omega_{i}-\omega_{j})^{\top}Y\big{)}\big{]}$

hold for all $i,j=1,\ldots,R$ given $\boldsymbol{\omega}_{R},$ the second moment matrices of $\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X)$ and $\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y)$ become identical. Also, note that satisfying the first two identities is equivalent to the coincidence of characteristic functions $\phi_{X}(\omega)$ and $\phi_{Y}(\omega)$ at the point $\omega_{i}+\omega_{j},$ and the last two identities imply the coincidence of $\phi_{X}(\omega)$ and $\phi_{Y}(\omega)$ at the point $\omega_{i}-\omega_{j}.$ Let us denote $\omega_{i}+\omega_{j}$ and $\omega_{i}-\omega_{j}$ as $\omega_{iR+j}$ and $\omega_{2iR+j}$ , respectively, for all $i,j=1,\ldots,R$ , and then, a sufficient condition for $\boldsymbol{\omega}_{R}\in\mathcal{E}_{2}$ is $\phi_{X}(\omega_{r})=\phi_{Y}(\omega_{r})$ for all $r=1,\ldots,2R^{2}+R.$ Now, observe that all random variables $\{\omega_{r}\}^{2R^{2}+R}_{r=1}$ have continuous probability distributions, and thus, for any $\epsilon>0$ , we can find $I=I(\epsilon)>0$ satisfying $\mathbb{P}(\omega_{r}\in[-I,I])\leq(2R^{2}+R)^{-1}\epsilon$ for all $r=1,\ldots,2R^{2}+R.$ Then,

	$\displaystyle\mathbb{P}(\omega_{1},\ldots,\omega_{2R^{2}+R}\in[-I,I]^{c})$	$\displaystyle\geq 1-\sum^{2R^{2}+R}_{r=1}\mathbb{P}(\omega_{r}\in[-I,I])$
		$\displaystyle\geq 1-\epsilon.$

Here, according to Pólya’s criterion (Pólya,, 1949, Theorem 1), we can find uncountably many characteristic functions that vanish outside the interval $[-I,I],$ and let us denote this set as $\mathcal{A}_{k,\epsilon}.$ A representative example of these functions (see e.g., Chwialkowski et al.,, 2015, Proposition 1) is a set $\{f_{\delta}\}_{\delta>I^{-1}}$ where

\displaystyle f_{\delta}(\omega)=\left\{\begin{array}[]{lll}1-\delta|\omega|&\text{ when }&|\omega|\leq\frac{1}{\delta},\\[5.0pt] 0&\text{ when }&|\omega|\geq\frac{1}{\delta}.\end{array}\right.

Then, for any distribution pair $P_{X},P_{Y}\in\mathcal{A}_{k,\epsilon},$ we have

	$\displaystyle\mathbb{P}_{\boldsymbol{\omega}_{R}}(\boldsymbol{\omega}_{R}\in\mathcal{E}_{2})$	$\displaystyle\geq\mathbb{P}(\omega_{1},\ldots,\omega_{2R^{2}+R}\in[-I,I]^{c})$
		$\displaystyle\geq 1-\sum^{2R^{2}+R}_{r=1}\mathbb{P}(\omega_{r}\in[-I,I])$
		$\displaystyle\geq 1-\epsilon,$

and this completes the proof.

B.2 Proof of Theorem 3

Recall that we use a permutation test defined as follows:

\Delta_{{n_{1}},{n_{2}},R}^{\alpha}:=\mathds{1}(V>q_{{n_{1}},{n_{2}},1-\alpha}).

We first note that the permutation test is invariant under multiplying a positive constant ${n_{1}}$ to the test statistic. Therefore, throughout the proof of Theorem 3, we consider ${n_{1}}V$ as a test statistic and its permutation quantile ${n_{1}}q_{{n_{1}},{n_{2}},1-\alpha}$ instead of $V$ and $q_{{n_{1}},{n_{2}},1-\alpha}$ , respectively. Now, note that both the test statistic ${n_{1}}V$ and the permutation quantile ${n_{1}}q_{{n_{1}},{n_{2}},1-\alpha}$ are random variables. Our strategy to prove Theorem 3 is to first assume the ME condition (7), $\boldsymbol{\omega}_{R}\in\mathcal{E}$ , which holds for any distribution pair $P_{X},P_{Y}\in\mathcal{A}_{\epsilon}$ with high probability by Lemma 1. For such fixed $\boldsymbol{\omega}_{R}$ , we analyze the asymptotic behavior of ${n_{1}}V$ and show that the test power is strictly smaller than one for an uncountable number of pairs of distributions $P_{X}$ and $P_{Y}$ .

Asymptotic behavior of the unconditional distribution

Let us start by investigating the test statistic ${n_{1}}V$ with a fixed $\boldsymbol{\omega}_{R}\in\mathcal{E}$ . For the sake of notation, we denote the covariances of $\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X)$ and $\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y)$ as $\Sigma_{\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X)}$ and $\Sigma_{\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y)}$ , respectively. Note that $\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X)$ and $\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y)$ are trigonometric functions, having finite variance. Hence the central limit theorem guarantees that the unconditional distribution of ${n_{1}}^{1/2}T$ converges in distribution to $N(0,\tilde{\Sigma})$ where $\tilde{\Sigma}=\Sigma_{\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X)}+\frac{p}{1-p}\Sigma_{\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y)}\in\mathbb{R}^{2R\times 2R}$ . Letting $\tilde{k}=\operatorname{rank}(\tilde{\Sigma})$ , consider an eigendecomposition of $\tilde{\Sigma}$ :

\tilde{\Sigma}=\tilde{Q}\tilde{D}\tilde{Q}^{\top}.

Here, $\tilde{D}=\mathrm{diag}(\tilde{\lambda}_{1},\ldots,\tilde{\lambda}_{\tilde{k}})\in\mathbb{R}^{\tilde{k}\times\tilde{k}}$ is a diagonal matrix formed from the non-zero eigenvalues of $\tilde{\Sigma}$ , and $\tilde{Q}\in\mathbb{R}^{2R\times\tilde{k}}$ is an orthogonal matrix with columns corresponding to the eigenvectors of $\tilde{\Sigma}$ . Then, a Gaussian random vector $A\sim N(0,\tilde{\Sigma})$ can be decomposed as $A=\tilde{Q}D^{\frac{1}{2}}G$ where $G=(G_{1},\ldots,G_{\tilde{k}})^{\top}\sim N(0,I_{\tilde{k}\times\tilde{k}})$ . Therefore, the distribution of $\|A\|^{2}_{\mathbb{R}^{\tilde{k}}}$ can be derived as follows:

	$\displaystyle\\|A\\|^{2}_{\mathbb{R}^{\tilde{k}}}=A^{\top}A$	$\displaystyle=G^{\top}\tilde{D}^{\frac{1}{2}}\tilde{Q}^{\top}\tilde{Q}\tilde{D}^{\frac{1}{2}}G=G^{\top}\tilde{D}^{\frac{1}{2}}\tilde{D}^{\frac{1}{2}}G$
		$\displaystyle=\sum^{\tilde{k}}_{i=1}\tilde{\lambda}_{i}G^{2}_{i}.$

Based on the fact that ${n_{1}}^{1/2}T$ converges in distribution to $A$ , the continuous mapping theorem guarantees

\sup_{t\in\mathbb{R}^{d}}\big{|}F_{{n_{1}}V}(t)-F_{\sum\tilde{\lambda}_{i}G_{i}^{2}}(t)\big{|}\rightarrow 0

(14)

for fixed $\boldsymbol{\omega}_{R}\in\mathcal{E}$ , where $F_{\sum\tilde{\lambda}_{i}G_{i}^{2}}$ is the CDF of $\sum^{\tilde{k}}_{i=1}\tilde{\lambda}_{i}G_{i}^{2}.$ If $\tilde{k}=2R$ , then the limiting distribution becomes $\sum^{2R}_{i=1}\tilde{\lambda}_{i}G_{i}^{2}$ . Even when $\tilde{k}$ is strictly less than $2R$ , we note that the distribution $F_{\sum\tilde{\lambda}_{i}G_{i}^{2}}$ can also be regarded as the distribution of $\sum^{2R}_{i=1}\tilde{\lambda}_{i}G_{i}^{2}$ instead of $\sum^{\tilde{k}}_{i=1}\tilde{\lambda}_{i}G_{i}^{2},$ since we can extend the eigenvalue set $\{\tilde{\lambda}_{i}\}_{i=1}^{\tilde{k}}$ to $\{\tilde{\lambda}_{i}\}_{i=1}^{2R}$ by including zero eigenvalues.

Asymptotic behavior of the permutation distribution

We now examine the asymptotic behavior of the permutation distribution of ${n_{1}}V$ and ${n_{1}}q_{{n_{1}},{n_{2}},1-\alpha}$ for fixed $\boldsymbol{\omega}_{R}\in\mathcal{E}$ . First, let $\bar{k}=\operatorname{rank}(\bar{\Sigma})$ where $\bar{\Sigma}=\frac{p}{1-p}\Sigma_{\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X)}+\Sigma_{\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y)}$ , and consider an eigendecomposition

	$\displaystyle\bar{\Sigma}$	$\displaystyle=QDQ^{\top}=\left[\bar{Q}\,\|\,Q_{0}\right]\left[\begin{array}[]{c\|c}\bar{D}&\mathbf{0}\\ \hline\cr\mathbf{0}&\mathbf{0}\end{array}\right]\left[\bar{Q}\,\|\,Q_{0}\right]^{\top}$
		$\displaystyle=\bar{Q}\bar{D}\bar{Q}^{\top},$

where $D\in\mathbb{R}^{2R\times 2R}$ is a diagonal matrix composed of $\bar{D}$ and zeros, $\bar{D}=\text{diag}(\bar{\lambda}_{1},\ldots,\bar{\lambda}_{\bar{k}})\in\mathbb{R}^{\bar{k}\times\bar{k}}$ is a diagonal matrix formed from the non-zero eigenvalues of $\bar{\Sigma}$ . In addition $\bar{Q}\in\mathbb{R}^{2R\times\bar{k}}$ denotes an orthogonal matrix whose columns are the eigenvectors of $\bar{\Sigma}$ and $Q_{0}\in\mathbb{R}^{2R\times(2R-\bar{k})}$ denotes an orthogonal matrix that makes $Q=\left[\bar{Q}\,|\,Q_{0}\right]\in\mathbb{R}^{2R\times 2R}$ also orthogonal. Note that ${n_{1}}V$ can be decomposed as

$\displaystyle{n_{1}}V$	$\displaystyle={n_{1}}T^{\top}T$	(15)
	$\displaystyle={n_{1}}T^{\top}QQ^{\top}T$
	$\displaystyle={n_{1}}T^{\top}\bar{Q}\bar{Q}^{\top}T+{n_{1}}T^{\top}Q_{0}Q_{0}^{\top}T.$

For the first term, note that $\bar{Q}^{\top}T=\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\bar{Q}^{\top}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X_{i})-\frac{1}{{n_{2}}}\sum_{j=1}^{n_{2}}\bar{Q}^{\top}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y_{j})$ and $\mathbb{E}_{X}[\bar{Q}^{\top}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X)]=\mathbb{E}_{Y\sim Q}[\bar{Q}^{\top}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y)]$ for $\boldsymbol{\omega}_{R}\in\mathcal{E}$ . Furthermore, we observe that

	$\displaystyle\frac{p}{1-p}\Sigma_{\bar{Q}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X)}+\Sigma_{\bar{Q}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y)}$	$\displaystyle=\frac{p}{1-p}\bar{Q}^{\top}\Sigma_{\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X)}\bar{Q}+\bar{Q}^{\top}\Sigma_{\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y)}\bar{Q}$
		$\displaystyle=\bar{Q}^{\top}\Big{(}\frac{p}{1-p}\Sigma_{\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X)}+\Sigma_{\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y)}\Big{)}\bar{Q}$
		$\displaystyle=\bar{Q}^{\top}\bar{\Sigma}\bar{Q}$
		$\displaystyle=\bar{D}.$

Since $\bar{D}$ is positive definite, we apply Lemma 14 to ${n_{1}}^{1/2}\bar{Q}^{\top}T$ and conclude that the permutation distribution of ${n_{1}}^{1/2}\bar{Q}^{\top}$ converges weakly in probability to $N(0,\bar{D})$ for fixed $\boldsymbol{\omega}_{R}\in\mathcal{E}$ . Therefore, the continuous mapping theorem for permutation distribution (Chung and Romano,, 2016, Lemma A.6) guarantees that the permutation distribution of ${n_{1}}T^{\top}\bar{Q}\bar{Q}^{\top}T$ converges weakly in probability to $\sum^{\bar{k}}_{i=1}\bar{\lambda}_{i}G_{i}^{2}$ for $\boldsymbol{\omega}_{R}\in\mathcal{E}$ . More formally, if we denote the permutation distribution of ${n_{1}}T^{\top}\bar{Q}\bar{Q}^{\top}T$ as $F^{\pi}_{({n_{1}}^{1/2}\bar{Q}T)^{2}}(t)$ , for any $\epsilon>0,$ we have

\lim_{{n_{1}},{n_{2}}\rightarrow\infty}\mathbb{P}\Big{(}\sup_{t\in\mathbb{R}}\big{|}F^{\pi}_{({n_{1}}^{1/2}\bar{Q}T)^{2}}(t)-F_{\sum\bar{\lambda}_{i}G_{i}^{2}}(t)\big{|}>\epsilon\,\Big{|}\,\boldsymbol{\omega}_{R}=\boldsymbol{\omega}\Big{)}=0

for each fixed $\boldsymbol{\omega}\in\mathcal{E}$ , where $F_{\sum\bar{\lambda}_{i}G_{i}^{2}}$ denotes the CDF of $\sum^{\bar{k}}_{i=1}\bar{\lambda}_{i}G_{i}^{2}.$

For the second term in Equation (15), we note that the null space of the sum of two positive semidefinite matrices is the intersection of the null spaces of each of them. Since $Q_{0}$ is the null space of $\bar{\Sigma}=\frac{p}{1-p}\Sigma_{\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X)}+\Sigma_{\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y)}$ and $\bar{\Sigma}$ is positive semidefinite, the columns of $Q_{0}$ are in the null space of $\Sigma_{\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X)}$ and $\Sigma_{\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y)}$ . This implies that $Q_{0}^{\top}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X)=\mathbf{0}$ and $Q_{0}^{\top}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y)=\mathbf{0}$ . Hence, for any permutation $\pi\in\Pi_{N}$ , we have

Q_{0}^{\top}T(Z_{\pi(1)},\ldots,Z_{\pi(N)})=\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}Q_{0}^{\top}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{\pi(i)})-\frac{1}{{n_{2}}}\sum_{j={n_{1}}+1}^{N}Q_{0}^{\top}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{\pi(j)})=\mathbf{0},

and we conclude that the permutation distribution of ${n_{1}}T^{\top}Q_{0}Q_{0}^{\top}T$ is degenerate at zero. Combining the results, Equation (15) implies $F^{\pi}_{{n_{1}}V}(t)=F^{\pi}_{({n_{1}}^{1/2}\bar{Q}T)^{2}}(t)$ , thus for any $\epsilon>0,$

\lim_{{n_{1}},{n_{2}}\rightarrow\infty}\mathbb{P}\Big{(}\sup_{t\in\mathbb{R}}\big{|}F^{\pi}_{{n_{1}}V}(t)-F_{\sum\bar{\lambda}_{i}G_{i}^{2}}(t)\big{|}>\epsilon\,\Big{|}\,\boldsymbol{\omega}_{R}=\boldsymbol{\omega}\Big{)}=0,

for each fixed $\boldsymbol{\omega}\in\mathcal{E}$ . Similar to Equation (14), we can include zero eigenvalues and $F_{\sum\bar{\lambda}_{i}G_{i}^{2}}$ can be seen as the distribution of $\sum^{2R}_{i=1}\bar{\lambda}_{i}G_{i}^{2}$ , instead of $\sum^{\bar{k}}_{i=1}\bar{\lambda}_{i}G_{i}^{2}.$

When a sequence of “random” distribution functions converges weakly in probability to a fixed distribution function, it ensures convergence in its quantile (Lehmann and Romano,, 2006, Lemma 11.2.1 (ii)). Hence, the critical value ${n_{1}}q_{{n_{1}},{n_{2}},1-\alpha}$ of ${n_{1}}V$ converges in probability to $q_{R,1-\alpha}$ , where $q_{R,1-\alpha}$ is the $(1-\alpha)$ -quantile of $\sum^{2R}_{i=1}\bar{\lambda}_{i}G_{i}^{2}$ under the ME condition. In other words, for any $\epsilon>0,$

\lim_{{n_{1}},{n_{2}}\rightarrow\infty}\mathbb{P}\big{(}|{n_{1}}q_{{n_{1}},{n_{2}},1-\alpha}-q_{R,1-\alpha}|>\epsilon\,\big{|}\,\boldsymbol{\omega}_{R}=\boldsymbol{\omega}\big{)}=0

(16)

for each fixed $\boldsymbol{\omega}\in\mathcal{E}$ .

Constructing $P_{X}$ and $P_{Y}$

We start by summarizing the analysis we have done so far. For the permutation test $\Delta_{{n_{1}},{n_{2}},R}^{\alpha}:=\mathds{1}({n_{1}}V>{n_{1}}q_{{n_{1}},{n_{2}},1-\alpha})$ with $\boldsymbol{\omega}_{R}\in{\mathcal{E}}$ , which implies that the 1-ME condition holds, we have shown that the unconditional distribution of the test statistic ${n_{1}}V$ , denoted as $F_{{n_{1}}V}$ , converges in distribution to $\sum^{2R}_{i=1}\tilde{\lambda}_{i}G_{i}^{2}$ and the critical value ${n_{1}}q_{{n_{1}},{n_{2}},1-\alpha}$ of permtation distribution $F^{\pi}_{{n_{1}}V}$ converges in probability to $q_{R,1-\alpha}$ , that is, the $(1-\alpha)$ -quantile of $\sum^{\bar{2}R}_{i=1}\bar{\lambda}_{i}G_{i}^{2}$ . Since $\{\tilde{\lambda}_{i}\}_{i=1}^{2R}$ and $\{\bar{\lambda}_{i}\}_{i=1}^{2R}$ are the eigenvalues of $\tilde{\Sigma}$ and $\bar{\Sigma}$ , the asymptotic power depends on the difference between these matrices. Note that they are given as

\tilde{\Sigma}=\Sigma_{\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X)}+\frac{p}{1-p}\Sigma_{\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y)}\quad\text{and}\quad\bar{\Sigma}=\frac{p}{1-p}\Sigma_{\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X)}+\Sigma_{\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y)},

which involves the second moments of the feature mappings.

Here, given $\boldsymbol{\omega}_{R}\in\mathcal{E}$ , suppose that the matrices $\tilde{\Sigma}$ and $\bar{\Sigma}$ coincide for some distribution pair $P_{X},P_{Y}$ . This implies that the permutation distribution $F^{\pi}_{{n_{1}}V}$ and the unconditional distribution $F_{{n_{1}}V}$ become asymptotically identical for such fixed $\boldsymbol{\omega}_{R}$ , and thus we can expect that the test fails to distinguish those distributions $P_{X}$ and $P_{Y}$ in this case. To formalize this scenario, consider an extension of the 1-ME condition to include second moments. Recall that the 1-ME condition is $\boldsymbol{\omega}_{R}\in{\mathcal{E}},$ meaning that the first moments of the feature mappings are identical. Now, consider the 2-ME condition, $\boldsymbol{\omega}_{R}\in{\mathcal{E}_{2}},$ which implies the coincidence up to the second moments of the feature mappings. i.e.,

\mathcal{E}_{2}:=\bigl{\{}\boldsymbol{x}\in\mathbb{R}^{d\times R}:\mathbb{E}_{X}[\boldsymbol{\psi}_{\boldsymbol{x}}(X)]=\mathbb{E}_{Y}[\boldsymbol{\psi}_{\boldsymbol{x}}(Y)],~{}\mathbb{E}_{X}[\boldsymbol{\psi}_{\boldsymbol{x}}(X)\boldsymbol{\psi}_{\boldsymbol{x}}(X)^{\top}]=\mathbb{E}_{Y}[\boldsymbol{\psi}_{\boldsymbol{x}}(Y)\boldsymbol{\psi}_{\boldsymbol{x}}(Y)^{\top}]\bigr{\}}.

Then, note that Proposition 2 guarantees that the 2-ME condition holds for any distribution pair $P_{X},P_{Y}\in\mathcal{A}_{k,\epsilon}$ with arbitrarily high probability $1-\epsilon$ . Also, observe that if $\boldsymbol{\omega}_{R}\in{\mathcal{E}_{2}},$ then the covariance matrices of the feature mapping $\Sigma_{\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X)}$ and $\Sigma_{\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y)}$ are the same, and this indicates that the matrices $\tilde{\Sigma}$ and $\bar{\Sigma}$ coincide. We again emphasize that $F^{\pi}_{{n_{1}}V}$ and $F_{{n_{1}}V}$ become asymptotically identical in this case. Therefore, for such $\boldsymbol{\omega}_{R}\in{\mathcal{E}_{2}},$ the critical value ${n_{1}}q_{{n_{1}},{n_{2}},1-\alpha}$ converges in probability to the $(1-\alpha)$ -quantile of the unconditional distribution, and combining this fact with Equation (14) and Equation (16), Slutsky’s theorem yields the convergence

\lim_{{n_{1}},{n_{2}}\rightarrow\infty}\mathbb{P}({n_{1}}V\geq{n_{1}}q_{{n_{1}},{n_{2}},1-\alpha}\,|\,\boldsymbol{\omega}_{R}=\boldsymbol{\omega})=\alpha,

(17)

for fixed $\boldsymbol{\omega}\in{\mathcal{E}_{2}}$ . For a given $\epsilon\in(0,1)$ , let $P_{X}$ and $P_{Y}$ be distinct distributions in $\mathcal{A}_{k,\epsilon}$ defined in Proposition 2. Then, we obtain the following result:

	$\displaystyle\mathbb{P}(\Delta_{{n_{1}},{n_{2}},R}^{\alpha}=1)$	$\displaystyle=\int_{\mathcal{E}_{2}}\mathbb{P}(\Delta_{{n_{1}},{n_{2}},R}^{\alpha}=1\,\|\,\boldsymbol{\omega}_{R}=\boldsymbol{\omega})f_{\boldsymbol{\omega}_{R}}(\boldsymbol{\omega})d\boldsymbol{\omega}+\int_{\mathcal{E}_{2}^{c}}\mathbb{P}(\Delta_{{n_{1}},{n_{2}},R}^{\alpha}=1\,\|\,\boldsymbol{\omega}_{R}=\boldsymbol{\omega})f_{\boldsymbol{\omega}_{R}}(\boldsymbol{\omega})d\boldsymbol{\omega}$
		$\displaystyle\leq\int_{\mathcal{E}_{2}}\mathbb{P}(\Delta_{{n_{1}},{n_{2}},R}^{\alpha}=1\,\|\,\boldsymbol{\omega}_{R}=\boldsymbol{\omega})f_{\boldsymbol{\omega}_{R}}(\boldsymbol{\omega})d\boldsymbol{\omega}+\epsilon$

Since the probability is bounded by 1, by taking limits on both sides and applying the dominated convergence theorem, we get the desired result:

	$\displaystyle\lim_{{n_{1}},{n_{2}}\rightarrow\infty}\mathbb{P}(\Delta_{{n_{1}},{n_{2}},R}^{\alpha}=1)$	$\displaystyle\leq\lim_{{n_{1}},{n_{2}}\rightarrow\infty}\int_{\mathcal{E}_{2}}\mathbb{P}(\Delta_{{n_{1}},{n_{2}},R}^{\alpha}=1\,\|\,\boldsymbol{\omega}_{R}=\boldsymbol{\omega})f_{\boldsymbol{\omega}_{R}}(\boldsymbol{\omega})d\boldsymbol{\omega}+\epsilon$
		$\displaystyle\leq\int_{\mathcal{E}_{2}}\lim_{{n_{1}},{n_{2}}\rightarrow\infty}\mathbb{P}(\Delta_{{n_{1}},{n_{2}},R}^{\alpha}=1\,\|\,\boldsymbol{\omega}_{R}=\boldsymbol{\omega})f_{\boldsymbol{\omega}_{R}}(\boldsymbol{\omega})d\boldsymbol{\omega}+\epsilon$
		$\displaystyle\leq\alpha+\epsilon$

where the last inequality follows from Equation 17.

B.3 Proof of Corollary 4

As shown by Zhao and Meng, (2015, Appendix A.1), the unbiased estimator of MMD can be written as

	$\displaystyle\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})$	$\displaystyle=\widehat{\mathrm{MMD}}_{b}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})+\frac{1}{{n_{1}}-1}\sum^{n_{1}}_{i=1}\sum^{n_{1}}_{j=1}\frac{1}{{n_{1}^{2}}}k(X_{i},X_{j})+\frac{1}{{n_{2}}-1}\sum^{n_{2}}_{i=1}\sum^{n_{2}}_{j=1}\frac{1}{{n_{2}^{2}}}k(Y_{i},Y_{j})$
		$\displaystyle\quad-\kappa(0)\bigg{(}\frac{1}{{n_{1}}-1}+\frac{1}{{n_{2}}-1}\bigg{)},$

where $k(x,y)=\kappa(x-y)$ . By replacing the kernel $k(x,y)$ with $\hat{k}(x,y)=\langle\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(x),\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(y)\rangle$ , we get

$\displaystyle\text{r}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\boldsymbol{\omega}_{R})$	$\displaystyle=\text{r}\widehat{\mathrm{MMD}}_{b}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\boldsymbol{\omega}_{R})+\frac{1}{{n_{1}}-1}\sum^{n_{1}}_{i=1}\sum^{n_{1}}_{j=1}\frac{1}{{n_{1}^{2}}}\hat{k}(X_{i},X_{j})$	(18)
	$\displaystyle\quad+\frac{1}{{n_{2}}-1}\sum^{n_{2}}_{i=1}\sum^{n_{2}}_{j=1}\frac{1}{{n_{2}^{2}}}\hat{k}(Y_{i},Y_{j})-\kappa(0)\bigg{(}\frac{1}{{n_{1}}-1}+\frac{1}{{n_{2}}-1}\bigg{)}$
	$\displaystyle=\text{r}\widehat{\mathrm{MMD}}_{b}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\boldsymbol{\omega}_{R})+\frac{1}{{n_{1}}-1}\bigg{\\|}\frac{1}{n_{1}}\sum^{n_{1}}_{i=1}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X_{i})\bigg{\\|}_{\mathbb{R}^{2R}}^{2}$
	$\displaystyle\quad+\frac{1}{{n_{2}}-1}\bigg{\\|}\frac{1}{{n_{2}}}\sum^{n_{2}}_{i=1}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y_{i})\bigg{\\|}_{\mathbb{R}^{2R}}^{2}-\kappa(0)\bigg{(}\frac{1}{{n_{1}}-1}+\frac{1}{{n_{2}}-1}\bigg{)}.$

Recall that we use the test statistics given as ${n_{1}}V={n_{1}}\cdot\text{r}\widehat{\mathrm{MMD}}_{b}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\boldsymbol{\omega}_{R})$ and ${n_{1}}U={n_{1}}\cdot\text{r}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\boldsymbol{\omega}_{R})$ in the permutation tests defined in Theorem 3 and Corollary 4, respectively. Also, throughout the proof of Theorem 4, we assume $\kappa(0)=1$ , which can be done without loss of generality as mentioned earlier in Section 2.3. Then, multiplying ${n_{1}}$ on both sides of Equation (18), we get

{n_{1}}U={n_{1}}V+\frac{{n_{1}}}{{n_{1}}-1}\bigg{\|}\frac{1}{n_{1}}\sum^{n_{1}}_{i=1}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X_{i})\bigg{\|}_{\mathbb{R}^{2R}}^{2}+\frac{{n_{1}}}{{n_{2}}-1}\bigg{\|}\frac{1}{{n_{2}}}\sum^{n_{2}}_{i=1}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y_{i})\bigg{\|}_{\mathbb{R}^{2R}}^{2}-\frac{n_{1}}{{n_{1}}-1}-\frac{{n_{1}}}{{n_{2}}-1}.

(19)

Our aim is to show that the unconditional distribution of the second and third terms and their permutation distribution are asymptotically the same under the ME condition. To start with, recall that the ME condition is $\boldsymbol{\omega}_{R}\in\mathcal{E},$ implying $\mathbb{E}[\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X)]=\mathbb{E}[\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y)].$ Let us denote the exact value of this expectation as $\mu_{\boldsymbol{\omega}_{R}}$ for such fixed $\boldsymbol{\omega}_{R}\in\mathcal{E}$ . Then, as $\frac{1}{n_{1}}\sum^{n_{1}}_{i=1}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X_{i})$ and $\frac{1}{{n_{2}}}\sum^{n_{2}}_{i=1}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y_{i})$ are the sample means of $\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X_{i})$ and $\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y_{i})$ with finite variance, the law of large numbers ensures that

\frac{1}{n_{1}}\sum^{n_{1}}_{i=1}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X_{i})\xrightarrow{p}\mu_{\boldsymbol{\omega}_{R}}\quad\text{and}\quad\frac{1}{{n_{2}}}\sum^{n_{2}}_{i=1}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y_{i})\xrightarrow{p}\mu_{\boldsymbol{\omega}_{R}}.

Therefore, by applying the continuous mapping theorem and Slutsky’s theorem, it can be shown that

\sup_{t\in\mathbb{R}}\big{|}F_{{n_{1}}U}(t)-L_{{n_{1}}V+c}(t)\big{|}\rightarrow 0,

(20)

where $F_{{n_{1}}U}(t)$ denotes the unconditional distribution of ${n_{1}}U$ and $L_{{n_{1}}V+c}$ denotes the asymptotic unconditional distribution of ${n_{1}}V+c(p,\mu_{\boldsymbol{\omega}_{R}}):={n_{1}}V+\big{(}1+\frac{p}{1-p}\big{)}\|\mu_{\boldsymbol{\omega}_{R}}\|_{\mathbb{R}^{2R}}^{2}-1-\frac{p}{1-p}$ . Note that we derived the asymptotic unconditional distribution of ${n_{1}}V$ in Equation (14).

For the case of the permutation distribution, we reformulate the $2R$ -dimensional vectors as follows:

\Psi^{(i)}_{N}(Z_{1},\dots,Z_{N}):=\sum^{N}_{i=1}\nu_{i}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{i}),\quad\Psi^{(ii)}_{N}(Z_{1},\dots,Z_{N}):=\sum^{N}_{i=1}(1-\nu_{i})\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{i}),

where $\nu_{i}=\mathds{1}(i\leq{n_{1}})$ . Then, we observe that $\frac{1}{n_{1}}\sum^{n_{1}}_{i=1}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X_{i})$ and $\frac{1}{{n_{2}}}\sum^{n_{2}}_{i=1}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y_{i})$ in Equation (19) can be written as $\frac{1}{n_{1}}\Psi^{(i)}_{N}(Z_{1},\dots,Z_{N})$ and $\frac{1}{{n_{2}}}\Psi^{(ii)}_{N}(Z_{1},\dots,Z_{N}).$ Let $F^{\pi}_{X}$ denote the permutation distribution of $\frac{1}{n_{1}}\Psi^{(i)}_{N}(Z_{1},\dots,Z_{N})$ , defined by

F^{\pi}_{X}(t):=\frac{1}{N!}\sum_{\pi\in\Pi_{N}}\mathds{1}\bigg{\{}\frac{1}{n_{1}}\Psi^{(i)}_{N}(Z_{\pi(1)},\dots,Z_{\pi(N)})\leq t\bigg{\}},

and let $F^{\pi}_{Y}$ denote the permutation distribution of $\frac{1}{{n_{2}}}\Psi^{(ii)}_{N}(Z_{1},\dots,Z_{N})$ defined similarly. Let $G$ be a random variable that is uniformly distributed over $\Pi_{N}$ . If we can show $\frac{1}{n_{1}}\Psi^{(i)}_{N}(Z_{G(1)},\dots,Z_{G(N)})\xrightarrow{p}\mu_{\boldsymbol{\omega}_{R}}$ and $\frac{1}{{n_{2}}}\Psi^{(ii)}_{N}(Z_{G(1)},\dots,Z_{G(N)})\xrightarrow{p}\mu_{\boldsymbol{\omega}_{R}}$ , then the desired results $F^{\pi}_{X}\xrightarrow{p}\mu_{\boldsymbol{\omega}_{R}}$ and $F^{\pi}_{Y}\xrightarrow{p}\mu_{\boldsymbol{\omega}_{R}}$ are followed by Lemma 11. Since $R\in\mathbb{N}$ is a fixed number, it suffices to show that each component of $\frac{1}{n_{1}}\Psi^{(i)}_{N}(Z_{G(1)},\dots,Z_{G(N)})$ converges to the corresponding component of $\mu_{\boldsymbol{\omega}_{R}}$ in probability.

For $1\leq k\leq 2R$ , let us denote the $k$ -th component of $\Psi^{(i)}_{N}(Z_{G(1)},\dots,Z_{G(N)})$ , $\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{i})$ and $\mu_{\boldsymbol{\omega}_{R}}$ as $\Psi^{(i)}_{N}(Z,G)_{k}$ , $\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{i})_{k}$ and $\mu_{R,k}$ , respectively. Note that

	$\displaystyle\mathbb{E}\left[\frac{1}{n_{1}}\Psi^{(i)}_{N}(Z,G)_{k}\right]$	$\displaystyle=\frac{1}{n_{1}}\sum^{N}_{i=1}\mathbb{E}\left[\nu_{G(i)}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{G(i)})_{k}\right]$
		$\displaystyle=\frac{1}{n_{1}}\sum^{N}_{i=1}\mathbb{E}\left[\nu_{G(i)}\right]\mathbb{E}\left[\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{G(i)})_{k}\right]$
		$\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\frac{1}{n_{1}}\sum^{N}_{i=1}\frac{n_{1}}{N}~{}\mathbb{E}_{Z}\Bigr{[}\mathbb{E}_{G}\left[\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{G(i)})_{k}\|Z_{1},\ldots,Z_{N}\right]\Bigr{]}$
		$\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}\frac{1}{N}\sum^{N}_{i=1}\mathbb{E}_{Z}\bigg{[}\frac{1}{N}\sum^{N}_{j=1}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{j})_{k}\bigg{]}$
		$\displaystyle=\frac{1}{N}\sum^{N}_{i=1}\frac{1}{N}\big{(}{n_{1}}\cdot\mathbb{E}[\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X)_{k}]+{n_{2}}\cdot\mathbb{E}[\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y)_{k}]\big{)}$
		$\displaystyle=\mu_{R,k},$

where $(a)$ uses the the fact that $\mathbb{E}[\nu_{G(i)}]=\frac{n_{1}}{N}$ for $1\leq i\leq N$ , and $(b)$ holds since $G$ is uniformly distributed over $\Pi_{N}$ .

Furthermore, we note that

	$\displaystyle\mathbb{E}\big{[}\nu_{G(1)}^{2}\big{]}=\frac{{n_{1}^{2}}}{N^{2}},$
	$\displaystyle\mathbb{E}\big{[}\nu_{G(1)}\nu_{G(2)}\big{]}=\frac{{n_{1}}({n_{1}}-1)}{N(N-1)},$
	$\displaystyle\mathbb{E}\big{[}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X)_{k}^{2}\big{]}\leq 1,\quad\mathbb{E}\left[\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y)_{k}^{2}\right]\leq 1,$

and also

	$\displaystyle\mathbb{E}\big{[}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{G(1)})_{k}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{G(2)})_{k}\big{]}$	$\displaystyle=\frac{1}{N(N-1)}\sum_{1\leq i\neq j\leq N}\mathbb{E}\big{[}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{i})_{k}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{j})_{k}\big{]}$
		$\displaystyle=\frac{1}{N(N-1)}\sum_{1\leq i\neq j\leq N}\mathbb{E}\big{[}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{i})_{k}]\mathbb{E}[\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{j})_{k}\big{]}$
		$\displaystyle=\frac{1}{N(N-1)}\cdot N(N-1)\mu_{R,k}^{2}$
		$\displaystyle=\mu_{R,k}^{2}.$

Based on these observations, we have

	$\displaystyle\operatorname{Var}\bigg{[}\frac{1}{n_{1}}\Psi^{(i)}_{N}(Z,G)_{k}\bigg{]}$	$\displaystyle=\mathbb{E}\bigg{[}\frac{1}{n_{1}^{2}}\Psi^{(i)}_{N}(Z,G)_{k}^{2}\bigg{]}-\mu_{R,k}^{2}$
		$\displaystyle=\frac{1}{{n_{1}^{2}}}\mathbb{E}\bigg{[}\sum^{N}_{i=1}\sum^{N}_{j=1}\nu_{G(i)}\nu_{G(j)}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{G(i)})_{k}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{G(j)})_{k}\bigg{]}-\mu_{R,k}^{2}$
		$\displaystyle=\frac{1}{{n_{1}^{2}}}\sum^{N}_{i=1}\sum^{N}_{j=1}\mathbb{E}\big{[}\nu_{G(i)}\nu_{G(j)}\big{]}\mathbb{E}\big{[}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{G(i)})_{k}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{G(j)})_{k}\big{]}-\mu_{R,k}^{2}$
		$\displaystyle=\frac{1}{{n_{1}^{2}}}\sum^{N}_{i=1}\mathbb{E}\big{[}\nu_{G(i)}^{2}\big{]}\mathbb{E}\big{[}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{G(i)})_{k}^{2}\big{]}$
		$\displaystyle\quad+\frac{1}{{n_{1}^{2}}}\sum_{1\leq i\neq j\leq N}\mathbb{E}\big{[}\nu_{G(i)}\nu_{G(j)}\big{]}\mathbb{E}\left[\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{G(i)})_{k}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{G(j)})_{k}\right]-\mu_{R,k}^{2}$
		$\displaystyle=\frac{N}{{n_{1}^{2}}}\mathbb{E}\big{[}\nu_{G(1)}^{2}\big{]}\mathbb{E}\big{[}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{G(1)})_{k}^{2}\big{]}$
		$\displaystyle\quad+\frac{N(N-1)}{{n_{1}^{2}}}\mathbb{E}\big{[}\nu_{G(1)}\nu_{G(2)}\big{]}\mathbb{E}\big{[}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{G(1)})_{k}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Z_{G(2)})_{k}\big{]}-\mu_{R,k}^{2}$
		$\displaystyle\leq\frac{N}{{n_{1}^{2}}}\cdot\frac{{n_{1}^{2}}}{N^{2}}\cdot 1+\frac{N(N-1)}{{n_{1}^{2}}}\cdot\frac{{n_{1}}({n_{1}}-1)}{N(N-1)}\cdot\mu_{R,k}^{2}-\mu_{R,k}^{2}$
		$\displaystyle=\frac{1}{N}-\frac{1}{{n_{1}}}\mu_{R,k}^{2}$
		$\displaystyle\rightarrow 0.$

Therefore, we now have $\frac{1}{n_{1}}\Psi^{(i)}_{N}(Z,G)_{k}\xrightarrow{p}\mu_{R,k}$ for each $k$ , and this implies

\frac{1}{n_{1}}\Psi^{(i)}_{N}(Z_{G(1)},\dots,Z_{G(N)})\xrightarrow{p}\mu_{\boldsymbol{\omega}_{R}}.

Similarly, we can get

\frac{1}{{n_{2}}}\Psi^{(ii)}_{N}(Z_{G(1)},\dots,Z_{G(N)})\xrightarrow{p}\mu_{\boldsymbol{\omega}_{R}}.

For the final step, letting $F_{{n_{1}}U}^{\pi}$ denote the permutation distribution function of ${n_{1}}U$ , and we apply the continuous mapping theorem for permutation distributions (Lemma 13) and Slutsky’s theorem extended for permutation distributions (Lemma 12) to conclude that

\lim_{{n_{1}},{n_{2}}\rightarrow\infty}\mathbb{P}\Big{(}\sup_{t\in\mathbb{R}}\big{|}F^{\pi}_{{n_{1}}U}(t)-L^{\pi}_{{n_{1}}V+c}(t)\big{|}>\epsilon\Big{)}=0,

where $L^{\pi}_{{n_{1}}V+c}$ is the asymptotic permutation distribution of ${n_{1}}V+c(p,\mu_{\boldsymbol{\omega}_{R}})$ under the ME condition. Therefore, in the same manner as Equation (16), we have

\lim_{{n_{1}},{n_{2}}\rightarrow\infty}\mathbb{P}\Big{(}\left|{n_{1}}q_{{n_{1}},{n_{2}},1-\alpha}^{u}-\big{(}q_{R,1-\alpha}+c(p,\mu_{\boldsymbol{\omega}_{R}})\big{)}\right|>\epsilon\,\Big{|}\,\boldsymbol{\omega}_{R}=\boldsymbol{\omega}\Big{)}=0

for fixed $\boldsymbol{\omega}\in\mathcal{E},$ where $q_{R,1-\alpha}$ denotes the $(1-\alpha)$ -quantile of the distribution of $\sum^{2R}_{i=1}\bar{\lambda}_{i}G_{i}^{2}.$ Combining the result with Equation (20), Slutsky’s theorem yields

		$\displaystyle\lim_{{n_{1}},{n_{2}}\rightarrow\infty}\mathbb{P}\big{(}{n_{1}}U\geq{n_{1}}q_{{n_{1}},{n_{2}},1-\alpha}^{u}\,\big{\|}\,\boldsymbol{\omega}_{R}=\boldsymbol{\omega}\big{)}$
	$\displaystyle=$	$\displaystyle\lim_{{n_{1}},{n_{2}}\rightarrow\infty}\mathbb{P}\big{(}{n_{1}}V+c(p,\mu_{\boldsymbol{\omega}_{R}})\geq q_{R,1-\alpha}+c(p,\mu_{\boldsymbol{\omega}_{R}})\,\big{\|}\,\boldsymbol{\omega}_{R}=\boldsymbol{\omega}\big{)}$
	$\displaystyle=$	$\displaystyle\lim_{{n_{1}},{n_{2}}\rightarrow\infty}\mathbb{P}\big{(}{n_{1}}V+c(p,\mu_{\boldsymbol{\omega}_{R}})\geq{n_{1}}q_{{n_{1}},{n_{2}},1-\alpha}+c(p,\mu_{\boldsymbol{\omega}_{R}})\,\big{\|}\,\boldsymbol{\omega}_{R}=\boldsymbol{\omega}\big{)}$
	$\displaystyle=$	$\displaystyle\lim_{{n_{1}},{n_{2}}\rightarrow\infty}\mathbb{P}\big{(}{n_{1}}V\geq{n_{1}}q_{{n_{1}},{n_{2}},1-\alpha}\,\big{\|}\,\boldsymbol{\omega}_{R}=\boldsymbol{\omega}\big{)}$

for fixed $\boldsymbol{\omega}\in\mathcal{E}.$ Hence, the lack of consistency of the test $\Delta_{{n_{1}},{n_{2}},R}^{\alpha,u}$ follows from Theorem 3.

B.4 Proof of Theorem 5

Recall that we use a permutation test defined as follows:

\displaystyle\Delta_{{n_{1}},{n_{2}},R}^{\alpha}:=\mathds{1}(V>q_{{n_{1}},{n_{2}},1-\alpha}).

For pointwise consistency, our strategy is to find a sequence $\beta_{{n_{1}},{n_{2}},R}\rightarrow 0$ such that

\displaystyle\mathbb{P}_{X\times Y\times\omega}\big{(}\Delta_{{n_{1}},{n_{2}},R}^{\alpha}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}})=0\big{)}\leq\beta_{{n_{1}},{n_{2}},R}

is true for ${n_{1}},{n_{2}}\geq N_{(P_{X},P_{Y})}$ and $R\geq R_{(P_{X},P_{Y})}$ , where $N_{(P_{X},P_{Y})}$ and $R_{(P_{X},P_{Y})}$ are constants depending on a given $(P_{X},P_{Y})$ with $P_{X}\neq P_{Y}$ . Similarly, if we use the test $\Delta_{{n_{1}},{n_{2}},R}^{\alpha,u}$ instead of $\Delta_{{n_{1}},{n_{2}},R}^{\alpha}$ , our goal is to show that $\mathbb{P}_{X\times Y\times\omega}(\Delta_{{n_{1}},{n_{2}},R}^{\alpha,u}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}})=0)\leq\beta_{{n_{1}},{n_{2}},R}.$ To achieve this goal, we use the approach that replaces a random permutation quantile with a deterministic quantity (see Fromont et al.,, 2013; Kim et al.,, 2022; Schrab et al.,, 2023). First, we start by integrating frameworks to analyze tests $\Delta_{{n_{1}},{n_{2}},R}^{\alpha}$ and $\Delta_{{n_{1}},{n_{2}},R}^{\alpha,u}$ in a unified manner. Let us define four events,

	$\displaystyle\mathcal{A}_{V}$	$\displaystyle:=\{V\leq q_{{n_{1}},{n_{2}},1-\alpha}\},$
	$\displaystyle\mathcal{A}_{U}$	$\displaystyle:=\{U\leq q^{u}_{{n_{1}},{n_{2}},1-\alpha}\},$

and

	$\displaystyle\mathcal{B}_{V,\beta}:=\bigg{\{}\mathbb{E}\left[V\right]\geq\sqrt{\frac{1}{\beta}\operatorname{Var}\left[V\right]}+q_{{n_{1}},{n_{2}},1-\alpha}\bigg{\}},$		(21)
	$\displaystyle\mathcal{B}_{U,\beta}:=\bigg{\{}\mathbb{E}\left[U\right]\geq\sqrt{\frac{1}{\beta}\operatorname{Var}\left[U\right]}+q^{u}_{{n_{1}},{n_{2}},1-\alpha}\bigg{\}}.$		(22)

Observe that $\mathbb{P}(\mathcal{A}_{V})\leq\beta$ implies $\mathbb{P}_{X\times Y\times\omega}\big{(}\Delta_{{n_{1}},{n_{2}},R}^{\alpha}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}})=0\big{)}\leq\beta,$ and similarly $\mathbb{P}(\mathcal{A}_{U})\leq\beta$ implies $\mathbb{P}_{X\times Y\times\omega}\big{(}\Delta_{{n_{1}},{n_{2}},R}^{\alpha,u}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}})=0\big{)}\leq\beta.$ Then, for an event $\mathcal{B}_{\beta}\subseteq\mathcal{B}_{V,\beta}\cap\mathcal{B}_{U,\beta},$ we claim that $\mathbb{P}(\mathcal{B}_{\beta})=1$ implies $\mathbb{P}(\mathcal{A}_{V})\leq\beta$ and $\mathbb{P}(\mathcal{A}_{U})\leq\beta.$ To see this, observe that Chebyshev’s inequality yields

	$\displaystyle\mathbb{P}\left(\mathcal{A}_{V},\mathcal{B}_{\beta}\right)$	$\displaystyle\leq\mathbb{P}\left(\mathcal{A}_{V},\mathcal{B}_{V,\beta}\right)$
		$\displaystyle\leq\mathbb{P}\bigg{(}V\leq\mathbb{E}\left[V\right]-\sqrt{\frac{1}{\beta}\operatorname{Var}\left[V\right]}\bigg{)}$
		$\displaystyle=\mathbb{P}\bigg{(}\sqrt{\frac{1}{\beta}\operatorname{Var}\left[V\right]}\leq\mathbb{E}\left[V\right]-V\bigg{)}$
		$\displaystyle\leq\mathbb{P}\bigg{(}\big{\|}V-\mathbb{E}[V]\big{\|}\geq\sqrt{\frac{1}{\beta}\operatorname{Var}[V]}\bigg{)}$
		$\displaystyle\leq\beta.$

Then we have

	$\displaystyle\mathbb{P}\left(\mathcal{A}_{V}\right)$	$\displaystyle=\mathbb{P}\left(\mathcal{A}_{V},\mathcal{B}_{\beta}\right)+\mathbb{P}\left(\mathcal{A}_{V},\mathcal{B}_{\beta}^{c}\right)$
		$\displaystyle\leq\beta+\mathbb{P}\left(\mathcal{A}_{V}\,\|\,\mathcal{B}_{\beta}^{c}\right)\mathbb{P}\left(\mathcal{B}_{\beta}^{c}\right)$
		$\displaystyle=\beta+0=\beta,$

and we can get a similar result with $\mathcal{A}_{U}.$ Therefore, our focus is on carefully identifying an event $\mathcal{B}_{\beta}$ and demonstrating that $\mathbb{P}(\mathcal{B}_{\beta})=1$ for sufficiently large ${n_{1}},{n_{2}}$ and $R$ . To obtain such $\mathcal{B}_{\beta}$ , we take a lower bound on the left-hand side and upper bound on the right-hand side in Equation (21) and Equation (22). Note that, as shown in Equation (18), the test statistic $V$ can be decomposed as

\displaystyle V=U+W,

(23)

where

W=\left(\frac{\kappa(0)}{n_{1}-1}-\frac{1}{n_{1}-1}\bigg{\|}\frac{1}{n_{1}}\sum^{n_{1}}_{i=1}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X_{i})\bigg{\|}_{\mathbb{R}^{2R}}^{2}+\frac{\kappa(0)}{n_{2}-1}-\frac{1}{n_{2}-1}\bigg{\|}\frac{1}{n_{2}}\sum^{n_{2}}_{i=1}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y_{i})\bigg{\|}_{\mathbb{R}^{2R}}^{2}\right),

and for all $x,y\in\mathbb{R}^{d},$

\displaystyle|\langle\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(x),\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(y)\rangle|=\frac{1}{R}\sum_{r=1}^{R}\kappa(0)|\cos({\omega_{r}^{\top}(x-y)})|\leq\kappa(0).

(24)

Therefore, $\big{\|}\frac{1}{n_{1}}\sum^{n_{1}}_{i=1}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X_{i})\big{\|}_{\mathbb{R}^{2R}}^{2}$ and $\big{\|}\frac{1}{n_{2}}\sum^{n_{2}}_{i=1}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y_{i})\big{\|}_{\mathbb{R}^{2R}}^{2}$ are less than or equal to $\kappa(0)$ , and this implies that $W$ satisfies $0\leq W\leq\kappa(0)\big{(}({n_{1}}-1)^{-1}+({n_{2}}-1)^{-1}\big{)}.$ Now, as a lower bound for $\mathbb{E}\left[V\right]$ and $\mathbb{E}\left[U\right]$ , we observe that

	$\displaystyle\mathbb{E}\left[V\right]$	$\displaystyle=\mathbb{E}\left[U+W\right]$
		$\displaystyle=\mathbb{E}\left[U\right]+\mathbb{E}\left[W\right]$
		$\displaystyle\geq\mathbb{E}\left[U\right].$

On the other hand, we note that $\operatorname{Var}\left[V\right]$ and $\operatorname{Var}\left[U\right]$ are both upper bounded by

	$\displaystyle\sqrt{\frac{1}{\beta}\operatorname{Var}\left[V\right]}$	$\displaystyle\leq\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]+\frac{2}{\beta}\operatorname{Var}\left[W\right]}$
		$\displaystyle\leq\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}+\sqrt{\frac{2}{\beta}\operatorname{Var}\left[W\right]}$
		$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}+\sqrt{\frac{1}{2\beta}\kappa(0)^{2}\bigg{(}\frac{1}{{n_{1}}-1}+\frac{1}{{n_{2}}-1}\bigg{)}^{2}}$
		$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}+\sqrt{\frac{2}{\beta}}\kappa(0)\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)},$

where the inequality (a) follows the fact that the variance of bounded variable is also bounded, (b) follows from the fact that $(x-1)^{-1}\leq 2x^{-1}$ for all $x\geq 2$ . For the critical value term, recall that the critical value is $q_{{n_{1}},{n_{2}},1-\alpha}=\inf\{t:F_{V}^{\pi}(t)\geq 1-\alpha\},$ and if we substitute the test statistic $V$ with $U,$ then the critical value is $q_{{n_{1}},{n_{2}},1-\alpha}^{u}=\inf\{t:F_{U}^{\pi}(t)\geq 1-\alpha\}.$ We claim that $q_{{n_{1}},{n_{2}},1-\alpha}$ and $q_{{n_{1}},{n_{2}},1-\alpha}^{u}$ are both upper bounded by

\displaystyle q_{{n_{1}},{n_{2}},1-\alpha}\leq q_{{n_{1}},{n_{2}},1-\alpha}^{u}+\kappa(0)\bigg{(}\frac{1}{{n_{1}}-1}+\frac{1}{{n_{2}}-1}\bigg{)}.

To see this, note that we have $|V-U|\leq\kappa(0)\big{(}({n_{1}}-1)^{-1}+({n_{2}}-1)^{-1}\big{)}$ from Equation (23). Based on this fact, for given $(Z_{1},\ldots,Z_{N})=(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}})$ and $\boldsymbol{\omega}_{R}$ , if a permutation $\pi\in\Pi_{N}$ satisfies $U(Z_{\pi(1)},\dots,Z_{\pi(N)};\boldsymbol{\omega}_{R})\leq q_{{n_{1}},{n_{2}},1-\alpha},$ then we also have $V(Z_{\pi(1)},\dots,Z_{\pi(N)};\boldsymbol{\omega}_{R})\leq q_{{n_{1}},{n_{2}},1-\alpha}+\kappa(0)\big{(}({n_{1}}-1)^{-1}+({n_{2}}-1)^{-1}\big{)}.$ This yields the desired result $q_{{n_{1}},{n_{2}},1-\alpha}\leq q_{{n_{1}},{n_{2}},1-\alpha}^{u}+\kappa(0)\big{(}({n_{1}}-1)^{-1}+({n_{2}}-1)^{-1}\big{)},$ and further, we also get $q_{{n_{1}},{n_{2}},1-\alpha}\leq q_{{n_{1}},{n_{2}},1-\alpha}^{u}+2\kappa(0)\big{(}{n_{1}}^{-1}+{n_{2}}^{-1}\big{)},$ using $(x-1)^{-1}\leq 2x^{-1}$ for all $x\geq 2.$

Combining the above results, we define an event

\displaystyle\mathcal{B}_{\beta}:=\bigg{\{}\mathbb{E}\left[U\right]\geq\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}+q_{{n_{1}},{n_{2}},1-\alpha}^{u}+\kappa(0)\bigg{(}\sqrt{\frac{2}{\beta}}+2\bigg{)}\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}\bigg{\}},

(25)

then it is straightforward to see that $\mathcal{B}_{\beta}\subseteq\mathcal{B}_{V,\beta}\cap\mathcal{B}_{U,\beta}.$ Now, our strategy is to show that the probability $\mathbb{P}(\mathcal{B}_{\beta})$ becomes 1 with sufficiently large ${n_{1}},{n_{2}}$ , and $R$ . To start with, similar to Corollary 4, we can assume $\kappa(0)=1$ throughout the proof. Then the event $\mathcal{B}_{\beta}$ becomes

\displaystyle\mathcal{B}_{\beta}=\bigg{\{}\mathbb{E}\left[U\right]\geq\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}+q_{{n_{1}},{n_{2}},1-\alpha}^{u}+\bigg{(}\sqrt{\frac{2}{\beta}}+2\bigg{)}\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}\bigg{\}},

and now we examine the three terms in the above event.

Expectation of $U$

For $\mathbb{E}\left[U\right],$ we note that the test statistic $U$ is an unbiased estimator, and hence

$\displaystyle\mathbb{E}\left[U\right]$	$\displaystyle=\mathbb{E}_{X\times Y}\big{[}\mathbb{E}_{\omega}\left[U\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}}\right]\big{]}$	(26)
	$\displaystyle=\mathbb{E}_{X\times Y}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\big{]}$
	$\displaystyle=\mathrm{MMD}^{2}\left(P_{X},P_{Y};\mathcal{H}_{k}\right).$

Upper bound for the variance of $U$

For $\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]},$ consider the decomposition of $\operatorname{Var}\left[U\right]$ as follows:

	$\displaystyle\operatorname{Var}\left[U\right]$	$\displaystyle=\mathbb{E}_{X\times Y}\big{[}\operatorname{Var}_{\omega}[U\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}}]\big{]}+\operatorname{Var}_{X\times Y}\big{[}\mathbb{E}_{\omega}[U\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}}]\big{]}$		(27)
		$\displaystyle=\mathbb{E}_{X\times Y}\big{[}\operatorname{Var}_{\omega}[U\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}}]\big{]}+\operatorname{Var}_{X\times Y}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\big{]}.$		(27)

For the first term in the last equation, recall that the statistic $V$ is

	$\displaystyle U(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\boldsymbol{\omega}_{R})$	$\displaystyle=\frac{1}{{n_{1}}({n_{1}}-1)}\sum_{1\leq i\neq j\leq{n_{1}}}\langle\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X_{i}),\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X_{j})\rangle-\frac{2}{{n_{1}}{n_{2}}}\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{2}}\langle\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X_{i}),\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y_{j})\rangle$
		$\displaystyle\quad+\frac{1}{{n_{2}}({n_{2}}-1)}\sum_{1\leq i\neq j\leq{n_{2}}}\langle\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y_{i}),\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y_{j})\rangle.$

Here, we emphasize that the inner product $\langle\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X_{i}),\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X_{j})\rangle$ and similar terms are actually the sample means. To be specific, observe that

\langle\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X_{i}),\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X_{j})\rangle=\frac{1}{R}\sum^{R}_{r=1}\langle{\psi_{\omega}}_{r}(X_{i}),{\psi_{\omega}}_{r}(X_{j})\rangle.

We note that as shown in Equation (29) when the samples $\mathcal{X}_{n_{1}}$ and $\mathcal{Y}_{n_{2}}$ are given, then the statistic $U$ can be seen as a mean of $R$ observations of $U_{1}(\omega_{r})$ , functions of i.i.d. random variables $\omega_{1},\ldots,\omega_{R},$ defined as

$\displaystyle U_{1}:=$	$\displaystyle\frac{1}{{n_{1}}({n_{1}}-1)}\sum_{1\leq i\neq j\leq{n_{1}}}\langle{\psi_{\omega}}_{1}(X_{i}),{\psi_{\omega}}_{1}(X_{j})\rangle-\frac{2}{{n_{1}}{n_{2}}}\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{2}}\langle{\psi_{\omega}}_{1}(X_{i}),{\psi_{\omega}}_{1}(Y_{j})\rangle$	(28)
	$\displaystyle\quad+\frac{1}{{n_{2}}({n_{2}}-1)}\sum_{1\leq i\neq j\leq{n_{2}}}\langle{\psi_{\omega}}_{1}(Y_{i}),{\psi_{\omega}}_{1}(Y_{j})\rangle$
$\displaystyle=$	$\displaystyle\frac{1}{{n_{1}}({n_{1}}-1)}\sum_{1\leq i\neq j\leq{n_{1}}}\kappa(0)\cos\big{(}{\omega_{1}^{\top}\left(X_{i}-X_{j}\right)}\big{)}-\frac{2}{{n_{1}}{n_{2}}}\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{2}}\kappa(0)\cos\big{(}{\omega_{1}^{\top}\left(X_{i}-Y_{j}\right)}\big{)}$
	$\displaystyle\quad+\frac{1}{{n_{2}}({n_{2}}-1)}\sum_{1\leq i\neq j\leq{n_{2}}}\kappa(0)\cos\big{(}{\omega_{1}^{\top}\left(Y_{i}-Y_{j}\right)}\big{)}.$

Hence, the conditional variance of $U$ in Equation (27) can be written as

\displaystyle\operatorname{Var}[U\,|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}}]=\frac{1}{R}\operatorname{Var}[U_{1}\,|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}}].

(29)

Also, since $|\!\cos(x)|\leq 1$ for all $x\in\mathbb{R}$ , we note that $|U_{1}|\leq 4\kappa(0).$ Therefore, since its variance is also bounded, we conclude that the first term in Equation (27) is bounded by

	$\displaystyle\mathbb{E}_{X\times Y}\big{[}\operatorname{Var}_{\omega}[U\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}}]\big{]}$	$\displaystyle=\frac{1}{R}\mathbb{E}_{X\times Y}\big{[}\operatorname{Var}_{\omega}[U_{1}\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}}]\big{]}$		(30)
		$\displaystyle\leq\frac{16\kappa(0)^{2}}{R}.$		(30)

For the second term in Equation (27), we leverage the result of Kim et al., (2022, Appendix F). Let $h(x_{1},x_{2},y_{1},y_{2}):=k(x_{1},x_{2})+k(y_{1},y_{2})-k(x_{1},y_{2})-k(x_{2},y_{1}).$ Then, there exists some positive constant $C_{1}$ such that the variance of the unbiased estimator of MMD can be bounded as

\operatorname{Var}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\big{]}\leq C_{1}\bigg{(}\frac{\sigma^{2}_{1,0}}{{n_{1}}}+\frac{\sigma^{2}_{0,1}}{{n_{2}}}+\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}^{2}\sigma^{2}_{2,2}\bigg{)}

for

	$\displaystyle\sigma^{2}_{1,0}$	$\displaystyle:=\operatorname{Var}\Big{[}\mathbb{E}_{X^{\prime},Y,Y^{\prime}}\big{[}h(X,X^{\prime},Y,Y^{\prime})\big{]}\Big{]},$
	$\displaystyle\sigma^{2}_{0,1}$	$\displaystyle:=\operatorname{Var}\Big{[}\mathbb{E}_{X,X^{\prime},Y^{\prime}}\big{[}h(X,X^{\prime},Y,Y^{\prime})\big{]}\Big{]},$
	$\displaystyle\sigma^{2}_{2,2}$	$\displaystyle:=\operatorname{Var}\Big{[}h(X,X^{\prime},Y,Y^{\prime})\Big{]},$

where $X^{\prime}$ is an independent copy of $X$ , and $Y^{\prime}$ is an independent copy of $Y.$ We note that the kernel $k$ is bounded and Bochner’s theorem (Lemma 9) guarantees the existence of the nonnegative Borel measure $\Lambda$ that satisfies

k(x,y)=\kappa(x-y)=\int_{\mathbb{R}^{d}}\cos\bigl{(}{\omega^{\top}(x-y)}\bigr{)}d\Lambda(\omega).

Since $|\!\cos(x)|\leq 1$ for all $x\in\mathbb{R}$ and the measure $\Lambda$ is nonnegative, we have

\displaystyle k(x,y)=\int_{\mathbb{R}^{d}}\cos\bigl{(}{\omega^{\top}(x-y)}\bigr{)}d\Lambda(\omega)\leq\int_{\mathbb{R}^{d}}|\cos\bigl{(}{\omega^{\top}(x-y)}\bigr{)}|d\Lambda(\omega)\leq\int_{\mathbb{R}^{d}}1d\Lambda(\omega)=\kappa(0)

(31)

for all $x,y\in\mathbb{R}^{d}.$ Therefore, the kernel $k$ is bounded by $\kappa(0),$ and the term $|h(X,X^{\prime},Y,Y^{\prime})|$ is bounded by $4\kappa(0)$ . This yields

\max\left(\sigma^{2}_{1,0},\sigma^{2}_{0,1},\sigma^{2}_{2,2}\right)\leq 16\kappa(0)^{2}.

Now, we conclude that the second term in Equation (27) is bounded by

\displaystyle\operatorname{Var}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\big{]}

\displaystyle\leq 16C_{1}\kappa(0)^{2}\left(\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}+\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}^{2}\right).

(32)

To sum up, combining results in Equations (30) and (32), we have

	$\displaystyle\operatorname{Var}\left[U\right]$	$\displaystyle\leq\mathbb{E}_{X\times Y}\big{[}\operatorname{Var}[V\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}}]\big{]}+2\operatorname{Var}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\big{]}$
		$\displaystyle\leq\frac{16\kappa(0)^{2}}{R}+32C_{1}\kappa(0)^{2}\Bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}+\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}^{2}\Bigg{)}$
		$\displaystyle\leq\kappa(0)^{2}\Bigg{(}\frac{16}{R}+C_{2}\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}\Bigg{)},$

for $C_{2}:=64C_{1}$ , as $\left({n_{1}}^{-1}+{n_{2}}^{-1}\right)^{2}\leq{n_{1}}^{-1}+{n_{2}}^{-1}$ for ${n_{1}},{n_{2}}\geq 2$ . Therefore, we have

	$\displaystyle\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}$	$\displaystyle\leq\frac{\kappa(0)}{\sqrt{\beta}}\sqrt{\frac{32}{R}+2C_{2}\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}}$		(33)
		$\displaystyle\leq\frac{1}{\sqrt{\beta}}\Bigg{(}\frac{6}{\sqrt{R}}+C_{3}\bigg{(}\frac{1}{\sqrt{n_{1}}}+\frac{1}{\sqrt{n_{2}}}\bigg{)}\Bigg{)},$		(33)

for $C_{3}=\sqrt{2C_{2}},$ where we use the fact that $\sqrt{x+y}\leq\sqrt{x}+\sqrt{y}$ for $x,y\geq 0,$ and the assumption $\kappa(0)=1.$

Upper bound for the critical value $q_{{n_{1}},{n_{2}},1-\alpha}^{u}$

In order to derive an upper bound for $q_{{n_{1}},{n_{2}},1-\alpha}^{u},$ we use the property of U-statistics as done by Kim et al., (2022, Appendix E, F). First, observe that Chebyshev’s inequality yields

\mathbb{P}_{\pi}\bigg{(}\big{|}U_{\pi}-\mathbb{E}_{\pi}[U_{\pi}\,|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}]\big{|}\geq\sqrt{\frac{1}{\alpha}\operatorname{Var}_{\pi}[U_{\pi}\,|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}]}\,\bigg{|}\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}\bigg{)}\leq\alpha,

and by the definition of quantile, we have an upper bound of $q_{{n_{1}},{n_{2}},1-\alpha}^{u}$ :

\displaystyle q_{{n_{1}},{n_{2}},1-\alpha}^{u}\leq\mathbb{E}_{\pi}[U_{\pi}\,|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}]+\sqrt{\frac{1}{\alpha}\operatorname{Var}_{\pi}[U_{\pi}\,|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}]}.

For the first term of the right-hand side, since the U-statistic is centered at zero under the permutation law (see e.g., Kim et al.,, 2022, Appendix F), we can deduce that $\mathbb{E}_{\pi}[U_{\pi}\,|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}]=0.$ Similarly, for the second term, observe that

	$\displaystyle\operatorname{Var}_{\pi}[U_{\pi}\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}]$	$\displaystyle=\mathbb{E}_{\pi}\big{[}(U_{\pi})^{2}\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}\big{]}-\big{(}\mathbb{E}_{\pi}[U_{\pi}\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}]\big{)}^{2}$
		$\displaystyle=\mathbb{E}_{\pi}\big{[}(U_{\pi})^{2}\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}\big{]}.$

Here, we note that this statistic has been carefully studied in Kim et al., (2022, Appendix F), and the following result holds true:

	$\displaystyle\mathbb{E}_{\pi}\big{[}(U_{\pi})^{2}\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}\big{]}$	(34)
$\displaystyle=$	$\displaystyle\frac{1}{{n_{1}}^{2}({n_{1}}-1)^{2}{n_{2}}^{2}({n_{2}}-1)^{2}}\sum_{(i_{1},\dots,j^{\prime}_{2})\in\mathbf{I}}\mathbb{E}_{\pi}\Big{[}\hat{h}\big{(}Z_{\pi(i_{1})},Z_{\pi(i_{2})};Z_{\pi(n_{1}+j_{1})},Z_{\pi(n_{1}+j_{2})}\big{)}$
	$\displaystyle\,\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\times\hat{h}\big{(}Z_{\pi(i^{\prime}_{1})},Z_{\pi(i^{\prime}_{2})};Z_{\pi(n_{1}+j^{\prime}_{1})},Z_{\pi(n_{1}+j^{\prime}_{2})}\big{)}\,\Big{\|}\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}\Big{]},$

where $\hat{h}(x_{1},x_{2};y_{1},y_{2})$ is a kernel defined as $\hat{h}(x_{1},x_{2};y_{1},y_{2}):=\hat{k}(x_{1},x_{2})+\hat{k}(y_{1},y_{2})-\hat{k}(x_{1},y_{2})-\hat{k}(x_{2},y_{1})$ , and $\mathbf{I}$ is a set of index defined as $\mathbf{I}:=\{(i_{1},i_{2},i^{\prime}_{1},i^{\prime}_{2},j_{1},j_{2},j^{\prime}_{1},j^{\prime}_{2})\in\mathbb{N}^{8}_{+}:(i_{1},i_{2}),(i^{\prime}_{1},i^{\prime}_{2})\in\mathbf{i}^{m}_{2},(j_{1},j_{2}),(j^{\prime}_{1},j^{\prime}_{2})\in\mathbf{i}^{n}_{2},\#|\{i_{1},i_{2}\}\cap\{i^{\prime}_{1},i^{\prime}_{2}\}|+\#|\{j_{1},j_{2}\}\cap\{j^{\prime}_{1},j^{\prime}_{2}\}|>1\}.$ Here $\#|A|$ denotes the cardinality of a set $A$ , and $(l_{1},l_{2})\in\mathbf{i}^{k}_{2}$ implies $1\leq l_{1}\neq l_{2}\leq k.$ Recall that $\hat{k}(x,y)=\langle\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(x),\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(y)\rangle$ is the approximated kernel defined in Equation (3), and we have a bound $|\hat{k}(x,y)|\leq\kappa(0)$ for all $x,y\in\mathbb{R}^{d}$ , in Equation (24). This implies $|\hat{h}(\cdot)|\leq 4\kappa(0),$ and thus $|\hat{h}(\cdot)\times\hat{h}(\cdot)|\leq 16\kappa(0)^{2}.$ Using this observation and counting the number of $\mathbf{I}$ (Kim et al.,, 2022, Appendix F) yields

	$\displaystyle\mathbb{E}_{\pi}\big{[}(U_{\pi})^{2}\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}\big{]}$	$\displaystyle\leq\frac{16}{{n_{1}^{2}}({n_{1}}-1)^{2}{n_{2}^{2}}({n_{2}}-1)^{2}}\sum_{(i_{1},\dots,j^{\prime}_{2})\in\mathbf{I}}1$
		$\displaystyle\leq C_{4}\kappa(0)^{2}\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}^{2}$

for some positive constant $C_{4},$ regardless of the realized values of $\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}}$ and $\boldsymbol{\omega}_{R}.$ Therefore, we get

\displaystyle\sqrt{\frac{1}{\alpha}\operatorname{Var}_{\pi}[U_{\pi}\,|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}]}

\displaystyle\leq\sqrt{\frac{C_{4}}{\alpha}}\kappa(0)\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}

and we conclude that

$\displaystyle q_{{n_{1}},{n_{2}},1-\alpha}^{u}$	$\displaystyle\leq\sqrt{\frac{1}{\alpha}\operatorname{Var}_{\pi}[U_{\pi}\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}]}$	(35)
	$\displaystyle\leq\sqrt{\frac{C_{4}}{\alpha}}\kappa(0)\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}$
	$\displaystyle\leq C_{5}(\alpha)\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}$

for $C_{5}(\alpha):=\sqrt{\alpha^{-1}C_{4}}.$

Finding $\beta_{{n_{1}},{n_{2}},R}$

Based on Equations (33) and (35) that we obtained so far, we can derive the result as follows:

	$\displaystyle\quad\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}+q_{{n_{1}},{n_{2}},1-\alpha}^{u}+\bigg{(}\sqrt{\frac{2}{\beta}}+2\bigg{)}\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}$
	$\displaystyle\leq\frac{1}{\sqrt{\beta}}\Bigg{(}\frac{6}{\sqrt{R}}+C_{3}\bigg{(}\frac{1}{\sqrt{n_{1}}}+\frac{1}{\sqrt{n_{2}}}\bigg{)}+\sqrt{2}\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}\Bigg{)}+\left(C_{5}(\alpha)+2\right)\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}$
	$\displaystyle\leq C(\alpha)\bigg{(}\frac{1}{\sqrt{\beta R}}+\frac{1}{\sqrt{\beta{n_{1}}}}+\frac{1}{\sqrt{\beta{n_{2}}}}+\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}$

for a constant $C(\alpha):=\max\left\{6,C_{3}+\sqrt{2},C_{5}(\alpha)+2\right\}.$ Here, we consider a sequence that converges to zero, $\beta_{{n_{1}},{n_{2}},R}=\max\big{\{}\frac{1}{\log({n_{1}})},\frac{1}{\log({n_{2}})},\frac{1}{\log(R)}\big{\}}.$ Then we get

	$\displaystyle\quad\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}+q_{{n_{1}},{n_{2}},1-\alpha}^{u}+\bigg{(}\sqrt{\frac{2}{\beta}}+2\bigg{)}\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}$
	$\displaystyle\leq C(\alpha)\Bigg{(}\bigg{(}\frac{\log R}{R}\bigg{)}^{1/2}+\bigg{(}\frac{\log{n_{1}}}{{n_{1}}}\bigg{)}^{1/2}+\bigg{(}\frac{\log{n_{2}}}{{n_{2}}}\bigg{)}^{1/2}+\frac{1}{n_{1}}+\frac{1}{{n_{2}}}\Bigg{)}.$

Since $\lim_{x\rightarrow\infty}(\log x/x)^{1/2}=0$ and $\lim_{x\rightarrow\infty}(1/x)=0$ , there exist $N_{(P_{X},P_{Y})}$ and $R_{(P_{X},P_{Y})}$ such that ${n_{1}},{n_{2}}\geq N_{(P_{X},P_{Y})}$ and $R\geq R_{(P_{X},P_{Y})}$ implies

\displaystyle\mathrm{MMD}^{2}\left(P_{X},P_{Y};\mathcal{H}_{k}\right)\geq C(\alpha)\left(\bigg{(}\frac{\log R}{R}\bigg{)}^{1/2}+\bigg{(}\frac{\log{n_{1}}}{{n_{1}}}\bigg{)}^{1/2}+\bigg{(}\frac{\log{n_{2}}}{{n_{2}}}\bigg{)}^{1/2}+\frac{1}{n_{1}}+\frac{1}{{n_{2}}}\right).

Then we can deduce that

	$\displaystyle\mathbb{P}(\mathcal{B}_{\beta_{{n_{1}},{n_{2}},R}})$	$\displaystyle=\mathbb{P}\left\{\mathbb{E}\left[U\right]\geq\sqrt{\frac{2}{\beta_{{n_{1}},{n_{2}},R}}\operatorname{Var}\left[U\right]}+q_{{n_{1}},{n_{2}},1-\alpha}^{u}+\bigg{(}\sqrt{\frac{2}{\beta}}+2\bigg{)}\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}\right\}$
		$\displaystyle\geq\mathbb{P}\left\{\mathrm{MMD}^{2}\left(P_{X},P_{Y};\mathcal{H}_{k}\right)\geq\sqrt{\frac{2}{\beta_{{n_{1}},{n_{2}},R}}\operatorname{Var}\left[U\right]}+q_{{n_{1}},{n_{2}},1-\alpha}^{u}+\bigg{(}\sqrt{\frac{2}{\beta}}+2\bigg{)}\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}\right\}$
		$\displaystyle\geq\mathbb{P}\left\{\mathrm{MMD}^{2}\left(P_{X},P_{Y};\mathcal{H}_{k}\right)\geq C(\alpha)\left(\bigg{(}\frac{\log R}{R}\bigg{)}^{1/2}+\bigg{(}\frac{\log{n_{1}}}{{n_{1}}}\bigg{)}^{1/2}+\bigg{(}\frac{\log{n_{2}}}{{n_{2}}}\bigg{)}^{1/2}+\frac{1}{n_{1}}+\frac{1}{{n_{2}}}\right)\right\}$
		$\displaystyle=1$

for ${n_{1}},{n_{2}}\geq N_{(P_{X},P_{Y})}$ and $R\geq R_{(P_{X},P_{Y})}$ . Note that the sequence $\beta_{{n_{1}},{n_{2}},R}$ converges to 0 and this completes the proof.

B.5 Proof of Theorem 6

To begin with, we introduce some assumptions and useful facts for ease of analysis. Note that we use a translation invariant kernel which can be decomposed as

k_{\lambda}(x,y)=\kappa_{\lambda}(x-y):=\prod^{d}_{i=1}\frac{1}{\lambda_{i}}\kappa_{i}\left(\frac{x_{i}-y_{i}}{\lambda_{i}}\right)

for $\lambda=(\lambda_{1},\ldots,\lambda_{d})\in(0,\infty)^{d}.$ Here, without loss of generality, we assume that $\prod^{d}_{i=1}\kappa_{i}(0)=1.$ If not, this can be done by scaling the bandwidth and $\kappa$ with a constant while the kernel $k$ remains unchanged. To be specific,

k(x,y)=\prod^{d}_{i=1}\frac{1}{\lambda_{i}}\kappa_{i}\left(\frac{x_{i}-y_{i}}{\lambda_{i}}\right)=\prod^{d}_{i=1}\frac{1}{\lambda^{*}_{i}}\kappa^{*}_{i}\left(\frac{x_{i}-y_{i}}{\lambda^{*}_{i}}\right)

holds where $\kappa^{*}_{i}(x):=\kappa_{i}(x/\kappa_{i}(0))/\kappa_{i}(0),~{}\lambda^{*}_{i}=\lambda_{i}/\kappa_{i}(0),$ and then $\prod^{d}_{i=1}\kappa^{*}_{i}(0)=1.$ Now, note that our assumption yields $\kappa_{\lambda}(0)=(\lambda_{1}\cdots\lambda_{d})^{-1}.$ Also, let us denote $C_{0},C^{\prime}_{0}>0$ be a constant that satisfies

\displaystyle\frac{1}{{n_{1}}-1}+\frac{1}{{n_{2}}-1}\leq\frac{C_{0}}{n},

(36)

and

\displaystyle\frac{1}{({n_{1}}-1)^{2}}+\frac{1}{({n_{2}}-1)^{2}}\leq\frac{C^{\prime}_{0}}{{n}^{2}},

(37)

respectively.

For the proof of Theorem 6, we follow a similar approach taken in Schrab et al., (2023) to derive an upper bound for the uniform separation rate. First, as in the proof of Theorem 5, we define an event that can be utilized concurrently for analyzing both tests $\Delta_{{n_{1}},{n_{2}},R}^{\alpha}$ and $\Delta_{{n_{1}},{n_{2}},R}^{\alpha,u}$ . Consider the following two events

	$\displaystyle\mathcal{B}_{V,\beta/2}$	$\displaystyle:=\bigg{\{}\mathbb{E}\left[V\right]\geq\sqrt{\frac{2}{\beta}\operatorname{Var}\left[V\right]}+q_{{n_{1}},{n_{2}},1-\alpha}\bigg{\}}\quad\text{and}$
	$\displaystyle\mathcal{B}_{U,\beta/2}$	$\displaystyle:=\bigg{\{}\mathbb{E}\left[U\right]\geq\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}+q^{u}_{{n_{1}},{n_{2}},1-\alpha}\bigg{\}},$

and suppose that there exists an event $\mathcal{B}_{\beta/2}\subseteq\mathcal{B}_{V,\beta/2}\cap\mathcal{B}_{U,\beta/2}$ and $\mathbb{P}(\mathcal{B}_{\beta/2})\geq 1-\beta/2$ . Also, for the following two events,

	$\displaystyle\mathcal{A}_{V}$	$\displaystyle:=\{V\leq q_{{n_{1}},{n_{2}},1-\alpha}\}\quad\text{and}$
	$\displaystyle\mathcal{A}_{U}$	$\displaystyle:=\{U\leq q^{u}_{{n_{1}},{n_{2}},1-\alpha}\},$

recall that $\mathbb{P}(\mathcal{A}_{V})\leq\beta$ implies $\mathbb{P}_{X\times Y}(\Delta_{{n_{1}},{n_{2}},R}^{\alpha}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}})=0)\leq\beta,$ and similarly $\mathbb{P}(\mathcal{A}_{U})\leq\beta$ implies $\mathbb{P}_{X\times Y}(\Delta_{{n_{1}},{n_{2}},R}^{\alpha,u}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}})=0)\leq\beta.$ Then Chebyshev’s inequality yields the desired result as

	$\displaystyle\mathbb{P}(\mathcal{A}_{V})$	$\displaystyle=\mathbb{P}(\mathcal{A}_{V},\mathcal{B}_{\beta/2})+\mathbb{P}(\mathcal{A}_{V}\,\|\,\mathcal{B}_{\beta/2}^{c})\mathbb{P}(\mathcal{B}_{\beta/2}^{c})$
		$\displaystyle\leq\mathbb{P}(\mathcal{A}_{V},\mathcal{B}_{V,\beta/2})+\frac{\beta}{2}$
		$\displaystyle\leq\mathbb{P}\bigg{(}V\leq\mathbb{E}[V]-\sqrt{\frac{2}{\beta}\operatorname{Var}[V]}\bigg{)}+\frac{\beta}{2}$
		$\displaystyle=\mathbb{P}\bigg{(}\sqrt{\frac{2}{\beta}\operatorname{Var}\left[V\right]}\leq\mathbb{E}\left[V\right]-V\bigg{)}+\frac{\beta}{2}$
		$\displaystyle\leq\mathbb{P}\bigg{(}\|V-\mathbb{E}[V]\|\geq\sqrt{\frac{2}{\beta}\operatorname{Var}\left[V\right]}\bigg{)}+\frac{\beta}{2}$
		$\displaystyle\leq\frac{\beta}{2}+\frac{\beta}{2}=\beta,$

and similarly we can get $\mathbb{P}\left(\mathcal{A}_{U}\right)\leq\beta.$

Therefore, our strategy is to identify such event $\mathcal{B}_{\beta/2}.$ To begin with, we carefully analyze the difference between the statistics $V$ and $U$ , following the logic similar to Kim and Schrab, (2023, Appendix E.11). From Equation (23), the statistic $V$ can be decomposed as

	$\displaystyle V$	$\displaystyle=U+\left(\frac{\kappa_{\lambda}(0)}{{n_{1}}-1}-\frac{1}{{n_{1}}-1}\bigg{\\|}\frac{1}{n_{1}}\sum^{n_{1}}_{i=1}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(X_{i})\bigg{\\|}_{\mathbb{R}^{2R}}^{2}+\frac{\kappa_{\lambda}(0)}{{n_{2}}-1}-\frac{1}{{n_{2}}-1}\bigg{\\|}\frac{1}{{n_{2}}}\sum^{n_{2}}_{i=1}\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(Y_{i})\bigg{\\|}_{\mathbb{R}^{2R}}^{2}\right)$
		$\displaystyle=U+\frac{\kappa_{\lambda}(0)}{{n_{1}}-1}-\frac{\kappa_{\lambda}(0)}{{n_{1}}({n_{1}}-1)}+\frac{\kappa_{\lambda}(0)}{{n_{2}}-1}-\frac{\kappa_{\lambda}(0)}{{n_{2}}({n_{2}}-1)}$
		$\displaystyle\quad-\bigg{(}\underbrace{\frac{1}{{n_{1}^{2}}({n_{1}}-1)}\sum_{1\leq i\neq j\leq{n_{1}}}\hat{k}(X_{i},X_{j})+\frac{1}{{n_{2}^{2}}({n_{2}}-1)}\sum_{1\leq i\neq j\leq{n_{2}}}\hat{k}(Y_{i},Y_{j})}_{:=W^{\prime}}\bigg{)}$
		$\displaystyle=U-W^{\prime}+\kappa_{\lambda}(0)\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}.$

Here, we claim that the event

\mathcal{B}_{\beta/2}:=\Bigg{\{}\mathbb{E}\left[U\right]\geq\sqrt{\frac{4}{\beta}\operatorname{Var}\left[U\right]}+q_{{n_{1}},{n_{2}},1-\alpha}^{u}+\mathbb{E}\left[W^{\prime}\right]+\sqrt{\frac{4}{\beta}\operatorname{Var}\left[W^{\prime}\right]}+C^{\prime}_{0}\frac{\kappa_{\lambda}(0)}{{n}^{2}}\Bigg{\}}

with a positive constant $C^{\prime}_{0}>0$ defined in Equation 37 satisfies $\mathcal{B}_{\beta/2}\subseteq\mathcal{B}_{V,\beta/2}\cap\mathcal{B}_{U,\beta/2}.$ To see this, we would first show $\mathcal{B}_{\beta/2}\subseteq\mathcal{B}_{U,\beta/2}$ and then show $\mathcal{B}_{\beta/2}\subseteq\mathcal{B}_{V,\beta/2}.$ Now, note that the nonnegativity of the kernel $k$ guarantees that the inequality

	$\displaystyle\mathbb{E}\left[W^{\prime}\right]$	$\displaystyle=\mathbb{E}_{X\times Y}\big{[}\mathbb{E}_{\omega}[W^{\prime}\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}}]\big{]}$
		$\displaystyle=\mathbb{E}_{X}\bigg{[}\frac{1}{{n_{1}^{2}}({n_{1}}-1)}\sum_{1\leq i\neq j\leq{n_{1}}}\mathbb{E}_{\omega}\big{[}\hat{k}(X_{i},X_{j})\,\big{\|}\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}}\big{]}\bigg{]}$
		$\displaystyle\quad+\mathbb{E}_{Y}\bigg{[}\frac{1}{{n_{2}^{2}}({n_{2}}-1)}\sum_{1\leq i\neq j\leq{n_{2}}}\mathbb{E}_{\omega}\big{[}\hat{k}(Y_{i},Y_{j})\,\big{\|}\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}}\big{]}\bigg{]}$
		$\displaystyle=\mathbb{E}_{X}\bigg{[}\frac{1}{{n_{1}^{2}}({n_{1}}-1)}\sum_{1\leq i\neq j\leq{n_{1}}}k(X_{i},X_{j})\bigg{]}+\mathbb{E}_{Y}\bigg{[}\frac{1}{{n_{2}^{2}}({n_{2}}-1)}\sum_{1\leq i\neq j\leq{n_{2}}}k(Y_{i},Y_{j})\bigg{]}$
		$\displaystyle=\frac{1}{n_{1}}\mathbb{E}_{X_{1}\times X_{2}}\big{[}k(X_{1},X_{2})\big{]}+\frac{1}{{n_{2}}}\mathbb{E}_{Y_{1}\times Y_{2}}\big{[}k(Y_{1},Y_{2})\big{]}$
		$\displaystyle\geq 0.$

Based on this observation, it can be shown that the right-hand side in the event $\mathcal{B}_{\beta/2}$ is an upper bound for the right-hand side in the event $\mathcal{B}_{U,\beta/2}$ , i.e.,

\displaystyle\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}+q^{u}_{{n_{1}},{n_{2}},1-\alpha}\leq\sqrt{\frac{4}{\beta}\operatorname{Var}\left[U\right]}+q_{{n_{1}},{n_{2}},1-\alpha}^{u}+\mathbb{E}\left[W^{\prime}\right]+\sqrt{\frac{4}{\beta}\operatorname{Var}\left[W^{\prime}\right]}+C^{\prime}_{0}\frac{\kappa_{\lambda}(0)}{{n}^{2}},

and this implies $\mathcal{B}_{\beta/2}\subseteq\mathcal{B}_{U,\beta/2}.$

For the inequality $\mathcal{B}_{\beta/2}\subseteq\mathcal{B}_{V,\beta/2},$ observe that

\mathbb{E}[V]=\mathbb{E}[U]-\mathbb{E}[W^{\prime}]+\kappa_{\lambda}(0)\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}

and plugging this equality into the event $\mathcal{B}_{V,\beta/2}$ yields

	$\displaystyle\mathcal{B}_{V,\beta/2}$	$\displaystyle=\bigg{\{}\mathbb{E}\left[V\right]\geq\sqrt{\frac{2}{\beta}\operatorname{Var}\left[V\right]}+q_{{n_{1}},{n_{2}},1-\alpha}\bigg{\}}$
		$\displaystyle=\bigg{\{}\mathbb{E}\left[U\right]\geq\sqrt{\frac{2}{\beta}\operatorname{Var}\left[V\right]}+\mathbb{E}[W^{\prime}]+q_{{n_{1}},{n_{2}},1-\alpha}-\kappa_{\lambda}(0)\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}\bigg{\}}.$

Here, note that we have

	$\displaystyle\sqrt{\frac{2}{\beta}\operatorname{Var}[V]}$	$\displaystyle\leq\sqrt{\frac{2}{\beta}\operatorname{Var}[U-W^{\prime}]}$
		$\displaystyle\leq\sqrt{\frac{4}{\beta}\operatorname{Var}[U]+\frac{4}{\beta}\operatorname{Var}[W^{\prime}]}$
		$\displaystyle\leq\sqrt{\frac{4}{\beta}\operatorname{Var}[U]}+\sqrt{\frac{4}{\beta}\operatorname{Var}[W^{\prime}]}$

and

	$\displaystyle q_{{n_{1}},{n_{2}},1-\alpha}-\kappa_{\lambda}(0)\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}$	$\displaystyle\leq q_{{n_{1}},{n_{2}},1-\alpha}^{u}+\kappa_{\lambda}(0)\bigg{(}\frac{1}{{n_{1}}-1}+\frac{1}{{n_{2}}-1}\bigg{)}-\kappa_{\lambda}(0)\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}$
		$\displaystyle=q_{{n_{1}},{n_{2}},1-\alpha}^{u}+\kappa_{\lambda}(0)\bigg{(}\frac{1}{{n_{1}}({n_{1}}-1)}+\frac{1}{{n_{2}}({n_{2}}-1)}\bigg{)}$
		$\displaystyle\leq q_{{n_{1}},{n_{2}},1-\alpha}^{u}+C^{\prime}_{0}\frac{\kappa_{\lambda}(0)}{{n}^{2}}$

for the constant $C^{\prime}_{0}>0$ defined in Equation 37. These two facts guarantee that the right-hand side in the event $\mathcal{B}_{\beta/2}$ is an upper bound for the right-hand side in the event $\mathcal{B}_{V,\beta/2}$ , i.e.,

\displaystyle\sqrt{\frac{2}{\beta}\operatorname{Var}\left[V\right]}+q^{u}_{{n_{1}},{n_{2}},1-\alpha}\leq\sqrt{\frac{4}{\beta}\operatorname{Var}\left[U\right]}+q_{{n_{1}},{n_{2}},1-\alpha}^{u}+\mathbb{E}\left[W^{\prime}\right]+\sqrt{\frac{4}{\beta}\operatorname{Var}\left[W^{\prime}\right]}+C^{\prime}_{0}\frac{\kappa_{\lambda}(0)}{{n}^{2}}

and we conclude that $\mathcal{B}_{\beta/2}\subseteq\mathcal{B}_{V,\beta/2}.$

Now, we move our focus to find a sufficient condition for $\mathbb{P}(\mathcal{B}_{\beta/2})\geq 1-\beta/2.$ Observe that Chebyshev’s inequality yields

\mathbb{P}_{\pi}\Bigg{(}\big{|}U_{\pi}-\mathbb{E}_{\pi}[U_{\pi}\,|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}]\big{|}\geq\sqrt{\frac{1}{\alpha}\operatorname{Var}_{\pi}[U_{\pi}\,|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}]}\,\Bigg{|}\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}\Bigg{)}\leq\alpha,

and by the definition of a quantile, we have an upper bound for $q_{{n_{1}},{n_{2}},1-\alpha}^{u}$ :

\displaystyle q_{{n_{1}},{n_{2}},1-\alpha}^{u}\leq\mathbb{E}_{\pi}[U_{\pi}\,|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}]+\sqrt{\frac{1}{\alpha}\operatorname{Var}_{\pi}[U_{\pi}\,|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}]}.

\mathbb{P}\bigg{(}\sqrt{\frac{1}{\alpha}\operatorname{Var}_{\pi}[U_{\pi}\,|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}]}<\sqrt{\frac{2}{\alpha\beta}\mathbb{E}\big{[}\operatorname{Var}_{\pi}[U_{\pi}\,|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}]\big{]}}\bigg{)}\geq 1-\frac{\beta}{2},

we conclude that

\displaystyle\mathbb{E}\left[U\right]\geq\sqrt{\frac{4}{\beta}\operatorname{Var}\left[U\right]}+\sqrt{\frac{2}{\alpha\beta}\mathbb{E}\big{[}\operatorname{Var}_{\pi}[U_{\pi}\,|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}]\big{]}}+\mathbb{E}\left[W^{\prime}\right]+\sqrt{\frac{4}{\beta}\operatorname{Var}\left[W^{\prime}\right]}+C^{\prime}_{0}\frac{\kappa_{\lambda}(0)}{{n}^{2}}

(38)

is a sufficient condition for $\mathbb{P}(\mathcal{B}_{\beta/2})\geq 1-\beta/2.$ Therefore, our goal is to analyze the above equation and to find a proper rate of $R$ and the bandwidth $\lambda_{1},\dots,\lambda_{d}$ in terms of $R,{n_{1}}$ and ${n_{2}}$ to uniformly control both types of errors.

Lower bound for $\mathbb{E}\left[U\right]$

We first note that $\mathbb{E}\left[U\right]=\mathrm{MMD}^{2}\left(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}}\right)$ , and $\mathrm{MMD}^{2}\left(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}}\right)$ can be written in $L_{2}$ sense (Schrab et al.,, 2023, Appendix E.5):

	$\displaystyle\mathrm{MMD}^{2}\left(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}}\right)$	$\displaystyle=\langle\xi,\xi\ast\kappa_{\lambda}\rangle_{2}$
		$\displaystyle=\frac{1}{2}\left(\\|\xi\\|^{2}_{2}+\\|\xi\ast\kappa_{\lambda}\\|^{2}_{2}-\\|\xi-\xi\ast\kappa_{\lambda}\\|^{2}_{2}\right),$

where $\xi:=p-q,$ $\kappa_{\lambda}(u)=\prod^{d}_{i=1}(1/{\lambda_{i}})\kappa_{i}\left(u_{i}/{\lambda_{i}}\right)$ for $u\in\mathbb{R}^{d},$ $\ast$ denotes convoluton, and $\langle\cdot,\cdot\rangle_{2}$ is an inner product defined on $L^{2}(\mathbb{R}^{d}),$ i.e., $\langle f,~{}g\rangle_{2}=\int_{\mathbb{R}^{d}}f(x)g(x)\>dx$ for $f,g\in L^{2}(\mathbb{R}^{d}).$ Hence,

\displaystyle\mathbb{E}\left[U\right]=\frac{1}{2}\left(\|\xi\|^{2}_{2}+\|\xi\ast\kappa_{\lambda}\|^{2}_{2}-\|\xi-\xi\ast\kappa_{\lambda}\|^{2}_{2}\right).

Now, we want to upper bound $\|\xi-\xi\ast\kappa_{\lambda}\|^{2}_{2}$ , and recall that we assumed the difference of the densities $p-q$ lying in a Sobolev ball $\mathcal{S}_{d}^{s}(M_{1})$ . In this setting, as shown in Schrab et al., (2023, Appendix E.6), we have

\|\xi-\xi\ast\kappa_{\lambda}\|^{2}_{2}\leq S^{2}\|\xi\|^{2}_{2}+C_{1}(M_{1},d,s)\sum^{d}_{i=1}\lambda_{i}^{2s}

for some fixed constant $S\in(0,1)$ and positive constant $C_{1}(M_{1},d,s).$ Therefore, we conclude that

	$\displaystyle\mathbb{E}\left[U\right]$	$\displaystyle=\frac{1}{2}\left(\\|\xi\\|^{2}_{2}+\\|\xi\ast\kappa_{\lambda}\\|^{2}_{2}-\\|\xi-\xi\ast\kappa_{\lambda}\\|^{2}_{2}\right)$		(39)
		$\displaystyle\geq\frac{1-S^{2}}{2}\\|\xi\\|^{2}_{2}+\frac{1}{2}\\|\xi\ast\kappa_{\lambda}\\|^{2}_{2}-C_{1}^{\prime}(M_{1},d,s)\sum^{d}_{i=1}\lambda_{i}^{2s},$		(39)

for $C_{1}^{\prime}(M_{1},d,s):=\frac{1}{2}C_{1}(M_{1},d,s).$

Upper bound for $\sqrt{\frac{4}{\beta}\operatorname{Var}\left[U\right]}$

Recall the statistic $U_{1}$ that estimates the squared MMD with a single random feature, defined in Equation (28). Note that as shown in Equation (29), when the samples $\mathcal{X}_{n_{1}}$ and $\mathcal{Y}_{n_{2}}$ are given, then the statistic $U$ can be seen as the expectation of $R$ observations of $U_{1}(\omega_{r})$ , functions of i.i.d. random variables $\omega_{1},\ldots,\omega_{R}.$ Hence, we can decompose the variance of $U$ as follows:

	$\displaystyle\operatorname{Var}\left[U\right]$	$\displaystyle=\mathbb{E}_{X\times Y}\big{[}\operatorname{Var}_{\omega}[U\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}}]\big{]}+\operatorname{Var}_{X\times Y}\big{[}\mathbb{E}_{\omega}[U\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}}]\big{]}$
		$\displaystyle=\mathbb{E}_{X\times Y}\bigg{[}\frac{1}{R}\operatorname{Var}_{\omega}[U_{1}\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}}]\bigg{]}+\operatorname{Var}_{X\times Y}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\big{]}$
		$\displaystyle\leq\frac{1}{R}\mathbb{E}_{X\times Y}\Big{[}\mathbb{E}_{\omega}\big{[}(U_{1})^{2}\,\big{\|}\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}}\big{]}\Big{]}+\operatorname{Var}_{X\times Y}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\big{]}$
		$\displaystyle=\frac{1}{R}\mathbb{E}_{\omega}\Big{[}\mathbb{E}_{X\times Y}\big{[}(U_{1})^{2}\,\big{\|}\,\omega\big{]}\Big{]}+\operatorname{Var}_{X\times Y}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\big{]}$
		$\displaystyle=\frac{1}{R}\mathbb{E}_{\omega}\Big{[}\operatorname{Var}_{X\times Y}[U_{1}\,\|\,\omega]+\big{(}\mathbb{E}_{X\times Y}[U_{1}\,\|\,\omega]\big{)}^{2}\Big{]}+\operatorname{Var}_{X\times Y}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\big{]}$
		$\displaystyle=\frac{1}{R}\mathbb{E}_{\omega}\big{[}\operatorname{Var}_{X\times Y}[U_{1}\,\|\,\omega]\big{]}+\frac{1}{R}\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,\|\,\omega]\big{)}^{2}\big{]}+\operatorname{Var}_{X\times Y}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\big{]}.$

Therefore, we have

	$\displaystyle\sqrt{\frac{4}{\beta}\operatorname{Var}\left[U\right]}$	$\displaystyle\leq\sqrt{\frac{4}{\beta R}\mathbb{E}_{\omega}\big{[}\operatorname{Var}_{X\times Y}[U_{1}\,\|\,\omega]\big{]}}+\sqrt{\frac{4}{\beta R}\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,\|\,\omega]\big{)}^{2}\big{]}}$		(40)
		$\displaystyle\quad+\sqrt{\frac{4}{\beta}\operatorname{Var}_{X\times Y}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\big{]}}.$		(40)

We start by analyzing the first term of the right hand side, $\sqrt{\frac{4}{\beta R}\mathbb{E}_{\omega}\big{[}\operatorname{Var}_{X\times Y}[U_{1}\,\big{|}\,\omega]\big{]}}$ . When $\omega$ is fixed, we note that $U_{1}$ is a two-sample U-statistic. We use the exact variance formula of the two-sample U-statistic (see e.g., page 38 of Lee,, 1990). To do so, let us define a kernel for a two-sample U-statistic,

h_{\omega}(x_{1},x_{2};y_{1},y_{2}):=\langle\psi_{\omega_{1}}(x_{1}),\psi_{\omega_{1}}(x_{2})\rangle+\langle\psi_{\omega_{1}}(y_{1}),\psi_{\omega_{1}}(y_{2})\rangle-\langle\psi_{\omega_{1}}(x_{1}),\psi_{\omega_{1}}(y_{2})\rangle-\langle\psi_{\omega_{1}}(x_{2}),\psi_{\omega_{1}}(y_{1})\rangle,

where $\psi_{\omega_{1}}(x)=[\sqrt{\kappa_{\lambda}(0)}\cos(\omega^{\top}x),\sqrt{\kappa_{\lambda}(0)}\sin(\omega^{\top}x)]^{\top}$ for a given $\omega$ , and write the symmetrized kernel as

	$\displaystyle\bar{h}_{\omega}(x_{1},x_{2};y_{1},y_{2}):$	$\displaystyle=\frac{1}{2!2!}\sum_{1\leq i_{1}\neq i_{2}\leq 2}\sum_{1\leq j_{1}\neq j_{2}\leq 2}h_{\omega}(x_{i_{1}},x_{i_{2}};y_{j_{1}},y_{j_{2}})$
		$\displaystyle=\frac{1}{2}\langle\psi_{\omega_{1}}(x_{1})-\psi_{\omega_{1}}(y_{1}),\psi_{\omega_{1}}(x_{2})-\psi_{\omega_{1}}(y_{2})\rangle+\frac{1}{2}\langle\psi_{\omega_{1}}(x_{1})-\psi_{\omega_{1}}(y_{2}),\psi_{\omega_{1}}(x_{2})-\psi_{\omega_{1}}(y_{1})\rangle.$

Also, let

\bar{h}_{\omega,c,d}(x_{1},\dots,x_{c};y_{1},\dots,y_{d})=\mathbb{E}_{X\times Y}[\bar{h}_{\omega}(x_{1},\dots,x_{c},X_{c+1},\dots,X_{2};y_{1},\dots,y_{d},Y_{d+1},\dots,Y_{2})],

and

\check{\sigma}_{\omega,c,d}^{2}=\operatorname{Var}_{X\times Y}\left[\mathbb{E}_{X\times Y}[\bar{h}_{\omega,c,d}(X_{1},\dots,X_{c};Y_{1},\dots,Y_{d})]\right],

for $0\leq c,d\leq 2.$ Then, the variance of the two-sample U-statistic is

\displaystyle\operatorname{Var}_{X\times Y}\left[U_{1}\,|\,\omega\right]=\sum_{c=0}^{2}\sum_{d=0}^{2}\binom{2}{c}\binom{2}{d}\binom{{n_{1}}-2}{2-c}\binom{{n_{2}}-2}{2-d}\binom{{n_{1}}}{2}^{-1}\binom{{n_{2}}}{2}^{-1}\check{\sigma}_{\omega,c,d}^{2}.

(41)

Here, note that we have $\check{\sigma}_{\omega,c,d}^{2}\leq 4\kappa_{\lambda}(0)^{2}$ for all $0\leq c,d\leq 2$ , since $|\langle\psi_{\omega_{1}}(x),\psi_{\omega_{1}}(y)\rangle|\leq\kappa_{\lambda}(0)$ for all $x,y\in\mathbb{R}^{d}.$ Also, denote $\mu_{\omega,X}=\mathbb{E}_{X}\left[\psi_{\omega_{1}}(X)\right]$ and $\mu_{\omega,Y}=\mathbb{E}_{Y}\left[\psi_{\omega_{1}}(Y)\right]$ , and observe that

$\displaystyle\check{\sigma}_{\omega,1,0}^{2}$	$\displaystyle=\mathbb{E}_{X_{1}}\Big{[}\big{(}\mathbb{E}_{X_{2},Y_{1},Y_{2}}\left[\bar{h}_{\omega}(x_{1},X_{2};Y_{1},Y_{2})\,\big{\|}\,X_{1}=x_{1}\right]-\left\\|\mu_{\omega,X}-\mu_{\omega,Y}\right\\|^{2}\big{)}^{2}\Big{]}$	(42)
	$\displaystyle=\mathbb{E}_{X_{1}}\Big{[}\big{(}\langle\psi_{\omega_{1}}(X_{1})-\mu_{\omega,Y},\mu_{\omega,X}-\mu_{\omega,Y}\rangle\big{)}^{2}\Big{]}$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\mathbb{E}_{X_{1}}\left\\|\psi_{\omega_{1}}(X_{1})-\mu_{\omega,Y}\right\\|^{2}\cdot\left\\|\mu_{\omega,X}-\mu_{\omega,Y}\right\\|^{2}$
	$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}4\kappa_{\lambda}(0)\left\\|\mu_{\omega,X}-\mu_{\omega,Y}\right\\|^{2},$

where inequality (a) is by the Cauchy–Schwarz inequality and inequality (b) is by the fact that $-\kappa_{\lambda}(0)\leq\langle\psi_{\omega_{1}}(x),\psi_{\omega_{1}}(y)\rangle\leq\kappa_{\lambda}(0)$ for all $x,y\in\mathbb{R}^{d}.$ Similarly, we can get $\check{\sigma}_{\omega,0,1}^{2}\leq 4\kappa_{\lambda}(0)\left\|\mu_{\omega,X}-\mu_{\omega,Y}\right\|^{2}.$ Now, followed by (41) and $-\kappa_{\lambda}(0)\leq\langle\psi_{\omega_{1}}(x),\psi_{\omega_{1}}(y)\rangle\leq\kappa_{\lambda}(0)$ for all $x,y\in\mathbb{R}^{d}$ , we can show that there exist universal constants $C_{2},C_{2}^{\prime},C_{3},C_{3}^{\prime}>0$ such that

	$\displaystyle\operatorname{Var}_{X\times Y}\left[U_{1}\,\|\,\omega\right]$	$\displaystyle\leq C_{2}\kappa_{\lambda}(0)\left\\|\mu_{\omega,X}-\mu_{\omega,Y}\right\\|^{2}\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}+C_{3}\kappa_{\lambda}(0)^{2}\bigg{(}\frac{1}{{n_{1}^{2}}}+\frac{1}{{n_{2}^{2}}}+\frac{1}{{n_{1}}{n_{2}}}\bigg{)}$
		$\displaystyle\leq C_{2}^{\prime}\frac{\kappa_{\lambda}(0)}{n}\left\\|\mu_{\omega,X}-\mu_{\omega,Y}\right\\|^{2}+C_{3}^{\prime}\frac{\kappa_{\lambda}(0)^{2}}{{n}^{2}}.$

Then, observe

	$\displaystyle\mathbb{E}_{\omega}\big{[}\operatorname{Var}_{X\times Y}\left[U_{1}\,\|\,\omega\right]\big{]}$	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}C_{2}^{\prime}\frac{\kappa_{\lambda}(0)}{n}\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}})+C_{3}^{\prime}\frac{\kappa_{\lambda}(0)^{2}}{{n}^{2}}$		(43)
		$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}C_{2}^{\prime}\frac{\kappa_{\lambda}(0)}{n}\\|\xi\\|^{2}_{2}+C_{3}^{\prime}\frac{\kappa_{\lambda}(0)^{2}}{{n}^{2}},$		(43)

where (a) is according to the equality $\mathbb{E}_{\omega}\big{[}\left\|\mu_{\omega,X}-\mu_{\omega,Y}\right\|^{2}\big{]}=\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}})$ and (b) follows from the fact that Young’s convolution inequality (Lemma 10) yields $\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}})=\langle\xi,\xi\ast\kappa_{\lambda}\rangle_{2}\leq\|\xi\|_{2}\|\xi\ast\kappa_{\lambda}\|_{2}\leq\|\xi\|^{2}_{2}\|\kappa_{\lambda}\|_{1}=\|\xi\|^{2}_{2}.$ Since we have $\int_{\mathbb{R}^{d}}\kappa_{\lambda}(x)dx=1$ and $\kappa_{\lambda}(x)\geq 0$ for all $x\in\mathbb{R}^{d}.$ Hence, using $\sqrt{x+y}\leq\sqrt{x}+\sqrt{y}$ for all $x,y\geq 0,$ we can conclude that

\displaystyle\sqrt{\frac{4}{\beta R}\mathbb{E}_{\omega}\big{[}\operatorname{Var}_{X\times Y}[U_{1}\,|\,\omega]\big{]}}

\displaystyle\leq C_{2}^{\prime\prime}(\beta)\sqrt{\frac{\kappa_{\lambda}(0)}{R{n}}}\|\xi\|_{2}+C_{3}^{\prime\prime}(\beta)\frac{\kappa_{\lambda}(0)}{\sqrt{R}{n}},

(44)

for $C_{2}^{\prime\prime}(\beta):=\sqrt{{4C_{2}^{\prime}}/\beta}$ and $C_{3}^{\prime\prime}(\beta):=\sqrt{{4C_{3}^{\prime}}/\beta}.$

Now, we analyze the second term in the right hand side of Equation (40), $\sqrt{\frac{4}{\beta R}\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega]\big{)}^{2}\big{]}}$ . Note that $\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega]$ can be written as

\displaystyle\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega]=\iint_{[-M_{3},M_{3}]^{d}}\kappa_{\lambda}(0)\cos\left(\omega^{\top}(x-y)\right)\left(p_{X}(x)-p_{Y}(x)\right)\left(p_{X}(y)-p_{Y}(y)\right)dxdy.

Therefore, for some positive constant $C_{4}(M_{3},d),$ we have

		$\displaystyle\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,\|\,\omega]\big{)}^{2}\big{]}$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{\omega}\left[\bigg{(}\iint_{[-M_{3},M_{3}]^{d}}\kappa_{\lambda}(0)\cos\left(\omega^{\top}(x-y)\right)\left(p_{X}(x)-p_{Y}(x)\right)\left(p_{X}(y)-p_{Y}(y)\right)dxdy\bigg{)}^{2}\right]$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{\omega}\left[\bigg{(}\iint_{[-M_{3},M_{3}]^{d}}\left\|\kappa_{\lambda}(0)\cos\left(\omega^{\top}(x-y)\right)\right\|\left\|p_{X}(x)-p_{Y}(x)\right\|\left\|p_{X}(y)-p_{Y}(y)\right\|dxdy\bigg{)}^{2}\right]$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{\omega}\left[\kappa_{\lambda}(0)^{2}\bigg{(}\iint_{[-M_{3},M_{3}]^{d}}\left\|p_{X}(x)-p_{Y}(x)\right\|\left\|p_{X}(y)-p_{Y}(y)\right\|dxdy\bigg{)}^{2}\right]$
	$\displaystyle=$	$\displaystyle\kappa_{\lambda}(0)^{2}\bigg{(}\int_{[-M_{3},M_{3}]^{d}}\left\|p_{X}(x)-p_{Y}(x)\right\|dx\bigg{)}^{4}$
	$\displaystyle\stackrel{{\scriptstyle(*)}}{{\leq}}$	$\displaystyle C_{4}(M_{3},d)\kappa_{\lambda}(0)^{2}\bigg{(}\int_{[-M_{3},M_{3}]^{d}}\left\|p_{X}(x)-p_{Y}(x)\right\|^{2}dx\bigg{)}^{2}$
	$\displaystyle=$	$\displaystyle C_{4}(M_{3},d)\kappa_{\lambda}(0)^{2}\\|\xi\\|^{4}_{2},$

where ${(*)}$ uses Jensen’s inequality since the supports of both density $p$ and $q$ are uniformly bounded. Hence, we can get

	$\displaystyle\sqrt{\frac{4}{\beta R}\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,\|\,\omega]\big{)}^{2}\big{]}}$	$\displaystyle\leq\sqrt{\frac{4C_{4}(M_{3},d)}{\beta R}}\kappa_{\lambda}(0)\\|\xi\\|^{2}_{2}$		(45)
		$\displaystyle=C_{4}^{\prime}(M_{3},\beta,d)\frac{\kappa_{\lambda}(0)}{\sqrt{R}}\\|\xi\\|^{2}_{2}$		(45)

for $C_{4}^{\prime}(M_{3},\beta,d):=2\sqrt{{C_{4}(M_{3},d)}/\beta}.$

For the final term, $\sqrt{\frac{4}{\beta}\operatorname{Var}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k_{\lambda}})\big{]}},$ Schrab et al., (2023, Proposition 3) guarantees that there exists a positive constant $C_{5}(M_{2},d)$ such that

\displaystyle\operatorname{Var}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k_{\lambda}})\big{]}\leq C_{5}(M_{2},d)\bigg{(}\frac{\|\xi\ast\kappa_{\lambda}\|^{2}_{2}}{n}+\frac{\kappa_{\lambda}(0)}{{n}^{2}}\bigg{)}.

Then, similar to the proof of Schrab et al., (2023, Appendix E.5), we have

$\displaystyle\sqrt{\frac{4}{\beta}\operatorname{Var}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k_{\lambda}})\big{]}}$	$\displaystyle\leq\sqrt{\frac{4C_{5}(M_{2},d)\\|\xi\ast\kappa_{\lambda}\\|^{2}_{2}}{\beta{n}}+\frac{4C_{5}(M_{2},d)\kappa_{\lambda}(0)}{\beta{n}^{2}}}$	(46)
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}2\sqrt{\frac{1}{2}\\|\xi\ast\kappa_{\lambda}\\|^{2}_{2}\frac{2C_{5}(M_{2},d)}{\beta{n}}}+\frac{2\sqrt{C_{5}(M_{2},d)\kappa_{\lambda}(0)}}{\sqrt{\beta}{n}}$
	$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\frac{1}{2}\\|\xi\ast\kappa_{\lambda}\\|^{2}_{2}+\frac{2C_{5}(M_{2},d)}{\beta{n}}+\frac{2\sqrt{C_{5}(M_{2},d)\kappa_{\lambda}(0)}}{\sqrt{\beta}{n}}$
	$\displaystyle\leq\frac{1}{2}\\|\xi\ast\kappa_{\lambda}\\|^{2}_{2}+\frac{C_{5}^{\prime}(M_{2},\beta,d)}{n}+C_{5}^{\prime}(M_{2},\beta,d)\frac{\sqrt{\kappa_{\lambda}(0)}}{{n}},$

where $(a)$ used the fact that $\sqrt{x+y}\leq\sqrt{x}+\sqrt{y}$ for all $x,y>0,$ $(b)$ used $2\sqrt{xy}\leq x+y$ for all $x,y>0,$ and the last inequality holds with $C_{5}^{\prime}(M_{2},\beta,d):=\max\{2C_{5}(M_{2},d)/\beta,2\sqrt{C_{5}(M_{2},d)/\beta}\}$ .

To sum up, given Equations (40), (44), (45) and (46), a valid upper bound for $\sqrt{\frac{1}{\beta}\operatorname{Var}\left[U\right]}$ is

$\displaystyle\sqrt{\frac{4}{\beta}\operatorname{Var}\left[U\right]}$	$\displaystyle\leq\sqrt{\frac{4}{\beta R}\mathbb{E}_{\omega}\big{[}\operatorname{Var}_{X\times Y}[U_{1}\,\|\,\omega]\big{]}}+\sqrt{\frac{4}{\beta R}\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,\|\,\omega]\big{)}^{2}\big{]}}$	(47)
	$\displaystyle\quad+\sqrt{\frac{4}{\beta}\operatorname{Var}_{X\times Y}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\big{]}}$
	$\displaystyle\leq C_{2}^{\prime\prime}(\beta)\sqrt{\frac{\kappa_{\lambda}(0)}{R{n}}}\\|\xi\\|_{2}+C_{3}^{\prime\prime}(\beta)\frac{\kappa_{\lambda}(0)}{\sqrt{R}{n}}+C_{4}^{\prime}(M_{3},\beta,d)\frac{\kappa_{\lambda}(0)\\|\xi\\|^{2}_{2}}{\sqrt{R}}$
	$\displaystyle\quad+\frac{1}{2}\\|\xi\ast\kappa_{\lambda}\\|^{2}_{2}+\frac{C_{5}^{\prime}(M_{2},\beta,d)}{n}+C_{5}^{\prime}(M_{2},\beta,d)\frac{\sqrt{\kappa_{\lambda}(0)}}{{n}}.$

Upper bound for $\sqrt{\frac{4}{\alpha\beta}\mathbb{E}\big{[}\operatorname{Var}_{\pi}[U_{\pi}\,|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}]\big{]}}$

Since the U-statistic is centered at zero under the permutation law, we have

	$\displaystyle\operatorname{Var}_{\pi}[U_{\pi}\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}]$	$\displaystyle=\mathbb{E}_{\pi}\big{[}(U_{\pi})^{2}\,\big{\|}\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}\big{]}-\big{(}\mathbb{E}_{\pi}[U_{\pi}\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}]\big{)}^{2}$
		$\displaystyle=\mathbb{E}_{\pi}\big{[}(U_{\pi})^{2}\,\big{\|}\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}\big{]}.$

Recall Equation (34) and note that the following result holds true (Kim et al.,, 2022, Appendix F):

		$\displaystyle\mathbb{E}_{\pi}\big{[}(U_{\pi})^{2}\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}\big{]}$
	$\displaystyle=$	$\displaystyle\frac{1}{{n_{1}}^{2}({n_{1}}-1)^{2}{n_{2}}^{2}({n_{2}}-1)^{2}}\sum_{(i_{1},\dots,j^{\prime}_{2})\in\mathbf{I}}\mathbb{E}_{\pi}\Big{[}\hat{h}\big{(}Z_{\pi(i_{1})},Z_{\pi(i_{2})};Z_{\pi(n_{1}+j_{1})},Z_{\pi(n_{1}+j_{2})}\big{)}$
		$\displaystyle\,\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\times\hat{h}\big{(}Z_{\pi(i^{\prime}_{1})},Z_{\pi(i^{\prime}_{2})};Z_{\pi(n_{1}+j^{\prime}_{1})},Z_{\pi(n_{1}+j^{\prime}_{2})}\big{)}\,\Big{\|}\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}\Big{]},$

Also, it can be shown that there exists some positive constant $C_{6}$ such that for any $(i_{1},\dots,j_{2}^{\prime})\in\mathbf{I},$

	$\displaystyle\bigg{\|}\mathbb{E}_{X\times Y\times\omega}\bigg{[}\mathbb{E}_{\pi}$	$\displaystyle\Big{[}\hat{h}\big{(}Z_{\pi(i_{1})},Z_{\pi(i_{2})};Z_{\pi({n_{1}}+j_{1})},Z_{\pi({n_{1}}+j_{2})}\big{)}$
		$\displaystyle\times\hat{h}\big{(}Z_{\pi(i^{\prime}_{1})},Z_{\pi(i^{\prime}_{2})};Z_{\pi({n_{1}}+j^{\prime}_{1})},Z_{\pi({n_{1}}+j^{\prime}_{2})}\big{)}\,\Big{\|}\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}\Big{]}\bigg{]}\bigg{\|}$
		$\displaystyle\leq C_{6}\check{\sigma}^{2}_{2,2},$

where

\check{\sigma}^{2}_{2,2}:=\max\Big{\{}\mathbb{E}\big{[}\hat{k}^{2}(X_{1},X_{2})\big{]},\mathbb{E}\big{[}\hat{k}^{2}(X_{1},Y_{1})\big{]},\mathbb{E}\big{[}\hat{k}^{2}(Y_{1},Y_{2})\big{]}\Big{\}}.

Observe that

	$\displaystyle\hat{k}^{2}(x,y)$	$\displaystyle=\bigg{(}\frac{1}{R}\sum\limits_{r=1}^{R}\langle\psi_{\omega_{r}}(x),\psi_{\omega_{r}}(y)\rangle\bigg{)}^{2}$
		$\displaystyle=\frac{1}{R^{2}}\sum^{R}_{r=1}\langle\psi_{\omega_{r}}(x),\psi_{\omega_{r}}(y)\rangle\langle\psi_{\omega_{r}}(x),\psi_{\omega_{r}}(y)\rangle+\frac{1}{R^{2}}\sum_{1\leq r_{1}\neq r_{2}\leq R}\langle\psi_{\omega_{r_{1}}}(x),\psi_{\omega_{r_{1}}}(y)\rangle\langle\psi_{\omega_{r_{2}}}(x),\psi_{\omega_{r_{2}}}(y)\rangle.$

Therefore, we have

	$\displaystyle\mathbb{E}\big{[}\hat{k}^{2}(X_{1},X_{2})\big{]}$	$\displaystyle=\mathbb{E}_{{X_{1}}\times{X_{2}}}\Big{[}\mathbb{E}_{\omega}\big{[}\hat{k}^{2}(x_{1},x_{2})\,\big{\|}\,X_{1}=x_{1},X_{2}=x_{2}\big{]}\Big{]}$
		$\displaystyle=\mathbb{E}_{{X_{1}}\times{X_{2}}}\bigg{[}\frac{1}{R}\mathbb{E}_{\omega}\big{[}\langle\psi_{\omega}(x_{1}),\psi_{\omega}(x_{2})\rangle^{2}\,\big{\|}\,X_{1}=x_{1},X_{2}=x_{2}\big{]}+\frac{R(R-1)}{R^{2}}k^{2}(X_{1},X_{2})\bigg{]}$
		$\displaystyle=\frac{1}{R}\mathbb{E}_{{X_{1}}\times{X_{2}}}\Big{[}\mathbb{E}_{\omega}\big{[}\langle\psi_{\omega}(x_{1}),\psi_{\omega}(x_{2})\rangle^{2}\,\big{\|}\,X_{1}=x_{1},X_{2}=x_{2}\big{]}\Big{]}+\frac{(R-1)}{R}\mathbb{E}_{{X_{1}}\times{X_{2}}}\big{[}k^{2}(X_{1},X_{2})\big{]}$
		$\displaystyle\leq\frac{\kappa_{\lambda}(0)^{2}}{R}+M_{2}\varkappa\kappa_{\lambda}(0),$

where the last inequality follows from the fact that $-\kappa_{\lambda}(0)\leq\langle\psi_{\omega_{1}}(x),\psi_{\omega_{1}}(y)\rangle\leq\kappa_{\lambda}(0)$ for all $x,y\in\mathbb{R}^{d},$ and $\mathbb{E}_{{X_{1}}\times{X_{2}}}\left[k^{2}(X_{1},X_{2})\right]\leq M_{2}\varkappa(\lambda_{1}\cdots\lambda_{d})^{-1}$ where $\varkappa=\prod^{d}_{i=1}\int_{\mathbb{R}}\kappa_{i}(x_{i})^{2}\text{d}x_{i}$ , as shown in Schrab et al., (2023, Appendix E.3). A similar calculation shows that $\mathbb{E}\big{[}\hat{k}^{2}(X_{1},Y_{1})\big{]}$ and $\mathbb{E}\big{[}\hat{k}^{2}(Y_{1},Y_{2})\big{]}$ are also upper bounded by the bound in the above inequality, thus we get

\check{\sigma}^{2}_{2,2}\leq\frac{\kappa_{\lambda}(0)^{2}}{R}+M_{2}\varkappa\kappa_{\lambda}(0).

Using this observation and counting the number of $\mathbf{I}$ (Kim et al.,, 2022, Appendix F) yields

	$\displaystyle\mathbb{E}\big{[}\operatorname{Var}_{\pi}[U_{\pi}\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}]\big{]}$	$\displaystyle\leq C_{6}\check{\sigma}^{2}_{2,2}\times\frac{1}{{n_{1}^{2}}({n_{1}}-1)^{2}{n_{2}^{2}}({n_{2}}-1)^{2}}\sum_{(i_{1},\dots,j^{\prime}_{2})\in\mathbf{I}}1$
		$\displaystyle\leq C_{7}\frac{\kappa_{\lambda}(0)^{2}}{R}\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}^{2}+C_{7}M_{2}\varkappa\kappa_{\lambda}(0)\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}^{2}$
		$\displaystyle\leq C_{7}^{\prime}\frac{\kappa_{\lambda}(0)^{2}}{R{n}^{2}}+C_{7}^{\prime}M_{2}\varkappa\frac{\kappa_{\lambda}(0)}{{n}^{2}}$

for some positive constant $C_{7},C_{7}^{\prime}>0.$ Therefore, using $\sqrt{x+y}\leq\sqrt{x}+\sqrt{y}$ for all $x,y\geq 0,$ we get

\displaystyle\sqrt{\frac{4}{\alpha\beta}\mathbb{E}\big{[}\operatorname{Var}_{\pi}[U_{\pi}\,|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}]\big{]}}\leq C_{8}(\alpha,\beta)\frac{\kappa_{\lambda}(0)}{\sqrt{R}{n}}+C_{9}(M_{2},\alpha,\beta)\frac{\sqrt{\kappa_{\lambda}(0)}}{{n}}

(48)

for some positive constants $C_{8}(\alpha,\beta),C_{9}(M_{2},\alpha,\beta)>0$ .

Upper bound for $\mathbb{E}[W^{\prime}]$

Recall that $W^{\prime}$ is defined as

W^{\prime}=\frac{1}{{n_{1}^{2}}({n_{1}}-1)}\sum_{1\leq i\neq j\leq{n_{1}}}\hat{k}(X_{i},X_{j})+\frac{1}{{n_{2}^{2}}({n_{2}}-1)}\sum_{1\leq i\neq j\leq{n_{2}}}\hat{k}(Y_{i},Y_{j})

and its expectation is

\mathbb{E}[W^{\prime}]=\frac{1}{n_{1}}\mathbb{E}_{X_{1}\times X_{2}}\big{[}k(X_{1},X_{2})\big{]}+\frac{1}{{n_{2}}}\mathbb{E}_{Y_{1}\times Y_{2}}\big{[}k(Y_{1},Y_{2})\big{]}.

Here we observe that

	$\displaystyle\frac{1}{n_{1}}\mathbb{E}_{X_{1}\times X_{2}}\big{[}k(X_{1},X_{2})\big{]}$	$\displaystyle=\frac{1}{n_{1}}\int\int p_{Y}(y_{1})p_{Y}(y_{2})k_{\lambda}(y_{1},y_{2})dy_{1}dy_{2}$
		$\displaystyle\leq\frac{\\|p_{Y}\\|_{\infty}}{n_{1}}\int\int p_{Y}(y_{2})k_{\lambda}(y_{1},y_{2})dy_{1}dy_{2}$
		$\displaystyle=\frac{M_{2}}{n_{1}}\int p_{Y}(y_{2})dy_{2}$
		$\displaystyle=\frac{M_{2}}{n_{1}},$

and similarly we have ${n_{2}}^{-1}\mathbb{E}_{Y_{1}\times Y_{2}}\big{[}k(Y_{1},Y_{2})\big{]}\leq{M_{2}}{n_{2}}^{-1}.$ Therefore, we conclude that

	$\displaystyle\mathbb{E}[W^{\prime}]$	$\displaystyle\leq M_{2}\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}$		(49)
		$\displaystyle\leq\frac{C_{10}(M_{2})}{n}$		(49)

for some constant $C_{10}(M_{2})>0$ .

Upper bound for $\sqrt{\frac{4}{\beta}\operatorname{Var}[W^{\prime}]}$

We note that the variance of $W^{\prime}$ can be upper bounded as

	$\displaystyle\operatorname{Var}[W^{\prime}]$	$\displaystyle=\operatorname{Var}\bigg{[}\frac{1}{{n_{1}^{2}}({n_{1}}-1)}\sum_{1\leq i\neq j\leq{n_{1}}}\hat{k}(X_{i},X_{j})+\frac{1}{{n_{2}^{2}}({n_{2}}-1)}\sum_{1\leq i\neq j\leq{n_{2}}}\hat{k}(Y_{i},Y_{j})\bigg{]}$
		$\displaystyle\leq\frac{2}{{n_{1}^{2}}}\operatorname{Var}\bigg{[}\frac{1}{{n_{1}}({n_{1}}-1)}\sum_{1\leq i\neq j\leq{n_{1}}}\hat{k}(X_{i},X_{j})\bigg{]}+\frac{2}{{n_{2}^{2}}}\operatorname{Var}\bigg{[}\frac{1}{{n_{2}}({n_{2}}-1)}\sum_{1\leq i\neq j\leq{n_{2}}}\hat{k}(Y_{i},Y_{j})\bigg{]}.$

Moreover, recall $\hat{k}(x,y):=R^{-1}\sum_{r=1}^{R}\langle\psi_{\omega_{r}}(x),\psi_{\omega_{r}}(y)\rangle=\langle\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(x),\boldsymbol{\psi}_{\boldsymbol{\omega}_{R}}(y)\rangle.$ And then, for some positive constant $C_{11}>0$ , we also have

	$\displaystyle\operatorname{Var}\bigg{[}\frac{1}{{n_{1}}({n_{1}}-1)}\sum_{1\leq i\neq j\leq{n_{1}}}\hat{k}(X_{i},X_{j})\bigg{]}$	$\displaystyle=\mathbb{E}_{X\times Y}\Bigg{[}\operatorname{Var}_{\omega}\bigg{[}\frac{1}{{n_{1}}({n_{1}}-1)}\sum_{1\leq i\neq j\leq{n_{1}}}\hat{k}(X_{i},X_{j})\,\bigg{\|}\,\mathcal{X}_{n_{1}}\bigg{]}\Bigg{]}$
		$\displaystyle\quad+\operatorname{Var}_{X\times Y}\Bigg{[}\mathbb{E}_{\omega}\bigg{[}\frac{1}{{n_{1}}({n_{1}}-1)}\sum_{1\leq i\neq j\leq{n_{1}}}\hat{k}(X_{i},X_{j})\,\bigg{\|}\,\mathcal{X}_{n_{1}}\bigg{]}\Bigg{]}$
		$\displaystyle=\mathbb{E}_{X\times Y}\Bigg{[}\frac{1}{R}\operatorname{Var}_{\omega}\bigg{[}\frac{1}{{n_{1}}({n_{1}}-1)}\sum_{1\leq i\neq j\leq{n_{1}}}\langle\psi_{\omega}(X_{i}),\psi_{\omega}(X_{j})\rangle\,\bigg{\|}\,\mathcal{X}_{n_{1}}\bigg{]}\Bigg{]}$
		$\displaystyle\quad+\operatorname{Var}_{X\times Y}\Bigg{[}\mathbb{E}_{\omega}\bigg{[}\frac{1}{{n_{1}}({n_{1}}-1)}\sum_{1\leq i\neq j\leq{n_{1}}}\hat{k}(X_{i},X_{j})\,\bigg{\|}\,\mathcal{X}_{n_{1}}\bigg{]}\Bigg{]}$
		$\displaystyle\leq\frac{1}{R}\mathbb{E}_{X\times Y}\Bigg{[}\mathbb{E}_{\omega}\bigg{[}\bigg{(}\frac{1}{{n_{1}}({n_{1}}-1)}\sum_{1\leq i\neq j\leq{n_{1}}}\langle\psi_{\omega}(X_{i}),\psi_{\omega}(X_{j})\rangle\bigg{)}^{2}\,\bigg{\|}\,\mathcal{X}_{n_{1}}\bigg{]}\Bigg{]}$
		$\displaystyle\quad+\operatorname{Var}_{X\times Y}\bigg{[}\frac{1}{{n_{1}}({n_{1}}-1)}\sum_{1\leq i\neq j\leq{n_{1}}}k(X_{i},X_{j})\bigg{]}$
		$\displaystyle\leq\frac{\kappa_{\lambda}(0)^{2}}{R}+\operatorname{Var}_{X\times Y}\bigg{[}\frac{1}{{n_{1}}({n_{1}}-1)}\sum_{1\leq i\neq j\leq{n_{1}}}k(X_{i},X_{j})\bigg{]}$
		$\displaystyle\leq\frac{\kappa_{\lambda}(0)^{2}}{R}+C_{11}\frac{\kappa_{\lambda}(0)}{n_{1}},$

where the first inequality follows from $|\langle\psi_{\omega}(x),\psi_{\omega}(y)\rangle|\leq\kappa_{\lambda}(0)$ for all $x,y\in\mathbb{R}^{d},$ and the last inequality follows from the result in Kim and Schrab, (2023, Appendix E.11). In a similar manner, we can get

\displaystyle\operatorname{Var}\bigg{[}\frac{1}{{n_{2}}({n_{2}}-1)}\sum_{1\leq i\neq j\leq{n_{2}}}\hat{k}(Y_{i},Y_{j})\bigg{]}\leq\frac{\kappa_{\lambda}(0)^{2}}{R}+C_{11}\frac{\kappa_{\lambda}(0)}{n_{2}}.

Therefore, using $\sqrt{x+y}\leq\sqrt{x}+\sqrt{y}$ for all $x,y\geq 0,$ we conclude that

		$\displaystyle\quad\sqrt{\frac{4}{\beta}\operatorname{Var}[W^{\prime}]}$		(50)
		$\displaystyle\leq\sqrt{\frac{8}{\beta{n_{1}^{2}}}\operatorname{Var}\bigg{[}\frac{1}{{n_{1}}({n_{1}}-1)}\sum_{1\leq i\neq j\leq{n_{1}}}\hat{k}(X_{i},X_{j})\bigg{]}}+\sqrt{\frac{8}{\beta{n_{2}^{2}}}\operatorname{Var}\bigg{[}\frac{1}{{n_{2}}({n_{2}}-1)}\sum_{1\leq i\neq j\leq{n_{2}}}\hat{k}(Y_{i},Y_{j})\bigg{]}}$
		$\displaystyle\leq\frac{\sqrt{8}\kappa_{\lambda}(0)}{\sqrt{\beta R}}\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}+\sqrt{\frac{8C_{11}\kappa_{\lambda}(0)}{\beta}}\bigg{(}\frac{1}{{n^{3/2}_{1}}}+\frac{1}{{n^{3/2}_{2}}}\bigg{)}$
		$\displaystyle\leq C_{12}(\beta)\frac{\kappa_{\lambda}(0)}{\sqrt{R}{n}}+C_{13}(\beta)\frac{\sqrt{\kappa_{\lambda}(0)}}{{n}^{3/2}},$

for some positive constants $C_{12}(\beta),C_{13}(\beta)>0.$

Sufficient condition for Equation (38)

Recall that Equation (38),

\displaystyle\mathbb{E}\left[U\right]\geq\sqrt{\frac{4}{\beta}\operatorname{Var}\left[U\right]}+\sqrt{\frac{2}{\alpha\beta}\mathbb{E}\big{[}\operatorname{Var}_{\pi}[U_{\pi}\,|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}]\big{]}}+\mathbb{E}\left[W^{\prime}\right]+\sqrt{\frac{4}{\beta}\operatorname{Var}\left[W^{\prime}\right]}+C^{\prime}_{0}\frac{\kappa_{\lambda}(0)}{{n}^{2}},

is a sufficient condition for $\mathbb{P}(\mathcal{B}_{\beta/2})\geq 1-\beta/2.$ So far, in Equations (39), (47), (48), (49) and (50), we derived a lower bound for the left-hand side of the inequality, and upper bounds for the terms in the right-hand side of the inequality as follows:

	$\displaystyle\mathbb{E}\left[U\right]$	$\displaystyle\geq\frac{1-S^{2}}{2}\\|\xi\\|^{2}_{2}+\frac{1}{2}\\|\xi\ast\kappa_{\lambda}\\|^{2}_{2}-C_{1}^{\prime}(M_{1},d,s)\sum^{d}_{i=1}\lambda_{i}^{2s},$
	$\displaystyle\sqrt{\frac{4}{\beta}\operatorname{Var}\left[U\right]}$	$\displaystyle\leq C_{2}^{\prime\prime}(\beta)\sqrt{\frac{\kappa_{\lambda}(0)}{R{n}}}\\|\xi\\|_{2}+C_{3}^{\prime\prime}(\beta)\frac{\kappa_{\lambda}(0)}{\sqrt{R}{n}}+C_{4}^{\prime}(M_{3},\beta,d)\frac{\kappa_{\lambda}(0)\\|\xi\\|^{2}_{2}}{\sqrt{R}}$
		$\displaystyle\quad+\frac{1}{2}\\|\xi\ast\kappa_{\lambda}\\|^{2}_{2}+\frac{C_{5}^{\prime}(M_{2},\beta,d)}{n}+C_{5}^{\prime}(M_{2},\beta,d)\frac{\sqrt{\kappa_{\lambda}(0)}}{{n}},$
	$\displaystyle\sqrt{\frac{4}{\alpha\beta}\mathbb{E}\big{[}\operatorname{Var}_{\pi}[U_{\pi}\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}]\big{]}}$	$\displaystyle\leq C_{8}(\alpha,\beta)\frac{\kappa_{\lambda}(0)}{\sqrt{R}{n}}+C_{9}(\alpha,\beta,M_{2})\frac{\sqrt{\kappa_{\lambda}(0)}}{{n}},$
	$\displaystyle\mathbb{E}[W^{\prime}]$	$\displaystyle\leq\frac{C_{10}(M_{2})}{n},$
	$\displaystyle\sqrt{\frac{4}{\beta}\operatorname{Var}[W^{\prime}]}$	$\displaystyle\leq C_{12}(\beta)\frac{\kappa_{\lambda}(0)}{\sqrt{R}{n}}+C_{13}(\beta)\frac{\sqrt{\kappa_{\lambda}(0)}}{{n}^{3/2}}.$

Plugging these results into Equation (38), a sufficient condition for Equation (38) is

	$\displaystyle\frac{1-S^{2}}{2}\\|\xi\\|^{2}_{2}-C_{1}^{\prime}(M_{1},d,s)\sum^{d}_{i=1}\lambda_{i}^{2s}$	$\displaystyle\geq C_{4}^{\prime}(M_{3},\beta,d)\frac{\kappa_{\lambda}(0)\\|\xi\\|^{2}_{2}}{\sqrt{R}}+C_{2}^{\prime\prime}(\beta)\sqrt{\frac{\kappa_{\lambda}(0)}{R{n}}}\\|\xi\\|_{2}$
		$\displaystyle\quad+C_{3}^{\prime\prime}(\beta)\frac{\kappa_{\lambda}(0)}{\sqrt{R}{n}}+C_{8}(\alpha,\beta)\frac{\kappa_{\lambda}(0)}{\sqrt{R}{n}}+C_{12}(\beta)\frac{\kappa_{\lambda}(0)}{\sqrt{R}{n}}$
		$\displaystyle\quad+C_{5}^{\prime}(M_{2},\beta,d)\frac{\sqrt{\kappa_{\lambda}(0)}}{{n}}+C_{9}(M_{2},\alpha,\beta)\frac{\sqrt{\kappa_{\lambda}(0)}}{{n}}+C_{13}(\beta)\frac{\sqrt{\kappa_{\lambda}(0)}}{{n}^{3/2}}$
		$\displaystyle\quad+\frac{C_{5}^{\prime}(M_{2},\beta,d)}{n}+\frac{C_{10}(M_{2})}{n}+\frac{C^{\prime}_{0}\kappa_{\lambda}(0)}{{n}^{2}}.$

Recall that $\kappa_{\lambda}(0)=(\lambda_{1}\cdots\lambda_{d})^{-1},$ and suppose that $\lambda_{1}\cdots\lambda_{d}\leq 1.$ This assumption does not compromise our analysis, as ultimately $\lambda_{1},\dots,\lambda_{d}$ we choose later satisfies this assumption. Now, observe that $\lambda_{1}\dots\lambda_{d}\leq 1$ implies ${n}^{-1}\leq{n}^{-1}(\lambda_{1}\cdots\lambda_{d})^{-1/2}.$ Also note that ${n}\leq{n}^{3/2}$ for ${n}\geq 1$ . Then, by grouping similar terms, a sufficient condition for the above inequality is

	$\displaystyle\frac{1-S^{2}}{2}\\|\xi\\|^{2}_{2}$	$\displaystyle\geq\frac{C_{4}^{\prime}(M_{3},\beta,d)}{\sqrt{R}\lambda_{1}\cdots\lambda_{d}}\\|\xi\\|^{2}_{2}+\frac{C_{2}^{\prime\prime}(\beta)}{\sqrt{R{n}\lambda_{1}\cdots\lambda_{d}}}\\|\xi\\|_{2}$
		$\displaystyle\quad+\frac{C_{14}(\alpha,\beta)}{\sqrt{R}{n}\lambda_{1}\cdots\lambda_{d}}+\frac{C_{15}(M_{2},\alpha,\beta,d)}{{n}\sqrt{\lambda_{1}\cdots\lambda_{d}}}+\frac{C^{\prime}_{0}}{{n}^{2}\lambda_{1}\cdots\lambda_{d}}+C_{1}^{\prime}(M_{1},d,s)\sum^{d}_{i=1}\lambda_{i}^{2s}.$

We observe that the simultaneous satisfaction of the following four inequalities is a sufficient condition for the above inequality:

	$\displaystyle\text{(i)}:\quad\\|\xi\\|^{2}_{2}$	$\displaystyle\geq\frac{4C_{4}^{\prime}(M_{3},\beta,d)}{\sqrt{R}\lambda_{1}\cdots\lambda_{d}}\\|\xi\\|^{2}_{2},$
	$\displaystyle\text{(ii)}:\quad\\|\xi\\|^{2}_{2}$	$\displaystyle\geq\frac{4C_{2}^{\prime\prime}(\beta)}{\sqrt{R{n}\lambda_{1}\cdots\lambda_{d}}}\\|\xi\\|_{2},$
	$\displaystyle\text{(iii)}:\quad\\|\xi\\|^{2}_{2}$	$\displaystyle\geq\frac{4C_{14}(\alpha,\beta)}{\sqrt{R}{n}\lambda_{1}\cdots\lambda_{d}},$
	$\displaystyle\text{(iv)}:\quad\\|\xi\\|^{2}_{2}$	$\displaystyle\geq\frac{4C_{15}(M_{2},\alpha,\beta,d)}{{n}\sqrt{\lambda_{1}\cdots\lambda_{d}}}+\frac{4C^{\prime}_{0}}{{n}^{2}\lambda_{1}\cdots\lambda_{d}}+4C_{1}^{\prime}(M_{1},d,s)\sum^{d}_{i=1}\lambda_{i}^{2s}.$

Now, we simplify the above inequalities to facilitate our discussion. Note that the inequality (i) is equivalent to the inequality denoted as $(a)$ :

	$\displaystyle\text{(i)}:$		$\displaystyle\\|\xi\\|^{2}_{2}$	$\displaystyle\geq\frac{4C_{4}^{\prime}(M_{3},\beta,d)}{\sqrt{R}\lambda_{1}\cdots\lambda_{d}}\\|\xi\\|^{2}_{2}$
	$\displaystyle\Longleftrightarrow\quad\text{(a)}:$		$\displaystyle R$	$\displaystyle\geq\frac{C_{16}(M_{3},\beta,d)}{(\lambda_{1}\cdots\lambda_{d})^{2}},$

where $C_{16}(M_{3},\beta,d):=16C_{4}^{\prime}(M_{3},\beta,d)^{2}.$ Also, observe that the inequality (ii) is equivalent to

		$\displaystyle\\|\xi\\|^{2}_{2}$	$\displaystyle\geq\frac{4C_{2}^{\prime\prime}(\beta)}{\sqrt{R{n}\lambda_{1}\cdots\lambda_{d}}}\\|\xi\\|_{2}$
	$\displaystyle\Longleftrightarrow\quad$	$\displaystyle\\|\xi\\|^{2}_{2}$	$\displaystyle\geq\frac{16C_{2}^{\prime\prime}(\beta)^{2}}{R{n}\lambda_{1}\cdots\lambda_{d}}.$

Since $R\geq 1$ and $R^{-1/2}\geq R^{-1}$ , a sufficient condition, denoted as (b), for simultaneously satisfying the inequalities (ii) and (iii) is

\displaystyle\text{(b)}:\quad\|\xi\|^{2}_{2}\geq\frac{C_{17}(\alpha,\beta)}{\sqrt{R}{n}\lambda_{1}\cdots\lambda_{d}}

for $C_{17}(\alpha,\beta):=\max\{4C_{14}(\alpha,\beta),16C_{2}^{\prime\prime}(\beta)^{2}\}.$ For the inequality (iv), we note that we would like to assume $\lambda_{1}\cdots\lambda_{d}\leq{n}^{-2/d}.$ This is because if not, the term ${n}^{-2}(\lambda_{1}\cdots\lambda_{d})^{-1}$ in the inequality (iv) becomes larger than one, having the test become worthless. Under the assumption, observe that the term ${n}^{-2}(\lambda_{1}\cdots\lambda_{d})^{-1}$ is dominated by the term ${n}^{-1}(\lambda_{1}\cdots\lambda_{d})^{-1/2}$ . Therefore, the following inequality, denoting (c), is sufficient to show that the inequality (iv) holds:

\displaystyle\text{(c)}:\quad\|\xi\|^{2}_{2}\geq C_{18}(M_{1},M_{2},\alpha,\beta,d,s)\bigg{(}\frac{1}{{n}\sqrt{\lambda_{1}\cdots\lambda_{d}}}+\sum^{d}_{i=1}\lambda_{i}^{2s}\bigg{)}

(51)

where $C_{18}(M_{1},M_{2},\alpha,\beta,d,s):=\max\{4C_{15}(M_{2},\alpha,\beta,d)+4C^{\prime}_{0},C_{1}^{\prime}(M_{1},d,s)\}.$

In summary, a sufficient condition for satisfying the inequalities (i), (ii), (iii) and (iv) at once is the simultaneous satisfaction of the following three inequalities:

$\displaystyle\text{(a)}:$	$\displaystyle R$	$\displaystyle\geq\frac{C_{16}(M_{3},\beta,d)}{(\lambda_{1}\cdots\lambda_{d})^{2}},$
$\displaystyle\text{(b)}:$	$\displaystyle\\|\xi\\|^{2}_{2}$	$\displaystyle\geq\frac{C_{17}(\alpha,\beta)}{\sqrt{R}{n}\lambda_{1}\cdots\lambda_{d}},$
$\displaystyle\text{(c)}:$	$\displaystyle\\|\xi\\|^{2}_{2}$	$\displaystyle\geq C_{18}(M_{1},M_{2},\alpha,\beta,d,s)\bigg{(}\frac{1}{{n}\sqrt{\lambda_{1}\cdots\lambda_{d}}}+\sum^{d}_{i=1}\lambda_{i}^{2s}\bigg{)}.$

Then by the definition of uniform separation rate, for both tests $\Delta=\Delta_{{n_{1}},{n_{2}},R}^{\alpha,\lambda}$ or $\Delta=\Delta_{{n_{1}},{n_{2}},R}^{\alpha,u,\lambda}$ , we have

\displaystyle\rho\left(\Delta,~{}\beta,~{}\mathcal{C}_{M_{1},M_{2}},~{}\delta_{L_{2}}\right)^{2}\leq C_{19}(M_{1},M_{2},\alpha,\beta,d,s)\max\bigg{\{}\frac{1}{\sqrt{R}{n}\lambda_{1}\cdots\lambda_{d}},\frac{1}{{n}\sqrt{\lambda_{1}\cdots\lambda_{d}}}+\sum^{d}_{i=1}\lambda_{i}^{2s}\bigg{\}},

for $C_{19}(M_{1},M_{2},\alpha,\beta,d,s):=\max\{C_{17}(\alpha,\beta),C_{18}(M_{1},M_{2},\alpha,\beta,d,s)\}$ and $R\geq C_{16}(M_{3},\beta,d)(\lambda_{1}\cdots\lambda_{d})^{-2}.$ For the smallest order of ${n}$ possible, we choose the bandwidth $\lambda^{\star}_{i}:={n}^{-2/(4s+d)}$ for $i=1,\ldots,d$ , and in this case, the condition on $R$ becomes $R\geq C_{16}(M_{3},\beta,d){n}^{4d/(4s+d)}.$ Plugging these values into the above inequality, we get

	$\displaystyle\rho\left(\Delta,~{}\beta,~{}\mathcal{C}_{M_{1},M_{2}},~{}\delta_{L_{2}}\right)^{2}$	$\displaystyle\leq C_{19}(M_{1},M_{2},\alpha,\beta,d,s)\max\bigg{\{}\frac{1}{\sqrt{R}{n}\lambda^{\star}_{1}\cdots\lambda^{\star}_{d}},\frac{1}{{n}\sqrt{\lambda^{\star}_{1}\cdots\lambda^{\star}_{d}}}+\sum^{d}_{i=1}\lambda_{i}^{2s}\bigg{\}}$
		$\displaystyle\leq C_{20}(M_{1},M_{2},M_{3},\alpha,\beta,d,s)\max\bigg{\{}\frac{1}{n},\frac{1}{{n}^{{4s}/{(4s+d)}}}\bigg{\}}$
		$\displaystyle=C_{20}(M_{1},M_{2},M_{3},\alpha,\beta,d,s){n}^{-{4s}/(4s+d)}$

for $C_{20}(M_{1},M_{2},M_{3},\alpha,\beta,d,s):=C_{19}(M_{1},M_{2},\alpha,\beta,d,s)\max\big{\{}C_{16}(M_{3},\beta,d)^{-1},d+1\big{\}}.$ We note that our choice $\{\lambda^{\star}_{i}\}^{d}_{i=1}$ satisfies the condition $\lambda_{1}\cdots\lambda_{d}\leq 1$ for Equation (46) and the condition $\lambda_{1}\cdots\lambda_{d}\leq{n}^{-2/d}$ for Equation 51. By letting $C_{L_{2}}(M_{1},M_{2},M_{3},\alpha,\beta,d,s):=C_{20}(M_{1},M_{2},M_{3},\alpha,\beta,d,s)^{1/2},$ we conclude that

\rho\left(\Delta,~{}\beta,~{}\mathcal{C}_{M_{1},M_{2}},~{}\delta_{L_{2}}\right)\leq C_{L_{2}}(M_{1},M_{2},M_{3},\alpha,\beta,d,s){n}^{{-2s}/{(4s+d)}}

holds and this completes the proof.

B.6 Proof of Theorem 7

We start by recalling the event defined in Equation (25), equipped with the kernel $k$ :

\displaystyle\mathcal{B}_{\beta}:=\bigg{\{}\mathbb{E}\left[U\right]\geq\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}+q_{{n_{1}},{n_{2}},1-\alpha}^{u}+\kappa(0)\bigg{(}\sqrt{\frac{2}{\beta}}+2\bigg{)}\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}\bigg{\}}.

As shown in the proof of Theorem 5, to control the probability of type II error of both tests $\Delta_{{n_{1}},{n_{2}},R}^{\alpha}$ and $\Delta_{{n_{1}},{n_{2}},R}^{\alpha,u}$ simultaneously, it is sufficient to show that $\mathbb{P}(\mathcal{B}_{\beta})=1.$ Similar to the proof of Theorem 6, we analyze the terms in the event $\mathcal{B}_{\beta}$ and derive a sufficient condition for the event $\mathcal{B}_{\beta}$ . To start with, note that we have

	$\displaystyle\mathbb{E}\left[U\right]$	$\displaystyle=\mathrm{MMD}^{2}\left(P_{X},P_{Y};\mathcal{H}_{k}\right),$
	$\displaystyle q_{{n_{1}},{n_{2}},1-\alpha}^{u}$	$\displaystyle\leq C_{1}(\alpha)\kappa(0)\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)},$

for some positive constant $C_{1}(\alpha)>0,$ from Equation (26) and (35). This gives

	$\displaystyle q_{{n_{1}},{n_{2}},1-\alpha}^{u}+\kappa(0)\bigg{(}\sqrt{\frac{2}{\beta}}+2\bigg{)}\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}$	$\displaystyle\leq\kappa(0)\bigg{(}C_{1}(\alpha)+\sqrt{\frac{2}{\beta}}+2\bigg{)}\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}$
		$\displaystyle\leq\frac{C_{2}(\alpha,\beta,K)}{n}$

for some positive constant $C_{2}(\alpha,\beta,K)>0$ . Therefore, a sufficient condition for $\mathbb{P}(\mathcal{B}_{\beta})=1$ is the following inequality:

\displaystyle\mathrm{MMD}^{2}\left(P_{X},P_{Y};\mathcal{H}_{k}\right)\geq\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}+\frac{C_{2}(\alpha,\beta,K)}{n}.

(52)

Upper bound for $\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}$

Now we analyze the square root of the variance term. Recall the decomposition of the variance of $U$ in Equation (40):

	$\displaystyle\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}$	$\displaystyle\leq\sqrt{\frac{4}{\beta R}\mathbb{E}_{\omega}\big{[}\operatorname{Var}_{X\times Y}[U_{1}\,\|\,\omega]\big{]}}+\sqrt{\frac{4}{\beta R}\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,\|\,\omega]\big{)}^{2}\big{]}}$
		$\displaystyle\quad+\sqrt{\frac{4}{\beta}\operatorname{Var}_{X\times Y}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\big{]}}.$

We aim to analyze each term on the right-hand side and derive an upper bound for them. First, recall the result in Equation (43) and the assumption that the kernel $k$ is uniformly bounded by $K.$ Then we have

\displaystyle\sqrt{\frac{4}{\beta R}\mathbb{E}_{\omega}\big{[}\operatorname{Var}_{X\times Y}[U_{1}\,|\,\omega]\big{]}}

\displaystyle\leq\frac{C_{3}(\beta,K)}{\sqrt{R{n}}}\mathrm{MMD}(P_{X},P_{Y};\mathcal{H}_{k})+\frac{C_{4}(\beta,K)}{\sqrt{R}{n}},

(53)

for some positive constants $C_{3}(\beta,K),C_{4}(\beta,K)>0.$

For the term $\mathbb{E}_{\omega}[(\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega])^{2}]$ , recall the statistic $U_{1}$ defined in Equation 28 and let us denote the statistic V with a single random feature as $V_{1}$ , i.e.,

	$\displaystyle V_{1}:=$	$\displaystyle\frac{1}{{n_{1}^{2}}}\sum_{1\leq i,j\leq{n_{1}}}\langle{\psi_{\omega}}_{1}(X_{i}),{\psi_{\omega}}_{1}(X_{j})\rangle-\frac{2}{{n_{1}}{n_{2}}}\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{2}}\langle{\psi_{\omega}}_{1}(X_{i}),{\psi_{\omega}}_{1}(Y_{j})\rangle$
		$\displaystyle\quad+\frac{1}{{n_{2}^{2}}}\sum_{1\leq i,j\leq{n_{2}}}\langle{\psi_{\omega}}_{1}(Y_{i}),{\psi_{\omega}}_{1}(Y_{j})\rangle$
	$\displaystyle=$	$\displaystyle\frac{1}{{n_{1}^{2}}}\sum_{1\leq i,j\leq{n_{1}}}\kappa(0)\cos\big{(}{\omega_{1}^{\top}\left(X_{i}-X_{j}\right)}\big{)}-\frac{2}{{n_{1}}{n_{2}}}\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{2}}\kappa(0)\cos\big{(}{\omega_{1}^{\top}\left(X_{i}-Y_{j}\right)}\big{)}$
		$\displaystyle\quad+\frac{1}{{n_{2}^{2}}}\sum_{1\leq i,j\leq{n_{2}}}\kappa(0)\cos\big{(}{\omega_{1}^{\top}\left(Y_{i}-Y_{j}\right)}\big{)}.$

Also, let $W_{1}$ be the difference between $V_{1}$ and $U_{1}.$ Then, observe

	$\displaystyle\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,\|\,\omega]\big{)}^{2}\big{]}$	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\mathbb{E}_{\omega}\big{[}2\big{(}\mathbb{E}_{X\times Y}[V_{1}\,\|\,\omega]\big{)}^{2}+2\big{(}\mathbb{E}_{X\times Y}[-W_{1}\,\|\,\omega]\big{)}^{2}\big{]}$
		$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}8\kappa(0)\mathbb{E}_{\omega}\big{[}\mathbb{E}_{X\times Y}[V_{1}\,\|\,\omega]\big{]}+2\mathbb{E}_{\omega}\big{[}\mathbb{E}_{X\times Y}[W_{1}^{2}\,\|\,\omega]\big{]}$
		$\displaystyle\stackrel{{\scriptstyle(c)}}{{\leq}}8\kappa(0)\mathbb{E}_{\omega}\big{[}\mathbb{E}_{X\times Y}[V_{1}\,\|\,\omega]\big{]}+8\kappa(0)^{2}\bigg{(}\frac{1}{{n_{1}}-1}+\frac{1}{{n_{2}}-1}\bigg{)}^{2}$
		$\displaystyle\leq 8\kappa(0)\Big{(}\mathbb{E}_{\omega}\big{[}\mathbb{E}_{X\times Y}[U_{1}\,\|\,\omega]\big{]}+\mathbb{E}_{\omega}\big{[}\mathbb{E}_{X\times Y}[W_{1}\,\|\,\omega]\big{]}\Big{)}+8C_{0}^{2}\frac{\kappa(0)^{2}}{{n}^{2}}$
		$\displaystyle\stackrel{{\scriptstyle(d)}}{{\leq}}8\kappa(0)\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k})+8C_{0}\frac{\kappa(0)^{2}}{{n}}+8C_{0}^{2}\frac{\kappa(0)^{2}}{{n}^{2}}$
		$\displaystyle\leq 8K\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k})+8C_{5}\frac{K^{2}}{{n}}$

where (a) follows from the inequality $(x+y)^{2}\leq 2x^{2}+2y^{2}$ for all $x,y\in\mathbb{R}$ , (b) follows from $0\leq V_{1}\leq 4\kappa(0)$ , (c), (d) is according to $0\leq W_{1}\leq\kappa(0)\Big{(}\frac{1}{{n_{1}}-1}+\frac{1}{{n_{2}}-1}\Big{)},$ and the last inequality holds with a constant $C_{5}:=C_{0}+{C^{\prime}_{0}}^{2}.$ Using $\sqrt{x+y}\leq\sqrt{x}+\sqrt{y}$ for all $x,y\in\mathbb{R}$ , we conclude that

\displaystyle\sqrt{\frac{4}{\beta R}\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega]\big{)}^{2}\big{]}}\leq\frac{C_{6}(\beta,K)}{\sqrt{R}}\mathrm{MMD}(P_{X},P_{Y};\mathcal{H}_{k})+\frac{C_{7}(\beta,K)}{\sqrt{R{n}}},

(54)

for some positive constants $C_{6}(\beta,K),C_{7}(\beta,K)>0.$

For $\operatorname{Var}_{X\times Y}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\big{]},$ we follow a similar logic to Equation (40) through Equation (44). The difference is that instead of using $h_{\omega}$ , we use the following kernel for a two-sample U-statistic:

h(x_{1},x_{2};y_{1},y_{2}):=\langle\dot{k}(x_{1}),\dot{k}(x_{2})\rangle+\langle\dot{k}(y_{1}),\dot{k}(y_{2})\rangle-\langle\dot{k}(x_{1}),\dot{k}(y_{2})\rangle-\langle\dot{k}(x_{2}),\dot{k}(y_{1})\rangle,

where $\dot{k}(x)(\cdot)=k(x,\cdot).$ Note that we have $|\langle\dot{k}(x),\dot{k}(y)\rangle|=|k(x,y)|\leq\kappa(0)$ for all $x,y\in\mathbb{R}^{d}$ as shown in Equation (31). This fact corresponds to the condition for the inequality (b) in Equation 42. Also note that $\big{\|}\mathbb{E}_{X}[\dot{k}(X)]-\mathbb{E}_{Y}[\dot{k}(Y)]\big{\|}^{2}=\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k})$ holds. This is the condition for the inequality (a) in Equation 43; therefore, we can follow the same logic to the previous analysis with kernel $h$ and we have a similar result to the first line in Equation 43:

\displaystyle\operatorname{Var}_{X\times Y}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\big{]}

\displaystyle\leq C_{8}\frac{\kappa(0)}{{n}}\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k})+C_{9}\frac{\kappa(0)^{2}}{{n}^{2}},

for some positive constants $C_{8},C_{9}>0.$ We apply $\sqrt{x+y}\leq\sqrt{x}+\sqrt{y}$ for all $x,y\in\mathbb{R}$ here and get the following result:

\displaystyle\sqrt{\frac{4}{\beta}\operatorname{Var}_{X\times Y}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\big{]}}

\displaystyle\leq\frac{C_{10}(\beta,K)}{\sqrt{{n}}}\mathrm{MMD}(P_{X},P_{Y};\mathcal{H}_{k})+\frac{C_{11}(\beta,K)}{{n}}

(55)

where $C_{10}(\beta,K),C_{11}(\beta,K)>0$ are some positive constants. In summary, given Equations (53),(54) and (55), a valid upper bound for $\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}$ is

	$\displaystyle\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}$	$\displaystyle\leq\sqrt{\frac{4}{\beta R}\mathbb{E}_{\omega}\big{[}\operatorname{Var}_{X\times Y}[U_{1}\,\|\,\omega]\big{]}}+\sqrt{\frac{4}{\beta R}\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,\|\,\omega]\big{)}^{2}\big{]}}$
		$\displaystyle\quad+\sqrt{\frac{4}{\beta}\operatorname{Var}_{X\times Y}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\big{]}}$
		$\displaystyle\leq\bigg{(}\frac{C_{3}(\beta,K)}{\sqrt{R{n}}}+\frac{C_{6}(\beta,K)}{\sqrt{R}}+\frac{C_{10}(\beta,K)}{\sqrt{{n}}}\bigg{)}\mathrm{MMD}(P_{X},P_{Y};\mathcal{H}_{k})$
		$\displaystyle\quad+\frac{C_{4}(\beta,K)}{\sqrt{R}{n}}+\frac{C_{7}(\beta,K)}{\sqrt{R{n}}}+\frac{C_{11}(\beta,K)}{{n}}.$

Sufficient condition for Equation 52

Recall that Equation 52,

\displaystyle\mathrm{MMD}^{2}\left(P_{X},P_{Y};\mathcal{H}_{k}\right)\geq\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}+\frac{C_{2}(\alpha,\beta,K)}{n}

is a sufficient condition for $\mathbb{P}(\mathcal{B}_{\beta})=1.$ Utilizing an upper bound for the variance of $U$ we derived, a sufficient condition for Equation 52 to hold is

	$\displaystyle\mathrm{MMD}^{2}\left(P_{X},P_{Y};\mathcal{H}_{k}\right)$	$\displaystyle\geq\bigg{(}\frac{C_{3}(\beta,K)}{\sqrt{R{n}}}+\frac{C_{6}(\beta,K)}{\sqrt{R}}+\frac{C_{10}(\beta,K)}{\sqrt{{n}}}\bigg{)}\mathrm{MMD}(P_{X},P_{Y};\mathcal{H}_{k})$
		$\displaystyle\quad+\frac{C_{4}(\beta,K)}{\sqrt{R}{n}}+\frac{C_{7}(\beta,K)}{\sqrt{R{n}}}+\frac{C_{11}(\beta,K)}{{n}}+\frac{C_{2}(\alpha,\beta,K)}{n}.$

Note that ${n}^{-1}\leq{n}^{-1/2}\leq 1$ for ${n}\geq 1.$ Then, by merging similar terms, a sufficient condition for the above inequality is

	$\displaystyle\mathrm{MMD}^{2}\left(P_{X},P_{Y};\mathcal{H}_{k}\right)$	$\displaystyle\geq\bigg{(}\frac{C_{12}(\beta,K)}{\sqrt{R}}+\frac{C_{10}(\beta,K)}{\sqrt{{n}}}\bigg{)}\mathrm{MMD}(P_{X},P_{Y};\mathcal{H}_{k})$
		$\displaystyle\quad+\frac{C_{14}(\beta,K)}{\sqrt{R{n}}}+\frac{C_{15}(\alpha,\beta,K)}{{n}},$

for some positive constants $C_{12}(\beta,K),C_{13}(\beta,K),C_{14}(\alpha,\beta,K)>0.$ Now the satisfaction of the following four inequalities at once is a sufficient condition for the above inequality:

	$\displaystyle\text{(i)}:\quad\mathrm{MMD}^{2}\left(P_{X},P_{Y};\mathcal{H}_{k}\right)$	$\displaystyle\geq\frac{4C_{12}(\beta,K)}{\sqrt{R}}\mathrm{MMD}(P_{X},P_{Y};\mathcal{H}_{k}),$
	$\displaystyle\text{(ii)}:\quad\mathrm{MMD}^{2}\left(P_{X},P_{Y};\mathcal{H}_{k}\right)$	$\displaystyle\geq\frac{4C_{10}(\beta,K)}{\sqrt{{n}}}\mathrm{MMD}(P_{X},P_{Y};\mathcal{H}_{k}),$
	$\displaystyle\text{(iii)}:\quad\mathrm{MMD}^{2}\left(P_{X},P_{Y};\mathcal{H}_{k}\right)$	$\displaystyle\geq\frac{4C_{13}(\beta,K)}{\sqrt{R{n}}},$
	$\displaystyle\text{(iv)}:\quad\mathrm{MMD}^{2}\left(P_{X},P_{Y};\mathcal{H}_{k}\right)$	$\displaystyle\geq\frac{4C_{14}(\alpha,\beta,K)}{{n}}.$

Here we note that the inequalities (i) and (ii) are equivalent to the following inequalities (a) and (b), respectively:

	$\displaystyle\text{(a)}:\quad\mathrm{MMD}^{2}\left(P_{X},P_{Y};\mathcal{H}_{k}\right)$	$\displaystyle\geq\frac{16C_{12}(\beta,K)^{2}}{R},$
	$\displaystyle\text{(b)}:\quad\mathrm{MMD}^{2}\left(P_{X},P_{Y};\mathcal{H}_{k}\right)$	$\displaystyle\geq\frac{16C_{10}(\beta,K)^{2}}{{n}}.$

Considering the inequalities (a),(b),(iii) and (iv), for both tests $\Delta=\Delta_{{n_{1}},{n_{2}},R}^{\alpha,\lambda}$ or $\Delta=\Delta_{{n_{1}},{n_{2}},R}^{\alpha,u,\lambda}$ , we have

\displaystyle\rho\big{(}\Delta,~{}\beta,~{}\mathcal{C},~{}\delta_{\mathrm{MMD}}\big{)}^{2}\leq C_{15}(\alpha,\beta,K)\max\bigg{\{}\frac{1}{R},\frac{1}{{n}},\frac{1}{\sqrt{R{n}}}\bigg{\}},

for some positive constant $C_{15}(\alpha,\beta,K)>0.$ For the smallest order of ${n}$ possible, choose $R={n}$ and then we have

\displaystyle\rho\big{(}\Delta,~{}\beta,~{}\mathcal{C},~{}\delta_{\mathrm{MMD}}\big{)}^{2}\leq C_{15}(\alpha,\beta,K){n}^{-1}.

By letting $C_{\mathrm{MMD}}(\alpha,\beta,K):=C_{15}(\alpha,\beta,K)^{1/2},$ we conclude that

\rho\big{(}\Delta,~{}\beta,~{}\mathcal{C},~{}\delta_{\mathrm{MMD}}\big{)}\leq C_{\mathrm{MMD}}(\alpha,\beta,K){n}^{-1/2}

holds and this completes the proof.

B.7 Proof of Proposition 8

Recall the event defined in Equation (25):

\displaystyle\mathcal{B}_{\beta}:=\bigg{\{}\mathbb{E}\left[U\right]\geq\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}+q_{{n_{1}},{n_{2}},1-\alpha}^{u}+\kappa_{\lambda}(0)\bigg{(}\sqrt{\frac{2}{\beta}}+2\bigg{)}\bigg{(}\frac{1}{{n_{1}}}+\frac{1}{{n_{2}}}\bigg{)}\bigg{\}}

and note that when $\mathbb{P}(\mathcal{B}_{\beta})=1$ , both tests $\Delta_{{n_{1}},{n_{2}},R}^{\alpha,u,\lambda}$ and $\Delta_{{n_{1}},{n_{2}},R}^{\alpha,\lambda}$ uniformly control the probability of type II error. Also, with $\kappa_{\lambda}(0)=\pi^{-d/2}(\lambda_{1}\cdots\lambda_{d})^{-1}$ in place, Equation 52 implies that a sufficient condition for $\mathbb{P}(\mathcal{B}_{\beta})=1$ is the following inequality:

\displaystyle\mathrm{MMD}^{2}\left(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}}\right)\geq\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}+\frac{C_{1}(\alpha,\beta,d,\lambda)}{n}

(56)

where $C_{1}(\alpha,\beta)>0$ is some positive constant. Similar to the proof of Theorem 7, our objective is to find an upper bound for the right-hand side of the above inequality.

Upper bound for $\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}$

Recall the decomposition in Equation (40)

	$\displaystyle\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}$	$\displaystyle\leq\sqrt{\frac{4}{\beta R}\mathbb{E}_{\omega}\big{[}\operatorname{Var}_{X\times Y}[U_{1}\,\|\,\omega]\big{]}}+\sqrt{\frac{4}{\beta R}\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,\|\,\omega]\big{)}^{2}\big{]}}$
		$\displaystyle\quad+\sqrt{\frac{4}{\beta}\operatorname{Var}_{X\times Y}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\big{]}}.$

Also, as previously noted in Equation 53 and (55), we derived upper bounds for the first term and the last term on the right-hand side of the above inequality, respectively:

	$\displaystyle\sqrt{\frac{4}{\beta R}\mathbb{E}_{\omega}\big{[}\operatorname{Var}_{X\times Y}[U_{1}\,\|\,\omega]\big{]}}$	$\displaystyle\leq C_{2}(\beta)\sqrt{\frac{\kappa_{\lambda}(0)}{R{n}}}\mathrm{MMD}(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}})+C_{3}(\beta)\frac{\kappa_{\lambda}(0)}{\sqrt{R}{n}},$
	$\displaystyle\sqrt{\frac{4}{\beta}\operatorname{Var}_{X\times Y}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\big{]}}$	$\displaystyle\leq C_{4}(\beta)\sqrt{\frac{\kappa_{\lambda}(0)}{{n}}}\mathrm{MMD}(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}})+C_{5}(\beta)\frac{\kappa_{\lambda}(0)}{{n}},$

We now analyze and derive an upper bound for the remaining term, $\mathbb{E}_{\omega}\big{[}(\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega])^{2}\big{]}.$ We emphasize that, unlike the result in Equation (54) we stated in the proof of Theorem 7, a stronger upper bound can be established here since we assumed a smaller class of distribution pairs, specifically a class of Gaussian distribution with a common fixed variance, $\mathcal{C}_{N,\Sigma}\subsetneq\mathcal{C}.$ This favorable setting allows us to explicitly calculate the term $\mathbb{E}_{\omega}\big{[}(\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega])^{2}\big{]}$ and upper bound it with ${\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})$ to a higher power than the general case. In detail, Lemma 17 yields

	$\displaystyle\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,\|\,\omega]\big{)}^{2}\big{]}$	$\displaystyle\leq C_{6}(d,\lambda,\Sigma)\big{(}\mathbb{E}_{\omega}\big{[}\mathbb{E}_{X\times Y}[U_{1}\,\|\,\omega]\big{]}\big{)}^{2}$
		$\displaystyle=C_{6}(d,\lambda,\Sigma)\mathrm{MMD}^{4}(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}}),$

for some positive constant $C_{6}(d,\lambda,\Sigma)>0$ . Therefore, it holds that

\sqrt{\frac{4}{\beta R}\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega]\big{)}^{2}\big{]}}\leq\frac{C_{7}(\beta,d,\lambda,\Sigma)}{\sqrt{R}}\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}})

We point out that there are two improvements on this upper bound, compared to the previous bound in Equation 54 that induces quadratic computational cost. Firstly, the power of $\mathrm{MMD}(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}})$ in this upper bound is two, whereas it is one in the previous. Also, note that there is no additional term in the current bound such as $(R{n})^{-1/2}$ that exists in the previous bound.

To sum up, a valid upper bound for the square root of the variance of $U$ satisfies

	$\displaystyle\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}$	$\displaystyle\leq\sqrt{\frac{4}{\beta R}\mathbb{E}_{\omega}\big{[}\operatorname{Var}_{X\times Y}[U_{1}\,\|\,\omega]\big{]}}+\sqrt{\frac{4}{\beta R}\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,\|\,\omega]\big{)}^{2}\big{]}}$
		$\displaystyle\quad+\sqrt{\frac{4}{\beta}\operatorname{Var}_{X\times Y}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\big{]}}$
		$\displaystyle\leq\frac{C_{7}(\beta,d,\lambda,\Sigma)}{\sqrt{R}}\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}})$
		$\displaystyle\quad+\bigg{(}C_{2}(\beta)\sqrt{\frac{\kappa_{\lambda}(0)}{R{n}}}+C_{4}(\beta)\sqrt{\frac{\kappa_{\lambda}(0)}{{n}}}\bigg{)}\mathrm{MMD}(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}})+C_{3}(\beta)\frac{\kappa_{\lambda}(0)}{\sqrt{R}{n}}+C_{5}(\beta)\frac{\kappa_{\lambda}(0)}{{n}}$
		$\displaystyle\stackrel{{\scriptstyle(\dagger)}}{{\leq}}\frac{C_{7}(\beta,d,\lambda,\Sigma)}{\sqrt{R}}\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}})+\frac{C_{8}(\beta,d,\lambda)}{\sqrt{{n}}}\mathrm{MMD}(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}})+\frac{C_{9}(\beta,d,\lambda)}{{n}}$

for some positive constants $C_{8}(\beta,d,\lambda),C_{9}(\beta,d,\lambda)>0,$ where the inequality $(\dagger)$ follows from $R^{-1/2}\leq 1$ for $R\geq 1$ , and $\kappa_{\lambda}(0)=\pi^{-d/2}(\lambda_{1}\cdots\lambda_{d})^{-1}$ .

Sufficient condition for Equation 56

Note that our objective is to find a sufficient condition for Equation 56,

\displaystyle\mathrm{MMD}^{2}\left(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}}\right)\geq\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}+\frac{C_{1}(\alpha,\beta,d,\lambda)}{n}.

Plugging the upper bound we derive in the preceding section into the above inequality, observe that a sufficient condition for Equation 56 to hold is

	$\displaystyle\mathrm{MMD}^{2}\left(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}}\right)$	$\displaystyle\geq\frac{C_{7}(\beta,d,\lambda,\Sigma)}{\sqrt{R}}\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}})+\frac{C_{8}(\beta,d,\lambda)}{\sqrt{{n}}}\mathrm{MMD}(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}})$
		$\displaystyle\quad+\frac{C_{9}(\beta,d,\lambda)}{{n}}+\frac{C_{1}(\alpha,\beta,d,\lambda)}{n}.$

Now the simultaneous satisfaction of the following three inequalities is a sufficient condition for the above inequality:

	$\displaystyle\text{(i)}:\quad\mathrm{MMD}^{2}\left(P_{X},P_{Y};\mathcal{H}_{k}\right)$	$\displaystyle\geq\frac{3C_{7}(\beta,d,\lambda,\Sigma)}{\sqrt{R}}\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k}),$
	$\displaystyle\text{(ii)}:\quad\mathrm{MMD}^{2}\left(P_{X},P_{Y};\mathcal{H}_{k}\right)$	$\displaystyle\geq\frac{3C_{8}(\beta,d,\lambda)}{\sqrt{{n}}}\mathrm{MMD}(P_{X},P_{Y};\mathcal{H}_{k}),$
	$\displaystyle\text{(iii)}:\quad\mathrm{MMD}^{2}\left(P_{X},P_{Y};\mathcal{H}_{k}\right)$	$\displaystyle\geq\frac{3C_{10}(\alpha,\beta,d,\lambda)}{{n}}$

where $C_{10}(\alpha,\beta,d,\lambda):=C_{9}(\beta,d,\lambda)+C_{1}(\alpha,\beta,d,\lambda).$ Observe that the inequalities (i) and (ii) are equivalent to the following inequalities (a) and (b), respectively:

	(a)	$\displaystyle:\quad R\geq 9C_{7}(\beta,d,\lambda,\Sigma)^{2},$
	(b)	$\displaystyle:\quad\mathrm{MMD}^{2}\left(P_{X},P_{Y};\mathcal{H}_{k}\right)\geq\frac{9C_{8}(\beta,\lambda)^{2}}{{n}}.$

Considering the inequalities (a), (b) and (iii) above, for both tests $\Delta=\Delta_{{n_{1}},{n_{2}},R}^{\alpha,\lambda}$ or $\Delta=\Delta_{{n_{1}},{n_{2}},R}^{\alpha,u,\lambda}$ , we have

\displaystyle\rho\big{(}\Delta,~{}\beta,~{}\mathcal{C},~{}\delta_{\mathrm{MMD}}\big{)}^{2}\leq\frac{C_{11}(\alpha,\beta,d,\lambda)}{{n}},

for some positive constant $C_{11}(\alpha,\beta,d,\lambda)>0$ and $R\geq 9C_{7}(\beta,d,\lambda,\Sigma)^{2}.$ Therefore, we conclude that

\rho\big{(}\Delta,~{}\beta,~{}\mathcal{C},~{}\delta_{\mathrm{MMD}}\big{)}\leq C_{\mathrm{MMD}}(\alpha,\beta,d,\lambda){n}^{-1/2}

with $C_{\mathrm{MMD}}(\alpha,\beta,d,\lambda):=C_{11}(\alpha,\beta,d,\lambda)^{1/2}$ and $R\geq 9C_{7}(\beta,d,\lambda,\Sigma)^{2}$ .

B.8 Proof of Lemma 16

Recall that the statistic $U_{1}$ in Equation 28 is defined as

	$\displaystyle U_{1}=$	$\displaystyle\frac{1}{{n_{1}}({n_{1}}-1)}\sum_{1\leq i\neq j\leq{n_{1}}}\kappa(0)\cos\big{(}{\omega_{1}^{\top}\left(X_{i}-X_{j}\right)}\big{)}-\frac{2}{{n_{1}}{n_{2}}}\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{2}}\kappa(0)\cos\big{(}{\omega_{1}^{\top}\left(X_{i}-Y_{j}\right)}\big{)}$
		$\displaystyle\quad+\frac{1}{{n_{2}}({n_{2}}-1)}\sum_{1\leq i\neq j\leq{n_{2}}}\kappa(0)\cos\big{(}{\omega_{1}^{\top}\left(Y_{i}-Y_{j}\right)}\big{)}.$

By taking conditional expectation with respect to $X$ and $Y$ given $\omega,$ we get

	$\displaystyle\mathbb{E}_{X\times Y}[U_{1}\,\|\,\omega]$	$\displaystyle=\mathbb{E}\big{[}\kappa(0)\cos\big{(}\omega(X_{1}-X_{2})\big{)}\,\big{\|}\,\omega\big{]}$
		$\displaystyle\quad+\mathbb{E}\big{[}\kappa(0)\cos\big{(}\omega(Y_{1}-Y_{2})\big{)}\,\big{\|}\,\omega\big{]}$
		$\displaystyle\quad-2\mathbb{E}\big{[}\kappa(0)\cos\big{(}\omega(X_{1}-Y_{1})\big{)}\,\big{\|}\,\omega\big{]}.$

Our strategy is to decompose the term $\big{(}\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega]\big{)}^{2}$ explicitly and simplify it with trigonometric identities. First, let $X_{1},X_{2},X_{3},X_{4}$ be the independent copies of $X$ , and $Y_{1},Y_{2},Y_{3},Y_{4}$ be the independent copies of $Y$ . And, let us drop the subscripts with respect to $X_{1},\ldots,X_{4},Y_{1},\ldots,Y_{4}$ on $\mathbb{E}$ if the context is clear. Now, observe that the term the term $\big{(}\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega]\big{)}^{2}$ can be expressed as

\displaystyle\big{(}\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega]\big{)}^{2}=\text{(i)}+\text{(ii)}+4\text{(iii)}+2\text{(iv)}-4\text{(v)}-4\text{(vi)},

where

	(i)	$\displaystyle=\mathbb{E}\big{[}\kappa(0)^{2}\cos\big{(}\omega(X_{1}-X_{2})\big{)}\cos\big{(}\omega(X_{3}-X_{4})\big{)}\,\big{\|}\,\omega\big{]},$
	(ii)	$\displaystyle=\mathbb{E}\big{[}\kappa(0)^{2}\cos\big{(}\omega(Y_{1}-Y_{2})\big{)}\cos\big{(}\omega(Y_{3}-Y_{4})\big{)}\,\big{\|}\,\omega\big{]},$
	(iii)	$\displaystyle=\mathbb{E}\big{[}\kappa(0)^{2}\cos\big{(}\omega(X_{1}-Y_{1})\big{)}\cos\big{(}\omega(X_{2}-Y_{2})\big{)}\,\big{\|}\,\omega\big{]},$
	(iv)	$\displaystyle=\mathbb{E}\big{[}\kappa(0)^{2}\cos\big{(}\omega(X_{1}-X_{2})\big{)}\cos\big{(}\omega(Y_{1}-Y_{2})\big{)}\,\big{\|}\,\omega\big{]},$
	(v)	$\displaystyle=\mathbb{E}\big{[}\kappa(0)^{2}\cos\big{(}\omega(X_{1}-X_{2})\big{)}\cos\big{(}\omega(X_{3}-Y_{1})\big{)}\,\big{\|}\,\omega\big{]},$
	(vi)	$\displaystyle=\mathbb{E}\big{[}\kappa(0)^{2}\cos\big{(}\omega(Y_{1}-Y_{2})\big{)}\cos\big{(}\omega(Y_{3}-X_{1})\big{)}\,\big{\|}\,\omega\big{]}.$

Here, observe that

	(i)	$\displaystyle=\mathbb{E}\big{[}\kappa(0)^{2}\cos\big{(}\omega(X_{1}-X_{2})\big{)}\cos\big{(}\omega(X_{3}-X_{4})\big{)}\,\big{\|}\,\omega\big{]}$
		$\displaystyle=\mathbb{E}\bigg{[}\frac{1}{2}\kappa(0)^{2}\Big{(}\cos\big{(}\omega(X_{1}-X_{2}+X_{3}-X_{4})\big{)}+\cos\big{(}\omega(X_{1}-X_{2}+X_{4}-X_{3})\big{)}\Big{)}\,\bigg{\|}\,\omega\bigg{]}$
		$\displaystyle=\mathbb{E}\big{[}\kappa(0)^{2}\cos\big{(}\omega(X_{1}-X_{2}+X_{3}-X_{4})\big{)}\,\big{\|}\,\omega\big{]},$

where the last equality follows from $X_{1}-X_{2}+X_{3}-X_{4}\stackrel{{\scriptstyle d}}{{=}}X_{1}-X_{2}+X_{4}-X_{3}$ . A similar calculation guarantees that the term $\big{(}\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega]\big{)}^{2}$ can be written as

	$\displaystyle\big{(}\mathbb{E}_{X\times Y}[U_{1}\,\|\,\omega]\big{)}^{2}$	$\displaystyle=\text{(i)}+\text{(ii)}+4\text{(iii)}+2\text{(iv)}-4\text{(v)}-4\text{(vi)}$
		$\displaystyle=(a)+(b)+4\bigg{(}\frac{1}{2}(c)+\frac{1}{2}(d)\bigg{)}+2(d)-4(e)-4(f)$
		$\displaystyle=(a)+(b)+2(c)+4(d)-4(e)-4(f)$

where

	$\displaystyle(a)$	$\displaystyle=\mathbb{E}\big{[}\kappa(0)^{2}\cos\big{(}\omega(X_{1}-X_{2}+X_{3}-X_{4})\big{)}\,\big{\|}\,\omega\big{]},$
	$\displaystyle(b)$	$\displaystyle=\mathbb{E}\big{[}\kappa(0)^{2}\cos\big{(}\omega(Y_{1}-Y_{2}+Y_{3}-Y_{4})\big{)}\,\big{\|}\,\omega\big{]},$
	$\displaystyle(c)$	$\displaystyle=\mathbb{E}\big{[}\kappa(0)^{2}\cos\big{(}\omega(X_{1}+X_{2}-Y_{1}-Y_{2})\big{)}\,\big{\|}\,\omega\big{]},$
	$\displaystyle(d)$	$\displaystyle=\mathbb{E}\big{[}\kappa(0)^{2}\cos\big{(}\omega(X_{1}-X_{2}+Y_{1}-Y_{2})\big{)}\,\big{\|}\,\omega\big{]},$
	$\displaystyle(e)$	$\displaystyle=\mathbb{E}\big{[}\kappa(0)^{2}\cos\big{(}\omega(X_{1}-X_{2}-X_{3}+Y_{1})\big{)}\,\big{\|}\,\omega\big{]},$
	$\displaystyle(f)$	$\displaystyle=\mathbb{E}\big{[}\kappa(0)^{2}\cos\big{(}\omega(Y_{1}-Y_{2}-Y_{3}+X_{1})\big{)}\,\big{\|}\,\omega\big{]}.$

Now, note that the symmetry of the cosine function allows different representations of the above terms. For example, combined with the symmetry of $X_{1}-X_{2},$ ( $e$ ) can also be written as

(e)=\mathbb{E}\big{[}\kappa(0)^{2}\cos\big{(}\omega(X_{1}-X_{2}-X_{3}+Y_{1})\big{)}\,\big{|}\,\omega\big{]}=\mathbb{E}\big{[}\kappa(0)^{2}\cos\big{(}\omega(X_{1}-X_{2}+X_{3}-Y_{1})\big{)}\,\big{|}\,\omega\big{]}.

Then, observe that

	$\displaystyle(a)+(d)-2(e)$	$\displaystyle=\mathbb{E}\big{[}\kappa(0)^{2}\cos\big{(}\omega([X_{1}-X_{2}]-[X_{3}-X_{4}])\big{)}\,\big{\|}\,\omega\big{]}$
		$\displaystyle\quad+\mathbb{E}\big{[}\kappa(0)^{2}\cos\big{(}\omega([Y_{1}-X_{1}]-[Y_{2}-X_{2}])\big{)}\,\big{\|}\,\omega\big{]}$
		$\displaystyle\quad-2\mathbb{E}\big{[}\kappa(0)^{2}\cos\big{(}\omega([X_{1}-X_{2}]+[X_{3}-Y_{1}])\big{)}\,\big{\|}\,\omega\big{]},$

and therefore,

\displaystyle\mathbb{E}_{\omega}\big{[}(a)+(d)-2(e)\big{]}=\kappa(0)\mathrm{MMD}^{2}(P_{X-X^{\prime}},P_{Y-X^{\prime\prime}};\mathcal{H}_{k_{\lambda}}),

where $X^{\prime},X^{\prime\prime}$ are independent copies of $X$ . Similarly, we can show that

	$\displaystyle\mathbb{E}_{\omega}\big{[}(b)+(d)-2(f)\big{]}$	$\displaystyle=\kappa(0)\mathrm{MMD}^{2}(P_{Y-Y^{\prime}},P_{X-Y^{\prime\prime}};\mathcal{H}_{k}),$
	$\displaystyle\mathbb{E}_{\omega}\big{[}(a)+(b)-2(c)\big{]}$	$\displaystyle=\kappa(0)\mathrm{MMD}^{2}(P_{X+X^{\prime}},P_{Y+Y^{\prime}};\mathcal{H}_{k}),$

where $Y^{\prime},Y^{\prime\prime}$ are independent copies of $Y$ . Since

	$\displaystyle\big{(}\mathbb{E}_{X\times Y}[U_{1}\,\|\,\omega]\big{)}^{2}$	$\displaystyle=(a)+(b)+2(c)+4(d)-4(e)-4(f)$
		$\displaystyle=2\big{(}(a)+(d)-2(e)\big{)}+2\big{(}(b)+(d)-2(f)\big{)}$
		$\displaystyle\quad-\big{(}(a)+(b)-2(c)\big{)},$

we can conclude that

	$\displaystyle\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,\|\,\omega]\big{)}^{2}\big{]}$	$\displaystyle=2\kappa(0)\mathrm{MMD}^{2}(P_{X-X^{\prime}},P_{X^{\prime\prime}-Y};\mathcal{H}_{k})+2\kappa(0)\mathrm{MMD}^{2}(P_{Y-Y^{\prime}},P_{X-Y^{\prime\prime}};\mathcal{H}_{k})$
		$\displaystyle\quad-\kappa(0)\mathrm{MMD}^{2}(P_{X+X^{\prime}},P_{Y+Y^{\prime}};\mathcal{H}_{k}).$

Additionally, note that the second statement in the lemma can be proven in a similar manner, thereby completing the proof.

B.9 Proof of Lemma 17

Recall the class of Gaussian distributions with a common fixed variance $\Sigma\in\mathbb{R}^{d\times d}$ :

\displaystyle\mathcal{C}_{N,\Sigma}:=\big{\{}(P_{X},P_{Y})\in\mathcal{P}_{\mathrm{conti}}\,\big{|}\,P_{X}=N(\mu_{X},\Sigma),~{}P_{Y}=N(\mu_{Y},\Sigma)~{}\text{where $\mu_{X},\mu_{Y}\in\mathbb{R}^{d}$}\big{\}}.

Here we claim that the following inequality

\displaystyle\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega]\big{)}^{2}\big{]}\leq C\big{(}\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}})\big{)}^{c}

(57)

holds for any distribution pair $(P_{X},P_{Y})\in\mathcal{C}_{N,\Sigma}$ , with $c=2$ and some positive constant $C>0.$ To prove the claim, one important observation is that the exact calculation of $\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k})$ is feasible when we use the Gaussian kernel. To be specific, consider a Gaussian kernel with bandwidth $\lambda=(\lambda_{1},\ldots,\lambda_{d})^{\top}\in(0,\infty)^{d}$ ,

k_{\lambda}(x,y)=\kappa_{\lambda}(x,y)=\prod^{d}_{i=1}\frac{1}{\sqrt{\pi}\lambda_{i}}e^{-\frac{(x_{i}-y_{i})^{2}}{\lambda_{i}^{2}}}.

There have been several existing results on calculating MMD with a Gaussian kernel for Gaussian distributions. Among them, we leverage the result from Ramdas et al., (2015, Proposition 1), which is displayed on Lemma 15:

	$\displaystyle\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k})$	$\displaystyle=2\left(\frac{1}{4\pi}\right)^{d/2}\frac{1-\exp\big{\{}-(\mu_{X}-\mu_{Y})^{\top}\left(\Sigma+D(\lambda^{2}/4)\right)^{-1}(\mu_{X}-\mu_{Y})/4\big{\}}}{\left\|\Sigma+D(\lambda^{2}/4)\right\|^{1/2}}$
		$\displaystyle=C_{1}(d,\lambda,\Sigma)\big{(}1-\exp\big{\{}-(\mu_{X}-\mu_{Y})^{\top}\left(\Sigma+D(\lambda^{2}/4)\right)^{-1}(\mu_{X}-\mu_{Y})/4\big{\}}\big{)},$

for a constant $C_{1}(d,\lambda,\Sigma)=2\left(\frac{1}{4\pi}\right)^{d/2}\left|\Sigma+D(\lambda^{2}/4)\right|^{-1/2}$ and $D(\lambda^{2}/4)=\mathrm{diag}(\lambda_{1}^{2}/4,\ldots,\lambda_{d}^{2}/4).$ We are now ready to analyze the two terms in Equation 57.

Exact value of $\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega]\big{)}^{2}\big{]}$

Recall Lemma 16 and observe that $\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega]\big{)}^{2}\big{]}$ can be expressed with several $\mathrm{MMD}^{2}$ terms:

	$\displaystyle\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,\|\,\omega]\big{)}^{2}\big{]}$	$\displaystyle=2\kappa_{\lambda}(0)\mathrm{MMD}^{2}(P_{X-X^{\prime}},P_{X^{\prime\prime}-Y};\mathcal{H}_{k_{\lambda}})+2\kappa_{\lambda}(0)\mathrm{MMD}^{2}(P_{Y-Y^{\prime}},P_{X-Y^{\prime\prime}};\mathcal{H}_{k_{\lambda}})$
		$\displaystyle\quad-\kappa_{\lambda}(0)\mathrm{MMD}^{2}(P_{X+X^{\prime}},P_{Y+Y^{\prime}};\mathcal{H}_{k_{\lambda}}).$

To simplify the above equation, let us define Gaussian random variables $Z_{1},Z_{2},Z_{3},Z_{4}$ such that

Z_{1}\sim N(0,2\Sigma),\quad Z_{2}\sim N(\mu_{X}-\mu_{Y},2\Sigma),\quad Z_{3}\sim N(2\mu_{X},2\Sigma),\quad Z_{4}\sim N(2\mu_{Y},2\Sigma).

Then, observe that $X-X^{\prime},~{}Y-Y^{\prime}\stackrel{{\scriptstyle d}}{{=}}Z_{1},~{}X-Y\stackrel{{\scriptstyle d}}{{=}}Z_{2},~{}X+X^{\prime}\stackrel{{\scriptstyle d}}{{=}}Z_{3}$ and $Y+Y^{\prime}\stackrel{{\scriptstyle d}}{{=}}Z_{4}$ , and these equivalences in distribution yield

\displaystyle\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega]\big{)}^{2}\big{]}

\displaystyle=4\kappa_{\lambda}(0)\mathrm{MMD}^{2}(P_{Z_{1}},P_{Z_{2}};\mathcal{H}_{k_{\lambda}})-\kappa_{\lambda}(0)\mathrm{MMD}^{2}(P_{Z_{3}},P_{Z_{4}};\mathcal{H}_{k_{\lambda}}).

We apply the $\mathrm{MMD}$ calculation formula in Lemma 15 here and obtain

	$\displaystyle\quad\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,\|\,\omega]\big{)}^{2}\big{]}$
	$\displaystyle=4\kappa_{\lambda}(0)C_{1}(d,\lambda,\Sigma)\big{(}1-\exp\big{\{}-(\mu_{X}-\mu_{Y})^{\top}\left(2\Sigma+D(\lambda^{2}/4)\right)^{-1}(\mu_{X}-\mu_{Y})/4\big{\}}\big{)}$
	$\displaystyle\quad-\kappa_{\lambda}(0)C_{1}(d,\lambda,\Sigma)\big{(}1-\exp\big{\{}-(\mu_{X}-\mu_{Y})^{\top}\left(2\Sigma+D(\lambda^{2}/4)\right)^{-1}(\mu_{X}-\mu_{Y})\big{\}}\big{)}$
	$\displaystyle=C_{2}(d,\lambda,\Sigma)\bigl{(}3-4\exp(-s_{a})+\exp(-4s_{a})\bigr{)},$

where we denote $s_{a}=(\mu_{X}-\mu_{Y})^{\top}\left(2\Sigma+D(\lambda^{2}/4)\right)^{-1}(\mu_{X}-\mu_{Y})/4$ , and $C_{2}(d,\lambda,\Sigma)=\kappa_{\lambda}(0)C_{1}(d,\lambda,\Sigma).$

Exact value of $\big{(}\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}})\big{)}^{2}$

By taking square on the formula in Lemma 15, we have

		$\displaystyle\big{(}\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}})\big{)}^{2}$
	$\displaystyle=$	$\displaystyle C_{1}(d,\lambda,\Sigma)^{2}\big{(}1-\exp\big{\{}-(\mu_{X}-\mu_{Y})^{\top}\left(\Sigma+D(\lambda^{2}/4)\right)^{-1}(\mu_{X}-\mu_{Y})/4\big{\}}\big{)}^{2}$
	$\displaystyle=$	$\displaystyle C_{3}(d,\lambda,\Sigma)\bigl{(}1-2\exp(-s_{b})+\exp(-2s_{b})\bigr{)},$

where $s_{b}=(\mu_{X}-\mu_{Y})^{\top}\left(\Sigma+D(\lambda^{2}/4)\right)^{-1}(\mu_{X}-\mu_{Y})/4,$ and $C_{3}(d,\lambda,\Sigma)=C_{1}(d,\lambda,\Sigma)^{2}.$

Existence of constant $C$

Our goal now is to show the existence of $C$ that satisfies

\frac{\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega]\big{)}^{2}\big{]}}{\big{(}\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}})\big{)}^{2}}\leq C.

Plugging our previous results in the above equation, it is equivalent to

\frac{C_{2}(d,\lambda,\Sigma)}{C_{3}(d,\lambda,\Sigma)}\frac{3-4\exp(-s_{a})+\exp(-4s_{a})}{1-2\exp(-s_{b})+\exp(-2s_{b})}\leq C.

Note that the last term can be written as

\displaystyle\frac{3-4\exp(-s_{a})+\exp(-4s_{a})}{1-2\exp(-s_{b})+\exp(-2s_{b})}=\underbrace{\big{(}3+2\exp(-s_{a})+\exp(-2s_{a})\big{)}}_{:=f(s_{a})}\underbrace{\frac{1-2\exp(-s_{a})+\exp(-2s_{a})}{1-2\exp(-s_{b})+\exp(-2s_{b})}}_{:=\frac{g(s_{a})}{g(s_{b})}}.

Since $f(0)=6$ and $f^{\prime}(x)=-2\exp(-x)\big{(}1+\exp(-x)\big{)}\leq 0$ for $\forall x\in\mathbb{R},$ we have $f(s_{a})\leq 6$ for $s_{a}\geq 0.$ Also, observe that $g^{\prime}(x)=2\exp(-x)\big{(}1-\exp(-x)\big{)}\geq 0$ for $\forall x\geq 0$ , $g(0)=1$ and $s_{a}\leq s_{b}$ for all $(\mu_{X}-\mu_{Y})\in\mathbb{R}^{d},$ thus we get $g(s_{a})/g(s_{b})\leq 1.$ Therefore, we can derive

f(s_{a})\frac{g(s_{a})}{g(s_{b})}\leq 6,

and this implies that there exists some positive constant $C(d,\lambda,\Sigma)>0$ satisfying

\frac{C_{2}(d,\lambda,\Sigma)}{C_{3}(d,\lambda,\Sigma)}f(s_{a})\frac{g(s_{a})}{g(s_{b})}\leq C(d,\lambda,\Sigma).

This completes the proof of Lemma 17.

		$\displaystyle\lim_{{n_{1}},{n_{2}}\rightarrow\infty}\mathbb{P}\big{(}{n_{1}}U\geq{n_{1}}q_{{n_{1}},{n_{2}},1-\alpha}^{u}\,\big{\|}\,\boldsymbol{\omega}_{R}=\boldsymbol{\omega}\big{)}$
	$\displaystyle=$	$\displaystyle\lim_{{n_{1}},{n_{2}}\rightarrow\infty}\mathbb{P}\big{(}{n_{1}}V+c(p,\mu_{\boldsymbol{\omega}_{R}})\geq q_{R,1-\alpha}+c(p,\mu_{\boldsymbol{\omega}_{R}})\,\big{\|}\,\boldsymbol{\omega}_{R}=\boldsymbol{\omega}\big{)}$
	$\displaystyle=$	$\displaystyle\lim_{{n_{1}},{n_{2}}\rightarrow\infty}\mathbb{P}\big{(}{n_{1}}V+c(p,\mu_{\boldsymbol{\omega}_{R}})\geq{n_{1}}q_{{n_{1}},{n_{2}},1-\alpha}+c(p,\mu_{\boldsymbol{\omega}_{R}})\,\big{\|}\,\boldsymbol{\omega}_{R}=\boldsymbol{\omega}\big{)}$
	$\displaystyle=$	$\displaystyle\lim_{{n_{1}},{n_{2}}\rightarrow\infty}\mathbb{P}\big{(}{n_{1}}V\geq{n_{1}}q_{{n_{1}},{n_{2}},1-\alpha}\,\big{\|}\,\boldsymbol{\omega}_{R}=\boldsymbol{\omega}\big{)}$

	$\displaystyle\operatorname{Var}\left[U\right]$	$\displaystyle=\mathbb{E}_{X\times Y}\big{[}\operatorname{Var}_{\omega}[U\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}}]\big{]}+\operatorname{Var}_{X\times Y}\big{[}\mathbb{E}_{\omega}[U\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}}]\big{]}$
		$\displaystyle=\mathbb{E}_{X\times Y}\bigg{[}\frac{1}{R}\operatorname{Var}_{\omega}[U_{1}\,\|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}}]\bigg{]}+\operatorname{Var}_{X\times Y}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\big{]}$
		$\displaystyle\leq\frac{1}{R}\mathbb{E}_{X\times Y}\Big{[}\mathbb{E}_{\omega}\big{[}(U_{1})^{2}\,\big{\|}\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}}\big{]}\Big{]}+\operatorname{Var}_{X\times Y}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\big{]}$
		$\displaystyle=\frac{1}{R}\mathbb{E}_{\omega}\Big{[}\mathbb{E}_{X\times Y}\big{[}(U_{1})^{2}\,\big{\|}\,\omega\big{]}\Big{]}+\operatorname{Var}_{X\times Y}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\big{]}$
		$\displaystyle=\frac{1}{R}\mathbb{E}_{\omega}\Big{[}\operatorname{Var}_{X\times Y}[U_{1}\,\|\,\omega]+\big{(}\mathbb{E}_{X\times Y}[U_{1}\,\|\,\omega]\big{)}^{2}\Big{]}+\operatorname{Var}_{X\times Y}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\big{]}$
		$\displaystyle=\frac{1}{R}\mathbb{E}_{\omega}\big{[}\operatorname{Var}_{X\times Y}[U_{1}\,\|\,\omega]\big{]}+\frac{1}{R}\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,\|\,\omega]\big{)}^{2}\big{]}+\operatorname{Var}_{X\times Y}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k})\big{]}.$

	$\displaystyle\sqrt{\frac{4}{\beta R}\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,\|\,\omega]\big{)}^{2}\big{]}}$	$\displaystyle\leq\sqrt{\frac{4C_{4}(M_{3},d)}{\beta R}}\kappa_{\lambda}(0)\\|\xi\\|^{2}_{2}$		(45)
		$\displaystyle=C_{4}^{\prime}(M_{3},\beta,d)\frac{\kappa_{\lambda}(0)}{\sqrt{R}}\\|\xi\\|^{2}_{2}$		(45)

$\displaystyle\sqrt{\frac{4}{\beta}\operatorname{Var}\big{[}\widehat{\mathrm{MMD}}_{u}^{2}(\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}};\mathcal{H}_{k_{\lambda}})\big{]}}$	$\displaystyle\leq\sqrt{\frac{4C_{5}(M_{2},d)\\|\xi\ast\kappa_{\lambda}\\|^{2}_{2}}{\beta{n}}+\frac{4C_{5}(M_{2},d)\kappa_{\lambda}(0)}{\beta{n}^{2}}}$	(46)
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}2\sqrt{\frac{1}{2}\\|\xi\ast\kappa_{\lambda}\\|^{2}_{2}\frac{2C_{5}(M_{2},d)}{\beta{n}}}+\frac{2\sqrt{C_{5}(M_{2},d)\kappa_{\lambda}(0)}}{\sqrt{\beta}{n}}$
	$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\frac{1}{2}\\|\xi\ast\kappa_{\lambda}\\|^{2}_{2}+\frac{2C_{5}(M_{2},d)}{\beta{n}}+\frac{2\sqrt{C_{5}(M_{2},d)\kappa_{\lambda}(0)}}{\sqrt{\beta}{n}}$
	$\displaystyle\leq\frac{1}{2}\\|\xi\ast\kappa_{\lambda}\\|^{2}_{2}+\frac{C_{5}^{\prime}(M_{2},\beta,d)}{n}+C_{5}^{\prime}(M_{2},\beta,d)\frac{\sqrt{\kappa_{\lambda}(0)}}{{n}},$

Computational-Statistical Trade-off in Kernel Two-Sample Testing with Random Fourier Features

Computational-Statistical Trade-off in Kernel Two-Sample Testing with Random Fourier Features

Abstract

1 Introduction

1.1 Related work

1.2 Our contributions

Organization.

2 Background

2.1 Two-sample problem

2.2 Maximum Mean Discrepancy

2.3 Random Fourier features

2.4 Permutation tests

3 Lack of consistency

3.1 Preliminaries and intuition

Lemma 1 (Chwialkowski et al., 2015).

3.2 Main results

Proposition 2.

Theorem 3.

Corollary 4.

4 Optimal choice of the number of random features

Theorem 5.

Minimax two-sample testing framework.

4.1 Uniform consistency in L2L_{2} metric

Theorem 6.

4.2 Uniform consistency in MMD metric

Theorem 7.

Proposition 8.

5 Numerical studies

6 Discussion

Acknowledgments

References

Appendix A Technical lemmas

Lemma 9 (Bochner,, 1933, Bochner’s theorem,).

Lemma 10 (Bogachev,, 2007, Theorem 3.9.4).

Lemma 11 (Chung and Romano, 2016, Lemma A.2).

Lemma 12 (Chung and Romano, 2016, Lemma A.3).

Lemma 13 (Chung and Romano, 2016, Lemma A.6).

Lemma 14 (Chung and Romano, 2016, Theorem 2.1).

Lemma 15 (Ramdas et al., 2015, Proposition 1).

Lemma 16.

Lemma 17.

Remark 17.1.

Appendix B Proofs

Notation and terminology.

B.1 Proof of Proposition 2

B.2 Proof of Theorem 3

Asymptotic behavior of the unconditional distribution

Asymptotic behavior of the permutation distribution

Constructing PXP_{X} and PYP_{Y}

B.3 Proof of Corollary 4

B.4 Proof of Theorem 5

Expectation of UU

Upper bound for the variance of UU

Upper bound for the critical value qn1,n2,1−αuq_{{n_{1}},{n_{2}},1-\alpha}^{u}

Finding βn1,n2,R\beta_{{n_{1}},{n_{2}},R}

B.5 Proof of Theorem 6

Lower bound for 𝔼​[U]\mathbb{E}\left[U\right]

Upper bound for 4β​Var⁡[U]\sqrt{\frac{4}{\beta}\operatorname{Var}\left[U\right]}

Upper bound for 4α​β​𝔼​[Varπ⁡[Uπ|𝒳n1,𝒴n2,𝝎R]]\sqrt{\frac{4}{\alpha\beta}\mathbb{E}\big{[}\operatorname{Var}_{\pi}[U_{\pi}\,|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}]\big{]}}

Upper bound for 𝔼​[W′]\mathbb{E}[W^{\prime}]

Upper bound for 4β​Var⁡[W′]\sqrt{\frac{4}{\beta}\operatorname{Var}[W^{\prime}]}

Sufficient condition for Equation (38)

B.6 Proof of Theorem 7

Upper bound for 2β​Var⁡[U]\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}

Sufficient condition for Equation 52

B.7 Proof of Proposition 8

Upper bound for 2β​Var⁡[U]\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}

Sufficient condition for Equation 56

B.8 Proof of Lemma 16

B.9 Proof of Lemma 17

Exact value of 𝔼ω​[(𝔼X×Y​[U1|ω])2]\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega]\big{)}^{2}\big{]}

Exact value of (MMD2​(PX,PY;ℋkλ))2\big{(}\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}})\big{)}^{2}

Existence of constant CC

4.1 Uniform consistency in $L_{2}$ metric

Constructing $P_{X}$ and $P_{Y}$

Expectation of $U$

Upper bound for the variance of $U$

Upper bound for the critical value $q_{{n_{1}},{n_{2}},1-\alpha}^{u}$

Finding $\beta_{{n_{1}},{n_{2}},R}$

Lower bound for $\mathbb{E}\left[U\right]$

Upper bound for $\sqrt{\frac{4}{\beta}\operatorname{Var}\left[U\right]}$

Upper bound for $\sqrt{\frac{4}{\alpha\beta}\mathbb{E}\big{[}\operatorname{Var}_{\pi}[U_{\pi}\,|\,\mathcal{X}_{n_{1}},\mathcal{Y}_{n_{2}},\boldsymbol{\omega}_{R}]\big{]}}$

Upper bound for $\mathbb{E}[W^{\prime}]$

Upper bound for $\sqrt{\frac{4}{\beta}\operatorname{Var}[W^{\prime}]}$

Upper bound for $\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}$

Upper bound for $\sqrt{\frac{2}{\beta}\operatorname{Var}\left[U\right]}$

Exact value of $\mathbb{E}_{\omega}\big{[}\big{(}\mathbb{E}_{X\times Y}[U_{1}\,|\,\omega]\big{)}^{2}\big{]}$

Exact value of $\big{(}\mathrm{MMD}^{2}(P_{X},P_{Y};\mathcal{H}_{k_{\lambda}})\big{)}^{2}$

Existence of constant $C$