Two-sample test based on maximum variance discrepancy

Natsumi Makigusa

Graduate School of Science and Engineering, Chiba University

Abstract

In this article, we introduce a novel discrepancy called the maximum variance discrepancy for the purpose of measuring the difference between two distributions in Hilbert spaces that cannot be found via the maximum mean discrepancy. We also propose a two-sample goodness of fit test based on this discrepancy. We obtain the asymptotic null distribution of this two-sample test, which provides an efficient approximation method for the null distribution of the test.

1 Introduction

For probability distributions $P$ and $Q$ , the test for the null hypothesis $H_{0}:P=Q$ against an alternative hypothesis $H_{1}:P\neq Q$ based on data $X_{1},\dots,X_{n}\overset{i.i.d.}{\sim}P$ and $Y_{1},\dots,Y_{m}\overset{i.i.d.}{\sim}Q$ is known as a two-sample test. Such tests have applications in various areas. There is a huge body of literature on two-sample tests in Euclidean space, so we will not attempt a complete bibliography. In [6], a two-sample test based on Maximum Mean Discrepancy (MMD) is proposed, where the MMD is defined by (1) in Section 2. The MMD for a reproducing kernel Hilbert space $H(k)$ associated with a positive definite kernel $k$ is defined as (2) in Section 2.

In this paper, we propose a novel discrepancy between two distributions defined as

T=\left\|V_{X\sim P}[k(\cdot,X)]-V_{Y\sim Q}[k(\cdot,Y)]\right\|_{H(k)^{\otimes 2}},

and we call this the Maximum Variance Discrepancy (MVD), where $V_{X\sim P}[k(\cdot,X)]$ is a covariance operator in $H(k)$ . The MVD is composed by replacing the kernel mean embedding in (2) with a covariance operator; hence, it is natural to consider a two-sample test based on the MVD. A related work can be found in [4], where a test for the equality of covariance operators in Hilbert spaces was proposed.

Our aim in this research is to clarify the properties of the MVD test from two perspectives: an asymptotic investigation as $n,m\to\infty$ , and its practical implementation. We first obtain the asymptotic distribution of a consistent estimator $\widehat{T}_{n,m}^{2}$ of $T^{2}$ , under $H_{0}$ . We also derive the asymptotic distribution of $\widehat{T}_{n,m}^{2}$ under the alternative hypothesis $H_{1}$ . Furthermore, we consider a sequence of local alternative distributions $Q_{nm}=(1-1/\sqrt{n+m})P+(1/\sqrt{n+m})Q$ for $P\neq Q$ and address the asymptotic distribution of $\widehat{T}^{2}_{n,m}$ under this sequence. For practical purposes, a method to approximate the distribution of the test by $\widehat{T}^{2}_{n,m}$ under $H_{0}$ is developed. The method is based on the eigenvalues of the centered Gram matrices associated with the dataset. Those eigenvalues will be shown to be estimators of the weights appearing in the asymptotic null distribution of the test. Hence, the method based on the eigenvalues is expected to provide a fine approximation of the distribution of the test. However, this approximation does not actually work well. Therefore, we further modify the method based on the eigenvalues, and the obtained method provides a better approximation.

The rest of this paper is structured as follows. Section 2 introduces the framework of the two-sample test and defines the test statistics based on the MVD. In addition, the representation of test statistics based on the centered Gram matrices is described. Section 3.1 develops the asymptotics for the test by the MVD under $H_{0}$ . The test by the MVD under $H_{1}$ is addressed in Section 3.2. Furthermore, the behavior of the test by the MVD under the local alternative hypothesis is clarified in Section 3.3. Section 3.4 describes the estimation of the weights that appear in the asymptotic null distribution obtained in Section 3.1. Section 4 examines the implementation of the MVD test with a Gaussian kernel in the Hilbert space $\mathcal{H}=\mathbb{R}^{d}$ . Section 4.1 introduces the modification of the approximate distribution given in Section 3.4. Section 4.2 reports the results of simulations for the type I error and the power of the MVD and MMD tests. Section 5 presents the results of applications to real data sets, including high-dimension low-sample-size data. Conclusions are given in Section 6. All proofs of theoretical results are provided in Section 7.

2 Maximum Variance Discrepancy

Let $\mathcal{H}$ be a separable Hilbert space and $(\mathcal{H},\mathcal{A})$ be a measurable space. Let $\left<\cdot,\cdot\right>_{\mathcal{H}}$ be the inner product of $\mathcal{H}$ and $\|\cdot\|_{\mathcal{H}}=\sqrt{\left<\cdot,\cdot\right>_{\mathcal{H}}}$ be the associated norm. Let $X_{1},\dots,X_{n}\in\mathcal{H}$ and $Y_{1},\dots,Y_{m}\in\mathcal{H}$ denote a sample of independent and identically distributed (i.i.d.) random variables drawn from unknown distributions $P$ and $Q$ , respectively. Our goal is to test whether the unknown distributions $P$ and $Q$ are equal.

Let us define the null hypothesis $H_{0}:P=Q$ and the alternative hypothesis $H_{1}:P\neq Q$ . Following [6], the gap between two distributions $P$ and $Q$ on $\mathcal{H}$ is measured by:

\text{MMD}(P,Q)=\sup_{f\in\mathcal{F}}|\mathbb{E}_{X\sim P}[f(X)]-\mathbb{E}_{Y\sim Q}[f(Y)]|,

(1)

where $\mathcal{F}$ is a class of real-valued functions on $\mathcal{H}$ . Regardless of $\mathcal{F}$ , $\text{MMD}(P,Q)$ always defines a pseudo-metric on the space of probability distributions. Let $\mathcal{F}$ be the unit ball of a reproducing kernel Hilbert space $H(k)$ associated with a characteristic kernel $k:\mathcal{H}\times\mathcal{H}\to\mathbb{R}$ (see [2] and [5] for details) and assume that $\mathbb{E}_{X\sim P}[\sqrt{k(X,X)}]<\infty$ and $\mathbb{E}_{Y\sim Q}[\sqrt{k(Y,Y)}]<\infty$ . Then, the MMD in $H(k)$ is defined as the distance between $P$ and $Q$ as follows:

\text{MMD}(P,Q)=\sup_{\left\|f\right\|_{H(k)}=1}|\left<f,\mu_{k}(P)-\mu_{k}(Q)\right>_{H(k)}|=\left\|\mu_{k}(P)-\mu_{k}(Q)\right\|_{H(k)},

(2)

where $\mu_{k}(P)=\mathbb{E}_{X\sim P}[k(\cdot,X)]$ and $\mu_{k}(Q)=\mathbb{E}_{Y\sim Q}[k(\cdot,Y)]$ are called kernel mean embeddings of $P$ and $Q$ , respectively, in $H(k)$ (see [6]). The MMD focuses on the difference between distributions $P$ and $Q$ depending on the difference between the means of $k(\cdot,X)$ and $k(\cdot,Y)$ in $H(k)$ . The motivation for this research is to focus on the difference between the distributions $P$ and $Q$ due to the difference between those variances in $H(k)$ , based on a similar idea as the MMD. Assume $\mathbb{E}_{X\sim P}[k(X,X)]<\infty$ and $\mathbb{E}_{Y\sim Q}[k(Y,Y)]<\infty$ , then the variance $V_{X\sim P}[k(\cdot,X)]:H(k)\to H(k)$ is defined by

V_{X\sim P}[k(\cdot,X)]=\mathbb{E}_{X\sim P}[(k(\cdot,X)-\mu_{k}(P))^{\otimes 2}]\in H(k)^{\otimes 2}.

Here, for any $h,h^{\prime}\in H(k)$ , the tensor product $h\otimes h^{\prime}$ is defined as the operator $H(k)\to H(k),x\mapsto\left<h^{\prime},x\right>_{H(k)}h$ , $h^{\otimes 2}$ is defined as $h^{\otimes 2}=h\otimes h$ , and $H(k)^{\otimes 2}=H(k)\otimes H(k)$ (see Section II.4 in [12] for details). Let $V_{X\sim P}[k(\cdot,X)]=\Sigma_{k}(P)$ and $V_{Y\sim Q}[k(\cdot,Y)]=\Sigma_{k}(Q)$ . Then, we define the MVD in $H(k)$ as

\text{MVD}(P,Q)=\sup_{\left\|A\right\|_{H(k)^{\otimes 2}}=1}|\left<A,\Sigma_{k}(P)-\Sigma_{k}(Q)\right>_{H(k)^{\otimes 2}}|=\left\|\Sigma_{k}(P)-\Sigma_{k}(Q)\right\|_{H(k)^{\otimes 2}},

which can be seen as a discrepancy between distributions $P$ and $Q$ . The $T^{2}=\text{MVD}(P,Q)^{2}$ can be estimated by

\widehat{T}^{2}_{n,m}=\left\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q})\right\|^{2}_{H(k)^{\otimes 2}},

(3)

where

\Sigma_{k}(\widehat{P})=\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(\widehat{P}))^{\otimes 2},~{}~{}\mu_{k}(\widehat{P})=\frac{1}{n}\sum_{i=1}^{n}k(\cdot,X_{i})

and

\Sigma_{k}(\widehat{Q})=\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(\widehat{Q}))^{\otimes 2},~{}~{}\mu_{k}(\widehat{Q})=\frac{1}{m}\sum_{j=1}^{m}k(\cdot,Y_{j}).

Let the Gram matrices be $K_{X}=(k(X_{i},X_{s}))_{1\leq i,s\leq n}$ , $K_{Y}=(k(Y_{j},Y_{t}))_{1\leq j,t\leq m}$ , and $K_{XY}=(k(X_{i},Y_{j}))_{\begin{subarray}{c}1\leq i\leq n\\ 1\leq j\leq m\end{subarray}}$ ; the centering matrix be $P_{n}=I_{n}-(1/n)\underline{1}_{n}\underline{1}_{n}$ ; and the centered Gram matrices be $\widetilde{K}_{X}=P_{n}K_{X}P_{n}$ , $\widetilde{K}_{Y}=P_{m}K_{Y}P_{m}$ , and $\widetilde{K}_{XY}=P_{n}K_{XY}P_{m}$ , where $I_{n}$ is the $n\times n$ identity matrix. This test statistic can be expanded as:

\widehat{T}^{2}_{n,m}=\frac{1}{n^{2}}\text{tr}(\widetilde{K}_{X}^{2})-\frac{2}{nm}\text{tr}(\widetilde{K}_{XY}\widetilde{K}_{XY}^{T})+\frac{1}{m^{2}}\text{tr}(\widetilde{K}_{Y}^{2}).

We investigate the $\text{MMD}(P,Q)^{2}$ and $\text{MVD}(P,Q)^{2}$ when $\mathcal{H}=\mathbb{R}^{d}$ , the kernel $k(\cdot,\cdot)$ is the Gaussian kernel:

k(\underline{t},\underline{s})=\exp(-\sigma\left\|\underline{t}-\underline{s}\right\|^{2}_{\mathbb{R}^{d}}),~{}~{}\sigma>0,

(4)

$P=N(\underline{0},I_{d})$ , and $Q=N(\underline{m},\Sigma)$ . Under this setting, straightforward calculations using the properties of Gaussian density yield the following result for MMD:

Proposition 1.

When $\mathcal{H}=\mathbb{R}^{d}$ and $k(\cdot,\cdot)$ is the Gaussian kernel in (4), we have

	$\displaystyle{\rm MMD}(N(\underline{0},I_{d}),N(\underline{m},\Sigma))^{2}$
	$\displaystyle=(1+4\sigma)^{-d/2}+\|I_{d}+4\sigma\Sigma\|^{-1/2}-2\|(1+2\sigma)I_{d}+2\sigma\Sigma\|^{-1/2}\exp\left(-\sigma\underline{m}^{T}((1+2\sigma)I_{d}+2\sigma\Sigma)^{-1}\underline{m}\right).$

The result for MVD is also obtained as follows using the Gaussian density property as well as the result for MMD:

Proposition 2.

When $\mathcal{H}=\mathbb{R}^{d}$ and $k(\cdot,\cdot)$ is the Gaussian kernel in (4), we have

	$\displaystyle{\rm MVD}(N(\underline{0},I_{d}),N(\underline{m},\Sigma))^{2}$
	$\displaystyle=(1+8\sigma)^{-d/2}-2(1+8\sigma+12\sigma^{2})^{-d/2}+(1+4\sigma)^{-d}$
	$\displaystyle~{}~{}~{}~{}~{}+\|I_{d}+8\sigma\Sigma\|^{-1/2}-2\|I_{d}+8\sigma\Sigma+12\sigma^{2}\Sigma^{2}\|^{-1/2}+\|I_{d}+4\sigma\Sigma\|^{-1}$
	$\displaystyle~{}~{}~{}~{}~{}-2\|(1+4\sigma)I_{d}+4\sigma\Sigma\|^{-1/2}\exp\left(-2\sigma\underline{m}^{T}\left((1+4\sigma)I_{d}+4\sigma\Sigma\right)^{-1}\underline{m}\right)$
	$\displaystyle~{}~{}~{}~{}~{}+2\|I_{d}+2\sigma\Sigma\|^{-1/2}\|(1+4\sigma)I_{d}+2\sigma\Sigma\|^{-1/2}\exp\left(-2\sigma\underline{m}^{T}\left((1+4\sigma)I_{d}+2\sigma\Sigma\right)^{-1}\underline{m}\right)$
	$\displaystyle~{}~{}~{}~{}~{}+2(1+2\sigma)^{-d/2}\|(1+2\sigma)I_{d}+4\sigma\Sigma\|^{-1/2}\exp\left(-2\sigma\underline{m}^{T}\left((1+2\sigma)I_{d}+4\sigma\Sigma\right)^{-1}\underline{m}\right)$
	$\displaystyle~{}~{}~{}~{}~{}-2\|(1+2\sigma)I_{d}+2\sigma\Sigma\|^{-1}\exp\left(-2\sigma\underline{m}^{T}\left((1+2\sigma)I_{d}+2\sigma\Sigma\right)^{-1}\underline{m}\right).$

In particular, $\text{MMD}(P,Q)$ and $\text{MVD}(P,Q)$ for $P=N(\underline{0},I_{d})$ and $Q=N(t\underline{1},sI_{d})$ are derived by Propositions 1 and 2. The result is Corollary 1.

Corollary 1.

When $\mathcal{H}=\mathbb{R}^{d}$ and $k(\cdot,\cdot)$ is the Gaussian kernel in (4), we have

	$\displaystyle{\rm MMD}(N(\underline{0},I_{d}),N(t\underline{1},sI_{d}))^{2}$
	$\displaystyle=(1+4\sigma)^{-d/2}+(1+4\sigma s)^{-d/2}-2(1+2\sigma+2\sigma s)^{-d/2}\exp\left(-\sigma t^{2}d(1+2\sigma+2\sigma s)^{-1}\right)$

and

	$\displaystyle{\rm MVD}(N(\underline{0},I_{d}),N(t\underline{1},sI_{d}))^{2}$
	$\displaystyle=(1+8\sigma)^{-d/2}-2(1+8\sigma+12\sigma^{2})^{-d/2}+(1+4\sigma)^{-d}$
	$\displaystyle~{}~{}~{}~{}~{}+(1+8\sigma s)^{-d/2}-2(1+8\sigma s+12\sigma^{2}s^{2})^{-d/2}+(1+4\sigma s)^{-d}$
	$\displaystyle~{}~{}~{}~{}~{}-2(1+4\sigma+4\sigma s)^{-d/2}\exp\left(-2\sigma t^{2}d(1+4\sigma+4\sigma s)^{-1}\right)$
	$\displaystyle~{}~{}~{}~{}~{}+2(1+2\sigma s)^{-d/2}(1+4\sigma+2\sigma s)^{-d/2}\exp\left(-2\sigma t^{2}d(1+4\sigma+2\sigma s)^{-1}\right)$
	$\displaystyle~{}~{}~{}~{}~{}+2(1+2\sigma)^{-d/2}(1+2\sigma+4\sigma s)^{-d/2}\exp\left(-2\sigma t^{2}d(1+2\sigma+4\sigma s)^{-1}\right)$
	$\displaystyle~{}~{}~{}~{}~{}-2(1+2\sigma+2\sigma s)^{-d}\exp\left(-2\sigma t^{2}d(1+2\sigma+2\sigma s)^{-1}\right).$

We investigate the behavior of ${\rm MMD}(N(\underline{0},I_{d}),N(t\underline{1},sI_{d}))^{2}$ and ${\rm MVD}(N(\underline{0},I_{d}),N(t\underline{1},sI_{d}))^{2}$ for the difference of $(t,s)$ from (0,1) by using Corollary 1. A sensitive reaction to the difference of $(t,s)$ from (0,1) means that it can sensitively react to differences between distributions from $N(\underline{0},I_{d})$ . Using such a discrepancy in the framework of the test is expected to correctly reject $H_{0}$ under $H_{1}$ .

More generally, kernel $k^{\prime}(x,y)=\exp(C)k(x,y)$ based on a constant $C$ and a positive definite kernel $k(x,y)$ is also positive definite. Then, $\text{MMD}_{k^{\prime}}(P,Q)$ and $\text{MVD}_{k^{\prime}}(P,Q)$ calculated by the kernel $k^{\prime}$ hold $\text{MMD}_{k^{\prime}}(P,Q)^{2}=\exp(C)\text{MMD}_{k}(P,Q)^{2}$ and $\text{MVD}_{k^{\prime}}(P,Q)^{2}=\exp(2C)\text{MVD}_{k}(P,Q)^{2}$ for any distributions $P$ and $Q$ using $\text{MMD}_{k}(P,Q)$ and $\text{MVD}_{k}(P,Q)$ calculated by the kernel $k$ .

The graph of $\text{MMD}_{k^{\prime}}(P,Q)^{2}$ and $\text{MVD}_{k^{\prime}}(P,Q)^{2}$ is displayed for each $t$ when $s=1$ in Figure 1 and for each $s$ when $t=0$ in Figure 2. The kernel $k$ is a Gaussian kernel in (4), and the parameters are $C=0,4,10$ , $d=10$ , and $\sigma=0.1$ in both Figures 1 and 2. Figure 1 shows the $\text{MMD}_{k^{\prime}}(P,Q)^{2}$ and $\text{MVD}_{k^{\prime}}(P,Q)^{2}$ for the difference of the mean from the standard normal distribution. In Figure 1 (a), $\text{MMD}_{k^{\prime}}(P,Q)^{2}$ is larger than $\text{MVD}_{k^{\prime}}(P,Q)^{2}$ , but in Figures 1 (b) and (c), where the value of $C$ is increased, $\text{MVD}_{k^{\prime}}(P,Q)^{2}$ is larger than $\text{MMD}_{k^{\prime}}(P,Q)^{2}$ for each $t$ . In addition, Figure 2 shows the reaction of the MMD and MVD to the difference of the covariance matrix from the standard normal distribution, and $\text{MVD}_{k^{\prime}}(P,Q)^{2}$ is larger than $\text{MMD}_{k^{\prime}}(P,Q)^{2}$ for each $s$ when $C$ is large. This means that MVD is more sensitive to differences from the standard normal distribution than MMD for $k^{\prime}$ with large $C$ . The fact that there is a kernel $k^{\prime}$ for which MVD is larger than MMD is a motivation for the two-sample test based on MVD in the next section.

Refer to caption — Figure 1: The $\text{MMD}_{k^{\prime}}(N(\underline{0},I_{d}),N(t\underline{1},sI_{d}))^{2}$ (solid) and $\text{MVD}_{k^{\prime}}(N(\underline{0},I_{d}),N(t\underline{1},sI_{d}))^{2}$ (dashed) for each $t$ : $s=1$ , $d=10$ , $\sigma=0.1$ , and (a) $C=0$ , (b) $C=4$ , and (c) $C=10$ .

3 Test statistic for two-sample problem

We consider a two-sample test based on $T^{2}_{n,m}$ for $H_{0}:P=Q$ and $H_{1}:P\neq Q$ , and $\widehat{T}^{2}_{n,m}$ is defined as a test statistic. If $\widehat{T}^{2}_{n,m}$ is large, then the null hypothesis $H_{0}$ is rejected since $T^{2}_{n,m}$ is the difference between $P$ and $Q$ . The condition to derive the asymptotic distribution of this test statistic is as follows:

Condition.

$\mathbb{E}_{X\sim P}[k(X,X)^{2}]<\infty$ , $\mathbb{E}_{Y\sim Q}[k(Y,Y)^{2}]<\infty$ and $\lim_{n,m\to\infty}n/(n+m)\to\rho,~{}0<\rho<1$ .

3.1 Asymptotic null distribution

In this section, we derive an asymptotic distribution of $\widehat{T}^{2}_{n,m}$ under $H_{0}$ . In what follows, the symbol $``\xrightarrow{\mathcal{D}}"$ designates convergence in distribution.

Theorem 1.

Suppose that Condition is satisfied. Then, under $H_{0}:P=Q$ , as $n,m\to\infty$ ,

(n+m)\widehat{T}^{2}_{n,m}\xrightarrow{\mathcal{D}}\frac{1}{\rho(1-\rho)}\sum_{\ell=1}^{\infty}\lambda_{\ell}Z_{\ell}^{2},

where $Z_{\ell}\overset{i.i.d.}{\sim}N(0,1)$ and $\lambda_{\ell}$ is an eigenvalue of $V_{X\sim P}[(k(\cdot,X)-\mu_{k}(P))^{\otimes 2}]$ .

It is generally not easy to utilize such an asymptotic null distribution because it is an infinite sum and determining the weights included in the asymptotic distribution is itself a difficult problem. For practical purposes, a method to approximate the distribution of the test by $\widehat{T}_{n,m}^{2}$ under $H_{0}$ is developed in Section 4. The method is based on the eigenvalues of the centered Gram matrices associated with the data set in Section 3.4.

3.2 Asymptotic non-null distribution

In this section, an asymptotic distribution of $\widehat{T}^{2}_{n,m}$ under $H_{1}$ is investigated.

Theorem 2.

Suppose that Condition is satisfied. Then, under $H_{1}:P\neq Q$ , as $n,m\to\infty$ ,

\sqrt{n+m}(\widehat{T}^{2}_{n,m}-T^{2})\xrightarrow{\mathcal{D}}N(0,4\rho^{-1}v^{2}_{P}+4(1-\rho)^{-1}v^{2}_{Q}),

where

v_{P}^{2}=V_{X\sim P}\left[\left<\Sigma_{k}(P)-\Sigma_{k}(Q),(k(\cdot,X)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right>_{H(k)^{\otimes 2}}\right]\\

and

v_{Q}^{2}=V_{Y\sim Q}\left[\left<\Sigma_{k}(P)-\Sigma_{k}(Q),(k(\cdot,Y)-\mu_{k}(Q))^{\otimes 2}-\Sigma_{k}(Q)\right>_{H(k)^{\otimes 2}}\right].

Remark 1.

We see by Theorem 2 that

\frac{\sqrt{n+m}(\widehat{T}^{2}_{n,m}-T^{2})}{v}\xrightarrow{\mathcal{D}}N(0,1),

where $v=\sqrt{4\rho^{-1}v^{2}_{P}+4(1-\rho)^{-1}v^{2}_{Q}}$ . Thus, we can evaluate the power of the test by $(n+m)\widehat{T}^{2}_{n,m}$ as

	$\displaystyle\Pr((n+m)\widehat{T}^{2}_{n,m}\geq t_{\alpha}\|H_{1})$	$\displaystyle=\Pr((n+m)(\widehat{T}^{2}_{n,m}-T^{2})\geq t_{\alpha}-(n+m)T^{2}\|H_{1})$
		$\displaystyle=\Pr\left(\frac{\sqrt{n+m}(\widehat{T}^{2}_{n,m}-T^{2})}{v}\geq\frac{t_{\alpha}}{\sqrt{n+m}v}-\frac{\sqrt{n+m}T^{2}}{v}\Bigg{\|}H_{1}\right)$
		$\displaystyle\approx 1-\Phi\left(\frac{t_{\alpha}}{\sqrt{n+m}v}-\frac{\sqrt{n+m}T^{2}}{v}\right)\to 1$

as $n,m\to\infty$ , where $t_{\alpha}$ is the $(1-\alpha)$ -quantile of the distribution of $(n+m)\widehat{T}^{2}_{n,m}$ under $H_{0}$ , and $\Phi$ is the distribution function of the standard normal distribution. Therefore, this test is consistent.

3.3 Asymptotic distribution under contiguous alternatives

In this section, we develop an asymptotic distribution of $\widehat{T}^{2}_{n,m}$ under a sequence of local alternative distributions $Q_{nm}=(1-1/\sqrt{n+m})P+(1/\sqrt{n+m})Q$ for $P\neq Q$ .

Theorem 3.

Let $X_{1},\dots,X_{n}\overset{i.i.d.}{\sim}P$ and $Y_{1},\dots,Y_{m}\overset{i.i.d.}{\sim}Q_{nm}$ . Suppose that Condition is satisfied. Let $h:\mathcal{H}\times\mathcal{H}\to\mathbb{R}$ be a kernel defined as

h(x,y)=\left<(k(\cdot,x)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P),(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right>_{H(k)^{\otimes 2}},~{}~{}x,y\in\mathcal{H}

(5)

and

h(\cdot,x)=(k(\cdot,x)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\in H(k)^{\otimes 2}

and let $S_{k}:L_{2}(\mathcal{H},P)\to L_{2}(\mathcal{H},P)$ be a self-adjoint operator defined as

S_{k}g(x)=\int_{\mathcal{H}}h(x,y)g(y)dP(y),~{}~{}g\in L_{2}(\mathcal{H},P)

(6)

(see Sections VI.1, VI.3, and VI.6 in [12] for details). Then, as $n,m\to\infty$

(n+m)\left\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q}_{nm})\right\|^{2}_{H(k)^{\otimes 2}}\xrightarrow{\mathcal{D}}\frac{1}{\rho(1-\rho)}\sum_{\ell=1}^{\infty}\theta_{\ell}W_{\ell}^{2},

where $W_{\ell}\sim N(\sqrt{\rho(1-\rho)}\cdot\zeta_{\ell}(P,Q)/\theta_{\ell},1),~{}W_{\ell}\mathop{\perp\!\!\!\perp}W_{\ell^{\prime}}~{}(\ell\neq\ell^{\prime})$ ,

\zeta_{\ell}(P,Q)=\int_{\mathcal{H}}\left<\Sigma_{k}(Q)-\Sigma_{k}(P)+(\mu_{k}(Q)-\mu_{k}(P))^{\otimes 2},h(\cdot,y)\right>_{H(k)^{\otimes 2}}\Psi_{\ell}(y)dP(y),

and $\theta_{\ell}$ and $\Psi_{\ell}$ are, respectively, the eigenvalue of $S_{k}$ and the eigenfunction corresponding to $\theta_{\ell}$ .

The following proposition claims that the eigenvalues $\theta_{\ell}$ appearing in Theorem 3 are the same as the eigenvalues $\lambda_{\ell}$ appearing in Theorem 1:

Proposition 3.

The eigenvalues of $V_{X\sim P}[(k(\cdot,X)-\mu_{k}(P))^{\otimes 2}]$ in Theorem 1 and $S_{k}$ in (6) of Theorem 3 are the same.

From Theorems 1 and 3 and Proposition 3, it can be seen that the local power of the test by $(n+m)\widehat{T}^{2}_{n,m}$ is dominated by the noncentrality parameters. It follows that

\zeta_{\ell}(P,Q)=\int_{\mathcal{H}}\left<\mathbb{E}_{Y\sim Q}[h(\cdot,y)],h(\cdot,x)\right>\Psi_{\ell}(y)dP(y)=\lambda_{\ell}\mathbb{E}_{Y\sim Q}[\Psi_{\ell}(Y)]

by which we obtain

	$\displaystyle\mathbb{E}\left[\frac{1}{\rho(1-\rho)}\sum_{\ell=1}^{\infty}\theta_{\ell}W_{\ell}^{2}\right]$	$\displaystyle=\frac{1}{\rho(1-\rho)}\sum_{\ell=1}^{\infty}\lambda_{\ell}\left(1+\rho(1-\rho)\cdot\frac{\zeta_{\ell}(P,Q)^{2}}{\lambda_{\ell}^{2}}\right)$
		$\displaystyle=\frac{1}{\rho(1-\rho)}\sum_{\ell=1}^{\infty}\lambda_{\ell}\Big{(}1+\rho(1-\rho)\{\mathbb{E}_{Y\sim Q}[\Psi_{\ell}(Y)]\}^{2}\Big{)}.$

In addition, from Theorem 1 in [10], we have

	$\displaystyle\sum_{\ell=1}^{\infty}\lambda_{\ell}\{\mathbb{E}_{Y\sim Q}[\Psi_{\ell}(Y)]\}^{2}$	$\displaystyle=\sum_{\ell=1}^{\infty}\lambda_{\ell}\mathbb{E}_{Y\sim Q}[\Psi_{\ell}(Y)]\mathbb{E}_{Y^{\prime}\sim Q}[\Psi_{\ell}(Y^{\prime})]$
		$\displaystyle=\mathbb{E}_{Y,Y^{\prime}\sim Q}[h(Y,Y^{\prime})]$
		$\displaystyle=\left\\|\Sigma_{k}(Q)-\Sigma_{k}(P)+(\mu_{k}(Q)-\mu_{k}(P))^{\otimes 2}\right\\|_{H(k)^{\otimes 2}}^{2}.$

Hence, the local power reveals not only the difference between $\Sigma_{k}(P)$ and $\Sigma_{k}(Q)$ but also that between $\mu_{k}(Q)$ and $\mu_{k}(P)$ .

3.4 Null distribution estimates using Gram matrix spectrum

The asymptotic null distribution was obtained in Theorem 1, but it is difficult to derive its weights. The following theorem shows that this weight can be estimated using the estimator of $V[(k(\cdot,X)-\mu_{k}(P))^{\otimes 2}]$ .

Theorem 4.

Assume that $\mathbb{E}_{X\sim P}[k(X,X)^{2}]<\infty$ . Let $\{\lambda_{\ell}\}_{\ell=1}^{\infty}$ and $\{\widehat{\lambda}^{(n)}_{\ell}\}_{\ell=1}^{\infty}$ be the eigenvalues of $\Upsilon$ and $\widehat{\Upsilon}^{(n)}$ , respectively, where

\Upsilon=V\left[(k(\cdot,X)-\mu_{k}(P))^{\otimes 2}\right]~{}~{}\text{and}~{}~{}\widehat{\Upsilon}^{(n)}=\frac{1}{n}\sum_{i=1}^{n}\left\{(k(\cdot,X_{i})-\mu_{k}(\widehat{P}))^{\otimes 2}-\Sigma_{k}(\widehat{P})\right\}^{\otimes 2}.

Then, as $n\to\infty$

\sum_{\ell=1}^{\infty}\widehat{\lambda}_{\ell}^{(n)}Z_{\ell}^{2}\xrightarrow{\mathcal{D}}\sum_{\ell=1}^{\infty}\lambda_{\ell}Z_{\ell}^{2},

where $Z_{\ell}\overset{i.i.d.}{\sim}N(0,1)$ .

In addition, Proposition 4 claims the eigenvalues of $\widehat{\Upsilon}^{(n)}$ and the Gram matrix are the same.

Proposition 4.

The $n\times n$ Gram matrix $H=(H_{ij})_{1\leq i,j\leq n}$ is defined as

H_{ij}=\left<(k(\cdot,X_{i})-\mu_{k}(\widehat{P}))^{\otimes 2}-\Sigma_{k}(\widehat{P}),(k(\cdot,X_{j})-\mu_{k}(\widehat{P}))^{\otimes 2}-\Sigma_{k}(\widehat{P})\right>_{H(k)^{\otimes 2}}.

Then, the eigenvalues of $\widehat{\Upsilon}^{(n)}$ and $H/n$ are the same.

Remark 2.

By Proposition 4, the critical value can be obtained by calculating $1/\{\rho(1-\rho)\}\sum_{\ell=1}^{n-1}\widehat{\lambda}_{\ell}^{(n)}Z_{\ell}^{2}$ using the eigenvalue $\widehat{\lambda}^{(n)}_{\ell},~{}\ell\in\{1,\dots,n-1\}$ of $H/n$ . In addition, the matrix $H$ is expressed as

H=P_{n}(\widetilde{K}_{X}\odot\widetilde{K}_{X})P_{n},

(7)

where $\odot$ is the Hadamard product. The $n\times n$ Gram matrix $K_{X}$ is a positive definite, but $H$ has eigenvalue 0 since $H\underline{1}=\underline{0}$ .

4 Implementation

This section proposes corrections to the asymptotic distribution for both the MVD and MMD tests, and describes the results of simulations of the type-I error and power for the modifications. The MMD test is a two-sample test for $H_{0}$ and $H_{1}$ using the test statistic:

\widehat{\Delta}_{n,m}^{2}=\left\|\frac{1}{n}\sum_{i=1}^{n}k(\cdot,X_{i})-\frac{1}{m}\sum_{j=1}^{m}k(\cdot,Y_{j})\right\|_{H(k)}^{2}.

The $\widehat{\Delta}^{2}_{n,m}$ of the asymptotic null distribution is the infinite sum of the weighted chi-square distribution, which is the same as $\widehat{T}^{2}_{n,m}$ in (3). The approximate distribution can be obtained by estimating the eigenvalues by the centered Gram matrix (see [7] for details).

4.1 Approximation of the null distribution

In this section, we discuss methods to approximate the null distributions of the MVD and MMD tests. The asymptotic null distribution of the MVD test was obtained in Theorem 1 as an infinite sum of weighted chi-squared random variables with one degree of freedom, and according to Theorem 4, those weights $\lambda_{\ell}~{}(\ell\geq 1)$ can be estimated by the eigenvalues $\widehat{\lambda}_{\ell}^{(n)}$ of the matrix $H/n$ . Similar results was obtained for MMD by [7]. However, this approximate distribution based on this estimated eigenvalue does not actually work well. In fact, by comparing the simulated exact null distribution with this approximate distribution based on estimated eigenvalues, it can be seen that variance of the approximate distribution is larger than that of the simulated exact null distribution. We modify the approximate distribution $1/\{\rho(1-\rho)\}\sum_{\ell=1}^{n-1}\widehat{\lambda}_{\ell}^{(n)}Z_{\ell}^{2}$ by obtaining the variance of this simulated exact null distribution. The variance of the exact null distribution $V[(n+m)\widehat{T}^{2}_{n,m}]$ is obtained as the following proposition:

Proposition 5.

Assume that $\mathbb{E}_{X\sim P}[k(X,X)^{2}]<\infty$ . Then under $H_{0}$ ,

V[(n+m)\widehat{T}^{2}_{n,m}]=\frac{2(n+m)^{4}}{n^{2}m^{2}}\left\|\Upsilon\right\|^{2}_{H(k)^{\otimes 4}}+O\left(\frac{1}{n}\right)+O\left(\frac{1}{m}\right).

Proposition 5 leads to

V[(n+m)\widehat{T}^{2}_{n,m}]\approx\frac{(n+m)^{4}}{n^{2}m^{2}}\cdot\frac{k^{2}\ell^{2}}{(k+\ell)^{4}}V[(k+\ell)\widehat{T}^{2}_{k,\ell}],

(8)

with respect to $V[(n+m)\widehat{T}^{2}_{n,m}]$ and $V[(k+\ell)\widehat{T}^{2}_{k,\ell}],~{}k,\ell\in\mathbb{N}$ . If we can estimate the variance $V[(k+\ell)\widehat{T}_{k,\ell}^{2}]$ at $k$ and $\ell$ that is less than $n$ and $m$ , respectively, we can estimate $V[(n+m)\widehat{T}^{2}_{n,m}]$ by using (8). In addition, the method of estimating $V[(k+\ell)\widehat{T}_{k,\ell}^{2}]$ by choosing $(k,\ell)$ from $(n,m)$ without replacement is known as subsampling.

The following proposition for MMD shows a similar result to MVD:

Proposition 6.

Assume that $\mathbb{E}_{X\sim P}[k(X,X)]<\infty$ . Then, under $H_{0}$

V[(n+m)\widehat{\Delta}^{2}_{n,m}]=\frac{2(n+m)^{4}}{n^{2}m^{2}}\left\|\Sigma_{k}(P)\right\|^{2}_{H(k)^{\otimes 2}}+O\left(\frac{1}{n}\right)+O\left(\frac{1}{m}\right).

4.1.1 Subsampling method

We used the subsampling method to estimate $V[(k+\ell)\widehat{T}^{2}_{k,\ell}]$ and $V[(k+\ell)\widehat{\Delta}^{2}_{k,\ell}]$ (see Section 2.2 in [11] for details). In order to obtain two samples under the null hypothesis, we divide $X_{1},\dots,X_{n}$ into $X_{1},\dots,X_{n_{1}}$ and $X_{n_{1}+1},\dots,X_{n}$ . Then, we randomly select $X^{*}_{1}(i),\dots,X^{*}_{k}(i)$ and $Y^{*}_{1}(i),\dots,Y^{*}_{\ell}(i)$ from $X_{1},\dots,X_{n_{1}}$ and $X_{n_{1}+1},\dots,X_{n}$ without replacement, which repaet in each iteration $i\in\{1,\dots,I\}$ . These randomly selected values generate the replications of the test statistic

\widehat{T}_{k,\ell}^{2}(i)=\widehat{T}_{k,\ell}^{2}(X^{*}_{1}(i),\dots,X^{*}_{k}(i);Y^{*}_{1}(i),\dots,Y^{*}_{\ell}(i))

for iterations $i\in\{1,\dots,I\}$ . The generated test statistics $(k+\ell)\widehat{T}_{k,\ell}^{2}(i)$ in $I$ iterations estimate $V[(k+\ell)\widehat{T}^{2}_{k,\ell}]$ by calculating the unbiased sample variance:

V[(k+\ell)\widehat{T}^{2}_{k,\ell}]_{\text{sub}}=\frac{1}{I-1}\sum_{j=1}^{I}\left\{(k+\ell){\widehat{T}^{2}}_{k,\ell}(j)-(k+\ell)\overline{\widehat{T}^{2}}_{k,\ell}\right\}^{2},

where $\overline{\widehat{T}^{2}}_{k,\ell}=(1/I)\sum_{i=1}^{I}{\widehat{T}^{2}}_{k,\ell}(i)$ . According to (8), $V[(n+m)\widehat{T}^{2}_{n,m}]$ is estimated by

V[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}}=\frac{(n+m)^{4}}{n^{2}m^{2}}\frac{k^{2}\ell^{2}}{(k+\ell)^{4}}V[(k+\ell)]\widehat{T}^{2}_{k,\ell}]_{\text{sub}}.

(9)

We also estimate $V[(n+m)\widehat{\Delta}^{2}_{n,m}]$ by using

V[(n+m)\widehat{\Delta}^{2}_{n,m}]_{\text{sub}}=\frac{(n+m)^{4}}{n^{2}m^{2}}\frac{k^{2}\ell^{2}}{(k+\ell)^{4}}V[(k+\ell)\widehat{\Delta}^{2}_{k,\ell}]_{\text{sub}}

(10)

from Proposition 6, where

\widehat{\Delta}_{k,\ell}^{2}(i)=\widehat{\Delta}_{k,\ell}^{2}(X^{*}_{1}(i),\dots,X^{*}_{k}(i);Y^{*}_{1}(i),\dots,Y^{*}_{\ell}(i))

for $i\in\{1,\dots,I\}$ ,

V[(k+\ell)\widehat{\Delta}^{2}_{k,\ell}]_{\text{sub}}=\frac{1}{I-1}\sum_{j=1}^{I}\left\{(k+\ell){\widehat{\Delta}^{2}}_{k,\ell}(j)-(k+\ell)\overline{\widehat{\Delta}^{2}}_{k,\ell}\right\}^{2}

and $\overline{\widehat{\Delta}^{2}}_{k,\ell}=(1/I)\sum_{i=1}^{I}{\widehat{\Delta}^{2}}_{k,\ell}(i)$ .

The columns of $(n+m)\widehat{T}^{2}_{n,m}$ in Table 1 and $(n+m)\widehat{\Delta}^{2}_{n,m}$ in Table 2 are variances of $(n+m)\widehat{T}^{2}_{n,m}$ and $(n+m)\widehat{\Delta}^{2}_{n,m}$ , which are estimated by a simulation of 10,000 iterations with $X_{1},\dots,X_{n}\overset{i.i.d.}{\sim}N(\underline{0},I_{d})$ and $Y_{1},\dots,Y_{m}\overset{i.i.d.}{\sim}N(\underline{0},I_{d})$ for each $\sigma,d$ and $(n,m)$ . The subsampling variances $V[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}}$ in (9) and $V[(n+m)\widehat{\Delta}^{2}_{n,m}]_{\text{sub}}$ in (10) with $X_{1},\dots,X_{n}\overset{i.i.d.}{\sim}N(\underline{0},I_{d})$ are given in the columns labeled “Subsampling” for $(k,\ell)$ . Tables 1 and 2 show that subsampling variances $V[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}}$ and $V[(n+m)\widehat{\Delta}^{2}_{n,m}]_{\text{sub}}$ estimate the exact variances well. However, these variances tend to underestimate the exact variances.

We investigate how much smaller $V[(n+m)\widehat{T}^{2}_{n,m}]$ is than the variance of $(n+m)\widehat{T}^{2}_{n,m}$ by performing a linear regression of $V[(n+m)\widehat{T}^{2}_{n,m}]$ and the variance of $(n+m)\widehat{T}^{2}_{n,m}$ with an intercept equal to 0, with the same for the MMD. The results are shown in Figure 3; (a) and (b) show results for the MVD and (c) and (d) show results for the MMD, which are respectively $(n,m)=(200,200)$ and $(k,\ell)=(50,50)$ and $(n,m)=(500,500)$ and $(k,\ell)=(125,125)$ cases. In Figure 3, the $x$ axis is $V[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}}$ or $V[(n+m)\widehat{\Delta}^{2}_{n,m}]_{\text{sub}}$ and the $y$ axis is variance of $(n+m)\widehat{T}^{2}_{n,m}$ or $(n+m)\widehat{\Delta}^{2}_{n,m}$ for each $\sigma,~{}d~{},(n,m)$ and $(k,\ell)$ in Table 1 or Table 2. The line in Figure 3 is a regression line found by the least-squares method in the form $y=ax+\varepsilon$ . It can be seen from Figure 3 that when $V[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}}$ and $V[(n+m)\widehat{\Delta}^{2}_{n,m}]_{\text{sub}}$ are multiplied by the term associated regression coefficient, they approach the variances $(n+m)\widehat{T}^{2}_{n,m}$ and $(n+m)\widehat{\Delta}^{2}_{n,m}$ . The coefficient of linear regression with intercept 0 is written in the row labeled “slope of the line” in Tables 1 and 2.

Table 1: The variance of

(n+m)\widehat{T}^{2}_{n,m}

under

P=Q=N(\underline{0},I_{d})

and

V[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}}

I=1,000

n_{1}=n/2

, and

X_{1},\dots,X_{n}\overset{i.i.d.}{\sim}N(\underline{0},I_{d})

$\sigma$	$d$	$(n,m)$	$(n+m)\widehat{T}^{2}_{n,m}$	Subsampling $(k,\ell)$
				$(n/4,n/4)$	$(n/6,n/6)$	$(n/8,n/8)$
$d^{-3/4}$	5	(200,200)	0.06880	0.04341	0.05168	0.04902
$d^{-3/4}$	5	(500,500)	0.06881	0.03821	0.04897	0.04921
$d^{-7/8}$	5	(200,200)	0.07254	0.04246	0.05138	0.05798
$d^{-7/8}$	5	(500,500)	0.07188	0.04052	0.05500	0.05593
$d^{-3/4}$	10	(200,200)	0.00850	0.00602	0.00812	0.00898
$d^{-3/4}$	10	(500,500)	0.00845	0.00674	0.00751	0.00753
$d^{-7/8}$	10	(200,200)	0.01280	0.00937	0.01224	0.01377
$d^{-7/8}$	10	(500,500)	0.01270	0.01032	0.01251	0.01255
$d^{-3/4}$	20	(200,200)	0.00049	0.00048	0.00070	0.00094
$d^{-3/4}$	20	(500,500)	0.00043	0.00031	0.00046	0.00060
$d^{-7/8}$	20	(200,200)	0.00166	0.00152	0.00261	0.00330
$d^{-7/8}$	20	(500,500)	0.00147	0.00122	0.00165	0.00204
		(200,200)	1	1.63621	1.35769	1.29601
slope of the line		(500,500)	1	1.76057	1.33845	1.3232
		both	1	1.69348	1.34798	1.30928

Table 2: The variance of

(n+m)\widehat{\Delta}^{2}_{n,m}

under

P=Q=N(\underline{0},I_{d})

and

V[(n+m)\widehat{\Delta}^{2}_{n,m}]_{\text{sub}}

I=1,000

n_{1}=n/2

, and

X_{1},\dots,X_{n}\overset{i.i.d.}{\sim}N(\underline{0},I_{d})

$\sigma$	$d$	$(n,m)$	$(n+m)\widehat{\Delta}^{2}_{n,m}$	Subsampling $(k,\ell)$
				$(n/4,n/4)$	$(n/6,n/6)$	$(n/8,n/8)$
$d^{-3/4}$	5	(200,200)	0.57100	0.47044	0.50315	0.57047
$d^{-3/4}$	5	(500,500)	0.66068	0.54518	0.63054	0.58848
$d^{-7/8}$	5	(200,200)	0.65987	0.51867	0.57258	0.54567
$d^{-7/8}$	5	(500,500)	0.75903	0.63563	0.68024	0.65349
$d^{-3/4}$	10	(200,200)	0.16205	0.09213	0.12017	0.12940
$d^{-3/4}$	10	(500,500)	0.16279	0.16106	0.16809	0.18334
$d^{-7/8}$	10	(200,200)	0.25656	0.14104	0.18435	0.20457
$d^{-7/8}$	10	(500,500)	0.25836	0.24670	0.25567	0.26402
$d^{-3/4}$	20	(200,200)	0.02757	0.02135	0.02632	0.02814
$d^{-3/4}$	20	(500,500)	0.02784	0.02255	0.02407	0.02615
$d^{-7/8}$	20	(200,200)	0.07642	0.07404	0.08121	0.08744
$d^{-7/8}$	20	(500,500)	0.07856	0.05714	0.07492	0.07320
		(200,200)	1	1.27369	1.16037	1.11083
slope of the line		(500,500)	1	1.18426	1.07578	1.12080
		both	1	1.21990	1.10951	1.11643

Remark 3.

Since the subsampling variance $V[(k+\ell)\widehat{T}^{2}_{k,\ell}]_{\text{sub}}$ is an unbiased sample variance of $I$ times, we get

	$\displaystyle\mathbb{E}[V[(k+\ell)\widehat{T}^{2}_{k,\ell}]_{\text{sub}}]$
	$\displaystyle=\frac{1}{I-1}\sum_{j=1}^{I}\mathbb{E}[\left\{(k+\ell){\widehat{T}^{2}}_{k,\ell}(j)-(k+\ell)\overline{\widehat{T}^{2}}_{k,\ell}\right\}^{2}]$
	$\displaystyle=\frac{1}{I-1}\sum_{j=1}^{I}\mathbb{E}\left[\left\{(k+\ell){\widehat{T}^{2}}_{k,\ell}(j)-\mathbb{E}[(k+\ell)\widehat{T}^{2}_{k,\ell}]+\mathbb{E}[(k+\ell)\widehat{T}^{2}_{k,\ell}]-(k+\ell)\overline{\widehat{T}^{2}}_{k,\ell}\right\}^{2}\right]$
	$\displaystyle=\frac{1}{I-1}\sum_{j=1}^{I}\Big{[}\mathbb{E}\left[\left\{(k+\ell){\widehat{T}^{2}}_{k,\ell}(j)-\mathbb{E}[(k+\ell)\widehat{T}^{2}_{k,\ell}]\right\}^{2}\right]+\mathbb{E}\left[\left\{\mathbb{E}[(k+\ell)\widehat{T}^{2}_{k,\ell}]-(k+\ell)\overline{\widehat{T}^{2}}_{k,\ell}\right\}^{2}\right]$
	$\displaystyle~{}~{}~{}~{}~{}+2\mathbb{E}\left[\left\{(k+\ell){\widehat{T}^{2}}_{k,\ell}(j)-\mathbb{E}[(k+\ell)\widehat{T}^{2}_{k,\ell}]\right\}\left\{\mathbb{E}[(k+\ell)\widehat{T}^{2}_{k,\ell}]-(k+\ell)\overline{\widehat{T}^{2}}_{k,\ell}\right\}\right]\Big{]}$
	$\displaystyle=\frac{1}{I-1}\sum_{j=1}^{I}\Big{\{}V[(k+\ell)\widehat{T}^{2}_{k,\ell}]+\frac{1}{I^{2}}\sum_{i,s=1}^{I}Cov\left((k+\ell)\widehat{T}^{2}_{k,\ell}(i),(k+\ell)\widehat{T}^{2}_{k,\ell}(s)\right)$
	$\displaystyle~{}~{}~{}~{}~{}-\frac{2}{I}\sum_{i=1}^{I}Cov\left((k+\ell)\widehat{T}^{2}_{k,\ell}(j),(k+\ell)\widehat{T}^{2}_{k,\ell}(i)\right)\Big{\}}$
	$\displaystyle=\frac{I}{I-1}V[(k+\ell)\widehat{T}^{2}_{k,\ell}]-\frac{1}{I(I-1)}\sum_{i,j=1}^{I}Cov\left((k+\ell)\widehat{T}^{2}_{k,\ell}(i),(k+\ell)\widehat{T}^{2}_{k,\ell}(j)\right)$
	$\displaystyle=V[(k+\ell)\widehat{T}^{2}_{k,\ell}]-\frac{2}{I(I-1)}\sum_{i<j}Cov\left((k+\ell)\widehat{T}^{2}_{k,\ell}(i),(k+\ell)\widehat{T}^{2}_{k,\ell}(j)\right).$

However, it is not easy to estimate $Cov\left((k+\ell)\widehat{T}^{2}_{k,\ell}(i),(k+\ell)\widehat{T}^{2}_{k,\ell}(j)\right)$ . This fact caused us to modify subsampling variances $V[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}}$ in (9) and $V[(n+m)\widehat{\Delta}^{2}_{n,m}]_{\text{sub}}$ in (10) to $(1+\tau)V[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}}$ and $(1+\tau)V[(n+m)\widehat{\Delta}^{2}_{n,m}]_{\text{sub}}$ using $\tau>0$ , which is determined from the regression coefficient in the row “slope of the line” in Tables 1 and 2.

4.1.2 Modification of approximation distribution

Since the subsampling variance $V[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}}$ underestimates $V[(n+m)\widehat{T}^{2}_{n,m}]$ , as seen in Table 1 and 2, we use positive $\tau>0$ to estimate $V[(n+m)\widehat{T}^{2}_{n,m}]$ by $(1+\tau)V[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}}$ in (9). This underestimation is the same for the MMD test and $V[(n+m)\widehat{\Delta}^{2}_{n,m}]$ is estimated by $(1+\tau)V[(n+m)\widehat{\Delta}^{2}_{n,m}]_{\text{sub}}$ in (10) with positive $\tau>0$ . Our approximation of the null distribution is based on a modification of the large variance of $1/\{\rho(1-\rho)\}\sum_{\ell=1}^{\infty}\widehat{\lambda}_{\ell}^{(n)}Z_{\ell}^{2}$ to $(1+\tau)V[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}}$ . The method aims to approximate the exact null distribution by using

W_{n}^{\prime}=\xi_{n}/\{\rho(1-\rho)\}\sum_{\ell=1}^{n-1}\widehat{\lambda}_{\ell}^{(n)}Z_{\ell}^{2}+c_{n}.

(11)

The parameters $\xi_{n}$ and $c_{n}$ are determined so that the means of $W^{\prime}_{n}$ and $1/\{\rho(1-\rho)\}\sum_{\ell=1}^{n-1}\widehat{\lambda}_{\ell}^{(n)}$ are equal and the variance of $W^{\prime}_{n}$ is equal to $(1+\tau)V[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}}$ , which can be established by

\mathbb{E}[W^{\prime}_{n}]=\frac{1}{\rho(1-\rho)}\sum_{\ell=1}^{\infty}\widehat{\lambda}_{\ell}

and

V[W^{\prime}_{n}]=(1+\tau)V[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}}.

This approximation method can be similarly discussed for the MMD test using $(1+\tau)V[(n+m)\widehat{\Delta}^{2}_{n,m}]_{\text{sub}}$ . In this paper, the parameter $\tau>0$ is determined using the value of “slope of the line” in Tables 1 and 2. Figure 4 shows that this $W^{\prime}_{n}$ can approximate the simulated exact distribution better than $1/\{\rho(1-\rho)\}\sum_{\ell^{\prime}=1}^{n-1}\widehat{\lambda}^{(n)}_{\ell}Z_{\ell}^{2}$ . Algorithm shows how to obtain the critical value of the MVD test using this modification. The algorithm for the MMD test can be obtained by changing $H$ and $\widehat{T}^{2}$ in Algorithm to $\widetilde{K}_{X}$ and $\widehat{\Delta}^{2}$ .

Algorithm Calculation of critical value for the MVD test.

X_{1},\dots,X_{n},Y_{1},\dots,Y_{m}\in\mathcal{H},~{}k:\mathcal{H}\times\mathcal{H}\to\mathbb{R}

(kernel),

0<\alpha<1

(significance label) and

(k,\ell)\in\{1,\dots,[n/2]\},\tau>0

(parameters). For example,

\tau

is selected as shown in the values of “slope of the line” in Tables 1 and 2. 1. Compute the eigenvalues

\widehat{\lambda}_{\ell}^{(n)}

H

in (7) and obtain

1/\{\rho(1-\rho)\}\sum_{\ell=1}^{n-1}\widehat{\lambda}_{\ell}^{(n)}Z_{\ell}^{2}(j)

by random element

Z_{1}^{(j)},\dots,Z_{n-1}^{(j)}\overset{i.i.d.}{\sim}N(0,1),~{}j\in\{1,\dots,J\}

.2. (a) Obtain copies

(k+\ell)\widehat{T}^{2}_{k,\ell}(i),~{}i\in\{1,\dots,I\}

(k+\ell)\widehat{T}^{2}_{k,\ell}

under

H_{0}

by the subsampling method. (b) Compute subsampling variance

V[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}}

from

(k+\ell)\widehat{T}^{2}_{k,\ell}(i)

.3. Compute

W_{n}^{\prime}

in (11) by

1/\{\rho(1-\rho)\}\sum_{\ell=1}^{n-1}\widehat{\lambda}_{\ell}^{(n)}Z_{\ell}^{2}(j)

and

V[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}}

0: We obtain the critical value

t_{\alpha}(W^{\prime}_{n})

as the

J(1-\alpha)

-th from the top sorted in ascending order.

4.2 Simulations

In this section, we investigate the performance of $(n+m)\widehat{T}^{2}_{n,m}$ under a specific null hypothesis and specific alternative hypotheses when $\mathcal{H}=\mathbb{R}^{d}$ and $k(\cdot,\cdot)$ is the Gaussian kernel in (4). In particular, a Monte Carlo simulation is performed to observe the type-I error and the power of the MVD and MMD tests. Two cases are implemented: a uniform distribution $Q_{1}$ and an exponential distribution $Q_{2}$ , with $P=N(0,1)$ , all of which have means and variances 0 and 1, respectively. The critical values are determined based on $W^{\prime}_{n}$ in Section 4.1.2 from a normal distribution. The type-I error of $(n+m)\widehat{T}^{2}_{n,m}$ can be obtained by counting the number of times $(n+m)\widehat{T}^{2}_{n,m}$ exceeds the critical value in 1,000 iterations under the null hypothesis. Next, the estimated power of $(n+m)\widehat{T}^{2}_{n,m}$ is similarly obtained by counting the number of times $(n+m)\widehat{T}^{2}_{n,m}$ exceeds the critical value under each alternative distribution in 1,000 iterations. We execute the above for $(n,m)=(200,200)$ and $(500,500)$ and $d=5,10$ , and $20$ . It is known that the selection of the value of $\sigma$ involved in the Gaussian kernel affects the performance. We utilize $\sigma$ depending on dimension $d$ . The significance level is $\alpha=0.05$ . The critical values are determined on the basis of $W^{\prime}_{n}$ in Section 4.1.2 from a normal distribution. The type-I error and estimated power can be obtained by counting how many times $(n+m)\widehat{T}^{2}_{n,m}$ exceeds the critical values in 1,000 iterations under $P=Q$ and $P\neq Q$ . With $n_{1}=n/2$ and $(k,\ell)=(n/8,n/8)$ , $\tau_{\text{MVD}}$ for MVD is $\tau_{\text{MVD}}=0.30928$ and $\tau_{\text{MMD}}$ for MMD is $\tau_{\text{MMD}}=0.11643$ by “slope of the line” in Tables 1 and 2. The following can be seen from Table 3:

•

Table 3 shows that the probabilities of type-I error at $d=5$ and 10 are near the significance level of $\alpha=0.05$ for both the MVD and MMD.
•

The probability of type-I error at $d=20$ exceeds the significance level of $\alpha=0.05$ for the MVD, but decreases as $(n,m)$ increases.
•

It can be seen that the critical value by $W^{\prime}_{n}$ of the MVD tends to be estimated as less than that point of the null distribution.
•

In hypothesis $P=Q_{1}$ , it can be seen that the MVD has a higher power than the MMD.
•

It can be seen that the MVD and MMD have higher powers for hypothesis $P=Q_{2}$ than hypothesis $P=Q_{1}$ and it is difficult to distinguish between the normal distribution and the uniform distribution by the MVD and MMD for a Gaussian kernel.
•

Note that the critical value changes depending on the distribution of the null hypothesis.

Table 3: Type-I error and power of the test by

(n+m)\widehat{T}^{2}_{n,m}

for each sample size and each parameter

\sigma

$\sigma$	$d$	$(n,m)$	MVD			MMD
			Type-I error	$Q_{1}$	$Q_{2}$	Type-I error	$Q_{1}$	$Q_{2}$
$d^{-3/4}$	5	(200,200)	0.060	0.797	1	0.047	0.401	1
$d^{-3/4}$	5	(500,500)	0.072	1	1	0.063	0.877	1
$d^{-7/8}$	5	(200,200)	0.056	0.728	1	0.052	0.305	1
$d^{-7/8}$	5	(500,500)	0.067	0.999	1	0.053	0.735	1
$d^{-3/4}$	10	(200,200)	0.073	0.612	1	0.086	0.342	1
$d^{-3/4}$	10	(500,500)	0.047	0.991	1	0.040	0.630	1
$d^{-7/8}$	10	(200,200)	0.054	0.482	1	0.086	0.235	1
$d^{-7/8}$	10	(500,500)	0.034	0.955	1	0.044	0.363	1
$d^{-3/4}$	20	(200,200)	0.279	0.816	1	0.082	0.239	1
$d^{-3/4}$	20	(500,500)	0.099	0.948	1	0.068	0.477	1
$d^{-7/8}$	20	(200,200)	0.060	0.332	1	0.047	0.113	0.989
$d^{-7/8}$	20	(500,500)	0.034	0.728	1	0.069	0.240	1

5 Application to real datasets

The MVD test was applied to some real data sets. The significance level was $\alpha=0.05$ and the critical value $t_{0.05}(H)$ was obtained through 10,000 iterations of $1/\{\rho(1-\rho)\}\sum_{\ell=1}^{n-1}\widehat{\lambda}^{(n)}_{\ell}Z_{\ell}^{2}$ based on the eigenvalues of the matrix $H/n$ . We also calculated the critical value $t_{0.05}(W^{\prime}_{n})$ of the approximate distribution $W^{\prime}_{n}$ according to Algorithm in Section 4.1.2. The $t_{0.05}(\widetilde{K}_{X})$ calculates the critical value for the MMD test from the distribution obtained based on Theorem 1 in [7] through 10,000 iterations.

5.1 USPS data

The USPS dataset consists of handwritten digits represented by a $16\times 16$ grayscale matrix (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#usps). The sizes of each sample are shown in Table 4.

Table 4: Sample sizes of the USPS data.

index	0	1	2	3	4	5	6	7	8	9
sample size	359	264	198	166	200	160	170	147	166	177

Each group was divided into two sets of sample size 70 and the MVD test was applied to each set. Table 5 shows the results of applying the MVD and MMD tests to this USPS dataset. Parameters $\sigma=d^{-3/4},d^{-7/8}$ , $n_{1}=35$ , $k$ and $\ell=18$ are adopted and $\tau=0.69348$ for the MVD and $\tau=0.21990$ for the MMD were utilized from the slope of the line in Tables 1 and 2. In each cell, the values of $(n+m)\widehat{T}^{2}_{n,m}$ and $(n+m)\widehat{\Delta}^{2}_{n,m}$ for each number are written, with the values of $(n+m)\widehat{\Delta}^{2}_{n,m}$ in parentheses.

From Table 5, there is a tendency that different groups will be rejected and that the same groups are not rejected by the MVD test. For $P=$ USPS 2 and $Q=$ USPS 2, the value of $(n+m)\widehat{T}^{2}_{n,m}$ is 2.953, which is larger than $t_{0.05}(W^{\prime}_{n})=2.946$ but smaller than $t_{0.05}(H)=3.416$ . On the other hand, for the MMD test, the value of $(n+m)\widehat{\Delta}^{2}_{n,m}$ is 5.014, which is larger than both $t_{0.05}(W^{\prime}_{n})=4.488$ and $t_{0.05}(H)=4.698$ . By modifying the distribution, there is a tendency to reject the null hypothesis.

Table 5: Values of

(n+m)\widehat{T}^{2}_{n,m}

and

(n+m)\widehat{\Delta}^{2}_{n,m}

for

\sigma=d^{-3/4}

n_{1}=35

k,\ell=18

\tau_{\text{MVD}}=0.69348

, and

\tau_{\text{MMD}}=0.21990

	0	1	2	3	4	5	6	7	8	9
$t_{0.05}(H)$	3.056	0.680	3.416	2.742	2.747	3.319	2.705	2.079	2.884	1.941
$t_{0.05}(W^{\prime}_{n})$	2.755	0.572	2.940	2.317	2.349	2.785	2.336	1.823	2.431	1.719
0	2.241	6.328	6.803	6.834	7.390	6.589	7.117	7.513	6.930	4.573
0	(2.740)	6.328	6.803	6.834	7.390	6.589	7.117	7.513	6.930	4.573
1	(124.6)	0.287	4.269	4.290	4.356	4.393	5.132	4.265	4.233	4.170
1	(124.6)	(0.585)	4.269	4.290	4.356	4.393	5.132	4.265	4.233	4.170
2	(34.38)	(94.14)	2.953	4.730	5.253	5.053	5.880	5.264	4.732	5.165
2	(34.38)	(94.14)	(5.014)	4.730	5.253	5.053	5.880	5.264	4.732	5.165
3	(42.61)	(105.0)	(26.78)	2.383	5.248	4.239	6.022	5.242	4.575	5.022
3	(42.61)	(105.0)	(26.78)	(3.345)	5.248	4.239	6.022	5.242	4.575	5.022
4	(55.81)	(93.27)	(36.46)	(45.34)	2.067	5.259	6.237	4.930	4.849	3.717
4	(55.81)	(93.27)	(36.46)	(45.34)	(2.745)	5.259	6.237	4.930	4.849	3.717
5	(30.65)	(95.83)	(24.64)	(18.91)	(35.32)	2.761	5.757	5.434	4.814	5.106
5	(30.65)	(95.83)	(24.64)	(18.91)	(35.32)	(3.822)	5.757	5.434	4.814	5.106
6	(39.15)	(102.6)	(30.11)	(47.36)	(48.80)	(29.49)	2.261	6.527	5.946	6.344
6	(39.15)	(102.6)	(30.11)	(47.36)	(48.80)	(29.49)	(5.643)	6.527	5.946	6.344
7	(72.41)	(111.3)	(50.29)	(52.91)	(45.54)	(51.29)	(74.30)	1.560	5.142	4.062
7	(72.41)	(111.3)	(50.29)	(52.91)	(45.54)	(51.29)	(74.30)	(1.785)	5.142	4.062
8	(44.46)	(86.90)	(25.20)	(28.77)	(31.13)	(25.01)	(40.92)	(51.27)	2.055	4.352
8	(44.46)	(86.90)	(25.20)	(28.77)	(31.13)	(25.01)	(40.92)	(51.27)	(2.666)	4.352
9	(71.81)	(95.38)	(51.48)	(49.58)	(25.87)	(46.76)	(70.81)	(31.19)	(33.73)	1.677
9	(71.81)	(95.38)	(51.48)	(49.58)	(25.87)	(46.76)	(70.81)	(31.19)	(33.73)	(2.336)
$t_{0.05}(\widetilde{K}_{X})$	(4.983)	(2.191)	(4.698)	(4.352)	(4.343)	(4.691)	(4.550)	(3.917)	(4.424)	(3.767)
$t_{0.05}(W^{\prime}_{n})$	(5.228)	(1.749)	(4.502)	(3.928)	(3.961)	(4.085)	(4.510)	(4.104)	(4.245)	(4.455)

5.2 MNIST data

The MNIST dataset consists of images of $28\times 28=784$ pixels in size (http://yann.lecun.com/exdb/mnist). The sizes of each sample are shown in Table 6.

Table 6: Sample sizes of the MNIST data.

index	0	1	2	3	4	5	6	7	8	9
sample size	5,923	6,742	5,958	6,131	5,842	5,421	5,918	6,265	5,851	5,949

The MNIST data are divided into two sets of sample size 2,000 and the MVD and MMD tests are applied. Table 7 shows the results of applying the MVD and MMD tests to the MNIST data. The approximate distribution $W^{\prime}_{n}$ is calculated with $n_{1}=1,000$ , $k,\ell=500$ , $\tau_{\text{MVD}}=0.69348$ , and $\tau_{\text{MMD}}=0.21990$ . As in Table 5, the values of $(n+m)\widehat{T}^{2}_{n,m}$ and $(n+m)\widehat{\Delta}^{2}_{n,m}$ are written in each cell, with the values of $(n+m)\widehat{\Delta}^{2}_{n,m}$ in parentheses. In Table 7, $(n+m)\widehat{T}_{n,m}^{2}$ tends to take a larger value than both $t_{0.05}(H)$ and $t_{0.05}(W^{\prime}_{n})$ . This result is the same for the MMD test. The MVD and MMD tests tend to reject the null hypothesis with the modifications in Section 4.1.2.

Table 7: Values of

(n+m)\widehat{T}^{2}_{n,m}

and

(n+m)\widehat{\Delta}^{2}_{n,m}

for

\sigma=d^{-2}

n_{1}=1,000

k,\ell=500

\tau_{\text{MVD}}=0.69348

, and

\tau_{\text{MMD}}=0.21990

	0	1	2	3	4	5	6	7	8	9
$t_{0.05}(H)$	4.194	3.883	4.202	4.196	4.189	4.195	4.182	4.163	4.197	4.171
$t_{0.05}(W^{\prime}_{n})$	3.993	3.937	3.993	3.989	3.990	3.993	3.989	3.990	3.989	3.988
0	3.999	34.86	4.207	4.268	4.394	4.304	4.599	5.240	4.256	4.861
0	(4.092)	34.86	4.207	4.268	4.394	4.304	4.599	5.240	4.256	4.861
1	(217.4)	5.379	34.68	34.75	34.86	34.80	35.08	35.66	34.73	35.33
1	(217.4)	(15.42)	34.68	34.75	34.86	34.80	35.08	35.66	34.73	35.33
2	(10.77)	(210.5)	4.001	4.118	4.245	4.156	4.451	5.088	4.106	4.711
2	(10.77)	(210.5)	(4.131)	4.118	4.245	4.156	4.451	5.088	4.106	4.711
3	(13.55)	(213.4)	(9,806)	4.007	4.306	4.211	4.512	5.146	4.164	4.769
3	(13.55)	(213.4)	(9,806)	(4.208)	4.306	4.211	4.512	5.146	4.164	4.769
4	(16.56)	(216.1)	(12.83)	(15.59)	4.020	4.341	4.637	5.262	4.292	4.805
4	(16.56)	(216.1)	(12.83)	(15.59)	(4.482)	4.341	4.637	5.262	4.292	4.805
5	(13.35)	(213.9)	(10.10)	(11.57)	(15.12)	4.018	4.546	5.188	4.201	4.806
5	(13.35)	(213.9)	(10.10)	(11.57)	(15.12)	(4.357)	4.546	5.188	4.201	4.806
6	(19.39)	(219.5)	(16.06)	(18.90)	(21.28)	(18.29)	4.031	5.485	4.499	5.105
6	(19.39)	(219.5)	(16.06)	(18.90)	(21.28)	(18.29)	(4.573)	5.485	4.499	5.105
7	(28.85)	(225.2)	(24.85)	(27.49)	(28.28)	(27.56)	(34.24)	4.067	5.138	5.625
7	(28.85)	(225.2)	(24.85)	(27.49)	(28.28)	(27.56)	(34.24)	(4.625)	5.138	5.625
8	(13.12)	(211.3)	(9.261)	(11.37)	(14.72)	(11.51)	(18.19)	(26.97)	4.005	4.755
8	(13.12)	(211.3)	(9.261)	(11.37)	(14.72)	(11.51)	(18.19)	(26.97)	(4.190)	4.755
9	(24.80)	(223.2)	(21.11)	(23.23)	(19.05)	(22.80)	(29.84)	(29.91)	(22.07)	4.046
9	(24.80)	(223.2)	(21.11)	(23.23)	(19.05)	(22.80)	(29.84)	(29.91)	(22.07)	(4.686)
$t_{0.05}(\widetilde{K}_{X})$	(4.210)	(4.718)	(4.208)	(4.206)	(4.210)	(4.209)	(4.218)	(4.238)	(4.207)	(4.224)
$t_{0.05}(W^{\prime}_{n})$	(4.058)	(5.137)	(4.024)	(4.038)	(4.077)	(4.107)	(4.092)	(4.137)	(4.039)	(4.159)

5.3 Colon data

The Colon dataset contains gene expression data from DNA microarray experiments of colon tissue samples with $d$ = 2,000 and $n=62$ (see [1] for details). Among the 62 samples, 40 are tumor tissues and 22 are normal tissues. Tables 8 and 9 show the results of the MVD and MMD tests for $P=$ tumor and $Q=$ normal. The “tumor vs. normal” column shows the values of $(n+m)\widehat{T}^{2}_{n,m}$ and $(n+m)\widehat{\Delta}^{2}_{n,m}$ for $P=$ tumor and $Q=$ normal. The “normal” and “tumor” columns show, $t_{0.05}(W^{\prime})$ and $t_{0.05}(H)$ calculated from respectively the normal tissues and tumor tissues datasets.

For the MVD, $(n+m)\widehat{T}^{2}_{n,m}$ does not exceed $t_{0.05}(H)$ , but $(n+m)\widehat{T}^{2}_{n,m}$ exceeds $t_{0.05}(W^{\prime}_{n})$ by modifying the approximate distribution. By contrast, for the MMD, $(n+m)\widehat{\Delta}^{2}_{n,m}$ exceeds both $t_{0.05}(H)$ and $t_{0.05}(W_{n}^{\prime})$ without modifying the approximate distribution.

Table 8: Values of

(n+m)\widehat{T}^{2}_{n,m}

and critical values for normal (

n_{1}=11,~{}k,\ell=6

) and tumor (

n_{1}=20,~{}k,\ell=10

), with

\tau_{\text{MVD}}=0.69348

		normal		tumor
$\sigma$	tumor vs. normal	$t_{0.05}(W^{\prime}_{n})$	$t_{0.05}(H)$	$t_{0.05}(W^{\prime}_{n})$	$t_{0.05}(H)$
$d^{-3/4}$	3.867	3.536	5.280	3.728	5.050
$d^{-7/8}$	2.291	2.097	2.907	2.258	2.846
$d^{-1}$	0.684	0.660	0.879	0.757	0.906

Table 9: Values of

(n+m)\widehat{\Delta}^{2}_{n,m}

and critical values for normal (

n_{1}=11,~{}k,\ell=6

) and tumor (

n_{1}=20,~{}k,\ell=10

), with

\tau_{\text{MMD}}=0.21990

		normal		tumor
$\sigma$	tumor vs. normal	$t_{0.05}(W^{\prime}_{n})$	$t_{0.05}(H)$	$t_{0.05}(W^{\prime}_{n})$	$t_{0.05}(H)$
$d^{-3/4}$	6.695	4.456	6.201	4.618	5.713
$d^{-7/8}$	8.787	3.974	4.827	3.945	4.491
$d^{-1}$	6.282	2.439	2.754	2.412	2.634

Next, tumor $(\text{sample size}=40)$ was divided into $P=$ tumor 1 $(n=20)$ and $Q=$ tumor 2 $(m=20)$ , and two-sample tests by the MVD and MMD were applied. The results are shown in Tables 10 and 11, with the values for $(n+m)\widehat{T}_{n,m}^{2}$ and $(n+m)\widehat{\Delta}_{n,m}^{2}$ in the column “tumor 1 vs. tumor 2”. In Table 10, when $\sigma=d^{-3/4}$ , $(n+m)\widehat{T}_{n,m}^{2}$ exceeds $t_{0.05}(W^{\prime}_{n})$ , but in other cases $(n+m)\widehat{T}_{n,m}^{2}$ does not exceed $t_{0.05}(W^{\prime}_{n})$ and $(n+m)\widehat{T}_{n,m}^{2}$ does not exceed $t_{0.05}(H)$ and $t_{0.05}(W_{n}^{\prime})$ . Table 11 shows that, for all $\sigma$ , $(n+m)\widehat{\Delta}^{2}_{n,m}$ exceeds $t_{0.05}(W^{\prime}_{n})$ for the MMD test, but $(n+m)\widehat{\Delta}^{2}_{n,m}$ does not exceed $t_{0.05}(H)$ .

Table 10: Values of

(n+m)\widehat{T}^{2}_{n,m}

and critical values for tumor 1 (

n_{1}=10,~{}k,\ell=5

) and tumor 2 (

n_{1}=10,~{}k,\ell=5

), with

\tau=0.69348

		tumor 1		tumor 2
$\sigma$	tumor 1 vs. tumor 2	$t_{0.05}(W^{\prime}_{n})$	$t_{0.05}(H)$	$t_{0.05}(W^{\prime}_{n})$	$t_{0.05}(H)$
$d^{-3/4}$	3.379	3.180	4.596	3.245	4.942
$d^{-7/8}$	1.858	1.915	2.502	2.085	2.875
$d^{-1}$	0.558	0.629	0.800	0.727	0.921

Table 11: Values of

(n+m)\widehat{T}^{2}_{n,m}

and critical values for tumor 1 (

n_{1}=10,~{}k,\ell=5

) and tumor 2 (

n_{1}=10,~{}k,\ell=5

), with

\tau=0.21990

		tumor 1		tumor 2
$\sigma$	tumor 1 vs. tumor 2	$t_{0.05}(W^{\prime}_{n})$	$t_{0.05}(H)$	$t_{0.05}(W^{\prime}_{n})$	$t_{0.05}(H)$
$d^{-3/4}$	4.627	4.064	5.621	3.961	5.782
$d^{-7/8}$	4.123	3.305	4.206	3.426	4.551
$d^{-1}$	2.453	1.942	2.377	2.102	2.656

6 Conclusion

We defined a Maximum Variance Discrepancy (MVD) with a similar idea to the Maximum Mean Discrepnacy (MMD) in Section 2. We derived the asymptotic null distribution for the MVD test in Section 3.1. This was the infinite sum of the weighted chi-square distributions. In Section 3.2, we derived an asymptotic nonnull distribution for the MVD test, which was a normal distribution. The asymptotic normality of the test under the alternative hypothesis showed that the two-sample test by the MVD has consistency. Furthermore, we developed an asymptotic distribution for the test under a sequence of local alternatives in Section 3.3. This was the infinite sum of weighted noncentral chi-squared distributions. We constructed an estimator of asymptotic null distributed weights based on the Gram matrix in Section 3.4. The approximate distribution of the null distribution by these estimated weights does not work well, so we modified it in Section 4.1. In the simulation of the power reported, we found that the power of the two-sample test by the MVD was larger than that of the MMD. We confirmed in Section 5 that the two-sample test by the MVD works for real data-sets.

7 Proofs

Lemma 1.

Suppose that Condition is satisfied. Then, as $n,m\to\infty$ ,

\sqrt{n+m}\left(\left\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q})\right\|_{H(k)^{\otimes 2}}^{2}-\left\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)\right\|_{H(k)^{\otimes 2}}^{2}\right)\xrightarrow{P}0,

where

\widetilde{\Sigma}_{k}(P)=\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}~{}~{}\text{and}~{}~{}\widetilde{\Sigma}_{k}(Q)=\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(Q))^{\otimes 2}.

Lemma 2.

Let $Y_{1},\dots,Y_{m}\overset{i.i.d.}{\sim}Q_{nm}$ . Suppose that Condition is satisfied. Then, as $n,m\to\infty$ , following evaluates are obtained

(i)

$\left\|\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(Q_{nm})\right\|_{H(k)}=O_{p}\left(\frac{1}{\sqrt{n+m}}\right),$
(ii)

$\left\|\Sigma_{k}(\widehat{Q}_{nm})-\Sigma_{k}(Q_{nm})\right\|_{H(k)^{\otimes 2}}=O_{p}\left(\frac{1}{\sqrt{n+m}}\right),$
(iii)

$\left\|\widetilde{\Sigma}_{k}(Q_{nm})-\Sigma_{k}(Q_{nm})\right\|_{H(k)^{\otimes 2}}=O_{p}\left(\frac{1}{\sqrt{n+m}}\right).$

Lemma 3.

Let $X_{1},\dots,X_{n}\overset{i.i.d.}{\sim}P$ and $Y_{1},\dots,Y_{m}\overset{i.i.d.}{\sim}Q_{nm}$ . Suppose that Condition is satisfied. Then, as $n,m\to\infty$ ,

(n+m)\left(\left\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q}_{nm})\right\|^{2}_{H(k)^{\otimes 2}}-\left\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q_{nm})\right\|^{2}_{H(k)^{\otimes 2}}\right)\xrightarrow{P}0.

7.1 Proof of Proposition 1

The kernel mean embedding $\mu_{k}(N(\underline{m},\Sigma))$ with the Gaussian kernel in (4) is obtained

\mu_{k}(N(\underline{m},\Sigma))=|I_{d}+2\sigma\Sigma|^{-1/2}\exp\left(-\sigma(\cdot-\underline{m})^{T}(I_{d}+2\sigma\Sigma)^{-1}(\cdot-\underline{m})\right)

(12)

and the norm of that is derived

\left\|\mu_{k}(N(\underline{m},\Sigma))\right\|^{2}_{H(k)}=|I_{d}+4\sigma\Sigma|^{-1/2}

(13)

by Proposition 4.2 in [9]. We use the property of Gaussian density

\phi_{\Sigma_{1}}(\underline{x}-\underline{m}_{1})\phi_{\Sigma_{2}}(\underline{x}-\underline{m}_{2})=\phi_{\Sigma_{1}+\Sigma_{2}}(\underline{m}_{1}-\underline{m}_{2})\phi_{(\Sigma_{1}^{-1}+\Sigma_{2}^{-1})^{-1}}(\underline{x}-\underline{m}^{*}),

(14)

where

\underline{m}^{*}=(\Sigma^{-1}_{1}+\Sigma^{-1}_{2})^{-1}(\Sigma^{-1}_{1}\underline{m}_{2}+\Sigma^{-1}_{2}\underline{m}_{1})

and $\phi_{\Sigma}(\cdot-\underline{m})$ designates the density of $N(\underline{m},\Sigma)$ , see e.g. Appendix C in [13]. The property of Gaussian density (14) is used repeatedly to calculate $\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{\mu},\Sigma)\\ \underline{X}^{\prime}\sim N(\underline{m}_{0},\Sigma_{0})\end{subarray}}[k(\underline{X},\underline{X}^{\prime})]$ , and we get

	$\displaystyle\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{\mu},\Sigma)\\ \underline{X}^{\prime}\sim N(\underline{m}_{0},\Sigma_{0})\end{subarray}}[k(\underline{X},\underline{X}^{\prime})]$
	$\displaystyle=\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}\exp(-\sigma\left\\|\underline{x}-\underline{y}\right\\|^{2}_{\mathbb{R^{d}}})dN(\underline{\mu},\Sigma)(\underline{x})dN(\underline{m}_{0},\Sigma_{0})(\underline{y})$
	$\displaystyle=\left(\frac{\pi}{\sigma}\right)^{d/2}\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}\phi_{\frac{1}{2\sigma}I_{d}}(\underline{x}-\underline{y})dN(\underline{\mu},\Sigma)(\underline{x})dN(\underline{m}_{0},\Sigma_{0})(\underline{y})$
	$\displaystyle=\left(\frac{\pi}{\sigma}\right)^{d/2}\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}\phi_{\frac{1}{2\sigma}I_{d}}(\underline{x}-\underline{y})\phi_{\Sigma}(\underline{x}-\underline{\mu})d\underline{x}dN(\underline{m}_{0},\Sigma_{0})(\underline{y})$
	$\displaystyle=\left(\frac{\pi}{\sigma}\right)^{d/2}\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}\phi_{\frac{1}{2\sigma}I_{d}+\Sigma}(\underline{y}-\underline{\mu})\phi_{(2\sigma I_{d}+\Sigma^{-1})^{-1}}(\underline{x}-\underline{m}_{1}^{*})d\underline{x}dN(\underline{m}_{0},\Sigma_{0})(\underline{y})$
	$\displaystyle=\left(\frac{\pi}{\sigma}\right)^{d/2}\int_{\mathbb{R}^{d}}\phi_{\frac{1}{2\sigma}I_{d}+\Sigma}(\underline{y}-\underline{\mu})dN(\underline{m}_{0},\Sigma_{0})(\underline{y})\int_{\mathbb{R}^{d}}\phi_{(2\sigma I_{d}+\Sigma^{-1})^{-1}}(\underline{x}-\underline{m}_{1}^{*})d\underline{x}$
	$\displaystyle=\left(\frac{\pi}{\sigma}\right)^{d/2}\int_{\mathbb{R}^{d}}\phi_{\frac{1}{2\sigma}I_{d}+\Sigma}(\underline{y}-\underline{\mu})\phi_{\Sigma_{0}}(\underline{y}-\underline{m}_{0})d\underline{y}$
	$\displaystyle=\left(\frac{\pi}{\sigma}\right)^{d/2}\int_{\mathbb{R}^{d}}\phi_{\frac{1}{2\sigma}I_{d}+\Sigma+\Sigma_{0}}(\underline{\mu}-\underline{m}_{0})\phi_{((\frac{1}{2\sigma}I_{d}+\Sigma)^{-1}+\Sigma_{0}^{-1})^{-1}}(\underline{y}-\underline{m}_{2}^{*})d\underline{y}$
	$\displaystyle=\left(\frac{\pi}{\sigma}\right)^{d/2}\phi_{\frac{1}{2\sigma}I_{d}+\Sigma+\Sigma_{0}}(\underline{\mu}-\underline{m}_{0})\int_{\mathbb{R}^{d}}\phi_{((\frac{1}{2\sigma}I_{d}+\Sigma)^{-1}+\Sigma_{0}^{-1})^{-1}}(\underline{y}-\underline{m}_{2}^{*})d\underline{y}$
	$\displaystyle=\left(\frac{\pi}{\sigma}\right)^{d/2}\phi_{\frac{1}{2\sigma}I_{d}+\Sigma+\Sigma_{0}}(\underline{\mu}-\underline{m}_{0}),$		(15)

where

	$\displaystyle m_{1}^{*}=(2\sigma I_{d}+\Sigma^{-1})^{-1}(2\sigma\underline{y}+\Sigma^{-1}\underline{\mu}),$
	$\displaystyle m_{2}^{*}=\left\{\left(\frac{1}{2\sigma}I_{d}+\Sigma\right)^{-1}+\Sigma_{0}^{-1}\right\}^{-1}\left\{\left(\frac{1}{2\sigma}I_{d}+\Sigma\right)^{-1}\underline{\mu}+\Sigma_{0}^{-1}\underline{m}_{0}\right\}.$

Using this results (13) and (15), $\text{MMD}(N(\underline{0},I_{d}),N(\underline{m},\Sigma))^{2}$ is obtained as

	$\displaystyle\text{MMD}(N(\underline{0},I_{d}),N(\underline{m},\Sigma))^{2}$
	$\displaystyle=\left\\|\mu_{k}(N(\underline{0},I_{d}))-\mu_{k}(N(\underline{m},\Sigma))\right\\|^{2}_{H(k)}$
	$\displaystyle=\left\\|\mu_{k}(N(\underline{0},I_{d}))\right\\|^{2}_{H(k)}+\left\\|\mu_{k}(N(\underline{m},\Sigma))\right\\|^{2}_{H(k)}-2\left<\mu_{k}(N(\underline{0},I_{d})),\mu_{k}(N(\underline{m},\Sigma))\right>_{H(k)}$
	$\displaystyle=\left\\|\mu_{k}(N(\underline{0},I_{d}))\right\\|^{2}_{H(k)}+\left\\|\mu_{k}(N(\underline{m},\Sigma))\right\\|^{2}_{H(k)}-2\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{0},I_{d})\\ \underline{X}^{\prime}\sim N(\underline{m},\Sigma)\end{subarray}}[k(\underline{X},\underline{X}^{\prime})]$
	$\displaystyle=\|I_{d}+4\sigma I_{d}\|^{-1/2}+\|I_{d}+4\sigma\Sigma\|^{-1/2}-2\left(\frac{\pi}{\sigma}\right)^{d/2}\phi_{\frac{1}{2\sigma}I_{d}+I_{d}+\Sigma}(\underline{m})$
	$\displaystyle=(1+4\sigma)^{-d/2}+\|I_{d}+4\sigma\Sigma\|^{-1/2}$
	$\displaystyle~{}~{}~{}~{}~{}-2\left(\frac{\pi}{\sigma}\right)^{d/2}\left\|\frac{\pi}{\sigma}\left((1+2\sigma)I_{d}+2\sigma\Sigma\right)\right\|^{-1/2}\exp\left(-\sigma\underline{m}^{T}\left((1+2\sigma)I_{d}+2\sigma\Sigma\right)^{-1}\underline{m}\right)$
	$\displaystyle=\|I_{d}+4\sigma I_{d}\|^{-1/2}+\|I_{d}+4\sigma\Sigma\|^{-1/2}-2\left(\frac{\pi}{\sigma}\right)^{d/2}\phi_{\frac{1}{2\sigma}I_{d}+I_{d}+\Sigma}(\underline{m})$
	$\displaystyle=(1+4\sigma)^{-d/2}+\|I_{d}+4\sigma\Sigma\|^{-1/2}$
	$\displaystyle~{}~{}~{}~{}~{}-2\left\|(1+2\sigma)I_{d}+2\sigma\Sigma\right\|^{-1/2}\exp\left(-\sigma\underline{m}^{T}\left((1+2\sigma)I_{d}+2\sigma\Sigma\right)^{-1}\underline{m}\right).$

7.2 Proof of Proposition 2

In the following proof, the $\text{MVD}(N(\underline{0},I_{d}),N(\underline{m},\Sigma))^{2}$ when using the Gaussian kernel in (4) is calculated by repeatedly using (14). From the expansion of the norm

	$\displaystyle\text{MVD}(N(\underline{0},I_{d}),N(\underline{m},\Sigma))^{2}$
	$\displaystyle=\left\\|\Sigma_{k}(N(\underline{0},I_{d}))-\Sigma_{k}(N(\underline{m},\Sigma))\right\\|^{2}_{H(k)^{\otimes 2}}$
	$\displaystyle=\left\\|\Sigma_{k}(N(\underline{0},I_{d}))\right\\|^{2}_{H(k)^{\otimes 2}}+\left\\|\Sigma_{k}(N(\underline{m},\Sigma))\right\\|^{2}_{H(k)^{\otimes 2}}-2\left<\Sigma_{k}(N(\underline{0},I_{d})),\Sigma_{k}(N(\underline{m},\Sigma))\right>_{H(k)^{\otimes 2}},$		(16)

it is sufficient for us to calculate $\left<\Sigma_{k}(N(\underline{\mu},\Sigma)),\Sigma_{k}(N(\underline{m}_{0},\Sigma_{0}))\right>_{H(k)^{\otimes 2}}$ . The definition of $\Sigma_{k}(P)$ and the tensor product $h^{\otimes 2}$ lead to

	$\displaystyle\left<\Sigma_{k}(N(\underline{\mu},\Sigma)),\Sigma_{k}(N(\underline{m}_{0},\Sigma_{0}))\right>_{H(k)^{\otimes 2}}$
	$\displaystyle=\left<\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}\left[\left(k(\cdot,\underline{X})-\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}[k(\cdot,\underline{X})]\right)^{\otimes 2}\right],\right.$
	$\displaystyle~{}~{}~{}~{}~{}\left.\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}\left[\left(k(\cdot,\underline{Y})-\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}[k(\cdot,\underline{Y})]\right)^{\otimes 2}\right]\right>_{H(k)^{\otimes 2}}$
	$\displaystyle=\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{\mu},\Sigma)\\ \underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})\end{subarray}}\left[\left<\left(k(\cdot,\underline{X})-\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}[k(\cdot,\underline{X})]\right)^{\otimes 2},\left(k(\cdot,\underline{Y})-\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}[k(\cdot,\underline{Y})]\right)^{\otimes 2}\right>_{H(k)^{\otimes 2}}\right]$
	$\displaystyle=\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{\mu},\Sigma)\\ \underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})\end{subarray}}\left[\left<k(\cdot,\underline{X})^{\otimes 2}-\left\{\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}[k(\cdot,\underline{X})]\right\}^{\otimes 2},k(\cdot,\underline{Y})^{\otimes 2}-\left\{\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}[k(\cdot,\underline{Y})]\right\}^{\otimes 2}\right>_{H(k)^{\otimes 2}}\right]$
	$\displaystyle=\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{\mu},\Sigma)\\ \underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})\end{subarray}}\left[\left<k(\cdot,\underline{X})^{\otimes 2},k(\cdot,\underline{Y})^{\otimes 2}\right>_{H(k)^{\otimes 2}}-\left<k(\cdot,\underline{X})^{\otimes 2},\left\{\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}[k(\cdot,\underline{Y})]\right\}^{\otimes 2}\right>_{H(k)^{\otimes 2}}\right.$
	$\displaystyle~{}~{}~{}~{}~{}-\left<\left\{\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}[k(\cdot,\underline{X})]\right\}^{\otimes 2},k(\cdot,\underline{Y})^{\otimes 2}\right>_{H(k)^{\otimes 2}}$
	$\displaystyle~{}~{}~{}~{}~{}\left.+\left<\left\{\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}[k(\cdot,\underline{X})]\right\}^{\otimes 2},\left\{\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}[k(\cdot,\underline{Y})]\right\}^{\otimes 2}\right>_{H(k)^{\otimes 2}}\right]$
	$\displaystyle=\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{\mu},\Sigma)\\ \underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})\end{subarray}}\left[\left<k(\cdot,\underline{X}),k(\cdot,\underline{Y})\right>_{H(k)}^{2}-\left<k(\cdot,\underline{X}),\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}[k(\cdot,\underline{Y})]\right>_{H(k)}^{2}\right.$
	$\displaystyle~{}~{}~{}~{}~{}\left.-\left<\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}[k(\cdot,\underline{X})],k(\cdot,\underline{Y})\right>_{H(k)}^{2}+\left<\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}[k(\cdot,\underline{X})],\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}[k(\cdot,\underline{Y})]\right>_{H(k)}^{2}\right]$
	$\displaystyle=\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{\mu},\Sigma)\\ \underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})\end{subarray}}\left[k(\underline{X},\underline{Y})^{2}-\left\{\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}[k(\underline{X},\underline{Y})]\right\}^{2}-\left\{\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}[k(\underline{X},\underline{Y})]\right\}^{2}\right.$
	$\displaystyle~{}~{}~{}~{}~{}\left.+\left\{\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{\mu},\Sigma)\\ \underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})\end{subarray}}[k(\underline{X},\underline{Y})]\right\}^{2}\right]$
	$\displaystyle=\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{\mu},\Sigma)\\ \underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})\end{subarray}}[k(\underline{X},\underline{Y})^{2}]-\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}\left[\left\{\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}[k(\underline{X},\underline{Y})]\right\}^{2}\right]$
	$\displaystyle~{}~{}~{}~{}~{}-\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}\left[\left\{\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}[k(\underline{X},\underline{Y})]\right\}^{2}\right]+\left\{\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{\mu},\Sigma)\\ \underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})\end{subarray}}[k(\underline{X},\underline{Y})]\right\}^{2}$
	$\displaystyle=:I_{1}-I_{2}-I_{3}+I_{4}.$		(17)

We calculate each of these terms. The first term $I_{1}$ is derived as

$\displaystyle I_{1}$	$\displaystyle=\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{\mu},\Sigma)\\ \underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})\end{subarray}}[k(\underline{X},\underline{Y})^{2}]$
	$\displaystyle=\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}\exp\left(-2\sigma\left\\|\underline{x}-\underline{y}\right\\|^{2}_{\mathbb{R}^{d}}\right)dN(\underline{m}_{0},\Sigma_{0})(\underline{x})dN(\underline{\mu},\Sigma)(\underline{y})$
	$\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}\phi_{\frac{1}{4\sigma}I_{d}}(\underline{x}-\underline{y})dN(\underline{m}_{0},\Sigma_{0})(\underline{x})dN(\underline{\mu},\Sigma)(\underline{y})$
	$\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}\phi_{\frac{1}{4\sigma}I_{d}}(\underline{x}-\underline{y})\phi_{\Sigma_{0}}(\underline{x}-\underline{m}_{0})d\underline{x}dN(\underline{\mu},\Sigma)(\underline{y})$
	$\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}\int_{\mathbb{R}^{d}}\phi_{\frac{1}{4\sigma}I_{d}+\Sigma_{0}}(\underline{y}-\underline{m}_{0})dN(\underline{\mu},\Sigma)(\underline{y})$
	$\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}\int_{\mathcal{\mathbb{R}}}\phi_{\frac{1}{4\sigma}I_{d}}(\underline{x}-\underline{y})\phi_{\Sigma_{0}}(\underline{y}-\underline{m}_{0})d\underline{y}$
	$\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}\phi_{\frac{1}{4\sigma}I_{d}+\Sigma_{0}}(\underline{x}-\underline{m}_{0}).$	(18)

by repeatedly using (14). By using the expression of kernel mean embedding in (12) and the property of (14), we obtain the second term

$\displaystyle I_{2}$	$\displaystyle=\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}\left[\left\{\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}[k(\underline{X},\underline{Y})]\right\}^{2}\right]$
	$\displaystyle=\int_{\mathbb{R}^{d}}\left\{\|I_{d}+2\sigma\Sigma_{0}\|^{-1/2}\exp\left(-\sigma(\underline{x}-\underline{m}_{0})^{T}(I_{d}+2\sigma\Sigma_{0})^{-1}(\underline{x}-\underline{m}_{0})\right)\right\}^{2}dN(\underline{\mu},\Sigma)(\underline{x})$
	$\displaystyle=\|I_{d}+2\sigma\Sigma_{0}\|^{-1}\int_{\mathbb{R}^{d}}\exp\left(-2\sigma(\underline{x}-\underline{m}_{0})^{T}(I_{d}+2\sigma\Sigma_{0})^{-1}(\underline{x}-\underline{m}_{0})\right)dN(\underline{\mu},\Sigma)(\underline{x})$
	$\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}\|I_{d}+2\sigma\Sigma_{0}\|^{-1/2}\int_{\mathbb{R}^{d}}\phi_{\frac{1}{4\sigma}(I_{d}+2\sigma\Sigma_{0})}(\underline{x}-\underline{m}_{0})dN(\underline{\mu},\Sigma)(\underline{x})$
	$\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}\|I_{d}+2\sigma\Sigma_{0}\|^{-1/2}\int_{\mathbb{R}^{d}}\phi_{\frac{1}{4\sigma}(I_{d}+2\sigma\Sigma_{0})}(\underline{x}-\underline{m}_{0})\phi_{\Sigma}(\underline{x}-\underline{\mu})d\underline{x}$
	$\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}\|I_{d}+2\sigma\Sigma_{0}\|^{-1/2}\phi_{\frac{1}{4\sigma}(I_{d}+2\sigma\Sigma_{0}+4\sigma\Sigma)}(\underline{m}_{0}-\underline{\mu}).$	(19)

The third term $I_{3}$ is derived

	$\displaystyle I_{3}$	$\displaystyle=\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}\left[\left\{\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}[k(\underline{X},\underline{Y})]\right\}^{2}\right]$
		$\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}\|I_{d}+2\sigma\Sigma\|^{-1/2}\phi_{\frac{1}{4\sigma}(I_{d}+2\sigma\Sigma+4\sigma\Sigma_{0})}(\underline{m}_{0}-\underline{\mu})$		(20)

by the same calculation as $I_{2}$ . Finally, the fourth term $I_{4}$ is calculated as follows

$\displaystyle I_{4}$	$\displaystyle=\left\{\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{\mu},\Sigma)\\ \underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})\end{subarray}}[k(\underline{X},\underline{Y})]\right\}^{2}$
	$\displaystyle=\left(\frac{\pi}{\sigma}\right)^{d}\left\{\phi_{\frac{1}{2\sigma}I_{d}+\Sigma+\Sigma_{0}}(\underline{m}_{0}-\underline{\mu})\right\}^{2}$
	$\displaystyle=\left(\frac{\pi}{\sigma}\right)^{d}\left\{\left\|2\pi\left(\frac{1}{2\sigma}I_{d}+\Sigma+\Sigma_{0}\right)\right\|^{-1/2}\exp\left(-\sigma(\underline{m}_{0}-\underline{\mu})^{T}(I_{d}+2\sigma\Sigma+2\sigma\Sigma_{0})^{-1}(\underline{m}_{0}-\underline{\mu})\right)\right\}^{2}$
	$\displaystyle=\|I_{d}+2\sigma\Sigma+2\sigma\Sigma_{0}\|^{-1}\exp\left(-2\sigma(\underline{m}_{0}-\underline{\mu})^{T}(I_{d}+2\sigma\Sigma+2\sigma\Sigma_{0})^{-1}(\underline{m}_{0}-\underline{\mu})\right)$
	$\displaystyle=\|I_{d}+2\sigma\Sigma+2\sigma\Sigma_{0}\|^{-1}\left\|\frac{\pi}{2\sigma}(I_{d}+2\sigma\Sigma+2\sigma\Sigma_{0})\right\|^{1/2}\phi_{\frac{1}{4\sigma}(I_{d}+2\sigma\Sigma+2\sigma\Sigma_{0})}(\underline{m}_{0}-\underline{\mu})$
	$\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}\|I_{d}+2\sigma\Sigma+2\sigma\Sigma_{0}\|^{-1/2}\phi_{\frac{1}{4\sigma}(I_{d}+2\sigma\Sigma+2\sigma\Sigma_{0})}(\underline{m}_{0}-\underline{\mu})$	(21)

by using (15). Hence, combining (17) and (18)-(21) yields

	$\displaystyle\left<\Sigma_{k}(N(\underline{\mu},\Sigma)),\Sigma_{k}(N(\underline{m}_{0},\Sigma_{0}))\right>_{H(k)^{\otimes 2}}$
	$\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}\phi_{\frac{1}{4\sigma}I_{d}+\Sigma_{0}+\Sigma}(\underline{m}_{0}-\underline{\mu})-\left(\frac{\pi}{2\sigma}\right)^{d/2}\|I_{d}+2\sigma\Sigma_{0}\|^{-1/2}\phi_{\frac{1}{4\sigma}(I_{d}+2\sigma\Sigma_{0}+4\sigma\Sigma)}(\underline{m}_{0}-\underline{\mu})$
	$\displaystyle~{}~{}~{}~{}~{}-\left(\frac{\pi}{2\sigma}\right)^{d/2}\|I_{d}+2\sigma\Sigma\|^{-1/2}\phi_{\frac{1}{4\sigma}(I_{d}+2\sigma\Sigma+4\sigma\Sigma_{0})}(\underline{m}_{0}-\underline{\mu})$
	$\displaystyle~{}~{}~{}~{}~{}+\left(\frac{\pi}{2\sigma}\right)^{d/2}\|I_{d}+2\sigma\Sigma+2\sigma\Sigma_{0}\|^{-1/2}\phi_{\frac{1}{4\sigma}(I_{d}+2\sigma\Sigma+2\sigma\Sigma_{0})}(\underline{m}_{0}-\underline{\mu}).$		(22)

The following results are obtained by using (22):

	$\displaystyle\left\\|\Sigma_{k}(N(\underline{0},I_{d}))\right\\|^{2}_{H(k)^{\otimes 2}}$
	$\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}\phi_{\frac{1}{4\sigma}(1+8\sigma)I_{d}}(\underline{0})-\left(\frac{\pi}{2\sigma}\right)^{d/2}\|(1+2\sigma)I_{d}\|^{-1/2}\phi_{\frac{1}{4\sigma}(1+6\sigma)I_{d}}(\underline{0})$
	$\displaystyle~{}~{}~{}~{}~{}-\left(\frac{\pi}{2\sigma}\right)^{d/2}\|(1+2\sigma)I_{d}\|^{-1/2}\phi_{\frac{1}{4\sigma}(1+6\sigma)I_{d}}(\underline{0})+\left(\frac{\pi}{2\sigma}\right)^{d/2}\|(1+4\sigma)I_{d}\|^{-1/2}\phi_{\frac{1}{4\sigma}(1+4\sigma)I_{d}}(\underline{0})$
	$\displaystyle=(1+8\sigma)^{-d/2}-2(1+8\sigma+12\sigma^{2})^{-d/2}+(1+4\sigma)^{-d},$		(23)
	$\displaystyle\left\\|\Sigma_{k}(N(\underline{m},\Sigma))\right\\|^{2}_{H(k)^{\otimes 2}}$
	$\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}\phi_{\frac{1}{4\sigma}(I_{d}+8\sigma\Sigma)}(\underline{0})-\left(\frac{\pi}{2\sigma}\right)^{d/2}\|I_{d}+2\sigma\Sigma\|^{-1/2}\phi_{\frac{1}{4\sigma}(I_{d}+6\sigma\Sigma)}(\underline{0})$
	$\displaystyle~{}~{}~{}~{}~{}-\left(\frac{\pi}{2\sigma}\right)^{d/2}\|I_{d}+2\sigma\Sigma\|^{-1/2}\phi_{\frac{1}{4\sigma}(I_{d}+6\sigma\Sigma)}(\underline{0})+\left(\frac{\pi}{2\sigma}\right)^{d/2}\|I_{d}+4\sigma\Sigma\|^{-1/2}\phi_{\frac{1}{4\sigma}(I_{d}+4\sigma\Sigma)}(\underline{0})$
	$\displaystyle=\|I_{d}+8\sigma\Sigma\|^{-1/2}-2\|I_{d}+8\sigma\Sigma+12\sigma^{2}\Sigma^{2}\|^{-1/2}+\|I_{d}+4\sigma\Sigma\|^{-1},$		(24)

and

	$\displaystyle\left<\Sigma_{k}(N(\underline{0},I_{d})),\Sigma_{k}(N(\underline{m},\Sigma))\right>_{H(k)^{\otimes 2}}$
	$\displaystyle=\|(1+4\sigma)I_{d}+4\sigma\Sigma\|^{-1/2}\exp\left(-2\sigma\underline{m}^{T}\left((1+4\sigma)I_{d}+4\sigma\Sigma\right)^{-1}\underline{m}\right)$
	$\displaystyle~{}~{}~{}~{}~{}-\|I_{d}+2\sigma\Sigma\|^{-1/2}\|(1+4\sigma)I_{d}+2\sigma\Sigma\|^{-1/2}\exp\left(-2\sigma\underline{m}^{T}\left((1+4\sigma)I_{d}+2\sigma\Sigma\right)^{-1}\underline{m}\right)$
	$\displaystyle~{}~{}~{}~{}~{}-(1+2\sigma)^{-d/2}\|(1+2\sigma)I_{d}+4\sigma\Sigma\|^{-1/2}\exp\left(-2\sigma\underline{m}^{T}\left((1+2\sigma)I_{d}+4\sigma\Sigma\right)^{-1}\underline{m}\right)$
	$\displaystyle~{}~{}~{}~{}~{}+\|(1+2\sigma)I_{d}+2\sigma\Sigma\|^{-1}\exp\left(-2\sigma\underline{m}^{T}\left((1+2\sigma)I_{d}+2\sigma\Sigma\right)^{-1}\underline{m}\right).$		(25)

Therefore, we give

	$\displaystyle\text{MVD}(N(\underline{0},I_{d}),N(\underline{m},\Sigma))^{2}$
	$\displaystyle=(1+8\sigma)^{-d/2}-2(1+8\sigma+12\sigma^{2})^{-d/2}+(1+4\sigma)^{-d}$
	$\displaystyle~{}~{}~{}~{}~{}+\|I_{d}+8\sigma\Sigma\|^{-1/2}-2\|I_{d}+8\sigma\Sigma+12\sigma^{2}\Sigma^{2}\|^{-1/2}+\|I_{d}+4\sigma\Sigma\|^{-1}$
	$\displaystyle~{}~{}~{}~{}~{}-2\|(1+4\sigma)I_{d}+4\sigma\Sigma\|^{-1/2}\exp\left(-2\sigma\underline{m}^{T}\left((1+4\sigma)I_{d}+4\sigma\Sigma\right)^{-1}\underline{m}\right)$
	$\displaystyle~{}~{}~{}~{}~{}+2\|I_{d}+2\sigma\Sigma\|^{-1/2}\|(1+4\sigma)I_{d}+2\sigma\Sigma\|^{-1/2}\exp\left(-2\sigma\underline{m}^{T}\left((1+4\sigma)I_{d}+2\sigma\Sigma\right)^{-1}\underline{m}\right)$
	$\displaystyle~{}~{}~{}~{}~{}+2(1+2\sigma)^{-d/2}\|(1+2\sigma)I_{d}+4\sigma\Sigma\|^{-1/2}\exp\left(-2\sigma\underline{m}^{T}\left((1+2\sigma)I_{d}+4\sigma\Sigma\right)^{-1}\underline{m}\right)$
	$\displaystyle~{}~{}~{}~{}~{}-2\|(1+2\sigma)I_{d}+2\sigma\Sigma\|^{-1}\exp\left(-2\sigma\underline{m}^{T}\left((1+2\sigma)I_{d}+2\sigma\Sigma\right)^{-1}\underline{m}\right)$

from substituting the formulas (23)-(25) to (16).

7.3 Proof of Theorem 1

Theorem 1 is shown by regarding $k(\cdot,X_{1}),\dots,k(\cdot,X_{n})$ and $k(\cdot,Y_{1}),\dots,k(\cdot,Y_{m})$ as the data in Corollary 1 of [4].

7.4 Proof of Theorem 2

By Lemma 1, it suffices to derive the asymptotic distribution of

\sqrt{n+m}\left\{\left\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)\right\|^{2}_{H(k)^{\otimes 2}}-\left\|\Sigma_{k}(P)-\Sigma_{k}(Q)\right\|^{2}_{H(k)^{\otimes 2}}\right\}.

Let us expand the following quantity

	$\displaystyle\sqrt{n+m}\left\{\left\\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)\right\\|^{2}_{H(k)^{\otimes 2}}-\left\\|\Sigma_{k}(P)-\Sigma_{k}(Q)\right\\|^{2}_{H(k)^{\otimes 2}}\right\}$
	$\displaystyle=\sqrt{n+m}\left<\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)+\Sigma_{k}(P)-\Sigma_{k}(Q),\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)-\Sigma_{k}(P)+\Sigma_{k}(Q)\right>_{H(k)^{\otimes 2}}$
	$\displaystyle=\sqrt{n+m}\left<\widetilde{\Sigma}_{k}(P)-\Sigma_{k}(P)-\{\widetilde{\Sigma}_{k}(Q)-\Sigma_{k}(Q)\},\right.$
	$\displaystyle~{}~{}~{}~{}~{}\left.2\{\Sigma_{k}(P)-\Sigma_{k}(Q)\}+\widetilde{\Sigma}_{k}(P)-\Sigma_{k}(P)-\{\widetilde{\Sigma}_{k}(Q)-\Sigma_{k}(Q)\}\right>_{H(k)^{\otimes 2}}$
	$\displaystyle=2\sqrt{n+m}\left<\Sigma_{k}(P)-\Sigma_{k}(Q),\widetilde{\Sigma}_{k}(P)-\Sigma_{k}(P)-\{\widetilde{\Sigma}_{k}(Q)-\Sigma_{k}(Q)\}\right>_{H(k)^{\otimes 2}}$
	$\displaystyle~{}~{}~{}~{}~{}+\sqrt{n+m}\left\\|\widetilde{\Sigma}_{k}(P)-\Sigma_{k}(P)-\{\widetilde{\Sigma}_{k}(Q)-\Sigma_{k}(Q)\}\right\\|^{2}_{H(k)^{\otimes 2}}$
	$\displaystyle=\sqrt{\frac{n+m}{n}}\frac{2}{\sqrt{n}}\sum_{i=1}^{n}\left<\Sigma_{k}(P)-\Sigma_{k}(Q),(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right>_{H(k)^{\otimes 2}}$
	$\displaystyle~{}~{}~{}~{}~{}-\sqrt{\frac{n+m}{m}}\frac{2}{\sqrt{m}}\sum_{j=1}^{m}\left<\Sigma_{k}(P)-\Sigma_{k}(Q),(k(\cdot,Y_{j})-\mu_{k}(Q))^{\otimes 2}-\Sigma_{k}(Q)\right>_{H(k)^{\otimes 2}}+O_{p}\left(\frac{1}{\sqrt{n+m}}\right),$

which converges in distribution to $N(0,4\rho^{-1}v_{P}^{2}+4(1-\rho)^{-1}v^{2}_{Q})$ by the central limit theorem.

7.5 Proof of Lemma 1

A direct calculation gives

	$\displaystyle\sqrt{n+m}\left(\left\\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q})\right\\|^{2}_{H(k)^{\otimes 2}}-\left\\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)\right\\|^{2}_{H(k)^{\otimes 2}}\right)$
	$\displaystyle=\left(\left\\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q})\right\\|_{H(k)^{\otimes 2}}+\left\\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)\right\\|_{H(k)^{\otimes 2}}\right)$
	$\displaystyle~{}~{}~{}~{}~{}\times\sqrt{n+m}\left(\left\\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q})\right\\|_{H(k)^{\otimes 2}}-\left\\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)\right\\|_{H(k)^{\otimes 2}}\right)$
	$\displaystyle\leq\left(\left\\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q})\right\\|_{H(k)^{\otimes 2}}+\left\\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)\right\\|_{H(k)^{\otimes 2}}\right)$
	$\displaystyle~{}~{}~{}~{}~{}\times\sqrt{n+m}\left\\|\Sigma_{k}(\widehat{P})-\widetilde{\Sigma}_{k}(P)-\left(\Sigma_{k}(\widehat{Q})-\widetilde{\Sigma}_{k}(Q)\right)\right\\|_{H(k)^{\otimes 2}}.$

From direct expansion $\Sigma_{k}(\widehat{P})=\widetilde{\Sigma}_{k}(P)-(\mu_{k}(P)-\mu_{k}(\widehat{P}))^{\otimes 2}$ , we have

	$\displaystyle\sqrt{n+m}\left(\left\\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q})\right\\|^{2}_{H(k)^{\otimes 2}}-\left\\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)\right\\|^{2}_{H(k)^{\otimes 2}}\right)$
	$\displaystyle\leq\left(\left\\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q})\right\\|_{H(k)^{\otimes 2}}+\left\\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)\right\\|_{H(k)^{\otimes 2}}\right)$
	$\displaystyle~{}~{}~{}~{}~{}\times\sqrt{n+m}\left\\|(\mu_{k}(P)-\mu_{k}(\widehat{P}))^{\otimes 2}-(\mu_{k}(Q)-\mu_{k}(\widehat{Q}))^{\otimes 2}\right\\|_{H(k)^{\otimes 2}}$
	$\displaystyle\leq\left(\left\\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q})\right\\|_{H(k)^{\otimes 2}}+\left\\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)\right\\|_{H(k)^{\otimes 2}}\right)$
	$\displaystyle~{}~{}~{}~{}~{}\times\sqrt{n+m}\left(\left\\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\\|^{2}_{H(k)}+\left\\|\mu_{k}(Q)-\mu_{k}(\widehat{Q})\right\\|^{2}_{H(k)}\right)$
	$\displaystyle=\left(\left\\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q})\right\\|_{H(k)^{\otimes 2}}+\left\\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)\right\\|_{H(k)^{\otimes 2}}\right)$
	$\displaystyle~{}~{}~{}~{}~{}\times\left(\frac{\sqrt{n+m}}{n}\cdot n\left\\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\\|^{2}_{H(k)}+\frac{\sqrt{n+m}}{m}\cdot m\left\\|\mu_{k}(Q)-\mu_{k}(\widehat{Q})\right\\|^{2}_{H(k)}\right)$
	$\displaystyle\xrightarrow{P}0,$

as $n,m\to\infty$ .

7.6 Proof of Theorem 3

From Lemma 3 it is sufficient for us to derive the asymptotic distribution of $(n+m)\left\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q_{nm})\right\|^{2}_{H(k)^{\otimes 2}}$ . It follows from direct calculations that

	$\displaystyle(n+m)\left\\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q_{nm})\right\\|^{2}_{H(k)^{\otimes 2}}$
	$\displaystyle=(n+m)\left\\|\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(Q_{nm}))^{\otimes 2}\right\\|^{2}_{H(k)^{\otimes 2}}$
	$\displaystyle=(n+m)\left\\|\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right.$
	$\displaystyle~{}~{}~{}~{}~{}\left.-\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P)+\mu_{k}(P)-\mu_{k}(Q_{nm}))^{\otimes 2}+\Sigma_{k}(P)\right\\|^{2}_{H(k)^{\otimes 2}}$
	$\displaystyle=(n+m)\Bigg{\\|}\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)-\left\{\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}$
	$\displaystyle~{}~{}~{}~{}~{}-\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))\otimes(\mu_{k}(P)-\mu_{k}(Q_{nm}))-\frac{1}{m}\sum_{j=1}^{m}(\mu_{k}(P)-\mu_{k}(Q_{nm}))\otimes(k(\cdot,Y_{j})-\mu_{k}(P))$
	$\displaystyle~{}~{}~{}~{}~{}-(\mu_{k}(P)-\mu_{k}(Q_{nm}))^{\otimes 2}\Bigg{\\|}^{2}_{H(k)^{\otimes 2}}$
	$\displaystyle=(n+m)\Bigg{\\|}\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)-\left\{\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}+\frac{A}{\sqrt{n+m}}\Bigg{\\|}^{2}_{H(k)^{\otimes 2}}$
	$\displaystyle=(n+m)\Bigg{\\|}\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)-\left\{\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}\Bigg{\\|}^{2}_{H(k)^{\otimes 2}}$
	$\displaystyle~{}~{}~{}~{}~{}+2\sqrt{n+m}\left<\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)-\left\{\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\},A\right>_{H(k)^{\otimes 2}}$
	$\displaystyle~{}~{}~{}~{}~{}+\left\\|A\right\\|^{2}_{H(k)^{\otimes 2}},$		(26)

where

A=(\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(P))\otimes(\mu_{k}(P)-\mu_{k}(Q))+(\mu_{k}(P)-\mu_{k}(Q))\otimes(\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(P))-\frac{1}{\sqrt{n+m}}(\mu_{k}(P)-\mu_{k}(Q))^{\otimes 2}.

In addition, we see that

$\displaystyle\left\\|\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(P)\right\\|_{H(k)}$	$\displaystyle=\left\\|\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(Q_{nm})+\mu_{k}(Q_{nm})-\mu_{k}(P)\right\\|_{H(k)}$
	$\displaystyle\leq\left\\|\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(Q_{nm})\right\\|_{H(k)}+\left\\|\mu_{k}(Q_{nm})-\mu_{k}(P)\right\\|_{H(k)}$
	$\displaystyle=O_{p}\left(\frac{1}{\sqrt{n+m}}\right)$	(27)

by (i) in Lemma 2. Thus, (27) leads that

	$\displaystyle\left\\|A\right\\|_{H(k)^{\otimes 2}}$	$\displaystyle\leq 2\left\\|\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(P)\right\\|_{H(k)}\left\\|\mu_{k}(P)-\mu_{k}(Q)\right\\|_{H(k)}+\frac{1}{\sqrt{n+m}}\left\\|\mu_{k}(P)-\mu_{k}(Q)\right\\|^{2}_{H(k)}$
		$\displaystyle=O_{p}\left(\frac{1}{\sqrt{n+m}}\right).$		(28)

Further, following result is obtained by (i) and (ii) in Lemma 2

	$\displaystyle\left\\|\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\\|_{H(k)^{\otimes 2}}$
	$\displaystyle=\left\\|\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(Q_{nm})+\Sigma_{k}(Q_{nm})-\Sigma_{k}(P)\right\\|_{H(k)^{\otimes 2}}$
	$\displaystyle\leq\left\\|\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(Q_{nm})\right\\|_{H(k)^{\otimes 2}}+\left\\|\Sigma_{k}(Q_{nm})-\Sigma_{k}(P)\right\\|_{H(k)^{\otimes 2}}$
	$\displaystyle=\left\\|\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(Q_{nm})\right\\|_{H(k)^{\otimes 2}}+O\left(\frac{1}{\sqrt{n+m}}\right)$
	$\displaystyle=\left\\|\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(Q_{nm})+\mu_{k}(Q_{nm})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(Q_{nm})\right\\|_{H(k)^{\otimes 2}}+O\left(\frac{1}{\sqrt{n+m}}\right)$
	$\displaystyle=\Bigg{\\|}\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(Q_{nm}))^{\otimes 2}-\Sigma_{k}(Q_{nm})+(\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(Q_{nm}))\otimes(\mu_{k}(Q_{nm})-\mu_{k}(P))$
	$\displaystyle~{}~{}~{}~{}~{}+(\mu_{k}(Q_{nm})-\mu_{k}(P))\otimes(\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(Q_{nm}))+(\mu_{k}(Q_{nm})-\mu_{k}(P))^{\otimes 2}\Bigg{\\|}_{H(k)^{\otimes 2}}+O\left(\frac{1}{\sqrt{n+m}}\right)$
	$\displaystyle\leq\left\\|\widetilde{\Sigma}_{k}(Q_{nm})-\Sigma_{k}(Q_{nm})\right\\|_{H(k)}+\frac{2}{\sqrt{n+m}}\left\\|\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(Q_{nm})\right\\|_{H(k)}\left\\|\mu_{k}(Q)-\mu_{k}(P)\right\\|_{H(k)^{\otimes 2}}$
	$\displaystyle~{}~{}~{}~{}~{}+\frac{1}{n+m}\left\\|\mu_{k}(Q)-\mu_{k}(P)\right\\|^{2}_{H(k)}+O\left(\frac{1}{\sqrt{n+m}}\right)$
	$\displaystyle=O_{p}\left(\frac{1}{\sqrt{n+m}}\right).$		(29)

These results (28) and (29) yield that

	$\displaystyle\sqrt{n+m}\left<\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)-\left\{\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\},A\right>_{H(k)^{\otimes 2}}$
	$\displaystyle\leq\sqrt{n+m}\left\\|\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)-\left\{\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}\right\\|_{H(k)^{\otimes 2}}$
	$\displaystyle~{}~{}~{}~{}~{}\times\left\\|A\right\\|_{H(k)^{\otimes 2}}$
	$\displaystyle=\sqrt{n+m}\Bigg{\{}\left\\|\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\\|_{H(k)^{\otimes 2}}$
	$\displaystyle~{}~{}~{}~{}~{}+\left\\|\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\\|_{H(k)^{\otimes 2}}\Bigg{\}}\left\\|A\right\\|_{H(k)^{\otimes 2}}$
	$\displaystyle=\sqrt{n+m}\left\{O_{p}\left(\frac{1}{\sqrt{n+m}}\right)+O_{p}\left(\frac{1}{\sqrt{n+m}}\right)\right\}\cdot O_{p}\left(\frac{1}{\sqrt{n+m}}\right)$
	$\displaystyle=O_{p}\left(\frac{1}{\sqrt{n+m}}\right).$		(30)

By using (28) and (30) to (26), we obtain

	$\displaystyle(n+m)\left\\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q_{nm})\right\\|^{2}_{H(k)^{\otimes 2}}$
	$\displaystyle=(n+m)\Bigg{\\|}\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)-\left\{\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}\Bigg{\\|}^{2}_{H(k)^{\otimes 2}}$
	$\displaystyle~{}~{}~{}~{}~{}+O_{p}\left(\frac{1}{\sqrt{n+m}}\right)$
	$\displaystyle=(n+m)\left<\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)-\left\{\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\},\right.$
	$\displaystyle~{}~{}~{}~{}~{}\left.\frac{1}{n}\sum_{s=1}^{n}(k(\cdot,X_{s})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)-\left\{\frac{1}{m}\sum_{t=1}^{m}(k(\cdot,Y_{t})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}\right>_{H(k)^{\otimes 2}}$
	$\displaystyle~{}~{}~{}~{}~{}+O_{p}\left(\frac{1}{\sqrt{n+m}}\right)$
	$\displaystyle=\frac{n+m}{n^{2}m^{2}}\sum_{i,s=1}^{n}\sum_{j,t=1}^{m}\left<(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)-\left\{(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\},\right.$
	$\displaystyle~{}~{}~{}~{}~{}\left.(k(\cdot,X_{s})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)-\left\{(k(\cdot,Y_{t})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}\right>_{H(k)^{\otimes 2}}$
	$\displaystyle~{}~{}~{}~{}~{}+O_{p}\left(\frac{1}{\sqrt{n+m}}\right)$
	$\displaystyle=\frac{n+m}{n^{2}}\sum_{i,s=1}^{n}\left<(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P),(k(\cdot,X_{s})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right>_{H(k)^{\otimes 2}}$
	$\displaystyle~{}~{}~{}~{}~{}+\frac{n+m}{m^{2}}\sum_{j,t=1}^{m}\left<(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P),(k(\cdot,Y_{t})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right>_{H(k)^{\otimes 2}}$
	$\displaystyle~{}~{}~{}~{}~{}-\frac{2(n+m)}{nm}\sum_{i=1}^{n}\sum_{j=1}^{m}\left<(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P),(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right>_{H(k)^{\otimes 2}}$
	$\displaystyle~{}~{}~{}~{}~{}+O_{p}\left(\frac{1}{\sqrt{n+m}}\right)$
	$\displaystyle=\frac{n+m}{n^{2}}\sum_{i,s=1}^{n}h(X_{i},X_{s})+\frac{n+m}{m^{2}}\sum_{j,t=1}^{m}h(Y_{j},Y_{t})-\frac{2(n+m)}{nm}\sum_{i=1}^{n}\sum_{j=1}^{m}h(X_{i},Y_{j})+O_{p}\left(\frac{1}{\sqrt{n+m}}\right),$

where $h(x,y)$ is in (5). Since $\mathbb{E}_{X\sim P}[k(X,X)^{2}]<\infty$ , $S_{k}$ in (6) is a Hilbert–Schmidt operator by Theorem VI.22 in [12]. Therefore, from Theorem 1 in [10], we have

h(x,y)=\sum_{\ell=1}^{\infty}\theta_{\ell}\Psi_{\ell}(x)\Psi_{\ell}(y),

where $\theta_{\ell}$ is eigenvalue of $S_{k}$ and $\Psi_{\ell}$ is eigenfunction corresponding to $\theta_{\ell}$ , each satisfies

\int_{\mathcal{H}}h(x,y)\Psi_{\ell}(y)dP(y)=\theta_{\ell}\Psi_{\ell}(x)~{}~{}\text{and}~{}~{}\int_{\mathcal{H}}\Psi_{i}(y)\Psi_{j}(y)dP(y)=\delta_{ij}

with $\delta_{ij}$ Kronecker’s delta. From $\int_{\mathcal{H}}h(x,y)\Psi_{\ell}(y)dP(y)=\theta_{\ell}\Psi_{\ell}(x)$ , we have

\mathbb{E}_{X\sim P}[\Phi_{\ell}(X)]=\frac{1}{\theta_{\ell}}\int_{\mathcal{H}}\mathbb{E}_{X\sim P}[h(X,y)]\Psi_{\ell}(y)dP(y)=0

and

\displaystyle\mathbb{E}_{Y\sim Q_{nm}}[\Psi_{\ell}(Y)]

\displaystyle=\frac{1}{\theta_{\ell}}\int_{\mathcal{H}}\mathbb{E}_{Y\sim Q_{nm}}[h(Y,y)]\Psi_{\ell}(y)dP(y).

Since direct calculation, we get

	$\displaystyle\mathbb{E}_{Y\sim Q_{nm}}[(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)]$
	$\displaystyle=\mathbb{E}_{Y\sim Q_{nm}}[(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}]-\Sigma_{k}(P)$
	$\displaystyle=\int_{\mathcal{H}}(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}dQ_{nm}(y)-\Sigma_{k}(P)$
	$\displaystyle=\int_{\mathcal{H}}(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}d\left\{\left(1-\frac{1}{\sqrt{n+m}}\right)P+\frac{1}{\sqrt{n+m}}Q\right\}(y)-\Sigma_{k}(P)$
	$\displaystyle=\left(1-\frac{1}{\sqrt{n+m}}\right)\mathbb{E}_{Y\sim P}[(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}]+\frac{1}{\sqrt{n+m}}\mathbb{E}_{Y\sim Q}[(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}]-\Sigma_{k}(P)$
	$\displaystyle=\left(1-\frac{1}{\sqrt{n+m}}\right)\Sigma_{k}(P)+\frac{1}{\sqrt{n+m}}\left(\Sigma_{k}(Q)+(\mu_{k}(Q)-\mu_{k}(P))^{\otimes 2}\right)-\Sigma_{k}(P)$
	$\displaystyle=\frac{1}{\sqrt{n+m}}(\Sigma_{k}(Q)-\Sigma_{k}(P)+(\mu_{k}(Q)-\mu_{k}(P))^{\otimes 2})$
	$\displaystyle=:\frac{1}{\sqrt{n+m}}\zeta(P,Q)$

and

	$\displaystyle\theta_{\ell}\mathbb{E}_{Y\sim Q_{nm}}[\Psi_{\ell}(Y)]$
	$\displaystyle=\int_{\mathcal{H}}\mathbb{E}_{Y\sim Q_{nm}}[h(Y,y)]\Psi_{\ell}(y)dP(y)$
	$\displaystyle=\int_{\mathcal{H}}\mathbb{E}_{Y\sim Q_{nm}}[\left<(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P),(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right>_{H(k)^{\otimes 2}}]\Psi_{\ell}(y)dP(y)$
	$\displaystyle=\int_{\mathcal{H}}\left<\mathbb{E}_{Y\sim Q_{nm}}[(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)],(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right>_{H(k)^{\otimes 2}}\Psi_{\ell}(y)dP(y)$
	$\displaystyle=\frac{1}{\sqrt{n+m}}\int_{\mathcal{H}}\left<\zeta(P,Q),h(\cdot,y)\right>_{H(k)^{\otimes 2}}\Psi_{\ell}(y)dP(y)$
	$\displaystyle=:\frac{1}{\sqrt{n+m}}\zeta_{\ell}(P,Q).$

Hence

\mathbb{E}_{Y\sim Q_{nm}}[\Psi_{\ell}(Y)]=\frac{1}{\sqrt{n+m}}\frac{\zeta_{\ell}(P,Q)}{\theta_{\ell}}

and

	$\displaystyle V_{Y\sim Q_{nm}}[\Psi_{\ell}(Y)]$
	$\displaystyle=\mathbb{E}_{Y\sim Q_{nm}}[\Psi_{\ell}(Y)^{2}]-\left\{\mathbb{E}_{Y\sim Q_{nm}}[\Psi_{\ell}(Y)]\right\}^{2}$
	$\displaystyle=\int_{\mathcal{H}}\Psi_{\ell}(y)^{2}dQ_{nm}(y)-\frac{1}{n+m}\frac{\zeta_{\ell}(P,Q)^{2}}{\theta_{\ell}^{2}}$
	$\displaystyle=\int_{\mathcal{H}}\Psi_{\ell}(y)^{2}dP(y)+\frac{1}{\sqrt{n+m}}\int_{\mathcal{H}}\Psi_{\ell}(y)^{2}d(Q-P)(y)-\frac{1}{n+m}\frac{\zeta_{\ell}(P,Q)^{2}}{\theta_{\ell}^{2}}$
	$\displaystyle=1+\frac{1}{\sqrt{n+m}}\tau_{\ell\ell}-\frac{1}{n+m}\frac{\zeta_{\ell}(P,Q)^{2}}{\theta_{\ell}^{2}},$

where

\tau_{\ell s}=\int_{\mathcal{H}}\Psi_{\ell}(y)\Psi_{s}(y)d(Q-P)(y).

Therefore, by using the central limit theorem for $\Psi_{\ell}$ s, we get

	$\displaystyle(n+m)\left\\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)\right\\|^{2}_{H(k)^{\otimes 2}}$
	$\displaystyle=\frac{n+m}{n^{2}}\sum_{i,s=1}^{n}h(X_{i},X_{s})+\frac{n+m}{m^{2}}\sum_{j,t=1}^{m}h(Y_{j},Y_{t})-\frac{2(n+m)}{nm}\sum_{i=1}^{n}\sum_{j=1}^{m}h(X_{i},Y_{j})+O_{p}\left(\frac{1}{\sqrt{n+m}}\right)$
	$\displaystyle=\frac{n+m}{n^{2}}\sum_{i,s=1}^{n}\sum_{\ell=1}^{\infty}\theta_{\ell}\Psi_{\ell}(X_{i})\Psi_{\ell}(X_{s})+\frac{n+m}{m^{2}}\sum_{j,t=1}^{m}\sum_{\ell=1}^{\infty}\theta_{\ell}\Psi_{\ell}(Y_{j})\Psi_{\ell}(Y_{t})$
	$\displaystyle~{}~{}~{}~{}~{}-\frac{2(n+m)}{nm}\sum_{i=1}^{n}\sum_{j=1}^{m}\sum_{\ell=1}^{\infty}\theta_{\ell}\Psi_{\ell}(X_{i})\Psi_{\ell}(Y_{j})$
	$\displaystyle=\frac{n+m}{n}\sum_{\ell=1}^{\infty}\theta_{\ell}\left(\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\Psi_{\ell}(X_{i})\right)^{2}+\frac{n+m}{m}\sum_{\ell=1}^{\infty}\theta_{\ell}\left(\frac{1}{\sqrt{m}}\sum_{j=1}^{m}\Psi_{\ell}(Y_{j})\right)^{2}$
	$\displaystyle~{}~{}~{}~{}~{}-\frac{2(n+m)}{\sqrt{nm}}\sum_{\ell=1}^{\infty}\theta_{\ell}\left(\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\Psi_{\ell}(X_{i})\right)\left(\frac{1}{\sqrt{m}}\sum_{j=1}^{m}\Psi_{\ell}(Y_{j})\right)$
	$\displaystyle\xrightarrow{\mathcal{D}}\frac{1}{\rho(1-\rho)}\sum_{\ell=1}^{\infty}\theta_{\ell}Z_{\ell}^{2},~{}~{}Z_{\ell}\sim N\left(\sqrt{\rho(1-\rho)}\cdot\frac{\zeta_{\ell}(P,Q)}{\theta_{\ell}},1\right),$

as $n,m\to\infty$ .

7.7 Proof of Lemma 2

(i) First, we prove $\left\|\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(Q_{nm})\right\|_{H(k)}=O_{P}(1/\sqrt{n+m})$ . For all $\delta>0$ , there exists $N_{1}\in\mathbb{N}$ such that

\frac{1}{\sqrt{n+m}}|\mathbb{E}_{Y\sim P}[\left\|k(\cdot,Y)-\mu_{k}(P)\right\|^{2}_{H(k)}]+\mathbb{E}_{Y\sim Q}[\left\|k(\cdot,Y)-\mu_{k}(P)\right\|^{2}_{H(k)}]|<\frac{\delta}{2}

for all $n,m>N_{1}$ . In addition, there exists $N_{2}\in\mathbb{N}$ such that

\frac{1}{\sqrt{n+m}}|\mathbb{E}_{Y\sim P}[\left\|k(\cdot,Y)-\mu_{k}(P)\right\|^{2}_{H(k)}]+\mathbb{E}_{Y\sim Q}[\left\|k(\cdot,Y)-\mu_{k}(P)\right\|^{2}_{H(k)}]|<\frac{\delta}{2}

for all $n,m>N_{2}$ . We put

M_{\delta}=\sqrt{\frac{1}{\delta}\left(\frac{1}{1-\rho}+\frac{\delta}{2}\right)\left(\mathbb{E}_{Y\sim P}[\left\|k(\cdot,Y)-\mu_{k}(P)\right\|^{2}_{H(k)}]+\frac{\delta}{2}\right)},

and $N_{\delta}=\max\{N_{1},N_{2}\}$ . Since

\mu_{k}(Q_{nm})=\left(1-\frac{1}{\sqrt{n+m}}\right)\mu_{k}(P)+\frac{1}{\sqrt{n+m}}\mu_{k}(Q)=\mu_{k}(P)-\frac{1}{\sqrt{n+m}}(\mu_{k}(P)-\mu_{k}(Q)),

for all $n,m>N_{\delta}$ , we get

	$\displaystyle\mathbb{P}\left(\sqrt{n+m}\left\\|\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(Q_{nm})\right\\|_{H(k)}>M_{\delta}\right)$
	$\displaystyle\leq\frac{\mathbb{E}_{Q_{nm}}\left[(n+m)\left\\|\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(Q_{nm})\right\\|^{2}_{H(k)}\right]}{M_{\delta^{2}}}$
	$\displaystyle=\frac{(n+m)\mathbb{E}_{Q_{nm}}\left[\left\\|\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(Q_{nm}))\right\\|^{2}_{H(k)}\right]}{M_{\delta}^{2}}$
	$\displaystyle=\frac{t\mathbb{E}_{Y\sim Q_{nm}}[\left\\|k(\cdot,Y)-\mu_{k}(Q_{nm})\right\\|^{2}_{H(k)}]}{mM_{\delta}^{2}}$
	$\displaystyle=\frac{n+m}{mM_{\delta}^{2}}\mathbb{E}_{Y\sim Q_{nm}}\left[\left\\|k(\cdot,Y)-\mu_{k}(P)+\frac{1}{\sqrt{n+m}}(\mu_{k}(P)-\mu_{k}(Q))\right\\|^{2}_{H(k)}\right]$
	$\displaystyle=\frac{n+m}{mM{\delta}^{2}}\Big{(}\mathbb{E}_{Y\sim Q_{nm}}[\left\\|k(\cdot,Y)-\mu_{k}(P)\right\\|^{2}_{H(k)}]+\frac{2}{\sqrt{n+m}}\mathbb{E}_{Y\sim Q_{nm}}[\left<k(\cdot,Y)-\mu_{k}(P),\mu_{k}(P)-\mu_{k}(Q)\right>_{H(k)}]$
	$\displaystyle~{}~{}~{}~{}~{}+\frac{1}{n+m}\left\\|\mu_{k}(P)-\mu_{k}(Q)\right\\|^{2}_{H(k)}\Big{)}$
	$\displaystyle=\frac{n+m}{mM_{\delta}^{2}}\left(\mathbb{E}_{Y\sim Q_{nm}}[\left\\|k(\cdot,Y)-\mu_{k}(P)\right\\|^{2}_{H(k)}]-\frac{1}{n+m}\left\\|\mu_{k}(P)-\mu_{k}(Q)\right\\|^{2}_{H(k)}\right)$
	$\displaystyle\leq\frac{n+m}{mM_{\delta}^{2}}\mathbb{E}_{Y\sim Q_{nm}}[\left\\|k(\cdot,Y)-\mu_{k}(P)\right\\|^{2}_{H(k)}]$
	$\displaystyle=\frac{t}{mM_{\delta}^{2}}\left(\left(1-\frac{1}{\sqrt{n+m}}\right)\mathbb{E}_{X\sim P}[\left\\|k(\cdot,Y)-\mu_{k}(P)\right\\|^{2}_{H(k)}]+\frac{1}{n+m}\mathbb{E}_{Y\sim Q}[\left\\|k(\cdot,Y)-\mu_{k}(P)\right\\|^{2}_{H(k)}]\right)$
	$\displaystyle=\frac{n+m}{mM_{\delta}^{2}}\Big{(}\mathbb{E}_{Y\sim P}[\left\\|k(\cdot,Y)-\mu_{k}(P)\right\\|^{2}_{H(k)}]$
	$\displaystyle~{}~{}~{}~{}~{}-\frac{1}{\sqrt{n+m}}\left(\mathbb{E}_{Y\sim P}[\left\\|k(\cdot,Y)-\mu_{k}(P)\right\\|]-\mathbb{E}_{Y\sim Q}[\left\\|k(\cdot,Y)-\mu_{k}(P)\right\\|^{2}_{H(k)}]\right)\Big{)}$
	$\displaystyle<\frac{1}{M_{\delta}^{2}}\left(\frac{1}{1-\rho}+\frac{\delta}{2}\right)\left(\mathbb{E}_{Y\sim P}[\left\\|k(\cdot,Y)-\mu_{k}(P)\right\\|^{2}_{H(k)}]+\frac{\delta}{2}\right)$
	$\displaystyle=\delta.$

Therefore, we obtain $\left\|\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(Q_{nm})\right\|_{H(k)}=O_{p}(1/\sqrt{n+m})$ .

(ii) Next, we prove $\left\|\widetilde{\Sigma}_{k}(Q_{nm})-\Sigma_{k}(Q_{nm})\right\|_{H(k)^{\otimes 2}}$ . For all $\delta>0$ , we put

M_{\delta}=\sqrt{\frac{1}{\delta}\left(\frac{1}{1-\rho}+\frac{\delta}{2}\right)(\mathbb{E}_{Y\sim P}[\left\|(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\|^{2}_{H(k)^{\otimes 2}}]+\delta)}

and

	$\displaystyle\frac{1}{\sqrt{n+m}}A(n,m,Y):=$	$\displaystyle\frac{1}{\sqrt{n+m}}\Big{\{}(k(\cdot,Y)-\mu_{k}(P))\otimes(\mu_{k}(P)-\mu_{k}(Q))$
		$\displaystyle~{}~{}~{}~{}~{}+(\mu_{k}(P)-\mu_{k}(Q))\otimes(k(\cdot,Y)-\mu_{k}(P))+\frac{1}{\sqrt{n+m}}(\mu_{k}(P)-\mu_{k}(Q))^{\otimes 2}\Big{\}}.$

Then, there exists $N_{1}\in\mathbb{N}$ such that

	$\displaystyle\Big{\|}\frac{1}{n+m}\mathbb{E}_{Y\sim Q_{nm}}[\left\\|A(n,m,Y)\right\\|^{2}_{H(k)^{\otimes 2}}]$
	$\displaystyle~{}~{}~{}~{}~{}-\frac{1}{\sqrt{n+m}}\mathbb{E}_{Y\sim Q_{nm}}\left[\left<(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P),A(t,Y)\right>_{H(k)^{\otimes 2}}\right]\Big{\|}<\frac{\delta}{2}$

for all $n,m>N_{1}$ . In addition, there exists $N_{2}\in\mathbb{N}$ such that

	$\displaystyle\Big{\|}\frac{1}{\sqrt{n+m}}\Big{(}\mathbb{E}_{Y\sim P}[\left\\|(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\\|^{2}_{H(k)^{\otimes 2}}]$
	$\displaystyle~{}~{}~{}~{}~{}-\mathbb{E}_{Y\sim Q}[\left\\|(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\\|^{2}_{H(k)^{\otimes 2}}]\Big{)}\Big{\|}<\frac{\delta}{2}$

for all $n,m>N_{2}$ . and there exists $N_{3}\in\mathbb{N}$ such that

\left|\frac{t}{m}-\frac{1}{1-\rho}\right|<\frac{\delta}{2}

for all $n,m>N_{3}$ .

Let $N_{\delta}=\max\{N_{1},N_{2},N_{3}\}$ . For all $n,m>N_{\delta}$ , we obtain that

	$\displaystyle\mathbb{P}(\sqrt{n+m}\left\\|\widetilde{\Sigma}_{k}(Q_{nm})-\Sigma_{k}(Q_{nm})\right\\|_{H(k)^{\otimes 2}}>M_{\delta})$
	$\displaystyle\leq\frac{\mathbb{E}_{Y\sim Q_{nm}}\left[t\left\\|\widetilde{\Sigma}_{k}(Q_{nm})-\Sigma_{k}(Q_{nm})\right\\|_{H(k)^{\otimes 2}}^{2}\right]}{M_{\delta}^{2}}$
	$\displaystyle=\frac{t\mathbb{E}_{Y\sim Q_{nm}}[\left\\|(k(\cdot,Y)-\mu_{k}(Q_{nm}))^{\otimes 2}-\Sigma_{k}(Q_{nm})\right\\|^{2}_{H(k)^{\otimes 2}}]}{mM_{\delta}^{2}}$
	$\displaystyle=\frac{n+m}{mM_{\delta}^{2}}\mathbb{E}_{Y\sim Q_{nm}}\Big{[}\Big{\\|}(k(\cdot,Y)-\mu_{k}(Q_{nm}))^{\otimes 2}-\Sigma_{k}(P)+\frac{1}{\sqrt{n+m}}(\Sigma_{k}(P)-\Sigma_{k}(Q))$
	$\displaystyle~{}~{}~{}~{}~{}-\frac{1}{\sqrt{n+m}}\left(1-\frac{1}{\sqrt{n+m}}\right)(\mu_{k}(P)-\mu_{k}(Q))^{\otimes 2}\Big{\\|}^{2}_{H(k)^{\otimes 2}}\Big{]}$
	$\displaystyle=\frac{n+m}{mM_{\delta}^{2}}\mathbb{E}_{Y\sim Q_{nm}}\left[\left\\|(k(\cdot,Y)-\mu_{k}(Q_{nm}))^{\otimes 2}-\Sigma_{k}(P)\right\\|^{2}_{H(k)^{\otimes 2}}\right]$
	$\displaystyle~{}~{}~{}~{}~{}-\frac{1}{n+m}\left\\|\Sigma_{k}(P)-\Sigma_{k}(Q)+\left(1-\frac{1}{\sqrt{n+m}}\right)(\mu_{k}(P)-\mu_{k}(Q))^{\otimes 2}\right\\|^{2}_{H(k)^{\otimes 2}}$
	$\displaystyle\leq\frac{n+m}{mM_{\delta}^{2}}\mathbb{E}_{Y\sim Q_{nm}}\left[\left\\|(k(\cdot,Y)-\mu_{k}(Q_{nm}))^{\otimes 2}-\Sigma_{k}(P)\right\\|^{2}_{H(k)^{\otimes 2}}\right]$
	$\displaystyle=\frac{n+m}{mM_{\delta}^{2}}\mathbb{E}_{Y\sim Q_{nm}}\left[\left\\|\left\{k(\cdot,Y)-\mu_{k}(P)+\frac{1}{\sqrt{n+m}}(\mu_{k}(P)-\mu_{k}(Q))\right\}^{\otimes 2}-\Sigma_{k}(P)\right\\|^{2}_{H(k)^{\otimes 2}}\right]$
	$\displaystyle=\frac{n+m}{mM_{\delta}^{2}}\mathbb{E}_{Y\sim Q_{nm}}\left[\left\\|(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)+\frac{1}{\sqrt{n+m}}A(t,Y)\right\\|^{2}_{H(k)^{\otimes 2}}\right]$
	$\displaystyle=\frac{n+m}{mM_{\delta}^{2}}\mathbb{E}_{Y\sim Q_{nm}}\Big{[}\left\\|(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\\|^{2}_{H(k)^{\otimes 2}}+\frac{1}{n+m}\left\\|A(t,Y)\right\\|^{2}_{H(k)^{\otimes 2}}$
	$\displaystyle~{}~{}~{}~{}~{}-\frac{2}{\sqrt{n+m}}\left<(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P),A(t,Y)\right>_{H(k)^{\otimes 2}}\Big{]}$
	$\displaystyle=\frac{n+m}{mM_{\delta}^{2}}\Big{\{}\mathbb{E}_{Y\sim Q_{nm}}\Big{[}\left\\|(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\\|^{2}_{H(k)^{\otimes 2}}\Big{]}+\frac{1}{n+m}\mathbb{E}_{Y\sim Q_{nm}}\Big{[}\left\\|A(t,Y)\right\\|^{2}_{H(k)^{\otimes 2}}\Big{]}$
	$\displaystyle~{}~{}~{}~{}~{}-\frac{2}{\sqrt{n+m}}\mathbb{E}_{Y\sim Q_{nm}}\Big{[}\left<(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P),A(t,Y)\right>_{H(k)^{\otimes 2}}\Big{]}\Big{\}}$
	$\displaystyle<\frac{n+m}{mM_{\delta}^{2}}\Big{\{}\mathbb{E}_{Y\sim P}\left[\left\\|(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\\|^{2}_{H(k)^{\otimes 2}}\right]$
	$\displaystyle~{}~{}~{}~{}~{}~{}-\frac{1}{\sqrt{n+m}}\Big{(}\mathbb{E}_{Y\sim P}\left[\left\\|(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\\|^{2}_{H(k)^{\otimes 2}}\right]$
	$\displaystyle~{}~{}~{}~{}~{}-\mathbb{E}_{Y\sim Q}\left[\left\\|(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\\|^{2}_{H(k)^{\otimes 2}}\right]\Big{)}+\frac{\delta}{2}\Big{\}}$
	$\displaystyle<\frac{1}{M_{\delta}^{2}}\left(\frac{1}{1-\rho}+\frac{\delta}{2}\right)\left\{\mathbb{E}_{Y\sim P}[\left\\|(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\\|^{2}_{H(k)^{\otimes 2}}+\delta]\right\}$
	$\displaystyle=\delta.$

Therefore, $\left\|\widetilde{\Sigma}_{k}(Q_{nm})-\Sigma_{k}(Q_{nm})\right\|_{H(k)^{\otimes 2}}=O_{p}(1/\sqrt{n+m})$ is proved.

(iii) Finally, we prove $\left\|\Sigma_{k}(\widehat{Q}_{nm})-\Sigma_{k}(Q_{nm})\right\|_{H(k)^{\otimes 2}}=O_{p}(1/\sqrt{n+m})$ . We get

	$\displaystyle\Sigma_{k}(\widehat{Q}_{nm})-\Sigma_{k}(Q_{nm})$
	$\displaystyle=\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(\widehat{Q}_{nm}))^{\otimes 2}-\Sigma_{k}(Q_{nm})$
	$\displaystyle=\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(Q_{nm})+\mu_{k}(Q_{nm})-\mu_{k}(\widehat{Q}_{nm}))^{\otimes 2}-\Sigma_{k}(Q_{nm})$
	$\displaystyle=\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(Q_{nm}))^{\otimes 2}-\Sigma_{k}(Q_{nm})$
	$\displaystyle~{}~{}~{}~{}~{}+\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(Q_{nm}))\otimes(\mu_{k}(Q_{nm})-\mu_{k}(\widehat{Q}_{nm}))$
	$\displaystyle~{}~{}~{}~{}~{}+\frac{1}{m}\sum_{j=1}^{m}(\mu_{k}(Q_{nm})-\mu_{k}(\widehat{Q}_{nm}))\otimes(k(\cdot,Y_{j})-\mu_{k}(Q_{nm}))+(\mu_{k}(Q_{nm})-\mu_{k}(\widehat{Q}_{nm}))^{\otimes 2}$
	$\displaystyle=\widetilde{\Sigma}_{k}(Q_{nm})-\Sigma_{k}(Q_{nm})-(\mu_{k}(Q_{nm})-\mu_{k}(\widehat{Q}_{nm}))^{\otimes 2}$

by a expansion of $\widetilde{\Sigma}_{k}(Q_{nm})$ . Using (i) and (ii) leads to the following:

	$\displaystyle\left\\|\Sigma_{k}(\widehat{Q}_{nm})-\Sigma_{k}(Q_{nm})\right\\|_{H(k)^{\otimes 2}}$
	$\displaystyle\leq\left\\|\widetilde{\Sigma}_{k}(Q_{nm})-\Sigma_{k}(Q_{nm})\right\\|_{H(k)^{\otimes 2}}+\left\\|\mu_{k}(Q_{nm})-\mu_{k}(\widehat{Q}_{nm})\right\\|^{2}_{H(k)^{\otimes 2}}$
	$\displaystyle=O_{p}\left(\frac{1}{\sqrt{n+m}}\right)+O_{p}\left(\frac{1}{n+m}\right)$
	$\displaystyle=O_{p}\left(\frac{1}{\sqrt{n+m}}\right).$

7.8 Proof of Lemma 3

First we have

	$\displaystyle\Sigma_{k}(Q_{nm})$
	$\displaystyle=\mathbb{E}_{Y\sim Q_{nm}}\left[(k(\cdot,Y)-\mu_{k}(Q_{nm}))^{\otimes 2}\right]$
	$\displaystyle=\left(1-\frac{1}{\sqrt{n+m}}\right)\mathbb{E}_{Y\sim P}\left[(k(\cdot,Y)-\mu_{k}(Q_{nm}))^{\otimes 2}\right]+\frac{1}{\sqrt{n+m}}\mathbb{E}_{Y\sim Q}\left[(k(\cdot,Y)-\mu_{k}(Q_{nm}))^{\otimes 2}\right]$
	$\displaystyle=\left(1-\frac{1}{\sqrt{n+m}}\right)\mathbb{E}_{Y\sim P}\left[(k(\cdot,Y)-\mu_{k}(P)+\mu_{k}(P)-\mu_{k}(Q_{nm}))^{\otimes 2}\right]$
	$\displaystyle~{}~{}~{}~{}~{}+\frac{1}{\sqrt{n+m}}\mathbb{E}_{Y\sim Q}\left[(k(\cdot,Y)-\mu_{k}(Q)+\mu_{k}(Q)-\mu_{k}(Q_{nm}))^{\otimes 2}\right]$
	$\displaystyle=\left(1-\frac{1}{\sqrt{n+m}}\right)\left\{\mathbb{E}_{Y\sim P}\left[(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}\right]+(\mu_{k}(P)-\mu_{k}(Q_{nm}))^{\otimes 2}\right\}$
	$\displaystyle~{}~{}~{}~{}~{}+\frac{1}{\sqrt{n+m}}\left\{\mathbb{E}_{Y\sim Q}\left[(k(\cdot,Y)-\mu_{k}(Q))^{\otimes 2}\right]+(\mu_{k}(Q)-\mu_{k}(Q_{nm}))^{\otimes 2}\right\}$
	$\displaystyle=\left(1-\frac{1}{\sqrt{n+m}}\right)\Sigma_{k}(P)+\frac{1}{\sqrt{n+m}}\Sigma_{k}(Q)+\left(1-\frac{1}{\sqrt{n+m}}\right)\frac{1}{n+m}(\mu_{k}(P)-\mu_{k}(Q))^{\otimes 2}$
	$\displaystyle~{}~{}~{}~{}~{}+\frac{1}{\sqrt{n+m}}\left(1-\frac{1}{\sqrt{n+m}}\right)^{2}(\mu_{k}(P)-\mu_{k}(Q))^{\otimes 2}$
	$\displaystyle=\left(1-\frac{1}{\sqrt{n+m}}\right)\Sigma_{k}(P)+\frac{1}{\sqrt{n+m}}\Sigma_{k}(Q)+\frac{1}{\sqrt{n+m}}\left(1-\frac{1}{\sqrt{n+m}}\right)(\mu_{k}(P)-\mu_{k}(Q))^{\otimes 2}.$		(31)

The result (31) leads to

	$\displaystyle\Sigma_{k}(P)-\Sigma_{k}(Q_{nm})$
	$\displaystyle=\Sigma_{k}(P)-\left(1-\frac{1}{\sqrt{n+m}}\right)\Sigma_{k}(P)-\frac{1}{\sqrt{n+m}}\Sigma_{k}(Q)$
	$\displaystyle~{}~{}~{}~{}~{}-\frac{1}{\sqrt{n+m}}\left(1-\frac{1}{\sqrt{n+m}}\right)(\mu_{k}(P)-\mu_{k}(Q))^{\otimes 2}$
	$\displaystyle=\frac{1}{\sqrt{n+m}}\Sigma_{k}(P)-\frac{1}{\sqrt{n+m}}\Sigma_{k}(Q)-\frac{1}{\sqrt{n+m}}\left(1-\frac{1}{\sqrt{n+m}}\right)(\mu_{k}(P)-\mu_{k}(Q))^{\otimes 2}$
	$\displaystyle=\frac{1}{\sqrt{n+m}}(\Sigma_{k}(P)-\Sigma_{k}(Q))-\frac{1}{\sqrt{n+m}}\left(1-\frac{1}{\sqrt{n+m}}\right)(\mu_{k}(P)-\mu_{k}(Q))^{\otimes 2}.$

Hence

	$\displaystyle(n+m)\left(\left\\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q}_{nm})\right\\|^{2}_{H(k)^{\otimes 2}}-\left\\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q_{nm})\right\\|^{2}_{H(k)^{\otimes 2}}\right)$
	$\displaystyle\leq(n+m)\left\{\left\\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q}_{nm})\right\\|_{H(k)^{\otimes 2}}+\left\\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q_{nm})\right\\|_{H(k)^{\otimes 2}}\right\}$
	$\displaystyle~{}~{}~{}~{}~{}\times\left\\|(\mu_{k}(P)-\mu_{k}(\widehat{P}))^{\otimes 2}-(\mu_{k}(Q_{nm})-\mu_{k}(\widehat{Q}_{nm}))^{\otimes 2}\right\\|_{H(k)^{\otimes 2}}$
	$\displaystyle\leq\sqrt{n+m}\left\{\left\\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q}_{nm})\right\\|_{H(k)^{\otimes 2}}+\left\\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q_{nm})\right\\|_{H(k)^{\otimes 2}}\right\}$
	$\displaystyle~{}~{}~{}~{}~{}\times\left(\frac{\sqrt{n+m}}{n}n\left\\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\\|^{2}_{H(k)}+\frac{\sqrt{n+m}}{m}m\left\\|\mu_{k}(Q_{nm})-\mu_{k}(\widehat{Q}_{nm})\right\\|^{2}_{H(k)}\right)$
	$\displaystyle=\sqrt{n+m}\left\{\left\\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(P)-(\Sigma_{k}(\widehat{Q}_{nm})-\Sigma_{k}(Q_{nm}))+\Sigma_{k}(P)-\Sigma_{k}(Q_{nm})\right\\|_{H(k)^{\otimes 2}}\right.$
	$\displaystyle\left.~{}~{}~{}~{}~{}+\left\\|\widetilde{\Sigma}_{k}(P)-\Sigma_{k}(P)-(\widetilde{\Sigma}_{k}(Q_{nm})-\Sigma_{k}(Q_{nm}))+\Sigma_{k}(P)-\Sigma_{k}(Q_{nm})\right\\|_{H(k)^{\otimes 2}}\right\}$
	$\displaystyle~{}~{}~{}~{}~{}\times\left(\frac{\sqrt{n+m}}{n}n\left\\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\\|^{2}_{H(k)}+\frac{\sqrt{n+m}}{m}m\left\\|\mu_{k}(Q_{nm})-\mu_{k}(\widehat{Q}_{nm})\right\\|^{2}_{H(k)}\right)$
	$\displaystyle=\sqrt{n+m}\left\{\Bigg{\\|}\Sigma_{k}(\widehat{P})-\Sigma_{k}(P)-(\Sigma_{k}(\widehat{Q}_{nm})-\Sigma_{k}(Q_{nm}))\right.$
	$\displaystyle~{}~{}~{}~{}~{}+\frac{1}{\sqrt{n+m}}(\Sigma_{k}(P)-\Sigma_{k}(Q))-\frac{1}{\sqrt{n+m}}\left(1-\frac{1}{\sqrt{n+m}}\right)(\mu_{k}(P)-\mu_{k}(Q))^{\otimes 2}\Bigg{\\|}_{H(k)^{\otimes 2}}$
	$\displaystyle~{}~{}~{}~{}~{}+\Bigg{\\|}\widetilde{\Sigma}_{k}(P)-\Sigma_{k}(P)-(\widetilde{\Sigma}_{k}(Q_{nm})-\Sigma_{k}(Q_{nm}))$
	$\displaystyle\left.~{}~{}~{}~{}~{}+\frac{1}{\sqrt{n+m}}(\Sigma_{k}(P)-\Sigma_{k}(Q))-+\frac{1}{\sqrt{n+m}}\left(1-\frac{1}{\sqrt{n+m}}\right)(\mu_{k}(P)-\mu_{k}(Q))^{\otimes 2}\Bigg{\\|}_{H(k)^{\otimes 2}}\right\}$
	$\displaystyle~{}~{}~{}~{}~{}\times\left(\frac{\sqrt{n+m}}{n}n\left\\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\\|^{2}_{H(k)}+\sqrt{n+m}\left\\|\mu_{k}(Q_{nm})-\mu_{k}(\widehat{Q}_{nm})\right\\|^{2}_{H(k)}\right)$
	$\displaystyle\leq\Bigg{\{}\sqrt{\frac{n+m}{n}}\sqrt{n}\left\\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(P)\right\\|_{H(k)^{\otimes 2}}+\sqrt{n+m}\left\\|\Sigma_{k}(\widehat{Q}_{nm})-\Sigma_{k}(Q_{nm})\right\\|_{H(k)^{\otimes 2}}$
	$\displaystyle~{}~{}~{}~{}~{}+\sqrt{\frac{n+m}{n}}\sqrt{n}\left\\|\widetilde{\Sigma}_{k}(P)-\Sigma_{k}(P)\right\\|_{H(k)^{\otimes 2}}+\sqrt{n+m}\left\\|\widetilde{\Sigma}_{k}(Q_{nm})-\Sigma_{k}(Q_{nm})\right\\|_{H(k)^{\otimes 2}}$
	$\displaystyle~{}~{}~{}~{}~{}+2\left\\|\Sigma_{k}(P)-\Sigma_{k}(Q)\right\\|_{H(k)^{\otimes 2}}+2\left(1-\frac{1}{\sqrt{n+m}}\right)\left\\|\mu_{k}(P)-\mu_{k}(Q)\right\\|^{2}_{H(k)}\Bigg{\}}$
	$\displaystyle~{}~{}~{}~{}~{}\times\left(\frac{\sqrt{n+m}}{n}n\left\\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\\|^{2}_{H(k)}+\sqrt{n+m}\left\\|\mu_{k}(Q_{nm})-\mu_{k}(\widehat{Q}_{nm})\right\\|^{2}_{H(k)}\right).$

Therefore, we obtain

(n+m)\left(\left\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q}_{nm})\right\|^{2}_{H(k)^{\otimes 2}}-\left\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q_{nm})\right\|^{2}_{H(k)^{\otimes 2}}\right)\xrightarrow{P}0,

as $n,m\to\infty$ by Lemma 2, which completes the proof of Lemma 3.

7.9 Proof of Proposition 3

Let $\Upsilon=V[(k(\cdot,x)-\mu_{k}(P))^{\otimes 2}]$ be the operator with spectral representation $\Upsilon=\sum_{\ell=1}^{\infty}\theta_{\ell}\phi^{\otimes 2}_{\ell}$ , and $T:H(k)^{\otimes 2}\to L_{2}(\mathcal{H},P)$ be the operator

(T(A))(x)=\left<\Upsilon^{-1/2}\{(k(\cdot,x)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\},A\right>_{H(k)^{\otimes 2}}

for all $A\in H(k)^{\otimes 2}$ and $x\in\mathcal{H}$ . We consider the adjoint operator of this $T$ ,

	$\displaystyle\left<T^{*}g,A\right>_{H(k)^{\otimes 2}}$	$\displaystyle=\left<g,TA\right>_{L_{2}(\mathcal{H},P)}$
		$\displaystyle=\int_{\mathcal{H}}(T(A))(y)g(y)dP(y)$
		$\displaystyle=\int_{\mathcal{H}}\left<\Upsilon^{-1/2}\{(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\},A\right>_{H(k)^{\otimes 2}}g(y)dP(y)$
		$\displaystyle=\left<\int_{\mathcal{H}}(\Upsilon^{-1/2}\{(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\})g(y)dP(y),A\right>_{H(k)^{\otimes 2}}$

for all $g\in L_{2}(\mathcal{H},P)$ and $A\in H(k)^{\otimes 2}$ , hence we get that adjoint operator $T^{*}$ of $T$ is

T^{*}g=\int_{\mathcal{H}}(\Upsilon^{-1/2}\{(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\})g(y)dP(y)

for all $g\in L_{2}(\mathcal{H},P)$ . Furthermore, since

	$\displaystyle T^{*}(T(A))$	$\displaystyle=\int_{\mathcal{H}}(\Upsilon^{-1/2}\{(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\})(T(A))(y)dP(y)$
		$\displaystyle=\int_{\mathcal{H}}(\Upsilon^{-1/2}\{(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\})\left<\Upsilon^{-1/2}\{(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\},A\right>_{H(k)^{\otimes 2}}dP(y)$
		$\displaystyle=\int_{\mathcal{H}}(\Upsilon^{-1/2}\{(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\})^{\otimes 2}AdP(y)$
		$\displaystyle=\Upsilon^{-1/2}\int_{\mathcal{H}}(\{(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\})^{\otimes 2}dP(y)\Upsilon^{-1/2}A$
		$\displaystyle=\Upsilon^{-1/2}\Upsilon\Upsilon^{-1/2}A$
		$\displaystyle=A$

for all $A\in H(k)^{\otimes 2}$ , $T^{*}T$ is the identity operator from $H(k)^{\otimes 2}$ to $H(k)^{\otimes 2}$ . It follows from direct calculations for $T\Upsilon T^{*}:L_{2}(\mathcal{H},P)\to L_{2}(\mathcal{H},P)$ that

	$\displaystyle(T\Upsilon T^{*}g)(x)$
	$\displaystyle=\left<\Upsilon^{-1/2}\{(k(\cdot,x)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\},\Upsilon T^{*}g\right>_{H(k)^{\otimes 2}}$
	$\displaystyle=\left<\Upsilon^{1/2}\{(k(\cdot,x)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\},T^{*}g\right>_{H(k)^{\otimes 2}}$
	$\displaystyle=\left<T\Upsilon^{1/2}\{(k(\cdot,x)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\},g\right>_{L_{2}(\mathcal{H},P)}$
	$\displaystyle=\int_{\mathcal{H}}[T\Upsilon^{1/2}\{(k(\cdot,x)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\}](y)g(y)dP(y)$
	$\displaystyle=\int_{\mathcal{H}}\left<\Upsilon^{-1/2}\{(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\},\Upsilon^{1/2}\{(k(\cdot,x)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\}\right>_{H(k)^{\otimes 2}}g(y)dP(y)$
	$\displaystyle=\int_{\mathcal{H}}\left<(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P),(k(\cdot,x)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right>_{H(k)^{\otimes 2}}g(y)dP(y)$
	$\displaystyle=(S_{k}g)(x)$

for all $g\in L_{2}(\mathcal{H},P)$ an $x\in\mathcal{H}$ , thus we see that $T\Upsilon T^{*}=S_{k}$ . Therefore, $\theta_{\ell}$ and $T\phi_{\ell}$ are the eigenvalue and eigenfunction of $S_{k}$ , by the following that

	$\displaystyle S_{k}g$	$\displaystyle=T\Upsilon T^{*}g$
		$\displaystyle=T\sum_{\ell=1}^{\infty}\theta_{\ell}\phi_{\ell}^{\otimes 2}T^{*}g$
		$\displaystyle=\sum_{\ell=1}^{\infty}\theta_{\ell}\left<\phi_{\ell},T^{*}g\right>_{H(k)^{\otimes 2}}T\phi_{\ell}$
		$\displaystyle=\sum_{\ell=1}^{\infty}\theta_{\ell}\left<T\phi_{\ell},g\right>_{H(k)^{\otimes 2}}T\phi_{\ell}$
		$\displaystyle=\sum_{\ell=1}^{\infty}\theta_{\ell}(T\phi_{\ell})^{\otimes 2}g$

and $\{T\phi_{\ell}\}_{\ell=1}^{\infty}$ is an orthonormal system in $L_{2}(\mathcal{H},P)$ which holds

\left<T\phi_{\ell},T\phi_{s}\right>_{L_{2}(\mathcal{H},P)}=\left<T^{*}T\phi_{\ell},\phi_{s}\right>_{H(k)^{\otimes 2}}=\left<\phi_{\ell},\phi_{s}\right>_{H(k)^{\otimes 2}}=\delta_{\ell s}

for all $\ell,s\in\mathbb{N}$ . In fact,

S_{k}(T\phi_{\ell})=T\Upsilon T^{*}T\phi_{\ell}=T\Upsilon\phi_{\ell}=T\theta_{\ell}\phi_{\ell}=\theta_{\ell}(T\phi_{\ell})

shows that $\theta_{\ell}$ and $T\theta_{\ell}$ are eigenvalue and eigenfunction of $S_{k}$ .

7.10 Proof of Theorem 4

Let $W^{\prime}_{n}=\sum_{\ell=1}^{\infty}({\widehat{\lambda}}^{(n)}_{\ell}-\lambda_{\ell})Z_{\ell}^{2}$ . Then

	$\displaystyle\mathbb{E}[W^{\prime}_{n}]=\mathbb{E}\left[\sum_{\ell=1}^{\infty}({\widehat{\lambda}}^{(n)}_{\ell}-\lambda_{\ell})\right],$
	$\displaystyle V[W^{\prime}_{n}]=2\mathbb{E}\left[\sum_{\ell=1}^{\infty}({\widehat{\lambda}}^{(n)}_{\ell}-\lambda_{\ell})^{2}\right]+Cov\left(\sum_{\ell=1}^{\infty}({\widehat{\lambda}}^{(n)}_{\ell}-\lambda_{\ell}),\sum_{\ell^{\prime}=1}^{\infty}(\widehat{\lambda}_{\ell^{\prime}}-\lambda_{\ell^{\prime}})\right).$		(32)

By the definition of trace of the Hilbert–Schmidt operator, we get the following inequality

\sum_{\ell=1}^{\infty}({\widehat{\lambda}}^{(n)}_{\ell}-\lambda_{\ell})=\text{tr}[\widehat{\Upsilon}^{(n)}]-\text{tr}[\Upsilon]=\left<\widehat{\Upsilon}^{(n)}-\Upsilon,I\right>_{H(k)^{\otimes 4}}\leq\left\|\widehat{\Upsilon}^{(n)}-\Upsilon\right\|_{H(k)^{\otimes 4}}.

(33)

By using a notation

B(X_{1})=(k(\cdot,X_{1})-\mu_{k}(P))\otimes(\mu_{k}(P)-\mu_{k}(\widehat{P}))+(\mu_{k}(P)-\mu_{k}(\widehat{P}))\otimes(k(\cdot,X_{1})-\mu_{k}(P))+(\mu_{k}(P)-\mu_{k}(\widehat{P}))^{\otimes 2},

we have

$\displaystyle\widehat{\Upsilon}^{(n)}$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left\{(k(\cdot,X_{i})-\mu_{k}(\widehat{P}))^{\otimes 2}-\Sigma_{k}(\widehat{P})\right\}^{\otimes 2}$
	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)+B(X_{i})+\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right\}^{\otimes 2}$
	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}^{\otimes 2}$
	$\displaystyle~{}~{}~{}~{}~{}+\frac{1}{n}\sum_{i=1}^{n}\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}\otimes\left\{B(X_{i})+\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right\}$
	$\displaystyle~{}~{}~{}~{}~{}+\frac{1}{n}\sum_{i=1}^{n}\left\{B(X_{i})+\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right\}\otimes\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}$
	$\displaystyle~{}~{}~{}~{}~{}+\frac{1}{n}\sum_{i=1}^{n}\left\{B(X_{i})+\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right\}^{\otimes 2}$
	$\displaystyle=:I_{1}+I_{2}+I_{3}+I_{4}.$	(34)

Since direct calculation, we get the following three expressions,

$\displaystyle I_{2}$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}\otimes\left\{B(X_{i})+\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right\}$
	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}\otimes B(X_{i})$
	$\displaystyle~{}~{}~{}~{}~{}+\frac{1}{n}\sum_{i=1}^{n}\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}\otimes\left(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right)$
	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}\otimes B(X_{i})-\left(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right)^{\otimes 2},$	(35)

	$\displaystyle I_{3}$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left\{B(X_{i})+\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right\}\otimes\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}B(X_{i})\otimes\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}-\left(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right)^{\otimes 2}$		(36)

and

$\displaystyle I_{4}=$	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\left\{B(X_{i})+\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right\}^{\otimes 2}$
	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}B(X_{i})^{\otimes 2}+\frac{1}{n}\sum_{i=1}^{n}B(X_{i})\otimes\left(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right)$
	$\displaystyle~{}~{}~{}~{}~{}+\frac{1}{n}\sum_{i=1}^{n}(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P}))\otimes B(X_{i})+(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P}))^{\otimes 2}$
	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}B(X_{i})^{\otimes 2}-(\mu_{k}(P)-\mu_{k}(\widehat{P}))^{\otimes 2}\otimes(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P}))$
	$\displaystyle~{}~{}~{}~{}~{}-(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P}))\otimes(\mu_{k}(P)-\mu_{k}(\widehat{P}))^{\otimes 2}+(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P}))^{\otimes 2}.$	(37)

By (35), (36) and (37) are combined into (34), we have an expression

	$\displaystyle\widehat{\Upsilon}^{(n)}$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}^{\otimes 2}-(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P}))^{\otimes 2}$
		$\displaystyle~{}~{}~{}~{}~{}-(\mu_{k}(P)-\mu_{k}(\widehat{P}))^{\otimes 2}\otimes(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P}))-(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P}))\otimes(\mu_{k}(P)-\mu_{k}(\widehat{P}))^{\otimes 2}$
		$\displaystyle~{}~{}~{}~{}~{}+\frac{1}{n}\sum_{i=1}^{n}B(X_{i})^{\otimes 2}+\frac{1}{n}\sum_{i=1}^{n}B(X_{i})\otimes\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}$
		$\displaystyle~{}~{}~{}~{}~{}+\frac{1}{n}\sum_{i=1}^{n}\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}\otimes B(X_{i}).$

Therefore,

	$\displaystyle\left\\|\widehat{\Upsilon}^{(n)}-\Upsilon\right\\|_{H(k)^{\otimes 4}}$
	$\displaystyle\leq\left\\|\frac{1}{n}\sum_{i=1}^{n}\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\}^{\otimes 2}-\Upsilon\right\\|_{H(k)^{\otimes 4}}+\left\\|(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P}))^{\otimes 2}\right\\|_{H(k)^{\otimes 4}}$
	$\displaystyle~{}~{}~{}~{}+\left\\|(\mu_{k}(P)-\mu_{k}(\widehat{P}))^{\otimes 2}\otimes(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P}))\right\\|_{H(k)^{\otimes 4}}+\left\\|(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P}))\otimes(\mu_{k}(P)-\mu_{k}(\widehat{P}))^{\otimes 2}\right\\|_{H(k)^{\otimes 4}}$
	$\displaystyle~{}~{}~{}~{}~{}+\left\\|\frac{1}{n}\sum_{i=1}^{n}B(X_{i})^{\otimes 2}\right\\|_{H(k)^{\otimes 4}}+\left\\|\frac{1}{n}\sum_{i=1}^{n}B(X_{i})\otimes\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\}\right\\|_{H(k)^{\otimes 4}}$
	$\displaystyle~{}~{}~{}~{}~{}+\left\\|\frac{1}{n}\sum_{i=1}^{n}\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\}\otimes B(X_{i})\right\\|_{H(k)^{\otimes 4}}$
	$\displaystyle=:I_{1}+I_{2}+I_{3}+I_{4}+I_{5}+I_{6}+I_{7}.$		(38)

By the law of large number in Hilbert spaces (see [8]), we have

I_{1}=\left\|\frac{1}{n}\sum_{i=1}^{n}\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\}^{\otimes 2}-\Upsilon\right\|_{H(k)^{\otimes 4}}=O_{p}\left(\frac{1}{\sqrt{n}}\right),

(39)

and we get

I_{2}=\left\|\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right\|^{2}_{H(k)^{\otimes 2}}=O_{p}\left(\frac{1}{n}\right),

(40)

I_{3}=\left\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\|^{2}_{H(k)}\left\|\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right\|_{H(k)^{\otimes 2}}=O_{p}\left(\frac{1}{n\sqrt{n}}\right),

(41)

and

I_{4}=\left\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\|^{2}_{H(k)}\left\|\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right\|_{H(k)^{\otimes 2}}=O_{p}\left(\frac{1}{n\sqrt{n}}\right).

(42)

Next our focus goes to $I_{5}$ . We have by direct computations that

	$\displaystyle I_{5}$	$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\left\\|B(X_{i})\right\\|^{2}_{H(k)^{\otimes 2}}$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left\{\left\\|k(\cdot,X_{i})-\mu_{k}(P)\right\\|_{H(k)}\left\\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\\|_{H(k)}+\left\\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\\|_{H(k)}\left\\|k(\cdot,X_{i})-\mu_{k}(P)\right\\|_{H(k)}\right.$
		$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\left.+\left\\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\\|^{2}_{H(k)}\right\}^{2}$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left\{\left\\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\\|^{2}_{H(k)}+2\left\\|k(\cdot,X_{i})-\mu_{k}(P)\right\\|_{H(k)}\left\\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\\|_{H(k)}\right\}^{2}$
		$\displaystyle=\left\\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\\|^{4}_{H(k)}+\left\{\frac{4}{n}\sum_{i=1}^{n}\left\\|k(\cdot,X_{i})-\mu_{k}(P)\right\\|_{H(k)}\right\}\left\\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\\|^{3}_{H(k)}$
		$\displaystyle~{}~{}~{}~{}~{}+\left\{\frac{4}{n}\sum_{i=1}^{n}\left\\|k(\cdot,X_{i})-\mu_{k}(P)\right\\|_{H(k)}^{2}\right\}\left\\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\\|_{H(k)}^{2}.$

Since Condition is assumed, the followings hold

\frac{1}{n}\sum_{i=1}^{n}\left\|k(\cdot,X_{i})-\mu_{k}(P)\right\|_{H(k)}^{2}\xrightarrow{P}\mathbb{E}[\left\|k(\cdot,X_{i})-\mu_{k}(P)\right\|_{H(k)}^{2}]

and

\frac{1}{n}\sum_{i=1}^{n}\left\|k(\cdot,X_{i})-\mu_{k}(P)\right\|_{H(k)}\xrightarrow{P}\mathbb{E}[\left\|k(\cdot,X_{i})-\mu_{k}(P)\right\|_{H(k)}]

as $n\to\infty$ by the law of large numbers. Hence, we get

	$\displaystyle I_{5}$	$\displaystyle=O_{p}\left(\frac{1}{n^{2}}\right)+\frac{1}{n}\sum_{i=1}^{n}\left\\|k(\cdot,X_{i})-\mu_{k}(P)\right\\|_{H(k)}O_{p}\left(\frac{1}{n\sqrt{n}}\right)+\frac{1}{n}\sum_{i=1}^{n}\left\\|k(\cdot,X_{i})-\mu_{k}(P)\right\\|_{H(k)}^{2}O_{p}\left(\frac{1}{n}\right)$
		$\displaystyle=O_{p}\left(\frac{1}{n}\right).$		(43)

Also, we see that

	$\displaystyle I_{6}$	$\displaystyle=\left\\|\frac{1}{n}\sum_{i=1}^{n}B(X_{i})\otimes\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\}\right\\|_{H(k)^{\otimes 4}}$
		$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\left\\|B(X_{i})\right\\|_{H(k)^{\otimes 2}}\left\\|(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\\|_{H(k)^{\otimes 2}}$
		$\displaystyle\leq\left(\frac{1}{n}\sum_{i=1}^{n}\left\\|B(X_{i})\right\\|_{H(k)^{\otimes 2}}^{2}\right)^{1/2}\left(\frac{1}{n}\sum_{i=1}^{n}\left\\|(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\\|_{H(k)^{\otimes 2}}^{2}\right)^{1/2}$

by Cauchy–Schwartz inequality, and we get

\frac{1}{n}\sum_{i=1}^{n}\left\|(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\|_{H(k)^{\otimes 2}}^{2}\xrightarrow{P}\mathbb{E}[\left\|(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\|_{H(k)^{\otimes 2}}^{2}]

as $n\to\infty$ . Hence, we get

I_{6}=\left(\frac{1}{n}\sum_{i=1}^{n}\left\|(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\|_{H(k)^{\otimes 2}}^{2}\right)^{1/2}O_{p}\left(\frac{1}{\sqrt{n}}\right)=O_{p}\left(\frac{1}{\sqrt{n}}\right)

(44)

since $I_{5}=O_{p}(1/n)$ . By the same argument as for $I_{6}=O_{p}(1/\sqrt{n})$ in (44), we have

$\displaystyle I_{7}$	$\displaystyle=\left\\|\frac{1}{n}\sum_{i=1}^{n}\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\}\otimes B(X_{i})\right\\|_{H(k)^{\otimes 4}}$
	$\displaystyle\leq\frac{1}{n}\left\\|B(X_{i})\right\\|_{H(k)^{\otimes 2}}\left\\|(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\\|_{H(k)^{\otimes 2}}$
	$\displaystyle=O_{p}\left(\frac{1}{\sqrt{n}}\right).$	(45)

The equations (39), (40), (41), (42), (43), (44) and (45) are combined into (38), which leads to

\left\|\widehat{\Upsilon}^{(n)}-\Upsilon\right\|_{H(k)^{\otimes 4}}=O_{p}\left(\frac{1}{\sqrt{n}}\right).

(46)

Therefore, we get

\sum_{\ell=1}^{\infty}(\widehat{\lambda}_{\ell}^{(n)}-\lambda_{\ell})=O_{p}\left(\frac{1}{\sqrt{n}}\right)

(47)

by (33) and (46), that is $\mathbb{E}[W^{\prime}]\to 0$ , as $n\to\infty$ .

Next, we consider $V[W^{\prime}]$ . By $\widehat{\Upsilon}$ and $\Upsilon$ are compact Hermitian operators and (28) of [3],

\sum_{\ell=1}^{\infty}(\widehat{\lambda}_{\ell}^{(n)}-\lambda_{\ell})^{2}\leq\left\|\widehat{\Upsilon}^{(n)}-\Upsilon\right\|_{H(k)^{\otimes 4}}^{2}.

Futhermore, we have got $\left\|\widehat{\Upsilon}^{(n)}-\Upsilon\right\|_{H(k)^{\otimes 4}}^{2}=O_{p}(1/n)$ , hence

\mathbb{E}\left[\sum_{\ell=1}^{\infty}(\widehat{\lambda}_{\ell}^{(n)}-\lambda_{\ell})^{2}\right]\to 0,

(48)

as $n\to\infty$ . Also,

Cov\left(\sum_{\ell=1}^{\infty}({\widehat{\lambda}}^{(n)}_{\ell}-\lambda_{\ell}),\sum_{\ell^{\prime}=1}^{\infty}(\widehat{\lambda}_{\ell^{\prime}}-\lambda_{\ell^{\prime}})\right)=\mathbb{E}\left[\left(\sum_{\ell=1}^{\infty}({\widehat{\lambda}}^{(n)}_{\ell}-\lambda_{\ell})\right)^{2}\right]+\left\{\mathbb{E}\left[\sum_{\ell=1}^{\infty}({\widehat{\lambda}}^{(n)}_{\ell}-\lambda_{\ell})\right]\right\}^{2}

(49)

and since $\left(\sum_{\ell=1}^{\infty}({\widehat{\lambda}}^{(n)}_{\ell}-\lambda_{\ell})\right)^{2}=O_{p}(1/n)$ by (47), we obtain

\mathbb{E}\left[\left(\sum_{\ell=1}^{\infty}({\widehat{\lambda}}^{(n)}_{\ell}-\lambda_{\ell})\right)^{2}\right]\to 0,

(50)

as $n\to\infty$ . In addition, $\mathbb{E}\left[\sum_{\ell=1}^{\infty}({\widehat{\lambda}}^{(n)}_{\ell}-\lambda_{\ell})\right]\to 0,$ as $n\to\infty$ by (47), we get

Cov\left(\sum_{\ell=1}^{\infty}({\widehat{\lambda}}^{(n)}_{\ell}-\lambda_{\ell}),\sum_{\ell^{\prime}=1}^{\infty}(\widehat{\lambda}_{\ell^{\prime}}-\lambda_{\ell^{\prime}})\right)\to 0,

as $n\to\infty$ by (49) and (50). Therefore, $V[W^{\prime}]\to 0,~{}~{}n\to\infty$ .

Finally, we shall show $W^{\prime}\xrightarrow{P}0$ , as $n\to\infty$ . From Chebyshev’s inequality, for any $\varepsilon>0$ ,

	$\displaystyle P(\|W^{\prime}\|\geq\varepsilon)$	$\displaystyle=P(\|W^{\prime}\|-\|\mathbb{E}[W^{\prime}]\|\geq\varepsilon-\|\mathbb{E}[W^{\prime}]\|)$
		$\displaystyle\leq P(\|W^{\prime}-\mathbb{E}[W^{\prime}]\|\geq\varepsilon-\|\mathbb{E}[W^{\prime}]\|)$
		$\displaystyle\leq\frac{V[W^{\prime}]}{(\varepsilon-\|\mathbb{E}[W^{\prime}]\|)^{2}}$
		$\displaystyle\to 0,$

as $n\to\infty$ . Therefore, $W^{\prime}\xrightarrow{P}0$ as $n\to\infty$ .

7.11 Proof of Proposition 4

Let $\widetilde{h}:\mathcal{H}\times\mathcal{H}\to\mathbb{R}$ be a kernel defined as,

\widetilde{h}(x,y)=\left<(k(\cdot,x)-\mu_{k}(\widehat{P}))^{\otimes 2}-\Sigma_{k}(\widehat{P}),(k(\cdot,y)-\mu_{k}(\widehat{P}))^{\otimes 2}-\Sigma_{k}(\widehat{P})\right>_{H(k)^{\otimes 2}},~{}~{}x,y\in\mathcal{H}

and the associated function $\widetilde{h}(\cdot,x)$ represent $\widetilde{h}(\cdot,x)=(k(\cdot,x)-\mu_{k}(\widehat{P}))^{\otimes 2}-\Sigma_{k}(\widehat{P})$ for all $x\in\mathcal{H}$ . Then, $T:H(k)^{\otimes 2}\to\mathbb{R}^{n}$ is defined by

T(A)=\begin{bmatrix}\left<A,\widetilde{h}(\cdot,X_{1})\right>_{H(k)^{\otimes 2}}\\ \vdots\\ \left<A,\widetilde{h}(\cdot,X_{n})\right>_{H(k)^{\otimes 2}}\end{bmatrix}

for any $A\in H(k)^{\otimes 2}$ . The conjugate operator $T^{*}$ of $T$ (see Section VI.2 in [12] for details) is obtained as $T^{*}\underline{a}=\sum_{i=1}^{n}a_{i}\widetilde{h}(\cdot,X_{i})$ for all $\underline{a}=\begin{bmatrix}a_{1}&\cdots&a_{n}\end{bmatrix}^{T}\in\mathbb{R}^{n}$ , because for all $A\in H(k)^{\otimes 2}$ ,

	$\displaystyle\left<T^{*}\underline{a},A\right>_{H(k)^{\otimes 2}}$
	$\displaystyle=\underline{a}^{T}(TA)$
	$\displaystyle=\underline{a}^{T}\begin{bmatrix}\left<\widetilde{h}(\cdot,X_{1}),A\right>_{H(k)^{\otimes 2}}\\ \vdots\\ \left<\widetilde{h}(\cdot,X_{1}),A\right>_{H(k)^{\otimes 2}}\end{bmatrix}$
	$\displaystyle=a_{1}\left<\widetilde{h}(\cdot,X_{1}),A\right>_{H(k)^{\otimes 2}}+\cdots+a_{n}\left<\widetilde{h}(\cdot,X_{1}),A\right>_{H(k)^{\otimes 2}}$
	$\displaystyle=\left<\sum_{i=1}^{n}a_{i}\widetilde{h}(\cdot,X_{i}),A\right>_{H(k)^{\otimes 2}}.$

Let $\lambda$ and $A$ be the eigenvalue and eigenvector of $\widehat{\Upsilon}^{(n)}$ , respectevely. Then, it is holds that

\frac{1}{n}\sum_{j=1}^{n}\left<h(\cdot,X_{j}),A\right>_{H(k)^{\otimes 2}}h(\cdot,X_{j})=\frac{1}{n}\sum_{j=1}^{n}\{\widetilde{h}(\cdot,X_{j})\}^{\otimes 2}A=\lambda A

from the definition of eigenvalue and eigenvector. By mapping both sides with $T$ ,

	$\displaystyle\frac{1}{n}\begin{bmatrix}\widetilde{h}(X_{1},X_{1})&\cdots&\widetilde{h}(X_{1},X_{n})\\ \vdots&\ddots&\vdots\\ \widetilde{h}(X_{n},X_{1})&\cdots&\widetilde{h}(X_{n},X_{n})\end{bmatrix}\begin{bmatrix}\left<A,\widetilde{h}(\cdot,X_{1})\right>_{H(k)^{\otimes 2}}\\ \vdots\\ \left<A,\widetilde{h}(\cdot,X_{1})\right>_{H(k)^{\otimes 2}}\end{bmatrix}$
	$\displaystyle=\frac{1}{n}\sum_{j=1}^{n}\left<\widetilde{h}(\cdot,X_{j}),A\right>_{H(k)^{\otimes 2}}\begin{bmatrix}\widetilde{h}(X_{1},X_{j})\\ \vdots\\ \widetilde{h}(X_{n},X_{j})\end{bmatrix}$
	$\displaystyle=\frac{1}{n}\sum_{j=1}^{n}\left<\widetilde{h}(\cdot,X_{j}),A\right>_{H(k)^{\otimes 2}}T(\widetilde{h}(\cdot,X_{j}))$
	$\displaystyle=\lambda T(A)$
	$\displaystyle=\lambda\begin{bmatrix}\left<A,\widetilde{h}(\cdot,X_{1})\right>_{H(k)^{\otimes 2}}\\ \vdots\\ \left<A,\widetilde{h}(\cdot,X_{n})\right>_{H(k)^{\otimes 2}}\end{bmatrix}.$

Hence, the eigenvalues of $\widehat{\Upsilon}^{(n)}$ are that of $H/n$ .

Conversely, let $\tau$ and $\underline{u}=\begin{bmatrix}u_{1}&\cdots&u_{n}\end{bmatrix}^{T}$ be the eigenvalue and correspondent eigenvector of $H/n$ , then

\frac{1}{n}\sum_{j=1}^{n}u_{j}\begin{bmatrix}\widetilde{h}(X_{1},X_{j})\\ \vdots\\ \widetilde{h}(X_{n},X_{j})\end{bmatrix}=\frac{1}{n}H\underline{u}=\lambda\underline{u},

and

	$\displaystyle\widehat{\Upsilon}^{(n)}\left\{\sum_{j=1}^{n}u_{j}\widetilde{h}(\cdot,X_{j})\right\}$
	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left\{\widetilde{h}(\cdot,X_{i})\right\}^{\otimes 2}\left\{\sum_{j=1}^{n}u_{j}\widetilde{h}(\cdot,X_{j})\right\}$
	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left<\widetilde{h}(\cdot,X_{i}),\sum_{j=1}^{n}u_{j}\widetilde{h}(\cdot,X_{j})\right>_{H(k)^{\otimes 2}}\widetilde{h}(\cdot,X_{i})$
	$\displaystyle=\frac{1}{n}\sum_{i,j=1}^{n}u_{j}\left<\widetilde{h}(\cdot,X_{i}),\widetilde{h}(\cdot,X_{j})\right>_{H(k)^{\otimes 2}}\widetilde{h}(\cdot,X_{i})$
	$\displaystyle=\frac{1}{n}\sum_{j=1}^{n}u_{j}\sum_{i=1}^{n}\widetilde{h}(X_{i},X_{j})\widetilde{h}(\cdot,X_{i})$
	$\displaystyle=\frac{1}{n}\sum_{j=1}^{n}u_{j}T^{*}\left(\begin{bmatrix}\widetilde{h}(X_{1},X_{j})\\ \vdots\\ \widetilde{h}(X_{n},X_{j})\end{bmatrix}\right)$
	$\displaystyle=\lambda T^{*}(\underline{u})$
	$\displaystyle=\lambda\sum_{i=1}^{n}u_{i}\widehat{h}(\cdot,X_{i})$

form mapping both sides with $T^{*}$ , hence the eigenvalue of $H/n$ are that of $\widehat{\Upsilon}^{(n)}$ .

7.12 Proof of Proposition 5

We see that

	$\displaystyle\mathbb{E}[(n+m)\widehat{T}^{2}_{n,m}]$
	$\displaystyle=\frac{n+m}{n^{2}}\sum_{i,s=1}^{n}\mathbb{E}[h(X_{i},X_{s})]+\frac{n+m}{m^{2}}\sum_{j,t=1}^{m}\mathbb{E}[h(Y_{j},Y_{t})]-\frac{2(n+m)}{nm}\sum_{i=1}^{n}\sum_{j=1}^{m}\mathbb{E}[h(X_{i},Y_{j})],$

where $h(x,y)$ is in (5). Since we have

	$\displaystyle\mathbb{E}[h(X_{1},X_{2})]$	$\displaystyle=\mathbb{E}\left[\left<(k(\cdot,X_{1})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P),(k(\cdot,X_{2})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right>_{H(k)^{\otimes 2}}\right]$
		$\displaystyle=\left<\mathbb{E}[(k(\cdot,X_{1})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)],\mathbb{E}[(k(\cdot,X_{2})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)]\right>_{H(k)^{\otimes 2}}$
		$\displaystyle=0$

and

	$\displaystyle\mathbb{E}[h(X_{1},X_{1})]$	$\displaystyle=\mathbb{E}\left[\left<(k(\cdot,X_{1})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P),(k(\cdot,X_{1})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right>_{H(k)^{\otimes 2}}\right]$
		$\displaystyle=\mathbb{E}\left[\left<\left\{(k(\cdot,X_{1})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}^{\otimes 2},I\right>_{H(k)^{\otimes 4}}\right]$
		$\displaystyle=\left<\mathbb{E}\left[\left\{(k(\cdot,X_{1})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}^{\otimes 2}\right],I\right>_{H(k)^{\otimes 4}}$
		$\displaystyle=\left<\Upsilon,I\right>_{H(k)^{\otimes 4}},$

under $P=Q$ , it follows that

\mathbb{E}[(n+m)\widehat{T}^{2}_{n,m}]=\frac{n+m}{n^{2}}n\left<\Upsilon,I\right>_{H(k)^{\otimes 4}}+\frac{n+m}{m^{2}}m\left<\Upsilon,I\right>_{H(k)^{\otimes 4}}\\ =\frac{(n+m)^{2}}{nm}\left<\Upsilon,I\right>_{H(k)^{\otimes 4}}.

(51)

Next, we consider $\mathbb{E}\left[\left\{(n+m)\widehat{T}_{n,m}^{2}\right\}^{2}\right]$ . It follows from direct calculations that

	$\displaystyle\mathbb{E}\left[\left\{(n+m)\widehat{T}_{n,m}^{2}\right\}^{2}\right]$
	$\displaystyle=\mathbb{E}\left[\left\{\frac{n+m}{n^{2}}\sum_{i,s=1}^{n}h(X_{i},X_{s})+\frac{n+m}{m^{2}}\sum_{j,t=1}^{m}h(Y_{j},Y_{t})-\frac{2(n+m)}{nm}\sum_{i=1}^{n}\sum_{j=1}^{m}h(X_{i},Y_{j})\right\}^{2}\right]$
	$\displaystyle=\frac{(n+m)^{2}}{n^{4}}\mathbb{E}\left[\left\{\sum_{i,s=1}^{n}h(X_{i},X_{s})\right\}^{2}\right]+\frac{(n+m)^{2}}{m^{4}}\mathbb{E}\left[\left\{\sum_{j,t=1}^{m}h(Y_{j},Y_{t})\right\}^{2}\right]$
	$\displaystyle~{}~{}~{}~{}~{}+\frac{4(n+m)^{2}}{n^{2}m^{2}}\mathbb{E}\left[\left\{\sum_{i=1}^{n}\sum_{j=1}^{m}h(X_{i},Y_{j})\right\}^{2}\right]+\frac{2(n+m)^{2}}{n^{2}m^{2}}\mathbb{E}\left[\left\{\sum_{i,s=1}^{n}h(X_{i},X_{s})\right\}\left\{\sum_{j,t=1}^{m}h(Y_{j},Y_{t})\right\}\right]$
	$\displaystyle~{}~{}~{}~{}~{}-\frac{4(n+m)^{2}}{n^{3}m}\mathbb{E}\left[\sum_{i,s,\ell=1}^{n}\sum_{j=1}^{m}h(X_{i},X_{s})h(X_{\ell},Y_{j})\right]-\frac{4(n+m)^{2}}{nm^{3}}\mathbb{E}\left[\sum_{i=1}^{n}\sum_{j,t,k=1}^{m}h(Y_{j},Y_{t})h(X_{i},Y_{k})\right].$

A straightforward but lengthy computation yields that

\mathbb{E}\left[\left\{\sum_{i,s=1}^{n}h(X_{i},X_{s})\right\}^{2}\right]=n\left<A,I\right>_{H(k)^{\otimes 8}}+2n(n-1)\left\|\Upsilon\right\|^{2}_{H(k)^{\otimes 4}}+n(n-1)\left<\Upsilon,I\right>_{H(k)^{\otimes 4}}^{2},

(52)

where $A=\mathbb{E}[\{(k(\cdot,X_{1})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\}^{\otimes 4}]$ . In addition, we obtain from direct calcuration that

	$\displaystyle\mathbb{E}\left[\left\{\sum_{i=1}^{n}\sum_{j=1}^{m}h(X_{i},Y_{j})\right\}^{2}\right]=nm\left\\|\Upsilon\right\\|^{2}_{H(k)^{\otimes 4}},$
	$\displaystyle\mathbb{E}\left[\left\{\sum_{i,s=1}^{n}h(X_{i},X_{s})\right\}\left\{\sum_{j,t=1}^{m}h(Y_{j},Y_{t})\right\}\right]=nm\left<\Upsilon,I\right>_{H(k)^{\otimes 2}}^{2}.$

Therefore, using (51) and (52)

	$\displaystyle V[(n+m)\widehat{T}_{n,m}]$
	$\displaystyle=\mathbb{E}\left[\left\{(n+m)\widehat{T}^{2}_{n,m}\right\}^{2}\right]-\{\mathbb{E}[(n+m)\widehat{T}^{2}_{n,m}]\}^{2}$
	$\displaystyle=\frac{(n+m)^{2}}{n^{4}}\left(n\left<A,I\right>_{H(k)^{\otimes 8}}+2n(n-1)\left\\|\Upsilon\right\\|^{2}_{H(k)^{\otimes 4}}+n(n-1)\left<\Upsilon,I\right>_{H(k)^{\otimes 4}}^{2}\right)$
	$\displaystyle~{}~{}~{}~{}~{}+\frac{(n+m)^{2}}{m^{4}}\left(m\left<A,I\right>_{H(k)^{\otimes 8}}+2m(m-1)\left\\|\Upsilon\right\\|^{2}_{H(k)^{\otimes 4}}+m(m-1)\left<\Upsilon,I\right>_{H(k)^{\otimes 4}}^{2}\right)$
	$\displaystyle~{}~{}~{}~{}~{}+\frac{4(n+m)^{2}}{n^{2}m^{2}}nm\left\\|\Upsilon\right\\|^{2}_{H(k)^{\otimes 4}}+\frac{2(n+m)^{2}}{n^{2}m^{2}}nm\left<\Upsilon,I\right>^{2}_{H(k)^{\otimes 2}}-\frac{(n+m)^{4}}{n^{2}m^{2}}\left<\Upsilon,I\right>^{2}_{H(k)^{\otimes 4}}$
	$\displaystyle=\frac{2(n+m)^{4}}{n^{2}m^{2}}\left\\|\Upsilon\right\\|^{2}_{H(k)^{\otimes 4}}+O\left(\frac{1}{n}\right)+O\left(\frac{1}{m}\right).$

7.13 Proof of Proposition 6

Since

(n+m)\widehat{\Delta}^{2}_{n,m}=\frac{n+m}{n^{2}}\sum_{i,s=1}^{n}k(X_{i},X_{s})+\frac{n+m}{m^{2}}\sum_{j,t=1}^{m}k(Y_{j},Y_{t})-\frac{2(n+m)}{nm}\sum_{i=1}^{n}\sum_{j=1}^{m}k(X_{i},Y_{j}),

first, we need to calculate

\mathbb{E}[(n+m)\widehat{\Delta}^{2}_{n,m}]=\frac{n+m}{n^{2}}\sum_{i,s=1}^{n}\mathbb{E}[k(X_{i},X_{s})]+\frac{n+m}{m^{2}}\sum_{j,t=1}^{m}\mathbb{E}[k(Y_{j},Y_{t})]-\frac{2(n+m)}{nm}\sum_{i=1}^{n}\sum_{j=1}^{m}\mathbb{E}[k(X_{i},Y_{j})].

From the expected values of each term are obtained as

	$\displaystyle\mathbb{E}[k(X_{1},X_{2})]=\left\\|\mu_{k}(P)\right\\|^{2}_{H(k)},$
	$\displaystyle\mathbb{E}[k(X_{1},X_{1})]=\left<\Sigma_{k}(P),I\right>_{H(k)^{\otimes 2}}+\left\\|\mu_{k}(P)\right\\|_{H(k)}^{2}$

we get

\displaystyle\mathbb{E}[(n+m)\widehat{\Delta}^{2}_{n,m}]=\frac{(n+m)^{2}}{nm}\left<\Sigma_{k}(P),I\right>_{H(k)^{\otimes 2}}

(53)

under $P=Q$ .

Next, the second moment of $(n+m)\widehat{\Delta}^{2}_{n,m}$ is

	$\displaystyle\mathbb{E}[\{(n+m)\widehat{\Delta}^{2}_{n,m}\}^{2}]$
	$\displaystyle=\mathbb{E}\left[\left\{\frac{n+m}{n^{2}}\sum_{i,s=1}^{n}k(X_{i},X_{s})+\frac{n+m}{m^{2}}\sum_{j,t=1}^{m}k(Y_{j},Y_{t})-\frac{2(n+m)}{nm}\sum_{i=1}^{n}\sum_{j=1}^{m}k(X_{i},Y_{j})\right\}^{2}\right]$
	$\displaystyle=\frac{(n+m)^{2}}{n^{4}}\mathbb{E}\left[\left\{\sum_{i,s=1}^{n}k(X_{i},X_{s})\right\}^{2}\right]+\frac{(n+m)^{2}}{m^{4}}\mathbb{E}\left[\left\{\sum_{j,t=1}^{m}k(Y_{j},Y_{t})\right\}^{2}\right]$
	$\displaystyle~{}~{}~{}~{}~{}+\frac{4(n+m)^{2}}{n^{2}m^{2}}\mathbb{E}\left[\left\{\sum_{i=1}^{n}\sum_{j=1}^{m}k(X_{i},Y_{j})\right\}^{2}\right]+\frac{2(n+m)^{2}}{n^{2}m^{2}}\mathbb{E}\left[\left\{\sum_{i,s=1}^{n}k(X_{i},X_{s})\right\}\left\{\sum_{j,t=1}^{m}k(Y_{j},Y_{t})\right\}\right]$
	$\displaystyle~{}~{}~{}~{}~{}-\frac{4(n+m)^{2}}{n^{3}m}\mathbb{E}\left[\sum_{i,s,\ell=1}^{n}\sum_{j=1}^{m}k(X_{i},X_{s})k(X_{\ell},Y_{j})\right]-\frac{4(n+m)^{2}}{nm^{3}}\mathbb{E}\left[\sum_{i=1}^{n}\sum_{j,t,k=1}^{m}k(Y_{j},Y_{t})k(X_{i},Y_{k})\right].$		(54)

These expectations are obtained as

	$\displaystyle\mathbb{E}\left[\left\{\sum_{i,s=1}^{n}k(X_{i},X_{s})\right\}^{2}\right]$
	$\displaystyle=n\mathbb{E}[k(X_{1},X_{1})^{2}]+4n(n-1)\left<\mathbb{E}[k(\cdot,X_{1})^{\otimes 2}k(\cdot,X_{1})],\mu_{k}(P)\right>_{H(k)}+2n(n-1)\left\\|\Sigma_{k}(P)\right\\|^{2}_{H(k)^{\otimes 2}}$
	$\displaystyle~{}~{}~{}~{}~{}+n(n-1)\left<\Sigma_{k}(P),I\right>_{H(k)^{\otimes 2}}^{2}+2n(n-1)^{2}\left<\Sigma_{k}(P),I\right>_{H(k)^{\otimes 2}}\left\\|\mu_{k}(P)\right\\|^{2}_{H(k)}$
	$\displaystyle~{}~{}~{}~{}~{}+4n(n-1)^{2}\left<\Sigma_{k}(P),\mu_{k}(P)^{\otimes 2}\right>_{H(k)^{\otimes 2}}+n(n-1)(n^{2}+n-3)\left\\|\mu_{k}(P)\right\\|^{4}_{H(k)},$		(55)
	$\displaystyle\mathbb{E}\left[\left\{\sum_{i=1}^{n}\sum_{j=1}^{m}k(X_{i},Y_{j})\right\}^{2}\right]$
	$\displaystyle=nm\left\\|\Sigma_{k}(P)\right\\|^{2}_{H(k)^{\otimes 2}}+nm(n+m)\left<\Sigma_{k}(P),\mu_{k}(P)^{\otimes 2}\right>_{H(k)^{\otimes 2}}+n^{2}m^{2}\left\\|\mu_{k}(P)\right\\|^{4}_{H(k)},$		(56)
	$\displaystyle\mathbb{E}\left[\left\{\sum_{i,s=1}^{n}k(X_{i},X_{s})\right\}\left\{\sum_{j,t=1}^{m}k(Y_{j},Y_{t})\right\}\right]$
	$\displaystyle=nm\left<\Sigma_{k}(P),I\right>_{H(k)^{\otimes 2}}^{2}+nm(n+m)\left<\Sigma_{k}(P),I\right>_{H(k)}\left\\|\mu_{k}(P)\right\\|^{2}_{H(k)}+n^{2}m^{2}\left\\|\mu_{k}(P)\right\\|^{4},$		(57)
	$\displaystyle\mathbb{E}\left[\sum_{i,s,\ell=1}^{n}\sum_{j=1}^{m}k(X_{i},X_{s})k(X_{\ell},Y_{j})\right]$
	$\displaystyle=nm\left<\mathbb{E}[k(\cdot,X_{1})^{\otimes 2}k(\cdot,X_{1})],\mu_{k}(P)\right>_{H(k)}+n(n-1)m\left<\Sigma_{k}(P),I\right>_{H(k)^{\otimes 2}}\left\\|\mu_{k}(P)\right\\|^{2}_{H(k)}$
	$\displaystyle~{}~{}~{}~{}~{}+2n(n-1)m\left<\Sigma_{k}(P),\mu_{k}(P)^{\otimes 2}\right>_{H(k)^{\otimes 2}}+n(n-1)(n+1)m\left\\|\mu_{k}(P)\right\\|^{4}_{H(k)},$		(58)
	$\displaystyle\mathbb{E}\left[\sum_{i=1}^{n}\sum_{j,t,k=1}^{m}k(Y_{j},Y_{t})k(X_{i},Y_{k})\right]$
	$\displaystyle=nm\left<\mathbb{E}[k(\cdot,X_{1})^{\otimes 2}k(\cdot,X_{1})],\mu_{k}(P)\right>_{H(k)}+m(m-1)n\left<\Sigma_{k}(P),I\right>_{H(k)^{\otimes 2}}\left\\|\mu_{k}(P)\right\\|^{2}_{H(k)}$
	$\displaystyle~{}~{}~{}~{}~{}+2m(m-1)n\left<\Sigma_{k}(P),\mu_{k}(P)^{\otimes 2}\right>_{H(k)^{\otimes 2}}+m(m-1)(m+1)n\left\\|\mu_{k}(P)\right\\|^{4}_{H(k)}.$		(59)

The combining (54) and (55)-(59) provides that

	$\displaystyle\mathbb{E}[\{(n+m)\widehat{\Delta}^{2}_{n,m}\}^{2}]$
	$\displaystyle=\frac{2(n+m)^{4}}{n^{2}m^{2}}\left\\|\Sigma_{k}(P)\right\\|^{2}_{H(k)^{\otimes 2}}+\frac{(n+m)^{4}}{n^{2}m^{2}}\left<\Sigma_{k}(P),I\right>_{H(k)^{\otimes 2}}^{2}+O\left(\frac{1}{n}\right)+O\left(\frac{1}{m}\right).$		(60)

Therefore, from (53) and (60), the variance of $(n+m)\widehat{\Delta}^{2}_{n,m}$ is

	$\displaystyle V[(n+m)\widehat{\Delta}^{2}_{n,m}]$	$\displaystyle=\mathbb{E}\left[\left\{(n+m)\widehat{\Delta}^{2}_{n,m}\right\}^{2}\right]-\left\{\mathbb{E}[(n+m)\widehat{\Delta}^{2}_{n,m}]\right\}^{2}$
		$\displaystyle=\frac{2(n+m)^{4}}{n^{2}m^{2}}\left\\|\Sigma_{k}(P)\right\\|^{2}_{H(k)^{\otimes 2}}+O\left(\frac{1}{n}\right)+O\left(\frac{1}{m}\right).$

7.14 Proof of (7)

The $(i,j)$ -th element of the matrix $H$ is

	$\displaystyle H_{ij}$	$\displaystyle=\left<(k(\cdot,X_{i})-{\mu}_{k}(\widehat{P}))^{\otimes 2}-{\Sigma}_{k}(\widehat{P}),(k(\cdot,X_{j})-{\mu}_{k}(\widehat{P}))^{\otimes 2}-{\Sigma}_{k}(\widehat{P})\right>_{H(k)^{\otimes 2}}$
		$\displaystyle=\left<(k(\cdot,X_{i})-{\mu}_{k}(\widehat{P}))^{\otimes 2},(k(\cdot,X_{j})-{\mu}_{k}(\widehat{P}))^{\otimes 2}\right>_{H(k)^{\otimes 2}}-\left<{\Sigma}_{k}(\widehat{P}),(k(\cdot,X_{j})-{\mu}_{k}(\widehat{P}))^{\otimes 2}\right>_{H(k)^{\otimes 2}}$
		$\displaystyle~{}~{}~{}~{}~{}-\left<(k(\cdot,X_{i})-{\mu}_{k}(\widehat{P}))^{\otimes 2},{\Sigma}_{k}(\widehat{P})\right>_{H(k)^{\otimes 2}}+\left<{\Sigma}_{k}(\widehat{P}),{\Sigma}_{k}(\widehat{P})\right>_{H(k)^{\otimes 2}}.$

Each term of this $H_{ij}$ can be expressed as

	$\displaystyle\left<(k(\cdot,X_{i})-{\mu}_{k}(\widehat{P}))^{\otimes 2},(k(\cdot,X_{j})-{\mu}_{k}(\widehat{P}))^{\otimes 2}\right>_{H(k)^{\otimes 2}}$
	$\displaystyle=\left<k(\cdot,X_{i})-{\mu}_{k}(\widehat{P}),k(\cdot,X_{j})-{\mu}_{k}(\widehat{P})\right>_{H(k)}^{2}$
	$\displaystyle=\left\{k(X_{i},X_{j})-\mu_{k}(\widehat{P})(X_{i})-\mu_{k}(\widehat{P})(X_{j})+\left<\mu_{k}(\widehat{P}),\mu_{k}(\widehat{P})\right>_{H(k)}\right\}^{2}$
	$\displaystyle=\left\{k(X_{i},X_{j})-\frac{1}{n}\sum_{s=1}^{n}k(X_{j},X_{s})-\frac{1}{n}\sum_{\ell=1}^{n}k(X_{i},X_{\ell})+\frac{1}{n^{2}}\sum_{s,\ell=1}^{n}k(X_{s},X_{\ell})\right\}^{2}$
	$\displaystyle=\left(\widetilde{K}_{ij}\right)^{2}$
	$\displaystyle=\left(\widetilde{K}\odot\widetilde{K}\right)_{ij},$
	$\displaystyle\left<\Sigma_{k}(\widehat{P}),(k(\cdot,X_{i})-{\mu}_{k}(\widehat{P}))^{\otimes 2}\right>_{H(k)^{\otimes 2}}$
	$\displaystyle=\left<\frac{1}{n}\sum_{s=1}^{n}(k(\cdot,X_{s})-{\mu}_{k}(\widehat{P}))^{\otimes 2},(k(\cdot,X_{i})-{\mu}_{k}(\widehat{P}))^{\otimes 2}\right>_{H(k)^{\otimes 2}}$
	$\displaystyle=\frac{1}{n}\sum_{s=1}^{n}\left<(k(\cdot,X_{s})-{\mu}_{k}(\widehat{P}))^{\otimes 2},(k(\cdot,X_{i})-{\mu}_{k}(\widehat{P}))^{\otimes 2}\right>_{H(k)^{\otimes 2}}$
	$\displaystyle=\frac{1}{n}\sum_{s=1}^{n}\left<k(\cdot,X_{s})-{\mu}_{k}(\widehat{P}),k(\cdot,X_{i})-{\mu}_{k}(\widehat{P})\right>_{H(k)}^{2}$
	$\displaystyle=\frac{1}{n}\sum_{s=1}^{n}\left(\widetilde{K}_{sj}\right)^{2}$
	$\displaystyle=\frac{1}{n}\sum_{s=1}^{n}\left(\widetilde{K}\odot\widetilde{K}\right)_{sj}$

and

	$\displaystyle\left<\Sigma_{k}(\widehat{P}),\Sigma_{k}(\widehat{P})\right>_{H(k)^{\otimes 2}}$
	$\displaystyle=\left<\frac{1}{n}\sum_{s=1}^{n}(k(\cdot,X_{s})-{\mu}_{k}(\widehat{P}))^{\otimes 2},\frac{1}{n}\sum_{\ell=1}^{n}(k(\cdot,X_{\ell})-{\mu}_{k}(\widehat{P}))^{\otimes 2}\right>_{H(k)^{\otimes 2}}$
	$\displaystyle=\frac{1}{n^{2}}\sum_{s,\ell=1}^{n}\left<(k(\cdot,X_{s})-{\mu}_{k}(\widehat{P}))^{\otimes 2},(k(\cdot,X_{\ell})-{\mu}_{k}(\widehat{P}))^{\otimes 2}\right>_{H(k)^{\otimes 2}}$
	$\displaystyle=\frac{1}{n^{2}}\sum_{s,\ell=1}^{n}\left<k(\cdot,X_{s})-{\mu}_{k}(\widehat{P}),k(\cdot,X_{\ell})-{\mu}_{k}(\widehat{P})\right>_{H(k)}^{2}$
	$\displaystyle=\frac{1}{n^{2}}\sum_{s,\ell=1}^{n}\left(\widetilde{K}_{s\ell}\right)^{2}$
	$\displaystyle=\frac{1}{n^{2}}\sum_{s,\ell=1}^{n}\left(\widetilde{K}\odot\widetilde{K}\right)_{s\ell}.$

Therefore,

H_{ij}=\left(\widetilde{K}\odot\widetilde{K}\right)_{ij}-\frac{1}{n}\sum_{s=1}^{n}\left(\widetilde{K}\odot\widetilde{K}\right)_{sj}-\frac{1}{n}\sum_{s=1}^{n}\left(\widetilde{K}\odot\widetilde{K}\right)_{si}+\frac{1}{n^{2}}\sum_{s,\ell=1}^{n}\left(\widetilde{K}\odot\widetilde{K}\right)_{s\ell},

which gives the expression (7).

References

[1] U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96(12):6745–6750, 1999.
[2] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337–404, 1950.
[3] R. Bhatia and L. Elsner. The Hoffman-Wielandt inequality in infinite dimensions. Proceedings of the Indian Academy of Sciences - Mathematical Sciences, 104(3):483–494, 1994.
[4] G. Boente, D. Rodriguez, and M. Sued. Testing equality between several populations covariance operators. Annals of the Institute of Statistical Mathematics, 70(4):919–950, 2018.
[5] K. Fukumizu, A. Gretton, X. Sun, and B. Schölkopf. Kernel measures of conditional dependence. Advances in Neural Information Processing Systems 20 - Proceedings of the 2007 Conference, pages 1–13, 2009.
[6] A. Gretton, K. M. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola. A kernel method for the two-sample-problem. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems, volume 19, pages 513–520. MIT Press, 2007.
[7] A. Gretton, K. Fukumizu, Z. Harchaoui, and B. Sriperumbudur. A fast, consistent kernel two-sample test. Advances in Neural Information Processing Systems, pages 673–681, 2009.
[8] J. Hoffmann-Jorgensen and G. Pisier. The law of large numbers and the central limit theorem in Banach spaces. The Annals of Probability, 4(4):587–599, 1976.
[9] J. Kellner and A. Celisse. A one-sample test for normality with kernel methods. Bernoulli, 25(3):1816–1837, 2019.
[10] H. Q. Minh, P. Niyogi, and Y. Yao. Mercer’s theorem, feature maps, and smoothing. In Proceedings of the 19th Annual Conference on Learning Theory, COLT’06, pages 154–168, Berlin, Heidelberg, 2006. Springer-Verlag.
[11] D. Politis, D. Wolf, J. Romano, M. Wolf, P. Bickel, P. Diggle, and S. Fienberg. Subsampling. Springer Series in Statistics. Springer New York, 1999.
[12] M. Reed and B. Simon. Functional Analysis. Methods of Modern Mathematical Physics. Elsevier Science, 1981.
[13] M. P. Wand and M. C. Jones. Kernel Smoothing. Chapman & Hall, New York, 1994.

	$\displaystyle{\rm MVD}(N(\underline{0},I_{d}),N(\underline{m},\Sigma))^{2}$
	$\displaystyle=(1+8\sigma)^{-d/2}-2(1+8\sigma+12\sigma^{2})^{-d/2}+(1+4\sigma)^{-d}$
	$\displaystyle~{}~{}~{}~{}~{}+\|I_{d}+8\sigma\Sigma\|^{-1/2}-2\|I_{d}+8\sigma\Sigma+12\sigma^{2}\Sigma^{2}\|^{-1/2}+\|I_{d}+4\sigma\Sigma\|^{-1}$
	$\displaystyle~{}~{}~{}~{}~{}-2\|(1+4\sigma)I_{d}+4\sigma\Sigma\|^{-1/2}\exp\left(-2\sigma\underline{m}^{T}\left((1+4\sigma)I_{d}+4\sigma\Sigma\right)^{-1}\underline{m}\right)$
	$\displaystyle~{}~{}~{}~{}~{}+2\|I_{d}+2\sigma\Sigma\|^{-1/2}\|(1+4\sigma)I_{d}+2\sigma\Sigma\|^{-1/2}\exp\left(-2\sigma\underline{m}^{T}\left((1+4\sigma)I_{d}+2\sigma\Sigma\right)^{-1}\underline{m}\right)$
	$\displaystyle~{}~{}~{}~{}~{}+2(1+2\sigma)^{-d/2}\|(1+2\sigma)I_{d}+4\sigma\Sigma\|^{-1/2}\exp\left(-2\sigma\underline{m}^{T}\left((1+2\sigma)I_{d}+4\sigma\Sigma\right)^{-1}\underline{m}\right)$
	$\displaystyle~{}~{}~{}~{}~{}-2\|(1+2\sigma)I_{d}+2\sigma\Sigma\|^{-1}\exp\left(-2\sigma\underline{m}^{T}\left((1+2\sigma)I_{d}+2\sigma\Sigma\right)^{-1}\underline{m}\right).$

	$\displaystyle\Pr((n+m)\widehat{T}^{2}_{n,m}\geq t_{\alpha}\|H_{1})$	$\displaystyle=\Pr((n+m)(\widehat{T}^{2}_{n,m}-T^{2})\geq t_{\alpha}-(n+m)T^{2}\|H_{1})$
		$\displaystyle=\Pr\left(\frac{\sqrt{n+m}(\widehat{T}^{2}_{n,m}-T^{2})}{v}\geq\frac{t_{\alpha}}{\sqrt{n+m}v}-\frac{\sqrt{n+m}T^{2}}{v}\Bigg{\|}H_{1}\right)$
		$\displaystyle\approx 1-\Phi\left(\frac{t_{\alpha}}{\sqrt{n+m}v}-\frac{\sqrt{n+m}T^{2}}{v}\right)\to 1$

	$\displaystyle\text{MMD}(N(\underline{0},I_{d}),N(\underline{m},\Sigma))^{2}$
	$\displaystyle=\left\\|\mu_{k}(N(\underline{0},I_{d}))-\mu_{k}(N(\underline{m},\Sigma))\right\\|^{2}_{H(k)}$
	$\displaystyle=\left\\|\mu_{k}(N(\underline{0},I_{d}))\right\\|^{2}_{H(k)}+\left\\|\mu_{k}(N(\underline{m},\Sigma))\right\\|^{2}_{H(k)}-2\left<\mu_{k}(N(\underline{0},I_{d})),\mu_{k}(N(\underline{m},\Sigma))\right>_{H(k)}$
	$\displaystyle=\left\\|\mu_{k}(N(\underline{0},I_{d}))\right\\|^{2}_{H(k)}+\left\\|\mu_{k}(N(\underline{m},\Sigma))\right\\|^{2}_{H(k)}-2\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{0},I_{d})\\ \underline{X}^{\prime}\sim N(\underline{m},\Sigma)\end{subarray}}[k(\underline{X},\underline{X}^{\prime})]$
	$\displaystyle=\|I_{d}+4\sigma I_{d}\|^{-1/2}+\|I_{d}+4\sigma\Sigma\|^{-1/2}-2\left(\frac{\pi}{\sigma}\right)^{d/2}\phi_{\frac{1}{2\sigma}I_{d}+I_{d}+\Sigma}(\underline{m})$
	$\displaystyle=(1+4\sigma)^{-d/2}+\|I_{d}+4\sigma\Sigma\|^{-1/2}$
	$\displaystyle~{}~{}~{}~{}~{}-2\left(\frac{\pi}{\sigma}\right)^{d/2}\left\|\frac{\pi}{\sigma}\left((1+2\sigma)I_{d}+2\sigma\Sigma\right)\right\|^{-1/2}\exp\left(-\sigma\underline{m}^{T}\left((1+2\sigma)I_{d}+2\sigma\Sigma\right)^{-1}\underline{m}\right)$
	$\displaystyle=\|I_{d}+4\sigma I_{d}\|^{-1/2}+\|I_{d}+4\sigma\Sigma\|^{-1/2}-2\left(\frac{\pi}{\sigma}\right)^{d/2}\phi_{\frac{1}{2\sigma}I_{d}+I_{d}+\Sigma}(\underline{m})$
	$\displaystyle=(1+4\sigma)^{-d/2}+\|I_{d}+4\sigma\Sigma\|^{-1/2}$
	$\displaystyle~{}~{}~{}~{}~{}-2\left\|(1+2\sigma)I_{d}+2\sigma\Sigma\right\|^{-1/2}\exp\left(-\sigma\underline{m}^{T}\left((1+2\sigma)I_{d}+2\sigma\Sigma\right)^{-1}\underline{m}\right).$

$\displaystyle I_{2}$	$\displaystyle=\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}\left[\left\{\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}[k(\underline{X},\underline{Y})]\right\}^{2}\right]$
	$\displaystyle=\int_{\mathbb{R}^{d}}\left\{\|I_{d}+2\sigma\Sigma_{0}\|^{-1/2}\exp\left(-\sigma(\underline{x}-\underline{m}_{0})^{T}(I_{d}+2\sigma\Sigma_{0})^{-1}(\underline{x}-\underline{m}_{0})\right)\right\}^{2}dN(\underline{\mu},\Sigma)(\underline{x})$
	$\displaystyle=\|I_{d}+2\sigma\Sigma_{0}\|^{-1}\int_{\mathbb{R}^{d}}\exp\left(-2\sigma(\underline{x}-\underline{m}_{0})^{T}(I_{d}+2\sigma\Sigma_{0})^{-1}(\underline{x}-\underline{m}_{0})\right)dN(\underline{\mu},\Sigma)(\underline{x})$
	$\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}\|I_{d}+2\sigma\Sigma_{0}\|^{-1/2}\int_{\mathbb{R}^{d}}\phi_{\frac{1}{4\sigma}(I_{d}+2\sigma\Sigma_{0})}(\underline{x}-\underline{m}_{0})dN(\underline{\mu},\Sigma)(\underline{x})$
	$\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}\|I_{d}+2\sigma\Sigma_{0}\|^{-1/2}\int_{\mathbb{R}^{d}}\phi_{\frac{1}{4\sigma}(I_{d}+2\sigma\Sigma_{0})}(\underline{x}-\underline{m}_{0})\phi_{\Sigma}(\underline{x}-\underline{\mu})d\underline{x}$
	$\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}\|I_{d}+2\sigma\Sigma_{0}\|^{-1/2}\phi_{\frac{1}{4\sigma}(I_{d}+2\sigma\Sigma_{0}+4\sigma\Sigma)}(\underline{m}_{0}-\underline{\mu}).$	(19)

	$\displaystyle\left<\Sigma_{k}(N(\underline{\mu},\Sigma)),\Sigma_{k}(N(\underline{m}_{0},\Sigma_{0}))\right>_{H(k)^{\otimes 2}}$
	$\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}\phi_{\frac{1}{4\sigma}I_{d}+\Sigma_{0}+\Sigma}(\underline{m}_{0}-\underline{\mu})-\left(\frac{\pi}{2\sigma}\right)^{d/2}\|I_{d}+2\sigma\Sigma_{0}\|^{-1/2}\phi_{\frac{1}{4\sigma}(I_{d}+2\sigma\Sigma_{0}+4\sigma\Sigma)}(\underline{m}_{0}-\underline{\mu})$
	$\displaystyle~{}~{}~{}~{}~{}-\left(\frac{\pi}{2\sigma}\right)^{d/2}\|I_{d}+2\sigma\Sigma\|^{-1/2}\phi_{\frac{1}{4\sigma}(I_{d}+2\sigma\Sigma+4\sigma\Sigma_{0})}(\underline{m}_{0}-\underline{\mu})$
	$\displaystyle~{}~{}~{}~{}~{}+\left(\frac{\pi}{2\sigma}\right)^{d/2}\|I_{d}+2\sigma\Sigma+2\sigma\Sigma_{0}\|^{-1/2}\phi_{\frac{1}{4\sigma}(I_{d}+2\sigma\Sigma+2\sigma\Sigma_{0})}(\underline{m}_{0}-\underline{\mu}).$		(22)