This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Two-sample test based on maximum variance discrepancy

Natsumi Makigusa

Graduate School of Science and Engineering, Chiba University
Abstract

In this article, we introduce a novel discrepancy called the maximum variance discrepancy for the purpose of measuring the difference between two distributions in Hilbert spaces that cannot be found via the maximum mean discrepancy. We also propose a two-sample goodness of fit test based on this discrepancy. We obtain the asymptotic null distribution of this two-sample test, which provides an efficient approximation method for the null distribution of the test.

1 Introduction

For probability distributions PP and QQ, the test for the null hypothesis H0:P=QH_{0}:P=Q against an alternative hypothesis H1:PQH_{1}:P\neq Q based on data X1,,Xni.i.d.PX_{1},\dots,X_{n}\overset{i.i.d.}{\sim}P and Y1,,Ymi.i.d.QY_{1},\dots,Y_{m}\overset{i.i.d.}{\sim}Q is known as a two-sample test. Such tests have applications in various areas. There is a huge body of literature on two-sample tests in Euclidean space, so we will not attempt a complete bibliography. In [6], a two-sample test based on Maximum Mean Discrepancy (MMD) is proposed, where the MMD is defined by (1) in Section 2. The MMD for a reproducing kernel Hilbert space H(k)H(k) associated with a positive definite kernel kk is defined as (2) in Section 2.

In this paper, we propose a novel discrepancy between two distributions defined as

T=VXP[k(,X)]VYQ[k(,Y)]H(k)2,T=\left\|V_{X\sim P}[k(\cdot,X)]-V_{Y\sim Q}[k(\cdot,Y)]\right\|_{H(k)^{\otimes 2}},

and we call this the Maximum Variance Discrepancy (MVD), where VXP[k(,X)]V_{X\sim P}[k(\cdot,X)] is a covariance operator in H(k)H(k). The MVD is composed by replacing the kernel mean embedding in (2) with a covariance operator; hence, it is natural to consider a two-sample test based on the MVD. A related work can be found in [4], where a test for the equality of covariance operators in Hilbert spaces was proposed.

Our aim in this research is to clarify the properties of the MVD test from two perspectives: an asymptotic investigation as n,mn,m\to\infty, and its practical implementation. We first obtain the asymptotic distribution of a consistent estimator T^n,m2\widehat{T}_{n,m}^{2} of T2T^{2}, under H0H_{0}. We also derive the asymptotic distribution of T^n,m2\widehat{T}_{n,m}^{2} under the alternative hypothesis H1H_{1}. Furthermore, we consider a sequence of local alternative distributions Qnm=(11/n+m)P+(1/n+m)QQ_{nm}=(1-1/\sqrt{n+m})P+(1/\sqrt{n+m})Q for PQP\neq Q and address the asymptotic distribution of T^n,m2\widehat{T}^{2}_{n,m} under this sequence. For practical purposes, a method to approximate the distribution of the test by T^n,m2\widehat{T}^{2}_{n,m} under H0H_{0} is developed. The method is based on the eigenvalues of the centered Gram matrices associated with the dataset. Those eigenvalues will be shown to be estimators of the weights appearing in the asymptotic null distribution of the test. Hence, the method based on the eigenvalues is expected to provide a fine approximation of the distribution of the test. However, this approximation does not actually work well. Therefore, we further modify the method based on the eigenvalues, and the obtained method provides a better approximation.

The rest of this paper is structured as follows. Section 2 introduces the framework of the two-sample test and defines the test statistics based on the MVD. In addition, the representation of test statistics based on the centered Gram matrices is described. Section 3.1 develops the asymptotics for the test by the MVD under H0H_{0}. The test by the MVD under H1H_{1} is addressed in Section 3.2. Furthermore, the behavior of the test by the MVD under the local alternative hypothesis is clarified in Section 3.3. Section 3.4 describes the estimation of the weights that appear in the asymptotic null distribution obtained in Section 3.1. Section 4 examines the implementation of the MVD test with a Gaussian kernel in the Hilbert space =d\mathcal{H}=\mathbb{R}^{d}. Section 4.1 introduces the modification of the approximate distribution given in Section 3.4. Section 4.2 reports the results of simulations for the type I error and the power of the MVD and MMD tests. Section 5 presents the results of applications to real data sets, including high-dimension low-sample-size data. Conclusions are given in Section 6. All proofs of theoretical results are provided in Section 7.

2 Maximum Variance Discrepancy

Let \mathcal{H} be a separable Hilbert space and (,𝒜)(\mathcal{H},\mathcal{A}) be a measurable space. Let ,\left<\cdot,\cdot\right>_{\mathcal{H}} be the inner product of \mathcal{H} and =,\|\cdot\|_{\mathcal{H}}=\sqrt{\left<\cdot,\cdot\right>_{\mathcal{H}}} be the associated norm. Let X1,,XnX_{1},\dots,X_{n}\in\mathcal{H} and Y1,,YmY_{1},\dots,Y_{m}\in\mathcal{H} denote a sample of independent and identically distributed (i.i.d.) random variables drawn from unknown distributions PP and QQ, respectively. Our goal is to test whether the unknown distributions PP and QQ are equal.

Let us define the null hypothesis H0:P=QH_{0}:P=Q and the alternative hypothesis H1:PQH_{1}:P\neq Q. Following [6], the gap between two distributions PP and QQ on \mathcal{H} is measured by:

MMD(P,Q)=supf|𝔼XP[f(X)]𝔼YQ[f(Y)]|,\text{MMD}(P,Q)=\sup_{f\in\mathcal{F}}|\mathbb{E}_{X\sim P}[f(X)]-\mathbb{E}_{Y\sim Q}[f(Y)]|, (1)

where \mathcal{F} is a class of real-valued functions on \mathcal{H}. Regardless of \mathcal{F}, MMD(P,Q)\text{MMD}(P,Q) always defines a pseudo-metric on the space of probability distributions. Let \mathcal{F} be the unit ball of a reproducing kernel Hilbert space H(k)H(k) associated with a characteristic kernel k:×k:\mathcal{H}\times\mathcal{H}\to\mathbb{R} (see [2] and [5] for details) and assume that 𝔼XP[k(X,X)]<\mathbb{E}_{X\sim P}[\sqrt{k(X,X)}]<\infty and 𝔼YQ[k(Y,Y)]<\mathbb{E}_{Y\sim Q}[\sqrt{k(Y,Y)}]<\infty. Then, the MMD in H(k)H(k) is defined as the distance between PP and QQ as follows:

MMD(P,Q)=supfH(k)=1|f,μk(P)μk(Q)H(k)|=μk(P)μk(Q)H(k),\text{MMD}(P,Q)=\sup_{\left\|f\right\|_{H(k)}=1}|\left<f,\mu_{k}(P)-\mu_{k}(Q)\right>_{H(k)}|=\left\|\mu_{k}(P)-\mu_{k}(Q)\right\|_{H(k)}, (2)

where μk(P)=𝔼XP[k(,X)]\mu_{k}(P)=\mathbb{E}_{X\sim P}[k(\cdot,X)] and μk(Q)=𝔼YQ[k(,Y)]\mu_{k}(Q)=\mathbb{E}_{Y\sim Q}[k(\cdot,Y)] are called kernel mean embeddings of PP and QQ, respectively, in H(k)H(k) (see [6]). The MMD focuses on the difference between distributions PP and QQ depending on the difference between the means of k(,X)k(\cdot,X) and k(,Y)k(\cdot,Y) in H(k)H(k). The motivation for this research is to focus on the difference between the distributions PP and QQ due to the difference between those variances in H(k)H(k), based on a similar idea as the MMD. Assume 𝔼XP[k(X,X)]<\mathbb{E}_{X\sim P}[k(X,X)]<\infty and 𝔼YQ[k(Y,Y)]<\mathbb{E}_{Y\sim Q}[k(Y,Y)]<\infty, then the variance VXP[k(,X)]:H(k)H(k)V_{X\sim P}[k(\cdot,X)]:H(k)\to H(k) is defined by

VXP[k(,X)]=𝔼XP[(k(,X)μk(P))2]H(k)2.V_{X\sim P}[k(\cdot,X)]=\mathbb{E}_{X\sim P}[(k(\cdot,X)-\mu_{k}(P))^{\otimes 2}]\in H(k)^{\otimes 2}.

Here, for any h,hH(k)h,h^{\prime}\in H(k), the tensor product hhh\otimes h^{\prime} is defined as the operator H(k)H(k),xh,xH(k)hH(k)\to H(k),x\mapsto\left<h^{\prime},x\right>_{H(k)}h, h2h^{\otimes 2} is defined as h2=hhh^{\otimes 2}=h\otimes h, and H(k)2=H(k)H(k)H(k)^{\otimes 2}=H(k)\otimes H(k) (see Section II.4 in [12] for details). Let VXP[k(,X)]=Σk(P)V_{X\sim P}[k(\cdot,X)]=\Sigma_{k}(P) and VYQ[k(,Y)]=Σk(Q)V_{Y\sim Q}[k(\cdot,Y)]=\Sigma_{k}(Q). Then, we define the MVD in H(k)H(k) as

MVD(P,Q)=supAH(k)2=1|A,Σk(P)Σk(Q)H(k)2|=Σk(P)Σk(Q)H(k)2,\text{MVD}(P,Q)=\sup_{\left\|A\right\|_{H(k)^{\otimes 2}}=1}|\left<A,\Sigma_{k}(P)-\Sigma_{k}(Q)\right>_{H(k)^{\otimes 2}}|=\left\|\Sigma_{k}(P)-\Sigma_{k}(Q)\right\|_{H(k)^{\otimes 2}},

which can be seen as a discrepancy between distributions PP and QQ. The T2=MVD(P,Q)2T^{2}=\text{MVD}(P,Q)^{2} can be estimated by

T^n,m2=Σk(P^)Σk(Q^)H(k)22,\widehat{T}^{2}_{n,m}=\left\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q})\right\|^{2}_{H(k)^{\otimes 2}}, (3)

where

Σk(P^)=1ni=1n(k(,Xi)μk(P^))2,μk(P^)=1ni=1nk(,Xi)\Sigma_{k}(\widehat{P})=\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(\widehat{P}))^{\otimes 2},~{}~{}\mu_{k}(\widehat{P})=\frac{1}{n}\sum_{i=1}^{n}k(\cdot,X_{i})

and

Σk(Q^)=1mj=1m(k(,Yj)μk(Q^))2,μk(Q^)=1mj=1mk(,Yj).\Sigma_{k}(\widehat{Q})=\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(\widehat{Q}))^{\otimes 2},~{}~{}\mu_{k}(\widehat{Q})=\frac{1}{m}\sum_{j=1}^{m}k(\cdot,Y_{j}).

Let the Gram matrices be KX=(k(Xi,Xs))1i,snK_{X}=(k(X_{i},X_{s}))_{1\leq i,s\leq n}, KY=(k(Yj,Yt))1j,tmK_{Y}=(k(Y_{j},Y_{t}))_{1\leq j,t\leq m}, and KXY=(k(Xi,Yj))1in1jmK_{XY}=(k(X_{i},Y_{j}))_{\begin{subarray}{c}1\leq i\leq n\\ 1\leq j\leq m\end{subarray}}; the centering matrix be Pn=In(1/n)1¯n1¯nP_{n}=I_{n}-(1/n)\underline{1}_{n}\underline{1}_{n}; and the centered Gram matrices be K~X=PnKXPn\widetilde{K}_{X}=P_{n}K_{X}P_{n}, K~Y=PmKYPm\widetilde{K}_{Y}=P_{m}K_{Y}P_{m}, and K~XY=PnKXYPm\widetilde{K}_{XY}=P_{n}K_{XY}P_{m}, where InI_{n} is the n×nn\times n identity matrix. This test statistic can be expanded as:

T^n,m2=1n2tr(K~X2)2nmtr(K~XYK~XYT)+1m2tr(K~Y2).\widehat{T}^{2}_{n,m}=\frac{1}{n^{2}}\text{tr}(\widetilde{K}_{X}^{2})-\frac{2}{nm}\text{tr}(\widetilde{K}_{XY}\widetilde{K}_{XY}^{T})+\frac{1}{m^{2}}\text{tr}(\widetilde{K}_{Y}^{2}).

We investigate the MMD(P,Q)2\text{MMD}(P,Q)^{2} and MVD(P,Q)2\text{MVD}(P,Q)^{2} when =d\mathcal{H}=\mathbb{R}^{d}, the kernel k(,)k(\cdot,\cdot) is the Gaussian kernel:

k(t¯,s¯)=exp(σt¯s¯d2),σ>0,k(\underline{t},\underline{s})=\exp(-\sigma\left\|\underline{t}-\underline{s}\right\|^{2}_{\mathbb{R}^{d}}),~{}~{}\sigma>0, (4)

P=N(0¯,Id)P=N(\underline{0},I_{d}), and Q=N(m¯,Σ)Q=N(\underline{m},\Sigma). Under this setting, straightforward calculations using the properties of Gaussian density yield the following result for MMD:

Proposition 1.

When =d\mathcal{H}=\mathbb{R}^{d} and k(,)k(\cdot,\cdot) is the Gaussian kernel in (4), we have

MMD(N(0¯,Id),N(m¯,Σ))2\displaystyle{\rm MMD}(N(\underline{0},I_{d}),N(\underline{m},\Sigma))^{2}
=(1+4σ)d/2+|Id+4σΣ|1/22|(1+2σ)Id+2σΣ|1/2exp(σm¯T((1+2σ)Id+2σΣ)1m¯).\displaystyle=(1+4\sigma)^{-d/2}+|I_{d}+4\sigma\Sigma|^{-1/2}-2|(1+2\sigma)I_{d}+2\sigma\Sigma|^{-1/2}\exp\left(-\sigma\underline{m}^{T}((1+2\sigma)I_{d}+2\sigma\Sigma)^{-1}\underline{m}\right).

The result for MVD is also obtained as follows using the Gaussian density property as well as the result for MMD:

Proposition 2.

When =d\mathcal{H}=\mathbb{R}^{d} and k(,)k(\cdot,\cdot) is the Gaussian kernel in (4), we have

MVD(N(0¯,Id),N(m¯,Σ))2\displaystyle{\rm MVD}(N(\underline{0},I_{d}),N(\underline{m},\Sigma))^{2}
=(1+8σ)d/22(1+8σ+12σ2)d/2+(1+4σ)d\displaystyle=(1+8\sigma)^{-d/2}-2(1+8\sigma+12\sigma^{2})^{-d/2}+(1+4\sigma)^{-d}
+|Id+8σΣ|1/22|Id+8σΣ+12σ2Σ2|1/2+|Id+4σΣ|1\displaystyle~{}~{}~{}~{}~{}+|I_{d}+8\sigma\Sigma|^{-1/2}-2|I_{d}+8\sigma\Sigma+12\sigma^{2}\Sigma^{2}|^{-1/2}+|I_{d}+4\sigma\Sigma|^{-1}
2|(1+4σ)Id+4σΣ|1/2exp(2σm¯T((1+4σ)Id+4σΣ)1m¯)\displaystyle~{}~{}~{}~{}~{}-2|(1+4\sigma)I_{d}+4\sigma\Sigma|^{-1/2}\exp\left(-2\sigma\underline{m}^{T}\left((1+4\sigma)I_{d}+4\sigma\Sigma\right)^{-1}\underline{m}\right)
+2|Id+2σΣ|1/2|(1+4σ)Id+2σΣ|1/2exp(2σm¯T((1+4σ)Id+2σΣ)1m¯)\displaystyle~{}~{}~{}~{}~{}+2|I_{d}+2\sigma\Sigma|^{-1/2}|(1+4\sigma)I_{d}+2\sigma\Sigma|^{-1/2}\exp\left(-2\sigma\underline{m}^{T}\left((1+4\sigma)I_{d}+2\sigma\Sigma\right)^{-1}\underline{m}\right)
+2(1+2σ)d/2|(1+2σ)Id+4σΣ|1/2exp(2σm¯T((1+2σ)Id+4σΣ)1m¯)\displaystyle~{}~{}~{}~{}~{}+2(1+2\sigma)^{-d/2}|(1+2\sigma)I_{d}+4\sigma\Sigma|^{-1/2}\exp\left(-2\sigma\underline{m}^{T}\left((1+2\sigma)I_{d}+4\sigma\Sigma\right)^{-1}\underline{m}\right)
2|(1+2σ)Id+2σΣ|1exp(2σm¯T((1+2σ)Id+2σΣ)1m¯).\displaystyle~{}~{}~{}~{}~{}-2|(1+2\sigma)I_{d}+2\sigma\Sigma|^{-1}\exp\left(-2\sigma\underline{m}^{T}\left((1+2\sigma)I_{d}+2\sigma\Sigma\right)^{-1}\underline{m}\right).

In particular, MMD(P,Q)\text{MMD}(P,Q) and MVD(P,Q)\text{MVD}(P,Q) for P=N(0¯,Id)P=N(\underline{0},I_{d}) and Q=N(t1¯,sId)Q=N(t\underline{1},sI_{d}) are derived by Propositions 1 and 2. The result is Corollary 1.

Corollary 1.

When =d\mathcal{H}=\mathbb{R}^{d} and k(,)k(\cdot,\cdot) is the Gaussian kernel in (4), we have

MMD(N(0¯,Id),N(t1¯,sId))2\displaystyle{\rm MMD}(N(\underline{0},I_{d}),N(t\underline{1},sI_{d}))^{2}
=(1+4σ)d/2+(1+4σs)d/22(1+2σ+2σs)d/2exp(σt2d(1+2σ+2σs)1)\displaystyle=(1+4\sigma)^{-d/2}+(1+4\sigma s)^{-d/2}-2(1+2\sigma+2\sigma s)^{-d/2}\exp\left(-\sigma t^{2}d(1+2\sigma+2\sigma s)^{-1}\right)

and

MVD(N(0¯,Id),N(t1¯,sId))2\displaystyle{\rm MVD}(N(\underline{0},I_{d}),N(t\underline{1},sI_{d}))^{2}
=(1+8σ)d/22(1+8σ+12σ2)d/2+(1+4σ)d\displaystyle=(1+8\sigma)^{-d/2}-2(1+8\sigma+12\sigma^{2})^{-d/2}+(1+4\sigma)^{-d}
+(1+8σs)d/22(1+8σs+12σ2s2)d/2+(1+4σs)d\displaystyle~{}~{}~{}~{}~{}+(1+8\sigma s)^{-d/2}-2(1+8\sigma s+12\sigma^{2}s^{2})^{-d/2}+(1+4\sigma s)^{-d}
2(1+4σ+4σs)d/2exp(2σt2d(1+4σ+4σs)1)\displaystyle~{}~{}~{}~{}~{}-2(1+4\sigma+4\sigma s)^{-d/2}\exp\left(-2\sigma t^{2}d(1+4\sigma+4\sigma s)^{-1}\right)
+2(1+2σs)d/2(1+4σ+2σs)d/2exp(2σt2d(1+4σ+2σs)1)\displaystyle~{}~{}~{}~{}~{}+2(1+2\sigma s)^{-d/2}(1+4\sigma+2\sigma s)^{-d/2}\exp\left(-2\sigma t^{2}d(1+4\sigma+2\sigma s)^{-1}\right)
+2(1+2σ)d/2(1+2σ+4σs)d/2exp(2σt2d(1+2σ+4σs)1)\displaystyle~{}~{}~{}~{}~{}+2(1+2\sigma)^{-d/2}(1+2\sigma+4\sigma s)^{-d/2}\exp\left(-2\sigma t^{2}d(1+2\sigma+4\sigma s)^{-1}\right)
2(1+2σ+2σs)dexp(2σt2d(1+2σ+2σs)1).\displaystyle~{}~{}~{}~{}~{}-2(1+2\sigma+2\sigma s)^{-d}\exp\left(-2\sigma t^{2}d(1+2\sigma+2\sigma s)^{-1}\right).

We investigate the behavior of MMD(N(0¯,Id),N(t1¯,sId))2{\rm MMD}(N(\underline{0},I_{d}),N(t\underline{1},sI_{d}))^{2} and MVD(N(0¯,Id),N(t1¯,sId))2{\rm MVD}(N(\underline{0},I_{d}),N(t\underline{1},sI_{d}))^{2} for the difference of (t,s)(t,s) from (0,1) by using Corollary 1. A sensitive reaction to the difference of (t,s)(t,s) from (0,1) means that it can sensitively react to differences between distributions from N(0¯,Id)N(\underline{0},I_{d}). Using such a discrepancy in the framework of the test is expected to correctly reject H0H_{0} under H1H_{1}.

More generally, kernel k(x,y)=exp(C)k(x,y)k^{\prime}(x,y)=\exp(C)k(x,y) based on a constant CC and a positive definite kernel k(x,y)k(x,y) is also positive definite. Then, MMDk(P,Q)\text{MMD}_{k^{\prime}}(P,Q) and MVDk(P,Q)\text{MVD}_{k^{\prime}}(P,Q) calculated by the kernel kk^{\prime} hold MMDk(P,Q)2=exp(C)MMDk(P,Q)2\text{MMD}_{k^{\prime}}(P,Q)^{2}=\exp(C)\text{MMD}_{k}(P,Q)^{2} and MVDk(P,Q)2=exp(2C)MVDk(P,Q)2\text{MVD}_{k^{\prime}}(P,Q)^{2}=\exp(2C)\text{MVD}_{k}(P,Q)^{2} for any distributions PP and QQ using MMDk(P,Q)\text{MMD}_{k}(P,Q) and MVDk(P,Q)\text{MVD}_{k}(P,Q) calculated by the kernel kk.

The graph of MMDk(P,Q)2\text{MMD}_{k^{\prime}}(P,Q)^{2} and MVDk(P,Q)2\text{MVD}_{k^{\prime}}(P,Q)^{2} is displayed for each tt when s=1s=1 in Figure 1 and for each ss when t=0t=0 in Figure 2. The kernel kk is a Gaussian kernel in (4), and the parameters are C=0,4,10C=0,4,10, d=10d=10, and σ=0.1\sigma=0.1 in both Figures 1 and 2. Figure 1 shows the MMDk(P,Q)2\text{MMD}_{k^{\prime}}(P,Q)^{2} and MVDk(P,Q)2\text{MVD}_{k^{\prime}}(P,Q)^{2} for the difference of the mean from the standard normal distribution. In Figure 1 (a), MMDk(P,Q)2\text{MMD}_{k^{\prime}}(P,Q)^{2} is larger than MVDk(P,Q)2\text{MVD}_{k^{\prime}}(P,Q)^{2}, but in Figures 1 (b) and (c), where the value of CC is increased, MVDk(P,Q)2\text{MVD}_{k^{\prime}}(P,Q)^{2} is larger than MMDk(P,Q)2\text{MMD}_{k^{\prime}}(P,Q)^{2} for each tt. In addition, Figure 2 shows the reaction of the MMD and MVD to the difference of the covariance matrix from the standard normal distribution, and MVDk(P,Q)2\text{MVD}_{k^{\prime}}(P,Q)^{2} is larger than MMDk(P,Q)2\text{MMD}_{k^{\prime}}(P,Q)^{2} for each ss when CC is large. This means that MVD is more sensitive to differences from the standard normal distribution than MMD for kk^{\prime} with large CC. The fact that there is a kernel kk^{\prime} for which MVD is larger than MMD is a motivation for the two-sample test based on MVD in the next section.

Refer to caption
Figure 1: The MMDk(N(0¯,Id),N(t1¯,sId))2\text{MMD}_{k^{\prime}}(N(\underline{0},I_{d}),N(t\underline{1},sI_{d}))^{2} (solid) and MVDk(N(0¯,Id),N(t1¯,sId))2\text{MVD}_{k^{\prime}}(N(\underline{0},I_{d}),N(t\underline{1},sI_{d}))^{2} (dashed) for each tt: s=1s=1, d=10d=10, σ=0.1\sigma=0.1, and (a) C=0C=0, (b) C=4C=4, and (c) C=10C=10.
Refer to caption
Figure 2: The MMDk(N(0¯,Id),N(t1¯,sId))2\text{MMD}_{k^{\prime}}(N(\underline{0},I_{d}),N(t\underline{1},sI_{d}))^{2} (solid) and MVDk(N(0¯,Id),N(t1¯,sId))2\text{MVD}_{k^{\prime}}(N(\underline{0},I_{d}),N(t\underline{1},sI_{d}))^{2} (dashed) for each ss: t=0t=0, d=10d=10, σ=0.1\sigma=0.1, and (a) C=0C=0, (b) C=4C=4, and (c) C=10C=10.

3 Test statistic for two-sample problem

We consider a two-sample test based on Tn,m2T^{2}_{n,m} for H0:P=QH_{0}:P=Q and H1:PQH_{1}:P\neq Q, and T^n,m2\widehat{T}^{2}_{n,m} is defined as a test statistic. If T^n,m2\widehat{T}^{2}_{n,m} is large, then the null hypothesis H0H_{0} is rejected since Tn,m2T^{2}_{n,m} is the difference between PP and QQ. The condition to derive the asymptotic distribution of this test statistic is as follows:

Condition.

𝔼XP[k(X,X)2]<\mathbb{E}_{X\sim P}[k(X,X)^{2}]<\infty, 𝔼YQ[k(Y,Y)2]<\mathbb{E}_{Y\sim Q}[k(Y,Y)^{2}]<\infty and limn,mn/(n+m)ρ,0<ρ<1\lim_{n,m\to\infty}n/(n+m)\to\rho,~{}0<\rho<1.

3.1 Asymptotic null distribution

In this section, we derive an asymptotic distribution of T^n,m2\widehat{T}^{2}_{n,m} under H0H_{0}. In what follows, the symbol ``𝒟"``\xrightarrow{\mathcal{D}}" designates convergence in distribution.

Theorem 1.

Suppose that Condition is satisfied. Then, under H0:P=QH_{0}:P=Q, as n,mn,m\to\infty,

(n+m)T^n,m2𝒟1ρ(1ρ)=1λZ2,(n+m)\widehat{T}^{2}_{n,m}\xrightarrow{\mathcal{D}}\frac{1}{\rho(1-\rho)}\sum_{\ell=1}^{\infty}\lambda_{\ell}Z_{\ell}^{2},

where Zi.i.d.N(0,1)Z_{\ell}\overset{i.i.d.}{\sim}N(0,1) and λ\lambda_{\ell} is an eigenvalue of VXP[(k(,X)μk(P))2]V_{X\sim P}[(k(\cdot,X)-\mu_{k}(P))^{\otimes 2}].

It is generally not easy to utilize such an asymptotic null distribution because it is an infinite sum and determining the weights included in the asymptotic distribution is itself a difficult problem. For practical purposes, a method to approximate the distribution of the test by T^n,m2\widehat{T}_{n,m}^{2} under H0H_{0} is developed in Section 4. The method is based on the eigenvalues of the centered Gram matrices associated with the data set in Section 3.4.

3.2 Asymptotic non-null distribution

In this section, an asymptotic distribution of T^n,m2\widehat{T}^{2}_{n,m} under H1H_{1} is investigated.

Theorem 2.

Suppose that Condition is satisfied. Then, under H1:PQH_{1}:P\neq Q, as n,mn,m\to\infty,

n+m(T^n,m2T2)𝒟N(0,4ρ1vP2+4(1ρ)1vQ2),\sqrt{n+m}(\widehat{T}^{2}_{n,m}-T^{2})\xrightarrow{\mathcal{D}}N(0,4\rho^{-1}v^{2}_{P}+4(1-\rho)^{-1}v^{2}_{Q}),

where

vP2=VXP[Σk(P)Σk(Q),(k(,X)μk(P))2Σk(P)H(k)2]v_{P}^{2}=V_{X\sim P}\left[\left<\Sigma_{k}(P)-\Sigma_{k}(Q),(k(\cdot,X)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right>_{H(k)^{\otimes 2}}\right]\\

and

vQ2=VYQ[Σk(P)Σk(Q),(k(,Y)μk(Q))2Σk(Q)H(k)2].v_{Q}^{2}=V_{Y\sim Q}\left[\left<\Sigma_{k}(P)-\Sigma_{k}(Q),(k(\cdot,Y)-\mu_{k}(Q))^{\otimes 2}-\Sigma_{k}(Q)\right>_{H(k)^{\otimes 2}}\right].
Remark 1.

We see by Theorem 2 that

n+m(T^n,m2T2)v𝒟N(0,1),\frac{\sqrt{n+m}(\widehat{T}^{2}_{n,m}-T^{2})}{v}\xrightarrow{\mathcal{D}}N(0,1),

where v=4ρ1vP2+4(1ρ)1vQ2v=\sqrt{4\rho^{-1}v^{2}_{P}+4(1-\rho)^{-1}v^{2}_{Q}}. Thus, we can evaluate the power of the test by (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} as

Pr((n+m)T^n,m2tα|H1)\displaystyle\Pr((n+m)\widehat{T}^{2}_{n,m}\geq t_{\alpha}|H_{1}) =Pr((n+m)(T^n,m2T2)tα(n+m)T2|H1)\displaystyle=\Pr((n+m)(\widehat{T}^{2}_{n,m}-T^{2})\geq t_{\alpha}-(n+m)T^{2}|H_{1})
=Pr(n+m(T^n,m2T2)vtαn+mvn+mT2v|H1)\displaystyle=\Pr\left(\frac{\sqrt{n+m}(\widehat{T}^{2}_{n,m}-T^{2})}{v}\geq\frac{t_{\alpha}}{\sqrt{n+m}v}-\frac{\sqrt{n+m}T^{2}}{v}\Bigg{|}H_{1}\right)
1Φ(tαn+mvn+mT2v)1\displaystyle\approx 1-\Phi\left(\frac{t_{\alpha}}{\sqrt{n+m}v}-\frac{\sqrt{n+m}T^{2}}{v}\right)\to 1

as n,mn,m\to\infty, where tαt_{\alpha} is the (1α)(1-\alpha)-quantile of the distribution of (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} under H0H_{0}, and Φ\Phi is the distribution function of the standard normal distribution. Therefore, this test is consistent.

3.3 Asymptotic distribution under contiguous alternatives

In this section, we develop an asymptotic distribution of T^n,m2\widehat{T}^{2}_{n,m} under a sequence of local alternative distributions Qnm=(11/n+m)P+(1/n+m)QQ_{nm}=(1-1/\sqrt{n+m})P+(1/\sqrt{n+m})Q for PQP\neq Q.

Theorem 3.

Let X1,,Xni.i.d.PX_{1},\dots,X_{n}\overset{i.i.d.}{\sim}P and Y1,,Ymi.i.d.QnmY_{1},\dots,Y_{m}\overset{i.i.d.}{\sim}Q_{nm}. Suppose that Condition is satisfied. Let h:×h:\mathcal{H}\times\mathcal{H}\to\mathbb{R} be a kernel defined as

h(x,y)=(k(,x)μk(P))2Σk(P),(k(,y)μk(P))2Σk(P)H(k)2,x,yh(x,y)=\left<(k(\cdot,x)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P),(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right>_{H(k)^{\otimes 2}},~{}~{}x,y\in\mathcal{H} (5)

and

h(,x)=(k(,x)μk(P))2Σk(P)H(k)2h(\cdot,x)=(k(\cdot,x)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\in H(k)^{\otimes 2}

and let Sk:L2(,P)L2(,P)S_{k}:L_{2}(\mathcal{H},P)\to L_{2}(\mathcal{H},P) be a self-adjoint operator defined as

Skg(x)=h(x,y)g(y)𝑑P(y),gL2(,P)S_{k}g(x)=\int_{\mathcal{H}}h(x,y)g(y)dP(y),~{}~{}g\in L_{2}(\mathcal{H},P) (6)

(see Sections VI.1, VI.3, and VI.6 in [12] for details). Then, as n,mn,m\to\infty

(n+m)Σk(P^)Σk(Q^nm)H(k)22𝒟1ρ(1ρ)=1θW2,(n+m)\left\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q}_{nm})\right\|^{2}_{H(k)^{\otimes 2}}\xrightarrow{\mathcal{D}}\frac{1}{\rho(1-\rho)}\sum_{\ell=1}^{\infty}\theta_{\ell}W_{\ell}^{2},

where WN(ρ(1ρ)ζ(P,Q)/θ,1),WW()W_{\ell}\sim N(\sqrt{\rho(1-\rho)}\cdot\zeta_{\ell}(P,Q)/\theta_{\ell},1),~{}W_{\ell}\mathop{\perp\!\!\!\perp}W_{\ell^{\prime}}~{}(\ell\neq\ell^{\prime}),

ζ(P,Q)=Σk(Q)Σk(P)+(μk(Q)μk(P))2,h(,y)H(k)2Ψ(y)𝑑P(y),\zeta_{\ell}(P,Q)=\int_{\mathcal{H}}\left<\Sigma_{k}(Q)-\Sigma_{k}(P)+(\mu_{k}(Q)-\mu_{k}(P))^{\otimes 2},h(\cdot,y)\right>_{H(k)^{\otimes 2}}\Psi_{\ell}(y)dP(y),

and θ\theta_{\ell} and Ψ\Psi_{\ell} are, respectively, the eigenvalue of SkS_{k} and the eigenfunction corresponding to θ\theta_{\ell}.

The following proposition claims that the eigenvalues θ\theta_{\ell} appearing in Theorem 3 are the same as the eigenvalues λ\lambda_{\ell} appearing in Theorem 1:

Proposition 3.

The eigenvalues of VXP[(k(,X)μk(P))2]V_{X\sim P}[(k(\cdot,X)-\mu_{k}(P))^{\otimes 2}] in Theorem 1 and SkS_{k} in (6) of Theorem 3 are the same.

From Theorems 1 and 3 and Proposition 3, it can be seen that the local power of the test by (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} is dominated by the noncentrality parameters. It follows that

ζ(P,Q)=𝔼YQ[h(,y)],h(,x)Ψ(y)𝑑P(y)=λ𝔼YQ[Ψ(Y)]\zeta_{\ell}(P,Q)=\int_{\mathcal{H}}\left<\mathbb{E}_{Y\sim Q}[h(\cdot,y)],h(\cdot,x)\right>\Psi_{\ell}(y)dP(y)=\lambda_{\ell}\mathbb{E}_{Y\sim Q}[\Psi_{\ell}(Y)]

by which we obtain

𝔼[1ρ(1ρ)=1θW2]\displaystyle\mathbb{E}\left[\frac{1}{\rho(1-\rho)}\sum_{\ell=1}^{\infty}\theta_{\ell}W_{\ell}^{2}\right] =1ρ(1ρ)=1λ(1+ρ(1ρ)ζ(P,Q)2λ2)\displaystyle=\frac{1}{\rho(1-\rho)}\sum_{\ell=1}^{\infty}\lambda_{\ell}\left(1+\rho(1-\rho)\cdot\frac{\zeta_{\ell}(P,Q)^{2}}{\lambda_{\ell}^{2}}\right)
=1ρ(1ρ)=1λ(1+ρ(1ρ){𝔼YQ[Ψ(Y)]}2).\displaystyle=\frac{1}{\rho(1-\rho)}\sum_{\ell=1}^{\infty}\lambda_{\ell}\Big{(}1+\rho(1-\rho)\{\mathbb{E}_{Y\sim Q}[\Psi_{\ell}(Y)]\}^{2}\Big{)}.

In addition, from Theorem 1 in [10], we have

=1λ{𝔼YQ[Ψ(Y)]}2\displaystyle\sum_{\ell=1}^{\infty}\lambda_{\ell}\{\mathbb{E}_{Y\sim Q}[\Psi_{\ell}(Y)]\}^{2} ==1λ𝔼YQ[Ψ(Y)]𝔼YQ[Ψ(Y)]\displaystyle=\sum_{\ell=1}^{\infty}\lambda_{\ell}\mathbb{E}_{Y\sim Q}[\Psi_{\ell}(Y)]\mathbb{E}_{Y^{\prime}\sim Q}[\Psi_{\ell}(Y^{\prime})]
=𝔼Y,YQ[h(Y,Y)]\displaystyle=\mathbb{E}_{Y,Y^{\prime}\sim Q}[h(Y,Y^{\prime})]
=Σk(Q)Σk(P)+(μk(Q)μk(P))2H(k)22.\displaystyle=\left\|\Sigma_{k}(Q)-\Sigma_{k}(P)+(\mu_{k}(Q)-\mu_{k}(P))^{\otimes 2}\right\|_{H(k)^{\otimes 2}}^{2}.

Hence, the local power reveals not only the difference between Σk(P)\Sigma_{k}(P) and Σk(Q)\Sigma_{k}(Q) but also that between μk(Q)\mu_{k}(Q) and μk(P)\mu_{k}(P).

3.4 Null distribution estimates using Gram matrix spectrum

The asymptotic null distribution was obtained in Theorem 1, but it is difficult to derive its weights. The following theorem shows that this weight can be estimated using the estimator of V[(k(,X)μk(P))2]V[(k(\cdot,X)-\mu_{k}(P))^{\otimes 2}].

Theorem 4.

Assume that 𝔼XP[k(X,X)2]<\mathbb{E}_{X\sim P}[k(X,X)^{2}]<\infty. Let {λ}=1\{\lambda_{\ell}\}_{\ell=1}^{\infty} and {λ^(n)}=1\{\widehat{\lambda}^{(n)}_{\ell}\}_{\ell=1}^{\infty} be the eigenvalues of Υ\Upsilon and Υ^(n)\widehat{\Upsilon}^{(n)}, respectively, where

Υ=V[(k(,X)μk(P))2]andΥ^(n)=1ni=1n{(k(,Xi)μk(P^))2Σk(P^)}2.\Upsilon=V\left[(k(\cdot,X)-\mu_{k}(P))^{\otimes 2}\right]~{}~{}\text{and}~{}~{}\widehat{\Upsilon}^{(n)}=\frac{1}{n}\sum_{i=1}^{n}\left\{(k(\cdot,X_{i})-\mu_{k}(\widehat{P}))^{\otimes 2}-\Sigma_{k}(\widehat{P})\right\}^{\otimes 2}.

Then, as nn\to\infty

=1λ^(n)Z2𝒟=1λZ2,\sum_{\ell=1}^{\infty}\widehat{\lambda}_{\ell}^{(n)}Z_{\ell}^{2}\xrightarrow{\mathcal{D}}\sum_{\ell=1}^{\infty}\lambda_{\ell}Z_{\ell}^{2},

where Zi.i.d.N(0,1)Z_{\ell}\overset{i.i.d.}{\sim}N(0,1).

In addition, Proposition 4 claims the eigenvalues of Υ^(n)\widehat{\Upsilon}^{(n)} and the Gram matrix are the same.

Proposition 4.

The n×nn\times n Gram matrix H=(Hij)1i,jnH=(H_{ij})_{1\leq i,j\leq n} is defined as

Hij=(k(,Xi)μk(P^))2Σk(P^),(k(,Xj)μk(P^))2Σk(P^)H(k)2.H_{ij}=\left<(k(\cdot,X_{i})-\mu_{k}(\widehat{P}))^{\otimes 2}-\Sigma_{k}(\widehat{P}),(k(\cdot,X_{j})-\mu_{k}(\widehat{P}))^{\otimes 2}-\Sigma_{k}(\widehat{P})\right>_{H(k)^{\otimes 2}}.

Then, the eigenvalues of Υ^(n)\widehat{\Upsilon}^{(n)} and H/nH/n are the same.

Remark 2.

By Proposition 4, the critical value can be obtained by calculating 1/{ρ(1ρ)}=1n1λ^(n)Z21/\{\rho(1-\rho)\}\sum_{\ell=1}^{n-1}\widehat{\lambda}_{\ell}^{(n)}Z_{\ell}^{2} using the eigenvalue λ^(n),{1,,n1}\widehat{\lambda}^{(n)}_{\ell},~{}\ell\in\{1,\dots,n-1\} of H/nH/n. In addition, the matrix HH is expressed as

H=Pn(K~XK~X)Pn,H=P_{n}(\widetilde{K}_{X}\odot\widetilde{K}_{X})P_{n}, (7)

where \odot is the Hadamard product. The n×nn\times n Gram matrix KXK_{X} is a positive definite, but HH has eigenvalue 0 since H1¯=0¯H\underline{1}=\underline{0}.

4 Implementation

This section proposes corrections to the asymptotic distribution for both the MVD and MMD tests, and describes the results of simulations of the type-I error and power for the modifications. The MMD test is a two-sample test for H0H_{0} and H1H_{1} using the test statistic:

Δ^n,m2=1ni=1nk(,Xi)1mj=1mk(,Yj)H(k)2.\widehat{\Delta}_{n,m}^{2}=\left\|\frac{1}{n}\sum_{i=1}^{n}k(\cdot,X_{i})-\frac{1}{m}\sum_{j=1}^{m}k(\cdot,Y_{j})\right\|_{H(k)}^{2}.

The Δ^n,m2\widehat{\Delta}^{2}_{n,m} of the asymptotic null distribution is the infinite sum of the weighted chi-square distribution, which is the same as T^n,m2\widehat{T}^{2}_{n,m} in (3). The approximate distribution can be obtained by estimating the eigenvalues by the centered Gram matrix (see [7] for details).

4.1 Approximation of the null distribution

In this section, we discuss methods to approximate the null distributions of the MVD and MMD tests. The asymptotic null distribution of the MVD test was obtained in Theorem 1 as an infinite sum of weighted chi-squared random variables with one degree of freedom, and according to Theorem 4, those weights λ(1)\lambda_{\ell}~{}(\ell\geq 1) can be estimated by the eigenvalues λ^(n)\widehat{\lambda}_{\ell}^{(n)} of the matrix H/nH/n. Similar results was obtained for MMD by [7]. However, this approximate distribution based on this estimated eigenvalue does not actually work well. In fact, by comparing the simulated exact null distribution with this approximate distribution based on estimated eigenvalues, it can be seen that variance of the approximate distribution is larger than that of the simulated exact null distribution. We modify the approximate distribution 1/{ρ(1ρ)}=1n1λ^(n)Z21/\{\rho(1-\rho)\}\sum_{\ell=1}^{n-1}\widehat{\lambda}_{\ell}^{(n)}Z_{\ell}^{2} by obtaining the variance of this simulated exact null distribution. The variance of the exact null distribution V[(n+m)T^n,m2]V[(n+m)\widehat{T}^{2}_{n,m}] is obtained as the following proposition:

Proposition 5.

Assume that 𝔼XP[k(X,X)2]<\mathbb{E}_{X\sim P}[k(X,X)^{2}]<\infty. Then under H0H_{0},

V[(n+m)T^n,m2]=2(n+m)4n2m2ΥH(k)42+O(1n)+O(1m).V[(n+m)\widehat{T}^{2}_{n,m}]=\frac{2(n+m)^{4}}{n^{2}m^{2}}\left\|\Upsilon\right\|^{2}_{H(k)^{\otimes 4}}+O\left(\frac{1}{n}\right)+O\left(\frac{1}{m}\right).

Proposition 5 leads to

V[(n+m)T^n,m2](n+m)4n2m2k22(k+)4V[(k+)T^k,2],V[(n+m)\widehat{T}^{2}_{n,m}]\approx\frac{(n+m)^{4}}{n^{2}m^{2}}\cdot\frac{k^{2}\ell^{2}}{(k+\ell)^{4}}V[(k+\ell)\widehat{T}^{2}_{k,\ell}], (8)

with respect to V[(n+m)T^n,m2]V[(n+m)\widehat{T}^{2}_{n,m}] and V[(k+)T^k,2],k,V[(k+\ell)\widehat{T}^{2}_{k,\ell}],~{}k,\ell\in\mathbb{N}. If we can estimate the variance V[(k+)T^k,2]V[(k+\ell)\widehat{T}_{k,\ell}^{2}] at kk and \ell that is less than nn and mm, respectively, we can estimate V[(n+m)T^n,m2]V[(n+m)\widehat{T}^{2}_{n,m}] by using (8). In addition, the method of estimating V[(k+)T^k,2]V[(k+\ell)\widehat{T}_{k,\ell}^{2}] by choosing (k,)(k,\ell) from (n,m)(n,m) without replacement is known as subsampling.

The following proposition for MMD shows a similar result to MVD:

Proposition 6.

Assume that 𝔼XP[k(X,X)]<\mathbb{E}_{X\sim P}[k(X,X)]<\infty. Then, under H0H_{0}

V[(n+m)Δ^n,m2]=2(n+m)4n2m2Σk(P)H(k)22+O(1n)+O(1m).V[(n+m)\widehat{\Delta}^{2}_{n,m}]=\frac{2(n+m)^{4}}{n^{2}m^{2}}\left\|\Sigma_{k}(P)\right\|^{2}_{H(k)^{\otimes 2}}+O\left(\frac{1}{n}\right)+O\left(\frac{1}{m}\right).

4.1.1 Subsampling method

We used the subsampling method to estimate V[(k+)T^k,2]V[(k+\ell)\widehat{T}^{2}_{k,\ell}] and V[(k+)Δ^k,2]V[(k+\ell)\widehat{\Delta}^{2}_{k,\ell}] (see Section 2.2 in [11] for details). In order to obtain two samples under the null hypothesis, we divide X1,,XnX_{1},\dots,X_{n} into X1,,Xn1X_{1},\dots,X_{n_{1}} and Xn1+1,,XnX_{n_{1}+1},\dots,X_{n}. Then, we randomly select X1(i),,Xk(i)X^{*}_{1}(i),\dots,X^{*}_{k}(i) and Y1(i),,Y(i)Y^{*}_{1}(i),\dots,Y^{*}_{\ell}(i) from X1,,Xn1X_{1},\dots,X_{n_{1}} and Xn1+1,,XnX_{n_{1}+1},\dots,X_{n} without replacement, which repaet in each iteration i{1,,I}i\in\{1,\dots,I\}. These randomly selected values generate the replications of the test statistic

T^k,2(i)=T^k,2(X1(i),,Xk(i);Y1(i),,Y(i))\widehat{T}_{k,\ell}^{2}(i)=\widehat{T}_{k,\ell}^{2}(X^{*}_{1}(i),\dots,X^{*}_{k}(i);Y^{*}_{1}(i),\dots,Y^{*}_{\ell}(i))

for iterations i{1,,I}i\in\{1,\dots,I\}. The generated test statistics (k+)T^k,2(i)(k+\ell)\widehat{T}_{k,\ell}^{2}(i) in II iterations estimate V[(k+)T^k,2]V[(k+\ell)\widehat{T}^{2}_{k,\ell}] by calculating the unbiased sample variance:

V[(k+)T^k,2]sub=1I1j=1I{(k+)T^2k,(j)(k+)T^2¯k,}2,V[(k+\ell)\widehat{T}^{2}_{k,\ell}]_{\text{sub}}=\frac{1}{I-1}\sum_{j=1}^{I}\left\{(k+\ell){\widehat{T}^{2}}_{k,\ell}(j)-(k+\ell)\overline{\widehat{T}^{2}}_{k,\ell}\right\}^{2},

where T^2¯k,=(1/I)i=1IT^2k,(i)\overline{\widehat{T}^{2}}_{k,\ell}=(1/I)\sum_{i=1}^{I}{\widehat{T}^{2}}_{k,\ell}(i). According to (8), V[(n+m)T^n,m2]V[(n+m)\widehat{T}^{2}_{n,m}] is estimated by

V[(n+m)T^n,m2]sub=(n+m)4n2m2k22(k+)4V[(k+)]T^k,2]sub.V[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}}=\frac{(n+m)^{4}}{n^{2}m^{2}}\frac{k^{2}\ell^{2}}{(k+\ell)^{4}}V[(k+\ell)]\widehat{T}^{2}_{k,\ell}]_{\text{sub}}. (9)

We also estimate V[(n+m)Δ^n,m2]V[(n+m)\widehat{\Delta}^{2}_{n,m}] by using

V[(n+m)Δ^n,m2]sub=(n+m)4n2m2k22(k+)4V[(k+)Δ^k,2]subV[(n+m)\widehat{\Delta}^{2}_{n,m}]_{\text{sub}}=\frac{(n+m)^{4}}{n^{2}m^{2}}\frac{k^{2}\ell^{2}}{(k+\ell)^{4}}V[(k+\ell)\widehat{\Delta}^{2}_{k,\ell}]_{\text{sub}} (10)

from Proposition 6, where

Δ^k,2(i)=Δ^k,2(X1(i),,Xk(i);Y1(i),,Y(i))\widehat{\Delta}_{k,\ell}^{2}(i)=\widehat{\Delta}_{k,\ell}^{2}(X^{*}_{1}(i),\dots,X^{*}_{k}(i);Y^{*}_{1}(i),\dots,Y^{*}_{\ell}(i))

for i{1,,I}i\in\{1,\dots,I\},

V[(k+)Δ^k,2]sub=1I1j=1I{(k+)Δ^2k,(j)(k+)Δ^2¯k,}2V[(k+\ell)\widehat{\Delta}^{2}_{k,\ell}]_{\text{sub}}=\frac{1}{I-1}\sum_{j=1}^{I}\left\{(k+\ell){\widehat{\Delta}^{2}}_{k,\ell}(j)-(k+\ell)\overline{\widehat{\Delta}^{2}}_{k,\ell}\right\}^{2}

and Δ^2¯k,=(1/I)i=1IΔ^2k,(i)\overline{\widehat{\Delta}^{2}}_{k,\ell}=(1/I)\sum_{i=1}^{I}{\widehat{\Delta}^{2}}_{k,\ell}(i).

The columns of (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} in Table 1 and (n+m)Δ^n,m2(n+m)\widehat{\Delta}^{2}_{n,m} in Table 2 are variances of (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} and (n+m)Δ^n,m2(n+m)\widehat{\Delta}^{2}_{n,m}, which are estimated by a simulation of 10,000 iterations with X1,,Xni.i.d.N(0¯,Id)X_{1},\dots,X_{n}\overset{i.i.d.}{\sim}N(\underline{0},I_{d}) and Y1,,Ymi.i.d.N(0¯,Id)Y_{1},\dots,Y_{m}\overset{i.i.d.}{\sim}N(\underline{0},I_{d}) for each σ,d\sigma,d and (n,m)(n,m). The subsampling variances V[(n+m)T^n,m2]subV[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}} in (9) and V[(n+m)Δ^n,m2]subV[(n+m)\widehat{\Delta}^{2}_{n,m}]_{\text{sub}} in (10) with X1,,Xni.i.d.N(0¯,Id)X_{1},\dots,X_{n}\overset{i.i.d.}{\sim}N(\underline{0},I_{d}) are given in the columns labeled “Subsampling” for (k,)(k,\ell). Tables 1 and 2 show that subsampling variances V[(n+m)T^n,m2]subV[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}} and V[(n+m)Δ^n,m2]subV[(n+m)\widehat{\Delta}^{2}_{n,m}]_{\text{sub}} estimate the exact variances well. However, these variances tend to underestimate the exact variances.

We investigate how much smaller V[(n+m)T^n,m2]V[(n+m)\widehat{T}^{2}_{n,m}] is than the variance of (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} by performing a linear regression of V[(n+m)T^n,m2]V[(n+m)\widehat{T}^{2}_{n,m}] and the variance of (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} with an intercept equal to 0, with the same for the MMD. The results are shown in Figure 3; (a) and (b) show results for the MVD and (c) and (d) show results for the MMD, which are respectively (n,m)=(200,200)(n,m)=(200,200) and (k,)=(50,50)(k,\ell)=(50,50) and (n,m)=(500,500)(n,m)=(500,500) and (k,)=(125,125)(k,\ell)=(125,125) cases. In Figure 3, the xx axis is V[(n+m)T^n,m2]subV[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}} or V[(n+m)Δ^n,m2]subV[(n+m)\widehat{\Delta}^{2}_{n,m}]_{\text{sub}} and the yy axis is variance of (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} or (n+m)Δ^n,m2(n+m)\widehat{\Delta}^{2}_{n,m} for each σ,d,(n,m)\sigma,~{}d~{},(n,m) and (k,)(k,\ell) in Table 1 or Table 2. The line in Figure 3 is a regression line found by the least-squares method in the form y=ax+εy=ax+\varepsilon. It can be seen from Figure 3 that when V[(n+m)T^n,m2]subV[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}} and V[(n+m)Δ^n,m2]subV[(n+m)\widehat{\Delta}^{2}_{n,m}]_{\text{sub}} are multiplied by the term associated regression coefficient, they approach the variances (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} and (n+m)Δ^n,m2(n+m)\widehat{\Delta}^{2}_{n,m}. The coefficient of linear regression with intercept 0 is written in the row labeled “slope of the line” in Tables 1 and 2.

Table 1: The variance of (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} under P=Q=N(0¯,Id)P=Q=N(\underline{0},I_{d}) and V[(n+m)T^n,m2]subV[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}} : I=1,000I=1,000, n1=n/2n_{1}=n/2, and X1,,Xni.i.d.N(0¯,Id)X_{1},\dots,X_{n}\overset{i.i.d.}{\sim}N(\underline{0},I_{d}).
σ\sigma dd (n,m)(n,m) (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} Subsampling (k,)(k,\ell)
(n/4,n/4)(n/4,n/4) (n/6,n/6)(n/6,n/6) (n/8,n/8)(n/8,n/8)
d3/4d^{-3/4} 5 (200,200) 0.06880 0.04341 0.05168 0.04902
d3/4d^{-3/4} 5 (500,500) 0.06881 0.03821 0.04897 0.04921
d7/8d^{-7/8} 5 (200,200) 0.07254 0.04246 0.05138 0.05798
d7/8d^{-7/8} 5 (500,500) 0.07188 0.04052 0.05500 0.05593
d3/4d^{-3/4} 10 (200,200) 0.00850 0.00602 0.00812 0.00898
d3/4d^{-3/4} 10 (500,500) 0.00845 0.00674 0.00751 0.00753
d7/8d^{-7/8} 10 (200,200) 0.01280 0.00937 0.01224 0.01377
d7/8d^{-7/8} 10 (500,500) 0.01270 0.01032 0.01251 0.01255
d3/4d^{-3/4} 20 (200,200) 0.00049 0.00048 0.00070 0.00094
d3/4d^{-3/4} 20 (500,500) 0.00043 0.00031 0.00046 0.00060
d7/8d^{-7/8} 20 (200,200) 0.00166 0.00152 0.00261 0.00330
d7/8d^{-7/8} 20 (500,500) 0.00147 0.00122 0.00165 0.00204
(200,200) 1 1.63621 1.35769 1.29601
slope of the line (500,500) 1 1.76057 1.33845 1.3232
both 1 1.69348 1.34798 1.30928
Table 2: The variance of (n+m)Δ^n,m2(n+m)\widehat{\Delta}^{2}_{n,m} under P=Q=N(0¯,Id)P=Q=N(\underline{0},I_{d}) and V[(n+m)Δ^n,m2]subV[(n+m)\widehat{\Delta}^{2}_{n,m}]_{\text{sub}} : I=1,000I=1,000, n1=n/2n_{1}=n/2, and X1,,Xni.i.d.N(0¯,Id)X_{1},\dots,X_{n}\overset{i.i.d.}{\sim}N(\underline{0},I_{d}).
σ\sigma dd (n,m)(n,m) (n+m)Δ^n,m2(n+m)\widehat{\Delta}^{2}_{n,m} Subsampling (k,)(k,\ell)
(n/4,n/4)(n/4,n/4) (n/6,n/6)(n/6,n/6) (n/8,n/8)(n/8,n/8)
d3/4d^{-3/4} 5 (200,200) 0.57100 0.47044 0.50315 0.57047
d3/4d^{-3/4} 5 (500,500) 0.66068 0.54518 0.63054 0.58848
d7/8d^{-7/8} 5 (200,200) 0.65987 0.51867 0.57258 0.54567
d7/8d^{-7/8} 5 (500,500) 0.75903 0.63563 0.68024 0.65349
d3/4d^{-3/4} 10 (200,200) 0.16205 0.09213 0.12017 0.12940
d3/4d^{-3/4} 10 (500,500) 0.16279 0.16106 0.16809 0.18334
d7/8d^{-7/8} 10 (200,200) 0.25656 0.14104 0.18435 0.20457
d7/8d^{-7/8} 10 (500,500) 0.25836 0.24670 0.25567 0.26402
d3/4d^{-3/4} 20 (200,200) 0.02757 0.02135 0.02632 0.02814
d3/4d^{-3/4} 20 (500,500) 0.02784 0.02255 0.02407 0.02615
d7/8d^{-7/8} 20 (200,200) 0.07642 0.07404 0.08121 0.08744
d7/8d^{-7/8} 20 (500,500) 0.07856 0.05714 0.07492 0.07320
(200,200) 1 1.27369 1.16037 1.11083
slope of the line (500,500) 1 1.18426 1.07578 1.12080
both 1 1.21990 1.10951 1.11643
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 3: The results of linear regression of the simulated variance of (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} and the subsampling variance V[(n+m)Δ^n,m2]subV[(n+m)\widehat{\Delta}^{2}_{n,m}]_{\text{sub}}, and the results for the MMD. (a) MVD (n,m)=(200,200),(k,)=(50,50)(n,m)=(200,200),~{}(k,\ell)=(50,50). (b) MVD (n,m)=(500,500),(k,)=(125,125)(n,m)=(500,500),~{}(k,\ell)=(125,125). (c) MMD (n,m)=(200,200),(k,)=(50,50)(n,m)=(200,200),~{}(k,\ell)=(50,50). (d) MMD (n,m)=(500,500),(k,)=(125,125)(n,m)=(500,500),~{}(k,\ell)=(125,125).
Remark 3.

Since the subsampling variance V[(k+)T^k,2]subV[(k+\ell)\widehat{T}^{2}_{k,\ell}]_{\text{sub}} is an unbiased sample variance of II times, we get

𝔼[V[(k+)T^k,2]sub]\displaystyle\mathbb{E}[V[(k+\ell)\widehat{T}^{2}_{k,\ell}]_{\text{sub}}]
=1I1j=1I𝔼[{(k+)T^2k,(j)(k+)T^2¯k,}2]\displaystyle=\frac{1}{I-1}\sum_{j=1}^{I}\mathbb{E}[\left\{(k+\ell){\widehat{T}^{2}}_{k,\ell}(j)-(k+\ell)\overline{\widehat{T}^{2}}_{k,\ell}\right\}^{2}]
=1I1j=1I𝔼[{(k+)T^2k,(j)𝔼[(k+)T^k,2]+𝔼[(k+)T^k,2](k+)T^2¯k,}2]\displaystyle=\frac{1}{I-1}\sum_{j=1}^{I}\mathbb{E}\left[\left\{(k+\ell){\widehat{T}^{2}}_{k,\ell}(j)-\mathbb{E}[(k+\ell)\widehat{T}^{2}_{k,\ell}]+\mathbb{E}[(k+\ell)\widehat{T}^{2}_{k,\ell}]-(k+\ell)\overline{\widehat{T}^{2}}_{k,\ell}\right\}^{2}\right]
=1I1j=1I[𝔼[{(k+)T^2k,(j)𝔼[(k+)T^k,2]}2]+𝔼[{𝔼[(k+)T^k,2](k+)T^2¯k,}2]\displaystyle=\frac{1}{I-1}\sum_{j=1}^{I}\Big{[}\mathbb{E}\left[\left\{(k+\ell){\widehat{T}^{2}}_{k,\ell}(j)-\mathbb{E}[(k+\ell)\widehat{T}^{2}_{k,\ell}]\right\}^{2}\right]+\mathbb{E}\left[\left\{\mathbb{E}[(k+\ell)\widehat{T}^{2}_{k,\ell}]-(k+\ell)\overline{\widehat{T}^{2}}_{k,\ell}\right\}^{2}\right]
+2𝔼[{(k+)T^2k,(j)𝔼[(k+)T^k,2]}{𝔼[(k+)T^k,2](k+)T^2¯k,}]]\displaystyle~{}~{}~{}~{}~{}+2\mathbb{E}\left[\left\{(k+\ell){\widehat{T}^{2}}_{k,\ell}(j)-\mathbb{E}[(k+\ell)\widehat{T}^{2}_{k,\ell}]\right\}\left\{\mathbb{E}[(k+\ell)\widehat{T}^{2}_{k,\ell}]-(k+\ell)\overline{\widehat{T}^{2}}_{k,\ell}\right\}\right]\Big{]}
=1I1j=1I{V[(k+)T^k,2]+1I2i,s=1ICov((k+)T^k,2(i),(k+)T^k,2(s))\displaystyle=\frac{1}{I-1}\sum_{j=1}^{I}\Big{\{}V[(k+\ell)\widehat{T}^{2}_{k,\ell}]+\frac{1}{I^{2}}\sum_{i,s=1}^{I}Cov\left((k+\ell)\widehat{T}^{2}_{k,\ell}(i),(k+\ell)\widehat{T}^{2}_{k,\ell}(s)\right)
2Ii=1ICov((k+)T^k,2(j),(k+)T^k,2(i))}\displaystyle~{}~{}~{}~{}~{}-\frac{2}{I}\sum_{i=1}^{I}Cov\left((k+\ell)\widehat{T}^{2}_{k,\ell}(j),(k+\ell)\widehat{T}^{2}_{k,\ell}(i)\right)\Big{\}}
=II1V[(k+)T^k,2]1I(I1)i,j=1ICov((k+)T^k,2(i),(k+)T^k,2(j))\displaystyle=\frac{I}{I-1}V[(k+\ell)\widehat{T}^{2}_{k,\ell}]-\frac{1}{I(I-1)}\sum_{i,j=1}^{I}Cov\left((k+\ell)\widehat{T}^{2}_{k,\ell}(i),(k+\ell)\widehat{T}^{2}_{k,\ell}(j)\right)
=V[(k+)T^k,2]2I(I1)i<jCov((k+)T^k,2(i),(k+)T^k,2(j)).\displaystyle=V[(k+\ell)\widehat{T}^{2}_{k,\ell}]-\frac{2}{I(I-1)}\sum_{i<j}Cov\left((k+\ell)\widehat{T}^{2}_{k,\ell}(i),(k+\ell)\widehat{T}^{2}_{k,\ell}(j)\right).

However, it is not easy to estimate Cov((k+)T^k,2(i),(k+)T^k,2(j))Cov\left((k+\ell)\widehat{T}^{2}_{k,\ell}(i),(k+\ell)\widehat{T}^{2}_{k,\ell}(j)\right). This fact caused us to modify subsampling variances V[(n+m)T^n,m2]subV[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}} in (9) and V[(n+m)Δ^n,m2]subV[(n+m)\widehat{\Delta}^{2}_{n,m}]_{\text{sub}} in (10) to (1+τ)V[(n+m)T^n,m2]sub(1+\tau)V[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}} and (1+τ)V[(n+m)Δ^n,m2]sub(1+\tau)V[(n+m)\widehat{\Delta}^{2}_{n,m}]_{\text{sub}} using τ>0\tau>0, which is determined from the regression coefficient in the row “slope of the line” in Tables 1 and 2.

4.1.2 Modification of approximation distribution

Since the subsampling variance V[(n+m)T^n,m2]subV[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}} underestimates V[(n+m)T^n,m2]V[(n+m)\widehat{T}^{2}_{n,m}], as seen in Table 1 and 2, we use positive τ>0\tau>0 to estimate V[(n+m)T^n,m2]V[(n+m)\widehat{T}^{2}_{n,m}] by (1+τ)V[(n+m)T^n,m2]sub(1+\tau)V[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}} in (9). This underestimation is the same for the MMD test and V[(n+m)Δ^n,m2]V[(n+m)\widehat{\Delta}^{2}_{n,m}] is estimated by (1+τ)V[(n+m)Δ^n,m2]sub(1+\tau)V[(n+m)\widehat{\Delta}^{2}_{n,m}]_{\text{sub}} in (10) with positive τ>0\tau>0. Our approximation of the null distribution is based on a modification of the large variance of 1/{ρ(1ρ)}=1λ^(n)Z21/\{\rho(1-\rho)\}\sum_{\ell=1}^{\infty}\widehat{\lambda}_{\ell}^{(n)}Z_{\ell}^{2} to (1+τ)V[(n+m)T^n,m2]sub(1+\tau)V[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}}. The method aims to approximate the exact null distribution by using

Wn=ξn/{ρ(1ρ)}=1n1λ^(n)Z2+cn.W_{n}^{\prime}=\xi_{n}/\{\rho(1-\rho)\}\sum_{\ell=1}^{n-1}\widehat{\lambda}_{\ell}^{(n)}Z_{\ell}^{2}+c_{n}. (11)

The parameters ξn\xi_{n} and cnc_{n} are determined so that the means of WnW^{\prime}_{n} and 1/{ρ(1ρ)}=1n1λ^(n)1/\{\rho(1-\rho)\}\sum_{\ell=1}^{n-1}\widehat{\lambda}_{\ell}^{(n)} are equal and the variance of WnW^{\prime}_{n} is equal to (1+τ)V[(n+m)T^n,m2]sub(1+\tau)V[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}}, which can be established by

𝔼[Wn]=1ρ(1ρ)=1λ^\mathbb{E}[W^{\prime}_{n}]=\frac{1}{\rho(1-\rho)}\sum_{\ell=1}^{\infty}\widehat{\lambda}_{\ell}

and

V[Wn]=(1+τ)V[(n+m)T^n,m2]sub.V[W^{\prime}_{n}]=(1+\tau)V[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}}.

This approximation method can be similarly discussed for the MMD test using (1+τ)V[(n+m)Δ^n,m2]sub(1+\tau)V[(n+m)\widehat{\Delta}^{2}_{n,m}]_{\text{sub}}. In this paper, the parameter τ>0\tau>0 is determined using the value of “slope of the line” in Tables 1 and 2. Figure 4 shows that this WnW^{\prime}_{n} can approximate the simulated exact distribution better than 1/{ρ(1ρ)}=1n1λ^(n)Z21/\{\rho(1-\rho)\}\sum_{\ell^{\prime}=1}^{n-1}\widehat{\lambda}^{(n)}_{\ell}Z_{\ell}^{2}. Algorithm shows how to obtain the critical value of the MVD test using this modification. The algorithm for the MMD test can be obtained by changing HH and T^2\widehat{T}^{2} in Algorithm to K~X\widetilde{K}_{X} and Δ^2\widehat{\Delta}^{2}.

Refer to caption
Refer to caption
Figure 4: Density estimates of the distributions for simulated exact null distribution (solid), WnW^{\prime}_{n} (dashed), and 1/{ρ(1ρ)}=1n1λ^(n)Z21/\{\rho(1-\rho)\}\sum_{\ell^{\prime}=1}^{n-1}\widehat{\lambda}^{(n)}_{\ell}Z_{\ell}^{2} (dotted), with (n,m)=(500,500),σ=d3/4(n,m)=(500,500),~{}\sigma=d^{-3/4}, and (k,)=(125,125)(k,\ell)=(125,125). Left panel: Comparison of MVD results with d=10,τMVD=0.69348d=10,~{}\tau_{\text{MVD}}=0.69348. Right panel: Comparison of MMD results with d=20,τMMD=0.21990d=20,~{}\tau_{\text{MMD}}=0.21990.
Algorithm Calculation of critical value for the MVD test.
0:  X1,,Xn,Y1,,Ym,k:×X_{1},\dots,X_{n},Y_{1},\dots,Y_{m}\in\mathcal{H},~{}k:\mathcal{H}\times\mathcal{H}\to\mathbb{R} (kernel), 0<α<10<\alpha<1 (significance label) and (k,){1,,[n/2]},τ>0(k,\ell)\in\{1,\dots,[n/2]\},\tau>0 (parameters). For example, τ\tau is selected as shown in the values of “slope of the line” in Tables 1 and 2. 1. Compute the eigenvalues λ^(n)\widehat{\lambda}_{\ell}^{(n)} of HH in (7) and obtain 1/{ρ(1ρ)}=1n1λ^(n)Z2(j)1/\{\rho(1-\rho)\}\sum_{\ell=1}^{n-1}\widehat{\lambda}_{\ell}^{(n)}Z_{\ell}^{2}(j) by random element Z1(j),,Zn1(j)i.i.d.N(0,1),j{1,,J}Z_{1}^{(j)},\dots,Z_{n-1}^{(j)}\overset{i.i.d.}{\sim}N(0,1),~{}j\in\{1,\dots,J\}.2. (a) Obtain copies (k+)T^k,2(i),i{1,,I}(k+\ell)\widehat{T}^{2}_{k,\ell}(i),~{}i\in\{1,\dots,I\} of (k+)T^k,2(k+\ell)\widehat{T}^{2}_{k,\ell} under H0H_{0} by the subsampling method.    (b) Compute subsampling variance V[(n+m)T^n,m2]subV[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}} from (k+)T^k,2(i)(k+\ell)\widehat{T}^{2}_{k,\ell}(i).3. Compute WnW_{n}^{\prime} in (11) by 1/{ρ(1ρ)}=1n1λ^(n)Z2(j)1/\{\rho(1-\rho)\}\sum_{\ell=1}^{n-1}\widehat{\lambda}_{\ell}^{(n)}Z_{\ell}^{2}(j) and V[(n+m)T^n,m2]subV[(n+m)\widehat{T}^{2}_{n,m}]_{\text{sub}}.
0:  We obtain the critical value tα(Wn)t_{\alpha}(W^{\prime}_{n}) as the J(1α)J(1-\alpha)-th from the top sorted in ascending order.

4.2 Simulations

In this section, we investigate the performance of (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} under a specific null hypothesis and specific alternative hypotheses when =d\mathcal{H}=\mathbb{R}^{d} and k(,)k(\cdot,\cdot) is the Gaussian kernel in (4). In particular, a Monte Carlo simulation is performed to observe the type-I error and the power of the MVD and MMD tests. Two cases are implemented: a uniform distribution Q1Q_{1} and an exponential distribution Q2Q_{2}, with P=N(0,1)P=N(0,1), all of which have means and variances 0 and 1, respectively. The critical values are determined based on WnW^{\prime}_{n} in Section 4.1.2 from a normal distribution. The type-I error of (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} can be obtained by counting the number of times (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} exceeds the critical value in 1,000 iterations under the null hypothesis. Next, the estimated power of (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} is similarly obtained by counting the number of times (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} exceeds the critical value under each alternative distribution in 1,000 iterations. We execute the above for (n,m)=(200,200)(n,m)=(200,200) and (500,500)(500,500) and d=5,10d=5,10, and 2020. It is known that the selection of the value of σ\sigma involved in the Gaussian kernel affects the performance. We utilize σ\sigma depending on dimension dd. The significance level is α=0.05\alpha=0.05. The critical values are determined on the basis of WnW^{\prime}_{n} in Section 4.1.2 from a normal distribution. The type-I error and estimated power can be obtained by counting how many times (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} exceeds the critical values in 1,000 iterations under P=QP=Q and PQP\neq Q. With n1=n/2n_{1}=n/2 and (k,)=(n/8,n/8)(k,\ell)=(n/8,n/8), τMVD\tau_{\text{MVD}} for MVD is τMVD=0.30928\tau_{\text{MVD}}=0.30928 and τMMD\tau_{\text{MMD}} for MMD is τMMD=0.11643\tau_{\text{MMD}}=0.11643 by “slope of the line” in Tables 1 and 2. The following can be seen from Table 3:

  • Table 3 shows that the probabilities of type-I error at d=5d=5 and 10 are near the significance level of α=0.05\alpha=0.05 for both the MVD and MMD.

  • The probability of type-I error at d=20d=20 exceeds the significance level of α=0.05\alpha=0.05 for the MVD, but decreases as (n,m)(n,m) increases.

  • It can be seen that the critical value by WnW^{\prime}_{n} of the MVD tends to be estimated as less than that point of the null distribution.

  • In hypothesis P=Q1P=Q_{1}, it can be seen that the MVD has a higher power than the MMD.

  • It can be seen that the MVD and MMD have higher powers for hypothesis P=Q2P=Q_{2} than hypothesis P=Q1P=Q_{1} and it is difficult to distinguish between the normal distribution and the uniform distribution by the MVD and MMD for a Gaussian kernel.

  • Note that the critical value changes depending on the distribution of the null hypothesis.

Table 3: Type-I error and power of the test by (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} for each sample size and each parameter σ\sigma.
σ\sigma dd (n,m)(n,m) MVD MMD
Type-I error Q1Q_{1} Q2Q_{2} Type-I error Q1Q_{1} Q2Q_{2}
d3/4d^{-3/4} 5 (200,200) 0.060 0.797 1 0.047 0.401 1
d3/4d^{-3/4} 5 (500,500) 0.072 1 1 0.063 0.877 1
d7/8d^{-7/8} 5 (200,200) 0.056 0.728 1 0.052 0.305 1
d7/8d^{-7/8} 5 (500,500) 0.067 0.999 1 0.053 0.735 1
d3/4d^{-3/4} 10 (200,200) 0.073 0.612 1 0.086 0.342 1
d3/4d^{-3/4} 10 (500,500) 0.047 0.991 1 0.040 0.630 1
d7/8d^{-7/8} 10 (200,200) 0.054 0.482 1 0.086 0.235 1
d7/8d^{-7/8} 10 (500,500) 0.034 0.955 1 0.044 0.363 1
d3/4d^{-3/4} 20 (200,200) 0.279 0.816 1 0.082 0.239 1
d3/4d^{-3/4} 20 (500,500) 0.099 0.948 1 0.068 0.477 1
d7/8d^{-7/8} 20 (200,200) 0.060 0.332 1 0.047 0.113 0.989
d7/8d^{-7/8} 20 (500,500) 0.034 0.728 1 0.069 0.240 1

5 Application to real datasets

The MVD test was applied to some real data sets. The significance level was α=0.05\alpha=0.05 and the critical value t0.05(H)t_{0.05}(H) was obtained through 10,000 iterations of 1/{ρ(1ρ)}=1n1λ^(n)Z21/\{\rho(1-\rho)\}\sum_{\ell=1}^{n-1}\widehat{\lambda}^{(n)}_{\ell}Z_{\ell}^{2} based on the eigenvalues of the matrix H/nH/n. We also calculated the critical value t0.05(Wn)t_{0.05}(W^{\prime}_{n}) of the approximate distribution WnW^{\prime}_{n} according to Algorithm in Section 4.1.2. The t0.05(K~X)t_{0.05}(\widetilde{K}_{X}) calculates the critical value for the MMD test from the distribution obtained based on Theorem 1 in [7] through 10,000 iterations.

5.1 USPS data

The USPS dataset consists of handwritten digits represented by a 16×1616\times 16 grayscale matrix (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#usps). The sizes of each sample are shown in Table 4.

Table 4: Sample sizes of the USPS data.
index 0 1 2 3 4 5 6 7 8 9
sample size 359 264 198 166 200 160 170 147 166 177

Each group was divided into two sets of sample size 70 and the MVD test was applied to each set. Table 5 shows the results of applying the MVD and MMD tests to this USPS dataset. Parameters σ=d3/4,d7/8\sigma=d^{-3/4},d^{-7/8}, n1=35n_{1}=35, kk and =18\ell=18 are adopted and τ=0.69348\tau=0.69348 for the MVD and τ=0.21990\tau=0.21990 for the MMD were utilized from the slope of the line in Tables 1 and 2. In each cell, the values of (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} and (n+m)Δ^n,m2(n+m)\widehat{\Delta}^{2}_{n,m} for each number are written, with the values of (n+m)Δ^n,m2(n+m)\widehat{\Delta}^{2}_{n,m} in parentheses.

From Table 5, there is a tendency that different groups will be rejected and that the same groups are not rejected by the MVD test. For P=P= USPS 2 and Q=Q= USPS 2, the value of (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} is 2.953, which is larger than t0.05(Wn)=2.946t_{0.05}(W^{\prime}_{n})=2.946 but smaller than t0.05(H)=3.416t_{0.05}(H)=3.416. On the other hand, for the MMD test, the value of (n+m)Δ^n,m2(n+m)\widehat{\Delta}^{2}_{n,m} is 5.014, which is larger than both t0.05(Wn)=4.488t_{0.05}(W^{\prime}_{n})=4.488 and t0.05(H)=4.698t_{0.05}(H)=4.698. By modifying the distribution, there is a tendency to reject the null hypothesis.

Table 5: Values of (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} and (n+m)Δ^n,m2(n+m)\widehat{\Delta}^{2}_{n,m} for σ=d3/4\sigma=d^{-3/4}, n1=35n_{1}=35, k,=18k,\ell=18, τMVD=0.69348\tau_{\text{MVD}}=0.69348, and τMMD=0.21990\tau_{\text{MMD}}=0.21990.
0 1 2 3 4 5 6 7 8 9
t0.05(H)t_{0.05}(H) 3.056 0.680 3.416 2.742 2.747 3.319 2.705 2.079 2.884 1.941
t0.05(Wn)t_{0.05}(W^{\prime}_{n}) 2.755 0.572 2.940 2.317 2.349 2.785 2.336 1.823 2.431 1.719
0 2.241 6.328 6.803 6.834 7.390 6.589 7.117 7.513 6.930 4.573
(2.740)
1 (124.6) 0.287 4.269 4.290 4.356 4.393 5.132 4.265 4.233 4.170
(0.585)
2 (34.38) (94.14) 2.953 4.730 5.253 5.053 5.880 5.264 4.732 5.165
(5.014)
3 (42.61) (105.0) (26.78) 2.383 5.248 4.239 6.022 5.242 4.575 5.022
(3.345)
4 (55.81) (93.27) (36.46) (45.34) 2.067 5.259 6.237 4.930 4.849 3.717
(2.745)
5 (30.65) (95.83) (24.64) (18.91) (35.32) 2.761 5.757 5.434 4.814 5.106
(3.822)
6 (39.15) (102.6) (30.11) (47.36) (48.80) (29.49) 2.261 6.527 5.946 6.344
(5.643)
7 (72.41) (111.3) (50.29) (52.91) (45.54) (51.29) (74.30) 1.560 5.142 4.062
(1.785)
8 (44.46) (86.90) (25.20) (28.77) (31.13) (25.01) (40.92) (51.27) 2.055 4.352
(2.666)
9 (71.81) (95.38) (51.48) (49.58) (25.87) (46.76) (70.81) (31.19) (33.73) 1.677
(2.336)
t0.05(K~X)t_{0.05}(\widetilde{K}_{X}) (4.983) (2.191) (4.698) (4.352) (4.343) (4.691) (4.550) (3.917) (4.424) (3.767)
t0.05(Wn)t_{0.05}(W^{\prime}_{n}) (5.228) (1.749) (4.502) (3.928) (3.961) (4.085) (4.510) (4.104) (4.245) (4.455)

5.2 MNIST data

The MNIST dataset consists of images of 28×28=78428\times 28=784 pixels in size (http://yann.lecun.com/exdb/mnist). The sizes of each sample are shown in Table 6.

Table 6: Sample sizes of the MNIST data.
index 0 1 2 3 4 5 6 7 8 9
sample size 5,923 6,742 5,958 6,131 5,842 5,421 5,918 6,265 5,851 5,949

The MNIST data are divided into two sets of sample size 2,000 and the MVD and MMD tests are applied. Table 7 shows the results of applying the MVD and MMD tests to the MNIST data. The approximate distribution WnW^{\prime}_{n} is calculated with n1=1,000n_{1}=1,000, k,=500k,\ell=500, τMVD=0.69348\tau_{\text{MVD}}=0.69348, and τMMD=0.21990\tau_{\text{MMD}}=0.21990. As in Table 5, the values of (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} and (n+m)Δ^n,m2(n+m)\widehat{\Delta}^{2}_{n,m} are written in each cell, with the values of (n+m)Δ^n,m2(n+m)\widehat{\Delta}^{2}_{n,m} in parentheses. In Table 7, (n+m)T^n,m2(n+m)\widehat{T}_{n,m}^{2} tends to take a larger value than both t0.05(H)t_{0.05}(H) and t0.05(Wn)t_{0.05}(W^{\prime}_{n}). This result is the same for the MMD test. The MVD and MMD tests tend to reject the null hypothesis with the modifications in Section 4.1.2.

Table 7: Values of (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} and (n+m)Δ^n,m2(n+m)\widehat{\Delta}^{2}_{n,m} for σ=d2\sigma=d^{-2}, n1=1,000n_{1}=1,000, k,=500k,\ell=500, τMVD=0.69348\tau_{\text{MVD}}=0.69348, and τMMD=0.21990\tau_{\text{MMD}}=0.21990.
0 1 2 3 4 5 6 7 8 9
t0.05(H)t_{0.05}(H) 4.194 3.883 4.202 4.196 4.189 4.195 4.182 4.163 4.197 4.171
t0.05(Wn)t_{0.05}(W^{\prime}_{n}) 3.993 3.937 3.993 3.989 3.990 3.993 3.989 3.990 3.989 3.988
0 3.999 34.86 4.207 4.268 4.394 4.304 4.599 5.240 4.256 4.861
(4.092)
1 (217.4) 5.379 34.68 34.75 34.86 34.80 35.08 35.66 34.73 35.33
(15.42)
2 (10.77) (210.5) 4.001 4.118 4.245 4.156 4.451 5.088 4.106 4.711
(4.131)
3 (13.55) (213.4) (9,806) 4.007 4.306 4.211 4.512 5.146 4.164 4.769
(4.208)
4 (16.56) (216.1) (12.83) (15.59) 4.020 4.341 4.637 5.262 4.292 4.805
(4.482)
5 (13.35) (213.9) (10.10) (11.57) (15.12) 4.018 4.546 5.188 4.201 4.806
(4.357)
6 (19.39) (219.5) (16.06) (18.90) (21.28) (18.29) 4.031 5.485 4.499 5.105
(4.573)
7 (28.85) (225.2) (24.85) (27.49) (28.28) (27.56) (34.24) 4.067 5.138 5.625
(4.625)
8 (13.12) (211.3) (9.261) (11.37) (14.72) (11.51) (18.19) (26.97) 4.005 4.755
(4.190)
9 (24.80) (223.2) (21.11) (23.23) (19.05) (22.80) (29.84) (29.91) (22.07) 4.046
(4.686)
t0.05(K~X)t_{0.05}(\widetilde{K}_{X}) (4.210) (4.718) (4.208) (4.206) (4.210) (4.209) (4.218) (4.238) (4.207) (4.224)
t0.05(Wn)t_{0.05}(W^{\prime}_{n}) (4.058) (5.137) (4.024) (4.038) (4.077) (4.107) (4.092) (4.137) (4.039) (4.159)

5.3 Colon data

The Colon dataset contains gene expression data from DNA microarray experiments of colon tissue samples with dd = 2,000 and n=62n=62 (see [1] for details). Among the 62 samples, 40 are tumor tissues and 22 are normal tissues. Tables 8 and 9 show the results of the MVD and MMD tests for P=P= tumor and Q=Q= normal. The “tumor vs. normal” column shows the values of (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} and (n+m)Δ^n,m2(n+m)\widehat{\Delta}^{2}_{n,m} for P=P= tumor and Q=Q= normal. The “normal” and “tumor” columns show, t0.05(W)t_{0.05}(W^{\prime}) and t0.05(H)t_{0.05}(H) calculated from respectively the normal tissues and tumor tissues datasets.

For the MVD, (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} does not exceed t0.05(H)t_{0.05}(H), but (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} exceeds t0.05(Wn)t_{0.05}(W^{\prime}_{n}) by modifying the approximate distribution. By contrast, for the MMD, (n+m)Δ^n,m2(n+m)\widehat{\Delta}^{2}_{n,m} exceeds both t0.05(H)t_{0.05}(H) and t0.05(Wn)t_{0.05}(W_{n}^{\prime}) without modifying the approximate distribution.

Table 8: Values of (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} and critical values for normal (n1=11,k,=6n_{1}=11,~{}k,\ell=6) and tumor (n1=20,k,=10n_{1}=20,~{}k,\ell=10), with τMVD=0.69348\tau_{\text{MVD}}=0.69348.
normal tumor
σ\sigma tumor vs. normal t0.05(Wn)t_{0.05}(W^{\prime}_{n}) t0.05(H)t_{0.05}(H) t0.05(Wn)t_{0.05}(W^{\prime}_{n}) t0.05(H)t_{0.05}(H)
d3/4d^{-3/4} 3.867 3.536 5.280 3.728 5.050
d7/8d^{-7/8} 2.291 2.097 2.907 2.258 2.846
d1d^{-1} 0.684 0.660 0.879 0.757 0.906
Table 9: Values of (n+m)Δ^n,m2(n+m)\widehat{\Delta}^{2}_{n,m} and critical values for normal (n1=11,k,=6n_{1}=11,~{}k,\ell=6) and tumor (n1=20,k,=10n_{1}=20,~{}k,\ell=10), with τMMD=0.21990\tau_{\text{MMD}}=0.21990.
normal tumor
σ\sigma tumor vs. normal t0.05(Wn)t_{0.05}(W^{\prime}_{n}) t0.05(H)t_{0.05}(H) t0.05(Wn)t_{0.05}(W^{\prime}_{n}) t0.05(H)t_{0.05}(H)
d3/4d^{-3/4} 6.695 4.456 6.201 4.618 5.713
d7/8d^{-7/8} 8.787 3.974 4.827 3.945 4.491
d1d^{-1} 6.282 2.439 2.754 2.412 2.634

Next, tumor (sample size=40)(\text{sample size}=40) was divided into P=P= tumor 1 (n=20)(n=20) and Q=Q= tumor 2 (m=20)(m=20), and two-sample tests by the MVD and MMD were applied. The results are shown in Tables 10 and 11, with the values for (n+m)T^n,m2(n+m)\widehat{T}_{n,m}^{2} and (n+m)Δ^n,m2(n+m)\widehat{\Delta}_{n,m}^{2} in the column “tumor 1 vs. tumor 2”. In Table 10, when σ=d3/4\sigma=d^{-3/4}, (n+m)T^n,m2(n+m)\widehat{T}_{n,m}^{2} exceeds t0.05(Wn)t_{0.05}(W^{\prime}_{n}), but in other cases (n+m)T^n,m2(n+m)\widehat{T}_{n,m}^{2} does not exceed t0.05(Wn)t_{0.05}(W^{\prime}_{n}) and (n+m)T^n,m2(n+m)\widehat{T}_{n,m}^{2} does not exceed t0.05(H)t_{0.05}(H) and t0.05(Wn)t_{0.05}(W_{n}^{\prime}). Table 11 shows that, for all σ\sigma, (n+m)Δ^n,m2(n+m)\widehat{\Delta}^{2}_{n,m} exceeds t0.05(Wn)t_{0.05}(W^{\prime}_{n}) for the MMD test, but (n+m)Δ^n,m2(n+m)\widehat{\Delta}^{2}_{n,m} does not exceed t0.05(H)t_{0.05}(H).

Table 10: Values of (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} and critical values for tumor 1 (n1=10,k,=5n_{1}=10,~{}k,\ell=5) and tumor 2 (n1=10,k,=5n_{1}=10,~{}k,\ell=5), with τ=0.69348\tau=0.69348.
tumor 1 tumor 2
σ\sigma tumor 1 vs. tumor 2 t0.05(Wn)t_{0.05}(W^{\prime}_{n}) t0.05(H)t_{0.05}(H) t0.05(Wn)t_{0.05}(W^{\prime}_{n}) t0.05(H)t_{0.05}(H)
d3/4d^{-3/4} 3.379 3.180 4.596 3.245 4.942
d7/8d^{-7/8} 1.858 1.915 2.502 2.085 2.875
d1d^{-1} 0.558 0.629 0.800 0.727 0.921
Table 11: Values of (n+m)T^n,m2(n+m)\widehat{T}^{2}_{n,m} and critical values for tumor 1 (n1=10,k,=5n_{1}=10,~{}k,\ell=5) and tumor 2 (n1=10,k,=5n_{1}=10,~{}k,\ell=5), with τ=0.21990\tau=0.21990.
tumor 1 tumor 2
σ\sigma tumor 1 vs. tumor 2 t0.05(Wn)t_{0.05}(W^{\prime}_{n}) t0.05(H)t_{0.05}(H) t0.05(Wn)t_{0.05}(W^{\prime}_{n}) t0.05(H)t_{0.05}(H)
d3/4d^{-3/4} 4.627 4.064 5.621 3.961 5.782
d7/8d^{-7/8} 4.123 3.305 4.206 3.426 4.551
d1d^{-1} 2.453 1.942 2.377 2.102 2.656

6 Conclusion

We defined a Maximum Variance Discrepancy (MVD) with a similar idea to the Maximum Mean Discrepnacy (MMD) in Section 2. We derived the asymptotic null distribution for the MVD test in Section 3.1. This was the infinite sum of the weighted chi-square distributions. In Section 3.2, we derived an asymptotic nonnull distribution for the MVD test, which was a normal distribution. The asymptotic normality of the test under the alternative hypothesis showed that the two-sample test by the MVD has consistency. Furthermore, we developed an asymptotic distribution for the test under a sequence of local alternatives in Section 3.3. This was the infinite sum of weighted noncentral chi-squared distributions. We constructed an estimator of asymptotic null distributed weights based on the Gram matrix in Section 3.4. The approximate distribution of the null distribution by these estimated weights does not work well, so we modified it in Section 4.1. In the simulation of the power reported, we found that the power of the two-sample test by the MVD was larger than that of the MMD. We confirmed in Section 5 that the two-sample test by the MVD works for real data-sets.

7 Proofs

Lemma 1.

Suppose that Condition is satisfied. Then, as n,mn,m\to\infty,

n+m(Σk(P^)Σk(Q^)H(k)22Σ~k(P)Σ~k(Q)H(k)22)𝑃0,\sqrt{n+m}\left(\left\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q})\right\|_{H(k)^{\otimes 2}}^{2}-\left\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)\right\|_{H(k)^{\otimes 2}}^{2}\right)\xrightarrow{P}0,

where

Σ~k(P)=1ni=1n(k(,Xi)μk(P))2andΣ~k(Q)=1mj=1m(k(,Yj)μk(Q))2.\widetilde{\Sigma}_{k}(P)=\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}~{}~{}\text{and}~{}~{}\widetilde{\Sigma}_{k}(Q)=\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(Q))^{\otimes 2}.
Lemma 2.

Let Y1,,Ymi.i.d.QnmY_{1},\dots,Y_{m}\overset{i.i.d.}{\sim}Q_{nm}. Suppose that Condition is satisfied. Then, as n,mn,m\to\infty, following evaluates are obtained

  • (i)
    μk(Q^nm)μk(Qnm)H(k)=Op(1n+m),\left\|\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(Q_{nm})\right\|_{H(k)}=O_{p}\left(\frac{1}{\sqrt{n+m}}\right),
  • (ii)
    Σk(Q^nm)Σk(Qnm)H(k)2=Op(1n+m),\left\|\Sigma_{k}(\widehat{Q}_{nm})-\Sigma_{k}(Q_{nm})\right\|_{H(k)^{\otimes 2}}=O_{p}\left(\frac{1}{\sqrt{n+m}}\right),
  • (iii)
    Σ~k(Qnm)Σk(Qnm)H(k)2=Op(1n+m).\left\|\widetilde{\Sigma}_{k}(Q_{nm})-\Sigma_{k}(Q_{nm})\right\|_{H(k)^{\otimes 2}}=O_{p}\left(\frac{1}{\sqrt{n+m}}\right).
Lemma 3.

Let X1,,Xni.i.d.PX_{1},\dots,X_{n}\overset{i.i.d.}{\sim}P and Y1,,Ymi.i.d.QnmY_{1},\dots,Y_{m}\overset{i.i.d.}{\sim}Q_{nm}. Suppose that Condition is satisfied. Then, as n,mn,m\to\infty,

(n+m)(Σk(P^)Σk(Q^nm)H(k)22Σ~k(P)Σ~k(Qnm)H(k)22)𝑃0.(n+m)\left(\left\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q}_{nm})\right\|^{2}_{H(k)^{\otimes 2}}-\left\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q_{nm})\right\|^{2}_{H(k)^{\otimes 2}}\right)\xrightarrow{P}0.

7.1 Proof of Proposition 1

The kernel mean embedding μk(N(m¯,Σ))\mu_{k}(N(\underline{m},\Sigma)) with the Gaussian kernel in (4) is obtained

μk(N(m¯,Σ))=|Id+2σΣ|1/2exp(σ(m¯)T(Id+2σΣ)1(m¯))\mu_{k}(N(\underline{m},\Sigma))=|I_{d}+2\sigma\Sigma|^{-1/2}\exp\left(-\sigma(\cdot-\underline{m})^{T}(I_{d}+2\sigma\Sigma)^{-1}(\cdot-\underline{m})\right) (12)

and the norm of that is derived

μk(N(m¯,Σ))H(k)2=|Id+4σΣ|1/2\left\|\mu_{k}(N(\underline{m},\Sigma))\right\|^{2}_{H(k)}=|I_{d}+4\sigma\Sigma|^{-1/2} (13)

by Proposition 4.2 in [9]. We use the property of Gaussian density

ϕΣ1(x¯m¯1)ϕΣ2(x¯m¯2)=ϕΣ1+Σ2(m¯1m¯2)ϕ(Σ11+Σ21)1(x¯m¯),\phi_{\Sigma_{1}}(\underline{x}-\underline{m}_{1})\phi_{\Sigma_{2}}(\underline{x}-\underline{m}_{2})=\phi_{\Sigma_{1}+\Sigma_{2}}(\underline{m}_{1}-\underline{m}_{2})\phi_{(\Sigma_{1}^{-1}+\Sigma_{2}^{-1})^{-1}}(\underline{x}-\underline{m}^{*}), (14)

where

m¯=(Σ11+Σ21)1(Σ11m¯2+Σ21m¯1)\underline{m}^{*}=(\Sigma^{-1}_{1}+\Sigma^{-1}_{2})^{-1}(\Sigma^{-1}_{1}\underline{m}_{2}+\Sigma^{-1}_{2}\underline{m}_{1})

and ϕΣ(m¯)\phi_{\Sigma}(\cdot-\underline{m}) designates the density of N(m¯,Σ)N(\underline{m},\Sigma), see e.g. Appendix C in [13]. The property of Gaussian density (14) is used repeatedly to calculate 𝔼X¯N(μ¯,Σ)X¯N(m¯0,Σ0)[k(X¯,X¯)]\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{\mu},\Sigma)\\ \underline{X}^{\prime}\sim N(\underline{m}_{0},\Sigma_{0})\end{subarray}}[k(\underline{X},\underline{X}^{\prime})], and we get

𝔼X¯N(μ¯,Σ)X¯N(m¯0,Σ0)[k(X¯,X¯)]\displaystyle\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{\mu},\Sigma)\\ \underline{X}^{\prime}\sim N(\underline{m}_{0},\Sigma_{0})\end{subarray}}[k(\underline{X},\underline{X}^{\prime})]
=ddexp(σx¯y¯𝕕2)𝑑N(μ¯,Σ)(x¯)𝑑N(m¯0,Σ0)(y¯)\displaystyle=\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}\exp(-\sigma\left\|\underline{x}-\underline{y}\right\|^{2}_{\mathbb{R^{d}}})dN(\underline{\mu},\Sigma)(\underline{x})dN(\underline{m}_{0},\Sigma_{0})(\underline{y})
=(πσ)d/2ddϕ12σId(x¯y¯)𝑑N(μ¯,Σ)(x¯)𝑑N(m¯0,Σ0)(y¯)\displaystyle=\left(\frac{\pi}{\sigma}\right)^{d/2}\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}\phi_{\frac{1}{2\sigma}I_{d}}(\underline{x}-\underline{y})dN(\underline{\mu},\Sigma)(\underline{x})dN(\underline{m}_{0},\Sigma_{0})(\underline{y})
=(πσ)d/2ddϕ12σId(x¯y¯)ϕΣ(x¯μ¯)𝑑x¯𝑑N(m¯0,Σ0)(y¯)\displaystyle=\left(\frac{\pi}{\sigma}\right)^{d/2}\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}\phi_{\frac{1}{2\sigma}I_{d}}(\underline{x}-\underline{y})\phi_{\Sigma}(\underline{x}-\underline{\mu})d\underline{x}dN(\underline{m}_{0},\Sigma_{0})(\underline{y})
=(πσ)d/2ddϕ12σId+Σ(y¯μ¯)ϕ(2σId+Σ1)1(x¯m¯1)𝑑x¯𝑑N(m¯0,Σ0)(y¯)\displaystyle=\left(\frac{\pi}{\sigma}\right)^{d/2}\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}\phi_{\frac{1}{2\sigma}I_{d}+\Sigma}(\underline{y}-\underline{\mu})\phi_{(2\sigma I_{d}+\Sigma^{-1})^{-1}}(\underline{x}-\underline{m}_{1}^{*})d\underline{x}dN(\underline{m}_{0},\Sigma_{0})(\underline{y})
=(πσ)d/2dϕ12σId+Σ(y¯μ¯)𝑑N(m¯0,Σ0)(y¯)dϕ(2σId+Σ1)1(x¯m¯1)𝑑x¯\displaystyle=\left(\frac{\pi}{\sigma}\right)^{d/2}\int_{\mathbb{R}^{d}}\phi_{\frac{1}{2\sigma}I_{d}+\Sigma}(\underline{y}-\underline{\mu})dN(\underline{m}_{0},\Sigma_{0})(\underline{y})\int_{\mathbb{R}^{d}}\phi_{(2\sigma I_{d}+\Sigma^{-1})^{-1}}(\underline{x}-\underline{m}_{1}^{*})d\underline{x}
=(πσ)d/2dϕ12σId+Σ(y¯μ¯)ϕΣ0(y¯m¯0)𝑑y¯\displaystyle=\left(\frac{\pi}{\sigma}\right)^{d/2}\int_{\mathbb{R}^{d}}\phi_{\frac{1}{2\sigma}I_{d}+\Sigma}(\underline{y}-\underline{\mu})\phi_{\Sigma_{0}}(\underline{y}-\underline{m}_{0})d\underline{y}
=(πσ)d/2dϕ12σId+Σ+Σ0(μ¯m¯0)ϕ((12σId+Σ)1+Σ01)1(y¯m¯2)𝑑y¯\displaystyle=\left(\frac{\pi}{\sigma}\right)^{d/2}\int_{\mathbb{R}^{d}}\phi_{\frac{1}{2\sigma}I_{d}+\Sigma+\Sigma_{0}}(\underline{\mu}-\underline{m}_{0})\phi_{((\frac{1}{2\sigma}I_{d}+\Sigma)^{-1}+\Sigma_{0}^{-1})^{-1}}(\underline{y}-\underline{m}_{2}^{*})d\underline{y}
=(πσ)d/2ϕ12σId+Σ+Σ0(μ¯m¯0)dϕ((12σId+Σ)1+Σ01)1(y¯m¯2)𝑑y¯\displaystyle=\left(\frac{\pi}{\sigma}\right)^{d/2}\phi_{\frac{1}{2\sigma}I_{d}+\Sigma+\Sigma_{0}}(\underline{\mu}-\underline{m}_{0})\int_{\mathbb{R}^{d}}\phi_{((\frac{1}{2\sigma}I_{d}+\Sigma)^{-1}+\Sigma_{0}^{-1})^{-1}}(\underline{y}-\underline{m}_{2}^{*})d\underline{y}
=(πσ)d/2ϕ12σId+Σ+Σ0(μ¯m¯0),\displaystyle=\left(\frac{\pi}{\sigma}\right)^{d/2}\phi_{\frac{1}{2\sigma}I_{d}+\Sigma+\Sigma_{0}}(\underline{\mu}-\underline{m}_{0}), (15)

where

m1=(2σId+Σ1)1(2σy¯+Σ1μ¯),\displaystyle m_{1}^{*}=(2\sigma I_{d}+\Sigma^{-1})^{-1}(2\sigma\underline{y}+\Sigma^{-1}\underline{\mu}),
m2={(12σId+Σ)1+Σ01}1{(12σId+Σ)1μ¯+Σ01m¯0}.\displaystyle m_{2}^{*}=\left\{\left(\frac{1}{2\sigma}I_{d}+\Sigma\right)^{-1}+\Sigma_{0}^{-1}\right\}^{-1}\left\{\left(\frac{1}{2\sigma}I_{d}+\Sigma\right)^{-1}\underline{\mu}+\Sigma_{0}^{-1}\underline{m}_{0}\right\}.

Using this results (13) and (15), MMD(N(0¯,Id),N(m¯,Σ))2\text{MMD}(N(\underline{0},I_{d}),N(\underline{m},\Sigma))^{2} is obtained as

MMD(N(0¯,Id),N(m¯,Σ))2\displaystyle\text{MMD}(N(\underline{0},I_{d}),N(\underline{m},\Sigma))^{2}
=μk(N(0¯,Id))μk(N(m¯,Σ))H(k)2\displaystyle=\left\|\mu_{k}(N(\underline{0},I_{d}))-\mu_{k}(N(\underline{m},\Sigma))\right\|^{2}_{H(k)}
=μk(N(0¯,Id))H(k)2+μk(N(m¯,Σ))H(k)22μk(N(0¯,Id)),μk(N(m¯,Σ))H(k)\displaystyle=\left\|\mu_{k}(N(\underline{0},I_{d}))\right\|^{2}_{H(k)}+\left\|\mu_{k}(N(\underline{m},\Sigma))\right\|^{2}_{H(k)}-2\left<\mu_{k}(N(\underline{0},I_{d})),\mu_{k}(N(\underline{m},\Sigma))\right>_{H(k)}
=μk(N(0¯,Id))H(k)2+μk(N(m¯,Σ))H(k)22𝔼X¯N(0¯,Id)X¯N(m¯,Σ)[k(X¯,X¯)]\displaystyle=\left\|\mu_{k}(N(\underline{0},I_{d}))\right\|^{2}_{H(k)}+\left\|\mu_{k}(N(\underline{m},\Sigma))\right\|^{2}_{H(k)}-2\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{0},I_{d})\\ \underline{X}^{\prime}\sim N(\underline{m},\Sigma)\end{subarray}}[k(\underline{X},\underline{X}^{\prime})]
=|Id+4σId|1/2+|Id+4σΣ|1/22(πσ)d/2ϕ12σId+Id+Σ(m¯)\displaystyle=|I_{d}+4\sigma I_{d}|^{-1/2}+|I_{d}+4\sigma\Sigma|^{-1/2}-2\left(\frac{\pi}{\sigma}\right)^{d/2}\phi_{\frac{1}{2\sigma}I_{d}+I_{d}+\Sigma}(\underline{m})
=(1+4σ)d/2+|Id+4σΣ|1/2\displaystyle=(1+4\sigma)^{-d/2}+|I_{d}+4\sigma\Sigma|^{-1/2}
2(πσ)d/2|πσ((1+2σ)Id+2σΣ)|1/2exp(σm¯T((1+2σ)Id+2σΣ)1m¯)\displaystyle~{}~{}~{}~{}~{}-2\left(\frac{\pi}{\sigma}\right)^{d/2}\left|\frac{\pi}{\sigma}\left((1+2\sigma)I_{d}+2\sigma\Sigma\right)\right|^{-1/2}\exp\left(-\sigma\underline{m}^{T}\left((1+2\sigma)I_{d}+2\sigma\Sigma\right)^{-1}\underline{m}\right)
=|Id+4σId|1/2+|Id+4σΣ|1/22(πσ)d/2ϕ12σId+Id+Σ(m¯)\displaystyle=|I_{d}+4\sigma I_{d}|^{-1/2}+|I_{d}+4\sigma\Sigma|^{-1/2}-2\left(\frac{\pi}{\sigma}\right)^{d/2}\phi_{\frac{1}{2\sigma}I_{d}+I_{d}+\Sigma}(\underline{m})
=(1+4σ)d/2+|Id+4σΣ|1/2\displaystyle=(1+4\sigma)^{-d/2}+|I_{d}+4\sigma\Sigma|^{-1/2}
2|(1+2σ)Id+2σΣ|1/2exp(σm¯T((1+2σ)Id+2σΣ)1m¯).\displaystyle~{}~{}~{}~{}~{}-2\left|(1+2\sigma)I_{d}+2\sigma\Sigma\right|^{-1/2}\exp\left(-\sigma\underline{m}^{T}\left((1+2\sigma)I_{d}+2\sigma\Sigma\right)^{-1}\underline{m}\right).

7.2 Proof of Proposition 2

In the following proof, the MVD(N(0¯,Id),N(m¯,Σ))2\text{MVD}(N(\underline{0},I_{d}),N(\underline{m},\Sigma))^{2} when using the Gaussian kernel in (4) is calculated by repeatedly using (14). From the expansion of the norm

MVD(N(0¯,Id),N(m¯,Σ))2\displaystyle\text{MVD}(N(\underline{0},I_{d}),N(\underline{m},\Sigma))^{2}
=Σk(N(0¯,Id))Σk(N(m¯,Σ))H(k)22\displaystyle=\left\|\Sigma_{k}(N(\underline{0},I_{d}))-\Sigma_{k}(N(\underline{m},\Sigma))\right\|^{2}_{H(k)^{\otimes 2}}
=Σk(N(0¯,Id))H(k)22+Σk(N(m¯,Σ))H(k)222Σk(N(0¯,Id)),Σk(N(m¯,Σ))H(k)2,\displaystyle=\left\|\Sigma_{k}(N(\underline{0},I_{d}))\right\|^{2}_{H(k)^{\otimes 2}}+\left\|\Sigma_{k}(N(\underline{m},\Sigma))\right\|^{2}_{H(k)^{\otimes 2}}-2\left<\Sigma_{k}(N(\underline{0},I_{d})),\Sigma_{k}(N(\underline{m},\Sigma))\right>_{H(k)^{\otimes 2}}, (16)

it is sufficient for us to calculate Σk(N(μ¯,Σ)),Σk(N(m¯0,Σ0))H(k)2\left<\Sigma_{k}(N(\underline{\mu},\Sigma)),\Sigma_{k}(N(\underline{m}_{0},\Sigma_{0}))\right>_{H(k)^{\otimes 2}}. The definition of Σk(P)\Sigma_{k}(P) and the tensor product h2h^{\otimes 2} lead to

Σk(N(μ¯,Σ)),Σk(N(m¯0,Σ0))H(k)2\displaystyle\left<\Sigma_{k}(N(\underline{\mu},\Sigma)),\Sigma_{k}(N(\underline{m}_{0},\Sigma_{0}))\right>_{H(k)^{\otimes 2}}
=𝔼X¯N(μ¯,Σ)[(k(,X¯)𝔼X¯N(μ¯,Σ)[k(,X¯)])2],\displaystyle=\left<\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}\left[\left(k(\cdot,\underline{X})-\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}[k(\cdot,\underline{X})]\right)^{\otimes 2}\right],\right.
𝔼Y¯N(m¯0,Σ0)[(k(,Y¯)𝔼Y¯N(m¯0,Σ0)[k(,Y¯)])2]H(k)2\displaystyle~{}~{}~{}~{}~{}\left.\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}\left[\left(k(\cdot,\underline{Y})-\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}[k(\cdot,\underline{Y})]\right)^{\otimes 2}\right]\right>_{H(k)^{\otimes 2}}
=𝔼X¯N(μ¯,Σ)Y¯N(m¯0,Σ0)[(k(,X¯)𝔼X¯N(μ¯,Σ)[k(,X¯)])2,(k(,Y¯)𝔼Y¯N(m¯0,Σ0)[k(,Y¯)])2H(k)2]\displaystyle=\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{\mu},\Sigma)\\ \underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})\end{subarray}}\left[\left<\left(k(\cdot,\underline{X})-\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}[k(\cdot,\underline{X})]\right)^{\otimes 2},\left(k(\cdot,\underline{Y})-\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}[k(\cdot,\underline{Y})]\right)^{\otimes 2}\right>_{H(k)^{\otimes 2}}\right]
=𝔼X¯N(μ¯,Σ)Y¯N(m¯0,Σ0)[k(,X¯)2{𝔼X¯N(μ¯,Σ)[k(,X¯)]}2,k(,Y¯)2{𝔼Y¯N(m¯0,Σ0)[k(,Y¯)]}2H(k)2]\displaystyle=\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{\mu},\Sigma)\\ \underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})\end{subarray}}\left[\left<k(\cdot,\underline{X})^{\otimes 2}-\left\{\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}[k(\cdot,\underline{X})]\right\}^{\otimes 2},k(\cdot,\underline{Y})^{\otimes 2}-\left\{\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}[k(\cdot,\underline{Y})]\right\}^{\otimes 2}\right>_{H(k)^{\otimes 2}}\right]
=𝔼X¯N(μ¯,Σ)Y¯N(m¯0,Σ0)[k(,X¯)2,k(,Y¯)2H(k)2k(,X¯)2,{𝔼Y¯N(m¯0,Σ0)[k(,Y¯)]}2H(k)2\displaystyle=\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{\mu},\Sigma)\\ \underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})\end{subarray}}\left[\left<k(\cdot,\underline{X})^{\otimes 2},k(\cdot,\underline{Y})^{\otimes 2}\right>_{H(k)^{\otimes 2}}-\left<k(\cdot,\underline{X})^{\otimes 2},\left\{\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}[k(\cdot,\underline{Y})]\right\}^{\otimes 2}\right>_{H(k)^{\otimes 2}}\right.
{𝔼X¯N(μ¯,Σ)[k(,X¯)]}2,k(,Y¯)2H(k)2\displaystyle~{}~{}~{}~{}~{}-\left<\left\{\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}[k(\cdot,\underline{X})]\right\}^{\otimes 2},k(\cdot,\underline{Y})^{\otimes 2}\right>_{H(k)^{\otimes 2}}
+{𝔼X¯N(μ¯,Σ)[k(,X¯)]}2,{𝔼Y¯N(m¯0,Σ0)[k(,Y¯)]}2H(k)2]\displaystyle~{}~{}~{}~{}~{}\left.+\left<\left\{\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}[k(\cdot,\underline{X})]\right\}^{\otimes 2},\left\{\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}[k(\cdot,\underline{Y})]\right\}^{\otimes 2}\right>_{H(k)^{\otimes 2}}\right]
=𝔼X¯N(μ¯,Σ)Y¯N(m¯0,Σ0)[k(,X¯),k(,Y¯)H(k)2k(,X¯),𝔼Y¯N(m¯0,Σ0)[k(,Y¯)]H(k)2\displaystyle=\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{\mu},\Sigma)\\ \underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})\end{subarray}}\left[\left<k(\cdot,\underline{X}),k(\cdot,\underline{Y})\right>_{H(k)}^{2}-\left<k(\cdot,\underline{X}),\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}[k(\cdot,\underline{Y})]\right>_{H(k)}^{2}\right.
𝔼X¯N(μ¯,Σ)[k(,X¯)],k(,Y¯)H(k)2+𝔼X¯N(μ¯,Σ)[k(,X¯)],𝔼Y¯N(m¯0,Σ0)[k(,Y¯)]H(k)2]\displaystyle~{}~{}~{}~{}~{}\left.-\left<\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}[k(\cdot,\underline{X})],k(\cdot,\underline{Y})\right>_{H(k)}^{2}+\left<\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}[k(\cdot,\underline{X})],\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}[k(\cdot,\underline{Y})]\right>_{H(k)}^{2}\right]
=𝔼X¯N(μ¯,Σ)Y¯N(m¯0,Σ0)[k(X¯,Y¯)2{𝔼Y¯N(m¯0,Σ0)[k(X¯,Y¯)]}2{𝔼X¯N(μ¯,Σ)[k(X¯,Y¯)]}2\displaystyle=\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{\mu},\Sigma)\\ \underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})\end{subarray}}\left[k(\underline{X},\underline{Y})^{2}-\left\{\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}[k(\underline{X},\underline{Y})]\right\}^{2}-\left\{\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}[k(\underline{X},\underline{Y})]\right\}^{2}\right.
+{𝔼X¯N(μ¯,Σ)Y¯N(m¯0,Σ0)[k(X¯,Y¯)]}2]\displaystyle~{}~{}~{}~{}~{}\left.+\left\{\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{\mu},\Sigma)\\ \underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})\end{subarray}}[k(\underline{X},\underline{Y})]\right\}^{2}\right]
=𝔼X¯N(μ¯,Σ)Y¯N(m¯0,Σ0)[k(X¯,Y¯)2]𝔼X¯N(μ¯,Σ)[{𝔼Y¯N(m¯0,Σ0)[k(X¯,Y¯)]}2]\displaystyle=\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{\mu},\Sigma)\\ \underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})\end{subarray}}[k(\underline{X},\underline{Y})^{2}]-\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}\left[\left\{\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}[k(\underline{X},\underline{Y})]\right\}^{2}\right]
𝔼Y¯N(m¯0,Σ0)[{𝔼X¯N(μ¯,Σ)[k(X¯,Y¯)]}2]+{𝔼X¯N(μ¯,Σ)Y¯N(m¯0,Σ0)[k(X¯,Y¯)]}2\displaystyle~{}~{}~{}~{}~{}-\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}\left[\left\{\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}[k(\underline{X},\underline{Y})]\right\}^{2}\right]+\left\{\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{\mu},\Sigma)\\ \underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})\end{subarray}}[k(\underline{X},\underline{Y})]\right\}^{2}
=:I1I2I3+I4.\displaystyle=:I_{1}-I_{2}-I_{3}+I_{4}. (17)

We calculate each of these terms. The first term I1I_{1} is derived as

I1\displaystyle I_{1} =𝔼X¯N(μ¯,Σ)Y¯N(m¯0,Σ0)[k(X¯,Y¯)2]\displaystyle=\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{\mu},\Sigma)\\ \underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})\end{subarray}}[k(\underline{X},\underline{Y})^{2}]
=ddexp(2σx¯y¯d2)𝑑N(m¯0,Σ0)(x¯)𝑑N(μ¯,Σ)(y¯)\displaystyle=\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}\exp\left(-2\sigma\left\|\underline{x}-\underline{y}\right\|^{2}_{\mathbb{R}^{d}}\right)dN(\underline{m}_{0},\Sigma_{0})(\underline{x})dN(\underline{\mu},\Sigma)(\underline{y})
=(π2σ)d/2ddϕ14σId(x¯y¯)𝑑N(m¯0,Σ0)(x¯)𝑑N(μ¯,Σ)(y¯)\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}\phi_{\frac{1}{4\sigma}I_{d}}(\underline{x}-\underline{y})dN(\underline{m}_{0},\Sigma_{0})(\underline{x})dN(\underline{\mu},\Sigma)(\underline{y})
=(π2σ)d/2ddϕ14σId(x¯y¯)ϕΣ0(x¯m¯0)𝑑x¯𝑑N(μ¯,Σ)(y¯)\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}\phi_{\frac{1}{4\sigma}I_{d}}(\underline{x}-\underline{y})\phi_{\Sigma_{0}}(\underline{x}-\underline{m}_{0})d\underline{x}dN(\underline{\mu},\Sigma)(\underline{y})
=(π2σ)d/2dϕ14σId+Σ0(y¯m¯0)𝑑N(μ¯,Σ)(y¯)\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}\int_{\mathbb{R}^{d}}\phi_{\frac{1}{4\sigma}I_{d}+\Sigma_{0}}(\underline{y}-\underline{m}_{0})dN(\underline{\mu},\Sigma)(\underline{y})
=(π2σ)d/2ϕ14σId(x¯y¯)ϕΣ0(y¯m¯0)𝑑y¯\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}\int_{\mathcal{\mathbb{R}}}\phi_{\frac{1}{4\sigma}I_{d}}(\underline{x}-\underline{y})\phi_{\Sigma_{0}}(\underline{y}-\underline{m}_{0})d\underline{y}
=(π2σ)d/2ϕ14σId+Σ0(x¯m¯0).\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}\phi_{\frac{1}{4\sigma}I_{d}+\Sigma_{0}}(\underline{x}-\underline{m}_{0}). (18)

by repeatedly using (14). By using the expression of kernel mean embedding in (12) and the property of (14), we obtain the second term

I2\displaystyle I_{2} =𝔼X¯N(μ¯,Σ)[{𝔼Y¯N(m¯0,Σ0)[k(X¯,Y¯)]}2]\displaystyle=\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}\left[\left\{\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}[k(\underline{X},\underline{Y})]\right\}^{2}\right]
=d{|Id+2σΣ0|1/2exp(σ(x¯m¯0)T(Id+2σΣ0)1(x¯m¯0))}2𝑑N(μ¯,Σ)(x¯)\displaystyle=\int_{\mathbb{R}^{d}}\left\{|I_{d}+2\sigma\Sigma_{0}|^{-1/2}\exp\left(-\sigma(\underline{x}-\underline{m}_{0})^{T}(I_{d}+2\sigma\Sigma_{0})^{-1}(\underline{x}-\underline{m}_{0})\right)\right\}^{2}dN(\underline{\mu},\Sigma)(\underline{x})
=|Id+2σΣ0|1dexp(2σ(x¯m¯0)T(Id+2σΣ0)1(x¯m¯0))𝑑N(μ¯,Σ)(x¯)\displaystyle=|I_{d}+2\sigma\Sigma_{0}|^{-1}\int_{\mathbb{R}^{d}}\exp\left(-2\sigma(\underline{x}-\underline{m}_{0})^{T}(I_{d}+2\sigma\Sigma_{0})^{-1}(\underline{x}-\underline{m}_{0})\right)dN(\underline{\mu},\Sigma)(\underline{x})
=(π2σ)d/2|Id+2σΣ0|1/2dϕ14σ(Id+2σΣ0)(x¯m¯0)𝑑N(μ¯,Σ)(x¯)\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}|I_{d}+2\sigma\Sigma_{0}|^{-1/2}\int_{\mathbb{R}^{d}}\phi_{\frac{1}{4\sigma}(I_{d}+2\sigma\Sigma_{0})}(\underline{x}-\underline{m}_{0})dN(\underline{\mu},\Sigma)(\underline{x})
=(π2σ)d/2|Id+2σΣ0|1/2dϕ14σ(Id+2σΣ0)(x¯m¯0)ϕΣ(x¯μ¯)𝑑x¯\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}|I_{d}+2\sigma\Sigma_{0}|^{-1/2}\int_{\mathbb{R}^{d}}\phi_{\frac{1}{4\sigma}(I_{d}+2\sigma\Sigma_{0})}(\underline{x}-\underline{m}_{0})\phi_{\Sigma}(\underline{x}-\underline{\mu})d\underline{x}
=(π2σ)d/2|Id+2σΣ0|1/2ϕ14σ(Id+2σΣ0+4σΣ)(m¯0μ¯).\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}|I_{d}+2\sigma\Sigma_{0}|^{-1/2}\phi_{\frac{1}{4\sigma}(I_{d}+2\sigma\Sigma_{0}+4\sigma\Sigma)}(\underline{m}_{0}-\underline{\mu}). (19)

The third term I3I_{3} is derived

I3\displaystyle I_{3} =𝔼Y¯N(m¯0,Σ0)[{𝔼X¯N(μ¯,Σ)[k(X¯,Y¯)]}2]\displaystyle=\mathbb{E}_{\underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})}\left[\left\{\mathbb{E}_{\underline{X}\sim N(\underline{\mu},\Sigma)}[k(\underline{X},\underline{Y})]\right\}^{2}\right]
=(π2σ)d/2|Id+2σΣ|1/2ϕ14σ(Id+2σΣ+4σΣ0)(m¯0μ¯)\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}|I_{d}+2\sigma\Sigma|^{-1/2}\phi_{\frac{1}{4\sigma}(I_{d}+2\sigma\Sigma+4\sigma\Sigma_{0})}(\underline{m}_{0}-\underline{\mu}) (20)

by the same calculation as I2I_{2}. Finally, the fourth term I4I_{4} is calculated as follows

I4\displaystyle I_{4} ={𝔼X¯N(μ¯,Σ)Y¯N(m¯0,Σ0)[k(X¯,Y¯)]}2\displaystyle=\left\{\mathbb{E}_{\begin{subarray}{c}\underline{X}\sim N(\underline{\mu},\Sigma)\\ \underline{Y}\sim N(\underline{m}_{0},\Sigma_{0})\end{subarray}}[k(\underline{X},\underline{Y})]\right\}^{2}
=(πσ)d{ϕ12σId+Σ+Σ0(m¯0μ¯)}2\displaystyle=\left(\frac{\pi}{\sigma}\right)^{d}\left\{\phi_{\frac{1}{2\sigma}I_{d}+\Sigma+\Sigma_{0}}(\underline{m}_{0}-\underline{\mu})\right\}^{2}
=(πσ)d{|2π(12σId+Σ+Σ0)|1/2exp(σ(m¯0μ¯)T(Id+2σΣ+2σΣ0)1(m¯0μ¯))}2\displaystyle=\left(\frac{\pi}{\sigma}\right)^{d}\left\{\left|2\pi\left(\frac{1}{2\sigma}I_{d}+\Sigma+\Sigma_{0}\right)\right|^{-1/2}\exp\left(-\sigma(\underline{m}_{0}-\underline{\mu})^{T}(I_{d}+2\sigma\Sigma+2\sigma\Sigma_{0})^{-1}(\underline{m}_{0}-\underline{\mu})\right)\right\}^{2}
=|Id+2σΣ+2σΣ0|1exp(2σ(m¯0μ¯)T(Id+2σΣ+2σΣ0)1(m¯0μ¯))\displaystyle=|I_{d}+2\sigma\Sigma+2\sigma\Sigma_{0}|^{-1}\exp\left(-2\sigma(\underline{m}_{0}-\underline{\mu})^{T}(I_{d}+2\sigma\Sigma+2\sigma\Sigma_{0})^{-1}(\underline{m}_{0}-\underline{\mu})\right)
=|Id+2σΣ+2σΣ0|1|π2σ(Id+2σΣ+2σΣ0)|1/2ϕ14σ(Id+2σΣ+2σΣ0)(m¯0μ¯)\displaystyle=|I_{d}+2\sigma\Sigma+2\sigma\Sigma_{0}|^{-1}\left|\frac{\pi}{2\sigma}(I_{d}+2\sigma\Sigma+2\sigma\Sigma_{0})\right|^{1/2}\phi_{\frac{1}{4\sigma}(I_{d}+2\sigma\Sigma+2\sigma\Sigma_{0})}(\underline{m}_{0}-\underline{\mu})
=(π2σ)d/2|Id+2σΣ+2σΣ0|1/2ϕ14σ(Id+2σΣ+2σΣ0)(m¯0μ¯)\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}|I_{d}+2\sigma\Sigma+2\sigma\Sigma_{0}|^{-1/2}\phi_{\frac{1}{4\sigma}(I_{d}+2\sigma\Sigma+2\sigma\Sigma_{0})}(\underline{m}_{0}-\underline{\mu}) (21)

by using (15). Hence, combining (17) and (18)-(21) yields

Σk(N(μ¯,Σ)),Σk(N(m¯0,Σ0))H(k)2\displaystyle\left<\Sigma_{k}(N(\underline{\mu},\Sigma)),\Sigma_{k}(N(\underline{m}_{0},\Sigma_{0}))\right>_{H(k)^{\otimes 2}}
=(π2σ)d/2ϕ14σId+Σ0+Σ(m¯0μ¯)(π2σ)d/2|Id+2σΣ0|1/2ϕ14σ(Id+2σΣ0+4σΣ)(m¯0μ¯)\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}\phi_{\frac{1}{4\sigma}I_{d}+\Sigma_{0}+\Sigma}(\underline{m}_{0}-\underline{\mu})-\left(\frac{\pi}{2\sigma}\right)^{d/2}|I_{d}+2\sigma\Sigma_{0}|^{-1/2}\phi_{\frac{1}{4\sigma}(I_{d}+2\sigma\Sigma_{0}+4\sigma\Sigma)}(\underline{m}_{0}-\underline{\mu})
(π2σ)d/2|Id+2σΣ|1/2ϕ14σ(Id+2σΣ+4σΣ0)(m¯0μ¯)\displaystyle~{}~{}~{}~{}~{}-\left(\frac{\pi}{2\sigma}\right)^{d/2}|I_{d}+2\sigma\Sigma|^{-1/2}\phi_{\frac{1}{4\sigma}(I_{d}+2\sigma\Sigma+4\sigma\Sigma_{0})}(\underline{m}_{0}-\underline{\mu})
+(π2σ)d/2|Id+2σΣ+2σΣ0|1/2ϕ14σ(Id+2σΣ+2σΣ0)(m¯0μ¯).\displaystyle~{}~{}~{}~{}~{}+\left(\frac{\pi}{2\sigma}\right)^{d/2}|I_{d}+2\sigma\Sigma+2\sigma\Sigma_{0}|^{-1/2}\phi_{\frac{1}{4\sigma}(I_{d}+2\sigma\Sigma+2\sigma\Sigma_{0})}(\underline{m}_{0}-\underline{\mu}). (22)

The following results are obtained by using (22):

Σk(N(0¯,Id))H(k)22\displaystyle\left\|\Sigma_{k}(N(\underline{0},I_{d}))\right\|^{2}_{H(k)^{\otimes 2}}
=(π2σ)d/2ϕ14σ(1+8σ)Id(0¯)(π2σ)d/2|(1+2σ)Id|1/2ϕ14σ(1+6σ)Id(0¯)\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}\phi_{\frac{1}{4\sigma}(1+8\sigma)I_{d}}(\underline{0})-\left(\frac{\pi}{2\sigma}\right)^{d/2}|(1+2\sigma)I_{d}|^{-1/2}\phi_{\frac{1}{4\sigma}(1+6\sigma)I_{d}}(\underline{0})
(π2σ)d/2|(1+2σ)Id|1/2ϕ14σ(1+6σ)Id(0¯)+(π2σ)d/2|(1+4σ)Id|1/2ϕ14σ(1+4σ)Id(0¯)\displaystyle~{}~{}~{}~{}~{}-\left(\frac{\pi}{2\sigma}\right)^{d/2}|(1+2\sigma)I_{d}|^{-1/2}\phi_{\frac{1}{4\sigma}(1+6\sigma)I_{d}}(\underline{0})+\left(\frac{\pi}{2\sigma}\right)^{d/2}|(1+4\sigma)I_{d}|^{-1/2}\phi_{\frac{1}{4\sigma}(1+4\sigma)I_{d}}(\underline{0})
=(1+8σ)d/22(1+8σ+12σ2)d/2+(1+4σ)d,\displaystyle=(1+8\sigma)^{-d/2}-2(1+8\sigma+12\sigma^{2})^{-d/2}+(1+4\sigma)^{-d}, (23)
Σk(N(m¯,Σ))H(k)22\displaystyle\left\|\Sigma_{k}(N(\underline{m},\Sigma))\right\|^{2}_{H(k)^{\otimes 2}}
=(π2σ)d/2ϕ14σ(Id+8σΣ)(0¯)(π2σ)d/2|Id+2σΣ|1/2ϕ14σ(Id+6σΣ)(0¯)\displaystyle=\left(\frac{\pi}{2\sigma}\right)^{d/2}\phi_{\frac{1}{4\sigma}(I_{d}+8\sigma\Sigma)}(\underline{0})-\left(\frac{\pi}{2\sigma}\right)^{d/2}|I_{d}+2\sigma\Sigma|^{-1/2}\phi_{\frac{1}{4\sigma}(I_{d}+6\sigma\Sigma)}(\underline{0})
(π2σ)d/2|Id+2σΣ|1/2ϕ14σ(Id+6σΣ)(0¯)+(π2σ)d/2|Id+4σΣ|1/2ϕ14σ(Id+4σΣ)(0¯)\displaystyle~{}~{}~{}~{}~{}-\left(\frac{\pi}{2\sigma}\right)^{d/2}|I_{d}+2\sigma\Sigma|^{-1/2}\phi_{\frac{1}{4\sigma}(I_{d}+6\sigma\Sigma)}(\underline{0})+\left(\frac{\pi}{2\sigma}\right)^{d/2}|I_{d}+4\sigma\Sigma|^{-1/2}\phi_{\frac{1}{4\sigma}(I_{d}+4\sigma\Sigma)}(\underline{0})
=|Id+8σΣ|1/22|Id+8σΣ+12σ2Σ2|1/2+|Id+4σΣ|1,\displaystyle=|I_{d}+8\sigma\Sigma|^{-1/2}-2|I_{d}+8\sigma\Sigma+12\sigma^{2}\Sigma^{2}|^{-1/2}+|I_{d}+4\sigma\Sigma|^{-1}, (24)

and

Σk(N(0¯,Id)),Σk(N(m¯,Σ))H(k)2\displaystyle\left<\Sigma_{k}(N(\underline{0},I_{d})),\Sigma_{k}(N(\underline{m},\Sigma))\right>_{H(k)^{\otimes 2}}
=|(1+4σ)Id+4σΣ|1/2exp(2σm¯T((1+4σ)Id+4σΣ)1m¯)\displaystyle=|(1+4\sigma)I_{d}+4\sigma\Sigma|^{-1/2}\exp\left(-2\sigma\underline{m}^{T}\left((1+4\sigma)I_{d}+4\sigma\Sigma\right)^{-1}\underline{m}\right)
|Id+2σΣ|1/2|(1+4σ)Id+2σΣ|1/2exp(2σm¯T((1+4σ)Id+2σΣ)1m¯)\displaystyle~{}~{}~{}~{}~{}-|I_{d}+2\sigma\Sigma|^{-1/2}|(1+4\sigma)I_{d}+2\sigma\Sigma|^{-1/2}\exp\left(-2\sigma\underline{m}^{T}\left((1+4\sigma)I_{d}+2\sigma\Sigma\right)^{-1}\underline{m}\right)
(1+2σ)d/2|(1+2σ)Id+4σΣ|1/2exp(2σm¯T((1+2σ)Id+4σΣ)1m¯)\displaystyle~{}~{}~{}~{}~{}-(1+2\sigma)^{-d/2}|(1+2\sigma)I_{d}+4\sigma\Sigma|^{-1/2}\exp\left(-2\sigma\underline{m}^{T}\left((1+2\sigma)I_{d}+4\sigma\Sigma\right)^{-1}\underline{m}\right)
+|(1+2σ)Id+2σΣ|1exp(2σm¯T((1+2σ)Id+2σΣ)1m¯).\displaystyle~{}~{}~{}~{}~{}+|(1+2\sigma)I_{d}+2\sigma\Sigma|^{-1}\exp\left(-2\sigma\underline{m}^{T}\left((1+2\sigma)I_{d}+2\sigma\Sigma\right)^{-1}\underline{m}\right). (25)

Therefore, we give

MVD(N(0¯,Id),N(m¯,Σ))2\displaystyle\text{MVD}(N(\underline{0},I_{d}),N(\underline{m},\Sigma))^{2}
=(1+8σ)d/22(1+8σ+12σ2)d/2+(1+4σ)d\displaystyle=(1+8\sigma)^{-d/2}-2(1+8\sigma+12\sigma^{2})^{-d/2}+(1+4\sigma)^{-d}
+|Id+8σΣ|1/22|Id+8σΣ+12σ2Σ2|1/2+|Id+4σΣ|1\displaystyle~{}~{}~{}~{}~{}+|I_{d}+8\sigma\Sigma|^{-1/2}-2|I_{d}+8\sigma\Sigma+12\sigma^{2}\Sigma^{2}|^{-1/2}+|I_{d}+4\sigma\Sigma|^{-1}
2|(1+4σ)Id+4σΣ|1/2exp(2σm¯T((1+4σ)Id+4σΣ)1m¯)\displaystyle~{}~{}~{}~{}~{}-2|(1+4\sigma)I_{d}+4\sigma\Sigma|^{-1/2}\exp\left(-2\sigma\underline{m}^{T}\left((1+4\sigma)I_{d}+4\sigma\Sigma\right)^{-1}\underline{m}\right)
+2|Id+2σΣ|1/2|(1+4σ)Id+2σΣ|1/2exp(2σm¯T((1+4σ)Id+2σΣ)1m¯)\displaystyle~{}~{}~{}~{}~{}+2|I_{d}+2\sigma\Sigma|^{-1/2}|(1+4\sigma)I_{d}+2\sigma\Sigma|^{-1/2}\exp\left(-2\sigma\underline{m}^{T}\left((1+4\sigma)I_{d}+2\sigma\Sigma\right)^{-1}\underline{m}\right)
+2(1+2σ)d/2|(1+2σ)Id+4σΣ|1/2exp(2σm¯T((1+2σ)Id+4σΣ)1m¯)\displaystyle~{}~{}~{}~{}~{}+2(1+2\sigma)^{-d/2}|(1+2\sigma)I_{d}+4\sigma\Sigma|^{-1/2}\exp\left(-2\sigma\underline{m}^{T}\left((1+2\sigma)I_{d}+4\sigma\Sigma\right)^{-1}\underline{m}\right)
2|(1+2σ)Id+2σΣ|1exp(2σm¯T((1+2σ)Id+2σΣ)1m¯)\displaystyle~{}~{}~{}~{}~{}-2|(1+2\sigma)I_{d}+2\sigma\Sigma|^{-1}\exp\left(-2\sigma\underline{m}^{T}\left((1+2\sigma)I_{d}+2\sigma\Sigma\right)^{-1}\underline{m}\right)

from substituting the formulas (23)-(25) to (16).

7.3 Proof of Theorem 1

Theorem 1 is shown by regarding k(,X1),,k(,Xn)k(\cdot,X_{1}),\dots,k(\cdot,X_{n}) and k(,Y1),,k(,Ym)k(\cdot,Y_{1}),\dots,k(\cdot,Y_{m}) as the data in Corollary 1 of [4].

7.4 Proof of Theorem 2

By Lemma 1, it suffices to derive the asymptotic distribution of

n+m{Σ~k(P)Σ~k(Q)H(k)22Σk(P)Σk(Q)H(k)22}.\sqrt{n+m}\left\{\left\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)\right\|^{2}_{H(k)^{\otimes 2}}-\left\|\Sigma_{k}(P)-\Sigma_{k}(Q)\right\|^{2}_{H(k)^{\otimes 2}}\right\}.

Let us expand the following quantity

n+m{Σ~k(P)Σ~k(Q)H(k)22Σk(P)Σk(Q)H(k)22}\displaystyle\sqrt{n+m}\left\{\left\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)\right\|^{2}_{H(k)^{\otimes 2}}-\left\|\Sigma_{k}(P)-\Sigma_{k}(Q)\right\|^{2}_{H(k)^{\otimes 2}}\right\}
=n+mΣ~k(P)Σ~k(Q)+Σk(P)Σk(Q),Σ~k(P)Σ~k(Q)Σk(P)+Σk(Q)H(k)2\displaystyle=\sqrt{n+m}\left<\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)+\Sigma_{k}(P)-\Sigma_{k}(Q),\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)-\Sigma_{k}(P)+\Sigma_{k}(Q)\right>_{H(k)^{\otimes 2}}
=n+mΣ~k(P)Σk(P){Σ~k(Q)Σk(Q)},\displaystyle=\sqrt{n+m}\left<\widetilde{\Sigma}_{k}(P)-\Sigma_{k}(P)-\{\widetilde{\Sigma}_{k}(Q)-\Sigma_{k}(Q)\},\right.
2{Σk(P)Σk(Q)}+Σ~k(P)Σk(P){Σ~k(Q)Σk(Q)}H(k)2\displaystyle~{}~{}~{}~{}~{}\left.2\{\Sigma_{k}(P)-\Sigma_{k}(Q)\}+\widetilde{\Sigma}_{k}(P)-\Sigma_{k}(P)-\{\widetilde{\Sigma}_{k}(Q)-\Sigma_{k}(Q)\}\right>_{H(k)^{\otimes 2}}
=2n+mΣk(P)Σk(Q),Σ~k(P)Σk(P){Σ~k(Q)Σk(Q)}H(k)2\displaystyle=2\sqrt{n+m}\left<\Sigma_{k}(P)-\Sigma_{k}(Q),\widetilde{\Sigma}_{k}(P)-\Sigma_{k}(P)-\{\widetilde{\Sigma}_{k}(Q)-\Sigma_{k}(Q)\}\right>_{H(k)^{\otimes 2}}
+n+mΣ~k(P)Σk(P){Σ~k(Q)Σk(Q)}H(k)22\displaystyle~{}~{}~{}~{}~{}+\sqrt{n+m}\left\|\widetilde{\Sigma}_{k}(P)-\Sigma_{k}(P)-\{\widetilde{\Sigma}_{k}(Q)-\Sigma_{k}(Q)\}\right\|^{2}_{H(k)^{\otimes 2}}
=n+mn2ni=1nΣk(P)Σk(Q),(k(,Xi)μk(P))2Σk(P)H(k)2\displaystyle=\sqrt{\frac{n+m}{n}}\frac{2}{\sqrt{n}}\sum_{i=1}^{n}\left<\Sigma_{k}(P)-\Sigma_{k}(Q),(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right>_{H(k)^{\otimes 2}}
n+mm2mj=1mΣk(P)Σk(Q),(k(,Yj)μk(Q))2Σk(Q)H(k)2+Op(1n+m),\displaystyle~{}~{}~{}~{}~{}-\sqrt{\frac{n+m}{m}}\frac{2}{\sqrt{m}}\sum_{j=1}^{m}\left<\Sigma_{k}(P)-\Sigma_{k}(Q),(k(\cdot,Y_{j})-\mu_{k}(Q))^{\otimes 2}-\Sigma_{k}(Q)\right>_{H(k)^{\otimes 2}}+O_{p}\left(\frac{1}{\sqrt{n+m}}\right),

which converges in distribution toN(0,4ρ1vP2+4(1ρ)1vQ2)N(0,4\rho^{-1}v_{P}^{2}+4(1-\rho)^{-1}v^{2}_{Q}) by the central limit theorem.

7.5 Proof of Lemma 1

A direct calculation gives

n+m(Σk(P^)Σk(Q^)H(k)22Σ~k(P)Σ~k(Q)H(k)22)\displaystyle\sqrt{n+m}\left(\left\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q})\right\|^{2}_{H(k)^{\otimes 2}}-\left\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)\right\|^{2}_{H(k)^{\otimes 2}}\right)
=(Σk(P^)Σk(Q^)H(k)2+Σ~k(P)Σ~k(Q)H(k)2)\displaystyle=\left(\left\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q})\right\|_{H(k)^{\otimes 2}}+\left\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)\right\|_{H(k)^{\otimes 2}}\right)
×n+m(Σk(P^)Σk(Q^)H(k)2Σ~k(P)Σ~k(Q)H(k)2)\displaystyle~{}~{}~{}~{}~{}\times\sqrt{n+m}\left(\left\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q})\right\|_{H(k)^{\otimes 2}}-\left\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)\right\|_{H(k)^{\otimes 2}}\right)
(Σk(P^)Σk(Q^)H(k)2+Σ~k(P)Σ~k(Q)H(k)2)\displaystyle\leq\left(\left\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q})\right\|_{H(k)^{\otimes 2}}+\left\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)\right\|_{H(k)^{\otimes 2}}\right)
×n+mΣk(P^)Σ~k(P)(Σk(Q^)Σ~k(Q))H(k)2.\displaystyle~{}~{}~{}~{}~{}\times\sqrt{n+m}\left\|\Sigma_{k}(\widehat{P})-\widetilde{\Sigma}_{k}(P)-\left(\Sigma_{k}(\widehat{Q})-\widetilde{\Sigma}_{k}(Q)\right)\right\|_{H(k)^{\otimes 2}}.

From direct expansion Σk(P^)=Σ~k(P)(μk(P)μk(P^))2\Sigma_{k}(\widehat{P})=\widetilde{\Sigma}_{k}(P)-(\mu_{k}(P)-\mu_{k}(\widehat{P}))^{\otimes 2}, we have

n+m(Σk(P^)Σk(Q^)H(k)22Σ~k(P)Σ~k(Q)H(k)22)\displaystyle\sqrt{n+m}\left(\left\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q})\right\|^{2}_{H(k)^{\otimes 2}}-\left\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)\right\|^{2}_{H(k)^{\otimes 2}}\right)
(Σk(P^)Σk(Q^)H(k)2+Σ~k(P)Σ~k(Q)H(k)2)\displaystyle\leq\left(\left\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q})\right\|_{H(k)^{\otimes 2}}+\left\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)\right\|_{H(k)^{\otimes 2}}\right)
×n+m(μk(P)μk(P^))2(μk(Q)μk(Q^))2H(k)2\displaystyle~{}~{}~{}~{}~{}\times\sqrt{n+m}\left\|(\mu_{k}(P)-\mu_{k}(\widehat{P}))^{\otimes 2}-(\mu_{k}(Q)-\mu_{k}(\widehat{Q}))^{\otimes 2}\right\|_{H(k)^{\otimes 2}}
(Σk(P^)Σk(Q^)H(k)2+Σ~k(P)Σ~k(Q)H(k)2)\displaystyle\leq\left(\left\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q})\right\|_{H(k)^{\otimes 2}}+\left\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)\right\|_{H(k)^{\otimes 2}}\right)
×n+m(μk(P)μk(P^)H(k)2+μk(Q)μk(Q^)H(k)2)\displaystyle~{}~{}~{}~{}~{}\times\sqrt{n+m}\left(\left\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\|^{2}_{H(k)}+\left\|\mu_{k}(Q)-\mu_{k}(\widehat{Q})\right\|^{2}_{H(k)}\right)
=(Σk(P^)Σk(Q^)H(k)2+Σ~k(P)Σ~k(Q)H(k)2)\displaystyle=\left(\left\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q})\right\|_{H(k)^{\otimes 2}}+\left\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)\right\|_{H(k)^{\otimes 2}}\right)
×(n+mnnμk(P)μk(P^)H(k)2+n+mmmμk(Q)μk(Q^)H(k)2)\displaystyle~{}~{}~{}~{}~{}\times\left(\frac{\sqrt{n+m}}{n}\cdot n\left\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\|^{2}_{H(k)}+\frac{\sqrt{n+m}}{m}\cdot m\left\|\mu_{k}(Q)-\mu_{k}(\widehat{Q})\right\|^{2}_{H(k)}\right)
𝑃0,\displaystyle\xrightarrow{P}0,

as n,mn,m\to\infty.

7.6 Proof of Theorem 3

From Lemma 3 it is sufficient for us to derive the asymptotic distribution of (n+m)Σ~k(P)Σ~k(Qnm)H(k)22(n+m)\left\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q_{nm})\right\|^{2}_{H(k)^{\otimes 2}}. It follows from direct calculations that

(n+m)Σ~k(P)Σ~k(Qnm)H(k)22\displaystyle(n+m)\left\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q_{nm})\right\|^{2}_{H(k)^{\otimes 2}}
=(n+m)1ni=1n(k(,Xi)μk(P))21mj=1m(k(,Yj)μk(Qnm))2H(k)22\displaystyle=(n+m)\left\|\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(Q_{nm}))^{\otimes 2}\right\|^{2}_{H(k)^{\otimes 2}}
=(n+m)1ni=1n(k(,Xi)μk(P))2Σk(P)\displaystyle=(n+m)\left\|\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right.
1mj=1m(k(,Yj)μk(P)+μk(P)μk(Qnm))2+Σk(P)H(k)22\displaystyle~{}~{}~{}~{}~{}\left.-\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P)+\mu_{k}(P)-\mu_{k}(Q_{nm}))^{\otimes 2}+\Sigma_{k}(P)\right\|^{2}_{H(k)^{\otimes 2}}
=(n+m)1ni=1n(k(,Xi)μk(P))2Σk(P){1mj=1m(k(,Yj)μk(P))2Σk(P)}\displaystyle=(n+m)\Bigg{\|}\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)-\left\{\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}
1mj=1m(k(,Yj)μk(P))(μk(P)μk(Qnm))1mj=1m(μk(P)μk(Qnm))(k(,Yj)μk(P))\displaystyle~{}~{}~{}~{}~{}-\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))\otimes(\mu_{k}(P)-\mu_{k}(Q_{nm}))-\frac{1}{m}\sum_{j=1}^{m}(\mu_{k}(P)-\mu_{k}(Q_{nm}))\otimes(k(\cdot,Y_{j})-\mu_{k}(P))
(μk(P)μk(Qnm))2H(k)22\displaystyle~{}~{}~{}~{}~{}-(\mu_{k}(P)-\mu_{k}(Q_{nm}))^{\otimes 2}\Bigg{\|}^{2}_{H(k)^{\otimes 2}}
=(n+m)1ni=1n(k(,Xi)μk(P))2Σk(P){1mj=1m(k(,Yj)μk(P))2Σk(P)}+An+mH(k)22\displaystyle=(n+m)\Bigg{\|}\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)-\left\{\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}+\frac{A}{\sqrt{n+m}}\Bigg{\|}^{2}_{H(k)^{\otimes 2}}
=(n+m)1ni=1n(k(,Xi)μk(P))2Σk(P){1mj=1m(k(,Yj)μk(P))2Σk(P)}H(k)22\displaystyle=(n+m)\Bigg{\|}\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)-\left\{\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}\Bigg{\|}^{2}_{H(k)^{\otimes 2}}
+2n+m1ni=1n(k(,Xi)μk(P))2Σk(P){1mj=1m(k(,Yj)μk(P))2Σk(P)},AH(k)2\displaystyle~{}~{}~{}~{}~{}+2\sqrt{n+m}\left<\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)-\left\{\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\},A\right>_{H(k)^{\otimes 2}}
+AH(k)22,\displaystyle~{}~{}~{}~{}~{}+\left\|A\right\|^{2}_{H(k)^{\otimes 2}}, (26)

where

A=(μk(Q^nm)μk(P))(μk(P)μk(Q))+(μk(P)μk(Q))(μk(Q^nm)μk(P))1n+m(μk(P)μk(Q))2.A=(\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(P))\otimes(\mu_{k}(P)-\mu_{k}(Q))+(\mu_{k}(P)-\mu_{k}(Q))\otimes(\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(P))-\frac{1}{\sqrt{n+m}}(\mu_{k}(P)-\mu_{k}(Q))^{\otimes 2}.

In addition, we see that

μk(Q^nm)μk(P)H(k)\displaystyle\left\|\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(P)\right\|_{H(k)} =μk(Q^nm)μk(Qnm)+μk(Qnm)μk(P)H(k)\displaystyle=\left\|\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(Q_{nm})+\mu_{k}(Q_{nm})-\mu_{k}(P)\right\|_{H(k)}
μk(Q^nm)μk(Qnm)H(k)+μk(Qnm)μk(P)H(k)\displaystyle\leq\left\|\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(Q_{nm})\right\|_{H(k)}+\left\|\mu_{k}(Q_{nm})-\mu_{k}(P)\right\|_{H(k)}
=Op(1n+m)\displaystyle=O_{p}\left(\frac{1}{\sqrt{n+m}}\right) (27)

by (i) in Lemma 2. Thus, (27) leads that

AH(k)2\displaystyle\left\|A\right\|_{H(k)^{\otimes 2}} 2μk(Q^nm)μk(P)H(k)μk(P)μk(Q)H(k)+1n+mμk(P)μk(Q)H(k)2\displaystyle\leq 2\left\|\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(P)\right\|_{H(k)}\left\|\mu_{k}(P)-\mu_{k}(Q)\right\|_{H(k)}+\frac{1}{\sqrt{n+m}}\left\|\mu_{k}(P)-\mu_{k}(Q)\right\|^{2}_{H(k)}
=Op(1n+m).\displaystyle=O_{p}\left(\frac{1}{\sqrt{n+m}}\right). (28)

Further, following result is obtained by (i) and (ii) in Lemma 2

1mj=1m(k(,Yj)μk(P))2Σk(P)H(k)2\displaystyle\left\|\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\|_{H(k)^{\otimes 2}}
=1mj=1m(k(,Yj)μk(P))2Σk(Qnm)+Σk(Qnm)Σk(P)H(k)2\displaystyle=\left\|\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(Q_{nm})+\Sigma_{k}(Q_{nm})-\Sigma_{k}(P)\right\|_{H(k)^{\otimes 2}}
1mj=1m(k(,Yj)μk(P))2Σk(Qnm)H(k)2+Σk(Qnm)Σk(P)H(k)2\displaystyle\leq\left\|\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(Q_{nm})\right\|_{H(k)^{\otimes 2}}+\left\|\Sigma_{k}(Q_{nm})-\Sigma_{k}(P)\right\|_{H(k)^{\otimes 2}}
=1mj=1m(k(,Yj)μk(P))2Σk(Qnm)H(k)2+O(1n+m)\displaystyle=\left\|\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(Q_{nm})\right\|_{H(k)^{\otimes 2}}+O\left(\frac{1}{\sqrt{n+m}}\right)
=1mj=1m(k(,Yj)μk(Qnm)+μk(Qnm)μk(P))2Σk(Qnm)H(k)2+O(1n+m)\displaystyle=\left\|\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(Q_{nm})+\mu_{k}(Q_{nm})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(Q_{nm})\right\|_{H(k)^{\otimes 2}}+O\left(\frac{1}{\sqrt{n+m}}\right)
=1mj=1m(k(,Yj)μk(Qnm))2Σk(Qnm)+(μk(Q^nm)μk(Qnm))(μk(Qnm)μk(P))\displaystyle=\Bigg{\|}\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(Q_{nm}))^{\otimes 2}-\Sigma_{k}(Q_{nm})+(\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(Q_{nm}))\otimes(\mu_{k}(Q_{nm})-\mu_{k}(P))
+(μk(Qnm)μk(P))(μk(Q^nm)μk(Qnm))+(μk(Qnm)μk(P))2H(k)2+O(1n+m)\displaystyle~{}~{}~{}~{}~{}+(\mu_{k}(Q_{nm})-\mu_{k}(P))\otimes(\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(Q_{nm}))+(\mu_{k}(Q_{nm})-\mu_{k}(P))^{\otimes 2}\Bigg{\|}_{H(k)^{\otimes 2}}+O\left(\frac{1}{\sqrt{n+m}}\right)
Σ~k(Qnm)Σk(Qnm)H(k)+2n+mμk(Q^nm)μk(Qnm)H(k)μk(Q)μk(P)H(k)2\displaystyle\leq\left\|\widetilde{\Sigma}_{k}(Q_{nm})-\Sigma_{k}(Q_{nm})\right\|_{H(k)}+\frac{2}{\sqrt{n+m}}\left\|\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(Q_{nm})\right\|_{H(k)}\left\|\mu_{k}(Q)-\mu_{k}(P)\right\|_{H(k)^{\otimes 2}}
+1n+mμk(Q)μk(P)H(k)2+O(1n+m)\displaystyle~{}~{}~{}~{}~{}+\frac{1}{n+m}\left\|\mu_{k}(Q)-\mu_{k}(P)\right\|^{2}_{H(k)}+O\left(\frac{1}{\sqrt{n+m}}\right)
=Op(1n+m).\displaystyle=O_{p}\left(\frac{1}{\sqrt{n+m}}\right). (29)

These results (28) and (29) yield that

n+m1ni=1n(k(,Xi)μk(P))2Σk(P){1mj=1m(k(,Yj)μk(P))2Σk(P)},AH(k)2\displaystyle\sqrt{n+m}\left<\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)-\left\{\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\},A\right>_{H(k)^{\otimes 2}}
n+m1ni=1n(k(,Xi)μk(P))2Σk(P){1mj=1m(k(,Yj)μk(P))2Σk(P)}H(k)2\displaystyle\leq\sqrt{n+m}\left\|\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)-\left\{\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}\right\|_{H(k)^{\otimes 2}}
×AH(k)2\displaystyle~{}~{}~{}~{}~{}\times\left\|A\right\|_{H(k)^{\otimes 2}}
=n+m{1ni=1n(k(,Xi)μk(P))2Σk(P)H(k)2\displaystyle=\sqrt{n+m}\Bigg{\{}\left\|\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\|_{H(k)^{\otimes 2}}
+1mj=1m(k(,Yj)μk(P))2Σk(P)H(k)2}AH(k)2\displaystyle~{}~{}~{}~{}~{}+\left\|\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\|_{H(k)^{\otimes 2}}\Bigg{\}}\left\|A\right\|_{H(k)^{\otimes 2}}
=n+m{Op(1n+m)+Op(1n+m)}Op(1n+m)\displaystyle=\sqrt{n+m}\left\{O_{p}\left(\frac{1}{\sqrt{n+m}}\right)+O_{p}\left(\frac{1}{\sqrt{n+m}}\right)\right\}\cdot O_{p}\left(\frac{1}{\sqrt{n+m}}\right)
=Op(1n+m).\displaystyle=O_{p}\left(\frac{1}{\sqrt{n+m}}\right). (30)

By using (28) and (30) to (26), we obtain

(n+m)Σ~k(P)Σ~k(Qnm)H(k)22\displaystyle(n+m)\left\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q_{nm})\right\|^{2}_{H(k)^{\otimes 2}}
=(n+m)1ni=1n(k(,Xi)μk(P))2Σk(P){1mj=1m(k(,Yj)μk(P))2Σk(P)}H(k)22\displaystyle=(n+m)\Bigg{\|}\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)-\left\{\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}\Bigg{\|}^{2}_{H(k)^{\otimes 2}}
+Op(1n+m)\displaystyle~{}~{}~{}~{}~{}+O_{p}\left(\frac{1}{\sqrt{n+m}}\right)
=(n+m)1ni=1n(k(,Xi)μk(P))2Σk(P){1mj=1m(k(,Yj)μk(P))2Σk(P)},\displaystyle=(n+m)\left<\frac{1}{n}\sum_{i=1}^{n}(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)-\left\{\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\},\right.
1ns=1n(k(,Xs)μk(P))2Σk(P){1mt=1m(k(,Yt)μk(P))2Σk(P)}H(k)2\displaystyle~{}~{}~{}~{}~{}\left.\frac{1}{n}\sum_{s=1}^{n}(k(\cdot,X_{s})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)-\left\{\frac{1}{m}\sum_{t=1}^{m}(k(\cdot,Y_{t})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}\right>_{H(k)^{\otimes 2}}
+Op(1n+m)\displaystyle~{}~{}~{}~{}~{}+O_{p}\left(\frac{1}{\sqrt{n+m}}\right)
=n+mn2m2i,s=1nj,t=1m(k(,Xi)μk(P))2Σk(P){(k(,Yj)μk(P))2Σk(P)},\displaystyle=\frac{n+m}{n^{2}m^{2}}\sum_{i,s=1}^{n}\sum_{j,t=1}^{m}\left<(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)-\left\{(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\},\right.
(k(,Xs)μk(P))2Σk(P){(k(,Yt)μk(P))2Σk(P)}H(k)2\displaystyle~{}~{}~{}~{}~{}\left.(k(\cdot,X_{s})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)-\left\{(k(\cdot,Y_{t})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}\right>_{H(k)^{\otimes 2}}
+Op(1n+m)\displaystyle~{}~{}~{}~{}~{}+O_{p}\left(\frac{1}{\sqrt{n+m}}\right)
=n+mn2i,s=1n(k(,Xi)μk(P))2Σk(P),(k(,Xs)μk(P))2Σk(P)H(k)2\displaystyle=\frac{n+m}{n^{2}}\sum_{i,s=1}^{n}\left<(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P),(k(\cdot,X_{s})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right>_{H(k)^{\otimes 2}}
+n+mm2j,t=1m(k(,Yj)μk(P))2Σk(P),(k(,Yt)μk(P))2Σk(P)H(k)2\displaystyle~{}~{}~{}~{}~{}+\frac{n+m}{m^{2}}\sum_{j,t=1}^{m}\left<(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P),(k(\cdot,Y_{t})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right>_{H(k)^{\otimes 2}}
2(n+m)nmi=1nj=1m(k(,Xi)μk(P))2Σk(P),(k(,Yj)μk(P))2Σk(P)H(k)2\displaystyle~{}~{}~{}~{}~{}-\frac{2(n+m)}{nm}\sum_{i=1}^{n}\sum_{j=1}^{m}\left<(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P),(k(\cdot,Y_{j})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right>_{H(k)^{\otimes 2}}
+Op(1n+m)\displaystyle~{}~{}~{}~{}~{}+O_{p}\left(\frac{1}{\sqrt{n+m}}\right)
=n+mn2i,s=1nh(Xi,Xs)+n+mm2j,t=1mh(Yj,Yt)2(n+m)nmi=1nj=1mh(Xi,Yj)+Op(1n+m),\displaystyle=\frac{n+m}{n^{2}}\sum_{i,s=1}^{n}h(X_{i},X_{s})+\frac{n+m}{m^{2}}\sum_{j,t=1}^{m}h(Y_{j},Y_{t})-\frac{2(n+m)}{nm}\sum_{i=1}^{n}\sum_{j=1}^{m}h(X_{i},Y_{j})+O_{p}\left(\frac{1}{\sqrt{n+m}}\right),

where h(x,y)h(x,y) is in (5). Since 𝔼XP[k(X,X)2]<\mathbb{E}_{X\sim P}[k(X,X)^{2}]<\infty, SkS_{k} in (6) is a Hilbert–Schmidt operator by Theorem VI.22 in [12]. Therefore, from Theorem 1 in [10], we have

h(x,y)==1θΨ(x)Ψ(y),h(x,y)=\sum_{\ell=1}^{\infty}\theta_{\ell}\Psi_{\ell}(x)\Psi_{\ell}(y),

where θ\theta_{\ell} is eigenvalue of SkS_{k} and Ψ\Psi_{\ell} is eigenfunction corresponding to θ\theta_{\ell}, each satisfies

h(x,y)Ψ(y)𝑑P(y)=θΨ(x)andΨi(y)Ψj(y)𝑑P(y)=δij\int_{\mathcal{H}}h(x,y)\Psi_{\ell}(y)dP(y)=\theta_{\ell}\Psi_{\ell}(x)~{}~{}\text{and}~{}~{}\int_{\mathcal{H}}\Psi_{i}(y)\Psi_{j}(y)dP(y)=\delta_{ij}

with δij\delta_{ij} Kronecker’s delta. From h(x,y)Ψ(y)𝑑P(y)=θΨ(x)\int_{\mathcal{H}}h(x,y)\Psi_{\ell}(y)dP(y)=\theta_{\ell}\Psi_{\ell}(x), we have

𝔼XP[Φ(X)]=1θ𝔼XP[h(X,y)]Ψ(y)𝑑P(y)=0\mathbb{E}_{X\sim P}[\Phi_{\ell}(X)]=\frac{1}{\theta_{\ell}}\int_{\mathcal{H}}\mathbb{E}_{X\sim P}[h(X,y)]\Psi_{\ell}(y)dP(y)=0

and

𝔼YQnm[Ψ(Y)]\displaystyle\mathbb{E}_{Y\sim Q_{nm}}[\Psi_{\ell}(Y)] =1θ𝔼YQnm[h(Y,y)]Ψ(y)dP(y).\displaystyle=\frac{1}{\theta_{\ell}}\int_{\mathcal{H}}\mathbb{E}_{Y\sim Q_{nm}}[h(Y,y)]\Psi_{\ell}(y)dP(y).

Since direct calculation, we get

𝔼YQnm[(k(,Y)μk(P))2Σk(P)]\displaystyle\mathbb{E}_{Y\sim Q_{nm}}[(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)]
=𝔼YQnm[(k(,Y)μk(P))2]Σk(P)\displaystyle=\mathbb{E}_{Y\sim Q_{nm}}[(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}]-\Sigma_{k}(P)
=(k(,y)μk(P))2dQnm(y)Σk(P)\displaystyle=\int_{\mathcal{H}}(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}dQ_{nm}(y)-\Sigma_{k}(P)
=(k(,y)μk(P))2d{(11n+m)P+1n+mQ}(y)Σk(P)\displaystyle=\int_{\mathcal{H}}(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}d\left\{\left(1-\frac{1}{\sqrt{n+m}}\right)P+\frac{1}{\sqrt{n+m}}Q\right\}(y)-\Sigma_{k}(P)
=(11n+m)𝔼YP[(k(,Y)μk(P))2]+1n+m𝔼YQ[(k(,Y)μk(P))2]Σk(P)\displaystyle=\left(1-\frac{1}{\sqrt{n+m}}\right)\mathbb{E}_{Y\sim P}[(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}]+\frac{1}{\sqrt{n+m}}\mathbb{E}_{Y\sim Q}[(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}]-\Sigma_{k}(P)
=(11n+m)Σk(P)+1n+m(Σk(Q)+(μk(Q)μk(P))2)Σk(P)\displaystyle=\left(1-\frac{1}{\sqrt{n+m}}\right)\Sigma_{k}(P)+\frac{1}{\sqrt{n+m}}\left(\Sigma_{k}(Q)+(\mu_{k}(Q)-\mu_{k}(P))^{\otimes 2}\right)-\Sigma_{k}(P)
=1n+m(Σk(Q)Σk(P)+(μk(Q)μk(P))2)\displaystyle=\frac{1}{\sqrt{n+m}}(\Sigma_{k}(Q)-\Sigma_{k}(P)+(\mu_{k}(Q)-\mu_{k}(P))^{\otimes 2})
=:1n+mζ(P,Q)\displaystyle=:\frac{1}{\sqrt{n+m}}\zeta(P,Q)

and

θ𝔼YQnm[Ψ(Y)]\displaystyle\theta_{\ell}\mathbb{E}_{Y\sim Q_{nm}}[\Psi_{\ell}(Y)]
=𝔼YQnm[h(Y,y)]Ψ(y)dP(y)\displaystyle=\int_{\mathcal{H}}\mathbb{E}_{Y\sim Q_{nm}}[h(Y,y)]\Psi_{\ell}(y)dP(y)
=𝔼YQnm[(k(,Y)μk(P))2Σk(P),(k(,y)μk(P))2Σk(P)H(k)2]Ψ(y)dP(y)\displaystyle=\int_{\mathcal{H}}\mathbb{E}_{Y\sim Q_{nm}}[\left<(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P),(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right>_{H(k)^{\otimes 2}}]\Psi_{\ell}(y)dP(y)
=𝔼YQnm[(k(,Y)μk(P))2Σk(P)],(k(,y)μk(P))2Σk(P)H(k)2Ψ(y)dP(y)\displaystyle=\int_{\mathcal{H}}\left<\mathbb{E}_{Y\sim Q_{nm}}[(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)],(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right>_{H(k)^{\otimes 2}}\Psi_{\ell}(y)dP(y)
=1n+mζ(P,Q),h(,y)H(k)2Ψ(y)dP(y)\displaystyle=\frac{1}{\sqrt{n+m}}\int_{\mathcal{H}}\left<\zeta(P,Q),h(\cdot,y)\right>_{H(k)^{\otimes 2}}\Psi_{\ell}(y)dP(y)
=:1n+mζ(P,Q).\displaystyle=:\frac{1}{\sqrt{n+m}}\zeta_{\ell}(P,Q).

Hence

𝔼YQnm[Ψ(Y)]=1n+mζ(P,Q)θ\mathbb{E}_{Y\sim Q_{nm}}[\Psi_{\ell}(Y)]=\frac{1}{\sqrt{n+m}}\frac{\zeta_{\ell}(P,Q)}{\theta_{\ell}}

and

VYQnm[Ψ(Y)]\displaystyle V_{Y\sim Q_{nm}}[\Psi_{\ell}(Y)]
=𝔼YQnm[Ψ(Y)2]{𝔼YQnm[Ψ(Y)]}2\displaystyle=\mathbb{E}_{Y\sim Q_{nm}}[\Psi_{\ell}(Y)^{2}]-\left\{\mathbb{E}_{Y\sim Q_{nm}}[\Psi_{\ell}(Y)]\right\}^{2}
=Ψ(y)2dQnm(y)1n+mζ(P,Q)2θ2\displaystyle=\int_{\mathcal{H}}\Psi_{\ell}(y)^{2}dQ_{nm}(y)-\frac{1}{n+m}\frac{\zeta_{\ell}(P,Q)^{2}}{\theta_{\ell}^{2}}
=Ψ(y)2dP(y)+1n+mΨ(y)2d(QP)(y)1n+mζ(P,Q)2θ2\displaystyle=\int_{\mathcal{H}}\Psi_{\ell}(y)^{2}dP(y)+\frac{1}{\sqrt{n+m}}\int_{\mathcal{H}}\Psi_{\ell}(y)^{2}d(Q-P)(y)-\frac{1}{n+m}\frac{\zeta_{\ell}(P,Q)^{2}}{\theta_{\ell}^{2}}
=1+1n+mτ1n+mζ(P,Q)2θ2,\displaystyle=1+\frac{1}{\sqrt{n+m}}\tau_{\ell\ell}-\frac{1}{n+m}\frac{\zeta_{\ell}(P,Q)^{2}}{\theta_{\ell}^{2}},

where

τs=Ψ(y)Ψs(y)d(QP)(y).\tau_{\ell s}=\int_{\mathcal{H}}\Psi_{\ell}(y)\Psi_{s}(y)d(Q-P)(y).

Therefore, by using the central limit theorem for Ψ\Psi_{\ell}s, we get

(n+m)Σ~k(P)Σ~k(Q)2H(k)2\displaystyle(n+m)\left\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q)\right\|^{2}_{H(k)^{\otimes 2}}
=n+mn2i,s=1nh(Xi,Xs)+n+mm2j,t=1mh(Yj,Yt)2(n+m)nmi=1nj=1mh(Xi,Yj)+Op(1n+m)\displaystyle=\frac{n+m}{n^{2}}\sum_{i,s=1}^{n}h(X_{i},X_{s})+\frac{n+m}{m^{2}}\sum_{j,t=1}^{m}h(Y_{j},Y_{t})-\frac{2(n+m)}{nm}\sum_{i=1}^{n}\sum_{j=1}^{m}h(X_{i},Y_{j})+O_{p}\left(\frac{1}{\sqrt{n+m}}\right)
=n+mn2i,s=1n=1θΨ(Xi)Ψ(Xs)+n+mm2j,t=1m=1θΨ(Yj)Ψ(Yt)\displaystyle=\frac{n+m}{n^{2}}\sum_{i,s=1}^{n}\sum_{\ell=1}^{\infty}\theta_{\ell}\Psi_{\ell}(X_{i})\Psi_{\ell}(X_{s})+\frac{n+m}{m^{2}}\sum_{j,t=1}^{m}\sum_{\ell=1}^{\infty}\theta_{\ell}\Psi_{\ell}(Y_{j})\Psi_{\ell}(Y_{t})
2(n+m)nmi=1nj=1m=1θΨ(Xi)Ψ(Yj)\displaystyle~{}~{}~{}~{}~{}-\frac{2(n+m)}{nm}\sum_{i=1}^{n}\sum_{j=1}^{m}\sum_{\ell=1}^{\infty}\theta_{\ell}\Psi_{\ell}(X_{i})\Psi_{\ell}(Y_{j})
=n+mn=1θ(1ni=1nΨ(Xi))2+n+mm=1θ(1mj=1mΨ(Yj))2\displaystyle=\frac{n+m}{n}\sum_{\ell=1}^{\infty}\theta_{\ell}\left(\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\Psi_{\ell}(X_{i})\right)^{2}+\frac{n+m}{m}\sum_{\ell=1}^{\infty}\theta_{\ell}\left(\frac{1}{\sqrt{m}}\sum_{j=1}^{m}\Psi_{\ell}(Y_{j})\right)^{2}
2(n+m)nm=1θ(1ni=1nΨ(Xi))(1mj=1mΨ(Yj))\displaystyle~{}~{}~{}~{}~{}-\frac{2(n+m)}{\sqrt{nm}}\sum_{\ell=1}^{\infty}\theta_{\ell}\left(\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\Psi_{\ell}(X_{i})\right)\left(\frac{1}{\sqrt{m}}\sum_{j=1}^{m}\Psi_{\ell}(Y_{j})\right)
𝒟1ρ(1ρ)=1θZ2,ZN(ρ(1ρ)ζ(P,Q)θ,1),\displaystyle\xrightarrow{\mathcal{D}}\frac{1}{\rho(1-\rho)}\sum_{\ell=1}^{\infty}\theta_{\ell}Z_{\ell}^{2},~{}~{}Z_{\ell}\sim N\left(\sqrt{\rho(1-\rho)}\cdot\frac{\zeta_{\ell}(P,Q)}{\theta_{\ell}},1\right),

as n,mn,m\to\infty.

7.7 Proof of Lemma 2

(i) First, we prove μk(Q^nm)μk(Qnm)H(k)=OP(1/n+m)\left\|\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(Q_{nm})\right\|_{H(k)}=O_{P}(1/\sqrt{n+m}). For all δ>0\delta>0, there exists N1N_{1}\in\mathbb{N} such that

1n+m|𝔼YP[k(,Y)μk(P)2H(k)]+𝔼YQ[k(,Y)μk(P)2H(k)]|<δ2\frac{1}{\sqrt{n+m}}|\mathbb{E}_{Y\sim P}[\left\|k(\cdot,Y)-\mu_{k}(P)\right\|^{2}_{H(k)}]+\mathbb{E}_{Y\sim Q}[\left\|k(\cdot,Y)-\mu_{k}(P)\right\|^{2}_{H(k)}]|<\frac{\delta}{2}

for all n,m>N1n,m>N_{1}. In addition, there exists N2N_{2}\in\mathbb{N} such that

1n+m|𝔼YP[k(,Y)μk(P)2H(k)]+𝔼YQ[k(,Y)μk(P)2H(k)]|<δ2\frac{1}{\sqrt{n+m}}|\mathbb{E}_{Y\sim P}[\left\|k(\cdot,Y)-\mu_{k}(P)\right\|^{2}_{H(k)}]+\mathbb{E}_{Y\sim Q}[\left\|k(\cdot,Y)-\mu_{k}(P)\right\|^{2}_{H(k)}]|<\frac{\delta}{2}

for all n,m>N2n,m>N_{2}. We put

Mδ=1δ(11ρ+δ2)(𝔼YP[k(,Y)μk(P)2H(k)]+δ2),M_{\delta}=\sqrt{\frac{1}{\delta}\left(\frac{1}{1-\rho}+\frac{\delta}{2}\right)\left(\mathbb{E}_{Y\sim P}[\left\|k(\cdot,Y)-\mu_{k}(P)\right\|^{2}_{H(k)}]+\frac{\delta}{2}\right)},

and Nδ=max{N1,N2}N_{\delta}=\max\{N_{1},N_{2}\}. Since

μk(Qnm)=(11n+m)μk(P)+1n+mμk(Q)=μk(P)1n+m(μk(P)μk(Q)),\mu_{k}(Q_{nm})=\left(1-\frac{1}{\sqrt{n+m}}\right)\mu_{k}(P)+\frac{1}{\sqrt{n+m}}\mu_{k}(Q)=\mu_{k}(P)-\frac{1}{\sqrt{n+m}}(\mu_{k}(P)-\mu_{k}(Q)),

for all n,m>Nδn,m>N_{\delta}, we get

(n+mμk(Q^nm)μk(Qnm)H(k)>Mδ)\displaystyle\mathbb{P}\left(\sqrt{n+m}\left\|\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(Q_{nm})\right\|_{H(k)}>M_{\delta}\right)
𝔼Qnm[(n+m)μk(Q^nm)μk(Qnm)2H(k)]Mδ2\displaystyle\leq\frac{\mathbb{E}_{Q_{nm}}\left[(n+m)\left\|\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(Q_{nm})\right\|^{2}_{H(k)}\right]}{M_{\delta^{2}}}
=(n+m)𝔼Qnm[1mj=1m(k(,Yj)μk(Qnm))2H(k)]Mδ2\displaystyle=\frac{(n+m)\mathbb{E}_{Q_{nm}}\left[\left\|\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(Q_{nm}))\right\|^{2}_{H(k)}\right]}{M_{\delta}^{2}}
=t𝔼YQnm[k(,Y)μk(Qnm)2H(k)]mMδ2\displaystyle=\frac{t\mathbb{E}_{Y\sim Q_{nm}}[\left\|k(\cdot,Y)-\mu_{k}(Q_{nm})\right\|^{2}_{H(k)}]}{mM_{\delta}^{2}}
=n+mmMδ2𝔼YQnm[k(,Y)μk(P)+1n+m(μk(P)μk(Q))2H(k)]\displaystyle=\frac{n+m}{mM_{\delta}^{2}}\mathbb{E}_{Y\sim Q_{nm}}\left[\left\|k(\cdot,Y)-\mu_{k}(P)+\frac{1}{\sqrt{n+m}}(\mu_{k}(P)-\mu_{k}(Q))\right\|^{2}_{H(k)}\right]
=n+mmMδ2(𝔼YQnm[k(,Y)μk(P)2H(k)]+2n+m𝔼YQnm[k(,Y)μk(P),μk(P)μk(Q)H(k)]\displaystyle=\frac{n+m}{mM{\delta}^{2}}\Big{(}\mathbb{E}_{Y\sim Q_{nm}}[\left\|k(\cdot,Y)-\mu_{k}(P)\right\|^{2}_{H(k)}]+\frac{2}{\sqrt{n+m}}\mathbb{E}_{Y\sim Q_{nm}}[\left<k(\cdot,Y)-\mu_{k}(P),\mu_{k}(P)-\mu_{k}(Q)\right>_{H(k)}]
+1n+mμk(P)μk(Q)2H(k))\displaystyle~{}~{}~{}~{}~{}+\frac{1}{n+m}\left\|\mu_{k}(P)-\mu_{k}(Q)\right\|^{2}_{H(k)}\Big{)}
=n+mmMδ2(𝔼YQnm[k(,Y)μk(P)2H(k)]1n+mμk(P)μk(Q)2H(k))\displaystyle=\frac{n+m}{mM_{\delta}^{2}}\left(\mathbb{E}_{Y\sim Q_{nm}}[\left\|k(\cdot,Y)-\mu_{k}(P)\right\|^{2}_{H(k)}]-\frac{1}{n+m}\left\|\mu_{k}(P)-\mu_{k}(Q)\right\|^{2}_{H(k)}\right)
n+mmMδ2𝔼YQnm[k(,Y)μk(P)2H(k)]\displaystyle\leq\frac{n+m}{mM_{\delta}^{2}}\mathbb{E}_{Y\sim Q_{nm}}[\left\|k(\cdot,Y)-\mu_{k}(P)\right\|^{2}_{H(k)}]
=tmMδ2((11n+m)𝔼XP[k(,Y)μk(P)2H(k)]+1n+m𝔼YQ[k(,Y)μk(P)2H(k)])\displaystyle=\frac{t}{mM_{\delta}^{2}}\left(\left(1-\frac{1}{\sqrt{n+m}}\right)\mathbb{E}_{X\sim P}[\left\|k(\cdot,Y)-\mu_{k}(P)\right\|^{2}_{H(k)}]+\frac{1}{n+m}\mathbb{E}_{Y\sim Q}[\left\|k(\cdot,Y)-\mu_{k}(P)\right\|^{2}_{H(k)}]\right)
=n+mmMδ2(𝔼YP[k(,Y)μk(P)2H(k)]\displaystyle=\frac{n+m}{mM_{\delta}^{2}}\Big{(}\mathbb{E}_{Y\sim P}[\left\|k(\cdot,Y)-\mu_{k}(P)\right\|^{2}_{H(k)}]
1n+m(𝔼YP[k(,Y)μk(P)]𝔼YQ[k(,Y)μk(P)2H(k)]))\displaystyle~{}~{}~{}~{}~{}-\frac{1}{\sqrt{n+m}}\left(\mathbb{E}_{Y\sim P}[\left\|k(\cdot,Y)-\mu_{k}(P)\right\|]-\mathbb{E}_{Y\sim Q}[\left\|k(\cdot,Y)-\mu_{k}(P)\right\|^{2}_{H(k)}]\right)\Big{)}
<1Mδ2(11ρ+δ2)(𝔼YP[k(,Y)μk(P)2H(k)]+δ2)\displaystyle<\frac{1}{M_{\delta}^{2}}\left(\frac{1}{1-\rho}+\frac{\delta}{2}\right)\left(\mathbb{E}_{Y\sim P}[\left\|k(\cdot,Y)-\mu_{k}(P)\right\|^{2}_{H(k)}]+\frac{\delta}{2}\right)
=δ.\displaystyle=\delta.

Therefore, we obtain μk(Q^nm)μk(Qnm)H(k)=Op(1/n+m)\left\|\mu_{k}(\widehat{Q}_{nm})-\mu_{k}(Q_{nm})\right\|_{H(k)}=O_{p}(1/\sqrt{n+m}).

(ii) Next, we prove Σ~k(Qnm)Σk(Qnm)H(k)2\left\|\widetilde{\Sigma}_{k}(Q_{nm})-\Sigma_{k}(Q_{nm})\right\|_{H(k)^{\otimes 2}}. For all δ>0\delta>0, we put

Mδ=1δ(11ρ+δ2)(𝔼YP[(k(,Y)μk(P))2Σk(P)2H(k)2]+δ)M_{\delta}=\sqrt{\frac{1}{\delta}\left(\frac{1}{1-\rho}+\frac{\delta}{2}\right)(\mathbb{E}_{Y\sim P}[\left\|(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\|^{2}_{H(k)^{\otimes 2}}]+\delta)}

and

1n+mA(n,m,Y):=\displaystyle\frac{1}{\sqrt{n+m}}A(n,m,Y):= 1n+m{(k(,Y)μk(P))(μk(P)μk(Q))\displaystyle\frac{1}{\sqrt{n+m}}\Big{\{}(k(\cdot,Y)-\mu_{k}(P))\otimes(\mu_{k}(P)-\mu_{k}(Q))
+(μk(P)μk(Q))(k(,Y)μk(P))+1n+m(μk(P)μk(Q))2}.\displaystyle~{}~{}~{}~{}~{}+(\mu_{k}(P)-\mu_{k}(Q))\otimes(k(\cdot,Y)-\mu_{k}(P))+\frac{1}{\sqrt{n+m}}(\mu_{k}(P)-\mu_{k}(Q))^{\otimes 2}\Big{\}}.

Then, there exists N1N_{1}\in\mathbb{N} such that

|1n+m𝔼YQnm[A(n,m,Y)2H(k)2]\displaystyle\Big{|}\frac{1}{n+m}\mathbb{E}_{Y\sim Q_{nm}}[\left\|A(n,m,Y)\right\|^{2}_{H(k)^{\otimes 2}}]
1n+m𝔼YQnm[(k(,Y)μk(P))2Σk(P),A(t,Y)H(k)2]|<δ2\displaystyle~{}~{}~{}~{}~{}-\frac{1}{\sqrt{n+m}}\mathbb{E}_{Y\sim Q_{nm}}\left[\left<(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P),A(t,Y)\right>_{H(k)^{\otimes 2}}\right]\Big{|}<\frac{\delta}{2}

for all n,m>N1n,m>N_{1}. In addition, there exists N2N_{2}\in\mathbb{N} such that

|1n+m(𝔼YP[(k(,Y)μk(P))2Σk(P)2H(k)2]\displaystyle\Big{|}\frac{1}{\sqrt{n+m}}\Big{(}\mathbb{E}_{Y\sim P}[\left\|(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\|^{2}_{H(k)^{\otimes 2}}]
𝔼YQ[(k(,Y)μk(P))2Σk(P)2H(k)2])|<δ2\displaystyle~{}~{}~{}~{}~{}-\mathbb{E}_{Y\sim Q}[\left\|(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\|^{2}_{H(k)^{\otimes 2}}]\Big{)}\Big{|}<\frac{\delta}{2}

for all n,m>N2n,m>N_{2}. and there exists N3N_{3}\in\mathbb{N} such that

|tm11ρ|<δ2\left|\frac{t}{m}-\frac{1}{1-\rho}\right|<\frac{\delta}{2}

for all n,m>N3n,m>N_{3}.

Let Nδ=max{N1,N2,N3}N_{\delta}=\max\{N_{1},N_{2},N_{3}\}. For all n,m>Nδn,m>N_{\delta}, we obtain that

(n+mΣ~k(Qnm)Σk(Qnm)H(k)2>Mδ)\displaystyle\mathbb{P}(\sqrt{n+m}\left\|\widetilde{\Sigma}_{k}(Q_{nm})-\Sigma_{k}(Q_{nm})\right\|_{H(k)^{\otimes 2}}>M_{\delta})
𝔼YQnm[tΣ~k(Qnm)Σk(Qnm)H(k)22]Mδ2\displaystyle\leq\frac{\mathbb{E}_{Y\sim Q_{nm}}\left[t\left\|\widetilde{\Sigma}_{k}(Q_{nm})-\Sigma_{k}(Q_{nm})\right\|_{H(k)^{\otimes 2}}^{2}\right]}{M_{\delta}^{2}}
=t𝔼YQnm[(k(,Y)μk(Qnm))2Σk(Qnm)2H(k)2]mMδ2\displaystyle=\frac{t\mathbb{E}_{Y\sim Q_{nm}}[\left\|(k(\cdot,Y)-\mu_{k}(Q_{nm}))^{\otimes 2}-\Sigma_{k}(Q_{nm})\right\|^{2}_{H(k)^{\otimes 2}}]}{mM_{\delta}^{2}}
=n+mmMδ2𝔼YQnm[(k(,Y)μk(Qnm))2Σk(P)+1n+m(Σk(P)Σk(Q))\displaystyle=\frac{n+m}{mM_{\delta}^{2}}\mathbb{E}_{Y\sim Q_{nm}}\Big{[}\Big{\|}(k(\cdot,Y)-\mu_{k}(Q_{nm}))^{\otimes 2}-\Sigma_{k}(P)+\frac{1}{\sqrt{n+m}}(\Sigma_{k}(P)-\Sigma_{k}(Q))
1n+m(11n+m)(μk(P)μk(Q))22H(k)2]\displaystyle~{}~{}~{}~{}~{}-\frac{1}{\sqrt{n+m}}\left(1-\frac{1}{\sqrt{n+m}}\right)(\mu_{k}(P)-\mu_{k}(Q))^{\otimes 2}\Big{\|}^{2}_{H(k)^{\otimes 2}}\Big{]}
=n+mmMδ2𝔼YQnm[(k(,Y)μk(Qnm))2Σk(P)2H(k)2]\displaystyle=\frac{n+m}{mM_{\delta}^{2}}\mathbb{E}_{Y\sim Q_{nm}}\left[\left\|(k(\cdot,Y)-\mu_{k}(Q_{nm}))^{\otimes 2}-\Sigma_{k}(P)\right\|^{2}_{H(k)^{\otimes 2}}\right]
1n+mΣk(P)Σk(Q)+(11n+m)(μk(P)μk(Q))22H(k)2\displaystyle~{}~{}~{}~{}~{}-\frac{1}{n+m}\left\|\Sigma_{k}(P)-\Sigma_{k}(Q)+\left(1-\frac{1}{\sqrt{n+m}}\right)(\mu_{k}(P)-\mu_{k}(Q))^{\otimes 2}\right\|^{2}_{H(k)^{\otimes 2}}
n+mmMδ2𝔼YQnm[(k(,Y)μk(Qnm))2Σk(P)2H(k)2]\displaystyle\leq\frac{n+m}{mM_{\delta}^{2}}\mathbb{E}_{Y\sim Q_{nm}}\left[\left\|(k(\cdot,Y)-\mu_{k}(Q_{nm}))^{\otimes 2}-\Sigma_{k}(P)\right\|^{2}_{H(k)^{\otimes 2}}\right]
=n+mmMδ2𝔼YQnm[{k(,Y)μk(P)+1n+m(μk(P)μk(Q))}2Σk(P)2H(k)2]\displaystyle=\frac{n+m}{mM_{\delta}^{2}}\mathbb{E}_{Y\sim Q_{nm}}\left[\left\|\left\{k(\cdot,Y)-\mu_{k}(P)+\frac{1}{\sqrt{n+m}}(\mu_{k}(P)-\mu_{k}(Q))\right\}^{\otimes 2}-\Sigma_{k}(P)\right\|^{2}_{H(k)^{\otimes 2}}\right]
=n+mmMδ2𝔼YQnm[(k(,Y)μk(P))2Σk(P)+1n+mA(t,Y)2H(k)2]\displaystyle=\frac{n+m}{mM_{\delta}^{2}}\mathbb{E}_{Y\sim Q_{nm}}\left[\left\|(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)+\frac{1}{\sqrt{n+m}}A(t,Y)\right\|^{2}_{H(k)^{\otimes 2}}\right]
=n+mmMδ2𝔼YQnm[(k(,Y)μk(P))2Σk(P)2H(k)2+1n+mA(t,Y)2H(k)2\displaystyle=\frac{n+m}{mM_{\delta}^{2}}\mathbb{E}_{Y\sim Q_{nm}}\Big{[}\left\|(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\|^{2}_{H(k)^{\otimes 2}}+\frac{1}{n+m}\left\|A(t,Y)\right\|^{2}_{H(k)^{\otimes 2}}
2n+m(k(,Y)μk(P))2Σk(P),A(t,Y)H(k)2]\displaystyle~{}~{}~{}~{}~{}-\frac{2}{\sqrt{n+m}}\left<(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P),A(t,Y)\right>_{H(k)^{\otimes 2}}\Big{]}
=n+mmMδ2{𝔼YQnm[(k(,Y)μk(P))2Σk(P)2H(k)2]+1n+m𝔼YQnm[A(t,Y)2H(k)2]\displaystyle=\frac{n+m}{mM_{\delta}^{2}}\Big{\{}\mathbb{E}_{Y\sim Q_{nm}}\Big{[}\left\|(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\|^{2}_{H(k)^{\otimes 2}}\Big{]}+\frac{1}{n+m}\mathbb{E}_{Y\sim Q_{nm}}\Big{[}\left\|A(t,Y)\right\|^{2}_{H(k)^{\otimes 2}}\Big{]}
2n+m𝔼YQnm[(k(,Y)μk(P))2Σk(P),A(t,Y)H(k)2]}\displaystyle~{}~{}~{}~{}~{}-\frac{2}{\sqrt{n+m}}\mathbb{E}_{Y\sim Q_{nm}}\Big{[}\left<(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P),A(t,Y)\right>_{H(k)^{\otimes 2}}\Big{]}\Big{\}}
<n+mmMδ2{𝔼YP[(k(,Y)μk(P))2Σk(P)2H(k)2]\displaystyle<\frac{n+m}{mM_{\delta}^{2}}\Big{\{}\mathbb{E}_{Y\sim P}\left[\left\|(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\|^{2}_{H(k)^{\otimes 2}}\right]
1n+m(𝔼YP[(k(,Y)μk(P))2Σk(P)2H(k)2]\displaystyle~{}~{}~{}~{}~{}~{}-\frac{1}{\sqrt{n+m}}\Big{(}\mathbb{E}_{Y\sim P}\left[\left\|(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\|^{2}_{H(k)^{\otimes 2}}\right]
𝔼YQ[(k(,Y)μk(P))2Σk(P)2H(k)2])+δ2}\displaystyle~{}~{}~{}~{}~{}-\mathbb{E}_{Y\sim Q}\left[\left\|(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\|^{2}_{H(k)^{\otimes 2}}\right]\Big{)}+\frac{\delta}{2}\Big{\}}
<1Mδ2(11ρ+δ2){𝔼YP[(k(,Y)μk(P))2Σk(P)2H(k)2+δ]}\displaystyle<\frac{1}{M_{\delta}^{2}}\left(\frac{1}{1-\rho}+\frac{\delta}{2}\right)\left\{\mathbb{E}_{Y\sim P}[\left\|(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\|^{2}_{H(k)^{\otimes 2}}+\delta]\right\}
=δ.\displaystyle=\delta.

Therefore, Σ~k(Qnm)Σk(Qnm)H(k)2=Op(1/n+m)\left\|\widetilde{\Sigma}_{k}(Q_{nm})-\Sigma_{k}(Q_{nm})\right\|_{H(k)^{\otimes 2}}=O_{p}(1/\sqrt{n+m}) is proved.

(iii) Finally, we prove Σk(Q^nm)Σk(Qnm)H(k)2=Op(1/n+m)\left\|\Sigma_{k}(\widehat{Q}_{nm})-\Sigma_{k}(Q_{nm})\right\|_{H(k)^{\otimes 2}}=O_{p}(1/\sqrt{n+m}). We get

Σk(Q^nm)Σk(Qnm)\displaystyle\Sigma_{k}(\widehat{Q}_{nm})-\Sigma_{k}(Q_{nm})
=1mj=1m(k(,Yj)μk(Q^nm))2Σk(Qnm)\displaystyle=\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(\widehat{Q}_{nm}))^{\otimes 2}-\Sigma_{k}(Q_{nm})
=1mj=1m(k(,Yj)μk(Qnm)+μk(Qnm)μk(Q^nm))2Σk(Qnm)\displaystyle=\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(Q_{nm})+\mu_{k}(Q_{nm})-\mu_{k}(\widehat{Q}_{nm}))^{\otimes 2}-\Sigma_{k}(Q_{nm})
=1mj=1m(k(,Yj)μk(Qnm))2Σk(Qnm)\displaystyle=\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(Q_{nm}))^{\otimes 2}-\Sigma_{k}(Q_{nm})
+1mj=1m(k(,Yj)μk(Qnm))(μk(Qnm)μk(Q^nm))\displaystyle~{}~{}~{}~{}~{}+\frac{1}{m}\sum_{j=1}^{m}(k(\cdot,Y_{j})-\mu_{k}(Q_{nm}))\otimes(\mu_{k}(Q_{nm})-\mu_{k}(\widehat{Q}_{nm}))
+1mj=1m(μk(Qnm)μk(Q^nm))(k(,Yj)μk(Qnm))+(μk(Qnm)μk(Q^nm))2\displaystyle~{}~{}~{}~{}~{}+\frac{1}{m}\sum_{j=1}^{m}(\mu_{k}(Q_{nm})-\mu_{k}(\widehat{Q}_{nm}))\otimes(k(\cdot,Y_{j})-\mu_{k}(Q_{nm}))+(\mu_{k}(Q_{nm})-\mu_{k}(\widehat{Q}_{nm}))^{\otimes 2}
=Σ~k(Qnm)Σk(Qnm)(μk(Qnm)μk(Q^nm))2\displaystyle=\widetilde{\Sigma}_{k}(Q_{nm})-\Sigma_{k}(Q_{nm})-(\mu_{k}(Q_{nm})-\mu_{k}(\widehat{Q}_{nm}))^{\otimes 2}

by a expansion of Σ~k(Qnm)\widetilde{\Sigma}_{k}(Q_{nm}). Using (i) and (ii) leads to the following:

Σk(Q^nm)Σk(Qnm)H(k)2\displaystyle\left\|\Sigma_{k}(\widehat{Q}_{nm})-\Sigma_{k}(Q_{nm})\right\|_{H(k)^{\otimes 2}}
Σ~k(Qnm)Σk(Qnm)H(k)2+μk(Qnm)μk(Q^nm)2H(k)2\displaystyle\leq\left\|\widetilde{\Sigma}_{k}(Q_{nm})-\Sigma_{k}(Q_{nm})\right\|_{H(k)^{\otimes 2}}+\left\|\mu_{k}(Q_{nm})-\mu_{k}(\widehat{Q}_{nm})\right\|^{2}_{H(k)^{\otimes 2}}
=Op(1n+m)+Op(1n+m)\displaystyle=O_{p}\left(\frac{1}{\sqrt{n+m}}\right)+O_{p}\left(\frac{1}{n+m}\right)
=Op(1n+m).\displaystyle=O_{p}\left(\frac{1}{\sqrt{n+m}}\right).

7.8 Proof of Lemma 3

First we have

Σk(Qnm)\displaystyle\Sigma_{k}(Q_{nm})
=𝔼YQnm[(k(,Y)μk(Qnm))2]\displaystyle=\mathbb{E}_{Y\sim Q_{nm}}\left[(k(\cdot,Y)-\mu_{k}(Q_{nm}))^{\otimes 2}\right]
=(11n+m)𝔼YP[(k(,Y)μk(Qnm))2]+1n+m𝔼YQ[(k(,Y)μk(Qnm))2]\displaystyle=\left(1-\frac{1}{\sqrt{n+m}}\right)\mathbb{E}_{Y\sim P}\left[(k(\cdot,Y)-\mu_{k}(Q_{nm}))^{\otimes 2}\right]+\frac{1}{\sqrt{n+m}}\mathbb{E}_{Y\sim Q}\left[(k(\cdot,Y)-\mu_{k}(Q_{nm}))^{\otimes 2}\right]
=(11n+m)𝔼YP[(k(,Y)μk(P)+μk(P)μk(Qnm))2]\displaystyle=\left(1-\frac{1}{\sqrt{n+m}}\right)\mathbb{E}_{Y\sim P}\left[(k(\cdot,Y)-\mu_{k}(P)+\mu_{k}(P)-\mu_{k}(Q_{nm}))^{\otimes 2}\right]
+1n+m𝔼YQ[(k(,Y)μk(Q)+μk(Q)μk(Qnm))2]\displaystyle~{}~{}~{}~{}~{}+\frac{1}{\sqrt{n+m}}\mathbb{E}_{Y\sim Q}\left[(k(\cdot,Y)-\mu_{k}(Q)+\mu_{k}(Q)-\mu_{k}(Q_{nm}))^{\otimes 2}\right]
=(11n+m){𝔼YP[(k(,Y)μk(P))2]+(μk(P)μk(Qnm))2}\displaystyle=\left(1-\frac{1}{\sqrt{n+m}}\right)\left\{\mathbb{E}_{Y\sim P}\left[(k(\cdot,Y)-\mu_{k}(P))^{\otimes 2}\right]+(\mu_{k}(P)-\mu_{k}(Q_{nm}))^{\otimes 2}\right\}
+1n+m{𝔼YQ[(k(,Y)μk(Q))2]+(μk(Q)μk(Qnm))2}\displaystyle~{}~{}~{}~{}~{}+\frac{1}{\sqrt{n+m}}\left\{\mathbb{E}_{Y\sim Q}\left[(k(\cdot,Y)-\mu_{k}(Q))^{\otimes 2}\right]+(\mu_{k}(Q)-\mu_{k}(Q_{nm}))^{\otimes 2}\right\}
=(11n+m)Σk(P)+1n+mΣk(Q)+(11n+m)1n+m(μk(P)μk(Q))2\displaystyle=\left(1-\frac{1}{\sqrt{n+m}}\right)\Sigma_{k}(P)+\frac{1}{\sqrt{n+m}}\Sigma_{k}(Q)+\left(1-\frac{1}{\sqrt{n+m}}\right)\frac{1}{n+m}(\mu_{k}(P)-\mu_{k}(Q))^{\otimes 2}
+1n+m(11n+m)2(μk(P)μk(Q))2\displaystyle~{}~{}~{}~{}~{}+\frac{1}{\sqrt{n+m}}\left(1-\frac{1}{\sqrt{n+m}}\right)^{2}(\mu_{k}(P)-\mu_{k}(Q))^{\otimes 2}
=(11n+m)Σk(P)+1n+mΣk(Q)+1n+m(11n+m)(μk(P)μk(Q))2.\displaystyle=\left(1-\frac{1}{\sqrt{n+m}}\right)\Sigma_{k}(P)+\frac{1}{\sqrt{n+m}}\Sigma_{k}(Q)+\frac{1}{\sqrt{n+m}}\left(1-\frac{1}{\sqrt{n+m}}\right)(\mu_{k}(P)-\mu_{k}(Q))^{\otimes 2}. (31)

The result (31) leads to

Σk(P)Σk(Qnm)\displaystyle\Sigma_{k}(P)-\Sigma_{k}(Q_{nm})
=Σk(P)(11n+m)Σk(P)1n+mΣk(Q)\displaystyle=\Sigma_{k}(P)-\left(1-\frac{1}{\sqrt{n+m}}\right)\Sigma_{k}(P)-\frac{1}{\sqrt{n+m}}\Sigma_{k}(Q)
1n+m(11n+m)(μk(P)μk(Q))2\displaystyle~{}~{}~{}~{}~{}-\frac{1}{\sqrt{n+m}}\left(1-\frac{1}{\sqrt{n+m}}\right)(\mu_{k}(P)-\mu_{k}(Q))^{\otimes 2}
=1n+mΣk(P)1n+mΣk(Q)1n+m(11n+m)(μk(P)μk(Q))2\displaystyle=\frac{1}{\sqrt{n+m}}\Sigma_{k}(P)-\frac{1}{\sqrt{n+m}}\Sigma_{k}(Q)-\frac{1}{\sqrt{n+m}}\left(1-\frac{1}{\sqrt{n+m}}\right)(\mu_{k}(P)-\mu_{k}(Q))^{\otimes 2}
=1n+m(Σk(P)Σk(Q))1n+m(11n+m)(μk(P)μk(Q))2.\displaystyle=\frac{1}{\sqrt{n+m}}(\Sigma_{k}(P)-\Sigma_{k}(Q))-\frac{1}{\sqrt{n+m}}\left(1-\frac{1}{\sqrt{n+m}}\right)(\mu_{k}(P)-\mu_{k}(Q))^{\otimes 2}.

Hence

(n+m)(Σk(P^)Σk(Q^nm)2H(k)2Σ~k(P)Σ~k(Qnm)2H(k)2)\displaystyle(n+m)\left(\left\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q}_{nm})\right\|^{2}_{H(k)^{\otimes 2}}-\left\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q_{nm})\right\|^{2}_{H(k)^{\otimes 2}}\right)
(n+m){Σk(P^)Σk(Q^nm)H(k)2+Σ~k(P)Σ~k(Qnm)H(k)2}\displaystyle\leq(n+m)\left\{\left\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q}_{nm})\right\|_{H(k)^{\otimes 2}}+\left\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q_{nm})\right\|_{H(k)^{\otimes 2}}\right\}
×(μk(P)μk(P^))2(μk(Qnm)μk(Q^nm))2H(k)2\displaystyle~{}~{}~{}~{}~{}\times\left\|(\mu_{k}(P)-\mu_{k}(\widehat{P}))^{\otimes 2}-(\mu_{k}(Q_{nm})-\mu_{k}(\widehat{Q}_{nm}))^{\otimes 2}\right\|_{H(k)^{\otimes 2}}
n+m{Σk(P^)Σk(Q^nm)H(k)2+Σ~k(P)Σ~k(Qnm)H(k)2}\displaystyle\leq\sqrt{n+m}\left\{\left\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q}_{nm})\right\|_{H(k)^{\otimes 2}}+\left\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q_{nm})\right\|_{H(k)^{\otimes 2}}\right\}
×(n+mnnμk(P)μk(P^)2H(k)+n+mmmμk(Qnm)μk(Q^nm)2H(k))\displaystyle~{}~{}~{}~{}~{}\times\left(\frac{\sqrt{n+m}}{n}n\left\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\|^{2}_{H(k)}+\frac{\sqrt{n+m}}{m}m\left\|\mu_{k}(Q_{nm})-\mu_{k}(\widehat{Q}_{nm})\right\|^{2}_{H(k)}\right)
=n+m{Σk(P^)Σk(P)(Σk(Q^nm)Σk(Qnm))+Σk(P)Σk(Qnm)H(k)2\displaystyle=\sqrt{n+m}\left\{\left\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(P)-(\Sigma_{k}(\widehat{Q}_{nm})-\Sigma_{k}(Q_{nm}))+\Sigma_{k}(P)-\Sigma_{k}(Q_{nm})\right\|_{H(k)^{\otimes 2}}\right.
+Σ~k(P)Σk(P)(Σ~k(Qnm)Σk(Qnm))+Σk(P)Σk(Qnm)H(k)2}\displaystyle\left.~{}~{}~{}~{}~{}+\left\|\widetilde{\Sigma}_{k}(P)-\Sigma_{k}(P)-(\widetilde{\Sigma}_{k}(Q_{nm})-\Sigma_{k}(Q_{nm}))+\Sigma_{k}(P)-\Sigma_{k}(Q_{nm})\right\|_{H(k)^{\otimes 2}}\right\}
×(n+mnnμk(P)μk(P^)2H(k)+n+mmmμk(Qnm)μk(Q^nm)2H(k))\displaystyle~{}~{}~{}~{}~{}\times\left(\frac{\sqrt{n+m}}{n}n\left\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\|^{2}_{H(k)}+\frac{\sqrt{n+m}}{m}m\left\|\mu_{k}(Q_{nm})-\mu_{k}(\widehat{Q}_{nm})\right\|^{2}_{H(k)}\right)
=n+m{Σk(P^)Σk(P)(Σk(Q^nm)Σk(Qnm))\displaystyle=\sqrt{n+m}\left\{\Bigg{\|}\Sigma_{k}(\widehat{P})-\Sigma_{k}(P)-(\Sigma_{k}(\widehat{Q}_{nm})-\Sigma_{k}(Q_{nm}))\right.
+1n+m(Σk(P)Σk(Q))1n+m(11n+m)(μk(P)μk(Q))2H(k)2\displaystyle~{}~{}~{}~{}~{}+\frac{1}{\sqrt{n+m}}(\Sigma_{k}(P)-\Sigma_{k}(Q))-\frac{1}{\sqrt{n+m}}\left(1-\frac{1}{\sqrt{n+m}}\right)(\mu_{k}(P)-\mu_{k}(Q))^{\otimes 2}\Bigg{\|}_{H(k)^{\otimes 2}}
+Σ~k(P)Σk(P)(Σ~k(Qnm)Σk(Qnm))\displaystyle~{}~{}~{}~{}~{}+\Bigg{\|}\widetilde{\Sigma}_{k}(P)-\Sigma_{k}(P)-(\widetilde{\Sigma}_{k}(Q_{nm})-\Sigma_{k}(Q_{nm}))
+1n+m(Σk(P)Σk(Q))+1n+m(11n+m)(μk(P)μk(Q))2H(k)2}\displaystyle\left.~{}~{}~{}~{}~{}+\frac{1}{\sqrt{n+m}}(\Sigma_{k}(P)-\Sigma_{k}(Q))-+\frac{1}{\sqrt{n+m}}\left(1-\frac{1}{\sqrt{n+m}}\right)(\mu_{k}(P)-\mu_{k}(Q))^{\otimes 2}\Bigg{\|}_{H(k)^{\otimes 2}}\right\}
×(n+mnnμk(P)μk(P^)2H(k)+n+mμk(Qnm)μk(Q^nm)2H(k))\displaystyle~{}~{}~{}~{}~{}\times\left(\frac{\sqrt{n+m}}{n}n\left\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\|^{2}_{H(k)}+\sqrt{n+m}\left\|\mu_{k}(Q_{nm})-\mu_{k}(\widehat{Q}_{nm})\right\|^{2}_{H(k)}\right)
{n+mnnΣk(P^)Σk(P)H(k)2+n+mΣk(Q^nm)Σk(Qnm)H(k)2\displaystyle\leq\Bigg{\{}\sqrt{\frac{n+m}{n}}\sqrt{n}\left\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(P)\right\|_{H(k)^{\otimes 2}}+\sqrt{n+m}\left\|\Sigma_{k}(\widehat{Q}_{nm})-\Sigma_{k}(Q_{nm})\right\|_{H(k)^{\otimes 2}}
+n+mnnΣ~k(P)Σk(P)H(k)2+n+mΣ~k(Qnm)Σk(Qnm)H(k)2\displaystyle~{}~{}~{}~{}~{}+\sqrt{\frac{n+m}{n}}\sqrt{n}\left\|\widetilde{\Sigma}_{k}(P)-\Sigma_{k}(P)\right\|_{H(k)^{\otimes 2}}+\sqrt{n+m}\left\|\widetilde{\Sigma}_{k}(Q_{nm})-\Sigma_{k}(Q_{nm})\right\|_{H(k)^{\otimes 2}}
+2Σk(P)Σk(Q)H(k)2+2(11n+m)μk(P)μk(Q)2H(k)}\displaystyle~{}~{}~{}~{}~{}+2\left\|\Sigma_{k}(P)-\Sigma_{k}(Q)\right\|_{H(k)^{\otimes 2}}+2\left(1-\frac{1}{\sqrt{n+m}}\right)\left\|\mu_{k}(P)-\mu_{k}(Q)\right\|^{2}_{H(k)}\Bigg{\}}
×(n+mnnμk(P)μk(P^)2H(k)+n+mμk(Qnm)μk(Q^nm)2H(k)).\displaystyle~{}~{}~{}~{}~{}\times\left(\frac{\sqrt{n+m}}{n}n\left\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\|^{2}_{H(k)}+\sqrt{n+m}\left\|\mu_{k}(Q_{nm})-\mu_{k}(\widehat{Q}_{nm})\right\|^{2}_{H(k)}\right).

Therefore, we obtain

(n+m)(Σk(P^)Σk(Q^nm)2H(k)2Σ~k(P)Σ~k(Qnm)2H(k)2)𝑃0,(n+m)\left(\left\|\Sigma_{k}(\widehat{P})-\Sigma_{k}(\widehat{Q}_{nm})\right\|^{2}_{H(k)^{\otimes 2}}-\left\|\widetilde{\Sigma}_{k}(P)-\widetilde{\Sigma}_{k}(Q_{nm})\right\|^{2}_{H(k)^{\otimes 2}}\right)\xrightarrow{P}0,

as n,mn,m\to\infty by Lemma 2, which completes the proof of Lemma 3.

7.9 Proof of Proposition 3

Let Υ=V[(k(,x)μk(P))2]\Upsilon=V[(k(\cdot,x)-\mu_{k}(P))^{\otimes 2}] be the operator with spectral representation Υ==1θϕ2\Upsilon=\sum_{\ell=1}^{\infty}\theta_{\ell}\phi^{\otimes 2}_{\ell}, and T:H(k)2L2(,P)T:H(k)^{\otimes 2}\to L_{2}(\mathcal{H},P) be the operator

(T(A))(x)=Υ1/2{(k(,x)μk(P))2Σk(P)},AH(k)2(T(A))(x)=\left<\Upsilon^{-1/2}\{(k(\cdot,x)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\},A\right>_{H(k)^{\otimes 2}}

for all AH(k)2A\in H(k)^{\otimes 2} and xx\in\mathcal{H}. We consider the adjoint operator of this TT,

Tg,AH(k)2\displaystyle\left<T^{*}g,A\right>_{H(k)^{\otimes 2}} =g,TAL2(,P)\displaystyle=\left<g,TA\right>_{L_{2}(\mathcal{H},P)}
=(T(A))(y)g(y)dP(y)\displaystyle=\int_{\mathcal{H}}(T(A))(y)g(y)dP(y)
=Υ1/2{(k(,y)μk(P))2Σk(P)},AH(k)2g(y)dP(y)\displaystyle=\int_{\mathcal{H}}\left<\Upsilon^{-1/2}\{(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\},A\right>_{H(k)^{\otimes 2}}g(y)dP(y)
=(Υ1/2{(k(,y)μk(P))2Σk(P)})g(y)dP(y),AH(k)2\displaystyle=\left<\int_{\mathcal{H}}(\Upsilon^{-1/2}\{(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\})g(y)dP(y),A\right>_{H(k)^{\otimes 2}}

for all gL2(,P)g\in L_{2}(\mathcal{H},P) and AH(k)2A\in H(k)^{\otimes 2}, hence we get that adjoint operator TT^{*} of TT is

Tg=(Υ1/2{(k(,y)μk(P))2Σk(P)})g(y)dP(y)T^{*}g=\int_{\mathcal{H}}(\Upsilon^{-1/2}\{(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\})g(y)dP(y)

for all gL2(,P)g\in L_{2}(\mathcal{H},P). Furthermore, since

T(T(A))\displaystyle T^{*}(T(A)) =(Υ1/2{(k(,y)μk(P))2Σk(P)})(T(A))(y)dP(y)\displaystyle=\int_{\mathcal{H}}(\Upsilon^{-1/2}\{(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\})(T(A))(y)dP(y)
=(Υ1/2{(k(,y)μk(P))2Σk(P)})Υ1/2{(k(,y)μk(P))2Σk(P)},AH(k)2dP(y)\displaystyle=\int_{\mathcal{H}}(\Upsilon^{-1/2}\{(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\})\left<\Upsilon^{-1/2}\{(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\},A\right>_{H(k)^{\otimes 2}}dP(y)
=(Υ1/2{(k(,y)μk(P))2Σk(P)})2AdP(y)\displaystyle=\int_{\mathcal{H}}(\Upsilon^{-1/2}\{(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\})^{\otimes 2}AdP(y)
=Υ1/2({(k(,y)μk(P))2Σk(P)})2dP(y)Υ1/2A\displaystyle=\Upsilon^{-1/2}\int_{\mathcal{H}}(\{(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\})^{\otimes 2}dP(y)\Upsilon^{-1/2}A
=Υ1/2ΥΥ1/2A\displaystyle=\Upsilon^{-1/2}\Upsilon\Upsilon^{-1/2}A
=A\displaystyle=A

for all AH(k)2A\in H(k)^{\otimes 2}, TTT^{*}T is the identity operator from H(k)2H(k)^{\otimes 2} to H(k)2H(k)^{\otimes 2}. It follows from direct calculations for TΥT:L2(,P)L2(,P)T\Upsilon T^{*}:L_{2}(\mathcal{H},P)\to L_{2}(\mathcal{H},P) that

(TΥTg)(x)\displaystyle(T\Upsilon T^{*}g)(x)
=Υ1/2{(k(,x)μk(P))2Σk(P)},ΥTgH(k)2\displaystyle=\left<\Upsilon^{-1/2}\{(k(\cdot,x)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\},\Upsilon T^{*}g\right>_{H(k)^{\otimes 2}}
=Υ1/2{(k(,x)μk(P))2Σk(P)},TgH(k)2\displaystyle=\left<\Upsilon^{1/2}\{(k(\cdot,x)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\},T^{*}g\right>_{H(k)^{\otimes 2}}
=TΥ1/2{(k(,x)μk(P))2Σk(P)},gL2(,P)\displaystyle=\left<T\Upsilon^{1/2}\{(k(\cdot,x)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\},g\right>_{L_{2}(\mathcal{H},P)}
=[TΥ1/2{(k(,x)μk(P))2Σk(P)}](y)g(y)dP(y)\displaystyle=\int_{\mathcal{H}}[T\Upsilon^{1/2}\{(k(\cdot,x)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\}](y)g(y)dP(y)
=Υ1/2{(k(,y)μk(P))2Σk(P)},Υ1/2{(k(,x)μk(P))2Σk(P)}H(k)2g(y)dP(y)\displaystyle=\int_{\mathcal{H}}\left<\Upsilon^{-1/2}\{(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\},\Upsilon^{1/2}\{(k(\cdot,x)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\}\right>_{H(k)^{\otimes 2}}g(y)dP(y)
=(k(,y)μk(P))2Σk(P),(k(,x)μk(P))2Σk(P)H(k)2g(y)dP(y)\displaystyle=\int_{\mathcal{H}}\left<(k(\cdot,y)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P),(k(\cdot,x)-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right>_{H(k)^{\otimes 2}}g(y)dP(y)
=(Skg)(x)\displaystyle=(S_{k}g)(x)

for all gL2(,P)g\in L_{2}(\mathcal{H},P) an xx\in\mathcal{H}, thus we see that TΥT=SkT\Upsilon T^{*}=S_{k}. Therefore, θ\theta_{\ell} and TϕT\phi_{\ell} are the eigenvalue and eigenfunction of SkS_{k}, by the following that

Skg\displaystyle S_{k}g =TΥTg\displaystyle=T\Upsilon T^{*}g
=T=1θϕ2Tg\displaystyle=T\sum_{\ell=1}^{\infty}\theta_{\ell}\phi_{\ell}^{\otimes 2}T^{*}g
==1θϕ,TgH(k)2Tϕ\displaystyle=\sum_{\ell=1}^{\infty}\theta_{\ell}\left<\phi_{\ell},T^{*}g\right>_{H(k)^{\otimes 2}}T\phi_{\ell}
==1θTϕ,gH(k)2Tϕ\displaystyle=\sum_{\ell=1}^{\infty}\theta_{\ell}\left<T\phi_{\ell},g\right>_{H(k)^{\otimes 2}}T\phi_{\ell}
==1θ(Tϕ)2g\displaystyle=\sum_{\ell=1}^{\infty}\theta_{\ell}(T\phi_{\ell})^{\otimes 2}g

and {Tϕ}=1\{T\phi_{\ell}\}_{\ell=1}^{\infty} is an orthonormal system in L2(,P)L_{2}(\mathcal{H},P) which holds

Tϕ,TϕsL2(,P)=TTϕ,ϕsH(k)2=ϕ,ϕsH(k)2=δs\left<T\phi_{\ell},T\phi_{s}\right>_{L_{2}(\mathcal{H},P)}=\left<T^{*}T\phi_{\ell},\phi_{s}\right>_{H(k)^{\otimes 2}}=\left<\phi_{\ell},\phi_{s}\right>_{H(k)^{\otimes 2}}=\delta_{\ell s}

for all ,s\ell,s\in\mathbb{N}. In fact,

Sk(Tϕ)=TΥTTϕ=TΥϕ=Tθϕ=θ(Tϕ)S_{k}(T\phi_{\ell})=T\Upsilon T^{*}T\phi_{\ell}=T\Upsilon\phi_{\ell}=T\theta_{\ell}\phi_{\ell}=\theta_{\ell}(T\phi_{\ell})

shows that θ\theta_{\ell} and TθT\theta_{\ell} are eigenvalue and eigenfunction of SkS_{k}.

7.10 Proof of Theorem 4

Let Wn==1(λ^(n)λ)Z2W^{\prime}_{n}=\sum_{\ell=1}^{\infty}({\widehat{\lambda}}^{(n)}_{\ell}-\lambda_{\ell})Z_{\ell}^{2}. Then

𝔼[Wn]=𝔼[=1(λ^(n)λ)],\displaystyle\mathbb{E}[W^{\prime}_{n}]=\mathbb{E}\left[\sum_{\ell=1}^{\infty}({\widehat{\lambda}}^{(n)}_{\ell}-\lambda_{\ell})\right],
V[Wn]=2𝔼[=1(λ^(n)λ)2]+Cov(=1(λ^(n)λ),=1(λ^λ)).\displaystyle V[W^{\prime}_{n}]=2\mathbb{E}\left[\sum_{\ell=1}^{\infty}({\widehat{\lambda}}^{(n)}_{\ell}-\lambda_{\ell})^{2}\right]+Cov\left(\sum_{\ell=1}^{\infty}({\widehat{\lambda}}^{(n)}_{\ell}-\lambda_{\ell}),\sum_{\ell^{\prime}=1}^{\infty}(\widehat{\lambda}_{\ell^{\prime}}-\lambda_{\ell^{\prime}})\right). (32)

By the definition of trace of the Hilbert–Schmidt operator, we get the following inequality

=1(λ^(n)λ)=tr[Υ^(n)]tr[Υ]=Υ^(n)Υ,IH(k)4Υ^(n)ΥH(k)4.\sum_{\ell=1}^{\infty}({\widehat{\lambda}}^{(n)}_{\ell}-\lambda_{\ell})=\text{tr}[\widehat{\Upsilon}^{(n)}]-\text{tr}[\Upsilon]=\left<\widehat{\Upsilon}^{(n)}-\Upsilon,I\right>_{H(k)^{\otimes 4}}\leq\left\|\widehat{\Upsilon}^{(n)}-\Upsilon\right\|_{H(k)^{\otimes 4}}. (33)

By using a notation

B(X1)=(k(,X1)μk(P))(μk(P)μk(P^))+(μk(P)μk(P^))(k(,X1)μk(P))+(μk(P)μk(P^))2,B(X_{1})=(k(\cdot,X_{1})-\mu_{k}(P))\otimes(\mu_{k}(P)-\mu_{k}(\widehat{P}))+(\mu_{k}(P)-\mu_{k}(\widehat{P}))\otimes(k(\cdot,X_{1})-\mu_{k}(P))+(\mu_{k}(P)-\mu_{k}(\widehat{P}))^{\otimes 2},

we have

Υ^(n)\displaystyle\widehat{\Upsilon}^{(n)} =1ni=1n{(k(,Xi)μk(P^))2Σk(P^)}2\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left\{(k(\cdot,X_{i})-\mu_{k}(\widehat{P}))^{\otimes 2}-\Sigma_{k}(\widehat{P})\right\}^{\otimes 2}
=1ni=1n{(k(,Xi)μk(P))2Σk(P)+B(Xi)+Σk(P)Σk(P^)}2\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)+B(X_{i})+\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right\}^{\otimes 2}
=1ni=1n{(k(,Xi)μk(P))2Σk(P)}2\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}^{\otimes 2}
+1ni=1n{(k(,Xi)μk(P))2Σk(P)}{B(Xi)+Σk(P)Σk(P^)}\displaystyle~{}~{}~{}~{}~{}+\frac{1}{n}\sum_{i=1}^{n}\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}\otimes\left\{B(X_{i})+\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right\}
+1ni=1n{B(Xi)+Σk(P)Σk(P^)}{(k(,Xi)μk(P))2Σk(P)}\displaystyle~{}~{}~{}~{}~{}+\frac{1}{n}\sum_{i=1}^{n}\left\{B(X_{i})+\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right\}\otimes\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}
+1ni=1n{B(Xi)+Σk(P)Σk(P^)}2\displaystyle~{}~{}~{}~{}~{}+\frac{1}{n}\sum_{i=1}^{n}\left\{B(X_{i})+\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right\}^{\otimes 2}
=:I1+I2+I3+I4.\displaystyle=:I_{1}+I_{2}+I_{3}+I_{4}. (34)

Since direct calculation, we get the following three expressions,

I2\displaystyle I_{2} =1ni=1n{(k(,Xi)μk(P))2Σk(P)}{B(Xi)+Σk(P)Σk(P^)}\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}\otimes\left\{B(X_{i})+\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right\}
=1ni=1n{(k(,Xi)μk(P))2Σk(P)}B(Xi)\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}\otimes B(X_{i})
+1ni=1n{(k(,Xi)μk(P))2Σk(P)}(Σk(P)Σk(P^))\displaystyle~{}~{}~{}~{}~{}+\frac{1}{n}\sum_{i=1}^{n}\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}\otimes\left(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right)
=1ni=1n{(k(,Xi)μk(P))2Σk(P)}B(Xi)(Σk(P)Σk(P^))2,\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}\otimes B(X_{i})-\left(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right)^{\otimes 2}, (35)
I3\displaystyle I_{3} =1ni=1n{B(Xi)+Σk(P)Σk(P^)}{(k(,Xi)μk(P))2Σk(P)}\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left\{B(X_{i})+\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right\}\otimes\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}
=1ni=1nB(Xi){(k(,Xi)μk(P))2Σk(P)}(Σk(P)Σk(P^))2\displaystyle=\frac{1}{n}\sum_{i=1}^{n}B(X_{i})\otimes\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}-\left(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right)^{\otimes 2} (36)

and

I4=\displaystyle I_{4}= 1ni=1n{B(Xi)+Σk(P)Σk(P^)}2\displaystyle\frac{1}{n}\sum_{i=1}^{n}\left\{B(X_{i})+\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right\}^{\otimes 2}
=1ni=1nB(Xi)2+1ni=1nB(Xi)(Σk(P)Σk(P^))\displaystyle=\frac{1}{n}\sum_{i=1}^{n}B(X_{i})^{\otimes 2}+\frac{1}{n}\sum_{i=1}^{n}B(X_{i})\otimes\left(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right)
+1ni=1n(Σk(P)Σk(P^))B(Xi)+(Σk(P)Σk(P^))2\displaystyle~{}~{}~{}~{}~{}+\frac{1}{n}\sum_{i=1}^{n}(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P}))\otimes B(X_{i})+(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P}))^{\otimes 2}
=1ni=1nB(Xi)2(μk(P)μk(P^))2(Σk(P)Σk(P^))\displaystyle=\frac{1}{n}\sum_{i=1}^{n}B(X_{i})^{\otimes 2}-(\mu_{k}(P)-\mu_{k}(\widehat{P}))^{\otimes 2}\otimes(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P}))
(Σk(P)Σk(P^))(μk(P)μk(P^))2+(Σk(P)Σk(P^))2.\displaystyle~{}~{}~{}~{}~{}-(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P}))\otimes(\mu_{k}(P)-\mu_{k}(\widehat{P}))^{\otimes 2}+(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P}))^{\otimes 2}. (37)

By (35), (36) and (37) are combined into (34), we have an expression

Υ^(n)\displaystyle\widehat{\Upsilon}^{(n)} =1ni=1n{(k(,Xi)μk(P))2Σk(P)}2(Σk(P)Σk(P^))2\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}^{\otimes 2}-(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P}))^{\otimes 2}
(μk(P)μk(P^))2(Σk(P)Σk(P^))(Σk(P)Σk(P^))(μk(P)μk(P^))2\displaystyle~{}~{}~{}~{}~{}-(\mu_{k}(P)-\mu_{k}(\widehat{P}))^{\otimes 2}\otimes(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P}))-(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P}))\otimes(\mu_{k}(P)-\mu_{k}(\widehat{P}))^{\otimes 2}
+1ni=1nB(Xi)2+1ni=1nB(Xi){(k(,Xi)μk(P))2Σk(P)}\displaystyle~{}~{}~{}~{}~{}+\frac{1}{n}\sum_{i=1}^{n}B(X_{i})^{\otimes 2}+\frac{1}{n}\sum_{i=1}^{n}B(X_{i})\otimes\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}
+1ni=1n{(k(,Xi)μk(P))2Σk(P)}B(Xi).\displaystyle~{}~{}~{}~{}~{}+\frac{1}{n}\sum_{i=1}^{n}\left\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}\otimes B(X_{i}).

Therefore,

Υ^(n)ΥH(k)4\displaystyle\left\|\widehat{\Upsilon}^{(n)}-\Upsilon\right\|_{H(k)^{\otimes 4}}
1ni=1n{(k(,Xi)μk(P))2Σk(P)}2ΥH(k)4+(Σk(P)Σk(P^))2H(k)4\displaystyle\leq\left\|\frac{1}{n}\sum_{i=1}^{n}\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\}^{\otimes 2}-\Upsilon\right\|_{H(k)^{\otimes 4}}+\left\|(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P}))^{\otimes 2}\right\|_{H(k)^{\otimes 4}}
+(μk(P)μk(P^))2(Σk(P)Σk(P^))H(k)4+(Σk(P)Σk(P^))(μk(P)μk(P^))2H(k)4\displaystyle~{}~{}~{}~{}+\left\|(\mu_{k}(P)-\mu_{k}(\widehat{P}))^{\otimes 2}\otimes(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P}))\right\|_{H(k)^{\otimes 4}}+\left\|(\Sigma_{k}(P)-\Sigma_{k}(\widehat{P}))\otimes(\mu_{k}(P)-\mu_{k}(\widehat{P}))^{\otimes 2}\right\|_{H(k)^{\otimes 4}}
+1ni=1nB(Xi)2H(k)4+1ni=1nB(Xi){(k(,Xi)μk(P))2Σk(P)}H(k)4\displaystyle~{}~{}~{}~{}~{}+\left\|\frac{1}{n}\sum_{i=1}^{n}B(X_{i})^{\otimes 2}\right\|_{H(k)^{\otimes 4}}+\left\|\frac{1}{n}\sum_{i=1}^{n}B(X_{i})\otimes\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\}\right\|_{H(k)^{\otimes 4}}
+1ni=1n{(k(,Xi)μk(P))2Σk(P)}B(Xi)H(k)4\displaystyle~{}~{}~{}~{}~{}+\left\|\frac{1}{n}\sum_{i=1}^{n}\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\}\otimes B(X_{i})\right\|_{H(k)^{\otimes 4}}
=:I1+I2+I3+I4+I5+I6+I7.\displaystyle=:I_{1}+I_{2}+I_{3}+I_{4}+I_{5}+I_{6}+I_{7}. (38)

By the law of large number in Hilbert spaces (see [8]), we have

I1=1ni=1n{(k(,Xi)μk(P))2Σk(P)}2ΥH(k)4=Op(1n),I_{1}=\left\|\frac{1}{n}\sum_{i=1}^{n}\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\}^{\otimes 2}-\Upsilon\right\|_{H(k)^{\otimes 4}}=O_{p}\left(\frac{1}{\sqrt{n}}\right), (39)

and we get

I2=Σk(P)Σk(P^)2H(k)2=Op(1n),I_{2}=\left\|\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right\|^{2}_{H(k)^{\otimes 2}}=O_{p}\left(\frac{1}{n}\right), (40)
I3=μk(P)μk(P^)2H(k)Σk(P)Σk(P^)H(k)2=Op(1nn),I_{3}=\left\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\|^{2}_{H(k)}\left\|\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right\|_{H(k)^{\otimes 2}}=O_{p}\left(\frac{1}{n\sqrt{n}}\right), (41)

and

I4=μk(P)μk(P^)2H(k)Σk(P)Σk(P^)H(k)2=Op(1nn).I_{4}=\left\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\|^{2}_{H(k)}\left\|\Sigma_{k}(P)-\Sigma_{k}(\widehat{P})\right\|_{H(k)^{\otimes 2}}=O_{p}\left(\frac{1}{n\sqrt{n}}\right). (42)

Next our focus goes to I5I_{5}. We have by direct computations that

I5\displaystyle I_{5} 1ni=1nB(Xi)2H(k)2\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\left\|B(X_{i})\right\|^{2}_{H(k)^{\otimes 2}}
=1ni=1n{k(,Xi)μk(P)H(k)μk(P)μk(P^)H(k)+μk(P)μk(P^)H(k)k(,Xi)μk(P)H(k)\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left\{\left\|k(\cdot,X_{i})-\mu_{k}(P)\right\|_{H(k)}\left\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\|_{H(k)}+\left\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\|_{H(k)}\left\|k(\cdot,X_{i})-\mu_{k}(P)\right\|_{H(k)}\right.
+μk(P)μk(P^)2H(k)}2\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\left.+\left\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\|^{2}_{H(k)}\right\}^{2}
=1ni=1n{μk(P)μk(P^)2H(k)+2k(,Xi)μk(P)H(k)μk(P)μk(P^)H(k)}2\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left\{\left\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\|^{2}_{H(k)}+2\left\|k(\cdot,X_{i})-\mu_{k}(P)\right\|_{H(k)}\left\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\|_{H(k)}\right\}^{2}
=μk(P)μk(P^)4H(k)+{4ni=1nk(,Xi)μk(P)H(k)}μk(P)μk(P^)3H(k)\displaystyle=\left\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\|^{4}_{H(k)}+\left\{\frac{4}{n}\sum_{i=1}^{n}\left\|k(\cdot,X_{i})-\mu_{k}(P)\right\|_{H(k)}\right\}\left\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\|^{3}_{H(k)}
+{4ni=1nk(,Xi)μk(P)H(k)2}μk(P)μk(P^)H(k)2.\displaystyle~{}~{}~{}~{}~{}+\left\{\frac{4}{n}\sum_{i=1}^{n}\left\|k(\cdot,X_{i})-\mu_{k}(P)\right\|_{H(k)}^{2}\right\}\left\|\mu_{k}(P)-\mu_{k}(\widehat{P})\right\|_{H(k)}^{2}.

Since Condition is assumed, the followings hold

1ni=1nk(,Xi)μk(P)H(k)2𝑃𝔼[k(,Xi)μk(P)H(k)2]\frac{1}{n}\sum_{i=1}^{n}\left\|k(\cdot,X_{i})-\mu_{k}(P)\right\|_{H(k)}^{2}\xrightarrow{P}\mathbb{E}[\left\|k(\cdot,X_{i})-\mu_{k}(P)\right\|_{H(k)}^{2}]

and

1ni=1nk(,Xi)μk(P)H(k)𝑃𝔼[k(,Xi)μk(P)H(k)]\frac{1}{n}\sum_{i=1}^{n}\left\|k(\cdot,X_{i})-\mu_{k}(P)\right\|_{H(k)}\xrightarrow{P}\mathbb{E}[\left\|k(\cdot,X_{i})-\mu_{k}(P)\right\|_{H(k)}]

as nn\to\infty by the law of large numbers. Hence, we get

I5\displaystyle I_{5} =Op(1n2)+1ni=1nk(,Xi)μk(P)H(k)Op(1nn)+1ni=1nk(,Xi)μk(P)H(k)2Op(1n)\displaystyle=O_{p}\left(\frac{1}{n^{2}}\right)+\frac{1}{n}\sum_{i=1}^{n}\left\|k(\cdot,X_{i})-\mu_{k}(P)\right\|_{H(k)}O_{p}\left(\frac{1}{n\sqrt{n}}\right)+\frac{1}{n}\sum_{i=1}^{n}\left\|k(\cdot,X_{i})-\mu_{k}(P)\right\|_{H(k)}^{2}O_{p}\left(\frac{1}{n}\right)
=Op(1n).\displaystyle=O_{p}\left(\frac{1}{n}\right). (43)

Also, we see that

I6\displaystyle I_{6} =1ni=1nB(Xi){(k(,Xi)μk(P))2Σk(P)}H(k)4\displaystyle=\left\|\frac{1}{n}\sum_{i=1}^{n}B(X_{i})\otimes\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\}\right\|_{H(k)^{\otimes 4}}
1ni=1nB(Xi)H(k)2(k(,Xi)μk(P))2Σk(P)H(k)2\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\left\|B(X_{i})\right\|_{H(k)^{\otimes 2}}\left\|(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\|_{H(k)^{\otimes 2}}
(1ni=1nB(Xi)H(k)22)1/2(1ni=1n(k(,Xi)μk(P))2Σk(P)H(k)22)1/2\displaystyle\leq\left(\frac{1}{n}\sum_{i=1}^{n}\left\|B(X_{i})\right\|_{H(k)^{\otimes 2}}^{2}\right)^{1/2}\left(\frac{1}{n}\sum_{i=1}^{n}\left\|(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\|_{H(k)^{\otimes 2}}^{2}\right)^{1/2}

by Cauchy–Schwartz inequality, and we get

1ni=1n(k(,Xi)μk(P))2Σk(P)H(k)22𝑃𝔼[(k(,Xi)μk(P))2Σk(P)H(k)22]\frac{1}{n}\sum_{i=1}^{n}\left\|(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\|_{H(k)^{\otimes 2}}^{2}\xrightarrow{P}\mathbb{E}[\left\|(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\|_{H(k)^{\otimes 2}}^{2}]

as nn\to\infty. Hence, we get

I6=(1ni=1n(k(,Xi)μk(P))2Σk(P)H(k)22)1/2Op(1n)=Op(1n)I_{6}=\left(\frac{1}{n}\sum_{i=1}^{n}\left\|(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\|_{H(k)^{\otimes 2}}^{2}\right)^{1/2}O_{p}\left(\frac{1}{\sqrt{n}}\right)=O_{p}\left(\frac{1}{\sqrt{n}}\right) (44)

since I5=Op(1/n)I_{5}=O_{p}(1/n). By the same argument as for I6=Op(1/n)I_{6}=O_{p}(1/\sqrt{n}) in (44), we have

I7\displaystyle I_{7} =1ni=1n{(k(,Xi)μk(P))2Σk(P)}B(Xi)H(k)4\displaystyle=\left\|\frac{1}{n}\sum_{i=1}^{n}\{(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\}\otimes B(X_{i})\right\|_{H(k)^{\otimes 4}}
1nB(Xi)H(k)2(k(,Xi)μk(P))2Σk(P)H(k)2\displaystyle\leq\frac{1}{n}\left\|B(X_{i})\right\|_{H(k)^{\otimes 2}}\left\|(k(\cdot,X_{i})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\|_{H(k)^{\otimes 2}}
=Op(1n).\displaystyle=O_{p}\left(\frac{1}{\sqrt{n}}\right). (45)

The equations (39), (40), (41), (42), (43), (44) and (45) are combined into (38), which leads to

Υ^(n)ΥH(k)4=Op(1n).\left\|\widehat{\Upsilon}^{(n)}-\Upsilon\right\|_{H(k)^{\otimes 4}}=O_{p}\left(\frac{1}{\sqrt{n}}\right). (46)

Therefore, we get

=1(λ^(n)λ)=Op(1n)\sum_{\ell=1}^{\infty}(\widehat{\lambda}_{\ell}^{(n)}-\lambda_{\ell})=O_{p}\left(\frac{1}{\sqrt{n}}\right) (47)

by (33) and (46), that is 𝔼[W]0\mathbb{E}[W^{\prime}]\to 0, as nn\to\infty.

Next, we consider V[W]V[W^{\prime}]. By Υ^\widehat{\Upsilon} and Υ\Upsilon are compact Hermitian operators and (28) of [3],

=1(λ^(n)λ)2Υ^(n)ΥH(k)42.\sum_{\ell=1}^{\infty}(\widehat{\lambda}_{\ell}^{(n)}-\lambda_{\ell})^{2}\leq\left\|\widehat{\Upsilon}^{(n)}-\Upsilon\right\|_{H(k)^{\otimes 4}}^{2}.

Futhermore, we have got Υ^(n)ΥH(k)42=Op(1/n)\left\|\widehat{\Upsilon}^{(n)}-\Upsilon\right\|_{H(k)^{\otimes 4}}^{2}=O_{p}(1/n), hence

𝔼[=1(λ^(n)λ)2]0,\mathbb{E}\left[\sum_{\ell=1}^{\infty}(\widehat{\lambda}_{\ell}^{(n)}-\lambda_{\ell})^{2}\right]\to 0, (48)

as nn\to\infty. Also,

Cov(=1(λ^(n)λ),=1(λ^λ))=𝔼[(=1(λ^(n)λ))2]+{𝔼[=1(λ^(n)λ)]}2Cov\left(\sum_{\ell=1}^{\infty}({\widehat{\lambda}}^{(n)}_{\ell}-\lambda_{\ell}),\sum_{\ell^{\prime}=1}^{\infty}(\widehat{\lambda}_{\ell^{\prime}}-\lambda_{\ell^{\prime}})\right)=\mathbb{E}\left[\left(\sum_{\ell=1}^{\infty}({\widehat{\lambda}}^{(n)}_{\ell}-\lambda_{\ell})\right)^{2}\right]+\left\{\mathbb{E}\left[\sum_{\ell=1}^{\infty}({\widehat{\lambda}}^{(n)}_{\ell}-\lambda_{\ell})\right]\right\}^{2} (49)

and since (=1(λ^(n)λ))2=Op(1/n)\left(\sum_{\ell=1}^{\infty}({\widehat{\lambda}}^{(n)}_{\ell}-\lambda_{\ell})\right)^{2}=O_{p}(1/n) by (47), we obtain

𝔼[(=1(λ^(n)λ))2]0,\mathbb{E}\left[\left(\sum_{\ell=1}^{\infty}({\widehat{\lambda}}^{(n)}_{\ell}-\lambda_{\ell})\right)^{2}\right]\to 0, (50)

as nn\to\infty. In addition, 𝔼[=1(λ^(n)λ)]0,\mathbb{E}\left[\sum_{\ell=1}^{\infty}({\widehat{\lambda}}^{(n)}_{\ell}-\lambda_{\ell})\right]\to 0, as nn\to\infty by (47), we get

Cov(=1(λ^(n)λ),=1(λ^λ))0,Cov\left(\sum_{\ell=1}^{\infty}({\widehat{\lambda}}^{(n)}_{\ell}-\lambda_{\ell}),\sum_{\ell^{\prime}=1}^{\infty}(\widehat{\lambda}_{\ell^{\prime}}-\lambda_{\ell^{\prime}})\right)\to 0,

as nn\to\infty by (49) and (50). Therefore, V[W]0,nV[W^{\prime}]\to 0,~{}~{}n\to\infty.

Finally, we shall show W𝑃0W^{\prime}\xrightarrow{P}0, as nn\to\infty. From Chebyshev’s inequality, for any ε>0\varepsilon>0,

P(|W|ε)\displaystyle P(|W^{\prime}|\geq\varepsilon) =P(|W||𝔼[W]|ε|𝔼[W]|)\displaystyle=P(|W^{\prime}|-|\mathbb{E}[W^{\prime}]|\geq\varepsilon-|\mathbb{E}[W^{\prime}]|)
P(|W𝔼[W]|ε|𝔼[W]|)\displaystyle\leq P(|W^{\prime}-\mathbb{E}[W^{\prime}]|\geq\varepsilon-|\mathbb{E}[W^{\prime}]|)
V[W](ε|𝔼[W]|)2\displaystyle\leq\frac{V[W^{\prime}]}{(\varepsilon-|\mathbb{E}[W^{\prime}]|)^{2}}
0,\displaystyle\to 0,

as nn\to\infty. Therefore, W𝑃0W^{\prime}\xrightarrow{P}0 as nn\to\infty.

7.11 Proof of Proposition 4

Let h~:×\widetilde{h}:\mathcal{H}\times\mathcal{H}\to\mathbb{R} be a kernel defined as,

h~(x,y)=(k(,x)μk(P^))2Σk(P^),(k(,y)μk(P^))2Σk(P^)H(k)2,x,y\widetilde{h}(x,y)=\left<(k(\cdot,x)-\mu_{k}(\widehat{P}))^{\otimes 2}-\Sigma_{k}(\widehat{P}),(k(\cdot,y)-\mu_{k}(\widehat{P}))^{\otimes 2}-\Sigma_{k}(\widehat{P})\right>_{H(k)^{\otimes 2}},~{}~{}x,y\in\mathcal{H}

and the associated function h~(,x)\widetilde{h}(\cdot,x) represent h~(,x)=(k(,x)μk(P^))2Σk(P^)\widetilde{h}(\cdot,x)=(k(\cdot,x)-\mu_{k}(\widehat{P}))^{\otimes 2}-\Sigma_{k}(\widehat{P}) for all xx\in\mathcal{H}. Then, T:H(k)2nT:H(k)^{\otimes 2}\to\mathbb{R}^{n} is defined by

T(A)=[A,h~(,X1)H(k)2A,h~(,Xn)H(k)2]T(A)=\begin{bmatrix}\left<A,\widetilde{h}(\cdot,X_{1})\right>_{H(k)^{\otimes 2}}\\ \vdots\\ \left<A,\widetilde{h}(\cdot,X_{n})\right>_{H(k)^{\otimes 2}}\end{bmatrix}

for any AH(k)2A\in H(k)^{\otimes 2}. The conjugate operator TT^{*} of TT (see Section VI.2 in [12] for details) is obtained as Ta¯=i=1naih~(,Xi)T^{*}\underline{a}=\sum_{i=1}^{n}a_{i}\widetilde{h}(\cdot,X_{i}) for all a¯=[a1an]Tn\underline{a}=\begin{bmatrix}a_{1}&\cdots&a_{n}\end{bmatrix}^{T}\in\mathbb{R}^{n}, because for all AH(k)2A\in H(k)^{\otimes 2},

Ta¯,AH(k)2\displaystyle\left<T^{*}\underline{a},A\right>_{H(k)^{\otimes 2}}
=a¯T(TA)\displaystyle=\underline{a}^{T}(TA)
=a¯T[h~(,X1),AH(k)2h~(,X1),AH(k)2]\displaystyle=\underline{a}^{T}\begin{bmatrix}\left<\widetilde{h}(\cdot,X_{1}),A\right>_{H(k)^{\otimes 2}}\\ \vdots\\ \left<\widetilde{h}(\cdot,X_{1}),A\right>_{H(k)^{\otimes 2}}\end{bmatrix}
=a1h~(,X1),AH(k)2++anh~(,X1),AH(k)2\displaystyle=a_{1}\left<\widetilde{h}(\cdot,X_{1}),A\right>_{H(k)^{\otimes 2}}+\cdots+a_{n}\left<\widetilde{h}(\cdot,X_{1}),A\right>_{H(k)^{\otimes 2}}
=i=1naih~(,Xi),AH(k)2.\displaystyle=\left<\sum_{i=1}^{n}a_{i}\widetilde{h}(\cdot,X_{i}),A\right>_{H(k)^{\otimes 2}}.

Let λ\lambda and AA be the eigenvalue and eigenvector of Υ^(n)\widehat{\Upsilon}^{(n)}, respectevely. Then, it is holds that

1nj=1nh(,Xj),AH(k)2h(,Xj)=1nj=1n{h~(,Xj)}2A=λA\frac{1}{n}\sum_{j=1}^{n}\left<h(\cdot,X_{j}),A\right>_{H(k)^{\otimes 2}}h(\cdot,X_{j})=\frac{1}{n}\sum_{j=1}^{n}\{\widetilde{h}(\cdot,X_{j})\}^{\otimes 2}A=\lambda A

from the definition of eigenvalue and eigenvector. By mapping both sides with TT,

1n[h~(X1,X1)h~(X1,Xn)h~(Xn,X1)h~(Xn,Xn)][A,h~(,X1)H(k)2A,h~(,X1)H(k)2]\displaystyle\frac{1}{n}\begin{bmatrix}\widetilde{h}(X_{1},X_{1})&\cdots&\widetilde{h}(X_{1},X_{n})\\ \vdots&\ddots&\vdots\\ \widetilde{h}(X_{n},X_{1})&\cdots&\widetilde{h}(X_{n},X_{n})\end{bmatrix}\begin{bmatrix}\left<A,\widetilde{h}(\cdot,X_{1})\right>_{H(k)^{\otimes 2}}\\ \vdots\\ \left<A,\widetilde{h}(\cdot,X_{1})\right>_{H(k)^{\otimes 2}}\end{bmatrix}
=1nj=1nh~(,Xj),AH(k)2[h~(X1,Xj)h~(Xn,Xj)]\displaystyle=\frac{1}{n}\sum_{j=1}^{n}\left<\widetilde{h}(\cdot,X_{j}),A\right>_{H(k)^{\otimes 2}}\begin{bmatrix}\widetilde{h}(X_{1},X_{j})\\ \vdots\\ \widetilde{h}(X_{n},X_{j})\end{bmatrix}
=1nj=1nh~(,Xj),AH(k)2T(h~(,Xj))\displaystyle=\frac{1}{n}\sum_{j=1}^{n}\left<\widetilde{h}(\cdot,X_{j}),A\right>_{H(k)^{\otimes 2}}T(\widetilde{h}(\cdot,X_{j}))
=λT(A)\displaystyle=\lambda T(A)
=λ[A,h~(,X1)H(k)2A,h~(,Xn)H(k)2].\displaystyle=\lambda\begin{bmatrix}\left<A,\widetilde{h}(\cdot,X_{1})\right>_{H(k)^{\otimes 2}}\\ \vdots\\ \left<A,\widetilde{h}(\cdot,X_{n})\right>_{H(k)^{\otimes 2}}\end{bmatrix}.

Hence, the eigenvalues of Υ^(n)\widehat{\Upsilon}^{(n)} are that of H/nH/n.

Conversely, let τ\tau and u¯=[u1un]T\underline{u}=\begin{bmatrix}u_{1}&\cdots&u_{n}\end{bmatrix}^{T} be the eigenvalue and correspondent eigenvector of H/nH/n, then

1nj=1nuj[h~(X1,Xj)h~(Xn,Xj)]=1nHu¯=λu¯,\frac{1}{n}\sum_{j=1}^{n}u_{j}\begin{bmatrix}\widetilde{h}(X_{1},X_{j})\\ \vdots\\ \widetilde{h}(X_{n},X_{j})\end{bmatrix}=\frac{1}{n}H\underline{u}=\lambda\underline{u},

and

Υ^(n){j=1nujh~(,Xj)}\displaystyle\widehat{\Upsilon}^{(n)}\left\{\sum_{j=1}^{n}u_{j}\widetilde{h}(\cdot,X_{j})\right\}
=1ni=1n{h~(,Xi)}2{j=1nujh~(,Xj)}\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left\{\widetilde{h}(\cdot,X_{i})\right\}^{\otimes 2}\left\{\sum_{j=1}^{n}u_{j}\widetilde{h}(\cdot,X_{j})\right\}
=1ni=1nh~(,Xi),j=1nujh~(,Xj)H(k)2h~(,Xi)\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left<\widetilde{h}(\cdot,X_{i}),\sum_{j=1}^{n}u_{j}\widetilde{h}(\cdot,X_{j})\right>_{H(k)^{\otimes 2}}\widetilde{h}(\cdot,X_{i})
=1ni,j=1nujh~(,Xi),h~(,Xj)H(k)2h~(,Xi)\displaystyle=\frac{1}{n}\sum_{i,j=1}^{n}u_{j}\left<\widetilde{h}(\cdot,X_{i}),\widetilde{h}(\cdot,X_{j})\right>_{H(k)^{\otimes 2}}\widetilde{h}(\cdot,X_{i})
=1nj=1nuji=1nh~(Xi,Xj)h~(,Xi)\displaystyle=\frac{1}{n}\sum_{j=1}^{n}u_{j}\sum_{i=1}^{n}\widetilde{h}(X_{i},X_{j})\widetilde{h}(\cdot,X_{i})
=1nj=1nujT([h~(X1,Xj)h~(Xn,Xj)])\displaystyle=\frac{1}{n}\sum_{j=1}^{n}u_{j}T^{*}\left(\begin{bmatrix}\widetilde{h}(X_{1},X_{j})\\ \vdots\\ \widetilde{h}(X_{n},X_{j})\end{bmatrix}\right)
=λT(u¯)\displaystyle=\lambda T^{*}(\underline{u})
=λi=1nuih^(,Xi)\displaystyle=\lambda\sum_{i=1}^{n}u_{i}\widehat{h}(\cdot,X_{i})

form mapping both sides with TT^{*}, hence the eigenvalue of H/nH/n are that of Υ^(n)\widehat{\Upsilon}^{(n)}.

7.12 Proof of Proposition 5

We see that

𝔼[(n+m)T^2n,m]\displaystyle\mathbb{E}[(n+m)\widehat{T}^{2}_{n,m}]
=n+mn2i,s=1n𝔼[h(Xi,Xs)]+n+mm2j,t=1m𝔼[h(Yj,Yt)]2(n+m)nmi=1nj=1m𝔼[h(Xi,Yj)],\displaystyle=\frac{n+m}{n^{2}}\sum_{i,s=1}^{n}\mathbb{E}[h(X_{i},X_{s})]+\frac{n+m}{m^{2}}\sum_{j,t=1}^{m}\mathbb{E}[h(Y_{j},Y_{t})]-\frac{2(n+m)}{nm}\sum_{i=1}^{n}\sum_{j=1}^{m}\mathbb{E}[h(X_{i},Y_{j})],

where h(x,y)h(x,y) is in (5). Since we have

𝔼[h(X1,X2)]\displaystyle\mathbb{E}[h(X_{1},X_{2})] =𝔼[(k(,X1)μk(P))2Σk(P),(k(,X2)μk(P))2Σk(P)H(k)2]\displaystyle=\mathbb{E}\left[\left<(k(\cdot,X_{1})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P),(k(\cdot,X_{2})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right>_{H(k)^{\otimes 2}}\right]
=𝔼[(k(,X1)μk(P))2Σk(P)],𝔼[(k(,X2)μk(P))2Σk(P)]H(k)2\displaystyle=\left<\mathbb{E}[(k(\cdot,X_{1})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)],\mathbb{E}[(k(\cdot,X_{2})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)]\right>_{H(k)^{\otimes 2}}
=0\displaystyle=0

and

𝔼[h(X1,X1)]\displaystyle\mathbb{E}[h(X_{1},X_{1})] =𝔼[(k(,X1)μk(P))2Σk(P),(k(,X1)μk(P))2Σk(P)H(k)2]\displaystyle=\mathbb{E}\left[\left<(k(\cdot,X_{1})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P),(k(\cdot,X_{1})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right>_{H(k)^{\otimes 2}}\right]
=𝔼[{(k(,X1)μk(P))2Σk(P)}2,IH(k)4]\displaystyle=\mathbb{E}\left[\left<\left\{(k(\cdot,X_{1})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}^{\otimes 2},I\right>_{H(k)^{\otimes 4}}\right]
=𝔼[{(k(,X1)μk(P))2Σk(P)}2],IH(k)4\displaystyle=\left<\mathbb{E}\left[\left\{(k(\cdot,X_{1})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\right\}^{\otimes 2}\right],I\right>_{H(k)^{\otimes 4}}
=Υ,IH(k)4,\displaystyle=\left<\Upsilon,I\right>_{H(k)^{\otimes 4}},

under P=QP=Q, it follows that

𝔼[(n+m)T^2n,m]=n+mn2nΥ,IH(k)4+n+mm2mΥ,IH(k)4=(n+m)2nmΥ,IH(k)4.\mathbb{E}[(n+m)\widehat{T}^{2}_{n,m}]=\frac{n+m}{n^{2}}n\left<\Upsilon,I\right>_{H(k)^{\otimes 4}}+\frac{n+m}{m^{2}}m\left<\Upsilon,I\right>_{H(k)^{\otimes 4}}\\ =\frac{(n+m)^{2}}{nm}\left<\Upsilon,I\right>_{H(k)^{\otimes 4}}. (51)

Next, we consider 𝔼[{(n+m)T^n,m2}2]\mathbb{E}\left[\left\{(n+m)\widehat{T}_{n,m}^{2}\right\}^{2}\right]. It follows from direct calculations that

𝔼[{(n+m)T^n,m2}2]\displaystyle\mathbb{E}\left[\left\{(n+m)\widehat{T}_{n,m}^{2}\right\}^{2}\right]
=𝔼[{n+mn2i,s=1nh(Xi,Xs)+n+mm2j,t=1mh(Yj,Yt)2(n+m)nmi=1nj=1mh(Xi,Yj)}2]\displaystyle=\mathbb{E}\left[\left\{\frac{n+m}{n^{2}}\sum_{i,s=1}^{n}h(X_{i},X_{s})+\frac{n+m}{m^{2}}\sum_{j,t=1}^{m}h(Y_{j},Y_{t})-\frac{2(n+m)}{nm}\sum_{i=1}^{n}\sum_{j=1}^{m}h(X_{i},Y_{j})\right\}^{2}\right]
=(n+m)2n4𝔼[{i,s=1nh(Xi,Xs)}2]+(n+m)2m4𝔼[{j,t=1mh(Yj,Yt)}2]\displaystyle=\frac{(n+m)^{2}}{n^{4}}\mathbb{E}\left[\left\{\sum_{i,s=1}^{n}h(X_{i},X_{s})\right\}^{2}\right]+\frac{(n+m)^{2}}{m^{4}}\mathbb{E}\left[\left\{\sum_{j,t=1}^{m}h(Y_{j},Y_{t})\right\}^{2}\right]
+4(n+m)2n2m2𝔼[{i=1nj=1mh(Xi,Yj)}2]+2(n+m)2n2m2𝔼[{i,s=1nh(Xi,Xs)}{j,t=1mh(Yj,Yt)}]\displaystyle~{}~{}~{}~{}~{}+\frac{4(n+m)^{2}}{n^{2}m^{2}}\mathbb{E}\left[\left\{\sum_{i=1}^{n}\sum_{j=1}^{m}h(X_{i},Y_{j})\right\}^{2}\right]+\frac{2(n+m)^{2}}{n^{2}m^{2}}\mathbb{E}\left[\left\{\sum_{i,s=1}^{n}h(X_{i},X_{s})\right\}\left\{\sum_{j,t=1}^{m}h(Y_{j},Y_{t})\right\}\right]
4(n+m)2n3m𝔼[i,s,=1nj=1mh(Xi,Xs)h(X,Yj)]4(n+m)2nm3𝔼[i=1nj,t,k=1mh(Yj,Yt)h(Xi,Yk)].\displaystyle~{}~{}~{}~{}~{}-\frac{4(n+m)^{2}}{n^{3}m}\mathbb{E}\left[\sum_{i,s,\ell=1}^{n}\sum_{j=1}^{m}h(X_{i},X_{s})h(X_{\ell},Y_{j})\right]-\frac{4(n+m)^{2}}{nm^{3}}\mathbb{E}\left[\sum_{i=1}^{n}\sum_{j,t,k=1}^{m}h(Y_{j},Y_{t})h(X_{i},Y_{k})\right].

A straightforward but lengthy computation yields that

𝔼[{i,s=1nh(Xi,Xs)}2]=nA,IH(k)8+2n(n1)Υ2H(k)4+n(n1)Υ,IH(k)42,\mathbb{E}\left[\left\{\sum_{i,s=1}^{n}h(X_{i},X_{s})\right\}^{2}\right]=n\left<A,I\right>_{H(k)^{\otimes 8}}+2n(n-1)\left\|\Upsilon\right\|^{2}_{H(k)^{\otimes 4}}+n(n-1)\left<\Upsilon,I\right>_{H(k)^{\otimes 4}}^{2}, (52)

where A=𝔼[{(k(,X1)μk(P))2Σk(P)}4]A=\mathbb{E}[\{(k(\cdot,X_{1})-\mu_{k}(P))^{\otimes 2}-\Sigma_{k}(P)\}^{\otimes 4}]. In addition, we obtain from direct calcuration that

𝔼[{i=1nj=1mh(Xi,Yj)}2]=nmΥ2H(k)4,\displaystyle\mathbb{E}\left[\left\{\sum_{i=1}^{n}\sum_{j=1}^{m}h(X_{i},Y_{j})\right\}^{2}\right]=nm\left\|\Upsilon\right\|^{2}_{H(k)^{\otimes 4}},
𝔼[{i,s=1nh(Xi,Xs)}{j,t=1mh(Yj,Yt)}]=nmΥ,IH(k)22.\displaystyle\mathbb{E}\left[\left\{\sum_{i,s=1}^{n}h(X_{i},X_{s})\right\}\left\{\sum_{j,t=1}^{m}h(Y_{j},Y_{t})\right\}\right]=nm\left<\Upsilon,I\right>_{H(k)^{\otimes 2}}^{2}.

Therefore, using (51) and (52)

V[(n+m)T^n,m]\displaystyle V[(n+m)\widehat{T}_{n,m}]
=𝔼[{(n+m)T^2n,m}2]{𝔼[(n+m)T^2n,m]}2\displaystyle=\mathbb{E}\left[\left\{(n+m)\widehat{T}^{2}_{n,m}\right\}^{2}\right]-\{\mathbb{E}[(n+m)\widehat{T}^{2}_{n,m}]\}^{2}
=(n+m)2n4(nA,IH(k)8+2n(n1)Υ2H(k)4+n(n1)Υ,IH(k)42)\displaystyle=\frac{(n+m)^{2}}{n^{4}}\left(n\left<A,I\right>_{H(k)^{\otimes 8}}+2n(n-1)\left\|\Upsilon\right\|^{2}_{H(k)^{\otimes 4}}+n(n-1)\left<\Upsilon,I\right>_{H(k)^{\otimes 4}}^{2}\right)
+(n+m)2m4(mA,IH(k)8+2m(m1)Υ2H(k)4+m(m1)Υ,IH(k)42)\displaystyle~{}~{}~{}~{}~{}+\frac{(n+m)^{2}}{m^{4}}\left(m\left<A,I\right>_{H(k)^{\otimes 8}}+2m(m-1)\left\|\Upsilon\right\|^{2}_{H(k)^{\otimes 4}}+m(m-1)\left<\Upsilon,I\right>_{H(k)^{\otimes 4}}^{2}\right)
+4(n+m)2n2m2nmΥ2H(k)4+2(n+m)2n2m2nmΥ,I2H(k)2(n+m)4n2m2Υ,I2H(k)4\displaystyle~{}~{}~{}~{}~{}+\frac{4(n+m)^{2}}{n^{2}m^{2}}nm\left\|\Upsilon\right\|^{2}_{H(k)^{\otimes 4}}+\frac{2(n+m)^{2}}{n^{2}m^{2}}nm\left<\Upsilon,I\right>^{2}_{H(k)^{\otimes 2}}-\frac{(n+m)^{4}}{n^{2}m^{2}}\left<\Upsilon,I\right>^{2}_{H(k)^{\otimes 4}}
=2(n+m)4n2m2Υ2H(k)4+O(1n)+O(1m).\displaystyle=\frac{2(n+m)^{4}}{n^{2}m^{2}}\left\|\Upsilon\right\|^{2}_{H(k)^{\otimes 4}}+O\left(\frac{1}{n}\right)+O\left(\frac{1}{m}\right).

7.13 Proof of Proposition 6

Since

(n+m)Δ^2n,m=n+mn2i,s=1nk(Xi,Xs)+n+mm2j,t=1mk(Yj,Yt)2(n+m)nmi=1nj=1mk(Xi,Yj),(n+m)\widehat{\Delta}^{2}_{n,m}=\frac{n+m}{n^{2}}\sum_{i,s=1}^{n}k(X_{i},X_{s})+\frac{n+m}{m^{2}}\sum_{j,t=1}^{m}k(Y_{j},Y_{t})-\frac{2(n+m)}{nm}\sum_{i=1}^{n}\sum_{j=1}^{m}k(X_{i},Y_{j}),

first, we need to calculate

𝔼[(n+m)Δ^2n,m]=n+mn2i,s=1n𝔼[k(Xi,Xs)]+n+mm2j,t=1m𝔼[k(Yj,Yt)]2(n+m)nmi=1nj=1m𝔼[k(Xi,Yj)].\mathbb{E}[(n+m)\widehat{\Delta}^{2}_{n,m}]=\frac{n+m}{n^{2}}\sum_{i,s=1}^{n}\mathbb{E}[k(X_{i},X_{s})]+\frac{n+m}{m^{2}}\sum_{j,t=1}^{m}\mathbb{E}[k(Y_{j},Y_{t})]-\frac{2(n+m)}{nm}\sum_{i=1}^{n}\sum_{j=1}^{m}\mathbb{E}[k(X_{i},Y_{j})].

From the expected values of each term are obtained as

𝔼[k(X1,X2)]=μk(P)2H(k),\displaystyle\mathbb{E}[k(X_{1},X_{2})]=\left\|\mu_{k}(P)\right\|^{2}_{H(k)},
𝔼[k(X1,X1)]=Σk(P),IH(k)2+μk(P)H(k)2\displaystyle\mathbb{E}[k(X_{1},X_{1})]=\left<\Sigma_{k}(P),I\right>_{H(k)^{\otimes 2}}+\left\|\mu_{k}(P)\right\|_{H(k)}^{2}

we get

𝔼[(n+m)Δ^2n,m]=(n+m)2nmΣk(P),IH(k)2\displaystyle\mathbb{E}[(n+m)\widehat{\Delta}^{2}_{n,m}]=\frac{(n+m)^{2}}{nm}\left<\Sigma_{k}(P),I\right>_{H(k)^{\otimes 2}} (53)

under P=QP=Q.

Next, the second moment of (n+m)Δ^2n,m(n+m)\widehat{\Delta}^{2}_{n,m} is

𝔼[{(n+m)Δ^2n,m}2]\displaystyle\mathbb{E}[\{(n+m)\widehat{\Delta}^{2}_{n,m}\}^{2}]
=𝔼[{n+mn2i,s=1nk(Xi,Xs)+n+mm2j,t=1mk(Yj,Yt)2(n+m)nmi=1nj=1mk(Xi,Yj)}2]\displaystyle=\mathbb{E}\left[\left\{\frac{n+m}{n^{2}}\sum_{i,s=1}^{n}k(X_{i},X_{s})+\frac{n+m}{m^{2}}\sum_{j,t=1}^{m}k(Y_{j},Y_{t})-\frac{2(n+m)}{nm}\sum_{i=1}^{n}\sum_{j=1}^{m}k(X_{i},Y_{j})\right\}^{2}\right]
=(n+m)2n4𝔼[{i,s=1nk(Xi,Xs)}2]+(n+m)2m4𝔼[{j,t=1mk(Yj,Yt)}2]\displaystyle=\frac{(n+m)^{2}}{n^{4}}\mathbb{E}\left[\left\{\sum_{i,s=1}^{n}k(X_{i},X_{s})\right\}^{2}\right]+\frac{(n+m)^{2}}{m^{4}}\mathbb{E}\left[\left\{\sum_{j,t=1}^{m}k(Y_{j},Y_{t})\right\}^{2}\right]
+4(n+m)2n2m2𝔼[{i=1nj=1mk(Xi,Yj)}2]+2(n+m)2n2m2𝔼[{i,s=1nk(Xi,Xs)}{j,t=1mk(Yj,Yt)}]\displaystyle~{}~{}~{}~{}~{}+\frac{4(n+m)^{2}}{n^{2}m^{2}}\mathbb{E}\left[\left\{\sum_{i=1}^{n}\sum_{j=1}^{m}k(X_{i},Y_{j})\right\}^{2}\right]+\frac{2(n+m)^{2}}{n^{2}m^{2}}\mathbb{E}\left[\left\{\sum_{i,s=1}^{n}k(X_{i},X_{s})\right\}\left\{\sum_{j,t=1}^{m}k(Y_{j},Y_{t})\right\}\right]
4(n+m)2n3m𝔼[i,s,=1nj=1mk(Xi,Xs)k(X,Yj)]4(n+m)2nm3𝔼[i=1nj,t,k=1mk(Yj,Yt)k(Xi,Yk)].\displaystyle~{}~{}~{}~{}~{}-\frac{4(n+m)^{2}}{n^{3}m}\mathbb{E}\left[\sum_{i,s,\ell=1}^{n}\sum_{j=1}^{m}k(X_{i},X_{s})k(X_{\ell},Y_{j})\right]-\frac{4(n+m)^{2}}{nm^{3}}\mathbb{E}\left[\sum_{i=1}^{n}\sum_{j,t,k=1}^{m}k(Y_{j},Y_{t})k(X_{i},Y_{k})\right]. (54)

These expectations are obtained as

𝔼[{i,s=1nk(Xi,Xs)}2]\displaystyle\mathbb{E}\left[\left\{\sum_{i,s=1}^{n}k(X_{i},X_{s})\right\}^{2}\right]
=n𝔼[k(X1,X1)2]+4n(n1)𝔼[k(,X1)2k(,X1)],μk(P)H(k)+2n(n1)Σk(P)2H(k)2\displaystyle=n\mathbb{E}[k(X_{1},X_{1})^{2}]+4n(n-1)\left<\mathbb{E}[k(\cdot,X_{1})^{\otimes 2}k(\cdot,X_{1})],\mu_{k}(P)\right>_{H(k)}+2n(n-1)\left\|\Sigma_{k}(P)\right\|^{2}_{H(k)^{\otimes 2}}
+n(n1)Σk(P),IH(k)22+2n(n1)2Σk(P),IH(k)2μk(P)2H(k)\displaystyle~{}~{}~{}~{}~{}+n(n-1)\left<\Sigma_{k}(P),I\right>_{H(k)^{\otimes 2}}^{2}+2n(n-1)^{2}\left<\Sigma_{k}(P),I\right>_{H(k)^{\otimes 2}}\left\|\mu_{k}(P)\right\|^{2}_{H(k)}
+4n(n1)2Σk(P),μk(P)2H(k)2+n(n1)(n2+n3)μk(P)4H(k),\displaystyle~{}~{}~{}~{}~{}+4n(n-1)^{2}\left<\Sigma_{k}(P),\mu_{k}(P)^{\otimes 2}\right>_{H(k)^{\otimes 2}}+n(n-1)(n^{2}+n-3)\left\|\mu_{k}(P)\right\|^{4}_{H(k)}, (55)
𝔼[{i=1nj=1mk(Xi,Yj)}2]\displaystyle\mathbb{E}\left[\left\{\sum_{i=1}^{n}\sum_{j=1}^{m}k(X_{i},Y_{j})\right\}^{2}\right]
=nmΣk(P)2H(k)2+nm(n+m)Σk(P),μk(P)2H(k)2+n2m2μk(P)4H(k),\displaystyle=nm\left\|\Sigma_{k}(P)\right\|^{2}_{H(k)^{\otimes 2}}+nm(n+m)\left<\Sigma_{k}(P),\mu_{k}(P)^{\otimes 2}\right>_{H(k)^{\otimes 2}}+n^{2}m^{2}\left\|\mu_{k}(P)\right\|^{4}_{H(k)}, (56)
𝔼[{i,s=1nk(Xi,Xs)}{j,t=1mk(Yj,Yt)}]\displaystyle\mathbb{E}\left[\left\{\sum_{i,s=1}^{n}k(X_{i},X_{s})\right\}\left\{\sum_{j,t=1}^{m}k(Y_{j},Y_{t})\right\}\right]
=nmΣk(P),IH(k)22+nm(n+m)Σk(P),IH(k)μk(P)2H(k)+n2m2μk(P)4,\displaystyle=nm\left<\Sigma_{k}(P),I\right>_{H(k)^{\otimes 2}}^{2}+nm(n+m)\left<\Sigma_{k}(P),I\right>_{H(k)}\left\|\mu_{k}(P)\right\|^{2}_{H(k)}+n^{2}m^{2}\left\|\mu_{k}(P)\right\|^{4}, (57)
𝔼[i,s,=1nj=1mk(Xi,Xs)k(X,Yj)]\displaystyle\mathbb{E}\left[\sum_{i,s,\ell=1}^{n}\sum_{j=1}^{m}k(X_{i},X_{s})k(X_{\ell},Y_{j})\right]
=nm𝔼[k(,X1)2k(,X1)],μk(P)H(k)+n(n1)mΣk(P),IH(k)2μk(P)2H(k)\displaystyle=nm\left<\mathbb{E}[k(\cdot,X_{1})^{\otimes 2}k(\cdot,X_{1})],\mu_{k}(P)\right>_{H(k)}+n(n-1)m\left<\Sigma_{k}(P),I\right>_{H(k)^{\otimes 2}}\left\|\mu_{k}(P)\right\|^{2}_{H(k)}
+2n(n1)mΣk(P),μk(P)2H(k)2+n(n1)(n+1)mμk(P)4H(k),\displaystyle~{}~{}~{}~{}~{}+2n(n-1)m\left<\Sigma_{k}(P),\mu_{k}(P)^{\otimes 2}\right>_{H(k)^{\otimes 2}}+n(n-1)(n+1)m\left\|\mu_{k}(P)\right\|^{4}_{H(k)}, (58)
𝔼[i=1nj,t,k=1mk(Yj,Yt)k(Xi,Yk)]\displaystyle\mathbb{E}\left[\sum_{i=1}^{n}\sum_{j,t,k=1}^{m}k(Y_{j},Y_{t})k(X_{i},Y_{k})\right]
=nm𝔼[k(,X1)2k(,X1)],μk(P)H(k)+m(m1)nΣk(P),IH(k)2μk(P)2H(k)\displaystyle=nm\left<\mathbb{E}[k(\cdot,X_{1})^{\otimes 2}k(\cdot,X_{1})],\mu_{k}(P)\right>_{H(k)}+m(m-1)n\left<\Sigma_{k}(P),I\right>_{H(k)^{\otimes 2}}\left\|\mu_{k}(P)\right\|^{2}_{H(k)}
+2m(m1)nΣk(P),μk(P)2H(k)2+m(m1)(m+1)nμk(P)4H(k).\displaystyle~{}~{}~{}~{}~{}+2m(m-1)n\left<\Sigma_{k}(P),\mu_{k}(P)^{\otimes 2}\right>_{H(k)^{\otimes 2}}+m(m-1)(m+1)n\left\|\mu_{k}(P)\right\|^{4}_{H(k)}. (59)

The combining (54) and (55)-(59) provides that

𝔼[{(n+m)Δ^2n,m}2]\displaystyle\mathbb{E}[\{(n+m)\widehat{\Delta}^{2}_{n,m}\}^{2}]
=2(n+m)4n2m2Σk(P)2H(k)2+(n+m)4n2m2Σk(P),IH(k)22+O(1n)+O(1m).\displaystyle=\frac{2(n+m)^{4}}{n^{2}m^{2}}\left\|\Sigma_{k}(P)\right\|^{2}_{H(k)^{\otimes 2}}+\frac{(n+m)^{4}}{n^{2}m^{2}}\left<\Sigma_{k}(P),I\right>_{H(k)^{\otimes 2}}^{2}+O\left(\frac{1}{n}\right)+O\left(\frac{1}{m}\right). (60)

Therefore, from (53) and (60), the variance of (n+m)Δ^2n,m(n+m)\widehat{\Delta}^{2}_{n,m} is

V[(n+m)Δ^2n,m]\displaystyle V[(n+m)\widehat{\Delta}^{2}_{n,m}] =𝔼[{(n+m)Δ^2n,m}2]{𝔼[(n+m)Δ^2n,m]}2\displaystyle=\mathbb{E}\left[\left\{(n+m)\widehat{\Delta}^{2}_{n,m}\right\}^{2}\right]-\left\{\mathbb{E}[(n+m)\widehat{\Delta}^{2}_{n,m}]\right\}^{2}
=2(n+m)4n2m2Σk(P)2H(k)2+O(1n)+O(1m).\displaystyle=\frac{2(n+m)^{4}}{n^{2}m^{2}}\left\|\Sigma_{k}(P)\right\|^{2}_{H(k)^{\otimes 2}}+O\left(\frac{1}{n}\right)+O\left(\frac{1}{m}\right).

7.14 Proof of (7)

The (i,j)(i,j)-th element of the matrix HH is

Hij\displaystyle H_{ij} =(k(,Xi)μk(P^))2Σk(P^),(k(,Xj)μk(P^))2Σk(P^)H(k)2\displaystyle=\left<(k(\cdot,X_{i})-{\mu}_{k}(\widehat{P}))^{\otimes 2}-{\Sigma}_{k}(\widehat{P}),(k(\cdot,X_{j})-{\mu}_{k}(\widehat{P}))^{\otimes 2}-{\Sigma}_{k}(\widehat{P})\right>_{H(k)^{\otimes 2}}
=(k(,Xi)μk(P^))2,(k(,Xj)μk(P^))2H(k)2Σk(P^),(k(,Xj)μk(P^))2H(k)2\displaystyle=\left<(k(\cdot,X_{i})-{\mu}_{k}(\widehat{P}))^{\otimes 2},(k(\cdot,X_{j})-{\mu}_{k}(\widehat{P}))^{\otimes 2}\right>_{H(k)^{\otimes 2}}-\left<{\Sigma}_{k}(\widehat{P}),(k(\cdot,X_{j})-{\mu}_{k}(\widehat{P}))^{\otimes 2}\right>_{H(k)^{\otimes 2}}
(k(,Xi)μk(P^))2,Σk(P^)H(k)2+Σk(P^),Σk(P^)H(k)2.\displaystyle~{}~{}~{}~{}~{}-\left<(k(\cdot,X_{i})-{\mu}_{k}(\widehat{P}))^{\otimes 2},{\Sigma}_{k}(\widehat{P})\right>_{H(k)^{\otimes 2}}+\left<{\Sigma}_{k}(\widehat{P}),{\Sigma}_{k}(\widehat{P})\right>_{H(k)^{\otimes 2}}.

Each term of this HijH_{ij} can be expressed as

(k(,Xi)μk(P^))2,(k(,Xj)μk(P^))2H(k)2\displaystyle\left<(k(\cdot,X_{i})-{\mu}_{k}(\widehat{P}))^{\otimes 2},(k(\cdot,X_{j})-{\mu}_{k}(\widehat{P}))^{\otimes 2}\right>_{H(k)^{\otimes 2}}
=k(,Xi)μk(P^),k(,Xj)μk(P^)H(k)2\displaystyle=\left<k(\cdot,X_{i})-{\mu}_{k}(\widehat{P}),k(\cdot,X_{j})-{\mu}_{k}(\widehat{P})\right>_{H(k)}^{2}
={k(Xi,Xj)μk(P^)(Xi)μk(P^)(Xj)+μk(P^),μk(P^)H(k)}2\displaystyle=\left\{k(X_{i},X_{j})-\mu_{k}(\widehat{P})(X_{i})-\mu_{k}(\widehat{P})(X_{j})+\left<\mu_{k}(\widehat{P}),\mu_{k}(\widehat{P})\right>_{H(k)}\right\}^{2}
={k(Xi,Xj)1ns=1nk(Xj,Xs)1n=1nk(Xi,X)+1n2s,=1nk(Xs,X)}2\displaystyle=\left\{k(X_{i},X_{j})-\frac{1}{n}\sum_{s=1}^{n}k(X_{j},X_{s})-\frac{1}{n}\sum_{\ell=1}^{n}k(X_{i},X_{\ell})+\frac{1}{n^{2}}\sum_{s,\ell=1}^{n}k(X_{s},X_{\ell})\right\}^{2}
=(K~ij)2\displaystyle=\left(\widetilde{K}_{ij}\right)^{2}
=(K~K~)ij,\displaystyle=\left(\widetilde{K}\odot\widetilde{K}\right)_{ij},
Σk(P^),(k(,Xi)μk(P^))2H(k)2\displaystyle\left<\Sigma_{k}(\widehat{P}),(k(\cdot,X_{i})-{\mu}_{k}(\widehat{P}))^{\otimes 2}\right>_{H(k)^{\otimes 2}}
=1ns=1n(k(,Xs)μk(P^))2,(k(,Xi)μk(P^))2H(k)2\displaystyle=\left<\frac{1}{n}\sum_{s=1}^{n}(k(\cdot,X_{s})-{\mu}_{k}(\widehat{P}))^{\otimes 2},(k(\cdot,X_{i})-{\mu}_{k}(\widehat{P}))^{\otimes 2}\right>_{H(k)^{\otimes 2}}
=1ns=1n(k(,Xs)μk(P^))2,(k(,Xi)μk(P^))2H(k)2\displaystyle=\frac{1}{n}\sum_{s=1}^{n}\left<(k(\cdot,X_{s})-{\mu}_{k}(\widehat{P}))^{\otimes 2},(k(\cdot,X_{i})-{\mu}_{k}(\widehat{P}))^{\otimes 2}\right>_{H(k)^{\otimes 2}}
=1ns=1nk(,Xs)μk(P^),k(,Xi)μk(P^)H(k)2\displaystyle=\frac{1}{n}\sum_{s=1}^{n}\left<k(\cdot,X_{s})-{\mu}_{k}(\widehat{P}),k(\cdot,X_{i})-{\mu}_{k}(\widehat{P})\right>_{H(k)}^{2}
=1ns=1n(K~sj)2\displaystyle=\frac{1}{n}\sum_{s=1}^{n}\left(\widetilde{K}_{sj}\right)^{2}
=1ns=1n(K~K~)sj\displaystyle=\frac{1}{n}\sum_{s=1}^{n}\left(\widetilde{K}\odot\widetilde{K}\right)_{sj}

and

Σk(P^),Σk(P^)H(k)2\displaystyle\left<\Sigma_{k}(\widehat{P}),\Sigma_{k}(\widehat{P})\right>_{H(k)^{\otimes 2}}
=1ns=1n(k(,Xs)μk(P^))2,1n=1n(k(,X)μk(P^))2H(k)2\displaystyle=\left<\frac{1}{n}\sum_{s=1}^{n}(k(\cdot,X_{s})-{\mu}_{k}(\widehat{P}))^{\otimes 2},\frac{1}{n}\sum_{\ell=1}^{n}(k(\cdot,X_{\ell})-{\mu}_{k}(\widehat{P}))^{\otimes 2}\right>_{H(k)^{\otimes 2}}
=1n2s,=1n(k(,Xs)μk(P^))2,(k(,X)μk(P^))2H(k)2\displaystyle=\frac{1}{n^{2}}\sum_{s,\ell=1}^{n}\left<(k(\cdot,X_{s})-{\mu}_{k}(\widehat{P}))^{\otimes 2},(k(\cdot,X_{\ell})-{\mu}_{k}(\widehat{P}))^{\otimes 2}\right>_{H(k)^{\otimes 2}}
=1n2s,=1nk(,Xs)μk(P^),k(,X)μk(P^)H(k)2\displaystyle=\frac{1}{n^{2}}\sum_{s,\ell=1}^{n}\left<k(\cdot,X_{s})-{\mu}_{k}(\widehat{P}),k(\cdot,X_{\ell})-{\mu}_{k}(\widehat{P})\right>_{H(k)}^{2}
=1n2s,=1n(K~s)2\displaystyle=\frac{1}{n^{2}}\sum_{s,\ell=1}^{n}\left(\widetilde{K}_{s\ell}\right)^{2}
=1n2s,=1n(K~K~)s.\displaystyle=\frac{1}{n^{2}}\sum_{s,\ell=1}^{n}\left(\widetilde{K}\odot\widetilde{K}\right)_{s\ell}.

Therefore,

Hij=(K~K~)ij1ns=1n(K~K~)sj1ns=1n(K~K~)si+1n2s,=1n(K~K~)s,H_{ij}=\left(\widetilde{K}\odot\widetilde{K}\right)_{ij}-\frac{1}{n}\sum_{s=1}^{n}\left(\widetilde{K}\odot\widetilde{K}\right)_{sj}-\frac{1}{n}\sum_{s=1}^{n}\left(\widetilde{K}\odot\widetilde{K}\right)_{si}+\frac{1}{n^{2}}\sum_{s,\ell=1}^{n}\left(\widetilde{K}\odot\widetilde{K}\right)_{s\ell},

which gives the expression (7).

References

  • [1] U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96(12):6745–6750, 1999.
  • [2] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337–404, 1950.
  • [3] R. Bhatia and L. Elsner. The Hoffman-Wielandt inequality in infinite dimensions. Proceedings of the Indian Academy of Sciences - Mathematical Sciences, 104(3):483–494, 1994.
  • [4] G. Boente, D. Rodriguez, and M. Sued. Testing equality between several populations covariance operators. Annals of the Institute of Statistical Mathematics, 70(4):919–950, 2018.
  • [5] K. Fukumizu, A. Gretton, X. Sun, and B. Schölkopf. Kernel measures of conditional dependence. Advances in Neural Information Processing Systems 20 - Proceedings of the 2007 Conference, pages 1–13, 2009.
  • [6] A. Gretton, K. M. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola. A kernel method for the two-sample-problem. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems, volume 19, pages 513–520. MIT Press, 2007.
  • [7] A. Gretton, K. Fukumizu, Z. Harchaoui, and B. Sriperumbudur. A fast, consistent kernel two-sample test. Advances in Neural Information Processing Systems, pages 673–681, 2009.
  • [8] J. Hoffmann-Jorgensen and G. Pisier. The law of large numbers and the central limit theorem in Banach spaces. The Annals of Probability, 4(4):587–599, 1976.
  • [9] J. Kellner and A. Celisse. A one-sample test for normality with kernel methods. Bernoulli, 25(3):1816–1837, 2019.
  • [10] H. Q. Minh, P. Niyogi, and Y. Yao. Mercer’s theorem, feature maps, and smoothing. In Proceedings of the 19th Annual Conference on Learning Theory, COLT’06, pages 154–168, Berlin, Heidelberg, 2006. Springer-Verlag.
  • [11] D. Politis, D. Wolf, J. Romano, M. Wolf, P. Bickel, P. Diggle, and S. Fienberg. Subsampling. Springer Series in Statistics. Springer New York, 1999.
  • [12] M. Reed and B. Simon. Functional Analysis. Methods of Modern Mathematical Physics. Elsevier Science, 1981.
  • [13] M. P. Wand and M. C. Jones. Kernel Smoothing. Chapman & Hall, New York, 1994.