This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Optimal Kernel for Kernel-Based Modal Statistical Methods

\nameRyoya Yamasaki \email[email protected]
\addrDepartment of Systems Science
Graduate School of Informatics, Kyoto University
36-1 Yoshida-Honmachi, Sakyo-ku, Kyoto 606-8501 JAPAN \AND\nameToshiyuki Tanaka \email[email protected]
\addrDepartment of Systems Science
Graduate School of Informatics, Kyoto University
36-1 Yoshida-Honmachi, Sakyo-ku, Kyoto 606-8501 JAPAN
Abstract

Kernel-based modal statistical methods include mode estimation, regression, and clustering. Estimation accuracy of these methods depends on the kernel used as well as the bandwidth. We study effect of the selection of the kernel function to the estimation accuracy of these methods. In particular, we theoretically show a (multivariate) optimal kernel that minimizes its analytically-obtained asymptotic error criterion when using an optimal bandwidth, among a certain kernel class defined via the number of its sign changes.

Keywords: Optimal kernel, kernel mode estimation, modal linear regression, mode clustering

1 Introduction

The mode for a continuous random variable is a location statistic where a probability density function (PDF) takes a local maximum value. It provides a good and easy-to-interpret summary of probabilistic events: one can expect many sample points in the vicinity of a mode. Besides, the mode has several desirable properties for data analysis. One of them is that it is robust against outliers and skew or heavy-tailed distributions as a statistic that expresses the centrality of data following a unimodal distribution. The mean is not robust at all, which often causes troubles in real applications. Another desirable property of the mode is that it can be naturally defined even in multivariate settings or non-Euclidean spaces. For example, the median and quantiles are intrinsically difficult to define in these cases due to the lack of an appropriate ordering structure. Also, the mode works well for multimodal distributions without loss of interpretability.

These attractive properties of the mode are utilized not only in location estimation but also in other data analysis techniques such as regression and clustering. Among various modal statistical methods, kernel-based methods have been widely used, presumably because they retain the inherent goodness of the mode and are easy to use in that they can be implemented with simple estimation algorithms, such as mean-shift algorithm (Comaniciu and Meer, 2002; Yamasaki and Tanaka, 2020) for kernel mode estimation (Parzen, 1962; Eddy, 1980; Romano, 1988) and for mode clustering (Casa et al., 2020), and iteratively reweighted least squares algorithm (Yamasaki and Tanaka, 2019) for modal linear regression (Yao and Li, 2014; Kemp et al., 2020). Also, their good estimation accuracy is an important advantage: for example, kernel mode estimation generally has the best estimation accuracy in mode estimation in the sense that it attains the minimax-optimal rate regarding the used sample size (Hasminskii, 1979; Donoho et al., 1991).

Estimation accuracy of such kernel-based methods depends on the kernel function used as well as the bandwidth. Nevertheless, the selection of a kernel has not been well studied, and some questions remain unanswered. For instance, for univariate kernel mode estimation, Granovsky and Müller (1991) have derived the optimal kernel, which minimizes the asymptotic mean squared error (AMSE) of a kernel mode estimate using an optimal bandwidth among a certain kernel class defined via the number of sign changes. Whether their argument can be extended to multivariate cases, however, has not been clarified yet. We are also interested in optimal kernels for other kernel-based modal statistical methods, such as modal linear regression and mode clustering, for which the problem of kernel selection has not been elucidated enough, as far as we are aware. The objective of this paper is to study the problem of kernel selection for various kernel-based modal statistical methods in the multivariate setting.

In the main part of this paper (Sections 24), with the aim of simplifying the presentation, we focus on kernel mode estimation (Parzen, 1962; Eddy, 1980; Romano, 1988) as a representative kernel-based modal statistical method. In Section 2, we address multivariate extension of the study on the optimal kernel by Granovsky and Müller (1991). For multivariate cases, we need some additional assumptions on the structure of the kernels to allow systematic discussion on the optimal kernel, so we mainly consider two commonly-used kernel classes, radial-basis kernels (RKs) and product kernels (PKs). In Section 2.1, we show basic asymptotic behaviors of the mode estimator, such as asymptotic normality of the mode estimator that can apply to general kernels including RKs and PKs, which is a modification of the existing statement in (Mokkadem and Pelletier, 2003). The characterization of the asymptotic behaviors leads to the AMSE of the mode estimator and the optimal bandwidth, providing the basis for our subsequent discussion on the optimal kernel. We then construct theories for the optimal RK in Section 2.2, and study PKs in Section 2.3. As one of the consequences, the optimal RK is found to improve the AMSE by more than 10 % compared with the commonly used Gaussian kernel, and the improvement is greater in higher dimensions. In view of the slow nonparametric convergence rate of mode estimation, this improvement would significantly contribute to higher sample efficiency. Moreover, on the basis of the consideration in Sections 2.2 and 2.3, in Section 2.4 we compare these two kernel classes in terms of the AMSE, and show that, among non-negative kernels, the optimal RK is better than any PK regardless of the underlying PDF. Section 3 gives some additional discussion. In Section 4 we show results of simulation experiments to examine to what extent the theories on the kernel selection (the results in Section 2) based on the asymptotics reflect the real performance in a finite sample size situation, which supports the usefulness of our discussion.

We show that our discussion can be further adapted to other kernel-based modal statistical methods, such another mode estimation method (Abraham et al., 2004), modal linear regression (Yao and Li, 2014; Kemp et al., 2020), and mode clustering (Comaniciu and Meer, 2002; Casa et al., 2020). For these methods, we can obtain results similar to, but not completely the same as, those for the kernel mode estimation. Section 5 gives method-specific results on kernel selection.

The remaining part of this paper consists of a concluding section and appendix sections: Section 6 summarizes our results and presents possible directions for future work. Appendices give proofs of the theories on the optimal RK and of the theorems for the asymptotic behaviors of the kernel mode estimator and modal linear regression, which are presented newly in this paper.

2 Optimal Kernel for Kernel Mode Estimation

2.1 Basic Asymptotic Behaviors

In general, for a continuous random variable, its mode refers either to a point where its PDF takes a local maximum value, or to a set of such local maximizers. If we followed such definitions of the mode, the design of an estimator corresponding to each mode and theoretical statements regarding such an estimator will become complicated in a multimodal case. So as to evade such a technical complication, we assume, unless stated otherwise, that the PDF has a unique global maximizer (see (A.3) in Appendix A). Let 𝑿d{\bm{X}}\in\mathbb{R}^{d} be a dd-variate continuous random variable, and assume that it has a PDF ff. Then, we define the mode 𝜽d{\bm{\theta}}\in\mathbb{R}^{d} of ff as

𝜽:-argmax𝒙df(𝒙).\displaystyle{\bm{\theta}}\coloneq\mathop{\rm arg\,max}\limits_{{\bm{x}}\in\mathbb{R}^{d}}f({\bm{x}}). (1)

Suppose that an independent and identically distributed (i.i.d.) sample {𝑿id}i=1n\{{\bm{X}}_{i}\in\mathbb{R}^{d}\}_{i=1}^{n} is drawn from ff. The kernel mode estimator (KME) is a plugin estimator

𝜽n:-argmax𝒙dfn(𝒙),\displaystyle{\bm{\theta}}_{n}\coloneq\mathop{\rm arg\,max}\limits_{{\bm{x}}\in\mathbb{R}^{d}}f_{n}({\bm{x}}), (2)

where the kernel density estimator (KDE)

fn(𝒙):-1nhndi=1nK(𝒙𝑿ihn)\displaystyle f_{n}({\bm{x}})\coloneq\frac{1}{nh_{n}^{d}}\sum_{i=1}^{n}K\biggl{(}\frac{{\bm{x}}-{\bm{X}}_{i}}{h_{n}}\biggr{)} (3)

is used as a surrogate for the PDF in (1), where KK is a kernel function defined on d\mathbb{R}^{d}, and where hn>0h_{n}>0 is a parameter regarding the scale of the kernel and is called a bandwidth. Here, we assume that the KDE also has a unique global maximizer.

Evaluating the KME amounts to solving the optimization problem (2) where the objective function fnf_{n} is in general non-convex, so how one solves it is itself a very important problem. In this paper, however, as we are mainly interested in elucidating properties of the KME, we assume that the KME is evaluated by a certain means, so that how to solve it is beyond the scope of this paper.

In what follows, for a set SS\subseteq\mathbb{R}, we let S0S_{\geq 0} denote the set of all nonnegative elements in SS. Thus, 0={0,1,}\mathbb{Z}_{\geq 0}=\{0,1,\ldots\}, and 0\mathbb{R}_{\geq 0} represents the set of nonnegative real numbers. For later use, we introduce the multi-index notation: For 𝒊=(i1,,id)0d{\bm{i}}=(i_{1},\ldots,i_{d})^{\top}\in\mathbb{Z}_{\geq 0}^{d}, let |𝒊|:-j=1dij|{\bm{i}}|\coloneq\sum_{j=1}^{d}i_{j}. It then follows that 𝒊=𝟎d{\bm{i}}=\bm{0}_{d} and |𝒊|=0|{\bm{i}}|=0 are equivalent, where 𝟎d:-(0,,0)\bm{0}_{d}\coloneq(0,\ldots,0)^{\top} is the dd-dimensional zero vector. For a multi-index 𝒊0d{\bm{i}}\in\mathbb{Z}_{\geq 0}^{d} and a vector 𝒙=(x1,,xd){\bm{x}}=(x_{1},\ldots,x_{d})^{\top}, let 𝒙𝒊:-j=1dxjij{\bm{x}}^{\bm{i}}\coloneq\prod_{j=1}^{d}x_{j}^{i_{j}}. The factorial of a multi-index is defined as 𝒊!:-j=1dij!{\bm{i}}!\coloneq\prod_{j=1}^{d}i_{j}!. Let 𝒊:-j=1djij\partial^{\bm{i}}\coloneq\prod_{j=1}^{d}\partial_{j}^{i_{j}}, where j:-/xj\partial_{j}\coloneq\partial/\partial x_{j}.

For a kernel function KK on d\mathbb{R}^{d} and a multi-index 𝒊0d{\bm{i}}\in\mathbb{Z}_{\geq 0}^{d}, we let

d,𝒊(K):-d𝒙𝒊K(𝒙)𝑑𝒙\displaystyle{\mathcal{B}}_{d,{\bm{i}}}(K)\coloneq\int_{\mathbb{R}^{d}}{\bm{x}}^{\bm{i}}K({\bm{x}})\,d{\bm{x}} (4)

be the (multivariate) moment of KK indexed by 𝒊{\bm{i}}. Throughout this paper, we assume using a qq-th order kernel KK of an even integer q2q\geq 2. Here, we say that KK defined on d\mathbb{R}^{d} is qq-th order if it satisfies the following moment condition: For 𝒊0d{\bm{i}}\in\mathbb{Z}_{\geq 0}^{d},

d,𝒊(K)={1,𝒊=𝟎d,0,for all 𝒊 such that|𝒊|{1,,q1},non-zero,for some 𝒊 such that|𝒊|=q.\displaystyle{\mathcal{B}}_{d,{\bm{i}}}(K)=\begin{cases}1,&{\bm{i}}=\bm{0}_{d},\\ 0,&\text{for all ${\bm{i}}$ such that}\;|{\bm{i}}|\in\{1,\ldots,q-1\},\\ \text{non-zero},&\text{for some ${\bm{i}}$ such that}\;|{\bm{i}}|=q.\end{cases} (5)

Also, note that the condition (5) for 𝒊=𝟎d{\bm{i}}=\bm{0}_{d} implies normalization of the kernel.

Here we consider the standard situation where the Hessian matrix Hf(𝜽)Hf({\bm{\theta}}) of the PDF at the mode is non-singular, where H:-H\coloneq\nabla\nabla^{\top}. Existing works (Yamato, 1971; Rüschendorf, 1977; Mokkadem and Pelletier, 2003) have revealed basic statistical properties of the KME such as consistency and asymptotic distribution. In the following, we provide a theorem modified from that of, especially, (Mokkadem and Pelletier, 2003): it shows the asymptotic bias (AB), variance-covariance matrix (AVC), and AMSE of 𝜽n{\bm{\theta}}_{n}, along with asymptotic normality, for a more general kernel class than those covered by existing proofs. This modification is essential for our purpose, since it covers the RKs that will be discussed in the next section, whereas the existing proofs do not.

Theorem 1.

Assume Assumption A.1 in Appendix A with an even integer q2q\geq 2. Then, the KME 𝛉n{\bm{\theta}}_{n} asymptotically follows a normal distribution with the following AB and AVC:

E[𝜽n𝜽]hnqA𝒃,Cov[𝜽n]1nhnd+2AVA,\displaystyle\mathrm{E}[{\bm{\theta}}_{n}-{\bm{\theta}}]\approx-h_{n}^{q}\mathrm{A}{\bm{b}},\quad\mathrm{Cov}[{\bm{\theta}}_{n}]\approx\frac{1}{nh_{n}^{d+2}}\mathrm{A}\mathrm{V}\mathrm{A}, (6)

where A:-{Hf(𝛉)}1\mathrm{A}\coloneq\{Hf({\bm{\theta}})\}^{-1}, where 𝐛=𝐛(𝛉;f,K){\bm{b}}={\bm{b}}({\bm{\theta}};f,K) with

𝒃(𝒙;f,K):-𝒊0d:|𝒊|=q1𝒊!𝒊f(𝒙)d,𝒊(K),\displaystyle{\bm{b}}({\bm{x}};f,K)\coloneq\sum_{{\bm{i}}\in\mathbb{Z}_{\geq 0}^{d}:|{\bm{i}}|=q}\frac{1}{{\bm{i}}!}\cdot\nabla\partial^{\bm{i}}f({\bm{x}})\cdot{\mathcal{B}}_{d,{\bm{i}}}(K), (7)

and where V\mathrm{V} is the d×dd\times d matrix defined as V:-f(𝛉)𝒱d(K)\mathrm{V}\coloneq f({\bm{\theta}}){\mathcal{V}}_{d}(K) with

𝒱d(K):-dK(𝒙)K(𝒙)𝑑𝒙.\displaystyle{\mathcal{V}}_{d}(K)\coloneq\int_{\mathbb{R}^{d}}\nabla K({\bm{x}})\nabla K({\bm{x}})^{\top}\,d{\bm{x}}. (8)

Moreover, the AMSE is given by

E[𝜽n𝜽2]hn2qA𝒃2+1nhnd+2tr(AVA),\displaystyle\mathrm{E}[\|{\bm{\theta}}_{n}-{\bm{\theta}}\|^{2}]\approx h_{n}^{2q}\|\mathrm{A}{\bm{b}}\|^{2}+\frac{1}{nh_{n}^{d+2}}\operatorname{tr}(\mathrm{A}\mathrm{V}\mathrm{A}), (9)

where \|\cdot\| is the Euclidean norm in d\mathbb{R}^{d} and where tr(M)\operatorname{tr}(\mathrm{M}) denotes the trace of a square matrix M\mathrm{M}.

A plausible approach to selecting the bandwidth hnh_{n} and kernel KK is by making the AMSE (9) as small as possible. Noting A𝒃𝟎d\mathrm{A}{\bm{b}}\neq\bm{0}_{d} to hold under our assumptions (see (A.4) and (A.6)), the stationary condition of the AMSE with respect to hnh_{n} leads to the optimal bandwidth,

hd,q,nopt=((d+2)tr(AVA)2qnA𝒃2)1d+2q+2,\displaystyle h_{d,q,n}^{\mathrm{opt}}=\left(\frac{(d+2)\operatorname{tr}(\mathrm{A}\mathrm{V}\mathrm{A})}{2qn\|\mathrm{A}{\bm{b}}\|^{2}}\right)^{\frac{1}{d+2q+2}}, (10)

which strictly minimizes the AMSE.111 The optimal bandwidth (10) depends on the inaccessible PDF ff and its mode 𝜽{\bm{\theta}}, via A\mathrm{A}, 𝒃{\bm{b}}, and V\mathrm{V}, and thus it cannot be used in practice as it is. However, it is possible to use a plugin estimator of the bandwidth which replaces these inaccessible quantities with their consistent estimates, and the discussion on the optimal kernel below holds also for the estimated optimal bandwidth. Moreover, substituting the optimal bandwidth into the AMSE, the bandwidth-optimized AMSE becomes

E[𝜽n𝜽2hd,q,nopt]2qd+2q+2A𝒃2(d+2)d+2q+2(d+22qntr(AVA))2qd+2q+2.\displaystyle\mathrm{E}[\|{\bm{\theta}}_{n}-{\bm{\theta}}\|^{2}\mid h_{d,q,n}^{\mathrm{opt}}]\approx\frac{2q}{d+2q+2}\|\mathrm{A}{\bm{b}}\|^{\frac{2(d+2)}{d+2q+2}}\left(\frac{d+2}{2qn}\operatorname{tr}(\mathrm{A}\mathrm{V}\mathrm{A})\right)^{\frac{2q}{d+2q+2}}. (11)

The bandwidth-optimized AMSE (11) depends on the kernel function KK used in the KDE through {d,𝒊(K)}\{{\mathcal{B}}_{d,{\bm{i}}}(K)\} and 𝒱d(K){\mathcal{V}}_{d}(K), so that one may consider further optimizing the bandwidth-optimized AMSE with respect to the kernel function. However, it also depends on the sample-generating PDF ff via A,𝒃,V\mathrm{A},{\bm{b}},\mathrm{V}, and moreover, the dependence on the kernel and that on the PDF are mixed in a complex way in the AMSE expression (11), so it seems quite difficult to study kernel optimization in the multivariate setting without additional assumptions. Furthermore, a resulting optimal kernel should depend on the PDF. A multivariate kernel without any structural assumptions is also difficult to implement in practice. We would thus introduce some structural assumptions on kernel functions, and consider optimization of the bandwidth-optimized AMSE under such assumptions. Under appropriate structural assumptions on kernel functions the optimal kernel function turns out not to depend on the PDF ff, as will be shown in the following sections.

2.2 Optimal Kernel in Radial-Basis Kernels

In multivariate kernel-based methods, RKs have been most commonly used. We first consider the kernel optimization among the class of RKs. We say that a kernel KK is an RK if there is a function GG on 0\mathbb{R}_{\geq 0} such that KK is represented as K()=G()K(\cdot)=G(\|\cdot\|).

For an RK K()=G()K(\cdot)=G(\|\cdot\|), by means of the cartesian-to-polar coordinate transformation, the functional d,𝒊(K){\mathcal{B}}_{d,{\bm{i}}}(K) is rewritten in terms of GG as

d,𝒊(K)=bd,𝒊Bd,i(G)\displaystyle{\mathcal{B}}_{d,{\bm{i}}}(K)=b_{d,{\bm{i}}}B_{d,i}(G) (12)

with i=|𝒊|i=|{\bm{i}}|, where Bd,i(G):-0xd1+iG(x)𝑑xB_{d,i}(G)\coloneq\int_{\mathbb{R}_{\geq 0}}x^{d-1+i}G(x)\,dx, and where bd,𝒊b_{d,{\bm{i}}} is a kernel-independent factor given by

bd,𝒊:-0πcosi1ξ1sind2+j>1ijξ1dξ10πcosi2ξ2sind3+j>2ijξ2dξ2×0πcosid2ξd2sin1+j>d2ijξd2dξd202πcosid1ξd1sinidξd1dξd1={2j=1dΓ(1+ij2)Γ(d+i2)0,i1,,id20,0,otherwise.\displaystyle\begin{split}b_{d,{\bm{i}}}\coloneq&\,\int_{0}^{\pi}\cos^{i_{1}}\xi_{1}\sin^{d-2+\sum_{j>1}i_{j}}\xi_{1}\,d\xi_{1}\int_{0}^{\pi}\cos^{i_{2}}\xi_{2}\sin^{d-3+\sum_{j>2}i_{j}}\xi_{2}\,d\xi_{2}\cdots\\ \times&\,\int_{0}^{\pi}\cos^{i_{d-2}}\xi_{d-2}\sin^{1+\sum_{j>d-2}i_{j}}\xi_{d-2}\,d\xi_{d-2}\int_{0}^{2\pi}\cos^{i_{d-1}}\xi_{d-1}\sin^{i_{d}}\xi_{d-1}\,d\xi_{d-1}\\ =&\,\begin{cases}\frac{2\prod_{j=1}^{d}\Gamma\bigl{(}\frac{1+i_{j}}{2}\bigr{)}}{\Gamma(\frac{d+i}{2})}\neq 0,&i_{1},\ldots,i_{d}\in 2\mathbb{Z}_{\geq 0},\\ 0,&\mbox{otherwise}.\end{cases}\end{split} (13)

Note that bd,𝒊b_{d,{\bm{i}}}, and so does d,𝒊(K){\mathcal{B}}_{d,{\bm{i}}}(K), vanishes once any index in the multi-index 𝒊{\bm{i}} is odd, which includes the case where |𝒊||{\bm{i}}| is odd, and that bd,𝟎db_{d,\bm{0}_{d}} is equal to bd:-2πd/2/Γ(d2)b_{d}\coloneq 2\pi^{d/2}/\Gamma(\frac{d}{2}). One then has 𝒃=Bd,i(G)𝒃¯{\bm{b}}=B_{d,i}(G)\bar{{\bm{b}}}, where

𝒃¯:-𝒊0d:|𝒊|=q1𝒊!𝒊f(𝜽)bd,𝒊=πd22q1Γ(d+q2)Γ(q2+1)2q/2f(𝜽),\displaystyle\bar{{\bm{b}}}\coloneq\sum_{{\bm{i}}\in\mathbb{Z}_{\geq 0}^{d}:|{\bm{i}}|=q}\frac{1}{{\bm{i}}!}\cdot\nabla\partial^{\bm{i}}f({\bm{\theta}})\cdot b_{d,{\bm{i}}}=\frac{\pi^{\frac{d}{2}}}{2^{q-1}\Gamma(\frac{d+q}{2})\Gamma(\frac{q}{2}+1)}\nabla\nabla_{2}^{q/2}f({\bm{\theta}}), (14)

where l:-j=1dlxjl\nabla_{l}\coloneq\sum_{j=1}^{d}\frac{\partial^{l}}{\partial x_{j}^{l}} so that 2\nabla_{2} denotes the Laplacian operator.

Also, the functional 𝒱d(K){\mathcal{V}}_{d}(K) for an RK K()=G()K(\cdot)=G(\|\cdot\|) reduces to

𝒱d(K)=vdVd,1(G)Id\displaystyle{\mathcal{V}}_{d}(K)=v_{d}V_{d,1}(G)\mathrm{I}_{d} (15)

with Vd,l(G):-0xd1{G(l)(x)}2𝑑xV_{d,l}(G)\coloneq\int_{\mathbb{R}_{\geq 0}}x^{d-1}\{G^{(l)}(x)\}^{2}\,dx and vd:-bd,{2,0,,0}=2πd/2/(dΓ(d2))v_{d}\coloneq b_{d,\{2,0,\ldots,0\}}=2\pi^{d/2}/(d\Gamma(\frac{d}{2})). This implies that V=vdf(𝜽)Vd,1(G)Id\mathrm{V}=v_{d}f({\bm{\theta}})V_{d,1}(G)\mathrm{I}_{d}, that is, V\mathrm{V} is proportional to the identity matrix Id\mathrm{I}_{d}.

Therefore, the bandwidth-optimized AMSE (11) becomes

E[𝜽n𝜽2hd,q,nopt]2qd+2q+2A𝒃¯2(d+2)d+2q+2(d+22qnvdf(𝜽)tr(A2))2qd+2q+2×(Bd,q2(d+2)(G)Vd,12q(G))1d+2q+2,\displaystyle\begin{split}\mathrm{E}[\|{\bm{\theta}}_{n}-{\bm{\theta}}\|^{2}\mid h_{d,q,n}^{\mathrm{opt}}]&\approx\frac{2q}{d+2q+2}\|\mathrm{A}\bar{{\bm{b}}}\|^{\frac{2(d+2)}{d+2q+2}}\left(\frac{d+2}{2qn}v_{d}f({\bm{\theta}})\operatorname{tr}(\mathrm{A}^{2})\right)^{\frac{2q}{d+2q+2}}\\ &\times\Bigl{(}B_{d,q}^{2(d+2)}(G)\cdot V_{d,1}^{2q}(G)\Bigr{)}^{\frac{1}{d+2q+2}},\end{split} (16)

in which the kernel-dependent factor is (Bd,q2(d+2)Vd,12q)1d+2q+2(B_{d,q}^{2(d+2)}\cdot V_{d,1}^{2q})^{\frac{1}{d+2q+2}} alone.222Although implicit in the notations here, Bd,iB_{d,i} and Vd,1V_{d,1} are functionals of GG. We will use the notation Bd,i(G)B_{d,i}(G) etc. when we want to make the dependence on GG explicit, but also use such abbreviations otherwise.

As one can see, the bandwidth-optimized AMSE (16) for an RK is decomposed into a product of the kernel-dependent factor, which we call the AMSE criterion, and the remaining PDF-dependent part. This fact implies that an RK minimizing the AMSE criterion also minimizes the AMSE, that is it is optimal, among the class of RKs for every PDF satisfying the requirements of Theorem 1.

For an RK K()=G()K(\cdot)=G(\|\cdot\|), as stated above, d,𝒊(K){\mathcal{B}}_{d,{\bm{i}}}(K) vanishes for 𝒊{\bm{i}} with odd |𝒊||{\bm{i}}|, so that one needs to consider the even moment condition alone, which can be translated into the moment condition for GG as follows:

Bd,i(G)={bd1,i=0,0,iIq2,non-zero,i=q,\displaystyle B_{d,i}(G)=\begin{cases}b_{d}^{-1},&i=0,\\ 0,&i\in I_{q-2},\\ \text{non-zero},&i=q,\end{cases} (17)

where IkI_{k} is the index set defined as

Ik:-{k2i:i=0,1,,k2}.\displaystyle I_{k}\coloneq\bigl{\{}k-2i:i=0,1,\ldots,\bigl{\lfloor}\tfrac{k}{2}\bigr{\rfloor}\bigr{\}}. (18)

Thus, if one focuses on the functional form of the AMSE criterion, one might consider the following variational problem (we name it ‘type-1’ because it depends on the first derivative of GG via Vd,12q(G)=Vd,02q(G(1))V_{d,1}^{2q}(G)=V_{d,0}^{2q}(G^{(1)})).

Problem 1 (dd-variate, type-1, qq-th order).
minG\displaystyle\mathop{\rm min}\limits_{G}\quad Bd,q2(d+2)(G)Vd,12q(G),\displaystyle B_{d,q}^{2(d+2)}(G)\cdot V_{d,1}^{2q}(G), (P1)
s.t.\displaystyle\mathrm{s.t.}\quad Gsatisfies the moment condition (17).\displaystyle G\;\text{satisfies the moment condition~{}\eqref{eq:RKMomCon}}. (P1-1)

However, this problem has no solution, as shown in the next proposition.

Proposition 1.

Problem 1 has no solution.

Proof.

One has Bd,q2(d+2)(G)Vd,12q(G)0B_{d,q}^{2(d+2)}(G)\cdot V_{d,1}^{2q}(G)\geq 0 by definition. It should never be equal to zero under the moment condition (17). Indeed, if one had Bd,q2(d+2)(G)Vd,12q(G)=0B_{d,q}^{2(d+2)}(G)\cdot V_{d,1}^{2q}(G)=0, either Bd,q(G)B_{d,q}(G) or Vd,1(G)V_{d,1}(G) should be equal to zero, whereas Bd,q(G)B_{d,q}(G) should be nonzero due to the moment condition (17). Vd,q(G)V_{d,q}(G) cannot be equal to zero either, since otherwise G(1)G^{(1)} should be equal to zero identically, implying that GG is a constant, contradicting the condition Bd,0(G)=bd1<B_{d,0}(G)=b_{d}^{-1}<\infty.

We next show that the infimum of Bd,q2(d+2)(G)Vd,12q(G)B_{d,q}^{2(d+2)}(G)\cdot V_{d,1}^{2q}(G) is zero. Take G1G_{1} and G2G_{2} satisfying (17) with Bd,q(G1)Bd,q(G2)B_{d,q}(G_{1})\not=B_{d,q}(G_{2}) and Vd,1(G1),Vd,1(G2)<V_{d,1}(G_{1}),V_{d,1}(G_{2})<\infty (for example, G1=Gd,qBG_{1}=G_{d,q}^{B} and G2=Gd,qEG_{2}=G_{d,q}^{E} defined later). Consider the linear combination G=wG1+(1w)G2G=wG_{1}+(1-w)G_{2} of G1G_{1} and G2G_{2} with ww\in\mathbb{R}. One then has Bd,q(G)=wBd,q(G1)+(1w)Bd,q(G2)B_{d,q}(G)=wB_{d,q}(G_{1})+(1-w)B_{d,q}(G_{2}), so that GG satisfies (17) unless w=w0:-Bd,q(G2)Bd,q(G2)Bd,q(G1)w=w_{0}\coloneq\frac{B_{d,q}(G_{2})}{B_{d,q}(G_{2})-B_{d,q}(G_{1})}, in which case one has Bd,q(G)=0B_{d,q}(G)=0 so that the order of GG is strictly higher than qq. On the other hand, substituting (s,t)=(wG1(1)(x),(1w)G2(1)(x))(s,t)=(wG_{1}^{(1)}(x),(1-w)G_{2}^{(1)}(x)) into the inequality (s+t)22(s2+t2)(s+t)^{2}\leq 2(s^{2}+t^{2}), multiplying both sides with xd1x^{d-1}, and taking the integral over x0x\in\mathbb{R}_{\geq 0}, one has

Vd,1(G)2(w2Vd,1(G1)+(1w)2Vd,1(G2))<\displaystyle V_{d,1}(G)\leq 2\bigl{(}w^{2}V_{d,1}(G_{1})+(1-w)^{2}V_{d,1}(G_{2})\bigr{)}<\infty (19)

for any ww. One therefore has Bd,q2(d+2)(G)Vd,12q(G)0B_{d,q}^{2(d+2)}(G)\cdot V_{d,1}^{2q}(G)\to 0 as ww0w\to w_{0}, proving that the infimum of Bd,q2(d+2)(G)Vd,12q(G)B_{d,q}^{2(d+2)}(G)\cdot V_{d,1}^{2q}(G) is equal to zero. ∎

This proposition states that the approach to kernel selection via simply minimizing the AMSE criterion among the entire class of qq-th order RKs is not a theoretically viable option. One may be able to discourage this approach from a practical standpoint as well. As shown above, one can make the AMSE criterion arbitrarily close to zero by considering a sequence of kernels for which Bd,qB_{d,q} approaches zero. For such a sequence of kernels, however, the optimal bandwidth hd,q,nopt{V1,d/(nBd,q)2}1d+2q+2h_{d,q,n}^{\mathrm{opt}}\propto\{V_{1,d}/(nB_{d,q})^{2}\}^{\frac{1}{d+2q+2}} diverges toward infinity. It then causes the effect of the next leading order term of the AB, which is O(hnq+2)O(h_{n}^{q+2}), to be non-negligible in the MSE when the sample size nn is finite, thereby invalidating the use of the AMSE criterion to estimate the MSE in a finite sample situation.

The above discussion suggests that, in order to avoid diverging behaviors of the optimal bandwidth and to obtain results meaningful in a finite sample situation, one should consider optimization of the AMSE criterion within a certain kernel class which at least excludes those kernels with vanishing leading-order moments. Since such an approach can give a kernel with the smallest AMSE criterion at least among the considered class, it would be superior to other kernel design methods that are not based on any discussion of the goodness of resulting higher-order kernels in kernel classes of the same order, such as jackknife (Schucany and Sommers, 1977; Wand and Schucany, 1990), as well as those that optimize alternative criteria other than the AMSE criterion, such as the minimum-variance kernels (Eddy, 1980; Müller, 1984). The existing researches (Gasser and Müller, 1979; Gasser et al., 1985; Granovsky and Müller, 1989, 1991) can be interpreted as following this approach, and they have focused on the relationship between the order of the kernel and the number of sign changes. More concretely, on the basis of the observation that any qq-th order univariate kernel should change its sign at least (q2)(q-2) times on \mathbb{R}, they considered kernel optimization among the class of qq-th order kernels with exactly (q2)(q-2) sign changes (which we call the minimum-sign-change condition). They have also shown that the minimum-sign-change condition excludes kernels with small leading-order moments, which cause the difficulties discussed above in the minimization of the AMSE criterion. In addition, and importantly, they succeeded in obtaining a closed-form solution of the resulting kernel optimization problem under the minimum-sign-change condition.

In this paper, we also take the same approach as the one by (Gasser and Müller, 1979; Gasser et al., 1985; Granovsky and Müller, 1989, 1991): under the moment condition (17), one can show that GG changes its sign at least (q21)(\frac{q}{2}-1) times on 0\mathbb{R}_{\geq 0}; see Lemma B.2. Then, we consider the following modified problem to which the minimum-sign-change condition (P2-3) is added.

Problem 2 (Modified dd-variate, type-1, qq-th order).
minG\displaystyle\mathop{\rm min}\limits_{G}\quad Bd,q2(d+2)(G)Vd,12q(G),\displaystyle B_{d,q}^{2(d+2)}(G)\cdot V_{d,1}^{2q}(G), (P2)
s.t.\displaystyle\mathrm{s.t.}\quad Gsatisfies the moment condition (17),\displaystyle G\;\text{satisfies the moment condition~{}\eqref{eq:RKMomCon}}, (P2-1)
Gis differentiable andG(1)is integrable,\displaystyle G\;\text{is differentiable and}\;G^{(1)}\;\text{is integrable}, (P2-2)
Gchanges its sign(q21)times on0,\displaystyle G\;\text{changes its sign}\;(\tfrac{q}{2}-1)\;\text{times on}\;\mathbb{R}_{\geq 0}, (P2-3)
Vd,1(G)<,\displaystyle V_{d,1}(G)<\infty, (P2-4)
xd+qG(x)0asx.\displaystyle x^{d+q}G(x)\to 0\;\text{as}\;x\to\infty. (P2-5)

Here, the conditions (P2-2)–(P2-5) are technically required in the proof. The conditions (P2-2) and (P2-4) for making the problem well-defined and the condition (P2-5) which is close to |Bd,q(G)|<|B_{d,q}(G)|<\infty would be almost non-restrictive. The minimum-sign-change condition (P2-3) for q=2q=2 implies that the kernel is non-negative, under the normalization condition in (P2-1). Thus, the minimum-sign-change condition (P2-3) for higher orders q4q\geq 4 can be regarded as a sort of extension of the non-negativity condition.

In the same way as that by (Gasser and Müller, 1979; Gasser et al., 1985), we can show when q=2,4q=2,4 that this problem is solvable and the solution is provided as follows.

Theorem 2.

For every d1d\geq 1 and q=2,4q=2,4, a solution of Problem 2333 Note that a solution of Problem 2 has freedom on its scale, i.e., for any s>0s>0, G(x)=sdGd,qB(x/s)G(x)=s^{-d}G_{d,q}^{B}(x/s) is also a solution of Problem 2. is

Gd,qB(x):-(iIq+2cd,q,iBxi)𝟙x1,\displaystyle G_{d,q}^{B}(x)\coloneq\biggl{(}\sum_{i\in I_{q+2}}c^{B}_{d,q,i}x^{i}\biggr{)}\mathbbm{1}_{x\leq 1}, (20)

where the coefficients cd,q,iBc^{B}_{d,q,i} with iIq+2i\in I_{q+2} are given by

cd,q,iB=(1)i2Γ(d+q+42)Γ(d+q+i2)πd2Γ(q2)Γ(2+i2)Γ(d+2+i2)Γ(q+4i2).\displaystyle c^{B}_{d,q,i}=\frac{(-1)^{\frac{i}{2}}\Gamma(\frac{d+q+4}{2})\Gamma(\frac{d+q+i}{2})}{\pi^{\frac{d}{2}}\Gamma(\frac{q}{2})\Gamma(\frac{2+i}{2})\Gamma(\frac{d+2+i}{2})\Gamma(\frac{q+4-i}{2})}. (21)

The kernel Gd,qBG_{d,q}^{B} is also represented as

Gd,qB(x)=Γ(d+q2)πd2Γ(q2)Pq2+1(d2,2)(12x2)𝟙x1=(1)q2+1Γ(d+q2+2)πd2Γ(q2+2)(1x2)2Pq21(2,d2)(2x21)𝟙x1,\displaystyle\begin{split}G_{d,q}^{B}(x)&=\frac{\Gamma(\frac{d+q}{2})}{\pi^{\frac{d}{2}}\Gamma(\frac{q}{2})}P_{\frac{q}{2}+1}^{(\frac{d}{2},-2)}(1-2x^{2})\mathbbm{1}_{x\leq 1}\\ &=(-1)^{\frac{q}{2}+1}\frac{\Gamma(\frac{d+q}{2}+2)}{\pi^{\frac{d}{2}}\Gamma(\frac{q}{2}+2)}(1-x^{2})^{2}P_{\frac{q}{2}-1}^{(2,\frac{d}{2})}(2x^{2}-1)\mathbbm{1}_{x\leq 1},\end{split} (22)

where Γ()\Gamma(\cdot) is the gamma function and where Pn(α,β)()P_{n}^{(\alpha,\beta)}(\cdot) is the Jacobi polynomial (Szegö, 1939) defined by

Pn(α,β)(x):-(1)n2nn!(1x)α(1+x)βdndxn[(1x)n+α(1+x)n+β].\displaystyle P_{n}^{(\alpha,\beta)}(x)\coloneq\frac{(-1)^{n}}{2^{n}n!}(1-x)^{-\alpha}(1+x)^{-\beta}\frac{d^{n}}{dx^{n}}\left[(1-x)^{n+\alpha}(1+x)^{n+\beta}\right]. (23)

Alternatively, we can consider the following problem with another minimum-sign-change condition (P3-3) that is defined for the first derivative of a kernel.

Problem 3 (Modified dd-variate, type-1, qq-th order).
minG\displaystyle\mathop{\rm min}\limits_{G}\quad Bd,q2(d+2)(G)Vd,12q(G),\displaystyle B_{d,q}^{2(d+2)}(G)\cdot V_{d,1}^{2q}(G), (P3)
s.t.\displaystyle\mathrm{s.t.}\quad Gsatisfies the moment condition (17),\displaystyle G\;\text{satisfies the moment condition~{}\eqref{eq:RKMomCon}}, (P3-1)
Gis differentiable andG(1)is integrable,\displaystyle G\;\text{is differentiable and}\;G^{(1)}\;\text{is integrable}, (P3-2)
G(1)changes its sign(q21)times on0,\displaystyle G^{(1)}\;\text{changes its sign}\;(\tfrac{q}{2}-1)\;\text{times on}\;\mathbb{R}_{\geq 0}, (P3-3)
Vd,1(G)<,\displaystyle V_{d,1}(G)<\infty, (P3-4)
there exists δ>0 s.t. xd+q+δG(x)0asx.\displaystyle\text{there exists }\delta>0\text{ s.t.~{}}x^{d+q+\delta}G(x)\to 0\;\text{as}\;x\to\infty. (P3-5)

The minimum-sign-change condition (P3-3) for q=2q=2 implies that the kernel is non-negative and non-increasing, under the normalization condition in (P3-5) and the end-point condition (P3-5). Therefore, the minimum-sign-change condition for the first derivative (P3-3) can be seen as a more restrictive version of the minimum-sign-change condition for the kernel itself (P2-3). Additionally note that the change of (P2-5) to (P3-5) stems from the difference in proof techniques used for Theorems 2 and 3.

We can show that this problem has the same solution even for general even orders q=2,4,6,q=2,4,6,\ldots, on the basis of the proof techniques in (Granovsky and Müller, 1989, 1991).

Theorem 3.

For every d1d\geq 1 and even q2q\geq 2, a solution of Problem 3 is (20) or equivalently (22).

\begin{overpic}[height=78.24507pt,bb=0 0 495 285]{./mulopt2.png}\put(3.0,48.0){\small$q=2$}\end{overpic}
\begin{overpic}[height=78.24507pt,bb=0 0 495 285]{./mulopt4.png}\put(3.0,48.0){\small$q=4$}\end{overpic}
\begin{overpic}[height=78.24507pt,bb=0 0 495 285]{./mulopt6.png}\put(3.0,48.0){\small$q=6$}\end{overpic}
Figure 1: The kernel Kd,qBK_{d,q}^{B}, d=2d=2, q=2,4,6q=2,4,6.

This type-1 optimal kernel is a truncation of a polynomial function with terms of degrees 0,2,,(q+2)0,2,\ldots,(q+2) and has a compact support (see Figure 1). It should be emphasized that the optimal kernel is a truncated kernel even though we have not assumed truncation a priori, and that the truncation of the optimal kernel is a consequence of the minimum-sign-change condition and the optimality with respect to the AMSE criterion. In particular, the 2nd order optimal kernel Kd,2B(𝒙){(1𝒙2)+}2K_{d,2}^{B}({\bm{x}})\propto\{(1-\|{\bm{x}}\|^{2})_{+}\}^{2} (where ()+=max{,0}(\cdot)_{+}=\max\{\cdot,0\}), often called a Biweight kernel, is optimal among non-negative RKs (from Theorem 2). The kernel Gd,qBG_{d,q}^{B} can be represented by the product of the dd-variate Biweight RK Gd,2BG_{d,2}^{B} and a polynomial function as

Gd,qB(x)=(1)q2+12Γ(d+q2+2)Γ(d2+3)Γ(q2+2)Pq21(2,d2)(2x21)Gd,2B(x).\displaystyle G_{d,q}^{B}(x)=(-1)^{\frac{q}{2}+1}\frac{2\Gamma(\frac{d+q}{2}+2)}{\Gamma(\frac{d}{2}+3)\Gamma(\frac{q}{2}+2)}P_{\frac{q}{2}-1}^{(2,\frac{d}{2})}(2x^{2}-1)\cdot G_{d,2}^{B}(x). (24)

Berlinet (1993) pointed out that in a variety of kernel-based estimation problems the optimal kernels of different orders often form a hierarchy, in the sense that an optimal kernel can be represented as a product of a polynomial and the corresponding lowest-order kernel called the basic kernel: See also Section 3.11 of (Berlinet and Thomas-Agnan, 2004). According to their terminology, for example, the statement of Theorem 2 or 3 can be concisely summarized as follows: Among the class of RKs, the Biweight hierarchy minimizes the AMSE of the kernel mode estimator for any dd under the minimum-sign-change condition for the kernel or its first derivative.

However, it also should be noted that the kernel optimization problem was formulated without considering several regularity conditions other than the moment condition that ensure the asymptotic normality and are for deriving the AMSE. The kernel Kd,qBK_{d,q}^{B} is not twice differentiable at the edge of its support. In this respect, Kd,qBK_{d,q}^{B} does not satisfy all the sufficient conditions in Theorem 1 for deriving the AMSE (see (A.7) and (A.14) in Assumption A.1 in Appendix). On the other hand, one can still add a small enough perturbation to Kd,qBK_{d,q}^{B} to make it twice differentiable at the edge of its support as well, while keeping other conditions to hold, as well as having the change of the value of the AMSE criterion arbitrarily small. Namely, Problem 2 or 3 and its solutions Kd,qBK_{d,q}^{B} still tell us the lower bound of achievable AMSE under the minimum-sign-change condition, as well as the shape of kernels which will achieve AMSE close to the bound (it also implies that the problem of minimizing the AMSE under all the conditions for Theorem 1 has no solution). From these considerations, we expect that, despite the lack of theoretical guarantee for the asymptotic normality of KME, this trouble is not so destructive and Kd,qBK_{d,q}^{B} will provide a practically good performance among the minimum-sign-change kernels; See also simulation experiments in Section 4.

2.3 Optimal Kernel in Product Kernels

In this section we consider optimization of the bandwidth-optimized AMSE with respect to the kernel among the class of PKs, where a kernel KK is a PK if there exists a function GG on \mathbb{R} such that K(𝒙)=j=1dG(xj)K({\bm{x}})=\prod_{j=1}^{d}G(x_{j}) for all 𝒙=(x1,,xd){\bm{x}}=(x_{1},\ldots,x_{d})^{\top}. To simplify the discussion, we assume in this paper that GG as a kernel on \mathbb{R} is symmetric and qq-th order, making the resulting PK KK to be qq-th order as well. We furthermore assume that GG satisfies the minimum-sign-change condition (P2-3) or (P3-3).

Under these assumptions, simple calculations lead to

d,𝒊(K)\displaystyle{\mathcal{B}}_{d,{\bm{i}}}(K) ={1,𝒊=𝟎d,2B1,q(G),if one index in 𝒊 is q and others are 0,0,for other 𝒊 such that |𝒊|{1,,q},\displaystyle=\begin{cases}1,&{\bm{i}}=\bm{0}_{d},\\ 2B_{1,q}(G),&\text{if one index in }{\bm{i}}\text{ is }q\text{ and others are }0,\\ 0,&\text{for other }{\bm{i}}\text{ such that }|{\bm{i}}|\in\{1,\ldots,q\},\end{cases} (25)
𝒱d(K)\displaystyle{\mathcal{V}}_{d}(K) =2dV1,1(G)V1,0d1(G)Id.\displaystyle=2^{d}V_{1,1}(G)\cdot V^{d-1}_{1,0}(G)\mathrm{I}_{d}. (26)

Therefore, the bandwidth-optimized AMSE (11) becomes

E[𝜽n𝜽2hd,q,nopt]2qd+2q+2A𝒃~2(d+2)d+2q+2(d+22qn2df(𝜽)tr(A2))2qd+2q+2×[(B1,q6(G)V1,12q(G))(B1,q2(G)V1,02q(G))d1]1d+2q+2,\displaystyle\begin{split}\mathrm{E}[\|{\bm{\theta}}_{n}-{\bm{\theta}}\|^{2}\mid h_{d,q,n}^{\mathrm{opt}}]&\approx\frac{2q}{d+2q+2}\|\mathrm{A}\tilde{{\bm{b}}}\|^{\frac{2(d+2)}{d+2q+2}}\biggl{(}\frac{d+2}{2qn}2^{d}f({\bm{\theta}})\operatorname{tr}(\mathrm{A}^{2})\biggr{)}^{\frac{2q}{d+2q+2}}\\ &\times\biggl{[}\Bigl{(}B_{1,q}^{6}(G)\cdot V_{1,1}^{2q}(G)\Bigr{)}\cdot\Bigl{(}B_{1,q}^{2}(G)\cdot V_{1,0}^{2q}(G)\Bigr{)}^{d-1}\biggr{]}^{\frac{1}{d+2q+2}},\end{split} (27)

where 𝒃~\tilde{{\bm{b}}} is given by

𝒃~:-j=1d1q!qf(𝜽)xjq2=2q!qf(𝜽).\displaystyle\tilde{{\bm{b}}}\coloneq\sum_{j=1}^{d}\frac{1}{q!}\cdot\nabla\frac{\partial^{q}f({\bm{\theta}})}{\partial x_{j}^{q}}\cdot 2=\frac{2}{q!}\nabla\nabla_{q}f({\bm{\theta}}). (28)

Equation (27) shows that the bandwidth-optimized AMSE for the PK is also decomposed into a product of the kernel-dependent factor and the PDF-dependent factor. One can therefore take the kernel-dependent factor {(B1,q6V1,12q)(B1,q2V1,02q)d1}1d+2q+2\{(B_{1,q}^{6}\cdot V_{1,1}^{2q})\cdot(B_{1,q}^{2}\cdot V_{1,0}^{2q})^{d-1}\}^{\frac{1}{d+2q+2}} as the AMSE criterion for the PK.

We have not succeeded in deriving the optimal GG minimizing this criterion. In the following, we consider instead a lower bound of the AMSE criterion, which will be compared with the optimal value of the criterion for the RK in the next section. The bound we discuss is derived as follows:

minG{(B1,q6(G)V1,12q(G))(B1,q2(G)V1,02q(G))d1}=minG,G:G=G{(B1,q6(G)V1,12q(G))(B1,q2(G)V1,02q(G))d1}minG(B1,q6(G)V1,12q(G))minG(B1,q2(G)V1,02q(G))d1,\displaystyle\begin{split}&\min_{G}\biggl{\{}\Bigl{(}B_{1,q}^{6}(G)\cdot V_{1,1}^{2q}(G)\Bigr{)}\cdot\Bigl{(}B_{1,q}^{2}(G)\cdot V_{1,0}^{2q}(G)\Bigr{)}^{d-1}\bigg{\}}\\ &=\min_{G,G^{\prime}:G=G^{\prime}}\bigg{\{}\Bigl{(}B_{1,q}^{6}(G)\cdot V_{1,1}^{2q}(G)\Bigr{)}\cdot\Bigl{(}B_{1,q}^{2}(G^{\prime})\cdot V_{1,0}^{2q}(G^{\prime})\Bigr{)}^{d-1}\bigg{\}}\\ &\geq\min_{G}\Bigl{(}B_{1,q}^{6}(G)\cdot V_{1,1}^{2q}(G)\Bigr{)}\cdot\min_{G^{\prime}}\Bigl{(}B_{1,q}^{2}(G^{\prime})\cdot V_{1,0}^{2q}(G^{\prime})\Bigr{)}^{d-1},\end{split} (29)

where the last inequality holds strictly if d>1d>1 since the solutions of the two minimization problems involved will become different. The first minimization problem is a univariate case of Theorem 2 or Theorem 3 (or (Granovsky and Müller, 1991, Corollary 1)): the Biweight hierarchy {G1,qB}\{G_{1,q}^{B}\} provides the optimal GG among the qq-th order kernels satisfying the minimum-sign-change condition (P3-3). The second minimization problem, which is an instance of the type-0 problems, has been solved by Granovsky and Müller (1989): for even q2q\geq 2, the kernel

G1,qE(x)=Γ(q+12)π12Γ(q2)Pq2(12,1)(12x2)𝟙x1=(1)q2+1Γ(q+32)π12Γ(q2+1)(1x2)Pq21(1,12)(2x21)𝟙x1\displaystyle G_{1,q}^{E}(x)=\frac{\Gamma(\frac{q+1}{2})}{\pi^{\frac{1}{2}}\Gamma(\frac{q}{2})}P_{\frac{q}{2}}^{(\frac{1}{2},-1)}(1-2x^{2})\mathbbm{1}_{x\leq 1}=(-1)^{\frac{q}{2}+1}\frac{\Gamma(\frac{q+3}{2})}{\pi^{\frac{1}{2}}\Gamma(\frac{q}{2}+1)}(1-x^{2})P_{\frac{q}{2}-1}^{(1,\frac{1}{2})}(2x^{2}-1)\mathbbm{1}_{x\leq 1} (30)

minimizes B1,q2V1,02qB_{1,q}^{2}\cdot V_{1,0}^{2q} among qq-th order kernels that change their sign (q21)(\frac{q}{2}-1) times on 0\mathbb{R}_{\geq 0}. The type-0 optimal kernel K1,qE(x)=G1,qE(|x|)K_{1,q}^{E}(x)=G_{1,q}^{E}(|x|) is a truncation of a polynomial function with terms of degrees 0,2,,q0,2,\ldots,q, and forms the Epanechnikov hierarchy, with the Epanechnikov kernel K1,2E(1x2)+K_{1,2}^{E}\propto(1-x^{2})_{+} as the basic kernel. We have therefore obtained a lower bound

{(B1,q6(K1,qB)V1,12q(K1,qB))(B1,q2(K1,qE)V1,02q(K1,qE))d1}1d+2q+2\displaystyle\biggl{\{}\Bigl{(}B_{1,q}^{6}(K_{1,q}^{B})\cdot V_{1,1}^{2q}(K_{1,q}^{B})\Bigr{)}\cdot\Bigl{(}B_{1,q}^{2}(K_{1,q}^{E})\cdot V_{1,0}^{2q}(K_{1,q}^{E})\Bigr{)}^{d-1}\biggr{\}}^{\frac{1}{d+2q+2}} (31)

of the AMSE criterion for the PK model.

2.4 Comparison between Radial-Basis Kernels and Product Kernels

We have so far studied the RKs and the PKs separately. One may then ask which of these classes of kernels will perform better. In this section we discuss this question.

Our approach to answering this question is basically grounded on a comparison between the AMSE of the optimal RK and that of the optimal PK. There are, however, difficulties in this approach: One is that the optimal PK, as well as its AMSE, is not known, which prevents us from comparing the RKs and the PKs directly in terms of the AMSE. Another is that the PDF-dependent factor of the AMSE takes different forms between the RKs and the PKs. Most evidently, in 𝒃¯\bar{{\bm{b}}} (14) and 𝒃~\tilde{{\bm{b}}} (28), the quantities which contribute to the AB of 𝜽n\bm{\theta}_{n} for the RKs and the PKs, respectively, cross derivatives of the PDF are present in the former whereas they are absent in the latter. Such differences in the PDF-dependent factors hinder direct comparison between the RKs and the PKs in terms of the AMSE in the general setting, in a PDF-independent manner.

The only exception is the case q=2q=2, where the PDF-dependent factor takes the same form for the RKs and the PKs, and consequently one can compare the AMSEs for RKs and PKs, regardless of the PDF. In the case of q=2q=2, simple calculations show 𝒃~=2f(𝜽)\tilde{{\bm{b}}}=\nabla\nabla_{2}f({\bm{\theta}}) and 𝒃¯=vd22f(𝜽)=vd2𝒃~\bar{{\bm{b}}}=\frac{v_{d}}{2}\nabla\nabla_{2}f({\bm{\theta}})=\frac{v_{d}}{2}\tilde{{\bm{b}}}. Therefore, the ratio of the lower bound of the AMSE for PK that derives from (31) to the AMSE for the optimal RK Kd,2BK_{d,2}^{B} becomes

(26d+4(B1,26(K1,2B)V1,14(K1,2B))(B1,22(K1,2E)V1,04(K1,2E))d1vd2d+8(Bd,22(d+2)(Gd,2B)Vd,14(Gd,2B)))1d+6.\displaystyle\Biggl{(}\frac{2^{6d+4}\bigl{(}B_{1,2}^{6}(K_{1,2}^{B})\cdot V_{1,1}^{4}(K_{1,2}^{B})\bigr{)}\cdot\bigl{(}B_{1,2}^{2}(K_{1,2}^{E})\cdot V_{1,0}^{4}(K_{1,2}^{E})\bigr{)}^{d-1}}{v_{d}^{2d+8}\bigl{(}B_{d,2}^{2(d+2)}(G_{d,2}^{B})\cdot V_{d,1}^{4}(G_{d,2}^{B})\bigr{)}}\Biggr{)}^{\frac{1}{d+6}}. (32)
Refer to caption
Figure 2: Ratio (32) of AMSE lower bound for PK to AMSE for RK versus dimension dd.

Figure 2 shows how the ratio depends on the dimension dd. One observes that it increases monotonically with dd. Since the ratio is larger than 1 for d2d\geq 2 (note that there is no distinction between the RKs and the PKs when d=1d=1), one can conclude that the optimal second-order non-negative RK achieves the AMSE that is smaller than those with any second-order non-negative PK.

It should be emphasized that the above superiority result of the optimal RK over the PKs is limited to the second-order (q=2q=2) case. As mentioned above, for q4q\geq 4 one cannot compare the RKs and the PKs in terms of the AMSE in a PDF-independent manner. Consequently, depending on the PDF, there might be some PKs with better AMSE values than the optimal RK Kd,qBK_{d,q}^{B} for q4q\geq 4, some examples of which will be shown in the simulation experiments in Section 4.

2.5 Optimal Kernel in Elliptic Kernels

Kernel methods in the multivariate setting may use a kernel function together with a linear transformation, in order to calibrate the difference in scales and shearing across the coordinates. In this section we assume for simplicity that an RK is to be combined with a linear transformation. For a d×dd\times d matrix P\mathrm{P} representing an area-preserving linear transformation in d\mathbb{R}^{d} (hence |detP|=1|\det\mathrm{P}|=1) and an RK KK (the base kernel), the transformed kernel KP(𝒙)=K(P𝒙)K_{\mathrm{P}}({\bm{x}})=K(\mathrm{P}{\bm{x}}) is often called an elliptic kernel. In this section, we discuss the optimal base kernel KK and optimal transformation for kernel mode estimation based on an elliptic kernel.

Using the elliptic kernel KPK_{\mathrm{P}} is equivalent to using the kernel KK with the transformed sample {P𝑿i}i=1n\{\mathrm{P}{\bm{X}}_{i}\}_{i=1}^{n} drawn from fPf_{\mathrm{P}}, which is defined by fP()=f(P1)f_{\mathrm{P}}(\cdot)=f(\mathrm{P}^{-1}\cdot), for estimation of the mode P𝜽\mathrm{P}{\bm{\theta}} of fPf_{\mathrm{P}}. Multiplying the resulting mode estimate by P1\mathrm{P}^{-1} from left yields the KME with KPK_{\mathrm{P}} of ff, an estimate of the mode 𝜽{\bm{\theta}}. This view, together with the calculations

{HfP(P𝜽)}1=PAP,2q/2fP(P𝜽)=(Q)q/2f(𝜽),\displaystyle\{Hf_{\mathrm{P}}(\mathrm{P}{\bm{\theta}})\}^{-1}=\mathrm{P}\mathrm{A}\mathrm{P}^{\top},\quad\nabla\nabla^{q/2}_{2}f_{\mathrm{P}}(\mathrm{P}{\bm{\theta}})=\nabla(\nabla^{\top}\mathrm{Q}\nabla)^{q/2}f({\bm{\theta}}), (33)

where Q:-P1P\mathrm{Q}\coloneq\mathrm{P}^{-1}\mathrm{P}^{-\top}, reveals that the AB and AVC of the KME using KPK_{\mathrm{P}} reduce to

E[𝜽n𝜽]πd/2hnq2q1Γ(d+q2)Γ(q2+1)Bd,i(G)A(Q)q/2f(𝜽),\displaystyle\mathrm{E}[{\bm{\theta}}_{n}-{\bm{\theta}}]\approx-\frac{\pi^{d/2}h_{n}^{q}}{2^{q-1}\Gamma(\frac{d+q}{2})\Gamma(\frac{q}{2}+1)}B_{d,i}(G)\mathrm{A}\nabla(\nabla^{\top}\mathrm{Q}\nabla)^{q/2}f({\bm{\theta}}), (34)
Cov[𝜽n]vdf(𝜽)nhnd+2Vd,1(G)AQ1A,\displaystyle\mathrm{Cov}[{\bm{\theta}}_{n}]\approx\frac{v_{d}f({\bm{\theta}})}{nh_{n}^{d+2}}V_{d,1}(G)\mathrm{A}\mathrm{Q}^{-1}\mathrm{A}, (35)

and that their kernel-dependent factors remain Bd,q(G)B_{d,q}(G) and Vd,1(G)V_{d,1}(G) of K()=G()K(\cdot)=G(\|\cdot\|). Therefore, as long as P\mathrm{P} (or equivalently Q\mathrm{Q}) is fixed such that the AB of the KME is non-zero, an argument similar to the ones in Section 2 leads to the optimal bandwidth and the same AMSE criterion with respect to KK. Thus, the kernel Kd,qBK_{d,q}^{B} is optimal as a base kernel even under such a transformation P\mathrm{P}.

One may furthermore consider optimizing Q\mathrm{Q} (equivalently P\mathrm{P}) via

minQA(Q)q/2f(𝜽)2(d+2)tr(AQ1A)2q,s.t.det(Q)=1,\displaystyle\mathop{\rm min}\limits_{\mathrm{Q}}\;\|\mathrm{A}\nabla(\nabla^{\top}\mathrm{Q}\nabla)^{q/2}f({\bm{\theta}})\|^{2(d+2)}\operatorname{tr}(\mathrm{A}\mathrm{Q}^{-1}\mathrm{A})^{2q},\quad\mathrm{s.t.}\ \det(\mathrm{Q})=1, (36)

which comes from the linear-transformation-dependent factor of the bandwidth-optimized AMSE. We have so far not succeeded in obtaining a closed-form solution of the optimization problem (36), although a numerical optimization might be possible with plug-in estimates of f(𝜽)f(\bm{\theta}) and A\mathrm{A}.

It would be worth mentioning that in some cases one can make the AB equal to zero by a certain choice of Q\mathrm{Q}, whereas in some other cases one cannot make the AB equal to zero by any choice of Q\mathrm{Q}. Furthermore, whether one can make the AB equal to zero via the choice of Q\mathrm{Q} depends on the (q+1)q+1)-st order term of the Taylor expansion of f(𝒙)f({\bm{x}}) around the mode 𝒙=𝜽{\bm{x}}={\bm{\theta}}, and the dependence seems to be quite complicated. Consequently, optimization of the linear transformation in terms of the AMSE is difficult to tackle in a manner similar to that adopted in the previous sections, because the assumption that the AB does not vanish does not necessarily hold. Although we have not succeeded in providing a complete specification as to when one can make the AB equal to zero via the choice of Q\mathrm{Q}, we describe in the following some cases where the AB can be made equal to zero by a certain choice of Q\mathrm{Q} and some other cases where the AB cannot be equal to zero by any choice of Q\mathrm{Q}.

The following two propositions provide cases where the AB can be made zero by an appropriate choice of Q\mathrm{Q}.

Proposition 2.

Assume dq+1d\geq q+1. Assume the (q+1)(q+1)-st order term of the Taylor expansion of f(𝐱)f({\bm{x}}) around the mode 𝐱=𝛉{\bm{x}}={\bm{\theta}} be of the form

i=1q+1𝒂i(𝒙𝜽),\displaystyle\prod_{i=1}^{q+1}\bm{a}_{i}^{\top}({\bm{x}}-{\bm{\theta}}), (37)

with {𝐚i}i=1q+1\{\bm{a}_{i}\}_{i=1}^{q+1} linearly independent. Then there is a choice of Q\mathrm{Q} with which the AB (34) is made equal to zero.

We show that there exists a positive definite Q\mathrm{Q} which satisfies

(Q)q/2i=1q+1𝒂i(𝒙𝜽)=𝟎d.\displaystyle\nabla(\nabla^{\top}\mathrm{Q}\nabla)^{q/2}\prod_{i=1}^{q+1}{\bm{a}}_{i}^{\top}({\bm{x}}-{\bm{\theta}})=\bm{0}_{d}. (38)

The left-hand side is calculated as

(Q)q/2i=1q+1𝒂i(𝒙𝜽)=σ(𝒂σ(1)Q𝒂σ(2))(𝒂σ(q1)Q𝒂σ(q))𝒂σ(q+1),\displaystyle\nabla(\nabla^{\top}\mathrm{Q}\nabla)^{q/2}\prod_{i=1}^{q+1}{\bm{a}}_{i}^{\top}({\bm{x}}-{\bm{\theta}})=\sum_{\sigma}({\bm{a}}_{\sigma(1)}^{\top}\mathrm{Q}{\bm{a}}_{\sigma(2)})\cdots({\bm{a}}_{\sigma(q-1)}^{\top}\mathrm{Q}{\bm{a}}_{\sigma(q)}){\bm{a}}_{\sigma(q+1)}, (39)

where the summation on the right-hand side is to be taken with respect to all the permutations of {1,,q+1}\{1,\ldots,q+1\}. Due to the assumption of the linear independence of {𝒂i}i=1q+1\{{\bm{a}}_{i}\}_{i=1}^{q+1}, there exists the duals {𝒃i}i=1q+1\{{\bm{b}}_{i}\}_{i=1}^{q+1} of {𝒂i}i=1q+1\{{\bm{a}}_{i}\}_{i=1}^{q+1} satisfying 𝒂i𝒃j=𝟙i=j{\bm{a}}_{i}^{\top}{\bm{b}}_{j}=\mathbbm{1}_{i=j}. Take {𝒃i}i=q+2,,d\{{\bm{b}}_{i}\}_{i=q+2,\ldots,d} as a basis of the orthogonal complement of {𝒂i}i=1q+1\{{\bm{a}}_{i}\}_{i=1}^{q+1}. Then {𝒃i}i=1,,d\{{\bm{b}}_{i}\}_{i=1,\ldots,d} forms a basis of d\mathbb{R}^{d}. Let

Q=i=1d𝒃i𝒃iOd,\displaystyle\mathrm{Q}=\sum_{i=1}^{d}{\bm{b}}_{i}{\bm{b}}_{i}^{\top}\succ\mathrm{O}_{d}, (40)

with the zero d×dd\times d matrix Od\mathrm{O}_{d}, where the notation AB\mathrm{A}\succ\mathrm{B} means that AB\mathrm{A}-\mathrm{B} is positive definite. One then has 𝒂iQ𝒂j=𝟙i=j{\bm{a}}_{i}^{\top}\mathrm{Q}{\bm{a}}_{j}=\mathbbm{1}_{i=j} for i,j{1,,q+1}i,j\in\{1,\ldots,q+1\}. It implies that all the coefficients of 𝒂i{\bm{a}}_{i} on the right-hand side of (39) vanish, proving (38) to hold with the above Q\mathrm{Q}.

Proposition 3.

Assume dqd\geq q. Assume the (q+1)(q+1)-st order term of the Taylor expansion of f(𝐱)f({\bm{x}}) around the mode 𝐱=𝛉{\bm{x}}={\bm{\theta}} be of the form

i=1q+1𝒂i(𝒙𝜽),\displaystyle\prod_{i=1}^{q+1}{\bm{a}}_{i}^{\top}({\bm{x}}-{\bm{\theta}}), (41)

where {𝐚i}i=1q+1\{{\bm{a}}_{i}\}_{i=1}^{q+1} spans a qq-dimensional subspace of d\mathbb{R}^{d}. Assume further that i=1q+1si𝐚i=𝟎d\sum_{i=1}^{q+1}s_{i}{\bm{a}}_{i}=\mathbf{0}_{d} holds with coefficients {si}i=1q+1\{s_{i}\}_{i=1}^{q+1} satisfying s1s2sq+10s_{1}s_{2}\cdots s_{q+1}\not=0. Then there is a choice of Q\mathrm{Q} with which the AB (34) is made equal to zero.

One can find a subset of {𝒂i}i=1q+1\{{\bm{a}}_{i}\}_{i=1}^{q+1} of size qq, which consists of qq linearly independent vectors. Without loss of generality we assume that {𝒂i}i=1q\{{\bm{a}}_{i}\}_{i=1}^{q} are linearly independent. Choosing

Q=i=1q1si2𝒃i𝒃ii,j=1i<jq1qsisj(𝒃i𝒃j+𝒃j𝒃i)+i=q+2d𝒃i𝒃iOd\displaystyle\mathrm{Q}=\sum_{i=1}^{q}\frac{1}{s_{i}^{2}}{\bm{b}}_{i}{\bm{b}}_{i}^{\top}-\sum_{i,j=1\atop i<j}^{q}\frac{1}{qs_{i}s_{j}}({\bm{b}}_{i}{\bm{b}}_{j}^{\top}+{\bm{b}}_{j}{\bm{b}}_{i}^{\top})+\sum_{i=q+2}^{d}{\bm{b}}_{i}{\bm{b}}_{i}^{\top}\succ\mathrm{O}_{d} (42)

with {𝒃i}i=1q\{{\bm{b}}_{i}\}_{i=1}^{q}, the duals of {𝒂i}i=1q\{{\bm{a}}_{i}\}_{i=1}^{q} in the subspace spanned by {𝒂i}i=1q\{{\bm{a}}_{i}\}_{i=1}^{q}, and with {𝒃i}i=q+2d\{{\bm{b}}_{i}\}_{i=q+2}^{d}, a basis of the orthogonal complement of {𝒂i}i=1q\{{\bm{a}}_{i}\}_{i=1}^{q}, one has

𝒂iQ𝒂j={1si2,i=j{1,,q+1},1qsisj,ij,i,j{1,,q+1},0,otherwise.\displaystyle{\bm{a}}_{i}^{\top}\mathrm{Q}{\bm{a}}_{j}=\left\{\begin{array}[]{ll}\frac{1}{s_{i}^{2}},&i=j\in\{1,\ldots,q+1\},\\ -\frac{1}{qs_{i}s_{j}},&i\not=j,i,j\in\{1,\ldots,q+1\},\\ 0,&\mbox{otherwise}.\end{array}\right. (46)

One can thus show that the following holds:

(Q)q/2(i=1q+1𝒂i(𝒙𝜽))\displaystyle\nabla(\nabla^{\top}\mathrm{Q}\nabla)^{q/2}\left(\prod_{i=1}^{q+1}{\bm{a}}_{i}^{\top}({\bm{x}}-{\bm{\theta}})\right) =(1q)q/21i=1q+1siσsσ(q+1)𝒂σ(q+1)=𝟎d.\displaystyle=\left(-\frac{1}{q}\right)^{q/2}\frac{1}{\prod_{i=1}^{q+1}s_{i}}\sum_{\sigma}s_{\sigma(q+1)}{\bm{a}}_{\sigma(q+1)}=\bm{0}_{d}. (47)

The following propositions provide cases where the AB cannot be equal to zero by any choice of Q\mathrm{Q}.

Proposition 4.

Assume dqd\geq q. Assume the (q+1)(q+1)-st order term of the Taylor expansion of f(𝐱)f({\bm{x}}) around the mode 𝐱=𝛉{\bm{x}}={\bm{\theta}} be of the form

𝒂(𝒙𝜽)i=1q/212(𝒙𝜽)Ri(𝒙𝜽),\displaystyle{\bm{a}}^{\top}({\bm{x}}-{\bm{\theta}})\prod_{i=1}^{q/2}\frac{1}{2}({\bm{x}}-{\bm{\theta}})^{\top}\mathrm{R}_{i}({\bm{x}}-{\bm{\theta}}), (48)

where Ri\mathrm{R}_{i}, i=1,,q/2i=1,\ldots,q/2 are positive definite. Then the AB (34) cannot be equal to zero by any choice of Q\mathrm{Q}.

In the case of q=2q=2 one can prove the following proposition.

Proposition 5.

If, around its mode 𝐱=𝛉{\bm{x}}={\bm{\theta}}, f(𝐱)f({\bm{x}}) admits the expression

f(𝒙)=f(𝜽)+12(𝒙𝜽)A1(𝒙𝜽)+g(𝒙𝜽)𝒂(𝒙𝜽)+hot.,\displaystyle f({\bm{x}})=f({\bm{\theta}})+\frac{1}{2}({\bm{x}}-{\bm{\theta}})^{\top}\mathrm{A}^{-1}({\bm{x}}-{\bm{\theta}})+g({\bm{x}}-{\bm{\theta}}){\bm{a}}^{\top}({\bm{x}}-{\bm{\theta}})+\mathrm{hot}., (49)

where 𝐚{\bm{a}} is nonzero, and where

g(𝒖)=12𝒖R𝒖,ROd\displaystyle g({\bm{u}})=\frac{1}{2}{\bm{u}}^{\top}\mathrm{R}{\bm{u}},\quad\mathrm{R}\succ\mathrm{O}_{d} (50)

is a positive definite quadratic form, then there is no P\mathrm{P} which makes the AB to vanish.

We show that in this case no choice of Q\mathrm{Q} which is positive definite will make the quantity Q(g(𝒙𝜽)𝒂(𝒙𝜽))\nabla\nabla^{\top}\mathrm{Q}\nabla(g({\bm{x}}-{\bm{\theta}}){\bm{a}}^{\top}({\bm{x}}-{\bm{\theta}})) vanishing. Indeed, one can write

Q(g(𝒙𝜽)𝒂(𝒙𝜽))=(2RQ+(trRQ)I)𝒂.\displaystyle\nabla\nabla^{\top}\mathrm{Q}\nabla\left(g({\bm{x}}-{\bm{\theta}}){\bm{a}}^{\top}({\bm{x}}-{\bm{\theta}})\right)=(2\mathrm{R}\mathrm{Q}+(\operatorname{tr}\mathrm{R}\mathrm{Q})\mathrm{I}){\bm{a}}. (51)

One observes that RQ\mathrm{R}\mathrm{Q} has positive eigenvalues only, since R\mathrm{R} and Q\mathrm{Q} are both positive definite. We thus have trRQ>0\operatorname{tr}\mathrm{R}\mathrm{Q}>0, which in turn implies that the matrix 2RQ+(trRQ)I2\mathrm{R}\mathrm{Q}+(\operatorname{tr}\mathrm{R}\mathrm{Q})\mathrm{I} is invertible. Hence the quantity in (51) is nonvanishing with an arbitrary choice of Q\mathrm{Q}.

Proposition 4 can be proven similarly. The term (Q)q/2f(𝜽)\nabla(\nabla^{\top}\mathrm{Q}\nabla)^{q/2}f({\bm{\theta}}) in this case is represented as M({Ri},Q)𝒂\mathrm{M}(\{\mathrm{R}_{i}\},\mathrm{Q}){\bm{a}}, where the d×dd\times d matrix M({Ri},Q)\mathrm{M}(\{\mathrm{R}_{i}\},\mathrm{Q}) is a sum of terms of the form Rj1QRjnQ\mathrm{R}_{j_{1}}\mathrm{Q}\cdots\mathrm{R}_{j_{n}}\mathrm{Q} with coefficients which are products of trRi1QRikQ>0\operatorname{tr}\mathrm{R}_{i_{1}}\mathrm{Q}\cdots\mathrm{R}_{i_{k}}\mathrm{Q}>0, and is hence positive definite. The term (Q)q/2f(𝜽)\nabla(\nabla^{\top}\mathrm{Q}\nabla)^{q/2}f({\bm{\theta}}) is therefore nonzero irrespective of the choice of Q\mathrm{Q}.

We close this section by showing a complete specification as to when the AB can be made equal to zero via the choice of Q\mathrm{Q} in the most basic case (d,q)=(2,2)(d,q)=(2,2).

Proposition 6.

Assume (d,q)=(2,2)(d,q)=(2,2). The third-order term of the Taylor expansion of f(𝐱)f(\bm{x}) around the mode 𝐱=𝛉\bm{x}=\bm{\theta} is either:

  • 0 identically, in which case the AB is equal to zero with an arbitrary Q\mathrm{Q},

  • of the form i=13𝒂i(𝒙𝜽)\prod_{i=1}^{3}\bm{a}_{i}^{\top}(\bm{x}-\bm{\theta}) with 𝒂1,𝒂2,𝒂3\bm{a}_{1},\bm{a}_{2},\bm{a}_{3}, any two of which are linearly independent, in which case there is a choice of Q\mathrm{Q} with which the AB is equal to zero,

  • other than any of the above, in which case the AB cannot be made equal to zero by any choice of Q\mathrm{Q}.

A proof is given in Appendix D.

3 Additional Discussion

3.1 Degree of Goodness of Optimal Kernel and Heuristic Findings

Table 1: The AMSE criterion (B1,q6V1,12q)12q+3(B_{1,q}^{6}\cdot V_{1,1}^{2q})^{\frac{1}{2q+3}} and the AMSE ratio in brackets, for the univariate KME, q=2,4,6q=2,4,6.
qq                       Kernel KK B1,qB_{1,q} V1,1V_{1,1} (B1,q6V1,12q)12q+3(B_{1,q}^{6}\cdot V_{1,1}^{2q})^{\frac{1}{2q+3}}
2 Biweight K1,2B(x)=1516{(1x2)+}2K_{1,2}^{B}(x)=\frac{15}{16}\{(1-x^{2})_{+}\}^{2} 114\frac{1}{14} 1514\frac{15}{14} 0.1083 [1.0000]
Triweight 3532{(1x2)+}3\frac{35}{32}\{(1-x^{2})_{+}\}^{3} 118\frac{1}{18} 3522\frac{35}{22} 0.1095 [1.0105]
Tricube 7081{(1|x|3)+}3\frac{70}{81}\{(1-|x|^{3})_{+}\}^{3} 35486\frac{35}{486} 210187\frac{210}{187} 0.1121 [1.0345]
Cosine π4cos(πx2)𝟙|x|1\frac{\pi}{4}\cos(\frac{\pi x}{2})\mathbbm{1}_{|x|\leq 1} 124π2\frac{1}{2}-\frac{4}{\pi^{2}} π4128\frac{\pi^{4}}{128} 0.1135 [1.0475]
Epanechnikov K1,2E(x)=34(1x2)+K_{1,2}^{E}(x)=\frac{3}{4}(1-x^{2})_{+} 0.1 0.75 0.1179 [1.0883]
Triangle (1|x|)+(1-|x|)_{+} 112\frac{1}{12} 1 0.1188 [1.0971]
Gaussian K1,2G(x)=1(2π)1/2ex2/2K_{1,2}^{G}(x)=\frac{1}{(2\pi)^{1/2}}e^{-x^{2}/2} 0.5 18π1/2\frac{1}{8\pi^{1/2}} 0.1213 [1.1198]
Logistic 1ex+2+ex\frac{1}{e^{x}+2+e^{-x}} π26\frac{\pi^{2}}{6} 160\frac{1}{60} 0.1476 [1.3629]
Sech 12sech(πx2)\frac{1}{2}\mathrm{sech}(\frac{\pi x}{2}) 0.5 π24\frac{\pi}{24} 0.1727 [1.5949]
Laplace K1,2L(x)=12e|x|K_{1,2}^{L}(x)=\frac{1}{2}e^{-|x|} 1 0.125 0.3048 [2.8133]
4 K1,4B(x)=10564(13x2){(1x2)+}2K_{1,4}^{B}(x)=\frac{105}{64}(1-3x^{2})\{(1-x^{2})_{+}\}^{2} 166-\frac{1}{66} 52588\frac{525}{88} 0.3729 [1.0000]
K1,4E(x)=1532(37x2)(1x2)+K_{1,4}^{E}(x)=\frac{15}{32}(3-7x^{2})(1-x^{2})_{+} 142-\frac{1}{42} 7516\frac{75}{16} 0.4005 [1.0737]
K1,4G(x)=12(2π)1/2(3x2)ex2/2K_{1,4}^{G}(x)=\frac{1}{2(2\pi)^{1/2}}(3-x^{2})e^{-x^{2}/2} -1.5 55128π1/2\frac{55}{128\pi^{1/2}} 0.4451 [1.1935]
K1,4L(x)=120(12x2)e|x|K_{1,4}^{L}(x)=\frac{1}{20}(12-x^{2})e^{-|x|} -21.6 3131600\frac{313}{1600} 1.6314 [4.3743]
6 K1,6B(x)=3152048(15110x2+143x4){(1x2)+}2K_{1,6}^{B}(x)=\frac{315}{2048}(15-110x^{2}+143x^{4})\{(1-x^{2})_{+}\}^{2} 1286\frac{1}{286} 2205128\frac{2205}{128} 1.0149 [1.0000]
K1,6E(x)=105256(530x2+33x4)(1x2)+K_{1,6}^{E}(x)=\frac{105}{256}(5-30x^{2}+33x^{4})(1-x^{2})_{+} 5858\frac{5}{858} 3675256\frac{3675}{256} 1.0760 [1.0600]
K1,6G(x)=18(2π)1/2(1510x2+x4)ex2/2K_{1,6}^{G}(x)=\frac{1}{8(2\pi)^{1/2}}(15-10x^{2}+x^{4})e^{-x^{2}/2} 7.5 71058192π1/2\frac{7105}{8192\pi^{1/2}} 1.2639 [1.2453]
K1,6L(x)=12384(1560220x2+3x4)e|x|K_{1,6}^{L}(x)=\frac{1}{2384}(1560-220x^{2}+3x^{4})e^{-|x|} 196200149\frac{196200}{149} 557234522733824\frac{5572345}{22733824} 5.7451 [5.6609]

We are interested in not only specifying the optimal kernel, but also in how the kernel selection will affect the AMSE. To quantify the degree of suboptimality of kernels, we define, for a dd-variate qq-th order kernel KK, the AMSE ratio as the ratio of the bandwidth-optimized AMSE (11) for KK to that for the optimal RK Kd,qBK_{d,q}^{B}. The AMSE ratio depends only on KK if the bandwidth-optimized AMSEs for KK and Kd,qBK_{d,q}^{B} share the same PDF-dependent factor, which is the case for RKs (including univariate ones) and PKs with q=2q=2 (see Section 2.4). Table 1 shows the comparison of the AMSE criteria and the AMSE ratios in the univariate case. An empirical observation is that truncated kernels, such as a Triweight and an Epanechnikov in addition to the optimal Biweight kernel, are better than non-truncated kernels including a Sech, a Laplace, and K1,4LK_{1,4}^{L}. This tendency still holds even for multivariate RKs and PKs. Figure 3 shows the AMSE ratios for several multivariate RKs.

\begin{overpic}[height=142.26378pt,bb=0 0 360 252]{./amseratios1.png}\put(18.0,60.0){\small(a)}\end{overpic} \begin{overpic}[height=142.26378pt,bb=0 0 360 252]{./amseratios2.png}\put(18.0,60.0){\small(b)}\end{overpic}
Figure 3: AMSE ratios for multivariate RKs. (b) shows a zoom of (a), with the range of the AMSE ratio being [1,1.35], to better visualize the differences.

For example, the AMSE ratio for the most frequently used Gaussian kernel Kd,2GK_{d,2}^{G}, approximately equals to 1.1198, 1.1547, 1.1864, 1.2151, 1.2413, 1.2652, 1.2870, 1.3071, 1.3256, 1.3428 for d=1,,10d=1,\ldots,10, and monotonically approaches e2/41.8473e^{2}/4\approx 1.8473 as dd goes to infinity.

3.2 Other Criteria

We here mention that the optimality of the kernel Kd,qBK_{d,q}^{B} is not limited to the scenario in which the AMSE is used as the criterion. In this section, the univariate case is assumed for simplicity, but its multivariate extension is straightforward.

Grund and Hall (1995) have studied the asymptotic LpL^{p} error (ALpL^{p}E) E[|θnθ|p]\mathrm{E}[|\theta_{n}-\theta|^{p}] for p1p\geq 1 and the optimal bandwidth minimizing it. Although they have clarified only the nn-dependence of the optimal bandwidth, we can go one step further beyond their discussion and clarify the kernel-dependence: the optimal bandwidth is represented as h1,q,p,nopt=cp(V1,1/(nB1,q2))12q+3h_{1,q,p,n}^{\mathrm{opt}}=c_{p}(V_{1,1}/(nB_{1,q}^{2}))^{\frac{1}{2q+3}}, where it is difficult to derive a closed-form expression for cpc_{p}, but it is a PDF-dependent and kernel-independent coefficient. If this optimal bandwidth is used, then the ALpL^{p}E reduces to

E[|θnθ|ph1,q,p,nopt]npq2q+3(B1,q6V1,12q)p2(2q+3)E[|21/2{f(θ)}1/2cp3/2f(2)(θ)Z+2cpqf(q+1)(θ)q!f(2)(θ)|p],\displaystyle\mathrm{E}[|\theta_{n}-\theta|^{p}\mid h_{1,q,p,n}^{\mathrm{opt}}]\approx n^{-\frac{pq}{2q+3}}\Bigl{(}B_{1,q}^{6}\cdot V_{1,1}^{2q}\Bigr{)}^{\frac{p}{2(2q+3)}}\mathrm{E}\left[\biggl{|}\frac{2^{1/2}\{f(\theta)\}^{1/2}}{c_{p}^{3/2}f^{(2)}(\theta)}Z+\frac{2c_{p}^{q}f^{(q+1)}(\theta)}{q!f^{(2)}(\theta)}\biggr{|}^{p}\right], (52)

where ZZ denotes the random variable following the standard normal distribution, and where the expectation is taken with respect to ZZ. Since the kernel-dependent factor of the ALpL^{p}E is (B1,q6V1,12q)(B_{1,q}^{6}\cdot V_{1,1}^{2q}), which is the same as that for the AMSE, the kernel K1,qBK_{1,q}^{B} also minimizes the ALpL^{p}E among minimum-sign-change kernels, even when p2p\neq 2.

Also, from (Mokkadem and Pelletier, 2003), it can be found that the KME θn\theta_{n} satisfies the law of the iterated logarithm. In the scaling limnnhn2q+3/lnlnn=0\lim_{n\to\infty}nh_{n}^{2q+3}/\ln\ln n=0, it holds that

limsupnnhn32lnlnn(θnθ)=liminfnnhn32lnlnn(θnθ)=f(θ)V1,1|f(2)(θ)|,a.s.,\displaystyle\mathop{\rm lim\,sup}\limits_{n\to\infty}\sqrt{\frac{nh_{n}^{3}}{2\ln\ln n}}(\theta_{n}-\theta)=-\mathop{\rm lim\,inf}\limits_{n\to\infty}\sqrt{\frac{nh_{n}^{3}}{2\ln\ln n}}(\theta_{n}-\theta)=\frac{\sqrt{f(\theta)V_{1,1}}}{|f^{(2)}(\theta)|},\quad\text{a.s.}, (53)

in other words, nhn3/(2lnlnn)(θnθ)\sqrt{nh_{n}^{3}/(2\ln\ln n)}(\theta_{n}-\theta) becomes relatively compact, and its limit set, in which the rescaled AB is included almost surely, is given by

[f(θ)V1,1/|f(2)(θ)|,f(θ)V1,1/|f(2)(θ)|].\displaystyle\Bigl{[}-\sqrt{f(\theta)V_{1,1}}/|f^{(2)}(\theta)|,\sqrt{f(\theta)V_{1,1}}/|f^{(2)}(\theta)|\Bigr{]}. (54)

Then, the combination of the bandwidth (10) minimizing the AMSE and the kernel K1,qBK_{1,q}^{B} uniformly minimizes the size of the limit set of (θnθ)(\theta_{n}-\theta), which is proportional to V1,1/hn3(B1,q6V1,12q)12(2q+3)\sqrt{V_{1,1}/h_{n}^{3}}\propto(B_{1,q}^{6}\cdot V_{1,1}^{2q})^{\frac{1}{2(2q+3)}}.

3.3 Singular Hessian

Although it has been so far assumed that the Hessian matrix HfHf is non-singular at the mode (see (A.4) in Assumption A.1), this section provides a consideration on the optimal kernel when the Hessian matrix is singular. Since the multivariate singular case is so intricate that even its convergence rate remains an open problem, we suppose d=1d=1 and hence f(2)(θ)=0f^{(2)}(\theta)=0 at the mode θ\theta.

Vieu (1996); Mokkadem and Pelletier (2003) have studied the singular case, where the PDF ff satisfies f(i)(θ)=0f^{(i)}(\theta)=0, i=1,,pi=1,\ldots,p and f(p+1)(θ)<0f^{(p+1)}(\theta)<0 for 1pq11\leq p\leq q-1 instead of f(2)(θ)<0f^{(2)}(\theta)<0. The non-singular case studied in Section 2 corresponds to p=1p=1. Note that the integer pp should be an odd number for ff to take a maximum at θ\theta.

A weak convergence of the KME θn\theta_{n} holds even in the singular case, in a somewhat different manner, where the requirements on the moments of the kernel KK are weakened to

1,i(K)={1,i=0,0,i=p,,q1,2B1,q(K)0,i=q.\displaystyle{\mathcal{B}}_{1,i}(K)=\begin{cases}1,&i=0,\\ 0,&i=p,\ldots,q-1,\\ 2B_{1,q}(K)\neq 0,&i=q.\end{cases} (55)

In other words, the conditions 1,i(K)=0{\mathcal{B}}_{1,i}(K)=0, i=1,,p1i=1,\ldots,p-1, required in the moment condition (5) in the non-singular case, are no longer needed in the singular case. Together with several additional conditions (see (Mokkadem and Pelletier, 2003) for the details), (θnθ)p(\theta_{n}-\theta)^{p} asymptotically follows a normal distribution with

E[(θnθ)p]hnqa¯b¯B1,q,var[(θnθ)p]2f(θ)nhn3a¯2V1,1,\displaystyle\mathrm{E}[(\theta_{n}-\theta)^{p}]\approx-h_{n}^{q}\bar{a}\bar{b}B_{1,q},\quad\mathrm{var}[(\theta_{n}-\theta)^{p}]\approx\frac{2f(\theta)}{nh_{n}^{3}}\bar{a}^{2}V_{1,1}, (56)

where a¯:-(1p!f(p+1)(θ))1\bar{a}\coloneq(\frac{1}{p!}f^{(p+1)}(\theta))^{-1}, b¯:-2q!f(q+1)(θ)\bar{b}\coloneq\frac{2}{q!}f^{(q+1)}(\theta). Moreover, the bias-variance decomposition leads straightforwardly to the asymptotic L2pL^{2p} error (AL2pL^{2p}E), instead of the AMSE, of the KME θn\theta_{n}:

E[(θnθ)2p]hn2q(a¯b¯)2B1,q2+2f(θ)nhn3a¯2V1,1,\displaystyle\mathrm{E}[(\theta_{n}-\theta)^{2p}]\approx h_{n}^{2q}(\bar{a}\bar{b})^{2}B_{1,q}^{2}+\frac{2f(\theta)}{nh_{n}^{3}}\bar{a}^{2}V_{1,1}, (57)

which has the same functional form regarding the bandwidth hnh_{n} and kernel-dependent terms B1,qB_{1,q}, V1,1V_{1,1} as those of the AMSE (9) for the univariate non-singular case. The optimal bandwidth minimizing the AL2pL^{2p}E can be calculated in the same way as that for the conventional one (10). Also, the AL2pL^{2p}E criterion reduces to (B1,q6V1,12q)12q+3(B_{1,q}^{6}\cdot V_{1,1}^{2q})^{\frac{1}{2q+3}}.

Table 2: The AL2pL^{2p}E criterion (B1,q6V1,12q)12q+3(B_{1,q}^{6}\cdot V_{1,1}^{2q})^{\frac{1}{2q+3}} and its ratio to the smallest value in brackets for the univariate singular cases, (p,q)=(3,4),(5,6)(p,q)=(3,4),(5,6).
Kernel KK B1,4B_{1,4} B1,6B_{1,6} V1,1V_{1,1} (B1,46V1,18)111(B_{1,4}^{6}\cdot V_{1,1}^{8})^{\frac{1}{11}} (B1,66V1,112)115(B_{1,6}^{6}\cdot V_{1,1}^{12})^{\frac{1}{15}}
Biweight 142\frac{1}{42} 5462\frac{5}{462} 1514\frac{15}{14} 0.1369 [1.0000] 0.1729 1[1.0098]
Triweight 166\frac{1}{66} 5858\frac{5}{858} 3522\frac{35}{22} 0.1426 [1.0418] 0.1851 1[1.0815]
Tricube 144\frac{1}{44} 1104\frac{1}{104} 210187\frac{210}{187} 0.1381 [1.0089] 0.1712[1.0000]\mathbf{0.1712}~{}\phantom{1}[\mathbf{1.0000}]
Cosine 0.03940.0394 0.02140.0214 π4128\frac{\pi^{4}}{128} 0.1404 [1.0257] 0.1728 1[1.0095]
Epanechnikov 370\frac{3}{70} 142\frac{1}{42} 0.75 0.1455 [1.0631] 0.1781 1[1.0405]
Triangle 130\frac{1}{30} 156\frac{1}{56} 1 0.1564 [1.1427] 0.1999 1[1.1674]
Gaussian 1.5 7.5 18π1/2\frac{1}{8\pi^{1/2}} 0.1813 [1.3246] 0.2683 1[1.5675]
Logistic 7π430\frac{7\pi^{4}}{30} 31π642\frac{31\pi^{6}}{42} 160\frac{1}{60} 0.2797 [2.0435] 0.5223 1[3.0507]
Sech 2.5 30.5 π24\frac{\pi}{24} 0.3757 [2.7443] 0.7714 [4.4614]
Laplace 12 360 0.125 0.8548 [6.2441] 1.9955 [11.6563]
K1,4BK_{1,4}^{B} 166-\frac{1}{66} 5429-\frac{5}{429} 52588\frac{525}{88} 0.3729 [2.7244] 0.7033 1[4.1083]
K1,6BK_{1,6}^{B} 0 1286\frac{1}{286} 2205128\frac{2205}{128} 1.0145 1[5.9282]

The essential difference of the discussion on the optimal kernel under the singular case from that under the non-singular case arises from the difference between the required moment conditions (55) and (5): As mentioned above, a kernel function is not required to satisfy the ii-th moment condition for i=1,,p1i=1,\ldots,p-1 in the singular case. For example, when p=3p=3 and q=4q=4, any symmetric 2nd order kernel does not satisfy (5), but does satisfy the singular version (55) of the moment conditions. The kernel K1,qBK_{1,q}^{B}, of course, minimizes the AL2pL^{2p}E criterion among the conventional qq-th order kernels satisfying the minimum-sign-change condition, regardless of pp. However, there is a possibility to improve the AL2pL^{2p}E criterion by using a lower order kernel satisfying (55).

We investigated two simple cases (p,q)=(3,4),(5,6)(p,q)=(3,4),(5,6) by calculating the AL2pL^{2p}E criterion for several kernels, and report the results in Table 2, where the AL2pL^{2p}E criterion is given by (B1,46V1,18)111(B_{1,4}^{6}\cdot V_{1,1}^{8})^{\frac{1}{11}} and (B1,66V1,112)115(B_{1,6}^{6}\cdot V_{1,1}^{12})^{\frac{1}{15}} for (p,q)=(3,4)(p,q)=(3,4) and (5,6)(5,6), respectively. In the case (p,q)=(3,4)(p,q)=(3,4), where any symmetric 2nd order kernel, as well as any 4th order kernel, fulfills the conditions (55), one can observe that a Biweight kernel K1,2BK_{1,2}^{B} is better than K1,4BK_{1,4}^{B}. In the case (p,q)=(5,6)(p,q)=(5,6), where at least three types of kernels, symmetric 2nd and 4th order kernels and 6th order kernels, satisfy the conditions (55), Table 2 shows that a Tricube kernel gives the minimal value of the AL2pL^{2p}E criterion among the kernels we investigated. Thus, we have confirmed that the kernel K1,qBK_{1,q}^{B} which is optimal under the non-singular case may not be optimal under the singular case, where a lower order kernel may improve the asymptotic estimation accuracy. Although we have not succeeded in deriving an optimal kernel in the singular case, an empirical finding that truncated kernels perform well would be useful.

4 Simulation Experiment

The analysis in the previous sections on the optimal kernel is based on the asymptotic normality and the evaluation of the AMSE, derived from the asymptotic expansion of the KME. While the leading-order term of the bandwidth-optimized AMSE (11) is O(n2qd+2q+2)O(n^{-\frac{2q}{d+2q+2}}), the next-leading-order term is O(n2(q+2)d+2q+2)O(n^{-\frac{2(q+2)}{d+2q+2}}) if one uses symmetric kernels such as RKs and PKs. Although the bandwidth-optimized AMSE ignores all but the leading-order term, those ignored terms may affect behaviors of the KME and thus its MSE when the sample size nn is finite. In this section, via simulation studies, we examine how well the kernel selection based on the AMSE criterion reflects the real performance of the KME and verify practical goodness of the optimal RK Kd,qBK_{d,q}^{B} in a finite sample situation.

We tried the three cases regarding the dimension, d=1,2,3d=1,2,3, and used synthetic i.i.d. sample sets of size n=100,400,n=100,400,\ldots, 102400, drawn from the distribution 12𝒩(𝟎d,Id)+12𝒩(𝟏d,2Id)\frac{1}{2}\mathcal{N}(\bm{0}_{d},\mathrm{I}_{d})+\frac{1}{2}\mathcal{N}(\bm{1}_{d},2\mathrm{I}_{d}), where 𝒩(𝝁,Σ)\mathcal{N}(\bm{\mu},\Sigma) denotes the normal distribution with mean 𝝁\bm{\mu} and variance-covariance matrix Σ\Sigma, and where 𝟏d\bm{1}_{d} is the all-1 dd-vector and Id\mathrm{I}_{d} is the d×dd\times d identity matrix. The modes of the sample-generating distribution are located at 𝜽0.2395,0.1514𝟏2,0.0874𝟏3{\bm{\theta}}\approx 0.2395,0.1514\cdot\bm{1}_{2},0.0874\cdot\bm{1}_{3} for d=1,2,3d=1,2,3, respectively. Because a symmetric distribution such as the normal distribution does not satisfy assumption (A.6), skewed distributions were used in the experiments. As 2nd order kernels, we examined the optimal Biweight Kd,2BK_{d,2}^{B}, Epanechnikov Kd,2EK_{d,2}^{E}, Gaussian Kd,2GK_{d,2}^{G}, and Laplace Kd,2LK_{d,2}^{L} kernels, as well as PKs of K1,2BK_{1,2}^{B}, K1,2EK_{1,2}^{E}, and K1,2LK_{1,2}^{L} if d2d\geq 2. For the higher orders q=4q=4 and 66, in addition to the optimal kernels Kd,qBK_{d,q}^{B}, three RKs Kd,qEK_{d,q}^{E}, Kd,qGK_{d,q}^{G}, and Kd,qLK_{d,q}^{L} designed via jackknife (Schucany and Sommers, 1977; Wand and Schucany, 1990), an alternative method for designing a higher-order kernel, on the basis of Kd,2EK_{d,2}^{E}, Kd,2GK_{d,2}^{G}, and Kd,2LK_{d,2}^{L} (see Section C), and four PKs with K1,qBK_{1,q}^{B}, K1,qEK_{1,q}^{E}, K1,qGK_{1,q}^{G}, and K1,qLK_{1,q}^{L} were also examined. Note that the RKs Kd,qBK_{d,q}^{B}, Kd,qEK_{d,q}^{E}, and Kd,qLK_{d,q}^{L} and PKs of K1,qBK_{1,q}^{B}, K1,qEK_{1,q}^{E} and K1,qLK_{1,q}^{L} are not twice differentiable at some points and thus do not exactly satisfy all the conditions of Theorem 1 on the asymptotic behaviors of the KME. For each kernel, we used the optimal bandwidth (10) calculated from the sample-generating PDF and its mode444It should be noted that this procedure makes access to the sample-generating PDF, which is unavailable in practice. See Footnote 1. The purpose of our simulation experiments here is not to evaluate performance in practical settings but rather to see how the MSE of the KME with the bandwidth optimized with respect to the AMSE behaves in the finite-nn situation, so that we used the bandwidth values exactly optimized with respect to the AMSE.. In these settings, the smallest AMSE among those of the kernels examined is given by the RK Kd,qBK_{d,q}^{B} for d=1d=1 or q=2,6q=2,6 and PK with K1,4BK_{1,4}^{B} for d=2,3d=2,3 and q=4q=4. On the basis of 1000 trials, we calculated the MSE and its standard deviation (SD), and the results are reported in Table 3, along with the MSE ratio, defined as the ratio of the MSE of each kernel to that for the kernel Kd,qBK_{d,q}^{B} with the same dd and qq when n=102400n=102400. Table 3 also shows the results of the Welch test with pp-value cutoff 0.05, with the null hypothesis that the MSE of interest be equal to the best MSE for the same d,q,nd,q,n.

Table 3: MSE and SD (both multiplied by 10210^{2}) of the KME. For each combination of d,q,nd,q,n, the smallest MSE is shown in bold. Shown in gray are those which were worse than the smallest (Welch test, p=0.05p=0.05).
dd qq Kernel AMSE ratio n=100n=100 n=400n=400 n=1600n=1600 n=6400n=6400 n=25600n=25600 n=102400n=102400 MSE ratio
1 2 K1,2BK_{1,2}^{B} [1.0000] 4.2571±0.1977\mathbf{4.2571\pm 0.1977} 1.6822±0.0737\mathbf{1.6822\pm 0.0737} 0.5799±0.0259\mathbf{0.5799\pm 0.0259} 0.2574±0.0110\mathbf{0.2574\pm 0.0110} 0.1186±0.0052\mathbf{0.1186\pm 0.0052} 0.0489±0.0021\mathbf{0.0489\pm 0.0021} [1.0000]
K1,2EK_{1,2}^{E} [1.0883] 4.3761±0.19454.3761\pm 0.1945 1.7085±0.07071.7085\pm 0.0707 0.6470±0.02850.6470\pm 0.0285 0.2772±0.01180.2772\pm 0.0118 0.1274±0.00540.1274\pm 0.0054 0.0528±0.00220.0528\pm 0.0022 [1.0805]
K1,2GK_{1,2}^{G} [1.1198] 4.9153±0.24144.9153\pm 0.2414 1.9430±0.08701.9430\pm 0.0870 0.6425±0.02820.6425\pm 0.0282 0.2918±0.01250.2918\pm 0.0125 0.1349±0.00580.1349\pm 0.0058 0.0553±0.00240.0553\pm 0.0024 [1.1307]
K1,2LK_{1,2}^{L} [2.8133] 10.8033±0.495710.8033\pm 0.4957 4.2865±0.20014.2865\pm 0.2001 1.4643±0.06061.4643\pm 0.0606 0.6984±0.03050.6984\pm 0.0305 0.2935±0.01260.2935\pm 0.0126 0.1234±0.00560.1234\pm 0.0056 [2.5243]
4 K1,4BK_{1,4}^{B} [1.0000] 5.6845±0.29685.6845\pm 0.2968 1.9812±0.08771.9812\pm 0.0877 0.5978±0.02720.5978\pm 0.0272 0.2220±0.00990.2220\pm 0.0099 0.0876±0.0041\mathbf{0.0876\pm 0.0041} 0.0294±0.0014\mathbf{0.0294\pm 0.0014} [1.0000]
K1,4EK_{1,4}^{E} [1.0737] 5.4032±0.2695\mathbf{5.4032\pm 0.2695} 1.9559±0.0869\mathbf{1.9559\pm 0.0869} 0.5976±0.0273\mathbf{0.5976\pm 0.0273} 0.2217±0.0100\mathbf{0.2217\pm 0.0100} 0.0882±0.00410.0882\pm 0.0041 0.0304±0.00140.0304\pm 0.0014 [1.0334]
K1,4GK_{1,4}^{G} [1.1935] 7.5127±0.39167.5127\pm 0.3916 2.5871±0.11562.5871\pm 0.1156 0.7530±0.03300.7530\pm 0.0330 0.2829±0.01250.2829\pm 0.0125 0.1089±0.00510.1089\pm 0.0051 0.0374±0.00170.0374\pm 0.0017 [1.2712]
K1,4LK_{1,4}^{L} [4.3743] 14.0314±0.599914.0314\pm 0.5999 6.2487±0.27416.2487\pm 0.2741 2.2248±0.09442.2248\pm 0.0944 0.9542±0.04280.9542\pm 0.0428 0.3592±0.01570.3592\pm 0.0157 0.1323±0.00590.1323\pm 0.0059 [4.5026]
6 K1,6BK_{1,6}^{B} [1.0000] 7.8852±0.41997.8852\pm 0.4199 2.5326±0.11122.5326\pm 0.1112 0.7125±0.03200.7125\pm 0.0320 0.2374±0.01050.2374\pm 0.0105 0.0850±0.00410.0850\pm 0.0041 0.0254±0.00120.0254\pm 0.0012 [1.0000]
K1,6EK_{1,6}^{E} [1.0602] 7.4116±0.3902\mathbf{7.4116\pm 0.3902} 2.3941±0.1070\mathbf{2.3941\pm 0.1070} 0.6891±0.0320\mathbf{0.6891\pm 0.0320} 0.2296±0.0103\mathbf{0.2296\pm 0.0103} 0.0843±0.0041\mathbf{0.0843\pm 0.0041} 0.0251±0.0012\mathbf{0.0251\pm 0.0012} [0.9882]
K1,6GK_{1,6}^{G} [1.2453] 10.5674±0.530610.5674\pm 0.5306 3.5329±0.15803.5329\pm 0.1580 0.9670±0.04180.9670\pm 0.0418 0.3282±0.01430.3282\pm 0.0143 0.1128±0.00530.1128\pm 0.0053 0.0351±0.00160.0351\pm 0.0016 [1.3804]
K1,6LK_{1,6}^{L} [5.6609] 16.9073±0.694616.9073\pm 0.6946 8.1200±0.35068.1200\pm 0.3506 2.9961±0.12842.9961\pm 0.1284 1.3032±0.05801.3032\pm 0.0580 0.4784±0.02130.4784\pm 0.0213 0.1700±0.00780.1700\pm 0.0078 [6.6836]
2 2 RK K2,2BK_{2,2}^{B} [1.0000] 11.4818±0.4159\mathbf{11.4818\pm 0.4159} 4.6054±0.1672\mathbf{4.6054\pm 0.1672} 1.8538±0.0571\mathbf{1.8538\pm 0.0571} 0.7933±0.0244\mathbf{0.7933\pm 0.0244} 0.3510±0.0102\mathbf{0.3510\pm 0.0102} 0.1669±0.0050\mathbf{0.1669\pm 0.0050} [1.0000]
RK K2,2EK_{2,2}^{E} [1.0887] 12.0854±0.432012.0854\pm 0.4320 4.8839±0.17104.8839\pm 0.1710 2.0408±0.06132.0408\pm 0.0613 0.8817±0.02670.8817\pm 0.0267 0.3842±0.01100.3842\pm 0.0110 0.1824±0.00570.1824\pm 0.0057 [1.0930]
RK K2,2GK_{2,2}^{G} [1.1547] 14.2550±0.518714.2550\pm 0.5187 5.4297±0.19255.4297\pm 0.1925 2.1603±0.07002.1603\pm 0.0700 0.9142±0.03010.9142\pm 0.0301 0.4018±0.01210.4018\pm 0.0121 0.1929±0.00560.1929\pm 0.0056 [1.1557]
RK K2,2LK_{2,2}^{L} [2.4495] 28.7910±0.937128.7910\pm 0.9371 11.9038±0.393111.9038\pm 0.3931 4.8348±0.16224.8348\pm 0.1622 2.0335±0.06912.0335\pm 0.0691 0.8559±0.02710.8559\pm 0.0271 0.4139±0.01330.4139\pm 0.0133 [2.4801]
PK of K1,2BK_{1,2}^{B} [1.0231] 11.9216±0.428311.9216\pm 0.4283 4.7388±0.17004.7388\pm 0.1700 1.8922±0.05881.8922\pm 0.0588 0.8127±0.02540.8127\pm 0.0254 0.3579±0.01050.3579\pm 0.0105 0.1713±0.00510.1713\pm 0.0051 [1.0265]
PK of K1,2EK_{1,2}^{E} [1.0984] 12.0406±0.424712.0406\pm 0.4247 4.9762±0.17444.9762\pm 0.1744 1.9840±0.06001.9840\pm 0.0600 0.8672±0.02660.8672\pm 0.0266 0.3893±0.01110.3893\pm 0.0111 0.1826±0.00570.1826\pm 0.0057 [1.0945]
PK of K1,2LK_{1,2}^{L} [2.8944] 29.0720±0.947329.0720\pm 0.9473 11.9243±0.391411.9243\pm 0.3914 4.9221±0.16074.9221\pm 0.1607 2.2091±0.07372.2091\pm 0.0737 0.9401±0.03110.9401\pm 0.0311 0.4626±0.01540.4626\pm 0.0154 [2.7722]
4 RK K2,4BK_{2,4}^{B} [1.0000] 16.4534±0.698516.4534\pm 0.6985 5.4845±0.20565.4845\pm 0.2056 1.8539±0.05971.8539\pm 0.0597 0.6997±0.02260.6997\pm 0.0226 0.2671±0.00830.2671\pm 0.0083 0.1065±0.0035\mathbf{0.1065\pm 0.0035} [1.0000]
RK K2,4EK_{2,4}^{E} [1.0817] 15.1225±0.5873\mathbf{15.1225\pm 0.5873} 5.3820±0.2064\mathbf{5.3820\pm 0.2064} 1.8173±0.0575\mathbf{1.8173\pm 0.0575} 0.6958±0.02250.6958\pm 0.0225 0.2774±0.00860.2774\pm 0.0086 0.1089±0.00360.1089\pm 0.0036 [1.0218]
RK K2,4GK_{2,4}^{G} [1.2552] 22.5890±0.876422.5890\pm 0.8764 7.5474±0.27017.5474\pm 0.2701 2.5624±0.08662.5624\pm 0.0866 0.9378±0.03170.9378\pm 0.0317 0.3452±0.01100.3452\pm 0.0110 0.1380±0.00440.1380\pm 0.0044 [1.2955]
RK K2,4LK_{2,4}^{L} [3.9734] 39.0583±1.195039.0583\pm 1.1950 18.1752±0.587918.1752\pm 0.5879 7.3580±0.23407.3580\pm 0.2340 3.0711±0.10273.0711\pm 0.1027 1.1322±0.03561.1322\pm 0.0356 0.4634±0.01520.4634\pm 0.0152 [4.3494]
PK of K1,4BK_{1,4}^{B} [0.9948] 16.6173±0.691516.6173\pm 0.6915 5.5669±0.20495.5669\pm 0.2049 1.8709±0.06161.8709\pm 0.0616 0.6935±0.02220.6935\pm 0.0222 0.2667±0.0084\mathbf{0.2667\pm 0.0084} 0.1086±0.00360.1086\pm 0.0036 [1.0191]
PK of K1,4EK_{1,4}^{E} [1.0579] 15.6356±0.644415.6356\pm 0.6444 5.4196±0.20365.4196\pm 0.2036 1.8217±0.05991.8217\pm 0.0599 0.6756±0.0215\mathbf{0.6756\pm 0.0215} 0.2711±0.00840.2711\pm 0.0084 0.1109±0.00370.1109\pm 0.0037 [1.0409]
PK of K1,4GK_{1,4}^{G} [1.2215] 22.0738±0.851022.0738\pm 0.8510 7.3806±0.26467.3806\pm 0.2646 2.4888±0.08452.4888\pm 0.0845 0.9088±0.03020.9088\pm 0.0302 0.3344±0.01060.3344\pm 0.0106 0.1350±0.00430.1350\pm 0.0043 [1.2666]
PK of K1,4LK_{1,4}^{L} [4.9457] 38.6564±1.177138.6564\pm 1.1771 18.4631±0.583118.4631\pm 0.5831 7.5841±0.23777.5841\pm 0.2377 3.2581±0.10533.2581\pm 0.1053 1.2406±0.03981.2406\pm 0.0398 0.5353±0.01750.5353\pm 0.0175 [5.0237]
6 RK K2,6BK_{2,6}^{B} [1.0000] 23.2642±0.916323.2642\pm 0.9163 7.2236±0.26477.2236\pm 0.2647 2.2068±0.07152.2068\pm 0.0715 0.7672±0.02480.7672\pm 0.0248 0.2628±0.00830.2628\pm 0.0083 0.0969±0.00330.0969\pm 0.0033 [1.0000]
RK K2,6EK_{2,6}^{E} [1.0700] 22.0099±0.9000\mathbf{22.0099\pm 0.9000} 6.8727±0.2570\mathbf{6.8727\pm 0.2570} 2.0995±0.0665\mathbf{2.0995\pm 0.0665} 0.7287±0.0235\mathbf{0.7287\pm 0.0235} 0.2569±0.0081\mathbf{0.2569\pm 0.0081} 0.0961±0.0033\mathbf{0.0961\pm 0.0033} [0.9913]
RK K2,6GK_{2,6}^{G} [1.3282] 31.5057±1.176131.5057\pm 1.1761 10.7053±0.381810.7053\pm 0.3818 3.4107±0.11703.4107\pm 0.1170 1.1480±0.03791.1480\pm 0.0379 0.3806±0.01210.3806\pm 0.0121 0.1362±0.00440.1362\pm 0.0044 [1.4056]
RK K2,6LK_{2,6}^{L} [5.4536] 45.7616±1.315745.7616\pm 1.3157 23.4046±0.716023.4046\pm 0.7160 10.3467±0.317010.3467\pm 0.3170 4.3718±0.13864.3718\pm 0.1386 1.6100±0.05011.6100\pm 0.0501 0.6141±0.02000.6141\pm 0.0200 [6.3369]
PK of K1,6BK_{1,6}^{B} [1.0419] 24.0760±0.936624.0760\pm 0.9366 7.5713±0.27427.5713\pm 0.2742 2.3077±0.07862.3077\pm 0.0786 0.7828±0.02430.7828\pm 0.0243 0.2714±0.00860.2714\pm 0.0086 0.1014±0.00340.1014\pm 0.0034 [1.0463]
PK of K1,6EK_{1,6}^{E} [1.0966] 22.6378±0.905922.6378\pm 0.9059 7.0804±0.25697.0804\pm 0.2569 2.1788±0.07422.1788\pm 0.0742 0.7332±0.02300.7332\pm 0.0230 0.2627±0.00820.2627\pm 0.0082 0.0991±0.00330.0991\pm 0.0033 [1.0226]
PK of K1,6GK_{1,6}^{G} [1.3577] 31.6913±1.135331.6913\pm 1.1353 10.8801±0.388010.8801\pm 0.3880 3.4745±0.12173.4745\pm 0.1217 1.1579±0.03731.1579\pm 0.0373 0.3831±0.01220.3831\pm 0.0122 0.1393±0.00460.1393\pm 0.0046 [1.4379]
PK of K1,6LK_{1,6}^{L} [7.3943] 46.4623±1.330546.4623\pm 1.3305 24.1137±0.727824.1137\pm 0.7278 11.0021±0.332611.0021\pm 0.3326 4.7613±0.14804.7613\pm 0.1480 1.8774±0.05911.8774\pm 0.0591 0.7473±0.02350.7473\pm 0.0235 [7.7113]
3 2 RK K3,2BK_{3,2}^{B} [1.0000] 18.7642±0.6040\mathbf{18.7642\pm 0.6040} 6.9539±0.1976\mathbf{6.9539\pm 0.1976} 2.9496±0.0771\mathbf{2.9496\pm 0.0771} 1.3446±0.0338\mathbf{1.3446\pm 0.0338} 0.6420±0.0161\mathbf{0.6420\pm 0.0161} 0.3099±0.0079\mathbf{0.3099\pm 0.0079} [1.0000]
RK K3,2EK_{3,2}^{E} [1.0864] 19.3876±0.604119.3876\pm 0.6041 7.4858±0.21497.4858\pm 0.2149 3.1403±0.08103.1403\pm 0.0810 1.4470±0.03581.4470\pm 0.0358 0.6943±0.01740.6943\pm 0.0174 0.3372±0.00830.3372\pm 0.0083 [1.0879]
RK K3,2GK_{3,2}^{G} [1.1864] 23.9051±0.752923.9051\pm 0.7529 8.7213±0.27738.7213\pm 0.2773 3.5969±0.09683.5969\pm 0.0968 1.6082±0.04041.6082\pm 0.0404 0.7779±0.01990.7779\pm 0.0199 0.3723±0.00950.3723\pm 0.0095 [1.2013]
RK K3,2LK_{3,2}^{L} [2.3660] 53.3562±1.392853.3562\pm 1.3928 21.1493±0.615121.1493\pm 0.6151 8.8721±0.25168.8721\pm 0.2516 3.6972±0.09553.6972\pm 0.0955 1.7077±0.04611.7077\pm 0.0461 0.8118±0.02080.8118\pm 0.0208 [2.6195]
PK of K1,2BK_{1,2}^{B} [1.0448] 19.7222±0.618619.7222\pm 0.6186 7.3303±0.21247.3303\pm 0.2124 3.1041±0.08133.1041\pm 0.0813 1.4081±0.03511.4081\pm 0.0351 0.6781±0.01710.6781\pm 0.0171 0.3264±0.00830.3264\pm 0.0083 [1.0531]
PK of K1,2EK_{1,2}^{E} [1.1098] 19.8647±0.645619.8647\pm 0.6456 7.5905±0.20967.5905\pm 0.2096 3.2686±0.08693.2686\pm 0.0869 1.4949±0.03671.4949\pm 0.0367 0.7059±0.01790.7059\pm 0.0179 0.3438±0.00860.3438\pm 0.0086 [1.1095]
PK of K1,2LK_{1,2}^{L} [2.9686] 52.7602±1.418452.7602\pm 1.4184 21.8585±0.637421.8585\pm 0.6374 9.2388±0.24869.2388\pm 0.2486 4.1435±0.10624.1435\pm 0.1062 1.9666±0.05031.9666\pm 0.0503 0.9162±0.02280.9162\pm 0.0228 [2.9563]
4 RK K3,4BK_{3,4}^{B} [1.0000] 25.3986±0.889425.3986\pm 0.8894 8.4013±0.27478.4013\pm 0.2747 3.0604±0.08413.0604\pm 0.0841 1.2210±0.03181.2210\pm 0.0318 0.5073±0.01340.5073\pm 0.0134 0.2093±0.00570.2093\pm 0.0057 [1.0000]
RK K3,4EK_{3,4}^{E} [1.0861] 23.3127±0.8190\mathbf{23.3127\pm 0.8190} 7.8901±0.2350\mathbf{7.8901\pm 0.2350} 2.9518±0.0817\mathbf{2.9518\pm 0.0817} 1.1936±0.03201.1936\pm 0.0320 0.5003±0.01330.5003\pm 0.0133 0.2084±0.00550.2084\pm 0.0055 [0.9960]
RK K3,4GK_{3,4}^{G} [1.3143] 37.2782±1.174837.2782\pm 1.1748 12.9172±0.430312.9172\pm 0.4303 4.5679±0.12404.5679\pm 0.1240 1.7626±0.04541.7626\pm 0.0454 0.7301±0.01940.7301\pm 0.0194 0.2955±0.00790.2955\pm 0.0079 [1.4119]
RK K3,4LK_{3,4}^{L} [3.9987] 70.6824±1.695570.6824\pm 1.6955 32.7709±0.858232.7709\pm 0.8582 14.6790±0.385114.6790\pm 0.3851 6.0564±0.15326.0564\pm 0.1532 2.4945±0.06552.4945\pm 0.0655 0.9916±0.02520.9916\pm 0.0252 [4.7381]
PK of K1,4BK_{1,4}^{B} [0.9765] 26.0719±0.897826.0719\pm 0.8978 8.4814±0.26978.4814\pm 0.2697 3.1138±0.08533.1138\pm 0.0853 1.2346±0.03181.2346\pm 0.0318 0.5133±0.01360.5133\pm 0.0136 0.2071±0.00560.2071\pm 0.0056 [0.9896]
PK of K1,4EK_{1,4}^{E} [1.0301] 24.4835±0.913724.4835\pm 0.9137 7.9345±0.23947.9345\pm 0.2394 2.9769±0.08232.9769\pm 0.0823 1.1889±0.0309\mathbf{1.1889\pm 0.0309} 0.4941±0.0134\mathbf{0.4941\pm 0.0134} 0.2045±0.0053\mathbf{0.2045\pm 0.0053} [0.9774]
PK of K1,4GK_{1,4}^{G} [1.2284] 35.2748±1.131535.2748\pm 1.1315 11.9953±0.396411.9953\pm 0.3964 4.2515±0.11474.2515\pm 0.1147 1.6530±0.04241.6530\pm 0.0424 0.6856±0.01820.6856\pm 0.0182 0.2761±0.00740.2761\pm 0.0074 [1.3195]
PK of K1,4LK_{1,4}^{L} [5.4110] 70.0372±1.694470.0372\pm 1.6944 33.6666±0.883333.6666\pm 0.8833 15.2810±0.409515.2810\pm 0.4095 6.5560±0.16256.5560\pm 0.1625 2.8585±0.07612.8585\pm 0.0761 1.1353±0.02771.1353\pm 0.0277 [5.4249]
6 RK K3,6BK_{3,6}^{B} [1.0000] 36.4994±1.221236.4994\pm 1.2212 11.6394±0.402911.6394\pm 0.4029 3.8464±0.10503.8464\pm 0.1050 1.4006±0.03581.4006\pm 0.0358 0.5244±0.01370.5244\pm 0.0137 0.1936±0.00510.1936\pm 0.0051 [1.0000]
RK K3,6EK_{3,6}^{E} [1.0769] 33.6589±1.1940\mathbf{33.6589\pm 1.1940} 10.6750±0.3579\mathbf{10.6750\pm 0.3579} 3.5802±0.1002\mathbf{3.5802\pm 0.1002} 1.3213±0.0336\mathbf{1.3213\pm 0.0336} 0.4910±0.0134\mathbf{0.4910\pm 0.0134} 0.1844±0.0048\mathbf{0.1844\pm 0.0048} [0.9527]
RK K3,6GK_{3,6}^{G} [1.4103] 51.3040±1.486251.3040\pm 1.4862 19.3169±0.606019.3169\pm 0.6060 6.5221±0.17996.5221\pm 0.1799 2.3097±0.06032.3097\pm 0.0603 0.8633±0.02270.8633\pm 0.0227 0.3148±0.00820.3148\pm 0.0082 [1.6260]
RK K3,6LK_{3,6}^{L} [5.7653] 81.7929±1.876781.7929\pm 1.8767 42.6492±1.029342.6492\pm 1.0293 20.4169±0.513120.4169\pm 0.5131 9.4432±0.23059.4432\pm 0.2305 3.8039±0.09473.8039\pm 0.0947 1.3986±0.03391.3986\pm 0.0339 [7.2246]
PK of K1,6BK_{1,6}^{B} [1.0606] 37.5986±1.221837.5986\pm 1.2218 12.3234±0.399812.3234\pm 0.3998 4.1032±0.11294.1032\pm 0.1129 1.5007±0.03841.5007\pm 0.0384 0.5629±0.01460.5629\pm 0.0146 0.2027±0.00540.2027\pm 0.0054 [1.0471]
PK of K1,6EK_{1,6}^{E} [1.1092] 34.8445±1.187234.8445\pm 1.1872 11.1684±0.359111.1684\pm 0.3591 3.7513±0.10523.7513\pm 0.1052 1.3843±0.03551.3843\pm 0.0355 0.5196±0.01360.5196\pm 0.0136 0.1976±0.00480.1976\pm 0.0048 [1.0206]
PK of K1,6GK_{1,6}^{G} [1.4384] 51.3070±1.519251.3070\pm 1.5192 18.7785±0.568218.7785\pm 0.5682 6.5203±0.17896.5203\pm 0.1789 2.3292±0.06052.3292\pm 0.0605 0.8699±0.02270.8699\pm 0.0227 0.3156±0.00830.3156\pm 0.0083 [1.6303]
PK of K1,6LK_{1,6}^{L} [9.1885] 84.8374±1.926684.8374\pm 1.9266 45.2915±1.106645.2915\pm 1.1066 22.4703±0.550822.4703\pm 0.5508 10.7301±0.252510.7301\pm 0.2525 4.6637±0.11614.6637\pm 0.1161 1.7211±0.03921.7211\pm 0.0392 [8.8902]

For q=2q=2, the MSE ratios were approximately close to the AMSE ratios, which suggests that the AMSE criterion serves as a good indicator of real performance. Also, the optimal Biweight kernel Kd,2BK_{d,2}^{B} achieved the best estimation result for every dd and nn as expected from the asymptotic theories. Although differences between MSEs for the Biweight kernel and those for other truncated kernels (RK Kd,2EK_{d,2}^{E} and PKs of K1,2BK_{1,2}^{B} and K1,2EK_{1,2}^{E}) were not significant especially for smaller nn, differences between them and those for non-truncated kernels (RKs Kd,2GK_{d,2}^{G} and Kd,2LK_{d,2}^{L} and PK of K1,2LK_{1,2}^{L}) were statistically significant with pp-value 0.05 for most combinations of dd and nn.

For the higher orders q=4,6q=4,6, truncated kernels performed well, whereas non-truncated ones gave significantly larger MSEs. This tendency is the same as in the case q=2q=2. However, the experimental results for the higher-order kernels exhibited deviations from the asymptotic theories: except for the case of (d,q)=(1,4)(d,q)=(1,4), even with the largest sample size n=102400n=102400 investigated, the smallest MSE values were achieved by kernels other than that with the minimum AMSE (K1,6EK_{1,6}^{E} for q=6q=6 when d=1d=1, and PK of K1,4BK_{1,4}^{B} for q=4q=4 and RK Kd,6BK_{d,6}^{B} for q=6q=6 when d=2,3d=2,3). Such deviations would be ascribed to the fact that asymptotics ignores residual terms of the AMSE as described at the beginning of this section. The ratio between the leading-order and next-leading-order terms of the AMSE for the considered RKs and PKs is O(n4d+2q+2)O(n^{-\frac{4}{d+2q+2}}), which gets larger as dd and/or qq increase. Therefore, for the residual terms to be negligible, one needs more sample points for larger dd and/or qq. In the cases of q=4,6q=4,6, even n=102400n=102400 might not have been large enough to accurately reflect the small difference less that about 10 % in the AMSE into the difference of MSE, for the PDFs used. However, considering that those kernels which performed inferior to the best-performing ones still performed close to the theoretical optimum with a difference less than 10 % in the AMSE criterion and that their experimental difference was not significant, it can be found that even though the AMSE criterion may not be a quantitative performance index for higher-order kernels, it is still suggestive for real performance.

5 Optimal Kernel for Other Methods

5.1 In-Sample Mode Estimation

A mode estimator considered in (Abraham et al., 2004), which we call the in-sample mode estimator (ISME), is defined as the location of a sample point where a value of the KDE (3) becomes maximum among those at sample points:

𝜽¯n:-argmax𝒙{𝑿i}i=1nfn(𝒙).\displaystyle\bar{{\bm{\theta}}}_{n}\coloneq\mathop{\rm arg\,max}\limits_{{\bm{x}}\in\{{\bm{X}}_{i}\}_{i=1}^{n}}f_{n}({\bm{x}}). (58)

The ISME can be evaluated efficiently with the quick-shift (QS) algorithm (Vedaldi and Soatto, 2008). Although the QS algorithm has an extra tuning parameter in addition to KK and hnh_{n}, which may affect the quality of the estimator, it has an advantage that it converges in a finite number of iterations irrespective of the kernel used as far as the sample size is finite, making it computationally efficient.

Abraham et al. (2004) have proved in their Corollary 1.1 that, in the large-sample asymptotics, the ISME 𝜽¯n\bar{{\bm{\theta}}}_{n} converges in distribution to an asymptotic distribution of the KME 𝜽n{\bm{\theta}}_{n} if the bandwidth hnh_{n} satisfies nd22hnd(d+2)2lnn0n^{\frac{d-2}{2}}h_{n}^{\frac{d(d+2)}{2}}\ln n\to 0 as nn\to\infty. A simple calculation shows that, if we use the optimal bandwidth hd,q,noptn1d+2q+2h_{d,q,n}^{\mathrm{opt}}\propto n^{-\frac{1}{d+2q+2}} in (10), the requirement for the bandwidth is fulfilled for (d,q)(d,q) satisfying (d2)q<d+2(d-2)q<d+2, that is, for any qq when d=1,2d=1,2; 2q42\leq q\leq 4 when d=3d=3; and q=2q=2 when d=4,5d=4,5. Accordingly, at least for these cases, the ISME has the same AMSE and the same AMSE criterion as the KME, so that the results on the optimal kernel still hold for the ISME as well: especially, Kd,qBK_{d,q}^{B} minimizes the AMSE criterion of the ISME among the minimum-sign-change RKs, for the above-mentioned combinations of dd and qq.

5.2 Modal Linear Regression

Modal linear regression (MLR) (Yao and Li, 2014) aims to obtain a conditional mode predictor as a linear model. In addition to intrinsic good properties rooted in a conditional mode, the MLR has an advantage that resulting parameter and regression line are consistent even for a heteroscedastic or asymmetric conditional PDF, compared with robust M-type estimators which are not consistent in these cases (Baldauf and Santos Silva, 2012).

Let (𝑿,𝒀)({\bm{X}},{\bm{Y}}) be a pair of random variables in 𝒳×𝒴dX×dY{\mathcal{X}}\times{\mathcal{Y}}\subset\mathbb{R}^{d_{\scalebox{0.5}{X}}}\times\mathbb{R}^{d_{\scalebox{0.5}{Y}}} following a certain joint distribution. MLR assumes that the conditional distribution of 𝒀{\bm{Y}} conditioned on 𝑿=𝒙{\bm{X}}={\bm{x}} is such that the conditional PDF fY—X(|𝒙)f_{\scalebox{0.5}{Y|X}}(\cdot|{\bm{x}}) satisfies arg max𝒚𝒴fY—X(𝒚|𝒙)=Θ𝒙\text{arg~{}max}_{{\bm{y}}\in{\mathcal{Y}}}f_{\scalebox{0.5}{Y|X}}({\bm{y}}|{\bm{x}})=\Theta{\bm{x}} for any 𝒙𝒳{\bm{x}}\in{\mathcal{X}}, where ΘdY×dX\Theta\in\mathbb{R}^{d_{\scalebox{0.5}{Y}}\times d_{\scalebox{0.5}{X}}} is an underlying MLR parameter. For given i.i.d. samples {(𝑿i,𝒀i)𝒳×𝒴}i=1n\{({\bm{X}}_{i},{\bm{Y}}_{i})\in{\mathcal{X}}\times{\mathcal{Y}}\}_{i=1}^{n} of (𝑿,𝒀)({\bm{X}},{\bm{Y}}), the MLR adopts, as the estimator of the parameter Θ\Theta, the global maximizer Θn\Theta_{n} of the kernel-based objective function with argument ΩdY×dX\Omega\in\mathbb{R}^{d_{\scalebox{0.5}{Y}}\times d_{\scalebox{0.5}{X}}},

On(Ω):-1nhndYi=1nK(Ω𝑿i𝒀ihn),\displaystyle O_{n}(\Omega)\coloneq\frac{1}{nh_{n}^{d_{\scalebox{0.5}{Y}}}}\sum_{i=1}^{n}K\biggl{(}\frac{\Omega{\bm{X}}_{i}-{\bm{Y}}_{i}}{h_{n}}\biggr{)}, (59)

with the kernel KK defined on dY\mathbb{R}^{d_{\scalebox{0.5}{Y}}} and the bandwidth hn>0h_{n}>0.

In (Kemp et al., 2020, Theorem 3), the asymptotic normality of the MLR has been proved. They have considered the case where a 2nd order kernel is used and the AVC is relatively dominant compared with the AB by assuming hn4(nhn)(dY+2)h_{n}^{4}\ll(nh_{n})^{-(d_{\scalebox{0.5}{Y}}+2)}. We provide an extension of their theorem which allows the use of a higher-order kernel and provides an explicit expression of the AB. Note that the vectorization operator vec()\operatorname{vec}(\cdot) is defined as the nmnm-vector vec(M):-(𝒎1,,𝒎n)\operatorname{vec}(\mathrm{M})\coloneq(\bm{m}_{1}^{\top},\ldots,\bm{m}_{n}^{\top})^{\top} for an m×nm\times n matrix M=(𝒎1,,𝒎n)\mathrm{M}=(\bm{m}_{1},\ldots,\bm{m}_{n}), and that we adopt the definition that the operator \nabla applied to a function with a matrix argument is the nabla operator with respect to the vectorization of the argument. The same definition applies to the Hessian operator HH as well.

Theorem 4.

Assume Assumption E.1 for an even integer q2q\geq 2. Then, the vectorization vec(Θn)\operatorname{vec}(\Theta_{n}) of the MLR parameter estimator Θn\Theta_{n} asymptotically follows a normal distribution with the following AB and AVC:

E[vec(Θn)vec(Θ)]hnqA𝒃,Cov[vec(Θn)]1nhndY+2AVA,\displaystyle\mathrm{E}[\operatorname{vec}(\Theta_{n})-\operatorname{vec}(\Theta)]\approx-h_{n}^{q}\mathrm{A}{\bm{b}},\quad\mathrm{Cov}[\operatorname{vec}(\Theta_{n})]\approx\frac{1}{nh_{n}^{d_{\scalebox{0.5}{Y}}+2}}\mathrm{A}\mathrm{V}\mathrm{A}, (60)

where the abbreviations A\mathrm{A}, 𝐛{\bm{b}}, and V\mathrm{V} are defined by

A={E[(𝑿IdY)HfY—X(Θ𝑿|𝑿)(𝑿IdY)]}1,𝒃=E[𝑿𝑻(Θ𝑿;𝑿,fX,Y,K)],V=E[fY—X(Θ𝑿|𝑿)(𝑿IdY)𝒱dY(𝑿IdY)].\displaystyle\begin{split}\mathrm{A}&=\bigl{\{}\mathrm{E}[({\bm{X}}\otimes\mathrm{I}_{d_{\scalebox{0.5}{Y}}})Hf_{\scalebox{0.5}{Y|X}}(\Theta{\bm{X}}|{\bm{X}})({\bm{X}}\otimes\mathrm{I}_{d_{\scalebox{0.5}{Y}}})^{\top}]\bigr{\}}^{-1},\\ {\bm{b}}&=\mathrm{E}[{\bm{X}}\otimes\bm{T}(\Theta{\bm{X}};{\bm{X}},f_{\scalebox{0.5}{X,Y}},K)],\\ \mathrm{V}&=\mathrm{E}[f_{\scalebox{0.5}{Y|X}}(\Theta{\bm{X}}|{\bm{X}})({\bm{X}}\otimes\mathrm{I}_{d_{\scalebox{0.5}{Y}}}){\mathcal{V}}_{d_{\scalebox{0.5}{Y}}}({\bm{X}}\otimes\mathrm{I}_{d_{\scalebox{0.5}{Y}}})^{\top}].\end{split} (61)

In the above, 𝐓(𝐲;𝐱,fX,Y,K)\bm{T}({\bm{y}};{\bm{x}},f_{\scalebox{0.5}{X,Y}},K) is defined as

𝑻(𝒚;𝒙,fX,Y,K):-𝒊0dY:|𝒊|=q1𝒊!𝒊fY—X(𝒚|𝒙)dY,𝒊(K),\displaystyle\bm{T}({\bm{y}};{\bm{x}},f_{\scalebox{0.5}{X,Y}},K)\coloneq\sum_{{\bm{i}}\in\mathbb{Z}_{\geq 0}^{d_{\scalebox{0.5}{Y}}}:|{\bm{i}}|=q}\frac{1}{{\bm{i}}!}\cdot\nabla\partial^{\bm{i}}f_{\scalebox{0.5}{Y|X}}({\bm{y}}|{\bm{x}})\cdot{\mathcal{B}}_{d_{\scalebox{0.5}{Y}},{\bm{i}}}(K), (62)

and 𝒱dY{\mathcal{V}}_{d_{\scalebox{0.5}{Y}}} is the dY×dYd_{\scalebox{0.5}{Y}}\times d_{\scalebox{0.5}{Y}} matrix defined via (8). Moreover, the AMSE is obtained as

E[vec(Θn)vec(Θ)2]hn2qA𝒃2+1nhndY+2tr(AVA).\displaystyle\mathrm{E}[\|\operatorname{vec}(\Theta_{n})-\operatorname{vec}(\Theta)\|^{2}]\approx h_{n}^{2q}\|\mathrm{A}{\bm{b}}\|^{2}+\frac{1}{nh_{n}^{d_{\scalebox{0.5}{Y}}+2}}\operatorname{tr}(\mathrm{A}\mathrm{V}\mathrm{A}). (63)

By following the same line as the argument in Section 2, for instance, one can optimize the bandwidth hnh_{n} in terms of the AMSE, and furthermore show that the AMSE criterion becomes (BdY,q2(dY+2)VdY,12q)1dY+2q+2(B_{d_{\scalebox{0.5}{Y}},q}^{2(d_{\scalebox{0.5}{Y}}+2)}\cdot V_{d_{\scalebox{0.5}{Y}},1}^{2q})^{\frac{1}{d_{\scalebox{0.5}{Y}}+2q+2}} if an RK is used, which implies that the kernel KdY,qBK_{d_{\scalebox{0.5}{Y}},q}^{B} is optimal. These results are generalization of (Yamasaki and Tanaka, 2019) for dY=1d_{\scalebox{0.5}{Y}}=1 and q=2q=2.

5.3 Mode Clustering

In mode clustering, a cluster is defined via the gradient flow 𝒙(t)=f(𝒙(t)){\bm{x}}^{\prime}(t)=\nabla f({\bm{x}}(t)) of the PDF ff regarded as a scalar field. A limiting point of the gradient flow defines a center of the cluster corresponding to its domain of attraction. For the mean-shift-based mode clustering (Comaniciu and Meer, 2002; Yamasaki and Tanaka, 2020), the KDE is often plugged into ff. Although clustering error for mode clustering using the KDE is difficult to evaluate in general dimensions, Casa et al. (2020) have analyzed clustering error rate in the univariate case in detail.

Here we review their discussion in a fairly simplified setting: we consider the univariate case and assume that the PDF and KDE are bimodal, as depicted in Figure 4. In this setting, the true clusters C1,C2C_{1},C_{2} and estimated clusters Cn,1,Cn,2C_{n,1},C_{n,2} are

C1={x:x<ζ},C2={x:xζ},Cn,1={x:x<ζn},Cn,2={x:xζn},\displaystyle\begin{split}&C_{1}=\{x\in\mathbb{R}:x<\zeta\},\quad C_{2}=\{x\in\mathbb{R}:x\geq\zeta\},\\ &C_{n,1}=\{x\in\mathbb{R}:x<\zeta_{n}\},\quad C_{n,2}=\{x\in\mathbb{R}:x\geq\zeta_{n}\},\end{split} (64)

with the local minimizers ζ\zeta and ζn\zeta_{n} of the PDF and KDE, respectively, and the clustering error rate (CER) becomes

CERn=j=12Pr(XCj and XCn,j)={Pr(ζX<ζn),ζ<ζn,Pr(ζnX<ζ),ζn<ζ.\displaystyle\mathrm{CER}_{n}=\sum_{j=1}^{2}\Pr(X\in C_{j}\text{ and }X\not\in C_{n,j})=\begin{cases}\Pr(\zeta\leq X<\zeta_{n}),&\zeta<\zeta_{n},\\ \Pr(\zeta_{n}\leq X<\zeta),&\zeta_{n}<\zeta.\end{cases} (65)

In Figure 4, the red area represents the CER. The green area is |ζnζ|×f(ζ)|\zeta_{n}-\zeta|\times f(\zeta), and the blue area is approximated as |ζnζ|×f(2)(ζ)2|ζnζ|2|\zeta_{n}-\zeta|\times\frac{f^{(2)}(\zeta)}{2}|\zeta_{n}-\zeta|^{2}, which is negligible in the asymptotic situation. Therefore, the relationship ‘green area¡red area (CER) << green ++ blue areas’ implies that the CER asymptotically equals to the green area, |ζnζ|×f(ζ)|\zeta_{n}-\zeta|\times f(\zeta): the asymptotic mean CER (AMCER) reduces to

limnE[CERn]=f(ζ)E[|ζnζ|].\displaystyle\lim_{n\to\infty}\mathrm{E}[\mathrm{CER}_{n}]=f(\zeta)\mathrm{E}[|\zeta_{n}-\zeta|]. (66)

On the basis of the fact that the form of the asymptotic distribution of the local minimizer ζn\zeta_{n} is the same as that of the mode, the AMCER behaves like the AL1L^{1}E for the mode, discussed in Section 3.2.

\begin{overpic}[height=142.26378pt,bb=0 0 665 335]{./mo-cl.png}\put(3.0,42.0){\small(a)}\end{overpic}
\begin{overpic}[height=142.26378pt,bb=0 0 105 335]{./mo-cl2.png}\put(5.0,85.0){\small(b)}\end{overpic}
Figure 4: The black solid and dotted curves are the underlying PDF and KDE, respectively. The red solid and dotted lines are located at local minimizers (ζ\zeta and ζn\zeta_{n}) of the underlying PDF and KDE, respectively. In panel (a), the red area represents the clustering error rate. In panel (b), the green area is a lower bound of the error rate, and the summation of the green and blue areas is an upper bound.

Casa et al. (2020) have shown nn-dependence of the optimal bandwidth minimizing the AMCER. We further show the kernel dependence of the optimal bandwidth and the optimal kernel: Since the AMCER behaves like the AL1L^{1}E for the mode, the optimal bandwidth minimizing the AMCER becomes h1,q,1,nopt=c1(V1,1/(nB1,q)2)12q+3h_{1,q,1,n}^{\mathrm{opt}}=c_{1}(V_{1,1}/(nB_{1,q})^{2})^{\frac{1}{2q+3}}, c1>0c_{1}>0, and the resulting kernel optimization problem becomes equivalent to the type-1 kernel optimization problem for the KME with the AL1L^{1}E criterion as the objective function. Thus, the kernel K1,qBK_{1,q}^{B} minimizes the AMCER for the optimal bandwidth among the qq-th order minimum-sign-change kernels.

6 Conclusion

In this paper, we have studied the kernel selection, particularly optimal kernel, for the kernel mode estimation, by extending the existing approach for the univariate setting by (Gasser and Müller, 1979; Gasser et al., 1985; Granovsky and Müller, 1989, 1991). This approach is based on asymptotics, and it seeks a kernel function that minimizes the main term of the AMSE for the optimal bandwidth among kernels (or their derivatives) that change their sign the minimum number of times required by their order. Theorems 2 and 3 are the main novel theoretical contribution, which shows that the Biweight hierarchy {Gd,qB}\{G_{d,q}^{B}\} provides the optimal RK Kd,qB()=Gd,qB()K_{d,q}^{B}(\cdot)=G_{d,q}^{B}(\|\cdot\|) for every dimension d1d\geq 1 and even order q=2,4q=2,4 and q2q\geq 2, respectively. We have also studied the PK model, and compared it with the RK model. As a result, we have found that the 2nd order optimal RK Kd,2BK_{d,2}^{B} is superior in terms of the AMSE to the PK model using any non-negative kernel. Simulation studies confirmed the superiority of the optimal Biweight kernel among the non-negative kernels, as well as truncated kernels including the optimal kernel Kd,qBK_{d,q}^{B}, which gave better accuracy even in higher orders q=4,6q=4,6.

The theories on the optimal kernel are also effective for other kernel-based modal statistical methods. In Section 5, we show the optimal kernel for in-sample mode estimation, modal linear regression, and univariate mode clustering. Although elucidation of the optimal kernel for other methods that might be relevant such as principal curve estimation or density ridge estimation (Ozertem and Erdogmus, 2011; Sando and Hino, 2020) and multivariate mode clustering, in which it is difficult to represent asymptotic errors explicitly, is an open problem, studying them analytically or experimentally will also be interesting.

A Proof of Asymptotic Behaviors of Kernel Mode Estimation

Regularity conditions for Theorem 1, i.e., sufficient conditions for deriving the expressions (6), (9) of the AB, AVC, and AMSE along with the asymptotic normality of the KME, are listed below. They consist of the conditions on the sample {𝑿id}i=1n\{{\bm{X}}_{i}\in\mathbb{R}^{d}\}_{i=1}^{n} (A.1), mode 𝜽{\bm{\theta}} and PDF ff (A.2)–(A.6), kernel KK (A.6)–(A.14), and bandwidth hnh_{n} (A.15) and (A.16).

Assumption A.1 (Regularity conditions for Theorem 1).

For finite d1d\geq 1 and even q2q\geq 2,

  • (A.1)

    {𝑿id}i=1n\{{\bm{X}}_{i}\in\mathbb{R}^{d}\}_{i=1}^{n} is a sample of i.i.d. observations from ff.

  • (A.2)

    ff is (q+2)(q+2) times differentiable in d\mathbb{R}^{d} (i.e., ff and f𝒊f^{\bm{i}}, |𝒊|=2|{\bm{i}}|=2 are continuous at 𝜽{\bm{\theta}}).

  • (A.3)

    ff has a unique and isolated maximizer at 𝜽{\bm{\theta}} (i.e., f(𝒙)<f(𝜽)f({\bm{x}})<f({\bm{\theta}}) for all 𝒙𝜽{\bm{x}}\neq{\bm{\theta}}, f(𝜽)=𝟎d\nabla f({\bm{\theta}})=\bm{0}_{d} due to (A.2), and sup𝒙Nf(𝒙)<f(𝜽)\sup_{{\bm{x}}\not\in N}f({\bm{x}})<f({\bm{\theta}}) for a neighborhood NN of 𝜽{\bm{\theta}}).

  • (A.4)

    𝒊f\partial^{\bm{i}}f, |𝒊|=2|{\bm{i}}|=2 satisfies |𝒊f(𝒙)|𝑑𝒙<\int|\partial^{\bm{i}}f({\bm{x}})|\,d{\bm{x}}<\infty, and Hf(𝜽)Hf({\bm{\theta}}) is non-singular.

  • (A.5)

    |𝒊f(𝜽)|<|\partial^{\bm{i}}f({\bm{\theta}})|<\infty for all 𝒊{\bm{i}} s.t. |𝒊|=2,,q+1|{\bm{i}}|=2,\ldots,q+1, and 𝒊f\partial^{\bm{i}}f is bounded for all 𝒊{\bm{i}} s.t. |𝒊|=q+2|{\bm{i}}|=q+2.

  • (A.6)

    𝒊f(𝜽)\partial^{\bm{i}}f({\bm{\theta}}), |𝒊|=q+1|{\bm{i}}|=q+1 and KK satisfy 𝒃(𝜽;f,K)𝟎d{\bm{b}}({\bm{\theta}};f,K)\neq\bm{0}_{d}.

  • (A.7)

    KK is bounded and twice differentiable in d\mathbb{R}^{d} and satisfies the covering number condition, |K(𝒙)|𝑑𝒙<\int|K({\bm{x}})|\,d{\bm{x}}<\infty, and lim𝒙𝒙|K(𝒙)|=0\lim_{\|{\bm{x}}\|\to\infty}\|{\bm{x}}\||K({\bm{x}})|=0.

  • (A.8)

    d,𝟎d(K)=1{\mathcal{B}}_{d,\bm{0}_{d}}(K)=1.

  • (A.9)

    d,𝒊(K)=0{\mathcal{B}}_{d,{\bm{i}}}(K)=0 for all 𝒊{\bm{i}} s.t. |𝒊|=1,,q1|{\bm{i}}|=1,\ldots,q-1.

  • (A.10)

    |d,𝒊(K)|<|{\mathcal{B}}_{d,{\bm{i}}}(K)|<\infty for all 𝒊{\bm{i}} s.t. |𝒊|=q|{\bm{i}}|=q, and d,𝒊(K)0{\mathcal{B}}_{d,{\bm{i}}}(K)\neq 0 for some 𝒊{\bm{i}} s.t. |𝒊|=q|{\bm{i}}|=q.

  • (A.11)

    |d,𝒊(K)|<|{\mathcal{B}}_{d,{\bm{i}}}(K)|<\infty for all 𝒊{\bm{i}} s.t. |𝒊|=q+1|{\bm{i}}|=q+1.

  • (A.12)

    iK\partial_{i}K is bounded and satisfies |iK(𝒙)jK(𝒙)|𝑑𝒙<\int|\partial_{i}K({\bm{x}})\partial_{j}K({\bm{x}})|\,d{\bm{x}}<\infty, lim𝒙𝒙|iK(𝒙)jK(𝒙)|=0\lim_{\|{\bm{x}}\|\to\infty}\|{\bm{x}}\||\partial_{i}K({\bm{x}})\partial_{j}K({\bm{x}})|=0, and |iK(𝒙)|2+δ𝑑𝒙<\int|\partial_{i}K({\bm{x}})|^{2+\delta}\,d{\bm{x}}<\infty for some δ>0\delta>0, for all i,j=1,,di,j=1,\ldots,d.

  • (A.13)

    K(𝒙)K(𝒙)𝑑𝒙=𝒱d(K)\int\nabla K({\bm{x}})\nabla K({\bm{x}})^{\top}\,d{\bm{x}}={\mathcal{V}}_{d}(K) has a finite determinant.

  • (A.14)

    𝒊K\partial^{\bm{i}}K, |𝒊|=2|{\bm{i}}|=2 satisfies the covering number condition.

  • (A.15)

    limn(nhnd+2q+2)12=c<\lim_{n\to\infty}(nh_{n}^{d+2q+2})^{\frac{1}{2}}=c<\infty (i.e., hn0h_{n}\to 0).

  • (A.16)

    limnnhnd+4/lnn=\lim_{n\to\infty}nh_{n}^{d+4}/\ln n=\infty.

Here, the covering number condition, that appears in (A.7) and (A.14), is defined as follow:

Definition A.1 ((Pollard, 1984, Definition 23) and (Mokkadem and Pelletier, 2003, Section 2)).
  • Let PP be a probability measure on SS and {\mathcal{F}} be a class of functions in 1(P):-{g:|Sg(𝒙)P(d𝒙)|<}{\mathcal{L}}_{1}(P)\coloneq\{g:|\int_{S}g({\bm{x}})\,P(d{\bm{x}})|<\infty\}. For each ϵ>0\epsilon>0, define the covering number N(ϵ,P,)N(\epsilon,P,{\mathcal{F}}) as the smallest value of mm for which there exist functions g1,,gmg_{1},\ldots,g_{m} such that minjS|fgj|P(d𝒙)ϵ\min_{j}\int_{S}|f-g_{j}|\,P(d{\bm{x}})\leq\epsilon for each ff\in{\mathcal{F}}. We set N(ϵ,P,)=N(\epsilon,P,{\mathcal{F}})=\infty if no such mm exists.

  • Let gg be a function defined on d\mathbb{R}^{d}. Let (g)={g(𝒙h),h>0,𝒙d}{\mathcal{F}}(g)=\{g(\frac{\cdot\,-{\bm{x}}}{h}),h>0,{\bm{x}}\in\mathbb{R}^{d}\} be the class of functions consisting of arbitrarily translated and scaled versions of gg. We say that gg satisfies the covering number condition if gg is bounded and integrable on d\mathbb{R}^{d}, and if there exist A>0A>0 and W>0W>0 such that N(ϵ,P,(g))AϵWN(\epsilon,P,{\mathcal{F}}(g))\leq A\epsilon^{-W} for any probability measure PP on d\mathbb{R}^{d} and any ϵ(0,1)\epsilon\in(0,1).

For example, the RK K(𝒙)=G(𝒙)K({\bm{x}})=G(\|{\bm{x}}\|) and the PK K(𝒙)=j=1dG(xj)K({\bm{x}})=\prod_{j=1}^{d}G(x_{j}) fulfill the covering number condition in (A.7) if GG has bounded variations in both cases.

We give a proof of the asymptotic normality of the KME 𝜽n{\bm{\theta}}_{n}, which is a modification of the proof of (Mokkadem and Pelletier, 2003, Theorem 2.2), below.

Proof of Theorem 1.

Pollard (1984) gives the following lemma on uniform consistency (see his Theorem 37 in p. 34):

Lemma A.1.

Let gg be a function on d\mathbb{R}^{d} satisfying the covering number condition, and {hn}\{h_{n}\} be a sequence of positive numbers satisfying limnhn=0\lim_{n\to\infty}h_{n}=0. If there exists j0j\geq 0 such that limnnhnd+2j/lnn=\lim_{n\to\infty}nh_{n}^{d+2j}/\ln n=\infty, then

limnsup𝒛d1nhnd+j|i=1ng(𝒛𝒁ihn)E[g(𝒛𝒁hn)]|=0,a.s.,\displaystyle\lim_{n\to\infty}\sup_{\bm{z}\in\mathbb{R}^{d}}\frac{1}{nh_{n}^{d+j}}\biggl{|}\sum_{i=1}^{n}g\biggl{(}\frac{\bm{z}-\bm{Z}_{i}}{h_{n}}\biggr{)}-\mathrm{E}\biggl{[}g\biggl{(}\frac{\bm{z}-\bm{Z}}{h_{n}}\biggr{)}\biggr{]}\biggr{|}=0,\quad\text{a.s.,} (A.17)

where E\mathrm{E} is the expectation value with respect to the distribution of the random vector 𝐙d\bm{Z}\in\mathbb{R}^{d}, and where {𝐙id}i=1n\{\bm{Z}_{i}\in\mathbb{R}^{d}\}_{i=1}^{n} is a sample of i.i.d. observations of 𝐙\bm{Z}.

Also, one has the following lemma (see (Bochner, 2005, Theorem 1.1.1)):

Lemma A.2.

Let g1g_{1} be a function on d\mathbb{R}^{d} satisfying sup𝐮d|g1(𝐮)|<\sup_{{\bm{u}}\in\mathbb{R}^{d}}|g_{1}({\bm{u}})|<\infty, |g1(𝐮)|𝑑𝐮<\int|g_{1}({\bm{u}})|\,d{\bm{u}}<\infty, and lim|𝐮|𝐮|g1(𝐮)|=0\lim_{|{\bm{u}}|\to\infty}\|{\bm{u}}\||g_{1}({\bm{u}})|=0, g2g_{2} be a function on d\mathbb{R}^{d} satisfying |g2(𝐮)|𝑑𝐮<\int|g_{2}({\bm{u}})|\,d{\bm{u}}<\infty, and {hn}\{h_{n}\} be a sequence of positive numbers satisfying limnhn=0\lim_{n\to\infty}h_{n}=0. Then one has that, for any 𝐱d{\bm{x}}\in\mathbb{R}^{d} of continuity of g2g_{2},

limn1hndg1(𝒖hn)g2(𝒙𝒖)𝑑𝒖=g2(𝒙)g1(𝒖)𝑑𝒖.\displaystyle\lim_{n\to\infty}\frac{1}{h_{n}^{d}}\int g_{1}\biggl{(}\frac{{\bm{u}}}{h_{n}}\biggr{)}g_{2}({\bm{x}}-{\bm{u}})\,d{\bm{u}}=g_{2}({\bm{x}})\int g_{1}({\bm{u}})\,d{\bm{u}}. (A.18)

Applying Lemma A.1 with g=Kg=K and j=0j=0, under (A.1), the covering number condition of KK (A.7), hn0h_{n}\to 0 (A.15), and nhnd/lnnnh_{n}^{d}/\ln n\to\infty (A.16), ensures that

limnsup𝒙d|fn(𝒙)E[fn(𝒙)]|=0,a.s.\displaystyle\lim_{n\to\infty}\sup_{{\bm{x}}\in\mathbb{R}^{d}}\bigl{|}f_{n}({\bm{x}})-\mathrm{E}[f_{n}({\bm{x}})]\bigr{|}=0,\quad\text{a.s.} (A.19)

Also, Lemma A.2 with g1=Kg_{1}=K and g2=fg_{2}=f under the fact |f(𝒙)|𝑑𝒙=1<\int|f({\bm{x}})|\,d{\bm{x}}=1<\infty, the continuity of ff (A.2), (A.7), (A.8), and hn0h_{n}\to 0 (A.15) implies that

E[fn(𝜽)]=1hndK(𝜽𝒖hn)f(𝒖)𝑑𝒖=f(𝜽)K(𝒖)𝑑𝒖=f(𝜽).\displaystyle\mathrm{E}[f_{n}({\bm{\theta}})]=\frac{1}{h_{n}^{d}}\int K\biggl{(}\frac{{\bm{\theta}}-{\bm{u}}}{h_{n}}\biggr{)}f({\bm{u}})\,d{\bm{u}}=f({\bm{\theta}})\int K({\bm{u}})\,d{\bm{u}}=f({\bm{\theta}}). (A.20)

According to these results, one has limnfn(𝜽)=f(𝜽)\lim_{n\to\infty}f_{n}({\bm{\theta}})=f({\bm{\theta}}) a.s., which implies the consistency of 𝜽n{\bm{\theta}}_{n} to 𝜽{\bm{\theta}} from contradiction to the (ϵ,δ)(\epsilon,\delta)-definition of limit: for any ϵ>0\epsilon>0, there exists a δ>0\delta>0 such that |fn(𝜽n)f(𝜽)|δ|f_{n}({\bm{\theta}}_{n})-f({\bm{\theta}})|\geq\delta if 𝜽n𝜽ϵ\|{\bm{\theta}}_{n}-{\bm{\theta}}\|\geq\epsilon.

Because 𝜽n{\bm{\theta}}_{n} maximizes fnf_{n} (i.e., fn(𝜽n)=𝟎d\nabla f_{n}({\bm{\theta}}_{n})=\bm{0}_{d}) and KK and hence fnf_{n} are twice differentiable (A.7), Taylor’s approximation of fn(𝒙)\nabla f_{n}({\bm{x}}) at 𝜽n{\bm{\theta}}_{n} around 𝜽{\bm{\theta}} shows

𝟎d=fn(𝜽n)=fn(𝜽)+Hfn(𝜽)(𝜽n𝜽),\displaystyle\bm{0}_{d}=\nabla f_{n}({\bm{\theta}}_{n})=\nabla f_{n}({\bm{\theta}})+Hf_{n}({\bm{\theta}}^{*})({\bm{\theta}}_{n}-{\bm{\theta}}), (A.21)

that is, 𝜽n=𝜽{Hfn(𝜽)}1fn(𝜽){\bm{\theta}}_{n}={\bm{\theta}}-\{Hf_{n}({\bm{\theta}}^{*})\}^{-1}\nabla f_{n}({\bm{\theta}}) if Hfn(𝜽)Hf_{n}({\bm{\theta}}^{*}) is invertible, where 𝜽{\bm{\theta}}^{*} satisfies 𝜽𝜽𝜽n𝜽\|{\bm{\theta}}^{*}-{\bm{\theta}}\|\leq\|{\bm{\theta}}_{n}-{\bm{\theta}}\|. We thus study the asymptotic behaviors of Hfn(𝜽)Hf_{n}({\bm{\theta}}^{*}) and fn(𝜽)\nabla f_{n}({\bm{\theta}}) below.

Applying Lemma A.1 with g=𝒊Kg=\partial^{\bm{i}}K, |𝒊|=2|{\bm{i}}|=2 and j=2j=2 under (A.1), the covering number condition of 𝒊K\partial^{\bm{i}}K (A.14), hn0h_{n}\to 0 (A.15), and (A.16), and the consistency of 𝜽n{\bm{\theta}}_{n} to 𝜽{\bm{\theta}} shows that Hfn(𝜽)Hf_{n}({\bm{\theta}}^{*}) converges to E[Hfn(𝜽)]\mathrm{E}[Hf_{n}({\bm{\theta}})]. Moreover, integration-by-parts and Lemma A.2 with g1=Kg_{1}=K and g2=𝒊fg_{2}=\partial^{\bm{i}}f, |𝒊|=2|{\bm{i}}|=2 under the continuity of 𝒊f\partial^{\bm{i}}f (A.2), (A.4), (A.7), (A.8), and hn0h_{n}\to 0 (A.15), leads that, for every 𝒊{\bm{i}} satisfying |𝒊|=2|{\bm{i}}|=2,

E[𝒊fn(𝜽)]=1hnd+2𝒊K(𝜽𝒖hn)f(𝒖)d𝒖=1hndK(𝜽𝒖hn)𝒊f(𝒖)d𝒖=𝒊f(𝜽)K(𝒖)𝑑𝒖=𝒊f(𝜽),\displaystyle\begin{split}\mathrm{E}[\partial^{\bm{i}}f_{n}({\bm{\theta}})]&=\frac{1}{h_{n}^{d+2}}\int\partial^{\bm{i}}K\biggl{(}\frac{{\bm{\theta}}-{\bm{u}}}{h_{n}}\biggr{)}f({\bm{u}})\,d{\bm{u}}=\frac{1}{h_{n}^{d}}\int K\biggl{(}\frac{{\bm{\theta}}-{\bm{u}}}{h_{n}}\biggr{)}\partial^{\bm{i}}f({\bm{u}})\,d{\bm{u}}\\ &=\partial^{\bm{i}}f({\bm{\theta}})\int K({\bm{u}})\,d{\bm{u}}=\partial^{\bm{i}}f({\bm{\theta}}),\end{split} (A.22)

that is, E[Hfn(𝜽)]=Hf(𝜽)\mathrm{E}[Hf_{n}({\bm{\theta}})]=Hf({\bm{\theta}}). Combining these results, one can find that {Hfn(𝜽)}1\{Hf_{n}({\bm{\theta}}^{*})\}^{-1} is consistent to the matrix A\mathrm{A}.

We next show that fn(𝜽)\nabla f_{n}({\bm{\theta}}) asymptotically follows the normal distribution with the mean hnq𝒃h_{n}^{q}{\bm{b}} and the variance-covariance matrix (nhnd+2)1V(nh_{n}^{d+2})^{-1}\mathrm{V}, on the basis of Lyapounov’s central limit theorem. The critical difference from the existing proof by (Mokkadem and Pelletier, 2003) is on the assumptions (A.6) and (A.10), which affects on the calculation of the mean of fn(𝜽)\nabla f_{n}({\bm{\theta}}). On the basis of the multivariate Taylor expansion of jf(𝜽hn𝒙)\partial_{j}f({\bm{\theta}}-h_{n}{\bm{x}}) around 𝜽{\bm{\theta}} under the setting that qq is even,

jf(𝜽hn𝒙)=jf(𝜽)+{|𝒊|=q(hn𝒙)𝒊𝒊!j𝒊f(𝜽)}{|𝒊|=q+1(hn𝒙)𝒊𝒊!j𝒊f(𝜽𝒙)},\displaystyle\partial_{j}f({\bm{\theta}}-h_{n}{\bm{x}})=\partial_{j}f({\bm{\theta}})-\cdots+\biggl{\{}\sum_{|{\bm{i}}|=q}\frac{(h_{n}{\bm{x}})^{\bm{i}}}{{\bm{i}}!}\cdot\partial_{j}\partial^{\bm{i}}f({\bm{\theta}})\biggr{\}}-\biggl{\{}\sum_{|{\bm{i}}|=q+1}\frac{(h_{n}{\bm{x}})^{\bm{i}}}{{\bm{i}}!}\cdot\partial_{j}\partial^{\bm{i}}f({\bm{\theta}}_{\bm{x}})\biggr{\}}, (A.23)

where 𝜽𝒙{\bm{\theta}}_{\bm{x}} lies in hn𝒙\|h_{n}{\bm{x}}\|-neighborhood of 𝜽{\bm{\theta}} and depends on 𝒙{\bm{x}}, one has that

E[jfn(𝜽)]=1hnd+1jK(𝜽𝒙hn)f(𝒙)d𝒙=K(𝒙)jf(𝜽hn𝒙)d𝒙=K(𝒙){+|𝒊|=q(hn𝒙)𝒊𝒊!j𝒊f(𝜽)}𝑑𝒙.\displaystyle\begin{split}\mathrm{E}\bigl{[}\partial_{j}f_{n}({\bm{\theta}})\bigr{]}&=\int\frac{1}{h_{n}^{d+1}}\partial_{j}K\biggl{(}\frac{{\bm{\theta}}-{\bm{x}}}{h_{n}}\biggr{)}f({\bm{x}})\,d{\bm{x}}=\int K({\bm{x}})\partial_{j}f({\bm{\theta}}-h_{n}{\bm{x}})\,d{\bm{x}}\\ &=\int K({\bm{x}})\biggl{\{}\cdots+\sum_{|{\bm{i}}|=q}\frac{(h_{n}{\bm{x}})^{\bm{i}}}{{\bm{i}}!}\cdot\partial_{j}\partial^{\bm{i}}f({\bm{\theta}})-\cdots\biggr{\}}\,d{\bm{x}}.\end{split} (A.24)

Because jf(𝜽)=0\partial_{j}f({\bm{\theta}})=0 (A.3), integrations of 2,,q2,\ldots,q-th summands vanish due to finiteness of 𝒊f(𝜽)\partial^{\bm{i}}f({\bm{\theta}}), |𝒊|=2,,q|{\bm{i}}|=2,\ldots,q (A.5) and (A.9), and integration of the (q+2)(q+2)-nd residual term is asymptotically ignorable due to boundedness of 𝒊f\partial^{\bm{i}}f, |𝒊|=q+2|{\bm{i}}|=q+2 (A.5), (A.11), and hn0h_{n}\to 0 (A.15), one can approximate the asymptotic mean of fn(𝜽)\nabla f_{n}({\bm{\theta}}) as

1hnqE[fn(𝜽)]jj(K(𝒙)|𝒊|=q𝒙𝒊𝒊!𝒊f(𝜽)d𝒙)=(𝒃(𝜽;f,K))j,\displaystyle\frac{1}{h_{n}^{q}}\mathrm{E}\bigl{[}\nabla f_{n}({\bm{\theta}})\bigr{]}_{j}\to\partial_{j}\biggl{(}\int K({\bm{x}})\sum_{|{\bm{i}}|=q}\frac{{\bm{x}}^{\bm{i}}}{{\bm{i}}!}\cdot\partial^{\bm{i}}f({\bm{\theta}})\,d{\bm{x}}\biggr{)}=\bigl{(}{\bm{b}}({\bm{\theta}};f,K)\big{)}_{j}, (A.25)

under the assumptions (A.6) and (A.10). Moreover, by using Lemma A.2 with g1=iKjKg_{1}=\partial_{i}K\partial_{j}K and g2=fg_{2}=f under the continuity of ff (A.2), (A.12), and hn0h_{n}\to 0 (A.15), one can calculate the variance-covariance matrix of fn(𝜽)\nabla f_{n}({\bm{\theta}}) with the definition (A.13):

nhnd+2Cov[fn(𝜽)]i,j=nhnd+21nCov[1hnd+1iK(𝜽𝑿hn),1hnd+1jK(𝜽𝑿hn)]=1hndiK(𝜽𝒙hn)jK(𝜽𝒙hn)f(𝒙)d𝒙f(𝜽)[𝒱d(K)]i,j=Vi,j.\displaystyle\begin{split}&nh_{n}^{d+2}\mathrm{Cov}[\nabla f_{n}({\bm{\theta}})]_{i,j}=nh_{n}^{d+2}\frac{1}{n}\mathrm{Cov}\biggl{[}\frac{1}{h_{n}^{d+1}}\partial_{i}K\left(\frac{{\bm{\theta}}-{\bm{X}}}{h_{n}}\right),\frac{1}{h_{n}^{d+1}}\partial_{j}K\left(\frac{{\bm{\theta}}-{\bm{X}}}{h_{n}}\right)\biggr{]}\\ &=\frac{1}{h_{n}^{d}}\int\partial_{i}K\left(\frac{{\bm{\theta}}-{\bm{x}}}{h_{n}}\right)\partial_{j}K\left(\frac{{\bm{\theta}}-{\bm{x}}}{h_{n}}\right)f({\bm{x}})\,d{\bm{x}}\to f({\bm{\theta}})[{\mathcal{V}}_{d}(K)]_{i,j}=\mathrm{V}_{i,j}.\end{split} (A.26)

Finally, by applying Slutsky’s theorem to these pieces, the proof on asymptotic normality of {Hfn(𝜽)}1fn(𝜽)\{Hf_{n}({\bm{\theta}}^{*})\}^{-1}\nabla f_{n}({\bm{\theta}}) is finished. ∎

The condition (A.6) is added to those in (Mokkadem and Pelletier, 2003), to ensure that the main term of AB A𝒃\propto\mathrm{A}{\bm{b}} does not vanish. Also, the condition (A.10) is a modification of condition (A5) iii) in (Mokkadem and Pelletier, 2003), the latter of which does not apply to RKs unlike their description. However, it should be noted that the kernel Kd,qBK_{d,q}^{B} does not satisfy the twice differentiability among the sufficient conditions on the kernel. Eddy (1980) has studied behaviors of the KME 𝜽n{\bm{\theta}}_{n} within O((nhnd+2)12)O((nh_{n}^{d+2})^{-\frac{1}{2}})-neighborhood of the mode 𝜽{\bm{\theta}} and tried the proof without assuming that the kernel KK is twice differentiable (such that the kernel Kd,qBK_{d,q}^{B} fulfills their requirements), but their proof is not rigorous in that it does not consider the possibility that the KME 𝜽n{\bm{\theta}}_{n} exists outside of that neighborhood. We have considered several methods, including the proof in (Eddy, 1980), and then concluded that twice differentiability of KK appears to be necessary for deriving the AMSE.

B Theories on Optimal Kernel

B.1 Relationships between the Moment Condition and the Number of Sign Changes

In this section we describe relationships between the moment condition and the number of sign changes of a kernel. We first introduce several notations. To represent a kernel class defined from the moment condition, we introduce, for integers d1d\geq 1, ll, and kk satisfying 0l<k0\leq l<k,

d,l,k:-{g:Bd,i(g)=(1)l(d)lbd1(i=l), 0(iIk2\{l}),non-zero(i=k)},\displaystyle{\mathcal{M}}_{d,l,k}\coloneq\{g:B_{d,i}(g)=(-1)^{l}(d)_{l}b_{d}^{-1}\;(i=l),\;0\;(i\in I_{k-2}\backslash\{l\}),\;\text{non-zero}\;(i=k)\}, (B.1)

where (n)m:-n(n+1)(n+m1)(n)_{m}\coloneq n(n+1)\cdots(n+m-1) denotes the Pochhammer symbol. Also, we represent a class of functions on 0\mathbb{R}_{\geq 0} with the prescribed number of sign changes as

𝒩j:-{g:g changes its sign j times on 0}.\displaystyle{\mathcal{N}}_{j}\coloneq\{g:\text{$g$ changes its sign $j$ times on $\mathbb{R}_{\geq 0}$}\}. (B.2)

Here, the number of the sign changes is defined as follows:

Definition B.1.

A function gg defined on a finite or infinite interval [a,b][a,b] is said to change its sign kk times on [a,b][a,b], if there are (k+1)(k+1) subintervals Sj=[xj1,xj]S_{j}=[x_{j-1},x_{j}], j=1,,k+1j=1,\ldots,k+1, where a=x0<x1<<xk+1=ba=x_{0}<x_{1}<\cdots<x_{k+1}=b, such that:

  • (i)

    g(x)0g(x)\leq 0 (or 0\geq 0) for all xSjx\in S_{j}, and there exists xSjx\in S_{j} such that g(x)<0g(x)<0 (or >0>0), for each j{1,,k+1}j\in\{1,\ldots,k+1\}.

  • (ii)

    If k1k\geq 1, g(x)g(y)0g(x)g(y)\leq 0 for all xSjx\in S_{j}, ySj+1y\in S_{j+1}, j=1,,kj=1,\ldots,k.

Note that a function could have an interval over which it equals to zero and that a point where its sign changes does not have to be uniquely determined. Additionally, we introduce the following term:

Definition B.2.

Functions g1g_{1} and g2g_{2} defined on a finite or infinite interval [a,b][a,b] are said to share the same sign-change pattern if the same set of subintervals in Definition B.1 applies to both g1g_{1} and g2g_{2}.

In the following, we provide a lemma (Lemma B.2) about the minimum number of sign changes of a qq-th order RK. It is a multivariate extension of (Gasser et al., 1985, Lemma 2) stating that a univariate qq-th order kernel changes its sign at least (q2)(q-2) times on \mathbb{R}, and becomes a basis for the condition (P2-3) in Problem 2 and condition (P3-3) in Problem 3. The following Lemma B.1 is for Lemmas B.2 and B.12.

Lemma B.1.

Let S1,,SkS_{1},\ldots,S_{k} be measurable subsets of 0\mathbb{R}_{\geq 0} with μ(Si)>0\mu(S_{i})>0, where μ\mu is the Lebesgue measure on 0\mathbb{R}_{\geq 0}, and assume that they are ordered in the sense that for any xSix\in S_{i} and any ySjy\in S_{j} with i<ji<j the inequality x<yx<y holds almost surely. Let w(x)w(x) be a measurable function which does not change its sign inside each SjS_{j}, j=1,,kj=1,\ldots,k. Assume further that Sjw(x)𝑑x0\int_{S_{j}}w(x)\,dx\not=0 holds. Then the following k×kk\times k matrix is non-singular:

M=[S1w(x)𝑑xS2w(x)𝑑xSkw(x)𝑑xS1w(x)x2𝑑xS2w(x)x2𝑑xSkw(x)x2𝑑xS1w(x)x2(k1)𝑑xS2w(x)x2(k1)𝑑xSkw(x)x2(k1)𝑑x].\displaystyle\mathrm{M}=\left[\begin{array}[]{cccc}\int_{S_{1}}w(x)\,dx&\int_{S_{2}}w(x)\,dx&\cdots&\int_{S_{k}}w(x)\,dx\\ \int_{S_{1}}w(x)x^{2}\,dx&\int_{S_{2}}w(x)x^{2}\,dx&\cdots&\int_{S_{k}}w(x)x^{2}\,dx\\ \vdots&\vdots&\ddots&\vdots\\ \int_{S_{1}}w(x)x^{2(k-1)}\,dx&\int_{S_{2}}w(x)x^{2(k-1)}\,dx&\cdots&\int_{S_{k}}w(x)x^{2(k-1)}\,dx\end{array}\right]. (B.7)
Proof.

Assume to the contrary that M\mathrm{M} is singular. Then there exists a non-zero vector 𝒂=(a1,,ak)\bm{a}=(a_{1},\ldots,a_{k})^{\top} which satisfies M𝒂=𝟎k\mathrm{M}^{\top}\bm{a}=\bm{0}_{k}. Rewriting it component-wise, one has

i=0k1ai+1Sjw(x)x2i𝑑x=Sjw(x)P(x)𝑑x=0,j=1,,k,\displaystyle\sum_{i=0}^{k-1}a_{i+1}\int_{S_{j}}w(x)x^{2i}\,dx=\int_{S_{j}}w(x)P(x)\,dx=0,\quad j=1,\ldots,k, (B.8)

where we let

P(x)=i=0k1ai+1x2i.\displaystyle P(x)=\sum_{i=0}^{k-1}a_{i+1}x^{2i}. (B.9)

Take an interval [a,b][a,b] with a=infSja=\inf S_{j} and b=supSjb=\sup S_{j}. Then w(x)𝟙xSjw(x)\mathbbm{1}_{x\in S_{j}} is integrable, does not change its sign on [a,b][a,b], and abw(x)𝟙xSj𝑑x0\int_{a}^{b}w(x)\mathbbm{1}_{x\in S_{j}}\,dx\not=0. From (B.8) and the mean value theorem, there exists αj(a,b)\alpha_{j}\in(a,b) satisfying P(αj)=0P(\alpha_{j})=0. Since P(x)P(x) is an even function of xx, one has P(αj)=0P(-\alpha_{j})=0. The zeros α1,,αk\alpha_{1},\ldots,\alpha_{k} are all distinct. Thus P(x)P(x) is factorized as

P(x)=(x2α12)(x2αk2)Q(x),\displaystyle P(x)=(x^{2}-\alpha_{1}^{2})\cdots(x^{2}-\alpha_{k}^{2})Q(x), (B.10)

where Q(x)Q(x) is a polynomial not identically equal to 0. The degree of the right-hand side is therefore at least 2k2k, whereas from (B.9) the degree of P(x)P(x) is at most 2(k1)2(k-1), leading to contradiction. ∎

Lemma B.2.

For d1d\geq 1 and even q2q\geq 2, if a function gg defined on 0\mathbb{R}_{\geq 0} is such that gd,0,qg\in{\mathcal{M}}_{d,0,q} or gd,1,q+1g\in{\mathcal{M}}_{d,1,q+1}, then gg changes its sign at least (q21)(\frac{q}{2}-1) times on 0\mathbb{R}_{\geq 0}.

Proof.

Assume that gg changes its sign (k1)(k-1) times on 0\mathbb{R}_{\geq 0}, and decompose 0\mathbb{R}_{\geq 0} into kk subintervals S1,,SkS_{1},\ldots,S_{k} according to the way in Definition B.1: the sign of gg changes alternately in these subintervals.

Proof for gd,0,qg\in{\mathcal{M}}_{d,0,q}. Lemma B.1 shows that the k×kk\times k matrix

M=[S1xd+1g(x)𝑑xS2xd+1g(x)𝑑xSkxd+1g(x)𝑑xS1xd+3g(x)𝑑xS2xd+3g(x)𝑑xSkxd+3g(x)𝑑xS1xd+2k1g(x)𝑑xS2xd+2k1g(x)𝑑xSkxd+2k1g(x)𝑑x]\displaystyle\mathrm{M}=\begin{bmatrix}\int_{S_{1}}x^{d+1}g(x)\,dx&\int_{S_{2}}x^{d+1}g(x)\,dx&\cdots&\int_{S_{k}}x^{d+1}g(x)\,dx\\ \int_{S_{1}}x^{d+3}g(x)\,dx&\int_{S_{2}}x^{d+3}g(x)\,dx&\cdots&\int_{S_{k}}x^{d+3}g(x)\,dx\\ \vdots&\vdots&\ddots&\vdots\\ \int_{S_{1}}x^{d+2k-1}g(x)\,dx&\int_{S_{2}}x^{d+2k-1}g(x)\,dx&\cdots&\int_{S_{k}}x^{d+2k-1}g(x)\,dx\end{bmatrix} (B.11)

is non-singular. Assume k<q2k<\frac{q}{2} as opposed to the statement of this lemma. Then, from the moment conditions gd,0,qg\in{\mathcal{M}}_{d,0,q}, one would have

M𝟏k=[0xd+1g(x)𝑑x0xd+3g(x)𝑑x0xd+2k1g(x)𝑑x]=𝟎k.\displaystyle\mathrm{M}\bm{1}_{k}=\begin{bmatrix}\int_{\mathbb{R}_{\geq 0}}x^{d+1}g(x)\,dx\\ \int_{\mathbb{R}_{\geq 0}}x^{d+3}g(x)\,dx\\ \vdots\\ \int_{\mathbb{R}_{\geq 0}}x^{d+2k-1}g(x)\,dx\end{bmatrix}=\bm{0}_{k}. (B.12)

This contradicts the non-singularity of M\mathrm{M}. Thus, kq2k\geq\frac{q}{2} holds, i.e., gg changes its sign at least (q21)(\frac{q}{2}-1) times on 0\mathbb{R}_{\geq 0}.

Proof for gd,1,q+1g\in{\mathcal{M}}_{d,1,q+1}. For the matrix

M=[S1xd+2g(x)𝑑xS2xd+2g(x)𝑑xSkxd+2g(x)𝑑xS1xd+4g(x)𝑑xS2xd+4g(x)𝑑xSkxd+4g(x)𝑑xS1xd+2kg(x)𝑑xS2xd+2kg(x)𝑑xSkxd+2kg(x)𝑑x]\displaystyle\mathrm{M}=\begin{bmatrix}\int_{S_{1}}x^{d+2}g(x)\,dx&\int_{S_{2}}x^{d+2}g(x)\,dx&\cdots&\int_{S_{k}}x^{d+2}g(x)\,dx\\ \int_{S_{1}}x^{d+4}g(x)\,dx&\int_{S_{2}}x^{d+4}g(x)\,dx&\cdots&\int_{S_{k}}x^{d+4}g(x)\,dx\\ \vdots&\vdots&\ddots&\vdots\\ \int_{S_{1}}x^{d+2k}g(x)\,dx&\int_{S_{2}}x^{d+2k}g(x)\,dx&\cdots&\int_{S_{k}}x^{d+2k}g(x)\,dx\end{bmatrix} (B.13)

one can take a similar proof as the case for gd,0,qg\in{\mathcal{M}}_{d,0,q}. ∎

Also, in the case of q=2,4q=2,4, we can know the order in which the kernel changes its sign:

Lemma B.3.

Let d1d\geq 1 and q=2q=2 or 44, and assume that a function Gd,qG_{d,q} is defined on 0\mathbb{R}_{\geq 0} and such that Gd,qd,0,q𝒩q21G_{d,q}\in{\mathcal{M}}_{d,0,q}\cap{\mathcal{N}}_{\frac{q}{2}-1}. Then, Gd,q(x)0G_{d,q}(x)\geq 0 for xS1S3x\in S_{1}\cup S_{3}\cup\cdots and Gd,q(x)0G_{d,q}(x)\leq 0 for xS2S4x\in S_{2}\cup S_{4}\cup\cdots, where the subintervals S1,,Sq2S_{1},\ldots,S_{\frac{q}{2}} of 0\mathbb{R}_{\geq 0} are defined according to Definition B.1. Moreover, sign(Bd,q(Gd,q))=(1)q/2+1\operatorname{sign}(B_{d,q}(G_{d,q}))=(-1)^{q/2+1}.

Proof.

Proof for q=2q=2. A 2nd order kernel satisfying the normalization condition and the minimum-sign-change condition is non-negative, and the statements Gd,2(x)0G_{d,2}(x)\geq 0 for any x0x\in\mathbb{R}_{\geq 0} and sign(Bd,2(Gd,2))=+1\operatorname{sign}(B_{d,2}(G_{d,2}))=+1 are trivial.

Proof for q=4q=4. The sign change of minimum-sign-change 4-th order kernel Gd,4G_{d,4} occurs at a single point, which is denoted as ρ>0\rho>0. If one assumes that Gd,4(x)0G_{d,4}(x)\leq 0 for x[0,ρ]x\in[0,\rho], then one has the following contradiction: On the basis of the mean value theorem of integration,

0=Bd,2(Gd,4)=0xd+1Gd,4(x)𝑑x=ξ120ρxd1Gd,4(x)𝑑x+ξ22ρxd1Gd,4(x)𝑑x=ξ12>00xd1Gd,4(x)𝑑x>0Bd,0(Gd,4)>0+(ξ22ξ12)>0ρxd1Gd,4(x)𝑑x>0>0,\displaystyle\begin{split}&0=B_{d,2}(G_{d,4})=\int_{0}^{\infty}x^{d+1}G_{d,4}(x)\,dx=\xi_{1}^{2}\int_{0}^{\rho}x^{d-1}G_{d,4}(x)\,dx+\xi_{2}^{2}\int_{\rho}^{\infty}x^{d-1}G_{d,4}(x)\,dx\\ &=\underbrace{\xi_{1}^{2}}_{>0}\underbrace{\int_{0}^{\infty}x^{d-1}G_{d,4}(x)\,dx}_{>0~{}\because B_{d,0}(G_{d,4})>0}+\underbrace{(\xi_{2}^{2}-\xi_{1}^{2})}_{>0}\underbrace{\int_{\rho}^{\infty}x^{d-1}G_{d,4}(x)\,dx}_{>0}>0,\end{split} (B.14)

where 0<ξ1<ρ<ξ20<\xi_{1}<\rho<\xi_{2}. Thus, Gd,4(x)0G_{d,4}(x)\geq 0 for 0xρ0\leq x\leq\rho and Gd,4(x)0G_{d,4}(x)\leq 0 for xρx\geq\rho. Also, one has

Bd,4(Gd,4)=0xd+3Gd,4(x)𝑑x=ξ~120ρxd+1Gd,4(x)𝑑x+ξ~22ρxd+1Gd,4(x)𝑑x=ξ~12>00xd+1Gd,4(x)𝑑x=0Bd,2(Gd,4)=0+(ξ~22ξ~12)>0ρxd+1Gd,4(x)𝑑x<0<0,\displaystyle\begin{split}&B_{d,4}(G_{d,4})=\int_{0}^{\infty}x^{d+3}G_{d,4}(x)\,dx=\tilde{\xi}_{1}^{2}\int_{0}^{\rho}x^{d+1}G_{d,4}(x)\,dx+\tilde{\xi}_{2}^{2}\int_{\rho}^{\infty}x^{d+1}G_{d,4}(x)\,dx\\ &=\underbrace{\tilde{\xi}_{1}^{2}}_{>0}\underbrace{\int_{0}^{\infty}x^{d+1}G_{d,4}(x)\,dx}_{=0~{}\because B_{d,2}(G_{d,4})=0}+\underbrace{(\tilde{\xi}_{2}^{2}-\tilde{\xi}_{1}^{2})}_{>0}\underbrace{\int_{\rho}^{\infty}x^{d+1}G_{d,4}(x)\,dx}_{<0}<0,\end{split} (B.15)

where 0<ξ~1<ρ<ξ~20<\tilde{\xi}_{1}<\rho<\tilde{\xi}_{2}. Therefore, sign(Bd,4(Gd,4))=1\operatorname{sign}(B_{d,4}(G_{d,4}))=-1 is confirmed. ∎

B.2 Proof of Theorem 2

First we show that the kernel Gd,qBG_{d,q}^{B} satisfies all the conditions of Problem 2.

Lemma B.4.

The kernel Gd,qBG_{d,q}^{B} satisfies all the conditions (P2-1)–(P2-5) with G=Gd,qBG=G_{d,q}^{B}.

Proof.

The fact that Gd,qBG_{d,q}^{B} satisfies (P2-3), that is, it changes its sign (q21)(\frac{q}{2}-1) times on 0\mathbb{R}_{\geq 0}, is a direct consequence of the well-known property of the Jacobi polynomials, that Pn(α,β)()P_{n}^{(\alpha,\beta)}(\cdot) with α,β>1\alpha,\beta>-1 has nn simple zeros in the interval (1,1)(-1,1), together with the rightmost expression in (22) of Gd,qBG_{d,q}^{B}. Differentiability of Gd,qBG_{d,q}^{B} on 0\mathbb{R}_{\geq 0} and boundedness and continuity in the sense of ‘a.e.’ of Gd,qB(1)G_{d,q}^{B(1)} (condition (P2-2)), finiteness of Vd,1(Gd,qB)V_{d,1}(G_{d,q}^{B}) (condition (P2-4)), and the behavior xd+qGd,qB(x)0x^{d+q}G_{d,q}^{B}(x)\to 0 as xx\to\infty (condition (P2-5)) follow straightforwardly from the fact that, from the rightmost expression in (22), Gd,qB(x)G_{d,q}^{B}(x) is a polynomial restricted to the finite interval x[0,1]x\in[0,1] with a double zero at x=1x=1.

We thus show in the following that Gd,qBd,0,qG_{d,q}^{B}\in{\mathcal{M}}_{d,0,q} holds, that is, Gd,qBG_{d,q}^{B} satisfies the moment condition (P2-1). Let

Bd,i(Gd,qB)=(1)q2+1Γ(d+q2+2)πd2Γ(q2+2)01xd1+i(1x2)2Pq21(2,d2)(2x21)𝑑x=(1)q2+1Γ(d+q2+2)πd2Γ(q2+2)Id,q,i,\displaystyle\begin{split}B_{d,i}(G_{d,q}^{B})&=(-1)^{\frac{q}{2}+1}\frac{\Gamma(\frac{d+q}{2}+2)}{\pi^{\frac{d}{2}}\Gamma(\frac{q}{2}+2)}\int_{0}^{1}x^{d-1+i}(1-x^{2})^{2}P_{\frac{q}{2}-1}^{(2,\frac{d}{2})}(2x^{2}-1)\,dx\\ &=(-1)^{\frac{q}{2}+1}\frac{\Gamma(\frac{d+q}{2}+2)}{\pi^{\frac{d}{2}}\Gamma(\frac{q}{2}+2)}I_{d,q,i},\end{split} (B.16)

where

Id,q,i:-01xd1+i(1x2)2Pq21(2,d2)(2x21)𝑑x=12d+i2+311(1y)2(1+y)d+i21Pq21(2,d2)(y)𝑑y.\displaystyle\begin{split}I_{d,q,i}\coloneq&\,\int_{0}^{1}x^{d-1+i}(1-x^{2})^{2}P_{\frac{q}{2}-1}^{(2,\frac{d}{2})}(2x^{2}-1)\,dx\\ =&\,\frac{1}{2^{\frac{d+i}{2}+3}}\int_{-1}^{1}(1-y)^{2}(1+y)^{\frac{d+i}{2}-1}P_{\frac{q}{2}-1}^{(2,\frac{d}{2})}(y)\,dy.\end{split} (B.17)

Using the identity (Prudnikov et al., 1986, Section 2.22.2, Item 11)

11(1y)α(1+y)λ1Pn(α,β)(y)𝑑y=(1)n2α+λB(α+n+1,λ)(βλ+nn),λ>0,α,β>1,\displaystyle\int_{-1}^{1}(1-y)^{\alpha}(1+y)^{\lambda-1}P_{n}^{(\alpha,\beta)}(y)\,dy=(-1)^{n}2^{\alpha+\lambda}B(\alpha+n+1,\lambda){\beta-\lambda+n\choose n},\quad\lambda>0,\,\alpha,\beta>-1, (B.18)

where

(xy)=Γ(x+1)Γ(y+1)Γ(xy+1),x,y\displaystyle{x\choose y}=\frac{\Gamma(x+1)}{\Gamma(y+1)\Gamma(x-y+1)},\quad x,y\in\mathbb{R} (B.19)

is the binomial coefficient extended to real-valued arguments, and where B(,)B(\cdot,\cdot) denotes the Beta function, one has

Id,q,i=(1)q212B(q2+2,d+i2)(q21i2q21).\displaystyle I_{d,q,i}=\frac{(-1)^{\frac{q}{2}-1}}{2}B\Bigl{(}\frac{q}{2}+2,\frac{d+i}{2}\Bigr{)}{\frac{q}{2}-1-\frac{i}{2}\choose\frac{q}{2}-1}. (B.20)

The values of Id,q,iI_{d,q,i} for i{0,2,,q}i\in\{0,2,\ldots,q\} are therefore

Id,q,i={(1)q212B(q2+2,d2),i=0,0,i{2,,q2},12B(q2+2,d+q2),i=q.\displaystyle I_{d,q,i}=\left\{\begin{array}[]{ll}\displaystyle\frac{(-1)^{\frac{q}{2}-1}}{2}B\Bigl{(}\frac{q}{2}+2,\frac{d}{2}\Bigr{)},&i=0,\\ 0,&i\in\{2,\ldots,q-2\},\\ \displaystyle\frac{1}{2}B\Bigl{(}\frac{q}{2}+2,\frac{d+q}{2}\Bigr{)},&i=q.\end{array}\right. (B.24)

These in turn imply that the moments of Gd,qBG_{d,q}^{B} are given by

Bd,i(Gd,qB)={Γ(d2)2πd2=1bd,i=0,0,i{2,,q2},(1)q2+12πd2Γ(d+q2)Γ(d+q2+2)Γ(d2+q+2),i=q,\displaystyle B_{d,i}(G_{d,q}^{B})=\left\{\begin{array}[]{ll}\displaystyle\frac{\Gamma(\frac{d}{2})}{2\pi^{\frac{d}{2}}}=\frac{1}{b_{d}},&i=0,\\ 0,&i\in\{2,\ldots,q-2\},\\ \displaystyle\frac{(-1)^{\frac{q}{2}+1}}{2\pi^{\frac{d}{2}}}\frac{\Gamma(\frac{d+q}{2})\Gamma(\frac{d+q}{2}+2)}{\Gamma(\frac{d}{2}+q+2)},&i=q,\end{array}\right. (B.28)

showing that Gd,qBd,0,qG_{d,q}^{B}\in{\mathcal{M}}_{d,0,q} holds. (The fact that Bd,i(Gd,qB)=0B_{d,i}(G_{d,q}^{B})=0 for i{2,,q2}i\in\{2,\ldots,q-2\} can alternatively be understood on the basis of the expression (B.17) and the orthogonality relation (B.70) of the Jacobi polynomials.) ∎

Then, we introduce two polynomials related to the kernel Kd,qB()=Gd,qB()K_{d,q}^{B}(\cdot)=G_{d,q}^{B}(\|\cdot\|), before the succeeding proof:

Pd,qB(x):-Γ(d+q2)πd2Γ(q2)Pq2+1(d2,2)(12x2),Rd,qB(x):-(d1)Pd,qB(1)(x)x+Pd,qB(2)(x).\displaystyle P_{d,q}^{B}(x)\coloneq\frac{\Gamma(\frac{d+q}{2})}{\pi^{\frac{d}{2}}\Gamma(\frac{q}{2})}P_{\frac{q}{2}+1}^{(\frac{d}{2},-2)}(1-2x^{2}),\quad R_{d,q}^{B}(x)\coloneq(d-1)\frac{P^{B(1)}_{d,q}(x)}{x}+P^{B(2)}_{d,q}(x). (B.29)

Pd,qBP_{d,q}^{B} is an original polynomial of Gd,qBG_{d,q}^{B} before truncated. Also, the function Rd,qBR_{d,q}^{B} is defined such that the first derivative of xd1Pd,qB(1)(x)x^{d-1}P_{d,q}^{B(1)}(x) becomes xd1Rd,qB(x)x^{d-1}R_{d,q}^{B}(x), and Rd,qBR_{d,q}^{B} is a polynomial function with terms of 0,2,,(q2)0,2,\ldots,(q-2)-nd degree.

In the following, we give Lemma B.6 on the properties of functions Pd,qBP_{d,q}^{B} and Rd,qBR_{d,q}^{B}.

Lemma B.5.

Suppose that P(x)=a0(x2a12)(x2an2)P(x)=a_{0}(x^{2}-a_{1}^{2})\cdots(x^{2}-a_{n}^{2}) has coefficients a00a_{0}\neq 0, 0<a1an0<a_{1}\leq\cdots\leq a_{n}. Then, P(x)/xjP(x)/x^{j}, j=0,,2nj=0,\ldots,2n; P(1)(x)/xjP^{(1)}(x)/x^{j}, j=0,,2n2j=0,\ldots,2n-2; and P(2)(x)/xjP^{(2)}(x)/x^{j}, j=0,,2n4j=0,\ldots,2n-4 are strictly monotonically increasing (resp. decreasing) for x>anx>a_{n}, when a0>0a_{0}>0 (resp. a0<0a_{0}<0).

Proof.

From simple calculus, one obtains

P(1)(x)=2a0i=1nx(x2a12)(x2ai12)(x2ai+12)(x2an2),P(2)(x)=2a0i=1n(x2a12)(x2ai12)(x2ai+12)(x2an2)+4a01i<jnx2(x2a12)(x2ai12)(x2ai+12)(x2aj12)(x2aj+12)(x2an2).\displaystyle\begin{split}&P^{(1)}(x)=2a_{0}\sum_{i=1}^{n}x(x^{2}-a_{1}^{2})\cdots(x^{2}-a_{i-1}^{2})(x^{2}-a_{i+1}^{2})\cdots(x^{2}-a_{n}^{2}),\\ &P^{(2)}(x)=2a_{0}\sum_{i=1}^{n}(x^{2}-a_{1}^{2})\cdots(x^{2}-a_{i-1}^{2})(x^{2}-a_{i+1}^{2})\cdots(x^{2}-a_{n}^{2})\\ &\hphantom{P^{(2)}(x)}+4a_{0}\sum_{1\leq i<j\leq n}x^{2}(x^{2}-a_{1}^{2})\cdots(x^{2}-a_{i-1}^{2})(x^{2}-a_{i+1}^{2})\cdots(x^{2}-a_{j-1}^{2})(x^{2}-a_{j+1}^{2})\cdots(x^{2}-a_{n}^{2}).\end{split} (B.30)

The first statement in (Gasser et al., 1985, Lemma 4) shows the statement on P(x)/xjP(x)/x^{j}, j=0,,2nj=0,\ldots,2n in this lemma. Also, applying the latter (resp. former) statement of their lemma to every summand of P(1)(x)P^{(1)}(x) (resp. P(2)(x)P^{(2)}(x)) in (B.30), a result on P(1)(x)/xjP^{(1)}(x)/x^{j}, j=0,,2n2j=0,\ldots,2n-2 (resp. P(2)(x)/xjP^{(2)}(x)/x^{j}, j=0,,2n4j=0,\ldots,2n-4) can be proved. ∎

Lemma B.6.

The polynomial Pd,qBP_{d,q}^{B} in (B.29) can be represented as Pd,qB(x)=a0(x2a12)(x2aq2+12)P_{d,q}^{B}(x)=a_{0}(x^{2}-a_{1}^{2})\cdots(x^{2}-a_{\frac{q}{2}+1}^{2}), a0>0a_{0}>0 (resp. a0<0a_{0}<0) when q=2,6,q=2,6,\ldots (resp. q=4,8,q=4,8,\ldots), 0<a1<<aq2=aq2+1=10<a_{1}<\cdots<a_{\frac{q}{2}}=a_{\frac{q}{2}+1}=1. Also, for the polynomial Rd,qBR_{d,q}^{B} in (B.29), Rd,qB(x)/xjR_{d,q}^{B}(x)/x^{j}, j=0,,q2j=0,\ldots,q-2 is monotonically increasing and positive (resp. decreasing and negative) for x>1x>1 when q=2,6,q=2,6,\ldots (resp. q=4,8,q=4,8,\ldots).

Proof.

The symmetric polynomial function Pd,qBP_{d,q}^{B} has (q2)(q-2) zeros inside (1,1)(-1,1) and double zeros at x=±1x=\pm 1, Thus, its coefficients become 0<a1<<aq21<10<a_{1}<\cdots<a_{\frac{q}{2}-1}<1 and aq2=aq2+1=1a_{\frac{q}{2}}=a_{\frac{q}{2}+1}=1.

Using the relationship of the Jacobi polynomial

Pn(α,β)(1)=(α+1)nn!,\displaystyle P_{n}^{(\alpha,\beta)}(1)=\frac{(\alpha+1)_{n}}{n!}, (B.31)

which appears in (Askey, 1975, p.7), one has

Pd,qB(0)=Γ(d+q2)πd2Γ(q2)Pq2+1(12,2)(1)=Γ(d+q2)πd2Γ(q2)(32)q2+1(q2+1)!>0.\displaystyle P_{d,q}^{B}(0)=\frac{\Gamma(\frac{d+q}{2})}{\pi^{\frac{d}{2}}\Gamma(\frac{q}{2})}P_{\frac{q}{2}+1}^{(\frac{1}{2},-2)}(1)=\frac{\Gamma(\frac{d+q}{2})}{\pi^{\frac{d}{2}}\Gamma(\frac{q}{2})}\frac{(\frac{3}{2})_{\frac{q}{2}+1}}{(\frac{q}{2}+1)!}>0. (B.32)

Then, on the basis of the factorized expression of Pd,qBP_{d,q}^{B},

Pd,qB(0)=a0(a12)(aq2+12)=(1)q2+1a0a12aq2+12>0,\displaystyle P_{d,q}^{B}(0)=a_{0}(-a_{1}^{2})\cdots(-a_{\frac{q}{2}+1}^{2})=(-1)^{\frac{q}{2}+1}a_{0}a_{1}^{2}\cdots a_{\frac{q}{2}+1}^{2}>0, (B.33)

one has a0>0a_{0}>0 (resp. a0<0a_{0}<0) when q=2,6,q=2,6,\ldots (resp. q=4,8,q=4,8,\ldots).

Finally, the increase or decrease of Rd,qB(x)/xjR_{d,q}^{B}(x)/x^{j}, j=0,,q2j=0,\ldots,q-2 for x>1x>1 is a direct corollary of Lemma B.5. On the basis of the calculation (B.30), one has

Rd,qB(1)=(d1)Pd,qB(1)(1)1+Pd,qB(2)(1)=(d1)01+4a012(1a1)2(1aq21)2a0(1a1)2(1aq21)2\displaystyle\begin{split}R_{d,q}^{B}(1)&=(d-1)\frac{P^{B(1)}_{d,q}(1)}{1}+P^{B(2)}_{d,q}(1)=(d-1)\frac{0}{1}+4a_{0}1^{2}(1-a_{1})^{2}\cdots(1-a_{\frac{q}{2}-1})^{2}\\ &\propto a_{0}(1-a_{1})^{2}\cdots(1-a_{\frac{q}{2}-1})^{2}\end{split} (B.34)

which, together with the above-mentioned fluctuation of Rd,qB(x)/x0=Rd,qB(x)R_{d,q}^{B}(x)/x^{0}=R_{d,q}^{B}(x), ensures that the sign of Rd,qB(x)R_{d,q}^{B}(x), x>1x>1 and that of a0a_{0} match with each other. ∎

Finally, we provide a proof of Theorem 2 below:

Proof of Theorem 2.

Suppose that Gd,q=Gd,qB+Qd,q:0G_{d,q}=G_{d,q}^{B}+Q_{d,q}:\mathbb{R}_{\geq 0}\to\mathbb{R} satisfies the conditions (P2-1)–(P2-5) and has the same qq-th moment Bd,qB_{d,q} as that of Gd,qBG_{d,q}^{B}. For this setting, because Gd,qBG_{d,q}^{B} satisfies (P2-1)–(P2-5) (shown in Lemma B.4), the disturbance Qd,qQ_{d,q} needs to satisfy the moment conditions

Bd,i(Qd,q)=0,iIq.\displaystyle B_{d,i}(Q_{d,q})=0,\quad i\in I_{q}. (B.35)

From the radial symmetry, odd-order moment conditions are automatically fulfilled. Additionally, it should be noted that Qd,qQ_{d,q} does not need to have a compact support, which makes the kernel’s optimality not limited to just the class of truncation kernels. In contrast, the following proof requires the condition (P2-5) (i.e., limxxd+qQd,q(x)=0\lim_{x\to\infty}x^{d+q}Q_{d,q}(x)=0).

The following calculations are performed to derive the inequality (B.38). On the basis of the integral by part, Pd,qB(1)(0)=0P_{d,q}^{B(1)}(0)=0, limxxd+qQd,q(x)=0\lim_{x\to\infty}x^{d+q}Q_{d,q}(x)=0, combination of the moment conditions (B.35) and the fact that the polynomial Rd,qBR_{d,q}^{B} defined in (B.29) has 0,2,,q0,2,\ldots,q-th degree terms, one has

0xd1Pd,qB(1)(x)Qd,q(1)(x)𝑑x=[xd1Pd,qB(1)(x)Qd,q(x)]00xd1Rd,qB(x)Qd,q(x)𝑑x=0.\displaystyle\begin{split}\int_{0}^{\infty}x^{d-1}P_{d,q}^{B(1)}(x)Q_{d,q}^{(1)}(x)\,dx&=\biggl{[}x^{d-1}P_{d,q}^{B(1)}(x)Q_{d,q}(x)\biggr{]}_{0}^{\infty}-\int_{0}^{\infty}x^{d-1}R_{d,q}^{B}(x)Q_{d,q}(x)\,dx\\ &=0.\end{split} (B.36)

Secondly, integral by part, together with Pd,qB(1)(1)=0P_{d,q}^{B(1)}(1)=0 and limxxd+qQd,q(x)=0\lim_{x\to\infty}x^{d+q}Q_{d,q}(x)=0, shows

1xd1Pd,qB(1)(x)Qd,q(1)(x)𝑑x=[xd1Pd,qB(1)(x)Qd,q(x)]11xd1Rd,qB(x)Qd,q(x)𝑑x=1xd1Rd,qB(x)Qd,q(x)𝑑x.\displaystyle\begin{split}\int_{1}^{\infty}x^{d-1}P_{d,q}^{B(1)}(x)Q_{d,q}^{(1)}(x)\,dx&=\biggl{[}x^{d-1}P_{d,q}^{B(1)}(x)Q_{d,q}(x)\biggr{]}_{1}^{\infty}-\int_{1}^{\infty}x^{d-1}R_{d,q}^{B}(x)Q_{d,q}(x)\,dx\\ &=-\int_{1}^{\infty}x^{d-1}R_{d,q}^{B}(x)Q_{d,q}(x)\,dx.\end{split} (B.37)

Collecting these pieces, one can obtain the following inequality:

Vd,1(Gd,q)Vd,1(Gd,qB)=Vd,1(Qd,q)+20xd1Gd,qB(1)(u)Qd,q(1)(x)𝑑x20xd1Gd,qB(1)(u)Qd,q(1)(x)𝑑x=20xd1{Gd,qB(1)(x)Pd,qB(1)(x)}Qd,q(1)(x)𝑑x=21xd1Pd,qB(1)(x)Qd,q(1)(x)𝑑x=21xd1Rd,qB(x)Qd,q(x)𝑑x,\displaystyle\begin{split}&V_{d,1}(G_{d,q})-V_{d,1}(G_{d,q}^{B})=V_{d,1}(Q_{d,q})+2\int_{0}^{\infty}x^{d-1}G_{d,q}^{B(1)}(u)Q_{d,q}^{(1)}(x)\,dx\\ &\geq 2\int_{0}^{\infty}x^{d-1}G_{d,q}^{B(1)}(u)Q_{d,q}^{(1)}(x)\,dx=2\int_{0}^{\infty}x^{d-1}\{G_{d,q}^{B(1)}(x)-P_{d,q}^{B(1)}(x)\}Q_{d,q}^{(1)}(x)\,dx\\ &=-2\int_{1}^{\infty}x^{d-1}P_{d,q}^{B(1)}(x)Q_{d,q}^{(1)}(x)\,dx=2\int_{1}^{\infty}x^{d-1}R_{d,q}^{B}(x)Q_{d,q}(x)\,dx,\end{split} (B.38)

where the equality holds just when Qd,q(1)(x)0Q^{(1)}_{d,q}(x)\equiv 0, which holds only if Qd,q(x)0Q_{d,q}(x)\equiv 0 due to the moment condition Bd,0(Qd,q)=0B_{d,0}(Q_{d,q})=0.

Thus, since the kernels Gd,qG_{d,q} and Gd,qBG_{d,q}^{B} have the same value of qq-th moment, in order to prove the optimality of Gd,qBG_{d,q}^{B} with respect to the AMSE criterion, it is sufficient to prove

1xd1Rd,qB(x)Qd,q(x)𝑑x0\displaystyle\int_{1}^{\infty}x^{d-1}R_{d,q}^{B}(x)Q_{d,q}(x)\,dx\geq 0 (B.39)

for Qd,q(x)0Q_{d,q}(x)\not\equiv 0, instead of showing Vd,1(Gd,q)>Vd,1(Gd,qB)V_{d,1}(G_{d,q})>V_{d,1}(G_{d,q}^{B}). We prove the inequality (B.39) for order q=2,4q=2,4, below.

Proof for q=2q=2. First, we provide the proof for q=2q=2. Lemma B.6 shows Rd,qB(x)>0R_{d,q}^{B}(x)>0 for x>1x>1. Also, since Gd,2G_{d,2} is non-negative due to the minimum-sign-change condition and Gd,2B(x)=0G_{d,2}^{B}(x)=0 for x>1x>1, the disturbance Qd,2(x)=Gd,2(x)Q_{d,2}(x)=G_{d,2}(x), x>1x>1 has to be non-negative. Thus, (B.39) holds.

Proof for q=4q=4. We next show the theorem for q=4q=4. The only sign change of minimum-sign-change 4-th order kernel Gd,4G_{d,4} goes from positive to negative at some point ρ>0\rho>0 from Lemma B.3.

If ρ1\rho\leq 1, then Gd,4(x)=Qd,4(x)G_{d,4}(x)=Q_{d,4}(x), x1ρx\geq 1\geq\rho is non-positive. Also, Lemma B.6 shows that Rd,4B(x)<0R_{d,4}^{B}(x)<0 for x1x\geq 1. Thus, the inequality (B.39) holds.

Next we consider the other case with ρ>1\rho>1, where the followings are satisfied:

Gd,4(x)0 for xρ(which implies Qd,4(x)0 for x[1,ρ]),\displaystyle G_{d,4}(x)\geq 0\text{ for }x\leq\rho\,(\text{which implies }Q_{d,4}(x)\geq 0\text{ for }x\in[1,\rho]), (B.40)
Gd,4(x)=Gd,4B(x)+Qd,4(x)=Qd,4(x)0 for xρ(>1).\displaystyle G_{d,4}(x)=G_{d,4}^{B}(x)+Q_{d,4}(x)=Q_{d,4}(x)\leq 0\text{ for }x\geq\rho(>1). (B.41)

From the equation

0=Bd,2(Gd,4)=01xd+1Gd,4(x)𝑑x0(B.40)+1xd+1Qd,4(x)𝑑x,\displaystyle 0=B_{d,2}(G_{d,4})=\underbrace{\int_{0}^{1}x^{d+1}G_{d,4}(x)\,dx}_{\geq 0~{}\because\eqref{eq:B.3}}+\int_{1}^{\infty}x^{d+1}Q_{d,4}(x)\,dx, (B.42)

one has

1xd+1Qd,4(x)𝑑x0,\displaystyle\int_{1}^{\infty}x^{d+1}Q_{d,4}(x)\,dx\leq 0, (B.43)

which leads us to obtain

ρxd+1Qd,4(x)𝑑x=1xd+1Qd,4(x)𝑑x1ρxd+1Qd,4(x)𝑑x0.\displaystyle\int_{\rho}^{\infty}x^{d+1}Q_{d,4}(x)\,dx=\int_{1}^{\infty}x^{d+1}Q_{d,4}(x)\,dx-\int_{1}^{\rho}x^{d+1}Q_{d,4}(x)\,dx\leq 0. (B.44)

On the basis of the mean value theorem of integration, one gets

1xd1Rd,4B(x)Qd,4(x)𝑑x=Rd,4B(ζ1)ζ121ρxd+1Qd,4(x)𝑑x+Rd,4B(ζ2)ζ22ρxd+1Qd,4(x)𝑑x=Rd,4B(ζ1)ζ12<01xd+1Qd,4(x)𝑑x0(B.43)+(Rd,4B(ζ2)ζ22Rd,4B(ζ1)ζ12)<0ρxd+1Qd,4(x)𝑑x0(B.44)0,\displaystyle\begin{split}&\int_{1}^{\infty}x^{d-1}R_{d,4}^{B}(x)Q_{d,4}(x)\,dx=\frac{R_{d,4}^{B}(\zeta_{1})}{\zeta_{1}^{2}}\int_{1}^{\rho}x^{d+1}Q_{d,4}(x)\,dx+\frac{R_{d,4}^{B}(\zeta_{2})}{\zeta_{2}^{2}}\int_{\rho}^{\infty}x^{d+1}Q_{d,4}(x)\,dx\\ &=\underbrace{\frac{R_{d,4}^{B}(\zeta_{1})}{\zeta_{1}^{2}}}_{<0}\underbrace{\int_{1}^{\infty}x^{d+1}Q_{d,4}(x)\,dx}_{\leq 0~{}\because\eqref{eq:B.5}}+\underbrace{\biggl{(}\frac{R_{d,4}^{B}(\zeta_{2})}{\zeta_{2}^{2}}-\frac{R_{d,4}^{B}(\zeta_{1})}{\zeta_{1}^{2}}\biggr{)}}_{<0}\underbrace{\int_{\rho}^{\infty}x^{d+1}Q_{d,4}(x)\,dx}_{\leq 0~{}\because\eqref{eq:B.6}}\geq 0,\end{split} (B.45)

for 1<ζ1<ζ21<\zeta_{1}<\zeta_{2}, where we use Rd,4B(1)<0R_{d,4}^{B}(1)<0 and that Rd,4B(x)/x2R_{d,4}^{B}(x)/x^{2} is monotonically decreasing for x>1x>1, which result from Lemma B.6. This inequality implies the completion of the proof. ∎

B.3 Proof of Theorem 3

B.3.1 Reformulation of Problem 3

In this section, we provide a proof of Theorem 3 that is a multivariate RK extension of the existing results by (Granovsky and Müller, 1991). We here reformulate Problem 3 to Problem B.2, to which we can almost directly apply the proof in the existing works.

Problem 3 includes both the function GG and derivative G(1)G^{(1)} in its formulation, which makes the problem difficult to handle. Under the condition (P3-5) (i.e., xd+qG(x)0x^{d+q}G(x)\to 0 as xx\to\infty), integration-by-parts yields the equation,

Bd,i(G)=0xd1+iG(x)𝑑x=[1d+ixd+iG(x)]01d+i0xd+iG(1)(x)𝑑x=1d+iBd,i+1(G(1)),foriIq.\displaystyle\begin{split}B_{d,i}(G)&=\int_{0}^{\infty}x^{d-1+i}G(x)\,dx=\biggl{[}\frac{1}{d+i}x^{d+i}G(x)\biggr{]}_{0}^{\infty}-\frac{1}{d+i}\int_{0}^{\infty}x^{d+i}G^{(1)}(x)\,dx\\ &=-\frac{1}{d+i}B_{d,i+1}(G^{(1)}),\quad\text{for}\;i\in I_{q}.\end{split} (B.46)

This equation allows us to rewrite the objective function (P3) as a functional of G(1)G^{(1)} only, and shows the following lemma that allows us to represent the moment condition (P3-1) in terms of G(1)G^{(1)}.

Lemma B.7.

For d1d\geq 1 and even q2q\geq 2, assume that xd+qG(x)0x^{d+q}G(x)\to 0 holds as xx\to\infty and that G(1)G^{(1)} is continuous a.e. Then the following two conditions are equivalent:

  • (i)

    Gd,0,qG\in{\mathcal{M}}_{d,0,q} and GG is differentiable.

  • (ii)

    G(1)d,1,q+1G^{(1)}\in{\mathcal{M}}_{d,1,q+1}.

Thus, the following problem is equivalent to Problem 3 in the sense that a solution of one of the problems is a solution of the other.

Problem B.1.

Find G(x)=xF(t)𝑑tG^{*}(x)=-\int_{x}^{\infty}F^{*}(t)\,dt, where FF^{*} is a solution of the following optimization problem, with G(x):-xF(t)𝑑tG(x)\coloneq-\int_{x}^{\infty}F(t)\,dt:

minF\displaystyle\mathop{\rm min}\limits_{F}\quad Bd,q+12(d+2)(F)Vd,02q(F),\displaystyle B^{2(d+2)}_{d,q+1}(F)\cdot V^{2q}_{d,0}(F), (P4)
s.t.\displaystyle\mathop{\rm s.t.}\limits\quad Fd,1,q+1,\displaystyle F\in{\mathcal{M}}_{d,1,q+1}, (P4-1)
Fis integrable,\displaystyle F\;\text{is integrable}, (P4-2)
F𝒩q21,\displaystyle F\in{\mathcal{N}}_{\frac{q}{2}-1}, (P4-3)
Vd,0(F)<,\displaystyle V_{d,0}(F)<\infty, (P4-4)
there exists δ>0 s.t. xd+q+δG(x)0asx.\displaystyle\text{there exists }\delta>0\text{ s.t.~{}}x^{d+q+\delta}G(x)\to 0\;\text{as}\;x\to\infty. (P4-5)

The kernel optimization problem B.1 still has scale indeterminacy of a solution, as described also in Footnote 3. Indeed, one has the following lemma.

Lemma B.8.

Let d1d\geq 1 and q=2,4,q=2,4,\ldots For a function FF and s>0s>0, we define a scale-transformed function Fs():-s(d+1)F(/s)F_{s}(\cdot)\coloneq s^{-(d+1)}F(\cdot/s). Then, one has

Bd,i(Fs)=si1Bd,i(F),Vd,0(Fs)=s(d+2)Vd,0(F).\displaystyle B_{d,i}(F_{s})=s^{i-1}B_{d,i}(F),\quad V_{d,0}(F_{s})=s^{-(d+2)}V_{d,0}(F). (B.47)

This lemma immediately shows that the objective function (P4) is scale-invariant, that is, for any s>0s>0 one has

Bd,q+12(d+2)(Fs)Vd,02q(Fs)=Bd,q+12(d+2)(F)Vd,02q(F).\displaystyle B_{d,q+1}^{2(d+2)}(F_{s})\cdot V_{d,0}^{2q}(F_{s})=B_{d,q+1}^{2(d+2)}(F)\cdot V_{d,0}^{2q}(F). (B.48)

It also shows that the set d,1,k{\mathcal{M}}_{d,1,k} is scale-invariant, that is, Fd,1,kF\in{\mathcal{M}}_{d,1,k} implies Fsd,1,kF_{s}\in{\mathcal{M}}_{d,1,k} for any s>0s>0. The following Problem B.2 resolves the scale indeterminacy.

Problem B.2.

Find G(x)=xF(t)𝑑tG^{*}(x)=-\int_{x}^{\infty}F^{*}(t)\,dt, where FF^{*} is a solution of the following optimization problem for a given constant C>0C>0, with G(x):-xF(t)𝑑tG(x)\coloneq-\int_{x}^{\infty}F(t)\,dt:

minF\displaystyle\mathop{\rm min}\limits_{F}\quad Vd,0(F),\displaystyle V_{d,0}(F), (P5)
s.t.\displaystyle\mathop{\rm s.t.}\limits\quad Fd,1,q+1,\displaystyle F\in{\mathcal{M}}_{d,1,q+1}, (P5-1)
Fis integrable,\displaystyle F\;\text{is integrable}, (P5-2)
F𝒩q21,\displaystyle F\in{\mathcal{N}}_{\frac{q}{2}-1}, (P5-3)
Vd,0(F)<,\displaystyle V_{d,0}(F)<\infty, (P5-4)
there exists δ>0 s.t. xd+q+δG(x)0asx,\displaystyle\text{there exists }\delta>0\text{ s.t.~{}}x^{d+q+\delta}G(x)\to 0\;\text{as}\;x\to\infty, (P5-5)
|Bd,q+1(F)|=(d+q)C.\displaystyle|B_{d,q+1}(F)|=(d+q)C. (P5-6)

B.3.2 Auxiliary Lemmas

Here we introduce notations and lemmas that will be used for proofs in the next sections. Let dd be a positive integer. For functions f,gf,g defined on 0=[0,)\mathbb{R}_{\geq 0}=[0,\infty), define

f,gd:-0xd1f(x)g(x)𝑑x.\displaystyle\langle f,g\rangle_{d}\coloneq\int_{0}^{\infty}x^{d-1}f(x)g(x)\,dx. (B.49)

Let L2(0,xd1)L^{2}(\mathbb{R}_{\geq 0},x^{d-1}) be the space of square-integrable functions on 0\mathbb{R}_{\geq 0} with weight xd1x^{d-1}. This is a Hilbert space equipped with the inner product ,d\langle\cdot,\cdot\rangle_{d}. Note however that, for any nonnegative integer ii, one has xiL2(0,xd1)x^{i}\not\in L^{2}(\mathbb{R}_{\geq 0},x^{d-1}), as the integral of (xi)2(x^{i})^{2} over 0\mathbb{R}_{\geq 0} with weight xd1x^{d-1} diverges.

With a measurable subset SS of 0\mathbb{R}_{\geq 0}, define

f,gS,d:-Sxd1f(x)g(x)𝑑x,\displaystyle\langle f,g\rangle_{S,d}\coloneq\int_{S}x^{d-1}f(x)g(x)\,dx, (B.50)

and let L2(S,xd1)L^{2}(S,x^{d-1}) be the space of square-integrable functions on SS with weight xd1x^{d-1}. Let DD be a compact measurable subset of 0\mathbb{R}_{\geq 0}, and let Dc=0\DD^{c}=\mathbb{R}_{\geq 0}\backslash D. One has that L2(D,xd1)L^{2}(D,x^{d-1}) and L2(Dc,xd1)L^{2}(D^{c},x^{d-1}) are both subspaces of L2(0,xd1)L^{2}(\mathbb{R}_{\geq 0},x^{d-1}), and that the map V:ff𝟙xDf𝟙xDcV:f\to f\mathbbm{1}_{x\in D}\oplus f\mathbbm{1}_{x\in D^{c}} from L2(0,xd1)L^{2}(\mathbb{R}_{\geq 0},x^{d-1}) to L2(D,xd1)L2(Dc,xd1)L^{2}(D,x^{d-1})\oplus L^{2}(D^{c},x^{d-1}) is an isomorphism (Conway, 2007, page 25). We therefore identify L2(0,xd1)L^{2}(\mathbb{R}_{\geq 0},x^{d-1}) and L2(D,xd1)L2(Dc,xd1)L^{2}(D,x^{d-1})\oplus L^{2}(D^{c},x^{d-1}) via this isomorphism.

Let II be a finite subset of 0\mathbb{Z}_{\geq 0}, and let

MD,I:-span{xi𝟙xD,iI}.\displaystyle M_{D,I}\coloneq\mathop{\mathrm{span}}\{x^{i}\mathbbm{1}_{x\in D},i\in I\}. (B.51)

For fixed DD and II, MD,IM_{D,I} is a subspace of L2(D,xd1)L^{2}(D,x^{d-1}) with dimension equal to |I||I|.

Lemma B.9.

Assume that a function f0L2(0,xd1)f_{0}\in L^{2}(\mathbb{R}_{\geq 0},x^{d-1}) satisfies f0,fd=0\langle f_{0},f\rangle_{d}=0 for any ff with supp(f)D\operatorname{supp}(f)\subset D and f,xid=0\langle f,x^{i}\rangle_{d}=0 for all iIi\in I. Then f0𝟙xDMD,If_{0}\mathbbm{1}_{x\in D}\in M_{D,I}.

Proof.

In view of the above direct-sum decomposition of L2(0,xd1)L^{2}(\mathbb{R}_{\geq 0},x^{d-1}), the subspace MD,IM_{D,I} is regarded as the restriction of

M¯D,I=span{xi𝟙xD,iI}L2(Dc,xd1)\displaystyle\bar{M}_{D,I}=\mathop{\mathrm{span}}\{x^{i}\mathbbm{1}_{x\in D},i\in I\}\oplus L^{2}(D^{c},x^{d-1}) (B.52)

to L2(D,xd1)L^{2}(D,x^{d-1}). The orthogonal complement of M¯D,I\bar{M}_{D,I} in L2(0,xd1)L^{2}(\mathbb{R}_{\geq 0},x^{d-1}) is given, via Lemma B.10 below, by

M¯D,I=(span{xi𝟙xD,iI})L2(Dc,xd1)={fL2(0,xd1):f,xiD,d=0for alliI}{f:supp(f)D}={fL2(0,xd1):supp(f)D,f,xid=0for alliI}.\displaystyle\begin{split}\bar{M}_{D,I}^{\perp}&=(\mathop{\mathrm{span}}\{x^{i}\mathbbm{1}_{x\in D},i\in I\})^{\perp}\cap L^{2}(D^{c},x^{d-1})^{\perp}\\ &=\{f\in L^{2}(\mathbb{R}_{\geq 0},x^{d-1}):\langle f,x^{i}\rangle_{D,d}=0\;\text{for all}\;i\in I\}\cap\{f:\operatorname{supp}(f)\in D\}\\ &=\{f\in L^{2}(\mathbb{R}_{\geq 0},x^{d-1}):\operatorname{supp}(f)\subset D,\ \langle f,x^{i}\rangle_{d}=0\;\text{for all}\;i\in I\}.\end{split} (B.53)

From the assumption of this lemma, the function f0f_{0} belongs to the orthogonal complement M¯D,I\bar{M}_{D,I}^{\perp\perp} of M¯D,I\bar{M}_{D,I}^{\perp}. On the other hand, M¯D,I\bar{M}_{D,I} is a closed subspace in L2(0,xd1)L^{2}(\mathbb{R}_{\geq 0},x^{d-1}) thanks to (Rudin, 1991, Theorem 1.42), since MDM_{D} is finite-dimensional. One can therefore conclude that M¯D,I=M¯D,I\bar{M}_{D,I}^{\perp\perp}=\bar{M}_{D,I} holds, implying f0M¯D,If_{0}\in\bar{M}_{D,I}, and the statement of the lemma follows. ∎

Lemma B.10.

Let HH be a Hilbert space, and let A,BA,B be subsets of HH satisfying 0AB0\in A\cap B. Then (A+B)=AB(A+B)^{\perp}=A^{\perp}\cap B^{\perp} holds.

Proof.

Since 0B0\in B, one has AA+BA\subset A+B and hence (A+B)A(A+B)^{\perp}\subset A^{\perp}. One similarly has (A+B)B(A+B)^{\perp}\subset B^{\perp}, and thus (A+B)AB(A+B)^{\perp}\subset A^{\perp}\cap B^{\perp}. On the other hand, take xABx\in A^{\perp}\cap B^{\perp} and yA+By\in A+B. Noting that yy is represented as y=a+by=a+b with aAa\in A and bBb\in B, one has

x,y=x,a+x,b=0,\displaystyle\langle x,y\rangle=\langle x,a\rangle+\langle x,b\rangle=0, (B.54)

where the final equality is due to xAx\in A^{\perp} and xBx\in B^{\perp}, implying that x(A+B)x\in(A+B)^{\perp} holds. This proves AB(A+B)A^{\perp}\cap B^{\perp}\subset(A+B)^{\perp}. ∎

It should be noted that the above lemma holds regardless of the dimensionality or the closedness of A,BA,B.

B.3.3 Prove that the optimal solution is a polynomial kernel

The following lemma is an RK extension of (Granovsky and Müller, 1989, Theorem), and it states the functional form that an optimal solution of Problem B.2 should take.

Lemma B.11.

For d1d\geq 1 and even q2q\geq 2, a solution of Problem B.2 takes a form of p¯q+1,τ():-pq+1()𝟙τ\bar{p}_{q+1,\tau}(\cdot)\coloneq p_{q+1}(\cdot)\mathbbm{1}_{\cdot\leq\tau}, where τ>0\tau>0 is decided from a constant CC in Problem B.2, and where pq+1p_{q+1} is a polynomial function with terms of degree in Iq+1I_{q+1} and p¯q+1,τ\bar{p}_{q+1,\tau} has no discontinuity (i.e., pq+1(τ)=0p_{q+1}(\tau)=0).

Proof of Lemma B.11.

In this proof, we work with Problem B.2. To briefly represent classes of kernel functions that satisfy various combinations of the conditions, we introduce a new notation: we let 𝒞i1,,ij{\mathcal{C}}_{i_{1},\ldots,i_{j}} denote a class of functions satisfying the conditions (P5-i1),,(P5-ij)(\mathrm{P5\text{-}}i_{1}),\ldots,(\mathrm{P5\text{-}}i_{j}) in Problem B.2. C4C_{4} equals to L2(0,xd1)L^{2}(\mathbb{R}_{\geq 0},x^{d-1}). Moreover, we use an abbreviation :-𝒞1,2,3,4,5,6{\mathcal{B}}\coloneq{\mathcal{C}}_{1,2,3,4,5,6}.

Step 1

Define V=inffVd,0(f)V=\inf_{f\in{\mathcal{B}}}V_{d,0}(f), and let {fn}\{f_{n}\in{\mathcal{B}}\} be a sequence of functions satisfying Vd,0(fn)V<ϵnV_{d,0}(f_{n})-V<\epsilon_{n} with positive numbers {ϵn}\{\epsilon_{n}\} satisfying ϵn0\epsilon_{n}\to 0 as nn\to\infty. The function set {f𝒞4:Vd,0(f)const}\{f\in{\mathcal{C}}_{4}:V_{d,0}(f)\leq\text{const}\} is weakly compact as a bounded set in the Hilbert space 𝒞4{\mathcal{C}}_{4} equipped with the scalar product ,d\langle\cdot,\cdot\rangle_{d}. It is thus found that there exists a subsequence of fnf_{n} weakly converging to a function f𝒞4f_{*}\in{\mathcal{C}}_{4}. Note that this convergence particularly shows fn,fdf,fd\langle f_{n},f_{*}\rangle_{d}\to\langle f_{*},f_{*}\rangle_{d}, and one has

f,fd2=limnf,fnd2limnfn,fndf,fd=Vf,fd,\displaystyle\langle f_{*},f_{*}\rangle_{d}^{2}=\lim_{n\to\infty}\langle f_{*},f_{n}\rangle_{d}^{2}\leq\lim_{n\to\infty}\langle f_{n},f_{n}\rangle_{d}\langle f_{*},f_{*}\rangle_{d}=V\langle f_{*},f_{*}\rangle_{d}, (B.55)

which implies f,fd=V\langle f_{*},f_{*}\rangle_{d}=V.

In the rest of Step 1, we show that ff_{*} satisfies (P5-1), (P5-3), and (P5-6): fn𝒞1,6f_{n}\in{\mathcal{C}}_{1,6} implies that

0xd1+ifn(x)𝑑x=ci,iIq+1,\displaystyle\int_{0}^{\infty}x^{d-1+i}f_{n}(x)\,dx=c_{i},\quad i\in I_{q+1}, (B.56)

where cic_{i}, iIq+1i\in I_{q+1} are ii-dependent constants. Introducing gn(x):-xfn(t)𝑑tg_{n}(x)\coloneq-\int_{x}^{\infty}f_{n}(t)\,dt and a xx-independent constant c>0c>0, from the condition (P5-5) for G=gnG=g_{n} (that is, |gn(x)|=o(x(d+q+δ))|g_{n}(x)|=o(x^{-(d+q+\delta)})), one has

limM|Mxd1+ifn(x)𝑑x|limM{|[xd1+ign(x)]M|+M(d1+i)xd2+i|gn(x)|dx}limMcMx(q+2+δi)𝑑xlimMcMx(1+δ)𝑑x=0,iIq+1.\displaystyle\begin{split}&\lim_{M\to\infty}\biggl{|}\int_{M}^{\infty}x^{d-1+i}f_{n}(x)\,dx\biggr{|}\leq\lim_{M\to\infty}\biggl{\{}\biggl{|}\biggl{[}x^{d-1+i}g_{n}(x)\biggr{]}_{M}^{\infty}\biggr{|}+\int_{M}^{\infty}(d-1+i)x^{d-2+i}|g_{n}(x)|\,dx\biggr{\}}\\ &\leq\lim_{M\to\infty}c\int_{M}^{\infty}x^{-(q+2+\delta-i)}\,dx\leq\lim_{M\to\infty}c\int_{M}^{\infty}x^{-(1+\delta)}\,dx=0,\quad i\in I_{q+1}.\end{split} (B.57)

Because fnf_{n} converges to ff_{*} and xi𝟙xM𝒞4x^{i}\mathbbm{1}_{x\leq M}\in{\mathcal{C}}_{4}, one has

0Mxd1+if(x)𝑑x=limn0Mxd1+ifn(x)𝑑x=limn{ciMxd1+ifn(x)𝑑x}.\displaystyle\int_{0}^{M}x^{d-1+i}f_{*}(x)\,dx=\lim_{n\to\infty}\int_{0}^{M}x^{d-1+i}f_{n}(x)\,dx=\lim_{n\to\infty}\left\{c_{i}-\int_{M}^{\infty}x^{d-1+i}f_{n}(x)\,dx\right\}. (B.58)

From these results, one has

0xd1+if(x)𝑑x=limM0Mxd1+if(x)𝑑x=ci,iIq,\displaystyle\int_{\mathbb{R}_{\geq 0}}x^{d-1+i}f_{*}(x)\,dx=\lim_{M\to\infty}\int_{0}^{M}x^{d-1+i}f_{*}(x)\,dx=c_{i},\quad i\in I_{q}, (B.59)

which implies that ff_{*} satisfies (P5-1) and (P5-6) for F=fF=f_{*}. Moreover, since fn𝒩q21f_{n}\in{\mathcal{N}}_{\frac{q}{2}-1} it also holds that f𝒩q21f_{*}\in{\mathcal{N}}_{\frac{q}{2}-1} (P5-3), otherwise ff_{*} and fnf_{n} have a interval (say SS) where their signs do not coincide (so 𝟙xS,fnd𝟙xS,fd\langle\mathbbm{1}_{x\in S},f_{n}\rangle_{d}\neq\langle\mathbbm{1}_{x\in S},f_{*}\rangle_{d} with 𝟙xS𝒞4\mathbbm{1}_{x\in S}\in{\mathcal{C}}_{4}), which contradicts that fnf_{n} converges to ff_{*}.

Step 2

On the basis of the fact that f𝒞1,3,4,6f_{*}\in{\mathcal{C}}_{1,3,4,6} proved in Step 1, we here show a functional form of ff_{*}. We introduce the function class

=𝒞4{f:Bd,i(f)=0for alliIq+1}.\displaystyle{\mathcal{F}}={\mathcal{C}}_{4}\cap\{f:B_{d,i}(f)=0\;\text{for all}\;i\in I_{q+1}\}. (B.60)

Then, one has that f+δf𝒞1,3,4,6f_{*}+\delta f\in{\mathcal{C}}_{1,3,4,6} for δ\delta with a sufficiently small absolute value and any bounded non-identically-zero function ff\in{\mathcal{F}} satisfying supp(f)D\operatorname{supp}(f)\subset D_{*}, where supp(f)\operatorname{supp}(f) denotes the support of ff and D=supp(f)D_{*}=\operatorname{supp}(f_{*}). For the optimality of ff_{*}, f,fd=0\langle f_{*},f\rangle_{d}=0 for all ff\in{\mathcal{F}} is necessary, because |δ|δ2|\delta|\gg\delta^{2},

Vd,0(f¯)Vd,0(f)=2δf,fd+δ2Vd,0(f)0\displaystyle V_{d,0}(\bar{f})-V_{d,0}(f_{*})=2\delta\langle f_{*},f\rangle_{d}+\delta^{2}V_{d,0}(f)\geq 0 (B.61)

and ff_{*} is not identically zero since V=Vd,0(f)V=V_{d,0}(f_{*}) is away from 0 due to the proof by contradiction with the condition (P5-3) for F=fF=f_{*} and Lemma B.2. This implies that ff_{*} should lie in the orthogonal complement of {f:supp(f)D}{\mathcal{F}}\cap\{f:\operatorname{supp}(f)\subset D_{*}\}, that is, fspan{xi𝟙xD,iIq+1}f_{*}\in\mathop{\mathrm{span}}\{x^{i}\mathbbm{1}_{x\in D_{*}},i\in I_{q+1}\}, which is proved by applying Lemma B.9. The result fspan{xi𝟙xD,iIq+1}f_{*}\in\mathop{\mathrm{span}}\{x^{i}\mathbbm{1}_{x\in D_{*}},i\in I_{q+1}\} implies that ff_{*} takes a form of pq+1(x)𝟙xDp_{q+1}(x)\mathbbm{1}_{x\in D_{*}}, where pq+1p_{q+1} is a polynomial with terms of degree in Iq+1I_{q+1}. Also, the boundedness of Vd,0(f)V_{d,0}(f_{*}), in addition to the functional form of ff_{*} derived above, concludes that the support DD_{*} of ff_{*} is bounded.

Step 3

Hereafter we show that an optimal solution ff_{*} has no discontinuity and that, for a polynomial pq+1p_{q+1} satisfying pq+1(τ)=0p_{q+1}(\tau)=0, ff_{*} takes a form

f(x)=p¯q+1,τ(x):-pq+1(x)𝟙xτ.\displaystyle f_{*}(x)=\bar{p}_{q+1,\tau}(x)\coloneq p_{q+1}(x)\mathbbm{1}_{x\leq\tau}. (B.62)

Then, we prove the following lemma.

Lemma B.12.

Let SS be a subset of 0\mathbb{R}_{\geq 0} with μ(S)(0,)\mu(S)\in(0,\infty), where μ\mu denotes the Lebesgue measure on 0\mathbb{R}_{\geq 0}. Let ff_{*} be a function defined in (B.62) and DD_{*} be the support of ff_{*}. Assume that pq+1(x)p_{q+1}(x) takes the same sign over SS, and that f()f_{*}(\cdot) and pq+1()𝟙SDp_{q+1}(\cdot)\mathbbm{1}_{\cdot\in S\cup D_{*}} share the same sign-change pattern. Then either μ(SD)\mu(S\cap D_{*}) or μ(SDc)\mu(S\cap D_{*}^{c}) is zero.

Proof of Lemma B.12.

Assume to the contrary that both μ(SD)\mu(S\cap D_{*}) and μ(SDc)\mu(S\cap D_{*}^{c}) are strictly positive. Since μ(SD)>0\mu(S\cap D_{*})>0, one can take an ordered partition {Pj}j=1,,k+1\{P_{j}\}_{j=1,\ldots,k+1} of SDS\cap D_{*} with μ(Pj)(0,)\mu(P_{j})\in(0,\infty) for all j=1,,k+1j=1,\ldots,k+1, with which, for xPix\in P_{i} and yPjy\in P_{j} with i<ji<j one has x<yx<y almost surely. We let M=(mij)\mathrm{M}=(m_{ij}) and 𝒃=(b1,,bk+1)\bm{b}=(b_{1},\ldots,b_{k+1})^{\top} with

mij:-Pjxd+2i2f(x)𝑑x,i,j=1,2,,k+1,\displaystyle m_{ij}\coloneq\int_{P_{j}}x^{d+2i-2}f_{*}(x)\,dx,\quad i,j=1,2,\ldots,k+1, (B.63)
bi:-SDcxd+2i2𝑑x,i=1,2,,k+1.\displaystyle b_{i}\coloneq\int_{S\cap D_{*}^{c}}x^{d+2i-2}\,dx,\quad i=1,2,\ldots,k+1. (B.64)

One has 0<bi<0<b_{i}<\infty due to the assumption 0<μ(SDc)μ(S)<0<\mu(S\cap D_{*}^{c})\leq\mu(S)<\infty. Furthermore, let 𝒄=(c1,,ck+1)\bm{c}=(c_{1},\ldots,c_{k+1})^{\top} and define

f𝒄(x)=j=1k+1cjpq+1(x)𝟙xPj.\displaystyle f_{\bm{c}}(x)=\sum_{j=1}^{k+1}c_{j}p_{q+1}(x)\mathbbm{1}_{x\in P_{j}}. (B.65)

Note that f𝒄f_{\bm{c}} is equivalent to ff_{*} on SS when 𝒄=𝟏k+1\bm{c}=\bm{1}_{k+1}.

We now want to set 𝒄\bm{c} so that f𝒄(x)f_{\bm{c}}(x) and the function 𝟙xSDc\mathbbm{1}_{x\in S\cap D_{*}^{c}} share the same moments of orders in Iq+1I_{q+1}. It will be achieved if 𝒄\bm{c} satisfies M𝒄=𝒃\mathrm{M}\bm{c}=\bm{b}, since the (2i1)(2i-1)-st moments of f𝒄(x)f_{\bm{c}}(x) and 𝟙xSDc\mathbbm{1}_{x\in S\cap D_{*}^{c}} are j=1k+1mijcj\sum_{j=1}^{k+1}m_{ij}c_{j} and bib_{i}, respectively. Applying Lemma B.1 with w(x)=xdf(x)w(x)=x^{d}f_{*}(x) shows that M\mathrm{M} is invertible, so that the assertion 𝒄=M1𝒃\bm{c}=\mathrm{M}^{-1}\bm{b} makes the moments of f𝒄(x)f_{\bm{c}}(x) and 𝟙xSDc\mathbbm{1}_{x\in S\cap D_{*}^{c}} to coincide.

With the above 𝒄\bm{c}, let f(x)=s(𝟙xSDcf𝒄(x))f(x)=s\cdot(\mathbbm{1}_{x\in S\cap D_{*}^{c}}-f_{\bm{c}}(x)), where s{1,1}s\in\{-1,1\} is the sign of pq+1p_{q+1} on SS. One has ff\in{\mathcal{F}} by construction, which implies pq+1,fd=0\langle p_{q+1},f\rangle_{d}=0. On the other hand, one has

SDcxd1pq+1(x)f(x)𝑑x=SDcxd1pq+1(x)s𝑑x>0.\displaystyle\int_{S\cap D_{*}^{c}}x^{d-1}p_{q+1}(x)f(x)\,dx=\int_{S\cap D_{*}^{c}}x^{d-1}p_{q+1}(x)s\,dx>0. (B.66)

Since ff vanishes outside SS, we have

f,fd=SDxd1pq+1(x)f(x)𝑑x=pq+1,fdSDcxd1pq+1(x)s𝑑x<0.\displaystyle\langle f_{*},f\rangle_{d}=\int_{S\cap D_{*}}x^{d-1}p_{q+1}(x)f(x)\,dx=\langle p_{q+1},f\rangle_{d}-\int_{S\cap D_{*}^{c}}x^{d-1}p_{q+1}(x)s\,dx<0. (B.67)

For a sufficiently small δ>0\delta>0, the function f+δff_{*}+\delta f has the same sign ss as pq+1p_{q+1} on SS, and it is equal to ff_{*} outside SS. Therefore, ff_{*} and f+δff_{*}+\delta f share the same sign-change pattern, and one consequently has f+δf𝒞1,3,4,6f_{*}+\delta f\in{\mathcal{C}}_{1,3,4,6}. The above inequality f,fd<0\langle f_{*},f\rangle_{d}<0 implies that for a sufficiently small δ>0\delta>0 one has Vd,0(f+δf)<Vd,0(f)V_{d,0}(f_{*}+\delta f)<V_{d,0}(f_{*}), leading to contradiction. ∎

We consider, as in Definition B.1, subintervals partitioning 0\mathbb{R}_{\geq 0} according to the sign of pq+1p_{q+1}. Since pq+1p_{q+1} is a polynomial of degree (q+1)(q+1) with odd-degree terms only, the number of the above subintervals is at most (q2+1)(\frac{q}{2}+1). Lemma B.12 states that, for each of the subintervals, the support DD_{*} of ff_{*} either contains its interior as a whole or not, and cannot contain only some fraction of it in terms of the Lebesgue measure. Since f𝒩q21f_{*}\in{\mathcal{N}}_{\frac{q}{2}-1} and DD_{*} is bounded, the only possibility is therefore that the closure of DD_{*} equals to [0,τ][0,\tau], where τ\tau is the largest zero of pq+1p_{q+1}, that is, f=p¯q+1,τf_{*}=\bar{p}_{q+1,\tau}.

In Lemma B.13 below, we will show that such a function ff_{*} uniquely exists and that function also satisfies the conditions (P5-2) and (P5-5) (besides the conditions confirmed in Step 1). Thus we conclude that function of the shape p¯q+1,τ\bar{p}_{q+1,\tau} is a solution of Problem B.2. ∎

B.3.4 Prove that the optimal polynomial kernel is uniquely determined

The following lemma is an RK extension of (Granovsky and Müller, 1989, Lemma 2), and shows that a polynomial function suggested by the above lemma is uniquely determined.

Lemma B.13.

For d1d\geq 1 and even q2q\geq 2, there exists a unique constant C0C\neq 0 and a unique polynomial pq+1p_{q+1} with terms of degree in Iq+1I_{q+1}, such that p¯q+1,1():-pq+1()𝟙1\bar{p}_{q+1,1}(\cdot)\coloneq p_{q+1}(\cdot)\mathbbm{1}_{\cdot\leq 1} satisfies

p¯q+1,1d,1,q+1,0xd+qp¯q+1,1(x)𝑑x=C,pq+1(1)=0.\displaystyle\bar{p}_{q+1,1}\in{\mathcal{M}}_{d,1,q+1},\quad\int_{\mathbb{R}_{\geq 0}}x^{d+q}\bar{p}_{q+1,1}(x)\,dx=C,\quad p_{q+1}(1)=0. (B.68)

Also, the resulting polynomial pq+1p_{q+1} is such that p¯q+1,1=Gd,qB(1)\bar{p}_{q+1,1}=G_{d,q}^{B(1)}.

Here, suppose that CC is given such that τ\tau gets to 1 for the sake of simplicity of description, since τ\tau in p¯q+1,τ\bar{p}_{q+1,\tau} and |C||C| in (P5-6) have a one-to-one correspondence and it is possible, as can be seen from Lemma B.8 on the scale transformation. Then, we will show that CC satisfying the conditions (instead of τ\tau) is uniquely determined.

Proof of Lemma B.13.

For odd j1j\geq 1, we introduce one of the normalized Jacobi polynomials on [0,1][0,1]:

Qj(x)=Cd,jxPj12(d2,0)(12x2),\displaystyle Q_{j}(x)=C_{d,j}xP_{\frac{j-1}{2}}^{(\frac{d}{2},0)}(1-2x^{2}), (B.69)

where Cd,j=d+2jC_{d,j}=\sqrt{d+2j} is a normalization constant, and define Q¯j(x)=Qj(x)𝟙x1\bar{Q}_{j}(x)=Q_{j}(x)\mathbbm{1}_{x\leq 1}. It is straightforward to show that {Q¯j}j=1,3,\{\bar{Q}_{j}\}_{j=1,3,\ldots} satisfies the orthogonality relation

Q¯j,Q¯kd=01xd1Qj(x)Qk(x)𝑑x=Cd,jCd,k01xd+1Pj12(d2,0)(12x2)Pk12(d2,0)(12x2)𝑑x=Cd,jCd,k2d2+211(1t)d2Pj12(d2,0)(t)Pk12(d2,0)(t)𝑑t=𝟙j=kfor oddj,k1,\displaystyle\begin{split}\langle\bar{Q}_{j},\bar{Q}_{k}\rangle_{d}&=\int_{0}^{1}x^{d-1}Q_{j}(x)Q_{k}(x)\,dx\\ &=C_{d,j}C_{d,k}\int_{0}^{1}x^{d+1}P_{\frac{j-1}{2}}^{(\frac{d}{2},0)}(1-2x^{2})P_{\frac{k-1}{2}}^{(\frac{d}{2},0)}(1-2x^{2})\,dx\\ &=\frac{C_{d,j}C_{d,k}}{2^{\frac{d}{2}+2}}\int_{-1}^{1}(1-t)^{\frac{d}{2}}P_{\frac{j-1}{2}}^{(\frac{d}{2},0)}(t)P_{\frac{k-1}{2}}^{(\frac{d}{2},0)}(t)\,dt\\ &=\mathbbm{1}_{j=k}\quad\text{for odd}\;j,k\geq 1,\end{split} (B.70)

where the last equality follows from the orthogonality relation of the Jacobi polynomials:

11(1t)α(1+t)βPm(α,β)(t)Pn(α,β)(t)𝑑t=2α+β+12n+α+β+1Γ(n+α+1)Γ(n+β+1)Γ(n+α+β+1)Γ(n+1)𝟙m=n.\displaystyle\int_{-1}^{1}(1-t)^{\alpha}(1+t)^{\beta}P_{m}^{(\alpha,\beta)}(t)P_{n}^{(\alpha,\beta)}(t)\,dt=\frac{2^{\alpha+\beta+1}}{2n+\alpha+\beta+1}\frac{\Gamma(n+\alpha+1)\Gamma(n+\beta+1)}{\Gamma(n+\alpha+\beta+1)\Gamma(n+1)}\mathbbm{1}_{m=n}. (B.71)

Since the Jacobi polynomial Pm(α,β)P_{m}^{(\alpha,\beta)} is a polynomial of degree mm, QjQ_{j} is a polynomial with terms of degree in IjI_{j}, so that one can write

Qj(x)=iIjcj,ixi,\displaystyle Q_{j}(x)=\sum_{i\in I_{j}}c_{j,i}x^{i}, (B.72)

where cj,j0c_{j,j}\neq 0. Furthermore, since any polynomial of degree mm is represented as a linear combination of {Pn(α,β)}n=0,1,,m\{P_{n}^{(\alpha,\beta)}\}_{n=0,1,\ldots,m}, any polynomial with terms of degree in IkI_{k} for odd kk is represented as a linear combination of {Qj}j=1,3,,k\{Q_{j}\}_{j=1,3,\ldots,k}. This in particular means that, for odd j3j\geq 3, xix^{i} with iIj2i\in I_{j-2} is represented as a linear combination of {Qj}j=1,3,,i\{Q_{j}\}_{j=1,3,\ldots,i}, so that the orthogonality (B.70) implies that Q¯j\bar{Q}_{j} satisfies the moment conditions Bd,i(Q¯j)=0B_{d,i}(\bar{Q}_{j})=0, iIj2i\in I_{j-2}.

Any polynomial with terms of degree in IkI_{k} can be represented as a linear combination of QjQ_{j}, j=1,3,,kj=1,3,\ldots,k. So, the polynomial in the statement of the theorem can be written as

pq+1(x)=jIq+1wjQj(x).\displaystyle p_{q+1}(x)=\sum_{j\in I_{q+1}}w_{j}Q_{j}(x). (B.73)

The coefficient wjw_{j} is evaluated, via the moment conditions of p¯q+1,1\bar{p}_{q+1,1}, as

wj=01xd1pq+1(x)Qj(x)𝑑x=01xd1pq+1(x)(iIjcj,ixi)𝑑x=iIjcj,iBd,i(p¯q+1,1)={cj,1C¯,jIq1,cq+1,1C¯+cq+1,q+1C,j=q+1,\displaystyle\begin{split}w_{j}&=\int_{0}^{1}x^{d-1}p_{q+1}(x)Q_{j}(x)\,dx=\int_{0}^{1}x^{d-1}p_{q+1}(x)\biggl{(}\sum_{i\in I_{j}}c_{j,i}x^{i}\biggr{)}\,dx\\ &=\sum_{i\in I_{j}}c_{j,i}B_{d,i}(\bar{p}_{q+1,1})=\begin{cases}c_{j,1}\bar{C},&j\in I_{q-1},\\ c_{q+1,1}\bar{C}+c_{q+1,q+1}C,&j=q+1,\end{cases}\end{split} (B.74)

where C¯:-Bd,1(p¯q+1,1)=dbd1<0\bar{C}\coloneq B_{d,1}(\bar{p}_{q+1,1})=-db_{d}^{-1}<0. Therefore, pq+1p_{q+1} has a unique representation,

pq+1(x)=C¯jIq+1cj,1Qj(x)+Ccq+1,q+1Qq+1(x),\displaystyle p_{q+1}(x)=\bar{C}\sum_{j\in I_{q+1}}c_{j,1}Q_{j}(x)+Cc_{q+1,q+1}Q_{q+1}(x), (B.75)

where CC is variable and other components are fixed. Considering additionally pq+1(1)=0p_{q+1}(1)=0 determines CC uniquely.

One can calculate (B.75) to show p¯q+1,1=Gd,qB(1)\bar{p}_{q+1,1}=G_{d,q}^{B(1)}, but we instead confirmed that Gd,qBG_{d,q}^{B} satisfies the equivalent requirements ((P3-1)–(P3-5)) in th following Lemma B.14. On the basis of the uniqueness, this completes the proof of the latter half of the statement. ∎

Lemma B.14.

The kernel Gd,qBG_{d,q}^{B} satisfies all the conditions (P3-1)–(P3-5) with G=Gd,qBG=G_{d,q}^{B}.

Proof.

The only difference between (P2-1)–(P2-5) and (P3-1)–(P3-5) is in (P2-3) and (P3-3). One can show that Gd,qBG_{d,q}^{B} satisfies (P3-3) in a similar argument as the proof of Lemma B.14. ∎

According to these lemmas, one can see that Problem B.2 gives optimal solution xGd,qB(1)(t)𝑑t=Gd,qB(x)-\int_{x}^{\infty}G_{d,q}^{B(1)}(t)\,dt=G_{d,q}^{B}(x), whose integral constant is determined from the end-point condition (P5-5). This is also a solution of Problem 3 due to the equivalence between the problems. This completes the proof of Theorem 3.

C Supplemental Information on Appeared Kernels

Here, we provide supplemental information on the kernels used in our simulation experiments in Section 4. More specifically, we give the expressions of Bd,qB_{d,q}, Vd,0V_{d,0}, and Vd,1V_{d,1} for those kernels.

Proposition C.1.

For the kernel Gd,qBG_{d,q}^{B}, the functionals Bd,qB_{d,q}, Vd,0V_{d,0}, and Vd,1V_{d,1} become

Bd,q(Gd,qB)=(1)q2+1Γ(d+q2)Γ(d+q+42)2πd2Γ(d+2q+42),Vd,0(Gd,qB)=8(3d+4q+4)Γ(d+q2)Γ(d+q+42)πdd(d+2q)2,3Γ(q2)2,Vd,1(Gd,qB)=16Γ(d+q+22)Γ(d+q+42)πd(d+2)(d+2q+2)Γ(q2)2.\displaystyle\begin{split}&B_{d,q}(G_{d,q}^{B})=(-1)^{\frac{q}{2}+1}\frac{\Gamma(\frac{d+q}{2})\Gamma(\frac{d+q+4}{2})}{2\pi^{\frac{d}{2}}\Gamma(\frac{d+2q+4}{2})},\quad V_{d,0}(G_{d,q}^{B})=\frac{8(3d+4q+4)\Gamma(\frac{d+q}{2})\Gamma(\frac{d+q+4}{2})}{\pi^{d}d(d+2q)_{2,3}\Gamma(\frac{q}{2})^{2}},\\ &V_{d,1}(G_{d,q}^{B})=\frac{16\Gamma(\frac{d+q+2}{2})\Gamma(\frac{d+q+4}{2})}{\pi^{d}(d+2)(d+2q+2)\Gamma(\frac{q}{2})^{2}}.\end{split} (C.1)

The following is on the information about the kernels Kd,qEK_{d,q}^{E}, Kd,qGK_{d,q}^{G}, and Kd,qLK_{d,q}^{L}, which are designed as higher-order extensions of Epanechnikov, Gaussian, Laplace kernels according to the jackknife (Schucany and Sommers, 1977; Wand and Schucany, 1990). These kernels respectively belong to Epanechnikov, Gaussian, and Laplace hierarchies, and take the forms of product between polynomials and Epanechnikov, Gaussian, and Laplace kernels.

Proposition C.2.

The kernel Kd,qE(𝐱)=Gd,qE(𝐱)K_{d,q}^{E}({\bm{x}})=G_{d,q}^{E}(\|{\bm{x}}\|) is a dd-variate, qq-th order kernel, where

Gd,qE(u)=Γ(d+q2)πd2Γ(q2)Pq2(d2,1)(12u2)𝟙u1=(1)q2+1Γ(d+q2+1)πd2Γ(q2+1)(1u2)Pq21(1,d2)(2u21)𝟙u1=(1)q2+1Γ(d+q2+1)Γ(d2+2)Γ(q2+1)Pq21(1,d2)(2u21)Gd,2E(u),\displaystyle\begin{split}G_{d,q}^{E}(u)&=\frac{\Gamma(\frac{d+q}{2})}{\pi^{\frac{d}{2}}\Gamma(\frac{q}{2})}P_{\frac{q}{2}}^{(\frac{d}{2},-1)}(1-2u^{2})\mathbbm{1}_{u\leq 1}=(-1)^{\frac{q}{2}+1}\frac{\Gamma(\frac{d+q}{2}+1)}{\pi^{\frac{d}{2}}\Gamma(\frac{q}{2}+1)}(1-u^{2})P_{\frac{q}{2}-1}^{(1,\frac{d}{2})}(2u^{2}-1)\mathbbm{1}_{u\leq 1}\\ &=(-1)^{\frac{q}{2}+1}\frac{\Gamma(\frac{d+q}{2}+1)}{\Gamma(\frac{d}{2}+2)\Gamma(\frac{q}{2}+1)}P_{\frac{q}{2}-1}^{(1,\frac{d}{2})}(2u^{2}-1)\cdot G_{d,2}^{E}(u),\end{split} (C.2)

and the functionals Bd,qB_{d,q}, Vd,0V_{d,0}, and Vd,1V_{d,1} of Gd,qEG_{d,q}^{E} become

Bd,q(Gd,qE)=(1)q2+1Γ(d+q2)Γ(d+q+22)2πd2Γ(d+2q+22),Vd,0(Gd,qE)=4Γ(d+q2)Γ(d+q+22)πdd(d+2q)Γ(q2)2,Vd,1(Gd,qE)=4Γ(d+q+22)2πd(d+2)Γ(q2)2.\displaystyle\begin{split}&B_{d,q}(G_{d,q}^{E})=(-1)^{\frac{q}{2}+1}\frac{\Gamma(\frac{d+q}{2})\Gamma(\frac{d+q+2}{2})}{2\pi^{\frac{d}{2}}\Gamma(\frac{d+2q+2}{2})},\quad V_{d,0}(G_{d,q}^{E})=\frac{4\Gamma(\frac{d+q}{2})\Gamma(\frac{d+q+2}{2})}{\pi^{d}d(d+2q)\Gamma(\frac{q}{2})^{2}},\\ &V_{d,1}(G_{d,q}^{E})=\frac{4\Gamma(\frac{d+q+2}{2})^{2}}{\pi^{d}(d+2)\Gamma(\frac{q}{2})^{2}}.\end{split} (C.3)
Proposition C.3.

The kernel Kd,qG(𝐱)=Gd,qG(𝐱)K_{d,q}^{G}({\bm{x}})=G_{d,q}^{G}(\|{\bm{x}}\|) is a dd-variate, qq-th order kernel, where

Gd,2G(u)=(2π)d2exp(u2/2),Gd,qG(u)=Lq21(d2)(u22)Gd,2G(u),\displaystyle G_{d,2}^{G}(u)=(2\pi)^{-\frac{d}{2}}\exp(-u^{2}/2),\quad G_{d,q}^{G}(u)=L_{\frac{q}{2}-1}^{(\frac{d}{2})}\left(\frac{u^{2}}{2}\right)G_{d,2}^{G}(u), (C.4)

where Ln(α)L_{n}^{(\alpha)} is the Laguerre polynomial (Szegö, 1939). Also, the functionals Bd,qB_{d,q}, Vd,0V_{d,0}, and Vd,1V_{d,1} of Gd,qGG_{d,q}^{G} become

Bd,q(Gd,qG)=(2)q21Γ(d+q2)πd2,Vd,0(Gd,qG)={Γ(d+q2)}22d+q1πdi1,i2=0q/21(2)i1+i2Γ(d2+q2i1i2)Γ(d+q2i1)Γ(d+q2i2)Γ(q2i1)Γ(q2i2)i1!i2!,Vd,1(Gd,2G)=Γ(d2+2)(d+2)2dπd,Vd,1(Gd,4G)=(d+10)Γ(d+62)(d+2)2d+3πd,Vd,1(Gd,6G)=(d2+26d+176)Γ(d+82)(d+2)2d+8πd.\displaystyle\begin{split}&B_{d,q}(G_{d,q}^{G})=(-2)^{\frac{q}{2}-1}\frac{\Gamma(\frac{d+q}{2})}{\pi^{\frac{d}{2}}},\\ &V_{d,0}(G_{d,q}^{G})=\frac{\{\Gamma(\frac{d+q}{2})\}^{2}}{2^{d+q-1}\pi^{d}}\sum_{i_{1},i_{2}=0}^{q/2-1}(-2)^{i_{1}+i_{2}}\frac{\Gamma(\frac{d}{2}+q-2-i_{1}-i_{2})}{\Gamma(\frac{d+q}{2}-i_{1})\Gamma(\frac{d+q}{2}-i_{2})\Gamma(\frac{q}{2}-i_{1})\Gamma(\frac{q}{2}-i_{2})i_{1}!i_{2}!},\\ &V_{d,1}(G_{d,2}^{G})=\frac{\Gamma(\frac{d}{2}+2)}{(d+2)2^{d}\pi^{d}},\quad V_{d,1}(G_{d,4}^{G})=\frac{(d+10)\Gamma(\frac{d+6}{2})}{(d+2)2^{d+3}\pi^{d}},\\ &V_{d,1}(G_{d,6}^{G})=\frac{(d^{2}+26d+176)\Gamma(\frac{d+8}{2})}{(d+2)2^{d+8}\pi^{d}}.\end{split} (C.5)
Proposition C.4.

The kernel Kd,qL(𝐱)=Gd,qL(𝐱)K_{d,q}^{L}({\bm{x}})=G_{d,q}^{L}(\|{\bm{x}}\|) is a dd-variate, qq-th order kernel, where

Gd,2L(u)=Γ(d2)2πd2Γ(d)exp(u),Gd,qL(u)=(iIq2bd,q,iLui)Gd,2L(u),\displaystyle G_{d,2}^{L}(u)=\frac{\Gamma(\frac{d}{2})}{2\pi^{\frac{d}{2}}\Gamma(d)}\exp(-u),\quad G_{d,q}^{L}(u)=\biggl{(}\sum_{i\in I_{q-2}}b_{d,q,i}^{L}u^{i}\biggr{)}G_{d,2}^{L}(u), (C.6)

and where bd,q,iLb_{d,q,i}^{L}, iIq2i\in I_{q-2}, are determined as

[bd,q,0Lbd,q,2Lbd,q,q2L]=[1(d)2(d)q21(d+2)2(d+2)q21(d+q2)2(d+q2)q2]1[100].\displaystyle\begin{bmatrix}b_{d,q,0}^{L}\\ b_{d,q,2}^{L}\\ \vdots\\ b_{d,q,q-2}^{L}\\ \end{bmatrix}=\begin{bmatrix}1&(d)_{2}&\cdots&(d)_{q-2}\\ 1&(d+2)_{2}&\cdots&(d+2)_{q-2}\\ \vdots&\vdots&\ddots&\vdots\\ 1&(d+q-2)_{2}&\cdots&(d+q-2)_{q-2}\\ \end{bmatrix}^{-1}\begin{bmatrix}1\\ 0\\ \vdots\\ 0\\ \end{bmatrix}. (C.7)

Also, the functionals Bd,qB_{d,q}, Vd,0V_{d,0}, and Vd,1V_{d,1} of Gd,qLG_{d,q}^{L} become

Bd,2(Gd,2L)=(d)2Γ(d2)2πd2;Bd,4(Gd,4L)=(d)4(2d+7)Γ(d2)2(2d+3)πd2;Bd,6(Gd,6L)=196200149,2215080307π,1335600109π,d=1,2,3;Vd,0(Gd,2L)=Γ(d2)22d+2πdΓ(d);Vd,0(Gd,4L)=(d+2)2(9d2+73d+96)Γ(d2)22d+8πd(2d+3)2Γ(d);Vd,0(Gd,6L)=432721522733824,28174335193021952π2,5292791389316608π2,d=1,2,3;Vd,1(Gd,2L)=Γ(d2)22d+2πdΓ(d);Vd,1(Gd,4L)=(d+1)(9d3+133d2+534d+576)Γ(d2)22d+8πd(2d+3)2Γ(d);Vd,1(Gd,6L)=557234522733824,38442103193021952π2,7306271389316608π2,d=1,2,3.\displaystyle\begin{split}&B_{d,2}(G_{d,2}^{L})=\frac{(d)_{2}\Gamma(\frac{d}{2})}{2\pi^{\frac{d}{2}}};\quad B_{d,4}(G_{d,4}^{L})=-\frac{(d)_{4}(2d+7)\Gamma(\frac{d}{2})}{2(2d+3)\pi^{\frac{d}{2}}};\\ &B_{d,6}(G_{d,6}^{L})=\frac{196200}{149},\frac{2215080}{307\pi},\frac{1335600}{109\pi},\;d=1,2,3;\\ &V_{d,0}(G_{d,2}^{L})=\frac{\Gamma(\frac{d}{2})^{2}}{2^{d+2}\pi^{d}\Gamma(d)};\quad V_{d,0}(G_{d,4}^{L})=\frac{(d+2)_{2}(9d^{2}+73d+96)\Gamma(\frac{d}{2})^{2}}{2^{d+8}\pi^{d}(2d+3)^{2}\Gamma(d)};\\ &V_{d,0}(G_{d,6}^{L})=\frac{4327215}{22733824},\frac{28174335}{193021952\pi^{2}},\frac{5292791}{389316608\pi^{2}},\;d=1,2,3;\\ &V_{d,1}(G_{d,2}^{L})=\frac{\Gamma(\frac{d}{2})^{2}}{2^{d+2}\pi^{d}\Gamma(d)};\quad V_{d,1}(G_{d,4}^{L})=\frac{(d+1)(9d^{3}+133d^{2}+534d+576)\Gamma(\frac{d}{2})^{2}}{2^{d+8}\pi^{d}(2d+3)^{2}\Gamma(d)};\\ &V_{d,1}(G_{d,6}^{L})=\frac{5572345}{22733824},\frac{38442103}{193021952\pi^{2}},\frac{7306271}{389316608\pi^{2}},\;d=1,2,3.\end{split} (C.8)

D Proof of Proposition 6

This appendix provides a proof of Proposition 6. In the case (d,q)=(2,2)(d,q)=(2,2), the third-order term of the Taylor expansion of ff around its mode 𝒙=𝜽\bm{x}=\bm{\theta} is represented as a real binary cubic form, i.e., a homogeneous degree-3 polynomial of 𝒖=𝒙𝜽\bm{u}=\bm{x}-\bm{\theta}: au13+bu12u2+cu1u22+du23au_{1}^{3}+bu_{1}^{2}u_{2}+cu_{1}u_{2}^{2}+du_{2}^{3}, where a,b,c,da,b,c,d\in\mathbb{R} are proportional to appropriate third-order partial derivatives of ff at 𝜽\bm{\theta}. Properties of such a real binary cubic form that are invariant under reversible linear transforms of 𝒖\bm{u} is characterized by the algebraic curve defined as the set of 𝒖\bm{u} satisfying au13+bu12u2+cu1u22+du23=0au_{1}^{3}+bu_{1}^{2}u_{2}+cu_{1}u_{2}^{2}+du_{2}^{3}=0. Putting aside the trivial case a=b=c=d=0a=b=c=d=0, classification of the algebraic curve au13+bu12u2+cu1u22+du23=0au_{1}^{3}+bu_{1}^{2}u_{2}+cu_{1}u_{2}^{2}+du_{2}^{3}=0 can be done on the basis of classification of the set of solutions of the cubic equation at3+bt2+ct+d=0at^{3}+bt^{2}+ct+d=0 of a real-valued variable tt: it has either:

  • a single real solution, which is triply degenerate,

  • two real distinct solutions, one of which is doubly degenerate,

  • three real distinct solutions, or

  • a real and two complex solutions.

Accordingly, the third-order term of the Taylor expansion of ff at the mode is either:

  1. (D.1)

    0 identically,

  2. (D.2)

    of the form (𝒂𝒖)3(\bm{a}^{\top}\bm{u})^{3},

  3. (D.3)

    of the form (𝒂1𝒖)2(𝒂2𝒖)(\bm{a}_{1}^{\top}\bm{u})^{2}(\bm{a}_{2}^{\top}\bm{u}) with 𝒂1,𝒂2\bm{a}_{1},\bm{a}_{2} linearly independent,

  4. (D.4)

    of the form i=13𝒂i𝒖\prod_{i=1}^{3}\bm{a}_{i}^{\top}\bm{u} with 𝒂1,𝒂2,𝒂3\bm{a}_{1},\bm{a}_{2},\bm{a}_{3}, any two of which are linearly independent, or

  5. (D.5)

    of the form (𝒂𝒖)g(𝒖)(\bm{a}^{\top}\bm{u})g(\bm{u}), where g(𝒖)=12𝒖R𝒖g(\bm{u})=\frac{1}{2}\bm{u}^{\top}\mathrm{R}\bm{u} with R\mathrm{R} positive definite.

In the case (D.1), the AB is trivially 0 for any choice of Q\mathrm{Q}. The cases (D.4) and (D.5) have already been covered in Propositions 3 and 4, respectively. What remains is to consider the cases (D.2) and (D.3), which are dealt with in the following lemmas, completing the proof of Proposition 6.

Lemma D.1.

(Q)q/2(𝒂(𝒙𝜽))q+1\nabla(\nabla^{\top}\mathrm{Q}\nabla)^{q/2}(\bm{a}^{\top}(\bm{x}-\bm{\theta}))^{q+1} with 𝐚𝟎d\bm{a}\not=\mathbf{0}_{d} is non-zero irrespective of the choice of the positive definite Q\mathrm{Q}.

Proof.

One has

(Q)q/2(𝒂(𝒙𝜽))q+1=(q+1)!(𝒂Q𝒂)q/2𝒂,\displaystyle\nabla(\nabla^{\top}\mathrm{Q}\nabla)^{q/2}(\bm{a}^{\top}(\bm{x}-\bm{\theta}))^{q+1}=(q+1)!(\bm{a}^{\top}\mathrm{Q}\bm{a})^{q/2}\bm{a}, (D.6)

and the coefficient of 𝒂\bm{a} in the above expression is non-zero because of the positive-definiteness of Q\mathrm{Q}. ∎

Lemma D.2.

(Q)q/2(𝒂1(𝒙𝜽))q(𝒂2(𝒙𝜽)\nabla(\nabla^{\top}\mathrm{Q}\nabla)^{q/2}(\bm{a}_{1}^{\top}(\bm{x}-\bm{\theta}))^{q}(\bm{a}_{2}^{\top}(\bm{x}-\bm{\theta}) with linearly independent 𝐚1,𝐚2\bm{a}_{1},\bm{a}_{2} is non-zero irrespective of the choice of the positive definite Q\mathrm{Q}.

Proof.

One has

(Q)q/2(𝒂1(𝒙𝜽))q(𝒂2(𝒙𝜽)=qq!(𝒂1Q𝒂1)q/21(𝒂1Q𝒂2)𝒂1+q!(𝒂1Q𝒂1)q/2𝒂2,\displaystyle\nabla(\nabla^{\top}\mathrm{Q}\nabla)^{q/2}(\bm{a}_{1}^{\top}(\bm{x}-\bm{\theta}))^{q}(\bm{a}_{2}^{\top}(\bm{x}-\bm{\theta})=q\cdot q!(\bm{a}_{1}^{\top}\mathrm{Q}\bm{a}_{1})^{q/2-1}(\bm{a}_{1}^{\top}\mathrm{Q}\bm{a}_{2})\bm{a}_{1}+q!(\bm{a}_{1}^{\top}\mathrm{Q}\bm{a}_{1})^{q/2}\bm{a}_{2}, (D.7)

and the coefficient of 𝒂2\bm{a}_{2} in the above expression is non-zero because of the positive-definiteness of Q\mathrm{Q}. ∎

The general treatment should require a higher-order analog of Sylvester’s law of inertia for quadratic forms, that is, classification of homogeneous degree-qq polynomials of dd variables under the equivalence relation where two polynomials are equivalent if and only if one of them can be transformed into the other via applying an invertible linear transform on \mathbb{R} to its variables. This problem seems to be less explored compared with that on \mathbb{C}, and is already complicated even when d=3d=3: see e.g. Banchi (2015). We therefore do not attempt to extend Proposition 6 to more general cases in this paper, and defer the problem to a future investigation.

E Proof of Asymptotic Behaviors of the Modal Linear Regression

Theorem 4 on asymptotic behaviors of the MLR parameter estimator can be proved in the manner almost same as that of Theorem 1, under the following regularity conditions:

Assumption E.1 (Regularity conditions for Theorem 4).

For finite dYd_{\scalebox{0.5}{Y}} and even qq,

  • (E.1)

    {(𝑿i,𝒀i)𝒳×𝒴}i=1n\{({\bm{X}}_{i},{\bm{Y}}_{i})\in{\mathcal{X}}\times{\mathcal{Y}}\}_{i=1}^{n} is a sample of i.i.d. observations from fX,Yf_{\scalebox{0.5}{X,Y}}.

  • (E.2)

    E[𝑿dY+4+τ]<\mathrm{E}[\|{\bm{X}}\|^{d_{\scalebox{0.5}{Y}}+4+\tau}]<\infty for some τ>0\tau>0.

  • (E.3)

    The parameter space 𝒜\mathcal{A} is compact subset of dY×dX\mathbb{R}^{d_{\scalebox{0.5}{Y}}\times d_{\scalebox{0.5}{X}}}, and has a nonempty interior int(𝒜)\mathrm{int}(\mathcal{A}); Θint(𝒜)\Theta\in\mathrm{int}(\mathcal{A}).

  • (E.4)

    Pr(Ω𝑿i=0)<1\Pr(\Omega{\bm{X}}_{i}=0)<1 for any fixed ΩdY×dX\Omega\in\mathbb{R}^{d_{\scalebox{0.5}{Y}}\times d_{\scalebox{0.5}{X}}} such that Ω0\Omega\neq\mathrm{0} (no multicollinearity).

  • (E.5)

    fY—X(|𝒙)f_{\scalebox{0.5}{Y|X}}(\cdot|{\bm{x}}) is (q+2)(q+2) times differentiable in dY\mathbb{R}^{d_{\scalebox{0.5}{Y}}} for all 𝒙{\bm{x}}.

  • (E.6)

    fY—X(𝒚|𝒙)f_{\scalebox{0.5}{Y|X}}({\bm{y}}|{\bm{x}}) has a unique and isolated maximizer at 𝒚Θ𝒙{\bm{y}}\neq\Theta{\bm{x}} for all 𝒙{\bm{x}} (i.e., fY—X(𝒚|𝒙)<fY—X(Θ𝒙|𝒙)f_{\scalebox{0.5}{Y|X}}({\bm{y}}|{\bm{x}})<f_{\scalebox{0.5}{Y|X}}(\Theta{\bm{x}}|{\bm{x}}) for all 𝒚Θ𝒙{\bm{y}}\neq\Theta{\bm{x}}, fY—X(Θ𝒙|𝒙)=𝟎dY\nabla f_{\scalebox{0.5}{Y|X}}(\Theta{\bm{x}}|{\bm{x}})=\bm{0}_{d_{\scalebox{0.5}{Y}}}, and sup𝒚N𝒙fY—X(𝒚|𝒙)<fY—X(Θ𝒙|𝒙)\sup_{{\bm{y}}\in N_{\bm{x}}}f_{\scalebox{0.5}{Y|X}}({\bm{y}}|{\bm{x}})<f_{\scalebox{0.5}{Y|X}}(\Theta{\bm{x}}|{\bm{x}}) for a neighborhood N𝒙N_{\bm{x}} of Θ𝒙\Theta{\bm{x}}, for all 𝒙{\bm{x}}).

  • (E.7)

    𝒊fY—X(|𝒙)\partial^{\bm{i}}f_{\scalebox{0.5}{Y|X}}(\cdot|{\bm{x}}), |𝒊|=2|{\bm{i}}|=2 satisfies |𝒊fY—X(𝒚|𝒙)|d𝒚<\int|\partial^{\bm{i}}f_{\scalebox{0.5}{Y|X}}({\bm{y}}|{\bm{x}})|\,d{\bm{y}}<\infty for all 𝒙{\bm{x}}, and A\mathrm{A} in (61) is non-singular.

  • (E.8)

    |𝒊fY—X(Θ𝒙|𝒙)|<|\partial^{\bm{i}}f_{\scalebox{0.5}{Y|X}}(\Theta{\bm{x}}|{\bm{x}})|<\infty for all 𝒙{\bm{x}} and 𝒊{\bm{i}} s.t. |𝒊|=2,,q+1|{\bm{i}}|=2,\ldots,q+1, and 𝒊fY—X(|𝒙)\partial^{\bm{i}}f_{\scalebox{0.5}{Y|X}}(\cdot|{\bm{x}}) is bounded in dY\mathbb{R}^{d_{\scalebox{0.5}{Y}}} for all 𝒙{\bm{x}} and 𝒊{\bm{i}} s.t. |𝒊|=q+2|{\bm{i}}|=q+2.

  • (E.9)

    𝒊fY—X\partial^{\bm{i}}f_{\scalebox{0.5}{Y|X}}, |𝒊|=q+1|{\bm{i}}|=q+1, and KK is such that 𝒃{\bm{b}} in (61) is non-zero.

  • (E.10)

    KK is bounded and twice differentiable in dY\mathbb{R}^{d_{\scalebox{0.5}{Y}}} and satisfies the covering number condition, |K(𝒚)|𝑑𝒚<\int|K({\bm{y}})|\,d{\bm{y}}<\infty, and lim𝒚𝒚|K(𝒚)|=0\lim_{\|{\bm{y}}\|\to\infty}\|{\bm{y}}\||K({\bm{y}})|=0.

  • (E.11)

    dY,𝟎dY(K)=1{\mathcal{B}}_{d_{\scalebox{0.5}{Y}},\bm{0}_{d_{\scalebox{0.5}{Y}}}}(K)=1.

  • (E.12)

    dY,𝒊(K)=0{\mathcal{B}}_{d_{\scalebox{0.5}{Y}},{\bm{i}}}(K)=0 for all 𝒊{\bm{i}} s.t. |𝒊|=1,,q1|{\bm{i}}|=1,\ldots,q-1.

  • (E.13)

    |dY,𝒊(K)|<|{\mathcal{B}}_{d_{\scalebox{0.5}{Y}},{\bm{i}}}(K)|<\infty for all 𝒊{\bm{i}} s.t. |𝒊|=q|{\bm{i}}|=q, and dY,𝒊(K)0{\mathcal{B}}_{d_{\scalebox{0.5}{Y}},{\bm{i}}}(K)\neq 0 for some 𝒊{\bm{i}} s.t. |𝒊|=q|{\bm{i}}|=q.

  • (E.14)

    |dY,𝒊(K)|<|{\mathcal{B}}_{d_{\scalebox{0.5}{Y}},{\bm{i}}}(K)|<\infty for all 𝒊{\bm{i}} s.t. |𝒊|=q+1|{\bm{i}}|=q+1.

  • (E.15)

    iK\partial_{i}K is bounded and satisfies |iK(𝒚)jK(𝒚)|𝑑𝒚<\int|\partial_{i}K({\bm{y}})\partial_{j}K({\bm{y}})|\,d{\bm{y}}<\infty, lim𝒚𝒚|iK(𝒚)jK(𝒚)|=0\lim_{\|{\bm{y}}\|\to\infty}\|{\bm{y}}\||\partial_{i}K({\bm{y}})\partial_{j}K({\bm{y}})|=0, and |iK(𝒚)|2+δ𝑑𝒚<\int|\partial_{i}K({\bm{y}})|^{2+\delta}\,d{\bm{y}}<\infty for some δ>0\delta>0, for all i,j=1,,dYi,j=1,\ldots,d_{\scalebox{0.5}{Y}}.

  • (E.16)

    K(𝒚)K(𝒚)𝑑𝒚=𝒱dY(K)\int\nabla K({\bm{y}})\nabla K({\bm{y}})^{\top}\,d{\bm{y}}={\mathcal{V}}_{d_{\scalebox{0.5}{Y}}}(K) has a finite determinant .

  • (E.17)

    𝒊K\partial^{\bm{i}}K, |𝒊|=2|{\bm{i}}|=2 satisfies the covering number condition.

  • (E.18)

    limn(nhndY+2q+2)12=c<\lim_{n\to\infty}(nh_{n}^{d_{\scalebox{0.5}{Y}}+2q+2})^{\frac{1}{2}}=c<\infty.

  • (E.19)

    limnnhndY+4/lnn=\lim_{n\to\infty}nh_{n}^{d_{\scalebox{0.5}{Y}}+4}/\ln n=\infty.

Acknowledgements

This work was supported by Grant-in-Aid for JSPS Fellows, Number 20J23367.

References

  • Abraham et al. (2004) C. Abraham, G. Biau, and B. Cadre. On the asymptotic properties of a simple estimate of the mode. ESAIM: Probability and Statistics, 8:1–11, 2004.
  • Askey (1975) R. Askey. Orthogonal Polynomials and Special Functions, volume 21. SIAM, 1975.
  • Baldauf and Santos Silva (2012) M. Baldauf and J. M. C. Santos Silva. On the use of robust regression in econometrics. Economics Letters, 114(1):124–127, 2012.
  • Banchi (2015) M. Banchi. Rank and border rank of real ternary cubics. Bollettino dell’Unione Matematica Italiana, 8(1):64–80, 2015.
  • Berlinet (1993) A. Berlinet. Hierarchies of higher order kernels. Probability Theory and Related Fields, 94(4):489–504, 1993.
  • Berlinet and Thomas-Agnan (2004) A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publishers, 2004.
  • Bochner (2005) S. Bochner. Harmonic Analysis and the Theory of Probability. Courier Corporation, 2005.
  • Casa et al. (2020) A. Casa, J. E. Chacón, and G. Menardi. Modal clustering asymptotics with applications to bandwidth selection. Electronic Journal of Statistics, 14(1):835–856, 2020.
  • Comaniciu and Meer (2002) D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619, 2002.
  • Conway (2007) J. B. Conway. A Course in Functional Analysis. Sprinnger, second edition, 2007.
  • Donoho et al. (1991) D. L. Donoho, R. C. Liu, et al. Geometrizing rates of convergence, III. The Annals of Statistics, 19(2):668–701, 1991.
  • Eddy (1980) W. F. Eddy. Optimum kernel estimators of the mode. The Annals of Statistics, 8(4):870–882, 1980.
  • Gasser and Müller (1979) T. Gasser and H.-G. Müller. Kernel estimation of regression functions. In T. Gasser and M. Rosenblatt, editors, Smoothing Techniques for Curve Estimation, pages 23–68. Springer, 1979.
  • Gasser et al. (1985) T. Gasser, H.-G. Müller, and V. Mammitzsch. Kernels for nonparametric curve estimation. Journal of the Royal Statistical Society. Series B (Methodological), 47:238–252, 1985.
  • Granovsky and Müller (1989) B. L. Granovsky and H.-G. Müller. On the optimality of a class of polynomial kernel functions. Statistics & Risk Modeling, 7(4):301–312, 1989.
  • Granovsky and Müller (1991) B. L. Granovsky and H.-G. Müller. Optimizing kernel methods: a unifying variational principle. International Statistical Review/Revue Internationale de Statistique, 59(3):373–388, 1991.
  • Grund and Hall (1995) B. Grund and P. Hall. On the minimisation of Lp{L}^{p} error in mode estimation. The Annals of Statistics, 23(6):2264–2284, 1995.
  • Hasminskii (1979) R. Hasminskii. Lower bound for the risks of nonparametric estimates of the mode. Contributions to Statistics, pages 91–97, 1979.
  • Kemp et al. (2020) G. C. R. Kemp, P. M. D. C. Parente, and J. M. C. Santos Silva. Dynamic vector mode regression. Journal of Business & Economic Statistics, 38(3):647–661, 2020.
  • Mokkadem and Pelletier (2003) A. Mokkadem and M. Pelletier. The law of the iterated logarithm for the multivariate kernel mode estimator. ESAIM: Probability and Statistics, 7:1–21, 2003.
  • Müller (1984) H.-G. Müller. Smooth optimum kernel estimators of densities, regression curves and modes. The Annals of Statistics, 12(2):766–774, 1984.
  • Ozertem and Erdogmus (2011) U. Ozertem and D. Erdogmus. Locally defined principal curves and surfaces. Journal of Machine Learning Research, 12:1249–1286, 2011.
  • Parzen (1962) E. Parzen. On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 33(3):1065–1076, 1962.
  • Pollard (1984) D. Pollard. Convergence of Stochastic Processes. Springer Science & Business Media, 1984.
  • Prudnikov et al. (1986) A. P. Prudnikov, Y. A. Brychkov, and O. I. Marichev. Integrals and Series, Volume 2: Special Functions. Gordon and Brewach Science Publishers, 1986.
  • Romano (1988) J. P. Romano. On weak convergence and optimality of kernel density estimates of the mode. The Annals of Statistics, 16(2):629–647, 1988.
  • Rudin (1991) W. Rudin. Functional Analysis. McGraw-Hill, second edition, 1991.
  • Rüschendorf (1977) L. Rüschendorf. Consistency of estimators for multivariate density functions and for the mode. Sankhyā: The Indian Journal of Statistics, Series A, 39(3):243–250, 1977.
  • Sando and Hino (2020) K. Sando and H. Hino. Modal principal component analysis. Neural Computation, 32(10):1901–1935, 2020.
  • Schucany and Sommers (1977) W. Schucany and J. P. Sommers. Improvement of kernel type density estimators. Journal of the American Statistical Association, 72(358):420–423, 1977.
  • Szegö (1939) G. Szegö. Orthogonal Polynomials, volume 23. American Mathematical Society, 1939.
  • Vedaldi and Soatto (2008) A. Vedaldi and S. Soatto. Quick shift and kernel methods for mode seeking. In European Conference on Computer Vision, pages 705–718, 2008.
  • Vieu (1996) P. Vieu. A note on density mode estimation. Statistics & Probability Letters, 26(4):297–307, 1996.
  • Wand and Schucany (1990) M. P. Wand and W. R. Schucany. Gaussian-based kernels. Canadian Journal of Statistics, 18(3):197–204, 1990.
  • Yamasaki and Tanaka (2019) R. Yamasaki and T. Tanaka. Kernel selection for modal linear regression: Optimal kernel and irls algorithm. In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pages 595–601, Dec 2019.
  • Yamasaki and Tanaka (2020) R. Yamasaki and T. Tanaka. Properties of mean shift. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(9):2273–2286, 2020.
  • Yamato (1971) H. Yamato. Sequential estimation of a continuous probability density function and mode. Bulletin of Mathematical Statistics, 14(3):1–12, 1971.
  • Yao and Li (2014) W. Yao and L. Li. A new regression model: modal linear regression. Scandinavian Journal of Statistics, 41(3):656–671, 2014.