Optimal Kernel for Kernel-Based Modal Statistical Methods

\nameRyoya Yamasaki \email[email protected]
\addrDepartment of Systems Science
Graduate School of Informatics, Kyoto University
36-1 Yoshida-Honmachi, Sakyo-ku, Kyoto 606-8501 JAPAN \AND\nameToshiyuki Tanaka \email[email protected]
\addrDepartment of Systems Science
Graduate School of Informatics, Kyoto University
36-1 Yoshida-Honmachi, Sakyo-ku, Kyoto 606-8501 JAPAN

Abstract

Kernel-based modal statistical methods include mode estimation, regression, and clustering. Estimation accuracy of these methods depends on the kernel used as well as the bandwidth. We study effect of the selection of the kernel function to the estimation accuracy of these methods. In particular, we theoretically show a (multivariate) optimal kernel that minimizes its analytically-obtained asymptotic error criterion when using an optimal bandwidth, among a certain kernel class defined via the number of its sign changes.

Keywords: Optimal kernel, kernel mode estimation, modal linear regression, mode clustering

1 Introduction

The mode for a continuous random variable is a location statistic where a probability density function (PDF) takes a local maximum value. It provides a good and easy-to-interpret summary of probabilistic events: one can expect many sample points in the vicinity of a mode. Besides, the mode has several desirable properties for data analysis. One of them is that it is robust against outliers and skew or heavy-tailed distributions as a statistic that expresses the centrality of data following a unimodal distribution. The mean is not robust at all, which often causes troubles in real applications. Another desirable property of the mode is that it can be naturally defined even in multivariate settings or non-Euclidean spaces. For example, the median and quantiles are intrinsically difficult to define in these cases due to the lack of an appropriate ordering structure. Also, the mode works well for multimodal distributions without loss of interpretability.

These attractive properties of the mode are utilized not only in location estimation but also in other data analysis techniques such as regression and clustering. Among various modal statistical methods, kernel-based methods have been widely used, presumably because they retain the inherent goodness of the mode and are easy to use in that they can be implemented with simple estimation algorithms, such as mean-shift algorithm (Comaniciu and Meer, 2002; Yamasaki and Tanaka, 2020) for kernel mode estimation (Parzen, 1962; Eddy, 1980; Romano, 1988) and for mode clustering (Casa et al., 2020), and iteratively reweighted least squares algorithm (Yamasaki and Tanaka, 2019) for modal linear regression (Yao and Li, 2014; Kemp et al., 2020). Also, their good estimation accuracy is an important advantage: for example, kernel mode estimation generally has the best estimation accuracy in mode estimation in the sense that it attains the minimax-optimal rate regarding the used sample size (Hasminskii, 1979; Donoho et al., 1991).

Estimation accuracy of such kernel-based methods depends on the kernel function used as well as the bandwidth. Nevertheless, the selection of a kernel has not been well studied, and some questions remain unanswered. For instance, for univariate kernel mode estimation, Granovsky and Müller (1991) have derived the optimal kernel, which minimizes the asymptotic mean squared error (AMSE) of a kernel mode estimate using an optimal bandwidth among a certain kernel class defined via the number of sign changes. Whether their argument can be extended to multivariate cases, however, has not been clarified yet. We are also interested in optimal kernels for other kernel-based modal statistical methods, such as modal linear regression and mode clustering, for which the problem of kernel selection has not been elucidated enough, as far as we are aware. The objective of this paper is to study the problem of kernel selection for various kernel-based modal statistical methods in the multivariate setting.

In the main part of this paper (Sections 2–4), with the aim of simplifying the presentation, we focus on kernel mode estimation (Parzen, 1962; Eddy, 1980; Romano, 1988) as a representative kernel-based modal statistical method. In Section 2, we address multivariate extension of the study on the optimal kernel by Granovsky and Müller (1991). For multivariate cases, we need some additional assumptions on the structure of the kernels to allow systematic discussion on the optimal kernel, so we mainly consider two commonly-used kernel classes, radial-basis kernels (RKs) and product kernels (PKs). In Section 2.1, we show basic asymptotic behaviors of the mode estimator, such as asymptotic normality of the mode estimator that can apply to general kernels including RKs and PKs, which is a modification of the existing statement in (Mokkadem and Pelletier, 2003). The characterization of the asymptotic behaviors leads to the AMSE of the mode estimator and the optimal bandwidth, providing the basis for our subsequent discussion on the optimal kernel. We then construct theories for the optimal RK in Section 2.2, and study PKs in Section 2.3. As one of the consequences, the optimal RK is found to improve the AMSE by more than 10 % compared with the commonly used Gaussian kernel, and the improvement is greater in higher dimensions. In view of the slow nonparametric convergence rate of mode estimation, this improvement would significantly contribute to higher sample efficiency. Moreover, on the basis of the consideration in Sections 2.2 and 2.3, in Section 2.4 we compare these two kernel classes in terms of the AMSE, and show that, among non-negative kernels, the optimal RK is better than any PK regardless of the underlying PDF. Section 3 gives some additional discussion. In Section 4 we show results of simulation experiments to examine to what extent the theories on the kernel selection (the results in Section 2) based on the asymptotics reflect the real performance in a finite sample size situation, which supports the usefulness of our discussion.

We show that our discussion can be further adapted to other kernel-based modal statistical methods, such another mode estimation method (Abraham et al., 2004), modal linear regression (Yao and Li, 2014; Kemp et al., 2020), and mode clustering (Comaniciu and Meer, 2002; Casa et al., 2020). For these methods, we can obtain results similar to, but not completely the same as, those for the kernel mode estimation. Section 5 gives method-specific results on kernel selection.

The remaining part of this paper consists of a concluding section and appendix sections: Section 6 summarizes our results and presents possible directions for future work. Appendices give proofs of the theories on the optimal RK and of the theorems for the asymptotic behaviors of the kernel mode estimator and modal linear regression, which are presented newly in this paper.

2 Optimal Kernel for Kernel Mode Estimation

2.1 Basic Asymptotic Behaviors

In general, for a continuous random variable, its mode refers either to a point where its PDF takes a local maximum value, or to a set of such local maximizers. If we followed such definitions of the mode, the design of an estimator corresponding to each mode and theoretical statements regarding such an estimator will become complicated in a multimodal case. So as to evade such a technical complication, we assume, unless stated otherwise, that the PDF has a unique global maximizer (see (A.3) in Appendix A). Let ${\bm{X}}\in\mathbb{R}^{d}$ be a $d$ -variate continuous random variable, and assume that it has a PDF $f$ . Then, we define the mode ${\bm{\theta}}\in\mathbb{R}^{d}$ of $f$ as

\displaystyle{\bm{\theta}}\coloneq\mathop{\rm arg\,max}\limits_{{\bm{x}}\in\mathbb{R}^{d}}f({\bm{x}}).

(1)

Suppose that an independent and identically distributed (i.i.d.) sample $\{{\bm{X}}_{i}\in\mathbb{R}^{d}\}_{i=1}^{n}$ is drawn from $f$ . The kernel mode estimator (KME) is a plugin estimator

\displaystyle{\bm{\theta}}_{n}\coloneq\mathop{\rm arg\,max}\limits_{{\bm{x}}\in\mathbb{R}^{d}}f_{n}({\bm{x}}),

(2)

where the kernel density estimator (KDE)

\displaystyle f_{n}({\bm{x}})\coloneq\frac{1}{nh_{n}^{d}}\sum_{i=1}^{n}K\biggl{(}\frac{{\bm{x}}-{\bm{X}}_{i}}{h_{n}}\biggr{)}

(3)

is used as a surrogate for the PDF in (1), where $K$ is a kernel function defined on $\mathbb{R}^{d}$ , and where $h_{n}>0$ is a parameter regarding the scale of the kernel and is called a bandwidth. Here, we assume that the KDE also has a unique global maximizer.

Evaluating the KME amounts to solving the optimization problem (2) where the objective function $f_{n}$ is in general non-convex, so how one solves it is itself a very important problem. In this paper, however, as we are mainly interested in elucidating properties of the KME, we assume that the KME is evaluated by a certain means, so that how to solve it is beyond the scope of this paper.

In what follows, for a set $S\subseteq\mathbb{R}$ , we let $S_{\geq 0}$ denote the set of all nonnegative elements in $S$ . Thus, $\mathbb{Z}_{\geq 0}=\{0,1,\ldots\}$ , and $\mathbb{R}_{\geq 0}$ represents the set of nonnegative real numbers. For later use, we introduce the multi-index notation: For ${\bm{i}}=(i_{1},\ldots,i_{d})^{\top}\in\mathbb{Z}_{\geq 0}^{d}$ , let $|{\bm{i}}|\coloneq\sum_{j=1}^{d}i_{j}$ . It then follows that ${\bm{i}}=\bm{0}_{d}$ and $|{\bm{i}}|=0$ are equivalent, where $\bm{0}_{d}\coloneq(0,\ldots,0)^{\top}$ is the $d$ -dimensional zero vector. For a multi-index ${\bm{i}}\in\mathbb{Z}_{\geq 0}^{d}$ and a vector ${\bm{x}}=(x_{1},\ldots,x_{d})^{\top}$ , let ${\bm{x}}^{\bm{i}}\coloneq\prod_{j=1}^{d}x_{j}^{i_{j}}$ . The factorial of a multi-index is defined as ${\bm{i}}!\coloneq\prod_{j=1}^{d}i_{j}!$ . Let $\partial^{\bm{i}}\coloneq\prod_{j=1}^{d}\partial_{j}^{i_{j}}$ , where $\partial_{j}\coloneq\partial/\partial x_{j}$ .

For a kernel function $K$ on $\mathbb{R}^{d}$ and a multi-index ${\bm{i}}\in\mathbb{Z}_{\geq 0}^{d}$ , we let

\displaystyle{\mathcal{B}}_{d,{\bm{i}}}(K)\coloneq\int_{\mathbb{R}^{d}}{\bm{x}}^{\bm{i}}K({\bm{x}})\,d{\bm{x}}

(4)

be the (multivariate) moment of $K$ indexed by ${\bm{i}}$ . Throughout this paper, we assume using a $q$ -th order kernel $K$ of an even integer $q\geq 2$ . Here, we say that $K$ defined on $\mathbb{R}^{d}$ is $q$ -th order if it satisfies the following moment condition: For ${\bm{i}}\in\mathbb{Z}_{\geq 0}^{d}$ ,

\displaystyle{\mathcal{B}}_{d,{\bm{i}}}(K)=\begin{cases}1,&{\bm{i}}=\bm{0}_{d},\\ 0,&\text{for all ${\bm{i}}$ such that}\;|{\bm{i}}|\in\{1,\ldots,q-1\},\\ \text{non-zero},&\text{for some ${\bm{i}}$ such that}\;|{\bm{i}}|=q.\end{cases}

(5)

Also, note that the condition (5) for ${\bm{i}}=\bm{0}_{d}$ implies normalization of the kernel.

Here we consider the standard situation where the Hessian matrix $Hf({\bm{\theta}})$ of the PDF at the mode is non-singular, where $H\coloneq\nabla\nabla^{\top}$ . Existing works (Yamato, 1971; Rüschendorf, 1977; Mokkadem and Pelletier, 2003) have revealed basic statistical properties of the KME such as consistency and asymptotic distribution. In the following, we provide a theorem modified from that of, especially, (Mokkadem and Pelletier, 2003): it shows the asymptotic bias (AB), variance-covariance matrix (AVC), and AMSE of ${\bm{\theta}}_{n}$ , along with asymptotic normality, for a more general kernel class than those covered by existing proofs. This modification is essential for our purpose, since it covers the RKs that will be discussed in the next section, whereas the existing proofs do not.

Theorem 1.

Assume Assumption A.1 in Appendix A with an even integer $q\geq 2$ . Then, the KME ${\bm{\theta}}_{n}$ asymptotically follows a normal distribution with the following AB and AVC:

\displaystyle\mathrm{E}[{\bm{\theta}}_{n}-{\bm{\theta}}]\approx-h_{n}^{q}\mathrm{A}{\bm{b}},\quad\mathrm{Cov}[{\bm{\theta}}_{n}]\approx\frac{1}{nh_{n}^{d+2}}\mathrm{A}\mathrm{V}\mathrm{A},

(6)

where $\mathrm{A}\coloneq\{Hf({\bm{\theta}})\}^{-1}$ , where ${\bm{b}}={\bm{b}}({\bm{\theta}};f,K)$ with

\displaystyle{\bm{b}}({\bm{x}};f,K)\coloneq\sum_{{\bm{i}}\in\mathbb{Z}_{\geq 0}^{d}:|{\bm{i}}|=q}\frac{1}{{\bm{i}}!}\cdot\nabla\partial^{\bm{i}}f({\bm{x}})\cdot{\mathcal{B}}_{d,{\bm{i}}}(K),

(7)

and where $\mathrm{V}$ is the $d\times d$ matrix defined as $\mathrm{V}\coloneq f({\bm{\theta}}){\mathcal{V}}_{d}(K)$ with

\displaystyle{\mathcal{V}}_{d}(K)\coloneq\int_{\mathbb{R}^{d}}\nabla K({\bm{x}})\nabla K({\bm{x}})^{\top}\,d{\bm{x}}.

(8)

Moreover, the AMSE is given by

\displaystyle\mathrm{E}[\|{\bm{\theta}}_{n}-{\bm{\theta}}\|^{2}]\approx h_{n}^{2q}\|\mathrm{A}{\bm{b}}\|^{2}+\frac{1}{nh_{n}^{d+2}}\operatorname{tr}(\mathrm{A}\mathrm{V}\mathrm{A}),

(9)

where $\|\cdot\|$ is the Euclidean norm in $\mathbb{R}^{d}$ and where $\operatorname{tr}(\mathrm{M})$ denotes the trace of a square matrix $\mathrm{M}$ .

A plausible approach to selecting the bandwidth $h_{n}$ and kernel $K$ is by making the AMSE (9) as small as possible. Noting $\mathrm{A}{\bm{b}}\neq\bm{0}_{d}$ to hold under our assumptions (see (A.4) and (A.6)), the stationary condition of the AMSE with respect to $h_{n}$ leads to the optimal bandwidth,

\displaystyle h_{d,q,n}^{\mathrm{opt}}=\left(\frac{(d+2)\operatorname{tr}(\mathrm{A}\mathrm{V}\mathrm{A})}{2qn\|\mathrm{A}{\bm{b}}\|^{2}}\right)^{\frac{1}{d+2q+2}},

(10)

which strictly minimizes the AMSE.¹¹1 The optimal bandwidth (10) depends on the inaccessible PDF $f$ and its mode ${\bm{\theta}}$ , via $\mathrm{A}$ , ${\bm{b}}$ , and $\mathrm{V}$ , and thus it cannot be used in practice as it is. However, it is possible to use a plugin estimator of the bandwidth which replaces these inaccessible quantities with their consistent estimates, and the discussion on the optimal kernel below holds also for the estimated optimal bandwidth. Moreover, substituting the optimal bandwidth into the AMSE, the bandwidth-optimized AMSE becomes

\displaystyle\mathrm{E}[\|{\bm{\theta}}_{n}-{\bm{\theta}}\|^{2}\mid h_{d,q,n}^{\mathrm{opt}}]\approx\frac{2q}{d+2q+2}\|\mathrm{A}{\bm{b}}\|^{\frac{2(d+2)}{d+2q+2}}\left(\frac{d+2}{2qn}\operatorname{tr}(\mathrm{A}\mathrm{V}\mathrm{A})\right)^{\frac{2q}{d+2q+2}}.

(11)

The bandwidth-optimized AMSE (11) depends on the kernel function $K$ used in the KDE through $\{{\mathcal{B}}_{d,{\bm{i}}}(K)\}$ and ${\mathcal{V}}_{d}(K)$ , so that one may consider further optimizing the bandwidth-optimized AMSE with respect to the kernel function. However, it also depends on the sample-generating PDF $f$ via $\mathrm{A},{\bm{b}},\mathrm{V}$ , and moreover, the dependence on the kernel and that on the PDF are mixed in a complex way in the AMSE expression (11), so it seems quite difficult to study kernel optimization in the multivariate setting without additional assumptions. Furthermore, a resulting optimal kernel should depend on the PDF. A multivariate kernel without any structural assumptions is also difficult to implement in practice. We would thus introduce some structural assumptions on kernel functions, and consider optimization of the bandwidth-optimized AMSE under such assumptions. Under appropriate structural assumptions on kernel functions the optimal kernel function turns out not to depend on the PDF $f$ , as will be shown in the following sections.

2.2 Optimal Kernel in Radial-Basis Kernels

In multivariate kernel-based methods, RKs have been most commonly used. We first consider the kernel optimization among the class of RKs. We say that a kernel $K$ is an RK if there is a function $G$ on $\mathbb{R}_{\geq 0}$ such that $K$ is represented as $K(\cdot)=G(\|\cdot\|)$ .

For an RK $K(\cdot)=G(\|\cdot\|)$ , by means of the cartesian-to-polar coordinate transformation, the functional ${\mathcal{B}}_{d,{\bm{i}}}(K)$ is rewritten in terms of $G$ as

\displaystyle{\mathcal{B}}_{d,{\bm{i}}}(K)=b_{d,{\bm{i}}}B_{d,i}(G)

(12)

with $i=|{\bm{i}}|$ , where $B_{d,i}(G)\coloneq\int_{\mathbb{R}_{\geq 0}}x^{d-1+i}G(x)\,dx$ , and where $b_{d,{\bm{i}}}$ is a kernel-independent factor given by

\displaystyle\begin{split}b_{d,{\bm{i}}}\coloneq&\,\int_{0}^{\pi}\cos^{i_{1}}\xi_{1}\sin^{d-2+\sum_{j>1}i_{j}}\xi_{1}\,d\xi_{1}\int_{0}^{\pi}\cos^{i_{2}}\xi_{2}\sin^{d-3+\sum_{j>2}i_{j}}\xi_{2}\,d\xi_{2}\cdots\\ \times&\,\int_{0}^{\pi}\cos^{i_{d-2}}\xi_{d-2}\sin^{1+\sum_{j>d-2}i_{j}}\xi_{d-2}\,d\xi_{d-2}\int_{0}^{2\pi}\cos^{i_{d-1}}\xi_{d-1}\sin^{i_{d}}\xi_{d-1}\,d\xi_{d-1}\\ =&\,\begin{cases}\frac{2\prod_{j=1}^{d}\Gamma\bigl{(}\frac{1+i_{j}}{2}\bigr{)}}{\Gamma(\frac{d+i}{2})}\neq 0,&i_{1},\ldots,i_{d}\in 2\mathbb{Z}_{\geq 0},\\ 0,&\mbox{otherwise}.\end{cases}\end{split}

(13)

Note that $b_{d,{\bm{i}}}$ , and so does ${\mathcal{B}}_{d,{\bm{i}}}(K)$ , vanishes once any index in the multi-index ${\bm{i}}$ is odd, which includes the case where $|{\bm{i}}|$ is odd, and that $b_{d,\bm{0}_{d}}$ is equal to $b_{d}\coloneq 2\pi^{d/2}/\Gamma(\frac{d}{2})$ . One then has ${\bm{b}}=B_{d,i}(G)\bar{{\bm{b}}}$ , where

\displaystyle\bar{{\bm{b}}}\coloneq\sum_{{\bm{i}}\in\mathbb{Z}_{\geq 0}^{d}:|{\bm{i}}|=q}\frac{1}{{\bm{i}}!}\cdot\nabla\partial^{\bm{i}}f({\bm{\theta}})\cdot b_{d,{\bm{i}}}=\frac{\pi^{\frac{d}{2}}}{2^{q-1}\Gamma(\frac{d+q}{2})\Gamma(\frac{q}{2}+1)}\nabla\nabla_{2}^{q/2}f({\bm{\theta}}),

(14)

where $\nabla_{l}\coloneq\sum_{j=1}^{d}\frac{\partial^{l}}{\partial x_{j}^{l}}$ so that $\nabla_{2}$ denotes the Laplacian operator.

Also, the functional ${\mathcal{V}}_{d}(K)$ for an RK $K(\cdot)=G(\|\cdot\|)$ reduces to

\displaystyle{\mathcal{V}}_{d}(K)=v_{d}V_{d,1}(G)\mathrm{I}_{d}

(15)

with $V_{d,l}(G)\coloneq\int_{\mathbb{R}_{\geq 0}}x^{d-1}\{G^{(l)}(x)\}^{2}\,dx$ and $v_{d}\coloneq b_{d,\{2,0,\ldots,0\}}=2\pi^{d/2}/(d\Gamma(\frac{d}{2}))$ . This implies that $\mathrm{V}=v_{d}f({\bm{\theta}})V_{d,1}(G)\mathrm{I}_{d}$ , that is, $\mathrm{V}$ is proportional to the identity matrix $\mathrm{I}_{d}$ .

Therefore, the bandwidth-optimized AMSE (11) becomes

\displaystyle\begin{split}\mathrm{E}[\|{\bm{\theta}}_{n}-{\bm{\theta}}\|^{2}\mid h_{d,q,n}^{\mathrm{opt}}]&\approx\frac{2q}{d+2q+2}\|\mathrm{A}\bar{{\bm{b}}}\|^{\frac{2(d+2)}{d+2q+2}}\left(\frac{d+2}{2qn}v_{d}f({\bm{\theta}})\operatorname{tr}(\mathrm{A}^{2})\right)^{\frac{2q}{d+2q+2}}\\ &\times\Bigl{(}B_{d,q}^{2(d+2)}(G)\cdot V_{d,1}^{2q}(G)\Bigr{)}^{\frac{1}{d+2q+2}},\end{split}

(16)

in which the kernel-dependent factor is $(B_{d,q}^{2(d+2)}\cdot V_{d,1}^{2q})^{\frac{1}{d+2q+2}}$ alone.²²2Although implicit in the notations here, $B_{d,i}$ and $V_{d,1}$ are functionals of $G$ . We will use the notation $B_{d,i}(G)$ etc. when we want to make the dependence on $G$ explicit, but also use such abbreviations otherwise.

As one can see, the bandwidth-optimized AMSE (16) for an RK is decomposed into a product of the kernel-dependent factor, which we call the AMSE criterion, and the remaining PDF-dependent part. This fact implies that an RK minimizing the AMSE criterion also minimizes the AMSE, that is it is optimal, among the class of RKs for every PDF satisfying the requirements of Theorem 1.

For an RK $K(\cdot)=G(\|\cdot\|)$ , as stated above, ${\mathcal{B}}_{d,{\bm{i}}}(K)$ vanishes for ${\bm{i}}$ with odd $|{\bm{i}}|$ , so that one needs to consider the even moment condition alone, which can be translated into the moment condition for $G$ as follows:

\displaystyle B_{d,i}(G)=\begin{cases}b_{d}^{-1},&i=0,\\ 0,&i\in I_{q-2},\\ \text{non-zero},&i=q,\end{cases}

(17)

where $I_{k}$ is the index set defined as

\displaystyle I_{k}\coloneq\bigl{\{}k-2i:i=0,1,\ldots,\bigl{\lfloor}\tfrac{k}{2}\bigr{\rfloor}\bigr{\}}.

(18)

Thus, if one focuses on the functional form of the AMSE criterion, one might consider the following variational problem (we name it ‘type-1’ because it depends on the first derivative of $G$ via $V_{d,1}^{2q}(G)=V_{d,0}^{2q}(G^{(1)})$ ).

Problem 1 ( $d$ -variate, type-1, $q$ -th order).

	$\displaystyle\mathop{\rm min}\limits_{G}\quad$	$\displaystyle B_{d,q}^{2(d+2)}(G)\cdot V_{d,1}^{2q}(G),$		(P1)
	$\displaystyle\mathrm{s.t.}\quad$	$\displaystyle G\;\text{satisfies the moment condition~{}\eqref{eq:RKMomCon}}.$		(P1-1)

However, this problem has no solution, as shown in the next proposition.

Proposition 1.

Problem 1 has no solution.

Proof.

One has $B_{d,q}^{2(d+2)}(G)\cdot V_{d,1}^{2q}(G)\geq 0$ by definition. It should never be equal to zero under the moment condition (17). Indeed, if one had $B_{d,q}^{2(d+2)}(G)\cdot V_{d,1}^{2q}(G)=0$ , either $B_{d,q}(G)$ or $V_{d,1}(G)$ should be equal to zero, whereas $B_{d,q}(G)$ should be nonzero due to the moment condition (17). $V_{d,q}(G)$ cannot be equal to zero either, since otherwise $G^{(1)}$ should be equal to zero identically, implying that $G$ is a constant, contradicting the condition $B_{d,0}(G)=b_{d}^{-1}<\infty$ .

We next show that the infimum of $B_{d,q}^{2(d+2)}(G)\cdot V_{d,1}^{2q}(G)$ is zero. Take $G_{1}$ and $G_{2}$ satisfying (17) with $B_{d,q}(G_{1})\not=B_{d,q}(G_{2})$ and $V_{d,1}(G_{1}),V_{d,1}(G_{2})<\infty$ (for example, $G_{1}=G_{d,q}^{B}$ and $G_{2}=G_{d,q}^{E}$ defined later). Consider the linear combination $G=wG_{1}+(1-w)G_{2}$ of $G_{1}$ and $G_{2}$ with $w\in\mathbb{R}$ . One then has $B_{d,q}(G)=wB_{d,q}(G_{1})+(1-w)B_{d,q}(G_{2})$ , so that $G$ satisfies (17) unless $w=w_{0}\coloneq\frac{B_{d,q}(G_{2})}{B_{d,q}(G_{2})-B_{d,q}(G_{1})}$ , in which case one has $B_{d,q}(G)=0$ so that the order of $G$ is strictly higher than $q$ . On the other hand, substituting $(s,t)=(wG_{1}^{(1)}(x),(1-w)G_{2}^{(1)}(x))$ into the inequality $(s+t)^{2}\leq 2(s^{2}+t^{2})$ , multiplying both sides with $x^{d-1}$ , and taking the integral over $x\in\mathbb{R}_{\geq 0}$ , one has

\displaystyle V_{d,1}(G)\leq 2\bigl{(}w^{2}V_{d,1}(G_{1})+(1-w)^{2}V_{d,1}(G_{2})\bigr{)}<\infty

(19)

for any $w$ . One therefore has $B_{d,q}^{2(d+2)}(G)\cdot V_{d,1}^{2q}(G)\to 0$ as $w\to w_{0}$ , proving that the infimum of $B_{d,q}^{2(d+2)}(G)\cdot V_{d,1}^{2q}(G)$ is equal to zero. ∎

This proposition states that the approach to kernel selection via simply minimizing the AMSE criterion among the entire class of $q$ -th order RKs is not a theoretically viable option. One may be able to discourage this approach from a practical standpoint as well. As shown above, one can make the AMSE criterion arbitrarily close to zero by considering a sequence of kernels for which $B_{d,q}$ approaches zero. For such a sequence of kernels, however, the optimal bandwidth $h_{d,q,n}^{\mathrm{opt}}\propto\{V_{1,d}/(nB_{d,q})^{2}\}^{\frac{1}{d+2q+2}}$ diverges toward infinity. It then causes the effect of the next leading order term of the AB, which is $O(h_{n}^{q+2})$ , to be non-negligible in the MSE when the sample size $n$ is finite, thereby invalidating the use of the AMSE criterion to estimate the MSE in a finite sample situation.

The above discussion suggests that, in order to avoid diverging behaviors of the optimal bandwidth and to obtain results meaningful in a finite sample situation, one should consider optimization of the AMSE criterion within a certain kernel class which at least excludes those kernels with vanishing leading-order moments. Since such an approach can give a kernel with the smallest AMSE criterion at least among the considered class, it would be superior to other kernel design methods that are not based on any discussion of the goodness of resulting higher-order kernels in kernel classes of the same order, such as jackknife (Schucany and Sommers, 1977; Wand and Schucany, 1990), as well as those that optimize alternative criteria other than the AMSE criterion, such as the minimum-variance kernels (Eddy, 1980; Müller, 1984). The existing researches (Gasser and Müller, 1979; Gasser et al., 1985; Granovsky and Müller, 1989, 1991) can be interpreted as following this approach, and they have focused on the relationship between the order of the kernel and the number of sign changes. More concretely, on the basis of the observation that any $q$ -th order univariate kernel should change its sign at least $(q-2)$ times on $\mathbb{R}$ , they considered kernel optimization among the class of $q$ -th order kernels with exactly $(q-2)$ sign changes (which we call the minimum-sign-change condition). They have also shown that the minimum-sign-change condition excludes kernels with small leading-order moments, which cause the difficulties discussed above in the minimization of the AMSE criterion. In addition, and importantly, they succeeded in obtaining a closed-form solution of the resulting kernel optimization problem under the minimum-sign-change condition.

In this paper, we also take the same approach as the one by (Gasser and Müller, 1979; Gasser et al., 1985; Granovsky and Müller, 1989, 1991): under the moment condition (17), one can show that $G$ changes its sign at least $(\frac{q}{2}-1)$ times on $\mathbb{R}_{\geq 0}$ ; see Lemma B.2. Then, we consider the following modified problem to which the minimum-sign-change condition (P2-3) is added.

Problem 2 (Modified $d$ -variate, type-1, $q$ -th order).

$\displaystyle\mathop{\rm min}\limits_{G}\quad$	$\displaystyle B_{d,q}^{2(d+2)}(G)\cdot V_{d,1}^{2q}(G),$	(P2)
$\displaystyle\mathrm{s.t.}\quad$	$\displaystyle G\;\text{satisfies the moment condition~{}\eqref{eq:RKMomCon}},$	(P2-1)
	$\displaystyle G\;\text{is differentiable and}\;G^{(1)}\;\text{is integrable},$	(P2-2)
	$\displaystyle G\;\text{changes its sign}\;(\tfrac{q}{2}-1)\;\text{times on}\;\mathbb{R}_{\geq 0},$	(P2-3)
	$\displaystyle V_{d,1}(G)<\infty,$	(P2-4)
	$\displaystyle x^{d+q}G(x)\to 0\;\text{as}\;x\to\infty.$	(P2-5)

Here, the conditions (P2-2)–(P2-5) are technically required in the proof. The conditions (P2-2) and (P2-4) for making the problem well-defined and the condition (P2-5) which is close to $|B_{d,q}(G)|<\infty$ would be almost non-restrictive. The minimum-sign-change condition (P2-3) for $q=2$ implies that the kernel is non-negative, under the normalization condition in (P2-1). Thus, the minimum-sign-change condition (P2-3) for higher orders $q\geq 4$ can be regarded as a sort of extension of the non-negativity condition.

In the same way as that by (Gasser and Müller, 1979; Gasser et al., 1985), we can show when $q=2,4$ that this problem is solvable and the solution is provided as follows.

Theorem 2.

For every $d\geq 1$ and $q=2,4$ , a solution of Problem 2³³3 Note that a solution of Problem 2 has freedom on its scale, i.e., for any $s>0$ , $G(x)=s^{-d}G_{d,q}^{B}(x/s)$ is also a solution of Problem 2. is

\displaystyle G_{d,q}^{B}(x)\coloneq\biggl{(}\sum_{i\in I_{q+2}}c^{B}_{d,q,i}x^{i}\biggr{)}\mathbbm{1}_{x\leq 1},

(20)

where the coefficients $c^{B}_{d,q,i}$ with $i\in I_{q+2}$ are given by

\displaystyle c^{B}_{d,q,i}=\frac{(-1)^{\frac{i}{2}}\Gamma(\frac{d+q+4}{2})\Gamma(\frac{d+q+i}{2})}{\pi^{\frac{d}{2}}\Gamma(\frac{q}{2})\Gamma(\frac{2+i}{2})\Gamma(\frac{d+2+i}{2})\Gamma(\frac{q+4-i}{2})}.

(21)

The kernel $G_{d,q}^{B}$ is also represented as

\displaystyle\begin{split}G_{d,q}^{B}(x)&=\frac{\Gamma(\frac{d+q}{2})}{\pi^{\frac{d}{2}}\Gamma(\frac{q}{2})}P_{\frac{q}{2}+1}^{(\frac{d}{2},-2)}(1-2x^{2})\mathbbm{1}_{x\leq 1}\\ &=(-1)^{\frac{q}{2}+1}\frac{\Gamma(\frac{d+q}{2}+2)}{\pi^{\frac{d}{2}}\Gamma(\frac{q}{2}+2)}(1-x^{2})^{2}P_{\frac{q}{2}-1}^{(2,\frac{d}{2})}(2x^{2}-1)\mathbbm{1}_{x\leq 1},\end{split}

(22)

where $\Gamma(\cdot)$ is the gamma function and where $P_{n}^{(\alpha,\beta)}(\cdot)$ is the Jacobi polynomial (Szegö, 1939) defined by

\displaystyle P_{n}^{(\alpha,\beta)}(x)\coloneq\frac{(-1)^{n}}{2^{n}n!}(1-x)^{-\alpha}(1+x)^{-\beta}\frac{d^{n}}{dx^{n}}\left[(1-x)^{n+\alpha}(1+x)^{n+\beta}\right].

(23)

Alternatively, we can consider the following problem with another minimum-sign-change condition (P3-3) that is defined for the first derivative of a kernel.

Problem 3 (Modified $d$ -variate, type-1, $q$ -th order).

$\displaystyle\mathop{\rm min}\limits_{G}\quad$	$\displaystyle B_{d,q}^{2(d+2)}(G)\cdot V_{d,1}^{2q}(G),$	(P3)
$\displaystyle\mathrm{s.t.}\quad$	$\displaystyle G\;\text{satisfies the moment condition~{}\eqref{eq:RKMomCon}},$	(P3-1)
	$\displaystyle G\;\text{is differentiable and}\;G^{(1)}\;\text{is integrable},$	(P3-2)
	$\displaystyle G^{(1)}\;\text{changes its sign}\;(\tfrac{q}{2}-1)\;\text{times on}\;\mathbb{R}_{\geq 0},$	(P3-3)
	$\displaystyle V_{d,1}(G)<\infty,$	(P3-4)
	$\displaystyle\text{there exists }\delta>0\text{ s.t.~{}}x^{d+q+\delta}G(x)\to 0\;\text{as}\;x\to\infty.$	(P3-5)

The minimum-sign-change condition (P3-3) for $q=2$ implies that the kernel is non-negative and non-increasing, under the normalization condition in (P3-5) and the end-point condition (P3-5). Therefore, the minimum-sign-change condition for the first derivative (P3-3) can be seen as a more restrictive version of the minimum-sign-change condition for the kernel itself (P2-3). Additionally note that the change of (P2-5) to (P3-5) stems from the difference in proof techniques used for Theorems 2 and 3.

We can show that this problem has the same solution even for general even orders $q=2,4,6,\ldots$ , on the basis of the proof techniques in (Granovsky and Müller, 1989, 1991).

Theorem 3.

For every $d\geq 1$ and even $q\geq 2$ , a solution of Problem 3 is (20) or equivalently (22).

\begin{overpic}[height=78.24507pt,bb=0 0 495 285]{./mulopt2.png}\put(3.0,48.0){\small$q=2$}\end{overpic}

\begin{overpic}[height=78.24507pt,bb=0 0 495 285]{./mulopt4.png}\put(3.0,48.0){\small$q=4$}\end{overpic}

\begin{overpic}[height=78.24507pt,bb=0 0 495 285]{./mulopt6.png}\put(3.0,48.0){\small$q=6$}\end{overpic}

Figure 1: The kernel

K_{d,q}^{B}

d=2

q=2,4,6

This type-1 optimal kernel is a truncation of a polynomial function with terms of degrees $0,2,\ldots,(q+2)$ and has a compact support (see Figure 1). It should be emphasized that the optimal kernel is a truncated kernel even though we have not assumed truncation a priori, and that the truncation of the optimal kernel is a consequence of the minimum-sign-change condition and the optimality with respect to the AMSE criterion. In particular, the 2nd order optimal kernel $K_{d,2}^{B}({\bm{x}})\propto\{(1-\|{\bm{x}}\|^{2})_{+}\}^{2}$ (where $(\cdot)_{+}=\max\{\cdot,0\}$ ), often called a Biweight kernel, is optimal among non-negative RKs (from Theorem 2). The kernel $G_{d,q}^{B}$ can be represented by the product of the $d$ -variate Biweight RK $G_{d,2}^{B}$ and a polynomial function as

\displaystyle G_{d,q}^{B}(x)=(-1)^{\frac{q}{2}+1}\frac{2\Gamma(\frac{d+q}{2}+2)}{\Gamma(\frac{d}{2}+3)\Gamma(\frac{q}{2}+2)}P_{\frac{q}{2}-1}^{(2,\frac{d}{2})}(2x^{2}-1)\cdot G_{d,2}^{B}(x).

(24)

Berlinet (1993) pointed out that in a variety of kernel-based estimation problems the optimal kernels of different orders often form a hierarchy, in the sense that an optimal kernel can be represented as a product of a polynomial and the corresponding lowest-order kernel called the basic kernel: See also Section 3.11 of (Berlinet and Thomas-Agnan, 2004). According to their terminology, for example, the statement of Theorem 2 or 3 can be concisely summarized as follows: Among the class of RKs, the Biweight hierarchy minimizes the AMSE of the kernel mode estimator for any $d$ under the minimum-sign-change condition for the kernel or its first derivative.

However, it also should be noted that the kernel optimization problem was formulated without considering several regularity conditions other than the moment condition that ensure the asymptotic normality and are for deriving the AMSE. The kernel $K_{d,q}^{B}$ is not twice differentiable at the edge of its support. In this respect, $K_{d,q}^{B}$ does not satisfy all the sufficient conditions in Theorem 1 for deriving the AMSE (see (A.7) and (A.14) in Assumption A.1 in Appendix). On the other hand, one can still add a small enough perturbation to $K_{d,q}^{B}$ to make it twice differentiable at the edge of its support as well, while keeping other conditions to hold, as well as having the change of the value of the AMSE criterion arbitrarily small. Namely, Problem 2 or 3 and its solutions $K_{d,q}^{B}$ still tell us the lower bound of achievable AMSE under the minimum-sign-change condition, as well as the shape of kernels which will achieve AMSE close to the bound (it also implies that the problem of minimizing the AMSE under all the conditions for Theorem 1 has no solution). From these considerations, we expect that, despite the lack of theoretical guarantee for the asymptotic normality of KME, this trouble is not so destructive and $K_{d,q}^{B}$ will provide a practically good performance among the minimum-sign-change kernels; See also simulation experiments in Section 4.

2.3 Optimal Kernel in Product Kernels

In this section we consider optimization of the bandwidth-optimized AMSE with respect to the kernel among the class of PKs, where a kernel $K$ is a PK if there exists a function $G$ on $\mathbb{R}$ such that $K({\bm{x}})=\prod_{j=1}^{d}G(x_{j})$ for all ${\bm{x}}=(x_{1},\ldots,x_{d})^{\top}$ . To simplify the discussion, we assume in this paper that $G$ as a kernel on $\mathbb{R}$ is symmetric and $q$ -th order, making the resulting PK $K$ to be $q$ -th order as well. We furthermore assume that $G$ satisfies the minimum-sign-change condition (P2-3) or (P3-3).

Under these assumptions, simple calculations lead to

	$\displaystyle{\mathcal{B}}_{d,{\bm{i}}}(K)$	$\displaystyle=\begin{cases}1,&{\bm{i}}=\bm{0}_{d},\\ 2B_{1,q}(G),&\text{if one index in }{\bm{i}}\text{ is }q\text{ and others are }0,\\ 0,&\text{for other }{\bm{i}}\text{ such that }\|{\bm{i}}\|\in\{1,\ldots,q\},\end{cases}$		(25)
	$\displaystyle{\mathcal{V}}_{d}(K)$	$\displaystyle=2^{d}V_{1,1}(G)\cdot V^{d-1}_{1,0}(G)\mathrm{I}_{d}.$		(26)

Therefore, the bandwidth-optimized AMSE (11) becomes

\displaystyle\begin{split}\mathrm{E}[\|{\bm{\theta}}_{n}-{\bm{\theta}}\|^{2}\mid h_{d,q,n}^{\mathrm{opt}}]&\approx\frac{2q}{d+2q+2}\|\mathrm{A}\tilde{{\bm{b}}}\|^{\frac{2(d+2)}{d+2q+2}}\biggl{(}\frac{d+2}{2qn}2^{d}f({\bm{\theta}})\operatorname{tr}(\mathrm{A}^{2})\biggr{)}^{\frac{2q}{d+2q+2}}\\ &\times\biggl{[}\Bigl{(}B_{1,q}^{6}(G)\cdot V_{1,1}^{2q}(G)\Bigr{)}\cdot\Bigl{(}B_{1,q}^{2}(G)\cdot V_{1,0}^{2q}(G)\Bigr{)}^{d-1}\biggr{]}^{\frac{1}{d+2q+2}},\end{split}

(27)

where $\tilde{{\bm{b}}}$ is given by

\displaystyle\tilde{{\bm{b}}}\coloneq\sum_{j=1}^{d}\frac{1}{q!}\cdot\nabla\frac{\partial^{q}f({\bm{\theta}})}{\partial x_{j}^{q}}\cdot 2=\frac{2}{q!}\nabla\nabla_{q}f({\bm{\theta}}).

(28)

Equation (27) shows that the bandwidth-optimized AMSE for the PK is also decomposed into a product of the kernel-dependent factor and the PDF-dependent factor. One can therefore take the kernel-dependent factor $\{(B_{1,q}^{6}\cdot V_{1,1}^{2q})\cdot(B_{1,q}^{2}\cdot V_{1,0}^{2q})^{d-1}\}^{\frac{1}{d+2q+2}}$ as the AMSE criterion for the PK.

We have not succeeded in deriving the optimal $G$ minimizing this criterion. In the following, we consider instead a lower bound of the AMSE criterion, which will be compared with the optimal value of the criterion for the RK in the next section. The bound we discuss is derived as follows:

\displaystyle\begin{split}&\min_{G}\biggl{\{}\Bigl{(}B_{1,q}^{6}(G)\cdot V_{1,1}^{2q}(G)\Bigr{)}\cdot\Bigl{(}B_{1,q}^{2}(G)\cdot V_{1,0}^{2q}(G)\Bigr{)}^{d-1}\bigg{\}}\\ &=\min_{G,G^{\prime}:G=G^{\prime}}\bigg{\{}\Bigl{(}B_{1,q}^{6}(G)\cdot V_{1,1}^{2q}(G)\Bigr{)}\cdot\Bigl{(}B_{1,q}^{2}(G^{\prime})\cdot V_{1,0}^{2q}(G^{\prime})\Bigr{)}^{d-1}\bigg{\}}\\ &\geq\min_{G}\Bigl{(}B_{1,q}^{6}(G)\cdot V_{1,1}^{2q}(G)\Bigr{)}\cdot\min_{G^{\prime}}\Bigl{(}B_{1,q}^{2}(G^{\prime})\cdot V_{1,0}^{2q}(G^{\prime})\Bigr{)}^{d-1},\end{split}

(29)

where the last inequality holds strictly if $d>1$ since the solutions of the two minimization problems involved will become different. The first minimization problem is a univariate case of Theorem 2 or Theorem 3 (or (Granovsky and Müller, 1991, Corollary 1)): the Biweight hierarchy $\{G_{1,q}^{B}\}$ provides the optimal $G$ among the $q$ -th order kernels satisfying the minimum-sign-change condition (P3-3). The second minimization problem, which is an instance of the type-0 problems, has been solved by Granovsky and Müller (1989): for even $q\geq 2$ , the kernel

\displaystyle G_{1,q}^{E}(x)=\frac{\Gamma(\frac{q+1}{2})}{\pi^{\frac{1}{2}}\Gamma(\frac{q}{2})}P_{\frac{q}{2}}^{(\frac{1}{2},-1)}(1-2x^{2})\mathbbm{1}_{x\leq 1}=(-1)^{\frac{q}{2}+1}\frac{\Gamma(\frac{q+3}{2})}{\pi^{\frac{1}{2}}\Gamma(\frac{q}{2}+1)}(1-x^{2})P_{\frac{q}{2}-1}^{(1,\frac{1}{2})}(2x^{2}-1)\mathbbm{1}_{x\leq 1}

(30)

minimizes $B_{1,q}^{2}\cdot V_{1,0}^{2q}$ among $q$ -th order kernels that change their sign $(\frac{q}{2}-1)$ times on $\mathbb{R}_{\geq 0}$ . The type-0 optimal kernel $K_{1,q}^{E}(x)=G_{1,q}^{E}(|x|)$ is a truncation of a polynomial function with terms of degrees $0,2,\ldots,q$ , and forms the Epanechnikov hierarchy, with the Epanechnikov kernel $K_{1,2}^{E}\propto(1-x^{2})_{+}$ as the basic kernel. We have therefore obtained a lower bound

\displaystyle\biggl{\{}\Bigl{(}B_{1,q}^{6}(K_{1,q}^{B})\cdot V_{1,1}^{2q}(K_{1,q}^{B})\Bigr{)}\cdot\Bigl{(}B_{1,q}^{2}(K_{1,q}^{E})\cdot V_{1,0}^{2q}(K_{1,q}^{E})\Bigr{)}^{d-1}\biggr{\}}^{\frac{1}{d+2q+2}}

(31)

of the AMSE criterion for the PK model.

2.4 Comparison between Radial-Basis Kernels and Product Kernels

We have so far studied the RKs and the PKs separately. One may then ask which of these classes of kernels will perform better. In this section we discuss this question.

Our approach to answering this question is basically grounded on a comparison between the AMSE of the optimal RK and that of the optimal PK. There are, however, difficulties in this approach: One is that the optimal PK, as well as its AMSE, is not known, which prevents us from comparing the RKs and the PKs directly in terms of the AMSE. Another is that the PDF-dependent factor of the AMSE takes different forms between the RKs and the PKs. Most evidently, in $\bar{{\bm{b}}}$ (14) and $\tilde{{\bm{b}}}$ (28), the quantities which contribute to the AB of $\bm{\theta}_{n}$ for the RKs and the PKs, respectively, cross derivatives of the PDF are present in the former whereas they are absent in the latter. Such differences in the PDF-dependent factors hinder direct comparison between the RKs and the PKs in terms of the AMSE in the general setting, in a PDF-independent manner.

The only exception is the case $q=2$ , where the PDF-dependent factor takes the same form for the RKs and the PKs, and consequently one can compare the AMSEs for RKs and PKs, regardless of the PDF. In the case of $q=2$ , simple calculations show $\tilde{{\bm{b}}}=\nabla\nabla_{2}f({\bm{\theta}})$ and $\bar{{\bm{b}}}=\frac{v_{d}}{2}\nabla\nabla_{2}f({\bm{\theta}})=\frac{v_{d}}{2}\tilde{{\bm{b}}}$ . Therefore, the ratio of the lower bound of the AMSE for PK that derives from (31) to the AMSE for the optimal RK $K_{d,2}^{B}$ becomes

\displaystyle\Biggl{(}\frac{2^{6d+4}\bigl{(}B_{1,2}^{6}(K_{1,2}^{B})\cdot V_{1,1}^{4}(K_{1,2}^{B})\bigr{)}\cdot\bigl{(}B_{1,2}^{2}(K_{1,2}^{E})\cdot V_{1,0}^{4}(K_{1,2}^{E})\bigr{)}^{d-1}}{v_{d}^{2d+8}\bigl{(}B_{d,2}^{2(d+2)}(G_{d,2}^{B})\cdot V_{d,1}^{4}(G_{d,2}^{B})\bigr{)}}\Biggr{)}^{\frac{1}{d+6}}.

(32)

Refer to caption — Figure 2: Ratio (32) of AMSE lower bound for PK to AMSE for RK versus dimension $d$ .

Figure 2 shows how the ratio depends on the dimension $d$ . One observes that it increases monotonically with $d$ . Since the ratio is larger than 1 for $d\geq 2$ (note that there is no distinction between the RKs and the PKs when $d=1$ ), one can conclude that the optimal second-order non-negative RK achieves the AMSE that is smaller than those with any second-order non-negative PK.

It should be emphasized that the above superiority result of the optimal RK over the PKs is limited to the second-order ( $q=2$ ) case. As mentioned above, for $q\geq 4$ one cannot compare the RKs and the PKs in terms of the AMSE in a PDF-independent manner. Consequently, depending on the PDF, there might be some PKs with better AMSE values than the optimal RK $K_{d,q}^{B}$ for $q\geq 4$ , some examples of which will be shown in the simulation experiments in Section 4.

2.5 Optimal Kernel in Elliptic Kernels

Kernel methods in the multivariate setting may use a kernel function together with a linear transformation, in order to calibrate the difference in scales and shearing across the coordinates. In this section we assume for simplicity that an RK is to be combined with a linear transformation. For a $d\times d$ matrix $\mathrm{P}$ representing an area-preserving linear transformation in $\mathbb{R}^{d}$ (hence $|\det\mathrm{P}|=1$ ) and an RK $K$ (the base kernel), the transformed kernel $K_{\mathrm{P}}({\bm{x}})=K(\mathrm{P}{\bm{x}})$ is often called an elliptic kernel. In this section, we discuss the optimal base kernel $K$ and optimal transformation for kernel mode estimation based on an elliptic kernel.

Using the elliptic kernel $K_{\mathrm{P}}$ is equivalent to using the kernel $K$ with the transformed sample $\{\mathrm{P}{\bm{X}}_{i}\}_{i=1}^{n}$ drawn from $f_{\mathrm{P}}$ , which is defined by $f_{\mathrm{P}}(\cdot)=f(\mathrm{P}^{-1}\cdot)$ , for estimation of the mode $\mathrm{P}{\bm{\theta}}$ of $f_{\mathrm{P}}$ . Multiplying the resulting mode estimate by $\mathrm{P}^{-1}$ from left yields the KME with $K_{\mathrm{P}}$ of $f$ , an estimate of the mode ${\bm{\theta}}$ . This view, together with the calculations

\displaystyle\{Hf_{\mathrm{P}}(\mathrm{P}{\bm{\theta}})\}^{-1}=\mathrm{P}\mathrm{A}\mathrm{P}^{\top},\quad\nabla\nabla^{q/2}_{2}f_{\mathrm{P}}(\mathrm{P}{\bm{\theta}})=\nabla(\nabla^{\top}\mathrm{Q}\nabla)^{q/2}f({\bm{\theta}}),

(33)

where $\mathrm{Q}\coloneq\mathrm{P}^{-1}\mathrm{P}^{-\top}$ , reveals that the AB and AVC of the KME using $K_{\mathrm{P}}$ reduce to

		$\displaystyle\mathrm{E}[{\bm{\theta}}_{n}-{\bm{\theta}}]\approx-\frac{\pi^{d/2}h_{n}^{q}}{2^{q-1}\Gamma(\frac{d+q}{2})\Gamma(\frac{q}{2}+1)}B_{d,i}(G)\mathrm{A}\nabla(\nabla^{\top}\mathrm{Q}\nabla)^{q/2}f({\bm{\theta}}),$		(34)
		$\displaystyle\mathrm{Cov}[{\bm{\theta}}_{n}]\approx\frac{v_{d}f({\bm{\theta}})}{nh_{n}^{d+2}}V_{d,1}(G)\mathrm{A}\mathrm{Q}^{-1}\mathrm{A},$		(35)

and that their kernel-dependent factors remain $B_{d,q}(G)$ and $V_{d,1}(G)$ of $K(\cdot)=G(\|\cdot\|)$ . Therefore, as long as $\mathrm{P}$ (or equivalently $\mathrm{Q}$ ) is fixed such that the AB of the KME is non-zero, an argument similar to the ones in Section 2 leads to the optimal bandwidth and the same AMSE criterion with respect to $K$ . Thus, the kernel $K_{d,q}^{B}$ is optimal as a base kernel even under such a transformation $\mathrm{P}$ .

One may furthermore consider optimizing $\mathrm{Q}$ (equivalently $\mathrm{P}$ ) via

\displaystyle\mathop{\rm min}\limits_{\mathrm{Q}}\;\|\mathrm{A}\nabla(\nabla^{\top}\mathrm{Q}\nabla)^{q/2}f({\bm{\theta}})\|^{2(d+2)}\operatorname{tr}(\mathrm{A}\mathrm{Q}^{-1}\mathrm{A})^{2q},\quad\mathrm{s.t.}\ \det(\mathrm{Q})=1,

(36)

which comes from the linear-transformation-dependent factor of the bandwidth-optimized AMSE. We have so far not succeeded in obtaining a closed-form solution of the optimization problem (36), although a numerical optimization might be possible with plug-in estimates of $f(\bm{\theta})$ and $\mathrm{A}$ .

It would be worth mentioning that in some cases one can make the AB equal to zero by a certain choice of $\mathrm{Q}$ , whereas in some other cases one cannot make the AB equal to zero by any choice of $\mathrm{Q}$ . Furthermore, whether one can make the AB equal to zero via the choice of $\mathrm{Q}$ depends on the ( $q+1)$ -st order term of the Taylor expansion of $f({\bm{x}})$ around the mode ${\bm{x}}={\bm{\theta}}$ , and the dependence seems to be quite complicated. Consequently, optimization of the linear transformation in terms of the AMSE is difficult to tackle in a manner similar to that adopted in the previous sections, because the assumption that the AB does not vanish does not necessarily hold. Although we have not succeeded in providing a complete specification as to when one can make the AB equal to zero via the choice of $\mathrm{Q}$ , we describe in the following some cases where the AB can be made equal to zero by a certain choice of $\mathrm{Q}$ and some other cases where the AB cannot be equal to zero by any choice of $\mathrm{Q}$ .

The following two propositions provide cases where the AB can be made zero by an appropriate choice of $\mathrm{Q}$ .

Proposition 2.

Assume $d\geq q+1$ . Assume the $(q+1)$ -st order term of the Taylor expansion of $f({\bm{x}})$ around the mode ${\bm{x}}={\bm{\theta}}$ be of the form

\displaystyle\prod_{i=1}^{q+1}\bm{a}_{i}^{\top}({\bm{x}}-{\bm{\theta}}),

(37)

with $\{\bm{a}_{i}\}_{i=1}^{q+1}$ linearly independent. Then there is a choice of $\mathrm{Q}$ with which the AB (34) is made equal to zero.

We show that there exists a positive definite $\mathrm{Q}$ which satisfies

\displaystyle\nabla(\nabla^{\top}\mathrm{Q}\nabla)^{q/2}\prod_{i=1}^{q+1}{\bm{a}}_{i}^{\top}({\bm{x}}-{\bm{\theta}})=\bm{0}_{d}.

(38)

The left-hand side is calculated as

\displaystyle\nabla(\nabla^{\top}\mathrm{Q}\nabla)^{q/2}\prod_{i=1}^{q+1}{\bm{a}}_{i}^{\top}({\bm{x}}-{\bm{\theta}})=\sum_{\sigma}({\bm{a}}_{\sigma(1)}^{\top}\mathrm{Q}{\bm{a}}_{\sigma(2)})\cdots({\bm{a}}_{\sigma(q-1)}^{\top}\mathrm{Q}{\bm{a}}_{\sigma(q)}){\bm{a}}_{\sigma(q+1)},

(39)

where the summation on the right-hand side is to be taken with respect to all the permutations of $\{1,\ldots,q+1\}$ . Due to the assumption of the linear independence of $\{{\bm{a}}_{i}\}_{i=1}^{q+1}$ , there exists the duals $\{{\bm{b}}_{i}\}_{i=1}^{q+1}$ of $\{{\bm{a}}_{i}\}_{i=1}^{q+1}$ satisfying ${\bm{a}}_{i}^{\top}{\bm{b}}_{j}=\mathbbm{1}_{i=j}$ . Take $\{{\bm{b}}_{i}\}_{i=q+2,\ldots,d}$ as a basis of the orthogonal complement of $\{{\bm{a}}_{i}\}_{i=1}^{q+1}$ . Then $\{{\bm{b}}_{i}\}_{i=1,\ldots,d}$ forms a basis of $\mathbb{R}^{d}$ . Let

\displaystyle\mathrm{Q}=\sum_{i=1}^{d}{\bm{b}}_{i}{\bm{b}}_{i}^{\top}\succ\mathrm{O}_{d},

(40)

with the zero $d\times d$ matrix $\mathrm{O}_{d}$ , where the notation $\mathrm{A}\succ\mathrm{B}$ means that $\mathrm{A}-\mathrm{B}$ is positive definite. One then has ${\bm{a}}_{i}^{\top}\mathrm{Q}{\bm{a}}_{j}=\mathbbm{1}_{i=j}$ for $i,j\in\{1,\ldots,q+1\}$ . It implies that all the coefficients of ${\bm{a}}_{i}$ on the right-hand side of (39) vanish, proving (38) to hold with the above $\mathrm{Q}$ .

Proposition 3.

Assume $d\geq q$ . Assume the $(q+1)$ -st order term of the Taylor expansion of $f({\bm{x}})$ around the mode ${\bm{x}}={\bm{\theta}}$ be of the form

\displaystyle\prod_{i=1}^{q+1}{\bm{a}}_{i}^{\top}({\bm{x}}-{\bm{\theta}}),

(41)

where $\{{\bm{a}}_{i}\}_{i=1}^{q+1}$ spans a $q$ -dimensional subspace of $\mathbb{R}^{d}$ . Assume further that $\sum_{i=1}^{q+1}s_{i}{\bm{a}}_{i}=\mathbf{0}_{d}$ holds with coefficients $\{s_{i}\}_{i=1}^{q+1}$ satisfying $s_{1}s_{2}\cdots s_{q+1}\not=0$ . Then there is a choice of $\mathrm{Q}$ with which the AB (34) is made equal to zero.

One can find a subset of $\{{\bm{a}}_{i}\}_{i=1}^{q+1}$ of size $q$ , which consists of $q$ linearly independent vectors. Without loss of generality we assume that $\{{\bm{a}}_{i}\}_{i=1}^{q}$ are linearly independent. Choosing

\displaystyle\mathrm{Q}=\sum_{i=1}^{q}\frac{1}{s_{i}^{2}}{\bm{b}}_{i}{\bm{b}}_{i}^{\top}-\sum_{i,j=1\atop i<j}^{q}\frac{1}{qs_{i}s_{j}}({\bm{b}}_{i}{\bm{b}}_{j}^{\top}+{\bm{b}}_{j}{\bm{b}}_{i}^{\top})+\sum_{i=q+2}^{d}{\bm{b}}_{i}{\bm{b}}_{i}^{\top}\succ\mathrm{O}_{d}

(42)

with $\{{\bm{b}}_{i}\}_{i=1}^{q}$ , the duals of $\{{\bm{a}}_{i}\}_{i=1}^{q}$ in the subspace spanned by $\{{\bm{a}}_{i}\}_{i=1}^{q}$ , and with $\{{\bm{b}}_{i}\}_{i=q+2}^{d}$ , a basis of the orthogonal complement of $\{{\bm{a}}_{i}\}_{i=1}^{q}$ , one has

\displaystyle{\bm{a}}_{i}^{\top}\mathrm{Q}{\bm{a}}_{j}=\left\{\begin{array}[]{ll}\frac{1}{s_{i}^{2}},&i=j\in\{1,\ldots,q+1\},\\ -\frac{1}{qs_{i}s_{j}},&i\not=j,i,j\in\{1,\ldots,q+1\},\\ 0,&\mbox{otherwise}.\end{array}\right.

(46)

One can thus show that the following holds:

\displaystyle\nabla(\nabla^{\top}\mathrm{Q}\nabla)^{q/2}\left(\prod_{i=1}^{q+1}{\bm{a}}_{i}^{\top}({\bm{x}}-{\bm{\theta}})\right)

\displaystyle=\left(-\frac{1}{q}\right)^{q/2}\frac{1}{\prod_{i=1}^{q+1}s_{i}}\sum_{\sigma}s_{\sigma(q+1)}{\bm{a}}_{\sigma(q+1)}=\bm{0}_{d}.

(47)

The following propositions provide cases where the AB cannot be equal to zero by any choice of $\mathrm{Q}$ .

Proposition 4.

Assume $d\geq q$ . Assume the $(q+1)$ -st order term of the Taylor expansion of $f({\bm{x}})$ around the mode ${\bm{x}}={\bm{\theta}}$ be of the form

\displaystyle{\bm{a}}^{\top}({\bm{x}}-{\bm{\theta}})\prod_{i=1}^{q/2}\frac{1}{2}({\bm{x}}-{\bm{\theta}})^{\top}\mathrm{R}_{i}({\bm{x}}-{\bm{\theta}}),

(48)

where $\mathrm{R}_{i}$ , $i=1,\ldots,q/2$ are positive definite. Then the AB (34) cannot be equal to zero by any choice of $\mathrm{Q}$ .

In the case of $q=2$ one can prove the following proposition.

Proposition 5.

If, around its mode ${\bm{x}}={\bm{\theta}}$ , $f({\bm{x}})$ admits the expression

\displaystyle f({\bm{x}})=f({\bm{\theta}})+\frac{1}{2}({\bm{x}}-{\bm{\theta}})^{\top}\mathrm{A}^{-1}({\bm{x}}-{\bm{\theta}})+g({\bm{x}}-{\bm{\theta}}){\bm{a}}^{\top}({\bm{x}}-{\bm{\theta}})+\mathrm{hot}.,

(49)

where ${\bm{a}}$ is nonzero, and where

\displaystyle g({\bm{u}})=\frac{1}{2}{\bm{u}}^{\top}\mathrm{R}{\bm{u}},\quad\mathrm{R}\succ\mathrm{O}_{d}

(50)

is a positive definite quadratic form, then there is no $\mathrm{P}$ which makes the AB to vanish.

We show that in this case no choice of $\mathrm{Q}$ which is positive definite will make the quantity $\nabla\nabla^{\top}\mathrm{Q}\nabla(g({\bm{x}}-{\bm{\theta}}){\bm{a}}^{\top}({\bm{x}}-{\bm{\theta}}))$ vanishing. Indeed, one can write

\displaystyle\nabla\nabla^{\top}\mathrm{Q}\nabla\left(g({\bm{x}}-{\bm{\theta}}){\bm{a}}^{\top}({\bm{x}}-{\bm{\theta}})\right)=(2\mathrm{R}\mathrm{Q}+(\operatorname{tr}\mathrm{R}\mathrm{Q})\mathrm{I}){\bm{a}}.

(51)

One observes that $\mathrm{R}\mathrm{Q}$ has positive eigenvalues only, since $\mathrm{R}$ and $\mathrm{Q}$ are both positive definite. We thus have $\operatorname{tr}\mathrm{R}\mathrm{Q}>0$ , which in turn implies that the matrix $2\mathrm{R}\mathrm{Q}+(\operatorname{tr}\mathrm{R}\mathrm{Q})\mathrm{I}$ is invertible. Hence the quantity in (51) is nonvanishing with an arbitrary choice of $\mathrm{Q}$ .

Proposition 4 can be proven similarly. The term $\nabla(\nabla^{\top}\mathrm{Q}\nabla)^{q/2}f({\bm{\theta}})$ in this case is represented as $\mathrm{M}(\{\mathrm{R}_{i}\},\mathrm{Q}){\bm{a}}$ , where the $d\times d$ matrix $\mathrm{M}(\{\mathrm{R}_{i}\},\mathrm{Q})$ is a sum of terms of the form $\mathrm{R}_{j_{1}}\mathrm{Q}\cdots\mathrm{R}_{j_{n}}\mathrm{Q}$ with coefficients which are products of $\operatorname{tr}\mathrm{R}_{i_{1}}\mathrm{Q}\cdots\mathrm{R}_{i_{k}}\mathrm{Q}>0$ , and is hence positive definite. The term $\nabla(\nabla^{\top}\mathrm{Q}\nabla)^{q/2}f({\bm{\theta}})$ is therefore nonzero irrespective of the choice of $\mathrm{Q}$ .

We close this section by showing a complete specification as to when the AB can be made equal to zero via the choice of $\mathrm{Q}$ in the most basic case $(d,q)=(2,2)$ .

Proposition 6.

Assume $(d,q)=(2,2)$ . The third-order term of the Taylor expansion of $f(\bm{x})$ around the mode $\bm{x}=\bm{\theta}$ is either:

•

0 identically, in which case the AB is equal to zero with an arbitrary $\mathrm{Q}$ ,
•

of the form $\prod_{i=1}^{3}\bm{a}_{i}^{\top}(\bm{x}-\bm{\theta})$ with $\bm{a}_{1},\bm{a}_{2},\bm{a}_{3}$ , any two of which are linearly independent, in which case there is a choice of $\mathrm{Q}$ with which the AB is equal to zero,
•

other than any of the above, in which case the AB cannot be made equal to zero by any choice of $\mathrm{Q}$ .

A proof is given in Appendix D.

3 Additional Discussion

3.1 Degree of Goodness of Optimal Kernel and Heuristic Findings

Table 1: The AMSE criterion

(B_{1,q}^{6}\cdot V_{1,1}^{2q})^{\frac{1}{2q+3}}

and the AMSE ratio in brackets, for the univariate KME,

q=2,4,6

$q$	Kernel $K$	$B_{1,q}$	$V_{1,1}$	$(B_{1,q}^{6}\cdot V_{1,1}^{2q})^{\frac{1}{2q+3}}$
2	Biweight $K_{1,2}^{B}(x)=\frac{15}{16}\{(1-x^{2})_{+}\}^{2}$	$\frac{1}{14}$	$\frac{15}{14}$	0.1083 [1.0000]
	Triweight $\frac{35}{32}\{(1-x^{2})_{+}\}^{3}$	$\frac{1}{18}$	$\frac{35}{22}$	0.1095 [1.0105]
	Tricube $\frac{70}{81}\{(1-\|x\|^{3})_{+}\}^{3}$	$\frac{35}{486}$	$\frac{210}{187}$	0.1121 [1.0345]
	Cosine $\frac{\pi}{4}\cos(\frac{\pi x}{2})\mathbbm{1}_{\|x\|\leq 1}$	$\frac{1}{2}-\frac{4}{\pi^{2}}$	$\frac{\pi^{4}}{128}$	0.1135 [1.0475]
	Epanechnikov $K_{1,2}^{E}(x)=\frac{3}{4}(1-x^{2})_{+}$	0.1	0.75	0.1179 [1.0883]
	Triangle $(1-\|x\|)_{+}$	$\frac{1}{12}$	1	0.1188 [1.0971]
	Gaussian $K_{1,2}^{G}(x)=\frac{1}{(2\pi)^{1/2}}e^{-x^{2}/2}$	0.5	$\frac{1}{8\pi^{1/2}}$	0.1213 [1.1198]
	Logistic $\frac{1}{e^{x}+2+e^{-x}}$	$\frac{\pi^{2}}{6}$	$\frac{1}{60}$	0.1476 [1.3629]
	Sech $\frac{1}{2}\mathrm{sech}(\frac{\pi x}{2})$	0.5	$\frac{\pi}{24}$	0.1727 [1.5949]
	Laplace $K_{1,2}^{L}(x)=\frac{1}{2}e^{-\|x\|}$	1	0.125	0.3048 [2.8133]
4	$K_{1,4}^{B}(x)=\frac{105}{64}(1-3x^{2})\{(1-x^{2})_{+}\}^{2}$	$-\frac{1}{66}$	$\frac{525}{88}$	0.3729 [1.0000]
	$K_{1,4}^{E}(x)=\frac{15}{32}(3-7x^{2})(1-x^{2})_{+}$	$-\frac{1}{42}$	$\frac{75}{16}$	0.4005 [1.0737]
	$K_{1,4}^{G}(x)=\frac{1}{2(2\pi)^{1/2}}(3-x^{2})e^{-x^{2}/2}$	-1.5	$\frac{55}{128\pi^{1/2}}$	0.4451 [1.1935]
	$K_{1,4}^{L}(x)=\frac{1}{20}(12-x^{2})e^{-\|x\|}$	-21.6	$\frac{313}{1600}$	1.6314 [4.3743]
6	$K_{1,6}^{B}(x)=\frac{315}{2048}(15-110x^{2}+143x^{4})\{(1-x^{2})_{+}\}^{2}$	$\frac{1}{286}$	$\frac{2205}{128}$	1.0149 [1.0000]
	$K_{1,6}^{E}(x)=\frac{105}{256}(5-30x^{2}+33x^{4})(1-x^{2})_{+}$	$\frac{5}{858}$	$\frac{3675}{256}$	1.0760 [1.0600]
	$K_{1,6}^{G}(x)=\frac{1}{8(2\pi)^{1/2}}(15-10x^{2}+x^{4})e^{-x^{2}/2}$	7.5	$\frac{7105}{8192\pi^{1/2}}$	1.2639 [1.2453]
	$K_{1,6}^{L}(x)=\frac{1}{2384}(1560-220x^{2}+3x^{4})e^{-\|x\|}$	$\frac{196200}{149}$	$\frac{5572345}{22733824}$	5.7451 [5.6609]

We are interested in not only specifying the optimal kernel, but also in how the kernel selection will affect the AMSE. To quantify the degree of suboptimality of kernels, we define, for a $d$ -variate $q$ -th order kernel $K$ , the AMSE ratio as the ratio of the bandwidth-optimized AMSE (11) for $K$ to that for the optimal RK $K_{d,q}^{B}$ . The AMSE ratio depends only on $K$ if the bandwidth-optimized AMSEs for $K$ and $K_{d,q}^{B}$ share the same PDF-dependent factor, which is the case for RKs (including univariate ones) and PKs with $q=2$ (see Section 2.4). Table 1 shows the comparison of the AMSE criteria and the AMSE ratios in the univariate case. An empirical observation is that truncated kernels, such as a Triweight and an Epanechnikov in addition to the optimal Biweight kernel, are better than non-truncated kernels including a Sech, a Laplace, and $K_{1,4}^{L}$ . This tendency still holds even for multivariate RKs and PKs. Figure 3 shows the AMSE ratios for several multivariate RKs.

\begin{overpic}[height=142.26378pt,bb=0 0 360 252]{./amseratios1.png}\put(18.0,60.0){\small(a)}\end{overpic}

\begin{overpic}[height=142.26378pt,bb=0 0 360 252]{./amseratios2.png}\put(18.0,60.0){\small(b)}\end{overpic}

Figure 3: AMSE ratios for multivariate RKs. (b) shows a zoom of (a), with the range of the AMSE ratio being [1,1.35], to better visualize the differences.

For example, the AMSE ratio for the most frequently used Gaussian kernel $K_{d,2}^{G}$ , approximately equals to 1.1198, 1.1547, 1.1864, 1.2151, 1.2413, 1.2652, 1.2870, 1.3071, 1.3256, 1.3428 for $d=1,\ldots,10$ , and monotonically approaches $e^{2}/4\approx 1.8473$ as $d$ goes to infinity.

3.2 Other Criteria

We here mention that the optimality of the kernel $K_{d,q}^{B}$ is not limited to the scenario in which the AMSE is used as the criterion. In this section, the univariate case is assumed for simplicity, but its multivariate extension is straightforward.

Grund and Hall (1995) have studied the asymptotic $L^{p}$ error (A $L^{p}$ E) $\mathrm{E}[|\theta_{n}-\theta|^{p}]$ for $p\geq 1$ and the optimal bandwidth minimizing it. Although they have clarified only the $n$ -dependence of the optimal bandwidth, we can go one step further beyond their discussion and clarify the kernel-dependence: the optimal bandwidth is represented as $h_{1,q,p,n}^{\mathrm{opt}}=c_{p}(V_{1,1}/(nB_{1,q}^{2}))^{\frac{1}{2q+3}}$ , where it is difficult to derive a closed-form expression for $c_{p}$ , but it is a PDF-dependent and kernel-independent coefficient. If this optimal bandwidth is used, then the A $L^{p}$ E reduces to

\displaystyle\mathrm{E}[|\theta_{n}-\theta|^{p}\mid h_{1,q,p,n}^{\mathrm{opt}}]\approx n^{-\frac{pq}{2q+3}}\Bigl{(}B_{1,q}^{6}\cdot V_{1,1}^{2q}\Bigr{)}^{\frac{p}{2(2q+3)}}\mathrm{E}\left[\biggl{|}\frac{2^{1/2}\{f(\theta)\}^{1/2}}{c_{p}^{3/2}f^{(2)}(\theta)}Z+\frac{2c_{p}^{q}f^{(q+1)}(\theta)}{q!f^{(2)}(\theta)}\biggr{|}^{p}\right],

(52)

where $Z$ denotes the random variable following the standard normal distribution, and where the expectation is taken with respect to $Z$ . Since the kernel-dependent factor of the A $L^{p}$ E is $(B_{1,q}^{6}\cdot V_{1,1}^{2q})$ , which is the same as that for the AMSE, the kernel $K_{1,q}^{B}$ also minimizes the A $L^{p}$ E among minimum-sign-change kernels, even when $p\neq 2$ .

Also, from (Mokkadem and Pelletier, 2003), it can be found that the KME $\theta_{n}$ satisfies the law of the iterated logarithm. In the scaling $\lim_{n\to\infty}nh_{n}^{2q+3}/\ln\ln n=0$ , it holds that

\displaystyle\mathop{\rm lim\,sup}\limits_{n\to\infty}\sqrt{\frac{nh_{n}^{3}}{2\ln\ln n}}(\theta_{n}-\theta)=-\mathop{\rm lim\,inf}\limits_{n\to\infty}\sqrt{\frac{nh_{n}^{3}}{2\ln\ln n}}(\theta_{n}-\theta)=\frac{\sqrt{f(\theta)V_{1,1}}}{|f^{(2)}(\theta)|},\quad\text{a.s.},

(53)

in other words, $\sqrt{nh_{n}^{3}/(2\ln\ln n)}(\theta_{n}-\theta)$ becomes relatively compact, and its limit set, in which the rescaled AB is included almost surely, is given by

\displaystyle\Bigl{[}-\sqrt{f(\theta)V_{1,1}}/|f^{(2)}(\theta)|,\sqrt{f(\theta)V_{1,1}}/|f^{(2)}(\theta)|\Bigr{]}.

(54)

Then, the combination of the bandwidth (10) minimizing the AMSE and the kernel $K_{1,q}^{B}$ uniformly minimizes the size of the limit set of $(\theta_{n}-\theta)$ , which is proportional to $\sqrt{V_{1,1}/h_{n}^{3}}\propto(B_{1,q}^{6}\cdot V_{1,1}^{2q})^{\frac{1}{2(2q+3)}}$ .

3.3 Singular Hessian

Although it has been so far assumed that the Hessian matrix $Hf$ is non-singular at the mode (see (A.4) in Assumption A.1), this section provides a consideration on the optimal kernel when the Hessian matrix is singular. Since the multivariate singular case is so intricate that even its convergence rate remains an open problem, we suppose $d=1$ and hence $f^{(2)}(\theta)=0$ at the mode $\theta$ .

Vieu (1996); Mokkadem and Pelletier (2003) have studied the singular case, where the PDF $f$ satisfies $f^{(i)}(\theta)=0$ , $i=1,\ldots,p$ and $f^{(p+1)}(\theta)<0$ for $1\leq p\leq q-1$ instead of $f^{(2)}(\theta)<0$ . The non-singular case studied in Section 2 corresponds to $p=1$ . Note that the integer $p$ should be an odd number for $f$ to take a maximum at $\theta$ .

A weak convergence of the KME $\theta_{n}$ holds even in the singular case, in a somewhat different manner, where the requirements on the moments of the kernel $K$ are weakened to

\displaystyle{\mathcal{B}}_{1,i}(K)=\begin{cases}1,&i=0,\\ 0,&i=p,\ldots,q-1,\\ 2B_{1,q}(K)\neq 0,&i=q.\end{cases}

(55)

In other words, the conditions ${\mathcal{B}}_{1,i}(K)=0$ , $i=1,\ldots,p-1$ , required in the moment condition (5) in the non-singular case, are no longer needed in the singular case. Together with several additional conditions (see (Mokkadem and Pelletier, 2003) for the details), $(\theta_{n}-\theta)^{p}$ asymptotically follows a normal distribution with

\displaystyle\mathrm{E}[(\theta_{n}-\theta)^{p}]\approx-h_{n}^{q}\bar{a}\bar{b}B_{1,q},\quad\mathrm{var}[(\theta_{n}-\theta)^{p}]\approx\frac{2f(\theta)}{nh_{n}^{3}}\bar{a}^{2}V_{1,1},

(56)

where $\bar{a}\coloneq(\frac{1}{p!}f^{(p+1)}(\theta))^{-1}$ , $\bar{b}\coloneq\frac{2}{q!}f^{(q+1)}(\theta)$ . Moreover, the bias-variance decomposition leads straightforwardly to the asymptotic $L^{2p}$ error (A $L^{2p}$ E), instead of the AMSE, of the KME $\theta_{n}$ :

\displaystyle\mathrm{E}[(\theta_{n}-\theta)^{2p}]\approx h_{n}^{2q}(\bar{a}\bar{b})^{2}B_{1,q}^{2}+\frac{2f(\theta)}{nh_{n}^{3}}\bar{a}^{2}V_{1,1},

(57)

which has the same functional form regarding the bandwidth $h_{n}$ and kernel-dependent terms $B_{1,q}$ , $V_{1,1}$ as those of the AMSE (9) for the univariate non-singular case. The optimal bandwidth minimizing the A $L^{2p}$ E can be calculated in the same way as that for the conventional one (10). Also, the A $L^{2p}$ E criterion reduces to $(B_{1,q}^{6}\cdot V_{1,1}^{2q})^{\frac{1}{2q+3}}$ .

Table 2: The A

L^{2p}

E criterion

(B_{1,q}^{6}\cdot V_{1,1}^{2q})^{\frac{1}{2q+3}}

and its ratio to the smallest value in brackets for the univariate singular cases,

(p,q)=(3,4),(5,6)

Kernel $K$	$B_{1,4}$	$B_{1,6}$	$V_{1,1}$	$(B_{1,4}^{6}\cdot V_{1,1}^{8})^{\frac{1}{11}}$	$(B_{1,6}^{6}\cdot V_{1,1}^{12})^{\frac{1}{15}}$
Biweight	$\frac{1}{42}$	$\frac{5}{462}$	$\frac{15}{14}$	0.1369 [1.0000]	0.1729 1[1.0098]
Triweight	$\frac{1}{66}$	$\frac{5}{858}$	$\frac{35}{22}$	0.1426 [1.0418]	0.1851 1[1.0815]
Tricube	$\frac{1}{44}$	$\frac{1}{104}$	$\frac{210}{187}$	0.1381 [1.0089]	$\mathbf{0.1712}~{}\phantom{1}[\mathbf{1.0000}]$
Cosine	$0.0394$	$0.0214$	$\frac{\pi^{4}}{128}$	0.1404 [1.0257]	0.1728 1[1.0095]
Epanechnikov	$\frac{3}{70}$	$\frac{1}{42}$	0.75	0.1455 [1.0631]	0.1781 1[1.0405]
Triangle	$\frac{1}{30}$	$\frac{1}{56}$	1	0.1564 [1.1427]	0.1999 1[1.1674]
Gaussian	1.5	7.5	$\frac{1}{8\pi^{1/2}}$	0.1813 [1.3246]	0.2683 1[1.5675]
Logistic	$\frac{7\pi^{4}}{30}$	$\frac{31\pi^{6}}{42}$	$\frac{1}{60}$	0.2797 [2.0435]	0.5223 1[3.0507]
Sech	2.5	30.5	$\frac{\pi}{24}$	0.3757 [2.7443]	0.7714 [4.4614]
Laplace	12	360	0.125	0.8548 [6.2441]	1.9955 [11.6563]
$K_{1,4}^{B}$	$-\frac{1}{66}$	$-\frac{5}{429}$	$\frac{525}{88}$	0.3729 [2.7244]	0.7033 1[4.1083]
$K_{1,6}^{B}$	0	$\frac{1}{286}$	$\frac{2205}{128}$	–	1.0145 1[5.9282]

The essential difference of the discussion on the optimal kernel under the singular case from that under the non-singular case arises from the difference between the required moment conditions (55) and (5): As mentioned above, a kernel function is not required to satisfy the $i$ -th moment condition for $i=1,\ldots,p-1$ in the singular case. For example, when $p=3$ and $q=4$ , any symmetric 2nd order kernel does not satisfy (5), but does satisfy the singular version (55) of the moment conditions. The kernel $K_{1,q}^{B}$ , of course, minimizes the A $L^{2p}$ E criterion among the conventional $q$ -th order kernels satisfying the minimum-sign-change condition, regardless of $p$ . However, there is a possibility to improve the A $L^{2p}$ E criterion by using a lower order kernel satisfying (55).

We investigated two simple cases $(p,q)=(3,4),(5,6)$ by calculating the A $L^{2p}$ E criterion for several kernels, and report the results in Table 2, where the A $L^{2p}$ E criterion is given by $(B_{1,4}^{6}\cdot V_{1,1}^{8})^{\frac{1}{11}}$ and $(B_{1,6}^{6}\cdot V_{1,1}^{12})^{\frac{1}{15}}$ for $(p,q)=(3,4)$ and $(5,6)$ , respectively. In the case $(p,q)=(3,4)$ , where any symmetric 2nd order kernel, as well as any 4th order kernel, fulfills the conditions (55), one can observe that a Biweight kernel $K_{1,2}^{B}$ is better than $K_{1,4}^{B}$ . In the case $(p,q)=(5,6)$ , where at least three types of kernels, symmetric 2nd and 4th order kernels and 6th order kernels, satisfy the conditions (55), Table 2 shows that a Tricube kernel gives the minimal value of the A $L^{2p}$ E criterion among the kernels we investigated. Thus, we have confirmed that the kernel $K_{1,q}^{B}$ which is optimal under the non-singular case may not be optimal under the singular case, where a lower order kernel may improve the asymptotic estimation accuracy. Although we have not succeeded in deriving an optimal kernel in the singular case, an empirical finding that truncated kernels perform well would be useful.

4 Simulation Experiment

The analysis in the previous sections on the optimal kernel is based on the asymptotic normality and the evaluation of the AMSE, derived from the asymptotic expansion of the KME. While the leading-order term of the bandwidth-optimized AMSE (11) is $O(n^{-\frac{2q}{d+2q+2}})$ , the next-leading-order term is $O(n^{-\frac{2(q+2)}{d+2q+2}})$ if one uses symmetric kernels such as RKs and PKs. Although the bandwidth-optimized AMSE ignores all but the leading-order term, those ignored terms may affect behaviors of the KME and thus its MSE when the sample size $n$ is finite. In this section, via simulation studies, we examine how well the kernel selection based on the AMSE criterion reflects the real performance of the KME and verify practical goodness of the optimal RK $K_{d,q}^{B}$ in a finite sample situation.

We tried the three cases regarding the dimension, $d=1,2,3$ , and used synthetic i.i.d. sample sets of size $n=100,400,\ldots$ , 102400, drawn from the distribution $\frac{1}{2}\mathcal{N}(\bm{0}_{d},\mathrm{I}_{d})+\frac{1}{2}\mathcal{N}(\bm{1}_{d},2\mathrm{I}_{d})$ , where $\mathcal{N}(\bm{\mu},\Sigma)$ denotes the normal distribution with mean $\bm{\mu}$ and variance-covariance matrix $\Sigma$ , and where $\bm{1}_{d}$ is the all-1 $d$ -vector and $\mathrm{I}_{d}$ is the $d\times d$ identity matrix. The modes of the sample-generating distribution are located at ${\bm{\theta}}\approx 0.2395,0.1514\cdot\bm{1}_{2},0.0874\cdot\bm{1}_{3}$ for $d=1,2,3$ , respectively. Because a symmetric distribution such as the normal distribution does not satisfy assumption (A.6), skewed distributions were used in the experiments. As 2nd order kernels, we examined the optimal Biweight $K_{d,2}^{B}$ , Epanechnikov $K_{d,2}^{E}$ , Gaussian $K_{d,2}^{G}$ , and Laplace $K_{d,2}^{L}$ kernels, as well as PKs of $K_{1,2}^{B}$ , $K_{1,2}^{E}$ , and $K_{1,2}^{L}$ if $d\geq 2$ . For the higher orders $q=4$ and $6$ , in addition to the optimal kernels $K_{d,q}^{B}$ , three RKs $K_{d,q}^{E}$ , $K_{d,q}^{G}$ , and $K_{d,q}^{L}$ designed via jackknife (Schucany and Sommers, 1977; Wand and Schucany, 1990), an alternative method for designing a higher-order kernel, on the basis of $K_{d,2}^{E}$ , $K_{d,2}^{G}$ , and $K_{d,2}^{L}$ (see Section C), and four PKs with $K_{1,q}^{B}$ , $K_{1,q}^{E}$ , $K_{1,q}^{G}$ , and $K_{1,q}^{L}$ were also examined. Note that the RKs $K_{d,q}^{B}$ , $K_{d,q}^{E}$ , and $K_{d,q}^{L}$ and PKs of $K_{1,q}^{B}$ , $K_{1,q}^{E}$ and $K_{1,q}^{L}$ are not twice differentiable at some points and thus do not exactly satisfy all the conditions of Theorem 1 on the asymptotic behaviors of the KME. For each kernel, we used the optimal bandwidth (10) calculated from the sample-generating PDF and its mode⁴⁴4It should be noted that this procedure makes access to the sample-generating PDF, which is unavailable in practice. See Footnote 1. The purpose of our simulation experiments here is not to evaluate performance in practical settings but rather to see how the MSE of the KME with the bandwidth optimized with respect to the AMSE behaves in the finite- $n$ situation, so that we used the bandwidth values exactly optimized with respect to the AMSE.. In these settings, the smallest AMSE among those of the kernels examined is given by the RK $K_{d,q}^{B}$ for $d=1$ or $q=2,6$ and PK with $K_{1,4}^{B}$ for $d=2,3$ and $q=4$ . On the basis of 1000 trials, we calculated the MSE and its standard deviation (SD), and the results are reported in Table 3, along with the MSE ratio, defined as the ratio of the MSE of each kernel to that for the kernel $K_{d,q}^{B}$ with the same $d$ and $q$ when $n=102400$ . Table 3 also shows the results of the Welch test with $p$ -value cutoff 0.05, with the null hypothesis that the MSE of interest be equal to the best MSE for the same $d,q,n$ .

Table 3: MSE and SD (both multiplied by

10^{2}

) of the KME. For each combination of

d,q,n

, the smallest MSE is shown in bold. Shown in gray are those which were worse than the smallest (Welch test,

p=0.05

$d$	$q$	Kernel	AMSE ratio	$n=100$	$n=400$	$n=1600$	$n=6400$	$n=25600$	$n=102400$	MSE ratio
1	2	$K_{1,2}^{B}$	[1.0000]	$\mathbf{4.2571\pm 0.1977}$	$\mathbf{1.6822\pm 0.0737}$	$\mathbf{0.5799\pm 0.0259}$	$\mathbf{0.2574\pm 0.0110}$	$\mathbf{0.1186\pm 0.0052}$	$\mathbf{0.0489\pm 0.0021}$	[1.0000]
		$K_{1,2}^{E}$	[1.0883]	$4.3761\pm 0.1945$	$1.7085\pm 0.0707$	$0.6470\pm 0.0285$	$0.2772\pm 0.0118$	$0.1274\pm 0.0054$	$0.0528\pm 0.0022$	[1.0805]
		$K_{1,2}^{G}$	[1.1198]	$4.9153\pm 0.2414$	$1.9430\pm 0.0870$	$0.6425\pm 0.0282$	$0.2918\pm 0.0125$	$0.1349\pm 0.0058$	$0.0553\pm 0.0024$	[1.1307]
		$K_{1,2}^{L}$	[2.8133]	$10.8033\pm 0.4957$	$4.2865\pm 0.2001$	$1.4643\pm 0.0606$	$0.6984\pm 0.0305$	$0.2935\pm 0.0126$	$0.1234\pm 0.0056$	[2.5243]
	4	$K_{1,4}^{B}$	[1.0000]	$5.6845\pm 0.2968$	$1.9812\pm 0.0877$	$0.5978\pm 0.0272$	$0.2220\pm 0.0099$	$\mathbf{0.0876\pm 0.0041}$	$\mathbf{0.0294\pm 0.0014}$	[1.0000]
		$K_{1,4}^{E}$	[1.0737]	$\mathbf{5.4032\pm 0.2695}$	$\mathbf{1.9559\pm 0.0869}$	$\mathbf{0.5976\pm 0.0273}$	$\mathbf{0.2217\pm 0.0100}$	$0.0882\pm 0.0041$	$0.0304\pm 0.0014$	[1.0334]
		$K_{1,4}^{G}$	[1.1935]	$7.5127\pm 0.3916$	$2.5871\pm 0.1156$	$0.7530\pm 0.0330$	$0.2829\pm 0.0125$	$0.1089\pm 0.0051$	$0.0374\pm 0.0017$	[1.2712]
		$K_{1,4}^{L}$	[4.3743]	$14.0314\pm 0.5999$	$6.2487\pm 0.2741$	$2.2248\pm 0.0944$	$0.9542\pm 0.0428$	$0.3592\pm 0.0157$	$0.1323\pm 0.0059$	[4.5026]
	6	$K_{1,6}^{B}$	[1.0000]	$7.8852\pm 0.4199$	$2.5326\pm 0.1112$	$0.7125\pm 0.0320$	$0.2374\pm 0.0105$	$0.0850\pm 0.0041$	$0.0254\pm 0.0012$	[1.0000]
		$K_{1,6}^{E}$	[1.0602]	$\mathbf{7.4116\pm 0.3902}$	$\mathbf{2.3941\pm 0.1070}$	$\mathbf{0.6891\pm 0.0320}$	$\mathbf{0.2296\pm 0.0103}$	$\mathbf{0.0843\pm 0.0041}$	$\mathbf{0.0251\pm 0.0012}$	[0.9882]
		$K_{1,6}^{G}$	[1.2453]	$10.5674\pm 0.5306$	$3.5329\pm 0.1580$	$0.9670\pm 0.0418$	$0.3282\pm 0.0143$	$0.1128\pm 0.0053$	$0.0351\pm 0.0016$	[1.3804]
		$K_{1,6}^{L}$	[5.6609]	$16.9073\pm 0.6946$	$8.1200\pm 0.3506$	$2.9961\pm 0.1284$	$1.3032\pm 0.0580$	$0.4784\pm 0.0213$	$0.1700\pm 0.0078$	[6.6836]
2	2	RK $K_{2,2}^{B}$	[1.0000]	$\mathbf{11.4818\pm 0.4159}$	$\mathbf{4.6054\pm 0.1672}$	$\mathbf{1.8538\pm 0.0571}$	$\mathbf{0.7933\pm 0.0244}$	$\mathbf{0.3510\pm 0.0102}$	$\mathbf{0.1669\pm 0.0050}$	[1.0000]
		RK $K_{2,2}^{E}$	[1.0887]	$12.0854\pm 0.4320$	$4.8839\pm 0.1710$	$2.0408\pm 0.0613$	$0.8817\pm 0.0267$	$0.3842\pm 0.0110$	$0.1824\pm 0.0057$	[1.0930]
		RK $K_{2,2}^{G}$	[1.1547]	$14.2550\pm 0.5187$	$5.4297\pm 0.1925$	$2.1603\pm 0.0700$	$0.9142\pm 0.0301$	$0.4018\pm 0.0121$	$0.1929\pm 0.0056$	[1.1557]
		RK $K_{2,2}^{L}$	[2.4495]	$28.7910\pm 0.9371$	$11.9038\pm 0.3931$	$4.8348\pm 0.1622$	$2.0335\pm 0.0691$	$0.8559\pm 0.0271$	$0.4139\pm 0.0133$	[2.4801]
		PK of $K_{1,2}^{B}$	[1.0231]	$11.9216\pm 0.4283$	$4.7388\pm 0.1700$	$1.8922\pm 0.0588$	$0.8127\pm 0.0254$	$0.3579\pm 0.0105$	$0.1713\pm 0.0051$	[1.0265]
		PK of $K_{1,2}^{E}$	[1.0984]	$12.0406\pm 0.4247$	$4.9762\pm 0.1744$	$1.9840\pm 0.0600$	$0.8672\pm 0.0266$	$0.3893\pm 0.0111$	$0.1826\pm 0.0057$	[1.0945]
		PK of $K_{1,2}^{L}$	[2.8944]	$29.0720\pm 0.9473$	$11.9243\pm 0.3914$	$4.9221\pm 0.1607$	$2.2091\pm 0.0737$	$0.9401\pm 0.0311$	$0.4626\pm 0.0154$	[2.7722]
	4	RK $K_{2,4}^{B}$	[1.0000]	$16.4534\pm 0.6985$	$5.4845\pm 0.2056$	$1.8539\pm 0.0597$	$0.6997\pm 0.0226$	$0.2671\pm 0.0083$	$\mathbf{0.1065\pm 0.0035}$	[1.0000]
		RK $K_{2,4}^{E}$	[1.0817]	$\mathbf{15.1225\pm 0.5873}$	$\mathbf{5.3820\pm 0.2064}$	$\mathbf{1.8173\pm 0.0575}$	$0.6958\pm 0.0225$	$0.2774\pm 0.0086$	$0.1089\pm 0.0036$	[1.0218]
		RK $K_{2,4}^{G}$	[1.2552]	$22.5890\pm 0.8764$	$7.5474\pm 0.2701$	$2.5624\pm 0.0866$	$0.9378\pm 0.0317$	$0.3452\pm 0.0110$	$0.1380\pm 0.0044$	[1.2955]
		RK $K_{2,4}^{L}$	[3.9734]	$39.0583\pm 1.1950$	$18.1752\pm 0.5879$	$7.3580\pm 0.2340$	$3.0711\pm 0.1027$	$1.1322\pm 0.0356$	$0.4634\pm 0.0152$	[4.3494]
		PK of $K_{1,4}^{B}$	[0.9948]	$16.6173\pm 0.6915$	$5.5669\pm 0.2049$	$1.8709\pm 0.0616$	$0.6935\pm 0.0222$	$\mathbf{0.2667\pm 0.0084}$	$0.1086\pm 0.0036$	[1.0191]
		PK of $K_{1,4}^{E}$	[1.0579]	$15.6356\pm 0.6444$	$5.4196\pm 0.2036$	$1.8217\pm 0.0599$	$\mathbf{0.6756\pm 0.0215}$	$0.2711\pm 0.0084$	$0.1109\pm 0.0037$	[1.0409]
		PK of $K_{1,4}^{G}$	[1.2215]	$22.0738\pm 0.8510$	$7.3806\pm 0.2646$	$2.4888\pm 0.0845$	$0.9088\pm 0.0302$	$0.3344\pm 0.0106$	$0.1350\pm 0.0043$	[1.2666]
		PK of $K_{1,4}^{L}$	[4.9457]	$38.6564\pm 1.1771$	$18.4631\pm 0.5831$	$7.5841\pm 0.2377$	$3.2581\pm 0.1053$	$1.2406\pm 0.0398$	$0.5353\pm 0.0175$	[5.0237]
	6	RK $K_{2,6}^{B}$	[1.0000]	$23.2642\pm 0.9163$	$7.2236\pm 0.2647$	$2.2068\pm 0.0715$	$0.7672\pm 0.0248$	$0.2628\pm 0.0083$	$0.0969\pm 0.0033$	[1.0000]
		RK $K_{2,6}^{E}$	[1.0700]	$\mathbf{22.0099\pm 0.9000}$	$\mathbf{6.8727\pm 0.2570}$	$\mathbf{2.0995\pm 0.0665}$	$\mathbf{0.7287\pm 0.0235}$	$\mathbf{0.2569\pm 0.0081}$	$\mathbf{0.0961\pm 0.0033}$	[0.9913]
		RK $K_{2,6}^{G}$	[1.3282]	$31.5057\pm 1.1761$	$10.7053\pm 0.3818$	$3.4107\pm 0.1170$	$1.1480\pm 0.0379$	$0.3806\pm 0.0121$	$0.1362\pm 0.0044$	[1.4056]
		RK $K_{2,6}^{L}$	[5.4536]	$45.7616\pm 1.3157$	$23.4046\pm 0.7160$	$10.3467\pm 0.3170$	$4.3718\pm 0.1386$	$1.6100\pm 0.0501$	$0.6141\pm 0.0200$	[6.3369]
		PK of $K_{1,6}^{B}$	[1.0419]	$24.0760\pm 0.9366$	$7.5713\pm 0.2742$	$2.3077\pm 0.0786$	$0.7828\pm 0.0243$	$0.2714\pm 0.0086$	$0.1014\pm 0.0034$	[1.0463]
		PK of $K_{1,6}^{E}$	[1.0966]	$22.6378\pm 0.9059$	$7.0804\pm 0.2569$	$2.1788\pm 0.0742$	$0.7332\pm 0.0230$	$0.2627\pm 0.0082$	$0.0991\pm 0.0033$	[1.0226]
		PK of $K_{1,6}^{G}$	[1.3577]	$31.6913\pm 1.1353$	$10.8801\pm 0.3880$	$3.4745\pm 0.1217$	$1.1579\pm 0.0373$	$0.3831\pm 0.0122$	$0.1393\pm 0.0046$	[1.4379]
		PK of $K_{1,6}^{L}$	[7.3943]	$46.4623\pm 1.3305$	$24.1137\pm 0.7278$	$11.0021\pm 0.3326$	$4.7613\pm 0.1480$	$1.8774\pm 0.0591$	$0.7473\pm 0.0235$	[7.7113]
3	2	RK $K_{3,2}^{B}$	[1.0000]	$\mathbf{18.7642\pm 0.6040}$	$\mathbf{6.9539\pm 0.1976}$	$\mathbf{2.9496\pm 0.0771}$	$\mathbf{1.3446\pm 0.0338}$	$\mathbf{0.6420\pm 0.0161}$	$\mathbf{0.3099\pm 0.0079}$	[1.0000]
		RK $K_{3,2}^{E}$	[1.0864]	$19.3876\pm 0.6041$	$7.4858\pm 0.2149$	$3.1403\pm 0.0810$	$1.4470\pm 0.0358$	$0.6943\pm 0.0174$	$0.3372\pm 0.0083$	[1.0879]
		RK $K_{3,2}^{G}$	[1.1864]	$23.9051\pm 0.7529$	$8.7213\pm 0.2773$	$3.5969\pm 0.0968$	$1.6082\pm 0.0404$	$0.7779\pm 0.0199$	$0.3723\pm 0.0095$	[1.2013]
		RK $K_{3,2}^{L}$	[2.3660]	$53.3562\pm 1.3928$	$21.1493\pm 0.6151$	$8.8721\pm 0.2516$	$3.6972\pm 0.0955$	$1.7077\pm 0.0461$	$0.8118\pm 0.0208$	[2.6195]
		PK of $K_{1,2}^{B}$	[1.0448]	$19.7222\pm 0.6186$	$7.3303\pm 0.2124$	$3.1041\pm 0.0813$	$1.4081\pm 0.0351$	$0.6781\pm 0.0171$	$0.3264\pm 0.0083$	[1.0531]
		PK of $K_{1,2}^{E}$	[1.1098]	$19.8647\pm 0.6456$	$7.5905\pm 0.2096$	$3.2686\pm 0.0869$	$1.4949\pm 0.0367$	$0.7059\pm 0.0179$	$0.3438\pm 0.0086$	[1.1095]
		PK of $K_{1,2}^{L}$	[2.9686]	$52.7602\pm 1.4184$	$21.8585\pm 0.6374$	$9.2388\pm 0.2486$	$4.1435\pm 0.1062$	$1.9666\pm 0.0503$	$0.9162\pm 0.0228$	[2.9563]
	4	RK $K_{3,4}^{B}$	[1.0000]	$25.3986\pm 0.8894$	$8.4013\pm 0.2747$	$3.0604\pm 0.0841$	$1.2210\pm 0.0318$	$0.5073\pm 0.0134$	$0.2093\pm 0.0057$	[1.0000]
		RK $K_{3,4}^{E}$	[1.0861]	$\mathbf{23.3127\pm 0.8190}$	$\mathbf{7.8901\pm 0.2350}$	$\mathbf{2.9518\pm 0.0817}$	$1.1936\pm 0.0320$	$0.5003\pm 0.0133$	$0.2084\pm 0.0055$	[0.9960]
		RK $K_{3,4}^{G}$	[1.3143]	$37.2782\pm 1.1748$	$12.9172\pm 0.4303$	$4.5679\pm 0.1240$	$1.7626\pm 0.0454$	$0.7301\pm 0.0194$	$0.2955\pm 0.0079$	[1.4119]
		RK $K_{3,4}^{L}$	[3.9987]	$70.6824\pm 1.6955$	$32.7709\pm 0.8582$	$14.6790\pm 0.3851$	$6.0564\pm 0.1532$	$2.4945\pm 0.0655$	$0.9916\pm 0.0252$	[4.7381]
		PK of $K_{1,4}^{B}$	[0.9765]	$26.0719\pm 0.8978$	$8.4814\pm 0.2697$	$3.1138\pm 0.0853$	$1.2346\pm 0.0318$	$0.5133\pm 0.0136$	$0.2071\pm 0.0056$	[0.9896]
		PK of $K_{1,4}^{E}$	[1.0301]	$24.4835\pm 0.9137$	$7.9345\pm 0.2394$	$2.9769\pm 0.0823$	$\mathbf{1.1889\pm 0.0309}$	$\mathbf{0.4941\pm 0.0134}$	$\mathbf{0.2045\pm 0.0053}$	[0.9774]
		PK of $K_{1,4}^{G}$	[1.2284]	$35.2748\pm 1.1315$	$11.9953\pm 0.3964$	$4.2515\pm 0.1147$	$1.6530\pm 0.0424$	$0.6856\pm 0.0182$	$0.2761\pm 0.0074$	[1.3195]
		PK of $K_{1,4}^{L}$	[5.4110]	$70.0372\pm 1.6944$	$33.6666\pm 0.8833$	$15.2810\pm 0.4095$	$6.5560\pm 0.1625$	$2.8585\pm 0.0761$	$1.1353\pm 0.0277$	[5.4249]
	6	RK $K_{3,6}^{B}$	[1.0000]	$36.4994\pm 1.2212$	$11.6394\pm 0.4029$	$3.8464\pm 0.1050$	$1.4006\pm 0.0358$	$0.5244\pm 0.0137$	$0.1936\pm 0.0051$	[1.0000]
		RK $K_{3,6}^{E}$	[1.0769]	$\mathbf{33.6589\pm 1.1940}$	$\mathbf{10.6750\pm 0.3579}$	$\mathbf{3.5802\pm 0.1002}$	$\mathbf{1.3213\pm 0.0336}$	$\mathbf{0.4910\pm 0.0134}$	$\mathbf{0.1844\pm 0.0048}$	[0.9527]
		RK $K_{3,6}^{G}$	[1.4103]	$51.3040\pm 1.4862$	$19.3169\pm 0.6060$	$6.5221\pm 0.1799$	$2.3097\pm 0.0603$	$0.8633\pm 0.0227$	$0.3148\pm 0.0082$	[1.6260]
		RK $K_{3,6}^{L}$	[5.7653]	$81.7929\pm 1.8767$	$42.6492\pm 1.0293$	$20.4169\pm 0.5131$	$9.4432\pm 0.2305$	$3.8039\pm 0.0947$	$1.3986\pm 0.0339$	[7.2246]
		PK of $K_{1,6}^{B}$	[1.0606]	$37.5986\pm 1.2218$	$12.3234\pm 0.3998$	$4.1032\pm 0.1129$	$1.5007\pm 0.0384$	$0.5629\pm 0.0146$	$0.2027\pm 0.0054$	[1.0471]
		PK of $K_{1,6}^{E}$	[1.1092]	$34.8445\pm 1.1872$	$11.1684\pm 0.3591$	$3.7513\pm 0.1052$	$1.3843\pm 0.0355$	$0.5196\pm 0.0136$	$0.1976\pm 0.0048$	[1.0206]
		PK of $K_{1,6}^{G}$	[1.4384]	$51.3070\pm 1.5192$	$18.7785\pm 0.5682$	$6.5203\pm 0.1789$	$2.3292\pm 0.0605$	$0.8699\pm 0.0227$	$0.3156\pm 0.0083$	[1.6303]
		PK of $K_{1,6}^{L}$	[9.1885]	$84.8374\pm 1.9266$	$45.2915\pm 1.1066$	$22.4703\pm 0.5508$	$10.7301\pm 0.2525$	$4.6637\pm 0.1161$	$1.7211\pm 0.0392$	[8.8902]

For $q=2$ , the MSE ratios were approximately close to the AMSE ratios, which suggests that the AMSE criterion serves as a good indicator of real performance. Also, the optimal Biweight kernel $K_{d,2}^{B}$ achieved the best estimation result for every $d$ and $n$ as expected from the asymptotic theories. Although differences between MSEs for the Biweight kernel and those for other truncated kernels (RK $K_{d,2}^{E}$ and PKs of $K_{1,2}^{B}$ and $K_{1,2}^{E}$ ) were not significant especially for smaller $n$ , differences between them and those for non-truncated kernels (RKs $K_{d,2}^{G}$ and $K_{d,2}^{L}$ and PK of $K_{1,2}^{L}$ ) were statistically significant with $p$ -value 0.05 for most combinations of $d$ and $n$ .

For the higher orders $q=4,6$ , truncated kernels performed well, whereas non-truncated ones gave significantly larger MSEs. This tendency is the same as in the case $q=2$ . However, the experimental results for the higher-order kernels exhibited deviations from the asymptotic theories: except for the case of $(d,q)=(1,4)$ , even with the largest sample size $n=102400$ investigated, the smallest MSE values were achieved by kernels other than that with the minimum AMSE ( $K_{1,6}^{E}$ for $q=6$ when $d=1$ , and PK of $K_{1,4}^{B}$ for $q=4$ and RK $K_{d,6}^{B}$ for $q=6$ when $d=2,3$ ). Such deviations would be ascribed to the fact that asymptotics ignores residual terms of the AMSE as described at the beginning of this section. The ratio between the leading-order and next-leading-order terms of the AMSE for the considered RKs and PKs is $O(n^{-\frac{4}{d+2q+2}})$ , which gets larger as $d$ and/or $q$ increase. Therefore, for the residual terms to be negligible, one needs more sample points for larger $d$ and/or $q$ . In the cases of $q=4,6$ , even $n=102400$ might not have been large enough to accurately reflect the small difference less that about 10 % in the AMSE into the difference of MSE, for the PDFs used. However, considering that those kernels which performed inferior to the best-performing ones still performed close to the theoretical optimum with a difference less than 10 % in the AMSE criterion and that their experimental difference was not significant, it can be found that even though the AMSE criterion may not be a quantitative performance index for higher-order kernels, it is still suggestive for real performance.

5 Optimal Kernel for Other Methods

5.1 In-Sample Mode Estimation

A mode estimator considered in (Abraham et al., 2004), which we call the in-sample mode estimator (ISME), is defined as the location of a sample point where a value of the KDE (3) becomes maximum among those at sample points:

\displaystyle\bar{{\bm{\theta}}}_{n}\coloneq\mathop{\rm arg\,max}\limits_{{\bm{x}}\in\{{\bm{X}}_{i}\}_{i=1}^{n}}f_{n}({\bm{x}}).

(58)

The ISME can be evaluated efficiently with the quick-shift (QS) algorithm (Vedaldi and Soatto, 2008). Although the QS algorithm has an extra tuning parameter in addition to $K$ and $h_{n}$ , which may affect the quality of the estimator, it has an advantage that it converges in a finite number of iterations irrespective of the kernel used as far as the sample size is finite, making it computationally efficient.

Abraham et al. (2004) have proved in their Corollary 1.1 that, in the large-sample asymptotics, the ISME $\bar{{\bm{\theta}}}_{n}$ converges in distribution to an asymptotic distribution of the KME ${\bm{\theta}}_{n}$ if the bandwidth $h_{n}$ satisfies $n^{\frac{d-2}{2}}h_{n}^{\frac{d(d+2)}{2}}\ln n\to 0$ as $n\to\infty$ . A simple calculation shows that, if we use the optimal bandwidth $h_{d,q,n}^{\mathrm{opt}}\propto n^{-\frac{1}{d+2q+2}}$ in (10), the requirement for the bandwidth is fulfilled for $(d,q)$ satisfying $(d-2)q<d+2$ , that is, for any $q$ when $d=1,2$ ; $2\leq q\leq 4$ when $d=3$ ; and $q=2$ when $d=4,5$ . Accordingly, at least for these cases, the ISME has the same AMSE and the same AMSE criterion as the KME, so that the results on the optimal kernel still hold for the ISME as well: especially, $K_{d,q}^{B}$ minimizes the AMSE criterion of the ISME among the minimum-sign-change RKs, for the above-mentioned combinations of $d$ and $q$ .

5.2 Modal Linear Regression

Modal linear regression (MLR) (Yao and Li, 2014) aims to obtain a conditional mode predictor as a linear model. In addition to intrinsic good properties rooted in a conditional mode, the MLR has an advantage that resulting parameter and regression line are consistent even for a heteroscedastic or asymmetric conditional PDF, compared with robust M-type estimators which are not consistent in these cases (Baldauf and Santos Silva, 2012).

Let $({\bm{X}},{\bm{Y}})$ be a pair of random variables in ${\mathcal{X}}\times{\mathcal{Y}}\subset\mathbb{R}^{d_{\scalebox{0.5}{X}}}\times\mathbb{R}^{d_{\scalebox{0.5}{Y}}}$ following a certain joint distribution. MLR assumes that the conditional distribution of ${\bm{Y}}$ conditioned on ${\bm{X}}={\bm{x}}$ is such that the conditional PDF $f_{\scalebox{0.5}{Y|X}}(\cdot|{\bm{x}})$ satisfies $\text{arg~{}max}_{{\bm{y}}\in{\mathcal{Y}}}f_{\scalebox{0.5}{Y|X}}({\bm{y}}|{\bm{x}})=\Theta{\bm{x}}$ for any ${\bm{x}}\in{\mathcal{X}}$ , where $\Theta\in\mathbb{R}^{d_{\scalebox{0.5}{Y}}\times d_{\scalebox{0.5}{X}}}$ is an underlying MLR parameter. For given i.i.d. samples $\{({\bm{X}}_{i},{\bm{Y}}_{i})\in{\mathcal{X}}\times{\mathcal{Y}}\}_{i=1}^{n}$ of $({\bm{X}},{\bm{Y}})$ , the MLR adopts, as the estimator of the parameter $\Theta$ , the global maximizer $\Theta_{n}$ of the kernel-based objective function with argument $\Omega\in\mathbb{R}^{d_{\scalebox{0.5}{Y}}\times d_{\scalebox{0.5}{X}}}$ ,

\displaystyle O_{n}(\Omega)\coloneq\frac{1}{nh_{n}^{d_{\scalebox{0.5}{Y}}}}\sum_{i=1}^{n}K\biggl{(}\frac{\Omega{\bm{X}}_{i}-{\bm{Y}}_{i}}{h_{n}}\biggr{)},

(59)

with the kernel $K$ defined on $\mathbb{R}^{d_{\scalebox{0.5}{Y}}}$ and the bandwidth $h_{n}>0$ .

In (Kemp et al., 2020, Theorem 3), the asymptotic normality of the MLR has been proved. They have considered the case where a 2nd order kernel is used and the AVC is relatively dominant compared with the AB by assuming $h_{n}^{4}\ll(nh_{n})^{-(d_{\scalebox{0.5}{Y}}+2)}$ . We provide an extension of their theorem which allows the use of a higher-order kernel and provides an explicit expression of the AB. Note that the vectorization operator $\operatorname{vec}(\cdot)$ is defined as the $nm$ -vector $\operatorname{vec}(\mathrm{M})\coloneq(\bm{m}_{1}^{\top},\ldots,\bm{m}_{n}^{\top})^{\top}$ for an $m\times n$ matrix $\mathrm{M}=(\bm{m}_{1},\ldots,\bm{m}_{n})$ , and that we adopt the definition that the operator $\nabla$ applied to a function with a matrix argument is the nabla operator with respect to the vectorization of the argument. The same definition applies to the Hessian operator $H$ as well.

Theorem 4.

Assume Assumption E.1 for an even integer $q\geq 2$ . Then, the vectorization $\operatorname{vec}(\Theta_{n})$ of the MLR parameter estimator $\Theta_{n}$ asymptotically follows a normal distribution with the following AB and AVC:

\displaystyle\mathrm{E}[\operatorname{vec}(\Theta_{n})-\operatorname{vec}(\Theta)]\approx-h_{n}^{q}\mathrm{A}{\bm{b}},\quad\mathrm{Cov}[\operatorname{vec}(\Theta_{n})]\approx\frac{1}{nh_{n}^{d_{\scalebox{0.5}{Y}}+2}}\mathrm{A}\mathrm{V}\mathrm{A},

(60)

where the abbreviations $\mathrm{A}$ , ${\bm{b}}$ , and $\mathrm{V}$ are defined by

\displaystyle\begin{split}\mathrm{A}&=\bigl{\{}\mathrm{E}[({\bm{X}}\otimes\mathrm{I}_{d_{\scalebox{0.5}{Y}}})Hf_{\scalebox{0.5}{Y|X}}(\Theta{\bm{X}}|{\bm{X}})({\bm{X}}\otimes\mathrm{I}_{d_{\scalebox{0.5}{Y}}})^{\top}]\bigr{\}}^{-1},\\ {\bm{b}}&=\mathrm{E}[{\bm{X}}\otimes\bm{T}(\Theta{\bm{X}};{\bm{X}},f_{\scalebox{0.5}{X,Y}},K)],\\ \mathrm{V}&=\mathrm{E}[f_{\scalebox{0.5}{Y|X}}(\Theta{\bm{X}}|{\bm{X}})({\bm{X}}\otimes\mathrm{I}_{d_{\scalebox{0.5}{Y}}}){\mathcal{V}}_{d_{\scalebox{0.5}{Y}}}({\bm{X}}\otimes\mathrm{I}_{d_{\scalebox{0.5}{Y}}})^{\top}].\end{split}

(61)

In the above, $\bm{T}({\bm{y}};{\bm{x}},f_{\scalebox{0.5}{X,Y}},K)$ is defined as

\displaystyle\bm{T}({\bm{y}};{\bm{x}},f_{\scalebox{0.5}{X,Y}},K)\coloneq\sum_{{\bm{i}}\in\mathbb{Z}_{\geq 0}^{d_{\scalebox{0.5}{Y}}}:|{\bm{i}}|=q}\frac{1}{{\bm{i}}!}\cdot\nabla\partial^{\bm{i}}f_{\scalebox{0.5}{Y|X}}({\bm{y}}|{\bm{x}})\cdot{\mathcal{B}}_{d_{\scalebox{0.5}{Y}},{\bm{i}}}(K),

(62)

and ${\mathcal{V}}_{d_{\scalebox{0.5}{Y}}}$ is the $d_{\scalebox{0.5}{Y}}\times d_{\scalebox{0.5}{Y}}$ matrix defined via (8). Moreover, the AMSE is obtained as

\displaystyle\mathrm{E}[\|\operatorname{vec}(\Theta_{n})-\operatorname{vec}(\Theta)\|^{2}]\approx h_{n}^{2q}\|\mathrm{A}{\bm{b}}\|^{2}+\frac{1}{nh_{n}^{d_{\scalebox{0.5}{Y}}+2}}\operatorname{tr}(\mathrm{A}\mathrm{V}\mathrm{A}).

(63)

By following the same line as the argument in Section 2, for instance, one can optimize the bandwidth $h_{n}$ in terms of the AMSE, and furthermore show that the AMSE criterion becomes $(B_{d_{\scalebox{0.5}{Y}},q}^{2(d_{\scalebox{0.5}{Y}}+2)}\cdot V_{d_{\scalebox{0.5}{Y}},1}^{2q})^{\frac{1}{d_{\scalebox{0.5}{Y}}+2q+2}}$ if an RK is used, which implies that the kernel $K_{d_{\scalebox{0.5}{Y}},q}^{B}$ is optimal. These results are generalization of (Yamasaki and Tanaka, 2019) for $d_{\scalebox{0.5}{Y}}=1$ and $q=2$ .

5.3 Mode Clustering

In mode clustering, a cluster is defined via the gradient flow ${\bm{x}}^{\prime}(t)=\nabla f({\bm{x}}(t))$ of the PDF $f$ regarded as a scalar field. A limiting point of the gradient flow defines a center of the cluster corresponding to its domain of attraction. For the mean-shift-based mode clustering (Comaniciu and Meer, 2002; Yamasaki and Tanaka, 2020), the KDE is often plugged into $f$ . Although clustering error for mode clustering using the KDE is difficult to evaluate in general dimensions, Casa et al. (2020) have analyzed clustering error rate in the univariate case in detail.

Here we review their discussion in a fairly simplified setting: we consider the univariate case and assume that the PDF and KDE are bimodal, as depicted in Figure 4. In this setting, the true clusters $C_{1},C_{2}$ and estimated clusters $C_{n,1},C_{n,2}$ are

\displaystyle\begin{split}&C_{1}=\{x\in\mathbb{R}:x<\zeta\},\quad C_{2}=\{x\in\mathbb{R}:x\geq\zeta\},\\ &C_{n,1}=\{x\in\mathbb{R}:x<\zeta_{n}\},\quad C_{n,2}=\{x\in\mathbb{R}:x\geq\zeta_{n}\},\end{split}

(64)

with the local minimizers $\zeta$ and $\zeta_{n}$ of the PDF and KDE, respectively, and the clustering error rate (CER) becomes

\displaystyle\mathrm{CER}_{n}=\sum_{j=1}^{2}\Pr(X\in C_{j}\text{ and }X\not\in C_{n,j})=\begin{cases}\Pr(\zeta\leq X<\zeta_{n}),&\zeta<\zeta_{n},\\ \Pr(\zeta_{n}\leq X<\zeta),&\zeta_{n}<\zeta.\end{cases}

(65)

In Figure 4, the red area represents the CER. The green area is $|\zeta_{n}-\zeta|\times f(\zeta)$ , and the blue area is approximated as $|\zeta_{n}-\zeta|\times\frac{f^{(2)}(\zeta)}{2}|\zeta_{n}-\zeta|^{2}$ , which is negligible in the asymptotic situation. Therefore, the relationship ‘green area¡red area (CER) $<$ green $+$ blue areas’ implies that the CER asymptotically equals to the green area, $|\zeta_{n}-\zeta|\times f(\zeta)$ : the asymptotic mean CER (AMCER) reduces to

\displaystyle\lim_{n\to\infty}\mathrm{E}[\mathrm{CER}_{n}]=f(\zeta)\mathrm{E}[|\zeta_{n}-\zeta|].

(66)

On the basis of the fact that the form of the asymptotic distribution of the local minimizer $\zeta_{n}$ is the same as that of the mode, the AMCER behaves like the A $L^{1}$ E for the mode, discussed in Section 3.2.

\begin{overpic}[height=142.26378pt,bb=0 0 665 335]{./mo-cl.png}\put(3.0,42.0){\small(a)}\end{overpic}

\begin{overpic}[height=142.26378pt,bb=0 0 105 335]{./mo-cl2.png}\put(5.0,85.0){\small(b)}\end{overpic}

Figure 4: The black solid and dotted curves are the underlying PDF and KDE, respectively. The red solid and dotted lines are located at local minimizers (

\zeta

and

\zeta_{n}

) of the underlying PDF and KDE, respectively. In panel (a), the red area represents the clustering error rate. In panel (b), the green area is a lower bound of the error rate, and the summation of the green and blue areas is an upper bound.

Casa et al. (2020) have shown $n$ -dependence of the optimal bandwidth minimizing the AMCER. We further show the kernel dependence of the optimal bandwidth and the optimal kernel: Since the AMCER behaves like the A $L^{1}$ E for the mode, the optimal bandwidth minimizing the AMCER becomes $h_{1,q,1,n}^{\mathrm{opt}}=c_{1}(V_{1,1}/(nB_{1,q})^{2})^{\frac{1}{2q+3}}$ , $c_{1}>0$ , and the resulting kernel optimization problem becomes equivalent to the type-1 kernel optimization problem for the KME with the A $L^{1}$ E criterion as the objective function. Thus, the kernel $K_{1,q}^{B}$ minimizes the AMCER for the optimal bandwidth among the $q$ -th order minimum-sign-change kernels.

6 Conclusion

In this paper, we have studied the kernel selection, particularly optimal kernel, for the kernel mode estimation, by extending the existing approach for the univariate setting by (Gasser and Müller, 1979; Gasser et al., 1985; Granovsky and Müller, 1989, 1991). This approach is based on asymptotics, and it seeks a kernel function that minimizes the main term of the AMSE for the optimal bandwidth among kernels (or their derivatives) that change their sign the minimum number of times required by their order. Theorems 2 and 3 are the main novel theoretical contribution, which shows that the Biweight hierarchy $\{G_{d,q}^{B}\}$ provides the optimal RK $K_{d,q}^{B}(\cdot)=G_{d,q}^{B}(\|\cdot\|)$ for every dimension $d\geq 1$ and even order $q=2,4$ and $q\geq 2$ , respectively. We have also studied the PK model, and compared it with the RK model. As a result, we have found that the 2nd order optimal RK $K_{d,2}^{B}$ is superior in terms of the AMSE to the PK model using any non-negative kernel. Simulation studies confirmed the superiority of the optimal Biweight kernel among the non-negative kernels, as well as truncated kernels including the optimal kernel $K_{d,q}^{B}$ , which gave better accuracy even in higher orders $q=4,6$ .

The theories on the optimal kernel are also effective for other kernel-based modal statistical methods. In Section 5, we show the optimal kernel for in-sample mode estimation, modal linear regression, and univariate mode clustering. Although elucidation of the optimal kernel for other methods that might be relevant such as principal curve estimation or density ridge estimation (Ozertem and Erdogmus, 2011; Sando and Hino, 2020) and multivariate mode clustering, in which it is difficult to represent asymptotic errors explicitly, is an open problem, studying them analytically or experimentally will also be interesting.

A Proof of Asymptotic Behaviors of Kernel Mode Estimation

Regularity conditions for Theorem 1, i.e., sufficient conditions for deriving the expressions (6), (9) of the AB, AVC, and AMSE along with the asymptotic normality of the KME, are listed below. They consist of the conditions on the sample $\{{\bm{X}}_{i}\in\mathbb{R}^{d}\}_{i=1}^{n}$ (A.1), mode ${\bm{\theta}}$ and PDF $f$ (A.2)–(A.6), kernel $K$ (A.6)–(A.14), and bandwidth $h_{n}$ (A.15) and (A.16).

Assumption A.1 (Regularity conditions for Theorem 1).

For finite $d\geq 1$ and even $q\geq 2$ ,

(A.1)

$\{{\bm{X}}_{i}\in\mathbb{R}^{d}\}_{i=1}^{n}$ is a sample of i.i.d. observations from $f$ .
(A.2)

$f$ is $(q+2)$ times differentiable in $\mathbb{R}^{d}$ (i.e., $f$ and $f^{\bm{i}}$ , $|{\bm{i}}|=2$ are continuous at ${\bm{\theta}}$ ).
(A.3)

$f$ has a unique and isolated maximizer at ${\bm{\theta}}$ (i.e., $f({\bm{x}})<f({\bm{\theta}})$ for all ${\bm{x}}\neq{\bm{\theta}}$ , $\nabla f({\bm{\theta}})=\bm{0}_{d}$ due to (A.2), and $\sup_{{\bm{x}}\not\in N}f({\bm{x}})<f({\bm{\theta}})$ for a neighborhood $N$ of ${\bm{\theta}}$ ).
(A.4)

$\partial^{\bm{i}}f$ , $|{\bm{i}}|=2$ satisfies $\int|\partial^{\bm{i}}f({\bm{x}})|\,d{\bm{x}}<\infty$ , and $Hf({\bm{\theta}})$ is non-singular.
(A.5)

$|\partial^{\bm{i}}f({\bm{\theta}})|<\infty$ for all ${\bm{i}}$ s.t. $|{\bm{i}}|=2,\ldots,q+1$ , and $\partial^{\bm{i}}f$ is bounded for all ${\bm{i}}$ s.t. $|{\bm{i}}|=q+2$ .
(A.6)

$\partial^{\bm{i}}f({\bm{\theta}})$ , $|{\bm{i}}|=q+1$ and $K$ satisfy ${\bm{b}}({\bm{\theta}};f,K)\neq\bm{0}_{d}$ .
(A.7)

$K$ is bounded and twice differentiable in $\mathbb{R}^{d}$ and satisfies the covering number condition, $\int|K({\bm{x}})|\,d{\bm{x}}<\infty$ , and $\lim_{\|{\bm{x}}\|\to\infty}\|{\bm{x}}\||K({\bm{x}})|=0$ .
(A.8)

${\mathcal{B}}_{d,\bm{0}_{d}}(K)=1$ .
(A.9)

${\mathcal{B}}_{d,{\bm{i}}}(K)=0$ for all ${\bm{i}}$ s.t. $|{\bm{i}}|=1,\ldots,q-1$ .
(A.10)

$|{\mathcal{B}}_{d,{\bm{i}}}(K)|<\infty$ for all ${\bm{i}}$ s.t. $|{\bm{i}}|=q$ , and ${\mathcal{B}}_{d,{\bm{i}}}(K)\neq 0$ for some ${\bm{i}}$ s.t. $|{\bm{i}}|=q$ .
(A.11)

$|{\mathcal{B}}_{d,{\bm{i}}}(K)|<\infty$ for all ${\bm{i}}$ s.t. $|{\bm{i}}|=q+1$ .
(A.12)

$\partial_{i}K$ is bounded and satisfies $\int|\partial_{i}K({\bm{x}})\partial_{j}K({\bm{x}})|\,d{\bm{x}}<\infty$ , $\lim_{\|{\bm{x}}\|\to\infty}\|{\bm{x}}\||\partial_{i}K({\bm{x}})\partial_{j}K({\bm{x}})|=0$ , and $\int|\partial_{i}K({\bm{x}})|^{2+\delta}\,d{\bm{x}}<\infty$ for some $\delta>0$ , for all $i,j=1,\ldots,d$ .
(A.13)

$\int\nabla K({\bm{x}})\nabla K({\bm{x}})^{\top}\,d{\bm{x}}={\mathcal{V}}_{d}(K)$ has a finite determinant.
(A.14)

$\partial^{\bm{i}}K$ , $|{\bm{i}}|=2$ satisfies the covering number condition.
(A.15)

$\lim_{n\to\infty}(nh_{n}^{d+2q+2})^{\frac{1}{2}}=c<\infty$ (i.e., $h_{n}\to 0$ ).
(A.16)

$\lim_{n\to\infty}nh_{n}^{d+4}/\ln n=\infty$ .

Here, the covering number condition, that appears in (A.7) and (A.14), is defined as follow:

Definition A.1 ((Pollard, 1984, Definition 23) and (Mokkadem and Pelletier, 2003, Section 2)).

•

Let $P$ be a probability measure on $S$ and ${\mathcal{F}}$ be a class of functions in ${\mathcal{L}}_{1}(P)\coloneq\{g:|\int_{S}g({\bm{x}})\,P(d{\bm{x}})|<\infty\}$ . For each $\epsilon>0$ , define the covering number $N(\epsilon,P,{\mathcal{F}})$ as the smallest value of $m$ for which there exist functions $g_{1},\ldots,g_{m}$ such that $\min_{j}\int_{S}|f-g_{j}|\,P(d{\bm{x}})\leq\epsilon$ for each $f\in{\mathcal{F}}$ . We set $N(\epsilon,P,{\mathcal{F}})=\infty$ if no such $m$ exists.
•

Let $g$ be a function defined on $\mathbb{R}^{d}$ . Let ${\mathcal{F}}(g)=\{g(\frac{\cdot\,-{\bm{x}}}{h}),h>0,{\bm{x}}\in\mathbb{R}^{d}\}$ be the class of functions consisting of arbitrarily translated and scaled versions of $g$ . We say that $g$ satisfies the covering number condition if $g$ is bounded and integrable on $\mathbb{R}^{d}$ , and if there exist $A>0$ and $W>0$ such that $N(\epsilon,P,{\mathcal{F}}(g))\leq A\epsilon^{-W}$ for any probability measure $P$ on $\mathbb{R}^{d}$ and any $\epsilon\in(0,1)$ .

For example, the RK $K({\bm{x}})=G(\|{\bm{x}}\|)$ and the PK $K({\bm{x}})=\prod_{j=1}^{d}G(x_{j})$ fulfill the covering number condition in (A.7) if $G$ has bounded variations in both cases.

We give a proof of the asymptotic normality of the KME ${\bm{\theta}}_{n}$ , which is a modification of the proof of (Mokkadem and Pelletier, 2003, Theorem 2.2), below.

Proof of Theorem 1.

Pollard (1984) gives the following lemma on uniform consistency (see his Theorem 37 in p. 34):

Lemma A.1.

Let $g$ be a function on $\mathbb{R}^{d}$ satisfying the covering number condition, and $\{h_{n}\}$ be a sequence of positive numbers satisfying $\lim_{n\to\infty}h_{n}=0$ . If there exists $j\geq 0$ such that $\lim_{n\to\infty}nh_{n}^{d+2j}/\ln n=\infty$ , then

\displaystyle\lim_{n\to\infty}\sup_{\bm{z}\in\mathbb{R}^{d}}\frac{1}{nh_{n}^{d+j}}\biggl{|}\sum_{i=1}^{n}g\biggl{(}\frac{\bm{z}-\bm{Z}_{i}}{h_{n}}\biggr{)}-\mathrm{E}\biggl{[}g\biggl{(}\frac{\bm{z}-\bm{Z}}{h_{n}}\biggr{)}\biggr{]}\biggr{|}=0,\quad\text{a.s.,}

(A.17)

where $\mathrm{E}$ is the expectation value with respect to the distribution of the random vector $\bm{Z}\in\mathbb{R}^{d}$ , and where $\{\bm{Z}_{i}\in\mathbb{R}^{d}\}_{i=1}^{n}$ is a sample of i.i.d. observations of $\bm{Z}$ .

Also, one has the following lemma (see (Bochner, 2005, Theorem 1.1.1)):

Lemma A.2.

Let $g_{1}$ be a function on $\mathbb{R}^{d}$ satisfying $\sup_{{\bm{u}}\in\mathbb{R}^{d}}|g_{1}({\bm{u}})|<\infty$ , $\int|g_{1}({\bm{u}})|\,d{\bm{u}}<\infty$ , and $\lim_{|{\bm{u}}|\to\infty}\|{\bm{u}}\||g_{1}({\bm{u}})|=0$ , $g_{2}$ be a function on $\mathbb{R}^{d}$ satisfying $\int|g_{2}({\bm{u}})|\,d{\bm{u}}<\infty$ , and $\{h_{n}\}$ be a sequence of positive numbers satisfying $\lim_{n\to\infty}h_{n}=0$ . Then one has that, for any ${\bm{x}}\in\mathbb{R}^{d}$ of continuity of $g_{2}$ ,

\displaystyle\lim_{n\to\infty}\frac{1}{h_{n}^{d}}\int g_{1}\biggl{(}\frac{{\bm{u}}}{h_{n}}\biggr{)}g_{2}({\bm{x}}-{\bm{u}})\,d{\bm{u}}=g_{2}({\bm{x}})\int g_{1}({\bm{u}})\,d{\bm{u}}.

(A.18)

Applying Lemma A.1 with $g=K$ and $j=0$ , under (A.1), the covering number condition of $K$ (A.7), $h_{n}\to 0$ (A.15), and $nh_{n}^{d}/\ln n\to\infty$ (A.16), ensures that

\displaystyle\lim_{n\to\infty}\sup_{{\bm{x}}\in\mathbb{R}^{d}}\bigl{|}f_{n}({\bm{x}})-\mathrm{E}[f_{n}({\bm{x}})]\bigr{|}=0,\quad\text{a.s.}

(A.19)

Also, Lemma A.2 with $g_{1}=K$ and $g_{2}=f$ under the fact $\int|f({\bm{x}})|\,d{\bm{x}}=1<\infty$ , the continuity of $f$ (A.2), (A.7), (A.8), and $h_{n}\to 0$ (A.15) implies that

\displaystyle\mathrm{E}[f_{n}({\bm{\theta}})]=\frac{1}{h_{n}^{d}}\int K\biggl{(}\frac{{\bm{\theta}}-{\bm{u}}}{h_{n}}\biggr{)}f({\bm{u}})\,d{\bm{u}}=f({\bm{\theta}})\int K({\bm{u}})\,d{\bm{u}}=f({\bm{\theta}}).

(A.20)

According to these results, one has $\lim_{n\to\infty}f_{n}({\bm{\theta}})=f({\bm{\theta}})$ a.s., which implies the consistency of ${\bm{\theta}}_{n}$ to ${\bm{\theta}}$ from contradiction to the $(\epsilon,\delta)$ -definition of limit: for any $\epsilon>0$ , there exists a $\delta>0$ such that $|f_{n}({\bm{\theta}}_{n})-f({\bm{\theta}})|\geq\delta$ if $\|{\bm{\theta}}_{n}-{\bm{\theta}}\|\geq\epsilon$ .

Because ${\bm{\theta}}_{n}$ maximizes $f_{n}$ (i.e., $\nabla f_{n}({\bm{\theta}}_{n})=\bm{0}_{d}$ ) and $K$ and hence $f_{n}$ are twice differentiable (A.7), Taylor’s approximation of $\nabla f_{n}({\bm{x}})$ at ${\bm{\theta}}_{n}$ around ${\bm{\theta}}$ shows

\displaystyle\bm{0}_{d}=\nabla f_{n}({\bm{\theta}}_{n})=\nabla f_{n}({\bm{\theta}})+Hf_{n}({\bm{\theta}}^{*})({\bm{\theta}}_{n}-{\bm{\theta}}),

(A.21)

that is, ${\bm{\theta}}_{n}={\bm{\theta}}-\{Hf_{n}({\bm{\theta}}^{*})\}^{-1}\nabla f_{n}({\bm{\theta}})$ if $Hf_{n}({\bm{\theta}}^{*})$ is invertible, where ${\bm{\theta}}^{*}$ satisfies $\|{\bm{\theta}}^{*}-{\bm{\theta}}\|\leq\|{\bm{\theta}}_{n}-{\bm{\theta}}\|$ . We thus study the asymptotic behaviors of $Hf_{n}({\bm{\theta}}^{*})$ and $\nabla f_{n}({\bm{\theta}})$ below.

Applying Lemma A.1 with $g=\partial^{\bm{i}}K$ , $|{\bm{i}}|=2$ and $j=2$ under (A.1), the covering number condition of $\partial^{\bm{i}}K$ (A.14), $h_{n}\to 0$ (A.15), and (A.16), and the consistency of ${\bm{\theta}}_{n}$ to ${\bm{\theta}}$ shows that $Hf_{n}({\bm{\theta}}^{*})$ converges to $\mathrm{E}[Hf_{n}({\bm{\theta}})]$ . Moreover, integration-by-parts and Lemma A.2 with $g_{1}=K$ and $g_{2}=\partial^{\bm{i}}f$ , $|{\bm{i}}|=2$ under the continuity of $\partial^{\bm{i}}f$ (A.2), (A.4), (A.7), (A.8), and $h_{n}\to 0$ (A.15), leads that, for every ${\bm{i}}$ satisfying $|{\bm{i}}|=2$ ,

\displaystyle\begin{split}\mathrm{E}[\partial^{\bm{i}}f_{n}({\bm{\theta}})]&=\frac{1}{h_{n}^{d+2}}\int\partial^{\bm{i}}K\biggl{(}\frac{{\bm{\theta}}-{\bm{u}}}{h_{n}}\biggr{)}f({\bm{u}})\,d{\bm{u}}=\frac{1}{h_{n}^{d}}\int K\biggl{(}\frac{{\bm{\theta}}-{\bm{u}}}{h_{n}}\biggr{)}\partial^{\bm{i}}f({\bm{u}})\,d{\bm{u}}\\ &=\partial^{\bm{i}}f({\bm{\theta}})\int K({\bm{u}})\,d{\bm{u}}=\partial^{\bm{i}}f({\bm{\theta}}),\end{split}

(A.22)

that is, $\mathrm{E}[Hf_{n}({\bm{\theta}})]=Hf({\bm{\theta}})$ . Combining these results, one can find that $\{Hf_{n}({\bm{\theta}}^{*})\}^{-1}$ is consistent to the matrix $\mathrm{A}$ .

We next show that $\nabla f_{n}({\bm{\theta}})$ asymptotically follows the normal distribution with the mean $h_{n}^{q}{\bm{b}}$ and the variance-covariance matrix $(nh_{n}^{d+2})^{-1}\mathrm{V}$ , on the basis of Lyapounov’s central limit theorem. The critical difference from the existing proof by (Mokkadem and Pelletier, 2003) is on the assumptions (A.6) and (A.10), which affects on the calculation of the mean of $\nabla f_{n}({\bm{\theta}})$ . On the basis of the multivariate Taylor expansion of $\partial_{j}f({\bm{\theta}}-h_{n}{\bm{x}})$ around ${\bm{\theta}}$ under the setting that $q$ is even,

\displaystyle\partial_{j}f({\bm{\theta}}-h_{n}{\bm{x}})=\partial_{j}f({\bm{\theta}})-\cdots+\biggl{\{}\sum_{|{\bm{i}}|=q}\frac{(h_{n}{\bm{x}})^{\bm{i}}}{{\bm{i}}!}\cdot\partial_{j}\partial^{\bm{i}}f({\bm{\theta}})\biggr{\}}-\biggl{\{}\sum_{|{\bm{i}}|=q+1}\frac{(h_{n}{\bm{x}})^{\bm{i}}}{{\bm{i}}!}\cdot\partial_{j}\partial^{\bm{i}}f({\bm{\theta}}_{\bm{x}})\biggr{\}},

(A.23)

where ${\bm{\theta}}_{\bm{x}}$ lies in $\|h_{n}{\bm{x}}\|$ -neighborhood of ${\bm{\theta}}$ and depends on ${\bm{x}}$ , one has that

\displaystyle\begin{split}\mathrm{E}\bigl{[}\partial_{j}f_{n}({\bm{\theta}})\bigr{]}&=\int\frac{1}{h_{n}^{d+1}}\partial_{j}K\biggl{(}\frac{{\bm{\theta}}-{\bm{x}}}{h_{n}}\biggr{)}f({\bm{x}})\,d{\bm{x}}=\int K({\bm{x}})\partial_{j}f({\bm{\theta}}-h_{n}{\bm{x}})\,d{\bm{x}}\\ &=\int K({\bm{x}})\biggl{\{}\cdots+\sum_{|{\bm{i}}|=q}\frac{(h_{n}{\bm{x}})^{\bm{i}}}{{\bm{i}}!}\cdot\partial_{j}\partial^{\bm{i}}f({\bm{\theta}})-\cdots\biggr{\}}\,d{\bm{x}}.\end{split}

(A.24)

Because $\partial_{j}f({\bm{\theta}})=0$ (A.3), integrations of $2,\ldots,q$ -th summands vanish due to finiteness of $\partial^{\bm{i}}f({\bm{\theta}})$ , $|{\bm{i}}|=2,\ldots,q$ (A.5) and (A.9), and integration of the $(q+2)$ -nd residual term is asymptotically ignorable due to boundedness of $\partial^{\bm{i}}f$ , $|{\bm{i}}|=q+2$ (A.5), (A.11), and $h_{n}\to 0$ (A.15), one can approximate the asymptotic mean of $\nabla f_{n}({\bm{\theta}})$ as

\displaystyle\frac{1}{h_{n}^{q}}\mathrm{E}\bigl{[}\nabla f_{n}({\bm{\theta}})\bigr{]}_{j}\to\partial_{j}\biggl{(}\int K({\bm{x}})\sum_{|{\bm{i}}|=q}\frac{{\bm{x}}^{\bm{i}}}{{\bm{i}}!}\cdot\partial^{\bm{i}}f({\bm{\theta}})\,d{\bm{x}}\biggr{)}=\bigl{(}{\bm{b}}({\bm{\theta}};f,K)\big{)}_{j},

(A.25)

under the assumptions (A.6) and (A.10). Moreover, by using Lemma A.2 with $g_{1}=\partial_{i}K\partial_{j}K$ and $g_{2}=f$ under the continuity of $f$ (A.2), (A.12), and $h_{n}\to 0$ (A.15), one can calculate the variance-covariance matrix of $\nabla f_{n}({\bm{\theta}})$ with the definition (A.13):

\displaystyle\begin{split}&nh_{n}^{d+2}\mathrm{Cov}[\nabla f_{n}({\bm{\theta}})]_{i,j}=nh_{n}^{d+2}\frac{1}{n}\mathrm{Cov}\biggl{[}\frac{1}{h_{n}^{d+1}}\partial_{i}K\left(\frac{{\bm{\theta}}-{\bm{X}}}{h_{n}}\right),\frac{1}{h_{n}^{d+1}}\partial_{j}K\left(\frac{{\bm{\theta}}-{\bm{X}}}{h_{n}}\right)\biggr{]}\\ &=\frac{1}{h_{n}^{d}}\int\partial_{i}K\left(\frac{{\bm{\theta}}-{\bm{x}}}{h_{n}}\right)\partial_{j}K\left(\frac{{\bm{\theta}}-{\bm{x}}}{h_{n}}\right)f({\bm{x}})\,d{\bm{x}}\to f({\bm{\theta}})[{\mathcal{V}}_{d}(K)]_{i,j}=\mathrm{V}_{i,j}.\end{split}

(A.26)

Finally, by applying Slutsky’s theorem to these pieces, the proof on asymptotic normality of $\{Hf_{n}({\bm{\theta}}^{*})\}^{-1}\nabla f_{n}({\bm{\theta}})$ is finished. ∎

The condition (A.6) is added to those in (Mokkadem and Pelletier, 2003), to ensure that the main term of AB $\propto\mathrm{A}{\bm{b}}$ does not vanish. Also, the condition (A.10) is a modification of condition (A5) iii) in (Mokkadem and Pelletier, 2003), the latter of which does not apply to RKs unlike their description. However, it should be noted that the kernel $K_{d,q}^{B}$ does not satisfy the twice differentiability among the sufficient conditions on the kernel. Eddy (1980) has studied behaviors of the KME ${\bm{\theta}}_{n}$ within $O((nh_{n}^{d+2})^{-\frac{1}{2}})$ -neighborhood of the mode ${\bm{\theta}}$ and tried the proof without assuming that the kernel $K$ is twice differentiable (such that the kernel $K_{d,q}^{B}$ fulfills their requirements), but their proof is not rigorous in that it does not consider the possibility that the KME ${\bm{\theta}}_{n}$ exists outside of that neighborhood. We have considered several methods, including the proof in (Eddy, 1980), and then concluded that twice differentiability of $K$ appears to be necessary for deriving the AMSE.

B Theories on Optimal Kernel

B.1 Relationships between the Moment Condition and the Number of Sign Changes

In this section we describe relationships between the moment condition and the number of sign changes of a kernel. We first introduce several notations. To represent a kernel class defined from the moment condition, we introduce, for integers $d\geq 1$ , $l$ , and $k$ satisfying $0\leq l<k$ ,

\displaystyle{\mathcal{M}}_{d,l,k}\coloneq\{g:B_{d,i}(g)=(-1)^{l}(d)_{l}b_{d}^{-1}\;(i=l),\;0\;(i\in I_{k-2}\backslash\{l\}),\;\text{non-zero}\;(i=k)\},

(B.1)

where $(n)_{m}\coloneq n(n+1)\cdots(n+m-1)$ denotes the Pochhammer symbol. Also, we represent a class of functions on $\mathbb{R}_{\geq 0}$ with the prescribed number of sign changes as

\displaystyle{\mathcal{N}}_{j}\coloneq\{g:\text{$g$ changes its sign $j$ times on $\mathbb{R}_{\geq 0}$}\}.

(B.2)

Here, the number of the sign changes is defined as follows:

Definition B.1.

A function $g$ defined on a finite or infinite interval $[a,b]$ is said to change its sign $k$ times on $[a,b]$ , if there are $(k+1)$ subintervals $S_{j}=[x_{j-1},x_{j}]$ , $j=1,\ldots,k+1$ , where $a=x_{0}<x_{1}<\cdots<x_{k+1}=b$ , such that:

(i)

$g(x)\leq 0$ (or $\geq 0$ ) for all $x\in S_{j}$ , and there exists $x\in S_{j}$ such that $g(x)<0$ (or $>0$ ), for each $j\in\{1,\ldots,k+1\}$ .
(ii)

If $k\geq 1$ , $g(x)g(y)\leq 0$ for all $x\in S_{j}$ , $y\in S_{j+1}$ , $j=1,\ldots,k$ .

Note that a function could have an interval over which it equals to zero and that a point where its sign changes does not have to be uniquely determined. Additionally, we introduce the following term:

Definition B.2.

Functions $g_{1}$ and $g_{2}$ defined on a finite or infinite interval $[a,b]$ are said to share the same sign-change pattern if the same set of subintervals in Definition B.1 applies to both $g_{1}$ and $g_{2}$ .

In the following, we provide a lemma (Lemma B.2) about the minimum number of sign changes of a $q$ -th order RK. It is a multivariate extension of (Gasser et al., 1985, Lemma 2) stating that a univariate $q$ -th order kernel changes its sign at least $(q-2)$ times on $\mathbb{R}$ , and becomes a basis for the condition (P2-3) in Problem 2 and condition (P3-3) in Problem 3. The following Lemma B.1 is for Lemmas B.2 and B.12.

Lemma B.1.

Let $S_{1},\ldots,S_{k}$ be measurable subsets of $\mathbb{R}_{\geq 0}$ with $\mu(S_{i})>0$ , where $\mu$ is the Lebesgue measure on $\mathbb{R}_{\geq 0}$ , and assume that they are ordered in the sense that for any $x\in S_{i}$ and any $y\in S_{j}$ with $i<j$ the inequality $x<y$ holds almost surely. Let $w(x)$ be a measurable function which does not change its sign inside each $S_{j}$ , $j=1,\ldots,k$ . Assume further that $\int_{S_{j}}w(x)\,dx\not=0$ holds. Then the following $k\times k$ matrix is non-singular:

\displaystyle\mathrm{M}=\left[\begin{array}[]{cccc}\int_{S_{1}}w(x)\,dx&\int_{S_{2}}w(x)\,dx&\cdots&\int_{S_{k}}w(x)\,dx\\ \int_{S_{1}}w(x)x^{2}\,dx&\int_{S_{2}}w(x)x^{2}\,dx&\cdots&\int_{S_{k}}w(x)x^{2}\,dx\\ \vdots&\vdots&\ddots&\vdots\\ \int_{S_{1}}w(x)x^{2(k-1)}\,dx&\int_{S_{2}}w(x)x^{2(k-1)}\,dx&\cdots&\int_{S_{k}}w(x)x^{2(k-1)}\,dx\end{array}\right].

(B.7)

Proof.

Assume to the contrary that $\mathrm{M}$ is singular. Then there exists a non-zero vector $\bm{a}=(a_{1},\ldots,a_{k})^{\top}$ which satisfies $\mathrm{M}^{\top}\bm{a}=\bm{0}_{k}$ . Rewriting it component-wise, one has

\displaystyle\sum_{i=0}^{k-1}a_{i+1}\int_{S_{j}}w(x)x^{2i}\,dx=\int_{S_{j}}w(x)P(x)\,dx=0,\quad j=1,\ldots,k,

(B.8)

where we let

\displaystyle P(x)=\sum_{i=0}^{k-1}a_{i+1}x^{2i}.

(B.9)

Take an interval $[a,b]$ with $a=\inf S_{j}$ and $b=\sup S_{j}$ . Then $w(x)\mathbbm{1}_{x\in S_{j}}$ is integrable, does not change its sign on $[a,b]$ , and $\int_{a}^{b}w(x)\mathbbm{1}_{x\in S_{j}}\,dx\not=0$ . From (B.8) and the mean value theorem, there exists $\alpha_{j}\in(a,b)$ satisfying $P(\alpha_{j})=0$ . Since $P(x)$ is an even function of $x$ , one has $P(-\alpha_{j})=0$ . The zeros $\alpha_{1},\ldots,\alpha_{k}$ are all distinct. Thus $P(x)$ is factorized as

\displaystyle P(x)=(x^{2}-\alpha_{1}^{2})\cdots(x^{2}-\alpha_{k}^{2})Q(x),

(B.10)

where $Q(x)$ is a polynomial not identically equal to 0. The degree of the right-hand side is therefore at least $2k$ , whereas from (B.9) the degree of $P(x)$ is at most $2(k-1)$ , leading to contradiction. ∎

Lemma B.2.

For $d\geq 1$ and even $q\geq 2$ , if a function $g$ defined on $\mathbb{R}_{\geq 0}$ is such that $g\in{\mathcal{M}}_{d,0,q}$ or $g\in{\mathcal{M}}_{d,1,q+1}$ , then $g$ changes its sign at least $(\frac{q}{2}-1)$ times on $\mathbb{R}_{\geq 0}$ .

Proof.

Assume that $g$ changes its sign $(k-1)$ times on $\mathbb{R}_{\geq 0}$ , and decompose $\mathbb{R}_{\geq 0}$ into $k$ subintervals $S_{1},\ldots,S_{k}$ according to the way in Definition B.1: the sign of $g$ changes alternately in these subintervals.

Proof for $g\in{\mathcal{M}}_{d,0,q}$ . Lemma B.1 shows that the $k\times k$ matrix

\displaystyle\mathrm{M}=\begin{bmatrix}\int_{S_{1}}x^{d+1}g(x)\,dx&\int_{S_{2}}x^{d+1}g(x)\,dx&\cdots&\int_{S_{k}}x^{d+1}g(x)\,dx\\ \int_{S_{1}}x^{d+3}g(x)\,dx&\int_{S_{2}}x^{d+3}g(x)\,dx&\cdots&\int_{S_{k}}x^{d+3}g(x)\,dx\\ \vdots&\vdots&\ddots&\vdots\\ \int_{S_{1}}x^{d+2k-1}g(x)\,dx&\int_{S_{2}}x^{d+2k-1}g(x)\,dx&\cdots&\int_{S_{k}}x^{d+2k-1}g(x)\,dx\end{bmatrix}

(B.11)

is non-singular. Assume $k<\frac{q}{2}$ as opposed to the statement of this lemma. Then, from the moment conditions $g\in{\mathcal{M}}_{d,0,q}$ , one would have

\displaystyle\mathrm{M}\bm{1}_{k}=\begin{bmatrix}\int_{\mathbb{R}_{\geq 0}}x^{d+1}g(x)\,dx\\ \int_{\mathbb{R}_{\geq 0}}x^{d+3}g(x)\,dx\\ \vdots\\ \int_{\mathbb{R}_{\geq 0}}x^{d+2k-1}g(x)\,dx\end{bmatrix}=\bm{0}_{k}.

(B.12)

This contradicts the non-singularity of $\mathrm{M}$ . Thus, $k\geq\frac{q}{2}$ holds, i.e., $g$ changes its sign at least $(\frac{q}{2}-1)$ times on $\mathbb{R}_{\geq 0}$ .

Proof for $g\in{\mathcal{M}}_{d,1,q+1}$ . For the matrix

\displaystyle\mathrm{M}=\begin{bmatrix}\int_{S_{1}}x^{d+2}g(x)\,dx&\int_{S_{2}}x^{d+2}g(x)\,dx&\cdots&\int_{S_{k}}x^{d+2}g(x)\,dx\\ \int_{S_{1}}x^{d+4}g(x)\,dx&\int_{S_{2}}x^{d+4}g(x)\,dx&\cdots&\int_{S_{k}}x^{d+4}g(x)\,dx\\ \vdots&\vdots&\ddots&\vdots\\ \int_{S_{1}}x^{d+2k}g(x)\,dx&\int_{S_{2}}x^{d+2k}g(x)\,dx&\cdots&\int_{S_{k}}x^{d+2k}g(x)\,dx\end{bmatrix}

(B.13)

one can take a similar proof as the case for $g\in{\mathcal{M}}_{d,0,q}$ . ∎

Also, in the case of $q=2,4$ , we can know the order in which the kernel changes its sign:

Lemma B.3.

Let $d\geq 1$ and $q=2$ or $4$ , and assume that a function $G_{d,q}$ is defined on $\mathbb{R}_{\geq 0}$ and such that $G_{d,q}\in{\mathcal{M}}_{d,0,q}\cap{\mathcal{N}}_{\frac{q}{2}-1}$ . Then, $G_{d,q}(x)\geq 0$ for $x\in S_{1}\cup S_{3}\cup\cdots$ and $G_{d,q}(x)\leq 0$ for $x\in S_{2}\cup S_{4}\cup\cdots$ , where the subintervals $S_{1},\ldots,S_{\frac{q}{2}}$ of $\mathbb{R}_{\geq 0}$ are defined according to Definition B.1. Moreover, $\operatorname{sign}(B_{d,q}(G_{d,q}))=(-1)^{q/2+1}$ .

Proof.

Proof for $q=2$ . A 2nd order kernel satisfying the normalization condition and the minimum-sign-change condition is non-negative, and the statements $G_{d,2}(x)\geq 0$ for any $x\in\mathbb{R}_{\geq 0}$ and $\operatorname{sign}(B_{d,2}(G_{d,2}))=+1$ are trivial.

Proof for $q=4$ . The sign change of minimum-sign-change 4-th order kernel $G_{d,4}$ occurs at a single point, which is denoted as $\rho>0$ . If one assumes that $G_{d,4}(x)\leq 0$ for $x\in[0,\rho]$ , then one has the following contradiction: On the basis of the mean value theorem of integration,

\displaystyle\begin{split}&0=B_{d,2}(G_{d,4})=\int_{0}^{\infty}x^{d+1}G_{d,4}(x)\,dx=\xi_{1}^{2}\int_{0}^{\rho}x^{d-1}G_{d,4}(x)\,dx+\xi_{2}^{2}\int_{\rho}^{\infty}x^{d-1}G_{d,4}(x)\,dx\\ &=\underbrace{\xi_{1}^{2}}_{>0}\underbrace{\int_{0}^{\infty}x^{d-1}G_{d,4}(x)\,dx}_{>0~{}\because B_{d,0}(G_{d,4})>0}+\underbrace{(\xi_{2}^{2}-\xi_{1}^{2})}_{>0}\underbrace{\int_{\rho}^{\infty}x^{d-1}G_{d,4}(x)\,dx}_{>0}>0,\end{split}

(B.14)

where $0<\xi_{1}<\rho<\xi_{2}$ . Thus, $G_{d,4}(x)\geq 0$ for $0\leq x\leq\rho$ and $G_{d,4}(x)\leq 0$ for $x\geq\rho$ . Also, one has

\displaystyle\begin{split}&B_{d,4}(G_{d,4})=\int_{0}^{\infty}x^{d+3}G_{d,4}(x)\,dx=\tilde{\xi}_{1}^{2}\int_{0}^{\rho}x^{d+1}G_{d,4}(x)\,dx+\tilde{\xi}_{2}^{2}\int_{\rho}^{\infty}x^{d+1}G_{d,4}(x)\,dx\\ &=\underbrace{\tilde{\xi}_{1}^{2}}_{>0}\underbrace{\int_{0}^{\infty}x^{d+1}G_{d,4}(x)\,dx}_{=0~{}\because B_{d,2}(G_{d,4})=0}+\underbrace{(\tilde{\xi}_{2}^{2}-\tilde{\xi}_{1}^{2})}_{>0}\underbrace{\int_{\rho}^{\infty}x^{d+1}G_{d,4}(x)\,dx}_{<0}<0,\end{split}

(B.15)

where $0<\tilde{\xi}_{1}<\rho<\tilde{\xi}_{2}$ . Therefore, $\operatorname{sign}(B_{d,4}(G_{d,4}))=-1$ is confirmed. ∎

B.2 Proof of Theorem 2

First we show that the kernel $G_{d,q}^{B}$ satisfies all the conditions of Problem 2.

Lemma B.4.

The kernel $G_{d,q}^{B}$ satisfies all the conditions (P2-1)–(P2-5) with $G=G_{d,q}^{B}$ .

Proof.

The fact that $G_{d,q}^{B}$ satisfies (P2-3), that is, it changes its sign $(\frac{q}{2}-1)$ times on $\mathbb{R}_{\geq 0}$ , is a direct consequence of the well-known property of the Jacobi polynomials, that $P_{n}^{(\alpha,\beta)}(\cdot)$ with $\alpha,\beta>-1$ has $n$ simple zeros in the interval $(-1,1)$ , together with the rightmost expression in (22) of $G_{d,q}^{B}$ . Differentiability of $G_{d,q}^{B}$ on $\mathbb{R}_{\geq 0}$ and boundedness and continuity in the sense of ‘a.e.’ of $G_{d,q}^{B(1)}$ (condition (P2-2)), finiteness of $V_{d,1}(G_{d,q}^{B})$ (condition (P2-4)), and the behavior $x^{d+q}G_{d,q}^{B}(x)\to 0$ as $x\to\infty$ (condition (P2-5)) follow straightforwardly from the fact that, from the rightmost expression in (22), $G_{d,q}^{B}(x)$ is a polynomial restricted to the finite interval $x\in[0,1]$ with a double zero at $x=1$ .

We thus show in the following that $G_{d,q}^{B}\in{\mathcal{M}}_{d,0,q}$ holds, that is, $G_{d,q}^{B}$ satisfies the moment condition (P2-1). Let

\displaystyle\begin{split}B_{d,i}(G_{d,q}^{B})&=(-1)^{\frac{q}{2}+1}\frac{\Gamma(\frac{d+q}{2}+2)}{\pi^{\frac{d}{2}}\Gamma(\frac{q}{2}+2)}\int_{0}^{1}x^{d-1+i}(1-x^{2})^{2}P_{\frac{q}{2}-1}^{(2,\frac{d}{2})}(2x^{2}-1)\,dx\\ &=(-1)^{\frac{q}{2}+1}\frac{\Gamma(\frac{d+q}{2}+2)}{\pi^{\frac{d}{2}}\Gamma(\frac{q}{2}+2)}I_{d,q,i},\end{split}

(B.16)

where

\displaystyle\begin{split}I_{d,q,i}\coloneq&\,\int_{0}^{1}x^{d-1+i}(1-x^{2})^{2}P_{\frac{q}{2}-1}^{(2,\frac{d}{2})}(2x^{2}-1)\,dx\\ =&\,\frac{1}{2^{\frac{d+i}{2}+3}}\int_{-1}^{1}(1-y)^{2}(1+y)^{\frac{d+i}{2}-1}P_{\frac{q}{2}-1}^{(2,\frac{d}{2})}(y)\,dy.\end{split}

(B.17)

Using the identity (Prudnikov et al., 1986, Section 2.22.2, Item 11)

\displaystyle\int_{-1}^{1}(1-y)^{\alpha}(1+y)^{\lambda-1}P_{n}^{(\alpha,\beta)}(y)\,dy=(-1)^{n}2^{\alpha+\lambda}B(\alpha+n+1,\lambda){\beta-\lambda+n\choose n},\quad\lambda>0,\,\alpha,\beta>-1,

(B.18)

where

\displaystyle{x\choose y}=\frac{\Gamma(x+1)}{\Gamma(y+1)\Gamma(x-y+1)},\quad x,y\in\mathbb{R}

(B.19)

is the binomial coefficient extended to real-valued arguments, and where $B(\cdot,\cdot)$ denotes the Beta function, one has

\displaystyle I_{d,q,i}=\frac{(-1)^{\frac{q}{2}-1}}{2}B\Bigl{(}\frac{q}{2}+2,\frac{d+i}{2}\Bigr{)}{\frac{q}{2}-1-\frac{i}{2}\choose\frac{q}{2}-1}.

(B.20)

The values of $I_{d,q,i}$ for $i\in\{0,2,\ldots,q\}$ are therefore

\displaystyle I_{d,q,i}=\left\{\begin{array}[]{ll}\displaystyle\frac{(-1)^{\frac{q}{2}-1}}{2}B\Bigl{(}\frac{q}{2}+2,\frac{d}{2}\Bigr{)},&i=0,\\ 0,&i\in\{2,\ldots,q-2\},\\ \displaystyle\frac{1}{2}B\Bigl{(}\frac{q}{2}+2,\frac{d+q}{2}\Bigr{)},&i=q.\end{array}\right.

(B.24)

These in turn imply that the moments of $G_{d,q}^{B}$ are given by

\displaystyle B_{d,i}(G_{d,q}^{B})=\left\{\begin{array}[]{ll}\displaystyle\frac{\Gamma(\frac{d}{2})}{2\pi^{\frac{d}{2}}}=\frac{1}{b_{d}},&i=0,\\ 0,&i\in\{2,\ldots,q-2\},\\ \displaystyle\frac{(-1)^{\frac{q}{2}+1}}{2\pi^{\frac{d}{2}}}\frac{\Gamma(\frac{d+q}{2})\Gamma(\frac{d+q}{2}+2)}{\Gamma(\frac{d}{2}+q+2)},&i=q,\end{array}\right.

(B.28)

showing that $G_{d,q}^{B}\in{\mathcal{M}}_{d,0,q}$ holds. (The fact that $B_{d,i}(G_{d,q}^{B})=0$ for $i\in\{2,\ldots,q-2\}$ can alternatively be understood on the basis of the expression (B.17) and the orthogonality relation (B.70) of the Jacobi polynomials.) ∎

Then, we introduce two polynomials related to the kernel $K_{d,q}^{B}(\cdot)=G_{d,q}^{B}(\|\cdot\|)$ , before the succeeding proof:

\displaystyle P_{d,q}^{B}(x)\coloneq\frac{\Gamma(\frac{d+q}{2})}{\pi^{\frac{d}{2}}\Gamma(\frac{q}{2})}P_{\frac{q}{2}+1}^{(\frac{d}{2},-2)}(1-2x^{2}),\quad R_{d,q}^{B}(x)\coloneq(d-1)\frac{P^{B(1)}_{d,q}(x)}{x}+P^{B(2)}_{d,q}(x).

(B.29)

$P_{d,q}^{B}$ is an original polynomial of $G_{d,q}^{B}$ before truncated. Also, the function $R_{d,q}^{B}$ is defined such that the first derivative of $x^{d-1}P_{d,q}^{B(1)}(x)$ becomes $x^{d-1}R_{d,q}^{B}(x)$ , and $R_{d,q}^{B}$ is a polynomial function with terms of $0,2,\ldots,(q-2)$ -nd degree.

In the following, we give Lemma B.6 on the properties of functions $P_{d,q}^{B}$ and $R_{d,q}^{B}$ .

Lemma B.5.

Suppose that $P(x)=a_{0}(x^{2}-a_{1}^{2})\cdots(x^{2}-a_{n}^{2})$ has coefficients $a_{0}\neq 0$ , $0<a_{1}\leq\cdots\leq a_{n}$ . Then, $P(x)/x^{j}$ , $j=0,\ldots,2n$ ; $P^{(1)}(x)/x^{j}$ , $j=0,\ldots,2n-2$ ; and $P^{(2)}(x)/x^{j}$ , $j=0,\ldots,2n-4$ are strictly monotonically increasing (resp. decreasing) for $x>a_{n}$ , when $a_{0}>0$ (resp. $a_{0}<0$ ).

Proof.

From simple calculus, one obtains

\displaystyle\begin{split}&P^{(1)}(x)=2a_{0}\sum_{i=1}^{n}x(x^{2}-a_{1}^{2})\cdots(x^{2}-a_{i-1}^{2})(x^{2}-a_{i+1}^{2})\cdots(x^{2}-a_{n}^{2}),\\ &P^{(2)}(x)=2a_{0}\sum_{i=1}^{n}(x^{2}-a_{1}^{2})\cdots(x^{2}-a_{i-1}^{2})(x^{2}-a_{i+1}^{2})\cdots(x^{2}-a_{n}^{2})\\ &\hphantom{P^{(2)}(x)}+4a_{0}\sum_{1\leq i<j\leq n}x^{2}(x^{2}-a_{1}^{2})\cdots(x^{2}-a_{i-1}^{2})(x^{2}-a_{i+1}^{2})\cdots(x^{2}-a_{j-1}^{2})(x^{2}-a_{j+1}^{2})\cdots(x^{2}-a_{n}^{2}).\end{split}

(B.30)

The first statement in (Gasser et al., 1985, Lemma 4) shows the statement on $P(x)/x^{j}$ , $j=0,\ldots,2n$ in this lemma. Also, applying the latter (resp. former) statement of their lemma to every summand of $P^{(1)}(x)$ (resp. $P^{(2)}(x)$ ) in (B.30), a result on $P^{(1)}(x)/x^{j}$ , $j=0,\ldots,2n-2$ (resp. $P^{(2)}(x)/x^{j}$ , $j=0,\ldots,2n-4$ ) can be proved. ∎

Lemma B.6.

The polynomial $P_{d,q}^{B}$ in (B.29) can be represented as $P_{d,q}^{B}(x)=a_{0}(x^{2}-a_{1}^{2})\cdots(x^{2}-a_{\frac{q}{2}+1}^{2})$ , $a_{0}>0$ (resp. $a_{0}<0$ ) when $q=2,6,\ldots$ (resp. $q=4,8,\ldots$ ), $0<a_{1}<\cdots<a_{\frac{q}{2}}=a_{\frac{q}{2}+1}=1$ . Also, for the polynomial $R_{d,q}^{B}$ in (B.29), $R_{d,q}^{B}(x)/x^{j}$ , $j=0,\ldots,q-2$ is monotonically increasing and positive (resp. decreasing and negative) for $x>1$ when $q=2,6,\ldots$ (resp. $q=4,8,\ldots$ ).

Proof.

The symmetric polynomial function $P_{d,q}^{B}$ has $(q-2)$ zeros inside $(-1,1)$ and double zeros at $x=\pm 1$ , Thus, its coefficients become $0<a_{1}<\cdots<a_{\frac{q}{2}-1}<1$ and $a_{\frac{q}{2}}=a_{\frac{q}{2}+1}=1$ .

Using the relationship of the Jacobi polynomial

\displaystyle P_{n}^{(\alpha,\beta)}(1)=\frac{(\alpha+1)_{n}}{n!},

(B.31)

which appears in (Askey, 1975, p.7), one has

\displaystyle P_{d,q}^{B}(0)=\frac{\Gamma(\frac{d+q}{2})}{\pi^{\frac{d}{2}}\Gamma(\frac{q}{2})}P_{\frac{q}{2}+1}^{(\frac{1}{2},-2)}(1)=\frac{\Gamma(\frac{d+q}{2})}{\pi^{\frac{d}{2}}\Gamma(\frac{q}{2})}\frac{(\frac{3}{2})_{\frac{q}{2}+1}}{(\frac{q}{2}+1)!}>0.

(B.32)

Then, on the basis of the factorized expression of $P_{d,q}^{B}$ ,

\displaystyle P_{d,q}^{B}(0)=a_{0}(-a_{1}^{2})\cdots(-a_{\frac{q}{2}+1}^{2})=(-1)^{\frac{q}{2}+1}a_{0}a_{1}^{2}\cdots a_{\frac{q}{2}+1}^{2}>0,

(B.33)

one has $a_{0}>0$ (resp. $a_{0}<0$ ) when $q=2,6,\ldots$ (resp. $q=4,8,\ldots$ ).

Finally, the increase or decrease of $R_{d,q}^{B}(x)/x^{j}$ , $j=0,\ldots,q-2$ for $x>1$ is a direct corollary of Lemma B.5. On the basis of the calculation (B.30), one has

\displaystyle\begin{split}R_{d,q}^{B}(1)&=(d-1)\frac{P^{B(1)}_{d,q}(1)}{1}+P^{B(2)}_{d,q}(1)=(d-1)\frac{0}{1}+4a_{0}1^{2}(1-a_{1})^{2}\cdots(1-a_{\frac{q}{2}-1})^{2}\\ &\propto a_{0}(1-a_{1})^{2}\cdots(1-a_{\frac{q}{2}-1})^{2}\end{split}

(B.34)

which, together with the above-mentioned fluctuation of $R_{d,q}^{B}(x)/x^{0}=R_{d,q}^{B}(x)$ , ensures that the sign of $R_{d,q}^{B}(x)$ , $x>1$ and that of $a_{0}$ match with each other. ∎

Finally, we provide a proof of Theorem 2 below:

Proof of Theorem 2.

Suppose that $G_{d,q}=G_{d,q}^{B}+Q_{d,q}:\mathbb{R}_{\geq 0}\to\mathbb{R}$ satisfies the conditions (P2-1)–(P2-5) and has the same $q$ -th moment $B_{d,q}$ as that of $G_{d,q}^{B}$ . For this setting, because $G_{d,q}^{B}$ satisfies (P2-1)–(P2-5) (shown in Lemma B.4), the disturbance $Q_{d,q}$ needs to satisfy the moment conditions

\displaystyle B_{d,i}(Q_{d,q})=0,\quad i\in I_{q}.

(B.35)

From the radial symmetry, odd-order moment conditions are automatically fulfilled. Additionally, it should be noted that $Q_{d,q}$ does not need to have a compact support, which makes the kernel’s optimality not limited to just the class of truncation kernels. In contrast, the following proof requires the condition (P2-5) (i.e., $\lim_{x\to\infty}x^{d+q}Q_{d,q}(x)=0$ ).

The following calculations are performed to derive the inequality (B.38). On the basis of the integral by part, $P_{d,q}^{B(1)}(0)=0$ , $\lim_{x\to\infty}x^{d+q}Q_{d,q}(x)=0$ , combination of the moment conditions (B.35) and the fact that the polynomial $R_{d,q}^{B}$ defined in (B.29) has $0,2,\ldots,q$ -th degree terms, one has

\displaystyle\begin{split}\int_{0}^{\infty}x^{d-1}P_{d,q}^{B(1)}(x)Q_{d,q}^{(1)}(x)\,dx&=\biggl{[}x^{d-1}P_{d,q}^{B(1)}(x)Q_{d,q}(x)\biggr{]}_{0}^{\infty}-\int_{0}^{\infty}x^{d-1}R_{d,q}^{B}(x)Q_{d,q}(x)\,dx\\ &=0.\end{split}

(B.36)

Secondly, integral by part, together with $P_{d,q}^{B(1)}(1)=0$ and $\lim_{x\to\infty}x^{d+q}Q_{d,q}(x)=0$ , shows

\displaystyle\begin{split}\int_{1}^{\infty}x^{d-1}P_{d,q}^{B(1)}(x)Q_{d,q}^{(1)}(x)\,dx&=\biggl{[}x^{d-1}P_{d,q}^{B(1)}(x)Q_{d,q}(x)\biggr{]}_{1}^{\infty}-\int_{1}^{\infty}x^{d-1}R_{d,q}^{B}(x)Q_{d,q}(x)\,dx\\ &=-\int_{1}^{\infty}x^{d-1}R_{d,q}^{B}(x)Q_{d,q}(x)\,dx.\end{split}

(B.37)

Collecting these pieces, one can obtain the following inequality:

\displaystyle\begin{split}&V_{d,1}(G_{d,q})-V_{d,1}(G_{d,q}^{B})=V_{d,1}(Q_{d,q})+2\int_{0}^{\infty}x^{d-1}G_{d,q}^{B(1)}(u)Q_{d,q}^{(1)}(x)\,dx\\ &\geq 2\int_{0}^{\infty}x^{d-1}G_{d,q}^{B(1)}(u)Q_{d,q}^{(1)}(x)\,dx=2\int_{0}^{\infty}x^{d-1}\{G_{d,q}^{B(1)}(x)-P_{d,q}^{B(1)}(x)\}Q_{d,q}^{(1)}(x)\,dx\\ &=-2\int_{1}^{\infty}x^{d-1}P_{d,q}^{B(1)}(x)Q_{d,q}^{(1)}(x)\,dx=2\int_{1}^{\infty}x^{d-1}R_{d,q}^{B}(x)Q_{d,q}(x)\,dx,\end{split}

(B.38)

where the equality holds just when $Q^{(1)}_{d,q}(x)\equiv 0$ , which holds only if $Q_{d,q}(x)\equiv 0$ due to the moment condition $B_{d,0}(Q_{d,q})=0$ .

Thus, since the kernels $G_{d,q}$ and $G_{d,q}^{B}$ have the same value of $q$ -th moment, in order to prove the optimality of $G_{d,q}^{B}$ with respect to the AMSE criterion, it is sufficient to prove

\displaystyle\int_{1}^{\infty}x^{d-1}R_{d,q}^{B}(x)Q_{d,q}(x)\,dx\geq 0

(B.39)

for $Q_{d,q}(x)\not\equiv 0$ , instead of showing $V_{d,1}(G_{d,q})>V_{d,1}(G_{d,q}^{B})$ . We prove the inequality (B.39) for order $q=2,4$ , below.

Proof for $q=2$ . First, we provide the proof for $q=2$ . Lemma B.6 shows $R_{d,q}^{B}(x)>0$ for $x>1$ . Also, since $G_{d,2}$ is non-negative due to the minimum-sign-change condition and $G_{d,2}^{B}(x)=0$ for $x>1$ , the disturbance $Q_{d,2}(x)=G_{d,2}(x)$ , $x>1$ has to be non-negative. Thus, (B.39) holds.

Proof for $q=4$ . We next show the theorem for $q=4$ . The only sign change of minimum-sign-change 4-th order kernel $G_{d,4}$ goes from positive to negative at some point $\rho>0$ from Lemma B.3.

If $\rho\leq 1$ , then $G_{d,4}(x)=Q_{d,4}(x)$ , $x\geq 1\geq\rho$ is non-positive. Also, Lemma B.6 shows that $R_{d,4}^{B}(x)<0$ for $x\geq 1$ . Thus, the inequality (B.39) holds.

Next we consider the other case with $\rho>1$ , where the followings are satisfied:

		$\displaystyle G_{d,4}(x)\geq 0\text{ for }x\leq\rho\,(\text{which implies }Q_{d,4}(x)\geq 0\text{ for }x\in[1,\rho]),$		(B.40)
		$\displaystyle G_{d,4}(x)=G_{d,4}^{B}(x)+Q_{d,4}(x)=Q_{d,4}(x)\leq 0\text{ for }x\geq\rho(>1).$		(B.41)

From the equation

\displaystyle 0=B_{d,2}(G_{d,4})=\underbrace{\int_{0}^{1}x^{d+1}G_{d,4}(x)\,dx}_{\geq 0~{}\because\eqref{eq:B.3}}+\int_{1}^{\infty}x^{d+1}Q_{d,4}(x)\,dx,

(B.42)

one has

\displaystyle\int_{1}^{\infty}x^{d+1}Q_{d,4}(x)\,dx\leq 0,

(B.43)

which leads us to obtain

\displaystyle\int_{\rho}^{\infty}x^{d+1}Q_{d,4}(x)\,dx=\int_{1}^{\infty}x^{d+1}Q_{d,4}(x)\,dx-\int_{1}^{\rho}x^{d+1}Q_{d,4}(x)\,dx\leq 0.

(B.44)

On the basis of the mean value theorem of integration, one gets

\displaystyle\begin{split}&\int_{1}^{\infty}x^{d-1}R_{d,4}^{B}(x)Q_{d,4}(x)\,dx=\frac{R_{d,4}^{B}(\zeta_{1})}{\zeta_{1}^{2}}\int_{1}^{\rho}x^{d+1}Q_{d,4}(x)\,dx+\frac{R_{d,4}^{B}(\zeta_{2})}{\zeta_{2}^{2}}\int_{\rho}^{\infty}x^{d+1}Q_{d,4}(x)\,dx\\ &=\underbrace{\frac{R_{d,4}^{B}(\zeta_{1})}{\zeta_{1}^{2}}}_{<0}\underbrace{\int_{1}^{\infty}x^{d+1}Q_{d,4}(x)\,dx}_{\leq 0~{}\because\eqref{eq:B.5}}+\underbrace{\biggl{(}\frac{R_{d,4}^{B}(\zeta_{2})}{\zeta_{2}^{2}}-\frac{R_{d,4}^{B}(\zeta_{1})}{\zeta_{1}^{2}}\biggr{)}}_{<0}\underbrace{\int_{\rho}^{\infty}x^{d+1}Q_{d,4}(x)\,dx}_{\leq 0~{}\because\eqref{eq:B.6}}\geq 0,\end{split}

(B.45)

for $1<\zeta_{1}<\zeta_{2}$ , where we use $R_{d,4}^{B}(1)<0$ and that $R_{d,4}^{B}(x)/x^{2}$ is monotonically decreasing for $x>1$ , which result from Lemma B.6. This inequality implies the completion of the proof. ∎

B.3 Proof of Theorem 3

B.3.1 Reformulation of Problem 3

In this section, we provide a proof of Theorem 3 that is a multivariate RK extension of the existing results by (Granovsky and Müller, 1991). We here reformulate Problem 3 to Problem B.2, to which we can almost directly apply the proof in the existing works.

Problem 3 includes both the function $G$ and derivative $G^{(1)}$ in its formulation, which makes the problem difficult to handle. Under the condition (P3-5) (i.e., $x^{d+q}G(x)\to 0$ as $x\to\infty$ ), integration-by-parts yields the equation,

\displaystyle\begin{split}B_{d,i}(G)&=\int_{0}^{\infty}x^{d-1+i}G(x)\,dx=\biggl{[}\frac{1}{d+i}x^{d+i}G(x)\biggr{]}_{0}^{\infty}-\frac{1}{d+i}\int_{0}^{\infty}x^{d+i}G^{(1)}(x)\,dx\\ &=-\frac{1}{d+i}B_{d,i+1}(G^{(1)}),\quad\text{for}\;i\in I_{q}.\end{split}

(B.46)

This equation allows us to rewrite the objective function (P3) as a functional of $G^{(1)}$ only, and shows the following lemma that allows us to represent the moment condition (P3-1) in terms of $G^{(1)}$ .

Lemma B.7.

For $d\geq 1$ and even $q\geq 2$ , assume that $x^{d+q}G(x)\to 0$ holds as $x\to\infty$ and that $G^{(1)}$ is continuous a.e. Then the following two conditions are equivalent:

(i)

$G\in{\mathcal{M}}_{d,0,q}$ and $G$ is differentiable.
(ii)

$G^{(1)}\in{\mathcal{M}}_{d,1,q+1}$ .

Thus, the following problem is equivalent to Problem 3 in the sense that a solution of one of the problems is a solution of the other.

Problem B.1.

Find $G^{*}(x)=-\int_{x}^{\infty}F^{*}(t)\,dt$ , where $F^{*}$ is a solution of the following optimization problem, with $G(x)\coloneq-\int_{x}^{\infty}F(t)\,dt$ :

$\displaystyle\mathop{\rm min}\limits_{F}\quad$	$\displaystyle B^{2(d+2)}_{d,q+1}(F)\cdot V^{2q}_{d,0}(F),$	(P4)
$\displaystyle\mathop{\rm s.t.}\limits\quad$	$\displaystyle F\in{\mathcal{M}}_{d,1,q+1},$	(P4-1)
	$\displaystyle F\;\text{is integrable},$	(P4-2)
	$\displaystyle F\in{\mathcal{N}}_{\frac{q}{2}-1},$	(P4-3)
	$\displaystyle V_{d,0}(F)<\infty,$	(P4-4)
	$\displaystyle\text{there exists }\delta>0\text{ s.t.~{}}x^{d+q+\delta}G(x)\to 0\;\text{as}\;x\to\infty.$	(P4-5)

The kernel optimization problem B.1 still has scale indeterminacy of a solution, as described also in Footnote 3. Indeed, one has the following lemma.

Lemma B.8.

Let $d\geq 1$ and $q=2,4,\ldots$ For a function $F$ and $s>0$ , we define a scale-transformed function $F_{s}(\cdot)\coloneq s^{-(d+1)}F(\cdot/s)$ . Then, one has

\displaystyle B_{d,i}(F_{s})=s^{i-1}B_{d,i}(F),\quad V_{d,0}(F_{s})=s^{-(d+2)}V_{d,0}(F).

(B.47)

This lemma immediately shows that the objective function (P4) is scale-invariant, that is, for any $s>0$ one has

\displaystyle B_{d,q+1}^{2(d+2)}(F_{s})\cdot V_{d,0}^{2q}(F_{s})=B_{d,q+1}^{2(d+2)}(F)\cdot V_{d,0}^{2q}(F).

(B.48)

It also shows that the set ${\mathcal{M}}_{d,1,k}$ is scale-invariant, that is, $F\in{\mathcal{M}}_{d,1,k}$ implies $F_{s}\in{\mathcal{M}}_{d,1,k}$ for any $s>0$ . The following Problem B.2 resolves the scale indeterminacy.

Problem B.2.

Find $G^{*}(x)=-\int_{x}^{\infty}F^{*}(t)\,dt$ , where $F^{*}$ is a solution of the following optimization problem for a given constant $C>0$ , with $G(x)\coloneq-\int_{x}^{\infty}F(t)\,dt$ :

$\displaystyle\mathop{\rm min}\limits_{F}\quad$	$\displaystyle V_{d,0}(F),$	(P5)
$\displaystyle\mathop{\rm s.t.}\limits\quad$	$\displaystyle F\in{\mathcal{M}}_{d,1,q+1},$	(P5-1)
	$\displaystyle F\;\text{is integrable},$	(P5-2)
	$\displaystyle F\in{\mathcal{N}}_{\frac{q}{2}-1},$	(P5-3)
	$\displaystyle V_{d,0}(F)<\infty,$	(P5-4)
	$\displaystyle\text{there exists }\delta>0\text{ s.t.~{}}x^{d+q+\delta}G(x)\to 0\;\text{as}\;x\to\infty,$	(P5-5)
	$\displaystyle\|B_{d,q+1}(F)\|=(d+q)C.$	(P5-6)

B.3.2 Auxiliary Lemmas

Here we introduce notations and lemmas that will be used for proofs in the next sections. Let $d$ be a positive integer. For functions $f,g$ defined on $\mathbb{R}_{\geq 0}=[0,\infty)$ , define

\displaystyle\langle f,g\rangle_{d}\coloneq\int_{0}^{\infty}x^{d-1}f(x)g(x)\,dx.

(B.49)

Let $L^{2}(\mathbb{R}_{\geq 0},x^{d-1})$ be the space of square-integrable functions on $\mathbb{R}_{\geq 0}$ with weight $x^{d-1}$ . This is a Hilbert space equipped with the inner product $\langle\cdot,\cdot\rangle_{d}$ . Note however that, for any nonnegative integer $i$ , one has $x^{i}\not\in L^{2}(\mathbb{R}_{\geq 0},x^{d-1})$ , as the integral of $(x^{i})^{2}$ over $\mathbb{R}_{\geq 0}$ with weight $x^{d-1}$ diverges.

With a measurable subset $S$ of $\mathbb{R}_{\geq 0}$ , define

\displaystyle\langle f,g\rangle_{S,d}\coloneq\int_{S}x^{d-1}f(x)g(x)\,dx,

(B.50)

and let $L^{2}(S,x^{d-1})$ be the space of square-integrable functions on $S$ with weight $x^{d-1}$ . Let $D$ be a compact measurable subset of $\mathbb{R}_{\geq 0}$ , and let $D^{c}=\mathbb{R}_{\geq 0}\backslash D$ . One has that $L^{2}(D,x^{d-1})$ and $L^{2}(D^{c},x^{d-1})$ are both subspaces of $L^{2}(\mathbb{R}_{\geq 0},x^{d-1})$ , and that the map $V:f\to f\mathbbm{1}_{x\in D}\oplus f\mathbbm{1}_{x\in D^{c}}$ from $L^{2}(\mathbb{R}_{\geq 0},x^{d-1})$ to $L^{2}(D,x^{d-1})\oplus L^{2}(D^{c},x^{d-1})$ is an isomorphism (Conway, 2007, page 25). We therefore identify $L^{2}(\mathbb{R}_{\geq 0},x^{d-1})$ and $L^{2}(D,x^{d-1})\oplus L^{2}(D^{c},x^{d-1})$ via this isomorphism.

Let $I$ be a finite subset of $\mathbb{Z}_{\geq 0}$ , and let

\displaystyle M_{D,I}\coloneq\mathop{\mathrm{span}}\{x^{i}\mathbbm{1}_{x\in D},i\in I\}.

(B.51)

For fixed $D$ and $I$ , $M_{D,I}$ is a subspace of $L^{2}(D,x^{d-1})$ with dimension equal to $|I|$ .

Lemma B.9.

Assume that a function $f_{0}\in L^{2}(\mathbb{R}_{\geq 0},x^{d-1})$ satisfies $\langle f_{0},f\rangle_{d}=0$ for any $f$ with $\operatorname{supp}(f)\subset D$ and $\langle f,x^{i}\rangle_{d}=0$ for all $i\in I$ . Then $f_{0}\mathbbm{1}_{x\in D}\in M_{D,I}$ .

Proof.

In view of the above direct-sum decomposition of $L^{2}(\mathbb{R}_{\geq 0},x^{d-1})$ , the subspace $M_{D,I}$ is regarded as the restriction of

\displaystyle\bar{M}_{D,I}=\mathop{\mathrm{span}}\{x^{i}\mathbbm{1}_{x\in D},i\in I\}\oplus L^{2}(D^{c},x^{d-1})

(B.52)

to $L^{2}(D,x^{d-1})$ . The orthogonal complement of $\bar{M}_{D,I}$ in $L^{2}(\mathbb{R}_{\geq 0},x^{d-1})$ is given, via Lemma B.10 below, by

\displaystyle\begin{split}\bar{M}_{D,I}^{\perp}&=(\mathop{\mathrm{span}}\{x^{i}\mathbbm{1}_{x\in D},i\in I\})^{\perp}\cap L^{2}(D^{c},x^{d-1})^{\perp}\\ &=\{f\in L^{2}(\mathbb{R}_{\geq 0},x^{d-1}):\langle f,x^{i}\rangle_{D,d}=0\;\text{for all}\;i\in I\}\cap\{f:\operatorname{supp}(f)\in D\}\\ &=\{f\in L^{2}(\mathbb{R}_{\geq 0},x^{d-1}):\operatorname{supp}(f)\subset D,\ \langle f,x^{i}\rangle_{d}=0\;\text{for all}\;i\in I\}.\end{split}

(B.53)

From the assumption of this lemma, the function $f_{0}$ belongs to the orthogonal complement $\bar{M}_{D,I}^{\perp\perp}$ of $\bar{M}_{D,I}^{\perp}$ . On the other hand, $\bar{M}_{D,I}$ is a closed subspace in $L^{2}(\mathbb{R}_{\geq 0},x^{d-1})$ thanks to (Rudin, 1991, Theorem 1.42), since $M_{D}$ is finite-dimensional. One can therefore conclude that $\bar{M}_{D,I}^{\perp\perp}=\bar{M}_{D,I}$ holds, implying $f_{0}\in\bar{M}_{D,I}$ , and the statement of the lemma follows. ∎

Lemma B.10.

Let $H$ be a Hilbert space, and let $A,B$ be subsets of $H$ satisfying $0\in A\cap B$ . Then $(A+B)^{\perp}=A^{\perp}\cap B^{\perp}$ holds.

Proof.

Since $0\in B$ , one has $A\subset A+B$ and hence $(A+B)^{\perp}\subset A^{\perp}$ . One similarly has $(A+B)^{\perp}\subset B^{\perp}$ , and thus $(A+B)^{\perp}\subset A^{\perp}\cap B^{\perp}$ . On the other hand, take $x\in A^{\perp}\cap B^{\perp}$ and $y\in A+B$ . Noting that $y$ is represented as $y=a+b$ with $a\in A$ and $b\in B$ , one has

\displaystyle\langle x,y\rangle=\langle x,a\rangle+\langle x,b\rangle=0,

(B.54)

where the final equality is due to $x\in A^{\perp}$ and $x\in B^{\perp}$ , implying that $x\in(A+B)^{\perp}$ holds. This proves $A^{\perp}\cap B^{\perp}\subset(A+B)^{\perp}$ . ∎

It should be noted that the above lemma holds regardless of the dimensionality or the closedness of $A,B$ .

B.3.3 Prove that the optimal solution is a polynomial kernel

The following lemma is an RK extension of (Granovsky and Müller, 1989, Theorem), and it states the functional form that an optimal solution of Problem B.2 should take.

Lemma B.11.

For $d\geq 1$ and even $q\geq 2$ , a solution of Problem B.2 takes a form of $\bar{p}_{q+1,\tau}(\cdot)\coloneq p_{q+1}(\cdot)\mathbbm{1}_{\cdot\leq\tau}$ , where $\tau>0$ is decided from a constant $C$ in Problem B.2, and where $p_{q+1}$ is a polynomial function with terms of degree in $I_{q+1}$ and $\bar{p}_{q+1,\tau}$ has no discontinuity (i.e., $p_{q+1}(\tau)=0$ ).

Proof of Lemma B.11.

In this proof, we work with Problem B.2. To briefly represent classes of kernel functions that satisfy various combinations of the conditions, we introduce a new notation: we let ${\mathcal{C}}_{i_{1},\ldots,i_{j}}$ denote a class of functions satisfying the conditions $(\mathrm{P5\text{-}}i_{1}),\ldots,(\mathrm{P5\text{-}}i_{j})$ in Problem B.2. $C_{4}$ equals to $L^{2}(\mathbb{R}_{\geq 0},x^{d-1})$ . Moreover, we use an abbreviation ${\mathcal{B}}\coloneq{\mathcal{C}}_{1,2,3,4,5,6}$ .

Step 1

Define $V=\inf_{f\in{\mathcal{B}}}V_{d,0}(f)$ , and let $\{f_{n}\in{\mathcal{B}}\}$ be a sequence of functions satisfying $V_{d,0}(f_{n})-V<\epsilon_{n}$ with positive numbers $\{\epsilon_{n}\}$ satisfying $\epsilon_{n}\to 0$ as $n\to\infty$ . The function set $\{f\in{\mathcal{C}}_{4}:V_{d,0}(f)\leq\text{const}\}$ is weakly compact as a bounded set in the Hilbert space ${\mathcal{C}}_{4}$ equipped with the scalar product $\langle\cdot,\cdot\rangle_{d}$ . It is thus found that there exists a subsequence of $f_{n}$ weakly converging to a function $f_{*}\in{\mathcal{C}}_{4}$ . Note that this convergence particularly shows $\langle f_{n},f_{*}\rangle_{d}\to\langle f_{*},f_{*}\rangle_{d}$ , and one has

\displaystyle\langle f_{*},f_{*}\rangle_{d}^{2}=\lim_{n\to\infty}\langle f_{*},f_{n}\rangle_{d}^{2}\leq\lim_{n\to\infty}\langle f_{n},f_{n}\rangle_{d}\langle f_{*},f_{*}\rangle_{d}=V\langle f_{*},f_{*}\rangle_{d},

(B.55)

which implies $\langle f_{*},f_{*}\rangle_{d}=V$ .

In the rest of Step 1, we show that $f_{*}$ satisfies (P5-1), (P5-3), and (P5-6): $f_{n}\in{\mathcal{C}}_{1,6}$ implies that

\displaystyle\int_{0}^{\infty}x^{d-1+i}f_{n}(x)\,dx=c_{i},\quad i\in I_{q+1},

(B.56)

where $c_{i}$ , $i\in I_{q+1}$ are $i$ -dependent constants. Introducing $g_{n}(x)\coloneq-\int_{x}^{\infty}f_{n}(t)\,dt$ and a $x$ -independent constant $c>0$ , from the condition (P5-5) for $G=g_{n}$ (that is, $|g_{n}(x)|=o(x^{-(d+q+\delta)})$ ), one has

\displaystyle\begin{split}&\lim_{M\to\infty}\biggl{|}\int_{M}^{\infty}x^{d-1+i}f_{n}(x)\,dx\biggr{|}\leq\lim_{M\to\infty}\biggl{\{}\biggl{|}\biggl{[}x^{d-1+i}g_{n}(x)\biggr{]}_{M}^{\infty}\biggr{|}+\int_{M}^{\infty}(d-1+i)x^{d-2+i}|g_{n}(x)|\,dx\biggr{\}}\\ &\leq\lim_{M\to\infty}c\int_{M}^{\infty}x^{-(q+2+\delta-i)}\,dx\leq\lim_{M\to\infty}c\int_{M}^{\infty}x^{-(1+\delta)}\,dx=0,\quad i\in I_{q+1}.\end{split}

(B.57)

Because $f_{n}$ converges to $f_{*}$ and $x^{i}\mathbbm{1}_{x\leq M}\in{\mathcal{C}}_{4}$ , one has

\displaystyle\int_{0}^{M}x^{d-1+i}f_{*}(x)\,dx=\lim_{n\to\infty}\int_{0}^{M}x^{d-1+i}f_{n}(x)\,dx=\lim_{n\to\infty}\left\{c_{i}-\int_{M}^{\infty}x^{d-1+i}f_{n}(x)\,dx\right\}.

(B.58)

From these results, one has

\displaystyle\int_{\mathbb{R}_{\geq 0}}x^{d-1+i}f_{*}(x)\,dx=\lim_{M\to\infty}\int_{0}^{M}x^{d-1+i}f_{*}(x)\,dx=c_{i},\quad i\in I_{q},

(B.59)

which implies that $f_{*}$ satisfies (P5-1) and (P5-6) for $F=f_{*}$ . Moreover, since $f_{n}\in{\mathcal{N}}_{\frac{q}{2}-1}$ it also holds that $f_{*}\in{\mathcal{N}}_{\frac{q}{2}-1}$ (P5-3), otherwise $f_{*}$ and $f_{n}$ have a interval (say $S$ ) where their signs do not coincide (so $\langle\mathbbm{1}_{x\in S},f_{n}\rangle_{d}\neq\langle\mathbbm{1}_{x\in S},f_{*}\rangle_{d}$ with $\mathbbm{1}_{x\in S}\in{\mathcal{C}}_{4}$ ), which contradicts that $f_{n}$ converges to $f_{*}$ .

Step 2

On the basis of the fact that $f_{*}\in{\mathcal{C}}_{1,3,4,6}$ proved in Step 1, we here show a functional form of $f_{*}$ . We introduce the function class

\displaystyle{\mathcal{F}}={\mathcal{C}}_{4}\cap\{f:B_{d,i}(f)=0\;\text{for all}\;i\in I_{q+1}\}.

(B.60)

Then, one has that $f_{*}+\delta f\in{\mathcal{C}}_{1,3,4,6}$ for $\delta$ with a sufficiently small absolute value and any bounded non-identically-zero function $f\in{\mathcal{F}}$ satisfying $\operatorname{supp}(f)\subset D_{*}$ , where $\operatorname{supp}(f)$ denotes the support of $f$ and $D_{*}=\operatorname{supp}(f_{*})$ . For the optimality of $f_{*}$ , $\langle f_{*},f\rangle_{d}=0$ for all $f\in{\mathcal{F}}$ is necessary, because $|\delta|\gg\delta^{2}$ ,

\displaystyle V_{d,0}(\bar{f})-V_{d,0}(f_{*})=2\delta\langle f_{*},f\rangle_{d}+\delta^{2}V_{d,0}(f)\geq 0

(B.61)

and $f_{*}$ is not identically zero since $V=V_{d,0}(f_{*})$ is away from 0 due to the proof by contradiction with the condition (P5-3) for $F=f_{*}$ and Lemma B.2. This implies that $f_{*}$ should lie in the orthogonal complement of ${\mathcal{F}}\cap\{f:\operatorname{supp}(f)\subset D_{*}\}$ , that is, $f_{*}\in\mathop{\mathrm{span}}\{x^{i}\mathbbm{1}_{x\in D_{*}},i\in I_{q+1}\}$ , which is proved by applying Lemma B.9. The result $f_{*}\in\mathop{\mathrm{span}}\{x^{i}\mathbbm{1}_{x\in D_{*}},i\in I_{q+1}\}$ implies that $f_{*}$ takes a form of $p_{q+1}(x)\mathbbm{1}_{x\in D_{*}}$ , where $p_{q+1}$ is a polynomial with terms of degree in $I_{q+1}$ . Also, the boundedness of $V_{d,0}(f_{*})$ , in addition to the functional form of $f_{*}$ derived above, concludes that the support $D_{*}$ of $f_{*}$ is bounded.

Step 3

Hereafter we show that an optimal solution $f_{*}$ has no discontinuity and that, for a polynomial $p_{q+1}$ satisfying $p_{q+1}(\tau)=0$ , $f_{*}$ takes a form

\displaystyle f_{*}(x)=\bar{p}_{q+1,\tau}(x)\coloneq p_{q+1}(x)\mathbbm{1}_{x\leq\tau}.

(B.62)

Then, we prove the following lemma.

Lemma B.12.

Let $S$ be a subset of $\mathbb{R}_{\geq 0}$ with $\mu(S)\in(0,\infty)$ , where $\mu$ denotes the Lebesgue measure on $\mathbb{R}_{\geq 0}$ . Let $f_{*}$ be a function defined in (B.62) and $D_{*}$ be the support of $f_{*}$ . Assume that $p_{q+1}(x)$ takes the same sign over $S$ , and that $f_{*}(\cdot)$ and $p_{q+1}(\cdot)\mathbbm{1}_{\cdot\in S\cup D_{*}}$ share the same sign-change pattern. Then either $\mu(S\cap D_{*})$ or $\mu(S\cap D_{*}^{c})$ is zero.

Proof of Lemma B.12.

Assume to the contrary that both $\mu(S\cap D_{*})$ and $\mu(S\cap D_{*}^{c})$ are strictly positive. Since $\mu(S\cap D_{*})>0$ , one can take an ordered partition $\{P_{j}\}_{j=1,\ldots,k+1}$ of $S\cap D_{*}$ with $\mu(P_{j})\in(0,\infty)$ for all $j=1,\ldots,k+1$ , with which, for $x\in P_{i}$ and $y\in P_{j}$ with $i<j$ one has $x<y$ almost surely. We let $\mathrm{M}=(m_{ij})$ and $\bm{b}=(b_{1},\ldots,b_{k+1})^{\top}$ with

	$\displaystyle m_{ij}\coloneq\int_{P_{j}}x^{d+2i-2}f_{*}(x)\,dx,\quad i,j=1,2,\ldots,k+1,$		(B.63)
	$\displaystyle b_{i}\coloneq\int_{S\cap D_{*}^{c}}x^{d+2i-2}\,dx,\quad i=1,2,\ldots,k+1.$		(B.64)

One has $0<b_{i}<\infty$ due to the assumption $0<\mu(S\cap D_{*}^{c})\leq\mu(S)<\infty$ . Furthermore, let $\bm{c}=(c_{1},\ldots,c_{k+1})^{\top}$ and define

\displaystyle f_{\bm{c}}(x)=\sum_{j=1}^{k+1}c_{j}p_{q+1}(x)\mathbbm{1}_{x\in P_{j}}.

(B.65)

Note that $f_{\bm{c}}$ is equivalent to $f_{*}$ on $S$ when $\bm{c}=\bm{1}_{k+1}$ .

We now want to set $\bm{c}$ so that $f_{\bm{c}}(x)$ and the function $\mathbbm{1}_{x\in S\cap D_{*}^{c}}$ share the same moments of orders in $I_{q+1}$ . It will be achieved if $\bm{c}$ satisfies $\mathrm{M}\bm{c}=\bm{b}$ , since the $(2i-1)$ -st moments of $f_{\bm{c}}(x)$ and $\mathbbm{1}_{x\in S\cap D_{*}^{c}}$ are $\sum_{j=1}^{k+1}m_{ij}c_{j}$ and $b_{i}$ , respectively. Applying Lemma B.1 with $w(x)=x^{d}f_{*}(x)$ shows that $\mathrm{M}$ is invertible, so that the assertion $\bm{c}=\mathrm{M}^{-1}\bm{b}$ makes the moments of $f_{\bm{c}}(x)$ and $\mathbbm{1}_{x\in S\cap D_{*}^{c}}$ to coincide.

With the above $\bm{c}$ , let $f(x)=s\cdot(\mathbbm{1}_{x\in S\cap D_{*}^{c}}-f_{\bm{c}}(x))$ , where $s\in\{-1,1\}$ is the sign of $p_{q+1}$ on $S$ . One has $f\in{\mathcal{F}}$ by construction, which implies $\langle p_{q+1},f\rangle_{d}=0$ . On the other hand, one has

\displaystyle\int_{S\cap D_{*}^{c}}x^{d-1}p_{q+1}(x)f(x)\,dx=\int_{S\cap D_{*}^{c}}x^{d-1}p_{q+1}(x)s\,dx>0.

(B.66)

Since $f$ vanishes outside $S$ , we have

\displaystyle\langle f_{*},f\rangle_{d}=\int_{S\cap D_{*}}x^{d-1}p_{q+1}(x)f(x)\,dx=\langle p_{q+1},f\rangle_{d}-\int_{S\cap D_{*}^{c}}x^{d-1}p_{q+1}(x)s\,dx<0.

(B.67)

For a sufficiently small $\delta>0$ , the function $f_{*}+\delta f$ has the same sign $s$ as $p_{q+1}$ on $S$ , and it is equal to $f_{*}$ outside $S$ . Therefore, $f_{*}$ and $f_{*}+\delta f$ share the same sign-change pattern, and one consequently has $f_{*}+\delta f\in{\mathcal{C}}_{1,3,4,6}$ . The above inequality $\langle f_{*},f\rangle_{d}<0$ implies that for a sufficiently small $\delta>0$ one has $V_{d,0}(f_{*}+\delta f)<V_{d,0}(f_{*})$ , leading to contradiction. ∎

We consider, as in Definition B.1, subintervals partitioning $\mathbb{R}_{\geq 0}$ according to the sign of $p_{q+1}$ . Since $p_{q+1}$ is a polynomial of degree $(q+1)$ with odd-degree terms only, the number of the above subintervals is at most $(\frac{q}{2}+1)$ . Lemma B.12 states that, for each of the subintervals, the support $D_{*}$ of $f_{*}$ either contains its interior as a whole or not, and cannot contain only some fraction of it in terms of the Lebesgue measure. Since $f_{*}\in{\mathcal{N}}_{\frac{q}{2}-1}$ and $D_{*}$ is bounded, the only possibility is therefore that the closure of $D_{*}$ equals to $[0,\tau]$ , where $\tau$ is the largest zero of $p_{q+1}$ , that is, $f_{*}=\bar{p}_{q+1,\tau}$ .

In Lemma B.13 below, we will show that such a function $f_{*}$ uniquely exists and that function also satisfies the conditions (P5-2) and (P5-5) (besides the conditions confirmed in Step 1). Thus we conclude that function of the shape $\bar{p}_{q+1,\tau}$ is a solution of Problem B.2. ∎

B.3.4 Prove that the optimal polynomial kernel is uniquely determined

The following lemma is an RK extension of (Granovsky and Müller, 1989, Lemma 2), and shows that a polynomial function suggested by the above lemma is uniquely determined.

Lemma B.13.

For $d\geq 1$ and even $q\geq 2$ , there exists a unique constant $C\neq 0$ and a unique polynomial $p_{q+1}$ with terms of degree in $I_{q+1}$ , such that $\bar{p}_{q+1,1}(\cdot)\coloneq p_{q+1}(\cdot)\mathbbm{1}_{\cdot\leq 1}$ satisfies

\displaystyle\bar{p}_{q+1,1}\in{\mathcal{M}}_{d,1,q+1},\quad\int_{\mathbb{R}_{\geq 0}}x^{d+q}\bar{p}_{q+1,1}(x)\,dx=C,\quad p_{q+1}(1)=0.

(B.68)

Also, the resulting polynomial $p_{q+1}$ is such that $\bar{p}_{q+1,1}=G_{d,q}^{B(1)}$ .

Here, suppose that $C$ is given such that $\tau$ gets to 1 for the sake of simplicity of description, since $\tau$ in $\bar{p}_{q+1,\tau}$ and $|C|$ in (P5-6) have a one-to-one correspondence and it is possible, as can be seen from Lemma B.8 on the scale transformation. Then, we will show that $C$ satisfying the conditions (instead of $\tau$ ) is uniquely determined.

Proof of Lemma B.13.

For odd $j\geq 1$ , we introduce one of the normalized Jacobi polynomials on $[0,1]$ :

\displaystyle Q_{j}(x)=C_{d,j}xP_{\frac{j-1}{2}}^{(\frac{d}{2},0)}(1-2x^{2}),

(B.69)

where $C_{d,j}=\sqrt{d+2j}$ is a normalization constant, and define $\bar{Q}_{j}(x)=Q_{j}(x)\mathbbm{1}_{x\leq 1}$ . It is straightforward to show that $\{\bar{Q}_{j}\}_{j=1,3,\ldots}$ satisfies the orthogonality relation

\displaystyle\begin{split}\langle\bar{Q}_{j},\bar{Q}_{k}\rangle_{d}&=\int_{0}^{1}x^{d-1}Q_{j}(x)Q_{k}(x)\,dx\\ &=C_{d,j}C_{d,k}\int_{0}^{1}x^{d+1}P_{\frac{j-1}{2}}^{(\frac{d}{2},0)}(1-2x^{2})P_{\frac{k-1}{2}}^{(\frac{d}{2},0)}(1-2x^{2})\,dx\\ &=\frac{C_{d,j}C_{d,k}}{2^{\frac{d}{2}+2}}\int_{-1}^{1}(1-t)^{\frac{d}{2}}P_{\frac{j-1}{2}}^{(\frac{d}{2},0)}(t)P_{\frac{k-1}{2}}^{(\frac{d}{2},0)}(t)\,dt\\ &=\mathbbm{1}_{j=k}\quad\text{for odd}\;j,k\geq 1,\end{split}

(B.70)

where the last equality follows from the orthogonality relation of the Jacobi polynomials:

\displaystyle\int_{-1}^{1}(1-t)^{\alpha}(1+t)^{\beta}P_{m}^{(\alpha,\beta)}(t)P_{n}^{(\alpha,\beta)}(t)\,dt=\frac{2^{\alpha+\beta+1}}{2n+\alpha+\beta+1}\frac{\Gamma(n+\alpha+1)\Gamma(n+\beta+1)}{\Gamma(n+\alpha+\beta+1)\Gamma(n+1)}\mathbbm{1}_{m=n}.

(B.71)

Since the Jacobi polynomial $P_{m}^{(\alpha,\beta)}$ is a polynomial of degree $m$ , $Q_{j}$ is a polynomial with terms of degree in $I_{j}$ , so that one can write

\displaystyle Q_{j}(x)=\sum_{i\in I_{j}}c_{j,i}x^{i},

(B.72)

where $c_{j,j}\neq 0$ . Furthermore, since any polynomial of degree $m$ is represented as a linear combination of $\{P_{n}^{(\alpha,\beta)}\}_{n=0,1,\ldots,m}$ , any polynomial with terms of degree in $I_{k}$ for odd $k$ is represented as a linear combination of $\{Q_{j}\}_{j=1,3,\ldots,k}$ . This in particular means that, for odd $j\geq 3$ , $x^{i}$ with $i\in I_{j-2}$ is represented as a linear combination of $\{Q_{j}\}_{j=1,3,\ldots,i}$ , so that the orthogonality (B.70) implies that $\bar{Q}_{j}$ satisfies the moment conditions $B_{d,i}(\bar{Q}_{j})=0$ , $i\in I_{j-2}$ .

Any polynomial with terms of degree in $I_{k}$ can be represented as a linear combination of $Q_{j}$ , $j=1,3,\ldots,k$ . So, the polynomial in the statement of the theorem can be written as

\displaystyle p_{q+1}(x)=\sum_{j\in I_{q+1}}w_{j}Q_{j}(x).

(B.73)

The coefficient $w_{j}$ is evaluated, via the moment conditions of $\bar{p}_{q+1,1}$ , as

\displaystyle\begin{split}w_{j}&=\int_{0}^{1}x^{d-1}p_{q+1}(x)Q_{j}(x)\,dx=\int_{0}^{1}x^{d-1}p_{q+1}(x)\biggl{(}\sum_{i\in I_{j}}c_{j,i}x^{i}\biggr{)}\,dx\\ &=\sum_{i\in I_{j}}c_{j,i}B_{d,i}(\bar{p}_{q+1,1})=\begin{cases}c_{j,1}\bar{C},&j\in I_{q-1},\\ c_{q+1,1}\bar{C}+c_{q+1,q+1}C,&j=q+1,\end{cases}\end{split}

(B.74)

where $\bar{C}\coloneq B_{d,1}(\bar{p}_{q+1,1})=-db_{d}^{-1}<0$ . Therefore, $p_{q+1}$ has a unique representation,

\displaystyle p_{q+1}(x)=\bar{C}\sum_{j\in I_{q+1}}c_{j,1}Q_{j}(x)+Cc_{q+1,q+1}Q_{q+1}(x),

(B.75)

where $C$ is variable and other components are fixed. Considering additionally $p_{q+1}(1)=0$ determines $C$ uniquely.

One can calculate (B.75) to show $\bar{p}_{q+1,1}=G_{d,q}^{B(1)}$ , but we instead confirmed that $G_{d,q}^{B}$ satisfies the equivalent requirements ((P3-1)–(P3-5)) in th following Lemma B.14. On the basis of the uniqueness, this completes the proof of the latter half of the statement. ∎

Lemma B.14.

The kernel $G_{d,q}^{B}$ satisfies all the conditions (P3-1)–(P3-5) with $G=G_{d,q}^{B}$ .

Proof.

The only difference between (P2-1)–(P2-5) and (P3-1)–(P3-5) is in (P2-3) and (P3-3). One can show that $G_{d,q}^{B}$ satisfies (P3-3) in a similar argument as the proof of Lemma B.14. ∎

According to these lemmas, one can see that Problem B.2 gives optimal solution $-\int_{x}^{\infty}G_{d,q}^{B(1)}(t)\,dt=G_{d,q}^{B}(x)$ , whose integral constant is determined from the end-point condition (P5-5). This is also a solution of Problem 3 due to the equivalence between the problems. This completes the proof of Theorem 3.

C Supplemental Information on Appeared Kernels

Here, we provide supplemental information on the kernels used in our simulation experiments in Section 4. More specifically, we give the expressions of $B_{d,q}$ , $V_{d,0}$ , and $V_{d,1}$ for those kernels.

Proposition C.1.

For the kernel $G_{d,q}^{B}$ , the functionals $B_{d,q}$ , $V_{d,0}$ , and $V_{d,1}$ become

\displaystyle\begin{split}&B_{d,q}(G_{d,q}^{B})=(-1)^{\frac{q}{2}+1}\frac{\Gamma(\frac{d+q}{2})\Gamma(\frac{d+q+4}{2})}{2\pi^{\frac{d}{2}}\Gamma(\frac{d+2q+4}{2})},\quad V_{d,0}(G_{d,q}^{B})=\frac{8(3d+4q+4)\Gamma(\frac{d+q}{2})\Gamma(\frac{d+q+4}{2})}{\pi^{d}d(d+2q)_{2,3}\Gamma(\frac{q}{2})^{2}},\\ &V_{d,1}(G_{d,q}^{B})=\frac{16\Gamma(\frac{d+q+2}{2})\Gamma(\frac{d+q+4}{2})}{\pi^{d}(d+2)(d+2q+2)\Gamma(\frac{q}{2})^{2}}.\end{split}

(C.1)

The following is on the information about the kernels $K_{d,q}^{E}$ , $K_{d,q}^{G}$ , and $K_{d,q}^{L}$ , which are designed as higher-order extensions of Epanechnikov, Gaussian, Laplace kernels according to the jackknife (Schucany and Sommers, 1977; Wand and Schucany, 1990). These kernels respectively belong to Epanechnikov, Gaussian, and Laplace hierarchies, and take the forms of product between polynomials and Epanechnikov, Gaussian, and Laplace kernels.

Proposition C.2.

The kernel $K_{d,q}^{E}({\bm{x}})=G_{d,q}^{E}(\|{\bm{x}}\|)$ is a $d$ -variate, $q$ -th order kernel, where

\displaystyle\begin{split}G_{d,q}^{E}(u)&=\frac{\Gamma(\frac{d+q}{2})}{\pi^{\frac{d}{2}}\Gamma(\frac{q}{2})}P_{\frac{q}{2}}^{(\frac{d}{2},-1)}(1-2u^{2})\mathbbm{1}_{u\leq 1}=(-1)^{\frac{q}{2}+1}\frac{\Gamma(\frac{d+q}{2}+1)}{\pi^{\frac{d}{2}}\Gamma(\frac{q}{2}+1)}(1-u^{2})P_{\frac{q}{2}-1}^{(1,\frac{d}{2})}(2u^{2}-1)\mathbbm{1}_{u\leq 1}\\ &=(-1)^{\frac{q}{2}+1}\frac{\Gamma(\frac{d+q}{2}+1)}{\Gamma(\frac{d}{2}+2)\Gamma(\frac{q}{2}+1)}P_{\frac{q}{2}-1}^{(1,\frac{d}{2})}(2u^{2}-1)\cdot G_{d,2}^{E}(u),\end{split}

(C.2)

and the functionals $B_{d,q}$ , $V_{d,0}$ , and $V_{d,1}$ of $G_{d,q}^{E}$ become

\displaystyle\begin{split}&B_{d,q}(G_{d,q}^{E})=(-1)^{\frac{q}{2}+1}\frac{\Gamma(\frac{d+q}{2})\Gamma(\frac{d+q+2}{2})}{2\pi^{\frac{d}{2}}\Gamma(\frac{d+2q+2}{2})},\quad V_{d,0}(G_{d,q}^{E})=\frac{4\Gamma(\frac{d+q}{2})\Gamma(\frac{d+q+2}{2})}{\pi^{d}d(d+2q)\Gamma(\frac{q}{2})^{2}},\\ &V_{d,1}(G_{d,q}^{E})=\frac{4\Gamma(\frac{d+q+2}{2})^{2}}{\pi^{d}(d+2)\Gamma(\frac{q}{2})^{2}}.\end{split}

(C.3)

Proposition C.3.

The kernel $K_{d,q}^{G}({\bm{x}})=G_{d,q}^{G}(\|{\bm{x}}\|)$ is a $d$ -variate, $q$ -th order kernel, where

\displaystyle G_{d,2}^{G}(u)=(2\pi)^{-\frac{d}{2}}\exp(-u^{2}/2),\quad G_{d,q}^{G}(u)=L_{\frac{q}{2}-1}^{(\frac{d}{2})}\left(\frac{u^{2}}{2}\right)G_{d,2}^{G}(u),

(C.4)

where $L_{n}^{(\alpha)}$ is the Laguerre polynomial (Szegö, 1939). Also, the functionals $B_{d,q}$ , $V_{d,0}$ , and $V_{d,1}$ of $G_{d,q}^{G}$ become

\displaystyle\begin{split}&B_{d,q}(G_{d,q}^{G})=(-2)^{\frac{q}{2}-1}\frac{\Gamma(\frac{d+q}{2})}{\pi^{\frac{d}{2}}},\\ &V_{d,0}(G_{d,q}^{G})=\frac{\{\Gamma(\frac{d+q}{2})\}^{2}}{2^{d+q-1}\pi^{d}}\sum_{i_{1},i_{2}=0}^{q/2-1}(-2)^{i_{1}+i_{2}}\frac{\Gamma(\frac{d}{2}+q-2-i_{1}-i_{2})}{\Gamma(\frac{d+q}{2}-i_{1})\Gamma(\frac{d+q}{2}-i_{2})\Gamma(\frac{q}{2}-i_{1})\Gamma(\frac{q}{2}-i_{2})i_{1}!i_{2}!},\\ &V_{d,1}(G_{d,2}^{G})=\frac{\Gamma(\frac{d}{2}+2)}{(d+2)2^{d}\pi^{d}},\quad V_{d,1}(G_{d,4}^{G})=\frac{(d+10)\Gamma(\frac{d+6}{2})}{(d+2)2^{d+3}\pi^{d}},\\ &V_{d,1}(G_{d,6}^{G})=\frac{(d^{2}+26d+176)\Gamma(\frac{d+8}{2})}{(d+2)2^{d+8}\pi^{d}}.\end{split}

(C.5)

Proposition C.4.

The kernel $K_{d,q}^{L}({\bm{x}})=G_{d,q}^{L}(\|{\bm{x}}\|)$ is a $d$ -variate, $q$ -th order kernel, where

\displaystyle G_{d,2}^{L}(u)=\frac{\Gamma(\frac{d}{2})}{2\pi^{\frac{d}{2}}\Gamma(d)}\exp(-u),\quad G_{d,q}^{L}(u)=\biggl{(}\sum_{i\in I_{q-2}}b_{d,q,i}^{L}u^{i}\biggr{)}G_{d,2}^{L}(u),

(C.6)

and where $b_{d,q,i}^{L}$ , $i\in I_{q-2}$ , are determined as

\displaystyle\begin{bmatrix}b_{d,q,0}^{L}\\ b_{d,q,2}^{L}\\ \vdots\\ b_{d,q,q-2}^{L}\\ \end{bmatrix}=\begin{bmatrix}1&(d)_{2}&\cdots&(d)_{q-2}\\ 1&(d+2)_{2}&\cdots&(d+2)_{q-2}\\ \vdots&\vdots&\ddots&\vdots\\ 1&(d+q-2)_{2}&\cdots&(d+q-2)_{q-2}\\ \end{bmatrix}^{-1}\begin{bmatrix}1\\ 0\\ \vdots\\ 0\\ \end{bmatrix}.

(C.7)

Also, the functionals $B_{d,q}$ , $V_{d,0}$ , and $V_{d,1}$ of $G_{d,q}^{L}$ become

\displaystyle\begin{split}&B_{d,2}(G_{d,2}^{L})=\frac{(d)_{2}\Gamma(\frac{d}{2})}{2\pi^{\frac{d}{2}}};\quad B_{d,4}(G_{d,4}^{L})=-\frac{(d)_{4}(2d+7)\Gamma(\frac{d}{2})}{2(2d+3)\pi^{\frac{d}{2}}};\\ &B_{d,6}(G_{d,6}^{L})=\frac{196200}{149},\frac{2215080}{307\pi},\frac{1335600}{109\pi},\;d=1,2,3;\\ &V_{d,0}(G_{d,2}^{L})=\frac{\Gamma(\frac{d}{2})^{2}}{2^{d+2}\pi^{d}\Gamma(d)};\quad V_{d,0}(G_{d,4}^{L})=\frac{(d+2)_{2}(9d^{2}+73d+96)\Gamma(\frac{d}{2})^{2}}{2^{d+8}\pi^{d}(2d+3)^{2}\Gamma(d)};\\ &V_{d,0}(G_{d,6}^{L})=\frac{4327215}{22733824},\frac{28174335}{193021952\pi^{2}},\frac{5292791}{389316608\pi^{2}},\;d=1,2,3;\\ &V_{d,1}(G_{d,2}^{L})=\frac{\Gamma(\frac{d}{2})^{2}}{2^{d+2}\pi^{d}\Gamma(d)};\quad V_{d,1}(G_{d,4}^{L})=\frac{(d+1)(9d^{3}+133d^{2}+534d+576)\Gamma(\frac{d}{2})^{2}}{2^{d+8}\pi^{d}(2d+3)^{2}\Gamma(d)};\\ &V_{d,1}(G_{d,6}^{L})=\frac{5572345}{22733824},\frac{38442103}{193021952\pi^{2}},\frac{7306271}{389316608\pi^{2}},\;d=1,2,3.\end{split}

(C.8)

D Proof of Proposition 6

This appendix provides a proof of Proposition 6. In the case $(d,q)=(2,2)$ , the third-order term of the Taylor expansion of $f$ around its mode $\bm{x}=\bm{\theta}$ is represented as a real binary cubic form, i.e., a homogeneous degree-3 polynomial of $\bm{u}=\bm{x}-\bm{\theta}$ : $au_{1}^{3}+bu_{1}^{2}u_{2}+cu_{1}u_{2}^{2}+du_{2}^{3}$ , where $a,b,c,d\in\mathbb{R}$ are proportional to appropriate third-order partial derivatives of $f$ at $\bm{\theta}$ . Properties of such a real binary cubic form that are invariant under reversible linear transforms of $\bm{u}$ is characterized by the algebraic curve defined as the set of $\bm{u}$ satisfying $au_{1}^{3}+bu_{1}^{2}u_{2}+cu_{1}u_{2}^{2}+du_{2}^{3}=0$ . Putting aside the trivial case $a=b=c=d=0$ , classification of the algebraic curve $au_{1}^{3}+bu_{1}^{2}u_{2}+cu_{1}u_{2}^{2}+du_{2}^{3}=0$ can be done on the basis of classification of the set of solutions of the cubic equation $at^{3}+bt^{2}+ct+d=0$ of a real-valued variable $t$ : it has either:

•

a single real solution, which is triply degenerate,
•

two real distinct solutions, one of which is doubly degenerate,
•

three real distinct solutions, or
•

a real and two complex solutions.

Accordingly, the third-order term of the Taylor expansion of $f$ at the mode is either:

(D.1)

0 identically,
(D.2)

of the form $(\bm{a}^{\top}\bm{u})^{3}$ ,
(D.3)

of the form $(\bm{a}_{1}^{\top}\bm{u})^{2}(\bm{a}_{2}^{\top}\bm{u})$ with $\bm{a}_{1},\bm{a}_{2}$ linearly independent,
(D.4)

of the form $\prod_{i=1}^{3}\bm{a}_{i}^{\top}\bm{u}$ with $\bm{a}_{1},\bm{a}_{2},\bm{a}_{3}$ , any two of which are linearly independent, or
(D.5)

of the form $(\bm{a}^{\top}\bm{u})g(\bm{u})$ , where $g(\bm{u})=\frac{1}{2}\bm{u}^{\top}\mathrm{R}\bm{u}$ with $\mathrm{R}$ positive definite.

In the case (D.1), the AB is trivially 0 for any choice of $\mathrm{Q}$ . The cases (D.4) and (D.5) have already been covered in Propositions 3 and 4, respectively. What remains is to consider the cases (D.2) and (D.3), which are dealt with in the following lemmas, completing the proof of Proposition 6.

Lemma D.1.

$\nabla(\nabla^{\top}\mathrm{Q}\nabla)^{q/2}(\bm{a}^{\top}(\bm{x}-\bm{\theta}))^{q+1}$ with $\bm{a}\not=\mathbf{0}_{d}$ is non-zero irrespective of the choice of the positive definite $\mathrm{Q}$ .

Proof.

One has

\displaystyle\nabla(\nabla^{\top}\mathrm{Q}\nabla)^{q/2}(\bm{a}^{\top}(\bm{x}-\bm{\theta}))^{q+1}=(q+1)!(\bm{a}^{\top}\mathrm{Q}\bm{a})^{q/2}\bm{a},

(D.6)

and the coefficient of $\bm{a}$ in the above expression is non-zero because of the positive-definiteness of $\mathrm{Q}$ . ∎

Lemma D.2.

$\nabla(\nabla^{\top}\mathrm{Q}\nabla)^{q/2}(\bm{a}_{1}^{\top}(\bm{x}-\bm{\theta}))^{q}(\bm{a}_{2}^{\top}(\bm{x}-\bm{\theta})$ with linearly independent $\bm{a}_{1},\bm{a}_{2}$ is non-zero irrespective of the choice of the positive definite $\mathrm{Q}$ .

Proof.

One has

\displaystyle\nabla(\nabla^{\top}\mathrm{Q}\nabla)^{q/2}(\bm{a}_{1}^{\top}(\bm{x}-\bm{\theta}))^{q}(\bm{a}_{2}^{\top}(\bm{x}-\bm{\theta})=q\cdot q!(\bm{a}_{1}^{\top}\mathrm{Q}\bm{a}_{1})^{q/2-1}(\bm{a}_{1}^{\top}\mathrm{Q}\bm{a}_{2})\bm{a}_{1}+q!(\bm{a}_{1}^{\top}\mathrm{Q}\bm{a}_{1})^{q/2}\bm{a}_{2},

(D.7)

and the coefficient of $\bm{a}_{2}$ in the above expression is non-zero because of the positive-definiteness of $\mathrm{Q}$ . ∎

The general treatment should require a higher-order analog of Sylvester’s law of inertia for quadratic forms, that is, classification of homogeneous degree- $q$ polynomials of $d$ variables under the equivalence relation where two polynomials are equivalent if and only if one of them can be transformed into the other via applying an invertible linear transform on $\mathbb{R}$ to its variables. This problem seems to be less explored compared with that on $\mathbb{C}$ , and is already complicated even when $d=3$ : see e.g. Banchi (2015). We therefore do not attempt to extend Proposition 6 to more general cases in this paper, and defer the problem to a future investigation.

E Proof of Asymptotic Behaviors of the Modal Linear Regression

Theorem 4 on asymptotic behaviors of the MLR parameter estimator can be proved in the manner almost same as that of Theorem 1, under the following regularity conditions:

Assumption E.1 (Regularity conditions for Theorem 4).

For finite $d_{\scalebox{0.5}{Y}}$ and even $q$ ,

(E.1)

$\{({\bm{X}}_{i},{\bm{Y}}_{i})\in{\mathcal{X}}\times{\mathcal{Y}}\}_{i=1}^{n}$ is a sample of i.i.d. observations from $f_{\scalebox{0.5}{X,Y}}$ .
(E.2)

$\mathrm{E}[\|{\bm{X}}\|^{d_{\scalebox{0.5}{Y}}+4+\tau}]<\infty$ for some $\tau>0$ .
(E.3)

The parameter space $\mathcal{A}$ is compact subset of $\mathbb{R}^{d_{\scalebox{0.5}{Y}}\times d_{\scalebox{0.5}{X}}}$ , and has a nonempty interior $\mathrm{int}(\mathcal{A})$ ; $\Theta\in\mathrm{int}(\mathcal{A})$ .
(E.4)

$\Pr(\Omega{\bm{X}}_{i}=0)<1$ for any fixed $\Omega\in\mathbb{R}^{d_{\scalebox{0.5}{Y}}\times d_{\scalebox{0.5}{X}}}$ such that $\Omega\neq\mathrm{0}$ (no multicollinearity).
(E.5)

$f_{\scalebox{0.5}{Y|X}}(\cdot|{\bm{x}})$ is $(q+2)$ times differentiable in $\mathbb{R}^{d_{\scalebox{0.5}{Y}}}$ for all ${\bm{x}}$ .
(E.6)

$f_{\scalebox{0.5}{Y|X}}({\bm{y}}|{\bm{x}})$ has a unique and isolated maximizer at ${\bm{y}}\neq\Theta{\bm{x}}$ for all ${\bm{x}}$ (i.e., $f_{\scalebox{0.5}{Y|X}}({\bm{y}}|{\bm{x}})<f_{\scalebox{0.5}{Y|X}}(\Theta{\bm{x}}|{\bm{x}})$ for all ${\bm{y}}\neq\Theta{\bm{x}}$ , $\nabla f_{\scalebox{0.5}{Y|X}}(\Theta{\bm{x}}|{\bm{x}})=\bm{0}_{d_{\scalebox{0.5}{Y}}}$ , and $\sup_{{\bm{y}}\in N_{\bm{x}}}f_{\scalebox{0.5}{Y|X}}({\bm{y}}|{\bm{x}})<f_{\scalebox{0.5}{Y|X}}(\Theta{\bm{x}}|{\bm{x}})$ for a neighborhood $N_{\bm{x}}$ of $\Theta{\bm{x}}$ , for all ${\bm{x}}$ ).
(E.7)

$\partial^{\bm{i}}f_{\scalebox{0.5}{Y|X}}(\cdot|{\bm{x}})$ , $|{\bm{i}}|=2$ satisfies $\int|\partial^{\bm{i}}f_{\scalebox{0.5}{Y|X}}({\bm{y}}|{\bm{x}})|\,d{\bm{y}}<\infty$ for all ${\bm{x}}$ , and $\mathrm{A}$ in (61) is non-singular.
(E.8)

$|\partial^{\bm{i}}f_{\scalebox{0.5}{Y|X}}(\Theta{\bm{x}}|{\bm{x}})|<\infty$ for all ${\bm{x}}$ and ${\bm{i}}$ s.t. $|{\bm{i}}|=2,\ldots,q+1$ , and $\partial^{\bm{i}}f_{\scalebox{0.5}{Y|X}}(\cdot|{\bm{x}})$ is bounded in $\mathbb{R}^{d_{\scalebox{0.5}{Y}}}$ for all ${\bm{x}}$ and ${\bm{i}}$ s.t. $|{\bm{i}}|=q+2$ .
(E.9)

$\partial^{\bm{i}}f_{\scalebox{0.5}{Y|X}}$ , $|{\bm{i}}|=q+1$ , and $K$ is such that ${\bm{b}}$ in (61) is non-zero.
(E.10)

$K$ is bounded and twice differentiable in $\mathbb{R}^{d_{\scalebox{0.5}{Y}}}$ and satisfies the covering number condition, $\int|K({\bm{y}})|\,d{\bm{y}}<\infty$ , and $\lim_{\|{\bm{y}}\|\to\infty}\|{\bm{y}}\||K({\bm{y}})|=0$ .
(E.11)

${\mathcal{B}}_{d_{\scalebox{0.5}{Y}},\bm{0}_{d_{\scalebox{0.5}{Y}}}}(K)=1$ .
(E.12)

${\mathcal{B}}_{d_{\scalebox{0.5}{Y}},{\bm{i}}}(K)=0$ for all ${\bm{i}}$ s.t. $|{\bm{i}}|=1,\ldots,q-1$ .
(E.13)

$|{\mathcal{B}}_{d_{\scalebox{0.5}{Y}},{\bm{i}}}(K)|<\infty$ for all ${\bm{i}}$ s.t. $|{\bm{i}}|=q$ , and ${\mathcal{B}}_{d_{\scalebox{0.5}{Y}},{\bm{i}}}(K)\neq 0$ for some ${\bm{i}}$ s.t. $|{\bm{i}}|=q$ .
(E.14)

$|{\mathcal{B}}_{d_{\scalebox{0.5}{Y}},{\bm{i}}}(K)|<\infty$ for all ${\bm{i}}$ s.t. $|{\bm{i}}|=q+1$ .
(E.15)

$\partial_{i}K$ is bounded and satisfies $\int|\partial_{i}K({\bm{y}})\partial_{j}K({\bm{y}})|\,d{\bm{y}}<\infty$ , $\lim_{\|{\bm{y}}\|\to\infty}\|{\bm{y}}\||\partial_{i}K({\bm{y}})\partial_{j}K({\bm{y}})|=0$ , and $\int|\partial_{i}K({\bm{y}})|^{2+\delta}\,d{\bm{y}}<\infty$ for some $\delta>0$ , for all $i,j=1,\ldots,d_{\scalebox{0.5}{Y}}$ .
(E.16)

$\int\nabla K({\bm{y}})\nabla K({\bm{y}})^{\top}\,d{\bm{y}}={\mathcal{V}}_{d_{\scalebox{0.5}{Y}}}(K)$ has a finite determinant .
(E.17)

$\partial^{\bm{i}}K$ , $|{\bm{i}}|=2$ satisfies the covering number condition.
(E.18)

$\lim_{n\to\infty}(nh_{n}^{d_{\scalebox{0.5}{Y}}+2q+2})^{\frac{1}{2}}=c<\infty$ .
(E.19)

$\lim_{n\to\infty}nh_{n}^{d_{\scalebox{0.5}{Y}}+4}/\ln n=\infty$ .

Acknowledgements

This work was supported by Grant-in-Aid for JSPS Fellows, Number 20J23367.

References

Abraham et al. (2004) C. Abraham, G. Biau, and B. Cadre. On the asymptotic properties of a simple estimate of the mode. ESAIM: Probability and Statistics, 8:1–11, 2004.
Askey (1975) R. Askey. Orthogonal Polynomials and Special Functions, volume 21. SIAM, 1975.
Baldauf and Santos Silva (2012) M. Baldauf and J. M. C. Santos Silva. On the use of robust regression in econometrics. Economics Letters, 114(1):124–127, 2012.
Banchi (2015) M. Banchi. Rank and border rank of real ternary cubics. Bollettino dell’Unione Matematica Italiana, 8(1):64–80, 2015.
Berlinet (1993) A. Berlinet. Hierarchies of higher order kernels. Probability Theory and Related Fields, 94(4):489–504, 1993.
Berlinet and Thomas-Agnan (2004) A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publishers, 2004.
Bochner (2005) S. Bochner. Harmonic Analysis and the Theory of Probability. Courier Corporation, 2005.
Casa et al. (2020) A. Casa, J. E. Chacón, and G. Menardi. Modal clustering asymptotics with applications to bandwidth selection. Electronic Journal of Statistics, 14(1):835–856, 2020.
Comaniciu and Meer (2002) D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619, 2002.
Conway (2007) J. B. Conway. A Course in Functional Analysis. Sprinnger, second edition, 2007.
Donoho et al. (1991) D. L. Donoho, R. C. Liu, et al. Geometrizing rates of convergence, III. The Annals of Statistics, 19(2):668–701, 1991.
Eddy (1980) W. F. Eddy. Optimum kernel estimators of the mode. The Annals of Statistics, 8(4):870–882, 1980.
Gasser and Müller (1979) T. Gasser and H.-G. Müller. Kernel estimation of regression functions. In T. Gasser and M. Rosenblatt, editors, Smoothing Techniques for Curve Estimation, pages 23–68. Springer, 1979.
Gasser et al. (1985) T. Gasser, H.-G. Müller, and V. Mammitzsch. Kernels for nonparametric curve estimation. Journal of the Royal Statistical Society. Series B (Methodological), 47:238–252, 1985.
Granovsky and Müller (1989) B. L. Granovsky and H.-G. Müller. On the optimality of a class of polynomial kernel functions. Statistics & Risk Modeling, 7(4):301–312, 1989.
Granovsky and Müller (1991) B. L. Granovsky and H.-G. Müller. Optimizing kernel methods: a unifying variational principle. International Statistical Review/Revue Internationale de Statistique, 59(3):373–388, 1991.
Grund and Hall (1995) B. Grund and P. Hall. On the minimisation of ${L}^{p}$ error in mode estimation. The Annals of Statistics, 23(6):2264–2284, 1995.
Hasminskii (1979) R. Hasminskii. Lower bound for the risks of nonparametric estimates of the mode. Contributions to Statistics, pages 91–97, 1979.
Kemp et al. (2020) G. C. R. Kemp, P. M. D. C. Parente, and J. M. C. Santos Silva. Dynamic vector mode regression. Journal of Business & Economic Statistics, 38(3):647–661, 2020.
Mokkadem and Pelletier (2003) A. Mokkadem and M. Pelletier. The law of the iterated logarithm for the multivariate kernel mode estimator. ESAIM: Probability and Statistics, 7:1–21, 2003.
Müller (1984) H.-G. Müller. Smooth optimum kernel estimators of densities, regression curves and modes. The Annals of Statistics, 12(2):766–774, 1984.
Ozertem and Erdogmus (2011) U. Ozertem and D. Erdogmus. Locally defined principal curves and surfaces. Journal of Machine Learning Research, 12:1249–1286, 2011.
Parzen (1962) E. Parzen. On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 33(3):1065–1076, 1962.
Pollard (1984) D. Pollard. Convergence of Stochastic Processes. Springer Science & Business Media, 1984.
Prudnikov et al. (1986) A. P. Prudnikov, Y. A. Brychkov, and O. I. Marichev. Integrals and Series, Volume 2: Special Functions. Gordon and Brewach Science Publishers, 1986.
Romano (1988) J. P. Romano. On weak convergence and optimality of kernel density estimates of the mode. The Annals of Statistics, 16(2):629–647, 1988.
Rudin (1991) W. Rudin. Functional Analysis. McGraw-Hill, second edition, 1991.
Rüschendorf (1977) L. Rüschendorf. Consistency of estimators for multivariate density functions and for the mode. Sankhyā: The Indian Journal of Statistics, Series A, 39(3):243–250, 1977.
Sando and Hino (2020) K. Sando and H. Hino. Modal principal component analysis. Neural Computation, 32(10):1901–1935, 2020.
Schucany and Sommers (1977) W. Schucany and J. P. Sommers. Improvement of kernel type density estimators. Journal of the American Statistical Association, 72(358):420–423, 1977.
Szegö (1939) G. Szegö. Orthogonal Polynomials, volume 23. American Mathematical Society, 1939.
Vedaldi and Soatto (2008) A. Vedaldi and S. Soatto. Quick shift and kernel methods for mode seeking. In European Conference on Computer Vision, pages 705–718, 2008.
Vieu (1996) P. Vieu. A note on density mode estimation. Statistics & Probability Letters, 26(4):297–307, 1996.
Wand and Schucany (1990) M. P. Wand and W. R. Schucany. Gaussian-based kernels. Canadian Journal of Statistics, 18(3):197–204, 1990.
Yamasaki and Tanaka (2019) R. Yamasaki and T. Tanaka. Kernel selection for modal linear regression: Optimal kernel and irls algorithm. In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pages 595–601, Dec 2019.
Yamasaki and Tanaka (2020) R. Yamasaki and T. Tanaka. Properties of mean shift. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(9):2273–2286, 2020.
Yamato (1971) H. Yamato. Sequential estimation of a continuous probability density function and mode. Bulletin of Mathematical Statistics, 14(3):1–12, 1971.
Yao and Li (2014) W. Yao and L. Li. A new regression model: modal linear regression. Scandinavian Journal of Statistics, 41(3):656–671, 2014.

Optimal Kernel for Kernel-Based Modal Statistical Methods

Abstract

1 Introduction

2 Optimal Kernel for Kernel Mode Estimation

2.1 Basic Asymptotic Behaviors

Theorem 1.

2.2 Optimal Kernel in Radial-Basis Kernels

Problem 1 (dd-variate, type-1, qq-th order).

Proposition 1.

Proof.

Problem 2 (Modified dd-variate, type-1, qq-th order).

Theorem 2.

Problem 3 (Modified dd-variate, type-1, qq-th order).

Theorem 3.

2.3 Optimal Kernel in Product Kernels

2.4 Comparison between Radial-Basis Kernels and Product Kernels

2.5 Optimal Kernel in Elliptic Kernels

Proposition 2.

Proposition 3.

Proposition 4.

Proposition 5.

Proposition 6.

3 Additional Discussion

3.1 Degree of Goodness of Optimal Kernel and Heuristic Findings

3.2 Other Criteria

3.3 Singular Hessian

4 Simulation Experiment

5 Optimal Kernel for Other Methods

5.1 In-Sample Mode Estimation

5.2 Modal Linear Regression

Theorem 4.

5.3 Mode Clustering

6 Conclusion

A Proof of Asymptotic Behaviors of Kernel Mode Estimation

Assumption A.1 (Regularity conditions for Theorem 1).

Definition A.1 ((Pollard, 1984, Definition 23) and (Mokkadem and Pelletier, 2003, Section 2)).

Proof of Theorem 1.

Lemma A.1.

Lemma A.2.

B Theories on Optimal Kernel

B.1 Relationships between the Moment Condition and the Number of Sign Changes

Definition B.1.

Definition B.2.

Lemma B.1.

Proof.

Lemma B.2.

Proof.

Lemma B.3.

Proof.

B.2 Proof of Theorem 2

Lemma B.4.

Proof.

Lemma B.5.

Proof.

Lemma B.6.

Proof.

Proof of Theorem 2.

B.3 Proof of Theorem 3

B.3.1 Reformulation of Problem 3

Lemma B.7.

Problem B.1.

Lemma B.8.

Problem B.2.

B.3.2 Auxiliary Lemmas

Lemma B.9.

Proof.

Lemma B.10.

Proof.

B.3.3 Prove that the optimal solution is a polynomial kernel

Lemma B.11.

Proof of Lemma B.11.

Step 1

Step 2

Step 3

Lemma B.12.

Proof of Lemma B.12.

B.3.4 Prove that the optimal polynomial kernel is uniquely determined

Lemma B.13.

Proof of Lemma B.13.

Lemma B.14.

Problem 1 ( $d$ -variate, type-1, $q$ -th order).

Problem 2 (Modified $d$ -variate, type-1, $q$ -th order).

Problem 3 (Modified $d$ -variate, type-1, $q$ -th order).