A reproducing kernel Hilbert space framework for functional classification

Peijun Sang University of Waterloo, Waterloo, ON, Canada [email protected] Adam B Kashlak Linglong Kong University of Alberta, Edmonton, AB, Canada

Abstract

We encounter a bottleneck when we try to borrow the strength of classical classifiers to classify functional data. The major issue is that functional data are intrinsically infinite dimensional, thus classical classifiers cannot be applied directly or have poor performance due to curse of dimensionality. To address this concern, we propose to project functional data onto one specific direction, and then a distance-weighted discrimination DWD classifier is built upon the projection score. The projection direction is identified through minimizing an empirical risk function that contains the particular loss function in a DWD classifier, over a reproducing kernel Hilbert space. Hence our proposed classifier can avoid overfitting and enjoy appealing properties of DWD classifiers. This framework is further extended to accommodate functional data classification problems where scalar covariates are involved. In contrast to previous work, we establish a non-asymptotic estimation error bound on the relative misclassification rate. In finite sample case, we demonstrate that the proposed classifiers compare favorably with some commonly used functional classifiers in terms of prediction accuracy through simulation studies and a real-world application.

keywords:

Functional classification; Projection; Distance-weighted discrimination, Reproducing kernel Hilbert space; Non-asymptotic error bound

1 Introduction

For functional data classification, the explanatory variable is usually a random function and the outcome is a categorical random variable which can take two or more categories. As with classification problems for scalar covariates, a functional classifier is built upon a collection of observations consisting of a functional covariate and a categorical response for each subject, and then a class label will be assigned to a new functional covariate based on the classifier. A typical example is the phoneme recognition problem in Friedman et al. (2009). Log-periodograms for each of the two phonemes “aa” and “ao” were measured at 256 frequency levels, and the primary goal is to make use of these log-periodograms to classify phonemes. For this problem, the log-periodograms measured at 256 frequencies can be regarded as a functional covariate and the outcome takes two possible categories: “aa” or “ao”. Therefore, this phoneme recognition can be framed as a functional classification problem. Actually, functional classification has been extensively studied in the literature due to its wide applications in various fields such as neural science, genetics, agriculture and chemometrics (Tian, 2010; Leng and Müller, 2005; Delaigle and Hall, 2012; Berrendero et al., 2016).

As pointed out by Fan and Fan (2008), a high dimension of scalar covariates would yield a negative impact on prediction accuracy of classifiers due to curse of dimensionality. This issue is even more serious in functional classification since functional data are intrinsically infinite dimensional (Ferraty and Vieu, 2004). In light of this fact, dimension reduction was suggested before classifying functional data. Functional principal component (FPC) analysis a commonly used technique in this regard, and various classifiers for functional data have been proposed based on FPC scores, which are the projections of functional covariates onto a number of FPCs. Typical examples include discriminant analysis (Hall et al., 2001), naive Bayes classifier (Dai et al., 2017) and logistic regression (Leng and Müller, 2005). Since FPC analysis is an unsupervised dimension reduction approach, the retained FPC scores are not necessarily more predictive of the outcome compared with the discarded ones. In contrast, treating fully observed functional data as a random variable in a Hilbert space without dimension reduction has also attracted substantial attention in functional classification (Yao et al., 2016). For instance, Ferraty and Vieu (2003) proposed a distance-based classifier for functional data, and Biau et al. (2005) and Cérou and Guyader (2006) considered $k$ -nearest neighbor classification.

An optimal separating hyperplane can be constructed to distinguish two perfectly separately classes. This idea is further extended to accommodate the nonseparable case in support vector machines (SVM). More specifically, with the aid of the so-called kernel trick, the original feature space is expanded and a linear boundary in this expanded feature space can separate the two overlapping classes very well. If we project this linear boundary back onto the original feature space, it is a nonlinear decision boundary. For a more comprehensive introduction to the SVM, one can refer to Vapnik (2013) and Cristianini and Shawe-Taylor (2000). Due to the ability of constructing a flexible decision boundary, different versions of SVM for functional data have been proposed in the literature. Rossi and Villa (2006) considered projecting functional covariates onto a set of fixed basis functions first, and then applied SVMs to the projections for classification. Nevertheless, Yao et al. (2016) proposed a supervised method to perform dimension reduction for functional data. Weighted SVM (Lin et al., 2002) were then constructed on the reduced feature space. Wu and Liu (2013) recovered trajectories of sparse functional data or longitudinal data using the principal analysis by conditional expectation (PACE) (Yao et al., 2005) first, and then proposed a support vector classifier for the random curves. But the convergence rate of the SVMs with a functional covariate was not established in the aforementioned work.

Marron et al. (2007) noted that the data piling problem may cause a deterioration in performance of SVM. They proposed the distance-weighted discrimination (DWD) classifier that can make uses of all observations in a training sample, rather than the support vectors in SVM, to determine the decision boundary. Wang and Zou (2018) proposed an efficient algorithm to solve the DWD problem. In this article, we extend the idea of the DWD to functional data to address a binary classification problem. The basic idea is to find an optimal projection direction such that the DWD classifier built upon this projected score achieves good prediction performance. Additionally, to avoid overfitting on the training sample, we incorporate a roughness penalty term when minimizing the empirical risk function. Penalized approaches have been investigated recently in the context of functional linear regression. Interested readers can refer to the work by Yuan and Cai (2010) and Sun et al. (2018). However, as far as we know, this framework has received little attention in functional classification problems. The method for functional classification proposed in this article estimates the slope function through seeking a minimizer of a regularized empirical risk function over a reproducing kernel Hilbert space (RKHS). This RKHS is closely associated with the penalty term in the regularized empirical risk function. With the help of the representer theorem, we are able to convert this infinite dimensional minimization problem to a finite dimensional problem. This fact lays the foundation of numerical implementations for the proposed classifier. This framework is further extended to accommodate classifications when observations of both a functional covariate and several scalar covariates are available for each subject. There has been extensive research on partial functional linear regression models for such scenarios; see Kong et al. (2016) and Wong et al. (2019) for instance. However, we have not seen much progress made for classifications in this regard; thus our work fills the gap of functional data classification when scalar covariates are also available. In addition to the novel methodology, we establish a non-asymptotic “oracle type inequality” for the bound on the convergence rate of the relative loss and the relative classification error. This error bound is essentially different from those considered in Delaigle and Hall (2012), Dai et al. (2017) and Berrendero et al. (2018), all of which focused on asymptotic perfect classification.

The rest of this article is organized as follows. In Section 2 we introduce the RKHS-based functional DWD classifier for classifying functional data without and with scalar covariates. Theoretical properties of the proposed classifiers are established in Section 3. We carry out simulation studies in Section 4 to investigate the finite sample performance of the proposed classifiers in terms of prediction accuracy. In Section 5 we consider one real world application to demonstrate the performance of the proposed classifiers. We conclude this article in Section 6. All technical proofs are provided in the Appendix.

2 Methodology

Let $X$ denote a random function with a compact domain $\mathcal{I}$ , and $Y$ is a binary outcome related to $X$ . Without loss of generality, we assume that $Y\in\{-1,1\}$ . Suppose that the training sample consists of $(x_{i},y_{i}),i=1,\ldots,n$ , $n$ i.i.d. copies of $(X,Y)$ . Our primary goal is to build a classifier based on this training sample.

We first present an overview of the distance-weighted discrimination (DWD) proposed by Marron et al. (2007). Consider the following classification problem where $\boldsymbol{z}=(z_{1},\ldots,z_{p})^{T}\in\mathcal{Z}$ be a vector of $p$ scalar covariates and $y\in\{-1,1\}$ is a binary response. The main task is to build a classifier: $f:\mathcal{Z}\rightarrow\{-1,1\}$ based on $n$ pairs of observations $(\boldsymbol{z}_{i},y_{i}),i=1,\ldots,n$ . According to Wang and Zou (2018), the decision boundary of a generalized distance-weighted discrimination classifier can be obtained by solving

\min_{\alpha_{0},\boldsymbol{\alpha}}n^{-1}\sum_{i=1}^{n}V_{q}\{y_{i}(\alpha_{0}+\boldsymbol{z}_{i}^{T}\boldsymbol{\alpha})\}+\lambda\boldsymbol{\alpha}^{T}\boldsymbol{\alpha},

where

V_{q}(u)=\begin{cases}1-u&\mbox{if }u\leq\frac{q}{1+q},\\ \frac{1}{u^{q}}\frac{q^{q}}{(q+1)^{q+1}}&\mbox{if }u>\frac{q}{1+q},\end{cases}

(1)

is the loss function and $\lambda>0$ is a tuning parameter. Note that as $q\rightarrow\infty$ , the generalized DWD loss function converges to the hinge loss function. $H(u)=\max(0,1-u)$ , used in the SVM. This relationship is also illustrated in Figure 1. Denote by $(\hat{\alpha}_{0},\hat{\boldsymbol{\alpha}})$ the solution to the minimization problem above. Given a new observation $\tilde{\boldsymbol{z}}\in\mathcal{Z}$ , the predicted class label will be 1 if $\hat{\alpha}_{0}+\tilde{\boldsymbol{z}}^{T}\hat{\boldsymbol{\alpha}}>0$ and -1 otherwise.

Refer to caption — Figure 1: The loss functions of the generalized distance-weighted discrimination with different values of $q$ and the hinge loss function.

In this article, we aim to extend the framework of DWD to functional data. In particular, we consider the following objective function

Q(\alpha,\beta)=n^{-1}\sum_{i=1}^{n}V_{q}\left\{y_{i}\left(\alpha+\int_{\mathcal{I}}x_{i}(t)\beta(t)dt\right)\right\}+\lambda J(\beta),

(2)

where $J$ is a penalty functional. The penalty functional can be conveniently defined through the slope function $\beta$ as a squared norm or semi-norm associated with $\mathcal{H}$ . A canonical example of $\mathcal{H}$ is the Sobolev space. Without loss of generality, assuming that $\mathcal{I}=[0,1]$ , the Sobolev space of order $m$ is then defined as

\mathcal{W}_{2}^{m}([0,1])=\{h:[0,1]\rightarrow\mathbb{R},h,h^{(1)},\ldots,h^{(m-1)}\text{are absolutely continuous and}~{}h^{(m)}\in L_{2}[0,1]\}.

Endowed with the (squared) norm

||h||^{2}_{\mathcal{W}_{2}^{m}}=\sum_{l=0}^{m-1}\left(\int_{0}^{1}h^{(l)}(t)dt\right)^{2}+\int_{0}^{1}\left(h^{(m)}(t)\right)^{2}dt,

$\mathcal{W}_{2}^{m}([0,1])$ is a reproducing kernel Hilbert space. In this case, a possible choice of the penalty functional is given by

J(\beta)=\int_{0}^{1}\{\beta^{(m)}(t)\}^{2}dt.

(3)

2.1 Representer theorem

Let the penalty functional $J$ be a squared semi-norm on $\mathcal{H}$ such that the null space

\mathcal{H}_{0}=\{\beta\in\mathcal{H}:J(\beta)=0\}

(4)

is a finite-dimensional linear subspace of $\mathcal{H}$ . Denote by $\mathcal{H}_{1}$ its orthogonal complement in $\mathcal{H}$ such that $\mathcal{H}=\mathcal{H}_{0}\oplus\mathcal{H}_{1}$ . That is, for any $\beta\in\mathcal{H}$ , there exists a unique decomposition $\beta=\beta_{0}+\beta_{1}$ such that $\beta_{0}\in\mathcal{H}_{0}$ and $\beta_{1}\in\mathcal{H}_{1}$ . Note that $\mathcal{H}_{1}$ is also a reproducing kernel Hilbert space with the inner product of $\mathcal{H}$ restricted to $\mathcal{H}_{1}$ . Let $K:\mathcal{I}\times\mathcal{I}\rightarrow\mathbb{R}$ be the corresponding reproducing kernel of $\mathcal{H}_{1}$ such that $J(\beta_{1})=||\beta_{1}||_{K}^{2}=||\beta_{1}||_{\mathcal{H}}^{2}$ for any $\beta_{1}\in\mathcal{H}_{1}$ . Let $N=\text{dim}(\mathcal{H}_{0})$ and $\psi_{1},\ldots,\psi_{N}$ be the basis functions of $\mathcal{H}_{0}$ .

We will assume that $K$ is continuous and square integrable. With slight abuse of notation, write

(Kf)(\cdot)=\int_{\mathcal{I}}K(\cdot,s)f(s)ds.

(5)

According to Yuan and Cai (2010), $Kf\in\mathcal{H}_{1}$ for any $f\in L_{2}(\mathcal{I})$ and $\forall\beta\in\mathcal{H}_{1}$ ,

\int_{\mathcal{I}}\beta(t)f(t)dt=\langle Kf,\beta\rangle_{\mathcal{H}}.

(6)

With these observations, we are able to establish the following theorem which is crucial in numerical implementations and the theoretical analysis.

Theorem 1

Let $\hat{\alpha}_{n}$ and $\hat{\beta}_{n}$ be the minimizer of (2) and $\hat{\beta}_{n}\in\mathcal{H}$ . Then there exist $\boldsymbol{d}=(d_{1},\ldots,d_{N})^{T}\in\mathbb{R}^{N}$ and $\boldsymbol{c}=(c_{1},\ldots,c_{n})^{T}\in\mathbb{R}^{n}$ such that

\hat{\beta}_{n}(t)=\sum_{l=1}^{N}d_{l}\psi_{l}(t)+\sum_{i=1}^{n}c_{i}(Kx_{i})(t).

(7)

2.2 Estimation algorithm

For the purpose of illustration, we assume that $\mathcal{H}=\mathcal{W}_{2}^{2}$ and $J(\beta)=\int_{0}^{1}(\beta^{{}^{\prime\prime}})^{2}dt$ , in the following numerical implementations. Then $\mathcal{H}_{0}$ is the linear space spanned by $\psi_{1}(t)=1$ and $\psi_{2}(t)=t-0.5$ . A possible choice for the reproducing kernel associated with $\mathcal{H}_{1}$ is

K(s,t)=k_{2}(s)k_{2}(t)-k_{4}(|s-t|),

where $k_{2}(s)=\frac{1}{2}\left(\psi_{2}^{2}(s)-\frac{1}{12}\right)$ and $k_{4}(s)=\frac{1}{24}\left(\psi_{2}^{4}(s)-\frac{\psi_{2}^{2}(s)}{2}+\frac{7}{240}\right)$ . You can refer to Chapter 2.3 of Gu (2013) for more details. Based on Theorem 7, we only need to consider $\beta$ that takes the following form:

\beta(t)=d_{1}+d_{2}(t-0.5)+\sum_{i=1}^{n}c_{i}\int_{\mathcal{I}}x_{i}(s)K(t,s)ds

for some $\boldsymbol{d}=(d_{1},d_{2})^{\mbox{\tiny{\sf T}}}\in\mathbb{R}^{\mbox{\tiny{\sf T}}}$ and $\boldsymbol{c}=(c_{1},\ldots,c_{n})^{\mbox{\tiny{\sf T}}}\in\mathbb{R}^{n}$ to minimize the function in (2). As a result,

	$\displaystyle\int_{\mathcal{I}}x_{i}(t)\beta(t)dt=~{}$	$\displaystyle d_{1}\int_{\mathcal{I}}x_{i}(t)dt+d_{2}\int_{\mathcal{I}}x_{i}(t)(t-0.5)dt$
		$\displaystyle+\sum_{j=1}^{n}c_{j}\int_{\mathcal{I}}\int_{\mathcal{I}}x_{i}(t)K(t,s)x_{j}(s)dsdt.$

For the penalty term, we have $J(\beta)=\boldsymbol{c}^{\mbox{\tiny{\sf T}}}\boldsymbol{R}\boldsymbol{c}$ , where $\boldsymbol{R}$ is an $n\times n$ matrix with $(i,j)$ th entry $r_{ij}=\int_{\mathcal{I}}\int_{\mathcal{I}}x_{i}(t)K(t,s)x_{j}(s)dsdt$ . Denote by $\boldsymbol{S}$ an $n\times 2$ matrix with the $(i,j)$ th entry $s_{ij}=\int_{\mathcal{I}}x_{i}(t)\psi_{j}(t)dt$ for $j=1,2$ . Let $\boldsymbol{S}_{i}$ and $\boldsymbol{R}_{i}$ denote the $i$ th row of $\boldsymbol{S}$ and $\boldsymbol{R}$ , respectively. Now the infinite-dimensional minimization of (2) becomes the following finite dimensional minimization problem:

D(\alpha,\boldsymbol{d},\boldsymbol{c}):=n^{-1}\sum_{i=1}^{n}V_{q}\{y_{i}(\alpha+\boldsymbol{S}_{i}^{\mbox{\tiny{\sf T}}}\boldsymbol{d}+\boldsymbol{R}_{i}^{\mbox{\tiny{\sf T}}}\boldsymbol{c})\}+\lambda\boldsymbol{c}^{\mbox{\tiny{\sf T}}}\boldsymbol{R}\boldsymbol{c}.

(8)

To find the minimizer of $D$ , we implement the majorization-minimization (MM) principal. The basic idea is as follows. We firstly look for a majorization function $M(\boldsymbol{\theta}|\boldsymbol{\theta}^{\prime})$ , where $\boldsymbol{\theta}=(\alpha,\boldsymbol{d}^{T},\boldsymbol{c}^{T})^{T}$ in this problem, of the target function $D(\boldsymbol{\theta})$ . This majorization function satisfies $D(\boldsymbol{\theta})<M(\boldsymbol{\theta}|\boldsymbol{\theta}^{\prime})$ for any $\boldsymbol{\theta}\neq\boldsymbol{\theta}^{\prime}$ and $D(\boldsymbol{\theta})=M(\boldsymbol{\theta}|\boldsymbol{\theta}^{\prime})$ if $\boldsymbol{\theta}=\boldsymbol{\theta}^{\prime}$ . Additionally, it should be easy to find the minimizer of $M(\boldsymbol{\theta}|\boldsymbol{\theta}^{\prime})$ for any given $\boldsymbol{\theta}^{\prime}$ . Then given an initial value of $\boldsymbol{\theta}$ , say $\boldsymbol{\theta}^{(0)}$ , we are able to generate a sequence of $\boldsymbol{\theta}$ ’s, say $\{\boldsymbol{\theta}^{(k)}\}_{k=1}^{\infty}$ , which are defined by $\boldsymbol{\theta}^{(k+1)}=\operatorname*{arg\,min}_{\boldsymbol{\theta}}M(\boldsymbol{\theta}|\boldsymbol{\theta}^{(k)})$ , $k\geq 0$ . As long as this sequence converges, the limit is regarded as the minimizer of the objective function $D$ .

Given $\bar{\boldsymbol{\theta}}=(\bar{\alpha},\bar{\boldsymbol{d}}^{\mbox{\tiny{\sf T}}},\bar{\boldsymbol{c}}^{\mbox{\tiny{\sf T}}})^{\mbox{\tiny{\sf T}}}$ , let $\boldsymbol{r}=(r_{1},\ldots,r_{n})^{\mbox{\tiny{\sf T}}}$ with $r_{i}=y_{i}V_{q}^{\prime}\{y_{i}(\bar{\alpha}+\boldsymbol{S}_{i}^{\mbox{\tiny{\sf T}}}\bar{\boldsymbol{d}}+\boldsymbol{R}_{i}^{\mbox{\tiny{\sf T}}}\bar{\boldsymbol{c}})\}/n$ and

\boldsymbol{A}_{q,\lambda}=\begin{pmatrix}n&\boldsymbol{1}_{n}^{\mbox{\tiny{\sf T}}}\boldsymbol{S}&\boldsymbol{1}_{n}^{\mbox{\tiny{\sf T}}}\boldsymbol{R}\\ \boldsymbol{S}^{\mbox{\tiny{\sf T}}}\boldsymbol{1}_{n}&\boldsymbol{S}^{\mbox{\tiny{\sf T}}}\boldsymbol{S}&\boldsymbol{S}^{\mbox{\tiny{\sf T}}}\boldsymbol{R}\\ \boldsymbol{R}\boldsymbol{1}_{n}&\boldsymbol{R}\boldsymbol{S}&\boldsymbol{R}\boldsymbol{R}+\frac{2nq\lambda}{(q+1)^{2}}\boldsymbol{R}\end{pmatrix},

where $\boldsymbol{1}_{n}$ denotes a vector of length $n$ with each component equal to 1. According to Lemma 2 of Wang and Zou (2018), we can take the majorization function of $D$ as

	$\displaystyle M(\boldsymbol{\theta}\|\bar{\boldsymbol{\theta}})$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}V_{q}\{y_{i}(\bar{\alpha}+\boldsymbol{S}_{i}^{\mbox{\tiny{\sf T}}}\bar{\boldsymbol{d}}+\boldsymbol{R}_{i}^{\mbox{\tiny{\sf T}}}\bar{\boldsymbol{c}})\}+\lambda\bar{\boldsymbol{c}}^{\mbox{\tiny{\sf T}}}\boldsymbol{R}\bar{\boldsymbol{c}}$
		$\displaystyle~{}~{}~{}+\begin{pmatrix}\boldsymbol{1}^{\mbox{\tiny{\sf T}}}\boldsymbol{r}\\ \boldsymbol{S}^{\mbox{\tiny{\sf T}}}\boldsymbol{r}\\ \boldsymbol{R}\boldsymbol{r}+2\lambda\boldsymbol{R}\bar{\boldsymbol{c}}\end{pmatrix}^{\mbox{\tiny{\sf T}}}\begin{pmatrix}\alpha-\bar{\alpha}\\ \boldsymbol{d}-\bar{\boldsymbol{d}}\\ \boldsymbol{c}-\bar{\boldsymbol{c}}\end{pmatrix}+\frac{(q+1)^{2}}{2nq}\begin{pmatrix}\alpha-\bar{\alpha}\\ \boldsymbol{d}-\bar{\boldsymbol{d}}\\ \boldsymbol{d}-\bar{\boldsymbol{c}}\end{pmatrix}^{\top}\boldsymbol{A}_{q,\lambda}\begin{pmatrix}\alpha-\bar{\alpha}\\ \boldsymbol{d}-\bar{\boldsymbol{d}}\\ \boldsymbol{c}-\bar{\boldsymbol{c}}\end{pmatrix}.$

It is trivial to show that the minimizer of $M(\boldsymbol{\theta}|\bar{\boldsymbol{\theta}})$ is

\begin{pmatrix}\bar{\alpha}\\ \bar{\boldsymbol{d}}\\ \bar{\boldsymbol{c}}\end{pmatrix}-\frac{nq}{(q+1)^{2}}\boldsymbol{A}_{q,\lambda}^{-1}\begin{pmatrix}\boldsymbol{1}_{n}^{\mbox{\tiny{\sf T}}}\boldsymbol{r}\\ \boldsymbol{S}^{\mbox{\tiny{\sf T}}}\boldsymbol{r}\\ \boldsymbol{R}\boldsymbol{r}+2\lambda\boldsymbol{R}\bar{\boldsymbol{c}}\end{pmatrix}

Then the algorithm proceeds until the sequence of minimizers converges. The limit of this sequence is denoted by $(\hat{\alpha},\hat{\boldsymbol{d}},\hat{\boldsymbol{c}})^{\mbox{\tiny{\sf T}}}$ , and thus $\hat{\beta}(t)=\hat{d}_{1}+\hat{d}_{2}(t-0.5)+\sum_{i=1}^{n}\hat{c}_{i}(Kx_{i})(t)$ . The functional DWD classifier assigns 1 or -1 to a new functional observation $x$ according to whether the statistic $\hat{\alpha}+\int_{\mathcal{I}}x(t)\hat{\beta}(t)dt$ is positive or negative.

2.3 Functional DWD with scalar covariates

The algorithm presented above is to address binary classification problems for univariate functional data. This idea can be further extended to accommodate binary classification when both a functional covariate and finite dimensional scalar covariates are involved. In particular, the training sample consists of $(x_{i},y_{i},\boldsymbol{z}_{i}),i=1,\ldots,n$ , where $\boldsymbol{z}_{i}=(z_{i1},\ldots,z_{ip})^{\mbox{\tiny{\sf T}}}$ denotes the $p$ dimensional scalar covariates of the $i$ th subject. With slight abuse of notation, we consider the following extension of (2):

Q(\alpha,\beta,\boldsymbol{\gamma})=n^{-1}\sum_{i=1}^{n}V_{q}\left\{y_{i}\left(\alpha+\int_{\mathcal{I}}x_{i}(t)\beta(t)dt+\boldsymbol{z}_{i}^{\mbox{\tiny{\sf T}}}\boldsymbol{\gamma}\right)\right\}+\lambda J(\beta),

(9)

to build a partial linear DWD classifier.

To solve the minimization problem of (9), we resort to the specific representation of $\hat{\beta}$ in Theorem 7. Actually it is straightforward to verify that this result still holds in the context of (9). As a result, the infinite dimensional minimization problem of (9) is converted to the following finite one:

D(\alpha,\boldsymbol{d},\boldsymbol{c},\boldsymbol{\gamma}):=n^{-1}\sum_{i=1}^{n}V_{q}\{y_{i}(\alpha+\boldsymbol{S}_{i}^{\mbox{\tiny{\sf T}}}\boldsymbol{d}+\boldsymbol{R}_{i}^{\mbox{\tiny{\sf T}}}\boldsymbol{c}+\boldsymbol{z}_{i}^{\mbox{\tiny{\sf T}}}\boldsymbol{\gamma})\}+\lambda\boldsymbol{c}^{\mbox{\tiny{\sf T}}}\boldsymbol{R}\boldsymbol{c}.

(10)

With some modifications, we employ the MM principal to address the minimization problem above. In particular, the majority function of $D$ is taken as

	$\displaystyle M(\boldsymbol{\theta}\|\bar{\boldsymbol{\theta}})$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}V_{q}\{y_{i}(\bar{\alpha}+\boldsymbol{S}_{i}^{\mbox{\tiny{\sf T}}}\bar{\boldsymbol{d}}+\boldsymbol{R}_{i}^{\mbox{\tiny{\sf T}}}\bar{\boldsymbol{c}}+\boldsymbol{z}_{i}^{\mbox{\tiny{\sf T}}}\bar{\boldsymbol{\gamma}})\}+\lambda\bar{\boldsymbol{c}}^{\mbox{\tiny{\sf T}}}\boldsymbol{R}\bar{\boldsymbol{c}}$
		$\displaystyle~{}~{}~{}+\begin{pmatrix}\boldsymbol{1}_{n}^{\mbox{\tiny{\sf T}}}\boldsymbol{r}\\ \boldsymbol{S}^{\mbox{\tiny{\sf T}}}\boldsymbol{r}\\ \boldsymbol{Z}^{\mbox{\tiny{\sf T}}}\boldsymbol{r}\\ \boldsymbol{R}\boldsymbol{r}+2\lambda\boldsymbol{R}\bar{\boldsymbol{c}}\end{pmatrix}^{\mbox{\tiny{\sf T}}}\begin{pmatrix}\alpha-\bar{\alpha}\\ \boldsymbol{d}-\bar{\boldsymbol{d}}\\ \boldsymbol{\gamma}-\bar{\boldsymbol{\gamma}}\\ \boldsymbol{c}-\bar{\boldsymbol{c}}\end{pmatrix}+\frac{(q+1)^{2}}{2nq}\begin{pmatrix}\alpha-\bar{\alpha}\\ \boldsymbol{d}-\bar{\boldsymbol{d}}\\ \boldsymbol{\gamma}-\bar{\boldsymbol{\gamma}}\\ \boldsymbol{c}-\bar{\boldsymbol{c}}\end{pmatrix}^{\mbox{\tiny{\sf T}}}\boldsymbol{A}_{q,\lambda}\begin{pmatrix}\alpha-\bar{\alpha}\\ \boldsymbol{d}-\bar{\boldsymbol{d}}\\ \boldsymbol{\gamma}-\bar{\boldsymbol{\gamma}}\\ \boldsymbol{c}-\bar{\boldsymbol{c}}\end{pmatrix},$

where $\boldsymbol{r}=(r_{1},\ldots,r_{n})^{\mbox{\tiny{\sf T}}}$ with $r_{i}=y_{i}V_{q}^{\prime}\{y_{i}(\bar{\alpha}+\boldsymbol{S}_{i}^{\mbox{\tiny{\sf T}}}\bar{\boldsymbol{d}}+\boldsymbol{R}_{i}^{\mbox{\tiny{\sf T}}}\bar{\boldsymbol{c}}+\boldsymbol{z}_{i}^{\mbox{\tiny{\sf T}}}\bar{\boldsymbol{\gamma}})\}/n$ , $\boldsymbol{Z}=(\boldsymbol{z}_{1},\ldots,\boldsymbol{z}_{n})^{\mbox{\tiny{\sf T}}}\in\mathbb{R}^{n\times p}$ and $\bar{\boldsymbol{\theta}}=(\bar{\alpha},\bar{\boldsymbol{d}}^{\mbox{\tiny{\sf T}}},\bar{\boldsymbol{c}}^{\mbox{\tiny{\sf T}}},\bar{\boldsymbol{\gamma}}^{\mbox{\tiny{\sf T}}})^{\mbox{\tiny{\sf T}}}$ , and

\boldsymbol{A}_{q,\lambda}=\begin{pmatrix}n&\boldsymbol{1}_{n}^{\mbox{\tiny{\sf T}}}\boldsymbol{S}&\boldsymbol{1}_{n}^{\mbox{\tiny{\sf T}}}\boldsymbol{Z}&\boldsymbol{1}_{n}^{\mbox{\tiny{\sf T}}}\boldsymbol{R}\\ \boldsymbol{S}^{\mbox{\tiny{\sf T}}}\boldsymbol{1}_{n}&\boldsymbol{S}^{\mbox{\tiny{\sf T}}}\boldsymbol{S}&\boldsymbol{S}^{\mbox{\tiny{\sf T}}}\boldsymbol{Z}&\boldsymbol{S}^{\mbox{\tiny{\sf T}}}\boldsymbol{R}\\ \boldsymbol{Z}^{\mbox{\tiny{\sf T}}}\boldsymbol{1}_{n}&\boldsymbol{Z}^{\mbox{\tiny{\sf T}}}\boldsymbol{S}&\boldsymbol{Z}^{\mbox{\tiny{\sf T}}}\boldsymbol{Z}&\boldsymbol{Z}^{\mbox{\tiny{\sf T}}}\boldsymbol{R}\\ \boldsymbol{R}\boldsymbol{1}_{n}&\boldsymbol{R}\boldsymbol{S}&\boldsymbol{R}\boldsymbol{Z}&\boldsymbol{R}\boldsymbol{R}+\frac{2nq\lambda}{(q+1)^{2}}\boldsymbol{R}\end{pmatrix}.

Thus the minimizer of $M(\boldsymbol{\theta}|\bar{\boldsymbol{\theta}})$ is given by

\begin{pmatrix}\bar{\alpha}\\ \bar{\boldsymbol{d}}\\ \bar{\boldsymbol{\gamma}}\\ \bar{\boldsymbol{c}}\end{pmatrix}-\frac{nq}{(q+1)^{2}}\boldsymbol{A}_{q,\lambda}^{-1}\begin{pmatrix}\boldsymbol{1}_{n}^{\mbox{\tiny{\sf T}}}\boldsymbol{r}\\ \boldsymbol{S}^{\mbox{\tiny{\sf T}}}\boldsymbol{r}\\ \boldsymbol{Z}^{\mbox{\tiny{\sf T}}}\boldsymbol{r}\\ \boldsymbol{R}\boldsymbol{r}+2\lambda\boldsymbol{R}\bar{\boldsymbol{c}}.\end{pmatrix}

We then follow steps in Section 2.2 to implement classifications on subjects with both a functional covariate and several scalar covariates.

2.4 Tuning parameter selection

Here we focus on binary classification when both a functional covariate and several scalar covariates are involved. The prediction performance of the proposed classifier depends on the choice of the two tuning parameters $q$ and $\lambda$ . Computing the inverse of the matrix $\boldsymbol{A}_{q,\lambda}\in\mathbb{R}^{(n+3+p)\times(n+3+p)}$ for every combination of $q$ and $\lambda$ would be computationally intensive, especially when sample size is large. Instead, we come up with a solution with a lower computational cost to tackle this problem. To facilitate the selection procedure, the essential idea in our implementations is to avoid directly computing the inverse matrix of $\boldsymbol{A}_{q,\lambda}$ for each combination of $(q,\lambda)$ since it would require substantial time. Write $\boldsymbol{A}_{q,\lambda}$ as

\boldsymbol{A}_{q,\lambda}=\left[\begin{array}[]{ccc|c}n&\boldsymbol{1}_{n}^{\mbox{\tiny{\sf T}}}\boldsymbol{S}&\boldsymbol{1}_{n}^{\mbox{\tiny{\sf T}}}\boldsymbol{Z}&\boldsymbol{1}_{n}^{\mbox{\tiny{\sf T}}}\boldsymbol{R}\\ \boldsymbol{S}^{\mbox{\tiny{\sf T}}}\boldsymbol{1}_{n}&\boldsymbol{S}^{\mbox{\tiny{\sf T}}}\boldsymbol{S}&\boldsymbol{S}^{\mbox{\tiny{\sf T}}}\boldsymbol{Z}&\boldsymbol{S}^{\mbox{\tiny{\sf T}}}\boldsymbol{R}\\ \boldsymbol{Z}^{\mbox{\tiny{\sf T}}}\boldsymbol{1}_{n}&\boldsymbol{Z}^{\mbox{\tiny{\sf T}}}\boldsymbol{S}&\boldsymbol{Z}^{\mbox{\tiny{\sf T}}}\boldsymbol{Z}&\boldsymbol{Z}^{\mbox{\tiny{\sf T}}}\boldsymbol{R}\\ \hline\cr\boldsymbol{R}\boldsymbol{1}_{n}&\boldsymbol{R}\boldsymbol{S}&\boldsymbol{R}\boldsymbol{Z}&\boldsymbol{R}\boldsymbol{R}+\frac{2nq\lambda}{(q+1)^{2}}\boldsymbol{R}\end{array}\right]:=\begin{pmatrix}\boldsymbol{B}&\boldsymbol{C}^{T}\\ \boldsymbol{C}&\boldsymbol{D}\end{pmatrix}.

Therefore, the inverse of $\boldsymbol{A}_{q,\lambda}$ admits

\boldsymbol{A}_{q,\lambda}^{-1}=\begin{pmatrix}\boldsymbol{B}^{-1}+\boldsymbol{B}^{-1}\boldsymbol{C}^{\mbox{\tiny{\sf T}}}(\boldsymbol{D}-\boldsymbol{C}\boldsymbol{B}^{-1}\boldsymbol{C}^{\mbox{\tiny{\sf T}}})^{-1}\boldsymbol{C}\boldsymbol{B}^{-1}&-\boldsymbol{B}^{-1}\boldsymbol{C}^{\mbox{\tiny{\sf T}}}(\boldsymbol{D}-\boldsymbol{C}\boldsymbol{B}^{-1}\boldsymbol{C}^{\mbox{\tiny{\sf T}}})^{-1}\\ -(\boldsymbol{D}-\boldsymbol{C}\boldsymbol{B}^{-1}\boldsymbol{C}^{\mbox{\tiny{\sf T}}})^{-1}\boldsymbol{C}\boldsymbol{B}^{-1}&(\boldsymbol{D}-\boldsymbol{C}\boldsymbol{B}^{-1}\boldsymbol{C}^{\mbox{\tiny{\sf T}}})^{-1}\end{pmatrix}

(11)

Note that among these matrices only $D$ depends on $q$ and $\lambda$ . The inverse of $\boldsymbol{D}-\boldsymbol{C}\boldsymbol{B}^{-1}\boldsymbol{C}^{T}$ is available from the Sherman-Morrison-Woodbury formula:

(\boldsymbol{D}-\boldsymbol{C}\boldsymbol{B}^{-1}\boldsymbol{C}^{T})^{-1}=\boldsymbol{D}^{-1}+\boldsymbol{D}^{-1}\boldsymbol{C}(B-\boldsymbol{C}^{T}\boldsymbol{D}^{-1}\boldsymbol{C})^{-1}\boldsymbol{C}^{T}\boldsymbol{D}^{-1}.

(12)

To compute the matrix in (12), we need to find the inverse of $\boldsymbol{D}$ first. Let $\boldsymbol{Q}\boldsymbol{\Lambda}\boldsymbol{Q}^{T}$ denote the eigen-decomposition of $\boldsymbol{R}$ ; hence it does not depend on $\lambda$ . Then we compute the inverse of $\Pi_{q,\lambda}=\boldsymbol{\Lambda}\boldsymbol{\Lambda}+\{2nq\lambda/(q+1)^{2}\}\boldsymbol{\Lambda}$ for each $q$ and $\lambda$ ; it is actually a diagonal matrix. Hence the inverse of $\Pi_{q,\lambda}$ is immediately available for each combination of $q$ and $\lambda$ . Furthermore, $\boldsymbol{D}^{-1}=\boldsymbol{Q}\Pi_{q,\lambda}^{-1}\boldsymbol{Q}^{T}$ and note that $B-\boldsymbol{C}^{T}\boldsymbol{D}^{-1}\boldsymbol{C}$ is a $(3+p)\times(3+p)$ matrix. These suggest that it is efficient to compute the (inverse) matrix in (12), as long as $p$ is relatively small.

Finally, we employ the expression of $\boldsymbol{A}_{q,\lambda}$ in (11) to compute $\boldsymbol{A}_{q,\lambda}^{-1}\begin{pmatrix}(\boldsymbol{1}_{n},\boldsymbol{S},\boldsymbol{Z})^{\mbox{\tiny{\sf T}}}\boldsymbol{r}\\ \boldsymbol{R}\boldsymbol{r}+2\lambda\boldsymbol{R}\bar{\boldsymbol{c}}\end{pmatrix}$ directly. Denote by $\boldsymbol{P}$ the inverse of $\boldsymbol{D}-\boldsymbol{C}\boldsymbol{B}^{-1}\boldsymbol{C}^{\mbox{\tiny{\sf T}}}$ . By equation (11), we have

\boldsymbol{A}_{q,\lambda}^{-1}\begin{pmatrix}(\boldsymbol{1}_{n},\boldsymbol{S},\boldsymbol{Z})^{\mbox{\tiny{\sf T}}}\boldsymbol{r}\\ \boldsymbol{R}\boldsymbol{r}+2\lambda\boldsymbol{R}\bar{\boldsymbol{c}}\end{pmatrix}=\begin{pmatrix}\boldsymbol{B}^{-1}+\boldsymbol{B}^{-1}\boldsymbol{C}^{\mbox{\tiny{\sf T}}}\boldsymbol{P}\boldsymbol{C}\boldsymbol{B}^{-1}&-\boldsymbol{B}^{-1}\boldsymbol{C}^{\mbox{\tiny{\sf T}}}\boldsymbol{P}\\ -\boldsymbol{P}\boldsymbol{C}\boldsymbol{B}^{-1}&\boldsymbol{P}\end{pmatrix}\begin{pmatrix}(\boldsymbol{1}_{n},\boldsymbol{S},\boldsymbol{Z})^{\mbox{\tiny{\sf T}}}\boldsymbol{r}\\ \boldsymbol{R}\boldsymbol{r}+2\lambda\boldsymbol{R}\bar{\boldsymbol{c}}\end{pmatrix}.

With the procedures above, we are able to compute the minimizer of the majorization function $M(\boldsymbol{\theta}|\bar{\boldsymbol{\theta}})$ for different values of $q$ and $\lambda$ , and thus solve the minimization problem of (10) efficiently. We employ cross validation to choose the optimal combination of $q$ and $\lambda$ in the following numerical studies.

3 Theoretical properties

Let $f^{\star}$ denote the Bayes classifier, which can minimize the probability of misclassification, $P(f^{\star}(X)\neq Y)$ . It is trivial that $f^{\star}(X)=2\mathbbm{1}\{P(Y=1|X=x)>0.5\}-1$ a.s. on the set $\{P(Y=1|X=x)\neq 0.5\}$ . Given a loss function $\ell$ , the associated risk function for a classifier $f$ is then defined by

L(f,f^{\star}):={\rm E}\,[l(f)-l(f^{\star})].

Blanchard et al. (2008) established a non-asymptotic bound on $L(\hat{f},f^{\star})$ when $\hat{f}$ is the estimated support vector classifier from a training sample in which each subject consists of multiple scalar covariates and a binary outcome, and $l$ is the corresponding hinge loss function.

Denote by $\hat{f}(x)=\hat{\alpha}+\int_{\mathcal{I}}x(t)\hat{\beta}(t)dt$ the estimated functional DWD classifier, where $\hat{\alpha}$ and $\hat{\beta}$ minimize the target function $Q$ in (2). In the context of functional classification, assuming that the functional covariate $X\in\mathcal{X}$ , the Bayes classifier $f^{\star}$ is a measurable functional from $\mathcal{X}$ to $\{-1,1\}$ . Here we aim to establish a non-asymptotic bound on $L(\hat{f},f^{\star})$ , where $l=V_{q}$ is the loss function for the functional DWD classifier.

The following conditions are required where (C1) is with regard to bounds on the noise and (C2) is with regard to bounds on the kernel and functions.

(C1)

For $\eta(x)=\mathrm{P}\left(Y=1|X=x\right)$ , we require $\lvert\eta(x)-1/2-c_{q}/4\rvert\geq\eta_{0}>0$ for all $x$ and $q\geq q_{0}$ , thus $q$ large enough so that $c_{q}<4-2/\eta_{0}$ . We also require $\min\{\eta(x),1-\eta(x)\}\geq\eta_{1}>0$ for all $x$ , which bounds the probabilty away from 0 and 1 and also implies that $\eta_{1}<1/2$ .
(C2)

There exist positive constants $M,A$ such that $\lVert x\rVert_{L^{2}}\leq A$ for $x(t)\in L^{2}(I)$ and $\lVert k\rVert_{L^{2}}\leq M$ for $k$ being the reproducting kernel.

There are two settings to consider similar to Blanchard et al. (2008), which affect how the penalization parameter is controlled. In setting (S1), the risk is considered via the spectral properties of the reproducing kernel. Specifically, the penalization parameter $\lambda_{n}$ is controlled by the tail sum of the eigenvalues of the reproducing kernel. In setting (S2), the risk is considered via covering numbers under the sup-norm. In contrast, $\lambda_{n}$ is controlled via $H_{\infty}$ , the supremum norm $\varepsilon$ -entropy. This control is encapsulated in the $\gamma(n)$ term defined in the following theorem.

Theorem 2

Under conditions (C1) and (C2), let the penalization parameter $\lambda_{n}$ be bounded as

\lambda_{n}\geq C\left(\gamma(n)+\frac{\log(\delta^{-1}\log n)\vee 1}{n}\right)

for some universal constant $C$ with $\gamma(n)=\frac{1}{\sqrt{n}}\inf_{d\in\mathbb{N}}[\frac{2d}{\sqrt{n}}+\frac{A\eta_{1}}{M}\sqrt{\sum_{j>d}\lambda_{j}}]$ under (S1) and $\gamma(n)=(x^{*})^{2}/M^{2}$ under (S2) where $x^{*}$ is the solution to $\int_{0}^{x}\sqrt{H_{\infty}(B,\varepsilon)}d\varepsilon=\sqrt{n}x^{2}/AM$ with $B=\{\beta\in\mathcal{W}_{2}^{m}(I)\,:\,\lVert\beta\rVert\leq 1\}$ . For an iid sample of functional-binary pairs $(x_{i}(t),y_{i})$ and FDWD loss function $V_{q}(\cdot)$ , let the regularized estimator $\hat{\beta}$ be the solution to

\hat{\beta}=\arg\min_{\beta}n^{-1}\sum_{i=1}^{n}V_{q}\left\{y_{i}\int_{\mathcal{I}}x_{i}(t)\beta(t)dt\right\}+\lambda AMJ(\beta)

for corresponding classifier $\hat{f}$ . Then, for $f^{*}$ being the Bayes classifier and $f_{\beta}$ the classifier corresponding to any arbitrary $\beta$ , the following holds with probabiliy at least $1-\delta$ ,

L(\hat{f},f^{*})\leq 2\inf_{\beta\in\mathcal{W}_{2}^{m}}\left\{L(f_{\beta},f^{*})+2\lambda_{n}AMJ(\beta)+\lambda_{n}k_{1}+k_{0})\right\}

for positive constants $k_{0},k_{1}$ .

Proof of Theorem 2 can be found in the appendix. We extend this theorem to the functional DWD estimator with scalar covariates in the following corollary. For this extension, we require an additional condition:

(C3)

There exist a positive constant $\Pi$ such that $\lVert z\rVert_{\ell^{2}}\leq\Pi$ for $z\in\mathbb{R}^{p}$ .

The proof of the below corollary follows from that for Theorem 2. Instead of considering suprema over the ball $B(R)=\{\beta\in\mathcal{W}_{2}^{m}(I)\,:\,\lVert\beta\rVert\leq R\}$ , we instead consider the product ball $\mathbb{B}(R)=B(R)\times\{\gamma\in\mathbb{R}^{p}\,:\,\lVert\gamma\rVert\leq R\}$ .

Corollary 1

Under conditions (C1), (C2), and (C3), let $A^{*}=A+\Pi/M$ be a positive constant, and let the penalization parameter $\lambda_{n}$ be bounded as

\lambda_{n}\geq C\left(\gamma(n)+\frac{\log(\delta^{-1}\log n)\vee 1}{n}\right)

for some universal constant $C$ with $\gamma(n)=\frac{1}{\sqrt{n}}\inf_{d\in\mathbb{N}}[\frac{2d}{\sqrt{n}}+\frac{A^{*}\eta_{1}}{M}\sqrt{\sum_{j>d}\lambda_{j}}]$ under (S1) and $\gamma(n)=(x^{*})^{2}/M^{2}$ under (S2) where $x^{*}$ is the solution to $\int_{0}^{x}\sqrt{H_{\infty}(\mathbb{B},\varepsilon)}d\varepsilon=\sqrt{n}x^{2}/A^{*}M$ with $\mathbb{B}=\{\beta\in\mathcal{W}_{2}^{m}(I)\,:\,\lVert\beta\rVert\leq 1\}\times\{\gamma\in\mathbb{R}^{p}\,:\,\lVert\gamma\rVert\leq 1\}$ . For an iid sample of functional-covariate-binary triples $(x_{i}(t),z_{i},y_{i})$ and FDWD loss function $V_{q}(\cdot)$ , let the regularized estimator $\hat{\beta}$ and $\hat{\gamma}$ be the solution to

(\hat{\beta},\hat{\gamma})=\arg\min_{\beta,\gamma}n^{-1}\sum_{i=1}^{n}V_{q}\left\{y_{i}\int_{\mathcal{I}}x_{i}(t)\beta(t)dt+z^{\mbox{\tiny{\sf T}}}\gamma\right\}+\lambda A^{*}MJ(\beta)

for corresponding classifier $\hat{f}$ . Then, for $f^{*}$ being the Bayes classifier and $f_{\beta,\gamma}$ the classifier corresponding to any arbitrary $\beta$ and $\gamma$ , the following holds with probabiliy at least $1-\delta$ ,

L(\hat{f},f^{*})\leq 2\inf_{\beta\in\mathcal{W}_{2}^{m},\gamma\in\mathbb{R}^{p}}\left\{L(f_{\beta,\gamma},f^{*})+2\lambda_{n}A^{*}MJ(\beta)+\lambda_{n}k_{1}+k_{0})\right\}

for positive constants $k_{0},k_{1}$ .

4 Simulation studies

In this section, we considered two different simulation settings to investigate finite sample performance of the proposed classifier. In both settings, the functional covariate was generated in the following way: $X_{i}(t)=\sum_{j=1}^{50}\xi_{ij}\zeta_{j}\phi_{j}(t)$ , where $\xi_{ij}$ ’s are independently drawn from a uniform distribution on ( $-\sqrt{3},\sqrt{3})$ , $\zeta_{j}=(-1)^{j+1}j^{-1},j=1,\ldots,50$ , and $\phi_{1}(t)=1$ and $\phi_{j}(t)=\sqrt{2}\cos((j-1)\pi t),~{}j\geq 2$ for $t\in[0,1]$ . Observations at 50 times points on the interval [0, 1] were available in each sample path of $X(t)$ . Two scalar covariates ( $\boldsymbol{z}=(z_{1},z_{2})^{\mbox{\tiny{\sf T}}}$ ) were independently generated from a truncated normal distribution within the interval $(-2,2)$ with mean 0 and variance 1. Then the binary response variable $y$ with values $1$ or $-1$ was generated from the logistic model:

f(X_{i},\boldsymbol{z}_{i})=\alpha_{0}+\int_{0}^{1}X_{i}(t)\beta(t)dt+\boldsymbol{z}_{i}^{\mbox{\tiny{\sf T}}}{\boldsymbol{\gamma}},~{}~{}p(Y_{i}=1)=\frac{\exp\{f(X_{i},\boldsymbol{z}_{i})\}}{1+\exp\{f(X_{i},\boldsymbol{z}_{i})\}},

where $\alpha_{0}=0.1$ and $f(X,\boldsymbol{z})$ is referred to as the discriminant function in this article.

In the first scenario, the slope function of $X(t)$ was $\beta(t)=e^{-t}$ , and $\boldsymbol{\gamma}=(-0.5,1)^{\mbox{\tiny{\sf T}}}$ or $(0,0)^{\mbox{\tiny{\sf T}}}$ to ensure the discriminant function $f$ depends on or not on the scalar covariates, respectively. In the second scenario, the slope function was written as a linear combination of the functional principal components of $X$ . Particularly, $\beta(t)=\sum_{j=1}^{50}4(-1)^{j+1}j^{-2}\phi_{j}(t)$ , and the coefficient vector of the scalar covariates was $\boldsymbol{\gamma}=(-2,3)^{\mbox{\tiny{\sf T}}}$ or $(0,0)^{\mbox{\tiny{\sf T}}}$ . In each simulation scenario, $n=100$ or 200 curves were generated for training. Then 500 samples were generated as the test set to assess prediction accuracy.

In addition to the proposed functional DWD classifier, we also considered several other commonly used functional data classifiers for comparison. The centroid classifier by Delaigle and Hall (2012) firstly projects each functional covariate onto one specific direction and then performs classification based on the distance to the centroid in the projected space. The functional quadratic discriminant in Galeano et al. (2015) conducts a quadratic discriminant analysis on FPC scores, while the functional logistic classifier fits a logistic regression model on them. Note that the aforementioned classifiers, except our proposed functional DWD with scalar covariates, only account for functional covariates in classification. To study the effect of scalar covariates on classification, we fitted an SVM classifier with only these two scalar covariates when they are involved in the discriminant function. In each simulation trial, we randomly generated a training set of size $n=100$ or 200 to fit all classifiers and then evaluated the predictive accuracy for all of them on a test sample of size 500. To assess the uncertainty in estimating prediction accuracy of each classifier, $B=500$ independent simulation trials were conducted in each scenario.

Table 1: The mean misclassification error rates (%) on the test sample across

M=500

simulations with the standard errors (%) in parentheses. Centroid, the centroid classifier proposed by Delaigle and Hall (2012). Quadratic, the functional quadratic classifier proposed by Galeano et al. (2015). Logistic, the functional logistic classifier. fDWD, our proposed functional DWD method without scalar covariates, while PLfDWD denotes our proposed functional DWD method with scalar covariates. KNN, the k-nearest neighbour classifier introduced in Chapter 8 of Ferraty and Vieu (2006). S-SVM, an SVM classifier depends only on scalar covariates. The column

\boldsymbol{z}

indicates whether or not the true discriminant function depends on the scalar covariates.

Scenario 1
$n$	$\boldsymbol{z}$	Centroid	Quadratic	Logistic	KNN	fDWD	PLfDWD	S-SVM
100	Yes	41.3 (3.7)	43.2 (3.0)	41.9 (3.1)	43.4 (3.3)	39.8 (2.7)	32.3 (2.7)	37.9 (5.1)
	No	39.1 (3.7)	40.7 (3.3)	39.5 (3.1)	40.8 (3.6)	37.7 (2.8)
200	Yes	40.2 (2.8)	41.5 (2.8)	40.5 (2.6)	43.0 (2.8)	39.3 (2.4)	31.3 (2.4)	35.5 (3.0)
	No	37.9 (2.7)	39.1 (2.8)	38.2 (2.4)	40.6 (3.1)	37.1 (2.3)
Scenario 2
$n$	$\boldsymbol{z}$	Centroid	Quadratic	Logistic	KNN	fDWD	PLfDWD	S-SVM
100	Yes	21.7 (2.2)	22.4 (2.4)	21.9 (2.2)	22.6 (2.5)	21.0 (2.0	11.0 (1.9)	35.7 (4.5)
	No	11.1 (1.6)	11.4 (1.6)	11.4 (1.7)	11.5 (1.6)	10.4 (1.4)
200	Yes	21.3 (2.0)	21.5 (2.0)	21.2 (1.9)	22.2 (2.2)	20.8 (1.8)	10.5 (1.9)	33.9 (2.9)
	No	10.7 (1.5)	10.7 (1.4)	10.6 (1.4)	11.1 (1.5)	10.2 (1.4)

Table 1 summarizes the mean and the standard error of the misclassification error rate of each classifier. In the first scenario, the proposed functional DWD classifier with scalar covariates is considerably more accurate than any other classifier in terms of prediction. This is not surprising since even the SVM classifier with only scalar covariates outperforms the functional classifiers that did not take scalar covariates into consideration. This fact implies the importance of accounting for scalar covariates when the true discriminant function indeed depends on them. Additionally, whether or not the true discriminant function depends on the scalar covariates in these settings, the misclassification error rates of our proposed functional DWD classifier, are very close to the Bayes errors, which are 0.283 and 0.376, respectively. As the projection function in the centroid classifier and the slope function in the functional logistic regression by Leng and Müller (2005) are represented in terms of FPCs, these two classifiers should be in favor in the second scenario. This was justified by comparing prediction accuracy of these classifiers. However, our proposed classifier still dominates all competitors no matter whether the true discriminant function depends on the scalar covariates. A plausible reason why the proposed classifier is superior to the centroid and logistic classifiers is that the roughness of the projection direction is appropriately controlled in our method. Once again, the misclassification rates of our proposed classifiers, are very close to the Bayes errors, which are 0.086 with scalar covariates and 0.099 without scalar covariates, respectively.

Pj: You can decide whether boxplots of misclassification error rate of each classifier in each scenario should be put in the Appendix. All of them are now in the folder “Figure”.

5 Real data examples

In this section, we apply the proposed classifiers as well as several alternative classifiers to one real-world example to demonstrate the performance of our proposal.

Alzheimer’s disease (AD) is an irreversible and progressive brain disorder that can lead to more and more serious dementia symptoms over a few years. Previous studies showed that increasing age is one of the most important risk factor of AD, and most patients with AD are above 65. However, there also exist substantial early-onset Alzheimer’s whose ages are under 65. Things are even worse considering the fact that there is no current cure for AD and AD would even eventually destroy people’s ability to perform the simplest tasks. Due to the reasons above, studies of AD have received considerable attention in the past few years.

In our study, the data were obtained from the ongoing Alzheimer’s Disease Neuroimaging Initiative (ADNI), which aims to unit researchers from the world to collect, validate and analyze relevant data. In particular, the ADNI is interested in identifying biomarkers of AD from genetic, structural, and functional neuroimaging, and clinical data. The dataset consists of two main parts. The first part is neuroimaging data collected by diffusion tensor imaging (DTI). More specifically, fractional anisotropy (FA) values were measured at 83 locations along the corpus callosum (CC) fiber tract for each subject. The second part is composed of demographic features like gender (a categorical variable), handedness (left hand or right hand, a categorical variable), the age, the education level, the AD status and mini-mental state examination (MMSE) scores. The AD status is a categorical variable with three levels: normal control (NC), mild cognitive impairment (MCI) and Alzheimer’s disease (AD). We combine the first two categories into one single category for simplicity, and then this status variable is treated as a binary outcome in our following analysis. The MMSE is one of the most widely used test of cognitive functions such as orientation, attention, memory, language and visual-spatial skills for assessing the level of dementia a patient may have. A more detailed description of the data can be found in http://adni.loni.usc.edu. Previous studies, such as Li et al. (2017) and Tang et al. (2020), focused on building regression models to investigate the relationship between the progression of the AD status and the neuroimaging and demographic data. However, our main objective is to use the DTI data and demographic features to predict the status of AD.

We had $N=214$ subjects in our analysis after removing 3 subjects with missing values. Among them, there are $N_{0}=172$ subjects from the first group, i.e., the status of them is either NC or MCI, and $N_{1}=42$ subjects from the AD group. The functional predictor $X(t)$ was taken as the FA profiles, which are displayed in Figure 2, and the scalar covariates $\boldsymbol{z}$ consisted of gender, handedness, the age, the education level and the MMSE score. To justify the importance of incorporating FA profiles in classification, we considered an SVM classifier with only these scalar covariates. To compare the prediction performance of each classifier, these 214 subjects were randomly divided into a training set with $n$ subjects and a test set with the other $N-n$ subjects. In the study, we considered two particular choices of $n$ : $0.5N$ and $0.8N$ . Following this rule, we randomly splitted the whole dataset to the training and test set for $M=500$ times.

Table 2: The mean misclassification error rates (%) across

M=500

random splits with standard errors in brackets.

$n$	Centroid	Quadratic	Logistic	KNN	fDWD	PLfDWD	S-SVM
107	19.6 (0.029)	20.6 (0.031)	21.4 (0.03)	21.0 (0.028)	19.8 (0.03)	11.2 (0.028)	26.3 (0.032)
171	19.8 (0.054)	20.5 (0.056)	21.5 (0.057)	21.7 (0.054)	19.8 (0.054)	9.9 (0.048)	27.2 (0.059)

Table 2 summarizes the mean misclassification error rates and the standard errors across the 500 splits for each classifier. When scalar covariates are not accounted for in the functional data classification, our proposed method (fDWD) outperforms all other competitors except the centroid method in terms of prediction accuracy. Even more remarkably, incorporating scalar covariates, though in a linear manner, results in a substantial reduction in misclassification errors in our proposed classifier, around half of the errors of all functional classifiers without scalar covariates. You might argue that this occurred because scalar covariates are highly predictive of the AD status while the functional covariate is not. However, the prediction performance of S-SVM disproved this argument, as the SVM classifier that only considered scalar covariates performed even worse than those with only the functional covariate in terms of prediction accuracy. On the one hand, these comparisons indicate superiority of our proposed classifier. On the other hand, they also suggest accounting for scalar covariates in an appropriate way is able to enhance prediction accuracy when discriminating functional data.

6 Conclusion

In this paper we propose a novel methodology that combines the idea of the canonical DWD classifier and regularized functional linear regression under the RKHS framework to classify functional data. The use of RKHS enables us to control roughness of the estimated projection direction, and thus enhances prediction accuracy in comparison with conventional functional logistic regression and the centroid classifier. Moreover, we further extend the framework to classifying subjects with both a functional covariate and several scalar covariates. Though we focus on the specific loss function to achieve nice properties of DWD classifiers in our study, this framework can be extended to other loss functions such as the logistic loss function in the functional logistic regression and the hinge loss function in functional SVM classifiers. Moreover, the scalar covariates are incorporated in our classifier in a linear manner to achieve a good trade off between flexibility and interpretability. However, nonlinear or nonparametric forms of scalar covariates can also be incorporated into our current framework, as long as we adopt appropriate regularizations to avoid overfitting. This direction deserves future investigation in both theory and practice.

Numerical studies including both simulation studies and one real-world application suggest that the proposed classifier is superior to many other competitors in terms of predication accuracy. The application of our classifier to a study of Alzheimer’s disease provides numerical evidence that both neuroimaging data and demographic features are relevant to AD, and ignoring either of them would deteriorate prediction accuracy of the AD status.

References

Berrendero et al. (2016) Berrendero, J. R., Cuevas, A. and Torrecilla, J. L. (2016) Variable selection in functional data classification: a maxima-hunting proposal. Statistica Sinica, 26, 619–638.
Berrendero et al. (2018) — (2018) On the use of reproducing kernel Hilbert spaces in functional classification. Journal of the American Statistical Association, 113, 1210–1218.
Biau et al. (2005) Biau, G., Bunea, F. and Wegkamp, M. H. (2005) Functional classification in Hilbert spaces. IEEE Transactions on Information Theory, 51, 2163–2172.
Blanchard et al. (2008) Blanchard, G., Bousquet, O. and Massart, P. (2008) Statistical performance of support vector machines. The Annals of Statistics, 36, 489–531.
Cérou and Guyader (2006) Cérou, F. and Guyader, A. (2006) Nearest neighbor classification in infinite dimension. ESAIM: Probability and Statistics, 10, 340–355.
Cristianini and Shawe-Taylor (2000) Cristianini, N. and Shawe-Taylor, J. (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press.
Dai et al. (2017) Dai, X., Müller, H.-G. and Yao, F. (2017) Optimal Bayes classifiers for functional data and density ratios. Biometrika, 104, 545–560.
Delaigle and Hall (2012) Delaigle, A. and Hall, P. (2012) Achieving near perfect classification for functional data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74, 267–286.
Fan and Fan (2008) Fan, J. and Fan, Y. (2008) High dimensional classification using features annealed independence rules. Annals of statistics, 36, 2605.
Ferraty and Vieu (2003) Ferraty, F. and Vieu, P. (2003) Curves discrimination: a nonparametric functional approach. Computational Statistics & Data Analysis, 44, 161–173.
Ferraty and Vieu (2004) — (2004) Nonparametric models for functional data, with application in regression, time series prediction and curve discrimination. Nonparametric Statistics, 16, 111–125.
Ferraty and Vieu (2006) — (2006) Nonparametric functional data analysis: theory and practice. Springer, New York.
Friedman et al. (2009) Friedman, J., Hastie, T. and Tibshirani, R. (2009) The elements of statistical learning, 2nd edition. Springer, New York.
Galeano et al. (2015) Galeano, P., Joseph, E. and Lillo, R. E. (2015) The mahalanobis distance for functional data with applications to classification. Technometrics, 57, 281–291.
Gu (2013) Gu, C. (2013) Smoothing spline ANOVA models, 2nd edition. Springer, New York.
Hall et al. (2001) Hall, P., Poskitt, D. S. and Presnell, B. (2001) A functional data—analytic approach to signal discrimination. Technometrics, 43, 1–9.
Kong et al. (2016) Kong, D., Xue, K., Yao, F. and Zhang, H. H. (2016) Partially functional linear regression in high dimensions. Biometrika, 103, 147–159.
Leng and Müller (2005) Leng, X. and Müller, H.-G. (2005) Classification using functional data analysis for temporal gene expression data. Bioinformatics, 22, 68–76.
Li et al. (2017) Li, J., Huang, C. and Zhu, H. (2017) A functional varying-coefficient single-index model for functional response data. Journal of the American Statistical Association, 112, 1169–1181.
Lin et al. (2002) Lin, Y., Lee, Y. and Wahba, G. (2002) Support vector machines for classification in nonstandard situations. Machine learning, 46, 191–202.
Marron et al. (2007) Marron, J. S., Todd, M. J. and Ahn, J. (2007) Distance-weighted discrimination. Journal of the American Statistical Association, 102, 1267–1271.
Massart (2000) Massart, P. (2000) Some applications of concentration inequalities to statistics. Annales de la Faculté des sciences de Toulouse: Mathématiques, 9, 245–303.
Mendelson (2003) Mendelson, S. (2003) On the performance of kernel classes. Journal of Machine Learning Research, 4, 759–771.
Rossi and Villa (2006) Rossi, F. and Villa, N. (2006) Support vector machine for functional data classification. Neurocomputing, 69, 730–742.
Sun et al. (2018) Sun, X., Du, P., Wang, X. and Ma, P. (2018) Optimal penalized function-on-function regression under a reproducing kernel Hilbert space framework. Journal of the American Statistical Association, 113, 1601–1611.
Tang et al. (2020) Tang, Q., Kong, L., Ruppert, D. and Karunamuni, R. J. (2020) Partial functional partially linear single-index models. Statistica Sinica, Preprint No: SS-2018-0316. URL: http://www.stat.sinica.edu.tw/statistica/.
Tian (2010) Tian, T. S. (2010) Functional data analysis in brain imaging studies. Frontiers in psychology, 1, 35.
Vapnik (2013) Vapnik, V. (2013) The nature of statistical learning theory. Springer, New York.
Wang and Zou (2018) Wang, B. and Zou, H. (2018) Another look at distance-weighted discrimination. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80, 177–198.
Wong et al. (2019) Wong, R. K., Li, Y. and Zhu, Z. (2019) Partially linear functional additive models for multivariate functional data. Journal of the American Statistical Association, 114, 406–418.
Wu and Liu (2013) Wu, Y. and Liu, Y. (2013) Functional robust support vector machines for sparse and irregular longitudinal data. Journal of computational and Graphical Statistics, 22, 379–395.
Yao et al. (2005) Yao, F., Müller, H.-G. and Wang, J.-L. (2005) Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association, 100, 577–590.
Yao et al. (2016) Yao, F., Wu, Y. and Zou, J. (2016) Probability-enhanced effective dimension reduction for classifying sparse functional data. Test, 25, 1–22.
Yuan and Cai (2010) Yuan, M. and Cai, T. T. (2010) A reproducing kernel Hilbert space approach to functional linear regression. The Annals of Statistics, 38, 3412–3444.

Appendix

A.1: Proof of Theorem 1

Based on the observations above, we have $Kx_{i}\in\mathcal{H}_{1},i=1,\ldots,n$ . Therefore, the solution to (2) can be written as

\hat{\beta}_{n}(t)=\sum_{l=1}^{N}d_{l}\psi_{l}(t)+\sum_{i=1}^{n}c_{i}(Kx_{i})(t)+\rho(t),

where $\rho(t)$ is the orthogonal complement of $\sum_{i=1}^{n}c_{i}(Kx_{i})(t)$ in $\mathcal{H}_{1}$ . To prove (7), we just need to check that $\rho(t)=0$ .

Let $\boldsymbol{\psi}(t)=(\psi_{1}(t),\ldots,\psi_{N}(t))^{T}$ . Plugging the solution into (2), we have

\min_{\alpha,\boldsymbol{d},\boldsymbol{c},\rho}n^{-1}\sum_{i=1}^{n}V_{q}\left\{y_{i}\left(\alpha+\int_{\mathcal{I}}x_{i}(t)\left(\boldsymbol{d}^{T}\boldsymbol{\psi}(t)+\sum_{j=1}^{n}c_{j}(Kx_{j})(t)+\rho(t)\right)dt\right)\right\}+\lambda J(\beta).

In the first term,

	$\displaystyle\int_{\mathcal{I}}x_{i}(t)$	$\displaystyle\left(\boldsymbol{d}^{T}\boldsymbol{\psi}(t)+\sum_{j=1}^{n}c_{j}(Kx_{j})(t)+\rho(t)\right)dt$
		$\displaystyle=\int_{\mathcal{I}}x_{i}(t)\left(\boldsymbol{d}^{T}\boldsymbol{\psi}(t)+\sum_{j=1}^{n}c_{j}(Kx_{j})(t)\right)dt+\int_{\mathcal{I}}x_{i}(t)\rho(t)dt$
		$\displaystyle=\int_{\mathcal{I}}x_{i}(t)\left(\boldsymbol{d}^{T}\boldsymbol{\psi}(t)+\sum_{j=1}^{n}c_{j}(Kx_{j})(t)\right)dt+\langle Kx_{i},\rho\rangle_{\mathcal{H}}$
		$\displaystyle=\int_{\mathcal{I}}x_{i}(t)\left(\boldsymbol{d}^{T}\boldsymbol{\psi}(t)+\sum_{j=1}^{n}c_{j}(Kx_{j})(t)\right)dt.$

In other words, the first term does not depend on $\rho$ . As we know, the second term, $\lambda J(\beta)$ , is minimized when $\rho(t)=0$ since $\rho$ is orthogonal to $\sum_{l=1}^{N}d_{l}\psi_{l}(t)+\sum_{i=1}^{n}c_{i}(Kx_{i})(t)$ in $\mathcal{H}_{1}$ .

A.2: Proofs of Theorem 2

To prove Theorem 2, we first prove the following lemmas. We also define $B(R)=\{\beta\in\mathcal{W}_{2}^{m}(I)\,:\,\lVert\beta\rVert\leq R\}$ with $R\in\mathcal{R}\subset\mathbb{R}^{+}$ a countable set. In what follows, Lemmas 6.4 and 6.6 are for setting (S1) and Lemmas 6.8 and 6.10 are for setting (S2).

Lemma 1

Let $f=\int_{\mathcal{I}}x_{i}(t)\beta(t)dt$ with $x(t)\in L^{2}(I)$ such that $\lVert x(t)\rVert_{L^{2}}\leq A\in\mathbb{R}^{+}$ and reproducing kernel $k:I\times I\rightarrow\mathbb{R}$ such that $\lVert k\rVert_{L^{2}}\leq M\in\mathbb{R}^{+}$ . Then,

\lVert V_{q}(yf)\rVert_{\infty,R}=\sup_{\beta\in B(R)}V_{q}(yf)\leq 1+RAM~{}~{}\text{a.s.}

for all $m\in\mathcal{M}$ .

Proof 6.3.

Via the Riesz representation theorem, we have $f(\cdot)=\left\langle\beta,\cdot\right\rangle$ for some $\beta\in\mathcal{W}_{2}^{m}(\mathcal{I})$ . Using Hölder’s inequality or the Cauchy-Schwarz inequality, we have the following bound.

	$\displaystyle\left\lvert V_{q}\left(yf\right)\right\rvert$	$\displaystyle\leq 1+\left\lvert\int_{I}x(t)\beta(t)dt\right\rvert$
		$\displaystyle=1+\left\lvert\int_{I}x(t)\left\langle\beta,k(t,)\right\rangle dt\right\rvert$
		$\displaystyle\leq 1+\lVert\beta\rVert\lVert x\rVert_{L^{2}}\lVert k\rVert_{L^{2}}$
		$\displaystyle=1+RAM.$

Lemma 6.4.

For all $\beta,\beta^{\prime}\in\mathcal{W}_{2}^{m}(\mathcal{I})$ with $f=\left\langle\beta,\right\rangle$ and $f^{\prime}=\left\langle\beta^{\prime},\right\rangle$ , $\mathrm{Var}[{V_{q}(yf)-V_{q}(yf^{\prime})}]\leq\mathrm{E}[(f-f^{\prime})^{2}].$ Furthermore, under Conditions (C1) and (C2),

\mathrm{E}[(f-f^{\prime})^{2}]\leq\left(\frac{RAM}{\eta_{1}}+\frac{5}{\eta_{0}}\right)L(f,f^{*})

for all $\beta,\beta^{\prime}\in B(R)$ and all $R\in\mathcal{R}$ where $f^{*}\in\operatorname*{arg\,min}_{g\in\mathfrak{G}}\mathrm{E}[V_{q}(g)]$ and $L(f,f^{*})=\mathrm{E}{[V_{q}(yf)-V_{q}(yf^{*})]}$ is the associated risk.

Proof 6.5.

First, we note that for $q\in(0,\infty)$ that $V_{q}(u)$ is differentiable with $\lvert\frac{d}{du}V_{q}(u)\rvert\leq 1$ . Hence, $V_{q}$ is Lipschitz implying that

\mathrm{E}{[(V_{q}(yf)-V_{q}(yf^{\prime}))^{2}]}\leq\mathrm{E}[(f-f^{\prime})^{2}]

proving the first part of the lemma.

For the bound on $\mathrm{E}[(f-f^{\prime})^{2}]$ , let $f^{*}(x)=2\boldsymbol{1}\!\left[\eta(x)>0.5\right]-1$ with $\eta(x)=$ . Without loss of generality, we will consider the case $\eta>1/2$ and $f^{*}=1$ . Note that $V_{q}(f)\geq(1-f)_{+}$ for $q\in(0,\infty)$ . Note also that $f=\int x\beta dt=\left\langle Kx,\beta\right\rangle$ , so $\sup_{x}f\leq RAM$ for $\beta\in B(R)$ . We have the ratio

	$\displaystyle\frac{\mathrm{E}[(f-f^{})^{2}\|X=x]}{\mathrm{E}[V_{q}(yf)-V_{q}(yf^{})\|X=x]}$	$\displaystyle=\frac{(f-1)^{2}}{\eta(V_{q}(f)-V_{q}(1))+(1-\eta)(V_{q}(-f)-V_{q}(-1))}$
		$\displaystyle=\frac{(f-1)^{2}}{\eta V_{q}(f)+(1-\eta)V_{q}(-f)+\eta(2-c_{q})-2}$
		$\displaystyle\leq\frac{(f-1)^{2}}{\eta(1-f)_{+}+(1-\eta)(1+f)_{+}+\eta(2-c_{q})-2}$

If $f<-1$ , then denote $g=-1-f$ so $f=-1-g$ . Also, note that $2\eta-1\geq 2\eta_{0}$ . Thus, if we choose $q$ such that $c_{q}<4-2/\eta_{0}$ , then

	$\displaystyle\frac{\mathrm{E}[(f-f^{})^{2}\|X=x]}{\mathrm{E}[V_{q}(f)-V_{q}(f^{})\|X=x]}$	$\displaystyle\leq\frac{(f-1)^{2}}{\eta(1-f)_{+}+\eta(2-c_{q})-2}$
		$\displaystyle=\frac{(g+2)^{2}}{\eta(g+2)+\eta(2-c_{q})-2}$
		$\displaystyle=\frac{(g+2)^{2}}{\eta g+\eta(4-c_{q})-2}$
		$\displaystyle\leq\frac{g}{\eta}+\frac{4}{\eta}+\frac{1}{\eta_{0}}$
		$\displaystyle\leq 2RAM+5/\eta_{0}\leq RAM/\eta_{1}+5/\eta_{0}.$

Lemma 6.6.

Under Conditions (C1) and (C2), let $\phi_{R}$ be a sequence of subroot functions with $r_{R}^{*}$ being the solution to $\phi_{R}(r)=r/C_{R}$ . Then, for all $R\in\mathcal{R}$ , $\beta_{0}\in B(R)$ , corresponding $f_{0}=\left\langle\beta_{0},\right\rangle$ , $r\geq r_{R}^{*}$ , and $d^{2}(f,f_{0})=\mathrm{E}[(f-f_{0})^{2}]$

\mathrm{E}\left[\sup_{\begin{subarray}{c}\beta\in B(R)\\ d^{2}(f,f_{0})\leq r\end{subarray}}(P-P_{n})(\ell(f)-\ell(f_{0}))\right]\leq\frac{4}{\sqrt{n}}\inf_{d\in\mathbb{N}}\left(\sqrt{dr}+AR\sqrt{\sum_{j>d}\lambda_{j}}\right)=:\phi_{R}(r)

with

r^{*}\leq\frac{8C_{R}^{2}}{\sqrt{n}}\inf_{d\in\mathbb{N}}\left[\frac{2d}{\sqrt{n}}+\frac{A\eta_{1}}{M}\sqrt{\sum_{j>d}\lambda_{j}}\right].

Proof 6.7.

We first define the Rademacher average for a function $g:\mathcal{X}\times\mathcal{Y}\rightarrow\mathbb{R}$ to be $\mathcal{R}_{n}g=n^{-1}\sum_{i=1}^{n}\varepsilon_{i}g(X_{i},Y_{i})$ for $\varepsilon_{1},\ldots,\varepsilon_{n}$ iid Rademacher random variables that are independent of the $(X_{i},Y_{i})$ . This can be applied to a class of $\mathcal{G}$ by denoting $\mathcal{R}_{n}\mathcal{G}=\sup_{g\in\mathcal{G}}\mathcal{R}_{n}g$ .

Lemma 6.7 from Blanchard et al. (2008) uses a standard symmetrization trick to prove that for any such collection of real valued functions $\mathcal{G}$ , some 1-Lipschitz function $\varphi$ , and some any $g_{0}\in\mathcal{G}$ that

\mathrm{E}\left[\sup_{g\in\mathcal{G}}\lvert(P-P_{n})(\varphi(g)-\varphi(g_{0}))\rvert\right]\leq 4\mathrm{E}\mathcal{R}_{n}\left\{g-g_{0}\,:\,g\in\mathcal{G}\right\}

(13)

Lemma 6.8 from Blanchard et al. (2008) comes from Mendelson (2003). It builds off the previous result to note that

	$\displaystyle\mathrm{E}\mathcal{R}_{n}\left\{g\in\mathcal{H}_{k}\,:\,\lVert g\rVert_{k}\leq R,\,\lVert g\rVert_{2,p}^{2}\leq r\right\}$	$\displaystyle\leq\frac{1}{\sqrt{n}}\inf_{d\in\mathbb{N}}\left(\sqrt{dr}+R\sqrt{\sum_{j>d}\lambda_{j}}\right)$
		$\displaystyle\leq\sqrt{\frac{2}{n}}\left(\sum_{j=1}^{\infty}\min\{r,R^{2}\lambda_{j}\}\right)^{1/2}$		(14)

where $\lambda_{j}$ is the $j$ th eigenvalue of the reproducing kernel and where for $g(t)=\sum_{j=1}^{\infty}\alpha_{j}\psi_{j}(t)$ the norms are $\lVert g\rVert_{k}^{2}=\sum\alpha_{j}^{2}$ and $\lVert g\rVert_{2,p}^{2}=\sum\lambda_{j}\alpha_{j}^{2}$ .

By noting that $V_{q}(\cdot)$ is a 1-Lipschitz function for any choice of $q\in\mathbb{R}^{+}$ , we apply Equation 13 for $x(t)$ with $\lVert x\rVert_{L^{2}}\leq A$ to get

\mathrm{E}\left[\sup_{\begin{subarray}{c}\beta\in B(R)\\ d^{2}(f,f_{0})\leq r\end{subarray}}(P-P_{n})(V_{q}(f)-V_{q}(f_{0}))\right]\\ \leq 4\mathcal{\mathrm{E}}{R}_{n}\left\{f-f_{0}\,:\,\beta\in B(R),d^{2}(f,f_{0})\leq r\right\}\leq 4\mathrm{E}\mathcal{R}_{n}\left\{f-f_{0}\,:\,\lVert f\rVert\leq AR,\lVert f\rVert_{2,P}^{2}\leq r\right\}.

Then, application of Equation 14 gives

\mathrm{E}\left[\sup_{\begin{subarray}{c}\beta\in B(R)\\ d^{2}(f,f_{0})\leq r\end{subarray}}(P-P_{n})(V_{q}(f)-V_{q}(f_{0}))\right]\leq\frac{4}{\sqrt{n}}\inf_{d\in\mathbb{N}}\left(\sqrt{dr}+AR\sqrt{\sum_{j>d}\lambda_{j}}\right)=:\phi_{R}(r).

Thus, we aim to solve $r/C=\phi(r)$ . Let $d^{*}$ be the minimizer over $d\in\mathbb{N}$ , which exists due to the reproducing kernel being a trace class operator, which in turn implies that $\sqrt{dr}+AR(\sum_{j>d}\lambda_{j})^{1/2}$ is finite for all $d$ and tends to $\infty$ as $d\rightarrow\infty$ . Then, application of the quadratic formula and the convexity result $(x+y)^{2}\leq 2x^{2}+2y^{2}$ gives

	$\displaystyle 0$	$\displaystyle=r^{}-\frac{4C}{\sqrt{n}}\left(\sqrt{d^{}r^{}}+AR\sqrt{\sum_{j>d^{}}\lambda_{j}}\right)$
	$\displaystyle r^{*}$	$\displaystyle=\left[\frac{2C\sqrt{d^{}}}{\sqrt{n}}+\left\{\frac{4C^{2}d^{}}{n}+\frac{4C}{\sqrt{n}}AR\sqrt{\sum_{j>d}\lambda_{j}}\right\}^{1/2}\right]^{2}$
		$\displaystyle\leq\frac{8C^{2}}{n}\left[2d^{}+\frac{AR\sqrt{n}}{C}\sqrt{\sum_{j>d^{}}\lambda_{j}}\right].$

Finally, choosing $C\geq MR/\eta_{1}$ gives

r^{*}\leq\frac{8C^{2}}{\sqrt{n}}\inf_{d\in\mathbb{N}}\left[\frac{2d}{\sqrt{n}}+\frac{A\eta_{1}}{M}\sqrt{\sum_{j>d}\lambda_{j}}\right].

Lemma 6.8.

\mathrm{E}{[(V_{q}(yf)-V_{q}(yf^{\prime}))^{2}]}\leq\left({RAM}+\frac{1}{\eta_{0}}\right)L(f,f^{*})

Proof 6.9.

By choice of the metric $d(f,f_{0})$ , we have immediately that $\mathrm{E}{[(V_{q}(yf)-V_{q}(yf^{\prime}))^{2}]}\leq\mathrm{E}[(V_{q}(f)-V_{q}(f_{0}))^{2}].$ Next, we aim to bound

\frac{\mathrm{E}[(V_{q}(yf)-V_{q}(yf^{*}))^{2}|X=x]}{\mathrm{E}[V_{q}(yf)-V_{q}(yf^{*})|X=x]}

Recalling that $f^{*}(x)=2\boldsymbol{1}\!\left[\eta(x)>0.5\right]-1$ , we can assume that $\eta>0.5$ and that $f^{*}=1$ . We also note that $V_{q}(f)\geq(1-f)_{+}$ for $q\in(0,\infty)$ , and that $f=\int x\beta dt=\left\langle Kx,\beta\right\rangle$ , so $\sup_{x}f\leq RAM$ for $\beta\in B(R)$ . Thus,

	$\displaystyle\frac{\mathrm{E}[(V_{q}(yf)-V_{q}(yf^{}))^{2}\|X=x]}{\mathrm{E}[V_{q}(yf)-V_{q}(yf^{})\|X=x]}$
	$\displaystyle~{}=\frac{\eta[V_{q}(f)^{2}-2V_{q}(f)V_{q}(1)+V_{q}(1)^{2}]+(1-\eta)[V_{q}(-f)^{2}-2V_{q}(-f)V_{q}(-1)+V_{q}(-1)^{2}]}{\eta(V_{q}(f)-V_{q}(1))+(1-\eta)(V_{q}(-f)-V_{q}(-1))}$
	$\displaystyle~{}\leq\frac{\eta[V_{q}(f)^{2}-2c_{q}V_{q}(f)+c_{q}^{2}]+(1-\eta)[V_{q}(-f)^{2}-4V_{q}(-f)+4]}{\eta(1-f)_{+}+(1-\eta)(1+f)_{+}+\eta(2-c_{q})-2}$

For $f\leq-1$ , then denote $g=-1-f$ so $f=-1-g$ . Also, note that $2\eta-1\geq 2\eta_{0}$ . Thus, if we choose $q$ such that $c_{q}<4-2/\eta_{0}$ , then

	$\displaystyle\frac{\mathrm{E}[(V_{q}(yf)-V_{q}(yf^{}))^{2}\|X=x]}{\mathrm{E}[V_{q}(yf)-V_{q}(yf^{})\|X=x]}$
	$\displaystyle~{}\leq\frac{\eta[(1-f)^{2}-2c_{q}(1-f)+c_{q}^{2}]+(1-\eta)[c_{q}^{2}+4]}{\eta(1-f)+\eta(2-c_{q})-2}$
	$\displaystyle~{}\leq\frac{\eta[(2+g)^{2}-2c_{q}(2+g)]+4(1-\eta)+c_{q}^{2}}{\eta(2+g)+\eta(2-c_{q})-2}$
	$\displaystyle~{}\leq\frac{\eta g^{2}+(4-2c_{q})\eta g+4(1-c_{q})+c_{q}^{2}}{\eta g+\eta(4-c_{q})-2}$
	$\displaystyle~{}\leq g+\frac{(2-\eta c_{q})g+4(1-c_{q})+c_{q}^{2}}{\eta g+\eta(4-c_{q})-2}$
	$\displaystyle~{}\leq RAM+\frac{2}{\eta}-c_{q}\leq RAM+\frac{1}{\eta_{0}}.$

Lemma 6.10.

\mathrm{E}\left[\sup_{\begin{subarray}{c}\beta\in B(R)\\ d^{2}(f,f_{0})\leq r\end{subarray}}(P-P_{n})(\ell(f)-\ell(f_{0}))\right]\leq\frac{96RA}{\sqrt{n}}\xi\left(\frac{\sqrt{r}}{4RA}\right)+\frac{64R^{3}A^{3}M}{nr}\xi\left(\frac{\sqrt{r}}{4RA}\right)^{2}:=\phi_{R}(r)

with $r^{*}\leq 625C_{R}^{*2}(x^{*})^{2}/M^{2}.$

Proof 6.11.

Lemma 6.10 from Blanchard et al. (2008) states that for $\mathcal{G}$ a separable class of real functions in sup-norm such that $\lVert g\rVert_{\infty}\leq M$ and $\mathrm{E}\left(g^{2}\right)\leq\sigma^{2}$ , the following bound holds:

\mathrm{E}\left[\sup_{g\in\mathcal{G}}\lvert(P-P_{n})g\rvert\right]\leq\frac{24}{\sqrt{n}}\int_{0}^{\sigma}\sqrt{H_{\infty}(\varepsilon)}d\varepsilon+\frac{MH_{\infty}(\sigma)}{n}

(15)

where $H_{\infty}(\varepsilon)$ is the supremum norm $\varepsilon$ -entropy for $\mathcal{G}$ .

We first note that

\{V_{q}(f)-V_{q}(f_{0})\,:\,\beta\in B(R),\,d^{2}(f,f_{0})\leq r\}\subset\{V_{q}(f)\,:\,\beta\in B(2R),\,\mathrm{E}[V_{q}(f)^{2}]\leq r\}

and secondly note that the DWD loss function is 1-Lipschitz, which implies that

\lvert V_{q}(f)-V_{q}(f^{\prime})\rvert\leq\left\lvert\int_{I}x(t)\beta(t)dt-\int_{I}x^{\prime}(t)\beta^{\prime}(t)dt\right\rvert\leq\lVert\beta-\beta^{\prime}\rVert\lVert x-x^{\prime}\rVert\leq 2A\lVert\beta-\beta^{\prime}\rVert

Thus, we can bound

H_{\infty}(\{V_{q}(f)\,:\,\beta\in B(2R),\,\mathrm{E}[V_{q}(f)^{2}]\leq r\},\varepsilon)\leq H_{\infty}(B(4AR),\varepsilon).

and application of Equation 15 gives

	$\displaystyle\mathrm{E}$	$\displaystyle\left[\sup_{\begin{subarray}{c}\beta\in B(R)\\ d^{2}(f,f_{0})\leq r\end{subarray}}(P-P_{n})(V_{q}(f)-V_{q}(f_{0}))\right]$
		$\displaystyle\leq\frac{24}{\sqrt{n}}\int_{0}^{\sqrt{r}}\sqrt{H_{\infty}(B(4AR),\varepsilon)}d\varepsilon+\frac{4RAMH_{\infty}(B(4AR),\sqrt{r})}{n}$
		$\displaystyle\leq\frac{96RA}{\sqrt{n}}\int_{0}^{\sqrt{r}/4RA}\sqrt{H_{\infty}(B(1),\varepsilon)}d\varepsilon+\frac{4RAMH_{\infty}(B(1),\sqrt{r}/2R)}{n}$
		$\displaystyle\leq\frac{96RA}{\sqrt{n}}\xi\left(\frac{\sqrt{r}}{4RA}\right)+\frac{64R^{3}A^{3}M}{nr}\xi\left(\frac{\sqrt{r}}{4RA}\right)^{2}:=\phi_{R}(r)$

where $\xi(x)=\int_{0}^{x}\sqrt{H_{\infty}(B(1),\varepsilon)}d\varepsilon$ . This final bound is a sub-root function in terms of $r$ .

For $x^{*}$ , the solution to $\xi(x)=\sqrt{n}x^{2}/AM$ , we can bound $r_{R}^{*}$ , the solution to $\phi_{R}(r)=r/C_{R}$ with $C_{R}>RAM$ , as follows. First, we note that $\xi(x)/x$ is decreasing. We also choose a $c>0$ so that $t_{R}^{*}:=c^{2}C^{*2}_{R}(x^{*})^{2}/A^{2}M^{2}$ . Therefore,

\xi\left(\frac{\sqrt{t_{R}^{*}}}{4RA}\right)\leq\frac{cC_{R}}{4RAM}\xi(x^{*})=\frac{cC_{R}\sqrt{n}(x^{*})^{2}}{4RA^{2}M^{2}}=\frac{\sqrt{n}t_{R}^{*}}{4RcC_{R}}.

Thus,

\phi_{R}(t^{*}_{R})\leq\frac{96RA}{\sqrt{n}}\frac{\sqrt{n}t_{R}^{*}}{4RcC_{R}}+\frac{64R^{3}A^{3}M}{nt_{R}^{*}}\left[\frac{\sqrt{n}t_{R}^{*}}{4RcC_{R}}\right]^{2}=\left[\frac{24A}{c}+\frac{4RA^{3}M}{c^{2}C_{R}}\right]\frac{t_{R}^{*}}{C_{R}^{*}}\leq\left[\frac{24A}{c}+\frac{4A^{2}}{c^{2}}\right]\frac{t_{R}^{*}}{C_{R}^{*}}.

Therefore, taking $c>25A$ results in $\phi_{R}(t_{R}^{*})\leq t^{*}/C_{R}^{*}$ implying that $t_{R}^{*}\geq r_{R}^{*}$ giving finally that $r_{R}^{*}\leq 625C_{R}^{*2}(x^{*})^{2}/M^{2}.$

Proof 6.12 (Proof of Theorem 2).

Given the above lemmas, we apply the “model selection” Theorem 4.3 from Blanchard et al. (2008), which is a generalization of Theorem 4.2 of Massart (2000), to our functional DWD classifier.

First, we choose $\mathcal{R}=\{M^{-1}A^{-1}2^{k}\,|\,k\in\mathbb{N},k\leq\lceil\log_{2}n\rceil\}$ to be our countable set of radii for the balls $B(R)$ with $R\in\mathcal{R}$ . To apply Theorem 4.3, we require a sequence $z_{R}$ and choose $z_{R}=\log(\log_{2}n+2)$ similarly to Blanchard et al. (2008). We also require a penalty function that satifies

\mathrm{pen}(R)\geq 250K\frac{r^{*}}{C_{R}}+\frac{(KC_{R}+28b_{R})(z_{R}+\xi+\log 2)}{3n}

where $b_{R}=1+RAM$ , $C_{R}=\frac{RAM}{\eta_{1}}+\frac{5}{\eta_{0}}$ , and $r_{R}^{*}=\frac{8C_{R}^{2}}{\sqrt{n}}\inf_{d\in\mathbb{N}}\left[\frac{2d}{\sqrt{n}}+\frac{A\eta_{1}}{M}\sqrt{\sum_{j>d}\lambda_{j}}\right]$ under setting (C1), and $b_{R}=1+RAM$ , $C_{R}={RAM}+\frac{1}{\eta_{0}}$ , and $r_{R}^{*}=625C_{R}^{*2}(x^{*})^{2}/M^{2}$ under setting (C2). To achieve that, we take $\gamma(n)$ to be $\frac{1}{\sqrt{n}}\inf_{d\in\mathbb{N}}\left[\frac{2d}{\sqrt{n}}+\frac{A\eta_{1}}{M}\sqrt{\sum_{j>d}\lambda_{j}}\right]$ under setting (C1) and $(x^{*})^{2}/M^{2}$ under setting (C2), and we take $\lambda_{n}=C\left(\gamma(n)+\frac{\log(\delta^{-1}\log n)\vee 1}{n}\right)$ for some universal constant $C$ . Consequently, we can choose $\mathrm{pen}(R)=\lambda_{n}\left(RAM+k_{0}\right)$ for another suitable positive constant $k_{0}$ .

Therefore, given the estimator $\hat{\beta}$ and corresponding estimator $\hat{f}$ , we have $P_{n}V_{q}(\hat{f})+\lambda_{n}AMJ(\hat{\beta})\leq P_{n}V_{q}(0)+\lambda_{n}AMJ(0)=1.$ Therefore, the regularization term $\lambda_{n}AMJ(\hat{\beta})\leq 1$ and consequently, $\lambda_{n}\geq n^{-1}$ . Thus, $J(\hat{\beta})\leq nA^{-1}M^{-1}$ . Recalling that $B(R)=\{\beta\in\mathcal{W}_{2}^{m}(I)\,:\,J({\beta})\leq R\}$ , we have that $\hat{\beta}\in B(\hat{R})$ for $\hat{R}=J(\hat{\beta})\vee(A^{-1}M^{-1})$ and, updating $k_{0}$ as necessary, we have that

	$\displaystyle P_{n}V_{q}(\hat{f})+\mathrm{pen}(\hat{R})$	$\displaystyle\leq P_{n}V_{q}(\hat{f})+\lambda_{n}({AMJ(\hat{\beta})\vee 1})+k_{0})$
		$\displaystyle\leq P_{n}V_{q}(\hat{f})+\lambda_{n}(J(\hat{\beta})AM+k_{0})$
		$\displaystyle\leq\inf_{\beta\in\mathcal{W}_{2}^{m}}\left\{P_{n}V_{q}({f}_{\beta})+\lambda_{n}J({\beta})AM+\lambda_{n}k_{0}\right\}$
		$\displaystyle=\inf_{R>0}\inf_{\beta\in B(R)}\left\{P_{n}V_{q}({f}_{\beta})+\lambda_{n}(RAM+k_{0})\right\}$
		$\displaystyle\leq\inf_{R\in\mathcal{R}}\inf_{\beta\in B(R)}\left\{P_{n}V_{q}({f}_{\beta})+\lambda_{n}(RAM+k_{0})\right\}$

where the third inequality results from $\hat{f}$ being the minimizer of the regularized estimation proceedure. Application of the “model selection theorem” (Massart, 2000; Blanchard et al., 2008) gives that with probability $1-\delta$ that

	$\displaystyle L(\hat{f},f^{*})$	$\displaystyle\leq 2\inf_{R\in\mathcal{R}}\inf_{\beta\in B(R)}\left\{L(f_{\beta},f^{*})+2\lambda_{n}(RAM+k_{0})\right\}$
		$\displaystyle\leq 2\inf_{A^{-1}M^{-1}\leq R\leq nA^{-1}M^{-1}}\inf_{\beta\in B(R)}\left\{L(f_{\beta},f^{*})+2\lambda_{n}(2^{\lceil\log_{2}RAM\rceil}+k_{0})\right\}$
		$\displaystyle\leq 2\inf_{R\leq nA^{-1}M^{-1}}\inf_{\beta\in B(R)}\left\{L(f_{\beta},f^{*})+2\lambda_{n}(2^{\lceil(\log_{2}RAM)_{+}\rceil}+k_{0})\right\}$
		$\displaystyle\leq 2\inf_{R\leq nA^{-1}M^{-1}}\inf_{\beta\in B(R)}\left\{L(f_{\beta},f^{*})+2\lambda_{n}(RAM\vee 1)+k_{0})\right\}$
		$\displaystyle\leq 2\inf_{\beta\in\mathcal{W}_{2}^{m}}\left\{L(f_{\beta},f^{*})+2\lambda_{n}AMJ(\beta)+\lambda_{n}k_{1}+k_{0})\right\}$

	$\displaystyle 0$	$\displaystyle=r^{}-\frac{4C}{\sqrt{n}}\left(\sqrt{d^{}r^{}}+AR\sqrt{\sum_{j>d^{}}\lambda_{j}}\right)$
	$\displaystyle r^{*}$	$\displaystyle=\left[\frac{2C\sqrt{d^{}}}{\sqrt{n}}+\left\{\frac{4C^{2}d^{}}{n}+\frac{4C}{\sqrt{n}}AR\sqrt{\sum_{j>d}\lambda_{j}}\right\}^{1/2}\right]^{2}$
		$\displaystyle\leq\frac{8C^{2}}{n}\left[2d^{}+\frac{AR\sqrt{n}}{C}\sqrt{\sum_{j>d^{}}\lambda_{j}}\right].$