A reproducing kernel Hilbert space framework for functional classification
Abstract
We encounter a bottleneck when we try to borrow the strength of classical classifiers to classify functional data. The major issue is that functional data are intrinsically infinite dimensional, thus classical classifiers cannot be applied directly or have poor performance due to curse of dimensionality. To address this concern, we propose to project functional data onto one specific direction, and then a distance-weighted discrimination DWD classifier is built upon the projection score. The projection direction is identified through minimizing an empirical risk function that contains the particular loss function in a DWD classifier, over a reproducing kernel Hilbert space. Hence our proposed classifier can avoid overfitting and enjoy appealing properties of DWD classifiers. This framework is further extended to accommodate functional data classification problems where scalar covariates are involved. In contrast to previous work, we establish a non-asymptotic estimation error bound on the relative misclassification rate. In finite sample case, we demonstrate that the proposed classifiers compare favorably with some commonly used functional classifiers in terms of prediction accuracy through simulation studies and a real-world application.
keywords:
Functional classification; Projection; Distance-weighted discrimination, Reproducing kernel Hilbert space; Non-asymptotic error bound1 Introduction
For functional data classification, the explanatory variable is usually a random function and the outcome is a categorical random variable which can take two or more categories. As with classification problems for scalar covariates, a functional classifier is built upon a collection of observations consisting of a functional covariate and a categorical response for each subject, and then a class label will be assigned to a new functional covariate based on the classifier. A typical example is the phoneme recognition problem in Friedman et al. (2009). Log-periodograms for each of the two phonemes “aa” and “ao” were measured at 256 frequency levels, and the primary goal is to make use of these log-periodograms to classify phonemes. For this problem, the log-periodograms measured at 256 frequencies can be regarded as a functional covariate and the outcome takes two possible categories: “aa” or “ao”. Therefore, this phoneme recognition can be framed as a functional classification problem. Actually, functional classification has been extensively studied in the literature due to its wide applications in various fields such as neural science, genetics, agriculture and chemometrics (Tian, 2010; Leng and Müller, 2005; Delaigle and Hall, 2012; Berrendero et al., 2016).
As pointed out by Fan and Fan (2008), a high dimension of scalar covariates would yield a negative impact on prediction accuracy of classifiers due to curse of dimensionality. This issue is even more serious in functional classification since functional data are intrinsically infinite dimensional (Ferraty and Vieu, 2004). In light of this fact, dimension reduction was suggested before classifying functional data. Functional principal component (FPC) analysis a commonly used technique in this regard, and various classifiers for functional data have been proposed based on FPC scores, which are the projections of functional covariates onto a number of FPCs. Typical examples include discriminant analysis (Hall et al., 2001), naive Bayes classifier (Dai et al., 2017) and logistic regression (Leng and Müller, 2005). Since FPC analysis is an unsupervised dimension reduction approach, the retained FPC scores are not necessarily more predictive of the outcome compared with the discarded ones. In contrast, treating fully observed functional data as a random variable in a Hilbert space without dimension reduction has also attracted substantial attention in functional classification (Yao et al., 2016). For instance, Ferraty and Vieu (2003) proposed a distance-based classifier for functional data, and Biau et al. (2005) and Cérou and Guyader (2006) considered -nearest neighbor classification.
An optimal separating hyperplane can be constructed to distinguish two perfectly separately classes. This idea is further extended to accommodate the nonseparable case in support vector machines (SVM). More specifically, with the aid of the so-called kernel trick, the original feature space is expanded and a linear boundary in this expanded feature space can separate the two overlapping classes very well. If we project this linear boundary back onto the original feature space, it is a nonlinear decision boundary. For a more comprehensive introduction to the SVM, one can refer to Vapnik (2013) and Cristianini and Shawe-Taylor (2000). Due to the ability of constructing a flexible decision boundary, different versions of SVM for functional data have been proposed in the literature. Rossi and Villa (2006) considered projecting functional covariates onto a set of fixed basis functions first, and then applied SVMs to the projections for classification. Nevertheless, Yao et al. (2016) proposed a supervised method to perform dimension reduction for functional data. Weighted SVM (Lin et al., 2002) were then constructed on the reduced feature space. Wu and Liu (2013) recovered trajectories of sparse functional data or longitudinal data using the principal analysis by conditional expectation (PACE) (Yao et al., 2005) first, and then proposed a support vector classifier for the random curves. But the convergence rate of the SVMs with a functional covariate was not established in the aforementioned work.
Marron et al. (2007) noted that the data piling problem may cause a deterioration in performance of SVM. They proposed the distance-weighted discrimination (DWD) classifier that can make uses of all observations in a training sample, rather than the support vectors in SVM, to determine the decision boundary. Wang and Zou (2018) proposed an efficient algorithm to solve the DWD problem. In this article, we extend the idea of the DWD to functional data to address a binary classification problem. The basic idea is to find an optimal projection direction such that the DWD classifier built upon this projected score achieves good prediction performance. Additionally, to avoid overfitting on the training sample, we incorporate a roughness penalty term when minimizing the empirical risk function. Penalized approaches have been investigated recently in the context of functional linear regression. Interested readers can refer to the work by Yuan and Cai (2010) and Sun et al. (2018). However, as far as we know, this framework has received little attention in functional classification problems. The method for functional classification proposed in this article estimates the slope function through seeking a minimizer of a regularized empirical risk function over a reproducing kernel Hilbert space (RKHS). This RKHS is closely associated with the penalty term in the regularized empirical risk function. With the help of the representer theorem, we are able to convert this infinite dimensional minimization problem to a finite dimensional problem. This fact lays the foundation of numerical implementations for the proposed classifier. This framework is further extended to accommodate classifications when observations of both a functional covariate and several scalar covariates are available for each subject. There has been extensive research on partial functional linear regression models for such scenarios; see Kong et al. (2016) and Wong et al. (2019) for instance. However, we have not seen much progress made for classifications in this regard; thus our work fills the gap of functional data classification when scalar covariates are also available. In addition to the novel methodology, we establish a non-asymptotic “oracle type inequality” for the bound on the convergence rate of the relative loss and the relative classification error. This error bound is essentially different from those considered in Delaigle and Hall (2012), Dai et al. (2017) and Berrendero et al. (2018), all of which focused on asymptotic perfect classification.
The rest of this article is organized as follows. In Section 2 we introduce the RKHS-based functional DWD classifier for classifying functional data without and with scalar covariates. Theoretical properties of the proposed classifiers are established in Section 3. We carry out simulation studies in Section 4 to investigate the finite sample performance of the proposed classifiers in terms of prediction accuracy. In Section 5 we consider one real world application to demonstrate the performance of the proposed classifiers. We conclude this article in Section 6. All technical proofs are provided in the Appendix.
2 Methodology
Let denote a random function with a compact domain , and is a binary outcome related to . Without loss of generality, we assume that . Suppose that the training sample consists of , i.i.d. copies of . Our primary goal is to build a classifier based on this training sample.
We first present an overview of the distance-weighted discrimination (DWD) proposed by Marron et al. (2007). Consider the following classification problem where be a vector of scalar covariates and is a binary response. The main task is to build a classifier: based on pairs of observations . According to Wang and Zou (2018), the decision boundary of a generalized distance-weighted discrimination classifier can be obtained by solving
where
(1) |
is the loss function and is a tuning parameter. Note that as , the generalized DWD loss function converges to the hinge loss function. , used in the SVM. This relationship is also illustrated in Figure 1. Denote by the solution to the minimization problem above. Given a new observation , the predicted class label will be 1 if and -1 otherwise.

In this article, we aim to extend the framework of DWD to functional data. In particular, we consider the following objective function
(2) |
where is a penalty functional. The penalty functional can be conveniently defined through the slope function as a squared norm or semi-norm associated with . A canonical example of is the Sobolev space. Without loss of generality, assuming that , the Sobolev space of order is then defined as
Endowed with the (squared) norm
is a reproducing kernel Hilbert space. In this case, a possible choice of the penalty functional is given by
(3) |
2.1 Representer theorem
Let the penalty functional be a squared semi-norm on such that the null space
(4) |
is a finite-dimensional linear subspace of . Denote by its orthogonal complement in such that . That is, for any , there exists a unique decomposition such that and . Note that is also a reproducing kernel Hilbert space with the inner product of restricted to . Let be the corresponding reproducing kernel of such that for any . Let and be the basis functions of .
We will assume that is continuous and square integrable. With slight abuse of notation, write
(5) |
According to Yuan and Cai (2010), for any and ,
(6) |
With these observations, we are able to establish the following theorem which is crucial in numerical implementations and the theoretical analysis.
Theorem 1
Let and be the minimizer of (2) and . Then there exist and such that
(7) |
2.2 Estimation algorithm
For the purpose of illustration, we assume that and , in the following numerical implementations. Then is the linear space spanned by and . A possible choice for the reproducing kernel associated with is
where and . You can refer to Chapter 2.3 of Gu (2013) for more details. Based on Theorem 7, we only need to consider that takes the following form:
for some and to minimize the function in (2). As a result,
For the penalty term, we have , where is an matrix with th entry . Denote by an matrix with the th entry for . Let and denote the th row of and , respectively. Now the infinite-dimensional minimization of (2) becomes the following finite dimensional minimization problem:
(8) |
To find the minimizer of , we implement the majorization-minimization (MM) principal. The basic idea is as follows. We firstly look for a majorization function , where in this problem, of the target function . This majorization function satisfies for any and if . Additionally, it should be easy to find the minimizer of for any given . Then given an initial value of , say , we are able to generate a sequence of ’s, say , which are defined by , . As long as this sequence converges, the limit is regarded as the minimizer of the objective function .
Given , let with and
where denotes a vector of length with each component equal to 1. According to Lemma 2 of Wang and Zou (2018), we can take the majorization function of as
It is trivial to show that the minimizer of is
Then the algorithm proceeds until the sequence of minimizers converges. The limit of this sequence is denoted by , and thus . The functional DWD classifier assigns 1 or -1 to a new functional observation according to whether the statistic is positive or negative.
2.3 Functional DWD with scalar covariates
The algorithm presented above is to address binary classification problems for univariate functional data. This idea can be further extended to accommodate binary classification when both a functional covariate and finite dimensional scalar covariates are involved. In particular, the training sample consists of , where denotes the dimensional scalar covariates of the th subject. With slight abuse of notation, we consider the following extension of (2):
(9) |
to build a partial linear DWD classifier.
To solve the minimization problem of (9), we resort to the specific representation of in Theorem 7. Actually it is straightforward to verify that this result still holds in the context of (9). As a result, the infinite dimensional minimization problem of (9) is converted to the following finite one:
(10) |
With some modifications, we employ the MM principal to address the minimization problem above. In particular, the majority function of is taken as
where with , and , and
Thus the minimizer of is given by
We then follow steps in Section 2.2 to implement classifications on subjects with both a functional covariate and several scalar covariates.
2.4 Tuning parameter selection
Here we focus on binary classification when both a functional covariate and several scalar covariates are involved. The prediction performance of the proposed classifier depends on the choice of the two tuning parameters and . Computing the inverse of the matrix for every combination of and would be computationally intensive, especially when sample size is large. Instead, we come up with a solution with a lower computational cost to tackle this problem. To facilitate the selection procedure, the essential idea in our implementations is to avoid directly computing the inverse matrix of for each combination of since it would require substantial time. Write as
Therefore, the inverse of admits
(11) |
Note that among these matrices only depends on and . The inverse of is available from the Sherman-Morrison-Woodbury formula:
(12) |
To compute the matrix in (12), we need to find the inverse of first. Let denote the eigen-decomposition of ; hence it does not depend on . Then we compute the inverse of for each and ; it is actually a diagonal matrix. Hence the inverse of is immediately available for each combination of and . Furthermore, and note that is a matrix. These suggest that it is efficient to compute the (inverse) matrix in (12), as long as is relatively small.
Finally, we employ the expression of in (11) to compute directly. Denote by the inverse of . By equation (11), we have
With the procedures above, we are able to compute the minimizer of the majorization function for different values of and , and thus solve the minimization problem of (10) efficiently. We employ cross validation to choose the optimal combination of and in the following numerical studies.
3 Theoretical properties
Let denote the Bayes classifier, which can minimize the probability of misclassification, . It is trivial that a.s. on the set . Given a loss function , the associated risk function for a classifier is then defined by
Blanchard et al. (2008) established a non-asymptotic bound on when is the estimated support vector classifier from a training sample in which each subject consists of multiple scalar covariates and a binary outcome, and is the corresponding hinge loss function.
Denote by the estimated functional DWD classifier, where and minimize the target function in (2). In the context of functional classification, assuming that the functional covariate , the Bayes classifier is a measurable functional from to . Here we aim to establish a non-asymptotic bound on , where is the loss function for the functional DWD classifier.
The following conditions are required where (C1) is with regard to bounds on the noise and (C2) is with regard to bounds on the kernel and functions.
-
(C1)
For , we require for all and , thus large enough so that . We also require for all , which bounds the probabilty away from 0 and 1 and also implies that .
-
(C2)
There exist positive constants such that for and for being the reproducting kernel.
There are two settings to consider similar to Blanchard et al. (2008), which affect how the penalization parameter is controlled. In setting (S1), the risk is considered via the spectral properties of the reproducing kernel. Specifically, the penalization parameter is controlled by the tail sum of the eigenvalues of the reproducing kernel. In setting (S2), the risk is considered via covering numbers under the sup-norm. In contrast, is controlled via , the supremum norm -entropy. This control is encapsulated in the term defined in the following theorem.
Theorem 2
Under conditions (C1) and (C2), let the penalization parameter be bounded as
for some universal constant with under (S1) and under (S2) where is the solution to with . For an iid sample of functional-binary pairs and FDWD loss function , let the regularized estimator be the solution to
for corresponding classifier . Then, for being the Bayes classifier and the classifier corresponding to any arbitrary , the following holds with probabiliy at least ,
for positive constants .
Proof of Theorem 2 can be found in the appendix. We extend this theorem to the functional DWD estimator with scalar covariates in the following corollary. For this extension, we require an additional condition:
-
(C3)
There exist a positive constant such that for .
The proof of the below corollary follows from that for Theorem 2. Instead of considering suprema over the ball , we instead consider the product ball .
Corollary 1
Under conditions (C1), (C2), and (C3), let be a positive constant, and let the penalization parameter be bounded as
for some universal constant with under (S1) and under (S2) where is the solution to with . For an iid sample of functional-covariate-binary triples and FDWD loss function , let the regularized estimator and be the solution to
for corresponding classifier . Then, for being the Bayes classifier and the classifier corresponding to any arbitrary and , the following holds with probabiliy at least ,
for positive constants .
4 Simulation studies
In this section, we considered two different simulation settings to investigate finite sample performance of the proposed classifier. In both settings, the functional covariate was generated in the following way: , where ’s are independently drawn from a uniform distribution on (, , and and for . Observations at 50 times points on the interval [0, 1] were available in each sample path of . Two scalar covariates () were independently generated from a truncated normal distribution within the interval with mean 0 and variance 1. Then the binary response variable with values or was generated from the logistic model:
where and is referred to as the discriminant function in this article.
In the first scenario, the slope function of was , and or to ensure the discriminant function depends on or not on the scalar covariates, respectively. In the second scenario, the slope function was written as a linear combination of the functional principal components of . Particularly, , and the coefficient vector of the scalar covariates was or . In each simulation scenario, or 200 curves were generated for training. Then 500 samples were generated as the test set to assess prediction accuracy.
In addition to the proposed functional DWD classifier, we also considered several other commonly used functional data classifiers for comparison. The centroid classifier by Delaigle and Hall (2012) firstly projects each functional covariate onto one specific direction and then performs classification based on the distance to the centroid in the projected space. The functional quadratic discriminant in Galeano et al. (2015) conducts a quadratic discriminant analysis on FPC scores, while the functional logistic classifier fits a logistic regression model on them. Note that the aforementioned classifiers, except our proposed functional DWD with scalar covariates, only account for functional covariates in classification. To study the effect of scalar covariates on classification, we fitted an SVM classifier with only these two scalar covariates when they are involved in the discriminant function. In each simulation trial, we randomly generated a training set of size or 200 to fit all classifiers and then evaluated the predictive accuracy for all of them on a test sample of size 500. To assess the uncertainty in estimating prediction accuracy of each classifier, independent simulation trials were conducted in each scenario.
Scenario 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Centroid | Quadratic | Logistic | KNN | fDWD | PLfDWD | S-SVM | ||
100 | Yes | 41.3 (3.7) | 43.2 (3.0) | 41.9 (3.1) | 43.4 (3.3) | 39.8 (2.7) | 32.3 (2.7) | 37.9 (5.1) |
No | 39.1 (3.7) | 40.7 (3.3) | 39.5 (3.1) | 40.8 (3.6) | 37.7 (2.8) | |||
200 | Yes | 40.2 (2.8) | 41.5 (2.8) | 40.5 (2.6) | 43.0 (2.8) | 39.3 (2.4) | 31.3 (2.4) | 35.5 (3.0) |
No | 37.9 (2.7) | 39.1 (2.8) | 38.2 (2.4) | 40.6 (3.1) | 37.1 (2.3) | |||
Scenario 2 | ||||||||
Centroid | Quadratic | Logistic | KNN | fDWD | PLfDWD | S-SVM | ||
100 | Yes | 21.7 (2.2) | 22.4 (2.4) | 21.9 (2.2) | 22.6 (2.5) | 21.0 (2.0 | 11.0 (1.9) | 35.7 (4.5) |
No | 11.1 (1.6) | 11.4 (1.6) | 11.4 (1.7) | 11.5 (1.6) | 10.4 (1.4) | |||
200 | Yes | 21.3 (2.0) | 21.5 (2.0) | 21.2 (1.9) | 22.2 (2.2) | 20.8 (1.8) | 10.5 (1.9) | 33.9 (2.9) |
No | 10.7 (1.5) | 10.7 (1.4) | 10.6 (1.4) | 11.1 (1.5) | 10.2 (1.4) |
Table 1 summarizes the mean and the standard error of the misclassification error rate of each classifier. In the first scenario, the proposed functional DWD classifier with scalar covariates is considerably more accurate than any other classifier in terms of prediction. This is not surprising since even the SVM classifier with only scalar covariates outperforms the functional classifiers that did not take scalar covariates into consideration. This fact implies the importance of accounting for scalar covariates when the true discriminant function indeed depends on them. Additionally, whether or not the true discriminant function depends on the scalar covariates in these settings, the misclassification error rates of our proposed functional DWD classifier, are very close to the Bayes errors, which are 0.283 and 0.376, respectively. As the projection function in the centroid classifier and the slope function in the functional logistic regression by Leng and Müller (2005) are represented in terms of FPCs, these two classifiers should be in favor in the second scenario. This was justified by comparing prediction accuracy of these classifiers. However, our proposed classifier still dominates all competitors no matter whether the true discriminant function depends on the scalar covariates. A plausible reason why the proposed classifier is superior to the centroid and logistic classifiers is that the roughness of the projection direction is appropriately controlled in our method. Once again, the misclassification rates of our proposed classifiers, are very close to the Bayes errors, which are 0.086 with scalar covariates and 0.099 without scalar covariates, respectively.
Pj: You can decide whether boxplots of misclassification error rate of each classifier in each scenario should be put in the Appendix. All of them are now in the folder “Figure”.
5 Real data examples
In this section, we apply the proposed classifiers as well as several alternative classifiers to one real-world example to demonstrate the performance of our proposal.
Alzheimer’s disease (AD) is an irreversible and progressive brain disorder that can lead to more and more serious dementia symptoms over a few years. Previous studies showed that increasing age is one of the most important risk factor of AD, and most patients with AD are above 65. However, there also exist substantial early-onset Alzheimer’s whose ages are under 65. Things are even worse considering the fact that there is no current cure for AD and AD would even eventually destroy people’s ability to perform the simplest tasks. Due to the reasons above, studies of AD have received considerable attention in the past few years.
In our study, the data were obtained from the ongoing Alzheimer’s Disease Neuroimaging Initiative (ADNI), which aims to unit researchers from the world to collect, validate and analyze relevant data. In particular, the ADNI is interested in identifying biomarkers of AD from genetic, structural, and functional neuroimaging, and clinical data. The dataset consists of two main parts. The first part is neuroimaging data collected by diffusion tensor imaging (DTI). More specifically, fractional anisotropy (FA) values were measured at 83 locations along the corpus callosum (CC) fiber tract for each subject. The second part is composed of demographic features like gender (a categorical variable), handedness (left hand or right hand, a categorical variable), the age, the education level, the AD status and mini-mental state examination (MMSE) scores. The AD status is a categorical variable with three levels: normal control (NC), mild cognitive impairment (MCI) and Alzheimer’s disease (AD). We combine the first two categories into one single category for simplicity, and then this status variable is treated as a binary outcome in our following analysis. The MMSE is one of the most widely used test of cognitive functions such as orientation, attention, memory, language and visual-spatial skills for assessing the level of dementia a patient may have. A more detailed description of the data can be found in http://adni.loni.usc.edu. Previous studies, such as Li et al. (2017) and Tang et al. (2020), focused on building regression models to investigate the relationship between the progression of the AD status and the neuroimaging and demographic data. However, our main objective is to use the DTI data and demographic features to predict the status of AD.

We had subjects in our analysis after removing 3 subjects with missing values. Among them, there are subjects from the first group, i.e., the status of them is either NC or MCI, and subjects from the AD group. The functional predictor was taken as the FA profiles, which are displayed in Figure 2, and the scalar covariates consisted of gender, handedness, the age, the education level and the MMSE score. To justify the importance of incorporating FA profiles in classification, we considered an SVM classifier with only these scalar covariates. To compare the prediction performance of each classifier, these 214 subjects were randomly divided into a training set with subjects and a test set with the other subjects. In the study, we considered two particular choices of : and . Following this rule, we randomly splitted the whole dataset to the training and test set for times.
Centroid | Quadratic | Logistic | KNN | fDWD | PLfDWD | S-SVM | |
---|---|---|---|---|---|---|---|
107 | 19.6 (0.029) | 20.6 (0.031) | 21.4 (0.03) | 21.0 (0.028) | 19.8 (0.03) | 11.2 (0.028) | 26.3 (0.032) |
171 | 19.8 (0.054) | 20.5 (0.056) | 21.5 (0.057) | 21.7 (0.054) | 19.8 (0.054) | 9.9 (0.048) | 27.2 (0.059) |
Table 2 summarizes the mean misclassification error rates and the standard errors across the 500 splits for each classifier. When scalar covariates are not accounted for in the functional data classification, our proposed method (fDWD) outperforms all other competitors except the centroid method in terms of prediction accuracy. Even more remarkably, incorporating scalar covariates, though in a linear manner, results in a substantial reduction in misclassification errors in our proposed classifier, around half of the errors of all functional classifiers without scalar covariates. You might argue that this occurred because scalar covariates are highly predictive of the AD status while the functional covariate is not. However, the prediction performance of S-SVM disproved this argument, as the SVM classifier that only considered scalar covariates performed even worse than those with only the functional covariate in terms of prediction accuracy. On the one hand, these comparisons indicate superiority of our proposed classifier. On the other hand, they also suggest accounting for scalar covariates in an appropriate way is able to enhance prediction accuracy when discriminating functional data.
6 Conclusion
In this paper we propose a novel methodology that combines the idea of the canonical DWD classifier and regularized functional linear regression under the RKHS framework to classify functional data. The use of RKHS enables us to control roughness of the estimated projection direction, and thus enhances prediction accuracy in comparison with conventional functional logistic regression and the centroid classifier. Moreover, we further extend the framework to classifying subjects with both a functional covariate and several scalar covariates. Though we focus on the specific loss function to achieve nice properties of DWD classifiers in our study, this framework can be extended to other loss functions such as the logistic loss function in the functional logistic regression and the hinge loss function in functional SVM classifiers. Moreover, the scalar covariates are incorporated in our classifier in a linear manner to achieve a good trade off between flexibility and interpretability. However, nonlinear or nonparametric forms of scalar covariates can also be incorporated into our current framework, as long as we adopt appropriate regularizations to avoid overfitting. This direction deserves future investigation in both theory and practice.
Numerical studies including both simulation studies and one real-world application suggest that the proposed classifier is superior to many other competitors in terms of predication accuracy. The application of our classifier to a study of Alzheimer’s disease provides numerical evidence that both neuroimaging data and demographic features are relevant to AD, and ignoring either of them would deteriorate prediction accuracy of the AD status.
References
- Berrendero et al. (2016) Berrendero, J. R., Cuevas, A. and Torrecilla, J. L. (2016) Variable selection in functional data classification: a maxima-hunting proposal. Statistica Sinica, 26, 619–638.
- Berrendero et al. (2018) — (2018) On the use of reproducing kernel Hilbert spaces in functional classification. Journal of the American Statistical Association, 113, 1210–1218.
- Biau et al. (2005) Biau, G., Bunea, F. and Wegkamp, M. H. (2005) Functional classification in Hilbert spaces. IEEE Transactions on Information Theory, 51, 2163–2172.
- Blanchard et al. (2008) Blanchard, G., Bousquet, O. and Massart, P. (2008) Statistical performance of support vector machines. The Annals of Statistics, 36, 489–531.
- Cérou and Guyader (2006) Cérou, F. and Guyader, A. (2006) Nearest neighbor classification in infinite dimension. ESAIM: Probability and Statistics, 10, 340–355.
- Cristianini and Shawe-Taylor (2000) Cristianini, N. and Shawe-Taylor, J. (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press.
- Dai et al. (2017) Dai, X., Müller, H.-G. and Yao, F. (2017) Optimal Bayes classifiers for functional data and density ratios. Biometrika, 104, 545–560.
- Delaigle and Hall (2012) Delaigle, A. and Hall, P. (2012) Achieving near perfect classification for functional data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74, 267–286.
- Fan and Fan (2008) Fan, J. and Fan, Y. (2008) High dimensional classification using features annealed independence rules. Annals of statistics, 36, 2605.
- Ferraty and Vieu (2003) Ferraty, F. and Vieu, P. (2003) Curves discrimination: a nonparametric functional approach. Computational Statistics & Data Analysis, 44, 161–173.
- Ferraty and Vieu (2004) — (2004) Nonparametric models for functional data, with application in regression, time series prediction and curve discrimination. Nonparametric Statistics, 16, 111–125.
- Ferraty and Vieu (2006) — (2006) Nonparametric functional data analysis: theory and practice. Springer, New York.
- Friedman et al. (2009) Friedman, J., Hastie, T. and Tibshirani, R. (2009) The elements of statistical learning, 2nd edition. Springer, New York.
- Galeano et al. (2015) Galeano, P., Joseph, E. and Lillo, R. E. (2015) The mahalanobis distance for functional data with applications to classification. Technometrics, 57, 281–291.
- Gu (2013) Gu, C. (2013) Smoothing spline ANOVA models, 2nd edition. Springer, New York.
- Hall et al. (2001) Hall, P., Poskitt, D. S. and Presnell, B. (2001) A functional data—analytic approach to signal discrimination. Technometrics, 43, 1–9.
- Kong et al. (2016) Kong, D., Xue, K., Yao, F. and Zhang, H. H. (2016) Partially functional linear regression in high dimensions. Biometrika, 103, 147–159.
- Leng and Müller (2005) Leng, X. and Müller, H.-G. (2005) Classification using functional data analysis for temporal gene expression data. Bioinformatics, 22, 68–76.
- Li et al. (2017) Li, J., Huang, C. and Zhu, H. (2017) A functional varying-coefficient single-index model for functional response data. Journal of the American Statistical Association, 112, 1169–1181.
- Lin et al. (2002) Lin, Y., Lee, Y. and Wahba, G. (2002) Support vector machines for classification in nonstandard situations. Machine learning, 46, 191–202.
- Marron et al. (2007) Marron, J. S., Todd, M. J. and Ahn, J. (2007) Distance-weighted discrimination. Journal of the American Statistical Association, 102, 1267–1271.
- Massart (2000) Massart, P. (2000) Some applications of concentration inequalities to statistics. Annales de la Faculté des sciences de Toulouse: Mathématiques, 9, 245–303.
- Mendelson (2003) Mendelson, S. (2003) On the performance of kernel classes. Journal of Machine Learning Research, 4, 759–771.
- Rossi and Villa (2006) Rossi, F. and Villa, N. (2006) Support vector machine for functional data classification. Neurocomputing, 69, 730–742.
- Sun et al. (2018) Sun, X., Du, P., Wang, X. and Ma, P. (2018) Optimal penalized function-on-function regression under a reproducing kernel Hilbert space framework. Journal of the American Statistical Association, 113, 1601–1611.
- Tang et al. (2020) Tang, Q., Kong, L., Ruppert, D. and Karunamuni, R. J. (2020) Partial functional partially linear single-index models. Statistica Sinica, Preprint No: SS-2018-0316. URL: http://www.stat.sinica.edu.tw/statistica/.
- Tian (2010) Tian, T. S. (2010) Functional data analysis in brain imaging studies. Frontiers in psychology, 1, 35.
- Vapnik (2013) Vapnik, V. (2013) The nature of statistical learning theory. Springer, New York.
- Wang and Zou (2018) Wang, B. and Zou, H. (2018) Another look at distance-weighted discrimination. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80, 177–198.
- Wong et al. (2019) Wong, R. K., Li, Y. and Zhu, Z. (2019) Partially linear functional additive models for multivariate functional data. Journal of the American Statistical Association, 114, 406–418.
- Wu and Liu (2013) Wu, Y. and Liu, Y. (2013) Functional robust support vector machines for sparse and irregular longitudinal data. Journal of computational and Graphical Statistics, 22, 379–395.
- Yao et al. (2005) Yao, F., Müller, H.-G. and Wang, J.-L. (2005) Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association, 100, 577–590.
- Yao et al. (2016) Yao, F., Wu, Y. and Zou, J. (2016) Probability-enhanced effective dimension reduction for classifying sparse functional data. Test, 25, 1–22.
- Yuan and Cai (2010) Yuan, M. and Cai, T. T. (2010) A reproducing kernel Hilbert space approach to functional linear regression. The Annals of Statistics, 38, 3412–3444.
Appendix
A.1: Proof of Theorem 1
Based on the observations above, we have . Therefore, the solution to (2) can be written as
where is the orthogonal complement of in . To prove (7), we just need to check that .
Let . Plugging the solution into (2), we have
In the first term,
In other words, the first term does not depend on . As we know, the second term, , is minimized when since is orthogonal to in .
A.2: Proofs of Theorem 2
To prove Theorem 2, we first prove the following lemmas. We also define with a countable set. In what follows, Lemmas 6.4 and 6.6 are for setting (S1) and Lemmas 6.8 and 6.10 are for setting (S2).
Lemma 1
Let with such that and reproducing kernel such that . Then,
for all .
Proof 6.3.
Via the Riesz representation theorem, we have for some . Using Hölder’s inequality or the Cauchy-Schwarz inequality, we have the following bound.
Lemma 6.4.
For all with and , Furthermore, under Conditions (C1) and (C2),
for all and all where and is the associated risk.
Proof 6.5.
First, we note that for that is differentiable with . Hence, is Lipschitz implying that
proving the first part of the lemma.
For the bound on , let with . Without loss of generality, we will consider the case and . Note that for . Note also that , so for . We have the ratio
If , then denote so . Also, note that . Thus, if we choose such that , then
Lemma 6.6.
Under Conditions (C1) and (C2), let be a sequence of subroot functions with being the solution to . Then, for all , , corresponding , , and
with
Proof 6.7.
We first define the Rademacher average for a function
to be
for iid Rademacher random variables
that are independent of the . This can be applied
to a class of by denoting
.
Lemma 6.7 from Blanchard et al. (2008) uses a standard symmetrization trick to prove that for any such collection of real valued functions , some 1-Lipschitz function , and some any that
(13) |
Lemma 6.8 from Blanchard et al. (2008) comes from Mendelson (2003). It builds off the previous result to note that
(14) |
where is the th eigenvalue of the
reproducing kernel and where for
the norms are and
.
By noting that is a 1-Lipschitz function for any choice of , we apply Equation 13 for with to get
Then, application of Equation 14 gives
Thus, we aim to solve . Let be the minimizer over , which exists due to the reproducing kernel being a trace class operator, which in turn implies that is finite for all and tends to as . Then, application of the quadratic formula and the convexity result gives
Finally, choosing gives
Lemma 6.8.
For all with and , Furthermore, under Conditions (C1) and (C2),
for all and all where and is the associated risk.
Proof 6.9.
By choice of the metric , we have immediately that Next, we aim to bound
Recalling that , we can assume that and that . We also note that for , and that , so for . Thus,
For , then denote so . Also, note that . Thus, if we choose such that , then
Lemma 6.10.
Under Conditions (C1) and (C2), let be a sequence of subroot functions with being the solution to . Then, for all , , corresponding , , and
with
Proof 6.11.
Lemma 6.10 from Blanchard et al. (2008) states that for a separable class of real functions in sup-norm such that and , the following bound holds:
(15) |
where is the supremum norm -entropy
for .
We first note that
and secondly note that the DWD loss function is 1-Lipschitz, which implies that
Thus, we can bound
and application of Equation 15 gives
where
.
This final bound is a sub-root function in terms of .
For , the solution to , we can bound , the solution to with , as follows. First, we note that is decreasing. We also choose a so that . Therefore,
Thus,
Therefore, taking results in implying that giving finally that
Proof 6.12 (Proof of Theorem 2).
Given the above lemmas, we apply the “model selection”
Theorem 4.3 from Blanchard et al. (2008), which
is a generalization of Theorem 4.2 of Massart (2000),
to our functional DWD classifier.
First, we choose to be our countable set of radii for the balls with . To apply Theorem 4.3, we require a sequence and choose similarly to Blanchard et al. (2008). We also require a penalty function that satifies
where ,
,
and under setting (C1), and
,
,
and
under setting (C2).
To achieve that, we take
to be under setting (C1) and under setting (C2),
and we take
for some universal constant .
Consequently, we can choose
for another suitable positive constant .
Therefore, given the estimator and corresponding estimator , we have Therefore, the regularization term and consequently, . Thus, . Recalling that , we have that for and, updating as necessary, we have that
where the third inequality results from being the minimizer of the regularized estimation proceedure. Application of the “model selection theorem” (Massart, 2000; Blanchard et al., 2008) gives that with probability that