Robust estimation for functional logistic regression models

Graciela Boente^a, Marina Valdora^a
^a Universidad de Buenos Aires and CONICET, Argentina

Abstract

This paper addresses the problem of providing robust estimators under a functional logistic regression model. Logistic regression is a popular tool in classification problems with two populations. As in functional linear regression, regularization tools are needed to compute estimators for the functional slope. The traditional methods are based on dimension reduction or penalization combined with maximum likelihood or quasi–likelihood techniques and for that reason, they may be affected by misclassified points especially if they are associated to functional covariates with atypical behaviour. The proposal given in this paper adapts some of the best practices used when the covariates are finite–dimensional to provide reliable estimations. Under regularity conditions, consistency of the resulting estimators and rates of convergence for the predictions are derived. A numerical study illustrates the finite sample performance of the proposed method and reveals its stability under different contamination scenarios. A real data example is also presented.

Keywords: $B$ -splines; Functional Data Analysis; Logistic Regression Models; Robust Estimation

AMS Subject Classification: 62F35; 62G25

1 Introduction

In many applications, such as chemometrics, image recognition and spectroscopy, the observed data contain functional covariates, that is, variables originated by phenomena that are continuous in time or space and can be assumed to be smooth functions, rather than finite dimensional vectors. Functional data analysis aims to provide tools for analysing such data and has received considerable attention in recent years due to its high versatility and numerous applications. Different approaches either parametric, nonparametric or even semiparametric ones, were given to model data with functional predictors. Some well-known references in the treatment of functional data are the books of Ferraty and Vieu, (2006) and Ferraty and Romain, (2010), who carefully discuss non–parametric models, and also Ramsay and Silverman, (2005, 2002), Horváth and Kokoszka, (2012) and Hsing and Eubank, (2015) who place emphasis on parametric models such as the functional linear one. We also refer to Aneiros-Pérez et al., (2017) and the reviews in Cuevas, (2014) and Goia and Vieu, (2016) for some other results in the area.

Among regression models relating a scalar response with a functional covariate, the functional linear regression model is one of the more popular ones. Several estimation aspects under this model have been considered among others in Cardot et al., (2003), Cardot and Sarda, (2005), Cai and Hall, (2006), Hall and Horowitz, (2007), see also Febrero-Bande et al., (2017) and Reiss et al., (2017) for a review. Robust proposals for functional linear regression models using either $P-$ , $B-$ splines or functional principal components were given in Maronna and Yohai, (2013), Boente et al., (2020), Kalogridis and Van Aelst, (2023) and Kalogridis and Van Aelst, (2019), respectively.

Most of the papers mentioned above consider the case where the response is a continuous variable. However, discrete responses arise in some practical problems such as in classification. In particular, when the interest relies on the presence or absence of a condition of interest, the response corresponds to a binary outcome. Reiss et al., (2017) includes a review on relevant papers treating the case of responses whose conditional distribution belong to an exponential–family distributions, that is, estimation under a generalized functional linear regression model. As in functional linear models, the naive approach to estimate the functional regression parameter considering as multivariate covariates the values of the functional covariates observed on a grid of points and ignoring their functional nature, is not appropriate, since this approach leads to an ill-conditioned problem. The causes for this issue are, on the one hand, the high correlation existing, within each trajectory, among observations corresponding to close grid values and on the other hand, the fact that the number of grid measurements may exceed the number of observations, see Marx and Eilers, (1999) and Ramsay, (2004) for a discussion. As mentioned in Wang et al., (2016), one of the challenges in functional regression is the inverse nature of the problem, which causes estimation problems mainly generated by the compactness of the covariance operator of $X$ . For the reasons mentioned above, the extension from the situation with finite-dimensional predictors to the case of an infinite-dimensional one is not direct. The usual practice to solve this drawback is regularization which can be achieved in several ways, either reducing the set of candidates to a finite–dimensional one or by adding a penalty term as when considering $P-$ splines. In particular, Cardot and Sarda, (2005) proposed estimators based on penalized likelihood and spline approximations and derived consistency properties for them. A different approach was followed in Müller and Stadtmüller, (2005) where the estimators are obtained via a truncated Karhunen-Loève expansion for the covariates. These authors also provided a theoretical study of the properties of their estimators as well as an illustration of their proposal on a classification problem that analyzes the interplay of longevity and reproduction, in short and long–lived Mediterranean fruit flies.

Among generalized regression models, logistic regression is one of the best known and useful models in statistics. It has been extensively studied when euclidean covariates arise and several robust proposals were given in this setting, some of which will be mentioned below, in Section 2.1. The functional logistic regression model is a generalization of the finite–dimensional logistic regression model, which assumes that the observed covariates are functional data rather than vectors in $\mathbb{R}^{p}$ . It is particularly relevant in discrimination problems for curve data. This model was already considered in James, (2002) as a particular case of generalized linear models with functional predictors. Beyond the procedures studied in the framework of generalized functional regression models which can be used in functional logistic regression models, some authors have considered specific estimation methods for this particular model. Among others, we can mention Escabias et al., (2004), Aguilera et al., (2008), Aguilera-Morillo et al., (2013) and Mousavi and Sørensen, (2018) who provides a revision and comparison of different estimation methods for functional logistic regression. Among the numerous interesting applications of functional logistic regression that have been reported in the literature we can mention Escabias et al., (2005), who use it to model environmental data, Ratcliffe et al., (2002), who apply it to foetal heart rate data and Reiss et al., (2005) who analyze pet imaging data. Besides, Sørensen et al., (2013) present several medical applications of functional data, while Wang et al., (2017) used penalized Haar wavelet approach for the classification of brain images to assist in early diagnosis of Alzheimer’s disease.

The framework in which this paper will focus corresponds to the functional logistic regression model with scalar response, also labelled as scalar–on–function logistic regression in the literature. Under this model, the i.i.d. observations $(y_{i},X_{i})$ , $1\leq i\leq n$ , are such that the response $y_{i}\in\{0,1\}$ and the predictor $X_{i}\in L^{2}({\mathcal{I}})$ , with ${\mathcal{I}}$ a compact interval, while the link function is the logistic one. More precisely, if $Bi(1,p)$ stands for the Bernoulli distribution with success probability $p$ and $F(t)=\exp(t)\left[1+\exp(t)\right]^{-1}$ denotes the logistic function, the functional logistic regression model assumes that

y_{i}|X_{i}\sim Bi\left(1,p_{i}\right)\qquad\mbox{where}\qquad p_{i}=\mathbb{E}(y_{i}|X_{i})=F\left(\alpha_{0}+\langle X_{i},\beta_{0}\rangle\right)\,,

(1)

with $\alpha_{0}\in\mathbb{R}$ , $\beta_{0}\in L^{2}({\mathcal{I}})$ and $\langle u,v\rangle=\int_{{\mathcal{I}}}u(t)v(t)dt$ stands for the usual inner product in $L^{2}({\mathcal{I}})$ and $\|\cdot\|$ for the corresponding norm.

The estimators for functional logistic regression mentioned above are based on the method of maximum likelihood combined with some regularization tool. As it happens with finite dimensional covariates, these estimators are highly affected by the presence of misclassified points specially when combined with high leverage covariates. Robust methods, on the other hand, have the advantage of giving reliable results, even when a proportion of the data correspond to atypical data. As mentioned above, some robust methods for functional linear regression models have been recently proposed and this area has shown great development in the last ten years. However, the literature on robust procedures for generalized functional linear models and specifically for functional logistic regression ones is scarce. Up to our knowledge, only few procedures have been considered and most of them lack a careful study of the asymptotic properties of the proposal considered. The first attempt to provide a robust method for functional logistic regression was given in Denhere and Billor, (2016). This method is based on reducing the dimension of the covariates by using a robust principal components method proposed in Hubert et al., (2005). The robustness of this method ensures that the functional principal component analysis is not influenced by outlying covariates. However, it but does not take into account the problem of large deviance residuals originated by incorrectly classified observations. To solve this problem, Mutis et al., (2022) propose to combine a basis approximation with weights computed using the Pearson residuals, in an approach related to that given by Alin and Agostinelli, (2017) for finite–dimensional covariates. Recently, Kalogridis, (2023) introduced an approach based on divergence measures combined with penalizations and provided a careful study of its asymptotic properties, for bounded covariates.

In this paper, we follow a different perspective, taking into account the sensitivity of these estimators to atypical observations and based on the ideas given for euclidean covariates by Bianco and Yohai, (1996) and Croux and Haesbroeck, (2003), we define robust estimators of the intercept $\alpha_{0}$ and the slope $\beta_{0}$ following a sieve approach combined with weighted $M-$ estimators. More precisely, as done for instance in functional linear regression, we first reduce the set of candidates for estimating $\beta_{0}$ to those belonging to a finite–dimensional space spanned by a fixed basis selected by the practitioner, such as the $B-$ splines, Fourier, or wavelet bases. This enables to use the robust tools developed for finite–dimensional covariates in this infinite–dimensional framework. Clearly this regularization process involves the selection of the basis dimension which should increase with the sample size at a given rate and which must be chosen in a robust way. For that reason, in Section 2.4, we describe a resistant procedure to select the dimension of the approximating space.

The rest of the paper is organized as follows. The model and our proposed estimators are described in Section 2. Theoretical assurances regarding consistency and convergence rates of our proposal are provided in Section 3, while in Section 4 we report the results of a simulation study to explore their finite-sample properties. Section 5 contains the analysis of a real-data set, while final comments are given in Section 6. All proofs are relegated to the Appendix.

2 The estimators

As mentioned in the Introduction, our proposal for estimators under the functional logistic regression model (1) is based on basis reduction. For that reason and for the sake of completeness, in Section 2.1, we recall some of the robust proposals given when the covariates belong to a finite–dimensional space.

2.1 Some robust proposals for euclidean covariates

When the covariates are finite–dimensional, the practitioner deals with i.i.d. observations $\left(y_{i},\mathbf{x}_{i}\right)$ , $1\leq i\leq n$ , where $\mathbf{x}_{i}\in\mathbb{R}^{p}$ , $y_{i}\in\{0,1\}$ . In this case, the well–known logistic regression model states that $y_{i}|\mathbf{x}_{i}\sim Bi(1,F(\alpha_{0}+\mathbf{x}_{i}^{\mbox{\footnotesize\sc t}}\mbox{\boldmath$\beta$}_{0}))$ , where $\mbox{\boldmath$\beta$}_{0}\in\mathbb{R}^{p}$ . As mentioned in the Introduction, the maximum likelihood estimator of the regression coefficients is very sensitive to outliers, meaning that we cannot accurately classify a new observation based on these estimators, neither identify those covariates with important information for assignation. To solve this drawback, different robust procedures have been considered.

In particular, consistent $M-$ estimators bounding the deviance were defined in Bianco and Yohai, (1996), while in order to obtain bounded influence estimators a weighted version was introduced in Croux and Haesbroeck, (2003). For the family of $M-$ estimators defined in Bianco and Yohai, (1996), Croux and Haesbroeck, (2003) introduced a loss function that guarantees the existence of the resulting robust estimator when the maximum likelihood estimators do exist. Basu et al., (1998) considered a proposal based on minimum divergence. However, their approach can also be seen as a particular case of the Bianco and Yohai, (1996) estimator with a properly defined loss function. Other approaches were given in Cantoni and Ronchetti, (2001) who consider a robust quasi–likelihood estimator, Bondell, (2005, 2008) whose procedures incorporate a minimum distance perspective and Hobza et al., (2008) who defines a a median estimator by using an $L^{1}-$ estimator of the smoothed responses.

As pointed out in Maronna et al., (2019), the use of redescending weighted $M-$ estimators ensures estimators with good robustness properties. For that reason and taking into account that our proposal will combine dimension reduction with weighted $M-$ estimators, we briefly describe the proposal introduced in Croux and Haesbroeck, (2003). From now on, denote $d(y,t)$ the squared deviance function, that is, $d(y,t)=-\log(F(t))y-\log(1-F(t))(1-y)$ and let $\rho:\mathbb{R}_{\geq 0}\to\mathbb{R}$ be a bounded, differentiable and nondecreasing function with derivative $\psi=\rho^{\prime}$ . Furthermore, define

\displaystyle\phi(y,t)

\displaystyle=

\displaystyle\rho(d(y,t))+G(F(t))+G(1-F(t))\,,

(2)

where $G(t)=\int_{0}^{t}\psi(-\log u)\,du$ . The correction term $G(F(t))+G(1-F(t))$ was introduced in Bianco and Yohai, (1996) to guarantee Fisher–consistency of the resulting procedure. It is worth mentioning that the function $\phi(y,t)$ can be written as

\phi(y,t)=y\rho\left(\,-\,\log\left[F(t)\right]\right)+G(F(t))+(1-y)\rho\left(\,-\,\log\left[1-F(t)\right]\right)+G(1-F(t))\,.

(3)

The weighted $M-$ estimators are the minimizers of $L_{n}^{\star}(a,\mathbf{b})=\sum_{i=1}^{n}\phi(y_{i},a+\mathbf{x}_{i}^{\mbox{\footnotesize\sc t}}\mathbf{b})w(\mathbf{x}_{i})/n$ , that is,

(\widehat{\alpha},\widehat{\mbox{\boldmath$\beta$}})=\mathop{\mbox{argmin}}_{a\in\mathbb{R},\mathbf{b}\in\mathbb{R}^{p}}L_{n}^{\star}(a,\mathbf{b})\,.

(4)

The weights $w(\mathbf{x}_{i})$ are usually based on a robust Mahalanobis distance of the explanatory variables, that is, they depend on the distance between $\mathbf{x}_{i}$ and a robust center of the data. With this notation, the minimum divergence estimators considered in Basu et al., (1998) correspond to the choice $\rho(t)=(1+1/c)(1-\exp(-\,c\,t))$ and $w\equiv 1$ .

The asymptotic properties of the $M-$ estimators with $w\equiv 1$ were obtained in Bianco and Yohai, (1996), while the situation of a general weight function was studied in Bianco and Martinez, (2009). It should be mentioned that these estimators are implemented in the package RobStatTM, through the functions logregBY, when the weights equal 1, and logregWBY when considering hard rejection weights derived from the MCD estimator of the continuous explanatory variables. In both cases, the loss function is taken as the one introduced in Croux and Haesbroeck, (2003).

2.2 The case of functional covariates

In this section, we consider the situation where $(y_{i},X_{i})$ , $1\leq i\leq n$ are independent observations such that $y_{i}\in\{0,1\}$ and $X_{i}\in L^{2}({\mathcal{T}})$ with ${\mathcal{T}}$ a compact interval, that, without loss of generality, we assume to be ${\mathcal{T}}=[0,1]$ . The model relating the responses to the covariates is the functional logistic regression model, that is, we assume that (1) holds.

As mentioned in the Introduction, estimation under functional linear regression or functional logistic regression models is an ill–posed problem. To avoid this issue, one possibility is dimension reduction that can be achieved by considering as possible candidates for estimating $\beta_{0}$ the elements of a finite–dimensional space spanned by a fixed basis. This is the approach we follow in this paper, that is, to define robust estimators of the intercept $\alpha_{0}$ and the slope $\beta_{0}$ , we will use a sieve approach combined with weighted $M-$ estimators. We do not restrict our attention to a particular basis as the $B-$ spline basis considered, for instance, in Boente et al., (2020) for the functional semi–linear model or Mutis et al., (2022) for the functional logistic regression one. Instead, we provide a general framework which allows the practitioner to choose the basis according to the smoothness knowledge or assumptions to be considered on $\beta_{0}$ .

Henceforth, let $k=k_{n}$ stand for the dimension of the finite dimensional space spanned by the basis $\{B_{j}:1\leq j\leq k_{n}\}$ . The space of possible candidates correspond to $\mathbb{R}\times{\mathcal{M}}_{k}$ , where ${\mathcal{M}}_{k}=\left\{\sum_{j=1}^{k}b_{j}B_{j},\mathbf{b}\in\mathbb{R}^{k}\right\}$ .

From now on, for any $\mathbf{b}\in\mathbb{R}^{k}$ , $\beta_{\mathbf{b}}$ will stand for $\beta_{\mathbf{b}}=\sum_{j=1}^{k}b_{j}B_{j}$ . Then, for any possible candidate $\beta_{\mathbf{b}}\in{\mathcal{M}}_{k}$ , the inner product $\langle X_{i},\beta_{\mathbf{b}}\rangle$ equals $\sum_{j=1}^{k}b_{j}x_{ij}$ where $x_{ij}=\langle X_{i},B_{j}\rangle$ which suggests to use the robust estimators defined in Section 2.1 taking as covariates $\mathbf{x}_{i}=(x_{i1},\dots,x_{ik})^{\mbox{\footnotesize\sc t}}$ .

More precisely, the weighted estimators defined in (4) over the finite–dimensional approximating spaces $\mathbb{R}\times{\mathcal{M}}_{k}$ are the key tool for obtaining consistent estimators of $\beta_{0}$ . For any $\mathbf{b}\in\mathbb{R}^{k}$ define

L_{n}(\alpha,\beta_{\mathbf{b}})=\frac{1}{n}\sum_{i=1}^{n}\phi(y_{i},\alpha+\langle X_{i},\beta_{\mathbf{b}}\rangle)w(X_{i})=\frac{1}{n}\sum_{i=1}^{n}\phi(y_{i},\alpha+\mathbf{x}_{i}^{\mbox{\footnotesize\sc t}}\mathbf{b})w(X_{i})

and

(\widehat{\alpha},\widehat{{\mathbf{b}}})=\mathop{\mbox{argmin}}_{\alpha\in\mathbb{R},\mathbf{b}\in\mathbb{R}^{k}}L_{n}(\alpha,\beta_{\mathbf{b}})\,.

(5)

Hence, the estimator of $\beta_{0}$ is given by

\widehat{\beta}(t)=\beta_{\widehat{{\mathbf{b}}}}(t)=\sum_{j=1}^{k}\widehat{{b}}_{j}B_{j}(t)\,,

where $\widehat{{\mathbf{b}}}=\left(\widehat{{b}}_{1},\ldots,\widehat{{b}}_{k}\right)^{\mbox{\footnotesize\sc t}}$ , meaning that $(\widehat{\alpha},\widehat{\beta})=\mathop{\mbox{argmin}}_{\alpha\in\mathbb{R},\beta\in{\mathcal{M}}_{k}}L_{n}(\alpha,\beta)$ .

The weights $w(X_{i})$ in (5) may be computed as in (4) using a weight function of the robust Mahalanobis distance of the projected variables $\mathbf{x}_{i}$ , in which case, $L_{n}(\alpha,\beta_{\mathbf{b}})=L_{n}^{\star}(a,\mathbf{b})$ . Another possibility is to compute the weights from the functional covariates, for instance, discarding observations which are declared as outliers by the functional boxplot or any other functional measure of atipicity. In the simulation study reported in Section 4, we explore both possible choices for the weights. As in the finite–dimensional setting, for the sake of simplicity, when deriving consistency results, we will assume that the weight function $w$ is not data dependent.

2.3 On the basis choice

The basis choice depends on the knowledge or assumptions to be made on the slope parameter. Some well known basis are $B-$ splines, Bernstein and Legendre polynomials, Fourier basis or Wavelet ones. They vary in the way they approximate a function as discussed, for instance, in Boente and Martinez, (2023) and Kalogridis and Van Aelst, (2023).

When considering $B$ -spline approximations, consistency results will require the slope parameter to be $r$ -times continuously differentiable, that is, $\beta_{0}\in C^{r}([0,1])$ , where $r\leq\ell-2$ and $\ell$ is the spline order. In particular, when cubic splines are considered, the results in Section 3 hold for twice continuously differentiable regression functions. Recall that a spline of order $\ell$ is a polynomial of degree $\ell-1$ within each subinterval defined by the knots. As stated in Corollary 6.21 in Schumaker, (1981), if $\beta_{0}\in C^{r}([0,1])$ with $r-$ th derivative Lipschitz and $r\leq\ell-2$ , under proper assumptions on the knots, there exist a spline of order $\ell$ , let’s say $\widetilde{\beta}=\sum_{j=1}^{k}b_{j}B_{j}$ , such that $\|\widetilde{\beta}-\beta_{0}\|_{\infty}=O(k^{-{(r+1)}})$ . It is worth mentioning that the approximation order has an impact on the rates of convergence derived in Theorems 3.1 and 3.3 through assumption A10.

Bernstein polynomials are a possible alternative to $B-$ splines. They are defined as

B_{j}(t)=\binom{k}{j}t^{j}(1-t)^{k-j}\quad\mbox{for }j=0,\dots,k\,.

Weierstrass Theorem ensures that if $\beta_{0}\in C([0,1])$ , there exists $\widetilde{\beta}=\sum_{j=1}^{k}b_{j}B_{j}$ , where $b_{j}=\beta_{0}(j/k)$ such that $\|\widetilde{\beta}-\beta_{0}\|_{\infty}\to 0$ . Furthermore, Theorem 3.2 in Powell, (1981) guarantees that when considering Bernstein polynomials of order $k$ we also get that $\|\widetilde{\beta}-\beta_{0}\|_{\infty}=O(k^{-r})$ , whenever $\beta_{0}\in C^{r}([0,1])$ .

Legendre polynomials define an orthogonal basis in $L^{2}([0,1])$ . As mentioned in Boente and Martinez, (2023), the convergence rates derived in Theorem 2.5 from Wang and Xiang, (2012) allow to show that, if $\beta_{0}\in C^{r}([0,1])$ and $\widetilde{\beta}$ stands for the truncated Legendre series expansion of $\beta_{0}$ of order $k$ , then $\|\widetilde{\beta}-\beta_{0}\|_{\infty}=O\left(k^{-\,r+1/2}\right)$ . Note that in this case, the approximation rate is lower than for the other two basis mentioned above and this will affect the rates provided in Theorem 3.3. More generally, when polynomials basis of order $k$ are considered, Jackson’s Theorem (see Theorem 3.12 in Schumaker, , 1981) ensures that if $\beta_{0}\in W^{r,2}([0,1])$ , the $L^{2}-$ Sobolev space of order $r$ as defined below, there exists a polynomial $\widetilde{\beta}$ of order $k\geq r$ such that $\|\widetilde{\beta}-\beta_{0}\|=O(k^{-r})$ and this improved $L^{2}$ approximation order is enough to guarantee a better rate of convergence for the predictions, when polynomial bases are considered.

Finally, the Fourier basis is the natural basis in $L^{2}([0,1])$ and it is usually considered when approximating periodic functions. Clearly, the finite expansion $\widetilde{\beta}_{0}(t)=b_{0}+\sum_{j=1}^{k}b_{j,1}\sin(2\pi\,j\,t)+b_{j,2}\cos(2\pi\,j\,t)$ , where $b_{0}=\int_{0}^{1}\beta_{0}(t)\,dt$ , $b_{j,1}=\int_{0}^{1}\beta_{0}(t)\sin(2\pi\,j\,t)\,dt$ and $b_{j,2}=\int_{0}^{1}\beta_{0}(t)\cos(2\pi\,j\,t)\,dt$ converges to $\beta_{0}$ in $L^{2}([0,1])$ . When $\beta_{0}\in W^{r,2}([0,1])$ , Corollary 2.4 in Chapter 7 from DeVore and Lorentz, (1993) ensures that $\|\widetilde{\beta}-\beta_{0}\|=O(k^{-r})$

Wavelet basis may be useful when seeking for sparse slope function estimates. Zhao et al., (2012) provide conditions ensuring $L^{2}$ approximations of order $O(k^{-r})$ , when $\beta_{0}\in W^{r,2}([0,1])$ and one–dimensional Wavelets are used, see also Mallat, (2009).

2.4 Selecting the size of the basis

The number of elements of the basis plays the role of regularization parameter in our estimation procedure. The importance of considering a robust criterion to select the regularization parameter has been discussed by several authors who report how standard model selection methods can be highly affected by a small proportion of outliers. The sensitivity to atypical data of classical basis selectors may be inherited by the final regression estimators even when robust procedure is considered.

To deal with these problems, when the covariates belong to $\mathbb{R}^{p}$ , Ronchetti, (1985) and Tharmaratnam and Claeskens, (2013) provide some robust approaches when considering linear regression models. Besides, under a sparse logistic regression model, Bianco et al., (2022) report in their supplement a numerical study that reveals the importance of considering a robust criterion in order to achieve reliable predictions. Finally, for functional covariates and under a semi–linear and a functional linear model, Boente et al., (2020), Kalogridis and Van Aelst, (2019, 2023), respectively, discuss robust criteria for selecting the regularization parameters.

In our framework, the basis dimension $k=k_{n}$ may be determined by a model selection criterion such as a robust version of the Akaike criterion used in Lu, (2015) or the robust Schwarz, (1978) criterion considered in He and Shi, (1996) and He et al., (2002) for semi–parametric regression models. Suppose that $(\widehat{\alpha}^{(k)},\widehat{{\mathbf{b}}}^{(k)})$ is the solution of (5) when we use a $k-$ dimensional linear space and denote as $\widehat{\beta}^{(k)}=\beta_{\widehat{{\mathbf{b}}}^{(k)}}$ . We define a robust $BIC$ criterion, whose large values indicate a poor fit, as

RBIC(k)=L_{n}(\widehat{\alpha}^{(k)},\widehat{\beta}^{(k)})+k\,\frac{\log n}{n}\,.

(6)

For instance, when considering $B-$ spline procedures, in order to obtain an optimal rate of convergence, we let the number of knots increase slowly with the sample size. Theorem 3.3 below shows that when $\beta_{0}$ is twice continuously differentiable and is approximated with cubic splines ( $\ell=4$ ), the size $k_{n}$ of the bases can be taken of order $n^{1/5}$ . Hence, a possible way to select $k_{n}$ is to search for the first local minimum of $RBIC(k_{n})$ in the range $\max(n^{1/5}/2,4)\leq k\leq 8+2\,n^{1/5}$ . Note that for cubic splines the smallest possible number of knots is 4.

3 Consistency results

To provide a unified approach in which the basis gives approximations either in $L^{2}(0,1)$ or in $C([0,1])$ , equipped by their respective norms $\|\cdot\|$ and $\|\cdot\|_{\infty}$ , we will denote ${\mathcal{H}}$ the space $L^{2}([0,1])$ or $C([0,1])$ and $\|\cdot\|_{{\mathcal{H}}}$ the corresponding norm. Hence, we have that $\|f\|\leq\|f\|_{{\mathcal{H}}}$ . Furthermore, ${\mathcal{W}}^{r,{\mathcal{H}}}$ will stand for the Hölder space

{\mathcal{W}}^{r,\infty}([0,1])\,=\,\biggl{\{}f\in C^{r}\left([0,1]\right):\big{\|}f^{(j)}\big{\|}_{\infty}<\infty,\;0\leq j\leq r,\mbox{ and }\sup_{z_{1}\neq z_{2}}\frac{\big{|}f^{(r)}(z_{1})-f^{(r)}(z_{2})\big{|}}{|z_{1}-z_{2}|}<\infty\biggr{\}}\,,

when ${\mathcal{H}}=C([0,1])$ , while when ${\mathcal{H}}=L^{2}([0,1])$ , we label ${\mathcal{W}}^{r,{\mathcal{H}}}$ the Sobolev space

	$\displaystyle{\mathcal{W}}^{r,2}([0,1])=$	$\displaystyle\{f\in L^{2}([0,1]):\mbox{the weak derivatives of $f$ up to order $r$ exist and }$
		$\displaystyle\\|f^{(j)}\\|^{2}=\int_{0}^{1}\left\{f^{(j)}(t)\right\}^{2}\,dt<\infty\mbox{ for }j=1\dots,r\}\,.$

We denote $\|\cdot\|_{{\mathcal{W}}^{r,{\mathcal{H}}}}$ the corresponding norm, that is,

\|f\|_{{\mathcal{W}}^{r,{\mathcal{H}}}}=\max_{1\leq j\leq r}\big{\|}f^{(j)}\big{\|}_{\infty}+\sup_{x\neq y,x,y\in(0,1)}\frac{\big{|}f^{(r)}(x)-f^{(r)}(y)\big{|}}{|x-y|}\,

in the former case and $\|f\|_{{\mathcal{W}}^{r,{\mathcal{H}}}}^{2}=\|f\|^{2}+\sum_{j=1}^{r}\|f^{(j)}\|^{2}$ in the latter one.

To derive consistency results, we will need the following assumptions.

A1

$\rho:\mathbb{R}_{\geq 0}\to\mathbb{R}$ is a bounded, continuously differentiable function with bounded derivative $\psi$ and $\rho(0)=0$ .
A2

$\psi(t)\geq 0$ and there exists some $a\geq\log 2$ such that $\psi(t)>0$ for all $0<t<a$ .
A3

$\psi(t)\geq 0$ and there exist values $a\geq\log 2$ and $A_{0}>0$ such that $\psi(t)>A_{0}$ for every $0<t<a$ .
A4

$\psi$ is continuously differentiable function with bounded derivative $\psi^{\prime}$ .
A5

$w$ is a non–negative bounded function with support ${\mathcal{C}}_{w}$ such that $\mathbb{P}(X\in{\mathcal{C}}_{w})>0$ . Without loss of generality, we assume that $\|w\|_{\infty}=1$ .
A6
1. (a)
  
  $\mathbb{E}w(X)\,\|X\|<\infty$ .
2. (b)
  
  $\mathbb{E}w(X)\,\|X\|^{2}<\infty$ .
A7

The basis functions are such that $B_{j}\in{\mathcal{W}}^{1,{\mathcal{H}}}$ .
A8

The basis dimension $k_{n}$ is such that $k_{n}\to\infty$ , $k_{n}/n\to 0$ .
A9

There exists an element $\widetilde{\beta}_{k}\in{\mathcal{M}}_{k}$ , $\widetilde{\beta}_{k}=\sum_{j=1}^{k}\widetilde{b}_{j}B_{j}(x)$ such that $\|\widetilde{\beta}_{k}-\beta_{0}\|_{{\mathcal{H}}}\to 0$ as $k\to\infty$ .
A10

There exists an element $\widetilde{\beta}_{k}\in{\mathcal{M}}_{k}$ , $\widetilde{\beta}_{k}=\sum_{j=1}^{k}\widetilde{b}_{j}B_{j}(x)$ such that $\|\widetilde{\beta}_{k}-\beta_{0}\|_{{\mathcal{H}}}=O(k^{-r})$ , for some $r>0$ . Furthermore, the basis dimension $k_{n}$ is of order $O(n^{\varsigma})$ where $\varsigma<1/r$ .

A11

The following condition holds:

{}\mathbb{P}(\langle X,\beta\rangle+\alpha=0)=0\,,\mbox{ for any $\beta\in{\mathcal{H}}^{\star}$, $\alpha\in\mathbb{R}$, such that $(\beta,\alpha)\neq 0$}\,,

(7)

where ${\mathcal{H}}^{\star}={{\mathcal{H}}}$ or ${\mathcal{H}}^{\star}={\mathcal{W}}^{1,{\mathcal{H}}}$ , depending on whether $\beta_{0}$ belongs to ${{\mathcal{H}}}$ or ${\mathcal{W}}^{1,{\mathcal{H}}}$ , respectively.

Our first result states that, under mild assumptions, the estimators obtained minimizing $L_{n}(\alpha,\beta)$ over $(\alpha,\beta)\in\mathbb{R}\times{\mathcal{M}}_{k}$ produce consistent estimators of the conditional success probability with respect to the weighted mean square error of the differences between predicted probabilities defined as

\pi_{\mathbb{P}}^{2}(\theta_{1},\theta_{2})=\mathbb{E}\left(w(X)\left[F(\alpha_{1}+\langle X,\beta_{1}\rangle)-F(\alpha_{2}+\langle X,\beta_{2}\rangle)\right]^{2}\right)\,,

where for $j=1,2$ , $\theta_{j}=(\alpha_{j},\beta_{j})\in\Theta=\mathbb{R}\times{\mathcal{H}}$ . Henceforth, to simplify the notation, we denote $\widehat{\theta}=(\widehat{\alpha},\widehat{\beta})$ and $\theta_{0}=(\alpha_{0},\beta_{0})$ .

Theorem 3.1.

Let $\rho$ be a function satisfying A1 and A3 and $w$ a weight function satisfying A5.

(a)

Under A8 and A9, $\pi_{\mathbb{P}}^{2}(\widehat{\theta},\theta_{0})\buildrel a.s.\over{\longrightarrow}0$ .
(b)

If $w$ satisfies A6a, we have that $\pi_{\mathbb{P}}^{2}(\widehat{\theta},\theta_{0})=O_{\mathbb{P}}(\sqrt{k_{n}/n}+\|\widetilde{\beta}_{k}-\beta_{0}\|_{{\mathcal{H}}})$ . Moreover, if A10 also holds, $\pi_{\mathbb{P}}(\widehat{\theta},\theta_{0})=O_{\mathbb{P}}(n^{-{\lambda}})$ , where $\lambda=\min(\varsigma\,r/2,(1-\varsigma)/4)$ .
(c)

If $w$ satisfies A6b and $\psi$ satisfies A4, we have that $\pi_{\mathbb{P}}^{2}(\widehat{\theta},\theta_{0})=O_{\mathbb{P}}(\sqrt{k_{n}/n}+\|\widetilde{\beta}_{k}-\beta_{0}\|_{{\mathcal{H}}}^{2})$ . Moreover, if A10 also holds, $\pi_{\mathbb{P}}(\widehat{\theta},\theta_{0})=O_{\mathbb{P}}(n^{-{\omega}}\;)$ , where $\omega=\min(\,\varsigma\,r,(1-\varsigma)/4)$ .

Remark 3.1.

Denote $\widehat{p}(X)=F(\widehat{\alpha}+\langle X,\widehat{\beta}\rangle)$ and $p_{0}(X)=F(\alpha_{0}+\langle X,\beta_{0}\rangle)$ . When $w(X)\equiv 1$ , Theorem 3.1(a) implies that for any $\epsilon>0$ ,

\mathbb{P}\left(\left|\widehat{p}(X)-p_{0}(X)\right|>\epsilon\right)=\mathbb{E}\left(\mathbb{P}\left(\left|\widehat{p}(X)-p_{0}(X)\right|>\epsilon\Big{|}_{(y_{1},X_{1}),\dots,(y_{n},X_{n})}\right)\right)\leq\mathbb{E}\left(\frac{\pi_{\mathbb{P}}^{2}(\widehat{\theta},\theta_{0})}{\epsilon^{2}}\right)\,,

where the right hand side of the inequality converges to 0 as $n\to\infty$ .Thus, $\widehat{p}(X)\buildrel p\over{\longrightarrow}p_{0}(X)$ allowing to consistently classify a new observation. Moreover, using that $F^{-1}$ is continuous, we also conclude that $\widehat{\alpha}+\langle X,\widehat{\beta}\rangle$ converges in probability to $\alpha_{0}+\langle X,\beta_{0}\rangle$ . However, the infinite–dimensional structure of the covariates does not allow to derive the consistency of $\widehat{\beta}$ , which is instead obtained in Theorem 3.2. The rates obtained in Theorem 3.1 (b) provide a preliminary rate that will be improved in Theorem 3.3.

Theorem 3.2 establishes strong consistency of the intercept and slope parameter, which clearly implies that of the predicted probability, that is, $F(\widehat{\alpha}+\langle X,\widehat{\beta}\rangle)\buildrel a.s.\over{\longrightarrow}F(\alpha_{0}+\langle X,\beta_{0}\rangle)$ . This result provides an improvement over the one obtained in Theorem 3.1, but requires additional assumptions on the covariates, namely, assumption A11 which is discussed in Remark 3.2.

Theorem 3.2.

Let $\rho$ be a function satisfying A1 and A2, and $w$ a weight function satisfying A5. Assume that A7 to A9 hold.

(a)

If in addition A11 holds, we have that $|\widehat{\alpha}-\alpha_{0}|+\|\widehat{\beta}-\beta_{0}\|_{{\mathcal{H}}}\buildrel a.s.\over{\longrightarrow}0$ .
(b)

Assume that ${\mathcal{H}}=L^{2}([0,1])$ , that the basis functions are such that $B_{j}\in{\mathcal{W}}^{1,2}$ , $1\leq j\leq k_{n}$ , providing approximations in A9 in $L^{2}([0,1])$ and that A11 holds with ${\mathcal{H}}^{\star}={\mathcal{W}}^{1,2}$ , i.e., $\beta_{0}\in{\mathcal{W}}^{1,2}$ .
If, in addition, $\beta_{0}\in C([0,1])$ and the basis elements are also continuous, i.e., $B_{j}\in C([0,1])$ , $1\leq j\leq k_{n}$ , then $|\widehat{\alpha}-\alpha_{0}|+\|\widehat{\beta}-\beta_{0}\|_{\infty}\buildrel a.s.\over{\longrightarrow}0$ .

Theorem 3.2(b) is useful for situation where the slope parameter is continuous but we use a smooth basis that provides an approximation in $L^{2}([0,1])$ , such as the Fourier one.

Remark 3.2 (Comments on assumptions).

Assumptions A1, A2, A5 and A11 are needed to ensure Fisher–consistency of the proposal, see Lemma A.1 in the Appendix. In the finite–dimensional case, A1 and A2 were also required in Bianco and Yohai, (1996) who considered $M-$ estimators, while assumption A5 corresponds to assumption A2 in Bianco and Martinez, (2009) who studied the asymptotic behaviour of weighted $M-$ estimators. Assumption A11 is the infinite–dimensional counterpart of assumptions C1 in Bianco and Yohai, (1996) and A3 in Bianco and Martinez, (2009). For the functional logistic regression model considered this assumption is stronger than the one required for functional linear regression models in Boente et al., (2020) and Kalogridis and Van Aelst, (2023) which states that $\mathbb{P}(\langle X,\beta\rangle+\alpha=0)<c<1$ . However, it is weaker than assumption C2 in Kalogridis and Van Aelst, (2019) who defined robust estimators based on principal components under a functional linear model and assumed that the process $X$ has a finite–dimensional Karhunen-Loève expansion with scores having a joint density function. It is worth mentioning that condition A11 is related to the fact that the slope parameter is not identifiable if the kernel of the covariance operator of $X$ does not reduce to $\{0\}$ . Instead of requiring the condition over all possible elements $\beta\in L^{2}([0,1])$ , depending on the smoothness of $\beta_{0}$ , the set of values $\beta\neq 0$ over which the probability $\mathbb{P}(\langle X,\beta\rangle+\alpha=0)$ equals 0 may be reduced.

Furthermore, assumptions A1 and A2 hold for the loss function introduced in Croux and Haesbroeck, (2003) and for $\rho(t)=(1+1/c)(1-\exp(-\,c\,t))$ which is related to the minimum divergence estimators defined in Basu et al., (1998) and in both cases, $a$ can be taken as $+\infty$ . Note that if $\psi(0)\neq 0$ and assumptions A1 and A2 hold for some constant $a>\log(2)$ , then condition A3 is fulfilled. This situation arises, for example, for the two loss functions mentioned above. It is worth mentioning that A3 is a key point to derive that $L(\theta)-L(\theta_{0})\geq C_{0}\,\pi_{\mathbb{P}}^{2}(\theta,\theta_{0})$ , for any $\theta$ and some constant $C_{0}>0$ , where $L(\theta)=L(\alpha,\beta)=\mathbb{E}\left(\phi\left(y,\alpha+\langle X,\beta\rangle\right)w(X)\right)$ . This inequality allows to derive convergence rates for the weighted mean square error of the prediction differences from those obtained for the empirical process $\sup_{\theta\in\mathbb{R}\times{\mathcal{M}}_{k}}|L_{n}(\theta)-L(\theta)|$ .

Assumption A8 gives a rate at which the dimension of the finite–dimensional space ${\mathcal{M}}_{k}$ should increase. It is a standard condition when a sieve approach is considered. Furthermore, in assumption A10 a stronger convergence rate is required to the basis dimension in order obtain rates of convergence.

Assumption A9 states that the true slope may be approximated by an element of ${\mathcal{M}}_{k}$ . Conditions under which this assumption holds for some basis choices were discussed in Section 2.3, where conditions ensuring a given rate for this approximation were also given. The approximation rate required in A10 plays a role when deriving rates of convergence for the predicted probabilities. Note also that under A7, the approximating element $\widetilde{\beta}_{k}\in{\mathcal{M}}_{k}$ , given in A9 and A10, also belongs to ${\mathcal{W}}^{1,{\mathcal{H}}}$ . A first attempt to obtain these rates is given in Theorem 3.1, but better ones will be obtained in Theorem 3.3 below.

3.1 Rates of Consistency

To derive rates of convergence for the estimators, we define the pseudo-distance given by

\widetilde{\pi}_{\mathbb{P}}^{2}(\theta_{1},\theta_{2})=\mathbb{E}\left(w(X)\left[\alpha_{1}-\alpha_{2}+\langle X,(\beta_{1}-\beta_{2})\rangle\right]^{2}\right)\,,

where for $j=1,2$ , $\theta_{j}=(\alpha_{j},\beta_{j})\in\Theta=\mathbb{R}\times{\mathcal{H}}$ . The following additional assumption will be required

A1

: There exists $\epsilon_{0}>0$ and a positive constant $C_{0}^{\star}$ , such that for any $\theta=(\alpha,\beta)\in\mathbb{R}\times{\mathcal{H}}$ with $|\alpha-\alpha_{0}|+\|\beta-\beta_{0}\|<\epsilon_{0}$ we have $L(\theta)-L(\theta_{0})\geq C_{0}^{\star}\,\widetilde{\pi}_{\mathbb{P}}^{2}(\theta,\theta_{0})$ .

Note that since $F^{\prime}(t)=F(t)\left(1-F(t)\right)$ is bounded by 1, $\pi_{\mathbb{P}}^{2}(\theta_{1},\theta_{2})\leq\widetilde{\pi}_{\mathbb{P}}^{2}(\theta_{1},\theta_{2})$ , so the weighted mean square error of the differences between predicted probabilities inherits the rates of converges obtained in Theorem 3.3 for the distance $\widetilde{\pi}_{\mathbb{P}}$ .

Theorem 3.3.

Let $\rho$ be a function satisfying A1, A3 and A4, and $w$ a weight function satisfying A5 and A6b. Assume that A7 and A10 to A1 hold. Then, $\gamma_{n}\widetilde{\pi}_{\mathbb{P}}(\widehat{\theta},\theta_{0})=O_{\mathbb{P}}(1)$ , whenever $\gamma_{n}=O(n^{r\,\varsigma})$ and $\gamma_{n}\sqrt{\log(\gamma_{n})}=O(n^{(1-\varsigma)/2})$ .

The lower bound given in assumption A1 is a requirement that is fulfilled when the covariates are bounded as shown in Proposition 3.4 below. Moreover, one consequence of Proposition 3.4 is that for bounded functional covariates, the pseudo-distances $\widetilde{\pi}_{\mathbb{P}}$ and $\pi_{\mathbb{P}}$ are equivalent.

Proposition 3.4.

Assume that assumptions A1 and A3 hold and that for some positive constant $C>0$ , $\mathbb{P}(\|X\|\leq C)=1$ . Then, there exists a constant $C_{1}>0$ such that $\pi_{\mathbb{P}}^{2}(\theta,\theta_{0})\geq C_{1}\,\widetilde{\pi}_{\mathbb{P}}^{2}(\theta,\theta_{0})$ , for any $\theta=(\alpha,\beta)\in\mathbb{R}\times{\mathcal{H}}$ with $|\alpha-\alpha_{0}|+\|\beta-\beta_{0}\|<1$ . Moreover, A1 holds.

As a consequence of Proposition 3.4 and Theorem 3.3, we get the following result that improves the rates given in Theorem 3.1.

Corollary 3.5.

Let $\rho$ be a function satisfying A1, A3 and A4, and $w$ a weight function satisfying A5. Assume that A7, A10 and A11 hold and that for some positive constant $C>0$ , $\mathbb{P}(\|X\|\leq C)=1$ . Then, $\gamma_{n}\pi_{\mathbb{P}}(\widehat{\theta},\theta_{0})=O_{\mathbb{P}}(1)$ , whenever $\gamma_{n}=O(n^{r\,\varsigma})$ and $\gamma_{n}\sqrt{\log(\gamma_{n})}=O(n^{(1-\varsigma)/2})$ . In particular,

\frac{n^{\eta}}{\sqrt{\log(n)}}\pi_{\mathbb{P}}(\widehat{\theta},\theta_{0})=O_{\mathbb{P}}(1)\,,

where $\eta=\min(r\,\varsigma,(1-\varsigma)/2)$ .

Remark 3.3.

As mentioned in Boente et al., (2020), who obtained rates under a semi-linear functional regression model, if $\varsigma=1/(1+2r)$ in A10, one can choose $\gamma_{n}=O(n^{\frac{r}{1+2r}-\delta})$ , for some $\delta>0$ arbitrarily small, which yields a convergence rate arbitrarily close to the optimal one. Then, as mentioned above, when considering cubic splines, if $\beta_{0}$ is twice continuously differentiable one has that $r=2$ in A10. Hence, taking $\varsigma=1/5$ , i.e., if the basis dimension $k_{n}$ has order $n^{1/5}$ , we ensure that the convergence rate for $\widetilde{\pi}_{\mathbb{P}}(\widehat{\theta},\theta_{0})$ and $\pi_{\mathbb{P}}(\widehat{\theta},\theta_{0})$ is arbitrarily close to $n^{2/5}$ .

Clearly, one may select in Theorem 3.3, $\gamma_{n}=n^{\eta}/\sqrt{\log(n)}$ where $\eta=\min(r\,\varsigma,(1-\varsigma)/2)$ . Hence, if $\varsigma=1/(1+2r)$ in A10, we have that $\gamma_{n}=n^{{r}/({1+2r})}/\sqrt{\log(n)}$ . Hence, for bounded covariates, both the weighted mean square error of the predictions and the weighted mean square error between the predicted probabilities are such that $\widetilde{\pi}_{\mathbb{P}}(\widehat{\theta},\theta_{0})=O_{\mathbb{P}}\left(n^{-\,{r}/({1+2r})}\;\sqrt{\log(n)}\right)$ and $\pi_{\mathbb{P}}(\widehat{\theta},\theta_{0})=O_{\mathbb{P}}\left(n^{-\,{r}/({1+2r})}\;\sqrt{\log(n)}\right)$ , leading to a convergence rate is suboptimal with respect to the one obtained for instance in nonparametric regression models, see Stone, (1982, 1985). Furthermore, this rate equals the one obtained for penalized estimators in Kalogridis, (2023), when considering $B-$ splines with $k_{n}=O(n^{1/(2r+1)})$ and the slope function is $r$ times continuously differentiable, i.e., $\beta_{0}\in{\mathcal{W}}^{r-1,\infty}$ . As mentioned therein, the term $\log(n)$ is related to the fact that we are considering infinite–dimensional covariates. Recall that when $X\in L^{2}(0,1)$ and to ensure identifiability, the eigenvalues of its covariance operator are non–null but converge to $0$ , enabling us to provide a lower bound for $\widetilde{\pi}_{\mathbb{P}}(\theta_{1},\theta_{2})$ in terms of $\|\beta_{1}-\beta_{2}\|$ .

It also is worth mentioning that the rate derived in Theorem 3.1(c), allow to conclude that, when $k_{n}=O(n^{1/(4r+1)}$ , that is, when $\varsigma=1/(4r+1)$ , we have $\pi_{\mathbb{P}}(\widehat{\theta},\theta_{0})=O_{\mathbb{P}}(n^{-\,r/(4r+1)})$ . Hence, if the functional covariates are bounded, from Proposition 3.4 we get that $\widetilde{\pi}_{\mathbb{P}}(\widehat{\theta},\theta_{0})=O_{\mathbb{P}}(n^{-\,r/(4r+1)})$ . This rate of convergence corresponds to the one obtained in Cardot and Sarda, (2005) for their penalized estimators and is slower than the rate $\widetilde{\pi}_{\mathbb{P}}(\widehat{\theta},\theta_{0})=O\left(n^{-\,{r}/({1+2r})}\;\sqrt{\log(n)}\right)$ derived from Theorem 3.3.

4 Simulation study

We performed a Monte Carlo study to investigate the finite-sample properties of our proposed estimators for the functional logistic regression model. For that purpose, we generated a training sample ${\mathcal{M}}$ of observations $(y_{i},X_{i})$ i.i.d. such that $y_{i}\sim Bi(1,F(\alpha_{0}+\langle\beta_{0},X_{i}\rangle)$ where $\alpha_{0}=0$ . The true regression parameter was set equal to $\beta_{0}(t)=\sum_{j=1}^{50}b_{j,0}\phi_{j}$ , where $\{\phi_{j}\}_{j=1}^{50}$ correspond to elements of the Fourier basis, more precisely, $\phi_{1}(t)\equiv 1$ , $\phi_{j}(t)=\sqrt{2}\cos((j-1)\pi t)$ , $j\geq 2$ , and the coefficients $b_{1,0}=0.3$ and $b_{j,0}=4(-1)^{j+1}j^{-2}$ , $j\geq 2$ . The process that generates the functional covariates $X_{i}(t)$ was Gaussian with mean 0 and covariance operator with eigenfunctions $\phi_{j}(t)$ . For uncontaminated samples, the scores $\xi_{ij}$ were generated as independent Gaussian random variables $\xi_{ij}\sim N(0,j^{-2})$ . We denote the distribution of this Gaussian process ${\mathcal{G}}(0,\Gamma)$ . Taking into account that $\mbox{\sc Var}(\xi_{ij})\leq 1/2500$ when $j>50$ , the process was approximated numerically using the first 50 terms of its Karhunen-Loève representation.

We chose as basis the $B-$ spline basis for all the procedures considered, denoted $\{B_{j}\}_{j=1}^{k_{n}}$ . We compared four estimators: the procedure based on using the deviance after dimension reduction, that is using $\rho(t)=t$ in (2), labelled the classical estimators and denoted cl, the one that uses $M-$ estimators denoted m, and their weighted versions. The $M-$ estimators and weighted $M-$ estimators were computed using the loss function introduced in Croux and Haesbroeck, (2003) and defined as

\rho(t)=\left\{\begin{array}[]{ll}t\;e^{-\sqrt{c}}&\hbox{if }\,\,t\leq c\\ -2e^{-\sqrt{t}}\left(1+\sqrt{t}\right)+e^{-\sqrt{c}}\left(2\left(1+\sqrt{c}\right)+c\right)&\hbox{if }\,\,t>c\,,\end{array}\right.

with tuning constant $c=0.5$ . For the former the weights equal 1 for all observations, while for the latter, as for the weighted deviance estimators, two different type of weight functions were considered.

a)

For the first one, after dimension reduction, that is, after computing $x_{ij}=\langle X_{i},B_{j}\rangle$ , we evaluated the Donoho–Stahel location and scatter estimators, denoted $\widehat{\mbox{\boldmath$\mu$}}$ and $\widehat{\mbox{\boldmath$\Sigma$}}$ , respectively, of the sample $\mathbf{x}_{1},\dots,\mathbf{x}_{n}$ with $\mathbf{x}_{i}=(x_{i1},\dots,x_{ik_{n}})^{\mbox{\footnotesize\sc t}}$ . The weights are then defined as $w(X_{i})=1$ when the squared Mahalanobis distance $d_{i}^{2}=(\mathbf{x}_{i}-\widehat{\mbox{\boldmath$\mu$}})^{\mbox{\footnotesize\sc t}}\widehat{\mbox{\boldmath$\Sigma$}}^{-1}(\mathbf{x}_{i}-\widehat{\mbox{\boldmath$\mu$}})$ is less than or equal to $\chi_{0.975,k_{n}}$ and $0$ otherwise, where $\chi_{{}_{\alpha,p}}$ stands for the $\alpha-$ quantile of a chi-square distribution with $p$ degrees of freedom. Hence, for this family we used hard rejection weights and for that reason, the weighted estimators based on the deviance and the weighted $M-$ estimators are denoted wcl-hr and wm-hr, respectively. Note that wcl-hr are related to the Mallows–type estimators introduced in Carroll and Pederson, (1993).
b)

The second family of weight functions is based on the functional boxplots as defined by Sun and Genton, (2011). Again, we choose hard rejection weights but on the functional space by taking $w(X)=0$ if $X$ was declared an outlier by the functional boxplot and $1$ otherwise. The functional boxplot was computed using the function fbplot of the library fda taking as method ”Both” which orders the observations according to the band depth and then breaks ties with the modified band depth, as defined in López-Pintado and Romo, (2009). In this case the weighting is done before the $B-$ spline approximation. The resulting weighted classical and $M-$ estimators are denoted wcl-fbb and wm-fbb, respectively.

For each setting we generated $n_{R}=1000$ samples of size $n=300$ and used cubic splines with equally spaced knots. For the robust estimators we selected the size of the spline basis, $k_{n}$ , by minimizing $RBIC(k)$ in equation (6) over the grid $4\leq k\leq 14$ . For the classical estimator, we used the standard $BIC$ criterion, that is, we chose $\rho(t)=t$ in equation (6).

Refer to caption — Figure 1: Trajectories $X_{i}(t)$ with and without contamination. The red dotted lines correspond to the added contaminated covariates.

We considered different contamination scenarios by adding a proportion $\epsilon$ of atypical points. We denote these scenarios $C_{j,\epsilon}$ , for $1\leq j\leq 5$ and we chose $\epsilon=0.05$ and $0.10$ .

•

In the first scenario, denoted $C_{1,\epsilon}$ , we generated $n_{\mbox{\scriptsize\sc out}}=\epsilon\;n$ misclassified points $(\widetilde{y},\widetilde{X})$ , where $\widetilde{X}\sim{\mathcal{G}}(0,25\Gamma)$ and $\widetilde{y}=1$ when $\alpha_{0}+\langle\widetilde{X},\beta_{0}\rangle<0$ and $\widetilde{y}=0$ , otherwise.
•

Under $C_{2,\epsilon}$ , we have tried to adapt to the functional framework the damaging effect of the high leverage points considered by Croux and Haesbroeck, (2003). For that purpose, given $m>0$ , we generated $\widetilde{X}\sim{\mathcal{G}}(m\;\beta_{0},0.01\Gamma)$ . The response $\widetilde{y}$ , related to $\widetilde{X}$ , was always taken equal to $0$ . It is worth noticing that $\langle\widetilde{X},\beta_{0}\rangle$ is very close to $m\|\beta_{0}\|^{2}$ , thus the leverage of the added points increases with $m$ . We chose $m=4$ .
•

Contamination $C_{3,\epsilon}$ generates extreme outliers as $\widetilde{X}\sim{\mathcal{G}}(\mu,\Gamma)$ with $\mu(t)=25$ , for all $t$ . As in $C_{1,\epsilon}$ , we chose $\widetilde{y}=1$ when $\alpha_{0}+\langle\widetilde{X},\beta_{0}\rangle<0$ and $\widetilde{y}=0$ , otherwise.
•

Setting $C_{4,\epsilon}$ aims to construct trajectories with extreme symmetric outliers. For that purpose, we define $\widetilde{X}\sim 0.5{\mathcal{G}}(\mu,\Gamma)+0.5{\mathcal{G}}(-\mu,\Gamma)$ with $\mu(t)=25$ , for all $t$ . As in $C_{1,\epsilon}$ , we chose $\widetilde{y}=1$ when $\alpha_{0}+\langle\widetilde{X},\beta_{0}\rangle<0$ and $\widetilde{y}=0$ , otherwise.
•

The purpose of $C_{5,\epsilon}$ is to add trajectories with a partial contamination. We generated $\widetilde{X}(t)=Z(t)+25\,B\mathbb{I}_{T<t}$ where $Z\sim{\mathcal{G}}(0,\Gamma)$ , $\mathbb{P}(B=1)=\mathbb{P}(B=-1)=0.5$ and $T\sim{\mathcal{U}}(0,1)$ and $Z,B$ and $T$ are independent. As in $C_{1,\epsilon}$ , we chose $\widetilde{y}=1$ when $\alpha_{0}+\langle\widetilde{X},\beta_{0}\rangle<0$ and $\widetilde{y}=0$ , otherwise.

The way contaminated trajectories are constructed under settings $C_{3,\epsilon}$ to $C_{5,\epsilon}$ corresponds to the contaminations considered in Denhere and Billor, (2016). However, we force the atypical trajectories to correspond to bad leverage points.

To illustrate the type of outliers generated, Figure 1 shows the obtained functional covariates $X_{i}(t)$ , for one sample generated under each scheme.

To compare the estimators of $\alpha_{0}$ , we computed their biases and standard deviations, $s_{\widehat{\alpha}}$ which are reported in Table 1. We also present in Figures 2 and 3 their boxplots for the considered contaminations.

	$\widehat{\alpha}-\alpha_{0}$	$s_{\widehat{\alpha}}$	$\widehat{\alpha}-\alpha_{0}$	$s_{\widehat{\alpha}}$	$\widehat{\alpha}-\alpha_{0}$	$s_{\widehat{\alpha}}$	$\widehat{\alpha}-\alpha_{0}$	$s_{\widehat{\alpha}}$	$\widehat{\alpha}-\alpha_{0}$	$s_{\widehat{\alpha}}$
					$C_{0}$
cl					-0.0040	0.1250
m					-0.0072	0.1528
wcl-hr					-0.0046	0.1250
wm-hr					-0.0045	0.1278
wcl-fbb					-0.0040	0.1250
wm-fbb					-0.0083	0.1259
	$C_{1,0.05}$		$C_{2,0.05}$		$C_{3,0.05}$		$C_{4,0.05}$		$C_{5,0.05}$
cl	-0.0024	0.1160	-0.0540	0.1200	-0.0200	0.1230	-0.0064	0.1216	-0.0034	0.1166
m	-0.0038	0.1589	-0.1613	0.1139	-0.0580	0.1086	-0.0041	0.1170	-0.0050	0.1138
wcl-hr	-0.0029	0.1265	-0.0028	0.1263	-0.0029	0.1263	-0.0065	0.1247	-0.0029	0.1254
wm-hr	0.0007	0.1331	0.0008	0.1331	0.0009	0.1333	-0.0053	0.1304	0.0042	0.1282
wcl-fbb	-0.0030	0.1208	-0.0080	0.1247	-0.0177	0.1228	-0.0061	0.1234	-0.0028	0.1234
wm-fbb	-0.0002	0.1352	-0.0068	0.1326	-0.0168	0.1313	-0.0084	0.1283	0.0034	0.1259
	$C_{1,0.10}$		$C_{2,0.10}$		$C_{3,0.10}$		$C_{4,0.10}$		$C_{5,0.10}$
cl	0.0011	0.1117	-0.0550	0.1220	-0.0154	0.1219	-0.0017	0.1219	0.0035	0.1174
m	-0.0092	0.1564	-0.1680	0.0952	-0.0397	0.1110	-0.0024	0.1128	-0.0006	0.1064
wcl-hr	0.0022	0.1234	0.0018	0.1240	0.0021	0.1241	-0.0006	0.1250	0.0053	0.1247
wm-hr	-0.0067	0.1292	-0.0069	0.1302	-0.0065	0.1299	-0.0032	0.1292	0.0078	0.1279
wcl-fbb	0.0030	0.1156	-0.0057	0.1235	-0.0154	0.1219	-0.0014	0.1220	0.0051	0.1229
wm-fbb	-0.0037	0.1262	-0.0145	0.1295	-0.0240	0.1297	-0.0017	0.1245	0.0031	0.1296

Table 1: Bias and standard deviations for the estimators of

\alpha_{0}

, over

n_{R}=1000

for clean and contaminated samples of size

n=300

As expected, for clean samples, all the methods perform similarly. It should be noted that the $M$ and weighted $M-$ estimators show slightly lower biases than the classical procedure based on the deviance, but their standard deviations are larger due to the loss of efficiency. For all the considered contamination schemes, the obtained results reflect the stability of the weighted estimators. In contrast, the classical procedure and also the $M-$ estimator are affected by some of the these schemes. When considering the performance across contaminations, we observe that $C_{2,\epsilon}$ is the one with the larger effect on the bias of the classical estimator. This contamination, as well as $C_{3,\epsilon}$ , is also damaging for the $M-$ estimator due to the presence of extreme high leverage outliers.

Regarding the performance of the weighted estimators under $C_{1,\epsilon}$ to $C_{5,\epsilon}$ the weighted proposal which downweights observations with large robust Mahalanobis distance provide the best results especially when looking at the bias. Note that under some contaminating schemes, the obtained biases for wm-fbb are considerably larger than those of wm-hr, for instance under $C_{3,0.10}$ it is more than 10 times larger, while their standard deviations are comparable. The same behaviour arises when comparing the biases of wcl-fbb and wcl-hr.

To evaluate the performance of the different estimators of $\beta_{0}$ , as in Qingguo, (2015) and Boente et al., (2020), one possibility is to consider numerical approximations of their integrated squared bias and mean integrated squared error, computed on a grid of equally spaced points on $[0,1]$ . However, as mentioned in He and Shi, (1998), these measures may be influenced by numerical errors at or near the boundaries of the grid. For that reason, we only report here trimmed versions of the above summaries computed without the $q$ first and last points of the grid. More specifically, if $\widehat{\beta}_{j}$ is the estimate of the function $\beta_{0}$ obtained with the $j$ -th sample ( $1\leq j\leq n_{R}$ ) and $t_{1}\leq\dots\leq t_{M}$ are equispaced points on $[0,1]$ , we evaluated

	$\displaystyle\mbox{Bias}_{\mbox{\footnotesize\sc tr}}^{2}(\widehat{\beta})$	$\displaystyle=$	$\displaystyle\frac{1}{M-2q}\sum_{s=q+1}^{M-q}\left(\frac{1}{n_{R}}\sum_{j=1}^{n_{R}}\widehat{\beta}_{j}(t_{s})-\beta_{0}(t_{s})\right)^{2}\,,$
	$\displaystyle\mbox{MISE}_{\mbox{\footnotesize\sc tr}}(\widehat{\beta})$	$\displaystyle=$	$\displaystyle\frac{1}{M-2q}\sum_{s=q+1}^{M-q}\frac{1}{n_{R}}\sum_{j=1}^{n_{R}}\left(\widehat{\beta}_{j}(t_{s})-\beta_{0}(t_{s})\right)^{2}\,.$

We chose $M=100$ and $q=[M\times 0.05]$ , which uses the central 90% interior points in the grid.

We also considered another summary measure which aims to evaluate the estimator predictive capability. With that purpose, denote $(\widehat{\alpha}_{j},\widehat{\beta}_{j})$ the estimator obtained at replication $j$ , $1\leq j\leq n_{R}$ . At the $j-$ th replication, we also generated, independently from the sample used to compute the estimator, a new sample ${\mathcal{T}}_{j}=\{(y_{i,{\mathcal{T}}_{j}},X_{i,{\mathcal{T}}_{j}})\}_{i=1}^{n}$ distributed as C0. We then computed the probability mean squared errors defined as

\text{PMSE}(\widehat{\alpha}_{,}\widehat{\beta})=\frac{1}{n_{R}}\sum_{j=1}^{n_{R}}\frac{1}{n_{{\mathcal{T}}_{j}}}\sum_{i=1}^{n}\left\{F\left(\langle X_{i,{\mathcal{T}}_{j}},\beta_{0}\rangle+\alpha_{0}\right)-F\left(\langle X_{i,{\mathcal{T}}_{j}},\widehat{\beta}_{j}\rangle+\widehat{\alpha}_{j}\right)\right\}^{2}\,.

Table 2 presents the results obtained for the PMSE, while Table 3 reports the obtained summary measures $\mbox{Bias}^{2}_{\mbox{\footnotesize\sc tr}}(\widehat{\beta})$ and $\mbox{MISE}_{\mbox{\footnotesize\sc tr}}(\widehat{\beta})$ .

	$C_{0}$	$C_{1,0.05}$	$C_{1,0.10}$	$C_{2,0.05}$	$C_{2,0.10}$	$C_{3,0.05}$	$C_{3,0.10}$	$C_{4,0.05}$	$C_{4,0.10}$	$C_{5,0.05}$	$C_{5,0.10}$
cl	0.0039	0.0155	0.0251	0.0223	0.0265	0.0118	0.0130	0.0118	0.0129	0.0263	0.0296
m	0.0043	0.0151	0.0244	0.0222	0.0256	0.0105	0.0113	0.0106	0.0113	0.0242	0.0266
wcl-hr	0.0042	0.0042	0.0040	0.0042	0.0041	0.0042	0.0040	0.0042	0.0040	0.0041	0.0041
wm-hr	0.0042	0.0042	0.0041	0.0041	0.0042	0.0042	0.0041	0.0042	0.0041	0.0042	0.0040
wcl-fbb	0.0039	0.0055	0.0115	0.0051	0.0060	0.0108	0.0130	0.0056	0.0108	0.0039	0.0039
wm-fbb	0.0039	0.0053	0.0109	0.0053	0.0060	0.0104	0.0120	0.0056	0.0102	0.0038	0.0039

Table 2: Probability mean squared errors, PMSE, over

n_{R}=1000

for clean and contaminated samples of size

n=300

	$\mbox{Bias}_{\mbox{\footnotesize\sc tr}}^{2}$	$\mbox{MISE}_{\mbox{\footnotesize\sc tr}}$	$\mbox{Bias}_{\mbox{\footnotesize\sc tr}}^{2}$	$\mbox{MISE}_{\mbox{\footnotesize\sc tr}}$	$\mbox{Bias}_{\mbox{\footnotesize\sc tr}}^{2}$	$\mbox{MISE}_{\mbox{\footnotesize\sc tr}}$	$\mbox{Bias}_{\mbox{\footnotesize\sc tr}}^{2}$	$\mbox{MISE}_{\mbox{\footnotesize\sc tr}}$	$\mbox{Bias}_{\mbox{\footnotesize\sc tr}}^{2}$	$\mbox{MISE}_{\mbox{\footnotesize\sc tr}}$
					$C_{0}$
cl					0.0029	0.3305
m					0.0024	0.3259
wcl-hr					0.0019	0.3795
wm-hr					0.0021	0.3510
wcl-fbb					0.0029	0.3305
wm-fbb					0.0030	0.3257
	$C_{1,0.05}$		$C_{2,0.05}$		$C_{3,0.05}$		$C_{4,0.05}$		$C_{5,0.05}$
cl	0.5726	0.7706	1.2394	1.5669	0.1416	0.4412	0.1435	0.4246	0.9499	1.0056
m	0.5222	0.7338	1.0635	1.3688	0.1290	0.3230	0.1300	0.3323	0.8710	0.8908
wcl-hr	0.0019	0.3566	0.0020	0.3523	0.0020	0.3535	0.0020	0.3444	0.0025	0.3534
wm-hr	0.0016	0.3350	0.0014	0.3353	0.0015	0.3379	0.0016	0.3382	0.0018	0.3605
wcl-fbb	0.0749	0.4007	0.0119	0.4016	0.1069	0.4258	0.0076	0.3386	0.0028	0.3190
wm-fbb	0.0517	0.3878	0.0105	0.4046	0.1028	0.4173	0.0086	0.3452	0.0019	0.3179
	$C_{1,0.10}$		$C_{2,0.10}$		$C_{3,0.10}$		$C_{4,0.10}$		$C_{5,0.10}$
cl	1.0181	1.1032	1.5130	1.8423	0.1620	0.4570	0.1659	0.4436	1.0601	1.0953
m	0.9764	1.0436	1.2632	1.5716	0.1440	0.3264	0.1414	0.3118	0.9227	0.9385
wcl-hr	0.0018	0.3484	0.0017	0.3497	0.0018	0.3475	0.0025	0.3344	0.0016	0.3381
wm-hr	0.0033	0.3354	0.0023	0.3440	0.0023	0.3385	0.0025	0.3272	0.0017	0.3274
wcl-fbb	0.4184	0.6609	0.0270	0.4626	0.1617	0.4568	0.0997	0.4133	0.0016	0.3188
wm-fbb	0.3615	0.6406	0.0252	0.4493	0.1441	0.4293	0.0942	0.4012	0.0016	0.3099

Table 3: Trimmed version of the integrated squared bias and mean integrated squared errors for the estimators of

\beta_{0}

, over

n_{R}=1000

for clean and contaminated samples of size

n=300

In order to visually explore the performance of these estimators, Figures 4 to 14 contain functional boxplots, see Sun and Genton, (2011), for the $n_{R}=1000$ realizations of the different estimators for $\beta_{0}$ under the contamination settings. As in standard boxplots, the magenta central box of these functional boxplots represents the 50% inner band of curves, the solid black line indicates the central (deepest) function and the dotted red lines indicate outlying curves, that is, outlying estimates $\widehat{\beta}_{j}$ for some $1\leq j\leq n_{R}$ . We also indicate in blue lines the whiskers delimiting the non–outlying curves and the true function $\beta_{0}$ with a dark green line. To avoid boundary effects, we show in Figures 15 to 25 the different estimates evaluated on the interior points of a grid of 100 equispaced points. In addition, to facilitate comparisons between contamination cases and estimation methods, the scales of the vertical axes are the same for all Figures.

In clean samples, all the estimators give similar results with slightly smaller values of the $\mbox{Bias}_{\mbox{\footnotesize\sc tr}}^{2}$ when considering the weighted estimators. It is worth mentioning that the weighted $M-$ estimators are remarkably efficient. Furthermore, when considering the PMSE, the weighted estimators giving weight 0 to the observations with covariates detected as atypical by the functional boxplot, give results similar to those of the classical estimator. This behaviour may be explained by the fact than in most generated samples, under $C_{0}$ , no atypical curves are detected among the covariates.

As expected, when misclassified observations are present, the procedure based on the deviance breaks–down, in particular when these responses are combined with extreme high leverage covariates as is the case for $C_{2,\epsilon}$ . Contamination schemes $C_{3,\epsilon}$ and $C_{4,\epsilon}$ have less effect than $C_{1,\epsilon}$ on the integrated squared bias and mean integrated squared error of the classical estimator of the slope. This fact is also illustrated in Figures 9 to 12 and 20 to 23, where the true curve is still in the band containing the 50% deepest estimates. In contrast, under $C_{1,\epsilon}$ , $C_{2,\epsilon}$ and particularly under $C_{5,\epsilon}$ , the plot of the true function is beyond the limits of the functional boxplot, meaning that the obtained estimates become completely uninformative and do not reflect the shape of $\beta_{0}$ , see Figures 5 to 8, 13 and 14 for instance.

It is interesting to note that the the unweighted $M$ -estimator shows a performance similar to that of the classical one, for the considered contamination schemes. In contrast, weighted estimators give very good results in all the studied contamination settings, especially when considering wcl-hr and wm-hr, which clearly outperform wcl-fbb and wm-fbb in most cases. It is worth mentioning that the weighted estimators based on the deviance are quite stable across contaminations, the only exception being $C_{1,0.10}$ where wcl-fbb is more sensitive. In this last case, as when considering the wm-fbb estimates, the true curve lies above the magenta central region for values of $t$ larger than 0.8 (see Figures 6 and 17), but is still included in the band limited by the blue whiskers.

In most cases, the weighted $M-$ estimators improve on the weighted estimators obtained when $\rho(t)=t$ and the procedures with weights based on the Mahalanobis distance of the projected covariates outperform those whose weights are based on the functional boxplot. In particular, contamination schemes $C_{1,\epsilon}$ and $C_{3,\epsilon}$ , affect more the PMSE of the the latter procedure than that of the former one. Note that, under $C_{1,0.10}$ , $C_{3,0.10}$ and also under $C_{4,0.10}$ the PMSE of wcl-fbb are twice those obtained with wcl-hr and similarly when comparing wm-fbb with wm-hr (see Table 2).

In summary, the wcl-hr and specially the wm-hr estimators display a remarkably stable behaviour across the selected contaminations. Both estimators are comparable, but wm-hr attains in general lower values of the $\mbox{MISE}_{\mbox{\footnotesize\sc tr}}$ . The good performance of wcl-hr may be explained by the fact that the hard–rejection weights mainly discard from the sample the observations with high leverage covariates, which in most cases correspond to missclassified ones. This behaviour was already noticed by Croux and Haesbroeck, (2003) in the finite–dimensional setting.

5 Real data example

In this section, we consider the electricity prices data set analysed in Liebl, (2013) in the context of electricity price forecasting and we investigate the performance of the proposed estimators.

The data consist of hourly electricity prices in Germany between 1 January 2006 and 30 September 2008, as traded at the Leipzig European Energy Exchange, German electricity demand (as reported by the European Network of Transmission System Operators for Electricity). These data were also used in Boente et al., (2020) who modelled the daily average hourly energy demand through a semi-functional linear regression model using as euclidean covariate the mean hourly amount of wind-generated electricity in the system for that day and as functional covariate the curve of energy prices observed hourly.

In our analysis, the response measures high or low demand of electricity, that is, we define the binary variable $y=1$ if the average hourly demand exceeds 55000 (“High Demand”) and $0$ (“Low Demand”) otherwise. Furthermore, the functional covariates used to predict the conditional probability that a day has high demand, correspond to the curves of energy prices. These curves are observed hourly originating a matrix of dimension $638\times 24$ , after removing weekends, holidays and other non-working days. Hence, with these data we fit a functional logistic regression model,

\mathbb{P}(y=1|X)=F\left(\alpha_{0}+\left\langle X,\beta_{0}\right\rangle\right)\,,

using the weighted robust estimators defined in this paper and their classical alternatives. The trajectories corresponding to $y=1$ and $y=0$ are given in Figure 26.

To compute the estimators, we used $B-$ splines to generate the space of possible candidates and as above we denote $\{B_{j}\}_{j=1}^{k_{n}}$ the $k_{n}-$ dimensional basis. As in the simulation study, after dimension reduction, we computed the finite–dimensional coefficients through (5) using as loss $\rho(t)=t$ that leads to the deviance based estimators denoted cl and the $\rho$ function introduced in Croux and Haesbroeck, (2003) with tuning constant $0.5$ denoted m. We also computed the weighted versions of the previous estimators choosing as weights the weights based on the robust Mahalanobis distance of the projected data, that is, we evaluated the Donoho–Stahel location and scatter estimators, denoted $\widehat{\mbox{\boldmath$\mu$}}$ and $\widehat{\mbox{\boldmath$\Sigma$}}$ , respectively, of the sample $\mathbf{x}_{1},\dots,\mathbf{x}_{n}$ with $\mathbf{x}_{i}=(x_{i1},\dots,x_{ik_{n}})^{\mbox{\footnotesize\sc t}}$ , $x_{ij}=\langle X_{i},B_{j}\rangle$ and we defined $w(X_{i})=1$ when $d_{i}^{2}=(\mathbf{x}_{i}-\widehat{\mbox{\boldmath$\mu$}})^{\mbox{\footnotesize\sc t}}\widehat{\mbox{\boldmath$\Sigma$}}^{-1}(\mathbf{x}_{i}-\widehat{\mbox{\boldmath$\mu$}})$ is less than or equal to $\chi_{0.975,k_{n}}$ and $0$ otherwise. For simplicity, in Table 4 below, these procedures are labelled wcl and wm, respectively. As in the simulation, to select the basis dimension we use the criterion defined in Section 2.4, that is, we minimize the quantity $RBIC(k)$ defined in (6). All procedures choose $k_{n}=7$ , except the $M-$ estimator that selects $k_{n}=5$ .

cl	m	wcl	wm
-1.564	-1.588	-1.693	-1.811

Table 4: Estimates of

\alpha_{0}

The estimates of $\alpha_{0}$ are reported in Table 4, while those for $\beta_{0}$ are shown in Figure 27. In solid black and gray lines we represent the wm and $M-$ estimators, respectively, while red solid and dashed lines correspond to the cl and wcl procedures. Comparing the obtained results, we note that the weighted estimators lead to slightly larger absolute values of the intercept. Regarding the estimation of $\beta_{0}$ , it is clear that the $M-$ estimator produces a smoother curve since the basis dimension is smaller. The classical procedure is almost equal to $0$ between 9am and 4pm, while all the weighted estimators detect a positive “peak”around 1pm and two slumps near 8am and 17pm. All procedures detect the large peaks close 4am and 9pm, where prices seem to have a larger (in magnitude) association when predicting the demand, but for the classical procedure the magnitude of the function $\widehat{\beta}$ is smaller than that of the weighted ones in that range.

To identify potential atypical observations, we used the deviance QQ-plot defined in García Ben and Yohai, (2004). For that purpose, we considered the deviance residuals based on the ${\mbox{{wm}}}-$ estimator. More precisely, if $(\widehat{\alpha},\widehat{\beta})$ stands for the weighted $M-$ estimator, we compute the predicted probabilities and deviances $\widehat{p}_{i}=F\left(\widehat{\alpha}+\left\langle X_{i},\widehat{\beta}\right\rangle\right)$ and $d_{i}=\text{sign}(y_{i}-\widehat{p}_{i})\sqrt{-2\left\{(1-y_{i})\log(1-\widehat{p}_{i})+y\log(\widehat{p}_{i}))\right\}}\,.$ Let $\widehat{F}_{D}$ be the estimator of their distribution function, given by

\widehat{F}_{D}(d)=\frac{1}{n}\left\{\sum_{\widehat{p}_{i}\in{\mathcal{A}}}\widehat{p}_{i}+\sum_{1-\widehat{p}_{i}\in{\mathcal{A}}}\left(1-\widehat{p}_{i}\right)\right\}\,,

where ${\mathcal{A}}=\left\{t:(-2\log t)^{1/2}\leq d\right\}$ . We consider outliers those observations with deviance residual $d_{i}$ smaller than $\widehat{F}_{D}^{-1}(0.005)$ or larger than $\widehat{F}_{D}^{-1}(0.995)$ , leading to $19$ observations detected as possible atypical observations. These observations, together with the value of their residual deviance and predicted probability are reported in Table 5.

	Date	$d_{i}$	$\widehat{p}_{i}$		Date	$d_{i}$	$\widehat{p}_{i}$
$i$	$y_{i}=0$			$i$	$y_{i}=1$
134	2006-07-25	-5.686	1.000	66	2006-04-10	2.344	0.064
136	2006-07-27	-4.972	1.000	75	2006-04-25	2.905	0.015
137	2006-07-28	-2.667	0.971	76	2006-04-26	2.511	0.043
469	2008-01-10	-2.493	0.955	77	2006-04-27	2.313	0.069
502	2008-03-03	-2.412	0.945	445	2007-11-20	3.208	0.006
508	2008-03-11	-2.282	0.926	543	2008-05-06	2.393	0.057
519	2008-03-28	-3.052	0.990	550	2008-05-16	2.435	0.052
621	2008-09-05	-2.512	0.957	580	2008-07-01	2.196	0.090
622	2008-09-08	-2.497	0.956
627	2008-09-15	-2.385	0.942
638	2008-09-30	-2.923	0.986

Table 5: Outliers identified by weighted

M-

estimator with weights based on the Mahalanobis distance of the projected data and a hard rejection weight function.

It is worth mentioning, that the observation corresponding to November 7 of 2006, which is an outlier in the covariate space is not detected as such by the deviance QQ-plot, since it does not correspond to a bad leverage point.

The covariates related to the detected atypical observations are presented in Figure 28 with dashed black lines. The left panel correspond to the covariates corresponding to $y=0$ and the right one to $y=1$ , where we also indicate the curve corresponding to the 7th of November which clearly appears as an outlier in the covariate space.

We next compute the classical estimator after removing the identified outliers. The obtained value for the estimate of $\alpha_{0}$ is -1.852, a value closer to the one obtained when using the wm, as reported in Table 4. Figure 29 presents the obtained estimator in dotted magenta lines. To facilitate comparisons we present in the left panel all the estimators and in the right one, only the weighted $M-$ estimate with solid black line and the classical one computed with the whole data set and without the outliers, with a red solid line and a dotted magenta one. Note the similarity of the curve obtained by the wm method and the estimate obtained when using the cl method after removing the atypical observations.

6 Conclusion

This paper faces the problem of providing robust estimators for the functional logistic regression model. We tackle the problem by basis dimension reduction, that is, we consider a finite–dimensional space of candidates and, once the reduction is made, we use weighted $M-$ estimators to down–weight the effect of bad leverage covariates. In this sense, our estimators are robust against outliers corresponding to miss–classified points, and also to outliers in the functional explanatory variables that have a damaging effect on the estimation. We also propose a robust BIC-type criterion to select the optimal dimension of the splines bases that seems to work well in practice.

Regarding the asymptotic behaviour of our proposal, we prove that the estimators are strongly consistent, under a very general framework which allows the practitioner to choose different basis as $B-$ splines, Bernstein and Legendre polynomials, Fourier basis or Wavelet ones. The convergence rates obtained depend on the capability of the basis to approximate the true slope function. It is worth mentioning that unlike the weak consistency results obtained in Kalogridis, (2023) for penalized estimators based on divergence measures, our achievements do not require the covariates to be bounded. Moreover, strong consistency rather than convergence in probability of the sieve based weighted $M-$ estimators is derived.

The performance in finite samples is studied by means of a simulation study and in a real data example. The simulation study reveals the good robustness properties of our proposal and the high sensitivity of the procedure based on the deviance when atypical observations are present. In particular, weighted $M-$ estimators show their advantage over the unweighted ones, a fact already observed by Croux and Haesbroeck, (2003) when finite–dimensional covariates instead of functional ones are used in the model.

We also apply our method to a real data set where the weighted $M-$ estimator remains reliable even in presence of misclassified observations corresponding to atypical functional explanatory variables. The deviance residuals obtained from the robust fit provide a natural way to identify potential atypical observations. The classical estimators obtained minimizing the deviance after dimension reduction and computed over the “cleaned” data set, that is, the data set without the outliers present a similar shape to those obtained by the robust procedure which automatically down–weights these data.

As it has been extensively discussed, functional data are intrinsically infinite–dimensional and even when recorded at a finite grid of points, they should be considered as random elements of some functional space rather than multivariate observations. For instance, in some applications, the functional covariates may be viewed as longitudinal data, while in other cases, the predictors are densely observed curves. Examples of these situations are the Mediterranean fruit flies data studied in Müller and Stadtmüller, (2005), for the former, and the Tecator data set studied in Ferraty and Vieu, (2006) or the Danone data set reported in Aguilera-Morillo et al., (2013), for the latter. In more challenging cases, the grid at which the predictors are recorded may be sparse or irregular and the observations may be possibly subject to measurement errors. The proposals considered in James, (2002) or Müller, (2005) for generalized functional regression models allow for sparse recording and measurement errors. Even when our proposal can be implemented if the predictors are observed on a dense grid of points, the asymptotic results derived make use of the whole process structure. The interesting situation of sparse and irregular time grids which is common, for instance, in longitudinal studies is beyond the scope of the paper and may be object of future research.

7 Acknowledgements.

This research was partially supported by Universidad de Buenos Aires [Grant 20020170100022ba] and anpcyt [Grant pict 2021-I-A-00260] at Argentina (Graciela Boente and Marina Valdora) and by the Ministry of Economy, Industry and Competitiveness, Spain (MINECO/AEI/FEDER, UE) [Grant MTM2016-76969P] (Graciela Boente).

A Appendix

From now on, we denote $\nu(t)$ the function

\nu(t)=\psi\left(-\log F(t)\right)\left[1-F(t)\right]+\psi\left(-\log\left[1-F(t)\right]\right)F(t)\,.

(8)

Under A1 and A2, the function $\nu(\cdot)$ is continuous, bounded and strictly positive.

Denote $\Psi(y,t)={\partial}\phi(y,t)/{\partial t}=-[y-F(t)]\nu(t)$ where $\nu(t)$ is given by (8). Under A1 and A2, the function $\Psi(y,\cdot)$ is continuous and bounded and we will denote $M_{\phi}=\sup_{y\in\{0,1\},t\in\mathbb{R}}\phi(y,t)$ and $M_{\Psi}=\sup_{y\in\{0,1\},t\in\mathbb{R}}|\Psi(y,t)|$ , where we have used that $\phi(y,t)\geq 0$ from assumptions A1 and A2.

Henceforth, for any $(\alpha,\beta)\in\mathbb{R}\times L^{2}([0,1])$ , we denote $\theta=(\alpha,\beta)$ . Besides, $L(\theta)$ stands for the population counterpart of $L_{n}(\theta)$ , that is,

L(\theta)=\mathbb{E}\left(\phi\left(y,\alpha+\langle X,\beta\rangle\right)w(X)\right)\,,

where $y|X\sim Bi\left(1,F(\alpha_{0}+\langle X,\beta_{0}\rangle)\right)$ . Recall that we have denoted $\widehat{\theta}=(\widehat{\alpha},\widehat{\beta})$ and $\theta_{0}=(\alpha_{0},\beta_{0})$ .

A.1 Some preliminary results

Lemma A.1 states the Fisher–consistency of the estimators defined through (5). It follows using similar arguments as those considered in Theorem S.1.1 from Bianco et al., (2022). Note that Lemma A.1(a) only states that the true value $\theta_{0}$ is one of the minimizers of the function $L(\theta)$ , while in part (b) to derive that it is the unique minimizer an additional requirement, (9), is needed. As mentioned in Remark 3.2, this condition is related to the fact that the slope parameter is not identifiable if the kernel of the covariance operator of $X$ does not reduce to $\{0\}$ . Instead of asking (9) to hold for any $\beta\in L^{2}([0,1])$ , we require that the condition holds only for $\beta\in{\mathcal{H}}^{\star}$ whenever $\beta_{0}\in{\mathcal{H}}^{\star}$ , where henceforth, ${\mathcal{H}}^{\star}={\mathcal{H}}$ when $\beta_{0}\in{\mathcal{H}}$ , while ${\mathcal{H}}^{\star}={\mathcal{W}}^{1,{\mathcal{H}}}$ when $\beta_{0}\in{\mathcal{W}}^{1,{\mathcal{H}}}$ . It is worth mentioning that under A5 and A11, condition (9) is fulfilled.

Lemma A.1.

Suppose $(y,X)$ follows the functional logistic regression model given in (1). Let $\phi:\mathbb{R}^{2}\to\mathbb{R}$ be the function given by (2), where $\rho$ satisfies conditions A1 and A2 and let $w$ be a non–negative bounded function.

a)

For any $\theta=(\alpha,\beta)\in\mathbb{R}\times L^{2}([0,1])$ , we have that $L(\theta_{0})\leq L(\theta)$ .

Furthermore, assume that $\beta_{0}\in{\mathcal{H}}^{\star}$ and

\mathbb{P}\left(\langle X,\beta\rangle=a\cup\{w(X)=0\}\right)<1,\qquad\forall a\in\mathbb{R},\beta\in{\mathcal{H}}^{\star},(a,\beta)\neq 0\,,

(9)

holds. Then, for all $\alpha\in\mathbb{R}$ , $\beta\in{\mathcal{H}}^{\star}$ , $\theta=(\alpha,\beta)\neq(\alpha_{0},\beta_{0})$ , we have $L(\theta_{0})<L(\theta)$ .

Proof.

(a) As in Theorem 2.2 in Bianco and Yohai, (1996), taking conditional expectation, we have that

	$\displaystyle L(\theta)$	$\displaystyle=\mathbb{E}\left[\mathbb{E}\phi(y,\alpha+\langle X,\beta\rangle)w(X)\|X\right]=\mathbb{E}\left\{\phi(F(\alpha_{0}+\langle X,\beta_{0}\rangle),\alpha+\langle X,\beta\rangle)w(X)\right\}$
		$\displaystyle=\mathbb{E}\left\{\phi(F(R_{0}(X)),R(X))w(X)\right\}\,,$

where for a fixed value $X$ , we have denoted $R(X)=\alpha+\langle X,\beta\rangle$ and $R_{0}(X)=\alpha_{0}+\langle X,\beta_{0}\rangle$ .

We will show that, for any fixed $t_{0}$ , the function $\phi(F(t_{0}),t)$ reaches its unique minimum when $t=t_{0}$ . For simplicity, denote $\Phi(t)=\phi(F(t_{0}),t)$ , then, $\Phi^{\prime}(t)=\Psi(F(t_{0}),t)=\,-\,(F(t_{0})-F(t))\nu(t)$ , where $\nu(t)$ defined in (8) is positive. Hence, $\Phi^{\prime}(t_{0})=0$ . Furthermore, $\Phi^{\prime}(t)>0$ , for $t>t_{0}$ , and $\Phi^{\prime}(t)<0$ for $t<t_{0}$ which entails that $\Phi$ has a unique minimum at $t_{0}$ .

Hence, for any $R(X)\neq R_{0}(X)$ , $\phi(F(R_{0}(X)),R_{0}(X))<\phi(F(R_{0}(X)),R(X))$ , that is,

\phi(F(\alpha_{0}+\langle X,\beta_{0}\rangle),\alpha_{0}+\langle X,\beta_{0}\rangle)<\phi(F(\alpha_{0}+\langle X,\beta_{0}\rangle),\alpha+\langle X,\beta\rangle)\,,

(10)

for any $\langle X,(\beta-\beta_{0})\rangle+\alpha-\alpha_{0}\neq 0$ which concludes the proof of a).

The proof of b) follows immediately from (10) since (9) holds and $\beta-\beta_{0}\in{\mathcal{H}}^{\star}$ . ∎

Lemma A.3 provides an improvement over Lemma A.1 and will be helpful to derive Theorem 3.1. For its proof, we need the following Lemma which corresponds to Lemma 3 in Bianco et al., (2023). We state it here for completeness.

Lemma A.2.

Let $\Lambda(p,p_{0})$ be defined as

\Lambda(p,p_{0})=p_{0}\rho(-\log p)+(1-p_{0})\rho(-\log(1-p))+G(p)+G(1-p)\,

(11)

for $(p,p_{0})\in(0,1)\times[0,1]$ and assume that assumption A1 holds.

(a)

If in addition A2 holds, then the function $\Lambda(p,p_{0})$ has a unique minimum at $p=p_{0}$ . Furthermore, $\Lambda(p,p_{0})$ can be extended to a continuous function on $[0,1]\times[0,1]$ .
(b)

Assume that, in addition, A3 holds. Then there exists a positive constant $C_{0}$ such that for each $0<p<1$ , $\Lambda(p,p_{0})-\Lambda(p_{0},p_{0})\geq C_{0}\,(p-p_{0})^{2}$ .

The proof of Lemma A.3 is now a direct consequence of the previous result.

Lemma A.3.

Assume that assumptions A1 and A3 hold. Then, there exists a constant $C_{0}>0$ such that $L(\theta)-L(\theta_{0})\geq C_{0}\,\pi_{\mathbb{P}}^{2}(\theta,\theta_{0})$ .

Proof.

For any $\theta=(\alpha,\beta)$ , denote $p(X)=F(\alpha+\langle X,\beta\rangle)$ and let $p_{0}(X)=F(\alpha_{0}+\langle X,\beta_{0}\rangle)$ .

Using that $y|X\sim Bi\left(1,p_{0}(X)\right)$ , (3) and (11), taking conditional expectation, we immediately get that $L(\theta)=\mathbb{E}\left\{w(X)\,\Lambda\left(p(X),p_{0}(X)\right)\right\}$ . Lemma A.2(b) implies that

\Lambda\left(p(X),p_{0}(X)\right)-\Lambda\left(p_{0}(X),p_{0}(X)\right)\geq C_{0}\,\left\{p(X)-p_{0}(X)\right\}^{2}=C_{0}\left\{F(\alpha+\langle X,\beta\rangle)-F(\alpha_{0}+\langle X,\beta_{0}\rangle)\right\}^{2}\,,

which entails

\displaystyle L(\theta)-L(\theta_{0})

\displaystyle\geq C_{0}\mathbb{E}\left[w(X)\left\{F(\alpha+\langle X,\beta\rangle)-F(\alpha_{0}+\langle X,\beta_{0}\rangle)\right\}^{2}\right]=C_{0}\,\,\pi_{\mathbb{P}}^{2}(\theta,\theta_{0})\,,

and concludes the proof. ∎

From now on, for any measure $\mathbb{Q}$ and class of functions ${\mathcal{F}}$ , $N(\epsilon,{\mathcal{F}},L_{s}(\mathbb{Q}))$ and $N_{[\;]}(\epsilon,{\mathcal{F}},L_{s}(\mathbb{Q}))$ will denote the covering and bracketing numbers of the class ${\mathcal{F}}$ with respect to the distance in $L_{s}(\mathbb{Q})$ , as defined, for instance, in van der Vaart and Wellner, (1996). Furthermore, we will make use of the empirical process $\mathbb{G}_{n}f=\sqrt{n}(P_{n}-P)f$ , where $P_{n}$ stands for the empirical probability measure of $(y_{i},X_{i})$ , $1\leq i\leq n$ , and $P$ is the probability measure corresponding of $(y,X)$ which follows the functional logistic regression model (1).

We will first derive a result regarding Glivenko–Cantelli results for classes depending on $n$ , which will be helpful to derive Lemma A.5(b) and which is a slight modification of Theorem 37 in Pollard, (1984), see also Lemma 2.3.3 in van de Geer, (1988).

Lemma A.4.

Let $z_{1},\dots,z_{n}$ , be i.i.d. random elements in a metric space ${\mathcal{X}}$ and ${\mathcal{F}}_{n}=\{f:{\mathcal{X}}\to\mathbb{R}\}$ be the class of bounded functions depending on $n$ , that is, for some positive constant $M$ , $|f|\leq M$ , for any $f\in{\mathcal{F}}_{n}$ and assume that for any $0<\epsilon<1$ , there exists some constant $C>1$ independent of $n$ and $\epsilon$ , such that

\log N(M\,\epsilon,{\mathcal{F}}_{n},L_{1}(P_{n}))\leq Cq_{n}\log\left(\frac{1}{\epsilon}\right)\;,

where $q_{n}$ is a non–random sequence of numbers such that $q_{n}/n\to 0$ . Then, we have that

\sup_{f\in{\mathcal{F}}_{n}}\left|(P_{n}-P)(f)\right|\buildrel a.s.\over{\longrightarrow}0\,.

Proof.

First of all, note that without loss of generality we may assume that $M=1$ , in which case

\mbox{\sc Var}\left(P_{n}f\right)=\frac{1}{n}\mbox{\sc Var}\left(f(z_{1})\right)\leq\frac{1}{n}\mathbb{E}\left(f^{2}(z_{1})\right)\leq\frac{1}{n}\,,

implying that for $n\geq 1/(8\epsilon^{2})$ , we have that

\frac{\mbox{\sc Var}\left(P_{n}f\right)}{\left(4\,\epsilon\right)^{2}}\leq\frac{1}{2}\,.

Thus, using Lemma 8 in Pollard, (1984), we obtain the inequality

\mathbb{P}\left(\sup_{f\in{\mathcal{F}}_{n}}\left|(P_{n}-P)(f)\right|>8\,\epsilon\right)\leq 4\mathbb{P}\left(\sup_{f\in{\mathcal{F}}_{n}}\left|P_{n}^{0}f\right|>2\,\epsilon\right)

where $P_{n}^{0}f=(1/n)\sum_{i=1}^{n}\xi_{i}f(z_{i})$ and $\{\xi_{i}\}_{i=1}^{n}$ is a Rademacher sequence independent of $\{z_{i}\}_{i=1}^{n}$ , that is, $\xi_{1},\dots\xi_{n}$ are i.i.d. $\mathbb{P}(\xi_{i}=1)=\mathbb{P}(\xi_{i}=-1)=1/2$ .

The covering number allows to choose functions $g_{j}$ , $1\leq j\leq N=N(\epsilon,{\mathcal{F}}_{n},L_{1}(P_{n}))$ , such that for any $f\in{\mathcal{F}}_{n}$ , $\min_{j}P_{n}|f-g_{j}|\leq\epsilon$ . Denote $j(f)$ the index where the minimum is attained. Without loss of generality, we may assume that $g_{j}\in{\mathcal{F}}_{n}$ , so that $|g_{j}|\leq 1$ . Thus, the approximation argument and Hoeffding’ s inequality lead to

	$\displaystyle\mathbb{P}\left(\sup_{f\in{\mathcal{F}}_{n}}\left\|P_{n}^{0}f\right\|>2\,\epsilon\,\Big{\|}{\{z_{1},\dots,z_{n}\}}\right)$	$\displaystyle\leq\mathbb{P}\left(\sup_{f\in{\mathcal{F}}_{n}}\left\|P_{n}^{0}g_{j(f)}\right\|+P_{n}\left\|f-g_{j(f)}\right\|>2\,\epsilon\,\Big{\|}{\{z_{1},\dots,z_{n}\}}\right)$
		$\displaystyle\leq\mathbb{P}\left(\max_{1\leq j\leq N}\left\|P_{n}^{0}g_{j}\right\|>\,\epsilon\,\Big{\|}{\{z_{1},\dots,z_{n}\}}\right)$
		$\displaystyle\leq N\max_{1\leq j\leq N}\mathbb{P}\left(\left\|P_{n}^{0}g_{j}\right\|>\,\epsilon\,\Big{\|}{\{z_{1},\dots,z_{n}\}}\right)$
		$\displaystyle\leq 2N(\epsilon,{\mathcal{F}}_{n},L_{1}(P_{n}))\exp\left\{-\,\frac{n\epsilon^{2}}{2}\right\}\leq 2\exp\left\{-\,\frac{n\epsilon^{2}}{2}+Cq_{n}\log\left(\frac{1}{\epsilon}\right)\right\}\,.$

Hence, using that $q_{n}/n\to 0$ , we have that for $n$ large enough

\frac{q_{n}}{n}\log\left(\frac{1}{\epsilon}\right)<\frac{\epsilon^{2}}{4}\,,

which implies that, for $n$ large enough,

\mathbb{P}\left(\sup_{f\in{\mathcal{F}}_{n}}\left|(P_{n}-P)(f)\right|>8\,\epsilon\right)\leq 2\exp\left\{-\,\frac{n\epsilon^{2}}{4}\right\}\,,

so $\sum_{n\geq 1}\mathbb{P}\left(\sup_{f\in{\mathcal{F}}_{n}}\left|(P_{n}-P)(f)\right|>8\,\epsilon\right)<\infty$ and the result follows now from the Borel–Cantelli lemma. ∎

To avoid burden notation, when there is no doubt we will denote $k$ instead of $k_{n}$ . To derive uniform results, Lemma A.5(a) below provides a bound to the covering number of the class of functions

{\mathcal{F}}_{n}=\left\{f(y,X)=\phi\left(y,\alpha+\langle X,\beta\rangle\right)\,w(X),\beta\in{\mathcal{M}}_{k},\alpha\in\mathbb{R}\right\}\,.

(12)

Its proof relies on providing a bound for the Vapnik-Chervonenkis dimension for the set ${\mathcal{F}}_{n}$ , which follows from Lemma S.2.2 in Bianco et al., (2022). Besides, Lemma A.5(b) shows that $L_{n}(\theta)$ converges to $L(\theta)$ with probability one, uniformly over $\theta=(\alpha,\beta)\in\mathbb{R}\times{\mathcal{M}}_{k}$ . Its proof uses standard empirical processes. This uniform law of large numbers will be crucial to obtain consistency results for our proposal as given in Theorems 3.1(a) and 3.2.

Lemma A.5.

Let $\rho$ be a function satisfying A1 and A2, and $w$ a weight function satisfying A5. Let ${\mathcal{F}}_{n}$ be the class of functions given in (12). Then,

(a)

for any measure $\mathbb{Q}$ , $0<\epsilon<1$ , there exists some constant $C>1$ independent of $n$ and $\epsilon$ , such that

N(M_{\phi}\,\epsilon,{\mathcal{F}}_{n},L_{1}(\mathbb{Q}))\leq Cq_{n}(16\,e)^{q_{n}}\left(\frac{1}{\epsilon}\right)^{q_{n}-1}\;,

where $q_{n}=2k_{n}+6$ and $M_{\phi}=\sup_{y,t}|\phi(y,t)|$ .

b)

If in addition, A8 holds, we have $\sup_{\theta\in\mathbb{R}\times{\mathcal{M}}_{k}}\left|L_{n}(\theta)-L(\theta)\right|\buildrel a.s.\over{\longrightarrow}0$ .
c)

There exists a constant $C^{\star}$ that does not depend on $n$ nor $k_{n}$ such that

$\mathbb{E}\left\{\sup_{f\in{\mathcal{F}}_{n}}\left|(P_{n}-P)(f)\right|\right\}\leq C^{\star}\sqrt{\frac{k_{n}}{n}}\,,$ (13)

which entails that $\sup_{f\in{\mathcal{F}}_{n}}\left|(P_{n}-P)(f)\right|=O_{\mathbb{P}}\left(\sqrt{{k_{n}}/{n}}\right)$ .

Proof.

(a) First of all note that since A1 and A5 hold, we have that $\phi(y,t)$ is bounded and $w(X)$ is a bounded function with $\|w\|_{\infty}=1$ , hence ${\mathcal{F}}_{n}$ has envelope $M_{\phi}$ .

Lemma S.2.2 from Bianco et al., (2022) implies that the class of functions

{\mathcal{G}}_{n}=\left\{g(y,\mathbf{x})=\phi\left(y,\alpha+\mathbf{x}^{\mbox{\footnotesize\sc t}}\mathbf{b}\right),\mathbf{b}\in\mathbb{R}^{k},\alpha\in\mathbb{R}\right\}\,,

where $\mathbf{x}=(\langle X,B_{1}\rangle,\dots,\langle X,B_{1}\rangle)^{\mbox{\footnotesize\sc t}}$ , is VC-subgraph with index smaller or equal than $2(k+1)+4=2k+6$ , since we now include an intercept term $\alpha\in\mathbb{R}$ .

Note that for any $f\in{\mathcal{F}}_{n}$ we have $f(y,X)=\phi\left(y,\alpha+\langle X,\beta\rangle\right)w(X)=\phi\left(y,\alpha+\mathbf{x}^{\mbox{\footnotesize\sc t}}\mathbf{b}\right)w(X)=g(y,\mathbf{x})w(X)$ , where $\beta=\beta_{\mathbf{b}}$ and $g\in{\mathcal{G}}_{n}$ . Hence, using the permanence property of VC-classes when multiplying by a fixed function, in this case $w(X)$ , we conclude that $V({\mathcal{F}}_{n})\leq 2k+6$ and the result follows now from Theorem 2.6.7 in van der Vaart and Wellner, (1996).

(b) From (a), using that $\log(2k_{n}+6)/(2k_{n}+6)<1$ and assuming that $C>1$ we get that

	$\displaystyle\sup_{\mathbb{Q}}\log\left(N\left(M_{\phi}\,\epsilon,{\mathcal{F}}_{n},L_{1}(\mathbb{Q})\right)\right)$	$\displaystyle\leq\log C+\log(2k_{n}+6)+(2k_{n}+6)\log(16e)+(2k_{n}+5)\log\left(\frac{1}{\epsilon}\right)$
		$\displaystyle\leq(2k_{n}+6)\left[\log C+1+\log(16e)+\log\left(\frac{1}{\epsilon}\right)\right]\,.$

Hence, for any $\epsilon<\min(e^{-C},(16e)^{-1},1/e)$ we have that

\frac{1}{n}\log\left(N\left(M_{\phi}\,\epsilon,{\mathcal{F}}_{n},L_{1}(P_{n})\right)\right)\leq 3\,\frac{2k_{n}+6}{n}\log\left(\frac{1}{\epsilon}\right)\to 0\,,

since A8 holds. Therefore, using Lemma A.4, we conclude that

\sup_{\theta\in\mathbb{R}\times{\mathcal{M}}_{k}}\left|L_{n}(\theta)-L(\theta)\right|\buildrel a.s.\over{\longrightarrow}0\,,

completing the proof of (b).

(c) As in (a), using that $V({\mathcal{F}}_{n})\leq q_{n}=2k_{n}+6$ and Theorem 2.6.7 in van der Vaart and Wellner, (1996), we deduce that there exists some constant $C_{0}>1$ independent of $n$ and $\epsilon$ , such that

N(M_{\phi}\,\epsilon,{\mathcal{F}}_{n},L_{2}(\mathbb{Q}))\leq C_{0}q_{n}(16\,e)^{q_{n}}\left(\frac{1}{\epsilon}\right)^{2q_{n}-2}=C_{0}(2k_{n}+6)(16\,e)^{2k_{n}+6}\left(\frac{1}{\epsilon}\right)^{4k_{n}+10}\;,

(14)

for any measure $\mathbb{Q}$ , $0<\epsilon<1$ . Theorem 2.14.1 in van der Vaart and Wellner, (1996) allows us to conclude that, for some universal constant $C_{1}>0$ ,

\mathbb{E}\left\{\sup_{f\in{\mathcal{F}}_{n}}\left|\sqrt{n}(P_{n}-P)(f)\right|\right\}\leq C_{1}\;M_{\phi}\sup_{\mathbb{Q}}\int_{0}^{1}\sqrt{1+\log N\left(M_{\phi}\,\epsilon,{\mathcal{F}}_{n},L_{2}(\mathbb{Q})\right)}d\epsilon\,,

where the supremum is taken over all discrete probability measures $\mathbb{Q}$ . Using (14) and that $\log t\leq t$ for $t\geq 1$ and denoting $C_{2}=\log(C_{0})+1+\log(16\;e)>1$ , we get that

	$\displaystyle\sqrt{1+\log N\left(M_{\phi}\,\epsilon,{\mathcal{F}}_{n},L_{2}(\mathbb{Q})\right)}$	$\displaystyle\leq\sqrt{1+C_{2}(2k_{n}+6)+(4k_{n}+10)\log\left(\frac{1}{\epsilon}\right)}$
		$\displaystyle\leq\sqrt{1+16k_{n}C_{2}+16k_{n}\log\left(\frac{1}{\epsilon}\right)}\,,$

where we have used that $2k_{n}+6\leq 4k_{n}+10\leq 16k_{n}$ . Let $C_{4}=4\,C_{1}\;M_{\phi}\;C_{3}$ with $C_{3}=C_{2}+1$ . Then, we obtain that

\displaystyle\mathbb{E}\left[\sup_{f\in{\mathcal{F}}}\left|\sqrt{n}(P_{n}-P)(f)\right|\right]

\displaystyle\leq

\displaystyle\sqrt{k_{n}}\;C_{4}\,\int_{0}^{1}\sqrt{1+\log\left(\frac{1}{\epsilon}\right)}d\epsilon\,,

which entails (13), since $\int_{0}^{1}\sqrt{1-\log(\epsilon)}\,d\epsilon<\infty$ . Markov’s inequality immediately leads to $\sup_{f\in{\mathcal{F}}_{n}}\left|(P_{n}-P)(f)\right|=O_{\mathbb{P}}\left(\sqrt{{k_{n}}/{n}}\right)$ , concluding the proof. ∎

The following result is needed to prove the convergence rates stated in Theorem 3.3. From now on, $\Theta_{n}^{\star}=\mathbb{R}\times{\mathcal{M}}_{k_{n}}$ .

Lemma A.6.

Let $\rho$ be a function satisfying A1 and A2, and $w$ a weight function satisfying A5 and A6b. Given $\widetilde{\beta}_{0}\in{\mathcal{M}}_{k}$ , define $\widetilde{\theta}_{0}=(\alpha_{0},\widetilde{\beta}_{0})$ and the class of functions

\displaystyle{\mathcal{G}}_{n,c,\widetilde{\theta}_{0},\theta_{0}^{*}}

\displaystyle=\{f_{\theta}=V_{\theta}-V_{\theta_{0}^{*}}:\theta\in\Theta_{n}^{\star}\quad\mbox{and}\quad|\alpha-\alpha_{0}|+\|\beta-\widetilde{\beta}_{0}\|_{{\mathcal{H}}}\leq c\}\,,

where $\theta_{0}^{*}=(\alpha_{0}^{*},\beta_{0}^{*})$ is a fixed element in $\mathbb{R}\times{\mathcal{H}}$ , and $V_{\theta}=\phi\left(y,\alpha+\langle X,\beta\rangle\right)w(X)$ , for $\theta=(\alpha,\beta)$ . Then, there exists some constant $A>0$ independent of $n$ , $\widetilde{\theta}_{0}$ and $\epsilon$ , such that

N_{[\;]}(\epsilon,{\mathcal{G}}_{n,c,\widetilde{\theta}_{0},\theta_{0}^{*}},L_{2}(P))\leq\left(\frac{Ac}{\epsilon}+1\right)^{k+1}\,.

Proof.

First note that $|f_{\theta}|\leq 2M_{\phi}$ and denote $\widetilde{\Theta}_{n,c}=\{\theta\in\Theta_{n}^{\star}:|\alpha-\alpha_{0}|+\|\beta-\widetilde{\beta}_{0}\|_{{\mathcal{H}}}\leq c\}$ . Note that since $\|\beta-\widetilde{\beta}_{0}\|\leq\|\beta-\widetilde{\beta}_{0}\|_{{\mathcal{H}}}$ , $\widetilde{\Theta}_{n,c}\subset{\mathcal{I}}_{c}\times{\mathcal{B}}_{n,c}$ , where ${\mathcal{I}}_{c}=[\alpha_{0}-c,\alpha_{0}+c]$ and ${\mathcal{B}}_{n,c}=\{\beta\in{\mathcal{M}}_{k}:\|\beta-\widetilde{\beta}_{0}\|\leq c\}$ .

For any $\theta_{\ell}\in\widetilde{\Theta}_{n,c}$ , $\ell=1,2$ we have that

$\displaystyle\|f_{\theta_{1}}-f_{\theta_{2}}\|$	$\displaystyle=\left\|V_{\theta_{1}}-V_{\theta_{2}}\right\|=\left\|\phi\left(y,\alpha_{1}+\langle X,\beta_{1}\rangle\right)-\phi\left(y,\alpha_{2}+\langle X,\beta_{2}\rangle\right)\right\|w(X)$
	$\displaystyle\leq M_{\Psi}\left\|\alpha_{1}-\alpha_{2}+\langle X,\beta_{1}-\beta_{2}\rangle\right\|w(X)$
	$\displaystyle\leq M_{\Psi}\left\{\left\|\alpha_{1}-\alpha_{2}\right\|+w(X)\\|X\\|\;\\|\beta_{1}-\beta_{2}\\|\right\}\,,$	(15)

where we recall that $M_{\Psi}=\sup_{y\in\{0,1\},t\in\mathbb{R}}|\Psi(y,t)|$ . Hence,

\mathbb{E}\left(f_{\theta_{1}}-f_{\theta_{2}}\right)^{2}\leq M_{\Psi}^{2}\left\{(\alpha_{1}-\alpha_{2})^{2}+\mathbb{E}\left(w^{2}(X)\|X\|^{2}\right)\|\beta_{1}-\beta_{2}\|^{2}\right\}\,.

Taking into account that $\widetilde{\beta}_{0}\in{\mathcal{M}}_{k}$ , it can be written as $\widetilde{\beta}_{0}=\sum_{j=1}^{k}\widetilde{b}_{0,j}B_{j}$ , so ${\mathcal{B}}_{n,c}=\{\beta=\sum_{j=1}^{k}b_{j}B_{j},\mathbf{b}\in\mathbb{R}^{k}:\|\sum_{j=1}^{k}\left(b_{j}-\widetilde{b}_{0,j}\right)B_{j}\|\leq c\}$ . Thus, according to Corollary 2.6 in van de Geer, (2000), taking therein as measure $\mathbb{Q}$ the uniform measure on ${\mathcal{T}}=[0,1]$ , we get that ${\mathcal{B}}_{n,c}$ can be covered by

N_{{\mathcal{B}}_{n,c},\delta}=\left(\frac{4c+\delta}{\delta}\right)^{k}\,,

balls with center $\beta_{j,c,\delta}$ , $1\leq j\leq N_{{\mathcal{B}}_{n,c},\delta}$ , and radius $\delta$ . Besides, the interval ${\mathcal{I}}_{c}$ may also be covered by $N_{{\mathcal{I}}_{c},\delta}=(4c+\delta)/\delta$ balls with center $\alpha_{j,c,\delta}$ , $1\leq j\leq N_{{\mathcal{I}}_{c},\delta}$ , and radius $\delta$ .

Given $\epsilon>0$ , take $\delta=\epsilon/\left\{2\;M_{\Psi}\left[1+\left(\mathbb{E}w^{2}(X)\|X\|^{2}\right)^{1/2}\right]\right\}$ and for $1\leq j\leq N_{{\mathcal{I}}_{c},\delta}$ and $1\leq m\leq N_{{\mathcal{B}}_{n,c},\delta}$ , define the functions $f_{j,m}=\phi\left(y,\alpha_{j,c,\delta}+\langle X,\beta_{m,c,\delta}\rangle\right)w(X)-V_{\theta_{0}}$ and

	$\displaystyle f_{j,m}^{(U)}(y,X)$	$\displaystyle=f_{j,m}+\delta\;M_{\Psi}\left(1+w(X)\\|X\\|\;\right)$
	$\displaystyle f_{j,m}^{(L)}(y,X)$	$\displaystyle=f_{j,m}-\delta\;M_{\Psi}\left(1+w(X)\\|X\\|\;\right)\,.$

Given $f_{\theta}\in{\mathcal{G}}_{n,c,\widetilde{\theta}_{0},\theta_{0}^{*}}$ , let $j$ and $m$ be such that $|\alpha-\alpha_{j,c,\delta}|<\delta$ and $\|\beta-\beta_{m,c,\delta}\|<\delta$ and $\theta_{j,m}=(\alpha_{j,c,\delta},\beta_{m,c,\delta})$ . Then, using (15), we obtain that

\left|f_{\theta}-f_{\theta_{j,m}}\right|\leq\delta\;M_{\Psi}\left\{1+w(X)\|X\|\;\right\}\,,

so $f_{j,m}^{(L)}\leq f_{\theta}\leq f_{j,m}^{(U)}$ , since $f_{\theta_{j,m}}=f_{j,m}$ . Besides,

\|f_{j,m}^{(U)}-f_{j,m}^{(L)}\|=2\delta M_{\Psi}\left\{\mathbb{E}\left(1+w(X)\|X\|\right)^{2}\right\}^{1/2}\leq 2\delta M_{\Psi}\left\{1+\left(\mathbb{E}w^{2}(X)\|X\|^{2}\right)^{1/2}\right\}=\epsilon\,,

which implies that

N_{[\;]}(\epsilon,{\mathcal{G}}_{n,c,\widetilde{\theta}_{0},\theta_{0}^{*}},L_{2}(P))\leq N_{{\mathcal{I}}_{c},\delta}N_{{\mathcal{B}}_{n,c},\delta}\leq\left(\frac{4c+\delta}{\delta}\right)^{k+1}=\left(\frac{8\;M_{\Psi}\left[1+\left(\mathbb{E}w^{2}(X)\|X\|^{2}\right)^{1/2}\right]c}{\epsilon}+1\right)^{k+1}\,,

and the result follows taking $A=8\;M_{\Psi}\left[1+\left(\mathbb{E}w^{2}(X)\|X\|^{2}\right)^{1/2}\right]$ . ∎

A.2 Proof of Theorems 3.1 and 3.2

Recall that $\widehat{\theta}=(\widehat{\alpha},\widehat{\beta})$ , $\theta_{0}=(\alpha_{0},\beta_{0})$ . Henceforth, $\widetilde{\theta}$ stands for $\widetilde{\theta}=(\alpha_{0},\widetilde{\beta})$ with $\widetilde{\beta}=\widetilde{\beta}_{k}$ defined in assumptions A9 and A10.

The following Lemma is useful to derive Theorems 3.1 and 3.2.

Lemma A.7.

Let $\rho$ be a function satisfying A1 and A2, and $w$ a weight function satisfying A5. Assume that A8 and A9 hold. Then, we have that $L(\widehat{\alpha},\beta_{\widehat{{\mathbf{b}}}})=L(\widehat{\theta})\buildrel a.s.\over{\longrightarrow}L(\theta_{0})$ .

Proof.

From A8 and A9, we have that there exists $\widetilde{\beta}_{k_{n}}\in{\mathcal{M}}_{k_{n}}$ , $\widetilde{\beta}_{k_{n}}=\sum_{j=1}^{k_{n}}\widetilde{b}_{j}B_{j}(x)$ such that $\|\widetilde{\beta}_{k_{n}}-\beta_{0}\|_{{\mathcal{H}}}\to 0$ as $n\to\infty$ . As mentioned above, we denote $\widetilde{\theta}=(\alpha_{0},\widetilde{\beta}_{k_{n}})$ . Using that $\widetilde{\beta}_{k_{n}}\in{\mathcal{M}}_{k_{n}}$ , we conclude that $L_{n}(\widehat{\theta})\leq L_{n}(\widetilde{\theta})\,,$ while from Lemma A.1(a) we have that $0\leq L(\widehat{\theta})-L(\theta_{0})$ . Thus,

$\displaystyle 0\leq L(\widehat{\theta})-L(\theta_{0})$	$\displaystyle=L(\widehat{\theta})-L(\widetilde{\theta})+L(\widetilde{\theta})-L(\theta_{0})$
	$\displaystyle=\left\{L_{n}(\widetilde{\theta})-L(\widetilde{\theta})\right\}-\left\{L_{n}(\widehat{\theta})-L(\widehat{\theta})\right\}+\left\{L(\widetilde{\theta})-L(\theta_{0})\right\}+L_{n}(\widehat{\theta})-L_{n}(\widetilde{\theta})$
	$\displaystyle\leq\left\{L_{n}(\widetilde{\theta})-L(\widetilde{\theta})\right\}-\left\{L_{n}(\widehat{\theta})-L(\widehat{\theta})\right\}+\left\{L(\widetilde{\theta})-L(\theta_{0})\right\}$
	$\displaystyle\leq 2\,\sup_{f\in{\mathcal{F}}_{n}}\left\|(P_{n}-P)f\right\|+\left\{L(\widetilde{\theta})-L(\theta_{0})\right\}=2\,A_{n}+B_{n}\,,$	(16)

where ${\mathcal{F}}_{n}$ is defined in (12).

From Lemma A.5 we obtain that $A_{n}\buildrel a.s.\over{\longrightarrow}0$ . On the other hand, from $\|\widetilde{\beta}_{k_{n}}-\beta_{0}\|_{{\mathcal{H}}}\to 0$ as $n\to\infty$ , the fact that for any $f\in{\mathcal{H}}$ , we have the bound $\|f\|\leq\|f\|_{{\mathcal{H}}}$ and the Cauchy-Schwartz inequality, we get that for any $v\in L^{2}([0,1])$ , $\alpha_{0}+\langle v,\widetilde{\beta}_{k_{n}}\rangle\to\alpha_{0}+\langle v,\beta_{0}\rangle$ . Thus, from the Bounded Convergence Theorem, the continuity of $\phi\left(y,t\right)$ with respect to $t$ and its boundedness together with the boundedness of $w$ , we conclude that $B_{n}=L(\widetilde{\theta})-L(\theta_{0})\to 0$ . Summarizing we have that

0\leq L(\widehat{\theta})-L(\theta_{0})\leq 2\,A_{n}+B_{n}\buildrel a.s.\over{\longrightarrow}0\,,

which concludes the proof. ∎

Proof of Theorem 3.1.

(a) From Lemma A.3, we have that there exists a constant $C_{0}>0$ independent from $n$ such that

L(\widehat{\theta})-L(\theta_{0})\geq C_{0}\,\pi_{\mathbb{P}}^{2}(\widehat{\theta},\theta_{0})\,.

Then, the result follows from Lemma A.7 which implies that $L(\widehat{\theta})-L(\theta_{0})\buildrel a.s.\over{\longrightarrow}0$ .

(b) Recall that from (16) and Lemma A.3 we obtain that, for some constant $C_{0}$ independent of $n$ ,

C_{0}\,\pi_{\mathbb{P}}^{2}(\widehat{\theta},\theta_{0})\leq L(\widehat{\theta})-L(\theta_{0})\leq 2\,\sup_{f\in{\mathcal{F}}_{n}}\left|(P_{n}-P)f\right|+\left\{L(\widetilde{\theta})-L(\theta_{0})\right\}\,.

Lemma A.5(c), entails that $\sup_{f\in{\mathcal{F}}_{n}}\left|(P_{n}-P)f\right|=O_{\mathbb{P}}\left(\sqrt{{k_{n}}/{n}}\right)$ , so, using again that $B_{n}=L(\widetilde{\theta})-L(\theta_{0})\geq 0$ , the proof will be concluded if we show that

L(\widetilde{\theta})-L(\theta_{0})=O(\|\widetilde{\beta}_{k_{n}}-\beta_{0}\|_{{\mathcal{H}}})\,.

(17)

To prove (17), recall that $\Psi(y,t)={\partial}\phi(y,t)/{\partial t}=-[y-F(t)]\nu(t)$ where $\nu(t)$ is given by (8) and that $M_{\Psi}=\sup_{y\in\{0,1\},t\in\mathbb{R}}|\Psi(y,t)|$ is finite. Thus,

	$\displaystyle 0$	$\displaystyle\leq L(\widetilde{\theta})-L(\theta_{0})=\mathbb{E}\left\{w(X)\,\left[\phi(y,\alpha_{0}+\langle X,\widetilde{\beta}\rangle)-\phi(y,\alpha_{0}+\langle X,\beta_{0}\rangle\right]\right\}$
		$\displaystyle\leq M_{\Psi}\mathbb{E}\left\{w(X)\,\left\|\langle X,\widetilde{\beta}-\beta\rangle\right\|\right\}\leq M_{\Psi}\\|\widetilde{\beta}-\beta\\|\;\mathbb{E}\left\{w(X)\,\\|X\\|\right\}\,.$

Then, (17) follows now from the fact that $\|f\|\leq\|f\|_{{\mathcal{H}}}$ .

(c) To derive (c) it will be enough to show that $L(\widetilde{\theta})-L(\theta_{0})=O\left(\|\widetilde{\beta}_{k_{n}}-\beta_{0}\|_{{\mathcal{H}}}^{2}\right)$ .

Using that $\psi$ is continuously differentiable with bounded derivative, we obtain that the derivative of the function $\nu(t)$ defined in (8) equals

	$\displaystyle\nu^{\,\prime}(t)$	$\displaystyle=-\psi^{\prime}\left(-\log F(t)\right)\left[1-F(t)\right]^{2}-\psi\left(-\log F(t)\right)F(t)\left[1-F(t)\right]$
		$\displaystyle+\psi\left(-\log\left[1-F(t)\right]\right)F(t)\left[1-F(t)\right]+\psi^{\prime}\left(-\log\left[1-F(t)\right]\right)F^{2}(t)$

and is bounded. Hence, ${\partial^{2}}\phi(y,t)/{\partial t^{2}}=F(t)\left[1-F(t)\right]\nu(t)-[y-F(t)]\nu^{\,\prime}(t)$ is also bounded. Denote $A=\sup_{y\in\{0,1\},t\in\mathbb{R}}|{\partial^{2}}\phi(y,t)/{\partial t^{2}}|$ .

To avoid burden notation, denote $R_{0}(X)=\alpha_{0}+\langle X,\beta_{0}\rangle$ . Define the function $g:\mathbb{R}\to\mathbb{R}$ as $g(t)=L(t\widetilde{\theta}+(1-t)\theta_{0})$ . Then,

g(t)=\mathbb{E}\left\{w(X)\phi\left(F(R_{0}(X)),R_{0}(X)+t\langle X,\widetilde{\beta}-\beta_{0}\rangle\right)\right\}\,.

Note that $g(0)=L(\theta_{0})$ , $g(1)=L(\widetilde{\theta})$ and $g(t)\geq g(0)$ for all $t$ .

Recall that $\Psi(y,t)={\partial}\phi(y,t)/{\partial t}=-[y-F(t)]\nu(t)$ , then $\Psi(F(r_{0}),r_{0})=0$ , for any $r_{0}$ . Therefore, we have that

	$\displaystyle g(1)-g(0)$	$\displaystyle=\mathbb{E}\left\{w(X)\;\Psi\left(F(R_{0}(X)),R_{0}(X)\right)\langle X,\widetilde{\beta}-\beta_{0}\rangle\right\}+\mathbb{E}\left\{w(X)\;\frac{\partial^{2}\phi(F(R_{0}(X)),t)}{\partial t^{2}}\Big{\|}_{t=\xi_{\varrho}}\langle X,\widetilde{\beta}-\beta_{0}\rangle^{2}\right\}$
		$\displaystyle=\mathbb{E}\left\{w(X)\;\frac{\partial^{2}\phi(F(R_{0}(X)),t)}{\partial t^{2}}\Big{\|}_{t=\xi_{\varrho}}\langle X,\widetilde{\beta}-\beta_{0}\rangle^{2}\right\}$

where $\xi_{\varrho}=\alpha_{0}+\langle X,\beta_{0}\rangle+\varrho\langle X,\widetilde{\beta}-\beta_{0}\rangle$ for some $\varrho=\varrho(X)\in(0,1)$ .

Therefore, using that ${\partial^{2}}\phi(y,t)/{\partial t^{2}}$ is bounded, we obtain

L(\widetilde{\theta})-L(\theta_{0})=g(1)-g(0)\leq A\,\mathbb{E}\left\{w(X)\langle X,\widetilde{\beta}-\beta_{0}\rangle^{2}\right\}\leq A\,\mathbb{E}\left\{w(X)\|X\|^{2}\right\}\|\widetilde{\beta}-\beta_{0}\|_{{\mathcal{H}}}^{2}\,,

(18)

where the last inequality follows from the Cauchy-Schwartz inequality and the fact that, for any $f\in{\mathcal{H}}$ , we have the bound $\|f\|\leq\|f\|_{{\mathcal{H}}}$ , concluding the proof. ∎

Lemma A.8 is an intermediate step to derive Theorem 3.2.

Lemma A.8.

Let $\rho$ be a function satisfying A1 and A2, and $w$ a weight function satisfying A5. Assume that A7 to A9 and A11 hold. Then, we have that, there exists $M>0$ such that $\mathbb{P}(\cup_{m\in\mathbb{N}}\cap_{n\geq m}|\widehat{\alpha}|+\|\widehat{\beta}\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}\leq M)=1$ .

Proof.

From now on, ${\mathcal{B}}_{1}$ denotes the unit ball in ${\mathcal{W}}^{1,{\mathcal{H}}}$ and ${\mathcal{B}}=\{(\alpha,\beta)\in\mathbb{R}\times{\mathcal{W}}^{1,{\mathcal{H}}}:|\alpha|+\|\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}=1\}\subset\ [-1,1]\times{\mathcal{B}}_{1}$ . The Rellich–Kondrachov theorem entails that ${\mathcal{W}}^{1,{\mathcal{H}}}$ is compactly embedded in ${\mathcal{H}}$ , hence ${\mathcal{B}}$ is compact in $\mathbb{R}\times{\mathcal{H}}$ .

To simplify the notation, let $\theta=(\alpha,\beta)$ and $\ell(X,\theta)=\alpha+\langle X,\beta\rangle$ . Furthermore, given $\theta_{j}\in\mathbb{R}\times{\mathcal{H}}$ , for $j=1,2$ , denote $d(\theta_{1},\theta_{2})=|\alpha_{1}-\alpha_{2}|+\|\beta_{1}-\beta_{2}\|_{{\mathcal{H}}}$ .

Step 1. We begin proving that, given $0<\delta\leq 1/4$ , there exist $K>0$ and positive numbers $\varphi_{1},\dots,\varphi_{s}$ such that for every $\theta\in{\mathcal{B}}$ , there exist $j\in\{1,\dots,s\}$ such that

\mathbb{P}\left(|\ell(X,\theta)|>\frac{\varphi_{j}}{2}\right)>\mathbb{P}\left(|\ell(X,\theta)|>\frac{\varphi_{j}}{2}\cap\|X\|\leq K\right)>1-2\delta\,.

(19)

To derive (19), first given $\delta>0$ , define $K_{\delta}$ such that, for any $K\geq K_{\delta}$

\mathbb{P}(\|X\|\geq K)<\delta\,.

(20)

Fix now $\theta\in{\mathcal{B}}$ . Using that $\beta\in{\mathcal{H}}$ and A11, there exists $\varphi_{\theta}>0$ a continuity point of the distribution of $|\ell(X,\theta)|$ such that

\mathbb{P}\left(|\ell(X,\theta)|<\varphi_{\theta}\right)<\delta\;.

(21)

Note that from (20) and (21), we conclude that $A(\theta)=\mathbb{P}\left(|\ell(X,\theta)|\geq\varphi_{\theta}\right)-\mathbb{P}\left(\|X\|\geq K\right)>1-2\delta$ and $B(\theta)=\mathbb{P}\left(|\ell(X,\theta)|\geq\varphi_{\theta}\cap\|X\|\leq K\right)>1-2\delta$ .

Let $\tau_{\theta}$ stand for $\tau_{\theta}={\varphi_{\theta}}/{(2(K+1))}$ . Then, given $\theta^{*}=(\alpha^{*},\beta^{*})\in\mathbb{R}\times{\mathcal{H}}$ , such that $d(\theta^{*},\theta)<\tau_{\theta}$ , and using that $\|f\|\leq\|f\|_{{\mathcal{H}}}$ and the Cauchy–Schwartz inequality, we obtain

|\ell(X,\theta)|\leq|\ell(X,\theta^{*})|+|\ell(X,\theta-\theta^{*})|\leq|\ell(X,\theta^{*})|+\tau_{\theta}\left(\|X\|+1\right)\,.

Hence, we get

\displaystyle\mathbb{P}\left(|\ell(X,\theta^{*})|\geq\frac{\varphi_{\theta}}{2}\cap\|X\|\leq K\right)

\displaystyle\geq B(\theta)>1-2\delta\,.

(22)

Consider the covering of ${\mathcal{B}}$ given by the open balls ${\mathcal{B}}(\theta,\tau_{\theta})=\{(a,f)\in\mathbb{R}\times{\mathcal{H}}:d((a,f),\theta)<\tau_{\theta}\}$ , $\theta\in{\mathcal{B}}$ . Using that ${\mathcal{B}}$ is compact in $\mathbb{R}\times{\mathcal{H}}$ , we get that there exist $\theta_{1},\dots\theta_{s}$ such that $\theta_{j}\in{\mathcal{B}}$ and ${\mathcal{B}}\subset\cup_{j=1}^{s}{\mathcal{B}}(\theta_{j},\tau_{\theta_{j}})$ . Therefore, from (22), we conclude that

\min_{1\leq j\leq s}\inf_{d(\theta,\theta_{j})<\tau_{\theta_{j}}}\mathbb{P}\left(|\ell(X,\theta)|>\frac{\varphi_{\theta_{j}}}{2}\cap\|X\|\leq K\right)>1-2\delta\,,

(23)

meaning that, for every $\theta\in{\mathcal{B}}$ , there exist $j\in\{j_{1}\dots j_{s}\}$ such that

\mathbb{P}\left(|\ell(X,\theta)|>\frac{\varphi_{\theta_{j}}}{2}\cap\|X\|\leq K\right)>1-2\delta\;,

concluding the proof of Step 1.

Step 2. We will show that for any $\theta=(\alpha,\beta)\in{\mathcal{B}}$

\mathbb{E}\left(\lim_{a\to\infty}\phi(y,a\;\ell(X,\theta))w(X)\right)>L(\theta_{0})\,.

(24)

Let us consider a sequence $a_{m}\to+\infty$ it is enough to show that $D>L(\theta_{0})$ where $D=\mathbb{E}\left\{\lim_{m\to\infty}\phi\left(y,a_{m}\;\ell(X,\theta)\right)w(X)\right\}$ . Using that $\phi$ is a bounded function and the bounded convergence theorem, we get that $D=\lim_{m\to\infty}D_{m}$ , where $D_{m}=\mathbb{E}\left\{\phi\left(y,a_{m}\;\ell(X,\theta)\right)w(X)\right\}=L(a_{m}\,\theta)$ .

Note that as in the proof of Lemma A.1, we have that

\displaystyle D_{m}

\displaystyle=\mathbb{E}\left\{w(X)\,\phi\left(F(\ell(X,\theta_{0})),a_{m}\;\ell(X,\theta)\right)\right\}=\mathbb{E}\left\{w(X)\phi(F(R_{0}(X)),R_{m}(X))\right\}\,,

(25)

where, as in the proof of Theorem 3.1, we denote $R_{m}(X)=a_{m}\;\ell(X,\theta)$ and $R_{0}(X)=\ell(X,\theta_{0})$ .

In the proof of Lemma A.1, we have shown that, for any fixed $r_{0}$ , the function $\Phi(t)=\phi(F(r_{0}),t)$ reaches its unique minimum when $t=r_{0}$ and $\Phi^{\prime}(t)>0$ , for $t>r_{0}$ , and that $\Phi^{\prime}(t)<0$ for $t<r_{0}$ . Then, $\Phi$ is strictly increasing on $(r_{0},+\infty)$ and strictly decreasing on $(-\infty,r_{0})$ , so $\lim_{t\to+\infty}\Phi(t)>\Phi(r_{0})$ and similarly $\lim_{t\to-\infty}\Phi(t)>\Phi(r_{0})$ .

Using A11 and that $\beta\in{\mathcal{W}}^{1,{\mathcal{H}}}$ , we have that with probability one

\lim_{m\to\infty}R_{m}(X)=\lim_{m\to\infty}a_{m}\;\ell(X,\theta)=\left\{\begin{array}[]{rl}+\infty&\mbox{ when }\ell(X,\theta)>0\\ -\infty&\mbox{ when }\ell(X,\theta)<0\end{array}\right.\,.

(26)

Fix $\omega\in\Omega$ , such that $X_{0}=X(\omega)$ satisfies (26) and $F(R_{0}(X_{0}))$ is not $0$ or $1$ . Let $r_{0}=R_{0}(X_{0})$ and take as above $\Phi(t)=\phi(F(r_{0}),t)$ . Then,

	$\displaystyle\lim_{m\to\infty}\phi(F(R_{0}(X_{0})),R_{m}(X_{0}))=\lim_{m\to\infty}\Phi(R_{m}(X_{0}))$	$\displaystyle=\left\{\begin{array}[]{rl}\lim_{r\to+\infty}\Phi(r)&\mbox{ when }\ell(X_{0},\theta)>0\\ \lim_{r\to-\infty}\Phi(r)&\mbox{ when }\ell(X_{0},\theta)<0\end{array}\right.$
		$\displaystyle>\Phi(r_{0})=\Phi\left(R_{0}(X_{0})=\phi(F(R_{0}(X)),R_{0}(X))\right)\,.$

Using (25), A11 and A5, we obtain that Thus, using again that $\phi$ is a bounded function and the bounded convergence theorem, we obtain that

\displaystyle D=\lim_{m\to\infty}D_{m}

\displaystyle=\mathbb{E}\left\{w(X)\,\lim_{m\to\infty}\phi(F(R_{0}(X)),R_{m}(X))\right\}>\mathbb{E}\left\{\phi(F(R_{0}(X)),R_{0}(X))\right\}\,,

which concludes the proof of (24).

Step 3. Let us show that there exists $\eta>0$ such that for any $\theta=(\alpha,\beta)\in{\mathcal{B}}$ , we have

\lim_{a\to\infty}\inf_{|\alpha^{*}-\alpha|+\|\beta^{*}-\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}<\eta}L(a\,\theta^{*})>L(\theta_{0})\,,

(27)

where we have denoted $\theta^{*}=(\alpha^{*},\beta^{*})$ . The proof is an extension to the functional setting of that of Lemma 6.3 in Bianco and Yohai, (1996).

Take $\delta<(D-L(\theta_{0}))/(2\,M_{\phi})$ where $D=\mathbb{E}\left(\lim_{a\to\infty}\phi(y,a\;\ell(X,\theta))w(X)\right)$ , the quantity $D-L(\theta_{0})$ is positive from (24) and $M_{\phi}=\sup_{y\in\{0,1\},t\in\mathbb{R}}\phi(y,t)$ .

From Step 1 we have that, for any $\theta\in{\mathcal{B}}$ , there exist $j\in\{1,\dots,s\}$ such that (19) holds. Let $j_{0}$ be the index corresponding to the chosen $\theta\in{\mathcal{B}}$ and define the set ${\mathcal{E}}=\{X:|\ell(X,\theta)|>\varphi_{j_{0}}/{2}\cap\|X\|\leq K\}$ . Then, from (19), we get that

\mathbb{P}(X\notin{\mathcal{E}})<2\delta<\frac{D-L(\theta_{0})}{M_{\phi}}\,.

(28)

Take $X\in{\mathcal{E}}$ and define $\eta=\min_{1\leq j\leq s}\varphi_{j}/(8(K+1))$ . Then, using again that $\|f\|\leq\|f\|_{W^{1,{\mathcal{H}}}}$ and that the set $\{\theta^{*}=(\alpha^{*},\beta^{*}):|\alpha^{*}-\alpha|+\|\beta^{*}-\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}<\eta\}\subset\{\theta^{*}:|\alpha^{*}-\alpha|+\|\beta^{*}-\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}\leq\varphi_{{j_{0}}}/(4(K+1))\}$ which is compact in $\mathbb{R}\times{\mathcal{H}}$ , it is easy to see that for any $\theta^{*}=(\alpha^{*},\beta^{*})$ such that $|\alpha^{*}-\alpha|+\|\beta^{*}-\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}\leq\eta\leq\varphi_{{j_{0}}}/(4(K+1))$ , we have that

\left\{\begin{array}[]{c }\mbox{sign}\left(\ell(X,\theta^{*})\right)=\mbox{sign}\left(\ell(X,\theta)\right)\,,\\ |\ell(X,\theta^{*})|\geq\displaystyle\frac{\varphi_{j_{0}}}{4}\,.\end{array}\right.

(29)

Using that the set ${\mathcal{V}}_{\eta}=\{\theta^{*}\in\mathbb{R}\times{\mathcal{W}}^{1,{\mathcal{H}}}:|\alpha^{*}-\alpha|+\|\beta^{*}-\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}\leq\eta\}$ is compact in $\mathbb{R}\times{\mathcal{H}}$ we conclude that there exists $\{\theta_{m}^{*}\}_{m\in\mathbb{N}}\subset{\mathcal{V}}_{\eta}$ (depending on $a$ and $X$ ) and a value $\widetilde{\theta}^{*}=\widetilde{\theta}^{*}_{a,X}\in{\mathcal{V}}_{\eta}$ which also depends on $a$ and $X$ , such that $d(\theta_{m}^{*},\widetilde{\theta}^{*})\to 0$ and $\lim_{m\to\infty}\phi(y,a\;\ell(X,\theta_{m}^{*}))=\inf_{\theta^{*}\in{\mathcal{V}}_{\eta}}\phi(y,a\;\ell(X,\theta^{*}))$ . The continuity of $\phi$ together with the fact that $d(\theta_{m}^{*},\widetilde{\theta}^{*})\to 0$ and the Cauchy–Schwartz inequality leads to $\phi(y,a\,\ell(X,\widetilde{\theta}^{*}))=\inf_{\theta^{*}\in{\mathcal{V}}_{\eta}}\phi(y,a\,\ell(X,\theta^{*}))$ . Then, using (29), we conclude that

\liminf_{a\to\infty}\inf_{\theta^{*}\in{\mathcal{V}}_{\eta}}\phi\left(y,a\;\ell(X,\theta^{*})\right)=\lim_{a\to\infty}\phi\left(y,a\;\ell\left(X,\widetilde{\theta}^{*}_{a,X}\right)\right)\,.

Therefore, using Fatou’s Lemma we obtain that

	$\displaystyle\liminf_{a\to\infty}\inf_{\|\alpha^{}-\alpha\|+\\|\beta^{}-\beta\\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}<\eta}L(a\,\theta^{*})$	$\displaystyle\geq\mathbb{E}\left\{\liminf_{a\to\infty}\inf_{\theta^{}\in{\mathcal{V}}_{\eta}}\phi(y,a\;\ell(X,\theta^{}))w(X)\right\}$
		$\displaystyle\geq\mathbb{E}\left\{\mathbb{I}_{{\mathcal{E}}}(X)\,w(X)\,\lim_{a\to\infty}\inf_{\theta^{}\in{\mathcal{V}}_{\eta}}\phi(y,a\;\ell(X,\theta^{}))\right\}$
		$\displaystyle\geq\mathbb{E}\left\{\mathbb{I}_{{\mathcal{E}}}(X)\,w(X)\,\lim_{a\to\infty}\phi\left(y,a\;\ell\left(X,\widetilde{\theta}^{*}_{a,X}\right)\right)\right\}$
		$\displaystyle=D-\mathbb{E}\left\{\mathbb{I}_{{\mathcal{E}}^{c}}(X)\,w(X)\,\lim_{a\to\infty}\phi\left(y,a\;\ell\left(X,\widetilde{\theta}^{*}_{a,X}\right)\right)\right\}\,.$

where we denote ${\mathcal{E}}^{c}$ the complement of ${\mathcal{E}}$ . Using (28) and assumption A5, we obtain that

\mathbb{E}\left\{\mathbb{I}_{{\mathcal{E}}^{c}}(X)\,w(X)\,\lim_{a\to\infty}\phi\left(y,a\;\ell\left(X,\widetilde{\theta}^{*}_{a,X}\right)\right)\right\}\leq M_{\phi}\mathbb{P}\left(X\in{\mathcal{E}}^{c}\right)<D-L(\theta_{0})\,,

\lim_{a\to\infty}\inf\inf_{|\alpha^{*}-\alpha|+\|\beta^{*}-\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}<\eta}L(a\,\theta^{*})\geq D-\mathbb{E}\left\{\mathbb{I}_{{\mathcal{E}}^{c}}(X)\,w(X)\,\lim_{a\to\infty}\phi\left(y,a\;\ell(X,\widetilde{\theta}^{*}_{a,X})\right)\right\}>L(\theta_{0})\,,

which concludes the proof of (27).

Step 4. We will show that there exists $M>0$ and $\tau>0$

\inf_{\theta=(\alpha,\beta):\|\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}+|\alpha|>M}L\left(\theta\right)>L(\theta_{0})+\tau\;.

(30)

Note that by proving (30), we may indeed conclude the proof. In fact, from Lemma A.7, we have that $L(\widehat{\alpha},\widehat{\beta})\buildrel a.s.\over{\longrightarrow}L(\alpha_{0},\beta_{0})$ , thus if ${\mathcal{N}}=\{\lim_{n\to\infty}L(\widehat{\alpha},\widehat{\beta})\neq L(\alpha_{0},\beta_{0})\}$ , $\mathbb{P}({\mathcal{N}})=0$ . Take $\omega\notin{\mathcal{N}}$ and $n_{0}$ such that for all $n\geq n_{0}$ , $|L(\widehat{\alpha},\widehat{\beta})-L(\alpha_{0},\beta_{0})|<\tau$ , then $L(\widehat{\alpha},\widehat{\beta})<L(\alpha_{0},\beta_{0})+\tau<\inf_{|\alpha|+\|\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}>M}L\left(\alpha,\beta\right)$ , which entails that for all $n\geq n_{0}$ , $|\widehat{\alpha}|+\|\widehat{\beta}\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}\leq M$ , as desired.

Let us show that (30) holds. From Step 3, there exists $\eta$ such that, for any $\theta=(\alpha,\beta)\in{\mathcal{B}}$ , we have that $D(\theta)>L(\theta_{0})$ where

D(\theta)=\lim_{a\to\infty}\inf_{|\alpha^{*}-\alpha|+\|\beta^{*}-\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}<\eta}L(a\,\theta^{*})\,,

and $\theta^{*}=(\alpha^{*},\beta^{*})$ . Define $0<\tau_{\theta}<(D(\theta)-L(\theta_{0}))/2$ , then $D(\theta)>L(\theta_{0})+2\;\tau_{\theta}$ , which implies that there exists $a_{\theta}>0$ such that

\inf_{a>a_{\theta}}\inf_{|\alpha^{*}-\alpha|+\|\beta^{*}-\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}<\eta}L(a\,\theta^{*})>L(\theta_{0})+\tau_{\theta}\,.

(31)

Taking into account that the open balls ${\mathcal{B}}(\theta,\eta_{\theta})=\{\theta^{*}\in\mathbb{R}\times{\mathcal{W}}^{1,{\mathcal{H}}}:|\alpha^{*}-\alpha|+\|\beta^{*}-\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}<\eta\}$ provide a covering of ${\mathcal{B}}$ which is compact in $\mathbb{R}\times{\mathcal{H}}$ , we get that there exist $\theta_{1},\dots\theta_{s}$ such that $\theta_{j}\in{\mathcal{B}}$ and ${\mathcal{B}}\subset\cup_{j=1}^{s}{\mathcal{B}}(\theta_{j},\eta)$ . Thus, if $a_{j}=a_{\theta_{j}}$ , $A=\max_{1\leq j\leq s}(a_{j})$ , $\tau_{j}=\tau_{\theta_{j}}$ and $\tau=\min_{1\leq j\leq s}(\tau_{j})$ , from (31), we obtain that for $j=1,\dots,s$ we have

\inf_{a>A}\inf_{\theta^{*}\in{\mathcal{B}}(\theta_{j},\eta)}L(a\,\theta^{*})>L(\theta_{0})+\tau\,.

(32)

Let $\theta\in\mathbb{R}\times{\mathcal{W}}^{1,{\mathcal{H}}}$ be such that $d_{\theta}=|\alpha|+\|\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}>A$ , define $\widetilde{\theta}=\theta/d_{\theta}$ , then $\widetilde{\theta}\in{\mathcal{B}}$ , so there exists $j_{0}$ such that $\widetilde{\theta}\in{\mathcal{B}}(\theta_{j_{0}},\eta)$ and $L(\theta)=L(d_{\theta}\;\widetilde{\theta})$ . Therefore, using (32), we obtain that $L(\theta)>L(\theta_{0})+\tau$ , so

\inf_{|\alpha|+\|\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}>A}L\left(\alpha,\beta\right)>L(\theta_{0})+\tau\,,

concluding the proof. ∎

Proof of Theorem 3.2.

We will show only b), that is, that the result holds when $\beta_{0}\in C([0,1])$ , $B_{j}\in C([0,1])$ , $1\leq j\leq k_{n}$ , $B_{j}\in{\mathcal{W}}^{1,2}$ and provide approximations in $L^{2}([0,1])$ , that is, below ${\mathcal{W}}^{1,{\mathcal{H}}}$ is the Sobolev space ${\mathcal{W}}^{1,2}$ . The situation where ${\mathcal{H}}=L^{2}([0,1])$ follows similarly, replacing the supremum norm below by the $L^{2}-$ norm and using that in this case, the Sobolev space ${\mathcal{W}}^{1,{\mathcal{H}}}$ is compactly embedded in $L^{2}([0,1])$ and that A11 holds for any $\beta\in{\mathcal{W}}^{1,{\mathcal{H}}}$ .

Assume that we have shown that

\inf_{\theta\in{\mathcal{A}}_{\epsilon}}L(\theta)>L(\theta_{0})

(33)

where ${\mathcal{A}}_{\epsilon}=\{\theta=(\alpha,\beta)\in\mathbb{R}\times{\mathcal{W}}^{1,{\mathcal{H}}}:|\alpha|+\|\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}\leq M,|\alpha-\alpha_{0}|+\|\beta-\beta_{0}\|_{\infty}>\epsilon\}$ . Then, taking into account that $\widehat{\beta}\in{\mathcal{W}}^{1,{\mathcal{H}}}\cap C([0,1])$ , that Lemma A.7 implies that $L(\widehat{\theta})\buildrel a.s.\over{\longrightarrow}L(\theta_{0})$ and Lemma A.8, we conclude that $|\widehat{\alpha}-\alpha_{0}|+\|\widehat{\beta}-\beta_{0}\|_{\infty}\buildrel a.s.\over{\longrightarrow}0$ .

Let us derive (33). Let $\{\theta_{m}\}_{m\geq 1}$ be a sequence such that $\theta_{m}=(\alpha_{m},\beta_{m})\in{\mathcal{A}}_{\epsilon}$ and $\lim_{m\to\infty}L(\theta_{m})=\inf_{\theta\in{\mathcal{A}}_{\epsilon}}L(\theta)$ . The Rellich–Kondrachov theorem entails that ${\mathcal{W}}^{1,{\mathcal{H}}}$ is compactly embedded in $C([0,1])$ , hence the ball $\{\theta=(\alpha,\beta)\in\mathbb{R}\times{\mathcal{W}}^{1,{\mathcal{H}}}:|\alpha|+\|\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}\leq M\}$ is compact in $\mathbb{R}\times C([0,1])$ . Thus, there exist a subsequence $\{\theta_{m_{j}}\}_{j\geq 1}$ of $\{\theta_{m}\}_{m\geq 1}$ and a point $\theta^{\star}=(\alpha^{\star},\beta^{\star})\in\mathbb{R}\times{\mathcal{W}}^{1,{\mathcal{H}}}$ with $|\alpha^{\star}|+\|\beta^{\star}\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}\leq M$ such that $|\alpha_{m_{j}}-\alpha^{\star}|+\|\beta_{m_{j}}-\beta^{\star}\|_{\infty}\to 0$ . Then, using that $\theta_{m}\in{\mathcal{A}}_{\epsilon}$ , we have $|\alpha_{m_{j}}-\alpha_{0}|+\|\beta_{m_{j}}-\beta_{0}\|_{\infty}>\epsilon$ which implies that $|\alpha^{\star}-\alpha_{0}|+\|\beta^{\star}-\beta_{0}\|_{\infty}\geq\epsilon$ . Using that $\|f\|\leq\|f\|_{\infty}$ and the Cauchy-Schwartz inequality, we get that for any $v\in L^{2}([0,1])$ , $\alpha_{m_{j}}+\langle v,\beta_{m_{j}}\rangle\to\alpha^{\star}+\langle v,\beta^{\star}\rangle$ . Thus, using the Bounded Convergence Theorem, the continuity of $\phi\left(y,t\right)$ with respect to $t$ and its boundedness, we get that $L(\theta_{m_{j}})\to L(\theta^{\star})$ , which leads to $\inf_{\theta\in{\mathcal{A}}_{\epsilon}}L(\theta)=L(\theta^{\star})$ . Using that A11 holds and taking into account that $\beta^{\star}\in{\mathcal{W}}^{1,{\mathcal{H}}}$ , Lemma A.1(b) implies that $L(\theta^{\star})>L(\theta_{0})$ concluding the proof. ∎

A.3 Proof of Theorem 3.3 and Proposition 3.4

Proof of Theorem 3.3.

From assumption A10, we have that there exists an element $\widetilde{\beta}_{k}\in{\mathcal{M}}_{k}$ , $\widetilde{\beta}_{k}=\sum_{j=1}^{k}\widetilde{b}_{j}B_{j}(x)$ such that $\|\widetilde{\beta}_{k}-\beta_{0}\|_{{\mathcal{H}}}=O(k^{-r})$ . Without loss of generality, we assume that $\|\widetilde{\beta}_{k}-\beta_{0}\|_{{\mathcal{H}}}<\epsilon_{0}/2$ with $\epsilon_{0}$ defined in A1. Recall that $\Theta_{n}^{\star}=\mathbb{R}\times{\mathcal{M}}_{k_{n}}$ .

In order to get the convergence rate of our estimator $\widehat{\theta}=(\widehat{\alpha},\widehat{\beta})$ we will apply Theorem 3.4.1 of van der Vaart and Wellner, (1996). According to the notation in that Theorem, let $d(\theta_{1},\theta_{2})=\widetilde{\pi}_{\mathbb{P}}(\theta_{1},\theta_{2})$ and $\Theta_{n}=\{\theta\in\Theta_{n}^{\star}:|\alpha-\alpha_{0}|+\|\beta-\beta_{0}\|_{{\mathcal{H}}}\leq\epsilon_{0}/2\}$ , where $\epsilon_{0}$ is given in assumption A1. Furthermore, the function $m_{\theta}$ in that Theorem equals

m_{\theta}(y,X)=\,-\,\phi\left(y,\alpha+\langle X,\beta\rangle\right)w(X)\,.

First of all, note that Theorem 3.2 implies that $\mathbb{P}(\widehat{\theta}\in\Theta_{n})\to 1$ as required. Secondly, to emphasize the dependence on $n$ denote $\widetilde{\theta}_{0,n}=(\alpha_{0},\widetilde{\beta})$ with $\widetilde{\beta}=\widetilde{\beta}_{k}$ defined in assumption A10. Assumption A6b, the fact that $\|f\|\leq\|f\|_{{\mathcal{H}}}$ and the Bounded Convergence Theorem implies that $\widetilde{\pi}_{\mathbb{P}}(\widetilde{\theta}_{0,n},\theta_{0})\to 0$ , as $n\to\infty$ . Furthermore, from Theorem 3.2, we get that $\widetilde{\pi}_{\mathbb{P}}(\widehat{\theta},\theta_{0})\buildrel a.s.\over{\longrightarrow}0$ . Hence, using the triangular inequality, we immediately obtain that $\widetilde{\pi}_{\mathbb{P}}(\widehat{\theta},\widetilde{\theta}_{0,n})\buildrel p\over{\longrightarrow}0$ as required in Theorem 3.4.1 of van der Vaart and Wellner, (1996). Moreover, we also have that $L_{n}(\widehat{\theta})\leq L_{n}(\widetilde{\theta}_{0,n})$ , since $\widetilde{\theta}_{0,n}\in\mathbb{R}\times{\mathcal{M}}_{k}$ , which is also a requirement to apply that result.

Let $A=\sup_{y\in\{0,1\},t\in\mathbb{R}}|{\partial^{2}}\phi(y,t)/{\partial t^{2}}|$ and denote $C^{2}=16(A+C_{0}^{\star})\mathbb{E}\left\{w(X)\|X\|^{2}\right\}/C_{0}^{\star}$ with $C_{0}^{\star}$ the constant given in assumption A1. Define $\delta_{n}=C\|\widetilde{\beta}-\beta_{0}\|_{{\mathcal{H}}}$ . Note that from A10, $\delta_{n}=O(n^{-\,r\varsigma})$ .

To make use of Theorem 3.4.1 of van der Vaart and Wellner, (1996), we have to show that there exists a function $\varphi_{n}:(0,\infty)\to\mathbb{R}$ such that $\varphi_{n}(\delta)/\delta^{\gamma}$ is decreasing on $(\delta_{n},\infty)$ , for some $\gamma<2$ and such that, for any $\delta>\delta_{n}$ , we have

	$\displaystyle\sup_{\theta\in\Theta_{n,\delta}}\mathbb{E}\left(m_{\theta}(y,X)-m_{\widetilde{\theta}_{0,n}}(y,X)\right)=L(\widetilde{\theta}_{0,n})-\inf_{\theta\in\Theta_{n,\delta}}L(\theta)\lesssim-\delta^{2}\,,$		(34)
	$\displaystyle\mathbb{E}^{}\sup_{\theta\in\Theta_{n,\delta}}\left\|\mathbb{G}_{n}\left(m_{\theta}-m_{\widetilde{\theta}_{0,n}}\right)\right\|=\mathbb{E}^{}\sup_{\theta\in\Theta_{n,\delta}}\sqrt{n}\biggl{\|}\left(L_{n}(\theta)-L(\theta\right)-\left(L_{n}(\widetilde{\theta}_{0,n})-L(\widetilde{\theta}_{0,n})\right)\biggr{\|}\lesssim\varphi_{n}(\delta)\,,$		(35)

where $\mathbb{G}_{n}f=\sqrt{n}(P_{n}-P)f$ , $\mathbb{E}^{*}$ stands for the outer expectation and $\Theta_{n,\delta}=\{\theta\in\Theta_{n}:\delta/2<d(\theta,\widetilde{\theta}_{0,n})\leq\delta\}$ .

We begin by showing (34). Assumption A1 entails that, for any $\theta\in\Theta_{n}$ ,

L(\theta)-L(\theta_{0})\geq C_{0}^{\star}\,\widetilde{\pi}_{\mathbb{P}}^{2}(\theta,\theta_{0})\,,

(36)

while from (18) in the proof of Theorem 3.1(c), we get that

L(\widetilde{\theta}_{0,n})-L(\theta_{0})\leq A\,\mathbb{E}\left\{w(X)\|X\|^{2}\right\}\|\widetilde{\beta}-\beta_{0}\|_{{\mathcal{H}}}^{2}=\frac{A\,\mathbb{E}\left\{w(X)\|X\|^{2}\right\}}{C^{2}}\delta_{n}^{2}\,.

(37)

Moreover, using the Cauchy Schwartz inequality and the fact that $\|f\|\leq\|f\|_{{\mathcal{H}}}$ , we have

\widetilde{\pi}_{\mathbb{P}}(\theta,\widetilde{\theta}_{0,n})\leq\widetilde{\pi}_{\mathbb{P}}(\theta,\theta_{0})+\widetilde{\pi}_{\mathbb{P}}(\theta_{0},\widetilde{\theta}_{0,n})\leq\widetilde{\pi}_{\mathbb{P}}(\theta,\theta_{0})+\left\{\mathbb{E}\left\{w(X)\|X\|^{2}\right\}\|\widetilde{\beta}-\beta_{0}\|_{{\mathcal{H}}}^{2}\right\}^{1/2}\,,

which together with the inequality $(a+b)^{2}\leq 2(a^{2}+b^{2})$ , implies that

\widetilde{\pi}_{\mathbb{P}}^{2}(\theta,\theta_{0})\geq\frac{1}{2}\widetilde{\pi}_{\mathbb{P}}^{2}(\theta,\widetilde{\theta}_{0,n})-\mathbb{E}\left\{w(X)\|X\|^{2}\right\}\|\widetilde{\beta}-\beta_{0}\|_{{\mathcal{H}}}^{2}=\frac{1}{2}\widetilde{\pi}_{\mathbb{P}}^{2}(\theta,\widetilde{\theta}_{0,n})-\frac{\mathbb{E}\left\{w(X)\|X\|^{2}\right\}}{C^{2}}\delta_{n}^{2}\,.

(38)

Then combining (36), (37) and (38), we get that for any $\theta\in\Theta_{n}$ , such that $\delta/2<d(\theta,\widetilde{\theta}_{0,n})=\widetilde{\pi}_{\mathbb{P}}(\theta,\widetilde{\theta}_{0,n})$ , we have

	$\displaystyle L(\theta)-L(\widetilde{\theta}_{0,n})$	$\displaystyle=\left\{L(\theta)-L(\theta_{0})\right\}-\left\{L(\widetilde{\theta}_{0,n})-L(\theta_{0})\right\}\geq C_{0}^{\star}\,\widetilde{\pi}_{\mathbb{P}}^{2}(\theta,\theta_{0})-\frac{A\,\mathbb{E}\left\{w(X)\\|X\\|^{2}\right\}}{C^{2}}\delta_{n}^{2}$
		$\displaystyle\geq\frac{C_{0}^{\star}}{2}\widetilde{\pi}_{\mathbb{P}}^{2}(\theta,\widetilde{\theta}_{0,n})-\frac{(A+C_{0}^{\star})\mathbb{E}\left\{w(X)\\|X\\|^{2}\right\}}{C^{2}}\delta_{n}^{2}\geq\frac{C_{0}^{\star}}{8}\delta^{2}-\frac{C_{0}}{16}\delta_{n}^{2}\geq\frac{C_{0}^{\star}}{16}\delta^{2}\,,$

concluding the proof of (34).

We have now to find $\varphi_{n}(\delta)$ such that $\varphi_{n}(\delta)/\delta^{\gamma}$ is decreasing in $\delta$ , for some $\gamma<2$ and (35) holds. Define the class of functions

{\mathcal{F}}_{n,\delta}=\left\{V_{\theta}-V_{\widetilde{\theta}_{0,n}}:\theta\in\Theta_{n,\delta}\right\}\,,

with $V_{\theta}=\,-\,m_{\theta}=\phi\left(y,\alpha+\langle X,\beta\rangle\right)w(X)$ . Inequality (35) involves an empirical process indexed by ${\mathcal{F}}_{n,\delta}$ , since

\mathbb{E}^{*}\sup_{\stackrel{{\scriptstyle\theta\in\Theta_{n}}}{{\delta/2<d(\theta,\widetilde{\theta}_{0,n})\leq\delta}}}\left|\mathbb{G}_{n}\left(m_{\theta}-m_{\widetilde{\theta}_{0,n}}\right)\right|=\mathbb{E}^{*}\sup_{f\in{\mathcal{F}}_{n,\delta}}\sqrt{n}|(P_{n}-P)f|\,.

For any $f\in{\mathcal{F}}_{n,\delta}$ we have that $\|f\|_{\infty}\leq A_{1}=2\sup_{y\in\{0,1\};t\in\mathbb{R}}\phi(y,t)=2M_{\phi}$ . Furthermore, if $A_{2}=2\,M_{\psi}$ using the inequality

	$\displaystyle\|V_{\theta}-V_{\widetilde{\theta}_{0,n}}\|$	$\displaystyle=w(X)\left\|\phi\left(y,\alpha+\langle X,\beta\rangle\right)-\phi\left(y,\alpha_{0}+\langle X,\widetilde{\beta}\rangle\right)\right\|$
		$\displaystyle\leq 2\,M_{\psi}\left\|\alpha-\alpha_{0}+\langle X,\beta-\widetilde{\beta}\rangle\right\|w(X)\,,$

and the fact that $\|w\|_{\infty}=1$ and $\widetilde{\pi}_{\mathbb{P}}^{2}(\theta,\widetilde{\theta}_{0,n})\leq\delta^{2}$ , we get that

Pf^{2}\leq 4\,M_{\psi}^{2}\;\mathbb{E}\left(\left[\alpha-\alpha_{0}+\langle X,\beta-\widetilde{\beta}\rangle\right]^{2}w(X)\right)\leq A_{2}^{2}\,\delta^{2}\,.

Lemma 3.4.2 in van der Vaart and Wellner, (1996) leads to

\mathbb{E}^{*}\sup_{f\in{\mathcal{F}}_{n,\delta}}\sqrt{n}|(P_{n}-P)f|\leq J_{[\;]}\left(A_{2}\delta,{\mathcal{F}}_{n,\delta},L_{2}(P)\right)\left(1+A_{1}\frac{J_{[\;]}(A_{2}\,\delta,{\mathcal{F}}_{n,\delta},L_{2}(P))}{A_{2}^{2}\delta^{2}\;\sqrt{n}}\right)\,,

where $J_{[\;]}(\delta,{\mathcal{F}},L_{2}(P))=\int_{0}^{\delta}\sqrt{1+\log N_{[\;]}(\epsilon,{\mathcal{F}},L_{2}(P))}d\epsilon$ is the bracketing integral of the class ${\mathcal{F}}$ .

Recall that $\|\widetilde{\beta}_{k}-\beta_{0}\|_{{\mathcal{H}}}=O(n^{-\,r\varsigma})$ , so for $n$ large enough, $\|\widetilde{\beta}_{k}-\beta_{0}\|_{{\mathcal{H}}}<\epsilon_{0}/2$ , so that, for any $\theta=(\alpha,\beta)\in\Theta_{n}$ , we have $|\alpha-\alpha_{0}|+\|\widetilde{\beta}_{k}-\beta\|_{{\mathcal{H}}}<\epsilon_{0}$ . Therefore, ${\mathcal{F}}_{n,\delta}\subset{\mathcal{G}}_{n,c,\widetilde{\theta}_{0,n},\widetilde{\theta}_{0,n}}$ where ${\mathcal{G}}_{n,c,\widetilde{\theta}_{0},\theta_{0}^{*}}$ is defined in Lemma A.6 and we take $\widetilde{\theta}_{0}=\widetilde{\theta}_{0,n}=(\alpha_{0},\widetilde{\beta}_{k})$ , $\theta_{0}^{*}=\widetilde{\theta}_{0,n}$ and $c=\epsilon_{0}$ . Hence, the bound given in Lemma A.6 ensures that leads to

N_{[\;]}\left(\epsilon,{\mathcal{F}}_{n,\delta},L_{2}(P)\right)\leq\left(\frac{B_{1}}{\epsilon}+1\right)^{k+1}\,,

for some positive constant $B_{1}$ independent of $n$ , $\widetilde{\theta}_{0,n}$ and $\epsilon$ . Therefore, for $\delta<B_{1}/A_{2}$ , we have

	$\displaystyle J_{[\;]}\left(A_{2}\delta,{\mathcal{F}}_{n,\delta},L_{2}(P)\right)$	$\displaystyle\leq\int_{0}^{A_{2}\,\delta}\sqrt{1+\log\left(\left(\frac{B_{1}}{\epsilon}+1\right)^{k+1}\right)}d\epsilon$
		$\displaystyle\leq\int_{0}^{A_{2}\,\delta}\sqrt{1+(k+1)\log\left(2\,\frac{B_{1}}{\epsilon}\right)}d\epsilon$
		$\displaystyle\leq 2\,(k+1)^{1/2}\int_{0}^{A_{2}\,\delta}\sqrt{1+\log\left(2\,\frac{B_{1}}{\epsilon}\right)}d\epsilon$
		$\displaystyle=4\,B_{1}\,(k+1)^{1/2}\int_{0}^{B_{2}\;\delta}\sqrt{1+\log\left(\frac{1}{\epsilon}\right)}d\epsilon\,,$

where $B_{2}={A_{2}}/({2\,B_{1}})$ . Note that $\int_{0}^{\delta}\sqrt{1+\log(1/\epsilon)}\,d\epsilon=O(\delta\sqrt{\log(1/\delta)})$ as $\delta\to 0$ , hence there exists $\delta_{0}>0$ and a constant $C>0$ such that for any $\delta<\delta_{0}$ , $\int_{0}^{\delta}\sqrt{1+\log(1/\epsilon)}\,d\epsilon\leq C\,\delta\,\sqrt{\log(1/\delta)}$ . This implies that for $\delta<\delta_{0}/B_{2}$

J_{[\;]}(A_{2}\delta,{\mathcal{F}}_{n,\delta},L_{2}(P))\lesssim\delta\,\sqrt{\log\left(\frac{1}{\delta}\right)}\sqrt{k+1}\,.

If we denote $q_{n}=k+1$ , we obtain that for some constant $A_{3}$ independent of $n$ and $\delta$ ,

\mathbb{E}^{*}\sup_{\theta\in\Theta_{n,\delta}}\left|\mathbb{G}_{n}\left(m_{\theta}-m_{\widetilde{\theta}_{0,n}}\right)\right|\leq A_{3}\,\left[\delta\,q_{n}^{1/2}\sqrt{\log\left(\frac{1}{\delta}\right)}+\frac{q_{n}}{\sqrt{n}}\;\log\left(\frac{1}{\delta}\right)\right]\,.

Choosing

\varphi_{n}(\delta)=A_{3}\,\left[\delta\,q_{n}^{1/2}\sqrt{\log\left(\frac{1}{\delta}\right)}+\frac{q_{n}}{\sqrt{n}}\;\log\left(\frac{1}{\delta}\right)\right]\,,

we have that $\varphi_{n}(\delta)/\delta$ is decreasing in $\delta$ , concluding the proof of (35).

To apply Theorem 3.4.1 of van der Vaart and Wellner, (1996), it remains to show that $\gamma_{n}\lesssim\delta_{n}^{-1}$ and

\gamma_{n}^{2}\,\varphi_{n}\left(\frac{1}{\gamma_{n}}\right)\lesssim\sqrt{n}\,,

(39)

since $\varphi_{n}(c\delta)\leq c\,\phi_{n}(\delta)$ , for $c>1$ . First note that $\gamma_{n}=O(n^{r\varsigma})$ and $\delta_{n}=C\|\widetilde{\beta}-\beta_{0}\|_{{\mathcal{H}}}=O(n^{-\,r\varsigma})$ , then $\gamma_{n}\lesssim\delta_{n}^{-1}$ .

To derive (39), observe that

\gamma_{n}^{2}\varphi_{n}\left(\frac{1}{\gamma_{n}}\right)=A_{3}\,\left[\gamma_{n}q_{n}^{1/2}\,\sqrt{\log(\gamma_{n})}+\gamma_{n}^{2}\,\log(\gamma_{n})\;\frac{q_{n}}{\sqrt{n}}\right]=A_{3}\,\left[\sqrt{n}\;a_{n}(1+a_{n})\right]\,,

where $a_{n}=\gamma_{n}\,\sqrt{\log(\gamma_{n})}\;q_{n}^{1/2}/\sqrt{n}$ . Hence, to derive that $\gamma_{n}^{2}\varphi_{n}\left(1/{\gamma_{n}}\right)\lesssim\sqrt{n}$ , it is enough to show that $a_{n}=O(1)$ , which follows easily since $q_{n}=O(n^{\varsigma})$ and $\gamma_{n}\sqrt{\log(\gamma_{n})}=O(n^{(1-\varsigma)/2})$ , concluding the proof of (39).

Hence, from Theorem 3.4.1 of van der Vaart and Wellner, (1996), we get that $\gamma_{n}\widetilde{\pi}_{\mathbb{P}}(\widehat{\theta},\widetilde{\theta}_{0,n})=O_{\mathbb{P}}(1)$ . As noticed above,

\widetilde{\pi}_{\mathbb{P}}(\theta_{0},\widetilde{\theta}_{0,n})\leq\left\{\mathbb{E}\left\{w(X)\|X\|^{2}\right\}\|\widetilde{\beta}-\beta_{0}\|_{{\mathcal{H}}}^{2}\right\}^{1/2}=O(n^{-\,r\varsigma})\,.

Then, using that $\gamma_{n}=O(n^{r\varsigma})$ , we get that $\gamma_{n}\widetilde{\pi}_{\mathbb{P}}(\theta_{0},\widetilde{\theta}_{0,n})=O_{\mathbb{P}}(1)$ and from the triangular inequality we obtain that $\gamma_{n}\widetilde{\pi}_{\mathbb{P}}(\widehat{\theta},\theta_{0})=O_{\mathbb{P}}(1)$ , as desired. ∎

Proof of Proposition 3.4.

From Lemma A.3, we have that there exists a constant $C_{0}>0$ independent from $n$ such that, for any $\theta$ ,

L(\theta)-L(\theta_{0})\geq C_{0}\,\pi_{\mathbb{P}}^{2}(\theta,\theta_{0})\,,

then to show that A1 holds, it will be enough to show that there exists a constant $C_{1}>0$ such that, for any $\theta=(\alpha,\beta)\in\mathbb{R}\times{\mathcal{H}}$ with $|\alpha-\alpha_{0}|+\|\beta-\beta_{0}\|<1$ ,

\pi_{\mathbb{P}}^{2}(\theta,\theta_{0})\geq C_{1}\,\widetilde{\pi}_{\mathbb{P}}^{2}(\theta,\theta_{0})\,,

and then take $\epsilon_{0}=1$ and $C_{0}^{\star}=C_{0}C_{1}$ .

Since $\mathbb{P}(\|X\|\leq C)=1$ , we have that, with probability one, for any $|\alpha-\alpha_{0}|+\|\beta-\beta_{0}\|<1$ ,

\left|\alpha+\langle X,\beta\rangle\right|\leq|\alpha_{0}|+1+C\left(\|\beta_{0}\|+1\right)=C^{\star}\,.

Thus, using that $F$ is strictly increasing we get that

0<A_{1}=F\left(-C^{\star}\right)\leq F\left(\alpha+\langle X,\beta\rangle\right)\leq F\left(C^{\star}\right)=A_{2}<1\,.

(40)

The Mean Value Theorem implies that given $\theta=(\alpha,\beta)\in\mathbb{R}\times{\mathcal{H}}$ with $|\alpha-\alpha_{0}|+\|\beta-\beta_{0}\|<1$ there exists $(\alpha^{\star}_{X},\beta^{\star}_{X})=(1-\omega_{X})\theta+\omega_{X}\theta_{0}$ , $0\leq\omega_{X}\leq 1$ , such that

	$\displaystyle\pi_{\mathbb{P}}^{2}(\theta,\theta_{0})$	$\displaystyle=\mathbb{E}\left\{w(X)\left[F(\alpha+\langle X,\beta\rangle)-F(\alpha_{0}+\langle X,\beta_{0}\rangle)\right]^{2}\right\}$
		$\displaystyle=\mathbb{E}\left\{w(X)\left\{F\left(\alpha_{X}^{\star}+\langle X,\beta_{X}^{\star}\rangle\right)\left[1-F\left(\alpha_{X}^{\star}+\langle X,\beta_{X}^{\star}\rangle\right)\right]\left(\alpha-\alpha_{0}+\langle X,\beta-\beta_{0}\rangle\right)\right\}^{2}\right\}$
		$\displaystyle\geq A_{1}^{2}(1-A_{2})^{2}\mathbb{E}\left\{w(X)\left[\alpha-\alpha_{0}+\langle X,\beta-\beta_{0}\rangle\right]^{2}\right\}=A_{1}^{2}(1-A_{2})^{2}\widetilde{\pi}_{\mathbb{P}}^{2}(\theta,\theta_{0})\,,$

where the last inequality follows from (40), since $|\alpha^{\star}_{X}-\alpha_{0}|+\|\beta^{\star}_{X}-\beta_{0}\|<1$ and the proof is concluded taking $C_{1}=A_{1}^{2}(1-A_{2})^{2}$ . ∎

References

Aguilera et al., (2008) Aguilera, A., Escabias, M., and Valderrama, M. (2008). Discussion of different logistic models with functional data: Application to systemic lupus erythematosus. Computational Statistics and Data Analysis, 53:151–163.
Aguilera-Morillo et al., (2013) Aguilera-Morillo, M., Aguilera, A., Escabias, M., and Valderrama, M. (2013). Penalized spline approaches for functional logit regression. Test, 22:251–277.
Alin and Agostinelli, (2017) Alin, A. and Agostinelli, C. (2017). Robust iteratively reweighted simpls. Journal of Chemometrics, 31:e2881.
Aneiros-Pérez et al., (2017) Aneiros-Pérez, G., Bongiorno, E. G., Cao, R., and Vieu, P. (2017). Functional Statistics and Related Fields. Springer.
Basu et al., (1998) Basu, A., Harris, I. R., Hjort, N. L., and Jones, M. C. (1998). Robust and efficient estimation by minimizing a density power divergence. Biometrika, 85:549–559.
Bianco et al., (2022) Bianco, A., Boente, G., and Chebi, G. (2022). Penalized robust estimators in logistic regression with applications to sparse models. Test, 31:563–594.
Bianco et al., (2023) Bianco, A., Boente, G., and Chebi, G. (2023). Asymptotic behaviour of penalized robust estimators in logistic regression when dimension increases. In Robust and Multivariate Statistical Methods: Festschrift in Honor of David E. Tyler, Eds: Yi, Mengxi and Nordhausen, Klaus, pages 323–348. Springer International Publishing.
Bianco and Martinez, (2009) Bianco, A. and Martinez, E. (2009). Robust testing in the logistic regression model. Computational Statistics and Data Analysis, 53:4095–4105.
Bianco and Yohai, (1996) Bianco, A. and Yohai, V. (1996). Robust estimation in the logistic regression model. Lecture Notes in Statistics, 109:17–34.
Boente and Martinez, (2023) Boente, G. and Martinez, A. (2023). A robust spline approach in partially linear additive models. Computational Statistics and Data Analysis, 178:107611.
Boente et al., (2020) Boente, G., Salibián-Barrera, M., and Vena, P. (2020). Robust estimation for semi–functional linear regression models. Computational Statistics and Data Analysis, 152:107041.
Bondell, (2005) Bondell, H. D. (2005). Minimum distance estimation for the logistic regression model. Biometrika, 92:724–731.
Bondell, (2008) Bondell, H. D. (2008). A characteristic function approach to the biased sampling model, with application to robust logistic regression. Journal of Statistical Planning and Inference, 138:742–755.
Cai and Hall, (2006) Cai, T. and Hall, P. (2006). Prediction in functional linear regression. Annals of Statistics, 34:2159–2179.
Cantoni and Ronchetti, (2001) Cantoni, E. and Ronchetti, E. (2001). Robust inference for generalized linear models. Journal of the American Statistical Association, 96:1022–1030.
Cardot et al., (2003) Cardot, H., Ferraty, F., and Sarda, P. (2003). Spline estimators for the functional linear model. Statistica Sinica, 13:571–591.
Cardot and Sarda, (2005) Cardot, H. and Sarda, P. (2005). Estimation in generalized linear models for functional data via penalized likelihood. Journal of Multivariate Analysis, 92:24–41.
Carroll and Pederson, (1993) Carroll, R. J. and Pederson, S. (1993). On robust estimation in the logistic regression model. Journal of the Royal Statistical Society, Series B, 55:693–706.
Croux and Haesbroeck, (2003) Croux, C. and Haesbroeck, G. (2003). Implementing the Bianco and Yohai estimator for logistic regression. Computational Statistics and Data Analysis, 44:273–295.
Cuevas, (2014) Cuevas, A. (2014). A partial overview of the theory of statistics with functional data. Journal of Statistical Planning and Inference, 147:1–23.
Denhere and Billor, (2016) Denhere, M. and Billor, N. (2016). Robust principal component functional logistic regression. Communications in Statistics: Simulation and Computation, 45:264–281.
DeVore and Lorentz, (1993) DeVore, R. and Lorentz, G. (1993). Constructive Approximation. Springer.
Escabias et al., (2005) Escabias, M., Aguilera, A., and Valderrama, M. (2005). Modeling environmental data by functional principal component logistic regression. Environmetrics, 16:95–107.
Escabias et al., (2004) Escabias, M., Aguilera, A. M., and Valderrama, M. (2004). Principal component estimation of functional logistic regression: Discussion of two different approaches. Journal of Nonparametric Statistics, 16:365–384.
Febrero-Bande et al., (2017) Febrero-Bande, M., Galeano, P., and González-Manteiga, W. (2017). Functional principal component regression and functional partial least–squares regression: An overview and a comparative study. International Statistical Review, 85:61–83.
Ferraty and Romain, (2010) Ferraty, F. and Romain, Y. (2010). The Oxford Handbook of Functional Data Analysis. Oxford University Press.
Ferraty and Vieu, (2006) Ferraty, F. and Vieu, P. (2006). Nonparametric Functional Data Analysis: Theory and Practice. Springer.
García Ben and Yohai, (2004) García Ben, M. and Yohai, V. J. (2004). Quantile–quantile plot for deviance residuals in the generalized linear model. Journal of Computational and Graphical Statistics, 13:36–47.
Goia and Vieu, (2016) Goia, A. and Vieu, P. (2016). An introduction to recent advances in high/infinite dimensional statistics. Journal of Multivariate Analysis, 146:1–6.
Hall and Horowitz, (2007) Hall, P. and Horowitz, J. L. (2007). Methodology and convergence rates for functional linear regression. Annals of Statistics, 35:70–91.
He and Shi, (1996) He, X. and Shi, P. (1996). Bivariate tensor-product B-spline in a partly linear model. Journal of Multivariate Analysis, 58:162–181.
He and Shi, (1998) He, X. and Shi, P. (1998). Monotone B-spline smoothing. Journal of the American statistical Association, 93:643–650.
He et al., (2002) He, X., Zhu, Z., and Fung, W. (2002). Estimation in a semiparametric model for longitudinal data with unspecified dependence structure. Biometrika, 89:579–590.
Hobza et al., (2008) Hobza, T., Pardo, L., and Vajda, I. (2008). Robust median estimator in logistic regression. Journal of Statistical Planning and Inference, 138:3822–3840.
Horváth and Kokoszka, (2012) Horváth, L. and Kokoszka, P. (2012). Inference for Functional Data with Applications. Springer.
Hsing and Eubank, (2015) Hsing, T. and Eubank, R. (2015). Theoretical foundations of Functional Data Analysis with an introduction to Linear Operators, volume 997. John Wiley and Sons.
Hubert et al., (2005) Hubert, M., Rousseeuw, P. J., and Vanden Branden, K. (2005). ROBPCA: A new approach to robust principal component analysis. Technometrics, 47:64–79.
James, (2002) James, G. M. (2002). Generalized linear models with functional predictors. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64:411–432.
Kalogridis, (2023) Kalogridis, I. (2023). Robust and adaptive functional logistic regression. Available at https://arxiv.org/abs/2305.01350.
Kalogridis and Van Aelst, (2019) Kalogridis, I. and Van Aelst, S. (2019). Robust functional regression based on principal components. Journal of Multivariate Analysis, 173:393–415.
Kalogridis and Van Aelst, (2023) Kalogridis, I. and Van Aelst, S. (2023). Robust penalized estimators for functional linear regression. Journal of Multivariate Analysis, 194:105104.
Liebl, (2013) Liebl, D. (2013). Modelling and forecasting electricity spot prices: a functional data perspective. The Annals of Applied Statistics, 7:1562–1592.
López-Pintado and Romo, (2009) López-Pintado, S. and Romo, J. (2009). On the concept of depth for functional data. Journal of the American Statistical Association, 104:718–734.
Lu, (2015) Lu, M. (2015). Spline estimation of generalised monotonic regression. Journal of Nonparametric Statistics, 27:19–39.
Mallat, (2009) Mallat, S. (2009). A Wavelet Tour of Signal Processing: the Sparse Way. Academic Press.
Maronna et al., (2019) Maronna, R., Martin, D., Yohai, V., and Salibián-Barrera, M. (2019). Robust Statistics: Theory and Methods (with R). John Wiley and Sons.
Maronna and Yohai, (2013) Maronna, R. and Yohai, V. (2013). Robust functional linear regression based on splines. Computational Statistics and Data Analysis, 65:46–55.
Marx and Eilers, (1999) Marx, B. D. and Eilers, P. H. C. (1999). Generalized linear regression on sampled signals and curves: A P-spline approach. Technometrics, 4:1–13.
Mousavi and Sørensen, (2018) Mousavi, S. N. and Sørensen, H. (2018). Functional logistic regression: A comparison of three methods. Journal of Statistical Computation and Simulation, 88:250–268.
Müller, (2005) Müller, H. G. (2005). Functional modelling and classification of longitudinal data. Scandinavian Journal of Statistics, 32:223–240.
Müller and Stadtmüller, (2005) Müller, H. G. and Stadtmüller, U. (2005). Generalized functional linear models. Annals of Statistics, 33:774–805.
Mutis et al., (2022) Mutis, M., Beyaztas, U., Simsek, G., and Shang, H. (2022). A robust scalar–on–function logistic regression for classification. Communications in Statistics - Theory and Methods, pages 1–17.
Pollard, (1984) Pollard, D. (1984). Convergence of Stochastic Processes. Springer Series in Statistics.
Powell, (1981) Powell, M. (1981). Approximation Theory and Methods. Cambridge: Cambridge University Press.
Qingguo, (2015) Qingguo, T. (2015). Estimation for semi-functional linear regression. Statistics, 49:1262–1278.
Ramsay and Silverman, (2002) Ramsay, J. and Silverman, B. (2002). Applied Functional Data Analysis. Methods and Case Studies. Springer.
Ramsay and Silverman, (2005) Ramsay, J. and Silverman, B. (2005). Functional Data Analysis, 2nd edition. Springer.
Ramsay, (2004) Ramsay, J. O. (2004). Functional data analysis. Encyclopedia of Statistical Sciences.
Ratcliffe et al., (2002) Ratcliffe, S. J., Heller, G. Z., and Leader, L. R. (2002). Functional data analysis with application to periodically stimulated foetal heart rate data. II: Functional logistic regression. Statistics in Medicine, 21:1115–1127.
Reiss et al., (2017) Reiss, P. T., Goldsmith, J., Shang, H. L., and Ogden, R. T. (2017). Methods for scalar–on–function regression. International Statistical Review, 85:228–249.
Reiss et al., (2005) Reiss, P. T., Ogden, R. T., Mann, J., and Parsey, R. V. (2005). Functional logistic regression with pet imaging data: A voxel-level clinical diagnostic tool. Journal of Cerebral Blood Flow and Metabolism, 25:S635–S635.
Ronchetti, (1985) Ronchetti, E. (1985). Robust model selection in regression. Statistics and Probability Letters, 3:21–23.
Schumaker, (1981) Schumaker, L. (1981). Spline Functions: Basic Theory. Wiley.
Schwarz, (1978) Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6:461–464.
Sørensen et al., (2013) Sørensen, H., Goldsmith, J., and Sangalli, L. M. (2013). An introduction with medical applications to functional data analysis. Statistics in Medicine, 32:5222–5240.
Stone, (1982) Stone, C. (1982). Optimal global rates of convergence for nonparametric regression. Annals of Statistics, 10:1040–1053.
Stone, (1985) Stone, C. (1985). Additive regression and other nonparametric models. Annals of Statistics, 13:689–705.
Sun and Genton, (2011) Sun, Y. and Genton, M. G. (2011). Functional boxplots. Journal of Computational and Graphical Statistics, 20:316–334.
Tharmaratnam and Claeskens, (2013) Tharmaratnam, K. and Claeskens, G. (2013). A comparison of robust versions of the AIC based on M-, S- and MM-estimators. Statistics, 47:216–235.
van de Geer, (1988) van de Geer, S. (1988). Regression Analysis and Empirical Processes. CWI Tract 45. Center for Mathematics and Computer Science, Amsterdam. Available at https://ir.cwi.nl/pub/13169.
van de Geer, (2000) van de Geer, S. (2000). Empirical Processes in $M-$ Estimation. Cambridge Series in Statistical and Probabilistic Mathematics.
van der Vaart and Wellner, (1996) van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer, New York.
Wang and Xiang, (2012) Wang, H. and Xiang, S. (2012). On the convergence rates of Legendre approximation. Mathematics of Computation, 81:861–877.
Wang et al., (2016) Wang, J. L., Chiou, J., and Müller, H. (2016). Functional data analysis. Annual Review of Statistics and Its Application, 3:257–295.
Wang et al., (2017) Wang, X., Nan, B., Zhu, J., Koeppe, R., and Frey, K. (2017). Classification of ADNI PET images via regularized 3D functional data analysis. Biostatistics and Epidemiology, 1:3–19.
Zhao et al., (2012) Zhao, Y., Ogden, T., and Reiss, P. (2012). Wavelet-based LASSO in functional linear regression. Journal of Computational and Graphical Statistics, 21:600–617.

$\widehat{\beta}_{\mbox{\scriptsize\sc cl}}$	$\widehat{\beta}_{\mbox{\footnotesize\sc m}}$

$\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}}$	$\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}$

$\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}}$	$\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}$

$\widehat{\beta}_{\mbox{\scriptsize\sc cl}}$	$\widehat{\beta}_{\mbox{\footnotesize\sc m}}$

$\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}}$	$\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}$

$\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}}$	$\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}$

$\widehat{\beta}_{\mbox{\scriptsize\sc cl}}$	$\widehat{\beta}_{\mbox{\footnotesize\sc m}}$

$\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}}$	$\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}$

$\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}}$	$\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}$

$\widehat{\beta}_{\mbox{\scriptsize\sc cl}}$	$\widehat{\beta}_{\mbox{\footnotesize\sc m}}$

$\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}}$	$\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}$

$\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}}$	$\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}$

$\widehat{\beta}_{\mbox{\scriptsize\sc cl}}$	$\widehat{\beta}_{\mbox{\footnotesize\sc m}}$

$\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}}$	$\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}$

$\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}}$	$\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}$