This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust estimation for functional logistic regression models

Graciela Boentea, Marina Valdoraa
a Universidad de Buenos Aires and CONICET, Argentina
Abstract

This paper addresses the problem of providing robust estimators under a functional logistic regression model. Logistic regression is a popular tool in classification problems with two populations. As in functional linear regression, regularization tools are needed to compute estimators for the functional slope. The traditional methods are based on dimension reduction or penalization combined with maximum likelihood or quasi–likelihood techniques and for that reason, they may be affected by misclassified points especially if they are associated to functional covariates with atypical behaviour. The proposal given in this paper adapts some of the best practices used when the covariates are finite–dimensional to provide reliable estimations. Under regularity conditions, consistency of the resulting estimators and rates of convergence for the predictions are derived. A numerical study illustrates the finite sample performance of the proposed method and reveals its stability under different contamination scenarios. A real data example is also presented.

Keywords: BB-splines; Functional Data Analysis; Logistic Regression Models; Robust Estimation

AMS Subject Classification: 62F35; 62G25

1 Introduction

In many applications, such as chemometrics, image recognition and spectroscopy, the observed data contain functional covariates, that is, variables originated by phenomena that are continuous in time or space and can be assumed to be smooth functions, rather than finite dimensional vectors. Functional data analysis aims to provide tools for analysing such data and has received considerable attention in recent years due to its high versatility and numerous applications. Different approaches either parametric, nonparametric or even semiparametric ones, were given to model data with functional predictors. Some well-known references in the treatment of functional data are the books of Ferraty and Vieu, (2006) and Ferraty and Romain, (2010), who carefully discuss non–parametric models, and also Ramsay and Silverman, (2005, 2002), Horváth and Kokoszka, (2012) and Hsing and Eubank, (2015) who place emphasis on parametric models such as the functional linear one. We also refer to Aneiros-Pérez et al., (2017) and the reviews in Cuevas, (2014) and Goia and Vieu, (2016) for some other results in the area.

Among regression models relating a scalar response with a functional covariate, the functional linear regression model is one of the more popular ones. Several estimation aspects under this model have been considered among others in Cardot et al., (2003), Cardot and Sarda, (2005), Cai and Hall, (2006), Hall and Horowitz, (2007), see also Febrero-Bande et al., (2017) and Reiss et al., (2017) for a review. Robust proposals for functional linear regression models using either PP-, BB-splines or functional principal components were given in Maronna and Yohai, (2013), Boente et al., (2020), Kalogridis and Van Aelst, (2023) and Kalogridis and Van Aelst, (2019), respectively.

Most of the papers mentioned above consider the case where the response is a continuous variable. However, discrete responses arise in some practical problems such as in classification. In particular, when the interest relies on the presence or absence of a condition of interest, the response corresponds to a binary outcome. Reiss et al., (2017) includes a review on relevant papers treating the case of responses whose conditional distribution belong to an exponential–family distributions, that is, estimation under a generalized functional linear regression model. As in functional linear models, the naive approach to estimate the functional regression parameter considering as multivariate covariates the values of the functional covariates observed on a grid of points and ignoring their functional nature, is not appropriate, since this approach leads to an ill-conditioned problem. The causes for this issue are, on the one hand, the high correlation existing, within each trajectory, among observations corresponding to close grid values and on the other hand, the fact that the number of grid measurements may exceed the number of observations, see Marx and Eilers, (1999) and Ramsay, (2004) for a discussion. As mentioned in Wang et al., (2016), one of the challenges in functional regression is the inverse nature of the problem, which causes estimation problems mainly generated by the compactness of the covariance operator of XX. For the reasons mentioned above, the extension from the situation with finite-dimensional predictors to the case of an infinite-dimensional one is not direct. The usual practice to solve this drawback is regularization which can be achieved in several ways, either reducing the set of candidates to a finite–dimensional one or by adding a penalty term as when considering PP-splines. In particular, Cardot and Sarda, (2005) proposed estimators based on penalized likelihood and spline approximations and derived consistency properties for them. A different approach was followed in Müller and Stadtmüller, (2005) where the estimators are obtained via a truncated Karhunen-Loève expansion for the covariates. These authors also provided a theoretical study of the properties of their estimators as well as an illustration of their proposal on a classification problem that analyzes the interplay of longevity and reproduction, in short and long–lived Mediterranean fruit flies.

Among generalized regression models, logistic regression is one of the best known and useful models in statistics. It has been extensively studied when euclidean covariates arise and several robust proposals were given in this setting, some of which will be mentioned below, in Section 2.1. The functional logistic regression model is a generalization of the finite–dimensional logistic regression model, which assumes that the observed covariates are functional data rather than vectors in p\mathbb{R}^{p}. It is particularly relevant in discrimination problems for curve data. This model was already considered in James, (2002) as a particular case of generalized linear models with functional predictors. Beyond the procedures studied in the framework of generalized functional regression models which can be used in functional logistic regression models, some authors have considered specific estimation methods for this particular model. Among others, we can mention Escabias et al., (2004), Aguilera et al., (2008), Aguilera-Morillo et al., (2013) and Mousavi and Sørensen, (2018) who provides a revision and comparison of different estimation methods for functional logistic regression. Among the numerous interesting applications of functional logistic regression that have been reported in the literature we can mention Escabias et al., (2005), who use it to model environmental data, Ratcliffe et al., (2002), who apply it to foetal heart rate data and Reiss et al., (2005) who analyze pet imaging data. Besides, Sørensen et al., (2013) present several medical applications of functional data, while Wang et al., (2017) used penalized Haar wavelet approach for the classification of brain images to assist in early diagnosis of Alzheimer’s disease.

The framework in which this paper will focus corresponds to the functional logistic regression model with scalar response, also labelled as scalar–on–function logistic regression in the literature. Under this model, the i.i.d. observations (yi,Xi)(y_{i},X_{i}), 1in1\leq i\leq n, are such that the response yi{0,1}y_{i}\in\{0,1\} and the predictor XiL2()X_{i}\in L^{2}({\mathcal{I}}), with {\mathcal{I}} a compact interval, while the link function is the logistic one. More precisely, if Bi(1,p)Bi(1,p) stands for the Bernoulli distribution with success probability pp and F(t)=exp(t)[1+exp(t)]1F(t)=\exp(t)\left[1+\exp(t)\right]^{-1} denotes the logistic function, the functional logistic regression model assumes that

yi|XiBi(1,pi)wherepi=𝔼(yi|Xi)=F(α0+Xi,β0),y_{i}|X_{i}\sim Bi\left(1,p_{i}\right)\qquad\mbox{where}\qquad p_{i}=\mathbb{E}(y_{i}|X_{i})=F\left(\alpha_{0}+\langle X_{i},\beta_{0}\rangle\right)\,, (1)

with α0\alpha_{0}\in\mathbb{R}, β0L2()\beta_{0}\in L^{2}({\mathcal{I}}) and u,v=u(t)v(t)𝑑t\langle u,v\rangle=\int_{{\mathcal{I}}}u(t)v(t)dt stands for the usual inner product in L2()L^{2}({\mathcal{I}}) and \|\cdot\| for the corresponding norm.

The estimators for functional logistic regression mentioned above are based on the method of maximum likelihood combined with some regularization tool. As it happens with finite dimensional covariates, these estimators are highly affected by the presence of misclassified points specially when combined with high leverage covariates. Robust methods, on the other hand, have the advantage of giving reliable results, even when a proportion of the data correspond to atypical data. As mentioned above, some robust methods for functional linear regression models have been recently proposed and this area has shown great development in the last ten years. However, the literature on robust procedures for generalized functional linear models and specifically for functional logistic regression ones is scarce. Up to our knowledge, only few procedures have been considered and most of them lack a careful study of the asymptotic properties of the proposal considered. The first attempt to provide a robust method for functional logistic regression was given in Denhere and Billor, (2016). This method is based on reducing the dimension of the covariates by using a robust principal components method proposed in Hubert et al., (2005). The robustness of this method ensures that the functional principal component analysis is not influenced by outlying covariates. However, it but does not take into account the problem of large deviance residuals originated by incorrectly classified observations. To solve this problem, Mutis et al., (2022) propose to combine a basis approximation with weights computed using the Pearson residuals, in an approach related to that given by Alin and Agostinelli, (2017) for finite–dimensional covariates. Recently, Kalogridis, (2023) introduced an approach based on divergence measures combined with penalizations and provided a careful study of its asymptotic properties, for bounded covariates.

In this paper, we follow a different perspective, taking into account the sensitivity of these estimators to atypical observations and based on the ideas given for euclidean covariates by Bianco and Yohai, (1996) and Croux and Haesbroeck, (2003), we define robust estimators of the intercept α0\alpha_{0} and the slope β0\beta_{0} following a sieve approach combined with weighted MM-estimators. More precisely, as done for instance in functional linear regression, we first reduce the set of candidates for estimating β0\beta_{0} to those belonging to a finite–dimensional space spanned by a fixed basis selected by the practitioner, such as the BB-splines, Fourier, or wavelet bases. This enables to use the robust tools developed for finite–dimensional covariates in this infinite–dimensional framework. Clearly this regularization process involves the selection of the basis dimension which should increase with the sample size at a given rate and which must be chosen in a robust way. For that reason, in Section 2.4, we describe a resistant procedure to select the dimension of the approximating space.

The rest of the paper is organized as follows. The model and our proposed estimators are described in Section 2. Theoretical assurances regarding consistency and convergence rates of our proposal are provided in Section 3, while in Section 4 we report the results of a simulation study to explore their finite-sample properties. Section 5 contains the analysis of a real-data set, while final comments are given in Section 6. All proofs are relegated to the Appendix.

2 The estimators

As mentioned in the Introduction, our proposal for estimators under the functional logistic regression model (1) is based on basis reduction. For that reason and for the sake of completeness, in Section 2.1, we recall some of the robust proposals given when the covariates belong to a finite–dimensional space.

2.1 Some robust proposals for euclidean covariates

When the covariates are finite–dimensional, the practitioner deals with i.i.d. observations (yi,𝐱i)\left(y_{i},\mathbf{x}_{i}\right), 1in1\leq i\leq n, where 𝐱ip\mathbf{x}_{i}\in\mathbb{R}^{p}, yi{0,1}y_{i}\in\{0,1\}. In this case, the well–known logistic regression model states that yi|𝐱iBi(1,F(α0+𝐱it𝜷0))y_{i}|\mathbf{x}_{i}\sim Bi(1,F(\alpha_{0}+\mathbf{x}_{i}^{\mbox{\footnotesize\sc t}}\mbox{\boldmath$\beta$}_{0})), where 𝜷0p\mbox{\boldmath$\beta$}_{0}\in\mathbb{R}^{p}. As mentioned in the Introduction, the maximum likelihood estimator of the regression coefficients is very sensitive to outliers, meaning that we cannot accurately classify a new observation based on these estimators, neither identify those covariates with important information for assignation. To solve this drawback, different robust procedures have been considered.

In particular, consistent MM-estimators bounding the deviance were defined in Bianco and Yohai, (1996), while in order to obtain bounded influence estimators a weighted version was introduced in Croux and Haesbroeck, (2003). For the family of MM-estimators defined in Bianco and Yohai, (1996), Croux and Haesbroeck, (2003) introduced a loss function that guarantees the existence of the resulting robust estimator when the maximum likelihood estimators do exist. Basu et al., (1998) considered a proposal based on minimum divergence. However, their approach can also be seen as a particular case of the Bianco and Yohai, (1996) estimator with a properly defined loss function. Other approaches were given in Cantoni and Ronchetti, (2001) who consider a robust quasi–likelihood estimator, Bondell, (2005, 2008) whose procedures incorporate a minimum distance perspective and Hobza et al., (2008) who defines a a median estimator by using an L1L^{1}-estimator of the smoothed responses.

As pointed out in Maronna et al., (2019), the use of redescending weighted MM-estimators ensures estimators with good robustness properties. For that reason and taking into account that our proposal will combine dimension reduction with weighted MM-estimators, we briefly describe the proposal introduced in Croux and Haesbroeck, (2003). From now on, denote d(y,t)d(y,t) the squared deviance function, that is, d(y,t)=log(F(t))ylog(1F(t))(1y)d(y,t)=-\log(F(t))y-\log(1-F(t))(1-y) and let ρ:0\rho:\mathbb{R}_{\geq 0}\to\mathbb{R} be a bounded, differentiable and nondecreasing function with derivative ψ=ρ\psi=\rho^{\prime}. Furthermore, define

ϕ(y,t)\displaystyle\phi(y,t) =\displaystyle= ρ(d(y,t))+G(F(t))+G(1F(t)),\displaystyle\rho(d(y,t))+G(F(t))+G(1-F(t))\,, (2)

where G(t)=0tψ(logu)𝑑uG(t)=\int_{0}^{t}\psi(-\log u)\,du. The correction term G(F(t))+G(1F(t))G(F(t))+G(1-F(t)) was introduced in Bianco and Yohai, (1996) to guarantee Fisher–consistency of the resulting procedure. It is worth mentioning that the function ϕ(y,t)\phi(y,t) can be written as

ϕ(y,t)=yρ(log[F(t)])+G(F(t))+(1y)ρ(log[1F(t)])+G(1F(t)).\phi(y,t)=y\rho\left(\,-\,\log\left[F(t)\right]\right)+G(F(t))+(1-y)\rho\left(\,-\,\log\left[1-F(t)\right]\right)+G(1-F(t))\,. (3)

The weighted MM-estimators are the minimizers of Ln(a,𝐛)=i=1nϕ(yi,a+𝐱it𝐛)w(𝐱i)/nL_{n}^{\star}(a,\mathbf{b})=\sum_{i=1}^{n}\phi(y_{i},a+\mathbf{x}_{i}^{\mbox{\footnotesize\sc t}}\mathbf{b})w(\mathbf{x}_{i})/n, that is,

(α^,𝜷^)=argmina,𝐛pLn(a,𝐛).(\widehat{\alpha},\widehat{\mbox{\boldmath$\beta$}})=\mathop{\mbox{argmin}}_{a\in\mathbb{R},\mathbf{b}\in\mathbb{R}^{p}}L_{n}^{\star}(a,\mathbf{b})\,. (4)

The weights w(𝐱i)w(\mathbf{x}_{i}) are usually based on a robust Mahalanobis distance of the explanatory variables, that is, they depend on the distance between 𝐱i\mathbf{x}_{i} and a robust center of the data. With this notation, the minimum divergence estimators considered in Basu et al., (1998) correspond to the choice ρ(t)=(1+1/c)(1exp(ct))\rho(t)=(1+1/c)(1-\exp(-\,c\,t)) and w1w\equiv 1.

The asymptotic properties of the MM-estimators with w1w\equiv 1 were obtained in Bianco and Yohai, (1996), while the situation of a general weight function was studied in Bianco and Martinez, (2009). It should be mentioned that these estimators are implemented in the package RobStatTM, through the functions logregBY, when the weights equal 1, and logregWBY when considering hard rejection weights derived from the MCD estimator of the continuous explanatory variables. In both cases, the loss function is taken as the one introduced in Croux and Haesbroeck, (2003).

2.2 The case of functional covariates

In this section, we consider the situation where (yi,Xi)(y_{i},X_{i}), 1in1\leq i\leq n are independent observations such that yi{0,1}y_{i}\in\{0,1\} and XiL2(𝒯)X_{i}\in L^{2}({\mathcal{T}}) with 𝒯{\mathcal{T}} a compact interval, that, without loss of generality, we assume to be 𝒯=[0,1]{\mathcal{T}}=[0,1]. The model relating the responses to the covariates is the functional logistic regression model, that is, we assume that (1) holds.

As mentioned in the Introduction, estimation under functional linear regression or functional logistic regression models is an ill–posed problem. To avoid this issue, one possibility is dimension reduction that can be achieved by considering as possible candidates for estimating β0\beta_{0} the elements of a finite–dimensional space spanned by a fixed basis. This is the approach we follow in this paper, that is, to define robust estimators of the intercept α0\alpha_{0} and the slope β0\beta_{0}, we will use a sieve approach combined with weighted MM-estimators. We do not restrict our attention to a particular basis as the BB-spline basis considered, for instance, in Boente et al., (2020) for the functional semi–linear model or Mutis et al., (2022) for the functional logistic regression one. Instead, we provide a general framework which allows the practitioner to choose the basis according to the smoothness knowledge or assumptions to be considered on β0\beta_{0}.

Henceforth, let k=knk=k_{n} stand for the dimension of the finite dimensional space spanned by the basis {Bj:1jkn}\{B_{j}:1\leq j\leq k_{n}\}. The space of possible candidates correspond to ×k\mathbb{R}\times{\mathcal{M}}_{k}, where k={j=1kbjBj,𝐛k}{\mathcal{M}}_{k}=\left\{\sum_{j=1}^{k}b_{j}B_{j},\mathbf{b}\in\mathbb{R}^{k}\right\}.

From now on, for any 𝐛k\mathbf{b}\in\mathbb{R}^{k}, β𝐛\beta_{\mathbf{b}} will stand for β𝐛=j=1kbjBj\beta_{\mathbf{b}}=\sum_{j=1}^{k}b_{j}B_{j}. Then, for any possible candidate β𝐛k\beta_{\mathbf{b}}\in{\mathcal{M}}_{k}, the inner product Xi,β𝐛\langle X_{i},\beta_{\mathbf{b}}\rangle equals j=1kbjxij\sum_{j=1}^{k}b_{j}x_{ij} where xij=Xi,Bjx_{ij}=\langle X_{i},B_{j}\rangle which suggests to use the robust estimators defined in Section 2.1 taking as covariates 𝐱i=(xi1,,xik)t\mathbf{x}_{i}=(x_{i1},\dots,x_{ik})^{\mbox{\footnotesize\sc t}}.

More precisely, the weighted estimators defined in (4) over the finite–dimensional approximating spaces ×k\mathbb{R}\times{\mathcal{M}}_{k} are the key tool for obtaining consistent estimators of β0\beta_{0}. For any 𝐛k\mathbf{b}\in\mathbb{R}^{k} define

Ln(α,β𝐛)=1ni=1nϕ(yi,α+Xi,β𝐛)w(Xi)=1ni=1nϕ(yi,α+𝐱it𝐛)w(Xi)L_{n}(\alpha,\beta_{\mathbf{b}})=\frac{1}{n}\sum_{i=1}^{n}\phi(y_{i},\alpha+\langle X_{i},\beta_{\mathbf{b}}\rangle)w(X_{i})=\frac{1}{n}\sum_{i=1}^{n}\phi(y_{i},\alpha+\mathbf{x}_{i}^{\mbox{\footnotesize\sc t}}\mathbf{b})w(X_{i})

and

(α^,𝐛^)=argminα,𝐛kLn(α,β𝐛).(\widehat{\alpha},\widehat{{\mathbf{b}}})=\mathop{\mbox{argmin}}_{\alpha\in\mathbb{R},\mathbf{b}\in\mathbb{R}^{k}}L_{n}(\alpha,\beta_{\mathbf{b}})\,. (5)

Hence, the estimator of β0\beta_{0} is given by

β^(t)=β𝐛^(t)=j=1kb^jBj(t),\widehat{\beta}(t)=\beta_{\widehat{{\mathbf{b}}}}(t)=\sum_{j=1}^{k}\widehat{{b}}_{j}B_{j}(t)\,,

where 𝐛^=(b^1,,b^k)t\widehat{{\mathbf{b}}}=\left(\widehat{{b}}_{1},\ldots,\widehat{{b}}_{k}\right)^{\mbox{\footnotesize\sc t}}, meaning that (α^,β^)=argminα,βkLn(α,β)(\widehat{\alpha},\widehat{\beta})=\mathop{\mbox{argmin}}_{\alpha\in\mathbb{R},\beta\in{\mathcal{M}}_{k}}L_{n}(\alpha,\beta).

The weights w(Xi)w(X_{i}) in (5) may be computed as in (4) using a weight function of the robust Mahalanobis distance of the projected variables 𝐱i\mathbf{x}_{i}, in which case, Ln(α,β𝐛)=Ln(a,𝐛)L_{n}(\alpha,\beta_{\mathbf{b}})=L_{n}^{\star}(a,\mathbf{b}). Another possibility is to compute the weights from the functional covariates, for instance, discarding observations which are declared as outliers by the functional boxplot or any other functional measure of atipicity. In the simulation study reported in Section 4, we explore both possible choices for the weights. As in the finite–dimensional setting, for the sake of simplicity, when deriving consistency results, we will assume that the weight function ww is not data dependent.

2.3 On the basis choice

The basis choice depends on the knowledge or assumptions to be made on the slope parameter. Some well known basis are BB-splines, Bernstein and Legendre polynomials, Fourier basis or Wavelet ones. They vary in the way they approximate a function as discussed, for instance, in Boente and Martinez, (2023) and Kalogridis and Van Aelst, (2023).

When considering BB-spline approximations, consistency results will require the slope parameter to be rr-times continuously differentiable, that is, β0Cr([0,1])\beta_{0}\in C^{r}([0,1]), where r2r\leq\ell-2 and \ell is the spline order. In particular, when cubic splines are considered, the results in Section 3 hold for twice continuously differentiable regression functions. Recall that a spline of order \ell is a polynomial of degree 1\ell-1 within each subinterval defined by the knots. As stated in Corollary 6.21 in Schumaker, (1981), if β0Cr([0,1])\beta_{0}\in C^{r}([0,1]) with rr-th derivative Lipschitz and r2r\leq\ell-2, under proper assumptions on the knots, there exist a spline of order \ell, let’s say β~=j=1kbjBj\widetilde{\beta}=\sum_{j=1}^{k}b_{j}B_{j}, such that β~β0=O(k(r+1))\|\widetilde{\beta}-\beta_{0}\|_{\infty}=O(k^{-{(r+1)}}). It is worth mentioning that the approximation order has an impact on the rates of convergence derived in Theorems 3.1 and 3.3 through assumption A10.

Bernstein polynomials are a possible alternative to BB-splines. They are defined as

Bj(t)=(kj)tj(1t)kjfor j=0,,k.B_{j}(t)=\binom{k}{j}t^{j}(1-t)^{k-j}\quad\mbox{for }j=0,\dots,k\,.

Weierstrass Theorem ensures that if β0C([0,1])\beta_{0}\in C([0,1]), there exists β~=j=1kbjBj\widetilde{\beta}=\sum_{j=1}^{k}b_{j}B_{j}, where bj=β0(j/k)b_{j}=\beta_{0}(j/k) such that β~β00\|\widetilde{\beta}-\beta_{0}\|_{\infty}\to 0. Furthermore, Theorem 3.2 in Powell, (1981) guarantees that when considering Bernstein polynomials of order kk we also get that β~β0=O(kr)\|\widetilde{\beta}-\beta_{0}\|_{\infty}=O(k^{-r}), whenever β0Cr([0,1])\beta_{0}\in C^{r}([0,1]).

Legendre polynomials define an orthogonal basis in L2([0,1])L^{2}([0,1]). As mentioned in Boente and Martinez, (2023), the convergence rates derived in Theorem 2.5 from Wang and Xiang, (2012) allow to show that, if β0Cr([0,1])\beta_{0}\in C^{r}([0,1]) and β~\widetilde{\beta} stands for the truncated Legendre series expansion of β0\beta_{0} of order kk, then β~β0=O(kr+1/2)\|\widetilde{\beta}-\beta_{0}\|_{\infty}=O\left(k^{-\,r+1/2}\right). Note that in this case, the approximation rate is lower than for the other two basis mentioned above and this will affect the rates provided in Theorem 3.3. More generally, when polynomials basis of order kk are considered, Jackson’s Theorem (see Theorem 3.12 in Schumaker, , 1981) ensures that if β0Wr,2([0,1])\beta_{0}\in W^{r,2}([0,1]), the L2L^{2}-Sobolev space of order rr as defined below, there exists a polynomial β~\widetilde{\beta} of order krk\geq r such that β~β0=O(kr)\|\widetilde{\beta}-\beta_{0}\|=O(k^{-r}) and this improved L2L^{2} approximation order is enough to guarantee a better rate of convergence for the predictions, when polynomial bases are considered.

Finally, the Fourier basis is the natural basis in L2([0,1])L^{2}([0,1]) and it is usually considered when approximating periodic functions. Clearly, the finite expansion β~0(t)=b0+j=1kbj,1sin(2πjt)+bj,2cos(2πjt)\widetilde{\beta}_{0}(t)=b_{0}+\sum_{j=1}^{k}b_{j,1}\sin(2\pi\,j\,t)+b_{j,2}\cos(2\pi\,j\,t), where b0=01β0(t)𝑑tb_{0}=\int_{0}^{1}\beta_{0}(t)\,dt, bj,1=01β0(t)sin(2πjt)𝑑tb_{j,1}=\int_{0}^{1}\beta_{0}(t)\sin(2\pi\,j\,t)\,dt and bj,2=01β0(t)cos(2πjt)𝑑tb_{j,2}=\int_{0}^{1}\beta_{0}(t)\cos(2\pi\,j\,t)\,dt converges to β0\beta_{0} in L2([0,1])L^{2}([0,1]). When β0Wr,2([0,1])\beta_{0}\in W^{r,2}([0,1]), Corollary 2.4 in Chapter 7 from DeVore and Lorentz, (1993) ensures that β~β0=O(kr)\|\widetilde{\beta}-\beta_{0}\|=O(k^{-r})

Wavelet basis may be useful when seeking for sparse slope function estimates. Zhao et al., (2012) provide conditions ensuring L2L^{2} approximations of order O(kr)O(k^{-r}), when β0Wr,2([0,1])\beta_{0}\in W^{r,2}([0,1]) and one–dimensional Wavelets are used, see also Mallat, (2009).

2.4 Selecting the size of the basis

The number of elements of the basis plays the role of regularization parameter in our estimation procedure. The importance of considering a robust criterion to select the regularization parameter has been discussed by several authors who report how standard model selection methods can be highly affected by a small proportion of outliers. The sensitivity to atypical data of classical basis selectors may be inherited by the final regression estimators even when robust procedure is considered.

To deal with these problems, when the covariates belong to p\mathbb{R}^{p}, Ronchetti, (1985) and Tharmaratnam and Claeskens, (2013) provide some robust approaches when considering linear regression models. Besides, under a sparse logistic regression model, Bianco et al., (2022) report in their supplement a numerical study that reveals the importance of considering a robust criterion in order to achieve reliable predictions. Finally, for functional covariates and under a semi–linear and a functional linear model, Boente et al., (2020), Kalogridis and Van Aelst, (2019, 2023), respectively, discuss robust criteria for selecting the regularization parameters.

In our framework, the basis dimension k=knk=k_{n} may be determined by a model selection criterion such as a robust version of the Akaike criterion used in Lu, (2015) or the robust Schwarz, (1978) criterion considered in He and Shi, (1996) and He et al., (2002) for semi–parametric regression models. Suppose that (α^(k),𝐛^(k))(\widehat{\alpha}^{(k)},\widehat{{\mathbf{b}}}^{(k)}) is the solution of (5) when we use a kk-dimensional linear space and denote as β^(k)=β𝐛^(k)\widehat{\beta}^{(k)}=\beta_{\widehat{{\mathbf{b}}}^{(k)}}. We define a robust BICBIC criterion, whose large values indicate a poor fit, as

RBIC(k)=Ln(α^(k),β^(k))+klognn.RBIC(k)=L_{n}(\widehat{\alpha}^{(k)},\widehat{\beta}^{(k)})+k\,\frac{\log n}{n}\,. (6)

For instance, when considering BB-spline procedures, in order to obtain an optimal rate of convergence, we let the number of knots increase slowly with the sample size. Theorem 3.3 below shows that when β0\beta_{0} is twice continuously differentiable and is approximated with cubic splines (=4\ell=4), the size knk_{n} of the bases can be taken of order n1/5n^{1/5}. Hence, a possible way to select knk_{n} is to search for the first local minimum of RBIC(kn)RBIC(k_{n}) in the range max(n1/5/2,4)k8+2n1/5\max(n^{1/5}/2,4)\leq k\leq 8+2\,n^{1/5}. Note that for cubic splines the smallest possible number of knots is 4.

3 Consistency results

To provide a unified approach in which the basis gives approximations either in L2(0,1)L^{2}(0,1) or in C([0,1])C([0,1]), equipped by their respective norms \|\cdot\| and \|\cdot\|_{\infty}, we will denote {\mathcal{H}} the space L2([0,1])L^{2}([0,1]) or C([0,1])C([0,1]) and \|\cdot\|_{{\mathcal{H}}} the corresponding norm. Hence, we have that ff\|f\|\leq\|f\|_{{\mathcal{H}}}. Furthermore, 𝒲r,{\mathcal{W}}^{r,{\mathcal{H}}} will stand for the Hölder space

𝒲r,([0,1])={fCr([0,1]):f(j)<, 0jr, and supz1z2|f(r)(z1)f(r)(z2)||z1z2|<},{\mathcal{W}}^{r,\infty}([0,1])\,=\,\biggl{\{}f\in C^{r}\left([0,1]\right):\big{\|}f^{(j)}\big{\|}_{\infty}<\infty,\;0\leq j\leq r,\mbox{ and }\sup_{z_{1}\neq z_{2}}\frac{\big{|}f^{(r)}(z_{1})-f^{(r)}(z_{2})\big{|}}{|z_{1}-z_{2}|}<\infty\biggr{\}}\,,

when =C([0,1]){\mathcal{H}}=C([0,1]), while when =L2([0,1]){\mathcal{H}}=L^{2}([0,1]), we label 𝒲r,{\mathcal{W}}^{r,{\mathcal{H}}} the Sobolev space

𝒲r,2([0,1])=\displaystyle{\mathcal{W}}^{r,2}([0,1])= {fL2([0,1]):the weak derivatives of f up to order r exist and\displaystyle\{f\in L^{2}([0,1]):\mbox{the weak derivatives of $f$ up to order $r$ exist and }
f(j)2=01{f(j)(t)}2dt< for j=1,r}.\displaystyle\|f^{(j)}\|^{2}=\int_{0}^{1}\left\{f^{(j)}(t)\right\}^{2}\,dt<\infty\mbox{ for }j=1\dots,r\}\,.

We denote 𝒲r,\|\cdot\|_{{\mathcal{W}}^{r,{\mathcal{H}}}} the corresponding norm, that is,

f𝒲r,=max1jrf(j)+supxy,x,y(0,1)|f(r)(x)f(r)(y)||xy|\|f\|_{{\mathcal{W}}^{r,{\mathcal{H}}}}=\max_{1\leq j\leq r}\big{\|}f^{(j)}\big{\|}_{\infty}+\sup_{x\neq y,x,y\in(0,1)}\frac{\big{|}f^{(r)}(x)-f^{(r)}(y)\big{|}}{|x-y|}\,

in the former case and f𝒲r,2=f2+j=1rf(j)2\|f\|_{{\mathcal{W}}^{r,{\mathcal{H}}}}^{2}=\|f\|^{2}+\sum_{j=1}^{r}\|f^{(j)}\|^{2} in the latter one.

To derive consistency results, we will need the following assumptions.

  1. A1

    ρ:0\rho:\mathbb{R}_{\geq 0}\to\mathbb{R} is a bounded, continuously differentiable function with bounded derivative ψ\psi and ρ(0)=0\rho(0)=0.

  2. A2

    ψ(t)0\psi(t)\geq 0 and there exists some alog2a\geq\log 2 such that ψ(t)>0\psi(t)>0 for all 0<t<a0<t<a.

  3. A3

    ψ(t)0\psi(t)\geq 0 and there exist values alog2a\geq\log 2 and A0>0A_{0}>0 such that ψ(t)>A0\psi(t)>A_{0} for every 0<t<a0<t<a.

  4. A4

    ψ\psi is continuously differentiable function with bounded derivative ψ\psi^{\prime}.

  5. A5

    ww is a non–negative bounded function with support 𝒞w{\mathcal{C}}_{w} such that (X𝒞w)>0\mathbb{P}(X\in{\mathcal{C}}_{w})>0. Without loss of generality, we assume that w=1\|w\|_{\infty}=1.

  6. A6
    1. (a)

      𝔼w(X)X<\mathbb{E}w(X)\,\|X\|<\infty.

    2. (b)

      𝔼w(X)X2<\mathbb{E}w(X)\,\|X\|^{2}<\infty.

  7. A7

    The basis functions are such that Bj𝒲1,B_{j}\in{\mathcal{W}}^{1,{\mathcal{H}}}.

  8. A8

    The basis dimension knk_{n} is such that knk_{n}\to\infty, kn/n0k_{n}/n\to 0.

  9. A9

    There exists an element β~kk\widetilde{\beta}_{k}\in{\mathcal{M}}_{k}, β~k=j=1kb~jBj(x)\widetilde{\beta}_{k}=\sum_{j=1}^{k}\widetilde{b}_{j}B_{j}(x) such that β~kβ00\|\widetilde{\beta}_{k}-\beta_{0}\|_{{\mathcal{H}}}\to 0 as kk\to\infty.

  10. A10

    There exists an element β~kk\widetilde{\beta}_{k}\in{\mathcal{M}}_{k}, β~k=j=1kb~jBj(x)\widetilde{\beta}_{k}=\sum_{j=1}^{k}\widetilde{b}_{j}B_{j}(x) such that β~kβ0=O(kr)\|\widetilde{\beta}_{k}-\beta_{0}\|_{{\mathcal{H}}}=O(k^{-r}), for some r>0r>0. Furthermore, the basis dimension knk_{n} is of order O(nς)O(n^{\varsigma}) where ς<1/r\varsigma<1/r.

  11. A11

    The following condition holds:

    (X,β+α=0)=0, for any βα, such that (β,α)0,{}\mathbb{P}(\langle X,\beta\rangle+\alpha=0)=0\,,\mbox{ for any $\beta\in{\mathcal{H}}^{\star}$, $\alpha\in\mathbb{R}$, such that $(\beta,\alpha)\neq 0$}\,, (7)

    where ={\mathcal{H}}^{\star}={{\mathcal{H}}} or =𝒲1,{\mathcal{H}}^{\star}={\mathcal{W}}^{1,{\mathcal{H}}}, depending on whether β0\beta_{0} belongs to {{\mathcal{H}}} or 𝒲1,{\mathcal{W}}^{1,{\mathcal{H}}}, respectively.

Our first result states that, under mild assumptions, the estimators obtained minimizing Ln(α,β)L_{n}(\alpha,\beta) over (α,β)×k(\alpha,\beta)\in\mathbb{R}\times{\mathcal{M}}_{k} produce consistent estimators of the conditional success probability with respect to the weighted mean square error of the differences between predicted probabilities defined as

π2(θ1,θ2)=𝔼(w(X)[F(α1+X,β1)F(α2+X,β2)]2),\pi_{\mathbb{P}}^{2}(\theta_{1},\theta_{2})=\mathbb{E}\left(w(X)\left[F(\alpha_{1}+\langle X,\beta_{1}\rangle)-F(\alpha_{2}+\langle X,\beta_{2}\rangle)\right]^{2}\right)\,,

where for j=1,2j=1,2, θj=(αj,βj)Θ=×\theta_{j}=(\alpha_{j},\beta_{j})\in\Theta=\mathbb{R}\times{\mathcal{H}}. Henceforth, to simplify the notation, we denote θ^=(α^,β^)\widehat{\theta}=(\widehat{\alpha},\widehat{\beta}) and θ0=(α0,β0)\theta_{0}=(\alpha_{0},\beta_{0}).

Theorem 3.1.

Let ρ\rho be a function satisfying A1 and A3 and ww a weight function satisfying A5.

  1. (a)

    Under A8 and A9, π2(θ^,θ0)a.s.0\pi_{\mathbb{P}}^{2}(\widehat{\theta},\theta_{0})\buildrel a.s.\over{\longrightarrow}0.

  2. (b)

    If ww satisfies A6a, we have that π2(θ^,θ0)=O(kn/n+β~kβ0)\pi_{\mathbb{P}}^{2}(\widehat{\theta},\theta_{0})=O_{\mathbb{P}}(\sqrt{k_{n}/n}+\|\widetilde{\beta}_{k}-\beta_{0}\|_{{\mathcal{H}}}). Moreover, if A10 also holds, π(θ^,θ0)=O(nλ)\pi_{\mathbb{P}}(\widehat{\theta},\theta_{0})=O_{\mathbb{P}}(n^{-{\lambda}}), where λ=min(ςr/2,(1ς)/4)\lambda=\min(\varsigma\,r/2,(1-\varsigma)/4).

  3. (c)

    If ww satisfies A6b and ψ\psi satisfies A4, we have that π2(θ^,θ0)=O(kn/n+β~kβ02)\pi_{\mathbb{P}}^{2}(\widehat{\theta},\theta_{0})=O_{\mathbb{P}}(\sqrt{k_{n}/n}+\|\widetilde{\beta}_{k}-\beta_{0}\|_{{\mathcal{H}}}^{2}). Moreover, if A10 also holds, π(θ^,θ0)=O(nω)\pi_{\mathbb{P}}(\widehat{\theta},\theta_{0})=O_{\mathbb{P}}(n^{-{\omega}}\;), where ω=min(ςr,(1ς)/4)\omega=\min(\,\varsigma\,r,(1-\varsigma)/4).

Remark 3.1.

Denote p^(X)=F(α^+X,β^)\widehat{p}(X)=F(\widehat{\alpha}+\langle X,\widehat{\beta}\rangle) and p0(X)=F(α0+X,β0)p_{0}(X)=F(\alpha_{0}+\langle X,\beta_{0}\rangle). When w(X)1w(X)\equiv 1, Theorem 3.1(a) implies that for any ϵ>0\epsilon>0,

(|p^(X)p0(X)|>ϵ)=𝔼((|p^(X)p0(X)|>ϵ|(y1,X1),,(yn,Xn)))𝔼(π2(θ^,θ0)ϵ2),\mathbb{P}\left(\left|\widehat{p}(X)-p_{0}(X)\right|>\epsilon\right)=\mathbb{E}\left(\mathbb{P}\left(\left|\widehat{p}(X)-p_{0}(X)\right|>\epsilon\Big{|}_{(y_{1},X_{1}),\dots,(y_{n},X_{n})}\right)\right)\leq\mathbb{E}\left(\frac{\pi_{\mathbb{P}}^{2}(\widehat{\theta},\theta_{0})}{\epsilon^{2}}\right)\,,

where the right hand side of the inequality converges to 0 as nn\to\infty.Thus, p^(X)pp0(X)\widehat{p}(X)\buildrel p\over{\longrightarrow}p_{0}(X) allowing to consistently classify a new observation. Moreover, using that F1F^{-1} is continuous, we also conclude that α^+X,β^\widehat{\alpha}+\langle X,\widehat{\beta}\rangle converges in probability to α0+X,β0\alpha_{0}+\langle X,\beta_{0}\rangle. However, the infinite–dimensional structure of the covariates does not allow to derive the consistency of β^\widehat{\beta}, which is instead obtained in Theorem 3.2. The rates obtained in Theorem 3.1 (b) provide a preliminary rate that will be improved in Theorem 3.3.

Theorem 3.2 establishes strong consistency of the intercept and slope parameter, which clearly implies that of the predicted probability, that is, F(α^+X,β^)a.s.F(α0+X,β0)F(\widehat{\alpha}+\langle X,\widehat{\beta}\rangle)\buildrel a.s.\over{\longrightarrow}F(\alpha_{0}+\langle X,\beta_{0}\rangle). This result provides an improvement over the one obtained in Theorem 3.1, but requires additional assumptions on the covariates, namely, assumption A11 which is discussed in Remark 3.2.

Theorem 3.2.

Let ρ\rho be a function satisfying A1 and A2, and ww a weight function satisfying A5. Assume that A7 to A9 hold.

  1. (a)

    If in addition A11 holds, we have that |α^α0|+β^β0a.s.0|\widehat{\alpha}-\alpha_{0}|+\|\widehat{\beta}-\beta_{0}\|_{{\mathcal{H}}}\buildrel a.s.\over{\longrightarrow}0.

  2. (b)

    Assume that =L2([0,1]){\mathcal{H}}=L^{2}([0,1]), that the basis functions are such that Bj𝒲1,2B_{j}\in{\mathcal{W}}^{1,2}, 1jkn1\leq j\leq k_{n}, providing approximations in A9 in L2([0,1])L^{2}([0,1]) and that A11 holds with =𝒲1,2{\mathcal{H}}^{\star}={\mathcal{W}}^{1,2}, i.e., β0𝒲1,2\beta_{0}\in{\mathcal{W}}^{1,2}.

  3. If, in addition, β0C([0,1])\beta_{0}\in C([0,1]) and the basis elements are also continuous, i.e., BjC([0,1])B_{j}\in C([0,1]), 1jkn1\leq j\leq k_{n}, then |α^α0|+β^β0a.s.0|\widehat{\alpha}-\alpha_{0}|+\|\widehat{\beta}-\beta_{0}\|_{\infty}\buildrel a.s.\over{\longrightarrow}0.

Theorem 3.2(b) is useful for situation where the slope parameter is continuous but we use a smooth basis that provides an approximation in L2([0,1])L^{2}([0,1]), such as the Fourier one.

Remark 3.2 (Comments on assumptions).

Assumptions A1, A2, A5 and A11 are needed to ensure Fisher–consistency of the proposal, see Lemma A.1 in the Appendix. In the finite–dimensional case, A1 and A2 were also required in Bianco and Yohai, (1996) who considered MM-estimators, while assumption A5 corresponds to assumption A2 in Bianco and Martinez, (2009) who studied the asymptotic behaviour of weighted MM-estimators. Assumption A11 is the infinite–dimensional counterpart of assumptions C1 in Bianco and Yohai, (1996) and A3 in Bianco and Martinez, (2009). For the functional logistic regression model considered this assumption is stronger than the one required for functional linear regression models in Boente et al., (2020) and Kalogridis and Van Aelst, (2023) which states that (X,β+α=0)<c<1\mathbb{P}(\langle X,\beta\rangle+\alpha=0)<c<1. However, it is weaker than assumption C2 in Kalogridis and Van Aelst, (2019) who defined robust estimators based on principal components under a functional linear model and assumed that the process XX has a finite–dimensional Karhunen-Loève expansion with scores having a joint density function. It is worth mentioning that condition A11 is related to the fact that the slope parameter is not identifiable if the kernel of the covariance operator of XX does not reduce to {0}\{0\}. Instead of requiring the condition over all possible elements βL2([0,1])\beta\in L^{2}([0,1]), depending on the smoothness of β0\beta_{0}, the set of values β0\beta\neq 0 over which the probability (X,β+α=0)\mathbb{P}(\langle X,\beta\rangle+\alpha=0) equals 0 may be reduced.

Furthermore, assumptions A1 and A2 hold for the loss function introduced in Croux and Haesbroeck, (2003) and for ρ(t)=(1+1/c)(1exp(ct))\rho(t)=(1+1/c)(1-\exp(-\,c\,t)) which is related to the minimum divergence estimators defined in Basu et al., (1998) and in both cases, aa can be taken as ++\infty. Note that if ψ(0)0\psi(0)\neq 0 and assumptions A1 and A2 hold for some constant a>log(2)a>\log(2), then condition A3 is fulfilled. This situation arises, for example, for the two loss functions mentioned above. It is worth mentioning that A3 is a key point to derive that L(θ)L(θ0)C0π2(θ,θ0)L(\theta)-L(\theta_{0})\geq C_{0}\,\pi_{\mathbb{P}}^{2}(\theta,\theta_{0}), for any θ\theta and some constant C0>0C_{0}>0, where L(θ)=L(α,β)=𝔼(ϕ(y,α+X,β)w(X))L(\theta)=L(\alpha,\beta)=\mathbb{E}\left(\phi\left(y,\alpha+\langle X,\beta\rangle\right)w(X)\right). This inequality allows to derive convergence rates for the weighted mean square error of the prediction differences from those obtained for the empirical process supθ×k|Ln(θ)L(θ)|\sup_{\theta\in\mathbb{R}\times{\mathcal{M}}_{k}}|L_{n}(\theta)-L(\theta)|.

Assumption A8 gives a rate at which the dimension of the finite–dimensional space k{\mathcal{M}}_{k} should increase. It is a standard condition when a sieve approach is considered. Furthermore, in assumption A10 a stronger convergence rate is required to the basis dimension in order obtain rates of convergence.

Assumption A9 states that the true slope may be approximated by an element of k{\mathcal{M}}_{k}. Conditions under which this assumption holds for some basis choices were discussed in Section 2.3, where conditions ensuring a given rate for this approximation were also given. The approximation rate required in A10 plays a role when deriving rates of convergence for the predicted probabilities. Note also that under A7, the approximating element β~kk\widetilde{\beta}_{k}\in{\mathcal{M}}_{k}, given in A9 and A10, also belongs to 𝒲1,{\mathcal{W}}^{1,{\mathcal{H}}}. A first attempt to obtain these rates is given in Theorem 3.1, but better ones will be obtained in Theorem 3.3 below.

3.1 Rates of Consistency

To derive rates of convergence for the estimators, we define the pseudo-distance given by

π~2(θ1,θ2)=𝔼(w(X)[α1α2+X,(β1β2)]2),\widetilde{\pi}_{\mathbb{P}}^{2}(\theta_{1},\theta_{2})=\mathbb{E}\left(w(X)\left[\alpha_{1}-\alpha_{2}+\langle X,(\beta_{1}-\beta_{2})\rangle\right]^{2}\right)\,,

where for j=1,2j=1,2, θj=(αj,βj)Θ=×\theta_{j}=(\alpha_{j},\beta_{j})\in\Theta=\mathbb{R}\times{\mathcal{H}}. The following additional assumption will be required

  1. A1

    : There exists ϵ0>0\epsilon_{0}>0 and a positive constant C0C_{0}^{\star}, such that for any θ=(α,β)×\theta=(\alpha,\beta)\in\mathbb{R}\times{\mathcal{H}} with |αα0|+ββ0<ϵ0|\alpha-\alpha_{0}|+\|\beta-\beta_{0}\|<\epsilon_{0} we have L(θ)L(θ0)C0π~2(θ,θ0)L(\theta)-L(\theta_{0})\geq C_{0}^{\star}\,\widetilde{\pi}_{\mathbb{P}}^{2}(\theta,\theta_{0}).

Note that since F(t)=F(t)(1F(t))F^{\prime}(t)=F(t)\left(1-F(t)\right) is bounded by 1, π2(θ1,θ2)π~2(θ1,θ2)\pi_{\mathbb{P}}^{2}(\theta_{1},\theta_{2})\leq\widetilde{\pi}_{\mathbb{P}}^{2}(\theta_{1},\theta_{2}), so the weighted mean square error of the differences between predicted probabilities inherits the rates of converges obtained in Theorem 3.3 for the distance π~\widetilde{\pi}_{\mathbb{P}}.

Theorem 3.3.

Let ρ\rho be a function satisfying A1, A3 and A4, and ww a weight function satisfying A5 and A6b. Assume that A7 and A10 to A1 hold. Then, γnπ~(θ^,θ0)=O(1)\gamma_{n}\widetilde{\pi}_{\mathbb{P}}(\widehat{\theta},\theta_{0})=O_{\mathbb{P}}(1), whenever γn=O(nrς)\gamma_{n}=O(n^{r\,\varsigma}) and γnlog(γn)=O(n(1ς)/2)\gamma_{n}\sqrt{\log(\gamma_{n})}=O(n^{(1-\varsigma)/2}).

The lower bound given in assumption A1 is a requirement that is fulfilled when the covariates are bounded as shown in Proposition 3.4 below. Moreover, one consequence of Proposition 3.4 is that for bounded functional covariates, the pseudo-distances π~\widetilde{\pi}_{\mathbb{P}} and π\pi_{\mathbb{P}} are equivalent.

Proposition 3.4.

Assume that assumptions A1 and A3 hold and that for some positive constant C>0C>0, (XC)=1\mathbb{P}(\|X\|\leq C)=1. Then, there exists a constant C1>0C_{1}>0 such that π2(θ,θ0)C1π~2(θ,θ0)\pi_{\mathbb{P}}^{2}(\theta,\theta_{0})\geq C_{1}\,\widetilde{\pi}_{\mathbb{P}}^{2}(\theta,\theta_{0}), for any θ=(α,β)×\theta=(\alpha,\beta)\in\mathbb{R}\times{\mathcal{H}} with |αα0|+ββ0<1|\alpha-\alpha_{0}|+\|\beta-\beta_{0}\|<1. Moreover, A1 holds.

As a consequence of Proposition 3.4 and Theorem 3.3, we get the following result that improves the rates given in Theorem 3.1.

Corollary 3.5.

Let ρ\rho be a function satisfying A1, A3 and A4, and ww a weight function satisfying A5. Assume that A7, A10 and A11 hold and that for some positive constant C>0C>0, (XC)=1\mathbb{P}(\|X\|\leq C)=1. Then, γnπ(θ^,θ0)=O(1)\gamma_{n}\pi_{\mathbb{P}}(\widehat{\theta},\theta_{0})=O_{\mathbb{P}}(1), whenever γn=O(nrς)\gamma_{n}=O(n^{r\,\varsigma}) and γnlog(γn)=O(n(1ς)/2)\gamma_{n}\sqrt{\log(\gamma_{n})}=O(n^{(1-\varsigma)/2}). In particular,

nηlog(n)π(θ^,θ0)=O(1),\frac{n^{\eta}}{\sqrt{\log(n)}}\pi_{\mathbb{P}}(\widehat{\theta},\theta_{0})=O_{\mathbb{P}}(1)\,,

where η=min(rς,(1ς)/2)\eta=\min(r\,\varsigma,(1-\varsigma)/2).

Remark 3.3.

As mentioned in Boente et al., (2020), who obtained rates under a semi-linear functional regression model, if ς=1/(1+2r)\varsigma=1/(1+2r) in A10, one can choose γn=O(nr1+2rδ)\gamma_{n}=O(n^{\frac{r}{1+2r}-\delta}), for some δ>0\delta>0 arbitrarily small, which yields a convergence rate arbitrarily close to the optimal one. Then, as mentioned above, when considering cubic splines, if β0\beta_{0} is twice continuously differentiable one has that r=2r=2 in A10. Hence, taking ς=1/5\varsigma=1/5, i.e., if the basis dimension knk_{n} has order n1/5n^{1/5}, we ensure that the convergence rate for π~(θ^,θ0)\widetilde{\pi}_{\mathbb{P}}(\widehat{\theta},\theta_{0}) and π(θ^,θ0)\pi_{\mathbb{P}}(\widehat{\theta},\theta_{0}) is arbitrarily close to n2/5n^{2/5}.

Clearly, one may select in Theorem 3.3, γn=nη/log(n)\gamma_{n}=n^{\eta}/\sqrt{\log(n)} where η=min(rς,(1ς)/2)\eta=\min(r\,\varsigma,(1-\varsigma)/2). Hence, if ς=1/(1+2r)\varsigma=1/(1+2r) in A10, we have that γn=nr/(1+2r)/log(n)\gamma_{n}=n^{{r}/({1+2r})}/\sqrt{\log(n)}. Hence, for bounded covariates, both the weighted mean square error of the predictions and the weighted mean square error between the predicted probabilities are such that π~(θ^,θ0)=O(nr/(1+2r)log(n))\widetilde{\pi}_{\mathbb{P}}(\widehat{\theta},\theta_{0})=O_{\mathbb{P}}\left(n^{-\,{r}/({1+2r})}\;\sqrt{\log(n)}\right) and π(θ^,θ0)=O(nr/(1+2r)log(n))\pi_{\mathbb{P}}(\widehat{\theta},\theta_{0})=O_{\mathbb{P}}\left(n^{-\,{r}/({1+2r})}\;\sqrt{\log(n)}\right), leading to a convergence rate is suboptimal with respect to the one obtained for instance in nonparametric regression models, see Stone, (1982, 1985). Furthermore, this rate equals the one obtained for penalized estimators in Kalogridis, (2023), when considering BB-splines with kn=O(n1/(2r+1))k_{n}=O(n^{1/(2r+1)}) and the slope function is rr times continuously differentiable, i.e., β0𝒲r1,\beta_{0}\in{\mathcal{W}}^{r-1,\infty}. As mentioned therein, the term log(n)\log(n) is related to the fact that we are considering infinite–dimensional covariates. Recall that when XL2(0,1)X\in L^{2}(0,1) and to ensure identifiability, the eigenvalues of its covariance operator are non–null but converge to 0, enabling us to provide a lower bound for π~(θ1,θ2)\widetilde{\pi}_{\mathbb{P}}(\theta_{1},\theta_{2}) in terms of β1β2\|\beta_{1}-\beta_{2}\|.

It also is worth mentioning that the rate derived in Theorem 3.1(c), allow to conclude that, when kn=O(n1/(4r+1)k_{n}=O(n^{1/(4r+1)}, that is, when ς=1/(4r+1)\varsigma=1/(4r+1), we have π(θ^,θ0)=O(nr/(4r+1))\pi_{\mathbb{P}}(\widehat{\theta},\theta_{0})=O_{\mathbb{P}}(n^{-\,r/(4r+1)}). Hence, if the functional covariates are bounded, from Proposition 3.4 we get that π~(θ^,θ0)=O(nr/(4r+1))\widetilde{\pi}_{\mathbb{P}}(\widehat{\theta},\theta_{0})=O_{\mathbb{P}}(n^{-\,r/(4r+1)}). This rate of convergence corresponds to the one obtained in Cardot and Sarda, (2005) for their penalized estimators and is slower than the rate π~(θ^,θ0)=O(nr/(1+2r)log(n))\widetilde{\pi}_{\mathbb{P}}(\widehat{\theta},\theta_{0})=O\left(n^{-\,{r}/({1+2r})}\;\sqrt{\log(n)}\right) derived from Theorem 3.3.

4 Simulation study

We performed a Monte Carlo study to investigate the finite-sample properties of our proposed estimators for the functional logistic regression model. For that purpose, we generated a training sample {\mathcal{M}} of observations (yi,Xi)(y_{i},X_{i}) i.i.d. such that yiBi(1,F(α0+β0,Xi)y_{i}\sim Bi(1,F(\alpha_{0}+\langle\beta_{0},X_{i}\rangle) where α0=0\alpha_{0}=0. The true regression parameter was set equal to β0(t)=j=150bj,0ϕj\beta_{0}(t)=\sum_{j=1}^{50}b_{j,0}\phi_{j}, where {ϕj}j=150\{\phi_{j}\}_{j=1}^{50} correspond to elements of the Fourier basis, more precisely, ϕ1(t)1\phi_{1}(t)\equiv 1, ϕj(t)=2cos((j1)πt)\phi_{j}(t)=\sqrt{2}\cos((j-1)\pi t), j2j\geq 2, and the coefficients b1,0=0.3b_{1,0}=0.3 and bj,0=4(1)j+1j2b_{j,0}=4(-1)^{j+1}j^{-2}, j2j\geq 2. The process that generates the functional covariates Xi(t)X_{i}(t) was Gaussian with mean 0 and covariance operator with eigenfunctions ϕj(t)\phi_{j}(t). For uncontaminated samples, the scores ξij\xi_{ij} were generated as independent Gaussian random variables ξijN(0,j2)\xi_{ij}\sim N(0,j^{-2}). We denote the distribution of this Gaussian process 𝒢(0,Γ){\mathcal{G}}(0,\Gamma). Taking into account that Var(ξij)1/2500\mbox{\sc Var}(\xi_{ij})\leq 1/2500 when j>50j>50, the process was approximated numerically using the first 50 terms of its Karhunen-Loève representation.

We chose as basis the BB-spline basis for all the procedures considered, denoted {Bj}j=1kn\{B_{j}\}_{j=1}^{k_{n}}. We compared four estimators: the procedure based on using the deviance after dimension reduction, that is using ρ(t)=t\rho(t)=t in (2), labelled the classical estimators and denoted cl, the one that uses MM-estimators denoted m, and their weighted versions. The MM-estimators and weighted MM-estimators were computed using the loss function introduced in Croux and Haesbroeck, (2003) and defined as

ρ(t)={tecif tc2et(1+t)+ec(2(1+c)+c)if t>c,\rho(t)=\left\{\begin{array}[]{ll}t\;e^{-\sqrt{c}}&\hbox{if }\,\,t\leq c\\ -2e^{-\sqrt{t}}\left(1+\sqrt{t}\right)+e^{-\sqrt{c}}\left(2\left(1+\sqrt{c}\right)+c\right)&\hbox{if }\,\,t>c\,,\end{array}\right.

with tuning constant c=0.5c=0.5. For the former the weights equal 1 for all observations, while for the latter, as for the weighted deviance estimators, two different type of weight functions were considered.

  • a)

    For the first one, after dimension reduction, that is, after computing xij=Xi,Bjx_{ij}=\langle X_{i},B_{j}\rangle, we evaluated the Donoho–Stahel location and scatter estimators, denoted 𝝁^\widehat{\mbox{\boldmath$\mu$}} and 𝚺^\widehat{\mbox{\boldmath$\Sigma$}}, respectively, of the sample 𝐱1,,𝐱n\mathbf{x}_{1},\dots,\mathbf{x}_{n} with 𝐱i=(xi1,,xikn)t\mathbf{x}_{i}=(x_{i1},\dots,x_{ik_{n}})^{\mbox{\footnotesize\sc t}}. The weights are then defined as w(Xi)=1w(X_{i})=1 when the squared Mahalanobis distance di2=(𝐱i𝝁^)t𝚺^1(𝐱i𝝁^)d_{i}^{2}=(\mathbf{x}_{i}-\widehat{\mbox{\boldmath$\mu$}})^{\mbox{\footnotesize\sc t}}\widehat{\mbox{\boldmath$\Sigma$}}^{-1}(\mathbf{x}_{i}-\widehat{\mbox{\boldmath$\mu$}}) is less than or equal to χ0.975,kn\chi_{0.975,k_{n}} and 0 otherwise, where χα,p\chi_{{}_{\alpha,p}} stands for the α\alpha-quantile of a chi-square distribution with pp degrees of freedom. Hence, for this family we used hard rejection weights and for that reason, the weighted estimators based on the deviance and the weighted MM-estimators are denoted wcl-hr and wm-hr, respectively. Note that wcl-hr are related to the Mallows–type estimators introduced in Carroll and Pederson, (1993).

  • b)

    The second family of weight functions is based on the functional boxplots as defined by Sun and Genton, (2011). Again, we choose hard rejection weights but on the functional space by taking w(X)=0w(X)=0 if XX was declared an outlier by the functional boxplot and 11 otherwise. The functional boxplot was computed using the function fbplot of the library fda taking as method ”Both” which orders the observations according to the band depth and then breaks ties with the modified band depth, as defined in López-Pintado and Romo, (2009). In this case the weighting is done before the BB-spline approximation. The resulting weighted classical and MM-estimators are denoted wcl-fbb and wm-fbb, respectively.

For each setting we generated nR=1000n_{R}=1000 samples of size n=300n=300 and used cubic splines with equally spaced knots. For the robust estimators we selected the size of the spline basis, knk_{n}, by minimizing RBIC(k)RBIC(k) in equation (6) over the grid 4k144\leq k\leq 14. For the classical estimator, we used the standard BICBIC criterion, that is, we chose ρ(t)=t\rho(t)=t in equation (6).

C0C_{0} C1,0.05C_{1,0.05} C2,0.05C_{2,0.05}
Refer to caption Refer to caption Refer to caption
C3,0.05C_{3,0.05} C4,0.05C_{4,0.05} C5,0.05C_{5,0.05}
Refer to caption Refer to caption Refer to caption
Figure 1: Trajectories Xi(t)X_{i}(t) with and without contamination. The red dotted lines correspond to the added contaminated covariates.

We considered different contamination scenarios by adding a proportion ϵ\epsilon of atypical points. We denote these scenarios Cj,ϵC_{j,\epsilon}, for 1j51\leq j\leq 5 and we chose ϵ=0.05\epsilon=0.05 and 0.100.10.

  • In the first scenario, denoted C1,ϵC_{1,\epsilon}, we generated nout=ϵnn_{\mbox{\scriptsize\sc out}}=\epsilon\;n misclassified points (y~,X~)(\widetilde{y},\widetilde{X}), where X~𝒢(0,25Γ)\widetilde{X}\sim{\mathcal{G}}(0,25\Gamma) and y~=1\widetilde{y}=1 when α0+X~,β0<0\alpha_{0}+\langle\widetilde{X},\beta_{0}\rangle<0 and y~=0\widetilde{y}=0, otherwise.

  • Under C2,ϵC_{2,\epsilon}, we have tried to adapt to the functional framework the damaging effect of the high leverage points considered by Croux and Haesbroeck, (2003). For that purpose, given m>0m>0, we generated X~𝒢(mβ0,0.01Γ)\widetilde{X}\sim{\mathcal{G}}(m\;\beta_{0},0.01\Gamma). The response y~\widetilde{y}, related to X~\widetilde{X}, was always taken equal to 0. It is worth noticing that X~,β0\langle\widetilde{X},\beta_{0}\rangle is very close to mβ02m\|\beta_{0}\|^{2}, thus the leverage of the added points increases with mm. We chose m=4m=4.

  • Contamination C3,ϵC_{3,\epsilon} generates extreme outliers as X~𝒢(μ,Γ)\widetilde{X}\sim{\mathcal{G}}(\mu,\Gamma) with μ(t)=25\mu(t)=25, for all tt. As in C1,ϵC_{1,\epsilon}, we chose y~=1\widetilde{y}=1 when α0+X~,β0<0\alpha_{0}+\langle\widetilde{X},\beta_{0}\rangle<0 and y~=0\widetilde{y}=0, otherwise.

  • Setting C4,ϵC_{4,\epsilon} aims to construct trajectories with extreme symmetric outliers. For that purpose, we define X~0.5𝒢(μ,Γ)+0.5𝒢(μ,Γ)\widetilde{X}\sim 0.5{\mathcal{G}}(\mu,\Gamma)+0.5{\mathcal{G}}(-\mu,\Gamma) with μ(t)=25\mu(t)=25, for all tt. As in C1,ϵC_{1,\epsilon}, we chose y~=1\widetilde{y}=1 when α0+X~,β0<0\alpha_{0}+\langle\widetilde{X},\beta_{0}\rangle<0 and y~=0\widetilde{y}=0, otherwise.

  • The purpose of C5,ϵC_{5,\epsilon} is to add trajectories with a partial contamination. We generated X~(t)=Z(t)+25B𝕀T<t\widetilde{X}(t)=Z(t)+25\,B\mathbb{I}_{T<t} where Z𝒢(0,Γ)Z\sim{\mathcal{G}}(0,\Gamma), (B=1)=(B=1)=0.5\mathbb{P}(B=1)=\mathbb{P}(B=-1)=0.5 and T𝒰(0,1)T\sim{\mathcal{U}}(0,1) and Z,BZ,B and TT are independent. As in C1,ϵC_{1,\epsilon}, we chose y~=1\widetilde{y}=1 when α0+X~,β0<0\alpha_{0}+\langle\widetilde{X},\beta_{0}\rangle<0 and y~=0\widetilde{y}=0, otherwise.

The way contaminated trajectories are constructed under settings C3,ϵC_{3,\epsilon} to C5,ϵC_{5,\epsilon} corresponds to the contaminations considered in Denhere and Billor, (2016). However, we force the atypical trajectories to correspond to bad leverage points.

To illustrate the type of outliers generated, Figure 1 shows the obtained functional covariates Xi(t)X_{i}(t), for one sample generated under each scheme.

To compare the estimators of α0\alpha_{0}, we computed their biases and standard deviations, sα^s_{\widehat{\alpha}} which are reported in Table 1. We also present in Figures 2 and 3 their boxplots for the considered contaminations.

α^α0\widehat{\alpha}-\alpha_{0} sα^s_{\widehat{\alpha}} α^α0\widehat{\alpha}-\alpha_{0} sα^s_{\widehat{\alpha}} α^α0\widehat{\alpha}-\alpha_{0} sα^s_{\widehat{\alpha}} α^α0\widehat{\alpha}-\alpha_{0} sα^s_{\widehat{\alpha}} α^α0\widehat{\alpha}-\alpha_{0} sα^s_{\widehat{\alpha}}
C0C_{0}
cl -0.0040 0.1250
m -0.0072 0.1528
wcl-hr -0.0046 0.1250
wm-hr -0.0045 0.1278
wcl-fbb -0.0040 0.1250
wm-fbb -0.0083 0.1259
C1,0.05C_{1,0.05} C2,0.05C_{2,0.05} C3,0.05C_{3,0.05} C4,0.05C_{4,0.05} C5,0.05C_{5,0.05}
cl -0.0024 0.1160 -0.0540 0.1200 -0.0200 0.1230 -0.0064 0.1216 -0.0034 0.1166
m -0.0038 0.1589 -0.1613 0.1139 -0.0580 0.1086 -0.0041 0.1170 -0.0050 0.1138
wcl-hr -0.0029 0.1265 -0.0028 0.1263 -0.0029 0.1263 -0.0065 0.1247 -0.0029 0.1254
wm-hr 0.0007 0.1331 0.0008 0.1331 0.0009 0.1333 -0.0053 0.1304 0.0042 0.1282
wcl-fbb -0.0030 0.1208 -0.0080 0.1247 -0.0177 0.1228 -0.0061 0.1234 -0.0028 0.1234
wm-fbb -0.0002 0.1352 -0.0068 0.1326 -0.0168 0.1313 -0.0084 0.1283 0.0034 0.1259
C1,0.10C_{1,0.10} C2,0.10C_{2,0.10} C3,0.10C_{3,0.10} C4,0.10C_{4,0.10} C5,0.10C_{5,0.10}
cl 0.0011 0.1117 -0.0550 0.1220 -0.0154 0.1219 -0.0017 0.1219 0.0035 0.1174
m -0.0092 0.1564 -0.1680 0.0952 -0.0397 0.1110 -0.0024 0.1128 -0.0006 0.1064
wcl-hr 0.0022 0.1234 0.0018 0.1240 0.0021 0.1241 -0.0006 0.1250 0.0053 0.1247
wm-hr -0.0067 0.1292 -0.0069 0.1302 -0.0065 0.1299 -0.0032 0.1292 0.0078 0.1279
wcl-fbb 0.0030 0.1156 -0.0057 0.1235 -0.0154 0.1219 -0.0014 0.1220 0.0051 0.1229
wm-fbb -0.0037 0.1262 -0.0145 0.1295 -0.0240 0.1297 -0.0017 0.1245 0.0031 0.1296
Table 1: Bias and standard deviations for the estimators of α0\alpha_{0}, over nR=1000n_{R}=1000 for clean and contaminated samples of size n=300n=300.
C0C_{0}
Refer to caption
C1,0.05C_{1,0.05} C1,0.10C_{1,0.10}
Refer to caption Refer to caption
C2,0.05C_{2,0.05} C2,0.10C_{2,0.10}
Refer to caption Refer to caption
Figure 2: Boxplot of the estimators for α0\alpha_{0} under C0C_{0} to C2,0.10C_{2,0.10}.
C3,0.05C_{3,0.05} C3,0.10C_{3,0.10}
Refer to caption Refer to caption
C4,0.05C_{4,0.05} C4,0.10C_{4,0.10}
Refer to caption Refer to caption
C5,0.05C_{5,0.05} C5,0.10C_{5,0.10}
Refer to caption Refer to caption
Figure 3: Boxplot of the estimators for α0\alpha_{0} under C3,0.05C_{3,0.05} to C5,0.10C_{5,0.10}.

As expected, for clean samples, all the methods perform similarly. It should be noted that the MM and weighted MM-estimators show slightly lower biases than the classical procedure based on the deviance, but their standard deviations are larger due to the loss of efficiency. For all the considered contamination schemes, the obtained results reflect the stability of the weighted estimators. In contrast, the classical procedure and also the MM-estimator are affected by some of the these schemes. When considering the performance across contaminations, we observe that C2,ϵC_{2,\epsilon} is the one with the larger effect on the bias of the classical estimator. This contamination, as well as C3,ϵC_{3,\epsilon}, is also damaging for the MM-estimator due to the presence of extreme high leverage outliers.

Regarding the performance of the weighted estimators under C1,ϵC_{1,\epsilon} to C5,ϵC_{5,\epsilon} the weighted proposal which downweights observations with large robust Mahalanobis distance provide the best results especially when looking at the bias. Note that under some contaminating schemes, the obtained biases for wm-fbb are considerably larger than those of wm-hr, for instance under C3,0.10C_{3,0.10} it is more than 10 times larger, while their standard deviations are comparable. The same behaviour arises when comparing the biases of wcl-fbb and wcl-hr.

To evaluate the performance of the different estimators of β0\beta_{0}, as in Qingguo, (2015) and Boente et al., (2020), one possibility is to consider numerical approximations of their integrated squared bias and mean integrated squared error, computed on a grid of equally spaced points on [0,1][0,1]. However, as mentioned in He and Shi, (1998), these measures may be influenced by numerical errors at or near the boundaries of the grid. For that reason, we only report here trimmed versions of the above summaries computed without the qq first and last points of the grid. More specifically, if β^j\widehat{\beta}_{j} is the estimate of the function β0\beta_{0} obtained with the jj-th sample (1jnR1\leq j\leq n_{R}) and t1tMt_{1}\leq\dots\leq t_{M} are equispaced points on [0,1][0,1], we evaluated

Biastr2(β^)\displaystyle\mbox{Bias}_{\mbox{\footnotesize\sc tr}}^{2}(\widehat{\beta}) =\displaystyle= 1M2qs=q+1Mq(1nRj=1nRβ^j(ts)β0(ts))2,\displaystyle\frac{1}{M-2q}\sum_{s=q+1}^{M-q}\left(\frac{1}{n_{R}}\sum_{j=1}^{n_{R}}\widehat{\beta}_{j}(t_{s})-\beta_{0}(t_{s})\right)^{2}\,,
MISEtr(β^)\displaystyle\mbox{MISE}_{\mbox{\footnotesize\sc tr}}(\widehat{\beta}) =\displaystyle= 1M2qs=q+1Mq1nRj=1nR(β^j(ts)β0(ts))2.\displaystyle\frac{1}{M-2q}\sum_{s=q+1}^{M-q}\frac{1}{n_{R}}\sum_{j=1}^{n_{R}}\left(\widehat{\beta}_{j}(t_{s})-\beta_{0}(t_{s})\right)^{2}\,.

We chose M=100M=100 and q=[M×0.05]q=[M\times 0.05], which uses the central 90% interior points in the grid.

We also considered another summary measure which aims to evaluate the estimator predictive capability. With that purpose, denote (α^j,β^j)(\widehat{\alpha}_{j},\widehat{\beta}_{j}) the estimator obtained at replication jj, 1jnR1\leq j\leq n_{R}. At the jj-th replication, we also generated, independently from the sample used to compute the estimator, a new sample 𝒯j={(yi,𝒯j,Xi,𝒯j)}i=1n{\mathcal{T}}_{j}=\{(y_{i,{\mathcal{T}}_{j}},X_{i,{\mathcal{T}}_{j}})\}_{i=1}^{n} distributed as C0. We then computed the probability mean squared errors defined as

PMSE(α^,β^)=1nRj=1nR1n𝒯ji=1n{F(Xi,𝒯j,β0+α0)F(Xi,𝒯j,β^j+α^j)}2.\text{PMSE}(\widehat{\alpha}_{,}\widehat{\beta})=\frac{1}{n_{R}}\sum_{j=1}^{n_{R}}\frac{1}{n_{{\mathcal{T}}_{j}}}\sum_{i=1}^{n}\left\{F\left(\langle X_{i,{\mathcal{T}}_{j}},\beta_{0}\rangle+\alpha_{0}\right)-F\left(\langle X_{i,{\mathcal{T}}_{j}},\widehat{\beta}_{j}\rangle+\widehat{\alpha}_{j}\right)\right\}^{2}\,.

Table 2 presents the results obtained for the PMSE, while Table 3 reports the obtained summary measures Biastr2(β^)\mbox{Bias}^{2}_{\mbox{\footnotesize\sc tr}}(\widehat{\beta}) and MISEtr(β^)\mbox{MISE}_{\mbox{\footnotesize\sc tr}}(\widehat{\beta}).

C0C_{0} C1,0.05C_{1,0.05} C1,0.10C_{1,0.10} C2,0.05C_{2,0.05} C2,0.10C_{2,0.10} C3,0.05C_{3,0.05} C3,0.10C_{3,0.10} C4,0.05C_{4,0.05} C4,0.10C_{4,0.10} C5,0.05C_{5,0.05} C5,0.10C_{5,0.10}
cl 0.0039 0.0155 0.0251 0.0223 0.0265 0.0118 0.0130 0.0118 0.0129 0.0263 0.0296
m 0.0043 0.0151 0.0244 0.0222 0.0256 0.0105 0.0113 0.0106 0.0113 0.0242 0.0266
wcl-hr 0.0042 0.0042 0.0040 0.0042 0.0041 0.0042 0.0040 0.0042 0.0040 0.0041 0.0041
wm-hr 0.0042 0.0042 0.0041 0.0041 0.0042 0.0042 0.0041 0.0042 0.0041 0.0042 0.0040
wcl-fbb 0.0039 0.0055 0.0115 0.0051 0.0060 0.0108 0.0130 0.0056 0.0108 0.0039 0.0039
wm-fbb 0.0039 0.0053 0.0109 0.0053 0.0060 0.0104 0.0120 0.0056 0.0102 0.0038 0.0039
Table 2: Probability mean squared errors, PMSE, over nR=1000n_{R}=1000 for clean and contaminated samples of size n=300n=300.
Biastr2\mbox{Bias}_{\mbox{\footnotesize\sc tr}}^{2} MISEtr\mbox{MISE}_{\mbox{\footnotesize\sc tr}} Biastr2\mbox{Bias}_{\mbox{\footnotesize\sc tr}}^{2} MISEtr\mbox{MISE}_{\mbox{\footnotesize\sc tr}} Biastr2\mbox{Bias}_{\mbox{\footnotesize\sc tr}}^{2} MISEtr\mbox{MISE}_{\mbox{\footnotesize\sc tr}} Biastr2\mbox{Bias}_{\mbox{\footnotesize\sc tr}}^{2} MISEtr\mbox{MISE}_{\mbox{\footnotesize\sc tr}} Biastr2\mbox{Bias}_{\mbox{\footnotesize\sc tr}}^{2} MISEtr\mbox{MISE}_{\mbox{\footnotesize\sc tr}}
C0C_{0}
cl 0.0029 0.3305
m 0.0024 0.3259
wcl-hr 0.0019 0.3795
wm-hr 0.0021 0.3510
wcl-fbb 0.0029 0.3305
wm-fbb 0.0030 0.3257
C1,0.05C_{1,0.05} C2,0.05C_{2,0.05} C3,0.05C_{3,0.05} C4,0.05C_{4,0.05} C5,0.05C_{5,0.05}
cl 0.5726 0.7706 1.2394 1.5669 0.1416 0.4412 0.1435 0.4246 0.9499 1.0056
m 0.5222 0.7338 1.0635 1.3688 0.1290 0.3230 0.1300 0.3323 0.8710 0.8908
wcl-hr 0.0019 0.3566 0.0020 0.3523 0.0020 0.3535 0.0020 0.3444 0.0025 0.3534
wm-hr 0.0016 0.3350 0.0014 0.3353 0.0015 0.3379 0.0016 0.3382 0.0018 0.3605
wcl-fbb 0.0749 0.4007 0.0119 0.4016 0.1069 0.4258 0.0076 0.3386 0.0028 0.3190
wm-fbb 0.0517 0.3878 0.0105 0.4046 0.1028 0.4173 0.0086 0.3452 0.0019 0.3179
C1,0.10C_{1,0.10} C2,0.10C_{2,0.10} C3,0.10C_{3,0.10} C4,0.10C_{4,0.10} C5,0.10C_{5,0.10}
cl 1.0181 1.1032 1.5130 1.8423 0.1620 0.4570 0.1659 0.4436 1.0601 1.0953
m 0.9764 1.0436 1.2632 1.5716 0.1440 0.3264 0.1414 0.3118 0.9227 0.9385
wcl-hr 0.0018 0.3484 0.0017 0.3497 0.0018 0.3475 0.0025 0.3344 0.0016 0.3381
wm-hr 0.0033 0.3354 0.0023 0.3440 0.0023 0.3385 0.0025 0.3272 0.0017 0.3274
wcl-fbb 0.4184 0.6609 0.0270 0.4626 0.1617 0.4568 0.0997 0.4133 0.0016 0.3188
wm-fbb 0.3615 0.6406 0.0252 0.4493 0.1441 0.4293 0.0942 0.4012 0.0016 0.3099
Table 3: Trimmed version of the integrated squared bias and mean integrated squared errors for the estimators of β0\beta_{0}, over nR=1000n_{R}=1000 for clean and contaminated samples of size n=300n=300.

In order to visually explore the performance of these estimators, Figures 4 to 14 contain functional boxplots, see Sun and Genton, (2011), for the nR=1000n_{R}=1000 realizations of the different estimators for β0\beta_{0} under the contamination settings. As in standard boxplots, the magenta central box of these functional boxplots represents the 50% inner band of curves, the solid black line indicates the central (deepest) function and the dotted red lines indicate outlying curves, that is, outlying estimates β^j\widehat{\beta}_{j} for some 1jnR1\leq j\leq n_{R}. We also indicate in blue lines the whiskers delimiting the non–outlying curves and the true function β0\beta_{0} with a dark green line. To avoid boundary effects, we show in Figures 15 to 25 the different estimates evaluated on the interior points of a grid of 100 equispaced points. In addition, to facilitate comparisons between contamination cases and estimation methods, the scales of the vertical axes are the same for all Figures.

In clean samples, all the estimators give similar results with slightly smaller values of the Biastr2\mbox{Bias}_{\mbox{\footnotesize\sc tr}}^{2} when considering the weighted estimators. It is worth mentioning that the weighted MM-estimators are remarkably efficient. Furthermore, when considering the PMSE, the weighted estimators giving weight 0 to the observations with covariates detected as atypical by the functional boxplot, give results similar to those of the classical estimator. This behaviour may be explained by the fact than in most generated samples, under C0C_{0}, no atypical curves are detected among the covariates.

As expected, when misclassified observations are present, the procedure based on the deviance breaks–down, in particular when these responses are combined with extreme high leverage covariates as is the case for C2,ϵC_{2,\epsilon}. Contamination schemes C3,ϵC_{3,\epsilon} and C4,ϵC_{4,\epsilon} have less effect than C1,ϵC_{1,\epsilon} on the integrated squared bias and mean integrated squared error of the classical estimator of the slope. This fact is also illustrated in Figures 9 to 12 and 20 to 23, where the true curve is still in the band containing the 50% deepest estimates. In contrast, under C1,ϵC_{1,\epsilon}, C2,ϵC_{2,\epsilon} and particularly under C5,ϵC_{5,\epsilon}, the plot of the true function is beyond the limits of the functional boxplot, meaning that the obtained estimates become completely uninformative and do not reflect the shape of β0\beta_{0}, see Figures 5 to 8, 13 and 14 for instance.

It is interesting to note that the the unweighted MM-estimator shows a performance similar to that of the classical one, for the considered contamination schemes. In contrast, weighted estimators give very good results in all the studied contamination settings, especially when considering wcl-hr and wm-hr, which clearly outperform wcl-fbb and wm-fbb in most cases. It is worth mentioning that the weighted estimators based on the deviance are quite stable across contaminations, the only exception being C1,0.10C_{1,0.10} where wcl-fbb is more sensitive. In this last case, as when considering the wm-fbb estimates, the true curve lies above the magenta central region for values of tt larger than 0.8 (see Figures 6 and 17), but is still included in the band limited by the blue whiskers.

In most cases, the weighted MM-estimators improve on the weighted estimators obtained when ρ(t)=t\rho(t)=t and the procedures with weights based on the Mahalanobis distance of the projected covariates outperform those whose weights are based on the functional boxplot. In particular, contamination schemes C1,ϵC_{1,\epsilon} and C3,ϵC_{3,\epsilon}, affect more the PMSE of the the latter procedure than that of the former one. Note that, under C1,0.10C_{1,0.10}, C3,0.10C_{3,0.10} and also under C4,0.10C_{4,0.10} the PMSE of wcl-fbb are twice those obtained with wcl-hr and similarly when comparing wm-fbb with wm-hr (see Table 2).

In summary, the wcl-hr and specially the wm-hr estimators display a remarkably stable behaviour across the selected contaminations. Both estimators are comparable, but wm-hr attains in general lower values of the MISEtr\mbox{MISE}_{\mbox{\footnotesize\sc tr}}. The good performance of wcl-hr may be explained by the fact that the hard–rejection weights mainly discard from the sample the observations with high leverage covariates, which in most cases correspond to missclassified ones. This behaviour was already noticed by Croux and Haesbroeck, (2003) in the finite–dimensional setting.

β^cl\widehat{\beta}_{\mbox{\scriptsize\sc cl}} β^m\widehat{\beta}_{\mbox{\footnotesize\sc m}}
Refer to caption Refer to caption
β^wcl-hr\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}} β^wm-hr\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}
Refer to caption Refer to caption
β^wcl-fbb\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}} β^wm-fbb\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}
Refer to caption Refer to caption
Figure 4: Functional boxplot of the estimators for β0\beta_{0} under C0C_{0} within the interval [0,1][0,1]. The true function is shown with a green dashed line, while the black solid one is the central curve of the nR=1000n_{R}=1000 estimates β^\widehat{\beta}.
β^cl\widehat{\beta}_{\mbox{\scriptsize\sc cl}} β^m\widehat{\beta}_{\mbox{\footnotesize\sc m}}
Refer to caption Refer to caption
β^wcl-hr\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}} β^wm-hr\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}
Refer to caption Refer to caption
β^wcl-fbb\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}} β^wm-fbb\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}
Refer to caption Refer to caption
Figure 5: Functional boxplot of the estimators for β0\beta_{0} under C1,0.05C_{1,0.05} within the interval [0,1][0,1]. The true function is shown with a green dashed line, while the black solid one is the central curve of the nR=1000n_{R}=1000 estimates β^\widehat{\beta}.
β^cl\widehat{\beta}_{\mbox{\scriptsize\sc cl}} β^m\widehat{\beta}_{\mbox{\footnotesize\sc m}}
Refer to caption Refer to caption
β^wcl-hr\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}} β^wm-hr\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}
Refer to caption Refer to caption
β^wcl-fbb\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}} β^wm-fbb\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}
Refer to caption Refer to caption
Figure 6: Functional boxplot of the estimators for β0\beta_{0} under C1,0.10C_{1,0.10} within the interval [0,1][0,1]. The true function is shown with a green dashed line, while the black solid one is the central curve of the nR=1000n_{R}=1000 estimates β^\widehat{\beta}.
β^cl\widehat{\beta}_{\mbox{\scriptsize\sc cl}} β^m\widehat{\beta}_{\mbox{\footnotesize\sc m}}
Refer to caption Refer to caption
β^wcl-hr\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}} β^wm-hr\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}
Refer to caption Refer to caption
β^wcl-fbb\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}} β^wm-fbb\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}
Refer to caption Refer to caption
Figure 7: Functional boxplot of the estimators for β0\beta_{0} under C2,0.05C_{2,0.05} within the interval [0,1][0,1]. The true function is shown with a green dashed line, while the black solid one is the central curve of the nR=1000n_{R}=1000 estimates β^\widehat{\beta}.
β^cl\widehat{\beta}_{\mbox{\scriptsize\sc cl}} β^m\widehat{\beta}_{\mbox{\footnotesize\sc m}}
Refer to caption Refer to caption
β^wcl-hr\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}} β^wm-hr\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}
Refer to caption Refer to caption
β^wcl-fbb\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}} β^wm-fbb\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}
Refer to caption Refer to caption
Figure 8: Functional boxplot of the estimators for β0\beta_{0} under C2,0.10C_{2,0.10} within the interval [0,1][0,1]. The true function is shown with a green dashed line, while the black solid one is the central curve of the nR=1000n_{R}=1000 estimates β^\widehat{\beta}.
β^cl\widehat{\beta}_{\mbox{\scriptsize\sc cl}} β^m\widehat{\beta}_{\mbox{\footnotesize\sc m}}
Refer to caption Refer to caption
β^wcl-hr\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}} β^wm-hr\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}
Refer to caption Refer to caption
β^wcl-fbb\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}} β^wm-fbb\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}
Refer to caption Refer to caption
Figure 9: Functional boxplot of the estimators for β0\beta_{0} under C3,0.05C_{3,0.05} within the interval [0,1][0,1]. The true function is shown with a green dashed line, while the black solid one is the central curve of the nR=1000n_{R}=1000 estimates β^\widehat{\beta}.
β^cl\widehat{\beta}_{\mbox{\scriptsize\sc cl}} β^m\widehat{\beta}_{\mbox{\footnotesize\sc m}}
Refer to caption Refer to caption
β^wcl-hr\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}} β^wm-hr\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}
Refer to caption Refer to caption
β^wcl-fbb\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}} β^wm-fbb\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}
Refer to caption Refer to caption
Figure 10: Functional boxplot of the estimators for β0\beta_{0} under C3,0.10C_{3,0.10} within the interval [0,1][0,1]. The true function is shown with a green dashed line, while the black solid one is the central curve of the nR=1000n_{R}=1000 estimates β^\widehat{\beta}.
β^cl\widehat{\beta}_{\mbox{\scriptsize\sc cl}} β^m\widehat{\beta}_{\mbox{\footnotesize\sc m}}
Refer to caption Refer to caption
β^wcl-hr\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}} β^wm-hr\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}
Refer to caption Refer to caption
β^wcl-fbb\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}} β^wm-fbb\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}
Refer to caption Refer to caption
Figure 11: Functional boxplot of the estimators for β0\beta_{0} under C4,0.05C_{4,0.05} within the interval [0,1][0,1]. The true function is shown with a green dashed line, while the black solid one is the central curve of the nR=1000n_{R}=1000 estimates β^\widehat{\beta}.
β^cl\widehat{\beta}_{\mbox{\scriptsize\sc cl}} β^m\widehat{\beta}_{\mbox{\footnotesize\sc m}}
Refer to caption Refer to caption
β^wcl-hr\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}} β^wm-hr\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}
Refer to caption Refer to caption
β^wcl-fbb\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}} β^wm-fbb\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}
Refer to caption Refer to caption
Figure 12: Functional boxplot of the estimators for β0\beta_{0} under C4,0.10C_{4,0.10} within the interval [0,1][0,1]. The true function is shown with a green dashed line, while the black solid one is the central curve of the nR=1000n_{R}=1000 estimates β^\widehat{\beta}.
β^cl\widehat{\beta}_{\mbox{\scriptsize\sc cl}} β^m\widehat{\beta}_{\mbox{\footnotesize\sc m}}
Refer to caption Refer to caption
β^wcl-hr\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}} β^wm-hr\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}
Refer to caption Refer to caption
β^wcl-fbb\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}} β^wm-fbb\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}
Refer to caption Refer to caption
Figure 13: Functional boxplot of the estimators for β0\beta_{0} under C5,0.05C_{5,0.05} within the interval [0,1][0,1]. The true function is shown with a green dashed line, while the black solid one is the central curve of the nR=1000n_{R}=1000 estimates β^\widehat{\beta}.
β^cl\widehat{\beta}_{\mbox{\scriptsize\sc cl}} β^m\widehat{\beta}_{\mbox{\footnotesize\sc m}}
Refer to caption Refer to caption
β^wcl-hr\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}} β^wm-hr\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}
Refer to caption Refer to caption
β^wcl-fbb\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}} β^wm-fbb\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}
Refer to caption Refer to caption
Figure 14: Functional boxplot of the estimators for β0\beta_{0} under C5,0.10C_{5,0.10} within the interval [0,1][0,1]. The true function is shown with a green dashed line, while the black solid one is the central curve of the nR=1000n_{R}=1000 estimates β^\widehat{\beta}.
β^cl\widehat{\beta}_{\mbox{\scriptsize\sc cl}} β^m\widehat{\beta}_{\mbox{\footnotesize\sc m}}
Refer to caption Refer to caption
β^wcl-hr\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}} β^wm-hr\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}
Refer to caption Refer to caption
β^wcl-fbb\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}} β^wm-fbb\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}
Refer to caption Refer to caption
Figure 15: Functional boxplot of the estimators for β0\beta_{0} under C0C_{0} within the interval [0.05,0.95][0.05,0.95]. The true function is shown with a green dashed line, while the black solid one is the central curve of the nR=1000n_{R}=1000 estimates β^\widehat{\beta}.
β^cl\widehat{\beta}_{\mbox{\scriptsize\sc cl}} β^m\widehat{\beta}_{\mbox{\footnotesize\sc m}}
Refer to caption Refer to caption
β^wcl-hr\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}} β^wm-hr\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}
Refer to caption Refer to caption
β^wcl-fbb\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}} β^wm-fbb\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}
Refer to caption Refer to caption
Figure 16: Functional boxplot of the estimators for β0\beta_{0} under C1,0.05C_{1,0.05} within the interval [0.05,0.95][0.05,0.95]. The true function is shown with a green dashed line, while the black solid one is the central curve of the nR=1000n_{R}=1000 estimates β^\widehat{\beta}.
β^cl\widehat{\beta}_{\mbox{\scriptsize\sc cl}} β^m\widehat{\beta}_{\mbox{\footnotesize\sc m}}
Refer to caption Refer to caption
β^wcl-hr\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}} β^wm-hr\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}
Refer to caption Refer to caption
β^wcl-fbb\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}} β^wm-fbb\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}
Refer to caption Refer to caption
Figure 17: Functional boxplot of the estimators for β0\beta_{0} under C1,0.10C_{1,0.10} within the interval [0.05,0.95][0.05,0.95]. The true function is shown with a green dashed line, while the black solid one is the central curve of the nR=1000n_{R}=1000 estimates β^\widehat{\beta}.
β^cl\widehat{\beta}_{\mbox{\scriptsize\sc cl}} β^m\widehat{\beta}_{\mbox{\footnotesize\sc m}}
Refer to caption Refer to caption
β^wcl-hr\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}} β^wm-hr\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}
Refer to caption Refer to caption
β^wcl-fbb\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}} β^wm-fbb\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}
Refer to caption Refer to caption
Figure 18: Functional boxplot of the estimators for β0\beta_{0} under C2,0.05C_{2,0.05} within the interval [0.05,0.95][0.05,0.95]. The true function is shown with a green dashed line, while the black solid one is the central curve of the nR=1000n_{R}=1000 estimates β^\widehat{\beta}.
β^cl\widehat{\beta}_{\mbox{\scriptsize\sc cl}} β^m\widehat{\beta}_{\mbox{\footnotesize\sc m}}
Refer to caption Refer to caption
β^wcl-hr\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}} β^wm-hr\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}
Refer to caption Refer to caption
β^wcl-fbb\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}} β^wm-fbb\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}
Refer to caption Refer to caption
Figure 19: Functional boxplot of the estimators for β0\beta_{0} under C2,0.10C_{2,0.10} within the interval [0.05,0.95][0.05,0.95]. The true function is shown with a green dashed line, while the black solid one is the central curve of the nR=1000n_{R}=1000 estimates β^\widehat{\beta}.
β^cl\widehat{\beta}_{\mbox{\scriptsize\sc cl}} β^m\widehat{\beta}_{\mbox{\footnotesize\sc m}}
Refer to caption Refer to caption
β^wcl-hr\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}} β^wm-hr\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}
Refer to caption Refer to caption
β^wcl-fbb\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}} β^wm-fbb\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}
Refer to caption Refer to caption
Figure 20: Functional boxplot of the estimators for β0\beta_{0} under C3,0.05C_{3,0.05} within the interval [0.05,0.95][0.05,0.95]. The true function is shown with a green dashed line, while the black solid one is the central curve of the nR=1000n_{R}=1000 estimates β^\widehat{\beta}.
β^cl\widehat{\beta}_{\mbox{\scriptsize\sc cl}} β^m\widehat{\beta}_{\mbox{\footnotesize\sc m}}
Refer to caption Refer to caption
β^wcl-hr\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}} β^wm-hr\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}
Refer to caption Refer to caption
β^wcl-fbb\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}} β^wm-fbb\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}
Refer to caption Refer to caption
Figure 21: Functional boxplot of the estimators for β0\beta_{0} under C3,0.10C_{3,0.10} within the interval [0.05,0.95][0.05,0.95]. The true function is shown with a green dashed line, while the black solid one is the central curve of the nR=1000n_{R}=1000 estimates β^\widehat{\beta}.
β^cl\widehat{\beta}_{\mbox{\scriptsize\sc cl}} β^m\widehat{\beta}_{\mbox{\footnotesize\sc m}}
Refer to caption Refer to caption
β^wcl-hr\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}} β^wm-hr\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}
Refer to caption Refer to caption
β^wcl-fbb\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}} β^wm-fbb\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}
Refer to caption Refer to caption
Figure 22: Functional boxplot of the estimators for β0\beta_{0} under C4,0.05C_{4,0.05} within the interval [0.05,0.95][0.05,0.95]. The true function is shown with a green dashed line, while the black solid one is the central curve of the nR=1000n_{R}=1000 estimates β^\widehat{\beta}.
β^cl\widehat{\beta}_{\mbox{\scriptsize\sc cl}} β^m\widehat{\beta}_{\mbox{\footnotesize\sc m}}
Refer to caption Refer to caption
β^wcl-hr\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}} β^wm-hr\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}
Refer to caption Refer to caption
β^wcl-fbb\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}} β^wm-fbb\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}
Refer to caption Refer to caption
Figure 23: Functional boxplot of the estimators for β0\beta_{0} under C4,0.10C_{4,0.10} within the interval [0.05,0.95][0.05,0.95]. The true function is shown with a green dashed line, while the black solid one is the central curve of the nR=1000n_{R}=1000 estimates β^\widehat{\beta}.
β^cl\widehat{\beta}_{\mbox{\scriptsize\sc cl}} β^m\widehat{\beta}_{\mbox{\footnotesize\sc m}}
Refer to caption Refer to caption
β^wcl-hr\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}} β^wm-hr\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}
Refer to caption Refer to caption
β^wcl-fbb\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}} β^wm-fbb\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}
Refer to caption Refer to caption
Figure 24: Functional boxplot of the estimators for β0\beta_{0} under C5,0.05C_{5,0.05} within the interval [0.05,0.95][0.05,0.95]. The true function is shown with a green dashed line, while the black solid one is the central curve of the nR=1000n_{R}=1000 estimates β^\widehat{\beta}.
β^cl\widehat{\beta}_{\mbox{\scriptsize\sc cl}} β^m\widehat{\beta}_{\mbox{\footnotesize\sc m}}
Refer to caption Refer to caption
β^wcl-hr\widehat{\beta}_{\mbox{\scriptsize{wcl-hr}}} β^wm-hr\widehat{\beta}_{\mbox{\scriptsize{wm-hr}}}
Refer to caption Refer to caption
β^wcl-fbb\widehat{\beta}_{\mbox{\scriptsize{wcl-fbb}}} β^wm-fbb\widehat{\beta}_{\mbox{\scriptsize{wm-fbb}}}
Refer to caption Refer to caption
Figure 25: Functional boxplot of the estimators for β0\beta_{0} under C5,0.10C_{5,0.10} within the interval [0.05,0.95][0.05,0.95]. The true function is shown with a green dashed line, while the black solid one is the central curve of the nR=1000n_{R}=1000 estimates β^\widehat{\beta}.

5 Real data example

In this section, we consider the electricity prices data set analysed in Liebl, (2013) in the context of electricity price forecasting and we investigate the performance of the proposed estimators.

The data consist of hourly electricity prices in Germany between 1 January 2006 and 30 September 2008, as traded at the Leipzig European Energy Exchange, German electricity demand (as reported by the European Network of Transmission System Operators for Electricity). These data were also used in Boente et al., (2020) who modelled the daily average hourly energy demand through a semi-functional linear regression model using as euclidean covariate the mean hourly amount of wind-generated electricity in the system for that day and as functional covariate the curve of energy prices observed hourly.

In our analysis, the response measures high or low demand of electricity, that is, we define the binary variable y=1y=1 if the average hourly demand exceeds 55000 (“High Demand”) and 0 (“Low Demand”) otherwise. Furthermore, the functional covariates used to predict the conditional probability that a day has high demand, correspond to the curves of energy prices. These curves are observed hourly originating a matrix of dimension 638×24638\times 24, after removing weekends, holidays and other non-working days. Hence, with these data we fit a functional logistic regression model,

(y=1|X)=F(α0+X,β0),\mathbb{P}(y=1|X)=F\left(\alpha_{0}+\left\langle X,\beta_{0}\right\rangle\right)\,,

using the weighted robust estimators defined in this paper and their classical alternatives. The trajectories corresponding to y=1y=1 and y=0y=0 are given in Figure 26.

Low demand (y=0y=0) High demand (y=1y=1)
Refer to caption Refer to caption
Figure 26: Energy prices for the German electricity data.

To compute the estimators, we used BB-splines to generate the space of possible candidates and as above we denote {Bj}j=1kn\{B_{j}\}_{j=1}^{k_{n}} the knk_{n}-dimensional basis. As in the simulation study, after dimension reduction, we computed the finite–dimensional coefficients through (5) using as loss ρ(t)=t\rho(t)=t that leads to the deviance based estimators denoted cl and the ρ\rho function introduced in Croux and Haesbroeck, (2003) with tuning constant 0.50.5 denoted m. We also computed the weighted versions of the previous estimators choosing as weights the weights based on the robust Mahalanobis distance of the projected data, that is, we evaluated the Donoho–Stahel location and scatter estimators, denoted 𝝁^\widehat{\mbox{\boldmath$\mu$}} and 𝚺^\widehat{\mbox{\boldmath$\Sigma$}}, respectively, of the sample 𝐱1,,𝐱n\mathbf{x}_{1},\dots,\mathbf{x}_{n} with 𝐱i=(xi1,,xikn)t\mathbf{x}_{i}=(x_{i1},\dots,x_{ik_{n}})^{\mbox{\footnotesize\sc t}}, xij=Xi,Bjx_{ij}=\langle X_{i},B_{j}\rangle and we defined w(Xi)=1w(X_{i})=1 when di2=(𝐱i𝝁^)t𝚺^1(𝐱i𝝁^)d_{i}^{2}=(\mathbf{x}_{i}-\widehat{\mbox{\boldmath$\mu$}})^{\mbox{\footnotesize\sc t}}\widehat{\mbox{\boldmath$\Sigma$}}^{-1}(\mathbf{x}_{i}-\widehat{\mbox{\boldmath$\mu$}}) is less than or equal to χ0.975,kn\chi_{0.975,k_{n}} and 0 otherwise. For simplicity, in Table 4 below, these procedures are labelled wcl and wm, respectively. As in the simulation, to select the basis dimension we use the criterion defined in Section 2.4, that is, we minimize the quantity RBIC(k)RBIC(k) defined in (6). All procedures choose kn=7k_{n}=7, except the MM-estimator that selects kn=5k_{n}=5.

cl m wcl wm
-1.564 -1.588 -1.693 -1.811
Table 4: Estimates of α0\alpha_{0}.

The estimates of α0\alpha_{0} are reported in Table 4, while those for β0\beta_{0} are shown in Figure 27. In solid black and gray lines we represent the wm and MM-estimators, respectively, while red solid and dashed lines correspond to the cl and wcl procedures. Comparing the obtained results, we note that the weighted estimators lead to slightly larger absolute values of the intercept. Regarding the estimation of β0\beta_{0}, it is clear that the MM-estimator produces a smoother curve since the basis dimension is smaller. The classical procedure is almost equal to 0 between 9am and 4pm, while all the weighted estimators detect a positive “peak”around 1pm and two slumps near 8am and 17pm. All procedures detect the large peaks close 4am and 9pm, where prices seem to have a larger (in magnitude) association when predicting the demand, but for the classical procedure the magnitude of the function β^\widehat{\beta} is smaller than that of the weighted ones in that range.

Refer to caption
Figure 27: German Electricity: Estimates of β0\beta_{0}. Solid black and gray lines are used for the weighted-MM and MM-estimators, respectively, while red solid and dashed lines correspond to the classical procedure and its weighted counterpart.

To identify potential atypical observations, we used the deviance QQ-plot defined in García Ben and Yohai, (2004). For that purpose, we considered the deviance residuals based on the wm{\mbox{{wm}}}-estimator. More precisely, if (α^,β^)(\widehat{\alpha},\widehat{\beta}) stands for the weighted MM-estimator, we compute the predicted probabilities and deviances p^i=F(α^+Xi,β^)\widehat{p}_{i}=F\left(\widehat{\alpha}+\left\langle X_{i},\widehat{\beta}\right\rangle\right) and di=sign(yip^i)2{(1yi)log(1p^i)+ylog(p^i))}.d_{i}=\text{sign}(y_{i}-\widehat{p}_{i})\sqrt{-2\left\{(1-y_{i})\log(1-\widehat{p}_{i})+y\log(\widehat{p}_{i}))\right\}}\,. Let F^D\widehat{F}_{D} be the estimator of their distribution function, given by

F^D(d)=1n{p^i𝒜p^i+1p^i𝒜(1p^i)},\widehat{F}_{D}(d)=\frac{1}{n}\left\{\sum_{\widehat{p}_{i}\in{\mathcal{A}}}\widehat{p}_{i}+\sum_{1-\widehat{p}_{i}\in{\mathcal{A}}}\left(1-\widehat{p}_{i}\right)\right\}\,,

where 𝒜={t:(2logt)1/2d}{\mathcal{A}}=\left\{t:(-2\log t)^{1/2}\leq d\right\}. We consider outliers those observations with deviance residual did_{i} smaller than F^D1(0.005)\widehat{F}_{D}^{-1}(0.005) or larger than F^D1(0.995)\widehat{F}_{D}^{-1}(0.995), leading to 1919 observations detected as possible atypical observations. These observations, together with the value of their residual deviance and predicted probability are reported in Table 5.

Date did_{i} p^i\widehat{p}_{i} Date did_{i} p^i\widehat{p}_{i}
ii yi=0y_{i}=0 ii yi=1y_{i}=1
134 2006-07-25 -5.686 1.000 66 2006-04-10 2.344 0.064
136 2006-07-27 -4.972 1.000 75 2006-04-25 2.905 0.015
137 2006-07-28 -2.667 0.971 76 2006-04-26 2.511 0.043
469 2008-01-10 -2.493 0.955 77 2006-04-27 2.313 0.069
502 2008-03-03 -2.412 0.945 445 2007-11-20 3.208 0.006
508 2008-03-11 -2.282 0.926 543 2008-05-06 2.393 0.057
519 2008-03-28 -3.052 0.990 550 2008-05-16 2.435 0.052
621 2008-09-05 -2.512 0.957 580 2008-07-01 2.196 0.090
622 2008-09-08 -2.497 0.956
627 2008-09-15 -2.385 0.942
638 2008-09-30 -2.923 0.986
Table 5: Outliers identified by weighted MM-estimator with weights based on the Mahalanobis distance of the projected data and a hard rejection weight function.

It is worth mentioning, that the observation corresponding to November 7 of 2006, which is an outlier in the covariate space is not detected as such by the deviance QQ-plot, since it does not correspond to a bad leverage point.

Low demand High demand
Refer to caption Refer to caption
Figure 28: Energy prices for the German electricity data. The covariates related to the detected atypical data are plotted in dashed black lines.

The covariates related to the detected atypical observations are presented in Figure 28 with dashed black lines. The left panel correspond to the covariates corresponding to y=0y=0 and the right one to y=1y=1, where we also indicate the curve corresponding to the 7th of November which clearly appears as an outlier in the covariate space.

We next compute the classical estimator after removing the identified outliers. The obtained value for the estimate of α0\alpha_{0} is -1.852, a value closer to the one obtained when using the wm, as reported in Table 4. Figure 29 presents the obtained estimator in dotted magenta lines. To facilitate comparisons we present in the left panel all the estimators and in the right one, only the weighted MM-estimate with solid black line and the classical one computed with the whole data set and without the outliers, with a red solid line and a dotted magenta one. Note the similarity of the curve obtained by the wm method and the estimate obtained when using the cl method after removing the atypical observations.

Refer to caption Refer to caption
Figure 29: German Electricity: Estimates of β0\beta_{0}. A solid black and gray lines are used for the weighted-MM and MM-estimators, respectively, while red solid and dashed lines correspond to the classical procedure and its weighted counterpart. The classical estimator computed without the detected outliers is given in magenta dotted lines.

6 Conclusion

This paper faces the problem of providing robust estimators for the functional logistic regression model. We tackle the problem by basis dimension reduction, that is, we consider a finite–dimensional space of candidates and, once the reduction is made, we use weighted MM-estimators to down–weight the effect of bad leverage covariates. In this sense, our estimators are robust against outliers corresponding to miss–classified points, and also to outliers in the functional explanatory variables that have a damaging effect on the estimation. We also propose a robust BIC-type criterion to select the optimal dimension of the splines bases that seems to work well in practice.

Regarding the asymptotic behaviour of our proposal, we prove that the estimators are strongly consistent, under a very general framework which allows the practitioner to choose different basis as BB-splines, Bernstein and Legendre polynomials, Fourier basis or Wavelet ones. The convergence rates obtained depend on the capability of the basis to approximate the true slope function. It is worth mentioning that unlike the weak consistency results obtained in Kalogridis, (2023) for penalized estimators based on divergence measures, our achievements do not require the covariates to be bounded. Moreover, strong consistency rather than convergence in probability of the sieve based weighted MM-estimators is derived.

The performance in finite samples is studied by means of a simulation study and in a real data example. The simulation study reveals the good robustness properties of our proposal and the high sensitivity of the procedure based on the deviance when atypical observations are present. In particular, weighted MM-estimators show their advantage over the unweighted ones, a fact already observed by Croux and Haesbroeck, (2003) when finite–dimensional covariates instead of functional ones are used in the model.

We also apply our method to a real data set where the weighted MM-estimator remains reliable even in presence of misclassified observations corresponding to atypical functional explanatory variables. The deviance residuals obtained from the robust fit provide a natural way to identify potential atypical observations. The classical estimators obtained minimizing the deviance after dimension reduction and computed over the “cleaned”  data set, that is, the data set without the outliers present a similar shape to those obtained by the robust procedure which automatically down–weights these data.

As it has been extensively discussed, functional data are intrinsically infinite–dimensional and even when recorded at a finite grid of points, they should be considered as random elements of some functional space rather than multivariate observations. For instance, in some applications, the functional covariates may be viewed as longitudinal data, while in other cases, the predictors are densely observed curves. Examples of these situations are the Mediterranean fruit flies data studied in Müller and Stadtmüller, (2005), for the former, and the Tecator data set studied in Ferraty and Vieu, (2006) or the Danone data set reported in Aguilera-Morillo et al., (2013), for the latter. In more challenging cases, the grid at which the predictors are recorded may be sparse or irregular and the observations may be possibly subject to measurement errors. The proposals considered in James, (2002) or Müller, (2005) for generalized functional regression models allow for sparse recording and measurement errors. Even when our proposal can be implemented if the predictors are observed on a dense grid of points, the asymptotic results derived make use of the whole process structure. The interesting situation of sparse and irregular time grids which is common, for instance, in longitudinal studies is beyond the scope of the paper and may be object of future research.

7 Acknowledgements.

This research was partially supported by Universidad de Buenos Aires [Grant 20020170100022ba] and anpcyt [Grant pict 2021-I-A-00260] at Argentina (Graciela Boente and Marina Valdora) and by the Ministry of Economy, Industry and Competitiveness, Spain (MINECO/AEI/FEDER, UE) [Grant MTM2016-76969P] (Graciela Boente).

A Appendix

From now on, we denote ν(t)\nu(t) the function

ν(t)=ψ(logF(t))[1F(t)]+ψ(log[1F(t)])F(t).\nu(t)=\psi\left(-\log F(t)\right)\left[1-F(t)\right]+\psi\left(-\log\left[1-F(t)\right]\right)F(t)\,. (8)

Under A1 and A2, the function ν()\nu(\cdot) is continuous, bounded and strictly positive.

Denote Ψ(y,t)=ϕ(y,t)/t=[yF(t)]ν(t)\Psi(y,t)={\partial}\phi(y,t)/{\partial t}=-[y-F(t)]\nu(t) where ν(t)\nu(t) is given by (8). Under A1 and A2, the function Ψ(y,)\Psi(y,\cdot) is continuous and bounded and we will denote Mϕ=supy{0,1},tϕ(y,t)M_{\phi}=\sup_{y\in\{0,1\},t\in\mathbb{R}}\phi(y,t) and MΨ=supy{0,1},t|Ψ(y,t)|M_{\Psi}=\sup_{y\in\{0,1\},t\in\mathbb{R}}|\Psi(y,t)|, where we have used that ϕ(y,t)0\phi(y,t)\geq 0 from assumptions A1 and A2.

Henceforth, for any (α,β)×L2([0,1])(\alpha,\beta)\in\mathbb{R}\times L^{2}([0,1]), we denote θ=(α,β)\theta=(\alpha,\beta). Besides, L(θ)L(\theta) stands for the population counterpart of Ln(θ)L_{n}(\theta), that is,

L(θ)=𝔼(ϕ(y,α+X,β)w(X)),L(\theta)=\mathbb{E}\left(\phi\left(y,\alpha+\langle X,\beta\rangle\right)w(X)\right)\,,

where y|XBi(1,F(α0+X,β0))y|X\sim Bi\left(1,F(\alpha_{0}+\langle X,\beta_{0}\rangle)\right). Recall that we have denoted θ^=(α^,β^)\widehat{\theta}=(\widehat{\alpha},\widehat{\beta}) and θ0=(α0,β0)\theta_{0}=(\alpha_{0},\beta_{0}).

A.1 Some preliminary results

Lemma A.1 states the Fisher–consistency of the estimators defined through (5). It follows using similar arguments as those considered in Theorem S.1.1 from Bianco et al., (2022). Note that Lemma A.1(a) only states that the true value θ0\theta_{0} is one of the minimizers of the function L(θ)L(\theta), while in part (b) to derive that it is the unique minimizer an additional requirement, (9), is needed. As mentioned in Remark 3.2, this condition is related to the fact that the slope parameter is not identifiable if the kernel of the covariance operator of XX does not reduce to {0}\{0\}. Instead of asking (9) to hold for any βL2([0,1])\beta\in L^{2}([0,1]), we require that the condition holds only for β\beta\in{\mathcal{H}}^{\star} whenever β0\beta_{0}\in{\mathcal{H}}^{\star}, where henceforth, ={\mathcal{H}}^{\star}={\mathcal{H}} when β0\beta_{0}\in{\mathcal{H}}, while =𝒲1,{\mathcal{H}}^{\star}={\mathcal{W}}^{1,{\mathcal{H}}} when β0𝒲1,\beta_{0}\in{\mathcal{W}}^{1,{\mathcal{H}}}. It is worth mentioning that under A5 and A11, condition (9) is fulfilled.

Lemma A.1.

Suppose (y,X)(y,X) follows the functional logistic regression model given in (1). Let ϕ:2\phi:\mathbb{R}^{2}\to\mathbb{R} be the function given by (2), where ρ\rho satisfies conditions A1 and A2 and let ww be a non–negative bounded function.

  • a)

    For any θ=(α,β)×L2([0,1])\theta=(\alpha,\beta)\in\mathbb{R}\times L^{2}([0,1]), we have that L(θ0)L(θ)L(\theta_{0})\leq L(\theta).

  • b)

    Furthermore, assume that β0\beta_{0}\in{\mathcal{H}}^{\star} and

    (X,β=a{w(X)=0})<1,a,β,(a,β)0,\mathbb{P}\left(\langle X,\beta\rangle=a\cup\{w(X)=0\}\right)<1,\qquad\forall a\in\mathbb{R},\beta\in{\mathcal{H}}^{\star},(a,\beta)\neq 0\,, (9)

    holds. Then, for all α\alpha\in\mathbb{R}, β\beta\in{\mathcal{H}}^{\star}, θ=(α,β)(α0,β0)\theta=(\alpha,\beta)\neq(\alpha_{0},\beta_{0}), we have L(θ0)<L(θ)L(\theta_{0})<L(\theta).

Proof.

(a) As in Theorem 2.2 in Bianco and Yohai, (1996), taking conditional expectation, we have that

L(θ)\displaystyle L(\theta) =𝔼[𝔼ϕ(y,α+X,β)w(X)|X]=𝔼{ϕ(F(α0+X,β0),α+X,β)w(X)}\displaystyle=\mathbb{E}\left[\mathbb{E}\phi(y,\alpha+\langle X,\beta\rangle)w(X)|X\right]=\mathbb{E}\left\{\phi(F(\alpha_{0}+\langle X,\beta_{0}\rangle),\alpha+\langle X,\beta\rangle)w(X)\right\}
=𝔼{ϕ(F(R0(X)),R(X))w(X)},\displaystyle=\mathbb{E}\left\{\phi(F(R_{0}(X)),R(X))w(X)\right\}\,,

where for a fixed value XX, we have denoted R(X)=α+X,βR(X)=\alpha+\langle X,\beta\rangle and R0(X)=α0+X,β0R_{0}(X)=\alpha_{0}+\langle X,\beta_{0}\rangle.

We will show that, for any fixed t0t_{0}, the function ϕ(F(t0),t)\phi(F(t_{0}),t) reaches its unique minimum when t=t0t=t_{0}. For simplicity, denote Φ(t)=ϕ(F(t0),t)\Phi(t)=\phi(F(t_{0}),t), then, Φ(t)=Ψ(F(t0),t)=(F(t0)F(t))ν(t)\Phi^{\prime}(t)=\Psi(F(t_{0}),t)=\,-\,(F(t_{0})-F(t))\nu(t), where ν(t)\nu(t) defined in (8) is positive. Hence, Φ(t0)=0\Phi^{\prime}(t_{0})=0. Furthermore, Φ(t)>0\Phi^{\prime}(t)>0, for t>t0t>t_{0}, and Φ(t)<0\Phi^{\prime}(t)<0 for t<t0t<t_{0} which entails that Φ\Phi has a unique minimum at t0t_{0}.

Hence, for any R(X)R0(X)R(X)\neq R_{0}(X), ϕ(F(R0(X)),R0(X))<ϕ(F(R0(X)),R(X))\phi(F(R_{0}(X)),R_{0}(X))<\phi(F(R_{0}(X)),R(X)), that is,

ϕ(F(α0+X,β0),α0+X,β0)<ϕ(F(α0+X,β0),α+X,β),\phi(F(\alpha_{0}+\langle X,\beta_{0}\rangle),\alpha_{0}+\langle X,\beta_{0}\rangle)<\phi(F(\alpha_{0}+\langle X,\beta_{0}\rangle),\alpha+\langle X,\beta\rangle)\,, (10)

for any X,(ββ0)+αα00\langle X,(\beta-\beta_{0})\rangle+\alpha-\alpha_{0}\neq 0 which concludes the proof of a).

The proof of b) follows immediately from (10) since (9) holds and ββ0\beta-\beta_{0}\in{\mathcal{H}}^{\star}. ∎

Lemma A.3 provides an improvement over Lemma A.1 and will be helpful to derive Theorem 3.1. For its proof, we need the following Lemma which corresponds to Lemma 3 in Bianco et al., (2023). We state it here for completeness.

Lemma A.2.

Let Λ(p,p0)\Lambda(p,p_{0}) be defined as

Λ(p,p0)=p0ρ(logp)+(1p0)ρ(log(1p))+G(p)+G(1p)\Lambda(p,p_{0})=p_{0}\rho(-\log p)+(1-p_{0})\rho(-\log(1-p))+G(p)+G(1-p)\, (11)

for (p,p0)(0,1)×[0,1](p,p_{0})\in(0,1)\times[0,1] and assume that assumption A1 holds.

  1. (a)

    If in addition A2 holds, then the function Λ(p,p0)\Lambda(p,p_{0}) has a unique minimum at p=p0p=p_{0}. Furthermore, Λ(p,p0)\Lambda(p,p_{0}) can be extended to a continuous function on [0,1]×[0,1][0,1]\times[0,1].

  2. (b)

    Assume that, in addition, A3 holds. Then there exists a positive constant C0C_{0} such that for each 0<p<10<p<1, Λ(p,p0)Λ(p0,p0)C0(pp0)2\Lambda(p,p_{0})-\Lambda(p_{0},p_{0})\geq C_{0}\,(p-p_{0})^{2}.

The proof of Lemma A.3 is now a direct consequence of the previous result.

Lemma A.3.

Assume that assumptions A1 and A3 hold. Then, there exists a constant C0>0C_{0}>0 such that L(θ)L(θ0)C0π2(θ,θ0)L(\theta)-L(\theta_{0})\geq C_{0}\,\pi_{\mathbb{P}}^{2}(\theta,\theta_{0}).

Proof.

For any θ=(α,β)\theta=(\alpha,\beta), denote p(X)=F(α+X,β)p(X)=F(\alpha+\langle X,\beta\rangle) and let p0(X)=F(α0+X,β0)p_{0}(X)=F(\alpha_{0}+\langle X,\beta_{0}\rangle).

Using that y|XBi(1,p0(X))y|X\sim Bi\left(1,p_{0}(X)\right), (3) and (11), taking conditional expectation, we immediately get that L(θ)=𝔼{w(X)Λ(p(X),p0(X))}L(\theta)=\mathbb{E}\left\{w(X)\,\Lambda\left(p(X),p_{0}(X)\right)\right\}. Lemma A.2(b) implies that

Λ(p(X),p0(X))Λ(p0(X),p0(X))C0{p(X)p0(X)}2=C0{F(α+X,β)F(α0+X,β0)}2,\Lambda\left(p(X),p_{0}(X)\right)-\Lambda\left(p_{0}(X),p_{0}(X)\right)\geq C_{0}\,\left\{p(X)-p_{0}(X)\right\}^{2}=C_{0}\left\{F(\alpha+\langle X,\beta\rangle)-F(\alpha_{0}+\langle X,\beta_{0}\rangle)\right\}^{2}\,,

which entails

L(θ)L(θ0)\displaystyle L(\theta)-L(\theta_{0}) C0𝔼[w(X){F(α+X,β)F(α0+X,β0)}2]=C0π2(θ,θ0),\displaystyle\geq C_{0}\mathbb{E}\left[w(X)\left\{F(\alpha+\langle X,\beta\rangle)-F(\alpha_{0}+\langle X,\beta_{0}\rangle)\right\}^{2}\right]=C_{0}\,\,\pi_{\mathbb{P}}^{2}(\theta,\theta_{0})\,,

and concludes the proof. ∎

From now on, for any measure \mathbb{Q} and class of functions {\mathcal{F}}, N(ϵ,,Ls())N(\epsilon,{\mathcal{F}},L_{s}(\mathbb{Q})) and N[](ϵ,,Ls())N_{[\;]}(\epsilon,{\mathcal{F}},L_{s}(\mathbb{Q})) will denote the covering and bracketing numbers of the class {\mathcal{F}} with respect to the distance in Ls()L_{s}(\mathbb{Q}), as defined, for instance, in van der Vaart and Wellner, (1996). Furthermore, we will make use of the empirical process 𝔾nf=n(PnP)f\mathbb{G}_{n}f=\sqrt{n}(P_{n}-P)f, where PnP_{n} stands for the empirical probability measure of (yi,Xi)(y_{i},X_{i}), 1in1\leq i\leq n, and PP is the probability measure corresponding of (y,X)(y,X) which follows the functional logistic regression model (1).

We will first derive a result regarding Glivenko–Cantelli results for classes depending on nn, which will be helpful to derive Lemma A.5(b) and which is a slight modification of Theorem 37 in Pollard, (1984), see also Lemma 2.3.3 in van de Geer, (1988).

Lemma A.4.

Let z1,,znz_{1},\dots,z_{n}, be i.i.d. random elements in a metric space 𝒳{\mathcal{X}} and n={f:𝒳}{\mathcal{F}}_{n}=\{f:{\mathcal{X}}\to\mathbb{R}\} be the class of bounded functions depending on nn, that is, for some positive constant MM, |f|M|f|\leq M, for any fnf\in{\mathcal{F}}_{n} and assume that for any 0<ϵ<10<\epsilon<1, there exists some constant C>1C>1 independent of nn and ϵ\epsilon, such that

logN(Mϵ,n,L1(Pn))Cqnlog(1ϵ),\log N(M\,\epsilon,{\mathcal{F}}_{n},L_{1}(P_{n}))\leq Cq_{n}\log\left(\frac{1}{\epsilon}\right)\;,

where qnq_{n} is a non–random sequence of numbers such that qn/n0q_{n}/n\to 0. Then, we have that

supfn|(PnP)(f)|a.s.0.\sup_{f\in{\mathcal{F}}_{n}}\left|(P_{n}-P)(f)\right|\buildrel a.s.\over{\longrightarrow}0\,.
Proof.

First of all, note that without loss of generality we may assume that M=1M=1, in which case

Var(Pnf)=1nVar(f(z1))1n𝔼(f2(z1))1n,\mbox{\sc Var}\left(P_{n}f\right)=\frac{1}{n}\mbox{\sc Var}\left(f(z_{1})\right)\leq\frac{1}{n}\mathbb{E}\left(f^{2}(z_{1})\right)\leq\frac{1}{n}\,,

implying that for n1/(8ϵ2)n\geq 1/(8\epsilon^{2}), we have that

Var(Pnf)(4ϵ)212.\frac{\mbox{\sc Var}\left(P_{n}f\right)}{\left(4\,\epsilon\right)^{2}}\leq\frac{1}{2}\,.

Thus, using Lemma 8 in Pollard, (1984), we obtain the inequality

(supfn|(PnP)(f)|>8ϵ)4(supfn|Pn0f|>2ϵ)\mathbb{P}\left(\sup_{f\in{\mathcal{F}}_{n}}\left|(P_{n}-P)(f)\right|>8\,\epsilon\right)\leq 4\mathbb{P}\left(\sup_{f\in{\mathcal{F}}_{n}}\left|P_{n}^{0}f\right|>2\,\epsilon\right)

where Pn0f=(1/n)i=1nξif(zi)P_{n}^{0}f=(1/n)\sum_{i=1}^{n}\xi_{i}f(z_{i}) and {ξi}i=1n\{\xi_{i}\}_{i=1}^{n} is a Rademacher sequence independent of {zi}i=1n\{z_{i}\}_{i=1}^{n}, that is, ξ1,ξn\xi_{1},\dots\xi_{n} are i.i.d. (ξi=1)=(ξi=1)=1/2\mathbb{P}(\xi_{i}=1)=\mathbb{P}(\xi_{i}=-1)=1/2.

The covering number allows to choose functions gjg_{j}, 1jN=N(ϵ,n,L1(Pn))1\leq j\leq N=N(\epsilon,{\mathcal{F}}_{n},L_{1}(P_{n})), such that for any fnf\in{\mathcal{F}}_{n}, minjPn|fgj|ϵ\min_{j}P_{n}|f-g_{j}|\leq\epsilon. Denote j(f)j(f) the index where the minimum is attained. Without loss of generality, we may assume that gjng_{j}\in{\mathcal{F}}_{n}, so that |gj|1|g_{j}|\leq 1. Thus, the approximation argument and Hoeffding’ s inequality lead to

(supfn|Pn0f|>2ϵ|{z1,,zn})\displaystyle\mathbb{P}\left(\sup_{f\in{\mathcal{F}}_{n}}\left|P_{n}^{0}f\right|>2\,\epsilon\,\Big{|}{\{z_{1},\dots,z_{n}\}}\right) (supfn|Pn0gj(f)|+Pn|fgj(f)|>2ϵ|{z1,,zn})\displaystyle\leq\mathbb{P}\left(\sup_{f\in{\mathcal{F}}_{n}}\left|P_{n}^{0}g_{j(f)}\right|+P_{n}\left|f-g_{j(f)}\right|>2\,\epsilon\,\Big{|}{\{z_{1},\dots,z_{n}\}}\right)
(max1jN|Pn0gj|>ϵ|{z1,,zn})\displaystyle\leq\mathbb{P}\left(\max_{1\leq j\leq N}\left|P_{n}^{0}g_{j}\right|>\,\epsilon\,\Big{|}{\{z_{1},\dots,z_{n}\}}\right)
Nmax1jN(|Pn0gj|>ϵ|{z1,,zn})\displaystyle\leq N\max_{1\leq j\leq N}\mathbb{P}\left(\left|P_{n}^{0}g_{j}\right|>\,\epsilon\,\Big{|}{\{z_{1},\dots,z_{n}\}}\right)
2N(ϵ,n,L1(Pn))exp{nϵ22}2exp{nϵ22+Cqnlog(1ϵ)}.\displaystyle\leq 2N(\epsilon,{\mathcal{F}}_{n},L_{1}(P_{n}))\exp\left\{-\,\frac{n\epsilon^{2}}{2}\right\}\leq 2\exp\left\{-\,\frac{n\epsilon^{2}}{2}+Cq_{n}\log\left(\frac{1}{\epsilon}\right)\right\}\,.

Hence, using that qn/n0q_{n}/n\to 0, we have that for nn large enough

qnnlog(1ϵ)<ϵ24,\frac{q_{n}}{n}\log\left(\frac{1}{\epsilon}\right)<\frac{\epsilon^{2}}{4}\,,

which implies that, for nn large enough,

(supfn|(PnP)(f)|>8ϵ)2exp{nϵ24},\mathbb{P}\left(\sup_{f\in{\mathcal{F}}_{n}}\left|(P_{n}-P)(f)\right|>8\,\epsilon\right)\leq 2\exp\left\{-\,\frac{n\epsilon^{2}}{4}\right\}\,,

so n1(supfn|(PnP)(f)|>8ϵ)<\sum_{n\geq 1}\mathbb{P}\left(\sup_{f\in{\mathcal{F}}_{n}}\left|(P_{n}-P)(f)\right|>8\,\epsilon\right)<\infty and the result follows now from the Borel–Cantelli lemma. ∎

To avoid burden notation, when there is no doubt we will denote kk instead of knk_{n}. To derive uniform results, Lemma A.5(a) below provides a bound to the covering number of the class of functions

n={f(y,X)=ϕ(y,α+X,β)w(X),βk,α}.{\mathcal{F}}_{n}=\left\{f(y,X)=\phi\left(y,\alpha+\langle X,\beta\rangle\right)\,w(X),\beta\in{\mathcal{M}}_{k},\alpha\in\mathbb{R}\right\}\,. (12)

Its proof relies on providing a bound for the Vapnik-Chervonenkis dimension for the set n{\mathcal{F}}_{n}, which follows from Lemma S.2.2 in Bianco et al., (2022). Besides, Lemma A.5(b) shows that Ln(θ)L_{n}(\theta) converges to L(θ)L(\theta) with probability one, uniformly over θ=(α,β)×k\theta=(\alpha,\beta)\in\mathbb{R}\times{\mathcal{M}}_{k}. Its proof uses standard empirical processes. This uniform law of large numbers will be crucial to obtain consistency results for our proposal as given in Theorems 3.1(a) and 3.2.

Lemma A.5.

Let ρ\rho be a function satisfying A1 and A2, and ww a weight function satisfying A5. Let n{\mathcal{F}}_{n} be the class of functions given in (12). Then,

  1. (a)

    for any measure \mathbb{Q}, 0<ϵ<10<\epsilon<1, there exists some constant C>1C>1 independent of nn and ϵ\epsilon, such that

    N(Mϕϵ,n,L1())Cqn(16e)qn(1ϵ)qn1,N(M_{\phi}\,\epsilon,{\mathcal{F}}_{n},L_{1}(\mathbb{Q}))\leq Cq_{n}(16\,e)^{q_{n}}\left(\frac{1}{\epsilon}\right)^{q_{n}-1}\;,

    where qn=2kn+6q_{n}=2k_{n}+6 and Mϕ=supy,t|ϕ(y,t)|M_{\phi}=\sup_{y,t}|\phi(y,t)|.

  2. b)

    If in addition, A8 holds, we have supθ×k|Ln(θ)L(θ)|a.s.0\sup_{\theta\in\mathbb{R}\times{\mathcal{M}}_{k}}\left|L_{n}(\theta)-L(\theta)\right|\buildrel a.s.\over{\longrightarrow}0.

  3. c)

    There exists a constant CC^{\star} that does not depend on nn nor knk_{n} such that

    𝔼{supfn|(PnP)(f)|}Cknn,\mathbb{E}\left\{\sup_{f\in{\mathcal{F}}_{n}}\left|(P_{n}-P)(f)\right|\right\}\leq C^{\star}\sqrt{\frac{k_{n}}{n}}\,, (13)

    which entails that supfn|(PnP)(f)|=O(kn/n)\sup_{f\in{\mathcal{F}}_{n}}\left|(P_{n}-P)(f)\right|=O_{\mathbb{P}}\left(\sqrt{{k_{n}}/{n}}\right).

Proof.

(a) First of all note that since A1 and A5 hold, we have that ϕ(y,t)\phi(y,t) is bounded and w(X)w(X) is a bounded function with w=1\|w\|_{\infty}=1, hence n{\mathcal{F}}_{n} has envelope MϕM_{\phi}.

Lemma S.2.2 from Bianco et al., (2022) implies that the class of functions

𝒢n={g(y,𝐱)=ϕ(y,α+𝐱t𝐛),𝐛k,α},{\mathcal{G}}_{n}=\left\{g(y,\mathbf{x})=\phi\left(y,\alpha+\mathbf{x}^{\mbox{\footnotesize\sc t}}\mathbf{b}\right),\mathbf{b}\in\mathbb{R}^{k},\alpha\in\mathbb{R}\right\}\,,

where 𝐱=(X,B1,,X,B1)t\mathbf{x}=(\langle X,B_{1}\rangle,\dots,\langle X,B_{1}\rangle)^{\mbox{\footnotesize\sc t}}, is VC-subgraph with index smaller or equal than 2(k+1)+4=2k+62(k+1)+4=2k+6, since we now include an intercept term α\alpha\in\mathbb{R}.

Note that for any fnf\in{\mathcal{F}}_{n} we have f(y,X)=ϕ(y,α+X,β)w(X)=ϕ(y,α+𝐱t𝐛)w(X)=g(y,𝐱)w(X)f(y,X)=\phi\left(y,\alpha+\langle X,\beta\rangle\right)w(X)=\phi\left(y,\alpha+\mathbf{x}^{\mbox{\footnotesize\sc t}}\mathbf{b}\right)w(X)=g(y,\mathbf{x})w(X), where β=β𝐛\beta=\beta_{\mathbf{b}} and g𝒢ng\in{\mathcal{G}}_{n}. Hence, using the permanence property of VC-classes when multiplying by a fixed function, in this case w(X)w(X), we conclude that V(n)2k+6V({\mathcal{F}}_{n})\leq 2k+6 and the result follows now from Theorem 2.6.7 in van der Vaart and Wellner, (1996).

(b) From (a), using that log(2kn+6)/(2kn+6)<1\log(2k_{n}+6)/(2k_{n}+6)<1 and assuming that C>1C>1 we get that

suplog(N(Mϕϵ,n,L1()))\displaystyle\sup_{\mathbb{Q}}\log\left(N\left(M_{\phi}\,\epsilon,{\mathcal{F}}_{n},L_{1}(\mathbb{Q})\right)\right) logC+log(2kn+6)+(2kn+6)log(16e)+(2kn+5)log(1ϵ)\displaystyle\leq\log C+\log(2k_{n}+6)+(2k_{n}+6)\log(16e)+(2k_{n}+5)\log\left(\frac{1}{\epsilon}\right)
(2kn+6)[logC+1+log(16e)+log(1ϵ)].\displaystyle\leq(2k_{n}+6)\left[\log C+1+\log(16e)+\log\left(\frac{1}{\epsilon}\right)\right]\,.

Hence, for any ϵ<min(eC,(16e)1,1/e)\epsilon<\min(e^{-C},(16e)^{-1},1/e) we have that

1nlog(N(Mϕϵ,n,L1(Pn)))32kn+6nlog(1ϵ)0,\frac{1}{n}\log\left(N\left(M_{\phi}\,\epsilon,{\mathcal{F}}_{n},L_{1}(P_{n})\right)\right)\leq 3\,\frac{2k_{n}+6}{n}\log\left(\frac{1}{\epsilon}\right)\to 0\,,

since A8 holds. Therefore, using Lemma A.4, we conclude that

supθ×k|Ln(θ)L(θ)|a.s.0,\sup_{\theta\in\mathbb{R}\times{\mathcal{M}}_{k}}\left|L_{n}(\theta)-L(\theta)\right|\buildrel a.s.\over{\longrightarrow}0\,,

completing the proof of (b).

(c) As in (a), using that V(n)qn=2kn+6V({\mathcal{F}}_{n})\leq q_{n}=2k_{n}+6 and Theorem 2.6.7 in van der Vaart and Wellner, (1996), we deduce that there exists some constant C0>1C_{0}>1 independent of nn and ϵ\epsilon, such that

N(Mϕϵ,n,L2())C0qn(16e)qn(1ϵ)2qn2=C0(2kn+6)(16e)2kn+6(1ϵ)4kn+10,N(M_{\phi}\,\epsilon,{\mathcal{F}}_{n},L_{2}(\mathbb{Q}))\leq C_{0}q_{n}(16\,e)^{q_{n}}\left(\frac{1}{\epsilon}\right)^{2q_{n}-2}=C_{0}(2k_{n}+6)(16\,e)^{2k_{n}+6}\left(\frac{1}{\epsilon}\right)^{4k_{n}+10}\;, (14)

for any measure \mathbb{Q}, 0<ϵ<10<\epsilon<1. Theorem 2.14.1 in van der Vaart and Wellner, (1996) allows us to conclude that, for some universal constant C1>0C_{1}>0,

𝔼{supfn|n(PnP)(f)|}C1Mϕsup011+logN(Mϕϵ,n,L2())𝑑ϵ,\mathbb{E}\left\{\sup_{f\in{\mathcal{F}}_{n}}\left|\sqrt{n}(P_{n}-P)(f)\right|\right\}\leq C_{1}\;M_{\phi}\sup_{\mathbb{Q}}\int_{0}^{1}\sqrt{1+\log N\left(M_{\phi}\,\epsilon,{\mathcal{F}}_{n},L_{2}(\mathbb{Q})\right)}d\epsilon\,,

where the supremum is taken over all discrete probability measures \mathbb{Q}. Using (14) and that logtt\log t\leq t for t1t\geq 1 and denoting C2=log(C0)+1+log(16e)>1C_{2}=\log(C_{0})+1+\log(16\;e)>1, we get that

1+logN(Mϕϵ,n,L2())\displaystyle\sqrt{1+\log N\left(M_{\phi}\,\epsilon,{\mathcal{F}}_{n},L_{2}(\mathbb{Q})\right)} 1+C2(2kn+6)+(4kn+10)log(1ϵ)\displaystyle\leq\sqrt{1+C_{2}(2k_{n}+6)+(4k_{n}+10)\log\left(\frac{1}{\epsilon}\right)}
1+16knC2+16knlog(1ϵ),\displaystyle\leq\sqrt{1+16k_{n}C_{2}+16k_{n}\log\left(\frac{1}{\epsilon}\right)}\,,

where we have used that 2kn+64kn+1016kn2k_{n}+6\leq 4k_{n}+10\leq 16k_{n}. Let C4=4C1MϕC3C_{4}=4\,C_{1}\;M_{\phi}\;C_{3} with C3=C2+1C_{3}=C_{2}+1. Then, we obtain that

𝔼[supf|n(PnP)(f)|]\displaystyle\mathbb{E}\left[\sup_{f\in{\mathcal{F}}}\left|\sqrt{n}(P_{n}-P)(f)\right|\right] \displaystyle\leq knC4011+log(1ϵ)𝑑ϵ,\displaystyle\sqrt{k_{n}}\;C_{4}\,\int_{0}^{1}\sqrt{1+\log\left(\frac{1}{\epsilon}\right)}d\epsilon\,,

which entails (13), since 011log(ϵ)𝑑ϵ<\int_{0}^{1}\sqrt{1-\log(\epsilon)}\,d\epsilon<\infty. Markov’s inequality immediately leads to supfn|(PnP)(f)|=O(kn/n)\sup_{f\in{\mathcal{F}}_{n}}\left|(P_{n}-P)(f)\right|=O_{\mathbb{P}}\left(\sqrt{{k_{n}}/{n}}\right), concluding the proof. ∎

The following result is needed to prove the convergence rates stated in Theorem 3.3. From now on, Θn=×kn\Theta_{n}^{\star}=\mathbb{R}\times{\mathcal{M}}_{k_{n}}.

Lemma A.6.

Let ρ\rho be a function satisfying A1 and A2, and ww a weight function satisfying A5 and A6b. Given β~0k\widetilde{\beta}_{0}\in{\mathcal{M}}_{k}, define θ~0=(α0,β~0)\widetilde{\theta}_{0}=(\alpha_{0},\widetilde{\beta}_{0}) and the class of functions

𝒢n,c,θ~0,θ0\displaystyle{\mathcal{G}}_{n,c,\widetilde{\theta}_{0},\theta_{0}^{*}} ={fθ=VθVθ0:θΘnand|αα0|+ββ~0c},\displaystyle=\{f_{\theta}=V_{\theta}-V_{\theta_{0}^{*}}:\theta\in\Theta_{n}^{\star}\quad\mbox{and}\quad|\alpha-\alpha_{0}|+\|\beta-\widetilde{\beta}_{0}\|_{{\mathcal{H}}}\leq c\}\,,

where θ0=(α0,β0)\theta_{0}^{*}=(\alpha_{0}^{*},\beta_{0}^{*}) is a fixed element in ×\mathbb{R}\times{\mathcal{H}}, and Vθ=ϕ(y,α+X,β)w(X)V_{\theta}=\phi\left(y,\alpha+\langle X,\beta\rangle\right)w(X), for θ=(α,β)\theta=(\alpha,\beta). Then, there exists some constant A>0A>0 independent of nn, θ~0\widetilde{\theta}_{0} and ϵ\epsilon, such that

N[](ϵ,𝒢n,c,θ~0,θ0,L2(P))(Acϵ+1)k+1.N_{[\;]}(\epsilon,{\mathcal{G}}_{n,c,\widetilde{\theta}_{0},\theta_{0}^{*}},L_{2}(P))\leq\left(\frac{Ac}{\epsilon}+1\right)^{k+1}\,.
Proof.

First note that |fθ|2Mϕ|f_{\theta}|\leq 2M_{\phi} and denote Θ~n,c={θΘn:|αα0|+ββ~0c}\widetilde{\Theta}_{n,c}=\{\theta\in\Theta_{n}^{\star}:|\alpha-\alpha_{0}|+\|\beta-\widetilde{\beta}_{0}\|_{{\mathcal{H}}}\leq c\}. Note that since ββ~0ββ~0\|\beta-\widetilde{\beta}_{0}\|\leq\|\beta-\widetilde{\beta}_{0}\|_{{\mathcal{H}}}, Θ~n,cc×n,c\widetilde{\Theta}_{n,c}\subset{\mathcal{I}}_{c}\times{\mathcal{B}}_{n,c}, where c=[α0c,α0+c]{\mathcal{I}}_{c}=[\alpha_{0}-c,\alpha_{0}+c] and n,c={βk:ββ~0c}{\mathcal{B}}_{n,c}=\{\beta\in{\mathcal{M}}_{k}:\|\beta-\widetilde{\beta}_{0}\|\leq c\}.

For any θΘ~n,c\theta_{\ell}\in\widetilde{\Theta}_{n,c}, =1,2\ell=1,2 we have that

|fθ1fθ2|\displaystyle|f_{\theta_{1}}-f_{\theta_{2}}| =|Vθ1Vθ2|=|ϕ(y,α1+X,β1)ϕ(y,α2+X,β2)|w(X)\displaystyle=\left|V_{\theta_{1}}-V_{\theta_{2}}\right|=\left|\phi\left(y,\alpha_{1}+\langle X,\beta_{1}\rangle\right)-\phi\left(y,\alpha_{2}+\langle X,\beta_{2}\rangle\right)\right|w(X)
MΨ|α1α2+X,β1β2|w(X)\displaystyle\leq M_{\Psi}\left|\alpha_{1}-\alpha_{2}+\langle X,\beta_{1}-\beta_{2}\rangle\right|w(X)
MΨ{|α1α2|+w(X)Xβ1β2},\displaystyle\leq M_{\Psi}\left\{\left|\alpha_{1}-\alpha_{2}\right|+w(X)\|X\|\;\|\beta_{1}-\beta_{2}\|\right\}\,, (15)

where we recall that MΨ=supy{0,1},t|Ψ(y,t)|M_{\Psi}=\sup_{y\in\{0,1\},t\in\mathbb{R}}|\Psi(y,t)|. Hence,

𝔼(fθ1fθ2)2MΨ2{(α1α2)2+𝔼(w2(X)X2)β1β22}.\mathbb{E}\left(f_{\theta_{1}}-f_{\theta_{2}}\right)^{2}\leq M_{\Psi}^{2}\left\{(\alpha_{1}-\alpha_{2})^{2}+\mathbb{E}\left(w^{2}(X)\|X\|^{2}\right)\|\beta_{1}-\beta_{2}\|^{2}\right\}\,.

Taking into account that β~0k\widetilde{\beta}_{0}\in{\mathcal{M}}_{k}, it can be written as β~0=j=1kb~0,jBj\widetilde{\beta}_{0}=\sum_{j=1}^{k}\widetilde{b}_{0,j}B_{j}, so n,c={β=j=1kbjBj,𝐛k:j=1k(bjb~0,j)Bjc}{\mathcal{B}}_{n,c}=\{\beta=\sum_{j=1}^{k}b_{j}B_{j},\mathbf{b}\in\mathbb{R}^{k}:\|\sum_{j=1}^{k}\left(b_{j}-\widetilde{b}_{0,j}\right)B_{j}\|\leq c\}. Thus, according to Corollary 2.6 in van de Geer, (2000), taking therein as measure \mathbb{Q} the uniform measure on 𝒯=[0,1]{\mathcal{T}}=[0,1], we get that n,c{\mathcal{B}}_{n,c} can be covered by

Nn,c,δ=(4c+δδ)k,N_{{\mathcal{B}}_{n,c},\delta}=\left(\frac{4c+\delta}{\delta}\right)^{k}\,,

balls with center βj,c,δ\beta_{j,c,\delta} , 1jNn,c,δ1\leq j\leq N_{{\mathcal{B}}_{n,c},\delta}, and radius δ\delta. Besides, the interval c{\mathcal{I}}_{c} may also be covered by Nc,δ=(4c+δ)/δN_{{\mathcal{I}}_{c},\delta}=(4c+\delta)/\delta balls with center αj,c,δ\alpha_{j,c,\delta}, 1jNc,δ1\leq j\leq N_{{\mathcal{I}}_{c},\delta}, and radius δ\delta.

Given ϵ>0\epsilon>0, take δ=ϵ/{2MΨ[1+(𝔼w2(X)X2)1/2]}\delta=\epsilon/\left\{2\;M_{\Psi}\left[1+\left(\mathbb{E}w^{2}(X)\|X\|^{2}\right)^{1/2}\right]\right\} and for 1jNc,δ1\leq j\leq N_{{\mathcal{I}}_{c},\delta} and 1mNn,c,δ1\leq m\leq N_{{\mathcal{B}}_{n,c},\delta}, define the functions fj,m=ϕ(y,αj,c,δ+X,βm,c,δ)w(X)Vθ0f_{j,m}=\phi\left(y,\alpha_{j,c,\delta}+\langle X,\beta_{m,c,\delta}\rangle\right)w(X)-V_{\theta_{0}} and

fj,m(U)(y,X)\displaystyle f_{j,m}^{(U)}(y,X) =fj,m+δMΨ(1+w(X)X)\displaystyle=f_{j,m}+\delta\;M_{\Psi}\left(1+w(X)\|X\|\;\right)
fj,m(L)(y,X)\displaystyle f_{j,m}^{(L)}(y,X) =fj,mδMΨ(1+w(X)X).\displaystyle=f_{j,m}-\delta\;M_{\Psi}\left(1+w(X)\|X\|\;\right)\,.

Given fθ𝒢n,c,θ~0,θ0f_{\theta}\in{\mathcal{G}}_{n,c,\widetilde{\theta}_{0},\theta_{0}^{*}}, let jj and mm be such that |ααj,c,δ|<δ|\alpha-\alpha_{j,c,\delta}|<\delta and ββm,c,δ<δ\|\beta-\beta_{m,c,\delta}\|<\delta and θj,m=(αj,c,δ,βm,c,δ)\theta_{j,m}=(\alpha_{j,c,\delta},\beta_{m,c,\delta}). Then, using (15), we obtain that

|fθfθj,m|δMΨ{1+w(X)X},\left|f_{\theta}-f_{\theta_{j,m}}\right|\leq\delta\;M_{\Psi}\left\{1+w(X)\|X\|\;\right\}\,,

so fj,m(L)fθfj,m(U)f_{j,m}^{(L)}\leq f_{\theta}\leq f_{j,m}^{(U)}, since fθj,m=fj,mf_{\theta_{j,m}}=f_{j,m}. Besides,

fj,m(U)fj,m(L)=2δMΨ{𝔼(1+w(X)X)2}1/22δMΨ{1+(𝔼w2(X)X2)1/2}=ϵ,\|f_{j,m}^{(U)}-f_{j,m}^{(L)}\|=2\delta M_{\Psi}\left\{\mathbb{E}\left(1+w(X)\|X\|\right)^{2}\right\}^{1/2}\leq 2\delta M_{\Psi}\left\{1+\left(\mathbb{E}w^{2}(X)\|X\|^{2}\right)^{1/2}\right\}=\epsilon\,,

which implies that

N[](ϵ,𝒢n,c,θ~0,θ0,L2(P))Nc,δNn,c,δ(4c+δδ)k+1=(8MΨ[1+(𝔼w2(X)X2)1/2]cϵ+1)k+1,N_{[\;]}(\epsilon,{\mathcal{G}}_{n,c,\widetilde{\theta}_{0},\theta_{0}^{*}},L_{2}(P))\leq N_{{\mathcal{I}}_{c},\delta}N_{{\mathcal{B}}_{n,c},\delta}\leq\left(\frac{4c+\delta}{\delta}\right)^{k+1}=\left(\frac{8\;M_{\Psi}\left[1+\left(\mathbb{E}w^{2}(X)\|X\|^{2}\right)^{1/2}\right]c}{\epsilon}+1\right)^{k+1}\,,

and the result follows taking A=8MΨ[1+(𝔼w2(X)X2)1/2]A=8\;M_{\Psi}\left[1+\left(\mathbb{E}w^{2}(X)\|X\|^{2}\right)^{1/2}\right]. ∎

A.2 Proof of Theorems 3.1 and 3.2

Recall that θ^=(α^,β^)\widehat{\theta}=(\widehat{\alpha},\widehat{\beta}), θ0=(α0,β0)\theta_{0}=(\alpha_{0},\beta_{0}). Henceforth, θ~\widetilde{\theta} stands for θ~=(α0,β~)\widetilde{\theta}=(\alpha_{0},\widetilde{\beta}) with β~=β~k\widetilde{\beta}=\widetilde{\beta}_{k} defined in assumptions A9 and A10.

The following Lemma is useful to derive Theorems 3.1 and 3.2.

Lemma A.7.

Let ρ\rho be a function satisfying A1 and A2, and ww a weight function satisfying A5. Assume that A8 and A9 hold. Then, we have that L(α^,β𝐛^)=L(θ^)a.s.L(θ0)L(\widehat{\alpha},\beta_{\widehat{{\mathbf{b}}}})=L(\widehat{\theta})\buildrel a.s.\over{\longrightarrow}L(\theta_{0}).

Proof.

From A8 and A9, we have that there exists β~knkn\widetilde{\beta}_{k_{n}}\in{\mathcal{M}}_{k_{n}}, β~kn=j=1knb~jBj(x)\widetilde{\beta}_{k_{n}}=\sum_{j=1}^{k_{n}}\widetilde{b}_{j}B_{j}(x) such that β~knβ00\|\widetilde{\beta}_{k_{n}}-\beta_{0}\|_{{\mathcal{H}}}\to 0 as nn\to\infty. As mentioned above, we denote θ~=(α0,β~kn)\widetilde{\theta}=(\alpha_{0},\widetilde{\beta}_{k_{n}}). Using that β~knkn\widetilde{\beta}_{k_{n}}\in{\mathcal{M}}_{k_{n}}, we conclude that Ln(θ^)Ln(θ~),L_{n}(\widehat{\theta})\leq L_{n}(\widetilde{\theta})\,, while from Lemma A.1(a) we have that 0L(θ^)L(θ0)0\leq L(\widehat{\theta})-L(\theta_{0}). Thus,

0L(θ^)L(θ0)\displaystyle 0\leq L(\widehat{\theta})-L(\theta_{0}) =L(θ^)L(θ~)+L(θ~)L(θ0)\displaystyle=L(\widehat{\theta})-L(\widetilde{\theta})+L(\widetilde{\theta})-L(\theta_{0})
={Ln(θ~)L(θ~)}{Ln(θ^)L(θ^)}+{L(θ~)L(θ0)}+Ln(θ^)Ln(θ~)\displaystyle=\left\{L_{n}(\widetilde{\theta})-L(\widetilde{\theta})\right\}-\left\{L_{n}(\widehat{\theta})-L(\widehat{\theta})\right\}+\left\{L(\widetilde{\theta})-L(\theta_{0})\right\}+L_{n}(\widehat{\theta})-L_{n}(\widetilde{\theta})
{Ln(θ~)L(θ~)}{Ln(θ^)L(θ^)}+{L(θ~)L(θ0)}\displaystyle\leq\left\{L_{n}(\widetilde{\theta})-L(\widetilde{\theta})\right\}-\left\{L_{n}(\widehat{\theta})-L(\widehat{\theta})\right\}+\left\{L(\widetilde{\theta})-L(\theta_{0})\right\}
2supfn|(PnP)f|+{L(θ~)L(θ0)}=2An+Bn,\displaystyle\leq 2\,\sup_{f\in{\mathcal{F}}_{n}}\left|(P_{n}-P)f\right|+\left\{L(\widetilde{\theta})-L(\theta_{0})\right\}=2\,A_{n}+B_{n}\,, (16)

where n{\mathcal{F}}_{n} is defined in (12).

From Lemma A.5 we obtain that Ana.s.0A_{n}\buildrel a.s.\over{\longrightarrow}0. On the other hand, from β~knβ00\|\widetilde{\beta}_{k_{n}}-\beta_{0}\|_{{\mathcal{H}}}\to 0 as nn\to\infty, the fact that for any ff\in{\mathcal{H}}, we have the bound ff\|f\|\leq\|f\|_{{\mathcal{H}}} and the Cauchy-Schwartz inequality, we get that for any vL2([0,1])v\in L^{2}([0,1]), α0+v,β~knα0+v,β0\alpha_{0}+\langle v,\widetilde{\beta}_{k_{n}}\rangle\to\alpha_{0}+\langle v,\beta_{0}\rangle. Thus, from the Bounded Convergence Theorem, the continuity of ϕ(y,t)\phi\left(y,t\right) with respect to tt and its boundedness together with the boundedness of ww, we conclude that Bn=L(θ~)L(θ0)0B_{n}=L(\widetilde{\theta})-L(\theta_{0})\to 0. Summarizing we have that

0L(θ^)L(θ0)2An+Bna.s.0,0\leq L(\widehat{\theta})-L(\theta_{0})\leq 2\,A_{n}+B_{n}\buildrel a.s.\over{\longrightarrow}0\,,

which concludes the proof. ∎

Proof of Theorem 3.1.

(a) From Lemma A.3, we have that there exists a constant C0>0C_{0}>0 independent from nn such that

L(θ^)L(θ0)C0π2(θ^,θ0).L(\widehat{\theta})-L(\theta_{0})\geq C_{0}\,\pi_{\mathbb{P}}^{2}(\widehat{\theta},\theta_{0})\,.

Then, the result follows from Lemma A.7 which implies that L(θ^)L(θ0)a.s.0L(\widehat{\theta})-L(\theta_{0})\buildrel a.s.\over{\longrightarrow}0.

(b) Recall that from (16) and Lemma A.3 we obtain that, for some constant C0C_{0} independent of nn,

C0π2(θ^,θ0)L(θ^)L(θ0)2supfn|(PnP)f|+{L(θ~)L(θ0)}.C_{0}\,\pi_{\mathbb{P}}^{2}(\widehat{\theta},\theta_{0})\leq L(\widehat{\theta})-L(\theta_{0})\leq 2\,\sup_{f\in{\mathcal{F}}_{n}}\left|(P_{n}-P)f\right|+\left\{L(\widetilde{\theta})-L(\theta_{0})\right\}\,.

Lemma A.5(c), entails that supfn|(PnP)f|=O(kn/n)\sup_{f\in{\mathcal{F}}_{n}}\left|(P_{n}-P)f\right|=O_{\mathbb{P}}\left(\sqrt{{k_{n}}/{n}}\right), so, using again that Bn=L(θ~)L(θ0)0B_{n}=L(\widetilde{\theta})-L(\theta_{0})\geq 0, the proof will be concluded if we show that

L(θ~)L(θ0)=O(β~knβ0).L(\widetilde{\theta})-L(\theta_{0})=O(\|\widetilde{\beta}_{k_{n}}-\beta_{0}\|_{{\mathcal{H}}})\,. (17)

To prove (17), recall that Ψ(y,t)=ϕ(y,t)/t=[yF(t)]ν(t)\Psi(y,t)={\partial}\phi(y,t)/{\partial t}=-[y-F(t)]\nu(t) where ν(t)\nu(t) is given by (8) and that MΨ=supy{0,1},t|Ψ(y,t)|M_{\Psi}=\sup_{y\in\{0,1\},t\in\mathbb{R}}|\Psi(y,t)| is finite. Thus,

0\displaystyle 0 L(θ~)L(θ0)=𝔼{w(X)[ϕ(y,α0+X,β~)ϕ(y,α0+X,β0]}\displaystyle\leq L(\widetilde{\theta})-L(\theta_{0})=\mathbb{E}\left\{w(X)\,\left[\phi(y,\alpha_{0}+\langle X,\widetilde{\beta}\rangle)-\phi(y,\alpha_{0}+\langle X,\beta_{0}\rangle\right]\right\}
MΨ𝔼{w(X)|X,β~β|}MΨβ~β𝔼{w(X)X}.\displaystyle\leq M_{\Psi}\mathbb{E}\left\{w(X)\,\left|\langle X,\widetilde{\beta}-\beta\rangle\right|\right\}\leq M_{\Psi}\|\widetilde{\beta}-\beta\|\;\mathbb{E}\left\{w(X)\,\|X\|\right\}\,.

Then, (17) follows now from the fact that ff\|f\|\leq\|f\|_{{\mathcal{H}}}.

(c) To derive (c) it will be enough to show that L(θ~)L(θ0)=O(β~knβ02)L(\widetilde{\theta})-L(\theta_{0})=O\left(\|\widetilde{\beta}_{k_{n}}-\beta_{0}\|_{{\mathcal{H}}}^{2}\right).

Using that ψ\psi is continuously differentiable with bounded derivative, we obtain that the derivative of the function ν(t)\nu(t) defined in (8) equals

ν(t)\displaystyle\nu^{\,\prime}(t) =ψ(logF(t))[1F(t)]2ψ(logF(t))F(t)[1F(t)]\displaystyle=-\psi^{\prime}\left(-\log F(t)\right)\left[1-F(t)\right]^{2}-\psi\left(-\log F(t)\right)F(t)\left[1-F(t)\right]
+ψ(log[1F(t)])F(t)[1F(t)]+ψ(log[1F(t)])F2(t)\displaystyle+\psi\left(-\log\left[1-F(t)\right]\right)F(t)\left[1-F(t)\right]+\psi^{\prime}\left(-\log\left[1-F(t)\right]\right)F^{2}(t)

and is bounded. Hence, 2ϕ(y,t)/t2=F(t)[1F(t)]ν(t)[yF(t)]ν(t){\partial^{2}}\phi(y,t)/{\partial t^{2}}=F(t)\left[1-F(t)\right]\nu(t)-[y-F(t)]\nu^{\,\prime}(t) is also bounded. Denote A=supy{0,1},t|2ϕ(y,t)/t2|A=\sup_{y\in\{0,1\},t\in\mathbb{R}}|{\partial^{2}}\phi(y,t)/{\partial t^{2}}|.

To avoid burden notation, denote R0(X)=α0+X,β0R_{0}(X)=\alpha_{0}+\langle X,\beta_{0}\rangle. Define the function g:g:\mathbb{R}\to\mathbb{R} as g(t)=L(tθ~+(1t)θ0)g(t)=L(t\widetilde{\theta}+(1-t)\theta_{0}). Then,

g(t)=𝔼{w(X)ϕ(F(R0(X)),R0(X)+tX,β~β0)}.g(t)=\mathbb{E}\left\{w(X)\phi\left(F(R_{0}(X)),R_{0}(X)+t\langle X,\widetilde{\beta}-\beta_{0}\rangle\right)\right\}\,.

Note that g(0)=L(θ0)g(0)=L(\theta_{0}), g(1)=L(θ~)g(1)=L(\widetilde{\theta}) and g(t)g(0)g(t)\geq g(0) for all tt.

Recall that Ψ(y,t)=ϕ(y,t)/t=[yF(t)]ν(t)\Psi(y,t)={\partial}\phi(y,t)/{\partial t}=-[y-F(t)]\nu(t), then Ψ(F(r0),r0)=0\Psi(F(r_{0}),r_{0})=0, for any r0r_{0}. Therefore, we have that

g(1)g(0)\displaystyle g(1)-g(0) =𝔼{w(X)Ψ(F(R0(X)),R0(X))X,β~β0}+𝔼{w(X)2ϕ(F(R0(X)),t)t2|t=ξϱX,β~β02}\displaystyle=\mathbb{E}\left\{w(X)\;\Psi\left(F(R_{0}(X)),R_{0}(X)\right)\langle X,\widetilde{\beta}-\beta_{0}\rangle\right\}+\mathbb{E}\left\{w(X)\;\frac{\partial^{2}\phi(F(R_{0}(X)),t)}{\partial t^{2}}\Big{|}_{t=\xi_{\varrho}}\langle X,\widetilde{\beta}-\beta_{0}\rangle^{2}\right\}
=𝔼{w(X)2ϕ(F(R0(X)),t)t2|t=ξϱX,β~β02}\displaystyle=\mathbb{E}\left\{w(X)\;\frac{\partial^{2}\phi(F(R_{0}(X)),t)}{\partial t^{2}}\Big{|}_{t=\xi_{\varrho}}\langle X,\widetilde{\beta}-\beta_{0}\rangle^{2}\right\}

where ξϱ=α0+X,β0+ϱX,β~β0\xi_{\varrho}=\alpha_{0}+\langle X,\beta_{0}\rangle+\varrho\langle X,\widetilde{\beta}-\beta_{0}\rangle for some ϱ=ϱ(X)(0,1)\varrho=\varrho(X)\in(0,1).

Therefore, using that 2ϕ(y,t)/t2{\partial^{2}}\phi(y,t)/{\partial t^{2}} is bounded, we obtain

L(θ~)L(θ0)=g(1)g(0)A𝔼{w(X)X,β~β02}A𝔼{w(X)X2}β~β02,L(\widetilde{\theta})-L(\theta_{0})=g(1)-g(0)\leq A\,\mathbb{E}\left\{w(X)\langle X,\widetilde{\beta}-\beta_{0}\rangle^{2}\right\}\leq A\,\mathbb{E}\left\{w(X)\|X\|^{2}\right\}\|\widetilde{\beta}-\beta_{0}\|_{{\mathcal{H}}}^{2}\,, (18)

where the last inequality follows from the Cauchy-Schwartz inequality and the fact that, for any ff\in{\mathcal{H}}, we have the bound ff\|f\|\leq\|f\|_{{\mathcal{H}}}, concluding the proof. ∎

Lemma A.8 is an intermediate step to derive Theorem 3.2.

Lemma A.8.

Let ρ\rho be a function satisfying A1 and A2, and ww a weight function satisfying A5. Assume that A7 to A9 and A11 hold. Then, we have that, there exists M>0M>0 such that (mnm|α^|+β^𝒲1,M)=1\mathbb{P}(\cup_{m\in\mathbb{N}}\cap_{n\geq m}|\widehat{\alpha}|+\|\widehat{\beta}\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}\leq M)=1.

Proof.

From now on, 1{\mathcal{B}}_{1} denotes the unit ball in 𝒲1,{\mathcal{W}}^{1,{\mathcal{H}}} and ={(α,β)×𝒲1,:|α|+β𝒲1,=1}[1,1]×1{\mathcal{B}}=\{(\alpha,\beta)\in\mathbb{R}\times{\mathcal{W}}^{1,{\mathcal{H}}}:|\alpha|+\|\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}=1\}\subset\ [-1,1]\times{\mathcal{B}}_{1}. The Rellich–Kondrachov theorem entails that 𝒲1,{\mathcal{W}}^{1,{\mathcal{H}}} is compactly embedded in {\mathcal{H}}, hence {\mathcal{B}} is compact in ×\mathbb{R}\times{\mathcal{H}}.

To simplify the notation, let θ=(α,β)\theta=(\alpha,\beta) and (X,θ)=α+X,β\ell(X,\theta)=\alpha+\langle X,\beta\rangle. Furthermore, given θj×\theta_{j}\in\mathbb{R}\times{\mathcal{H}}, for j=1,2j=1,2, denote d(θ1,θ2)=|α1α2|+β1β2d(\theta_{1},\theta_{2})=|\alpha_{1}-\alpha_{2}|+\|\beta_{1}-\beta_{2}\|_{{\mathcal{H}}}.

Step 1. We begin proving that, given 0<δ1/40<\delta\leq 1/4, there exist K>0K>0 and positive numbers φ1,,φs\varphi_{1},\dots,\varphi_{s} such that for every θ\theta\in{\mathcal{B}}, there exist j{1,,s}j\in\{1,\dots,s\} such that

(|(X,θ)|>φj2)>(|(X,θ)|>φj2XK)>12δ.\mathbb{P}\left(|\ell(X,\theta)|>\frac{\varphi_{j}}{2}\right)>\mathbb{P}\left(|\ell(X,\theta)|>\frac{\varphi_{j}}{2}\cap\|X\|\leq K\right)>1-2\delta\,. (19)

To derive (19), first given δ>0\delta>0, define KδK_{\delta} such that, for any KKδK\geq K_{\delta}

(XK)<δ.\mathbb{P}(\|X\|\geq K)<\delta\,. (20)

Fix now θ\theta\in{\mathcal{B}}. Using that β\beta\in{\mathcal{H}} and A11, there exists φθ>0\varphi_{\theta}>0 a continuity point of the distribution of |(X,θ)||\ell(X,\theta)| such that

(|(X,θ)|<φθ)<δ.\mathbb{P}\left(|\ell(X,\theta)|<\varphi_{\theta}\right)<\delta\;. (21)

Note that from (20) and (21), we conclude that A(θ)=(|(X,θ)|φθ)(XK)>12δA(\theta)=\mathbb{P}\left(|\ell(X,\theta)|\geq\varphi_{\theta}\right)-\mathbb{P}\left(\|X\|\geq K\right)>1-2\delta and B(θ)=(|(X,θ)|φθXK)>12δB(\theta)=\mathbb{P}\left(|\ell(X,\theta)|\geq\varphi_{\theta}\cap\|X\|\leq K\right)>1-2\delta .

Let τθ\tau_{\theta} stand for τθ=φθ/(2(K+1))\tau_{\theta}={\varphi_{\theta}}/{(2(K+1))}. Then, given θ=(α,β)×\theta^{*}=(\alpha^{*},\beta^{*})\in\mathbb{R}\times{\mathcal{H}}, such that d(θ,θ)<τθd(\theta^{*},\theta)<\tau_{\theta}, and using that ff\|f\|\leq\|f\|_{{\mathcal{H}}} and the Cauchy–Schwartz inequality, we obtain

|(X,θ)||(X,θ)|+|(X,θθ)||(X,θ)|+τθ(X+1).|\ell(X,\theta)|\leq|\ell(X,\theta^{*})|+|\ell(X,\theta-\theta^{*})|\leq|\ell(X,\theta^{*})|+\tau_{\theta}\left(\|X\|+1\right)\,.

Hence, we get

(|(X,θ)|φθ2XK)\displaystyle\mathbb{P}\left(|\ell(X,\theta^{*})|\geq\frac{\varphi_{\theta}}{2}\cap\|X\|\leq K\right) B(θ)>12δ.\displaystyle\geq B(\theta)>1-2\delta\,. (22)

Consider the covering of {\mathcal{B}} given by the open balls (θ,τθ)={(a,f)×:d((a,f),θ)<τθ}{\mathcal{B}}(\theta,\tau_{\theta})=\{(a,f)\in\mathbb{R}\times{\mathcal{H}}:d((a,f),\theta)<\tau_{\theta}\}, θ\theta\in{\mathcal{B}}. Using that {\mathcal{B}} is compact in ×\mathbb{R}\times{\mathcal{H}}, we get that there exist θ1,θs\theta_{1},\dots\theta_{s} such that θj\theta_{j}\in{\mathcal{B}} and j=1s(θj,τθj){\mathcal{B}}\subset\cup_{j=1}^{s}{\mathcal{B}}(\theta_{j},\tau_{\theta_{j}}). Therefore, from (22), we conclude that

min1jsinfd(θ,θj)<τθj(|(X,θ)|>φθj2XK)>12δ,\min_{1\leq j\leq s}\inf_{d(\theta,\theta_{j})<\tau_{\theta_{j}}}\mathbb{P}\left(|\ell(X,\theta)|>\frac{\varphi_{\theta_{j}}}{2}\cap\|X\|\leq K\right)>1-2\delta\,, (23)

meaning that, for every θ\theta\in{\mathcal{B}}, there exist j{j1js}j\in\{j_{1}\dots j_{s}\} such that

(|(X,θ)|>φθj2XK)>12δ,\mathbb{P}\left(|\ell(X,\theta)|>\frac{\varphi_{\theta_{j}}}{2}\cap\|X\|\leq K\right)>1-2\delta\;,

concluding the proof of Step 1.

Step 2. We will show that for any θ=(α,β)\theta=(\alpha,\beta)\in{\mathcal{B}}

𝔼(limaϕ(y,a(X,θ))w(X))>L(θ0).\mathbb{E}\left(\lim_{a\to\infty}\phi(y,a\;\ell(X,\theta))w(X)\right)>L(\theta_{0})\,. (24)

Let us consider a sequence am+a_{m}\to+\infty it is enough to show that D>L(θ0)D>L(\theta_{0}) where D=𝔼{limmϕ(y,am(X,θ))w(X)}D=\mathbb{E}\left\{\lim_{m\to\infty}\phi\left(y,a_{m}\;\ell(X,\theta)\right)w(X)\right\}. Using that ϕ\phi is a bounded function and the bounded convergence theorem, we get that D=limmDmD=\lim_{m\to\infty}D_{m}, where Dm=𝔼{ϕ(y,am(X,θ))w(X)}=L(amθ)D_{m}=\mathbb{E}\left\{\phi\left(y,a_{m}\;\ell(X,\theta)\right)w(X)\right\}=L(a_{m}\,\theta).

Note that as in the proof of Lemma A.1, we have that

Dm\displaystyle D_{m} =𝔼{w(X)ϕ(F((X,θ0)),am(X,θ))}=𝔼{w(X)ϕ(F(R0(X)),Rm(X))},\displaystyle=\mathbb{E}\left\{w(X)\,\phi\left(F(\ell(X,\theta_{0})),a_{m}\;\ell(X,\theta)\right)\right\}=\mathbb{E}\left\{w(X)\phi(F(R_{0}(X)),R_{m}(X))\right\}\,, (25)

where, as in the proof of Theorem 3.1, we denote Rm(X)=am(X,θ)R_{m}(X)=a_{m}\;\ell(X,\theta) and R0(X)=(X,θ0)R_{0}(X)=\ell(X,\theta_{0}).

In the proof of Lemma A.1, we have shown that, for any fixed r0r_{0}, the function Φ(t)=ϕ(F(r0),t)\Phi(t)=\phi(F(r_{0}),t) reaches its unique minimum when t=r0t=r_{0} and Φ(t)>0\Phi^{\prime}(t)>0, for t>r0t>r_{0}, and that Φ(t)<0\Phi^{\prime}(t)<0 for t<r0t<r_{0}. Then, Φ\Phi is strictly increasing on (r0,+)(r_{0},+\infty) and strictly decreasing on (,r0)(-\infty,r_{0}), so limt+Φ(t)>Φ(r0)\lim_{t\to+\infty}\Phi(t)>\Phi(r_{0}) and similarly limtΦ(t)>Φ(r0)\lim_{t\to-\infty}\Phi(t)>\Phi(r_{0}) .

Using A11 and that β𝒲1,\beta\in{\mathcal{W}}^{1,{\mathcal{H}}}, we have that with probability one

limmRm(X)=limmam(X,θ)={+ when (X,θ)>0 when (X,θ)<0.\lim_{m\to\infty}R_{m}(X)=\lim_{m\to\infty}a_{m}\;\ell(X,\theta)=\left\{\begin{array}[]{rl}+\infty&\mbox{ when }\ell(X,\theta)>0\\ -\infty&\mbox{ when }\ell(X,\theta)<0\end{array}\right.\,. (26)

Fix ωΩ\omega\in\Omega, such that X0=X(ω)X_{0}=X(\omega) satisfies (26) and F(R0(X0))F(R_{0}(X_{0})) is not 0 or 11. Let r0=R0(X0)r_{0}=R_{0}(X_{0}) and take as above Φ(t)=ϕ(F(r0),t)\Phi(t)=\phi(F(r_{0}),t). Then,

limmϕ(F(R0(X0)),Rm(X0))=limmΦ(Rm(X0))\displaystyle\lim_{m\to\infty}\phi(F(R_{0}(X_{0})),R_{m}(X_{0}))=\lim_{m\to\infty}\Phi(R_{m}(X_{0})) ={limr+Φ(r) when (X0,θ)>0limrΦ(r) when (X0,θ)<0\displaystyle=\left\{\begin{array}[]{rl}\lim_{r\to+\infty}\Phi(r)&\mbox{ when }\ell(X_{0},\theta)>0\\ \lim_{r\to-\infty}\Phi(r)&\mbox{ when }\ell(X_{0},\theta)<0\end{array}\right.
>Φ(r0)=Φ(R0(X0)=ϕ(F(R0(X)),R0(X))).\displaystyle>\Phi(r_{0})=\Phi\left(R_{0}(X_{0})=\phi(F(R_{0}(X)),R_{0}(X))\right)\,.

Using (25), A11 and A5, we obtain that Thus, using again that ϕ\phi is a bounded function and the bounded convergence theorem, we obtain that

D=limmDm\displaystyle D=\lim_{m\to\infty}D_{m} =𝔼{w(X)limmϕ(F(R0(X)),Rm(X))}>𝔼{ϕ(F(R0(X)),R0(X))},\displaystyle=\mathbb{E}\left\{w(X)\,\lim_{m\to\infty}\phi(F(R_{0}(X)),R_{m}(X))\right\}>\mathbb{E}\left\{\phi(F(R_{0}(X)),R_{0}(X))\right\}\,,

which concludes the proof of (24).

Step 3. Let us show that there exists η>0\eta>0 such that for any θ=(α,β)\theta=(\alpha,\beta)\in{\mathcal{B}}, we have

limainf|αα|+ββ𝒲1,<ηL(aθ)>L(θ0),\lim_{a\to\infty}\inf_{|\alpha^{*}-\alpha|+\|\beta^{*}-\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}<\eta}L(a\,\theta^{*})>L(\theta_{0})\,, (27)

where we have denoted θ=(α,β)\theta^{*}=(\alpha^{*},\beta^{*}). The proof is an extension to the functional setting of that of Lemma 6.3 in Bianco and Yohai, (1996).

Take δ<(DL(θ0))/(2Mϕ)\delta<(D-L(\theta_{0}))/(2\,M_{\phi}) where D=𝔼(limaϕ(y,a(X,θ))w(X))D=\mathbb{E}\left(\lim_{a\to\infty}\phi(y,a\;\ell(X,\theta))w(X)\right), the quantity DL(θ0)D-L(\theta_{0}) is positive from (24) and Mϕ=supy{0,1},tϕ(y,t)M_{\phi}=\sup_{y\in\{0,1\},t\in\mathbb{R}}\phi(y,t).

From Step 1 we have that, for any θ\theta\in{\mathcal{B}}, there exist j{1,,s}j\in\{1,\dots,s\} such that (19) holds. Let j0j_{0} be the index corresponding to the chosen θ\theta\in{\mathcal{B}} and define the set ={X:|(X,θ)|>φj0/2XK}{\mathcal{E}}=\{X:|\ell(X,\theta)|>\varphi_{j_{0}}/{2}\cap\|X\|\leq K\}. Then, from (19), we get that

(X)<2δ<DL(θ0)Mϕ.\mathbb{P}(X\notin{\mathcal{E}})<2\delta<\frac{D-L(\theta_{0})}{M_{\phi}}\,. (28)

Take XX\in{\mathcal{E}} and define η=min1jsφj/(8(K+1))\eta=\min_{1\leq j\leq s}\varphi_{j}/(8(K+1)). Then, using again that ffW1,\|f\|\leq\|f\|_{W^{1,{\mathcal{H}}}} and that the set {θ=(α,β):|αα|+ββ𝒲1,<η}{θ:|αα|+ββ𝒲1,φj0/(4(K+1))}\{\theta^{*}=(\alpha^{*},\beta^{*}):|\alpha^{*}-\alpha|+\|\beta^{*}-\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}<\eta\}\subset\{\theta^{*}:|\alpha^{*}-\alpha|+\|\beta^{*}-\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}\leq\varphi_{{j_{0}}}/(4(K+1))\} which is compact in ×\mathbb{R}\times{\mathcal{H}}, it is easy to see that for any θ=(α,β)\theta^{*}=(\alpha^{*},\beta^{*}) such that |αα|+ββ𝒲1,ηφj0/(4(K+1))|\alpha^{*}-\alpha|+\|\beta^{*}-\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}\leq\eta\leq\varphi_{{j_{0}}}/(4(K+1)), we have that

{sign((X,θ))=sign((X,θ)),|(X,θ)|φj04.\left\{\begin{array}[]{c }\mbox{sign}\left(\ell(X,\theta^{*})\right)=\mbox{sign}\left(\ell(X,\theta)\right)\,,\\ |\ell(X,\theta^{*})|\geq\displaystyle\frac{\varphi_{j_{0}}}{4}\,.\end{array}\right. (29)

Using that the set 𝒱η={θ×𝒲1,:|αα|+ββ𝒲1,η}{\mathcal{V}}_{\eta}=\{\theta^{*}\in\mathbb{R}\times{\mathcal{W}}^{1,{\mathcal{H}}}:|\alpha^{*}-\alpha|+\|\beta^{*}-\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}\leq\eta\} is compact in ×\mathbb{R}\times{\mathcal{H}} we conclude that there exists {θm}m𝒱η\{\theta_{m}^{*}\}_{m\in\mathbb{N}}\subset{\mathcal{V}}_{\eta} (depending on aa and XX) and a value θ~=θ~a,X𝒱η\widetilde{\theta}^{*}=\widetilde{\theta}^{*}_{a,X}\in{\mathcal{V}}_{\eta} which also depends on aa and XX, such that d(θm,θ~)0d(\theta_{m}^{*},\widetilde{\theta}^{*})\to 0 and limmϕ(y,a(X,θm))=infθ𝒱ηϕ(y,a(X,θ))\lim_{m\to\infty}\phi(y,a\;\ell(X,\theta_{m}^{*}))=\inf_{\theta^{*}\in{\mathcal{V}}_{\eta}}\phi(y,a\;\ell(X,\theta^{*})). The continuity of ϕ\phi together with the fact that d(θm,θ~)0d(\theta_{m}^{*},\widetilde{\theta}^{*})\to 0 and the Cauchy–Schwartz inequality leads to ϕ(y,a(X,θ~))=infθ𝒱ηϕ(y,a(X,θ))\phi(y,a\,\ell(X,\widetilde{\theta}^{*}))=\inf_{\theta^{*}\in{\mathcal{V}}_{\eta}}\phi(y,a\,\ell(X,\theta^{*})). Then, using (29), we conclude that

lim infainfθ𝒱ηϕ(y,a(X,θ))=limaϕ(y,a(X,θ~a,X)).\liminf_{a\to\infty}\inf_{\theta^{*}\in{\mathcal{V}}_{\eta}}\phi\left(y,a\;\ell(X,\theta^{*})\right)=\lim_{a\to\infty}\phi\left(y,a\;\ell\left(X,\widetilde{\theta}^{*}_{a,X}\right)\right)\,.

Therefore, using Fatou’s Lemma we obtain that

lim infainf|αα|+ββ𝒲1,<ηL(aθ)\displaystyle\liminf_{a\to\infty}\inf_{|\alpha^{*}-\alpha|+\|\beta^{*}-\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}<\eta}L(a\,\theta^{*}) 𝔼{lim infainfθ𝒱ηϕ(y,a(X,θ))w(X)}\displaystyle\geq\mathbb{E}\left\{\liminf_{a\to\infty}\inf_{\theta^{*}\in{\mathcal{V}}_{\eta}}\phi(y,a\;\ell(X,\theta^{*}))w(X)\right\}
𝔼{𝕀(X)w(X)limainfθ𝒱ηϕ(y,a(X,θ))}\displaystyle\geq\mathbb{E}\left\{\mathbb{I}_{{\mathcal{E}}}(X)\,w(X)\,\lim_{a\to\infty}\inf_{\theta^{*}\in{\mathcal{V}}_{\eta}}\phi(y,a\;\ell(X,\theta^{*}))\right\}
𝔼{𝕀(X)w(X)limaϕ(y,a(X,θ~a,X))}\displaystyle\geq\mathbb{E}\left\{\mathbb{I}_{{\mathcal{E}}}(X)\,w(X)\,\lim_{a\to\infty}\phi\left(y,a\;\ell\left(X,\widetilde{\theta}^{*}_{a,X}\right)\right)\right\}
=D𝔼{𝕀c(X)w(X)limaϕ(y,a(X,θ~a,X))}.\displaystyle=D-\mathbb{E}\left\{\mathbb{I}_{{\mathcal{E}}^{c}}(X)\,w(X)\,\lim_{a\to\infty}\phi\left(y,a\;\ell\left(X,\widetilde{\theta}^{*}_{a,X}\right)\right)\right\}\,.

where we denote c{\mathcal{E}}^{c} the complement of {\mathcal{E}}. Using (28) and assumption A5, we obtain that

𝔼{𝕀c(X)w(X)limaϕ(y,a(X,θ~a,X))}Mϕ(Xc)<DL(θ0),\mathbb{E}\left\{\mathbb{I}_{{\mathcal{E}}^{c}}(X)\,w(X)\,\lim_{a\to\infty}\phi\left(y,a\;\ell\left(X,\widetilde{\theta}^{*}_{a,X}\right)\right)\right\}\leq M_{\phi}\mathbb{P}\left(X\in{\mathcal{E}}^{c}\right)<D-L(\theta_{0})\,,

so

limainfinf|αα|+ββ𝒲1,<ηL(aθ)D𝔼{𝕀c(X)w(X)limaϕ(y,a(X,θ~a,X))}>L(θ0),\lim_{a\to\infty}\inf\inf_{|\alpha^{*}-\alpha|+\|\beta^{*}-\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}<\eta}L(a\,\theta^{*})\geq D-\mathbb{E}\left\{\mathbb{I}_{{\mathcal{E}}^{c}}(X)\,w(X)\,\lim_{a\to\infty}\phi\left(y,a\;\ell(X,\widetilde{\theta}^{*}_{a,X})\right)\right\}>L(\theta_{0})\,,

which concludes the proof of (27).

Step 4. We will show that there exists M>0M>0 and τ>0\tau>0

infθ=(α,β):β𝒲1,+|α|>ML(θ)>L(θ0)+τ.\inf_{\theta=(\alpha,\beta):\|\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}+|\alpha|>M}L\left(\theta\right)>L(\theta_{0})+\tau\;. (30)

Note that by proving (30), we may indeed conclude the proof. In fact, from Lemma A.7, we have that L(α^,β^)a.s.L(α0,β0)L(\widehat{\alpha},\widehat{\beta})\buildrel a.s.\over{\longrightarrow}L(\alpha_{0},\beta_{0}), thus if 𝒩={limnL(α^,β^)L(α0,β0)}{\mathcal{N}}=\{\lim_{n\to\infty}L(\widehat{\alpha},\widehat{\beta})\neq L(\alpha_{0},\beta_{0})\}, (𝒩)=0\mathbb{P}({\mathcal{N}})=0. Take ω𝒩\omega\notin{\mathcal{N}} and n0n_{0} such that for all nn0n\geq n_{0}, |L(α^,β^)L(α0,β0)|<τ|L(\widehat{\alpha},\widehat{\beta})-L(\alpha_{0},\beta_{0})|<\tau, then L(α^,β^)<L(α0,β0)+τ<inf|α|+β𝒲1,>ML(α,β)L(\widehat{\alpha},\widehat{\beta})<L(\alpha_{0},\beta_{0})+\tau<\inf_{|\alpha|+\|\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}>M}L\left(\alpha,\beta\right), which entails that for all nn0n\geq n_{0}, |α^|+β^𝒲1,M|\widehat{\alpha}|+\|\widehat{\beta}\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}\leq M, as desired.

Let us show that (30) holds. From Step 3, there exists η\eta such that, for any θ=(α,β)\theta=(\alpha,\beta)\in{\mathcal{B}}, we have that D(θ)>L(θ0)D(\theta)>L(\theta_{0}) where

D(θ)=limainf|αα|+ββ𝒲1,<ηL(aθ),D(\theta)=\lim_{a\to\infty}\inf_{|\alpha^{*}-\alpha|+\|\beta^{*}-\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}<\eta}L(a\,\theta^{*})\,,

and θ=(α,β)\theta^{*}=(\alpha^{*},\beta^{*}). Define 0<τθ<(D(θ)L(θ0))/20<\tau_{\theta}<(D(\theta)-L(\theta_{0}))/2, then D(θ)>L(θ0)+2τθD(\theta)>L(\theta_{0})+2\;\tau_{\theta}, which implies that there exists aθ>0a_{\theta}>0 such that

infa>aθinf|αα|+ββ𝒲1,<ηL(aθ)>L(θ0)+τθ.\inf_{a>a_{\theta}}\inf_{|\alpha^{*}-\alpha|+\|\beta^{*}-\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}<\eta}L(a\,\theta^{*})>L(\theta_{0})+\tau_{\theta}\,. (31)

Taking into account that the open balls (θ,ηθ)={θ×𝒲1,:|αα|+ββ𝒲1,<η}{\mathcal{B}}(\theta,\eta_{\theta})=\{\theta^{*}\in\mathbb{R}\times{\mathcal{W}}^{1,{\mathcal{H}}}:|\alpha^{*}-\alpha|+\|\beta^{*}-\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}<\eta\} provide a covering of {\mathcal{B}} which is compact in ×\mathbb{R}\times{\mathcal{H}}, we get that there exist θ1,θs\theta_{1},\dots\theta_{s} such that θj\theta_{j}\in{\mathcal{B}} and j=1s(θj,η){\mathcal{B}}\subset\cup_{j=1}^{s}{\mathcal{B}}(\theta_{j},\eta). Thus, if aj=aθja_{j}=a_{\theta_{j}}, A=max1js(aj)A=\max_{1\leq j\leq s}(a_{j}), τj=τθj\tau_{j}=\tau_{\theta_{j}} and τ=min1js(τj)\tau=\min_{1\leq j\leq s}(\tau_{j}), from (31), we obtain that for j=1,,sj=1,\dots,s we have

infa>Ainfθ(θj,η)L(aθ)>L(θ0)+τ.\inf_{a>A}\inf_{\theta^{*}\in{\mathcal{B}}(\theta_{j},\eta)}L(a\,\theta^{*})>L(\theta_{0})+\tau\,. (32)

Let θ×𝒲1,\theta\in\mathbb{R}\times{\mathcal{W}}^{1,{\mathcal{H}}} be such that dθ=|α|+β𝒲1,>Ad_{\theta}=|\alpha|+\|\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}>A, define θ~=θ/dθ\widetilde{\theta}=\theta/d_{\theta}, then θ~\widetilde{\theta}\in{\mathcal{B}}, so there exists j0j_{0} such that θ~(θj0,η)\widetilde{\theta}\in{\mathcal{B}}(\theta_{j_{0}},\eta) and L(θ)=L(dθθ~)L(\theta)=L(d_{\theta}\;\widetilde{\theta}). Therefore, using (32), we obtain that L(θ)>L(θ0)+τL(\theta)>L(\theta_{0})+\tau, so

inf|α|+β𝒲1,>AL(α,β)>L(θ0)+τ,\inf_{|\alpha|+\|\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}>A}L\left(\alpha,\beta\right)>L(\theta_{0})+\tau\,,

concluding the proof. ∎

Proof of Theorem 3.2.

We will show only b), that is, that the result holds when β0C([0,1])\beta_{0}\in C([0,1]), BjC([0,1])B_{j}\in C([0,1]), 1jkn1\leq j\leq k_{n}, Bj𝒲1,2B_{j}\in{\mathcal{W}}^{1,2} and provide approximations in L2([0,1])L^{2}([0,1]), that is, below 𝒲1,{\mathcal{W}}^{1,{\mathcal{H}}} is the Sobolev space 𝒲1,2{\mathcal{W}}^{1,2}. The situation where =L2([0,1]){\mathcal{H}}=L^{2}([0,1]) follows similarly, replacing the supremum norm below by the L2L^{2}-norm and using that in this case, the Sobolev space 𝒲1,{\mathcal{W}}^{1,{\mathcal{H}}} is compactly embedded in L2([0,1])L^{2}([0,1]) and that A11 holds for any β𝒲1,\beta\in{\mathcal{W}}^{1,{\mathcal{H}}}.

Assume that we have shown that

infθ𝒜ϵL(θ)>L(θ0)\inf_{\theta\in{\mathcal{A}}_{\epsilon}}L(\theta)>L(\theta_{0}) (33)

where 𝒜ϵ={θ=(α,β)×𝒲1,:|α|+β𝒲1,M,|αα0|+ββ0>ϵ}{\mathcal{A}}_{\epsilon}=\{\theta=(\alpha,\beta)\in\mathbb{R}\times{\mathcal{W}}^{1,{\mathcal{H}}}:|\alpha|+\|\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}\leq M,|\alpha-\alpha_{0}|+\|\beta-\beta_{0}\|_{\infty}>\epsilon\}. Then, taking into account that β^𝒲1,C([0,1])\widehat{\beta}\in{\mathcal{W}}^{1,{\mathcal{H}}}\cap C([0,1]), that Lemma A.7 implies that L(θ^)a.s.L(θ0)L(\widehat{\theta})\buildrel a.s.\over{\longrightarrow}L(\theta_{0}) and Lemma A.8, we conclude that |α^α0|+β^β0a.s.0|\widehat{\alpha}-\alpha_{0}|+\|\widehat{\beta}-\beta_{0}\|_{\infty}\buildrel a.s.\over{\longrightarrow}0.

Let us derive (33). Let {θm}m1\{\theta_{m}\}_{m\geq 1} be a sequence such that θm=(αm,βm)𝒜ϵ\theta_{m}=(\alpha_{m},\beta_{m})\in{\mathcal{A}}_{\epsilon} and limmL(θm)=infθ𝒜ϵL(θ)\lim_{m\to\infty}L(\theta_{m})=\inf_{\theta\in{\mathcal{A}}_{\epsilon}}L(\theta). The Rellich–Kondrachov theorem entails that 𝒲1,{\mathcal{W}}^{1,{\mathcal{H}}} is compactly embedded in C([0,1])C([0,1]), hence the ball {θ=(α,β)×𝒲1,:|α|+β𝒲1,M}\{\theta=(\alpha,\beta)\in\mathbb{R}\times{\mathcal{W}}^{1,{\mathcal{H}}}:|\alpha|+\|\beta\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}\leq M\} is compact in ×C([0,1])\mathbb{R}\times C([0,1]). Thus, there exist a subsequence {θmj}j1\{\theta_{m_{j}}\}_{j\geq 1} of {θm}m1\{\theta_{m}\}_{m\geq 1} and a point θ=(α,β)×𝒲1,\theta^{\star}=(\alpha^{\star},\beta^{\star})\in\mathbb{R}\times{\mathcal{W}}^{1,{\mathcal{H}}} with |α|+β𝒲1,M|\alpha^{\star}|+\|\beta^{\star}\|_{{\mathcal{W}}^{1,{\mathcal{H}}}}\leq M such that |αmjα|+βmjβ0|\alpha_{m_{j}}-\alpha^{\star}|+\|\beta_{m_{j}}-\beta^{\star}\|_{\infty}\to 0. Then, using that θm𝒜ϵ\theta_{m}\in{\mathcal{A}}_{\epsilon}, we have |αmjα0|+βmjβ0>ϵ|\alpha_{m_{j}}-\alpha_{0}|+\|\beta_{m_{j}}-\beta_{0}\|_{\infty}>\epsilon which implies that |αα0|+ββ0ϵ|\alpha^{\star}-\alpha_{0}|+\|\beta^{\star}-\beta_{0}\|_{\infty}\geq\epsilon. Using that ff\|f\|\leq\|f\|_{\infty} and the Cauchy-Schwartz inequality, we get that for any vL2([0,1])v\in L^{2}([0,1]), αmj+v,βmjα+v,β\alpha_{m_{j}}+\langle v,\beta_{m_{j}}\rangle\to\alpha^{\star}+\langle v,\beta^{\star}\rangle. Thus, using the Bounded Convergence Theorem, the continuity of ϕ(y,t)\phi\left(y,t\right) with respect to tt and its boundedness, we get that L(θmj)L(θ)L(\theta_{m_{j}})\to L(\theta^{\star}), which leads to infθ𝒜ϵL(θ)=L(θ)\inf_{\theta\in{\mathcal{A}}_{\epsilon}}L(\theta)=L(\theta^{\star}). Using that A11 holds and taking into account that β𝒲1,\beta^{\star}\in{\mathcal{W}}^{1,{\mathcal{H}}}, Lemma A.1(b) implies that L(θ)>L(θ0)L(\theta^{\star})>L(\theta_{0}) concluding the proof. ∎

A.3 Proof of Theorem 3.3 and Proposition 3.4

Proof of Theorem 3.3.

From assumption A10, we have that there exists an element β~kk\widetilde{\beta}_{k}\in{\mathcal{M}}_{k}, β~k=j=1kb~jBj(x)\widetilde{\beta}_{k}=\sum_{j=1}^{k}\widetilde{b}_{j}B_{j}(x) such that β~kβ0=O(kr)\|\widetilde{\beta}_{k}-\beta_{0}\|_{{\mathcal{H}}}=O(k^{-r}). Without loss of generality, we assume that β~kβ0<ϵ0/2\|\widetilde{\beta}_{k}-\beta_{0}\|_{{\mathcal{H}}}<\epsilon_{0}/2 with ϵ0\epsilon_{0} defined in A1. Recall that Θn=×kn\Theta_{n}^{\star}=\mathbb{R}\times{\mathcal{M}}_{k_{n}}.

In order to get the convergence rate of our estimator θ^=(α^,β^)\widehat{\theta}=(\widehat{\alpha},\widehat{\beta}) we will apply Theorem 3.4.1 of van der Vaart and Wellner, (1996). According to the notation in that Theorem, let d(θ1,θ2)=π~(θ1,θ2)d(\theta_{1},\theta_{2})=\widetilde{\pi}_{\mathbb{P}}(\theta_{1},\theta_{2}) and Θn={θΘn:|αα0|+ββ0ϵ0/2}\Theta_{n}=\{\theta\in\Theta_{n}^{\star}:|\alpha-\alpha_{0}|+\|\beta-\beta_{0}\|_{{\mathcal{H}}}\leq\epsilon_{0}/2\}, where ϵ0\epsilon_{0} is given in assumption A1. Furthermore, the function mθm_{\theta} in that Theorem equals

mθ(y,X)=ϕ(y,α+X,β)w(X).m_{\theta}(y,X)=\,-\,\phi\left(y,\alpha+\langle X,\beta\rangle\right)w(X)\,.

First of all, note that Theorem 3.2 implies that (θ^Θn)1\mathbb{P}(\widehat{\theta}\in\Theta_{n})\to 1 as required. Secondly, to emphasize the dependence on nn denote θ~0,n=(α0,β~)\widetilde{\theta}_{0,n}=(\alpha_{0},\widetilde{\beta}) with β~=β~k\widetilde{\beta}=\widetilde{\beta}_{k} defined in assumption A10. Assumption A6b, the fact that ff\|f\|\leq\|f\|_{{\mathcal{H}}} and the Bounded Convergence Theorem implies that π~(θ~0,n,θ0)0\widetilde{\pi}_{\mathbb{P}}(\widetilde{\theta}_{0,n},\theta_{0})\to 0, as nn\to\infty. Furthermore, from Theorem 3.2, we get that π~(θ^,θ0)a.s.0\widetilde{\pi}_{\mathbb{P}}(\widehat{\theta},\theta_{0})\buildrel a.s.\over{\longrightarrow}0. Hence, using the triangular inequality, we immediately obtain that π~(θ^,θ~0,n)p0\widetilde{\pi}_{\mathbb{P}}(\widehat{\theta},\widetilde{\theta}_{0,n})\buildrel p\over{\longrightarrow}0 as required in Theorem 3.4.1 of van der Vaart and Wellner, (1996). Moreover, we also have that Ln(θ^)Ln(θ~0,n)L_{n}(\widehat{\theta})\leq L_{n}(\widetilde{\theta}_{0,n}), since θ~0,n×k\widetilde{\theta}_{0,n}\in\mathbb{R}\times{\mathcal{M}}_{k}, which is also a requirement to apply that result.

Let A=supy{0,1},t|2ϕ(y,t)/t2|A=\sup_{y\in\{0,1\},t\in\mathbb{R}}|{\partial^{2}}\phi(y,t)/{\partial t^{2}}| and denote C2=16(A+C0)𝔼{w(X)X2}/C0C^{2}=16(A+C_{0}^{\star})\mathbb{E}\left\{w(X)\|X\|^{2}\right\}/C_{0}^{\star} with C0C_{0}^{\star} the constant given in assumption A1. Define δn=Cβ~β0\delta_{n}=C\|\widetilde{\beta}-\beta_{0}\|_{{\mathcal{H}}}. Note that from A10, δn=O(nrς)\delta_{n}=O(n^{-\,r\varsigma}).

To make use of Theorem 3.4.1 of van der Vaart and Wellner, (1996), we have to show that there exists a function φn:(0,)\varphi_{n}:(0,\infty)\to\mathbb{R} such that φn(δ)/δγ\varphi_{n}(\delta)/\delta^{\gamma} is decreasing on (δn,)(\delta_{n},\infty), for some γ<2\gamma<2 and such that, for any δ>δn\delta>\delta_{n}, we have

supθΘn,δ𝔼(mθ(y,X)mθ~0,n(y,X))=L(θ~0,n)infθΘn,δL(θ)δ2,\displaystyle\sup_{\theta\in\Theta_{n,\delta}}\mathbb{E}\left(m_{\theta}(y,X)-m_{\widetilde{\theta}_{0,n}}(y,X)\right)=L(\widetilde{\theta}_{0,n})-\inf_{\theta\in\Theta_{n,\delta}}L(\theta)\lesssim-\delta^{2}\,, (34)
𝔼supθΘn,δ|𝔾n(mθmθ~0,n)|=𝔼supθΘn,δn|(Ln(θ)L(θ)(Ln(θ~0,n)L(θ~0,n))|φn(δ),\displaystyle\mathbb{E}^{*}\sup_{\theta\in\Theta_{n,\delta}}\left|\mathbb{G}_{n}\left(m_{\theta}-m_{\widetilde{\theta}_{0,n}}\right)\right|=\mathbb{E}^{*}\sup_{\theta\in\Theta_{n,\delta}}\sqrt{n}\biggl{|}\left(L_{n}(\theta)-L(\theta\right)-\left(L_{n}(\widetilde{\theta}_{0,n})-L(\widetilde{\theta}_{0,n})\right)\biggr{|}\lesssim\varphi_{n}(\delta)\,, (35)

where 𝔾nf=n(PnP)f\mathbb{G}_{n}f=\sqrt{n}(P_{n}-P)f, 𝔼\mathbb{E}^{*} stands for the outer expectation and Θn,δ={θΘn:δ/2<d(θ,θ~0,n)δ}\Theta_{n,\delta}=\{\theta\in\Theta_{n}:\delta/2<d(\theta,\widetilde{\theta}_{0,n})\leq\delta\}.

We begin by showing (34). Assumption A1 entails that, for any θΘn\theta\in\Theta_{n},

L(θ)L(θ0)C0π~2(θ,θ0),L(\theta)-L(\theta_{0})\geq C_{0}^{\star}\,\widetilde{\pi}_{\mathbb{P}}^{2}(\theta,\theta_{0})\,, (36)

while from (18) in the proof of Theorem 3.1(c), we get that

L(θ~0,n)L(θ0)A𝔼{w(X)X2}β~β02=A𝔼{w(X)X2}C2δn2.L(\widetilde{\theta}_{0,n})-L(\theta_{0})\leq A\,\mathbb{E}\left\{w(X)\|X\|^{2}\right\}\|\widetilde{\beta}-\beta_{0}\|_{{\mathcal{H}}}^{2}=\frac{A\,\mathbb{E}\left\{w(X)\|X\|^{2}\right\}}{C^{2}}\delta_{n}^{2}\,. (37)

Moreover, using the Cauchy Schwartz inequality and the fact that ff\|f\|\leq\|f\|_{{\mathcal{H}}}, we have

π~(θ,θ~0,n)π~(θ,θ0)+π~(θ0,θ~0,n)π~(θ,θ0)+{𝔼{w(X)X2}β~β02}1/2,\widetilde{\pi}_{\mathbb{P}}(\theta,\widetilde{\theta}_{0,n})\leq\widetilde{\pi}_{\mathbb{P}}(\theta,\theta_{0})+\widetilde{\pi}_{\mathbb{P}}(\theta_{0},\widetilde{\theta}_{0,n})\leq\widetilde{\pi}_{\mathbb{P}}(\theta,\theta_{0})+\left\{\mathbb{E}\left\{w(X)\|X\|^{2}\right\}\|\widetilde{\beta}-\beta_{0}\|_{{\mathcal{H}}}^{2}\right\}^{1/2}\,,

which together with the inequality (a+b)22(a2+b2)(a+b)^{2}\leq 2(a^{2}+b^{2}), implies that

π~2(θ,θ0)12π~2(θ,θ~0,n)𝔼{w(X)X2}β~β02=12π~2(θ,θ~0,n)𝔼{w(X)X2}C2δn2.\widetilde{\pi}_{\mathbb{P}}^{2}(\theta,\theta_{0})\geq\frac{1}{2}\widetilde{\pi}_{\mathbb{P}}^{2}(\theta,\widetilde{\theta}_{0,n})-\mathbb{E}\left\{w(X)\|X\|^{2}\right\}\|\widetilde{\beta}-\beta_{0}\|_{{\mathcal{H}}}^{2}=\frac{1}{2}\widetilde{\pi}_{\mathbb{P}}^{2}(\theta,\widetilde{\theta}_{0,n})-\frac{\mathbb{E}\left\{w(X)\|X\|^{2}\right\}}{C^{2}}\delta_{n}^{2}\,. (38)

Then combining (36), (37) and (38), we get that for any θΘn\theta\in\Theta_{n}, such that δ/2<d(θ,θ~0,n)=π~(θ,θ~0,n)\delta/2<d(\theta,\widetilde{\theta}_{0,n})=\widetilde{\pi}_{\mathbb{P}}(\theta,\widetilde{\theta}_{0,n}), we have

L(θ)L(θ~0,n)\displaystyle L(\theta)-L(\widetilde{\theta}_{0,n}) ={L(θ)L(θ0)}{L(θ~0,n)L(θ0)}C0π~2(θ,θ0)A𝔼{w(X)X2}C2δn2\displaystyle=\left\{L(\theta)-L(\theta_{0})\right\}-\left\{L(\widetilde{\theta}_{0,n})-L(\theta_{0})\right\}\geq C_{0}^{\star}\,\widetilde{\pi}_{\mathbb{P}}^{2}(\theta,\theta_{0})-\frac{A\,\mathbb{E}\left\{w(X)\|X\|^{2}\right\}}{C^{2}}\delta_{n}^{2}
C02π~2(θ,θ~0,n)(A+C0)𝔼{w(X)X2}C2δn2C08δ2C016δn2C016δ2,\displaystyle\geq\frac{C_{0}^{\star}}{2}\widetilde{\pi}_{\mathbb{P}}^{2}(\theta,\widetilde{\theta}_{0,n})-\frac{(A+C_{0}^{\star})\mathbb{E}\left\{w(X)\|X\|^{2}\right\}}{C^{2}}\delta_{n}^{2}\geq\frac{C_{0}^{\star}}{8}\delta^{2}-\frac{C_{0}}{16}\delta_{n}^{2}\geq\frac{C_{0}^{\star}}{16}\delta^{2}\,,

concluding the proof of (34).

We have now to find φn(δ)\varphi_{n}(\delta) such that φn(δ)/δγ\varphi_{n}(\delta)/\delta^{\gamma} is decreasing in δ\delta, for some γ<2\gamma<2 and (35) holds. Define the class of functions

n,δ={VθVθ~0,n:θΘn,δ},{\mathcal{F}}_{n,\delta}=\left\{V_{\theta}-V_{\widetilde{\theta}_{0,n}}:\theta\in\Theta_{n,\delta}\right\}\,,

with Vθ=mθ=ϕ(y,α+X,β)w(X)V_{\theta}=\,-\,m_{\theta}=\phi\left(y,\alpha+\langle X,\beta\rangle\right)w(X). Inequality (35) involves an empirical process indexed by n,δ{\mathcal{F}}_{n,\delta}, since

𝔼supδ/2<d(θ,θ~0,n)δθΘn|𝔾n(mθmθ~0,n)|=𝔼supfn,δn|(PnP)f|.\mathbb{E}^{*}\sup_{\stackrel{{\scriptstyle\theta\in\Theta_{n}}}{{\delta/2<d(\theta,\widetilde{\theta}_{0,n})\leq\delta}}}\left|\mathbb{G}_{n}\left(m_{\theta}-m_{\widetilde{\theta}_{0,n}}\right)\right|=\mathbb{E}^{*}\sup_{f\in{\mathcal{F}}_{n,\delta}}\sqrt{n}|(P_{n}-P)f|\,.

For any fn,δf\in{\mathcal{F}}_{n,\delta} we have that fA1=2supy{0,1};tϕ(y,t)=2Mϕ\|f\|_{\infty}\leq A_{1}=2\sup_{y\in\{0,1\};t\in\mathbb{R}}\phi(y,t)=2M_{\phi}. Furthermore, if A2=2MψA_{2}=2\,M_{\psi} using the inequality

|VθVθ~0,n|\displaystyle|V_{\theta}-V_{\widetilde{\theta}_{0,n}}| =w(X)|ϕ(y,α+X,β)ϕ(y,α0+X,β~)|\displaystyle=w(X)\left|\phi\left(y,\alpha+\langle X,\beta\rangle\right)-\phi\left(y,\alpha_{0}+\langle X,\widetilde{\beta}\rangle\right)\right|
2Mψ|αα0+X,ββ~|w(X),\displaystyle\leq 2\,M_{\psi}\left|\alpha-\alpha_{0}+\langle X,\beta-\widetilde{\beta}\rangle\right|w(X)\,,

and the fact that w=1\|w\|_{\infty}=1 and π~2(θ,θ~0,n)δ2\widetilde{\pi}_{\mathbb{P}}^{2}(\theta,\widetilde{\theta}_{0,n})\leq\delta^{2}, we get that

Pf24Mψ2𝔼([αα0+X,ββ~]2w(X))A22δ2.Pf^{2}\leq 4\,M_{\psi}^{2}\;\mathbb{E}\left(\left[\alpha-\alpha_{0}+\langle X,\beta-\widetilde{\beta}\rangle\right]^{2}w(X)\right)\leq A_{2}^{2}\,\delta^{2}\,.

Lemma 3.4.2 in van der Vaart and Wellner, (1996) leads to

𝔼supfn,δn|(PnP)f|J[](A2δ,n,δ,L2(P))(1+A1J[](A2δ,n,δ,L2(P))A22δ2n),\mathbb{E}^{*}\sup_{f\in{\mathcal{F}}_{n,\delta}}\sqrt{n}|(P_{n}-P)f|\leq J_{[\;]}\left(A_{2}\delta,{\mathcal{F}}_{n,\delta},L_{2}(P)\right)\left(1+A_{1}\frac{J_{[\;]}(A_{2}\,\delta,{\mathcal{F}}_{n,\delta},L_{2}(P))}{A_{2}^{2}\delta^{2}\;\sqrt{n}}\right)\,,

where J[](δ,,L2(P))=0δ1+logN[](ϵ,,L2(P))𝑑ϵJ_{[\;]}(\delta,{\mathcal{F}},L_{2}(P))=\int_{0}^{\delta}\sqrt{1+\log N_{[\;]}(\epsilon,{\mathcal{F}},L_{2}(P))}d\epsilon is the bracketing integral of the class {\mathcal{F}}.

Recall that β~kβ0=O(nrς)\|\widetilde{\beta}_{k}-\beta_{0}\|_{{\mathcal{H}}}=O(n^{-\,r\varsigma}), so for nn large enough, β~kβ0<ϵ0/2\|\widetilde{\beta}_{k}-\beta_{0}\|_{{\mathcal{H}}}<\epsilon_{0}/2, so that, for any θ=(α,β)Θn\theta=(\alpha,\beta)\in\Theta_{n}, we have |αα0|+β~kβ<ϵ0|\alpha-\alpha_{0}|+\|\widetilde{\beta}_{k}-\beta\|_{{\mathcal{H}}}<\epsilon_{0}. Therefore, n,δ𝒢n,c,θ~0,n,θ~0,n{\mathcal{F}}_{n,\delta}\subset{\mathcal{G}}_{n,c,\widetilde{\theta}_{0,n},\widetilde{\theta}_{0,n}} where 𝒢n,c,θ~0,θ0{\mathcal{G}}_{n,c,\widetilde{\theta}_{0},\theta_{0}^{*}} is defined in Lemma A.6 and we take θ~0=θ~0,n=(α0,β~k)\widetilde{\theta}_{0}=\widetilde{\theta}_{0,n}=(\alpha_{0},\widetilde{\beta}_{k}), θ0=θ~0,n\theta_{0}^{*}=\widetilde{\theta}_{0,n} and c=ϵ0c=\epsilon_{0}. Hence, the bound given in Lemma A.6 ensures that leads to

N[](ϵ,n,δ,L2(P))(B1ϵ+1)k+1,N_{[\;]}\left(\epsilon,{\mathcal{F}}_{n,\delta},L_{2}(P)\right)\leq\left(\frac{B_{1}}{\epsilon}+1\right)^{k+1}\,,

for some positive constant B1B_{1} independent of nn, θ~0,n\widetilde{\theta}_{0,n} and ϵ\epsilon. Therefore, for δ<B1/A2\delta<B_{1}/A_{2}, we have

J[](A2δ,n,δ,L2(P))\displaystyle J_{[\;]}\left(A_{2}\delta,{\mathcal{F}}_{n,\delta},L_{2}(P)\right) 0A2δ1+log((B1ϵ+1)k+1)𝑑ϵ\displaystyle\leq\int_{0}^{A_{2}\,\delta}\sqrt{1+\log\left(\left(\frac{B_{1}}{\epsilon}+1\right)^{k+1}\right)}d\epsilon
0A2δ1+(k+1)log(2B1ϵ)𝑑ϵ\displaystyle\leq\int_{0}^{A_{2}\,\delta}\sqrt{1+(k+1)\log\left(2\,\frac{B_{1}}{\epsilon}\right)}d\epsilon
2(k+1)1/20A2δ1+log(2B1ϵ)𝑑ϵ\displaystyle\leq 2\,(k+1)^{1/2}\int_{0}^{A_{2}\,\delta}\sqrt{1+\log\left(2\,\frac{B_{1}}{\epsilon}\right)}d\epsilon
=4B1(k+1)1/20B2δ1+log(1ϵ)𝑑ϵ,\displaystyle=4\,B_{1}\,(k+1)^{1/2}\int_{0}^{B_{2}\;\delta}\sqrt{1+\log\left(\frac{1}{\epsilon}\right)}d\epsilon\,,

where B2=A2/(2B1)B_{2}={A_{2}}/({2\,B_{1}}). Note that 0δ1+log(1/ϵ)𝑑ϵ=O(δlog(1/δ))\int_{0}^{\delta}\sqrt{1+\log(1/\epsilon)}\,d\epsilon=O(\delta\sqrt{\log(1/\delta)}) as δ0\delta\to 0, hence there exists δ0>0\delta_{0}>0 and a constant C>0C>0 such that for any δ<δ0\delta<\delta_{0}, 0δ1+log(1/ϵ)𝑑ϵCδlog(1/δ)\int_{0}^{\delta}\sqrt{1+\log(1/\epsilon)}\,d\epsilon\leq C\,\delta\,\sqrt{\log(1/\delta)}. This implies that for δ<δ0/B2\delta<\delta_{0}/B_{2}

J[](A2δ,n,δ,L2(P))δlog(1δ)k+1.J_{[\;]}(A_{2}\delta,{\mathcal{F}}_{n,\delta},L_{2}(P))\lesssim\delta\,\sqrt{\log\left(\frac{1}{\delta}\right)}\sqrt{k+1}\,.

If we denote qn=k+1q_{n}=k+1, we obtain that for some constant A3A_{3} independent of nn and δ\delta,

𝔼supθΘn,δ|𝔾n(mθmθ~0,n)|A3[δqn1/2log(1δ)+qnnlog(1δ)].\mathbb{E}^{*}\sup_{\theta\in\Theta_{n,\delta}}\left|\mathbb{G}_{n}\left(m_{\theta}-m_{\widetilde{\theta}_{0,n}}\right)\right|\leq A_{3}\,\left[\delta\,q_{n}^{1/2}\sqrt{\log\left(\frac{1}{\delta}\right)}+\frac{q_{n}}{\sqrt{n}}\;\log\left(\frac{1}{\delta}\right)\right]\,.

Choosing

φn(δ)=A3[δqn1/2log(1δ)+qnnlog(1δ)],\varphi_{n}(\delta)=A_{3}\,\left[\delta\,q_{n}^{1/2}\sqrt{\log\left(\frac{1}{\delta}\right)}+\frac{q_{n}}{\sqrt{n}}\;\log\left(\frac{1}{\delta}\right)\right]\,,

we have that φn(δ)/δ\varphi_{n}(\delta)/\delta is decreasing in δ\delta, concluding the proof of (35).

To apply Theorem 3.4.1 of van der Vaart and Wellner, (1996), it remains to show that γnδn1\gamma_{n}\lesssim\delta_{n}^{-1} and

γn2φn(1γn)n,\gamma_{n}^{2}\,\varphi_{n}\left(\frac{1}{\gamma_{n}}\right)\lesssim\sqrt{n}\,, (39)

since φn(cδ)cϕn(δ)\varphi_{n}(c\delta)\leq c\,\phi_{n}(\delta), for c>1c>1. First note that γn=O(nrς)\gamma_{n}=O(n^{r\varsigma}) and δn=Cβ~β0=O(nrς)\delta_{n}=C\|\widetilde{\beta}-\beta_{0}\|_{{\mathcal{H}}}=O(n^{-\,r\varsigma}), then γnδn1\gamma_{n}\lesssim\delta_{n}^{-1}.

To derive (39), observe that

γn2φn(1γn)=A3[γnqn1/2log(γn)+γn2log(γn)qnn]=A3[nan(1+an)],\gamma_{n}^{2}\varphi_{n}\left(\frac{1}{\gamma_{n}}\right)=A_{3}\,\left[\gamma_{n}q_{n}^{1/2}\,\sqrt{\log(\gamma_{n})}+\gamma_{n}^{2}\,\log(\gamma_{n})\;\frac{q_{n}}{\sqrt{n}}\right]=A_{3}\,\left[\sqrt{n}\;a_{n}(1+a_{n})\right]\,,

where an=γnlog(γn)qn1/2/na_{n}=\gamma_{n}\,\sqrt{\log(\gamma_{n})}\;q_{n}^{1/2}/\sqrt{n}. Hence, to derive that γn2φn(1/γn)n\gamma_{n}^{2}\varphi_{n}\left(1/{\gamma_{n}}\right)\lesssim\sqrt{n}, it is enough to show that an=O(1)a_{n}=O(1), which follows easily since qn=O(nς)q_{n}=O(n^{\varsigma}) and γnlog(γn)=O(n(1ς)/2)\gamma_{n}\sqrt{\log(\gamma_{n})}=O(n^{(1-\varsigma)/2}), concluding the proof of (39).

Hence, from Theorem 3.4.1 of van der Vaart and Wellner, (1996), we get that γnπ~(θ^,θ~0,n)=O(1)\gamma_{n}\widetilde{\pi}_{\mathbb{P}}(\widehat{\theta},\widetilde{\theta}_{0,n})=O_{\mathbb{P}}(1). As noticed above,

π~(θ0,θ~0,n){𝔼{w(X)X2}β~β02}1/2=O(nrς).\widetilde{\pi}_{\mathbb{P}}(\theta_{0},\widetilde{\theta}_{0,n})\leq\left\{\mathbb{E}\left\{w(X)\|X\|^{2}\right\}\|\widetilde{\beta}-\beta_{0}\|_{{\mathcal{H}}}^{2}\right\}^{1/2}=O(n^{-\,r\varsigma})\,.

Then, using that γn=O(nrς)\gamma_{n}=O(n^{r\varsigma}), we get that γnπ~(θ0,θ~0,n)=O(1)\gamma_{n}\widetilde{\pi}_{\mathbb{P}}(\theta_{0},\widetilde{\theta}_{0,n})=O_{\mathbb{P}}(1) and from the triangular inequality we obtain that γnπ~(θ^,θ0)=O(1)\gamma_{n}\widetilde{\pi}_{\mathbb{P}}(\widehat{\theta},\theta_{0})=O_{\mathbb{P}}(1), as desired. ∎

Proof of Proposition 3.4.

From Lemma A.3, we have that there exists a constant C0>0C_{0}>0 independent from nn such that, for any θ\theta,

L(θ)L(θ0)C0π2(θ,θ0),L(\theta)-L(\theta_{0})\geq C_{0}\,\pi_{\mathbb{P}}^{2}(\theta,\theta_{0})\,,

then to show that A1 holds, it will be enough to show that there exists a constant C1>0C_{1}>0 such that, for any θ=(α,β)×\theta=(\alpha,\beta)\in\mathbb{R}\times{\mathcal{H}} with |αα0|+ββ0<1|\alpha-\alpha_{0}|+\|\beta-\beta_{0}\|<1,

π2(θ,θ0)C1π~2(θ,θ0),\pi_{\mathbb{P}}^{2}(\theta,\theta_{0})\geq C_{1}\,\widetilde{\pi}_{\mathbb{P}}^{2}(\theta,\theta_{0})\,,

and then take ϵ0=1\epsilon_{0}=1 and C0=C0C1C_{0}^{\star}=C_{0}C_{1}.

Since (XC)=1\mathbb{P}(\|X\|\leq C)=1, we have that, with probability one, for any |αα0|+ββ0<1|\alpha-\alpha_{0}|+\|\beta-\beta_{0}\|<1,

|α+X,β||α0|+1+C(β0+1)=C.\left|\alpha+\langle X,\beta\rangle\right|\leq|\alpha_{0}|+1+C\left(\|\beta_{0}\|+1\right)=C^{\star}\,.

Thus, using that FF is strictly increasing we get that

0<A1=F(C)F(α+X,β)F(C)=A2<1.0<A_{1}=F\left(-C^{\star}\right)\leq F\left(\alpha+\langle X,\beta\rangle\right)\leq F\left(C^{\star}\right)=A_{2}<1\,. (40)

The Mean Value Theorem implies that given θ=(α,β)×\theta=(\alpha,\beta)\in\mathbb{R}\times{\mathcal{H}} with |αα0|+ββ0<1|\alpha-\alpha_{0}|+\|\beta-\beta_{0}\|<1 there exists (αX,βX)=(1ωX)θ+ωXθ0(\alpha^{\star}_{X},\beta^{\star}_{X})=(1-\omega_{X})\theta+\omega_{X}\theta_{0}, 0ωX10\leq\omega_{X}\leq 1, such that

π2(θ,θ0)\displaystyle\pi_{\mathbb{P}}^{2}(\theta,\theta_{0}) =𝔼{w(X)[F(α+X,β)F(α0+X,β0)]2}\displaystyle=\mathbb{E}\left\{w(X)\left[F(\alpha+\langle X,\beta\rangle)-F(\alpha_{0}+\langle X,\beta_{0}\rangle)\right]^{2}\right\}
=𝔼{w(X){F(αX+X,βX)[1F(αX+X,βX)](αα0+X,ββ0)}2}\displaystyle=\mathbb{E}\left\{w(X)\left\{F\left(\alpha_{X}^{\star}+\langle X,\beta_{X}^{\star}\rangle\right)\left[1-F\left(\alpha_{X}^{\star}+\langle X,\beta_{X}^{\star}\rangle\right)\right]\left(\alpha-\alpha_{0}+\langle X,\beta-\beta_{0}\rangle\right)\right\}^{2}\right\}
A12(1A2)2𝔼{w(X)[αα0+X,ββ0]2}=A12(1A2)2π~2(θ,θ0),\displaystyle\geq A_{1}^{2}(1-A_{2})^{2}\mathbb{E}\left\{w(X)\left[\alpha-\alpha_{0}+\langle X,\beta-\beta_{0}\rangle\right]^{2}\right\}=A_{1}^{2}(1-A_{2})^{2}\widetilde{\pi}_{\mathbb{P}}^{2}(\theta,\theta_{0})\,,

where the last inequality follows from (40), since |αXα0|+βXβ0<1|\alpha^{\star}_{X}-\alpha_{0}|+\|\beta^{\star}_{X}-\beta_{0}\|<1 and the proof is concluded taking C1=A12(1A2)2C_{1}=A_{1}^{2}(1-A_{2})^{2}. ∎

References

  • Aguilera et al., (2008) Aguilera, A., Escabias, M., and Valderrama, M. (2008). Discussion of different logistic models with functional data: Application to systemic lupus erythematosus. Computational Statistics and Data Analysis, 53:151–163.
  • Aguilera-Morillo et al., (2013) Aguilera-Morillo, M., Aguilera, A., Escabias, M., and Valderrama, M. (2013). Penalized spline approaches for functional logit regression. Test, 22:251–277.
  • Alin and Agostinelli, (2017) Alin, A. and Agostinelli, C. (2017). Robust iteratively reweighted simpls. Journal of Chemometrics, 31:e2881.
  • Aneiros-Pérez et al., (2017) Aneiros-Pérez, G., Bongiorno, E. G., Cao, R., and Vieu, P. (2017). Functional Statistics and Related Fields. Springer.
  • Basu et al., (1998) Basu, A., Harris, I. R., Hjort, N. L., and Jones, M. C. (1998). Robust and efficient estimation by minimizing a density power divergence. Biometrika, 85:549–559.
  • Bianco et al., (2022) Bianco, A., Boente, G., and Chebi, G. (2022). Penalized robust estimators in logistic regression with applications to sparse models. Test, 31:563–594.
  • Bianco et al., (2023) Bianco, A., Boente, G., and Chebi, G. (2023). Asymptotic behaviour of penalized robust estimators in logistic regression when dimension increases. In Robust and Multivariate Statistical Methods: Festschrift in Honor of David E. Tyler, Eds: Yi, Mengxi and Nordhausen, Klaus, pages 323–348. Springer International Publishing.
  • Bianco and Martinez, (2009) Bianco, A. and Martinez, E. (2009). Robust testing in the logistic regression model. Computational Statistics and Data Analysis, 53:4095–4105.
  • Bianco and Yohai, (1996) Bianco, A. and Yohai, V. (1996). Robust estimation in the logistic regression model. Lecture Notes in Statistics, 109:17–34.
  • Boente and Martinez, (2023) Boente, G. and Martinez, A. (2023). A robust spline approach in partially linear additive models. Computational Statistics and Data Analysis, 178:107611.
  • Boente et al., (2020) Boente, G., Salibián-Barrera, M., and Vena, P. (2020). Robust estimation for semi–functional linear regression models. Computational Statistics and Data Analysis, 152:107041.
  • Bondell, (2005) Bondell, H. D. (2005). Minimum distance estimation for the logistic regression model. Biometrika, 92:724–731.
  • Bondell, (2008) Bondell, H. D. (2008). A characteristic function approach to the biased sampling model, with application to robust logistic regression. Journal of Statistical Planning and Inference, 138:742–755.
  • Cai and Hall, (2006) Cai, T. and Hall, P. (2006). Prediction in functional linear regression. Annals of Statistics, 34:2159–2179.
  • Cantoni and Ronchetti, (2001) Cantoni, E. and Ronchetti, E. (2001). Robust inference for generalized linear models. Journal of the American Statistical Association, 96:1022–1030.
  • Cardot et al., (2003) Cardot, H., Ferraty, F., and Sarda, P. (2003). Spline estimators for the functional linear model. Statistica Sinica, 13:571–591.
  • Cardot and Sarda, (2005) Cardot, H. and Sarda, P. (2005). Estimation in generalized linear models for functional data via penalized likelihood. Journal of Multivariate Analysis, 92:24–41.
  • Carroll and Pederson, (1993) Carroll, R. J. and Pederson, S. (1993). On robust estimation in the logistic regression model. Journal of the Royal Statistical Society, Series B, 55:693–706.
  • Croux and Haesbroeck, (2003) Croux, C. and Haesbroeck, G. (2003). Implementing the Bianco and Yohai estimator for logistic regression. Computational Statistics and Data Analysis, 44:273–295.
  • Cuevas, (2014) Cuevas, A. (2014). A partial overview of the theory of statistics with functional data. Journal of Statistical Planning and Inference, 147:1–23.
  • Denhere and Billor, (2016) Denhere, M. and Billor, N. (2016). Robust principal component functional logistic regression. Communications in Statistics: Simulation and Computation, 45:264–281.
  • DeVore and Lorentz, (1993) DeVore, R. and Lorentz, G. (1993). Constructive Approximation. Springer.
  • Escabias et al., (2005) Escabias, M., Aguilera, A., and Valderrama, M. (2005). Modeling environmental data by functional principal component logistic regression. Environmetrics, 16:95–107.
  • Escabias et al., (2004) Escabias, M., Aguilera, A. M., and Valderrama, M. (2004). Principal component estimation of functional logistic regression: Discussion of two different approaches. Journal of Nonparametric Statistics, 16:365–384.
  • Febrero-Bande et al., (2017) Febrero-Bande, M., Galeano, P., and González-Manteiga, W. (2017). Functional principal component regression and functional partial least–squares regression: An overview and a comparative study. International Statistical Review, 85:61–83.
  • Ferraty and Romain, (2010) Ferraty, F. and Romain, Y. (2010). The Oxford Handbook of Functional Data Analysis. Oxford University Press.
  • Ferraty and Vieu, (2006) Ferraty, F. and Vieu, P. (2006). Nonparametric Functional Data Analysis: Theory and Practice. Springer.
  • García Ben and Yohai, (2004) García Ben, M. and Yohai, V. J. (2004). Quantile–quantile plot for deviance residuals in the generalized linear model. Journal of Computational and Graphical Statistics, 13:36–47.
  • Goia and Vieu, (2016) Goia, A. and Vieu, P. (2016). An introduction to recent advances in high/infinite dimensional statistics. Journal of Multivariate Analysis, 146:1–6.
  • Hall and Horowitz, (2007) Hall, P. and Horowitz, J. L. (2007). Methodology and convergence rates for functional linear regression. Annals of Statistics, 35:70–91.
  • He and Shi, (1996) He, X. and Shi, P. (1996). Bivariate tensor-product B-spline in a partly linear model. Journal of Multivariate Analysis, 58:162–181.
  • He and Shi, (1998) He, X. and Shi, P. (1998). Monotone B-spline smoothing. Journal of the American statistical Association, 93:643–650.
  • He et al., (2002) He, X., Zhu, Z., and Fung, W. (2002). Estimation in a semiparametric model for longitudinal data with unspecified dependence structure. Biometrika, 89:579–590.
  • Hobza et al., (2008) Hobza, T., Pardo, L., and Vajda, I. (2008). Robust median estimator in logistic regression. Journal of Statistical Planning and Inference, 138:3822–3840.
  • Horváth and Kokoszka, (2012) Horváth, L. and Kokoszka, P. (2012). Inference for Functional Data with Applications. Springer.
  • Hsing and Eubank, (2015) Hsing, T. and Eubank, R. (2015). Theoretical foundations of Functional Data Analysis with an introduction to Linear Operators, volume 997. John Wiley and Sons.
  • Hubert et al., (2005) Hubert, M., Rousseeuw, P. J., and Vanden Branden, K. (2005). ROBPCA: A new approach to robust principal component analysis. Technometrics, 47:64–79.
  • James, (2002) James, G. M. (2002). Generalized linear models with functional predictors. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64:411–432.
  • Kalogridis, (2023) Kalogridis, I. (2023). Robust and adaptive functional logistic regression. Available at https://arxiv.org/abs/2305.01350.
  • Kalogridis and Van Aelst, (2019) Kalogridis, I. and Van Aelst, S. (2019). Robust functional regression based on principal components. Journal of Multivariate Analysis, 173:393–415.
  • Kalogridis and Van Aelst, (2023) Kalogridis, I. and Van Aelst, S. (2023). Robust penalized estimators for functional linear regression. Journal of Multivariate Analysis, 194:105104.
  • Liebl, (2013) Liebl, D. (2013). Modelling and forecasting electricity spot prices: a functional data perspective. The Annals of Applied Statistics, 7:1562–1592.
  • López-Pintado and Romo, (2009) López-Pintado, S. and Romo, J. (2009). On the concept of depth for functional data. Journal of the American Statistical Association, 104:718–734.
  • Lu, (2015) Lu, M. (2015). Spline estimation of generalised monotonic regression. Journal of Nonparametric Statistics, 27:19–39.
  • Mallat, (2009) Mallat, S. (2009). A Wavelet Tour of Signal Processing: the Sparse Way. Academic Press.
  • Maronna et al., (2019) Maronna, R., Martin, D., Yohai, V., and Salibián-Barrera, M. (2019). Robust Statistics: Theory and Methods (with R). John Wiley and Sons.
  • Maronna and Yohai, (2013) Maronna, R. and Yohai, V. (2013). Robust functional linear regression based on splines. Computational Statistics and Data Analysis, 65:46–55.
  • Marx and Eilers, (1999) Marx, B. D. and Eilers, P. H. C. (1999). Generalized linear regression on sampled signals and curves: A P-spline approach. Technometrics, 4:1–13.
  • Mousavi and Sørensen, (2018) Mousavi, S. N. and Sørensen, H. (2018). Functional logistic regression: A comparison of three methods. Journal of Statistical Computation and Simulation, 88:250–268.
  • Müller, (2005) Müller, H. G. (2005). Functional modelling and classification of longitudinal data. Scandinavian Journal of Statistics, 32:223–240.
  • Müller and Stadtmüller, (2005) Müller, H. G. and Stadtmüller, U. (2005). Generalized functional linear models. Annals of Statistics, 33:774–805.
  • Mutis et al., (2022) Mutis, M., Beyaztas, U., Simsek, G., and Shang, H. (2022). A robust scalar–on–function logistic regression for classification. Communications in Statistics - Theory and Methods, pages 1–17.
  • Pollard, (1984) Pollard, D. (1984). Convergence of Stochastic Processes. Springer Series in Statistics.
  • Powell, (1981) Powell, M. (1981). Approximation Theory and Methods. Cambridge: Cambridge University Press.
  • Qingguo, (2015) Qingguo, T. (2015). Estimation for semi-functional linear regression. Statistics, 49:1262–1278.
  • Ramsay and Silverman, (2002) Ramsay, J. and Silverman, B. (2002). Applied Functional Data Analysis. Methods and Case Studies. Springer.
  • Ramsay and Silverman, (2005) Ramsay, J. and Silverman, B. (2005). Functional Data Analysis, 2nd edition. Springer.
  • Ramsay, (2004) Ramsay, J. O. (2004). Functional data analysis. Encyclopedia of Statistical Sciences.
  • Ratcliffe et al., (2002) Ratcliffe, S. J., Heller, G. Z., and Leader, L. R. (2002). Functional data analysis with application to periodically stimulated foetal heart rate data. II: Functional logistic regression. Statistics in Medicine, 21:1115–1127.
  • Reiss et al., (2017) Reiss, P. T., Goldsmith, J., Shang, H. L., and Ogden, R. T. (2017). Methods for scalar–on–function regression. International Statistical Review, 85:228–249.
  • Reiss et al., (2005) Reiss, P. T., Ogden, R. T., Mann, J., and Parsey, R. V. (2005). Functional logistic regression with pet imaging data: A voxel-level clinical diagnostic tool. Journal of Cerebral Blood Flow and Metabolism, 25:S635–S635.
  • Ronchetti, (1985) Ronchetti, E. (1985). Robust model selection in regression. Statistics and Probability Letters, 3:21–23.
  • Schumaker, (1981) Schumaker, L. (1981). Spline Functions: Basic Theory. Wiley.
  • Schwarz, (1978) Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6:461–464.
  • Sørensen et al., (2013) Sørensen, H., Goldsmith, J., and Sangalli, L. M. (2013). An introduction with medical applications to functional data analysis. Statistics in Medicine, 32:5222–5240.
  • Stone, (1982) Stone, C. (1982). Optimal global rates of convergence for nonparametric regression. Annals of Statistics, 10:1040–1053.
  • Stone, (1985) Stone, C. (1985). Additive regression and other nonparametric models. Annals of Statistics, 13:689–705.
  • Sun and Genton, (2011) Sun, Y. and Genton, M. G. (2011). Functional boxplots. Journal of Computational and Graphical Statistics, 20:316–334.
  • Tharmaratnam and Claeskens, (2013) Tharmaratnam, K. and Claeskens, G. (2013). A comparison of robust versions of the AIC based on M-, S- and MM-estimators. Statistics, 47:216–235.
  • van de Geer, (1988) van de Geer, S. (1988). Regression Analysis and Empirical Processes. CWI Tract 45. Center for Mathematics and Computer Science, Amsterdam. Available at https://ir.cwi.nl/pub/13169.
  • van de Geer, (2000) van de Geer, S. (2000). Empirical Processes in MM-Estimation. Cambridge Series in Statistical and Probabilistic Mathematics.
  • van der Vaart and Wellner, (1996) van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer, New York.
  • Wang and Xiang, (2012) Wang, H. and Xiang, S. (2012). On the convergence rates of Legendre approximation. Mathematics of Computation, 81:861–877.
  • Wang et al., (2016) Wang, J. L., Chiou, J., and Müller, H. (2016). Functional data analysis. Annual Review of Statistics and Its Application, 3:257–295.
  • Wang et al., (2017) Wang, X., Nan, B., Zhu, J., Koeppe, R., and Frey, K. (2017). Classification of ADNI PET images via regularized 3D functional data analysis. Biostatistics and Epidemiology, 1:3–19.
  • Zhao et al., (2012) Zhao, Y., Ogden, T., and Reiss, P. (2012). Wavelet-based LASSO in functional linear regression. Journal of Computational and Graphical Statistics, 21:600–617.