A robust approach for generalized linear models based on maximum L $q$ -likelihood procedure

Felipe Osorio Departamento de Matemática, Universidad Técnica Federico Santa María, Chile Avenida España 1680, Valparaíso, Chile [email protected] , Manuel Galea Departamento de Estadística, Pontificia Universidad Católica de Chile Avenida Vicuña Mackena 4860, Santiago, Chile [email protected] and Patricia Giménez Departamento de Matemática, Universidad Nacional de Mar del Plata, Argentina Funes 3350, CP 7600, Mar del Plata, Argentina. [email protected]

Abstract.

In this paper we propose a procedure for robust estimation in the context of generalized linear models based on the maximum L $q$ -likelihood method. Alongside this, an estimation algorithm that represents a natural extension of the usual iteratively weighted least squares method in generalized linear models is presented. It is through the discussion of the asymptotic distribution of the proposed estimator and a set of statistics for testing linear hypothesis that it is possible to define standardized residuals using the mean-shift outlier model. In addition, robust versions of deviance function and the Akaike information criterion are defined with the aim of providing tools for model selection. Finally, the performance of the proposed methodology is illustrated through a simulation study and analysis of a real dataset.

Keywords. Generalized linear model; Influence function; $q$ -entropy; Robustness.

Corresponding author. E-mail address: [email protected] (F. Osorio)

1. Introduction

In this work we consider the robust estimation of the regression coefficients in generalized linear models (Nelder and Wedderburn, 1972). This model is defined by asumming that $y_{1},\dots,y_{n}$ are independent random variables, each with density function

f(y_{i};\theta_{i},\phi)=\exp[\phi\{y_{i}\theta_{i}-b(\theta_{i})\}+c(y_{i},\phi)],\qquad i=1,\dots,n,

(1.1)

where $\phi>0$ represents a scale parameter, and $b(\cdot)$ and $c(\cdot,\cdot)$ are known functions. It is also assumed that the systematic part of the model is given by a linear predictor $\eta_{i}=\mbox{\boldmath$x$}_{i}^{\top}\mbox{\boldmath$\beta$}$ , which is related to the natural parameter

\theta_{i}=k(\eta_{i}),\qquad i=1,\dots,n,

(1.2)

by a one-to-one differentiable function $k(\cdot)$ , this is called $\theta$ -link to distinguish it from the conventional link that relates the covariates to the mean of $y_{i}$ . The $\theta$ -link function has been used by Thomas and Cook (1989, 1990) to perform local influence analyses in generalized linear models. The aforementioned allows us to write $\mbox{\boldmath$\eta$}=\mbox{\boldmath$X\beta$}$ , where $X$ is a $n\times p$ design matrix with full column rank and $\mbox{\boldmath$\beta$}=(\beta_{1},\dots,\beta_{p})^{\top}$ represents the unknown regression coefficients.

A number of authors have considered procedures for robust estimation of regression coefficients in generalized linear models (GLM), with particular emphasis on logistic regression (see Pregibon, 1982; Stefanski et al., 1986; Künsch et al., 1989; Morgenthaler, 1992), whereas Cantoni and Ronchetti (2001) developed an approach to obtain robust estimators in the framework of quasi-likelihood based on the class of $M$ -estimators of Mallows’ type proposed by Preisser and Qaqish (1999). More recently, Ghosh and Basu (2016) considered an alternative perspective for robust estimation in generalized linear models based on the minimization of the density power divergence introduced by Basu et al. (1998).

To address the robust estimation in the framework of generalized linear models, we will briefly describe the estimation procedure introduced by Ferrari and Yang (2010) (see also Ferrari and Paterlini, 2009) who proposed the minimization of a $q$ -divergence and showed it to be equivalent to the minimization of an empirical version of Tsallis-Havrda-Charvát entropy (Havrda and Charvát, 1967; Tsallis, 1988) also known as $q$ -entropy. An interesting feature of this procedure is that the efficiency of the method depends on a tuning parameter $q$ , known as distortion parameter. It should be stressed that the methodology proposed as part of this work offers a parametric alternative to perform robust estimation in generalized linear models which allows, for example, to obtain robust estimators for logistic regression, an extremely relevant input to carry out binary classification.

Ferrari and Yang (2010) introduced a procedure to obtain robust estimators, known as maximum L $q$ -likelihood estimation, which is defined as

\widehat{\mbox{\boldmath$\beta$}}{}_{q}^{*}=\underset{\beta\in\mathbb{R}^{p}}{\operatorname{arg\,max}}\ L_{q}(\mbox{\boldmath$\beta$}),\qquad q>0,

where

L_{q}(\mbox{\boldmath$\beta$})=\sum_{i=1}^{n}l_{q}(f(y_{i};k(\mbox{\boldmath$x$}_{i}^{\top}\mbox{\boldmath$\beta$}),\phi)),

(1.3)

with $l_{q}(u)$ denoting the deformed logarithm of order $q$ , whose definition is given by:

l_{q}(u)=\begin{cases}(u^{1-q}-1)/(1-q),&\quad q\neq 1,\\ \log(u),&\quad q=1.\end{cases}

The function $l_{q}(\cdot)$ is known in statistics as the Box-Cox transformation, The notation for the objective function defined in (1.3) emphasizes the fact that we are considering $\phi$ as a nuisance parameter. Note that this parameter is in fact known for many models of the exponential family.

It should be stressed that, in general, the maximum L $q$ -likelihood estimator, $\widehat{\mbox{\boldmath$\beta$}}{}_{q}^{*}$ is not Fisher consistent for the true value of parameter $\mbox{\boldmath$\beta$}_{0}$ . An appropriate transformation $\tau_{q}$ in order to obtain a Fisher consistent version of the maximum L $q$ -likelihood estimator is considered in next section. Usually, the maximum L $q$ -likelihood estimator $\widehat{\mbox{\boldmath$\beta$}}{}_{q}^{*}$ is obtained by solving the system of equations $\mbox{\boldmath$\Psi$}_{n}(\mbox{\boldmath$\beta$})=\mbox{\boldmath$0$}$ , which assume the form

	$\displaystyle\mbox{\boldmath$\Psi$}_{n}(\mbox{\boldmath$\beta$})$	$\displaystyle=\sum_{i=1}^{n}\frac{\partial}{\partial\mbox{\boldmath$\beta$}}\,l_{q}(f(y_{i};k(\mbox{\boldmath$x$}_{i}^{\top}\mbox{\boldmath$\beta$}),\phi))$
		$\displaystyle=\sum_{i=1}^{n}\,\{f(y_{i};k(\mbox{\boldmath$x$}_{i}^{\top}\mbox{\boldmath$\beta$}),\phi)\}^{1-q}\frac{\partial}{\partial\mbox{\boldmath$\beta$}}\log f(y_{i};k(\mbox{\boldmath$x$}_{i}^{\top}\mbox{\boldmath$\beta$}),\phi).$

Therefore, $\mbox{\boldmath$\Psi$}_{n}(\mbox{\boldmath$\beta$})$ can be written as an estimation function associated to the solution of a weighted likelihood, as follows

\mbox{\boldmath$\Psi$}_{n}(\mbox{\boldmath$\beta$})=\sum_{i=1}^{n}\mbox{\boldmath$\Psi$}_{i}(\mbox{\boldmath$\beta$})=\sum_{i=1}^{n}U_{i}(\mbox{\boldmath$\beta$})\mbox{\boldmath$s$}_{i}(\mbox{\boldmath$\beta$}),

(1.4)

with $U_{i}(\mbox{\boldmath$\beta$})=\{f(y_{i};k(\mbox{\boldmath$x$}_{i}^{\top}\mbox{\boldmath$\beta$}),\phi)\}^{1-q}$ being a weighting based on a power of the density function and $\mbox{\boldmath$s$}_{i}(\mbox{\boldmath$\beta$})=\partial\log f(y_{i};k(\mbox{\boldmath$x$}_{i}^{\top}\mbox{\boldmath$\beta$}),\phi)/\partial\mbox{\boldmath$\beta$}$ , denoting the score function associated with the $i$ th observation $(i=1,\dots,n$ ). Interestingly, for $q\neq 1$ , the estimation procedure attributes weights $U_{i}(\mbox{\boldmath$\beta$})$ to each of the observations depending on their probability of occurrence. Thus, those observations that disagree with the postulated model will receive a small weighting in the estimation procedure as long as $q<1$ , while if $q>1$ the importance of those observations whose density is close to zero will be accentuated. It should be noted that the maximum L $q$ -likelihood estimation procedure corresponds to a generalization of the maximum likelihood method. Indeed, it is easy to notice that for $q\to 1$ , we obtain that $l_{q}(u)\to\log(u)$ , and hence $\widehat{\mbox{\boldmath$\beta$}}{}_{q}^{*}$ converges to the maximum likelihood estimator of $\beta$ .

When $q$ is fixed, the maximum L $q$ -likelihood estimator $\widehat{\mbox{\boldmath$\beta$}}{}_{q}^{*}$ belongs to the class of $M$ -estimators and we can use the results available in Hampel et al. (1986) to study the statistical properties of this kind of estimators. Since such estimators are defined by optimizing the objective function given in (1.3), we can also use the very general setting of extremum estimation (Gourieroux and Monfort, 1995) to obtain their asymptotic properties. A further alternative is to consider $\widehat{\mbox{\boldmath$\beta$}}{}_{q}^{*}$ as a solution of an inference function and invoke results associated with asymptotic normality that have been developed for that class of estimators (see, for instance Yuan and Jennrich, 1998). In particular, Ferrari and Yang (2010) derived the asymptotic properties of the maximum L $q$ -likelihood estimator considering distributions belonging to the exponential family. An interesting result of that work is that the asymptotic behavior of this estimator is characterized by the distortion parameter $q$ .

Although there are relatively few applications of this methodology in the statistical literature, we can highlight that this procedure has been extended to manipulate models for incomplete data (Qin and Priebe, 2013), to carry out hypothesis tests using a generalized version of the likelihood ratio statistic (Huang et al., 2013; Qin and Priebe, 2017) as well as to address the robust estimation in measurement error models (Giménez et al., 2022a, b) and beta regression (Ribeiro and Ferrari, 2023).

The remainder of the paper unfolds as follows. We begin by describing the estimation procedure in GLM based on maximum L $q$ -likelihood, discussing its asymptotic properties and some methods for the selection of the distortion parameter. Statistics for testing linear hypotheses, model selection methods and the definition of standardized residuals are discussed in Section 3, whereas Section 4 illustrates the good performance of our proposal by analyzing a real data set as well as a Monte Carlo simulation experiment. Appendices include proofs of all the theoretical results.

2. Robust estimation in GLM based on maximum L $q$ -likelihood

2.1. Estimation procedure and asymptotic properties

The maximum L $q$ -likelihood estimator for $\beta$ in the class of generalized linear models is a solution of the estimating equation:

\mbox{\boldmath$\Psi$}_{n}(\mbox{\boldmath$\beta$})=\phi\sum_{i=1}^{n}U_{i}(\mbox{\boldmath$\beta$})W_{i}^{1/2}\frac{(y_{i}-\mu_{i})}{\sqrt{V_{i}}}\mbox{\boldmath$x$}_{i},

(2.1)

with

U_{i}(\mbox{\boldmath$\beta$})=\exp\{(1-q)\phi[y_{i}k(\mbox{\boldmath$x$}_{i}^{\top}\mbox{\boldmath$\beta$})-b(k(\mbox{\boldmath$x$}_{i}^{\top}\mbox{\boldmath$\beta$}))]+c_{q}(y_{i},\phi)\},

(2.2)

corresponds to the weight arising from the estimation function defined in (1.4), with $c_{q}(y_{i},\phi)=(1-q)c(y_{i},\phi)$ , while $\mu_{i}=\operatorname{E}(y_{i})=\dot{b}(\theta_{i})$ , $\operatorname{var}(y_{i})=\phi^{-1}V_{i}$ , with $V_{i}=\ddot{b}(\theta_{i})$ being the variance function, and $W_{i}=V_{i}\{\dot{k}(\eta_{i})\}^{2}$ , we shall use dots over functions to denote derivatives, $\dot{k}(u)={\operatorname{d}}k/{\operatorname{d}}u$ and $\ddot{k}(u)={\operatorname{d}}^{2}k/{\operatorname{d}}u^{2}$ . In this paper, we consider $q<1$ in which case the weights $U_{i}(\mbox{\boldmath$\beta$})$ $(i=1\dots,n)$ provide a mechanism for downweighting outlying observations. The individual score function adopts the form $\mbox{\boldmath$s$}_{i}(\mbox{\boldmath$\beta$})=\phi W_{i}^{1/2}r_{i}\mbox{\boldmath$x$}_{i}$ , where $r_{i}=(y_{i}-\mu_{i})/\sqrt{V_{i}}$ denotes the Pearson residual. It is straightforward to notice that the estimation function for $\beta$ can be written as:

\mbox{\boldmath$\Psi$}_{n}(\mbox{\boldmath$\beta$})=\phi\mbox{\boldmath$X$}^{\top}\mbox{\boldmath$W$}^{1/2}\mbox{\boldmath$UV$}^{-1/2}(\mbox{\boldmath$Y$}-\mbox{\boldmath$\mu$}),

where $\mbox{\boldmath$U$}=\operatorname{diag}(U_{1},\dots,U_{n})$ , with $U_{i}=U_{i}(\mbox{\boldmath$\beta$})$ is given in (2.2), while $\mbox{\boldmath$W$}=\operatorname{diag}(W_{1},\dots,W_{n})$ , $\mbox{\boldmath$V$}=\operatorname{diag}(V_{1},\dots,V_{n})$ , $\mbox{\boldmath$Y$}=(y_{1},\dots,y_{n})^{\top}$ and $\mbox{\boldmath$\mu$}=(\mu_{1},\dots,\mu_{n})^{\top}$ .

In Appendix A we show that the estimating function in (2.1) is unbiased for the surrogate parameter $\mbox{\boldmath$\beta$}_{*}$ , that is $\operatorname{E}_{0}\{\mbox{\boldmath$\Psi$}_{n}(\mbox{\boldmath$\beta$}_{*})\}=\mbox{\boldmath$0$}$ , where $\operatorname{E}_{0}(\cdot)$ denotes expectation with respect to the true distribution $f(\cdot;\theta_{0},\phi)$ . Thereby, this allows introducing a suitable calibration function, say $\tau_{q}(\cdot)$ , such that $\widehat{\mbox{\boldmath$\beta$}}_{q}=\tau_{q}(\widehat{\mbox{\boldmath$\beta$}}{}_{q}^{*})$ which is applied to rescale the parameter estimates in order to achieve Fisher-consistency for $\mbox{\boldmath$\beta$}_{0}$ .

When partial derivatives and expectations exist, we define the matrices

	$\displaystyle\mbox{\boldmath$A$}_{n}(\mbox{\boldmath$\beta$})$	$\displaystyle=\operatorname{E}_{0}\{\mbox{\boldmath$\Psi$}_{n}(\mbox{\boldmath$\beta$})\mbox{\boldmath$\Psi$}_{n}^{\top}(\mbox{\boldmath$\beta$})\}=\frac{\phi}{2-q}\mbox{\boldmath$X$}^{\top}\mbox{\boldmath$WJX$},$		(2.3)
	$\displaystyle\mbox{\boldmath$B$}_{n}(\mbox{\boldmath$\beta$})$	$\displaystyle=\operatorname{E}_{0}\Big{\{}-\frac{\partial\mbox{\boldmath$\Psi$}_{n}(\mbox{\boldmath$\beta$})}{\partial\mbox{\boldmath$\beta$}^{\top}}\Big{\}}=\phi\mbox{\boldmath$X$}^{\top}\mbox{\boldmath$WJGKX$}.$		(2.4)

where $\mbox{\boldmath$J$}=\operatorname{diag}(J_{1},\dots,J_{n})$ whose diagonal elements are given by $J_{i}=J_{q}^{-\phi}(\theta_{i})$ , for $i=1,\dots,n$ , with $J_{q}(\theta)=\exp\{-qb(\theta)+b(q\theta)\}$ , $q\theta\in\Theta$ was defined by Menendez (2000), whereas $\mbox{\boldmath$G$}=\operatorname{diag}(\dot{g}(\theta_{1}^{*}),\dots,\dot{g}(\theta_{n}^{*}))$ and $\mbox{\boldmath$K$}=\operatorname{diag}(\dot{k}(\eta_{1}),\dots,\dot{k}(\eta_{n}))$ with ${\operatorname{d}}\eta_{*}/{\operatorname{d}}\eta=(1/q)\dot{g}(\theta_{*})\dot{k}(\eta)$ and $g(\theta_{*})=k^{-1}(\theta_{*})$ corresponding to the inverse function of the $\theta$ -link evaluated on the surrogate parameter. We propose to use the Newton-scoring algorithm (Jørgensen and Knudsen, 2004) to address the parameter estimation associated with the estimating equation $\mbox{\boldmath$\Psi$}_{n}(\mbox{\boldmath$\beta$})=\mbox{\boldmath$0$}$ , which assumes the form

\mbox{\boldmath$\beta$}^{(t+1)}=\mbox{\boldmath$\beta$}^{(t)}+\mbox{\boldmath$B$}_{n}^{-1}(\mbox{\boldmath$\beta$}^{(t)})\mbox{\boldmath$\Psi$}_{n}(\mbox{\boldmath$\beta$}^{(t)}),\qquad t=0,1,\dots,

Therefore, the estimation procedure adopts an iteratively reweighted least squares structure, as

\mbox{\boldmath$\beta$}^{(t+1)}=(\mbox{\boldmath$X$}^{\top}\mbox{\boldmath$W$}^{(t)}\mbox{\boldmath$J$}^{(t)}\mbox{\boldmath$G$}^{(t)}\mbox{\boldmath$K$}^{(t)}\mbox{\boldmath$X$})^{-1}\mbox{\boldmath$X$}^{\top}\mbox{\boldmath$W$}^{(t)}\mbox{\boldmath$J$}^{(t)}\mbox{\boldmath$G$}^{(t)}\mbox{\boldmath$K$}^{(t)}\mbox{\boldmath$Z$}_{q}^{(t)},

(2.5)

where $\mbox{\boldmath$Z$}_{q}=\mbox{\boldmath$\eta$}+\mbox{\boldmath$J$}^{-1}\mbox{\boldmath$G$}^{-1}\mbox{\boldmath$K$}^{-1}\mbox{\boldmath$W$}^{-1/2}\mbox{\boldmath$UV$}^{-1/2}(\mbox{\boldmath$Y$}-\mbox{\boldmath$\mu$})$ denotes the working response which must be evaluated at $\mbox{\boldmath$\beta$}=\mbox{\boldmath$\beta$}^{(t)}$ . Let $\widehat{\mbox{\boldmath$\beta$}}{}_{q}^{*}$ be the value obtained at the convergence of the algorithm in (2.5). Thus, in order to obtain a corrected version of the maximum L $q$ -likelihood estimator, we must take

\widehat{\mbox{\boldmath$\eta$}}_{q}=k^{-1}(qk(\widehat{\mbox{\boldmath$\eta$}}{}_{q}^{*})),\quad\textrm{with}\quad\widehat{\mbox{\boldmath$\eta$}}{}_{q}^{*}=\mbox{\boldmath$X$}\widehat{\mbox{\boldmath$\beta$}}{}_{q}^{*}.

(2.6)

Evidently, for the canonical $\theta$ -link, $k(u)=u$ , it is sufficient to consider $\widehat{\mbox{\boldmath$\eta$}}_{q}=q\widehat{\mbox{\boldmath$\eta$}}{}_{q}^{*}$ , or equivalently $\widehat{\mbox{\boldmath$\beta$}}_{q}=q\widehat{\mbox{\boldmath$\beta$}}{}_{q}^{*}$ . In addition, for this case we have $\mbox{\boldmath$GK$}=\mbox{\boldmath$I$}_{n}$ , with $\mbox{\boldmath$I$}_{n}$ being the identity matrix of order $n$ , which leads to a considerably simpler version of the algorithm proposed in (2.5). Qin and Priebe (2017) suggested another alternative to achieve Fisher-consistency which is based on re-centering the estimation equation, considering a bias-correction term, and who termed it as the bias-corrected maximum L $q$ -likelihood estimator. We emphasize the simplicity of the calibration function in (2.6), which introduces a reparameterization that allows us to characterize the asymptotic distribution of $\widehat{\mbox{\boldmath$\beta$}}_{q}$ .

Remark 2.1.

For $q=1$ , it is easy to notice that in the formula given in (2.5), we obtain $\mbox{\boldmath$J$}=\mbox{\boldmath$I$}_{n}$ and $\mbox{\boldmath$U$}=\mbox{\boldmath$I$}_{n}$ , thereby leading to the Fisher-scoring method for maximum likelihood estimation in generalized linear models (see, for instance, Green, 1984).

The following result describes the asymptotic normality of the $\widehat{\mbox{\boldmath$\beta$}}_{q}$ estimator in the context of generalized linear models.

Proposition 2.2.

For the generalized linear model defined in (1.1) and (1.2), and under assumptions of Property 24.16 of Gourieroux and Monfort (1995), it follows that

(\mbox{\boldmath$B$}_{n}^{-1}\mbox{\boldmath$A$}_{n}\mbox{\boldmath$B$}_{n}^{-1})^{-1/2}\sqrt{n}(\widehat{\mbox{\boldmath$\beta$}}_{q}-\mbox{\boldmath$\beta$}_{0})\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathsf{N}_{p}(\mbox{\boldmath$0$},\mbox{\boldmath$I$}_{p}),

where $\mbox{\boldmath$A$}_{n}=\mbox{\boldmath$A$}_{n}(\mbox{\boldmath$\beta$}_{0})$ and $\mbox{\boldmath$B$}_{n}=\mbox{\boldmath$B$}_{n}(\mbox{\boldmath$\beta$}_{0})$ . That is, $\sqrt{n}(\widehat{\mbox{\boldmath$\beta$}}_{q}-\mbox{\boldmath$\beta$}_{0})$ asymptotically follows a $p$ -variate normal distribution with mean vector $0$ and covariance matrix,

\mbox{\boldmath$B$}_{n}^{-1}\mbox{\boldmath$A$}_{n}\mbox{\boldmath$B$}_{n}^{-1}=\frac{\,\phi^{-1}}{2-q}(\mbox{\boldmath$X$}^{\top}\mbox{\boldmath$WJGKX$})^{-1}\mbox{\boldmath$X$}^{\top}\mbox{\boldmath$WJX$}(\mbox{\boldmath$X$}^{\top}\mbox{\boldmath$WJGKX$})^{-1}.

Remark 2.3.

It is straightforward to notice that for the canonical $\theta$ -link $k(u)=u$ the asymptotic covariance matrix of the $\widehat{\mbox{\boldmath$\beta$}}_{q}$ estimator adopts a rather simple form, namely:

\mbox{\boldmath$B$}_{n}^{-1}\mbox{\boldmath$A$}_{n}\mbox{\boldmath$B$}_{n}^{-1}=\frac{\,\phi^{-1}}{2-q}(\mbox{\boldmath$X$}^{\top}\mbox{\boldmath$WJX$})^{-1}.

2.2. Influence function and selection of distortion parameter $q$

From the general theory of robustness, we can characterize a slight misspecification of the proposed model by the influence function of a statistical functional $\mbox{\boldmath$T$}(F)$ , which is defined as (see Hampel et al., 1986, Sec. 4.2),

\operatorname{IF}(Y;\mbox{\boldmath$T$})=\lim_{\epsilon\to 0}\frac{\mbox{\boldmath$T$}(F_{\epsilon})-\mbox{\boldmath$T$}(F)}{\epsilon}=\frac{\partial\mbox{\boldmath$T$}(F_{\epsilon})}{\partial\epsilon}\Big{|}_{\epsilon=0},

where $F_{\epsilon}=(1-\epsilon)F+\epsilon\delta_{Y}$ , with $\delta_{Y}$ a point mass distribution in $Y$ . That is, $\operatorname{IF}(Y;\mbox{\boldmath$T$})$ reflects the effect on $T$ of a contamination by adding an observation in $Y$ . It should be noted that the maximum L $q$ -likelihood estimation procedure for $q$ fixed corresponds to an $M$ -estimation method, where $q$ acts as a tuning constant. It is well known that the influence functions of the estimators obtained by the maximum likelihood and L $q$ -likelihood methods adopt the form

\operatorname{IF}(Y;\widehat{\mbox{\boldmath$\beta$}}_{\mathrm{ML}})=\mbox{\boldmath$\mathcal{F}$}_{n}^{-1}\mbox{\boldmath$s$}(\mbox{\boldmath$\beta$}),\qquad\operatorname{IF}_{q}(Y;\widehat{\mbox{\boldmath$\beta$}}_{q})=\mbox{\boldmath$B$}_{n}^{-1}\{f(y;k(\mbox{\boldmath$x$}^{\top}\mbox{\boldmath$\beta$}),\phi)\}^{1-q}\mbox{\boldmath$s$}(\mbox{\boldmath$\beta$}),

respectively, where $\mbox{\boldmath$\mathcal{F}$}_{n}=\operatorname{E}\{-\partial\mbox{\boldmath$s$}_{n}(\mbox{\boldmath$\beta$})/\partial\mbox{\boldmath$\beta$}^{\top}\}=\phi^{-1}\mbox{\boldmath$X$}^{\top}\mbox{\boldmath$WX$}$ is the Fisher information matrix, with $\mbox{\boldmath$s$}_{n}(\mbox{\boldmath$\beta$})=\sum_{i=1}^{n}\mbox{\boldmath$s$}_{i}(\mbox{\boldmath$\beta$})$ the score function, whereas $\{f(y;k(\mbox{\boldmath$x$}^{\top}\mbox{\boldmath$\beta$}),\phi)\}^{1-q}$ corresponds to the weighting defined in (2.2) and $\mbox{\boldmath$s$}(\mbox{\boldmath$\beta$})=\phi W^{1/2}(y-\mu)\mbox{\boldmath$x$}/\sqrt{V}$ , with $\mbox{\boldmath$x$}=(x_{1},\dots,x_{p})^{\top}$ . It is possible to appreciate that when $q<1$ the weights $U_{i}(\mbox{\boldmath$\beta$})$ provide a mechanism to limit the influence of extreme observations. This allows downweighting those observations that are strongly discrepant from the proposed model. The properties of robust procedures based on the weighting of observations using the working statistical model have been discussed, for example in Windham (1995) and Basu et al. (1998), who highlighted the connection between these procedures with robust methods (see also, Ferrari and Yang, 2010). Indeed, we can write the asymptotic covariance of $\widehat{\mbox{\boldmath$\beta$}}_{q}$ equivalently as,

\operatorname{Cov}(\widehat{\mbox{\boldmath$\beta$}}_{q})=\operatorname{E}_{0}\{\operatorname{IF}_{q}(Y;\widehat{\mbox{\boldmath$\beta$}}_{q})\operatorname{IF}_{q}^{\top}(Y;\widehat{\mbox{\boldmath$\beta$}}_{q})\}.

Additionally, it is known that $\mbox{\boldmath$B$}_{n}^{-1}\mbox{\boldmath$A$}_{n}\mbox{\boldmath$B$}_{n}^{-1}-\mbox{\boldmath$\mathcal{F}$}_{n}^{-1}$ , is a positive semidefinite matrix. That is, the $\widehat{\mbox{\boldmath$\beta$}}_{q}$ estimator has asymptotic covariance matrix always larger than the covariance of $\widehat{\mbox{\boldmath$\beta$}}_{\mathrm{ML}}$ for the working model defined in (1.1) and (1.2). To quantify the loss in efficiency of the maximum L $q$ -likelihood estimator, is to consider the asymptotic relative efficiency of $\widehat{\mbox{\boldmath$\beta$}}_{q}$ with respect to $\widehat{\mbox{\boldmath$\beta$}}_{\mathrm{ML}}$ , which is defined as,

\operatorname{ARE}(\widehat{\mbox{\boldmath$\beta$}}_{q},\widehat{\mbox{\boldmath$\beta$}}_{\mathrm{ML}})=\operatorname{tr}(\mbox{\boldmath$\mathcal{F}$}_{n}^{-1})/\operatorname{tr}(\mbox{\boldmath$B$}_{n}^{-1}\mbox{\boldmath$A$}_{n}\mbox{\boldmath$B$}_{n}^{-1}).

(2.7)

Hence, $\operatorname{ARE}(\widehat{\mbox{\boldmath$\beta$}}_{q},\widehat{\mbox{\boldmath$\beta$}}_{\mathrm{ML}})\geq 1$ . This leads to a procedure for the selection of the distortion parameter $q$ . Following the same reasoning as Windham (1995) an alternative is to choose $q$ from the available data as that $q_{\mathrm{opt}}$ value that minimizes $\operatorname{tr}(\mbox{\boldmath$B$}_{n}^{-1}\mbox{\boldmath$A$}_{n}\mbox{\boldmath$B$}_{n}^{-1})$ . In other words, the interest is to select the model for which the loss in efficiency is the smallest. This procedure has also been used in Qin and Priebe (2017) for adaptive selection of the distortion parameter.

In this paper we follow the guidelines given by La Vecchia et al. (2015) who proposed a data-driven method for the selection of $q$ (see also Ribeiro and Ferrari, 2023). In this procedure the authors introduced the concept of stability condition as a mechanism for determining the distortion parameter by considering an equally spaced grid, say $1\geq q_{1}>q_{2}>\cdots>q_{m}$ and for each value in the grid $\widehat{\mbox{\boldmath$\beta$}}_{q_{j}}$ , for $j=1,\dots,m$ , is computed. To achieve robustness we can select an optimal value $q_{\mathrm{opt}}$ , as the greatest value $q_{j}$ satisfying $QV_{j}<\rho$ , where $QV_{j}=\|\widehat{\mbox{\boldmath$\beta$}}_{q_{j}}-\widehat{\mbox{\boldmath$\beta$}}_{q_{j+1}}\|$ with $\rho$ a threshold value. Based on experimental results, La Vecchia et al. (2015) proposed to construct a grid between $q_{1}=1$ , and some minimum value, $q_{m}$ with increments of 0.01, whereas $\rho=0.05\,\|\widehat{\mbox{\boldmath$\beta$}}_{q_{m}}\|$ . One of the main advantages of their procedure is that, in the absence of contamination, it allows to select the distortion parameter as close to 1 as possible, thas is, this leads to the maximum likelihood estimator. Yet another alternative that has shown remarkable performance for $q$ -selection based on parametric bootstrap is reported by Giménez et al. (2022a, b).

Remark 2.4.

An additional constraint that the selected value for $q$ must satisfy, which follows from the definition for the function $J_{q}(\theta)$ is that $q\theta$ must be in the parameter space $\Theta$ . This allows, for example, to refine the grid of search values for the procedures proposed by La Vecchia et al. (2015) or Giménez et al. (2022a).

3. Goodness of fit and hypothesis testing

3.1. Robust deviance and model selection

The goodness of fit in generalized linear models, considering the estimation method based on the L $q$ -likelihood function, requires appropriate modifications to obtain robust extensions of the deviance function. Consider the parameter vector to be partitioned as $\mbox{\boldmath$\beta$}=(\mbox{\boldmath$\beta$}_{1}^{\top},\mbox{\boldmath$\beta$}_{2}^{\top})^{\top}$ and suppose that our interest consists of testing the hypothesis $H_{0}:\mbox{\boldmath$\beta$}_{1}=\mbox{\boldmath$0$}$ , where $\mbox{\boldmath$\beta$}_{1}\in\mathbb{R}^{r}$ . Thus, we can evaluate the discrepancy between the model defined by the null hypothesis versus the model under $H_{1}:\mbox{\boldmath$\beta$}_{1}\neq\mbox{\boldmath$0$}$ using the proposal of Qin and Priebe (2017), as

	$\displaystyle D_{q}(\mbox{\boldmath$Y$},\widehat{\mbox{\boldmath$\mu$}})$	$\displaystyle=2\big{\{}L_{q}(\widehat{\mbox{\boldmath$\beta$}}_{q})-L_{q}(\widetilde{\mbox{\boldmath$\beta$}}_{q})\big{\}}$
		$\displaystyle=2\sum_{i=1}^{n}\big{\{}l_{q}(f(y_{i};k(\mbox{\boldmath$x$}_{i}^{\top}\widehat{\mbox{\boldmath$\beta$}}_{q}),\phi))-l_{q}(f(y_{i};k(\mbox{\boldmath$x$}_{i}^{\top}\widetilde{\mbox{\boldmath$\beta$}}_{q}),\phi))\big{\}},$		(3.1)

where $\widehat{\mbox{\boldmath$\beta$}}_{q}$ and $\widetilde{\mbox{\boldmath$\beta$}}_{q}$ are the corrected maximum L $q$ -likelihood estimates for $\beta$ obtained under $H_{0}$ and $H_{1}$ , respectively, and using these estimates we have that $\widetilde{\mu}_{i}=\mu_{i}(\widetilde{\mbox{\boldmath$\beta$}}_{q})$ and $\widehat{\mu}_{i}=\mu_{i}(\widehat{\mbox{\boldmath$\beta$}}_{q})$ , for $i=1,\dots,n$ . Qin and Priebe (2017) based on results available in Cantoni and Ronchetti (2001), proved that the $D_{q}(\mbox{\boldmath$Y$},\widehat{\mbox{\boldmath$\mu$}})$ statistic has an asymptotic distribution following a discrete mixture of chi-squared random variables with one degree of freedom. We should note that the correction term proposed in Qin and Priebe (2017) in our case zeroed for the surrogate parameter.

Following Ronchetti (1997), we can introduce the Akaike information criterion based on the L $q$ -estimation procedure, as follows

\operatorname{AIC}_{q}=-2\sum_{i=1}^{n}l_{q}(f(y_{i};k(\mbox{\boldmath$x$}_{i}^{\top}\widehat{\mbox{\boldmath$\beta$}}_{q}),\phi))+2\operatorname{tr}(\widehat{\mbox{\boldmath$B$}}_{n}^{-1}\widehat{\mbox{\boldmath$A$}}_{n}),

where $\widehat{\mbox{\boldmath$\beta$}}_{q}$ corresponds to the maximum L $q$ -likelihood estimator obtained from the algorithm defined in (2.5). Using the definitions of $\mbox{\boldmath$A$}_{n}$ and $\mbox{\boldmath$B$}_{n}$ , given in (2.3) and (2.4) leads to,

\operatorname{AIC}_{q}=-2\sum_{i=1}^{n}l_{q}(f(y_{i};k(\mbox{\boldmath$x$}_{i}^{\top}\widehat{\mbox{\boldmath$\beta$}}_{q}),\phi))+\frac{2}{2-q}\operatorname{tr}\{(\mbox{\boldmath$X$}^{\top}\widehat{\mbox{\boldmath$W$}}\widehat{\mbox{\boldmath$J$}}\widehat{\mbox{\boldmath$G$}}\widehat{\mbox{\boldmath$K$}}\mbox{\boldmath$X$})^{-1}\mbox{\boldmath$X$}^{\top}\widehat{\mbox{\boldmath$W$}}\widehat{\mbox{\boldmath$J$}}\mbox{\boldmath$X$}\}.

Evidently, when the canonical $\theta$ -link is considered, we obtain that the penalty term assumes the form $2\operatorname{tr}(\widehat{\mbox{\boldmath$B$}}_{n}^{-1}\widehat{\mbox{\boldmath$A$}}_{n})=2p/(2-q)$ and in the case that $q\to 1$ we recover the usual Akaike information criterion, $\operatorname{AIC}$ for model selection in the context of maximum likelihood estimation. Some authors (see for instance, Ghosh and Basu, 2016), have suggested using the Akaike information criterion $\operatorname{AIC}_{q}$ as an alternative mechanism for the selection of the tuning parameter, i.e., the distortion parameter $q$ .

3.2. Hypothesis testing and residual analysis

A robust alternative for conducting hypothesis testing in the context of maximum L $q$ -likelihood estimation has been proposed by Qin and Priebe (2017), who studied the properties of L $q$ -likelihood ratio type statistics for simple hypotheses. As suggested in Equation (3.1) this type of development allows, for example, the evaluation of the fitted model. In this section we focus on testing linear hypotheses of the type

H_{0}:\mbox{\boldmath$H\beta$}_{0}=\mbox{\boldmath$h$},\qquad\textrm{against}\qquad H_{1}:\mbox{\boldmath$H\beta$}_{0}\neq\mbox{\boldmath$h$},

(3.2)

where $H$ es a known matrix of order $d\times p$ , with $\operatorname{rank}(\mbox{\boldmath$H$})=d$ $(d\leq p)$ and $\mbox{\boldmath$h$}\in\mathbb{R}^{d}$ . Wald, score-type and bilinear form (Crudu and Osorio, 2020) statistics for testing the hypothesis in (3.2) are given by the following result.

Proposition 3.1.

Given the assumptions of Properties 24.10 and 24.16 in Gourieroux and Monfort (1995), considering that $\mbox{\boldmath$A$}_{n}$ and $\mbox{\boldmath$B$}_{n}$ converge almost surely to matrices $A$ and $B$ , respectively, and under $H_{0}$ , then the Wald, score-type and bilinear form statistics given by

$\displaystyle W_{n}$	$\displaystyle=n(\mbox{\boldmath$H$}\widehat{\mbox{\boldmath$\beta$}}_{q}-\mbox{\boldmath$h$})^{\top}(\mbox{\boldmath$H$}\widehat{\mbox{\boldmath$B$}}{}^{-1}\widehat{\mbox{\boldmath$A$}}\widehat{\mbox{\boldmath$B$}}{}^{-1}\mbox{\boldmath$H$}^{\top})^{-1}(\mbox{\boldmath$H$}\widehat{\mbox{\boldmath$\beta$}}_{q}-\mbox{\boldmath$h$}),$	(3.3)
$\displaystyle R_{n}$	$\displaystyle=\frac{1}{n}\mbox{\boldmath$\Psi$}_{n}^{\top}(\widetilde{\mbox{\boldmath$\beta$}}_{q})\widetilde{\mbox{\boldmath$B$}}^{-1}\mbox{\boldmath$H$}^{\top}(\mbox{\boldmath$H$}\widetilde{\mbox{\boldmath$B$}}{}^{-1}\widetilde{\mbox{\boldmath$A$}}\widetilde{\mbox{\boldmath$B$}}{}^{-1}\mbox{\boldmath$H$}^{\top})^{-1}\mbox{\boldmath$H$}\widetilde{\mbox{\boldmath$B$}}^{-1}\mbox{\boldmath$\Psi$}_{n}(\widetilde{\mbox{\boldmath$\beta$}}_{q}),\quad\textrm{and}$	(3.4)
$\displaystyle BF_{n}$	$\displaystyle=\mbox{\boldmath$\Psi$}_{n}^{\top}(\widetilde{\mbox{\boldmath$\beta$}}_{q})\widetilde{\mbox{\boldmath$B$}}^{-1}\mbox{\boldmath$H$}^{\top}(\mbox{\boldmath$H$}\widehat{\mbox{\boldmath$B$}}{}^{-1}\widehat{\mbox{\boldmath$A$}}\widehat{\mbox{\boldmath$B$}}{}^{-1}\mbox{\boldmath$H$}^{\top})^{-1}(\mbox{\boldmath$H$}\widehat{\mbox{\boldmath$\beta$}}_{q}-\mbox{\boldmath$h$}),$	(3.5)

are asymptotically equivalent and follow a chi-square distribution with $d$ degrees of freedom, where $\widetilde{\mbox{\boldmath$\beta$}}_{q}$ and $\widehat{\mbox{\boldmath$\beta$}}_{q}$ represent the corrected maximum L $q$ -likelihood estimates for $\beta$ under the null and alternative hypothesis, respectively. For the estimates of matrices $A$ and $B$ , the notation is analogous. Therefore, we reject the null hypothesis at a level $\alpha$ , if either $W_{n}$ , $R_{n}$ or $BF_{n}$ exceeds a $100(1-\alpha)\%$ quantile value of the distribution $\chi^{2}(d)$ .

Robustness of the statistics defined in (3.3)-(3.5) can be characterized using the tools available in Heritier and Ronchetti (1994). In addition, it is relatively simple to extend these test statistics to manipulate nonlinear hypotheses following the results developed by Crudu and Osorio (2020).

Suppose that our aim is to evaluate the inclusion of a new variable in the regression model. Following Wang (1985), we can consider the added-variable model, which is defined through the linear predictor $\eta_{i}=\mbox{\boldmath$x$}_{i}^{\top}\mbox{\boldmath$\beta$}+z_{i}\gamma=\mbox{\boldmath$f$}_{i}^{\top}\mbox{\boldmath$\delta$}$ , where $\mbox{\boldmath$f$}_{i}^{\top}=(\mbox{\boldmath$x$}_{i}^{\top},z_{i})$ and $\mbox{\boldmath$\delta$}=(\mbox{\boldmath$\beta$}^{\top},\gamma)^{\top}$ , with $\theta_{i}=k(\eta_{i})$ , for $i=1,\dots,n$ . Based on the above results we present the score-type statistic for testing the hypothesis $H_{0}:\gamma=0$ . Let $\mbox{\boldmath$F$}=(\mbox{\boldmath$X$},\mbox{\boldmath$z$})$ be the model matrix for the added-variable model, with $\mbox{\boldmath$X$}=(\mbox{\boldmath$x$}_{1},\dots,\mbox{\boldmath$x$}_{n})^{\top}$ and $\mbox{\boldmath$z$}=(z_{1},\dots,z_{n})^{\top}$ . That notation, enables us to write the estimation function associated with the maximum L $q$ -likelihood estimation problem as:

\mbox{\boldmath$\Psi$}_{n}(\mbox{\boldmath$\delta$})=\begin{pmatrix}\mbox{\boldmath$\Psi$}_{\beta}(\mbox{\boldmath$\delta$})\\ \mbox{\boldmath$\Psi$}_{\gamma}(\mbox{\boldmath$\delta$})\end{pmatrix}=\phi\begin{pmatrix}\mbox{\boldmath$X$}^{\top}\mbox{\boldmath$W$}^{1/2}\mbox{\boldmath$UV$}^{-1/2}(\mbox{\boldmath$Y$}-\mbox{\boldmath$\mu$})\\ \mbox{\boldmath$z$}^{\top}\mbox{\boldmath$W$}^{1/2}\mbox{\boldmath$UV$}^{-1/2}(\mbox{\boldmath$Y$}-\mbox{\boldmath$\mu$})\end{pmatrix},

whereas,

	$\displaystyle\mbox{\boldmath$A$}_{n}(\mbox{\boldmath$\delta$})$	$\displaystyle=\frac{\phi}{2-q}\begin{pmatrix}\mbox{\boldmath$X$}^{\top}\mbox{\boldmath$WJX$}&\mbox{\boldmath$X$}^{\top}\mbox{\boldmath$WJz$}\\ \mbox{\boldmath$z$}^{\top}\mbox{\boldmath$WJX$}&\mbox{\boldmath$z$}^{\top}\mbox{\boldmath$WJz$}\end{pmatrix},$
	$\displaystyle\mbox{\boldmath$B$}_{n}(\mbox{\boldmath$\delta$})$	$\displaystyle=\phi\begin{pmatrix}\mbox{\boldmath$X$}^{\top}\mbox{\boldmath$WJGKX$}&\mbox{\boldmath$X$}^{\top}\mbox{\boldmath$WJGKz$}\\ \mbox{\boldmath$z$}^{\top}\mbox{\boldmath$WJGKX$}&\mbox{\boldmath$z$}^{\top}\mbox{\boldmath$WJGKz$}\end{pmatrix}.$

It is easy to notice that the estimate for $\beta$ under the null hypothesis, $H_{0}:\gamma=0$ can be obtained by the Newton-scoring algorithm given in (2.5). Details of the derivation of the score-type statistic for testing $H_{0}:\gamma=0$ are deferred to Appendix E, where it is obtained

R_{n}=\mbox{\boldmath$\Psi$}_{\gamma}^{\top}(\widehat{\mbox{\boldmath$\beta$}}_{q})\{\operatorname{Cov}(\mbox{\boldmath$\Psi$}_{\gamma}(\widehat{\mbox{\boldmath$\beta$}}_{q}))\}^{-1}\mbox{\boldmath$\Psi$}_{\gamma}(\widehat{\mbox{\boldmath$\beta$}}_{q}),

which for the context of the added variable model, takes the form:

R_{n}=\frac{(2-q)\{\mbox{\boldmath$z$}^{\top}\widehat{\mbox{\boldmath$W$}}{}^{1/2}\widehat{\mbox{\boldmath$U$}}\widehat{\mbox{\boldmath$V$}}{}^{-1/2}(\mbox{\boldmath$Y$}-\widehat{\mbox{\boldmath$\mu$}})\}^{2}}{\phi^{-1}\mbox{\boldmath$z$}^{\top}(\mbox{\boldmath$I$}_{n}-\widehat{\mbox{\boldmath$W$}}\widehat{\mbox{\boldmath$J$}}\widehat{\mbox{\boldmath$G$}}\widehat{\mbox{\boldmath$K$}}\widehat{\mbox{\boldmath$P$}})\widehat{\mbox{\boldmath$W$}}\widehat{\mbox{\boldmath$J$}}(\mbox{\boldmath$I$}_{n}-\widehat{\mbox{\boldmath$P$}}\widehat{\mbox{\boldmath$W$}}\widehat{\mbox{\boldmath$J$}}\widehat{\mbox{\boldmath$G$}}\widehat{\mbox{\boldmath$K$}})\mbox{\boldmath$z$}},

where $\mbox{\boldmath$P$}=\mbox{\boldmath$X$}(\mbox{\boldmath$X$}^{\top}\mbox{\boldmath$WJGKX$})^{-1}\mbox{\boldmath$X$}^{\top}$ . Thus, we reject the null hypothesis by comparing the value obtained for $R_{n}$ with some percentile value $(1-\alpha)$ of the chi-squared distribution with one degree of freedom. This type of development has been used, for example, in Wei and Shih (1994) to propose tools for outlier detection considering the mean-shift outlier model in both generalized linear and nonlinear regression models for the maximum likelihood framework. It should be noted that the mean-shift outlier model can be obtained by considering $\mbox{\boldmath$z$}=(0,\dots,1,\dots,0)^{\top}$ as the vector of zeros except for an 1 at the $i$ th position, which leads to the standardized residual.

t_{i}=\frac{\sqrt{2-q}\,\widehat{U}_{i}(Y_{i}-\widehat{\mu}_{i})}{\phi^{-1/2}\widehat{J}_{i}^{1/2}\widehat{V}_{i}^{1/2}\sqrt{(1-\widehat{g}_{i}\widehat{k}_{i}\widehat{m}_{ii})-\widehat{g}_{i}\widehat{k}_{i}(\widehat{m}_{ii}-\widehat{g}_{i}\widehat{k}_{i}\widehat{m}_{ii}^{*})}},\qquad i=1,\dots,n,

where $\widehat{U}_{i}=U_{i}(\widehat{\mbox{\boldmath$\beta$}})$ , $\widehat{J}_{i}=J_{q}^{-\phi}(\theta_{i}(\widehat{\mbox{\boldmath$\beta$}}))$ , $\widehat{g}_{i}=\dot{g}(\widehat{\theta}_{i}^{*})$ , $\widehat{k}_{i}=\dot{k}(\widehat{\eta}_{i})$ , and $\widehat{m}_{ii}$ , $\widehat{m}_{ii}^{*}$ are, respectively, the diagonal elements of matrices $\widehat{\mbox{\boldmath$M$}}=\widehat{\mbox{\boldmath$P$}}\widehat{\mbox{\boldmath$W$}}\widehat{\mbox{\boldmath$J$}}$ and $\widehat{\mbox{\boldmath$M$}}{}^{2}$ , for $i=1,\dots,n$ . It is straightforward to note that $t_{i}$ has an approximately standard normal distribution and can be used for the identification of outliers and possibly influential observations. Evidently, we have that for $q=1$ we recover the standardized residual for generalized linear models. We have developed the standardized residual, $t_{i}$ based on the type-score statistic defined based on (3.4), further alternatives for standardized residuals can be constructed by using the Wald and bilinear-form statistics defined in Equations (3.3) and (3.5), respectively. Additionally, we can consider other types of residuals. Indeed, based on Equation (3.1) we can define the deviance residual as,

r_{i}^{\mathrm{DEV}}=\operatorname{sign}(y_{i}-\widehat{\mu}_{i})\sqrt{d(y_{i},\widehat{\mu}_{i})},\qquad i=1,\dots,n,

with $d(y_{i},\widehat{\mu}_{i})=2\{l_{q}(y_{i},\widehat{\mu}_{i})-l_{q}(y_{i},y_{i})\}$ , the deviance associated with the $i$ th observation. Moreover, it is also possible to consider the quantile residual (Dunn and Smyth, 1996), which is defined as

r_{i}^{\mathrm{QTL}}=\Phi^{-1}\{F(y_{i};\widehat{\mu}_{i},\phi)\},\qquad i=1,\dots,n,

where $F(\cdot;\mu,\phi)$ is the cumulative distribution function associated with the random variable $y$ , while $\Phi(\cdot)$ is the cumulative distribution function of the standard normal.

Throughout this work we have assumed $\phi$ as a fixed niusance parameter. Following Giménez et al. (2022a, b), we can obtain the maximum L $q$ -likelihood estimator $\widehat{\phi}_{q}$ by solving the problem,

\underset{\phi}{\max}\,H_{n}(\phi),\qquad H_{n}(\phi)=\sum_{i=1}^{n}l_{q}(f(y_{i};k(\mbox{\boldmath$x$}_{i}^{\top}\widehat{\mbox{\boldmath$\beta$}}_{q}),\phi)),

which corresponds to a profiled L $q$ -likelihood function. It should be noted that to obtain $\widehat{\phi}_{q}$ it is sufficient to consider an one-dimensional optimization procedure.

4. Numerical experiments

4.1. Example

We revisit the dataset introduced by Finney (1947) which aims to assess the effect of the volume and rate of inspired air on transient vasoconstriction in the skin of the digits. The response is binary, indicating the occurrence or non-occurrence of vasoconstriction. Following Pregibon (1981), a logistic regression model using the linear predictor $\eta=\beta_{1}+\beta_{2}\log(\text{volume})+\beta_{3}\log(\text{rate})$ , was considered. Observations 4 and 18 have been identified as highly influential on the estimated coefficients obtained by the maximum likelihood method (Pregibon, 1981). In fact, several authors have highlighted the extreme characteristic of this dataset since the removal of these observations brings the model close to indeterminacy (see, for instance Künsch et al., 1989, Sec. 4). Table 4.1, reports the parameter estimates and their standard errors obtained from three robust procedures. Here CR denotes the fit using the method proposed by Cantoni and Ronchetti (2001), BY indicates the estimation method for logistic regression developed by Bianco and Yohai (1996) both available in the robustbase package (Todorov and Filzmoser, 2009) from the R software (R Development Core Team, 2022), while ML $q$ , denotes the procedure proposed in this work where we have selected $q$ using the mechanism described in Section 2.2, in which $q=0.79$ was obtained. Furthermore, the results of the maximum likelihood fit are also reported and estimation was carried out using the weighted Bianco and Yohai estimator (Croux and Haesbroeck, 2003). However, since the explanatory variables do not contain any outliers the estimates obtained using BY and weighted BY methods are identical and therefore they are not reported here.

Table 4.1. Estimates by maximum likelihood and robust methods for the skin vaso-constriction data.

	Intercept				$\log(\text{volume})$				$\log(\text{rate})$
ML	-2.	875	(1.	321)	5.	179	(1.	865)	4.	562	(1.	838)
ML*	-24.	581	(14.	020)	39.	550	(23.	244)	31.	935	(17.	758)
ML $q$ , $q=0.79$	-5.	185	(2.	563)	8.	234	(3.	920)	7.	287	(3.	455)
CR	-2.	753	(1.	327)	4.	974	(1.	862)	4.	388	(1.	845)
BY	-6.	852	(10.	040)	10.	734	(15.	295)	9.	364	(12.	770)
* With cases 4 and 18 removed.

Refer to caption — Figure 4.1. Skin vaso-constriction data: (a) observations and the estimated probabilities and (b) QQ-plot of Studentized residuals with simulation envelope, from the logistic model fitted using maximum likelihood.

From Table 4.1 we must highlight the drastic change in the estimates of the regression coefficients by maximum likelihood when observations 4 and 18 are deleted (see also Pregibon, 1981; Wang, 1985). It is interesting to note that the results for this case are very close to those offered by the estimation procedure based on the $L_{1}$ -norm introduced by Morgenthaler (1992). Moreover, as discussed in Sec. 6.17 of Atkinson and Riani (2000) the removal of observations 4, 18 and 24 leads to a perfect fit. Indeed, the elimination of observations 4 and 18 leads to a huge increment in the standard errors.

Figure 4.1 (a) shows a strong overlap in the groups with zero and unit response. The poor fit is also evident from the QQ-plot of Studentized residuals with simulated envelopes (see, Figure 4.1 (b)). The estimated weights $\widehat{U}_{i}=U_{i}(\widehat{\mbox{\boldmath$\beta$}})$ , $(i=1,\dots,39)$ are displayed in Figure 4.2, where it is possible to appreciate the ability of the algorithm to downweight observations 4, 18 and 24. It should be stressed that the estimation of the regression coefficients by the maximum L $q$ -likelihood method with $q=0.79$ is very similar to that obtained by the resistant procedure discussed by Pregibon (1982). Figure 4.3, reveals that the maximum L $q$ -likelihood procedure offers a better fit and the QQ-plot with simulated envelope correctly identifies observations 4 and 18 as outliers.

In Appendix F the fit for the proposed model using values ranging from $q=1$ to 0.74, is reported. Table F.1 shows that as the value of $q$ decreases, there is an increase in the standard error of the estimates. Additionally, Figures F.1 and F.2 from Appendix F show how as $q$ decreases the weights of observations 4 and 18 decrease rapidly to zero. Indeed, when $q\leq 0.76$ the problem of indeterminacy underlying these data becomes evident.

4.2. Simulation study

To evaluate the performance of the proposed estimation procedure we developed a small Monte Carlo simulation study. We consider a Poisson regression model with logarithmic link function, based on the linear predictor

\eta_{i}=\beta_{1}x_{i1}+\beta_{2}x_{i2}+\beta_{3}x_{i3},\qquad i=1,\dots,n,

(4.1)

where $x_{ij}\sim\mathsf{U}(0,1)$ for $j=1,2,3$ and $\mbox{\boldmath$\beta$}=(\beta_{1},\beta_{2},\beta_{3})^{\top}=(1,1,1)^{\top}$ . We construct $M=1000$ data sets with sample size $n=25,100$ and $400$ from the model (4.1). Following Cantoni and Ronchetti (2006) the observations are contaminated by multiplying by a factor of $\nu$ a percentage $\varepsilon$ of randomly selected responses. We have considered $\nu=2,5$ and $\varepsilon=0.00,0.05,0.10$ and $0.25$ as contamination percentages. For contaminated as well as non-contaminated data we have carried out the estimation by maximum likelihood, using the method of Cantoni and Ronchetti (2001) and the procedure proposed in this paper.

Let $\widehat{\mbox{\boldmath$\beta$}}_{k}=(\widehat{\beta}_{k1},\dots,\widehat{\beta}_{kp})^{\top}$ be the vector of estimated parameters for the $k$ th simulated sample $(k=1,\dots,M)$ . We summarize the results of the simulations by calculating the following statistics (Hosseinian and Morgenthaler, 2011)

\operatorname{bias}=\Big{\|}\frac{1}{M}\sum_{k=1}^{M}(\widehat{\mbox{\boldmath$\beta$}}_{k}-\mbox{\boldmath$\beta$})\Big{\|},\qquad\operatorname{IQR}=\frac{1}{p}\sum_{j=1}^{p}\operatorname{IQR}(\widehat{\beta}_{kj}),

where $\operatorname{IQR}$ denotes the interquartile range and for our study $p=3$ .

Table 4.2. Bias of the ML

q

estimators for various values of the distortion parameter

q

under different levels of contamination.

$n$	$\varepsilon$	$\nu$	$q$									CR
			1.00	0.99	0.97	0.95	0.93	0.91	0.89	0.87	0.85
25	0.05	2	0.020	0.007	0.039	0.076	0.113	0.151	0.188	0.225	0.262	0.009
100			0.042	0.023	0.020	0.058	0.097	0.135	0.173	0.211	0.248	0.024
400			0.043	0.023	0.018	0.057	0.096	0.135	0.173	0.211	0.248	0.023
25	0.05	5	0.114	0.085	0.050	0.064	0.098	0.141	0.182	0.217	0.244	0.023
100			0.155	0.101	0.024	0.053	0.100	0.137	0.154	0.160	0.158	0.031
400			0.161	0.101	0.015	0.049	0.099	0.142	0.182	0.202	0.195	0.029
25	0.10	2	0.055	0.037	0.021	0.051	0.089	0.128	0.166	0.204	0.242	0.030
100			0.083	0.061	0.021	0.028	0.067	0.107	0.147	0.187	0.226	0.048
400			0.084	0.063	0.019	0.023	0.065	0.106	0.146	0.186	0.225	0.048
25	0.10	5	0.221	0.178	0.097	0.009	0.082	0.141	0.172	0.184	0.184	0.058
100			0.285	0.213	0.093	0.037	0.044	0.089	0.150	0.177	0.171	0.065
400			0.296	0.216	0.088	0.013	0.052	0.141	0.301	0.301	0.245	0.063
25	0.25	2	0.187	0.166	0.123	0.083	0.051	0.046	0.072	0.106	0.143	0.131
100			0.194	0.171	0.124	0.078	0.034	0.022	0.061	0.098	0.123	0.130
400			0.199	0.175	0.127	0.079	0.033	0.014	0.056	0.101	0.126	0.131
25	0.25	5	0.590	0.530	0.410	0.424	0.459	0.468	0.484	0.503	0.504	0.229
100			0.591	0.518	0.407	0.741	0.750	0.702	0.649	0.597	0.544	0.190
400			0.610	0.530	0.401	0.838	0.783	0.730	0.676	0.623	0.568	0.187

Table 4.3. Interquartile range of the ML

q

estimators for various values of the distortion parameter

q

under different levels of contamination.

$n$	$\varepsilon$	$\nu$	$q$									CR
			1.00	0.99	0.97	0.95	0.93	0.91	0.89	0.87	0.85
25	0.05	2	0.431	0.426	0.413	0.401	0.393	0.386	0.377	0.369	0.363	0.425
100			0.195	0.190	0.183	0.178	0.175	0.171	0.167	0.164	0.161	0.187
400			0.103	0.100	0.095	0.091	0.090	0.089	0.087	0.085	0.083	0.097
25	0.05	5	0.611	0.572	0.492	0.442	0.432	0.426	0.440	0.437	0.454	0.444
100			0.363	0.303	0.227	0.201	0.185	0.178	0.182	0.190	0.220	0.192
400			0.195	0.160	0.118	0.103	0.097	0.093	0.090	0.095	0.121	0.101
25	0.10	2	0.454	0.449	0.435	0.423	0.413	0.404	0.393	0.388	0.386	0.453
100			0.201	0.196	0.191	0.187	0.182	0.175	0.171	0.165	0.163	0.190
400			0.109	0.106	0.103	0.100	0.097	0.094	0.093	0.092	0.090	0.101
25	0.10	5	0.807	0.753	0.619	0.548	0.547	0.578	0.619	0.689	0.730	0.474
100			0.421	0.358	0.282	0.236	0.234	0.296	0.418	0.459	0.475	0.204
400			0.221	0.193	0.150	0.126	0.117	0.263	0.229	0.220	0.241	0.111
25	0.25	2	0.519	0.513	0.506	0.496	0.483	0.472	0.473	0.475	0.472	0.520
100			0.239	0.238	0.230	0.223	0.215	0.210	0.205	0.204	0.208	0.224
400			0.118	0.116	0.114	0.112	0.110	0.107	0.106	0.104	0.107	0.114
25	0.25	5	1.008	1.003	1.057	1.293	1.303	1.298	1.221	1.167	1.139	0.712
100			0.465	0.462	0.469	0.560	0.502	0.492	0.481	0.470	0.459	0.288
400			0.226	0.228	0.244	0.240	0.241	0.236	0.231	0.226	0.220	0.142

Tables 4.2 and 4.3 present the simulation results. It should be noted that, in the presence of contamination, the maximum L $q$ -likelihood estimation method with values of the distortion parameter $q$ around 0.99 to 0.95 outperforms the procedure proposed by Cantoni and Ronchetti (2001) with the exception of severe levels of contamination (i.e. $\varepsilon=0.25$ and $\nu=5$ ). These results are in concordance with Theorem 3.1 described in Ferrari and Yang (2010), although they highlight the need to understand in detail the proportion of aberrant data supported by this estimation procedure. The simulations also reveal the trade-off for the selection of the distortion parameter, while decreasing the value of $q$ tends to choose models whose standard error turns out to be smaller, introduces considerable bias and therefore we recommend the selection procedures proposed by La Vecchia et al. (2015) and Ribeiro and Ferrari (2023). Indeed, for non-contaminated data (results not presented here) such procedure correctly leads to select the maximum likelihood estimation method, i.e. $q=1$ .

5. Concluding remarks

The methodology described in this work provides a fully parametric robust estimation mechanism that depends on a single tuning parameter $q$ which controls the robustness and efficiency of the procedure. The simplicity of the proposed approach as well as its interesting asymptotic properties have also allowed us to introduce measures to evaluate the goodness of fit of the model, to carry out hypothesis tests as well as to define standardized residuals. In particular, the estimation can be characterized by iteratively weighted least squares which allows to re-use existing code, resulting in a procedure less sensitive to outlying observations. We have described several strategies for the selection of the distortion (tuning) parameter. In particular, the proposal of La Vecchia et al. (2015) represents a very low computational cost alternative that seems to work well in the application with real data. The Monte Carlo simulation study reveals a satisfactory performance of the procedure based on maximum L $q$ -likelihood estimation in presence of contamination even when compared to some popular GLM alternatives for robust estimation. It should be stressed that despite the robustness of the proposed procedure, there may still be observations that exert disproportionate influence on key aspects of the statistical model; the study to assess the local influence of this type of observations is a topic for further research which is being developed by the authors. An implementation of the ML $q$ estimation for GLM along with the experiments developed in this work are publicly available on github.

Code and software availability

All analyses and simulations were conducted in the R environment for statistical computing. The replication files related to this article are available online at https://github.com/faosorios/robGLM

Acknowledgements

The authors were partially supported by the UTFSM grant PI_LI_19_11. Additionally, authors acknowledge the support of the computing infrastructure of the Applied Laboratory of Spatial Analysis UTFSM - ALSA (MT7018).

References

Atkinson and Riani (2000) Atkinson, A., Riani, M. (2000). Robust Diagnostics Regression Analysis. Springer, New York.
Basu et al. (1998) Basu, A., Harris, I.R., Hjort, N.L., Jones, M.C. (1998). Robust and efficient estimation by minimising a density power divergence. Biometrika 85, 549-559.
Bianco and Yohai (1996) Bianco, A.M., Yohai, V.J. (1996). Robust estimation in the logistic regression model. In: Rieder, H. (Ed.), Robust Statistics, Data Analysis, and Computer Intensive Methods, Lecture Notes in Statistics, Vol. 109. Springer, New York, pp. 17-34.
Boos (1992) Boos, D.D. (1992). On generalized score tests. The American Statistician 46, 327-333.
Cantoni and Ronchetti (2001) Cantoni, E., Ronchetti, E. (2001). Robust inference for generalized linear models. Journal of the American Statistical Association 96, 1022-1030.
Cantoni and Ronchetti (2006) Cantoni, E., Ronchetti, E. (2006). A robust approach for skewed and heavy-tailed outcomes in the analysis of health care expeditures. Journal of Health Economics 25, 198-213.
Croux and Haesbroeck (2003) Croux, C., Haesbroeck, G. (2003). Implementing the Bianco and Yohai estimator for logistic regression. Computational Statistics and Data Analysis 44, 273-295.
Crudu and Osorio (2020) Crudu, F., Osorio, F. (2020). Bilinear form test statistics for extremum estimation. Economics Letters 187, 108885.
Dunn and Smyth (1996) Dunn, P.K., Smyth, G.K. (1996). Randomized quantile residuals. Journal of Computational and Graphical Statistics 5, 236-244.
Ferrari and Paterlini (2009) Ferrari, D., Paterlini, S. (2009). The maximum L $q$ -likelihood method: An application to extreme quantile estimation in finance. Methodology and Computing in Applied Probability 11,3-19.
Ferrari and Yang (2010) Ferrari, D., Yang, Y. (2010). Maximum L $q$ -likelihood estimation. The Annals of Statistics 38, 753-783.
Finney (1947) Finney, D.J. (1947). The estimation from individual records of the relationships between dose and quantal response. Biometrika 34, 320-324.
Ghosh and Basu (2016) Ghosh, A., Basu, A. (2016). Robust estimation in generalized linear models: the density power divergence approach. TEST 25, 269-290.
Giménez et al. (2022a) Giménez, P., Guarracino, L., Galea, M. (2022). Maximum L $q$ -likelihood estimation in functional measurement error models. Statistica Sinica 32, 1723-1743.
Giménez et al. (2022b) Giménez, P., Guarracino, L., Galea, M. (2022). Robust estimation in functional comparative calibration models via maximum L $q$ -likelihood. Brazilian Journal of Probability and Statistics 36, 725-750.
Gourieroux and Monfort (1995) Gourieroux, C. Monfort, A. (1995). Statistics and Econometrics Models: Testing, Confidence Regions, Model Selection and Asymptotic Theory. Vol. 2. Cambridge University Press.
Green (1984) Green, P.J. (1984). Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. Journal of the Royal Statistical Society, Series B 46, 149-170.
Hampel et al. (1986) Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A. (1986). Robust Statistics: The approach based on influence functions. Wiley, New York.
Havrda and Charvát (1967) Havrda, J., Charvát, F. (1967). Quantification method of classification processes: Concept of structural entropy. Kibernetika 3, 30-35.
Heritier and Ronchetti (1994) Heritier, S., Ronchetti, E. (1994). Robust bounded-influence tests in general parametric models. Journal of the American Statistical Association 89, 897-904.
Hosseinian and Morgenthaler (2011) Hosseinian, S., Morgenthaler, S. (2011). Robust binary regression. Journal of Statistical Planning and Inference 141, 1497-1509.
Huang et al. (2013) Huang, C., Lin, R., Ren, Y. (2013). Testing for the shape parameter of generalized extreme value distribution based on the L $q$ -likelihood estimation ratio statistic. Metrika 76, 641-671.
Jørgensen and Knudsen (2004) Jørgensen, B., Knudsen, S.J. (2004). Parameter orthogonality and bias adjustment for estimating functions. Scandinavian Journal of Statistics 31, 93-114.
Künsch et al. (1989) Künsch, H.R., Stefanski, L.A., Carroll, R.J. (1989). Conditionally unbiased bounded-influence estimation in general regression models, with applications to generalized linear models. Journal of the American Statistical Association 84, 460-466.
La Vecchia et al. (2015) La Vecchia, D., Camponovo, L., Ferrari, D. (2015). Robust heart rate variability analysis by generalized entropy minimization. Computational Statistics and Data Analysis 82, 137-151.
Menendez (2000) Menendez, M.L. (2000). Shannon’s entropy in exponential families: statistical applications. Applied Mathematics Letters 13, 37-42.
Morgenthaler (1992) Morgenthaler, S. (1992). Least-absolute-deviations fits for generalized linear models. Biometrika 79, 747-754.
Nelder and Wedderburn (1972) Nelder, J.A., Wedderburn, R.W.M. (1972). Generalized linear models. Journal of the Royal Statistical Society, Series A 135, 370-384.
Pregibon (1981) Pregibon, D. (1981). Logistic regression diagnostics. The Annals of Statistics 9, 705-724.
Pregibon (1982) Pregibon, D. (1982). Resistant fits for some commonly used logistic models with medical applications. Biometrics 38, 485-498.
Preisser and Qaqish (1999) Preisser, J.S., Qaqish, B.F. (1999). Robust regression for clustered data with applications to binary regression. Biometrics 55, 574-579.
Qin and Priebe (2013) Qin, Y., Priebe, C.E. (2013). Maximum L $q$ -likelihood estimation via the Expectation-Maximization algorithm: A robust estimation of mixture models. Journal of the American Statistical Association 108, 914-928.
Qin and Priebe (2017) Qin, Y., Priebe, C.E. (2017). Robust hypothesis testing via L $q$ -likelihood. Statistica Sinica 27, 1793-1813.
R Development Core Team (2022) R Development Core Team (2022). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0, http://www.R-project.org.
Ribeiro and Ferrari (2023) Ribeiro, T.K.A., Ferrari, S.L.P. (2023). Robust estimation in beta regression via maximum L $q$ -likelihood. Statistical Papers 64, 321-353.
Ronchetti (1997) Ronchetti, E. (1997). Robust aspects of model choice. Statistica Sinica 7, 327-338.
Stefanski et al. (1986) Stefanski, L.A., Carroll, R.J., Ruppert, D. (1986). Optimally bounded score functions for generalized linear models with applications to logistic regression. Biometrika 73, 413-425.
Thomas and Cook (1989) Thomas, W., Cook, R.D. (1989). Assessing influence on regression coefficients in generalized linear models. Biometrika 76, 741-749.
Thomas and Cook (1990) Thomas, W., Cook, R.D. (1990). Assessing influence on predictions from generalized linear models. Technometrics 32, 59-65.
Todorov and Filzmoser (2009) Todorov, V., Filzmoser, P. (2009). An object-oriented framework for robust multivariate analysis. Journal of Statistical Software 32 (3), 1-47.
Tsallis (1988) Tsallis, C. (1988). Possible generalization of Boltzmann-Gibbs statistic. Journal of Statistical Physics 52, 479-487.
Wang (1985) Wang, P.C. (1985). Adding a variable in generalized linear models. Technometrics 27, 273-276.
Wei and Shih (1994) Wei, B.C., Shih, J.Q. (1994). On statistical models for regression diagnostics. Annals of the Institute of Statistical Mathematics 46, 267-278.
Windham (1995) Windham, M.P. (1995). Robustifying model fitting. Journal of the Royal Statistical Society, Series B 57, 599-609.
Yuan and Jennrich (1998) Yuan, K.H., Jennrich, R.I. (1998). Asymptotics of estimating equations under natural conditions. Journal of Multivariate Analysis 65, 245-260.

Appendix A Fisher consistency

We say that a random variable $Y$ belongs to the exponential family of distributions if its probability density function is given by

f(y;\theta,\phi)=\exp[\phi\{y\theta-b(\theta)\}+c(y,\phi)],

(A.1)

where $b(\cdot)$ and $c(\cdot,\cdot)$ are some specific functions, $\theta$ is the natural parameter defined in $\Theta\subset\mathbb{R}$ and $\phi>0$ is a dispersion parameter. If a random variable has a density function (A.1), we shall denote $Y\sim\mathsf{FE}(\theta,\phi)$ . To verify Fisher consistency, the following Lemma is introduced,

Lemma A.1.

Suppose $Y\sim\mathsf{FE}(\theta,\phi)$ with expectation $E(Y)=\dot{b}(\theta)$ , and consider $U(\theta)=\{f(y;\theta,\phi)\}^{1-q}$ for $q>0$ . Then, for an integrable function $h(Y;\theta)$ , it follows that

\operatorname{E}_{0}\{U^{r}(\theta)h(Y;\theta)\}=J_{q}^{-\phi}(\theta)\operatorname{E}_{q}\{h(Y;\theta)\},

where $\theta_{0}=q\theta$ , and $J_{q}(\theta)=\exp\{-qb(\theta)+b(q\theta)\}$ is defined in Menendez (2000), whereas the expectation $\operatorname{E}_{q}\{h(Y;\theta)\}$ must be calculated based on the density function

f_{q}(y;\theta,\phi)=\exp[\phi(r(1-q)+q)\{y\theta-b(\theta)\}+c_{q,r}(y,\phi)],

(A.2)

and $c_{q,r}(y,\phi)=(r(1-q)+1)c(y,\phi)$ .

Proof.

It is desired to compute expectations of the type

\operatorname{E}_{0}\{U^{r}(\theta)h(Y;\theta)\}=\int_{\mathcal{Y}}U^{r}(\theta)h(y;\theta)f(y;\theta_{0},\phi){{\rm d}y}.

For $\theta_{0}=q\theta$ , we have

	$\displaystyle U^{r}(\theta)f(y;\theta_{0},\phi)={}$	$\displaystyle\{f(y;\theta,\phi)\}^{r(1-q)}f(y;q\theta,\phi)$
	$\displaystyle={}$	$\displaystyle\exp[r(1-q)\phi\{y\theta-b(\theta)\}+r(1-q)c(y;\phi)]$
		$\displaystyle\times\exp[\phi\{yq\theta-b(q\theta)\}+c(y,\phi)].$

It is easy to notice that

	$\displaystyle r(1-q)\phi\{y\theta$	$\displaystyle-b(\theta)\}+\phi\{yq\theta-b(q\theta)\}$
		$\displaystyle=\phi(r(1-q)+q)\{y\theta-b(\theta)\}+\phi\{qb(\theta)-b(q\theta)\}.$

Thus,

U^{r}(\theta)f(y;\theta_{0},\phi)=\exp[\phi\{qb(\theta)-b(q\theta)\}]\exp[\phi(r(1-q)+q)\{y\theta-b(\theta)\}+c_{q}(y;\phi)],

where $c_{q}(y,\phi)=(r(1-q)+1)c(y,\phi)$ and let

J_{q}(\theta)=\exp\{-qb(\theta)+b(q\theta)\},\qquad q\theta\in\Theta,

(see Menendez, 2000), which leads to write

\operatorname{E}_{0}\{U^{r}(\theta)h(Y;\theta)\}=J_{q}^{-\phi}(\theta)\operatorname{E}_{q}\{h(Y,\theta)\},

because $E_{q}\{h(Y,\theta)\}=\int_{\mathcal{Y}}h(y;\theta)f_{q}(y;\theta,\phi){{\rm d}y}$ , with $f_{q}(y;\theta,\phi)$ as defined in (A.2). ∎

Proposition A.2.

The estimation function defined in Equation (2.1) is unbiased for the surrogate parameter $\theta_{*}=\theta_{0}/q$ .

Proof.

We have to calculate the expectation,

\operatorname{E}_{0}\{\mbox{\boldmath$\Psi$}_{n}(\mbox{\boldmath$\beta$}_{*})\}=\phi\sum_{i=1}^{n}W_{i}^{1/2}\frac{\operatorname{E}_{0}\{U_{i}(\mbox{\boldmath$\beta$}_{*})(Y_{i}-\mu_{*,i})\}}{\sqrt{V_{i}}}\mbox{\boldmath$x$}_{i}.

Since $U_{i}(\mbox{\boldmath$\beta$}_{*})=\{f(y_{i};\theta_{i}(\mbox{\boldmath$\beta$}_{*}),\phi)\}^{1-q}$ , it is enough to obtain $\operatorname{E}_{0}\{U(\theta_{*})(Y-\mu_{*})\}$ . Using Lemma A.1, leads to

\operatorname{E}_{0}\{U(\theta_{*})(Y-\mu_{*})\}=J_{q}^{-\phi}(\theta_{*})\operatorname{E}_{q}(Y-\mu_{*}).

Noticing that

\operatorname{E}_{q}(Y-\mu_{*})=\int_{\mathcal{Y}}\{y-\dot{b}(\theta_{*})\}f_{q}(y;\theta_{*},\phi){{\rm d}y}=0,

yields $\operatorname{E}_{0}\{\mbox{\boldmath$\Psi$}_{n}(\mbox{\boldmath$\beta$}_{*})\}=0$ . That is, it is an unbiased estimation function for $\theta_{*}$ . ∎

Appendix B Corrected version of the maximum L $q$ -likelihood estimator in GLM

Let $\theta_{0}$ be the true parameter, Ferrari and Yang (2010) highlighted that for $q$ fixed, the maximum L $q$ -likelihood estimator $\hat{\theta}_{q}^{*}$ converges in probability to $\theta_{*}=\theta_{0}/q$ . Here, $\theta_{*}$ is called the surrogate parameter of $\theta_{0}$ . This leads to the correction $\hat{\theta}_{q}=q\hat{\theta}_{q}^{*}\stackrel{{\scriptstyle p}}{{\to}}q\theta_{*}=\theta_{0}$ . Considering the $\theta$ -link, that is $\theta=k(\eta)$ , we write $\mbox{\boldmath$\eta$}_{*}=k^{-1}(\mbox{\boldmath$\theta$}_{*})=\mbox{\boldmath$X\beta$}_{*}$ for the corresponding surrogates. Then we have that

\hat{\eta}_{q}=k^{-1}(q\hat{\theta}_{q}^{*})\stackrel{{\scriptstyle p}}{{\to}}k^{-1}(\theta_{0})=\eta_{0}.

This leads to the following correction for the linear predictor $\hat{\eta}_{q}=k^{-1}(qk(\hat{\eta}_{q}^{*}))$ , which yields the calibration function in Equation (2.6).

Appendix C Asymptotic covariance

Previous to the calculation of matrices defined in Equations (2.3) and (2.4), we introduce the following Lemma.

Lemma C.1.

Consider $Y\sim\mathsf{FE}(\theta,\phi)$ and assume the elements given in Lemma A.1. Therefore it is straightforward that

\operatorname{E}_{0}\{U^{r}(\theta)(Y-\mu)^{2}\}=\frac{\phi^{-1}}{r(1-q)+q}J_{q}^{-\phi}(\theta)V(\mu),

where $V(\mu)=\ddot{b}(\theta)$ is the variance function.

Proof.

Using Lemma A.1, we have that $\operatorname{E}_{0}\{U^{r}(\theta)(Y-\mu)^{2}\}=J_{q}^{-\phi}(\theta)\operatorname{E}_{q}\{(Y-\mu)^{2}\}$ . Since,

\operatorname{E}_{q}\{(Y-\mu)^{2}\}=\frac{\phi^{-1}}{r(1-q)+q}\,\ddot{b}(\theta),

the result follows. ∎

Proof of Equation (2.3).

We have that

\mbox{\boldmath$A$}_{n}(\mbox{\boldmath$\beta$})=\operatorname{E}_{0}\{\mbox{\boldmath$\Psi$}_{n}(\mbox{\boldmath$\beta$})\mbox{\boldmath$\Psi$}_{n}^{\top}(\mbox{\boldmath$\beta$})\}=\sum_{i=1}^{n}\operatorname{E}_{0}\{U_{i}^{2}(\mbox{\boldmath$\beta$})\mbox{\boldmath$s$}_{i}(\mbox{\boldmath$\beta$})\mbox{\boldmath$s$}_{i}^{\top}(\mbox{\boldmath$\beta$})\},

where $\mbox{\boldmath$s$}_{i}(\mbox{\boldmath$\beta$})=\phi\{y_{i}-\dot{b}(\theta_{i})\}\dot{k}(\eta_{i})\mbox{\boldmath$x$}_{i}$ . Thus,

\displaystyle\mbox{\boldmath$A$}_{n}(\mbox{\boldmath$\beta$})=\phi^{2}\sum_{i=1}^{n}\operatorname{E}_{0}[U_{i}^{2}(\mbox{\boldmath$\beta$})\{y_{i}-\dot{b}(\theta_{i})\}^{2}]\{\dot{k}(\eta_{i})\}^{2}\mbox{\boldmath$x$}_{i}\mbox{\boldmath$x$}_{i}^{\top}.

(C.1)

Using Lemma C.1 with $r=2$ , we obtain

\operatorname{E}_{0}[U_{i}^{2}(\mbox{\boldmath$\beta$})\{y_{i}-\dot{b}(\theta_{i})\}^{2}]=\frac{\phi^{-1}}{2(1-q)+q}\,J_{q}^{-\phi}(\theta_{i})\ddot{b}(\theta_{i}).

Substituting this result in Equation (C.1), it follows that

	$\displaystyle\mbox{\boldmath$A$}_{n}(\mbox{\boldmath$\beta$})$	$\displaystyle=\frac{\phi}{2-q}\sum_{i=1}^{n}J_{q}^{-\phi}(\theta_{i})\ddot{b}(\theta_{i})\{\dot{k}(\eta_{i})\}^{2}\mbox{\boldmath$x$}_{i}\mbox{\boldmath$x$}_{i}^{\top}=\frac{\phi}{2-q}\sum_{i=1}^{n}W_{i}J_{q}^{-\phi}(\theta_{i})\mbox{\boldmath$x$}_{i}\mbox{\boldmath$x$}_{i}^{\top}$
		$\displaystyle=\frac{\phi}{2-q}\,\mbox{\boldmath$X$}^{\top}\mbox{\boldmath$WJX$},$

this yields Equation (2.3). ∎

Proof of Equation (2.4).

To obtain $\mbox{\boldmath$B$}_{n}(\mbox{\boldmath$\beta$})$ , first consider

\frac{\partial}{\partial\eta_{i}}U_{i}(\eta_{i})s_{i}(\eta_{i})=\frac{\partial U_{i}(\eta_{i})}{\partial\eta_{i}}s_{i}(\eta_{i})+U_{i}(\eta_{i})\frac{\partial s_{i}(\eta_{i})}{\partial\eta_{i}},

with

	$\displaystyle\frac{\partial U_{i}(\eta_{i})}{\partial\eta_{i}}$	$\displaystyle=(1-q)\{f(y_{i};\eta_{i},\phi)\}^{1-q-1}\frac{\partial f(y_{i};\eta_{i},\phi)}{\partial\eta_{i}}$
		$\displaystyle=(1-q)U_{i}(\eta_{i})s_{i}(\eta_{i}),$

where $s_{i}(\eta_{i})=\phi\{y_{i}-\dot{b}(\theta_{i})\}\dot{k}(\eta_{i})$ . We also have that

\frac{\partial s_{i}(\eta_{i})}{\partial\eta_{i}}=-\phi\,\ddot{b}(\theta_{i})\{\dot{k}(\eta_{i})\}^{2}+\phi\{y_{i}-\dot{b}(\theta_{i})\}\ddot{k}(\eta_{i}).

On the one hand,

	$\displaystyle\operatorname{E}_{0}\Big{\{}-\frac{\partial U_{i}(\eta_{i})}{\partial\eta_{i}}s_{i}(\eta_{i})\Big{\}}$	$\displaystyle=-\operatorname{E}_{0}\{(1-q)U_{i}(\eta_{i})s_{i}^{2}(\eta_{i})\}$
		$\displaystyle=-(1-q)\phi^{2}\operatorname{E}_{0}[U_{i}(\eta_{i})\{y_{i}-\dot{b}(\theta_{i})\}^{2}]\{\dot{k}(\eta_{i})\}^{2}.$

We known that $\operatorname{E}_{0}[U_{i}(\eta_{i})\{y_{i}-\dot{b}(\theta_{i})\}^{2}]=\phi^{-1}J_{q}^{-\phi}(\theta_{i})\ddot{b}(\theta_{i})$ , therefore

\operatorname{E}_{0}\Big{\{}-\frac{\partial U_{i}(\eta_{i})}{\partial\eta_{i}}s_{i}(\eta_{i})\Big{\}}=-(1-q)\phi J_{q}^{-\phi}(\theta_{i})\ddot{b}(\theta_{i})\{\dot{k}(\eta_{i})\}^{2},

(C.2)

on the other hand,

\operatorname{E}_{0}\Big{\{}-U_{i}(\eta_{i})\frac{\partial s_{i}(\eta_{i})}{\partial\eta_{i}}\Big{\}}=\phi\ddot{b}(\theta_{i})\{\dot{k}(\eta_{i})\}^{2}\operatorname{E}_{0}\{U_{i}(\eta_{i})\}-\phi\operatorname{E}_{0}[U_{i}(\eta_{i})\{y_{i}-\dot{b}(\theta_{i})\}]\ddot{k}(\eta_{i}).

By using Lemma S1, we have that

\operatorname{E}_{0}\{U_{i}(\eta_{i})\}=J_{q}^{-\phi}(\theta_{i}),\qquad\operatorname{E}_{0}[U_{i}(\eta_{i})\{y_{i}-\dot{b}(\theta_{i})\}]=0.

Thus,

\operatorname{E}_{0}\Big{\{}-U_{i}(\eta_{i})\frac{\partial s_{i}(\eta_{i})}{\partial\eta_{i}}\Big{\}}=\phi\ddot{b}(\theta_{i})\{\dot{k}(\eta_{i})\}^{2}J_{q}^{-\phi}(\theta_{i}).

(C.3)

Using Equations (C.2) and (C.3), yields to the expectation

\operatorname{E}_{0}\Big{\{}-\frac{\partial}{\partial\eta_{i}}U_{i}(\eta_{i})s_{i}(\eta_{i})\Big{\}}=q\phi J_{q}^{-\phi}(\theta_{i})\ddot{b}(\theta_{i})\{\dot{k}(\eta_{i})\}^{2}=q\phi J_{q}^{-\phi}(\theta_{i})W_{i}

We known that $\partial\eta_{i}^{*}/\partial\eta_{i}=q^{-1}\partial k^{-1}(\theta_{i}^{*})/\partial\theta_{i}^{*}\,\dot{k}(\eta_{i})$ . Therefore,

\operatorname{E}_{0}\Big{\{}-\frac{\partial\mbox{\boldmath$\Psi$}_{n}(\mbox{\boldmath$\beta$})}{\partial\mbox{\boldmath$\beta$}^{\top}}\Big{\}}=\phi\sum_{i=1}^{n}J_{q}^{-\phi}(\theta_{i})W_{i}\frac{\partial k^{-1}(\theta_{i}^{*})}{\partial\theta_{i}^{*}}\dot{k}(\eta_{i})\mbox{\boldmath$x$}_{i}\mbox{\boldmath$x$}_{i}^{\top}.

Considering $\mbox{\boldmath$G$}=\operatorname{diag}(\dot{g}(\theta_{1}^{*}),\dots,\dot{g}(\theta_{n}^{*}))$ with $g(\theta_{i}^{*})=k^{-1}(\theta_{i}^{*})$ and $\mbox{\boldmath$K$}=\operatorname{diag}(\dot{k}(\theta_{1}),\dots,$ $\dot{k}(\theta_{n}))$ , we obtain

\operatorname{E}_{0}\Big{\{}-\frac{\partial\mbox{\boldmath$\Psi$}_{n}(\mbox{\boldmath$\beta$})}{\partial\mbox{\boldmath$\beta$}^{\top}}\Big{\}}=\phi\mbox{\boldmath$X$}^{\top}\mbox{\boldmath$WJGKX$},

and the proof is complete. ∎

Appendix D L $q$ -likelihood based tests

Proof of Proposition 2.

Following Property 24.16 in Gourieroux and Monfort (1995), we know that

\sqrt{n}(\widehat{\mbox{\boldmath$\beta$}}_{q}-\mbox{\boldmath$\beta$}_{0})\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathsf{N}_{p}(\mbox{\boldmath$0$},\mbox{\boldmath$B$}^{-1}\mbox{\boldmath$AB$}^{-1}).

This leads to,

\sqrt{n}\mbox{\boldmath$H$}(\widehat{\mbox{\boldmath$\beta$}}_{q}-\mbox{\boldmath$\beta$}_{0})\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathsf{N}_{k}(\mbox{\boldmath$0$},\mbox{\boldmath$HB$}^{-1}\mbox{\boldmath$AB$}^{-1}\mbox{\boldmath$H$}^{\top}).

Evidently, $\sqrt{n}\mbox{\boldmath$H$}(\widehat{\mbox{\boldmath$\beta$}}_{q}-\mbox{\boldmath$\beta$}_{0})=\sqrt{n}(\mbox{\boldmath$H$}\widehat{\mbox{\boldmath$\beta$}}_{q}-\mbox{\boldmath$H\beta$}_{0})$ . Now, under $H_{0}$ , we have $\mbox{\boldmath$H\beta$}_{0}=\mbox{\boldmath$h$}$ . Hence

\sqrt{n}(\mbox{\boldmath$H$}\widehat{\mbox{\boldmath$\beta$}}_{q}-\mbox{\boldmath$h$})\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathsf{N}_{k}(\mbox{\boldmath$0$},\mbox{\boldmath$HB$}^{-1}\mbox{\boldmath$AB$}^{-1}\mbox{\boldmath$H$}^{\top}).

(D.1)

Thus, under $H_{0}:\mbox{\boldmath$H\beta$}_{0}=\mbox{\boldmath$h$}$ , we obtain

W_{n}=n(\mbox{\boldmath$H$}\widehat{\mbox{\boldmath$\beta$}}_{q}-\mbox{\boldmath$h$})^{\top}(\mbox{\boldmath$H$}\widehat{\mbox{\boldmath$B$}}^{-1}\widehat{\mbox{\boldmath$A$}}\widehat{\mbox{\boldmath$B$}}^{-1}\mbox{\boldmath$H$}^{\top})^{-1}(\mbox{\boldmath$H$}\widehat{\mbox{\boldmath$\beta$}}_{q}-\mbox{\boldmath$h$})\stackrel{{\scriptstyle D}}{{\longrightarrow}}\chi^{2}(k),

which allows us to establish Equation (3.3).

To establish (3.4), note that the constrained estimator $\widetilde{\mbox{\boldmath$\beta$}}_{q}$ is defined as solution of the Lagrangian problem,

Q_{n}(\mbox{\boldmath$\beta$},\mbox{\boldmath$\lambda$})=L_{q}(\mbox{\boldmath$\beta$})+\mbox{\boldmath$\lambda$}^{\top}(\mbox{\boldmath$h$}-\mbox{\boldmath$H\beta$}),

where $\lambda$ denotes a $k\times 1$ vector of Lagrange multipliers. We have that $\widetilde{\mbox{\boldmath$\beta$}}_{q}$ and $\widetilde{\mbox{\boldmath$\lambda$}}_{n}$ satisfy the first order conditions (see Property 24.10 of Gourieroux and Monfort, 1995),

\mbox{\boldmath$\Psi$}_{n}(\widetilde{\mbox{\boldmath$\beta$}}_{q})-\mbox{\boldmath$H$}^{\top}\widetilde{\mbox{\boldmath$\lambda$}}_{n}=0,\qquad\mbox{\boldmath$H$}\widetilde{\mbox{\boldmath$\beta$}}_{q}-\mbox{\boldmath$h$}=\mbox{\boldmath$0$},

(D.2)

and the corrected $\widetilde{\mbox{\boldmath$\beta$}}_{q}$ is consistent. Considering Taylor expansions of $\mbox{\boldmath$\Psi$}_{n}(\widehat{\mbox{\boldmath$\beta$}}_{q})$ and $\mbox{\boldmath$\Psi$}_{n}(\widetilde{\mbox{\boldmath$\beta$}}_{q})$ around $\beta$ , assuming that $\mbox{\boldmath$B$}_{n}\stackrel{{\scriptstyle a.s.}}{{\longrightarrow}}\mbox{\boldmath$B$}$ uniformly and following simple calculations yield

\sqrt{n}(\widehat{\mbox{\boldmath$\beta$}}_{q}-\widetilde{\mbox{\boldmath$\beta$}}_{q})=\mbox{\boldmath$B$}^{-1}\,\frac{1}{\sqrt{n}}\mbox{\boldmath$\Psi$}_{n}(\widetilde{\mbox{\boldmath$\beta$}}_{q})+o_{p}(\mbox{\boldmath$1$}).

From the first order condition (D.2), leads to

\sqrt{n}\mbox{\boldmath$H$}(\widehat{\mbox{\boldmath$\beta$}}_{q}-\widetilde{\mbox{\boldmath$\beta$}}_{q})=\mbox{\boldmath$HB$}^{-1}\mbox{\boldmath$H$}^{\top}\,\frac{\widetilde{\mbox{\boldmath$\lambda$}}_{n}}{\sqrt{n}}+o_{p}(\mbox{\boldmath$1$}).

Moreover, using $\mbox{\boldmath$H$}\widetilde{\mbox{\boldmath$\beta$}}_{q}-\mbox{\boldmath$h$}=\mbox{\boldmath$0$}$ , we find

\sqrt{n}(\mbox{\boldmath$H$}\widehat{\mbox{\boldmath$\beta$}}_{q}-\mbox{\boldmath$h$})-\sqrt{n}(\mbox{\boldmath$H$}\widetilde{\mbox{\boldmath$\beta$}}_{q}-\mbox{\boldmath$h$})=\sqrt{n}\mbox{\boldmath$H$}(\widehat{\mbox{\boldmath$\beta$}}_{q}-\widetilde{\mbox{\boldmath$\beta$}}_{q})

(D.3)

From (D.1) and (D.3), we obtain

\sqrt{n}\mbox{\boldmath$H$}(\widehat{\mbox{\boldmath$\beta$}}_{q}-\widetilde{\mbox{\boldmath$\beta$}}_{q})=\sqrt{n}(\mbox{\boldmath$H$}\widehat{\mbox{\boldmath$\beta$}}_{q}-\mbox{\boldmath$h$})\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathsf{N}_{k}(\mbox{\boldmath$0$},\mbox{\boldmath$HB$}^{-1}\mbox{\boldmath$AB$}^{-1}\mbox{\boldmath$H$}^{\top}).

(D.4)

Furthermore, Equation (D.3) enables us to write,

\frac{\widetilde{\mbox{\boldmath$\lambda$}}_{n}}{\sqrt{n}}=(\mbox{\boldmath$HB$}^{-1}\mbox{\boldmath$H$}^{\top})^{-1}\sqrt{n}(\mbox{\boldmath$H$}\widehat{\mbox{\boldmath$\beta$}}_{q}-\mbox{\boldmath$h$})+o_{p}(\mbox{\boldmath$1$}).

Then, using (D.4), it follows that

\frac{\widetilde{\mbox{\boldmath$\lambda$}}_{n}}{\sqrt{n}}\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathsf{N}_{k}(\mbox{\boldmath$0$},(\mbox{\boldmath$HB$}^{-1}\mbox{\boldmath$H$}^{\top})^{-1}\mbox{\boldmath$HB$}^{-1}\mbox{\boldmath$AB$}^{-1}\mbox{\boldmath$H$}^{\top}(\mbox{\boldmath$HB$}^{-1}\mbox{\boldmath$H$}^{\top})^{-1}).

(D.5)

This result allows us to define the statistic

	$\displaystyle R_{n}$	$\displaystyle=\frac{1}{n}\,\widetilde{\mbox{\boldmath$\lambda$}}_{n}^{\top}\mbox{\boldmath$H$}\widetilde{\mbox{\boldmath$B$}}^{-1}\mbox{\boldmath$H$}^{\top}(\mbox{\boldmath$H$}\widetilde{\mbox{\boldmath$B$}}^{-1}\widetilde{\mbox{\boldmath$A$}}\widetilde{\mbox{\boldmath$B$}}^{-1}\mbox{\boldmath$H$}^{\top})^{-1}\mbox{\boldmath$H$}\widetilde{\mbox{\boldmath$B$}}^{-1}\mbox{\boldmath$H$}^{\top}\widetilde{\mbox{\boldmath$\lambda$}}_{n}$
		$\displaystyle=\frac{1}{n}\,\mbox{\boldmath$\Psi$}_{n}^{\top}(\widetilde{\mbox{\boldmath$\beta$}}_{q})\widetilde{\mbox{\boldmath$B$}}^{-1}\mbox{\boldmath$H$}^{\top}(\mbox{\boldmath$H$}\widetilde{\mbox{\boldmath$B$}}^{-1}\widetilde{\mbox{\boldmath$A$}}\widetilde{\mbox{\boldmath$B$}}^{-1}\mbox{\boldmath$H$}^{\top})^{-1}\mbox{\boldmath$H$}\widetilde{\mbox{\boldmath$B$}}^{-1}\mbox{\boldmath$\Psi$}_{n}(\widetilde{\mbox{\boldmath$\beta$}}_{q})$
		$\displaystyle\stackrel{{\scriptstyle D}}{{\longrightarrow}}\chi^{2}(k),$

as desired.

Let $\mbox{\boldmath$HB$}^{-1}\mbox{\boldmath$AB$}^{-1}\mbox{\boldmath$H$}^{\top}=\mbox{\boldmath$RR$}^{\top}$ where $R$ is a nonsingular $k\times k$ matrix. Thus, it follows from (D.1) and (D.5),

	$\displaystyle\sqrt{n}\,\mbox{\boldmath$R$}^{-1}(\mbox{\boldmath$H$}\widehat{\mbox{\boldmath$\beta$}}_{q}-\mbox{\boldmath$h$})$	$\displaystyle\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathsf{N}_{k}(\mbox{\boldmath$0$},\mbox{\boldmath$I$}_{k})$
	$\displaystyle\mbox{\boldmath$R$}^{-1}\mbox{\boldmath$HB$}^{-1}\mbox{\boldmath$H$}^{\top}\,\frac{\widetilde{\mbox{\boldmath$\lambda$}}_{n}}{\sqrt{n}}$	$\displaystyle\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathsf{N}_{k}(\mbox{\boldmath$0$},\mbox{\boldmath$I$}_{k}),$

and this imply that,

	$\displaystyle BF_{n}$	$\displaystyle=\Big{\{}\mbox{\boldmath$R$}^{-1}\mbox{\boldmath$H$}\widetilde{\mbox{\boldmath$B$}}^{-1}\mbox{\boldmath$H$}^{\top}\,\frac{\widetilde{\mbox{\boldmath$\lambda$}}_{n}}{\sqrt{n}}\Big{\}}^{\top}\sqrt{n}\,\mbox{\boldmath$R$}^{-1}(\mbox{\boldmath$H$}\widehat{\mbox{\boldmath$\beta$}}_{q}-\mbox{\boldmath$h$})$
		$\displaystyle=\widetilde{\mbox{\boldmath$\lambda$}}_{n}^{\top}\mbox{\boldmath$H$}\widetilde{\mbox{\boldmath$B$}}^{-1}\mbox{\boldmath$H$}^{\top}\mbox{\boldmath$R$}^{-\top}\mbox{\boldmath$R$}^{-1}(\mbox{\boldmath$H$}\widehat{\mbox{\boldmath$\beta$}}_{q}-\mbox{\boldmath$h$})$
		$\displaystyle=\widetilde{\mbox{\boldmath$\lambda$}}_{n}^{\top}\mbox{\boldmath$H$}\widetilde{\mbox{\boldmath$B$}}^{-1}\mbox{\boldmath$H$}^{\top}(\mbox{\boldmath$H$}\widehat{\mbox{\boldmath$B$}}^{-1}\widehat{\mbox{\boldmath$A$}}\widehat{\mbox{\boldmath$B$}}^{-1}\mbox{\boldmath$H$}^{\top})^{-1}(\mbox{\boldmath$H$}\widehat{\mbox{\boldmath$\beta$}}_{q}-\mbox{\boldmath$h$})\stackrel{{\scriptstyle D}}{{\longrightarrow}}\chi^{2}(k),$

which can be written alternatively, as

BF_{n}=\mbox{\boldmath$\Psi$}_{n}^{\top}(\widetilde{\mbox{\boldmath$\beta$}}_{q})\widetilde{\mbox{\boldmath$B$}}^{-1}\mbox{\boldmath$H$}^{\top}(\mbox{\boldmath$H$}\widehat{\mbox{\boldmath$B$}}^{-1}\widehat{\mbox{\boldmath$A$}}\widehat{\mbox{\boldmath$B$}}^{-1}\mbox{\boldmath$H$}^{\top})^{-1}(\mbox{\boldmath$H$}\widehat{\mbox{\boldmath$\beta$}}_{q}-\mbox{\boldmath$h$})\stackrel{{\scriptstyle D}}{{\longrightarrow}}\chi^{2}(k),

and the proposition is verified. ∎

Appendix E Score-type statistic for hypotheses about subvectors

Consider that the hypothesis $H_{0}:\mbox{\boldmath$\beta$}_{2}=\mbox{\boldmath$\beta$}_{02}$ with $\mbox{\boldmath$\beta$}=(\mbox{\boldmath$\beta$}_{1}^{\top},\mbox{\boldmath$\beta$}_{2}^{\top})^{\top}$ where $\mbox{\boldmath$\beta$}_{1}\in\mathbb{R}^{r}$ and $\mbox{\boldmath$\beta$}_{2}\in\mathbb{R}^{p-r}$ . Define the Lagrangian function,

Q_{n}(\mbox{\boldmath$\beta$})=L_{q}(\mbox{\boldmath$\beta$})+\mbox{\boldmath$\lambda$}^{\top}(\mbox{\boldmath$\beta$}_{02}-\mbox{\boldmath$\beta$}_{2}),

where $\mbox{\boldmath$\lambda$}\in\mathbb{R}^{p-r}$ is a vector of Lagrange multipliers. Let $\mbox{\boldmath$\Psi$}_{1}(\mbox{\boldmath$\beta$})=\partial L_{q}(\mbox{\boldmath$\beta$})/\partial\mbox{\boldmath$\beta$}_{1}$ and $\mbox{\boldmath$\Psi$}_{2}(\mbox{\boldmath$\beta$})=\partial L_{q}(\mbox{\boldmath$\beta$})/\partial\mbox{\boldmath$\beta$}_{2}$ the derivatives of $L_{q}(\mbox{\boldmath$\beta$})$ with respect to $\mbox{\boldmath$\beta$}_{1}$ and $\mbox{\boldmath$\beta$}_{2}$ , respectively. The first order condition leads to,

\mbox{\boldmath$\Psi$}_{1}(\widetilde{\mbox{\boldmath$\beta$}})=\mbox{\boldmath$0$},\qquad\mbox{\boldmath$\Psi$}_{2}(\widetilde{\mbox{\boldmath$\beta$}})-\widetilde{\mbox{\boldmath$\lambda$}}=\mbox{\boldmath$0$},\qquad\mbox{\boldmath$\beta$}_{02}-\mbox{\boldmath$\beta$}_{2}=\mbox{\boldmath$0$},

where $\widetilde{\mbox{\boldmath$\beta$}}=(\widetilde{\mbox{\boldmath$\beta$}}_{1}^{\top},\widetilde{\mbox{\boldmath$\beta$}}_{2}^{\top})^{\top}$ and $\widetilde{\mbox{\boldmath$\lambda$}}$ denote the constrained estimators of $\beta$ and $\lambda$ .

Following Boos (1992), we can consider a Taylor expansion of $\mbox{\boldmath$\Psi$}_{n}(\widetilde{\mbox{\boldmath$\beta$}})$ around $\mbox{\boldmath$\beta$}_{0}=(\mbox{\boldmath$\beta$}_{01}^{\top},\mbox{\boldmath$\beta$}_{02}^{\top})^{\top}$ as

	$\displaystyle\mbox{\boldmath$\Psi$}_{1}(\widetilde{\mbox{\boldmath$\beta$}})$	$\displaystyle=\mbox{\boldmath$\Psi$}_{1}(\mbox{\boldmath$\beta$}_{0})+\frac{\partial\mbox{\boldmath$\Psi$}_{1}(\mbox{\boldmath$\beta$}_{0})}{\partial\mbox{\boldmath$\beta$}_{1}^{\top}}(\widetilde{\mbox{\boldmath$\beta$}}_{1}-\mbox{\boldmath$\beta$}_{01})+o_{p}(\mbox{\boldmath$1$}),$
	$\displaystyle\mbox{\boldmath$\Psi$}_{2}(\widetilde{\mbox{\boldmath$\beta$}})$	$\displaystyle=\mbox{\boldmath$\Psi$}_{2}(\mbox{\boldmath$\beta$}_{0})+\frac{\partial\mbox{\boldmath$\Psi$}_{2}(\mbox{\boldmath$\beta$}_{0})}{\partial\mbox{\boldmath$\beta$}_{1}^{\top}}(\widetilde{\mbox{\boldmath$\beta$}}_{1}-\mbox{\boldmath$\beta$}_{01})+o_{p}(\mbox{\boldmath$1$}).$

However, $\mbox{\boldmath$\Psi$}_{1}(\widetilde{\mbox{\boldmath$\beta$}})=\mbox{\boldmath$0$}$ and substituting $-\partial\mbox{\boldmath$\Psi$}_{1}(\mbox{\boldmath$\beta$}_{0})/\partial\mbox{\boldmath$\beta$}_{1}^{\top}$ and $-\partial\mbox{\boldmath$\Psi$}_{2}(\mbox{\boldmath$\beta$}_{0})/\partial\mbox{\boldmath$\beta$}_{1}^{\top}$ by their expectations, follow that

	$\displaystyle\mbox{\boldmath$\Psi$}_{1}(\widetilde{\mbox{\boldmath$\beta$}})$	$\displaystyle=\mbox{\boldmath$\Psi$}_{1}(\mbox{\boldmath$\beta$}_{0})-\mbox{\boldmath$B$}_{11}(\mbox{\boldmath$\beta$}_{0})(\widetilde{\mbox{\boldmath$\beta$}}_{1}-\mbox{\boldmath$\beta$}_{01})+o_{p}(\mbox{\boldmath$1$}),$		(E.1)
	$\displaystyle\mbox{\boldmath$\Psi$}_{2}(\widetilde{\mbox{\boldmath$\beta$}})$	$\displaystyle=\mbox{\boldmath$\Psi$}_{2}(\mbox{\boldmath$\beta$}_{0})-\mbox{\boldmath$B$}_{21}(\mbox{\boldmath$\beta$}_{0})(\widetilde{\mbox{\boldmath$\beta$}}_{1}-\mbox{\boldmath$\beta$}_{01})+o_{p}(\mbox{\boldmath$1$}).$		(E.2)

with $\mbox{\boldmath$B$}_{11}(\mbox{\boldmath$\beta$})=\operatorname{E}_{0}\{-\partial\mbox{\boldmath$\Psi$}_{1}(\mbox{\boldmath$\beta$})/\partial\mbox{\boldmath$\beta$}_{1}^{\top}\}$ and $\mbox{\boldmath$B$}_{21}(\mbox{\boldmath$\beta$})=\operatorname{E}_{0}\{-\partial\mbox{\boldmath$\Psi$}_{2}(\mbox{\boldmath$\beta$})/\partial\mbox{\boldmath$\beta$}_{1}^{\top}\}$ . Noticing that

\widetilde{\mbox{\boldmath$\beta$}}_{1}-\mbox{\boldmath$\beta$}_{01}=\mbox{\boldmath$B$}_{11}^{-1}(\mbox{\boldmath$\beta$}_{0})\mbox{\boldmath$\Psi$}_{1}(\mbox{\boldmath$\beta$}_{0})+o_{p}(\mbox{\boldmath$1$}).

(E.3)

Substituting Equation (E.3) into (E.2), we obtain

	$\displaystyle\mbox{\boldmath$\Psi$}_{2}(\widetilde{\mbox{\boldmath$\beta$}})$	$\displaystyle=\mbox{\boldmath$\Psi$}_{2}(\mbox{\boldmath$\beta$}_{0})-\mbox{\boldmath$B$}_{21}(\mbox{\boldmath$\beta$}_{0})\mbox{\boldmath$B$}_{11}^{-1}(\mbox{\boldmath$\beta$}_{0})\mbox{\boldmath$\Psi$}_{1}(\mbox{\boldmath$\beta$}_{0})+o_{p}(\mbox{\boldmath$1$})$
		$\displaystyle=(-\mbox{\boldmath$B$}_{21}(\mbox{\boldmath$\beta$}_{0})\mbox{\boldmath$B$}_{11}^{-1}(\mbox{\boldmath$\beta$}_{0}),\mbox{\boldmath$I$})\begin{pmatrix}\mbox{\boldmath$\Psi$}_{1}(\mbox{\boldmath$\beta$}_{0})\\ \mbox{\boldmath$\Psi$}_{2}(\mbox{\boldmath$\beta$}_{0})\end{pmatrix}+o_{p}(\mbox{\boldmath$1$}).$

Moreover,

\operatorname{Cov}(\mbox{\boldmath$\Psi$}_{2}(\widetilde{\mbox{\boldmath$\beta$}}))=(-\mbox{\boldmath$B$}_{21}(\mbox{\boldmath$\beta$}_{0})\mbox{\boldmath$B$}_{11}^{-1}(\mbox{\boldmath$\beta$}_{0}),\mbox{\boldmath$I$})\operatorname{Cov}(\mbox{\boldmath$\Psi$}_{n}(\widetilde{\mbox{\boldmath$\beta$}}))\begin{pmatrix}-\mbox{\boldmath$B$}_{11}^{-\top}(\mbox{\boldmath$\beta$}_{0})\mbox{\boldmath$B$}_{21}^{\top}(\mbox{\boldmath$\beta$}_{0})\\ \mbox{\boldmath$I$}\end{pmatrix}.

This leads to the score-type statistic

R_{n}=\mbox{\boldmath$\Psi$}_{2}^{\top}(\widetilde{\mbox{\boldmath$\beta$}})\{\operatorname{Cov}(\mbox{\boldmath$\Psi$}_{2}(\widetilde{\mbox{\boldmath$\beta$}}))\}^{-1}\mbox{\boldmath$\Psi$}_{2}(\widetilde{\mbox{\boldmath$\beta$}}).

E.1. Score-type statistic for adding a variable

Assume that $\eta_{i}=\mbox{\boldmath$x$}_{i}^{\top}\mbox{\boldmath$\beta$}+z_{i}\gamma=\mbox{\boldmath$f$}_{i}^{\top}\mbox{\boldmath$\delta$}$ , where $\mbox{\boldmath$f$}_{i}=(\mbox{\boldmath$x$}_{i}^{\top},z_{i})$ and $\mbox{\boldmath$\delta$}=(\mbox{\boldmath$\beta$}^{\top},\gamma)^{\top}$ . Let $\mbox{\boldmath$F$}=(\mbox{\boldmath$X$},\mbox{\boldmath$z$})$ be the model matrix for the added-variable model, where

\mbox{\boldmath$X$}=\begin{pmatrix}\mbox{\boldmath$x$}_{1}^{\top}\\ \vdots\\ \mbox{\boldmath$x$}_{n}^{\top}\end{pmatrix},\qquad\mbox{\boldmath$z$}=\begin{pmatrix}z_{1}\\ \vdots\\ z_{n}\end{pmatrix}.

Thus,

\mbox{\boldmath$\Psi$}_{n}(\delta)=\begin{pmatrix}\mbox{\boldmath$\Psi$}_{\beta}(\mbox{\boldmath$\delta$})\\ \mbox{\boldmath$\Psi$}_{\gamma}(\mbox{\boldmath$\delta$})\end{pmatrix}=\phi\begin{pmatrix}\mbox{\boldmath$X$}^{\top}\mbox{\boldmath$W$}^{1/2}\mbox{\boldmath$UV$}^{-1/2}(\mbox{\boldmath$Y$}-\mbox{\boldmath$\mu$})\\ \mbox{\boldmath$z$}^{\top}\mbox{\boldmath$W$}^{1/2}\mbox{\boldmath$UV$}^{-1/2}(\mbox{\boldmath$Y$}-\mbox{\boldmath$\mu$})\end{pmatrix}.

We have that,

	$\displaystyle\mbox{\boldmath$A$}_{n}(\delta)$	$\displaystyle=\frac{\phi}{2-q}\mbox{\boldmath$F$}^{\top}\mbox{\boldmath$WJF$}=\frac{\phi}{2-q}\begin{pmatrix}\mbox{\boldmath$X$}^{\top}\mbox{\boldmath$WJX$}&\mbox{\boldmath$X$}^{\top}\mbox{\boldmath$WJz$}\\ \mbox{\boldmath$z$}^{\top}\mbox{\boldmath$WJX$}&\mbox{\boldmath$z$}^{\top}\mbox{\boldmath$WJz$}\end{pmatrix}$
	$\displaystyle\mbox{\boldmath$B$}_{n}(\delta)$	$\displaystyle=\phi\mbox{\boldmath$F$}^{\top}\mbox{\boldmath$WJGKF$}=\phi\begin{pmatrix}\mbox{\boldmath$X$}^{\top}\mbox{\boldmath$WJGKX$}&\mbox{\boldmath$X$}^{\top}\mbox{\boldmath$WJGKz$}\\ \mbox{\boldmath$z$}^{\top}\mbox{\boldmath$WJGKX$}&\mbox{\boldmath$z$}^{\top}\mbox{\boldmath$WJGKz$}\end{pmatrix}.$

Consider the following partition for matrices $\mbox{\boldmath$A$}_{n}(\mbox{\boldmath$\delta$})$ and $\mbox{\boldmath$B$}_{n}(\mbox{\boldmath$\delta$})$ ,

\mbox{\boldmath$A$}_{n}(\delta)=\begin{pmatrix}\mbox{\boldmath$A$}_{11}&\mbox{\boldmath$A$}_{12}\\ \mbox{\boldmath$A$}_{21}&\mbox{\boldmath$A$}_{22}\end{pmatrix},\qquad\mbox{\boldmath$B$}_{n}(\delta)=\begin{pmatrix}\mbox{\boldmath$B$}_{11}&\mbox{\boldmath$B$}_{12}\\ \mbox{\boldmath$B$}_{21}&\mbox{\boldmath$B$}_{22}\end{pmatrix}.

Therefore, the covariance of $\mbox{\boldmath$\Psi$}_{\gamma}(\mbox{\boldmath$\delta$})$ assumes the form,

\operatorname{Cov}(\mbox{\boldmath$\Psi$}_{\gamma}(\mbox{\boldmath$\delta$}))=\mbox{\boldmath$A$}_{22}-\mbox{\boldmath$A$}_{21}\mbox{\boldmath$B$}_{11}^{-1}\mbox{\boldmath$B$}_{12}-(\mbox{\boldmath$B$}_{21}\mbox{\boldmath$B$}_{11}^{-1}\mbox{\boldmath$A$}_{12}-\mbox{\boldmath$B$}_{21}\mbox{\boldmath$B$}_{11}^{-1}\mbox{\boldmath$A$}_{11}\mbox{\boldmath$B$}_{11}^{-1}\mbox{\boldmath$B$}_{12}).

After some simple algebra, we obtain

\operatorname{Cov}(\mbox{\boldmath$\Psi$}_{\gamma}(\mbox{\boldmath$\delta$}))=\frac{\phi}{2-q}\mbox{\boldmath$z$}^{\top}(\mbox{\boldmath$I$}-\mbox{\boldmath$WJGKP$})\mbox{\boldmath$WJ$}(\mbox{\boldmath$I$}-\mbox{\boldmath$PWJGK$})\mbox{\boldmath$z$},

where $\mbox{\boldmath$P$}=\mbox{\boldmath$X$}(\mbox{\boldmath$X$}^{\top}\mbox{\boldmath$WJGKX$})^{-1}\mbox{\boldmath$X$}^{\top}$ . This, leads to

	$\displaystyle R_{n}$	$\displaystyle=\Big{(}\frac{\phi}{2-q}\Big{)}^{-1}\frac{\{\phi\mbox{\boldmath$z$}^{\top}\widehat{\mbox{\boldmath$W$}}^{1/2}\widehat{\mbox{\boldmath$U$}}\widehat{\mbox{\boldmath$V$}}^{-1/2}(\mbox{\boldmath$Y$}-\mbox{\boldmath$\mu$})\}^{2}}{\mbox{\boldmath$z$}^{\top}(\mbox{\boldmath$I$}-\widehat{\mbox{\boldmath$W$}}\widehat{\mbox{\boldmath$J$}}\widehat{\mbox{\boldmath$G$}}\widehat{\mbox{\boldmath$K$}}\widehat{\mbox{\boldmath$P$}})\widehat{\mbox{\boldmath$W$}}\widehat{\mbox{\boldmath$J$}}(\mbox{\boldmath$I$}-\widehat{\mbox{\boldmath$P$}}\widehat{\mbox{\boldmath$W$}}\widehat{\mbox{\boldmath$J$}}\widehat{\mbox{\boldmath$G$}}\widehat{\mbox{\boldmath$K$}})\mbox{\boldmath$z$}}$
		$\displaystyle=\frac{(2-q)\{\mbox{\boldmath$z$}^{\top}\widehat{\mbox{\boldmath$W$}}^{1/2}\widehat{\mbox{\boldmath$U$}}\widehat{\mbox{\boldmath$V$}}^{-1/2}(\mbox{\boldmath$Y$}-\mbox{\boldmath$\mu$})\}^{2}}{\phi^{-1}\mbox{\boldmath$z$}^{\top}(\mbox{\boldmath$I$}-\widehat{\mbox{\boldmath$W$}}\widehat{\mbox{\boldmath$J$}}\widehat{\mbox{\boldmath$G$}}\widehat{\mbox{\boldmath$K$}}\widehat{\mbox{\boldmath$P$}})\widehat{\mbox{\boldmath$W$}}\widehat{\mbox{\boldmath$J$}}(\mbox{\boldmath$I$}-\widehat{\mbox{\boldmath$P$}}\widehat{\mbox{\boldmath$W$}}\widehat{\mbox{\boldmath$J$}}\widehat{\mbox{\boldmath$G$}}\widehat{\mbox{\boldmath$K$}})\mbox{\boldmath$z$}}\stackrel{{\scriptstyle D}}{{\longrightarrow}}\chi^{2}(1).$

The foregoing allows us to note that,

\frac{\sqrt{2-q}\{\mbox{\boldmath$z$}^{\top}\widehat{\mbox{\boldmath$W$}}^{1/2}\widehat{\mbox{\boldmath$U$}}\widehat{\mbox{\boldmath$V$}}^{-1/2}(\mbox{\boldmath$Y$}-\mbox{\boldmath$\mu$})\}}{\phi^{-1/2}\sqrt{\mbox{\boldmath$z$}^{\top}(\mbox{\boldmath$I$}-\widehat{\mbox{\boldmath$W$}}\widehat{\mbox{\boldmath$J$}}\widehat{\mbox{\boldmath$G$}}\widehat{\mbox{\boldmath$K$}}\widehat{\mbox{\boldmath$P$}})\widehat{\mbox{\boldmath$W$}}\widehat{\mbox{\boldmath$J$}}(\mbox{\boldmath$I$}-\widehat{\mbox{\boldmath$P$}}\widehat{\mbox{\boldmath$W$}}\widehat{\mbox{\boldmath$J$}}\widehat{\mbox{\boldmath$G$}}\widehat{\mbox{\boldmath$K$}})\mbox{\boldmath$z$}}}\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathsf{N}(0,1).

Now, consider $\mbox{\boldmath$z$}=(0,\dots,1,\dots,0)^{\top}$ a vector of zeros except for an 1 at the $i$ th position. It is straightforward to note that,

\mbox{\boldmath$z$}^{\top}\widehat{\mbox{\boldmath$W$}}^{1/2}\widehat{\mbox{\boldmath$U$}}\widehat{\mbox{\boldmath$V$}}^{-1/2}(\mbox{\boldmath$Y$}-\mbox{\boldmath$\mu$})=\widehat{W}_{i}^{1/2}\widehat{U}_{i}(y_{i}-\hat{\mu}_{i})/\widehat{V}_{i}^{1/2},

and

\mbox{\boldmath$z$}^{\top}(\mbox{\boldmath$I$}-\widehat{\mbox{\boldmath$W$}}\widehat{\mbox{\boldmath$J$}}\widehat{\mbox{\boldmath$G$}}\widehat{\mbox{\boldmath$K$}}\widehat{\mbox{\boldmath$P$}})\widehat{\mbox{\boldmath$W$}}\widehat{\mbox{\boldmath$J$}}(\mbox{\boldmath$I$}-\widehat{\mbox{\boldmath$P$}}\widehat{\mbox{\boldmath$W$}}\widehat{\mbox{\boldmath$J$}}\widehat{\mbox{\boldmath$G$}}\widehat{\mbox{\boldmath$K$}})\mbox{\boldmath$z$}=\widehat{W}_{i}\widehat{J}_{i}\{(1-\widehat{g}_{i}\widehat{k}_{i}\widehat{m}_{ii})-\widehat{g}_{i}\widehat{k}_{i}(\widehat{m}_{ii}-\widehat{g}_{i}\widehat{k}_{i}\widehat{m}_{ii}^{*})\}.

This yields to the standardized residual,

t_{i}=\frac{\sqrt{2-q}\,\widehat{W}_{i}^{1/2}\widehat{U}_{i}(y_{i}-\widehat{\mu}_{i})/\hat{V}_{i}^{1/2}}{\phi^{-1/2}\widehat{W}_{i}^{1/2}\widehat{J}_{i}^{1/2}\sqrt{(1-\widehat{g}_{i}\widehat{k}_{i}\widehat{m}_{ii})-\widehat{g}_{i}\widehat{k}_{i}(\widehat{m}_{ii}-\widehat{g}_{i}\widehat{k}_{i}\widehat{m}_{ii}^{*})}},

which approximately follows an standard normal distribution.

Appendix F Additional Results

Additional tables and figures related to the example of skin vaso-constriction data are presented below.

Table F.1. Estimates by maximum L

q

-likelihood for the skin vaso-constriction data.

$q$		Intercept				$\log(\text{volume})$				$\log(\text{rate})$
1.	00	-2.	875	(1.	321)	5.	179	(1.	865)	4.	562	(1.	838)
0.	98	-2.	919	(1.	349)	5.	220	(1.	909)	4.	600	(1.	875)
0.	96	-2.	970	(1.	382)	5.	271	(1.	960)	4.	649	(1.	918)
0.	94	-3.	033	(1.	420)	5.	337	(2.	020)	4.	710	(1.	967)
0.	92	-3.	109	(1.	465)	5.	421	(2.	092)	4.	789	(2.	026)
0.	90	-3.	205	(1.	519)	5.	531	(2.	179)	4.	892	(2.	097)
0.	88	-3.	327	(1.	587)	5.	677	(2.	289)	5.	027	(2.	185)
0.	86	-3.	488	(1.	673)	5.	877	(2.	430)	5.	211	(2.	297)
0.	84	-3.	712	(1.	712)	6.	165	(2.	624)	5.	473	(2.	450)
0.	82	-4.	047	(1.	965)	6.	614	(2.	914)	5.	875	(2.	677)
0.	80	-4.	636	(2.	273)	7.	439	(3.	431)	6.	601	(3.	078)
0.	79*	-5.	185	(2.	563)	8.	234	(3.	920)	7.	287	(3.	455)
0.	78	-6.	322	(3.	166)	9.	936	(4.	934)	8.	727	(4.	233)
0.	76	-18.	054	(11.	594)	28.	906	(19.	195)	23.	609	(14.	760)
0.	74	-20.	645	(15.	478)	33.	394	(25.	732)	26.	922	(19.	695)
* reported in Section 4.1.