2021
[1]\fnm Larissa A. \surMatos
[1]\orgdivDepartamento de Estatística, \orgnameUniversidade Estadual de Campinas, \orgaddress\cityCampinas, \postcode13083-859, \stateSão Paulo, \countryBrazil
2]\orgdivDepartamento de Matemáticas, \orgnameEscuela Superior Politécnica del Litoral, \orgaddress\cityGuayaquil, \postcode090112, \countryEcuador
Censored autoregressive regression models with Student- innovations
Abstract
The Student- distribution is widely used in statistical modeling of datasets involving outliers since its longer-than-normal tails provide a robust approach to hand such data. Furthermore, data collected over time may contain censored or missing observations, making it impossible to use standard statistical procedures. This paper proposes an algorithm to estimate the parameters of a censored linear regression model when the regression errors are autocorrelated and the innovations follow a Student- distribution. To fit the proposed model, maximum likelihood estimates are obtained throughout the SAEM algorithm, which is a stochastic approximation of the EM algorithm useful for models in which the E-step does not have an analytic form. The methods are illustrated by the analysis of a real dataset that has left-censored and missing observations. We also conducted two simulations studies to examine the asymptotic properties of the estimates and the robustness of the model.
keywords:
Autoregressive models; Censored data; Heavy-tailed distributions; Missing data; SAEM algorithm1 Introduction
A linear regression model is a commonly used statistical tool to analyze the relation between a response (dependent) and some explanatory (independent) variables. In these models, the errors are usually considered as independent and identically distributed random variables with zero-mean and constant-variance. However, observations collected over time are often autocorrelated rather than independent. Examples of this kind of data abound in economics, medicine, environmental monitoring, and social sciences and therefore the study of the dependence among observations is of considerable practical interest (see tsay2005analysis; box2015time).
A stochastic model that has been successfully used in many real-world applications to deal with serial correlation in the error term is the autoregressive (AR) model. In the AR model, the current state of the process is expressed as a finite linear combination of previous states and a stochastic shock of disturbance, which is also known as innovation in the time series literature. In general, it is assumed that the disturbance is normally distributed. For example, beach1978maximum suggested a maximum likelihood (ML) estimation method to estimate the parameters of AR(1) error term regression model, ullah1983estimation used the two-stage Prais-Winsten estimator for a linear regression model with first-order autocorrelated disturbances, where a Gaussian assumption is considered for the innovations to derive a large-sample asymptotic approximation for the variance-covariance matrix, and more recently alpuim2008efficiency proposed a linear regression model with the sequence of error terms following a Gaussian autoregressive stationary process, in which the parameters of the model are estimated using ML and ordinary least squares.
In practice, however, the normality assumption may be unrealistic, especially in the presence of outliers. To relax this assumption, tiku1999estimating suggested considering non-normal symmetric innovations in a simple regression model with AR(1) error term. tuacc2018robust proposed a model with AR(p) error term considering the distribution with fixed degrees of freedom as an alternative to the normal distribution and applying the conditional maximum likelihood method to find the estimators of the model parameters, and nduka2018based developed an EM algorithm to estimate the parameters in autoregressive models of order with Student- innovations.
An additional challenge arises when some observations are censored or missing. In the first case, values can occur out of the range of a measuring instrument, and in the second, the value is unknown. Censored time series data are frequently encountered in environmental monitoring, medicine, and economics, and there are some proposals in the literature related to censored autoregressive linear regression models with normal innovations. For example, zeger1986regression proposed a Gaussian censored autocorrelated model with parameters estimated by maximizing the full likelihood using the quasi-Newton and Expectation-Maximization (EM) algorithms, and the authors pointed out that the quasi-Newton approach is preferable because it estimates the variance-covariance matrix for the parameters.
Additionally, park2007censored developed a censored autoregressive moving average model with Gaussian white noise, in which the algorithm works iteratively imputing the censored or missing observations by sampling from the truncated normal distribution and the parameters are estimated by maximizing an approximate full likelihood function, and wang2018quasi suggested a quasi-likelihood method using the complete-incomplete data framework. Recently, schumacher2017censored considered an autoregressive censored linear model with the parameters estimated via an analytically tractable and efficient stochastic approximation of the EM (SAEM) algorithm, and liu2019parameter proposed a coupled Monte Carlo Markov Chain (MCMC)-SAEM algorithm to fit an AR regression model with Student- innovations accounting for missing data.
Nevertheless, to the best of our knowledge there are no studies that consider the Student- distribution for the innovations in censored autoregressive models from a likelihood-based perspective. Thence, in this work we propose an EM-type algorithm to estimate the parameters of a censored regression model with autoregressive errors and innovations following a Student- distribution. Specifically, the SAEM algorithm is considered to avoid the direct computation of complex expressions on the E-step of the EM algorithm (dempster1977maximum) and since its computational effort is much smaller in comparison to the Monte Carlo EM algorithm (wei1990monte), as shown by jank2006implementing.
The paper is organized as follows. Section 2 introduces the autoregressive regression models of order and the SAEM algorithm. Section 3 is devoted to formulate the censored autoregressive model with Student- innovations, providing the expressions used to estimate the parameters of the proposed model, the method used to estimate the observed information matrix, and the expression to make predictions. Section 4 displays some results of two simulation studies carried out to examine the asymptotic properties of the estimators, and to demonstrate the robustness of the model. Section 5 applies the proposed model to the total phosphorus concentration dataset available in the R package ARCensReg (schumacher2016package; rteam), along with an analysis of quantile residuals. Finally, Section 6 concludes with a discussion.
2 Preliminaries
We begin by defining some notations and introducing the basic definitions and algorithms used throughout this work. The normal distribution with mean and variance is denoted by , the notation for a -variate normal distribution with mean and variance-covariance matrix is , and a truncated multivariate normal distribution with truncation region is denoted by . The notation for a random variable with Gamma distribution is given by , where is the shape parameter and is the rate parameter. The Student- distribution with location parameter , scale parameter , and degrees of freedom is denoted by . A random variable with uniform distribution on the interval is denoted by . Furthermore, the expression “” means independent and identically distributed, represents the transpose of , is the gamma function evaluated at , and is the -field generated by . Finally, for integers and , we use the index as follows: for a vector , then ; and for a matrix with rows , , then .
2.1 The autoregressive regression model of order
First, consider a linear regression model with errors that are autocorrelated as a discrete-time autoregressive process of order ; thus, the model for an observation at time is given by
(1) | |||||
(2) |
where represents the dependent variable observed at time , is a vector of known covariables, is a vector of unknown regression parameters to be estimated, and is the regression error and is assumed to follow an autoregressive model, with begin a vector of autoregressive coefficients and being a shock of disturbance with distribution . The term denoted by is also known as the innovation in the time series literature (see, for instance, box2015time; schumacher2017censored; wang2018quasi).
Now, suppose that in Equation 2 follows a Student- distribution with location parameter 0, scale parameter , and degrees of freedom, whose probability density function (pdf) can be written as
(3) |
Then, the model defined by Equations 1-3 will be called the autoregressive regression model of order (AR) hereinafter.
Particularly, the distribution of might be written using a scale mixture of normal (SMN) distribution as described in kotz2004multivariate, where is equal in distribution to , with , , and being independent of . Consequently, the conditional distribution of given is normal with zero mean and variance , i.e.,
This representation facilitates the implementation of an EM-type algorithm, which is discussed next.
2.2 The SAEM algorithm
The EM algorithm was first introduced by dempster1977maximum and provides a general approach to the iterative computation of ML estimates in problems with incomplete data. Let be the complete data, where is the observed data and is the incomplete (or missing) data. Let be the complete log-likelihood function. The EM algorithm proceeds as follows:
-
•
E-step: Let be the estimate of at the th iteration. Compute the conditional expectation of the complete log-likelihood function
-
•
M-step: Maximize with respect to to obtain .
There are situations where the E-step has no analytic form. To deal with these cases, wei1990monte proposed to replace the expectation in the computation of with a Monte Carlo (MC) approximation based on a large number of independent simulations of the missing data, which was called the Monte Carlo EM (MCEM) algorithm. As an alternative to the computationally intensive MCEM algorithm, delyon1999convergence proposed a scheme that splits the E-step into a simulation step and an integration step (using a stochastic approximation), while the maximization step remains unchanged. This algorithm was called the SAEM algorithm and performs as follows:
-
•
E-step:
-
1.
Simulation: Draw samples of the missing data from the conditional distribution .
-
2.
Stochastic approximation: Update by
where is a decreasing sequence of positive numbers such that and , also known as the smoothness parameter (kuhn2005maximum).
-
1.
-
•
M-step: Update as
This process is iterated until some distance between two successive evaluations of the actual log-likelihood function becomes small enough.
3 Proposed model and parameter estimation
This section is devoted the formulating and the ML estimation of the censored autoregressive linear model. Here, we specify the log-likelihood function, the algorithm used to overcome the parameter estimation problem, expressions to approximate the standard errors, and expressions to predict at unobserved times when the values of some covariables (if needed) at these times are available.
3.1 The censored ART() model
Assume that the response variable given in Equation 1 is not fully observed for all times. Instead, we observe , where represents either an observed value or the limit of detection (LOD) of a censored variable, and is the censoring indicator defined as
(6) |
Thereby, the model defined by Equations 1-6 will be called censored autoregressive regression model of order (CAR). We say that is left-censored if and , right-censored if and , interval censored if and , and it represents a missing value if and .
To compute the log-likelihood function of the CAR model, we condition the marginal distribution on the first observations, which are considered fully observed, and we denote by the vector of observed data and by the vector of censored or missing observations, with . Note also that the distribution of conditional to all the preceding data only depends on the previous observations. Then, for , given follows a Student- distribution with location parameter , scale parameter , and degrees of freedom, where is a matrix with the covariates related to and is a vector with the covariates related to . In other words, we have
Thus, the observed (conditional) log-likelihood function can be computed by
(7) |
3.2 Parameter estimation via the SAEM algorithm
The observed log-likelihood function in Equation 3.1 involves complex expressions, making work directly with very difficult. Hence, to obtain the ML estimates of , we now apply an EM-type algorithm considering the hierarchical representation presented in Subsection 2.1.
Let and be hypothetical missing data, and consider that we observe , where and . Then, the complete dataset is given by , and the complete log-likelihood function can be written as
with , , , and cte being a constant.
Let denote the current estimate of . Then, the conditional expectation of the complete log-likelihood function given the observed data and ignoring constant terms is given by , with
where
such that , , , , , , , and , for .
It is easy to observe that the E-step reduces to the computation of , , , , , , and , for . Because of the difficulties in the calculation of the conditional expectations in the E-step, specifically, when we have successive censored observations, we consider the implementation of the SAEM algorithm, introduced in Subsection 2.2, which at the th iteration proceeds as follows:
Step E-1 (Sampling). We first consider the observed and censored/missing part of separately and rearrange the elements of the location vector and scale matrix parameters to compute the conditional distribution of as demonstrated in Appendix A.1. Thus, in the simulation step, we generate samples from the conditional distribution of the latent variables via the Gibbs sampler algorithm according to the following scheme:
- •
-
•
Step 2: Sample from , whose elements are independent and Gamma distributed as with , , and , for and .
Step E-2 (Stochastic Approximation). Given the sequence for , we replace the conditional expectations in by the following stochastic approximations:
where
Following galarza2017quantile, the variable will be considered as , if , and , if , where is the maximum number of iterations and is a cutoff point which determines the percentage of initial iterations with no-memory. If , the algorithm will have a memory for all iterations and hence will converge slowly to the ML estimates, and needs to be large. If , the algorithm will be memory-free, it will converge quickly to a solution neighborhood, and the algorithm will initiate a Markov chain leading to a reasonably well-estimated mean after applying the necessary burn-in and thinning steps. A number between 0 and 1 will assure an initial convergence in distribution to a solution neighborhood for the first iterations and an almost sure convergence for the rest of the iterations.
The selection of and could affect the speed of convergence of the SAEM algorithm. A graphical approach can monitor the convergence of the estimates for all parameters and determine the values for these constants, as suggested by lavielle2014mixed. An advantage of the SAEM algorithm is that, even though it performs an MCMC E-step, it only requires a small and fixed sample size making it much faster than the MCEM algorithm.
Step CM (Conditional Maximization). A conditional maximization step is then carried out, and is updated by maximizing over to obtain a new estimate , leading to the expressions:
(8) | |||||
(10) | |||||
(12) | |||||
(13) |
with for .
3.3 Standard error approximation
The Fisher information matrix is a good measure of the amount of information that a sample dataset provides about the parameters, and it can be used to compute the asymptotic variance of the estimators. louis1982finding developed a procedure for extracting the observed information matrix when the EM algorithm is used to find the ML estimates in problems with incomplete data.
Let and respectively denote the first derivative and the negative of the second derivative of the complete-data log-likelihood function with respect to the parameter vector . Let and be the corresponding derivatives of the log-likelihood function of the observed data. Then, the observed information matrix is given by
delyon1999convergence adapted the method proposed by louis1982finding to compute the observed information matrix when the SAEM algorithm is used to estimate the parameters. For this, it is necessary to compute the auxiliary terms
(14) | ||||
and | ||||
where is the vector of observations sampled at iteration for and , with denoting the maximum number of iteration of the SAEM algorithm and the number of MC samples for stochastic approximation. The inverse of the limiting value of can be used to assess the dispersion of the estimators (delyon1999convergence). The analytical expressions for the first and second derivatives of the complete data log-likelihood function are given in Appendix A.2.
3.4 Prediction
Considering the interest of predicting values from the CAR model, we denote by the -vector of random variables (RVs) corresponding to the given sample and by the vector of RVs of length corresponding to the time points that we are interested in predicting.
Let , where is the vector of uncensored observations and is the vector of censored or missing observations. To deal with the censored values existing in , we use an imputation procedure that consists in replacing the censored values with the values obtained in the last iteration of the SAEM algorithm, i.e., , since elements of can also be updated during the Step E-2 of the SAEM algorithm as
(15) |
, with the same , , and settings considering in the estimation. The new vector of observed values will be denoted by .
Now, supposing that all values in are completely observed and that the explanatory variables for are available, the forecasting procedure will be performed recursively (box2015time) as follows
In practice, is substituted by , where is the estimate obtained via the SAEM algorithm, such that .
4 Simulation Study
In this section, we examine the asymptotic properties of the SAEM estimates through a simulation study considering different sample sizes and levels of censoring. A second simulation study is performed to demonstrate the robustness of the estimates obtained from the proposed model when the data is perturbed.
Average | Parameter | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
LOD | level of | Measure | ||||||||
censoring | 5.00 | 0.50 | 0.90 | -0.40 | 0.12 | 2.00 | 4.00 | |||
100 | No | 0% | MC-Mean | 4.988 | 0.501 | 0.937 | -0.409 | 0.093 | 1.779 | 4.150 |
IM-SE | 0.325 | 0.141 | 0.565 | 0.086 | 0.086 | 0.531 | 2.612 | |||
MC-SD | 0.334 | 0.149 | 0.567 | 0.090 | 0.091 | 0.465 | 2.089 | |||
CP (%) | 94.0 | 91.3 | 94.0 | - | - | - | - | |||
1.60 | 5.10% | MC-Mean | 4.984 | 0.502 | 0.942 | -0.409 | 0.092 | 1.781 | 4.345 | |
IM-SE | 0.319 | 0.141 | 0.557 | 0.091 | 0.088 | 0.533 | 3.326 | |||
MC-SD | 0.339 | 0.149 | 0.576 | 0.096 | 0.094 | 0.494 | 2.692 | |||
CP (%) | 94.3 | 92.0 | 94.3 | - | - | - | - | |||
3.45 | 19.59% | MC-Mean | 4.977 | 0.505 | 0.931 | -0.418 | 0.089 | 1.812 | 4.559 | |
IM-SE | 0.337 | 0.150 | 0.586 | 0.105 | 0.099 | 0.584 | 3.957 | |||
MC-SD | 0.349 | 0.156 | 0.588 | 0.110 | 0.103 | 0.516 | 2.914 | |||
CP (%) | 93.3 | 93.7 | 93.3 | - | - | - | - | |||
4.30 | 34.11% | MC-Mean | 4.939 | 0.511 | 0.928 | -0.434 | 0.077 | 1.856 | 4.930 | |
IM-SE | 0.371 | 0.163 | 0.644 | 0.122 | 0.115 | 0.721 | 4.872 | |||
MC-SD | 0.372 | 0.171 | 0.642 | 0.126 | 0.120 | 0.571 | 3.517 | |||
CP (%) | 94.0 | 93.7 | 93.3 | - | - | - | - | |||
300 | No | 0% | MC-Mean | 5.015 | 0.501 | 0.893 | -0.401 | 0.115 | 1.937 | 4.162 |
IM-SE | 0.167 | 0.084 | 0.299 | 0.048 | 0.048 | 0.293 | 1.172 | |||
MC-SD | 0.180 | 0.090 | 0.312 | 0.052 | 0.050 | 0.295 | 1.266 | |||
CP (%) | 93.3 | 93.7 | 93.7 | - | - | - | - | |||
1.60 | 5.31% | MC-Mean | 5.017 | 0.501 | 0.889 | -0.399 | 0.115 | 1.944 | 4.244 | |
IM-SE | 0.169 | 0.085 | 0.303 | 0.053 | 0.051 | 0.314 | 1.444 | |||
MC-SD | 0.182 | 0.091 | 0.319 | 0.055 | 0.051 | 0.311 | 1.498 | |||
CP (%) | 93.0 | 94.0 | 93.0 | - | - | - | - | |||
3.45 | 20.58% | MC-Mean | 4.999 | 0.507 | 0.891 | -0.409 | 0.110 | 1.997 | 4.645 | |
IM-SE | 0.177 | 0.090 | 0.317 | 0.060 | 0.057 | 0.365 | 2.122 | |||
MC-SD | 0.191 | 0.095 | 0.332 | 0.061 | 0.058 | 0.341 | 1.935 | |||
CP (%) | 93.7 | 92.3 | 93.7 | - | - | - | - | |||
4.30 | 35.44% | MC-Mean | 4.958 | 0.518 | 0.901 | -0.409 | 0.110 | 2.080 | 5.043 | |
IM-SE | 0.194 | 0.098 | 0.341 | 0.068 | 0.065 | 0.421 | 2.716 | |||
MC-SD | 0.199 | 0.106 | 0.339 | 0.074 | 0.069 | 0.385 | 2.258 | |||
CP (%) | 95.0 | 93.3 | 95.3 | - | - | - | - | |||
600 | No | 0% | MC-Mean | 5.001 | 0.507 | 0.909 | -0.400 | 0.118 | 1.966 | 4.072 |
IM-SE | 0.125 | 0.063 | 0.225 | 0.037 | 0.036 | 0.221 | 0.889 | |||
MC-SD | 0.125 | 0.064 | 0.226 | 0.038 | 0.037 | 0.212 | 0.918 | |||
CP (%) | 93.0 | 94.7 | 92.3 | - | - | - | - | |||
1.60 | 5.11% | MC-Mean | 4.999 | 0.508 | 0.911 | -0.401 | 0.117 | 1.977 | 4.145 | |
IM-SE | 0.125 | 0.063 | 0.225 | 0.037 | 0.036 | 0.221 | 0.899 | |||
MC-SD | 0.124 | 0.065 | 0.224 | 0.038 | 0.037 | 0.212 | 0.937 | |||
CP (%) | 92.7 | 94.3 | 92.0 | - | - | - | - | |||
3.45 | 19.99% | MC-Mean | 4.981 | 0.513 | 0.921 | -0.405 | 0.117 | 2.034 | 4.447 | |
IM-SE | 0.132 | 0.066 | 0.235 | 0.042 | 0.040 | 0.254 | 1.252 | |||
MC-SD | 0.127 | 0.069 | 0.229 | 0.042 | 0.041 | 0.239 | 1.342 | |||
CP (%) | 94.0 | 93.7 | 94.3 | - | - | - | - | |||
4.30 | 34.64% | MC-Mean | 4.939 | 0.521 | 0.933 | -0.410 | 0.112 | 2.113 | 4.826 | |
IM-SE | 0.142 | 0.072 | 0.251 | 0.048 | 0.045 | 0.300 | 1.672 | |||
MC-SD | 0.139 | 0.074 | 0.251 | 0.049 | 0.046 | 0.284 | 1.657 | |||
CP (%) | 93.7 | 94.7 | 94.7 | - | - | - | - |
4.1 Simulation study 1
This study aims to provide empirical evidence about the consistence of the ML estimates under different scenarios. Thence, 300 MC samples were generated considering different sample sizes: , and . The data was generated from the model specified by Equations 1-2, considering a Student- distribution for the innovations (’s) with and . The rest of the parameters were set as , , and the covariables at time were , with and , for . From this scenario, two analysis were conducted and will be discussed next.
4.1.1 Asymptotic properties
Aiming to have scenarios with an average level of censoring/missing of 5%, 20%, and 35%, respectively, we considered the following LODs (values lower than the LOD are substituted by the LOD): 1.60, 3.45, and 4.30. Furthermore, 20% of the desired censored rate corresponds to observations randomly selected to be treated as missing. Additionally, we considered the case without censoring (original data) for comparison.
For each sample size and LOD, we computed the mean of the 300 MC estimates (MC-Mean), the mean of the standard error (IM-SE) computed by the inverse of the observed information matrix given in Equation 14, the standard deviation of the 300 MC estimates (MC-SD), and the coverage probability (CP) of a 95% confidence interval, i.e.,
where is the estimate of the th parameter of in the th MC sample.
The results are shown in Table 1, where we can observe that the mean of the estimates (MC-mean) is close to the true parameter value in all combinations of sample size and LOD, except for the scale parameter for a sample size of . As expected, this difference decreases as the sample size increases. Notice that the mean of the standard errors obtained through the inverse of the observed information matrix (IM-SE) are in general close to the standard deviation of the estimates (MC-SD) for all scenarios, indicating that the proposed method to obtain the standard errors is reliable.
Figure 1 shows the boxplots of the estimates for each parameter, considering different sample sizes and LODs. The solid red line represents the true parameter value. In general, the median of the estimates is close to the real value independent of the sample size and LOD. However, for and , the median underestimates the true value in samples of size , i.e., the smallest sample size in the simulation study. Furthermore, interquartile ranges decrease as sample sizes increase, suggesting the consistence of the estimates. Additionally, boxplots for the estimates of are shown in Figure 8 in Appendix B.
Finally, to study the asymptotic properties of the estimates, we analyzed the mean squared error (MSE) of the estimates obtained from the proposed algorithm for all scenarios, which can defined by . The results for each parameter, sample size, and LOD are shown in Figure 2, where we may note that the MSE tends to zero as the sample size increases. Thus, the proposed SAEM algorithm seems to provide ML estimates with good asymptotic properties for our proposed autoregressive censored linear model with Student- innovations.
4.1.2 Residual Analysis
Checking the specification of a statistical model usually involves statistical tests and graphical methods based on residuals. However, conventional residuals (as Pearson’s residuals, p.e.) are not appropriate for some models, since they may lead to erroneous inference as kalliovirta2006misspecification demonstrated for models based on mixtures of distributions. As an alternative, dunn1996randomized developed the quantile residuals, a type of residuals for regression models with independent responses that produce residuals normally distributed by inverting the fitted distribution function for each response value and finding the equivalent standard normal quantile. These results assume that the model is correctly specified and parameters are consistently estimated. The method can be extended to dependent data by expressing the likelihood as a product of univariate conditional likelihoods.
To compute the quantile residuals for the CAR model, we first impute the censored or missing observations as defined in Section 3.4. Then, considering all values as completely observed, the residual for th observation is computed as follows
(16) |
where denotes the cumulative distribution function (CDF) of the standard normal distribution, is the CDF of the univariate Student- distribution with location parameter , scale parameter , and degrees of freedom, where refers to the ML estimates of obtained through the SAEM algorithm. Note that the quantile residual is calculated only for the latest observations.
Aiming to analyze how the quantile residuals behave for the proposed model, they were computed for a simulated dataset of sample size and considering four levels of left censoring: 0%, 5.17%, 20%, and 34.67%. Figure 3 shows a Quantile-Quantile (Q-Q) plot, a dispersion plot (residual vs. time), and a histogram for the residuals. For all levels of censoring, the histogram seems to correspond to a histogram of a normally distributed variable, and the dispersion plot shows independent residuals. We can deduce through the Q-Q plot that the residuals are roughly normally distributed because all points form a roughly straight line inside the confidence band. However, for samples with 20% and 34.67% of censoring, the Q-Q plots present a slight deviation from the center line in the lower tail, which might be due to the high proportion of censored values.
For comparison, we fitted the same dataset considering the normal distribution (i.e., disregarding the heavy tails), and computed the corresponding quantile residuals. The resulting plots are given in Figure 4, where we can see clear signs of non-normality, such as large residuals and several points outside the confidence band in the Q-Q plots. Therefore, this illustration indicates that this method can help checking the model specification in the CAR model. Nevertheless, for significant levels of censoring, more caution is needed in the analysis of residuals since our proposal imputes the unobserved data by its conditional expectation.
4.2 Simulation study 2: Robustness of the estimators
a. Level of censoring: 0%
b. Level of censoring: 10.04%
c. Level of censoring: 30%
This simulation study aims to compare the performance of the estimates for two censored AR models in the presence of outliers on the response variable. In this case, we generated 300 MC samples of size under the model specified in Equations 1-2, considering the standard normal distribution for the innovations. The other parameters were set as and , and the covariates were set as , where , . After generating the data, each MC sample was perturbed considering the following scheme: the maximum value was increased in times the sample standard deviation, i.e., , for .
a. Level of censoring: 0% Pert. CAR(2) CAR(2) () DI (%) 0 4.007 0.510 0.963 0.478 -0.215 4.007 0.510 0.979 0.480 -0.216 24.424 17 1 4.021 0.526 1.030 0.469 -0.209 4.008 0.518 1.042 0.464 -0.207 17.098 73.33 2 4.036 0.541 1.135 0.447 -0.195 3.999 0.518 1.130 0.433 -0.186 9.568 98.67 3 4.050 0.557 1.275 0.416 -0.175 3.991 0.515 1.229 0.398 -0.162 6.064 99.67 4 4.064 0.572 1.449 0.383 -0.154 3.987 0.513 1.328 0.369 -0.140 4.896 100 5 4.079 0.588 1.655 0.349 -0.133 3.986 0.510 1.426 0.346 -0.121 4.324 100 7 4.107 0.619 2.160 0.287 -0.100 3.985 0.508 1.628 0.309 -0.090 3.710 100
b. Level of censoring: 10.04% Pert. CAR(2) CAR(2) () DI (%) 0 4.006 0.511 0.966 0.478 -0.214 4.006 0.511 0.997 0.478 -0.213 20.892 21.00 1 4.018 0.529 1.044 0.468 -0.208 4.007 0.519 1.079 0.460 -0.203 14.290 77.67 2 4.027 0.549 1.165 0.445 -0.194 4.002 0.518 1.207 0.424 -0.180 7.764 98.67 3 4.036 0.570 1.327 0.415 -0.175 3.999 0.514 1.362 0.387 -0.155 5.061 99.67 4 4.044 0.592 1.527 0.383 -0.156 3.999 0.511 1.535 0.357 -0.133 4.105 100 5 4.052 0.614 1.764 0.351 -0.138 4.000 0.509 1.704 0.332 -0.115 3.668 100 7 4.065 0.659 2.340 0.294 -0.109 4.003 0.505 2.170 0.295 -0.088 3.166 100
c. Level of censoring: 30% Pert. CAR(2) CAR(2) () DI (%) 0 4.014 0.505 0.944 0.476 -0.219 4.015 0.503 0.972 0.476 -0.218 19.338 24.67 1 4.014 0.532 1.050 0.464 -0.213 4.011 0.513 1.076 0.454 -0.206 12.553 82.33 2 4.007 0.562 1.213 0.439 -0.199 4.007 0.512 1.230 0.415 -0.181 6.854 98.33 3 3.996 0.594 1.429 0.409 -0.182 4.006 0.507 1.406 0.376 -0.156 4.651 100 4 3.982 0.628 1.694 0.378 -0.165 4.008 0.503 1.616 0.344 -0.135 3.803 100 5 3.965 0.664 2.005 0.349 -0.151 4.010 0.500 1.837 0.318 -0.118 3.386 100 7 3.926 0.737 2.760 0.301 -0.130 4.016 0.495 2.428 0.280 -0.092 2.926 100
Furthermore, we considered three different levels of censoring: the first case corresponds to the case without censoring; the second case considered 2.34 as a LOD, where all values lower or equal than the LOD were substituted by 2.34 (which implied an average of 10.04% of censoring); and finally the third scenario considered a LOD of 3.31 (which yielded an average of 30% of censoring). For each scenario, we fitted two models: the first considers normal innovations, denoted by CAR(2) whose parameters were estimated by the function ARCensReg from the R package ARCensReg (schumacher2016package), and the second one is our proposed model CAR(2) fitted through the function ARtCensReg from the same package.
Table 2 displays the mean of the estimates for each parameter by level of perturbation and censoring rate. To obtain comparable values of , for the model with Student- innovations we reported the estimated variance of the innovation , where is the estimate of the scale parameter under our proposal. Moreover, the percentage of times that our model detected the perturbed observation as influential, denoted by DI(%), is also reported, which was computed as the number of times the estimated weight (, computed during the E-step in the SAEM algorithm) of the perturbed observation was the lowest one, divided by the number of MC samples.
Table 2a and Figure 5a show the results for the non-censored dataset, where it is possible to observe that for the normal distribution the bias increases as the perturbation increases, as natural; however, when Student- innovations are considered, the bias is much smaller, illustrating the robustness of the model against atypical distributions. As expected, estimates for decrease as the perturbation grows. For the non-perturbed samples, the observation with the maximum value was detected as influential in only 17% of the samples, but this percentage increases fast along the perturbation. Observe that for , and , the perturbed observation was detected as influential in all MC samples.
Table 2b and Figure 5b display the results for samples with an average of 10.04% of censoring. The estimates for have a distribution similar to the non-censored case. On the other hand, for , a more significant difference was observed between the real value and its estimate in the normal model. In contrast, the model with heavy-tailed innovations performs better in recovering the true parameter values, with a mean value of smaller than the observed previously.
Finally, results for the scenario with an average of 30% of left censoring are shown in Table 2c and Figure 5c. For the normal case, the bias for is more outstanding than the observed in the previous two cases, while for our model, its mean and median are generally close to the true parameter value, for all perturbation levels. For , the normal model returned estimates close to the real value only for levels of perturbation lower than 4 (), while for larger perturbations the models tends to underestimate it. These results confirm that the heavy-tails of the Student- distribution provides to our model the ability to mitigate the effect of outliers, i.e., a much more robust method against atypical values.
5 Application
In this section, the CAR model is applied to a real environmental dataset that has both missing and censored observations. We consider the phosphorus concentration dataset previously analyzed by schumacher2017censored and wang2018quasi, available in the package ARCensReg. For this study, we are interested in tracking the phosphorus concentration levels over time as an indicator of the river water quality since, for instance, excessive phosphorus in surface water may result in eutrophication. The data consist of observations of phosphorus concentration () in mg/L monthly measured from October 1998 to October 2013 in West Fork Cedar River at Finchford, Iowa, USA. The phosphorus concentration measurement was subject to a LOD of 0.02, 0.05, or 0.10, depending on the time, and therefore 28 (15.47%) observations are left-censored. Moreover, there are 7 (3.87%) missing observations from September 2008 to March 2009 due to a program suspension caused by a temporary lack of funding.
As mentioned by wang2018quasi, levels are generally correlated with the water discharge (), which is measured in cubic feet per second; then, our objective is to explore the relationship between and , when the response contains censored and missing observations. Aiming to evaluate the prediction accuracy, the dataset was train-test split. The training dataset consists of 169 observations, where 20.71% are left-censored or missing, while the testing dataset contains 12 observations, all fully observed. Following wang2018quasi, to make the - relationship more linear we considered the logarithmic transformation of and and fitted the following CAR model:
(17) |
, where the regression error follows an autoregressive model and is a dummy seasonal variable for quarters , and . The first quarter corresponds from January to March, the second one from April to June, and so on. In this model, and are respectively the intercept and slope for quarter , for , and .
The AR order is unknown, but since the second observation from the training set (November 1998) is censored, it is not possible to consider a model of order greater than one because the proposed algorithm assumes that the first values are completely observed. Therefore, for this application we only considered a model of order 1. Besides that schumacher2017censored concluded that a censored autoregressive model of order 1 was the best to fit this dataset based on information criteria and mean squared prediction error (MSPE). The authors also found that the dataset has some influential observations, then it seems reasonable to consider a model with innovations following a heavy-tailed distribution.
Parameters | CAR(1) | CAR(1) | ||||
---|---|---|---|---|---|---|
Estimate | SE | 95% CI | Estimate | SE | 95% CI | |
-4.338 | 0.609 | (-5.530 , -3.145) | -4.670 | 0.513 | (-5.674 , -3.665) | |
-2.731 | 0.750 | (-4.201 , -1.262) | -3.022 | 0.751 | (-4.494 , -1.549) | |
-4.141 | 0.419 | (-4.963 , -3.319) | -4.174 | 0.442 | (-5.041 , -3.307) | |
-4.651 | 0.545 | (-5.719 , -3.583) | -4.999 | 0.549 | (-6.075 , -3.922) | |
0.293 | 0.107 | (0.084 , 0.503) | 0.373 | 0.085 | (0.206 , 0.541) | |
0.139 | 0.105 | (-0.066 , 0.345) | 0.185 | 0.105 | (-0.021 , 0.391) | |
0.363 | 0.070 | (0.227 , 0.500) | 0.364 | 0.075 | (0.218 , 0.510) | |
0.351 | 0.097 | (0.161 , 0.542) | 0.410 | 0.098 | (0.218 , 0.601) | |
-0.087 | 0.085 | -0.077 | 0.089 | |||
0.176 | 0.046 | 0.254 | 0.030 | |||
5.002 | 3.059 | - | - | |||
MSPE | 0.101 | 0.126 | ||||
MAPE | 0.240 | 0.255 |
For comparison purposes, we considered the model in Equation 17 with regression errors following an autoregressive model of order 1 (as defined by Equation 2), with innovations being independent and identically distributed as and (CAR(1) and CAR(1), respectively). Both models were fitted using the R library ARCensReg through the functions ARCensReg and ARtCensReg, respectively. Parameter estimates and their corresponding standard errors (SEs) are displayed in Table 3, along with the MSPE and the mean absolute prediction error (MAPE) were computed considering one-step-ahead predictions for the testing dataset. These criteria indicate that the heavy-tailed Student- () model provides better predictions than the normal model for the phosphorus concentration data.
One can also note that the estimates for under the CAR(1) model are negative and greater than the estimates obtained from the CAR(1) model. On the other hand, estimates for slopes are all positive and, except for the second quarter, they are significantly different from zero, evidencing the correlation between the water discharge and the phosphorus concentration. Regarding the autoregressive coefficient , both models estimated similar values.
Figure 6 presents a Q-Q plot (left), a residual vs. time scatterplot (middle), and a histogram (right) for residual analysis for both models. For the CAR(1) model (top), we see in the Q-Q plot that all points are forming a roughly straight line and lie within the confidence bands. Further, the histogram seems to correspond to a normally distributed variable, and the dispersion plot seems to be related to independent residuals. On the other hand, the Q-Q plot for the CAR(1) model (bottom) presents some points outside of the confidence bands on the upper tail, indicating that the distribution is skewed or heavy-tailed. Additionally, we see from the scatterplot and histogram larger residual values than those from the model. Therefore, the CAR(1) model seems to provide a better fit to the phosphorus concentration data than the CAR(1) model. Plots with the sample autocorrelation for the quantile residuals are displayed in Appendix B, Figure 9.
CAR(1)
CAR(1)
Finally, Figure 7 shows the imputed values for the censored and missing observations from October 1998 to October 2012 (training dataset) and the predicted values from November 2012 to October 2013 (yellow box) considering normal (light blue line) and Student- innovations (pink line), with observed values represented as a black solid line and vertical black dashed lines indicating the period with missing values. Here, we see slight differences between the predicted values obtained under both fitted models. Besides, the general behavior of the imputed values for the missing period seems to capture well the seasonal behavior of the time series and is also similar for both models. In addition, for assessing the convergence of SAEM parameter estimates, convergence plots for our proposal are displayed in Figure 10 in Section B of the Appendix.
6 Conclusions
Extending autoregressive regression methods to include censored response variables is a promising area of research. This paper introduces a novel model that can handle left, right, or interval censoring time series, while simultaneously modeling heavy-tails and missing observations, which can be treated as interval-censored observations.
Our approach extends some previous works, such as schumacher2017censored and liu2019parameter, and handles missing data while respecting the nature of the data without the need for transformations. The proposed methods have been coded and implemented in the R package ARCensReg, which is available for the users on the CRAN repository.
It is important to remark that we assumed the dropout/censoring mechanism to be missing at random (MAR) (see diggle2002analysis, p 283). However, in the case where MAR with ignorability is not realistic, the relationship between the unobserved measurements and the censoring process should be further investigated. Future directions point to tackling the limitation of assuming that the first observations are fully observed to fit a CAR model. Furthermore, a natural and interesting path for future research is to extend this model to a multivariate framework.
Acknowledgments
This study was financed by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001. Larissa A. Matos acknowledges support from FAPESP-Brazil (Grant 2020/16713-0).
Appendix A Conditional distributions and derivatives
A.1 Full conditional distributions
The full conditional distributions are needed to perform the E-Step of the SAEM algorithm; these are and . We have that, the conditional probability density function (pdf) of is given by
From the last expression, we deduce that is independent of for all , where , with , , and , for .
To compute the condition distribution of , we consider that the first observations are completely observed. Using the VAR(1) model representation, as zhou2020student suggested, we get the expression,
where is a vector of dimension , is a matrix in which A is a matrix with the identity matrix in its first columns and s in the last one, , and . Through a recursive process based on VAR(1) form, we have
Note that the model defined in Equations 1 and 2 can be recovered through the first element of the preceding vectors, i.e.,
(18) | |||||
where represents the matrix multiplied by itself times, is a vector whose elements correspond to the first row of , and is the element (1,1) of . From Equation 18, we deduce that given , , and is normally distributed, for all . Thus, the conditional expectation and the elements of the variance-covariance matrix that characterizes the normal distribution will be given by
(19) | ||||
(20) |
where . Now, suppose that y is partitioned into two vectors, containing the first observed values and containing the remaining observations, i.e., . Let and be the observed and the censored/missing part of , respectively, then the distribution of , where the th element of is and the element of is equal to , for all .
To compute the conditional distribution of , we rearrange the elements of , and as follows:
Using the results for the conditional distribution of the normal distribution, we obtain
with and .
A.2 First and second derivatives of the complete-data log-likelihood function
The complete log-likelihood function for the model defined by Equations 1-6 is given by
where , , and .
The first derivative of with respect to each element of is:
with and .
Besides, elements of the Hessian matrix are given by:
with .
Appendix B Additional results
This section displays some additional results obtained from the simulation study and the analysis of the phosphorus concentration dataset.
B.1 Simulation study 1
Figure 8 shows boxplots for the estimates of considering different sample sizes and limits of detection (LOD). Here it is possible to observe that the median of the estimates is close to the true value () independent of the sample size and LOD. Furthermore, interquartile ranges decrease as sample sizes increase, suggesting the consistence of the estimates.
B.2 Application
For the phosphorus concentration dataset, available in the package ARCensReg, was fitted two censored autoregressive models, one which considered a model with Student- innovations (CAR(1)) and other with normal innovations (CAR(1)). Then, Figure 9 displays the sample autocorrelation for the quantile residuals computed from each model, where we noted that the obtained residuals present negligible serial autocorrelation.
Finally, Figure 10 shows the convergence graphics of the SAEM parameter estimates obtained for the CAR(1) model at each iteration, the model selected as the best for fitting this dataset based on prediction accuracy and residual analysis.