Survey Data Integration for Distribution Function Estimation
Abstract
Integration of probabilistic and non-probabilistic samples for the estimation of finite population totals (or means) has recently received considerable attention in the field of survey sampling; yet, to the best of our knowledge, this framework has not been extended to cumulative distribution function (CDF) estimation. To address this gap, we propose a novel CDF estimator that integrates data from probability samples with data from, potentially big, nonprobability samples. Assuming that a set of shared covariates are observed in both samples, while the response variable is observed only in the latter, the proposed estimator uses a survey-weighted empirical CDF of regression residuals trained on the convenience sample to estimate the CDF of the response variable. Under some regularity conditions, we show that our CDF estimator is both design-consistent for the finite population CDF and asymptotically normally distributed. Additionally, we define and study a quantile estimator based on the proposed CDF estimator. Furthermore, we use both the bootstrap and asymptotic formulae to estimate their respective sampling variances. Our empirical results show that the proposed CDF estimator is robust to model misspecification under ignorability, and robust to ignorability under model misspecification. When both assumptions are violated, our residual-based CDF estimator still outperforms its ‘plug-in’ mass imputation and naïve siblings, albeit with noted decreases in efficiency.
Keywords: Data integration Probability surveys Nonprobability samples Distribution functions Quantile estimation
1 Introduction
Much of the literature on survey data integration focuses on estimating the finite population mean where denotes a finite population of size , is the variable of interest, and is the value assigned to the -th unit of the population. Given a probability sample of size , and inclusion probability for all , the [27] (HT) finite population mean estimator, , is design-unbiased with a relatively simple closed-form sampling variance [40]. An interesting scenario arises when is missing for all , but a set of covariates, say , remains fully measured. Furthermore, assume that there exists a nonprobability (convenience) sample, , whose and are both fully measured for all . Although cannot be directly calculated in , it is possible to ‘impute’ ’s missing values through fitting a regression model for on in sample ; this approach is oftentimes referred to as mass imputation in the literature. [30] investigated the use of generalized linear models as the imputation model and established the consistency of their proposed parametric mass imputation estimator (PMIE). Their empirical results imply optimal performance when (a) the mean (regression) function in can be ‘transported’ to (suitably named transportability), and (b) the regression model used is reasonably specified. [25] extended this work to the realm of nonparametric regression, which make little to no assumptions of ’s parametric form. There, they proposed mass imputation estimators based on kernel regression (NPMIEK) and generalized additive models (NPMIEG), and noted marked gains in efficiency for both compared to the PMIE.
A natural extension of this work involves estimating the finite population cumulative distribution function (CDF), , and the quantile function, . The CDF of a finite population is defined as the proportion of elements whose for lie at or below a quantile ; that is,
(1.1) |
where returns one if the condition in its argument is satisfied, and zero otherwise. Quantile functions, however, define the quantile as the smallest among all such that ; or, equivalently,
(1.2) |
If was measured for all , then, as with , one could prove that the HT CDF estimator,
alongside its corresponding quantile estimator,
are at least asymptotically unbiased with relatively simple variance expressions [24, 44]. Furthermore, if was measured for all , a more efficient estimator of could be obtained by predicting (a) for or (b) for all , then correcting for design bias [37, 24, 29]. Neither of these approaches is compatible with the monotone missingness framework described above, though. To address this paucity in literature, we modify [24]’s residual-based CDF estimator, henceforth denoted as , to allow for information integration between and ; the idea is to use a regression model built on to obtain a vector of estimated residuals, , and use these residuals’ empirical CDF (eCDF) as a substitute for in . We also explore the use of as a corresponding quantile estimator.
The rest of the paper is organized as follows. In Section 2, we introduce some key concepts and set notations that are used throughout the paper. Next, in Section 3, we introduce our residual eCDF estimator and derive its asymptotic properties; we then contrast it with a competing ‘plug-in’ CDF estimator in Section 3.3, and derive their corresponding quantile estimators in Section 3.4. Monte Carlo simulation results are provided in Section 4, which demonstrate the finite-sample performance of the proposed CDF and quantile estimators. In Section 5, we illustrate the proposed estimators using a real data application. We conclude in Section 6 with a discussion of the main results.
2 Background and Preliminaries
Let denote an index set for the units in a finite population of size , and let denote a probability sample drawn from with measured variables , where represents the sampling weight for element and represent a set of covariates. In a similar vein, let denote a nonprobability sample from with measured variables , where represents a response variable; note that is unmeasured in , which differentiates from as a convenience sample. The missingness pattern outlined here is summarized in Table 2.1.
Sample | ||||||
---|---|---|---|---|---|---|
✓ | ✓ | ✓ | ✓ | ✓ | ||
✓ | ✓ | ✓ | ✓ | ✓ |
To incorporate the covariates in estimating , we assume that the finite population is a realization from the following superpopulation model, :
(2.1) |
where is a known function of parameterized by an unknown parameter vector , is a known, strictly positive function, and is an error term satisfying and . Now, if was observed in its entirety, a natural estimator of can be chosen to solve the finite population score function
for some -dimensional function ; this estimator is henceforth denoted as , since it requires full information of all population elements. Nonetheless, given that is present within alone, we estimate by solving a sample-based score function,
whose solution is henceforth denoted as .
One is obtained, mass imputation uses to ‘impute’ the missing in , assuming positivity and transportability (stated in Definitions 1 and 2, respectively).
Definition 1 (Positivity Assumption).
Let be an indicator function that returns 1 if observation is included in sample and 0 otherwise. The assumption of positivity is satisfied if for all and in the support of . That is, for every value , there is a positive chance for the corresponding units to be selected in sample .
Definition 2 (Transportability Assumption).
Let again denote the sample indicator for , and let represent a function of conditioned on . The assumption of transportability is satisfied if .
Transportability is a crucial assumption as it allows us to “transport" the prediction model from to without worry. As described in [30], a sufficient condition for transportability to hold is the ignorability condition, stated in Definition 3.
Definition 3 (Ignorability Condition).
Let again denote the sample indicator for , and let denote the probability of unit being included in sample conditioned on . The assumption of ignorability is satisfied if .
Definition 3 is essentially the missing at random (MAR) assumption [39]. It is commonly known that, under MAR (or, ignorability), is a consistent estimator of , in that the sampling distribution of grows tighter and tighter around as increases [30, 33, 42]. In the next section, we use the consistency of to derive a residual-based CDF estimator of , where in is replaced by , the empirical CDF of the estimated regression residuals from sample at .
3 Proposed Estimators and Asymptotic Results
3.1 Residual-based CDF Estimator
Note that if model (2.1) holds, may be restated as
Now, letting
denote the finite population distribution function of at some fixed point , we note
which further implies that
Thus, it is not egragious to replace in with , assuming that the working model (2.1) holds.
Given that is subject to missingness on only, replacing in with an empirical estimate of from remains plausible if (i.e., ignorability holds). To describe the procedure, let denote the eCDF of the estimated residuals from sample , , such that
Then, the residual-based estimator of is defined as
(3.1) |
3.2 Asymptotic Theory
Before we discuss the asymptotic properties of , we must establish a framework that allows , , and to be infinite. Similar to [28, 43], let denote an increasing sequence of finite populations of size from model (2.1); furthermore, let and denote samples from ( henceforth omitted for brevity) according to some sampling design, such that the sampling fraction converges to a limit in (0,1) as both and tend to infinity [24]. Given this asymptotic framework, let such that, for any ,
Letting again denote convergence in probability, we have previously mentioned that (i.e., ) under the ignorability condition; and since as well, we may use Theorem 2.1.3 of [32] to show .
In the following, we formally define a set of conditions that govern the validity of our asymptotic results.
Assumption 1.
The sampling design of is ignorable; that is,
Assumption 2.
The sampling fraction converges to a limit in as and tend to infinity. Furthermore, where denotes the design-based expectation and .
Assumption 3.
There exist some positive real constants , and such that and for all Furthermore,
Assumption 4.
For any random variable with finite population moments for an arbitrarily small ,
where is the finite population mean of and some positive constant.
Assumption 5.
For any random variable with a finite fourth population moment,
where denotes the HT mean estimator of and the HT estimator of .
Assumption 6.
The coefficient vector satisfies
Assumption 7.
converges to a smooth function as goes to infinity; that is,
where the limiting function is continuous with finite first and second derivatives.
Assumption 1 allows ; Assumption 2 establishes a framework to allow as , and limits the growth rate of both sample sizes; Assumptions 3 and 4 ensure design-based consistency under , the sampling design of , and allows to grow at a faster rate than ; Assumption 5 ensures asymptotic normality under a general design; Assumption 6 establishes regularity conditions for ; and Assumption 7 specifies a limiting smooth function for [43, 24, 36]. Note that, if Assumptions 3 and 6 hold, then .
The following theorem summarizes the asymptotic properties of the proposed residual-based CDF estimator, .
Theorem 1.
Proof.
From here, we may use results from [43], [36], and [37] to show that and have the same limiting normal distribution, and are both design-consistent for ; that is,
where is as given in the theorem.
∎
Proof.
Given that and are both design-consistent for , we may again use Theorem 2.1.3 of [32] to show
(3.2) |
Note that
where . Given that is asymptotically unbiased for , we would like to show that converges in probability to zero.
3.3 Plug-In CDF Estimator
One may consider a more direct ‘mass imputation’ approach to CDF estimation by directly substituting for all with predicted values, , but this approach will almost always result in higher estimation bias. Let denote this direct plug-in CDF estimator such that
In the best-case scenario of , we observe
and thus
which is not unbiased for under model 2.1 unless is effectively zero for all . From the perspective of sample , we note , where the unknown residual could trigger an erroneous output even if it is small for all —that is to say, the plug-in approach is ‘all-or-nothing’, in the sense that is either right or wrong with very little room for error. This is not the case for , though, since small implies a well-specified (and transportable) , making a reasonable substitute for within .
3.4 Quantile Estimation
We conclude this section with a discussion of the quantile estimation problem. From Eq. 1.2, an estimator of using could be defined as
(3.3) |
or, for , as
(3.4) |
where denotes the percentile of interest. In general, if some is asymptotically unbiased, then its corresponding quantile, , will also be asymptotically unbiased [24]. And, assuming is asymptotically normally distributed, confidence intervals can be constructed using [44]’s formula as follows:
(3.5) | ||||
(3.6) |
where is the percentile under the standard normal distribution and is the estimated standard error of . Then, using [31, 26]’s linear approximations, we may estimate the variance of using
(3.7) |
An alternative variance estimator based on the bootstrap will be defined and empirically evaluated in Section 4.4.
One should note that quantile estimation is extremely sensitive to small sample sizes, generally, and small inclusion probabilities, specifically, as both tend to exclude many (if not most) of the quantiles in . In the next sections, we will evaluate the consequences of small and on the performance of both and using simulated data.
4 Simulation Study
We conducted a two-phase Monte Carlo simulation study to contrast the finite-sample performance of the proposed CDF and quantile estimators ( and ) to their plug-in and ‘naïve’, -only equivalents ( and ; and ). We measure the relative performance of the estimators using the relative root mean squared error (RRMSE), defined generically for some as
(4.1) |
where denotes a ‘gold-standard’ estimator of from . Here, , which denote the HT estimators of and , respectively.
4.1 Simulation Settings
To thoroughly assess the performance of our proposed estimators, we considered the following four superpopulation models, , which account for varying levels of complexity between and .
- •
- •
- •
- •
Under each of these models, we generated a finite population of size . After was obtained, a simple random sample without replacement (SRS) of size was selected to serve as , the probability sample. For sample with , we used modifications of [25]’s stratified simple random sampling technique to generate samples under both missing at random (MAR) and missing not at random (MNAR) mechanisms. For MAR, we defined to be the covariate with the strongest correlation to , and placed all elements with in stratum I and the rest in stratum II. We then used stratified SRS to obtain and elements from each respective strata. This procedure was nearly identical to that used to generate MNAR data, with the exception of using as the strata discrimination rule. Lastly, using each of the randomly sampled / sample pairs, we calculated (shortened as ‘B’), (shortened as ‘P’) and (shortened as ‘R’), as well as their respective quantile estimators, at the 1st, 10th, 25th, 50th, 75th, 90th, and 99th percentiles. The two estimators and used multivariable linear regression, which was fit on sample using R’s lm() function. Results are presented in Tables 4.1–4.1.
MAR | MNAR | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
\cdashline3-14 | B | P | R | B | P | R | B | P | R | B | P | R | |
500 | 1% | 2.10 | 1.14 | 0.76 | 1.59 | 1.09 | 0.89 | 2.01 | 1.19 | 0.72 | 1.60 | 1.19 | 0.93 |
10% | 4.64 | 1.38 | 0.83 | 3.58 | 1.32 | 0.85 | 5.39 | 1.22 | 0.93 | 3.83 | 1.06 | 0.81 | |
25% | 6.64 | 1.25 | 0.90 | 5.63 | 1.19 | 0.93 | 9.09 | 1.12 | 1.36 | 6.34 | 1.02 | 1.22 | |
50% | 7.14 | 1.10 | 0.91 | 6.77 | 1.01 | 0.92 | 15.69 | 2.09 | 1.91 | 9.63 | 1.87 | 1.73 | |
75% | 6.33 | 1.22 | 0.93 | 7.34 | 1.21 | 0.96 | 9.17 | 2.91 | 2.14 | 13.49 | 2.79 | 2.04 | |
90% | 4.48 | 1.42 | 0.91 | 6.30 | 1.46 | 0.93 | 5.25 | 2.92 | 2.03 | 11.08 | 3.11 | 2.01 | |
99% | 1.56 | 1.12 | 0.77 | 3.10 | 1.42 | 1.32 | 1.67 | 1.59 | 1.13 | 3.47 | 2.41 | 1.24 | |
1000 | 1% | 1.83 | 1.13 | 0.73 | 1.38 | 1.09 | 0.87 | 1.75 | 1.17 | 0.70 | 1.30 | 1.13 | 0.89 |
10% | 4.64 | 1.41 | 0.81 | 3.52 | 1.34 | 0.84 | 5.27 | 1.19 | 0.92 | 3.78 | 1.06 | 0.81 | |
25% | 6.37 | 1.19 | 0.84 | 5.33 | 1.12 | 0.86 | 9.05 | 1.09 | 1.33 | 6.24 | 0.99 | 1.18 | |
50% | 7.47 | 1.04 | 0.87 | 6.95 | 0.94 | 0.86 | 15.67 | 2.03 | 1.86 | 9.71 | 1.84 | 1.70 | |
75% | 6.31 | 1.15 | 0.84 | 7.16 | 1.13 | 0.86 | 9.33 | 2.95 | 2.15 | 13.77 | 2.84 | 2.05 | |
90% | 4.31 | 1.31 | 0.81 | 6.11 | 1.39 | 0.84 | 5.28 | 2.91 | 2.02 | 11.24 | 3.17 | 2.02 | |
99% | 1.62 | 1.14 | 0.74 | 2.94 | 1.41 | 1.30 | 1.64 | 1.62 | 1.15 | 3.32 | 2.48 | 1.22 | |
5000 | 1% | 1.62 | 1.11 | 0.73 | 1.13 | 1.09 | 0.87 | 1.62 | 1.20 | 0.71 | 1.11 | 1.11 | 0.89 |
10% | 4.64 | 1.44 | 0.80 | 3.47 | 1.35 | 0.82 | 5.07 | 1.18 | 0.88 | 3.60 | 1.07 | 0.78 | |
25% | 6.39 | 1.22 | 0.81 | 5.49 | 1.15 | 0.86 | 8.78 | 1.06 | 1.30 | 6.07 | 0.98 | 1.16 | |
50% | 7.30 | 1.01 | 0.83 | 6.87 | 0.94 | 0.84 | 15.32 | 1.95 | 1.79 | 9.38 | 1.79 | 1.63 | |
75% | 6.40 | 1.16 | 0.84 | 7.43 | 1.17 | 0.86 | 8.98 | 2.75 | 1.99 | 13.53 | 2.70 | 1.94 | |
90% | 4.64 | 1.43 | 0.83 | 6.20 | 1.41 | 0.82 | 5.25 | 2.87 | 1.96 | 11.14 | 3.08 | 1.93 | |
99% | 1.53 | 1.11 | 0.70 | 2.77 | 1.44 | 1.29 | 1.65 | 1.65 | 1.16 | 3.13 | 2.43 | 1.21 | |
10000 | 1% | 1.60 | 1.18 | 0.71 | 1.11 | 1.11 | 0.87 | 1.61 | 1.22 | 0.71 | 1.11 | 1.10 | 0.88 |
10% | 4.64 | 1.44 | 0.79 | 3.58 | 1.39 | 0.84 | 5.21 | 1.21 | 0.88 | 3.70 | 1.08 | 0.78 | |
25% | 6.45 | 1.22 | 0.82 | 5.41 | 1.15 | 0.85 | 9.16 | 1.08 | 1.32 | 6.32 | 0.98 | 1.18 | |
50% | 7.43 | 0.99 | 0.82 | 7.05 | 0.93 | 0.84 | 15.45 | 1.99 | 1.81 | 9.69 | 1.85 | 1.68 | |
75% | 6.64 | 1.17 | 0.85 | 7.61 | 1.15 | 0.88 | 9.08 | 2.79 | 2.03 | 13.33 | 2.70 | 1.94 | |
90% | 4.52 | 1.37 | 0.81 | 6.30 | 1.43 | 0.84 | 5.26 | 2.89 | 1.98 | 11.04 | 3.06 | 1.93 | |
99% | 1.50 | 1.10 | 0.70 | 2.72 | 1.41 | 1.26 | 1.52 | 1.52 | 1.06 | 3.01 | 2.35 | 1.16 |
MAR | MNAR | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
\cdashline3-14 | B | P | R | B | P | R | B | P | R | B | P | R | |
1% | 2.03 | 2.90 | 5.59 | 1.55 | 3.58 | 4.66 | 1.74 | 2.10 | 4.54 | 1.42 | 2.54 | 3.56 | |
10% | 4.87 | 1.32 | 1.64 | 3.60 | 1.69 | 1.96 | 4.06 | 1.46 | 1.54 | 3.01 | 1.68 | 1.66 | |
25% | 6.89 | 2.32 | 1.04 | 5.32 | 2.51 | 1.31 | 6.90 | 2.00 | 0.86 | 4.77 | 1.88 | 0.92 | |
50% | 7.17 | 1.84 | 1.42 | 6.75 | 1.54 | 1.43 | 11.90 | 1.17 | 1.00 | 7.63 | 0.91 | 0.85 | |
75% | 5.96 | 2.42 | 1.27 | 6.68 | 1.78 | 0.99 | 8.55 | 4.48 | 2.96 | 10.66 | 3.34 | 2.35 | |
90% | 4.07 | 4.55 | 2.92 | 5.40 | 3.85 | 2.36 | 4.94 | 5.74 | 4.31 | 9.24 | 5.44 | 3.92 | |
99% | 1.52 | 2.22 | 2.01 | 2.66 | 4.45 | 2.57 | 1.56 | 2.22 | 2.11 | 3.08 | 5.21 | 3.28 | |
1% | 1.88 | 2.90 | 5.66 | 1.36 | 3.59 | 4.73 | 1.47 | 2.08 | 4.49 | 1.16 | 2.64 | 3.70 | |
10% | 4.91 | 1.24 | 1.59 | 3.60 | 1.58 | 1.89 | 4.07 | 1.42 | 1.52 | 3.11 | 1.70 | 1.68 | |
25% | 6.93 | 2.32 | 0.99 | 5.22 | 2.47 | 1.23 | 6.86 | 1.92 | 0.78 | 4.80 | 1.87 | 0.86 | |
50% | 7.30 | 1.80 | 1.39 | 6.90 | 1.53 | 1.41 | 11.49 | 1.09 | 0.93 | 7.33 | 0.84 | 0.79 | |
75% | 6.03 | 2.31 | 1.11 | 6.74 | 1.73 | 0.86 | 8.44 | 4.42 | 2.90 | 10.17 | 3.18 | 2.23 | |
90% | 4.04 | 4.50 | 2.86 | 5.40 | 3.81 | 2.30 | 4.90 | 5.75 | 4.30 | 9.00 | 5.30 | 3.81 | |
99% | 1.49 | 2.24 | 2.03 | 2.44 | 4.39 | 2.52 | 1.59 | 2.33 | 2.21 | 2.87 | 5.26 | 3.32 | |
1% | 1.60 | 2.71 | 5.31 | 1.16 | 3.46 | 4.65 | 1.20 | 1.90 | 4.26 | 0.91 | 2.42 | 3.47 | |
10% | 4.75 | 1.20 | 1.55 | 3.46 | 1.53 | 1.82 | 3.97 | 1.39 | 1.48 | 2.93 | 1.56 | 1.57 | |
25% | 6.83 | 2.29 | 0.94 | 5.20 | 2.45 | 1.17 | 6.72 | 1.86 | 0.74 | 4.70 | 1.82 | 0.81 | |
50% | 7.18 | 1.72 | 1.31 | 6.65 | 1.41 | 1.31 | 11.77 | 1.04 | 0.90 | 7.50 | 0.82 | 0.76 | |
75% | 5.82 | 2.22 | 1.00 | 6.39 | 1.63 | 0.76 | 8.52 | 4.44 | 2.90 | 10.36 | 3.24 | 2.26 | |
90% | 4.00 | 4.50 | 2.83 | 5.22 | 3.71 | 2.24 | 4.89 | 5.77 | 4.30 | 8.92 | 5.29 | 3.79 | |
99% | 1.42 | 2.23 | 2.02 | 2.21 | 4.36 | 2.44 | 1.51 | 2.28 | 2.16 | 2.72 | 5.32 | 3.28 | |
1% | 1.59 | 2.73 | 5.35 | 1.11 | 3.50 | 4.59 | 1.21 | 1.93 | 4.41 | 0.92 | 2.60 | 3.68 | |
10% | 4.79 | 1.22 | 1.53 | 3.41 | 1.52 | 1.74 | 3.97 | 1.34 | 1.52 | 2.96 | 1.55 | 1.62 | |
25% | 6.80 | 2.32 | 0.96 | 5.16 | 2.48 | 1.20 | 6.62 | 1.82 | 0.74 | 4.68 | 1.79 | 0.82 | |
50% | 7.25 | 1.81 | 1.36 | 6.75 | 1.48 | 1.36 | 11.95 | 1.05 | 0.91 | 7.64 | 0.82 | 0.77 | |
75% | 5.73 | 2.15 | 0.95 | 6.35 | 1.59 | 0.73 | 8.58 | 4.45 | 2.91 | 10.35 | 3.24 | 2.25 | |
90% | 3.99 | 4.47 | 2.80 | 5.35 | 3.79 | 2.27 | 4.99 | 5.89 | 4.38 | 8.99 | 5.30 | 3.80 | |
99% | 1.36 | 2.15 | 1.94 | 2.10 | 4.15 | 2.34 | 1.51 | 2.27 | 2.16 | 2.69 | 5.28 | 3.30 |
MAR | MNAR | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
\cdashline3-14 | B | P | R | B | P | R | B | P | R | B | P | R | |
500 | 1% | 1.89 | 2.24 | 0.55 | 1.54 | 5.69 | 4.89 | 2.06 | 2.24 | 0.64 | 1.60 | 5.12 | 4.48 |
10% | 3.78 | 6.37 | 0.76 | 3.15 | 5.64 | 0.85 | 5.55 | 5.18 | 2.94 | 3.85 | 3.14 | 1.94 | |
25% | 5.10 | 4.51 | 0.84 | 4.43 | 3.31 | 0.81 | 9.24 | 2.49 | 5.84 | 6.46 | 1.34 | 4.36 | |
50% | 5.69 | 1.59 | 0.95 | 5.61 | 1.09 | 0.98 | 16.43 | 12.25 | 8.28 | 10.52 | 7.28 | 7.22 | |
75% | 5.26 | 4.97 | 1.10 | 5.74 | 3.95 | 1.12 | 9.51 | 12.82 | 7.82 | 14.68 | 12.23 | 8.59 | |
90% | 3.87 | 6.00 | 1.08 | 5.08 | 5.76 | 1.16 | 5.43 | 7.68 | 5.57 | 12.21 | 13.93 | 7.87 | |
99% | 1.59 | 2.29 | 0.76 | 2.77 | 6.10 | . | 1.64 | 2.22 | 1.86 | 3.53 | 10.36 | . | |
1000 | 1% | 1.68 | 2.25 | 0.41 | 1.31 | 5.66 | 4.89 | 1.83 | 2.23 | 0.50 | 1.36 | 5.07 | 4.47 |
10% | 3.66 | 6.40 | 0.60 | 3.01 | 5.57 | 0.70 | 5.45 | 5.18 | 2.92 | 3.82 | 3.16 | 1.93 | |
25% | 5.00 | 4.47 | 0.68 | 4.41 | 3.31 | 0.66 | 9.39 | 2.44 | 5.95 | 6.35 | 1.27 | 4.32 | |
50% | 5.59 | 1.30 | 0.74 | 5.39 | 0.88 | 0.75 | 15.77 | 11.78 | 7.95 | 10.05 | 6.96 | 6.92 | |
75% | 5.00 | 4.75 | 0.80 | 5.54 | 3.84 | 0.82 | 9.03 | 12.27 | 7.44 | 14.04 | 11.73 | 8.24 | |
90% | 3.89 | 6.17 | 0.80 | 4.98 | 5.79 | 0.86 | 5.19 | 7.38 | 5.35 | 11.46 | 13.16 | 7.44 | |
99% | 1.47 | 2.20 | 0.51 | 2.54 | 6.09 | . | 1.56 | 2.18 | 1.81 | 3.17 | 10.08 | . | |
5000 | 1% | 1.47 | 2.26 | 0.26 | 1.13 | 5.71 | 4.97 | 1.64 | 2.24 | 0.33 | 1.16 | 5.04 | 4.49 |
10% | 3.66 | 6.54 | 0.47 | 2.96 | 5.60 | 0.58 | 5.45 | 5.28 | 2.90 | 3.84 | 3.25 | 1.95 | |
25% | 4.87 | 4.34 | 0.53 | 4.38 | 3.32 | 0.53 | 9.78 | 2.42 | 6.20 | 6.62 | 1.28 | 4.51 | |
50% | 5.71 | 1.12 | 0.60 | 5.46 | 0.76 | 0.60 | 16.49 | 12.29 | 8.29 | 10.23 | 7.12 | 7.07 | |
75% | 5.05 | 4.70 | 0.58 | 5.56 | 3.85 | 0.60 | 9.13 | 12.46 | 7.51 | 13.81 | 11.56 | 8.09 | |
90% | 3.79 | 6.09 | 0.52 | 4.90 | 5.71 | 0.64 | 5.25 | 7.47 | 5.40 | 11.78 | 13.59 | 7.66 | |
99% | 1.44 | 2.24 | 0.31 | 2.28 | 5.95 | . | 1.59 | 2.25 | 1.87 | 3.10 | 10.40 | . | |
10000 | 1% | 1.41 | 2.21 | 0.22 | 1.06 | 5.56 | 4.82 | 1.62 | 2.23 | 0.29 | 1.16 | 5.18 | 4.60 |
10% | 3.42 | 6.11 | 0.41 | 2.80 | 5.38 | 0.54 | 5.51 | 5.35 | 2.92 | 3.78 | 3.22 | 1.92 | |
25% | 4.95 | 4.47 | 0.51 | 4.43 | 3.38 | 0.51 | 9.15 | 2.27 | 5.80 | 6.15 | 1.18 | 4.20 | |
50% | 5.38 | 1.02 | 0.53 | 5.29 | 0.72 | 0.55 | 16.06 | 12.01 | 8.09 | 9.96 | 6.93 | 6.88 | |
75% | 5.01 | 4.62 | 0.53 | 5.41 | 3.72 | 0.55 | 9.26 | 12.65 | 7.63 | 13.94 | 11.70 | 8.19 | |
90% | 3.77 | 6.02 | 0.46 | 4.86 | 5.64 | 0.58 | 5.27 | 7.50 | 5.43 | 11.71 | 13.53 | 7.65 | |
99% | 1.44 | 2.24 | 0.26 | 2.30 | 6.05 | . | 1.62 | 2.29 | 1.91 | 3.11 | 10.57 | . |
MAR | MNAR | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
\cdashline3-14 | B | P | R | B | P | R | B | P | R | B | P | R | |
500 | 1% | 1.10 | 2.23 | 1.09 | 1.09 | 3.64 | 2.24 | 1.88 | 2.19 | 1.57 | 1.71 | 3.32 | 2.11 |
10% | 2.49 | 6.03 | 0.99 | 2.08 | 6.51 | 0.99 | 4.67 | 4.88 | 2.56 | 3.39 | 3.78 | 1.87 | |
25% | 3.42 | 5.60 | 1.09 | 3.09 | 4.03 | 1.00 | 8.22 | 1.42 | 4.86 | 5.74 | 0.74 | 3.66 | |
50% | 4.14 | 2.08 | 1.54 | 3.90 | 1.18 | 1.38 | 14.34 | 11.38 | 7.55 | 8.81 | 5.85 | 5.96 | |
75% | 3.88 | 7.52 | 1.93 | 4.12 | 5.15 | 1.79 | 8.96 | 11.95 | 7.48 | 12.03 | 10.43 | 7.38 | |
90% | 2.79 | 6.76 | 1.88 | 3.36 | 7.31 | 1.69 | 5.26 | 7.63 | 5.48 | 10.17 | 12.05 | 7.00 | |
99% | 1.17 | 2.27 | 1.10 | 1.67 | 5.41 | 1.59 | 1.63 | 2.27 | 1.84 | 2.89 | 6.98 | 3.47 | |
1000 | 1% | 0.85 | 2.34 | 0.95 | 0.75 | 3.88 | 2.44 | 1.70 | 2.28 | 1.46 | 1.37 | 3.35 | 2.14 |
10% | 2.31 | 5.92 | 0.74 | 1.94 | 6.60 | 0.80 | 4.76 | 5.14 | 2.53 | 3.47 | 4.00 | 1.85 | |
25% | 3.44 | 5.89 | 0.91 | 3.09 | 4.19 | 0.81 | 8.12 | 1.23 | 4.79 | 5.61 | 0.62 | 3.57 | |
50% | 4.20 | 1.76 | 1.40 | 3.97 | 1.01 | 1.26 | 14.38 | 11.48 | 7.61 | 8.79 | 5.85 | 5.97 | |
75% | 3.96 | 7.73 | 1.87 | 4.21 | 5.33 | 1.74 | 8.77 | 11.81 | 7.38 | 11.96 | 10.45 | 7.39 | |
90% | 2.81 | 6.97 | 1.83 | 3.45 | 7.63 | 1.62 | 4.97 | 7.20 | 5.19 | 9.61 | 11.52 | 6.72 | |
99% | 1.01 | 2.24 | 0.97 | 1.38 | 5.32 | . | 1.58 | 2.22 | 1.77 | 2.73 | 6.89 | . | |
5000 | 1% | 0.43 | 2.22 | 0.73 | 0.31 | 3.84 | 2.38 | 1.48 | 2.26 | 1.31 | 1.07 | 3.34 | 2.11 |
10% | 2.28 | 6.29 | 0.57 | 1.84 | 6.82 | 0.63 | 4.95 | 5.50 | 2.51 | 3.56 | 4.28 | 1.80 | |
25% | 3.40 | 6.02 | 0.75 | 3.01 | 4.22 | 0.65 | 8.21 | 1.07 | 4.77 | 5.74 | 0.54 | 3.60 | |
50% | 4.18 | 1.52 | 1.32 | 3.93 | 0.87 | 1.19 | 13.88 | 11.10 | 7.32 | 8.68 | 5.75 | 5.87 | |
75% | 3.92 | 7.82 | 1.83 | 4.08 | 5.27 | 1.66 | 8.57 | 11.60 | 7.23 | 11.68 | 10.25 | 7.23 | |
90% | 2.82 | 7.16 | 1.83 | 3.41 | 7.81 | 1.60 | 5.24 | 7.62 | 5.48 | 10.21 | 12.31 | 7.17 | |
99% | 0.89 | 2.24 | 0.89 | 1.20 | 5.31 | . | 1.51 | 2.18 | 1.71 | 2.57 | 6.81 | . | |
10000 | 1% | 0.38 | 2.26 | 0.71 | 0.24 | 3.68 | 2.31 | 1.43 | 2.22 | 1.27 | 1.04 | 3.33 | 2.12 |
10% | 2.21 | 6.09 | 0.52 | 1.79 | 6.63 | 0.59 | 4.80 | 5.33 | 2.44 | 3.41 | 4.10 | 1.75 | |
25% | 3.21 | 5.65 | 0.70 | 2.92 | 4.05 | 0.61 | 8.17 | 1.03 | 4.76 | 5.54 | 0.50 | 3.48 | |
50% | 3.99 | 1.45 | 1.27 | 3.71 | 0.82 | 1.13 | 14.04 | 11.23 | 7.42 | 8.69 | 5.79 | 5.89 | |
75% | 3.82 | 7.63 | 1.79 | 4.01 | 5.19 | 1.64 | 9.04 | 12.24 | 7.63 | 12.08 | 10.60 | 7.49 | |
90% | 2.78 | 7.07 | 1.81 | 3.31 | 7.58 | 1.56 | 5.12 | 7.43 | 5.36 | 9.83 | 11.88 | 6.92 | |
99% | 0.92 | 2.36 | 0.92 | 1.19 | 5.51 | . | 1.58 | 2.29 | 1.79 | 2.63 | 7.01 | . |
4.2 CDF Estimation
Let us begin with the CDF estimators. Starting with model , under MAR, drastically outperformed even for small , which implies superiority under ignorability and a correctly specified regression model. still performed well overall, though, with RRMSE that was always larger than that of . Under MNAR, all estimators experienced significant reductions in efficiency, but still performed the best overall, which suggests robustness to violations of ignorability for correctly specified regression models. For model , under both MAR and MNAR, tended to perform optimally for near the center of the distribution, but it was consistently outperformed by at the distribution tails (i.e., at the 1st and 99th percentiles). Despite generally performing worse than —with a noticeable exception at the 10th percentile—it tended to perform better than , suggesting gains in efficiency even with model misspecification [30]. Under model and MAR, experienced impressive gains in efficiency compared to despite severe model misspecification. , though, tended to perform worse than , which provides support for its endorsement if and only if the prediction error between and is negligible for all ; otherwise, the consequences of ’s ‘all-or-nothing’ approach will likely spoil any potential help from . Under MNAR, still performed the best overall, but we noted severe efficiency losses that failed to dissipate with larger . In fact, these losses were much greater in magnitude compared to those under MNAR for model , which may imply sensitivity to violations of ignorability if the regression model is severely misspecified. Concluding with model , under MAR, exhibited superior performance at all percentiles except the 1st, for , where it was outperformed by . Meanwhile, generally performed worse than both and , with efficiency losses surpassing those reported under model . Conclusions under MNAR are virtually identical to those discussed for model .
In summary, these results imply several important points. For one, seems to be fairly robust to violations of ignorability if the regression model is correctly specified, and robust to regression model misspecification if ignorability is satisfied. When both assumptions are violated, will still perform better than and , generally, albeit with noticeable reductions in efficiency. Second, generally performs worse than , but may still perform optimally even under MNAR if the regression model is correctly specified. The final point emphasizes that , despite its lackluster performance, still performed well at the tails of ’s distribution, even when its sample size was small ( of ) and its missingness mechanism was MNAR. This is an encouraging result for researchers interested in estimating at the distribution extrema without the use of a probability sample.
4.3 Quantile Estimation
In general, the performance of the three quantile estimators tended to match their CDF counterparts, with some noticeable exceptions. For one, , unlike and , was missing for nearly every under models and , even for MAR missingness. There is a reasonable explanation, however: if we were to compute and for every value of for and/or for , we will always encounter data points that serve either as the sample minimum (i.e., ) or the sample maximum (i.e., ), even if the CDF estimator itself is spurious. This is not always true for , though, because there may not exist a in that causes to be near zero or unity, especially if the regression model is misspecified. In these cases, it may be better to substitute with , especially for large , as it is more likely to produce an accurate estimator of than . Another exception involves the performance of , , and under MNAR for model , which seemed to vary widely depending on the percentile in question; specifically, performed the best for percentiles near distribution extrema (i.e., 1st and 99th), for those near the distribution centers (i.e., 25th and 50th), and elsewhere, despite generally outperforming its CDF siblings. We suspect that these conclusions result from model misspecification.
4.4 Variance Estimation
We conclude this section with results from a limited variance estimation study, which investigated several performance metrics regarding the variance of and under , models and , and . The simulation settings were otherwise identical to those previously described.
We calculated two types of variance estimators for both and . The first type, denoted by , is given in Theorem 2 for and by Eq. (3.7) for . Note that, due to the SRS design of , the variance estimator in Theorem 2 simplifies to
(4.2) |
where s^2 = 1nA-1∑_i ∈A(^G(X_i)- 1nA∑_i ∈A^G(X_i))^2. The second variance estimator, denoted by , is an adaptation of [30]’s bootstrap variance procedure, which is summarized as follows:
-
1.
Generate sets of replicate weights for sample and bootstrap samples with replacement (of size ) from sample .
-
2.
Use each weight-resample pair to calculate , , and .
-
3.
Use both the replicate estimators (say, ) and the estimators calculated from the original sets (say, ) to calculate
(4.3) (4.4) where
(4.5) (4.6) and .
We then compared both and to the ‘gold-standard’ Monte Carlo variance, defined as
(4.7) |
where may denote either or .
Finally, after independent simulation runs, we calculated the following performance statistics at the 1st, 10th, 25th, 50th, 75th, 90th, and 99th percentiles:
-
•
(): The mean of () across all simulation runs.
-
•
CR: The coverage, or percentage of instances for which the respective population parameter was captured in the approximately 90% -confidence interval. This was calculated for as either ^F_R(t) ±1.645 ×V_1(^F_R(t)) or ^F_R(t) ±1.645 ×V_2(^F_R(t)), and similarly for using Eqs. (3.5) and (3.6) or (4.5) and (4.6).
-
•
AL: The average length of the confidence intervals.
-
•
: The mean estimated variance of (), multiplied by for or for (to enhance readability).
-
•
RB: The percent absolute relative bias of relative to , RB = |^VarMC- ¯^Var|^VarMC ×100, rounded to one decimal place.
Results for are presented in Tables 4.4 and 4.4, whereas results for are presented in Tables 4.4 and 4.4.
CR | AL | RB | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
500 | MAR | 1% | 0.010 | 0.010 | 83.8 | 86.4 | 0.010 | 0.011 | 0.010 | 0.011 | 15.5 | 5.7 |
10% | 0.100 | 0.099 | 85.2 | 88.0 | 0.035 | 0.038 | 0.113 | 0.131 | 23.0 | 10.7 | ||
25% | 0.250 | 0.250 | 85.8 | 87.4 | 0.052 | 0.056 | 0.253 | 0.286 | 18.1 | 7.2 | ||
50% | 0.500 | 0.499 | 86.4 | 86.6 | 0.061 | 0.064 | 0.346 | 0.382 | 14.7 | 5.8 | ||
75% | 0.750 | 0.748 | 86.8 | 90.0 | 0.052 | 0.056 | 0.255 | 0.288 | 15.2 | 4.2 | ||
90% | 0.900 | 0.901 | 84.8 | 87.8 | 0.035 | 0.038 | 0.113 | 0.131 | 17.5 | 4.5 | ||
99% | 0.990 | 0.990 | 81.6 | 83.0 | 0.010 | 0.010 | 0.010 | 0.011 | 11.7 | 1.5 | ||
\cdashline2-13 | MNAR | 1% | 0.010 | 0.010 | 82.8 | 86.2 | 0.010 | 0.011 | 0.010 | 0.011 | 15.1 | 5.1 |
10% | 0.100 | 0.100 | 85.0 | 87.4 | 0.035 | 0.037 | 0.113 | 0.131 | 22.9 | 11.0 | ||
25% | 0.250 | 0.250 | 84.8 | 88.6 | 0.052 | 0.056 | 0.253 | 0.286 | 18.0 | 7.4 | ||
50% | 0.500 | 0.499 | 86.4 | 88.2 | 0.061 | 0.064 | 0.347 | 0.382 | 14.7 | 5.8 | ||
75% | 0.750 | 0.748 | 87.2 | 88.8 | 0.053 | 0.056 | 0.255 | 0.288 | 15.1 | 4.1 | ||
90% | 0.900 | 0.900 | 86.0 | 88.0 | 0.035 | 0.038 | 0.113 | 0.131 | 17.4 | 4.5 | ||
99% | 0.990 | 0.990 | 83.0 | 84.8 | 0.010 | 0.010 | 0.010 | 0.011 | 11.3 | 1.2 | ||
10000 | MAR | 1% | 0.010 | 0.010 | 86.8 | 86.4 | 0.010 | 0.010 | 0.010 | 0.010 | 5.1 | 4.5 |
10% | 0.100 | 0.100 | 87.6 | 88.2 | 0.035 | 0.035 | 0.113 | 0.113 | 8.8 | 8.9 | ||
25% | 0.250 | 0.250 | 88.6 | 88.6 | 0.052 | 0.052 | 0.253 | 0.253 | 7.2 | 7.2 | ||
50% | 0.500 | 0.500 | 89.0 | 89.0 | 0.061 | 0.061 | 0.346 | 0.346 | 6.5 | 6.5 | ||
75% | 0.750 | 0.749 | 88.8 | 89.4 | 0.052 | 0.053 | 0.254 | 0.255 | 3.6 | 3.1 | ||
90% | 0.900 | 0.901 | 87.6 | 87.8 | 0.035 | 0.035 | 0.113 | 0.114 | 6.1 | 5.4 | ||
99% | 0.990 | 0.990 | 82.8 | 82.4 | 0.010 | 0.010 | 0.010 | 0.010 | 7.2 | 6.5 | ||
\cdashline2-13 | MNAR | 1% | 0.010 | 0.010 | 86.8 | 86.8 | 0.010 | 0.010 | 0.010 | 0.010 | 5.1 | 4.8 |
10% | 0.100 | 0.100 | 88.2 | 88.2 | 0.035 | 0.035 | 0.113 | 0.114 | 8.9 | 8.1 | ||
25% | 0.250 | 0.250 | 88.6 | 88.4 | 0.052 | 0.052 | 0.253 | 0.254 | 7.2 | 6.5 | ||
50% | 0.500 | 0.500 | 88.6 | 89.4 | 0.061 | 0.061 | 0.346 | 0.347 | 6.5 | 6.0 | ||
75% | 0.750 | 0.749 | 88.2 | 88.4 | 0.052 | 0.052 | 0.254 | 0.255 | 3.6 | 3.4 | ||
90% | 0.900 | 0.901 | 87.8 | 87.6 | 0.035 | 0.035 | 0.113 | 0.114 | 6.1 | 5.5 | ||
99% | 0.990 | 0.990 | 82.4 | 82.4 | 0.010 | 0.010 | 0.010 | 0.010 | 7.2 | 6.7 |
CR | AL | RB | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
500 | MAR | 1% | 0.010 | 0.010 | 43.6 | 88.0 | 0.003 | 0.008 | 0.001 | 0.006 | 85.9 | 2.1 |
10% | 0.100 | 0.098 | 59.4 | 89.0 | 0.018 | 0.035 | 0.030 | 0.114 | 72.3 | 4.8 | ||
25% | 0.250 | 0.250 | 67.8 | 89.8 | 0.032 | 0.053 | 0.095 | 0.263 | 61.5 | 6.7 | ||
50% | 0.500 | 0.499 | 71.6 | 91.8 | 0.040 | 0.061 | 0.148 | 0.341 | 51.9 | 10.8 | ||
75% | 0.750 | 0.748 | 67.6 | 92.0 | 0.033 | 0.053 | 0.102 | 0.261 | 57.3 | 9.3 | ||
90% | 0.900 | 0.899 | 63.2 | 90.4 | 0.020 | 0.036 | 0.036 | 0.120 | 68.8 | 4.1 | ||
99% | 0.990 | 0.990 | 48.4 | 88.8 | 0.003 | 0.008 | 0.001 | 0.007 | 84.0 | 0.2 | ||
\cdashline2-13 | MNAR | 1% | 0.010 | 0.010 | 43.6 | 87.4 | 0.003 | 0.008 | 0.001 | 0.006 | 86.1 | 1.5 |
10% | 0.100 | 0.097 | 58.8 | 89.4 | 0.018 | 0.035 | 0.030 | 0.114 | 72.4 | 5.1 | ||
25% | 0.250 | 0.249 | 70.4 | 92.2 | 0.032 | 0.053 | 0.095 | 0.264 | 61.6 | 6.9 | ||
50% | 0.500 | 0.498 | 73.8 | 93.6 | 0.040 | 0.061 | 0.148 | 0.340 | 52.0 | 10.6 | ||
75% | 0.750 | 0.748 | 70.4 | 91.4 | 0.033 | 0.053 | 0.102 | 0.260 | 57.3 | 8.8 | ||
90% | 0.900 | 0.899 | 62.0 | 90.6 | 0.020 | 0.036 | 0.036 | 0.120 | 68.9 | 3.6 | ||
99% | 0.990 | 0.990 | 49.2 | 86.2 | 0.003 | 0.008 | 0.001 | 0.007 | 84.0 | 0.3 | ||
10000 | MAR | 1% | 0.010 | 0.010 | 83.8 | 88.6 | 0.003 | 0.003 | 0.001 | 0.001 | 22.1 | 4.5 |
10% | 0.100 | 0.098 | 84.2 | 86.8 | 0.018 | 0.019 | 0.029 | 0.033 | 5.5 | 7.8 | ||
25% | 0.250 | 0.250 | 91.2 | 91.8 | 0.032 | 0.033 | 0.093 | 0.102 | 0.7 | 9.7 | ||
50% | 0.500 | 0.499 | 89.8 | 91.2 | 0.040 | 0.041 | 0.146 | 0.155 | 5.1 | 12.1 | ||
75% | 0.750 | 0.748 | 90.4 | 91.0 | 0.033 | 0.034 | 0.100 | 0.109 | 5.9 | 14.5 | ||
90% | 0.900 | 0.899 | 89.6 | 91.4 | 0.019 | 0.021 | 0.035 | 0.039 | 2.0 | 14.6 | ||
99% | 0.990 | 0.990 | 87.8 | 91.4 | 0.003 | 0.004 | 0.001 | 0.001 | 12.8 | 12.1 | ||
\cdashline2-13 | MNAR | 1% | 0.010 | 0.010 | 83.6 | 89.8 | 0.003 | 0.003 | 0.001 | 0.001 | 22.2 | 4.4 |
10% | 0.100 | 0.098 | 86.2 | 88.0 | 0.018 | 0.019 | 0.029 | 0.033 | 5.6 | 7.7 | ||
25% | 0.250 | 0.250 | 91.6 | 92.2 | 0.032 | 0.033 | 0.093 | 0.101 | 0.7 | 9.5 | ||
50% | 0.500 | 0.499 | 90.0 | 91.6 | 0.040 | 0.041 | 0.145 | 0.155 | 5.0 | 11.7 | ||
75% | 0.750 | 0.748 | 90.2 | 92.0 | 0.033 | 0.034 | 0.100 | 0.108 | 5.9 | 14.0 | ||
90% | 0.900 | 0.899 | 90.8 | 92.6 | 0.019 | 0.021 | 0.035 | 0.039 | 2.1 | 14.3 | ||
99% | 0.990 | 0.990 | 86.4 | 91.2 | 0.003 | 0.004 | 0.001 | 0.001 | 12.9 | 11.7 |
CR | AL | RB | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
500 | MAR | 1% | 1.353 | 1.583 | 71.8 | 75.4 | 1.140 | 1.228 | 1.497 | 1.718 | 13.6 | 0.9 |
10% | 4.461 | 4.506 | 84.4 | 85.0 | 0.596 | 0.609 | 0.334 | 0.350 | 23.3 | 19.7 | ||
25% | 6.277 | 6.298 | 85.2 | 85.2 | 0.493 | 0.498 | 0.227 | 0.231 | 17.3 | 15.7 | ||
50% | 8.289 | 8.307 | 85.2 | 85.0 | 0.461 | 0.464 | 0.198 | 0.200 | 13.4 | 12.3 | ||
75% | 10.298 | 10.331 | 86.2 | 87.6 | 0.497 | 0.502 | 0.230 | 0.235 | 15.9 | 14.1 | ||
90% | 12.146 | 12.167 | 83.6 | 84.8 | 0.606 | 0.619 | 0.346 | 0.362 | 17.6 | 13.8 | ||
99% | 15.291 | 15.597 | 67.6 | 70.0 | 1.396 | 1.509 | 2.310 | 2.652 | 26.9 | 16.1 | ||
\cdashline2-13 | MNAR | 1% | 1.353 | 1.579 | 72.8 | 75.4 | 1.149 | 1.246 | 1.508 | 1.780 | 12.1 | 3.7 |
10% | 4.461 | 4.505 | 84.6 | 85.0 | 0.600 | 0.609 | 0.338 | 0.350 | 19.7 | 16.9 | ||
25% | 6.277 | 6.298 | 84.6 | 86.0 | 0.491 | 0.497 | 0.224 | 0.230 | 16.9 | 14.8 | ||
50% | 8.289 | 8.308 | 85.4 | 85.6 | 0.461 | 0.464 | 0.198 | 0.201 | 13.3 | 12.0 | ||
75% | 10.298 | 10.332 | 86.4 | 86.0 | 0.496 | 0.502 | 0.230 | 0.235 | 14.8 | 12.9 | ||
90% | 12.146 | 12.169 | 84.4 | 85.2 | 0.609 | 0.620 | 0.350 | 0.362 | 15.1 | 12.2 | ||
99% | 15.291 | 15.590 | 68.1 | 69.3 | 1.419 | 1.502 | 2.385 | 2.647 | 23.5 | 15.1 | ||
10000 | MAR | 1% | 1.353 | 1.572 | 73.4 | 74.2 | 1.140 | 1.162 | 1.485 | 1.554 | 4.8 | 0.3 |
10% | 4.461 | 4.501 | 87.4 | 88.0 | 0.592 | 0.596 | 0.329 | 0.334 | 11.5 | 10.2 | ||
25% | 6.277 | 6.292 | 89.0 | 88.8 | 0.491 | 0.491 | 0.225 | 0.225 | 6.3 | 6.3 | ||
50% | 8.289 | 8.304 | 88.2 | 89.0 | 0.461 | 0.461 | 0.197 | 0.198 | 5.2 | 5.1 | ||
75% | 10.298 | 10.329 | 88.0 | 88.6 | 0.497 | 0.498 | 0.230 | 0.231 | 2.2 | 1.7 | ||
90% | 12.146 | 12.168 | 88.2 | 88.4 | 0.608 | 0.611 | 0.347 | 0.352 | 7.4 | 6.2 | ||
99% | 15.291 | 15.583 | 66.6 | 67.6 | 1.425 | 1.460 | 2.412 | 2.492 | 16.4 | 13.6 | ||
\cdashline2-13 | MNAR | 1% | 1.353 | 1.572 | 73.2 | 74.2 | 1.137 | 1.163 | 1.482 | 1.545 | 5.2 | 1.1 |
10% | 4.461 | 4.502 | 87.4 | 87.6 | 0.594 | 0.600 | 0.332 | 0.339 | 11.4 | 9.5 | ||
25% | 6.277 | 6.292 | 88.6 | 88.6 | 0.492 | 0.494 | 0.225 | 0.227 | 6.0 | 5.1 | ||
50% | 8.289 | 8.304 | 89.0 | 89.2 | 0.460 | 0.461 | 0.196 | 0.198 | 6.5 | 6.0 | ||
75% | 10.298 | 10.328 | 88.4 | 88.4 | 0.496 | 0.497 | 0.229 | 0.230 | 3.6 | 3.2 | ||
90% | 12.146 | 12.167 | 89.0 | 89.0 | 0.607 | 0.611 | 0.346 | 0.351 | 6.9 | 5.5 | ||
99% | 15.291 | 15.580 | 66.2 | 67.0 | 1.422 | 1.454 | 2.413 | 2.479 | 15.3 | 13.0 |
CR | AL | RB | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
500 | MAR | 1% | -3.919 | -2.937 | . | . | . | . | . | . | . | . |
10% | -2.688 | -2.651 | 53.8 | 60.8 | 0.123 | 0.145 | 0.015 | 0.021 | 73.7 | 63.2 | ||
25% | -1.938 | -1.930 | 65.8 | 68.4 | 0.122 | 0.127 | 0.014 | 0.015 | 62.1 | 58.9 | ||
50% | -1.122 | -1.114 | 72.2 | 72.0 | 0.122 | 0.123 | 0.014 | 0.014 | 52.6 | 51.4 | ||
75% | -0.315 | -0.302 | 67.0 | 69.8 | 0.124 | 0.128 | 0.014 | 0.015 | 57.9 | 54.8 | ||
90% | 0.396 | 0.419 | 60.5 | 69.1 | 0.134 | 0.166 | 0.018 | 0.027 | 66.1 | 48.2 | ||
99% | 1.566 | . | . | . | . | . | . | . | . | . | ||
\cdashline2-13 | MNAR | 1% | -3.919 | -2.933 | . | . | . | . | . | . | . | . |
10% | -2.688 | -2.648 | 53.8 | 61.0 | 0.123 | 0.147 | 0.015 | 0.021 | 73.5 | 62.2 | ||
25% | -1.938 | -1.929 | 69.0 | 71.4 | 0.122 | 0.127 | 0.014 | 0.015 | 61.1 | 58.1 | ||
50% | -1.122 | -1.113 | 73.6 | 74.6 | 0.122 | 0.124 | 0.014 | 0.014 | 50.2 | 48.3 | ||
75% | -0.315 | -0.301 | 69.4 | 71.8 | 0.125 | 0.130 | 0.014 | 0.016 | 56.7 | 52.8 | ||
90% | 0.396 | 0.423 | 61.2 | 71.0 | 0.138 | 0.170 | 0.019 | 0.029 | 66.0 | 47.9 | ||
99% | 1.566 | . | . | . | . | . | . | . | . | . | ||
10000 | MAR | 1% | -3.919 | -2.911 | . | . | . | . | . | . | . | . |
10% | -2.688 | -2.656 | 73.0 | 77.8 | 0.120 | 0.133 | 0.014 | 0.017 | 24.5 | 6.7 | ||
25% | -1.938 | -1.932 | 89.0 | 90.4 | 0.121 | 0.124 | 0.014 | 0.014 | 4.0 | 0.5 | ||
50% | -1.122 | -1.115 | 89.4 | 89.4 | 0.121 | 0.122 | 0.014 | 0.014 | 4.9 | 7.2 | ||
75% | -0.315 | -0.303 | 88.6 | 89.2 | 0.124 | 0.127 | 0.014 | 0.015 | 5.7 | 10.5 | ||
90% | 0.396 | 0.420 | 78.8 | 84.2 | 0.132 | 0.147 | 0.017 | 0.021 | 15.2 | 7.6 | ||
99% | 1.566 | . | . | . | . | . | . | . | . | . | ||
\cdashline2-13 | MNAR | 1% | -3.919 | -2.910 | . | . | . | . | . | . | . | . |
10% | -2.688 | -2.655 | 73.0 | 77.4 | 0.121 | 0.133 | 0.014 | 0.017 | 20.1 | 1.9 | ||
25% | -1.938 | -1.931 | 88.6 | 89.2 | 0.121 | 0.124 | 0.014 | 0.014 | 1.1 | 3.3 | ||
50% | -1.122 | -1.114 | 88.6 | 89.2 | 0.121 | 0.122 | 0.014 | 0.014 | 7.9 | 10.1 | ||
75% | -0.315 | -0.303 | 88.8 | 89.4 | 0.124 | 0.126 | 0.014 | 0.015 | 9.2 | 13.3 | ||
90% | 0.396 | 0.421 | 80.6 | 85.7 | 0.132 | 0.148 | 0.017 | 0.022 | 14.8 | 9.3 | ||
99% | 1.566 | . | . | . | . | . | . | . | . | . |
Starting with , under model and , we noted that the coverage rates for when using was quite close to the nominal confidence level for both MAR and MNAR missingness mechanisms, albeit lower at the distribution extrema due to smaller variance. The same pattern was noticed for , alongside rates that were closer to the nominal 90%; this was largely due to less-biased , suggesting that [30]’s bootstrap variance estimator outperforms its asymptotic competitor when is rather small. Yet, as increased from 0.5% to 10% of , the differences between the two variance estimators became negligible, producing coverage rates that were consistently optimal for all . Moving to model , under , we note a severely increased degree of bias for the asymptotic variance estimators, resulting in liberal confidence intervals that excluded at disproportionate rates. The bootstrap variance estimator continued to perform optimally, though, suggesting robustness to violations of ignorability and model misspecification. For large , the differences again dissipated to practically negligible magnitudes, but tended to have less bias than .
Let us now transition to quantile estimation. Starting with model , we note similar trends as before, albeit with several exceptions. For one, despite the RB being smaller for the bootstrap variance estimator than the asymptotic-theory counterpart, they were still larger here compared to those under CDF estimation, resulting in coverage rates that were fairly lower on average and significantly lower at distribution extrema. These trends were noted at an increased potency for model and but became much milder with larger . Still, most performance statistics remained undefined at the 1st and 99th percentiles, as only one run out of produced an estimator of (see Section 4.3 for a discussion on this phenomenon). In aggregate, these results imply that both variance estimators of are (generally) robust to violations of ignorability and model misspecification, but only if is large in magnitude. If is small, the two variance estimators will indeed be similar, but will likely be severely biased as well.
5 Real Data Example
In this section, we illustrate the performance of and using data from the National Health and Nutrition Examination Survey (NHANES) published by the [23]. NHANES is a multistage stratified random sample collected to obtain and disseminate detailed information on the health and nutritional status of a nationally representative sample of non-institutionalized individuals in the United States. Our goal is to estimate the distribution function of total cholesterol (TCHL, , in mg/dL) in the population of U.S. adults using biological sex (), age (), glycohemoglobin (, in %), triglycerides (, in mg/dL), direct high-density lipoprotein cholesterol (HDL, , in mg/dL), body mass index (BMI, , in kg/), and pulse (). We used the subsample of individuals with available data in the 2015-2016 NHANES cohort as the probability sample ( 2,474), and those with available data in the 2017-2020 cohort for the nonprobability sample ( 3,770). Interestingly, observations in that were obtained during the 2019-2020 time period were deemed by the CDC to be nonrepresentative of the population and had to be combined with the previous cohort to ensure respondent confidentiality. Absolute relative biases () of our estimators relative to and are presented in Table 5.3, whereas bootstrap estimators of the variance are presented in Tables 5.4 and 5.5; and, to accommodate the lack of information regarding the true regression function and ’s missingness mechanism, we also report various statistics concerning the estimated coefficients (Table 5.1), model fit (Table 5.2), and model diagnostics (Figure 1).
SE | LL | UL | ||||||
---|---|---|---|---|---|---|---|---|
87.62 | 93.35 | 5.62 | 84.10 | 102.60 | 16.60 | 0.00 | Y | |
: Male | 0.44 | 2.12 | 1.24 | 0.08 | 4.15 | 1.71 | 0.09 | Y |
0.28 | 0.25 | 0.03 | 0.19 | 0.30 | 7.79 | 0.00 | Y | |
-1.71 | -1.03 | 0.57 | -1.97 | -0.09 | -1.80 | 0.07 | Y | |
0.27 | 0.20 | 0.01 | 0.18 | 0.21 | 28.31 | 0.00 | N | |
0.93 | 1.00 | 0.04 | 0.93 | 1.07 | 22.57 | 0.00 | Y | |
0.43 | 0.25 | 0.09 | 0.11 | 0.39 | 2.90 | 0.00 | N | |
0.04 | -0.02 | 0.05 | -0.11 | 0.06 | -0.43 | 0.67 | Y |
SE | |||||
---|---|---|---|---|---|
0.26 | 0.26 | 186.08 | 35.55 | 7 | 3762 |

B | P | R | B | P | R | |||
---|---|---|---|---|---|---|---|---|
1% | 0.01 | 107.00 | 99.00 | 100.00 | 52.49 | 6.54 | 38.71 | 25.51 |
10% | 0.10 | 138.00 | 50.99 | 99.49 | 17.67 | 5.80 | 15.87 | 0.37 |
25% | 0.25 | 158.00 | 30.34 | 70.55 | 10.17 | 5.06 | 7.07 | 2.03 |
50% | 0.51 | 184.00 | 16.59 | 16.92 | 6.95 | 4.89 | 2.38 | 2.40 |
75% | 0.75 | 212.00 | 7.53 | 21.61 | 4.53 | 3.77 | 8.21 | 2.41 |
90% | 0.90 | 244.00 | 3.56 | 9.85 | 2.68 | 4.51 | 14.24 | 3.60 |
99% | 0.99 | 295.00 | 0.12 | 0.77 | 0.04 | 0.68 | 18.69 | 0.50 |
LL | UL | |||||
---|---|---|---|---|---|---|
1% | 0.010 | 0.015 | 0.008 | 0.023 | 0.020 | Y |
10% | 0.102 | 0.121 | 0.105 | 0.136 | 0.089 | N |
25% | 0.254 | 0.280 | 0.258 | 0.302 | 0.182 | N |
50% | 0.510 | 0.546 | 0.519 | 0.572 | 0.253 | N |
75% | 0.752 | 0.786 | 0.763 | 0.809 | 0.197 | N |
90% | 0.904 | 0.928 | 0.915 | 0.941 | 0.062 | N |
99% | 0.990 | 0.990 | 0.985 | 0.994 | 0.008 | Y |
LL | UL | |||||
---|---|---|---|---|---|---|
1% | 107.00 | 134.29 | 115.23 | 153.35 | 14.78 | N |
10% | 138.00 | 137.49 | 118.20 | 156.77 | 4.23 | Y |
25% | 158.00 | 154.79 | 134.33 | 175.25 | 0.62 | Y |
50% | 184.00 | 179.58 | 157.54 | 201.63 | 0.86 | Y |
75% | 212.00 | 206.89 | 183.23 | 230.54 | 1.61 | Y |
90% | 244.00 | 235.22 | 209.99 | 260.44 | 3.18 | Y |
99% | 295.00 | 296.49 | 268.17 | 324.81 | 549.83 | Y |
Before we address the relative biases and bootstrap variances, let us briefly discuss various characteristics associated with the regression model. Starting with Table 5.1, we note that six out of eight (75%) of the survey-weighted coefficients from were captured in their respective 90% confidence intervals (which were constructed from a regression model fit on .) Interestingly, the coefficient associated with pulse () had a very high -value of .67, which may suggest that one’s pulse, after controlling for the remaining six covariates, is not a significant predictor of TCHL. A similar argument can be extended to biological sex, given that biological males in sample had an average increase of 0.44 in TCHL compared to biological females, which is practically insignificant. Of course, these interpretations are likely challenged by severe model misspecification, given the small in Table 5.2 and the extreme nonlinearity, heteroscedasticity, and outliers in Figure 1.
Yet, despite the (likely) misspecified regression model, we noticed that the relative biases of were notably superior to the other two CDF estimators. For the quantile estimators, performed the best overall, but it was noticeably outperformed at the 1st percentile and slightly outperformed at the 50th; the latter case is likely a coincidence, whereas the former case is most likely due to ’s sensitivity to distribution extrema (see Section 4.3). In aggregate, these results match those presented previously from our simulations, as they imply superiority of compared to and even when the regression function used is misspecified.
We conclude with a brief discussion of the performance of the bootstrap variance estimators under the NHANES data. Starting with , in Table 5.4, was excluded from the intervals attributed to the 10th, 25th, 50th, 75th, and 90th percentiles—and yet, in Table 5.5, was included in all intervals except at the 1st percentile. It should be noted that these intervals are not meant for estimating or , rather they estimate or . Therefore, these results should be read cautiously.
6 Conclusion
In this paper, we considered the problem of estimating the finite population distribution function using monotone missing data from probability and nonprobability samples. We proposed a residual-based CDF estimator, , that substitutes in , the Horvitz-Thompson estimator of the finite population CDF, with , the empirical CDF of the estimated regression residuals from sample at Under ignorability of the missingness mechanism of sample , a correctly specified model, and minor regularity conditions, both and its corresponding quantile estimator are shown to be asymptotically unbiased for their respective population parameters. Empirical results overwhelmingly support this claim, with the added benefit of robustness to violations of ignorability if the regression model holds, and robustness to model misspecification if ignorability holds. When both assumptions are violated, performed better than , the naïve empirical CDF of , and , the plug-in CDF estimator, albeit with noticeable reductions in efficiency. We also defined and contrasted two variance estimators, and demonstrated that the bootstrap estimator tends to produce smaller relative bias than the asymptotic-based variance estimator. In our future research, we seek to extend this work to the realm of nonparametric regression, which can produce CDF and quantile estimators that are more robust to model misspecification.
Declarations
Data Availability Statement
The data that support the findings of this study are publicly available from the United States Centers for Disease Control and Prevention (CDC) at https://wwwn.cdc.gov/nchs/nhanes/Default.aspx.
Funding
This work was funded by the Chancellor’s Distinguished Fellowship, a Title III HBGI grant from the U.S. Department of Education.
References
- [1] Centers for Disease Control and Prevention (CDC) “NHANES - National Health and Nutrition Examination Survey” https://www.cdc.gov/nchs/nhanes/index.htm (visited: 2023-10-11) National Center for Health Statistics (NCHS), 2015-2020
- [2] Richard L Chambers and R Dunstan “Estimating distribution functions from survey data” In Biometrika 73.3 Oxford University Press, 1986, pp. 597–604
- [3] Sixia Chen, Shu Yang and Jae Kwang Kim “Nonparametric mass imputation for data integration” In Journal of Survey Statistics and Methodology 10.1 Oxford University Press, 2022, pp. 1–24
- [4] Carol A Francisco and Wayne A Fuller “Quantile estimation with a complex survey design” In The Annals of statistics JSTOR, 1991, pp. 454–469
- [5] Daniel G Horvitz and Donovan J Thompson “A generalization of sampling without replacement from a finite universe” In Journal of the American Statistical Association 47.260 Taylor & Francis, 1952, pp. 663–685
- [6] Cary T Isaki and Wayne A Fuller “Survey design under the regression superpopulation model” In Journal of the American Statistical Association 77.377 Taylor & Francis, 1982, pp. 89–96
- [7] Alicia A Johnson, F Jay Breidt and Jean D Opsomer “Estimating distribution functions from survey data using nonparametric regression” In Journal of Statistical Theory and Practice 2 Springer, 2008, pp. 419–431
- [8] Jae Kwang Kim, Seho Park, Yilin Chen and Changbao Wu “Combining non-probability and probability survey samples through mass imputation” In Journal of the Royal Statistical Society Series A: Statistics in Society 184.3 Oxford University Press, 2021, pp. 941–963
- [9] JG Kovar, JNK Rao and CFJ Wu “Bootstrap and other methods to measure errors in survey estimates” In Canadian Journal of Statistics 16.S1 Wiley Online Library, 1988, pp. 25–45
- [10] Erich Leo Lehmann “Elements of large-sample theory” New York, NY: Springer, 1999
- [11] Roderick JA Little and Donald B Rubin “Statistical analysis with missing data” Hoboken, New Jersey: John Wiley & Sons, 2019
- [12] Thomas Lumley “Analysis of Complex Survey Samples” R package verson 2.2 In Journal of Statistical Software 9.1, 2004, pp. 1–19
- [13] Mateus Maia, Arthur R Azevedo and Anderson Ara “Predictive comparison between random machines and random forests” In Journal of Data Science 19.4 School of Statistics, Renmin University of China, 2021, pp. 593–614
- [14] Ronald H Randles “On the asymptotic normality of statistics with estimated parameters” In The Annals of Statistics JSTOR, 1982, pp. 462–474
- [15] JNK Rao, JG Kovar and HJ Mantel “On estimating distribution functions and quantiles from survey data using auxiliary information” In Biometrika JSTOR, 1990, pp. 365–375
- [16] Marie-Hélène Roy and Denis Larocque “Robustness of random forests for regression” In Journal of Nonparametric Statistics 24.4 Taylor & Francis, 2012, pp. 993–1006
- [17] Donald B Rubin “Inference and missing data” In Biometrika 63.3 Oxford University Press, 1976, pp. 581–592
- [18] Carl-Erik Särndal, Bengt Swensson and Jan Wretman “Model Assisted Survey Sampling” New York, NY: Springer, 2003
- [19] Erwan Scornet “Random forests and kernel methods” In IEEE Transactions on Information Theory 62.3 IEEE, 2016, pp. 1485–1500
- [20] Anastasios A Tsiatis “Semiparametric theory and missing data” Springer, 2006
- [21] Jianqiang C Wang and Jean D Opsomer “On asymptotic normality and variance estimation for nondifferentiable survey estimators” In Biometrika 98.1 Oxford University Press, 2011, pp. 91–106
- [22] Ralph S Woodruff “Confidence intervals for medians and other position measures” In Journal of the American Statistical Association 47.260 Taylor & Francis, 1952, pp. 635–646
References
- [23] Centers for Disease Control and Prevention (CDC) “NHANES - National Health and Nutrition Examination Survey” https://www.cdc.gov/nchs/nhanes/index.htm (visited: 2023-10-11) National Center for Health Statistics (NCHS), 2015-2020
- [24] Richard L Chambers and R Dunstan “Estimating distribution functions from survey data” In Biometrika 73.3 Oxford University Press, 1986, pp. 597–604
- [25] Sixia Chen, Shu Yang and Jae Kwang Kim “Nonparametric mass imputation for data integration” In Journal of Survey Statistics and Methodology 10.1 Oxford University Press, 2022, pp. 1–24
- [26] Carol A Francisco and Wayne A Fuller “Quantile estimation with a complex survey design” In The Annals of statistics JSTOR, 1991, pp. 454–469
- [27] Daniel G Horvitz and Donovan J Thompson “A generalization of sampling without replacement from a finite universe” In Journal of the American Statistical Association 47.260 Taylor & Francis, 1952, pp. 663–685
- [28] Cary T Isaki and Wayne A Fuller “Survey design under the regression superpopulation model” In Journal of the American Statistical Association 77.377 Taylor & Francis, 1982, pp. 89–96
- [29] Alicia A Johnson, F Jay Breidt and Jean D Opsomer “Estimating distribution functions from survey data using nonparametric regression” In Journal of Statistical Theory and Practice 2 Springer, 2008, pp. 419–431
- [30] Jae Kwang Kim, Seho Park, Yilin Chen and Changbao Wu “Combining non-probability and probability survey samples through mass imputation” In Journal of the Royal Statistical Society Series A: Statistics in Society 184.3 Oxford University Press, 2021, pp. 941–963
- [31] JG Kovar, JNK Rao and CFJ Wu “Bootstrap and other methods to measure errors in survey estimates” In Canadian Journal of Statistics 16.S1 Wiley Online Library, 1988, pp. 25–45
- [32] Erich Leo Lehmann “Elements of large-sample theory” New York, NY: Springer, 1999
- [33] Roderick JA Little and Donald B Rubin “Statistical analysis with missing data” Hoboken, New Jersey: John Wiley & Sons, 2019
- [34] Thomas Lumley “Analysis of Complex Survey Samples” R package verson 2.2 In Journal of Statistical Software 9.1, 2004, pp. 1–19
- [35] Mateus Maia, Arthur R Azevedo and Anderson Ara “Predictive comparison between random machines and random forests” In Journal of Data Science 19.4 School of Statistics, Renmin University of China, 2021, pp. 593–614
- [36] Ronald H Randles “On the asymptotic normality of statistics with estimated parameters” In The Annals of Statistics JSTOR, 1982, pp. 462–474
- [37] JNK Rao, JG Kovar and HJ Mantel “On estimating distribution functions and quantiles from survey data using auxiliary information” In Biometrika JSTOR, 1990, pp. 365–375
- [38] Marie-Hélène Roy and Denis Larocque “Robustness of random forests for regression” In Journal of Nonparametric Statistics 24.4 Taylor & Francis, 2012, pp. 993–1006
- [39] Donald B Rubin “Inference and missing data” In Biometrika 63.3 Oxford University Press, 1976, pp. 581–592
- [40] Carl-Erik Särndal, Bengt Swensson and Jan Wretman “Model Assisted Survey Sampling” New York, NY: Springer, 2003
- [41] Erwan Scornet “Random forests and kernel methods” In IEEE Transactions on Information Theory 62.3 IEEE, 2016, pp. 1485–1500
- [42] Anastasios A Tsiatis “Semiparametric theory and missing data” Springer, 2006
- [43] Jianqiang C Wang and Jean D Opsomer “On asymptotic normality and variance estimation for nondifferentiable survey estimators” In Biometrika 98.1 Oxford University Press, 2011, pp. 91–106
- [44] Ralph S Woodruff “Confidence intervals for medians and other position measures” In Journal of the American Statistical Association 47.260 Taylor & Francis, 1952, pp. 635–646