This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Survey Data Integration for Distribution Function Estimation

Jeremy R.A. Flood  and Sayed A. Mostafa Jeremy R.A. Flood; Department of Mathematics & Statistics, North Carolina A&T State University, Greensboro, NC, USA. Sayed A. Mostafa; Department of Mathematics & Statistics, North Carolina A&T State University, Greensboro, NC, USA. Email: [email protected].
Abstract

Integration of probabilistic and non-probabilistic samples for the estimation of finite population totals (or means) has recently received considerable attention in the field of survey sampling; yet, to the best of our knowledge, this framework has not been extended to cumulative distribution function (CDF) estimation. To address this gap, we propose a novel CDF estimator that integrates data from probability samples with data from, potentially big, nonprobability samples. Assuming that a set of shared covariates are observed in both samples, while the response variable is observed only in the latter, the proposed estimator uses a survey-weighted empirical CDF of regression residuals trained on the convenience sample to estimate the CDF of the response variable. Under some regularity conditions, we show that our CDF estimator is both design-consistent for the finite population CDF and asymptotically normally distributed. Additionally, we define and study a quantile estimator based on the proposed CDF estimator. Furthermore, we use both the bootstrap and asymptotic formulae to estimate their respective sampling variances. Our empirical results show that the proposed CDF estimator is robust to model misspecification under ignorability, and robust to ignorability under model misspecification. When both assumptions are violated, our residual-based CDF estimator still outperforms its ‘plug-in’ mass imputation and naïve siblings, albeit with noted decreases in efficiency.

Keywords: Data integration \cdot Probability surveys \cdot Nonprobability samples \cdot Distribution functions \cdot Quantile estimation

1 Introduction

Much of the literature on survey data integration focuses on estimating the finite population mean μN=1Ni𝒰Yi,\mu_{\text{N}}=\frac{1}{N}{\sum_{i\in\mathcal{U}}{Y_{i}}}, where 𝒰\mathcal{U} denotes a finite population of size NN, YY is the variable of interest, and YiY_{i} is the value assigned to the ii-th unit of the population. Given a probability sample 𝑨\boldsymbol{A} of size nAn_{\text{A}}, and inclusion probability πi\pi_{i} for all i𝑨i\in\boldsymbol{A}, the [27] (HT) finite population mean estimator, μ^π=1Ni𝑨πi1Yi\hat{\mu}_{\pi}=\frac{1}{N}\sum_{i\in\boldsymbol{A}}{\pi^{-1}_{i}Y_{i}}, is design-unbiased with a relatively simple closed-form sampling variance [40]. An interesting scenario arises when YiY_{i} is missing for all i𝑨i\in\boldsymbol{A}, but a set of covariates, say 𝑿i\boldsymbol{X}_{i}, remains fully measured. Furthermore, assume that there exists a nonprobability (convenience) sample, 𝑩\boldsymbol{B}, whose YjY_{j} and 𝑿j\boldsymbol{X}_{j} are both fully measured for all j𝑩j\in\boldsymbol{B}. Although μ^π\hat{\mu}_{\pi} cannot be directly calculated in 𝑨\boldsymbol{A}, it is possible to ‘impute’ 𝑨\boldsymbol{A}’s missing YY values through fitting a regression model for YY on 𝑿\boldsymbol{X} in sample 𝑩\boldsymbol{B}; this approach is oftentimes referred to as mass imputation in the literature. [30] investigated the use of generalized linear models as the imputation model and established the consistency of their proposed parametric mass imputation estimator (PMIE). Their empirical results imply optimal performance when (a) the mean (regression) function in 𝑩\boldsymbol{B} can be ‘transported’ to 𝑨\boldsymbol{A} (suitably named transportability), and (b) the regression model used is reasonably specified. [25] extended this work to the realm of nonparametric regression, which make little to no assumptions of 𝔼(Y|𝑿)\mathbb{E}\left(Y|\boldsymbol{X}\right)’s parametric form. There, they proposed mass imputation estimators based on kernel regression (NPMIEK) and generalized additive models (NPMIEG), and noted marked gains in efficiency for both compared to the PMIE.

A natural extension of this work involves estimating the finite population cumulative distribution function (CDF), FN(t)F_{\text{N}}(t), and the quantile function, tN(α)t_{\text{N}}(\alpha). The CDF of a finite population 𝒰\mathcal{U} is defined as the proportion of elements whose YuY_{u} for u𝒰u\in\mathcal{U} lie at or below a quantile tt; that is,

FN(t)=1Nu𝒰𝟙(Yut),\displaystyle F_{\text{N}}(t)=\frac{1}{N}\sum_{u\in\mathcal{U}}{\mathbbm{1}\left(Y_{u}\leq t\right)}, (1.1)

where 𝟙()\mathbbm{1}(\cdot) returns one if the condition in its argument is satisfied, and zero otherwise. Quantile functions, however, define the αth\alpha^{th} quantile as the smallest YuY_{u} among all u𝒰u\in\mathcal{U} such that FN(Yu)αF_{\text{N}}(Y_{u})\geq\alpha; or, equivalently,

tN(α)=inft{t:FN(t)α}.\displaystyle t_{\text{N}}(\alpha)=\inf_{t}\big{\{}t:F_{\text{N}}(t)\geq\alpha\big{\}}. (1.2)

If YiY_{i} was measured for all i𝑨i\in\boldsymbol{A}, then, as with μ^π\hat{\mu}_{\pi}, one could prove that the HT CDF estimator,

F^π(t)=1Ni𝑨πi1𝟙(Yit),\widehat{F}_{\pi}(t)=\frac{1}{N}\sum_{i\in\boldsymbol{A}}{\pi^{-1}_{i}\mathbbm{1}(Y_{i}\leq t)},

alongside its corresponding quantile estimator,

t^π(α)=inft{t:F^π(t)α},\hat{t}_{\pi}(\alpha)=\inf_{t}\left\{t:\widehat{F}_{\pi}(t)\geq\alpha\right\},

are at least asymptotically unbiased with relatively simple variance expressions [24, 44]. Furthermore, if 𝑿u\boldsymbol{X}_{u} was measured for all u𝒰u\in\mathcal{U}, a more efficient estimator of FN(t)F_{\text{N}}(t) could be obtained by predicting (a) 𝟙(Ykt)\mathbbm{1}(Y_{k}\leq t) for k𝒰\𝑨k\in\mathcal{U}\backslash\boldsymbol{A} or (b) 𝟙(Yut)\mathbbm{1}(Y_{u}\leq t) for all u𝒰u\in\mathcal{U}, then correcting for design bias [37, 24, 29]. Neither of these approaches is compatible with the monotone missingness framework described above, though. To address this paucity in literature, we modify [24]’s residual-based CDF estimator, henceforth denoted as F^R(t)\widehat{F}_{\text{R}}(t), to allow for information integration between 𝑨\boldsymbol{A} and 𝑩\boldsymbol{B}; the idea is to use a regression model built on 𝑩\boldsymbol{B} to obtain a vector of estimated residuals, ϵ^\hat{\epsilon}, and use these residuals’ empirical CDF (eCDF) as a substitute for 𝟙(Yit)\mathbbm{1}(Y_{i}\leq t) in F^π(t)\widehat{F}_{\pi}(t). We also explore the use of t^R(α)\hat{t}_{\text{R}}(\alpha) as a corresponding quantile estimator.

The rest of the paper is organized as follows. In Section 2, we introduce some key concepts and set notations that are used throughout the paper. Next, in Section 3, we introduce our residual eCDF estimator and derive its asymptotic properties; we then contrast it with a competing ‘plug-in’ CDF estimator in Section 3.3, and derive their corresponding quantile estimators in Section 3.4. Monte Carlo simulation results are provided in Section 4, which demonstrate the finite-sample performance of the proposed CDF and quantile estimators. In Section 5, we illustrate the proposed estimators using a real data application. We conclude in Section 6 with a discussion of the main results.

2 Background and Preliminaries

Let 𝒰={1,2,,N}\mathcal{U}=\{1,2,\dots,N\} denote an index set for the units in a finite population of size NN, and let 𝑨\boldsymbol{A} denote a nA×(p+1)n_{\text{A}}\times(p+1) probability sample drawn from 𝒰\mathcal{U} with measured variables {d,X1,X2,,Xp}\{d,X_{1},X_{2},\cdots,X_{p}\}, where diπi1d_{i}\coloneqq\pi_{i}^{-1} represents the sampling weight for element ii and X1,,XpX_{1},\cdots,X_{p} represent a set of covariates. In a similar vein, let 𝑩\boldsymbol{B} denote a nB×(p+1)n_{\text{B}}\times(p+1) nonprobability sample from 𝒰\mathcal{U} with measured variables {Y,X1,X2,,Xp}\{Y,X_{1},X_{2},\cdots,X_{p}\}, where YY represents a response variable; note that dd is unmeasured in 𝑩\boldsymbol{B}, which differentiates 𝑩\boldsymbol{B} from 𝑨\boldsymbol{A} as a convenience sample. The missingness pattern outlined here is summarized in Table 2.1.

Table 2.1: Data structure for the probability (𝑨\boldsymbol{A}) and nonprobability (𝑩\boldsymbol{B}) samples.
Sample dd X1X_{1} X2X_{2} \cdots XpX_{p} YY
𝑨\boldsymbol{A} ×\times
𝑩\boldsymbol{B} ×\times

To incorporate the covariates in estimating FN(t)F_{\text{N}}(t), we assume that the finite population is a realization from the following superpopulation model, ξ{\xi}:

Y=m(𝑿;𝜷)+ν(𝑿)ϵ,\displaystyle Y=m(\boldsymbol{X};\boldsymbol{\beta})+\nu(\boldsymbol{X})\epsilon, (2.1)

where m(𝑿;𝜷)=𝔼ξ(Y|𝑿)m(\boldsymbol{X};\boldsymbol{\beta})=\mathbb{E}_{\xi}\left(Y|\boldsymbol{X}\right) is a known function of 𝑿\boldsymbol{X} parameterized by an unknown parameter vector 𝜷\boldsymbol{\beta}, ν()\nu(\cdot) is a known, strictly positive function, and ϵ\epsilon is an error term satisfying 𝔼ξ(ϵ|𝑿)=0\mathbb{E}_{\xi}\left(\epsilon|\boldsymbol{X}\right)=0 and 𝔼ξ(ϵ2|𝑿)=σϵ2\mathbb{E}_{\xi}\left(\epsilon^{2}|\boldsymbol{X}\right)=\sigma^{2}_{\epsilon}. Now, if 𝒰\mathcal{U} was observed in its entirety, a natural estimator of 𝜷\boldsymbol{\beta} can be chosen to solve the finite population score function

U(𝜷)=1Nu𝒰(Yum(𝑿u;𝜷))𝑾(𝑿u;𝜷)=0U(\boldsymbol{\beta})=\frac{1}{N}\sum_{u\in\mathcal{U}}{\Big{(}Y_{u}-m(\boldsymbol{X}_{u};\boldsymbol{\beta})\Big{)}\boldsymbol{W}\left(\boldsymbol{X}_{u};\boldsymbol{\beta}\right)}=0

for some pp-dimensional function 𝑾\boldsymbol{W}; this estimator is henceforth denoted as 𝜷N\boldsymbol{\beta}_{\text{N}}, since it requires full information of all NN population elements. Nonetheless, given that YY is present within 𝑩\boldsymbol{B} alone, we estimate 𝜷\boldsymbol{\beta} by solving a sample-based score function,

U^(𝜷)=1nBj𝑩(Yjm(𝑿j;𝜷))𝑾(𝑿j;𝜷)=0,\widehat{U}(\boldsymbol{\beta})=\frac{1}{n_{\text{B}}}\sum_{j\in\boldsymbol{B}}{\Big{(}Y_{j}-m(\boldsymbol{X}_{j};\boldsymbol{\beta})\Big{)}\boldsymbol{W}\left(\boldsymbol{X}_{j};\boldsymbol{\beta}\right)}=0,

whose solution is henceforth denoted as 𝜷^\boldsymbol{\widehat{\beta}}.

One 𝜷^\boldsymbol{\widehat{\beta}} is obtained, mass imputation uses m(𝑿;𝜷^)m(\boldsymbol{X};\boldsymbol{\widehat{\beta}}) to ‘impute’ the missing YY in 𝑨\boldsymbol{A}, assuming positivity and transportability (stated in Definitions 1 and 2, respectively).

Definition 1 (Positivity Assumption).

Let δj\delta_{j} be an indicator function that returns 1 if observation jj is included in sample 𝐁\boldsymbol{B} and 0 otherwise. The assumption of positivity is satisfied if Pr(δj=1|𝐗=𝐱)>0\Pr(\delta_{j}=1|\boldsymbol{X}=\boldsymbol{x})>0 for all j𝐁j\in\boldsymbol{B} and 𝐱\boldsymbol{x} in the support of 𝐗\boldsymbol{X}. That is, for every value 𝐱\boldsymbol{x}, there is a positive chance for the corresponding units to be selected in sample 𝐁\boldsymbol{B}.

Definition 2 (Transportability Assumption).

Let δj\delta_{j} again denote the sample indicator for 𝐁\boldsymbol{B}, and let f(Y|𝐗)f(Y|\boldsymbol{X}) represent a function of YY conditioned on 𝐗\boldsymbol{X}. The assumption of transportability is satisfied if f(Y|𝐗,δj=1)=f(Y|𝐗)f(Y|\boldsymbol{X},\delta_{j}=1)=f(Y|\boldsymbol{X}).

Transportability is a crucial assumption as it allows us to “transport" the prediction model from 𝑩\boldsymbol{B} to 𝑨\boldsymbol{A} without worry. As described in [30], a sufficient condition for transportability to hold is the ignorability condition, stated in Definition 3.

Definition 3 (Ignorability Condition).

Let δj\delta_{j} again denote the sample indicator for 𝐁\boldsymbol{B}, and let Pr(δj=1|𝐗)\Pr(\delta_{j}=1|\boldsymbol{X}) denote the probability of unit jj being included in sample 𝐁\boldsymbol{B} conditioned on 𝐗\boldsymbol{X}. The assumption of ignorability is satisfied if Pr(δj=1|𝐗,Y)=Pr(δj=1|𝐗)\Pr(\delta_{j}=1|\boldsymbol{X},Y)=\Pr(\delta_{j}=1|\boldsymbol{X}).

Definition 3 is essentially the missing at random (MAR) assumption [39]. It is commonly known that, under MAR (or, ignorability), 𝜷^\boldsymbol{\widehat{\beta}} is a consistent estimator of 𝜷\boldsymbol{\beta}, in that the sampling distribution of 𝜷^\boldsymbol{\widehat{\beta}} grows tighter and tighter around 𝜷\boldsymbol{\beta} as nBn_{\text{B}} increases [30, 33, 42]. In the next section, we use the consistency of 𝜷^\boldsymbol{\widehat{\beta}} to derive a residual-based CDF estimator of FN(t)F_{\text{N}}(t), where 𝟙(Yit)\mathbbm{1}(Y_{i}\leq t) in F^π(t)\widehat{F}_{\pi}(t) is replaced by G^(𝑿i)\widehat{G}(\boldsymbol{X}_{i}), the empirical CDF of the estimated regression residuals from sample 𝑩\boldsymbol{B} at tm(𝑿i;𝜷^)ν(𝑿i)\frac{t-m(\boldsymbol{X}_{i};\boldsymbol{\widehat{\beta}})}{\nu(\boldsymbol{X}_{i})}.

3 Proposed Estimators and Asymptotic Results

3.1 Residual-based CDF Estimator

Note that if model (2.1) holds, FN(t)F_{\text{N}}(t) may be restated as

FN(t)=1Nu𝒰𝟙(ϵutm(𝑿u;𝜷)ν(𝑿u)).F_{\text{N}}(t)=\frac{1}{N}\sum_{u\in\mathcal{U}}{\mathbbm{1}\left(\epsilon_{u}\leq\frac{t-m(\boldsymbol{X}_{u};\boldsymbol{\beta})}{\nu(\boldsymbol{X}_{u})}\right)}.

Now, letting

GN(𝑿u)1Nv𝒰𝟙(ϵvtm(𝑿u;𝜷)ν(𝑿u))G_{\text{N}}(\boldsymbol{X}_{u})\coloneqq\frac{1}{N}\sum_{v\in\mathcal{U}}{\mathbbm{1}\left(\epsilon_{v}\leq\frac{t-m(\boldsymbol{X}_{u};\boldsymbol{\beta})}{\nu(\boldsymbol{X}_{u})}\right)}

denote the finite population distribution function of ϵ\epsilon at some fixed point tm(𝑿u;𝜷)ν(𝑿u)\frac{t-m(\boldsymbol{X}_{u};\boldsymbol{\beta})}{\nu(\boldsymbol{X}_{u})}, we note

FN(t)1Nu𝒰GN(𝑿u)\displaystyle F_{\text{N}}(t)-\frac{1}{N}\sum_{u\in\mathcal{U}}{G_{\text{N}}(\boldsymbol{X}_{u})} =1Nu𝒰[𝟙(ϵutm(𝑿u;𝜷)ν(𝑿u))GN(𝑿u)]\displaystyle=\frac{1}{N}\sum_{u\in\mathcal{U}}{\left[\mathbbm{1}\left(\epsilon_{u}\leq\frac{t-m(\boldsymbol{X}_{u};\boldsymbol{\beta})}{\nu(\boldsymbol{X}_{u})}\right)-G_{\text{N}}\left(\boldsymbol{X}_{u}\right)\right]}
=0,\displaystyle=0,

which further implies that

𝔼ξ[1Nu𝒰𝟙(Yut)]=𝔼ξ[1Nu𝒰GN(𝑿u)]=Pr(Yt).\mathbb{E}_{\xi}\left[\frac{1}{N}\sum_{u\in\mathcal{U}}{\mathbbm{1}\left(Y_{u}\leq t\right)}\right]=\mathbb{E}_{\xi}\left[\frac{1}{N}\sum_{u\in\mathcal{U}}{G_{\text{N}}\left(\boldsymbol{X}_{u}\right)}\right]=\Pr\left(Y\leq t\right).

Thus, it is not egragious to replace 𝟙(Yut)\mathbbm{1}\left(Y_{u}\leq t\right) in FN(t)F_{\text{N}}(t) with GN(𝑿u)G_{\text{N}}\left(\boldsymbol{X}_{u}\right), assuming that the working model (2.1) holds.

Given that 𝑨\boldsymbol{A} is subject to missingness on YY only, replacing 𝟙(Yit)\mathbbm{1}\left(Y_{i}\leq t\right) in F^π(t)\widehat{F}_{\pi}(t) with an empirical estimate of GN(𝑿i)G_{\text{N}}(\boldsymbol{X}_{i}) from 𝑩\boldsymbol{B} remains plausible if 𝜷^𝑃𝜷\boldsymbol{\widehat{\beta}}\xrightarrow{P}\boldsymbol{\beta} (i.e., ignorability holds). To describe the procedure, let G^(𝑿k)\widehat{G}(\boldsymbol{X}_{k}) denote the eCDF of the estimated residuals from sample 𝑩\boldsymbol{B}, ϵ^j=Yjm(𝑿j;𝜷^)ν(𝑿j)\hat{\epsilon}_{j}=\frac{Y_{j}-m(\boldsymbol{X}_{j};\boldsymbol{\widehat{\beta}})}{\nu(\boldsymbol{X}_{j})}, such that

G^(𝑿k)=1nBj𝑩𝟙(ϵ^jtm(𝑿k;𝜷^)ν(𝑿k)).\widehat{G}(\boldsymbol{X}_{k})=\frac{1}{n_{\text{B}}}\sum_{j\in\boldsymbol{B}}{\mathbbm{1}\left(\hat{\epsilon}_{j}\leq\frac{t-m(\boldsymbol{X}_{k};\boldsymbol{\widehat{\beta}})}{\nu(\boldsymbol{X}_{k})}\right)}.

Then, the residual-based estimator of FN(t)F_{\text{N}}(t) is defined as

F^R(t)\displaystyle\widehat{F}_{\text{R}}(t) =1Ni𝑨πi1G^(𝑿i)\displaystyle=\frac{1}{N}\sum_{i\in\boldsymbol{A}}{\pi^{-1}_{i}\widehat{G}(\boldsymbol{X}_{i})}
=1NnBi𝑨j𝑩πi1𝟙(ϵ^jtm(𝑿i;𝜷^)ν(𝑿i)).\displaystyle=\frac{1}{Nn_{\text{B}}}\sum_{i\in\boldsymbol{A}}{\sum_{j\in\boldsymbol{B}}{\pi^{-1}_{i}\mathbbm{1}\left(\hat{\epsilon}_{j}\leq\frac{t-m(\boldsymbol{X}_{i};\boldsymbol{\widehat{\beta}})}{\nu(\boldsymbol{X}_{i})}\right)}}. (3.1)

3.2 Asymptotic Theory

Before we discuss the asymptotic properties of F^R(t)\widehat{F}_{\text{R}}(t), we must establish a framework that allows nAn_{\text{A}}, nBn_{\text{B}}, and NN to be infinite. Similar to [28, 43], let 𝒰{𝒰N}\mathscr{U}\coloneqq\set{\mathcal{U}_{\text{N}}} denote an increasing sequence of finite populations of size NN\to\infty from model (2.1); furthermore, let 𝑨\boldsymbol{A} and 𝑩\boldsymbol{B} denote samples from 𝒰N\mathcal{U}_{\text{N}} (NN henceforth omitted for brevity) according to some sampling design, such that the sampling fraction nsN=nA+nBN\frac{n_{\text{s}}}{N}=\frac{n_{\text{A}}+n_{\text{B}}}{N} converges to a limit in (0,1) as both nsn_{\text{s}} and NN tend to infinity [24]. Given this asymptotic framework, let 𝜷=plimnB𝜷^,\boldsymbol{\beta}^{*}=\text{plim}_{n_{\text{B}}\to\infty}\boldsymbol{\widehat{\beta}}, such that, for any ε>0\varepsilon>0,

Pr(|𝜷^𝜷|>ε)0asnB.\displaystyle\Pr\left(\big{|}\boldsymbol{\widehat{\beta}}-\boldsymbol{\beta}^{*}\big{|}>\varepsilon\right)\to 0\ \text{as}\ n_{\text{B}}\to\infty.

Letting 𝑃\xrightarrow{P} again denote convergence in probability, we have previously mentioned that 𝜷^𝑃𝜷\boldsymbol{\widehat{\beta}}\xrightarrow{P}\boldsymbol{\beta} (i.e., 𝜷=𝜷\boldsymbol{\beta}^{*}=\boldsymbol{\beta}) under the ignorability condition; and since 𝜷N𝑃𝜷\boldsymbol{\beta}_{\text{N}}\xrightarrow{P}\boldsymbol{\beta} as well, we may use Theorem 2.1.3 of [32] to show 𝜷^𝑃𝜷N\boldsymbol{\widehat{\beta}}\xrightarrow{P}\boldsymbol{\beta}_{\text{N}}.

In the following, we formally define a set of conditions that govern the validity of our asymptotic results.

Assumption 1.

The sampling design of 𝐁\boldsymbol{B} is ignorable; that is,

Pr(j𝑩|𝑿,Y)=Pr(j𝑩|𝑿).\Pr\left(j\in\boldsymbol{B}|\boldsymbol{X},Y\right)=\Pr\left(j\in\boldsymbol{B}|\boldsymbol{X}\right).
Assumption 2.

The sampling fraction nA+nBN\frac{n_{\text{A}}+n_{\text{B}}}{N} converges to a limit in (0,1)(0,1) as nA+nBn_{\text{A}}+n_{\text{B}} and NN tend to infinity. Furthermore, 𝔼𝒟(nA)=O(Nγ),\mathbb{E}_{\mathcal{D}}\left(n_{\text{A}}\right)=O(N^{\gamma}), where 𝔼𝒟()\mathbb{E}_{\mathcal{D}}(\cdot) denotes the design-based expectation and 23<γ1\frac{2}{3}<\gamma\leq 1.

Assumption 3.

There exist some positive real constants c1,c2c_{1},c_{2}, and c3c_{3} such that c1Nπi𝔼𝒟(nA)c2c_{1}\leq\frac{N\pi_{i}}{\mathbb{E}_{\mathcal{D}}\left(n_{\text{A}}\right)}\leq c_{2} and |1πhπiπhi|c3|1-\frac{\pi_{h}\pi_{i}}{\pi_{hi}}|\leq c_{3} for all h,i𝐀.h,i\in\boldsymbol{A}. Furthermore, nB1/2=o([𝔼𝒟(nA)]12).n^{-1/2}_{\text{B}}=o\left([\mathbb{E}_{\mathcal{D}}\left(n_{\text{A}}\right)]^{-\frac{1}{2}}\right).

Assumption 4.

For any random variable ZZ with finite 2+υ2+\upsilon population moments for an arbitrarily small υ>0\upsilon>0,

Var𝒟(1Ni𝑨πi1Zi)c3𝔼𝒟(nA)(N1)u𝒰N(ZuZ¯N)2,\mathrm{Var}_{\mathcal{D}}\left(\frac{1}{N}\sum_{i\in\boldsymbol{A}}{\pi^{-1}_{i}Z_{i}}\right)\leq\frac{c_{3}}{\mathbb{E}_{\mathcal{D}}\left(n_{\text{A}}\right)(N-1)}\sum_{u\in\mathcal{U}_{\text{N}}}{\left(Z_{u}-\bar{Z}_{\text{N}}\right)^{2}},

where Z¯N=1Nu𝒰NZu\bar{Z}_{\text{N}}=\frac{1}{N}\sum_{u\in\mathcal{U}_{\text{N}}}{Z_{u}} is the finite population mean of ZZ and c3c_{3} some positive constant.

Assumption 5.

For any random variable ZZ with a finite fourth population moment,

[Var𝒟(Z¯π)]1/2(Z¯πZ¯N)\displaystyle\left[\mathrm{Var}_{\mathcal{D}}\left(\bar{Z}_{\pi}\right)\right]^{-1/2}\left(\bar{Z}_{\pi}-\bar{Z}_{\text{N}}\right) N(0,1)\displaystyle\xrightarrow{\mathcal{L}}\mathrm{N}\left(0,1\right)
[Var𝒟(Z¯π)]1Var^π(Z¯π)1\displaystyle\left[\mathrm{Var}_{\mathcal{D}}\left(\bar{Z}_{\pi}\right)\right]^{-1}\widehat{\mathrm{Var}}_{\pi}\left(\bar{Z}_{\pi}\right)-1 =OP([𝔼𝒟(nA)]1/2),\displaystyle=O_{\text{P}}\left([\mathbb{E}_{\mathcal{D}}\left(n_{\text{A}}\right)]^{-1/2}\right),

where Z¯π=1Ni𝐀πi1Zi\bar{Z}_{\pi}=\frac{1}{N}\sum_{i\in\boldsymbol{A}}{\pi^{-1}_{i}Z_{i}} denotes the HT mean estimator of Z¯N\bar{Z}_{\text{N}} and Var^π(Z¯π)\widehat{\mathrm{Var}}_{\pi}\left(\bar{Z}_{\pi}\right) the HT estimator of Var𝒟(Z¯π)\mathrm{Var}_{\mathcal{D}}\left(\bar{Z}_{\pi}\right).

Assumption 6.

The coefficient vector 𝛃^\boldsymbol{\widehat{\beta}} satisfies

𝜷^=𝜷N+oP(nB1/2).\boldsymbol{\widehat{\beta}}=\boldsymbol{\beta}_{\text{N}}+o_{P}\left(n^{-1/2}_{\text{B}}\right).
Assumption 7.

FN(t)F_{\text{N}}(t) converges to a smooth function F(t)F(t) as NN goes to infinity; that is,

limNFN(t)=F(t),\lim_{N\to\infty}{F_{\text{N}}(t)}=F(t),

where the limiting function F(t)F(t) is continuous with finite first and second derivatives.

Assumption 1 allows 𝜷^𝑃𝜷N\boldsymbol{\widehat{\beta}}\xrightarrow{P}\boldsymbol{\beta}_{\text{N}}; Assumption 2 establishes a framework to allow nA,nBn_{\text{A}},n_{\text{B}}\to\infty as NN\to\infty, and limits the growth rate of both sample sizes; Assumptions 3 and 4 ensure design-based consistency under p(𝑨)p(\boldsymbol{A}), the sampling design of 𝑨\boldsymbol{A}, and allows nBn_{\text{B}} to grow at a faster rate than 𝔼𝒟(nA)\mathbb{E}_{\mathcal{D}}\left(n_{\text{A}}\right); Assumption 5 ensures asymptotic normality under a general design; Assumption 6 establishes regularity conditions for 𝜷^\boldsymbol{\widehat{\beta}}; and Assumption 7 specifies a limiting smooth function for FN(t)F_{\text{N}}(t) [43, 24, 36]. Note that, if Assumptions 3 and 6 hold, then 𝜷^=𝜷N+oP([𝔼𝒟(nA)]12)\boldsymbol{\widehat{\beta}}=\boldsymbol{\beta}_{\text{N}}+o_{P}\left([\mathbb{E}_{\mathcal{D}}\left(n_{\text{A}}\right)]^{-\frac{1}{2}}\right).

The following theorem summarizes the asymptotic properties of the proposed residual-based CDF estimator, F^R(t)\widehat{F}_{\text{R}}(t).

Theorem 1.

Under Assumptions 1-7, F^R(t)\widehat{F}_{\text{R}}(t) is design-consistent for FN(t)F_{N}(t) and is asymptotically normally distributed; that is,

F^R(t)FN(t)AV{F^R(t)}N(0,1),\frac{\widehat{F}_{R}(t)-F_{N}(t)}{AV\{\widehat{F}_{R}(t)\}}\overset{\mathcal{L}}{\longrightarrow}N(0,1),

where

AV{F^R(t)}=1N2u𝒰v𝒰(πuvπuπv1)GN(𝑿u)GN(𝑿v).AV\{\widehat{F}_{R}(t)\}=\frac{1}{N^{2}}\sum_{u\in\mathcal{U}}{\sum_{v\in\mathcal{U}}{\left(\frac{\pi_{uv}}{\pi_{u}\pi_{v}}-1\right)G_{\text{N}}(\boldsymbol{X}_{u})G_{\text{N}}(\boldsymbol{X}_{v})}}.
Proof.

First, define F~R(t)\tilde{F}_{\text{R}}(t) as

F~R(t)\displaystyle\tilde{F}_{\text{R}}(t) =1Ni𝑨πi11nBj𝑩𝟙(ϵjtm(𝑿i;𝜷)ν(𝑿i))\displaystyle=\frac{1}{N}\sum_{i\in\boldsymbol{A}}{\pi^{-1}_{i}\frac{1}{n_{\text{B}}}\sum_{j\in\boldsymbol{B}}{\mathbbm{1}\left(\epsilon_{j}\leq\frac{t-m(\boldsymbol{X}_{i};\boldsymbol{\beta})}{\nu\left(\boldsymbol{X}_{i}\right)}\right)}}
=1Ni𝑨πi1G~(𝑿i)\displaystyle=\frac{1}{N}\sum_{i\in\boldsymbol{A}}{\pi^{-1}_{i}\tilde{G}(\boldsymbol{X}_{i})}
=1Nu𝒰πu1uG~(𝑿u),\displaystyle=\frac{1}{N}\sum_{u\in\mathcal{U}}{\pi^{-1}_{u}\ell_{u}\tilde{G}(\boldsymbol{X}_{u})},

where u=1ifu𝑨\ell_{u}=1\ \text{if}\ u\in\boldsymbol{A} and zero otherwise. Now, under Assumption 1, G~(𝑿)𝑃GN(𝑿)\tilde{G}(\boldsymbol{X})\xrightarrow{P}G_{\text{N}}(\boldsymbol{X}), and thus

𝔼𝒟(F~R(t))\displaystyle\mathbb{E}_{\mathcal{D}}\left(\tilde{F}_{\text{R}}(t)\right) =𝔼𝒟(1Nu𝒰uπu1GN(𝑿u)+oP(1))\displaystyle=\mathbb{E}_{\mathcal{D}}\left(\frac{1}{N}\sum_{u\in\mathcal{U}}{\ell_{u}\pi^{-1}_{u}G_{\text{N}}(\boldsymbol{X}_{u})}+o_{\text{P}}(1)\right)
=1Nu𝒰GN(𝑿u)+o(1)\displaystyle=\frac{1}{N}\sum_{u\in\mathcal{U}}{G_{\text{N}}(\boldsymbol{X}_{u})}+o(1)
𝔼𝒟(F~R2(t))\displaystyle\mathbb{E}_{\mathcal{D}}\left(\tilde{F}^{2}_{\text{R}}(t)\right) =1N2u𝒰v𝒰πuvπuπvGN(𝑿u)GN(𝑿v)+o(1);\displaystyle=\frac{1}{N^{2}}\sum_{u\in\mathcal{U}}{\sum_{v\in\mathcal{U}}{\frac{\pi_{uv}}{\pi_{u}\pi_{v}}}G_{\text{N}}(\boldsymbol{X}_{u})G_{\text{N}}(\boldsymbol{X}_{v})}+o(1);

ergo,

Var𝒟(F~R(t))\displaystyle\mathrm{Var}_{\mathcal{D}}\left(\tilde{F}_{\text{R}}(t)\right) =1N2u𝒰v𝒰(πuvπuπv1)GN(𝑿u)GN(𝑿v)+o(1).\displaystyle=\frac{1}{N^{2}}\sum_{u\in\mathcal{U}}{\sum_{v\in\mathcal{U}}{\left(\frac{\pi_{uv}}{\pi_{u}\pi_{v}}-1\right)G_{\text{N}}(\boldsymbol{X}_{u})G_{\text{N}}(\boldsymbol{X}_{v})}}+o(1).

From here, we may use results from [43], [36], and [37] to show that F^R(t)\widehat{F}_{\text{R}}(t) and F~R(t)\tilde{F}_{\text{R}}(t) have the same limiting normal distribution, and are both design-consistent for FN(t)F_{\text{N}}(t); that is,

F^R(t)FN(t)AV{F^R(t)}N(0,1),\frac{\widehat{F}_{R}(t)-F_{N}(t)}{AV\{\widehat{F}_{R}(t)\}}\overset{\mathcal{L}}{\longrightarrow}N(0,1),

where AV{F^R(t)}AV\{\widehat{F}_{R}(t)\} is as given in the theorem.

Theorem 2.

Let Suppose Assumptions 1-6 hold. An asymptotically unbiased estimator of AV(F^R(t))\mathrm{AV}\left(\widehat{F}_{\text{R}}(t)\right) is given by

Var^(F^R(t))=1N2h𝑨i𝑨(πhiπhπi1)1πhiG^(𝑿h)G^(𝑿i).\widehat{\mathrm{Var}}\left(\widehat{F}_{\text{R}}(t)\right)=\frac{1}{N^{2}}\sum_{h\in\boldsymbol{A}}{\sum_{i\in\boldsymbol{A}}{\left(\frac{\pi_{hi}}{\pi_{h}\pi_{i}}-1\right)\frac{1}{\pi_{hi}}\widehat{G}(\boldsymbol{X}_{h})\widehat{G}(\boldsymbol{X}_{i})}}.
Proof.

Given that F^R(t)\widehat{F}_{\text{R}}(t) and F~R(t)\tilde{F}_{\text{R}}(t) are both design-consistent for FN(t)F_{\text{N}}(t), we may again use Theorem 2.1.3 of [32] to show

F^R2(t)F~R2(t)\displaystyle\widehat{F}^{2}_{\text{R}}(t)-\tilde{F}^{2}_{\text{R}}(t) 1N2h𝑨i𝑨πh1πi1(G^(𝑿h)G^(𝑿i)G~(𝑿h)G~(𝑿i))𝑃0.\displaystyle\coloneqq\frac{1}{N^{2}}\sum_{h\in\boldsymbol{A}}{\sum_{i\in\boldsymbol{A}}{\pi^{-1}_{h}\pi^{-1}_{i}\left(\widehat{G}(\boldsymbol{X}_{h})\widehat{G}(\boldsymbol{X}_{i})-\tilde{G}(\boldsymbol{X}_{h})\tilde{G}(\boldsymbol{X}_{i})\right)}}\xrightarrow{P}0. (3.2)

Note that

Var^(F^R(t))Var^(F~R(t))\displaystyle\widehat{\mathrm{Var}}\left(\widehat{F}_{\text{R}}(t)\right)-\widehat{\mathrm{Var}}\left(\tilde{F}_{\text{R}}(t)\right) =1N2h𝑨i𝑨πh1πi1(G^(𝑿h)G^(𝑿i)G~(𝑿h)G~(𝑿i))αhi,\displaystyle=\frac{1}{N^{2}}\sum_{h\in\boldsymbol{A}}{\sum_{i\in\boldsymbol{A}}{\pi^{-1}_{h}\pi^{-1}_{i}\left(\widehat{G}(\boldsymbol{X}_{h})\widehat{G}(\boldsymbol{X}_{i})-\tilde{G}(\boldsymbol{X}_{h})\tilde{G}(\boldsymbol{X}_{i})\right)\alpha_{hi}}},

where αhi1πhπiπhi\alpha_{hi}\coloneqq 1-\frac{\pi_{h}\pi_{i}}{\pi_{hi}}. Given that Var^(F~R(t))\widehat{\mathrm{Var}}\left(\tilde{F}_{\text{R}}(t)\right) is asymptotically unbiased for AV{F^R(t)}AV\{\widehat{F}_{\text{R}}(t)\}, we would like to show that Var^(F^R(t))Var^(F~R(t))\widehat{\mathrm{Var}}\left(\widehat{F}_{\text{R}}(t)\right)-\widehat{\mathrm{Var}}\left(\tilde{F}_{\text{R}}(t)\right) converges in probability to zero.

Recall, from Assumption 3, that there exists some constant c3+c_{3}\in\mathbb{R}^{+} such that |αhi|c3|\alpha_{hi}|\leq c_{3} for all h,i𝑨h,i\in\boldsymbol{A}. This implies

c3×(F^R2(t)F~R2(t))Var^(F^R(t))Var^(F~R(t))c3×(F^R2(t)F~R2(t));-c_{3}\times\left(\widehat{F}^{2}_{\text{R}}(t)-\tilde{F}^{2}_{\text{R}}(t)\right)\leq\widehat{\mathrm{Var}}\left(\widehat{F}_{\text{R}}(t)\right)-\widehat{\mathrm{Var}}\left(\tilde{F}_{\text{R}}(t)\right)\leq c_{3}\times\left(\widehat{F}^{2}_{\text{R}}(t)-\tilde{F}^{2}_{\text{R}}(t)\right);

and since

plimnB{c3×(F^R2(t)F~R2(t))}=plimnB{c3×(F^R2(t)F~R2(t))}=0,\text{plim}_{n_{\text{B}}\to\infty}\left\{c_{3}\times\left(\widehat{F}^{2}_{\text{R}}(t)-\tilde{F}^{2}_{\text{R}}(t)\right)\right\}=\text{plim}_{n_{\text{B}}\to\infty}\left\{-c_{3}\times\left(\widehat{F}^{2}_{\text{R}}(t)-\tilde{F}^{2}_{\text{R}}(t)\right)\right\}=0,

we see, by the squeeze theorem and (3.2), that Var^(F^R(t))Var^(F~R(t))𝑃0\widehat{\mathrm{Var}}\left(\widehat{F}_{\text{R}}(t)\right)-\widehat{\mathrm{Var}}\left(\tilde{F}_{\text{R}}(t)\right)\xrightarrow{P}0. Thus, Var^(F^R(t))\widehat{\mathrm{Var}}\left(\widehat{F}_{\text{R}}(t)\right) is asymptotically unbiased for AV{F^R(t)}AV\{\hat{F}_{\text{R}}(t)\}. ∎

In Section 4.4, we define a replication-type variance estimator based on the bootstrap, and compare its performance with that of the variance estimator in Theorem 2.

3.3 Plug-In CDF Estimator

One may consider a more direct ‘mass imputation’ approach to CDF estimation by directly substituting YiY_{i} for all i𝑨i\in\boldsymbol{A} with predicted values, Y^im(𝑿i;𝜷^)\hat{Y}_{i}\coloneqq m(\boldsymbol{X}_{i};\boldsymbol{\widehat{\beta}}), but this approach will almost always result in higher estimation bias. Let F^P(t)\widehat{F}_{\text{P}}(t) denote this direct plug-in CDF estimator such that

F^P(t)\displaystyle\widehat{F}_{\text{P}}(t) =1Ni𝑨πi1𝟙(Y^it)\displaystyle=\frac{1}{N}\sum_{i\in\boldsymbol{A}}{\pi^{-1}_{i}\mathbbm{1}\left(\hat{Y}_{i}\leq t\right)}
=1Ni𝑨πi1𝟙(m(𝑿i;𝜷^)t).\displaystyle=\frac{1}{N}\sum_{i\in\boldsymbol{A}}{\pi^{-1}_{i}\mathbbm{1}\left(m(\boldsymbol{X}_{i};\boldsymbol{\widehat{\beta}})\leq t\right)}.

In the best-case scenario of 𝜷^=𝜷\boldsymbol{\widehat{\beta}}=\boldsymbol{\beta}, we observe

F^P(t)|𝜷^=𝜷=1Ni𝑨πi1𝟙(m(𝑿i;𝜷)t),\widehat{F}_{\text{P}}(t)\Big{|}_{\boldsymbol{\widehat{\beta}}=\boldsymbol{\beta}}=\frac{1}{N}\sum_{i\in\boldsymbol{A}}{\pi^{-1}_{i}\mathbbm{1}\Big{(}m(\boldsymbol{X}_{i};\boldsymbol{\beta})\leq t\Big{)}},

and thus

𝔼𝒟(F^P(t)|𝜷^=𝜷)\displaystyle\mathbb{E}_{\mathcal{D}}\Big{(}\widehat{F}_{\text{P}}(t)\Big{|}_{\boldsymbol{\widehat{\beta}}=\boldsymbol{\beta}}\Big{)} =1Nu𝒰𝟙(m(𝑿u;𝜷)t)\displaystyle=\frac{1}{N}\sum_{u\in\mathcal{U}}{\mathbbm{1}\Big{(}m(\boldsymbol{X}_{u};\boldsymbol{\beta})\leq t\Big{)}}
=1Nu𝒰𝟙(Yuν(𝑿u)ϵut),\displaystyle=\frac{1}{N}\sum_{u\in\mathcal{U}}{\mathbbm{1}\Big{(}Y_{u}-\nu(\boldsymbol{X}_{u})\epsilon_{u}\leq t\Big{)}},

which is not unbiased for FN(t)F_{\text{N}}(t) under model 2.1 unless ν(𝑿)ϵu\nu(\boldsymbol{X})\epsilon_{u} is effectively zero for all u𝒰u\in\mathcal{U}. From the perspective of sample 𝑨\boldsymbol{A}, we note Y^i=Yiςi\hat{Y}_{i}=Y_{i}-\varsigma_{i}, where the unknown residual ςi\varsigma_{i} could trigger an erroneous 𝟙()\mathbbm{1}(\cdot) output even if it is small for all i𝑨i\in\boldsymbol{A}—that is to say, the plug-in approach is ‘all-or-nothing’, in the sense that 𝟙(Y^it)\mathbbm{1}(\hat{Y}_{i}\leq t) is either right or wrong with very little room for error. This is not the case for F^R(t)\widehat{F}_{\text{R}}(t), though, since small ςi\varsigma_{i} implies a well-specified (and transportable) 𝜷^\boldsymbol{\widehat{\beta}}, making G^(𝑿i)\widehat{G}(\boldsymbol{X}_{i}) a reasonable substitute for 𝟙(Yit)\mathbbm{1}(Y_{i}\leq t) within F^π(t)\widehat{F}_{\pi}(t).

3.4 Quantile Estimation

We conclude this section with a discussion of the quantile estimation problem. From Eq. 1.2, an estimator of tN(α)t_{\text{N}}(\alpha) using F^R(t)\widehat{F}_{\text{R}}(t) could be defined as

t^R(α)=inft{t:F^R(t)α},\displaystyle\hat{t}_{\text{R}}(\alpha)=\inf_{t}\left\{t:\widehat{F}_{\text{R}}(t)\geq\alpha\right\}, (3.3)

or, for F^P(t)\widehat{F}_{\text{P}}(t), as

t^P(α)=inft{t:F^P(t)α},\displaystyle\hat{t}_{\text{P}}(\alpha)=\inf_{t}\left\{t:\widehat{F}_{\text{P}}(t)\geq\alpha\right\}, (3.4)

where α\alpha denotes the percentile of interest. In general, if some F^(t)\widehat{F}(t) is asymptotically unbiased, then its corresponding quantile, t^(α)\hat{t}(\alpha), will also be asymptotically unbiased [24]. And, assuming F^(t)\widehat{F}(t) is asymptotically normally distributed, γ%\gamma\% confidence intervals can be constructed using [44]’s formula as follows:

t^LL\displaystyle\hat{t}_{\text{LL}} =mint(F^(t)αzγ/2SE^[F^(t^(α))])\displaystyle=\min_{t}\left(\widehat{F}(t)\geq\alpha-z_{\gamma/2}\widehat{\text{SE}}\left[\widehat{F}\left(\hat{t}(\alpha)\right)\right]\right) (3.5)
t^UL\displaystyle\hat{t}_{\text{UL}} =mint(F^(t)α+zγ/2SE^[F^(t^(α))]),\displaystyle=\min_{t}\left(\widehat{F}(t)\geq\alpha+z_{\gamma/2}\widehat{\text{SE}}\left[\widehat{F}\left(\hat{t}(\alpha)\right)\right]\right), (3.6)

where zγ/2z_{\gamma/2} is the (1γ/2)th(1-\gamma/2)^{th} percentile under the standard normal distribution and SE^\widehat{\text{SE}} is the estimated standard error of F^(t)\widehat{F}(t). Then, using [31, 26]’s linear approximations, we may estimate the variance of t^(α)\hat{t}(\alpha) using

Var^[t^(α)]\displaystyle\widehat{\mathrm{Var}}\left[\hat{t}(\alpha)\right] =(t^ULt^LL2zγ/2)2.\displaystyle=\left(\frac{\hat{t}_{\text{UL}}-\hat{t}_{\text{LL}}}{2z_{\gamma/2}}\right)^{2}. (3.7)

An alternative variance estimator based on the bootstrap will be defined and empirically evaluated in Section 4.4.

One should note that quantile estimation is extremely sensitive to small sample sizes, generally, and small inclusion probabilities, specifically, as both tend to exclude many (if not most) of the quantiles in 𝒰\mathcal{U}. In the next sections, we will evaluate the consequences of small nBn_{\text{B}} and nAn_{\text{A}} on the performance of both t^P(α)\hat{t}_{\text{P}}(\alpha) and t^R(α)\hat{t}_{\text{R}}(\alpha) using simulated data.

4 Simulation Study

We conducted a two-phase Monte Carlo simulation study to contrast the finite-sample performance of the proposed CDF and quantile estimators (F^R(t)\widehat{F}_{\text{R}}(t) and t^R(α)\hat{t}_{\text{R}}(\alpha)) to their plug-in and ‘naïve’, 𝑩\boldsymbol{B}-only equivalents (F^P(t)\widehat{F}_{\text{P}}(t) and t^P(α)\hat{t}_{\text{P}}(\alpha); F^B(t)\widehat{F}_{\text{B}}(t) and t^B(α)\hat{t}_{\text{B}}(\alpha)). We measure the relative performance of the estimators using the relative root mean squared error (RRMSE), defined generically for some θ^\hat{\theta} as

RRMSE(θ^)=MSE(θ^)MSE(θ^A),\displaystyle\mathrm{RRMSE}\left(\hat{\theta}\right)=\sqrt{\frac{\mathrm{MSE}\left(\hat{\theta}\right)}{\mathrm{MSE}\left(\hat{\theta}_{\text{A}}\right)}}, (4.1)

where θ^A\hat{\theta}_{\text{A}} denotes a ‘gold-standard’ estimator of θ\theta from 𝑨\boldsymbol{A}. Here, θ^A{F^π(t),t^π(α)}\hat{\theta}_{\text{A}}\coloneqq\{\hat{F}_{\pi}(t),\hat{t}_{\pi}(\alpha)\}, which denote the HT estimators of FN(t)F_{\text{N}}(t) and tN(α)t_{\text{N}}(\alpha), respectively.

4.1 Simulation Settings

To thoroughly assess the performance of our proposed estimators, we considered the following four superpopulation models, {ξi}i=14\{\xi_{i}\}^{4}_{i=1}, which account for varying levels of complexity between YY and 𝑿\boldsymbol{X}.

  • Model ξ1\xi_{1} [25]:

    Y=.3+2X1+2X2+ϵ,Y=.3+2X_{1}+2X_{2}+\epsilon,

    where X1,X2N(μ=2,σ=1)X_{1},X_{2}\sim\mathrm{N}(\mu=2,\sigma=1) and ϵN(μ=0,σ=1)\epsilon\sim\mathrm{N}(\mu=0,\sigma=1).

  • Model ξ2\xi_{2} [25]:

    Y=.3+.5X12+.5X22+ϵ,Y=.3+.5X^{2}_{1}+.5X^{2}_{2}+\epsilon,

    where X1,X2N(μ=2,σ=1)X_{1},X_{2}\sim\mathrm{N}(\mu=2,\sigma=1) and ϵN(μ=0,σ=1)\epsilon\sim\mathrm{N}(\mu=0,\sigma=1).

  • Model ξ3\xi_{3} [41, 35]:

    Y=sin((X1))+X22+X3eX42+ϵ,Y=-\sin{(X_{1})}+X^{2}_{2}+X_{3}-\mathrm{e}^{-X^{2}_{4}}+\epsilon,

    where X1,,X4Uniform(min=1,max=1)X_{1},\cdots,X_{4}\sim\mathrm{Uniform}\left(\min=-1,\max=1\right) and ϵN(μ=0,σ=.5)\epsilon\sim\mathrm{N}\left(\mu=0,\sigma=\sqrt{.5}\right).

  • Model ξ4\xi_{4} [38, 35]:

    Y\displaystyle Y =X1+.707X22+2𝟙(X3>0)+.873ln(|X1|)|X3|+.894X2X4\displaystyle=X_{1}+.707X^{2}_{2}+2\mathbbm{1}\left(X_{3}>0\right)+.873\ln\left(|X_{1}|\right)|X_{3}|+.894X_{2}X_{4}
    +2𝟙(X5>0)+.464eX6+ϵ,\displaystyle+2\mathbbm{1}\left(X_{5}>0\right)+.464\mathrm{e}^{X_{6}}+\epsilon,

    where X1,,X6N(μ=0,σ=1)X_{1},\cdots,X_{6}\sim\mathrm{N}\left(\mu=0,\sigma=1\right) and ϵN(μ=0,σ=1)\epsilon\sim\mathrm{N}\left(\mu=0,\sigma=1\right).

Under each of these models, we generated a finite population 𝒰\mathcal{U} of size N=100,000N=100,000. After 𝒰\mathcal{U} was obtained, a simple random sample without replacement (SRS) of size nA=500n_{\text{A}}=500 was selected to serve as 𝑨\boldsymbol{A}, the probability sample. For sample 𝑩\boldsymbol{B} with 𝒏B=[500 1000 5000 10000]\boldsymbol{n}_{\text{B}}=[500\ \ 1000\ \ 5000\ \ 10000], we used modifications of [25]’s stratified simple random sampling technique to generate samples under both missing at random (MAR) and missing not at random (MNAR) mechanisms. For MAR, we defined XX^{*} to be the covariate with the strongest correlation to YY, and placed all elements u𝒰u\in\mathcal{U} with XuμXX_{u}^{*}\leq\mu_{X^{*}} in stratum I and the rest in stratum II. We then used stratified SRS to obtain nI=0.85𝒏Bn_{\text{I}}=0.85\boldsymbol{n}_{\text{B}} and nII=(10.85)𝒏Bn_{\text{II}}=(1-0.85)\boldsymbol{n}_{\text{B}} elements from each respective strata. This procedure was nearly identical to that used to generate MNAR data, with the exception of using YuμYY_{u}\leq\mu_{Y} as the strata discrimination rule. Lastly, using each of the nsim=1500n_{\text{sim}}=1500 randomly sampled 𝑨\boldsymbol{A}/𝑩\boldsymbol{B} sample pairs, we calculated F^B(t)\widehat{F}_{\text{B}}(t) (shortened as ‘B’), F^P(t)\widehat{F}_{\text{P}}(t) (shortened as ‘P’) and F^R(t)\widehat{F}_{\text{R}}(t) (shortened as ‘R’), as well as their respective quantile estimators, at the 1st, 10th, 25th, 50th, 75th, 90th, and 99th percentiles. The two estimators F^P(t)\widehat{F}_{\text{P}}(t) and F^R(t)\widehat{F}_{\text{R}}(t) used multivariable linear regression, which was fit on sample 𝑩\boldsymbol{B} using R’s lm() function. Results are presented in Tables 4.14.1.

Table 4.1: RRMSE values for F^B(t)\widehat{F}_{\text{B}}(t) (shortened as ‘B’), F^P(t)\widehat{F}_{\text{P}}(t) (shortened as ‘P’), and F^R(t)\widehat{F}_{\text{R}}(t) (shortened as ‘R’), as well as their corresponding quantile estimators, under model ξ1\xi_{1}.
MAR MNAR
F^(t)\widehat{F}(t) t^(α)\hat{t}(\alpha) F^(t)\widehat{F}(t) t^(α)\hat{t}(\alpha)
\cdashline3-14 nBn_{\text{B}} α\alpha B P R B P R B P R B P R
500 1% 2.10 1.14 0.76 1.59 1.09 0.89 2.01 1.19 0.72 1.60 1.19 0.93
10% 4.64 1.38 0.83 3.58 1.32 0.85 5.39 1.22 0.93 3.83 1.06 0.81
25% 6.64 1.25 0.90 5.63 1.19 0.93 9.09 1.12 1.36 6.34 1.02 1.22
50% 7.14 1.10 0.91 6.77 1.01 0.92 15.69 2.09 1.91 9.63 1.87 1.73
75% 6.33 1.22 0.93 7.34 1.21 0.96 9.17 2.91 2.14 13.49 2.79 2.04
90% 4.48 1.42 0.91 6.30 1.46 0.93 5.25 2.92 2.03 11.08 3.11 2.01
99% 1.56 1.12 0.77 3.10 1.42 1.32 1.67 1.59 1.13 3.47 2.41 1.24
1000 1% 1.83 1.13 0.73 1.38 1.09 0.87 1.75 1.17 0.70 1.30 1.13 0.89
10% 4.64 1.41 0.81 3.52 1.34 0.84 5.27 1.19 0.92 3.78 1.06 0.81
25% 6.37 1.19 0.84 5.33 1.12 0.86 9.05 1.09 1.33 6.24 0.99 1.18
50% 7.47 1.04 0.87 6.95 0.94 0.86 15.67 2.03 1.86 9.71 1.84 1.70
75% 6.31 1.15 0.84 7.16 1.13 0.86 9.33 2.95 2.15 13.77 2.84 2.05
90% 4.31 1.31 0.81 6.11 1.39 0.84 5.28 2.91 2.02 11.24 3.17 2.02
99% 1.62 1.14 0.74 2.94 1.41 1.30 1.64 1.62 1.15 3.32 2.48 1.22
5000 1% 1.62 1.11 0.73 1.13 1.09 0.87 1.62 1.20 0.71 1.11 1.11 0.89
10% 4.64 1.44 0.80 3.47 1.35 0.82 5.07 1.18 0.88 3.60 1.07 0.78
25% 6.39 1.22 0.81 5.49 1.15 0.86 8.78 1.06 1.30 6.07 0.98 1.16
50% 7.30 1.01 0.83 6.87 0.94 0.84 15.32 1.95 1.79 9.38 1.79 1.63
75% 6.40 1.16 0.84 7.43 1.17 0.86 8.98 2.75 1.99 13.53 2.70 1.94
90% 4.64 1.43 0.83 6.20 1.41 0.82 5.25 2.87 1.96 11.14 3.08 1.93
99% 1.53 1.11 0.70 2.77 1.44 1.29 1.65 1.65 1.16 3.13 2.43 1.21
10000 1% 1.60 1.18 0.71 1.11 1.11 0.87 1.61 1.22 0.71 1.11 1.10 0.88
10% 4.64 1.44 0.79 3.58 1.39 0.84 5.21 1.21 0.88 3.70 1.08 0.78
25% 6.45 1.22 0.82 5.41 1.15 0.85 9.16 1.08 1.32 6.32 0.98 1.18
50% 7.43 0.99 0.82 7.05 0.93 0.84 15.45 1.99 1.81 9.69 1.85 1.68
75% 6.64 1.17 0.85 7.61 1.15 0.88 9.08 2.79 2.03 13.33 2.70 1.94
90% 4.52 1.37 0.81 6.30 1.43 0.84 5.26 2.89 1.98 11.04 3.06 1.93
99% 1.50 1.10 0.70 2.72 1.41 1.26 1.52 1.52 1.06 3.01 2.35 1.16
Table 4.2: RRMSE values for F^B(t)\widehat{F}_{\text{B}}(t) (shortened as ‘B’), F^P(t)\widehat{F}_{\text{P}}(t) (shortened as ‘P’), and F^R(t)\widehat{F}_{\text{R}}(t) (shortened as ‘R’), as well as their corresponding quantile estimators, under model ξ2\xi_{2}.
MAR MNAR
F^(t)\widehat{F}(t) t^(α)\hat{t}(\alpha) F^(t)\widehat{F}(t) t^(α)\hat{t}(\alpha)
\cdashline3-14 nBn_{\text{B}} α\alpha B P R B P R B P R B P R
500500 1% 2.03 2.90 5.59 1.55 3.58 4.66 1.74 2.10 4.54 1.42 2.54 3.56
10% 4.87 1.32 1.64 3.60 1.69 1.96 4.06 1.46 1.54 3.01 1.68 1.66
25% 6.89 2.32 1.04 5.32 2.51 1.31 6.90 2.00 0.86 4.77 1.88 0.92
50% 7.17 1.84 1.42 6.75 1.54 1.43 11.90 1.17 1.00 7.63 0.91 0.85
75% 5.96 2.42 1.27 6.68 1.78 0.99 8.55 4.48 2.96 10.66 3.34 2.35
90% 4.07 4.55 2.92 5.40 3.85 2.36 4.94 5.74 4.31 9.24 5.44 3.92
99% 1.52 2.22 2.01 2.66 4.45 2.57 1.56 2.22 2.11 3.08 5.21 3.28
10001000 1% 1.88 2.90 5.66 1.36 3.59 4.73 1.47 2.08 4.49 1.16 2.64 3.70
10% 4.91 1.24 1.59 3.60 1.58 1.89 4.07 1.42 1.52 3.11 1.70 1.68
25% 6.93 2.32 0.99 5.22 2.47 1.23 6.86 1.92 0.78 4.80 1.87 0.86
50% 7.30 1.80 1.39 6.90 1.53 1.41 11.49 1.09 0.93 7.33 0.84 0.79
75% 6.03 2.31 1.11 6.74 1.73 0.86 8.44 4.42 2.90 10.17 3.18 2.23
90% 4.04 4.50 2.86 5.40 3.81 2.30 4.90 5.75 4.30 9.00 5.30 3.81
99% 1.49 2.24 2.03 2.44 4.39 2.52 1.59 2.33 2.21 2.87 5.26 3.32
50005000 1% 1.60 2.71 5.31 1.16 3.46 4.65 1.20 1.90 4.26 0.91 2.42 3.47
10% 4.75 1.20 1.55 3.46 1.53 1.82 3.97 1.39 1.48 2.93 1.56 1.57
25% 6.83 2.29 0.94 5.20 2.45 1.17 6.72 1.86 0.74 4.70 1.82 0.81
50% 7.18 1.72 1.31 6.65 1.41 1.31 11.77 1.04 0.90 7.50 0.82 0.76
75% 5.82 2.22 1.00 6.39 1.63 0.76 8.52 4.44 2.90 10.36 3.24 2.26
90% 4.00 4.50 2.83 5.22 3.71 2.24 4.89 5.77 4.30 8.92 5.29 3.79
99% 1.42 2.23 2.02 2.21 4.36 2.44 1.51 2.28 2.16 2.72 5.32 3.28
1000010000 1% 1.59 2.73 5.35 1.11 3.50 4.59 1.21 1.93 4.41 0.92 2.60 3.68
10% 4.79 1.22 1.53 3.41 1.52 1.74 3.97 1.34 1.52 2.96 1.55 1.62
25% 6.80 2.32 0.96 5.16 2.48 1.20 6.62 1.82 0.74 4.68 1.79 0.82
50% 7.25 1.81 1.36 6.75 1.48 1.36 11.95 1.05 0.91 7.64 0.82 0.77
75% 5.73 2.15 0.95 6.35 1.59 0.73 8.58 4.45 2.91 10.35 3.24 2.25
90% 3.99 4.47 2.80 5.35 3.79 2.27 4.99 5.89 4.38 8.99 5.30 3.80
99% 1.36 2.15 1.94 2.10 4.15 2.34 1.51 2.27 2.16 2.69 5.28 3.30
Table 4.3: RRMSE values for F^B(t)\widehat{F}_{\text{B}}(t) (shortened as ‘B’), F^P(t)\widehat{F}_{\text{P}}(t) (shortened as ‘P’), and F^R(t)\widehat{F}_{\text{R}}(t) (shortened as ‘R’), as well as their corresponding quantile estimators, under model ξ3\xi_{3}.
MAR MNAR
F^(t)\widehat{F}(t) t^(α)\hat{t}(\alpha) F^(t)\widehat{F}(t) t^(α)\hat{t}(\alpha)
\cdashline3-14 nBn_{\text{B}} α\alpha B P R B P R B P R B P R
500 1% 1.89 2.24 0.55 1.54 5.69 4.89 2.06 2.24 0.64 1.60 5.12 4.48
10% 3.78 6.37 0.76 3.15 5.64 0.85 5.55 5.18 2.94 3.85 3.14 1.94
25% 5.10 4.51 0.84 4.43 3.31 0.81 9.24 2.49 5.84 6.46 1.34 4.36
50% 5.69 1.59 0.95 5.61 1.09 0.98 16.43 12.25 8.28 10.52 7.28 7.22
75% 5.26 4.97 1.10 5.74 3.95 1.12 9.51 12.82 7.82 14.68 12.23 8.59
90% 3.87 6.00 1.08 5.08 5.76 1.16 5.43 7.68 5.57 12.21 13.93 7.87
99% 1.59 2.29 0.76 2.77 6.10 . 1.64 2.22 1.86 3.53 10.36 .
1000 1% 1.68 2.25 0.41 1.31 5.66 4.89 1.83 2.23 0.50 1.36 5.07 4.47
10% 3.66 6.40 0.60 3.01 5.57 0.70 5.45 5.18 2.92 3.82 3.16 1.93
25% 5.00 4.47 0.68 4.41 3.31 0.66 9.39 2.44 5.95 6.35 1.27 4.32
50% 5.59 1.30 0.74 5.39 0.88 0.75 15.77 11.78 7.95 10.05 6.96 6.92
75% 5.00 4.75 0.80 5.54 3.84 0.82 9.03 12.27 7.44 14.04 11.73 8.24
90% 3.89 6.17 0.80 4.98 5.79 0.86 5.19 7.38 5.35 11.46 13.16 7.44
99% 1.47 2.20 0.51 2.54 6.09 . 1.56 2.18 1.81 3.17 10.08 .
5000 1% 1.47 2.26 0.26 1.13 5.71 4.97 1.64 2.24 0.33 1.16 5.04 4.49
10% 3.66 6.54 0.47 2.96 5.60 0.58 5.45 5.28 2.90 3.84 3.25 1.95
25% 4.87 4.34 0.53 4.38 3.32 0.53 9.78 2.42 6.20 6.62 1.28 4.51
50% 5.71 1.12 0.60 5.46 0.76 0.60 16.49 12.29 8.29 10.23 7.12 7.07
75% 5.05 4.70 0.58 5.56 3.85 0.60 9.13 12.46 7.51 13.81 11.56 8.09
90% 3.79 6.09 0.52 4.90 5.71 0.64 5.25 7.47 5.40 11.78 13.59 7.66
99% 1.44 2.24 0.31 2.28 5.95 . 1.59 2.25 1.87 3.10 10.40 .
10000 1% 1.41 2.21 0.22 1.06 5.56 4.82 1.62 2.23 0.29 1.16 5.18 4.60
10% 3.42 6.11 0.41 2.80 5.38 0.54 5.51 5.35 2.92 3.78 3.22 1.92
25% 4.95 4.47 0.51 4.43 3.38 0.51 9.15 2.27 5.80 6.15 1.18 4.20
50% 5.38 1.02 0.53 5.29 0.72 0.55 16.06 12.01 8.09 9.96 6.93 6.88
75% 5.01 4.62 0.53 5.41 3.72 0.55 9.26 12.65 7.63 13.94 11.70 8.19
90% 3.77 6.02 0.46 4.86 5.64 0.58 5.27 7.50 5.43 11.71 13.53 7.65
99% 1.44 2.24 0.26 2.30 6.05 . 1.62 2.29 1.91 3.11 10.57 .
Table 4.4: RRMSE values for F^B(t)\widehat{F}_{\text{B}}(t) (shortened as ‘B’), F^P(t)\widehat{F}_{\text{P}}(t) (shortened as ‘P’), and F^R(t)\widehat{F}_{\text{R}}(t) (shortened as ‘R’), as well as their corresponding quantile estimators, under model ξ4\xi_{4}.
MAR MNAR
F^(t)\widehat{F}(t) t^(α)\hat{t}(\alpha) F^(t)\widehat{F}(t) t^(α)\hat{t}(\alpha)
\cdashline3-14 nBn_{\text{B}} α\alpha B P R B P R B P R B P R
500 1% 1.10 2.23 1.09 1.09 3.64 2.24 1.88 2.19 1.57 1.71 3.32 2.11
10% 2.49 6.03 0.99 2.08 6.51 0.99 4.67 4.88 2.56 3.39 3.78 1.87
25% 3.42 5.60 1.09 3.09 4.03 1.00 8.22 1.42 4.86 5.74 0.74 3.66
50% 4.14 2.08 1.54 3.90 1.18 1.38 14.34 11.38 7.55 8.81 5.85 5.96
75% 3.88 7.52 1.93 4.12 5.15 1.79 8.96 11.95 7.48 12.03 10.43 7.38
90% 2.79 6.76 1.88 3.36 7.31 1.69 5.26 7.63 5.48 10.17 12.05 7.00
99% 1.17 2.27 1.10 1.67 5.41 1.59 1.63 2.27 1.84 2.89 6.98 3.47
1000 1% 0.85 2.34 0.95 0.75 3.88 2.44 1.70 2.28 1.46 1.37 3.35 2.14
10% 2.31 5.92 0.74 1.94 6.60 0.80 4.76 5.14 2.53 3.47 4.00 1.85
25% 3.44 5.89 0.91 3.09 4.19 0.81 8.12 1.23 4.79 5.61 0.62 3.57
50% 4.20 1.76 1.40 3.97 1.01 1.26 14.38 11.48 7.61 8.79 5.85 5.97
75% 3.96 7.73 1.87 4.21 5.33 1.74 8.77 11.81 7.38 11.96 10.45 7.39
90% 2.81 6.97 1.83 3.45 7.63 1.62 4.97 7.20 5.19 9.61 11.52 6.72
99% 1.01 2.24 0.97 1.38 5.32 . 1.58 2.22 1.77 2.73 6.89 .
5000 1% 0.43 2.22 0.73 0.31 3.84 2.38 1.48 2.26 1.31 1.07 3.34 2.11
10% 2.28 6.29 0.57 1.84 6.82 0.63 4.95 5.50 2.51 3.56 4.28 1.80
25% 3.40 6.02 0.75 3.01 4.22 0.65 8.21 1.07 4.77 5.74 0.54 3.60
50% 4.18 1.52 1.32 3.93 0.87 1.19 13.88 11.10 7.32 8.68 5.75 5.87
75% 3.92 7.82 1.83 4.08 5.27 1.66 8.57 11.60 7.23 11.68 10.25 7.23
90% 2.82 7.16 1.83 3.41 7.81 1.60 5.24 7.62 5.48 10.21 12.31 7.17
99% 0.89 2.24 0.89 1.20 5.31 . 1.51 2.18 1.71 2.57 6.81 .
10000 1% 0.38 2.26 0.71 0.24 3.68 2.31 1.43 2.22 1.27 1.04 3.33 2.12
10% 2.21 6.09 0.52 1.79 6.63 0.59 4.80 5.33 2.44 3.41 4.10 1.75
25% 3.21 5.65 0.70 2.92 4.05 0.61 8.17 1.03 4.76 5.54 0.50 3.48
50% 3.99 1.45 1.27 3.71 0.82 1.13 14.04 11.23 7.42 8.69 5.79 5.89
75% 3.82 7.63 1.79 4.01 5.19 1.64 9.04 12.24 7.63 12.08 10.60 7.49
90% 2.78 7.07 1.81 3.31 7.58 1.56 5.12 7.43 5.36 9.83 11.88 6.92
99% 0.92 2.36 0.92 1.19 5.51 . 1.58 2.29 1.79 2.63 7.01 .

4.2 CDF Estimation

Let us begin with the CDF estimators. Starting with model ξ1\xi_{1}, under MAR, F^R(t)\widehat{F}_{\text{R}}(t) drastically outperformed F^B(t)\widehat{F}_{\text{B}}(t) even for small nBn_{\text{B}}, which implies superiority under ignorability and a correctly specified regression model. F^P(t)\widehat{F}_{\text{P}}(t) still performed well overall, though, with RRMSE that was always larger than that of F^R(t)\widehat{F}_{\text{R}}(t). Under MNAR, all estimators experienced significant reductions in efficiency, but F^R(t)\widehat{F}_{\text{R}}(t) still performed the best overall, which suggests robustness to violations of ignorability for correctly specified regression models. For model ξ2\xi_{2}, under both MAR and MNAR, F^R(t)\widehat{F}_{\text{R}}(t) tended to perform optimally for tt near the center of the distribution, but it was consistently outperformed by F^B(t)\widehat{F}_{\text{B}}(t) at the distribution tails (i.e., at the 1st and 99th percentiles). Despite F^P(t)\widehat{F}_{\text{P}}(t) generally performing worse than F^R(t)\widehat{F}_{\text{R}}(t)—with a noticeable exception at the 10th percentile—it tended to perform better than F^B(t)\widehat{F}_{\text{B}}(t), suggesting gains in efficiency even with model misspecification [30]. Under model ξ3\xi_{3} and MAR, F^R(t)\widehat{F}_{\text{R}}(t) experienced impressive gains in efficiency compared to F^π(t)\widehat{F}_{\pi}(t) despite severe model misspecification. F^P(t)\widehat{F}_{\text{P}}(t), though, tended to perform worse than F^R(t)\widehat{F}_{\text{R}}(t), which provides support for its endorsement if and only if the prediction error between YiY_{i} and Y^i\hat{Y}_{i} is negligible for all i𝑨i\in\boldsymbol{A}; otherwise, the consequences of F^P(t)\widehat{F}_{\text{P}}(t)’s ‘all-or-nothing’ approach will likely spoil any potential help from 𝑩\boldsymbol{B}. Under MNAR, F^R(t)\widehat{F}_{\text{R}}(t) still performed the best overall, but we noted severe efficiency losses that failed to dissipate with larger nBn_{\text{B}}. In fact, these losses were much greater in magnitude compared to those under MNAR for model ξ1\xi_{1}, which may imply sensitivity to violations of ignorability if the regression model is severely misspecified. Concluding with model ξ4\xi_{4}, under MAR, F^R(t)\widehat{F}_{\text{R}}(t) exhibited superior performance at all percentiles except the 1st, for nB500n_{\text{B}}\neq 500, where it was outperformed by F^B(t)\widehat{F}_{\text{B}}(t). Meanwhile, F^P(t)\widehat{F}_{\text{P}}(t) generally performed worse than both F^R(t)\widehat{F}_{\text{R}}(t) and F^B(t)\widehat{F}_{\text{B}}(t), with efficiency losses surpassing those reported under model ξ3\xi_{3}. Conclusions under MNAR are virtually identical to those discussed for model ξ3\xi_{3}.

In summary, these results imply several important points. For one, F^R(t)\widehat{F}_{\text{R}}(t) seems to be fairly robust to violations of ignorability if the regression model is correctly specified, and robust to regression model misspecification if ignorability is satisfied. When both assumptions are violated, F^R(t)\widehat{F}_{\text{R}}(t) will still perform better than F^P(t)\widehat{F}_{\text{P}}(t) and F^B(t)\widehat{F}_{\text{B}}(t), generally, albeit with noticeable reductions in efficiency. Second, F^P(t)\widehat{F}_{\text{P}}(t) generally performs worse than F^R(t)\widehat{F}_{\text{R}}(t), but F^P(t)\widehat{F}_{\text{P}}(t) may still perform optimally even under MNAR if the regression model is correctly specified. The final point emphasizes that F^B(t)\widehat{F}_{\text{B}}(t), despite its lackluster performance, still performed well at the tails of YY’s distribution, even when its sample size was small (<1%<1\% of 𝒰\mathcal{U}) and its missingness mechanism was MNAR. This is an encouraging result for researchers interested in estimating FN(t)F_{\text{N}}(t) at the distribution extrema without the use of a probability sample.

4.3 Quantile Estimation

In general, the performance of the three quantile estimators tended to match their CDF counterparts, with some noticeable exceptions. For one, t^R(.99)\hat{t}_{\text{R}}(.99), unlike t^B(.99)\hat{t}_{\text{B}}(.99) and t^P(.99)\hat{t}_{\text{P}}(.99), was missing for nearly every nBn_{\text{B}} under models ξ3\xi_{3} and ξ4\xi_{4}, even for MAR missingness. There is a reasonable explanation, however: if we were to compute F^P(t)\widehat{F}_{\text{P}}(t) and F^B(t)\widehat{F}_{\text{B}}(t) for every value of Y^i\hat{Y}_{i} for i𝑨i\in\boldsymbol{A} and/or YjY_{j} for j𝑩j\in\boldsymbol{B}, we will always encounter data points that serve either as the sample minimum (i.e., F^(t)0\widehat{F}(t)\approx 0) or the sample maximum (i.e., F^(t)1\widehat{F}(t)\approx 1), even if the CDF estimator itself is spurious. This is not always true for F^R(t)\widehat{F}_{\text{R}}(t), though, because there may not exist a 𝑿i\boldsymbol{X}_{i} in 𝑨\boldsymbol{A} that causes G^(𝑿i)\widehat{G}(\boldsymbol{X}_{i}) to be near zero or unity, especially if the regression model is misspecified. In these cases, it may be better to substitute t^R(α)\hat{t}_{\text{R}}(\alpha) with t^B(α)\hat{t}_{\text{B}}(\alpha), especially for large nBn_{\text{B}}, as it is more likely to produce an accurate estimator of tN(α)t_{\text{N}}(\alpha) than t^P(α)\hat{t}_{\text{P}}(\alpha). Another exception involves the performance of t^B(α)\hat{t}_{\text{B}}(\alpha), t^P(α)\hat{t}_{\text{P}}(\alpha), and t^R(α)\hat{t}_{\text{R}}(\alpha) under MNAR for model ξ4\xi_{4}, which seemed to vary widely depending on the percentile in question; specifically, t^B(α)\hat{t}_{\text{B}}(\alpha) performed the best for percentiles near distribution extrema (i.e., 1st and 99th), t^P(t)\hat{t}_{\text{P}}(t) for those near the distribution centers (i.e., 25th and 50th), and t^R(α)\hat{t}_{\text{R}}(\alpha) elsewhere, despite F^R(t)\widehat{F}_{\text{R}}(t) generally outperforming its CDF siblings. We suspect that these conclusions result from model misspecification.

4.4 Variance Estimation

We conclude this section with results from a limited variance estimation study, which investigated several performance metrics regarding the variance of F^R(t)\widehat{F}_{\text{R}}(t) and t^R(t)\hat{t}_{\text{R}}(t) under 𝒏B=[500 10000]\boldsymbol{n}_{\text{B}}=[500\ \ 10000], models ξ1\xi_{1} and ξ3\xi_{3}, and nsim=500n_{\text{sim}}=500. The simulation settings were otherwise identical to those previously described.

We calculated two types of variance estimators for both F^R(t)\widehat{F}_{\text{R}}(t) and t^R(α)\hat{t}_{\text{R}}(\alpha). The first type, denoted by V1V_{1}, is given in Theorem 2 for F^R(t)\widehat{F}_{\text{R}}(t) and by Eq. (3.7) for t^R(t)\hat{t}_{\text{R}}(t). Note that, due to the SRS design of 𝑨\boldsymbol{A}, the variance estimator in Theorem 2 simplifies to

V1(F^R(t))\displaystyle V_{1}\left(\widehat{F}_{\text{R}}(t)\right) =Var^(F^R(t))=1nANnAs2,\displaystyle=\widehat{\mathrm{Var}}\left(\widehat{F}_{\text{R}}(t)\right)=\frac{1-\frac{n_{\text{A}}}{N}}{n_{\text{A}}}s^{2}, (4.2)

where s^2 = 1nA-1∑_i ∈A(^G(X_i)- 1nA∑_i ∈A^G(X_i))^2. The second variance estimator, denoted by V2V_{2}, is an adaptation of [30]’s bootstrap variance procedure, which is summarized as follows:

  1. 1.

    Generate L=500L=500 sets of replicate weights for sample 𝑨\boldsymbol{A} and bootstrap samples with replacement (of size nBn_{\text{B}}) from sample 𝑩\boldsymbol{B}.

  2. 2.

    Use each weight-resample pair to calculate F^R(t)\widehat{F}_{\text{R}}(t), t^R(α)\hat{t}_{\text{R}}(\alpha), and F^R(t^R(α))\widehat{F}_{\text{R}}(\hat{t}_{\text{R}}(\alpha)).

  3. 3.

    Use both the replicate estimators (say, θ^l\hat{\theta}^{l}) and the estimators calculated from the original 𝑨,𝑩\boldsymbol{A},\boldsymbol{B} sets (say, θ^\hat{\theta}) to calculate

    V2(F^R(t))\displaystyle V_{2}\left(\widehat{F}_{\text{R}}(t)\right) =Var^Boot(F^R(t))=1Ll=1L(F^Rl(t)F^R(t))2\displaystyle=\widehat{\mathrm{Var}}_{\text{Boot}}\left(\widehat{F}_{\text{R}}(t)\right)=\frac{1}{L}\sum_{l=1}^{L}{\left(\widehat{F}^{l}_{\text{R}}(t)-\widehat{F}_{\text{R}}(t)\right)^{2}} (4.3)
    V2(t^R(α))\displaystyle V_{2}\Big{(}\hat{t}_{\text{R}}(\alpha)\Big{)} =Var^Boot(t^R(α))=(t^ULt^LL2zγ/2)2,\displaystyle=\widehat{\mathrm{Var}}_{\text{Boot}}\Big{(}\hat{t}_{\text{R}}(\alpha)\Big{)}=\left(\frac{\hat{t}^{*}_{\text{UL}}-\hat{t}^{*}_{\text{LL}}}{2z_{\gamma/2}}\right)^{2}, (4.4)

    where

    t^LL\displaystyle\hat{t}^{*}_{\text{LL}} =mint(F^R(t)αzγ/2SE^Boot[F^R(t^R(α))])\displaystyle=\min_{t}\left(\widehat{F}_{\text{R}}(t)\geq\alpha-z_{\gamma/2}\widehat{\text{SE}}_{\text{Boot}}\left[\widehat{F}_{\text{R}}\left(\hat{t}_{\text{R}}(\alpha)\right)\right]\right) (4.5)
    t^UL\displaystyle\hat{t}^{*}_{\text{UL}} =mint(F^R(t)α+zγ/2SE^Boot[F^R(t^R(α))])\displaystyle=\min_{t}\left(\widehat{F}_{\text{R}}(t)\geq\alpha+z_{\gamma/2}\widehat{\text{SE}}_{\text{Boot}}\left[\widehat{F}_{\text{R}}\left(\hat{t}_{\text{R}}(\alpha)\right)\right]\right) (4.6)

    and SE^BootVar^Boot\widehat{\text{SE}}_{\text{Boot}}\coloneqq\sqrt{\widehat{\mathrm{Var}}_{\text{Boot}}}.

We then compared both V1V_{1} and V2V_{2} to the ‘gold-standard’ Monte Carlo variance, defined as

Var^MC\displaystyle\widehat{\mathrm{Var}}_{\text{MC}} =1nsim1m=1nsim(θ^m1nsimm=1nsimθ^m)2,\displaystyle=\frac{1}{n_{\text{sim}}-1}\sum_{m=1}^{n_{\text{sim}}}{\left(\hat{\theta}_{m}-\frac{1}{n_{\text{sim}}}\sum_{m=1}^{n_{\text{sim}}}{\hat{\theta}_{m}}\right)^{2}}, (4.7)

where θ^\hat{\theta} may denote either F^R(t)\widehat{F}_{\text{R}}(t) or t^R(t)\hat{t}_{\text{R}}(t).

Finally, after nsim=500n_{\text{sim}}=500 independent simulation runs, we calculated the following performance statistics at the 1st, 10th, 25th, 50th, 75th, 90th, and 99th percentiles:

  • F^¯R\bar{\widehat{F}}_{\text{R}} (t^¯R\bar{\hat{t}}_{\text{R}}): The mean of F^R(t)\widehat{F}_{\text{R}}(t) (t^R(α)\hat{t}_{\text{R}}(\alpha)) across all simulation runs.

  • CR: The coverage, or percentage of instances for which the respective population parameter was captured in the approximately 90% ZZ-confidence interval. This was calculated for F^R(t)\widehat{F}_{\text{R}}(t) as either ^F_R(t) ±1.645 ×V_1(^F_R(t)) or ^F_R(t) ±1.645 ×V_2(^F_R(t)), and similarly for t^R(t)\hat{t}_{\text{R}}(t) using Eqs. (3.5) and (3.6) or (4.5) and (4.6).

  • AL: The average length of the confidence intervals.

  • Var^¯\bar{\widehat{\mathrm{Var}}}: The mean estimated variance of F^R{\widehat{F}}_{\text{R}} (t^R{\hat{t}}_{\text{R}}), multiplied by 1010 for t^R(t)\hat{t}_{\text{R}}(t) or 10310^{3} for F^R(t)\widehat{F}_{\text{R}}(t) (to enhance readability).

  • RB: The percent absolute relative bias of Var^¯\bar{\widehat{\mathrm{Var}}} relative to Var^MC\widehat{\mathrm{Var}}_{\text{MC}}, RB = |^VarMC- ¯^Var|^VarMC ×100, rounded to one decimal place.

Results for F^R(t)\widehat{F}_{\text{R}}(t) are presented in Tables 4.4 and 4.4, whereas results for t^R(t)\hat{t}_{\text{R}}(t) are presented in Tables 4.4 and 4.4.

Table 4.5: Performance statistics for the variance estimators of F^R(t)\widehat{F}_{\text{R}}(t) under model ξ1\xi_{1}.
CR AL Var^¯\bar{\widehat{\mathrm{Var}}} RB
nBn_{\text{B}} α\alpha FNF_{\text{N}} F^¯R\bar{\widehat{F}}_{\text{R}} V1V_{1} V2V_{2} V1V_{1} V2V_{2} V1V_{1} V2V_{2} V1V_{1} V2V_{2}
500 MAR 1% 0.010 0.010 83.8 86.4 0.010 0.011 0.010 0.011 15.5 5.7
10% 0.100 0.099 85.2 88.0 0.035 0.038 0.113 0.131 23.0 10.7
25% 0.250 0.250 85.8 87.4 0.052 0.056 0.253 0.286 18.1 7.2
50% 0.500 0.499 86.4 86.6 0.061 0.064 0.346 0.382 14.7 5.8
75% 0.750 0.748 86.8 90.0 0.052 0.056 0.255 0.288 15.2 4.2
90% 0.900 0.901 84.8 87.8 0.035 0.038 0.113 0.131 17.5 4.5
99% 0.990 0.990 81.6 83.0 0.010 0.010 0.010 0.011 11.7 1.5
\cdashline2-13 MNAR 1% 0.010 0.010 82.8 86.2 0.010 0.011 0.010 0.011 15.1 5.1
10% 0.100 0.100 85.0 87.4 0.035 0.037 0.113 0.131 22.9 11.0
25% 0.250 0.250 84.8 88.6 0.052 0.056 0.253 0.286 18.0 7.4
50% 0.500 0.499 86.4 88.2 0.061 0.064 0.347 0.382 14.7 5.8
75% 0.750 0.748 87.2 88.8 0.053 0.056 0.255 0.288 15.1 4.1
90% 0.900 0.900 86.0 88.0 0.035 0.038 0.113 0.131 17.4 4.5
99% 0.990 0.990 83.0 84.8 0.010 0.010 0.010 0.011 11.3 1.2
10000 MAR 1% 0.010 0.010 86.8 86.4 0.010 0.010 0.010 0.010 5.1 4.5
10% 0.100 0.100 87.6 88.2 0.035 0.035 0.113 0.113 8.8 8.9
25% 0.250 0.250 88.6 88.6 0.052 0.052 0.253 0.253 7.2 7.2
50% 0.500 0.500 89.0 89.0 0.061 0.061 0.346 0.346 6.5 6.5
75% 0.750 0.749 88.8 89.4 0.052 0.053 0.254 0.255 3.6 3.1
90% 0.900 0.901 87.6 87.8 0.035 0.035 0.113 0.114 6.1 5.4
99% 0.990 0.990 82.8 82.4 0.010 0.010 0.010 0.010 7.2 6.5
\cdashline2-13 MNAR 1% 0.010 0.010 86.8 86.8 0.010 0.010 0.010 0.010 5.1 4.8
10% 0.100 0.100 88.2 88.2 0.035 0.035 0.113 0.114 8.9 8.1
25% 0.250 0.250 88.6 88.4 0.052 0.052 0.253 0.254 7.2 6.5
50% 0.500 0.500 88.6 89.4 0.061 0.061 0.346 0.347 6.5 6.0
75% 0.750 0.749 88.2 88.4 0.052 0.052 0.254 0.255 3.6 3.4
90% 0.900 0.901 87.8 87.6 0.035 0.035 0.113 0.114 6.1 5.5
99% 0.990 0.990 82.4 82.4 0.010 0.010 0.010 0.010 7.2 6.7
Table 4.6: Performance statistics for the variance estimators of F^R(t)\widehat{F}_{\text{R}}(t) under model ξ3\xi_{3}.
CR AL Var^¯\bar{\widehat{\mathrm{Var}}} RB
nBn_{\text{B}} α\alpha FNF_{\text{N}} F^¯R\bar{\widehat{F}}_{\text{R}} V1V_{1} V2V_{2} V1V_{1} V2V_{2} V1V_{1} V2V_{2} V1V_{1} V2V_{2}
500 MAR 1% 0.010 0.010 43.6 88.0 0.003 0.008 0.001 0.006 85.9 2.1
10% 0.100 0.098 59.4 89.0 0.018 0.035 0.030 0.114 72.3 4.8
25% 0.250 0.250 67.8 89.8 0.032 0.053 0.095 0.263 61.5 6.7
50% 0.500 0.499 71.6 91.8 0.040 0.061 0.148 0.341 51.9 10.8
75% 0.750 0.748 67.6 92.0 0.033 0.053 0.102 0.261 57.3 9.3
90% 0.900 0.899 63.2 90.4 0.020 0.036 0.036 0.120 68.8 4.1
99% 0.990 0.990 48.4 88.8 0.003 0.008 0.001 0.007 84.0 0.2
\cdashline2-13 MNAR 1% 0.010 0.010 43.6 87.4 0.003 0.008 0.001 0.006 86.1 1.5
10% 0.100 0.097 58.8 89.4 0.018 0.035 0.030 0.114 72.4 5.1
25% 0.250 0.249 70.4 92.2 0.032 0.053 0.095 0.264 61.6 6.9
50% 0.500 0.498 73.8 93.6 0.040 0.061 0.148 0.340 52.0 10.6
75% 0.750 0.748 70.4 91.4 0.033 0.053 0.102 0.260 57.3 8.8
90% 0.900 0.899 62.0 90.6 0.020 0.036 0.036 0.120 68.9 3.6
99% 0.990 0.990 49.2 86.2 0.003 0.008 0.001 0.007 84.0 0.3
10000 MAR 1% 0.010 0.010 83.8 88.6 0.003 0.003 0.001 0.001 22.1 4.5
10% 0.100 0.098 84.2 86.8 0.018 0.019 0.029 0.033 5.5 7.8
25% 0.250 0.250 91.2 91.8 0.032 0.033 0.093 0.102 0.7 9.7
50% 0.500 0.499 89.8 91.2 0.040 0.041 0.146 0.155 5.1 12.1
75% 0.750 0.748 90.4 91.0 0.033 0.034 0.100 0.109 5.9 14.5
90% 0.900 0.899 89.6 91.4 0.019 0.021 0.035 0.039 2.0 14.6
99% 0.990 0.990 87.8 91.4 0.003 0.004 0.001 0.001 12.8 12.1
\cdashline2-13 MNAR 1% 0.010 0.010 83.6 89.8 0.003 0.003 0.001 0.001 22.2 4.4
10% 0.100 0.098 86.2 88.0 0.018 0.019 0.029 0.033 5.6 7.7
25% 0.250 0.250 91.6 92.2 0.032 0.033 0.093 0.101 0.7 9.5
50% 0.500 0.499 90.0 91.6 0.040 0.041 0.145 0.155 5.0 11.7
75% 0.750 0.748 90.2 92.0 0.033 0.034 0.100 0.108 5.9 14.0
90% 0.900 0.899 90.8 92.6 0.019 0.021 0.035 0.039 2.1 14.3
99% 0.990 0.990 86.4 91.2 0.003 0.004 0.001 0.001 12.9 11.7
Table 4.7: Performance statistics for the variance estimators of t^R(t)\hat{t}_{\text{R}}(t) under model ξ1\xi_{1}.
CR AL Var^¯\bar{\widehat{\mathrm{Var}}} RB
nBn_{\text{B}} α\alpha tNt_{\text{N}} t^¯R\bar{\hat{t}}_{\text{R}} V1V_{1} V2V_{2} V1V_{1} V2V_{2} V1V_{1} V2V_{2} V1V_{1} V2V_{2}
500 MAR 1% 1.353 1.583 71.8 75.4 1.140 1.228 1.497 1.718 13.6 0.9
10% 4.461 4.506 84.4 85.0 0.596 0.609 0.334 0.350 23.3 19.7
25% 6.277 6.298 85.2 85.2 0.493 0.498 0.227 0.231 17.3 15.7
50% 8.289 8.307 85.2 85.0 0.461 0.464 0.198 0.200 13.4 12.3
75% 10.298 10.331 86.2 87.6 0.497 0.502 0.230 0.235 15.9 14.1
90% 12.146 12.167 83.6 84.8 0.606 0.619 0.346 0.362 17.6 13.8
99% 15.291 15.597 67.6 70.0 1.396 1.509 2.310 2.652 26.9 16.1
\cdashline2-13 MNAR 1% 1.353 1.579 72.8 75.4 1.149 1.246 1.508 1.780 12.1 3.7
10% 4.461 4.505 84.6 85.0 0.600 0.609 0.338 0.350 19.7 16.9
25% 6.277 6.298 84.6 86.0 0.491 0.497 0.224 0.230 16.9 14.8
50% 8.289 8.308 85.4 85.6 0.461 0.464 0.198 0.201 13.3 12.0
75% 10.298 10.332 86.4 86.0 0.496 0.502 0.230 0.235 14.8 12.9
90% 12.146 12.169 84.4 85.2 0.609 0.620 0.350 0.362 15.1 12.2
99% 15.291 15.590 68.1 69.3 1.419 1.502 2.385 2.647 23.5 15.1
10000 MAR 1% 1.353 1.572 73.4 74.2 1.140 1.162 1.485 1.554 4.8 0.3
10% 4.461 4.501 87.4 88.0 0.592 0.596 0.329 0.334 11.5 10.2
25% 6.277 6.292 89.0 88.8 0.491 0.491 0.225 0.225 6.3 6.3
50% 8.289 8.304 88.2 89.0 0.461 0.461 0.197 0.198 5.2 5.1
75% 10.298 10.329 88.0 88.6 0.497 0.498 0.230 0.231 2.2 1.7
90% 12.146 12.168 88.2 88.4 0.608 0.611 0.347 0.352 7.4 6.2
99% 15.291 15.583 66.6 67.6 1.425 1.460 2.412 2.492 16.4 13.6
\cdashline2-13 MNAR 1% 1.353 1.572 73.2 74.2 1.137 1.163 1.482 1.545 5.2 1.1
10% 4.461 4.502 87.4 87.6 0.594 0.600 0.332 0.339 11.4 9.5
25% 6.277 6.292 88.6 88.6 0.492 0.494 0.225 0.227 6.0 5.1
50% 8.289 8.304 89.0 89.2 0.460 0.461 0.196 0.198 6.5 6.0
75% 10.298 10.328 88.4 88.4 0.496 0.497 0.229 0.230 3.6 3.2
90% 12.146 12.167 89.0 89.0 0.607 0.611 0.346 0.351 6.9 5.5
99% 15.291 15.580 66.2 67.0 1.422 1.454 2.413 2.479 15.3 13.0
Table 4.8: Performance statistics for the variance estimators of t^R(t)\hat{t}_{\text{R}}(t) under model ξ3\xi_{3}.
CR AL Var^¯\bar{\widehat{\mathrm{Var}}} RB
nBn_{\text{B}} α\alpha tNt_{\text{N}} t^¯R\bar{\hat{t}}_{\text{R}} V1V_{1} V2V_{2} V1V_{1} V2V_{2} V1V_{1} V2V_{2} V1V_{1} V2V_{2}
500 MAR 1% -3.919 -2.937 . . . . . . . .
10% -2.688 -2.651 53.8 60.8 0.123 0.145 0.015 0.021 73.7 63.2
25% -1.938 -1.930 65.8 68.4 0.122 0.127 0.014 0.015 62.1 58.9
50% -1.122 -1.114 72.2 72.0 0.122 0.123 0.014 0.014 52.6 51.4
75% -0.315 -0.302 67.0 69.8 0.124 0.128 0.014 0.015 57.9 54.8
90% 0.396 0.419 60.5 69.1 0.134 0.166 0.018 0.027 66.1 48.2
99% 1.566 . . . . . . . . .
\cdashline2-13 MNAR 1% -3.919 -2.933 . . . . . . . .
10% -2.688 -2.648 53.8 61.0 0.123 0.147 0.015 0.021 73.5 62.2
25% -1.938 -1.929 69.0 71.4 0.122 0.127 0.014 0.015 61.1 58.1
50% -1.122 -1.113 73.6 74.6 0.122 0.124 0.014 0.014 50.2 48.3
75% -0.315 -0.301 69.4 71.8 0.125 0.130 0.014 0.016 56.7 52.8
90% 0.396 0.423 61.2 71.0 0.138 0.170 0.019 0.029 66.0 47.9
99% 1.566 . . . . . . . . .
10000 MAR 1% -3.919 -2.911 . . . . . . . .
10% -2.688 -2.656 73.0 77.8 0.120 0.133 0.014 0.017 24.5 6.7
25% -1.938 -1.932 89.0 90.4 0.121 0.124 0.014 0.014 4.0 0.5
50% -1.122 -1.115 89.4 89.4 0.121 0.122 0.014 0.014 4.9 7.2
75% -0.315 -0.303 88.6 89.2 0.124 0.127 0.014 0.015 5.7 10.5
90% 0.396 0.420 78.8 84.2 0.132 0.147 0.017 0.021 15.2 7.6
99% 1.566 . . . . . . . . .
\cdashline2-13 MNAR 1% -3.919 -2.910 . . . . . . . .
10% -2.688 -2.655 73.0 77.4 0.121 0.133 0.014 0.017 20.1 1.9
25% -1.938 -1.931 88.6 89.2 0.121 0.124 0.014 0.014 1.1 3.3
50% -1.122 -1.114 88.6 89.2 0.121 0.122 0.014 0.014 7.9 10.1
75% -0.315 -0.303 88.8 89.4 0.124 0.126 0.014 0.015 9.2 13.3
90% 0.396 0.421 80.6 85.7 0.132 0.148 0.017 0.022 14.8 9.3
99% 1.566 . . . . . . . . .

Starting with F^R(t)\widehat{F}_{\text{R}}(t), under model ξ1\xi_{1} and nB=500n_{\text{B}}=500, we noted that the coverage rates for F^R(t)\widehat{F}_{\text{R}}(t) when using V1V_{1} was quite close to the nominal confidence level for both MAR and MNAR missingness mechanisms, albeit lower at the distribution extrema due to smaller variance. The same pattern was noticed for V2V_{2}, alongside rates that were closer to the nominal 90%; this was largely due to less-biased Var^¯\bar{\widehat{\mathrm{Var}}}, suggesting that [30]’s bootstrap variance estimator outperforms its asymptotic competitor when nBn_{\text{B}} is rather small. Yet, as nBn_{\text{B}} increased from 0.5% to 10% of NN, the differences between the two variance estimators became negligible, producing coverage rates that were consistently optimal for all α99%\alpha\neq 99\%. Moving to model ξ3\xi_{3}, under nB=500n_{\text{B}}=500, we note a severely increased degree of bias for the asymptotic variance estimators, resulting in liberal confidence intervals that excluded FN(t)F_{\text{N}}(t) at disproportionate rates. The bootstrap variance estimator continued to perform optimally, though, suggesting robustness to violations of ignorability and model misspecification. For large nBn_{\text{B}}, the differences again dissipated to practically negligible magnitudes, but V1V_{1} tended to have less bias than V2V_{2}.

Let us now transition to quantile estimation. Starting with model ξ1\xi_{1}, we note similar trends as before, albeit with several exceptions. For one, despite the RB being smaller for the bootstrap variance estimator than the asymptotic-theory counterpart, they were still larger here compared to those under CDF estimation, resulting in coverage rates that were fairly lower on average and significantly lower at distribution extrema. These trends were noted at an increased potency for model ξ3\xi_{3} and nB=500n_{\text{B}}=500 but became much milder with larger nBn_{\text{B}}. Still, most performance statistics remained undefined at the 1st and 99th percentiles, as only one run out of nsim=500n_{\text{sim}}=500 produced an estimator of t^R\hat{t}_{\text{R}} (see Section 4.3 for a discussion on this phenomenon). In aggregate, these results imply that both variance estimators of t^R\widehat{t}_{\text{R}} are (generally) robust to violations of ignorability and model misspecification, but only if nBn_{\text{B}} is large in magnitude. If nBn_{\text{B}} is small, the two variance estimators will indeed be similar, but will likely be severely biased as well.

5 Real Data Example

In this section, we illustrate the performance of F^R(t)\widehat{F}_{\text{R}}(t) and F^P(t)\widehat{F}_{\text{P}}(t) using data from the National Health and Nutrition Examination Survey (NHANES) published by the [23]. NHANES is a multistage stratified random sample collected to obtain and disseminate detailed information on the health and nutritional status of a nationally representative sample of non-institutionalized individuals in the United States. Our goal is to estimate the distribution function of total cholesterol (TCHL, YY, in mg/dL) in the population of U.S. adults using biological sex (X1X_{1}), age (X2X_{2}), glycohemoglobin (X3X_{3}, in %), triglycerides (X4X_{4}, in mg/dL), direct high-density lipoprotein cholesterol (HDL, X5X_{5}, in mg/dL), body mass index (BMI, X6X_{6}, in kg/m2m^{2}), and pulse (X7X_{7}). We used the subsample of individuals with available data in the 2015-2016 NHANES cohort as the probability sample (nA=n_{\text{A}}= 2,474), and those with available data in the 2017-2020 cohort for the nonprobability sample (nB=n_{\text{B}}= 3,770). Interestingly, observations in 𝑩\boldsymbol{B} that were obtained during the 2019-2020 time period were deemed by the CDC to be nonrepresentative of the population and had to be combined with the previous cohort to ensure respondent confidentiality. Absolute relative biases (RB\mathrm{RB}) of our estimators relative to F^π(t)\widehat{F}_{\pi}(t) and t^π(α)\hat{t}_{\pi}(\alpha) are presented in Table 5.3, whereas bootstrap estimators of the variance are presented in Tables 5.4 and 5.5; and, to accommodate the lack of information regarding the true regression function and 𝑩\boldsymbol{B}’s missingness mechanism, we also report various statistics concerning the estimated coefficients (Table 5.1), model fit (Table 5.2), and model diagnostics (Figure 1).

Table 5.1: Table of coefficients, along with various statistics. We report (a) survey-weighted β^\hat{\beta} from 𝑨\boldsymbol{A} (obtained using [34]’s svyglm() function in R; β^𝑨\hat{\beta}_{\boldsymbol{A}}), (b) näive coefficients from 𝑩\boldsymbol{B} (β^𝑩\hat{\beta}_{\boldsymbol{B}}), (c) respective SE, 90% tt-confidence intervals, tt-statistics, and pp-values from 𝑩\boldsymbol{B}, and (d) indicators for LLβ^𝑨UL\mathrm{LL}\leq\hat{\beta}_{\boldsymbol{A}}\leq\mathrm{UL}.
𝑨\boldsymbol{A} 𝑩\boldsymbol{B} SE LL UL tt Pr(T|t|)\Pr\left(T\geq|t|\right) β^𝑨CI\hat{\beta}_{\boldsymbol{A}}\in\mathrm{CI}
β^0\hat{\beta}_{0} 87.62 93.35 5.62 84.10 102.60 16.60 0.00 Y
β^1\hat{\beta}_{1}: Male 0.44 2.12 1.24 0.08 4.15 1.71 0.09 Y
β^2\hat{\beta}_{2} 0.28 0.25 0.03 0.19 0.30 7.79 0.00 Y
β^3\hat{\beta}_{3} -1.71 -1.03 0.57 -1.97 -0.09 -1.80 0.07 Y
β^4\hat{\beta}_{4} 0.27 0.20 0.01 0.18 0.21 28.31 0.00 N
β^5\hat{\beta}_{5} 0.93 1.00 0.04 0.93 1.07 22.57 0.00 Y
β^6\hat{\beta}_{6} 0.43 0.25 0.09 0.11 0.39 2.90 0.00 N
β^7\hat{\beta}_{7} 0.04 -0.02 0.05 -0.11 0.06 -0.43 0.67 Y
Table 5.2: Table of model fit statistics. We report the multiple R2R^{2}, adjusted R2R^{2}, FF-statistic, residual standard error, and numerator (denominator) degrees of freedom.
R2R^{2} RA2R^{2}_{\text{A}} FF SE dfnumdf_{\text{num}} dfdenomdf_{\text{denom}}
0.26 0.26 186.08 35.55 7 3762
Figure 1: Diagnostic plots related to the linear regression model. Here we report (a) residuals against the fitted values (for linearity), (b) Normal quantile-quantile plot (for residual normality), (c) scale-location plot (for homoscedasticity), and (d) standardized residuals against leverage (for outlier detection).
Refer to caption
Table 5.3: Percent absolute relative bias (RB\mathrm{RB}) of F^B(t)\widehat{F}_{\text{B}}(t), F^P(t)\widehat{F}_{\text{P}}(t), and F^R(t)\widehat{F}_{\text{R}}(t), as well as their respective quantile estimators, relative to the HT equivalents F^π(t)\widehat{F}_{\pi}(t) and t^π(α)\hat{t}_{\pi}(\alpha) using the 2015-2016 NHANES dataset (𝑨\boldsymbol{A}).
RB(F^)\mathrm{RB}\big{(}\widehat{F}\big{)} RB(t^)\mathrm{RB}\big{(}\hat{t}\big{)}
α\alpha F^π(t)\widehat{F}_{\pi}(t) t^π(α)\hat{t}_{\pi}(\alpha) B P R B P R
1% 0.01 107.00 99.00 100.00 52.49 6.54 38.71 25.51
10% 0.10 138.00 50.99 99.49 17.67 5.80 15.87 0.37
25% 0.25 158.00 30.34 70.55 10.17 5.06 7.07 2.03
50% 0.51 184.00 16.59 16.92 6.95 4.89 2.38 2.40
75% 0.75 212.00 7.53 21.61 4.53 3.77 8.21 2.41
90% 0.90 244.00 3.56 9.85 2.68 4.51 14.24 3.60
99% 0.99 295.00 0.12 0.77 0.04 0.68 18.69 0.50
Table 5.4: Performance of the bootstrap variance estimator of F^R(t)\widehat{F}_{\text{R}}(t) under the NHANES data example.
α\alpha F^π(t)\widehat{F}_{\pi}(t) F^R(t)\widehat{F}_{\text{R}}(t) LL UL Var^Boot×103\widehat{\mathrm{Var}}_{\text{Boot}}\times 10^{3} F^π(t)CI\widehat{F}_{\pi}(t)\in\text{CI}
1% 0.010 0.015 0.008 0.023 0.020 Y
10% 0.102 0.121 0.105 0.136 0.089 N
25% 0.254 0.280 0.258 0.302 0.182 N
50% 0.510 0.546 0.519 0.572 0.253 N
75% 0.752 0.786 0.763 0.809 0.197 N
90% 0.904 0.928 0.915 0.941 0.062 N
99% 0.990 0.990 0.985 0.994 0.008 Y
Table 5.5: Performance of the bootstrap variance estimator of t^R(α)\hat{t}_{\text{R}}(\alpha) under the NHANES data example.
α\alpha t^π(α)\hat{t}_{\pi}(\alpha) t^R(α)\hat{t}_{\text{R}}(\alpha) LL UL Var^Boot\widehat{\mathrm{Var}}_{\text{Boot}} t^π(α)CI\hat{t}_{\pi}(\alpha)\in\text{CI}
1% 107.00 134.29 115.23 153.35 14.78 N
10% 138.00 137.49 118.20 156.77 4.23 Y
25% 158.00 154.79 134.33 175.25 0.62 Y
50% 184.00 179.58 157.54 201.63 0.86 Y
75% 212.00 206.89 183.23 230.54 1.61 Y
90% 244.00 235.22 209.99 260.44 3.18 Y
99% 295.00 296.49 268.17 324.81 549.83 Y

Before we address the relative biases and bootstrap variances, let us briefly discuss various characteristics associated with the regression model. Starting with Table 5.1, we note that six out of eight (75%) of the survey-weighted coefficients from 𝑨\boldsymbol{A} were captured in their respective 90% confidence intervals (which were constructed from a regression model fit on 𝑩\boldsymbol{B}.) Interestingly, the coefficient associated with pulse (β^7\hat{\beta}_{7}) had a very high pp-value of .67, which may suggest that one’s pulse, after controlling for the remaining six covariates, is not a significant predictor of TCHL. A similar argument can be extended to biological sex, given that biological males in sample 𝑩\boldsymbol{B} had an average increase of 0.44 in TCHL compared to biological females, which is practically insignificant. Of course, these interpretations are likely challenged by severe model misspecification, given the small RA2R^{2}_{A} in Table 5.2 and the extreme nonlinearity, heteroscedasticity, and outliers in Figure 1.

Yet, despite the (likely) misspecified regression model, we noticed that the relative biases of F^R(t)\widehat{F}_{\text{R}}(t) were notably superior to the other two CDF estimators. For the quantile estimators, t^R(α)\hat{t}_{\text{R}}(\alpha) performed the best overall, but it was noticeably outperformed at the 1st percentile and slightly outperformed at the 50th; the latter case is likely a coincidence, whereas the former case is most likely due to t^R(α)\hat{t}_{\text{R}}(\alpha)’s sensitivity to distribution extrema (see Section 4.3). In aggregate, these results match those presented previously from our simulations, as they imply superiority of F^R(t)\widehat{F}_{\text{R}}(t) compared to F^B(t)\widehat{F}_{\text{B}}(t) and F^P(t)\widehat{F}_{\text{P}}(t) even when the regression function used is misspecified.

We conclude with a brief discussion of the performance of the bootstrap variance estimators under the NHANES data. Starting with F^R(t)\widehat{F}_{\text{R}}(t), in Table 5.4, F^π(t)\widehat{F}_{\pi}(t) was excluded from the intervals attributed to the 10th, 25th, 50th, 75th, and 90th percentiles—and yet, in Table 5.5, t^π(α)\hat{t}_{\pi}(\alpha) was included in all intervals except at the 1st percentile. It should be noted that these intervals are not meant for estimating F^π(t)\widehat{F}_{\pi}(t) or t^π(α)\hat{t}_{\pi}(\alpha), rather they estimate FN(t)F_{N}(t) or tN(α)t_{N}(\alpha). Therefore, these results should be read cautiously.

6 Conclusion

In this paper, we considered the problem of estimating the finite population distribution function using monotone missing data from probability and nonprobability samples. We proposed a residual-based CDF estimator, F^R(t)\widehat{F}_{\text{R}}(t), that substitutes 𝟙(Yit)\mathbbm{1}(Y_{i}\leq t) in F^π(t)\widehat{F}_{\pi}(t), the Horvitz-Thompson estimator of the finite population CDF, with G^(𝑿i)\widehat{G}(\boldsymbol{X}_{i}), the empirical CDF of the estimated regression residuals from sample 𝑩\boldsymbol{B} at (tY^i)/ν(𝑿i).(t-\hat{Y}_{i})/\nu(\boldsymbol{X}_{i}). Under ignorability of the missingness mechanism of sample 𝑩\boldsymbol{B}, a correctly specified model, and minor regularity conditions, both F^R(t)\widehat{F}_{\text{R}}(t) and its corresponding quantile estimator are shown to be asymptotically unbiased for their respective population parameters. Empirical results overwhelmingly support this claim, with the added benefit of robustness to violations of ignorability if the regression model holds, and robustness to model misspecification if ignorability holds. When both assumptions are violated, F^R(t)\widehat{F}_{\text{R}}(t) performed better than F^B(t)\widehat{F}_{\text{B}}(t), the naïve empirical CDF of 𝑩\boldsymbol{B}, and F^P(t)\widehat{F}_{\text{P}}(t), the plug-in CDF estimator, albeit with noticeable reductions in efficiency. We also defined and contrasted two variance estimators, and demonstrated that the bootstrap estimator tends to produce smaller relative bias than the asymptotic-based variance estimator. In our future research, we seek to extend this work to the realm of nonparametric regression, which can produce CDF and quantile estimators that are more robust to model misspecification.

Declarations

Data Availability Statement

The data that support the findings of this study are publicly available from the United States Centers for Disease Control and Prevention (CDC) at https://wwwn.cdc.gov/nchs/nhanes/Default.aspx.

Funding

This work was funded by the Chancellor’s Distinguished Fellowship, a Title III HBGI grant from the U.S. Department of Education.

References

  • [1] Centers for Disease Control and Prevention (CDC) “NHANES - National Health and Nutrition Examination Survey” https://www.cdc.gov/nchs/nhanes/index.htm (visited: 2023-10-11) National Center for Health Statistics (NCHS), 2015-2020
  • [2] Richard L Chambers and R Dunstan “Estimating distribution functions from survey data” In Biometrika 73.3 Oxford University Press, 1986, pp. 597–604
  • [3] Sixia Chen, Shu Yang and Jae Kwang Kim “Nonparametric mass imputation for data integration” In Journal of Survey Statistics and Methodology 10.1 Oxford University Press, 2022, pp. 1–24
  • [4] Carol A Francisco and Wayne A Fuller “Quantile estimation with a complex survey design” In The Annals of statistics JSTOR, 1991, pp. 454–469
  • [5] Daniel G Horvitz and Donovan J Thompson “A generalization of sampling without replacement from a finite universe” In Journal of the American Statistical Association 47.260 Taylor & Francis, 1952, pp. 663–685
  • [6] Cary T Isaki and Wayne A Fuller “Survey design under the regression superpopulation model” In Journal of the American Statistical Association 77.377 Taylor & Francis, 1982, pp. 89–96
  • [7] Alicia A Johnson, F Jay Breidt and Jean D Opsomer “Estimating distribution functions from survey data using nonparametric regression” In Journal of Statistical Theory and Practice 2 Springer, 2008, pp. 419–431
  • [8] Jae Kwang Kim, Seho Park, Yilin Chen and Changbao Wu “Combining non-probability and probability survey samples through mass imputation” In Journal of the Royal Statistical Society Series A: Statistics in Society 184.3 Oxford University Press, 2021, pp. 941–963
  • [9] JG Kovar, JNK Rao and CFJ Wu “Bootstrap and other methods to measure errors in survey estimates” In Canadian Journal of Statistics 16.S1 Wiley Online Library, 1988, pp. 25–45
  • [10] Erich Leo Lehmann “Elements of large-sample theory” New York, NY: Springer, 1999
  • [11] Roderick JA Little and Donald B Rubin “Statistical analysis with missing data” Hoboken, New Jersey: John Wiley & Sons, 2019
  • [12] Thomas Lumley “Analysis of Complex Survey Samples” R package verson 2.2 In Journal of Statistical Software 9.1, 2004, pp. 1–19
  • [13] Mateus Maia, Arthur R Azevedo and Anderson Ara “Predictive comparison between random machines and random forests” In Journal of Data Science 19.4 School of Statistics, Renmin University of China, 2021, pp. 593–614
  • [14] Ronald H Randles “On the asymptotic normality of statistics with estimated parameters” In The Annals of Statistics JSTOR, 1982, pp. 462–474
  • [15] JNK Rao, JG Kovar and HJ Mantel “On estimating distribution functions and quantiles from survey data using auxiliary information” In Biometrika JSTOR, 1990, pp. 365–375
  • [16] Marie-Hélène Roy and Denis Larocque “Robustness of random forests for regression” In Journal of Nonparametric Statistics 24.4 Taylor & Francis, 2012, pp. 993–1006
  • [17] Donald B Rubin “Inference and missing data” In Biometrika 63.3 Oxford University Press, 1976, pp. 581–592
  • [18] Carl-Erik Särndal, Bengt Swensson and Jan Wretman “Model Assisted Survey Sampling” New York, NY: Springer, 2003
  • [19] Erwan Scornet “Random forests and kernel methods” In IEEE Transactions on Information Theory 62.3 IEEE, 2016, pp. 1485–1500
  • [20] Anastasios A Tsiatis “Semiparametric theory and missing data” Springer, 2006
  • [21] Jianqiang C Wang and Jean D Opsomer “On asymptotic normality and variance estimation for nondifferentiable survey estimators” In Biometrika 98.1 Oxford University Press, 2011, pp. 91–106
  • [22] Ralph S Woodruff “Confidence intervals for medians and other position measures” In Journal of the American Statistical Association 47.260 Taylor & Francis, 1952, pp. 635–646

References

  • [23] Centers for Disease Control and Prevention (CDC) “NHANES - National Health and Nutrition Examination Survey” https://www.cdc.gov/nchs/nhanes/index.htm (visited: 2023-10-11) National Center for Health Statistics (NCHS), 2015-2020
  • [24] Richard L Chambers and R Dunstan “Estimating distribution functions from survey data” In Biometrika 73.3 Oxford University Press, 1986, pp. 597–604
  • [25] Sixia Chen, Shu Yang and Jae Kwang Kim “Nonparametric mass imputation for data integration” In Journal of Survey Statistics and Methodology 10.1 Oxford University Press, 2022, pp. 1–24
  • [26] Carol A Francisco and Wayne A Fuller “Quantile estimation with a complex survey design” In The Annals of statistics JSTOR, 1991, pp. 454–469
  • [27] Daniel G Horvitz and Donovan J Thompson “A generalization of sampling without replacement from a finite universe” In Journal of the American Statistical Association 47.260 Taylor & Francis, 1952, pp. 663–685
  • [28] Cary T Isaki and Wayne A Fuller “Survey design under the regression superpopulation model” In Journal of the American Statistical Association 77.377 Taylor & Francis, 1982, pp. 89–96
  • [29] Alicia A Johnson, F Jay Breidt and Jean D Opsomer “Estimating distribution functions from survey data using nonparametric regression” In Journal of Statistical Theory and Practice 2 Springer, 2008, pp. 419–431
  • [30] Jae Kwang Kim, Seho Park, Yilin Chen and Changbao Wu “Combining non-probability and probability survey samples through mass imputation” In Journal of the Royal Statistical Society Series A: Statistics in Society 184.3 Oxford University Press, 2021, pp. 941–963
  • [31] JG Kovar, JNK Rao and CFJ Wu “Bootstrap and other methods to measure errors in survey estimates” In Canadian Journal of Statistics 16.S1 Wiley Online Library, 1988, pp. 25–45
  • [32] Erich Leo Lehmann “Elements of large-sample theory” New York, NY: Springer, 1999
  • [33] Roderick JA Little and Donald B Rubin “Statistical analysis with missing data” Hoboken, New Jersey: John Wiley & Sons, 2019
  • [34] Thomas Lumley “Analysis of Complex Survey Samples” R package verson 2.2 In Journal of Statistical Software 9.1, 2004, pp. 1–19
  • [35] Mateus Maia, Arthur R Azevedo and Anderson Ara “Predictive comparison between random machines and random forests” In Journal of Data Science 19.4 School of Statistics, Renmin University of China, 2021, pp. 593–614
  • [36] Ronald H Randles “On the asymptotic normality of statistics with estimated parameters” In The Annals of Statistics JSTOR, 1982, pp. 462–474
  • [37] JNK Rao, JG Kovar and HJ Mantel “On estimating distribution functions and quantiles from survey data using auxiliary information” In Biometrika JSTOR, 1990, pp. 365–375
  • [38] Marie-Hélène Roy and Denis Larocque “Robustness of random forests for regression” In Journal of Nonparametric Statistics 24.4 Taylor & Francis, 2012, pp. 993–1006
  • [39] Donald B Rubin “Inference and missing data” In Biometrika 63.3 Oxford University Press, 1976, pp. 581–592
  • [40] Carl-Erik Särndal, Bengt Swensson and Jan Wretman “Model Assisted Survey Sampling” New York, NY: Springer, 2003
  • [41] Erwan Scornet “Random forests and kernel methods” In IEEE Transactions on Information Theory 62.3 IEEE, 2016, pp. 1485–1500
  • [42] Anastasios A Tsiatis “Semiparametric theory and missing data” Springer, 2006
  • [43] Jianqiang C Wang and Jean D Opsomer “On asymptotic normality and variance estimation for nondifferentiable survey estimators” In Biometrika 98.1 Oxford University Press, 2011, pp. 91–106
  • [44] Ralph S Woodruff “Confidence intervals for medians and other position measures” In Journal of the American Statistical Association 47.260 Taylor & Francis, 1952, pp. 635–646