Survey Data Integration for Distribution Function Estimation

Jeremy R.A. Flood and Sayed A. Mostafa Jeremy R.A. Flood; Department of Mathematics & Statistics, North Carolina A&T State University, Greensboro, NC, USA. Sayed A. Mostafa; Department of Mathematics & Statistics, North Carolina A&T State University, Greensboro, NC, USA. Email: [email protected].

Abstract

Integration of probabilistic and non-probabilistic samples for the estimation of finite population totals (or means) has recently received considerable attention in the field of survey sampling; yet, to the best of our knowledge, this framework has not been extended to cumulative distribution function (CDF) estimation. To address this gap, we propose a novel CDF estimator that integrates data from probability samples with data from, potentially big, nonprobability samples. Assuming that a set of shared covariates are observed in both samples, while the response variable is observed only in the latter, the proposed estimator uses a survey-weighted empirical CDF of regression residuals trained on the convenience sample to estimate the CDF of the response variable. Under some regularity conditions, we show that our CDF estimator is both design-consistent for the finite population CDF and asymptotically normally distributed. Additionally, we define and study a quantile estimator based on the proposed CDF estimator. Furthermore, we use both the bootstrap and asymptotic formulae to estimate their respective sampling variances. Our empirical results show that the proposed CDF estimator is robust to model misspecification under ignorability, and robust to ignorability under model misspecification. When both assumptions are violated, our residual-based CDF estimator still outperforms its ‘plug-in’ mass imputation and naïve siblings, albeit with noted decreases in efficiency.

Keywords: Data integration $\cdot$ Probability surveys $\cdot$ Nonprobability samples $\cdot$ Distribution functions $\cdot$ Quantile estimation

1 Introduction

Much of the literature on survey data integration focuses on estimating the finite population mean $\mu_{\text{N}}=\frac{1}{N}{\sum_{i\in\mathcal{U}}{Y_{i}}},$ where $\mathcal{U}$ denotes a finite population of size $N$ , $Y$ is the variable of interest, and $Y_{i}$ is the value assigned to the $i$ -th unit of the population. Given a probability sample $\boldsymbol{A}$ of size $n_{\text{A}}$ , and inclusion probability $\pi_{i}$ for all $i\in\boldsymbol{A}$ , the [27] (HT) finite population mean estimator, $\hat{\mu}_{\pi}=\frac{1}{N}\sum_{i\in\boldsymbol{A}}{\pi^{-1}_{i}Y_{i}}$ , is design-unbiased with a relatively simple closed-form sampling variance [40]. An interesting scenario arises when $Y_{i}$ is missing for all $i\in\boldsymbol{A}$ , but a set of covariates, say $\boldsymbol{X}_{i}$ , remains fully measured. Furthermore, assume that there exists a nonprobability (convenience) sample, $\boldsymbol{B}$ , whose $Y_{j}$ and $\boldsymbol{X}_{j}$ are both fully measured for all $j\in\boldsymbol{B}$ . Although $\hat{\mu}_{\pi}$ cannot be directly calculated in $\boldsymbol{A}$ , it is possible to ‘impute’ $\boldsymbol{A}$ ’s missing $Y$ values through fitting a regression model for $Y$ on $\boldsymbol{X}$ in sample $\boldsymbol{B}$ ; this approach is oftentimes referred to as mass imputation in the literature. [30] investigated the use of generalized linear models as the imputation model and established the consistency of their proposed parametric mass imputation estimator (PMIE). Their empirical results imply optimal performance when (a) the mean (regression) function in $\boldsymbol{B}$ can be ‘transported’ to $\boldsymbol{A}$ (suitably named transportability), and (b) the regression model used is reasonably specified. [25] extended this work to the realm of nonparametric regression, which make little to no assumptions of $\mathbb{E}\left(Y|\boldsymbol{X}\right)$ ’s parametric form. There, they proposed mass imputation estimators based on kernel regression (NPMIEK) and generalized additive models (NPMIEG), and noted marked gains in efficiency for both compared to the PMIE.

A natural extension of this work involves estimating the finite population cumulative distribution function (CDF), $F_{\text{N}}(t)$ , and the quantile function, $t_{\text{N}}(\alpha)$ . The CDF of a finite population $\mathcal{U}$ is defined as the proportion of elements whose $Y_{u}$ for $u\in\mathcal{U}$ lie at or below a quantile $t$ ; that is,

\displaystyle F_{\text{N}}(t)=\frac{1}{N}\sum_{u\in\mathcal{U}}{\mathbbm{1}\left(Y_{u}\leq t\right)},

(1.1)

where $\mathbbm{1}(\cdot)$ returns one if the condition in its argument is satisfied, and zero otherwise. Quantile functions, however, define the $\alpha^{th}$ quantile as the smallest $Y_{u}$ among all $u\in\mathcal{U}$ such that $F_{\text{N}}(Y_{u})\geq\alpha$ ; or, equivalently,

\displaystyle t_{\text{N}}(\alpha)=\inf_{t}\big{\{}t:F_{\text{N}}(t)\geq\alpha\big{\}}.

(1.2)

If $Y_{i}$ was measured for all $i\in\boldsymbol{A}$ , then, as with $\hat{\mu}_{\pi}$ , one could prove that the HT CDF estimator,

\widehat{F}_{\pi}(t)=\frac{1}{N}\sum_{i\in\boldsymbol{A}}{\pi^{-1}_{i}\mathbbm{1}(Y_{i}\leq t)},

alongside its corresponding quantile estimator,

\hat{t}_{\pi}(\alpha)=\inf_{t}\left\{t:\widehat{F}_{\pi}(t)\geq\alpha\right\},

are at least asymptotically unbiased with relatively simple variance expressions [24, 44]. Furthermore, if $\boldsymbol{X}_{u}$ was measured for all $u\in\mathcal{U}$ , a more efficient estimator of $F_{\text{N}}(t)$ could be obtained by predicting (a) $\mathbbm{1}(Y_{k}\leq t)$ for $k\in\mathcal{U}\backslash\boldsymbol{A}$ or (b) $\mathbbm{1}(Y_{u}\leq t)$ for all $u\in\mathcal{U}$ , then correcting for design bias [37, 24, 29]. Neither of these approaches is compatible with the monotone missingness framework described above, though. To address this paucity in literature, we modify [24]’s residual-based CDF estimator, henceforth denoted as $\widehat{F}_{\text{R}}(t)$ , to allow for information integration between $\boldsymbol{A}$ and $\boldsymbol{B}$ ; the idea is to use a regression model built on $\boldsymbol{B}$ to obtain a vector of estimated residuals, $\hat{\epsilon}$ , and use these residuals’ empirical CDF (eCDF) as a substitute for $\mathbbm{1}(Y_{i}\leq t)$ in $\widehat{F}_{\pi}(t)$ . We also explore the use of $\hat{t}_{\text{R}}(\alpha)$ as a corresponding quantile estimator.

The rest of the paper is organized as follows. In Section 2, we introduce some key concepts and set notations that are used throughout the paper. Next, in Section 3, we introduce our residual eCDF estimator and derive its asymptotic properties; we then contrast it with a competing ‘plug-in’ CDF estimator in Section 3.3, and derive their corresponding quantile estimators in Section 3.4. Monte Carlo simulation results are provided in Section 4, which demonstrate the finite-sample performance of the proposed CDF and quantile estimators. In Section 5, we illustrate the proposed estimators using a real data application. We conclude in Section 6 with a discussion of the main results.

2 Background and Preliminaries

Let $\mathcal{U}=\{1,2,\dots,N\}$ denote an index set for the units in a finite population of size $N$ , and let $\boldsymbol{A}$ denote a $n_{\text{A}}\times(p+1)$ probability sample drawn from $\mathcal{U}$ with measured variables $\{d,X_{1},X_{2},\cdots,X_{p}\}$ , where $d_{i}\coloneqq\pi_{i}^{-1}$ represents the sampling weight for element $i$ and $X_{1},\cdots,X_{p}$ represent a set of covariates. In a similar vein, let $\boldsymbol{B}$ denote a $n_{\text{B}}\times(p+1)$ nonprobability sample from $\mathcal{U}$ with measured variables $\{Y,X_{1},X_{2},\cdots,X_{p}\}$ , where $Y$ represents a response variable; note that $d$ is unmeasured in $\boldsymbol{B}$ , which differentiates $\boldsymbol{B}$ from $\boldsymbol{A}$ as a convenience sample. The missingness pattern outlined here is summarized in Table 2.1.

Table 2.1: Data structure for the probability (

\boldsymbol{A}

) and nonprobability (

\boldsymbol{B}

) samples.

Sample	$d$	$X_{1}$	$X_{2}$	$\cdots$	$X_{p}$	$Y$
$\boldsymbol{A}$	✓	✓	✓	✓	✓	$\times$
$\boldsymbol{B}$	$\times$	✓	✓	✓	✓	✓

To incorporate the covariates in estimating $F_{\text{N}}(t)$ , we assume that the finite population is a realization from the following superpopulation model, ${\xi}$ :

\displaystyle Y=m(\boldsymbol{X};\boldsymbol{\beta})+\nu(\boldsymbol{X})\epsilon,

(2.1)

where $m(\boldsymbol{X};\boldsymbol{\beta})=\mathbb{E}_{\xi}\left(Y|\boldsymbol{X}\right)$ is a known function of $\boldsymbol{X}$ parameterized by an unknown parameter vector $\boldsymbol{\beta}$ , $\nu(\cdot)$ is a known, strictly positive function, and $\epsilon$ is an error term satisfying $\mathbb{E}_{\xi}\left(\epsilon|\boldsymbol{X}\right)=0$ and $\mathbb{E}_{\xi}\left(\epsilon^{2}|\boldsymbol{X}\right)=\sigma^{2}_{\epsilon}$ . Now, if $\mathcal{U}$ was observed in its entirety, a natural estimator of $\boldsymbol{\beta}$ can be chosen to solve the finite population score function

U(\boldsymbol{\beta})=\frac{1}{N}\sum_{u\in\mathcal{U}}{\Big{(}Y_{u}-m(\boldsymbol{X}_{u};\boldsymbol{\beta})\Big{)}\boldsymbol{W}\left(\boldsymbol{X}_{u};\boldsymbol{\beta}\right)}=0

for some $p$ -dimensional function $\boldsymbol{W}$ ; this estimator is henceforth denoted as $\boldsymbol{\beta}_{\text{N}}$ , since it requires full information of all $N$ population elements. Nonetheless, given that $Y$ is present within $\boldsymbol{B}$ alone, we estimate $\boldsymbol{\beta}$ by solving a sample-based score function,

\widehat{U}(\boldsymbol{\beta})=\frac{1}{n_{\text{B}}}\sum_{j\in\boldsymbol{B}}{\Big{(}Y_{j}-m(\boldsymbol{X}_{j};\boldsymbol{\beta})\Big{)}\boldsymbol{W}\left(\boldsymbol{X}_{j};\boldsymbol{\beta}\right)}=0,

whose solution is henceforth denoted as $\boldsymbol{\widehat{\beta}}$ .

One $\boldsymbol{\widehat{\beta}}$ is obtained, mass imputation uses $m(\boldsymbol{X};\boldsymbol{\widehat{\beta}})$ to ‘impute’ the missing $Y$ in $\boldsymbol{A}$ , assuming positivity and transportability (stated in Definitions 1 and 2, respectively).

Definition 1 (Positivity Assumption).

Let $\delta_{j}$ be an indicator function that returns 1 if observation $j$ is included in sample $\boldsymbol{B}$ and 0 otherwise. The assumption of positivity is satisfied if $\Pr(\delta_{j}=1|\boldsymbol{X}=\boldsymbol{x})>0$ for all $j\in\boldsymbol{B}$ and $\boldsymbol{x}$ in the support of $\boldsymbol{X}$ . That is, for every value $\boldsymbol{x}$ , there is a positive chance for the corresponding units to be selected in sample $\boldsymbol{B}$ .

Definition 2 (Transportability Assumption).

Let $\delta_{j}$ again denote the sample indicator for $\boldsymbol{B}$ , and let $f(Y|\boldsymbol{X})$ represent a function of $Y$ conditioned on $\boldsymbol{X}$ . The assumption of transportability is satisfied if $f(Y|\boldsymbol{X},\delta_{j}=1)=f(Y|\boldsymbol{X})$ .

Transportability is a crucial assumption as it allows us to “transport" the prediction model from $\boldsymbol{B}$ to $\boldsymbol{A}$ without worry. As described in [30], a sufficient condition for transportability to hold is the ignorability condition, stated in Definition 3.

Definition 3 (Ignorability Condition).

Let $\delta_{j}$ again denote the sample indicator for $\boldsymbol{B}$ , and let $\Pr(\delta_{j}=1|\boldsymbol{X})$ denote the probability of unit $j$ being included in sample $\boldsymbol{B}$ conditioned on $\boldsymbol{X}$ . The assumption of ignorability is satisfied if $\Pr(\delta_{j}=1|\boldsymbol{X},Y)=\Pr(\delta_{j}=1|\boldsymbol{X})$ .

Definition 3 is essentially the missing at random (MAR) assumption [39]. It is commonly known that, under MAR (or, ignorability), $\boldsymbol{\widehat{\beta}}$ is a consistent estimator of $\boldsymbol{\beta}$ , in that the sampling distribution of $\boldsymbol{\widehat{\beta}}$ grows tighter and tighter around $\boldsymbol{\beta}$ as $n_{\text{B}}$ increases [30, 33, 42]. In the next section, we use the consistency of $\boldsymbol{\widehat{\beta}}$ to derive a residual-based CDF estimator of $F_{\text{N}}(t)$ , where $\mathbbm{1}(Y_{i}\leq t)$ in $\widehat{F}_{\pi}(t)$ is replaced by $\widehat{G}(\boldsymbol{X}_{i})$ , the empirical CDF of the estimated regression residuals from sample $\boldsymbol{B}$ at $\frac{t-m(\boldsymbol{X}_{i};\boldsymbol{\widehat{\beta}})}{\nu(\boldsymbol{X}_{i})}$ .

3 Proposed Estimators and Asymptotic Results

3.1 Residual-based CDF Estimator

Note that if model (2.1) holds, $F_{\text{N}}(t)$ may be restated as

F_{\text{N}}(t)=\frac{1}{N}\sum_{u\in\mathcal{U}}{\mathbbm{1}\left(\epsilon_{u}\leq\frac{t-m(\boldsymbol{X}_{u};\boldsymbol{\beta})}{\nu(\boldsymbol{X}_{u})}\right)}.

Now, letting

G_{\text{N}}(\boldsymbol{X}_{u})\coloneqq\frac{1}{N}\sum_{v\in\mathcal{U}}{\mathbbm{1}\left(\epsilon_{v}\leq\frac{t-m(\boldsymbol{X}_{u};\boldsymbol{\beta})}{\nu(\boldsymbol{X}_{u})}\right)}

denote the finite population distribution function of $\epsilon$ at some fixed point $\frac{t-m(\boldsymbol{X}_{u};\boldsymbol{\beta})}{\nu(\boldsymbol{X}_{u})}$ , we note

	$\displaystyle F_{\text{N}}(t)-\frac{1}{N}\sum_{u\in\mathcal{U}}{G_{\text{N}}(\boldsymbol{X}_{u})}$	$\displaystyle=\frac{1}{N}\sum_{u\in\mathcal{U}}{\left[\mathbbm{1}\left(\epsilon_{u}\leq\frac{t-m(\boldsymbol{X}_{u};\boldsymbol{\beta})}{\nu(\boldsymbol{X}_{u})}\right)-G_{\text{N}}\left(\boldsymbol{X}_{u}\right)\right]}$
		$\displaystyle=0,$

which further implies that

\mathbb{E}_{\xi}\left[\frac{1}{N}\sum_{u\in\mathcal{U}}{\mathbbm{1}\left(Y_{u}\leq t\right)}\right]=\mathbb{E}_{\xi}\left[\frac{1}{N}\sum_{u\in\mathcal{U}}{G_{\text{N}}\left(\boldsymbol{X}_{u}\right)}\right]=\Pr\left(Y\leq t\right).

Thus, it is not egragious to replace $\mathbbm{1}\left(Y_{u}\leq t\right)$ in $F_{\text{N}}(t)$ with $G_{\text{N}}\left(\boldsymbol{X}_{u}\right)$ , assuming that the working model (2.1) holds.

Given that $\boldsymbol{A}$ is subject to missingness on $Y$ only, replacing $\mathbbm{1}\left(Y_{i}\leq t\right)$ in $\widehat{F}_{\pi}(t)$ with an empirical estimate of $G_{\text{N}}(\boldsymbol{X}_{i})$ from $\boldsymbol{B}$ remains plausible if $\boldsymbol{\widehat{\beta}}\xrightarrow{P}\boldsymbol{\beta}$ (i.e., ignorability holds). To describe the procedure, let $\widehat{G}(\boldsymbol{X}_{k})$ denote the eCDF of the estimated residuals from sample $\boldsymbol{B}$ , $\hat{\epsilon}_{j}=\frac{Y_{j}-m(\boldsymbol{X}_{j};\boldsymbol{\widehat{\beta}})}{\nu(\boldsymbol{X}_{j})}$ , such that

\widehat{G}(\boldsymbol{X}_{k})=\frac{1}{n_{\text{B}}}\sum_{j\in\boldsymbol{B}}{\mathbbm{1}\left(\hat{\epsilon}_{j}\leq\frac{t-m(\boldsymbol{X}_{k};\boldsymbol{\widehat{\beta}})}{\nu(\boldsymbol{X}_{k})}\right)}.

Then, the residual-based estimator of $F_{\text{N}}(t)$ is defined as

	$\displaystyle\widehat{F}_{\text{R}}(t)$	$\displaystyle=\frac{1}{N}\sum_{i\in\boldsymbol{A}}{\pi^{-1}_{i}\widehat{G}(\boldsymbol{X}_{i})}$
		$\displaystyle=\frac{1}{Nn_{\text{B}}}\sum_{i\in\boldsymbol{A}}{\sum_{j\in\boldsymbol{B}}{\pi^{-1}_{i}\mathbbm{1}\left(\hat{\epsilon}_{j}\leq\frac{t-m(\boldsymbol{X}_{i};\boldsymbol{\widehat{\beta}})}{\nu(\boldsymbol{X}_{i})}\right)}}.$		(3.1)

3.2 Asymptotic Theory

Before we discuss the asymptotic properties of $\widehat{F}_{\text{R}}(t)$ , we must establish a framework that allows $n_{\text{A}}$ , $n_{\text{B}}$ , and $N$ to be infinite. Similar to [28, 43], let $\mathscr{U}\coloneqq\set{\mathcal{U}_{\text{N}}}$ denote an increasing sequence of finite populations of size $N\to\infty$ from model (2.1); furthermore, let $\boldsymbol{A}$ and $\boldsymbol{B}$ denote samples from $\mathcal{U}_{\text{N}}$ ( $N$ henceforth omitted for brevity) according to some sampling design, such that the sampling fraction $\frac{n_{\text{s}}}{N}=\frac{n_{\text{A}}+n_{\text{B}}}{N}$ converges to a limit in (0,1) as both $n_{\text{s}}$ and $N$ tend to infinity [24]. Given this asymptotic framework, let $\boldsymbol{\beta}^{*}=\text{plim}_{n_{\text{B}}\to\infty}\boldsymbol{\widehat{\beta}},$ such that, for any $\varepsilon>0$ ,

\displaystyle\Pr\left(\big{|}\boldsymbol{\widehat{\beta}}-\boldsymbol{\beta}^{*}\big{|}>\varepsilon\right)\to 0\ \text{as}\ n_{\text{B}}\to\infty.

Letting $\xrightarrow{P}$ again denote convergence in probability, we have previously mentioned that $\boldsymbol{\widehat{\beta}}\xrightarrow{P}\boldsymbol{\beta}$ (i.e., $\boldsymbol{\beta}^{*}=\boldsymbol{\beta}$ ) under the ignorability condition; and since $\boldsymbol{\beta}_{\text{N}}\xrightarrow{P}\boldsymbol{\beta}$ as well, we may use Theorem 2.1.3 of [32] to show $\boldsymbol{\widehat{\beta}}\xrightarrow{P}\boldsymbol{\beta}_{\text{N}}$ .

In the following, we formally define a set of conditions that govern the validity of our asymptotic results.

Assumption 1.

The sampling design of $\boldsymbol{B}$ is ignorable; that is,

\Pr\left(j\in\boldsymbol{B}|\boldsymbol{X},Y\right)=\Pr\left(j\in\boldsymbol{B}|\boldsymbol{X}\right).

Assumption 2.

The sampling fraction $\frac{n_{\text{A}}+n_{\text{B}}}{N}$ converges to a limit in $(0,1)$ as $n_{\text{A}}+n_{\text{B}}$ and $N$ tend to infinity. Furthermore, $\mathbb{E}_{\mathcal{D}}\left(n_{\text{A}}\right)=O(N^{\gamma}),$ where $\mathbb{E}_{\mathcal{D}}(\cdot)$ denotes the design-based expectation and $\frac{2}{3}<\gamma\leq 1$ .

Assumption 3.

There exist some positive real constants $c_{1},c_{2}$ , and $c_{3}$ such that $c_{1}\leq\frac{N\pi_{i}}{\mathbb{E}_{\mathcal{D}}\left(n_{\text{A}}\right)}\leq c_{2}$ and $|1-\frac{\pi_{h}\pi_{i}}{\pi_{hi}}|\leq c_{3}$ for all $h,i\in\boldsymbol{A}.$ Furthermore, $n^{-1/2}_{\text{B}}=o\left([\mathbb{E}_{\mathcal{D}}\left(n_{\text{A}}\right)]^{-\frac{1}{2}}\right).$

Assumption 4.

For any random variable $Z$ with finite $2+\upsilon$ population moments for an arbitrarily small $\upsilon>0$ ,

\mathrm{Var}_{\mathcal{D}}\left(\frac{1}{N}\sum_{i\in\boldsymbol{A}}{\pi^{-1}_{i}Z_{i}}\right)\leq\frac{c_{3}}{\mathbb{E}_{\mathcal{D}}\left(n_{\text{A}}\right)(N-1)}\sum_{u\in\mathcal{U}_{\text{N}}}{\left(Z_{u}-\bar{Z}_{\text{N}}\right)^{2}},

where $\bar{Z}_{\text{N}}=\frac{1}{N}\sum_{u\in\mathcal{U}_{\text{N}}}{Z_{u}}$ is the finite population mean of $Z$ and $c_{3}$ some positive constant.

Assumption 5.

For any random variable $Z$ with a finite fourth population moment,

	$\displaystyle\left[\mathrm{Var}_{\mathcal{D}}\left(\bar{Z}_{\pi}\right)\right]^{-1/2}\left(\bar{Z}_{\pi}-\bar{Z}_{\text{N}}\right)$	$\displaystyle\xrightarrow{\mathcal{L}}\mathrm{N}\left(0,1\right)$
	$\displaystyle\left[\mathrm{Var}_{\mathcal{D}}\left(\bar{Z}_{\pi}\right)\right]^{-1}\widehat{\mathrm{Var}}_{\pi}\left(\bar{Z}_{\pi}\right)-1$	$\displaystyle=O_{\text{P}}\left([\mathbb{E}_{\mathcal{D}}\left(n_{\text{A}}\right)]^{-1/2}\right),$

where $\bar{Z}_{\pi}=\frac{1}{N}\sum_{i\in\boldsymbol{A}}{\pi^{-1}_{i}Z_{i}}$ denotes the HT mean estimator of $\bar{Z}_{\text{N}}$ and $\widehat{\mathrm{Var}}_{\pi}\left(\bar{Z}_{\pi}\right)$ the HT estimator of $\mathrm{Var}_{\mathcal{D}}\left(\bar{Z}_{\pi}\right)$ .

Assumption 6.

The coefficient vector $\boldsymbol{\widehat{\beta}}$ satisfies

\boldsymbol{\widehat{\beta}}=\boldsymbol{\beta}_{\text{N}}+o_{P}\left(n^{-1/2}_{\text{B}}\right).

Assumption 7.

$F_{\text{N}}(t)$ converges to a smooth function $F(t)$ as $N$ goes to infinity; that is,

\lim_{N\to\infty}{F_{\text{N}}(t)}=F(t),

where the limiting function $F(t)$ is continuous with finite first and second derivatives.

Assumption 1 allows $\boldsymbol{\widehat{\beta}}\xrightarrow{P}\boldsymbol{\beta}_{\text{N}}$ ; Assumption 2 establishes a framework to allow $n_{\text{A}},n_{\text{B}}\to\infty$ as $N\to\infty$ , and limits the growth rate of both sample sizes; Assumptions 3 and 4 ensure design-based consistency under $p(\boldsymbol{A})$ , the sampling design of $\boldsymbol{A}$ , and allows $n_{\text{B}}$ to grow at a faster rate than $\mathbb{E}_{\mathcal{D}}\left(n_{\text{A}}\right)$ ; Assumption 5 ensures asymptotic normality under a general design; Assumption 6 establishes regularity conditions for $\boldsymbol{\widehat{\beta}}$ ; and Assumption 7 specifies a limiting smooth function for $F_{\text{N}}(t)$ [43, 24, 36]. Note that, if Assumptions 3 and 6 hold, then $\boldsymbol{\widehat{\beta}}=\boldsymbol{\beta}_{\text{N}}+o_{P}\left([\mathbb{E}_{\mathcal{D}}\left(n_{\text{A}}\right)]^{-\frac{1}{2}}\right)$ .

The following theorem summarizes the asymptotic properties of the proposed residual-based CDF estimator, $\widehat{F}_{\text{R}}(t)$ .

Theorem 1.

Under Assumptions 1-7, $\widehat{F}_{\text{R}}(t)$ is design-consistent for $F_{N}(t)$ and is asymptotically normally distributed; that is,

\frac{\widehat{F}_{R}(t)-F_{N}(t)}{AV\{\widehat{F}_{R}(t)\}}\overset{\mathcal{L}}{\longrightarrow}N(0,1),

where

AV\{\widehat{F}_{R}(t)\}=\frac{1}{N^{2}}\sum_{u\in\mathcal{U}}{\sum_{v\in\mathcal{U}}{\left(\frac{\pi_{uv}}{\pi_{u}\pi_{v}}-1\right)G_{\text{N}}(\boldsymbol{X}_{u})G_{\text{N}}(\boldsymbol{X}_{v})}}.

Proof.

First, define $\tilde{F}_{\text{R}}(t)$ as

	$\displaystyle\tilde{F}_{\text{R}}(t)$	$\displaystyle=\frac{1}{N}\sum_{i\in\boldsymbol{A}}{\pi^{-1}_{i}\frac{1}{n_{\text{B}}}\sum_{j\in\boldsymbol{B}}{\mathbbm{1}\left(\epsilon_{j}\leq\frac{t-m(\boldsymbol{X}_{i};\boldsymbol{\beta})}{\nu\left(\boldsymbol{X}_{i}\right)}\right)}}$
		$\displaystyle=\frac{1}{N}\sum_{i\in\boldsymbol{A}}{\pi^{-1}_{i}\tilde{G}(\boldsymbol{X}_{i})}$
		$\displaystyle=\frac{1}{N}\sum_{u\in\mathcal{U}}{\pi^{-1}_{u}\ell_{u}\tilde{G}(\boldsymbol{X}_{u})},$

where $\ell_{u}=1\ \text{if}\ u\in\boldsymbol{A}$ and zero otherwise. Now, under Assumption 1, $\tilde{G}(\boldsymbol{X})\xrightarrow{P}G_{\text{N}}(\boldsymbol{X})$ , and thus

	$\displaystyle\mathbb{E}_{\mathcal{D}}\left(\tilde{F}_{\text{R}}(t)\right)$	$\displaystyle=\mathbb{E}_{\mathcal{D}}\left(\frac{1}{N}\sum_{u\in\mathcal{U}}{\ell_{u}\pi^{-1}_{u}G_{\text{N}}(\boldsymbol{X}_{u})}+o_{\text{P}}(1)\right)$
		$\displaystyle=\frac{1}{N}\sum_{u\in\mathcal{U}}{G_{\text{N}}(\boldsymbol{X}_{u})}+o(1)$
	$\displaystyle\mathbb{E}_{\mathcal{D}}\left(\tilde{F}^{2}_{\text{R}}(t)\right)$	$\displaystyle=\frac{1}{N^{2}}\sum_{u\in\mathcal{U}}{\sum_{v\in\mathcal{U}}{\frac{\pi_{uv}}{\pi_{u}\pi_{v}}}G_{\text{N}}(\boldsymbol{X}_{u})G_{\text{N}}(\boldsymbol{X}_{v})}+o(1);$

ergo,

\displaystyle\mathrm{Var}_{\mathcal{D}}\left(\tilde{F}_{\text{R}}(t)\right)

\displaystyle=\frac{1}{N^{2}}\sum_{u\in\mathcal{U}}{\sum_{v\in\mathcal{U}}{\left(\frac{\pi_{uv}}{\pi_{u}\pi_{v}}-1\right)G_{\text{N}}(\boldsymbol{X}_{u})G_{\text{N}}(\boldsymbol{X}_{v})}}+o(1).

From here, we may use results from [43], [36], and [37] to show that $\widehat{F}_{\text{R}}(t)$ and $\tilde{F}_{\text{R}}(t)$ have the same limiting normal distribution, and are both design-consistent for $F_{\text{N}}(t)$ ; that is,

\frac{\widehat{F}_{R}(t)-F_{N}(t)}{AV\{\widehat{F}_{R}(t)\}}\overset{\mathcal{L}}{\longrightarrow}N(0,1),

where $AV\{\widehat{F}_{R}(t)\}$ is as given in the theorem.

∎

Theorem 2.

Let Suppose Assumptions 1-6 hold. An asymptotically unbiased estimator of $\mathrm{AV}\left(\widehat{F}_{\text{R}}(t)\right)$ is given by

\widehat{\mathrm{Var}}\left(\widehat{F}_{\text{R}}(t)\right)=\frac{1}{N^{2}}\sum_{h\in\boldsymbol{A}}{\sum_{i\in\boldsymbol{A}}{\left(\frac{\pi_{hi}}{\pi_{h}\pi_{i}}-1\right)\frac{1}{\pi_{hi}}\widehat{G}(\boldsymbol{X}_{h})\widehat{G}(\boldsymbol{X}_{i})}}.

Proof.

Given that $\widehat{F}_{\text{R}}(t)$ and $\tilde{F}_{\text{R}}(t)$ are both design-consistent for $F_{\text{N}}(t)$ , we may again use Theorem 2.1.3 of [32] to show

\displaystyle\widehat{F}^{2}_{\text{R}}(t)-\tilde{F}^{2}_{\text{R}}(t)

\displaystyle\coloneqq\frac{1}{N^{2}}\sum_{h\in\boldsymbol{A}}{\sum_{i\in\boldsymbol{A}}{\pi^{-1}_{h}\pi^{-1}_{i}\left(\widehat{G}(\boldsymbol{X}_{h})\widehat{G}(\boldsymbol{X}_{i})-\tilde{G}(\boldsymbol{X}_{h})\tilde{G}(\boldsymbol{X}_{i})\right)}}\xrightarrow{P}0.

(3.2)

Note that

\displaystyle\widehat{\mathrm{Var}}\left(\widehat{F}_{\text{R}}(t)\right)-\widehat{\mathrm{Var}}\left(\tilde{F}_{\text{R}}(t)\right)

\displaystyle=\frac{1}{N^{2}}\sum_{h\in\boldsymbol{A}}{\sum_{i\in\boldsymbol{A}}{\pi^{-1}_{h}\pi^{-1}_{i}\left(\widehat{G}(\boldsymbol{X}_{h})\widehat{G}(\boldsymbol{X}_{i})-\tilde{G}(\boldsymbol{X}_{h})\tilde{G}(\boldsymbol{X}_{i})\right)\alpha_{hi}}},

where $\alpha_{hi}\coloneqq 1-\frac{\pi_{h}\pi_{i}}{\pi_{hi}}$ . Given that $\widehat{\mathrm{Var}}\left(\tilde{F}_{\text{R}}(t)\right)$ is asymptotically unbiased for $AV\{\widehat{F}_{\text{R}}(t)\}$ , we would like to show that $\widehat{\mathrm{Var}}\left(\widehat{F}_{\text{R}}(t)\right)-\widehat{\mathrm{Var}}\left(\tilde{F}_{\text{R}}(t)\right)$ converges in probability to zero.

Recall, from Assumption 3, that there exists some constant $c_{3}\in\mathbb{R}^{+}$ such that $|\alpha_{hi}|\leq c_{3}$ for all $h,i\in\boldsymbol{A}$ . This implies

-c_{3}\times\left(\widehat{F}^{2}_{\text{R}}(t)-\tilde{F}^{2}_{\text{R}}(t)\right)\leq\widehat{\mathrm{Var}}\left(\widehat{F}_{\text{R}}(t)\right)-\widehat{\mathrm{Var}}\left(\tilde{F}_{\text{R}}(t)\right)\leq c_{3}\times\left(\widehat{F}^{2}_{\text{R}}(t)-\tilde{F}^{2}_{\text{R}}(t)\right);

and since

\text{plim}_{n_{\text{B}}\to\infty}\left\{c_{3}\times\left(\widehat{F}^{2}_{\text{R}}(t)-\tilde{F}^{2}_{\text{R}}(t)\right)\right\}=\text{plim}_{n_{\text{B}}\to\infty}\left\{-c_{3}\times\left(\widehat{F}^{2}_{\text{R}}(t)-\tilde{F}^{2}_{\text{R}}(t)\right)\right\}=0,

we see, by the squeeze theorem and (3.2), that $\widehat{\mathrm{Var}}\left(\widehat{F}_{\text{R}}(t)\right)-\widehat{\mathrm{Var}}\left(\tilde{F}_{\text{R}}(t)\right)\xrightarrow{P}0$ . Thus, $\widehat{\mathrm{Var}}\left(\widehat{F}_{\text{R}}(t)\right)$ is asymptotically unbiased for $AV\{\hat{F}_{\text{R}}(t)\}$ . ∎

In Section 4.4, we define a replication-type variance estimator based on the bootstrap, and compare its performance with that of the variance estimator in Theorem 2.

3.3 Plug-In CDF Estimator

One may consider a more direct ‘mass imputation’ approach to CDF estimation by directly substituting $Y_{i}$ for all $i\in\boldsymbol{A}$ with predicted values, $\hat{Y}_{i}\coloneqq m(\boldsymbol{X}_{i};\boldsymbol{\widehat{\beta}})$ , but this approach will almost always result in higher estimation bias. Let $\widehat{F}_{\text{P}}(t)$ denote this direct plug-in CDF estimator such that

	$\displaystyle\widehat{F}_{\text{P}}(t)$	$\displaystyle=\frac{1}{N}\sum_{i\in\boldsymbol{A}}{\pi^{-1}_{i}\mathbbm{1}\left(\hat{Y}_{i}\leq t\right)}$
		$\displaystyle=\frac{1}{N}\sum_{i\in\boldsymbol{A}}{\pi^{-1}_{i}\mathbbm{1}\left(m(\boldsymbol{X}_{i};\boldsymbol{\widehat{\beta}})\leq t\right)}.$

In the best-case scenario of $\boldsymbol{\widehat{\beta}}=\boldsymbol{\beta}$ , we observe

\widehat{F}_{\text{P}}(t)\Big{|}_{\boldsymbol{\widehat{\beta}}=\boldsymbol{\beta}}=\frac{1}{N}\sum_{i\in\boldsymbol{A}}{\pi^{-1}_{i}\mathbbm{1}\Big{(}m(\boldsymbol{X}_{i};\boldsymbol{\beta})\leq t\Big{)}},

and thus

	$\displaystyle\mathbb{E}_{\mathcal{D}}\Big{(}\widehat{F}_{\text{P}}(t)\Big{\|}_{\boldsymbol{\widehat{\beta}}=\boldsymbol{\beta}}\Big{)}$	$\displaystyle=\frac{1}{N}\sum_{u\in\mathcal{U}}{\mathbbm{1}\Big{(}m(\boldsymbol{X}_{u};\boldsymbol{\beta})\leq t\Big{)}}$
		$\displaystyle=\frac{1}{N}\sum_{u\in\mathcal{U}}{\mathbbm{1}\Big{(}Y_{u}-\nu(\boldsymbol{X}_{u})\epsilon_{u}\leq t\Big{)}},$

which is not unbiased for $F_{\text{N}}(t)$ under model 2.1 unless $\nu(\boldsymbol{X})\epsilon_{u}$ is effectively zero for all $u\in\mathcal{U}$ . From the perspective of sample $\boldsymbol{A}$ , we note $\hat{Y}_{i}=Y_{i}-\varsigma_{i}$ , where the unknown residual $\varsigma_{i}$ could trigger an erroneous $\mathbbm{1}(\cdot)$ output even if it is small for all $i\in\boldsymbol{A}$ —that is to say, the plug-in approach is ‘all-or-nothing’, in the sense that $\mathbbm{1}(\hat{Y}_{i}\leq t)$ is either right or wrong with very little room for error. This is not the case for $\widehat{F}_{\text{R}}(t)$ , though, since small $\varsigma_{i}$ implies a well-specified (and transportable) $\boldsymbol{\widehat{\beta}}$ , making $\widehat{G}(\boldsymbol{X}_{i})$ a reasonable substitute for $\mathbbm{1}(Y_{i}\leq t)$ within $\widehat{F}_{\pi}(t)$ .

3.4 Quantile Estimation

We conclude this section with a discussion of the quantile estimation problem. From Eq. 1.2, an estimator of $t_{\text{N}}(\alpha)$ using $\widehat{F}_{\text{R}}(t)$ could be defined as

\displaystyle\hat{t}_{\text{R}}(\alpha)=\inf_{t}\left\{t:\widehat{F}_{\text{R}}(t)\geq\alpha\right\},

(3.3)

or, for $\widehat{F}_{\text{P}}(t)$ , as

\displaystyle\hat{t}_{\text{P}}(\alpha)=\inf_{t}\left\{t:\widehat{F}_{\text{P}}(t)\geq\alpha\right\},

(3.4)

where $\alpha$ denotes the percentile of interest. In general, if some $\widehat{F}(t)$ is asymptotically unbiased, then its corresponding quantile, $\hat{t}(\alpha)$ , will also be asymptotically unbiased [24]. And, assuming $\widehat{F}(t)$ is asymptotically normally distributed, $\gamma\%$ confidence intervals can be constructed using [44]’s formula as follows:

	$\displaystyle\hat{t}_{\text{LL}}$	$\displaystyle=\min_{t}\left(\widehat{F}(t)\geq\alpha-z_{\gamma/2}\widehat{\text{SE}}\left[\widehat{F}\left(\hat{t}(\alpha)\right)\right]\right)$		(3.5)
	$\displaystyle\hat{t}_{\text{UL}}$	$\displaystyle=\min_{t}\left(\widehat{F}(t)\geq\alpha+z_{\gamma/2}\widehat{\text{SE}}\left[\widehat{F}\left(\hat{t}(\alpha)\right)\right]\right),$		(3.6)

where $z_{\gamma/2}$ is the $(1-\gamma/2)^{th}$ percentile under the standard normal distribution and $\widehat{\text{SE}}$ is the estimated standard error of $\widehat{F}(t)$ . Then, using [31, 26]’s linear approximations, we may estimate the variance of $\hat{t}(\alpha)$ using

\displaystyle\widehat{\mathrm{Var}}\left[\hat{t}(\alpha)\right]

\displaystyle=\left(\frac{\hat{t}_{\text{UL}}-\hat{t}_{\text{LL}}}{2z_{\gamma/2}}\right)^{2}.

(3.7)

An alternative variance estimator based on the bootstrap will be defined and empirically evaluated in Section 4.4.

One should note that quantile estimation is extremely sensitive to small sample sizes, generally, and small inclusion probabilities, specifically, as both tend to exclude many (if not most) of the quantiles in $\mathcal{U}$ . In the next sections, we will evaluate the consequences of small $n_{\text{B}}$ and $n_{\text{A}}$ on the performance of both $\hat{t}_{\text{P}}(\alpha)$ and $\hat{t}_{\text{R}}(\alpha)$ using simulated data.

4 Simulation Study

We conducted a two-phase Monte Carlo simulation study to contrast the finite-sample performance of the proposed CDF and quantile estimators ( $\widehat{F}_{\text{R}}(t)$ and $\hat{t}_{\text{R}}(\alpha)$ ) to their plug-in and ‘naïve’, $\boldsymbol{B}$ -only equivalents ( $\widehat{F}_{\text{P}}(t)$ and $\hat{t}_{\text{P}}(\alpha)$ ; $\widehat{F}_{\text{B}}(t)$ and $\hat{t}_{\text{B}}(\alpha)$ ). We measure the relative performance of the estimators using the relative root mean squared error (RRMSE), defined generically for some $\hat{\theta}$ as

\displaystyle\mathrm{RRMSE}\left(\hat{\theta}\right)=\sqrt{\frac{\mathrm{MSE}\left(\hat{\theta}\right)}{\mathrm{MSE}\left(\hat{\theta}_{\text{A}}\right)}},

(4.1)

where $\hat{\theta}_{\text{A}}$ denotes a ‘gold-standard’ estimator of $\theta$ from $\boldsymbol{A}$ . Here, $\hat{\theta}_{\text{A}}\coloneqq\{\hat{F}_{\pi}(t),\hat{t}_{\pi}(\alpha)\}$ , which denote the HT estimators of $F_{\text{N}}(t)$ and $t_{\text{N}}(\alpha)$ , respectively.

4.1 Simulation Settings

To thoroughly assess the performance of our proposed estimators, we considered the following four superpopulation models, $\{\xi_{i}\}^{4}_{i=1}$ , which account for varying levels of complexity between $Y$ and $\boldsymbol{X}$ .

•

Model $\xi_{1}$ [25]:

$Y=.3+2X_{1}+2X_{2}+\epsilon,$

where $X_{1},X_{2}\sim\mathrm{N}(\mu=2,\sigma=1)$ and $\epsilon\sim\mathrm{N}(\mu=0,\sigma=1)$ .
•

Model $\xi_{2}$ [25]:

$Y=.3+.5X^{2}_{1}+.5X^{2}_{2}+\epsilon,$

where $X_{1},X_{2}\sim\mathrm{N}(\mu=2,\sigma=1)$ and $\epsilon\sim\mathrm{N}(\mu=0,\sigma=1)$ .
•

Model $\xi_{3}$ [41, 35]:

$Y=-\sin{(X_{1})}+X^{2}_{2}+X_{3}-\mathrm{e}^{-X^{2}_{4}}+\epsilon,$

where $X_{1},\cdots,X_{4}\sim\mathrm{Uniform}\left(\min=-1,\max=1\right)$ and $\epsilon\sim\mathrm{N}\left(\mu=0,\sigma=\sqrt{.5}\right)$ .

•

Model $\xi_{4}$ [38, 35]:

	$\displaystyle Y$	$\displaystyle=X_{1}+.707X^{2}_{2}+2\mathbbm{1}\left(X_{3}>0\right)+.873\ln\left(\|X_{1}\|\right)\|X_{3}\|+.894X_{2}X_{4}$
		$\displaystyle+2\mathbbm{1}\left(X_{5}>0\right)+.464\mathrm{e}^{X_{6}}+\epsilon,$

where $X_{1},\cdots,X_{6}\sim\mathrm{N}\left(\mu=0,\sigma=1\right)$ and $\epsilon\sim\mathrm{N}\left(\mu=0,\sigma=1\right)$ .

Under each of these models, we generated a finite population $\mathcal{U}$ of size $N=100,000$ . After $\mathcal{U}$ was obtained, a simple random sample without replacement (SRS) of size $n_{\text{A}}=500$ was selected to serve as $\boldsymbol{A}$ , the probability sample. For sample $\boldsymbol{B}$ with $\boldsymbol{n}_{\text{B}}=[500\ \ 1000\ \ 5000\ \ 10000]$ , we used modifications of [25]’s stratified simple random sampling technique to generate samples under both missing at random (MAR) and missing not at random (MNAR) mechanisms. For MAR, we defined $X^{*}$ to be the covariate with the strongest correlation to $Y$ , and placed all elements $u\in\mathcal{U}$ with $X_{u}^{*}\leq\mu_{X^{*}}$ in stratum I and the rest in stratum II. We then used stratified SRS to obtain $n_{\text{I}}=0.85\boldsymbol{n}_{\text{B}}$ and $n_{\text{II}}=(1-0.85)\boldsymbol{n}_{\text{B}}$ elements from each respective strata. This procedure was nearly identical to that used to generate MNAR data, with the exception of using $Y_{u}\leq\mu_{Y}$ as the strata discrimination rule. Lastly, using each of the $n_{\text{sim}}=1500$ randomly sampled $\boldsymbol{A}$ / $\boldsymbol{B}$ sample pairs, we calculated $\widehat{F}_{\text{B}}(t)$ (shortened as ‘B’), $\widehat{F}_{\text{P}}(t)$ (shortened as ‘P’) and $\widehat{F}_{\text{R}}(t)$ (shortened as ‘R’), as well as their respective quantile estimators, at the 1st, 10th, 25th, 50th, 75th, 90th, and 99th percentiles. The two estimators $\widehat{F}_{\text{P}}(t)$ and $\widehat{F}_{\text{R}}(t)$ used multivariable linear regression, which was fit on sample $\boldsymbol{B}$ using R’s lm() function. Results are presented in Tables 4.1–4.1.

Refer to caption — Table 4.1: RRMSE values for $\widehat{F}_{\text{B}}(t)$ (shortened as ‘B’), $\widehat{F}_{\text{P}}(t)$ (shortened as ‘P’), and $\widehat{F}_{\text{R}}(t)$ (shortened as ‘R’), as well as their corresponding quantile estimators, under model $\xi_{1}$ .

		MAR						MNAR
		$\widehat{F}(t)$			$\hat{t}(\alpha)$			$\widehat{F}(t)$			$\hat{t}(\alpha)$
\cdashline3-14 $n_{\text{B}}$	$\alpha$	B	P	R	B	P	R	B	P	R	B	P	R
500	1%	2.10	1.14	0.76	1.59	1.09	0.89	2.01	1.19	0.72	1.60	1.19	0.93
	10%	4.64	1.38	0.83	3.58	1.32	0.85	5.39	1.22	0.93	3.83	1.06	0.81
	25%	6.64	1.25	0.90	5.63	1.19	0.93	9.09	1.12	1.36	6.34	1.02	1.22
	50%	7.14	1.10	0.91	6.77	1.01	0.92	15.69	2.09	1.91	9.63	1.87	1.73
	75%	6.33	1.22	0.93	7.34	1.21	0.96	9.17	2.91	2.14	13.49	2.79	2.04
	90%	4.48	1.42	0.91	6.30	1.46	0.93	5.25	2.92	2.03	11.08	3.11	2.01
	99%	1.56	1.12	0.77	3.10	1.42	1.32	1.67	1.59	1.13	3.47	2.41	1.24
1000	1%	1.83	1.13	0.73	1.38	1.09	0.87	1.75	1.17	0.70	1.30	1.13	0.89
	10%	4.64	1.41	0.81	3.52	1.34	0.84	5.27	1.19	0.92	3.78	1.06	0.81
	25%	6.37	1.19	0.84	5.33	1.12	0.86	9.05	1.09	1.33	6.24	0.99	1.18
	50%	7.47	1.04	0.87	6.95	0.94	0.86	15.67	2.03	1.86	9.71	1.84	1.70
	75%	6.31	1.15	0.84	7.16	1.13	0.86	9.33	2.95	2.15	13.77	2.84	2.05
	90%	4.31	1.31	0.81	6.11	1.39	0.84	5.28	2.91	2.02	11.24	3.17	2.02
	99%	1.62	1.14	0.74	2.94	1.41	1.30	1.64	1.62	1.15	3.32	2.48	1.22
5000	1%	1.62	1.11	0.73	1.13	1.09	0.87	1.62	1.20	0.71	1.11	1.11	0.89
	10%	4.64	1.44	0.80	3.47	1.35	0.82	5.07	1.18	0.88	3.60	1.07	0.78
	25%	6.39	1.22	0.81	5.49	1.15	0.86	8.78	1.06	1.30	6.07	0.98	1.16
	50%	7.30	1.01	0.83	6.87	0.94	0.84	15.32	1.95	1.79	9.38	1.79	1.63
	75%	6.40	1.16	0.84	7.43	1.17	0.86	8.98	2.75	1.99	13.53	2.70	1.94
	90%	4.64	1.43	0.83	6.20	1.41	0.82	5.25	2.87	1.96	11.14	3.08	1.93
	99%	1.53	1.11	0.70	2.77	1.44	1.29	1.65	1.65	1.16	3.13	2.43	1.21
10000	1%	1.60	1.18	0.71	1.11	1.11	0.87	1.61	1.22	0.71	1.11	1.10	0.88
	10%	4.64	1.44	0.79	3.58	1.39	0.84	5.21	1.21	0.88	3.70	1.08	0.78
	25%	6.45	1.22	0.82	5.41	1.15	0.85	9.16	1.08	1.32	6.32	0.98	1.18
	50%	7.43	0.99	0.82	7.05	0.93	0.84	15.45	1.99	1.81	9.69	1.85	1.68
	75%	6.64	1.17	0.85	7.61	1.15	0.88	9.08	2.79	2.03	13.33	2.70	1.94
	90%	4.52	1.37	0.81	6.30	1.43	0.84	5.26	2.89	1.98	11.04	3.06	1.93
	99%	1.50	1.10	0.70	2.72	1.41	1.26	1.52	1.52	1.06	3.01	2.35	1.16

		MAR						MNAR
		$\widehat{F}(t)$			$\hat{t}(\alpha)$			$\widehat{F}(t)$			$\hat{t}(\alpha)$
\cdashline3-14 $n_{\text{B}}$	$\alpha$	B	P	R	B	P	R	B	P	R	B	P	R
$500$	1%	2.03	2.90	5.59	1.55	3.58	4.66	1.74	2.10	4.54	1.42	2.54	3.56
	10%	4.87	1.32	1.64	3.60	1.69	1.96	4.06	1.46	1.54	3.01	1.68	1.66
	25%	6.89	2.32	1.04	5.32	2.51	1.31	6.90	2.00	0.86	4.77	1.88	0.92
	50%	7.17	1.84	1.42	6.75	1.54	1.43	11.90	1.17	1.00	7.63	0.91	0.85
	75%	5.96	2.42	1.27	6.68	1.78	0.99	8.55	4.48	2.96	10.66	3.34	2.35
	90%	4.07	4.55	2.92	5.40	3.85	2.36	4.94	5.74	4.31	9.24	5.44	3.92
	99%	1.52	2.22	2.01	2.66	4.45	2.57	1.56	2.22	2.11	3.08	5.21	3.28
$1000$	1%	1.88	2.90	5.66	1.36	3.59	4.73	1.47	2.08	4.49	1.16	2.64	3.70
	10%	4.91	1.24	1.59	3.60	1.58	1.89	4.07	1.42	1.52	3.11	1.70	1.68
	25%	6.93	2.32	0.99	5.22	2.47	1.23	6.86	1.92	0.78	4.80	1.87	0.86
	50%	7.30	1.80	1.39	6.90	1.53	1.41	11.49	1.09	0.93	7.33	0.84	0.79
	75%	6.03	2.31	1.11	6.74	1.73	0.86	8.44	4.42	2.90	10.17	3.18	2.23
	90%	4.04	4.50	2.86	5.40	3.81	2.30	4.90	5.75	4.30	9.00	5.30	3.81
	99%	1.49	2.24	2.03	2.44	4.39	2.52	1.59	2.33	2.21	2.87	5.26	3.32
$5000$	1%	1.60	2.71	5.31	1.16	3.46	4.65	1.20	1.90	4.26	0.91	2.42	3.47
	10%	4.75	1.20	1.55	3.46	1.53	1.82	3.97	1.39	1.48	2.93	1.56	1.57
	25%	6.83	2.29	0.94	5.20	2.45	1.17	6.72	1.86	0.74	4.70	1.82	0.81
	50%	7.18	1.72	1.31	6.65	1.41	1.31	11.77	1.04	0.90	7.50	0.82	0.76
	75%	5.82	2.22	1.00	6.39	1.63	0.76	8.52	4.44	2.90	10.36	3.24	2.26
	90%	4.00	4.50	2.83	5.22	3.71	2.24	4.89	5.77	4.30	8.92	5.29	3.79
	99%	1.42	2.23	2.02	2.21	4.36	2.44	1.51	2.28	2.16	2.72	5.32	3.28
$10000$	1%	1.59	2.73	5.35	1.11	3.50	4.59	1.21	1.93	4.41	0.92	2.60	3.68
	10%	4.79	1.22	1.53	3.41	1.52	1.74	3.97	1.34	1.52	2.96	1.55	1.62
	25%	6.80	2.32	0.96	5.16	2.48	1.20	6.62	1.82	0.74	4.68	1.79	0.82
	50%	7.25	1.81	1.36	6.75	1.48	1.36	11.95	1.05	0.91	7.64	0.82	0.77
	75%	5.73	2.15	0.95	6.35	1.59	0.73	8.58	4.45	2.91	10.35	3.24	2.25
	90%	3.99	4.47	2.80	5.35	3.79	2.27	4.99	5.89	4.38	8.99	5.30	3.80
	99%	1.36	2.15	1.94	2.10	4.15	2.34	1.51	2.27	2.16	2.69	5.28	3.30

		MAR						MNAR
		$\widehat{F}(t)$			$\hat{t}(\alpha)$			$\widehat{F}(t)$			$\hat{t}(\alpha)$
\cdashline3-14 $n_{\text{B}}$	$\alpha$	B	P	R	B	P	R	B	P	R	B	P	R
500	1%	1.89	2.24	0.55	1.54	5.69	4.89	2.06	2.24	0.64	1.60	5.12	4.48
	10%	3.78	6.37	0.76	3.15	5.64	0.85	5.55	5.18	2.94	3.85	3.14	1.94
	25%	5.10	4.51	0.84	4.43	3.31	0.81	9.24	2.49	5.84	6.46	1.34	4.36
	50%	5.69	1.59	0.95	5.61	1.09	0.98	16.43	12.25	8.28	10.52	7.28	7.22
	75%	5.26	4.97	1.10	5.74	3.95	1.12	9.51	12.82	7.82	14.68	12.23	8.59
	90%	3.87	6.00	1.08	5.08	5.76	1.16	5.43	7.68	5.57	12.21	13.93	7.87
	99%	1.59	2.29	0.76	2.77	6.10	.	1.64	2.22	1.86	3.53	10.36	.
1000	1%	1.68	2.25	0.41	1.31	5.66	4.89	1.83	2.23	0.50	1.36	5.07	4.47
	10%	3.66	6.40	0.60	3.01	5.57	0.70	5.45	5.18	2.92	3.82	3.16	1.93
	25%	5.00	4.47	0.68	4.41	3.31	0.66	9.39	2.44	5.95	6.35	1.27	4.32
	50%	5.59	1.30	0.74	5.39	0.88	0.75	15.77	11.78	7.95	10.05	6.96	6.92
	75%	5.00	4.75	0.80	5.54	3.84	0.82	9.03	12.27	7.44	14.04	11.73	8.24
	90%	3.89	6.17	0.80	4.98	5.79	0.86	5.19	7.38	5.35	11.46	13.16	7.44
	99%	1.47	2.20	0.51	2.54	6.09	.	1.56	2.18	1.81	3.17	10.08	.
5000	1%	1.47	2.26	0.26	1.13	5.71	4.97	1.64	2.24	0.33	1.16	5.04	4.49
	10%	3.66	6.54	0.47	2.96	5.60	0.58	5.45	5.28	2.90	3.84	3.25	1.95
	25%	4.87	4.34	0.53	4.38	3.32	0.53	9.78	2.42	6.20	6.62	1.28	4.51
	50%	5.71	1.12	0.60	5.46	0.76	0.60	16.49	12.29	8.29	10.23	7.12	7.07
	75%	5.05	4.70	0.58	5.56	3.85	0.60	9.13	12.46	7.51	13.81	11.56	8.09
	90%	3.79	6.09	0.52	4.90	5.71	0.64	5.25	7.47	5.40	11.78	13.59	7.66
	99%	1.44	2.24	0.31	2.28	5.95	.	1.59	2.25	1.87	3.10	10.40	.
10000	1%	1.41	2.21	0.22	1.06	5.56	4.82	1.62	2.23	0.29	1.16	5.18	4.60
	10%	3.42	6.11	0.41	2.80	5.38	0.54	5.51	5.35	2.92	3.78	3.22	1.92
	25%	4.95	4.47	0.51	4.43	3.38	0.51	9.15	2.27	5.80	6.15	1.18	4.20
	50%	5.38	1.02	0.53	5.29	0.72	0.55	16.06	12.01	8.09	9.96	6.93	6.88
	75%	5.01	4.62	0.53	5.41	3.72	0.55	9.26	12.65	7.63	13.94	11.70	8.19
	90%	3.77	6.02	0.46	4.86	5.64	0.58	5.27	7.50	5.43	11.71	13.53	7.65
	99%	1.44	2.24	0.26	2.30	6.05	.	1.62	2.29	1.91	3.11	10.57	.

		MAR						MNAR
		$\widehat{F}(t)$			$\hat{t}(\alpha)$			$\widehat{F}(t)$			$\hat{t}(\alpha)$
\cdashline3-14 $n_{\text{B}}$	$\alpha$	B	P	R	B	P	R	B	P	R	B	P	R
500	1%	1.10	2.23	1.09	1.09	3.64	2.24	1.88	2.19	1.57	1.71	3.32	2.11
	10%	2.49	6.03	0.99	2.08	6.51	0.99	4.67	4.88	2.56	3.39	3.78	1.87
	25%	3.42	5.60	1.09	3.09	4.03	1.00	8.22	1.42	4.86	5.74	0.74	3.66
	50%	4.14	2.08	1.54	3.90	1.18	1.38	14.34	11.38	7.55	8.81	5.85	5.96
	75%	3.88	7.52	1.93	4.12	5.15	1.79	8.96	11.95	7.48	12.03	10.43	7.38
	90%	2.79	6.76	1.88	3.36	7.31	1.69	5.26	7.63	5.48	10.17	12.05	7.00
	99%	1.17	2.27	1.10	1.67	5.41	1.59	1.63	2.27	1.84	2.89	6.98	3.47
1000	1%	0.85	2.34	0.95	0.75	3.88	2.44	1.70	2.28	1.46	1.37	3.35	2.14
	10%	2.31	5.92	0.74	1.94	6.60	0.80	4.76	5.14	2.53	3.47	4.00	1.85
	25%	3.44	5.89	0.91	3.09	4.19	0.81	8.12	1.23	4.79	5.61	0.62	3.57
	50%	4.20	1.76	1.40	3.97	1.01	1.26	14.38	11.48	7.61	8.79	5.85	5.97
	75%	3.96	7.73	1.87	4.21	5.33	1.74	8.77	11.81	7.38	11.96	10.45	7.39
	90%	2.81	6.97	1.83	3.45	7.63	1.62	4.97	7.20	5.19	9.61	11.52	6.72
	99%	1.01	2.24	0.97	1.38	5.32	.	1.58	2.22	1.77	2.73	6.89	.
5000	1%	0.43	2.22	0.73	0.31	3.84	2.38	1.48	2.26	1.31	1.07	3.34	2.11
	10%	2.28	6.29	0.57	1.84	6.82	0.63	4.95	5.50	2.51	3.56	4.28	1.80
	25%	3.40	6.02	0.75	3.01	4.22	0.65	8.21	1.07	4.77	5.74	0.54	3.60
	50%	4.18	1.52	1.32	3.93	0.87	1.19	13.88	11.10	7.32	8.68	5.75	5.87
	75%	3.92	7.82	1.83	4.08	5.27	1.66	8.57	11.60	7.23	11.68	10.25	7.23
	90%	2.82	7.16	1.83	3.41	7.81	1.60	5.24	7.62	5.48	10.21	12.31	7.17
	99%	0.89	2.24	0.89	1.20	5.31	.	1.51	2.18	1.71	2.57	6.81	.
10000	1%	0.38	2.26	0.71	0.24	3.68	2.31	1.43	2.22	1.27	1.04	3.33	2.12
	10%	2.21	6.09	0.52	1.79	6.63	0.59	4.80	5.33	2.44	3.41	4.10	1.75
	25%	3.21	5.65	0.70	2.92	4.05	0.61	8.17	1.03	4.76	5.54	0.50	3.48
	50%	3.99	1.45	1.27	3.71	0.82	1.13	14.04	11.23	7.42	8.69	5.79	5.89
	75%	3.82	7.63	1.79	4.01	5.19	1.64	9.04	12.24	7.63	12.08	10.60	7.49
	90%	2.78	7.07	1.81	3.31	7.58	1.56	5.12	7.43	5.36	9.83	11.88	6.92
	99%	0.92	2.36	0.92	1.19	5.51	.	1.58	2.29	1.79	2.63	7.01	.

	$\displaystyle V_{2}\left(\widehat{F}_{\text{R}}(t)\right)$	$\displaystyle=\widehat{\mathrm{Var}}_{\text{Boot}}\left(\widehat{F}_{\text{R}}(t)\right)=\frac{1}{L}\sum_{l=1}^{L}{\left(\widehat{F}^{l}_{\text{R}}(t)-\widehat{F}_{\text{R}}(t)\right)^{2}}$		(4.3)
	$\displaystyle V_{2}\Big{(}\hat{t}_{\text{R}}(\alpha)\Big{)}$	$\displaystyle=\widehat{\mathrm{Var}}_{\text{Boot}}\Big{(}\hat{t}_{\text{R}}(\alpha)\Big{)}=\left(\frac{\hat{t}^{}_{\text{UL}}-\hat{t}^{}_{\text{LL}}}{2z_{\gamma/2}}\right)^{2},$		(4.4)

Survey Data Integration for Distribution Function Estimation

Abstract

1 Introduction

2 Background and Preliminaries

Definition 1 (Positivity Assumption).

Definition 2 (Transportability Assumption).

Definition 3 (Ignorability Condition).

3 Proposed Estimators and Asymptotic Results

3.1 Residual-based CDF Estimator

3.2 Asymptotic Theory

Assumption 1.

Assumption 2.

Assumption 3.

Assumption 4.

Assumption 5.

Assumption 6.

Assumption 7.

Theorem 1.

Proof.

Theorem 2.

Proof.

3.3 Plug-In CDF Estimator

3.4 Quantile Estimation

4 Simulation Study

4.1 Simulation Settings

4.2 CDF Estimation

4.3 Quantile Estimation

4.4 Variance Estimation

5 Real Data Example

6 Conclusion

Declarations

References

References

	$\displaystyle\hat{t}^{*}_{\text{LL}}$	$\displaystyle=\min_{t}\left(\widehat{F}_{\text{R}}(t)\geq\alpha-z_{\gamma/2}\widehat{\text{SE}}_{\text{Boot}}\left[\widehat{F}_{\text{R}}\left(\hat{t}_{\text{R}}(\alpha)\right)\right]\right)$		(4.5)
	$\displaystyle\hat{t}^{*}_{\text{UL}}$	$\displaystyle=\min_{t}\left(\widehat{F}_{\text{R}}(t)\geq\alpha+z_{\gamma/2}\widehat{\text{SE}}_{\text{Boot}}\left[\widehat{F}_{\text{R}}\left(\hat{t}_{\text{R}}(\alpha)\right)\right]\right)$		(4.6)

					CR		AL		$\bar{\widehat{\mathrm{Var}}}$		RB
$n_{\text{B}}$		$\alpha$	$F_{\text{N}}$	$\bar{\widehat{F}}_{\text{R}}$	$V_{1}$	$V_{2}$	$V_{1}$	$V_{2}$	$V_{1}$	$V_{2}$	$V_{1}$	$V_{2}$
500	MAR	1%	0.010	0.010	83.8	86.4	0.010	0.011	0.010	0.011	15.5	5.7
		10%	0.100	0.099	85.2	88.0	0.035	0.038	0.113	0.131	23.0	10.7
		25%	0.250	0.250	85.8	87.4	0.052	0.056	0.253	0.286	18.1	7.2
		50%	0.500	0.499	86.4	86.6	0.061	0.064	0.346	0.382	14.7	5.8
		75%	0.750	0.748	86.8	90.0	0.052	0.056	0.255	0.288	15.2	4.2
		90%	0.900	0.901	84.8	87.8	0.035	0.038	0.113	0.131	17.5	4.5
		99%	0.990	0.990	81.6	83.0	0.010	0.010	0.010	0.011	11.7	1.5
\cdashline2-13	MNAR	1%	0.010	0.010	82.8	86.2	0.010	0.011	0.010	0.011	15.1	5.1
		10%	0.100	0.100	85.0	87.4	0.035	0.037	0.113	0.131	22.9	11.0
		25%	0.250	0.250	84.8	88.6	0.052	0.056	0.253	0.286	18.0	7.4
		50%	0.500	0.499	86.4	88.2	0.061	0.064	0.347	0.382	14.7	5.8
		75%	0.750	0.748	87.2	88.8	0.053	0.056	0.255	0.288	15.1	4.1
		90%	0.900	0.900	86.0	88.0	0.035	0.038	0.113	0.131	17.4	4.5
		99%	0.990	0.990	83.0	84.8	0.010	0.010	0.010	0.011	11.3	1.2
10000	MAR	1%	0.010	0.010	86.8	86.4	0.010	0.010	0.010	0.010	5.1	4.5
		10%	0.100	0.100	87.6	88.2	0.035	0.035	0.113	0.113	8.8	8.9
		25%	0.250	0.250	88.6	88.6	0.052	0.052	0.253	0.253	7.2	7.2
		50%	0.500	0.500	89.0	89.0	0.061	0.061	0.346	0.346	6.5	6.5
		75%	0.750	0.749	88.8	89.4	0.052	0.053	0.254	0.255	3.6	3.1
		90%	0.900	0.901	87.6	87.8	0.035	0.035	0.113	0.114	6.1	5.4
		99%	0.990	0.990	82.8	82.4	0.010	0.010	0.010	0.010	7.2	6.5
\cdashline2-13	MNAR	1%	0.010	0.010	86.8	86.8	0.010	0.010	0.010	0.010	5.1	4.8
		10%	0.100	0.100	88.2	88.2	0.035	0.035	0.113	0.114	8.9	8.1
		25%	0.250	0.250	88.6	88.4	0.052	0.052	0.253	0.254	7.2	6.5
		50%	0.500	0.500	88.6	89.4	0.061	0.061	0.346	0.347	6.5	6.0
		75%	0.750	0.749	88.2	88.4	0.052	0.052	0.254	0.255	3.6	3.4
		90%	0.900	0.901	87.8	87.6	0.035	0.035	0.113	0.114	6.1	5.5
		99%	0.990	0.990	82.4	82.4	0.010	0.010	0.010	0.010	7.2	6.7

					CR		AL		$\bar{\widehat{\mathrm{Var}}}$		RB
$n_{\text{B}}$		$\alpha$	$t_{\text{N}}$	$\bar{\hat{t}}_{\text{R}}$	$V_{1}$	$V_{2}$	$V_{1}$	$V_{2}$	$V_{1}$	$V_{2}$	$V_{1}$	$V_{2}$
500	MAR	1%	1.353	1.583	71.8	75.4	1.140	1.228	1.497	1.718	13.6	0.9
		10%	4.461	4.506	84.4	85.0	0.596	0.609	0.334	0.350	23.3	19.7
		25%	6.277	6.298	85.2	85.2	0.493	0.498	0.227	0.231	17.3	15.7
		50%	8.289	8.307	85.2	85.0	0.461	0.464	0.198	0.200	13.4	12.3
		75%	10.298	10.331	86.2	87.6	0.497	0.502	0.230	0.235	15.9	14.1
		90%	12.146	12.167	83.6	84.8	0.606	0.619	0.346	0.362	17.6	13.8
		99%	15.291	15.597	67.6	70.0	1.396	1.509	2.310	2.652	26.9	16.1
\cdashline2-13	MNAR	1%	1.353	1.579	72.8	75.4	1.149	1.246	1.508	1.780	12.1	3.7
		10%	4.461	4.505	84.6	85.0	0.600	0.609	0.338	0.350	19.7	16.9
		25%	6.277	6.298	84.6	86.0	0.491	0.497	0.224	0.230	16.9	14.8
		50%	8.289	8.308	85.4	85.6	0.461	0.464	0.198	0.201	13.3	12.0
		75%	10.298	10.332	86.4	86.0	0.496	0.502	0.230	0.235	14.8	12.9
		90%	12.146	12.169	84.4	85.2	0.609	0.620	0.350	0.362	15.1	12.2
		99%	15.291	15.590	68.1	69.3	1.419	1.502	2.385	2.647	23.5	15.1
10000	MAR	1%	1.353	1.572	73.4	74.2	1.140	1.162	1.485	1.554	4.8	0.3
		10%	4.461	4.501	87.4	88.0	0.592	0.596	0.329	0.334	11.5	10.2
		25%	6.277	6.292	89.0	88.8	0.491	0.491	0.225	0.225	6.3	6.3
		50%	8.289	8.304	88.2	89.0	0.461	0.461	0.197	0.198	5.2	5.1
		75%	10.298	10.329	88.0	88.6	0.497	0.498	0.230	0.231	2.2	1.7
		90%	12.146	12.168	88.2	88.4	0.608	0.611	0.347	0.352	7.4	6.2
		99%	15.291	15.583	66.6	67.6	1.425	1.460	2.412	2.492	16.4	13.6
\cdashline2-13	MNAR	1%	1.353	1.572	73.2	74.2	1.137	1.163	1.482	1.545	5.2	1.1
		10%	4.461	4.502	87.4	87.6	0.594	0.600	0.332	0.339	11.4	9.5
		25%	6.277	6.292	88.6	88.6	0.492	0.494	0.225	0.227	6.0	5.1
		50%	8.289	8.304	89.0	89.2	0.460	0.461	0.196	0.198	6.5	6.0
		75%	10.298	10.328	88.4	88.4	0.496	0.497	0.229	0.230	3.6	3.2
		90%	12.146	12.167	89.0	89.0	0.607	0.611	0.346	0.351	6.9	5.5
		99%	15.291	15.580	66.2	67.0	1.422	1.454	2.413	2.479	15.3	13.0

	$\boldsymbol{A}$	$\boldsymbol{B}$	SE	LL	UL	$t$	$\Pr\left(T\geq\|t\|\right)$	$\hat{\beta}_{\boldsymbol{A}}\in\mathrm{CI}$
$\hat{\beta}_{0}$	87.62	93.35	5.62	84.10	102.60	16.60	0.00	Y
$\hat{\beta}_{1}$ : Male	0.44	2.12	1.24	0.08	4.15	1.71	0.09	Y
$\hat{\beta}_{2}$	0.28	0.25	0.03	0.19	0.30	7.79	0.00	Y
$\hat{\beta}_{3}$	-1.71	-1.03	0.57	-1.97	-0.09	-1.80	0.07	Y
$\hat{\beta}_{4}$	0.27	0.20	0.01	0.18	0.21	28.31	0.00	N
$\hat{\beta}_{5}$	0.93	1.00	0.04	0.93	1.07	22.57	0.00	Y
$\hat{\beta}_{6}$	0.43	0.25	0.09	0.11	0.39	2.90	0.00	N
$\hat{\beta}_{7}$	0.04	-0.02	0.05	-0.11	0.06	-0.43	0.67	Y

			$\mathrm{RB}\big{(}\widehat{F}\big{)}$			$\mathrm{RB}\big{(}\hat{t}\big{)}$
$\alpha$	$\widehat{F}_{\pi}(t)$	$\hat{t}_{\pi}(\alpha)$	B	P	R	B	P	R
1%	0.01	107.00	99.00	100.00	52.49	6.54	38.71	25.51
10%	0.10	138.00	50.99	99.49	17.67	5.80	15.87	0.37
25%	0.25	158.00	30.34	70.55	10.17	5.06	7.07	2.03
50%	0.51	184.00	16.59	16.92	6.95	4.89	2.38	2.40
75%	0.75	212.00	7.53	21.61	4.53	3.77	8.21	2.41
90%	0.90	244.00	3.56	9.85	2.68	4.51	14.24	3.60
99%	0.99	295.00	0.12	0.77	0.04	0.68	18.69	0.50

$\alpha$	$\widehat{F}_{\pi}(t)$	$\widehat{F}_{\text{R}}(t)$	LL	UL	$\widehat{\mathrm{Var}}_{\text{Boot}}\times 10^{3}$	$\widehat{F}_{\pi}(t)\in\text{CI}$
1%	0.010	0.015	0.008	0.023	0.020	Y
10%	0.102	0.121	0.105	0.136	0.089	N
25%	0.254	0.280	0.258	0.302	0.182	N
50%	0.510	0.546	0.519	0.572	0.253	N
75%	0.752	0.786	0.763	0.809	0.197	N
90%	0.904	0.928	0.915	0.941	0.062	N
99%	0.990	0.990	0.985	0.994	0.008	Y

$\alpha$	$\hat{t}_{\pi}(\alpha)$	$\hat{t}_{\text{R}}(\alpha)$	LL	UL	$\widehat{\mathrm{Var}}_{\text{Boot}}$	$\hat{t}_{\pi}(\alpha)\in\text{CI}$
1%	107.00	134.29	115.23	153.35	14.78	N
10%	138.00	137.49	118.20	156.77	4.23	Y
25%	158.00	154.79	134.33	175.25	0.62	Y
50%	184.00	179.58	157.54	201.63	0.86	Y
75%	212.00	206.89	183.23	230.54	1.61	Y
90%	244.00	235.22	209.99	260.44	3.18	Y
99%	295.00	296.49	268.17	324.81	549.83	Y