Estimation of population size based on capture recapture designs and evaluation of the estimation reliability

Yue You Division of Biostatistics and Epidemiology, University of California, Berkeley Mark van der Laan Division of Biostatistics and Epidemiology, University of California, Berkeley Philip Collender Division of Environmental Health Sciences, University of California, Berkeley Qu Cheng Division of Environmental Health Sciences, University of California, Berkeley Alan Hubbard Division of Biostatistics and Epidemiology, University of California, Berkeley Nicholas P Jewell Division of Biostatistics and Epidemiology, University of California, Berkeley Zhiyue Tom Hu Division of Biostatistics and Epidemiology, University of California, Berkeley Robin Mejia Department of Statistics, Carnegie Mellon University Justin Remais Division of Environmental Health Sciences, University of California, Berkeley

Abstract

We propose a modern method to estimate population size based on capture-recapture designs of K samples. The observed data is formulated as a sample of n i.i.d. K-dimensional vectors of binary indicators, where the k-th component of each vector indicates the subject being caught by the k-th sample, such that only subjects with nonzero capture vectors are observed. The target quantity is the unconditional probability of the vector being nonzero across both observed and unobserved subjects. We cover models assuming a single general constraint on the K-dimensional distribution such that the target quantity is identified and the statistical model is unrestricted. We present solutions for general linear constraints, as well as constraints commonly assumed to identify capture-recapture models, including no K-way interaction in linear and log-linear models, independence or conditional independence. We demonstrate that the choice of constraint(identification assumption) has a dramatic impact on the value of the estimand, showing that it is crucial that the constraint is known to hold by design. For the commonly assumed constraint of no K-way interaction in a log-linear model, the statistical target parameter is only defined when each of the $2^{K}-1$ observable capture patterns is present, and therefore suffers from the curse of dimensionality. We propose a targeted MLE based on undersmoothed lasso model to smooth across the cells while targeting the fit towards the single valued target parameter of interest. For each identification assumption, we provide simulated inference and confidence intervals to assess the performance on the estimator under correct and incorrect identifying assumptions. We apply the proposed method, alongside existing estimators, to estimate prevalence of a parasitic infection using multi-source surveillance data from a region in southwestern China, under the four identification assumptions.

Keywords: Asymptotic linear estimator, capture-recapture, targeted maximum likelihood estimation(TMLE), undersmoothed lasso.

1 Introduction

Epidemiologists use surveillance networks to monitor trends in disease frequency. When multiple surveillance components or surveys gather data on the same underlying population (such as those diagnosed with a particular disease over a particular time period), a variety of methods (capture-recapture designs [1], distance sampling [2], multiple observers [3], etc.) may be used to better estimate the disease occurrence in the population. Capture-recapture models are widely used for estimating the size of partially observed populations, usually assuming that individuals do not enter or leave the population between sample collections [4, 5]. These models have been widely applied to epidemiological data [6].

Due to the unobservability of outcomes for individuals not captured by any survey, additional identifying assumptions have to be made in capture-recapture problems. In two-sample scenarios, a common identification assumption is independence between the two samples (i.e. that capture in one survey does not change the probability of capture in the other survey). The estimator of population size based on this assumption is known as the Lincoln-Petersen estimator [7]. However, the independence assumption is often violated in empirical studies. In problems involving three or more surveys, it is common to assume that the highest order interaction term in log-linear or logistic model equals zero (i.e., that correlations between survey captures can be described by lower-order interactions alone), an assumption that is very difficult to interpret or empirically verify [4]. An additional challenge to capture-recapture estimators is the curse of dimensionality in the finite sample case, whereby the absence of one or more capture patterns from the observed sample leads to undefined interaction terms. Traditionally, a common approach to selecting among alternative capture-recapture models is to perform model selection based on Akaike’s Information Criterion (AIC) [8], Bayesian Information Criterion (BIC) [9], or Draper’s version of the Bayesian Information Criterion [10]. However, this approach is known to have limited reliability in the presence of violations of its identifying assumptions [4, 11, 12]. Das and Kennedy made contributions on a doubly robust method under a specific identification assumption that two lists are conditionally independent given measured covariate information [13].

In this paper, we propose a generalizable framework for estimating population size with as few model assumptions as possible. This framework can be adapted to posit various identification assumptions, including independence between any pair of surveys or absence of highest-order interactions, and can be applied to linear and nonlinear constraints. In high dimensional settings with finite samples, we use machine learning methods to smooth over unobserved capture patterns and apply targeted maximum likelihood estimation (TMLE) [14] updates to reduce estimation bias. Previous work has shown the vulnerability of the existing estimators with violations of identification assumptions [15, 16], and we further show the significant impact of the misspecified identification assumption on the estimation results for all existing and proposed estimators.

1.1 Paper outline

In this paper, we start with the statistical formulation of the estimation problem. In section 2, we define the framework of our estimators under linear and non-linear constraints. Specifically, we develop the estimators under each of the following identification assumptions: no K-way interaction in a linear model, independence among samples, conditional independence among samples, and no K-way interaction in a log-linear model. In section 3, we derive the efficient estimator of the target parameter, and the statistical inference. In section 4 we provide the targeted learning updates for the smoothed estimators under the no K-way interaction log-linear model. In section 5, we illustrate the performance of existing and proposed estimators in various situations, including high-dimensional finite sample settings. In section 6, we show the performance of the estimators under violations of various identification assumptions. In section 7, we apply the estimators to surveillance data on a parasitic infectious disease in a region in southwestern China [17]. In section 8, we summarise the characteristics of the proposed estimators and state the main findings of the paper.

2 Statistical formulation of estimation problem

2.1 Defining the data and its probability distribution

We define the capture-recapture experiments in the following manner. One takes a first random sample from the population of size $n_{1}$ , records identifying characteristics of the individuals captured and an indicator of their capture, then repeats the process a total of $K$ times, resulting in $K$ samples of size $n_{1},\ldots,n_{K}$ .

Each individual $i$ in the population, whether captured or not, defines a capture history as a vector $B^{*}_{i}=(B^{*}_{i}(1),\ldots,B^{*}_{i}(K))$ , where $B^{*}_{i}(k)$ denotes the indicator that this individual $i$ is captured by sample $k$ . We assume the capture history vectors $B^{*}_{i}$ of all $i=1,\dots,N$ individuals independently and identically follow a common probability distribution, denoted by $P_{B^{*}}$ . Thus $P_{B^{*}}$ is defined on $2^{K}$ possible vectors of dimension $K$ .

Note that, for the individuals contained in our observed sample of size $n=\sum_{k=1}^{K}n_{k}$ , we actually observe the realization of $B^{*}$ . For any individual $i$ that is never captured, we know that $B^{*}_{i}=0$ , where $0=(0,\ldots,0)$ is the $K$ dimensional vector in which each component equals $0$ . However, for these $N-n$ individuals we do not know the identity of these individuals and we also do not know how many there are (i.e., we do not know $N-n$ ).

Therefore, we can conclude that our observed data set $B_{1},\ldots,B_{n}$ are $n$ independent and identically distributed draws from the conditional distribution of $B^{*}$ , given $B^{*}\not=0$ . Let’s denote this true probability distribution with $P_{0}$ and its corresponding random variable with $B$ .

So we can conclude that under our assumptions we have that $B_{1},\ldots,B_{n}\sim_{iid}P_{0}$ , where $P_{0}(b)=P_{B^{*},0}(b\mid B\not=0)$ for all $b\in\{0,1\}^{K}$ . Since the probability distribution $P$ of $B$ is implied by the distribution $P^{*}$ of $B^{*}$ we also use the notation $P=P_{P^{*}}$ to stress this parameterization.

2.2 Full-data model, target quantity

Let ${\cal M}^{F}$ be a model for the underlying distribution $P_{B^{*},0}$ . The full-data target parameter $\Psi^{F}:{\cal M}^{F}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$}$ is defined as

\Psi^{F}(P_{B^{*}})=P_{B^{*}}(B\not=0).

In other words, we want to know the proportion of $N$ that on average will not be caught by the combined sample. An estimator $\psi^{F}_{n}$ of this $\psi^{F}_{0}=\Psi^{F}(P_{B^{*},0})$ immediately translates into an estimator of the desired size $N$ of the population of interest: $N_{n}=\frac{n}{\psi^{F}_{n}}$ . We will focus on full-data models defined by a single constraint, where we distinguish between linear constraints defined by a function $f$ and a non-linear constraint defined by a function $\Phi$ . Specifically, for a given function $f$ , we define the full-data model

{\cal M}^{F}_{f}=\left\{P_{B^{*}}:\sum_{b}f(b)P_{B^{*}}(b)=0,0<P^{*}(0)<1\right\}.

We will require that $f$ satisfies the following condition:

f(b=0)\not=0.

(1)

In other words, ${\cal M}^{F}_{f}$ contains all probability distributions of $B^{*}$ for which $E_{P_{B^{*}}}f(B^{*})=0$ . An example of interest is:

f_{I}(b)=(-1)^{K+\sum_{k=1}^{K}b_{k}}.

In this case, $E_{0}f_{I}(B^{*})=0$ is equivalent with

\displaystyle\sum_{b}(-1)^{K+\sum_{k=1}^{K}b_{k}}P_{B^{*},0}(b)=0.

(2)

We note the left-hand side represents a $K$ -th way interaction term $\alpha_{1}$ in the saturated model

P_{B^{*}}(b)=\alpha_{0}+\sum_{b^{\prime}\not=0}\alpha_{b^{\prime}}\prod_{j:b_{j}^{\prime}=1}b_{j}.

For example, if $K=2$ , then the latter model states:

P_{B^{*}}(b_{1},b_{2})=\alpha_{00}+\alpha_{10}b_{1}+\alpha_{01}b_{2}+\alpha_{11}b_{1}b_{2},

and the constraint $Ef_{I}(B^{*})=0$ states that $\alpha_{11}=0$ .

One might also use a log-link in this saturated model:

\log P_{B^{*}}(b)=a_{0}+\sum_{b^{\prime}\not=0}a_{b^{\prime}}\prod_{j:b_{j}^{\prime}=1}b_{j}.

In this case, $a_{1}$ is the $K$ -way interaction term in this log-linear model, and we now have

a_{1}\equiv\sum_{b}(-1)^{K+\sum_{k=1}^{K}b_{k}}\log P_{B^{*},0}(b)=0.

So, assuming $a_{1}=0$ corresponds with assuming

0=\sum_{b}(-1)^{K+\sum_{k=1}^{K}b_{k}}\log P_{B^{*},0}(b).

This is an example of a non-linear constraint $\Phi_{I}(P^{*})=0$ , where

\displaystyle\Phi_{I}(P^{*})\equiv\sum_{b}(-1)^{1+\sum_{k=1}^{K}b_{k}}\log P_{B^{*}}(b).

(3)

We will also consider general non-linear constraints defined by such a function $\Phi$ , so that for that purpose one can keep this example $\Phi_{I}$ in mind. As we will see the choice of this constraint has quite dramatic implications on the resulting statistical target parameter/estimand and thereby on the resulting estimator. As one can already tell from the definition of $\Phi_{I}$ , $\Phi_{I}$ is not even defined if there are some $b$ for which $P_{B^{*}}(b)=0$ , so that also an NPMLE of $\Psi^{F}_{\Phi_{I}}(P^{*}_{0})$ will be ill defined in the case that the empirical distribution $P_{n}(B=b)=0$ for some $b\not=0$ . As we will see the parameter $\Psi^{F}_{f_{I}}(P^{*}_{0})$ is very well estimated by the MLE, even when $K$ is very large relative to $n$ , but for $\Psi^{F}_{\Phi_{I}}(P^{*}_{0})$ we would need a so called TMLE, incorporating machine learning [14].

Since it is hard to believe that the $\Phi_{I}$ constraint is more realistic than the $f$ -constraint in real applications, it appears that, without a good reason to prefer $\Phi_{I}$ , the $f$ -constraint is far superior. However, it also raises alarm bells that the choice of constraint, if wrong, can result in dramatically different statistical output, so that one should really try to make sure that the constraint that is chosen is known to hold by design.

Another example of a non-linear constraint is the independence between two samples, i.e., that

P^{*}(B^{*}(1:2)=(0,0))=P^{*}(B^{*}(1)=0)P^{*}(B^{*}(2)=0),

i.e., that the binary indicators $B^{*}(1)$ and $B^{*}(2)$ are independent, but the remaining components can depend on each other and depend on $B^{*}(1),B^{*}(2)$ . This corresponds with assuming $\Phi_{II}(P^{*})=0$ , where

\Phi_{II}(P^{*})\equiv\sum_{b}I(b(1:2)=(0,0))P^{*}(b)-\sum_{b_{1},b_{2}}I(b_{1}(1)=b_{2}(2)=0)P^{*}(b_{1})P^{*}(b_{2}).

(4)

A third non-linear constraint example is conditional independence between two samples, given the others. Suppose we have $K$ samples in total, and the distribution of the $j^{th}$ sample $B_{j}$ is independent of the $m^{th}$ sample $B_{m}$ given all other samples (there’s no time ordering of $j,m$ ). Then we can derive the target parameter and efficient influence curve for this conditional independence constraint. The conditional independence constraint $\Phi_{CI}=0$ is defined as

	$\displaystyle\Phi_{CI,(j,m)}=P^{*}(B_{j}=1\|B_{1}=b_{1},\cdots,B_{m}=0,\cdots,B_{K}=b_{k})$
	$\displaystyle-P^{*}(B_{j}=1\|B_{1}=b_{1},\cdots,B_{m}=1,\cdots,B_{K}=b_{k}),b_{t}=0,1,t=1,\cdots,K.$

Because we must have the term $P^{*}(0,0,\cdots,0)$ in the constraint to successfully identify the parameter of interest, the constraint $\Phi_{CI}=0$ is sufficient. Thus the equation above can be presented as

	$\displaystyle\Phi_{CI,(j,m)}=P^{*}(B_{j}=1\|B_{1}=0,\cdots,B_{m}=0,\cdots,B_{K}=0)$
	$\displaystyle-P^{*}(B_{j}=1\|B_{1}=0,\cdots,B_{m}=1,\cdots,B_{K}=0).$		(5)

This is because for all the combinations of $\{B_{1}=b_{1},\cdots,B_{m-1}=b_{m-1},B_{m+1}=b_{m+1},\cdots,B_{K}=b_{K}\},\forall b_{t}\in\{0,1\},t=1,\cdots,K$ , only the situation of $\{B_{1}=0,\cdots,B_{m-1}=0,B_{m+1}=0,\cdots,B_{K}\ =0\}$ can generate the required term $P^{*}(0,0,\cdots,0)$ in the constraint.

2.3 Identifiability from probability distribution of data

Given such a full-data model with one constraint defined by $f$ or $\Phi$ , one will need to establish that for all $P^{*}\in{\cal M}^{F}_{f}$ , we have (say we use $f$ )

\Psi^{F}(P^{*})=\Psi_{f}(P).

for a known $\Psi_{f}:{\cal M}\rightarrow(0,1)$ and $P=P_{P^{*}}$ , where the statistical model for $P_{0}$ is defined as

{\cal M}_{f}=\{P_{P_{B^{*}}}:P_{B^{*}}\in{\cal M}^{F}_{f}\}.

Let’s first study this identifiability problem in the special case of our full data models defined by the linear constraint $P^{*}_{0}f=0$ with $f(0)\not=0$ (equation 2). It will show that we have identifiability of the whole $P_{B^{*}}$ from $P=P_{P_{B^{*}}}$ . Firstly, we note that $P_{P^{*}}(b)=P^{*}(b)/\psi^{F}$ for all $b\not=0$ . Thus, $P^{*}(b)=\psi^{F}P(b)$ for all $b\not=0$ . We also have $P^{*}(0)=1-\psi^{F}$ , so that $\sum_{b}f(b)P^{*}(b)=0$ yields:

	$\displaystyle 0$	$\displaystyle=$	$\displaystyle\sum_{b}f(b)P^{}(b)=\sum_{b\not=0}f(b)\psi^{F}P(b)+f(0)P^{}(0)$
		$\displaystyle=$	$\displaystyle\psi^{F}\sum_{b\not=0}f(b)P(b)+f(0)(1-\psi^{F}).$

We can now solve for $\psi^{F}$ :

\psi^{F}=\frac{f(0)}{f(0)-\sum_{b\not=0}f(b)P(b)}.

At first sight, one might wonder if this solution is in the range $(0,1)$ . Working backwards from the right-hand side, i.e., using $f(0)P^{*}(0)+\sum_{b\not=0}f(b)P^{*}(b)=0$ and $P^{*}(b)=P(b)/\psi^{F}$ , it indeed follows that the denominator equals $f(0)+f(0)(1-\psi^{F})/\psi^{F}$ , so that the right-hand side is indeed in $(0,1)$ . Thus, we can conclude that:

\displaystyle\Psi^{F}(P^{*})=\Psi_{f}(P)\equiv\frac{f(0)}{f(0)-Pf},

(6)

where we use the notation $Pf=\int f(b)dP(b)$ for the expectation operator.

Suppose now that our the one-dimensional constraint in the full-data model is defined in general by $\Phi(P^{*})=0$ for some function $\Phi:{\cal M}^{F}_{\Phi}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$}$ . The full data model is now defined by ${\cal M}^{F}_{\Phi}=\{P^{*}:\Phi(P^{*})=0,0<P^{*}(0)<1\}$ , and the corresponding observed data model can be denoted with ${\cal M}_{\Phi}$ . To establish the identifiability, we still use $P^{*}(b)=\psi^{F}P(b)$ for all $b\not=0$ . We also still have $P^{*}(0)=1-\psi^{F}$ . Define $P^{*}_{P,\psi}$ as $P^{*}_{P,\psi}(0)=1-\psi$ and $P^{*}_{P,\psi}(b)=\psi P(b)$ for $b\not=0$ . The constraint $\Phi(P^{*})=0$ now yields the equation

\Phi(P^{*}_{P,\psi^{F}})=0\mbox{ in $\psi^{F}$}

for a given $P=P_{P^{*}}$ .

Consider now our particular example $\Phi_{I}$ . Note that

\Phi_{I}(P^{*}_{P,\psi})=\sum_{b\not=0}(-1)^{K+\sum_{k}b_{k}}\log\{\psi P(b)\}+(-1)^{K}\log(1-\psi).

We need to solve this equation in $\psi$ for the given $P$ . For $K$ is odd, we obtain

\Psi_{I}(P)=\frac{1}{1+\exp\left(\sum_{b\not=0}f_{I}(b)\log P(b)\right)},

and for $K$ even we obtain $\log(1-\psi)/\psi)=\sum_{b\not=0}\log P(b)$ and thus

\Psi_{I}(P)=\frac{1}{1+\exp\left(-\sum_{b\not=0}f_{I}(b)\log P(b)\right)}.

So, in general, this solution can be represented as:

\Psi_{I}(P)=\frac{1}{1+\exp\left((-1)^{K+1}\sum_{b\not=0}f_{I}(b)\log P(b)\right)}.

(7)

This solution $\Psi(P)$ only exists if $P(b)>0$ for all $b\not=0$ . In particular, a plug-in estimator $\Psi(P_{n})$ based on the empirical distribution function $P_{n}$ would not be defined when $P_{n}(b)=0$ for some cells $b\in\{0,1\}^{K}$ .

For general $\Phi$ , one needs to assume that this one-dimensional equation $H(\psi^{F},P)\equiv\Phi(P^{*}_{P,\psi})=0$ in $\psi$ , for a given $P\in{\cal M}$ , always has a unique solution, which then proves the desired identifiability of $\psi^{F}(P^{*})$ from $P_{P^{*}}$ for any $P^{*}\in{\cal M}^{F}_{\Phi}$ . This solution is now denoted with $\Psi_{\Phi}(P)$ . So, in this case, $\Psi_{\Phi}:{\cal M}_{\Phi}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$}$ is defined implicitly by $H(\Psi_{\Phi}(P),P)=0$ . If $\Phi$ is one-dimensional, one will still have that ${\cal M}_{\Phi}$ is nonparametric.

Let’s now consider the $\Phi_{II}$ constraint which assumes that $B^{*}(1)$ is independent of $B^{*}(2)$ . Again, using $P^{*}(b)=P(b)\psi$ for $b\not=0$ and $P^{*}(0)=(1-\psi)$ , the equation $\Phi_{II}(P^{*})=0$ yields the following quadratic equation in $\psi$ :

a_{II}(P)\psi^{2}+b_{II}(P)\psi=0,

where

$\displaystyle a_{II}(P)$	$\displaystyle=$	$\displaystyle-\sum_{b_{1}\not=0,b_{2}\not=0}I(b_{1}(1)=b_{2}(2)=0)P(b_{1})P(b_{2})$
		$\displaystyle+\sum_{b_{2}\not=0}I(b_{2}(2)=0)P(b_{2})+\sum_{b_{1}\not=0}I(b_{1}(1)=0)P(b_{1})-1$
$\displaystyle b_{II}(P)$	$\displaystyle=$	$\displaystyle\sum_{b\not=0}I(b(1:2)=(0,0))p(b)-\sum_{b_{2}\not=0}I(b_{2}(2)=0)P(b_{2})$
		$\displaystyle-\sum_{b_{1}\not=0}I(b_{1}(1)=0)P(b_{1})+1.$

Since $\psi\not=0$ , this yields the equation $a_{II}(P)\psi+b_{II}(P)=0$ and thus

\Psi_{II}(P)=\frac{-b_{II}(P)}{a_{II}(P)}.

A more helpful way this parameter can be represented is given by:

\Psi_{II}(P)=\frac{1-P(B(1)=0)-P(B(2)=0)+P(B(1:2)=(0,0))}{1-P(B(1)=0)-P(B(2)=0)+P(B(1)=0)P(B(2)=0)}.

(8)

This identifiability result relies on $\Psi_{II}(P)\in(0,1)$ , i..e, that $P(B(1)=B(2)=0)<P(B(1)=0)P(B(2)=0)$ . In particular, we need $0<P(B(1)=0)<1$ and $0<P(B(2)=0)<1$ and thereby that each of the three cells $(1,0),(0,1),(1,1)$ has positive probability under the bivariate marginal distribution of $B(1),B(2)$ under $P$ . It follows trivially that the inequality holds for $K=2$ since in that case $P(B(1)=B(2)=0)=0$ . It will have to be verified if this inequality constraint always holds for $K>2$ as well, or that this is an actual assumption in the statistical model. Since it would be an inequality constraint in the statistical model, it would not affect the tangent space and thereby the efficient influence function for $\Psi_{II}:{\cal M}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$}$ presented below, but the MLE would now involve maximizing the likelihood under this constraint so that the resulting MLE of $P_{0}$ satisfies this constraint.

Similarly, for the conditional independence assumption, the parameter $\Psi_{CI}$ can be derived as

\Psi_{CI}=\frac{P(B_{m}=1,B_{j}=1,0,...,0)}{C_{0}(P)},

(9)

where

	$\displaystyle C_{0}(P)$	$\displaystyle=$	$\displaystyle P(B_{m}=1,B_{j}=1,0,...,0)$		(10)
			$\displaystyle+P(B_{m}=0,B_{j}=1,0,...,0)P(B_{m}=1,B_{j}=0,0,...,0).$		(10)

Details on this derivation are presented in the appendix 9.4.

One could also define a model by a multivariate $\Phi:{\cal M}^{F}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$}^{d}$ . In this case, the full data model is restricted by $d$ constraints $\Phi(P^{*})=0$ and the observed data model will not be saturated anymore. Let’s consider the example in which we assume that $(B^{*}(1),\ldots,B^{*}(K))$ are $K$ independent Bernoulli random variables. In this case, we assume that $P^{*}(B^{*}=b)=\prod_{k=1}^{K}P^{*}(B^{*}(k)=b(k))$ for all possible $b\in\{0,1\}^{K}$ . This can also be defined as stating that for each 2 components $B^{*}(j_{1}),B^{*}(j_{2})$ , we have that these two Bernoulli’s are independent. Let $\Phi_{,II,j_{1},j_{2}}(P)$ be the constraint defined as in (4) but with $B(1)$ and $B(2)$ replaced by $B(j_{1})$ and $B(j_{2})$ , respectively. Then, we can define $\Phi_{III}(P)=(\Phi_{II,j_{1},j_{2}}(P):(j_{1},j_{2})\in\{1,\ldots,K\}^{2},j_{1}\not=j_{2})$ , a vector of dimension $K(K-1)/2$ . This defines now the model ${\cal M}^{F}_{III}=\{P^{*}:\Phi_{III}(P^{*})=0\}$ , which assumes that all components of $B^{*}$ are independent. We will also work out the MLE and efficient influence curve of $\psi_{0}^{F}$ for this restricted statistical model.

2.4 Statistical Model and target parameter

We have now defined the statistical estimation problem for linear and non-linear constraints. For linear constraints defined by a function $f$ , we observe $B_{1},\ldots,B_{n}\sim P_{0}\in{\cal M}_{f}=\{P_{P^{*}}:P^{*}\in{\cal M}^{F}\}$ , ${\cal M}^{F}=\{P^{*}:P^{*}f=0,0<P^{*}(0)<1\}$ , and our statistical target parameter is given by $\Psi_{f}:{\cal M}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$}$ , where

\displaystyle\Psi_{f}(P)=\frac{f(0)}{f(0)-Pf}.

(11)

The statistical target parameter satisfies that $\Psi^{F}(P^{*})=P^{*}(B^{*}\not=0)=\Psi_{f}(P_{P^{*}})$ for all $P^{*}\in{\cal M}^{F}$ .

Since the full-data model only includes a single constraint, it follows that ${\cal M}_{f}$ consists of all possible probability distributions of $B$ on $\{b:b\not=0\}$ , so that it is a nonparametric/saturated model.

Similarly, we can state the statistical model and target parameter for our examples using non-linear one-dimensional constraints $\Phi$ .

In general, we have a statistical model ${\cal M}=\{P_{P^{*}}:P^{*}\in{\cal M}^{F}\}$ for some full data model ${\cal M}^{F}$ for the distribution of $B^{*}$ , and, we would be given a particular mapping $\Psi:{\cal M}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$}$ , satisfying $\Psi^{F}(P^{*})=\Psi(P_{P^{*}})$ for all $P^{*}\in{\cal M}^{F}$ . In this case, our goal is to estimate $\Psi(P_{0})$ based on knowing that $P_{0}\in{\cal M}$ .

2.5 Efficient influence curve of target parameter

An estimator of $\Psi$ is efficient if and only if it is asymptotically linear with influence curve equal to canonical gradient of pathwise derivative of $\Psi$ . Therefore it is important to determine this canonical gradient. It teaches us how to construct an efficient estimator of $\Psi(P)$ in model ${\cal M}^{F}$ . In addition, it provides us with Wald type confidence intervals based on an efficient estimator. As we will see for most of our estimation problems $\Psi$ with a nonparametric model, $P$ and $\Psi$ can be estimated with the empirical measure $P_{n}$ and $\Psi(P_{n})$ . However for constraint $\Phi_{I}$ , an NPMLE is typically not defined due to empty cells, so that smoothing and bias reduction is needed.

Let $\Psi_{1f}(P)=\sum_{b\not=0}f(b)P(b)$ so that $\Psi_{f}(P)=\frac{f(0)}{f(0)-\Psi_{1f}(P)}$ . Note that $\Psi_{1f}(P)=Pf$ is simply the expectation of $f(B)$ w.r.t its distribution. $\Psi_{1f}:{\cal M}_{f}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$}$ is pathwise differentiable parameter at any $P\in{\cal M}$ with canonical gradient/efficient influence curve given by:

D^{*}_{1f}(P)(B)=f(B)-\Psi_{1f}(P).

By the delta-method the efficient influence curve of $\Psi_{f}$ at $P$ is given by:

\displaystyle D^{*}_{f}(P)(B)=\frac{f(0)}{\{f(0)-\Psi_{1f}\}^{2}}D^{*}_{1f}(P)(B).

(12)

In general, if $\Psi:{\cal M}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$}$ is the target parameter and the model ${\cal M}$ is nonparametric, then the efficient influence curve $D^{*}(P)$ is given by $D^{*}(P)=d\Psi(P)(P_{n=1}-P)$ , where

d\Psi(P)(h)=\left.\frac{d}{d\epsilon}\Psi(P+\epsilon h)\right|_{\epsilon=0}

is the Gateaux derivative in the direction $h$ , and $P_{n=1}$ is the empirical distribution for a sample of size one $\{B\}$ , putting all its mass on $B$ [18].

Consider now the model ${\cal M}_{\Phi}$ for a general univariate constraint function $\Phi:{\cal M}^{F}_{NP}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$}$ that maps any possible $P^{*}$ into a real number. Recall that $\Psi_{\Phi}:{\cal M}_{\Phi}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$}$ is now defined implicitly by $\Phi(P^{*}_{P,\psi})=0$ . For notational convenience, in this particular paragraph we suppress the dependence of $\Psi$ on $\Phi$ . The implicit function theorem implies that the general form of efficient influence curve is given by:

D^{*}_{\Phi}(P)(B)=-\left\{\frac{d}{d\psi}\Phi(P^{*}_{P,\psi})\right\}^{-1}\frac{d}{dP}\Phi(P^{*}_{P,\psi})(P_{n=1}-P).

where the latter derivative is the directional derivative of $P\rightarrow\Phi(P^{*}_{P,\psi})$ in the direction $P_{n=1}-P$ .

Note that

\frac{d}{d\psi}\Phi(P^{*}_{P,\psi})=d\Phi(P^{*}_{P,\psi})\frac{d}{d\psi}P^{*}_{P,\psi}.

We now use that $P^{*}_{P,\psi}(b)=P(b)I(b\not=0)+(1-\psi)I(b=0)$ . So we obtain:

\frac{d}{d\psi}\Phi(P^{*}_{P,\psi})=d\Phi(P^{*}_{P,\psi})(-1I_{0}),

where $I_{0}(b)$ is the function in $b$ that equals 1 if $b=0$ and zero otherwise, and

d\Phi(P^{*0})(h)\equiv\left.\frac{d}{d\epsilon}\Phi(P^{*0}+\epsilon h)\right|_{\epsilon=0}

is the directional/Gateaux derivative of $\Phi$ in the direction $h$ . We also have:

\frac{d}{dP}\Phi(P^{*}_{P,\psi})(P_{n=1}-P)=d\Phi(P^{*}_{P,\psi})\frac{d}{dP}P^{*}_{P,\psi}(P_{n=1}-P).

We have $\frac{d}{dP}P^{*}_{P,\psi}(h)=I_{0^{c}}h$ , where $I_{0^{c}}(b)$ is the function in $b$ that equals zero if $b=0$ and equals $1$ otherwise. So we obtain:

\frac{d}{dP}\Phi(P^{*}_{P,\psi})(P_{n=1}-P)=d\Phi(P^{*}_{P,\psi})(I_{0^{c}}(P_{n=1}-P)).

We conclude that

D^{*}_{\Phi}(P)=-\{d\Phi(P^{*}_{P,\psi})(-1I_{0})\}^{-1}d\Phi(P^{*}_{P,\psi})(I_{0^{c}}(P_{n=1}-P)).

Let’s now consider our special example $\Phi_{I}$ . In this case, the statistical target parameter $\Psi_{I}:{\cal M}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$}$ is given by (7). It is straightforward to show that the directional derivative of $\Psi$ at $P=(P(b):b)$ in direction $h=(h(b):b)$ is given by:

d\Psi_{I}(P)(h)=(-1)^{K}\Psi_{I}(P)(1-\Psi_{I}(P))\sum_{b\not=0}\frac{f_{I}(b)}{P(b)}h(b).

As one would have predicted from the definition of $\Psi_{I}$ , this directional derivative is only bounded if $P(b)>0$ for all $b\not=0$ . The efficient influence curve is thus given by $d\Psi_{I}(P)(P_{n=1}-P)$ :

	$\displaystyle D^{*}_{\Phi_{I}}(P)$	$\displaystyle=$	$\displaystyle(-1)^{K}\Psi_{I}(P)(1-\Psi_{I}(P))\sum_{b\not=0}\frac{f_{I}(b)}{P(b)}\{I_{B}(b)-P(b)\}$		(13)
		$\displaystyle=$	$\displaystyle(-1)^{K}\Psi_{I}(P)(1-\Psi_{I}(P))\left\{\frac{f_{I}(B)}{P(B)}+f_{I}(0)\right\},$		(13)

where we use that $\sum_{b\not=0}f_{I}(b)=-f_{I}(0)$ so that indeed the expectation of $D^{*}_{\Phi_{I}}(P)$ equals zero (under $P$ ).

The efficient influence function for $\Psi_{II}:{\cal M}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$}$ (8), corresponding with the constraint $\Phi_{II}(P^{*})=0$ , is the influence curve of the empirical plug-in estimator $\Psi_{II}(P_{n})$ , and can thus be derived from the delta-method:

$\displaystyle D^{*}_{\Phi_{II}}(P)$	$\displaystyle=$	$\displaystyle C_{2}(P)\{I(B(1)=0)-P(B(1)=0)\}$	(14)
		$\displaystyle+C_{3}(P)\{I(B(2)=0)-P(B(2)=0)\}$
		$\displaystyle+C_{4}(P)\{I(B(1)=B(2)=0)-P(B(1:2)=0)\},$

where

\begin{array}[]{l}C_{2}(P)=\frac{1-P(B(1)=0)-P(B(2)=0)-P(B(1)=B(2)=0)}{(1-P(B(1)=0)-P(B(2)=0)-P(B(1)=0)P(B(2)=0))^{2}}P(B(2)=0)\\ C_{3}(P)=\frac{1-P(B(1)=0)-P(B(2)=0)-P(B(1)=B(2)=0)}{(1-P(B(1)=0)-P(B(2)=0)-P(B(1)=0)P(B(2)=0))^{2}}P(B(1)=0)\\ C_{4}(P)=-\frac{1}{1-P(B(1)=0)-P(B(2)=0)-P(B(1)=0)P(B(2)=0)}.\end{array}

Similarly, we can derive the efficient influence curve for conditional independence constraint $\Phi_{CI}$ and parameter $\Psi_{CI}$ under this constraint as

\displaystyle D^{*}_{\Phi_{CI}}(P)

\displaystyle=

\displaystyle\frac{1}{C_{5}(P)}(C_{6}(P)-C_{7}(P)-C_{8}(P)).

(15)

where

$\displaystyle C_{5}(P)$	$\displaystyle=$	$\displaystyle-\sum_{b_{1}\not=0,b_{2}\not=0}I(b_{1}(1)=b_{2}(2)=0)P(b_{1})P(b_{2}).$
$\displaystyle C_{6}(P)$	$\displaystyle=$	$\displaystyle P(B_{m}=0,B_{j}=1,0,...,0)P(B_{m}=0,B_{j}=0,0,...,0)$
		$\displaystyle[\mathbbm{I}(B_{m}=1,B_{j}=1,0,...,0)-P(B_{m}=1,B_{j}=1,0,...,0)].$
$\displaystyle C_{7}(P)$	$\displaystyle=$	$\displaystyle P(B_{m}=1,B_{j}=0,0,...,0)P(B_{m}=1,B_{j}=1,0,...,0)$
		$\displaystyle[\mathbbm{I}(B_{m}=0,B_{j}=1,0,...,0)-P(B_{m}=0,B_{j}=1,0,...,0)].$
$\displaystyle C_{8}(P)$	$\displaystyle=$	$\displaystyle P(B_{m}=0,B_{j}=1,0,...,0)P(B_{m}=1,B_{j}=1,0,...,0)$
		$\displaystyle[\mathbbm{I}(B_{m}=1,B_{j}=0,0,...,0)-P(B_{m}=1,B_{j}=0,0,...,0)].$

The details for deriving the influence curve $D^{*}_{\Phi_{CI}}(P)$ can be found in appendix, section 9.4.

Finally, consider a non-saturated model ${\cal M}$ implied by a multidimensional constraint function $\Phi$ . In this case, the above $D^{*}_{\Phi}(P)$ is still a gradient of the pathwise derivative, where division by a vector $x$ is now defined as $1/x=(1/x_{j}:j)$ component wise. However, this is now not equal to the canonical gradient. We can now determine the tangent space of the model ${\cal M}$ and project $D^{*}_{\Phi}(P)$ onto the tangent space at $P$ , which then yields the actual efficient influence curve. We can demonstrate this for the example defined by the multidimensional constraint $\Phi_{III}$ using the general result in appendix 9.3.

3 Efficient estimator of target parameter, and statistical inference

Estimation of $\Psi_{f}(P_{0})$ based on statistical model ${\cal M}_{f}$ and data $B_{1},\ldots,B_{n}\sim_{iid}P_{0}$ is trivial since $\Psi_{1f}(P_{0})=P_{0}f$ is just a mean of $f(B)$ . In other words, we estimate $\Psi_{1f}(P_{0})$ with the NPMLE

\Psi_{1f}(P_{n})=P_{n}f=\frac{1}{n}\sum_{i=1}^{n}f(B_{i}),

where $P_{n}$ is the empirical distribution of $B_{1},\ldots,B_{n}$ , and, similarly, we estimate $\Psi_{f}(P_{0})$ with its NPMLE

\Psi_{f}(P_{n})=\frac{f(0)}{f(0)-\Psi_{1f}(P_{n})}=\frac{f(0)}{f(0)-P_{n}f}.

This estimator is asymptotically linear at $P_{0}$ with influence curve $D^{*}_{\Phi}(P_{0})$ under no further conditions. As a consequence, a valid asymptotic 95% confidence interval is given by:

\Psi_{f}(P_{n})\pm q(0.975)\sigma_{n}/\sqrt{n},

where

\sigma^{2}_{n}\equiv\frac{1}{n}\sum_{i=1}^{n}\{D^{*}_{f}(P_{n})(B_{i})\}^{2},

and $q(0.975)$ is the 0.975 quantile value of standard normal distribution. If the general constraint $\Phi:{\cal M}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$}$ is differentiable so that $\Phi(P_{n})$ is an asymptotically linear estimator of $\Phi(P_{0})$ [19], then, we can estimate $\Psi_{\Phi}(P_{0})$ with the NPMLE $\Psi_{\Phi}(P_{n})$ . Again, under no further conditions, $\Psi_{\Phi}(P_{n})$ is asymptotically linear with influence curve $D^{*}_{\Phi}(P_{0})$ , and an asymptotically valid confidence interval is obtained as above. The estimator $\Psi_{f}(P_{n})$ is always well behaved, even when $n$ is relatively large relative to $K$ . For a general $\Phi$ , this will very much depend on the the precise dependence on $P$ of $\Phi(P)$ . In general, if $\Phi(P_{n})$ starts suffering from the dimensionality of the model (i.e., empty cells make the estimator erratic), then one should expect that $\Psi_{\Phi}(P_{n})$ will suffer accordingly, even though it will still be asymptotically efficient. In these latter cases, we propose to use targeted maximum likelihood estimation, which targets data-adaptive machine learning fits towards optimal variance-bias trade-off for the target parameter.

Let’s consider the $\Phi_{I}$ -example which assumes that the $K$ -way interaction on the log-scale equals zero. In this case the target parameter $\Psi_{I}(P)$ is defined by (7), which shows that the NPMLE $\Psi_{I}(P_{n})$ is not defined when $P_{n}(b)=0$ for some $b\in\{0,1\}^{K}$ . So in this case, the MLE suffers immensely from the curse of dimensionality and can thus not be used when the number $K$ of samples is such that the sample size $n$ is of the same order as $2^{K}$ . In this context, we discuss estimation based on TMLE in the next section.

$\Psi_{II}(P_{0})$ can be estimated with the plug-in empirical estimator which is the NPMLE. So, this estimator only relies on positive probability on $(0,1),(1,0)$ and $(1,1)$ under the bivariate distribution of $B(1),B(2)$ under $P$ . We also note that this efficient estimator of $\psi_{II,0}$ does only use the data on the first two samples. Thus, the best estimator based upon this particular constraint $\Phi_{II}$ ignores the data on all the other samples. More assumptions will be needed, such as the model that assumes that all components of $B^{*}$ are independent, in order to create a statistical model that is able to also incorporate the data with the other patterns.

Constrained models based on multivariate $\Phi$ : For these models, if the NPMLE would behave well for the larger model in which just one of the $\Phi$ constraints is used, then it will behave well under more constraints. If on the other hand, this NPMLE suffers from curse of dimensionality, it might be the case that the MLE behaves better due to the additional constraints, but generally speaking that is a lot to hope for (except if $\Phi$ is really high dimensional relative to $2^{K}$ ). To construct an asymptotically efficient estimator one can thus use the MLE $\Psi_{\Phi}(\tilde{P}_{n})$ where now $\tilde{P}_{n}$ is the MLE over the actual model ${\cal M}_{\Phi}$ . In the special case that the behavior of this MLE $\Psi_{\Phi}(\tilde{P}_{n})$ suffers from the curse of dimensionality, we recommend the TMLE below instead, which will be worked out for the example $\Phi_{I}$ .

4 Targeted maximum likelihood estimation when the (NP)MLE of target parameter suffers from curse of dimensionality

Consider the statistical target parameter $\Psi_{I}:{\cal M}_{\Phi_{I}}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$}$ defined by (7), whose efficient influence curve is given by (13). Let $P_{n}^{0}$ be an initial estimator of the true distribution $P_{0}$ of $B$ in the nonparametric statistical model ${\cal M}_{\Phi_{I}}$ that consists of all possible distributions of $B=(B(1),\ldots,B(K))$ . An ideal initial estimator for $P$ would be consistent and identifiable when empty cells exist. One could use, for example, an undersmoothed lasso estimator constructed in algorithm 1 as the initial estimator [20]. In algorithm 1, we establish the empirical criterion by which the level of undersmoothing may be chosen to appropriately satisfy the conditions required of an efficient plug-in estimator. In particular, we require that the minimum of the empirical mean of the selected basis functions is smaller than a constant times $n^{-\frac{1}{2}}$ , which is not parameter specific. This condition essentially enforces the selection of the $L_{1}$ -norm in the Lasso to be large enough so that the fit includes sparsely supported basis functions [20]. When the hyperparameter $\lambda$ equals zero, the undersmoothed lasso estimator is the same as the NPMLE plug-in estimator, and when $\lambda$ is not zero, the undersmoothed lasso estimator will smooth over the predicted probabilities and avoid predicting probabilities of exact zero when empty cells exist.

The log-linear model can be expressed as:

\log E_{B^{*}}(b)=a_{0}+\sum_{b\not=0}a_{b}\prod_{j:b_{j}=1}b_{j}.

In this case,

b=(b_{1},b_{2},...,b_{K})

, and

b_{j}=1

if the subject is captured by sample

j,j=1,2,...,K

E_{B^{*}}(b)

is the count of observations in cell

b

. The

K

-way interaction term is

a_{1,1,...,1}

(K terms of 1), and the identification assumption is that

a_{1,1,...,1}=0

Fit a lasso regression

M_{lasso}

with all but the highest way interaction term as the independent variable,

E_{B^{*}}(b)

(the count) as the dependent variable, and specify model family as Poisson. The regularization term

\lambda

is chosen such that the absolute value of the empirical mean of the efficient influence function

P_{n}D^{*}_{\Phi_{I}}(P_{n})(B_{i})\leq T^{*}=\frac{\sigma_{n}}{\sqrt{n}}

, where

\sigma_{n}=\sqrt{\frac{1}{n}\sum_{i=1}^{n}\{D^{*}_{I,cv}(P_{n})(B_{i})\}^{2}}

. The probabilities

P_{n}(B_{i}),i=1,...,n

used in computing

\sigma_{n}

are estimated by the lasso regression with regularization term chosen by cross-validation. This algorithms will naturally avoid fits with

P_{n}(B_{i})=0

for any

B_{i}

since that would result in a log-likelihood equal to minus infinity.

while $P_{n}D^{*}_{\Phi_{I}}(P_{n})(B_{i})\geq T^{*}$ do

1. decrease

\lambda

;

2. predict count

\hat{E}_{B^{*}}(b)

using

M_{lasso}

and the new

\lambda

;

3. calculate predicted probability

\hat{P}_{B}(b)

by dividing the count

\hat{E}_{B^{*}}(b)

by the sum of all counts;

4. calculate

\hat{\Psi}_{I}(P)=\frac{1}{1+exp((-1)^{K}\sum_{b\neq 0}f_{I}(b)log\hat{P}_{B}(b))}

;

5. calculate

D^{*}_{\Phi_{I}}(P_{n})(B_{i})=\hat{\Psi}_{I}(P)(1-\hat{\Psi}_{I}(P))[\frac{f_{I}(b)}{\hat{P}_{B}(b)}+f_{I}(0)]

;

6. update

P_{n}D^{*}_{\Phi_{I}}(P_{n})(B_{i})=\frac{1}{n}\sum_{i=1}^{n}D^{*}_{\Phi_{I}}(P_{n})(B_{i})

end while

Result: Predicted probability

\hat{P}_{B}(b)

for each cell

Algorithm 1 Undersmoothed Lasso

We denote $P_{n}^{0}\equiv\hat{P}_{B}(b)$ as our initial estimator, the TMLE $P_{n}^{*}$ will update this initial estimator $P_{n}^{0}$ in such a way that $P_{n}D^{*}_{\Phi_{I}}(P_{n}^{*})=0$ , allowing a rigorous analysis of the TMLE $\Psi_{I}(P_{n}^{*})$ of $\Psi_{I}(P_{0})$ as presented in algorithm 2.

Note that indeed this submodel 16 $\{P_{\epsilon}:\epsilon\}\subset{\cal M}_{\Phi_{I}}$ through $P$ is a density for all $\epsilon$ and that its score is given by:

\left.\frac{d}{d\epsilon}\log P_{\epsilon}\right|_{\epsilon=0}=D^{*}_{\Phi_{I}}(P).

Recall that $\sum_{b\not=0}f_{I}(b)=-f_{I}(0)$ , so that indeed the expectation of its score equals zero: $E_{P}D^{*}_{\Phi_{I}}(P)(B)=0$ . This proves that if we select as loss function of $P$ the log-likelihood loss $L(P)=-\log P$ , then the score $\left.\frac{d}{d\epsilon}L(P_{\epsilon})\right|_{\epsilon=0}$ spans the efficient influence function $D^{*}_{\Phi_{I}}(P)$ . This property is needed to establish the asymptotic efficiency of the TMLE, with details in appendix section 9.6.

In accordance with general TMLE procedures [14], the TMLE updating process in algorithm 2 will be iterated till convergence at which point $\epsilon_{n}^{m}\approx 0$ , or $P_{n}D^{*}_{\Phi_{I}}(P_{n}^{m})=o_{P}(1/\sqrt{n})$ . Let $P_{n}^{*}=\lim_{m}P_{n}^{m}$ be the probability distribution in the limit. The TMLE of $\Psi_{I}(P_{0})$ is now defined as $\Psi_{I}(P_{n}^{*})$ . Due to the fact that the MLE $\epsilon_{n}^{m}$ solves its score equation and $\epsilon_{n}^{m}\approx 0$ , it follows that this TMLE satisfies $P_{n}D^{*}_{\Phi_{I}}(P_{n}^{*})=0$ (or $o_{P}(1/\sqrt{n})$ ).

Input: Vector of predicted probability

\hat{P}_{B}(b)

; observed population size

n

;

Procedure:

1. Denote

P_{n}^{0}=\hat{P}_{B}(b)

, calculate

\Psi_{n}(P_{n}^{0})

using equation 7,

D_{\Phi_{I}}^{*}(P_{n}^{0})

using equation 13, the empirical mean of

D_{\Phi_{I}}^{*}(P_{n}^{0})

P_{n}D^{*}_{\Phi_{I}}(P_{n}^{0})=\frac{1}{n}\sum_{i=1}^{n}D_{\Phi_{I}}^{*}(P_{n}^{0})(B_{i})

, and the stopping point

s=\frac{\sqrt{\frac{1}{n}\sum_{i=1}^{n}D_{\Phi_{I}}^{*}(P_{n}^{0})(B_{i})^{2}}}{max(log(n),C)\sqrt{n}}

, where

C

is a positive constant.

2. Let

m\in\mathbbm{Z}

be the number of iterations, and

P_{n}^{m}

is the updated probability in iteration

m

. Initial

m=0

\delta

is a positive constant close to zero.

while $\lvert P_{n}D^{*}_{\Phi_{I}}(P_{n}^{m})(B_{i})\rvert>s$ and $\lvert\epsilon\rvert>\delta$ do

2.1. calculate the bounds for

\epsilon

such that the updated probability

\in[0,1]

l_{\epsilon}=max_{i}[min(-\frac{1}{D^{*}_{\Phi_{I}}(P_{n}^{m})(B_{i})},\frac{1-P_{n}^{m}(B_{i})}{P_{n}^{m}(B_{i})D^{*}_{\Phi_{I}}(P_{n}^{m})(B_{i})}))]

u_{\epsilon}=min_{i}[max(-\frac{1}{D^{*}_{\Phi_{I}}(P_{n}^{m})(B_{i})},\frac{1-P_{n}^{m}(B_{i})}{P_{n}^{m}(B_{i})D^{*}_{\Phi_{I}}(P_{n}^{m})(B_{i})}))]

2.2. Construct a least favorable parametric model

\{P_{n,\epsilon}^{m}:\epsilon\}

through

P_{n}^{m}

defined as follows:

P_{n,\epsilon}^{m}=C(P_{n}^{m},\epsilon)(1+\epsilon D^{*}_{\Phi_{I}}(P_{n}^{m}))P_{n}^{m},

(16)

where

C(P,\epsilon)=\frac{1}{\sum_{b\neq 0}(1+\epsilon D^{*}_{\Phi_{I}}(P(b)))P(b)},

2.3. calculate

\epsilon^{m}_{n}=argmax_{\epsilon}\frac{1}{n}\sum_{b\neq 0}log(P_{n,\epsilon}^{m}),

where

\epsilon\in[l_{\epsilon},u_{\epsilon}]

2.4.

m\leftarrow m+1

, and update

P_{n}^{m}\leftarrow P_{n,\epsilon^{m}_{n}}^{m}

end while

3. Denote

P_{n}^{*}=P_{n}^{m}

in the final iteration

m

. Calculate TMLE

\Psi_{n}(P_{n}^{*})

using equation 7, and its efficient influence function

D_{\Phi_{I}}^{*}(P_{n}^{*})

using equation 13.

Result: TMLE

\Psi_{n}(P_{n}^{*})

and efficient influence function

D_{\Phi_{I}}^{*}(P_{n}^{*})

Algorithm 2 Targeted maximum likelihood estimation(TMLE) update procedure

The proof of the asymptotic efficiency of TMLE is provided in appendix section 9.6.

5 Simulations

In this section, we show the performance of all the estimators given that their identification assumptions (linear and non-linear constraints) hold true. The estimand $\Psi_{f}(P)$ and $\Psi_{\Phi}(P)$ refer to the probability of an individual being captured at least once under the linear constraint $f$ or non-linear constraint $\Phi$ . In subsection 5.1, the linear constraint $\Phi_{f}$ is defined in equation 2, the corresponding estimand (equation 11)

\Psi_{f}(P)=\frac{f(0)}{f(0)-Pf},

and its plug-in estimator

\Psi_{f}(P_{n})=\frac{f(0)}{f(0)-\sum_{b\not=0}f(b)P_{n}(b)},

where $P_{n}(b)$ is the empirical probability of cell $b$ . In subsection 5.2, we provide a summary of all the non-linear constraints defined by equation $\Phi$ and their corresponding estimators. In subsection 5.3, the non-linear constraint $\Phi_{I}$ is defined by the assumption in equation 2.2. The estimand (equation 7)

\Psi_{I}(P)=\frac{1}{1+\exp\left((-1)^{K+1}\sum_{b\not=0}f_{I}(b)\log P(b)\right)}.

The plug-in estimator

\Psi_{I}(P_{NP})=\frac{1}{1+\exp\left((-1)^{K+1}\sum_{b\not=0}f_{I}(b)\log P_{n}(b)\right)}.

In addition, the undersmoothed lasso estimator $\Psi_{I}(P_{lasso})$ replaces the plugged-in $P_{n}(b)$ in $\Psi_{I}(P_{NP})$ with estimated probability in a lasso regression with regularization parameters chosen by the undersmoothing algorithm, the estimator $\Psi_{I}(P_{lasso\_cv})$ replaces the plugged-in $P_{n}(b)$ with estimated probability in a lasso regression with regularization parameters optimized by cross-validation. The estimator $\Psi_{I}(P_{tmle})$ updates the estimated probabilities of $\Psi_{I}(P_{lasso})$ using targeted learning techniques, and the estimator $\Psi_{I}(P_{tmle\_cv})$ updates the estimated probabilities of $\Psi_{I}(P_{lasso\_cv})$ using targeted learning techniques. In addition, we compare the performance of our proposed estimators to existing estimators $\Psi_{I}(P_{M_{0}})$ and $\Psi_{I}(P_{M_{t}})$ .

In subsection 5.4, the non-linear constraint $\Phi_{II}$ is defined by the independence assumption in equation 4. The estimand (equation 8)

\Psi_{II}(P)\equiv\frac{1-P(B(1)=0)-P(B(2)=0)+P(B(1:2)=(0,0))}{1-P(B(1)=0)-P(B(2)=0)+P(B(1)=0)P(B(2)=0)},

and its plug-in estimator

\Psi_{II}(P_{n})\equiv\frac{1-P_{n}(B(1)=0)-P_{n}(B(2)=0)+P_{n}(B(1:2)=(0,0))}{1-P_{n}(B(1)=0)-P_{n}(B(2)=0)+P_{n}(B(1)=0)P_{n}(B(2)=0)}.

In subsection 5.5, the non-linear constraint $\Phi_{CI}$ is defined by the conditional independence assumption in equation 5. The estimand (equation 9, $C_{0}(P)$ is defined in equation 10)

\Psi_{CI}(P)\equiv\frac{P(B_{m}=1,B_{j}=1,0,...,0)}{C_{0}(P)},

and its plug-in estimator

\Psi_{CI}(P_{n})\frac{P_{n}(B_{m}=1,B_{j}=1,0,...,0)}{C_{0}(P_{n})}.

5.1 Linear identification assumption

The linear identification constraint is $E_{P_{B^{*}}}f(B^{*})=0$ , for any function $f$ such that $f(b=0)\neq 0$ . An example of interest is $f(b)=(-1)^{K+\sum_{k=1}^{K}b_{k}}$ , where K is the total number of samples. In this case, the linear identification assumption is equation 2. For example, if $K=3$ , then the constraint (equation 2) states:

P_{B^{*}}(b_{1},b_{2},b_{3})=\alpha_{0}+\alpha_{1}b_{1}+\alpha_{3}b_{2}+\alpha_{4}b_{1}b_{2}+\alpha_{5}b_{1}b_{3}+\alpha_{6}b_{2}b_{3}+\alpha_{7}b_{1}b_{2}b_{3}

and the constraint $E_{P_{B^{*}}}f(B^{*})=0$ states that $\alpha_{7}=0$ . The identified estimator

\Psi_{f}(P_{n})=\frac{f(0)}{f(0)-\sum_{b\neq 0}f(b)P_{n}(b)},

where $P_{n}(b)$ is the observed probability of cell $b$ , and the efficient influence curve (equation 12) is

D^{*}_{f}=\frac{f(0)}{(f(0)-\sum_{b\neq 0}f(b)P(b))^{2}}[f(B)-\sum_{b\neq 0}f(b)P(b)].

We use the parameters below to generate the underlying distribution. The assumption that the highest-way interaction term $\alpha_{7}=0$ is satisfied in this setting.

$\alpha_{0}$	$\alpha_{1}$	$\alpha_{2}$	$\alpha_{3}$	$\alpha_{4}$	$\alpha_{5}$	$\alpha_{6}$	$\alpha_{7}$
0.0725	0.03	0.01	0.04	0.01	0.02	0.02	0

The simulated true probabilities for all 7 observed cells are:

P(0,0,1)	P(0,1,0)	P(0,1,1)	P(1,0,0)	P(1,0,1)	P(1,1,0)	P(1,1,1)
0.1213	0.0889	0.1429	0.1105	0.1752	0.1428	0.2183

We can calculate the true value of $\Psi_{f}(P)$ analytically as $\Psi_{f}(P_{0})=0.9275$ , and draw $10^{4}$ samples to obtain the asymptotic $\Psi_{f,asym}(P_{n})=0.9316$ , asymptotic $\hat{\sigma}^{2}=0.7492$ .

Figure 1 shows that our estimators perform well when the identification assumption holds true.

Refer to caption — Figure 1: Simulation results when the linear identification assumption holds true. Upper left: ratio of estimated variance over true variance vs sample size. The line is the ratio at each sample size, and the grey horizontal line is the true value; Upper right: coverage of 95% asymptotic confidence interval based on normal approximation with $\hat{\sigma}^{2}$ estimated from efficient influence curve vs sample size. The line is the average coverage; Lower left: mean value of estimated $\psi$ and its 95% asymptotic confidence interval based on normal approximation with $\hat{\sigma}^{2}$ estimated from efficient influence curve vs sample size. The line is the mean, and shaded area is the confidence interval and the grey horizontal line is the true value; Lower right: Distribution of 1000 estimates of $\psi$ for sample size of 1000, the vertical line is the true value, and the dashed line is the mean value.

5.2 Non-linear identification assumption

We perform simulations for three types of non-linear identification assumptions: 1) K-way interaction term equals zero in log-linear models, 2) independence assumption, and 3) conditional independence assumption. For assumption 1), we provide simulations for the following estimators:

1.

Existing estimator defined by model $M_{0}$ , which assumes that in the log-linear model, all the main terms have the same values, and there are no interaction terms [21]: $\Psi_{I}(P_{M_{0}})$
2.

Existing estimator defined by model $M_{t}$ , which assumes that the log-linear model does not contain interaction terms [4]: $\Psi_{I}(P_{M_{t}})$
3.

Non-parametric plug-in estimator: $\Psi_{I}(P_{NP})$
4.

Estimator based on undersmoothed lasso regression: $\Psi_{I}(P_{lasso})$
5.

Estimator based on lasso regression with cross-validation: $\Psi_{I}(P_{lasso\_cv})$
6.

TMLE based on $\Psi_{I}(P_{lasso})$ : $\Psi_{I}(P_{tmle})$
7.

TMLE based on $\Psi_{I}(P_{lasso\_cv})$ : $\Psi_{I}(P_{tmle\_cv})$

Model $M_{0}$ and $M_{t}$ were calculated using R package ”RCapture” (version 1.4-3) [22].

5.3 Non-linear identification assumption: k-way interaction term equals zero

Given that we have three samples $B_{1},B_{2}$ and $B_{3}$ , the identification assumption is that in the log-linear model

log(E_{B^{*}}(b))=\alpha_{0}+\alpha_{1}b_{1}+\alpha_{2}b_{2}+\alpha_{3}b_{3}+\alpha_{4}b_{1}b_{2}+\alpha_{5}b_{1}b_{3}+\alpha_{6}b_{2}b_{3}+\alpha_{7}b_{1}b_{2}b_{3},

$\alpha_{7}=0$ (equation 2.2). Here $E_{B^{*}}(b)$ is the count of observations in cell $b=(b_{1},b_{2},b_{3})$ , and $b_{i}=1$ if the subject is captured by sample $i,i=1,2,3$ . The parameter $\Psi_{I}(P)$ is identified in equation 7, and its influence curve $D^{*}_{\Phi_{I}}(P)$ is identified in equation 13.

In this section, we evaluated both our proposed estimators as well as two existing parametric estimators. The first estimator we evaluated is the $M_{t}$ model [23], where $P_{ij}$ denotes the probability that subject $i$ is captured by sample $j$ , and modeled as $log(P_{ij})=\mu_{j}$ for subject $i$ and sample $j$ . This estimator assumes that all the interaction terms are zero in the log-linear model, and only use the main term variables in model training. The second estimator is $M_{0}$ [21], with the formula $log(P_{ij})=\alpha$ , where $\alpha$ is a constant.

5.3.1 Log-linear model with main term effects

We used the parameters below to generate the underlying distribution. The assumption that the k-way interaction term $\alpha_{7}=0$ is satisfied in this setting. The model assumptions for $M_{0}$ and $M_{t}$ are also satisfied in this setting.

$\alpha_{0}$	$\alpha_{1}$	$\alpha_{2}$	$\alpha_{3}$	$\alpha_{4}$	$\alpha_{5}$	$\alpha_{6}$	$\alpha_{7}$
-0.9398	-1	-1	-1	0	0	0	0

The simulated observed probabilities for all 7 observed cells are:

P(0,0,1)	P(0,1,0)	P(0,1,1)	P(1,0,0)	P(1,0,1)	P(1,1,0)	P(1,1,1)
0.2359	0.2359	0.0868	0.2359	0.0868	0.0868	0.0319

Under this setting, the true value of $\Psi_{I}(P)$ can be calculated analytically as $\Psi_{I}(P_{0})=0.6093$ . Figure 2 and table 1 shows the performance of estimators $\Psi_{I}(P_{M_{0}}),\Psi_{I}(P_{M_{t}}),\Psi_{I}(P_{NP})$ , $\Psi_{I}(P_{lasso}),\Psi_{I}(P_{lasso\_cv}),\Psi_{I}(P_{tmle})$ , and $\Psi_{I}(P_{tmle\_cv})$ . Table 2 shows that the model fit statistics for $\Psi_{I}(P_{M_{0}})$ and $\Psi_{I}(P_{M_{t}})$ .

Estimator	Average $\psi$	Lower 95% CI	Upper 95% CI	Coverage(%)
$\Psi_{I}(P_{NP})$	0.6078	0.4791	0.7366	95
$\Psi_{I}(P_{lasso})$	0.6473	0.5237	0.7708	84.2
$\Psi_{I}(P_{tmle})$	0.6078	0.4806	0.735	93.7
$\Psi_{I}(P_{lasso\_cv})$	0.8534	0.7984	0.9083	4.4
$\Psi_{I}(P_{tmle\_cv})$	0.4044	0.2988	0.5101	13.6
$\Psi_{I}(P_{M_{0}})$	0.6093	0.5659	0.6521	97.8
$\Psi_{I}(P_{M_{t}})$	0.6096	0.5662	0.6525	97.8

Table 1: Estimated

\hat{\psi}

and 95% asymptotic confidence interval based on normal approximation with

\hat{\sigma}^{2}

estimated from efficient influence curve and its coverage by each estimator.

\Psi_{I}(P_{NP})

is the plug-in maximum likelihood estimator,

\Psi_{I}(P_{lasso})

uses probabilities estimated from undersmoothed lasso regression,

\Psi_{I}(P_{tmle})

is the TMLE based on

\Psi_{I}(P_{lasso})

\Psi_{I}(P_{lasso\_cv})

uses probabilities estimated from lasso regression with regularization term optimized by cross-validation,

\Psi_{I}(P_{tmle\_cv})

is the TMLE based on

\Psi_{I}(P_{lasso\_cv})

\Psi_{I}(P_{M_{0}})

and

\Psi_{I}(P_{M_{t}})

are existing estimators defined in section 5.2. True

\psi_{0}=0.6093

Estimator	Df	AIC	BIC
$\Psi_{I}(P_{M_{0}})$	5	56.532	66.348
$\Psi_{I}(P_{M_{t}})$	3	58.034	77.665

Table 2: model fit statistics for

\Psi_{I}(P_{M_{0}})

and

\Psi_{I}(P_{M_{t}})

estimators.

Figure 2 and table 1 compared the performance of two base learners: lasso regression with regularization term chosen by cross-validation ( $\Psi_{I}(P_{lasso\_cv})$ ), and undersmoothed lasso regression ( $\Psi_{I}(P_{lasso})$ ).The $\Psi_{I}(P_{lasso\_cv})$ estimator is significantly biased from the true value of $\psi$ , and the TMLE estimator based on it, $\Psi_{I}(P_{tmle\_cv})$ estimator is also biased. And although the $\Psi_{I}(P_{lasso})$ estimator is biased, the TMLE based on it, $\Psi_{I}(P_{tmle})$ estimator ( $mean=0.6078$ , $coverage=93.7\%$ ) is able to adjust the bias and give a fit as good as the NPMLE estimator ( $mean=0.6078$ , $coverage=95\%$ ). Thus, in the section below we will only use the undersmooting lasso estimator $\Psi_{I}(P_{lasso})$ as the base learner.

5.3.2 Log-linear model with main term and interaction term effects

We used the parameters below to generate the underlying distribution. The assumption that the k-way interaction term $\alpha_{7}=0$ is satisfied in this setting.

$\alpha_{0}$	$\alpha_{1}$	$\alpha_{2}$	$\alpha_{3}$	$\alpha_{4}$	$\alpha_{5}$	$\alpha_{6}$	$\alpha_{7}$
-0.9194	-1	-1	-1	-0.1	-0.1	-0.1	0

The simulated observed probabilities for all 7 observed cells are:

P(0,0,1)	P(0,1,0)	P(0,1,1)	P(1,0,0)	P(1,0,1)	P(1,1,0)	P(1,1,1)
0.2440	0.2440	0.0812	0.2440	0.0812	0.0812	0.0245

Under this setting, the true value of $\Psi_{I}(P)$ can be calculated analytically as $\Psi_{I}(P_{0})=0.6013$ . The model assumptions of $\Psi_{I}(P_{M_{0}})$ and $\Psi_{I}(P_{M_{t}})$ are violated, and their estimates are biased (estimated mean $\hat{\Psi}_{I}(P_{M_{0}})=\hat{\Psi}_{I}(P_{M_{t}})=0.572,bias=\Psi_{I}(P_{0})-\hat{\Psi}_{I}(P_{M_{t}})=0.6013-0.5720=0.0293$ ). The coverage of their 95% confidence interval is also low (81.5%). Table 4 shows that the model fit statistics for $\Psi_{I}(P_{M_{0}})$ and $\Psi_{I}(P_{M_{t}})$ . As the $\Psi_{I}(P_{NP})$ and $\Psi_{I}(P_{tmle})$ estimators do not have model assumptions, they are unbiased and the coverage of their 95% asymptotic confidence intervals are close to 95%. (94.7% for $\Psi_{I}(P_{NP})$ and $93.7\%$ for $\Psi_{I}(P_{tmle})$ ), shown in figure 3 and table 3.

Estimator	Average $\psi$	Lower 95% CI	Upper 95% CI	Coverage(%)
$\Psi_{I}(P_{NP})$	0.6013	0.4625	0.7401	94.7
$\Psi_{I}(P_{lasso})$	0.6482	0.5159	0.7805	78.4
$\Psi_{I}(P_{tmle})$	0.6005	0.4653	0.7358	93.7
$\Psi_{I}(P_{M_{0}})$	0.572	0.5274	0.6163	81.5
$\Psi_{I}(P_{M_{t}})$	0.5722	0.5277	0.6166	81.5

Table 3: Estimated

\hat{\psi}

and 95% asymptotic confidence interval based on normal approximation with

\hat{\sigma}^{2}

estimated from efficient influence curve and its coverage by each estimator.

\Psi_{I}(P_{NP})

is the plug-in maximum likelihood estimator,

\Psi_{I}(P_{lasso})

uses probabilities estimated from undersmoothed lasso regression,

\Psi_{I}(P_{tmle})

is the TMLE based on

\Psi_{I}(P_{lasso})

\Psi_{I}(P_{M_{0}})

and

\Psi_{I}(P_{M_{t}})

are existing estimators defined in section 5.2. True

\Psi_{I}(P_{0})=0.6013

Estimator	Df	AIC	BIC
$\Psi_{I}(P_{M_{0}})$	5	55.499	65.314
$\Psi_{I}(P_{M_{t}})$	3	58.440	78.071

Table 4: model fit statistics for

\Psi_{I}(P_{M_{0}})

and

\Psi_{I}(P_{M_{t}})

estimators.

5.3.3 Log-linear model with empty cells

When the probability for some cells are close to zero, in the finite sample case, there are likely to be empty cells in the observed data. For example, if the probability of a subject being caught in all 3 samples, represented as $P(1,1,1)$ , is less than $10^{-6}$ and the total number of unique subjects caught by any of the 3 samples is less than $10^{3}$ , then it’s likely that we will not observe any subject being caught three times, i.e., cell $P(1,1,1)$ will likely be empty.

We used the parameters below to generate the underlying distribution. The assumption that the k-way interaction term $\alpha_{7}=0$ is satisfied in this setting.

$\alpha_{0}$	$\alpha_{1}$	$\alpha_{2}$	$\alpha_{3}$	$\alpha_{4}$	$\alpha_{5}$	$\alpha_{6}$	$\alpha_{7}$
-0.4578	-1	-2	-3	-1	-1	-1	0

The simulated observed probabilities for all 7 observed cells are:

P(0,0,1)	P(0,1,0)	P(0,1,1)	P(1,0,0)	P(1,0,1)	P(1,1,0)	P(1,1,1)
0.0857	0.2331	0.0043	0.6336	0.0116	0.0315	2e-04

Figure 4 and table 5 show the performance of the estimators when there’s no observation in cell $(1,1,1)$ , .

Figure 4 shows that all the existing estimators are biased when empty cells of $(1,1,1)$ exist. Table 5 shows that the coverage of the 95% asymptotic confidence intervals for $\Psi_{I}(P_{tmle})$ is the highest (58.8%), and the coverage of the 95% asymptotic confidence intervals for $\Psi_{I}(P_{M_{0}})$ , $\Psi_{I}(P_{M_{t}})$ are both 0, due to the estimation bias and narrow range of intervals. Table 6 shows the model fit statistics for $\Psi_{I}(P_{M_{0}})$ and $\Psi_{I}(P_{M_{t}})$ .

Estimator	Average $\psi$	Lower 95% CI	Upper 95% CI	Coverage(%)
$\Psi_{I}(P_{lasso})$	0.5565	0.1956	0.9174	20
$\Psi_{I}(P_{tmle})$	0.2308	0.0875	0.374	58.8
$\Psi_{I}(P_{M_{0}})$	0.1318	0.0995	0.169	0
$\Psi_{I}(P_{M_{t}})$	0.1729	0.1313	0.2201	0

Table 5: Estimated

\hat{\psi}

, 95% asymptotic confidence interval based on normal approximation with

\hat{\sigma}^{2}

estimated from efficient influence curve and its coverage of the estimators in figure 4.

\Psi_{I}(P_{lasso})

uses probabilities estimated from undersmoothed lasso regression,

\Psi_{I}(P_{tmle})

is the TMLE based on

\Psi_{I}(P_{lasso})

\Psi_{I}(P_{M_{0}})

and

\Psi_{I}(P_{M_{t}})

are existing estimators defined in section 5.2.True

\Psi_{I}(P_{0})=0.3674

Estimator	Df	AIC	BIC
$\Psi_{I}(P_{M_{0}})$	5	501.562	511.378
$\Psi_{I}(P_{M_{t}})$	3	45.506	65.137

Table 6: model fit statistics for

\Psi_{I}(P_{M_{0}})

and

\Psi_{I}(P_{M_{t}})

estimators.

5.4 Non-linear identification assumption: independence

Given 3 samples, we assume that the first and second sample $B_{1},B_{2}$ are independent of each other, that is, $P^{*}(B_{1}=1,B_{2}=1)=P^{*}(B_{1}=1)\times P^{*}(B_{2}=1)$ . Then we have $\Psi_{II}(P)$ identified in equation 8, and the influence curve $D^{*}_{II}(P)$ identified in equation 14.

Here we illustrate the performance of our estimators with the following underlying distribution:

	$\displaystyle P(B_{1}=1)=0.1$
	$\displaystyle P(B_{2}=1\|B_{1}=1)=0.2$
	$\displaystyle P(B_{2}=1\|B_{1}=0)=0.2$
	$\displaystyle P(B_{3}=1\|B_{1}=1)=0.25$
	$\displaystyle P(B_{3}=1\|B_{1}=0)=0.3$

The probability distribution for all 7 observed cells are:

P(0,0,1)	P(0,1,0)	P(0,1,1)	P(1,0,0)	P(1,0,1)	P(1,1,0)	P(1,1,1)
0.4355	0.2540	0.1089	0.1210	0.0403	0.0302	0.0101

We can calculate the true value of $\Psi_{II}(P)$ analytically as $\Psi_{II}(P_{0})=0.4960$ , and draw $10^{6}$ samples to obtain the asymptotic $\Psi_{II}(P_{n})=0.4963$ , asymptotic $\sigma^{2}=\frac{1}{n}\sum D^{*2}=4.2657$ .

Figure 5 shows that our estimators perform well when the independence assumptions hold true.

5.5 Non-linear identification assumption: conditional independence

Given 3 samples, we assume that the third sample $B_{3}$ is conditionally independent on the second sample $B_{2}$ , that is, $P^{*}(B_{3}=1|B_{2}=b_{2},B_{1}=0)=P^{*}(B_{3}=1|B_{1}=0)$ . Then we can derive that $P^{*}(0,0,0)=\frac{P^{*}(0,1,0)P^{*}(0,0,1)}{P^{*}(0,1,1)}$ and $\Psi_{CI}(P)=\frac{P(0,1,1)}{P(0,1,1)+P(0,1,0)P(0,0,1)}$ from equation 9, and the efficient influence curve can be derived from equation 15 as

	$\displaystyle D^{*}_{\Phi_{CI}}(P)=$
			$\displaystyle\frac{P(0,1,0)P(0,0,1)}{[P(0,1,1)+P(0,1,0)P(0,0,1)]^{2}}[I(0,1,1)-P(0,1,1)]-$
			$\displaystyle\frac{P(0,0,1)P(0,1,1)}{[P(0,1,1)+P(0,1,0)P(0,0,1)]^{2}}[I(0,1,0)-P(0,1,0)]-$
			$\displaystyle\frac{P(0,1,0)P(0,1,1)}{[P(0,1,1)+P(0,1,0)P(0,0,1)]^{2}}[I(0,0,1)-P(0,0,1)]$

Here we illustrate the performance of our estimators with the following underlying distribution:

	$\displaystyle P(B_{1}=1)=0.1$
	$\displaystyle P(B_{2}=1\|B_{1}=1)=0.2$
	$\displaystyle P(B_{2}=1\|B_{1}=0)=0.15$
	$\displaystyle P(B_{3}=1\|B_{1}=1)=P(B_{3}=1\|B_{2}=1,B_{1}=1)=0.25$
	$\displaystyle P(B_{3}=1\|B_{1}=1)=P(B_{3}=1\|B_{2}=0,B_{1}=1)=0.25$
	$\displaystyle P(B_{3}=1\|B_{1}=0)=P(B_{3}=1\|B_{2}=1,B_{1}=0)=0.2$
	$\displaystyle P(B_{3}=1\|B_{1}=0)=P(B_{3}=1\|B_{2}=0,B_{1}=0)=0.2$

The probability distribution for all 7 observed cells $P(B_{1},B_{2},B_{3})$ are:

P(0,0,1)	P(0,1,0)	P(0,1,1)	P(1,0,0)	P(1,0,1)	P(1,1,0)	P(1,1,1)
0.3943	0.2784	0.0696	0.1546	0.0515	0.0387	0.0129

We can derive the true $\Psi_{CI}(P)$ analytically as $\Psi_{CI}(P_{0})=0.388$ , and draw $10^{6}$ samples to obtain the asymptotic $\hat{\Psi}_{CI}(P_{n})=0.3887$ , asymptotic $\hat{\sigma}^{2}=1.0958$ .

Figure 6 shows that our estimator performs well when the conditional independence assumptions are met.

6 Evaluating the sensitivity of identification bias to violations of the assumed constraint

In this section, we use simulations to show the sensitivity of identification bias for all the estimators with identification assumptions violations.

6.1 Violation of linear assumptions

In this section we analyzed the same problem stated in section 5.1, but the identification assumption is violated.

We used the parameters below to generate the underlying distribution. The assumption that the k-way interaction term $\alpha_{7}=0$ is violated in this setting( $\alpha_{7}=0.2$ ).

$\alpha_{0}$	$\alpha_{1}$	$\alpha_{2}$	$\alpha_{3}$	$\alpha_{4}$	$\alpha_{5}$	$\alpha_{6}$	$\alpha_{7}$
0.11	0.1	0.05	0.08	-0.2	-0.2	-0.1	0.2

The simulated observed probabilities for all 7 cells are:

P(0,0,1)	P(0,1,0)	P(0,1,1)	P(1,0,0)	P(1,0,1)	P(1,1,0)	P(1,1,1)
0.2135	0.1798	0.0449	0.2360	0.1011	0.1978	0.0449

We can calculate the true value of $\Psi$ analytically as $\Psi_{0}=0.8900$ , and draw $10^{5}$ samples to obtain the asymptotic $\Psi_{f}(P_{n})=0.7419$ , asymptotic $\hat{\sigma}^{2}=0.2662$ . The asymptotic value is biased by 0.1481.

Figure 7 shows that the linear estimator is biased when the identification assumption is violated.

6.2 Violation of non-linear assumption: independence

Given 3 samples, we suppose that the first and second sample $B_{1},B_{2}$ are independent of each other, that is, $P^{*}(B_{1}=1,B_{2}=1)=P^{*}(B_{1}=1)\times P^{*}(B_{2}=1)$ . Then we can compute $\hat{\psi}$ and its efficient influence curve as stated above.

Here we illustrate the performance of our estimators with the following underlying distribution:

	$\displaystyle P(B_{1}=1)=0.5$
	$\displaystyle P(B_{2}=1\|B_{1}=1)=0.6$
	$\displaystyle P(B_{2}=1\|B_{1}=0)=0.5$
	$\displaystyle P(B_{3}=1)=0.5$

The probability for all 7 observed cells is given by:

P(0,0,1)	P(0,1,0)	P(0,1,1)	P(1,0,0)	P(1,0,1)	P(1,1,0)	P(1,1,1)
0.1429	0.1429	0.1429	0.1143	0.1143	0.1714	0.1714

We can calculate the true value of $\Psi$ analytically as $\Psi_{0}=0.875$ , and draw $10^{6}$ samples to obtain the asymptotic $\Psi_{II}(P_{n})=0.9544$ , asymptotic $\sigma^{2}=0.4415$ , the asymptotic value is biased by 0.0794.

Figure 8 shows that violation of the independence assumption will create a significant bias in the results.

6.3 Violation of non-linear assumption: conditional independence

Same as above, our assumption is that given 3 samples, the third sample $B_{3}$ is independent on the second sample $B_{2}$ conditional on the first sample $B_{1}$ . Then we can derive the influence curve same as above.

Here we illustrate the performance of our estimators when the identification assumption does not hold.

The simulated distribution is as follows:

	$\displaystyle P(B_{1}=1)=0.1$
	$\displaystyle P(B_{2}=1\|B_{1}=1)=0.2$
	$\displaystyle P(B_{2}=1\|B_{1}=0)=0.15$
	$\displaystyle P(B_{3}=1\|B_{1}=1,B_{2}=1)=0.10$
	$\displaystyle P(B_{3}=1\|B_{1}=1,B_{2}=0)=0.50$
	$\displaystyle P(B_{3}=1\|B_{1}=0,B_{2}=1)=0.20$
	$\displaystyle P(B_{3}=1\|B_{1}=0,B_{2}=0)=0.50$

The probability distribution for all 7 observed cells are:

P(0,0,1)	P(0,1,0)	P(0,1,1)	P(1,0,0)	P(1,0,1)	P(1,1,0)	P(1,1,1)
0.6194	0.1749	0.0437	0.0648	0.0648	0.0291	0.0032

The true value of $\Psi$ is $\Psi_{0}=0.6175$ , and we drew $10^{6}$ samples and obtained the asymptotic $\Psi_{CI}(P_{n})=0.2931$ , asymptotic $\sigma^{2}=1.2349$ , the asymptotic value is biased by 0.3244.

Figure 9 shows that with a strong violation of conditional independence assumptions, the estimated values will have significant biases.

6.4 Violation of non-linear assumption: K-way interaction term equals zero

Here we illustrate the performance of our estimators when the identification assumption that the k-way interaction term $\alpha_{7}=0$ does not hold. We used the parameters below to generate the underlying distribution. The underlying distribution follows the same model as in section 5.3.

$\alpha_{0}$	$\alpha_{1}$	$\alpha_{2}$	$\alpha_{3}$	$\alpha_{4}$	$\alpha_{5}$	$\alpha_{6}$	$\alpha_{7}$
-1.6333	0	0	0	-1	-2	-0.5	1

The simulated observed probabilities for all 7 observed cells are:

P(0,0,1)	P(0,1,0)	P(0,1,1)	P(1,0,0)	P(1,0,1)	P(1,1,0)	P(1,1,1)
0.2386	0.2386	0.0878	0.2386	0.0323	0.1447	0.0196

Figure 10 and table 7 show that when the assumption is heavily violated, all estimators are significantly biased. The true value of $\psi=0.8074$ . The bias for $\Psi_{I}(P_{NP})$ estimator is 0.2032, for $\Psi_{I}(P_{lasso})$ estimator is 0.1372, for $\Psi_{I}(P_{tmle})$ is 0.2055, for $\Psi_{I}(P_{M_{0}})$ estimator is 0.2217, and for $\Psi_{I}(P_{M_{t}})$ estimator is 0.2185. The coverage of 95% asymptotic confidence intervals for $\Psi_{I}(P_{NP})$ estimator is 26%, for $\Psi_{I}(P_{lasso})$ estimator is 59%, for $\Psi_{I}(P_{tmle})$ is 22%, for $\Psi_{I}(P_{M_{0}})$ and $\Psi_{I}(P_{M_{t}})$ is 0%. Table 8 shows the information on $M_{0},M_{t}$ models.

Estimator	average $\psi$	lower 95% CI	upper 95% CI	coverage(%)
$\Psi_{I}(P_{NP})$	0.6042	0.4491	0.7592	26.3
$\Psi_{I}(P_{lasso})$	0.6702	0.5261	0.8142	58.6
$\Psi_{I}(P_{tmle})$	0.6019	0.4516	0.7523	21.9
$\Psi_{I}(P_{M_{0}})$	0.5857	0.5416	0.6295	0
$\Psi_{I}(P_{M_{t}})$	0.5889	0.5447	0.6328	0

Table 7: Estimated

\hat{\psi}

and 95% asymptotic confidence interval based on normal approximation with

\hat{\sigma}^{2}

estimated from efficient influence curve and its coverage by each estimator. True

\Psi_{0}=0.8074

Estimator	df	AIC	BIC
$\Psi_{I}(P_{M_{0}})$	5	157.720	167.535
$\Psi_{I}(P_{M_{t}})$	3	128.623	148.254

Table 8: model fit statistics for

\Psi_{I}(P_{M_{0}})

and

\Psi_{I}(P_{M_{t}})

estimators.

6.5 Violation of all identification assumptions

In the sections above we showed that if an identification assumption is violated, its corresponding estimators will be biased and their asymptotic confidence intervals will have low coverage. In this section, all the identification assumptions are violated in the simulated distribution, and we analyzed the performance of all the estimators discussed above in this setting.

The underlying distribution of three samples $B_{1},B_{2},B_{3}$ follows the same log-linear model in section 5.3, with the parameters as below.

$\alpha_{0}$	$\alpha_{1}$	$\alpha_{2}$	$\alpha_{3}$	$\alpha_{4}$	$\alpha_{5}$	$\alpha_{6}$	$\alpha_{7}$
-1.1835	-1	-1	-1	-1.5	-1	2	1

The simulated observed probability for all 7 observed cells is given by:

P(0,0,1)	P(0,1,0)	P(0,1,1)	P(1,0,0)	P(1,0,1)	P(1,1,0)	P(1,1,1)
0.1624	0.1624	0.0133	0.1624	0.0220	0.4414	0.0362

Figure 11 and table 9 show that all estimators are significantly biased, when none of their identification assumption holds. The true value of $\psi=0.6938$ . Under the assumption that sample $B_{1}$ and $B_{2}$ are independent conditional on $B_{3}$ (”Conditional Independence” in table 9), the plug-in estimator has a bias of -0.2497. Under the assumption that sample $B_{1}$ and $B_{2}$ are independent (”Independence” in table 9), the plug-in estimator has a bias of 0.3760. Under the Under the assumption that the 3-way additive interaction term in the linear model equals zero (”K-way additive interaction equals zero” in table 9), the plug-in estimator has a bias of -0.2627. And under the assumption that the 3-way multiplicative interaction term in the log-linear model equals zero (”K-way multiplicative interaction equals zero” in table 9), the $\Psi_{I}(P_{NP})$ estimator has a bias of 0.2517, the $\Psi_{I}(P_{lasso})$ estimator has a bias of -0.1573, the $\Psi_{I}(P_{tmle})$ has a bias of -0.1867, the $\Psi_{I}(P_{M_{0}})$ estimator has a bias of -0.1017, and the $\Psi_{I}(P_{M_{t}})$ estimator has a bias of -0.1446. The coverage of 95% asymptotic confidence intervals for all the estimators are far below 95%. Table 10 shows the information on $M_{0},M_{t}$ models.

Assumption	Estimator	Average $\psi$	Lower 95% CI	Upper 95% CI	Coverage(%)
Conditional independence	$\Psi_{CI}(P_{n})$	0.9435	0.9313	0.9557	31.2
Independence	$\Psi_{II}(P_{n})$	0.3223	0.2472	0.3974	0
Highest-way interaction equals zero (linear)	$\Psi_{f}(P_{n})$	0.9565	0.8999	1.0132	0
Highest-way interaction equals zero (log-linear)	$\Psi_{I}(P_{NP})$	0.4466	0.2519	0.6413	0
Highest-way interaction equals zero (log-linear)	$\Psi_{I}(P_{lasso})$	0.8511	0.8005	0.9017	0
Highest-way interaction equals zero (log-linear)	$\Psi_{I}(P_{tmle})$	0.8805	0.8385	0.9225	31.2
Highest-way interaction equals zero (log-linear)	$\Psi_{I}(P_{M_{0}})$	0.7955	0.7621	0.8269	0
Highest-way interaction equals zero (log-linear)	$\Psi_{I}(P_{M_{t}})$	0.8384	0.8074	0.8673	0

Table 9: Estimated

\hat{\psi}

and 95% asymptotic confidence interval based on normal approximation with

\hat{\sigma}^{2}

estimated from efficient influence curve and its coverage for each estimator. True

\psi_{0}=0.6938

. Conditional independence assumption: sample

B_{1}

and

B_{2}

are independent conditional on

B_{3}

; Independence assumption: sample

B_{1}

and

B_{2}

are independent; K-way additive interaction equals zero assumption: 3-way interaction term in linear model equals zero; K-way multiplicative interaction equals zero assumption: 3-way interaction term in log-linear model equals zero.

Estimator	Df	AIC	BIC
$\Psi_{I}(P_{M_{0}})$	5	743.631	753.447
$\Psi_{I}(P_{M_{t}})$	3	391.757	411.388

Table 10: model fit statistics for

\Psi_{I}(P_{M_{0}})

and

\Psi_{I}(P_{M_{t}})

estimators.

7 Data Analysis

Schistosomiasis is an acute and chronic parasitic disease endemic to 78 countries worldwide. Estimates show that at least 229 million people required preventive treatment in 2018 [24]. In China, schistosomiasis is an ongoing public health challenge and the country has set ambitious goals of achieving nationwide transmission interruption [25]. To monitor the transmission of schistosomiasis, China operates three surveillance systems which can be considered as repeated samples of the underlying infected population. The first is a national surveillance system designed to monitor nationwide schistosomiasis prevalence ( $S_{1}$ ), which covers 1% of communities in endemic provinces, conducting surveys every 6-9 years. The second is a sentinel system designed to provide longitudinal measures of disease prevalence and intensity in select communities ( $S_{2}$ ), and conducts yearly surveys. The third system ( $S_{3}$ ) comprises routine surveillance of all communities in endemic counties, with surveys conducted on a roughly three-year cycle [25]. In this application, we use case records for schistosomiasis from these three systems for a region in southwestern China [17]. The community, having a population of about 6000, reported a total of 302 cases in 2004: 112 in $S_{1}$ ; 294 in $S_{2}$ ; and 167 in $S_{3}$ ; with 202 cases appearing on more than one system register (figure 12). The three surveillance systems are not independent, and we apply estimators under the four constraints (highest-way interaction equals zero in log-linear model, highest-way interaction equals zero in linear model, independence, and conditional independence)to the data.

The estimated $\hat{\psi}$ with its 95% asymptotic confidence intervals are shown in table 11. The highest estimate of $\psi$ is 0.9969 (with 95% CI: (0.8041, 0.9829)) under conditional independence assumption, and the lowest is 0.8935 (with 95% CI: (0.9965, 0.9972)) under the linear model with no highest-way interaction assumption (table 11). This analysis illustrates the heavy dependence of estimates of $\Psi$ on the arbitrary identification assumptions using real data.

One way to solve this problem is to hold certain identification assumptions true by design. In this case, we could make an effort to design the sentinel system $S_{2}$ and routine system $S_{3}$ to be independent of each other conditional on national system $S_{1}$ . For example, let the sampling surveys in $S_{2}$ and $S_{3}$ be done independently, while each of them could borrow information from $S_{1}$ . In so doing we can assume the conditional independence between the two samples and conduct the analysis correspondingly.

Assumption	Estimator	Estimated $\psi$	Lower 95% CI	Upper 95% CI
Highest-way interaction equals zero (log-linear)	$\Psi_{I}(P_{lasso})$	0.9212	0.8755	0.9669
Highest-way interaction equals zero (log-linear)	$\Psi_{I}(P_{tmle})$	0.9428	0.9089	0.9768
Highest-way interaction equals zero (log-linear)	$\Psi_{I}(P_{M_{0}})$	0.9321	0.8959	0.9627
Highest-way interaction equals zero (log-linear)	$\Psi_{I}(P_{M_{t}})$	0.9934	0.9748	1
Highest-way interaction equals zero (linear)	$\Psi_{f}(P_{n})$	0.8935	0.8041	0.9829
Independence	$\Psi_{II}(P_{n})$	0.963	0.9326	0.9935
Conditional independence	$\Psi_{CI}(P_{n})$	0.9969	0.9965	0.9972

Table 11: Estimated

\hat{\psi}

and 95% asymptotic confidence interval based on normal approximation with

\hat{\sigma}^{2}

estimated from efficient influence curve and its coverage for each estimator. Conditional independence assumption: survey

S_{1}

and

S_{2}

are independent conditional on

S_{3}

; Independence assumption: survey

S_{1}

and

S_{2}

\Psi_{I}(P_{NP})

estimator under K-way multiplicative interaction term equals zero assumption is not defined due to existing empty cells in observed data.

Estimator	Df	AIC	BIC
$\Psi_{I}(P_{M_{0}})$	5	329.383	336.804
$\Psi_{I}(P_{M_{t}})$	3	60.139	74.980

Table 12: model fit statistics for

\Psi_{I}(P_{M_{0}})

and

\Psi_{I}(P_{M_{t}})

estimators.

8 Discussion

8.1 Summary table

In this section we summarise all the estimators we proposed under linear and non-linear constraints as below.

Identification assumption	Highest-way interaction term in linear model equals zero
Constraint	$E_{P_{B^{}}}f(B^{})=\sum_{b}(-1)^{K+\sum_{k=1}^{K}b_{k}}P_{B^{*},0}(b)=0$
Method	Plug-in
Target parameter	$\psi_{f}=\frac{f(0)}{f(0)-\sum_{b\neq 0}f(b)P(b)}$
Efficient influence curve	$D^{*}_{f}=\frac{f(0)}{(f(0)-\sum_{b\neq 0}f(b)P(b))^{2}}[f(B)-\sum_{b\neq 0}f(b)P(b)]$

Table 13: Summary table for identification assumption: highest-way interaction term in linear model equals zero, where

f(b)=(-1)^{K+\sum_{k=1}^{K}b_{k}}

Identification assumption	Independence between two samples
Constraint	$\Phi_{II}(P^{})\equiv\sum_{b}I(b(1:2)=(0,0))P^{}(b)-$
	$\sum_{b_{1},b_{2}}I(b_{1}(1)=b_{2}(2)=0)P^{}(b_{1})P^{}(b_{2})$
Method	Plug-in
Target parameter	$\Psi_{II}(P)=\frac{1-P(B(1)=0)-P(B(2)=0)+P(B(1:2)=(0,0))}{1-P(B(1)=0)-P(B(2)=0)+P(B(1)=0)P(B(2)=0)}$
Efficient influence curve	$D^{*}_{\Phi_{II}}(P)=C_{2}(P)\{\mathbbm{1}(B(1)=0)-P(B(1)=0)\}+$
	$C_{3}(P)\{\mathbbm{1}(B(2)=0)-P(B(2)=0)+$
	$C_{4}(P)\{\mathbbm{1}(B(1)=B(2)=0)-P(B(1:2)=0\}$ ,
	where $C_{2}(P)=\frac{1-P(B(1)=0)-P(B(2)=0)-P(B(1)=B(2)=0)}{(1-P(B(1)=0)-P(B(2)=0)-P(B(1)=0)P(B(2)=0))^{2}}P(B(2)=0)$ ,
	$C_{3}(P)=\frac{1-P(B(1)=0)-P(B(2)=0)-P(B(1)=B(2)=0)}{(1-P(B(1)=0)-P(B(2)=0)-P(B(1)=0)P(B(2)=0))^{2}}P(B(1)=0)$ ,
	$C_{4}(P)=-\frac{1}{1-P(B(1)=0)-P(B(2)=0)-P(B(1)=0)P(B(2)=0)}$

Table 14: Summary table for identification assumption: independence between two samples.

Identification assumption	Conditional independence between two samples
Constraint	$\Phi_{CI,(j,m)}=P^{*}(B_{j}=1\|B_{1}=0,\cdots,B_{m}=0,\cdots,B_{K}=0)$
	$-P^{*}(B_{j}=1\|B_{1}=0,\cdots,B_{m}=1,\cdots,B_{K}=0)$
Method	Plug-in
Target parameter	$\Psi_{CI}=\frac{P(B_{m}=1,B_{j}=1,0,...,0)}{P(B_{m}=1,B_{j}=1,0,...,0)+P(B_{m}=0,B_{j}=1,0,...,0)P(B_{m}=1,B_{j}=0,0,...,0)}$
Efficient influence curve	$D^{*}_{\Phi_{CI}}(P)=\frac{1}{C_{5}(P)}(C_{6}(P)-C_{7}(P)-C_{8}(P))$ ,
	where $C_{5}(P)=-\sum_{b_{1}\not=0,b_{2}\not=0}[\mathbbm{I}(b_{1}(1)=b_{2}(2)=0)P(b_{1})P(b_{2})$
	$C_{6}(P)=P(B_{m}=0,B_{j}=1,0,...,0)P(B_{m}=0,B_{j}=0,0,...,0)$
	$[\mathbbm{I}(B_{m}=1,B_{j}=1,0,...,0)-P(B_{m}=1,B_{j}=1,0,...,0)]$
	$C_{7}(P)=P(B_{m}=1,B_{j}=0,0,...,0)P(B_{m}=1,B_{j}=1,0,...,0)$
	$[\mathbbm{I}(B_{m}=0,B_{j}=1,0,...,0)-P(B_{m}=0,B_{j}=1,0,...,0)]$
	$C_{8}(P)=P(B_{m}=0,B_{j}=1,0,...,0)P(B_{m}=1,B_{j}=1,0,...,0)$
	$[\mathbbm{I}(B_{m}=1,B_{j}=0,0,...,0)-P(B_{m}=1,B_{j}=0,0,...,0)]$

Table 15: Summary table for identification assumption: conditional independence between two samples.

Identification assumption	Highest-way interaction term in log-linear model equals zero
Constraint	$\Phi_{I}(P^{})\equiv\sum_{b}(-1)^{1+\sum_{k=1}^{K}b_{k}}\log P_{B^{}}(b)=0$
Method	Plug-in (NPMLE), undersmoothed lasso, TMLE based on lasso
Target parameter	$\Psi_{I}(P)=\frac{1}{1+\exp((-1)^{K+1}\sum_{b\not=0}f(b)\log P(b))}$
Efficient influence curve	$D^{*}_{\Phi_{I}}(P)=(-1)^{K}\Psi_{I}(P)(1-\Psi_{I}(P))\left\{\frac{f(B)}{P(B)}+f(0)\right\}$

Table 16: Summary table for identification assumption: highest-way interaction term in log-linear model equals zero, where

f(b)=(-1)^{K+\sum_{k=1}^{K}b_{k}}

We developed a modern method to estimate population size based on capture-recapture designs with a minimal number of constraints or parametric assumptions. We provide the solutions, theoretical support, simulation study and sensitivity analysis for four identification assumptions: independence between two samples, conditional independence between two samples, no highest-way interaction in linear models, and no highest-way interaction in log-linear models. We also developed machine learning algorithms to solve the curse of dimensionality for high dimensional problems under the assumption of no highest-way interaction in log-linear model. Through our analysis, we found that whether the identification assumption holds true plays a vital role in the performance of estimation. When the assumption is violated, all estimators will be biased. This conclusion applies to models of all forms, parametric or non-parametric, simple plug-in estimators or complex machine-learning based estimators. Thus one should always ensure that the chosen identification assumption is known to be true by survey design, otherwise all the estimators will be unreliable. Under the circumstances where the identification assumptions hold true, the performance of our targeted maximum likelihood estimator, $\Psi_{I}(P_{tmle})$ , is superior to $\Psi_{I}(P_{M_{0}})$ (identical capture-probabilities, no highest-way interaction in log-linear model), $\Psi_{I}(P_{M_{t}})$ (no interaction terms in log-linear model) and $\Psi_{I}(P_{NP})$ (plugged-in, no highest-way interaction in log-linear model) estimators in several aspects: first, by making the least number of assumptions required for identifiability, the estimator $\Psi_{I}(P_{tmle})$ is more robust in empirical data analysis, as there will be no bias due to violations of parametric model assumptions. Second, the estimator $\Psi_{I}(P_{tmle})$ is based on a consistent undersmoothed lasso estimator. This property ensures the asymptotic efficiency of TMLE. Third, when there are empty cells, the estimator $\Psi_{I}(P_{tmle})$ solves the curse of dimensionality by correcting the bias introduced by the undersmoothed lasso estimator, and gives a more honest asymptotic confidence interval, wider that those from parametric models, and hence a higher coverage.

References

[1] Anne Chao et al. “The applications of capture-recapture models to epidemiological data” In Statistics in medicine 20.20 Wiley Online Library, 2001, pp. 3123–3157
[2] Stephen T Buckland et al. “Introduction to distance sampling: estimating abundance of biological populations” Oxford (United Kingdom) Oxford Univ. Press, 2001
[3] Mathew W Alldredge, Kenneth H Pollock and Theodore R Simons “Estimating detection probabilities from multiple-observer point counts” In The Auk 123.4 Oxford University Press, 2006, pp. 1172–1182
[4] Anne Chao “An overview of closed capture-recapture models” In Journal of Agricultural, Biological, and Environmental Statistics 6.2 Springer, 2001, pp. 158–175
[5] Zachary T Kurtz “Local log-linear models for capture-recapture” In arXiv preprint arXiv:1302.0890, 2013
[6] Janet T Wittes “Applications of a multinomial capture-recapture model to epidemiological data” In Journal of the American Statistical Association 69.345 Taylor & Francis Group, 1974, pp. 93–97
[7] George Arthur Frederick Seber “The estimation of animal abundance and related parameters” Blackburn press Caldwell, New Jersey, 1982
[8] Hamparsum Bozdogan “Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions” In Psychometrika 52.3 Springer, 1987, pp. 345–370
[9] Gideon Schwarz “Estimating the dimension of a model” In Annals of statistics 6.2 Institute of Mathematical Statistics, 1978, pp. 461–464
[10] David Draper “Assessment and propagation of model uncertainty” In Journal of the Royal Statistical Society: Series B (Methodological) 57.1 Wiley Online Library, 1995, pp. 45–70
[11] Emest B Hook and Ronald R Regal “Validity of methods for model selection, weighting for model uncertainty, and small sample adjustment in capture-recapture estimation” In American journal of epidemiology 145.12 Oxford University Press, 1997, pp. 1138–1144
[12] Toon Braeye et al. “Capture-recapture estimators in epidemiology with applications to pertussis and pneumococcal invasive disease surveillance” In PloS one 11.8 Public Library of Science San Francisco, CA USA, 2016, pp. e0159832
[13] Manjari Das and Edward H Kennedy “Doubly robust capture-recapture methods for estimating population size” In arXiv preprint arXiv:2104.14091, 2021
[14] Mark J Laan and Sherri Rose “Targeted learning: causal inference for observational and experimental data” Springer Science & Business Media, 2011
[15] Annegret Grimm, Bernd Gruber and Klaus Henle “Reliability of different mark-recapture methods for population size estimation tested against reference population sizes constructed from field data” In PLoS One 9.6 Public Library of Science, 2014, pp. e98840
[16] Samuel G Rees, Anne E Goodenough, Adam G Hart and Richard Stafford “Testing the effectiveness of capture mark recapture population estimation techniques using a computer simulation with known population size” In Ecological Modelling 222.17 Elsevier, 2011, pp. 3291–3294
[17] Robert C Spear et al. “Factors influencing the transmission of Schistosoma japonicum in the mountains of Sichuan Province of China” In The American journal of tropical medicine and hygiene 70.1 ASTMH, 2004, pp. 48–56
[18] Rene Gateaux “Fonctions d’une infinite de variables independantes” In Bulletin de la Societe Mathematique de France 47, 1919, pp. 70–96
[19] Richard D Gill, Jon A Wellner and Jens Praestgaard “Non-and semi-parametric maximum likelihood estimators and the von mises method (part 1)[with discussion and reply]” In Scandinavian Journal of Statistics JSTOR, 1989, pp. 97–128
[20] Mark J Laan, David Benkeser and Weixin Cai “Efficient estimation of pathwise differentiable target parameters with the undersmoothed highly adaptive lasso” In arXiv preprint arXiv:1908.05607, 2019
[21] Richard M Cormack “Log-linear models for capture-recapture” In Biometrics JSTOR, 1989, pp. 395–413
[22] Sophie Baillargeon and Louis-Paul Rivest “Rcapture: loglinear models for capture-recapture in R” In Journal of Statistical Software 19.5, 2007, pp. 1–31
[23] Zoe Emily Schnabel “The estimation of the total fish population of a lake” In The American Mathematical Monthly 45.6 Taylor & Francis, 1938, pp. 348–352
[24] “Schistosomiasis:Key Facts” URL: https://www.who.int/en/news-room/fact-sheets/detail/schistosomiasis
[25] Song Liang et al. “Surveillance systems for neglected tropical diseases: global lessons from China’s evolving schistosomiasis reporting systems, 1949–2014” In Emerging themes in epidemiology 11.1 Springer, 2014, pp. 19
[26] Joseph L Doob “The limiting distributions of certain statistics” In The Annals of Mathematical Statistics 6.3 JSTOR, 1935, pp. 160–169

9 Appendix

In the appendix, we formally state the lemmas used in the context and provide the proofs. In section 9.1, we derive the target parameter $\Psi_{II}(P)$ under identification assumption that the first two samples $B_{1},B_{2}$ are independent, given there are three samples in total. In section 9.2 we derive the efficient influence curve $D_{\Phi_{II}}^{*}(P)$ for $\Psi_{II}(P)$ . In section 9.3, we state the lemma on how to derive the efficient influence curve under multidimensional constraint $\Phi_{III}$ . In section 9.4, we formally state the target parameter $\Psi_{CI}(P)$ under identification assumption that the first two samples $B_{1},B_{2}$ are independent conditional on the third samples. In section 9.5 we derive the efficient influence curve $D_{\Phi_{CI}}^{*}(P)$ for $\Psi_{CI}(P)$ . In section 9.6 we prove the asymptotic efficiency of the TMLE.

9.1 Target parameter under independence assumption

Lemma 1.

For constraint $\Phi_{II}$ (equation 4), we have the target parameter $\Psi_{II}$ as:

\displaystyle\Psi_{II}(P)=\frac{1-P(B(1)=0)-P(B(2)=0)+P(B(1:2)=(0,0))}{1-P(B(1)=0)-P(B(2)=0)+P(B(1)=0)P(B(2)=0)}

Proof.

We provide a brief proof of lemma 1 when there are three samples. In this case, $\Phi_{II}(P^{*})=0$ is equivalent to

P^{*}(B^{*}(1:2)=(0,0))=P^{*}(B(1)=0)P^{*}(B^{*}(2)=0).

(17)

When there are three samples, equation 17 can be expanded as:

$\displaystyle P^{}(0,0,1)+P^{}(0,0,0)$	$\displaystyle=$	$\displaystyle P^{}(0,0,0)^{2}+[P^{}(0,1,0)+P^{}(0,1,1)+P^{}(0,0,1)$	(18)
		$\displaystyle+P^{}(1,0,0)+P^{}(1,0,1)+P^{}(0,0,1)]\times P^{}(0,0,0)$
		$\displaystyle+[P^{}(0,1,0)+P^{}(0,1,1)+P^{*}(0,0,1)]$
		$\displaystyle\times[P^{}(1,0,0)+P^{}(1,0,1)+P^{*}(0,0,1)]$

Denote $a^{*}=P^{*}(0,1,0)+P^{*}(0,1,1)+P^{*}(0,0,1)$ , $b^{*}=P^{*}(1,0,0)+P^{*}(1,0,1)+P^{*}(0,0,1)$ . Equation 18 can be written as:

\displaystyle 0=P^{*}(0,0,0)^{2}+(a^{*}+b^{*}-1)\times P^{*}(0,0,0)-P^{*}(0,0,1).

Plug in $\psi=1-P^{*}(0,0,0)$ , we have

P(0,1,0)=\frac{P^{*}(0,1,0)}{\psi},...,P(0,0,1)=\frac{P^{*}(0,0,1)}{\psi},a=\frac{a^{*}}{\psi},b=\frac{b^{*}}{\psi}.

Thus equation 18 can be expressed as:

0=(1+ab-a-b)\psi^{2}+(a+b-1-P(0,0,1))\psi.

(19)

Here, let $a_{II}=1+ab-a-b$ , $b_{II}=a+b-1-P(0,0,1)$ , from equation 19 we know $a_{II}\psi+b_{II}=0$ , thus we have

	$\displaystyle\psi$	$\displaystyle=$	$\displaystyle-\frac{b_{II}}{a_{II}}$
		$\displaystyle=$	$\displaystyle\frac{1-P(B(1)=0)-P(B(2)=0)+P(B(1:2)=(0,0))}{1+P(B(1)=0)P(B(2)=0)-P(B(1)=0)-P(B(2)=0)}$

where $P(B(1)=0)=P(0,1,0)+P(0,1,1)+P(0,0,1),P(B(2)=0)=P(1,0,0)+P(1,0,1)+P(0,0,1),$ and $P(B(1:2)=(0,0))=P(0,0,1)$ ∎

9.2 Efficient influence curve under independence assumption

Lemma 2.

For constraint $\Phi_{II}=0$ (equation 4), we have the efficient influence curve $D^{*}_{\Phi_{II}}(P)$ as:

$\displaystyle D^{*}_{\Phi_{II}}(P)$	$\displaystyle=$	$\displaystyle\frac{\partial\psi}{\partial P(B(1)=0)}\times D^{*}_{\Phi_{II}}(P(B(1)=0))$	(20)
		$\displaystyle+\frac{\partial\psi}{\partial P(B(2)=0)}\times D^{*}_{\Phi_{II}}(P(B(2)=0))$
		$\displaystyle+\frac{\partial\psi}{\partial P(B(1:2)=(0,0))}\times D^{*}_{\Phi_{II}}(P(B(1:2)=(0,0))).$

Proof.

By the delta method [26], the efficient influence curve of $\psi$ can be written as a function of each components’ influence curve. The efficient influence curves of the three components are presented as follows:

$\displaystyle D^{*}_{\Phi_{II}}(P(B(1)=0))$	$\displaystyle=$	$\displaystyle\mathbbm{I}(B(1)=0)-P(B(1)=0).$
$\displaystyle D^{*}_{\Phi_{II}}(P(B(2)=0))$	$\displaystyle=$	$\displaystyle\mathbbm{I}(B(2)=0)-P(B(2)=0).$
$\displaystyle D^{*}_{\Phi_{II}}(P(B(1:2)=(0,0)))$	$\displaystyle=$	$\displaystyle\mathbbm{I}(B(1:2)=(0,0))-P(B(1:2)=(0,0)).$

Therefore, we only need to calculate the derivatives to get $D^{*}_{\Phi_{II}}(P)$ , and the three parts of derivatives are given by

\frac{\partial\psi}{\partial P(B(1)=0)}=\frac{(1-P(B(2)=0))(P(B(1:2)=(0,0))-P(B(1)=0))}{(1-P(B(1)=0)-P(B(2)=0)+P(B(1)=0)P(B(2)=0))^{2}}.

\frac{\partial\psi}{\partial P(B(2)=0)}=\frac{(1-P(B(1)=0))(P(B(1:2)=(0,0))-P(B(1)=0))}{(1-P(B(1)=0)-P(B(2)=0)+P(B(1)=0)P(B(2)=0))^{2}}.

\frac{\partial\psi}{\partial P(B(1:2)=(0,0))}=\frac{1}{1-P(B(1)=0)-P(B(2)=0)+P(B(1)=0)P(B(2)=0)}.

Plug each part into equation 20, we have

$\displaystyle D^{*}_{\Phi_{II}}(P)$	$\displaystyle=$	$\displaystyle\frac{(1-P(B(2)=0))(P(B(1:2)=(0,0))-P(B(2)=0))}{(1-P(B(1)=0)-P(B(2)=0)+P(B(1)=0)P(B(2)=0))^{2}}$
		$\displaystyle[\mathbbm{I}(B(1)=0)-P(B(1)=0)]$
		$\displaystyle+\frac{(1-P(B(1)=0))(P(B(1:2)=(0,0))-P(B(1)=0))}{(1-P(B(1)=0)-P(B(2)=0)+P(B(1)=0)P(B(2)=0))^{2}}$
		$\displaystyle[\mathbbm{I}(B(2)=0)-P(B(2)=0)]$
		$\displaystyle+\frac{1}{1-P(B(1)=0)-P(B(2)=0)+P(B(1)=0)P(B(2)=0)}$
		$\displaystyle[\mathbbm{I}(B(1:2)=(0,0))-P(B(1:2)=(0,0))].$

∎

9.3 Efficient influence curve under multidimensional constraint

Lemma 3.

Consider a model ${\cal M}\equiv\{P\in{\cal M}_{1}:\Phi(P)=0\}$ defined by an initial larger model ${\cal M}_{1}$ and multivariate constraint function $\Phi:{\cal M}_{1}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$}^{K}$ . Suppose that $\Phi:{\cal M}_{1}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$}^{K}$ is path-wise differentiable at $P$ with efficient influence curve $D^{*}_{\Phi}(P)$ for all $P\in{\cal M}_{1}$ . Let $T_{1}(P)$ be the tangent space at $P$ for model ${\cal M}_{1}$ , and let $\Pi_{T_{1}}:L^{2}_{0}(P)\rightarrow T_{1}(P)$ be the projection operator onto $T_{1}(P)$ . The tangent space at $P$ for model ${\cal M}$ is given by:

T(P)=\{S\in T_{1}(P):S\perp D^{*}_{\Phi}(P)\}.

The projection onto $T(P)$ is given by:

\Pi_{T}(S)=\Pi_{T}(S)-\Pi_{D^{*}_{\Phi}}(\Pi_{T}(S)),

where $\Pi_{D^{*}_{\Phi}}$ is the projection operator on the $K$ -dimensional subspace of $T_{1}(P)$ spanned by the components of $D^{*}_{\Phi}(P)$ . The latter projection is given by the formula:

\Pi_{D^{*}_{\Phi}}(S)=E(SD^{*}_{\Phi}(P)^{\top})E(D^{*}_{\Phi}(P)D^{*}_{\Phi}(P)^{\top})^{-1}D^{*}_{\Phi}(P).

9.4 Target parameter under conditional independence assumption

Lemma 4.

For constraint $\Phi_{CI}=0$ (equation 5), we have the target parameter $\Psi_{CI}$ as

\Psi_{CI}=\frac{P(B_{m}=1,B_{j}=1,0,...,0)}{C_{0}(P)}.

where

	$\displaystyle C_{0}(P)$	$\displaystyle=$	$\displaystyle P(B_{m}=1,B_{j}=1,0,...,0)$
			$\displaystyle+P(B_{m}=0,B_{j}=1,0,...,0)P(B_{m}=1,B_{j}=0,0,...,0).$

Proof.

The constrain $\Phi_{CI}=0$ (equation 5) can be written as

	$\displaystyle P^{*}(B_{j}=1\|B_{1}=0,\cdots,B_{m}=0,\cdots,B_{K}=0)$	$\displaystyle=$
	$\displaystyle P^{*}(B_{j}=1\|B_{1}=0,\cdots,B_{m}=1,\cdots,B_{K}=0).$

This equation is equivalent to

	$\displaystyle P^{*}(0,\dots,0)$	$\displaystyle=$	$\displaystyle\frac{P^{}(B_{m}=0,B_{j}=1,0,\dots,0)}{P^{}(B_{m}=1,B_{j}=1,0,\dots,0)}$		(21)
			$\displaystyle\times P^{*}(B_{m}=1,B_{j}=0,0,\dots,0).$		(21)

As $\Psi_{CI}\equiv 1-P^{*}(0,\dots,0)$ , we have

P^{*}(B_{m}=1,B_{j}=0,0,\dots,0)=P(B_{m}=1,B_{j}=0,0,\dots,0)\Psi_{CI}.

P^{*}(B_{m}=0,B_{j}=1,0,\dots,0)=P(B_{m}=0,B_{j}=1,0,\dots,0)\Psi_{CI}.

P^{*}(B_{m}=1,B_{j}=1,0,\dots,0)=P(B_{m}=1,B_{j}=1,0,\dots,0)\Psi_{CI}.

Therefore, equation 21 is equivalent to

\Psi_{CI}=\frac{P(B_{m}=1,B_{j}=1,0,\dots,0)}{P(B_{m}=1,B_{j}=1,0,\dots,0)+C_{9}}.

where

C_{9}=P(B_{m}=0,B_{j}=1,0,\dots,0)P(B_{m}=1,B_{j}=0,0,\dots,0).

∎

9.5 Efficient influence curve under conditional independence assumption

Lemma 5.

For constraint $\Phi_{CI}=0$ (equation 5), we have the efficient influence curve $D^{*}_{\Phi_{CI}}(P)$ as

\displaystyle D^{*}_{\Phi_{CI}}(P)

\displaystyle=

\displaystyle\frac{1}{C_{5}(P)}(C_{6}(P)-C_{7}(P)-C_{8}(P))

where

$\displaystyle C_{5}(P)$	$\displaystyle=$	$\displaystyle-\sum_{b_{1}\not=0,b_{2}\not=0}I(b_{1}(1)=b_{2}(2)=0)P(b_{1})P(b_{2}).$
$\displaystyle C_{6}(P)$	$\displaystyle=$	$\displaystyle P(B_{m}=0,B_{j}=1,0,...,0)P(B_{m}=0,B_{j}=0,0,...,0)$
		$\displaystyle[\mathbbm{I}(B_{m}=1,B_{j}=1,0,...,0)-P(B_{m}=1,B_{j}=1,0,...,0)].$
$\displaystyle C_{7}(P)$	$\displaystyle=$	$\displaystyle P(B_{m}=1,B_{j}=0,0,...,0)P(B_{m}=1,B_{j}=1,0,...,0)$
		$\displaystyle[\mathbbm{I}(B_{m}=0,B_{j}=1,0,...,0)-P(B_{m}=0,B_{j}=1,0,...,0)].$
$\displaystyle C_{8}(P)$	$\displaystyle=$	$\displaystyle P(B_{m}=0,B_{j}=1,0,...,0)P(B_{m}=1,B_{j}=1,0,...,0)$
		$\displaystyle[\mathbbm{I}(B_{m}=1,B_{j}=0,0,...,0)-P(B_{m}=1,B_{j}=0,0,...,0)].$

Proof.

By the delta method [26], the efficient influence curve of $\Psi_{CI}$ can be written as a function of each components’ influence curve. The efficient influence curves of the three components are presented as follows

$\displaystyle D^{*}_{\Phi_{CI}}(P(B_{m}=1,B_{j}=1,0,\dots,0))$	$\displaystyle=$	$\displaystyle\mathbbm{I}(B_{m}=1,B_{j}=1,0,\dots,0)$	(22)
		$\displaystyle-P(B_{m}=1,B_{j}=1,0,\dots,0).$	(22)
$\displaystyle D^{*}_{\Phi_{CI}}(P(B_{m}=0,B_{j}=1,0,\dots,0))$	$\displaystyle=$	$\displaystyle\mathbbm{I}(B_{m}=0,B_{j}=1,0,\dots,0)$	(23)
		$\displaystyle-P(B_{m}=0,B_{j}=1,0,\dots,0).$	(23)
$\displaystyle D^{*}_{\Phi_{CI}}(P(B_{m}=1,B_{j}=0,0,\dots,0))$	$\displaystyle=$	$\displaystyle\mathbbm{I}(B_{m}=1,B_{j}=0,0,\dots,0)$	(24)
		$\displaystyle-P(B_{m}=1,B_{j}=0,0,\dots,0)).$	(24)

Therefore, we only need to calculate the derivatives to get $D^{*}_{\Phi_{CI}}(P)$ , and the three parts of derivatives are given by

$\displaystyle\frac{\partial\psi}{\partial P(B_{m}=1,B_{j}=1,0,\dots,0)}$	$\displaystyle=$	$\displaystyle\frac{P(B_{m}=0,B_{j}=1,0,\dots,0)}{C_{4}}$
		$\displaystyle\times P(B_{m}=1,B_{j}=0,0,\dots,0).$
$\displaystyle\frac{\partial\psi}{\partial P(B_{m}=0,B_{j}=1,0,\dots,0)}$	$\displaystyle=$	$\displaystyle-\frac{P(B_{m}=1,B_{j}=1,0,\dots,0)}{C_{4}}$
		$\displaystyle\times P(B_{m}=1,B_{j}=0,0,\dots,0).$
$\displaystyle\frac{\partial\psi}{\partial P(B_{m}=1,B_{j}=0,0,\dots,0)}$	$\displaystyle=$	$\displaystyle-\frac{P(B_{m}=1,B_{j}=1,0,\dots,0)}{C_{4}}$
		$\displaystyle\times P(B_{m}=1,B_{j}=1,0,\dots,0).$

where

	$\displaystyle C_{4}$	$\displaystyle=$	$\displaystyle[P(B_{m}=1,B_{j}=1,0,\dots,0)+P(B_{m}=0,B_{j}=1,0,\dots,0)$
			$\displaystyle\times P(B_{m}=1,B_{j}=0,0,\dots,0)]^{2}.$

Plug each part into equation 24, we have that lemma 5 is true. ∎

9.6 Asymptotic efficiency of TMLE

In this section we prove the following theorem 1 establishing asymptotic efficiency of the TMLE. The empirical NPMLE is also efficient since asymptotically all cells are filled up. And TMLE is just a finite sample improvement and asymptotically TMLE acts as the empirical NPMLE. The estimators $P_{n}*$ will be like parametric model MLE and thus converge at rate $o_{P}(1/\sqrt{n})$ .

Theorem 1.

Consider the TMLE $\Psi_{I}(P_{n}^{*})$ of $\Psi_{I}(P_{0})$ defined above, satisfying $P_{n}D^{*}_{\Phi_{I}}(P_{n}^{*})=o_{P}(1/\sqrt{n})$ . We assume $P_{0}(b)>0$ for all $b\not=0$ . If $\parallel P_{n}^{*}-P_{0}\parallel=o_{P}(n^{-1/4})$ , then $\Psi_{I}(P_{n}^{*})$ is an asymptotically efficient estimator of $\Psi_{I}(P_{0})$ :

\Psi_{I}(P_{n}^{*})-\Psi_{I}(P_{0})=(P_{n}-P_{0})D^{*}_{\Phi_{I}}(P_{0})+o_{P}(1/\sqrt{n}).

Proof.

Define $f_{I,0}(b)=f_{I}(b)+f_{I}(0)$ .Note that

P_{0}D^{*}_{\Phi_{I}}(P)=(-1)^{K}\psi_{I}(1-\psi_{I})\sum_{b\not=0}f_{I,0}(b)\frac{(P_{0}-P)(b)}{P(b)}.

Consider the second order Taylor expansion of $\Psi_{I}(P_{0})$ at $P$ :

\Psi_{I}(P_{0})-\Psi_{I}(P)=d\Psi_{I}(P)(P_{0}-P)+R_{2}(P,P_{0}),

where

	$\displaystyle d\Psi_{I}(P)(h)$	$\displaystyle=$	$\displaystyle\left.\frac{d}{d\epsilon}\Psi_{I}(P+\epsilon h)\right\|_{\epsilon=0}$
		$\displaystyle=$	$\displaystyle(-1)^{K}\Psi_{I}(P)(1-\Psi_{I}(P))\sum_{b\not=0}\frac{f_{I}(b)h(b)}{P(b)},$

$\displaystyle R_{2}(P,P_{0})$	$\displaystyle=$	$\displaystyle\left.\frac{1}{2}\frac{d^{2}\Psi_{I}(P+\epsilon h)}{d\epsilon^{2}}(P_{0}-P)^{2}\right\|_{\epsilon=0}+o_{P}((P_{0}-P)^{2})$	(25)
		$\displaystyle=\frac{1}{2}\Psi_{I}(P)(1-\Psi_{I}(P))(P_{0}-P)^{2}(b)\times$
		$\displaystyle[(1-2\Psi_{I}(P))(\sum_{b\not=0}\frac{f_{I}(b)h(b)}{P(b)})^{2}+(-1)^{K+1}\sum_{b\not=0}\frac{f_{I}(b)h(b)^{2}}{P(b)^{2}}]$
		$\displaystyle+o_{P}((P_{0}-P)^{2}(b))$

From equation 25 we know that $R_{2}(P,P_{0})$ is a second order term involving square differences $(P_{0}-P)^{2}(b)$ for $b\not=0$ . Thus, we observe that

P_{0}D^{*}_{\Phi_{I}}(P)=d\Psi_{I}(P)(P_{0}-P).

This proves that

P_{0}D^{*}_{\Phi_{I}}(P)=\Psi_{I}(P_{0})-\Psi_{I}(P)-R_{2}(P,P_{0}).

We can apply this identity to $P_{n}^{*}$ so that we obtain $P_{0}D^{*}_{\Phi_{I}}(P_{n}^{*})=\Psi_{I}(P_{0})-\Psi_{I}(P_{n}^{*})-R_{2}(P_{n}^{*},P_{0})$ . Combining this identity with $P_{n}^{*}D^{*}_{\Phi_{I}}(P_{n}^{*})=0$ yields:

\Psi_{I}(P_{n}^{*})-\Psi_{I}(P_{0})=(P_{n}-P_{0})D^{*}_{\Phi_{I}}(P_{n}^{*})+R_{2}(P_{n}^{*},P_{0}).

We will assume that $P_{n}^{*}$ is consistent for $P_{0}$ and $P_{0}(b)>0$ for all $b\not=0$ , so that it follows that $D^{*}_{\Phi_{I}}(P_{n}^{*})$ is uniformly bounded by a $M<\infty$ with probability tending to 1, an that it falls in a $P_{0}$ -Donsker class (dimension is finite $2^{K}-2$ ), and $P_{0}\{D^{*}_{\Phi_{I}}(P_{n}^{*})-D^{*}_{\Phi_{I}}(P_{0})\}^{2}\rightarrow 0$ in probability as $n\rightarrow\infty$ . By empirical process theory, it now follows that

(P_{n}-P_{0})D^{*}_{\Phi_{I}}(P_{n}^{*})=(P_{n}-P_{0})D^{*}_{\Phi_{I}}(P_{0})+o_{P}(1/\sqrt{n}).

We also note that $R_{2}(P_{n}^{*},P_{0})$ has denominators that are bounded away from zero, so that $R_{2}(P_{n}^{*},P_{0})=o_{P}(1/\sqrt{n})$ if $\parallel P_{n}^{*}-P_{0}\parallel=o_{P}(n^{-1/4})$ (e.g., Euclidean norm). Thus theorem 1 is proved. ∎

10 Acknowledgement

This research was supported by NIH grant R01AI125842 and NIH grant 2R01AI074345.

	$\displaystyle P(B_{1}=1)=0.1$
	$\displaystyle P(B_{2}=1\|B_{1}=1)=0.2$
	$\displaystyle P(B_{2}=1\|B_{1}=0)=0.2$
	$\displaystyle P(B_{3}=1\|B_{1}=1)=0.25$
	$\displaystyle P(B_{3}=1\|B_{1}=0)=0.3$

	$\displaystyle P(B_{1}=1)=0.1$
	$\displaystyle P(B_{2}=1\|B_{1}=1)=0.2$
	$\displaystyle P(B_{2}=1\|B_{1}=0)=0.15$
	$\displaystyle P(B_{3}=1\|B_{1}=1)=P(B_{3}=1\|B_{2}=1,B_{1}=1)=0.25$
	$\displaystyle P(B_{3}=1\|B_{1}=1)=P(B_{3}=1\|B_{2}=0,B_{1}=1)=0.25$
	$\displaystyle P(B_{3}=1\|B_{1}=0)=P(B_{3}=1\|B_{2}=1,B_{1}=0)=0.2$
	$\displaystyle P(B_{3}=1\|B_{1}=0)=P(B_{3}=1\|B_{2}=0,B_{1}=0)=0.2$

	$\displaystyle P(B_{1}=1)=0.1$
	$\displaystyle P(B_{2}=1\|B_{1}=1)=0.2$
	$\displaystyle P(B_{2}=1\|B_{1}=0)=0.15$
	$\displaystyle P(B_{3}=1\|B_{1}=1,B_{2}=1)=0.10$
	$\displaystyle P(B_{3}=1\|B_{1}=1,B_{2}=0)=0.50$
	$\displaystyle P(B_{3}=1\|B_{1}=0,B_{2}=1)=0.20$
	$\displaystyle P(B_{3}=1\|B_{1}=0,B_{2}=0)=0.50$

$\displaystyle P^{}(0,0,1)+P^{}(0,0,0)$	$\displaystyle=$	$\displaystyle P^{}(0,0,0)^{2}+[P^{}(0,1,0)+P^{}(0,1,1)+P^{}(0,0,1)$	(18)
		$\displaystyle+P^{}(1,0,0)+P^{}(1,0,1)+P^{}(0,0,1)]\times P^{}(0,0,0)$
		$\displaystyle+[P^{}(0,1,0)+P^{}(0,1,1)+P^{*}(0,0,1)]$
		$\displaystyle\times[P^{}(1,0,0)+P^{}(1,0,1)+P^{*}(0,0,1)]$