This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Estimation of population size based on capture recapture designs and evaluation of the estimation reliability

Yue You Division of Biostatistics and Epidemiology, University of California, Berkeley Mark van der Laan Division of Biostatistics and Epidemiology, University of California, Berkeley Philip Collender Division of Environmental Health Sciences, University of California, Berkeley Qu Cheng Division of Environmental Health Sciences, University of California, Berkeley Alan Hubbard Division of Biostatistics and Epidemiology, University of California, Berkeley Nicholas P Jewell Division of Biostatistics and Epidemiology, University of California, Berkeley Zhiyue Tom Hu Division of Biostatistics and Epidemiology, University of California, Berkeley Robin Mejia Department of Statistics, Carnegie Mellon University Justin Remais Division of Environmental Health Sciences, University of California, Berkeley
Abstract

We propose a modern method to estimate population size based on capture-recapture designs of K samples. The observed data is formulated as a sample of n i.i.d. K-dimensional vectors of binary indicators, where the k-th component of each vector indicates the subject being caught by the k-th sample, such that only subjects with nonzero capture vectors are observed. The target quantity is the unconditional probability of the vector being nonzero across both observed and unobserved subjects. We cover models assuming a single general constraint on the K-dimensional distribution such that the target quantity is identified and the statistical model is unrestricted. We present solutions for general linear constraints, as well as constraints commonly assumed to identify capture-recapture models, including no K-way interaction in linear and log-linear models, independence or conditional independence. We demonstrate that the choice of constraint(identification assumption) has a dramatic impact on the value of the estimand, showing that it is crucial that the constraint is known to hold by design. For the commonly assumed constraint of no K-way interaction in a log-linear model, the statistical target parameter is only defined when each of the 2K12^{K}-1 observable capture patterns is present, and therefore suffers from the curse of dimensionality. We propose a targeted MLE based on undersmoothed lasso model to smooth across the cells while targeting the fit towards the single valued target parameter of interest. For each identification assumption, we provide simulated inference and confidence intervals to assess the performance on the estimator under correct and incorrect identifying assumptions. We apply the proposed method, alongside existing estimators, to estimate prevalence of a parasitic infection using multi-source surveillance data from a region in southwestern China, under the four identification assumptions.

Keywords: Asymptotic linear estimator, capture-recapture, targeted maximum likelihood estimation(TMLE), undersmoothed lasso.

1 Introduction

Epidemiologists use surveillance networks to monitor trends in disease frequency. When multiple surveillance components or surveys gather data on the same underlying population (such as those diagnosed with a particular disease over a particular time period), a variety of methods (capture-recapture designs [1], distance sampling [2], multiple observers [3], etc.) may be used to better estimate the disease occurrence in the population. Capture-recapture models are widely used for estimating the size of partially observed populations, usually assuming that individuals do not enter or leave the population between sample collections [4, 5]. These models have been widely applied to epidemiological data [6].

Due to the unobservability of outcomes for individuals not captured by any survey, additional identifying assumptions have to be made in capture-recapture problems. In two-sample scenarios, a common identification assumption is independence between the two samples (i.e. that capture in one survey does not change the probability of capture in the other survey). The estimator of population size based on this assumption is known as the Lincoln-Petersen estimator [7]. However, the independence assumption is often violated in empirical studies. In problems involving three or more surveys, it is common to assume that the highest order interaction term in log-linear or logistic model equals zero (i.e., that correlations between survey captures can be described by lower-order interactions alone), an assumption that is very difficult to interpret or empirically verify [4]. An additional challenge to capture-recapture estimators is the curse of dimensionality in the finite sample case, whereby the absence of one or more capture patterns from the observed sample leads to undefined interaction terms. Traditionally, a common approach to selecting among alternative capture-recapture models is to perform model selection based on Akaike’s Information Criterion (AIC) [8], Bayesian Information Criterion (BIC) [9], or Draper’s version of the Bayesian Information Criterion [10]. However, this approach is known to have limited reliability in the presence of violations of its identifying assumptions [4, 11, 12]. Das and Kennedy made contributions on a doubly robust method under a specific identification assumption that two lists are conditionally independent given measured covariate information [13].

In this paper, we propose a generalizable framework for estimating population size with as few model assumptions as possible. This framework can be adapted to posit various identification assumptions, including independence between any pair of surveys or absence of highest-order interactions, and can be applied to linear and nonlinear constraints. In high dimensional settings with finite samples, we use machine learning methods to smooth over unobserved capture patterns and apply targeted maximum likelihood estimation (TMLE) [14] updates to reduce estimation bias. Previous work has shown the vulnerability of the existing estimators with violations of identification assumptions [15, 16], and we further show the significant impact of the misspecified identification assumption on the estimation results for all existing and proposed estimators.

1.1 Paper outline

In this paper, we start with the statistical formulation of the estimation problem. In section 2, we define the framework of our estimators under linear and non-linear constraints. Specifically, we develop the estimators under each of the following identification assumptions: no K-way interaction in a linear model, independence among samples, conditional independence among samples, and no K-way interaction in a log-linear model. In section 3, we derive the efficient estimator of the target parameter, and the statistical inference. In section 4 we provide the targeted learning updates for the smoothed estimators under the no K-way interaction log-linear model. In section 5, we illustrate the performance of existing and proposed estimators in various situations, including high-dimensional finite sample settings. In section 6, we show the performance of the estimators under violations of various identification assumptions. In section 7, we apply the estimators to surveillance data on a parasitic infectious disease in a region in southwestern China [17]. In section 8, we summarise the characteristics of the proposed estimators and state the main findings of the paper.

2 Statistical formulation of estimation problem

2.1 Defining the data and its probability distribution

We define the capture-recapture experiments in the following manner. One takes a first random sample from the population of size n1n_{1}, records identifying characteristics of the individuals captured and an indicator of their capture, then repeats the process a total of KK times, resulting in KK samples of size n1,,nKn_{1},\ldots,n_{K}.

Each individual ii in the population, whether captured or not, defines a capture history as a vector Bi=(Bi(1),,Bi(K))B^{*}_{i}=(B^{*}_{i}(1),\ldots,B^{*}_{i}(K)), where Bi(k)B^{*}_{i}(k) denotes the indicator that this individual ii is captured by sample kk. We assume the capture history vectors BiB^{*}_{i} of all i=1,,Ni=1,\dots,N individuals independently and identically follow a common probability distribution, denoted by PBP_{B^{*}}. Thus PBP_{B^{*}} is defined on 2K2^{K} possible vectors of dimension KK.

Note that, for the individuals contained in our observed sample of size n=k=1Knkn=\sum_{k=1}^{K}n_{k}, we actually observe the realization of BB^{*}. For any individual ii that is never captured, we know that Bi=0B^{*}_{i}=0, where 0=(0,,0)0=(0,\ldots,0) is the KK dimensional vector in which each component equals 0. However, for these NnN-n individuals we do not know the identity of these individuals and we also do not know how many there are (i.e., we do not know NnN-n).

Therefore, we can conclude that our observed data set B1,,BnB_{1},\ldots,B_{n} are nn independent and identically distributed draws from the conditional distribution of BB^{*}, given B0B^{*}\not=0. Let’s denote this true probability distribution with P0P_{0} and its corresponding random variable with BB.

So we can conclude that under our assumptions we have that B1,,BniidP0B_{1},\ldots,B_{n}\sim_{iid}P_{0}, where P0(b)=PB,0(bB0)P_{0}(b)=P_{B^{*},0}(b\mid B\not=0) for all b{0,1}Kb\in\{0,1\}^{K}. Since the probability distribution PP of BB is implied by the distribution PP^{*} of BB^{*} we also use the notation P=PPP=P_{P^{*}} to stress this parameterization.

2.2 Full-data model, target quantity

Let F{\cal M}^{F} be a model for the underlying distribution PB,0P_{B^{*},0}. The full-data target parameter ΨF:FIR\Psi^{F}:{\cal M}^{F}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$} is defined as

ΨF(PB)=PB(B0).\Psi^{F}(P_{B^{*}})=P_{B^{*}}(B\not=0).

In other words, we want to know the proportion of NN that on average will not be caught by the combined sample. An estimator ψnF\psi^{F}_{n} of this ψ0F=ΨF(PB,0)\psi^{F}_{0}=\Psi^{F}(P_{B^{*},0}) immediately translates into an estimator of the desired size NN of the population of interest: Nn=nψnFN_{n}=\frac{n}{\psi^{F}_{n}}. We will focus on full-data models defined by a single constraint, where we distinguish between linear constraints defined by a function ff and a non-linear constraint defined by a function Φ\Phi. Specifically, for a given function ff, we define the full-data model

fF={PB:bf(b)PB(b)=0,0<P(0)<1}.{\cal M}^{F}_{f}=\left\{P_{B^{*}}:\sum_{b}f(b)P_{B^{*}}(b)=0,0<P^{*}(0)<1\right\}.

We will require that ff satisfies the following condition:

f(b=0)0.f(b=0)\not=0. (1)

In other words, fF{\cal M}^{F}_{f} contains all probability distributions of BB^{*} for which EPBf(B)=0E_{P_{B^{*}}}f(B^{*})=0. An example of interest is:

fI(b)=(1)K+k=1Kbk.f_{I}(b)=(-1)^{K+\sum_{k=1}^{K}b_{k}}.

In this case, E0fI(B)=0E_{0}f_{I}(B^{*})=0 is equivalent with

b(1)K+k=1KbkPB,0(b)=0.\displaystyle\sum_{b}(-1)^{K+\sum_{k=1}^{K}b_{k}}P_{B^{*},0}(b)=0. (2)

We note the left-hand side represents a KK-th way interaction term α1\alpha_{1} in the saturated model

PB(b)=α0+b0αbj:bj=1bj.P_{B^{*}}(b)=\alpha_{0}+\sum_{b^{\prime}\not=0}\alpha_{b^{\prime}}\prod_{j:b_{j}^{\prime}=1}b_{j}.

For example, if K=2K=2, then the latter model states:

PB(b1,b2)=α00+α10b1+α01b2+α11b1b2,P_{B^{*}}(b_{1},b_{2})=\alpha_{00}+\alpha_{10}b_{1}+\alpha_{01}b_{2}+\alpha_{11}b_{1}b_{2},

and the constraint EfI(B)=0Ef_{I}(B^{*})=0 states that α11=0\alpha_{11}=0.

One might also use a log-link in this saturated model:

logPB(b)=a0+b0abj:bj=1bj.\log P_{B^{*}}(b)=a_{0}+\sum_{b^{\prime}\not=0}a_{b^{\prime}}\prod_{j:b_{j}^{\prime}=1}b_{j}.

In this case, a1a_{1} is the KK-way interaction term in this log-linear model, and we now have

a1b(1)K+k=1KbklogPB,0(b)=0.a_{1}\equiv\sum_{b}(-1)^{K+\sum_{k=1}^{K}b_{k}}\log P_{B^{*},0}(b)=0.

So, assuming a1=0a_{1}=0 corresponds with assuming

0=b(1)K+k=1KbklogPB,0(b).0=\sum_{b}(-1)^{K+\sum_{k=1}^{K}b_{k}}\log P_{B^{*},0}(b).

This is an example of a non-linear constraint ΦI(P)=0\Phi_{I}(P^{*})=0, where

ΦI(P)b(1)1+k=1KbklogPB(b).\displaystyle\Phi_{I}(P^{*})\equiv\sum_{b}(-1)^{1+\sum_{k=1}^{K}b_{k}}\log P_{B^{*}}(b). (3)

We will also consider general non-linear constraints defined by such a function Φ\Phi, so that for that purpose one can keep this example ΦI\Phi_{I} in mind. As we will see the choice of this constraint has quite dramatic implications on the resulting statistical target parameter/estimand and thereby on the resulting estimator. As one can already tell from the definition of ΦI\Phi_{I}, ΦI\Phi_{I} is not even defined if there are some bb for which PB(b)=0P_{B^{*}}(b)=0, so that also an NPMLE of ΨΦIF(P0)\Psi^{F}_{\Phi_{I}}(P^{*}_{0}) will be ill defined in the case that the empirical distribution Pn(B=b)=0P_{n}(B=b)=0 for some b0b\not=0. As we will see the parameter ΨfIF(P0)\Psi^{F}_{f_{I}}(P^{*}_{0}) is very well estimated by the MLE, even when KK is very large relative to nn, but for ΨΦIF(P0)\Psi^{F}_{\Phi_{I}}(P^{*}_{0}) we would need a so called TMLE, incorporating machine learning [14].

Since it is hard to believe that the ΦI\Phi_{I} constraint is more realistic than the ff-constraint in real applications, it appears that, without a good reason to prefer ΦI\Phi_{I}, the ff-constraint is far superior. However, it also raises alarm bells that the choice of constraint, if wrong, can result in dramatically different statistical output, so that one should really try to make sure that the constraint that is chosen is known to hold by design.

Another example of a non-linear constraint is the independence between two samples, i.e., that

P(B(1:2)=(0,0))=P(B(1)=0)P(B(2)=0),P^{*}(B^{*}(1:2)=(0,0))=P^{*}(B^{*}(1)=0)P^{*}(B^{*}(2)=0),

i.e., that the binary indicators B(1)B^{*}(1) and B(2)B^{*}(2) are independent, but the remaining components can depend on each other and depend on B(1),B(2)B^{*}(1),B^{*}(2). This corresponds with assuming ΦII(P)=0\Phi_{II}(P^{*})=0, where

ΦII(P)bI(b(1:2)=(0,0))P(b)b1,b2I(b1(1)=b2(2)=0)P(b1)P(b2).\Phi_{II}(P^{*})\equiv\sum_{b}I(b(1:2)=(0,0))P^{*}(b)-\sum_{b_{1},b_{2}}I(b_{1}(1)=b_{2}(2)=0)P^{*}(b_{1})P^{*}(b_{2}). (4)

A third non-linear constraint example is conditional independence between two samples, given the others. Suppose we have KK samples in total, and the distribution of the jthj^{th} sample BjB_{j} is independent of the mthm^{th} sample BmB_{m} given all other samples (there’s no time ordering of j,mj,m). Then we can derive the target parameter and efficient influence curve for this conditional independence constraint. The conditional independence constraint ΦCI=0\Phi_{CI}=0 is defined as

ΦCI,(j,m)=P(Bj=1|B1=b1,,Bm=0,,BK=bk)\displaystyle\Phi_{CI,(j,m)}=P^{*}(B_{j}=1|B_{1}=b_{1},\cdots,B_{m}=0,\cdots,B_{K}=b_{k})
P(Bj=1|B1=b1,,Bm=1,,BK=bk),bt=0,1,t=1,,K.\displaystyle-P^{*}(B_{j}=1|B_{1}=b_{1},\cdots,B_{m}=1,\cdots,B_{K}=b_{k}),b_{t}=0,1,t=1,\cdots,K.

Because we must have the term P(0,0,,0)P^{*}(0,0,\cdots,0) in the constraint to successfully identify the parameter of interest, the constraint ΦCI=0\Phi_{CI}=0 is sufficient. Thus the equation above can be presented as

ΦCI,(j,m)=P(Bj=1|B1=0,,Bm=0,,BK=0)\displaystyle\Phi_{CI,(j,m)}=P^{*}(B_{j}=1|B_{1}=0,\cdots,B_{m}=0,\cdots,B_{K}=0)
P(Bj=1|B1=0,,Bm=1,,BK=0).\displaystyle-P^{*}(B_{j}=1|B_{1}=0,\cdots,B_{m}=1,\cdots,B_{K}=0). (5)

This is because for all the combinations of {B1=b1,,Bm1=bm1,Bm+1=bm+1,,BK=bK},bt{0,1},t=1,,K\{B_{1}=b_{1},\cdots,B_{m-1}=b_{m-1},B_{m+1}=b_{m+1},\cdots,B_{K}=b_{K}\},\forall b_{t}\in\{0,1\},t=1,\cdots,K, only the situation of {B1=0,,Bm1=0,Bm+1=0,,BK=0}\{B_{1}=0,\cdots,B_{m-1}=0,B_{m+1}=0,\cdots,B_{K}\ =0\} can generate the required term P(0,0,,0)P^{*}(0,0,\cdots,0) in the constraint.

2.3 Identifiability from probability distribution of data

Given such a full-data model with one constraint defined by ff or Φ\Phi, one will need to establish that for all PfFP^{*}\in{\cal M}^{F}_{f}, we have (say we use ff)

ΨF(P)=Ψf(P).\Psi^{F}(P^{*})=\Psi_{f}(P).

for a known Ψf:(0,1)\Psi_{f}:{\cal M}\rightarrow(0,1) and P=PPP=P_{P^{*}}, where the statistical model for P0P_{0} is defined as

f={PPB:PBfF}.{\cal M}_{f}=\{P_{P_{B^{*}}}:P_{B^{*}}\in{\cal M}^{F}_{f}\}.

Let’s first study this identifiability problem in the special case of our full data models defined by the linear constraint P0f=0P^{*}_{0}f=0 with f(0)0f(0)\not=0 (equation 2). It will show that we have identifiability of the whole PBP_{B^{*}} from P=PPBP=P_{P_{B^{*}}}. Firstly, we note that PP(b)=P(b)/ψFP_{P^{*}}(b)=P^{*}(b)/\psi^{F} for all b0b\not=0. Thus, P(b)=ψFP(b)P^{*}(b)=\psi^{F}P(b) for all b0b\not=0. We also have P(0)=1ψFP^{*}(0)=1-\psi^{F}, so that bf(b)P(b)=0\sum_{b}f(b)P^{*}(b)=0 yields:

0\displaystyle 0 =\displaystyle= bf(b)P(b)=b0f(b)ψFP(b)+f(0)P(0)\displaystyle\sum_{b}f(b)P^{*}(b)=\sum_{b\not=0}f(b)\psi^{F}P(b)+f(0)P^{*}(0)
=\displaystyle= ψFb0f(b)P(b)+f(0)(1ψF).\displaystyle\psi^{F}\sum_{b\not=0}f(b)P(b)+f(0)(1-\psi^{F}).

We can now solve for ψF\psi^{F}:

ψF=f(0)f(0)b0f(b)P(b).\psi^{F}=\frac{f(0)}{f(0)-\sum_{b\not=0}f(b)P(b)}.

At first sight, one might wonder if this solution is in the range (0,1)(0,1). Working backwards from the right-hand side, i.e., using f(0)P(0)+b0f(b)P(b)=0f(0)P^{*}(0)+\sum_{b\not=0}f(b)P^{*}(b)=0 and P(b)=P(b)/ψFP^{*}(b)=P(b)/\psi^{F}, it indeed follows that the denominator equals f(0)+f(0)(1ψF)/ψFf(0)+f(0)(1-\psi^{F})/\psi^{F}, so that the right-hand side is indeed in (0,1)(0,1). Thus, we can conclude that:

ΨF(P)=Ψf(P)f(0)f(0)Pf,\displaystyle\Psi^{F}(P^{*})=\Psi_{f}(P)\equiv\frac{f(0)}{f(0)-Pf}, (6)

where we use the notation Pf=f(b)𝑑P(b)Pf=\int f(b)dP(b) for the expectation operator.

Suppose now that our the one-dimensional constraint in the full-data model is defined in general by Φ(P)=0\Phi(P^{*})=0 for some function Φ:ΦFIR\Phi:{\cal M}^{F}_{\Phi}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$}. The full data model is now defined by ΦF={P:Φ(P)=0,0<P(0)<1}{\cal M}^{F}_{\Phi}=\{P^{*}:\Phi(P^{*})=0,0<P^{*}(0)<1\}, and the corresponding observed data model can be denoted with Φ{\cal M}_{\Phi}. To establish the identifiability, we still use P(b)=ψFP(b)P^{*}(b)=\psi^{F}P(b) for all b0b\not=0. We also still have P(0)=1ψFP^{*}(0)=1-\psi^{F}. Define PP,ψP^{*}_{P,\psi} as PP,ψ(0)=1ψP^{*}_{P,\psi}(0)=1-\psi and PP,ψ(b)=ψP(b)P^{*}_{P,\psi}(b)=\psi P(b) for b0b\not=0. The constraint Φ(P)=0\Phi(P^{*})=0 now yields the equation

Φ(PP,ψF)=0 in ψF\Phi(P^{*}_{P,\psi^{F}})=0\mbox{ in $\psi^{F}$}

for a given P=PPP=P_{P^{*}}.

Consider now our particular example ΦI\Phi_{I}. Note that

ΦI(PP,ψ)=b0(1)K+kbklog{ψP(b)}+(1)Klog(1ψ).\Phi_{I}(P^{*}_{P,\psi})=\sum_{b\not=0}(-1)^{K+\sum_{k}b_{k}}\log\{\psi P(b)\}+(-1)^{K}\log(1-\psi).

We need to solve this equation in ψ\psi for the given PP. For KK is odd, we obtain

ΨI(P)=11+exp(b0fI(b)logP(b)),\Psi_{I}(P)=\frac{1}{1+\exp\left(\sum_{b\not=0}f_{I}(b)\log P(b)\right)},

and for KK even we obtain log(1ψ)/ψ)=b0logP(b)\log(1-\psi)/\psi)=\sum_{b\not=0}\log P(b) and thus

ΨI(P)=11+exp(b0fI(b)logP(b)).\Psi_{I}(P)=\frac{1}{1+\exp\left(-\sum_{b\not=0}f_{I}(b)\log P(b)\right)}.

So, in general, this solution can be represented as:

ΨI(P)=11+exp((1)K+1b0fI(b)logP(b)).\Psi_{I}(P)=\frac{1}{1+\exp\left((-1)^{K+1}\sum_{b\not=0}f_{I}(b)\log P(b)\right)}. (7)

This solution Ψ(P)\Psi(P) only exists if P(b)>0P(b)>0 for all b0b\not=0. In particular, a plug-in estimator Ψ(Pn)\Psi(P_{n}) based on the empirical distribution function PnP_{n} would not be defined when Pn(b)=0P_{n}(b)=0 for some cells b{0,1}Kb\in\{0,1\}^{K}.

For general Φ\Phi, one needs to assume that this one-dimensional equation H(ψF,P)Φ(PP,ψ)=0H(\psi^{F},P)\equiv\Phi(P^{*}_{P,\psi})=0 in ψ\psi, for a given PP\in{\cal M}, always has a unique solution, which then proves the desired identifiability of ψF(P)\psi^{F}(P^{*}) from PPP_{P^{*}} for any PΦFP^{*}\in{\cal M}^{F}_{\Phi}. This solution is now denoted with ΨΦ(P)\Psi_{\Phi}(P). So, in this case, ΨΦ:ΦIR\Psi_{\Phi}:{\cal M}_{\Phi}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$} is defined implicitly by H(ΨΦ(P),P)=0H(\Psi_{\Phi}(P),P)=0. If Φ\Phi is one-dimensional, one will still have that Φ{\cal M}_{\Phi} is nonparametric.

Let’s now consider the ΦII\Phi_{II} constraint which assumes that B(1)B^{*}(1) is independent of B(2)B^{*}(2). Again, using P(b)=P(b)ψP^{*}(b)=P(b)\psi for b0b\not=0 and P(0)=(1ψ)P^{*}(0)=(1-\psi), the equation ΦII(P)=0\Phi_{II}(P^{*})=0 yields the following quadratic equation in ψ\psi:

aII(P)ψ2+bII(P)ψ=0,a_{II}(P)\psi^{2}+b_{II}(P)\psi=0,

where

aII(P)\displaystyle a_{II}(P) =\displaystyle= b10,b20I(b1(1)=b2(2)=0)P(b1)P(b2)\displaystyle-\sum_{b_{1}\not=0,b_{2}\not=0}I(b_{1}(1)=b_{2}(2)=0)P(b_{1})P(b_{2})
+b20I(b2(2)=0)P(b2)+b10I(b1(1)=0)P(b1)1\displaystyle+\sum_{b_{2}\not=0}I(b_{2}(2)=0)P(b_{2})+\sum_{b_{1}\not=0}I(b_{1}(1)=0)P(b_{1})-1
bII(P)\displaystyle b_{II}(P) =\displaystyle= b0I(b(1:2)=(0,0))p(b)b20I(b2(2)=0)P(b2)\displaystyle\sum_{b\not=0}I(b(1:2)=(0,0))p(b)-\sum_{b_{2}\not=0}I(b_{2}(2)=0)P(b_{2})
b10I(b1(1)=0)P(b1)+1.\displaystyle-\sum_{b_{1}\not=0}I(b_{1}(1)=0)P(b_{1})+1.

Since ψ0\psi\not=0, this yields the equation aII(P)ψ+bII(P)=0a_{II}(P)\psi+b_{II}(P)=0 and thus

ΨII(P)=bII(P)aII(P).\Psi_{II}(P)=\frac{-b_{II}(P)}{a_{II}(P)}.

A more helpful way this parameter can be represented is given by:

ΨII(P)=1P(B(1)=0)P(B(2)=0)+P(B(1:2)=(0,0))1P(B(1)=0)P(B(2)=0)+P(B(1)=0)P(B(2)=0).\Psi_{II}(P)=\frac{1-P(B(1)=0)-P(B(2)=0)+P(B(1:2)=(0,0))}{1-P(B(1)=0)-P(B(2)=0)+P(B(1)=0)P(B(2)=0)}. (8)

This identifiability result relies on ΨII(P)(0,1)\Psi_{II}(P)\in(0,1), i..e, that P(B(1)=B(2)=0)<P(B(1)=0)P(B(2)=0)P(B(1)=B(2)=0)<P(B(1)=0)P(B(2)=0). In particular, we need 0<P(B(1)=0)<10<P(B(1)=0)<1 and 0<P(B(2)=0)<10<P(B(2)=0)<1 and thereby that each of the three cells (1,0),(0,1),(1,1)(1,0),(0,1),(1,1) has positive probability under the bivariate marginal distribution of B(1),B(2)B(1),B(2) under PP. It follows trivially that the inequality holds for K=2K=2 since in that case P(B(1)=B(2)=0)=0P(B(1)=B(2)=0)=0. It will have to be verified if this inequality constraint always holds for K>2K>2 as well, or that this is an actual assumption in the statistical model. Since it would be an inequality constraint in the statistical model, it would not affect the tangent space and thereby the efficient influence function for ΨII:IR\Psi_{II}:{\cal M}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$} presented below, but the MLE would now involve maximizing the likelihood under this constraint so that the resulting MLE of P0P_{0} satisfies this constraint.

Similarly, for the conditional independence assumption, the parameter ΨCI\Psi_{CI} can be derived as

ΨCI=P(Bm=1,Bj=1,0,,0)C0(P),\Psi_{CI}=\frac{P(B_{m}=1,B_{j}=1,0,...,0)}{C_{0}(P)}, (9)

where

C0(P)\displaystyle C_{0}(P) =\displaystyle= P(Bm=1,Bj=1,0,,0)\displaystyle P(B_{m}=1,B_{j}=1,0,...,0) (10)
+P(Bm=0,Bj=1,0,,0)P(Bm=1,Bj=0,0,,0).\displaystyle+P(B_{m}=0,B_{j}=1,0,...,0)P(B_{m}=1,B_{j}=0,0,...,0).

Details on this derivation are presented in the appendix 9.4.

One could also define a model by a multivariate Φ:FIRd\Phi:{\cal M}^{F}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$}^{d}. In this case, the full data model is restricted by dd constraints Φ(P)=0\Phi(P^{*})=0 and the observed data model will not be saturated anymore. Let’s consider the example in which we assume that (B(1),,B(K))(B^{*}(1),\ldots,B^{*}(K)) are KK independent Bernoulli random variables. In this case, we assume that P(B=b)=k=1KP(B(k)=b(k))P^{*}(B^{*}=b)=\prod_{k=1}^{K}P^{*}(B^{*}(k)=b(k)) for all possible b{0,1}Kb\in\{0,1\}^{K}. This can also be defined as stating that for each 2 components B(j1),B(j2)B^{*}(j_{1}),B^{*}(j_{2}), we have that these two Bernoulli’s are independent. Let Φ,II,j1,j2(P)\Phi_{,II,j_{1},j_{2}}(P) be the constraint defined as in (4) but with B(1)B(1) and B(2)B(2) replaced by B(j1)B(j_{1}) and B(j2)B(j_{2}), respectively. Then, we can define ΦIII(P)=(ΦII,j1,j2(P):(j1,j2){1,,K}2,j1j2)\Phi_{III}(P)=(\Phi_{II,j_{1},j_{2}}(P):(j_{1},j_{2})\in\{1,\ldots,K\}^{2},j_{1}\not=j_{2}), a vector of dimension K(K1)/2K(K-1)/2. This defines now the model IIIF={P:ΦIII(P)=0}{\cal M}^{F}_{III}=\{P^{*}:\Phi_{III}(P^{*})=0\}, which assumes that all components of BB^{*} are independent. We will also work out the MLE and efficient influence curve of ψ0F\psi_{0}^{F} for this restricted statistical model.

2.4 Statistical Model and target parameter

We have now defined the statistical estimation problem for linear and non-linear constraints. For linear constraints defined by a function ff, we observe B1,,BnP0f={PP:PF}B_{1},\ldots,B_{n}\sim P_{0}\in{\cal M}_{f}=\{P_{P^{*}}:P^{*}\in{\cal M}^{F}\}, F={P:Pf=0,0<P(0)<1}{\cal M}^{F}=\{P^{*}:P^{*}f=0,0<P^{*}(0)<1\}, and our statistical target parameter is given by Ψf:IR\Psi_{f}:{\cal M}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$}, where

Ψf(P)=f(0)f(0)Pf.\displaystyle\Psi_{f}(P)=\frac{f(0)}{f(0)-Pf}. (11)

The statistical target parameter satisfies that ΨF(P)=P(B0)=Ψf(PP)\Psi^{F}(P^{*})=P^{*}(B^{*}\not=0)=\Psi_{f}(P_{P^{*}}) for all PFP^{*}\in{\cal M}^{F}.

Since the full-data model only includes a single constraint, it follows that f{\cal M}_{f} consists of all possible probability distributions of BB on {b:b0}\{b:b\not=0\}, so that it is a nonparametric/saturated model.

Similarly, we can state the statistical model and target parameter for our examples using non-linear one-dimensional constraints Φ\Phi.

In general, we have a statistical model ={PP:PF}{\cal M}=\{P_{P^{*}}:P^{*}\in{\cal M}^{F}\} for some full data model F{\cal M}^{F} for the distribution of BB^{*}, and, we would be given a particular mapping Ψ:IR\Psi:{\cal M}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$}, satisfying ΨF(P)=Ψ(PP)\Psi^{F}(P^{*})=\Psi(P_{P^{*}}) for all PFP^{*}\in{\cal M}^{F}. In this case, our goal is to estimate Ψ(P0)\Psi(P_{0}) based on knowing that P0P_{0}\in{\cal M}.

2.5 Efficient influence curve of target parameter

An estimator of Ψ\Psi is efficient if and only if it is asymptotically linear with influence curve equal to canonical gradient of pathwise derivative of Ψ\Psi. Therefore it is important to determine this canonical gradient. It teaches us how to construct an efficient estimator of Ψ(P)\Psi(P) in model F{\cal M}^{F}. In addition, it provides us with Wald type confidence intervals based on an efficient estimator. As we will see for most of our estimation problems Ψ\Psi with a nonparametric model, PP and Ψ\Psi can be estimated with the empirical measure PnP_{n} and Ψ(Pn)\Psi(P_{n}). However for constraint ΦI\Phi_{I}, an NPMLE is typically not defined due to empty cells, so that smoothing and bias reduction is needed.

Let Ψ1f(P)=b0f(b)P(b)\Psi_{1f}(P)=\sum_{b\not=0}f(b)P(b) so that Ψf(P)=f(0)f(0)Ψ1f(P)\Psi_{f}(P)=\frac{f(0)}{f(0)-\Psi_{1f}(P)}. Note that Ψ1f(P)=Pf\Psi_{1f}(P)=Pf is simply the expectation of f(B)f(B) w.r.t its distribution. Ψ1f:fIR\Psi_{1f}:{\cal M}_{f}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$} is pathwise differentiable parameter at any PP\in{\cal M} with canonical gradient/efficient influence curve given by:

D1f(P)(B)=f(B)Ψ1f(P).D^{*}_{1f}(P)(B)=f(B)-\Psi_{1f}(P).

By the delta-method the efficient influence curve of Ψf\Psi_{f} at PP is given by:

Df(P)(B)=f(0){f(0)Ψ1f}2D1f(P)(B).\displaystyle D^{*}_{f}(P)(B)=\frac{f(0)}{\{f(0)-\Psi_{1f}\}^{2}}D^{*}_{1f}(P)(B). (12)

In general, if Ψ:IR\Psi:{\cal M}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$} is the target parameter and the model {\cal M} is nonparametric, then the efficient influence curve D(P)D^{*}(P) is given by D(P)=dΨ(P)(Pn=1P)D^{*}(P)=d\Psi(P)(P_{n=1}-P), where

dΨ(P)(h)=ddϵΨ(P+ϵh)|ϵ=0d\Psi(P)(h)=\left.\frac{d}{d\epsilon}\Psi(P+\epsilon h)\right|_{\epsilon=0}

is the Gateaux derivative in the direction hh, and Pn=1P_{n=1} is the empirical distribution for a sample of size one {B}\{B\}, putting all its mass on BB [18].

Consider now the model Φ{\cal M}_{\Phi} for a general univariate constraint function Φ:NPFIR\Phi:{\cal M}^{F}_{NP}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$} that maps any possible PP^{*} into a real number. Recall that ΨΦ:ΦIR\Psi_{\Phi}:{\cal M}_{\Phi}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$} is now defined implicitly by Φ(PP,ψ)=0\Phi(P^{*}_{P,\psi})=0. For notational convenience, in this particular paragraph we suppress the dependence of Ψ\Psi on Φ\Phi. The implicit function theorem implies that the general form of efficient influence curve is given by:

DΦ(P)(B)={ddψΦ(PP,ψ)}1ddPΦ(PP,ψ)(Pn=1P).D^{*}_{\Phi}(P)(B)=-\left\{\frac{d}{d\psi}\Phi(P^{*}_{P,\psi})\right\}^{-1}\frac{d}{dP}\Phi(P^{*}_{P,\psi})(P_{n=1}-P).

where the latter derivative is the directional derivative of PΦ(PP,ψ)P\rightarrow\Phi(P^{*}_{P,\psi}) in the direction Pn=1PP_{n=1}-P.

Note that

ddψΦ(PP,ψ)=dΦ(PP,ψ)ddψPP,ψ.\frac{d}{d\psi}\Phi(P^{*}_{P,\psi})=d\Phi(P^{*}_{P,\psi})\frac{d}{d\psi}P^{*}_{P,\psi}.

We now use that PP,ψ(b)=P(b)I(b0)+(1ψ)I(b=0)P^{*}_{P,\psi}(b)=P(b)I(b\not=0)+(1-\psi)I(b=0). So we obtain:

ddψΦ(PP,ψ)=dΦ(PP,ψ)(1I0),\frac{d}{d\psi}\Phi(P^{*}_{P,\psi})=d\Phi(P^{*}_{P,\psi})(-1I_{0}),

where I0(b)I_{0}(b) is the function in bb that equals 1 if b=0b=0 and zero otherwise, and

dΦ(P0)(h)ddϵΦ(P0+ϵh)|ϵ=0d\Phi(P^{*0})(h)\equiv\left.\frac{d}{d\epsilon}\Phi(P^{*0}+\epsilon h)\right|_{\epsilon=0}

is the directional/Gateaux derivative of Φ\Phi in the direction hh. We also have:

ddPΦ(PP,ψ)(Pn=1P)=dΦ(PP,ψ)ddPPP,ψ(Pn=1P).\frac{d}{dP}\Phi(P^{*}_{P,\psi})(P_{n=1}-P)=d\Phi(P^{*}_{P,\psi})\frac{d}{dP}P^{*}_{P,\psi}(P_{n=1}-P).

We have ddPPP,ψ(h)=I0ch\frac{d}{dP}P^{*}_{P,\psi}(h)=I_{0^{c}}h, where I0c(b)I_{0^{c}}(b) is the function in bb that equals zero if b=0b=0 and equals 11 otherwise. So we obtain:

ddPΦ(PP,ψ)(Pn=1P)=dΦ(PP,ψ)(I0c(Pn=1P)).\frac{d}{dP}\Phi(P^{*}_{P,\psi})(P_{n=1}-P)=d\Phi(P^{*}_{P,\psi})(I_{0^{c}}(P_{n=1}-P)).

We conclude that

DΦ(P)={dΦ(PP,ψ)(1I0)}1dΦ(PP,ψ)(I0c(Pn=1P)).D^{*}_{\Phi}(P)=-\{d\Phi(P^{*}_{P,\psi})(-1I_{0})\}^{-1}d\Phi(P^{*}_{P,\psi})(I_{0^{c}}(P_{n=1}-P)).

Let’s now consider our special example ΦI\Phi_{I}. In this case, the statistical target parameter ΨI:IR\Psi_{I}:{\cal M}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$} is given by (7). It is straightforward to show that the directional derivative of Ψ\Psi at P=(P(b):b)P=(P(b):b) in direction h=(h(b):b)h=(h(b):b) is given by:

dΨI(P)(h)=(1)KΨI(P)(1ΨI(P))b0fI(b)P(b)h(b).d\Psi_{I}(P)(h)=(-1)^{K}\Psi_{I}(P)(1-\Psi_{I}(P))\sum_{b\not=0}\frac{f_{I}(b)}{P(b)}h(b).

As one would have predicted from the definition of ΨI\Psi_{I}, this directional derivative is only bounded if P(b)>0P(b)>0 for all b0b\not=0. The efficient influence curve is thus given by dΨI(P)(Pn=1P)d\Psi_{I}(P)(P_{n=1}-P):

DΦI(P)\displaystyle D^{*}_{\Phi_{I}}(P) =\displaystyle= (1)KΨI(P)(1ΨI(P))b0fI(b)P(b){IB(b)P(b)}\displaystyle(-1)^{K}\Psi_{I}(P)(1-\Psi_{I}(P))\sum_{b\not=0}\frac{f_{I}(b)}{P(b)}\{I_{B}(b)-P(b)\} (13)
=\displaystyle= (1)KΨI(P)(1ΨI(P)){fI(B)P(B)+fI(0)},\displaystyle(-1)^{K}\Psi_{I}(P)(1-\Psi_{I}(P))\left\{\frac{f_{I}(B)}{P(B)}+f_{I}(0)\right\},

where we use that b0fI(b)=fI(0)\sum_{b\not=0}f_{I}(b)=-f_{I}(0) so that indeed the expectation of DΦI(P)D^{*}_{\Phi_{I}}(P) equals zero (under PP).

The efficient influence function for ΨII:IR\Psi_{II}:{\cal M}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$} (8), corresponding with the constraint ΦII(P)=0\Phi_{II}(P^{*})=0, is the influence curve of the empirical plug-in estimator ΨII(Pn)\Psi_{II}(P_{n}), and can thus be derived from the delta-method:

DΦII(P)\displaystyle D^{*}_{\Phi_{II}}(P) =\displaystyle= C2(P){I(B(1)=0)P(B(1)=0)}\displaystyle C_{2}(P)\{I(B(1)=0)-P(B(1)=0)\} (14)
+C3(P){I(B(2)=0)P(B(2)=0)}\displaystyle+C_{3}(P)\{I(B(2)=0)-P(B(2)=0)\}
+C4(P){I(B(1)=B(2)=0)P(B(1:2)=0)},\displaystyle+C_{4}(P)\{I(B(1)=B(2)=0)-P(B(1:2)=0)\},

where

C2(P)=1P(B(1)=0)P(B(2)=0)P(B(1)=B(2)=0)(1P(B(1)=0)P(B(2)=0)P(B(1)=0)P(B(2)=0))2P(B(2)=0)C3(P)=1P(B(1)=0)P(B(2)=0)P(B(1)=B(2)=0)(1P(B(1)=0)P(B(2)=0)P(B(1)=0)P(B(2)=0))2P(B(1)=0)C4(P)=11P(B(1)=0)P(B(2)=0)P(B(1)=0)P(B(2)=0).\begin{array}[]{l}C_{2}(P)=\frac{1-P(B(1)=0)-P(B(2)=0)-P(B(1)=B(2)=0)}{(1-P(B(1)=0)-P(B(2)=0)-P(B(1)=0)P(B(2)=0))^{2}}P(B(2)=0)\\ C_{3}(P)=\frac{1-P(B(1)=0)-P(B(2)=0)-P(B(1)=B(2)=0)}{(1-P(B(1)=0)-P(B(2)=0)-P(B(1)=0)P(B(2)=0))^{2}}P(B(1)=0)\\ C_{4}(P)=-\frac{1}{1-P(B(1)=0)-P(B(2)=0)-P(B(1)=0)P(B(2)=0)}.\end{array}

Similarly, we can derive the efficient influence curve for conditional independence constraint ΦCI\Phi_{CI} and parameter ΨCI\Psi_{CI} under this constraint as

DΦCI(P)\displaystyle D^{*}_{\Phi_{CI}}(P) =\displaystyle= 1C5(P)(C6(P)C7(P)C8(P)).\displaystyle\frac{1}{C_{5}(P)}(C_{6}(P)-C_{7}(P)-C_{8}(P)). (15)

where

C5(P)\displaystyle C_{5}(P) =\displaystyle= b10,b20I(b1(1)=b2(2)=0)P(b1)P(b2).\displaystyle-\sum_{b_{1}\not=0,b_{2}\not=0}I(b_{1}(1)=b_{2}(2)=0)P(b_{1})P(b_{2}).
C6(P)\displaystyle C_{6}(P) =\displaystyle= P(Bm=0,Bj=1,0,,0)P(Bm=0,Bj=0,0,,0)\displaystyle P(B_{m}=0,B_{j}=1,0,...,0)P(B_{m}=0,B_{j}=0,0,...,0)
[𝕀(Bm=1,Bj=1,0,,0)P(Bm=1,Bj=1,0,,0)].\displaystyle[\mathbbm{I}(B_{m}=1,B_{j}=1,0,...,0)-P(B_{m}=1,B_{j}=1,0,...,0)].
C7(P)\displaystyle C_{7}(P) =\displaystyle= P(Bm=1,Bj=0,0,,0)P(Bm=1,Bj=1,0,,0)\displaystyle P(B_{m}=1,B_{j}=0,0,...,0)P(B_{m}=1,B_{j}=1,0,...,0)
[𝕀(Bm=0,Bj=1,0,,0)P(Bm=0,Bj=1,0,,0)].\displaystyle[\mathbbm{I}(B_{m}=0,B_{j}=1,0,...,0)-P(B_{m}=0,B_{j}=1,0,...,0)].
C8(P)\displaystyle C_{8}(P) =\displaystyle= P(Bm=0,Bj=1,0,,0)P(Bm=1,Bj=1,0,,0)\displaystyle P(B_{m}=0,B_{j}=1,0,...,0)P(B_{m}=1,B_{j}=1,0,...,0)
[𝕀(Bm=1,Bj=0,0,,0)P(Bm=1,Bj=0,0,,0)].\displaystyle[\mathbbm{I}(B_{m}=1,B_{j}=0,0,...,0)-P(B_{m}=1,B_{j}=0,0,...,0)].

The details for deriving the influence curve DΦCI(P)D^{*}_{\Phi_{CI}}(P) can be found in appendix, section 9.4.

Finally, consider a non-saturated model {\cal M} implied by a multidimensional constraint function Φ\Phi. In this case, the above DΦ(P)D^{*}_{\Phi}(P) is still a gradient of the pathwise derivative, where division by a vector xx is now defined as 1/x=(1/xj:j)1/x=(1/x_{j}:j) component wise. However, this is now not equal to the canonical gradient. We can now determine the tangent space of the model {\cal M} and project DΦ(P)D^{*}_{\Phi}(P) onto the tangent space at PP, which then yields the actual efficient influence curve. We can demonstrate this for the example defined by the multidimensional constraint ΦIII\Phi_{III} using the general result in appendix 9.3.

3 Efficient estimator of target parameter, and statistical inference

Estimation of Ψf(P0)\Psi_{f}(P_{0}) based on statistical model f{\cal M}_{f} and data B1,,BniidP0B_{1},\ldots,B_{n}\sim_{iid}P_{0} is trivial since Ψ1f(P0)=P0f\Psi_{1f}(P_{0})=P_{0}f is just a mean of f(B)f(B). In other words, we estimate Ψ1f(P0)\Psi_{1f}(P_{0}) with the NPMLE

Ψ1f(Pn)=Pnf=1ni=1nf(Bi),\Psi_{1f}(P_{n})=P_{n}f=\frac{1}{n}\sum_{i=1}^{n}f(B_{i}),

where PnP_{n} is the empirical distribution of B1,,BnB_{1},\ldots,B_{n}, and, similarly, we estimate Ψf(P0)\Psi_{f}(P_{0}) with its NPMLE

Ψf(Pn)=f(0)f(0)Ψ1f(Pn)=f(0)f(0)Pnf.\Psi_{f}(P_{n})=\frac{f(0)}{f(0)-\Psi_{1f}(P_{n})}=\frac{f(0)}{f(0)-P_{n}f}.

This estimator is asymptotically linear at P0P_{0} with influence curve DΦ(P0)D^{*}_{\Phi}(P_{0}) under no further conditions. As a consequence, a valid asymptotic 95% confidence interval is given by:

Ψf(Pn)±q(0.975)σn/n,\Psi_{f}(P_{n})\pm q(0.975)\sigma_{n}/\sqrt{n},

where

σn21ni=1n{Df(Pn)(Bi)}2,\sigma^{2}_{n}\equiv\frac{1}{n}\sum_{i=1}^{n}\{D^{*}_{f}(P_{n})(B_{i})\}^{2},

and q(0.975)q(0.975) is the 0.975 quantile value of standard normal distribution. If the general constraint Φ:IR\Phi:{\cal M}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$} is differentiable so that Φ(Pn)\Phi(P_{n}) is an asymptotically linear estimator of Φ(P0)\Phi(P_{0}) [19], then, we can estimate ΨΦ(P0)\Psi_{\Phi}(P_{0}) with the NPMLE ΨΦ(Pn)\Psi_{\Phi}(P_{n}). Again, under no further conditions, ΨΦ(Pn)\Psi_{\Phi}(P_{n}) is asymptotically linear with influence curve DΦ(P0)D^{*}_{\Phi}(P_{0}), and an asymptotically valid confidence interval is obtained as above. The estimator Ψf(Pn)\Psi_{f}(P_{n}) is always well behaved, even when nn is relatively large relative to KK. For a general Φ\Phi, this will very much depend on the the precise dependence on PP of Φ(P)\Phi(P). In general, if Φ(Pn)\Phi(P_{n}) starts suffering from the dimensionality of the model (i.e., empty cells make the estimator erratic), then one should expect that ΨΦ(Pn)\Psi_{\Phi}(P_{n}) will suffer accordingly, even though it will still be asymptotically efficient. In these latter cases, we propose to use targeted maximum likelihood estimation, which targets data-adaptive machine learning fits towards optimal variance-bias trade-off for the target parameter.

Let’s consider the ΦI\Phi_{I}-example which assumes that the KK-way interaction on the log-scale equals zero. In this case the target parameter ΨI(P)\Psi_{I}(P) is defined by (7), which shows that the NPMLE ΨI(Pn)\Psi_{I}(P_{n}) is not defined when Pn(b)=0P_{n}(b)=0 for some b{0,1}Kb\in\{0,1\}^{K}. So in this case, the MLE suffers immensely from the curse of dimensionality and can thus not be used when the number KK of samples is such that the sample size nn is of the same order as 2K2^{K}. In this context, we discuss estimation based on TMLE in the next section.

ΨII(P0)\Psi_{II}(P_{0}) can be estimated with the plug-in empirical estimator which is the NPMLE. So, this estimator only relies on positive probability on (0,1),(1,0)(0,1),(1,0) and (1,1)(1,1) under the bivariate distribution of B(1),B(2)B(1),B(2) under PP. We also note that this efficient estimator of ψII,0\psi_{II,0} does only use the data on the first two samples. Thus, the best estimator based upon this particular constraint ΦII\Phi_{II} ignores the data on all the other samples. More assumptions will be needed, such as the model that assumes that all components of BB^{*} are independent, in order to create a statistical model that is able to also incorporate the data with the other patterns.

Constrained models based on multivariate Φ\Phi: For these models, if the NPMLE would behave well for the larger model in which just one of the Φ\Phi constraints is used, then it will behave well under more constraints. If on the other hand, this NPMLE suffers from curse of dimensionality, it might be the case that the MLE behaves better due to the additional constraints, but generally speaking that is a lot to hope for (except if Φ\Phi is really high dimensional relative to 2K2^{K}). To construct an asymptotically efficient estimator one can thus use the MLE ΨΦ(P~n)\Psi_{\Phi}(\tilde{P}_{n}) where now P~n\tilde{P}_{n} is the MLE over the actual model Φ{\cal M}_{\Phi}. In the special case that the behavior of this MLE ΨΦ(P~n)\Psi_{\Phi}(\tilde{P}_{n}) suffers from the curse of dimensionality, we recommend the TMLE below instead, which will be worked out for the example ΦI\Phi_{I}.

4 Targeted maximum likelihood estimation when the (NP)MLE of target parameter suffers from curse of dimensionality

Consider the statistical target parameter ΨI:ΦIIR\Psi_{I}:{\cal M}_{\Phi_{I}}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$} defined by (7), whose efficient influence curve is given by (13). Let Pn0P_{n}^{0} be an initial estimator of the true distribution P0P_{0} of BB in the nonparametric statistical model ΦI{\cal M}_{\Phi_{I}} that consists of all possible distributions of B=(B(1),,B(K))B=(B(1),\ldots,B(K)). An ideal initial estimator for PP would be consistent and identifiable when empty cells exist. One could use, for example, an undersmoothed lasso estimator constructed in algorithm 1 as the initial estimator [20]. In algorithm 1, we establish the empirical criterion by which the level of undersmoothing may be chosen to appropriately satisfy the conditions required of an efficient plug-in estimator. In particular, we require that the minimum of the empirical mean of the selected basis functions is smaller than a constant times n12n^{-\frac{1}{2}}, which is not parameter specific. This condition essentially enforces the selection of the L1L_{1}-norm in the Lasso to be large enough so that the fit includes sparsely supported basis functions [20]. When the hyperparameter λ\lambda equals zero, the undersmoothed lasso estimator is the same as the NPMLE plug-in estimator, and when λ\lambda is not zero, the undersmoothed lasso estimator will smooth over the predicted probabilities and avoid predicting probabilities of exact zero when empty cells exist.

The log-linear model can be expressed as:
logEB(b)=a0+b0abj:bj=1bj.\log E_{B^{*}}(b)=a_{0}+\sum_{b\not=0}a_{b}\prod_{j:b_{j}=1}b_{j}.
In this case, b=(b1,b2,,bK)b=(b_{1},b_{2},...,b_{K}), and bj=1b_{j}=1 if the subject is captured by sample j,j=1,2,,Kj,j=1,2,...,K. EB(b)E_{B^{*}}(b) is the count of observations in cell bb. The KK-way interaction term is a1,1,,1a_{1,1,...,1}(K terms of 1), and the identification assumption is that a1,1,,1=0a_{1,1,...,1}=0.
Fit a lasso regression MlassoM_{lasso} with all but the highest way interaction term as the independent variable, EB(b)E_{B^{*}}(b) (the count) as the dependent variable, and specify model family as Poisson. The regularization term λ\lambda is chosen such that the absolute value of the empirical mean of the efficient influence function PnDΦI(Pn)(Bi)T=σnnP_{n}D^{*}_{\Phi_{I}}(P_{n})(B_{i})\leq T^{*}=\frac{\sigma_{n}}{\sqrt{n}}, where σn=1ni=1n{DI,cv(Pn)(Bi)}2\sigma_{n}=\sqrt{\frac{1}{n}\sum_{i=1}^{n}\{D^{*}_{I,cv}(P_{n})(B_{i})\}^{2}}. The probabilities Pn(Bi),i=1,,nP_{n}(B_{i}),i=1,...,n used in computing σn\sigma_{n} are estimated by the lasso regression with regularization term chosen by cross-validation. This algorithms will naturally avoid fits with Pn(Bi)=0P_{n}(B_{i})=0 for any BiB_{i} since that would result in a log-likelihood equal to minus infinity.
while PnDΦI(Pn)(Bi)TP_{n}D^{*}_{\Phi_{I}}(P_{n})(B_{i})\geq T^{*} do
       1. decrease λ\lambda;
       2. predict count E^B(b)\hat{E}_{B^{*}}(b) using MlassoM_{lasso} and the new λ\lambda;
       3. calculate predicted probability P^B(b)\hat{P}_{B}(b) by dividing the count E^B(b)\hat{E}_{B^{*}}(b) by the sum of all counts;
       4. calculate Ψ^I(P)=11+exp((1)Kb0fI(b)logP^B(b))\hat{\Psi}_{I}(P)=\frac{1}{1+exp((-1)^{K}\sum_{b\neq 0}f_{I}(b)log\hat{P}_{B}(b))};
       5. calculate DΦI(Pn)(Bi)=Ψ^I(P)(1Ψ^I(P))[fI(b)P^B(b)+fI(0)]D^{*}_{\Phi_{I}}(P_{n})(B_{i})=\hat{\Psi}_{I}(P)(1-\hat{\Psi}_{I}(P))[\frac{f_{I}(b)}{\hat{P}_{B}(b)}+f_{I}(0)];
       6. update PnDΦI(Pn)(Bi)=1ni=1nDΦI(Pn)(Bi)P_{n}D^{*}_{\Phi_{I}}(P_{n})(B_{i})=\frac{1}{n}\sum_{i=1}^{n}D^{*}_{\Phi_{I}}(P_{n})(B_{i}).
end while
Result: Predicted probability P^B(b)\hat{P}_{B}(b) for each cell
Algorithm 1 Undersmoothed Lasso

We denote Pn0P^B(b)P_{n}^{0}\equiv\hat{P}_{B}(b) as our initial estimator, the TMLE PnP_{n}^{*} will update this initial estimator Pn0P_{n}^{0} in such a way that PnDΦI(Pn)=0P_{n}D^{*}_{\Phi_{I}}(P_{n}^{*})=0, allowing a rigorous analysis of the TMLE ΨI(Pn)\Psi_{I}(P_{n}^{*}) of ΨI(P0)\Psi_{I}(P_{0}) as presented in algorithm 2.

Note that indeed this submodel 16 {Pϵ:ϵ}ΦI\{P_{\epsilon}:\epsilon\}\subset{\cal M}_{\Phi_{I}} through PP is a density for all ϵ\epsilon and that its score is given by:

ddϵlogPϵ|ϵ=0=DΦI(P).\left.\frac{d}{d\epsilon}\log P_{\epsilon}\right|_{\epsilon=0}=D^{*}_{\Phi_{I}}(P).

Recall that b0fI(b)=fI(0)\sum_{b\not=0}f_{I}(b)=-f_{I}(0), so that indeed the expectation of its score equals zero: EPDΦI(P)(B)=0E_{P}D^{*}_{\Phi_{I}}(P)(B)=0. This proves that if we select as loss function of PP the log-likelihood loss L(P)=logPL(P)=-\log P, then the score ddϵL(Pϵ)|ϵ=0\left.\frac{d}{d\epsilon}L(P_{\epsilon})\right|_{\epsilon=0} spans the efficient influence function DΦI(P)D^{*}_{\Phi_{I}}(P). This property is needed to establish the asymptotic efficiency of the TMLE, with details in appendix section 9.6.

In accordance with general TMLE procedures [14], the TMLE updating process in algorithm 2 will be iterated till convergence at which point ϵnm0\epsilon_{n}^{m}\approx 0, or PnDΦI(Pnm)=oP(1/n)P_{n}D^{*}_{\Phi_{I}}(P_{n}^{m})=o_{P}(1/\sqrt{n}). Let Pn=limmPnmP_{n}^{*}=\lim_{m}P_{n}^{m} be the probability distribution in the limit. The TMLE of ΨI(P0)\Psi_{I}(P_{0}) is now defined as ΨI(Pn)\Psi_{I}(P_{n}^{*}). Due to the fact that the MLE ϵnm\epsilon_{n}^{m} solves its score equation and ϵnm0\epsilon_{n}^{m}\approx 0, it follows that this TMLE satisfies PnDΦI(Pn)=0P_{n}D^{*}_{\Phi_{I}}(P_{n}^{*})=0 (or oP(1/n)o_{P}(1/\sqrt{n})).

Input: Vector of predicted probability P^B(b)\hat{P}_{B}(b); observed population size nn;
Procedure:
1. Denote Pn0=P^B(b)P_{n}^{0}=\hat{P}_{B}(b), calculate Ψn(Pn0)\Psi_{n}(P_{n}^{0}) using equation 7, DΦI(Pn0)D_{\Phi_{I}}^{*}(P_{n}^{0}) using equation 13, the empirical mean of DΦI(Pn0)D_{\Phi_{I}}^{*}(P_{n}^{0}) as PnDΦI(Pn0)=1ni=1nDΦI(Pn0)(Bi)P_{n}D^{*}_{\Phi_{I}}(P_{n}^{0})=\frac{1}{n}\sum_{i=1}^{n}D_{\Phi_{I}}^{*}(P_{n}^{0})(B_{i}), and the stopping point s=1ni=1nDΦI(Pn0)(Bi)2max(log(n),C)ns=\frac{\sqrt{\frac{1}{n}\sum_{i=1}^{n}D_{\Phi_{I}}^{*}(P_{n}^{0})(B_{i})^{2}}}{max(log(n),C)\sqrt{n}}, where CC is a positive constant.
2. Let mm\in\mathbbm{Z} be the number of iterations, and PnmP_{n}^{m} is the updated probability in iteration mm. Initial m=0m=0. δ\delta is a positive constant close to zero.
while |PnDΦI(Pnm)(Bi)|>s\lvert P_{n}D^{*}_{\Phi_{I}}(P_{n}^{m})(B_{i})\rvert>s and |ϵ|>δ\lvert\epsilon\rvert>\delta do
       2.1. calculate the bounds for ϵ\epsilon such that the updated probability [0,1]\in[0,1].
lϵ=maxi[min(1DΦI(Pnm)(Bi),1Pnm(Bi)Pnm(Bi)DΦI(Pnm)(Bi)))]l_{\epsilon}=max_{i}[min(-\frac{1}{D^{*}_{\Phi_{I}}(P_{n}^{m})(B_{i})},\frac{1-P_{n}^{m}(B_{i})}{P_{n}^{m}(B_{i})D^{*}_{\Phi_{I}}(P_{n}^{m})(B_{i})}))]
uϵ=mini[max(1DΦI(Pnm)(Bi),1Pnm(Bi)Pnm(Bi)DΦI(Pnm)(Bi)))]u_{\epsilon}=min_{i}[max(-\frac{1}{D^{*}_{\Phi_{I}}(P_{n}^{m})(B_{i})},\frac{1-P_{n}^{m}(B_{i})}{P_{n}^{m}(B_{i})D^{*}_{\Phi_{I}}(P_{n}^{m})(B_{i})}))]
      2.2. Construct a least favorable parametric model {Pn,ϵm:ϵ}\{P_{n,\epsilon}^{m}:\epsilon\} through PnmP_{n}^{m} defined as follows:
Pn,ϵm=C(Pnm,ϵ)(1+ϵDΦI(Pnm))Pnm,P_{n,\epsilon}^{m}=C(P_{n}^{m},\epsilon)(1+\epsilon D^{*}_{\Phi_{I}}(P_{n}^{m}))P_{n}^{m}, (16)
where
C(P,ϵ)=1b0(1+ϵDΦI(P(b)))P(b),C(P,\epsilon)=\frac{1}{\sum_{b\neq 0}(1+\epsilon D^{*}_{\Phi_{I}}(P(b)))P(b)},
      2.3. calculate
ϵnm=argmaxϵ1nb0log(Pn,ϵm),\epsilon^{m}_{n}=argmax_{\epsilon}\frac{1}{n}\sum_{b\neq 0}log(P_{n,\epsilon}^{m}),
where ϵ[lϵ,uϵ]\epsilon\in[l_{\epsilon},u_{\epsilon}].
      2.4. mm+1m\leftarrow m+1, and update PnmPn,ϵnmmP_{n}^{m}\leftarrow P_{n,\epsilon^{m}_{n}}^{m}.
end while
3. Denote Pn=PnmP_{n}^{*}=P_{n}^{m} in the final iteration mm. Calculate TMLE Ψn(Pn)\Psi_{n}(P_{n}^{*}) using equation 7, and its efficient influence function DΦI(Pn)D_{\Phi_{I}}^{*}(P_{n}^{*}) using equation 13.
Result: TMLE Ψn(Pn)\Psi_{n}(P_{n}^{*}) and efficient influence function DΦI(Pn)D_{\Phi_{I}}^{*}(P_{n}^{*}).
Algorithm 2 Targeted maximum likelihood estimation(TMLE) update procedure

The proof of the asymptotic efficiency of TMLE is provided in appendix section 9.6.

5 Simulations

In this section, we show the performance of all the estimators given that their identification assumptions (linear and non-linear constraints) hold true. The estimand Ψf(P)\Psi_{f}(P) and ΨΦ(P)\Psi_{\Phi}(P) refer to the probability of an individual being captured at least once under the linear constraint ff or non-linear constraint Φ\Phi. In subsection 5.1, the linear constraint Φf\Phi_{f} is defined in equation 2, the corresponding estimand (equation 11)

Ψf(P)=f(0)f(0)Pf,\Psi_{f}(P)=\frac{f(0)}{f(0)-Pf},

and its plug-in estimator

Ψf(Pn)=f(0)f(0)b0f(b)Pn(b),\Psi_{f}(P_{n})=\frac{f(0)}{f(0)-\sum_{b\not=0}f(b)P_{n}(b)},

where Pn(b)P_{n}(b) is the empirical probability of cell bb. In subsection 5.2, we provide a summary of all the non-linear constraints defined by equation Φ\Phi and their corresponding estimators. In subsection 5.3, the non-linear constraint ΦI\Phi_{I} is defined by the assumption in equation 2.2. The estimand (equation 7)

ΨI(P)=11+exp((1)K+1b0fI(b)logP(b)).\Psi_{I}(P)=\frac{1}{1+\exp\left((-1)^{K+1}\sum_{b\not=0}f_{I}(b)\log P(b)\right)}.

The plug-in estimator

ΨI(PNP)=11+exp((1)K+1b0fI(b)logPn(b)).\Psi_{I}(P_{NP})=\frac{1}{1+\exp\left((-1)^{K+1}\sum_{b\not=0}f_{I}(b)\log P_{n}(b)\right)}.

In addition, the undersmoothed lasso estimator ΨI(Plasso)\Psi_{I}(P_{lasso}) replaces the plugged-in Pn(b)P_{n}(b) in ΨI(PNP)\Psi_{I}(P_{NP}) with estimated probability in a lasso regression with regularization parameters chosen by the undersmoothing algorithm, the estimator ΨI(Plasso_cv)\Psi_{I}(P_{lasso\_cv}) replaces the plugged-in Pn(b)P_{n}(b) with estimated probability in a lasso regression with regularization parameters optimized by cross-validation. The estimator ΨI(Ptmle)\Psi_{I}(P_{tmle}) updates the estimated probabilities of ΨI(Plasso)\Psi_{I}(P_{lasso}) using targeted learning techniques, and the estimator ΨI(Ptmle_cv)\Psi_{I}(P_{tmle\_cv}) updates the estimated probabilities of ΨI(Plasso_cv)\Psi_{I}(P_{lasso\_cv}) using targeted learning techniques. In addition, we compare the performance of our proposed estimators to existing estimators ΨI(PM0)\Psi_{I}(P_{M_{0}}) and ΨI(PMt)\Psi_{I}(P_{M_{t}}).

In subsection 5.4, the non-linear constraint ΦII\Phi_{II} is defined by the independence assumption in equation 4. The estimand (equation 8)

ΨII(P)1P(B(1)=0)P(B(2)=0)+P(B(1:2)=(0,0))1P(B(1)=0)P(B(2)=0)+P(B(1)=0)P(B(2)=0),\Psi_{II}(P)\equiv\frac{1-P(B(1)=0)-P(B(2)=0)+P(B(1:2)=(0,0))}{1-P(B(1)=0)-P(B(2)=0)+P(B(1)=0)P(B(2)=0)},

and its plug-in estimator

ΨII(Pn)1Pn(B(1)=0)Pn(B(2)=0)+Pn(B(1:2)=(0,0))1Pn(B(1)=0)Pn(B(2)=0)+Pn(B(1)=0)Pn(B(2)=0).\Psi_{II}(P_{n})\equiv\frac{1-P_{n}(B(1)=0)-P_{n}(B(2)=0)+P_{n}(B(1:2)=(0,0))}{1-P_{n}(B(1)=0)-P_{n}(B(2)=0)+P_{n}(B(1)=0)P_{n}(B(2)=0)}.

In subsection 5.5, the non-linear constraint ΦCI\Phi_{CI} is defined by the conditional independence assumption in equation 5. The estimand (equation 9, C0(P)C_{0}(P) is defined in equation 10)

ΨCI(P)P(Bm=1,Bj=1,0,,0)C0(P),\Psi_{CI}(P)\equiv\frac{P(B_{m}=1,B_{j}=1,0,...,0)}{C_{0}(P)},

and its plug-in estimator

ΨCI(Pn)Pn(Bm=1,Bj=1,0,,0)C0(Pn).\Psi_{CI}(P_{n})\frac{P_{n}(B_{m}=1,B_{j}=1,0,...,0)}{C_{0}(P_{n})}.

5.1 Linear identification assumption

The linear identification constraint is EPBf(B)=0E_{P_{B^{*}}}f(B^{*})=0, for any function ff such that f(b=0)0f(b=0)\neq 0. An example of interest is f(b)=(1)K+k=1Kbkf(b)=(-1)^{K+\sum_{k=1}^{K}b_{k}}, where K is the total number of samples. In this case, the linear identification assumption is equation 2. For example, if K=3K=3, then the constraint (equation 2) states:

PB(b1,b2,b3)=α0+α1b1+α3b2+α4b1b2+α5b1b3+α6b2b3+α7b1b2b3P_{B^{*}}(b_{1},b_{2},b_{3})=\alpha_{0}+\alpha_{1}b_{1}+\alpha_{3}b_{2}+\alpha_{4}b_{1}b_{2}+\alpha_{5}b_{1}b_{3}+\alpha_{6}b_{2}b_{3}+\alpha_{7}b_{1}b_{2}b_{3}

and the constraint EPBf(B)=0E_{P_{B^{*}}}f(B^{*})=0 states that α7=0\alpha_{7}=0. The identified estimator

Ψf(Pn)=f(0)f(0)b0f(b)Pn(b),\Psi_{f}(P_{n})=\frac{f(0)}{f(0)-\sum_{b\neq 0}f(b)P_{n}(b)},

where Pn(b)P_{n}(b) is the observed probability of cell bb, and the efficient influence curve (equation 12) is

Df=f(0)(f(0)b0f(b)P(b))2[f(B)b0f(b)P(b)].D^{*}_{f}=\frac{f(0)}{(f(0)-\sum_{b\neq 0}f(b)P(b))^{2}}[f(B)-\sum_{b\neq 0}f(b)P(b)].

We use the parameters below to generate the underlying distribution. The assumption that the highest-way interaction term α7=0\alpha_{7}=0 is satisfied in this setting.

α0\alpha_{0} α1\alpha_{1} α2\alpha_{2} α3\alpha_{3} α4\alpha_{4} α5\alpha_{5} α6\alpha_{6} α7\alpha_{7}
0.0725 0.03 0.01 0.04 0.01 0.02 0.02 0

The simulated true probabilities for all 7 observed cells are:

P(0,0,1) P(0,1,0) P(0,1,1) P(1,0,0) P(1,0,1) P(1,1,0) P(1,1,1)
0.1213 0.0889 0.1429 0.1105 0.1752 0.1428 0.2183

We can calculate the true value of Ψf(P)\Psi_{f}(P) analytically as Ψf(P0)=0.9275\Psi_{f}(P_{0})=0.9275, and draw 10410^{4} samples to obtain the asymptotic Ψf,asym(Pn)=0.9316\Psi_{f,asym}(P_{n})=0.9316, asymptotic σ^2=0.7492\hat{\sigma}^{2}=0.7492.

Figure 1 shows that our estimators perform well when the identification assumption holds true.

Refer to caption
Figure 1: Simulation results when the linear identification assumption holds true. Upper left: ratio of estimated variance over true variance vs sample size. The line is the ratio at each sample size, and the grey horizontal line is the true value; Upper right: coverage of 95% asymptotic confidence interval based on normal approximation with σ^2\hat{\sigma}^{2} estimated from efficient influence curve vs sample size. The line is the average coverage; Lower left: mean value of estimated ψ\psi and its 95% asymptotic confidence interval based on normal approximation with σ^2\hat{\sigma}^{2} estimated from efficient influence curve vs sample size. The line is the mean, and shaded area is the confidence interval and the grey horizontal line is the true value; Lower right: Distribution of 1000 estimates of ψ\psi for sample size of 1000, the vertical line is the true value, and the dashed line is the mean value.

5.2 Non-linear identification assumption

We perform simulations for three types of non-linear identification assumptions: 1) K-way interaction term equals zero in log-linear models, 2) independence assumption, and 3) conditional independence assumption. For assumption 1), we provide simulations for the following estimators:

  1. 1.

    Existing estimator defined by model M0M_{0}, which assumes that in the log-linear model, all the main terms have the same values, and there are no interaction terms [21]: ΨI(PM0)\Psi_{I}(P_{M_{0}})

  2. 2.

    Existing estimator defined by model MtM_{t}, which assumes that the log-linear model does not contain interaction terms [4]: ΨI(PMt)\Psi_{I}(P_{M_{t}})

  3. 3.

    Non-parametric plug-in estimator: ΨI(PNP)\Psi_{I}(P_{NP})

  4. 4.

    Estimator based on undersmoothed lasso regression: ΨI(Plasso)\Psi_{I}(P_{lasso})

  5. 5.

    Estimator based on lasso regression with cross-validation: ΨI(Plasso_cv)\Psi_{I}(P_{lasso\_cv})

  6. 6.

    TMLE based on ΨI(Plasso)\Psi_{I}(P_{lasso}): ΨI(Ptmle)\Psi_{I}(P_{tmle})

  7. 7.

    TMLE based on ΨI(Plasso_cv)\Psi_{I}(P_{lasso\_cv}): ΨI(Ptmle_cv)\Psi_{I}(P_{tmle\_cv})

Model M0M_{0} and MtM_{t} were calculated using R package ”RCapture” (version 1.4-3) [22].

5.3 Non-linear identification assumption: k-way interaction term equals zero

Given that we have three samples B1,B2B_{1},B_{2} and B3B_{3}, the identification assumption is that in the log-linear model

log(EB(b))=α0+α1b1+α2b2+α3b3+α4b1b2+α5b1b3+α6b2b3+α7b1b2b3,log(E_{B^{*}}(b))=\alpha_{0}+\alpha_{1}b_{1}+\alpha_{2}b_{2}+\alpha_{3}b_{3}+\alpha_{4}b_{1}b_{2}+\alpha_{5}b_{1}b_{3}+\alpha_{6}b_{2}b_{3}+\alpha_{7}b_{1}b_{2}b_{3},

α7=0\alpha_{7}=0 (equation 2.2). Here EB(b)E_{B^{*}}(b) is the count of observations in cell b=(b1,b2,b3)b=(b_{1},b_{2},b_{3}), and bi=1b_{i}=1 if the subject is captured by sample i,i=1,2,3i,i=1,2,3. The parameter ΨI(P)\Psi_{I}(P) is identified in equation 7, and its influence curve DΦI(P)D^{*}_{\Phi_{I}}(P) is identified in equation 13.

In this section, we evaluated both our proposed estimators as well as two existing parametric estimators. The first estimator we evaluated is the MtM_{t} model [23], where PijP_{ij} denotes the probability that subject ii is captured by sample jj, and modeled as log(Pij)=μjlog(P_{ij})=\mu_{j} for subject ii and sample jj. This estimator assumes that all the interaction terms are zero in the log-linear model, and only use the main term variables in model training. The second estimator is M0M_{0} [21], with the formula log(Pij)=αlog(P_{ij})=\alpha, where α\alpha is a constant.

5.3.1 Log-linear model with main term effects

We used the parameters below to generate the underlying distribution. The assumption that the k-way interaction term α7=0\alpha_{7}=0 is satisfied in this setting. The model assumptions for M0M_{0} and MtM_{t} are also satisfied in this setting.

α0\alpha_{0} α1\alpha_{1} α2\alpha_{2} α3\alpha_{3} α4\alpha_{4} α5\alpha_{5} α6\alpha_{6} α7\alpha_{7}
-0.9398 -1 -1 -1 0 0 0 0

The simulated observed probabilities for all 7 observed cells are:

P(0,0,1) P(0,1,0) P(0,1,1) P(1,0,0) P(1,0,1) P(1,1,0) P(1,1,1)
0.2359 0.2359 0.0868 0.2359 0.0868 0.0868 0.0319

Under this setting, the true value of ΨI(P)\Psi_{I}(P) can be calculated analytically as ΨI(P0)=0.6093\Psi_{I}(P_{0})=0.6093. Figure 2 and table 1 shows the performance of estimators ΨI(PM0),ΨI(PMt),ΨI(PNP)\Psi_{I}(P_{M_{0}}),\Psi_{I}(P_{M_{t}}),\Psi_{I}(P_{NP}), ΨI(Plasso),ΨI(Plasso_cv),ΨI(Ptmle)\Psi_{I}(P_{lasso}),\Psi_{I}(P_{lasso\_cv}),\Psi_{I}(P_{tmle}), and ΨI(Ptmle_cv)\Psi_{I}(P_{tmle\_cv}). Table 2 shows that the model fit statistics forΨI(PM0)\Psi_{I}(P_{M_{0}}) and ΨI(PMt)\Psi_{I}(P_{M_{t}}).

Refer to caption
Figure 2: Left: Distribution of 1000 estimates of ψ\psi for sample size of 1000 with estimators ΨI(Plasso_cv)\Psi_{I}(P_{lasso\_cv}) and ΨI(Ptmle_cv)\Psi_{I}(P_{tmle\_cv}), the black vertical line is the true value, and the dashed vertical lines represent mean values for each estimator;Right: Distribution of 1000 estimates of ψ\psi for sample size of 1000 with estimators ΨI(Plasso)\Psi_{I}(P_{lasso}) and ΨI(Ptmle)\Psi_{I}(P_{tmle}), the black vertical line is the true value, and the dashed vertical lines represent mean values for each estimator.
Estimator Average ψ\psi Lower 95% CI Upper 95% CI Coverage(%)
ΨI(PNP)\Psi_{I}(P_{NP}) 0.6078 0.4791 0.7366 95
ΨI(Plasso)\Psi_{I}(P_{lasso}) 0.6473 0.5237 0.7708 84.2
ΨI(Ptmle)\Psi_{I}(P_{tmle}) 0.6078 0.4806 0.735 93.7
ΨI(Plasso_cv)\Psi_{I}(P_{lasso\_cv}) 0.8534 0.7984 0.9083 4.4
ΨI(Ptmle_cv)\Psi_{I}(P_{tmle\_cv}) 0.4044 0.2988 0.5101 13.6
ΨI(PM0)\Psi_{I}(P_{M_{0}}) 0.6093 0.5659 0.6521 97.8
ΨI(PMt)\Psi_{I}(P_{M_{t}}) 0.6096 0.5662 0.6525 97.8
Table 1: Estimated ψ^\hat{\psi} and 95% asymptotic confidence interval based on normal approximation with σ^2\hat{\sigma}^{2} estimated from efficient influence curve and its coverage by each estimator. ΨI(PNP)\Psi_{I}(P_{NP}) is the plug-in maximum likelihood estimator, ΨI(Plasso)\Psi_{I}(P_{lasso}) uses probabilities estimated from undersmoothed lasso regression, ΨI(Ptmle)\Psi_{I}(P_{tmle}) is the TMLE based on ΨI(Plasso)\Psi_{I}(P_{lasso}), ΨI(Plasso_cv)\Psi_{I}(P_{lasso\_cv}) uses probabilities estimated from lasso regression with regularization term optimized by cross-validation, ΨI(Ptmle_cv)\Psi_{I}(P_{tmle\_cv}) is the TMLE based on ΨI(Plasso_cv)\Psi_{I}(P_{lasso\_cv}), ΨI(PM0)\Psi_{I}(P_{M_{0}}) and ΨI(PMt)\Psi_{I}(P_{M_{t}}) are existing estimators defined in section 5.2. True ψ0=0.6093\psi_{0}=0.6093.
Estimator Df AIC BIC
ΨI(PM0)\Psi_{I}(P_{M_{0}}) 5 56.532 66.348
ΨI(PMt)\Psi_{I}(P_{M_{t}}) 3 58.034 77.665
Table 2: model fit statistics for ΨI(PM0)\Psi_{I}(P_{M_{0}}) and ΨI(PMt)\Psi_{I}(P_{M_{t}}) estimators.

Figure 2 and table 1 compared the performance of two base learners: lasso regression with regularization term chosen by cross-validation (ΨI(Plasso_cv)\Psi_{I}(P_{lasso\_cv})), and undersmoothed lasso regression (ΨI(Plasso)\Psi_{I}(P_{lasso})).The ΨI(Plasso_cv)\Psi_{I}(P_{lasso\_cv}) estimator is significantly biased from the true value of ψ\psi, and the TMLE estimator based on it, ΨI(Ptmle_cv)\Psi_{I}(P_{tmle\_cv}) estimator is also biased. And although the ΨI(Plasso)\Psi_{I}(P_{lasso}) estimator is biased, the TMLE based on it, ΨI(Ptmle)\Psi_{I}(P_{tmle}) estimator (mean=0.6078mean=0.6078, coverage=93.7%coverage=93.7\%) is able to adjust the bias and give a fit as good as the NPMLE estimator (mean=0.6078mean=0.6078, coverage=95%coverage=95\%). Thus, in the section below we will only use the undersmooting lasso estimator ΨI(Plasso)\Psi_{I}(P_{lasso}) as the base learner.

5.3.2 Log-linear model with main term and interaction term effects

We used the parameters below to generate the underlying distribution. The assumption that the k-way interaction term α7=0\alpha_{7}=0 is satisfied in this setting.

α0\alpha_{0} α1\alpha_{1} α2\alpha_{2} α3\alpha_{3} α4\alpha_{4} α5\alpha_{5} α6\alpha_{6} α7\alpha_{7}
-0.9194 -1 -1 -1 -0.1 -0.1 -0.1 0

The simulated observed probabilities for all 7 observed cells are:

P(0,0,1) P(0,1,0) P(0,1,1) P(1,0,0) P(1,0,1) P(1,1,0) P(1,1,1)
0.2440 0.2440 0.0812 0.2440 0.0812 0.0812 0.0245

Under this setting, the true value of ΨI(P)\Psi_{I}(P) can be calculated analytically as ΨI(P0)=0.6013\Psi_{I}(P_{0})=0.6013. The model assumptions of ΨI(PM0)\Psi_{I}(P_{M_{0}}) and ΨI(PMt)\Psi_{I}(P_{M_{t}}) are violated, and their estimates are biased (estimated mean Ψ^I(PM0)=Ψ^I(PMt)=0.572,bias=ΨI(P0)Ψ^I(PMt)=0.60130.5720=0.0293\hat{\Psi}_{I}(P_{M_{0}})=\hat{\Psi}_{I}(P_{M_{t}})=0.572,bias=\Psi_{I}(P_{0})-\hat{\Psi}_{I}(P_{M_{t}})=0.6013-0.5720=0.0293). The coverage of their 95% confidence interval is also low (81.5%). Table 4 shows that the model fit statistics forΨI(PM0)\Psi_{I}(P_{M_{0}}) and ΨI(PMt)\Psi_{I}(P_{M_{t}}). As the ΨI(PNP)\Psi_{I}(P_{NP}) and ΨI(Ptmle)\Psi_{I}(P_{tmle}) estimators do not have model assumptions, they are unbiased and the coverage of their 95% asymptotic confidence intervals are close to 95%. (94.7% for ΨI(PNP)\Psi_{I}(P_{NP}) and 93.7%93.7\% for ΨI(Ptmle)\Psi_{I}(P_{tmle})), shown in figure 3 and table 3.

Refer to caption
Figure 3: Distribution of 1000 estimates of ψ\psi for sample size of 1000, the black vertical line is the true value, and the dashed vertical lines represent mean values for each estimator..
Estimator Average ψ\psi Lower 95% CI Upper 95% CI Coverage(%)
ΨI(PNP)\Psi_{I}(P_{NP}) 0.6013 0.4625 0.7401 94.7
ΨI(Plasso)\Psi_{I}(P_{lasso}) 0.6482 0.5159 0.7805 78.4
ΨI(Ptmle)\Psi_{I}(P_{tmle}) 0.6005 0.4653 0.7358 93.7
ΨI(PM0)\Psi_{I}(P_{M_{0}}) 0.572 0.5274 0.6163 81.5
ΨI(PMt)\Psi_{I}(P_{M_{t}}) 0.5722 0.5277 0.6166 81.5
Table 3: Estimated ψ^\hat{\psi} and 95% asymptotic confidence interval based on normal approximation with σ^2\hat{\sigma}^{2} estimated from efficient influence curve and its coverage by each estimator. ΨI(PNP)\Psi_{I}(P_{NP}) is the plug-in maximum likelihood estimator, ΨI(Plasso)\Psi_{I}(P_{lasso}) uses probabilities estimated from undersmoothed lasso regression, ΨI(Ptmle)\Psi_{I}(P_{tmle}) is the TMLE based on ΨI(Plasso)\Psi_{I}(P_{lasso}), ΨI(PM0)\Psi_{I}(P_{M_{0}}) and ΨI(PMt)\Psi_{I}(P_{M_{t}}) are existing estimators defined in section 5.2. True ΨI(P0)=0.6013\Psi_{I}(P_{0})=0.6013.
Estimator Df AIC BIC
ΨI(PM0)\Psi_{I}(P_{M_{0}}) 5 55.499 65.314
ΨI(PMt)\Psi_{I}(P_{M_{t}}) 3 58.440 78.071
Table 4: model fit statistics forΨI(PM0)\Psi_{I}(P_{M_{0}}) and ΨI(PMt)\Psi_{I}(P_{M_{t}}) estimators.

5.3.3 Log-linear model with empty cells

When the probability for some cells are close to zero, in the finite sample case, there are likely to be empty cells in the observed data. For example, if the probability of a subject being caught in all 3 samples, represented as P(1,1,1)P(1,1,1), is less than 10610^{-6} and the total number of unique subjects caught by any of the 3 samples is less than 10310^{3}, then it’s likely that we will not observe any subject being caught three times, i.e., cell P(1,1,1)P(1,1,1) will likely be empty.

We used the parameters below to generate the underlying distribution. The assumption that the k-way interaction term α7=0\alpha_{7}=0 is satisfied in this setting.

α0\alpha_{0} α1\alpha_{1} α2\alpha_{2} α3\alpha_{3} α4\alpha_{4} α5\alpha_{5} α6\alpha_{6} α7\alpha_{7}
-0.4578 -1 -2 -3 -1 -1 -1 0

The simulated observed probabilities for all 7 observed cells are:

P(0,0,1) P(0,1,0) P(0,1,1) P(1,0,0) P(1,0,1) P(1,1,0) P(1,1,1)
0.0857 0.2331 0.0043 0.6336 0.0116 0.0315 2e-04

Figure 4 and table 5 show the performance of the estimators when there’s no observation in cell (1,1,1)(1,1,1), .

Figure 4 shows that all the existing estimators are biased when empty cells of (1,1,1)(1,1,1) exist. Table 5 shows that the coverage of the 95% asymptotic confidence intervals for ΨI(Ptmle)\Psi_{I}(P_{tmle}) is the highest (58.8%), and the coverage of the 95% asymptotic confidence intervals for ΨI(PM0)\Psi_{I}(P_{M_{0}}), ΨI(PMt)\Psi_{I}(P_{M_{t}}) are both 0, due to the estimation bias and narrow range of intervals. Table 6 shows the model fit statistics forΨI(PM0)\Psi_{I}(P_{M_{0}}) and ΨI(PMt)\Psi_{I}(P_{M_{t}}).

Refer to caption
Figure 4: Distribution of 1000 estimates of ψ\psi for sample size of 1000, the black vertical line is the true value of ΨI(P0)=0.3674\Psi_{I}(P_{0})=0.3674, and the dashed vertical lines represent mean values for each estimator.
Estimator Average ψ\psi Lower 95% CI Upper 95% CI Coverage(%)
ΨI(Plasso)\Psi_{I}(P_{lasso}) 0.5565 0.1956 0.9174 20
ΨI(Ptmle)\Psi_{I}(P_{tmle}) 0.2308 0.0875 0.374 58.8
ΨI(PM0)\Psi_{I}(P_{M_{0}}) 0.1318 0.0995 0.169 0
ΨI(PMt)\Psi_{I}(P_{M_{t}}) 0.1729 0.1313 0.2201 0
Table 5: Estimated ψ^\hat{\psi}, 95% asymptotic confidence interval based on normal approximation with σ^2\hat{\sigma}^{2} estimated from efficient influence curve and its coverage of the estimators in figure 4. ΨI(Plasso)\Psi_{I}(P_{lasso}) uses probabilities estimated from undersmoothed lasso regression, ΨI(Ptmle)\Psi_{I}(P_{tmle}) is the TMLE based on ΨI(Plasso)\Psi_{I}(P_{lasso}), ΨI(PM0)\Psi_{I}(P_{M_{0}}) and ΨI(PMt)\Psi_{I}(P_{M_{t}}) are existing estimators defined in section 5.2.True ΨI(P0)=0.3674\Psi_{I}(P_{0})=0.3674.
Estimator Df AIC BIC
ΨI(PM0)\Psi_{I}(P_{M_{0}}) 5 501.562 511.378
ΨI(PMt)\Psi_{I}(P_{M_{t}}) 3 45.506 65.137
Table 6: model fit statistics forΨI(PM0)\Psi_{I}(P_{M_{0}}) and ΨI(PMt)\Psi_{I}(P_{M_{t}}) estimators.

5.4 Non-linear identification assumption: independence

Given 3 samples, we assume that the first and second sample B1,B2B_{1},B_{2} are independent of each other, that is, P(B1=1,B2=1)=P(B1=1)×P(B2=1)P^{*}(B_{1}=1,B_{2}=1)=P^{*}(B_{1}=1)\times P^{*}(B_{2}=1). Then we have ΨII(P)\Psi_{II}(P) identified in equation 8, and the influence curve DII(P)D^{*}_{II}(P) identified in equation 14.

Here we illustrate the performance of our estimators with the following underlying distribution:

P(B1=1)=0.1\displaystyle P(B_{1}=1)=0.1
P(B2=1|B1=1)=0.2\displaystyle P(B_{2}=1|B_{1}=1)=0.2
P(B2=1|B1=0)=0.2\displaystyle P(B_{2}=1|B_{1}=0)=0.2
P(B3=1|B1=1)=0.25\displaystyle P(B_{3}=1|B_{1}=1)=0.25
P(B3=1|B1=0)=0.3\displaystyle P(B_{3}=1|B_{1}=0)=0.3

The probability distribution for all 7 observed cells are:

P(0,0,1) P(0,1,0) P(0,1,1) P(1,0,0) P(1,0,1) P(1,1,0) P(1,1,1)
0.4355 0.2540 0.1089 0.1210 0.0403 0.0302 0.0101

We can calculate the true value of ΨII(P)\Psi_{II}(P) analytically as ΨII(P0)=0.4960\Psi_{II}(P_{0})=0.4960, and draw 10610^{6} samples to obtain the asymptotic ΨII(Pn)=0.4963\Psi_{II}(P_{n})=0.4963, asymptotic σ2=1nD2=4.2657\sigma^{2}=\frac{1}{n}\sum D^{*2}=4.2657.

Figure 5 shows that our estimators perform well when the independence assumptions hold true.

Refer to caption
Figure 5: Simulation results when the independence identification assumption holds true. Upper left: ratio of estimated variance over true variance vs sample size. The line is the ratio at each sample size, and the grey horizontal line is the true value; Upper right: coverage of 95% asymptotic confidence interval based on normal approximation with σ^2\hat{\sigma}^{2} estimated from efficient influence curve vs sample size. The line is the average coverage; Lower left: mean value of estimated ψ\psi and its 95% asymptotic confidence interval based on normal approximation with σ^2\hat{\sigma}^{2} estimated from efficient influence curve vs sample size. The line is the mean, and shaded area is the confidence interval and the grey horizontal line is the true value; Lower right: Distribution of 1000 estimates of ψ\psi for sample size of 1000, the vertical line is the true value, and the dashed line is the mean value.

5.5 Non-linear identification assumption: conditional independence

Given 3 samples, we assume that the third sample B3B_{3} is conditionally independent on the second sample B2B_{2}, that is, P(B3=1|B2=b2,B1=0)=P(B3=1|B1=0)P^{*}(B_{3}=1|B_{2}=b_{2},B_{1}=0)=P^{*}(B_{3}=1|B_{1}=0). Then we can derive that P(0,0,0)=P(0,1,0)P(0,0,1)P(0,1,1)P^{*}(0,0,0)=\frac{P^{*}(0,1,0)P^{*}(0,0,1)}{P^{*}(0,1,1)} and ΨCI(P)=P(0,1,1)P(0,1,1)+P(0,1,0)P(0,0,1)\Psi_{CI}(P)=\frac{P(0,1,1)}{P(0,1,1)+P(0,1,0)P(0,0,1)} from equation 9, and the efficient influence curve can be derived from equation 15 as

DΦCI(P)=\displaystyle D^{*}_{\Phi_{CI}}(P)=
P(0,1,0)P(0,0,1)[P(0,1,1)+P(0,1,0)P(0,0,1)]2[I(0,1,1)P(0,1,1)]\displaystyle\frac{P(0,1,0)P(0,0,1)}{[P(0,1,1)+P(0,1,0)P(0,0,1)]^{2}}[I(0,1,1)-P(0,1,1)]-
P(0,0,1)P(0,1,1)[P(0,1,1)+P(0,1,0)P(0,0,1)]2[I(0,1,0)P(0,1,0)]\displaystyle\frac{P(0,0,1)P(0,1,1)}{[P(0,1,1)+P(0,1,0)P(0,0,1)]^{2}}[I(0,1,0)-P(0,1,0)]-
P(0,1,0)P(0,1,1)[P(0,1,1)+P(0,1,0)P(0,0,1)]2[I(0,0,1)P(0,0,1)]\displaystyle\frac{P(0,1,0)P(0,1,1)}{[P(0,1,1)+P(0,1,0)P(0,0,1)]^{2}}[I(0,0,1)-P(0,0,1)]

Here we illustrate the performance of our estimators with the following underlying distribution:

P(B1=1)=0.1\displaystyle P(B_{1}=1)=0.1
P(B2=1|B1=1)=0.2\displaystyle P(B_{2}=1|B_{1}=1)=0.2
P(B2=1|B1=0)=0.15\displaystyle P(B_{2}=1|B_{1}=0)=0.15
P(B3=1|B1=1)=P(B3=1|B2=1,B1=1)=0.25\displaystyle P(B_{3}=1|B_{1}=1)=P(B_{3}=1|B_{2}=1,B_{1}=1)=0.25
P(B3=1|B1=1)=P(B3=1|B2=0,B1=1)=0.25\displaystyle P(B_{3}=1|B_{1}=1)=P(B_{3}=1|B_{2}=0,B_{1}=1)=0.25
P(B3=1|B1=0)=P(B3=1|B2=1,B1=0)=0.2\displaystyle P(B_{3}=1|B_{1}=0)=P(B_{3}=1|B_{2}=1,B_{1}=0)=0.2
P(B3=1|B1=0)=P(B3=1|B2=0,B1=0)=0.2\displaystyle P(B_{3}=1|B_{1}=0)=P(B_{3}=1|B_{2}=0,B_{1}=0)=0.2

The probability distribution for all 7 observed cells P(B1,B2,B3)P(B_{1},B_{2},B_{3}) are:

P(0,0,1) P(0,1,0) P(0,1,1) P(1,0,0) P(1,0,1) P(1,1,0) P(1,1,1)
0.3943 0.2784 0.0696 0.1546 0.0515 0.0387 0.0129

We can derive the true ΨCI(P)\Psi_{CI}(P) analytically as ΨCI(P0)=0.388\Psi_{CI}(P_{0})=0.388, and draw 10610^{6} samples to obtain the asymptotic Ψ^CI(Pn)=0.3887\hat{\Psi}_{CI}(P_{n})=0.3887, asymptotic σ^2=1.0958\hat{\sigma}^{2}=1.0958.

Figure 6 shows that our estimator performs well when the conditional independence assumptions are met.

Refer to caption
Figure 6: Simulation results when the conditional independence identification assumption holds true. Upper left: ratio of estimated variance over true variance vs sample size. The line is the ratio at each sample size, and the grey horizontal line is the true value; Upper right: coverage of 95% asymptotic confidence interval based on normal approximation with σ^2\hat{\sigma}^{2} estimated from efficient influence curve vs sample size. The line is the average coverage; Lower left: mean value of estimated ψ\psi and its 95% asymptotic confidence interval based on normal approximation with σ^2\hat{\sigma}^{2} estimated from efficient influence curve vs sample size. The line is the mean, and shaded area is the confidence interval and the grey horizontal line is the true value; Lower right: Distribution of 1000 estimates of ψ\psi for sample size of 1000, the vertical line is the true value, and the dashed line is the mean value.

6 Evaluating the sensitivity of identification bias to violations of the assumed constraint

In this section, we use simulations to show the sensitivity of identification bias for all the estimators with identification assumptions violations.

6.1 Violation of linear assumptions

In this section we analyzed the same problem stated in section 5.1, but the identification assumption is violated.

We used the parameters below to generate the underlying distribution. The assumption that the k-way interaction term α7=0\alpha_{7}=0 is violated in this setting(α7=0.2\alpha_{7}=0.2).

α0\alpha_{0} α1\alpha_{1} α2\alpha_{2} α3\alpha_{3} α4\alpha_{4} α5\alpha_{5} α6\alpha_{6} α7\alpha_{7}
0.11 0.1 0.05 0.08 -0.2 -0.2 -0.1 0.2

The simulated observed probabilities for all 7 cells are:

P(0,0,1) P(0,1,0) P(0,1,1) P(1,0,0) P(1,0,1) P(1,1,0) P(1,1,1)
0.2135 0.1798 0.0449 0.2360 0.1011 0.1978 0.0449

We can calculate the true value of Ψ\Psi analytically as Ψ0=0.8900\Psi_{0}=0.8900, and draw 10510^{5} samples to obtain the asymptotic Ψf(Pn)=0.7419\Psi_{f}(P_{n})=0.7419, asymptotic σ^2=0.2662\hat{\sigma}^{2}=0.2662. The asymptotic value is biased by 0.1481.

Figure 7 shows that the linear estimator is biased when the identification assumption is violated.

Refer to caption
Figure 7: Simulation results when the linear identification assumption is violated. Upper left: ratio of estimated variance over true variance vs sample size. The line is the ratio at each sample size, and the grey horizontal line is the true value; Upper right: coverage of 95% asymptotic confidence interval based on normal approximation with σ^2\hat{\sigma}^{2} estimated from efficient influence curve vs sample size. The line is the average coverage; Lower left: mean value of estimated ψ\psi and its 95% asymptotic confidence interval based on normal approximation with σ^2\hat{\sigma}^{2} estimated from efficient influence curve vs sample size. The line is the mean, and shaded area is the confidence interval and the grey horizontal line is the true value; Lower right: Distribution of 1000 estimates of ψ\psi for sample size of 1000, the vertical line is the true value, and the dashed line is the mean value.

6.2 Violation of non-linear assumption: independence

Given 3 samples, we suppose that the first and second sample B1,B2B_{1},B_{2} are independent of each other, that is, P(B1=1,B2=1)=P(B1=1)×P(B2=1)P^{*}(B_{1}=1,B_{2}=1)=P^{*}(B_{1}=1)\times P^{*}(B_{2}=1). Then we can compute ψ^\hat{\psi} and its efficient influence curve as stated above.

Here we illustrate the performance of our estimators with the following underlying distribution:

P(B1=1)=0.5\displaystyle P(B_{1}=1)=0.5
P(B2=1|B1=1)=0.6\displaystyle P(B_{2}=1|B_{1}=1)=0.6
P(B2=1|B1=0)=0.5\displaystyle P(B_{2}=1|B_{1}=0)=0.5
P(B3=1)=0.5\displaystyle P(B_{3}=1)=0.5

The probability for all 7 observed cells is given by:

P(0,0,1) P(0,1,0) P(0,1,1) P(1,0,0) P(1,0,1) P(1,1,0) P(1,1,1)
0.1429 0.1429 0.1429 0.1143 0.1143 0.1714 0.1714

We can calculate the true value of Ψ\Psi analytically as Ψ0=0.875\Psi_{0}=0.875, and draw 10610^{6} samples to obtain the asymptotic ΨII(Pn)=0.9544\Psi_{II}(P_{n})=0.9544, asymptotic σ2=0.4415\sigma^{2}=0.4415, the asymptotic value is biased by 0.0794.

Figure 8 shows that violation of the independence assumption will create a significant bias in the results.

Refer to caption
Figure 8: Simulation results when the independence identification assumption is violated. Upper left: ratio of estimated variance over true variance vs sample size. The line is the ratio at each sample size, and the grey horizontal line is the true value; Upper right: coverage of 95% asymptotic confidence interval based on normal approximation with σ^2\hat{\sigma}^{2} estimated from efficient influence curve vs sample size. The line is the average coverage; Lower left: mean value of estimated ψ\psi and its 95% asymptotic confidence interval based on normal approximation with σ^2\hat{\sigma}^{2} estimated from efficient influence curve vs sample size. The line is the mean, and shaded area is the confidence interval and the grey horizontal line is the true value; Lower right: Distribution of 1000 estimates of ψ\psi for sample size of 1000, the vertical line is the true value, and the dashed line is the mean value.

6.3 Violation of non-linear assumption: conditional independence

Same as above, our assumption is that given 3 samples, the third sample B3B_{3} is independent on the second sample B2B_{2} conditional on the first sample B1B_{1}. Then we can derive the influence curve same as above.

Here we illustrate the performance of our estimators when the identification assumption does not hold.

The simulated distribution is as follows:

P(B1=1)=0.1\displaystyle P(B_{1}=1)=0.1
P(B2=1|B1=1)=0.2\displaystyle P(B_{2}=1|B_{1}=1)=0.2
P(B2=1|B1=0)=0.15\displaystyle P(B_{2}=1|B_{1}=0)=0.15
P(B3=1|B1=1,B2=1)=0.10\displaystyle P(B_{3}=1|B_{1}=1,B_{2}=1)=0.10
P(B3=1|B1=1,B2=0)=0.50\displaystyle P(B_{3}=1|B_{1}=1,B_{2}=0)=0.50
P(B3=1|B1=0,B2=1)=0.20\displaystyle P(B_{3}=1|B_{1}=0,B_{2}=1)=0.20
P(B3=1|B1=0,B2=0)=0.50\displaystyle P(B_{3}=1|B_{1}=0,B_{2}=0)=0.50

The probability distribution for all 7 observed cells are:

P(0,0,1) P(0,1,0) P(0,1,1) P(1,0,0) P(1,0,1) P(1,1,0) P(1,1,1)
0.6194 0.1749 0.0437 0.0648 0.0648 0.0291 0.0032

The true value of Ψ\Psi is Ψ0=0.6175\Psi_{0}=0.6175, and we drew 10610^{6} samples and obtained the asymptotic ΨCI(Pn)=0.2931\Psi_{CI}(P_{n})=0.2931, asymptotic σ2=1.2349\sigma^{2}=1.2349, the asymptotic value is biased by 0.3244.

Figure 9 shows that with a strong violation of conditional independence assumptions, the estimated values will have significant biases.

Refer to caption
Figure 9: Simulation results when the conditional independence identification assumption is violated. Upper left: ratio of estimated variance over true variance vs sample size. The line is the ratio at each sample size, and the grey horizontal line is the true value; Upper right: coverage of 95% asymptotic confidence interval based on normal approximation with σ^2\hat{\sigma}^{2} estimated from efficient influence curve vs sample size. The line is the average coverage; Lower left: mean value of estimated ψ\psi and its 95% asymptotic confidence interval based on normal approximation with σ^2\hat{\sigma}^{2} estimated from efficient influence curve vs sample size. The line is the mean, and shaded area is the confidence interval and the grey horizontal line is the true value; Lower right: Distribution of 1000 estimates of ψ\psi for sample size of 1000, the vertical line is the true value, and the dashed line is the mean value.

6.4 Violation of non-linear assumption: K-way interaction term equals zero

Here we illustrate the performance of our estimators when the identification assumption that the k-way interaction term α7=0\alpha_{7}=0 does not hold. We used the parameters below to generate the underlying distribution. The underlying distribution follows the same model as in section 5.3.

α0\alpha_{0} α1\alpha_{1} α2\alpha_{2} α3\alpha_{3} α4\alpha_{4} α5\alpha_{5} α6\alpha_{6} α7\alpha_{7}
-1.6333 0 0 0 -1 -2 -0.5 1

The simulated observed probabilities for all 7 observed cells are:

P(0,0,1) P(0,1,0) P(0,1,1) P(1,0,0) P(1,0,1) P(1,1,0) P(1,1,1)
0.2386 0.2386 0.0878 0.2386 0.0323 0.1447 0.0196

Figure 10 and table 7 show that when the assumption is heavily violated, all estimators are significantly biased. The true value of ψ=0.8074\psi=0.8074. The bias for ΨI(PNP)\Psi_{I}(P_{NP}) estimator is 0.2032, for ΨI(Plasso)\Psi_{I}(P_{lasso}) estimator is 0.1372, for ΨI(Ptmle)\Psi_{I}(P_{tmle}) is 0.2055, for ΨI(PM0)\Psi_{I}(P_{M_{0}}) estimator is 0.2217, and for ΨI(PMt)\Psi_{I}(P_{M_{t}}) estimator is 0.2185. The coverage of 95% asymptotic confidence intervals for ΨI(PNP)\Psi_{I}(P_{NP}) estimator is 26%, for ΨI(Plasso)\Psi_{I}(P_{lasso}) estimator is 59%, for ΨI(Ptmle)\Psi_{I}(P_{tmle}) is 22%, for ΨI(PM0)\Psi_{I}(P_{M_{0}}) and ΨI(PMt)\Psi_{I}(P_{M_{t}}) is 0%. Table 8 shows the information on M0,MtM_{0},M_{t} models.

Refer to caption
Figure 10: Distribution of 1000 estimates of ΨI\Psi_{I} for sample size of 1000, the black vertical line is the true value, and the dashed vertical lines represent mean values for each estimator.
Estimator average ψ\psi lower 95% CI upper 95% CI coverage(%)
ΨI(PNP)\Psi_{I}(P_{NP}) 0.6042 0.4491 0.7592 26.3
ΨI(Plasso)\Psi_{I}(P_{lasso}) 0.6702 0.5261 0.8142 58.6
ΨI(Ptmle)\Psi_{I}(P_{tmle}) 0.6019 0.4516 0.7523 21.9
ΨI(PM0)\Psi_{I}(P_{M_{0}}) 0.5857 0.5416 0.6295 0
ΨI(PMt)\Psi_{I}(P_{M_{t}}) 0.5889 0.5447 0.6328 0
Table 7: Estimated ψ^\hat{\psi} and 95% asymptotic confidence interval based on normal approximation with σ^2\hat{\sigma}^{2} estimated from efficient influence curve and its coverage by each estimator. True Ψ0=0.8074\Psi_{0}=0.8074.
Estimator df AIC BIC
ΨI(PM0)\Psi_{I}(P_{M_{0}}) 5 157.720 167.535
ΨI(PMt)\Psi_{I}(P_{M_{t}}) 3 128.623 148.254
Table 8: model fit statistics forΨI(PM0)\Psi_{I}(P_{M_{0}}) and ΨI(PMt)\Psi_{I}(P_{M_{t}}) estimators.

6.5 Violation of all identification assumptions

In the sections above we showed that if an identification assumption is violated, its corresponding estimators will be biased and their asymptotic confidence intervals will have low coverage. In this section, all the identification assumptions are violated in the simulated distribution, and we analyzed the performance of all the estimators discussed above in this setting.

The underlying distribution of three samples B1,B2,B3B_{1},B_{2},B_{3} follows the same log-linear model in section 5.3, with the parameters as below.

α0\alpha_{0} α1\alpha_{1} α2\alpha_{2} α3\alpha_{3} α4\alpha_{4} α5\alpha_{5} α6\alpha_{6} α7\alpha_{7}
-1.1835 -1 -1 -1 -1.5 -1 2 1

The simulated observed probability for all 7 observed cells is given by:

P(0,0,1) P(0,1,0) P(0,1,1) P(1,0,0) P(1,0,1) P(1,1,0) P(1,1,1)
0.1624 0.1624 0.0133 0.1624 0.0220 0.4414 0.0362

Figure 11 and table 9 show that all estimators are significantly biased, when none of their identification assumption holds. The true value of ψ=0.6938\psi=0.6938. Under the assumption that sample B1B_{1} and B2B_{2} are independent conditional on B3B_{3} (”Conditional Independence” in table 9), the plug-in estimator has a bias of -0.2497. Under the assumption that sample B1B_{1} and B2B_{2} are independent (”Independence” in table 9), the plug-in estimator has a bias of 0.3760. Under the Under the assumption that the 3-way additive interaction term in the linear model equals zero (”K-way additive interaction equals zero” in table 9), the plug-in estimator has a bias of -0.2627. And under the assumption that the 3-way multiplicative interaction term in the log-linear model equals zero (”K-way multiplicative interaction equals zero” in table 9), the ΨI(PNP)\Psi_{I}(P_{NP}) estimator has a bias of 0.2517, the ΨI(Plasso)\Psi_{I}(P_{lasso}) estimator has a bias of -0.1573, the ΨI(Ptmle)\Psi_{I}(P_{tmle}) has a bias of -0.1867, the ΨI(PM0)\Psi_{I}(P_{M_{0}}) estimator has a bias of -0.1017, and the ΨI(PMt)\Psi_{I}(P_{M_{t}}) estimator has a bias of -0.1446. The coverage of 95% asymptotic confidence intervals for all the estimators are far below 95%. Table 10 shows the information on M0,MtM_{0},M_{t} models.

Refer to caption
Figure 11: Distribution of 1000 estimates of ψ\psi for sample size of 1000, the black vertical line is the true value, and the dashed vertical lines represent mean values for each estimator.
Assumption Estimator Average ψ\psi Lower 95% CI Upper 95% CI Coverage(%)
Conditional independence ΨCI(Pn)\Psi_{CI}(P_{n}) 0.9435 0.9313 0.9557 31.2
Independence ΨII(Pn)\Psi_{II}(P_{n}) 0.3223 0.2472 0.3974 0
Highest-way interaction equals zero (linear) Ψf(Pn)\Psi_{f}(P_{n}) 0.9565 0.8999 1.0132 0
Highest-way interaction equals zero (log-linear) ΨI(PNP)\Psi_{I}(P_{NP}) 0.4466 0.2519 0.6413 0
Highest-way interaction equals zero (log-linear) ΨI(Plasso)\Psi_{I}(P_{lasso}) 0.8511 0.8005 0.9017 0
Highest-way interaction equals zero (log-linear) ΨI(Ptmle)\Psi_{I}(P_{tmle}) 0.8805 0.8385 0.9225 31.2
Highest-way interaction equals zero (log-linear) ΨI(PM0)\Psi_{I}(P_{M_{0}}) 0.7955 0.7621 0.8269 0
Highest-way interaction equals zero (log-linear) ΨI(PMt)\Psi_{I}(P_{M_{t}}) 0.8384 0.8074 0.8673 0
Table 9: Estimated ψ^\hat{\psi} and 95% asymptotic confidence interval based on normal approximation with σ^2\hat{\sigma}^{2} estimated from efficient influence curve and its coverage for each estimator. True ψ0=0.6938\psi_{0}=0.6938. Conditional independence assumption: sample B1B_{1} and B2B_{2} are independent conditional on B3B_{3}; Independence assumption: sample B1B_{1} and B2B_{2} are independent; K-way additive interaction equals zero assumption: 3-way interaction term in linear model equals zero; K-way multiplicative interaction equals zero assumption: 3-way interaction term in log-linear model equals zero.
Estimator Df AIC BIC
ΨI(PM0)\Psi_{I}(P_{M_{0}}) 5 743.631 753.447
ΨI(PMt)\Psi_{I}(P_{M_{t}}) 3 391.757 411.388
Table 10: model fit statistics forΨI(PM0)\Psi_{I}(P_{M_{0}}) and ΨI(PMt)\Psi_{I}(P_{M_{t}}) estimators.

7 Data Analysis

Schistosomiasis is an acute and chronic parasitic disease endemic to 78 countries worldwide. Estimates show that at least 229 million people required preventive treatment in 2018 [24]. In China, schistosomiasis is an ongoing public health challenge and the country has set ambitious goals of achieving nationwide transmission interruption [25]. To monitor the transmission of schistosomiasis, China operates three surveillance systems which can be considered as repeated samples of the underlying infected population. The first is a national surveillance system designed to monitor nationwide schistosomiasis prevalence (S1S_{1}), which covers 1% of communities in endemic provinces, conducting surveys every 6-9 years. The second is a sentinel system designed to provide longitudinal measures of disease prevalence and intensity in select communities (S2S_{2}), and conducts yearly surveys. The third system (S3S_{3}) comprises routine surveillance of all communities in endemic counties, with surveys conducted on a roughly three-year cycle [25]. In this application, we use case records for schistosomiasis from these three systems for a region in southwestern China [17]. The community, having a population of about 6000, reported a total of 302 cases in 2004: 112 in S1S_{1}; 294 in S2S_{2}; and 167 in S3S_{3}; with 202 cases appearing on more than one system register (figure 12). The three surveillance systems are not independent, and we apply estimators under the four constraints (highest-way interaction equals zero in log-linear model, highest-way interaction equals zero in linear model, independence, and conditional independence)to the data.

Refer to caption
Figure 12: Schistosomiasis case frequencies among S1,S2,S3S_{1},S_{2},S_{3} surveillance systems in a community (population 6000\sim 6000) in southwestern China in 2004 [17]

The estimated ψ^\hat{\psi} with its 95% asymptotic confidence intervals are shown in table 11. The highest estimate of ψ\psi is 0.9969 (with 95% CI: (0.8041, 0.9829)) under conditional independence assumption, and the lowest is 0.8935 (with 95% CI: (0.9965, 0.9972)) under the linear model with no highest-way interaction assumption (table 11). This analysis illustrates the heavy dependence of estimates of Ψ\Psi on the arbitrary identification assumptions using real data.

One way to solve this problem is to hold certain identification assumptions true by design. In this case, we could make an effort to design the sentinel system S2S_{2} and routine system S3S_{3} to be independent of each other conditional on national system S1S_{1}. For example, let the sampling surveys in S2S_{2} and S3S_{3} be done independently, while each of them could borrow information from S1S_{1}. In so doing we can assume the conditional independence between the two samples and conduct the analysis correspondingly.

Assumption Estimator Estimated ψ\psi Lower 95% CI Upper 95% CI
Highest-way interaction equals zero (log-linear) ΨI(Plasso)\Psi_{I}(P_{lasso}) 0.9212 0.8755 0.9669
Highest-way interaction equals zero (log-linear) ΨI(Ptmle)\Psi_{I}(P_{tmle}) 0.9428 0.9089 0.9768
Highest-way interaction equals zero (log-linear) ΨI(PM0)\Psi_{I}(P_{M_{0}}) 0.9321 0.8959 0.9627
Highest-way interaction equals zero (log-linear) ΨI(PMt)\Psi_{I}(P_{M_{t}}) 0.9934 0.9748 1
Highest-way interaction equals zero (linear) Ψf(Pn)\Psi_{f}(P_{n}) 0.8935 0.8041 0.9829
Independence ΨII(Pn)\Psi_{II}(P_{n}) 0.963 0.9326 0.9935
Conditional independence ΨCI(Pn)\Psi_{CI}(P_{n}) 0.9969 0.9965 0.9972
Table 11: Estimated ψ^\hat{\psi} and 95% asymptotic confidence interval based on normal approximation with σ^2\hat{\sigma}^{2} estimated from efficient influence curve and its coverage for each estimator. Conditional independence assumption: survey S1S_{1} and S2S_{2} are independent conditional on S3S_{3}; Independence assumption: survey S1S_{1} and S2S_{2} are independent; K-way additive interaction equals zero assumption: 3-way interaction term in linear model equals zero; K-way multiplicative interaction equals zero assumption: 3-way interaction term in log-linear model equals zero. The ΨI(PNP)\Psi_{I}(P_{NP}) estimator under K-way multiplicative interaction term equals zero assumption is not defined due to existing empty cells in observed data.
Estimator Df AIC BIC
ΨI(PM0)\Psi_{I}(P_{M_{0}}) 5 329.383 336.804
ΨI(PMt)\Psi_{I}(P_{M_{t}}) 3 60.139 74.980
Table 12: model fit statistics forΨI(PM0)\Psi_{I}(P_{M_{0}}) and ΨI(PMt)\Psi_{I}(P_{M_{t}}) estimators.

8 Discussion

8.1 Summary table

In this section we summarise all the estimators we proposed under linear and non-linear constraints as below.

Identification assumption Highest-way interaction term in linear model equals zero
Constraint EPBf(B)=b(1)K+k=1KbkPB,0(b)=0E_{P_{B^{*}}}f(B^{*})=\sum_{b}(-1)^{K+\sum_{k=1}^{K}b_{k}}P_{B^{*},0}(b)=0
Method Plug-in
Target parameter ψf=f(0)f(0)b0f(b)P(b)\psi_{f}=\frac{f(0)}{f(0)-\sum_{b\neq 0}f(b)P(b)}
Efficient influence curve Df=f(0)(f(0)b0f(b)P(b))2[f(B)b0f(b)P(b)]D^{*}_{f}=\frac{f(0)}{(f(0)-\sum_{b\neq 0}f(b)P(b))^{2}}[f(B)-\sum_{b\neq 0}f(b)P(b)]
Table 13: Summary table for identification assumption: highest-way interaction term in linear model equals zero, where f(b)=(1)K+k=1Kbkf(b)=(-1)^{K+\sum_{k=1}^{K}b_{k}}.
Identification assumption Independence between two samples
Constraint ΦII(P)bI(b(1:2)=(0,0))P(b)\Phi_{II}(P^{*})\equiv\sum_{b}I(b(1:2)=(0,0))P^{*}(b)-
b1,b2I(b1(1)=b2(2)=0)P(b1)P(b2)\sum_{b_{1},b_{2}}I(b_{1}(1)=b_{2}(2)=0)P^{*}(b_{1})P^{*}(b_{2})
Method Plug-in
Target parameter ΨII(P)=1P(B(1)=0)P(B(2)=0)+P(B(1:2)=(0,0))1P(B(1)=0)P(B(2)=0)+P(B(1)=0)P(B(2)=0)\Psi_{II}(P)=\frac{1-P(B(1)=0)-P(B(2)=0)+P(B(1:2)=(0,0))}{1-P(B(1)=0)-P(B(2)=0)+P(B(1)=0)P(B(2)=0)}
Efficient influence curve DΦII(P)=C2(P){𝟙(B(1)=0)P(B(1)=0)}+D^{*}_{\Phi_{II}}(P)=C_{2}(P)\{\mathbbm{1}(B(1)=0)-P(B(1)=0)\}+
C3(P){𝟙(B(2)=0)P(B(2)=0)+C_{3}(P)\{\mathbbm{1}(B(2)=0)-P(B(2)=0)+
C4(P){𝟙(B(1)=B(2)=0)P(B(1:2)=0}C_{4}(P)\{\mathbbm{1}(B(1)=B(2)=0)-P(B(1:2)=0\},
where C2(P)=1P(B(1)=0)P(B(2)=0)P(B(1)=B(2)=0)(1P(B(1)=0)P(B(2)=0)P(B(1)=0)P(B(2)=0))2P(B(2)=0)C_{2}(P)=\frac{1-P(B(1)=0)-P(B(2)=0)-P(B(1)=B(2)=0)}{(1-P(B(1)=0)-P(B(2)=0)-P(B(1)=0)P(B(2)=0))^{2}}P(B(2)=0),
C3(P)=1P(B(1)=0)P(B(2)=0)P(B(1)=B(2)=0)(1P(B(1)=0)P(B(2)=0)P(B(1)=0)P(B(2)=0))2P(B(1)=0)C_{3}(P)=\frac{1-P(B(1)=0)-P(B(2)=0)-P(B(1)=B(2)=0)}{(1-P(B(1)=0)-P(B(2)=0)-P(B(1)=0)P(B(2)=0))^{2}}P(B(1)=0),
C4(P)=11P(B(1)=0)P(B(2)=0)P(B(1)=0)P(B(2)=0)C_{4}(P)=-\frac{1}{1-P(B(1)=0)-P(B(2)=0)-P(B(1)=0)P(B(2)=0)}
Table 14: Summary table for identification assumption: independence between two samples.
Identification assumption Conditional independence between two samples
Constraint ΦCI,(j,m)=P(Bj=1|B1=0,,Bm=0,,BK=0)\Phi_{CI,(j,m)}=P^{*}(B_{j}=1|B_{1}=0,\cdots,B_{m}=0,\cdots,B_{K}=0)
P(Bj=1|B1=0,,Bm=1,,BK=0)-P^{*}(B_{j}=1|B_{1}=0,\cdots,B_{m}=1,\cdots,B_{K}=0)
Method Plug-in
Target parameter ΨCI=P(Bm=1,Bj=1,0,,0)P(Bm=1,Bj=1,0,,0)+P(Bm=0,Bj=1,0,,0)P(Bm=1,Bj=0,0,,0)\Psi_{CI}=\frac{P(B_{m}=1,B_{j}=1,0,...,0)}{P(B_{m}=1,B_{j}=1,0,...,0)+P(B_{m}=0,B_{j}=1,0,...,0)P(B_{m}=1,B_{j}=0,0,...,0)}
Efficient influence curve DΦCI(P)=1C5(P)(C6(P)C7(P)C8(P))D^{*}_{\Phi_{CI}}(P)=\frac{1}{C_{5}(P)}(C_{6}(P)-C_{7}(P)-C_{8}(P)),
where C5(P)=b10,b20[𝕀(b1(1)=b2(2)=0)P(b1)P(b2)C_{5}(P)=-\sum_{b_{1}\not=0,b_{2}\not=0}[\mathbbm{I}(b_{1}(1)=b_{2}(2)=0)P(b_{1})P(b_{2})
C6(P)=P(Bm=0,Bj=1,0,,0)P(Bm=0,Bj=0,0,,0)C_{6}(P)=P(B_{m}=0,B_{j}=1,0,...,0)P(B_{m}=0,B_{j}=0,0,...,0)
[𝕀(Bm=1,Bj=1,0,,0)P(Bm=1,Bj=1,0,,0)][\mathbbm{I}(B_{m}=1,B_{j}=1,0,...,0)-P(B_{m}=1,B_{j}=1,0,...,0)]
C7(P)=P(Bm=1,Bj=0,0,,0)P(Bm=1,Bj=1,0,,0)C_{7}(P)=P(B_{m}=1,B_{j}=0,0,...,0)P(B_{m}=1,B_{j}=1,0,...,0)
[𝕀(Bm=0,Bj=1,0,,0)P(Bm=0,Bj=1,0,,0)][\mathbbm{I}(B_{m}=0,B_{j}=1,0,...,0)-P(B_{m}=0,B_{j}=1,0,...,0)]
C8(P)=P(Bm=0,Bj=1,0,,0)P(Bm=1,Bj=1,0,,0)C_{8}(P)=P(B_{m}=0,B_{j}=1,0,...,0)P(B_{m}=1,B_{j}=1,0,...,0)
[𝕀(Bm=1,Bj=0,0,,0)P(Bm=1,Bj=0,0,,0)][\mathbbm{I}(B_{m}=1,B_{j}=0,0,...,0)-P(B_{m}=1,B_{j}=0,0,...,0)]
Table 15: Summary table for identification assumption: conditional independence between two samples.
Identification assumption Highest-way interaction term in log-linear model equals zero
Constraint ΦI(P)b(1)1+k=1KbklogPB(b)=0\Phi_{I}(P^{*})\equiv\sum_{b}(-1)^{1+\sum_{k=1}^{K}b_{k}}\log P_{B^{*}}(b)=0
Method Plug-in (NPMLE), undersmoothed lasso, TMLE based on lasso
Target parameter ΨI(P)=11+exp((1)K+1b0f(b)logP(b))\Psi_{I}(P)=\frac{1}{1+\exp((-1)^{K+1}\sum_{b\not=0}f(b)\log P(b))}
Efficient influence curve DΦI(P)=(1)KΨI(P)(1ΨI(P)){f(B)P(B)+f(0)}D^{*}_{\Phi_{I}}(P)=(-1)^{K}\Psi_{I}(P)(1-\Psi_{I}(P))\left\{\frac{f(B)}{P(B)}+f(0)\right\}
Table 16: Summary table for identification assumption: highest-way interaction term in log-linear model equals zero, where f(b)=(1)K+k=1Kbkf(b)=(-1)^{K+\sum_{k=1}^{K}b_{k}}.

We developed a modern method to estimate population size based on capture-recapture designs with a minimal number of constraints or parametric assumptions. We provide the solutions, theoretical support, simulation study and sensitivity analysis for four identification assumptions: independence between two samples, conditional independence between two samples, no highest-way interaction in linear models, and no highest-way interaction in log-linear models. We also developed machine learning algorithms to solve the curse of dimensionality for high dimensional problems under the assumption of no highest-way interaction in log-linear model. Through our analysis, we found that whether the identification assumption holds true plays a vital role in the performance of estimation. When the assumption is violated, all estimators will be biased. This conclusion applies to models of all forms, parametric or non-parametric, simple plug-in estimators or complex machine-learning based estimators. Thus one should always ensure that the chosen identification assumption is known to be true by survey design, otherwise all the estimators will be unreliable. Under the circumstances where the identification assumptions hold true, the performance of our targeted maximum likelihood estimator, ΨI(Ptmle)\Psi_{I}(P_{tmle}), is superior to ΨI(PM0)\Psi_{I}(P_{M_{0}})(identical capture-probabilities, no highest-way interaction in log-linear model), ΨI(PMt)\Psi_{I}(P_{M_{t}})(no interaction terms in log-linear model) and ΨI(PNP)\Psi_{I}(P_{NP})(plugged-in, no highest-way interaction in log-linear model) estimators in several aspects: first, by making the least number of assumptions required for identifiability, the estimator ΨI(Ptmle)\Psi_{I}(P_{tmle}) is more robust in empirical data analysis, as there will be no bias due to violations of parametric model assumptions. Second, the estimator ΨI(Ptmle)\Psi_{I}(P_{tmle}) is based on a consistent undersmoothed lasso estimator. This property ensures the asymptotic efficiency of TMLE. Third, when there are empty cells, the estimator ΨI(Ptmle)\Psi_{I}(P_{tmle}) solves the curse of dimensionality by correcting the bias introduced by the undersmoothed lasso estimator, and gives a more honest asymptotic confidence interval, wider that those from parametric models, and hence a higher coverage.

References

  • [1] Anne Chao et al. “The applications of capture-recapture models to epidemiological data” In Statistics in medicine 20.20 Wiley Online Library, 2001, pp. 3123–3157
  • [2] Stephen T Buckland et al. “Introduction to distance sampling: estimating abundance of biological populations” Oxford (United Kingdom) Oxford Univ. Press, 2001
  • [3] Mathew W Alldredge, Kenneth H Pollock and Theodore R Simons “Estimating detection probabilities from multiple-observer point counts” In The Auk 123.4 Oxford University Press, 2006, pp. 1172–1182
  • [4] Anne Chao “An overview of closed capture-recapture models” In Journal of Agricultural, Biological, and Environmental Statistics 6.2 Springer, 2001, pp. 158–175
  • [5] Zachary T Kurtz “Local log-linear models for capture-recapture” In arXiv preprint arXiv:1302.0890, 2013
  • [6] Janet T Wittes “Applications of a multinomial capture-recapture model to epidemiological data” In Journal of the American Statistical Association 69.345 Taylor & Francis Group, 1974, pp. 93–97
  • [7] George Arthur Frederick Seber “The estimation of animal abundance and related parameters” Blackburn press Caldwell, New Jersey, 1982
  • [8] Hamparsum Bozdogan “Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions” In Psychometrika 52.3 Springer, 1987, pp. 345–370
  • [9] Gideon Schwarz “Estimating the dimension of a model” In Annals of statistics 6.2 Institute of Mathematical Statistics, 1978, pp. 461–464
  • [10] David Draper “Assessment and propagation of model uncertainty” In Journal of the Royal Statistical Society: Series B (Methodological) 57.1 Wiley Online Library, 1995, pp. 45–70
  • [11] Emest B Hook and Ronald R Regal “Validity of methods for model selection, weighting for model uncertainty, and small sample adjustment in capture-recapture estimation” In American journal of epidemiology 145.12 Oxford University Press, 1997, pp. 1138–1144
  • [12] Toon Braeye et al. “Capture-recapture estimators in epidemiology with applications to pertussis and pneumococcal invasive disease surveillance” In PloS one 11.8 Public Library of Science San Francisco, CA USA, 2016, pp. e0159832
  • [13] Manjari Das and Edward H Kennedy “Doubly robust capture-recapture methods for estimating population size” In arXiv preprint arXiv:2104.14091, 2021
  • [14] Mark J Laan and Sherri Rose “Targeted learning: causal inference for observational and experimental data” Springer Science & Business Media, 2011
  • [15] Annegret Grimm, Bernd Gruber and Klaus Henle “Reliability of different mark-recapture methods for population size estimation tested against reference population sizes constructed from field data” In PLoS One 9.6 Public Library of Science, 2014, pp. e98840
  • [16] Samuel G Rees, Anne E Goodenough, Adam G Hart and Richard Stafford “Testing the effectiveness of capture mark recapture population estimation techniques using a computer simulation with known population size” In Ecological Modelling 222.17 Elsevier, 2011, pp. 3291–3294
  • [17] Robert C Spear et al. “Factors influencing the transmission of Schistosoma japonicum in the mountains of Sichuan Province of China” In The American journal of tropical medicine and hygiene 70.1 ASTMH, 2004, pp. 48–56
  • [18] Rene Gateaux “Fonctions d’une infinite de variables independantes” In Bulletin de la Societe Mathematique de France 47, 1919, pp. 70–96
  • [19] Richard D Gill, Jon A Wellner and Jens Praestgaard “Non-and semi-parametric maximum likelihood estimators and the von mises method (part 1)[with discussion and reply]” In Scandinavian Journal of Statistics JSTOR, 1989, pp. 97–128
  • [20] Mark J Laan, David Benkeser and Weixin Cai “Efficient estimation of pathwise differentiable target parameters with the undersmoothed highly adaptive lasso” In arXiv preprint arXiv:1908.05607, 2019
  • [21] Richard M Cormack “Log-linear models for capture-recapture” In Biometrics JSTOR, 1989, pp. 395–413
  • [22] Sophie Baillargeon and Louis-Paul Rivest “Rcapture: loglinear models for capture-recapture in R” In Journal of Statistical Software 19.5, 2007, pp. 1–31
  • [23] Zoe Emily Schnabel “The estimation of the total fish population of a lake” In The American Mathematical Monthly 45.6 Taylor & Francis, 1938, pp. 348–352
  • [24] “Schistosomiasis:Key Facts” URL: https://www.who.int/en/news-room/fact-sheets/detail/schistosomiasis
  • [25] Song Liang et al. “Surveillance systems for neglected tropical diseases: global lessons from China’s evolving schistosomiasis reporting systems, 1949–2014” In Emerging themes in epidemiology 11.1 Springer, 2014, pp. 19
  • [26] Joseph L Doob “The limiting distributions of certain statistics” In The Annals of Mathematical Statistics 6.3 JSTOR, 1935, pp. 160–169

9 Appendix

In the appendix, we formally state the lemmas used in the context and provide the proofs. In section 9.1, we derive the target parameter ΨII(P)\Psi_{II}(P) under identification assumption that the first two samples B1,B2B_{1},B_{2} are independent, given there are three samples in total. In section 9.2 we derive the efficient influence curve DΦII(P)D_{\Phi_{II}}^{*}(P) for ΨII(P)\Psi_{II}(P). In section 9.3, we state the lemma on how to derive the efficient influence curve under multidimensional constraint ΦIII\Phi_{III}. In section 9.4, we formally state the target parameter ΨCI(P)\Psi_{CI}(P) under identification assumption that the first two samples B1,B2B_{1},B_{2} are independent conditional on the third samples. In section 9.5 we derive the efficient influence curve DΦCI(P)D_{\Phi_{CI}}^{*}(P) for ΨCI(P)\Psi_{CI}(P). In section 9.6 we prove the asymptotic efficiency of the TMLE.

9.1 Target parameter under independence assumption

Lemma 1.

For constraint ΦII\Phi_{II}(equation 4), we have the target parameter ΨII\Psi_{II} as:

ΨII(P)=1P(B(1)=0)P(B(2)=0)+P(B(1:2)=(0,0))1P(B(1)=0)P(B(2)=0)+P(B(1)=0)P(B(2)=0)\displaystyle\Psi_{II}(P)=\frac{1-P(B(1)=0)-P(B(2)=0)+P(B(1:2)=(0,0))}{1-P(B(1)=0)-P(B(2)=0)+P(B(1)=0)P(B(2)=0)}
Proof.

We provide a brief proof of lemma 1 when there are three samples. In this case, ΦII(P)=0\Phi_{II}(P^{*})=0 is equivalent to

P(B(1:2)=(0,0))=P(B(1)=0)P(B(2)=0).P^{*}(B^{*}(1:2)=(0,0))=P^{*}(B(1)=0)P^{*}(B^{*}(2)=0). (17)

When there are three samples, equation 17 can be expanded as:

P(0,0,1)+P(0,0,0)\displaystyle P^{*}(0,0,1)+P^{*}(0,0,0) =\displaystyle= P(0,0,0)2+[P(0,1,0)+P(0,1,1)+P(0,0,1)\displaystyle P^{*}(0,0,0)^{2}+[P^{*}(0,1,0)+P^{*}(0,1,1)+P^{*}(0,0,1) (18)
+P(1,0,0)+P(1,0,1)+P(0,0,1)]×P(0,0,0)\displaystyle+P^{*}(1,0,0)+P^{*}(1,0,1)+P^{*}(0,0,1)]\times P^{*}(0,0,0)
+[P(0,1,0)+P(0,1,1)+P(0,0,1)]\displaystyle+[P^{*}(0,1,0)+P^{*}(0,1,1)+P^{*}(0,0,1)]
×[P(1,0,0)+P(1,0,1)+P(0,0,1)]\displaystyle\times[P^{*}(1,0,0)+P^{*}(1,0,1)+P^{*}(0,0,1)]

Denote a=P(0,1,0)+P(0,1,1)+P(0,0,1)a^{*}=P^{*}(0,1,0)+P^{*}(0,1,1)+P^{*}(0,0,1), b=P(1,0,0)+P(1,0,1)+P(0,0,1)b^{*}=P^{*}(1,0,0)+P^{*}(1,0,1)+P^{*}(0,0,1). Equation 18 can be written as:

0=P(0,0,0)2+(a+b1)×P(0,0,0)P(0,0,1).\displaystyle 0=P^{*}(0,0,0)^{2}+(a^{*}+b^{*}-1)\times P^{*}(0,0,0)-P^{*}(0,0,1).

Plug in ψ=1P(0,0,0)\psi=1-P^{*}(0,0,0), we have

P(0,1,0)=P(0,1,0)ψ,,P(0,0,1)=P(0,0,1)ψ,a=aψ,b=bψ.P(0,1,0)=\frac{P^{*}(0,1,0)}{\psi},...,P(0,0,1)=\frac{P^{*}(0,0,1)}{\psi},a=\frac{a^{*}}{\psi},b=\frac{b^{*}}{\psi}.

Thus equation 18 can be expressed as:

0=(1+abab)ψ2+(a+b1P(0,0,1))ψ.0=(1+ab-a-b)\psi^{2}+(a+b-1-P(0,0,1))\psi. (19)

Here, let aII=1+ababa_{II}=1+ab-a-b, bII=a+b1P(0,0,1)b_{II}=a+b-1-P(0,0,1), from equation 19 we know aIIψ+bII=0a_{II}\psi+b_{II}=0, thus we have

ψ\displaystyle\psi =\displaystyle= bIIaII\displaystyle-\frac{b_{II}}{a_{II}}
=\displaystyle= 1P(B(1)=0)P(B(2)=0)+P(B(1:2)=(0,0))1+P(B(1)=0)P(B(2)=0)P(B(1)=0)P(B(2)=0)\displaystyle\frac{1-P(B(1)=0)-P(B(2)=0)+P(B(1:2)=(0,0))}{1+P(B(1)=0)P(B(2)=0)-P(B(1)=0)-P(B(2)=0)}

where P(B(1)=0)=P(0,1,0)+P(0,1,1)+P(0,0,1),P(B(2)=0)=P(1,0,0)+P(1,0,1)+P(0,0,1),P(B(1)=0)=P(0,1,0)+P(0,1,1)+P(0,0,1),P(B(2)=0)=P(1,0,0)+P(1,0,1)+P(0,0,1), and P(B(1:2)=(0,0))=P(0,0,1)P(B(1:2)=(0,0))=P(0,0,1)

9.2 Efficient influence curve under independence assumption

Lemma 2.

For constraint ΦII=0\Phi_{II}=0(equation 4), we have the efficient influence curve DΦII(P)D^{*}_{\Phi_{II}}(P) as:

DΦII(P)\displaystyle D^{*}_{\Phi_{II}}(P) =\displaystyle= ψP(B(1)=0)×DΦII(P(B(1)=0))\displaystyle\frac{\partial\psi}{\partial P(B(1)=0)}\times D^{*}_{\Phi_{II}}(P(B(1)=0)) (20)
+ψP(B(2)=0)×DΦII(P(B(2)=0))\displaystyle+\frac{\partial\psi}{\partial P(B(2)=0)}\times D^{*}_{\Phi_{II}}(P(B(2)=0))
+ψP(B(1:2)=(0,0))×DΦII(P(B(1:2)=(0,0))).\displaystyle+\frac{\partial\psi}{\partial P(B(1:2)=(0,0))}\times D^{*}_{\Phi_{II}}(P(B(1:2)=(0,0))).
Proof.

By the delta method [26], the efficient influence curve of ψ\psi can be written as a function of each components’ influence curve. The efficient influence curves of the three components are presented as follows:

DΦII(P(B(1)=0))\displaystyle D^{*}_{\Phi_{II}}(P(B(1)=0)) =\displaystyle= 𝕀(B(1)=0)P(B(1)=0).\displaystyle\mathbbm{I}(B(1)=0)-P(B(1)=0).
DΦII(P(B(2)=0))\displaystyle D^{*}_{\Phi_{II}}(P(B(2)=0)) =\displaystyle= 𝕀(B(2)=0)P(B(2)=0).\displaystyle\mathbbm{I}(B(2)=0)-P(B(2)=0).
DΦII(P(B(1:2)=(0,0)))\displaystyle D^{*}_{\Phi_{II}}(P(B(1:2)=(0,0))) =\displaystyle= 𝕀(B(1:2)=(0,0))P(B(1:2)=(0,0)).\displaystyle\mathbbm{I}(B(1:2)=(0,0))-P(B(1:2)=(0,0)).

Therefore, we only need to calculate the derivatives to get DΦII(P)D^{*}_{\Phi_{II}}(P), and the three parts of derivatives are given by

ψP(B(1)=0)=(1P(B(2)=0))(P(B(1:2)=(0,0))P(B(1)=0))(1P(B(1)=0)P(B(2)=0)+P(B(1)=0)P(B(2)=0))2.\frac{\partial\psi}{\partial P(B(1)=0)}=\frac{(1-P(B(2)=0))(P(B(1:2)=(0,0))-P(B(1)=0))}{(1-P(B(1)=0)-P(B(2)=0)+P(B(1)=0)P(B(2)=0))^{2}}.
ψP(B(2)=0)=(1P(B(1)=0))(P(B(1:2)=(0,0))P(B(1)=0))(1P(B(1)=0)P(B(2)=0)+P(B(1)=0)P(B(2)=0))2.\frac{\partial\psi}{\partial P(B(2)=0)}=\frac{(1-P(B(1)=0))(P(B(1:2)=(0,0))-P(B(1)=0))}{(1-P(B(1)=0)-P(B(2)=0)+P(B(1)=0)P(B(2)=0))^{2}}.
ψP(B(1:2)=(0,0))=11P(B(1)=0)P(B(2)=0)+P(B(1)=0)P(B(2)=0).\frac{\partial\psi}{\partial P(B(1:2)=(0,0))}=\frac{1}{1-P(B(1)=0)-P(B(2)=0)+P(B(1)=0)P(B(2)=0)}.

Plug each part into equation 20, we have

DΦII(P)\displaystyle D^{*}_{\Phi_{II}}(P) =\displaystyle= (1P(B(2)=0))(P(B(1:2)=(0,0))P(B(2)=0))(1P(B(1)=0)P(B(2)=0)+P(B(1)=0)P(B(2)=0))2\displaystyle\frac{(1-P(B(2)=0))(P(B(1:2)=(0,0))-P(B(2)=0))}{(1-P(B(1)=0)-P(B(2)=0)+P(B(1)=0)P(B(2)=0))^{2}}
[𝕀(B(1)=0)P(B(1)=0)]\displaystyle[\mathbbm{I}(B(1)=0)-P(B(1)=0)]
+(1P(B(1)=0))(P(B(1:2)=(0,0))P(B(1)=0))(1P(B(1)=0)P(B(2)=0)+P(B(1)=0)P(B(2)=0))2\displaystyle+\frac{(1-P(B(1)=0))(P(B(1:2)=(0,0))-P(B(1)=0))}{(1-P(B(1)=0)-P(B(2)=0)+P(B(1)=0)P(B(2)=0))^{2}}
[𝕀(B(2)=0)P(B(2)=0)]\displaystyle[\mathbbm{I}(B(2)=0)-P(B(2)=0)]
+11P(B(1)=0)P(B(2)=0)+P(B(1)=0)P(B(2)=0)\displaystyle+\frac{1}{1-P(B(1)=0)-P(B(2)=0)+P(B(1)=0)P(B(2)=0)}
[𝕀(B(1:2)=(0,0))P(B(1:2)=(0,0))].\displaystyle[\mathbbm{I}(B(1:2)=(0,0))-P(B(1:2)=(0,0))].

9.3 Efficient influence curve under multidimensional constraint

Lemma 3.

Consider a model {P1:Φ(P)=0}{\cal M}\equiv\{P\in{\cal M}_{1}:\Phi(P)=0\} defined by an initial larger model 1{\cal M}_{1} and multivariate constraint function Φ:1IRK\Phi:{\cal M}_{1}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$}^{K}. Suppose that Φ:1IRK\Phi:{\cal M}_{1}\rightarrow\hbox{${\rm I\kern-1.99997ptR}$}^{K} is path-wise differentiable at PP with efficient influence curve DΦ(P)D^{*}_{\Phi}(P) for all P1P\in{\cal M}_{1}. Let T1(P)T_{1}(P) be the tangent space at PP for model 1{\cal M}_{1}, and let ΠT1:L02(P)T1(P)\Pi_{T_{1}}:L^{2}_{0}(P)\rightarrow T_{1}(P) be the projection operator onto T1(P)T_{1}(P). The tangent space at PP for model {\cal M} is given by:

T(P)={ST1(P):SDΦ(P)}.T(P)=\{S\in T_{1}(P):S\perp D^{*}_{\Phi}(P)\}.

The projection onto T(P)T(P) is given by:

ΠT(S)=ΠT(S)ΠDΦ(ΠT(S)),\Pi_{T}(S)=\Pi_{T}(S)-\Pi_{D^{*}_{\Phi}}(\Pi_{T}(S)),

where ΠDΦ\Pi_{D^{*}_{\Phi}} is the projection operator on the KK-dimensional subspace of T1(P)T_{1}(P) spanned by the components of DΦ(P)D^{*}_{\Phi}(P). The latter projection is given by the formula:

ΠDΦ(S)=E(SDΦ(P))E(DΦ(P)DΦ(P))1DΦ(P).\Pi_{D^{*}_{\Phi}}(S)=E(SD^{*}_{\Phi}(P)^{\top})E(D^{*}_{\Phi}(P)D^{*}_{\Phi}(P)^{\top})^{-1}D^{*}_{\Phi}(P).

9.4 Target parameter under conditional independence assumption

Lemma 4.

For constraint ΦCI=0\Phi_{CI}=0 (equation 5), we have the target parameter ΨCI\Psi_{CI} as

ΨCI=P(Bm=1,Bj=1,0,,0)C0(P).\Psi_{CI}=\frac{P(B_{m}=1,B_{j}=1,0,...,0)}{C_{0}(P)}.

where

C0(P)\displaystyle C_{0}(P) =\displaystyle= P(Bm=1,Bj=1,0,,0)\displaystyle P(B_{m}=1,B_{j}=1,0,...,0)
+P(Bm=0,Bj=1,0,,0)P(Bm=1,Bj=0,0,,0).\displaystyle+P(B_{m}=0,B_{j}=1,0,...,0)P(B_{m}=1,B_{j}=0,0,...,0).
Proof.

The constrain ΦCI=0\Phi_{CI}=0(equation 5) can be written as

P(Bj=1|B1=0,,Bm=0,,BK=0)\displaystyle P^{*}(B_{j}=1|B_{1}=0,\cdots,B_{m}=0,\cdots,B_{K}=0) =\displaystyle=
P(Bj=1|B1=0,,Bm=1,,BK=0).\displaystyle P^{*}(B_{j}=1|B_{1}=0,\cdots,B_{m}=1,\cdots,B_{K}=0).

This equation is equivalent to

P(0,,0)\displaystyle P^{*}(0,\dots,0) =\displaystyle= P(Bm=0,Bj=1,0,,0)P(Bm=1,Bj=1,0,,0)\displaystyle\frac{P^{*}(B_{m}=0,B_{j}=1,0,\dots,0)}{P^{*}(B_{m}=1,B_{j}=1,0,\dots,0)} (21)
×P(Bm=1,Bj=0,0,,0).\displaystyle\times P^{*}(B_{m}=1,B_{j}=0,0,\dots,0).

As ΨCI1P(0,,0)\Psi_{CI}\equiv 1-P^{*}(0,\dots,0), we have

P(Bm=1,Bj=0,0,,0)=P(Bm=1,Bj=0,0,,0)ΨCI.P^{*}(B_{m}=1,B_{j}=0,0,\dots,0)=P(B_{m}=1,B_{j}=0,0,\dots,0)\Psi_{CI}.
P(Bm=0,Bj=1,0,,0)=P(Bm=0,Bj=1,0,,0)ΨCI.P^{*}(B_{m}=0,B_{j}=1,0,\dots,0)=P(B_{m}=0,B_{j}=1,0,\dots,0)\Psi_{CI}.
P(Bm=1,Bj=1,0,,0)=P(Bm=1,Bj=1,0,,0)ΨCI.P^{*}(B_{m}=1,B_{j}=1,0,\dots,0)=P(B_{m}=1,B_{j}=1,0,\dots,0)\Psi_{CI}.

Therefore, equation 21 is equivalent to

ΨCI=P(Bm=1,Bj=1,0,,0)P(Bm=1,Bj=1,0,,0)+C9.\Psi_{CI}=\frac{P(B_{m}=1,B_{j}=1,0,\dots,0)}{P(B_{m}=1,B_{j}=1,0,\dots,0)+C_{9}}.

where

C9=P(Bm=0,Bj=1,0,,0)P(Bm=1,Bj=0,0,,0).C_{9}=P(B_{m}=0,B_{j}=1,0,\dots,0)P(B_{m}=1,B_{j}=0,0,\dots,0).

9.5 Efficient influence curve under conditional independence assumption

Lemma 5.

For constraint ΦCI=0\Phi_{CI}=0 (equation 5), we have the efficient influence curve DΦCI(P)D^{*}_{\Phi_{CI}}(P) as

DΦCI(P)\displaystyle D^{*}_{\Phi_{CI}}(P) =\displaystyle= 1C5(P)(C6(P)C7(P)C8(P))\displaystyle\frac{1}{C_{5}(P)}(C_{6}(P)-C_{7}(P)-C_{8}(P))

where

C5(P)\displaystyle C_{5}(P) =\displaystyle= b10,b20I(b1(1)=b2(2)=0)P(b1)P(b2).\displaystyle-\sum_{b_{1}\not=0,b_{2}\not=0}I(b_{1}(1)=b_{2}(2)=0)P(b_{1})P(b_{2}).
C6(P)\displaystyle C_{6}(P) =\displaystyle= P(Bm=0,Bj=1,0,,0)P(Bm=0,Bj=0,0,,0)\displaystyle P(B_{m}=0,B_{j}=1,0,...,0)P(B_{m}=0,B_{j}=0,0,...,0)
[𝕀(Bm=1,Bj=1,0,,0)P(Bm=1,Bj=1,0,,0)].\displaystyle[\mathbbm{I}(B_{m}=1,B_{j}=1,0,...,0)-P(B_{m}=1,B_{j}=1,0,...,0)].
C7(P)\displaystyle C_{7}(P) =\displaystyle= P(Bm=1,Bj=0,0,,0)P(Bm=1,Bj=1,0,,0)\displaystyle P(B_{m}=1,B_{j}=0,0,...,0)P(B_{m}=1,B_{j}=1,0,...,0)
[𝕀(Bm=0,Bj=1,0,,0)P(Bm=0,Bj=1,0,,0)].\displaystyle[\mathbbm{I}(B_{m}=0,B_{j}=1,0,...,0)-P(B_{m}=0,B_{j}=1,0,...,0)].
C8(P)\displaystyle C_{8}(P) =\displaystyle= P(Bm=0,Bj=1,0,,0)P(Bm=1,Bj=1,0,,0)\displaystyle P(B_{m}=0,B_{j}=1,0,...,0)P(B_{m}=1,B_{j}=1,0,...,0)
[𝕀(Bm=1,Bj=0,0,,0)P(Bm=1,Bj=0,0,,0)].\displaystyle[\mathbbm{I}(B_{m}=1,B_{j}=0,0,...,0)-P(B_{m}=1,B_{j}=0,0,...,0)].
Proof.

By the delta method [26], the efficient influence curve of ΨCI\Psi_{CI} can be written as a function of each components’ influence curve. The efficient influence curves of the three components are presented as follows

DΦCI(P(Bm=1,Bj=1,0,,0))\displaystyle D^{*}_{\Phi_{CI}}(P(B_{m}=1,B_{j}=1,0,\dots,0)) =\displaystyle= 𝕀(Bm=1,Bj=1,0,,0)\displaystyle\mathbbm{I}(B_{m}=1,B_{j}=1,0,\dots,0) (22)
P(Bm=1,Bj=1,0,,0).\displaystyle-P(B_{m}=1,B_{j}=1,0,\dots,0).
DΦCI(P(Bm=0,Bj=1,0,,0))\displaystyle D^{*}_{\Phi_{CI}}(P(B_{m}=0,B_{j}=1,0,\dots,0)) =\displaystyle= 𝕀(Bm=0,Bj=1,0,,0)\displaystyle\mathbbm{I}(B_{m}=0,B_{j}=1,0,\dots,0) (23)
P(Bm=0,Bj=1,0,,0).\displaystyle-P(B_{m}=0,B_{j}=1,0,\dots,0).
DΦCI(P(Bm=1,Bj=0,0,,0))\displaystyle D^{*}_{\Phi_{CI}}(P(B_{m}=1,B_{j}=0,0,\dots,0)) =\displaystyle= 𝕀(Bm=1,Bj=0,0,,0)\displaystyle\mathbbm{I}(B_{m}=1,B_{j}=0,0,\dots,0) (24)
P(Bm=1,Bj=0,0,,0)).\displaystyle-P(B_{m}=1,B_{j}=0,0,\dots,0)).

Therefore, we only need to calculate the derivatives to get DΦCI(P)D^{*}_{\Phi_{CI}}(P), and the three parts of derivatives are given by

ψP(Bm=1,Bj=1,0,,0)\displaystyle\frac{\partial\psi}{\partial P(B_{m}=1,B_{j}=1,0,\dots,0)} =\displaystyle= P(Bm=0,Bj=1,0,,0)C4\displaystyle\frac{P(B_{m}=0,B_{j}=1,0,\dots,0)}{C_{4}}
×P(Bm=1,Bj=0,0,,0).\displaystyle\times P(B_{m}=1,B_{j}=0,0,\dots,0).
ψP(Bm=0,Bj=1,0,,0)\displaystyle\frac{\partial\psi}{\partial P(B_{m}=0,B_{j}=1,0,\dots,0)} =\displaystyle= P(Bm=1,Bj=1,0,,0)C4\displaystyle-\frac{P(B_{m}=1,B_{j}=1,0,\dots,0)}{C_{4}}
×P(Bm=1,Bj=0,0,,0).\displaystyle\times P(B_{m}=1,B_{j}=0,0,\dots,0).
ψP(Bm=1,Bj=0,0,,0)\displaystyle\frac{\partial\psi}{\partial P(B_{m}=1,B_{j}=0,0,\dots,0)} =\displaystyle= P(Bm=1,Bj=1,0,,0)C4\displaystyle-\frac{P(B_{m}=1,B_{j}=1,0,\dots,0)}{C_{4}}
×P(Bm=1,Bj=1,0,,0).\displaystyle\times P(B_{m}=1,B_{j}=1,0,\dots,0).

where

C4\displaystyle C_{4} =\displaystyle= [P(Bm=1,Bj=1,0,,0)+P(Bm=0,Bj=1,0,,0)\displaystyle[P(B_{m}=1,B_{j}=1,0,\dots,0)+P(B_{m}=0,B_{j}=1,0,\dots,0)
×P(Bm=1,Bj=0,0,,0)]2.\displaystyle\times P(B_{m}=1,B_{j}=0,0,\dots,0)]^{2}.

Plug each part into equation 24, we have that lemma 5 is true. ∎

9.6 Asymptotic efficiency of TMLE

In this section we prove the following theorem 1 establishing asymptotic efficiency of the TMLE. The empirical NPMLE is also efficient since asymptotically all cells are filled up. And TMLE is just a finite sample improvement and asymptotically TMLE acts as the empirical NPMLE. The estimators PnP_{n}* will be like parametric model MLE and thus converge at rate oP(1/n)o_{P}(1/\sqrt{n}).

Theorem 1.

Consider the TMLE ΨI(Pn)\Psi_{I}(P_{n}^{*}) of ΨI(P0)\Psi_{I}(P_{0}) defined above, satisfying PnDΦI(Pn)=oP(1/n)P_{n}D^{*}_{\Phi_{I}}(P_{n}^{*})=o_{P}(1/\sqrt{n}). We assume P0(b)>0P_{0}(b)>0 for all b0b\not=0. If PnP0=oP(n1/4)\parallel P_{n}^{*}-P_{0}\parallel=o_{P}(n^{-1/4}), then ΨI(Pn)\Psi_{I}(P_{n}^{*}) is an asymptotically efficient estimator of ΨI(P0)\Psi_{I}(P_{0}):

ΨI(Pn)ΨI(P0)=(PnP0)DΦI(P0)+oP(1/n).\Psi_{I}(P_{n}^{*})-\Psi_{I}(P_{0})=(P_{n}-P_{0})D^{*}_{\Phi_{I}}(P_{0})+o_{P}(1/\sqrt{n}).
Proof.

Define fI,0(b)=fI(b)+fI(0)f_{I,0}(b)=f_{I}(b)+f_{I}(0).Note that

P0DΦI(P)=(1)KψI(1ψI)b0fI,0(b)(P0P)(b)P(b).P_{0}D^{*}_{\Phi_{I}}(P)=(-1)^{K}\psi_{I}(1-\psi_{I})\sum_{b\not=0}f_{I,0}(b)\frac{(P_{0}-P)(b)}{P(b)}.

Consider the second order Taylor expansion of ΨI(P0)\Psi_{I}(P_{0}) at PP:

ΨI(P0)ΨI(P)=dΨI(P)(P0P)+R2(P,P0),\Psi_{I}(P_{0})-\Psi_{I}(P)=d\Psi_{I}(P)(P_{0}-P)+R_{2}(P,P_{0}),

where

dΨI(P)(h)\displaystyle d\Psi_{I}(P)(h) =\displaystyle= ddϵΨI(P+ϵh)|ϵ=0\displaystyle\left.\frac{d}{d\epsilon}\Psi_{I}(P+\epsilon h)\right|_{\epsilon=0}
=\displaystyle= (1)KΨI(P)(1ΨI(P))b0fI(b)h(b)P(b),\displaystyle(-1)^{K}\Psi_{I}(P)(1-\Psi_{I}(P))\sum_{b\not=0}\frac{f_{I}(b)h(b)}{P(b)},
R2(P,P0)\displaystyle R_{2}(P,P_{0}) =\displaystyle= 12d2ΨI(P+ϵh)dϵ2(P0P)2|ϵ=0+oP((P0P)2)\displaystyle\left.\frac{1}{2}\frac{d^{2}\Psi_{I}(P+\epsilon h)}{d\epsilon^{2}}(P_{0}-P)^{2}\right|_{\epsilon=0}+o_{P}((P_{0}-P)^{2}) (25)
=12ΨI(P)(1ΨI(P))(P0P)2(b)×\displaystyle=\frac{1}{2}\Psi_{I}(P)(1-\Psi_{I}(P))(P_{0}-P)^{2}(b)\times
[(12ΨI(P))(b0fI(b)h(b)P(b))2+(1)K+1b0fI(b)h(b)2P(b)2]\displaystyle[(1-2\Psi_{I}(P))(\sum_{b\not=0}\frac{f_{I}(b)h(b)}{P(b)})^{2}+(-1)^{K+1}\sum_{b\not=0}\frac{f_{I}(b)h(b)^{2}}{P(b)^{2}}]
+oP((P0P)2(b))\displaystyle+o_{P}((P_{0}-P)^{2}(b))

From equation 25 we know that R2(P,P0)R_{2}(P,P_{0}) is a second order term involving square differences (P0P)2(b)(P_{0}-P)^{2}(b) for b0b\not=0. Thus, we observe that

P0DΦI(P)=dΨI(P)(P0P).P_{0}D^{*}_{\Phi_{I}}(P)=d\Psi_{I}(P)(P_{0}-P).

This proves that

P0DΦI(P)=ΨI(P0)ΨI(P)R2(P,P0).P_{0}D^{*}_{\Phi_{I}}(P)=\Psi_{I}(P_{0})-\Psi_{I}(P)-R_{2}(P,P_{0}).

We can apply this identity to PnP_{n}^{*} so that we obtain P0DΦI(Pn)=ΨI(P0)ΨI(Pn)R2(Pn,P0)P_{0}D^{*}_{\Phi_{I}}(P_{n}^{*})=\Psi_{I}(P_{0})-\Psi_{I}(P_{n}^{*})-R_{2}(P_{n}^{*},P_{0}). Combining this identity with PnDΦI(Pn)=0P_{n}^{*}D^{*}_{\Phi_{I}}(P_{n}^{*})=0 yields:

ΨI(Pn)ΨI(P0)=(PnP0)DΦI(Pn)+R2(Pn,P0).\Psi_{I}(P_{n}^{*})-\Psi_{I}(P_{0})=(P_{n}-P_{0})D^{*}_{\Phi_{I}}(P_{n}^{*})+R_{2}(P_{n}^{*},P_{0}).

We will assume that PnP_{n}^{*} is consistent for P0P_{0} and P0(b)>0P_{0}(b)>0 for all b0b\not=0, so that it follows that DΦI(Pn)D^{*}_{\Phi_{I}}(P_{n}^{*}) is uniformly bounded by a M<M<\infty with probability tending to 1, an that it falls in a P0P_{0}-Donsker class (dimension is finite 2K22^{K}-2), and P0{DΦI(Pn)DΦI(P0)}20P_{0}\{D^{*}_{\Phi_{I}}(P_{n}^{*})-D^{*}_{\Phi_{I}}(P_{0})\}^{2}\rightarrow 0 in probability as nn\rightarrow\infty. By empirical process theory, it now follows that

(PnP0)DΦI(Pn)=(PnP0)DΦI(P0)+oP(1/n).(P_{n}-P_{0})D^{*}_{\Phi_{I}}(P_{n}^{*})=(P_{n}-P_{0})D^{*}_{\Phi_{I}}(P_{0})+o_{P}(1/\sqrt{n}).

We also note that R2(Pn,P0)R_{2}(P_{n}^{*},P_{0}) has denominators that are bounded away from zero, so that R2(Pn,P0)=oP(1/n)R_{2}(P_{n}^{*},P_{0})=o_{P}(1/\sqrt{n}) if PnP0=oP(n1/4)\parallel P_{n}^{*}-P_{0}\parallel=o_{P}(n^{-1/4}) (e.g., Euclidean norm). Thus theorem 1 is proved. ∎

10 Acknowledgement

This research was supported by NIH grant R01AI125842 and NIH grant 2R01AI074345.