This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Substitute adjustment via recovery of latent variables

Jeffrey Adams [email protected]  and  Niels Richard Hansen [email protected] Department of Mathematical Sciences, University of Copenhagen
Universitetsparken 5, Copenhagen, 2100, Denmark
Abstract.

The deconfounder was proposed as a method for estimating causal parameters in a context with multiple causes and unobserved confounding. It is based on recovery of a latent variable from the observed causes. We disentangle the causal interpretation from the statistical estimation problem and show that the deconfounder in general estimates adjusted regression target parameters. It does so by outcome regression adjusted for the recovered latent variable termed the substitute. We refer to the general algorithm, stripped of causal assumptions, as substitute adjustment. We give theoretical results to support that substitute adjustment estimates adjusted regression parameters when the regressors are conditionally independent given the latent variable. We also introduce a variant of our substitute adjustment algorithm that estimates an assumption-lean target parameter with minimal model assumptions. We then give finite sample bounds and asymptotic results supporting substitute adjustment estimation in the case where the latent variable takes values in a finite set. A simulation study illustrates finite sample properties of substitute adjustment. Our results support that when the latent variable model of the regressors hold, substitute adjustment is a viable method for adjusted regression.

1. Introduction

The deconfounder was proposed by Wang & Blei (2019) as a general algorithm for estimating causal parameters via outcome regression when: (1) there are multiple observed causes of the outcome; (2) the causal effects are potentially confounded by a latent variable; (3) the causes are conditionally independent given a latent variable ZZ. The proposal spurred discussion and criticism; see the comments to (Wang & Blei, 2019) and the contributions by D’Amour (2019); Ogburn et al. (2020) and Grimmer et al. (2023). One question raised was whether the assumptions made by Wang & Blei (2019) are sufficient to claim that the deconfounder estimates a causal parameter. Though an amendment by Wang & Blei (2020) addressed the criticism and clarified their assumptions, it did not resolve all questions regarding the deconfounder.

The key idea of the deconfounder is to recover the latent variable ZZ from the observed causes and use this substitute confounder as a replacement for the unobserved confounder. The causal parameter is then estimated by outcome regression using the substitute confounder for adjustment. This way of adjusting for potential confounding has been in widespread use for some time in genetics and genomics, where, e.g., EIGENSTRAT based on PCA (Patterson et al., 2006; Price et al., 2006) was proposed to adjust for population structure in genome wide association studies (GWASs); see also (Song et al., 2015). Similarly, surrogate variable adjustment (Leek & Storey, 2007) adjusts for unobserved factors causing unwanted variation in gene expression measurements.

In our view, the discussion regarding the deconfounder was muddled by several issues. First, issues with non-identifiablity of target parameters from the observational distribution with a finite number of observed causes lead to confusion. Second, the causal role of the latent variable ZZ and its causal relations to any unobserved confounder were difficult to grasp. Third, there was a lack of theory supporting that the deconfounder was actually estimating causal target parameters consistently. We defer the treatment of the thorny causal interpretation of the deconfounder to the discussion in Section 5 and focus here on the statistical aspects.

In our view, the statistical problem is best treated as adjusted regression without insisting on a causal interpretation. Suppose that we observe a real valued outcome variable YY and additional variables X1,X2,,XpX_{1},X_{2},\ldots,X_{p}. We can then be interested in estimating the adjusted regression function

(1) x𝔼[𝔼[YXi=x;𝐗i]]x\mapsto\mathbb{E}\left[\mathbb{E}\left[Y\mid X_{i}=x;\mathbf{X}_{-i}\right]\right]

where 𝐗i\mathbf{X}_{-i} denotes all variables but XiX_{i}. That is, we adjust for all other variables when regressing YY on XiX_{i}. The adjusted regression function could have a causal interpretation in some contexts, but it is also of interest without a causal interpretation. It can, for instance, be used to study the added predictive value of XiX_{i}, and it is constant (as a function of xx) if and only if 𝔼[YXi=x;𝐗i]=𝔼[Y𝐗i]\mathbb{E}\left[Y\mid X_{i}=x;\mathbf{X}_{-i}\right]=\mathbb{E}\left[Y\mid\mathbf{X}_{-i}\right]; that is, if and only if YY is conditionally mean independent of XiX_{i} given 𝐗i\mathbf{X}_{-i} (Lundborg et al., 2023).

In the context of a GWAS, YY is a continuous phenotype and XiX_{i} represents a single nucleotide polymorphism (SNP) at the genomic site ii. The regression function (1) quantifies how much a SNP at site ii adds to the prediction of the phenotype outcome on top of all other SNP sites. In practice, only a fraction of all SNPs along the genome are observed, yet the number of SNPs can be in the millions, and estimation of the full regression model 𝔼[YXi=x;𝐗i=𝐱i]\mathbb{E}\left[Y\mid X_{i}=x;\mathbf{X}_{-i}=\mathbf{x}_{-i}\right] can be impossible without model assumptions. Thus if the regression function (1) is the target of interest, it is extremely useful if we, by adjusting for a substitute of a latent variable, can obtain a computationally efficient and statistically valid estimator of (1).

From our perspective, when viewing the problem as that of adjusted regression, the most pertinent questions are: (1) when is adjustment by the latent variable ZZ instead of 𝐗i\mathbf{X}_{-i} appropriate; (2) can adjustment by substitutes of the latent variable, recovered from the observe XiX_{i}-s, be justified; (3) can we establish an asymptotic theory that allows for statistical inference when adjusting for substitutes?

With the aim of answering the three questions above, this paper makes two main contributions:

A transparent statistical framework. We focus on estimation of the adjusted mean, thereby disentangling the statistical problem from the causal discussion. This way the target of inference is clear and so are the assumptions we need about the observational distribution in terms of the latent variable model. We present in Section 2 a general framework with an infinite number of XiX_{i}-variables, and we present clear assumptions implying that we can replace adjustment by 𝐗i\mathbf{X}_{-i} with adjustment by ZZ. Within the general framework, we subsequently present an assumption-lean target parameter that is interpretable without restrictive model assumptions on the regression function.

A novel theoretical analysis. By restricting attention to the case where the latent variable ZZ takes values in a finite set, we give in Section 3 bounds on the estimation error due to using substitutes and on the recovery error—that is, the substitute mislabeling rate. These bounds quantify, among other things, how the errors depend on pp; the actual (finite) number of XiX_{i}-s used for recovery. With minimal assumptions on the conditional distributions in the latent variable model and on the outcome model, we use our bounds to derive asymptotic conditions ensuring that the assumption-lean target parameter can be estimated just as well using substitutes as if the latent variables were observed.

To implement substitute adjustment in practice, we leverage recent developments on estimation in finite mixture models via tensor methods, which are computationally and statistically efficient in high dimensions. We illustrate our results via a simulation study in Section 4. Proofs and auxiliary results are in Appendix A. Appendix B contains a complete characterization of when recovery of ZZ is possible from an infinite 𝐗\mathbf{X} in a Gaussian mixture model.

1.1. Relation to existing literature

Our framework and results are based on ideas by Wang & Blei (2019, 2020) and the literature preceding them on adjustment by surrogate/substitute variables. We add new results to this line of research on the theoretical justification of substitute adjustment as a method for estimation.

There is some literature on the theoretical properties of tests and estimators in high-dimensional problems with latent variables. Somewhat related to our framework is the work by Wang et al. (2017) on adjustment for latent confounders in multiple testing, motivated by applications to gene expression analysis. More directly related is the work by Ćevid et al. (2020) and Guo, Ćevid & Bühlmann (2022), who analyze estimators within a linear modelling framework with unobserved confounding. While their methods and results are definitely interesting, they differ from substitute adjustment, since they do not directly attempt to recover the latent variables. The linearity and sparsity assumptions, which we will not make, play an important role for their methods and analysis.

The paper by Grimmer et al. (2023) comes closest to our framework and analysis. Grimmer et al. (2023) present theoretical results and extensive numerical examples, primarily with a continuous latent variable. Their results are not favorable for the deconfounder and they conclude that the deconfounder is “not a viable substitute for careful research design in real-world applications”. Their theoretical analyses are mostly in terms of computing the population (or nn-asymptotic) bias of a method for a finite pp (the number of XiX_{i}-variables), and then possibly investigate the limit of the bias as pp tends to infinity. Compared to this, we analyze the asymptotic behaviour of the estimator based on substitute adjustment as nn and pp tend to infinity jointly. Moreover, since we specifically treat discrete latent variables, some of our results are also in a different framework.

2. Substitute adjustment

2.1. The General Model

The full model is specified in terms of variables (𝐗,Y)(\mathbf{X},Y), where YY\in\mathbb{R} is a real valued outcome variable of interest and 𝐗\mathbf{X}\in\mathbb{R}^{\mathbb{N}} is a infinite vector of additional real valued variables. That is, 𝐗=(Xi)i\mathbf{X}=(X_{i})_{i\in\mathbb{N}} with XiX_{i}\in\mathbb{R} for ii\in\mathbb{N}. We let 𝐗i=(Xj)j\{i}\mathbf{X}_{-i}=(X_{j})_{j\in\mathbb{N}\backslash{\{i\}}}, and define (informally) for each ii\in\mathbb{N} and xx\in\mathbb{R} the target parameter of interest

(2) χxi=𝔼[𝔼[Y|Xi=x;𝐗i]].\chi_{x}^{i}=\mathbb{E}\left[\mathbb{E}\left[Y\;\middle|\;X_{i}=x;\mathbf{X}_{-i}\right]\right].

That is, χxi\chi_{x}^{i} is the mean outcome given Xi=xX_{i}=x when adjusting for all remaining variables 𝐗i\mathbf{X}_{-i}. Since 𝔼[YXi=x;𝐗i]\mathbb{E}\left[Y\mid X_{i}=x;\mathbf{X}_{-i}\right] is generally not uniquely defined for all xx\in\mathbb{R} by the distribution of (𝐗,Y)(\mathbf{X},Y), we need some additional structure to formally define χxi\chi^{i}_{x}. The following assumption and subsequent definition achieve this by assuming that a particular choice of the conditional expectation is made and remains fixed. Throughout, \mathbb{R} is equipped with the Borel σ\sigma-algebra and \mathbb{R}^{\mathbb{N}} with the corresponding product σ\sigma-algebra.

Assumption 1 (Regular Conditional Distribution).

Fix for each ii\in\mathbb{N} a Markov kernel (Px,𝐱i)(x,𝐱)×(P_{x,\mathbf{x}}^{i})_{(x,\mathbf{x})\in\mathbb{R}\times\mathbb{R}^{\mathbb{N}}} on \mathbb{R}. Assume that Px,𝐱iP_{x,\mathbf{x}}^{i} is the regular conditional distribution of YY given (Xi,𝐗i)=(x,𝐱)(X_{i},\mathbf{X}_{-i})=(x,\mathbf{x}) for all xx\in\mathbb{R}, 𝐱\mathbf{x}\in\mathbb{R}^{\mathbb{N}} and ii\in\mathbb{N}. With PiP^{-i} the distribution of 𝐗i\mathbf{X}_{-i}, suppose additionally that

|y|Px,𝐱i(dy)Pi(d𝐱)<\iint|y|\,P^{i}_{x,\mathbf{x}}(\mathrm{d}y)P^{-i}(\mathrm{d}\mathbf{x})<\infty

for all xx\in\mathbb{R}.

Definition 1.

Under Assumption 1 we define

(3) χxi=yPx,𝐱i(dy)Pi(d𝐱).\chi_{x}^{i}=\iint y\,P^{i}_{x,\mathbf{x}}(\mathrm{d}y)P^{-i}(\mathrm{d}\mathbf{x}).
Remark 1.

Definition 1 makes the choice of conditional expectation explicit by letting

𝔼[YXi=x;𝐗i]=yPx,𝐗ii(dy)\mathbb{E}\left[Y\mid X_{i}=x;\mathbf{X}_{-i}\right]=\int y\,P^{i}_{x,\mathbf{X}_{-i}}(\mathrm{d}y)

be defined in terms of the specific regular conditional distribution that is fixed according to Assumption 1. We may need additional regularity assumptions to identify this Markov kernel from the distribution of (𝐗,Y)(\mathbf{X},Y), which we will not pursue here.

The main assumption in this paper is the existence of a latent variable, ZZ, that will render the XiX_{i}-s conditionally independent, and which can be recovered from 𝐗\mathbf{X} in a suitable way. The variable ZZ will take values in a measurable space (E,)(E,\mathcal{E}), which we assume to be a Borel space. We use the notation σ(Z)\sigma(Z) and σ(𝐗i)\sigma(\mathbf{X}_{-i}) to denote the σ\sigma-algebras generated by ZZ and 𝐗i\mathbf{X}_{-i}, respectively.

Assumption 2 (Latent Variable Model).

There is a random variable ZZ with values in (E,)(E,\mathcal{E}) such that:

  1. (1)

    X1,X2,X_{1},X_{2},\ldots are conditionally independent given ZZ,

  2. (2)

    σ(Z)i=1σ(𝐗i)\sigma(Z)\subseteq\bigcap_{i=1}^{\infty}\sigma(\mathbf{X}_{-i}).

The latent variable model given by Assumption 2 allows us to identify the adjusted mean by adjusting for the latent variable only.

Proposition 1.

Fix ii\in\mathbb{N} and let PziP^{-i}_{z} denote a regular conditional distribution of 𝐗i\mathbf{X}_{-i} given Z=zZ=z. Under Assumptions 1 and 2, the Markov kernel

(4) Qx,zi(A)=Px,𝐱i(A)Pzi(d𝐱),AQ_{x,z}^{i}(A)=\int P^{i}_{x,\mathbf{x}}(A)P^{-i}_{z}(\mathrm{d}\mathbf{x}),\qquad A\subseteq\mathbb{R}

is a regular conditional distribution of YY given (Xi,Z)=(x,z)(X_{i},Z)=(x,z), in which case

(5) χxi=yQx,zi(dy)PZ(dz)=𝔼[𝔼[YXi=x;Z]].\chi_{x}^{i}=\iint y\,Q_{x,z}^{i}(\mathrm{d}y)P^{Z}(\mathrm{d}z)=\mathbb{E}\left[\mathbb{E}\left[Y\mid X_{i}=x;Z\right]\right].
ZZXiX_{i}𝐗i\mathbf{X}_{-i}YY
Figure 1. Directed Acyclic Graph (DAG) representing the joint distribution of (Xi,𝐗i,Z,Y)(X_{i},\mathbf{X}_{-i},Z,Y). The variable ZZ blocks the backdoor from XiX_{i} to YY.

The joint distribution of (Xi,𝐗i,Z,Y)(X_{i},\mathbf{X}_{-i},Z,Y) is, by Assumption 2, Markov w.r.t. to the graph in Figure 1. Proposition 1 is essentially the backdoor criterion, since ZZ blocks the backdoor from XiX_{i} to YY via 𝐗i\mathbf{X}_{-i}; see Theorem 3.3.2 in (Pearl, 2009) or Proposition 6.41(ii) in (Peters et al., 2017). Nevertheless, we include a proof in Appendix A for two reasons. First, Proposition 1 does not involve causal assumptions about the model, and we want to clarify that the mathematical result is agnostic to such assumptions. Second, the proof we give of Proposition 1 does not require regularity assumptions, such as densities of the conditional distributions, but it relies subtly on Assumption 2(2).

Example 1.

Suppose 𝔼[|Xi|]C\mathbb{E}[|X_{i}|]\leq C for all ii and some finite constant CC, and assume, for simplicity, that 𝔼[Xi]=0\mathbb{E}[X_{i}]=0. Let 𝜷=(βi)i1\boldsymbol{\beta}=(\beta_{i})_{i\in\mathbb{N}}\in\ell_{1} and define

𝜷,𝐗=i=1βiXi.\langle\boldsymbol{\beta},\mathbf{X}\rangle=\sum_{i=1}^{\infty}\beta_{i}X_{i}.

The infinite sum converges almost surely since 𝜷1\boldsymbol{\beta}\in\ell_{1}. With ε\varepsilon being 𝒩(0,1)\mathcal{N}(0,1)-distributed and independent of 𝐗\mathbf{X} consider the outcome model

Y=𝜷,𝐗+ε.Y=\langle\boldsymbol{\beta},\mathbf{X}\rangle+\varepsilon.

Letting 𝜷i\boldsymbol{\beta}_{-i} denote the 𝜷\boldsymbol{\beta}-sequence with the ii-th coordinate removed, a straightforward, though slightly informal, computation, gives

χxi\displaystyle\chi^{i}_{x} =𝔼[𝔼[βiXi+𝜷i,𝐗iXi=x;𝐗i]]\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\beta_{i}X_{i}+\langle\boldsymbol{\beta}_{-i},\mathbf{X}_{-i}\rangle\mid X_{i}=x;\mathbf{X}_{-i}\right]\right]
=βix+𝔼[𝜷i,𝐗i]=βix+𝜷i,𝔼[𝐗i]=βix.\displaystyle=\beta_{i}x+\mathbb{E}\left[\langle\boldsymbol{\beta}_{-i},\mathbf{X}_{-i}\rangle\right]=\beta_{i}x+\langle\boldsymbol{\beta}_{-i},\mathbb{E}\left[\mathbf{X}_{-i}\right]\rangle=\beta_{i}x.

To fully justify the computation, via Assumption 1, we let Px,𝐱iP^{i}_{x,\mathbf{x}} be the 𝒩(βix+𝜷i,𝐱,1)\mathcal{N}(\beta_{i}x+\langle\boldsymbol{\beta}_{-i},\mathbf{x}\rangle,1)-distribution for the PiP^{-i}-almost all 𝐱\mathbf{x} where 𝜷i,𝐱\langle\boldsymbol{\beta}_{-i},\mathbf{x}\rangle is well defined. For the remaining 𝐱\mathbf{x} we let Px,𝐱iP^{i}_{x,\mathbf{x}} be the 𝒩(βix,1)\mathcal{N}(\beta_{i}x,1)-distribution. Then Px,𝐱iP^{i}_{x,\mathbf{x}} is a regular conditional distribution of YY given (Xi,𝐗i)=(x,𝐱)(X_{i},\mathbf{X}_{-i})=(x,\mathbf{x}),

yPx,𝐱i(dy)=βix+𝜷i,𝐱for Pi-almost all 𝐱,\int y\,P_{x,\mathbf{x}}^{i}(\mathrm{d}y)=\beta_{i}x+\langle\boldsymbol{\beta}_{-i},\mathbf{x}\rangle\quad\text{for }P^{-i}\text{-almost all }\mathbf{x},

and χxi=βix\chi^{i}_{x}=\beta_{i}x follows from (3). It also follows from (4) that for PZP^{Z}-almost all zEz\in E,

𝔼[YXi=x;Z=z]\displaystyle\mathbb{E}\left[Y\mid X_{i}=x;Z=z\right] =yQx,zi(dy)\displaystyle=\int y\,Q^{i}_{x,z}(\mathrm{d}y)
=βix+𝜷i,𝐱Pzi(d𝐱)\displaystyle=\beta_{i}x+\int\langle\boldsymbol{\beta}_{-i},\mathbf{x}\rangle P^{-i}_{z}(\mathrm{d}\mathbf{x})
=βix+jiβj𝔼[XjZ=z].\displaystyle=\beta_{i}x+\sum_{j\neq i}\beta_{j}\mathbb{E}[X_{j}\mid Z=z].

That is, with Γi(z)=jiβj𝔼[XjZ=z]\Gamma_{-i}(z)=\sum_{j\neq i}\beta_{j}\mathbb{E}[X_{j}\mid Z=z], the regression model

𝔼[YXi=x;Z=z]=βix+Γi(z)\mathbb{E}\left[Y\mid X_{i}=x;Z=z\right]=\beta_{i}x+\Gamma_{-i}(z)

is a partially linear model.

Example 2.

While Example 1 is explicit about the outcome model, it does not describe an explicit latent variable model fulfilling Assumption 2. To this end, take E=E=\mathbb{R}, let Z,U1,U2,Z^{\prime},U_{1},U_{2},\ldots be i.i.d. 𝒩(0,1)\mathcal{N}(0,1)-distributed and set Xi=Z+UiX_{i}=Z^{\prime}+U_{i}. By the Law of Large Numbers, for any ii\in\mathbb{N},

1nj=1;jin+1Xj=Z+1nj=1;jin+1UjZ\frac{1}{n}\sum_{j=1;j\neq i}^{n+1}X_{j}=Z^{\prime}+\frac{1}{n}\sum_{j=1;j\neq i}^{n+1}U_{j}\rightarrow Z^{\prime}

almost surely for nn\to\infty. Setting

Z={limn1nj=1;jin+1Xjif the limit exists0otherwiseZ=\left\{\begin{array}[]{ll}\lim\limits_{n\to\infty}\frac{1}{n}\sum_{j=1;j\neq i}^{n+1}X_{j}&\quad\text{if the limit exists}\\ 0&\quad\text{otherwise}\end{array}\right.

we get that σ(Z)σ(𝐗i)\sigma(Z)\subseteq\sigma(\mathbf{X}_{-i}) for any ii\in\mathbb{N} and Z=ZZ=Z^{\prime} almost surely. Thus, Assumption 2 holds.

Continuing with the outcome model from Example 1, we see that for PZP^{Z}-almost all zEz\in E,

𝔼[XjZ=z]=𝔼[Z+UjZ=z]=z,\mathbb{E}[X_{j}\mid Z=z]=\mathbb{E}[Z^{\prime}+U_{j}\mid Z=z]=z,

thus Γi(z)=γiz\Gamma_{-i}(z)=\gamma_{-i}z with γi=jiβj\gamma_{-i}=\sum_{j\neq i}\beta_{j}. In this example it is actually possible to compute the regular conditional distribution, Qx,ziQ^{i}_{x,z}, of YY given (Xi,Z)=(x,z)(X_{i},Z)=(x,z) explicitly. It is the 𝒩(βix+γiz,1+𝜷i22)\mathcal{N}\left(\beta_{i}x+\gamma_{-i}z,1+\|\boldsymbol{\beta}_{-i}\|^{2}_{2}\right)-distribution where 𝜷i22=𝜷i,𝜷i\|\boldsymbol{\beta}_{-i}\|^{2}_{2}=\langle\boldsymbol{\beta}_{-i},\boldsymbol{\beta}_{-i}\rangle.

2.2. Substitute Latent Variable Adjustment

Proposition 1 tells us that under Assumptions 1 and 2 the adjusted mean, χxi\chi_{x}^{i}, defined by adjusting for the entire infinite vector 𝐗i\mathbf{X}_{-i}, is also given by adjusting for the latent variable ZZ. If the latent variable were observed we could estimate χxi\chi^{i}_{x} in terms of an estimate of the following regression function.

Definition 2 (Regression function).

Under Assumptions 1 and 2 define the regression function

(6) bxi(z)=yQx,zi(dy)=𝔼[YXi=x;Z=z]b_{x}^{i}(z)=\int y\,Q_{x,z}^{i}(\mathrm{d}y)=\mathbb{E}\left[Y\mid X_{i}=x;Z=z\right]

where Qx,ziQ_{x,z}^{i} is given by (4).

If we had nn i.i.d. observations, (xi,1,z1,y1),,(xi,n,zn,yn)(x_{i,1},z_{1},y_{1}),\ldots,(x_{i,n},z_{n},y_{n}), of (Xi,Z,Y)(X_{i},Z,Y), a straightforward plug-in estimate of χxi\chi^{i}_{x} is

(7) χ^xi=1nk=1nb^xi(zk),\hat{\chi}^{i}_{x}=\frac{1}{n}\sum_{k=1}^{n}\hat{b}_{x}^{i}(z_{k}),

where b^xi(z)\hat{b}_{x}^{i}(z) is an estimate of the regression function bxi(z)b_{x}^{i}(z). In practice we do not observe the latent variable ZZ. Though Assumption 2(2) implies that ZZ can be recovered from 𝐗\mathbf{X}, we do not assume we know this recovery map, nor do we in practice observe the entire 𝐗\mathbf{X}, but only the first pp coordinates, 𝐗1:p=(X1,,Xp)\mathbf{X}_{1:p}=(X_{1},\ldots,X_{p}).

We thus need an estimate of a recovery map, f^p:pE\hat{f}^{p}:\mathbb{R}^{p}\to E, such that for the substitute latent variable Z^=f^p(𝐗1:p)\hat{Z}=\hat{f}^{p}(\mathbf{X}_{1:p}) we have111We can in general only hope to learn a recovery map of ZZ up to a Borel isomorphism, but this is also all that is needed, cf. Assumption 2. that σ(Z^)\sigma(\hat{Z}) approximately contains the same information as σ(Z)\sigma(Z). Using such substitutes, a natural way to estimate χxi\chi^{i}_{x} is given by Algorithm 1, which is a general three-step procedure returning the estimate χ^xi,sub\widehat{\chi}^{i,\mathrm{sub}}_{x}.

1 input: data 𝒮0={𝐱1:p,10,,𝐱1:p,m0}\mathcal{S}_{0}=\{\mathbf{x}_{1:p,1}^{0},\ldots,\mathbf{x}_{1:p,m}^{0}\} and 𝒮={(𝐱1:p,1,y1),(𝐱1:p,n,yn)}\mathcal{S}=\{(\mathbf{x}_{1:p,1},y_{1}),\ldots(\mathbf{x}_{1:p,n},y_{n})\}, a set EE, i{1,,p}i\in\{1,\ldots,p\} and xx\in\mathbb{R};
2 options: a method for estimating a recovery map fp:pEf^{p}:\mathbb{R}^{p}\to E, a method for estimating the regression function zbxi(z)z\mapsto b_{x}^{i}(z);
3 begin
4       use data in 𝒮0\mathcal{S}_{0} to compute the estimate f^p\hat{f}^{p} of the recovery map.
5      use data in 𝒮\mathcal{S} to compute the substitute latent variables as z^k:=f^p(𝐱1:p,k)\hat{z}_{k}:=\hat{f}^{p}(\mathbf{x}_{1:p,k}), k=1,,nk=1,\ldots,n.
6      use data in 𝒮\mathcal{S} combined with the substitutes to compute the regression function estimate, zb^xi(z)z\mapsto\hat{b}_{x}^{i}(z), and set
χ^xi,sub=1nk=1nb^xi(z^k).\widehat{\chi}^{i,\mathrm{sub}}_{x}=\frac{1}{n}\sum_{k=1}^{n}\hat{b}_{x}^{i}(\hat{z}_{k}).
7 end
8
return χ^xi,sub\widehat{\chi}^{i,\mathrm{sub}}_{x}
Algorithm 1 General Substitute Adjustment

The regression estimate b^xi(z)\hat{b}_{x}^{i}(z) in Algorithm 1 is computed on the basis of the substitutes, which likewise enter into the final computation of χ^xi,sub\widehat{\chi}^{i,\mathrm{sub}}_{x}. Thus the estimate is directly estimating χxi,sub=𝔼[𝔼[YXi=x;Z^]|f^p]\chi^{i,\mathrm{sub}}_{x}=\mathbb{E}\left[\mathbb{E}\left[Y\mid X_{i}=x;\hat{Z}\right]\;\middle|\;\hat{f}^{p}\right], and it is expected to be biased as an estimate of χxi\chi^{i}_{x}. The general idea is that under some regularity assumptions, and for pp\to\infty and mm\to\infty appropriately, χxi,subχxi\chi^{i,\mathrm{sub}}_{x}\to\chi^{i}_{x} and the bias vanishes asymptotically. Section 3 specifies a setup where such a result is shown rigorously.

Note that the estimated recovery map f^p\hat{f}^{p} in Algorithm 1 is the same for all i=1,,pi=1,\ldots,p. Thus for any fixed ii, the xi,k0x_{i,k}^{0}-s are used for estimation of the recovery map, and the xi,kx_{i,k}-s are used for the computation of the substitutes. Steps 4 and 5 of the algorithm could be changed to construct a recovery map f^ip\hat{f}_{-i}^{p} independent of the ii-th coordinate. This appears to align better with Assumption 2, and it would most likely make the z^k\hat{z}_{k}-s slightly less correlated with the xi,kx_{i,k}-s. It would, on the other hand, lead to a slightly larger recovery error, and worse, a substantial increase in the computational complexity if we want to estimate χ^xi,sub\widehat{\chi}_{x}^{i,\mathrm{sub}} for all i=1,,pi=1,\ldots,p.

Algorithm 1 leaves some options open. First, the estimation method used to compute f^p\hat{f}^{p} could be based on any method for estimating a recovery map, e.g., using a factor model if E=E=\mathbb{R} or a mixture model if EE is finite. The idea of such methods is to compute a parsimonious f^p\hat{f}^{p} such that: (1) conditionally on z^k0=f^p(𝐱1:p,k0)\hat{z}^{0}_{k}=\hat{f}^{p}(\mathbf{x}_{1:p,k}^{0}) the observations x1,k0,,xp,k0x_{1,k}^{0},\ldots,x_{p,k}^{0} are approximately independent for k=1,,mk=1,\ldots,m; and (2) z^k0\hat{z}^{0}_{k} is minimally predictive of xi,k0x_{i,k}^{0} for i=1,,pi=1,\ldots,p. Second, the regression method for estimation of the regression function bxi(z)b_{x}^{i}(z) could be any parametric or nonparametric method. If E=E=\mathbb{R} we could use OLS combined with the parametric model bxi(z)=β0+βix+γizb_{x}^{i}(z)=\beta_{0}+\beta_{i}x+\gamma_{-i}z, which would lead to the estimate

χ^xi,sub=β^0+β^ix+γ^i1nk=1nz^k.\widehat{\chi}^{i,\mathrm{sub}}_{x}=\hat{\beta}_{0}+\hat{\beta}_{i}x+\hat{\gamma}_{-i}\frac{1}{n}\sum_{k=1}^{n}\hat{z}_{k}.

If EE is finite, we could still use OLS but now combined with the parametric model bxi(z)=βi,zx+γi,zb_{x}^{i}(z)=\beta_{i,z}^{\prime}x+\gamma_{-i,z}, which would lead to the estimate

χ^xi,sub=(1nk=1nβ^i,z^k)x+1nk=1nγ^i,z^k.\widehat{\chi}^{i,\mathrm{sub}}_{x}=\left(\frac{1}{n}\sum_{k=1}^{n}\hat{\beta}_{i,\hat{z}_{k}}^{\prime}\right)x+\frac{1}{n}\sum_{k=1}^{n}\hat{\gamma}_{-i,\hat{z}_{k}}.

The relation between the two datasets in Algorithm 1 is not specified by the algorithm either. It is possible that they are independent, e.g., by data splitting, in which case f^p\hat{f}^{p} is independent of the data in 𝒮\mathcal{S}. It is also possible that m=nm=n and 𝐱1:p,k0=𝐱1:p,k\mathbf{x}_{1:p,k}^{0}=\mathbf{x}_{1:p,k} for k=1,,nk=1,\ldots,n. While we will assume 𝒮0\mathcal{S}_{0} and 𝒮\mathcal{S} independent for the theoretical analysis, the 𝐱1:p\mathbf{x}_{1:p}-s from 𝒮\mathcal{S} will in practice often be part of 𝒮0\mathcal{S}_{0}, if not all of 𝒮0\mathcal{S}_{0}.

2.3. Assumption-Lean Substitute Adjustment

If the regression model in the general Algorithm 1 is misspecified we cannot expect that χ^xi,sub\widehat{\chi}^{i,\mathrm{sub}}_{x} is a consistent estimate of χxi\chi^{i}_{x}. In Section 3 we investigate the distribution of a substitute adjustment estimator in the case where EE is finite. It is possible to carry out this investigation assuming a partially linear regression model, bxi(z)=βix+Γi(z)b_{x}^{i}(z)=\beta_{i}x+\Gamma_{-i}(z), but the results would then hinge on this model being correct. To circumvent such a model assumption we proceed instead in the spirit of assumption-lean regression (Berk et al., 2021; Vansteelandt & Dukes, 2022). Thus we focus on a univariate target parameter defined as a functional of the data distribution, and we then investigate its estimation via substitute adjustment.

Assumption 3 (Moments).

It holds that 𝔼(Y2)<\mathbb{E}(Y^{2})<\infty, 𝔼[Xi2]<\mathbb{E}[X_{i}^{2}]<\infty and 𝔼[Var[XiZ]]>0\mathbb{E}\left[\operatorname{Var}\left[X_{i}\mid Z\right]\right]>0.

Definition 3 (Target parameter).

Let ii\in\mathbb{N}. Under Assumptions 2 and 3 define the target parameter

(8) βi=𝔼[Cov[Xi,YZ]]𝔼[Var[XiZ]].\beta_{i}=\frac{\operatorname{\mathbb{E}}\left[\operatorname{Cov}\left[X_{i},Y\mid Z\right]\right]}{\operatorname{\mathbb{E}}\left[\operatorname{Var}\left[X_{i}\mid Z\right]\right]}.
1 input: data 𝒮0={𝐱1:p,10,,𝐱1:p,m0}\mathcal{S}_{0}=\{\mathbf{x}_{1:p,1}^{0},\ldots,\mathbf{x}_{1:p,m}^{0}\} and 𝒮={(𝐱1:p,1,y1),(𝐱1:p,n,yn)}\mathcal{S}=\{(\mathbf{x}_{1:p,1},y_{1}),\ldots(\mathbf{x}_{1:p,n},y_{n})\}, a set EE and i{1,,p}i\in\{1,\ldots,p\};
2 options: a method for estimating the recovery map fp:pEf^{p}:\mathbb{R}^{p}\to E, methods for estimating the regression functions μi(z)=𝔼[XiZ=z]\mu_{i}(z)=\mathbb{E}\left[X_{i}\mid Z=z\right] and g(z)=𝔼[YZ=z]g(z)=\mathbb{E}\left[Y\mid Z=z\right];
3 begin
4       use data in 𝒮0\mathcal{S}_{0} to compute the estimate f^p\hat{f}^{p} of the recovery map.
5      use data in 𝒮\mathcal{S} to compute the substitute latent variables as z^k:=f^p(𝐱1:p,k)\hat{z}_{k}:=\hat{f}^{p}(\mathbf{x}_{1:p,k}), k=1,,nk=1,\ldots,n.
6      use data in 𝒮\mathcal{S} combined with the substitutes to compute the regression function estimates zμ^i(z)z\mapsto\hat{\mu}_{i}(z) and zg^(z)z\mapsto\hat{g}(z), and set
β^isub=k=1n(xi,kμ^i(z^k))(ykg^(z^k))k=1n(xi,kμ^i(z^k))2.\widehat{\beta}^{\mathrm{sub}}_{i}=\frac{\sum_{k=1}^{n}(x_{i,k}-\hat{\mu}_{i}(\hat{z}_{k}))(y_{k}-\hat{g}(\hat{z}_{k}))}{\sum_{k=1}^{n}(x_{i,k}-\hat{\mu}_{i}(\hat{z}_{k}))^{2}}.
7 end
8
return β^isub\widehat{\beta}^{\mathrm{sub}}_{i}
Algorithm 2 Assumption-Lean Substitute Adjustment

Algorithm 2 gives a procedure for estimating βi\beta_{i} based on substitute latent variables. The following proposition gives insight on the interpretation of the target parameter βi\beta_{i}.

Proposition 2.

Under Assumptions 1, 2 and 3, and with bxi(z)b_{x}^{i}(z) given as in Definition 2, and βi\beta_{i} given as in Definition 3,

(9) βi=𝔼[Cov[Xi,bXii(Z)Z]]𝔼[Var[XiZ]].\beta_{i}=\frac{\operatorname{\mathbb{E}}\left[\operatorname{Cov}\left[X_{i},b^{i}_{X_{i}}(Z)\mid Z\right]\right]}{\operatorname{\mathbb{E}}\left[\operatorname{Var}\left[X_{i}\mid Z\right]\right]}.

Moreover, βi=0\beta_{i}=0 if bxi(z)b^{i}_{x}(z) does not depend on xx. If bxi(z)=βi(z)x+Γi(z)b^{i}_{x}(z)=\beta_{i}^{\prime}(z)x+\Gamma_{-i}(z) then

(10) βi=𝔼[wi(Z)βi(Z)]\beta_{i}=\mathbb{E}\left[w_{i}(Z)\beta_{i}^{\prime}(Z)\right]

where

wi(Z)=Var[XiZ]𝔼[Var[XiZ]].w_{i}(Z)=\frac{\operatorname{Var}[X_{i}\mid Z]}{\mathbb{E}\left[\operatorname{Var}[X_{i}\mid Z]\right]}.

We include a proof of Proposition 2 in Appendix A.1 for completeness. The arguments are essentially as in (Vansteelandt & Dukes, 2022).

Remark 2.

If bxi(z)=βi(z)x+Γi(z)b^{i}_{x}(z)=\beta_{i}^{\prime}(z)x+\Gamma_{-i}(z) it follows from Proposition 1 that χxi=βix\chi^{i}_{x}=\beta^{\prime}_{i}x, where the coefficient βi=𝔼[βi(Z)]\beta^{\prime}_{i}=\mathbb{E}[\beta_{i}^{\prime}(Z)] may differ from βi\beta_{i} given by (10). In the special case where the variance of XiX_{i} given ZZ is constant across all values of ZZ, the weights in (10) are all 11, in which case βi=βi\beta_{i}=\beta^{\prime}_{i}. For the partially linear model, bxi(z)=βix+Γi(z)b^{i}_{x}(z)=\beta_{i}^{\prime}x+\Gamma_{-i}(z), with βi\beta_{i}^{\prime} not depending on zz, it follows from (10) that βi=βi\beta_{i}=\beta_{i}^{\prime} irrespectively of the weights.

Remark 3.

If Xi{0,1}X_{i}\in\{0,1\} then bxi(z)=(b1i(Z)b0i(Z))x+b0i(Z)b_{x}^{i}(z)=(b_{1}^{i}(Z)-b_{0}^{i}(Z))x+b_{0}^{i}(Z), and the contrast χ1iχ0i=𝔼[b1i(Z)b0i(Z)]\chi^{i}_{1}-\chi^{i}_{0}=\mathbb{E}\left[b_{1}^{i}(Z)-b_{0}^{i}(Z)\right] is an unweighted mean of differences, while it follows from (10) that

(11) βi=𝔼[wi(Z)(b1i(Z)b0i(Z))].\beta_{i}=\mathbb{E}\left[w_{i}(Z)(b_{1}^{i}(Z)-b_{0}^{i}(Z))\right].

If we let πi(Z)=(Xi=1Z)\pi_{i}(Z)=\mathbb{P}(X_{i}=1\mid Z), we see that the weights are given as

wi(Z)=πi(Z)(1πi(Z))𝔼[πi(Z)(1πi(Z))].w_{i}(Z)=\frac{\pi_{i}(Z)(1-\pi_{i}(Z))}{\mathbb{E}\left[\pi_{i}(Z)(1-\pi_{i}(Z))\right]}.

We summarize three important take-away messages from Proposition 2 and the remarks above as follows:

Conditional mean independence:

The null hypothesis of conditional mean independence,

𝔼[YXi=x;𝐗i])=𝔼[Y𝐗i],\mathbb{E}\left[Y\mid X_{i}=x;\mathbf{X}_{-i}\right])=\mathbb{E}\left[Y\mid\mathbf{X}_{-i}\right],

implies that βi=0\beta_{i}=0. The target parameter βi\beta_{i} thus suggests an assumption-lean approach to testing this null without a specific model of the conditional mean.

Heterogeneous partial linear model:

If the conditional mean,

bxi(z)=𝔼[YXi=x;Z=z],b_{x}^{i}(z)=\mathbb{E}\left[Y\mid X_{i}=x;Z=z\right],

is linear in xx with an xx-coefficient that depends on ZZ (heterogeneity), the target parameter βi\beta_{i} is a weighted mean of these coefficients, while χxi=βix\chi^{i}_{x}=\beta^{\prime}_{i}x with βi\beta^{\prime}_{i} the unweighted mean.

Simple partial linear model:

If the conditional mean is linear in xx with an xx-coef-ficient that is independent of ZZ (homogeneity), the target parameter βi\beta_{i} coincides with this xx-coefficient and χxi=βix\chi_{x}^{i}=\beta_{i}x. Example 1 is a special case where the latent variable model is arbitrary but the full outcome model is linear.

Just as for the general Algorithm 1, the estimate that Algorithm 2 outputs, β^isub\widehat{\beta}^{\mathrm{sub}}_{i}, is not directly estimating the target parameter βi\beta_{i}. It is directly estimating

(12) βisub=𝔼[Cov[Xi,YZ^]|f^p]𝔼[Var[XiZ^]|f^p].\beta^{\mathrm{sub}}_{i}=\frac{\operatorname{\mathbb{E}}\left[\operatorname{Cov}\left[X_{i},Y\mid\hat{Z}\right]\;\middle|\;\hat{f}^{p}\right]}{\operatorname{\mathbb{E}}\left[\operatorname{Var}\left[X_{i}\mid\hat{Z}\right]\;\middle|\;\hat{f}^{p}\right]}.

Fixing the estimated recovery map f^p\hat{f}^{p} and letting nn\to\infty, we can expect that β^isub\widehat{\beta}^{\mathrm{sub}}_{i} is consistent for βisub\beta^{\mathrm{sub}}_{i} and not for βi\beta_{i}.

Pretending that the zkz_{k}-s were observed, we introduce the oracle estimator

β^i=k=1n(xi,kμ¯i(zk))(ykg¯(zk))k=1n(xi,kμ¯i(zk))2.\widehat{\beta}_{i}=\frac{\sum_{k=1}^{n}(x_{i,k}-\overline{\mu}_{i}(z_{k}))(y_{k}-\overline{g}(z_{k}))}{\sum_{k=1}^{n}(x_{i,k}-\overline{\mu}_{i}(z_{k}))^{2}}.

Here, μ¯i\overline{\mu}_{i} and g¯\overline{g} denote estimates of the regression functions μi\mu_{i} and gg, respectively, using the zkz_{k}-s instead of the substitutes. The estimator β^i\widehat{\beta}_{i} is independent of mm, pp, and f^p\hat{f}^{p}, and when (xi,1,z1,y1),,(xi,n,zn,yn)(x_{i,1},z_{1},y_{1}),\ldots,(x_{i,n},z_{n},y_{n}) are i.i.d. observations, standard regularity assumptions (van der Vaart, 1998) will ensure that the estimator β^i\widehat{\beta}_{i} is consistent for βi\beta_{i} (and possibly even n\sqrt{n}-rate asymptotically normal). Writing

(13) β^isubβi=(β^isubβ^i)+(β^iβi)\widehat{\beta}^{\mathrm{sub}}_{i}-\beta_{i}=(\widehat{\beta}^{\mathrm{sub}}_{i}-\widehat{\beta}_{i})+(\hat{\beta}_{i}-\beta_{i})

we see that if we can appropriately bound the error, |β^isubβ^i||\widehat{\beta}^{\mathrm{sub}}_{i}-\widehat{\beta}_{i}|, due to using the substitutes instead of the unobserved zkz_{k}-s, we can transfer asymptotic properties of β^i\hat{\beta}_{i} to β^isub\widehat{\beta}^{\mathrm{sub}}_{i}. It is the objective of the following section to demonstrate how such a bound can be achieved for a particular model class.

3. Substitute adjustment in a mixture model

In this section, we present a theoretical analysis of assumption-lean substitute adjustment in the case where the latent variable takes values in a finite set. We provide finite-sample bounds on the error of β^isub\widehat{\beta}^{\mathrm{sub}}_{i} due to the use of substitutes, and we show, in particular, that there exist trajectories of mm, nn and pp along which the estimator is asymptotically equivalent to the oracle estimator β^i\widehat{\beta}_{i}, which uses the actual latent variables.

3.1. The mixture model

To be concrete, we assume that 𝐗\mathbf{X} is generated by a finite mixture model such that conditionally on a latent variable ZZ with values in a finite set, the coordinates of 𝐗\mathbf{X} are independent. The precise model specification is as follows.

Assumption 4 (Mixture Model).

There is a latent variable ZZ with values in the finite set E={1,,K}E=\{1,\ldots,K\} such that X1,X2,X_{1},X_{2},\ldots are conditionally independent given Z=zZ=z. Furthermore,

  1. (1)

    The conditional distribution of XiX_{i} given Z=zZ=z has finite second moment, and its conditional mean and variance are denoted

    μi(z)\displaystyle\mu_{i}(z) =𝔼[XiZ=z]\displaystyle=\mathbb{E}[X_{i}\mid Z=z]
    σi2(z)\displaystyle\sigma_{i}^{2}(z) =Var[XiZ=z]\displaystyle=\operatorname{Var}[X_{i}\mid Z=z]

    for zEz\in E and ii\in\mathbb{N}.

  2. (2)

    The conditional means satisfy the following separation condition

    (14) i=1(μi(z)μi(v))2=\sum_{i=1}^{\infty}(\mu_{i}(z)-\mu_{i}(v))^{2}=\infty

    for all z,vEz,v\in E with vzv\neq z.

  3. (3)

    There are constants 0<σmin2σmax2<0<\sigma_{\min}^{2}\leq\sigma^{2}_{\max}<\infty that bound the conditional variances;

    (15) σmin2maxzEσi2(z)σmax2\sigma_{\min}^{2}\leq\max_{z\in E}\sigma_{i}^{2}(z)\leq\sigma^{2}_{\max}

    for all ii\in\mathbb{N}.

  4. (4)

    (Z=z)>0\mathbb{P}(Z=z)>0 for all zEz\in E.

Algorithm 3 is one specific version of Algorithm 2 for computing β^isub\hat{\beta}_{i}^{\mathrm{sub}} when the latent variable takes values in a finite set EE. The recovery map in Step 5 is given by computing the nearest mean, and it is thus estimated in Step 4 by estimating the means for each of the mixture components. How this is done precisely is an option of the algorithm. Once the substitutes are computed, outcome means and xi,kx_{i,k}-means are (re)computed within each component. The computations in Steps 6 and 7 of Algorithm 3 result in the same estimator as the OLS estimator of βi\beta_{i} when it is computed using the linear model

bxi(z)=βix+γi,z,βi,γi,1,,γi,Kb_{x}^{i}(z)=\beta_{i}x+\gamma_{-i,z},\qquad\beta_{i},\gamma_{-i,1},\ldots,\gamma_{-i,K}\in\mathbb{R}

on the data (xi,1,z^1,y1),(xi,n,z^n,yn)(x_{i,1},\hat{z}_{1},y_{1}),\ldots(x_{i,n},\hat{z}_{n},y_{n}). This may be relevant in practice, but it is also used in the proof of Theorem 1. The corresponding oracle estimator, β^i\hat{\beta}_{i}, is similarly an OLS estimator.

1 input: data 𝒮0={𝐱1:p,10,,𝐱1:p,m0}\mathcal{S}_{0}=\{\mathbf{x}_{1:p,1}^{0},\ldots,\mathbf{x}_{1:p,m}^{0}\} and 𝒮={(𝐱1:p,1,y1),(𝐱1:p,n,yn)}\mathcal{S}=\{(\mathbf{x}_{1:p,1},y_{1}),\ldots(\mathbf{x}_{1:p,n},y_{n})\}, a finite set EE and i{1,,p}i\in\{1,\ldots,p\};
2 options: a method for estimating the conditional means μj(z)=𝔼[XjZ=z]\mu_{j}(z)=\mathbb{E}[X_{j}\mid Z=z];
3 begin
4       use the data in 𝒮0\mathcal{S}_{0} to compute the estimates μˇj(z)\check{\mu}_{j}(z) for j{1,,p}j\in\{1,\ldots,p\} and zEz\in E.
5      use the data in 𝒮\mathcal{S} to compute the substitute latent variables as z^k=argminz𝐱1:p,k𝝁ˇ1:p(z)2\hat{z}_{k}=\operatorname*{arg\,min}_{z}\|\mathbf{x}_{1:p,k}-\check{\boldsymbol{\mu}}_{1:p}(z)\|_{2}, k=1,,nk=1,\ldots,n.
6      use the data in 𝒮\mathcal{S} combined with the substitutes to compute the estimates
g^(z)\displaystyle\hat{g}(z) =1n^(z)k:z^k=zyk,zE\displaystyle=\frac{1}{\hat{n}(z)}\sum_{k:\hat{z}_{k}=z}y_{k},\quad z\in E
μ^i(z)\displaystyle\hat{\mu}_{i}(z) =1n^(z)k:z^k=zxi,k,zE,\displaystyle=\frac{1}{\hat{n}(z)}\sum_{k:\hat{z}_{k}=z}x_{i,k},\quad z\in E,
where n^(z)=k=1n1(z^k=z)\hat{n}(z)=\sum_{k=1}^{n}1(\hat{z}_{k}=z) is the number of kk-s with z^k=z\hat{z}_{k}=z.
7      use the data in 𝒮\mathcal{S} combined with the substitutes to compute
β^isub=k=1n(xi,kμ^i(z^k))(ykg^(z^k))k=1n(xi,kμ^i(z^k))2.\widehat{\beta}^{\mathrm{sub}}_{i}=\frac{\sum_{k=1}^{n}(x_{i,k}-\hat{\mu}_{i}(\hat{z}_{k}))(y_{k}-\hat{g}(\hat{z}_{k}))}{\sum_{k=1}^{n}(x_{i,k}-\hat{\mu}_{i}(\hat{z}_{k}))^{2}}.
8 end
9
return β^isub\widehat{\beta}^{\mathrm{sub}}_{i}
Algorithm 3 Assumption Lean Substitute Adjustment w. Mixtures

Note that Assumption 4 implies that

𝔼[Xi2]\displaystyle\mathbb{E}[X_{i}^{2}] =zE𝔼[Xi2Z=z](Z=z)=zE(σi2(z)+μi(z)2)(Z=z)<\displaystyle=\sum_{z\in E}\mathbb{E}[X_{i}^{2}\mid Z=z]\mathbb{P}(Z=z)=\sum_{z\in E}(\sigma_{i}^{2}(z)+\mu_{i}(z)^{2})\mathbb{P}(Z=z)<\infty
𝔼[Var[XiZ]]\displaystyle\mathbb{E}\left[\operatorname{Var}\left[X_{i}\mid Z\right]\right] =zEσi2(z)(Z=z)σmin2minzE(Z=z)>0.\displaystyle=\sum_{z\in E}\sigma_{i}^{2}(z)\mathbb{P}(Z=z)\geq\sigma_{\min}^{2}\min_{z\in E}\mathbb{P}(Z=z)>0.

Hence Assumption 4, combined with 𝔼[Y2]<\mathbb{E}[Y^{2}]<\infty, ensure that the moment conditions in Assumption 3 hold.

The following proposition states that the mixture model given by Assumption 4 is a special case of the general latent variable model.

Proposition 3.

Assumption 4 on the mixture model implies Assumption 2. Specifically, that σ(Z)σ(𝐗i)\sigma(Z)\subseteq\sigma(\mathbf{X}_{-i}) for all ii\in\mathbb{N}.

Remark 4.

The proof of Proposition 3 is in Appendix A.3. Technically, the proof only gives almost sure recovery of ZZ from 𝐗i\mathbf{X}_{-i}, and we can thus only conclude that σ(Z)\sigma(Z) is contained in σ(𝐗i)\sigma(\mathbf{X}_{-i}) up to negligible sets. We can, however, replace ZZ by a variable, ZZ^{\prime}, such that σ(Z)σ(𝐗i)\sigma(Z^{\prime})\subseteq\sigma(\mathbf{X}_{-i}) and Z=ZZ^{\prime}=Z almost surely. We can thus simply swap ZZ with ZZ^{\prime} in Assumption 4.

Remark 5.

The arguments leading to Proposition 3 rely on Assumptions 4(2) and 4(3)—specifically the separation condition (14) and the upper bound in (15). However, these conditions are not necessary to be able to recover ZZ from 𝐗i\mathbf{X}_{-i}. Using Kakutani’s theorem on equivalence of product measures it is possible to characterize precisely when ZZ can be recovered, but the abstract characterization is not particularly operational. In Appendix B we analyze the characterization for the Gaussian mixture model, where XiX_{i} given Z=zZ=z has a 𝒩(μi(z),σi2(z))\mathcal{N}(\mu_{i}(z),\sigma^{2}_{i}(z))-distribution. This leads to Proposition 5 and Corollary 1 in Appendix B, which gives necessary and sufficient conditions for recovery in the Gaussian mixture model.

3.2. Bounding estimation error due to using substitutes

In this section we derive an upper bound on the estimation error, which is due to using substitutes, cf. the decomposition (13). To this end, we consider the (partly hypothetical) observations (xi,1,z^1,z1,y1),(xi,n,z^n,zn,yn)(x_{i,1},\hat{z}_{1},z_{1},y_{1}),\ldots(x_{i,n},\hat{z}_{n},z_{n},y_{n}), which include the otherwise unobserved zkz_{k}-s as well as their observed substitutes, the z^k\hat{z}_{k}-s. We let 𝐱i=(xi,1,,xi,n)Tn\mathbf{x}_{i}=(x_{i,1},\ldots,x_{i,n})^{T}\in\mathbb{R}^{n} and 𝐲=(y1,,yn)Tn\mathbf{y}=(y_{1},\ldots,y_{n})^{T}\in\mathbb{R}^{n}, and 𝐱i2\|\mathbf{x}_{i}\|_{2} and 𝐲2\|\mathbf{y}\|_{2} denote the 22-norms of 𝐱i\mathbf{x}_{i} and 𝐲\mathbf{y}, respectively. We also let

n(z)=k=1n1(zk=z)andn^(z)=k=1n1(z^k=z)n(z)=\sum_{k=1}^{n}1(z_{k}=z)\qquad\text{and}\qquad\hat{n}(z)=\sum_{k=1}^{n}1(\hat{z}_{k}=z)

for zE={1,,K}z\in E=\{1,\ldots,K\}, and

nmin=min{n(1),,n(K),n^(1),,n^(K)}.n_{\min}=\min\{n(1),\ldots,n(K),\hat{n}(1),\ldots,\hat{n}(K)\}.

Furthermore,

μ¯i(z)=1n(z)k:zk=zxi,k,\overline{\mu}_{i}(z)=\frac{1}{n(z)}\sum_{k:z_{k}=z}x_{i,k},

and we define the following three quantities

(16) α\displaystyle\alpha =nminn\displaystyle=\frac{n_{\min}}{n}
(17) δ\displaystyle\delta =1nk=1n1(z^kzk)\displaystyle=\frac{1}{n}\sum_{k=1}^{n}1(\hat{z}_{k}\neq z_{k})
(18) ρ\displaystyle\rho =min{k=1n(xi,kμ¯i(zk))2,k=1n(xi,kμ^i(z^k))2}𝐱i22.\displaystyle=\frac{\min\left\{\sum_{k=1}^{n}(x_{i,k}-\overline{\mu}_{i}(z_{k}))^{2},\sum_{k=1}^{n}(x_{i,k}-\hat{\mu}_{i}(\hat{z}_{k}))^{2}\right\}}{\|\mathbf{x}_{i}\|_{2}^{2}}.
Theorem 1.

Let α\alpha, δ\delta and ρ\rho be given by (16), (17) and (18). If α,ρ>0\alpha,\rho>0 then

(19) |β^isubβ^i|22ρ2δα𝐲2𝐱i2.|\widehat{\beta}^{\mathrm{sub}}_{i}-\hat{\beta}_{i}|\leq\frac{2\sqrt{2}}{\rho^{2}}\sqrt{\frac{\delta}{\alpha}}\frac{\|\mathbf{y}\|_{2}}{\|\mathbf{x}_{i}\|_{2}}.

The proof of Theorem 1 is given in Appendix A.2. Appealing to the Law of Large Numbers, the quantities in the upper bound (19) can be interpreted as follows:

  • The ratio 𝐲2/𝐱i2\|\mathbf{y}\|_{2}/\|\mathbf{x}_{i}\|_{2} is approximately a fixed and finite constant (unless XiX_{i} is constantly zero) depending on the marginal distributions of XiX_{i} and YY only.

  • The fraction α\alpha is approximately

    (20) minzE{min{(Z=z),(Z^=z)}},\min_{z\in E}\left\{\min\{\mathbb{P}(Z=z),\mathbb{P}(\hat{Z}=z)\}\right\},

    which is strictly positive by Assumption 4(4) (unless recovery is working poorly).

  • The quantity ρ\rho is a standardized measure of the residual variation of the xi,kx_{i,k}-s within the groups defined by the zkz_{k}-s or the z^k\hat{z}_{k}-s. It is approximately equal to the constant

    min{𝔼[Var[XiZ]],𝔼[Var[XiZ^]]}E(Xi2),\frac{\min\left\{\mathbb{E}\left[\operatorname{Var}\left[X_{i}\mid Z\right]\right],\mathbb{E}\left[\operatorname{Var}\left[X_{i}\mid\hat{Z}\right]\right]\right\}}{E(X_{i}^{2})},

    which is strictly positive if the probabilities in (20) are strictly positive and not all of the conditional variances are 0.

  • The fraction δ\delta is the relative mislabeling frequency of the substitutes. It is approximately equal to the mislabeling rate (Z^Z)\mathbb{P}(\hat{Z}\neq Z).

The bound (19) tells us that if the mislabeling rate of the substitutes tends to 0, that is, if (Z^Z)0\mathbb{P}(\hat{Z}\neq Z)\to 0, the estimation error tends to 0 roughly like (Z^Z)\sqrt{\mathbb{P}(\hat{Z}\neq Z)}. This could potentially be achieved by letting pp\to\infty and mm\to\infty. We formalize this statement in Section 3.4.

3.3. Bounding the mislabeling rate of the substitutes

In this section we give bounds on the mislabeling rate, (Z^Z)\mathbb{P}(\hat{Z}\neq Z), with the ultimate purpose of controlling the magnitude of δ\delta in the bound (19). Two different approximations are the culprits of mislabeling. First, the computation of Z^\hat{Z} is based on the pp variables in 𝐗1:p\mathbf{X}_{1:p} only, and it is thus an approximation of the full recovery map based on all variables in 𝐗\mathbf{X}. Second, the recovery map is an estimate and thus itself an approximation. The severity of the second approximation is quantified by the following relative errors of the conditional means used for recovery.

Definition 4 (Relative errors, pp-separation).

For the mixture model given by Assumption 4 let 𝝁1:p(z)=(μi(z))i=1,,pp\boldsymbol{\mu}_{1:p}(z)=(\mu_{i}(z))_{i=1,\ldots,p}\in\mathbb{R}^{p} for zEz\in E. With 𝝁ˇ1:p(z)p\check{\boldsymbol{\mu}}_{1:p}(z)\in\mathbb{R}^{p} for zEz\in E any collection of pp-vectors, define the relative errors

(21) Rz,v(p)=𝝁1:p(z)𝝁ˇ1:p(z)2𝝁1:p(z)𝝁1:p(v)2R_{z,v}^{(p)}=\frac{\|\boldsymbol{\mu}_{1:p}(z)-\check{\boldsymbol{\mu}}_{1:p}(z)\|_{2}}{\|\boldsymbol{\mu}_{1:p}(z)-\boldsymbol{\mu}_{1:p}(v)\|_{2}}

for z,vEz,v\in E, vzv\neq z. Define, moreover, the minimal pp-separation as

(22) sep(p)=minzv𝝁1:p(z)𝝁1:p(v)22.\mathrm{sep}(p)=\min_{z\neq v}\norm{\boldsymbol{\mu}_{1:p}(z)-\boldsymbol{\mu}_{1:p}(v)}_{2}^{2}.

Note that Assumption 4(2) implies that sep(p)\mathrm{sep}(p)\to\infty for pp\to\infty. This convergence could be arbitrarily slow. The following definition captures the important case where the separation grows at least linearly in pp.

Definition 5 (Strong separation).

We say that the mixture model satisfies strong separation if there exists an ε>0\varepsilon>0 such that sep(p)εp\mathrm{sep}(p)\geq\varepsilon p eventually.

Strong separation is equivalent to

lim infpsep(p)p>0.\liminf_{p\to\infty}\frac{\mathrm{sep}(p)}{p}>0.

A sufficient condition for strong separation is that |μi(z)μi(v)|ε|\mu_{i}(z)-\mu_{i}(v)|\geq\varepsilon eventually for all z,vEz,v\in E, vzv\neq z and some ε>0\varepsilon>0. That is, lim infi|μi(z)μi(v)|>0\liminf_{i\to\infty}|\mu_{i}(z)-\mu_{i}(v)|>0 for vzv\neq z. When we have strong separation, then for pp large enough

(Rz,v(p))21εp𝝁1:p(z)𝝁ˇ1:p(z)221εmaxi=1,,p(μi(z)μˇi(z))2,\left(R_{z,v}^{(p)}\right)^{2}\leq\frac{1}{\varepsilon p}\|\boldsymbol{\mu}_{1:p}(z)-\check{\boldsymbol{\mu}}_{1:p}(z)\|_{2}^{2}\leq\frac{1}{\varepsilon}\max_{i=1,\ldots,p}\left(\mu_{i}(z)-\check{\mu}_{i}(z)\right)^{2},

and we note that it is conceivable222Parametric assumptions, say, and marginal estimators of each μi(z)\mu_{i}(z) that, under Assumption 4, are uniformly consistent over ii\in\mathbb{N} can be combined with a simple union bound to show the claim, possibly in a suboptimal way, cf. Section 3.5. that we can estimate 𝝁1:p(z)\boldsymbol{\mu}_{1:p}(z) by an estimator, 𝝁ˇ1:p(z)\check{\boldsymbol{\mu}}_{1:p}(z), such that for m,pm,p\to\infty appropriately, Rz,v(p)𝑃0R_{z,v}^{(p)}\overset{P}{\to}0.

The following proposition shows that a bound on Rz,v(p)R_{z,v}^{(p)} is sufficient to ensure that the growth of sep(p)\mathrm{sep}(p) controls how fast the mislabeling rate diminishes with pp. The proposition is stated for a fixed 𝝁ˇ\check{\boldsymbol{\mu}}, which means that when 𝝁ˇ\check{\boldsymbol{\mu}} is an estimate, we are effectively assuming it is independent of the template observation (𝐗1:p,Z)(\mathbf{X}_{1:p},Z) used to compute Z^\hat{Z}.

Proposition 4.

Suppose that Assumption 4 holds. Let 𝛍ˇ1:p(z)p\check{\boldsymbol{\mu}}_{1:p}(z)\in\mathbb{R}^{p} for zEz\in E and let

Z^=argminz𝐗1:p𝝁ˇ1:p(z)2.\hat{Z}=\operatorname*{arg\,min}_{z}\|\mathbf{X}_{1:p}-\check{\boldsymbol{\mu}}_{1:p}(z)\|_{2}.

Suppose also that Rz,v(p)110R_{z,v}^{(p)}\leq\frac{1}{10} for all z,vEz,v\in E with vzv\neq z. Then

(23) (Z^Z)25Kσmax2sep(p).\mathbb{P}\left(\hat{Z}\neq Z\right)\leq\frac{25K\sigma^{2}_{\max}}{\mathrm{sep}(p)}.

If, in addition, the conditional distribution of XiX_{i} given Z=zZ=z is sub-Gaussian with variance factor vmaxv_{\max}, independent of ii and zz, then

(24) (Z^Z)Kexp(sep(p)50vmax)\mathbb{P}\left(\hat{Z}\neq Z\right)\leq K\exp\left(-\frac{\mathrm{sep}(p)}{50v_{\max}}\right)
Remark 6.

The proof of Proposition 4 is in Appendix A.3. It shows that the specific constants, 2525 and 5050, appearing in the bounds above hinge on the specific bound, Rz,v(p)110R_{z,v}^{(p)}\leq\frac{1}{10}, on the relative error. The proof works for any bound strictly smaller than 14\frac{1}{4}. Replacing 110\frac{1}{10} by a smaller bound on the relative errors decreases the constant, but it will always be larger than 44.

The upshot of Proposition 4 is that if the relative errors, Rz,v(p)R_{z,v}^{(p)}, are sufficiently small then Assumption 4 is sufficient to ensure that (Z^Z)0\mathbb{P}\left(\hat{Z}\neq Z\right)\to 0 for pp\to\infty. Without additional distributional assumptions the general bound (23) decays slowly with pp, and even with strong separation, the bound only gives a rate of 1p\tfrac{1}{p}. With the additional sub-Gaussian assumption, the rate is improved dramatically, and with strong separation it improves to ecpe^{-cp} for some constant c>0c>0. If the XiX_{i}-s are bounded, their (conditional) distributions are sub-Gaussian, thus the rate is fast in this special but important case.

3.4. Asymptotics of the substitute adjustment estimator

Suppose ZZ takes values in E={1,,K}E=\{1,\ldots,K\} and that (xi,1,z1,y1),,(xi,n,zn,yn)(x_{i,1},z_{1},y_{1}),\ldots,(x_{i,n},z_{n},y_{n}) are observations of (Xi,Z,Y)(X_{i},Z,Y). Then Assumption 3 ensures that the oracle OLS estimator β^i\widehat{\beta}_{i} is n\sqrt{n}-consistent and that

β^ias𝒩(βi,wi2/n).\widehat{\beta}_{i}\overset{\text{as}}{\sim}\mathcal{N}(\beta_{i},w_{i}^{2}/n).

There are standard sandwich formulas for the asymptotic variance parameter wi2w_{i}^{2}. In this section we combine the bounds from Sections 3.2 and 3.3 to show our main theoretical result; that β^isub\widehat{\beta}^{\mathrm{sub}}_{i} is a consistent and asymptotically normal estimator of βi\beta_{i} for n,mn,m\to\infty if also pp\to\infty appropriately.

Assumption 5.

The dataset 𝒮0\mathcal{S}_{0} in Algorithm 3 consists of i.i.d. observations of 𝐗1:p\mathbf{X}_{1:p}, the dataset 𝒮\mathcal{S} in Algorithm 3 consists of i.i.d. observations of (𝐗1:p,Y)(\mathbf{X}_{1:p},Y), and 𝒮\mathcal{S} is independent of 𝒮0\mathcal{S}_{0}.

Theorem 2.

Suppose Assumption 1 holds and E(Y2)<E(Y^{2})<\infty, and consider the mixture model fulfilling Assumption 4. Consider data satisfying Assumption 5 and the estimator β^isub\widehat{\beta}^{\mathrm{sub}}_{i} given by Algorithm 3. Suppose that n,m,pn,m,p\to\infty such that (Rz,v(p)>110)0\mathbb{P}(R_{z,v}^{(p)}>\tfrac{1}{10})\to 0. Then the following hold:

  1. (1)

    The estimation error due to using substitutes tends to 0 in probability, that is,

    |β^isubβ^i|𝑃0,|\widehat{\beta}^{\mathrm{sub}}_{i}-\hat{\beta}_{i}|\overset{P}{\to}0,

    and β^isub\widehat{\beta}^{\mathrm{sub}}_{i} is a consistent estimator of βi\beta_{i}.

  2. (2)

    If sep(p)n\frac{\mathrm{sep}(p)}{n}\to\infty and n(Rz,v(p)>110)0n\mathbb{P}(R_{z,v}^{(p)}>\tfrac{1}{10})\to 0, then n|β^isubβ^i|𝑃0\sqrt{n}|\widehat{\beta}^{\mathrm{sub}}_{i}-\hat{\beta}_{i}|\overset{P}{\to}0.

  3. (3)

    If XiX_{i} conditionally on Z=zZ=z is sub-Gaussian, with variance factor independent of ii and zz, and if sep(p)log(n)\frac{\mathrm{sep}(p)}{\log(n)}\to\infty and n(Rz,v(p)>110)0n\mathbb{P}(R_{z,v}^{(p)}>\tfrac{1}{10})\to 0, then n|β^isubβ^i|𝑃0\sqrt{n}|\widehat{\beta}^{\mathrm{sub}}_{i}-\hat{\beta}_{i}|\overset{P}{\to}0.

In addition, in case (2) as well as case (3), β^isubas𝒩(βi,wi2/n)\widehat{\beta}^{\mathrm{sub}}_{i}\overset{\text{as}}{\sim}\mathcal{N}(\beta_{i},w_{i}^{2}/n), where the asymptotic variance parameter wi2w_{i}^{2} is the same as for the oracle estimator β^i\widehat{\beta}_{i}.

Remark 7.

The proof of Theorem 2 is in Appendix A.4. As mentioned in Remark 6, the precise value of the constant 110\tfrac{1}{10} is not important. It could be replaced by any other constant strictly smaller than 14\tfrac{1}{4}, and the conclusion would be the same.

Remark 8.

The general growth condition on pp in terms of nn in case (2) is bad; even with strong separation we would need pn\tfrac{p}{n}\to\infty, that is, pp should grow faster than nn. In the sub-Gaussian case this improves substantially so that pp only needs to grow faster than log(n)\log(n).

3.5. Tensor decompositions

One open question from both a theoretical and practical perspective is how we construct the estimators 𝝁ˇ1:p(z)\check{\boldsymbol{\mu}}_{1:p}(z). We want to ensure consistency for m,pm,p\to\infty, which is expressed as (Rz,v(p)>110)0\mathbb{P}\left(R_{z,v}^{(p)}>\tfrac{1}{10}\right)\to 0 in our theoretical results, and that the estimator can be computed efficiently for large mm and pp. We indicated in Section 3.3 that simple marginal estimators of μi(z)\mu_{i}(z) can achieve this, but such estimators may be highly inefficient. In this section we briefly describe two methods based on tensor decompositions (Anandkumar et al., 2014) related to the third order moments of 𝐗1:p\mathbf{X}_{1:p}. Thus to apply such methods we need to additionally assume that the XiX_{i}-s have finite third moments.

Introduce first the third order p×p×pp\times p\times p tensor G(p)G^{(p)} as

G(p)=i=1p𝐚i𝐞i𝐞i+𝐞i𝐚i𝐞i+𝐞i𝐞i𝐚i,G^{(p)}=\sum_{i=1}^{p}\mathbf{a}_{i}\otimes\mathbf{e}_{i}\otimes\mathbf{e}_{i}+\mathbf{e}_{i}\otimes\mathbf{a}_{i}\otimes\mathbf{e}_{i}+\mathbf{e}_{i}\otimes\mathbf{e}_{i}\otimes\mathbf{a}_{i},

where 𝐞ip\mathbf{e}_{i}\in\mathbb{R}^{p} is the standard basis vector with a 11 in the ii-th coordinate and 0 elsewhere, and where

𝐚i=zE(Z=z)σi2(z)𝝁1:p(z).\mathbf{a}_{i}=\sum_{z\in E}\mathbb{P}(Z=z)\sigma_{i}^{2}(z)\boldsymbol{\mu}_{1:p}(z).

In terms of the third order raw moment tensor and G(p)G^{(p)} we define the tensor

(25) M3(p)=𝔼[𝐗1:p𝐗1:p𝐗1:p]G(p).M_{3}^{(p)}=\mathbb{E}[\mathbf{X}_{1:p}\otimes\mathbf{X}_{1:p}\otimes\mathbf{X}_{1:p}]-G^{(p)}.

Letting ={(i1,i2,i3){1,,p}i1,i2,i3 all distinct}\mathcal{I}=\{(i_{1},i_{2},i_{3})\in\{1,\ldots,p\}\mid i_{1},i_{2},i_{3}\text{ all distinct}\} denote the set of indices of the tensors with all entries distinct, we see from the definition of G(p)G^{(p)} that Gi1,i2,i3(p)=0G^{(p)}_{i_{1},i_{2},i_{3}}=0 for (i1,i2,i3)(i_{1},i_{2},i_{3})\in\mathcal{I}. Thus

(M3(p))i1,i2,i3=𝔼[Xi1Xi2Xi3](M_{3}^{(p)})_{i_{1},i_{2},i_{3}}=\mathbb{E}\left[X_{i_{1}}X_{i_{2}}X_{i_{3}}\right]

for (i1,i2,i3)(i_{1},i_{2},i_{3})\in\mathcal{I}. In the following, (M3(p))(M^{(p)}_{3})_{\mathcal{I}} denotes the incomplete tensor obtained by restricting the indices of M3(p)M^{(p)}_{3} to \mathcal{I}.

The key to using the M3(p)M_{3}^{(p)}-tensor for estimation of the μi(z)\mu_{i}(z)-s is the following rank-KK tensor decomposition,

(26) M3(p)=z=1K(Z=z)𝝁1:p(z)𝝁1:p(z)𝝁1:p(z);M_{3}^{(p)}=\sum_{z=1}^{K}\mathbb{P}(Z=z)\boldsymbol{\mu}_{1:p}(z)\otimes\boldsymbol{\mu}_{1:p}(z)\otimes\boldsymbol{\mu}_{1:p}(z);

see Theorem 3.3 in (Anandkumar et al., 2014) or the derivations on page 2 in (Guo, Nie & Yang, 2022).

Guo, Nie & Yang (2022) propose an algorithm based on incomplete tensor decomposition as follows: Let (M^3(p))(\widehat{M}_{3}^{(p)})_{\mathcal{I}} denote an estimate of the incomplete tensor (M3(p))(M_{3}^{(p)})_{\mathcal{I}}; obtain an approximate rank-KK tensor decomposition of the incomplete tensor (M^3(p))(\widehat{M}_{3}^{(p)})_{\mathcal{I}}; extract estimates 𝝁ˇ1:p(1),,𝝁ˇ1:p(K)\check{\boldsymbol{\mu}}_{1:p}(1),\ldots,\check{\boldsymbol{\mu}}_{1:p}(K) from this tensor decomposition. Theorem 4.2 in (Guo, Nie & Yang, 2022) shows that if the vectors 𝝁1:p(1),,𝝁1:p(K)\boldsymbol{\mu}_{1:p}(1),\ldots,\boldsymbol{\mu}_{1:p}(K) satisfy certain regularity assumptions, they are estimated consistently by their algorithm (up to permutation) if (M^3(p))(\widehat{M}_{3}^{(p)})_{\mathcal{I}} is consistent. We note that the regularity assumptions are fulfilled for generic vectors in p\mathbb{R}^{p}.

A computational downside of working directly with M3(p)M_{3}^{(p)} is that it grows cubically with pp. Anandkumar et al. (2014) propose to consider 𝐗~(p)=𝐖T𝐗1:pK\widetilde{\mathbf{X}}^{(p)}=\mathbf{W}^{T}\mathbf{X}_{1:p}\in\mathbb{R}^{K}, where 𝐖\mathbf{W} is a p×Kp\times K whitening matrix. The tensor decomposition is then computed for the corresponding K×K×KK\times K\times K tensor M~3\widetilde{M}_{3}. When K<pK<p is fixed and pp grows, this is computationally advantageous. Theorem 5.1 in Anandkumar et al. (2014) shows that, under a generically satisfied non-degeneracy condition, the tensor decomposition of M~3\widetilde{M}_{3} can be estimated consistently (up to permutation) if M~3\widetilde{M}_{3} can be estimated consistently.

To use the methodology from Anandkumar et al. (2014) in Algorithm 3, we replace Step 4 by their Algorithm 1 applied to 𝐱~(0,p)=𝐖T𝐱1:p(0)\widetilde{\mathbf{x}}^{(0,p)}=\mathbf{W}^{T}\mathbf{x}_{1:p}^{(0)}. This will estimate the transformed mean vectors 𝝁~(p)(z)=𝐖T𝝁1:p(z)K\widetilde{\boldsymbol{\mu}}^{(p)}(z)=\mathbf{W}^{T}\boldsymbol{\mu}_{1:p}(z)\in\mathbb{R}^{K}. Likewise, we replace Step 5 in Algorithm 3 by

z^k=argminz𝐱~(p)𝝁~ˇ(p)(z)2\hat{z}_{k}=\operatorname*{arg\,min}_{z}\left\|\widetilde{\mathbf{x}}^{(p)}-\check{\widetilde{\boldsymbol{\mu}}}^{(p)}(z)\right\|_{2}

where 𝐱~(p)=𝐖T𝐱1:p\widetilde{\mathbf{x}}^{(p)}=\mathbf{W}^{T}\mathbf{x}_{1:p}. The separation and relative errors conditions should then be expressed in terms of the pp-dependent KK-vectors 𝝁~(p)(1),,𝝁~(p)(K)K\widetilde{\boldsymbol{\mu}}^{(p)}(1),\ldots,\widetilde{\boldsymbol{\mu}}^{(p)}(K)\in\mathbb{R}^{K}.

4. Simulation Study

Our analysis in Section 3 shows that Algorithm 3 is capable of consistently estimating the βi\beta_{i}-parameters via substitute adjustment for n,m,pn,m,p\to\infty appropriately. The purpose of this section is to shed light on the finite sample performance of substitute adjustment via a simulation study.

The XiX_{i}-s are simulated according to a mixture model fulfilling Assumption 4, and the outcome model is as in Example 1, which makes bxi(z)=𝔼[YXi=x;Z=z]b_{x}^{i}(z)=\mathbb{E}[Y\mid X_{i}=x;Z=z] a partially linear model. Throughout, we take m=nm=n and 𝒮0=𝒮\mathcal{S}_{0}=\mathcal{S} in Algorithm 3. The simulations are carried out for different choices of nn, pp, 𝜷\boldsymbol{\beta} and μi(z)\mu_{i}(z)-s, and we report results on both the mislabeling rate of the latent variables and the mean squared error (MSE) of the βi\beta_{i}-estimators.

4.1. Mixture model simulations and recovery of ZZ

The mixture model in our simulations is given as follows.

  • We set K=10K=10 and fix pmax=1000p_{\max}=1000 and nmax=1000n_{\max}=1000.

  • We draw μi(z)\mu_{i}(z)-s independently and uniformly from (1,1)(-1,1) for z{1,,K}z\in\{1,\ldots,K\} and i{1,,pmax}i\in\{1,\ldots,p_{\max}\}.

  • Fixing the μi(z)\mu_{i}(z)-s and a choice of μscale{0.75,1,1.5}\mu_{\mathrm{scale}}\in\{0.75,1,1.5\}, we simulate nmaxn_{\max} independent observations of (𝐗1:pmax,Z)(\mathbf{X}_{1:{p_{\max}}},Z), each with the latent variable ZZ uniformly distributed on {1,,K}\{1,...,K\}, and XiX_{i} given Z=zZ=z being 𝒩(μscaleμi(z),1)\mathcal{N}(\mu_{\mathrm{scale}}\cdot\mu_{i}(z),1)-distributed.

We use the algorithm from Anandkumar et al. (2014), as described in Section 3.5, for recovery. We replicate the simulation outlined above 1010 times, and we consider recovery of ZZ for p{50,100,200,1000}p\in\{50,100,200,1000\} and n{50,100,200,500,1000}n\in\{50,100,200,500,1000\}. For replication b{1,,10}b\in\{1,\ldots,10\} the actual values of the latent variables are denoted zb,kz_{b,k}. For each combination of nn and pp the substitutes are denoted z^b,k(n,p)\hat{z}_{b,k}^{(n,p)}. The mislabeling rate for fixed pp and nn is estimated as

δ(n,p)=110b=1101nk=1n1(z^b,k(n,p)zb,k).\delta^{(n,p)}=\frac{1}{10}\sum_{b=1}^{10}\frac{1}{n}\sum_{k=1}^{n}1(\hat{z}_{b,k}^{(n,p)}\neq z_{b,k}).
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2. Empirical mislabeling rates as a function of n=mn=m and pp and for three different separation scales.

Figure 2 shows the estimated mislabeling rates from the simulations. The results demonstrate that for reasonable choices of nn and pp, the algorithm based on (Anandkumar et al., 2014) is capable of recovering ZZ quite well.

The theoretical upper bounds of the mislabeling rate in Proposition 4 are monotonely decreasing as functions of 𝝁1:p(z)𝝁1:p(v)2\norm{\boldsymbol{\mu}_{1:p}(z)-\boldsymbol{\mu}_{1:p}(v)}_{2}. These are, in turn, monotonely increasing in pp and in μscale\mu_{\mathrm{scale}}. The results in Figure 2 support that this behavior of the upper bounds carry over to the actual mislabeling rate. Moreover, the rapid decay of the mislabeling rate with μscale\mu_{\mathrm{scale}} is in accordance with the exponential decay of the upper bound in the sub-Gaussian case.

4.2. Outcome model simulation and estimation of βi\beta_{i}

Given simulated ZZ-s and XiX_{i}-s as described in Section 4.1, we simulate the outcomes as follows.

  • Draw βi\beta_{i} independently and uniformly from (1,1)(-1,1) for i=1,,pmaxi=1,\ldots,p_{\max}.

  • Fix γscale{0,20,40,100,200}\gamma_{\mathrm{scale}}\in\{0,20,40,100,200\} and let γz=γscalez\gamma_{z}=\gamma_{\mathrm{scale}}\cdot z for z{1,,K}z\in\{1,\ldots,K\}.

  • With ε𝒩(0,1)\varepsilon\sim\mathcal{N}(0,1) simulate nmaxn_{\max} independent outcomes as

    Y=i=1pmaxβiXi+γZ+ε.Y=\sum_{i=1}^{p_{\max}}\beta_{i}X_{i}+\gamma_{Z}+\varepsilon.
Refer to caption
Refer to caption
Figure 3. Average MSE for substitute adjustment using Algorithm 3 as a function of sample size nn and for two different dimensions, a range of the unobserved confounding levels, and with μscale=1\mu_{\mathrm{scale}}=1.

The simulation parameter γscale\gamma_{\mathrm{scale}} captures a potential effect of unobserved XiX_{i}-s for i>pmaxi>p_{\max}. We refer to this effect as unobserved confounding. For p<pmaxp<p_{\max}, adjustment using the naive linear regression model i=1pβixi\sum_{i=1}^{p}\beta_{i}x_{i} would lead to biased estimates even if γscale=0\gamma_{\mathrm{scale}}=0, while the naive linear regression model for p=pmaxp=p_{\max} would be correct when γscale=0\gamma_{\mathrm{scale}}=0. When γscale>0\gamma_{\mathrm{scale}}>0, adjusting via naive linear regression for all observed XiX_{i}-s would still lead to biased estimates due to the unobserved confounding.

We consider the estimation error for p{125,175}p\in\{125,175\} and n{50,100,200,500,1000}n\in\{50,100,200,500,1000\}. Let βb,i\beta_{b,i} denote the ii-th parameter in the bb-th replication, and let β^b,isub,n,p\hat{\beta}_{b,i}^{\mathrm{sub},n,p} denote the corresponding estimate from Algorithm 3 for each combination of nn and pp. The average MSE of 𝜷^bsub,n,p\hat{\boldsymbol{\beta}}_{b}^{\mathrm{sub},n,p} is computed as

MSE(n,p)=110b=1101pi=1p(β^b,isub,n,pβb,i)2.\mathrm{MSE}^{(n,p)}=\frac{1}{10}\sum_{b=1}^{10}\frac{1}{p}\sum_{i=1}^{p}(\hat{\beta}_{b,i}^{\mathrm{sub},n,p}-\beta_{b,i})^{2}.

Figure 3 shows the MSE for the different combinations of nn and pp and for different choices of γscale\gamma_{\mathrm{scale}}. Unsurprisingly, the MSE decays with sample size and increases with the magnitude of unobserved confounding. More interestingly, we see a clear decrease with the dimension pp indicating that the lower mislabeling rate for larger pp translates to a lower MSE as well.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4. Average MSE for substitute adjustment using Algorithm 3 compared to average MSE for the ridge and augmented ridge estimators for two different dimensions, a range of unobserved confounding levels, and with μscale=1\mu_{\mathrm{scale}}=1.

Finally, we compare the results of Algorithm 3 with two other approaches. Letting 𝕏\mathbb{X} denote the n×pn\times p model matrix for the xi,kx_{i,k}-s and 𝐲\mathbf{y} the nn-vector of outcomes, the ridge regression estimator is given as

𝜷^Ridge(n,p)=argmin𝜷pminβ0𝐲β0𝕏𝜷22+λ𝜷22,\hat{\boldsymbol{\beta}}^{(n,p)}_{\mathrm{Ridge}}=\operatorname*{arg\,min}_{\boldsymbol{\beta}\in\mathbb{R}^{p}}\min_{\beta_{0}\in\mathbb{R}}\|\mathbf{y}-\beta_{0}-\mathbb{X}\boldsymbol{\beta}\|_{2}^{2}+\lambda\norm{\boldsymbol{\beta}}_{2}^{2},

with λ\lambda chosen by five-fold cross-validation. The augmented ridge regression estimator is given as

𝜷^Aug-Ridge(n,p)=argmin𝜷pmin𝜸K\displaystyle\hat{\boldsymbol{\beta}}^{(n,p)}_{\text{Aug-Ridge}}=\operatorname*{arg\,min}_{\boldsymbol{\beta}\in\mathbb{R}^{p}}\min_{\boldsymbol{\gamma}\in\mathbb{R}^{K}} 𝐲[𝕏,𝐙^][𝜷𝜸]22+λ𝜷22,\displaystyle\left\|\mathbf{y}-\left[\mathbb{X},\hat{\mathbf{Z}}\right]\left[\begin{array}[]{c}\boldsymbol{\beta}\\ \boldsymbol{\gamma}\end{array}\right]\right\|_{2}^{2}+\lambda\norm{\boldsymbol{\beta}}_{2}^{2},

where 𝐙^\hat{\mathbf{Z}} is the n×Kn\times K model matrix of dummy variable encodings of the substitutes. Again, λ\lambda is chosen by five-fold cross-validation.

The average MSE is computed for ridge regression and augmented ridge regression just as for substitute adjustment. Figure 4 shows results for p=125p=125 and p=175p=175. These two values of pp correspond to asymptotic (as pp stays fixed and nn\to\infty) mislabeling rates δ\delta around 7%7\% and 2%2\%, respectively.

We see that both alternative estimators outperform Algorithm 3 when the sample size is too small to learn ZZ reliably. However, naive linear regression is biased, and so is ridge regression (even asymptotically), and its performance does not improve as the sample size, nn, increases. Substitute adjustment as well as augmented ridge regression adjust for Z^\hat{Z}, and their performance improve with nn, despite the fact that pp is too small to recover ZZ exactly. When nn and the amount of unobserved confounding is sufficiently large, both of these estimators outperform ridge regression. Note that it is unsurprising that the augmented ridge estimator performs similarly to Algorithm 3 for large sample sizes, because after adjusting for the substitutes, the xi,kx_{i,k}-residuals are roughly orthogonal if the substitutes give accurate recovery, and a joint regression will give estimates similar to those of the marginal regressions.

We made a couple of observations (data not shown) during the simulation study. We experimented with changing the mixture distributions to other sub-Gaussian distributions as well as to the Laplace distribution and got similar results as shown here using the Gaussian distribution. We also implemented sample splitting, and though Proposition 4 assumes sample splitting, we found that the improved estimation accuracy attained by using all available data for the tensor decomposition outweighs the benefit of sample splitting in the recovery stage.

In conclusion, our simulations show that for reasonable finite nn and pp, it is possible to recover the latent variables sufficiently well for substitute adjustment to be a better alternative than naive linear or ridge regression in settings where the unobserved confounding is sufficiently large.

5. Discussion

We break the discussion into three parts. In the first part we revisit the discussion about the causal interpretation of the target parameters χxi\chi_{x}^{i} treated in this paper. In the second part we discuss substitute adjustment as a method for estimation of these parameters as well as the assumption-lean parameters βi\beta_{i}. In the third part we discuss possible extensions of our results

5.1. Causal interpretations

The main causal question is whether a contrast of the form χxiχx0i\chi_{x}^{i}-\chi_{x_{0}}^{i} has a causal interpretation as an average treatment effect. The framework in (Wang & Blei, 2019) and the subsequent criticisms by D’Amour (2019) and Ogburn et al. (2020) are based on the XiX_{i}-s all being causes of YY, and on the possibility of unobserved confounding. Notably, the latent variable ZZ to be recovered is not equal to an unobserved confounder, but Wang & Blei (2019) argue that using the deconfounder allows us to weaken the assumption of “no unmeasured confounding” to “no unmeasured single-cause confounding”. The assumptions made in (Wang & Blei, 2019) did not fully justify this claim, and we found it difficult to understand precisely what the causal assumptions related to ZZ were.

Mathematically precise assumptions that allow for identification of causal parameters from a finite number of causes, X1,,XpX_{1},\ldots,X_{p}, via deconfounding are stated as Assumptions 1 and 2 in (Wang & Blei, 2020). We find these assumptions regarding recovery of ZZ (also termed “pinpointing” in the context of the deconfounder) for finite pp implausible. Moreover, the entire framework of the deconfounder rests on the causal assumption of “weak unconfoundedness” in Assumption 1 and Theorem 1 of (Wang & Blei, 2020), which might be needed for a causal interpretation but is unnecessary for the deconfounder algorithm to estimate a meaningful target parameter.

We find it beneficial to disentangle the causal interpretation from the definition of the target parameter. By defining the target parameter entirely in terms of the observational distribution of observed (or, at least, observable) variables, we can discuss the properties of the statistical method of substitute adjustment without making causal claims. We have shown that substitute adjustment under our Assumption 2 on the latent variable model targets the adjusted mean irrespectively of any unobserved confounding. Grimmer et al. (2023) present a similar view. The contrast χxiχx0i\chi_{x}^{i}-\chi_{x_{0}}^{i} might have a causal interpretation in specific applications, but substitute adjustment as a statistical method does not rely on such an interpretation or assumptions needed to justify such an interpretation. In any specific application with multiple causes and potential unobserved confounding, substitute adjustment might be a useful method for deconfounding, but depending on the context and the causal assumptions we are willing to make, other methods could be preferable (Miao et al., 2023).

5.2. Substitute adjustment: interpretation, merits and deficits

We define the target parameter as an adjusted mean when adjusting for an infinite number of variables. Clearly, this is a mathematical idealization of adjusting for a large number of variables, but it also has some important technical consequences. First, the recovery Assumption 2(2) is a more plausible modelling assumption than recovery from a finite number of variables. Second, it gives a clear qualitative difference between the adjusted mean of one (or any finite number of) variables and regression on all variables. Third, the natural requirement in Assumption 2(2) that ZZ can be recovered from 𝐗i\mathbf{X}_{-i} for any ii replaces the minimality of a “multi-cause separator” from (Wang & Blei, 2020). Our assumption is that σ(Z)\sigma(Z) is sufficiently minimal in a very explicit way, which ensures that ZZ does not contain information unique to any single XiX_{i}.

Grimmer et al. (2023) come to a similar conclusion as we do: that the target parameter of substitute adjustment (and the deconfounder) is the adjusted mean χxi\chi^{i}_{x}, where you adjust for an infinite number of variables. They argue forcefully that substitute adjustment, using a finite number pp of variables, does not have an advantage over naive regression, that is, over estimating the regression function 𝔼[YX1=x1,,Xp=xp]\mathbb{E}\left[Y\mid X_{1}=x_{1},\ldots,X_{p}=x_{p}\right] directly. With i=1i=1, say, they argue that substitute adjustment is effectively assuming a partially linear, semiparametric regression model

𝔼[YX1=x1,,Xp=xp]=β0+β1x1+h(x2,,xp),\mathbb{E}\left[Y\mid X_{1}=x_{1},\ldots,X_{p}=x_{p}\right]=\beta_{0}+\beta_{1}x_{1}+h(x_{2},\ldots,x_{p}),

with the specific constraint that h(x2,,xp)=g(z^)=g(f(p)(x2,,xp))h(x_{2},\ldots,x_{p})=g(\hat{z})=g(f^{(p)}(x_{2},\ldots,x_{p})). We agree with their analysis and conclusion; substitute adjustment is implicitly a way of making assumptions about hh. It is also a way to leverage those assumptions, either by shrinking the bias compared to directly estimating a misspecified (linear, say) hh, or by improving efficiency over methods that use a too flexible model of hh. We believe there is room for further studies of such bias and efficiency tradeoffs.

We also believe that there are two potential benefits of substitute adjustment, which are not brought forward by Grimmer et al. (2023). First, the latent variable model can be estimated without access to outcome observations. This means that the inner part of h=gf(p)h=g\circ f^{(p)} could, potentially, be estimated very accurately on the basis of a large sample 𝒮0\mathcal{S}_{0} in cases where it would be difficult to estimate the composed map hh accurately from 𝒮\mathcal{S} alone. Second, when pp is very large, e.g., in the millions, but ZZ is low-dimensional, there can be huge computational advantages to running pp small parallel regressions compared to just one naive linear regression of YY on all of 𝐗1:p\mathbf{X}_{1:p}, let alone pp naive partially linear regressions.

5.3. Possible extensions

We believe that our error bound in Theorem 1 is an interesting result, which in a precise way bounds the error of an OLS estimator in terms of errors in the regressors. This result is closely related to the classical literature on errors-in-variables models (or measurement error models) (Durbin, 1954; Cochran, 1968; Schennach, 2016), though this literature focuses on methods for bias correction when the errors are non-vanishing. We see two possible extensions of our result. For one, Theorem 1 could easily be generalized to E=dE=\mathbb{R}^{d}. In addition, it might be possible to apply the bias correction techniques developed for errors-in-variables to improve the finite sample properties of the substitute adjustment estimator.

Our analysis of the recovery error could also be extended. The concentration inequalities in Section 3.3 are unsurprising, but developed to match our specific needs for a high-dimensional analysis with as few assumptions as possible. For more refined results on finite mixture estimation see, e.g., (Heinrich & Kahn, 2018), and see (Ndaoud, 2022) for optimal recovery when K=2K=2 and the mixture distributions are Gaussian. In cases where the mixture distributions are Gaussian, it is also plausible that specialized algorithms such as (Kalai et al., 2012; Gandhi & Borns-Weil, 2016) are more efficient than the methods we consider based on conditional means only.

One general concern with substitute adjustment is model misspecification. We have done our analysis with minimal distributional assumptions, but there are, of course, two fundamental assumptions: the assumption of conditional independence of the XiX_{i}-s given the latent variable ZZ, and the assumption that ZZ takes values in a finite set of size KK. An important extension of our results is to study robustness to violations of these two fundamental assumptions. We have also not considered estimation of KK, and it would likewise be relevant to understand how that affects the substitute adjustment estimator.

Acknowledgments

We thank Alexander Mangulad Christgau for helpful input. JA and NRH were supported by a research grant (NNF20OC0062897) from Novo Nordisk Fonden. JA also received funding from the European Union’s Horizon 2020 research and innovation programme under Marie Skłodowska-Curie grant agreement No 801199.

Appendix A Proofs and auxiliary results

A.1. Proofs of results in Section 2.1

Proof of Proposition 1.

Since XiX_{i} as well as 𝐗i\mathbf{X}_{-i} take values in Borel spaces, there exists a regular conditional distribution given Z=zZ=z of each (Kallenberg, 2021, Theorem 8.5). These are denoted PziP^{i}_{z} and PziP^{-i}_{z}, respectively. Moreover, Assumption 2(2) and the Doob-Dynkin lemma (Kallenberg, 2021, Lemma 1.14) imply that for each ii\in\mathbb{N} there is a measurable map fi:Ef_{i}:\mathbb{R}^{\mathbb{N}}\to E such that Z=fi(𝐗i)Z=f_{i}(\mathbf{X}_{-i}). This implies that Pi(B)=Pzi(B)PZ(dz)P^{-i}(B)=\int P^{-i}_{z}(B)P^{Z}(\mathrm{d}z) for BB\subseteq\mathbb{R}^{\mathbb{N}} measurable.

Since Z=fi(𝐗i)Z=f_{i}(\mathbf{X}_{-i}) it holds that fi(Pi)=PZf_{i}(P^{-i})=P^{Z}, and furthermore that Pzi(fi1({z}))=1P^{-i}_{z}(f^{-1}_{i}(\{z\}))=1. Assumption 2(1) implies that XiX_{i} and 𝐗i\mathbf{X}_{-i} are conditionally independent given ZZ, thus for A,CA,C\subseteq\mathbb{R} and BEB\subseteq E measurable sets and B~=fi1(B)\tilde{B}=f_{i}^{-1}(B)\subseteq\mathbb{R}^{\mathbb{N}},

(XiA,ZB,YC)\displaystyle\mathbb{P}(X_{i}\in A,Z\in B,Y\in C) =(XiA,𝐗iB~,YC)\displaystyle=\mathbb{P}(X_{i}\in A,\mathbf{X}_{-i}\in\tilde{B},Y\in C)
=1A(x)1B~(𝐱)Px,𝐱i(C)P(dx,d𝐱)\displaystyle=\int 1_{A}(x)1_{\tilde{B}}(\mathbf{x})P_{x,\mathbf{x}}^{i}(C)P(\mathrm{d}x,\mathrm{d}\mathbf{x})
=1A(x)1B~(𝐱)Px,𝐱i(C)PziPzi(dx,d𝐱)PZ(dz)\displaystyle=\int 1_{A}(x)1_{\tilde{B}}(\mathbf{x})\,P_{x,\mathbf{x}}^{i}(C)\int P_{z}^{i}\otimes P_{z}^{-i}(\mathrm{d}x,\mathrm{d}\mathbf{x})P^{Z}(\mathrm{d}z)
=1A(x)1B~(𝐱)Px,𝐱i(C)Pzi(dx)Pzi(d𝐱)PZ(dz)\displaystyle=\iiint 1_{A}(x)1_{\tilde{B}}(\mathbf{x})\,P_{x,\mathbf{x}}^{i}(C)P_{z}^{i}(\mathrm{d}x)P_{z}^{-i}(\mathrm{d}\mathbf{x})P^{Z}(\mathrm{d}z)
=1A(x)1B(z)Px,𝐱i(C)Pzi(d𝐱)Pzi(dx)PZ(dz)\displaystyle=\iiint 1_{A}(x)1_{B}(z)\,\int P_{x,\mathbf{x}}^{i}(C)P_{z}^{-i}(\mathrm{d}\mathbf{x})P_{z}^{i}(\mathrm{d}x)P^{Z}(\mathrm{d}z)
=1A(x)1B(z)Qx,zi(C)Pzi(dx)PZ(dz).\displaystyle=\iint 1_{A}(x)1_{B}(z)\,Q^{i}_{x,z}(C)P_{z}^{i}(\mathrm{d}x)P^{Z}(\mathrm{d}z).

Hence Qx,ziQ^{i}_{x,z} is a regular conditional distribution of YY given (Xi,Z)=(x,z)(X_{i},Z)=(x,z).

We finally find that

χxi\displaystyle\chi_{x}^{i} =yPx,𝐱i(dy)Pi(d𝐱)\displaystyle=\iint y\,P^{i}_{x,\mathbf{x}}(\mathrm{d}y)P^{-i}(\mathrm{d}\mathbf{x})
=yPx,𝐱i(dy)Pzi(d𝐱)PZ(dz)\displaystyle=\iiint y\,P^{i}_{x,\mathbf{x}}(\mathrm{d}y)P^{-i}_{z}(\mathrm{d}\mathbf{x})P^{Z}(\mathrm{d}z)
=yPx,𝐱i(dy)Pzi(d𝐱)PZ(dz)\displaystyle=\iint y\,\int P^{i}_{x,\mathbf{x}}(\mathrm{d}y)P^{-i}_{z}(\mathrm{d}\mathbf{x})P^{Z}(\mathrm{d}z)
=yQz,𝐱i(dy)PZ(dz).\displaystyle=\iint y\,Q^{i}_{z,\mathbf{x}}(\mathrm{d}y)P^{Z}(\mathrm{d}z).

Proof of Proposition 2.

We find that

Cov[Xi,YZ]\displaystyle\operatorname{Cov}\left[X_{i},Y\mid Z\right] =𝔼[(Xi𝔼[XiZ])YZ]\displaystyle=\mathbb{E}\left[(X_{i}-\mathbb{E}[X_{i}\mid Z])Y\mid Z\right]
=𝔼[𝔼[(Xi𝔼[XiZ])YXi,Z]Z]\displaystyle=\mathbb{E}\left[\mathbb{E}\left[(X_{i}-\mathbb{E}[X_{i}\mid Z])Y\mid X_{i},Z\right]\mid Z\right]
=𝔼[(Xi𝔼[XiZ])𝔼[YXi,Z]Z]\displaystyle=\mathbb{E}\left[(X_{i}-\mathbb{E}[X_{i}\mid Z])\mathbb{E}\left[Y\mid X_{i},Z\right]\mid Z\right]
=𝔼[(Xi𝔼[XiZ])bXii(Z)Z]\displaystyle=\mathbb{E}\left[(X_{i}-\mathbb{E}[X_{i}\mid Z])b_{X_{i}}^{i}(Z)\mid Z\right]
=Cov[Xi,bXii(Z)Z],\displaystyle=\operatorname{Cov}\left[X_{i},b^{i}_{X_{i}}(Z)\mid Z\right],

which shows (9). From this representation, if bxi(z)=bi(z)b_{x}^{i}(z)=b^{i}(z) does not depend on xx, bi(Z)b^{i}(Z) is σ(Z)\sigma(Z)-measurable and Cov[Xi,bi(Z)Z]=0\operatorname{Cov}\left[X_{i},b^{i}(Z)\mid Z\right]=0, whence βi=0\beta_{i}=0.

If bxi(z)=βi(z)x+ηi(z)b^{i}_{x}(z)=\beta_{i}^{\prime}(z)x+\eta_{-i}(z),

Cov[Xi,bXii(Z)Z]=Cov[Xi,βi(Z)Xi+ηi(Z)Z]=βi(Z)Var[XiZ],\operatorname{Cov}\left[X_{i},b^{i}_{X_{i}}(Z)\mid Z\right]=\operatorname{Cov}\left[X_{i},\beta_{i}^{\prime}(Z)X_{i}+\eta_{-i}(Z)\mid Z\right]=\beta_{i}^{\prime}(Z)\operatorname{Var}\left[X_{i}\mid Z\right],

and (10) follows. ∎

A.2. Auxiliary results related to Section 3.2 and proof of Theorem 1

Let 𝐙\mathbf{Z} denote the n×Kn\times K matrix of dummy variable encodings of the zkz_{k}-s, and let 𝐙^\hat{\mathbf{Z}} denote the similar matrix for the substitutes z^k\hat{z}_{k}-s. With P𝐙P_{\mathbf{Z}} and P𝐙^P_{\hat{\mathbf{Z}}} the orthogonal projections onto the column spaces of 𝐙\mathbf{Z} and 𝐙^\hat{\mathbf{Z}}, respectively, we can write the estimator from Algorithm 3 as

(27) β^isub=𝐱iP𝐙^𝐱i,𝐲P𝐙^𝐲𝐱iP𝐙^𝐱i22.\widehat{\beta}^{\mathrm{sub}}_{i}=\frac{\langle\mathbf{x}_{i}-P_{\hat{\mathbf{Z}}}\mathbf{x}_{i},\mathbf{y}-P_{\hat{\mathbf{Z}}}\mathbf{y}\rangle}{\|\mathbf{x}_{i}-P_{\hat{\mathbf{Z}}}\mathbf{x}_{i}\|_{2}^{2}}.

Here 𝐱i,𝐲n\mathbf{x}_{i},\mathbf{y}\in\mathbb{R}^{n} denote the nn-vectors of xi,kx_{i,k}-s and yky_{k}-s, respectively, and ,\langle\cdot,\cdot\rangle is the standard inner product on n\mathbb{R}^{n}, so that, e.g., 𝐲22=𝐲,𝐲\|\mathbf{y}\|_{2}^{2}=\langle\mathbf{y},\mathbf{y}\rangle. The estimator, had we observed the latent variables, is similarly given as

(28) β^i=𝐱iP𝐙𝐱i,𝐲P𝐙𝐲𝐱iP𝐙𝐱i22.\hat{\beta}_{i}=\frac{\langle\mathbf{x}_{i}-P_{\mathbf{Z}}\mathbf{x}_{i},\mathbf{y}-P_{\mathbf{Z}}\mathbf{y}\rangle}{\|\mathbf{x}_{i}-P_{\mathbf{Z}}\mathbf{x}_{i}\|_{2}^{2}}.

The proof of Theorem 1 is based on the following bound on the difference between the projection matrices.

Lemma 1.

Let α\alpha and δ\delta be as defined by (16) and (17). If α>0\alpha>0 it holds that

(29) P𝐙P𝐙^22δα,\|P_{\mathbf{Z}}-P_{\hat{\mathbf{Z}}}\|_{2}\leq\sqrt{\frac{2\delta}{\alpha}},

where 2\|\cdot\|_{2} above denotes the operator 22-norm also known as the spectral norm.

Proof.

When α>0\alpha>0, the matrices 𝐙\mathbf{Z} and 𝐙^\hat{\mathbf{Z}} have full rank KK. Let 𝐙+=(𝐙T𝐙)1𝐙T\mathbf{Z}^{+}=(\mathbf{Z}^{T}\mathbf{Z})^{-1}\mathbf{Z}^{T} and 𝐙^+=(𝐙^T𝐙^)1𝐙^T\hat{\mathbf{Z}}^{+}=(\hat{\mathbf{Z}}^{T}\hat{\mathbf{Z}})^{-1}\hat{\mathbf{Z}}^{T} denote the Moore-Penrose inverses of 𝐙\mathbf{Z} and 𝐙^\hat{\mathbf{Z}}, respectively. Then P𝐙=𝐙𝐙+P_{\mathbf{Z}}=\mathbf{Z}\mathbf{Z}^{+} and P𝐙^=𝐙^𝐙^+P_{\hat{\mathbf{Z}}}=\hat{\mathbf{Z}}\hat{\mathbf{Z}}^{+}. By Theorems 2.3 and 2.4 in (Stewart, 1977),

P𝐙P𝐙^2min{𝐙+2,𝐙^+2}𝐙𝐙^2.\|P_{\mathbf{Z}}-P_{\hat{\mathbf{Z}}}\|_{2}\leq\min\left\{\|\mathbf{Z}^{+}\|_{2},\|\hat{\mathbf{Z}}^{+}\|_{2}\right\}\ \|\mathbf{Z}-\hat{\mathbf{Z}}\|_{2}.

The operator 22-norm 𝐙+2\|\mathbf{Z}^{+}\|_{2} is the square root of the largest eigenvalue of

(𝐙T𝐙)1=(n(1)1000n(2)1000n(K)1).(\mathbf{Z}^{T}\mathbf{Z})^{-1}=\left(\begin{array}[]{cccc}n(1)^{-1}&0&\ldots&0\\ 0&n(2)^{-1}&\ldots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\ldots&n(K)^{-1}\end{array}\right).

Whence 𝐙+2(nmin)1/2=(αn)1/2\|\mathbf{Z}^{+}\|_{2}\leq(n_{\min})^{-1/2}=(\alpha n)^{-1/2}. The same bound is obtained for 𝐙^+2\|\hat{\mathbf{Z}}^{+}\|_{2}, which gives

P𝐙P𝐙^21αn𝐙𝐙^2.\|P_{\mathbf{Z}}-P_{\hat{\mathbf{Z}}}\|_{2}\leq\frac{1}{\sqrt{\alpha n}}\ \|\mathbf{Z}-\hat{\mathbf{Z}}\|_{2}.

We also have that

𝐙𝐙^22𝐙𝐙^F2=k=1ni=1p(𝐙k,i𝐙^k,i)2=2δn,\|\mathbf{Z}-\hat{\mathbf{Z}}\|_{2}^{2}\leq\|\mathbf{Z}-\hat{\mathbf{Z}}\|_{F}^{2}=\sum_{k=1}^{n}\sum_{i=1}^{p}(\mathbf{Z}_{k,i}-\hat{\mathbf{Z}}_{k,i})^{2}=2\delta n,

because i=1p(𝐙k,i𝐙^k,i)2=2\sum_{i=1}^{p}(\mathbf{Z}_{k,i}-\hat{\mathbf{Z}}_{k,i})^{2}=2 precisely for those kk with z^kzk\hat{z}_{k}\neq z_{k} and 0 otherwise. Combining the inequalities gives (29). ∎

Before proceeding with the proof of Theorem 1, note that

k=1n(xi,kμ¯i(zk))2=𝐱iP𝐙𝐱i22=(IP𝐙)𝐱i22𝐱i22\sum_{k=1}^{n}(x_{i,k}-\overline{\mu}_{i}(z_{k}))^{2}=\|\mathbf{x}_{i}-P_{\mathbf{Z}}\mathbf{x}_{i}\|_{2}^{2}=\|(I-P_{\mathbf{Z}})\mathbf{x}_{i}\|_{2}^{2}\leq\|\mathbf{x}_{i}\|_{2}^{2}

since (IP𝐙)(I-P_{\mathbf{Z}}) is a projection. Similarly, k=1n(xi,kμ^i(z^k))2=𝐱iP𝐙^𝐱i22𝐱22\sum_{k=1}^{n}(x_{i,k}-\hat{\mu}_{i}(\hat{z}_{k}))^{2}=\|\mathbf{x}_{i}-P_{\hat{\mathbf{Z}}}\mathbf{x}_{i}\|_{2}^{2}\leq\|\mathbf{x}\|_{2}^{2}, thus

ρ=min{𝐱iP𝐙𝐱i22,𝐱iP𝐙^𝐱i22}𝐱i221.\rho=\frac{\min\left\{\|\mathbf{x}_{i}-P_{\mathbf{Z}}\mathbf{x}_{i}\|_{2}^{2},\|\mathbf{x}_{i}-P_{\hat{\mathbf{Z}}}\mathbf{x}_{i}\|_{2}^{2}\right\}}{\|\mathbf{x}_{i}\|_{2}^{2}}\leq 1.
Proof of Theorem 1.

First note that since IP𝐙^I-P_{\hat{\mathbf{Z}}} is an orthogonal projection,

𝐱iP𝐙^𝐱i,𝐲P𝐙^𝐲=𝐱i,(IP𝐙^)𝐲\langle\mathbf{x}_{i}-P_{\hat{\mathbf{Z}}}\mathbf{x}_{i},\mathbf{y}-P_{\hat{\mathbf{Z}}}\mathbf{y}\rangle=\langle\mathbf{x}_{i},(I-P_{\hat{\mathbf{Z}}})\mathbf{y}\rangle

and similarly for the other inner product in (28). Moreover,

𝐱i,(IP𝐙^)𝐲𝐱i,(IP𝐙)𝐲=𝐱i,(P𝐙P𝐙^)𝐲\langle\mathbf{x}_{i},(I-P_{\hat{\mathbf{Z}}})\mathbf{y}\rangle-\langle\mathbf{x}_{i},(I-P_{\mathbf{Z}})\mathbf{y}\rangle=\langle\mathbf{x}_{i},(P_{\mathbf{Z}}-P_{\hat{\mathbf{Z}}})\mathbf{y}\rangle

and

(IP𝐙)𝐱i22(IP𝐙^)𝐱i22=(P𝐙^P𝐙)𝐱i22.\|(I-P_{\mathbf{Z}})\mathbf{x}_{i}\|_{2}^{2}-\|(I-P_{\hat{\mathbf{Z}}})\mathbf{x}_{i}\|_{2}^{2}=\|(P_{\hat{\mathbf{Z}}}-P_{\mathbf{Z}})\mathbf{x}_{i}\|_{2}^{2}.

We find that

β^isubβ^i\displaystyle\widehat{\beta}^{\mathrm{sub}}_{i}-\hat{\beta}_{i} =𝐱i,(IP𝐙^)𝐲(IP𝐙^)𝐱i22𝐱i,(IP𝐙)𝐲(IP𝐙)𝐱i22\displaystyle=\frac{\langle\mathbf{x}_{i},(I-P_{\hat{\mathbf{Z}}})\mathbf{y}\rangle}{\|(I-P_{\hat{\mathbf{Z}}})\mathbf{x}_{i}\|_{2}^{2}}-\frac{\langle\mathbf{x}_{i},(I-P_{\mathbf{Z}})\mathbf{y}\rangle}{\|(I-P_{\mathbf{Z}})\mathbf{x}_{i}\|_{2}^{2}}
=𝐱i,(IP𝐙^)𝐲(1(IP𝐙^)𝐱i221(IP𝐙)𝐱i22)\displaystyle=\langle\mathbf{x}_{i},(I-P_{\hat{\mathbf{Z}}})\mathbf{y}\rangle\left(\frac{1}{\|(I-P_{\hat{\mathbf{Z}}})\mathbf{x}_{i}\|_{2}^{2}}-\frac{1}{\|(I-P_{\mathbf{Z}})\mathbf{x}_{i}\|_{2}^{2}}\right)
+𝐱i,(IP𝐙^)𝐲𝐱i,(IP𝐙)𝐲(IP𝐙)𝐱i22\displaystyle\qquad\qquad+\frac{\langle\mathbf{x}_{i},(I-P_{\hat{\mathbf{Z}}})\mathbf{y}\rangle-\langle\mathbf{x}_{i},(I-P_{\mathbf{Z}})\mathbf{y}\rangle}{\|(I-P_{\mathbf{Z}})\mathbf{x}_{i}\|_{2}^{2}}
=𝐱i,(IP𝐙^)𝐲((P𝐙^P𝐙)𝐱i22(IP𝐙^)𝐱i22(IP𝐙)𝐱i22)\displaystyle=\langle\mathbf{x}_{i},(I-P_{\hat{\mathbf{Z}}})\mathbf{y}\rangle\left(\frac{\|(P_{\hat{\mathbf{Z}}}-P_{\mathbf{Z}})\mathbf{x}_{i}\|_{2}^{2}}{\|(I-P_{\hat{\mathbf{Z}}})\mathbf{x}_{i}\|_{2}^{2}\|(I-P_{\mathbf{Z}})\mathbf{x}_{i}\|_{2}^{2}}\right)
+𝐱i,(P𝐙P𝐙^)𝐲(IP𝐙)𝐱i22.\displaystyle\qquad\qquad+\frac{\langle\mathbf{x}_{i},(P_{\mathbf{Z}}-P_{\hat{\mathbf{Z}}})\mathbf{y}\rangle}{\|(I-P_{\mathbf{Z}})\mathbf{x}_{i}\|_{2}^{2}}.

This gives the following inequality, using that ρ1\rho\leq 1,

|β^isubβ^i|\displaystyle|\widehat{\beta}^{\mathrm{sub}}_{i}-\hat{\beta}_{i}| P𝐙P𝐙^2𝐱i23𝐲2ρ2𝐱i24+P𝐙P𝐙^2𝐱i2𝐲2ρ𝐱i22\displaystyle\leq\frac{\|P_{\mathbf{Z}}-P_{\hat{\mathbf{Z}}}\|_{2}\|\mathbf{x}_{i}\|_{2}^{3}\|\mathbf{y}\|_{2}}{\rho^{2}\|\mathbf{x}_{i}\|^{4}_{2}}+\frac{\|P_{\mathbf{Z}}-P_{\hat{\mathbf{Z}}}\|_{2}\|\mathbf{x}_{i}\|_{2}\|\mathbf{y}\|_{2}}{\rho\|\mathbf{x}_{i}\|^{2}_{2}}
=(1ρ2+1ρ)P𝐙P𝐙^2𝐲2𝐱i2\displaystyle=\left(\frac{1}{\rho^{2}}+\frac{1}{\rho}\right)\|P_{\mathbf{Z}}-P_{\hat{\mathbf{Z}}}\|_{2}\frac{\|\mathbf{y}\|_{2}}{\|\mathbf{x}_{i}\|_{2}}
2ρ2P𝐙P𝐙^2𝐲2𝐱i2.\displaystyle\leq\frac{2}{\rho^{2}}\|P_{\mathbf{Z}}-P_{\hat{\mathbf{Z}}}\|_{2}\frac{\|\mathbf{y}\|_{2}}{\|\mathbf{x}_{i}\|_{2}}.

Combining this inequality with (29) gives (19). ∎

A.3. Auxiliary concentration inequalities. Proofs of Propositions 3 and 4

Lemma 2.

Suppose that Assumption 4 holds. Let 𝛍ˇ1:p(z)p\check{\boldsymbol{\mu}}_{1:p}(z)\in\mathbb{R}^{p} for zEz\in E and let Z^=argminz𝐗1:p𝛍ˇ1:p(z)2\hat{Z}=\operatorname*{arg\,min}_{z}\|\mathbf{X}_{1:p}-\check{\boldsymbol{\mu}}_{1:p}(z)\|_{2}. Suppose that Rz,v(p)110R_{z,v}^{(p)}\leq\frac{1}{10} for all z,vEz,v\in E with vzv\neq z then

(30) (Z^=vZ=z)25σmax2𝝁1:p(z)𝝁1:p(v)22.\mathbb{P}(\hat{Z}=v\mid Z=z)\leq\frac{25\sigma_{\max}^{2}}{\|\boldsymbol{\mu}_{1:p}(z)-\boldsymbol{\mu}_{1:p}(v)\|_{2}^{2}}.
Proof.

Since pp is fixed throughout the proof, we simplify the notation by dropping the 1:p1\!\!:\!\!p subscript and use, e.g., 𝐗\mathbf{X} and 𝝁\boldsymbol{\mu} to denote the p\mathbb{R}^{p}-vectors 𝐗1:p\mathbf{X}_{1:p} and 𝝁1:p\boldsymbol{\mu}_{1:p}, respectively.

Fix also z,vEz,v\in E with vzv\neq z and observe first that

(Z^=v)\displaystyle(\hat{Z}=v) (𝐗𝝁ˇ(v)2<𝐗𝝁ˇ(z)2)\displaystyle\subseteq\left(\norm{\mathbf{X}-\check{\boldsymbol{\mu}}(v)}_{2}<\norm{\mathbf{X}-\check{\boldsymbol{\mu}}(z)}_{2}\right)
=(𝐗𝝁ˇ(z),𝝁ˇ(z)𝝁ˇ(v)<12𝝁ˇ(z)𝝁ˇ(v)22)\displaystyle=\left(\langle\mathbf{X}-\check{\boldsymbol{\mu}}(z),\check{\boldsymbol{\mu}}(z)-\check{\boldsymbol{\mu}}(v)\rangle<-\tfrac{1}{2}\norm{\check{\boldsymbol{\mu}}(z)-\check{\boldsymbol{\mu}}(v)}^{2}_{2}\right)
=(𝐗𝝁(z),𝝁ˇ(z)𝝁ˇ(v)<\displaystyle=\Big{(}\langle\mathbf{X}-\boldsymbol{\mu}(z),\check{\boldsymbol{\mu}}(z)-\check{\boldsymbol{\mu}}(v)\rangle<
(12𝝁ˇ(z)𝝁ˇ(v)22+𝝁(z)𝝁ˇ(z),𝝁ˇ(z)𝝁ˇ(v))).\displaystyle\qquad\qquad-\left(\tfrac{1}{2}\norm{\check{\boldsymbol{\mu}}(z)-\check{\boldsymbol{\mu}}(v)}^{2}_{2}+\langle\boldsymbol{\mu}(z)-\check{\boldsymbol{\mu}}(z),\check{\boldsymbol{\mu}}(z)-\check{\boldsymbol{\mu}}(v)\rangle\right)\Big{)}.

The objective is to bound the probability of the event above using Chebyshev’s inequality. To this end, we first use the Cauchy-Schwarz inequality to get

12𝝁ˇ(z)𝝁ˇ(v)22\displaystyle\tfrac{1}{2}\norm{\check{\boldsymbol{\mu}}(z)-\check{\boldsymbol{\mu}}(v)}^{2}_{2} +𝝁(z)𝝁ˇ(z),𝝁ˇ(z)𝝁ˇ(v)\displaystyle+\langle\boldsymbol{\mu}(z)-\check{\boldsymbol{\mu}}(z),\check{\boldsymbol{\mu}}(z)-\check{\boldsymbol{\mu}}(v)\rangle
12𝝁ˇ(z)𝝁ˇ(v)22𝝁(z)𝝁ˇ(z)2𝝁ˇ(z)𝝁ˇ(v)2\displaystyle\geq\tfrac{1}{2}\norm{\check{\boldsymbol{\mu}}(z)-\check{\boldsymbol{\mu}}(v)}^{2}_{2}-\|\boldsymbol{\mu}(z)-\check{\boldsymbol{\mu}}(z)\|_{2}\|\check{\boldsymbol{\mu}}(z)-\check{\boldsymbol{\mu}}(v)\|_{2}
=𝝁(z)𝝁(v)22(12Bz,v2Rz,v(p)Bz,v),\displaystyle=\norm{\boldsymbol{\mu}(z)-\boldsymbol{\mu}(v)}_{2}^{2}\left(\tfrac{1}{2}B_{z,v}^{2}-R_{z,v}^{(p)}B_{z,v}\right),

where

Bz,v=𝝁ˇ(z)𝝁ˇ(v)2𝝁(z)𝝁(v)2.B_{z,v}=\frac{\norm{\check{\boldsymbol{\mu}}(z)-\check{\boldsymbol{\mu}}(v)}_{2}}{\norm{\boldsymbol{\mu}(z)-\boldsymbol{\mu}(v)}_{2}}.

The triangle and reverse triangle inequality give that

𝝁ˇ(z)𝝁ˇ(v)2\displaystyle\norm{\check{\boldsymbol{\mu}}(z)-\check{\boldsymbol{\mu}}(v)}_{2} 𝝁(z)𝝁(v)2+𝝁ˇ(z)𝝁(z)2+𝝁(v)𝝁ˇ(v)2\displaystyle\leq\norm{\boldsymbol{\mu}(z)-\boldsymbol{\mu}(v)}_{2}+\norm{\check{\boldsymbol{\mu}}(z)-\boldsymbol{\mu}(z)}_{2}+\norm{\boldsymbol{\mu}(v)-\check{\boldsymbol{\mu}}(v)}_{2}
𝝁ˇ(z)𝝁ˇ(v)2\displaystyle\norm{\check{\boldsymbol{\mu}}(z)-\check{\boldsymbol{\mu}}(v)}_{2} |𝝁(z)𝝁(v)2𝝁(z)𝝁ˇ(z)2𝝁(v)𝝁ˇ(v)2|,\displaystyle\geq\Big{|}\norm{\boldsymbol{\mu}(z)-\boldsymbol{\mu}(v)}_{2}-\norm{\boldsymbol{\mu}(z)-\check{\boldsymbol{\mu}}(z)}_{2}-\norm{\boldsymbol{\mu}(v)-\check{\boldsymbol{\mu}}(v)}_{2}\Big{|},

and dividing by 𝝁(z)𝝁(v)2\norm{\boldsymbol{\mu}(z)-\boldsymbol{\mu}(v)}_{2} combined with the bound 110\tfrac{1}{10} on the relative errors yield

Bz,v\displaystyle B_{z,v} 1+Rz,v(p)+Rv,z(p)65,\displaystyle\leq 1+R_{z,v}^{(p)}+R_{v,z}^{(p)}\leq\frac{6}{5},
Bz,v\displaystyle B_{z,v} |1Rz,v(p)Rv,z(p)|45.\displaystyle\geq\Big{|}1-R_{z,v}^{(p)}-R_{v,z}^{(p)}\Big{|}\geq\frac{4}{5}.

This gives

12Bz,v2Rz,v(p)Bz,v12Bz,v2110Bz,v625\tfrac{1}{2}B_{z,v}^{2}-R_{z,v}^{(p)}B_{z,v}\geq\tfrac{1}{2}B_{z,v}^{2}-\tfrac{1}{10}B_{z,v}\geq\tfrac{6}{25}

since the function bb2210bb\mapsto b^{2}-\tfrac{2}{10}b is increasing for b45b\geq\tfrac{4}{5}.

Introducing the variables Wi=(Xiμi(z))(μˇi(z)μˇi(v))W_{i}=(X_{i}-\mu_{i}(z))(\check{\mu}_{i}(z)-\check{\mu}_{i}(v)) we conclude that

(31) (Z^=v)\displaystyle(\hat{Z}=v) (i=1pWi<625𝝁(z)𝝁(v)22).\displaystyle\subseteq\left(\sum_{i=1}^{p}W_{i}<-\tfrac{6}{25}\norm{\boldsymbol{\mu}(z)-\boldsymbol{\mu}(v)}_{2}^{2}\right).

Note that 𝔼[WiZ=z]=0\mathbb{E}[W_{i}\mid Z=z]=0 and Var[WiZ=z]=(μˇi(z)μˇi(v))2σi2(z)\operatorname{Var}[W_{i}\mid Z=z]=(\check{\mu}_{i}(z)-\check{\mu}_{i}(v))^{2}\sigma^{2}_{i}(z), and by Assumption 4, the WiW_{i}-s are conditionally independent given Z=zZ=z, so Chebyshev’s inequality gives that

(Z^=vZ=z)\displaystyle\mathbb{P}(\hat{Z}=v\mid Z=z) (i=1pWi<625𝝁(z)𝝁(v)22|Z=z)\displaystyle\leq\mathbb{P}\left(\sum_{i=1}^{p}W_{i}<-\tfrac{6}{25}\norm{\boldsymbol{\mu}(z)-\boldsymbol{\mu}(v)}_{2}^{2}\;\middle|\;Z=z\right)
(256)2i=1p(μˇi(z)μˇi(v))2σi2(z)𝝁(z)𝝁(v)24\displaystyle\leq\left(\frac{25}{6}\right)^{2}\frac{\sum_{i=1}^{p}(\check{\mu}_{i}(z)-\check{\mu}_{i}(v))^{2}\sigma^{2}_{i}(z)}{\norm{\boldsymbol{\mu}(z)-\boldsymbol{\mu}(v)}_{2}^{4}}
(256)2σmax2𝝁ˇ(z)𝝁ˇ(v)22𝝁(z)𝝁(v)42\displaystyle\leq\left(\frac{25}{6}\right)^{2}\frac{\sigma^{2}_{\max}\norm{\check{\boldsymbol{\mu}}(z)-\check{\boldsymbol{\mu}}(v)}_{2}^{2}}{\norm{\boldsymbol{\mu}(z)-\boldsymbol{\mu}(v)}_{4}^{2}}
(256)2Bz,v2σmax2𝝁(z)𝝁(v)22\displaystyle\leq\left(\frac{25}{6}\right)^{2}B_{z,v}^{2}\frac{\sigma^{2}_{\max}}{\norm{\boldsymbol{\mu}(z)-\boldsymbol{\mu}(v)}_{2}^{2}}
25σmax2𝝁(z)𝝁(v)22,\displaystyle\leq\frac{25\sigma^{2}_{\max}}{\norm{\boldsymbol{\mu}(z)-\boldsymbol{\mu}(v)}_{2}^{2}},

where we, for the last inequality, used that Bz,v2(65)2B_{z,v}^{2}\leq\left(\frac{6}{5}\right)^{2}. ∎

Before proceeding to the concentration inequality for sub-Gaussian distributions, we use Lemma 2 to prove Proposition 3.

Proof of Proposition 3.

Suppose that i=1i=1 for convenience. We take 𝝁ˇ1:p(z)=𝝁1:p(z)\check{\boldsymbol{\mu}}_{1:p}(z)=\boldsymbol{\mu}_{1:p}(z) for all pp\in\mathbb{N} and zEz\in E and write Z^p=argminz𝐗2:p𝝁2:p(z)2\hat{Z}_{p}=\operatorname*{arg\,min}_{z}\|\mathbf{X}_{2:p}-\boldsymbol{\mu}_{2:p}(z)\|_{2} for the prediction of ZZ based on the coordinates 2,,p2,\ldots,p. With this oracle choice of 𝝁ˇ1:p(z)\check{\boldsymbol{\mu}}_{1:p}(z), the relative errors are zero, thus the bound (30) holds, and Lemma 2 gives

(Z^pZ)\displaystyle\mathbb{P}\left(\hat{Z}_{p}\neq Z\right) =zvz(Z^p=v,Z=z)\displaystyle=\sum_{z}\sum_{v\neq z}\mathbb{P}\left(\hat{Z}_{p}=v,Z=z\right)
=zvz(Z^p=v|Z=z)(Z=z)\displaystyle=\sum_{z}\sum_{v\neq z}\mathbb{P}\left(\hat{Z}_{p}=v\;\middle|\;Z=z\right)\mathbb{P}\left(Z=z\right)
Cminzv𝝁2:p(z)𝝁2:p(v)22\displaystyle\leq\frac{C}{\min_{z\neq v}\norm{\boldsymbol{\mu}_{2:p}(z)-\boldsymbol{\mu}_{2:p}(v)}_{2}^{2}}

with CC a constant independent of pp. By (14), minzv𝝁2:p(z)𝝁2:p(v)22\min_{z\neq v}\norm{\boldsymbol{\mu}_{2:p}(z)-\boldsymbol{\mu}_{2:p}(v)}_{2}^{2}\to\infty for pp\to\infty, and by choosing a subsequence, prp_{r}, we can ensure that (Z^prZ)1r2.\mathbb{P}\left(\hat{Z}_{p_{r}}\neq Z\right)\leq\frac{1}{r^{2}}. Then r=1(Z^prZ)<\sum_{r=1}^{\infty}\mathbb{P}\left(\hat{Z}_{p_{r}}\neq Z\right)<\infty, and by Borel-Cantelli’s lemma,

(Z^prZinfinitely often)=0.\mathbb{P}\left(\hat{Z}_{p_{r}}\neq Z\ \text{infinitely often}\right)=0.

That is, (Z^pr=Zeventually)=1\mathbb{P}\left(\hat{Z}_{p_{r}}=Z\ \text{eventually}\right)=1, which shows that we can recover ZZ from (Z^pr)r(\hat{Z}_{p_{r}})_{r\in\mathbb{N}} and thus from 𝐗1\mathbf{X}_{-1} (with probability 11). Defining

Z={limrZ^prif Z^pr=Z eventually0otherwiseZ^{\prime}=\left\{\begin{array}[]{ll}\lim\limits_{r\to\infty}\hat{Z}_{p_{r}}&\qquad\text{if }\hat{Z}_{p_{r}}=Z\text{ eventually}\\ 0&\qquad\text{otherwise}\end{array}\right.

we see that σ(Z)σ(𝐗1)\sigma(Z^{\prime})\subseteq\sigma(\mathbf{X}_{-1}) and Z=ZZ^{\prime}=Z almost surely. Thus if we replace ZZ by ZZ^{\prime} in Assumption 4 we see that Assumption 2(2) holds. ∎

Lemma 3.

Consider the same setup as in Lemma 2, that is, Assumption 4 holds and Rz,v(p)110R_{z,v}^{(p)}\leq\frac{1}{10} for all z,vEz,v\in E with vzv\neq z. Suppose, in addition, that the conditional distribution of XiX_{i} given Z=zZ=z is sub-Gaussian with variance factor vmaxv_{\max}, independent of ii and zz, then

(32) (Z^=vZ=z)exp(150vmax𝝁1:p(z)𝝁1:p(v)22).\mathbb{P}(\hat{Z}=v\mid Z=z)\leq\exp\left(-\frac{1}{50v_{\max}}\|\boldsymbol{\mu}_{1:p}(z)-\boldsymbol{\mu}_{1:p}(v)\|_{2}^{2}\right).
Proof.

Recall that XiX_{i} given Z=zZ=z being sub-Gaussian with variance factor vmaxv_{\max} means that

log𝔼[eλ(Xiμi(z))|Z=z]12λ2vmax\log\mathbb{E}\left[e^{\lambda(X_{i}-\mu_{i}(z))}\;\middle|\;Z=z\right]\leq\frac{1}{2}\lambda^{2}v_{\max}

for λ\lambda\in\mathbb{R}. Consequently, with WiW_{i} as in the proof of Lemma 2, and using conditional independence of the XiX_{i}-s given Z=zZ=z,

log𝔼[eλi=1pWi|Z=z]\displaystyle\log\mathbb{E}\left[e^{\lambda\sum_{i=1}^{p}W_{i}}\;\middle|\;Z=z\right] =i=1plog𝔼[eλ(μˇi(z)μˇi(v))(Xiμi(z))|Z=z]\displaystyle=\sum_{i=1}^{p}\log\mathbb{E}\left[e^{\lambda(\check{\mu}_{i}(z)-\check{\mu}_{i}(v))(X_{i}-\mu_{i}(z))}\;\middle|\;Z=z\right]
12λ2vmaxi=1p(μˇi(z)μˇi(v))2\displaystyle\leq\frac{1}{2}\lambda^{2}v_{\max}\sum_{i=1}^{p}(\check{\mu}_{i}(z)-\check{\mu}_{i}(v))^{2}
=12λ2vmax𝝁ˇ1:p(z)𝝁ˇ1:p(v)22.\displaystyle=\frac{1}{2}\lambda^{2}v_{\max}\|\check{\boldsymbol{\mu}}_{1:p}(z)-\check{\boldsymbol{\mu}}_{1:p}(v)\|_{2}^{2}.

Using (31) in combination with the Chernoff bound gives

(Z^=vZ=z)\displaystyle\mathbb{P}(\hat{Z}=v\mid Z=z) (i=1pWi<625𝝁1:p(z)𝝁1:p(v)22|Z=z)\displaystyle\leq\mathbb{P}\left(\sum_{i=1}^{p}W_{i}<-\tfrac{6}{25}\norm{\boldsymbol{\mu}_{1:p}(z)-\boldsymbol{\mu}_{1:p}(v)}_{2}^{2}\;\middle|\;Z=z\right)
exp((625)2𝝁1:p(z)𝝁1:p(v)242vmax𝝁ˇ1:p(z)𝝁ˇ1:p(v)22)\displaystyle\leq\exp\left(-\left(\frac{6}{25}\right)^{2}\frac{\norm{\boldsymbol{\mu}_{1:p}(z)-\boldsymbol{\mu}_{1:p}(v)}_{2}^{4}}{2v_{\max}\|\check{\boldsymbol{\mu}}_{1:p}(z)-\check{\boldsymbol{\mu}}_{1:p}(v)\|_{2}^{2}}\right)
=exp(12vmax(625)2Bz,v2𝝁1:p(z)𝝁1:p(v)22)\displaystyle=\exp\left(-\frac{1}{2v_{\max}}\left(\frac{6}{25}\right)^{2}B_{z,v}^{-2}\norm{\boldsymbol{\mu}_{1:p}(z)-\boldsymbol{\mu}_{1:p}(v)}_{2}^{2}\right)
exp(150vmax𝝁1:p(z)𝝁1:p(v)22),\displaystyle\leq\exp\left(-\frac{1}{50v_{\max}}\norm{\boldsymbol{\mu}_{1:p}(z)-\boldsymbol{\mu}_{1:p}(v)}_{2}^{2}\right),

where we, as in the proof of Lemma 2, have used that the bound on the relative error implies that Bz,v65B_{z,v}\leq\tfrac{6}{5}. ∎

Proof of Proposition 4.

The argument proceeds as in the proof of Proposition 3. We first note that

(Z^Z)\displaystyle\mathbb{P}\left(\hat{Z}\neq Z\right) =zvz(Z^=v,Z=z)\displaystyle=\sum_{z}\sum_{v\neq z}\mathbb{P}\left(\hat{Z}=v,Z=z\right)
=zvz(Z^=v|Z=z)(Z=z).\displaystyle=\sum_{z}\sum_{v\neq z}\mathbb{P}\left(\hat{Z}=v\;\middle|\;Z=z\right)\mathbb{P}\left(Z=z\right).

Lemma 2 then gives

(Z^Z)25Kσmax2sep(p).\mathbb{P}\left(\hat{Z}\neq Z\right)\leq\frac{25K\sigma^{2}_{\max}}{\mathrm{sep}(p)}.

If the sub-Gaussian assumption holds, Lemma 3 instead gives

(Z^Z)Kexp(sep(p)50vmax).\mathbb{P}\left(\hat{Z}\neq Z\right)\leq K\exp\left(-\frac{\mathrm{sep}(p)}{50v_{\max}}\right).

A.4. Proof of Theorem 2

Proof of Theorem 2.

Recall that

δ=1nk=1n1(z^kzk),\delta=\frac{1}{n}\sum_{k=1}^{n}1(\hat{z}_{k}\neq z_{k}),

hence by Proposition 4

𝔼[δ]\displaystyle\mathbb{E}[\delta] =(Z^kZ)\displaystyle=\mathbb{P}(\hat{Z}_{k}\neq Z)
(Z^kZ|maxzvRz,v(p)110)+(maxzvRz,v(p)>110)\displaystyle\leq\mathbb{P}\left(\hat{Z}_{k}\neq Z\;\middle|\;\max_{z\neq v}R^{(p)}_{z,v}\leq\tfrac{1}{10}\right)+\mathbb{P}\left(\max_{z\neq v}R^{(p)}_{z,v}>\tfrac{1}{10}\right)
(33) 25Kσmax2sep(p)+K2maxzv(Rz,v(p)>110).\displaystyle\leq\frac{25K\sigma_{\max}^{2}}{\mathrm{sep}(p)}+K^{2}\max_{z\neq v}\mathbb{P}\left(R^{(p)}_{z,v}>\tfrac{1}{10}\right).

Both of the terms above tend to 0, thus δ𝑃0\delta\overset{P}{\to}0.

Now rewrite the bound (19) as

|β^isubβ^i|δ(22ρ2α𝐲2𝐱i2)=Ln|\widehat{\beta}^{\mathrm{sub}}_{i}-\hat{\beta}_{i}|\leq\sqrt{\delta}\underbrace{\left(\frac{2\sqrt{2}}{\rho^{2}\sqrt{\alpha}}\frac{\|\mathbf{y}\|_{2}}{\|\mathbf{x}_{i}\|_{2}}\right)}_{=L_{n}}

From the argument above, δ𝑃0\sqrt{\delta}\overset{P}{\to}0. We will show that the second factor, LnL_{n}, tends to a constant, LL, in probability under the stated assumptions. This will imply that

|β^isubβ^i|𝑃0,|\widehat{\beta}^{\mathrm{sub}}_{i}-\hat{\beta}_{i}|\overset{P}{\to}0,

which shows case (1).

Observe first that

𝐱i22=1nk=1nxi,k2𝑃𝔼[Xi2](0,)\|\mathbf{x}_{i}\|_{2}^{2}=\frac{1}{n}\sum_{k=1}^{n}x_{i,k}^{2}\overset{P}{\to}\mathbb{E}[X_{i}^{2}]\in(0,\infty)

by the Law of Large Numbers, using the i.i.d. assumption and the fact that 𝔼[Xi2](0,)\mathbb{E}[X_{i}^{2}]\in(0,\infty) by Assumption 4. Similarly, 𝐲22𝑃𝔼[Y][0,)\|\mathbf{y}\|_{2}^{2}\overset{P}{\to}\mathbb{E}[Y]\in[0,\infty).

Turning to α\alpha, we first see that by the Law of Large Numbers,

n(z)n𝑃(Z=z)\frac{n(z)}{n}\overset{P}{\to}\mathbb{P}(Z=z)

for nn\to\infty and zEz\in E. Then observe that for any zEz\in E

|n^(z)n(z)|k=1n|1(z^k=z)1(zk=z)|k=1n1(z^kzk)nδ.|\hat{n}(z)-n(z)|\leq\sum_{k=1}^{n}|1(\hat{z}_{k}=z)-1(z_{k}=z)|\leq\sum_{k=1}^{n}1(\hat{z}_{k}\neq z_{k})\leq n\delta.

Since δ𝑃0\delta\overset{P}{\to}0, also

n^(z)n𝑃(Z=z),\frac{\hat{n}(z)}{n}\overset{P}{\to}\mathbb{P}(Z=z),

thus

α=nminn=min{n(1)n,,n(K)n,n^(1)n,,n^(K)n}𝑃minzE(Z=z)(0,).\alpha=\frac{n_{\min}}{n}=\min\left\{\frac{n(1)}{n},\ldots,\frac{n(K)}{n},\frac{\hat{n}(1)}{n},\ldots,\frac{\hat{n}(K)}{n}\right\}\overset{P}{\to}\min_{z\in E}\mathbb{P}(Z=z)\in(0,\infty).

We finally consider ρ\rho, and to this end we first see that

1n(IP𝐙)𝐱i22=1nk=1n(xi,kμ¯(zk))2𝑃𝔼[σi2(Z)](0,).\displaystyle\frac{1}{n}\|(I-P_{\mathbf{Z}})\mathbf{x}_{i}\|_{2}^{2}=\frac{1}{n}\sum_{k=1}^{n}(x_{i,k}-\overline{\mu}(z_{k}))^{2}\overset{P}{\to}\mathbb{E}\left[\sigma_{i}^{2}(Z)\right]\in(0,\infty).

Moreover, using Lemma 1,

|(IP𝐙^)𝐱i22(IP𝐙)𝐱i22|\displaystyle\left|\|(I-P_{\hat{\mathbf{Z}}})\mathbf{x}_{i}\|_{2}^{2}-\|(I-P_{\mathbf{Z}})\mathbf{x}_{i}\|_{2}^{2}\right| =|(P𝐙^P𝐙)𝐱i22+2|(IP𝐙^)𝐱i,(P𝐙^P𝐙)𝐱i|\displaystyle=\left|\|(P_{\hat{\mathbf{Z}}}-P_{\mathbf{Z}})\mathbf{x}_{i}\|_{2}^{2}+2|\langle(I-P_{\hat{\mathbf{Z}}})\mathbf{x}_{i},(P_{\hat{\mathbf{Z}}}-P_{\mathbf{Z}})\mathbf{x}_{i}\rangle\right|
P𝐙^P𝐙22𝐱i22+2P𝐙^P𝐙2𝐱i22\displaystyle\leq\|P_{\hat{\mathbf{Z}}}-P_{\mathbf{Z}}\|_{2}^{2}\|\mathbf{x}_{i}\|_{2}^{2}+2\|P_{\hat{\mathbf{Z}}}-P_{\mathbf{Z}}\|_{2}\|\mathbf{x}_{i}\|^{2}_{2}
(2δα+2δα)𝐱i22.\displaystyle\leq\left(\frac{2\delta}{\alpha}+\sqrt{\frac{2\delta}{\alpha}}\right)\|\mathbf{x}_{i}\|^{2}_{2}.

Hence

ρ𝑃𝔼[σi2(Z)]𝔼[Xi2](0,).\rho\overset{P}{\to}\frac{\mathbb{E}\left[\sigma_{i}^{2}(Z)\right]}{\mathbb{E}[X_{i}^{2}]}\in(0,\infty).

Combining the limit results,

Ln𝑃L=22𝔼[Xi2]2𝔼[σi2(Z)]2minzE(Z=z)𝔼[Y2]𝔼[Xi2](0,).L_{n}\overset{P}{\to}L=\frac{2\sqrt{2}\mathbb{E}[X_{i}^{2}]^{2}}{\mathbb{E}\left[\sigma_{i}^{2}(Z)\right]^{2}\sqrt{\min_{z\in E}\mathbb{P}(Z=z)}}\sqrt{\frac{\mathbb{E}[Y^{2}]}{\mathbb{E}[X_{i}^{2}]}}\in(0,\infty).

To complete the proof, suppose first that sep(p)n\frac{\mathrm{sep}(p)}{n}\to\infty. Then

n|β^isubβ^i|nδLn\sqrt{n}|\widehat{\beta}^{\mathrm{sub}}_{i}-\hat{\beta}_{i}|\leq\sqrt{n\delta}L_{n}

By (33) we have, under the assumptions given in case (2) of the theorem, that nδ𝑃0n\delta\overset{P}{\to}0, and case (2) follows.

Finally, in the sub-Gaussian case, and if just hn=sep(p)log(n)h_{n}=\frac{\mathrm{sep}(p)}{\log(n)}\to\infty, then we can replace (33) by the bound

𝔼[δ]Kexp(sep(p)50vmax)+K2maxzv(Rz,v(p)>110).\mathbb{E}[\delta]\leq K\exp\left(-\frac{\mathrm{sep}(p)}{50v_{\max}}\right)+K^{2}\max_{z\neq v}\mathbb{P}\left(R^{(p)}_{z,v}>\tfrac{1}{10}\right).

Multiplying by nn, we get that the first term in the bound equals

Knexp(sep(p)50vmax)\displaystyle Kn\exp\left(-\frac{\mathrm{sep}(p)}{50v_{\max}}\right) =Kexp(sep(p)50vmax+log(n))\displaystyle=K\exp\left(-\frac{\mathrm{sep}(p)}{50v_{\max}}+\log(n)\right)
=Kexp(log(n)(1hn50vmax))0\displaystyle=K\exp\left(\log(n)\left(1-\frac{h_{n}}{50v_{\max}}\right)\right)\to 0

for nn\to\infty. We conclude that the relaxed growth condition on pp in terms of nn in the sub-Gaussian case is enough to imply nδ𝑃0n\delta\overset{P}{\to}0, and case (3) follows.

By the decomposition

n(β^isubβi)=n(β^isubβ^i)+n(β^iβi)\sqrt{n}(\widehat{\beta}^{\mathrm{sub}}_{i}-\beta_{i})=\sqrt{n}(\widehat{\beta}^{\mathrm{sub}}_{i}-\widehat{\beta}_{i})+\sqrt{n}(\hat{\beta}_{i}-\beta_{i})

it follows from Slutsky’s theorem that in case (2) as well as case (3),

n(β^isubβi)=n(β^iβi)+oP(1)𝒟𝒩(0,wi2).\sqrt{n}(\widehat{\beta}^{\mathrm{sub}}_{i}-\beta_{i})=\sqrt{n}(\widehat{\beta}_{i}-\beta_{i})+o_{P}(1)\overset{\mathcal{D}}{\to}\mathcal{N}(0,w_{i}^{2}).

Appendix B Gaussian mixture models

This appendix contains an analysis of a latent variable model with a finite EE, similar to the one given by Assumption 4, but with Assumption 4(1) strengthened to

XiZ=z𝒩(μi(z),σi2(z)).X_{i}\mid Z=z\sim\mathcal{N}(\mu_{i}(z),\sigma_{i}^{2}(z)).

Assumptions 4(2), 4(3) and 4(4) are dropped, and the purpose is to understand precisely when Assumption 2(2) holds in this model. That is, when ZZ can be recovered from 𝐗i\mathbf{X}_{-i}. To keep notation simple, we will show when ZZ can be recovered from 𝐗\mathbf{X}, but the analysis and conclusion is the same if we left out a single coordinate.

The key to this analysis is a classical result due to Kakutani. As in Section 2, the conditional distribution of 𝐗\mathbf{X} given Z=zZ=z is denoted PzP_{z}, and the model assumption is that

(34) Pz=i=1PziP_{z}=\bigotimes_{i=1}^{\infty}P^{i}_{z}

where PziP^{i}_{z} is the conditional distribution of XiX_{i} given Z=zZ=z. For Kakutani’s theorem below we do not need the Gaussian assumption; only that PziP^{i}_{z} and PviP^{i}_{v} are equivalent (absolutely continuous w.r.t. each other), and we let dPzidPvi\frac{\mathrm{d}P_{z}^{i}}{\mathrm{d}P_{v}^{i}} denote the Radon-Nikodym derivative of PziP_{z}^{i} w.r.t. PviP_{v}^{i}.

Theorem 3 (Kakutani (1948)).

Let z,vEz,v\in E and vzv\neq z. Then PzP_{z} and PvP_{v} are singular if and only if

(35) i=1logdPzidPvidPvi=.\sum_{i=1}^{\infty}-\log\int\sqrt{\frac{\mathrm{d}P_{z}^{i}}{\mathrm{d}P_{v}^{i}}}\ \mathrm{d}P_{v}^{i}=\infty.

Note that

BCz,vi=dPzidPvidPvi\mathrm{BC}_{z,v}^{i}=\int\sqrt{\frac{\mathrm{d}P_{z}^{i}}{\mathrm{d}P_{v}^{i}}}\ \mathrm{d}P_{v}^{i}

is known as the Bhattacharyya coefficient, while log(BCz,vi)-\log(\mathrm{BC}_{z,v}^{i}) and 1BCz,vi\sqrt{1-\mathrm{BC}_{z,v}^{i}} are known as the Bhattacharyya distance and the Hellinger distance, respectively, between PziP^{i}_{z} and PviP^{i}_{v}. Note also that if Pzi=hziλP^{i}_{z}=h^{i}_{z}\cdot\lambda and Pvi=hviλP^{i}_{v}=h^{i}_{v}\cdot\lambda for a reference measure λ\lambda, then

BCz,vi=hzihvidλ.\mathrm{BC}_{z,v}^{i}=\int\sqrt{h_{z}^{i}h_{v}^{i}}\ \mathrm{d}\lambda.
Proposition 5.

Let PziP^{i}_{z} be the 𝒩(μi(z),σi2(z))\mathcal{N}(\mu_{i}(z),\sigma_{i}^{2}(z))-distribution for all ii\in\mathbb{N} and zEz\in E. Then PzP_{z} and PvP_{v} are singular if and only if either

(36) i=1(μi(z)μi(v))2σi2(z)+σi2(v)\displaystyle\sum_{i=1}^{\infty}\frac{(\mu_{i}(z)-\mu_{i}(v))^{2}}{\sigma_{i}^{2}(z)+\sigma^{2}_{i}(v)} =or\displaystyle=\infty\qquad\text{or}
(37) i=1log(σi2(z)+σi2(v)2σi(z)σi(v))\displaystyle\sum_{i=1}^{\infty}\log\left(\frac{\sigma_{i}^{2}(z)+\sigma^{2}_{i}(v)}{2\sigma_{i}(z)\sigma_{i}(v)}\right) =\displaystyle=\infty
Proof.

Letting μ=μi(z)\mu=\mu_{i}(z), ν=μi(v)\nu=\mu_{i}(v), τ=1/σi(z)\tau=1/\sigma_{i}(z) and κ=1/σi(v)\kappa=1/\sigma_{i}(v) we find

BCz,vi\displaystyle\mathrm{BC}_{z,v}^{i} =τ2πexp(τ22(xμ)2)κ2πexp(κ22(xν)2)dx\displaystyle=\int\sqrt{\frac{\tau}{\sqrt{2\pi}}\exp(-\frac{\tau^{2}}{2}\quantity(x-\mu)^{2})\frac{\kappa}{\sqrt{2\pi}}\exp(-\frac{\kappa^{2}}{2}\quantity(x-\nu)^{2})}\mathrm{d}x
=τκ2πexp((τ2+κ2)x22(τ2μ+κ2ν)x+(τ2μ2+κ2ν2)4)dx\displaystyle=\sqrt{\frac{\tau\kappa}{2\pi}}\int\exp(-\frac{(\tau^{2}+\kappa^{2})x^{2}-2(\tau^{2}\mu+\kappa^{2}\nu)x+(\tau^{2}\mu^{2}+\kappa^{2}\nu^{2})}{4})\mathrm{d}x
=τκ2π4πτ2+κ2exp((τ2μ+κ2ν)24(τ2+κ2)τ2μ2+κ2ν24)\displaystyle=\sqrt{\frac{\tau\kappa}{2\pi}}\sqrt{\frac{4\pi}{\tau^{2}+\kappa^{2}}}\exp(\frac{(\tau^{2}\mu+\kappa^{2}\nu)^{2}}{4(\tau^{2}+\kappa^{2})}-\frac{\tau^{2}\mu^{2}+\kappa^{2}\nu^{2}}{4})
=2τκτ2+κ2exp(τ2κ2(μν)24(τ2+κ2))\displaystyle=\sqrt{\frac{2\tau\kappa}{\tau^{2}+\kappa^{2}}}\exp(-\frac{\tau^{2}\kappa^{2}(\mu-\nu)^{2}}{4(\tau^{2}+\kappa^{2})})
=2σi(z)σi(v)σi2(z)+σi2(z)exp((μi(z)μi(v))24(σi2(z)+σi2(z))).\displaystyle=\sqrt{\frac{2\sigma_{i}(z)\sigma_{i}(v)}{\sigma^{2}_{i}(z)+\sigma^{2}_{i}(z)}}\exp(-\frac{(\mu_{i}(z)-\mu_{i}(v))^{2}}{4(\sigma^{2}_{i}(z)+\sigma^{2}_{i}(z))}).

Thus

i=1log(BCz,vi)=12i=1log(σi2(z)+σi2(v)2σi(z)σi(v))+14i=1(μi(z)μi(v))2σi2(z)+σi2(v),\sum_{i=1}^{\infty}-\log\left(\mathrm{BC}_{z,v}^{i}\right)=\frac{1}{2}\sum_{i=1}^{\infty}\log\left(\frac{\sigma^{2}_{i}(z)+\sigma_{i}^{2}(v)}{2\sigma_{i}(z)\sigma_{i}(v)}\right)+\frac{1}{4}\sum_{i=1}^{\infty}\frac{(\mu_{i}(z)-\mu_{i}(v))^{2}}{\sigma^{2}_{i}(z)+\sigma_{i}^{2}(v)},

and the result follows from Theorem 3. ∎

Corollary 1.

Let PziP^{i}_{z} be the 𝒩(μi(z),σi2(z))\mathcal{N}(\mu_{i}(z),\sigma_{i}^{2}(z))-distribution for all ii\in\mathbb{N} and zEz\in E. There is a mapping f:Ef:\mathbb{R}^{\mathbb{N}}\to E such that Z=f(𝐗)Z=f(\mathbf{X}) almost surely if and only if either (36) or (37) holds.

Proof.

If either (36) or (37) holds, PzP_{z} and PvP_{v} are singular whenever vzv\neq z. This implies that there are measurable subsets AzA_{z}\subseteq\mathbb{R}^{\mathbb{N}} for zEz\in E such that Pz(Az)=1P_{z}(A_{z})=1 and Pv(Az)=0P_{v}(A_{z})=0 for vzv\neq z. Setting A=zAzA=\cup_{z}A_{z} we see that

P(A)=zPz(A)(Z=z)=zPz(Az)(Z=z)=1.P(A)=\sum_{z}P_{z}(A)\mathbb{P}(Z=z)=\sum_{z}P_{z}(A_{z})\mathbb{P}(Z=z)=1.

Defining the map f:Ef:\mathbb{R}^{\mathbb{N}}\to E by f(𝐱)=zf(\mathbf{x})=z if 𝐱Az\mathbf{x}\in A_{z} (and arbitrarily on the complement of AA) we see that f(𝐗)=Zf(\mathbf{X})=Z almost surely.

On the other hand, if there is such a mapping ff, define Az=f1({z})A_{z}=f^{-1}(\{z\}) for all zEz\in E. Then AzAv=A_{z}\cap A_{v}=\emptyset for vzv\neq z and

Pz(Az)\displaystyle P_{z}(A_{z}) =(𝐗Az,Z=z)(Z=z)=(f(𝐗)=z,Z=z)(Z=z)\displaystyle=\frac{\mathbb{P}(\mathbf{X}\in A_{z},Z=z)}{\mathbb{P}(Z=z)}=\frac{\mathbb{P}(f(\mathbf{X})=z,Z=z)}{\mathbb{P}(Z=z)}
=(f(𝐗)=Z,Z=z)(Z=z)=(Z=z)(Z=z)=1.\displaystyle=\frac{\mathbb{P}(f(\mathbf{X})=Z,Z=z)}{\mathbb{P}(Z=z)}=\frac{\mathbb{P}(Z=z)}{\mathbb{P}(Z=z)}=1.

Similarly, for vzv\neq z

Pv(Az)\displaystyle P_{v}(A_{z}) =(𝐗Az,Z=v)(Z=v)=(f(𝐗)=z,Z=v)(Z=v)\displaystyle=\frac{\mathbb{P}(\mathbf{X}\in A_{z},Z=v)}{\mathbb{P}(Z=v)}=\frac{\mathbb{P}(f(\mathbf{X})=z,Z=v)}{\mathbb{P}(Z=v)}
=(f(𝐗)Z,Z=v)(Z=v)=0(Z=v)=0.\displaystyle=\frac{\mathbb{P}(f(\mathbf{X})\neq Z,Z=v)}{\mathbb{P}(Z=v)}=\frac{0}{\mathbb{P}(Z=v)}=0.

This shows that PzP_{z} and PvP_{v} are singular, and by Proposition 5, either (36) or (37) holds. ∎

References

  • (1)
  • Anandkumar et al. (2014) Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M. & Telgarsky, M. (2014), ‘Tensor decompositions for learning latent variable models’, Journal of Machine Learning Research 15, 2773–2832.
  • Berk et al. (2021) Berk, R., Buja, A., Brown, L., George, E., Kuchibhotla, A. K., Su, W. & Zhao, L. (2021), ‘Assumption lean regression’, The American Statistician 75(1), 76–84.
  • Ćevid et al. (2020) Ćevid, D., Bühlmann, P. & Meinshausen, N. (2020), ‘Spectral deconfounding via perturbed sparse linear models’, Journal of Machine Learning Research 21(232), 1–41.
  • Cochran (1968) Cochran, W. G. (1968), ‘Errors of measurement in statistics’, Technometrics 10(4), 637–666.
  • Durbin (1954) Durbin, J. (1954), ‘Errors in variables’, Revue de l’Institut International de Statistique / Review of the International Statistical Institute 22(1/3), 23–32.
  • D’Amour (2019) D’Amour, A. (2019), On multi-cause approaches to causal inference with unobserved counfounding: Two cautionary failure cases and a promising alternative, in ‘The 22nd International Conference on Artificial Intelligence and Statistics’, PMLR, pp. 3478–3486.
  • Gandhi & Borns-Weil (2016) Gandhi, K. & Borns-Weil, Y. (2016), ‘Moment-based learning of mixture distributions’.
  • Grimmer et al. (2023) Grimmer, J., Knox, D. & Stewart, B. (2023), ‘Naive regression requires weaker assumptions than factor models to adjust for multiple cause confounding’, Journal of Machine Learning Research 24(182), 1–70.
  • Guo, Nie & Yang (2022) Guo, B., Nie, J. & Yang, Z. (2022), ‘Learning diagonal Gaussian mixture models and incomplete tensor decompositions’, Vietnam Journal of Mathematics 50, 421–446.
  • Guo, Ćevid & Bühlmann (2022) Guo, Z., Ćevid, D. & Bühlmann, P. (2022), ‘Doubly debiased lasso: High-dimensional inference under hidden confounding’, The Annals of Statistics 50(3), 1320 – 1347.
  • Heinrich & Kahn (2018) Heinrich, P. & Kahn, J. (2018), ‘Strong identifiability and optimal minimax rates for finite mixture estimation’, The Annals of Statistics 46(6A), 2844 – 2870.
  • Kakutani (1948) Kakutani, S. (1948), ‘On equivalence of infinite product measures’, Annals of Mathematics 49(1), 214–224.
  • Kalai et al. (2012) Kalai, A. T., Moitra, A. & Valiant, G. (2012), ‘Disentangling Gaussians’, Communications of the ACM 55(2), 113–120.
  • Kallenberg (2021) Kallenberg, O. (2021), Foundations of modern probability, Probability and its Applications (New York), third edn, Springer-Verlag, New York.
  • Leek & Storey (2007) Leek, J. T. & Storey, J. D. (2007), ‘Capturing heterogeneity in gene expression studies by surrogate variable analysis’, PLOS Genetics 3(9), 1–12.
  • Lundborg et al. (2023) Lundborg, A. R., Kim, I., Shah, R. D. & Samworth, R. J. (2023), ‘The projected covariance measure for assumption-lean variable significance testing’, arXiv:2211.02039 .
  • Miao et al. (2023) Miao, W., Hu, W., Ogburn, E. L. & Zhou, X.-H. (2023), ‘Identifying effects of multiple treatments in the presence of unmeasured confounding’, Journal of the American Statistical Association 118(543), 1953–1967.
  • Ndaoud (2022) Ndaoud, M. (2022), ‘Sharp optimal recovery in the two component Gaussian mixture model’, The Annals of Statistics 50(4), 2096 – 2126.
  • Ogburn et al. (2020) Ogburn, E. L., Shpitser, I. & Tchetgen, E. J. T. (2020), ‘Counterexamples to ”the blessings of multiple causes” by Wang and Blei’, arXiv:2001.06555 .
  • Patterson et al. (2006) Patterson, N., Price, A. L. & Reich, D. (2006), ‘Population structure and eigenanalysis’, PLOS Genetics 2(12), 1–20.
  • Pearl (2009) Pearl, J. (2009), Causality, Cambridge University Press.
  • Peters et al. (2017) Peters, J., Janzing, D. & Schölkopf, B. (2017), Elements of Causal Inference: Foundations and Learning Algorithms, MIT Press, Cambridge, MA, USA.
  • Price et al. (2006) Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A. & Reich, D. (2006), ‘Principal components analysis corrects for stratification in genome-wide association studies’, Nature Genetics 38(8), 904–909.
  • Schennach (2016) Schennach, S. M. (2016), ‘Recent advances in the measurement error literature’, Annual Review of Economics 8, 341–377.
  • Song et al. (2015) Song, M., Hao, W. & Storey, J. D. (2015), ‘Testing for genetic associations in arbitrarily structured populations’, Nat Genet 47(5), 550–554.
  • Stewart (1977) Stewart, G. W. (1977), ‘On the perturbation of pseudo-inverses, projections and linear least squares problems’, SIAM Review 19(4), 634–662.
  • van der Vaart (1998) van der Vaart, A. W. (1998), Asymptotic statistics, Vol. 3 of Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press, Cambridge.
  • Vansteelandt & Dukes (2022) Vansteelandt, S. & Dukes, O. (2022), ‘Assumption-lean inference for generalised linear model parameters’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 84(3), 657–685.
  • Wang et al. (2017) Wang, J., Zhao, Q., Hastie, T. & Owen, A. B. (2017), ‘Confounder adjustment in multiple hypothesis testing’, Ann. Statist. 45(5), 1863–1894.
  • Wang & Blei (2019) Wang, Y. & Blei, D. M. (2019), ‘The blessings of multiple causes’, Journal of the American Statistical Association 114(528), 1574–1596.
  • Wang & Blei (2020) Wang, Y. & Blei, D. M. (2020), ‘Towards clarifying the theory of the deconfounder’, arXiv:2003.04948 .