This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Partial Identification of Dose Responses with Hidden Confounders

Myrl G. Marmarelis USC Information Sciences Institute
4676 Admiralty Way
Marina del Rey, CA 90292
Elizabeth Haddad USC Stevens Neuroimaging and Informatics Institute
4676 Admiralty Way
Marina del Rey, CA 90292
Andrew Jesson University of Oxford, OATML
14 Parks Road
Oxford, UK OX1 3AQ

Neda Jahanshad
USC Stevens Neuroimaging and Informatics Institute
4676 Admiralty Way
Marina del Rey, CA 90292
Aram Galstyan USC Information Sciences Institute
4676 Admiralty Way
Marina del Rey, CA 90292
Greg Ver Steeg USC Information Sciences Institute
4676 Admiralty Way
Marina del Rey, CA 90292
Abstract

Inferring causal effects of continuous-valued treatments from observational data is a crucial task promising to better inform policy- and decision-makers. A critical assumption needed to identify these effects is that all confounding variables—causal parents of both the treatment and the outcome—are included as covariates. Unfortunately, given observational data alone, we cannot know with certainty that this criterion is satisfied. Sensitivity analyses provide principled ways to give bounds on causal estimates when confounding variables are hidden. While much attention is focused on sensitivity analyses for discrete-valued treatments, much less is paid to continuous-valued treatments. We present novel methodology to bound both average and conditional average continuous-valued treatment-effect estimates when they cannot be point identified due to hidden confounding. A semi-synthetic benchmark on multiple datasets shows our method giving tighter coverage of the true dose-response curve than a recently proposed continuous sensitivity model and baselines. Finally, we apply our method to a real-world observational case study to demonstrate the value of identifying dose-dependent causal effects.

1 Introduction

Causal inference on observational studies [Hill, 2011, Athey et al., 2019] attempts to predict conclusions of alternate versions of those studies, as if they were actually properly randomized experiments. The causal aspect is unique among inference tasks in that the goal is not prediction per se, as causal inference deals with counterfactuals, the problem of predicting unobservables: for example, what would have been a particular patient’s health outcome had she taken some medication, versus not, while keeping all else equal (ceteris paribus)? There is quite often no way to validate the results without bringing in additional domain knowledge. A set of putative treatments 𝒯\mathcal{T}, often binary with a treated/untreated dichotomy, induces potential outcomes Yt𝒯Y_{t\in\mathcal{T}}. These can depend on covariates XX as with heterogeneous treatment effects 𝔼[Y1Y0X]\operatorname{\mathbb{E}}[Y_{1}-Y_{0}\mid X] in the binary case. Only one outcome is ever observed: that at the assigned treatment TT. Potential biases arise from the incomplete observation. This problem is exacerbated with more than two treatment values, especially when there are infinite possibilities, like in a continuum, e.g. 𝒯=[0,1]\mathcal{T}=[0,1]. Unfortunately, many consequential decisions in life involve this kind of treatment: What dose of drug should I take? How much of   should I eat/drink? How much exercise do I really need?

In an observational study, the direct causal link between assigned treatment TT and observed outcome YY (also denoted as YTY_{T}) can be influenced by indirect links modulated by confounding variables. For instance, wealth is often a confounder in an individual’s health outcome from diet, medication, or exercise. Wealth affects access to each of these “treatments,” and it also affects health through numerous other paths. Including the confounders as covariates in XX allows estimators to condition on them and disentangle the influences [Yao et al., 2021].

It can be challenging to collect sufficient data, in terms of quality and quantity, on confounders in order to adjust a causal estimation to them. Case in point, noisy observations of e.g. lifestyle confounders lead researchers to vacillate on the health implications of coffee [Atroszko, 2019], alcohol [Ystrom et al., 2022], and cheese [Godos et al., 2020].

For consequential real-world causal inference, it is only prudent to allow margins for some amount of hidden confounding. A major impediment to such analysis is that it is impossible to know how a hidden confounder would bias the causal effect. The role of any causal sensitivity model [Cornfield et al., 1959, Rosenbaum and Rubin, 1983] is to make reasonable structural assumptions [Manski, 2003] about different levels of hidden confounding. Most sensitivity analyses to hidden confounding require the treatment categories to be binary or at least discrete. This weakens empirical studies that are better specified by dose-response curves [Calabrese and Baldwin, 2001, Bonvini and Kennedy, 2022] from a continuous treatment variable. Estimated dose-response functions are indeed vulnerable in the presence of hidden confounders. Figure 1 highlights the danger of skewed observational studies that lead to biased estimates of personal toxic thresholds of treatment dosages.

Refer to caption
Figure 1: Dose-respone curves in medicine [e.g. Taleb, 2018] can be viewed as expected potential outcomes from continuous treatments. In this simulation (with details in §D,) there is one unobserved confounder. The empirical estimate of the population-level dose responses massively overshoots the maximum effective dosage, and would suggest treatments that were actually toxic to the population. This phenomenon persists even when the vulnerable hidden subgroup occurs more often in the population.

1.1 Related works

There is growing interest in causal methodology for continuous treatments (or exposures, interventions), especially in the fields of econometrics [e.g. Huang et al., 2021, Tübbicke, 2022], health sciences [Vegetabile et al., 2021], and machine learning [Chernozhukov et al., 2021, Ghassami et al., 2021, Colangelo and Lee, 2021, Kallus and Santacatterina, 2019]. So far, most scrutiny on partial identification of potential outcomes has focused on the case of discrete treatments [e.g. Rosenbaum and Rubin, 1983, Louizos et al., 2017, Lim et al., 2021]. A number of creative approaches recently made strides in the discrete setting. Most rely on a sensitivity model for assessing the susceptibility of causal estimands to hidden-confounding bias. A sensitivity model allows hidden confounders but restricts their possible influence on the data, with an adjustable parameter that controls the overall tightness of that restriction.

The common discrete-treatment sensitivity models are incompatible with continuous treatments, which are needed for estimating dose-response curves. Still, some recent attempts have been made to handle hidden confounding under more general treatment domains [Chernozhukov et al., 2021]. Padh et al. [2022], Hu et al. [2021] optimize generative models to reflect bounds on the treatment effect due to ignorance, inducing an implicit sensitivity model through functional constraints. Instrumental variables are also helpful when they are available [Kilbertus et al., 2020]. The CMSM [Jesson et al., 2022] was developed in parallel to this work, and now serves as a baseline.

For binary treatments, the Marginal Sensitivity Model (MSM) due to Tan [2006] has found widespread usage [Zhao et al., 2019, Veitch and Zaveri, 2020, Yin et al., 2021, Kallus et al., 2019, Jesson et al., 2021]. Variations thereof include Rosenbaum’s earlier sensitivity model [2002] that enjoys ties to regression coefficients [Yadlowsky et al., 2020]. Alternatives to sensitivity models leverage generative modeling [Meresht et al., 2022] and robust optimization [Guo et al., 2022]. Other perspectives require additional structure to the data-generating (observed outcome, treatment, covariates) process. Proximal causal learning [Tchetgen et al., 2020, Mastouri et al., 2021] requires observation of proxy variables. Chen et al. [2022] rely on a large number of background variables to help filter out hidden confounding from apparent causal influences.

1.2 Contributions

We propose a novel sensitivity model for continuous treatments in §2. Next, we derive general formulas (§2.1) and solve closed forms for three versions (§2.3) of partially identified dose responses—for Beta, Gamma, and Gaussian treatment variables. We devise an efficient sampling algorithm (§3), and validate our results empirically using a semi-synthetic benchmark (§4) and realistic case study (§5).

1.3 Problem Statement

Our goal is the partial identification of causal dose responses under a bounded level of possible hidden confounding. We consider any setup that grants access to two predictors [Chernozhukov et al., 2017] that can be learned empirically and are assumed to output correct conditional distributions. These are (1) a predictor of outcomes conditioned on covariates and the assigned treatment, and (2) a predictor of the propensity of treatment assignments, taking the form of a probability density, conditioned on the covariates. The latter measures (non-)uniformity in treatment assignment for different parts of the population. The observed data come from a joint distribution of outcome, continuous treatment, and covariates that include any observed confounders.

Potential outcomes.

Causal inference is often cast in the nomenclature of potential outcomes, due to Rubin [1974]. Our first assumption, common to Rubin’s framework, is that observation tuples of outcome, assigned treatment, and covariates, {(y(i),t(i),x(i))}i=1n,\{(y^{(i)},t^{(i)},x^{(i)})\}_{i=1}^{n}, are i.i.d draws from a single joint distribution. This subsumes the Stable Unit Treatment Value Assumption (SUTVA), where units/individuals cannot depend on one another, since they are i.i.d. The second assumption is overlap/positivity, that all treatments have a chance of assignment for every individual in the data: pTX(tx)>0{p_{T\mid X}(t\mid x)>0} for every (t,x)𝒯×𝒳{(t,x)\in\mathcal{T}\times\mathcal{X}}.

The third and most challenging fundamental assumption is that of ignorability/sufficiency: {(Yt)t𝒯T}X.{\{(Y_{t})_{t\in\mathcal{T}}\mathbin{\perp\!\!\!\perp}T\}\mid X}. Clearly the outcome should depend on the assigned treatment, but potential outcomes ought not to be affected by the assignment, after blocking out paths through covariates.

Our study focuses on dealing with limited violations to ignorability. The situation is expressed formally as {(Yt)t𝒯T}X\{(Y_{t})_{t\in\mathcal{T}}\mathbin{\not\!\perp\!\!\!\perp}T\}\mid X, but more specifically, we shall introduce a sensitivity model that governs the shape and extent of that violation.

Let p(yt|x)p(y_{t}|x) denote the probability density function of potential outcome Yt=ytY_{t}=y_{t} from a treatment t𝒯t\in\mathcal{T}, given covariates X=xX=x. This is what we seek to infer, while observing realized outcomes that allow us to learn the density p(yt|x,T=t)p(y_{t}|\,x,\,T=t). If the ignorability condition held, then p(yt|x,T=t)=p(yt|x)p(y_{t}|\,x,\,T=t)=p(y_{t}|x) due to the conditional independence. However, without ignorability, one has to marginalize over treatment assignment, requiring p(yt|x,Tt)p(y_{t}|\,x,\,T\neq t) because

p(yt|x)=𝒯p(yt|τ,x)p(τ|x)dτ,p(y_{t}|x)=\int_{\mathcal{T}}p(y_{t}|\,\tau,x)p(\tau|x)\operatorname{\mathrm{d}\!}\tau, (1)

where p(yt|τ,x)p(y_{t}|\tau,x) is the distribution of potential outcomes conditioned on actual treatment T=τ𝒯T=\tau\in\mathcal{T} that may differ from the potential outcome’s index tt. The density p(τ|x)p(\tau|x) is termed the nominal propensity, defining the distribution of treatment assignments for different covariate values.

On notation.

Throughout this study, yty_{t} will indicate the value of the potential outcome at treatment tt, and to disambiguate with assigned treatment τ\tau will be used for events where T=τT=\tau. For instance, we may care about the counterfactual of a smoker’s (τ=1)(\tau=1) health outcome had they not smoked (yt=0),(y_{t=0}), where T=0T=0 signifies no smoking and T=1T=1 is “full” smoking. We will use the shorthand p()p(\cdots) with lowercase variables whenever working with probability densities of the corresponding variables in uppercase:

p(yt|τ,x)\displaystyle p(y_{t}|\tau,x)\ means u[YtuT=τ,X=x]|u=yt.\displaystyle\textrm{ means }\ \frac{\partial}{\partial u}\mathbb{P}[\,Y_{t}\leq u\mid T=\tau,\ X=x\,]\Big{\rvert}_{u=y_{t}.}

Quantities of interest.

We attempt to impart intuition on the conditional probability densities that may be confusing.

  • p(yt|x)p(y_{t}|\,x)  [conditional potential outcome].  A person’s outcome from a treatment, disentangled from the selection bias of treatment assignment in the population. We seek to characterize this in order to (partially) identify the Conditional Average Potential Outcome (CAPO) and the Average Potential Outcome (APO):

    CAPO(t,x)=𝔼[YtX=x];APO(t)=𝔼[Yt].\text{CAPO}(t,x)=\operatorname{\mathbb{E}}[Y_{t}\mid X=x];\quad\text{APO}(t)=\operatorname{\mathbb{E}}[Y_{t}].
  • p(yt|τ,x)p(y_{t}|\,\tau,x)  [counterfactual].  What is the potential outcome of a person in the population characterized by xx and assigned treatment τ\tau? The answer changes with τ\tau only when xx is inadequate to block all backdoor paths through confounders. We can estimate this for t=τt=\tau.

  • p(τ|yt,x)p(\tau|\,y_{t},x)  [complete propensity]  is related to the above by Bayes’ rule. We distinguish it from the nominal propensity p(τ|x)p(\tau|x) because the unobservable yty_{t} possibly confers more information about the individual, again if xx is inadequate. The complete propensity cannot be estimated, even for t=τt=\tau; hence, this is the target of our sensitivity model.

Refer to caption
Figure 2: In this example, ZZ encompasses all hidden confounders. Counterfactual p(yt|τ,x)p(y_{t}|\,\tau,x) diverges from p(yt|x)p(y_{t}|\,x) because of the red path from TT to YtY_{t} through ZZ.

A backdoor path between potential outcomes and treatment can manifest in several ways. Figure 2 shows the barebones setting for hidden confounding to take place. Simply noisy observations of the confounders could leak a backdoor path. It is important to understand the ontology [Sarvet and Stensrud, 2022] of the problem in order to ascribe hidden confounding to the stochasticity inherent to a potential outcome.

Sensitivity.

Explored by Tan [2006] followed by Kallus et al. [2019], Jesson et al. [2021], among many others, the Marginal Sensitivity Model (MSM) serves to bound the extent of (putative) hidden confounding in the regime of binary treatments T{0,1}T^{\prime}\in\{0,1\}. The MSM limits the discrepancy between the odds of treatment under the nominal propensity and the odds of treatment under the complete propensity.

Definition 1 (The Marginal Sensitivity Model).

For binary treatment t{0,1}t^{\prime}\in\{0,1\} and violation factor Γ1\Gamma\geq 1, the following ratio is bounded:

Γ1[p(t|x)1p(t|x)]1[p(t|yt,x)1p(t|yt,x)]Γ.\Gamma^{-1}\leq\left[\frac{p(t^{\prime}|x)}{1-p(t^{\prime}|x)}\right]^{\,-1}\left[\frac{p(t^{\prime}|\,y_{t^{\prime}},x)}{1-p(t^{\prime}|\,y_{t^{\prime}},x)}\right]\leq\Gamma.

The confines of a binary treatment afford a number of conveniences. For instance, one probability value is sufficient to describe the whole propensity landscape on a set of conditions, p(1t|)=1p(t|)p(1-t^{\prime}|\cdots)=1-p(t^{\prime}|\cdots). As we transfer to the separate context of treatment continua, we must contend with infinite treatments and infinite potential outcomes.

2 Continuous Sensitivity Model

The counterfactuals required for Equation 1 are almost entirely unobservable. We look to the Radon-Nikodym derivative ωδ\omega_{\delta} of a counterfactual with respect to another [Tan, 2006], quantifying their divergence between nearby treatment assignments: (assuming mutual continuity)

ωδ(yt|τ,x)\displaystyle\omega_{\delta}(y_{t}|\,\tau,x)\coloneqq p(yt|τ+δ,x)p(yt|τ,x)=p(τ+δ|yt,x)p(τ|x)p(τ|yt,x)p(τ+δ|x)(Bayes’ rule)\displaystyle\ \frac{p(y_{t}|\,\tau+\delta,x)}{p(y_{t}|\,\tau,x)}\ =\ \ \stackrel{{\scriptstyle\text{(Bayes' rule)}}}{{\frac{p(\tau+\delta|\,y_{t},x)p(\tau|x)}{p(\tau|\,y_{t},x)p(\tau+\delta|x)}}}
=[p(τ+δ|x)p(τ|x)]1[p(τ+δ|yt,x)p(τ|yt,x)].\displaystyle=\left[\frac{p(\tau+\delta|x)}{p(\tau|x)}\right]^{\,-1}\left[\frac{p(\tau+\delta|\,y_{t},x)}{p(\tau|\,y_{t},x)}\right].

As with the MSM, we encounter a ratio of odds, here contrasting τ\tau versus τ+δ\tau+\delta in the assigned-treatment continuum. Assuming the densities are at least once differentiable,

limδ0δ1logωδ(yt|τ,x)=τ[logp(τ|yt,x)logp(τ|x)].\lim_{\delta\to 0}\delta^{-1}\log\omega_{\delta}(y_{t}|\,\tau,x)=\partial_{\tau}[\log p(\tau|\,y_{t},x)-\log p(\tau|x)].

By constraining ωδ\omega_{\delta} to be close to unit, via bounds above and below, we tie the logarithmic derivatives of the nominal- and complete-propensity densities.

Definition 2 (The Infinitesimal Marginal Sensitivity Model).

For treatments t𝒯t\in\mathcal{T}\subseteq\mathbb{R}, where 𝒯\mathcal{T} is connected, and violation-of-ignorability factor Γ1\Gamma\geq 1, the δ\deltaMSM requires

|τlogp(τ|yt,x)p(τ|x)|logΓ\absolutevalue{\frac{\partial}{\partial\tau}\log\frac{p(\tau|\,y_{t},x)}{p(\tau|x)}}\leq\log\Gamma

everywhere, for all τ\tau, tt, and xx combinations. This differs from the CMSM due to Jesson et al. [2022] that considers only t=τ,t=\tau, and which bounds the density ratios directly.

2.1 The Complete Framework

Assumption 1 (Bounded Hidden Confounding).

Invoking Definition 2, the violation of ignorability is constrained by a δ\deltaMSM with some Γ1\Gamma\geq 1.

Assumption 2 (Anchor Point).

A special treatment value designated as zero is not informed by potential outcomes: p(τ=0yt,x)=p(τ=0x)p(\tau=0\mid y_{t},x)=p(\tau=0\mid x) for all xx, tt, and yty_{t}.

At this point we state the core sensitivity assumptions. In addition to the δ\deltaMSM, we require an anchor point at T=0T=0, which may be considered a lack of treatment. Strictly, we assume that hidden confounding does not affect the propensity density precisely at the anchor point. A broader interpretation is that the strength of causal effect, hence vulnerability to hidden confounders, roughly increases with |T|\absolutevalue{T}. Assumption 2 is necessary to make closed-form solutions feasible. We discuss ramifications and a relaxation in §2.3.

The unobservability of almost all counterfactuals is unique to the case of continuous treatments, since the discrete analogy would be a discrete sum with an observable term. Figure 3 explains our approach to solving Equation 1.

Refer to caption
Figure 3: In the binary case, the red part is unobservable, but the MSM condition helps to bound that quantity. In the continuous case the integrand (Equation 1) is unobservable almost everywhere in the space of assigned treatments, except for the infinitesimal point T=tT=t. In order to divide the integral into two parts (observable and unobservable) like with the binary sum, we must draw an approximation where assigned treatment and potential-outcome index are close enough. We use a soft window (yellow) to mark the validity of the approximation. Our continuous version of the MSM, the δ\deltaMSM, allows us to bound the red part as well as reason about the yellow part. Covariates XX are omitted for brevity.

2.2 A Partial Approximation

We expand p(yt|τ,x)p(y_{t}|\tau,x) around τ=t,\tau=t, where p(yt|t,x)=p(y|t,x)p(y_{t}|t,x)=p(y|t,x) is learnable from data. Suppose that p(yt|τ,x)p(y_{t}|\tau,x) is twice differentiable in τ\tau. Construct a Taylor expansion

p(yt|τ,x)=p(yt|t,x)+(τt)τp(yt|τ,x)|τ=t+(τt)22τ2p(yt|τ,x)|τ=t+𝒪(τt)3.p(y_{t}|\tau,x)=p(y_{t}|t,x)+(\tau-t)\partial_{\tau}p(y_{t}|\tau,x)|_{\tau=t}\\ +\frac{(\tau-t)^{2}}{2}\partial^{2}_{\tau}p(y_{t}|\tau,x)|_{\tau=t}+\mathcal{O}(\tau-t)^{3}. (2)

Denote with p~(yt|τ,x)\tilde{p}(y_{t}|\tau,x) an approximation of second order as laid out above. One could have stopped at lower orders but the difference in complexity is not that large. The intractable derivatives like τp(yt|τ,x)|τ=t\partial_{\tau}p(y_{t}|\tau,x)|_{\tau=t} will be bounded using the δ\deltaMSM machinery. Let us quantify the reliability of this approximation by a trust-weighing scheme 0wt(τ)1,0\leq w_{t}(\tau)\leq 1, where typically wt(t)=1.w_{t}(t)=1. This corresponds to the yellow part in Figure 3. We argue that wt(τ)w_{t}(\tau) should be narrower with lower-entropy (narrower) propensities (§B). The possible forms of wt(τ)w_{t}(\tau) are elaborated in §2.3.

Splitting Equation 1 along the trusted regime marked by wt(τ)w_{t}(\tau), and then applying the approximation of Equation 2,

p(yt|x)=\displaystyle p(y_{t}|x)= 𝒯wt(τ)p(yt|τ,x)p(τ|x)dτ“observable” (Fig. 3)\displaystyle\int_{\mathcal{T}}\underbrace{w_{t}(\tau)p(y_{t}|\tau,x)p(\tau|x)\operatorname{\mathrm{d}\!}\tau}_{\text{``observable'' (Fig.\leavevmode\nobreak\ \ref{fig:math-outline})}} (3)
+𝒯[1wt(τ)]p(yt|τ,x)p(τ|x)dτ“unobservable” (Fig. 3)\displaystyle+\int_{\mathcal{T}}\underbrace{[1-w_{t}(\tau)]p(y_{t}|\tau,x)p(\tau|x)\operatorname{\mathrm{d}\!}\tau}_{\text{``unobservable'' (Fig.\leavevmode\nobreak\ \ref{fig:math-outline})}}
\displaystyle{\color[rgb]{0.9,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.9,0,0}\approx} 𝒯wt(τ)p~(yt|τ,x)p(τ|x)dτ(A) the approximated quantity\displaystyle\int_{\mathcal{T}}\,\underbrace{w_{t}(\tau){\color[rgb]{0.9,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.9,0,0}\tilde{p}}(y_{t}|\tau,x)p(\tau|x)\operatorname{\mathrm{d}\!}\tau}_{(A)\text{ the approximated quantity}}
+𝒯[1wt(τ)]p(τ|yt,x)p(yt|x)dτ(B) by Bayes’ rule.\displaystyle+\int_{\mathcal{T}}\,\underbrace{[1-w_{t}(\tau)]p(\tau|y_{t},x)p(y_{t}|x)\operatorname{\mathrm{d}\!}\tau}_{(B)\text{ by Bayes' rule}}.

The intuition behind separating the integral into two parts is the following. By choosing the weights wt(τ)w_{t}(\tau) so that they are close to one in the range where approximation Equation 2 is valid (yellow region in Figure 3) and zero outside of this range, we can evaluate the first integral through the approximated counterfactuals. The second integral, which is effectively over the red region in Figure 3 and cannot be evaluated due to unobserved counterfactuals, will be bounded using the δ\deltaMSM. Simplifying the second integral first,

𝒯[1wt(τ)]p(τ|yt,x)p(yt|x)dτ=p(yt|x)[1𝒯wt(τ)p(τ|yt,x)dτ].\int_{\mathcal{T}}[1-w_{t}(\tau)]p(\tau|\,y_{t},x)p(y_{t}|x)\operatorname{\mathrm{d}\!}\tau\\ =p(y_{t}|x)\left[1-\int_{\mathcal{T}}w_{t}(\tau)p(\tau|\,y_{t},x)\operatorname{\mathrm{d}\!}\tau\right].

By algebraic manipulation, we witness already that p(yt|x)p(y_{t}|x) shall take the form of

p(yt|x)𝒯wt(τ)p~(yt|τ,x)p(τ|x)dτ𝒯wt(τ)p(τ|yt,x)dτ.p(y_{t}|x)\approx\frac{\int_{\mathcal{T}}w_{t}(\tau)\tilde{p}(y_{t}|\,\tau,x)p(\tau|x)\operatorname{\mathrm{d}\!}\tau}{\int_{\mathcal{T}}w_{t}(\tau)p(\tau|\,y_{t},x)\operatorname{\mathrm{d}\!}\tau}. (4)

Reflecting on Assumptions 1 & 2, the divergence between p(τ|yt,x)p(\tau|\,y_{t},x) and p(τ|x)p(\tau|x) is bounded, allowing characterization of the denominator in terms of the learnable p(τ|x)p(\tau|x). Similarly the derivatives in Equation 2 can be bounded. These results would be sufficient to partially identify the numerator. Without loss of generality, consider the unknown quantity γ\gamma that can be a function of τ\tau, yty_{t}, and xx, such that

τlogp(τ|yt,x)=τlogp(τ|x)+γ(τ|yt,x),where |γ(τ|yt,x)|logΓ using the δMSM.\partial_{\tau}\log p(\tau|y_{t},x)=\partial_{\tau}\log p(\tau|x)+\gamma(\tau|y_{t},x),\\ \text{where }\absolutevalue{\gamma(\tau|y_{t},x)}\leq\log\Gamma\text{ using the $\delta$MSM.} (5)

We may attempt to integrate both sides;

0sτlogp(τ|yt,x)dτ=\displaystyle\int_{0}^{\,s}\partial_{\tau}\log p(\tau|\,y_{t},x)\operatorname{\mathrm{d}\!}\tau= 0sτlogp(τ|x)dτ\displaystyle\int_{0}^{\,s}\partial_{\tau}\log p(\tau|x)\operatorname{\mathrm{d}\!}\tau
+0sγ(τ|yt,x)dτλ(s|yt,x).\displaystyle\ \ +\underbrace{\int_{0}^{\,s}\gamma(\tau|y_{t},x)\operatorname{\mathrm{d}\!}\tau}_{\coloneqq\lambda(s|y_{t},x)}.
logp(τ=s|yt,x)log\displaystyle\therefore\log p(\tau=s|\,y_{t},x)-\log p(τ=0|yt,x)\displaystyle\,p(\tau=0|\,y_{t},x)
=logp(τ=s|x)\displaystyle=\log p(\tau=s|\,x)- logp(τ=0|x)+λ(s|yt,x),\displaystyle\log p(\tau=0|\,x)+\lambda(s|y_{t},x),
logp(τ|yt,x)=logp(τ|\displaystyle\therefore\log p(\tau|\,y_{t},x)=\log p(\tau| x)+λ(τ|yt,x).\displaystyle x)+\lambda(\tau|y_{t},x).
(by Assumption 2)
p(τ|yt,x)=p(τ|x)Λ\displaystyle\therefore\quad p(\tau|\,y_{t},x)=p(\tau|x)\Lambda (τ|yt,x),Λexp(λ).\displaystyle(\tau|y_{t},x),\ \ \Lambda\coloneqq\exp{\lambda}. (6)

One finds that |λ(τ|yt,x)||τ|logΓ\absolutevalue{\lambda(\tau|y_{t},x)}\leq\absolutevalue{\tau}\log\Gamma because λ\lambda integrates γ\gamma, bounded by ±logΓ,\pm\log\Gamma, over a support with length τ\tau. Subsequently, Λ\Lambda is bounded by Γ±|τ|.\Gamma^{\pm\absolutevalue{\tau}}. These are the requisite tools for bounding p(yt|x)p(y_{t}|x)—or an approximation thereof, erring on ignorance via the trusted regime marked by wt(τ).w_{t}(\tau). The derivation is completed in §A by framing the unknown quantities in terms of γ\gamma and Λ\Lambda, culminating in Equation 7.

Predicting potential outcomes.

The recovery of a fully normalized probability density p~(yt|x)\tilde{p}(y_{t}|x) via Equation 4 is laid out below. It may be approximated with Monte Carlo or solved in closed form with specific formulations for the weights and propensity. Concretely, it takes on the form p~(yt|x)=d(t|yt,x)1p(yt|t,x),\tilde{p}(y_{t}|x)=d(t|y_{t},x)^{-1}p(y_{t}|t,x), where

d(t|yt,x)𝔼τ[Λ(τ|yt,x)][γΛ](t|yt,x)𝔼τ[τt]12[(γ˙+γ2)Λ](t|yt,x)𝔼τ[(τt)2],d(t|y_{t},x)\coloneqq\operatorname{\mathbb{E}}_{\tau}[\Lambda(\tau|y_{t},x)]-[\gamma\Lambda](t|y_{t},x)\,\operatorname{\mathbb{E}}_{\tau}[\tau-t]\\ -\frac{1}{2}[(\dot{\gamma}+\gamma^{2})\Lambda](t|y_{t},x)\,\operatorname{\mathbb{E}}_{\tau}[(\tau-t)^{2}], (7)

and said expectations, 𝔼τ[],\operatorname{\mathbb{E}}_{\tau}[\cdot], are with respect to the implicit distribution q(τ|t,x)wt(τ)p(τ|x).q(\tau|t,x)\propto w_{t}(\tau)p(\tau|x). The notation γ˙\dot{\gamma} denotes a derivative in the first argument of γ(t|yt,x).\gamma(t|y_{t},x).

Assumption 3 (Second-order Simplification).

The quantity γ˙(τ|yt,x)\dot{\gamma}(\tau|y_{t},x) cannot be characterized as-is. Granting that γ2\gamma^{2} dominates over the former, and consequently

|(γ˙+γ2)Λ||γ2Λ|+εfor small ε0.\absolutevalue{(\dot{\gamma}+\gamma^{2})\Lambda}\leq\absolutevalue{\gamma^{2}\Lambda}+\varepsilon\quad\textrm{for small }\varepsilon\geq 0.
Parametrization Support (𝒯)(\mathcal{T}) Params. Precision (rr) Bounds for 𝔼τ[Λ(τ|yt,x)]\operatorname{\mathbb{E}}_{\tau}[\Lambda(\tau|y_{t},x)]
Beta [0,1][0,1] α,β\alpha,\beta α+β2\alpha+\beta-2 F11(𝜶+1;𝜶+𝜷+2;±logΓ){}_{1}F_{1}(\bm{\alpha}+1;\ \bm{\alpha}+\bm{\beta}+2;\ \pm\log\Gamma)
where 𝜶α¯+α2,𝜷β¯+β2\bm{\alpha}\coloneqq\bar{\alpha}+\alpha-2,\ \bm{\beta}\coloneqq\bar{\beta}+\beta-2
Balanced Beta [0,1][0,1] α,β\alpha,\beta α+β2\alpha+\beta-2 tthe Beta above+(1t)Beta, mirroredt\cdot\langle\textrm{the {Beta} above}\rangle+(1-t)\cdot\langle\textrm{{Beta}, mirrored}\rangle
Gamma [0,+)[0,+\infty) α,β\alpha,\beta α/β2\alpha/\beta^{2} [1(±logΓ)/𝜷]𝜶[1-(\pm\log\Gamma)/\bm{\beta}]^{-\bm{\alpha}}
where 𝜶α¯+α1,𝜷β¯+β\bm{\alpha}\coloneqq\bar{\alpha}+\alpha-1,\ \bm{\beta}\coloneqq\bar{\beta}+\beta
Gaussian (,+)(-\infty,+\infty) μ,σ\mu,\sigma 1/σ1/\sigma exp(𝝈2(logΓ)2/2)(Γ±𝝁[1+erf(𝝁±𝝈2logΓ2𝝈)]+Γ𝝁[1erf(𝝁𝝈2logΓ2𝝈)])\exp{\bm{\sigma}^{2}(\log\Gamma)^{2}/2}\left(\begin{aligned} \Gamma^{\pm\bm{\mu}}[1+\erf&(\textstyle\frac{\bm{\mu}\pm\bm{\sigma}^{2}\log\Gamma}{\sqrt{2}\bm{\sigma}})]\\ +&\\ \Gamma^{\mp\bm{\mu}}[1-\erf&(\textstyle\frac{\bm{\mu}\mp\bm{\sigma}^{2}\log\Gamma}{\sqrt{2}\bm{\sigma}})]\end{aligned}\right)
where 𝝁μσ¯2+μ¯σ2σ¯2+σ2,𝝈2σ¯2σ2σ¯2+σ2\bm{\mu}\coloneqq\frac{\mu\bar{\sigma}^{2}+\bar{\mu}\sigma^{2}}{\bar{\sigma}^{2}+\sigma^{2}},\ \bm{\sigma}^{2}\coloneqq\frac{\bar{\sigma}^{2}\sigma^{2}}{\bar{\sigma}^{2}+\sigma^{2}}
Table 1: Candidates for propensity and trust-weighing combinations. Each row specifies the distribution—beta, beta, gamma, and Gaussian respectively—of the propensity model p(τ|x).p(\tau|x). The last column lists solutions for the first term of Equation 7 / 8. This is a convolution of the propensity and weighing scheme, which have similar forms (see Bromiley [2003] for the Gaussian case.) We distinguish the replicated parameters between propensity and weight by placing a bar over the propensity parameters. So if the propensity is x(α¯,β¯)x\mapsto(\bar{\alpha},\bar{\beta}), then the weighing scheme has t(α,β).t\mapsto(\alpha,\beta). The bold parameters are of the compound density, with respect to which the first and second moments are computed in Equation 7 / 8.

To make use of the formula in Equation 7, one first obtains the set of admissible d(t|yt,x)[d¯(t|yt,x),d¯(t|yt,x)]d(t|y_{t},x)\in\big{[}\,\underline{d}(t|y_{t},x),\overline{d}(t|y_{t},x)\,\big{]} that violate ignorability up to a factor Γ\Gamma according to the δ\deltaMSM. With the negative side of the ±\pm corresponding to d¯\underline{d} and the positive side to d¯\overline{d}, the bounds are expressible as

(d¯,d¯)=𝒯Γ±|τ|q(τ|t,x)dτ𝔼τ[Λ(τ|yt,x)]+(±logΓ)Γ|t||𝒯(τt)q(τ|t,x)dτ|+12(0,log2Γ)Γ|t|𝒯(τt)2q(τ|t,x)dτ.\Big{(}\underline{d},\,\overline{d}\Big{)}=\int_{\mathcal{T}}\Gamma^{\pm\absolutevalue{\tau}}q(\tau|t,x)\operatorname{\mathrm{d}\!}\tau\quad\text{{\color[rgb]{0.9,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.9,0,0}$\longrightarrow\operatorname{\mathbb{E}}_{\tau}[\Lambda(\tau|y_{t},x)]$}}\\ +(\pm\log\Gamma)\Gamma^{\absolutevalue{t}}\,\absolutevalue{\int_{\mathcal{T}}(\tau-t)q(\tau|t,x)\operatorname{\mathrm{d}\!}\tau}\\ +\frac{1}{2}\Big{(}0,\ \log^{2}\Gamma\Big{)}\Gamma^{\absolutevalue{t}}\,\int_{\mathcal{T}}(\tau-t)^{2}q(\tau|t,x)\operatorname{\mathrm{d}\!}\tau. (8)

The Γ±|τ|\Gamma^{\pm\absolutevalue{\tau}} in the first integral, as well as the alternating sign of the other two terms taken together, reveal that d¯1d¯\underline{d}\leq 1\leq\overline{d} with equality at Γ=1\Gamma=1. This is noteworthy because it implies that p(y|t,x)p(y|t,x) is admissible for the partially identified p~(yt|x).\tilde{p}(y_{t}|x). We cannot describe p(yt|x)p(y_{t}|x) once d¯\underline{d} crosses zero.

Ensembles.

To quantify empirical uncertainties [Jesson et al., 2020] alongside our sensitivity, the predictors could be learned as ensembles, with p~(yt|x)\tilde{p}(y_{t}|x) computed as (bootstrap resampled [Lo, 1987]) expectations over them.

2.3 Propensity-Trust Combinations

In addition to developing the general framework above, we derive analytical forms for a myriad of paramametrizations that span the relevant supports 𝒯\mathcal{T} for continuous treatments: the unit interval [0,1][0,1], the nonnegative reals [0,+)[0,+\infty), and the real number line (,+).(-\infty,+\infty). For some nominal propensity distributions p(τ|x),p(\tau|x), we propose trust-weighing schemes wt(τ)w_{t}(\tau) with shared form so that the expectations in Equation 8 are solvable.

For instance, consider the parametrization (TX=x)Beta(α(x),β(x))(T\mid X=x)\sim\textrm{Beta}(\alpha(x),\beta(x)). We select a Beta-like weighing scheme, rescaled and translated, wtbeta(τ)=ctτat1(1τ)bt1w^{\text{beta}}_{t}(\tau)=c_{t}\tau^{a_{t}-1}(1-\tau)^{b_{t}-1}. Two constraints are imposed on every wt(τ)w_{t}(\tau) studied herein:

  • (the mode) that wt(τ)w_{t}(\tau) peaks at τ=t\tau=t, and wt(t)=1w_{t}(t)=1.

  • (the precision) that some r>0r>0 defines a narrowness of the form, and can be set a priori.

For the beta version we chose at+bt=r+2.a_{t}+b_{t}=r+2. These constraints imply that atrt+1a_{t}\coloneqq rt+1, btr(1t)+1b_{t}\coloneqq r(1-t)+1, and ct1trt(1t)r(1t)c^{-1}_{t}\coloneqq t^{rt}(1-t)^{r(1-t)}.

r=4\displaystyle r=4r=16\displaystyle r=16r=64\displaystyle r=64
Figure 4: Beta parametrizations for wt(τ)w_{t}(\tau) in the unit square, plotted for t=0.125,0.25,0.5t=0.125,0.25,0.5. Trust declines with rr.

The choices.

We present solutions for propensity-trust combinations in Table 1. Balanced Beta stands out by not strictly obeying Assumption 2. Rather, it adheres to a symmetrified mixture that is more versatile to realistic situations.

Balanced Beta.

Formally, for all tt, yty_{t}, and xx, we balance the Beta parametrization by replacing Assumption 2 with

{p(τ=0yt,x)=p(τ=0x)w.p.t,p(τ=1yt,x)=p(τ=1x)w.p.1t.\begin{cases}\quad p(\,\tau=0\mid y_{t},x)=p(\,\tau=0\mid x)&\quad\text{w.p.}\quad t,\\ \quad p(\,\tau=1\mid y_{t},x)=p(\,\tau=1\mid x)&\quad\text{w.p.}\quad 1-t.\end{cases}

This special parametrization deserves further justifying. The premise is that distant treatments are decoupled; treatment assignment τ\tau shares less information with a distal potential outcome yty_{t} than a proximal one. If that were the case, then the above linear interpolation favors the less informative anchor points for a given tt. This is helpful because the sensitivity analysis is vulnerable to the anchor points. Stratifying the anchor points eventually leads to an arithmetic mixture of d(t|yt,x)d(t|y_{t},x) in Equation 7 with its mirrored version about t1t,t\mapsto 1-t, and (α,β)(β,α).(\alpha,\beta)\mapsto(\beta,\alpha).

Controlling trust.

The absolute error of the approximation in Equation 3.A is bounded above by a form that could grow with narrower propensities (see §B), in the Beta parametrization. Intuitively the error also depends on the smoothness of the complete propensity (Taylor residual.) For that reason we used the heuristic of setting the trust-weighing precision rr to the nominal propensity precision.

3 Estimating The Intervals

We seek to bound partially identified expectations with respect to the true potential-outcome densities, which are constrained according to Equation 7 / 8. The quantities of interest are the Average Potential Outcome (APO), 𝔼[f(Yt)]\operatorname{\mathbb{E}}[f(Y_{t})], and Conditional Average Potential Outcome (CAPO), 𝔼[f(Yt)|X=x]\operatorname{\mathbb{E}}[f(Y_{t})|X=x], for any task-specific f(y)f(y). We use Monte Carlo over mm realizations yiy_{i} drawn from proposal density g(y)g(y), and covariates from a subsample of instances:

𝔼~[f(Yt)X{x(j)}jJ]=i=1mjJf(yi)p~(yt=yix(j))/g(yi)i=1mjJp~(yt=yix(j))/g(yi),\tilde{\operatorname{\mathbb{E}}}[f(Y_{t})\mid X\in\{x^{(j)}\}_{j\in J}]=\\ \frac{\sum_{i=1}^{m}\sum_{j\in J}f(y_{i})\,\tilde{p}(y_{t}=y_{i}\mid x^{(j)})/g(y_{i})}{\sum_{i=1}^{m}\sum_{j\in J}\tilde{p}(y_{t}=y_{i}\mid x^{(j)})/g(y_{i})}, (9)

where J{1n}J\subseteq\{1\dots n\} indexes a subset of the finite instances. |J|=1\absolutevalue{J}=1 recovers the formula for the CAPO, and |J|=n\absolutevalue{J}=n for the APO. The partially identified p~(yt|x)\tilde{p}(y_{t}|x) really encompasses a set of probability densities that includes p(y|t,x)p(y|t,x) and smooth deviations from it. Our importance sampler ensures normalization [Tokdar and Kass, 2010], but is overly conservative [Dorn and Guo, 2022]. For current purposes, a greedy algorithm may be deployed to maximize (or minimize) Equation 9 by optimizing the weights wiw_{i} attached to each f(yi)f(y_{i}), within the range

w¯ip(yi|t,x)d¯(t|yi,x)g(yi),w¯ip(yi|t,x)d¯(t|yi,x)g(yi).\underline{w}_{i}\coloneqq\frac{p(y_{i}|t,x)}{\overline{d}(t|y_{i},x)g(y_{i})},\qquad\overline{w}_{i}\coloneqq\frac{p(y_{i}|t,x)}{\underline{d}(t|y_{i},x)g(y_{i})}.

Our Algorithm 1 adapts the method of Jesson et al. [2021], Kallus et al. [2019] to heterogeneous weight bounds [w¯i,w¯i][\underline{w}_{i},\overline{w}_{i}] per draw ii. View a proof of correctness in §C.

Others have framed the APO as the averaged CAPOs, and left the min/max optimizations on the CAPO level [Jesson et al., 2022]. We optimize the APO directly, but have not studied the impact of one choice versus the other.

Input : {(w¯i,w¯i,fi)}i=1n\{(\underline{w}_{i},\overline{w}_{i},f_{i})\}_{i=1}^{n} ordered by ascending fif_{i}.
Output : maxw𝔼[f(X)]\max_{w}\operatorname{\mathbb{E}}[f(X)] estimated by importance sampling with nn draws.
Initialize wiw¯iw_{i}\leftarrow\overline{w}_{i} for all i=1,2,ni=1,2,\dots n;
for j=1,2,nj=1,2,\dots n do
       Compute Δji=1nwi(fjfi)\Delta_{j}\coloneqq\sum_{i=1}^{n}w_{i}(f_{j}-f_{i});
       if Δj<0\Delta_{j}<0 then
             wjw¯jw_{j}\leftarrow\underline{w}_{j};
            
      else
             break;
            
       end if
      
end for
Return iwifi/iwi\sum_{i}w_{i}f_{i}/\sum_{i}w_{i}
Algorithm 1 The expectation maximizer, with 𝒪(n)\mathcal{O}(n) runtime if intermediate Δj\Delta_{j} results are memoized.
Benchmarks brain blood pbmc mftc ratio
mean (std.) mean (std.) mean (std.) mean (std.) % best to best
δ\deltaMSM (ours) 𝟏𝟑𝟖\bm{138} (120)(120) 𝟏𝟒𝟏\bm{141} (129)(129) 𝟏𝟑𝟖\bm{138} (121)(121) 𝟏𝟒𝟒\bm{144} (124)(124) 78.4\bm{78.4} 1.03(0.08)\bm{1.03}\leavevmode\nobreak\ (0.08)
CMSM 186186 (153)(153) 188188 (156)(156) 205205 (169)(169) 182182 (145)(145) 7.87.8 1.81(2.15)1.81\leavevmode\nobreak\ (2.15)
uniform 158158 (137)(137) 162162 (146)(146) 157157 (136)(136) 167167 (141)(141) 4.84.8 1.20(0.10)1.20\leavevmode\nobreak\ (0.10)
binary MSM 211211 (128)(128) 213213 (131)(131) 222222 (127)(127) 214214 (127)(127) 9.09.0 2.57(2.34)2.57\leavevmode\nobreak\ (2.34)
Table 2: Semi-synthetic benchmark: divergence costs of 90% coverage of the Average Potential Outcome (APO), multiplied by 10001000. The four datasets are listed on top. We report averages over 500 trials per experiment. A paired tt-test and sign test, roughly corresponding to the mean and median, showed significant improvement by the δ\deltaMSM over the others with all P<105.P<10^{-5}. “% best” counts the proportion of trials that each method outperformed the rest, and “ratio to best” is the average cost ratio to the best method’s in each trial—closer to one is better.

4 A Semi-synthetic Benchmark

It is common practice to test causal methods, especially under novel settings, with real datasets but synthetic outcomes [Curth et al., 2021, Cristali and Veitch, 2022]. We adopted four exceedingly diverse datasets spanning health, bioinformatics, and social-science sources. Our variable-generating process preserved the statistical idiosyncracies of each dataset. Confounders and treatment were random projections of the data, which were quantile-normalized for uniform marginals in the unit interval. Half the confounders were observed as covariates and the other half were hidden. The outcome was Bernoulli with random linear or quadratic forms mixing the variables before passing through a normal CDF activation function. Outcome and propensity models were linear and estimated by maximum likelihood. See §E.

Selecting the baselines.

The δ\deltaMSM with Balanced Beta was benchmarked against three relevant baselines.

  • (CMSM)  Use the recent model by Jesson et al. [2022], where d¯Γ1p(τ|x),d¯Γ+1p(τ|x)\underline{d}\coloneqq\Gamma^{-1}p(\tau|x),\ \overline{d}\coloneqq\Gamma^{+1}p(\tau|x).

  • (uniform)  Suppose d¯Γ1,d¯Γ+1\underline{d}\coloneqq\Gamma^{-1},\ \overline{d}\coloneqq\Gamma^{+1}, as if the propensity were uniform and constant.

  • (binary MSM)  Shoehorn the propensity into the classic MSM [Tan, 2006] by considering the treatment as binary with indicator 𝕀[T>0.5].\mathbb{I}[T>0.5].

Note that the CMSM becomes equivalent to the “uniform” baseline above when CAPOs are concerned (Equation 9 with m=1m=1), which are not studied in this benchmark.

ABTreatment t\displaystyle t0.00.51.0APO Bernoulli(p)\displaystyle(p)Divergence Costs VisualizedlowhighKL Divergences
Figure 5: Divergence cost measures the size of the ignorance intervals (blue), weighted by the badness of each estimate (red). The black line is the true APO. Coverage is the portion of treatments contained in the blue shaded region, between A and B in this example. We target 90% of the unit interval in our benchmark with Beta-distributed treatments.

Scoring the coverages.

A reasonable goal would be to achieve a certain amount of coverage [McCandless et al., 2007] of the true APOs, like having 90% of the curve be contained in the ignorance intervals. Since violation factor Γ\Gamma is not entirely interpretable, nor commensurable across sensitivity models, we measure the size of an ignorance interval via a cost incurred in terms of actionable inference. For each point tt of the dose-response curve, we integrated the KL divergence of the actual APO (which defines the YtY_{t} Bernoulli parameter) against the predicted APO uniformly between the bounds. This way, each additional unit of ignorance interval is weighed by its information-theoretic approximation cost. This score is a divergence cost of a target coverage.

Analysis.

The main results are displayed in Table 2. There were ten confounders and the true dose-response curve was a random quadratic form in the treatment and confounders. Other settings are shown in Supplementary Table 4. Each trial exhibited completely new projections and outcome function. There were different levels and types of confounding as well as varying model fits. Still, clear patterns are evident in Table 2, like the rate at which the δ\deltaMSM provided the lowest divergence cost against the baselines.

6070809095% Coverage020406080100% Bestδ\displaystyle\deltaMSM for High Coverage
Figure 6: Performance for different coverages. Black line: rate of δ\deltaMSM achieving lowest divergence cost compared to baselines. Dashed line: expected rate if the chance of any one method outperforming another were identical.

5 A Real-world Exemplar

The UK Biobank [Bycroft et al., 2018] is a large, densely phenotyped epidemiological study with brain imaging. We preprocessed 40 attributes, eight of which were continuous diet quality scores (DQSs) [Said et al., 2018, Zhuang et al., 2021] valued 0–10 and serving as treatments, on 42,032 people. The outcome was thicknesses of 34 cortical brain regions. A poor DQS could translate to noticeable atrophy in the brain of some older individuals, depending on their attributes [Gu et al., 2015, Melo Van Lent et al., 2022].

Continuous treatments enable the (Conditional) Average Causal Derivative, (C)ACD 𝔼[Yt|X]/t\coloneqq\partial\operatorname{\mathbb{E}}[Y_{t}|X]\,/\,\partial t. The CACD informs investigators on the incremental change in outcome due to a small change in an individual’s given treatment. For instance, it may be useful to identify the individuals who would benefit the most from an incremental improvement in diet. We plotted the age distributions of the top 1% individuals by CACD (diet \to cortical thickness) in Figure 7.

50607080Age (years)0.00.51.0Empirical CDFTop 1% CACDs Before/After δ\displaystyle\deltaMSMΓ=1\displaystyle\Gamma=1Γ=1.2\displaystyle\Gamma=1.2
Figure 7: When we apply the δ\deltaMSM (Γ>1)(\Gamma>1) for partial identification, the individuals with the top 1% causal derivatives of cortical thickness with respect to DQSs skew even older. This is expected logically because older people have more years during which they could have revised their diets. Red dotted line corresponds to the entire population.

We also compared the δ\deltaMSM to an equivalent binary MSM where CACDs are computed in the latter case by thresholding the binary propensity at tt. Each model’s violation factor Γ\Gamma was set for an equivalent amount (\sim30%) of nonzero CACDs. Under the δ\deltaMSM, the DQSs with strongest average marginal benefit ranked as vegetables, whole grains, and then meat, for both females and males. They differed under the binary MSM, with meat, then whole grains as the top for females and dairy, then refined grains as the top for males.

6 Discussion

Sensitivity analyses for hidden confounders can help to guard against erroneous conclusions from observational studies. We generalized the practice to causal dose-response curves, thereby increasing its practical applicability. However, there is no replacement for an actual interventional study, and researchers must be careful to maintain a healthy degree of skepticism towards observational results even after properly calibrating the partially identified effects.

Specifically for Average Potential Outcomes (APOs) via the sample-based algorithm, we demonstrated widespread applicability of the δ\deltaMSM in §4 by showing that it provided tighter ignorance intervals than the recent CMSM and other models for 78%78\% of all trials, notwithstanding the wide variation in scenarios tested. Ablating the approximation in Equation 2 and dropping the quadratic term, that percentage falls slightly to 74%74\%. Even further, keeping just the constant term results in a large drop to 7%7\%. This result suggests that the proposed Taylor expansion (Equation 2) is useful, and that terms of higher order would not give additional value.

We showcased sensical behaviors of the δ\deltaMSM in a real observational case study (§5), e.g. how older people would be more impacted by (retroactive) changes to their reported diets. Additionally, the top effectual DQSs appeared more consistent with the δ\deltaMSM rather than the binary MSM.

Contrasting the CMSM.

Another recently proposed sensitivity model for continuous-valued treatments is the CMSM [Jesson et al., 2022], which was included in our benchmark, §4. Unlike the δ\deltaMSM, the CMSM does not always guarantee d¯1d¯\underline{d}\leq 1\leq\overline{d} and therefore p(y|t,x)p(y|t,x) need not be admissible for p~(yt|x)\tilde{p}(y_{t}|x). For partial identification of the CAPO with importance sampling, the propensity density factors out and does not affect outcome sensitivity under the CMSM. For that implementation it happens that p(y|t,x)p(y|t,x) is indeed admissible. However, we believe that the nominal propensity should play a role in the CAPO’s sensitivity to hidden confounders, as both the CMSM and the δ\deltaMSM couple the hidden confounding (via the complete propensity) to the nominal propensity. Equations 7 & 8 make it clear that the propensity plays a key role in outcome sensitivity under the δ\deltaMSM for both CAPO and APO. We remind the reader of the original MSM that bounds a ratio of complete and nominal propensity odds. The δ\deltaMSM takes that structure to the infinitesimal limit and maintains the original desirable property of p(y|t,x)p(y|t,x) admissibility for p~(yt|x)\tilde{p}(y_{t}|x).

Looking ahead.

Alternatives to sampling-based Algorithm 1 deserve further investigation for computing ignorance intervals on expectations—but not only. Our analytical solutions bound the density function p(yt|x)p(y_{t}|x) of conditional potential outcomes, which can generate other quantities of interest [Kallus, 2022] or play a role in larger pipelines. Further, an open challenge with the δ\deltaMSM would be to find a pragmatic solution to sharp partial identification. Recent works have introduced sharpness to binary-treatment sensitivity analysis [Oprescu et al., 2023].

7 Conclusion

We recommend the novel δ\deltaMSM for causal sensitivity analyses with continuous-valued treatments. The simple and practical Monte Carlo estimator for the APO and CAPO (Algorithm 1) gives tighter ignorance intervals with the δ\deltaMSM than alternatives. We believe that the partial identification of the potential-outcome density shown in Equation 8, in conjunction with the parametric formulas of Table 1, is of general applicability for causal inference in real-world problems. The variety of settings presented in that table allow a domain-informed selection of realistic sensitivity assumptions. For instance, when estimating the effect of a real-valued variable’s deviations from some base value, like a region’s current temperature compared to its historical average, the Gaussian scheme could be used. Gamma is ideal for one-sided or unidirectional deviations. Finally, Balanced Beta is recommended for measurements in an interval where neither of the endpoints is special.

Acknowledgements.
This work was funded in part by Defense Advanced Research Projects Agency (DARPA) and Army Research Office (ARO) under Contract No. W911NF-21-C-0002.

References

  • Athey et al. [2019] S. Athey, J. Tibshirani, and S. Wager. Generalized random forests. The Annals of Statistics, 47(2):1148–1178, 2019.
  • Atroszko [2019] P. A. Atroszko. Is a high workload an unaccounted confounding factor in the relation between heavy coffee consumption and cardiovascular disease risk? The American Journal of Clinical Nutrition, 110(5):1257–1258, 2019.
  • Bonvini and Kennedy [2022] M. Bonvini and E. H. Kennedy. Fast convergence rates for dose-response estimation. arXiv preprint arXiv:2207.11825, 2022.
  • Bromiley [2003] P. Bromiley. Products and convolutions of gaussian probability density functions. Tina-Vision Memo, 3(4):1, 2003.
  • Bycroft et al. [2018] C. Bycroft, C. Freeman, D. Petkova, G. Band, L. T. Elliott, K. Sharp, A. Motyer, D. Vukcevic, O. Delaneau, J. O’Connell, et al. The uk biobank resource with deep phenotyping and genomic data. Nature, 562(7726):203–209, 2018.
  • Calabrese and Baldwin [2001] E. J. Calabrese and L. A. Baldwin. U-shaped dose-responses in biology, toxicology, and public health. Annual Review of Public Health, 22(1):15–33, 2001. 10.1146/annurev.publhealth.22.1.15. PMID: 11274508.
  • Chen et al. [2022] Y.-L. Chen, L. Minorics, and D. Janzing. Correcting confounding via random selection of background variables. arXiv preprint arXiv:2202.02150, 2022.
  • Chernozhukov et al. [2017] V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, J. Robins, et al. Double/debiased machine learning for treatment and causal parameters. Technical report, 2017.
  • Chernozhukov et al. [2021] V. Chernozhukov, C. Cinelli, W. Newey, A. Sharma, and V. Syrgkanis. Long story short: Omitted variable bias in causal machine learning. arXiv preprint arXiv:2112.13398, 2021.
  • Colangelo and Lee [2021] K. Colangelo and Y.-Y. Lee. Double debiased machine learning nonparametric inference with continuous treatments. arXiv preprint arXiv:2004.03036, 2021.
  • Cornfield et al. [1959] J. Cornfield, W. Haenszel, E. C. Hammond, A. M. Lilienfeld, M. B. Shimkin, and E. L. Wynder. Smoking and lung cancer: recent evidence and a discussion of some questions. Journal of the National Cancer institute, 22(1):173–203, 1959.
  • Cristali and Veitch [2022] I. Cristali and V. Veitch. Using embeddings for causal estimation of peer influence in social networks. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022.
  • Curth et al. [2021] A. Curth, D. Svensson, J. Weatherall, and M. van der Schaar. Really doing great at estimating CATE? a critical look at ML benchmarking practices in treatment effect estimation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  • Dorn and Guo [2022] J. Dorn and K. Guo. Sharp sensitivity analysis for inverse propensity weighting via quantile balancing. Journal of the American Statistical Association, pages 1–13, 2022.
  • Ghassami et al. [2021] A. Ghassami, N. Sani, Y. Xu, and I. Shpitser. Multiply robust causal mediation analysis with continuous treatments. arXiv preprint arXiv:2105.09254, 2021.
  • Godos et al. [2020] J. Godos, M. Tieri, F. Ghelfi, L. Titta, S. Marventano, A. Lafranconi, A. Gambera, E. Alonzo, S. Sciacca, S. Buscemi, et al. Dairy foods and health: an umbrella review of observational studies. International Journal of Food Sciences and Nutrition, 71(2):138–151, 2020.
  • Gu et al. [2015] Y. Gu, A. M. Brickman, Y. Stern, C. G. Habeck, Q. R. Razlighi, J. A. Luchsinger, J. J. Manly, N. Schupf, R. Mayeux, and N. Scarmeas. Mediterranean diet and brain structure in a multiethnic elderly cohort. Neurology, 85(20):1744–1751, 2015.
  • Guo et al. [2022] W. Guo, M. Yin, Y. Wang, and M. Jordan. Partial identification with noisy covariates: A robust optimization approach. In Conference on Causal Learning and Reasoning, pages 318–335. PMLR, 2022.
  • Hill [2011] J. L. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 2011.
  • Hoover et al. [2020] J. Hoover, G. Portillo-Wightman, L. Yeh, S. Havaldar, A. M. Davani, Y. Lin, B. Kennedy, M. Atari, Z. Kamel, M. Mendlen, et al. Moral foundations twitter corpus: A collection of 35k tweets annotated for moral sentiment. Social Psychological and Personality Science, 11(8):1057–1071, 2020.
  • Hu et al. [2021] Y. Hu, Y. Wu, L. Zhang, and X. Wu. A generative adversarial framework for bounding confounded causal effects. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 12104–12112, 2021.
  • Huang et al. [2021] W. Huang, O. Linton, and Z. Zhang. A unified framework for specification tests of continuous treatment effect models. Journal of Business & Economic Statistics, 0(0):1–14, 2021. 10.1080/07350015.2021.1981915.
  • Jesson et al. [2020] A. Jesson, S. Mindermann, U. Shalit, and Y. Gal. Identifying causal-effect inference failure with uncertainty-aware models. Advances in Neural Information Processing Systems, 33:11637–11649, 2020.
  • Jesson et al. [2021] A. Jesson, S. Mindermann, Y. Gal, and U. Shalit. Quantifying ignorance in individual-level causal-effect estimates under hidden confounding. ICML, 2021.
  • Jesson et al. [2022] A. Jesson, A. R. Douglas, P. Manshausen, M. Solal, N. Meinshausen, P. Stier, Y. Gal, and U. Shalit. Scalable sensitivity and uncertainty analyses for causal-effect estimates of continuous-valued interventions. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=PzI4ow094E.
  • Kallus [2022] N. Kallus. Treatment effect risk: Bounds and inference. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 213–213, 2022.
  • Kallus and Santacatterina [2019] N. Kallus and M. Santacatterina. Kernel optimal orthogonality weighting: A balancing approach to estimating effects of continuous treatments. arXiv preprint arXiv:1910.11972, 2019.
  • Kallus et al. [2019] N. Kallus, X. Mao, and A. Zhou. Interval estimation of individual-level causal effects under unobserved confounding. In The 22nd international conference on artificial intelligence and statistics, pages 2281–2290. PMLR, 2019.
  • Kang et al. [2018] H. M. Kang, M. Subramaniam, S. Targ, M. Nguyen, L. Maliskova, E. McCarthy, E. Wan, S. Wong, L. Byrnes, C. M. Lanata, et al. Multiplexed droplet single-cell rna-sequencing using natural genetic variation. Nature biotechnology, 36(1):89–94, 2018.
  • Kilbertus et al. [2020] N. Kilbertus, M. J. Kusner, and R. Silva. A class of algorithms for general instrumental variable models. Advances in Neural Information Processing Systems, 33:20108–20119, 2020.
  • Lim et al. [2021] J. Lim, C. X. Ji, M. Oberst, S. Blecker, L. Horwitz, and D. Sontag. Finding regions of heterogeneity in decision-making via expected conditional covariance. Advances in Neural Information Processing Systems, 34:15328–15343, 2021.
  • Lo [1987] A. Y. Lo. A large sample study of the bayesian bootstrap. The Annals of Statistics, 15(1):360–375, 1987.
  • Louizos et al. [2017] C. Louizos, U. Shalit, J. M. Mooij, D. Sontag, R. Zemel, and M. Welling. Causal effect inference with deep latent-variable models. Advances in neural information processing systems, 30, 2017.
  • Manski [2003] C. F. Manski. Partial identification of probability distributions, volume 5. Springer, 2003.
  • Mastouri et al. [2021] A. Mastouri, Y. Zhu, L. Gultchin, A. Korba, R. Silva, M. Kusner, A. Gretton, and K. Muandet. Proximal causal learning with kernels: Two-stage estimation and moment restriction. In International Conference on Machine Learning, pages 7512–7523. PMLR, 2021.
  • McCandless et al. [2007] L. C. McCandless, P. Gustafson, and A. Levy. Bayesian sensitivity analysis for unmeasured confounding in observational studies. Statist Med, 26:2331–2347, 2007.
  • Melo Van Lent et al. [2022] D. Melo Van Lent, H. Gokingco, M. I. Short, C. Yuan, P. F. Jacques, J. R. Romero, C. S. DeCarli, A. S. Beiser, S. Seshadri, J. J. Himali, et al. Higher dietary inflammatory index scores are associated with brain mri markers of brain aging: Results from the framingham heart study offspring cohort. Alzheimer’s & Dementia, 2022.
  • Meresht et al. [2022] V. B. Meresht, V. Syrgkanis, and R. G. Krishnan. Partial identification of treatment effects with implicit generative models. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=8cUGfg-zUnh.
  • Mokhberian et al. [2020] N. Mokhberian, A. Abeliuk, P. Cummings, and K. Lerman. Moral framing and ideological bias of news. In Social Informatics: 12th International Conference, SocInfo 2020, Pisa, Italy, October 6–9, 2020, Proceedings 12, pages 206–219. Springer, 2020.
  • Oprescu et al. [2023] M. Oprescu, J. Dorn, M. Ghoummaid, A. Jesson, N. Kallus, and U. Shalit. B-learner: Quasi-oracle bounds on heterogeneous causal effects under hidden confounding. arXiv preprint arXiv:2304.10577, 2023.
  • Padh et al. [2022] K. Padh, J. Zeitler, D. Watson, M. Kusner, R. Silva, and N. Kilbertus. Stochastic causal programming for bounding treatment effects. arXiv preprint arXiv:2202.10806, 2022.
  • Rosenbaum [2002] P. R. Rosenbaum. Observational Studies. Springer, 2002.
  • Rosenbaum and Rubin [1983] P. R. Rosenbaum and D. B. Rubin. Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome. Journal of the Royal Statistical Society: Series B (Methodological), 45(2):212–218, 1983.
  • Rubin [1974] D. B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5):688, 1974.
  • Said et al. [2018] M. A. Said, N. Verweij, and P. van der Harst. Associations of combined genetic and lifestyle risks with incident cardiovascular disease and diabetes in the uk biobank study. JAMA cardiology, 3(8):693–702, 2018.
  • Sarvet and Stensrud [2022] A. L. Sarvet and M. J. Stensrud. Without commitment to an ontology, there could be no causal inference. Epidemiology, 33(3):372–378, 2022.
  • Simpson [1951] E. H. Simpson. The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society: Series B (Methodological), 13(2):238–241, 1951.
  • Taleb [2018] N. N. Taleb. (anti) fragility and convex responses in medicine. In Unifying Themes in Complex Systems IX: Proceedings of the Ninth International Conference on Complex Systems 9, pages 299–325. Springer, 2018.
  • Tan [2006] Z. Tan. A distributional approach for causal inference using propensity scores. Journal of the American Statistical Association, 101(476):1619–1637, 2006.
  • Tchetgen et al. [2020] E. J. T. Tchetgen, A. Ying, Y. Cui, X. Shi, and W. Miao. An introduction to proximal causal learning. arXiv preprint arXiv:2009.10982, 2020.
  • Tokdar and Kass [2010] S. T. Tokdar and R. E. Kass. Importance sampling: A review. WIREs Computational Statistics, 2(1):54–60, 2010.
  • Tübbicke [2022] S. Tübbicke. Entropy balancing for continuous treatments. J Econ Methods, 11(1):71–89, 2022.
  • Vegetabile et al. [2021] B. G. Vegetabile, B. A. Griffin, D. L. Coffman, M. Cefalu, M. W. Robbins, and D. F. McCaffrey. Nonparametric estimation of population average dose-response curves using entropy balancing weights for continuous exposures. Health Services and Outcomes Research Methodology, 21(1):69–110, 2021.
  • Veitch and Zaveri [2020] V. Veitch and A. Zaveri. Sense and sensitivity analysis: Simple post-hoc analysis of bias due to unobserved confounding. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 10999–11009. Curran Associates, Inc., 2020.
  • Yadlowsky et al. [2020] S. Yadlowsky, H. Namkoong, S. Basu, J. Duchi, and L. Tian. Bounds on the conditional and average treatment effect with unobserved confounding factors. arXiv preprint arXiv:1808.09521, 2020.
  • Yao et al. [2021] L. Yao, Z. Chu, S. Li, Y. Li, J. Gao, and A. Zhang. A survey on causal inference. ACM Transactions on Knowledge Discovery from Data (TKDD), 15(5):1–46, 2021.
  • Yin et al. [2021] M. Yin, C. Shi, Y. Wang, and D. M. Blei. Conformal sensitivity analysis for individual treatment effects. arXiv preprint arXiv:2112.03493v2, 2021.
  • Ystrom et al. [2022] E. Ystrom, E. Degerud, M. Tesli, A. Høye, T. Reichborn-Kjennerud, and Ø. Næss. Alcohol consumption and lower risk of cardiovascular and all-cause mortality: the impact of accounting for familial factors in twins. Psychological Medicine, pages 1–9, 2022.
  • Yule [1903] G. U. Yule. NOTES ON THE THEORY OF ASSOCIATION OF ATTRIBUTES IN STATISTICS. Biometrika, 2(2):121–134, 02 1903. ISSN 0006-3444. 10.1093/biomet/2.2.121. URL https://doi.org/10.1093/biomet/2.2.121.
  • Zhao et al. [2019] Q. Zhao, D. S. Small, and B. B. Bhattacharya. Sensitivity analysis for inverse probability weighting estimators via the percentile bootstrap. Journal of the Royal Statistical Society (Series B), 81(4):735–761, 2019.
  • Zhuang et al. [2021] P. Zhuang, X. Liu, Y. Li, X. Wan, Y. Wu, F. Wu, Y. Zhang, and J. Jiao. Effect of diet quality and genetic predisposition on hemoglobin a1c and type 2 diabetes risk: gene-diet interaction analysis of 357,419 individuals. Diabetes Care, 44(11):2470–2479, 2021.

Appendix A Completing the Derivations

Consider Equation 3.A:

01wt(τ)p~(yt|τ,x)p(τ|x)dτ=p(yt|t,x)01wt(τ)p(τ|x)dτ(A.0)+g1(yt|t,x)01wt(τ)(τt)p(τ|x)dτ(A.1)+g2(yt|t,x)01wt(τ)(τt)22p(τ|x)dτ(A.2),wheregk(yt|t,x)τkp(yt|τ,x)|τ=t.\int_{0}^{1}w_{t}(\tau)\tilde{p}(y_{t}|\tau,x)p(\tau|x)\operatorname{\mathrm{d}\!}\tau=\underbrace{p(y_{t}|t,x)\int_{0}^{1}w_{t}(\tau)p(\tau|x)\operatorname{\mathrm{d}\!}\tau}_{(A.0)}\\ +\quad\underbrace{g_{1}(y_{t}|t,x)\int_{0}^{1}w_{t}(\tau)(\tau-t)p(\tau|x)\operatorname{\mathrm{d}\!}\tau}_{(A.1)}\quad+\quad\underbrace{g_{2}(y_{t}|t,x)\int_{0}^{1}w_{t}(\tau)\frac{(\tau-t)^{2}}{2}p(\tau|x)\operatorname{\mathrm{d}\!}\tau}_{(A.2)},\\ \textrm{where}\quad g_{k}(y_{t}|t,x)\coloneqq\partial^{k}_{\tau}p(y_{t}|\tau,x)|_{\tau=t}. (10)

Lightening the notation with a shorthand for the weighted expectations, τ 01wt(τ)()p(τ|x)dτ,\langle\cdot\rangle_{\tau}\coloneqq\int_{\,0}^{1}w_{t}(\tau)(\cdot)p(\tau|x)\operatorname{\mathrm{d}\!}\tau, it becomes apparent that we must grapple with the pseudo-moments 1τ\langle 1\rangle_{\tau}, τtτ\langle\tau-t\rangle_{\tau}, and (τt)2τ\langle(\tau-t)^{2}\rangle_{\tau}. Note that tt should not be mistaken for a “mean” value.

Furthermore, we have yet to fully characterize gk(yt|t,x)g_{k}(y_{t}|t,x). Observe that

p(yt|τ,x)=p(τ|yt,x)p(yt|x)p(τ|x)\displaystyle p(y_{t}|\tau,x)=\frac{p(\tau|y_{t},x)p(y_{t}|x)}{p(\tau|x)}\quad τp(yt|τ,x)=p(yt|x)τp(τ|yt,x)p(τ|x).\displaystyle\iff\quad\partial_{\tau}p(y_{t}|\tau,x)=p(y_{t}|x)\cdot\frac{\partial}{\partial\tau}\frac{p(\tau|y_{t},x)}{p(\tau|x)}.
The p(yt|x)p(y_{t}|x) will be moved to the other side of the equation as needed; by Equation 6,
τp(τ|yt,x)p(τ|x)\displaystyle\frac{\partial}{\partial\tau}\frac{p(\tau|y_{t},x)}{p(\tau|x)} =τΛ(τ|yt,x).\displaystyle=\frac{\partial}{\partial\tau}\Lambda(\tau|y_{t},x).
Expanding,
=τexp(0τγ(τ|yt,x)dτ)=γ(τ|yt,x)exp(0τγ(τ|yt,x)dτ)\displaystyle=\frac{\partial}{\partial\tau}\exp{\int_{0}^{\tau}\gamma(\tau|y_{t},x)\operatorname{\mathrm{d}\!}\tau}\quad=\quad\gamma(\tau|y_{t},x)\exp{\int_{0}^{\tau}\gamma(\tau|y_{t},x)\operatorname{\mathrm{d}\!}\tau}
=(γΛ)(τ|yt,x).\displaystyle=(\gamma\Lambda)(\tau|y_{t},x).

Appropriate bounds will be calculated for g2(yt|t,x)g_{2}(y_{t}|t,x) next, utilizing the finding above as their main ingredient. Let

g~k(yt|t,x)p(yt|x)1gk(yt|t,x)=(τ)kp(τ|yt,x)p(τ|x)|τ=t.\tilde{g}_{k}(y_{t}|t,x)\coloneqq p(y_{t}|x)^{-1}g_{k}(y_{t}|t,x)=\left.\left(\frac{\partial}{\partial\tau}\right)^{\!k}\frac{p(\tau|y_{t},x)}{p(\tau|x)}\right|_{\tau=t.}

The second derivative may be calculated in terms of the ignorance quantities γ,Λ\gamma,\Lambda:

g~2(yt|t,x)=\displaystyle\tilde{g}_{2}(y_{t}|t,x)= τγ(τ|yt,x)Λ(τ|yt,x)\displaystyle\partial_{\tau}\gamma(\tau|y_{t},x)\Lambda(\tau|y_{t},x)
=\displaystyle= γ(τ|yt,x)2Λ(τ|yt,x)+γ˙(τ|yt,x)Λ(τ|yt,x)\displaystyle\gamma(\tau|y_{t},x)^{2}\Lambda(\tau|y_{t},x)+\dot{\gamma}(\tau|y_{t},x)\Lambda(\tau|y_{t},x)
=\displaystyle= (γ2+γ˙)Λ(τ|yt,x).\displaystyle(\gamma^{2}+\dot{\gamma})\Lambda(\tau|y_{t},x).

And finally we address p~(yt|x)\tilde{p}(y_{t}|x). Carrying over the components of Equation 10 into Equation 3,

p~(yt|x)\displaystyle\tilde{p}(y_{t}|x) =p(yt|t,x)1τΛ(τ|yt,x)τg~1(yt|t,x)τtτg~2(yt|t,x)(τt)2τ\displaystyle=\frac{p(y_{t}|t,x)\langle 1\rangle_{\tau}}{\langle\Lambda(\tau|y_{t},x)\rangle_{\tau}-\tilde{g}_{1}(y_{t}|t,x)\langle\tau-t\rangle_{\tau}-\tilde{g}_{2}(y_{t}|t,x)\langle(\tau-t)^{2}\rangle_{\tau}} (11)
=p(yt|t,x)𝔼τ[Λ(τ|yt,x)](γΛ)(t|yt,x)𝔼τ[τt]12((γ˙+γ2)Λ)(t|yt,x)𝔼τ[(τt)2],\displaystyle=\frac{p(y_{t}|t,x)}{\operatorname{\mathbb{E}}_{\tau}[\Lambda(\tau|y_{t},x)]-(\gamma\Lambda)(t|y_{t},x)\operatorname{\mathbb{E}}_{\tau}[\tau-t]-\frac{1}{2}((\dot{\gamma}+\gamma^{2})\Lambda)(t|y_{t},x)\operatorname{\mathbb{E}}_{\tau}[(\tau-t)^{2}]},

where these expectations 𝔼τ[]\operatorname{\mathbb{E}}_{\tau}[\cdot] are with respect to the implicit distribution q(τ|t,x)wt(τ)p(τ|x).q(\tau|t,x)\propto w_{t}(\tau)p(\tau|x). The first term in the denominator, 𝔼τ[Λ(τ|yt,x)]\operatorname{\mathbb{E}}_{\tau}[\Lambda(\tau|y_{t},x)], may be approximately bounded by the same Algorithm 1.

Appendix B How to Calibrate the Weighing Scheme

We present an argument based on the absolute error of the approximation in Equation 2, specifically for Beta propensities. The following applies to both Beta and Balanced Beta, 0<t<10<t<1.

Suppose that the the second derivative employed in the Taylor expansion is QQ-Lipschitz, so that |τ3p(yt|τ,x)|Q.\absolutevalue{\partial^{3}_{\tau}p(y_{t}|\tau,x)}\leq Q. Denote the remainder as ρ(yt|τ,x).\rho(y_{t}|\tau,x). By Taylor’s theorem,

|ρ(yt|τ,x)||τt|36Q.\absolutevalue{\rho(y_{t}|\tau,x)}\leq\frac{\absolutevalue{\tau-t}^{3}}{6}Q.

The approximated quantity (part A) in Equation 3 is the following integral, which ends up becoming the numerator in Equation 4:

01wt(τ)p~(yt|τ,x)p(τ|x)dτ=01wt(τ)[p(yt|τ,x)+ρ(yt|τ,x)]p(τ|x)dτ.\int_{0}^{1}w_{t}(\tau)\tilde{p}(y_{t}|\tau,x)p(\tau|x)\operatorname{\mathrm{d}\!}\tau\ =\ \int_{0}^{1}w_{t}(\tau)\big{[}p(y_{t}|\tau,x)+\rho(y_{t}|\tau,x)\big{]}p(\tau|x)\operatorname{\mathrm{d}\!}\tau.

The absolute error of this integral is therefore

|01wt(τ)ρ(yt|τ,x)p(τ|x)dτ|16Q01wt(τ)p(τ|x)|τt|3dτJ, which upper-bounds the error.by the remainder theorem.\absolutevalue{\int_{0}^{1}w_{t}(\tau)\rho(y_{t}|\tau,x)p(\tau|x)\operatorname{\mathrm{d}\!}\tau}\ \leq\ \frac{1}{6}Q\underbrace{\int_{0}^{1}w_{t}(\tau)p(\tau|x)\absolutevalue{\tau-t}^{3}\operatorname{\mathrm{d}\!}\tau}_{\text{$\coloneqq J$, which upper-bounds the error.}}\quad\text{by the remainder theorem.}

Let A=α1+rtA=\alpha-1+rt and B=β1+r(1t)B=\beta-1+r(1-t), where (α,β)(\alpha,\beta) parametrize the nominal propensity and rr is the precision of the Beta trust-weighing scheme. The trust-propensity combination is

wt(τ)p(τ|x)=τA(1τ)Bct𝔹(α,β),where ct=trt(1t)r(1t).w_{t}(\tau)p(\tau|x)=\frac{\tau^{A}(1-\tau)^{B}}{c_{t}\,\mathbb{B}(\alpha,\beta)},\quad\text{where $c_{t}=t^{rt}(1-t)^{r(1-t)}$.}

Hence, the error bound reduces to

J=\displaystyle J\ = [ct𝔹(α,β)]101τA(1τ)B|τt|3dτ\displaystyle\ [c_{t}\,\mathbb{B}(\alpha,\beta)]^{-1}\int_{0}^{1}\tau^{A}(1-\tau)^{B}\absolutevalue{\tau-t}^{3}\operatorname{\mathrm{d}\!}\tau
=\displaystyle= [ct𝔹(α,β)]1[Γ(A+1)Γ(B+1)Γ(A+B+5)U3(A,B,t)first term+Γ(A+1)Γ(A+5)12tA+4(1t)B+4F12(4,A+B+5,A+5;t)second term],\displaystyle\ [c_{t}\,\mathbb{B}(\alpha,\beta)]^{-1}\left[\ \underbrace{\frac{\Gamma(A+1)\Gamma(B+1)}{\Gamma(A+B+5)}U_{3}(A,B,t)}_{\text{first term}}\ \ +\ \ \underbrace{\frac{\Gamma(A+1)}{\Gamma(A+5)}12t^{A+4}(1-t)^{B+4}\,{{}_{2}}F_{1}(4,A+B+5,A+5;\,t)}_{\text{second term}}\ \right],

where U3(A,B,t)U_{3}(A,B,t) is a cubic polynomial in AA, BB, and tt. Notice that even though the quantity is symmetric about (A,B,t)(B,A,1t)(A,B,t)\mapsto(B,A,1-t), the form does not appear so. We shall focus on the relation of the error bound entirely with AA and α\alpha, then justify the analogous conclusion for BB and β\beta by the underlying symmetry of the expression.

The Gaussian hypergeometric function in the second term can be expressed as

i=0(4)i(A+B+5)i(A+5)itii!=\displaystyle\sum_{i=0}^{\infty}\frac{(4)_{i}(A+B+5)_{i}}{(A+5)_{i}}\frac{t^{i}}{i!}\ = i=0(4)i(A+B+5A+5)(A+B+6A+6)length itii!\displaystyle\ \sum_{i=0}^{\infty}(4)_{i}\underbrace{\left(\frac{A+B+5}{A+5}\right)\left(\frac{A+B+6}{A+6}\right)\cdots}_{\text{length $i$}}\frac{t^{i}}{i!}
=\displaystyle= i=0(4)ii!(1+BA+5)(1+BA+6)ti,where (4)ii!=(i+2)(i+3)(i+4)3!.\displaystyle\ \sum_{i=0}^{\infty}\frac{(4)_{i}}{i!}\left(1+\frac{B}{A+5}\right)\left(1+\frac{B}{A+6}\right)\cdots t^{i},\quad\text{where }\frac{(4)_{i}}{i!}=\frac{(i+2)(i+3)(i+4)}{3!}.

by using the definition of the Pochhammer symbol (x)i=x(x+1)(x+i1)(x)_{i}=x(x+1)\dots(x+i-1). In terms of AA\to\infty, the whole second term in JJ is 𝒪(A4)\mathcal{O}(A^{-4}) due to the fraction of Γ\Gamma functions. The first term in JJ is

𝒪(A(B+4)B(A+4))U3(A,B,t)=𝒪(AB1BA1)\mathcal{O}(A^{-(B+4)}B^{-(A+4)})\cdot U_{3}(A,B,t)=\mathcal{O}(A^{-B-1}B^{-A-1})

by Stirling’s approximation of Γ(x)=𝒪(xx12)\Gamma(x)=\mathcal{O}(x^{x-\frac{1}{2}}). Clearly, a small B>0B>0 might cause the first term in JJ to explode with large AA due to the 𝒪(BA1)\mathcal{O}(B^{-A-1}) part. This could occur with high α\alpha, low β\beta, and low rr—it is an instance of a high-precision propensity and low-precision weighing scheme destroying the upper error bound. Hence follows an argument for having rr match the propensity’s precision, to avoid these cases.

As mentioned earlier, the same argument flows for large BB and small AA, while swapping t(1t).t\mapsto(1-t).

Appendix C Correctness of Algorithm 1

The algorithm functions by incrementally reallocating mass (relative, in the weights) to the righthand side, from a cursor beginning on the lefthand side of the “tape”.

Proof.

Firstly we characterize the indicator quantity Δj.\Delta_{j}. Differentiate the quantity to be maximized with respect to wj;w_{j};

wjiwifiiwi=\displaystyle\frac{\partial}{\partial w_{j}}\frac{\sum_{i}w_{i}f_{i}}{\sum_{i}w_{i}}= fjiwiiwifi(iwi)2\displaystyle\frac{f_{j}}{\sum_{i}w_{i}}-\frac{\sum_{i}w_{i}f_{i}}{\left(\sum_{i}w_{i}\right)^{2}}
=\displaystyle= fjiwiiwifi(iwi)2\displaystyle\frac{f_{j}\sum_{i}w_{i}-\sum_{i}w_{i}f_{i}}{\left(\sum_{i}w_{i}\right)^{2}}
\displaystyle\propto iwi(fjfi)Δjup to some positive factor.\displaystyle\underbrace{\sum_{i}w_{i}(f_{j}-f_{i})}_{\coloneqq\Delta_{j}}\quad\textrm{up to some positive factor.}

Hence, Δj\Delta_{j} captures the sign of the derivative.

We shall proceed with induction. Begin with the first iteration, j=1.j=1. No weights have been altered since initialization yet. Therefore we have

Δ1=iw¯i(f1fi).\Delta_{1}=\sum_{i}\overline{w}_{i}(f_{1}-f_{i}).

Since i,f1fi\forall i,\ f_{1}\leq f_{i} due to the prior sorting, Δ1\Delta_{1} is either negative or zero. If zero, trivially terminate the procedure as all function values are identical.

Now assume that by the time the algorithm reaches some j>1j>1, all wk=w¯kw_{k}=\underline{w}_{k} for 1k<j1\leq k<j. In other words,

Δj=i<jw¯i(fjfi)(+)+i>jw¯i(fjfi)().\Delta_{j}=\sum_{i<j}\underline{w}_{i}\underbrace{(f_{j}-f_{i})}_{(+)}+\sum_{i>j}\overline{w}_{i}\underbrace{(f_{j}-f_{i})}_{(-)}.

Per the algorithm, we would flip the weight wjw¯jw_{j}\leftarrow\underline{w}_{j} only if Δj<0.\Delta_{j}<0. In that case,

i<jw¯i(fjfi)<i>jw¯i(fifj),where both sides are non-negative.\sum_{i<j}\underline{w}_{i}(f_{j}-f_{i})<\sum_{i>j}\overline{w}_{i}(f_{i}-f_{j}),\quad\textrm{where both sides are non-negative.}

Notice that the above is not affected by the current value of wj.w_{j}. This update can only increase the current estimate because the derivative remains negative and the weight at jj is non-increasing. We must verify that the derivatives for the previous weights, indexed at k<jk<j, remain negative. Otherwise, the procedure would need to backtrack to possibly flip some weights back up.

More generally, with every decision for weight assignment, we seek to ensure that the condition detailed above is not violated for any weights that have been finalized. That includes the weights before jj, and those after jj at the point of termination. Returning from this digression, at k<jk<j after updating wjw_{j},

Δk=ijw¯i(fkfi)+i>jw¯i(fkfi).\Delta_{k}=\sum_{i\leq j}\underline{w}_{i}(f_{k}-f_{i})+\sum_{i>j}\overline{w}_{i}(f_{k}-f_{i}).

To glean the sign of this, we refer to a quantity that we know.

i<jw¯i(fjfi)<\displaystyle\sum_{i<j}\underline{w}_{i}(f_{j}-f_{i})< i>jw¯i(fifj)\displaystyle\sum_{i>j}\overline{w}_{i}(f_{i}-f_{j})
ijw¯i(fkfi)<\displaystyle\iff\sum_{i\leq j}\underline{w}_{i}(f_{k}-f_{i})< i>jw¯i(fifj)+ijw¯i(fkfj)\displaystyle\sum_{i>j}\overline{w}_{i}(f_{i}-f_{j})+\sum_{i\leq j}\underline{w}_{i}(f_{k}-f_{j})
ijw¯i(fkfi)+i>jw¯i(fkfi)Δk<\displaystyle\iff\underbrace{\sum_{i\leq j}\underline{w}_{i}(f_{k}-f_{i})+\sum_{i>j}\overline{w}_{i}(f_{k}-f_{i})}_{\Delta_{k}}< i>jw¯i(fkfj)+ijw¯i(fkfj)negative.\displaystyle\underbrace{\sum_{i>j}\overline{w}_{i}(f_{k}-f_{j})+\sum_{i\leq j}\underline{w}_{i}(f_{k}-f_{j})}_{\textrm{negative.}}

The remaining fact to be demonstrated is that upon termination, when Δj0,\Delta_{j}\geq 0, no other pseudo-derivatives Δj,j>j\Delta_{j^{\prime}},\ j^{\prime}>j are negative. This must be the case simply because fjfj.f_{j^{\prime}}\geq f_{j}.

Appendix D On the introductory illustration

Refer to caption
Figure 8: Elaboration on the example in Figure 1. Treatments were exponentially distributed, and the thresholds displayed in the grid controlled the center of the second sigmoid in S2S^{2} due to Taleb [2018]. Two different visible attributes demonstrate how the hidden bias depends on the interplay between propensity and outcome, via the hidden attribute. The blue curve is a little shorter, which allows the vulnerable subgroup’s threshold change to be revealed in the data. Estimation minimized the empirical squared error.
Continuous TreatmentsOutcome VariableConfounded Observational Studypopulation averagesub-population curveindividual
Figure 9: A different example that shows the connection to Simpson’s paradox more clearly [Yule, 1903, Simpson, 1951]. When a confounder is distorting the assigned treatments in sub-populations, the overall population-level trend may appear flipped in comparison to each sub-population’s dose response.

Appendix E Details on The Benchmark

During each trial, 750 train and 250 test instances of (observed/hidden) confounders, treatment, and outcome were generated. The APO was computed on the test instances. Coverage of the dose-response curve was assessed on a treatment grid of 100100 evenly spaced points in [0,1][0,1]. The different violation factors Γ\Gamma that were tested were also from a 100100-sized grid in [0,2.5][0,2.5].

The data-generating process constructed vectors
Vvisible conf…, treatment, hidden conf…k\displaystyle V\coloneqq\langle\text{visible conf\ldots, treatment, hidden conf\ldots}\rangle\in\mathbb{R}^{k}

where kk is the number of confounders plus one, for the treatment. Each of these variables is a projection of the original data with i.i.d normal coefficients. We upscale the middle (i.e. treatment) entry by (k1)(k-1) to keep the treatment effect strong enough. Then, we experiment with two functional forms of confounded dose-response curves:

  • (linear) mixing vector {Mi}i=1ki.i.d Normal(0,1)\{M_{i}\}_{i=1}^{k}\sim\text{i.i.d Normal}(0,1). Pre-activation outcome is uMvu\coloneqq M\cdot v.

  • (quadratic) matrix {Mij}i.i.d Normal(0,1)\{M_{ij}\}\sim\text{i.i.d Normal}(0,1). Pre-activation outcome is uvTMvu\coloneqq v^{\text{T}}Mv. Unlike a covariance, MM is not positive (semi-)definite. The fact that all entries are i.i.d Gaussian implies that there are cases where the off-diagonal entries are much larger in magnitude than the on-diagonal entries, in such a way that cannot occur in a covariance matrix. This induces more confounding and strengthens our benchmark.

The actual outcome is Bernoulli with probability uϕ((um)/s)u^{\star}\coloneqq\phi\big{(}(u-m)/s\big{)}, wherein ϕ\phi is the standard normal CDF, location parameter mm is the sample median, and scale ss is the sample mean absolute deviation from the median. If uu were normal, ss would be expected to be a bit smaller than σ\sigma, by a factor of 2/π\sqrt{2/\pi}. Generally uu^{\star} is no longer uniformly distributed (on margin) because we use ss, and instead it gravitates towards zero or one. Since the estimated outcome models use logistic sigmoid activations, there is already an intentional measure of model mismatch present in this setup.

See Table 4 for results under all the settings considered.

The linear outcome and propensity predictors were estimated by maximum likelihood using the ADAM gradient-descent optimizer, with learning rate 10110^{1}, 44 batches, and 5050 epochs throughout. For the outcome, we used a sigmoid activation stretched horizontally by 10210^{2} for smooth training. For the propensity, similarly, we stretched a sigmoid horizontally and vertically, gating the output in order to yield Beta parameters within (0,102)(0,10^{2}).

Data sources.

The datasets brain and blood both came from the UK Biobank, which is described in the case study of §5. The two datasets are taken from disjoint subsets of all the available fields, one pertaining to parcelized brain volumes (via MRI) and the other to blood tests. The pbmc dataset came from single-cell RNA sequencing, a modality that is exploding in popularity for bioinformatics. PBMC data are a commonly used benchmark in the field [Kang et al., 2018]. Finally, the mftc dataset consisted of BERT embeddings for morally loaded tweets [Hoover et al., 2020, Mokhberian et al., 2020].

Dataset Sample Size Dimension
brain 43,069 148
blood 31,811 42
pbmc 14,039 16
mftc 17,930 768
Table 3: Characteristics of the various datasets employed in our experiments.

Model mismatch varied with how approximately linear the true dose responses were. As expected, there was a significant negative correlation between model likelihood and divergence cost, so poorer fits had higher costs for coverage.

Benchmarks \ Scores brain blood pbmc mftc
mean median mean median mean median mean median
linear 2 confounders δ\deltaMSM 𝟗𝟒\bm{94} 𝟕𝟏\bm{71} 𝟖𝟔\bm{86} 𝟔𝟑\bm{63} 𝟏𝟎𝟓\bm{105} 𝟕𝟓\bm{75} 𝟔𝟗\bm{69} 𝟓𝟗\bm{59}
CMSM 291291 253253 261261 228228 288288 259259 243243 204204
uniform 116116 8282 104104 7171 128128 8383 7878 6666
binary MSM 116116 9090 104104 7373 127127 9494 9191 7373
6 confounders δ\deltaMSM 𝟔𝟑\bm{63} 𝟑𝟗\bm{39} 𝟔𝟑\bm{63} 𝟑𝟑\bm{33} 𝟕𝟕\bm{77} 𝟒𝟒\bm{44} 𝟒𝟕\bm{47} 𝟑𝟏\bm{31}
CMSM 177177 111111 186186 117117 198198 136136 167167 105105
uniform 6868 4141 6868 3636 8383 4747 5151 3333
binary MSM 177177 176176 173173 163163 188188 195195 168168 160160
10 confounders δ\deltaMSM 𝟓𝟕\bm{57} 𝟑𝟏\bm{31} 𝟔𝟏\bm{61} 𝟑𝟓\bm{35} 𝟕𝟐\bm{72} 𝟑𝟏\bm{31} 𝟒𝟑\bm{43} 𝟐𝟕\bm{27}
CMSM 151151 8181 146146 8484 158158 8484 126126 7474
uniform 5858 3232 6363 3737 7373 3333 4545 2828
binary MSM 177177 181181 182182 190190 172172 170170 184184 191191
quadratic 2 confounders δ\deltaMSM 𝟏𝟕𝟎\bm{170} 𝟏𝟓𝟏\bm{151} 𝟏𝟔𝟎\bm{160} 𝟏𝟑𝟗\bm{139} 𝟏𝟖𝟎\bm{180} 𝟏𝟔𝟎\bm{160} 𝟏𝟓𝟗\bm{159} 𝟏𝟒𝟒\bm{144}
CMSM 301301 275275 283283 263263 299299 274274 270270 248248
uniform 198198 180180 190190 166166 212212 188188 190190 167167
binary MSM 205205 186186 192192 169169 217217 198198 190190 173173
6 confounders δ\deltaMSM 𝟏𝟑𝟖\bm{138} 𝟏𝟎𝟑\bm{103} 𝟏𝟒𝟓\bm{145} 𝟏𝟐𝟎\bm{120} 𝟏𝟓𝟓\bm{155} 𝟏𝟑𝟒\bm{134} 𝟏𝟒𝟎\bm{140} 𝟏𝟏𝟐\bm{112}
CMSM 216216 171171 220220 193193 239239 223223 222222 198198
uniform 171171 118118 181181 149149 189189 158158 177177 132132
binary MSM 217217 231231 227227 257257 230230 266266 224224 249249
10 confounders δ\deltaMSM 𝟏𝟑𝟖\bm{138} 𝟏𝟎𝟏\bm{101} 𝟏𝟒𝟏\bm{141} 𝟏𝟎𝟎\bm{100} 𝟏𝟑𝟖\bm{138} 𝟏𝟎𝟒\bm{104} 𝟏𝟒𝟒\bm{144} 𝟏𝟏𝟕\bm{117}
CMSM 186186 173173 188188 165165 205205 178178 182182 165165
uniform 158158 116116 162162 108108 157157 117117 167167 140140
binary MSM 211211 241241 213213 240240 222222 258258 214214 242242
Table 4: The full array of experiments. Underlined settings are those shown in Table 2.

Appendix F Details on The Biobank Study

The application number used to access data from the UK Biobank will be mentioned in the de-anonymized manuscript. The measured outcomes were cortical thicknesses and subcortical volumes, the latter normalized by intracranial volume, obtained via structural Magnetic Resonance Imaging (MRI). The results in the main text (§5) focused on the cortical thicknesses, for brevity. Input variables comprising the covariates and DQS treatments are listed in Table 5. Inputs were normalized in the unit interval, and outputs were zz-scored.

Training the models.

The outcome predictors with 40 inputs and 48 outputs were implemented as multilayer perceptions with three hidden layers of width 32, and single-skip connections. They used Swish activation functions and a unit dropout rate of 0.10.1. The ADAM optimizer with learning rate 5×1035\times 10^{-3} was was run for 10410^{4} epochs. The data were split into four non-overlapping test sets, with separate ensembles of 16 predictors trained for each split. Training sets were bootstrap-resampled for each estimator in the ensemble. The propensity was formulated as a linear model outputting Beta parameters within (0,64)(0,64), trained in a similar fashion. Finally, CAPOs were partially identified using the set of models from the train-test split for which the data instance belonged to the test set.

Additional figures.

This exploratory study includes plots of relative effects on the various brain regions, shown in Figures 10 & 11. We plan on studying the differential effects of diet on the brain further.

Refer to caption
Figure 10: Normalized effect differences between males and females for the overall average diet score and stratified by individual diet components. The lefthand columns depict individual effects across all cortical thickness parcellations and the righthand side shows subcortical regional volumes. Females show generally larger effects across most diet components.
Refer to caption
Figure 11: Normalized effect differences comparing the δ\deltaMSM against a shoehorned binary MSM (“δ\delta” vs. “B”) stratified by sex. Note differences in relative feature importance, where continuous modeling ranks vegetables and whole grains to be the most important compared to the binary model which emphasizes dairy, vegetable oils, refined grains (primarily for males) and fish.
Variable Features Classifications Data Field ID
Demographics Age at scan - 21003
Sex Male/Female 31
Townsend Deprivation Index - 189
ApoE4 copies 0, 1, 2 -
Education College/University Yes/No 6138
Physical Activity/ Body Composition American Heart Association (AHA) guidelines for weekly physical activity Ideal (\geq150 min/week moderate or \geq75 min/wk vigorous or 150 min/week mixed); Intermediate (1–149 min/week moderate or 1–74 min/week vigorous or 1–149 min/week mixed); Poor (not performing any moderate or vigorous activity) 884, 904, 894, 914
Waist to Hip Ratio (WHR) - 48,49
Normal WHR Females: \leq 0.85; Males \leq 0.90 48,49
Body Mass Index (BMI) - 23104
Body fat percentage - 23099
Sleep Sleep 7-9 Hours a Night - 1160
Job Involves Night Shift Work Never/Rarely 3426
Daytime Dozing/Sleeping Never/Rarely 1220
Diet DQS 1 - Fruit - 1309, 1319
DQS 2 - Vegetables - 1289, 1299
DQS 3 - Whole Grains - 1438, 1448, 1458, 1468
DQS 4 - Fish - 1329, 1339
DQS 5 - Dairy - 1408, 1418
DQS 6 - Vegetable Oil - 1428, 2654, 1438
DQS 7 - Refined Grains - 1438, 1448, 1458, 1468
DQS 8 - Processed Meats - 1349, 3680
DQS 9 - Unprocessed Meats - 1369, 1379, 1389, 3680
DQS 10 - Sugary Foods/Drinks - 6144
Water intake Glasses/day 1528
Tea intake Cups/day 1488
Coffee intake Cups/day 1498
Fish Oil Supplementation Yes/No 20084
Vitamin/Mineral Supplementation Multivitamin (with iron/ calcium/ multimineral)/ Vitamins A, B6, B12, C, D, or E/ Folic acid/ Chromium/ Magnesium/ Selenium/ Calcium/ Iron/ Zinc/ Other vitamin 20084
Variation in diet Never/Rarely; Sometimes; Often 1548
Salt added to food Never/Rarely; Sometimes; Usually; Always 1478
Smoking Smoking status Never; Previous; Current 20116
Alcohol Alcohol Frequency Infrequent (1–3 times a month, special occasions only, or never); Occasional (1–2 a week or 3–4 times a week), Frequent (daily/almost daily and ICD conditions F10, G312, G621, I426, K292, K70, K860, T510) 1558/ICD
Social Support Leisure/social activities Sports club/gym; pub/social; social/religious; social/adult education; other social group 6160
Frequency of Friends/Family Visits Twice/week or more 1031
Able to Confide in Someone Almost Daily 2110
Table 5: Variables, features, classifications, and respective data fields use in the models. Diet quality scores (DQS) ranging from 0–10 for 10 components were computed using the same coding scheme as in Said et al. [2018], Zhuang et al. [2021]. Leisure/social activity classifications served as their own binary variables. Our results omitted DQS #8 & #10 because they were not even approximately continuous, taking on only a few discrete values.

Appendix G Source-code Availability