This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning Invariant Representations under General Interventions on the Response

Kang Du and Yu Xiang K. Du and Y. Xiang are with the Department of Electrical and Computer Engineering, University of Utah, Salt Lake City, UT, 84112, USA. This paper has supplementary downloadable material available at http://ieeexplore.ieee.org., provided by the authors. The material includes all the proofs. Contact {kang.du, yu.xiang}@utah.edu for further questions about this work.
Abstract

It has become increasingly common nowadays to collect observations of feature and response pairs from different environments. As a consequence, one has to apply learned predictors to data with a different distribution due to distribution shifts. One principled approach is to adopt the structural causal models to describe training and test models, following the invariance principle which says that the conditional distribution of the response given its predictors remains the same across environments. However, this principle might be violated in practical settings when the response is intervened. A natural question is whether it is still possible to identify other forms of invariance to facilitate prediction in unseen environments. To shed light on this challenging scenario, we focus on linear structural causal models (SCMs) and introduce invariant matching property (IMP), an explicit relation to capture interventions through an additional feature, leading to an alternative form of invariance that enables a unified treatment of general interventions on the response as well as the predictors. We analyze the asymptotic generalization errors of our method under both the discrete and continuous environment settings, where the continuous case is handled by relating it to the semiparametric varying coefficient models. We present algorithms that show competitive performance compared to existing methods over various experimental settings including a COVID dataset.

Index Terms:
Multi-environment domain adaptation, invariance, structural causal models, semiparametric varying coefficient model.

1 Introduction

How to make reliable predictions in unseen environments that are different from training environments is a challenging problem, which is fundamentally different from the classical machine learning settings [1, 2, 3]. Modeling these distribution shifts in a principled way is of great importance in many fields including robotics, medical imaging, and environmental science. Apparently, this problem is ill-posed without any constraints on the relationship between training and test distributions, as the test distribution may be arbitrary. Consider the problem of predicting the response YY given its predictors X=(X1,,Xd)X=(X_{1},...,X_{d})^{\top} in unseen environments. To model distribution changes across different environments (or training and test distributions), we follow the approach of using structural causal models (SCMs) [4, 5] to model different data-generating mechanisms. The common assumption is that the assignment for YY does not change across environments (or YY is not intervened), which allows for natural formulations of the invariant conditional distribution of YY given a subset of XX [6, 7, 8, 5, 9, 10, 11, 12, 13]. The underlying principle is known as invariance, autonomy or modularity [14, 15, 4, 16].

Following this principle, the invariance-based causal prediction initiated in [8] (also see [17] and [11] and references therein) assumes that the conditional distribution of YY given a set of predictors XS{X1,,Xd}X_{S}\subseteq\{X_{1},...,X_{d}\} is invariant in all environments, i.e., 𝒫e(Y|XS)=𝒫h(Y|XS)\mathcal{P}_{e}(Y|X_{S})=\mathcal{P}_{h}(Y|X_{S}) for environments ee and hh, where (X,Y)(X,Y) is generated according to the joint distribution 𝒫e:=𝒫eX,Y\mathcal{P}_{e}:=\mathcal{P}_{e}^{X,Y}. Focusing on linear SCMs, it assumes the existence of a linear model that is invariant across environments, with an unknown noise distribution and arbitrary dependence among predictors (see extensions to nonlinear [18] and time series [19] settings). Following this framework, theoretical guarantees for domain adaptation have been developed in [9, 20]. More recently, a multi-environment regression method for domain adaptation called the stabilized regression [21] explicitly enforces stability (based on a weaker version of invariance 𝖤𝒫e[Y|XS=xs]=𝖤𝒫h[Y|XS=xs]\mathsf{E}_{\mathcal{P}_{e}}[Y|X_{S}=x_{s}]=\mathsf{E}_{\mathcal{P}_{h}}[Y|X_{S}=x_{s}]) by introducing the stable blanket, which is a refined version of the Markov blanket to promote generalization. The tradeoff between predictive performance on training and test data has been studied via regularization under shift interventions [22]. Motivated by [8], the invariant risk minimization (IRM) [23] only uses data from the training environments (i.e., the out-of-distribution generalization setting), and imposes 𝒫e(Y|ϕ(X))=𝒫h(Y|ϕ(X))\mathcal{P}_{e}(Y|\phi(X))=\mathcal{P}_{h}(Y|\phi(X)), where ϕ\phi is invariant across environments, leading to a bi-leveled optimization problem that is not practical. Several relaxed versions of IRM have been proposed in [23], but they behave very differently from the original IRM (see, e.g., [24, 25]). For a framework of the out-of-distribution setting from a causal perspective with a focus on minimizing the worst-case risk, see [26] and references therein. In this line of invariance-based work, the fundamental assumption is that interventions on the target variable YY are not allowed.

In practical settings, however, the structural assignment of YY might change across environments, namely, YY might be intervened. How to relax this assumption in a principled way is one of the main motivations in our work. We propose to explore alternative forms of invariance and make an attempt in this direction by focusing on linear SCMs. Concretely, the assignment for YY allows general interventions

Ye=(βe)Xe+εYe,\displaystyle Y^{e}=(\beta^{e})^{\top}X^{e}+\varepsilon_{Y}^{e},

where YY can be intervened through coefficient βe\beta^{e} and/or the noise εYe\varepsilon_{Y}^{e}, to capture the dependence of structural assignment across different environments (preliminary results have been reported by the same authors in [27]). We consider a multi-environment regression setting for domain adaption: There are multiple training data (Xe,Ye)(X^{e},Y^{e}) for etraine\in\mathcal{E}^{\text{train}} that are generated from a training model and one test data (indexed by eteste^{\text{test}}) from a test model; we assume the training model and test model follow SCMs with the same but unknown graph structure, but we allow βe\beta^{e} and the mean and variance of εYe\varepsilon_{Y}^{e} to be arbitrarily different under the two models. To avoid the setting to be ill-posed, a key necessary condition is that YeY^{e} needs to have at least one child in the SCMs, as prediction is not possible otherwise given that YeY^{e} may change arbitrarily over environments. The main challenge lies in whether it is still possible to identify other forms of invariance to facilitate prediction in the test environment. We propose an alternative form of invariance 𝒫e(Y|ϕe(X))=𝒫h(Y|ϕh(X))\mathcal{P}_{e}(Y|\phi_{e}(X))=\mathcal{P}_{h}(Y|\phi_{h}(X)) that is enabled by a family of conditional invariant transforms Φϕe,ϕh\Phi\ni\phi_{e},\phi_{h}. Under general interventions on YY, we provide explicit constructions of such transforms by developing invariant matching property (IMP), a deterministic relation between an estimator of YY and XX along with an additional predictor constructed from XX. To enable a systematic way of constructing the IMP, we provide a natural decomposition of it and demonstrate this when only YY is intervened or both XX and YY are intervened. The IMP comes with several appealing features: (1) it does not vary over environments, making it applicable in unseen environments, and (2) the identification of the IMP follows directly from the fact that the training data contains multiple environments. We study the asymptotic generalization error for both the discrete environment setting and continuous counterpart, which is the more challenging setting. Interestingly, we reveal a connection between the continuous environment setting with the semiparametric varying models, which makes the asymptotic generalization analysis possible. We believe that our results open up new possibilities for multi-environment regression methods for domain adaptation under the structure causal models.

1-A Further Related Work

In [28], the authors have provided a systematic treatment of domain adaptation using the SCMs to enable analysis and comparisons of domain adaption methods, which leads to the conditional invariant residual matching (CIRM) method. The CIRM and its variants combine the domain invariant projection (DIP)-type methods (see [29, 30] and the generalized label shift to handle target label perturbation [31, 32]) with the idea of conditional invariance penalty (appeared in [33, 34] under slightly different settings) that assumes the existence of conditional invariant components (CICs) in the anticausal setting where YY causes XX. Theoretical guarantees have been provided for the prediction performance under shift interventions on YY, while numerical studies are provided for interventions on the noise variance of YY [28]. It has also been pointed out that the general mixed-causal-anticausal domain adaptation problem remains open. We aim to shed light on this challenging setting by constructing explicit conditional invariant transforms.

The role of causality in facilitating domain adaptation problems is first articulated in [6], focusing on causal and anticausal predictions. Reweighting methods have been extensively studied for covariate shift [1, 35, 36], which assumes that only the feature distribution changes over environments while the conditionals remain the same. The label shift, which aligned with the anticausal setting, has attracted much attention recently [37, 38, 39]. Many other interesting domain adaption methods have been developed but they are less related to this work. The performance bounds using Vapnik-Chervonenkis (VC) theory have been initiated in [40]. There are fundamental works from the robust statistics perspective including distributional robust learning [41, 42, 43, 44, 45, 46] and adversarial machine learning [47, 48].

1-B Contribution and Structure

There are four main contributions in this work. (I) We formulate a general invariance property and the corresponding conditional invariant transforms for analyzing general interventions on linear SCMs (Section 2). (II) We tackle this problem by introducing the invariant matching property (IMP) and providing a systematic approach for establishing explicit characterization of it (Section 3 and Section 4). (III) To handle the continuous environment setting, we bridge our framework with the profile likelihood estimators developed in the semiparametric literature, leading to asymptotic performance guarantees under this challenging setting (Section 6). (IV) Motivated by our theoretical results, we develop efficient algorithms that show competitive empirical performance over a variety of simulation settings including a COVID dataset (Section 5 and Section 7). All the technical details are deferred to Appendix.

2 Background and Problem Formulation

Consider a response YY\in\mathbb{R} and a vector of predictors X=(X1,,Xd)𝒳dX=(X_{1},\ldots,X_{d})^{\top}\in\mathcal{X}\subseteq\mathbb{R}^{d} following an acyclic linear SCM,

:\displaystyle\mathcal{M}: X=γY+BX+εX\displaystyle X=\gamma Y+BX+\varepsilon_{X}
:\displaystyle\mathcal{M}: Y=βX+εY,\displaystyle Y=\beta^{\top}X+\varepsilon_{Y},

where β,γd\beta,\gamma\in\mathbb{R}^{d}, Bd×dB\in\mathbb{R}^{d\times d}, εX=(εX1,,εXd)\varepsilon_{X}=(\varepsilon_{X_{1}},\ldots,\varepsilon_{X_{d}})^{\top}, and the noise variables {εX1,,εXd}\{\varepsilon_{X_{1}},\ldots,\varepsilon_{X_{d}}\} and εY\varepsilon_{Y} are jointly independent. We use 𝒢()\mathcal{G}(\mathcal{M}) to denote the directed acyclic graph induced by \mathcal{M}, with edges determined by the non-zero coefficients in \mathcal{M}. We denote the parents, children, descendants, and Markov blanket of a variable Z{X1,,Xd,Y}Z\in\{X_{1},\ldots,X_{d},Y\} as PA(Z)PA(Z), CH(Z)CH(Z), DE(Z)DE(Z), and MB(Z)MB(Z), respectively. When (X,Y)(X,Y) is observed in different environments (e.g., different experiment settings for data collection), the parameters {β,γ,B}\{\beta,\gamma,B\} and the distributions of {εX,εY}\{\varepsilon_{X},\varepsilon_{Y}\} may change. In the following, we use interventions on the SCM \mathcal{M} to model such changes.

Let all\mathcal{E}^{\text{all}} denote the set of all possible environments 111We use training (or test) environments and observable (or unseen) environments interchangeably., which are partitioned into multiple training environments train\mathcal{E}^{\text{train}} and one test environment {etest}\{e^{\text{test}}\} such that all=train{etest}\mathcal{E}^{\text{all}}=\mathcal{E}^{\text{train}}\cup\{e^{\text{test}}\}. In each environment ealle\in\mathcal{E}^{\text{all}}, (X,Y)(X,Y) is generated according to the joint distribution 𝒫e:=𝒫eX,Y\mathcal{P}_{e}:=\mathcal{P}_{e}^{X,Y}, and to simplify notation, we write 𝒫test:=𝒫etestX,Y\mathcal{P}_{\text{test}}:=\mathcal{P}_{e^{\text{test}}}^{X,Y}. A variable from {X1,,Xd,Y}\{X_{1},\ldots,X_{d},Y\} is intervened if the parameters or noise distribution in its assignment changes over different ealle\in\mathcal{E}^{\text{all}}. For instance, the changes of β\beta and/or the distribution of εY\varepsilon_{Y} correspond to the intervention on YY (see different intervention cases below). Importantly, we allow both YY and any subset of {X1,,Xd}\{X_{1},\ldots,X_{d}\} to be intervened in all\mathcal{E}^{\text{all}}.

Now, we introduce the linear SCM \mathcal{M} with parameters that change with environments. For each ealle\in\mathcal{E}^{\text{all}}, the linear SCM \mathcal{M} is modified to be

e:\displaystyle\mathcal{M}^{e}: Xe=γeYe+BeXe+εXe\displaystyle X^{e}=\gamma^{e}Y^{e}+B^{e}X^{e}+\varepsilon_{X}^{e} (1)
e:\displaystyle\mathcal{M}^{e}: Ye=(βe)Xe+εYe.\displaystyle Y^{e}=(\beta^{e})^{\top}X^{e}+\varepsilon_{Y}^{e}. (2)

This formulation is fairly general. From the structural perspective, this consists of causal, anticausal, and mixed-causal-anticausal settings [6]. It should be noted that we only adopt the linear SCM rather than the fully specified SCMs as in [49], since learning the functional forms can be more complicated than the prediction problem we aim to solve. Regarding the intervention types, we discuss several special cases to put them into perspective.

  1. 1.

    Shift interventions on XX or YY: A variable XjX_{j} is intervened through a shift if the mean of the noise variable εX,je\varepsilon_{X,j}^{e} changes with ealle\in\mathcal{E}^{\text{all}}. For the shift intervention on YY, the mean of εYe\varepsilon_{Y}^{e} changes.

  2. 2.

    Interventions on the coefficients of XX or YY: A variable XjX_{j} is intervened through coefficients if the coefficients {γje,Bje}\{\gamma_{j}^{e},B_{j\cdot}^{e}\} change with ealle\in\mathcal{E}^{\text{all}}. For YY, the change is on the coefficient vector βe\beta^{e}.

  3. 3.

    Interventions on the noise variance of XX or YY: Similar to shift interventions, a variable XjX_{j} or YY is intervened if its noise variance changes.

We observe nen^{e} i.i.d. samples {(x1,y1),,(xne,yne)}\{(x_{1},y_{1}),...,(x_{n^{e}},y_{n^{e}})\} from each training environment distribution 𝒫e\mathcal{P}_{e} for etraine\in\mathcal{E}^{\text{train}}, but in the test environment eteste^{\text{test}} we only observe mm i.i.d. samples {x1,,xm}\{x_{1},...,x_{m}\} from 𝒫testX\mathcal{P}_{\text{test}}^{X}. The goal is to learn a function f:𝒳𝒴^f:\mathcal{X}\to\hat{\mathcal{Y}} that works well on eteste^{\text{test}} in the sense that it minimizes the test population loss

test(f):=𝖤(X,Y)𝒫test[l(Y,f(X))].\mathcal{L}_{\text{test}}(f):=\mathsf{E}_{(X,Y)\sim\mathcal{P}_{\text{test}}}[l(Y,f(X))]. (3)

where ll is the square loss function l(y,y^)=(yy^)2l(y,\hat{y})=(y-\hat{y})^{2}. The optimal function is f(x)=𝖤𝒫test[Y|X=x]f(x)=\mathsf{E}_{\mathcal{P}_{\text{test}}}[Y|X=x], which cannot be learned from the observed data in general when XX and/or YY is intervened. Without any constraints on the relationship between 𝒫test\mathcal{P}_{\text{test}} and 𝒫e\mathcal{P}_{e} for etraine\in\mathcal{E}^{\text{train}}, the test population loss can be arbitrarily large. To make this problem tractable, we assume that (X,Y)(X,Y) under 𝒫test\mathcal{P}_{\text{test}} and 𝒫e\mathcal{P}_{e} are generated according to the SCMs described in (2) but we do not assume that the causal graph is known and we allow for general types of interventions.

It is well-known that if YY is not intervened, a general form of invariance principle applies, assuming the existence of some subset S{1,,d}S\subseteq\{1,...,d\} such that

𝒫e(Y|XS)=𝒫h(Y|XS)\mathcal{P}_{e}(Y|X_{S})=\mathcal{P}_{h}(Y|X_{S}) (4)

holds for any e,halle,h\in\mathcal{E}^{\text{all}}. Under this assumption, the causal function 𝖤𝒫e[Y|XPA(Y)]\mathsf{E}_{\mathcal{P}_{e}}[Y|X_{PA(Y)}] is invariant across different environments and minimax under the class of all possible interventions on XX [9]. If not every predictor is intervened arbitrarily, previous works that are motivated by (4) (as mentioned in Section 1) aim to improve upon the causal function. The main challenge in our setting comes from the general interventions on YY, making the traditional invariance principle not applicable. Importantly, the causal function can change arbitrarily with environments. In this work, we propose to exploit an alternative form of invariance to tackle this problem.

Definition 1.

A function ϕ:(all,d)q\phi:(\mathcal{E}^{\text{all}},\mathbb{R}^{d})\to\mathbb{R}^{q} is called a conditional invariant transform if the following invariance property holds for any e,halle,h\in\mathcal{E}^{\text{all}}

𝒫e(Y|ϕe(X))=𝒫h(Y|ϕh(X)).\mathcal{P}_{e}(Y|\phi_{e}(X))=\mathcal{P}_{h}(Y|\phi_{h}(X)). (5)

Under general intervention settings, we denote this class of conditional invariant transforms as Φ\Phi, and we provide explicit characterizations of it via the invariant matching property (IMP) (see Definition 2). For each ϕΦ\phi\in\Phi, the invariance property (5) enables us to compute

fϕe(x)=gϕe(x)=𝖤𝒫e[Y|ϕe(x)],f_{\phi_{e}}(x)=g\circ\phi_{e}(x)=\mathsf{E}_{\mathcal{P}_{\text{e}}}[Y|\phi_{e}(x)], (6)

for any ealle\in\mathcal{E}^{\text{all}}, where the function g:q𝒴^g:\mathbb{R}^{q}\to\hat{\mathcal{Y}} is invariant across environments and is nonlinear in general. Equivalently, this solves a relaxed version of (3) by minimizing test(fϕ)\mathcal{L}_{\text{test}}(f_{\phi}) over {ϕΦ}\{\phi\in\Phi\}. This formulation allows us to treat the general mixed-causal-anticausal problem under general interventions on YY in a unified manner.

3 Invariant Matching Property and Theoretical Guarantees

Among the intervention settings below (2), the interventions on only YY are important but rarely studied, which concerns the changes of βe\beta^{e} and the distribution of εYe\varepsilon^{e}_{Y}. Our method can be motivated in this setting by the following observation: If XX includes any descendants of YY, then β\beta and εY\varepsilon_{Y} (that may change arbitrarily in the unseen environments) will be passed on to the descendants. Thus, the changes might be revealed by the changes of certain statistical properties of XX, leading to our proposed invariant matching property detailed in this section. We start with a few toy examples.

3-A Motivating Examples

Example:

Consider (Ye,Xe),etoy={1,2}(Y^{e},X^{e}),e\in\mathcal{E}^{\text{toy}}=\{1,2\}, with Xe:=(X1e,X2e,X3e)X^{e}:=(X_{1}^{e},X_{2}^{e},X_{3}^{e})^{\top} satisfying the following linear acyclic SCM (illustrated in Fig. 1),

Refer to caption
Figure 1: Directed acyclic graph 𝒢(toye)\mathcal{G}(\mathcal{M}_{\text{toy}}^{e}).
toye:{Ye=aeX1e+X2e+NYeX3e=Ye+X1e+N3e,\displaystyle\mathcal{M}_{\text{toy}}^{e}:\begin{cases}Y^{e}=a^{e}X_{1}^{e}+X_{2}^{e}+N_{Y}^{e}\\ X_{3}^{e}=Y^{e}+X_{1}^{e}+N_{3}^{e},\end{cases} (7)

where X1e,X2e,N3eX_{1}^{e},X_{2}^{e},N_{3}^{e}, and NYeN_{Y}^{e} are independent and 𝒩(0,1)\mathcal{N}(0,1)-distributed for every etoye\in\mathcal{E}^{\text{toy}}. Since (Ye,Xe)(Y^{e},X^{e}) is multivariate Gaussian, the MMSE estimator of YeY^{e} given XeX^{e} is

𝖤𝒫e[Y|X]=\displaystyle\mathsf{E}_{\mathcal{P}_{e}}[Y|X]= X(𝖤𝒫e[XX])1𝖤𝒫e[XY]\displaystyle X^{\top}\left(\mathsf{E}_{\mathcal{P}_{e}}[XX^{\top}]\right)^{-1}\mathsf{E}_{\mathcal{P}_{e}}[XY]
=\displaystyle= 12(ae1)X1e+12X2e+12X3e,\displaystyle\frac{1}{2}(a^{e}-1)X_{1}^{e}+\frac{1}{2}X_{2}^{e}+\frac{1}{2}X_{3}^{e},
Refer to caption
Refer to caption
Refer to caption
Figure 2: Illustrations of invariant relations in Example Example. In (a), we illustrate the linear invariant relation (8) by visualizing estimates333For e{1,2}e\in\{1,2\}, X4eX_{4}^{e} and 𝖤𝒫e[Y|X]\mathsf{E}_{\mathcal{P}_{e}}[Y|X] are estimated using OLS. Then, η\eta (denoted as η\eta_{\text{ⓐ}} in Figure 2.(a)) is estimated using OLS by regressing 𝖤𝒫e[Y|X]\mathsf{E}_{\mathcal{P}_{e}}[Y|X] on (X4e,Xe)(X_{4}^{e},X^{e}) using the pooled data. of the tuple (X4e(X_{4}^{e}, ηXe,𝖤𝒫e[Y|X])\eta^{\top}X^{e},\mathsf{E}_{\mathcal{P}_{e}}[Y|X]) (corresponding to the {x,y,z}\{x,y,z\} axes). Similarly, for (b) and (c), we verify such a relation for X5eX_{5}^{e} and X6eX_{6}^{e}. The overlaps between the red dots (e=1e=1) and the blue dots (e=2e=2) in (a) and (b) indicate the invariant linear relations. However, the red and blue dots in (c) are not aligned, implying that no invariant relations as in (8) and (9) hold for (𝖤𝒫e[Y|X],X6e)(\mathsf{E}_{\mathcal{P}_{e}}[Y|X],X_{6}^{e}).

which is not directly applicable for predicting YetestY^{e^{\text{test}}} as aea^{e} can change arbitrarily. Similarly, one can compute 𝖤𝒫e[X3|X1,X2]=(1+ae)X1e+X2e\mathsf{E}_{\mathcal{P}_{e}}[X_{3}|X_{1},X_{2}]=(1+a^{e})X_{1}^{e}+X_{2}^{e}. It is noteworthy that 𝖤𝒫e[Y|X]\mathsf{E}_{\mathcal{P}_{e}}[Y|X] and 𝖤𝒫e[X3|X1,X2]\mathsf{E}_{\mathcal{P}_{e}}[X_{3}|X_{1},X_{2}] can each be generated by a linear combination of 𝖤𝒫e[Y|XPA(Y)]=aeX1e+X2e\mathsf{E}_{\mathcal{P}_{e}}[Y|X_{PA(Y)}]=a^{e}X_{1}^{e}+X_{2}^{e} and {X1e,X2e,X3e}\{X_{1}^{e},X_{2}^{e},X_{3}^{e}\}, and we will highlight this observation by introducing two invariant relations (see Def. 4 and Def. 5). As a result, there exists a deterministic linear relation, which we refer to as matching,

𝖤𝒫e[Y|X]\displaystyle\mathsf{E}_{\mathcal{P}_{e}}[Y|X] =λ𝖤𝒫e[X3|X1,X2]+ηXe,\displaystyle=\lambda\mathsf{E}_{\mathcal{P}_{e}}[X_{3}|X_{1},X_{2}]+\eta^{\top}X^{e}, (8)

with coefficients λ=1/2\lambda=1/2 and η=(1,0,1/2)\eta=(-1,0,1/2)^{\top} that are invariant with respect to the environment (illustrated in Fig. 2.(a)-(b)). For simplicity of notation, we denote X4e:=𝖤𝒫e[X3|X1,X2]X_{4}^{e}:=\mathsf{E}_{\mathcal{P}_{e}}[X_{3}|X_{1},X_{2}] in Fig. 2. Moreover, one can verify that 𝒫e(Y|X,X4)\mathcal{P}_{e}(Y|X,X_{4}) is invariant since 𝖤𝒫e[Y|X,X4]\mathsf{E}_{\mathcal{P}_{e}}[Y|X,X_{4}] and Var𝒫e(Y|X,X4)\mathop{\rm Var}\nolimits_{\mathcal{P}_{e}}(Y|X,X_{4}) are invariant. Thus ϕe(X)=(X,X4)\phi_{e}(X)=(X^{\top},X_{4})^{\top} satisfies the invariance property (5). A prediction model in (8) with invariant coefficients is often not unique when it exists. One can show that

𝖤𝒫e[Y|X]=X1e+12X2e+X3e32X5e,\displaystyle\mathsf{E}_{\mathcal{P}_{e}}[Y|X]=-X_{1}^{e}+\frac{1}{2}X_{2}^{e}+X_{3}^{e}-\frac{3}{2}X_{5}^{e}, (9)

with X5e:=𝖤𝒫e[X2|X1,X3]X_{5}^{e}:=\mathsf{E}_{\mathcal{P}_{e}}[X_{2}|X_{1},X_{3}]. However, invariant relations in (8) and (9) do not hold for X6e:=𝖤𝒫e[X1|X2,X3]X_{6}^{e}:=\mathsf{E}_{\mathcal{P}_{e}}[X_{1}|X_{2},X_{3}], since

X6e=ae+1(ae+1)2+2X2e+ae+1(ae+1)2+2X3e\displaystyle X_{6}^{e}=-\frac{a^{e}+1}{(a^{e}+1)^{2}+2}X_{2}^{e}+\frac{a^{e}+1}{(a^{e}+1)^{2}+2}X_{3}^{e}

depends on ee in a more complicated form so that there is no linear invariant relation between (X6e,Xe)(X_{6}^{e},X^{e}) and 𝖤𝒫e[Y|X]\mathsf{E}_{\mathcal{P}_{e}}[Y|X], as illustrated in Fig. 2.(c). \color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\blacksquare

In the next example, we extend Example Example to allow for interventions on X1X_{1}, X2X_{2}, and YY through the means and/or variances of the noise variables (see Fig. 3).

Refer to caption
Figure 3: 𝒢(toye)\mathcal{G}(\mathcal{M}_{\text{toy}}^{e}) with interventions on (X1,Y)(X_{1},Y) in (a), (X2,Y)(X_{2},Y) in (b), and (X1,X2,X3,Y)(X_{1},X_{2},X_{3},Y) in (c), respectively.
Example (additional settings):

Consider model toye\mathcal{M}_{\text{toy}}^{e} in Example Example with additional shift interventions and/or interventions on the noise variances. The results are summarized in the following.

  1. 1.

    Under shift interventions on YY and X1X_{1} and an intervention on the variance of X1X_{1} as illustrated in Fig. 3(a), the two invariant relations (8) and (9) hold.

  2. 2.

    When YY is intervened through the variance of NYN_{Y} as illustrated in Fig. 1, the relations (8) and (9) will not hold. In this case, new relations can be established if 𝖤𝒫e[Y|X]\mathsf{E}_{\mathcal{P}_{e}}[Y|X] in (8) and (9) is replaced by 𝖤𝒫e[Y|XPA(Y)]=𝖤𝒫e[Y|X1,X2]\mathsf{E}_{\mathcal{P}_{e}}[Y|X_{PA(Y)}]=\mathsf{E}_{\mathcal{P}_{e}}[Y|X_{1},X_{2}].

  3. 3.

    When X2X_{2} is intervened through either the mean or variance as illustrated in Fig. 3(b), a relation as in (9) that is based on X5eX_{5}^{e} will not hold.

  4. 4.

    Combining the interventions above along with an intervention on X3X_{3} through the noise variance as illustrated in Fig. 3 (c), there will be one invariant relation left,

    𝖤𝒫e[Y|X1,X2]=X4eX1e+b,\mathsf{E}_{\mathcal{P}_{e}}[Y|X_{1},X_{2}]=X_{4}^{e}-X_{1}^{e}+b, (10)

    for some intercept bb\in\mathbb{R}. It is noteworthy that (10) will fail to hold if X3X_{3} is intervened through the coefficients or noise mean. However, due to the intervention on the noise variance of YY, 𝒫e(Y|XS,X4)\mathcal{P}_{e}(Y|X_{S},X_{4}) is not invariant for any S{1,2,3}S\subseteq\{1,2,3\}, since Var𝒫e(Y|XS,X4)\mathop{\rm Var}\nolimits_{\mathcal{P}_{e}}(Y|X_{S},X_{4}) changes with ee.

Remark 1.

In [50], the authors have shown that a form of varying filter connecting the features and response (as a special case of the varying coefficients captured by βe\beta^{e} in (2)) is effective for causal inference tasks, by adopting estimators from [51].

\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\blacksquare

Refer to caption
Figure 4: A cyclic model with interventions on YY in (a) and (X2,Y)(X_{2},Y) in (b), respectively.
Example (cyclic graph setting):

Consider model toye\mathcal{M}_{\text{toy}}^{e} in Example Example with X2eX_{2}^{e} generated by X2e=aX3e+N2eX^{e}_{2}=aX^{e}_{3}+N^{e}_{2} such that a1a\neq 1, which creates a cyclic model as illustrated in Fig. 4(a). One can compute that the relation in (8) still holds. Moreover, the same relation holds even when X2X_{2} is intervened through the coefficient aa (i.e., when it changes over ee, denoted by aea^{e}) and/or the noise N2eN^{e}_{2}, illustrated in Fig. 4(b); this example can be further generalized to more complicated cyclic models. This cyclic setting highlights the point that the matching described in (8) can hold without assuming acyclic causal models, which will be left for future work. Focusing on acyclic causal models, in the subsequent sections, we formalize the matching conditions and provide a systematic analysis of sufficient conditions through a natural decomposition. \color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\blacksquare

Remark 2.

The condition on aa avoids the cyclic model to be ill-defined, since combining the structural equations under a=1a=1 yields a contradiction where the independent noise variables satisfy a linear constraint.

3-B Invariant Matching Property

In this section, we generalize the invariant relations observed in Example Example and Example Example to a class of such relations for e\mathcal{M}^{e}, ealle\in\mathcal{E}^{\text{all}} without assuming joint Gaussian distributions, and connect this with the invariance property in (5). For the identification of such relations, we show that even two training environments suffice (see Proposition 1).

To handle non-Gaussian cases (beyond Example Example), we choose to adopt the linear MMSE (or LMMSE) estimators for constructing linear invariant relations. For a target variable YY\in\mathbb{R} given a vector of predictors XpX\in\mathbb{R}^{p}, the LMMSE estimator is defined as

𝖤l[Y|X]:=(θols)(X𝖤[X])+𝖤[Y],\mathsf{E}_{l}[Y|X]:=(\theta^{\text{ols}})^{\top}(X-\mathsf{E}[X])+\mathsf{E}[Y],

where θols:=Cov(X,X)1Cov(X,Y)\theta^{\text{ols}}:=\mathop{\rm Cov}\nolimits(X,X)^{-1}\mathop{\rm Cov}\nolimits(X,Y) is called the population ordinary least squares (OLS) estimator. With a slight abuse of notation, we write 𝖤l,𝒫e[Y|X]\mathsf{E}_{l,\mathcal{P}_{e}}[Y\,|\,X] to denote the LMMSE of YY given XX with respect to (X,Y)𝒫e(X,Y)\sim\mathcal{P}_{e}. To simplify presentation, we focus on (Xe,Ye)(X^{e},Y^{e}) with zero means for each ealle\in\mathcal{E}^{\text{all}} (or equivalently all the noise variables have zero means), while the non-zero mean settings can be handled by introducing the constant one as an additional predictor.

Definition 2.

For k{1,,d}k\in\{1,\ldots,d\}, R{1,,d}kR\subseteq\{1,\ldots,d\}\setminus k, and S{1,,d}S\subseteq\{1,\ldots,d\}, we say that the tuple (k,R,S)(k,R,S) satisfies the invariant matching property (IMP) if, for every ealle\in\mathcal{E}^{\text{all}},

𝖤l,𝒫e[Y|XS]=λ𝖤l,𝒫e[Xk|XR]+ηXe,\displaystyle\mathsf{E}_{l,\mathcal{P}_{e}}[Y\,|\,X_{S}]=\lambda\mathsf{E}_{l,\mathcal{P}_{e}}[X_{k}\,|\,X_{R}]+\eta^{\top}X^{e}, (11)

for some λ\lambda\in\mathbb{R} and ηd\eta\in\mathbb{R}^{d} that do not depend on ee. We denote :={(k,R,S):(11) holds}\mathcal{I}_{\mathcal{M}}:=\{(k,R,S):~{}\eqref{invar_coeff}\text{ holds}\} for model \mathcal{M}, and we call (η,λ)(\eta^{\top},\lambda)^{\top} the matching parameters.

For a tuple (K,R,S)(K,R,S) such that the IMP does not hold, we simply call it a non-IMP. Observe that 𝖤l,𝒫e[Y|XS]\mathsf{E}_{l,\mathcal{P}_{e}}[Y|X_{S}] is not directly applicable to the test environment due to its components depending on ee, but those components are fully captured by 𝖤l,𝒫e[Xk|XR]\mathsf{E}_{l,\mathcal{P}_{e}}[X_{k}|X_{R}]. If the matching parameters are identified from the training environments, the IMP is applicable to the test environment since 𝖤l,𝒫test[Xk|XR]\mathsf{E}_{l,\mathcal{P}_{\text{test}}}[X_{k}|X_{R}] is completely determined by the distribution of XetestX^{e^{\text{test}}} without the need of the target YetestY^{e^{\text{test}}}. Since computing the additional feature 𝖤l,𝒫e[Xk|XR]\mathsf{E}_{l,\mathcal{P}_{e}}[X_{k}|X_{R}] is simply the prediction of XkX_{k} (as if XkX_{k} is not observed), the IMP indicates that the prediction of YY can benefit from the predictions of certain predictors. We formally define this class of additional features as follows.

Definition 3.

For any k{1,,d}k\in\{1,\ldots,d\} and R{1,,d}kR\subseteq\{1,\ldots,d\}\setminus k, we call 𝖤l,𝒫e[Xk|XR]\mathsf{E}_{l,\mathcal{P}_{e}}[X_{k}\,|\,X_{R}] a prediction module. If a prediction module satisfies an IMP for some S{1,,d}S\subseteq\{1,\ldots,d\}, we call it a matched prediction module for SS.

Now we discuss the relationship between the IMP and the invariance property 𝒫e(Y|ϕe(X))=𝒫h(Y|ϕh(X))\mathcal{P}_{e}(Y|\phi_{e}(X))=\mathcal{P}_{h}(Y|\phi_{h}(X)) in (5), we rewrite (11) in a compact form as

𝖤l,𝒫e[Y|XS]=θX~e,\mathsf{E}_{l,\mathcal{P}_{e}}[Y|X_{S}]=\theta^{\top}\widetilde{X}^{e}, (12)

where

X~e:=(X1e,,Xde,𝖤l,𝒫e[Xk|XR]),\widetilde{X}^{e}:=(X_{1}^{e},\ldots,X_{d}^{e},\mathsf{E}_{l,\mathcal{P}_{e}}[X_{k}|X_{R}])^{\top}, (13)

and θ=(η,λ)\theta=(\eta^{\top},\lambda)^{\top} denotes the matching parameter. Define

ϕe(k,R,S)(Xe):=(XSe,𝖤l,𝒫e[Xk|XR]),\phi_{e}^{(k,R,S)}(X^{e}):=(X_{S^{\prime}}^{e},\mathsf{E}_{l,\mathcal{P}_{e}}[X_{k}|X_{R}])^{\top}, (14)

where XSeX_{S^{\prime}}^{e} is a row vector for some S{1,,d}S^{\prime}\subseteq\{1,\ldots,d\}. For notational convenience, we introduce the shorthand ϕe(Xe):=ϕe(k,R,S)(Xe)\phi_{e}(X^{e}):=\phi_{e}^{(k,R,S)}(X^{e}). Note that ϕe(Xe)\phi_{e}(X^{e}) is a linear transform of XeX^{e}.

In general, however, the invariance of the matching parameter θ\theta does not imply the invariance property (5). In Section 4, we will characterize a class of IMPs that each satisfies (5). When the invariance property holds, one can apply the general conditional expectation fϕe(x)=𝖤𝒫e[Y|ϕe(x)]f_{\phi_{e}}(x)=\mathsf{E}_{\mathcal{P}_{\text{e}}}[Y|\phi_{e}(x)] as in (6), since the linear estimator from the IMP is in general sub-optimal for the non-Gaussian cases. We will focus on linear estimators in this work as the extension can be handled via nonlinear regression methods in a straightforward manner.

It is noteworthy that since 𝖤l,𝒫e[Xk|XR]\mathsf{E}_{l,\mathcal{P}_{e}}[X_{k}|X_{R}] is a linear function of XReX_{R}^{e}, the matching parameter θ\theta is not unique given a single environment etraine\in\mathcal{E}^{\text{train}}. This causes issues when one aims to identify the possible IMPs given the distribution of (Xe,Ye)(X^{e},Y^{e}). However, we show that two environments in the training data are sufficient to identify θ\theta under a mild assumption on the matched prediction module.

Proposition 1.

For a tuple (k,R,S)(k,R,S) that satisfies an IMP, the matching parameter θ\theta can be uniquely identified in train\mathcal{E}^{\text{train}} if |train|2|\mathcal{E}^{\text{train}}|\geq 2 and

𝖤l,𝒫e[Xk|XR=x]𝖤l,𝒫h[Xk|XR=x]\mathsf{E}_{l,\mathcal{P}_{e}}[X_{k}|X_{R}=x]\neq\mathsf{E}_{l,\mathcal{P}_{h}}[X_{k}|X_{R}=x] (15)

for some e,htraine,h\in\mathcal{E}^{\text{train}} and x𝒳Rx\in\mathcal{X}_{R}.

This proposition shows how the heterogeneity of the data generating process can be helpful for identifying important invariant relations.

3-C A Decomposition of the IMP

In our toy examples, recall that the IMPs are derived by first computing 𝖤𝒫e[Y|XS]\mathsf{E}_{\mathcal{P}_{e}}[Y|X_{S}] and 𝖤𝒫e[Xk|XR]\mathsf{E}_{\mathcal{P}_{e}}[X_{k}|X_{R}] separately and then fitting a linear relation from (𝖤𝒫e[Xk|XR],XS)(\mathsf{E}_{\mathcal{P}_{e}}[X_{k}|X_{R}],X_{S}) to 𝖤𝒫e[Y|XS]\mathsf{E}_{\mathcal{P}_{e}}[Y|X_{S}]. These two steps reveal a natural decomposition of the IMP, which we term as the first and second matching properties below.

Definition 4.

We say that S{1,,d}S\subseteq\{1,\ldots,d\} satisfies the first matching property if, for every ealle\in\mathcal{E}^{\text{all}},

𝖤l,𝒫e[Y|XS]=λY𝖤𝒫e[Y|XPA(Y)]+ηYXe,\mathsf{E}_{l,\mathcal{P}_{e}}[Y|X_{S}]=\lambda_{Y}\mathsf{E}_{\mathcal{P}_{e}}[Y|X_{PA(Y)}]+\eta_{Y}^{\top}X^{e}, (16)

for some λY\lambda_{Y}\in\mathbb{R} and ηYd\eta_{Y}\in\mathbb{R}^{d} that do not depend on ee.

First, observe that the first matching property holds for S=PA(Y)S=PA(Y) since

𝖤l,𝒫e[Y|XPA(Y)]=𝖤𝒫e[Y|XPA(Y)]=(βe)Xe.\displaystyle\mathsf{E}_{l,\mathcal{P}_{e}}[Y|X_{PA(Y)}]=\mathsf{E}_{\mathcal{P}_{e}}[Y|X_{PA(Y)}]=(\beta^{e})^{\top}X^{e}.

The first matching property concerns the set SS such that the components in 𝖤l,𝒫e[Y|XS]\mathsf{E}_{l,\mathcal{P}_{e}}[Y|X_{S}] that depend on ee are fully captured by the causal function 𝖤𝒫e[Y|XPA(Y)]\mathsf{E}_{\mathcal{P}_{e}}[Y|X_{PA(Y)}]. However, this invariant relation is not directly useful for the prediction of YetestY^{e^{\text{test}}}, since the causal function can change arbitrarily with ee. To this end, we identify another invariant relation from e\mathcal{M}^{e} which is called the second matching property.

Definition 5.

For k{1,,d}k\in\{1,\ldots,d\} and R{1,,d}kR\subseteq\{1,\ldots,d\}\setminus k, we say that a tuple (k,R)(k,R) satisfies the second matching property if, for every ealle\in\mathcal{E}^{\text{all}},

𝖤l,𝒫e[Xk|XR]=λX𝖤𝒫e[Y|XPA(Y)]+ηXXe,\mathsf{E}_{l,\mathcal{P}_{e}}[X_{k}|X_{R}]=\lambda_{X}\mathsf{E}_{\mathcal{P}_{e}}[Y|X_{PA(Y)}]+\eta_{X}^{\top}X^{e}, (17)

for some λX\lambda_{X}\in\mathbb{R} and ηXd\eta_{X}\in\mathbb{R}^{d} that do not depend on ee.

It is straightforward to see that, if λX0\lambda_{X}\neq 0 in the second matching property, the first and second matching properties imply the IMP as follows,

𝖤l,𝒫e[Y|XS]\displaystyle\mathsf{E}_{l,\mathcal{P}_{e}}[Y|X_{S}] =λYλX𝖤l,𝒫e[Xk|XR]+(ηYλYλXηX)Xe\displaystyle=\frac{\lambda_{Y}}{\lambda_{X}}\mathsf{E}_{l,\mathcal{P}_{e}}[X_{k}|X_{R}]+\left(\eta_{Y}-\frac{\lambda_{Y}}{\lambda_{X}}\eta_{X}\right)^{\top}X^{e}
:=λ𝖤l,𝒫e[Xk|XR]+ηXe.\displaystyle:=\lambda\mathsf{E}_{l,\mathcal{P}_{e}}[X_{k}|X_{R}]+\eta^{\top}X^{e}.

For prediction tasks under SCMs, the causal function often plays a central role. Our first and second matching properties show how the LMMSE estimator 𝖤l,𝒫e[Y|XS]\mathsf{E}_{l,\mathcal{P}_{e}}[Y|X_{S}] and the matched prediction module 𝖤l,𝒫e[Xk|XR]\mathsf{E}_{l,\mathcal{P}_{e}}[X_{k}|X_{R}] are connected with the causal function, respectively. Together, the two individual connections make up the IMP (illustrated in Fig. 5).

Refer to caption
Figure 5: A triangular relation consists of the first, second, and invariant matching properties.

4 Characterization of Invariant Matching Properties

4-A Interventions on the Response

First, we consider model e\mathcal{M}^{e} with interventions only on YY through the coefficients, i.e.,

e,1:\displaystyle\mathcal{M}^{e,1}: Xe=γYe+BXe+εXe\displaystyle X^{e}=\gamma Y^{e}+BX^{e}+\varepsilon_{X}^{e} (18)
e,1:\displaystyle\mathcal{M}^{e,1}: Ye=(αe+β)Xe+εYe.\displaystyle Y^{e}=(\alpha^{e}+\beta)^{\top}X^{e}+\varepsilon_{Y}^{e}. (19)

To distinguish the parents of YY with varying and invariant coefficients, we decompose βe\beta^{e} in e\mathcal{M}^{e} into two parts αe\alpha^{e} and β\beta. Without loss of generality, we assume that αje0\alpha_{j}^{e}\neq 0 if and only if αje\alpha_{j}^{e} is a non-constant function of ee, and we define the following subset of parents of YY,

PE={j{1,,d}:αje0}.PE=\{j\in\{1,\ldots,d\}:\alpha_{j}^{e}\neq 0\}.
Remark 3.

Note that, in the non-zero mean settings, this model covers the shift intervention on YY through the varying coefficient of the predictor which is a constant one.

Recall that prediction modules do not rely on the response YY but on the relations between the predictors for each environment. When YY is unobserved (or equivalently, substituting YY in (19) into (18)), the relations between the predictors are as follows,

Xe=(γ(αe+β)+B)Xe+γεY+εX,X^{e}=\left(\gamma\left(\alpha^{e}+\beta\right)^{\top}+B\right)X^{e}+\gamma\varepsilon_{Y}+\varepsilon_{X}, (20)

where γεY+εX\gamma\varepsilon_{Y}+\varepsilon_{X} a vector of dependent random variables when γ\gamma is not a zero vector. If αe\alpha^{e} vanishes from (20), the distribution of XeX^{e} becomes invariant with respect to environments. As a consequence, the condition (15) in Proposition 1 will not be satisfied. Observe that αe\alpha^{e} is non-vanishing in (20) only if γ\gamma is not a zero vector, which brings up the following key assumption.

Assumption 1.

When YY is intervened, we assume that YY has at least one child.

Note that if YY has no children, by the Markov property of SCMs [4], the test data sampled from 𝒫testX\mathcal{P}_{\text{test}}^{X} provides no information about 𝒫testY|X\mathcal{P}_{\text{test}}^{Y|X} and thus the observed XetestX^{e^{\text{test}}} may correspond to two arbitrarily different YetestY^{e^{\text{test}}}’s, which makes the problem ill-posed.

The first and second matching properties enable us to characterize the tuples (k,R,S)(k,R,S)’s that satisfy IMPs through the characterizations of SS (for the first matching property) and (k,R)(k,R) (for the second matching property) separately. In the following theorem, we show that a class of IMPs implied by the first and second invariant matching properties satisfy the invariance property (5).

Theorem 1.

For model e,1\mathcal{M}^{e,1}, the first and second matching properties hold in the following cases.

  1. 1.

    On the first MP: For each S{1,,d}S\subseteq\{1,\ldots,d\} such that PESPE\subseteq S, the first matching property holds.

  2. 2.

    On the second MP: For each k{1,,d}PEk\in\{1,\ldots,d\}\setminus PE and R{1,,d}kR\subseteq\{1,\ldots,d\}\setminus k such that PERPE\subseteq R, the second matching property holds.

For any tuple (k,R,S)(k,R,S) above such that RSR\subseteq S, if λX0\lambda_{X}\neq 0 in the second matching property, then ϕe(Xe)=(XSe,𝖤l,𝒫e[Xk|XR])\phi_{e}(X^{e})=(X_{S}^{e},\,\mathsf{E}_{l,\mathcal{P}_{e}}[X_{k}|X_{R}])^{\top} satisfies (5). Furthermore, test(fϕ)\mathcal{L}_{\text{test}}(f_{\phi}) is minimized by any ϕ\phi with S={1,,d}S=\{1,...,d\}.

As a concrete example, recall (8) from our motivating example, where RSR\subseteq S and λX0\lambda_{X}\neq 0 are satisfied, 𝖤𝒫e[Y|X]\mathsf{E}_{\mathcal{P}_{e}}[Y|X] can be represented using invariant coefficients. It is noteworthy that Assumption 1 is a necessary condition for λX0\lambda_{X}\neq 0, and we provide a sufficient condition for λX0\lambda_{X}\neq 0 in a concrete setting with S={1,,d}S=\{1,\ldots,d\} below.

Corollary 1.

For model e,1\mathcal{M}^{e,1}, the first and second matching properties hold in the following cases.

  1. 1.

    On the first MP: The first matching property holds for S={1,,d}S=\{1,\ldots,d\}.

  2. 2.

    On the second MP: For each k{jMB(Y):αje=0}k\in\{j\in MB(Y):\alpha_{j}^{e}=0\} and R=k:={1,,d}kR=-k:=\{1,\ldots,d\}\setminus k, the second matching property holds.

Proposition 2.

Under Assumption 1, for each (k,R)(k,R) in Corollary 1, we have λX0\lambda_{X}\neq 0 in the second matching property if Bk,kB_{-k,k} is not in the following hyperplane,

wx+b=0,w^{\top}x+b=0, (21)

where wd1w\in\mathbb{R}^{d-1} and bb\in\mathbb{R} are determined by the parameters in e,1\mathcal{M}^{e,1} other than Bk,kB_{-k,k}.

The explicit expressions of ww and bb using the parameters in e,1\mathcal{M}^{e,1} are provided in the proof of Proposition 2. For generic choices of the parameters, the second matching property holds with λX0\lambda_{X}\neq 0 since Bk,kB_{k,-k} is not necessarily on the hyperplane described in (21).

When YY is additionally intervened through the noise variance, the proof of Theorem 1 will break down in general (see Remark 8 in Appendix C). However, recall that the first matching property holds for S=PA(Y)S=PA(Y) by definition. In this case, we provide an example for the second matching property in the following corollary.

Corollary 2.

Under Assumption 1, if YY is intervened through the noise variance in model e,1\mathcal{M}^{e,1}, the second matching property holds for kCH(Y)k\in CH(Y) such that kDE(i)k\not\in DE(i) for any iCH(Y)ki\in CH(Y)\setminus k, and R={1,,d}DE(Y)R=\{1,\ldots,d\}\setminus DE(Y).

The resulting IMPs no longer satisfy the invariance property (5), but we can use the IMP directly for the prediction of YetestY^{e^{\text{test}}}.

Remark 4.

To sum up, the class of IMPs constructed under interventions on YY only through the coefficients and shifts will in general imply the invariance property (5), but it is not the case under interventions on YY through the noise variance. For the characterizations of IMP, we focus on sufficient conditions that are relatively simple to evaluate, while they are not necessary conditions in general. One way to find necessary and sufficient conditions for the first/second IMP is through explicit (but tedious) calculations as in the proof of Proposition 2. However, such conditions are hard to evaluate and provide limited insights into our methodology. We thus do not pursue them in this work. When YY is intervened through the coefficients of all its parents, i.e., PE=PA(Y)PE=PA(Y), a recent work [52] shows that PA(Y)SPA(Y)\subseteq S and PA(Y)RPA(Y)\subseteq R are also necessary for the IMP under mild assumptions, leading to a novel algorithm for identifying direct causes of YY in the multi-environment setting.

4-B Interventions on both Predictors and Response

To generalize the setting when only YY is intervened to the general setting when XX and YY are both intervened, an idea is to merge the setting when only YY is intervened with the one when only XX is intervened. The latter setting has been studied in the stabilized regression framework [21]. The following set of predictors is identified (see Definition 3.4 therein),

Xint(Y)=CHI(Y){j{1,,d}|iCHI(Y) such that jDE(Xi)},\displaystyle X^{\text{int}}(Y)=CH^{I}(Y)\cup\bigl{\{}j\in\{1,\ldots,d\}\,|\,\exists i\in CH^{I}(Y)\text{ such that }j\in DE(X_{i})\bigr{\}},

which contains the intervened children of YY (denoted by CHI(Y)CH^{I}(Y)) and the descendants of such children. This useful notion can be defined for each Xj{X1,,Xd}X_{j}\in\{X_{1},\ldots,X_{d}\}, denoted by Xint(Xj)X^{\text{int}}(X_{j}) for each XjX_{j}. When only XX is intervened, the invariance principle (4) holds for S={1,,d}Xint(Y)S^{*}=\{1,\ldots,d\}\setminus X^{\text{int}}(Y); the Markov blanket of YY defined with respect to XSX_{S^{*}} is called the stable blanket of YY in [21]. In other words, by excluding the predictors in Xint(Y)X^{\text{int}}(Y), the target YY is blocked from the interventions on XX when conditioning on XSX_{S^{*}}. This holds when YY is additionally intervened, as if only YY is intervened given XSX_{S^{*}}. In order for SS^{*} to include as least one child of YY as in Assumption 1, we need the following assumption.

Assumption 2.

When YY is intervened, we assume that YY has at least one child that is not intervened and that child is not a descendant of some intervened child of YY.

Based on the observation above, we identify an important class of IMPs for the general setting in the following Theorem.

Theorem 2.

For the training model e\mathcal{M}^{e} without the intervention on the noise variance of YY, the first and second matching properties hold in the following cases.

  1. 1.

    On the first MP: For S{1,,d}Xint(Y)S\subseteq\{1,\ldots,d\}\setminus X^{\text{int}}(Y) such that PA(Y)SPA(Y)\subseteq S, the first matching property holds.

  2. 2.

    On the second MP: For each k{1,,d}{PEXint(Y)}k\in\{1,\ldots,d\}\setminus\{PE\cup X^{\text{int}}(Y)\}, and R{1,,d}{k,Xint(Xk)Xint(Y)}R\subseteq\{1,\ldots,d\}\setminus\{k,X^{\text{int}}(X_{k})\cup X^{\text{int}}(Y)\} such that PA(Xk)YRPA(X_{k})\setminus Y\subseteq R, the second matching property holds.

Furthermore, if λX0\lambda_{X}\neq 0 in the second matching property, then ϕe(Xe)=(XSe,𝖤l,𝒫e[Xk|XR])\phi_{e}(X^{e})=(X_{S}^{e},\,\mathsf{E}_{l,\mathcal{P}_{e}}[X_{k}|X_{R}])^{\top} satisfies (5).

Remark 5.

In general, there exist subsets of SS and RR from Theorem 2 such that the first and second matching properties hold, respectively. In particular, PA(Y)SPA(Y)\subseteq S and PA(Xk)RPA(X_{k})\subseteq R are not necessarily satisfied. Due to the Markov property of SCMs, YY is independent of its ancestors when conditioning on PA(Y)PA(Y), thus we focus on the IMPs that are more predictive by including PA(Y)PA(Y) in SS.

When YY is intervened through its noise variance, recall again that the first matching property always holds for S=PA(Y)S=PA(Y) by definition. The following proposition follows from Remark 9 in Appendix E.

Proposition 3.

When YY is additionally intervened through the noise variance in Theorem 2, the second matching property in Theorem 2 still holds.

Similar to the argument in the proof of Theorem 1, the class of ϕ\phi’s from Theorem 2 will lead to the same test population loss, as they depend on the same SS that is fixed in this setting. Assumption 2 is necessary for λX0\lambda_{X}\neq 0, while sufficient conditions for λX0\lambda_{X}\neq 0 can be found similarly as in Proposition 2.

5 Algorithms

For each etraine\in\mathcal{E}^{\text{train}}, we are given the i.i.d. training data 𝑿ene×d,𝒀ene\boldsymbol{X}_{e}\in\mathbb{R}^{n_{e}\times d},\boldsymbol{Y}_{e}\in\mathbb{R}^{n_{e}}, and we observe the i.i.d. test data 𝑿τm×d\boldsymbol{X}^{\tau}\in\mathbb{R}^{m\times d} and aim to predict 𝒀τm\boldsymbol{Y}^{\tau}\in\mathbb{R}^{m}. Let 𝑿n×d\boldsymbol{X}\in\mathbb{R}^{n\times d} with n:=i=1|train|nen:=\sum_{i=1}^{|\mathcal{E}^{\text{train}}|}n_{e} denote the pooled data of 𝑿e\boldsymbol{X}_{e}’s. In this section, we present the implementation of our method starting with the case when ee is sampled from a discrete distribution with a finite support. In this setting, we expect to have ne1n_{e}\gg 1 for every etraine\in\mathcal{E}^{\text{train}} in general, thus it is possible to do estimation based on the data from each single environment. The challenging setting of continuous environments will be handled afterward.

5-A Discrete Environments

To implement our method, the main task is to identify the set of IMPs \mathcal{I}_{\mathcal{M}} from the training data. For each tuple (k,R,S)(k,R,S) in Algorithm 1, we test the following null hypothesis

0 : There exists θd+1 such that(12) holds.\mathcal{H}_{0}\text{ : There exists }\theta\in\mathbb{R}^{d+1}\text{ such that}~{}\eqref{imp_theta}\text{ holds}.

We propose two test procedures.

  1. 1.

    Test of the Deterministic Relation: Since the IMPs are linear and deterministic (i.e., noiseless), we test whether the residual vector 𝑹n\boldsymbol{R}\in\mathbb{R}^{n} of fitting an IMP on (k,R,S)(k,R,S) is a zero vector or not using the test statistics,

    T=1n𝑹𝑹,T=\frac{1}{n}\boldsymbol{R}^{\top}\boldsymbol{R},

    where 𝑹\boldsymbol{R} is a pooled data vector of 𝑹e\boldsymbol{R}_{e}’s defined below. To fit an IMP, we first estimate the two LMMSE estimators in (11) using OLS for each environment,

    𝑳^e,1\displaystyle\boldsymbol{\hat{L}}_{e,1} :=(𝑿e,S𝑿e,S)1𝑿e,S𝒀e,\displaystyle:=(\boldsymbol{X}_{e,S}^{\top}\boldsymbol{X}_{e,S})^{-1}\boldsymbol{X}_{e,S}^{\top}\boldsymbol{Y}_{e}\,,
    𝑳^e,2\displaystyle\boldsymbol{\hat{L}}_{e,2} :=(𝑿e,R𝑿e,R)1𝑿e,R𝑿e,k.\displaystyle:=(\boldsymbol{X}_{e,R}^{\top}\boldsymbol{X}_{e,R})^{-1}\boldsymbol{X}_{e,R}^{\top}\boldsymbol{X}_{e,k}\,.

    Let 𝑳^1n\boldsymbol{\hat{L}}_{1}\in\mathbb{R}^{n} and 𝑳^2n\boldsymbol{\hat{L}}_{2}\in\mathbb{R}^{n} denote the pooled data of 𝑳^e,1\boldsymbol{\hat{L}}_{e,1}’s and 𝑳^e,2\boldsymbol{\hat{L}}_{e,2}’s, respectively. It is noteworthy that 𝐋^2\boldsymbol{\hat{L}}_{2} only depends on 𝐗\boldsymbol{X}, thus 𝐋^2τ\boldsymbol{\hat{L}}_{2}^{\tau} for the test data can be computed similarly using 𝐗τ\boldsymbol{X}^{\tau}. Next, we estimate the matching parameter using OLS on the pooled data (recall that the matching parameter cannot be identified using the data from a single environment). The OLS estimator of the matching parameter is

    θ^:=(η^,λ^)=([𝑿S,𝑳^2][𝑿S,𝑳^2])1[𝑿S,𝑳^2]𝑳^1.\hat{\theta}:=(\hat{\eta}^{\top},\hat{\lambda})^{\top}=([\boldsymbol{X}_{S},\boldsymbol{\hat{L}}_{2}]^{\top}[\boldsymbol{X}_{S},\boldsymbol{\hat{L}}_{2}])^{-1}[\boldsymbol{X}_{S},\boldsymbol{\hat{L}}_{2}]^{\top}\boldsymbol{\hat{L}}_{1}\,.

    For each etraine\in\mathcal{E}^{\text{train}}, we obtain the residual vector of fitting an IMP

    𝑹e=𝑳^e,1λ^𝑳^e,2𝑿e,Sη^.\displaystyle\boldsymbol{R}_{e}=\boldsymbol{\hat{L}}_{e,1}-\hat{\lambda}\boldsymbol{\hat{L}}_{e,2}-\boldsymbol{X}_{e,S}\hat{\eta}\,.
  2. 2.

    Approximate Test of Invariant Residual Distributions: According to the invariance property (6), we test whether the residual when regressing 𝒀\boldsymbol{Y} on [𝑿S,𝑳2][\boldsymbol{X}_{S},\boldsymbol{L}_{2}] has constant mean and variance. Specifically, we use the t-test and F-test with corrections for multiple hypothesis testing from [8] (see Section 2.1 Method II). The test yields a p-value.

The test statistic from the first procedure and the p-value from the second procedure quantify how likely an IMP holds (i.e., the smaller the more likely), and thus we will refer to either one of them as an IMP score denoted by sIMPs_{\text{IMP}}. Let ^={(k,R,S):sIMP(k,R,S)<cIMP}\hat{\mathcal{I}}=\{(k,R,S):s_{\text{IMP}}(k,R,S)<c_{\text{IMP}}\} denote the set of IMPs identified from the training data, where cIMPc_{\text{IMP}} is some cutoff parameter. Then, since IMPs are not equally predictive in general, we focus on the most predictive ones by introducing the mean squared prediction error as a prediction score spreds_{\text{pred}}, and we select the set of IMPs that are more predictive pred^={(k,R,S)^:spred(k,R,S)<cpred}\widehat{\mathcal{I}_{\text{pred}}}=\{(k,R,S)\in\hat{\mathcal{I}}:s_{\text{pred}}(k,R,S)<c_{\text{pred}}\} with some cutoff parameter cpredc_{\text{pred}}. For the second IMP score that is a p-value, the cutoff parameter cIMPc_{\text{IMP}} is simply a significance level that is fixed to 0.050.05 in this work. For choosing the rest of the cutoff parameters, we follow a bootstrap procedure from [19] with one subtle difference: We sample the same amount of bootstrap samples from each environment rather than sampling over the pool data as in [19] since our procedure involves estimations using the data from each environment.

Algorithm 1 Invariant Prediction using the IMP (discrete)
procedure Identify IMPs from the Training data
     for k{1,,d}k\in\{1,\ldots,d\}, S{1,,d}S\subseteq\{1,\ldots,d\}, RSkR\subseteq S\setminus k do
         Compute the IMP score simps_{\text{imp}} and the prediction
         score spreds_{\text{pred}} for i=(k,R,S)i=(k,R,S)
         Regress 𝒀\boldsymbol{Y} on [𝑿S,𝑳^2][\boldsymbol{X}_{S},\boldsymbol{\hat{L}}_{2}] to obtain fif_{i}      
     Identify ^\hat{\mathcal{I}} and pred^\widehat{\mathcal{I}_{\text{pred}}}
procedure Prediction on the Testing data
     𝒀^τ=1|pred^|ipred^fi(𝑿Sτ,𝑳^2τ)\hat{\boldsymbol{Y}}^{\tau}=\frac{1}{|\widehat{\mathcal{I}_{\text{pred}}}|}\sum_{i\in\widehat{\mathcal{I}_{\text{pred}}}}f_{i}(\boldsymbol{X}_{S}^{\tau},\boldsymbol{\hat{L}}_{2}^{\tau})

In practice, there can be spurious IMPs that have extremely small IMP scores but have large prediction scores, e.g., when YY is independent of XSX_{S} and XkX_{k} is independent of XRX_{R}. To this end, we will pre-select (k,R,S)(k,R,S)’s with prediction scores smaller than the median of all the computed prediction scores before identifying ^\hat{\mathcal{I}}.

If the regression function fif_{i} in Algorithm 1 is chosen to be linear, one can use the IMP directly, i.e.,

𝒀^IMP(k,R,S)=[𝑿S,𝑳^2]θ^,\boldsymbol{\hat{Y}}_{\text{IMP}}(k,R,S)=[\boldsymbol{X}_{S},\boldsymbol{\hat{L}}_{2}]\hat{\theta}, (22)

which we call the discrete IMP estimator denoted by IMPd\text{IMP}_{\text{d}}. To make use of all the IMPs selected in pred^\widehat{\mathcal{I}_{\text{pred}}}, we use an averaging step for the prediction of 𝒀τ\boldsymbol{Y}^{\tau} in Algorithm 1.

Now we discuss the computation complexity of our method. For a given graph of size d3d\geq 3, the number of (k,R,S)(k,R,S)’s with nonempty (R,S)(R,S)’s is given by

j=1d(dj)(j(2j11)+(dj)(2j1))=d(23d12d),\sum_{j=1}^{d}\binom{d}{j}\biggl{(}j\cdot(2^{j-1}-1)+(d-j)\cdot(2^{j}-1)\biggr{)}=d\cdot(2\cdot 3^{d-1}-2^{d}),

which follows by first choosing a set SS with jj elements, and then considering two settings kSk\in S and kSk\not\in S since RR has to satisfy RSkR\subseteq S\setminus k. This calculation implies that the exhaustive search step in our method is not applicable for relatively large graphs (e.g., d15d\geq 15) due to the exponential time complexity. It is noteworthy that in the ICP framework [8], a similar issue occurs due to an exhaustive search over S{1,,d}S\subseteq\{1,\ldots,d\}. We discuss several options to alleviate this issue. (1) One can adopt a preprocessing step for feature selection to reduce the dimension dd, using Lasso [53] or Boosting [54], as proposed in ICP [8]. (2) When prior information about the graph structure (e.g., the maximum number of parents of the nodes) is available or the graph structure can be estimated, the search space can be reduced. For instance, one can first adopt existing causal discovery methods (e.g., [55]) for graph structure estimation and then apply our methods according to the sufficient conditions in Theorem 1 and 2. (If no IMP is found, one can still perform an exhaustive search over other (k,R,S)(k,R,S)’s). (3) One can also exploit some intrinsic sparsity regarding IMPs, observe that there is only one matched prediction module (recall Definition 3) in the IMP, while the number of prediction modules grows as d3dd\cdot 3^{d}. This sparsity has been studied when only YY is intervened [56], leveraging a variant of Lasso.

5-B Continuous Environments

To model continuous environments, we introduce an environmental variable UU that is a continuous random variable with support 𝒰\mathcal{U}. Apparently, this is a much more challenging setting compared with the discrete environment case, as we only have one training data sample for each u𝒰u\in\mathcal{U}, making the OLS a poor estimate of 𝖤l,𝒫u[Y|XS]\mathsf{E}_{l,\mathcal{P}_{u}}[Y|X_{S}]. Fortunately, it turns out that we can leverage the semi-parametric varying coefficient (SVC) models [57] (see Appendix G) to remedy this issue. In particular, we estimate 𝖤l,𝒫u[Y|XS]\mathsf{E}_{l,\mathcal{P}_{u}}[Y|X_{S}] by fitting,

Y=M+βZ+N with M=α(U)W,Y=M+\beta^{\top}Z+N\;\text{ with }\;M=\alpha^{\top}(U)W, (23)

where NN is independent of UU and the two vectors of predictors WpW\in\mathbb{R}^{p} (for the varying coefficient) and ZqZ\in\mathbb{R}^{q} (for the invariant coefficients) with p+q=|S|p+q=|S|. Since we assume NUN\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}U, we focus on the settings when YY is not intervened through the noise variance.

Remark 6.

Our estimation procedure for the discrete environments can also be formulated under the SVC model with a discrete random variable UU, where we treat all the coefficients as varying coefficients (i.e., β=𝟎\beta=\boldsymbol{0}), and IMPd\text{IMP}_{\text{d}} becomes an estimate of MM.

An SVC model over (Yτ,Wτ,Zτ,Nτ)(Y^{\tau},W^{\tau},Z^{\tau},N^{\tau}) for the test data can be defined similarly, where σ2=𝖤[(Nτ)2]\sigma^{2}=\mathsf{E}[(N^{\tau})^{2}] is the population generalization error of the IMP estimator. Observe that the linear SCM u\mathcal{M}^{u} in (2) can be viewed as a collection of SVC models parameterized by U=uU=u. Thus the estimation tasks for the linear SCMs from continuous environments can greatly benefit from the existing theories developed for SVC models. More precisely, we employ the following estimate

𝖤l,𝒫u[Y|XS]^=M^|U=u+β^Z,\widehat{\mathsf{E}_{l,\mathcal{P}_{u}}[Y|X_{S}]}=\hat{M}|_{U=u}+\hat{\beta}^{\top}Z,

where the profile least-squares estimation of β\beta and MM proposed in [57] can be found in Appendix G. Similarly, 𝖤l,𝒫u[Xk|XR]\mathsf{E}_{l,\mathcal{P}_{u}}[X_{k}|X_{R}] can be estimated by fitting another semi-parametric varying coefficient model

V=MV+βVZV+NV with MV=αV(U)W,\displaystyle V=M_{V}+\beta_{V}^{\top}Z_{V}+N_{V}\;\text{ with }\;M_{V}=\alpha_{V}^{\top}(U)W, (24)

where VV denotes any XkX_{k}, and XRX_{R} is divided into ZVrZ_{V}\in\mathbb{R}^{r} and WW.

It is noteworthy that the two SVC models share the same set of predictors with varying coefficients, which we explain below. A challenge for fitting such models is that the vector of predictors with varying coefficients, namely WW, needs to be known. For continuous environments, we focus on discovering IMPs that can be decomposed into the first and second matching properties. Thus, since the causal function captures the predictors with varying coefficients, the first and second matching properties imply that the vector WW is simply XPEX_{PE}, i.e., the parents of YY with varying coefficients in u\mathcal{M}^{u}, for both models.

Based on this observation and Theorem 1, we replace the exhaustive search over (k,R,S)(k,R,S) in Algorithm 1 by a search over (P,k,R,S)(P,k,R,S) according to the conditions in Theorem 1 with PE=PPE=P. That is, we choose (P,k,R,S)(P,k,R,S) from

P{1,,d},k{1,,d}P,\displaystyle P\subseteq\{1,\ldots,d\},\quad k\in\{1,\ldots,d\}\setminus P,
PS{1,d},PRSk,\displaystyle P\subseteq S\subseteq\{1,\ldots d\},\quad P\subseteq R\subseteq S\setminus k,

such that W=XPW=X_{P}, Z=XSPZ=X_{S\setminus P}, V=XkV=X_{k}, and ZV=XRPZ_{V}=X_{R\setminus P}.

Unlike IMPd\text{IMP}_{\text{d}} in (22), we make use of the fact that β\beta is invariant and propose the continuous IMP estimator denoted by IMPc\text{IMP}_{\text{c}} as follows

𝒀^IMP(P,k,R,S)=[𝑾,𝑴^V]w^+𝒁β^,\boldsymbol{\hat{Y}}_{\text{IMP}}(P,k,R,S)=[\boldsymbol{W},\boldsymbol{\hat{M}}_{V}]\hat{w}+\boldsymbol{Z}\hat{\beta}, (25)

with the matching parameter wp+1w\in\mathbb{R}^{p+1} estimated by

w^=([𝑾,𝑴^V][𝑾,𝑴^V])1[𝑾,𝑴^V]𝑴^.\hat{w}=([\boldsymbol{W},\boldsymbol{\hat{M}}_{V}]^{\top}[\boldsymbol{W},\boldsymbol{\hat{M}}_{V}])^{-1}[\boldsymbol{W},\boldsymbol{\hat{M}}_{V}]^{\top}\boldsymbol{\hat{M}}. (26)

The data matrices for the two models (e.g., 𝑾n×p\boldsymbol{W}\in\mathbb{R}^{n\times p}) can be defined accordingly and we provide the details in Appendix I. In this case, the residual vector for the first IMP score is given by

𝑹=𝑴[𝑾,𝑴^V]w^.\boldsymbol{R}=\boldsymbol{M}-[\boldsymbol{W},\boldsymbol{\hat{M}}_{V}]\hat{w}.

Note the second IMP score is not applicable for continuous environments due to the small sample size in each environment, thus we focus on the first IMP score.

Remark 7.

The IMPd\text{IMP}_{\text{d}} and IMPc\text{IMP}_{\text{c}} are similar in spirit, the IMPc\text{IMP}_{\text{c}} relies on the first and second matching properties for identifying WW (so we can reuse ZZ in (25)), whereas IMPd\text{IMP}_{\text{d}} directly tests the IMP since the estimation process treats all the coefficients as varying coefficients (also see the proof of Corollary 4).

6 Asymptotic Generalization Error

In this section, we provide the asymptotic generalization errors (as n,mn,m\to\infty) of the IMPc\text{IMP}_{\text{c}} and IMPd\text{IMP}_{\text{d}} estimators for (k,R,S)(k,R,S)’s that satisfy IMPs, i.e., (k,R,S)(k,R,S)\in\mathcal{I}_{\mathcal{M}}. Recall that σ2=𝖤[(Nτ)2]\sigma^{2}=\mathsf{E}[(N^{\tau})^{2}] is the population generalization error of the IMP estimators. Due to the estimations on both training and test data, the asymptotic generalization error can be decomposed to the error terms depending on the training data size nn and the test data size mm as follows. Let cn={log(1/h)nh}1/2+h2c_{n}=\left\{\frac{\log(1/h)}{nh}\right\}^{1/2}+h^{2}, where hh is the kernel bandwidth (see Appendix G for details).

Theorem 3.

For any (k,R,S)(k,R,S)\in\mathcal{I}_{\mathcal{M}}, under the technical assumptions in Appendix H, the asymptotic generalization error of the IMPc\text{IMP}_{\text{c}} estimator is given by

1mi=1m(Y^iτYiτ)2=σ2+Op(cnn1/2)+Op(cmm1/2).\frac{1}{m}\sum_{i=1}^{m}(\hat{Y}^{\tau}_{i}-Y_{i}^{\tau})^{2}=\sigma^{2}+O_{p}(c_{n}\vee n^{-1/2})+O_{p}(c_{m}\vee m^{-1/2}).

The following corollary considers the setting when the amount of unlabeled training and test data grows in a higher order than that of labels in the training data. The generalization error due to the estimation on the test data disappears.

Corollary 3.

Given i.i.d. training data of size nn with 0<ln<n0<l_{n}<n labels and test data of size mm, if max(lnn,lnm)0\max(\frac{l_{n}}{n},\frac{l_{n}}{m})\to 0 as min(m,n)\min(m,n)\to\infty, under the technical assumptions in Appendix H, the asymptotic generalization error of the IMPc\text{IMP}_{\text{c}} estimator is given by

1mi=1m(Y^iτYiτ)2=σ2+Op(clnln1/2).\frac{1}{m}\sum_{i=1}^{m}(\hat{Y}^{\tau}_{i}-Y_{i}^{\tau})^{2}=\sigma^{2}+O_{p}(c_{l_{n}}\vee l_{n}^{-1/2}).

The setting of discrete environments can be viewed as a special case of continuous environments, where the error term cnn1/2c_{n}\vee n^{-1/2} due to the kernel estimation procedure is replaced by an error term from multiple OLS estimations.

Corollary 4.

For any (k,R,S)(k,R,S)\in\mathcal{I}_{\mathcal{M}}, under the technical assumptions in Appendix H, the asymptotic generalization error of the IMPd\text{IMP}_{\text{d}} estimator is given by

1mi=1m(Y^iτYiτ)2=σ2+Op(an)+Op(am),\frac{1}{m}\sum_{i=1}^{m}(\hat{Y}^{\tau}_{i}-Y_{i}^{\tau})^{2}=\sigma^{2}+O_{p}(a_{n})+O_{p}(a_{m}),

where an=(minetrainne)1/2a_{n}=(\min_{e\in\mathcal{E}^{\text{train}}}n_{e})^{-1/2} and am=m1/2a_{m}=m^{-1/2}.

This asymptotic generalization error heavily depends on the environment with the smallest sample size, which also supports the fact that IMPd\text{IMP}_{\text{d}} should not be employed for continuous environment settings.

7 Experiments

The prediction performance is measured by the mean residual sum of squares (RSS) on the test environments. We compare our method with several baseline methods: Ordinary Least Squares (OLS), stabilized regression (SR) [21], anchor regression (AR) [22]. We have also compared with domain invariant projection (DIP) [29], conditional invariance penalty (CIP) [34], conditional invariant residual matching (CIRM) [28], and invariant risk minimization (IRM) [23]; it turns out that the empirical performance of these methods is not as competitive as the other baselines in our experimental settings, thus we do not report them below.

The two IMP scores lead to two versions of our algorithm, and we refer to the first one as IMP and the second one as IMPinv\text{IMP}_{\text{inv}} (since it tests the invariance of the noise mean and variance). We focus on linear functions fif_{i}’s in Algorithm 1, namely, we use the IMP estimators. For the profile likelihood estimation, we adopt the Epanechnikov kernel k(u)=0.75max(1u2,0)k(u)=0.75\max(1-u^{2},0) with the bandwidth fixed to be 0.10.1. We test DIP, CIP, CIRM, and their variants provided in [28] with the default parameters. For the anchor regression, we use a 55-fold cross-validation procedure to select the hyper-parameter γ\gamma from {0,0.05,0.1,,0.5}\{0,0.05,0.1,\ldots,0.5\}. The significance levels are fixed to be 0.050.05 for all methods. We randomly simulate 500500 data sets for each experiment, if not mentioned otherwise.

7-A Discrete environments

First, we generate linear SCMs e\mathcal{M}^{e}’s without interventions. For each etrain={1,,5}e\in\mathcal{E}^{\text{train}}=\{1,\ldots,5\} or etest={6,,10}e\in\mathcal{E}^{\text{test}}=\{6,\ldots,10\}, we randomly generate a linear SCM with 9 variables as follows. The graph 𝒢(e)\mathcal{G}(\mathcal{M}^{e}) is specified by a lower triangular matrix of i.i.d. Bernoulli(1/2)\mathrm{Bernoulli}(1/2) random variables. The response YY is randomly selected from the 99 variables and we require that YY has a least one parent and one child in 𝒢(e)\mathcal{G}(\mathcal{M}^{e}). When XX and YY are both intervened, we randomly choose a child of YY to not be intervened. For each linear SCM, the non-zero coefficients are sampled from Unif[1.5,0.5][0.5,1.5]\mathrm{Unif}[-1.5,-0.5]\cup[0.5,1.5] and the noise variables are standard normal. For each training or test environment, we simulate i.i.d. data of sample size 300300.

7-A1 Interventions on XX

Since the baseline methods have been examined extensively under shift interventions, we focus on shift interventions on XX for comparison. The general interventions on XX will be considered in Section 7-C. Specifically, for each training environment, we randomly selected 44 predictors to be intervened through shifts sampled from Unif[2,2]\mathrm{Unif}[-2,2]. For each test environment, the shifts are sampled from Unif[10,10]\mathrm{Unif}[-10,10]. In Fig. 6, IMPinv\text{IMP}_{\text{inv}} performs similarly to SR since they share a similar idea when only XX is intervened. Due to the averaging steps of IMP, IMPinv\text{IMP}_{\text{inv}}, and SR, they have smaller variances compared with OLS and AR (a similar result has been reported in [19]).

Refer to caption
Figure 6: Experiment 7.A.1 with discrete environments: Only a subset of XX is intervened.

7-A2 Interventions on YY

We consider the response YY to be intervened through both the coefficients and shifts. We randomly select npUnif{1,,|PA(Y)|}n_{p}\sim\mathrm{Unif}\{1,\ldots,|PA(Y)|\} of parents of YY to have varying coefficients. For each training environment, we add perturbation terms sampled from Unif[2,2]\mathrm{Unif}[-2,2] to the original coefficients. For each test environment, the perturbations are sampled from Unif[10,10]\mathrm{Unif}[-10,10]. The shift intervention on YY is the same as the shift interventions on XX in Section 7-A1. In this setting, since none of the baseline methods allow interventions on YY through the coefficients, they cannot even improve upon OLS. In Fig. 7, IMP performs slightly better than IMPinv\text{IMP}_{\text{inv}}, which may be due to the fact that the IMP method aims to find all possible IMPs, but the IMPinv\text{IMP}_{\text{inv}} only looks for IMPs that imply invariance. To further examine our test procedures for identifying IMPs, we check if the sufficient conditions in Theorem 1 are satisfied for the estimated IMPs (k^,R^,S^)(\hat{k},\hat{R},\hat{S})’s, summarized in Table 1. We observe that the conditions on (k,S)(k,S) are satisfied for the majority of cases, while the condition on RR holds with a noticeably lower empirical probability. Note that our sufficient conditions might be conservative. Moreover, due to the randomly selected coefficients, the intervention strength can be quite weak in some cases, making it challenging to distinguish true IMPs and non-IMPs. Fortunately, the averaging step helps to mitigate inaccurate predictions made by non-IMPs.

  P^(k^PE)\hat{P}(\hat{k}\not\in PE) P^(PER^)\hat{P}(PE\subseteq\hat{R}) P^(PES^)\hat{P}(PE\subseteq\hat{S})
  IMP 0.9610 (0.0796) 0.7503 (0.2511) 0.9376 (0.1079)
 IMPinv\text{IMP}_{\text{inv}} 0.9603 (0.0850) 0.7060 (0.2617) 0.9373 (0.1090)
 
TABLE 1: Empirical results on the estimated IMP vs. conditions in Theorem 1 in terms of empirical probability and standard deviation (in parenthesis): Only YY is intervened.
Refer to caption
Figure 7: Experiment 7.A.2 with discrete environments: Only YY is intervened.

7-A3 Interventions on both XX and YY

The setting of interventions on both XX and YY is simply a combination of the two settings above. In this challenging setting, our method outperforms the baselines by a large margin. Since the sufficient conditions in Theorem 2 are not necessary for some settings as discussed in Remark 10, we verify slightly relaxed conditions according to Remark 10. The estimation task is more challenging in this setting compared with the setting with interventions only on XX or YY. The results in Table 2 indicate that a large percentage of estimated IMPs conform to the conditions. In Fig. 8, we present an illustration of the difference between the predictions made by the estimated IMPs and other (k,R,S)(k,R,S)’s that are rejected by our test procedures (which we call the estimated non-IMPs). For the estimated IMP, the gap between the prediction error of the training environments and that of the test environments is relatively small. However, the gap can be huge for the majority of the estimated non-IMPs.

  P^(k^PEXint)\hat{P}(\hat{k}\not\in PE\cup X^{\text{int}}) P^(A)\hat{P}(A) P^(B)\hat{P}(B)
  IMP 0.8767 (0.2749) 0.7170 (0.3515) 0.7003 (0.4412)
 IMPinv\text{IMP}_{\text{inv}} 0.8558 (0.3144) 0.6571 (0.4085) 0.6853 (0.4460)
 
TABLE 2: Empirical results on the estimated IMP vs. conditions in Theorem 2 in terms of empirical probability and standard deviation (in parenthesis): Both XX and YY are intervened. Events AA and BB are defined by A={PER^}{(Xint(Y)Xint(Xk^))R^=}A=\{PE\subseteq\hat{R}\}\cap\{(X^{\text{int}}(Y)\cup X^{\text{int}}(X_{\hat{k}}))\cap\hat{R}=\varnothing\} and B={PES^}{Xint(Y)S^=}B=\{PE\subseteq\hat{S}\}\cap\{X^{\text{int}}(Y)\cap\hat{S}=\varnothing\}, respectively.
Refer to caption
Refer to caption
Figure 8: A generic example from Experiment 7.A.3. We compare the estimated IMPs (left plot in red) with estimated non-IMPs (right plot in blue) in terms of prediction performance in training vs. test environments. For visibility purposes, we only present a portion of estimated non-IMPs with low prediction errors.
Refer to caption
Figure 9: Experiment 7.A.3 with discrete environments: Both XX and YY are intervened.

7-B Interventions on both XX and YY (continuous)

In this more challenging setting, we only compare with OLS, since AR only considers shift interventions while the other baselines are not developed for continuous environments, and also recall that IMPinv\text{IMP}_{\text{inv}} is proposed for discrete environments. First, we define {U1,,U800}\{U_{1},\ldots,U_{800}\} sampled from Unif[0,1]\mathrm{Unif}[0,1] and {U1τ,,U800τ}\{U_{1}^{\tau},\ldots,U_{800}^{\tau}\} sampled from Unif[1,2]\mathrm{Unif}[1,2] as the training and test environments, respectively. Similar to the setting of discrete environments, we randomly generate the linear SCMs without interventions first and then add interventions to the model. Due to the high computational complexity, we focus on graphs with 55 nodes where 22 predictors are intervened. The interventions on the coefficients and shift interventions are defined by adding a perturbation term asin(2πwUi)a\sin(2\pi wU_{i}), where ww is sampled from Unif[0.5,2]\mathrm{Unif}[0.5,2]. The parameter aa is fixed to be 22 for the training environments (i.e., the same range as for the discrete case) and 55 for the test environments. Since the parameter space is much smaller than in the previous experiments, we only generated 100100 data sets. Overall, the performance of our IMP algorithm is similar to that in the discrete setting.

Refer to caption
Figure 10: Experiment 7.B with continuous environments: Both XX and YY are intervened.

7-C Robustness

In the previous experiments, we only consider shift interventions on XX and interventions on YY other than the noise variance. In this experiment, we consider an extreme case when only a child of YY with the highest causal ordering is not intervened (i.e., Assumption 2 is satisfied), all other variables are intervened through every parameter. Specifically, a shift intervention or an intervention on the coefficient is defined by adding a perturbation term to the original parameter. The perturbation term is sampled from Unif[2,2]\mathrm{Unif}[-2,2] for the training data, and from Unif[5,5]\mathrm{Unif}[-5,5] for the test data. The intervened noise variances are sampled from Unif[0.75,1.25]\mathrm{Unif}[0.75,1.25] for each training environment, and from Unif[0.5,1.5]\mathrm{Unif}[0.5,1.5] for each test environment. To test how sensitive our method is with respect to Assumption 2 in this challenging setting, we gradually add interventions to the child of YY that is not intervened, where the shifts and coefficient interventions are sampled from Unif[2λ,2λ]\mathrm{Unif}[-2\lambda,2\lambda] and Unif[5λ,5λ]\mathrm{Unif}[-5\lambda,5\lambda] for the training and test environments, respectively. The intervened noise variances are sample from Unif[10.25λ,1+0.25λ]\mathrm{Unif}[1-0.25\lambda,1+0.25\lambda] for training, and from Unif[10.5λ,1+0.5λ]\mathrm{Unif}[1-0.5\lambda,1+0.5\lambda] for testing. The parameter λ[0,1]\lambda\in[0,1] controls the intervention strength. Due to the results in Section 7-A3, it would not be informative to compare with the baseline methods, so we focus on our IMP method. Note that our IMPinv\text{IMP}_{\text{inv}} is also not included since IMPs will not imply invariance in this case (see Remark 4). Overall, as shown in Fig. 11, the median of the mean RSS is not too sensitive with respect to mild interventions on the child that is not intervened, but the variance increases rapidly.

Refer to caption
Figure 11: Experiment 7.C with discrete environments: YY and every XjX_{j} are intervened.

7-D COVID Data Set

For observational data, there is often no ground truth regarding the set of variables that are intervened. If the data is not collected under a carefully designed experimental setting (e.g., the gene perturbation experiments from [8]), it is reasonable to assume that every variable is more or less intervened, especially the response. To examine our methods under real-world distribution shifts, we consider a COVID data set [58] collected at 31423142 US counties from January 22, 2020, to June 10, 2021. The data set consists of 46 predictive features that are relevant to the number of COVID cases. Some of the features are treated as fixed or time-invariant, e.g., population density, age distribution, and education level, and others are temporal features, e.g., temperature, social distancing grade, and the number of COVID tests performed.

There are plenty of prediction tasks that can be potentially designed for this data set. In our experiment, we focus on the prediction of the number of COVID cases 444The number of COVID cases is measured in thousands. using the 1212 temporal features for the time interval from March 1, 2020, to September 30, 2020 (214214 days). For different regions of the US, features such as population density can differ greatly, making it challenging for a prediction model to perform well on a diverse set of regions. Specifically, we take 55 major cities from the Western U.S. (Los Angeles, San Francisco, San Diego, Seattle, and Phoenix) as 55 training environments, and we examine our methods and the baselines regarding 88 cities/counties that are mostly from the Eastern U.S. (Baltimore, Boston, Philadelphia, New York County, Queens County 555Only the data from Manhattan (New York County) and Queens are available for New York City in this data set., Houston, Miami, and Chicago). According to the total number of COVID cases in the considered time interval, we partition (using 5050k as the threshold) the cities/counties for testing into two groups in Table 3, which makes the results more informative as most of the methods perform worse on cities/counties with higher total cases. Since no invariance in terms of mean and variance can be identified by IMPinv\text{IMP}_{\text{inv}}, we focus on the IMP method. Overall, IMP and CIP show similar performance for the lower total cases, while the performance of CIP degrades more over the higher total cases compared with IMP. Since real-world interventions are often beyond shift interventions, AR does not improve upon OLS for this data set. It is noteworthy that no stable sets are found by SR at the significance level of 0.050.05, implying that the invariance principle is likely to be violated. IRM shows less competitive performance compared with other methods, thus its results are not presented in Table 3.

 Cities/Counties (total cases) IMP OLS AR SR CIP DIP CIRM
 Baltimore (18k) 0.1292 0.0937 0.0379 0.0978 0.0184 0.9199 0.3503
Boston (26k) 0.0238 0.0503 0.0351 0.0543 0.0192 0.5820 0.4605
Philadelphia (32k) 0.1532 0.1191 0.0701 0.0970 0.0353 1.0407 0.4217
NY County (34k) 0.1611 0.4088 0.5606 0.3449 0.3537 2.4055 2.6372
 average mean RSS 0.1168 0.1680 0.1759 0.1485 0.1066 1.2370 0.9674
  Queens County (73k) 0.1126 0.1020 0.0882 0.1109 0.0673 0.9389 0.9647
Houston (143k) 1.0884 1.2042 1.2200 1.1673 1.3705 1.7802 1.4982
Chicago (146k) 0.3687 0.4489 0.4026 0.4099 0.4562 0.6328 0.3499
Miami (171k) 0.4534 0.4621 0.5170 0.4025 0.6011 1.4465 0.8004
 average mean RSS 0.5058 0.5543 0.5548 0.5226 0.6238 1.1996 0.9033
 
TABLE 3: Experiment results of the COVID data set.

8 Discussion

To deal with general interventions on the response, we introduce the IMP that allows for an alternative form of invariance when the traditional invariance principle fails. We provide explicit characterizations of the IMP under different intervention settings and provide asymptotic generalization error analysis for both the discrete environment and the more challenging continuous environment settings, supported by our empirical studies.

Our work is motivated by allowing general interventions on the response, and it can be extended into several directions. First, we observe a connection between the IMP and self-supervised learning schemes (e.g., [59] from natural language processing). Specifically, the prediction of XkX_{k} using XRX_{R} corresponds to the pretext (or pretrained) task, and the linear relation in the IMP solves the downstream prediction task. We believe that methods developed through causal modeling including our IMP method will bring new opportunities for self-supervised learning. We have discussed several heuristic approaches to alleviate the computation complexity issue; it would be worthwhile to analyze such methods and provide theoretical guarantees (e.g., exploiting sparsity beyond the setting in [56]). Our asymptotic analysis of the generalization error is developed given that IMPs are correctly identified. It would be interesting to further analyze the consistency of the set of IMPs ^\hat{\mathcal{I}} identified by the proposed algorithms or new efficient algorithms.

References

  • [1] J. Quinonero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence, Dataset shift in machine learning.   MIT Press, 2008.
  • [2] K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer learning,” Journal of Big Data, vol. 3, no. 1, pp. 1–40, 2016.
  • [3] G. Csurka, “Domain adaptation for visual applications: A comprehensive survey,” arXiv preprint arXiv:1702.05374, 2017.
  • [4] J. Pearl, Causality.   Cambridge University Press, 2009.
  • [5] J. Peters, D. Janzing, and B. Schölkopf, Elements of causal inference: foundations and learning algorithms.   The MIT Press, 2017.
  • [6] B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij, “On causal and anticausal learning,” in 29th International Conference on Machine Learning (ICML 2012).   International Machine Learning Society, 2012, pp. 1255–1262.
  • [7] K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang, “Domain adaptation under target and conditional shift,” in International Conference on Machine Learning.   PMLR, 2013, pp. 819–827.
  • [8] J. Peters, P. Bühlmann, and N. Meinshausen, “Causal inference by using invariant prediction: identification and confidence intervals,” Journal of the Royal Statistical Society. Series B (Statistical Methodology), pp. 947–1012, 2016.
  • [9] M. Rojas-Carulla, B. Schölkopf, R. Turner, and J. Peters, “Invariant models for causal transfer learning,” The Journal of Machine Learning Research, vol. 19, no. 1, pp. 1309–1342, 2018.
  • [10] C. Heinze-Deml and N. Meinshausen, “Conditional variance penalties and domain shift robustness,” arXiv preprint arXiv:1710.11469, 2017.
  • [11] P. Bühlmann, “Invariance, causality and robustness,” Statistical Science, vol. 35, no. 3, pp. 404–426, 2020.
  • [12] A. Ghassami, N. Kiyavash, B. Huang, and K. Zhang, “Multi-domain causal structure learning in linear systems,” Advances in neural information processing systems, vol. 31, 2018.
  • [13] A. Ghassami, S. Salehkaleybar, N. Kiyavash, and K. Zhang, “Learning causal structures using regression invariance,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [14] T. Haavelmo, “The probability approach in econometrics,” Econometrica: Journal of the Econometric Society, pp. iii–115, 1944.
  • [15] J. Aldrich, “Autonomy,” Oxford Economic Papers, vol. 41, no. 1, pp. 15–34, 1989.
  • [16] G. W. Imbens and D. B. Rubin, Causal inference in statistics, social, and biomedical sciences.   Cambridge University Press, 2015.
  • [17] N. Meinshausen, “Causality from a distributional robustness point of view,” in 2018 IEEE Data Science Workshop (DSW).   IEEE, 2018, pp. 6–10.
  • [18] C. Heinze-Deml, J. Peters, and N. Meinshausen, “Invariant causal prediction for nonlinear models,” Journal of Causal Inference, vol. 6, no. 2, 2018.
  • [19] N. Pfister, P. Bühlmann, and J. Peters, “Invariant causal prediction for sequential data,” Journal of the American Statistical Association, vol. 114, no. 527, pp. 1264–1276, 2019.
  • [20] S. Magliacane, T. Van Ommen, T. Claassen, S. Bongers, P. Versteeg, and J. M. Mooij, “Domain adaptation by using causal inference to predict invariant conditional distributions,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [21] N. Pfister, E. G. Williams, J. Peters, R. Aebersold, and P. Bühlmann, “Stabilizing variable selection and regression,” The Annals of Applied Statistics, vol. 15, no. 3, pp. 1220–1246, 2021.
  • [22] D. Rothenhäusler, N. Meinshausen, P. Bühlmann, and J. Peters, “Anchor regression: Heterogeneous data meet causality,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 83, no. 2, pp. 215–246, 2021.
  • [23] M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz, “Invariant risk minimization,” arXiv preprint arXiv:1907.02893, 2019.
  • [24] E. Rosenfeld, P. Ravikumar, and A. Risteski, “The risks of invariant risk minimization,” in International Conference on Learning Representations, vol. 9, 2021.
  • [25] P. Kamath, A. Tangella, D. Sutherland, and N. Srebro, “Does invariant risk minimization capture invariance?” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2021, pp. 4069–4077.
  • [26] R. Christiansen, N. Pfister, M. E. Jakobsen, N. Gnecco, and J. Peters, “A causal framework for distribution generalization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  • [27] K. Du and Y. Xiang, “An invariant matching property for distribution generalization under intervened response,” in European Conference on Signal Processing (EUSIPCO) 2022 arXiv:2205.09162, 2022.
  • [28] Y. Chen and P. Bühlmann, “Domain adaptation under structural causal models,” Journal of Machine Learning Research, vol. 22, pp. 1–80, 2021.
  • [29] M. Baktashmotlagh, M. T. Harandi, B. C. Lovell, and M. Salzmann, “Unsupervised domain adaptation by domain invariant projection,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 769–776.
  • [30] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016.
  • [31] Y. Li, M. Murias, S. Major, G. Dawson, and D. Carlson, “On target shift in adversarial domain adaptation,” in The 22nd International Conference on Artificial Intelligence and Statistics.   PMLR, 2019, pp. 616–625.
  • [32] R. Tachet des Combes, H. Zhao, Y.-X. Wang, and G. J. Gordon, “Domain adaptation with conditional distribution matching and generalized label shift,” Advances in Neural Information Processing Systems, vol. 33, pp. 19 276–19 289, 2020.
  • [33] M. Gong, K. Zhang, T. Liu, D. Tao, C. Glymour, and B. Schölkopf, “Domain adaptation with conditional transferable components,” in International Conference on Machine Learning.   PMLR, 2016, pp. 2839–2848.
  • [34] C. Heinze-Deml and N. Meinshausen, “Conditional variance penalties and domain shift robustness,” Machine Learning, vol. 110, no. 2, pp. 303–348, 2021.
  • [35] A. Storkey, “When training and test sets are different: characterizing learning transfer,” Dataset shift in machine learning, vol. 30, pp. 3–28, 2009.
  • [36] M. Sugiyama and M. Kawanabe, Machine learning in non-stationary environments: Introduction to covariate shift adaptation, 2012.
  • [37] Z. Lipton, Y.-X. Wang, and A. Smola, “Detecting and correcting for label shift with black box predictors,” in International Conference on Machine Learning.   PMLR, 2018, pp. 3122–3130.
  • [38] K. Azizzadenesheli, A. Liu, F. Yang, and A. Anandkumar, “Regularized learning for domain adaptation under label shifts,” in International Conference on Learning Representations, 2018.
  • [39] S. Garg, Y. Wu, S. Balakrishnan, and Z. Lipton, “A unified view of label shift estimation,” Advances in Neural Information Processing Systems, vol. 33, pp. 3290–3300, 2020.
  • [40] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira, “Analysis of representations for domain adaptation,” Advances in Neural Information Processing Systems, vol. 19, 2006.
  • [41] J. A. Bagnell, “Robust supervised learning,” in AAAI, 2005, pp. 714–719.
  • [42] Z. Hu and L. J. Hong, “Kullback-Leibler divergence constrained distributionally robust optimization,” Available at Optimization Online, pp. 1695–1724, 2013.
  • [43] A. Sinha, H. Namkoong, and J. Duchi, “Certifying some distributional robustness with principled adversarial training,” in International Conference on Learning Representations, 2018.
  • [44] R. Gao, X. Chen, and A. J. Kleywegt, “Distributional robustness and regularization in statistical learning,” arXiv preprint arXiv:1712.06050, 2017.
  • [45] J. Lee and M. Raginsky, “Minimax statistical learning with Wasserstein distances,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [46] J. C. Duchi and H. Namkoong, “Learning models with uniform performance via distributionally robust optimization,” The Annals of Statistics, vol. 49, no. 3, pp. 1378–1406, 2021.
  • [47] I. Goodfellow, P. McDaniel, and N. Papernot, “Making machine learning robust against adversarial inputs,” Communications of the ACM, vol. 61, no. 7, pp. 56–66, 2018.
  • [48] A. Raghunathan, J. Steinhardt, and P. S. Liang, “Semidefinite relaxations for certifying robustness to adversarial examples,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [49] J. Pearl and E. Bareinboim, “External validity: From do-calculus to transportability across populations,” in Probabilistic and Causal Inference: The Works of Judea Pearl, 2022, pp. 451–482.
  • [50] K. Du and Y. Xiang, “Causal inference from slowly varying nonstationary processes,” arXiv preprint arXiv:2012.13025, 2020.
  • [51] Y. Xiang, J. Ding, and V. Tarokh, “Estimation of the evolutionary spectra with application to stationarity test,” IEEE Transactions on Signal Processing, vol. 67, no. 5, pp. 1353–1365, 2019.
  • [52] K. Du, Y. Xiang, and I. Soloveychik, “Identifying direct causes using intervened target variable,” arXiv preprint arXiv:2307.07736, 2023.
  • [53] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
  • [54] P. Bartlett, Y. Freund, W. S. Lee, and R. E. Schapire, “Boosting the margin: A new explanation for the effectiveness of voting methods,” The annals of statistics, vol. 26, no. 5, pp. 1651–1686, 1998.
  • [55] S. Shimizu, P. O. Hoyer, A. Hyvärinen, A. Kerminen, and M. Jordan, “A linear non-gaussian acyclic model for causal discovery.” Journal of Machine Learning Research, vol. 7, no. 10, 2006.
  • [56] K. Du and Y. Xiang, “Generalized invariant matching property via lasso,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  • [57] J. Fan and T. Huang, “Profile likelihood inferences on semiparametric varying-coefficient partially linear models,” Bernoulli, vol. 11, no. 6, pp. 1031–1057, 2005.
  • [58] A. Haratian, H. Fazelinia, Z. Maleki, P. Ramazi, H. Wang, M. A. Lewis, R. Greiner, and D. Wishart, “Dataset of COVID-19 outbreak and potential predictive features in the USA,” Data in Brief, vol. 38, p. 107360, 2021.
  • [59] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  • [60] M. S. Bartlett, “An inverse matrix adjustment arising in discriminant analysis,” The Annals of Mathematical Statistics, vol. 22, no. 1, pp. 107–111, 1951.
  • [61] Y.-p. Mack and B. W. Silverman, “Weak and strong uniform consistency of kernel regression estimates,” Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, vol. 61, no. 3, pp. 405–415, 1982.

Appendix A Proof of Proposition 1

Let train={e1,,en}\mathcal{E}^{\text{train}}=\{e_{1},\ldots,e_{n}\}. For a tuple (k,S,R)(k,S,R) that satisfies the IMP, let

𝒀^=(𝖤[Ye1|XSe1],,𝖤[Yen|XSen]),\displaystyle\hat{\boldsymbol{Y}}=(\mathsf{E}[Y^{e_{1}}|X_{S}^{e_{1}}],\ldots,\mathsf{E}[Y^{e_{n}}|X_{S}^{e_{n}}])^{\top},
𝑿~=[(Xe1)𝖤l[Xke1|XRe1](Xen)𝖤l[Xken|XRen]]:=[𝑿v],\boldsymbol{\tilde{X}}=\begin{bmatrix}(X^{e_{1}})^{\top}&\mathsf{E}_{l}[X_{k}^{e_{1}}|X_{R}^{e_{1}}]\\ \vdots&\vdots\\ (X^{e_{n}})^{\top}&\mathsf{E}_{l}[X_{k}^{e_{n}}|X_{R}^{e_{n}}]\end{bmatrix}:=\begin{bmatrix}\boldsymbol{X}&v\end{bmatrix},

where the rows of 𝑿~\boldsymbol{\tilde{X}} are independent. According to (12), we have 𝒀^=𝑿~θ\hat{\boldsymbol{Y}}=\boldsymbol{\tilde{X}}\theta. Then, if 𝖤[𝑿~𝑿~]\mathsf{E}[\boldsymbol{\tilde{X}}^{\top}\boldsymbol{\tilde{X}}] is invertible, we have

(𝖤[𝑿~𝑿~])1𝖤[𝑿~𝒀^]=(𝖤[𝑿~𝑿~])1𝖤[𝑿~𝑿~]θ=θ.\displaystyle(\mathsf{E}[\boldsymbol{\tilde{X}}^{\top}\boldsymbol{\tilde{X}}])^{-1}\mathsf{E}[\boldsymbol{\tilde{X}}^{\top}\hat{\boldsymbol{Y}}]=(\mathsf{E}[\boldsymbol{\tilde{X}}^{\top}\boldsymbol{\tilde{X}}])^{-1}\mathsf{E}[\boldsymbol{\tilde{X}}^{\top}\boldsymbol{\tilde{X}}]\theta=\theta. (27)

Now, we prove the invertibility. Observe that

𝖤[𝑿~𝑿~]=[𝖤[𝑿𝑿]𝖤[𝑿v]𝖤[v𝑿]𝖤[vv]],\mathsf{E}[\boldsymbol{\tilde{X}}^{\top}\boldsymbol{\tilde{X}}]=\begin{bmatrix}\mathsf{E}[\boldsymbol{X}^{\top}\boldsymbol{X}]&\mathsf{E}[\boldsymbol{X}^{\top}v]\\ \mathsf{E}[v^{\top}\boldsymbol{X}]&\mathsf{E}[v^{\top}v]\end{bmatrix},

where 𝖤[𝑿𝑿]=i=1n𝖤[Xei(Xei)]\mathsf{E}[\boldsymbol{X}^{\top}\boldsymbol{X}]=\sum_{i=1}^{n}\mathsf{E}[X^{e_{i}}(X^{e_{i}})^{\top}] is inveritble since it is a sum of positive-definite matrices. Then, 𝖤[𝑿~𝑿~]\mathsf{E}[\boldsymbol{\tilde{X}}^{\top}\boldsymbol{\tilde{X}}] is invertible if and only if

𝖤[vv]𝖤[v𝑿]𝖤1[𝑿𝑿]𝖤[𝑿v]0.\mathsf{E}[v^{\top}v]-\mathsf{E}[v^{\top}\boldsymbol{X}]\mathsf{E}^{-1}[\boldsymbol{X}^{\top}\boldsymbol{X}]\mathsf{E}[\boldsymbol{X}^{\top}v]\neq 0.

This is equivalent to

𝖤[(v𝑿β)(v𝑿β)]0,\mathsf{E}[(v-\boldsymbol{X}\beta)^{\top}(v-\boldsymbol{X}\beta)]\neq 0,

where β:=𝖤1[𝑿𝑿]𝖤[𝑿v]\beta:=\mathsf{E}^{-1}[\boldsymbol{X}^{\top}\boldsymbol{X}]\mathsf{E}[\boldsymbol{X}^{\top}v]. This is true since there is no bdb\in\mathbb{R}^{d} such that v=𝑿bv=\boldsymbol{X}b almost surely by our assumption in (15). Therefore, θ\theta is uniquely determined by (27).

Appendix B Problem Formulation Using an environmental random variable

By introducing an environmental random variable EallE\in\mathcal{E}^{\text{all}}, we define a mixture of e\mathcal{M}^{e}’s, ealle\in\mathcal{E}^{\text{all}}, as follows,

:\displaystyle\mathcal{M}: X=γ(E)Y+B(E)X+εX(E)\displaystyle X=\gamma(E)Y+B(E)X+\varepsilon_{X}(E)
:\displaystyle\mathcal{M}: Y=β(E)X+εY(E),\displaystyle Y=\beta^{\top}(E)X+\varepsilon_{Y}(E),

where EE in a root node in 𝒢()\mathcal{G}(\mathcal{M}) and the noise variables are jointly independent given EE. We do not specify the distribution of EE but assume that EE does not have a degenerate distribution. Under this formulation, the invariance property (5) is equivalent to

YE|ϕ(E,X),Y\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}E\,|\,\phi(E,X), (28)

and the invariant, first, and second matching properties can be equivalently written as

𝖤l[Y|XS,E=e]\displaystyle\mathsf{E}_{l}[Y|X_{S},E=e] =λ𝖤l[Xk|XR,E=e]+ηX,\displaystyle=\lambda\mathsf{E}_{l}[X_{k}|X_{R},E=e]+\eta^{\top}X, (29)
𝖤l[Y|XS,E=e]\displaystyle\mathsf{E}_{l}[Y|X_{S},E=e] =λY𝖤[Y|XPA(Y),E=e]+ηYX,\displaystyle=\lambda_{Y}\mathsf{E}[Y|X_{PA(Y)},E=e]+\eta_{Y}^{\top}X, (30)
𝖤l[Xk|XR,E=e]\displaystyle\mathsf{E}_{l}[X_{k}|X_{R},E=e] =λX𝖤[Y|XPA(Y),E=e]+ηXX,\displaystyle=\lambda_{X}\mathsf{E}[Y|X_{PA(Y)},E=e]+\eta_{X}^{\top}X, (31)

for ealle\in\mathcal{E}^{\text{all}}. As a special case of \mathcal{M}, we define a mixture of e,1\mathcal{M}^{e,1}’s as

1:\displaystyle\mathcal{M}^{1}: X=γY+BX+εX\displaystyle X=\gamma Y+BX+\varepsilon_{X} (32)
1:\displaystyle\mathcal{M}^{1}: Y=(α(E)+β)X+εY,\displaystyle Y=(\alpha(E)+\beta)^{\top}X+\varepsilon_{Y}, (33)

where only YY is intervened through the coefficients.

The proofs of Theorems 12 and Corollaries 12 will be presented under this formulation.

Appendix C Proof of Theorem 1

The problem formulation and necessary notation for the proof are introduced in Appendix B. For the first part, let Z(E)=α(E)XZ(E)=\alpha^{\top}(E)X denote an additional node in the acyclic graph 𝒢(1)\mathcal{G}(\mathcal{M}^{1}), then the assignment of YY in (33) becomes

Y=Z(E)+βX+εY,Y=Z(E)+\beta^{\top}X+\varepsilon_{Y}, (34)

where EE is no longer a parent of YY. Note that Z(E)=𝖤[Y|XPA(Y),E]βXZ(E)=\mathsf{E}[Y|X_{PA(Y)},E]-\beta^{\top}X as we assume both XX and YY have zero means. Since EE is a root node, observe that EE and YY can be d-connected through only two types of paths as follows,

  1. 1.

    EZ(E)YE\to Z(E)\to Y,

  2. 2.

    EZ(E)XiXlYE\to Z(E)\leftarrow X_{i}\to\cdots\to X_{l}\leftarrow\cdots\leftarrow Y,

where iPEi\in PE, and XiXlYX_{i}\to\cdots\to X_{l}\leftarrow\cdots\leftarrow Y is a V-structure for some lDE(i)DE(Y)l\in DE(i)\cap DE(Y). Note that the second type of path does not exist if Assumption 1 is not satisfied or CH(i)YCH(i)\setminus Y is empty.

We start by showing that the d-separation Y𝒢E|{Z(E),XS}Y\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}_{\mathcal{G}}E\,|\,\{Z(E),X_{S}\} holds given PESPE\subseteq S. First, the first path is immediately blocked by Z(E)Z(E). Second, for any sCH(i)Ys\in CH(i)\setminus Y, the path EZ(E)XiXsE\to Z(E)\leftarrow X_{i}\to X_{s} is blocked by {Z(E),Xi}\{Z(E),X_{i}\}. Thus, the second path is blocked given Z(E)Z(E) and XSX_{S}. According to the Markov property of SCMs [4], the d-separation Y𝒢E|{Z(E),XS}Y\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}_{\mathcal{G}}E\,|\,\{Z(E),X_{S}\} implies

YE|{Z(E),XS}.Y\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}E\,|\,\{Z(E),X_{S}\}. (35)

Now, we prove that the above conditional independence implies the first matching property (30).

By definition, the LMMSE estimators only rely on the (finite) first two moments of the variables. Thus we start with the case when (X,Y)|E=e(X,Y)|_{E=e} is jointly Gaussian, for each ealle\in\mathcal{E}^{\text{all}}. First we have

𝖤l[Y|XS,E=e]\displaystyle\mathsf{E}_{l}[Y|X_{S},E=e] =(a)𝖤[Y|XS,E=e]\displaystyle\overset{(a)}{=}\mathsf{E}[Y|X_{S},E=e]
=(b)𝖤[Y|Z(e),XS,E=e],\displaystyle\overset{(b)}{=}\mathsf{E}[Y|Z(e),X_{S},E=e],

where (a)(a) follows from the Gaussian assumption on (X,Y)|E=e(X,Y)|_{E=e} and (b)(b) from the fact that Z(e)Z(e) is a function of {XS,E=e}\{X_{S},E=e\} given our assumption that PESPE\subseteq S. This implies that

𝖤l[Y|XS,E]=𝖤[Y|Z(E),XS,E]=(a)𝖤[Y|Z(E),XS],\displaystyle\mathsf{E}_{l}[Y|X_{S},E]=\mathsf{E}[Y|Z(E),X_{S},E]\overset{(a)}{=}\mathsf{E}[Y|Z(E),X_{S}],

where (a)(a) follows from the conditional independence relation (35). Using the Gaussian assumption again, we have 𝖤[Y|Z(e),XS]=𝖤l[Y|Z(e),XS]\mathsf{E}[Y|Z(e),X_{S}]=\mathsf{E}_{l}[Y|Z(e),X_{S}]. Thus putting all the pieces together, we obtain

𝖤l[Y|XS,E=e]\displaystyle\mathsf{E}_{l}[Y|X_{S},E=e] =𝖤l[Y|Z(e),XS]\displaystyle=\mathsf{E}_{l}[Y|Z(e),X_{S}] (36)
=aZ(e)+bXS,\displaystyle=aZ(e)+b^{\top}X_{S}, (37)

where aa\in\mathbb{R} and b|S|b\in\mathbb{R}^{|S|} that are not functions of EE. When (X,Y)|E=e(X,Y)|_{E=e} is non-Gaussian, one can replace it with Gaussian random variables with the matching first and second moments. Then the same argument leading to (37) still holds. Then, the first matching property follows from the fact that Z(E)=𝖤[Y|XPA(Y),E]βXZ(E)=\mathsf{E}[Y|X_{PA(Y)},E]-\beta^{\top}X.

Similarly, for the second part, we first show the following d-separation Xk𝒢E|{Z(E),XR}X_{k}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}_{\mathcal{G}}\,E\,|\,\{Z(E),X_{R}\}. Observe that XkX_{k} and EE can be d-connected through two types of paths as follows,

  1. 1.

    EZ(E)YXkE\to Z(E)\to Y\cdots X_{k},

  2. 2.

    EZ(E)XiXkE\to Z(E)\leftarrow X_{i}\cdots X_{k},

with i{j{1,,d}:αj(E)0}i\in\{j\in\{1,\ldots,d\}:\alpha_{j}(E)\neq 0\}, where YXkY\cdots X_{k} denotes any directed path between YY and XkX_{k}, and similarly for XiXkX_{i}\cdots X_{k}. The two types of paths are immediately blocked by {Z(E),XR}\{Z(E),X_{R}\} under our assumption that PERPE\subseteq R. We thus have the following when X|E=eX|_{E=e} is jointly Gaussian,

𝖤l[Xk|XR,E=e]\displaystyle\mathsf{E}_{l}[X_{k}|X_{R},E=e] =𝖤l[Xk|Z(e),XR,E=e]\displaystyle=\mathsf{E}_{l}[X_{k}|Z(e),X_{R},E=e]
=𝖤l[Xk|Z(e),XR]\displaystyle=\mathsf{E}_{l}[X_{k}|Z(e),X_{R}]
=cZ(e)+dXR\displaystyle=cZ(e)+d^{\top}X_{R} (38)

for some cc\in\mathbb{R} and d|R|d\in\mathbb{R}^{|R|} do not depend on EE. The non-Gaussian cases can be handled in the same way as before, and thus the second property follows again from Z(E)=𝖤[Y|XPA(Y),E]βXZ(E)=\mathsf{E}[Y|X_{PA(Y)},E]-\beta^{\top}X. Observe that we have λX=c\lambda_{X}=c in the second matching property and Assumption 1 is a necessary condition for λX0\lambda_{X}\neq 0.

Finally, given RSR\subseteq S, we have that (38) with c0c\neq 0 (i.e., λX0\lambda_{X}\neq 0) provides a one-to-one mapping between {Z(E),XS}\{Z(E),X_{S}\} and {𝖤l[Xk|XR,E],XS}\{\mathsf{E}_{l}[X_{k}|X_{R},E],X_{S}\}. Therefore, the conditional independence (35) is equivalent to

YE|{𝖤l[Xk|XR,E],XS}.Y\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}E\,\big{|}\,\{\mathsf{E}_{l}[X_{k}|X_{R},E],\,X_{S}\}. (39)

This implies that, using our previous notation, ϕe(Xe)=(XSe,𝖤l,𝒫e[Xk|XR])\phi_{e}(X^{e})=(X_{S}^{e},\,\mathsf{E}_{l,\mathcal{P}_{e}}[X_{k}|X_{R}])^{\top} satisfies the invariance property (5), which implies666We use the shorthand ϕe(X)\phi_{e}(X) for ϕe(Xe)\phi_{e}(X^{e}) since the expectation is with respect to 𝒫e\mathcal{P}_{e}. 𝖤𝒫e[Y|XS]=𝖤𝒫e[Y|ϕe(X)]\mathsf{E}_{\mathcal{P}_{e}}[Y|X_{S}]=\mathsf{E}_{\mathcal{P}_{e}}[Y|\phi_{e}(X)]. Effectively, 𝖤𝒫e[Y|ϕe(X)]\mathsf{E}_{\mathcal{P}_{e}}[Y|\phi_{e}(X)] serves as a representation of 𝖤𝒫e[Y|XS]\mathsf{E}_{\mathcal{P}_{e}}[Y|X_{S}] that is invariant. Note that Φ\Phi is not empty when λX0\lambda_{X}\neq 0 holds with RSR\subseteq S. This implies that any ϕΦ\phi\in\Phi with S={1,,d}S=\{1,\ldots,d\} minimizes test(fϕ)\mathcal{L}_{\text{test}}(f_{\phi}), since the optimality of ϕΦ\phi\in\Phi only relies on the corresponding SS.

Remark 8.

When εY\varepsilon_{Y} is replace by εY(E)\varepsilon_{Y}(E), we consider the following two cases.

  1. 1.

    The mean of εYe\varepsilon_{Y}^{e} is a function of ee, and its variance is a constant.

  2. 2.

    The variance of εYe\varepsilon_{Y}^{e} is a function of EE, and its mean can be either a function of ee or a constant.

For the first case, we can introduce Xd+1:=1X_{d+1}:=1 that is a parent of YY with αd+1:=𝖤[εY(E)|E]\alpha_{d+1}:=\mathsf{E}[\varepsilon_{Y}(E)|E]. Additionally, Xd+1X_{d+1} is a parent of every XjX_{j} such that εX,j\varepsilon_{X,j} has a non-zero mean. Specifically, the coefficient of Xd+1X_{d+1} in the assignment of XjX_{j} will be 𝖤[εX,j]\mathsf{E}[\varepsilon_{X,j}]. Thus, the problem reduces to the setting when εXe\varepsilon_{X}^{e} and εYe\varepsilon_{Y}^{e} have zero means, which has been proved in Theorem 1. For the second case, however, the varying variance of εY(E)\varepsilon_{Y}(E) cannot be separated as the mean, thus EE is always a parent of YY and the path EYE\to Y cannot be blocked by ZZ or any XS{X1,,Xd}X_{S}\subseteq\{X_{1},\ldots,X_{d}\}, i.e., the proof of Theorem 1 breaks down.

Appendix D Proof of Proposition 2

The following lemma is a slight extension of Lemma 3.6 from [21], where we consider linear models with dependent noise variables rather than linear SCMs considered in [21].

Lemma 1.

Consider VV\in\mathbb{R} and X=(X1,,Xp)pX=(X_{1},\ldots,X_{p})^{\top}\in\mathbb{R}^{p} satisfying a linear model,

[VX]=[0hgA][VX]+[eVeX]:=A~[VX]+e,\begin{bmatrix}V\\ X\end{bmatrix}=\begin{bmatrix}0&h^{\top}\\ g&A\end{bmatrix}\begin{bmatrix}V\\ X\end{bmatrix}+\begin{bmatrix}e_{V}\\ e_{X}\end{bmatrix}:=\tilde{A}\begin{bmatrix}V\\ X\end{bmatrix}+e, (40)

where eVe_{V}\in\mathbb{R}, g,h,eXpg,h,e_{X}\in\mathbb{R}^{p}, and Ap×pA\in\mathbb{R}^{p\times p}. Assume that IA~I-\tilde{A} is invertible, the population OLS estimator when regressing VV on XX is given by

θols\displaystyle\theta^{\text{ols}} =[cgΣ~1(Icσ2ggΣ~1)v]h+(IA)[cσ2Σ~1g+Σ~1(Icσ2ggΣ~1)v],\displaystyle=\left[c-g^{\top}\tilde{\Sigma}^{-1}\left(I-c\sigma^{2}gg^{\top}\tilde{\Sigma}^{-1}\right)v\right]h+\left(I-A^{\top}\right)\left[c\sigma^{2}\tilde{\Sigma}^{-1}g+\tilde{\Sigma}^{-1}\left(I-c\sigma^{2}gg^{\top}\tilde{\Sigma}^{-1}\right)v\right],

where c:=(1+σ2gΣ~1g)1c:=\left(1+\sigma^{2}g^{\top}\tilde{\Sigma}^{-1}g\right)^{-1}, Σ~:=Σ+vg+gv\tilde{\Sigma}:=\Sigma+vg^{\top}+gv^{\top}, and

Cov(e,e):=[σ2vvΣ],\mathop{\rm Cov}\nolimits(e,e):=\begin{bmatrix}\sigma^{2}&v^{\top}\\ v&\Sigma\end{bmatrix},

with σ2:=Var(eV)\sigma^{2}:=\mathop{\rm Var}\nolimits(e_{V}), Σ:=Cov(eX,eX)\Sigma:=\mathop{\rm Cov}\nolimits(e_{X},e_{X}), and v:=Cov(eX,eV)v:=\mathop{\rm Cov}\nolimits(e_{X},e_{V}).

Proof:

Let P=[01×d,Id×d]P=[0_{1\times d},I_{d\times d}], we continue from Equation A.7 in [21] as follows,

θols\displaystyle\theta^{\text{ols}} =Cov1(X,X)Cov(X,V)\displaystyle=\mathop{\rm Cov}\nolimits^{-1}(X,X)\mathop{\rm Cov}\nolimits(X,V)
={P(IA~)1[σ2vvΣ][(IA~)1]P}1P(IA~)1[σ2v]+h,\displaystyle=\left\{P(I-\tilde{A})^{-1}\begin{bmatrix}\sigma^{2}&v^{\top}\\ v&\Sigma\end{bmatrix}\left[(I-\tilde{A})^{-1}\right]^{\top}P^{\top}\right\}^{-1}P(I-\tilde{A})^{-1}\begin{bmatrix}\sigma^{2}\\ v\end{bmatrix}+h,

and it has been shown that

P(IA~)1=[w,M],P(I-\tilde{A})^{-1}=[w,M],

for M=(IAgh)1M=(I-A-gh^{\top})^{-1} and w=Mgw=Mg. Then,

θols\displaystyle\theta^{\text{ols}} =[σ2ww+wvM+Mvw+MΣM]1(σ2w+Mv)+h\displaystyle=\left[\sigma^{2}ww^{\top}+wv^{\top}M^{\top}+Mvw^{\top}+M\Sigma M^{\top}\right]^{-1}\left(\sigma^{2}w+Mv\right)+h
:=[σ2ww+MΣ~M]1(σ2w+Mv)+h,\displaystyle:=\left[\sigma^{2}ww^{\top}+M\tilde{\Sigma}M^{\top}\right]^{-1}\left(\sigma^{2}w+Mv\right)+h,

where Σ~:=Σ+vg+gv\tilde{\Sigma}:=\Sigma+vg^{\top}+gv^{\top} and it was computed in [21] using the Sherman-Morrison formula [60] that

[σ2ww+MΣ~M]1\displaystyle\left[\sigma^{2}ww^{\top}+M\tilde{\Sigma}M^{\top}\right]^{-1}
=\displaystyle= (IAgh)Σ~1(Icσ2ggΣ~1)M1,\displaystyle(I-A-gh^{\top})^{\top}\tilde{\Sigma}^{-1}\left(I-c\sigma^{2}gg^{\top}\tilde{\Sigma}^{-1}\right)M^{-1},

with c:=(1+σ2gΣ~1g)1c:=\left(1+\sigma^{2}g^{\top}\tilde{\Sigma}^{-1}g\right)^{-1}. Then, some simple algebra leads to

θols\displaystyle\theta^{\text{ols}} =h+cσ2(IAgh)Σ~1g+(IAgh)Σ~1(Icσ2ggΣ~1)v\displaystyle=h+c\sigma^{2}(I-A-gh^{\top})^{\top}\tilde{\Sigma}^{-1}g+(I-A-gh^{\top})^{\top}\tilde{\Sigma}^{-1}\left(I-c\sigma^{2}gg^{\top}\tilde{\Sigma}^{-1}\right)v
=[cgΣ~1(Icσ2ggΣ~1)v]h+(IA)[cσ2Σ~1g+Σ~1(Icσ2ggΣ~1)v].\displaystyle=\left[c-g^{\top}\tilde{\Sigma}^{-1}\left(I-c\sigma^{2}gg^{\top}\tilde{\Sigma}^{-1}\right)v\right]h+\left(I-A^{\top}\right)\left[c\sigma^{2}\tilde{\Sigma}^{-1}g+\tilde{\Sigma}^{-1}\left(I-c\sigma^{2}gg^{\top}\tilde{\Sigma}^{-1}\right)v\right].

\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\blacksquare

D-A Proof of Proposition 2

For each tuple (k,R,S),k{jMB(Y):αje=0}(k,R,S),k\in\{j\in MB(Y):\alpha_{j}^{e}=0\}. R=k:={1,,d}kR=-k:=\{1,\ldots,d\}\setminus k, S={1,,d}S=\{1,\ldots,d\}, we prove that λX0\lambda_{X}\neq 0. Equivalently, we prove that 𝖤l[Xk|Xk,E=e]\mathsf{E}_{l}[X_{k}|X_{-k},E=e] is a non-constant function of ee.

First, recall that, when the target variables YY is unobserved, the relations between the predictors in e,1\mathcal{M}^{e,1}, are described by

X=(γ(β+αe)+B)X+γεY+εX.X=\left(\gamma(\beta+\alpha^{e})^{\top}+B\right)X+\gamma\varepsilon_{Y}+\varepsilon_{X}.

Now, we rewrite the above equation in the same form as the linear model (40) as follows,

[XkXk]=[0γk(β+αe)k+Bk,kγkβk+Bk,kγk(β+αe)k+Bk,k][XkXk]+εX+γεY,\displaystyle\begin{bmatrix}X_{k}\\ X_{-k}\end{bmatrix}=\begin{bmatrix}0&\gamma_{k}(\beta+\alpha^{e})_{-k}^{\top}+B_{k,-k}\\ \gamma_{-k}\beta_{k}+B_{-k,k}&\gamma_{-k}(\beta+\alpha^{e})_{-k}^{\top}+B_{-k,-k}\end{bmatrix}\begin{bmatrix}X_{k}\\ X_{-k}\end{bmatrix}+\varepsilon_{X}+\gamma\varepsilon_{Y},

where the top-left element of the coefficient matrix is zero, i.e., γk(βk+αke)+Bk,k=0\gamma_{k}(\beta_{k}+\alpha_{k}^{e})+B_{k,k}=0 since αke=0\alpha_{k}^{e}=0 (by assumption), Bk,k=0B_{k,k}=0 (due to acyclicity), and γkβk=0\gamma_{k}\beta_{k}=0 (since XkX_{k} cannot be both a child and a parent of YY). Now, by Lemma 1, the population OLS estimator when regressing XkX_{k} on XkX_{-k} given E=eE=e is

θols,k(e)\displaystyle\theta^{\text{ols},k}(e) =[cgΣ~1(Icσ2ggΣ~1)v]h+(IA)[cσ2Σ~1g+Σ~1(Icσ2ggΣ~1)v]\displaystyle=\left[c-g^{\top}\tilde{\Sigma}^{-1}\left(I-c\sigma^{2}gg^{\top}\tilde{\Sigma}^{-1}\right)v\right]h+\left(I-A^{\top}\right)\left[c\sigma^{2}\tilde{\Sigma}^{-1}g+\tilde{\Sigma}^{-1}\left(I-c\sigma^{2}gg^{\top}\tilde{\Sigma}^{-1}\right)v\right]
:=ah+(IA)b,\displaystyle:=ah+(I-A^{\top})b, (41)

where h=γk(β+αe)k+Bk,kh=\gamma_{k}(\beta+\alpha^{e})_{-k}+B_{k,-k}^{\top}, v=γkσY2γkv=\gamma_{k}\sigma_{Y}^{2}\gamma_{-k}, g=βkγk+Bk,kg=\beta_{k}\gamma_{-k}+B_{-k,k}, A=γk(β+αe)k+Bk,kA=\gamma_{-k}(\beta+\alpha^{e})_{-k}^{\top}+B_{-k,-k}, Σ=Cov(NXk,NXk)\Sigma=\mathop{\rm Cov}\nolimits(N_{X_{-k}},N_{X_{-k}}), Σ~=Σ+vg+gv\tilde{\Sigma}=\Sigma+vg^{\top}+gv^{\top}, and c=(1+σ2gΣ~1g)1c=\left(1+\sigma^{2}g^{\top}\tilde{\Sigma}^{-1}g\right)^{-1}. Note that aa\in\mathbb{R} and b(d1)b\in\mathbb{R}^{(d-1)} are not functions of ee and

a=gb+c(1+σ2gΣ~1g)=gb+1.a=g^{\top}b+c(1+\sigma^{2}g^{\top}\tilde{\Sigma}^{-1}g)=g^{\top}b+1. (42)
  1. 1.

    kCH(Y)k\in CH(Y): First, observe that βk=0\beta_{k}=0 implies g=Bk,kg=B_{-k,k}. By plugging hh and AA into (41), we obtain

    θols,k(e)\displaystyle\theta^{\text{ols},k}(e) =(aγkb)αke+aγkβk+(IβkγkBk,k)b,\displaystyle=(a-\gamma^{\top}_{-k}b)\alpha_{-k}^{e}+a\gamma_{k}\beta_{-k}+(I-\beta_{-k}\gamma_{-k}^{\top}-B_{-k,-k}^{\top})b,

    where αke\alpha_{-k}^{e} in the first term is non-vanishing only if

    aγkb=1+(Bk,kγk)b0,a-\gamma^{\top}_{-k}b=1+(B_{-k,k}-\gamma_{-k})^{\top}b\neq 0,

    where we use (42).

  2. 2.

    kPA(Y)k\in PA(Y): Observe that v=0(d1)v=0_{(d-1)} and γk=0\gamma_{k}=0, then

    θols,k(e)\displaystyle\theta^{\text{ols},k}(e) =cαe+cβ+cσ2(IA)Σ1g\displaystyle=c\alpha^{e}+c\beta+c\sigma^{2}\left(I-A^{\top}\right)\Sigma^{-1}g
    =c(1+σ2βkγkΣ1γk+σ2γkΣ1Bk,k)αke+cβ+cσ2(IβkγkBk,k)Σ1g,\displaystyle=c(1+\sigma^{2}\beta_{k}\gamma_{-k}^{\top}\Sigma^{-1}\gamma_{-k}+\sigma^{2}\gamma_{-k}^{\top}\Sigma^{-1}B_{-k,k})\alpha_{-k}^{e}+c\beta+c\sigma^{2}(I-\beta_{-k}\gamma_{-k}^{\top}-B_{-k,-k}^{\top})\Sigma^{-1}g,

    where the first term is not vanishing only if

    1+σ2βkγkΣ1γk+σ2γkΣ1Bk,k0.1+\sigma^{2}\beta_{k}\gamma_{-k}^{\top}\Sigma^{-1}\gamma_{-k}+\sigma^{2}\gamma_{-k}^{\top}\Sigma^{-1}B_{-k,k}\neq 0.
  3. 3.

    k{j:iCH(Xj) such that iCH(Y)}k\in\{j:\exists i\in CH(X_{j})\text{ such that }i\in CH(Y)\}: Again, we have v=0(d1)×1v=0_{(d-1)\times 1} and γk=0\gamma_{k}=0. Additionally, we have βk=0\beta_{k}=0, then αke\alpha_{-k}^{e} in θols,k(e)\theta^{\text{ols},k}(e) is non-vanishing only if

    1+σ2γkΣ1Bk,k0.1+\sigma^{2}\gamma_{-k}^{\top}\Sigma^{-1}B_{-k,k}\neq 0.

Appendix E Proof of Corollary 2

First, we extract the assignments of (Xk,XR)(X_{k},X_{R})’s from (20) with E=eE=e, where kCH(Y)k\in CH(Y) and kDE(Xi)k\not\in DE(X_{i}) for any iCH(Y)ki\in CH(Y)\setminus k and R={1,,d}DE(Y)R=\{1,\ldots,d\}\setminus DE(Y) as follows,

[XkXR]\displaystyle\begin{bmatrix}X_{k}\\ X_{R}\end{bmatrix} =[0γk(β+αe)R+Bk,R0|R|×1BR,R][XkXR]+[εX,k+γkεYeεX,R],\displaystyle=\begin{bmatrix}0&\gamma_{k}(\beta+\alpha^{e})_{R}^{\top}+B_{k,R}\\ 0_{|R|\times 1}&B_{R,R}\end{bmatrix}\begin{bmatrix}X_{k}\\ X_{R}\end{bmatrix}+\begin{bmatrix}\varepsilon_{X,k}+\gamma_{k}\varepsilon_{Y}^{e}\\ \varepsilon_{X,R}\end{bmatrix},

where the zeros in the coefficient matrix are due to the fact that all the descendants of XkX_{k} are excluded from XRX_{R} since DE(Xk)DE(Y)DE(X_{k})\subseteq DE(Y) and R={1,,d}DE(Y)R=\{1,\ldots,d\}\setminus DE(Y). Note that XkX_{k} is a child of YY that is not a descendant of any other child of YY, thus any removed node j{1,,d}{k,R}j\in\{1,\ldots,d\}\setminus\{k,R\} can not be the parent of any remaining node i{k,R}i\in\{k,R\}.

Now, using Lemma 1, the population OLS estimator when regressing XkX_{k} on XRX_{R} given E=eE=e is

θols(e)=γk(β+αe)R+Bk,R.\theta^{\text{ols}}(e)=\gamma_{k}(\beta+\alpha^{e})_{R}+B_{k,R}^{\top}.

This implies

𝖤l[Xk|XR,E=e]=γk(β+αe)X+Bk,X,\mathsf{E}_{l}[X_{k}|X_{R},E=e]=\gamma_{k}(\beta+\alpha^{e})^{\top}X+B_{k,\cdot}X,

where we use the fact that βj=αje=Bk,j=0\beta_{j}=\alpha^{e}_{j}=B_{k,j}=0 for jRj\not\in R (recall that {1,,d}R\{1,\ldots,d\}\setminus R does not contain either the parents of XkX_{k} or parents of YY). Therefore,

𝖤l[Xk|XR,E=e]=γk𝖤[Y|XPA(Y),E=e]+Bk,X,\mathsf{E}_{l}[X_{k}|X_{R},E=e]=\gamma_{k}\mathsf{E}[Y|X_{PA(Y)},E=e]+B_{k,\cdot}X,

which is the second matching property.

Remark 9.

Observe that θols(e)\theta^{\text{ols}}(e) does not depend on BR,RB_{R,R} or the covariance of εX,R\varepsilon_{X,R}, thus the second matching property holds even if every XjXRX_{j}\in X_{R} is intervened.

Appendix F Proof of Theorem 2

For the first part, we only need to prove that YY and EE are d-connected only through the arrow EYE\to Y when conditioning on XSX_{S}, S{1,,d}Xint(Y)S\subseteq\{1,\ldots,d\}\setminus X^{\text{int}}(Y) such that PA(Y)SPA(Y)\subseteq S. The rest is simply the proof of the first part of Theorem 1. Since EE is a root node, YY and EE can only be d-connected through the two types of paths,

  1. 1.

    EYE\to\cdots\to Y,

  2. 2.

    YXjEY\to\cdots\to X_{j}\leftarrow\cdots\leftarrow E,

where jDE(Y)DE(E)j\in DE(Y)\cap DE(E) , and EYE\to\cdots\to Y denotes any directed path from EE to YY. For the first type of paths, since PA(Y)SPA(Y)\subseteq S, the only unblocked path when conditioning on XSX_{S} is EYE\to Y. For the second type of paths that are V-structures, we have jSj\not\in S since Xint(Y)X^{\text{int}}(Y) contains all the intervened children of YY and the descendants of such children, implying that the V-structures are always blocked when conditioning on XSX_{S}. Thus, the only unblocked path between EE and YY is EYE\to Y when conditioning on XSX_{S}.

For the second part, similarly, for k{1,,d}{PEXint(Y)}k\in\{1,\ldots,d\}\setminus\{PE\cup X^{\text{int}}(Y)\} and R{1,,d}{k,Xint(Xk)Xint(Y)}R\subseteq\{1,\ldots,d\}\setminus\{k,X^{\text{int}}(X_{k})\cup X^{\text{int}}(Y)\} such that PA(Xk)YRPA(X_{k})\setminus Y\subseteq R, we prove that XkX_{k} is d-connected with EE only through paths that contain the subpath EYE\to Y when conditioning on XRX_{R} (i.e., the second type of path below). Following the same idea as in the first part, YY and EE are d-connected only through the arrow EYE\to Y when conditioning on XRX_{R} since Xint(Y)R=X^{\text{int}}(Y)\cap R=\varnothing and PA(Y)RPA(Y)\subseteq R. Then, since XkX_{k} is not intervened (i.e., EPA(Xk)E\not\in PA(X_{k})), the variables EE and XkX_{k} can only be d-connected through three types of paths as follows,

  1. 1.

    EXkE\to\dots\to X_{k} that does not contain YY,

  2. 2.

    EYXkE\to Y\to\cdots\to X_{k},

  3. 3.

    EXiXkE\to\cdots\to X_{i}\leftarrow\cdots\leftarrow X_{k},

where iRi\in R and we use the fact that kXint(Y)k\not\in X^{\text{int}}(Y) (i.e., the node kk is not an intervened child of YY or a descendant of an intervened child of YY). The first type of path is blocked immediately due to PA(Xk)YRPA(X_{k})\setminus Y\subseteq R.

To handle the third type of paths, we will need two technical results. (I) If iRi\in R, then jRj\in R for any jPA(i)j\in PA(i) such that jkj\neq k. This can be proved by contradiction. If jRj\not\in R (i.e., jXint(Y)Xint(Xk)j\in X^{\text{int}}(Y)\cup X^{\text{int}}(X_{k})), then we have iXint(Y)Xint(Xk)i\in X^{\text{int}}(Y)\cup X^{\text{int}}(X_{k}) by the definition of Xint(Y)X^{\text{int}}(Y) and Xint(Y)X^{\text{int}}(Y), i.e., iRi\not\in R. (II) For iRi\in R, observe that XiX_{i} can only be a child of XkX_{k}, otherwise the path XiXjXkX_{i}\leftarrow X_{j}\leftarrow\cdots\leftarrow X_{k} is blocked by XjPA(Xi)X_{j}\in PA(X_{i}) when conditioning on XRX_{R}, since iRi\in R implies jRj\in R.

Now, we proved that the third type of paths are always blocked when conditioning on XRX_{R}, by focusing on EXiE\to\cdots\to X_{i}. For any XiCH(Xk)X_{i}\in CH(X_{k}), the subpath EXiE\to\dots\to X_{i} cannot be EXiE\to X_{i}, since iRi\in R implies that XiX_{i} is not an intervened child of XkX_{k}. Finally, the path EXlXiE\to\cdots\to X_{l}\to X_{i} is blocked by XlPA(Xi)X_{l}\in PA(X_{i}) when conditioning on XRX_{R}, since iRi\in R implies that lRl\in R.

Remark 10.

We require PA(Y)PA(Y) to be included SS in order to block every path from intervened ancestors to YY, but this is not necessary when the ancestors are not intervened. While it is important to include PEPE in SS to handle interventions on YY. Also, including PEPE will block paths from intervened ancestors that pass through variables in PEPE. Similar arguments apply to RR as well.

Appendix G Semi-parametric Varying Coefficient Models and Profile Least-Squares Estimation

First, we introduce the semi-parametric varying coefficient model following most of the notation in [57]. Consider YY\in\mathbb{R}, U𝒰U\in\mathcal{U}, and two vectors of predictors W=(W1,,Wp)W=(W_{1},\ldots,W_{p})^{\top} and Z=(Z1,,Zq)Z=(Z_{1},\ldots,Z_{q})^{\top} such that ZjZ_{j}’s have invariant coefficients, a semi-parametric varying coefficient model over (U,Y,W,Z)(U,Y,W,Z) is defined by

Y=M+βZ+N,M=α(U)W,Y=M+\beta^{\top}Z+N,\quad M=\alpha^{\top}(U)W, (43)

where NN is independent of (U,W,Z)(U,W,Z).

We briefly introduce the profile least-squares estimator of β\beta proposed in [57]. Denote nn i.i.d. samples of (U,Y,Z,N)(U,Y,Z,N) as 𝑼=(U1,,Un)\boldsymbol{U}=(U_{1},\ldots,U_{n})^{\top}, 𝒀=(Y1,,Yn)\boldsymbol{Y}=(Y_{1},\ldots,Y_{n})^{\top}, 𝑾=(𝑾𝟏,,𝑾𝒏)\boldsymbol{W}=(\boldsymbol{W_{1}},\ldots,\boldsymbol{W_{n}})^{\top}, 𝑾i=(Wi1,,Wip)\boldsymbol{W}_{i}=(W_{i1},\ldots,W_{ip})^{\top}, 𝒁=(𝒁1,,𝒁n)\boldsymbol{Z}=(\boldsymbol{Z}_{1},\ldots,\boldsymbol{Z}_{n})^{\top}, 𝒁i=(Zi1,,Ziq)\boldsymbol{Z}_{i}=(Z_{i1},\ldots,Z_{iq})^{\top}, and 𝑵=(N1,,Nn)\boldsymbol{N}=(N_{1},\ldots,N_{n})^{\top}. We thus have 𝑴=(M1,,Mn)\boldsymbol{M}=(M_{1},\ldots,M_{n})^{\top} with Mi=α(Ui)𝑾iM_{i}=\alpha^{\top}(U_{i})\boldsymbol{W}_{i}. Let 𝑲u=diag(Kh(U1u),,Kh(Unu))\boldsymbol{K}_{u}=\text{diag}(K_{h}(U_{1}-u),\ldots,K_{h}(U_{n}-u)) for some kernel function Kh()=K(/h)/hK_{h}(\cdot)=K(\cdot/h)/h with bandwidth hh, and

𝑾~u=[𝑾𝟏U1uh𝑾𝟏𝑾𝒏Unuh𝑾𝒏]n×2p.\boldsymbol{\tilde{W}}_{u}=\begin{bmatrix}\boldsymbol{W_{1}}^{\top}&\frac{U_{1}-u}{h}\boldsymbol{W_{1}}^{\top}\\ \vdots&\vdots\\ \boldsymbol{W_{n}}^{\top}&\frac{U_{n}-u}{h}\boldsymbol{W_{n}}^{\top}\end{bmatrix}\in\mathbb{R}^{n\times 2p}.

For uu in a neighborhood of u0𝒰u_{0}\in\mathcal{U}, assume that each function αi(u)\alpha_{i}(u) in (43) can be approximated locally by the first-order Taylor expansion αi(u)αi+bi(uu0)\alpha_{i}(u)\approx\alpha_{i}+b_{i}(u-u_{0}). Then the varying coefficients α(U1),,α(Un)\alpha(U_{1}),\ldots,\alpha(U_{n}) can be estimated by solving the weighted local least squares problem

min{ai,bi}k=1n[𝒀kj=1qβjZkji=1p(ai+bi(Uku))Wki]2Kh(Uku),\displaystyle\min_{\{a_{i},b_{i}\}}\sum_{k=1}^{n}\left[\boldsymbol{Y}_{k}-\sum_{j=1}^{q}\beta_{j}Z_{kj}-\sum_{i=1}^{p}(a_{i}+b_{i}(U_{k}-u))W_{ki}\right]^{2}\cdot K_{h}(U_{k}-u), (44)

which has the solution

[a^1(u),,a^p(u),hb^1(u),,hb^p(u)]=(𝑿~u𝑲u𝑿~u)1𝑿~u𝑲u(𝒀𝒁β),\displaystyle[\hat{a}_{1}(u),\ldots,\hat{a}_{p}(u),h\hat{b}_{1}(u),\ldots,h\hat{b}_{p}(u)]=(\boldsymbol{\tilde{X}}_{u}^{\top}\boldsymbol{K}_{u}\boldsymbol{\tilde{X}}_{u})^{-1}\boldsymbol{\tilde{X}}_{u}^{\top}\boldsymbol{K}_{u}(\boldsymbol{Y}-\boldsymbol{Z}\beta),

for u{U1,,Un}u\in\{U_{1},\ldots,U_{n}\}. Then, each variable 𝑴i\boldsymbol{M}_{i} can be estimated by 𝑴^i=j=1pa^j(u)𝑿ij\boldsymbol{\hat{M}}_{i}=\sum_{j=1}^{p}\hat{a}_{j}(u)\boldsymbol{X}_{ij}, and thus the vector 𝑴\boldsymbol{M} can be estimated by

𝑴^(β)=\displaystyle\boldsymbol{\hat{M}}(\beta)=
[[𝑾101×p]{𝑾~u1𝑲u1𝑾~u1}1𝑾~u1𝑲u1[𝑾n01×p]{𝑾~un𝑲un𝑾~un}1𝑾~un𝑲un](𝒀𝒁β)\displaystyle\begin{bmatrix}[\boldsymbol{W}_{1}^{\top}\quad 0_{1\times p}]\{\boldsymbol{\tilde{W}}_{u_{1}}^{\top}\boldsymbol{K}_{u_{1}}\boldsymbol{\tilde{W}}_{u_{1}}\}^{-1}\boldsymbol{\tilde{W}}_{u_{1}}^{\top}\boldsymbol{K}_{u_{1}}\\ \vdots\\ [\boldsymbol{W}_{n}^{\top}\quad 0_{1\times p}]\{\boldsymbol{\tilde{W}}_{u_{n}}^{\top}\boldsymbol{K}_{u_{n}}\boldsymbol{\tilde{W}}_{u_{n}}\}^{-1}\boldsymbol{\tilde{W}}_{u_{n}}^{\top}\boldsymbol{K}_{u_{n}}\\ \end{bmatrix}(\boldsymbol{Y}-\boldsymbol{Z}\beta)
:=A(𝒀𝒁β),\displaystyle:=A(\boldsymbol{Y}-\boldsymbol{Z}\beta), (45)

which depends on the unknown parameter β\beta that will be estimated below. Substituting 𝑴^(β)\boldsymbol{\hat{M}}(\beta) into the vector form of (43), we obtain (IA)𝒀=(IA)𝒁β+𝑵(I-A)\boldsymbol{Y}=(I-A)\boldsymbol{Z}\beta+\boldsymbol{N}. The profile least-squares estimator of β\beta is given by

β^\displaystyle\hat{\beta} ={𝒁(IA)(IA)𝒁}1𝒁(IA)(IA)𝒀.\displaystyle=\{\boldsymbol{Z}^{\top}(I-A)^{\top}(I-A)\boldsymbol{Z}\}^{-1}\cdot\boldsymbol{Z}^{\top}(I-A)^{\top}(I-A)\boldsymbol{Y}. (46)

Finally, by replacing β\beta in (45) with β^\hat{\beta}, the final form of the estimator for 𝑴\boldsymbol{M} is given by

𝑴^\displaystyle\boldsymbol{\hat{M}} =A(𝒀𝒁β^).\displaystyle=A(\boldsymbol{Y}-\boldsymbol{Z}\hat{\beta}). (47)

Appendix H Technical Lemmas for the Proof of Theorem 3

The semi-parametric varying coefficient model and profile least-squares estimation are introduced in Appendix G. First, we present some technical assumptions and two technical lemmas from [57]. Let cn={log(1/h)nh}1/2c_{n}^{\prime}=\left\{\frac{\log(1/h)}{nh}\right\}^{1/2} and cn=cn+h2c_{n}=c_{n}^{\prime}+h^{2}.

  1. 1.

    UU has a bounded support 𝒰\mathcal{U} and has density function f()f(\cdot) that is Lipschitz continuous and bounded away from 0.

  2. 2.

    For each UiU_{i}, the matrix 𝖤[WW|Ui]\mathsf{E}[W^{\top}W|U_{i}] is non-singular, and the matrices 𝖤[WW|Ui]\mathsf{E}[W^{\top}W|U_{i}], (𝖤[WW|Ui])1\left(\mathsf{E}[W^{\top}W|U_{i}]\right)^{-1}, and 𝖤[WZ|Ui]\mathsf{E}[WZ^{\top}|U_{i}] are Lipschitz continuous.

  3. 3.

    α1(u),,αp(u)\alpha_{1}(u),\ldots,\alpha_{p}(u) have continuous second derivatives.

  4. 4.

    K()K(\cdot) is a symmetric density function.

  5. 5.

    There exists an s>2s>2 such that 𝖤[X2s]<\mathsf{E}[||X||^{2s}]<\infty, 𝖤[ZY2s]<\mathsf{E}[||Z_{Y}||^{2s}]<\infty, and 𝖤[ZV2s]<\mathsf{E}[||Z_{V}||^{2s}]<\infty and there exists ε<2s1\varepsilon<2-s^{-1} such that n2ε1hn^{2\varepsilon-1}h\to\infty.

  6. 6.

    The bandwidth hh satisfies nh80nh^{8}\to 0 and nh2/(log(n))2nh^{2}/(\log(n))^{2}\to\infty.

Lemma 2 (​[61]).

Let (U1,Y1),,(Un,Yn)(U_{1},Y_{1}),\ldots,(U_{n},Y_{n}) be i.i.d. random vectors in 2\mathbb{R}^{2}. Assume that 𝖤[|Y|s]<\mathsf{E}[|Y|^{s}]<\infty and supx|y|sf(u,y)𝑑y<\sup_{x}\int|y|^{s}f(u,y)dy<\infty, where f(u,y)f(u,y) is the density of (U,Y)(U,Y). Let KK be a bounded positive function with a bounded support that satisfies a Lipschitz condition. Given that n2ε1hn^{2\varepsilon-1}h\to\infty for some ε<1s1\varepsilon<1-s^{-1}, then

supu|1ni=1nKh(Uiu)Yi𝖤[Kh(Uiu)Yi]|=Op(cn).\displaystyle\sup_{u}\left|\frac{1}{n}\sum_{i=1}^{n}K_{h}(U_{i}-u)Y_{i}-\mathsf{E}[K_{h}(U_{i}-u)Y_{i}]\right|=O_{p}(c_{n}^{\prime}).

Lemma 3 (​[57]).

Under assumptions 1)6)1)\sim 6), n(β^β)𝑑𝒩(0,Σ)\sqrt{n}(\hat{\beta}-\beta)\xrightarrow{d}\mathcal{N}(0,\Sigma) as nn\to\infty, where Σ=Var(N)C1\Sigma=\mathop{\rm Var}\nolimits(N)\cdot C^{-1} with

C=𝖤[ZZ]𝖤[𝖤[ZW|U]𝖤[WW|U]𝖤[WZ|U]].\displaystyle C=\mathsf{E}[ZZ^{\top}]-\mathsf{E}\big{[}\mathsf{E}[ZW^{\top}|U]\mathsf{E}[WW^{\top}|U]\mathsf{E}[WZ^{\top}|U]\big{]}.

In the following lemma, we provide the rates of several quantities needed for the proof of Theorem 3.

Lemma 4.

Under the same assumptions as in Lemma 3,

R1\displaystyle R_{1} =1n(𝑴^𝑴)(𝑴^𝑴)=Op(cn2n1),\displaystyle=\frac{1}{n}(\hat{\boldsymbol{M}}-\boldsymbol{M})^{\top}(\hat{\boldsymbol{M}}-\boldsymbol{M})=O_{p}(c_{n}^{2}\vee n^{-1}),
R2\displaystyle R_{2} =1n(𝑴^𝑴)[𝑾,𝑴]=Op(cnn1/2),\displaystyle=\frac{1}{n}(\hat{\boldsymbol{M}}-\boldsymbol{M})^{\top}[\boldsymbol{W},\boldsymbol{M}]=O_{p}(c_{n}\vee n^{-1/2}),
R3\displaystyle R_{3} =1n(𝑴^𝑴)𝒁=Op(cnn1/2),\displaystyle=\frac{1}{n}(\hat{\boldsymbol{M}}-\boldsymbol{M})^{\top}\boldsymbol{Z}=O_{p}(c_{n}\vee n^{-1/2}),
R4\displaystyle R_{4} =1n(𝑴^𝑴)𝑵=Op(cn).\displaystyle=\frac{1}{n}(\hat{\boldsymbol{M}}-\boldsymbol{M})^{\top}\boldsymbol{N}=O_{p}(c_{n}).

Proof:

First, (47) and the vector form of (43) give

𝑴^𝑴\displaystyle\boldsymbol{\hat{M}}-\boldsymbol{M} =A(𝒀𝒁β^)𝑴\displaystyle=A(\boldsymbol{Y}-\boldsymbol{Z}\hat{\beta})-\boldsymbol{M}
=(AI)𝑴+A𝒁(ββ^)+A𝑵.\displaystyle=(A-I)\boldsymbol{M}+A\boldsymbol{Z}(\beta-\hat{\beta})+A\boldsymbol{N}.

Observe that R2R4R_{2}\sim R_{4} can be defined through

I1\displaystyle I_{1} =1n𝑴(AI)𝑷,\displaystyle=\frac{1}{n}\boldsymbol{M}^{\top}(A^{\top}-I)\boldsymbol{P},
I2\displaystyle I_{2} =1n(ββ^)𝒁A𝑷,\displaystyle=\frac{1}{n}(\beta-\hat{\beta})^{\top}\boldsymbol{Z}^{\top}A^{\top}\boldsymbol{P},
I3\displaystyle I_{3} =1n𝑵A𝑷,\displaystyle=\frac{1}{n}\boldsymbol{N}^{\top}A^{\top}\boldsymbol{P},

where 𝑷\boldsymbol{P} can be replaced by 𝑾\boldsymbol{W}, 𝒁\boldsymbol{Z}, 𝑴\boldsymbol{M}, or 𝑵\boldsymbol{N} (note that the rows of 𝑷\boldsymbol{P} are i.i.d. ). We also have R1=i=49IiR_{1}=\sum_{i=4}^{9}I_{i}, where

I4\displaystyle I_{4} =1n𝑴(AI)(AI)𝑴,\displaystyle=\frac{1}{n}\boldsymbol{M}^{\top}(A^{\top}-I)(A-I)\boldsymbol{M},
I5\displaystyle I_{5} =1n(ββ^)𝒁AA𝒁(ββ^),\displaystyle=\frac{1}{n}(\beta-\hat{\beta})^{\top}\boldsymbol{Z}^{\top}A^{\top}A\boldsymbol{Z}(\beta-\hat{\beta}),
I6\displaystyle I_{6} =1n𝑵AA𝑵,\displaystyle=\frac{1}{n}\boldsymbol{N}^{\top}A^{\top}A\boldsymbol{N},
I7\displaystyle I_{7} =1n𝑴(AI){A𝒁(ββ^)+A𝑵},\displaystyle=\frac{1}{n}\boldsymbol{M}^{\top}(A^{\top}-I)\bigg{\{}A\boldsymbol{Z}(\beta-\hat{\beta})+A\boldsymbol{N}\bigg{\}},
I8\displaystyle I_{8} =1n(ββ^)𝒁A{(AI)𝑴+A𝑵},\displaystyle=\frac{1}{n}(\beta-\hat{\beta})^{\top}\boldsymbol{Z}^{\top}A^{\top}\bigg{\{}(A-I)\boldsymbol{M}+A\boldsymbol{N}\bigg{\}},
I9\displaystyle I_{9} =1n𝑵A{(AI)𝑴+A𝒁(ββ^)}.\displaystyle=\frac{1}{n}\boldsymbol{N}^{\top}A^{\top}\bigg{\{}(A-I)\boldsymbol{M}+A\boldsymbol{Z}(\beta-\hat{\beta})\bigg{\}}.

It can be shown that I1,I3=Op(cn)I_{1},I_{3}=O_{p}(c_{n}) (we use this as a shorthand for I1=Op(cn),I3=Op(cn)I_{1}=O_{p}(c_{n}),I_{3}=O_{p}(c_{n}), as I1I_{1} and I3I_{3} are not equal), I2=Op(n1/2)I_{2}=O_{p}(n^{-1/2}), I5=Op(n1)I_{5}=O_{p}(n^{-1}), I4,I6=Op(cn2)I_{4},I_{6}=O_{p}(c_{n}^{2}), I8=Op(cnn1/2)I_{8}=O_{p}(c_{n}n^{-1/2}), I7,I9=Op(cn2cnn1/2)I_{7},I_{9}=O_{p}(c_{n}^{2}\vee c_{n}n^{-1/2}), which implies R1=Op(cn2n1)R_{1}=O_{p}(c_{n}^{2}\vee n^{-1}), R2,R3=Op(cnn1/2)R_{2},R_{3}=O_{p}(c_{n}\vee n^{-1/2}) and R4=Op(cn)R_{4}=O_{p}(c_{n}). The techniques for proving the rates of I1I9I_{1}\sim I_{9} are similar; observe that all the components in I4I9I_{4}\sim I_{9} are already computed to obtain I1I3I_{1}\sim I_{3}, thus we only provide the proof for I1I3I_{1}\sim I_{3} for simplicity of presentation.

Using Lemma 2, it has been shown in [57] that 𝑾~ui𝑲ui𝑾~ui\boldsymbol{\tilde{W}}_{u_{i}}^{\top}\boldsymbol{K}_{u_{i}}\boldsymbol{\tilde{W}}_{u_{i}} can be equivalently expressed as

[B1B2B3B4]=nf(Ui)𝖤[WW|Ui](100μ2){1+Op(cn)},\displaystyle\begin{bmatrix}B_{1}&B_{2}\\ B_{3}&B_{4}\end{bmatrix}=nf(U_{i})\mathsf{E}[WW^{\top}|U_{i}]\otimes\begin{pmatrix}1&0\\ 0&\mu_{2}\end{pmatrix}\{1+O_{p}(c_{n})\}, (48)

where μ2=𝒰u2K(u)𝑑u\mu_{2}=\int_{\mathcal{U}}u^{2}K(u)du, and the four block matrices are B1=S0(Ui)B_{1}=S_{0}(U_{i}), B2=B3=S1(Ui)B_{2}=B_{3}=S_{1}(U_{i}), and B4=S2(Ui)B_{4}=S_{2}(U_{i}), with respect to

Sk(Ui)\displaystyle S_{k}(U_{i}) =j=1n(UjUih)k𝑾j𝑾jKh(UjUi).\displaystyle=\sum_{j=1}^{n}\left(\frac{U_{j}-U_{i}}{h}\right)^{k}\boldsymbol{W}_{j}\boldsymbol{W}_{j}^{\top}K_{h}(U_{j}-U_{i}).

Since the techniques for proving (48), omitted in [57], will be used repeatedly in the rest of this paper, we provide the proof for completeness. Applying Lemma 2, it holds uniformly in u𝒰u\in\mathcal{U} that Sk(u)S_{k}(u) can be expressed as

𝖤[(Uuh)k𝖤[WW|U]Kh(Uu)]{1+Op(cn)}\displaystyle\mathsf{E}\left[\left(\frac{U-u}{h}\right)^{k}\mathsf{E}[WW^{\top}|U]K_{h}(U-u)\right]\{1+O_{p}(c_{n}^{\prime})\}
=\displaystyle= 𝒱vk𝖤[WW|U=hv+u]K(v)f(hv+u)𝑑v{1+Op(cn)}\displaystyle\int_{\mathcal{V}}v^{k}\mathsf{E}[WW^{\top}|U=hv+u]K(v)f(hv+u)dv\cdot\{1+O_{p}(c_{n}^{\prime})\} (49)
=\displaystyle= 𝒱vkK(v)[𝖤[WW|U=u]+vO(h)][f(u)+vO(h)]𝑑v{1+Op(cn)}\displaystyle\int_{\mathcal{V}}v^{k}K(v)\left[\mathsf{E}[WW^{\top}|U=u]+vO(h)\right]\cdot\left[f(u)+vO(h)\right]dv\{1+O_{p}(c_{n}^{\prime})\} (50)
=\displaystyle= {𝖤[WW|U=u]f(u)μk{1+Op(cn)},k evenO(h)+Op(cn),k odd\displaystyle\begin{cases}\mathsf{E}[WW^{\top}|U=u]f(u)\mu_{k}\{1+O_{p}(c_{n})\},&k\text{ even}\\ O(h)+O_{p}(c_{n}^{\prime}),&k\text{ odd}\end{cases} (51)

where (49) is due to the change of variable V=(Uu)/hV=(U-u)/h, (50) uses the Lipschitz continuity assumptions on 𝖤[XXT|U]\mathsf{E}[XX^{T}|U] and f()f(\cdot), and (51) is by the symmetry of the kernel function K()K(\cdot). Similarly, we obtain

𝑾~ui𝑲ui𝑴\displaystyle\boldsymbol{\tilde{W}}_{u_{i}}^{\top}\boldsymbol{K}_{u_{i}}\boldsymbol{M}
=\displaystyle= [j=1n𝑾j𝑾jα(Uj)Kh(UjUi)j=1nUjUih𝑾j𝑾jα(Uj)Kh(UjUi)]\displaystyle\begin{bmatrix}\sum_{j=1}^{n}\boldsymbol{W}_{j}\boldsymbol{W}_{j}^{\top}\alpha(U_{j})K_{h}(U_{j}-U_{i})\\ \sum_{j=1}^{n}\frac{U_{j}-U_{i}}{h}\boldsymbol{W}_{j}\boldsymbol{W}_{j}^{\top}\alpha(U_{j})K_{h}(U_{j}-U_{i})\end{bmatrix}
=\displaystyle= nf(Ui)𝖤[WW|Ui]α(Ui)(10){1+Op(cn)}.\displaystyle nf(U_{i})\mathsf{E}[WW^{\top}|U_{i}]\alpha(U_{i})\otimes\begin{pmatrix}1&0\end{pmatrix}^{\top}\{1+O_{p}(c_{n})\}. (52)

Recall the expression of AA in (45), we have

(AI)𝑴\displaystyle(A-I)\boldsymbol{M}
=\displaystyle= [[𝑾1𝟎]{𝑾~u1𝑲u1𝑾~u1}1𝑾~u1𝑲u1𝑴[𝑾n𝟎]{𝑾~un𝑲un𝑾~un}1𝑾~un𝑲un𝑴]𝑴\displaystyle\begin{bmatrix}[\boldsymbol{W}_{1}^{\top}\quad\boldsymbol{0}]\{\boldsymbol{\tilde{W}}_{u_{1}}^{\top}\boldsymbol{K}_{u_{1}}\boldsymbol{\tilde{W}}_{u_{1}}\}^{-1}\boldsymbol{\tilde{W}}_{u_{1}}^{\top}\boldsymbol{K}_{u_{1}}\boldsymbol{M}\\ \vdots\\ [\boldsymbol{W}_{n}^{\top}\quad\boldsymbol{0}]\{\boldsymbol{\tilde{W}}_{u_{n}}^{\top}\boldsymbol{K}_{u_{n}}\boldsymbol{\tilde{W}}_{u_{n}}\}^{-1}\boldsymbol{\tilde{W}}_{u_{n}}^{\top}\boldsymbol{K}_{u_{n}}\boldsymbol{M}\\ \end{bmatrix}-\boldsymbol{M}

Using (48) and (Proof), we obtain

I1\displaystyle I_{1} =1n𝑴(AI)𝑷\displaystyle=\frac{1}{n}\boldsymbol{M}^{\top}(A^{\top}-I)\boldsymbol{P}
=1ni=1n(Mi{1+Op(cn)}Mi)𝑷i=Op(cn),\displaystyle=\frac{1}{n}\sum_{i=1}^{n}(M_{i}\{1+O_{p}(c_{n})\}-M_{i})\boldsymbol{P}_{i}=O_{p}(c_{n}),

where 𝑷i\boldsymbol{P}_{i} denotes the ithi^{\text{th}} row of 𝑷\boldsymbol{P} and the last equality is due to the law of large numbers. For I2I_{2}, similarly as above, we compute

A𝒁=[𝑾𝟏{𝖤[WWT|U1]}1𝖤[WZ|U1]𝑾𝒏{𝖤[WWT|Un]}1𝖤[WZ|Un]]{1+Op(cn)}.A\boldsymbol{Z}=\begin{bmatrix}\boldsymbol{W_{1}}^{\top}\left\{\mathsf{E}[WW^{T}|U_{1}]\right\}^{-1}\mathsf{E}[WZ^{\top}|U_{1}]\\ \vdots\\ \boldsymbol{W_{n}}^{\top}\left\{\mathsf{E}[WW^{T}|U_{n}]\right\}^{-1}\mathsf{E}[WZ^{\top}|U_{n}]\end{bmatrix}\{1+O_{p}(c_{n})\}.

Since ββ^=Op(n1/2)\beta-\hat{\beta}=O_{p}(n^{-1/2}) by Lemma 3, we obtain

I2\displaystyle I_{2} =1n(ββ^)𝒁A𝑷\displaystyle=\frac{1}{n}(\beta-\hat{\beta})^{\top}\boldsymbol{Z}^{\top}A^{\top}\boldsymbol{P}
=(ββ^)1ni=1n𝑾𝒊{𝖤[WWT|Ui]}1𝖤[WZ|Ui]{1+Op(cn)}𝑷i\displaystyle=(\beta-\hat{\beta})^{\top}\frac{1}{n}\sum_{i=1}^{n}\boldsymbol{W_{i}}^{\top}\left\{\mathsf{E}[WW^{T}|U_{i}]\right\}^{-1}\cdot\mathsf{E}[WZ^{\top}|U_{i}]\{1+O_{p}(c_{n})\}\boldsymbol{P}_{i}
=Op(n1/2),\displaystyle=O_{p}(n^{-1/2}),

where, again, the last equality is due to the law of large numbers. Finally, for

A𝑵\displaystyle A\boldsymbol{N} =[[𝑾1𝟎]{𝑾~u1𝑲u1𝑾~u1}1𝑾~u1𝑲u1𝑵[𝑾n𝟎]{𝑾~un𝑲un𝑾~un}1𝑾~un𝑲un𝑵],\displaystyle=\begin{bmatrix}[\boldsymbol{W}_{1}^{\top}\quad\boldsymbol{0}]\{\boldsymbol{\tilde{W}}_{u_{1}}^{\top}\boldsymbol{K}_{u_{1}}\boldsymbol{\tilde{W}}_{u_{1}}\}^{-1}\boldsymbol{\tilde{W}}_{u_{1}}^{\top}\boldsymbol{K}_{u_{1}}\boldsymbol{N}\\ \vdots\\ [\boldsymbol{W}_{n}^{\top}\quad\boldsymbol{0}]\{\boldsymbol{\tilde{W}}_{u_{n}}^{\top}\boldsymbol{K}_{u_{n}}\boldsymbol{\tilde{W}}_{u_{n}}\}^{-1}\boldsymbol{\tilde{W}}_{u_{n}}^{\top}\boldsymbol{K}_{u_{n}}\boldsymbol{N}\\ \end{bmatrix},

the same argument for (51) leads to

𝑾~ui𝑲ui𝑵=nf(Ui)𝖤[WN|Ui]{1+Op(cn)},\displaystyle\boldsymbol{\tilde{W}}_{u_{i}}^{\top}\boldsymbol{K}_{u_{i}}\boldsymbol{N}=nf(U_{i})\mathsf{E}[WN^{\top}|U_{i}]\{1+O_{p}(c_{n})\},

where 𝖤[WN|Ui]=0\mathsf{E}[WN^{\top}|U_{i}]=0 since NN is independent of WW and UiU_{i}, and NN has a zero mean. Thus, by the law of large numbers,

I3\displaystyle I_{3} =1n𝑵A𝑷=1ni=1nOp(cn)𝑷i=Op(cn).\displaystyle=\frac{1}{n}\boldsymbol{N}^{\top}A^{\top}\boldsymbol{P}=\frac{1}{n}\sum_{i=1}^{n}O_{p}(c_{n})\boldsymbol{P}_{i}=O_{p}(c_{n}).

\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\blacksquare

Appendix I Proofs of Theorem 3 and Corollaries 3 and 4

To reuse model (43) for the prediction of YτY^{\tau}, the main challenge comes from MM that changes with UU (while β\beta remains invariant). First, we introduce some notation for the proofs. We define (𝒀,𝑾,𝑾i)(\boldsymbol{Y},\boldsymbol{W},\boldsymbol{W}_{i}) and (𝒁,𝒁i,𝑴,𝑵)(\boldsymbol{Z},\boldsymbol{Z}_{i},\boldsymbol{M},\boldsymbol{N}) in the same way as in Appendix G. Similarly, we define 𝒁V,𝑴V\boldsymbol{Z}_{V},\boldsymbol{M}_{V}, 𝑵V\boldsymbol{N}_{V}, and all the corresponding data matrices for the test data (e.g., 𝒀τ\boldsymbol{Y}^{\tau}), and let σ2=𝖤[(Nτ)2]\sigma^{2}=\mathsf{E}[(N^{\tau})^{2}]. With this notation, the IMP (11) implies that

M=λMV+ζW:=[W,MV]w,M=\lambda M_{V}+\zeta^{\top}W:=[W,M_{V}]w, (53)

for some ζp\zeta\in\mathbb{R}^{p}. Let 𝑾^V:=[𝑾,𝑴^V]\boldsymbol{\hat{W}}_{V}:=[\boldsymbol{W},\boldsymbol{\hat{M}}_{V}] and 𝑾^Vτ:=[𝑾τ,𝑴^Vτ]\boldsymbol{\hat{W}}_{V}^{\tau}:=[\boldsymbol{W}^{\tau},\boldsymbol{\hat{M}}_{V}^{\tau}]. Then, the OLS estimator of ww according to the above equation is given by

w^=(𝑾^V𝑾^V)1𝑾^V𝑴^.\hat{w}=(\boldsymbol{\hat{W}}_{V}^{\top}\boldsymbol{\hat{W}}_{V})^{-1}\boldsymbol{\hat{W}}_{V}^{\top}\boldsymbol{\hat{M}}. (54)

We predict 𝒀τ\boldsymbol{Y}^{\tau} using the continuous IMP estimator

𝒀^τ=𝑾~Vτw^+𝒁τβ^,\boldsymbol{\hat{Y}}^{\tau}=\boldsymbol{\tilde{W}}_{V}^{\tau}\hat{w}+\boldsymbol{Z}^{\tau}\hat{\beta}, (55)

where β^\hat{\beta}, 𝑴^\boldsymbol{\hat{M}}, and 𝑴^V\boldsymbol{\hat{M}}_{V} are provided in Appendix G.

Lemma 5.

Under assumptions 1)\sim 4) in Appendix H, it holds that

w^w=Op(cnn1/2).\hat{w}-w=O_{p}(c_{n}\vee n^{-1/2}).

Proof:

Using the fact that

𝑾^V=[𝑾,𝑴V]+[𝟎,𝑴^V𝑴V],\boldsymbol{\hat{W}}_{V}=[\boldsymbol{W},\boldsymbol{M}_{V}]+[\boldsymbol{0},\boldsymbol{\hat{M}}_{V}-\boldsymbol{M}_{V}],

we obtain

1n𝑾^V𝑾^V\displaystyle\frac{1}{n}\boldsymbol{\hat{W}}_{V}^{\top}\boldsymbol{\hat{W}}_{V}
=1n[𝑾,𝑴V][𝑾,𝑴V]+1n[𝑾,𝑴V][𝟎,𝑴^V𝑴V]\displaystyle=\frac{1}{n}[\boldsymbol{W},\boldsymbol{M}_{V}]^{\top}[\boldsymbol{W},\boldsymbol{M}_{V}]+\frac{1}{n}[\boldsymbol{W},\boldsymbol{M}_{V}]^{\top}[\boldsymbol{0},\boldsymbol{\hat{M}}_{V}-\boldsymbol{M}_{V}]
+1n[𝟎,𝑴^V𝑴V][𝑾,𝑴V]+1n[𝟎,𝑴^V𝑴V][𝟎,𝑴^V𝑴V]\displaystyle\hskip 10.00002pt+\frac{1}{n}[\boldsymbol{0},\boldsymbol{\hat{M}}_{V}-\boldsymbol{M}_{V}]^{\top}[\boldsymbol{W},\boldsymbol{M}_{V}]+\frac{1}{n}[\boldsymbol{0},\boldsymbol{\hat{M}}_{V}-\boldsymbol{M}_{V}]^{\top}\cdot[\boldsymbol{0},\boldsymbol{\hat{M}}_{V}-\boldsymbol{M}_{V}]
=𝖤[(W,MV)(W,MV)]{1+Op(cnn1/2)},\displaystyle=\mathsf{E}\bigg{[}(W^{\top},M_{V})^{\top}(W^{\top},M_{V})\bigg{]}\{1+O_{p}(c_{n}\vee n^{-1/2})\},

where we use R1R_{1} and R2R_{2} from Lemma 4 and the law of large numbers. Similarly,

1n𝑾^V𝑴^\displaystyle\frac{1}{n}\boldsymbol{\hat{W}}_{V}^{\top}\boldsymbol{\hat{M}} =1n[𝑾,𝑴V]𝑴+1n[𝑾,𝑴V](𝑴^𝑴)\displaystyle=\frac{1}{n}[\boldsymbol{W},\boldsymbol{M}_{V}]^{\top}\boldsymbol{M}+\frac{1}{n}[\boldsymbol{W},\boldsymbol{M}_{V}]^{\top}(\boldsymbol{\hat{M}}-\boldsymbol{M})
+1n[𝟎,𝑴^V𝑴V](𝑴^𝑴)+1n[𝟎,𝑴^V𝑴V]𝑴\displaystyle\hskip 50.00008pt+\frac{1}{n}[\boldsymbol{0},\boldsymbol{\hat{M}}_{V}-\boldsymbol{M}_{V}]^{\top}(\boldsymbol{\hat{M}}-\boldsymbol{M})+\frac{1}{n}[\boldsymbol{0},\boldsymbol{\hat{M}}_{V}-\boldsymbol{M}_{V}]^{\top}\boldsymbol{M}
=𝖤[(W,MV)M]{1+Op(cnn1/2)}.\displaystyle=\mathsf{E}\bigg{[}(W^{\top},M_{V})^{\top}M\bigg{]}\{1+O_{p}(c_{n}\vee n^{-1/2})\}.

Note that the rate is not directly implied by Lemma 4, but it can be proved using the same techniques demonstrated in the proof Lemma 4. Therefore, using (53),

w^=(𝑾^V𝑾^V)1𝑾^V𝑴^=w{1+Op(cnn1/2)}.\displaystyle\hat{w}=(\boldsymbol{\hat{W}}_{V}^{\top}\boldsymbol{\hat{W}}_{V})^{-1}\boldsymbol{\hat{W}}_{V}^{\top}\boldsymbol{\hat{M}}=w\{1+O_{p}(c_{n}\vee n^{-1/2})\}.

\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\blacksquare

I-A Proof of Theorem 3

Using (55) and 𝒀τ=𝑴τ+𝒁τβ+𝑵τ\boldsymbol{Y}^{\tau}=\boldsymbol{M}^{\tau}+\boldsymbol{Z}^{\tau}\beta+\boldsymbol{N}^{\tau}, we derive

𝒀^𝝉𝒀τ\displaystyle\boldsymbol{\hat{Y}^{\tau}}-\boldsymbol{Y}^{\tau}
=\displaystyle= [𝑾τ,𝑴^Vτ]w^+𝒁τβ^[𝑾τ,𝑴Vτ]w𝒁τβ𝑵τ\displaystyle[\boldsymbol{W}^{\tau},\boldsymbol{\hat{M}}^{\tau}_{V}]\hat{w}+\boldsymbol{Z}^{\tau}\hat{\beta}-[\boldsymbol{W}^{\tau},\boldsymbol{M}^{\tau}_{V}]w-\boldsymbol{Z}^{\tau}\beta-\boldsymbol{N}^{\tau}
=\displaystyle= [𝟎,𝑴^Vτ𝑴Vτ](w^w)+[𝑾τ,𝑴Vτ](w^w)+[𝟎,𝑴^Vτ𝑴Vτ]w+𝒁τ(β^β)𝑵τ.\displaystyle[\boldsymbol{0},\boldsymbol{\hat{M}}^{\tau}_{V}-\boldsymbol{M}^{\tau}_{V}](\hat{w}-w)+[\boldsymbol{W}^{\tau},\boldsymbol{M}^{\tau}_{V}](\hat{w}-w)+[\boldsymbol{0},\boldsymbol{\hat{M}}^{\tau}_{V}-\boldsymbol{M}^{\tau}_{V}]w+\boldsymbol{Z}^{\tau}(\hat{\beta}-\beta)-\boldsymbol{N}^{\tau}.

Then, the generalization error is given by

1m(𝒀^𝝉𝒀τ)(𝒀^𝝉𝒀τ)=i=110Ji\displaystyle\frac{1}{m}(\boldsymbol{\hat{Y}^{\tau}}-\boldsymbol{Y}^{\tau})^{\top}(\boldsymbol{\hat{Y}^{\tau}}-\boldsymbol{Y}^{\tau})=\sum_{i=1}^{10}J_{i}

where

J1\displaystyle J_{1} =1m(w^w)[𝟎,𝑴^Vτ𝑴Vτ][𝟎,𝑴^Vτ𝑴Vτ](w^w),\displaystyle=\frac{1}{m}(\hat{w}-w)^{\top}[\boldsymbol{0},\boldsymbol{\hat{M}}^{\tau}_{V}-\boldsymbol{M}^{\tau}_{V}]^{\top}[\boldsymbol{0},\boldsymbol{\hat{M}}^{\tau}_{V}-\boldsymbol{M}^{\tau}_{V}](\hat{w}-w),
J2\displaystyle J_{2} =1m(w^w)[𝑿τ,𝑴Vτ][𝑿τ,𝑴Vτ](w^w),\displaystyle=\frac{1}{m}(\hat{w}-w)^{\top}[\boldsymbol{X}^{\tau},\boldsymbol{M}^{\tau}_{V}]^{\top}[\boldsymbol{X}^{\tau},\boldsymbol{M}^{\tau}_{V}](\hat{w}-w),
J3\displaystyle J_{3} =1mw[𝟎,𝑴^Vτ𝑴Vτ][𝟎,𝑴^Vτ𝑴Vτ]w,\displaystyle=\frac{1}{m}w^{\top}[\boldsymbol{0},\boldsymbol{\hat{M}}^{\tau}_{V}-\boldsymbol{M}^{\tau}_{V}]^{\top}[\boldsymbol{0},\boldsymbol{\hat{M}}^{\tau}_{V}-\boldsymbol{M}^{\tau}_{V}]w,
J4\displaystyle J_{4} =1m(β^β)(𝒁τ)𝒁τ(β^β),\displaystyle=\frac{1}{m}(\hat{\beta}-\beta)^{\top}(\boldsymbol{Z}^{\tau})^{\top}\boldsymbol{Z}^{\tau}(\hat{\beta}-\beta),
J5\displaystyle J_{5} =1m(𝑵τ)𝑵τ,\displaystyle=\frac{1}{m}(\boldsymbol{N}^{\tau})^{\top}\boldsymbol{N}^{\tau},
J6\displaystyle J_{6} =1m(w^w)[𝟎,𝑴^Vτ𝑴Vτ]{[𝑾τ,𝑴Vτ]\displaystyle=\frac{1}{m}(\hat{w}-w)^{\top}[\boldsymbol{0},\boldsymbol{\hat{M}}^{\tau}_{V}-\boldsymbol{M}^{\tau}_{V}]^{\top}\bigg{\{}[\boldsymbol{W}^{\tau},\boldsymbol{M}^{\tau}_{V}]
(w^w)+[𝟎,𝑴^Vτ𝑴Vτ]w+𝒁τ(β^β)𝑵τ},\displaystyle\hskip 50.00008pt\cdot(\hat{w}-w)+[\boldsymbol{0},\boldsymbol{\hat{M}}^{\tau}_{V}-\boldsymbol{M}^{\tau}_{V}]w+\boldsymbol{Z}^{\tau}(\hat{\beta}-\beta)-\boldsymbol{N}^{\tau}\bigg{\}},
J7\displaystyle J_{7} =1m(w^w)[𝑾τ,𝑴Vτ]{[𝟎,𝑴^Vτ𝑴Vτ]\displaystyle=\frac{1}{m}(\hat{w}-w)^{\top}[\boldsymbol{W}^{\tau},\boldsymbol{M}^{\tau}_{V}]^{\top}\bigg{\{}[\boldsymbol{0},\boldsymbol{\hat{M}}^{\tau}_{V}-\boldsymbol{M}^{\tau}_{V}]
(w^w)+[𝟎,𝑴^Vτ𝑴Vτ]w+𝒁τ(β^β)𝑵τ},\displaystyle\hskip 50.00008pt\cdot(\hat{w}-w)+[\boldsymbol{0},\boldsymbol{\hat{M}}^{\tau}_{V}-\boldsymbol{M}^{\tau}_{V}]w+\boldsymbol{Z}^{\tau}(\hat{\beta}-\beta)-\boldsymbol{N}^{\tau}\bigg{\}},
J8\displaystyle J_{8} =1mw[𝟎,𝑴^Vτ𝑴Vτ]{[𝟎,𝑴^Vτ𝑴Vτ]\displaystyle=\frac{1}{m}w^{\top}[\boldsymbol{0},\boldsymbol{\hat{M}}^{\tau}_{V}-\boldsymbol{M}^{\tau}_{V}]^{\top}\bigg{\{}[\boldsymbol{0},\boldsymbol{\hat{M}}^{\tau}_{V}-\boldsymbol{M}^{\tau}_{V}]
(w^w)+[𝑾τ,𝑴Vτ](w^w)+𝒁τ(β^β)𝑵τ},\displaystyle\hskip 50.00008pt\cdot(\hat{w}-w)+[\boldsymbol{W}^{\tau},\boldsymbol{M}^{\tau}_{V}](\hat{w}-w)+\boldsymbol{Z}^{\tau}(\hat{\beta}-\beta)-\boldsymbol{N}^{\tau}\bigg{\}},
J9\displaystyle J_{9} =1m(β^β)(𝒁τ){[𝟎,𝑴^Vτ𝑴Vτ](w^w)\displaystyle=\frac{1}{m}(\hat{\beta}-\beta)^{\top}(\boldsymbol{Z}^{\tau})^{\top}\bigg{\{}[\boldsymbol{0},\boldsymbol{\hat{M}}^{\tau}_{V}-\boldsymbol{M}^{\tau}_{V}](\hat{w}-w)
+[𝑾τ,𝑴Vτ](w^w)+[𝟎,𝑴^Vτ𝑴Vτ]w+𝑵τ},\displaystyle\hskip 50.00008pt+[\boldsymbol{W}^{\tau},\boldsymbol{M}^{\tau}_{V}](\hat{w}-w)+[\boldsymbol{0},\boldsymbol{\hat{M}}^{\tau}_{V}-\boldsymbol{M}^{\tau}_{V}]w+\boldsymbol{N}^{\tau}\bigg{\}},
J10\displaystyle J_{10} =1m(𝑵τ){[𝟎,𝑴^Vτ𝑴Vτ](w^w)\displaystyle=-\frac{1}{m}(\boldsymbol{N}^{\tau})^{\top}\bigg{\{}[\boldsymbol{0},\boldsymbol{\hat{M}}^{\tau}_{V}-\boldsymbol{M}^{\tau}_{V}](\hat{w}-w)
+[𝑾τ,𝑴Vτ](w^w)+[𝟎,𝑴^Vτ𝑴Vτ]w+𝒁τ(β^β)}.\displaystyle\hskip 50.00008pt+[\boldsymbol{W}^{\tau},\boldsymbol{M}^{\tau}_{V}](\hat{w}-w)+[\boldsymbol{0},\boldsymbol{\hat{M}}^{\tau}_{V}-\boldsymbol{M}^{\tau}_{V}]w+\boldsymbol{Z}^{\tau}(\hat{\beta}-\beta)\bigg{\}}.

We can show the following rates of these terms through simple applications of Lemmas 345, along with the law of large number and the central limit theorem. J2,J3=Op(cn2n1)J_{2},J_{3}=O_{p}(c_{n}^{2}\vee n^{-1}), J4=Op(n1)J_{4}=O_{p}(n^{-1}), J5=σ2+Op(m1/2)J_{5}=\sigma^{2}+O_{p}(m^{-1/2}),

J10\displaystyle J_{10} =Op(cmm1/2)Op(cnn1/2)+Op(cnn1/2)+Op(cmm1/2)+Op(n1/2),\displaystyle=O_{p}(c_{m}\vee m^{-1/2})\cdot O_{p}(c_{n}\vee n^{-1/2})+O_{p}(c_{n}\vee n^{-1/2})+O_{p}(c_{m}\vee m^{-1/2})+O_{p}(n^{-1/2}),

and J1,J6J9J_{1},J_{6}\sim J_{9} have higher orders compared with either Op(cnn1/2)O_{p}(c_{n}\vee n^{-1/2}) or Op(cmm1/2)O_{p}(c_{m}\vee m^{-1/2}) in J10J_{10}. This completes the proof of Theorem  3.

I-B Proof of Corollary 3

We use ll as the shorthand notation for lnl_{n} in the proof. Since the estimation of β,𝑴,\beta,\boldsymbol{M}, and ww are based on the labeled data, we have β^β=Op(l1/2)\hat{\beta}-\beta=O_{p}(l^{-1/2}) in Lemma 3 and R1,R2,R3=Op(cll1/2)R_{1},R_{2},R_{3}=O_{p}(c_{l}\vee l^{-1/2}) and R4=Op(cl)R_{4}=O_{p}(c_{l}) in Lemma 4. In the proof of Lemma 5, the rate of 𝑾^V𝑾^V\boldsymbol{\hat{W}}_{V}^{\top}\boldsymbol{\hat{W}}_{V} remains the same since the estimation of 𝑴V\boldsymbol{M}_{V} only uses the unlabeled data but the rate of 𝑾^V𝑴^\boldsymbol{\hat{W}}_{V}^{\top}\boldsymbol{\hat{M}} now depends on ll. This observation implies w^w=Op(cll1/2)\hat{w}-w=O_{p}(c_{l}\vee l^{-1/2}). Now, observe that J2=Op(cl2l1)J_{2}=O_{p}(c_{l}^{2}\vee l^{-1}), J3=Op(cmm1/2)J_{3}=O_{p}(c_{m}\vee m^{-1/2}), J4=Op(l1)J_{4}=O_{p}(l^{-1}), J5=σ2+Op(m1/2)J_{5}=\sigma^{2}+O_{p}(m^{-1/2}), and J10=Op(cll1/2)+Op(m1/2)J_{10}=O_{p}(c_{l}\vee l^{-1/2})+O_{p}(m^{-1/2}). Since max(ln,lm)0\max(\frac{l}{n},\frac{l}{m})\to 0 as min(n,m)\min(n,m)\to\infty, we get i=110Ji=Op(cll1/2)\sum_{i=1}^{10}J_{i}=O_{p}(c_{l}\vee l^{-1/2}).

I-C Proof of Corollary 4

According to the definition of the discrete IMP estimator, all the coefficients are treated as varying coefficients (i.e., β=𝟎\beta=\boldsymbol{0} and βV=𝟎\beta_{V}=\boldsymbol{0}), and 𝑴\boldsymbol{M} is estimated by performing the OLS for each environment and then putting the estimates into one vector, thus R1=Op(an2)R_{1}=O_{p}(a_{n}^{2}) and R2,R4=Op(an)R_{2},R_{4}=O_{p}(a_{n}) in Lemma 4, with an=(minetrainne)1/2a_{n}=(\min_{e\in\mathcal{E}^{\text{train}}}n_{e})^{-1/2} by the asymptotic normality of the OLS estimators ( note that XX and ZZ are assumed to have finite fourth moments in Lemma 3). Accordingly, we have w^w^=Op(an)\hat{w}-\hat{w}=O_{p}(a_{n}) in Lemma 5 using the law of large numbers. Setting Zτ=𝟎Z^{\tau}=\boldsymbol{0} in J1J10J_{1}\sim J_{10}, we obtain J2,J3=Op(an2)J_{2},J_{3}=O_{p}(a_{n}^{2}), J4=0J_{4}=0, J5=σ2+Op(m1/2)J_{5}=\sigma^{2}+O_{p}(m^{-1/2}), J10=Op(an)+Op(am)J_{10}=O_{p}(a_{n})+O_{p}(a_{m}), and the other terms have higher orders compared with Op(an)O_{p}(a_{n}) or Op(am)O_{p}(a_{m}). Similar to Theorem 3, the asymptotic generalization error is dominated by J10J_{10}.