This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\useunder

\ul

Bridging Multicalibration and Out-of-distribution Generalization Beyond Covariate Shift

Jiayun Wu
Depart. of Computer Science & Tech.
Tsinghua University
[email protected] &Jiashuo Liu
Depart. of Computer Science & Tech.
Tsinghua University
[email protected]
Peng Cui
Depart. of Computer Science & Tech.
Tsinghua University
[email protected]
&Zhiwei Steven Wu
School of Computer Science
Carnegie Mellon University
[email protected]
Work completed as a research intern at Carnegie Mellon University.
Abstract

We establish a new model-agnostic optimization framework for out-of-distribution generalization via multicalibration, a criterion that ensures a predictor is calibrated across a family of overlapping groups. Multicalibration is shown to be associated with robustness of statistical inference under covariate shift. We further establish a link between multicalibration and robustness for prediction tasks both under and beyond covariate shift. We accomplish this by extending multicalibration to incorporate grouping functions that consider covariates and labels jointly. This leads to an equivalence of the extended multicalibration and invariance, an objective for robust learning in existence of concept shift. We show a linear structure of the grouping function class spanned by density ratios, resulting in a unifying framework for robust learning by designing specific grouping functions. We propose MC-Pseudolabel, a post-processing algorithm to achieve both extended multicalibration and out-of-distribution generalization. The algorithm, with lightweight hyperparameters and optimization through a series of supervised regression steps, achieves superior performance on real-world datasets with distribution shift.

1 Introduction

We revisit the problem of out-of-distribution generalization and establish new connections with multicalibration Hébert-Johnson et al. (2018), a criterion originating from algorithmic fairness. Multicalibration is a strengthening of calibration, which requires a predictor ff to be correct on average within each level set:

𝔼[Yf(X)f(X)]=0\mathbb{E}[Y-f(X)\mid f(X)]=0

Calibration is a relatively weak property, as it can be satisfied even by the uninformative constant predictor f(X)=𝔼[Y]f(X)=\mathbb{E}[Y] that predicts the average outcome. More broadly, calibration provides only a marginal guarantee that does not extend to sub-populations. Multicalibration Hébert-Johnson et al. (2018) mitigates this issue by requiring the calibration to hold over a family of (overlapping) subgroups \mathcal{H}: for all hh\in\mathcal{H},

𝔼[(Yf(X))h(X)f(X)]=0\mathbb{E}[(Y-f(X))\,h(X)\mid f(X)]=0

Multicalibration is initially studied as measure of subgroup fairness for boolean grouping functions hh, with h(X)=1h(X)=1 indicating XX is a member of group hh (Hébert-Johnson et al., 2018). Subsequently, Gopalan et al. (2022) and Kim et al. (2019) adopt a broader class of real-valued grouping functions that can identify sub-populations through reweighting. The formulation of real-valued grouping function has enabled surprising connections between multicalibration and distribution shifts. Prior work (Kim et al., 2022; Roth, 2022) studied how distribution shift affects the measure of multicalibration, with a focus on covariate shift where the relationship between XX and YY remains fixed. Kim et al. (2022) show that whenever the set of real-valued grouping functions \mathcal{H} includes the density ratio between the source and target distributions, a multicalibrated predictor with respect to the source remains calibrated in the shifted target distribution.

Our work substantially expands the connections between multicalibration and distribution shifts. At a high level, our results show that robust prediction under distribution shift can actually be facilitated by multicalibration. We extend the notion of multicalibration by incorporating grouping functions that simultaneously consider both covariates XX and outcomes YY. This extension enables us to go beyond covariate shift and account for concept shift, which is prevalent in practice due to spurious correlation, missing variables, or confounding (Liu et al., 2023).

Our contributions. Based on the introduction of joint grouping functions, we establish new connections between our extended multicalibration notion and algorithmic robustness in the general setting of out-of-distribution generalization, where the target distribution to assess the model is different from the source distribution to learn the model.

1. We first revisit the setting of covariate shift and show multicalibration implies Bayes optimality under covariate shift, provided a sufficiently rich class of grouping functions. Then, in the setting of concept shifts, we show the equivalence of multicalibration and invariance (Arjovsky et al., 2019), a learning objective to search for a Bayes optimal predictor 𝔼[Y|Φ(X)]\mathbb{E}[Y|\Phi(X)] under a representation over features Φ(X)\Phi(X), even though 𝔼[Y|X]\mathbb{E}[Y|X] is different across target distributions. We show correspondence between an invariant representation Φ(X)\Phi(X) and a multicalibrated predictor 𝔼[Y|Φ(X)]\mathbb{E}[Y|\Phi(X)], with a grouping function class containing all density ratios of target distributions and the source distribution.

2. As part of our structural analysis of the new multicalibration concept, we investigate the maximal grouping function class that allows for a nontrivial multicalibrated predictor. For traditional covariate-based grouping functions, the Bayes optimal predictor f(X)=𝔼[Y|X]f(X)=\mathbb{E}[Y|X] is always multicalibrated, which is no longer the case for joint grouping functions. We show the maximal grouping function class is a linear space spanned by the density ratio of the target distributions where the predictor is invariant. As a structural characterization of distribution shift, this leads to an efficient parameterization of the grouping functions by linear combination of a spanning set of density ratios. The spanning set can be flexibly designed to incorporates implicit assumptions of various methodologies for robust learning, including multi-environment learning (Shi et al., 2021) and hard sample learning (Liu et al., 2021).

3. We devise a post-processing algorithm to multicalibrate predictors and simultaneously producing invariant predictors. As a multicalibration algorithm, we prove its convergence under Gaussian distributions of data and certify multicalibration upon convergence. As a robust learning algorithm, the procedure is plainly supervised regression with respect to models’ hypothesis class and grouping function class, introducing an overhead of linear regression. This stands out from heavy optimization techniques for out-of-distribution generalization, such as bi-level optimization (Finn et al., 2017; Liu et al., 2022) and multi-objective learning (Ahuja et al., 2021; Arjovsky et al., 2019; Koyama and Yamaguchi, 2020), which typically involves high-order gradients (Ramé et al., 2022). The algorithm introduces no extra hyperparameters. This simplifies model selection, which is a significant challenge for out-of-distribution generalization since validation is unavailable where the model is deployed (Gulrajani and Lopez-Paz, 2021). Under the standard model selection protocol of DomainBed (Gulrajani and Lopez-Paz, 2021), the algorithm achieves superior performance to existing methods in real-world datasets with concept shift, including porverty estimation (Yeh et al., 2020), personal income prediction (Ding et al., 2021) and power consumption (Malinin et al., 2021, 2022) prediction.

2 Multicalibration and Bayes Optimality under Covariate Shift

2.1 Multicalibration with Joint Grouping Functions

Settings of Out-of-distribution Generalization

We are concerned with prediction tasks under distribution shift, where covariates are denoted by a random vector X𝒳X\in\mathcal{X} and the target by Y𝒴Y\in\mathcal{Y}. The values taken by random variables are written by x,yx,y in lowercase. We have an uncertainty set of absolutely continuous probability measures, denoted by 𝒫(X,Y)\mathcal{P}(X,Y), where there is an accessible source measure PS𝒫P_{S}\in\mathcal{P} and unknown target measure PT𝒫P_{T}\in\mathcal{P}. We use capital letters such as PP to denote a single probability measure and lowercase letters such as pp to denote its probability density function. We define predictors as real-valued functions f:𝒳𝒴f:\mathcal{X}\rightarrow\mathcal{Y}, which is learned in the source distribution PSP_{S} and assessed in the target distribution PTP_{T}. Given a loss function :𝒴×𝒴\ell:\mathcal{Y}\times\mathcal{Y}\rightarrow\mathbb{R}, we evaluate the average risk of a predictor ff w.r.t. a probability measure PP, defined by RP(f):=𝔼P[(f(X),Y)]R_{P}(f):=\mathbb{E}_{P}[\ell(f(X),Y)]. We focus on the setting with 𝒴=[0,1]\mathcal{Y}=[0,1] and (y^,y)=(y^y)2\ell(\hat{y},y)=(\hat{y}-y)^{2} in our theoretical analyses.

We propose a new definition of 2\ell_{2} approximate multicalibration with joint grouping functions.

Definition 2.1 (Multicalibration with Joint Grouping Functions).

For a probability measure P(X,Y)P(X,Y) and a predictor ff, let 𝒳×𝒴\mathcal{H}\subset\mathbb{R}^{\mathcal{X}\times\mathcal{Y}} be a real-valued grouping function class. We say that ff is α\alpha-approximately 2\ell_{2} multicalibrated w.r.t. \mathcal{H} and PP if for all hh\in\mathcal{H}:

K2(f,h,P)=(𝔼P[h(X,Y)(Yv)|f(X)=v])2𝑑Pf(X)(v)α.\displaystyle K_{2}(f,h,P)=\int\Big{(}\mathbb{E}_{P}\left[h(X,Y)(Y-v)\big{|}f(X)=v\right]\Big{)}^{2}dP_{f(X)}(v)\leq\alpha. (1)

Pf(X)(v)=P(f1(v))P_{f(X)}(v)=P(f^{-1}(v)) is the pushforward measure. We say ff is α\alpha-approximately calibrated for h1h\equiv 1. We say ff is multicalibrated (calibrated) for α=0\alpha=0. If the grouping function is defined on XX, which implies h(x,y1)=h(x,y2)h(x,y_{1})=h(x,y_{2}) for any x𝒳x\in\mathcal{X} and y1,y2𝒴y_{1},y_{2}\in\mathcal{Y}, we abbreviate h(x,)h(x,\cdot) by h(x)h(x).

In the special case of a constant grouping function h1h\equiv 1, K2(f,1,P)K_{2}(f,1,P) recovers the overall calibration error. For boolean grouping functions defined on XX (Hébert-Johnson et al., 2018), K2(f,h,P)K_{2}(f,h,P) computes the calibration error of the subgroup with h(x)=1h(x)=1. For real-valued grouping functions defined on XX (Gopalan et al., 2022; Kim et al., 2019), K2(f,h,P)K_{2}(f,h,P) evaluates a reweighted calibration error, whose weights h(x)h(x) are proportional to the likelihood of a sample belonging to the subgroup. Most importantly, we propose an extended domain of grouping functions defined on covariates and outcomes jointly, which is useful for capturing more general distribution shifts. Multicalibration error quantifies the maximal calibration error for all subgroups associated with grouping functions in \mathcal{H}. We will discuss multicalibration with covariate-based grouping functions in this section and joint grouping functions in next section.

2.2 Multicalibration Implies Bayes Optimality under Covariate Shift

In this subsection we focus on grouping functions h(x)h(x) defined on covariates. We will prove approximately multicalibrated predictors simultaneous approaches Bayes optimality in each target distribution with covariate shift, bridging the results of Kim et al. (2022) and Globus-Harris et al. (2023). To recap, Kim et al. (2022) studies multicalibration under covariate shift and shows that a multicalibrated predictor remains calibrated in target distribution for a sufficiently large grouping function class. Further, it is shown that multicalibration predictors remain multicalibrated under covariate shift (Kim et al., 2022; Roth, 2022), assuming the grouping function class \mathcal{H} is closed under some transformation by density ratios (Assumption 2.2.1). Second, Globus-Harris et al. (2023) shows multicalibration implies Bayes optimal accuracy (Globus-Harris et al., 2023), assuming \mathcal{H} satisfies a weak learning condition (Assumption 2.2.2). Detailed discussion on other related works is deferred to section A in the appendix.

Assumption 2.2 (Sufficiency of Grouping Function Class (informal, see Assumption F.1)).

1. (Closure under Covariate Shift) For a set of probability measures 𝒫(X)\mathcal{P}(X) containing the source measure PS(X)P_{S}(X), hh\in\mathcal{H} implies p/pShp/p_{S}\cdot h\in\mathcal{H} for any density function pp of distributions in 𝒫\mathcal{P}.

2. ((γ,ρ\gamma,\rho)-Weak Learning Condition) For any P𝒫(X)PS(Y|X){P(X)PS(YX):P𝒫}P\in\mathcal{P}(X)P_{S}(Y|X)\equiv\{P^{\prime}(X)P_{S}(Y\mid X):P^{\prime}\in\mathcal{P}\} with the source measure PS(Y|X)P_{S}(Y|X), and every subset G𝒳G\subset\mathcal{X} with P(XG)>ρP(X\in G)>\rho, if the Bayes optimal predictor 𝔼P[Y|X]\mathbb{E}_{P}[Y|X] has lower risk than the constant predictor 𝔼P[Y|XG]\mathbb{E}_{P}[Y|X\in G] by a margin γ\gamma, there exists a predictor hh\in\mathcal{H} that is also better than the constant predictor with the margin γ\gamma.

Theorem 2.3 (Risk Bound under Covariate Shift).

For a source measure PS(X,Y)P_{S}(X,Y) and a set of probability measures 𝒫(X)\mathcal{P}(X) containing PS(X)P_{S}(X), given a predictor f:𝒳[0,1]f:\mathcal{X}\rightarrow[0,1] with finite range m:=|Range(f)|m:=|\text{Range}(f)|, consider a grouping function class \mathcal{H} closed under affine transformation and satisfying Assumption 2.2 with ρ=γ/m\rho=\gamma/m. If ff is γ6256m2\frac{\gamma^{6}}{256m^{2}}-approximately 2\ell_{2} multicalibrated w.r.t PSP_{S} and 1:={h:maxx𝒳h(x)21}\mathcal{H}_{1}:=\left\{h\in\mathcal{H}:\max_{x\in\mathcal{X}}h(x)^{2}\leq 1\right\}, then for any target measure PT𝒫(X)PS(Y|X)P_{T}\in\mathcal{P}(X)P_{S}(Y|X),

RPT(f)inff:𝒳[0,1]RPT(f)+3γ.\displaystyle R_{P_{T}}(f)\leq\inf_{f^{*}:\mathcal{X}\rightarrow[0,1]}R_{P_{T}}(f^{*})+3\gamma. (2)
Remark 2.4.

Following prior work in multicalibration (Globus-Harris et al., 2023; Roth, 2022), we study functions ff with finite cardinality, which can be obtained by discretization.

3 Multicalibration and Invariance under Concept Shift

Theorem 2.3 shows multicalibration implies Bayes optimal accuracy for target distributions under covariate shift. However, in practical scenarios, there are both marginal distribution shifts of covariates (XX) and concept shift of the conditional distributions (Y|XY|X). Concept shift is especially prevalent in tabular data due to missing variables and confounding (Liu et al., 2023). In order to go beyond covariate shift, we will focus on grouping functions defined on covariates and outcomes jointly. We show that multicalibration notion w.r.t. joint grouping functions is equivalent to invariance, a criterion for robust prediction under concept shift. Extending the robustness of multicalibration to general shift is non-trivial. The fundamental challenge is that there is no shared predictor that is generally optimal in each target distribution because the Bayes optimal predictor varies for different Y|XY|X distributions. As a first step, we show multicalibrated predictors w.r.t. joint grouping functions are robust as they are optimal over any post-processing functions in each target distribution.

Theorem 3.1 (Risk Bound under Concept Shift).

For a set of absolutely continuous probability measures 𝒫(X,Y)\mathcal{P}(X,Y) containing the source measure PS(X,Y)P_{S}(X,Y), consider a predictor f:𝒳[0,1]f:\mathcal{X}\rightarrow[0,1]. Assume the grouping function class \mathcal{H} satisfies the following condition:

{h(x,y)=p(x,y)pS(x,y)|P𝒫(X,Y)}.\displaystyle\mathcal{H}\supset\left\{h(x,y)=\frac{p(x,y)}{p_{S}(x,y)}\Big{|}P\in\mathcal{P}(X,Y)\right\}. (3)

If ff is α\alpha-approximately 2\ell_{2} multicalibrated w.r.t. \mathcal{H} and PSP_{S}, then for any measure P𝒫(X,Y)P\in\mathcal{P}(X,Y),

RP(f)infg:[0,1][0,1]RP(gf)+2α.\displaystyle R_{P}(f)\leq\inf_{g:[0,1]\rightarrow[0,1]}R_{P}(g\circ f)+2\sqrt{\alpha}. (4)

The theorem shows an approximately multicalibrated predictor on the source almost cannot be improved by post-processing for each target distribution. To ensure such robustness, the grouping function class must include all density ratios between target and source measures, which are functions over 𝒳×𝒴\mathcal{X}\times\mathcal{Y}. This characterization of robustness in terms of post-processing echoes with Invariant Risk Minimization (IRM) (Arjovsky et al., 2019), a paradigm for out-of-distribution generalization with Y|XY|X shift. However, their analysis focuses on representation learning.

Definition 3.2 (Invariant Predictor).

Consider data selected from multiple environments in the set \mathscr{E} where the probability measure in an environment ee\in\mathscr{E} is denoted by Pe(X,Y)P_{e}(X,Y). Denote the representation over covariates by a measurable function Φ(x)\Phi(x). We say that Φ\Phi elicits an α\alpha-approximately invariant predictor gΦg^{*}\circ\Phi across \mathscr{E} if there exists a function g𝒢:={g:supp(Φ)[0,1]}g^{*}\in\mathcal{G}:=\left\{g:\text{supp}(\Phi)\rightarrow[0,1]\right\} such that for all ee\in\mathscr{E}:

RPe(gΦ)infg𝒢RPe(gΦ)+α.\displaystyle R_{P_{e}}(g^{*}\circ\Phi)\leq\inf_{g\in\mathcal{G}}{R_{P_{e}}(g\circ\Phi)}+\alpha. (5)
Remark 3.3.

(1) Predictors in 𝒢\mathcal{G} take a representation Φ\Phi extracted from the covariates as input. For a general predictor f(x)f(x), if we take Φ(x)=f(x)\Phi(x)=f(x) and gg^{*} as an identity function, Equation 5 reduces to the form of Equation 4. Therefore, ff in Equation 4 is a 2α2\sqrt{\alpha}-approximately invariant predictor across environments collected from the uncertainty set 𝒫\mathcal{P}. (2) We give an approximate definition of invariant predictors, which recovers the original definition (Arjovsky et al., 2019) when α=0\alpha=0. In this case, there exists a shared Bayes optimal predictor gg^{\star} across environments, taking Φ\Phi as input. This implies 𝔼e1[Y|Φ]=𝔼e2[Y|Φ]\mathbb{E}_{e_{1}}[Y|\Phi]=\mathbb{E}_{e_{2}}[Y|\Phi] almost surely for any e1,e2e_{1},e_{2}.

IRM searches for a representation such that the optimal predictors upon the representation are invariant across environments. Motivated from causality, the interaction between outcomes and their causes are also assumed invariant, so IRM learns a representation of causal variables for stable prediction. We extend Theorem 3.1 to representation learning and prove equivalence between multicalibrated and invariant predictors.

Theorem 3.4 (Equivalence of Multicalibration and Invariance).

Assume samples are drawn from an environment ee\in\mathscr{E} with a prior PS(e)P_{S}(e) such that ePS(e)=1\sum_{e\in\mathscr{E}}P_{S}(e)=1 and PS(e)>0P_{S}(e)>0. The overall population satisfies PS(X,Y)=ePe(X,Y)PS(e)P_{S}(X,Y)=\sum_{e\in\mathscr{E}}P_{e}(X,Y)P_{S}(e) where Pe(X,Y)P_{e}(X,Y) is the environment-specific absolutely continuous measure. With a measurable function Φ(x)\Phi(x), define a function class \mathcal{H} as:

:={h(x,y)=pe(x,y)pS(x,y)|e}.\displaystyle\mathcal{H}:=\left\{h(x,y)=\frac{p_{e}(x,y)}{p_{S}(x,y)}\Big{|}e\in\mathscr{E}\right\}. (6)

1. If there is a bijection g:supp(Φ)[0,1]g^{\star}:\text{supp}(\Phi)\rightarrow[0,1] such that gΦg^{\star}\circ\Phi is α\alpha-approximately 2\ell_{2} multicalibrated w.r.t. \mathcal{H} and PSP_{S}, then Φ\Phi elicits an 2α2\sqrt{\alpha}-approximately invariant predictor gΦg^{\star}\circ\Phi across \mathscr{E}.

2. If there is g:supp(Φ)[0,1]g^{\star}:\text{supp}(\Phi)\rightarrow[0,1] such that Φ\Phi elicits an α\alpha-approximately invariant predictor gΦg^{\star}\circ\Phi across \mathscr{E}, then gΦg^{\star}\circ\Phi is α/D\sqrt{\alpha/D}-approximately 2\ell_{2} multicalibrated w.r.t. \mathcal{H} and PSP_{S}, where D=minePS(e)D=\min_{e\in\mathscr{E}}P_{S}(e).

Remark 3.5.

(1) In the first statement, assuming gg^{\star} is a bijection avoids degenerate cases where Φ\Phi contains redundant information. For example, every predictor gΦ(X)g^{\star}\circ\Phi(X) upon representation Φ\Phi equals g(Φ)𝕀(X)g^{\star}(\Phi)\circ\mathbb{I}(X) upon representation XX. Confining gg^{\star} to bijections ensures some unique decomposition into predictors and representations. (2) Wald et al. (2021) proves equivalence between exact invariance and simultaneous calibration in each environment. We strengthen their result to show multicalibration on a single source distribution suffices for invariance. Moreover, our results can be directly extended beyond their multi-environment setting to a general uncertainty set of target distributions, by the mapping between grouping functions and density ratios. Further, our theorem is established for both exact and approximate invariance.

The theorem bridges approximate multicalibration with approximate invariance for out-of-distribution generalization beyond covariate shift. The equivalence property indicates that the density ratios of target and source distributions constitute the minimal grouping function class required for robust prediction in terms of invariance.

4 Structure of Grouping Function Classes

Section 3 inspires one to construct richer grouping function classes for stronger generalizability. However, fewer predictors are multicalibrated to a rich function class. For covariate-based grouping functions, the Bayes optimal 𝔼[Y|X]\mathbb{E}[Y|X] is always multicalibrated. But the existence of a multicalibrated predictor is nontrivial for joint grouping functions. For example, when we take the grouping function h(x,y)=yh(x,y)=y, a multicalibrated predictor ff implies f(X)=𝔼[Y|f(X),h(X,Y)]=Yf(X)=\mathbb{E}[Y|f(X),h(X,Y)]=Y, which is impossible for regression with label noise. In this section, we first study the maximal grouping function class that is feasible for a multicalibrated predictor. Then, we will leverage our structural results to inform the design of grouping functions.

4.1 Maximal Grouping Function Space

We focus on continuous grouping functions defined on a compact set 𝒳×𝒴d+1\mathcal{X}\times\mathcal{Y}\subset\mathbb{R}^{d+1}, i.e., hC(𝒳×𝒴)h\in C(\mathcal{X}\times\mathcal{Y}), and consider absolutely continuous probability measures supported on 𝒳×𝒴\mathcal{X}\times\mathcal{Y} with continuous density functions. Our first proposition shows that the maximal grouping function class for any predictor is a linear space.

Proposition 4.1 (Maximal Grouping Function Class).

Given an absolutely continuous probability measure PS(X,Y)P_{S}(X,Y) and a predictor f:𝒳[0,1]f:\mathcal{X}\rightarrow[0,1], define the maximal grouping function class that ff is multicalibrated with respect to:

f:={hC(𝒳×𝒴):K2(f,h,PS)=0}.\displaystyle\mathcal{H}_{f}:=\{h\in C(\mathcal{X}\times\mathcal{Y}):K_{2}(f,h,P_{S})=0\}. (7)

Then f\mathcal{H}_{f} is a linear space.

In the following, we further analyze the spanning set of maximal grouping function classes for nontrivial predictors which are at least calibrated.

Theorem 4.2 (Spanning Set).

Consider an absolutely continuous probability measure PS(X,Y)P_{S}(X,Y) and a calibrated predictor f:𝒳[0,1]f:\mathcal{X}\rightarrow[0,1]. Then its maximal grouping function class f\mathcal{H}_{f} is given by:

f=span{p(x,y)pS(x,y):p is continuous and RP(f)=infg:[0,1][0,1]RP(gf)}.\displaystyle\mathcal{H}_{f}=\mathrm{span}\left\{\frac{p(x,y)}{p_{S}(x,y)}:\text{$p$ is continuous and }R_{P}(f)=\inf_{g:[0,1]\rightarrow[0,1]}R_{P}(g\circ f)\right\}. (8)

A predictor’s maximal grouping function class is spanned by density ratios of target distributions where the predictor is invariant. Correspondingly, Theorem 3.1 gives the minimal grouping function class, comprised of density ratios between target and source distributions, in order to ensure f(x)f(x) is an invariant predictor. In contrast, Theorem 4.2 states the maximal grouping function class for f(x)f(x) is exactly the linear space spanned by those density ratios. Next, we further investigate sub-structures of the maximal grouping function class. We focus on the representation learning setting of IRM.

Theorem 4.3 (Decomposition of Grouping Function Space).

Consider an absolutely continuous probability measure PS(X,Y)P_{S}(X,Y) and a measurable function Φ:ddΦ\Phi:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d_{\Phi}} with dΦZ+{d_{\Phi}}\in Z^{+}. We define the Bayes optimal predictor over Φ\Phi as fΦ(x)=𝔼PS[Y|Φ(x)]f_{\Phi}(x)=\mathbb{E}_{P_{S}}[Y|\Phi(x)]. We abbreviate fΦ\mathcal{H}_{f_{\Phi}} with Φ\mathcal{H}_{\Phi}. Then Φ\mathcal{H}_{\Phi} can be decomposed as a Minkowski sum of 1,Φ+2,Φ\mathcal{H}_{1,\Phi}+\mathcal{H}_{2,\Phi}.

1,Φ\displaystyle\mathcal{H}_{1,\Phi} =span{p(Φ,y)pS(Φ,y):p is continuous and RP(fΦ)=infg:[0,1][0,1]RP(gfΦ)}.\displaystyle=\mathrm{span}\left\{\frac{p(\Phi,y)}{p_{S}(\Phi,y)}:\text{$p$ is continuous and }R_{P}(f_{\Phi})=\inf_{g:[0,1]\rightarrow[0,1]}R_{P}(g\circ f_{\Phi})\right\}. (9)
2,Φ\displaystyle\mathcal{H}_{2,\Phi} =span{p(x|Φ,y)pS(x|Φ,y):p is continuous}.\displaystyle=\mathrm{span}\left\{\frac{p(x|\Phi,y)}{p_{S}(x|\Phi,y)}:\;\text{$p$ is continuous}\right\}. (10)

1. If a predictor ff is multicalibrated with 1,Φ\mathcal{H}_{1,\Phi}, then RPS(f)RPS(fΦ)R_{P_{S}}(f)\leq R_{P_{S}}(f_{\Phi}).

2. fΦf_{\Phi} is an invariant predictor elicited by Φ\Phi across a set of environments \mathscr{E} where Pe(Φ,Y)=PS(Φ,Y)P_{e}(\Phi,Y)=P_{S}(\Phi,Y) for any ee\in\mathscr{E}. If a predictor ff is multicalibrated with 2,Φ\mathcal{H}_{2,\Phi}, then ff is also an invariant predictor across \mathscr{E} elicited by some representation.

Remark 4.4.

1,Φ\mathcal{H}_{1,\Phi} and 2,Φ\mathcal{H}_{2,\Phi} contain functions defined on x,Φ(x),yx,\Phi(x),y which can both be rewritten as functions on x,yx,y by variable substitution. Thus, 1,Φ,2,Φ\mathcal{H}_{1,\Phi},\mathcal{H}_{2,\Phi} are still subspaces of grouping functions. 1,Φ\mathcal{H}_{1,\Phi} is spanned by the density ratio of P(Φ,Y)P(\Phi,Y) where the Bayes optimal predictor over Φ\Phi must be invariant on the distribution of PP. 2,Φ\mathcal{H}_{2,\Phi} is spanned by general density ratio of P(X|Φ,Y)P(X|\Phi,Y).

Multicalibration w.r.t. 1,Φ\mathcal{H}_{1,\Phi} ensures at least the accuracy of the Bayes optimal predictor on Φ\Phi, and multicalibration w.r.t. 2,Φ\mathcal{H}_{2,\Phi} ensures at least the invariance of this predictor. However, we show in Proposition F.20 that sizes of two subspaces are negatively correlated. When Φ\Phi is a variable selector, 1,Φ\mathcal{H}_{1,\Phi} expands with more selected covariates while 2,Φ\mathcal{H}_{2,\Phi} shrinks. By choosing a combination of 1,Φ\mathcal{H}_{1,\Phi} and 2,Φ\mathcal{H}_{2,\Phi}, we strike a balance between accuracy and invariance of the multicalibrated predictor.

Proposition 4.5 (Monotonicity).

Consider XdX\in\mathbb{R}^{d} which could be sliced as X=(Φ,Ψ)TX=(\Phi,\Psi)^{T} and Φ=(Λ,Ω)T\Phi=(\Lambda,\Omega)^{T}. Define 1,Φ:={h(Φ(x))C(𝒳×𝒴)}\mathcal{H}_{1,\Phi}^{\prime}:=\{h(\Phi(x))\in C(\mathcal{X}\times\mathcal{Y})\}. 1,X\mathcal{H}_{1,X}^{\prime} and 1,Λ\mathcal{H}_{1,\Lambda}^{\prime} are similarly defined. We have:

1. 1,X1,Φ1,Λ1,={C}.\mathcal{H}_{1,X}^{\prime}\supset\mathcal{H}_{1,\Phi}^{\prime}\supset\mathcal{H}_{1,\Lambda}^{\prime}\supset\mathcal{H}_{1,\emptyset}^{\prime}=\{C\}.

2. {C}=2,XH2,Φ2,Λ2,.\{C\}=\mathcal{H}_{2,X}\subset H_{2,\Phi}\subset\mathcal{H}_{2,\Lambda}\subset\mathcal{H}_{2,\emptyset}.

CC is a constant value function.

4.2 Design of Grouping Function Classes

The objective of a robust learning method can be represented by a tuple consisting of an assumption about the boundary of distribution shift and a metric of robustness. Multicalibration is equivalent to invariance as a metric of robustness, while the grouping function class provides a unifying view for assumptions over potential distribution shift. Given any uncertainty set of target distributions 𝒫\mathcal{P}, Theorem 4.2 implies an efficient and reasonable construction of grouping functions as linear combinations of density ratios from 𝒫\mathcal{P}. We implement two designs of grouping functions for the learning setting with and without environment annotations respectively.

From Environments   If samples are drawn from multiple environments and the environment annotations are available, we assume the uncertainty set as the union of each environment’s distribution PeP_{e}. This completely recovers IRM’s objective, but we approach it with a different optimization technique in the next section. Taking pooled data as the source SS, density ratios spanning the grouping function class are he(x,y)=pe(x,y)/pS(x,y)=pS(e|x,y)/pS(e)h_{e}(x,y)=p_{e}(x,y)/p_{S}(x,y)=p_{S}(e|x,y)/p_{S}(e), where pS(e|x,y)p_{S}(e|x,y) is estimated by an environment classifier. Then a grouping function can be represented as a linear combination of heh_{e}:

h(x,y)=eλepS(e|x,y),λe.\displaystyle h(x,y)=\sum_{e\in\mathscr{E}}\lambda_{e}p_{S}(e|x,y),\;\;\lambda_{e}\in\mathbb{R}. (11)

From Hard Samples   When data contains latent sub-populations without annotations, the uncertainty set can be constructed by identifying sub-populations. Hard sample learning (Li et al., 2021; Lin et al., 2020; Liu et al., 2021) suggests the risk is an indicator for sub-population structures. Samples from the minority sub-population MM are more likely to have high risks. For example, JTT (Liu et al., 2021) identified the minority subgroup using a risk threshold of a trained predictor fidf_{id}. We adopt a continuous grouping by assuming PS(X,YM)(fid(X)Y)2P_{S}(X,Y\in M)\propto(f_{id}(X)-Y)^{2}. We construct the uncertainty set as the union of the source SS and minority sub-population MM, resulting in a grouping function represented as:

h(x,y)=λM(fid(x)y)2+λS,λM,λS.\displaystyle h(x,y)=\lambda_{M}(f_{id}(x)-y)^{2}+\lambda_{S},\;\;\lambda_{M},\lambda_{S}\in\mathbb{R}. (12)

Another design utilizing Distributionally Robust Optimization’s assumption Duchi et al. (2019) is in section B.

5 MC-PseudoLabel: An Algorithm for Extended Multicalibration

In this section, we introduce an algorithm for multicalibration with respect to joint grouping functions. Simultaneously, the algorithm also provides a new optimization paradigm for invariant prediction under distribution shift. The algorithm, called MC-PseudoLabel, post-processes a trained model by supervised learning with pseudolabels generated by grouping functions. As shown in Algorithm 1, given a predictor function class \mathcal{F} and a dataset DD with an empirical distribution P^D(X,Y)\hat{P}_{D}(X,Y), a regression oracle AA solves the optimization: A(D)=argminfRP^D(f)A_{\mathcal{F}}(D)=\arg\min_{f\in\mathcal{F}}R_{\hat{P}_{D}}(f). We take as input a model f0f_{0}, possibly trained by Empirical Risk Minimization. f0f_{0} has a finite range following conventions of prior work in multicalibration (Globus-Harris et al., 2023). For continuous predictors, we discretize the model output and introduce a small rounding error (see section C). For each iteration, the algorithm performs regression with grouping functions on each level set of the model. The prediction of grouping functions rectify the uncalibrated model and serves as pseudolabels for model updates.

Algorithm 1 MC-PseudoLabel
0:  A dataset D=(Dx,Dy)D=(D_{x},D_{y}), a grouping function class \mathcal{H}, a predictive function class \mathcal{F}.
1:  t0;t\leftarrow 0;
2:  f0Initialization;f_{0}\leftarrow\text{Initialization};     {For example, models trained with ERM.}
3:  m|Range(Discretize(f0))|;m\leftarrow|\text{Range}(\text{Discretize}(f_{0}))|;
4:  repeat
5:     ftRound(ft;m):=argminv[1/m]|ft(x)v|;f_{t}^{\prime}\leftarrow\text{Round}(f_{t};m):=\arg\min_{v\in[1/m]}|f_{t}(x)-v|;
6:     errt=𝔼x,yD[(ft(x)y)2];err_{t}=\mathbb{E}_{x,y\sim D}[(f_{t}^{\prime}(x)-y)^{2}];
7:     for each v[1/m]v\in[1/m] do
8:        DvtD|ft(x)=v;D_{v}^{t}\leftarrow D|f_{t}^{\prime}(x)=v;
9:        hvt(x,y)A(Dvt);h_{v}^{t}(x,y)\leftarrow A_{\mathcal{H}}(D_{v}^{t});     {Regression on level sets with grouping functions.}
10:     end for
11:     f~t+1(x,y)v[1m]1{ft(x)=v}hvt(x,y);\tilde{f}_{t+1}(x,y)\leftarrow\sum_{v\in[\frac{1}{m}]}1_{\{f_{t}^{\prime}(x)=v\}}\cdot h_{v}^{t}(x,y);     {Generate pseudolabels.}
12:     err~t+1𝔼x,yD[(f~t+1(x,y)y)2];\tilde{err}_{t+1}\leftarrow\mathbb{E}_{x,y\sim D}[(\tilde{f}_{t+1}(x,y)-y)^{2}];
13:     Dt+1(Dx,f~t+1(D));D_{t+1}\leftarrow(D_{x},\tilde{f}_{t+1}(D));
14:     ft+1(x)A(Dt+1);f_{t+1}(x)\leftarrow A_{\mathcal{F}}(D_{t+1});     {Update the model with pseudolabels.}
15:     tt+1;t\leftarrow t+1;
16:  until errt1err~terr_{t-1}-\tilde{err}_{t} stops decreasing.
16:  ft1(x).f_{t-1}^{\prime}(x).

Since we regress YY with grouping functions defined on YY, a poor design of groupings violating Theorem 4.2 can produce trivial outputs. For example, if grouping functions contain h(x,y)=yh(x,y)=y, then errt1err~terr_{t-1}-\tilde{err}_{t} never decreases and the algorithm outputs f0f_{0}, because there does not exist a multicalibrated predictor. However, the algorithm certifies a multicalibrated output if it converges.

Theorem 5.1 (Certified Multicalibration).

In Algorithm 1, for α,B>0\alpha,B>0, if errt1err~tαBerr_{t-1}-\tilde{err}_{t}\leq\frac{\alpha}{B}, the output ft1(x)f_{t-1}^{\prime}(x) is α\alpha-approximately 2\ell_{2} multicalibrated w.r.t. B={h:suph(x,y)2B}\mathcal{H}_{B}=\{h\in\mathcal{H}:\sup h(x,y)^{2}\leq B\}.

MC-PseudoLabel reduces to LSBoost Globus-Harris et al. (2023), a boosting algorithm for multicalibration if \mathcal{H} only contains covariate-based grouping functions. In this case, Line 14 of Algorithm 1 reduces to ft+1(x)=f~t+1(x,)f_{t+1}(x)=\tilde{f}_{t+1}(x,\cdot) where f~t+1\tilde{f}_{t+1} does not depend on yy. For joint grouping functions, since f~t+1𝒳×𝒴\tilde{f}_{t+1}\in\mathbb{R}^{\mathcal{X}\times\mathcal{Y}}, we project it to models’ space of 𝒳\mathbb{R}^{\mathcal{X}} by learning the model with f~t+1\tilde{f}_{t+1} as pseudolabels. The projection substantially changes the optimization dynamics. LSBoost constantly decreases risks of models, due to RP^D(ft+1)=RP^D(f~t+1)<RP^D(ft)R_{\hat{P}_{D}}(f_{t+1})=R_{\hat{P}_{D}}(\tilde{f}_{t+1})<R_{\hat{P}_{D}}(f_{t}). The projection step disrupts the monotonicity of risks, implying that MC-Pseudolabel can output a predictor with a higher risk than input. This is because multicalibration with joint grouping functions implies balance between accuracy and invariance, as is discussed in Theorem 4.3. The convergence of LSBoost relies on the monotonicity of risks, which is not applicable to MC-Pseudolabel. We study the algorithm’s convergence in the context of representation learning. Assume we are given a grouping function class Φ\mathcal{H}_{\Phi} with a latent representation Φ\Phi. If a predictor is multicalibrated w.r.t 1,Φ,2,Φ\mathcal{H}_{1,\Phi},\mathcal{H}_{2,\Phi} respectively, then it is also multicalibrated w.r.t. Φ\mathcal{H}_{\Phi}. Therefore, we separately study the convergence with two grouping function classes. In Proposition F.26, we show the convergence for a subset of 1,Φ\mathcal{H}_{1,\Phi} consisting of covariate-based grouping functions, which is a corollary of Globus-Harris et al.’s result. As a greater challenge, we derive convergence for 2,Φ\mathcal{H}_{2,\Phi} when data follows multivariate normal distributions.

Theorem 5.2 (Covergence for 2,Φ\mathcal{H}_{2,\Phi} (informal, see Theorem F.27)).

Consider XdX\in\mathbb{R}^{d} with X=(Φ,Ψ)TX=(\Phi,\Psi)^{T}. Assume that (Φ,Ψ,Y)(\Phi,\Psi,Y) follows a multivariate normal distribution 𝒩d+1(μ,Σ)\mathcal{N}_{d+1}(\mu,\Sigma) where the random variables are in general position such that Σ\Sigma is positive definite. For any distribution DD supported on 𝒳×𝒴\mathcal{X}\times\mathcal{Y}, take the grouping function class ={h2,Φ:h(x,y)=cxTx+cyy+cb,cxd,cy,cb}\mathcal{H}=\{h\in\mathcal{H}_{2,\Phi}:h(x,y)=c_{x}^{T}x+c_{y}y+c_{b},c_{x}\in\mathbb{R}^{d},c_{y},c_{b}\in\mathbb{R}\} and the predictor class =𝒳\mathcal{F}=\mathbb{R}^{\mathcal{X}}. For an initial predictor f0(x)=𝔼[Y|x]f_{0}(x)=\mathbb{E}[Y|x], run MC-Pseudolabel(D,,)\text{MC-Pseudolabel}(D,\mathcal{H},\mathcal{F}) without rounding, then ft(x)𝔼[Y|Φ(x)]f_{t}(x)\rightarrow\mathbb{E}[Y|\Phi(x)] as tt\rightarrow\infty with a convergence rate of 𝒪(M(Σ)t)\mathcal{O}(M(\Sigma)^{t}) where 0M(Σ)<10\leq M(\Sigma)<1.

MC-Pseudolabel is also an optimization paradigm for invariance. Certified multicalibration in Theorem 5.1 also implies certified invariance. Furthermore, MC-Pseudolabel introduces no extra hyperparameters to tradeoff between risks and robustness. Both certified invariance and light-weighted hyperparameters simplify model selection, which is challenging for out-distribution generalization because of unavailable validation data from target distributions (Gulrajani and Lopez-Paz, 2021). MC-Pseudolabel has light-weighted optimization consisting of a series of supervised regression. It introduces an overhead to Empirical Risk Minimization by performing regression on level sets. However, the extra burden is linear regression by designing the grouping function class as linear space. Furthermore, regression on different level sets can be parallelized. Computational complexity is further analyzed in section D.

6 Experiments

6.1 Settings

We benchmark MC-Pseudolabel on real-world regression datasets with distributional shift. We adopt two experimental settings. For the multi-environment setting, algorithms are provided with training data collected from multiple annotated environments. Thereafter, the trained model is assessed on new environments. For the single-environment setting, algorithms are trained on a single source distribution. There could be latent sub-populations in training data, but environment annotations are unavailable. The trained model is assessed on a target dataset with distribution shift from the source.

Datasets We experiment on PovertyMap (Yeh et al., 2020) and ACSIncome (Ding et al., 2021) for the multi-environment setting, and VesselPower (Malinin et al., 2022) for the single-environment setting. As the only regression task in WILDS (Koh et al., 2021), a popular benchmark for in-the-wild distribution shift, PovertyMap performs poverty index estimation for different spatial regions by satellite images. Data are collected from both urban and rural regions, by which the environment is annotated. The test dataset also covers both environments, but is collected from different countries. The primary metric is Worst-U/R Pearson, the worst Pearson correlation of prediction between rural and urban regions. The other two datasets are tabular, where natural concept shift (Y|XY|X shift) is more common due to existence of missing variables and hidden confounders (Liu et al., 2023). ACSIncome (Ding et al., 2021) performs personal income prediction with data collected from US Census sources across different US states. The task is converted to binary classification by an income threshold, but we take raw data for regression. Environments are partitioned by different occupations with similar average income. VesselPower comes from Shifts (Malinin et al., 2021, 2022), a benchmark focusing on regression tasks with real-world distributional shift. The objective is to predict power consumption of a merchant vessel given navigation and weather data. Data are sampled under different time and wind speeds, causing distribution shift between training and test data.

Baselines For the multi-environment setting, baselines include ERM (Empirical Risk Minimization); methods for invariance learning which mostly adopts multi-objective optimization: IRM (Arjovsky et al., 2019), MIP (Koyama and Yamaguchi, 2020), IB-IRM (Ahuja et al., 2021), CLOvE (Wald et al., 2021), MRI (Huh and Baidya, 2022), REX (Krueger et al., 2021), Fishr (Ramé et al., 2022); an alignment-based method from domain generalization: IDGM (Shi et al., 2021); and Group DRO (Sagawa et al., 2019). Notably, CLOvE learns a calibrated predictor simultaneously on all environments, but it is optimized by multi-objective learning with a differentiable regularizer for calibration. For the singe-environment setting, baselines include reweighting based techniques: CVaR (Levy et al., 2020), JTT (Liu et al., 2021), Tilted-ERM (Li et al., 2021); a Distributionally Robust Optimization method χ2\chi^{2}-DRO (Duchi and Namkoong, 2019); and a data augmentation method C-Mixup (Yao et al., 2022). Other methods are not included because of specification in classification (Zhang et al., 2018, 2022) or exposure to target distribution data during training (Idrissi et al., 2022; Kirichenko et al., 2023). For all experiments, we train an Oracle ERM with data sampled from target distribution.

Implementation  We implement the predictor with MLP for ACSIncome and VesselPower, and Resnet18-MS (He et al., 2016) for PovertyMap, following WILDS’ default architecture. The grouping function class is implemented according to Equation 11 and Equation 12 for the multi-environment and single-environment setting respectively. We follow DomainBed’s protocol (Gulrajani and Lopez-Paz, 2021) for model selection. Specifically, we randomly sample 20 sets of hyperparameters for each method, containing both the training hyperparameters and extra hyperparameters from the robust learning algorithm. We select the best model across hyperparameters based on three model selection criteria, including in-distribution validation on the average of training data, worst-environment validation with the worst performance across training environments, and oracle validation on target data. Oracle validation is not recommended by DomainBed, which suggests limited numbers of access to target data. The entire run is repeated with different seeds for three times to measure standard errors of performances. Specifically for PovertyMap, an OOD validation set is provided for oracle validation. And we perform 5-fold cross validation instead of three repeated experiments, following WILDS’ setup.

6.2 Results

Table 1: Results on multi-environment datasets, evaluated on test data using three model selection criteria. ID: validation with averaged performance on training data. Worst: validation with the worst performance across training environments. Oracle: validation with performance on sampled test set.
ACSIncome: RMSE \downarrow PovertyMap: Worst-U/R Pearson \uparrow
Method ID Worst Oracle ID Worst Oracle
ERM 0.4870.487±0.009\pm 0.009 0.4870.487±0.009\pm 0.009 0.4520.452±0.012\pm 0.012 0.480.48±0.06\pm 0.06 0.480.48±0.06\pm 0.06 0.490.49±0.07\pm 0.07
IRM 0.4660.466±0.002\pm 0.002 0.4660.466±0.002\pm 0.002 0.4650.465±0.002\pm 0.002 0.380.38±0.07\pm 0.07 0.390.39±0.06\pm 0.06 0.450.45±0.08\pm 0.08
MIP 0.4570.457±0.008\pm 0.008 0.4540.454±0.012\pm 0.012 0.4540.454±0.012\pm 0.012 0.400.40±0.09\pm 0.09 0.390.39±0.10\pm 0.10 0.430.43±0.08\pm 0.08
IB-IRM 0.4630.463±0.003\pm 0.003 0.4630.463±0.003\pm 0.003 0.4380.438±0.009\pm 0.009 0.390.39±0.07\pm 0.07 0.370.37±0.05\pm 0.05 0.430.43±0.06\pm 0.06
CLOvE 0.4550.455±0.005\pm 0.005 0.4540.454±0.002\pm 0.002 0.4500.450±0.005\pm 0.005 0.460.46±0.09\pm 0.09 0.420.42±0.13\pm 0.13 0.51\mathbf{0.51}±0.06\pm 0.06
MRI 0.4580.458±0.011\pm 0.011 0.4580.458±0.011\pm 0.011 0.4550.455±0.013\pm 0.013 0.470.47±0.10\pm 0.10 0.460.46±0.08\pm 0.08 0.490.49±0.07\pm 0.07
REX 0.4660.466±0.009\pm 0.009 0.4640.464±0.009\pm 0.009 0.4580.458±0.003\pm 0.003 0.430.43±0.09\pm 0.09 0.420.42±0.09\pm 0.09 0.450.45±0.09\pm 0.09
Fishr 0.4580.458±0.006\pm 0.006 0.4550.455±0.012\pm 0.012 0.4500.450±0.008\pm 0.008 0.420.42±0.09\pm 0.09 0.410.41±0.09\pm 0.09 0.430.43±0.08\pm 0.08
IDGM 1.8431.843±0.018\pm 0.018 1.8431.843±0.018\pm 0.018 1.8431.843±0.018\pm 0.018 0.020.02±0.07\pm 0.07 0.010.01±0.15\pm 0.15 0.130.13±0.14\pm 0.14
GroupDRO 0.4810.481±0.035\pm 0.035 0.4490.449±0.017\pm 0.017 0.4330.433±0.013\pm 0.013 0.380.38±0.15\pm 0.15 0.370.37±0.16\pm 0.16 0.420.42±0.12\pm 0.12
MC-Pseudolabel 0.428\mathbf{0.428}±0.009\pm 0.009 0.425\mathbf{0.425}±0.012\pm 0.012 0.411\mathbf{0.411}±0.011\pm 0.011 0.50\mathbf{0.50}±0.07\pm 0.07 0.50\mathbf{0.50}±0.07\pm 0.07 0.51\mathbf{0.51}±0.08\pm 0.08
Oracle ERM 0.3320.332±0.001\pm 0.001 0.710.71±0.05\pm 0.05

Refer to caption

Figure 1: Accuracy-on-the-line beyond covariate shift: correlation between models’ in-distribution and out-of-distribution risks on VesselPower.

Results are shown in Table 1 for multi-environment settings and Table 2 for single-environment settings. MC-Pseudolabel achieves superior performance in all datasets with in-distribution and worst-environment validation which does not violate test data. For oracle validation, MC-Pseudolabel achieves comparable performances to the best method. For example, CLOvE, which also learns invariance by calibration, achieves best performance under oracle validation in PovertyMap, but it sharply degrades when target validation data is unavailable. It’s because CLOvE tunes its regularizer’s coefficient to tradeoff with ERM risk, whose optimal value depends on the target distribution shift.

Table 2: Single-environment results.
VesselPower: RMSE \downarrow
Method ID Oracle
ERM 1.921.92±0.23\pm 0.23 1.861.86±0.19\pm 0.19
CVaR 1.691.69±0.18\pm 0.18 1.49\mathbf{1.49}±0.10\pm 0.10
JTT 1.751.75±0.27\pm 0.27 1.581.58±0.15\pm 0.15
Tilted-ERM 1.721.72±0.21\pm 0.21 1.611.61±0.12\pm 0.12
χ2\chi^{2}-DRO 1.691.69±0.20\pm 0.20 1.561.56±0.06\pm 0.06
C-Mixup 1.721.72±0.15\pm 0.15 1.561.56±0.08\pm 0.08
MC-Pseudolabel 1.61\mathbf{1.61}±0.20\pm 0.20 1.521.52±0.16\pm 0.16
Oracle ERM 1.181.18±0.01\pm 0.01

In contrast, MC-Pseudolabel exhibits an advantage with in-distribution model selection. This is further supported by Figure 1, which shows that MC-Pseudolabel’s out-of-distribution errors strongly correlates with in-distribution errors. The experiment spans across different hyperparameters and seeds with the same model architecture on VesselPower. The phenomenon, known as accuracy-on-the-line (Miller et al., 2021), is well known for a general class of models under covariate shift. However, Liu et al. (2023) shows accuracy-on-the-line does not exist under concept shift, which is the case for ERM and C-Mixup. This introduces significant challenge for model selection. However, MC-Pseudolabel recovers the accuracy to the line.

7 Conclusion

To conclude, we establish a new optimization framework for out-of-distribution generalization through extended multicalibration with joint grouping functions. While the current algorithm focuses on regression, there is potential for future work to extend our approach to general forms of tasks, particularly in terms of classification.

References

  • Ahuja et al. (2021) Kartik Ahuja, Ethan Caballero, Dinghuai Zhang, Jean-Christophe Gagnon-Audet, Yoshua Bengio, Ioannis Mitliagkas, and Irina Rish. Invariance principle meets information bottleneck for out-of-distribution generalization. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 3438–3450, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/1c336b8080f82bcc2cd2499b4c57261d-Abstract.html.
  • Arjovsky et al. (2019) Martín Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. CoRR, abs/1907.02893, 2019. URL http://arxiv.org/abs/1907.02893.
  • Blasiok et al. (2023a) Jaroslaw Blasiok, Parikshit Gopalan, Lunjia Hu, and Preetum Nakkiran. A unifying theory of distance from calibration. In Barna Saha and Rocco A. Servedio, editors, Proceedings of the 55th Annual ACM Symposium on Theory of Computing, STOC 2023, Orlando, FL, USA, June 20-23, 2023, pages 1727–1740. ACM, 2023a. doi: 10.1145/3564246.3585182. URL https://doi.org/10.1145/3564246.3585182.
  • Blasiok et al. (2023b) Jaroslaw Blasiok, Parikshit Gopalan, Lunjia Hu, and Preetum Nakkiran. When does optimizing a proper loss yield calibration? In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023b. URL http://papers.nips.cc/paper_files/paper/2023/hash/e4165c96702bac5f4962b70f3cf2f136-Abstract-Conference.html.
  • Creager et al. (2021) Elliot Creager, Jörn-Henrik Jacobsen, and Richard S. Zemel. Environment inference for invariant learning. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 2189–2200. PMLR, 2021. URL http://proceedings.mlr.press/v139/creager21a.html.
  • Deng et al. (2023) Zhun Deng, Cynthia Dwork, and Linjun Zhang. Happymap : A generalized multicalibration method. In Yael Tauman Kalai, editor, 14th Innovations in Theoretical Computer Science Conference, ITCS 2023, January 10-13, 2023, MIT, Cambridge, Massachusetts, USA, volume 251 of LIPIcs, pages 41:1–41:23. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023. doi: 10.4230/LIPICS.ITCS.2023.41. URL https://doi.org/10.4230/LIPIcs.ITCS.2023.41.
  • Ding et al. (2021) Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. Retiring adult: New datasets for fair machine learning. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 6478–6490, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/32e54441e6382a7fbacbbbaf3c450059-Abstract.html.
  • Duchi and Namkoong (2019) John C. Duchi and Hongseok Namkoong. Variance-based regularization with convex objectives. J. Mach. Learn. Res., 20:68:1–68:55, 2019. URL http://jmlr.org/papers/v20/17-750.html.
  • Duchi and Namkoong (2021) John C Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378–1406, 2021.
  • Duchi et al. (2019) John C Duchi, Tatsunori Hashimoto, and Hongseok Namkoong. Distributionally robust losses against mixture covariate shifts. Under review, 2(1), 2019.
  • Duchi et al. (2021) John C. Duchi, Peter W. Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. Math. Oper. Res., 46(3):946–969, 2021. doi: 10.1287/MOOR.2020.1085. URL https://doi.org/10.1287/moor.2020.1085.
  • Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 1126–1135. PMLR, 2017. URL http://proceedings.mlr.press/v70/finn17a.html.
  • Globus-Harris et al. (2023) Ira Globus-Harris, Declan Harrison, Michael Kearns, Aaron Roth, and Jessica Sorrell. Multicalibration as boosting for regression. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 11459–11492. PMLR, 2023. URL https://proceedings.mlr.press/v202/globus-harris23a.html.
  • Gopalan et al. (2022) Parikshit Gopalan, Adam Tauman Kalai, Omer Reingold, Vatsal Sharan, and Udi Wieder. Omnipredictors. In Mark Braverman, editor, 13th Innovations in Theoretical Computer Science Conference, ITCS 2022, January 31 - February 3, 2022, Berkeley, CA, USA, volume 215 of LIPIcs, pages 79:1–79:21. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022. doi: 10.4230/LIPICS.ITCS.2022.79. URL https://doi.org/10.4230/LIPIcs.ITCS.2022.79.
  • Gulrajani and Lopez-Paz (2021) Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=lQdXeXDoWtI.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778. IEEE Computer Society, 2016. doi: 10.1109/CVPR.2016.90. URL https://doi.org/10.1109/CVPR.2016.90.
  • Hébert-Johnson et al. (2018) Úrsula Hébert-Johnson, Michael P. Kim, Omer Reingold, and Guy N. Rothblum. Multicalibration: Calibration for the (computationally-identifiable) masses. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 1944–1953. PMLR, 2018. URL http://proceedings.mlr.press/v80/hebert-johnson18a.html.
  • Huh and Baidya (2022) Dongsung Huh and Avinash Baidya. The missing invariance principle found - the reciprocal twin of invariant risk minimization. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/91b482312a0845ed86e244adbd9935e4-Abstract-Conference.html.
  • Idrissi et al. (2022) Badr Youbi Idrissi, Martín Arjovsky, Mohammad Pezeshki, and David Lopez-Paz. Simple data balancing achieves competitive worst-group-accuracy. In Bernhard Schölkopf, Caroline Uhler, and Kun Zhang, editors, 1st Conference on Causal Learning and Reasoning, CLeaR 2022, Sequoia Conference Center, Eureka, CA, USA, 11-13 April, 2022, volume 177 of Proceedings of Machine Learning Research, pages 336–351. PMLR, 2022. URL https://proceedings.mlr.press/v177/idrissi22a.html.
  • Kim et al. (2019) Michael P. Kim, Amirata Ghorbani, and James Y. Zou. Multiaccuracy: Black-box post-processing for fairness in classification. In Vincent Conitzer, Gillian K. Hadfield, and Shannon Vallor, editors, Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, AIES 2019, Honolulu, HI, USA, January 27-28, 2019, pages 247–254. ACM, 2019. doi: 10.1145/3306618.3314287. URL https://doi.org/10.1145/3306618.3314287.
  • Kim et al. (2022) Michael P Kim, Christoph Kern, Shafi Goldwasser, Frauke Kreuter, and Omer Reingold. Universal adaptability: Target-independent inference that competes with propensity scoring. Proceedings of the National Academy of Sciences, 119(4):e2108097119, 2022.
  • Kirichenko et al. (2023) Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is sufficient for robustness to spurious correlations. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=Zb6c8A-Fghk.
  • Koh et al. (2021) Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton Earnshaw, Imran S. Haque, Sara M. Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. WILDS: A benchmark of in-the-wild distribution shifts. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 5637–5664. PMLR, 2021. URL http://proceedings.mlr.press/v139/koh21a.html.
  • Koyama and Yamaguchi (2020) Masanori Koyama and Shoichiro Yamaguchi. When is invariance useful in an out-of-distribution generalization problem? arXiv preprint arXiv:2008.01883, 2020.
  • Krueger et al. (2021) David Krueger, Ethan Caballero, Jörn-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Rémi Le Priol, and Aaron C. Courville. Out-of-distribution generalization via risk extrapolation (rex). In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 5815–5826. PMLR, 2021. URL http://proceedings.mlr.press/v139/krueger21a.html.
  • Levy et al. (2020) Daniel Levy, Yair Carmon, John C. Duchi, and Aaron Sidford. Large-scale methods for distributionally robust optimization. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/64986d86a17424eeac96b08a6d519059-Abstract.html.
  • Li et al. (2021) Tian Li, Ahmad Beirami, Maziar Sanjabi, and Virginia Smith. Tilted empirical risk minimization. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=K5YasWXZT3O.
  • Lin et al. (2020) Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell., 42(2):318–327, 2020. doi: 10.1109/TPAMI.2018.2858826. URL https://doi.org/10.1109/TPAMI.2018.2858826.
  • Liu et al. (2021) Evan Zheran Liu, Behzad Haghgoo, Annie S. Chen, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn. Just train twice: Improving group robustness without training group information. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 6781–6792. PMLR, 2021. URL http://proceedings.mlr.press/v139/liu21f.html.
  • Liu et al. (2022) Jiashuo Liu, Jiayun Wu, Bo Li, and Peng Cui. Distributionally robust optimization with data geometry. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/da535999561b932f56efdd559498282e-Abstract-Conference.html.
  • Liu et al. (2023) Jiashuo Liu, Tianyu Wang, Peng Cui, and Hongseok Namkoong. On the need for a language describing distribution shifts: Illustrations on tabular datasets. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
  • Malinin et al. (2021) Andrey Malinin, Neil Band, Yarin Gal, Mark J. F. Gales, Alexander Ganshin, German Chesnokov, Alexey Noskov, Andrey Ploskonosov, Liudmila Prokhorenkova, Ivan Provilkov, Vatsal Raina, Vyas Raina, Denis Roginskiy, Mariya Shmatova, Panagiotis Tigas, and Boris Yangel. Shifts: A dataset of real distributional shift across multiple large-scale tasks. In Joaquin Vanschoren and Sai-Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/ad61ab143223efbc24c7d2583be69251-Abstract-round2.html.
  • Malinin et al. (2022) Andrey Malinin, Andreas Athanasopoulos, Muhamed Barakovic, Meritxell Bach Cuadra, Mark J. F. Gales, Cristina Granziera, Mara Graziani, Nikolay Kartashev, Konstantinos Kyriakopoulos, Po-Jui Lu, Nataliia Molchanova, Antonis Nikitakis, Vatsal Raina, Francesco La Rosa, Eli Sivena, Vasileios Tsarsitalidis, Efi Tsompopoulou, and Elena Volf. Shifts 2.0: Extending the dataset of real distributional shifts. CoRR, abs/2206.15407, 2022. doi: 10.48550/ARXIV.2206.15407. URL https://doi.org/10.48550/arXiv.2206.15407.
  • Miller et al. (2021) John Miller, Rohan Taori, Aditi Raghunathan, Shiori Sagawa, Pang Wei Koh, Vaishaal Shankar, Percy Liang, Yair Carmon, and Ludwig Schmidt. Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 7721–7735. PMLR, 2021. URL http://proceedings.mlr.press/v139/miller21b.html.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  • Ramé et al. (2022) Alexandre Ramé, Corentin Dancette, and Matthieu Cord. Fishr: Invariant gradient variances for out-of-distribution generalization. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 18347–18377. PMLR, 2022. URL https://proceedings.mlr.press/v162/rame22a.html.
  • Roth (2022) Aaron Roth. Uncertain: Modern topics in uncertainty estimation, 2022.
  • Sagawa et al. (2019) Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks. In International Conference on Learning Representations, 2019.
  • Shi et al. (2021) Yuge Shi, Jeffrey Seely, Philip HS Torr, N Siddharth, Awni Hannun, Nicolas Usunier, and Gabriel Synnaeve. Gradient matching for domain generalization. arXiv preprint arXiv:2104.09937, 2021.
  • Sinha et al. (2018) Aman Sinha, Hongseok Namkoong, and John C. Duchi. Certifying some distributional robustness with principled adversarial training. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. URL https://openreview.net/forum?id=Hk6kPgZA-.
  • Wald et al. (2021) Yoav Wald, Amir Feder, Daniel Greenfeld, and Uri Shalit. On calibration and out-of-domain generalization. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 2215–2227, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/118bd558033a1016fcc82560c65cca5f-Abstract.html.
  • Wang et al. (2022) Haoxiang Wang, Bo Li, and Han Zhao. Understanding gradual domain adaptation: Improved analysis, optimal path and beyond. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 22784–22801. PMLR, 2022. URL https://proceedings.mlr.press/v162/wang22n.html.
  • Yao et al. (2022) Huaxiu Yao, Yiping Wang, Linjun Zhang, James Y. Zou, and Chelsea Finn. C-mixup: Improving generalization in regression. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/1626be0ab7f3d7b3c639fbfd5951bc40-Abstract-Conference.html.
  • Yeh et al. (2020) Christopher Yeh, Anthony Perez, Anne Driscoll, George Azzari, Zhongyi Tang, David Lobell, Stefano Ermon, and Marshall Burke. Using publicly available satellite imagery and deep learning to understand economic well-being in africa. Nature Communications, 2020.
  • Zhang et al. (2018) Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. URL https://openreview.net/forum?id=r1Ddp1-Rb.
  • Zhang et al. (2022) Michael Zhang, Nimit S Sohoni, Hongyang R Zhang, Chelsea Finn, and Christopher Re. Correct-n-contrast: a contrastive approach for improving robustness to spurious correlations. In International Conference on Machine Learning, pages 26484–26516. PMLR, 2022.

Appendix A Related Work

Multicalibration   Multicalibration is first proposed by Hébert-Johnson et al. [2018] with binary grouping functions. Kim et al. [2019] and Gopalan et al. [2022] extend the grouping functions to real-valued functions. Globus-Harris et al. [2023] shows that with a sufficiently rich class of real-valued grouping functions, multicalibration actually implies accuracy. Globus-Harris et al. also provides a boosting algorithm for both regression and multicalibration. The connection between multicalibration and distribution shift is first studied by Kim et al. [2022], who proves that 1\ell_{1} multicalibration error remains under covariate shift, given a sufficiently large real-valued grouping function class. Kim et al. further shows that under covariate shift, a multicalibrated predictor can perform statistical inference of the average outcome of a sample batch. In contrast, we derive a robustness result for individual prediction of outcomes for 2\ell_{2} multicalibrated predictors. In addition, Wald et al. [2021] studies the equivalence of Invariant Risk Minimization and simultaneous calibration on each environment. Our equivalence results for multicalibration can be perceived as a generalization of Wald et al.’s results beyond the multi-environment setting, by deriving a mapping between density ratios and grouping functions. We also extend the equivalence to approximately multicalibrated and approximately invariant predictors. Furthermore, we move beyond Wald et al.’s multi-objective optimization with Lagrangian regularization, by proposing a new post-processing optimization framework consisting of a series of supervised regression. Meanwhile, Blasiok et al. [2023b] discusses connections between calibration and post-processing, which is an equivalent expression of invariance. There are other extensions of multicalibration, such as Deng et al. [2023] who generalize the term Yf(X)Y-f(X) in multicalibration’s definition to a class of general functions. While our work is the first to generalize the grouping functions hh to consider the outcomes.

Out-of-distribution Generalization Beyond Covariate Shift   Despite abundant literature from domain generalization that focuses on image classification where covariate shift dominates, research on algorithmic robustness on regression tasks beyond covariate shift is relatively limited. The setting can be categorized according to if the source distribution is partitioned into several environments. For the multi-environment generalization setting, Invariant Risk Minimization and its variants assume that outcomes are generated by a common causal structural equation across all environments, and aims to recover such an invariant (or causal) predictor [Ahuja et al., 2021, Arjovsky et al., 2019, Huh and Baidya, 2022, Koyama and Yamaguchi, 2020, Krueger et al., 2021, Ramé et al., 2022]. Group DRO [Sagawa et al., 2019] is a simple but surprisingly strong technique that optimizes for the worst group risk with reweighting of environments. There are also meta-learning methods [Finn et al., 2017] that handles multi-environment generalization with bi-level optimization. For the single environment setting, Distributionally Robust Optimization optimizes for the worst-case risk in an uncertainty set of distributions centering around the source distribution [Duchi and Namkoong, 2019, 2021, Duchi et al., 2021, Levy et al., 2020, Sinha et al., 2018]. Another branch of research is targeted at mitigating spurious correlation with an assumption of simplicity bias, which utilizes a simple model to discover latent sub-populations and then correct the biased predictor by sample reweighting  [Li et al., 2021, Lin et al., 2020, Liu et al., 2021], retraining on a subgroup-balanced dataset or a small batch from target distribution [Idrissi et al., 2022, Kirichenko et al., 2023, Zhang et al., 2022], or perform Invariant Risk Minimization on discovered subgroups [Creager et al., 2021]. Data augmentation is a prevalent technique to enhance algorithmic robustness for vision tasks. Quite a lot of these methods are tailored for classification. For example, Mixup [Zhang et al., 2018] interpolates between features of samples with the same label. The approach is extended to regression settings by C-Mixup [Yao et al., 2022]. Pseudolabelling is a common technique for out-of-distribution generalization, but typically adopted in a setting with exposure to unlabelled samples from target distribution, known as domain adaptation [Wang et al., 2022]. However, MC-Pseudolabel generate pseudolabels for the source distribution itself.

Appendix B Grouping Functions for Distributionally Robust Optimization

Distributionally Robust Optimization assumes the target distribution to reside in an uncertainty set 𝒫\mathcal{P} of distributions centering around the source distribution PSP_{S}. For example, Duchi et al. [2019] formulates the uncertainty set as arbitrary subgroups that has a proportion of at least α0(0,1)\alpha_{0}\in(0,1). Duchi et al. only consider subgroups of covariates:

𝒫(X)={P(X):\displaystyle\mathcal{P}(X)=\{P(X): there exists a probability measure P(X),\displaystyle\text{ there exists a probability measure }P^{\prime}(X), (13)
PS(X)=αP(X)+(1α)P(X),αα0}.\displaystyle P_{S}(X)=\alpha P(X)+(1-\alpha)P^{\prime}(X),\alpha\geq\alpha_{0}\}. (14)

By the correspondence between density ratios and grouping functions, the equivalent design of a grouping function class is given by:

={h𝒳:0h(x)1α0,x}.\displaystyle\mathcal{H}=\left\{h\in\mathbb{R}^{\mathcal{X}}:0\leq h(x)\leq\frac{1}{\alpha_{0}},\;\;\forall x\right\}. (15)

We can also extend the grouping functions to consider both covariates and outcomes, such that general subgroups are incorporated into the uncertainty set:

={h𝒳×𝒴:0h(x,y)1α0,x,y}.\displaystyle\mathcal{H}=\left\{h\in\mathbb{R}^{\mathcal{X}\times\mathcal{Y}}:0\leq h(x,y)\leq\frac{1}{\alpha_{0}},\;\;\forall x,y\right\}. (16)

In the case of grouping functions defined on xx and yy jointly, the grouping function class is not closed under affine transformation and is not a linear space spanned by density ratios, which suggests that a perfectly multicalibrated solution might not exist. However, approximately multicalibrated predictors can still be pursued.

Appendix C Model Discretization

For continuous predictors, we take a preprocessing step to discretize the model to as many bins as possible such that the rounding error is negligible while still ensuring enough samples in individual bins. Specifically, we equally split the outcomes of predictors to bins with equal intervals from the minimum to maximum of model output. We start from a minimum bin number m=10m=10, and keeps increasing mm as long as 90%90\% of the samples reside in a bin with at least 30 samples. When the criterion is violated, we stop increasing mm and select it as the final bin number. The model discretization procedure is fixed across all experiments.

Appendix D Computational Complexity

We assume that the predictor’s outcomes are uniformly distributed. Denote the average bin size by NbN_{b}, which is a constant around 30 in our implementation. The bin number is given by m=N/Nbm=N/N_{b} where NN is the sample size. For neural networks, NN represents the batch size. The overhead of MC-Pseudolabel compared to Empirical Risk Minimization is linear regression on each bin, whose sample complexity is 𝒪(Nb3)\mathcal{O}(N_{b}^{3}) with OLS. Please note that an individual linear regression for around 30 samples is extremely cheap. A non-parallel implementation of regression on every bin scales linearly with the bin number mm, so the overall complexity is 𝒪(Nb2N)\mathcal{O}(N_{b}^{2}N). However, since the regression on each bin is independent, we adopt a multi-processing implementation. Denote the number of jobs by JJ, the overall time cost of MC-Pseudolabel is 𝒪(Nb2N/J)\mathcal{O}(N_{b}^{2}N/J). As a comparison, OLS on NN samples has a computational complexity of 𝒪(N3)\mathcal{O}(N^{3}).

In conclusion, the complexity of MC-Pseudolabel scales linearly with sample size (or batch size for neural networks). Counterintuitively, increasing the bin number mm (and thus decreasing the bin size) actually decreases the computational complexity. This is because linear regression scales cubically with sample size, so decreasing the sample size in each bin is preferred to decreasing the bin number.

Appendix E Experiments

E.1 An Additional Experiment: Synthetic Dataset

We start from a multi-environment synthetic dataset with a multivariate normal distribution corresponding to Theorem 5.2. In this experiment, we examine the optimization dynamics of MC-Pseudolabel. The data generation process is inspired by Arjovsky et al. [2019]. The covariates can be sliced into X=(S,V)TX=(S,V)^{T} with S9S\in\mathbb{R}^{9} and VV\in\mathbb{R}, where SS is the causal variable for YY and VV is the spurious variable. The data is generated by the following structural equations:

S\displaystyle S 𝒩(0,1).\displaystyle\sim\mathcal{N}(0,1). (17)
Y\displaystyle Y =αSTS+ϵY,αS=(1,,1)T9,ϵY𝒩(0,0.52).\displaystyle=\alpha_{S}^{T}S+\epsilon_{Y},\;\;\;\;\alpha_{S}=(1,...,1)^{T}\in\mathbb{R}^{9},\epsilon_{Y}\sim\mathcal{N}(0,0.5^{2}). (18)
V\displaystyle V =αV()Y+ϵV,αV(e1)=1.25,αV(e2)=0.75,αV(eT)=1,ϵV𝒩(0,0.12).\displaystyle=\alpha_{V}(\mathcal{E})\cdot Y+\epsilon_{V},\;\;\;\;\alpha_{V}(e_{1})=1.25,\alpha_{V}(e_{2})=0.75,\alpha_{V}(e_{T})=-1,\epsilon_{V}\sim\mathcal{N}(0,0.1^{2}). (19)

The covariate VV is spuriously correlated with YY because the coefficient αV(e)\alpha_{V}(e) depends on the specific environment ee. We set αV()=1.25,0.75\alpha_{V}(\mathcal{E})=1.25,0.75 respectively for two training environments while αV()\alpha_{V}(\mathcal{E}) extrapolates to 1-1 during testing. A robust algorithm is supposed to bypass the spurious variable and output a predictor f(X)=αSTSf(X)=\alpha_{S}^{T}S in order to survive the test distribution where the correlation between VV and YY is negated.

The predictor class for this dataset is linear models, and the environment classifier is implemented by MLP with a single hidden layer. In this experiment, we fix the training hyperparameters for the base linear model, and perform grid search over the extra hyperparameter introduced by robust learning methods. Baselines except for ERM and Group DRO share a hyperparameter which is the regularizer’s coefficient, and Group DRO introduces a temperature hyperparameter. We search over their hyperparameter space and report RMSE metric on the test set in Figure 2. Most baselines exhibit a U-turn with an increasing hyperparameter, and the minimum point varies across methods. The sensitivity of hyperparameters implies the dependence on a strong model selection criterion, such as oracle model selection on target distribution. However, the dashed line for MC-Pseudolabel’s error is tangent to all the U-turns of baselines, indicating a competitive performance of MC-Pseudolabel both with and without oracle model selection.

Refer to caption

Figure 2: Results (RMSE) on the synthetic dataset. Curves show method performances across hyperparameters. Methods without extra hyperparameters are marked with dotted lines.

We also investigate the evolution of pseudo labels f~t\tilde{f}_{t} in Algorithm 1 to recover the dynamics of MC-Pseudolabel. The first row of Figure 3 demonstrates how pseudolabelling results in a multicalibrated predictor. It shows that pseudolabels for two environments deviate from model prediction at Step 0, but the gap quickly converges at Step 4, implying multicalibrated prediction. The second row provides insight about how pseudolabelling contributes to an invariant predictor. We observe that the curve of two environments are gradually merging because the pseudolabel introduces a special noise to the original label such that the correlation between the pseudolabel and spurious variable VV is weakened. As a result, the predictor will depend on the causal variable SS which is relatively more strongly correlated with pseudolabels.

Refer to caption

Figure 3: Evolution of pseudolabels during MC-Pseudolabel. The first row plots values of pseudolabels against model predictions. The second row plots values of pseudolabels against VV. Columns represent different snapshots during optimization.

E.2 VesselPower

In figure 4, we provide the correlation between models’ in-distribution validation performance and out-of-distribution test performance across all methods.

Refer to caption


Figure 4: Correlation between models’ in-distribution and out-of-distribution risks on VesselPower.

E.3 Training Details

We follow DomainBed’s protocol [Gulrajani and Lopez-Paz, 2021] for model selection. Specifically, we randomly sample 20 sets of hyperparameters for each method, containing both the training hyperparameters of base models in Table 3 and extra hyperparameters from the robust learning algorithm in Table 4. We select the best model across hyperparameters based on three model selection criteria. In-distribution (ID) validation selects the model with the best metric on the average of an in-distribution validation dataset, which is sampled from the same distribution as the training data. Worst-environment (Worst) validation selects the best model by the worst performance across all environments in the in-distribution validation dataset. Worst validation is applicable only to the multi-environment setting. Oracle validation selects the best model by an out-of-distribution validation dataset sampled from the target distribution of test data. Oracle validation leaks the test distribution, so it is not recommended by DomainBed. However, most robust learning methods relies on out-of-distribution validation, so Domainbed suggests limited numbers of access to target data when using Oracle validation. Though MC-Pseudolabel already performs well under ID and Worst validation, we still report its performance under Oracle validation to compare the limit of robust learning methods regardless of model selection. Specifically for PovertyMap, an OOD validation set is provided for Oracle validation. The results of Oracle validation recover PovertyMap’s public benchmark.

Following DomainBed, the entire model selection procedure is repeated with different seeds for three times to measure standard errors of performances. Thus, we have totally 60 runs per method per dataset. Specifically for PovertyMap, we follow WILDS’ setup [Koh et al., 2021] and perform 5-fold cross validation instead of three repeated experiments. For each fold of the dataset, we conduct the model selection procedure with a single seed, and we report the average and standard error of performances across 5 folds. Thus, the standard error measures both the difficulty disparity across folds and the model’s instability.

The grouping function class of MC-Pseudolabel is implemented according to Equation 11 and Equation 12 for the multi-environment and single-environment setting respectively. For the multi-environment setting, the environment classifier p(e|x,y)p(e|x,y) is implemented as MLP with a single hidden layer of size 100 for tabular datasets including Simulation and ACSIncome. For PovertyMap, the environment classifier is implemented by Resnet18-MS with the same architecture as the predictor. For the single-environment setting of VesselPower, the identification model fidf_{id} is implemented as a Ridge regression model.

Table 3: Hyperparameters for model architecture.
Simulation ACSIncome VesselPower PovertyMap
Architecture Linear MLP MLP Resnet18-MS
Hidden Layer Dimensions None 16 32, 8 Standard [Koh et al., 2021]
Optimizer Adam Adam Adam Adam
Weight Decay 0 0 0 0
Loss MSE MSE MSE MSE
Learning Rate 0.1 10Uniform[3,1]10^{\text{Uniform}[-3,-1]} 10Uniform[3,1]10^{\text{Uniform}[-3,-1]} 10Uniform[4,2]10^{\text{Uniform}[-4,-2]}
Batch Size 1024 [256, 512, 1024, 2048] [256, 512, 1024, 2048] [32, 64, 128]
Table 4: Hyperparameters for robust learning methods.
Range
Regularizer Coefficient 10Uniform[3,2]10^{\text{Uniform}[-3,2]}
η\eta (GroupDRO) 10Uniform[3,2]10^{\text{Uniform}[-3,2]}
α\alpha (JTT) [0.1, 0.2, 0.5, 0.7]
λ\lambda (JTT) [5, 10, 20, 50]
η\eta (χ2\chi^{2}-DRO) [0.2, 0.5, 1, 1.5]
tt (Tilted-ERM) [0.1, 0.5, 1, 5, 10, 50, 100, 200]
α\alpha (C-Mixup) [0.5, 1, 1.5, 2]
σ\sigma (C-Mixup) [0.01, 0.1, 1, 10, 100]

E.4 Software and Hardware

Our experiments are based on the architecture of PyTorch [Paszke et al., 2019]. Each experiment with a single set of hyperparameters is run on one NVIDIA GeForce RTX 3090 with 24GB of memory, taking at most 15 minutes.

Appendix F Theory

F.1 Multicalibration and Bayes Optimality under Covariate Shift

Assumption F.1 (Restatement of Assumption 2.2).

1. (Closure under Covariate Shift) For a set of probability measures 𝒫(X)\mathcal{P}(X) containing the source measure PS(X)P_{S}(X),

P𝒫,hp()pS()h().\forall P\in\mathcal{P},h\in\mathcal{H}\Rightarrow\frac{p(\cdot)}{p_{S}(\cdot)}\cdot h(\cdot)\in\mathcal{H}. (20)

2. ((γ,ρ\gamma,\rho)-Weak Learning Condition) For any P𝒫(X)PS(Y|X){P(X)PS(YX):P𝒫}P\in\mathcal{P}(X)P_{S}(Y|X)\equiv\{P^{\prime}(X)P_{S}(Y\mid X):P^{\prime}\in\mathcal{P}\} with the source conditional measure PS(Y|X)P_{S}(Y|X) and every measurable set G𝒳G\subset\mathcal{X} satisfying P(XG)>ρP(X\in G)>\rho, if

𝔼P[(𝔼P[Y|X]Y)2|XG]<𝔼P[(𝔼P[Y|XG]Y)2|XG]γ,\displaystyle\mathbb{E}_{P}[(\mathbb{E}_{P}[Y|X]-Y)^{2}|X\in G]<\mathbb{E}_{P}[(\mathbb{E}_{P}[Y|X\in G]-Y)^{2}|X\in G]-\gamma, (21)

then there exists hh\in\mathcal{H} satisfying

𝔼P[(h(X)Y)2|XG]<𝔼P[(𝔼P[Y|XG]Y)2|XG]γ.\displaystyle\mathbb{E}_{P}[(h(X)-Y)^{2}|X\in G]<\mathbb{E}_{P}[(\mathbb{E}_{P}[Y|X\in G]-Y)^{2}|X\in G]-\gamma. (22)
Lemma F.2 (Globus-Harris et al. [2023]).

Fix any distribution P𝒫(X,Y)P\in\mathcal{P}(X,Y), any model f:𝒳[0,1]f:\mathcal{X}\rightarrow[0,1], and any class of real valued functions \mathcal{H} that is closed under affine transformation. Let:

1={h:maxx𝒳h(x)21}\mathcal{H}_{1}=\{h\in\mathcal{H}:\max_{x\in\mathcal{X}}h(x)^{2}\leq 1\}

be the set of functions in \mathcal{H} upper-bounded by 1 on 𝒳\mathcal{X}. Let m=|Range(f)|,γ>0m=|\text{Range}(f)|,\gamma>0, and αγ316m\alpha\leq\frac{\gamma^{3}}{16m}. Then if \mathcal{H} satisfies the (γ,γm)(\gamma,\frac{\gamma}{m})-weak learning condition and ff is α\alpha-approximately 2\ell_{2} multicalibrated with respect to 1\mathcal{H}_{1} and PP, then ff has squared error

𝔼P[(f(x)y)2]𝔼P[(f(x)y)2]+3γ,\mathbb{E}_{P}[(f(x)-y)^{2}]\leq\mathbb{E}_{P}[(f^{*}(x)-y)^{2}]+3\gamma,

where f(x)=𝔼P[Y|x]f^{*}(x)=\mathbb{E}_{P}[Y|x].

Definition F.3.

For a probability measure P(X,Y)P(X,Y) and a predictor f:𝒳[0,1]f:\mathcal{X}\rightarrow[0,1], let 𝒳×𝒴\mathcal{H}\subset\mathbb{R}^{\mathcal{X}\times\mathcal{Y}} be a function class. We say that ff is α\alpha-approximately 1\ell_{1} multicalibrated w.r.t. \mathcal{H} and PP if for all hh\in\mathcal{H}:

K1(f,h,P)\displaystyle\;\;\;\;\;K_{1}(f,h,P) (23)
=|𝔼P[h(X,Y)(Yv)|f(X)=v]|dPf(X)(v)\displaystyle=\int\Big{|}\mathbb{E}_{P}\left[h(X,Y)(Y-v)\big{|}f(X)=v\right]\Big{|}dP_{f(X)}(v) (24)
α.\displaystyle\leq\alpha. (25)
Lemma F.4 Roth [2022]).

Suppose PS,PT𝒫P_{S},P_{T}\in\mathcal{P} have the same conditional label distribution, and suppose ff is α\alpha-approximately 1\ell_{1} multicalibrated with respect to PSP_{S} and \mathcal{H}. If \mathcal{H} satisfies Equation 20, then ff is also α\alpha-approximately 1\ell_{1} multicalibrated with respect to PTP_{T} and \mathcal{H}:

Lemma F.5.

For a predictor f:𝒳[0,1]f:\mathcal{X}\rightarrow[0,1] and a grouping function hh satisfying maxx𝒳,y𝒴h(x,y)2B\max_{x\in\mathcal{X},y\in\mathcal{Y}}h(x,y)^{2}\leq B where B>0B>0,

1BK2(f,h,P)K1(f,h,P)K2(f,h,P).\displaystyle\frac{1}{\sqrt{B}}K_{2}(f,h,P)\leq K_{1}(f,h,P)\leq\sqrt{K_{2}(f,h,P)}. (26)
Remark F.6.

The lemma is extended from Roth [2022]’s result for B=1B=1.

Proof.

First we prove K2(f,h,P)BK1(f,h,P)K_{2}(f,h,P)\leq\sqrt{B}K_{1}(f,h,P). For any v[0,1]v\in[0,1],

(𝔼P[h(X,Y)(Yv)|f(X)=v])2\displaystyle\;\;\;\;\Big{(}\mathbb{E}_{P}\left[h(X,Y)(Y-v)\big{|}f(X)=v\right]\Big{)}^{2} (27)
=|𝔼P[h(X)(Yv)|f(X)=v]||𝔼P[h(X,Y)(Yv)|f(X)=v]|\displaystyle=\Big{|}\mathbb{E}_{P}\left[h(X)(Y-v)\big{|}f(X)=v\right]\Big{|}\cdot\Big{|}\mathbb{E}_{P}\left[h(X,Y)(Y-v)\big{|}f(X)=v\right]\Big{|} (28)
𝔼[h(X,Y)2|f(X)=v]𝔼[(Yv)2|f(X)=v]\displaystyle\leq\sqrt{\mathbb{E}\left[h(X,Y)^{2}\big{|}f(X)=v\right]\mathbb{E}\left[(Y-v)^{2}\big{|}f(X)=v\right]} (29)
|𝔼P[h(X,Y)(Yv)|f(X)=v]|\displaystyle\;\;\;\;\cdot\Big{|}\mathbb{E}_{P}\left[h(X,Y)(Y-v)\big{|}f(X)=v\right]\Big{|} (30)
B|𝔼P[h(X,Y)(Yv)|f(X)=v]|.\displaystyle\leq\sqrt{B}\cdot\Big{|}\mathbb{E}_{P}\left[h(X,Y)(Y-v)\big{|}f(X)=v\right]\Big{|}. (31)

Equation 29 follows from the Cauchy-Schwarz inequality.

K2(f,h,P)\displaystyle K_{2}(f,h,P) =𝔼vPf(X)[(𝔼P[h(X,Y)(Yv)|f(X)=v])2]\displaystyle=\underset{v\sim P_{f(X)}}{\mathbb{E}}\left[\Big{(}\mathbb{E}_{P}\left[h(X,Y)(Y-v)\big{|}f(X)=v\right]\Big{)}^{2}\right] (32)
B𝔼vPf(X)|𝔼P[h(X,Y)(Yv)|f(X)=v]|\displaystyle\leq\sqrt{B}\underset{v\sim P_{f(X)}}{\mathbb{E}}\Big{|}\mathbb{E}_{P}\left[h(X,Y)(Y-v)\big{|}f(X)=v\right]\Big{|} (33)
=BK1(f,h,P).\displaystyle=\sqrt{B}K_{1}(f,h,P). (34)

Then we prove K1(f,h,P)BK2(f,h,P)K_{1}(f,h,P)\leq\sqrt{BK_{2}(f,h,P)}. Still from the Cauchy-Schwarz inequality:

K1(f,h,P)\displaystyle K_{1}(f,h,P) =𝔼vPf(X)|𝔼P[h(X,Y)(Yv)|f(X)=v]|\displaystyle=\underset{v\sim P_{f(X)}}{\mathbb{E}}\Big{|}\mathbb{E}_{P}\left[h(X,Y)(Y-v)\big{|}f(X)=v\right]\Big{|} (35)
𝔼vPf(X)[12]𝔼vPf(X)[(𝔼P[h(X,Y)(Yv)|f(X)=v])2]\displaystyle\leq\sqrt{\underset{v\sim P_{f(X)}}{\mathbb{E}}[1^{2}]\underset{v\sim P_{f(X)}}{\mathbb{E}}\left[\Big{(}\mathbb{E}_{P}\left[h(X,Y)(Y-v)\big{|}f(X)=v\right]\Big{)}^{2}\right]} (36)
=K2(f,h,P).\displaystyle=\sqrt{K_{2}(f,h,P)}. (37)

Theorem F.7 (Restatement of Theorem 2.3).

For a source measure PS(X,Y)P_{S}(X,Y) and a set of probability measures 𝒫(X)\mathcal{P}(X) containing PS(X)P_{S}(X), given a predictor f:𝒳[0,1]f:\mathcal{X}\rightarrow[0,1] with finite range m:=|Range(f)|m:=|\text{Range}(f)|, consider a grouping function class \mathcal{H} closed under affine transformation satisfying Assumption 2.2 with ρ=γ/m\rho=\gamma/m. If ff is γ6256m2\frac{\gamma^{6}}{256m^{2}}-approximately 2\ell_{2} multicalibrated w.r.t PSP_{S} and \mathcal{H}’s bounded subset 1:={h:maxx𝒳h(x)21}\mathcal{H}_{1}:=\left\{h\in\mathcal{H}:\max_{x\in\mathcal{X}}h(x)^{2}\leq 1\right\}, then for any target measure PT𝒫(X)PS(Y|X)P_{T}\in\mathcal{P}(X)P_{S}(Y|X),

RPT(f)RPT(f)+3γ,\displaystyle R_{P_{T}}(f)\leq R_{P_{T}}(f^{*})+3\gamma, (38)

where f(x)=𝔼PT[Y|x]f^{*}(x)=\mathbb{E}_{P_{T}}[Y|x] is the optimal regression function in each target distribution.

Proof.

For |h(x)|1|h(x)|\leq 1 and f[0,1]f\in[0,1], according to Lemma F.5,

K2(f,h,P)K1(f,h,P)K2(f,h,P).\displaystyle K_{2}(f,h,P)\leq K_{1}(f,h,P)\leq\sqrt{K_{2}(f,h,P)}. (39)

Since K2(f,h,PS)γ6256m2K_{2}(f,h,P_{S})\leq\frac{\gamma^{6}}{256m^{2}} for any hH1h\in H_{1},

K1(f,h,PS)γ316m.\displaystyle K_{1}(f,h,P_{S})\leq\frac{\gamma^{3}}{16m}. (40)

With Lemma F.4, for any hH1h\in H_{1} and PT𝒫P_{T}\in\mathcal{P}, we have

K1(f,h,PT)γ316m.\displaystyle K_{1}(f,h,P_{T})\leq\frac{\gamma^{3}}{16m}. (41)

Again, it follows from Lemma F.5 that:

K2(f,h,PT)γ316m.\displaystyle K_{2}(f,h,P_{T})\leq\frac{\gamma^{3}}{16m}. (42)

With Lemma F.2, we have

𝔼PT[(f(x)y)2]𝔼PT[(f(x)y)2]+3γ,\displaystyle\mathbb{E}_{P_{T}}[(f(x)-y)^{2}]\leq\mathbb{E}_{P_{T}}[(f^{*}(x)-y)^{2}]+3\gamma, (43)

which completes the proof. ∎

F.2 Multicalibration and Invariance under Concept Shift

Theorem F.8 (Restatement of Theorem 3.1).

For a set of absolutely continuous probability measures 𝒫(X,Y)\mathcal{P}(X,Y) containing the source measure PS(X,Y)P_{S}(X,Y), consider a predictor f:𝒳[0,1]f:\mathcal{X}\rightarrow[0,1]. Assume the grouping function class \mathcal{H} satisfies the following condition:

{h(x,y)=p(x,y)pS(x,y)|P𝒫(X,Y)}.\displaystyle\mathcal{H}\supset\left\{h(x,y)=\frac{p(x,y)}{p_{S}(x,y)}\Big{|}P\in\mathcal{P}(X,Y)\right\}. (44)

If ff is α\alpha-approximately 2\ell_{2} multicalibrated w.r.t. \mathcal{H} and PSP_{S}, then for any measure P𝒫(X,Y)P\in\mathcal{P}(X,Y),

RP(f)infg:[0,1][0,1]RP(gf)+2α.\displaystyle R_{P}(f)\leq\inf_{g:[0,1]\rightarrow[0,1]}R_{P}(g\circ f)+2\sqrt{\alpha}. (45)
Proof.

For any h(x,y)=p(x,y)/pS(x,y)h(x,y)=p(x,y)/p_{S}(x,y) where P𝒫P\in\mathcal{P}, since ff is α\alpha-approximately 2\ell_{2} multicalibrated, K2(f,h,PS)αK_{2}(f,h,P_{S})\leq\alpha. It follows from Lemma F.5 that K1(f,h,PS)αK_{1}(f,h,P_{S})\leq\sqrt{\alpha}.

K1(f,h,PS)\displaystyle K_{1}(f,h,P_{S}) =|𝔼PS[h(X,Y)(Yv)|f(X)=v]|dPS(f1(v))\displaystyle=\int\Big{|}\mathbb{E}_{P_{S}}\left[h(X,Y)(Y-v)\big{|}f(X)=v\right]\Big{|}dP_{S}(f^{-1}(v)) (46)
=|p(x,y)pS(x,y)(yv)pS(x,y|f=v)d(x,y)|dPS(f1(v))\displaystyle=\int\left|\int\frac{p(x,y)}{p_{S}(x,y)}(y-v)p_{S}(x,y\big{|}f=v)d(x,y)\right|dP_{S}(f^{-1}(v)) (47)
=|dP(f1(v))dPS(f1(v))(yv)p(x,y|f=v)d(x,y)|dPS(f1(v))\displaystyle=\int\left|\int\frac{dP(f^{-1}(v))}{dP_{S}(f^{-1}(v))}(y-v)p(x,y\big{|}f=v)d(x,y)\right|dP_{S}(f^{-1}(v)) (48)
=dP(f1(v))dPS(f1(v))|(yv)p(x,y|f=v)d(x,y)|dPS(f1(v))\displaystyle=\int\frac{dP(f^{-1}(v))}{dP_{S}(f^{-1}(v))}\left|\int(y-v)p(x,y\big{|}f=v)d(x,y)\right|dP_{S}(f^{-1}(v)) (49)
=|𝔼P[Yv|f(X)=v]|dP(f1(v))\displaystyle=\int\Big{|}\mathbb{E}_{P}\left[Y-v\big{|}f(X)=v\right]\Big{|}dP(f^{-1}(v)) (50)
=K1(f,1,P).\displaystyle=K_{1}(f,1,P). (51)

Thus, we have K1(f,1,P)αK_{1}(f,1,P)\leq\sqrt{\alpha} for any P𝒫P\in\mathcal{P}. We will prove an equivalent form of 1\ell_{1} calibration error:

K1(f,1,P)\displaystyle K_{1}(f,1,P) =supη:[0,1][1,1]η(v)𝔼P[Yv|f(X)=v]𝑑Pf(X)(v)\displaystyle=\sup_{\eta:[0,1]\rightarrow[-1,1]}\int\eta(v)\mathbb{E}_{P}\left[Y-v\big{|}f(X)=v\right]dP_{f(X)}(v) (52)
=supη:[0,1][1,1]𝔼P[η(f(X))(Yf(X))].\displaystyle=\sup_{\eta:[0,1]\rightarrow[-1,1]}\mathbb{E}_{P}[\eta(f(X))(Y-f(X))]. (53)

K1(f,1,P)supη:[0,1][1,1]𝔼P[η(f(X))(Yf(X))]K_{1}(f,1,P)\leq\sup_{\eta:[0,1]\rightarrow[-1,1]}\mathbb{E}_{P}[\eta(f(X))(Y-f(X))] can be proved by taking η(v)=2𝕀[v>0]1\eta(v)=2\mathbb{I}[v>0]-1. On the other hand,

η(v)𝔼P[Yv|f(X)=v]𝑑Pf(X)(v)\displaystyle\int\eta(v)\mathbb{E}_{P}\left[Y-v\big{|}f(X)=v\right]dP_{f(X)}(v) |η(v)||𝔼P[Yv|f(X)=v]|dPf(X)(v)\displaystyle\leq\int\Big{|}\eta(v)\Big{|}\cdot\Big{|}\mathbb{E}_{P}\left[Y-v\big{|}f(X)=v\right]\Big{|}dP_{f(X)}(v) (54)
K1(f,1,P).\displaystyle\leq K_{1}(f,1,P). (55)

Actually the right hand side of Equation 52 resembles smooth calibration [Blasiok et al., 2023a], which restricts η\eta to Lipschitz functions. Based on smooth calibration, Blasiok et al. [Blasiok et al., 2023b] shows that approximately calibrated predictors cannot be improved much by post-processing. In the above we present a similar proof for 1\ell_{1} calibration error.

For any g:[0,1][0,1]g:[0,1]\rightarrow[0,1], there exists η:[0,1][1,1]\eta:[0,1]\rightarrow[-1,1] such that g(v)=v+η(v)g(v)=v+\eta(v) for any v[0,1]v\in[0,1].

RP(gf)\displaystyle R_{P}(g\circ f) =𝔼P[(Yf(X)η(f(X)))2]\displaystyle=\mathbb{E}_{P}\left[\big{(}Y-f(X)-\eta(f(X))\big{)}^{2}\right] (56)
=𝔼P[(Yf(X))2]2𝔼P[(Yf(X)η(f(X))]+𝔼P[η(f(X))2]\displaystyle=\mathbb{E}_{P}\left[(Y-f(X))^{2}\right]-2\mathbb{E}_{P}\left[(Y-f(X)\eta(f(X))\right]+\mathbb{E}_{P}\left[\eta(f(X))^{2}\right] (57)
RP(f)supη:[0,1][1,1]2𝔼[η(f(X))(Yf(X))]\displaystyle\geq R_{P}(f)-\sup_{\eta^{\prime}:[0,1]\rightarrow[-1,1]}2\mathbb{E}[\eta^{\prime}(f(X))(Y-f(X))] (58)
=RP(f)2K1(f,1,P)\displaystyle=R_{P}(f)-2K_{1}(f,1,P) (59)
RP(f)2α.\displaystyle\geq R_{P}(f)-2\sqrt{\alpha}. (60)

Theorem F.9 (Restatement of Theorem 3.4).

Assume samples are drawn from an environment ee\in\mathscr{E} with a prior PS(e)P_{S}(e) such that ePS(e)=1\sum_{e\in\mathscr{E}}P_{S}(e)=1 and PS(e)>0P_{S}(e)>0. The overall population satisfies PS(X,Y)=ePe(X,Y)PS(e)P_{S}(X,Y)=\sum_{e\in\mathscr{E}}P_{e}(X,Y)P_{S}(e) where Pe(X,Y)P_{e}(X,Y) is the environment-specific absolutely continuous measure. For a representation Φσ(X)\Phi\in\sigma(X) over features, define a function class \mathcal{H} as:

:={h(x,y)=pe(x,y)pS(x,y)|e}.\displaystyle\mathcal{H}:=\left\{h(x,y)=\frac{p_{e}(x,y)}{p_{S}(x,y)}\Big{|}e\in\mathscr{E}\right\}. (61)

1. If there exists a bijection g:supp(Φ)[0,1]g^{*}:\text{supp}(\Phi)\rightarrow[0,1] such that gΦg^{*}\circ\Phi is α\alpha-approximately 2\ell_{2} multicalibrated w.r.t. \mathcal{H} and PSP_{S}, then for any ee\in\mathscr{E},

RPe(gΦ)infg:supp(Φ)[0,1]RPe(gΦ)+2α.\displaystyle R_{P_{e}}(g^{*}\circ\Phi)\leq\inf_{g:\text{supp}(\Phi)\rightarrow[0,1]}R_{P_{e}}(g\circ\Phi)+2\sqrt{\alpha}. (62)

2. For Φσ(X)\Phi\in\sigma(X), if there exists g:supp(Φ)[0,1]g^{*}:\text{supp}(\Phi)\rightarrow[0,1] such that Equation 62 is satisfied for any ee\in\mathscr{E}, then gΦg^{*}\circ\Phi is 2/Dα1/4\sqrt{2/D}\alpha^{1/4}-approximately 2\ell_{2} multicalibrated w.r.t. \mathcal{H} and PSP_{S}, where D=minePS(e)D=\min_{e\in\mathscr{E}}P_{S}(e).

Proof.

We first prove statement 1.

For any g:supp(Φ)[0,1]g:\text{supp}(\Phi)\rightarrow[0,1], since gg^{*} is a bijection, gΦ=(gg1)(gΦ)g\circ\Phi=(g\circ{g^{*}}^{-1})(g^{*}\circ\Phi) where gg1[0,1][0,1]g\circ{g^{*}}^{-1}\in[0,1]^{[0,1]}. Since gΦg^{*}\circ\Phi is α\alpha-approximately 2\ell_{2} multicalibrated, it follows from Theorem F.8 that for any ee\in\mathscr{E},

RPe(gΦ)\displaystyle R_{P_{e}}(g^{*}\circ\Phi) RPe((gg1)(gΦ))+2α\displaystyle\leq R_{P_{e}}\big{(}(g\circ{g^{*}}^{-1})(g^{*}\circ\Phi)\big{)}+2\sqrt{\alpha} (63)
=RPe(gΦ)+2α.\displaystyle=R_{P_{e}}(g\circ\Phi)+2\sqrt{\alpha}. (64)

Then we give a proof of statement 2, which is inspired by Blasiok et al. [Blasiok et al., 2023b].

For simplicity let f:=gΦf^{*}:=g^{*}\circ\Phi. For any ee\in\mathscr{E} and any η:[0,1][1,1]\eta:[0,1]\rightarrow[-1,1], define β:=EPe[(Yf(X))η(f(X))][1,1]\beta:=E_{P_{e}}\left[(Y-f^{*}(X))\eta(f^{*}(X))\right]\in[-1,1]. Construct κ(v):=proj[0,1](v+βη(v))\kappa(v):=\text{proj}_{[0,1]}(v+\beta\eta(v)), where proj[0,1]()=max{0,min{1,}}\text{proj}_{[0,1]}(\cdot)=\max\{0,\min\{1,\cdot\}\}.

RPe(κ(f))\displaystyle R_{P_{e}}(\kappa(f^{*})) =𝔼Pe[(Yκ(f(X)))2]\displaystyle=\mathbb{E}_{P_{e}}\left[\big{(}Y-\kappa(f^{*}(X))\big{)}^{2}\right] (65)
𝔼Pe[(Yf(X)βη(f(X)))2]\displaystyle\leq\mathbb{E}_{P_{e}}\left[\big{(}Y-f^{*}(X)-\beta\eta(f^{*}(X))\big{)}^{2}\right] (66)
=𝔼Pe[(Yf(X))2]2β2+β2𝔼Pe[η(f(x))2]\displaystyle=\mathbb{E}_{P_{e}}\left[(Y-f^{*}(X))^{2}\right]-2\beta^{2}+\beta^{2}\mathbb{E}_{P_{e}}[\eta(f^{*}(x))^{2}] (67)
RPe(f)2β2+β2\displaystyle\leq R_{P_{e}}(f^{*})-2\beta^{2}+\beta^{2} (68)
=RPe(f)β2.\displaystyle=R_{P_{e}}(f^{*})-\beta^{2}. (69)

Rearranging the inequality above gives:

(EPe[(Yf(X))η(f(X))])2=β2RPe(f)RPe(κ(f)).\displaystyle\Big{(}E_{P_{e}}\left[(Y-f^{*}(X))\eta(f^{*}(X))\right]\Big{)}^{2}=\beta^{2}\leq R_{P_{e}}(f^{*})-R_{P_{e}}(\kappa(f^{*})). (70)

Since κgsupp(Φ)[0,1]\kappa\circ g^{*}\in\text{supp}(\Phi)\rightarrow[0,1], it follows from Equation 62 that:

RPe(f)RPe(κ(f))=RPe(gΦ)RPe((κg)Φ)2α.\displaystyle R_{P_{e}}(f^{*})-R_{P_{e}}(\kappa(f^{*}))=R_{P_{e}}(g^{*}\circ\Phi)-R_{P_{e}}((\kappa\circ g^{*})\circ\Phi)\leq 2\sqrt{\alpha}. (71)

Combining Equation 70 and Equation 71 gives EPe[(Yf(X))η(f(X))]2α1/4E_{P_{e}}\left[(Y-f^{*}(X))\eta(f^{*}(X))\right]\leq\sqrt{2}\alpha^{1/4} for any η:[0,1][1,1]\eta:[0,1]\rightarrow[-1,1]. From Equation 52, it follows that:

K1(f,1,Pe)=supη:[0,1][1,1]𝔼Pe[η(f(X))(Yf(X))]2α1/4.\displaystyle K_{1}(f^{*},1,P_{e})=\sup_{\eta:[0,1]\rightarrow[-1,1]}\mathbb{E}_{P_{e}}[\eta(f^{*}(X))(Y-f^{*}(X))]\leq\sqrt{2}\alpha^{1/4}. (72)

From Equation 51, K1(f,h,PS)2α1/4K_{1}(f^{*},h,P_{S})\leq\sqrt{2}\alpha^{1/4} for any hh\in\mathcal{H}. Further, for any hh\in\mathcal{H},

h(x,y)\displaystyle h(x,y) =pe(x,y)pS(x,y)\displaystyle=\frac{p_{e}(x,y)}{p_{S}(x,y)} (73)
=pe(x,y)epe(x,y)PS(e)\displaystyle=\frac{p_{e}(x,y)}{\sum_{e^{\prime}\in\mathscr{E}}{p_{e^{\prime}}(x,y)}P_{S}(e^{\prime})} (74)
pe(x,y)pe(x,y)PS(e)\displaystyle\leq\frac{p_{e}(x,y)}{p_{e}(x,y)P_{S}(e)} (75)
1D.\displaystyle\leq\frac{1}{D}. (76)

By Lemma F.5, it follows that K2(f,h,PS)1/DK1(f,h,PS)2/Dα1/4.K_{2}(f^{*},h,P_{S})\leq\sqrt{1/D}\cdot K_{1}(f^{*},h,P_{S})\leq\sqrt{2/D}\alpha^{1/4}.

F.3 Structure of Grouping Function Classes

In this subsection, we focus on Euclidean space where 𝒳d\mathcal{X}\subset\mathbb{R}^{d} is compact and measurable for some dZ+d\in Z^{+} and 𝒴=[0,1]\mathcal{Y}=[0,1]. Grouping functions are assumed to be continuous, i.e., hC(𝒳×𝒴)h\in C(\mathcal{X}\times\mathcal{Y}). We consider absolutely continuous probability measures with continuous density functions.

Proposition F.10 (Restatement of Proposition 4.1).

Consider an absolutely continuous probability measure PS(X,Y)P_{S}(X,Y) and a predictor f:𝒳[0,1]f:\mathcal{X}\rightarrow[0,1], define the maximal grouping function class that ff is multicalibrated with respect to:

f:={hC(𝒳×𝒴):K2(f,h,PS)=0}.\displaystyle\mathcal{H}_{f}:=\{h\in C(\mathcal{X}\times\mathcal{Y}):K_{2}(f,h,P_{S})=0\}. (77)

Then f\mathcal{H}_{f} is a linear space.

Particularly for fΦ(x)=𝔼[Y|Φ(x)]f_{\Phi}(x)=\mathbb{E}[Y|\Phi(x)] where Φ:ddΦ,dΦZ+\Phi:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d_{\Phi}},{d_{\Phi}}\in Z^{+} is a measurable function, we abbreviate fΦ\mathcal{H}_{f_{\Phi}} with Φ\mathcal{H}_{\Phi}. Then any finite subset 𝒢Φ\mathcal{G}\subset\mathcal{H}_{\Phi} implies span{1,𝒢}Φ\text{span}\{1,\mathcal{G}\}\subset\mathcal{H}_{\Phi}, where 11 denotes a constant function.

Proof.

For any γ1,γ2\gamma_{1},\gamma_{2}\in\mathbb{R} and any h1,h2fh_{1},h_{2}\in\mathcal{H}_{f}, γ1h1+γ2h2C(𝒳×𝒴)\gamma_{1}h_{1}+\gamma_{2}h_{2}\in C(\mathcal{X}\times\mathcal{Y}).

K1(f,γ1h1+γ2h2,PS)\displaystyle K_{1}(f,\gamma_{1}h_{1}+\gamma_{2}h_{2},P_{S}) =|𝔼[(γ1h1(X,Y)+γ2h2(X,Y))(Yv)|f(X)=v]|dPS(f1(v))\displaystyle=\int\Big{|}\mathbb{E}\left[(\gamma_{1}h_{1}(X,Y)+\gamma_{2}h_{2}(X,Y))(Y-v)\big{|}f(X)=v\right]\Big{|}dP_{S}(f^{-1}(v)) (78)
|𝔼[γ1h1(X,Y)(Yv)|f(X)=v]|dPS(f1(v))\displaystyle\leq\int\Big{|}\mathbb{E}\left[\gamma_{1}h_{1}(X,Y)(Y-v)\big{|}f(X)=v\right]\Big{|}dP_{S}(f^{-1}(v)) (79)
+|𝔼[γ2h2(X,Y)(Yv)|f(X)=v]|dPS(f1(v))\displaystyle\;\;\;\;+\int\Big{|}\mathbb{E}\left[\gamma_{2}h_{2}(X,Y)(Y-v)\big{|}f(X)=v\right]\Big{|}dP_{S}(f^{-1}(v)) (80)
=γ1K1(f,h1,PS)+γ2K1(f,h2,PS)\displaystyle=\gamma_{1}K_{1}(f,h_{1},P_{S})+\gamma_{2}K_{1}(f,h_{2},P_{S}) (81)
=0.\displaystyle=0. (82)

According to Lemma F.5, K1(f,h,PS)=0K_{1}(f,h,P_{S})=0 is equivalent as K2(f,h,PS)=0K_{2}(f,h,P_{S})=0 for bounded hh. Thus, K2(f,γ1h1+γ2h2,PS)=0K_{2}(f,\gamma_{1}h_{1}+\gamma_{2}h_{2},P_{S})=0 which implies γ1h1+γ2h2f\gamma_{1}h_{1}+\gamma_{2}h_{2}\in\mathcal{H}_{f}. Now we finishes the proof that f\mathcal{H}_{f} is a linear space.

For fΦ(x)=𝔼[Y|Φ(x)]f_{\Phi}(x)=\mathbb{E}[Y|\Phi(x)],

𝔼[Y|fΦ(X)]\displaystyle\mathbb{E}[Y|f_{\Phi}(X)] =𝔼[𝔼[Y|fΦ(X),Φ(X)]|fΦ(X)]\displaystyle=\mathbb{E}\left[\mathbb{E}[Y|f_{\Phi}(X),\Phi(X)]\big{|}f_{\Phi}(X)\right] (83)
=𝔼[𝔼[Y|Φ(X)]|fΦ(X)]\displaystyle=\mathbb{E}\left[\mathbb{E}[Y|\Phi(X)]\big{|}f_{\Phi}(X)\right] (84)
=𝔼[fΦ(X)|fΦ(X)]\displaystyle=\mathbb{E}[f_{\Phi}(X)|f_{\Phi}(X)] (85)
=fΦ(X).\displaystyle=f_{\Phi}(X). (86)

Thus, K2(fΦ,1,PS)=0K_{2}(f_{\Phi},1,P_{S})=0 which implies 1Φ1\in\mathcal{H}_{\Phi}. Since Φ\mathcal{H}_{\Phi} is a linear space, 𝒢Φ\mathcal{G}\subset\mathcal{H}_{\Phi} implies span{1,𝒢}Φ\text{span}\{1,\mathcal{G}\}\subset\mathcal{H}_{\Phi}. ∎

Lemma F.11.

For any absolutely continuous probability measure P(X,Y)P(X,Y) with a continuous density function pp, and any grouping function hC(𝒳×𝒴)h\in C(\mathcal{X}\times\mathcal{Y}), there exists γ0\gamma\neq 0 and ρ\rho\in\mathbb{R} such that γh+ρC(𝒳×𝒴)\gamma h+\rho\in C(\mathcal{X}\times\mathcal{Y}), and it is density ratio between some absolutely continuous probability measure PP^{\prime} and PP, i.e., dP(x,y)=(γh(x,y)+ρ)dP(x,y)dP^{\prime}(x,y)=(\gamma h(x,y)+\rho)dP(x,y), where PP^{\prime} also has a continuous density function.

Proof.

Since hh is bounded, there exists ρ\rho^{\prime}\in\mathbb{R} such that h(x,y)+ρ>0h(x,y)+\rho^{\prime}>0 for any x𝒳,y𝒴x\in\mathcal{X},y\in\mathcal{Y}. Define:

γ\displaystyle\gamma =1(h(x,y)+ρ)𝑑P(x,y).\displaystyle=\frac{1}{\int(h(x,y)+\rho^{\prime})dP(x,y)}. (87)
ρ\displaystyle\rho =ρ(h(x,y)+ρ)𝑑P(x,y).\displaystyle=\frac{\rho^{\prime}}{\int(h(x,y)+\rho^{\prime})dP(x,y)}. (88)

γh+ρ\gamma h+\rho is still continuous.

We have γh(x,y)+ρ=γ(h(x,y)+ρ)>0\gamma h(x,y)+\rho=\gamma(h(x,y)+\rho^{\prime})>0 for any x𝒳,y𝒴x\in\mathcal{X},y\in\mathcal{Y}.

𝑑P(x,y)\displaystyle\int dP^{\prime}(x,y) =(γh(x,y)+ρ)𝑑P(x,y)\displaystyle=\int(\gamma h(x,y)+\rho)dP(x,y) (89)
=γ(h(x,y)+ρ)𝑑P(x,y)\displaystyle=\gamma\int(h(x,y)+\rho^{\prime})dP(x,y) (90)
=1.\displaystyle=1. (91)

Thus, P(X,Y)P^{\prime}(X,Y) is an absolutely continuous probability measure.

Its density function p=(γh+ρ)pp^{\prime}=(\gamma h+\rho)p is continuous. ∎

Theorem F.12.

Consider an absolutely continuous probability measure PS(X,Y)P_{S}(X,Y) and a predictor fΦ(x)=𝔼PS[Y|Φ(x)]f_{\Phi}(x)=\mathbb{E}_{P_{S}}[Y|\Phi(x)] where Φ:ddΦ,dΦZ+\Phi:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d_{\Phi}},{d_{\Phi}}\in Z^{+} is a measurable function. We abbreviate fΦ\mathcal{H}_{f_{\Phi}} with Φ\mathcal{H}_{\Phi}.

Φ\displaystyle\mathcal{H}_{\Phi} ={hC(𝒳×𝒴):CovPS[h(X,Y),Y|fΦ=v]=0for almost everyv[0,1]}\displaystyle=\left\{h\in C(\mathcal{X}\times\mathcal{Y}):\text{Cov}_{P_{S}}\left[h(X,Y),Y|f_{\Phi}=v\right]=0\;\;\text{for almost every}\;v\in[0,1]\right\} (92)
=span{p(x,y)pS(x,y):𝔼P[Y|fΦ]=𝔼PS[Y|fΦ]almost surely,p is continuous}.\displaystyle=\text{span}\left\{\frac{p(x,y)}{p_{S}(x,y)}:\mathbb{E}_{P}[Y|f_{\Phi}]=\mathbb{E}_{P_{S}}[Y|f_{\Phi}]\;\;\text{almost surely},\;\text{$p$ is continuous}\right\}. (93)
Proof.

First we prove:

Φ={hC(𝒳×𝒴):CovPS[h(X,Y)|fΦ=v]=0for almost everyv[0,1]}.\displaystyle\mathcal{H}_{\Phi}=\left\{h\in C(\mathcal{X}\times\mathcal{Y}):\text{Cov}_{P_{S}}\left[h(X,Y)|f_{\Phi}=v\right]=0\;\;\text{for almost every}\;v\in[0,1]\right\}. (94)

For each v[0,1]v\in[0,1] and any hC(𝒳×𝒴)h\in C(\mathcal{X}\times\mathcal{Y}),

𝔼PS[h(X,Y)(Yv)|fΦ=v]\displaystyle\mathbb{E}_{P_{S}}\left[h(X,Y)(Y-v)\big{|}f_{\Phi}=v\right] =𝔼PS[h(X,Y)Y|fΦ=v]v𝔼PS[h(X,Y)|fΦ=v]\displaystyle=\mathbb{E}_{P_{S}}\left[h(X,Y)Y\Big{|}f_{\Phi}=v\right]-v\mathbb{E}_{P_{S}}\left[h(X,Y)\Big{|}f_{\Phi}=v\right] (95)
=𝔼PS[h(X,Y)Y|fΦ=v]\displaystyle=\mathbb{E}_{P_{S}}\left[h(X,Y)Y\Big{|}f_{\Phi}=v\right] (96)
𝔼PS[h(X,Y)|fΦ=v]𝔼[Y|fΦ=v]\displaystyle\;\;\;\;-\mathbb{E}_{P_{S}}\left[h(X,Y)\Big{|}f_{\Phi}=v\right]\mathbb{E}[Y|f_{\Phi}=v] (97)
=CovPS[h(X,Y),Y|fΦ=v].\displaystyle=\text{Cov}_{P_{S}}\left[h(X,Y),Y|f_{\Phi}=v\right]. (98)

Equation 97 follows from Equation 86.

For any hC(𝒳×𝒴)h\in C(\mathcal{X}\times\mathcal{Y}),

hΦ\displaystyle h\in\mathcal{H}_{\Phi} K2(fΦ,h,PS)=0\displaystyle\Leftrightarrow K_{2}(f_{\Phi},h,P_{S})=0 (99)
(𝔼PS[h(X,Y)(Yv)|fΦ=v])2𝑑PS(fΦ1(v))\displaystyle\Leftrightarrow\int\Big{(}\mathbb{E}_{P_{S}}\left[h(X,Y)(Y-v)\big{|}f_{\Phi}=v\right]\Big{)}^{2}dP_{S}(f_{\Phi}^{-1}(v)) (100)
𝔼PS[h(X,Y)(Yv)|fΦ=v]=0for almost everyv[0,1]\displaystyle\Leftrightarrow\mathbb{E}_{P_{S}}\left[h(X,Y)(Y-v)\big{|}f_{\Phi}=v\right]=0\;\;\text{for almost every}\;v\in[0,1] (101)
CovPS[h(X,Y),Y|fΦ=v]=0for almost everyv[0,1].\displaystyle\Leftrightarrow\text{Cov}_{P_{S}}\left[h(X,Y),Y|f_{\Phi}=v\right]=0\;\;\text{for almost every}\;v\in[0,1]. (102)

Next, we prove

Φspan{p(x,y)pS(x,y):𝔼P[Y|fΦ]=𝔼PS[Y|fΦ]almost surely,p is continuous}.\displaystyle\mathcal{H}_{\Phi}\supset\text{span}\left\{\frac{p(x,y)}{p_{S}(x,y)}:\mathbb{E}_{P}[Y|f_{\Phi}]=\mathbb{E}_{P_{S}}[Y|f_{\Phi}]\;\;\text{almost surely},\;\text{$p$ is continuous}\right\}. (103)

This is equivalent to saying that for any absolutely continuous probability measure PP satisfying 𝔼P[Y|fΦ]=𝔼PS[Y|fΦ]\mathbb{E}_{P}[Y|f_{\Phi}]=\mathbb{E}_{P_{S}}[Y|f_{\Phi}] almost surely, CovPS[p(X,Y)/pS(X,Y),Y|fΦ=v]=0\text{Cov}_{P_{S}}\left[p(X,Y)/p_{S}(X,Y),Y|f_{\Phi}=v\right]=0 for almost every v[0,1]v\in[0,1].

CovPS[p(X,Y)pS(X,Y),Y|fΦ=v]\displaystyle\;\;\;\;\text{Cov}_{P_{S}}\left[\frac{p(X,Y)}{p_{S}(X,Y)},Y\Big{|}f_{\Phi}=v\right] (104)
=𝔼PS[p(X,Y)pS(X,Y)Y|fΦ=v]𝔼PS[p(X,Y)pS(X,Y)|fΦ=v]𝔼PS[Y|fΦ=v]\displaystyle=\mathbb{E}_{P_{S}}\left[\frac{p(X,Y)}{p_{S}(X,Y)}Y\Big{|}f_{\Phi}=v\right]-\mathbb{E}_{P_{S}}\left[\frac{p(X,Y)}{p_{S}(X,Y)}\Big{|}f_{\Phi}=v\right]\mathbb{E}_{P_{S}}\left[Y\Big{|}f_{\Phi}=v\right] (105)
=dP(fΦ1(v))dPS(fΦ1(v))p(x,y|fΦ=v)pS(x,y|fΦ=v)y𝑑PS(x,y|fΦ=v)\displaystyle=\int\frac{dP(f_{\Phi}^{-1}(v))}{dP_{S}(f_{\Phi}^{-1}(v))}\frac{p(x,y|f_{\Phi}=v)}{p_{S}(x,y|f_{\Phi}=v)}y\cdot dP_{S}(x,y|f_{\Phi}=v) (106)
𝔼PS[Y|fΦ=v]dP(fΦ1(v))dPS(fΦ1(v))p(x,y|fΦ=v)pS(x,y|fΦ=v)𝑑PS(x,y|fΦ=v)\displaystyle\;\;\;\;-\mathbb{E}_{P_{S}}\left[Y\Big{|}f_{\Phi}=v\right]\cdot\int\frac{dP(f_{\Phi}^{-1}(v))}{dP_{S}(f_{\Phi}^{-1}(v))}\frac{p(x,y|f_{\Phi}=v)}{p_{S}(x,y|f_{\Phi}=v)}\cdot dP_{S}(x,y|f_{\Phi}=v) (107)
=𝔼P[Y|fΦ=v]dP(fΦ1(v))dPS(fΦ1(v))𝔼PS[Y|fΦ=v]dP(fΦ1(v))dPS(fΦ1(v))\displaystyle=\mathbb{E}_{P}\left[Y\Big{|}f_{\Phi}=v\right]\frac{dP(f_{\Phi}^{-1}(v))}{dP_{S}(f_{\Phi}^{-1}(v))}-\mathbb{E}_{P_{S}}\left[Y\Big{|}f_{\Phi}=v\right]\frac{dP(f_{\Phi}^{-1}(v))}{dP_{S}(f_{\Phi}^{-1}(v))} (108)
=[𝔼P[Y|fΦ=v]𝔼PS[Y|fΦ=v]]dP(fΦ1(v))dPS(fΦ1(v))\displaystyle=\left[\mathbb{E}_{P}\left[Y\Big{|}f_{\Phi}=v\right]-\mathbb{E}_{P_{S}}\left[Y\Big{|}f_{\Phi}=v\right]\right]\frac{dP(f_{\Phi}^{-1}(v))}{dP_{S}(f_{\Phi}^{-1}(v))} (109)
=0almost surely.\displaystyle=0\;\;\;\;\text{almost surely.} (110)

Next, we prove

Φspan{p(x,y)pS(x,y):𝔼P[Y|fΦ]=𝔼PS[Y|fΦ]almost surely,p is continuous}.\displaystyle\mathcal{H}_{\Phi}\subset\text{span}\left\{\frac{p(x,y)}{p_{S}(x,y)}:\mathbb{E}_{P}[Y|f_{\Phi}]=\mathbb{E}_{P_{S}}[Y|f_{\Phi}]\;\;\text{almost surely},\;\text{$p$ is continuous}\right\}. (111)

By Lemma F.11, any grouping function hΦh\in\mathcal{H}_{\Phi} could be rewritten as h(x,y)=γp(x,y)/pS(x,y)+ρh(x,y)=\gamma p(x,y)/p_{S}(x,y)+\rho for some continuous density functions pp and γ0\gamma\neq 0. Thus, we just need to prove the statement that CovPS[γp(X,Y)/pS(X,Y)+ρ,Y|fΦ=v]=0\text{Cov}_{P_{S}}\left[\gamma p(X,Y)/p_{S}(X,Y)+\rho,Y|f_{\Phi}=v\right]=0 implies 𝔼P[Y|fΦ=v]=𝔼PS[Y|fΦ=v]\mathbb{E}_{P}[Y|f_{\Phi}=v]=\mathbb{E}_{P_{S}}[Y|f_{\Phi}=v].

CovPS[γp(X,Y)pS(X,Y)+ρ,Y|fΦ=v]\displaystyle\;\;\;\;\text{Cov}_{P_{S}}\left[\gamma\frac{p(X,Y)}{p_{S}(X,Y)}+\rho,Y\Big{|}f_{\Phi}=v\right] (112)
=γ𝔼PS[p(X,Y)pS(X,Y)Y|fΦ=v]+ρEPS[Y|fΦ=v]\displaystyle=\gamma\mathbb{E}_{P_{S}}\left[\frac{p(X,Y)}{p_{S}(X,Y)}Y\Big{|}f_{\Phi}=v\right]+\rho E_{P_{S}}\left[Y\Big{|}f_{\Phi}=v\right] (113)
γEPS[p(X,Y)pS(X,Y)|fΦ=v]EPS[Y|fΦ=v]ρEPS[Y|fΦ=v]\displaystyle\;\;\;\;-\gamma E_{P_{S}}\left[\frac{p(X,Y)}{p_{S}(X,Y)}\Big{|}f_{\Phi}=v\right]E_{P_{S}}\left[Y\Big{|}f_{\Phi}=v\right]-\rho E_{P_{S}}\left[Y\Big{|}f_{\Phi}=v\right] (114)
=γ𝔼PS[p(X,Y)pS(X,Y)Y|fΦ=v]γEPS[p(X,Y)pS(X,Y)|fΦ=v]EPS[Y|fΦ=v]\displaystyle=\gamma\mathbb{E}_{P_{S}}\left[\frac{p(X,Y)}{p_{S}(X,Y)}Y\Big{|}f_{\Phi}=v\right]-\gamma E_{P_{S}}\left[\frac{p(X,Y)}{p_{S}(X,Y)}\Big{|}f_{\Phi}=v\right]E_{P_{S}}\left[Y\Big{|}f_{\Phi}=v\right] (115)
=γCovPS[p(X,Y)pS(X,Y),Y|fΦ=v]\displaystyle=\gamma\text{Cov}_{P_{S}}\left[\frac{p(X,Y)}{p_{S}(X,Y)},Y\Big{|}f_{\Phi}=v\right] (116)
=γ[𝔼P[Y|fΦ=v]𝔼PS[Y|fΦ=v]]dP(fΦ1(v))dPS(fΦ1(v)).\displaystyle=\gamma\left[\mathbb{E}_{P}\left[Y\Big{|}f_{\Phi}=v\right]-\mathbb{E}_{P_{S}}\left[Y\Big{|}f_{\Phi}=v\right]\right]\frac{dP(f_{\Phi}^{-1}(v))}{dP_{S}(f_{\Phi}^{-1}(v))}. (117)

Equation 117 follows from Equation 109. So we have 𝔼P[Y|fΦ=v]=𝔼PS[Y|fΦ=v]\mathbb{E}_{P}[Y|f_{\Phi}=v]=\mathbb{E}_{P_{S}}[Y|f_{\Phi}=v] if CovPS[γp(X,Y)/pS(X,Y)+ρ,Y|fΦ=v]=0\text{Cov}_{P_{S}}\left[\gamma p(X,Y)/p_{S}(X,Y)+\rho,Y|f_{\Phi}=v\right]=0. ∎

Corollary F.13 (Restatement of Theorem 4.2).

Consider an absolutely continuous probability measure PS(X,Y)P_{S}(X,Y) and a calibrated predictor ff.

f\displaystyle\mathcal{H}_{f} ={hC(𝒳×𝒴):CovPS[h(X,Y),Y|f=v]=0for almost everyv[0,1]}\displaystyle=\left\{h\in C(\mathcal{X}\times\mathcal{Y}):\text{Cov}_{P_{S}}\left[h(X,Y),Y|f=v\right]=0\;\;\text{for almost every}\;v\in[0,1]\right\} (118)
=span{p(x,y)pS(x,y):𝔼P[Y|f]=𝔼PS[Y|f]almost surely,p is continuous}.\displaystyle=\text{span}\left\{\frac{p(x,y)}{p_{S}(x,y)}:\mathbb{E}_{P}[Y|f]=\mathbb{E}_{P_{S}}[Y|f]\;\;\text{almost surely},\;\text{$p$ is continuous}\right\}. (119)
Remark F.14.

𝔼P[Y|f]=𝔼PS[Y|f]\mathbb{E}_{P}[Y|f]=\mathbb{E}_{P_{S}}[Y|f] almost surely implies 𝔼P[Y|f]=f\mathbb{E}_{P}[Y|f]=f, which is equivalent as RP(f)=infg:[0,1][0,1]RP(gf)R_{P}(f)=\inf_{g:[0,1]\rightarrow[0,1]}R_{P}(g\circ f), since we adopt square error.

Proof.

Since ff is calibrated, we have 𝔼PS[Y|f]=f\mathbb{E}_{P_{S}}[Y|f]=f. Take Φ(x)=f(x)\Phi(x)=f(x).

fΦ(x)=𝔼PS[Y|Φ(x)]=𝔼PS[Y|f(x)]=f(x).\displaystyle f_{\Phi}(x)=\mathbb{E}_{P_{S}}[Y|\Phi(x)]=\mathbb{E}_{P_{S}}[Y|f(x)]=f(x). (120)

Apply Theorem F.12 and the proof is complete. ∎

Theorem F.15 (Restatement of Theorem 4.3 (first part)).

Consider an absolutely continuous probability measure PS(X,Y)P_{S}(X,Y) and a predictor fΦ(x)=𝔼PS[Y|Φ(x)]f_{\Phi}(x)=\mathbb{E}_{P_{S}}[Y|\Phi(x)] where Φ:ddΦ,dΦZ+\Phi:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d_{\Phi}},{d_{\Phi}}\in Z^{+} is a measurable function. Φ\mathcal{H}_{\Phi} can be decomposed as Φ=1,Φ+2,Φ\mathcal{H}_{\Phi}=\mathcal{H}_{1,\Phi}+\mathcal{H}_{2,\Phi}.

1,Φ\displaystyle\mathcal{H}_{1,\Phi} :={hC(𝒳×𝒴):CovPS[h(Φ,Y),Y|fΦ=v]=0for almost everyv[0,1]}\displaystyle:=\left\{h\in C(\mathcal{X}\times\mathcal{Y}):\text{Cov}_{P_{S}}\left[h(\Phi,Y),Y|f_{\Phi}=v\right]=0\;\;\text{for almost every}\;v\in[0,1]\right\} (121)
=span{p(Φ,y)pS(Φ,y):𝔼P[Y|fΦ]=𝔼PS[Y|fΦ]almost surely,p is continuous}.\displaystyle=\text{span}\left\{\frac{p(\Phi,y)}{p_{S}(\Phi,y)}:\mathbb{E}_{P}[Y|f_{\Phi}]=\mathbb{E}_{P_{S}}[Y|f_{\Phi}]\;\;\text{almost surely},\;\text{$p$ is continuous}\right\}. (122)
2,Φ\displaystyle\mathcal{H}_{2,\Phi} :={hC(𝒳×𝒴):𝔼[h(X,Y)|Φ,Y]ChΦ,Y}\displaystyle:=\left\{h\in C(\mathcal{X}\times\mathcal{Y}):\mathbb{E}[h(X,Y)|\Phi,Y]\equiv C_{h}\;\;\forall\Phi,Y\right\} (123)
=span{p(x|Φ,y)pS(x|Φ,y):p is continuous}.\displaystyle=\text{span}\left\{\frac{p(x|\Phi,y)}{p_{S}(x|\Phi,y)}:\;\text{$p$ is continuous}\right\}. (124)
Remark F.16.

1,Φ\mathcal{H}_{1,\Phi} contains functions defined on supp(Φ)×𝒴\text{supp}(\Phi)\times\mathcal{Y} which can be rewritten as functions on 𝒳×𝒴\mathcal{X}\times\mathcal{Y} by variable substitution. Thus, 1,Φ\mathcal{H}_{1,\Phi} is still a set of grouping functions. For 2,Φ\mathcal{H}_{2,\Phi}, ChC_{h} is a constant depending on hh.

Proof.

First we prove Φ1,Φ+2,Φ\mathcal{H}_{\Phi}\subset\mathcal{H}_{1,\Phi}+\mathcal{H}_{2,\Phi}.

For any h11,Φh_{1}\in\mathcal{H}_{1,\Phi} and h22,Φh_{2}\in\mathcal{H}_{2,\Phi},

CovPS[h1(Φ,Y)+h2(X,Y),Y|fΦ=v]\displaystyle\;\;\;\;\text{Cov}_{P_{S}}\left[h_{1}(\Phi,Y)+h_{2}(X,Y),Y|f_{\Phi}=v\right] (125)
=CovPS[h1(Φ,Y),Y|fΦ=v]+CovPS[h2(X,Y),Y|fΦ=v]\displaystyle=\text{Cov}_{P_{S}}\left[h_{1}(\Phi,Y),Y|f_{\Phi}=v\right]+\text{Cov}_{P_{S}}\left[h_{2}(X,Y),Y|f_{\Phi}=v\right] (126)
=CovPS[h2(X,Y),Y|fΦ=v]\displaystyle=\text{Cov}_{P_{S}}\left[h_{2}(X,Y),Y|f_{\Phi}=v\right] (127)
=𝔼PS[h2(X,Y)Y|fΦ=v]𝔼PS[h2(X,Y)|fΦ=v]𝔼PS[Y|fΦ=v]\displaystyle=\mathbb{E}_{P_{S}}\left[h_{2}(X,Y)Y|f_{\Phi}=v\right]-\mathbb{E}_{P_{S}}\left[h_{2}(X,Y)|f_{\Phi}=v\right]\mathbb{E}_{P_{S}}\left[Y|f_{\Phi}=v\right] (128)
=𝔼PS[𝔼PS[h2(X,Y)Y|Φ,Y]|fΦ=v]𝔼PS[𝔼PS[h2(X,Y)|Φ,Y]|fΦ=v]𝔼PS[Y|fΦ=v]\displaystyle=\mathbb{E}_{P_{S}}\left[\mathbb{E}_{P_{S}}\left[h_{2}(X,Y)Y|\Phi,Y\right]\big{|}f_{\Phi}=v\right]-\mathbb{E}_{P_{S}}\left[\mathbb{E}_{P_{S}}\left[h_{2}(X,Y)|\Phi,Y\right]\big{|}f_{\Phi}=v\right]\mathbb{E}_{P_{S}}\left[Y|f_{\Phi}=v\right] (129)
=𝔼PS[𝔼PS[h2(X,Y)|Φ,Y]Y|fΦ=v]𝔼PS[𝔼PS[h2(X,Y)|Φ,Y]|fΦ=v]𝔼PS[Y|fΦ=v]\displaystyle=\mathbb{E}_{P_{S}}\left[\mathbb{E}_{P_{S}}\left[h_{2}(X,Y)|\Phi,Y\right]Y\big{|}f_{\Phi}=v\right]-\mathbb{E}_{P_{S}}\left[\mathbb{E}_{P_{S}}\left[h_{2}(X,Y)|\Phi,Y\right]\big{|}f_{\Phi}=v\right]\mathbb{E}_{P_{S}}\left[Y|f_{\Phi}=v\right] (130)
=Ch2EPS[Y|fΦ=v]Ch2EPS[Y|fΦ=v]\displaystyle=C_{h_{2}}E_{P_{S}}\left[Y|f_{\Phi}=v\right]-C_{h_{2}}E_{P_{S}}\left[Y|f_{\Phi}=v\right] (131)
=0.\displaystyle=0. (132)

Next we prove Φ1,Φ+2,Φ\mathcal{H}_{\Phi}\supset\mathcal{H}_{1,\Phi}+\mathcal{H}_{2,\Phi}.

For any hΦh\in\mathcal{H}_{\Phi}, let

h1(Φ(x),y)\displaystyle h_{1}(\Phi(x),y) =𝔼PS[h(X,Y)|Φ=Φ(x),Y=y].\displaystyle=\mathbb{E}_{P_{S}}[h(X,Y)|\Phi=\Phi(x),Y=y]. (133)
h2(x,y)\displaystyle h_{2}(x,y) =h(x,y)𝔼PS[h(X,Y)|Φ=Φ(x),Y=y].\displaystyle=h(x,y)-\mathbb{E}_{P_{S}}[h(X,Y)|\Phi=\Phi(x),Y=y]. (134)

Then we have

CovPS[h1(Φ,Y),Y|fΦ=v]\displaystyle\;\;\;\;\text{Cov}_{P_{S}}\left[h_{1}(\Phi,Y),Y|f_{\Phi}=v\right] (135)
=𝔼PS[h1(Φ,Y)Y|fΦ=v]𝔼PS[h1(Φ,Y)|fΦ=v]𝔼PS[Y|fΦ=v]\displaystyle=\mathbb{E}_{P_{S}}\left[h_{1}(\Phi,Y)Y|f_{\Phi}=v\right]-\mathbb{E}_{P_{S}}\left[h_{1}(\Phi,Y)|f_{\Phi}=v\right]\mathbb{E}_{P_{S}}\left[Y|f_{\Phi}=v\right] (136)
=𝔼PS[𝔼PS[h(X,Y)|Φ,Y]Y|fΦ=v]𝔼PS[𝔼PS[h(X,Y)|Φ,Y]|fΦ=v]𝔼PS[Y|fΦ=v]\displaystyle=\mathbb{E}_{P_{S}}\left[\mathbb{E}_{P_{S}}[h(X,Y)|\Phi,Y]Y\big{|}f_{\Phi}=v\right]-\mathbb{E}_{P_{S}}\left[\mathbb{E}_{P_{S}}[h(X,Y)|\Phi,Y]\big{|}f_{\Phi}=v\right]\mathbb{E}_{P_{S}}\left[Y|f_{\Phi}=v\right] (137)
=𝔼PS[h(X,Y)Y|fΦ=v]𝔼PS[h(X,Y)|fΦ=v]𝔼PS[Y|fΦ=v]\displaystyle=\mathbb{E}_{P_{S}}\left[h(X,Y)Y|f_{\Phi}=v\right]-\mathbb{E}_{P_{S}}\left[h(X,Y)|f_{\Phi}=v\right]\mathbb{E}_{P_{S}}\left[Y|f_{\Phi}=v\right] (138)
=CovPS[h(X,Y),Y|fΦ=v]\displaystyle=\text{Cov}_{P_{S}}\left[h(X,Y),Y|f_{\Phi}=v\right] (139)
=0for almost everyv[0,1].\displaystyle=0\;\;\;\;\text{for almost every}\;v\in[0,1]. (140)

Thus, h1(Φ(x),y)1,Φh_{1}(\Phi(x),y)\in\mathcal{H}_{1,\Phi}.

𝔼PS[h2(X,Y)|Φ,Y]\displaystyle\mathbb{E}_{P_{S}}[h_{2}(X,Y)|\Phi,Y] =𝔼PS[h(X,Y)𝔼PS[h(X,Y)|Φ,Y]|Φ,Y]\displaystyle=\mathbb{E}_{P_{S}}\left[h(X,Y)-\mathbb{E}_{P_{S}}[h(X,Y)|\Phi,Y]\big{|}\Phi,Y\right] (141)
=𝔼PS[h(X,Y)|Φ,Y]𝔼PS[h(X,Y)|Φ,Y]\displaystyle=\mathbb{E}_{P_{S}}[h(X,Y)|\Phi,Y]-\mathbb{E}_{P_{S}}[h(X,Y)|\Phi,Y] (142)
=0.\displaystyle=0. (143)

Thus, h2(x,y)2,Φh_{2}(x,y)\in\mathcal{H}_{2,\Phi}. Following a similar proof of Theorem F.12, we have

1,Φ=span{p(Φ,y)pS(Φ,y):𝔼P[Y|fΦ]=𝔼PS[Y|fΦ]almost surely,p is continuous}.\displaystyle\mathcal{H}_{1,\Phi}=\text{span}\left\{\frac{p(\Phi,y)}{p_{S}(\Phi,y)}:\mathbb{E}_{P}[Y|f_{\Phi}]=\mathbb{E}_{P_{S}}[Y|f_{\Phi}]\;\;\text{almost surely},\;\text{$p$ is continuous}\right\}. (144)

Next, we prove 2,Φspan{p(x|Φ,y)/pS(x|Φ,y):p is continuous}\mathcal{H}_{2,\Phi}\supset\text{span}\left\{p(x|\Phi,y)/p_{S}(x|\Phi,y):\;p\text{ is continuous}\right\}.

This is equivalent to saying that p(x|Φ,y)/pS(x|Φ,y)2,Φp(x|\Phi,y)/p_{S}(x|\Phi,y)\in\mathcal{H}_{2,\Phi} for any continuous density function pp.

𝔼PS[p(X|Φ,Y)pS(X|Φ,Y)|Φ,Y]\displaystyle\mathbb{E}_{P_{S}}\left[\frac{p(X|\Phi,Y)}{p_{S}(X|\Phi,Y)}\Big{|}\Phi,Y\right] =p(x|Φ,Y)pS(x|Φ,Y)𝑑PS(x|Φ,Y)\displaystyle=\int\frac{p(x|\Phi,Y)}{p_{S}(x|\Phi,Y)}dP_{S}(x|\Phi,Y) (145)
=𝑑P(x|Φ,Y)\displaystyle=\int dP(x|\Phi,Y) (146)
=1.\displaystyle=1. (147)

Thus, we have p(x|Φ,y)/pS(x|Φ,y)2,Φp(x|\Phi,y)/p_{S}(x|\Phi,y)\in\mathcal{H}_{2,\Phi}.

Next, we prove 2,Φspan{p(x|Φ,y)/pS(x|Φ,y):p is continuous}\mathcal{H}_{2,\Phi}\subset\text{span}\left\{p(x|\Phi,y)/p_{S}(x|\Phi,y):\;p\text{ is continuous}\right\}.

By Lemma F.11, any grouping function h2,Φh\in\mathcal{H}_{2,\Phi} could be rewritten as h2(x,y)=γp(x,y)/pS(x,y)+ρh_{2}(x,y)=\gamma p(x,y)/p_{S}(x,y)+\rho for some continuous density function pp and γ0\gamma\neq 0. Thus, we just need to prove the statement that 𝔼PS[γp(X,Y)/pS(X,Y)+ρ|Φ,Y]Ch2\mathbb{E}_{P_{S}}[\gamma p(X,Y)/p_{S}(X,Y)+\rho|\Phi,Y]\equiv C_{h_{2}} implies p(X,Y)/pS(X,Y)p(X|Φ,Y)/pS(X|Φ,Y)p(X,Y)/p_{S}(X,Y)\equiv p(X|\Phi,Y)/p_{S}(X|\Phi,Y).

𝔼PS[γp(X,Y)pS(X,Y)+ρ|Φ,Y]\displaystyle\mathbb{E}_{P_{S}}\left[\gamma\frac{p(X,Y)}{p_{S}(X,Y)}+\rho\Big{|}\Phi,Y\right] =𝔼PS[γp(X|Φ,Y)pS(X|Φ,Y)p(Φ,Y)pS(Φ,Y)+ρ|Φ,Y]\displaystyle=\mathbb{E}_{P_{S}}\left[\gamma\frac{p(X|\Phi,Y)}{p_{S}(X|\Phi,Y)}\frac{p(\Phi,Y)}{p_{S}(\Phi,Y)}+\rho\Big{|}\Phi,Y\right] (148)
=γp(Φ,Y)pS(Φ,Y)𝔼PS[p(X|Φ,Y)pS(X|Φ,Y)|Φ,Y]+ρ\displaystyle=\gamma\frac{p(\Phi,Y)}{p_{S}(\Phi,Y)}\mathbb{E}_{P_{S}}\left[\frac{p(X|\Phi,Y)}{p_{S}(X|\Phi,Y)}\Big{|}\Phi,Y\right]+\rho (149)
=γp(Φ,Y)pS(Φ,Y)𝔼P[1|Φ,Y]+ρ\displaystyle=\gamma\frac{p(\Phi,Y)}{p_{S}(\Phi,Y)}\mathbb{E}_{P}\left[1\Big{|}\Phi,Y\right]+\rho (150)
=γp(Φ,Y)pS(Φ,Y)+ρ.\displaystyle=\gamma\frac{p(\Phi,Y)}{p_{S}(\Phi,Y)}+\rho. (151)

Thus, we have p(Φ,Y)/pS(Φ,Y)(Ch2ρ)/γp(\Phi,Y)/p_{S}(\Phi,Y)\equiv(C_{h_{2}}-\rho)/\gamma which is a constant. Since 𝔼PS[p(Φ,Y)/pS(Φ,Y)]=1\mathbb{E}_{P_{S}}[p(\Phi,Y)/p_{S}(\Phi,Y)]=1, we have p(Φ,Y)/pS(Φ,Y)1p(\Phi,Y)/p_{S}(\Phi,Y)\equiv 1.

p(X,Y)pS(X,Y)\displaystyle\frac{p(X,Y)}{p_{S}(X,Y)} =p(X|Φ,Y)pS(X|Φ,Y)p(Φ,Y)pS(Φ,Y)\displaystyle=\frac{p(X|\Phi,Y)}{p_{S}(X|\Phi,Y)}\frac{p(\Phi,Y)}{p_{S}(\Phi,Y)} (152)
=p(X|Φ,Y)pS(X|Φ,Y).\displaystyle=\frac{p(X|\Phi,Y)}{p_{S}(X|\Phi,Y)}. (153)

Lemma F.17 (Theorem 3.2 from Globus-Harris et al. [2023]).

If ff is calibrated and there exists an h(x)h(x) such that

𝔼[(f(X)Y)2(h(X)Y)2|f(X)=v]α,\displaystyle\mathbb{E}[(f(X)-Y)^{2}-(h(X)-Y)^{2}|f(X)=v]\geq\alpha, (154)

then:

𝔼[h(X)(Yv)|f(X)=v]α2.\displaystyle\mathbb{E}[h(X)(Y-v)|f(X)=v]\geq\frac{\alpha}{2}. (155)
Proposition F.18 (Restatement of Theorem 4.3 (second part)).

If a predictor ff is multicalibrated with 1,Φ\mathcal{H}_{1,\Phi}, then RPS(f)RPS(fΦ)R_{P_{S}}(f)\leq R_{P_{S}}(f_{\Phi}).

Proof.

We prove by contradiction. If RPS(f)>RPS(fΦ)R_{P_{S}}(f)>R_{P_{S}}(f_{\Phi}), then

𝔼[(f(X)Y)2(fΦ(X)Y)2|f(X)=v]𝑑PS(f1(v))>0.\displaystyle\int\mathbb{E}\left[(f(X)-Y)^{2}-(f_{\Phi}(X)-Y)^{2}|f(X)=v\right]dP_{S}(f^{-1}(v))>0. (156)

Let αv=𝔼[(f(X)Y)2(h(X)Y)2|f(X)=v]\alpha_{v}=\mathbb{E}\left[(f(X)-Y)^{2}-(h(X)-Y)^{2}|f(X)=v\right].

Since ff is multicalibrated with 1,Φ\mathcal{H}_{1,\Phi}, ff is calibrated. It follows from Lemma F.17:

𝔼[fΦ(X)(Yv)|f(X)=v]αv2.\displaystyle\mathbb{E}[f_{\Phi}(X)(Y-v)|f(X)=v]\geq\frac{\alpha_{v}}{2}. (157)

Then,

K1(f,fΦ,PS)\displaystyle K_{1}(f,f_{\Phi},P_{S}) =|E[fΦ(X)(Yv)|f(X)=v]|dPS(f1(v))\displaystyle=\int\Big{|}E[f_{\Phi}(X)(Y-v)|f(X)=v]\Big{|}dP_{S}(f^{-1}(v)) (158)
αv𝑑PS(f1(v))\displaystyle\geq\int\alpha_{v}dP_{S}(f^{-1}(v)) (159)
>0.\displaystyle>0. (160)

From Lemma F.5, we have K2(f,fΦ,PS)K1(f,fΦ,PS)2>0K_{2}(f,f_{\Phi},P_{S})\geq K_{1}(f,f_{\Phi},P_{S})^{2}>0. Since fΦ1,Φf_{\Phi}\in\mathcal{H}_{1,\Phi}, it contradicts with the fact that ff is multicalibrated with 1,Φ\mathcal{H}_{1,\Phi}. ∎

Proposition F.19 (Restatement of Theorem 4.3 (third part)).

fΦf_{\Phi} is an invariant predictor elicited by Φ\Phi across a set of environments \mathscr{E} where Pe(Φ,Y)=PS(Φ,Y)P_{e}(\Phi,Y)=P_{S}(\Phi,Y) for any ee\in\mathscr{E}. If a predictor ff is multicalibrated with 2,Φ\mathcal{H}_{2,\Phi}, then ff is also an invariant predictor across \mathscr{E} elicited by some representation.

Proof.

Since Pe(Φ,Y)=PS(Φ,Y)P_{e}(\Phi,Y)=P_{S}(\Phi,Y), we have 𝔼Pe[Y|Φ]=𝔼PS[Y|Φ]=fΦ\mathbb{E}_{P_{e}}[Y|\Phi]=\mathbb{E}_{P_{S}}[Y|\Phi]=f_{\Phi} for every ee\in\mathscr{E}.

Thus,

RPe(fΦ)=infg𝒢RPe(gΦ).\displaystyle R_{P_{e}}(f_{\Phi})=\inf_{g\in\mathcal{G}}{R_{P_{e}}(g\circ\Phi)}. (161)

This implies fΦf_{\Phi} is an invariant predictor elicited by Φ\Phi across \mathscr{E}.

For any ee\in\mathscr{E}, we have:

pe(x,y)pS(x,y)=pe(Φ,y)pS(Φ,y)pe(x|Phi,y)pS(x|Φ,y)=pe(x|Φ,y)pS(x|Φ,y)2,Φ.\displaystyle\frac{p_{e}(x,y)}{p_{S}(x,y)}=\frac{p_{e}(\Phi,y)}{p_{S}(\Phi,y)}\frac{p_{e}(x|Phi,y)}{p_{S}(x|\Phi,y)}=\frac{p_{e}(x|\Phi,y)}{p_{S}(x|\Phi,y)}\in\mathcal{H}_{2,\Phi}. (162)

Since ff is multicalibrated with 2,Φ\mathcal{H}_{2,\Phi}, it follows from Theorem F.8:

RPe(f)=infg:[0,1][0,1]RPe(gf).\displaystyle R_{P_{e}}(f)=\inf_{g:[0,1]\rightarrow[0,1]}R_{P_{e}}(g\circ f). (163)

This implies that ff is an invariant predictor across \mathscr{E} elicited by ff. ∎

Proposition F.20 (Restatement of Proposition 4.5).

Consider XdX\in\mathbb{R}^{d} which could be sliced as X=(Φ,Ψ)TX=(\Phi,\Psi)^{T} and Φ=(Λ,Ω)T\Phi=(\Lambda,\Omega)^{T}. Define 1,Φ:={h(Φ(x))C(𝒳×𝒴)}\mathcal{H}_{1,\Phi}^{\prime}:=\{h(\Phi(x))\in C(\mathcal{X}\times\mathcal{Y})\}. 1,X\mathcal{H}_{1,X}^{\prime} and 1,Λ\mathcal{H}_{1,\Lambda}^{\prime} are similarly defined. We have:

1. 1,X1,Φ1,Λ1,={C}.\mathcal{H}_{1,X}^{\prime}\supset\mathcal{H}_{1,\Phi}^{\prime}\supset\mathcal{H}_{1,\Lambda}^{\prime}\supset\mathcal{H}_{1,\emptyset}^{\prime}=\{C\}.

2. {C}=2,XH2,Φ2,Λ2,.\{C\}=\mathcal{H}_{2,X}\subset H_{2,\Phi}\subset\mathcal{H}_{2,\Lambda}\subset\mathcal{H}_{2,\emptyset}.

CC is a constant value function.

Remark F.21.

The proposition shows that 1\mathcal{H}_{1}^{\prime}, as a subspace of 1\mathcal{H}_{1}, evolves monotonically and in opposite direction to 2\mathcal{H}_{2}. If we perceive the representation Φ\Phi as a filter, gaining more information from covariates facilitates multicalibration w.r.t. 1\mathcal{H}_{1} (and accuracy) but hampers multicalibration w.r.t. 2\mathcal{H}_{2} (and invariance). With 1\mathcal{H}_{1}^{\prime} and 2\mathcal{H}_{2} combined together, a multicalibrated predictor is searching for an appropriate level of information filter to balance the tradeoff between accuracy and invariance.

Proof.

1. For any h(Λ)1,Λh(\Lambda)\in\mathcal{H}_{1,\Lambda}^{\prime}, since Φ=(Λ,Ω)\Phi=(\Lambda,\Omega), h(Λ)h(\Lambda) is also a function of Φ\Phi. Thus, we have h(Λ)1,Φh(\Lambda)\in\mathcal{H}_{1,\Phi}^{\prime}. It follows that 1,Φ1,Λ\mathcal{H}_{1,\Phi}^{\prime}\supset\mathcal{H}_{1,\Lambda}^{\prime}. Similarly we have 1,X1,Φ\mathcal{H}_{1,X}^{\prime}\supset\mathcal{H}_{1,\Phi}^{\prime} and 1,Λ1,\mathcal{H}_{1,\Lambda}^{\prime}\supset\mathcal{H}_{1,\emptyset}^{\prime}.

2. For any h(Λ,Ω,Ψ,Y)2,Φh(\Lambda,\Omega,\Psi,Y)\in\mathcal{H}_{2,\Phi} such that 𝔼[h(Φ,Ψ,Y)|Φ,Y]=Ch\mathbb{E}[h(\Phi,\Psi,Y)|\Phi,Y]=C_{h} for any values of Φ,Y\Phi,Y, we have

𝔼[h(Λ,Ω,Ψ,Y)|Λ,Y]\displaystyle\mathbb{E}[h(\Lambda,\Omega,\Psi,Y)|\Lambda,Y] =𝔼[𝔼[h(Λ,Ω,Ψ,Y)|Λ,Ω,Y]|Λ,Y]\displaystyle=\mathbb{E}\left[\mathbb{E}[h(\Lambda,\Omega,\Psi,Y)|\Lambda,\Omega,Y]\big{|}\Lambda,Y\right] (164)
=𝔼[𝔼[h(Φ,Ψ,Y)|Φ,Y]|Λ,Y]\displaystyle=\mathbb{E}\left[\mathbb{E}[h(\Phi,\Psi,Y)|\Phi,Y]\big{|}\Lambda,Y\right] (165)
=𝔼[Ch|Λ,Y]\displaystyle=\mathbb{E}[C_{h}|\Lambda,Y] (166)
=Chfor any values of Λ,Y.\displaystyle=C_{h}\;\;\;\;\text{for any values of }\Lambda,Y. (167)

Thus, h(Λ,Ω,Ψ,Y)2,Λh(\Lambda,\Omega,\Psi,Y)\in\mathcal{H}_{2,\Lambda}. It follows that H2,Φ2,ΛH_{2,\Phi}\subset\mathcal{H}_{2,\Lambda}. Similarly, we have 2,XH2,Φ\mathcal{H}_{2,X}\subset H_{2,\Phi} and 2,Λ2,\mathcal{H}_{2,\Lambda}\subset\mathcal{H}_{2,\emptyset}. Particularly for h(x,y)2,Xh(x,y)\in\mathcal{H}_{2,X}, we have h(X,Y)=𝔼[h(X,Y)|X,Y]h(X,Y)=\mathbb{E}[h(X,Y)|X,Y] is a constant for any values of X,YX,Y. ∎

Lemma F.22 (Globus-Harris et al. [2023]).

Let 𝒳\mathcal{H}\subset\mathcal{X}^{\mathbb{R}} be such a grouping function class that hh\in\mathcal{H} implies γh+ρ\gamma h+\rho\in\mathcal{H} for any hh\in\mathcal{H} and γ,ρ\gamma,\rho\in\mathbb{R}. If \mathcal{H} satisfies the (0,0)(0,0)-weak learning condition in Assumption F.1, a predictor ff is multicalibrated w.r.t. \mathcal{H} if and only if f(x)=𝔼[Y|x]f(x)=\mathbb{E}[Y|x] almost surely.

Proposition F.23.

For a measurable function Φ:ddΦ,dΦZ+\Phi:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d_{\Phi}},{d_{\Phi}}\in Z^{+}, a predictor f:supp(Φ)[0,1]f:\text{supp}(\Phi)\rightarrow[0,1] is multicalibrated w.r.t. 1,Φ\mathcal{H}_{1,\Phi} if and only if ff is multicalibrated w.r.t. 1,Φ\mathcal{H}_{1,\Phi}^{\prime}.

Proof.

Since 1,Φ1,Φ\mathcal{H}_{1,\Phi}^{\prime}\subset\mathcal{H}_{1,\Phi}, ff’s multicalibration w.r.t. 1,Φ\mathcal{H}_{1,\Phi} implies ff’s multicalibration w.r.t. 1,Φ\mathcal{H}_{1,\Phi}^{\prime}.

On the other hand, 1,Φ\mathcal{H}_{1,\Phi}^{\prime} satisfies the (0,0)(0,0)-weak learning condition with the pushforward measure on Φ\Phi, because 𝔼[Y|Φ]1,Φ\mathbb{E}[Y|\Phi]\in\mathcal{H}_{1,\Phi}^{\prime}. It follows from Lemma F.22 that ff is multicalibrated w.r.t. 1,Φ\mathcal{H}_{1,\Phi}^{\prime} implies f(Φ)=𝔼[Y|Φ]f(\Phi)=\mathbb{E}[Y|\Phi] almost surely. By the definition of 1,Φ\mathcal{H}_{1,\Phi}, f(Φ)f(\Phi) is multicalibrated w.r.t. 1,Φ\mathcal{H}_{1,\Phi}. ∎

F.4 MC-PseudoLabel: An Algorithm for Extended Multicalibration

Lemma F.24.

Fix a model f:𝒳[0,1]f:\mathcal{X}\rightarrow[0,1]. Suppose for some vRange(f)v\in\text{Range}(f) there is an hh\in\mathcal{H} such that:

𝔼[h(x,y)(yv)|f(x)=v]>α\mathbb{E}[h(x,y)(y-v)|f(x)=v]>\alpha

Let h=v+ηh(x,y)h^{\prime}=v+\eta h(x,y) for η=α𝔼[h(x,y)2|f(x)=v]\eta=\frac{\alpha}{\mathbb{E}[h(x,y)^{2}|f(x)=v]}.

Then:

𝔼[(f(x)y)2(h(x,y)y)2|f(x)=v]>α2𝔼[h(x,y)2|f(x)=v].\mathbb{E}[(f(x)-y)^{2}-(h^{\prime}(x,y)-y)^{2}|f(x)=v]>\frac{\alpha^{2}}{\mathbb{E}[h(x,y)^{2}|f(x)=v]}.

Proof. Following [Globus-Harris et al., 2023], we have

𝔼[(f(x)y)2\displaystyle\mathbb{E}[(f(x)-y)^{2} (h(x,y)y)2|f(x)=v]\displaystyle-(h^{\prime}(x,y)-y)^{2}|f(x)=v]
=𝔼[(vy)2(v+ηh(x,y)y)2|f(x)=v]\displaystyle=\mathbb{E}[(v-y)^{2}-(v+\eta h(x,y)-y)^{2}|f(x)=v]
=𝔼[v22vy+y2(v+ηh(x,y))2+2y(v+ηh(x,y))y2|f(x)=v]\displaystyle=\mathbb{E}[v^{2}-2vy+y^{2}-(v+\eta h(x,y))^{2}+2y(v+\eta h(x,y))-y^{2}|f(x)=v]
=𝔼[2ηyh(x,y)2ηvh(x,y)η2h(x,y)2|f(x)=v]\displaystyle=\mathbb{E}[2\eta yh(x,y)-2\eta vh(x,y)-\eta^{2}h(x,y)^{2}|f(x)=v]
=𝔼[2ηh(x,y)(yv)η2h(x,y)2|f(x)=v]\displaystyle=\mathbb{E}[2\eta h(x,y)(y-v)-\eta^{2}h(x,y)^{2}|f(x)=v]
>2ηαη2𝔼[h(x,y)2|f(x)=v]\displaystyle>2\eta\alpha-\eta^{2}\mathbb{E}[h(x,y)^{2}|f(x)=v]
=α2𝔼[h(x,y)2|f(x)=v].\displaystyle=\frac{\alpha^{2}}{\mathbb{E}[h(x,y)^{2}|f(x)=v]}.
Theorem F.25 (Restatement of Theorem 5.1).

In Algorithm 1, for α,B>0\alpha,B>0, if the following is satisfied:

errt1err~tαB,\displaystyle err_{t-1}-\tilde{err}_{t}\leq\frac{\alpha}{B}, (168)

the output ft1(x)f_{t-1}^{\prime}(x) is α\alpha-approximately 2\ell_{2} multicalibrated w.r.t. B={h:suph(x,y)2B}\mathcal{H}_{B}=\{h\in\mathcal{H}:\sup h(x,y)^{2}\leq B\}.

Proof.

We prove by contradiction. Assume that ft1f_{t-1} is not α\alpha-approximately multicalibrated with respect to B\mathcal{H}_{B}. Then there exists hBh\in\mathcal{H}_{B} such that:

v[1/m]P(ft1(x)=v)(𝔼[h(x,y)(yv)|ft1(x)=v])2>α.\sum_{v\in[1/m]}P(f_{t-1}(x)=v)\left(\mathbb{E}\left[h(x,y)(y-v)\Big{|}f_{t-1}(x)=v\right]\right)^{2}>\alpha.

For each v[1/m]v\in[1/m] define

αv:=P(ft1(x)=v)(𝔼[h(x,y)(yv)|ft1(x)=v])2.\alpha_{v}:=P(f_{t-1}(x)=v)\left(\mathbb{E}\left[h(x,y)(y-v)\Big{|}f_{t-1}(x)=v\right]\right)^{2}.

Then we have v[1/m]αv>α\sum_{v\in[1/m]}\alpha_{v}>\alpha.

According to Lemma F.24, for each v[1/m]v\in[1/m], there exists hvh_{v}\in\mathcal{H} such that:

𝔼[(ft1(x)y)2(hv(x,y)y)2|ft1(x)=v]\displaystyle\mathbb{E}\left[(f_{t-1}(x)-y)^{2}-(h_{v}(x,y)-y)^{2}|f_{t-1}(x)=v\right] (169)
>\displaystyle> αv𝔼[h(x)2|ft1(x)=v]P(ft1(x)=v)\displaystyle\;\frac{\alpha_{v}}{\mathbb{E}[h(x)^{2}|f_{t-1}(x)=v]\cdot P(f_{t-1}(x)=v)} (170)
\displaystyle\geq αvBP(ft1(x)=v).\displaystyle\;\frac{\alpha_{v}}{B\cdot P(f_{t-1}(x)=v)}. (171)

Then,

𝔼[(ft1(x)y)2(f~t(x)y)2]\displaystyle\;\;\;\;\mathbb{E}\left[(f_{t-1}(x)-y)^{2}-(\tilde{f}_{t}(x)-y)^{2}\right]
=v[1/m]P(ft1(x)=v)𝔼[(ft1(x)y)2(f~t(x)y)2|ft1(x)=v]\displaystyle=\sum_{v\in[1/m]}P(f_{t-1}(x)=v)\mathbb{E}\left[(f_{t-1}(x)-y)^{2}-(\tilde{f}_{t}(x)-y)^{2}|f_{t-1}(x)=v\right]
=v[1/m]P(ft1(x)=v)𝔼[(ft1(x)y)2(hvt(x,y)y)2|ft1(x)=v]\displaystyle=\sum_{v\in[1/m]}P(f_{t-1}(x)=v)\mathbb{E}\left[(f_{t-1}(x)-y)^{2}-(h_{v}^{t}(x,y)-y)^{2}|f_{t-1}(x)=v\right]
>αB,\displaystyle>\frac{\alpha}{B},

which contradicts the condition in Equation 168. ∎

The following proposition is a direct corollary from Globus-Harris et al. [2023]’s Theorem 4.3.

Proposition F.26.

For any distribution DD supported on 𝒳×𝒴\mathcal{X}\times\mathcal{Y} and Φσ(X)\Phi\in\sigma(X) , take the grouping function class 1,Φ:={h(Φ(x))C(𝒳×𝒴)}\mathcal{H}\subset\mathcal{H}^{\prime}_{1,\Phi}:=\{h(\Phi(x))\in C(\mathcal{X}\times\mathcal{Y})\} and the predictor class =𝒳\mathcal{F}=\mathbb{R}^{\mathcal{X}}. For any 0<α<1,B>00<\alpha<1,B>0 and an initial predictor f0:𝒳[0,1]f_{0}:\mathcal{X}\rightarrow[0,1] with |Range(f0)|2Bα|\text{Range}(f_{0})|\geq\frac{2B}{\alpha}, then MC-Pseudolabel(D,,)\text{MC-Pseudolabel}(D,\mathcal{H},\mathcal{F}) halts after at most T2BαT\leq\frac{2B}{\alpha} steps and outputs a model fT1(x)f_{T-1}^{\prime}(x) that is α\alpha-approximately 2\ell_{2} multicalibrated w.r.t DD and B={h:suph(x,y)2B}\mathcal{H}_{B}=\{h\in\mathcal{H}:\sup h(x,y)^{2}\leq B\}.

Theorem F.27 (Restatement of Theorem 5.2).

Consider XdX\in\mathbb{R}^{d} with X=(Φ,Ψ)TX=(\Phi,\Psi)^{T}. Assume that (Φ,Ψ,Y)(\Phi,\Psi,Y) follows a multivariate normal distribution 𝒩d+1(μ,Σ)\mathcal{N}_{d+1}(\mu,\Sigma) where the random variables are in general position such that Σ\Sigma is positive definite. we partition Σ\Sigma into blocks:

Σ=(ΣΦΦΣΦΨΣΦyΣΨΦΣΨΨΣΨyΣyΦΣyΨΣyy).\displaystyle\Sigma=\begin{pmatrix}\Sigma_{\Phi\Phi}&\Sigma_{\Phi\Psi}&\Sigma_{\Phi y}\\ \Sigma_{\Psi\Phi}&\Sigma_{\Psi\Psi}&\Sigma_{\Psi y}\\ \Sigma_{y\Phi}&\Sigma_{y\Psi}&\Sigma_{yy}\end{pmatrix}. (172)

For any distribution DD supported on 𝒳×𝒴\mathcal{X}\times\mathcal{Y}, take the grouping function class ={h2,Φ:h(x,y)=cxTx+cyy+cb,cxd,cy,cb}\mathcal{H}=\{h\in\mathcal{H}_{2,\Phi}:h(x,y)=c_{x}^{T}x+c_{y}y+c_{b},c_{x}\in\mathbb{R}^{d},c_{y},c_{b}\in\mathbb{R}\} and the predictor class =𝒳\mathcal{F}=\mathbb{R}^{\mathcal{X}}. For an initial predictor f(0)(x)=𝔼[Y|x]f^{(0)}(x)=\mathbb{E}[Y|x], run MC-Pseudolabel(D,,)\text{MC-Pseudolabel}(D,\mathcal{H},\mathcal{F}) without rounding, then f(t)(x)𝔼[Y|Φ(x)]f^{(t)}(x)\rightarrow\mathbb{E}[Y|\Phi(x)] as tt\rightarrow\infty with a convergence rate of 𝒪(M(Σ)t)\mathcal{O}(M(\Sigma)^{t}), where

M(Σ)\displaystyle M(\Sigma) =(ΣyyΣyΦΣΦΦ1ΣΦy)1(ΣyΨΣyΦΣΦΦ1ΣΦΨ)\displaystyle=(\Sigma_{yy}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y})^{-1}(\Sigma_{y\Psi}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi}) (173)
(ΣΨΨΣΨΦΣΦΦ1ΣΦΨ)1(ΣΨyΣΨΦΣΦΦ1ΣΦy).\displaystyle\;\;\;\;(\Sigma_{\Psi\Psi}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi})^{-1}(\Sigma_{\Psi y}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}). (174)

We have 0M(Σ)<10\leq M(\Sigma)<1.

Remark F.28.

f(t)f^{(t)} are always linear. Thus, the limit of functions is equivalent as the limit of coefficients.

Remark F.29.

𝔼[Y|Φ(x)]\mathbb{E}[Y|\Phi(x)] is multicalibrated with respect to \mathcal{H}. Furthermore, any calibrated predictor on Φ\Phi, denoted by g(Φ)g(\Phi), is multicalibrated with respect to \mathcal{H}. This is because:

𝔼[h(X,Y)(Yg(Φ))|g(Φ)]\displaystyle\mathbb{E}[h(X,Y)(Y-g(\Phi))|g(\Phi)] =𝔼[𝔼[h(X,Y)(Yg(Φ))|Φ,Y]|g(Φ)]\displaystyle=\mathbb{E}\left[\mathbb{E}\left[h(X,Y)(Y-g(\Phi))|\Phi,Y\right]|g(\Phi)\right] (175)
=𝔼[Ch(Yg(Φ))|g(Φ)]\displaystyle=\mathbb{E}\left[C_{h}(Y-g(\Phi))|g(\Phi)\right] (176)
=0.\displaystyle=0. (177)

However, 𝔼[Y|Φ]\mathbb{E}[Y|\Phi] is the most accurate predictor among all multicalibrated g(Φ)g(\Phi).

Remark F.30.

The convergence rate M(Σ)M(\Sigma) does not depend on the dimension dd of covariates. When YΨ|ΦY\perp\Psi\;|\;\Phi implying that Φ\Phi is sufficient for prediction, following from 𝔼[Y|Φ,Ψ]=𝔼[Y|Φ]\mathbb{E}[Y|\Phi,\Psi]=\mathbb{E}[Y|\Phi]:

ΣΨy=ΣΨΦΣΦΦ1ΣΦy.\displaystyle\Sigma_{\Psi y}=\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}. (178)

It follows that M(Σ)=0M(\Sigma)=0 and the algorithm will converge in one step.

On the other hand, when YY and Ψ\Psi are linearly dependent given Φ\Phi such that Σ\Sigma is singular, which violates positive definiteness, following from the proof below:

ΣyyΣyΦΣΦΦ1ΣΦy=(ΣyΨΣyΦΣΦΦ1ΣΦΨ)(ΣΨΨΣΨΦΣΦΦ1ΣΦΨ)1(ΣΨyΣΨΦΣΦΦ1ΣΦy).\displaystyle\Sigma_{yy}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}=(\Sigma_{y\Psi}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi})(\Sigma_{\Psi\Psi}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi})^{-1}(\Sigma_{\Psi y}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}). (179)

It follows that M(Σ)=1M(\Sigma)=1 and the algorithm can’t converge.

So the convergence rate depends on the singularity of the problem. Since the algorithm converges to a predictor that does not depend on Ψ\Psi, stronger the "spurious" correlation between YY and Ψ\Psi given Φ\Phi in the distribution DD, the algorithm takes longer to converge.

Proof.

Without loss of generality, assume μ=0\mu=0.

Let c=(cx,cy,cb)T=(cΦ,cΨ,cy,cb)Tc=(c_{x},c_{y},c_{b})^{T}=(c_{\Phi},c_{\Psi},c_{y},c_{b})^{T}. Denote dimensions of Φ,Ψ\Phi,\Psi by dΦ,dΨd_{\Phi},d_{\Psi}.

Let 𝔼[Y|Φ,Ψ]=(αΦ)TΦ+(αΨ)TΨ\mathbb{E}[Y|\Phi,\Psi]=(\alpha_{\Phi}^{\star})^{T}\Phi+(\alpha_{\Psi}^{\star})^{T}\Psi and 𝔼[Ψ|Φ,Y]=(βΦ)TΦ+(βy)TY\mathbb{E}[\Psi|\Phi,Y]=(\beta_{\Phi}^{\star})^{T}\Phi+(\beta_{y}^{\star})^{T}Y with αΦdΦ,\alpha_{\Phi}^{\star}\in\mathbb{R}^{d_{\Phi}},
αΨdΨ,βΦdΦ×dΨ,βy1×dΨ\alpha_{\Psi}^{\star}\in\mathbb{R}^{d_{\Psi}},\beta_{\Phi}^{\star}\in\mathbb{R}^{d_{\Phi}\times d_{\Psi}},\beta_{y}^{\star}\in\mathbb{R}^{1\times d_{\Psi}}.

We have:

(αΦαΨ)\displaystyle\begin{pmatrix}\alpha_{\Phi}^{\star}\\ \alpha_{\Psi}^{\star}\end{pmatrix} =(ΣΦΦΣΦΨΣΨΦΣΨΨ)1(ΣΦyΣΨy).\displaystyle=\begin{pmatrix}\Sigma_{\Phi\Phi}&\Sigma_{\Phi\Psi}\\ \Sigma_{\Psi\Phi}&\Sigma_{\Psi\Psi}\end{pmatrix}^{-1}\begin{pmatrix}\Sigma_{\Phi y}\\ \Sigma_{\Psi y}\end{pmatrix}. (180)
(βΦβy)\displaystyle\begin{pmatrix}\beta_{\Phi}^{\star}\\ \beta_{y}^{\star}\end{pmatrix} =(ΣΦΦΣΦyΣyΦΣyy)1(ΣΦΨΣyΨ).\displaystyle=\begin{pmatrix}\Sigma_{\Phi\Phi}&\Sigma_{\Phi y}\\ \Sigma_{y\Phi}&\Sigma_{yy}\end{pmatrix}^{-1}\begin{pmatrix}\Sigma_{\Phi\Psi}\\ \Sigma_{y\Psi}\end{pmatrix}. (181)

According to Theorem F.15, 𝔼[h(X,Y)|Φ,Y]\mathbb{E}[h(X,Y)|\Phi,Y] is a constant for different values of Φ,Y\Phi,Y.

𝔼[h(X,Y)|Φ,Y]\displaystyle\mathbb{E}[h(X,Y)|\Phi,Y] =cΦTΦ+cyTY+cΨT𝔼[Ψ|Φ,Y]+cb\displaystyle=c_{\Phi}^{T}\Phi+c_{y}^{T}Y+c_{\Psi}^{T}\mathbb{E}[\Psi|\Phi,Y]+c_{b} (182)
=cΦTΦ+cyTY+cΨT(ΣΨΦΣΨy)(ΣΦΦΣΦyΣyΦΣyy)1(ΦY)+cb.\displaystyle=c_{\Phi}^{T}\Phi+c_{y}^{T}Y+c_{\Psi}^{T}\begin{pmatrix}\Sigma_{\Psi\Phi}&\Sigma_{\Psi y}\end{pmatrix}\begin{pmatrix}\Sigma_{\Phi\Phi}&\Sigma_{\Phi y}\\ \Sigma_{y\Phi}&\Sigma_{yy}\\ \end{pmatrix}^{-1}\begin{pmatrix}\Phi\\ Y\\ \end{pmatrix}+c_{b}. (183)

This implies:

(ΣΦΦΣΦyΣyΦΣyy)1(ΣΦΨΣyΨ)cΨ+(cΦcy)=0.\displaystyle\begin{pmatrix}\Sigma_{\Phi\Phi}&\Sigma_{\Phi y}\\ \Sigma_{y\Phi}&\Sigma_{yy}\\ \end{pmatrix}^{-1}\begin{pmatrix}\Sigma_{\Phi\Psi}\\ \Sigma_{y\Psi}\end{pmatrix}c_{\Psi}+\begin{pmatrix}c_{\Phi}\\ c_{y}\end{pmatrix}=0. (184)

Rearranging to:

(ΣΦΦΣΦΨΣΦyΣyΦΣyΨΣyy)c=0.\displaystyle\begin{pmatrix}\Sigma_{\Phi\Phi}&\Sigma_{\Phi\Psi}&\Sigma_{\Phi y}\\ \Sigma_{y\Phi}&\Sigma_{y\Psi}&\Sigma_{yy}\end{pmatrix}c=0. (185)

Let f(t)=(αΦ(t))TΦ+(αΨ(t))TΨf^{(t)}=(\alpha_{\Phi}^{(t)})^{T}\Phi+(\alpha_{\Psi}^{(t)})^{T}\Psi and f~(t)=(α~Φ(t))TΦ+(α~Ψ(t))TΨ+(α~y(t))TY\tilde{f}^{(t)}=(\tilde{\alpha}_{\Phi}^{(t)})^{T}\Phi+(\tilde{\alpha}_{\Psi}^{(t)})^{T}\Psi+(\tilde{\alpha}_{y}^{(t)})^{T}Y. We claim that α~Ψ(t)=0\tilde{\alpha}_{\Psi}^{(t)}=0. Otherwise, consider f~(t)=(α~Φ(t))TΦ+(α~Ψ(t))T𝔼[Ψ|Φ,Y]+(α~y(t))TY\tilde{f}^{(t)^{\prime}}=(\tilde{\alpha}_{\Phi}^{(t)})^{T}\Phi+(\tilde{\alpha}_{\Psi}^{(t)})^{T}\mathbb{E}[\Psi|\Phi,Y]+(\tilde{\alpha}_{y}^{(t)})^{T}Y.

𝔼[f~(t)f~(t)|Ψ,Y]=𝔼[(α~Ψ(t))TΨ|Φ,Y](α~Ψ(t))T𝔼[Ψ|Φ,Y]=0.\displaystyle\mathbb{E}[\tilde{f}^{(t)}-\tilde{f}^{(t)^{\prime}}|\Psi,Y]=\mathbb{E}[(\tilde{\alpha}_{\Psi}^{(t)})^{T}\Psi|\Phi,Y]-(\tilde{\alpha}_{\Psi}^{(t)})^{T}\mathbb{E}[\Psi|\Phi,Y]=0. (186)

Thus, f~(t)f~(t)\tilde{f}^{(t)}-\tilde{f}^{(t)^{\prime}}\in\mathcal{H}.

On the other hand,

𝔼[(Yf~(t))2]𝔼[(Yf~(t))2]\displaystyle\;\;\;\;\mathbb{E}[(Y-\tilde{f}^{(t)})^{2}]-\mathbb{E}[(Y-\tilde{f}^{(t)^{\prime}})^{2}] (187)
=𝔼[(α~Ψ(t))T[𝔼[ΨΨT|Φ,Y]𝔼[Ψ|Φ,Y]𝔼[ΨT|Φ,Y]]α~Ψ(t)]\displaystyle=\mathbb{E}\left[(\tilde{\alpha}_{\Psi}^{(t)})^{T}\left[\mathbb{E}[\Psi\Psi^{T}|\Phi,Y]-\mathbb{E}[\Psi|\Phi,Y]\mathbb{E}[\Psi^{T}|\Phi,Y]\right]\tilde{\alpha}_{\Psi}^{(t)}\right] (188)
>0.\displaystyle>0. (189)

The inequality follows from the fact that 𝔼[ΨΨT|Φ,Y]𝔼[Ψ|Φ,Y]𝔼[ΨT|Φ,Y]\mathbb{E}[\Psi\Psi^{T}|\Phi,Y]-\mathbb{E}[\Psi|\Phi,Y]\mathbb{E}[\Psi^{T}|\Phi,Y] is the covariance matrix of Ψ|Φ,Y\Psi|\Phi,Y, which is positive definite because Σ\Sigma is positive definite. The inequality contradicts with the definition of f~(t)\tilde{f}^{(t)}. Thus, f~(t)=(α~Φ(t))TΦ+(α~y(t))TY\tilde{f}^{(t)}=(\tilde{\alpha}_{\Phi}^{(t)})^{T}\Phi+(\tilde{\alpha}_{y}^{(t)})^{T}Y.

Define a matrix C(d+1)×dtC\in\mathbb{R}^{(d+1)\times d_{t}} whose columns support the solution space of Equation 185. Then H:=CT(S,T,Y)TdtH:=C^{T}(S,T,Y)^{T}\in\mathbb{R}^{d_{t}} is a random vector. According to the definition of f~(t)\tilde{f}^{(t)},

f~(t+1)\displaystyle\tilde{f}^{(t+1)} =𝔼[Y|f(t),H]\displaystyle=\mathbb{E}[Y|f^{(t)},H] (190)
=k(t)f(t)+(cΦ(t))TΦ+(cΨ(t))TΨ+(cy(t))TY\displaystyle=k^{(t)}f^{(t)}+(c_{\Phi}^{(t)})^{T}\Phi+(c_{\Psi}^{(t)})^{T}\Psi+(c_{y}^{(t)})^{T}Y (191)
=k(t)(αΦ(t))TΦ+k(t)(αΨ(t))TΨ+(cΦ(t))TΦ+(cΨ(t))TΨ+(cy(t))TY.\displaystyle=k^{(t)}(\alpha_{\Phi}^{(t)})^{T}\Phi+k^{(t)}(\alpha_{\Psi}^{(t)})^{T}\Psi+(c_{\Phi}^{(t)})^{T}\Phi+(c_{\Psi}^{(t)})^{T}\Psi+(c_{y}^{(t)})^{T}Y. (192)

In the above equation, k(t)k^{(t)}\in\mathbb{R}.

Since α~Ψ(t+1)=k(t)αΨ(t)+cΨ(t)=0\tilde{\alpha}_{\Psi}^{(t+1)}=k^{(t)}\alpha_{\Psi}^{(t)}+c_{\Psi}^{(t)}=0, we have cΨ(t)=k(t)αΨ(t)c_{\Psi}^{(t)}=-k^{(t)}\alpha_{\Psi}^{(t)}. Substituting into Equation 184:

(cΦ(t)cy(t))=k(t)(ΣΦΦΣΦyΣyΦΣyy)1(ΣΦΨΣyΨ)αΨ(t).\displaystyle\begin{pmatrix}c_{\Phi}^{(t)}\\ c_{y}^{(t)}\end{pmatrix}=k^{(t)}\begin{pmatrix}\Sigma_{\Phi\Phi}&\Sigma_{\Phi y}\\ \Sigma_{y\Phi}&\Sigma_{yy}\\ \end{pmatrix}^{-1}\begin{pmatrix}\Sigma_{\Phi\Psi}\\ \Sigma_{y\Psi}\end{pmatrix}\alpha_{\Psi}^{(t)}. (193)

Substituting into Equation 192:

(α~Φ(t+1)α~y(t+1))\displaystyle\begin{pmatrix}\tilde{\alpha}_{\Phi}^{(t+1)}\\ \tilde{\alpha}_{y}^{(t+1)}\end{pmatrix} =k(t)(α~Φ(t)0)+(cΦ(t)cy(t))\displaystyle=k^{(t)}\begin{pmatrix}\tilde{\alpha}_{\Phi}^{(t)}\\ 0\end{pmatrix}+\begin{pmatrix}c_{\Phi}^{(t)}\\ c_{y}^{(t)}\end{pmatrix} (194)
=k(t)(IdΦβΦ0βy)(αΦ(t)αΨ(t)).\displaystyle=k^{(t)}\begin{pmatrix}I_{d_{\Phi}}&\beta_{\Phi}^{\star}\\ 0&\beta_{y}^{\star}\end{pmatrix}\begin{pmatrix}\alpha_{\Phi}^{(t)}\\ \alpha_{\Psi}^{(t)}\end{pmatrix}. (195)

Then we have f(t+1)=𝔼[f~(t+1)|Φ,Ψ]=(α~Φ(t+1))TΦ+(α~y(t+1))T𝔼[Y|Φ,Ψ]f^{(t+1)}=\mathbb{E}[\tilde{f}^{(t+1)}|\Phi,\Psi]=(\tilde{\alpha}_{\Phi}^{(t+1)})^{T}\Phi+(\tilde{\alpha}_{y}^{(t+1)})^{T}\mathbb{E}[Y|\Phi,\Psi].

This is equivalent as:

(αΦ(t+1)αΨ(t+1))\displaystyle\begin{pmatrix}\alpha_{\Phi}^{(t+1)}\\ \alpha_{\Psi}^{(t+1)}\end{pmatrix} =(α~Φ(t+1)0)+(ΣΦΦΣΦΨΣΨΦΣΨΨ)1(ΣΦyΣΨy)α~y(t+1)\displaystyle=\begin{pmatrix}\tilde{\alpha}_{\Phi}^{(t+1)}\\ 0\end{pmatrix}+\begin{pmatrix}\Sigma_{\Phi\Phi}&\Sigma_{\Phi\Psi}\\ \Sigma_{\Psi\Phi}&\Sigma_{\Psi\Psi}\end{pmatrix}^{-1}\begin{pmatrix}\Sigma_{\Phi y}\\ \Sigma_{\Psi y}\end{pmatrix}\tilde{\alpha}_{y}^{(t+1)} (196)
=(IdΦαΦ0αΨ)(α~Φ(t+1)α~y(t+1)).\displaystyle=\begin{pmatrix}I_{d_{\Phi}}&\alpha_{\Phi}^{\star}\\ 0&\alpha_{\Psi}^{\star}\end{pmatrix}\begin{pmatrix}\tilde{\alpha}_{\Phi}^{(t+1)}\\ \tilde{\alpha}_{y}^{(t+1)}\end{pmatrix}. (197)

Combining the two equations above, we have:

(αΦ(t+1)αΨ(t+1))\displaystyle\begin{pmatrix}\alpha_{\Phi}^{(t+1)}\\ \alpha_{\Psi}^{(t+1)}\end{pmatrix} =k(t)(IdΦαΦ0αΨ)(IdΦβΦ0βy)(αΦ(t)αΨ(t))\displaystyle=k^{(t)}\begin{pmatrix}I_{d_{\Phi}}&\alpha_{\Phi}^{\star}\\ 0&\alpha_{\Psi}^{\star}\end{pmatrix}\begin{pmatrix}I_{d_{\Phi}}&\beta_{\Phi}^{\star}\\ 0&\beta_{y}^{\star}\end{pmatrix}\begin{pmatrix}\alpha_{\Phi}^{(t)}\\ \alpha_{\Psi}^{(t)}\end{pmatrix} (198)
=k(t)(IdΦβΦ+αΦβy0αΨβy)(αΦ(t)αΨ(t)).\displaystyle=k^{(t)}\begin{pmatrix}I_{d_{\Phi}}&\beta_{\Phi}^{\star}+\alpha_{\Phi}^{\star}\beta_{y}^{\star}\\ 0&\alpha_{\Psi}^{\star}\beta_{y}^{\star}\end{pmatrix}\begin{pmatrix}\alpha_{\Phi}^{(t)}\\ \alpha_{\Psi}^{(t)}\end{pmatrix}. (199)

Thus,

(αΦ(t)αΨ(t))\displaystyle\begin{pmatrix}\alpha_{\Phi}^{(t)}\\ \alpha_{\Psi}^{(t)}\end{pmatrix} =K(t)(IdΦβΦ+αΦβy0αΨβy)t(αΦ(0)αΨ(0))\displaystyle=K^{(t)}\begin{pmatrix}I_{d_{\Phi}}&\beta_{\Phi}^{\star}+\alpha_{\Phi}^{\star}\beta_{y}^{\star}\\ 0&\alpha_{\Psi}^{\star}\beta_{y}^{\star}\end{pmatrix}^{t}\begin{pmatrix}\alpha_{\Phi}^{(0)}\\ \alpha_{\Psi}^{(0)}\end{pmatrix} (200)
=K(t)(IdΦβΦ+αΦβy0αΨβy)t(αΦαΨ).\displaystyle=K^{(t)}\begin{pmatrix}I_{d_{\Phi}}&\beta_{\Phi}^{\star}+\alpha_{\Phi}^{\star}\beta_{y}^{\star}\\ 0&\alpha_{\Psi}^{\star}\beta_{y}^{\star}\end{pmatrix}^{t}\begin{pmatrix}\alpha_{\Phi}^{\star}\\ \alpha_{\Psi}^{\star}\end{pmatrix}. (201)

In the above equation, K(t)=0u<tk(u)K^{(t)}=\prod_{0\leq u<t}k^{(u)}.

Define f^(t)=(α^Φ(t))TΦ+(α^Ψ(t))TΨ\hat{f}^{(t)}=(\hat{\alpha}_{\Phi}^{(t)})^{T}\Phi+(\hat{\alpha}_{\Psi}^{(t)})^{T}\Psi, where

(α^Φ(t)α^Ψ(t))=(IdΦβΦ+αΦβy0αΨβy)t(αΦαΨ).\displaystyle\begin{pmatrix}\hat{\alpha}_{\Phi}^{(t)}\\ \hat{\alpha}_{\Psi}^{(t)}\end{pmatrix}=\begin{pmatrix}I_{d_{\Phi}}&\beta_{\Phi}^{\star}+\alpha_{\Phi}^{\star}\beta_{y}^{\star}\\ 0&\alpha_{\Psi}^{\star}\beta_{y}^{\star}\end{pmatrix}^{t}\begin{pmatrix}\alpha_{\Phi}^{\star}\\ \alpha_{\Psi}^{\star}\end{pmatrix}. (202)

In the following we show α^Ψ(t)0\hat{\alpha}_{\Psi}^{(t)}\rightarrow 0. Since α^Ψ(t)=(αΨβy)tαΨ\hat{\alpha}_{\Psi}^{(t)}=(\alpha_{\Psi}^{\star}\beta_{y}^{\star})^{t}\alpha_{\Psi}^{\star}, and αΨβydΨ×dΨ\alpha_{\Psi}^{\star}\beta_{y}^{\star}\in\mathbb{R}^{d_{\Psi}\times d_{\Psi}} has exactly one nonzero eigenvalue βyαΨ\beta_{y}^{\star}\alpha_{\Psi}^{\star}, we just have to show |βyαΨ|<1|\beta_{y}^{\star}\alpha_{\Psi}^{\star}|<1.

βyαΨ\displaystyle\beta_{y}^{\star}\alpha_{\Psi}^{\star} =(ΣyyΣyΦΣΦΦ1ΣΦy)1\displaystyle=(\Sigma_{yy}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y})^{-1} (203)
(ΣyΨΣyΦΣΦΦ1ΣΦΨ)(ΣΨΨΣΨΦΣΦΦ1ΣΦΨ)1(ΣΨyΣΨΦΣΦΦ1ΣΦy).\displaystyle\;\;\;\;(\Sigma_{y\Psi}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi})(\Sigma_{\Psi\Psi}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi})^{-1}(\Sigma_{\Psi y}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}). (204)

Since, (ΣyyΣyΦΣΦΦ1ΣΦy)(\Sigma_{yy}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}) and (ΣΨΨΣΨΦΣΦΦ1ΣΦΨ)(\Sigma_{\Psi\Psi}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi}) are both Schur complements of Σ\Sigma’s principal submatrix, they are still positive definite.

Thus,

ΣyyΣyΦΣΦΦ1ΣΦy>0.\displaystyle\Sigma_{yy}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}>0. (205)
(ΣyΨΣyΦΣΦΦ1ΣΦΨ)(ΣΨΨΣΨΦΣΦΦ1ΣΦΨ)1(ΣΨyΣΨΦΣΦΦ1ΣΦy)0.\displaystyle(\Sigma_{y\Psi}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi})(\Sigma_{\Psi\Psi}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi})^{-1}(\Sigma_{\Psi y}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y})\geq 0. (206)

It follows that βyαΨ0\beta_{y}^{\star}\alpha_{\Psi}^{\star}\geq 0. So we just have to show βyαΨ<1\beta_{y}^{\star}\alpha_{\Psi}^{\star}<1.

Since det(Σ)>0\det(\Sigma)>0, by applying row addition on Σ\Sigma, we have:

det(ΣΦΦΣΦΨΣΦy0ΣΨΨΣΨΦΣΦΦ1ΣΦΨΣΨyΣΨΦΣΦΦ1ΣΦy0ΣyΨΣyΦΣΦΦ1ΣΦΨΣyyΣyΦΣΦΦ1ΣΦy)>0.\displaystyle\det\begin{pmatrix}\Sigma_{\Phi\Phi}&\Sigma_{\Phi\Psi}&\Sigma_{\Phi y}\\ 0&\Sigma_{\Psi\Psi}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi}&\Sigma_{\Psi y}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}\\ 0&\Sigma_{y\Psi}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi}&\Sigma_{yy}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}\end{pmatrix}>0. (207)

It follows that:

(ΣyyΣyΦΣΦΦ1ΣΦy)\displaystyle(\Sigma_{yy}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}) (208)
(ΣyΨΣyΦΣΦΦ1ΣΦΨ)(ΣΨΨΣΨΦΣΦΦ1ΣΦΨ)1(ΣΨyΣΨΦΣΦΦ1ΣΦy)>0.\displaystyle-(\Sigma_{y\Psi}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi})(\Sigma_{\Psi\Psi}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi})^{-1}(\Sigma_{\Psi y}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y})>0. (209)

Rearranging to:

βyαΨ\displaystyle\beta_{y}^{\star}\alpha_{\Psi}^{\star} =(ΣyyΣyΦΣΦΦ1ΣΦy)1\displaystyle=(\Sigma_{yy}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y})^{-1} (210)
(ΣyΨΣyΦΣΦΦ1ΣΦΨ)(ΣΨΨΣΨΦΣΦΦ1ΣΦΨ)1(ΣΨyΣΨΦΣΦΦ1ΣΦy)\displaystyle\;\;\;\;(\Sigma_{y\Psi}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi})(\Sigma_{\Psi\Psi}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi})^{-1}(\Sigma_{\Psi y}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}) (211)
<1.\displaystyle<1. (212)

Thus, 0βyαΨ<10\leq\beta_{y}^{\star}\alpha_{\Psi}^{\star}<1.

In the following, we show α^Φ(t)ΣΦΦ1ΣΦy\hat{\alpha}_{\Phi}^{(t)}\rightarrow\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}. By Equation 202 and |βyαΨ|<1|\beta_{y}^{\star}\alpha_{\Psi}^{\star}|<1, we have:

α^Φ(t)\displaystyle\hat{\alpha}_{\Phi}^{(t)} =αΦ+(βΦ+αΦβy)0u<t(αΨβy)uαΨ\displaystyle=\alpha_{\Phi}^{\star}+(\beta_{\Phi}^{\star}+\alpha_{\Phi}^{\star}\beta_{y}^{\star})\sum_{0\leq u<t}(\alpha_{\Psi}^{\star}\beta_{y}^{\star})^{u}\alpha_{\Psi}^{\star} (213)
tαΦ+(βΦ+αΦβy)(IdΨαΨβy)1αΨ\displaystyle\xrightarrow{t\to\infty}\alpha_{\Phi}^{\star}+(\beta_{\Phi}^{\star}+\alpha_{\Phi}^{\star}\beta_{y}^{\star})(I_{d_{\Psi}}-\alpha_{\Psi}^{\star}\beta_{y}^{\star})^{-1}\alpha_{\Psi}^{\star} (214)
=αΦ+(βΦ+αΦβy)(1βyαΨ)1αΨ\displaystyle=\alpha_{\Phi}^{\star}+(\beta_{\Phi}^{\star}+\alpha_{\Phi}^{\star}\beta_{y}^{\star})(1-\beta_{y}^{\star}\alpha_{\Psi}^{\star})^{-1}\alpha_{\Psi}^{\star} (215)
=αΦ+βΦαΨ1βyαΨ.\displaystyle=\frac{\alpha_{\Phi}^{\star}+\beta_{\Phi}^{\star}\alpha_{\Psi}^{\star}}{1-\beta_{y}^{\star}\alpha_{\Psi}^{\star}}. (216)

Equation 215 follows from the fact that (IdΨαΨβy)αΨ=(1βyαΨ)αΨ(I_{d_{\Psi}}-\alpha_{\Psi}^{\star}\beta_{y}^{\star})\alpha_{\Psi}^{\star}=(1-\beta_{y}^{\star}\alpha_{\Psi}^{\star})\alpha_{\Psi}^{\star}.

Define γΦ=ΣΦΦ1ΣΦy\gamma_{\Phi}^{\star}=\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y} such that 𝔼[Y|Φ]=(γΦ)TΦ\mathbb{E}[Y|\Phi]=(\gamma_{\Phi}^{\star})^{T}\Phi. We have

𝔼[𝔼[𝔼[Y|Φ,Ψ]|Φ,Y]|Φ]=𝔼[Y|Φ].\displaystyle\;\;\;\;\;\;\mathbb{E}\left[\mathbb{E}\left[\mathbb{E}\left[Y|\Phi,\Psi\right]|\Phi,Y\right]|\Phi\right]=\mathbb{E}[Y|\Phi]. (217)
𝔼[𝔼[(αΦ)TΦ+(αΨ)TΨ|Φ,Y]|Φ]=(γΦ)TΦ.\displaystyle\Leftrightarrow\mathbb{E}\left[\mathbb{E}\left[(\alpha_{\Phi}^{\star})^{T}\Phi+(\alpha_{\Psi}^{\star})^{T}\Psi|\Phi,Y\right]|\Phi\right]=(\gamma_{\Phi}^{\star})^{T}\Phi. (218)
𝔼[(αΦ)TΦ+(αΨ)T(βΦ)TΦ+(αΨ)T(βy)TY|Φ]=(γΦ)TΦ.\displaystyle\Leftrightarrow\mathbb{E}\left[(\alpha_{\Phi}^{\star})^{T}\Phi+(\alpha_{\Psi}^{\star})^{T}(\beta_{\Phi}^{\star})^{T}\Phi+(\alpha_{\Psi}^{\star})^{T}(\beta_{y}^{\star})^{T}Y|\Phi\right]=(\gamma_{\Phi}^{\star})^{T}\Phi. (219)
(αΦ)TΦ+(αΨ)T(βΦ)TΦ+(αΨ)T(βy)T(γΦ)TΦ=(γΦ)TΦ.\displaystyle\Leftrightarrow(\alpha_{\Phi}^{\star})^{T}\Phi+(\alpha_{\Psi}^{\star})^{T}(\beta_{\Phi}^{\star})^{T}\Phi+(\alpha_{\Psi}^{\star})^{T}(\beta_{y}^{\star})^{T}(\gamma_{\Phi}^{\star})^{T}\Phi=(\gamma_{\Phi}^{\star})^{T}\Phi. (220)
αΦ+βΦαΨ+βyαΨγΦ=γΦ.\displaystyle\Leftrightarrow\alpha_{\Phi}^{\star}+\beta_{\Phi}^{\star}\alpha_{\Psi}^{\star}+\beta_{y}^{\star}\alpha_{\Psi}^{\star}\gamma_{\Phi}^{\star}=\gamma_{\Phi}^{\star}. (221)
αΦ+βΦαΨ1βyαΨ=γΦ.\displaystyle\Leftrightarrow\frac{\alpha_{\Phi}^{\star}+\beta_{\Phi}^{\star}\alpha_{\Psi}^{\star}}{1-\beta_{y}^{\star}\alpha_{\Psi}^{\star}}=\gamma_{\Phi}^{\star}. (222)

Thus, α^Φ(t)ΣΦΦ1ΣΦy\hat{\alpha}_{\Phi}^{(t)}\rightarrow\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}. Subsequently, f^(t)(ΣΦΦ1ΣΦy)TΦ=𝔼[Y|Φ]\hat{f}^{(t)}\rightarrow(\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y})^{T}\Phi=\mathbb{E}[Y|\Phi]. The convergence ratio is M(Σ)=βyαΨM(\Sigma)=\beta_{y}^{\star}\alpha_{\Psi}^{\star}.

We have:

(α~Φ(t+1)α~y(t+1))\displaystyle\begin{pmatrix}\tilde{\alpha}_{\Phi}^{(t+1)}\\ \tilde{\alpha}_{y}^{(t+1)}\end{pmatrix} =K(t+1)(IdΦβΦ0βy)(α^Φ(t)α^Ψ(t)),\displaystyle=K^{(t+1)}\begin{pmatrix}I_{d_{\Phi}}&\beta_{\Phi}^{\star}\\ 0&\beta_{y}^{\star}\end{pmatrix}\begin{pmatrix}\hat{\alpha}_{\Phi}^{(t)}\\ \hat{\alpha}_{\Psi}^{(t)}\end{pmatrix}, (223)

where

(IdΦβΦ0βy)(α^Φ(t)α^Ψ(t))\displaystyle\begin{pmatrix}I_{d_{\Phi}}&\beta_{\Phi}^{\star}\\ 0&\beta_{y}^{\star}\end{pmatrix}\begin{pmatrix}\hat{\alpha}_{\Phi}^{(t)}\\ \hat{\alpha}_{\Psi}^{(t)}\end{pmatrix} t(IdΦβΦ0βy)(ΣΦΦ1ΣΦy0)\displaystyle\xrightarrow{t\to\infty}\begin{pmatrix}I_{d_{\Phi}}&\beta_{\Phi}^{\star}\\ 0&\beta_{y}^{\star}\end{pmatrix}\begin{pmatrix}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}\\ 0\end{pmatrix} (224)
=(ΣΦΦ1ΣΦy0).\displaystyle=\begin{pmatrix}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}\\ 0\end{pmatrix}. (225)

Define f~^(t)=(α~^Φ(t))TΦ+(α~^y(t))TY\hat{\tilde{f}}^{(t)}=(\hat{\tilde{\alpha}}_{\Phi}^{(t)})^{T}\Phi+(\hat{\tilde{\alpha}}_{y}^{(t)})^{T}Y, where

(α~^Φ(t+1)α~^y(t+1))=(IdΦβΦ0βy)(α^Φ(t)α^Ψ(t)).\displaystyle\begin{pmatrix}\hat{\tilde{\alpha}}_{\Phi}^{(t+1)}\\ \hat{\tilde{\alpha}}_{y}^{(t+1)}\end{pmatrix}=\begin{pmatrix}I_{d_{\Phi}}&\beta_{\Phi}^{\star}\\ 0&\beta_{y}^{\star}\end{pmatrix}\begin{pmatrix}\hat{\alpha}_{\Phi}^{(t)}\\ \hat{\alpha}_{\Psi}^{(t)}\end{pmatrix}. (226)

Thus, f~^(t)𝔼[Y|Φ]\hat{\tilde{f}}^{(t)}\rightarrow\mathbb{E}[Y|\Phi]. Since f~(t)=K(t)f~^(t)\tilde{f}^{(t)}=K^{(t)}\hat{\tilde{f}}^{(t)}, we have f~(t)=𝔼[Y|f~^(t)]𝔼[Y|Φ]\tilde{f}^{(t)}=\mathbb{E}[Y|\hat{\tilde{f}}^{(t)}]\rightarrow\mathbb{E}[Y|\Phi].

Subsequently, f(t)=𝔼[f~(t)|Φ,Ψ]𝔼[Y|Φ]f^{(t)}=\mathbb{E}[\tilde{f}^{(t)}|\Phi,\Psi]\rightarrow\mathbb{E}[Y|\Phi].

Appendix G Limitations

Both our theory and algorithm focuses on the bounded regression setting. The definition of extended multicalibration does not depend on the risk function. However, the analysis of the maximal grouping function class as a linear space assumes a continuous probability distribution of observations, implying a continuous target domain. The convergence of MC-Pseudolabel is also established in a regression setting. All experiments are performed on regression tasks. As most algorithms for out-of-distribution generalization are set up with classification problems, we fill the gap for regression and leave an extension to general risk functions for future work.