\useunder

\ul

Bridging Multicalibration and Out-of-distribution Generalization Beyond Covariate Shift

Jiayun Wu
Depart. of Computer Science & Tech.
Tsinghua University
[email protected] &Jiashuo Liu
Depart. of Computer Science & Tech.
Tsinghua University
[email protected]
Peng Cui
Depart. of Computer Science & Tech.
Tsinghua University
[email protected]
&Zhiwei Steven Wu
School of Computer Science
Carnegie Mellon University
[email protected]
Work completed as a research intern at Carnegie Mellon University.

Abstract

We establish a new model-agnostic optimization framework for out-of-distribution generalization via multicalibration, a criterion that ensures a predictor is calibrated across a family of overlapping groups. Multicalibration is shown to be associated with robustness of statistical inference under covariate shift. We further establish a link between multicalibration and robustness for prediction tasks both under and beyond covariate shift. We accomplish this by extending multicalibration to incorporate grouping functions that consider covariates and labels jointly. This leads to an equivalence of the extended multicalibration and invariance, an objective for robust learning in existence of concept shift. We show a linear structure of the grouping function class spanned by density ratios, resulting in a unifying framework for robust learning by designing specific grouping functions. We propose MC-Pseudolabel, a post-processing algorithm to achieve both extended multicalibration and out-of-distribution generalization. The algorithm, with lightweight hyperparameters and optimization through a series of supervised regression steps, achieves superior performance on real-world datasets with distribution shift.

1 Introduction

We revisit the problem of out-of-distribution generalization and establish new connections with multicalibration Hébert-Johnson et al. (2018), a criterion originating from algorithmic fairness. Multicalibration is a strengthening of calibration, which requires a predictor $f$ to be correct on average within each level set:

\mathbb{E}[Y-f(X)\mid f(X)]=0

Calibration is a relatively weak property, as it can be satisfied even by the uninformative constant predictor $f(X)=\mathbb{E}[Y]$ that predicts the average outcome. More broadly, calibration provides only a marginal guarantee that does not extend to sub-populations. Multicalibration Hébert-Johnson et al. (2018) mitigates this issue by requiring the calibration to hold over a family of (overlapping) subgroups $\mathcal{H}$ : for all $h\in\mathcal{H}$ ,

\mathbb{E}[(Y-f(X))\,h(X)\mid f(X)]=0

Multicalibration is initially studied as measure of subgroup fairness for boolean grouping functions $h$ , with $h(X)=1$ indicating $X$ is a member of group $h$ (Hébert-Johnson et al., 2018). Subsequently, Gopalan et al. (2022) and Kim et al. (2019) adopt a broader class of real-valued grouping functions that can identify sub-populations through reweighting. The formulation of real-valued grouping function has enabled surprising connections between multicalibration and distribution shifts. Prior work (Kim et al., 2022; Roth, 2022) studied how distribution shift affects the measure of multicalibration, with a focus on covariate shift where the relationship between $X$ and $Y$ remains fixed. Kim et al. (2022) show that whenever the set of real-valued grouping functions $\mathcal{H}$ includes the density ratio between the source and target distributions, a multicalibrated predictor with respect to the source remains calibrated in the shifted target distribution.

Our work substantially expands the connections between multicalibration and distribution shifts. At a high level, our results show that robust prediction under distribution shift can actually be facilitated by multicalibration. We extend the notion of multicalibration by incorporating grouping functions that simultaneously consider both covariates $X$ and outcomes $Y$ . This extension enables us to go beyond covariate shift and account for concept shift, which is prevalent in practice due to spurious correlation, missing variables, or confounding (Liu et al., 2023).

Our contributions. Based on the introduction of joint grouping functions, we establish new connections between our extended multicalibration notion and algorithmic robustness in the general setting of out-of-distribution generalization, where the target distribution to assess the model is different from the source distribution to learn the model.

1. We first revisit the setting of covariate shift and show multicalibration implies Bayes optimality under covariate shift, provided a sufficiently rich class of grouping functions. Then, in the setting of concept shifts, we show the equivalence of multicalibration and invariance (Arjovsky et al., 2019), a learning objective to search for a Bayes optimal predictor $\mathbb{E}[Y|\Phi(X)]$ under a representation over features $\Phi(X)$ , even though $\mathbb{E}[Y|X]$ is different across target distributions. We show correspondence between an invariant representation $\Phi(X)$ and a multicalibrated predictor $\mathbb{E}[Y|\Phi(X)]$ , with a grouping function class containing all density ratios of target distributions and the source distribution.

2. As part of our structural analysis of the new multicalibration concept, we investigate the maximal grouping function class that allows for a nontrivial multicalibrated predictor. For traditional covariate-based grouping functions, the Bayes optimal predictor $f(X)=\mathbb{E}[Y|X]$ is always multicalibrated, which is no longer the case for joint grouping functions. We show the maximal grouping function class is a linear space spanned by the density ratio of the target distributions where the predictor is invariant. As a structural characterization of distribution shift, this leads to an efficient parameterization of the grouping functions by linear combination of a spanning set of density ratios. The spanning set can be flexibly designed to incorporates implicit assumptions of various methodologies for robust learning, including multi-environment learning (Shi et al., 2021) and hard sample learning (Liu et al., 2021).

3. We devise a post-processing algorithm to multicalibrate predictors and simultaneously producing invariant predictors. As a multicalibration algorithm, we prove its convergence under Gaussian distributions of data and certify multicalibration upon convergence. As a robust learning algorithm, the procedure is plainly supervised regression with respect to models’ hypothesis class and grouping function class, introducing an overhead of linear regression. This stands out from heavy optimization techniques for out-of-distribution generalization, such as bi-level optimization (Finn et al., 2017; Liu et al., 2022) and multi-objective learning (Ahuja et al., 2021; Arjovsky et al., 2019; Koyama and Yamaguchi, 2020), which typically involves high-order gradients (Ramé et al., 2022). The algorithm introduces no extra hyperparameters. This simplifies model selection, which is a significant challenge for out-of-distribution generalization since validation is unavailable where the model is deployed (Gulrajani and Lopez-Paz, 2021). Under the standard model selection protocol of DomainBed (Gulrajani and Lopez-Paz, 2021), the algorithm achieves superior performance to existing methods in real-world datasets with concept shift, including porverty estimation (Yeh et al., 2020), personal income prediction (Ding et al., 2021) and power consumption (Malinin et al., 2021, 2022) prediction.

2 Multicalibration and Bayes Optimality under Covariate Shift

2.1 Multicalibration with Joint Grouping Functions

Settings of Out-of-distribution Generalization

We are concerned with prediction tasks under distribution shift, where covariates are denoted by a random vector $X\in\mathcal{X}$ and the target by $Y\in\mathcal{Y}$ . The values taken by random variables are written by $x,y$ in lowercase. We have an uncertainty set of absolutely continuous probability measures, denoted by $\mathcal{P}(X,Y)$ , where there is an accessible source measure $P_{S}\in\mathcal{P}$ and unknown target measure $P_{T}\in\mathcal{P}$ . We use capital letters such as $P$ to denote a single probability measure and lowercase letters such as $p$ to denote its probability density function. We define predictors as real-valued functions $f:\mathcal{X}\rightarrow\mathcal{Y}$ , which is learned in the source distribution $P_{S}$ and assessed in the target distribution $P_{T}$ . Given a loss function $\ell:\mathcal{Y}\times\mathcal{Y}\rightarrow\mathbb{R}$ , we evaluate the average risk of a predictor $f$ w.r.t. a probability measure $P$ , defined by $R_{P}(f):=\mathbb{E}_{P}[\ell(f(X),Y)]$ . We focus on the setting with $\mathcal{Y}=[0,1]$ and $\ell(\hat{y},y)=(\hat{y}-y)^{2}$ in our theoretical analyses.

We propose a new definition of $\ell_{2}$ approximate multicalibration with joint grouping functions.

Definition 2.1 (Multicalibration with Joint Grouping Functions).

For a probability measure $P(X,Y)$ and a predictor $f$ , let $\mathcal{H}\subset\mathbb{R}^{\mathcal{X}\times\mathcal{Y}}$ be a real-valued grouping function class. We say that $f$ is $\alpha$ -approximately $\ell_{2}$ multicalibrated w.r.t. $\mathcal{H}$ and $P$ if for all $h\in\mathcal{H}$ :

\displaystyle K_{2}(f,h,P)=\int\Big{(}\mathbb{E}_{P}\left[h(X,Y)(Y-v)\big{|}f(X)=v\right]\Big{)}^{2}dP_{f(X)}(v)\leq\alpha.

(1)

$P_{f(X)}(v)=P(f^{-1}(v))$ is the pushforward measure. We say $f$ is $\alpha$ -approximately calibrated for $h\equiv 1$ . We say $f$ is multicalibrated (calibrated) for $\alpha=0$ . If the grouping function is defined on $X$ , which implies $h(x,y_{1})=h(x,y_{2})$ for any $x\in\mathcal{X}$ and $y_{1},y_{2}\in\mathcal{Y}$ , we abbreviate $h(x,\cdot)$ by $h(x)$ .

In the special case of a constant grouping function $h\equiv 1$ , $K_{2}(f,1,P)$ recovers the overall calibration error. For boolean grouping functions defined on $X$ (Hébert-Johnson et al., 2018), $K_{2}(f,h,P)$ computes the calibration error of the subgroup with $h(x)=1$ . For real-valued grouping functions defined on $X$ (Gopalan et al., 2022; Kim et al., 2019), $K_{2}(f,h,P)$ evaluates a reweighted calibration error, whose weights $h(x)$ are proportional to the likelihood of a sample belonging to the subgroup. Most importantly, we propose an extended domain of grouping functions defined on covariates and outcomes jointly, which is useful for capturing more general distribution shifts. Multicalibration error quantifies the maximal calibration error for all subgroups associated with grouping functions in $\mathcal{H}$ . We will discuss multicalibration with covariate-based grouping functions in this section and joint grouping functions in next section.

2.2 Multicalibration Implies Bayes Optimality under Covariate Shift

In this subsection we focus on grouping functions $h(x)$ defined on covariates. We will prove approximately multicalibrated predictors simultaneous approaches Bayes optimality in each target distribution with covariate shift, bridging the results of Kim et al. (2022) and Globus-Harris et al. (2023). To recap, Kim et al. (2022) studies multicalibration under covariate shift and shows that a multicalibrated predictor remains calibrated in target distribution for a sufficiently large grouping function class. Further, it is shown that multicalibration predictors remain multicalibrated under covariate shift (Kim et al., 2022; Roth, 2022), assuming the grouping function class $\mathcal{H}$ is closed under some transformation by density ratios (Assumption 2.2.1). Second, Globus-Harris et al. (2023) shows multicalibration implies Bayes optimal accuracy (Globus-Harris et al., 2023), assuming $\mathcal{H}$ satisfies a weak learning condition (Assumption 2.2.2). Detailed discussion on other related works is deferred to section A in the appendix.

Assumption 2.2 (Sufficiency of Grouping Function Class (informal, see Assumption F.1)).

1. (Closure under Covariate Shift) For a set of probability measures $\mathcal{P}(X)$ containing the source measure $P_{S}(X)$ , $h\in\mathcal{H}$ implies $p/p_{S}\cdot h\in\mathcal{H}$ for any density function $p$ of distributions in $\mathcal{P}$ .

2. (( $\gamma,\rho$ )-Weak Learning Condition) For any $P\in\mathcal{P}(X)P_{S}(Y|X)\equiv\{P^{\prime}(X)P_{S}(Y\mid X):P^{\prime}\in\mathcal{P}\}$ with the source measure $P_{S}(Y|X)$ , and every subset $G\subset\mathcal{X}$ with $P(X\in G)>\rho$ , if the Bayes optimal predictor $\mathbb{E}_{P}[Y|X]$ has lower risk than the constant predictor $\mathbb{E}_{P}[Y|X\in G]$ by a margin $\gamma$ , there exists a predictor $h\in\mathcal{H}$ that is also better than the constant predictor with the margin $\gamma$ .

Theorem 2.3 (Risk Bound under Covariate Shift).

For a source measure $P_{S}(X,Y)$ and a set of probability measures $\mathcal{P}(X)$ containing $P_{S}(X)$ , given a predictor $f:\mathcal{X}\rightarrow[0,1]$ with finite range $m:=|\text{Range}(f)|$ , consider a grouping function class $\mathcal{H}$ closed under affine transformation and satisfying Assumption 2.2 with $\rho=\gamma/m$ . If $f$ is $\frac{\gamma^{6}}{256m^{2}}$ -approximately $\ell_{2}$ multicalibrated w.r.t $P_{S}$ and $\mathcal{H}_{1}:=\left\{h\in\mathcal{H}:\max_{x\in\mathcal{X}}h(x)^{2}\leq 1\right\}$ , then for any target measure $P_{T}\in\mathcal{P}(X)P_{S}(Y|X)$ ,

\displaystyle R_{P_{T}}(f)\leq\inf_{f^{*}:\mathcal{X}\rightarrow[0,1]}R_{P_{T}}(f^{*})+3\gamma.

(2)

Remark 2.4.

Following prior work in multicalibration (Globus-Harris et al., 2023; Roth, 2022), we study functions $f$ with finite cardinality, which can be obtained by discretization.

3 Multicalibration and Invariance under Concept Shift

Theorem 2.3 shows multicalibration implies Bayes optimal accuracy for target distributions under covariate shift. However, in practical scenarios, there are both marginal distribution shifts of covariates ( $X$ ) and concept shift of the conditional distributions ( $Y|X$ ). Concept shift is especially prevalent in tabular data due to missing variables and confounding (Liu et al., 2023). In order to go beyond covariate shift, we will focus on grouping functions defined on covariates and outcomes jointly. We show that multicalibration notion w.r.t. joint grouping functions is equivalent to invariance, a criterion for robust prediction under concept shift. Extending the robustness of multicalibration to general shift is non-trivial. The fundamental challenge is that there is no shared predictor that is generally optimal in each target distribution because the Bayes optimal predictor varies for different $Y|X$ distributions. As a first step, we show multicalibrated predictors w.r.t. joint grouping functions are robust as they are optimal over any post-processing functions in each target distribution.

Theorem 3.1 (Risk Bound under Concept Shift).

For a set of absolutely continuous probability measures $\mathcal{P}(X,Y)$ containing the source measure $P_{S}(X,Y)$ , consider a predictor $f:\mathcal{X}\rightarrow[0,1]$ . Assume the grouping function class $\mathcal{H}$ satisfies the following condition:

\displaystyle\mathcal{H}\supset\left\{h(x,y)=\frac{p(x,y)}{p_{S}(x,y)}\Big{|}P\in\mathcal{P}(X,Y)\right\}.

(3)

If $f$ is $\alpha$ -approximately $\ell_{2}$ multicalibrated w.r.t. $\mathcal{H}$ and $P_{S}$ , then for any measure $P\in\mathcal{P}(X,Y)$ ,

\displaystyle R_{P}(f)\leq\inf_{g:[0,1]\rightarrow[0,1]}R_{P}(g\circ f)+2\sqrt{\alpha}.

(4)

The theorem shows an approximately multicalibrated predictor on the source almost cannot be improved by post-processing for each target distribution. To ensure such robustness, the grouping function class must include all density ratios between target and source measures, which are functions over $\mathcal{X}\times\mathcal{Y}$ . This characterization of robustness in terms of post-processing echoes with Invariant Risk Minimization (IRM) (Arjovsky et al., 2019), a paradigm for out-of-distribution generalization with $Y|X$ shift. However, their analysis focuses on representation learning.

Definition 3.2 (Invariant Predictor).

Consider data selected from multiple environments in the set $\mathscr{E}$ where the probability measure in an environment $e\in\mathscr{E}$ is denoted by $P_{e}(X,Y)$ . Denote the representation over covariates by a measurable function $\Phi(x)$ . We say that $\Phi$ elicits an $\alpha$ -approximately invariant predictor $g^{*}\circ\Phi$ across $\mathscr{E}$ if there exists a function $g^{*}\in\mathcal{G}:=\left\{g:\text{supp}(\Phi)\rightarrow[0,1]\right\}$ such that for all $e\in\mathscr{E}$ :

\displaystyle R_{P_{e}}(g^{*}\circ\Phi)\leq\inf_{g\in\mathcal{G}}{R_{P_{e}}(g\circ\Phi)}+\alpha.

(5)

Remark 3.3.

(1) Predictors in $\mathcal{G}$ take a representation $\Phi$ extracted from the covariates as input. For a general predictor $f(x)$ , if we take $\Phi(x)=f(x)$ and $g^{*}$ as an identity function, Equation 5 reduces to the form of Equation 4. Therefore, $f$ in Equation 4 is a $2\sqrt{\alpha}$ -approximately invariant predictor across environments collected from the uncertainty set $\mathcal{P}$ . (2) We give an approximate definition of invariant predictors, which recovers the original definition (Arjovsky et al., 2019) when $\alpha=0$ . In this case, there exists a shared Bayes optimal predictor $g^{\star}$ across environments, taking $\Phi$ as input. This implies $\mathbb{E}_{e_{1}}[Y|\Phi]=\mathbb{E}_{e_{2}}[Y|\Phi]$ almost surely for any $e_{1},e_{2}$ .

IRM searches for a representation such that the optimal predictors upon the representation are invariant across environments. Motivated from causality, the interaction between outcomes and their causes are also assumed invariant, so IRM learns a representation of causal variables for stable prediction. We extend Theorem 3.1 to representation learning and prove equivalence between multicalibrated and invariant predictors.

Theorem 3.4 (Equivalence of Multicalibration and Invariance).

Assume samples are drawn from an environment $e\in\mathscr{E}$ with a prior $P_{S}(e)$ such that $\sum_{e\in\mathscr{E}}P_{S}(e)=1$ and $P_{S}(e)>0$ . The overall population satisfies $P_{S}(X,Y)=\sum_{e\in\mathscr{E}}P_{e}(X,Y)P_{S}(e)$ where $P_{e}(X,Y)$ is the environment-specific absolutely continuous measure. With a measurable function $\Phi(x)$ , define a function class $\mathcal{H}$ as:

\displaystyle\mathcal{H}:=\left\{h(x,y)=\frac{p_{e}(x,y)}{p_{S}(x,y)}\Big{|}e\in\mathscr{E}\right\}.

(6)

1. If there is a bijection $g^{\star}:\text{supp}(\Phi)\rightarrow[0,1]$ such that $g^{\star}\circ\Phi$ is $\alpha$ -approximately $\ell_{2}$ multicalibrated w.r.t. $\mathcal{H}$ and $P_{S}$ , then $\Phi$ elicits an $2\sqrt{\alpha}$ -approximately invariant predictor $g^{\star}\circ\Phi$ across $\mathscr{E}$ .

2. If there is $g^{\star}:\text{supp}(\Phi)\rightarrow[0,1]$ such that $\Phi$ elicits an $\alpha$ -approximately invariant predictor $g^{\star}\circ\Phi$ across $\mathscr{E}$ , then $g^{\star}\circ\Phi$ is $\sqrt{\alpha/D}$ -approximately $\ell_{2}$ multicalibrated w.r.t. $\mathcal{H}$ and $P_{S}$ , where $D=\min_{e\in\mathscr{E}}P_{S}(e)$ .

Remark 3.5.

(1) In the first statement, assuming $g^{\star}$ is a bijection avoids degenerate cases where $\Phi$ contains redundant information. For example, every predictor $g^{\star}\circ\Phi(X)$ upon representation $\Phi$ equals $g^{\star}(\Phi)\circ\mathbb{I}(X)$ upon representation $X$ . Confining $g^{\star}$ to bijections ensures some unique decomposition into predictors and representations. (2) Wald et al. (2021) proves equivalence between exact invariance and simultaneous calibration in each environment. We strengthen their result to show multicalibration on a single source distribution suffices for invariance. Moreover, our results can be directly extended beyond their multi-environment setting to a general uncertainty set of target distributions, by the mapping between grouping functions and density ratios. Further, our theorem is established for both exact and approximate invariance.

The theorem bridges approximate multicalibration with approximate invariance for out-of-distribution generalization beyond covariate shift. The equivalence property indicates that the density ratios of target and source distributions constitute the minimal grouping function class required for robust prediction in terms of invariance.

4 Structure of Grouping Function Classes

Section 3 inspires one to construct richer grouping function classes for stronger generalizability. However, fewer predictors are multicalibrated to a rich function class. For covariate-based grouping functions, the Bayes optimal $\mathbb{E}[Y|X]$ is always multicalibrated. But the existence of a multicalibrated predictor is nontrivial for joint grouping functions. For example, when we take the grouping function $h(x,y)=y$ , a multicalibrated predictor $f$ implies $f(X)=\mathbb{E}[Y|f(X),h(X,Y)]=Y$ , which is impossible for regression with label noise. In this section, we first study the maximal grouping function class that is feasible for a multicalibrated predictor. Then, we will leverage our structural results to inform the design of grouping functions.

4.1 Maximal Grouping Function Space

We focus on continuous grouping functions defined on a compact set $\mathcal{X}\times\mathcal{Y}\subset\mathbb{R}^{d+1}$ , i.e., $h\in C(\mathcal{X}\times\mathcal{Y})$ , and consider absolutely continuous probability measures supported on $\mathcal{X}\times\mathcal{Y}$ with continuous density functions. Our first proposition shows that the maximal grouping function class for any predictor is a linear space.

Proposition 4.1 (Maximal Grouping Function Class).

Given an absolutely continuous probability measure $P_{S}(X,Y)$ and a predictor $f:\mathcal{X}\rightarrow[0,1]$ , define the maximal grouping function class that $f$ is multicalibrated with respect to:

\displaystyle\mathcal{H}_{f}:=\{h\in C(\mathcal{X}\times\mathcal{Y}):K_{2}(f,h,P_{S})=0\}.

(7)

Then $\mathcal{H}_{f}$ is a linear space.

In the following, we further analyze the spanning set of maximal grouping function classes for nontrivial predictors which are at least calibrated.

Theorem 4.2 (Spanning Set).

Consider an absolutely continuous probability measure $P_{S}(X,Y)$ and a calibrated predictor $f:\mathcal{X}\rightarrow[0,1]$ . Then its maximal grouping function class $\mathcal{H}_{f}$ is given by:

\displaystyle\mathcal{H}_{f}=\mathrm{span}\left\{\frac{p(x,y)}{p_{S}(x,y)}:\text{$p$ is continuous and }R_{P}(f)=\inf_{g:[0,1]\rightarrow[0,1]}R_{P}(g\circ f)\right\}.

(8)

A predictor’s maximal grouping function class is spanned by density ratios of target distributions where the predictor is invariant. Correspondingly, Theorem 3.1 gives the minimal grouping function class, comprised of density ratios between target and source distributions, in order to ensure $f(x)$ is an invariant predictor. In contrast, Theorem 4.2 states the maximal grouping function class for $f(x)$ is exactly the linear space spanned by those density ratios. Next, we further investigate sub-structures of the maximal grouping function class. We focus on the representation learning setting of IRM.

Theorem 4.3 (Decomposition of Grouping Function Space).

Consider an absolutely continuous probability measure $P_{S}(X,Y)$ and a measurable function $\Phi:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d_{\Phi}}$ with ${d_{\Phi}}\in Z^{+}$ . We define the Bayes optimal predictor over $\Phi$ as $f_{\Phi}(x)=\mathbb{E}_{P_{S}}[Y|\Phi(x)]$ . We abbreviate $\mathcal{H}_{f_{\Phi}}$ with $\mathcal{H}_{\Phi}$ . Then $\mathcal{H}_{\Phi}$ can be decomposed as a Minkowski sum of $\mathcal{H}_{1,\Phi}+\mathcal{H}_{2,\Phi}$ .

	$\displaystyle\mathcal{H}_{1,\Phi}$	$\displaystyle=\mathrm{span}\left\{\frac{p(\Phi,y)}{p_{S}(\Phi,y)}:\text{$p$ is continuous and }R_{P}(f_{\Phi})=\inf_{g:[0,1]\rightarrow[0,1]}R_{P}(g\circ f_{\Phi})\right\}.$		(9)
	$\displaystyle\mathcal{H}_{2,\Phi}$	$\displaystyle=\mathrm{span}\left\{\frac{p(x\|\Phi,y)}{p_{S}(x\|\Phi,y)}:\;\text{$p$ is continuous}\right\}.$		(10)

1. If a predictor $f$ is multicalibrated with $\mathcal{H}_{1,\Phi}$ , then $R_{P_{S}}(f)\leq R_{P_{S}}(f_{\Phi})$ .

2. $f_{\Phi}$ is an invariant predictor elicited by $\Phi$ across a set of environments $\mathscr{E}$ where $P_{e}(\Phi,Y)=P_{S}(\Phi,Y)$ for any $e\in\mathscr{E}$ . If a predictor $f$ is multicalibrated with $\mathcal{H}_{2,\Phi}$ , then $f$ is also an invariant predictor across $\mathscr{E}$ elicited by some representation.

Remark 4.4.

$\mathcal{H}_{1,\Phi}$ and $\mathcal{H}_{2,\Phi}$ contain functions defined on $x,\Phi(x),y$ which can both be rewritten as functions on $x,y$ by variable substitution. Thus, $\mathcal{H}_{1,\Phi},\mathcal{H}_{2,\Phi}$ are still subspaces of grouping functions. $\mathcal{H}_{1,\Phi}$ is spanned by the density ratio of $P(\Phi,Y)$ where the Bayes optimal predictor over $\Phi$ must be invariant on the distribution of $P$ . $\mathcal{H}_{2,\Phi}$ is spanned by general density ratio of $P(X|\Phi,Y)$ .

Multicalibration w.r.t. $\mathcal{H}_{1,\Phi}$ ensures at least the accuracy of the Bayes optimal predictor on $\Phi$ , and multicalibration w.r.t. $\mathcal{H}_{2,\Phi}$ ensures at least the invariance of this predictor. However, we show in Proposition F.20 that sizes of two subspaces are negatively correlated. When $\Phi$ is a variable selector, $\mathcal{H}_{1,\Phi}$ expands with more selected covariates while $\mathcal{H}_{2,\Phi}$ shrinks. By choosing a combination of $\mathcal{H}_{1,\Phi}$ and $\mathcal{H}_{2,\Phi}$ , we strike a balance between accuracy and invariance of the multicalibrated predictor.

Proposition 4.5 (Monotonicity).

Consider $X\in\mathbb{R}^{d}$ which could be sliced as $X=(\Phi,\Psi)^{T}$ and $\Phi=(\Lambda,\Omega)^{T}$ . Define $\mathcal{H}_{1,\Phi}^{\prime}:=\{h(\Phi(x))\in C(\mathcal{X}\times\mathcal{Y})\}$ . $\mathcal{H}_{1,X}^{\prime}$ and $\mathcal{H}_{1,\Lambda}^{\prime}$ are similarly defined. We have:

1. $\mathcal{H}_{1,X}^{\prime}\supset\mathcal{H}_{1,\Phi}^{\prime}\supset\mathcal{H}_{1,\Lambda}^{\prime}\supset\mathcal{H}_{1,\emptyset}^{\prime}=\{C\}.$

2. $\{C\}=\mathcal{H}_{2,X}\subset H_{2,\Phi}\subset\mathcal{H}_{2,\Lambda}\subset\mathcal{H}_{2,\emptyset}.$

$C$ is a constant value function.

4.2 Design of Grouping Function Classes

The objective of a robust learning method can be represented by a tuple consisting of an assumption about the boundary of distribution shift and a metric of robustness. Multicalibration is equivalent to invariance as a metric of robustness, while the grouping function class provides a unifying view for assumptions over potential distribution shift. Given any uncertainty set of target distributions $\mathcal{P}$ , Theorem 4.2 implies an efficient and reasonable construction of grouping functions as linear combinations of density ratios from $\mathcal{P}$ . We implement two designs of grouping functions for the learning setting with and without environment annotations respectively.

From Environments If samples are drawn from multiple environments and the environment annotations are available, we assume the uncertainty set as the union of each environment’s distribution $P_{e}$ . This completely recovers IRM’s objective, but we approach it with a different optimization technique in the next section. Taking pooled data as the source $S$ , density ratios spanning the grouping function class are $h_{e}(x,y)=p_{e}(x,y)/p_{S}(x,y)=p_{S}(e|x,y)/p_{S}(e)$ , where $p_{S}(e|x,y)$ is estimated by an environment classifier. Then a grouping function can be represented as a linear combination of $h_{e}$ :

\displaystyle h(x,y)=\sum_{e\in\mathscr{E}}\lambda_{e}p_{S}(e|x,y),\;\;\lambda_{e}\in\mathbb{R}.

(11)

From Hard Samples When data contains latent sub-populations without annotations, the uncertainty set can be constructed by identifying sub-populations. Hard sample learning (Li et al., 2021; Lin et al., 2020; Liu et al., 2021) suggests the risk is an indicator for sub-population structures. Samples from the minority sub-population $M$ are more likely to have high risks. For example, JTT (Liu et al., 2021) identified the minority subgroup using a risk threshold of a trained predictor $f_{id}$ . We adopt a continuous grouping by assuming $P_{S}(X,Y\in M)\propto(f_{id}(X)-Y)^{2}$ . We construct the uncertainty set as the union of the source $S$ and minority sub-population $M$ , resulting in a grouping function represented as:

\displaystyle h(x,y)=\lambda_{M}(f_{id}(x)-y)^{2}+\lambda_{S},\;\;\lambda_{M},\lambda_{S}\in\mathbb{R}.

(12)

Another design utilizing Distributionally Robust Optimization’s assumption Duchi et al. (2019) is in section B.

5 MC-PseudoLabel: An Algorithm for Extended Multicalibration

In this section, we introduce an algorithm for multicalibration with respect to joint grouping functions. Simultaneously, the algorithm also provides a new optimization paradigm for invariant prediction under distribution shift. The algorithm, called MC-PseudoLabel, post-processes a trained model by supervised learning with pseudolabels generated by grouping functions. As shown in Algorithm 1, given a predictor function class $\mathcal{F}$ and a dataset $D$ with an empirical distribution $\hat{P}_{D}(X,Y)$ , a regression oracle $A$ solves the optimization: $A_{\mathcal{F}}(D)=\arg\min_{f\in\mathcal{F}}R_{\hat{P}_{D}}(f)$ . We take as input a model $f_{0}$ , possibly trained by Empirical Risk Minimization. $f_{0}$ has a finite range following conventions of prior work in multicalibration (Globus-Harris et al., 2023). For continuous predictors, we discretize the model output and introduce a small rounding error (see section C). For each iteration, the algorithm performs regression with grouping functions on each level set of the model. The prediction of grouping functions rectify the uncalibrated model and serves as pseudolabels for model updates.

Algorithm 1 MC-PseudoLabel

0: A dataset

D=(D_{x},D_{y})

, a grouping function class

\mathcal{H}

, a predictive function class

\mathcal{F}

t\leftarrow 0;

f_{0}\leftarrow\text{Initialization};

{For example, models trained with ERM.}

m\leftarrow|\text{Range}(\text{Discretize}(f_{0}))|;

4: repeat

f_{t}^{\prime}\leftarrow\text{Round}(f_{t};m):=\arg\min_{v\in[1/m]}|f_{t}(x)-v|;

err_{t}=\mathbb{E}_{x,y\sim D}[(f_{t}^{\prime}(x)-y)^{2}];

7: for each

v\in[1/m]

D_{v}^{t}\leftarrow D|f_{t}^{\prime}(x)=v;

h_{v}^{t}(x,y)\leftarrow A_{\mathcal{H}}(D_{v}^{t});

{Regression on level sets with grouping functions.}

10: end for

11:

\tilde{f}_{t+1}(x,y)\leftarrow\sum_{v\in[\frac{1}{m}]}1_{\{f_{t}^{\prime}(x)=v\}}\cdot h_{v}^{t}(x,y);

{Generate pseudolabels.}

12:

\tilde{err}_{t+1}\leftarrow\mathbb{E}_{x,y\sim D}[(\tilde{f}_{t+1}(x,y)-y)^{2}];

13:

D_{t+1}\leftarrow(D_{x},\tilde{f}_{t+1}(D));

14:

f_{t+1}(x)\leftarrow A_{\mathcal{F}}(D_{t+1});

{Update the model with pseudolabels.}

15:

t\leftarrow t+1;

16: until

err_{t-1}-\tilde{err}_{t}

stops decreasing.

16:

f_{t-1}^{\prime}(x).

Since we regress $Y$ with grouping functions defined on $Y$ , a poor design of groupings violating Theorem 4.2 can produce trivial outputs. For example, if grouping functions contain $h(x,y)=y$ , then $err_{t-1}-\tilde{err}_{t}$ never decreases and the algorithm outputs $f_{0}$ , because there does not exist a multicalibrated predictor. However, the algorithm certifies a multicalibrated output if it converges.

Theorem 5.1 (Certified Multicalibration).

In Algorithm 1, for $\alpha,B>0$ , if $err_{t-1}-\tilde{err}_{t}\leq\frac{\alpha}{B}$ , the output $f_{t-1}^{\prime}(x)$ is $\alpha$ -approximately $\ell_{2}$ multicalibrated w.r.t. $\mathcal{H}_{B}=\{h\in\mathcal{H}:\sup h(x,y)^{2}\leq B\}$ .

MC-PseudoLabel reduces to LSBoost Globus-Harris et al. (2023), a boosting algorithm for multicalibration if $\mathcal{H}$ only contains covariate-based grouping functions. In this case, Line 14 of Algorithm 1 reduces to $f_{t+1}(x)=\tilde{f}_{t+1}(x,\cdot)$ where $\tilde{f}_{t+1}$ does not depend on $y$ . For joint grouping functions, since $\tilde{f}_{t+1}\in\mathbb{R}^{\mathcal{X}\times\mathcal{Y}}$ , we project it to models’ space of $\mathbb{R}^{\mathcal{X}}$ by learning the model with $\tilde{f}_{t+1}$ as pseudolabels. The projection substantially changes the optimization dynamics. LSBoost constantly decreases risks of models, due to $R_{\hat{P}_{D}}(f_{t+1})=R_{\hat{P}_{D}}(\tilde{f}_{t+1})<R_{\hat{P}_{D}}(f_{t})$ . The projection step disrupts the monotonicity of risks, implying that MC-Pseudolabel can output a predictor with a higher risk than input. This is because multicalibration with joint grouping functions implies balance between accuracy and invariance, as is discussed in Theorem 4.3. The convergence of LSBoost relies on the monotonicity of risks, which is not applicable to MC-Pseudolabel. We study the algorithm’s convergence in the context of representation learning. Assume we are given a grouping function class $\mathcal{H}_{\Phi}$ with a latent representation $\Phi$ . If a predictor is multicalibrated w.r.t $\mathcal{H}_{1,\Phi},\mathcal{H}_{2,\Phi}$ respectively, then it is also multicalibrated w.r.t. $\mathcal{H}_{\Phi}$ . Therefore, we separately study the convergence with two grouping function classes. In Proposition F.26, we show the convergence for a subset of $\mathcal{H}_{1,\Phi}$ consisting of covariate-based grouping functions, which is a corollary of Globus-Harris et al.’s result. As a greater challenge, we derive convergence for $\mathcal{H}_{2,\Phi}$ when data follows multivariate normal distributions.

Theorem 5.2 (Covergence for $\mathcal{H}_{2,\Phi}$ (informal, see Theorem F.27)).

Consider $X\in\mathbb{R}^{d}$ with $X=(\Phi,\Psi)^{T}$ . Assume that $(\Phi,\Psi,Y)$ follows a multivariate normal distribution $\mathcal{N}_{d+1}(\mu,\Sigma)$ where the random variables are in general position such that $\Sigma$ is positive definite. For any distribution $D$ supported on $\mathcal{X}\times\mathcal{Y}$ , take the grouping function class $\mathcal{H}=\{h\in\mathcal{H}_{2,\Phi}:h(x,y)=c_{x}^{T}x+c_{y}y+c_{b},c_{x}\in\mathbb{R}^{d},c_{y},c_{b}\in\mathbb{R}\}$ and the predictor class $\mathcal{F}=\mathbb{R}^{\mathcal{X}}$ . For an initial predictor $f_{0}(x)=\mathbb{E}[Y|x]$ , run $\text{MC-Pseudolabel}(D,\mathcal{H},\mathcal{F})$ without rounding, then $f_{t}(x)\rightarrow\mathbb{E}[Y|\Phi(x)]$ as $t\rightarrow\infty$ with a convergence rate of $\mathcal{O}(M(\Sigma)^{t})$ where $0\leq M(\Sigma)<1$ .

MC-Pseudolabel is also an optimization paradigm for invariance. Certified multicalibration in Theorem 5.1 also implies certified invariance. Furthermore, MC-Pseudolabel introduces no extra hyperparameters to tradeoff between risks and robustness. Both certified invariance and light-weighted hyperparameters simplify model selection, which is challenging for out-distribution generalization because of unavailable validation data from target distributions (Gulrajani and Lopez-Paz, 2021). MC-Pseudolabel has light-weighted optimization consisting of a series of supervised regression. It introduces an overhead to Empirical Risk Minimization by performing regression on level sets. However, the extra burden is linear regression by designing the grouping function class as linear space. Furthermore, regression on different level sets can be parallelized. Computational complexity is further analyzed in section D.

6 Experiments

6.1 Settings

We benchmark MC-Pseudolabel on real-world regression datasets with distributional shift. We adopt two experimental settings. For the multi-environment setting, algorithms are provided with training data collected from multiple annotated environments. Thereafter, the trained model is assessed on new environments. For the single-environment setting, algorithms are trained on a single source distribution. There could be latent sub-populations in training data, but environment annotations are unavailable. The trained model is assessed on a target dataset with distribution shift from the source.

Datasets We experiment on PovertyMap (Yeh et al., 2020) and ACSIncome (Ding et al., 2021) for the multi-environment setting, and VesselPower (Malinin et al., 2022) for the single-environment setting. As the only regression task in WILDS (Koh et al., 2021), a popular benchmark for in-the-wild distribution shift, PovertyMap performs poverty index estimation for different spatial regions by satellite images. Data are collected from both urban and rural regions, by which the environment is annotated. The test dataset also covers both environments, but is collected from different countries. The primary metric is Worst-U/R Pearson, the worst Pearson correlation of prediction between rural and urban regions. The other two datasets are tabular, where natural concept shift ( $Y|X$ shift) is more common due to existence of missing variables and hidden confounders (Liu et al., 2023). ACSIncome (Ding et al., 2021) performs personal income prediction with data collected from US Census sources across different US states. The task is converted to binary classification by an income threshold, but we take raw data for regression. Environments are partitioned by different occupations with similar average income. VesselPower comes from Shifts (Malinin et al., 2021, 2022), a benchmark focusing on regression tasks with real-world distributional shift. The objective is to predict power consumption of a merchant vessel given navigation and weather data. Data are sampled under different time and wind speeds, causing distribution shift between training and test data.

Baselines For the multi-environment setting, baselines include ERM (Empirical Risk Minimization); methods for invariance learning which mostly adopts multi-objective optimization: IRM (Arjovsky et al., 2019), MIP (Koyama and Yamaguchi, 2020), IB-IRM (Ahuja et al., 2021), CLOvE (Wald et al., 2021), MRI (Huh and Baidya, 2022), REX (Krueger et al., 2021), Fishr (Ramé et al., 2022); an alignment-based method from domain generalization: IDGM (Shi et al., 2021); and Group DRO (Sagawa et al., 2019). Notably, CLOvE learns a calibrated predictor simultaneously on all environments, but it is optimized by multi-objective learning with a differentiable regularizer for calibration. For the singe-environment setting, baselines include reweighting based techniques: CVaR (Levy et al., 2020), JTT (Liu et al., 2021), Tilted-ERM (Li et al., 2021); a Distributionally Robust Optimization method $\chi^{2}$ -DRO (Duchi and Namkoong, 2019); and a data augmentation method C-Mixup (Yao et al., 2022). Other methods are not included because of specification in classification (Zhang et al., 2018, 2022) or exposure to target distribution data during training (Idrissi et al., 2022; Kirichenko et al., 2023). For all experiments, we train an Oracle ERM with data sampled from target distribution.

Implementation We implement the predictor with MLP for ACSIncome and VesselPower, and Resnet18-MS (He et al., 2016) for PovertyMap, following WILDS’ default architecture. The grouping function class is implemented according to Equation 11 and Equation 12 for the multi-environment and single-environment setting respectively. We follow DomainBed’s protocol (Gulrajani and Lopez-Paz, 2021) for model selection. Specifically, we randomly sample 20 sets of hyperparameters for each method, containing both the training hyperparameters and extra hyperparameters from the robust learning algorithm. We select the best model across hyperparameters based on three model selection criteria, including in-distribution validation on the average of training data, worst-environment validation with the worst performance across training environments, and oracle validation on target data. Oracle validation is not recommended by DomainBed, which suggests limited numbers of access to target data. The entire run is repeated with different seeds for three times to measure standard errors of performances. Specifically for PovertyMap, an OOD validation set is provided for oracle validation. And we perform 5-fold cross validation instead of three repeated experiments, following WILDS’ setup.

6.2 Results

Table 1: Results on multi-environment datasets, evaluated on test data using three model selection criteria. ID: validation with averaged performance on training data. Worst: validation with the worst performance across training environments. Oracle: validation with performance on sampled test set.

	ACSIncome: RMSE $\downarrow$			PovertyMap: Worst-U/R Pearson $\uparrow$
Method	ID	Worst	Oracle	ID	Worst	Oracle
ERM	$0.487$ $\pm 0.009$	$0.487$ $\pm 0.009$	$0.452$ $\pm 0.012$	$0.48$ $\pm 0.06$	$0.48$ $\pm 0.06$	$0.49$ $\pm 0.07$
IRM	$0.466$ $\pm 0.002$	$0.466$ $\pm 0.002$	$0.465$ $\pm 0.002$	$0.38$ $\pm 0.07$	$0.39$ $\pm 0.06$	$0.45$ $\pm 0.08$
MIP	$0.457$ $\pm 0.008$	$0.454$ $\pm 0.012$	$0.454$ $\pm 0.012$	$0.40$ $\pm 0.09$	$0.39$ $\pm 0.10$	$0.43$ $\pm 0.08$
IB-IRM	$0.463$ $\pm 0.003$	$0.463$ $\pm 0.003$	$0.438$ $\pm 0.009$	$0.39$ $\pm 0.07$	$0.37$ $\pm 0.05$	$0.43$ $\pm 0.06$
CLOvE	$0.455$ $\pm 0.005$	$0.454$ $\pm 0.002$	$0.450$ $\pm 0.005$	$0.46$ $\pm 0.09$	$0.42$ $\pm 0.13$	$\mathbf{0.51}$ $\pm 0.06$
MRI	$0.458$ $\pm 0.011$	$0.458$ $\pm 0.011$	$0.455$ $\pm 0.013$	$0.47$ $\pm 0.10$	$0.46$ $\pm 0.08$	$0.49$ $\pm 0.07$
REX	$0.466$ $\pm 0.009$	$0.464$ $\pm 0.009$	$0.458$ $\pm 0.003$	$0.43$ $\pm 0.09$	$0.42$ $\pm 0.09$	$0.45$ $\pm 0.09$
Fishr	$0.458$ $\pm 0.006$	$0.455$ $\pm 0.012$	$0.450$ $\pm 0.008$	$0.42$ $\pm 0.09$	$0.41$ $\pm 0.09$	$0.43$ $\pm 0.08$
IDGM	$1.843$ $\pm 0.018$	$1.843$ $\pm 0.018$	$1.843$ $\pm 0.018$	$0.02$ $\pm 0.07$	$0.01$ $\pm 0.15$	$0.13$ $\pm 0.14$
GroupDRO	$0.481$ $\pm 0.035$	$0.449$ $\pm 0.017$	$0.433$ $\pm 0.013$	$0.38$ $\pm 0.15$	$0.37$ $\pm 0.16$	$0.42$ $\pm 0.12$
MC-Pseudolabel	$\mathbf{0.428}$ $\pm 0.009$	$\mathbf{0.425}$ $\pm 0.012$	$\mathbf{0.411}$ $\pm 0.011$	$\mathbf{0.50}$ $\pm 0.07$	$\mathbf{0.50}$ $\pm 0.07$	$\mathbf{0.51}$ $\pm 0.08$
Oracle ERM			$0.332$ $\pm 0.001$			$0.71$ $\pm 0.05$

Refer to caption — Figure 1: Accuracy-on-the-line beyond covariate shift: correlation between models’ in-distribution and out-of-distribution risks on VesselPower.

Results are shown in Table 1 for multi-environment settings and Table 2 for single-environment settings. MC-Pseudolabel achieves superior performance in all datasets with in-distribution and worst-environment validation which does not violate test data. For oracle validation, MC-Pseudolabel achieves comparable performances to the best method. For example, CLOvE, which also learns invariance by calibration, achieves best performance under oracle validation in PovertyMap, but it sharply degrades when target validation data is unavailable. It’s because CLOvE tunes its regularizer’s coefficient to tradeoff with ERM risk, whose optimal value depends on the target distribution shift.

Table 2: Single-environment results.

	VesselPower: RMSE $\downarrow$
Method	ID	Oracle
ERM	$1.92$ $\pm 0.23$	$1.86$ $\pm 0.19$
CVaR	$1.69$ $\pm 0.18$	$\mathbf{1.49}$ $\pm 0.10$
JTT	$1.75$ $\pm 0.27$	$1.58$ $\pm 0.15$
Tilted-ERM	$1.72$ $\pm 0.21$	$1.61$ $\pm 0.12$
$\chi^{2}$ -DRO	$1.69$ $\pm 0.20$	$1.56$ $\pm 0.06$
C-Mixup	$1.72$ $\pm 0.15$	$1.56$ $\pm 0.08$
MC-Pseudolabel	$\mathbf{1.61}$ $\pm 0.20$	$1.52$ $\pm 0.16$
Oracle ERM		$1.18$ $\pm 0.01$

In contrast, MC-Pseudolabel exhibits an advantage with in-distribution model selection. This is further supported by Figure 1, which shows that MC-Pseudolabel’s out-of-distribution errors strongly correlates with in-distribution errors. The experiment spans across different hyperparameters and seeds with the same model architecture on VesselPower. The phenomenon, known as accuracy-on-the-line (Miller et al., 2021), is well known for a general class of models under covariate shift. However, Liu et al. (2023) shows accuracy-on-the-line does not exist under concept shift, which is the case for ERM and C-Mixup. This introduces significant challenge for model selection. However, MC-Pseudolabel recovers the accuracy to the line.

7 Conclusion

To conclude, we establish a new optimization framework for out-of-distribution generalization through extended multicalibration with joint grouping functions. While the current algorithm focuses on regression, there is potential for future work to extend our approach to general forms of tasks, particularly in terms of classification.

References

Ahuja et al. (2021) Kartik Ahuja, Ethan Caballero, Dinghuai Zhang, Jean-Christophe Gagnon-Audet, Yoshua Bengio, Ioannis Mitliagkas, and Irina Rish. Invariance principle meets information bottleneck for out-of-distribution generalization. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 3438–3450, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/1c336b8080f82bcc2cd2499b4c57261d-Abstract.html.
Arjovsky et al. (2019) Martín Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. CoRR, abs/1907.02893, 2019. URL http://arxiv.org/abs/1907.02893.
Blasiok et al. (2023a) Jaroslaw Blasiok, Parikshit Gopalan, Lunjia Hu, and Preetum Nakkiran. A unifying theory of distance from calibration. In Barna Saha and Rocco A. Servedio, editors, Proceedings of the 55th Annual ACM Symposium on Theory of Computing, STOC 2023, Orlando, FL, USA, June 20-23, 2023, pages 1727–1740. ACM, 2023a. doi: 10.1145/3564246.3585182. URL https://doi.org/10.1145/3564246.3585182.
Blasiok et al. (2023b) Jaroslaw Blasiok, Parikshit Gopalan, Lunjia Hu, and Preetum Nakkiran. When does optimizing a proper loss yield calibration? In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023b. URL http://papers.nips.cc/paper_files/paper/2023/hash/e4165c96702bac5f4962b70f3cf2f136-Abstract-Conference.html.
Creager et al. (2021) Elliot Creager, Jörn-Henrik Jacobsen, and Richard S. Zemel. Environment inference for invariant learning. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 2189–2200. PMLR, 2021. URL http://proceedings.mlr.press/v139/creager21a.html.
Deng et al. (2023) Zhun Deng, Cynthia Dwork, and Linjun Zhang. Happymap : A generalized multicalibration method. In Yael Tauman Kalai, editor, 14th Innovations in Theoretical Computer Science Conference, ITCS 2023, January 10-13, 2023, MIT, Cambridge, Massachusetts, USA, volume 251 of LIPIcs, pages 41:1–41:23. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023. doi: 10.4230/LIPICS.ITCS.2023.41. URL https://doi.org/10.4230/LIPIcs.ITCS.2023.41.
Ding et al. (2021) Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. Retiring adult: New datasets for fair machine learning. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 6478–6490, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/32e54441e6382a7fbacbbbaf3c450059-Abstract.html.
Duchi and Namkoong (2019) John C. Duchi and Hongseok Namkoong. Variance-based regularization with convex objectives. J. Mach. Learn. Res., 20:68:1–68:55, 2019. URL http://jmlr.org/papers/v20/17-750.html.
Duchi and Namkoong (2021) John C Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378–1406, 2021.
Duchi et al. (2019) John C Duchi, Tatsunori Hashimoto, and Hongseok Namkoong. Distributionally robust losses against mixture covariate shifts. Under review, 2(1), 2019.
Duchi et al. (2021) John C. Duchi, Peter W. Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. Math. Oper. Res., 46(3):946–969, 2021. doi: 10.1287/MOOR.2020.1085. URL https://doi.org/10.1287/moor.2020.1085.
Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 1126–1135. PMLR, 2017. URL http://proceedings.mlr.press/v70/finn17a.html.
Globus-Harris et al. (2023) Ira Globus-Harris, Declan Harrison, Michael Kearns, Aaron Roth, and Jessica Sorrell. Multicalibration as boosting for regression. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 11459–11492. PMLR, 2023. URL https://proceedings.mlr.press/v202/globus-harris23a.html.
Gopalan et al. (2022) Parikshit Gopalan, Adam Tauman Kalai, Omer Reingold, Vatsal Sharan, and Udi Wieder. Omnipredictors. In Mark Braverman, editor, 13th Innovations in Theoretical Computer Science Conference, ITCS 2022, January 31 - February 3, 2022, Berkeley, CA, USA, volume 215 of LIPIcs, pages 79:1–79:21. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022. doi: 10.4230/LIPICS.ITCS.2022.79. URL https://doi.org/10.4230/LIPIcs.ITCS.2022.79.
Gulrajani and Lopez-Paz (2021) Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=lQdXeXDoWtI.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778. IEEE Computer Society, 2016. doi: 10.1109/CVPR.2016.90. URL https://doi.org/10.1109/CVPR.2016.90.
Hébert-Johnson et al. (2018) Úrsula Hébert-Johnson, Michael P. Kim, Omer Reingold, and Guy N. Rothblum. Multicalibration: Calibration for the (computationally-identifiable) masses. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 1944–1953. PMLR, 2018. URL http://proceedings.mlr.press/v80/hebert-johnson18a.html.
Huh and Baidya (2022) Dongsung Huh and Avinash Baidya. The missing invariance principle found - the reciprocal twin of invariant risk minimization. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/91b482312a0845ed86e244adbd9935e4-Abstract-Conference.html.
Idrissi et al. (2022) Badr Youbi Idrissi, Martín Arjovsky, Mohammad Pezeshki, and David Lopez-Paz. Simple data balancing achieves competitive worst-group-accuracy. In Bernhard Schölkopf, Caroline Uhler, and Kun Zhang, editors, 1st Conference on Causal Learning and Reasoning, CLeaR 2022, Sequoia Conference Center, Eureka, CA, USA, 11-13 April, 2022, volume 177 of Proceedings of Machine Learning Research, pages 336–351. PMLR, 2022. URL https://proceedings.mlr.press/v177/idrissi22a.html.
Kim et al. (2019) Michael P. Kim, Amirata Ghorbani, and James Y. Zou. Multiaccuracy: Black-box post-processing for fairness in classification. In Vincent Conitzer, Gillian K. Hadfield, and Shannon Vallor, editors, Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, AIES 2019, Honolulu, HI, USA, January 27-28, 2019, pages 247–254. ACM, 2019. doi: 10.1145/3306618.3314287. URL https://doi.org/10.1145/3306618.3314287.
Kim et al. (2022) Michael P Kim, Christoph Kern, Shafi Goldwasser, Frauke Kreuter, and Omer Reingold. Universal adaptability: Target-independent inference that competes with propensity scoring. Proceedings of the National Academy of Sciences, 119(4):e2108097119, 2022.
Kirichenko et al. (2023) Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is sufficient for robustness to spurious correlations. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=Zb6c8A-Fghk.
Koh et al. (2021) Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton Earnshaw, Imran S. Haque, Sara M. Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. WILDS: A benchmark of in-the-wild distribution shifts. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 5637–5664. PMLR, 2021. URL http://proceedings.mlr.press/v139/koh21a.html.
Koyama and Yamaguchi (2020) Masanori Koyama and Shoichiro Yamaguchi. When is invariance useful in an out-of-distribution generalization problem? arXiv preprint arXiv:2008.01883, 2020.
Krueger et al. (2021) David Krueger, Ethan Caballero, Jörn-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Rémi Le Priol, and Aaron C. Courville. Out-of-distribution generalization via risk extrapolation (rex). In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 5815–5826. PMLR, 2021. URL http://proceedings.mlr.press/v139/krueger21a.html.
Levy et al. (2020) Daniel Levy, Yair Carmon, John C. Duchi, and Aaron Sidford. Large-scale methods for distributionally robust optimization. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/64986d86a17424eeac96b08a6d519059-Abstract.html.
Li et al. (2021) Tian Li, Ahmad Beirami, Maziar Sanjabi, and Virginia Smith. Tilted empirical risk minimization. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=K5YasWXZT3O.
Lin et al. (2020) Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell., 42(2):318–327, 2020. doi: 10.1109/TPAMI.2018.2858826. URL https://doi.org/10.1109/TPAMI.2018.2858826.
Liu et al. (2021) Evan Zheran Liu, Behzad Haghgoo, Annie S. Chen, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn. Just train twice: Improving group robustness without training group information. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 6781–6792. PMLR, 2021. URL http://proceedings.mlr.press/v139/liu21f.html.
Liu et al. (2022) Jiashuo Liu, Jiayun Wu, Bo Li, and Peng Cui. Distributionally robust optimization with data geometry. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/da535999561b932f56efdd559498282e-Abstract-Conference.html.
Liu et al. (2023) Jiashuo Liu, Tianyu Wang, Peng Cui, and Hongseok Namkoong. On the need for a language describing distribution shifts: Illustrations on tabular datasets. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
Malinin et al. (2021) Andrey Malinin, Neil Band, Yarin Gal, Mark J. F. Gales, Alexander Ganshin, German Chesnokov, Alexey Noskov, Andrey Ploskonosov, Liudmila Prokhorenkova, Ivan Provilkov, Vatsal Raina, Vyas Raina, Denis Roginskiy, Mariya Shmatova, Panagiotis Tigas, and Boris Yangel. Shifts: A dataset of real distributional shift across multiple large-scale tasks. In Joaquin Vanschoren and Sai-Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/ad61ab143223efbc24c7d2583be69251-Abstract-round2.html.
Malinin et al. (2022) Andrey Malinin, Andreas Athanasopoulos, Muhamed Barakovic, Meritxell Bach Cuadra, Mark J. F. Gales, Cristina Granziera, Mara Graziani, Nikolay Kartashev, Konstantinos Kyriakopoulos, Po-Jui Lu, Nataliia Molchanova, Antonis Nikitakis, Vatsal Raina, Francesco La Rosa, Eli Sivena, Vasileios Tsarsitalidis, Efi Tsompopoulou, and Elena Volf. Shifts 2.0: Extending the dataset of real distributional shifts. CoRR, abs/2206.15407, 2022. doi: 10.48550/ARXIV.2206.15407. URL https://doi.org/10.48550/arXiv.2206.15407.
Miller et al. (2021) John Miller, Rohan Taori, Aditi Raghunathan, Shiori Sagawa, Pang Wei Koh, Vaishaal Shankar, Percy Liang, Yair Carmon, and Ludwig Schmidt. Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 7721–7735. PMLR, 2021. URL http://proceedings.mlr.press/v139/miller21b.html.
Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
Ramé et al. (2022) Alexandre Ramé, Corentin Dancette, and Matthieu Cord. Fishr: Invariant gradient variances for out-of-distribution generalization. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 18347–18377. PMLR, 2022. URL https://proceedings.mlr.press/v162/rame22a.html.
Roth (2022) Aaron Roth. Uncertain: Modern topics in uncertainty estimation, 2022.
Sagawa et al. (2019) Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks. In International Conference on Learning Representations, 2019.
Shi et al. (2021) Yuge Shi, Jeffrey Seely, Philip HS Torr, N Siddharth, Awni Hannun, Nicolas Usunier, and Gabriel Synnaeve. Gradient matching for domain generalization. arXiv preprint arXiv:2104.09937, 2021.
Sinha et al. (2018) Aman Sinha, Hongseok Namkoong, and John C. Duchi. Certifying some distributional robustness with principled adversarial training. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. URL https://openreview.net/forum?id=Hk6kPgZA-.
Wald et al. (2021) Yoav Wald, Amir Feder, Daniel Greenfeld, and Uri Shalit. On calibration and out-of-domain generalization. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 2215–2227, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/118bd558033a1016fcc82560c65cca5f-Abstract.html.
Wang et al. (2022) Haoxiang Wang, Bo Li, and Han Zhao. Understanding gradual domain adaptation: Improved analysis, optimal path and beyond. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 22784–22801. PMLR, 2022. URL https://proceedings.mlr.press/v162/wang22n.html.
Yao et al. (2022) Huaxiu Yao, Yiping Wang, Linjun Zhang, James Y. Zou, and Chelsea Finn. C-mixup: Improving generalization in regression. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/1626be0ab7f3d7b3c639fbfd5951bc40-Abstract-Conference.html.
Yeh et al. (2020) Christopher Yeh, Anthony Perez, Anne Driscoll, George Azzari, Zhongyi Tang, David Lobell, Stefano Ermon, and Marshall Burke. Using publicly available satellite imagery and deep learning to understand economic well-being in africa. Nature Communications, 2020.
Zhang et al. (2018) Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. URL https://openreview.net/forum?id=r1Ddp1-Rb.
Zhang et al. (2022) Michael Zhang, Nimit S Sohoni, Hongyang R Zhang, Chelsea Finn, and Christopher Re. Correct-n-contrast: a contrastive approach for improving robustness to spurious correlations. In International Conference on Machine Learning, pages 26484–26516. PMLR, 2022.

Appendix A Related Work

Multicalibration Multicalibration is first proposed by Hébert-Johnson et al. [2018] with binary grouping functions. Kim et al. [2019] and Gopalan et al. [2022] extend the grouping functions to real-valued functions. Globus-Harris et al. [2023] shows that with a sufficiently rich class of real-valued grouping functions, multicalibration actually implies accuracy. Globus-Harris et al. also provides a boosting algorithm for both regression and multicalibration. The connection between multicalibration and distribution shift is first studied by Kim et al. [2022], who proves that $\ell_{1}$ multicalibration error remains under covariate shift, given a sufficiently large real-valued grouping function class. Kim et al. further shows that under covariate shift, a multicalibrated predictor can perform statistical inference of the average outcome of a sample batch. In contrast, we derive a robustness result for individual prediction of outcomes for $\ell_{2}$ multicalibrated predictors. In addition, Wald et al. [2021] studies the equivalence of Invariant Risk Minimization and simultaneous calibration on each environment. Our equivalence results for multicalibration can be perceived as a generalization of Wald et al.’s results beyond the multi-environment setting, by deriving a mapping between density ratios and grouping functions. We also extend the equivalence to approximately multicalibrated and approximately invariant predictors. Furthermore, we move beyond Wald et al.’s multi-objective optimization with Lagrangian regularization, by proposing a new post-processing optimization framework consisting of a series of supervised regression. Meanwhile, Blasiok et al. [2023b] discusses connections between calibration and post-processing, which is an equivalent expression of invariance. There are other extensions of multicalibration, such as Deng et al. [2023] who generalize the term $Y-f(X)$ in multicalibration’s definition to a class of general functions. While our work is the first to generalize the grouping functions $h$ to consider the outcomes.

Out-of-distribution Generalization Beyond Covariate Shift Despite abundant literature from domain generalization that focuses on image classification where covariate shift dominates, research on algorithmic robustness on regression tasks beyond covariate shift is relatively limited. The setting can be categorized according to if the source distribution is partitioned into several environments. For the multi-environment generalization setting, Invariant Risk Minimization and its variants assume that outcomes are generated by a common causal structural equation across all environments, and aims to recover such an invariant (or causal) predictor [Ahuja et al., 2021, Arjovsky et al., 2019, Huh and Baidya, 2022, Koyama and Yamaguchi, 2020, Krueger et al., 2021, Ramé et al., 2022]. Group DRO [Sagawa et al., 2019] is a simple but surprisingly strong technique that optimizes for the worst group risk with reweighting of environments. There are also meta-learning methods [Finn et al., 2017] that handles multi-environment generalization with bi-level optimization. For the single environment setting, Distributionally Robust Optimization optimizes for the worst-case risk in an uncertainty set of distributions centering around the source distribution [Duchi and Namkoong, 2019, 2021, Duchi et al., 2021, Levy et al., 2020, Sinha et al., 2018]. Another branch of research is targeted at mitigating spurious correlation with an assumption of simplicity bias, which utilizes a simple model to discover latent sub-populations and then correct the biased predictor by sample reweighting [Li et al., 2021, Lin et al., 2020, Liu et al., 2021], retraining on a subgroup-balanced dataset or a small batch from target distribution [Idrissi et al., 2022, Kirichenko et al., 2023, Zhang et al., 2022], or perform Invariant Risk Minimization on discovered subgroups [Creager et al., 2021]. Data augmentation is a prevalent technique to enhance algorithmic robustness for vision tasks. Quite a lot of these methods are tailored for classification. For example, Mixup [Zhang et al., 2018] interpolates between features of samples with the same label. The approach is extended to regression settings by C-Mixup [Yao et al., 2022]. Pseudolabelling is a common technique for out-of-distribution generalization, but typically adopted in a setting with exposure to unlabelled samples from target distribution, known as domain adaptation [Wang et al., 2022]. However, MC-Pseudolabel generate pseudolabels for the source distribution itself.

Appendix B Grouping Functions for Distributionally Robust Optimization

Distributionally Robust Optimization assumes the target distribution to reside in an uncertainty set $\mathcal{P}$ of distributions centering around the source distribution $P_{S}$ . For example, Duchi et al. [2019] formulates the uncertainty set as arbitrary subgroups that has a proportion of at least $\alpha_{0}\in(0,1)$ . Duchi et al. only consider subgroups of covariates:

	$\displaystyle\mathcal{P}(X)=\{P(X):$	$\displaystyle\text{ there exists a probability measure }P^{\prime}(X),$		(13)
		$\displaystyle P_{S}(X)=\alpha P(X)+(1-\alpha)P^{\prime}(X),\alpha\geq\alpha_{0}\}.$		(14)

By the correspondence between density ratios and grouping functions, the equivalent design of a grouping function class is given by:

\displaystyle\mathcal{H}=\left\{h\in\mathbb{R}^{\mathcal{X}}:0\leq h(x)\leq\frac{1}{\alpha_{0}},\;\;\forall x\right\}.

(15)

We can also extend the grouping functions to consider both covariates and outcomes, such that general subgroups are incorporated into the uncertainty set:

\displaystyle\mathcal{H}=\left\{h\in\mathbb{R}^{\mathcal{X}\times\mathcal{Y}}:0\leq h(x,y)\leq\frac{1}{\alpha_{0}},\;\;\forall x,y\right\}.

(16)

In the case of grouping functions defined on $x$ and $y$ jointly, the grouping function class is not closed under affine transformation and is not a linear space spanned by density ratios, which suggests that a perfectly multicalibrated solution might not exist. However, approximately multicalibrated predictors can still be pursued.

Appendix C Model Discretization

For continuous predictors, we take a preprocessing step to discretize the model to as many bins as possible such that the rounding error is negligible while still ensuring enough samples in individual bins. Specifically, we equally split the outcomes of predictors to bins with equal intervals from the minimum to maximum of model output. We start from a minimum bin number $m=10$ , and keeps increasing $m$ as long as $90\%$ of the samples reside in a bin with at least 30 samples. When the criterion is violated, we stop increasing $m$ and select it as the final bin number. The model discretization procedure is fixed across all experiments.

Appendix D Computational Complexity

We assume that the predictor’s outcomes are uniformly distributed. Denote the average bin size by $N_{b}$ , which is a constant around 30 in our implementation. The bin number is given by $m=N/N_{b}$ where $N$ is the sample size. For neural networks, $N$ represents the batch size. The overhead of MC-Pseudolabel compared to Empirical Risk Minimization is linear regression on each bin, whose sample complexity is $\mathcal{O}(N_{b}^{3})$ with OLS. Please note that an individual linear regression for around 30 samples is extremely cheap. A non-parallel implementation of regression on every bin scales linearly with the bin number $m$ , so the overall complexity is $\mathcal{O}(N_{b}^{2}N)$ . However, since the regression on each bin is independent, we adopt a multi-processing implementation. Denote the number of jobs by $J$ , the overall time cost of MC-Pseudolabel is $\mathcal{O}(N_{b}^{2}N/J)$ . As a comparison, OLS on $N$ samples has a computational complexity of $\mathcal{O}(N^{3})$ .

In conclusion, the complexity of MC-Pseudolabel scales linearly with sample size (or batch size for neural networks). Counterintuitively, increasing the bin number $m$ (and thus decreasing the bin size) actually decreases the computational complexity. This is because linear regression scales cubically with sample size, so decreasing the sample size in each bin is preferred to decreasing the bin number.

Appendix E Experiments

E.1 An Additional Experiment: Synthetic Dataset

We start from a multi-environment synthetic dataset with a multivariate normal distribution corresponding to Theorem 5.2. In this experiment, we examine the optimization dynamics of MC-Pseudolabel. The data generation process is inspired by Arjovsky et al. [2019]. The covariates can be sliced into $X=(S,V)^{T}$ with $S\in\mathbb{R}^{9}$ and $V\in\mathbb{R}$ , where $S$ is the causal variable for $Y$ and $V$ is the spurious variable. The data is generated by the following structural equations:

$\displaystyle S$	$\displaystyle\sim\mathcal{N}(0,1).$	(17)
$\displaystyle Y$	$\displaystyle=\alpha_{S}^{T}S+\epsilon_{Y},\;\;\;\;\alpha_{S}=(1,...,1)^{T}\in\mathbb{R}^{9},\epsilon_{Y}\sim\mathcal{N}(0,0.5^{2}).$	(18)
$\displaystyle V$	$\displaystyle=\alpha_{V}(\mathcal{E})\cdot Y+\epsilon_{V},\;\;\;\;\alpha_{V}(e_{1})=1.25,\alpha_{V}(e_{2})=0.75,\alpha_{V}(e_{T})=-1,\epsilon_{V}\sim\mathcal{N}(0,0.1^{2}).$	(19)

The covariate $V$ is spuriously correlated with $Y$ because the coefficient $\alpha_{V}(e)$ depends on the specific environment $e$ . We set $\alpha_{V}(\mathcal{E})=1.25,0.75$ respectively for two training environments while $\alpha_{V}(\mathcal{E})$ extrapolates to $-1$ during testing. A robust algorithm is supposed to bypass the spurious variable and output a predictor $f(X)=\alpha_{S}^{T}S$ in order to survive the test distribution where the correlation between $V$ and $Y$ is negated.

The predictor class for this dataset is linear models, and the environment classifier is implemented by MLP with a single hidden layer. In this experiment, we fix the training hyperparameters for the base linear model, and perform grid search over the extra hyperparameter introduced by robust learning methods. Baselines except for ERM and Group DRO share a hyperparameter which is the regularizer’s coefficient, and Group DRO introduces a temperature hyperparameter. We search over their hyperparameter space and report RMSE metric on the test set in Figure 2. Most baselines exhibit a U-turn with an increasing hyperparameter, and the minimum point varies across methods. The sensitivity of hyperparameters implies the dependence on a strong model selection criterion, such as oracle model selection on target distribution. However, the dashed line for MC-Pseudolabel’s error is tangent to all the U-turns of baselines, indicating a competitive performance of MC-Pseudolabel both with and without oracle model selection.

We also investigate the evolution of pseudo labels $\tilde{f}_{t}$ in Algorithm 1 to recover the dynamics of MC-Pseudolabel. The first row of Figure 3 demonstrates how pseudolabelling results in a multicalibrated predictor. It shows that pseudolabels for two environments deviate from model prediction at Step 0, but the gap quickly converges at Step 4, implying multicalibrated prediction. The second row provides insight about how pseudolabelling contributes to an invariant predictor. We observe that the curve of two environments are gradually merging because the pseudolabel introduces a special noise to the original label such that the correlation between the pseudolabel and spurious variable $V$ is weakened. As a result, the predictor will depend on the causal variable $S$ which is relatively more strongly correlated with pseudolabels.

E.2 VesselPower

In figure 4, we provide the correlation between models’ in-distribution validation performance and out-of-distribution test performance across all methods.

E.3 Training Details

We follow DomainBed’s protocol [Gulrajani and Lopez-Paz, 2021] for model selection. Specifically, we randomly sample 20 sets of hyperparameters for each method, containing both the training hyperparameters of base models in Table 3 and extra hyperparameters from the robust learning algorithm in Table 4. We select the best model across hyperparameters based on three model selection criteria. In-distribution (ID) validation selects the model with the best metric on the average of an in-distribution validation dataset, which is sampled from the same distribution as the training data. Worst-environment (Worst) validation selects the best model by the worst performance across all environments in the in-distribution validation dataset. Worst validation is applicable only to the multi-environment setting. Oracle validation selects the best model by an out-of-distribution validation dataset sampled from the target distribution of test data. Oracle validation leaks the test distribution, so it is not recommended by DomainBed. However, most robust learning methods relies on out-of-distribution validation, so Domainbed suggests limited numbers of access to target data when using Oracle validation. Though MC-Pseudolabel already performs well under ID and Worst validation, we still report its performance under Oracle validation to compare the limit of robust learning methods regardless of model selection. Specifically for PovertyMap, an OOD validation set is provided for Oracle validation. The results of Oracle validation recover PovertyMap’s public benchmark.

Following DomainBed, the entire model selection procedure is repeated with different seeds for three times to measure standard errors of performances. Thus, we have totally 60 runs per method per dataset. Specifically for PovertyMap, we follow WILDS’ setup [Koh et al., 2021] and perform 5-fold cross validation instead of three repeated experiments. For each fold of the dataset, we conduct the model selection procedure with a single seed, and we report the average and standard error of performances across 5 folds. Thus, the standard error measures both the difficulty disparity across folds and the model’s instability.

The grouping function class of MC-Pseudolabel is implemented according to Equation 11 and Equation 12 for the multi-environment and single-environment setting respectively. For the multi-environment setting, the environment classifier $p(e|x,y)$ is implemented as MLP with a single hidden layer of size 100 for tabular datasets including Simulation and ACSIncome. For PovertyMap, the environment classifier is implemented by Resnet18-MS with the same architecture as the predictor. For the single-environment setting of VesselPower, the identification model $f_{id}$ is implemented as a Ridge regression model.

Table 3: Hyperparameters for model architecture.

	Simulation	ACSIncome	VesselPower	PovertyMap
Architecture	Linear	MLP	MLP	Resnet18-MS
Hidden Layer Dimensions	None	16	32, 8	Standard [Koh et al., 2021]
Optimizer	Adam	Adam	Adam	Adam
Weight Decay	0	0	0	0
Loss	MSE	MSE	MSE	MSE
Learning Rate	0.1	$10^{\text{Uniform}[-3,-1]}$	$10^{\text{Uniform}[-3,-1]}$	$10^{\text{Uniform}[-4,-2]}$
Batch Size	1024	[256, 512, 1024, 2048]	[256, 512, 1024, 2048]	[32, 64, 128]

Table 4: Hyperparameters for robust learning methods.

	Range
Regularizer Coefficient	$10^{\text{Uniform}[-3,2]}$
$\eta$ (GroupDRO)	$10^{\text{Uniform}[-3,2]}$
$\alpha$ (JTT)	[0.1, 0.2, 0.5, 0.7]
$\lambda$ (JTT)	[5, 10, 20, 50]
$\eta$ ( $\chi^{2}$ -DRO)	[0.2, 0.5, 1, 1.5]
$t$ (Tilted-ERM)	[0.1, 0.5, 1, 5, 10, 50, 100, 200]
$\alpha$ (C-Mixup)	[0.5, 1, 1.5, 2]
$\sigma$ (C-Mixup)	[0.01, 0.1, 1, 10, 100]

E.4 Software and Hardware

Our experiments are based on the architecture of PyTorch [Paszke et al., 2019]. Each experiment with a single set of hyperparameters is run on one NVIDIA GeForce RTX 3090 with 24GB of memory, taking at most 15 minutes.

Appendix F Theory

F.1 Multicalibration and Bayes Optimality under Covariate Shift

Assumption F.1 (Restatement of Assumption 2.2).

1. (Closure under Covariate Shift) For a set of probability measures $\mathcal{P}(X)$ containing the source measure $P_{S}(X)$ ,

\forall P\in\mathcal{P},h\in\mathcal{H}\Rightarrow\frac{p(\cdot)}{p_{S}(\cdot)}\cdot h(\cdot)\in\mathcal{H}.

(20)

\displaystyle\mathbb{E}_{P}[(\mathbb{E}_{P}[Y|X]-Y)^{2}|X\in G]<\mathbb{E}_{P}[(\mathbb{E}_{P}[Y|X\in G]-Y)^{2}|X\in G]-\gamma,

(21)

then there exists $h\in\mathcal{H}$ satisfying

\displaystyle\mathbb{E}_{P}[(h(X)-Y)^{2}|X\in G]<\mathbb{E}_{P}[(\mathbb{E}_{P}[Y|X\in G]-Y)^{2}|X\in G]-\gamma.

(22)

Lemma F.2 (Globus-Harris et al. [2023]).

Fix any distribution $P\in\mathcal{P}(X,Y)$ , any model $f:\mathcal{X}\rightarrow[0,1]$ , and any class of real valued functions $\mathcal{H}$ that is closed under affine transformation. Let:

\mathcal{H}_{1}=\{h\in\mathcal{H}:\max_{x\in\mathcal{X}}h(x)^{2}\leq 1\}

be the set of functions in $\mathcal{H}$ upper-bounded by 1 on $\mathcal{X}$ . Let $m=|\text{Range}(f)|,\gamma>0$ , and $\alpha\leq\frac{\gamma^{3}}{16m}$ . Then if $\mathcal{H}$ satisfies the $(\gamma,\frac{\gamma}{m})$ -weak learning condition and $f$ is $\alpha$ -approximately $\ell_{2}$ multicalibrated with respect to $\mathcal{H}_{1}$ and $P$ , then $f$ has squared error

\mathbb{E}_{P}[(f(x)-y)^{2}]\leq\mathbb{E}_{P}[(f^{*}(x)-y)^{2}]+3\gamma,

where $f^{*}(x)=\mathbb{E}_{P}[Y|x]$ .

Definition F.3.

For a probability measure $P(X,Y)$ and a predictor $f:\mathcal{X}\rightarrow[0,1]$ , let $\mathcal{H}\subset\mathbb{R}^{\mathcal{X}\times\mathcal{Y}}$ be a function class. We say that $f$ is $\alpha$ -approximately $\ell_{1}$ multicalibrated w.r.t. $\mathcal{H}$ and $P$ if for all $h\in\mathcal{H}$ :

	$\displaystyle\;\;\;\;\;K_{1}(f,h,P)$		(23)
	$\displaystyle=\int\Big{\|}\mathbb{E}_{P}\left[h(X,Y)(Y-v)\big{\|}f(X)=v\right]\Big{\|}dP_{f(X)}(v)$		(24)
	$\displaystyle\leq\alpha.$		(25)

Lemma F.4 ( Roth [2022]).

Suppose $P_{S},P_{T}\in\mathcal{P}$ have the same conditional label distribution, and suppose $f$ is $\alpha$ -approximately $\ell_{1}$ multicalibrated with respect to $P_{S}$ and $\mathcal{H}$ . If $\mathcal{H}$ satisfies Equation 20, then $f$ is also $\alpha$ -approximately $\ell_{1}$ multicalibrated with respect to $P_{T}$ and $\mathcal{H}$ :

Lemma F.5.

For a predictor $f:\mathcal{X}\rightarrow[0,1]$ and a grouping function $h$ satisfying $\max_{x\in\mathcal{X},y\in\mathcal{Y}}h(x,y)^{2}\leq B$ where $B>0$ ,

\displaystyle\frac{1}{\sqrt{B}}K_{2}(f,h,P)\leq K_{1}(f,h,P)\leq\sqrt{K_{2}(f,h,P)}.

(26)

Remark F.6.

The lemma is extended from Roth [2022]’s result for $B=1$ .

Proof.

First we prove $K_{2}(f,h,P)\leq\sqrt{B}K_{1}(f,h,P)$ . For any $v\in[0,1]$ ,

		$\displaystyle\;\;\;\;\Big{(}\mathbb{E}_{P}\left[h(X,Y)(Y-v)\big{\|}f(X)=v\right]\Big{)}^{2}$		(27)
		$\displaystyle=\Big{\|}\mathbb{E}_{P}\left[h(X)(Y-v)\big{\|}f(X)=v\right]\Big{\|}\cdot\Big{\|}\mathbb{E}_{P}\left[h(X,Y)(Y-v)\big{\|}f(X)=v\right]\Big{\|}$		(28)
		$\displaystyle\leq\sqrt{\mathbb{E}\left[h(X,Y)^{2}\big{\|}f(X)=v\right]\mathbb{E}\left[(Y-v)^{2}\big{\|}f(X)=v\right]}$		(29)
		$\displaystyle\;\;\;\;\cdot\Big{\|}\mathbb{E}_{P}\left[h(X,Y)(Y-v)\big{\|}f(X)=v\right]\Big{\|}$		(30)
		$\displaystyle\leq\sqrt{B}\cdot\Big{\|}\mathbb{E}_{P}\left[h(X,Y)(Y-v)\big{\|}f(X)=v\right]\Big{\|}.$		(31)

Equation 29 follows from the Cauchy-Schwarz inequality.

$\displaystyle K_{2}(f,h,P)$	$\displaystyle=\underset{v\sim P_{f(X)}}{\mathbb{E}}\left[\Big{(}\mathbb{E}_{P}\left[h(X,Y)(Y-v)\big{\|}f(X)=v\right]\Big{)}^{2}\right]$	(32)
	$\displaystyle\leq\sqrt{B}\underset{v\sim P_{f(X)}}{\mathbb{E}}\Big{\|}\mathbb{E}_{P}\left[h(X,Y)(Y-v)\big{\|}f(X)=v\right]\Big{\|}$	(33)
	$\displaystyle=\sqrt{B}K_{1}(f,h,P).$	(34)

Then we prove $K_{1}(f,h,P)\leq\sqrt{BK_{2}(f,h,P)}$ . Still from the Cauchy-Schwarz inequality:

$\displaystyle K_{1}(f,h,P)$	$\displaystyle=\underset{v\sim P_{f(X)}}{\mathbb{E}}\Big{\|}\mathbb{E}_{P}\left[h(X,Y)(Y-v)\big{\|}f(X)=v\right]\Big{\|}$	(35)
	$\displaystyle\leq\sqrt{\underset{v\sim P_{f(X)}}{\mathbb{E}}[1^{2}]\underset{v\sim P_{f(X)}}{\mathbb{E}}\left[\Big{(}\mathbb{E}_{P}\left[h(X,Y)(Y-v)\big{\|}f(X)=v\right]\Big{)}^{2}\right]}$	(36)
	$\displaystyle=\sqrt{K_{2}(f,h,P)}.$	(37)

∎

Theorem F.7 (Restatement of Theorem 2.3).

For a source measure $P_{S}(X,Y)$ and a set of probability measures $\mathcal{P}(X)$ containing $P_{S}(X)$ , given a predictor $f:\mathcal{X}\rightarrow[0,1]$ with finite range $m:=|\text{Range}(f)|$ , consider a grouping function class $\mathcal{H}$ closed under affine transformation satisfying Assumption 2.2 with $\rho=\gamma/m$ . If $f$ is $\frac{\gamma^{6}}{256m^{2}}$ -approximately $\ell_{2}$ multicalibrated w.r.t $P_{S}$ and $\mathcal{H}$ ’s bounded subset $\mathcal{H}_{1}:=\left\{h\in\mathcal{H}:\max_{x\in\mathcal{X}}h(x)^{2}\leq 1\right\}$ , then for any target measure $P_{T}\in\mathcal{P}(X)P_{S}(Y|X)$ ,

\displaystyle R_{P_{T}}(f)\leq R_{P_{T}}(f^{*})+3\gamma,

(38)

where $f^{*}(x)=\mathbb{E}_{P_{T}}[Y|x]$ is the optimal regression function in each target distribution.

Proof.

For $|h(x)|\leq 1$ and $f\in[0,1]$ , according to Lemma F.5,

\displaystyle K_{2}(f,h,P)\leq K_{1}(f,h,P)\leq\sqrt{K_{2}(f,h,P)}.

(39)

Since $K_{2}(f,h,P_{S})\leq\frac{\gamma^{6}}{256m^{2}}$ for any $h\in H_{1}$ ,

\displaystyle K_{1}(f,h,P_{S})\leq\frac{\gamma^{3}}{16m}.

(40)

With Lemma F.4, for any $h\in H_{1}$ and $P_{T}\in\mathcal{P}$ , we have

\displaystyle K_{1}(f,h,P_{T})\leq\frac{\gamma^{3}}{16m}.

(41)

Again, it follows from Lemma F.5 that:

\displaystyle K_{2}(f,h,P_{T})\leq\frac{\gamma^{3}}{16m}.

(42)

With Lemma F.2, we have

\displaystyle\mathbb{E}_{P_{T}}[(f(x)-y)^{2}]\leq\mathbb{E}_{P_{T}}[(f^{*}(x)-y)^{2}]+3\gamma,

(43)

which completes the proof. ∎

F.2 Multicalibration and Invariance under Concept Shift

Theorem F.8 (Restatement of Theorem 3.1).

\displaystyle\mathcal{H}\supset\left\{h(x,y)=\frac{p(x,y)}{p_{S}(x,y)}\Big{|}P\in\mathcal{P}(X,Y)\right\}.

(44)

If $f$ is $\alpha$ -approximately $\ell_{2}$ multicalibrated w.r.t. $\mathcal{H}$ and $P_{S}$ , then for any measure $P\in\mathcal{P}(X,Y)$ ,

\displaystyle R_{P}(f)\leq\inf_{g:[0,1]\rightarrow[0,1]}R_{P}(g\circ f)+2\sqrt{\alpha}.

(45)

Proof.

For any $h(x,y)=p(x,y)/p_{S}(x,y)$ where $P\in\mathcal{P}$ , since $f$ is $\alpha$ -approximately $\ell_{2}$ multicalibrated, $K_{2}(f,h,P_{S})\leq\alpha$ . It follows from Lemma F.5 that $K_{1}(f,h,P_{S})\leq\sqrt{\alpha}$ .

$\displaystyle K_{1}(f,h,P_{S})$	$\displaystyle=\int\Big{\|}\mathbb{E}_{P_{S}}\left[h(X,Y)(Y-v)\big{\|}f(X)=v\right]\Big{\|}dP_{S}(f^{-1}(v))$	(46)
	$\displaystyle=\int\left\|\int\frac{p(x,y)}{p_{S}(x,y)}(y-v)p_{S}(x,y\big{\|}f=v)d(x,y)\right\|dP_{S}(f^{-1}(v))$	(47)
	$\displaystyle=\int\left\|\int\frac{dP(f^{-1}(v))}{dP_{S}(f^{-1}(v))}(y-v)p(x,y\big{\|}f=v)d(x,y)\right\|dP_{S}(f^{-1}(v))$	(48)
	$\displaystyle=\int\frac{dP(f^{-1}(v))}{dP_{S}(f^{-1}(v))}\left\|\int(y-v)p(x,y\big{\|}f=v)d(x,y)\right\|dP_{S}(f^{-1}(v))$	(49)
	$\displaystyle=\int\Big{\|}\mathbb{E}_{P}\left[Y-v\big{\|}f(X)=v\right]\Big{\|}dP(f^{-1}(v))$	(50)
	$\displaystyle=K_{1}(f,1,P).$	(51)

Thus, we have $K_{1}(f,1,P)\leq\sqrt{\alpha}$ for any $P\in\mathcal{P}$ . We will prove an equivalent form of $\ell_{1}$ calibration error:

	$\displaystyle K_{1}(f,1,P)$	$\displaystyle=\sup_{\eta:[0,1]\rightarrow[-1,1]}\int\eta(v)\mathbb{E}_{P}\left[Y-v\big{\|}f(X)=v\right]dP_{f(X)}(v)$		(52)
		$\displaystyle=\sup_{\eta:[0,1]\rightarrow[-1,1]}\mathbb{E}_{P}[\eta(f(X))(Y-f(X))].$		(53)

$K_{1}(f,1,P)\leq\sup_{\eta:[0,1]\rightarrow[-1,1]}\mathbb{E}_{P}[\eta(f(X))(Y-f(X))]$ can be proved by taking $\eta(v)=2\mathbb{I}[v>0]-1$ . On the other hand,

	$\displaystyle\int\eta(v)\mathbb{E}_{P}\left[Y-v\big{\|}f(X)=v\right]dP_{f(X)}(v)$	$\displaystyle\leq\int\Big{\|}\eta(v)\Big{\|}\cdot\Big{\|}\mathbb{E}_{P}\left[Y-v\big{\|}f(X)=v\right]\Big{\|}dP_{f(X)}(v)$		(54)
		$\displaystyle\leq K_{1}(f,1,P).$		(55)

Actually the right hand side of Equation 52 resembles smooth calibration [Blasiok et al., 2023a], which restricts $\eta$ to Lipschitz functions. Based on smooth calibration, Blasiok et al. [Blasiok et al., 2023b] shows that approximately calibrated predictors cannot be improved much by post-processing. In the above we present a similar proof for $\ell_{1}$ calibration error.

For any $g:[0,1]\rightarrow[0,1]$ , there exists $\eta:[0,1]\rightarrow[-1,1]$ such that $g(v)=v+\eta(v)$ for any $v\in[0,1]$ .

$\displaystyle R_{P}(g\circ f)$	$\displaystyle=\mathbb{E}_{P}\left[\big{(}Y-f(X)-\eta(f(X))\big{)}^{2}\right]$	(56)
	$\displaystyle=\mathbb{E}_{P}\left[(Y-f(X))^{2}\right]-2\mathbb{E}_{P}\left[(Y-f(X)\eta(f(X))\right]+\mathbb{E}_{P}\left[\eta(f(X))^{2}\right]$	(57)
	$\displaystyle\geq R_{P}(f)-\sup_{\eta^{\prime}:[0,1]\rightarrow[-1,1]}2\mathbb{E}[\eta^{\prime}(f(X))(Y-f(X))]$	(58)
	$\displaystyle=R_{P}(f)-2K_{1}(f,1,P)$	(59)
	$\displaystyle\geq R_{P}(f)-2\sqrt{\alpha}.$	(60)

∎

Theorem F.9 (Restatement of Theorem 3.4).

Assume samples are drawn from an environment $e\in\mathscr{E}$ with a prior $P_{S}(e)$ such that $\sum_{e\in\mathscr{E}}P_{S}(e)=1$ and $P_{S}(e)>0$ . The overall population satisfies $P_{S}(X,Y)=\sum_{e\in\mathscr{E}}P_{e}(X,Y)P_{S}(e)$ where $P_{e}(X,Y)$ is the environment-specific absolutely continuous measure. For a representation $\Phi\in\sigma(X)$ over features, define a function class $\mathcal{H}$ as:

\displaystyle\mathcal{H}:=\left\{h(x,y)=\frac{p_{e}(x,y)}{p_{S}(x,y)}\Big{|}e\in\mathscr{E}\right\}.

(61)

1. If there exists a bijection $g^{*}:\text{supp}(\Phi)\rightarrow[0,1]$ such that $g^{*}\circ\Phi$ is $\alpha$ -approximately $\ell_{2}$ multicalibrated w.r.t. $\mathcal{H}$ and $P_{S}$ , then for any $e\in\mathscr{E}$ ,

\displaystyle R_{P_{e}}(g^{*}\circ\Phi)\leq\inf_{g:\text{supp}(\Phi)\rightarrow[0,1]}R_{P_{e}}(g\circ\Phi)+2\sqrt{\alpha}.

(62)

2. For $\Phi\in\sigma(X)$ , if there exists $g^{*}:\text{supp}(\Phi)\rightarrow[0,1]$ such that Equation 62 is satisfied for any $e\in\mathscr{E}$ , then $g^{*}\circ\Phi$ is $\sqrt{2/D}\alpha^{1/4}$ -approximately $\ell_{2}$ multicalibrated w.r.t. $\mathcal{H}$ and $P_{S}$ , where $D=\min_{e\in\mathscr{E}}P_{S}(e)$ .

Proof.

We first prove statement 1.

For any $g:\text{supp}(\Phi)\rightarrow[0,1]$ , since $g^{*}$ is a bijection, $g\circ\Phi=(g\circ{g^{*}}^{-1})(g^{*}\circ\Phi)$ where $g\circ{g^{*}}^{-1}\in[0,1]^{[0,1]}$ . Since $g^{*}\circ\Phi$ is $\alpha$ -approximately $\ell_{2}$ multicalibrated, it follows from Theorem F.8 that for any $e\in\mathscr{E}$ ,

	$\displaystyle R_{P_{e}}(g^{*}\circ\Phi)$	$\displaystyle\leq R_{P_{e}}\big{(}(g\circ{g^{}}^{-1})(g^{}\circ\Phi)\big{)}+2\sqrt{\alpha}$		(63)
		$\displaystyle=R_{P_{e}}(g\circ\Phi)+2\sqrt{\alpha}.$		(64)

Then we give a proof of statement 2, which is inspired by Blasiok et al. [Blasiok et al., 2023b].

For simplicity let $f^{*}:=g^{*}\circ\Phi$ . For any $e\in\mathscr{E}$ and any $\eta:[0,1]\rightarrow[-1,1]$ , define $\beta:=E_{P_{e}}\left[(Y-f^{*}(X))\eta(f^{*}(X))\right]\in[-1,1]$ . Construct $\kappa(v):=\text{proj}_{[0,1]}(v+\beta\eta(v))$ , where $\text{proj}_{[0,1]}(\cdot)=\max\{0,\min\{1,\cdot\}\}$ .

$\displaystyle R_{P_{e}}(\kappa(f^{*}))$	$\displaystyle=\mathbb{E}_{P_{e}}\left[\big{(}Y-\kappa(f^{*}(X))\big{)}^{2}\right]$	(65)
	$\displaystyle\leq\mathbb{E}_{P_{e}}\left[\big{(}Y-f^{}(X)-\beta\eta(f^{}(X))\big{)}^{2}\right]$	(66)
	$\displaystyle=\mathbb{E}_{P_{e}}\left[(Y-f^{}(X))^{2}\right]-2\beta^{2}+\beta^{2}\mathbb{E}_{P_{e}}[\eta(f^{}(x))^{2}]$	(67)
	$\displaystyle\leq R_{P_{e}}(f^{*})-2\beta^{2}+\beta^{2}$	(68)
	$\displaystyle=R_{P_{e}}(f^{*})-\beta^{2}.$	(69)

Rearranging the inequality above gives:

\displaystyle\Big{(}E_{P_{e}}\left[(Y-f^{*}(X))\eta(f^{*}(X))\right]\Big{)}^{2}=\beta^{2}\leq R_{P_{e}}(f^{*})-R_{P_{e}}(\kappa(f^{*})).

(70)

Since $\kappa\circ g^{*}\in\text{supp}(\Phi)\rightarrow[0,1]$ , it follows from Equation 62 that:

\displaystyle R_{P_{e}}(f^{*})-R_{P_{e}}(\kappa(f^{*}))=R_{P_{e}}(g^{*}\circ\Phi)-R_{P_{e}}((\kappa\circ g^{*})\circ\Phi)\leq 2\sqrt{\alpha}.

(71)

Combining Equation 70 and Equation 71 gives $E_{P_{e}}\left[(Y-f^{*}(X))\eta(f^{*}(X))\right]\leq\sqrt{2}\alpha^{1/4}$ for any $\eta:[0,1]\rightarrow[-1,1]$ . From Equation 52, it follows that:

\displaystyle K_{1}(f^{*},1,P_{e})=\sup_{\eta:[0,1]\rightarrow[-1,1]}\mathbb{E}_{P_{e}}[\eta(f^{*}(X))(Y-f^{*}(X))]\leq\sqrt{2}\alpha^{1/4}.

(72)

From Equation 51, $K_{1}(f^{*},h,P_{S})\leq\sqrt{2}\alpha^{1/4}$ for any $h\in\mathcal{H}$ . Further, for any $h\in\mathcal{H}$ ,

$\displaystyle h(x,y)$	$\displaystyle=\frac{p_{e}(x,y)}{p_{S}(x,y)}$	(73)
	$\displaystyle=\frac{p_{e}(x,y)}{\sum_{e^{\prime}\in\mathscr{E}}{p_{e^{\prime}}(x,y)}P_{S}(e^{\prime})}$	(74)
	$\displaystyle\leq\frac{p_{e}(x,y)}{p_{e}(x,y)P_{S}(e)}$	(75)
	$\displaystyle\leq\frac{1}{D}.$	(76)

By Lemma F.5, it follows that $K_{2}(f^{*},h,P_{S})\leq\sqrt{1/D}\cdot K_{1}(f^{*},h,P_{S})\leq\sqrt{2/D}\alpha^{1/4}.$ ∎

F.3 Structure of Grouping Function Classes

In this subsection, we focus on Euclidean space where $\mathcal{X}\subset\mathbb{R}^{d}$ is compact and measurable for some $d\in Z^{+}$ and $\mathcal{Y}=[0,1]$ . Grouping functions are assumed to be continuous, i.e., $h\in C(\mathcal{X}\times\mathcal{Y})$ . We consider absolutely continuous probability measures with continuous density functions.

Proposition F.10 (Restatement of Proposition 4.1).

Consider an absolutely continuous probability measure $P_{S}(X,Y)$ and a predictor $f:\mathcal{X}\rightarrow[0,1]$ , define the maximal grouping function class that $f$ is multicalibrated with respect to:

\displaystyle\mathcal{H}_{f}:=\{h\in C(\mathcal{X}\times\mathcal{Y}):K_{2}(f,h,P_{S})=0\}.

(77)

Then $\mathcal{H}_{f}$ is a linear space.

Particularly for $f_{\Phi}(x)=\mathbb{E}[Y|\Phi(x)]$ where $\Phi:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d_{\Phi}},{d_{\Phi}}\in Z^{+}$ is a measurable function, we abbreviate $\mathcal{H}_{f_{\Phi}}$ with $\mathcal{H}_{\Phi}$ . Then any finite subset $\mathcal{G}\subset\mathcal{H}_{\Phi}$ implies $\text{span}\{1,\mathcal{G}\}\subset\mathcal{H}_{\Phi}$ , where $1$ denotes a constant function.

Proof.

For any $\gamma_{1},\gamma_{2}\in\mathbb{R}$ and any $h_{1},h_{2}\in\mathcal{H}_{f}$ , $\gamma_{1}h_{1}+\gamma_{2}h_{2}\in C(\mathcal{X}\times\mathcal{Y})$ .

$\displaystyle K_{1}(f,\gamma_{1}h_{1}+\gamma_{2}h_{2},P_{S})$	$\displaystyle=\int\Big{\|}\mathbb{E}\left[(\gamma_{1}h_{1}(X,Y)+\gamma_{2}h_{2}(X,Y))(Y-v)\big{\|}f(X)=v\right]\Big{\|}dP_{S}(f^{-1}(v))$	(78)
	$\displaystyle\leq\int\Big{\|}\mathbb{E}\left[\gamma_{1}h_{1}(X,Y)(Y-v)\big{\|}f(X)=v\right]\Big{\|}dP_{S}(f^{-1}(v))$	(79)
	$\displaystyle\;\;\;\;+\int\Big{\|}\mathbb{E}\left[\gamma_{2}h_{2}(X,Y)(Y-v)\big{\|}f(X)=v\right]\Big{\|}dP_{S}(f^{-1}(v))$	(80)
	$\displaystyle=\gamma_{1}K_{1}(f,h_{1},P_{S})+\gamma_{2}K_{1}(f,h_{2},P_{S})$	(81)
	$\displaystyle=0.$	(82)

According to Lemma F.5, $K_{1}(f,h,P_{S})=0$ is equivalent as $K_{2}(f,h,P_{S})=0$ for bounded $h$ . Thus, $K_{2}(f,\gamma_{1}h_{1}+\gamma_{2}h_{2},P_{S})=0$ which implies $\gamma_{1}h_{1}+\gamma_{2}h_{2}\in\mathcal{H}_{f}$ . Now we finishes the proof that $\mathcal{H}_{f}$ is a linear space.

For $f_{\Phi}(x)=\mathbb{E}[Y|\Phi(x)]$ ,

$\displaystyle\mathbb{E}[Y\|f_{\Phi}(X)]$	$\displaystyle=\mathbb{E}\left[\mathbb{E}[Y\|f_{\Phi}(X),\Phi(X)]\big{\|}f_{\Phi}(X)\right]$	(83)
	$\displaystyle=\mathbb{E}\left[\mathbb{E}[Y\|\Phi(X)]\big{\|}f_{\Phi}(X)\right]$	(84)
	$\displaystyle=\mathbb{E}[f_{\Phi}(X)\|f_{\Phi}(X)]$	(85)
	$\displaystyle=f_{\Phi}(X).$	(86)

Thus, $K_{2}(f_{\Phi},1,P_{S})=0$ which implies $1\in\mathcal{H}_{\Phi}$ . Since $\mathcal{H}_{\Phi}$ is a linear space, $\mathcal{G}\subset\mathcal{H}_{\Phi}$ implies $\text{span}\{1,\mathcal{G}\}\subset\mathcal{H}_{\Phi}$ . ∎

Lemma F.11.

For any absolutely continuous probability measure $P(X,Y)$ with a continuous density function $p$ , and any grouping function $h\in C(\mathcal{X}\times\mathcal{Y})$ , there exists $\gamma\neq 0$ and $\rho\in\mathbb{R}$ such that $\gamma h+\rho\in C(\mathcal{X}\times\mathcal{Y})$ , and it is density ratio between some absolutely continuous probability measure $P^{\prime}$ and $P$ , i.e., $dP^{\prime}(x,y)=(\gamma h(x,y)+\rho)dP(x,y)$ , where $P^{\prime}$ also has a continuous density function.

Proof.

Since $h$ is bounded, there exists $\rho^{\prime}\in\mathbb{R}$ such that $h(x,y)+\rho^{\prime}>0$ for any $x\in\mathcal{X},y\in\mathcal{Y}$ . Define:

	$\displaystyle\gamma$	$\displaystyle=\frac{1}{\int(h(x,y)+\rho^{\prime})dP(x,y)}.$		(87)
	$\displaystyle\rho$	$\displaystyle=\frac{\rho^{\prime}}{\int(h(x,y)+\rho^{\prime})dP(x,y)}.$		(88)

$\gamma h+\rho$ is still continuous.

We have $\gamma h(x,y)+\rho=\gamma(h(x,y)+\rho^{\prime})>0$ for any $x\in\mathcal{X},y\in\mathcal{Y}$ .

$\displaystyle\int dP^{\prime}(x,y)$	$\displaystyle=\int(\gamma h(x,y)+\rho)dP(x,y)$	(89)
	$\displaystyle=\gamma\int(h(x,y)+\rho^{\prime})dP(x,y)$	(90)
	$\displaystyle=1.$	(91)

Thus, $P^{\prime}(X,Y)$ is an absolutely continuous probability measure.

Its density function $p^{\prime}=(\gamma h+\rho)p$ is continuous. ∎

Theorem F.12.

Consider an absolutely continuous probability measure $P_{S}(X,Y)$ and a predictor $f_{\Phi}(x)=\mathbb{E}_{P_{S}}[Y|\Phi(x)]$ where $\Phi:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d_{\Phi}},{d_{\Phi}}\in Z^{+}$ is a measurable function. We abbreviate $\mathcal{H}_{f_{\Phi}}$ with $\mathcal{H}_{\Phi}$ .

	$\displaystyle\mathcal{H}_{\Phi}$	$\displaystyle=\left\{h\in C(\mathcal{X}\times\mathcal{Y}):\text{Cov}_{P_{S}}\left[h(X,Y),Y\|f_{\Phi}=v\right]=0\;\;\text{for almost every}\;v\in[0,1]\right\}$		(92)
		$\displaystyle=\text{span}\left\{\frac{p(x,y)}{p_{S}(x,y)}:\mathbb{E}_{P}[Y\|f_{\Phi}]=\mathbb{E}_{P_{S}}[Y\|f_{\Phi}]\;\;\text{almost surely},\;\text{$p$ is continuous}\right\}.$		(93)

Proof.

First we prove:

\displaystyle\mathcal{H}_{\Phi}=\left\{h\in C(\mathcal{X}\times\mathcal{Y}):\text{Cov}_{P_{S}}\left[h(X,Y)|f_{\Phi}=v\right]=0\;\;\text{for almost every}\;v\in[0,1]\right\}.

(94)

For each $v\in[0,1]$ and any $h\in C(\mathcal{X}\times\mathcal{Y})$ ,

$\displaystyle\mathbb{E}_{P_{S}}\left[h(X,Y)(Y-v)\big{\|}f_{\Phi}=v\right]$	$\displaystyle=\mathbb{E}_{P_{S}}\left[h(X,Y)Y\Big{\|}f_{\Phi}=v\right]-v\mathbb{E}_{P_{S}}\left[h(X,Y)\Big{\|}f_{\Phi}=v\right]$	(95)
	$\displaystyle=\mathbb{E}_{P_{S}}\left[h(X,Y)Y\Big{\|}f_{\Phi}=v\right]$	(96)
	$\displaystyle\;\;\;\;-\mathbb{E}_{P_{S}}\left[h(X,Y)\Big{\|}f_{\Phi}=v\right]\mathbb{E}[Y\|f_{\Phi}=v]$	(97)
	$\displaystyle=\text{Cov}_{P_{S}}\left[h(X,Y),Y\|f_{\Phi}=v\right].$	(98)

Equation 97 follows from Equation 86.

For any $h\in C(\mathcal{X}\times\mathcal{Y})$ ,

$\displaystyle h\in\mathcal{H}_{\Phi}$	$\displaystyle\Leftrightarrow K_{2}(f_{\Phi},h,P_{S})=0$	(99)
	$\displaystyle\Leftrightarrow\int\Big{(}\mathbb{E}_{P_{S}}\left[h(X,Y)(Y-v)\big{\|}f_{\Phi}=v\right]\Big{)}^{2}dP_{S}(f_{\Phi}^{-1}(v))$	(100)
	$\displaystyle\Leftrightarrow\mathbb{E}_{P_{S}}\left[h(X,Y)(Y-v)\big{\|}f_{\Phi}=v\right]=0\;\;\text{for almost every}\;v\in[0,1]$	(101)
	$\displaystyle\Leftrightarrow\text{Cov}_{P_{S}}\left[h(X,Y),Y\|f_{\Phi}=v\right]=0\;\;\text{for almost every}\;v\in[0,1].$	(102)

Next, we prove

\displaystyle\mathcal{H}_{\Phi}\supset\text{span}\left\{\frac{p(x,y)}{p_{S}(x,y)}:\mathbb{E}_{P}[Y|f_{\Phi}]=\mathbb{E}_{P_{S}}[Y|f_{\Phi}]\;\;\text{almost surely},\;\text{$p$ is continuous}\right\}.

(103)

This is equivalent to saying that for any absolutely continuous probability measure $P$ satisfying $\mathbb{E}_{P}[Y|f_{\Phi}]=\mathbb{E}_{P_{S}}[Y|f_{\Phi}]$ almost surely, $\text{Cov}_{P_{S}}\left[p(X,Y)/p_{S}(X,Y),Y|f_{\Phi}=v\right]=0$ for almost every $v\in[0,1]$ .

		$\displaystyle\;\;\;\;\text{Cov}_{P_{S}}\left[\frac{p(X,Y)}{p_{S}(X,Y)},Y\Big{\|}f_{\Phi}=v\right]$		(104)
		$\displaystyle=\mathbb{E}_{P_{S}}\left[\frac{p(X,Y)}{p_{S}(X,Y)}Y\Big{\|}f_{\Phi}=v\right]-\mathbb{E}_{P_{S}}\left[\frac{p(X,Y)}{p_{S}(X,Y)}\Big{\|}f_{\Phi}=v\right]\mathbb{E}_{P_{S}}\left[Y\Big{\|}f_{\Phi}=v\right]$		(105)
		$\displaystyle=\int\frac{dP(f_{\Phi}^{-1}(v))}{dP_{S}(f_{\Phi}^{-1}(v))}\frac{p(x,y\|f_{\Phi}=v)}{p_{S}(x,y\|f_{\Phi}=v)}y\cdot dP_{S}(x,y\|f_{\Phi}=v)$		(106)
		$\displaystyle\;\;\;\;-\mathbb{E}_{P_{S}}\left[Y\Big{\|}f_{\Phi}=v\right]\cdot\int\frac{dP(f_{\Phi}^{-1}(v))}{dP_{S}(f_{\Phi}^{-1}(v))}\frac{p(x,y\|f_{\Phi}=v)}{p_{S}(x,y\|f_{\Phi}=v)}\cdot dP_{S}(x,y\|f_{\Phi}=v)$		(107)
		$\displaystyle=\mathbb{E}_{P}\left[Y\Big{\|}f_{\Phi}=v\right]\frac{dP(f_{\Phi}^{-1}(v))}{dP_{S}(f_{\Phi}^{-1}(v))}-\mathbb{E}_{P_{S}}\left[Y\Big{\|}f_{\Phi}=v\right]\frac{dP(f_{\Phi}^{-1}(v))}{dP_{S}(f_{\Phi}^{-1}(v))}$		(108)
		$\displaystyle=\left[\mathbb{E}_{P}\left[Y\Big{\|}f_{\Phi}=v\right]-\mathbb{E}_{P_{S}}\left[Y\Big{\|}f_{\Phi}=v\right]\right]\frac{dP(f_{\Phi}^{-1}(v))}{dP_{S}(f_{\Phi}^{-1}(v))}$		(109)
		$\displaystyle=0\;\;\;\;\text{almost surely.}$		(110)

Next, we prove

\displaystyle\mathcal{H}_{\Phi}\subset\text{span}\left\{\frac{p(x,y)}{p_{S}(x,y)}:\mathbb{E}_{P}[Y|f_{\Phi}]=\mathbb{E}_{P_{S}}[Y|f_{\Phi}]\;\;\text{almost surely},\;\text{$p$ is continuous}\right\}.

(111)

By Lemma F.11, any grouping function $h\in\mathcal{H}_{\Phi}$ could be rewritten as $h(x,y)=\gamma p(x,y)/p_{S}(x,y)+\rho$ for some continuous density functions $p$ and $\gamma\neq 0$ . Thus, we just need to prove the statement that $\text{Cov}_{P_{S}}\left[\gamma p(X,Y)/p_{S}(X,Y)+\rho,Y|f_{\Phi}=v\right]=0$ implies $\mathbb{E}_{P}[Y|f_{\Phi}=v]=\mathbb{E}_{P_{S}}[Y|f_{\Phi}=v]$ .

		$\displaystyle\;\;\;\;\text{Cov}_{P_{S}}\left[\gamma\frac{p(X,Y)}{p_{S}(X,Y)}+\rho,Y\Big{\|}f_{\Phi}=v\right]$		(112)
		$\displaystyle=\gamma\mathbb{E}_{P_{S}}\left[\frac{p(X,Y)}{p_{S}(X,Y)}Y\Big{\|}f_{\Phi}=v\right]+\rho E_{P_{S}}\left[Y\Big{\|}f_{\Phi}=v\right]$		(113)
		$\displaystyle\;\;\;\;-\gamma E_{P_{S}}\left[\frac{p(X,Y)}{p_{S}(X,Y)}\Big{\|}f_{\Phi}=v\right]E_{P_{S}}\left[Y\Big{\|}f_{\Phi}=v\right]-\rho E_{P_{S}}\left[Y\Big{\|}f_{\Phi}=v\right]$		(114)
		$\displaystyle=\gamma\mathbb{E}_{P_{S}}\left[\frac{p(X,Y)}{p_{S}(X,Y)}Y\Big{\|}f_{\Phi}=v\right]-\gamma E_{P_{S}}\left[\frac{p(X,Y)}{p_{S}(X,Y)}\Big{\|}f_{\Phi}=v\right]E_{P_{S}}\left[Y\Big{\|}f_{\Phi}=v\right]$		(115)
		$\displaystyle=\gamma\text{Cov}_{P_{S}}\left[\frac{p(X,Y)}{p_{S}(X,Y)},Y\Big{\|}f_{\Phi}=v\right]$		(116)
		$\displaystyle=\gamma\left[\mathbb{E}_{P}\left[Y\Big{\|}f_{\Phi}=v\right]-\mathbb{E}_{P_{S}}\left[Y\Big{\|}f_{\Phi}=v\right]\right]\frac{dP(f_{\Phi}^{-1}(v))}{dP_{S}(f_{\Phi}^{-1}(v))}.$		(117)

Equation 117 follows from Equation 109. So we have $\mathbb{E}_{P}[Y|f_{\Phi}=v]=\mathbb{E}_{P_{S}}[Y|f_{\Phi}=v]$ if $\text{Cov}_{P_{S}}\left[\gamma p(X,Y)/p_{S}(X,Y)+\rho,Y|f_{\Phi}=v\right]=0$ . ∎

Corollary F.13 (Restatement of Theorem 4.2).

Consider an absolutely continuous probability measure $P_{S}(X,Y)$ and a calibrated predictor $f$ .

	$\displaystyle\mathcal{H}_{f}$	$\displaystyle=\left\{h\in C(\mathcal{X}\times\mathcal{Y}):\text{Cov}_{P_{S}}\left[h(X,Y),Y\|f=v\right]=0\;\;\text{for almost every}\;v\in[0,1]\right\}$		(118)
		$\displaystyle=\text{span}\left\{\frac{p(x,y)}{p_{S}(x,y)}:\mathbb{E}_{P}[Y\|f]=\mathbb{E}_{P_{S}}[Y\|f]\;\;\text{almost surely},\;\text{$p$ is continuous}\right\}.$		(119)

Remark F.14.

$\mathbb{E}_{P}[Y|f]=\mathbb{E}_{P_{S}}[Y|f]$ almost surely implies $\mathbb{E}_{P}[Y|f]=f$ , which is equivalent as $R_{P}(f)=\inf_{g:[0,1]\rightarrow[0,1]}R_{P}(g\circ f)$ , since we adopt square error.

Proof.

Since $f$ is calibrated, we have $\mathbb{E}_{P_{S}}[Y|f]=f$ . Take $\Phi(x)=f(x)$ .

\displaystyle f_{\Phi}(x)=\mathbb{E}_{P_{S}}[Y|\Phi(x)]=\mathbb{E}_{P_{S}}[Y|f(x)]=f(x).

(120)

Apply Theorem F.12 and the proof is complete. ∎

Theorem F.15 (Restatement of Theorem 4.3 (first part)).

Consider an absolutely continuous probability measure $P_{S}(X,Y)$ and a predictor $f_{\Phi}(x)=\mathbb{E}_{P_{S}}[Y|\Phi(x)]$ where $\Phi:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d_{\Phi}},{d_{\Phi}}\in Z^{+}$ is a measurable function. $\mathcal{H}_{\Phi}$ can be decomposed as $\mathcal{H}_{\Phi}=\mathcal{H}_{1,\Phi}+\mathcal{H}_{2,\Phi}$ .

$\displaystyle\mathcal{H}_{1,\Phi}$	$\displaystyle:=\left\{h\in C(\mathcal{X}\times\mathcal{Y}):\text{Cov}_{P_{S}}\left[h(\Phi,Y),Y\|f_{\Phi}=v\right]=0\;\;\text{for almost every}\;v\in[0,1]\right\}$	(121)
	$\displaystyle=\text{span}\left\{\frac{p(\Phi,y)}{p_{S}(\Phi,y)}:\mathbb{E}_{P}[Y\|f_{\Phi}]=\mathbb{E}_{P_{S}}[Y\|f_{\Phi}]\;\;\text{almost surely},\;\text{$p$ is continuous}\right\}.$	(122)
$\displaystyle\mathcal{H}_{2,\Phi}$	$\displaystyle:=\left\{h\in C(\mathcal{X}\times\mathcal{Y}):\mathbb{E}[h(X,Y)\|\Phi,Y]\equiv C_{h}\;\;\forall\Phi,Y\right\}$	(123)
	$\displaystyle=\text{span}\left\{\frac{p(x\|\Phi,y)}{p_{S}(x\|\Phi,y)}:\;\text{$p$ is continuous}\right\}.$	(124)

Remark F.16.

$\mathcal{H}_{1,\Phi}$ contains functions defined on $\text{supp}(\Phi)\times\mathcal{Y}$ which can be rewritten as functions on $\mathcal{X}\times\mathcal{Y}$ by variable substitution. Thus, $\mathcal{H}_{1,\Phi}$ is still a set of grouping functions. For $\mathcal{H}_{2,\Phi}$ , $C_{h}$ is a constant depending on $h$ .

Proof.

First we prove $\mathcal{H}_{\Phi}\subset\mathcal{H}_{1,\Phi}+\mathcal{H}_{2,\Phi}$ .

For any $h_{1}\in\mathcal{H}_{1,\Phi}$ and $h_{2}\in\mathcal{H}_{2,\Phi}$ ,

	$\displaystyle\;\;\;\;\text{Cov}_{P_{S}}\left[h_{1}(\Phi,Y)+h_{2}(X,Y),Y\|f_{\Phi}=v\right]$		(125)
	$\displaystyle=\text{Cov}_{P_{S}}\left[h_{1}(\Phi,Y),Y\|f_{\Phi}=v\right]+\text{Cov}_{P_{S}}\left[h_{2}(X,Y),Y\|f_{\Phi}=v\right]$		(126)
	$\displaystyle=\text{Cov}_{P_{S}}\left[h_{2}(X,Y),Y\|f_{\Phi}=v\right]$		(127)
	$\displaystyle=\mathbb{E}_{P_{S}}\left[h_{2}(X,Y)Y\|f_{\Phi}=v\right]-\mathbb{E}_{P_{S}}\left[h_{2}(X,Y)\|f_{\Phi}=v\right]\mathbb{E}_{P_{S}}\left[Y\|f_{\Phi}=v\right]$		(128)
	$\displaystyle=\mathbb{E}_{P_{S}}\left[\mathbb{E}_{P_{S}}\left[h_{2}(X,Y)Y\|\Phi,Y\right]\big{\|}f_{\Phi}=v\right]-\mathbb{E}_{P_{S}}\left[\mathbb{E}_{P_{S}}\left[h_{2}(X,Y)\|\Phi,Y\right]\big{\|}f_{\Phi}=v\right]\mathbb{E}_{P_{S}}\left[Y\|f_{\Phi}=v\right]$		(129)
	$\displaystyle=\mathbb{E}_{P_{S}}\left[\mathbb{E}_{P_{S}}\left[h_{2}(X,Y)\|\Phi,Y\right]Y\big{\|}f_{\Phi}=v\right]-\mathbb{E}_{P_{S}}\left[\mathbb{E}_{P_{S}}\left[h_{2}(X,Y)\|\Phi,Y\right]\big{\|}f_{\Phi}=v\right]\mathbb{E}_{P_{S}}\left[Y\|f_{\Phi}=v\right]$		(130)
	$\displaystyle=C_{h_{2}}E_{P_{S}}\left[Y\|f_{\Phi}=v\right]-C_{h_{2}}E_{P_{S}}\left[Y\|f_{\Phi}=v\right]$		(131)
	$\displaystyle=0.$		(132)

Next we prove $\mathcal{H}_{\Phi}\supset\mathcal{H}_{1,\Phi}+\mathcal{H}_{2,\Phi}$ .

For any $h\in\mathcal{H}_{\Phi}$ , let

	$\displaystyle h_{1}(\Phi(x),y)$	$\displaystyle=\mathbb{E}_{P_{S}}[h(X,Y)\|\Phi=\Phi(x),Y=y].$		(133)
	$\displaystyle h_{2}(x,y)$	$\displaystyle=h(x,y)-\mathbb{E}_{P_{S}}[h(X,Y)\|\Phi=\Phi(x),Y=y].$		(134)

Then we have

	$\displaystyle\;\;\;\;\text{Cov}_{P_{S}}\left[h_{1}(\Phi,Y),Y\|f_{\Phi}=v\right]$		(135)
	$\displaystyle=\mathbb{E}_{P_{S}}\left[h_{1}(\Phi,Y)Y\|f_{\Phi}=v\right]-\mathbb{E}_{P_{S}}\left[h_{1}(\Phi,Y)\|f_{\Phi}=v\right]\mathbb{E}_{P_{S}}\left[Y\|f_{\Phi}=v\right]$		(136)
	$\displaystyle=\mathbb{E}_{P_{S}}\left[\mathbb{E}_{P_{S}}[h(X,Y)\|\Phi,Y]Y\big{\|}f_{\Phi}=v\right]-\mathbb{E}_{P_{S}}\left[\mathbb{E}_{P_{S}}[h(X,Y)\|\Phi,Y]\big{\|}f_{\Phi}=v\right]\mathbb{E}_{P_{S}}\left[Y\|f_{\Phi}=v\right]$		(137)
	$\displaystyle=\mathbb{E}_{P_{S}}\left[h(X,Y)Y\|f_{\Phi}=v\right]-\mathbb{E}_{P_{S}}\left[h(X,Y)\|f_{\Phi}=v\right]\mathbb{E}_{P_{S}}\left[Y\|f_{\Phi}=v\right]$		(138)
	$\displaystyle=\text{Cov}_{P_{S}}\left[h(X,Y),Y\|f_{\Phi}=v\right]$		(139)
	$\displaystyle=0\;\;\;\;\text{for almost every}\;v\in[0,1].$		(140)

Thus, $h_{1}(\Phi(x),y)\in\mathcal{H}_{1,\Phi}$ .

$\displaystyle\mathbb{E}_{P_{S}}[h_{2}(X,Y)\|\Phi,Y]$	$\displaystyle=\mathbb{E}_{P_{S}}\left[h(X,Y)-\mathbb{E}_{P_{S}}[h(X,Y)\|\Phi,Y]\big{\|}\Phi,Y\right]$	(141)
	$\displaystyle=\mathbb{E}_{P_{S}}[h(X,Y)\|\Phi,Y]-\mathbb{E}_{P_{S}}[h(X,Y)\|\Phi,Y]$	(142)
	$\displaystyle=0.$	(143)

Thus, $h_{2}(x,y)\in\mathcal{H}_{2,\Phi}$ . Following a similar proof of Theorem F.12, we have

\displaystyle\mathcal{H}_{1,\Phi}=\text{span}\left\{\frac{p(\Phi,y)}{p_{S}(\Phi,y)}:\mathbb{E}_{P}[Y|f_{\Phi}]=\mathbb{E}_{P_{S}}[Y|f_{\Phi}]\;\;\text{almost surely},\;\text{$p$ is continuous}\right\}.

(144)

Next, we prove $\mathcal{H}_{2,\Phi}\supset\text{span}\left\{p(x|\Phi,y)/p_{S}(x|\Phi,y):\;p\text{ is continuous}\right\}$ .

This is equivalent to saying that $p(x|\Phi,y)/p_{S}(x|\Phi,y)\in\mathcal{H}_{2,\Phi}$ for any continuous density function $p$ .

$\displaystyle\mathbb{E}_{P_{S}}\left[\frac{p(X\|\Phi,Y)}{p_{S}(X\|\Phi,Y)}\Big{\|}\Phi,Y\right]$	$\displaystyle=\int\frac{p(x\|\Phi,Y)}{p_{S}(x\|\Phi,Y)}dP_{S}(x\|\Phi,Y)$	(145)
	$\displaystyle=\int dP(x\|\Phi,Y)$	(146)
	$\displaystyle=1.$	(147)

Thus, we have $p(x|\Phi,y)/p_{S}(x|\Phi,y)\in\mathcal{H}_{2,\Phi}$ .

Next, we prove $\mathcal{H}_{2,\Phi}\subset\text{span}\left\{p(x|\Phi,y)/p_{S}(x|\Phi,y):\;p\text{ is continuous}\right\}$ .

By Lemma F.11, any grouping function $h\in\mathcal{H}_{2,\Phi}$ could be rewritten as $h_{2}(x,y)=\gamma p(x,y)/p_{S}(x,y)+\rho$ for some continuous density function $p$ and $\gamma\neq 0$ . Thus, we just need to prove the statement that $\mathbb{E}_{P_{S}}[\gamma p(X,Y)/p_{S}(X,Y)+\rho|\Phi,Y]\equiv C_{h_{2}}$ implies $p(X,Y)/p_{S}(X,Y)\equiv p(X|\Phi,Y)/p_{S}(X|\Phi,Y)$ .

$\displaystyle\mathbb{E}_{P_{S}}\left[\gamma\frac{p(X,Y)}{p_{S}(X,Y)}+\rho\Big{\|}\Phi,Y\right]$	$\displaystyle=\mathbb{E}_{P_{S}}\left[\gamma\frac{p(X\|\Phi,Y)}{p_{S}(X\|\Phi,Y)}\frac{p(\Phi,Y)}{p_{S}(\Phi,Y)}+\rho\Big{\|}\Phi,Y\right]$	(148)
	$\displaystyle=\gamma\frac{p(\Phi,Y)}{p_{S}(\Phi,Y)}\mathbb{E}_{P_{S}}\left[\frac{p(X\|\Phi,Y)}{p_{S}(X\|\Phi,Y)}\Big{\|}\Phi,Y\right]+\rho$	(149)
	$\displaystyle=\gamma\frac{p(\Phi,Y)}{p_{S}(\Phi,Y)}\mathbb{E}_{P}\left[1\Big{\|}\Phi,Y\right]+\rho$	(150)
	$\displaystyle=\gamma\frac{p(\Phi,Y)}{p_{S}(\Phi,Y)}+\rho.$	(151)

Thus, we have $p(\Phi,Y)/p_{S}(\Phi,Y)\equiv(C_{h_{2}}-\rho)/\gamma$ which is a constant. Since $\mathbb{E}_{P_{S}}[p(\Phi,Y)/p_{S}(\Phi,Y)]=1$ , we have $p(\Phi,Y)/p_{S}(\Phi,Y)\equiv 1$ .

	$\displaystyle\frac{p(X,Y)}{p_{S}(X,Y)}$	$\displaystyle=\frac{p(X\|\Phi,Y)}{p_{S}(X\|\Phi,Y)}\frac{p(\Phi,Y)}{p_{S}(\Phi,Y)}$		(152)
		$\displaystyle=\frac{p(X\|\Phi,Y)}{p_{S}(X\|\Phi,Y)}.$		(153)

∎

Lemma F.17 (Theorem 3.2 from Globus-Harris et al. [2023]).

If $f$ is calibrated and there exists an $h(x)$ such that

\displaystyle\mathbb{E}[(f(X)-Y)^{2}-(h(X)-Y)^{2}|f(X)=v]\geq\alpha,

(154)

then:

\displaystyle\mathbb{E}[h(X)(Y-v)|f(X)=v]\geq\frac{\alpha}{2}.

(155)

Proposition F.18 (Restatement of Theorem 4.3 (second part)).

If a predictor $f$ is multicalibrated with $\mathcal{H}_{1,\Phi}$ , then $R_{P_{S}}(f)\leq R_{P_{S}}(f_{\Phi})$ .

Proof.

We prove by contradiction. If $R_{P_{S}}(f)>R_{P_{S}}(f_{\Phi})$ , then

\displaystyle\int\mathbb{E}\left[(f(X)-Y)^{2}-(f_{\Phi}(X)-Y)^{2}|f(X)=v\right]dP_{S}(f^{-1}(v))>0.

(156)

Let $\alpha_{v}=\mathbb{E}\left[(f(X)-Y)^{2}-(h(X)-Y)^{2}|f(X)=v\right]$ .

Since $f$ is multicalibrated with $\mathcal{H}_{1,\Phi}$ , $f$ is calibrated. It follows from Lemma F.17:

\displaystyle\mathbb{E}[f_{\Phi}(X)(Y-v)|f(X)=v]\geq\frac{\alpha_{v}}{2}.

(157)

Then,

$\displaystyle K_{1}(f,f_{\Phi},P_{S})$	$\displaystyle=\int\Big{\|}E[f_{\Phi}(X)(Y-v)\|f(X)=v]\Big{\|}dP_{S}(f^{-1}(v))$	(158)
	$\displaystyle\geq\int\alpha_{v}dP_{S}(f^{-1}(v))$	(159)
	$\displaystyle>0.$	(160)

From Lemma F.5, we have $K_{2}(f,f_{\Phi},P_{S})\geq K_{1}(f,f_{\Phi},P_{S})^{2}>0$ . Since $f_{\Phi}\in\mathcal{H}_{1,\Phi}$ , it contradicts with the fact that $f$ is multicalibrated with $\mathcal{H}_{1,\Phi}$ . ∎

Proposition F.19 (Restatement of Theorem 4.3 (third part)).

$f_{\Phi}$ is an invariant predictor elicited by $\Phi$ across a set of environments $\mathscr{E}$ where $P_{e}(\Phi,Y)=P_{S}(\Phi,Y)$ for any $e\in\mathscr{E}$ . If a predictor $f$ is multicalibrated with $\mathcal{H}_{2,\Phi}$ , then $f$ is also an invariant predictor across $\mathscr{E}$ elicited by some representation.

Proof.

Since $P_{e}(\Phi,Y)=P_{S}(\Phi,Y)$ , we have $\mathbb{E}_{P_{e}}[Y|\Phi]=\mathbb{E}_{P_{S}}[Y|\Phi]=f_{\Phi}$ for every $e\in\mathscr{E}$ .

Thus,

\displaystyle R_{P_{e}}(f_{\Phi})=\inf_{g\in\mathcal{G}}{R_{P_{e}}(g\circ\Phi)}.

(161)

This implies $f_{\Phi}$ is an invariant predictor elicited by $\Phi$ across $\mathscr{E}$ .

For any $e\in\mathscr{E}$ , we have:

\displaystyle\frac{p_{e}(x,y)}{p_{S}(x,y)}=\frac{p_{e}(\Phi,y)}{p_{S}(\Phi,y)}\frac{p_{e}(x|Phi,y)}{p_{S}(x|\Phi,y)}=\frac{p_{e}(x|\Phi,y)}{p_{S}(x|\Phi,y)}\in\mathcal{H}_{2,\Phi}.

(162)

Since $f$ is multicalibrated with $\mathcal{H}_{2,\Phi}$ , it follows from Theorem F.8:

\displaystyle R_{P_{e}}(f)=\inf_{g:[0,1]\rightarrow[0,1]}R_{P_{e}}(g\circ f).

(163)

This implies that $f$ is an invariant predictor across $\mathscr{E}$ elicited by $f$ . ∎

Proposition F.20 (Restatement of Proposition 4.5).

1. $\mathcal{H}_{1,X}^{\prime}\supset\mathcal{H}_{1,\Phi}^{\prime}\supset\mathcal{H}_{1,\Lambda}^{\prime}\supset\mathcal{H}_{1,\emptyset}^{\prime}=\{C\}.$

2. $\{C\}=\mathcal{H}_{2,X}\subset H_{2,\Phi}\subset\mathcal{H}_{2,\Lambda}\subset\mathcal{H}_{2,\emptyset}.$

$C$ is a constant value function.

Remark F.21.

The proposition shows that $\mathcal{H}_{1}^{\prime}$ , as a subspace of $\mathcal{H}_{1}$ , evolves monotonically and in opposite direction to $\mathcal{H}_{2}$ . If we perceive the representation $\Phi$ as a filter, gaining more information from covariates facilitates multicalibration w.r.t. $\mathcal{H}_{1}$ (and accuracy) but hampers multicalibration w.r.t. $\mathcal{H}_{2}$ (and invariance). With $\mathcal{H}_{1}^{\prime}$ and $\mathcal{H}_{2}$ combined together, a multicalibrated predictor is searching for an appropriate level of information filter to balance the tradeoff between accuracy and invariance.

Proof.

1. For any $h(\Lambda)\in\mathcal{H}_{1,\Lambda}^{\prime}$ , since $\Phi=(\Lambda,\Omega)$ , $h(\Lambda)$ is also a function of $\Phi$ . Thus, we have $h(\Lambda)\in\mathcal{H}_{1,\Phi}^{\prime}$ . It follows that $\mathcal{H}_{1,\Phi}^{\prime}\supset\mathcal{H}_{1,\Lambda}^{\prime}$ . Similarly we have $\mathcal{H}_{1,X}^{\prime}\supset\mathcal{H}_{1,\Phi}^{\prime}$ and $\mathcal{H}_{1,\Lambda}^{\prime}\supset\mathcal{H}_{1,\emptyset}^{\prime}$ .

2. For any $h(\Lambda,\Omega,\Psi,Y)\in\mathcal{H}_{2,\Phi}$ such that $\mathbb{E}[h(\Phi,\Psi,Y)|\Phi,Y]=C_{h}$ for any values of $\Phi,Y$ , we have

$\displaystyle\mathbb{E}[h(\Lambda,\Omega,\Psi,Y)\|\Lambda,Y]$	$\displaystyle=\mathbb{E}\left[\mathbb{E}[h(\Lambda,\Omega,\Psi,Y)\|\Lambda,\Omega,Y]\big{\|}\Lambda,Y\right]$	(164)
	$\displaystyle=\mathbb{E}\left[\mathbb{E}[h(\Phi,\Psi,Y)\|\Phi,Y]\big{\|}\Lambda,Y\right]$	(165)
	$\displaystyle=\mathbb{E}[C_{h}\|\Lambda,Y]$	(166)
	$\displaystyle=C_{h}\;\;\;\;\text{for any values of }\Lambda,Y.$	(167)

Thus, $h(\Lambda,\Omega,\Psi,Y)\in\mathcal{H}_{2,\Lambda}$ . It follows that $H_{2,\Phi}\subset\mathcal{H}_{2,\Lambda}$ . Similarly, we have $\mathcal{H}_{2,X}\subset H_{2,\Phi}$ and $\mathcal{H}_{2,\Lambda}\subset\mathcal{H}_{2,\emptyset}$ . Particularly for $h(x,y)\in\mathcal{H}_{2,X}$ , we have $h(X,Y)=\mathbb{E}[h(X,Y)|X,Y]$ is a constant for any values of $X,Y$ . ∎

Lemma F.22 (Globus-Harris et al. [2023]).

Let $\mathcal{H}\subset\mathcal{X}^{\mathbb{R}}$ be such a grouping function class that $h\in\mathcal{H}$ implies $\gamma h+\rho\in\mathcal{H}$ for any $h\in\mathcal{H}$ and $\gamma,\rho\in\mathbb{R}$ . If $\mathcal{H}$ satisfies the $(0,0)$ -weak learning condition in Assumption F.1, a predictor $f$ is multicalibrated w.r.t. $\mathcal{H}$ if and only if $f(x)=\mathbb{E}[Y|x]$ almost surely.

Proposition F.23.

For a measurable function $\Phi:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d_{\Phi}},{d_{\Phi}}\in Z^{+}$ , a predictor $f:\text{supp}(\Phi)\rightarrow[0,1]$ is multicalibrated w.r.t. $\mathcal{H}_{1,\Phi}$ if and only if $f$ is multicalibrated w.r.t. $\mathcal{H}_{1,\Phi}^{\prime}$ .

Proof.

Since $\mathcal{H}_{1,\Phi}^{\prime}\subset\mathcal{H}_{1,\Phi}$ , $f$ ’s multicalibration w.r.t. $\mathcal{H}_{1,\Phi}$ implies $f$ ’s multicalibration w.r.t. $\mathcal{H}_{1,\Phi}^{\prime}$ .

On the other hand, $\mathcal{H}_{1,\Phi}^{\prime}$ satisfies the $(0,0)$ -weak learning condition with the pushforward measure on $\Phi$ , because $\mathbb{E}[Y|\Phi]\in\mathcal{H}_{1,\Phi}^{\prime}$ . It follows from Lemma F.22 that $f$ is multicalibrated w.r.t. $\mathcal{H}_{1,\Phi}^{\prime}$ implies $f(\Phi)=\mathbb{E}[Y|\Phi]$ almost surely. By the definition of $\mathcal{H}_{1,\Phi}$ , $f(\Phi)$ is multicalibrated w.r.t. $\mathcal{H}_{1,\Phi}$ . ∎

F.4 MC-PseudoLabel: An Algorithm for Extended Multicalibration

Lemma F.24.

Fix a model $f:\mathcal{X}\rightarrow[0,1]$ . Suppose for some $v\in\text{Range}(f)$ there is an $h\in\mathcal{H}$ such that:

\mathbb{E}[h(x,y)(y-v)|f(x)=v]>\alpha

Let $h^{\prime}=v+\eta h(x,y)$ for $\eta=\frac{\alpha}{\mathbb{E}[h(x,y)^{2}|f(x)=v]}$ .

Then:

\mathbb{E}[(f(x)-y)^{2}-(h^{\prime}(x,y)-y)^{2}|f(x)=v]>\frac{\alpha^{2}}{\mathbb{E}[h(x,y)^{2}|f(x)=v]}.

Proof. Following [Globus-Harris et al., 2023], we have

	$\displaystyle\mathbb{E}[(f(x)-y)^{2}$	$\displaystyle-(h^{\prime}(x,y)-y)^{2}\|f(x)=v]$
		$\displaystyle=\mathbb{E}[(v-y)^{2}-(v+\eta h(x,y)-y)^{2}\|f(x)=v]$
		$\displaystyle=\mathbb{E}[v^{2}-2vy+y^{2}-(v+\eta h(x,y))^{2}+2y(v+\eta h(x,y))-y^{2}\|f(x)=v]$
		$\displaystyle=\mathbb{E}[2\eta yh(x,y)-2\eta vh(x,y)-\eta^{2}h(x,y)^{2}\|f(x)=v]$
		$\displaystyle=\mathbb{E}[2\eta h(x,y)(y-v)-\eta^{2}h(x,y)^{2}\|f(x)=v]$
		$\displaystyle>2\eta\alpha-\eta^{2}\mathbb{E}[h(x,y)^{2}\|f(x)=v]$
		$\displaystyle=\frac{\alpha^{2}}{\mathbb{E}[h(x,y)^{2}\|f(x)=v]}.$

Theorem F.25 (Restatement of Theorem 5.1).

In Algorithm 1, for $\alpha,B>0$ , if the following is satisfied:

\displaystyle err_{t-1}-\tilde{err}_{t}\leq\frac{\alpha}{B},

(168)

the output $f_{t-1}^{\prime}(x)$ is $\alpha$ -approximately $\ell_{2}$ multicalibrated w.r.t. $\mathcal{H}_{B}=\{h\in\mathcal{H}:\sup h(x,y)^{2}\leq B\}$ .

Proof.

We prove by contradiction. Assume that $f_{t-1}$ is not $\alpha$ -approximately multicalibrated with respect to $\mathcal{H}_{B}$ . Then there exists $h\in\mathcal{H}_{B}$ such that:

\sum_{v\in[1/m]}P(f_{t-1}(x)=v)\left(\mathbb{E}\left[h(x,y)(y-v)\Big{|}f_{t-1}(x)=v\right]\right)^{2}>\alpha.

For each $v\in[1/m]$ define

\alpha_{v}:=P(f_{t-1}(x)=v)\left(\mathbb{E}\left[h(x,y)(y-v)\Big{|}f_{t-1}(x)=v\right]\right)^{2}.

Then we have $\sum_{v\in[1/m]}\alpha_{v}>\alpha$ .

According to Lemma F.24, for each $v\in[1/m]$ , there exists $h_{v}\in\mathcal{H}$ such that:

	$\displaystyle\mathbb{E}\left[(f_{t-1}(x)-y)^{2}-(h_{v}(x,y)-y)^{2}\|f_{t-1}(x)=v\right]$	(169)
$\displaystyle>$	$\displaystyle\;\frac{\alpha_{v}}{\mathbb{E}[h(x)^{2}\|f_{t-1}(x)=v]\cdot P(f_{t-1}(x)=v)}$	(170)
$\displaystyle\geq$	$\displaystyle\;\frac{\alpha_{v}}{B\cdot P(f_{t-1}(x)=v)}.$	(171)

Then,

	$\displaystyle\;\;\;\;\mathbb{E}\left[(f_{t-1}(x)-y)^{2}-(\tilde{f}_{t}(x)-y)^{2}\right]$
	$\displaystyle=\sum_{v\in[1/m]}P(f_{t-1}(x)=v)\mathbb{E}\left[(f_{t-1}(x)-y)^{2}-(\tilde{f}_{t}(x)-y)^{2}\|f_{t-1}(x)=v\right]$
	$\displaystyle=\sum_{v\in[1/m]}P(f_{t-1}(x)=v)\mathbb{E}\left[(f_{t-1}(x)-y)^{2}-(h_{v}^{t}(x,y)-y)^{2}\|f_{t-1}(x)=v\right]$
	$\displaystyle>\frac{\alpha}{B},$

which contradicts the condition in Equation 168. ∎

The following proposition is a direct corollary from Globus-Harris et al. [2023]’s Theorem 4.3.

Proposition F.26.

For any distribution $D$ supported on $\mathcal{X}\times\mathcal{Y}$ and $\Phi\in\sigma(X)$ , take the grouping function class $\mathcal{H}\subset\mathcal{H}^{\prime}_{1,\Phi}:=\{h(\Phi(x))\in C(\mathcal{X}\times\mathcal{Y})\}$ and the predictor class $\mathcal{F}=\mathbb{R}^{\mathcal{X}}$ . For any $0<\alpha<1,B>0$ and an initial predictor $f_{0}:\mathcal{X}\rightarrow[0,1]$ with $|\text{Range}(f_{0})|\geq\frac{2B}{\alpha}$ , then $\text{MC-Pseudolabel}(D,\mathcal{H},\mathcal{F})$ halts after at most $T\leq\frac{2B}{\alpha}$ steps and outputs a model $f_{T-1}^{\prime}(x)$ that is $\alpha$ -approximately $\ell_{2}$ multicalibrated w.r.t $D$ and $\mathcal{H}_{B}=\{h\in\mathcal{H}:\sup h(x,y)^{2}\leq B\}$ .

Theorem F.27 (Restatement of Theorem 5.2).

\displaystyle\Sigma=\begin{pmatrix}\Sigma_{\Phi\Phi}&\Sigma_{\Phi\Psi}&\Sigma_{\Phi y}\\ \Sigma_{\Psi\Phi}&\Sigma_{\Psi\Psi}&\Sigma_{\Psi y}\\ \Sigma_{y\Phi}&\Sigma_{y\Psi}&\Sigma_{yy}\end{pmatrix}.

(172)

For any distribution $D$ supported on $\mathcal{X}\times\mathcal{Y}$ , take the grouping function class $\mathcal{H}=\{h\in\mathcal{H}_{2,\Phi}:h(x,y)=c_{x}^{T}x+c_{y}y+c_{b},c_{x}\in\mathbb{R}^{d},c_{y},c_{b}\in\mathbb{R}\}$ and the predictor class $\mathcal{F}=\mathbb{R}^{\mathcal{X}}$ . For an initial predictor $f^{(0)}(x)=\mathbb{E}[Y|x]$ , run $\text{MC-Pseudolabel}(D,\mathcal{H},\mathcal{F})$ without rounding, then $f^{(t)}(x)\rightarrow\mathbb{E}[Y|\Phi(x)]$ as $t\rightarrow\infty$ with a convergence rate of $\mathcal{O}(M(\Sigma)^{t})$ , where

	$\displaystyle M(\Sigma)$	$\displaystyle=(\Sigma_{yy}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y})^{-1}(\Sigma_{y\Psi}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi})$		(173)
		$\displaystyle\;\;\;\;(\Sigma_{\Psi\Psi}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi})^{-1}(\Sigma_{\Psi y}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}).$		(174)

We have $0\leq M(\Sigma)<1$ .

Remark F.28.

$f^{(t)}$ are always linear. Thus, the limit of functions is equivalent as the limit of coefficients.

Remark F.29.

$\mathbb{E}[Y|\Phi(x)]$ is multicalibrated with respect to $\mathcal{H}$ . Furthermore, any calibrated predictor on $\Phi$ , denoted by $g(\Phi)$ , is multicalibrated with respect to $\mathcal{H}$ . This is because:

$\displaystyle\mathbb{E}[h(X,Y)(Y-g(\Phi))\|g(\Phi)]$	$\displaystyle=\mathbb{E}\left[\mathbb{E}\left[h(X,Y)(Y-g(\Phi))\|\Phi,Y\right]\|g(\Phi)\right]$	(175)
	$\displaystyle=\mathbb{E}\left[C_{h}(Y-g(\Phi))\|g(\Phi)\right]$	(176)
	$\displaystyle=0.$	(177)

However, $\mathbb{E}[Y|\Phi]$ is the most accurate predictor among all multicalibrated $g(\Phi)$ .

Remark F.30.

The convergence rate $M(\Sigma)$ does not depend on the dimension $d$ of covariates. When $Y\perp\Psi\;|\;\Phi$ implying that $\Phi$ is sufficient for prediction, following from $\mathbb{E}[Y|\Phi,\Psi]=\mathbb{E}[Y|\Phi]$ :

\displaystyle\Sigma_{\Psi y}=\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}.

(178)

It follows that $M(\Sigma)=0$ and the algorithm will converge in one step.

On the other hand, when $Y$ and $\Psi$ are linearly dependent given $\Phi$ such that $\Sigma$ is singular, which violates positive definiteness, following from the proof below:

\displaystyle\Sigma_{yy}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}=(\Sigma_{y\Psi}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi})(\Sigma_{\Psi\Psi}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi})^{-1}(\Sigma_{\Psi y}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}).

(179)

It follows that $M(\Sigma)=1$ and the algorithm can’t converge.

So the convergence rate depends on the singularity of the problem. Since the algorithm converges to a predictor that does not depend on $\Psi$ , stronger the "spurious" correlation between $Y$ and $\Psi$ given $\Phi$ in the distribution $D$ , the algorithm takes longer to converge.

Proof.

Without loss of generality, assume $\mu=0$ .

Let $c=(c_{x},c_{y},c_{b})^{T}=(c_{\Phi},c_{\Psi},c_{y},c_{b})^{T}$ . Denote dimensions of $\Phi,\Psi$ by $d_{\Phi},d_{\Psi}$ .

Let $\mathbb{E}[Y|\Phi,\Psi]=(\alpha_{\Phi}^{\star})^{T}\Phi+(\alpha_{\Psi}^{\star})^{T}\Psi$ and $\mathbb{E}[\Psi|\Phi,Y]=(\beta_{\Phi}^{\star})^{T}\Phi+(\beta_{y}^{\star})^{T}Y$ with $\alpha_{\Phi}^{\star}\in\mathbb{R}^{d_{\Phi}},$
$\alpha_{\Psi}^{\star}\in\mathbb{R}^{d_{\Psi}},\beta_{\Phi}^{\star}\in\mathbb{R}^{d_{\Phi}\times d_{\Psi}},\beta_{y}^{\star}\in\mathbb{R}^{1\times d_{\Psi}}$ .

We have:

	$\displaystyle\begin{pmatrix}\alpha_{\Phi}^{\star}\\ \alpha_{\Psi}^{\star}\end{pmatrix}$	$\displaystyle=\begin{pmatrix}\Sigma_{\Phi\Phi}&\Sigma_{\Phi\Psi}\\ \Sigma_{\Psi\Phi}&\Sigma_{\Psi\Psi}\end{pmatrix}^{-1}\begin{pmatrix}\Sigma_{\Phi y}\\ \Sigma_{\Psi y}\end{pmatrix}.$		(180)
	$\displaystyle\begin{pmatrix}\beta_{\Phi}^{\star}\\ \beta_{y}^{\star}\end{pmatrix}$	$\displaystyle=\begin{pmatrix}\Sigma_{\Phi\Phi}&\Sigma_{\Phi y}\\ \Sigma_{y\Phi}&\Sigma_{yy}\end{pmatrix}^{-1}\begin{pmatrix}\Sigma_{\Phi\Psi}\\ \Sigma_{y\Psi}\end{pmatrix}.$		(181)

According to Theorem F.15, $\mathbb{E}[h(X,Y)|\Phi,Y]$ is a constant for different values of $\Phi,Y$ .

	$\displaystyle\mathbb{E}[h(X,Y)\|\Phi,Y]$	$\displaystyle=c_{\Phi}^{T}\Phi+c_{y}^{T}Y+c_{\Psi}^{T}\mathbb{E}[\Psi\|\Phi,Y]+c_{b}$		(182)
		$\displaystyle=c_{\Phi}^{T}\Phi+c_{y}^{T}Y+c_{\Psi}^{T}\begin{pmatrix}\Sigma_{\Psi\Phi}&\Sigma_{\Psi y}\end{pmatrix}\begin{pmatrix}\Sigma_{\Phi\Phi}&\Sigma_{\Phi y}\\ \Sigma_{y\Phi}&\Sigma_{yy}\\ \end{pmatrix}^{-1}\begin{pmatrix}\Phi\\ Y\\ \end{pmatrix}+c_{b}.$		(183)

This implies:

\displaystyle\begin{pmatrix}\Sigma_{\Phi\Phi}&\Sigma_{\Phi y}\\ \Sigma_{y\Phi}&\Sigma_{yy}\\ \end{pmatrix}^{-1}\begin{pmatrix}\Sigma_{\Phi\Psi}\\ \Sigma_{y\Psi}\end{pmatrix}c_{\Psi}+\begin{pmatrix}c_{\Phi}\\ c_{y}\end{pmatrix}=0.

(184)

Rearranging to:

\displaystyle\begin{pmatrix}\Sigma_{\Phi\Phi}&\Sigma_{\Phi\Psi}&\Sigma_{\Phi y}\\ \Sigma_{y\Phi}&\Sigma_{y\Psi}&\Sigma_{yy}\end{pmatrix}c=0.

(185)

Let $f^{(t)}=(\alpha_{\Phi}^{(t)})^{T}\Phi+(\alpha_{\Psi}^{(t)})^{T}\Psi$ and $\tilde{f}^{(t)}=(\tilde{\alpha}_{\Phi}^{(t)})^{T}\Phi+(\tilde{\alpha}_{\Psi}^{(t)})^{T}\Psi+(\tilde{\alpha}_{y}^{(t)})^{T}Y$ . We claim that $\tilde{\alpha}_{\Psi}^{(t)}=0$ . Otherwise, consider $\tilde{f}^{(t)^{\prime}}=(\tilde{\alpha}_{\Phi}^{(t)})^{T}\Phi+(\tilde{\alpha}_{\Psi}^{(t)})^{T}\mathbb{E}[\Psi|\Phi,Y]+(\tilde{\alpha}_{y}^{(t)})^{T}Y$ .

\displaystyle\mathbb{E}[\tilde{f}^{(t)}-\tilde{f}^{(t)^{\prime}}|\Psi,Y]=\mathbb{E}[(\tilde{\alpha}_{\Psi}^{(t)})^{T}\Psi|\Phi,Y]-(\tilde{\alpha}_{\Psi}^{(t)})^{T}\mathbb{E}[\Psi|\Phi,Y]=0.

(186)

Thus, $\tilde{f}^{(t)}-\tilde{f}^{(t)^{\prime}}\in\mathcal{H}$ .

On the other hand,

	$\displaystyle\;\;\;\;\mathbb{E}[(Y-\tilde{f}^{(t)})^{2}]-\mathbb{E}[(Y-\tilde{f}^{(t)^{\prime}})^{2}]$		(187)
	$\displaystyle=\mathbb{E}\left[(\tilde{\alpha}_{\Psi}^{(t)})^{T}\left[\mathbb{E}[\Psi\Psi^{T}\|\Phi,Y]-\mathbb{E}[\Psi\|\Phi,Y]\mathbb{E}[\Psi^{T}\|\Phi,Y]\right]\tilde{\alpha}_{\Psi}^{(t)}\right]$		(188)
	$\displaystyle>0.$		(189)

The inequality follows from the fact that $\mathbb{E}[\Psi\Psi^{T}|\Phi,Y]-\mathbb{E}[\Psi|\Phi,Y]\mathbb{E}[\Psi^{T}|\Phi,Y]$ is the covariance matrix of $\Psi|\Phi,Y$ , which is positive definite because $\Sigma$ is positive definite. The inequality contradicts with the definition of $\tilde{f}^{(t)}$ . Thus, $\tilde{f}^{(t)}=(\tilde{\alpha}_{\Phi}^{(t)})^{T}\Phi+(\tilde{\alpha}_{y}^{(t)})^{T}Y$ .

Define a matrix $C\in\mathbb{R}^{(d+1)\times d_{t}}$ whose columns support the solution space of Equation 185. Then $H:=C^{T}(S,T,Y)^{T}\in\mathbb{R}^{d_{t}}$ is a random vector. According to the definition of $\tilde{f}^{(t)}$ ,

$\displaystyle\tilde{f}^{(t+1)}$	$\displaystyle=\mathbb{E}[Y\|f^{(t)},H]$	(190)
	$\displaystyle=k^{(t)}f^{(t)}+(c_{\Phi}^{(t)})^{T}\Phi+(c_{\Psi}^{(t)})^{T}\Psi+(c_{y}^{(t)})^{T}Y$	(191)
	$\displaystyle=k^{(t)}(\alpha_{\Phi}^{(t)})^{T}\Phi+k^{(t)}(\alpha_{\Psi}^{(t)})^{T}\Psi+(c_{\Phi}^{(t)})^{T}\Phi+(c_{\Psi}^{(t)})^{T}\Psi+(c_{y}^{(t)})^{T}Y.$	(192)

In the above equation, $k^{(t)}\in\mathbb{R}$ .

Since $\tilde{\alpha}_{\Psi}^{(t+1)}=k^{(t)}\alpha_{\Psi}^{(t)}+c_{\Psi}^{(t)}=0$ , we have $c_{\Psi}^{(t)}=-k^{(t)}\alpha_{\Psi}^{(t)}$ . Substituting into Equation 184:

\displaystyle\begin{pmatrix}c_{\Phi}^{(t)}\\ c_{y}^{(t)}\end{pmatrix}=k^{(t)}\begin{pmatrix}\Sigma_{\Phi\Phi}&\Sigma_{\Phi y}\\ \Sigma_{y\Phi}&\Sigma_{yy}\\ \end{pmatrix}^{-1}\begin{pmatrix}\Sigma_{\Phi\Psi}\\ \Sigma_{y\Psi}\end{pmatrix}\alpha_{\Psi}^{(t)}.

(193)

Substituting into Equation 192:

	$\displaystyle\begin{pmatrix}\tilde{\alpha}_{\Phi}^{(t+1)}\\ \tilde{\alpha}_{y}^{(t+1)}\end{pmatrix}$	$\displaystyle=k^{(t)}\begin{pmatrix}\tilde{\alpha}_{\Phi}^{(t)}\\ 0\end{pmatrix}+\begin{pmatrix}c_{\Phi}^{(t)}\\ c_{y}^{(t)}\end{pmatrix}$		(194)
		$\displaystyle=k^{(t)}\begin{pmatrix}I_{d_{\Phi}}&\beta_{\Phi}^{\star}\\ 0&\beta_{y}^{\star}\end{pmatrix}\begin{pmatrix}\alpha_{\Phi}^{(t)}\\ \alpha_{\Psi}^{(t)}\end{pmatrix}.$		(195)

Then we have $f^{(t+1)}=\mathbb{E}[\tilde{f}^{(t+1)}|\Phi,\Psi]=(\tilde{\alpha}_{\Phi}^{(t+1)})^{T}\Phi+(\tilde{\alpha}_{y}^{(t+1)})^{T}\mathbb{E}[Y|\Phi,\Psi]$ .

This is equivalent as:

	$\displaystyle\begin{pmatrix}\alpha_{\Phi}^{(t+1)}\\ \alpha_{\Psi}^{(t+1)}\end{pmatrix}$	$\displaystyle=\begin{pmatrix}\tilde{\alpha}_{\Phi}^{(t+1)}\\ 0\end{pmatrix}+\begin{pmatrix}\Sigma_{\Phi\Phi}&\Sigma_{\Phi\Psi}\\ \Sigma_{\Psi\Phi}&\Sigma_{\Psi\Psi}\end{pmatrix}^{-1}\begin{pmatrix}\Sigma_{\Phi y}\\ \Sigma_{\Psi y}\end{pmatrix}\tilde{\alpha}_{y}^{(t+1)}$		(196)
		$\displaystyle=\begin{pmatrix}I_{d_{\Phi}}&\alpha_{\Phi}^{\star}\\ 0&\alpha_{\Psi}^{\star}\end{pmatrix}\begin{pmatrix}\tilde{\alpha}_{\Phi}^{(t+1)}\\ \tilde{\alpha}_{y}^{(t+1)}\end{pmatrix}.$		(197)

Combining the two equations above, we have:

	$\displaystyle\begin{pmatrix}\alpha_{\Phi}^{(t+1)}\\ \alpha_{\Psi}^{(t+1)}\end{pmatrix}$	$\displaystyle=k^{(t)}\begin{pmatrix}I_{d_{\Phi}}&\alpha_{\Phi}^{\star}\\ 0&\alpha_{\Psi}^{\star}\end{pmatrix}\begin{pmatrix}I_{d_{\Phi}}&\beta_{\Phi}^{\star}\\ 0&\beta_{y}^{\star}\end{pmatrix}\begin{pmatrix}\alpha_{\Phi}^{(t)}\\ \alpha_{\Psi}^{(t)}\end{pmatrix}$		(198)
		$\displaystyle=k^{(t)}\begin{pmatrix}I_{d_{\Phi}}&\beta_{\Phi}^{\star}+\alpha_{\Phi}^{\star}\beta_{y}^{\star}\\ 0&\alpha_{\Psi}^{\star}\beta_{y}^{\star}\end{pmatrix}\begin{pmatrix}\alpha_{\Phi}^{(t)}\\ \alpha_{\Psi}^{(t)}\end{pmatrix}.$		(199)

Thus,

	$\displaystyle\begin{pmatrix}\alpha_{\Phi}^{(t)}\\ \alpha_{\Psi}^{(t)}\end{pmatrix}$	$\displaystyle=K^{(t)}\begin{pmatrix}I_{d_{\Phi}}&\beta_{\Phi}^{\star}+\alpha_{\Phi}^{\star}\beta_{y}^{\star}\\ 0&\alpha_{\Psi}^{\star}\beta_{y}^{\star}\end{pmatrix}^{t}\begin{pmatrix}\alpha_{\Phi}^{(0)}\\ \alpha_{\Psi}^{(0)}\end{pmatrix}$		(200)
		$\displaystyle=K^{(t)}\begin{pmatrix}I_{d_{\Phi}}&\beta_{\Phi}^{\star}+\alpha_{\Phi}^{\star}\beta_{y}^{\star}\\ 0&\alpha_{\Psi}^{\star}\beta_{y}^{\star}\end{pmatrix}^{t}\begin{pmatrix}\alpha_{\Phi}^{\star}\\ \alpha_{\Psi}^{\star}\end{pmatrix}.$		(201)

In the above equation, $K^{(t)}=\prod_{0\leq u<t}k^{(u)}$ .

Define $\hat{f}^{(t)}=(\hat{\alpha}_{\Phi}^{(t)})^{T}\Phi+(\hat{\alpha}_{\Psi}^{(t)})^{T}\Psi$ , where

\displaystyle\begin{pmatrix}\hat{\alpha}_{\Phi}^{(t)}\\ \hat{\alpha}_{\Psi}^{(t)}\end{pmatrix}=\begin{pmatrix}I_{d_{\Phi}}&\beta_{\Phi}^{\star}+\alpha_{\Phi}^{\star}\beta_{y}^{\star}\\ 0&\alpha_{\Psi}^{\star}\beta_{y}^{\star}\end{pmatrix}^{t}\begin{pmatrix}\alpha_{\Phi}^{\star}\\ \alpha_{\Psi}^{\star}\end{pmatrix}.

(202)

In the following we show $\hat{\alpha}_{\Psi}^{(t)}\rightarrow 0$ . Since $\hat{\alpha}_{\Psi}^{(t)}=(\alpha_{\Psi}^{\star}\beta_{y}^{\star})^{t}\alpha_{\Psi}^{\star}$ , and $\alpha_{\Psi}^{\star}\beta_{y}^{\star}\in\mathbb{R}^{d_{\Psi}\times d_{\Psi}}$ has exactly one nonzero eigenvalue $\beta_{y}^{\star}\alpha_{\Psi}^{\star}$ , we just have to show $|\beta_{y}^{\star}\alpha_{\Psi}^{\star}|<1$ .

	$\displaystyle\beta_{y}^{\star}\alpha_{\Psi}^{\star}$	$\displaystyle=(\Sigma_{yy}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y})^{-1}$		(203)
		$\displaystyle\;\;\;\;(\Sigma_{y\Psi}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi})(\Sigma_{\Psi\Psi}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi})^{-1}(\Sigma_{\Psi y}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}).$		(204)

Since, $(\Sigma_{yy}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y})$ and $(\Sigma_{\Psi\Psi}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi})$ are both Schur complements of $\Sigma$ ’s principal submatrix, they are still positive definite.

Thus,

	$\displaystyle\Sigma_{yy}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}>0.$		(205)
	$\displaystyle(\Sigma_{y\Psi}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi})(\Sigma_{\Psi\Psi}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi})^{-1}(\Sigma_{\Psi y}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y})\geq 0.$		(206)

It follows that $\beta_{y}^{\star}\alpha_{\Psi}^{\star}\geq 0$ . So we just have to show $\beta_{y}^{\star}\alpha_{\Psi}^{\star}<1$ .

Since $\det(\Sigma)>0$ , by applying row addition on $\Sigma$ , we have:

\displaystyle\det\begin{pmatrix}\Sigma_{\Phi\Phi}&\Sigma_{\Phi\Psi}&\Sigma_{\Phi y}\\ 0&\Sigma_{\Psi\Psi}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi}&\Sigma_{\Psi y}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}\\ 0&\Sigma_{y\Psi}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi}&\Sigma_{yy}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}\end{pmatrix}>0.

(207)

It follows that:

	$\displaystyle(\Sigma_{yy}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y})$		(208)
	$\displaystyle-(\Sigma_{y\Psi}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi})(\Sigma_{\Psi\Psi}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi})^{-1}(\Sigma_{\Psi y}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y})>0.$		(209)

Rearranging to:

$\displaystyle\beta_{y}^{\star}\alpha_{\Psi}^{\star}$	$\displaystyle=(\Sigma_{yy}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y})^{-1}$	(210)
	$\displaystyle\;\;\;\;(\Sigma_{y\Psi}-\Sigma_{y\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi})(\Sigma_{\Psi\Psi}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi\Psi})^{-1}(\Sigma_{\Psi y}-\Sigma_{\Psi\Phi}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y})$	(211)
	$\displaystyle<1.$	(212)

Thus, $0\leq\beta_{y}^{\star}\alpha_{\Psi}^{\star}<1$ .

In the following, we show $\hat{\alpha}_{\Phi}^{(t)}\rightarrow\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}$ . By Equation 202 and $|\beta_{y}^{\star}\alpha_{\Psi}^{\star}|<1$ , we have:

$\displaystyle\hat{\alpha}_{\Phi}^{(t)}$	$\displaystyle=\alpha_{\Phi}^{\star}+(\beta_{\Phi}^{\star}+\alpha_{\Phi}^{\star}\beta_{y}^{\star})\sum_{0\leq u<t}(\alpha_{\Psi}^{\star}\beta_{y}^{\star})^{u}\alpha_{\Psi}^{\star}$	(213)
	$\displaystyle\xrightarrow{t\to\infty}\alpha_{\Phi}^{\star}+(\beta_{\Phi}^{\star}+\alpha_{\Phi}^{\star}\beta_{y}^{\star})(I_{d_{\Psi}}-\alpha_{\Psi}^{\star}\beta_{y}^{\star})^{-1}\alpha_{\Psi}^{\star}$	(214)
	$\displaystyle=\alpha_{\Phi}^{\star}+(\beta_{\Phi}^{\star}+\alpha_{\Phi}^{\star}\beta_{y}^{\star})(1-\beta_{y}^{\star}\alpha_{\Psi}^{\star})^{-1}\alpha_{\Psi}^{\star}$	(215)
	$\displaystyle=\frac{\alpha_{\Phi}^{\star}+\beta_{\Phi}^{\star}\alpha_{\Psi}^{\star}}{1-\beta_{y}^{\star}\alpha_{\Psi}^{\star}}.$	(216)

Equation 215 follows from the fact that $(I_{d_{\Psi}}-\alpha_{\Psi}^{\star}\beta_{y}^{\star})\alpha_{\Psi}^{\star}=(1-\beta_{y}^{\star}\alpha_{\Psi}^{\star})\alpha_{\Psi}^{\star}$ .

Define $\gamma_{\Phi}^{\star}=\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}$ such that $\mathbb{E}[Y|\Phi]=(\gamma_{\Phi}^{\star})^{T}\Phi$ . We have

	$\displaystyle\;\;\;\;\;\;\mathbb{E}\left[\mathbb{E}\left[\mathbb{E}\left[Y\|\Phi,\Psi\right]\|\Phi,Y\right]\|\Phi\right]=\mathbb{E}[Y\|\Phi].$		(217)
	$\displaystyle\Leftrightarrow\mathbb{E}\left[\mathbb{E}\left[(\alpha_{\Phi}^{\star})^{T}\Phi+(\alpha_{\Psi}^{\star})^{T}\Psi\|\Phi,Y\right]\|\Phi\right]=(\gamma_{\Phi}^{\star})^{T}\Phi.$		(218)
	$\displaystyle\Leftrightarrow\mathbb{E}\left[(\alpha_{\Phi}^{\star})^{T}\Phi+(\alpha_{\Psi}^{\star})^{T}(\beta_{\Phi}^{\star})^{T}\Phi+(\alpha_{\Psi}^{\star})^{T}(\beta_{y}^{\star})^{T}Y\|\Phi\right]=(\gamma_{\Phi}^{\star})^{T}\Phi.$		(219)
	$\displaystyle\Leftrightarrow(\alpha_{\Phi}^{\star})^{T}\Phi+(\alpha_{\Psi}^{\star})^{T}(\beta_{\Phi}^{\star})^{T}\Phi+(\alpha_{\Psi}^{\star})^{T}(\beta_{y}^{\star})^{T}(\gamma_{\Phi}^{\star})^{T}\Phi=(\gamma_{\Phi}^{\star})^{T}\Phi.$		(220)
	$\displaystyle\Leftrightarrow\alpha_{\Phi}^{\star}+\beta_{\Phi}^{\star}\alpha_{\Psi}^{\star}+\beta_{y}^{\star}\alpha_{\Psi}^{\star}\gamma_{\Phi}^{\star}=\gamma_{\Phi}^{\star}.$		(221)
	$\displaystyle\Leftrightarrow\frac{\alpha_{\Phi}^{\star}+\beta_{\Phi}^{\star}\alpha_{\Psi}^{\star}}{1-\beta_{y}^{\star}\alpha_{\Psi}^{\star}}=\gamma_{\Phi}^{\star}.$		(222)

Thus, $\hat{\alpha}_{\Phi}^{(t)}\rightarrow\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}$ . Subsequently, $\hat{f}^{(t)}\rightarrow(\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y})^{T}\Phi=\mathbb{E}[Y|\Phi]$ . The convergence ratio is $M(\Sigma)=\beta_{y}^{\star}\alpha_{\Psi}^{\star}$ .

We have:

\displaystyle\begin{pmatrix}\tilde{\alpha}_{\Phi}^{(t+1)}\\ \tilde{\alpha}_{y}^{(t+1)}\end{pmatrix}

\displaystyle=K^{(t+1)}\begin{pmatrix}I_{d_{\Phi}}&\beta_{\Phi}^{\star}\\ 0&\beta_{y}^{\star}\end{pmatrix}\begin{pmatrix}\hat{\alpha}_{\Phi}^{(t)}\\ \hat{\alpha}_{\Psi}^{(t)}\end{pmatrix},

(223)

where

	$\displaystyle\begin{pmatrix}I_{d_{\Phi}}&\beta_{\Phi}^{\star}\\ 0&\beta_{y}^{\star}\end{pmatrix}\begin{pmatrix}\hat{\alpha}_{\Phi}^{(t)}\\ \hat{\alpha}_{\Psi}^{(t)}\end{pmatrix}$	$\displaystyle\xrightarrow{t\to\infty}\begin{pmatrix}I_{d_{\Phi}}&\beta_{\Phi}^{\star}\\ 0&\beta_{y}^{\star}\end{pmatrix}\begin{pmatrix}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}\\ 0\end{pmatrix}$		(224)
		$\displaystyle=\begin{pmatrix}\Sigma_{\Phi\Phi}^{-1}\Sigma_{\Phi y}\\ 0\end{pmatrix}.$		(225)

Define $\hat{\tilde{f}}^{(t)}=(\hat{\tilde{\alpha}}_{\Phi}^{(t)})^{T}\Phi+(\hat{\tilde{\alpha}}_{y}^{(t)})^{T}Y$ , where

\displaystyle\begin{pmatrix}\hat{\tilde{\alpha}}_{\Phi}^{(t+1)}\\ \hat{\tilde{\alpha}}_{y}^{(t+1)}\end{pmatrix}=\begin{pmatrix}I_{d_{\Phi}}&\beta_{\Phi}^{\star}\\ 0&\beta_{y}^{\star}\end{pmatrix}\begin{pmatrix}\hat{\alpha}_{\Phi}^{(t)}\\ \hat{\alpha}_{\Psi}^{(t)}\end{pmatrix}.

(226)

Thus, $\hat{\tilde{f}}^{(t)}\rightarrow\mathbb{E}[Y|\Phi]$ . Since $\tilde{f}^{(t)}=K^{(t)}\hat{\tilde{f}}^{(t)}$ , we have $\tilde{f}^{(t)}=\mathbb{E}[Y|\hat{\tilde{f}}^{(t)}]\rightarrow\mathbb{E}[Y|\Phi]$ .

Subsequently, $f^{(t)}=\mathbb{E}[\tilde{f}^{(t)}|\Phi,\Psi]\rightarrow\mathbb{E}[Y|\Phi]$ .

∎

Appendix G Limitations

Both our theory and algorithm focuses on the bounded regression setting. The definition of extended multicalibration does not depend on the risk function. However, the analysis of the maximal grouping function class as a linear space assumes a continuous probability distribution of observations, implying a continuous target domain. The convergence of MC-Pseudolabel is also established in a regression setting. All experiments are performed on regression tasks. As most algorithms for out-of-distribution generalization are set up with classification problems, we fill the gap for regression and leave an extension to general risk functions for future work.

$\displaystyle\mathbb{E}_{P_{S}}\left[h(X,Y)(Y-v)\big{\|}f_{\Phi}=v\right]$	$\displaystyle=\mathbb{E}_{P_{S}}\left[h(X,Y)Y\Big{\|}f_{\Phi}=v\right]-v\mathbb{E}_{P_{S}}\left[h(X,Y)\Big{\|}f_{\Phi}=v\right]$	(95)
	$\displaystyle=\mathbb{E}_{P_{S}}\left[h(X,Y)Y\Big{\|}f_{\Phi}=v\right]$	(96)
	$\displaystyle\;\;\;\;-\mathbb{E}_{P_{S}}\left[h(X,Y)\Big{\|}f_{\Phi}=v\right]\mathbb{E}[Y\|f_{\Phi}=v]$	(97)
	$\displaystyle=\text{Cov}_{P_{S}}\left[h(X,Y),Y\|f_{\Phi}=v\right].$	(98)

	$\displaystyle\mathbb{E}[(f(x)-y)^{2}$	$\displaystyle-(h^{\prime}(x,y)-y)^{2}\|f(x)=v]$
		$\displaystyle=\mathbb{E}[(v-y)^{2}-(v+\eta h(x,y)-y)^{2}\|f(x)=v]$
		$\displaystyle=\mathbb{E}[v^{2}-2vy+y^{2}-(v+\eta h(x,y))^{2}+2y(v+\eta h(x,y))-y^{2}\|f(x)=v]$
		$\displaystyle=\mathbb{E}[2\eta yh(x,y)-2\eta vh(x,y)-\eta^{2}h(x,y)^{2}\|f(x)=v]$
		$\displaystyle=\mathbb{E}[2\eta h(x,y)(y-v)-\eta^{2}h(x,y)^{2}\|f(x)=v]$
		$\displaystyle>2\eta\alpha-\eta^{2}\mathbb{E}[h(x,y)^{2}\|f(x)=v]$
		$\displaystyle=\frac{\alpha^{2}}{\mathbb{E}[h(x,y)^{2}\|f(x)=v]}.$

Bridging Multicalibration and Out-of-distribution Generalization Beyond Covariate Shift

Abstract

1 Introduction

2 Multicalibration and Bayes Optimality under Covariate Shift

2.1 Multicalibration with Joint Grouping Functions

Settings of Out-of-distribution Generalization

Definition 2.1 (Multicalibration with Joint Grouping Functions).

2.2 Multicalibration Implies Bayes Optimality under Covariate Shift

Assumption 2.2 (Sufficiency of Grouping Function Class (informal, see Assumption F.1)).

Theorem 2.3 (Risk Bound under Covariate Shift).

Remark 2.4.

3 Multicalibration and Invariance under Concept Shift

Theorem 3.1 (Risk Bound under Concept Shift).

Definition 3.2 (Invariant Predictor).

Remark 3.3.

Theorem 3.4 (Equivalence of Multicalibration and Invariance).

Remark 3.5.

4 Structure of Grouping Function Classes

4.1 Maximal Grouping Function Space

Proposition 4.1 (Maximal Grouping Function Class).

Theorem 4.2 (Spanning Set).

Theorem 4.3 (Decomposition of Grouping Function Space).

Remark 4.4.

Proposition 4.5 (Monotonicity).

4.2 Design of Grouping Function Classes

5 MC-PseudoLabel: An Algorithm for Extended Multicalibration

Theorem 5.1 (Certified Multicalibration).

Theorem 5.2 (Covergence for ℋ2,Φ\mathcal{H}_{2,\Phi} (informal, see Theorem F.27)).

6 Experiments

6.1 Settings

6.2 Results

7 Conclusion

References

Appendix A Related Work

Appendix B Grouping Functions for Distributionally Robust Optimization

Appendix C Model Discretization

Appendix D Computational Complexity

Appendix E Experiments

E.1 An Additional Experiment: Synthetic Dataset

E.2 VesselPower

E.3 Training Details

E.4 Software and Hardware

Appendix F Theory

F.1 Multicalibration and Bayes Optimality under Covariate Shift

Assumption F.1 (Restatement of Assumption 2.2).

Lemma F.2 (Globus-Harris et al. [2023]).

Definition F.3.

Lemma F.4 ( Roth [2022]).

Lemma F.5.

Remark F.6.

Proof.

Theorem F.7 (Restatement of Theorem 2.3).

Proof.

F.2 Multicalibration and Invariance under Concept Shift

Theorem F.8 (Restatement of Theorem 3.1).

Proof.

Theorem F.9 (Restatement of Theorem 3.4).

Proof.

F.3 Structure of Grouping Function Classes

Proposition F.10 (Restatement of Proposition 4.1).

Proof.

Lemma F.11.

Proof.

Theorem F.12.

Proof.

Corollary F.13 (Restatement of Theorem 4.2).

Remark F.14.

Proof.

Theorem F.15 (Restatement of Theorem 4.3 (first part)).

Remark F.16.

Proof.

Lemma F.17 (Theorem 3.2 from Globus-Harris et al. [2023]).

Proposition F.18 (Restatement of Theorem 4.3 (second part)).

Proof.

Proposition F.19 (Restatement of Theorem 4.3 (third part)).

Proof.

Proposition F.20 (Restatement of Proposition 4.5).

Remark F.21.

Proof.

Lemma F.22 (Globus-Harris et al. [2023]).

Theorem 5.2 (Covergence for $\mathcal{H}_{2,\Phi}$ (informal, see Theorem F.27)).