This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Minimax Regret Optimization for Robust Machine Learning under Distribution Shift

Alekh Agarwal
[email protected]
Google Research
   Tong Zhang
[email protected]
Google Research & HKUST
Abstract

In this paper, we consider learning scenarios where the learned model is evaluated under an unknown test distribution which potentially differs from the training distribution (i.e. distribution shift). The learner has access to a family of weight functions such that the test distribution is a reweighting of the training distribution under one of these functions, a setting typically studied under the name of Distributionally Robust Optimization (DRO). We consider the problem of deriving regret bounds in the classical learning theory setting, and require that the resulting regret bounds hold uniformly for all potential test distributions. We show that the DRO formulation does not guarantee uniformly small regret under distribution shift. We instead propose an alternative method called Minimax Regret Optimization (MRO), and show that under suitable conditions this method achieves uniformly low regret across all test distributions. We also adapt our technique to have stronger guarantees when the test distributions are heterogeneous in their similarity to the training data. Given the widespead optimization of worst case risks in current approaches to robust machine learning, we believe that MRO can be a strong alternative to address distribution shift scenarios.

1 Introduction

Learning good models for scenarios where the evaluation happens under a test distribution that differs from the training distribution has become an increasingly important problem in machine learning and statistics. It is particularly motivated by the widespread practical deployment of machine learning models, and the observation that model performance deteriorates when the test distribution differs from the training distribution. Some motivating scenarios range from class imbalance between training and test data (Galar et al., 2011; Japkowicz, 2000) to algorithmic fairness (Dwork et al., 2012; Barocas and Selbst, 2016). Consequently, several recent formalisms have been proposed to develop algorithms for such settings, such as adversarial training (Goodfellow et al., 2014; Madry et al., 2017; Wang et al., 2019), Invariant Risk Minimization (Arjovsky et al., 2019; Ahuja et al., 2020a; Rosenfeld et al., 2020) and Distributionally Robust Optimization (Duchi and Namkoong, 2021; Namkoong and Duchi, 2016; Staib and Jegelka, 2019; Kuhn et al., 2019). While the specifics differ across these formulations, at a high-level they all minimize the worst-case risk of the learned model across some family of test distributions, and spend considerable effort on finding natural classes of test distributions and designing efficient algorithms to optimize the worst-case objectives.

In this paper, we consider this problem from a learning theory perspective, by asking if it is possible to obtain regret bounds that hold uniformly across all test distributions. Recall that in supervised learning, we are given a function class \mathcal{F} and a loss function (z,f(z))\ell(z,f(z)), where zz denotes a sample. For a test distribution PP, we are interested in learning a prediction function ff\in\mathcal{F} using the training data, so that the risk of ff under PP: RP(f)=𝔼zP(z,f(z))R_{P}(f)={\mathbb{E}}_{z\sim P}\ell(z,f(z)), is small. In classical learning theory, an estimator f^\hat{f} is learned from the training data, and we are interested in bounding the regret (or excess risk) of ff as follows

RegretP(f^)=RP(f^)inffRP(f)=RP(f^)RP(fP),{\mathrm{Regret}}_{P}(\hat{f})=R_{P}(\hat{f})-\inf_{f\in\mathcal{F}}R_{P}(f)=R_{P}(\hat{f})-R_{P}(f_{P}),

where fPf_{P} minimizes RP(f)R_{P}(f) over \mathcal{F}. If the training data (z1,,zn)(z_{1},\ldots,z_{n}) are drawn from the same distribution as the test distribution PP, then standard results imply that the empirical risk minimizaiton (ERM) method, which finds an f^\hat{f} with a small training error, achieves small regret. A natural question we would like to ask in this paper is whether there exists an estimator that allows us to achieve small regret with distribution shift? More precisely, suppose we are given a family of distributions 𝒫\mathcal{P} with P0𝒫P_{0}\in\mathcal{P}. Given access to nn samples (z1,,zn)(z_{1},\ldots,z_{n}) from P0P_{0}, we would like to find an estimator f^\hat{f} from the training data so that the uniform regret

supP𝒫RegretP(f^)=supP𝒫[RP(f^)inffRP(f)]\sup_{P\in\mathcal{P}}{\mathrm{Regret}}_{P}(\hat{f})=\sup_{P\in\mathcal{P}}\left[R_{P}(\hat{f})-\inf_{f\in\mathcal{F}}R_{P}(f)\right] (1)

is small. Here the model class \mathcal{F} can be statistically misspecified in that it may not contain the optimal model ff^{\star} that attains the Bayes risk in a pointwise manner.

We note that this objective is different from the objective of Distributionally Robust Optimization (henceforth DRO), which directly minimizes the risk across all test distributions as follows:

fDRO=arginffsupP𝒫RP(f).f_{\text{DRO}}=\operatorname*{\text{arginf}}_{f\in\mathcal{F}}\sup_{P\in\mathcal{P}}R_{P}(f). (2)

We observe that the objective (1) is less sensitive than DRO to heterogeneity in the amount of noise in test distributions. Moreover, when the model is well-specified, our criterion directly measures the closeness of the estimator f^\hat{f} and the optimal model ff^{\star}, which is often desirable in applications. We call the criterion to minimize regret uniformly across test distributaions Minimax Regret Optimization (MRO), and its population formulation seeks to minimize the worst-case regret (1):

fMRO=arginffsupP𝒫RegretP(f).f_{\text{MRO}}=\operatorname*{\text{arginf}}_{f\in\mathcal{F}}\sup_{P\in\mathcal{P}}{\mathrm{Regret}}_{P}(f). (3)

Compared to DRO, MRO evaluates the regret of a candidate model ff on each distribution P𝒫P\in\mathcal{P} as opposed to the raw risk. As we show in the following sections, regret is comparable across distribution in a more robust manner than the risk, since the subtraction of the minimum risk for each distribution takes away the variance due to noise. We note that a similar approach to remove the residual variance to deal with heterogenous noise arises in the offline Reinforcement Learning setting (Antos et al., 2008), which we comment more on in the discussion of related work. Similar to ERM, which replaces the regret minimization problem over a single test distribution by its empirical minimization counterpart, we consider an empirical minimization counterpart of MRO, and analyze its generalization performance.

Our contributions.

With this context, our paper makes the following contributions.

  1. 1.

    The objective (3), which gives an alternative to DRO, alleviates its shortcomings related to noise overfitting, and provides a robust learning formulation for problems with distribution shift.

  2. 2.

    We show the empirical couterpart to (3) can be analyzed using classical tools from learning theory. When the loss function is bounded and Lipschitz continuous, we show that the regret of our learned model f^MRO\hat{f}_{\text{MRO}} on each distribution PP can be bounded by that of fMROf_{\text{MRO}} along with a deviation term that goes down as 𝒪(1/n)\mathcal{O}(1/\sqrt{n}). Concretely, we show that if dP(z)/dP0(z)BdP(z)/dP_{0}(z)\leq B for all P𝒫P\in\mathcal{P}, then with high probability,

    supP𝒫RegretP(f^MRO)inffsupP𝒫RegretP(f)+𝒪(Bn),\sup_{P\in\mathcal{P}}{\mathrm{Regret}}_{P}(\widehat{f}_{\text{MRO}})\leq\inf_{f\in\mathcal{F}}\sup_{P\in\mathcal{P}}{\mathrm{Regret}}_{P}(f)+\mathcal{O}\left(\frac{B}{\sqrt{n}}\right),

    where we omit the dependence on the complexities of \mathcal{F} and 𝒫\mathcal{P} as well as the failure probability and use a simplified assumption on dP/dQdP/dQ than in our formal result of Theorem 2 for ease of presentation. That is, f^MRO\widehat{f}_{\text{MRO}} has a bounded regret on each domain, so long as there is at least one function with a small regret on each domain. For squared loss, if the class \mathcal{F} is convex, we show that our rates improve to 𝒪(B/n)\mathcal{O}(B/n) (Theorem 4). The result follows naturally as empirical regret concentrates to population regret at a fast rate in this setting. We also show that the worst-case risk of f^DRO\widehat{f}_{\text{DRO}} cannot approach the worst case risk of fDROf_{\text{DRO}} at a rate faster than 1/n1/\sqrt{n} (Proposition 6), which demonstrates a theoretical benefit of our selection criterion.

  3. 3.

    We present SMRO, an adaptation of our basic technique which further rescales the regret of each distribution differently. We show that SMRO retains the same worst-case guarantees as MRO, but markedly improves upon it when the distributions P𝒫P\in\mathcal{P} satisfy an alignment condition, which includes all well-specified settings (Theorems 7 and 8). The rescaling parameters are fully estimated from data.

  4. 4.

    Algorithmically, we show that our method can be implemented using access to a (weighted) ERM oracle for the function class \mathcal{F}, using the standard machinery of solving minimax problems as a two player game involving a no-regret learner and a best response player. The computational complexity scales linearly in the size of 𝒫\mathcal{P} and as n2n^{2} in the number of samples.

We conclude this section with a discussion of related work.

1.1 Related Work

Distributional mismatch between training and test data has been studied in many settings (see e.g. (Quinonero-Candela et al., 2009)) and under several notions of mismatch. A common setting is that of covariate shift, where only the distribution of labels can differ between training and test (Shimodaira, 2000; Huang et al., 2006; Bickel et al., 2007). Others study changes in the proportions of the label or other discrete attributes between training and test (Dwork et al., 2012; Xu et al., 2020). More general shifts are considered in domain adaptation (Mansour et al., 2009; Ben-David et al., 2010; Patel et al., 2015) and transfer learning (Pan and Yang, 2009; Tan et al., 2018).

A recent line of work on Invariant Risk Minimization (IRM) (Arjovsky et al., 2019) considers particularly broad generalization to unseen domains from which no examples are available at the training time, motivated by preventing the learning of spurious features in ERM based methods. The idea is to find an estimator which is invariant across all potential test distributions. Although most empirical works on IRM use gradient-penalty based formulations for algorithmic considerations, one interpretation of IRM leads to a minimax formulation similar to the MRO objective (3):

fIRM=arginffRP0(f)s.t.supP𝒫RP(f)RP(fP)ϵ,f_{\text{IRM}}=\operatorname*{\text{arginf}}_{f\in\mathcal{F}}R_{P_{0}}(f)\quad\mbox{s.t.}\sup_{P\in\mathcal{P}}R_{P}(f)-R_{P}(f_{P})\leq\epsilon, (4)

where the family 𝒫\mathcal{P} consists of all individual domains in the training data and P0P_{0} is the entire training data pooled across domains. However, it is not obvious how to meaningfully analyze the distributional robustness of IRM method (4) in the classical learning theory setting, and some subsequent works works (Ahuja et al., 2020b; Kamath et al., 2021) formalize IRM as minimizing a DRO style worst-case risk objective, instead of regret as in the original paper. In contrast, our MRO formulation is motivated by extending regret-driven reasoning to settings with distribution shift.

Distributionally robust optimization, which forms a starting point for this work, has a long history in the optimization literature (see e.g. (Ben-Tal et al., 2009; Shapiro, 2017)) and has gained recent prominence in the machine learning literature (Namkoong and Duchi, 2016; Duchi and Namkoong, 2021; Duchi et al., 2021; Staib and Jegelka, 2019; Kuhn et al., 2019; Zhu et al., 2020). DRO has been applied with mixed success in language modeling (Oren et al., 2019), correcting class imbalance (Xu et al., 2020) and group fairness (Hashimoto et al., 2018). For most of the theoretical works, the emphasis is on efficient optimization of the worst-case risk objective. While Duchi and Namkoong (2021) provide sharp upper and lower bounds on the risk of the DRO estimator, they do not examine regret, which is the central focus of our study.

We note that the raw risk used in DRO is sensitive to heterogeneous noise, in that larger noise leads to larger risk. Such sensitivity can be undesirable for some applications. Challenges in learning across scenarios with heterogeneous noise levels has been previously studied in supervised learning (Crammer et al., 2005), and also arise routinely in reinforcement learning (RL). In the setting of offline RL, a class of techniques is designed to minimize the Bellman error criterion (Bertsekas and Tsitsiklis, 1996; Munos and Szepesvári, 2008), which has known challenges in direct unbiased estimation (see e.g. Example 11.4 in Sutton and Barto (1998)). To counter this, Antos et al. (2008) suggest removing the residual variance, which is akin to considering regret instead of risk in our objective and yields similar minimax objectives as this work.

On the algorithmic side, we build on the use of regret minimization strategies for solving two player zero-sum games, pioneered by Freund and Schapire (1996) to allow for general distribution families 𝒫\mathcal{P} as opposed to relying on closed-form maximizations for specific families, as performed in many prior works (Namkoong and Duchi, 2016; Staib and Jegelka, 2019; Kuhn et al., 2019). Similar approaches have been studied in the presence of adversarial corruption in Feige et al. (2015) and in the context of algorithmic fairness for regression in Agarwal et al. (2019).

2 Setting

We consider a class of learning problems where the learner is given a dataset sampled from a distribution P0P_{0} of the form z1,,znz_{1},\ldots,z_{n} with zi𝒵z_{i}\in\mathcal{Z}. We also have a class 𝒫\mathcal{P} of distributions such that we want to do well relative to this entire class of distributions, even though we only have access to samples from P0P_{0}. In Minimax Regret Optimization, we formalize the property of doing uniformly well for all P𝒫P\in\mathcal{P} through the objective (3). As mentioned in the introduction, this is closely related to the DRO objective (2). The following result shows that small risk does not lead to small regret, which means DRO does not solve the MRO objective (3). The construction also shows that under heterogenous noise, DRO tends to focus on a distribution P𝒫P\in\mathcal{P} with large noise level, which can be undesirable for many applications.

Proposition 1 (DRO is sensitive to noise heterogeneity)

There exists a family of two distributions 𝒫={P1,P2}\mathcal{P}=\{P_{1},P_{2}\} with heterogeneous noise levels, and a function class ={f1,f2}\mathcal{F}=\{f_{1},f_{2}\}, so that
supP𝒫RegretP(fDRO)=0.21\sup_{P\in\mathcal{P}}{\mathrm{Regret}}_{P}(f_{\text{DRO}})=0.21, while supP𝒫RegretP(fMRO)=0.03\sup_{P\in\mathcal{P}}{\mathrm{Regret}}_{P}(f_{\text{MRO}})=0.03.

Proof  Suppose 𝒫\mathcal{P} consists of two distributions over z[0,1]z\in[0,1]. The first distribution P1P_{1} is degenerate with P1(0.1)=1P_{1}(0.1)=1. The second distribution P2=Ber(0.5)P_{2}=\text{Ber}(0.5). The goal is to estimate the mean of the distributions using least squares loss. In this example, P1P_{1} has zero-noise and P2P_{2} has a relatively high noise. We consider a function class \mathcal{F} consisting of two estimates of the mean, f1=0.3f_{1}=0.3 and f2=0.6f_{2}=0.6 and (z,f)=(fz)2\ell(z,f)=(f-z)^{2}. Let us abbreviate RPi(f)=Ri(f)R_{P_{i}}(f)=R_{i}(f) for i{1,2}i\in\{1,2\}. Note that the true mean of P1P_{1} is 0.10.1, and that of P2P_{2} is 0.50.5. Thus f1f_{1} is a better uniform estimate of the means.

It is easy to verify that R1(f1)=0.04R_{1}(f_{1})=0.04, R1(f2)=0.25R_{1}(f_{2})=0.25, R2(f1)=0.29R_{2}(f_{1})=0.29, and R2(f2)=0.26R_{2}(f_{2})=0.26. Then the DRO objective (2) evaluates to the optimal value of 0.26, with f2f_{2} attaining this value. On the other hand, the MRO objective (3) is equal to 0.030.03 and this value is attained for f1f_{1}, while f2f_{2} incurs a much higher objective of supPRP(f2)RP(fP)=0.21\sup_{P}R_{P}(f_{2})-R_{P}(f_{P})=0.21. As a result, we have fDRO=f2f_{\text{DRO}}=f_{2} and fMRO=f1f_{\text{MRO}}=f_{1}. Comparing regrets lets MRO remove the underlying noise level of 0.250.25 for P2P_{2}.  


Note that while the worst-case regret of MRO is always better than that of DRO, this does not imply that the function fMROf_{\text{MRO}} is always preferable under each distribution P𝒫P\in\mathcal{P} to fDROf_{\text{DRO}}. Next we give one such example where fDROf_{\text{DRO}} achieves an arguably better trade-off in balanced performance across distributions. This situation can happen in real applications when functions achieving minimum risks vary greatly across distributions, potentially due to overfitting. The example shows that both MRO and DRO may have advantages or disadvantages under different conditions.

Example 1 (Heterogeneous regrets across domains)

Suppose the target class 𝒫\mathcal{P} consists of two distributions over z[0,1]z\in[0,1] and our function class \mathcal{F} consists of 3 functions {f1,f2,f3}\{f_{1},f_{2},f_{3}\}. Let us assume that the risks of these functions under the two distributions are given by:

R1(f1)=0,R1(f2)=0.5,R1(f3)=0.5+ϵ,R2(f1)=1,R2(f2)=0.9,andR2(f3)=0.4.\displaystyle R_{1}(f_{1})=0,~{}~{}R_{1}(f_{2})=0.5,~{}~{}R_{1}(f_{3})=0.5+\epsilon,~{}~{}R_{2}(f_{1})=1,R_{2}(f_{2})=0.9,~{}~{}\mbox{and}~{}~{}R_{2}(f_{3})=0.4.

Then f1f_{1} is easily disregarded under both MRO and DRO, since it has poor performance under P2P_{2}. For both f2f_{2} and f3f_{3}, their regrets on P1P_{1} are larger than P2P_{2}, so MRO selects the better function for P1P_{1}, that is, fMRO=f2f_{\text{MRO}}=f_{2}. DRO, on the other hand, prefers f3f_{3} as it has a smaller worst-case risk, which is arguably the right choice in this scenario.

More generally, one can always find a distribution P𝒫P\in\mathcal{P} under which the regret of fDROf_{\text{DRO}} is smaller than that of fMROf_{\text{MRO}} and vice versa, but the ordering in the worst-case is clear by definition. In the context of these observations, we next discuss how we might estimate fMROf_{\text{MRO}} from samples, for which it is useful to rewrite the objective (3) in an equivalent weight-based formulation we discuss next.

A weight-based reformulation.

Having shown the potential benefits of our population objective, we now consider how to optimize it using samples from P0P_{0}. Notice that the objective (3) does not depend on P0P_{0} explicitly, which makes it unclear how we should approach the problem given our dataset. We address this issue by adopting an equivalent reweighting based formulation as typically done in DRO. Concretely, let us assume that PP is absolutely continous with respect to P0P_{0} for all P𝒫P\in\mathcal{P}, so that there exists a weighting functions w:𝒵+w~{}:~{}\mathcal{Z}\to\mathbb{R}_{+} such that dP(z)=w(z)dP0(z)dP(z)=w(z)dP_{0}(z), where 𝔼P0[w]=1{\mathbb{E}}_{P_{0}}[w]=1. We can equivalently rewrite the objective (3) as

fMRO\displaystyle f_{\text{MRO}} :=arginffsupw𝒲{Regretw(f):=Rw(f)inffRw(f)=Rw(f)Rw(fw)},\displaystyle:=\operatorname*{\text{arginf}}_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}\left\{{\mathrm{Regret}}_{w}(f):=R_{w}(f)-\inf_{f^{\prime}\in\mathcal{F}}R_{w}(f^{\prime})=\ R_{w}(f)-R_{w}(f_{w})\right\}, (5)

where 𝒲={w:w(z)=dP(z)/dP0(z)forP𝒫}\mathcal{W}=\{w~{}:~{}w(z)=dP(z)/dP_{0}(z)~{}~{}\text{for}~{}~{}P\in\mathcal{P}\} and Rw(f)=𝔼zP0[w(z)(z,f(z))]R_{w}(f)={\mathbb{E}}_{z\sim P_{0}}[w(z)\ell(z,f(z))]. It is also straightforward to define an empirical counterpart for this objective. Given nn samples z1,,znP0z_{1},\ldots,z_{n}\sim P_{0}, we can define the (weighted) empirical risk and its corresponding minimizer as R^w(f)=1ni=1nw(zi)(zi,f(zi))\widehat{R}_{w}(f)=\frac{1}{n}\sum_{i=1}^{n}w(z_{i})\ell(z_{i},f(z_{i})) and f^w=arginffR^w(f)\widehat{f}_{w}=\operatorname*{\text{arginf}}_{f\in\mathcal{F}}\widehat{R}_{w}(f). With these notations, a natural empirical counterpart of the population objective (5) can be written as

f^MRO:=arginffsupw𝒲[R^w(f)inffR^w(f)]=arginffsupw𝒲[R^w(f)R^w(f^w)],\widehat{f}_{\text{MRO}}:=\operatorname*{\text{arginf}}_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}\big{[}\widehat{R}_{w}(f)-\inf_{f^{\prime}\in\mathcal{F}}\widehat{R}_{w}(f^{\prime})\big{]}=\operatorname*{\text{arginf}}_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}\big{[}\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})\big{]}, (6)

We now give a couple of concrete examples to instantiate this general formulation.

Example 2 (Covariate shift in supervised learning)

Suppose the samples z=(x,y)z=(x,y) where xdx\in\mathbb{R}^{d} are features and yy\in\mathbb{R} is a prediction target. Suppose that the class 𝒲\mathcal{W} contains functions 𝒲={w(z)=wθ(x):θΘ}\mathcal{W}=\{w(z)=w_{\theta}(x)~{}:~{}\theta\in\Theta\}, where Θ\Theta is some parameter class. That is, we allow only the marginal distribution of xx to change while the conditional P0(y|x)P_{0}(y|x) is identical across test distributions (Shimodaira, 2000; Huang et al., 2006). In our formulation, it is easy to incorporate the covariate shift situation with a general class 𝒲\mathcal{W} under the restrictions wθ(x)Bw_{\theta}(x)\leq B. It is possible to add additional regularity on the class Θ\Theta, such as taking the unit ball in an RKHS (Huang et al., 2006; Staib and Jegelka, 2019). We note that most prior studies of DRO imposed conditions on the joint perturbations of both (x,y)(x,y), which leads to closed form solutions. However, directly instantiating 𝒲\mathcal{W} in DRO with a bounded ff-divergence or MMD perturbations for only the marginal P0(x)P_{0}(x) (instead of the joint distribution of (x,y)(x,y)) does not yield a nice closed form solution. This results in additional computational and statistical challenges as discussed in Duchi et al. (2019).

We now demonstrate the versatility of our setting with an example from reinforcement learning.

Example 3 (Offline contextual bandit learning)

In offline contextual bandit (CB) learning, the data consists of z=(x,a,r)z=(x,a,r) with xdx\in\mathbb{R}^{d} denoting a context, a[K]a\in[K] being an action out of KK possible choices and r[0,1]r\in[0,1] a reward. There is a fixed and unknown joint distribution P0P_{0} over (x,r(1),,r(K))(x,r(1),\ldots,r(K)). The object of interest is a decision policy π\pi which maps a context xx to a distribution over actions, and we seek a policy which maximizes the expected reward under its action choices: π=argmaxπΠ𝔼[r(a)|aπ(|x),x]\pi^{\star}=\operatorname*{\text{argmax}}_{\pi\in\Pi}{\mathbb{E}}[r(a)~{}|~{}a\sim\pi(\cdot|x),x], where Π\Pi is some given policy class. In the offline setting, there is some fixed policy μ\mu which is used to choose actions during the data collection process. Then the training data can be described as z=(x,a,r)z=(x,a,r) where xP0x\sim P_{0}, aμ(|x)a\sim\mu(\cdot|x) and rP0(|x,a)r\sim P_{0}(\cdot|x,a), and this differs from the action distributions that other policies in Π\Pi induce. Existing literature on this problem typically creates an estimator η\eta for 𝔼[r|x,a]{\mathbb{E}}[r|x,a] and outputs π=argmaxπΠi=1naπ(a|xi)η(r|a,xi)\pi=\operatorname*{\text{argmax}}_{\pi\in\Pi}\sum_{i=1}^{n}\sum_{a}\pi(a|x_{i})\eta(r|a,x_{i}), and the quality of the resulting policy critically relies on the quality of the estimator η\eta. A particularly popular choice is the class of doubly robust estimators (Cassel et al., 1976; Dudík et al., 2014) which solve a (weighted) regression problem to estimate ηargminfiπ(ai|xi)μ(ai|xi)(f(xi,ai)ri)2\eta\approx\operatorname*{\text{argmin}}_{f}\sum_{i}\tfrac{\pi(a_{i}|x_{i})}{\mu(a_{i}|x_{i})}(f(x_{i},a_{i})-r_{i})^{2}. Since we require the reward model to predict well on the actions selected according to different policies πΠ\pi\in\Pi, the ideal regression objective for any policy π\pi reweights the data as per that policy’s actions, and doing this weighting results in a significant performance gain (Su et al., 2020; Farajtabar et al., 2018). Typically this weighting is simplified or skipped during policy learning, however, as one needs to simultaneously reweight for all policies πΠ\pi\in\Pi, and MRO provides an approach to do exactly this by choosing 𝒲={w(x,a,r)=π(ax)/μ(ax):πΠ}\mathcal{W}=\{w(x,a,r)=\pi(a\mid x)/\mu(a\mid x)~{}:~{}\pi\in\Pi\} in the optimization for estimating η\eta.

Boundedness and complexity assumptions

We now make standard technical assumptions on \mathcal{F}, the loss function \ell and the weight class 𝒲\mathcal{W}. To focus attention on the essence of the results, we omit relatively complex results such as chaining or local Rademacher complexity, and use an \ell_{\infty}- covering for uniform convergence over \mathcal{F}, where we use N(ϵ,)N(\epsilon,\mathcal{F}) to denote the \ell_{\infty}- covering number of \mathcal{F} with an accuracy ϵ\epsilon. For parametric classes, such an analysis is optimal up to a log factors. We also make a boundedness and Lipschitz continuity assumption on the loss function.

Assumption 1 (Bounded and Lipschitz losses)

The loss function (z,v)\ell(z,v) is bounded in [0,1][0,1] for all z𝒵z\in\mathcal{Z} and v{f(z):f,z𝒵}v\in\{f(z)~{}:~{}f\in\mathcal{F},z\in\mathcal{Z}\} and is LL-Lipschitz continuous with respect to vv.

We also assume that the weight class satisfies a boundedness condition.

Assumption 2 (Bounded importance weights)

All weights w𝒲w\in\mathcal{W} satisfy w(z)BwBw(z)\leq B_{w}\leq B for all z𝒵z\in\mathcal{Z} and some constant B1B\geq 1.

To avoid the complexity of introducing an additional covering number, throughout this paper, we assume that the weight class 𝒲\mathcal{W} is of finite cardinality, which is reasonable for many domain adaptation applications. We adopt the notations d(δ)=1+logN(1/(nLB),)δd_{\mathcal{F}}(\delta)=1+\log\frac{N(1/(nLB),\mathcal{F})}{\delta} and d,𝒲(δ)=d(δ)+ln(|𝒲|/δ)d_{\mathcal{F},\mathcal{W}}(\delta)=d_{\mathcal{F}}(\delta)+\ln(|\mathcal{W}|/\delta) to jointly capture the complexities of \mathcal{F} and 𝒲\mathcal{W}, given a failure probability δ\delta.

3 Regret Bounds for Bounded, Lipschitz Losses

We begin with the most general case with fairly standard assumptions on the loss function, and the function and weight classes and show the following result on the regret of f^MRO\widehat{f}_{\text{MRO}} for any w𝒲w\in\mathcal{W}.

Theorem 2

Under assumptions 1 and 2, suppose further that 𝔼P0[w2]σw2{\mathbb{E}}_{P_{0}}[w^{2}]\leq\sigma_{w}^{2} for any w𝒲w\in\mathcal{W}. Then with probability at least 1δ1-\delta, we have w𝒲\forall w\in\mathcal{W}:

Regretw(f^MRO)inffsupw𝒲Regretw(f)+supw𝒲𝒪(σw2d,𝒲(δ)n+Bwd,𝒲(δ)n)ϵw.\displaystyle{\mathrm{Regret}}_{w}(\widehat{f}_{\text{MRO}})\leq\inf_{f\in\mathcal{F}}\sup_{w^{\prime}\in\mathcal{W}}{\mathrm{Regret}}_{w^{\prime}}(f)+\sup_{w^{\prime}\in\mathcal{W}}\underbrace{\mathcal{O}\bigg{(}\sqrt{\frac{\sigma_{w^{\prime}}^{2}d_{\mathcal{F},\mathcal{W}}(\delta)}{n}}+\frac{B_{w^{\prime}}d_{\mathcal{F},\mathcal{W}}(\delta)}{n}\bigg{)}}_{\epsilon_{w^{\prime}}}.

The bound of Theorem 2 highlights the main difference of our objective compared with DRO style approaches. The bound states that we our solution f^MRO\widehat{f}_{\text{MRO}} has a small regret for each w𝒲w\in\mathcal{W}, as long as at least one such function exists in the function class, up to the usual finite sample deviation terms. Thus, f^MRO\widehat{f}_{\text{MRO}} attains the uniform regret guarantee we asked for. To better interpret our result, we state an easy corollary, before making some additional remarks.

Corollary 3

Under conditions of Theorem 2, suppose further that f\exists f^{\star}\in\mathcal{F} such that Rw(f)Rw(fw)+ϵ~wR_{w}(f^{\star})\leq R_{w}(f_{w})+\tilde{\epsilon}_{w} for all w𝒲w\in\mathcal{W}. Then with probability at least 1δ1-\delta, we have

w𝒲:Rw(f^MRO)Rw(f)+supw𝒲ϵ~w+supw𝒲ϵwϵ𝒲.\displaystyle\forall w\in\mathcal{W}~{}:~{}R_{w}(\widehat{f}_{\text{MRO}})\leq R_{w}(f^{\star})+\underbrace{\sup_{w^{\prime}\in\mathcal{W}}\tilde{\epsilon}_{w^{\prime}}+\sup_{w^{\prime}\in\mathcal{W}}\epsilon_{w^{\prime}}}_{\epsilon_{\mathcal{W}}}.

Comparison with DRO approaches.

While Proposition 1 already illustrates the improved robustness of MRO to differing noise levels across target distributions, it is further instructive to compare the guarantees for the two approaches. Concretely, in our setting, it can be shown that with probability at least 1δ1-\delta, we have for all w𝒲w\in\mathcal{W}:

Rw(f^DRO)inffsupw𝒲Rw(f)+supw𝒲ϵw.\displaystyle R_{w}(\widehat{f}_{\text{DRO}})\leq\inf_{f\in\mathcal{F}}\sup_{w^{\prime}\in\mathcal{W}}R_{w^{\prime}}(f)+\sup_{w^{\prime}\in\mathcal{W}}\epsilon_{w^{\prime}}. (7)

Compared with Theorem 2, this bound can be inferior when the different distributions identified by w𝒲w\in\mathcal{W} have vastly different noise levels. For instance, in the setting of Corollary 3, the bound (7) yields Rw(fDRO)supw𝒲Rw(f)+ϵ𝒲R_{w}(f_{\text{DRO}})\leq\sup_{w\in\mathcal{W}}R_{w}(f^{\star})+\epsilon_{\mathcal{W}}, which has additional dependence on the worst case risk of ff^{\star} across all the w𝒲w\in\mathcal{W}. We can now recall that the assumption of Corollary 3 is satisfied with small ϵ𝒲\epsilon_{\mathcal{W}} in the constructed example of Proposition 1, in which DRO tries to fit to the large noise. Similarly, if the data distribution is Gaussian with an identical mean but different variances across the ww, then we see that the two bounds reduce to:

Rw(f^MRO)Varw+ϵ𝒲andRw(f^DRO)supw𝒲Varw+ϵ𝒲,\displaystyle\textstyle R_{w}(\widehat{f}_{\text{MRO}})\leq\text{Var}_{w}+\epsilon_{\mathcal{W}}\quad\mbox{and}\quad R_{w}(\widehat{f}_{\text{DRO}})\leq\sup_{w^{\prime}\in\mathcal{W}}\text{Var}_{w^{\prime}}+\epsilon_{\mathcal{W}},

where Varw\text{Var}_{w} is the variance of the Gaussian corresponding to importance weights ww. Overall, these examples serve to illustrate that the DRO objective is reasonable when the different distributions have similar noise levels, but is not robust to heterogeneity in noise levels.

Dependence on the complexity of 𝒲\mathcal{W}.

For discrete 𝒲\mathcal{W}, our results incur a union bound penalty of ln|𝒲|\ln|\mathcal{W}|. We suspect that this dependency is unavoidable in the general case. If we define 𝒲\mathcal{W} using a bound constraint such as w(z)Bw(z)\leq B, or an ff-divergence based constraint as in the DRO literature, then the worst-case ww has a simple solution in the DRO setting, with no explicit dependence on the complexity of 𝒲\mathcal{W}. These results are easily adapted to MRO as well by observing supw𝒲Regretw(f)=supw𝒲supfRw(f)Rw(f)\sup_{w\in\mathcal{W}}{\mathrm{Regret}}_{w}(f)=\sup_{w\in\mathcal{W}}\sup_{f^{\prime}\in\mathcal{F}}R_{w}(f)-R_{w}(f^{\prime}). For instance, if 𝒲B={w:0w(z)B,𝔼P0[w]=1}\mathcal{W}_{B}=\{w~{}:~{}0\leq w(z)\leq B,{\mathbb{E}}_{P_{0}}[w]=1\} is the class of all bounded importance weights, then the results of Shapiro (2017) adapted to MRO imply that

supw𝒲BRegret(f)=supfinfη{η+B𝔼P0[((f)(f)η)+]}.\displaystyle\textstyle\sup_{w\in\mathcal{W}_{B}}{\mathrm{Regret}}(f)=\sup_{f^{\prime}\in\mathcal{F}}\inf_{\eta\in\mathbb{R}}\left\{\eta+B{\mathbb{E}}_{P_{0}}\left[\left(\ell(f)-\ell(f^{\prime})-\eta\right)_{+}\right]\right\}. (8)

Similar arguments hold for the family of Cressie-Read divergences studied in Duchi and Namkoong (2021). However, these closed form maximizations rely on doing joint perturbations over (x,y)(x,y) in the supervised learning setting and do not apply to covariate shifts (Duchi et al., 2019).

Proof [Sketch of the proof of Theorem 2] For a fixed ff, we can bound |R^w(f)Rw(f)||\widehat{R}_{w}(f)-R_{w}(f)| using Bernstein’s inequality. This requires bounding the range and variance of the random variable w(z)(z,f(z))w(z)\ell(z,f(z)). Since the losses are bounded in [0,1][0,1], the two quantities are bounded by σw2\sigma_{w}^{2} and BwB_{w} respectively. Assumption 1 allows us to further get a uniform bound over ff\in\mathcal{F} with an additional dependence on d(δ)d_{\mathcal{F}}(\delta). Standard arguments now yield closeness of R^w(f)R^w(f^w)\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w}) to Rw(f)Rw(fw)R_{w}(f)-R_{w}(f_{w}) simultaneously for all ff\in\mathcal{F} and w𝒲w\in\mathcal{W}, and utilizing the definition of f^MRO\widehat{f}_{\text{MRO}} as the minimizer of supw𝒲R^w(f)R^w(f^w)\sup_{w\in\mathcal{W}}\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w}) completes the proof.  


4 Fast Rates for Squared Loss and Convex Classes

A limitation of Theorem 2 is that it does not leverage any structure of the loss function beyond Assumption 1, and as a result obtains a 1/n1/\sqrt{n} dependence on the number of samples. For the case of a fixed distribution, it is well known that self-bounding properties of the loss variance can be used to obtain faster 1/n1/n bounds on the regret of the ERM solution, and here we show that this improvement extends to our setting. For ease of exposition, we specialize to the squared loss setting in this section. That is, our samples zz take the form (x,y)(x,y) with x𝒳dx\in\mathcal{X}\subseteq\mathbb{R}^{d}, y[1,1]y\in[-1,1] , our functions ff map from 𝒳\mathcal{X} to [1,1][-1,1] and and (z,f(z))=(f(x)y)2\ell(z,f(z))=(f(x)-y)^{2}. We make an additional convexity assumption on the function class \mathcal{F}.

Assumption 3

The class \mathcal{F} is convex: f,f\forall~{}f,f^{\prime}\in\mathcal{F}, αf+(1α)f\alpha f+(1-\alpha)f^{\prime}\in\mathcal{F} for all α[0,1]\alpha\in[0,1].

Note that convexity of \mathcal{F} is quite different from convexity of ff in its parameters, and can always be satisfied by taking the convex hull of a base class. The assumption can be avoided by replacing our ERM based solution with aggregation approaches (see e.g. Tsybakov, 2003; Dalalyan and Salmon, 2012). However, we focus on the convex case here to illustrate the key ideas.

Theorem 4

Under assumptions 1, 2 and 3, with probability at least 1δ1-\delta, we have w𝒲\forall w\in\mathcal{W}:

Regretw(f^MRO)\displaystyle{\mathrm{Regret}}_{w}(\widehat{f}_{\text{MRO}})\leq Regret+𝒪(RegretBd,𝒲(δ)/n+Bd,𝒲(δ)/n)\displaystyle{\mathrm{Regret}}_{\star}+\mathcal{O}\Big{(}\sqrt{{\mathrm{Regret}}_{\star}\cdot Bd_{\mathcal{F},\mathcal{W}}(\delta)/n}+Bd_{\mathcal{F},\mathcal{W}}(\delta)/n\Big{)}
=\displaystyle= 𝒪(Regret+Bd,𝒲(δ)n),where Regret=inffsupw𝒲Regretw(f).\displaystyle\mathcal{O}\Big{(}{\mathrm{Regret}}_{\star}+\frac{Bd_{\mathcal{F},\mathcal{W}}(\delta)}{n}\Big{)},~{}~{}\mbox{where ${\mathrm{Regret}}_{\star}=\inf_{f\in\mathcal{F}}\sup_{w^{\prime}\in\mathcal{W}}{\mathrm{Regret}}_{w^{\prime}}(f)$.}

Theorem 4 crucially leverages the following property of convex classes \mathcal{F} and squared loss.

Lemma 5

For a convex class \mathcal{F} and squared loss (y,f(x))=(yf(x))2\ell(y,f(x))=(y-f(x))^{2}, we have for any distribution PP and ff\in\mathcal{F}: 𝔼P[(f(x)fP(x))2]RegretP(f){\mathbb{E}}_{P}[(f(x)-f_{P}(x))^{2}]\leq{\mathrm{Regret}}_{P}(f).

Note that a similar result holds for the squared loss and any class \mathcal{F}, if we replace fPf_{P} with the unconstrained minimizer of RP(f)R_{P}(f) over all measurable functions, but using this property in our analysis would result in an additional approximation error term in the bound of Theorem 4, which is undesirable when considering regret within the function class.

Proof [Sketch of the proof of Theorem 4] The only difference with the proof of Theorem 2 is in the handling of the variance of A=w(z)((z,f(z))(z,fw(z))A=w(z)(\ell(z,f(z))-\ell(z,f_{w}(z)), which we use as our random variable of interest in this case. Since w(z)Bww(z)\leq B_{w} almost surely, we get

𝔼P0[A2]\displaystyle{\mathbb{E}}_{P_{0}}[A^{2}] (a)16Bw𝔼w[(f(x)fw(x))2]16BwRegretw(f),\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}16B_{w}{\mathbb{E}}_{w}[(f(x)-f_{w}(x))^{2}]\leq 16B_{w}{\mathrm{Regret}}_{w}(f),

where the inequality (a)(a) follows from the boundedness of ff\in\mathcal{F} and yy and the final inequality uses Lemma 5. We can now follow the usual recipe for fast rates with a self-bounding property of the variance. Rest of the arguments mirror the proof of Theorem 2.  


Comparison with DRO.

It is natural to ask if it is possible to obtain fast rate for DRO under the conditions of Theorem 4. Of course it is not possible to show a regret bound similar to Theorem 4 for DRO, since it does not optimize the worst-case regret as Proposition 1 illustrates. Nevertheless, we investigate if DRO can achieve fast rate for the raw risk. Unfortunately, the following result shows that even under the assumptions of Theorem 4, the worst-case risk criterion for DRO is still subject to the slower 1/n1/\sqrt{n} rate. The reason is that regret has a markedly smaller variance than the risk, and the latter usually deviates at a 1/n1/\sqrt{n} rate even for squared loss.

Proposition 6

Under the assumptions of Theorem 4, there is a family of distributions 𝒲\mathcal{W} with |𝒲|=2|\mathcal{W}|=2, satisfying BwB=2B_{w}\leq B=2, and a function class \mathcal{F} with d=𝒪(ln(nL))d_{\mathcal{F}}=\mathcal{O}(\ln(nL)), such that with probability at least 11/n1-1/n, we have supw𝒲Rw(f^DRO)supw𝒲Rw(fDRO)=Ω((lnn)/n)\sup_{w\in\mathcal{W}}R_{w}(\widehat{f}_{\text{DRO}})-\sup_{w\in\mathcal{W}}R_{w}(f_{\text{DRO}})=\Omega(\sqrt{(\ln n)/n}).

5 Adapting to Misspecification through Non-uniform Scaling across Distributions

Our result so far guarantees uniform regret, with learning complexity that depends on the worst case complexity over all distributions ww. For example, if the complexity of the distribution family is characterized by BwB_{w} as in our analysis, then the bound depends on supwBw\sup_{w}B_{w}. In this section, we show that it is possible to improve such dependence using an adaptive scaling of the objective function. To better illustrate the consequences of this approach, let us assume that we can rewrite our guarantees in the form that for each w𝒲w\in\mathcal{W} and ff\in\mathcal{F}, we have with probability at least 1δ1-\delta,

Regretw(f)c(R^w(f)R^w(f^w))+cwϵ,{\mathrm{Regret}}_{w}(f)\leq c(\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w}))+c_{w}\epsilon, (9)

where the scaling function cwc_{w} is a distribution PP dependent quantity that is known to the algorithm. For instance, under conditions of Theorem 2, we get this bound with c=1c=1, cw=σw+Bw/nc_{w}=\sigma_{w}+B_{w}/\sqrt{n} and ϵ=d,𝒲(δ)/n\epsilon=d_{\mathcal{F},\mathcal{W}}(\delta)/\sqrt{n} (assuming d,𝒲(δ)1d_{\mathcal{F},\mathcal{W}}(\delta)\geq 1). Under conditions of Theorem 4, we can use c=3c=3, cw=Bwc_{w}=B_{w} and ϵ=d,𝒲/n\epsilon=d_{\mathcal{F},\mathcal{W}}/n, by taking the second O(B/n)O(B/n) bound of Theorem 4. While this does worsen our dependence on d,𝒲(δ)d_{\mathcal{F},\mathcal{W}}(\delta) in the first case, this is preferable to assuming that d,𝒲(δ)d_{\mathcal{F},\mathcal{W}}(\delta) is known and setting of cw=σw+Bwd,𝒲(δ)/nc_{w}=\sigma_{w}+B_{w}\sqrt{d_{\mathcal{F},\mathcal{W}}(\delta)/n} and ϵ=d,𝒲(δ)/n\epsilon=\sqrt{d_{\mathcal{F},\mathcal{W}}(\delta)/n}.

Since we assume cwc_{w} is known, then we can define the estimator (Scaled MRO, or SMRO):

f^SMRO:=arginffsupw𝒲(R^w(f)R^w(f^w))/cw,\widehat{f}_{\text{SMRO}}:=\operatorname*{\text{arginf}}_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}(\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w}))/c_{w}, (10)

We now present our main results for the statistical properties of f^SMRO\widehat{f}_{\text{SMRO}} under the assumptions of Theorems 2 and 4, before comparing the results and discussing some key consequences.

Theorem 7

Under assumptions of Theorem 2, consider the estimator f^SMRO\widehat{f}_{\text{SMRO}} with cw=σ^w+Bw/nc_{w}=\widehat{\sigma}_{w}+B_{w}/\sqrt{n}, where σ^w2=1ni=1nw(zi)2\widehat{\sigma}_{w}^{2}=\frac{1}{n}\sum_{i=1}^{n}w(z_{i})^{2} is the empirical second moment of the importance weights. Then with probability at least 1δ1-\delta, we have w𝒲\forall w\in\mathcal{W}:

Regretw(f^SMRO)cwinffsupw𝒲Regretw(f)cw+d,𝒲(δ)𝒪(σw2/n+Bw/n)ϵw.\displaystyle{\mathrm{Regret}}_{w}(\widehat{f}_{\text{SMRO}})\leq c_{w}\inf_{f\in\mathcal{F}}\sup_{w^{\prime}\in\mathcal{W}}\frac{{\mathrm{Regret}}_{w^{\prime}}(f)}{c_{w^{\prime}}}+\underbrace{d_{\mathcal{F},\mathcal{W}}(\delta)\;\mathcal{O}\left(\sqrt{\sigma_{w}^{2}/n}+B_{w}/n\right)}_{\epsilon^{\prime}_{w}}.

We can also state a version for squared loss and convex classes.

Theorem 8

Under assumptions of Theorem 4, consider the estimator f^SMRO\widehat{f}_{\text{SMRO}} with cw=Bwc_{w}=B_{w}. Then with probability at least 1δ1-\delta, we have

w𝒲:Regretw(f^SMRO)Bw[Regret+𝒪(Regretd,𝒲(δ)n+d,𝒲(δ)n)],\displaystyle\forall w\in\mathcal{W}~{}:~{}{\mathrm{Regret}}_{w}(\widehat{f}_{\text{SMRO}})\leq B_{w}\left[{\mathrm{Regret}}_{\star}+\mathcal{O}\left(\sqrt{{\mathrm{Regret}}_{\star}\cdot\frac{d_{\mathcal{F},\mathcal{W}}(\delta)}{n}}+\frac{d_{\mathcal{F},\mathcal{W}}(\delta)}{n}\right)\right],

where Regret=inffsupw𝒲Regretw(f)Bw{\mathrm{Regret}}_{\star}=\inf_{f\in\mathcal{F}}\sup_{w^{\prime}\in\mathcal{W}}\frac{{\mathrm{Regret}}_{w^{\prime}}(f)}{B_{w^{\prime}}}.

Comparing Theorems 7 and 8 with their counterparts of MRO, we notice a subtle but important difference. The finite sample deviation term ϵw\epsilon^{\prime}_{w} in Theorem 7 depends on the weights ww for which the bound is stated, while the corresponding term in Theorem 7 is supw𝒲ϵw\sup_{w^{\prime}\in\mathcal{W}}\epsilon_{w^{\prime}}. In other words, the heterogeneous scaling in SMRO allows us to obtain a regret bound for each distribution depending on the deviation properties of that particular distribution, unlike in the basic MRO approach. To see the benefits of this approach, we state a corollary of Theorems 7 and 8 next.

Corollary 9 (Aligned distribution class)

Suppose that P0P_{0} and 𝒲\mathcal{W} satisfy that Rw(fw)=Rw(f)R_{w}(f_{w})=R_{w}(f^{\star}) for all w𝒲w\in\mathcal{W}, with ff^{\star}\in\mathcal{F}. Under conditions of Theorem 7, with probability at least 1δ1-\delta:

w𝒲:Rw(f^SMRO)Rw(f)+d,𝒲(δ)𝒪(σw2n+Bwn).\displaystyle\forall w\in\mathcal{W}~{}:~{}R_{w}(\widehat{f}_{\text{SMRO}})\leq R_{w}(f^{\star})+d_{\mathcal{F},\mathcal{W}}(\delta)\;\mathcal{O}\left(\sqrt{\frac{\sigma_{w}^{2}}{n}}+\frac{B_{w}}{n}\right).

In the same setting, under the assumptions of Theorem 8, we have with probability at least 1δ1-\delta:

w𝒲:Rw(f^SMRO)Rw(f)+𝒪(Bwd,𝒲(δ)n).\displaystyle\forall w\in\mathcal{W}~{}:~{}R_{w}(\widehat{f}_{\text{SMRO}})\leq R_{w}(f^{\star})+\mathcal{O}\left(\frac{B_{w}d_{\mathcal{F},\mathcal{W}}(\delta)}{n}\right).

Both results follow by choosing f=fw=ff=f_{w}=f^{\star} in Theorems 7 and 8. This result shows that the rescaling allows our bounds to adapt to the closeness of a target distribution to the data collection distribution, simultaneously for all target distributions. The alignment condition of a shared ff^{\star} is always true in well-specified problems, and we illustrate a consequence of Corollary 9 for well-specified linear regression in Appendix D.

6 Algorithmic considerations

So far our development has focused on the statistical properties of MRO. In this section, we discuss how the MRO estimator can be computed from a finite dataset, given some reasonable computational assumptions on the function class \mathcal{F}.

Definition 10 (ERM oracle)

Given a dataset of the form (ωi,zi)i=1n(\omega_{i},z_{i})_{i=1}^{n}, an ERM oracle for \mathcal{F} solves the weighted empirical risk minimization problem: minfωi(zi,f(zi))\min_{f\in\mathcal{F}}\omega_{i}\ell(z_{i},f(z_{i})).

While we assume access to an exact ERM oracle, we can weaken the notion to an approximate oracle with an optimization error comparable to the statistical error from finite samples. Given such an oracle, we can now approximately solve the MRO objective (6) (or (10)) by using a well-known strategy of solving minimax problems as two player zero-sum games. To do so, we recall our assumption of a finite class 𝒲\mathcal{W}, and let Δ(𝒲)\Delta(\mathcal{W}) denote the set of all distributions over the importance weight family. Similarly Δ()\Delta(\mathcal{F}) is the set of distributions over our function class. Then we can rewrite the objective in Equation (6) as:

inffsupw𝒲R^w(f)R^w(f^w)=infPΔ()supρΔ(𝒲)𝔼fP,wρ[R^w(f)R^w(f^w)].\inf_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})=\inf_{P\in\Delta(\mathcal{F})}\sup_{\rho\in\Delta(\mathcal{W})}{\mathbb{E}}_{f\sim P,w\sim\rho}\left[\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})\right]. (11)

The objective is bilinear in PP and ρ\rho so that the following result holds due to the exchangeability of minimum and maximum for convex-concave saddle point problems. It is also a direct consequence of Proposition 12.

Proposition 11

There exists ρ^Δ(𝒲)\hat{\rho}\in\Delta(\mathcal{W}) so that the solution of MRO is equivalent to the following solution of the weighted ERM method: f^MRO=argminf𝔼wρ^R^w(f)\widehat{f}_{\text{MRO}}=\operatorname*{\text{argmin}}_{f\in\mathcal{F}}{\mathbb{E}}_{w\sim\hat{\rho}}\widehat{R}_{w}(f).

For finite 𝒲\mathcal{W} which we consider in this paper, it is also possible to find the approximate ρ^\hat{\rho} and f^MRO\widehat{f}_{\text{MRO}} efficiently, using the celebrated result of Freund and Schapire (1996) to solve this minimax problem through no-regret dynamics. We use the best response strategy for the PP-player, as finding the best response distribution given a specific ρ\rho is equivalent to finding the function ff\in\mathcal{F} that minimizes 𝔼wρR^w(f){\mathbb{E}}_{w\sim\rho}\widehat{R}_{w}(f), which can be accomplished through one call to the ERM oracle. For optimizing ρ\rho, we use the exponentiated gradient algorithm of Kivinen and Warmuth (1997) (closely related to Hedge (Freund and Schapire, 1997)), which is a common no-regret strategy to learn a distribution over a finite collection of experts, with each w𝒲w\in\mathcal{W} being an ”expert” in our setting. More formally, we initialize ρ1\rho_{1} to be the uniform distribution over 𝒲\mathcal{W} and repeatedly update:

ft\displaystyle f_{t} =arginff𝔼wρtR^w(f),ρt+1(w)ρt(w)exp(η(R^w(ft)R^w(f^w))),\displaystyle=\operatorname*{\text{arginf}}_{f\in\mathcal{F}}{\mathbb{E}}_{w\sim\rho_{t}}\widehat{R}_{w}(f),\quad\rho_{t+1}(w)\propto\rho_{t}(w)\exp\left(\eta\bigg{(}\widehat{R}_{w}(f_{t})-\widehat{R}_{w}(\widehat{f}_{w})\bigg{)}\right), (12)

Let us denote the distribution Pt=(f1++ft)/tP_{t}=(f_{1}+\ldots+f_{t})/t for the iterates generated by (12). Then the results of Freund and Schapire (1996) yield the following suboptimality bound on PtP_{t}.

Proposition 12

For any TT and using η=ln|𝒲|B2T\eta=\sqrt{\tfrac{\ln|\mathcal{W}|}{B^{2}T}} in the updates (12), the distribution PTP_{T} satisfies

𝔼fPTsupw[R^w(f)R^w(fw)]2Bln|𝒲|T.{\mathbb{E}}_{f\sim P_{T}}\sup_{w}[\widehat{R}_{w}(f)-\widehat{R}_{w}(f_{w})]\leq 2B\sqrt{\frac{\ln|\mathcal{W}|}{T}}.

At a high-level, Proposition 12 allows us to choose T=n2T=n^{2} to ensure that the optimization error is no larger than the statistical error, allowing the same bounds to apply up to constants. The optimization strategy used here bears some resemblance to boosting techniques, which also seek to reweight the data in order to ensure a uniformly good performance on all samples, though the reweighting is not constrained to a particular class 𝒲\mathcal{W} unlike here. Note that while the optimization error scales logarithmically in |𝒲||\mathcal{W}|, the computational complexity is linear in the size of this set, meaning that our strategy is computationally feasible for sets 𝒲\mathcal{W} of a modest size. On the other hand, our earlier discussion (8) suggests that alternative reformulations of the objective might be computationally preferred in the case of class 𝒲\mathcal{W} which are continuous. Handling the intermediate regime of a large discrete class 𝒲\mathcal{W} in a computationally efficient manner is an interesting question for future research.

7 Conclusion

In this paper, we introduce the MRO criterion for problems with distribution shift, and establish learning theoretic properties of optimizing the criterion from samples. We demonstrate the many benefits of reasoning with uniform regret as opposed to uniform risk guarantees, and we expect these observations to have implications beyond the setting of distribution shift.

On a technical side, it remains interesting to further develop scalable algorithms for large datasets and weight classes. Better understanding the statistical scaling with the size of the weight class and refining our techniques for important scenarios such as covariate shift are also important directions for future research.

References

  • Agarwal et al. [2019] Alekh Agarwal, Miroslav Dudík, and Zhiwei Steven Wu. Fair regression: Quantitative definitions and reduction-based algorithms. In International Conference on Machine Learning, pages 120–129. PMLR, 2019.
  • Ahuja et al. [2020a] Kartik Ahuja, Karthikeyan Shanmugam, Kush Varshney, and Amit Dhurandhar. Invariant risk minimization games. In International Conference on Machine Learning, pages 145–155. PMLR, 2020a.
  • Ahuja et al. [2020b] Kartik Ahuja, Jun Wang, Amit Dhurandhar, Karthikeyan Shanmugam, and Kush R Varshney. Empirical or invariant risk minimization? a sample complexity perspective. arXiv preprint arXiv:2010.16412, 2020b.
  • Antos et al. [2008] András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008.
  • Arjovsky et al. [2019] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
  • Barocas and Selbst [2016] Solon Barocas and Andrew D Selbst. Big data’s disparate impact. Calif. L. Rev., 104:671, 2016.
  • Ben-David et al. [2010] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1):151–175, 2010.
  • Ben-Tal et al. [2009] Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski. Robust optimization. Princeton university press, 2009.
  • Bertsekas and Tsitsiklis [1996] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1st edition, 1996. ISBN 1886529108.
  • Bickel et al. [2007] Steffen Bickel, Michael Brückner, and Tobias Scheffer. Discriminative learning for differing training and test distributions. In Proceedings of the 24th international conference on Machine learning, pages 81–88, 2007.
  • Cassel et al. [1976] Claes M Cassel, Carl E Särndal, and Jan H Wretman. Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika, 63(3):615–620, 1976.
  • Crammer et al. [2005] Koby Crammer, Michael Kearns, and Jennifer Wortman. Learning from data of variable quality. In NIPS, pages 219–226, 2005.
  • Dalalyan and Salmon [2012] Arnak S Dalalyan and Joseph Salmon. Sharp oracle inequalities for aggregation of affine estimators. The Annals of Statistics, 40(4):2327–2355, 2012.
  • Duchi and Namkoong [2021] John C Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378–1406, 2021.
  • Duchi et al. [2019] John C Duchi, Tatsunori Hashimoto, and Hongseok Namkoong. Distributionally robust losses against mixture covariate shifts. Under review, 2019.
  • Duchi et al. [2021] John C Duchi, Peter W Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. Mathematics of Operations Research, 2021.
  • Dudík et al. [2014] Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485–511, 2014.
  • Dwork et al. [2012] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226, 2012.
  • Farajtabar et al. [2018] Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pages 1447–1456. PMLR, 2018.
  • Feige et al. [2015] Uriel Feige, Yishay Mansour, and Robert Schapire. Learning and inference in the presence of corrupted inputs. In Conference on Learning Theory, pages 637–657. PMLR, 2015.
  • Freund and Schapire [1996] Yoav Freund and Robert E Schapire. Game theory, on-line prediction and boosting. In Proceedings of the ninth annual conference on Computational learning theory, pages 325–332, 1996.
  • Freund and Schapire [1997] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
  • Galar et al. [2011] Mikel Galar, Alberto Fernandez, Edurne Barrenechea, Humberto Bustince, and Francisco Herrera. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4):463–484, 2011.
  • Goodfellow et al. [2014] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • Hashimoto et al. [2018] Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning, pages 1929–1938. PMLR, 2018.
  • Huang et al. [2006] Jiayuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Schölkopf, and Alex Smola. Correcting sample selection bias by unlabeled data. Advances in neural information processing systems, 19:601–608, 2006.
  • Japkowicz [2000] Nathalie Japkowicz. The class imbalance problem: Significance and strategies. In Proc. of the Int’l Conf. on Artificial Intelligence, volume 56. Citeseer, 2000.
  • Kamath et al. [2021] Pritish Kamath, Akilesh Tangella, Danica Sutherland, and Nathan Srebro. Does invariant risk minimization capture invariance? In International Conference on Artificial Intelligence and Statistics, pages 4069–4077. PMLR, 2021.
  • Kivinen and Warmuth [1997] Jyrki Kivinen and Manfred K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Inf. Comput., 132(1):1–63, 1997.
  • Kuhn et al. [2019] Daniel Kuhn, Peyman Mohajerin Esfahani, Viet Anh Nguyen, and Soroosh Shafieezadeh-Abadeh. Wasserstein distributionally robust optimization: Theory and applications in machine learning. In Operations Research & Management Science in the Age of Analytics, pages 130–166. INFORMS, 2019.
  • Madry et al. [2017] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  • Mansour et al. [2009] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. arXiv preprint arXiv:0902.3430, 2009.
  • Munos and Szepesvári [2008] Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(5), 2008.
  • Namkoong and Duchi [2016] Hongseok Namkoong and John C Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In NIPS, volume 29, pages 2208–2216, 2016.
  • Oren et al. [2019] Yonatan Oren, Shiori Sagawa, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust language modeling. arXiv preprint arXiv:1909.02060, 2019.
  • Pan and Yang [2009] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009.
  • Patel et al. [2015] Vishal M Patel, Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. Visual domain adaptation: A survey of recent advances. IEEE signal processing magazine, 32(3):53–69, 2015.
  • Quinonero-Candela et al. [2009] J. Quinonero-Candela, M. Sugiyama, A. Schwaighofer, and N.D. Lawrence. Dataset Shift in Machine Learning. Neural Information Processing series. MIT Press, 2009. ISBN 9780262170055. URL https://books.google.com/books?id=7zrhAAAAMAAJ.
  • Rosenfeld et al. [2020] Elan Rosenfeld, Pradeep Kumar Ravikumar, and Andrej Risteski. The risks of invariant risk minimization. In International Conference on Learning Representations, 2020.
  • Shalev-Shwartz [2012] Shai Shalev-Shwartz. Online learning and online convex optimization. Found. Trends Mach. Learn., page 107–194, 2012.
  • Shapiro [2017] Alexander Shapiro. Distributionally robust stochastic programming. SIAM Journal on Optimization, 27(4):2258–2275, 2017.
  • Shimodaira [2000] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.
  • Staib and Jegelka [2019] Matthew Staib and Stefanie Jegelka. Distributionally robust optimization and generalization in kernel methods. Advances in Neural Information Processing Systems, 32:9134–9144, 2019.
  • Su et al. [2020] Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. Doubly robust off-policy evaluation with shrinkage. In International Conference on Machine Learning, pages 9167–9176. PMLR, 2020.
  • Sutton and Barto [1998] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. Adaptive Computation and Machine Learning series. MIT Press, 1998. ISBN 9780262303842. URL https://books.google.com/books?id=U57uDwAAQBAJ.
  • Tan et al. [2018] Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu. A survey on deep transfer learning. In International conference on artificial neural networks, pages 270–279. Springer, 2018.
  • Tsybakov [2003] Alexandre B Tsybakov. Optimal rates of aggregation. In Learning theory and kernel machines, pages 303–313. Springer, 2003.
  • Wang et al. [2019] Yisen Wang, Xingjun Ma, James Bailey, Jinfeng Yi, Bowen Zhou, and Quanquan Gu. On the convergence and robustness of adversarial training. In International Conference on Machine Learning, pages 6586–6595. PMLR, 2019.
  • Xu et al. [2020] Ziyu Xu, Chen Dan, Justin Khim, and Pradeep Ravikumar. Class-weighted classification: Trade-offs and robust approaches. In International Conference on Machine Learning, pages 10544–10554. PMLR, 2020.
  • Zhu et al. [2020] Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl, and Bernhard Schölkopf. Kernel distributionally robust optimization. arXiv preprint arXiv:2006.06981, 2020.

Appendix A Proof of Theorem 2

Recalling our assumptions that for any w𝒲w\in\mathcal{W}, we have maxzw(z)B\max_{z}w(z)\leq B and 𝔼zP0w(z)2σw2{\mathbb{E}}_{z\sim P_{0}}w(z)^{2}\leq\sigma_{w}^{2}, we know that for a fixed w𝒲w\in\mathcal{W} and ff\in\mathcal{F}, we have, with probability at least 1δ1-\delta:

|R^w(f)Rw(f)|=𝒪(σw2ln(1/δ)n+Bwln(1/δ)n).\left|\widehat{R}_{w}(f)-R_{w}(f)\right|=\mathcal{O}\left(\sqrt{\frac{\sigma_{w}^{2}\ln(1/\delta)}{n}}+\frac{B_{w}\ln(1/\delta)}{n}\right). (13)

This is a consequence of Bernstein’s inequality applied to the random variable A=w(z)(z,f(z))A=w(z)\ell(z,f(z)) which is bounded by BwB_{w} almost surely, and has a second moment at most σw2\sigma_{w}^{2} when zP0z\sim P_{0}. Since (z,f(z))\ell(z,f(z)) is LL-Lipschitz in the second argument, if ff1/(nLB)\|f-f^{\prime}\|_{\infty}\leq 1/(nLB), we have for any w𝒲w\in\mathcal{W} and any zz:

|w(z)(z,f(z))w(z)(z,f(z))|BwLnLB1n,\displaystyle|w(z)\ell(z,f(z))-w(z)\ell(z,f^{\prime}(z))|\leq\frac{B_{w}L}{nLB}\leq\frac{1}{n},

where the second inequality follows since BwBB_{w}\leq B for all w𝒲w\in\mathcal{W} by assumption. Hence, defining \mathcal{F}^{\prime} to be an \ell_{\infty} cover of \mathcal{F} at a scale 1/nLB1/nLB, a union bound combined with the bound (13) yields that with probability 1δ1-\delta, we have for all ff\in\mathcal{F}^{\prime}:

|R^w(f)Rw(f)|=𝒪(σw2ln(N(,1/(nLB))/δ)n+Bwln(N(,1/(nLB))/δ)n).\left|\widehat{R}_{w}(f)-R_{w}(f)\right|=\mathcal{O}\left(\sqrt{\frac{\sigma_{w}^{2}\ln(N(\mathcal{F},1/(nLB))/\delta)}{n}}+\frac{B_{w}\ln(N(\mathcal{F},1/(nLB))/\delta)}{n}\right).

Using the Lipschitz property of the loss and further taking a union bound over w𝒲w\in\mathcal{W}, this gives for all ff\in\mathcal{F} and w𝒲w\in\mathcal{W}, with probability at least 1δ1-\delta:

|R^w(f)Rw(f)|\displaystyle\left|\widehat{R}_{w}(f)-R_{w}(f)\right| =𝒪(σw2ln(N(,1/(nLB))|𝒲|/δ)n+Bwln(N(,1/(nLB))|𝒲|/δ)n)+1n\displaystyle=\mathcal{O}\left(\sqrt{\frac{\sigma_{w}^{2}\ln(N(\mathcal{F},1/(nLB))|\mathcal{W}|/\delta)}{n}}+\frac{B_{w}\ln(N(\mathcal{F},1/(nLB))|\mathcal{W}|/\delta)}{n}\right)+\frac{1}{n}
=𝒪(σw2d,𝒲(δ)n+Bwd,𝒲(δ)n)=:ϵw,\displaystyle=\mathcal{O}\left(\sqrt{\frac{\sigma_{w}^{2}d_{\mathcal{F},\mathcal{W}}(\delta)}{n}}+\frac{B_{w}d_{\mathcal{F},\mathcal{W}}(\delta)}{n}\right)=:\epsilon_{w}, (14)

where the second equality recalls our definition of d,𝒲=1+lnN(1/(nLB),)δ+ln|𝒲|δd_{\mathcal{F},\mathcal{W}}=1+\ln\frac{N(1/(nLB),\mathcal{F})}{\delta}+\ln\frac{|\mathcal{W}|}{\delta}. We now condition on the 1δ1-\delta probability event that the bound of Equation 14 holds to avoid stating error probabilities in each bound. In particular, applying the bound above to the empirical minimizer f^w\widehat{f}_{w} for any w𝒲w\in\mathcal{W}, we observe that

Rw(f^w)R^w(f^w)+ϵwR^w(fw)+ϵwRw(fw)+2ϵw.\displaystyle R_{w}(\widehat{f}_{w})\leq\widehat{R}_{w}(\widehat{f}_{w})+\epsilon_{w}\leq\widehat{R}_{w}(f_{w})+\epsilon_{w}\leq R_{w}(f_{w})+2\epsilon_{w}.

Hence, for any ff\in\mathcal{F}:

supw𝒲[Rw(f^MRO)Rw(fw)]\displaystyle\sup_{w\in\mathcal{W}}\left[R_{w}(\widehat{f}_{\text{MRO}})-R_{w}(f_{w})\right] supw𝒲[Rw(f^MRO)R^w(f^w)]+supwϵw\displaystyle\leq\sup_{w\in\mathcal{W}}[R_{w}(\widehat{f}_{\text{MRO}})-\widehat{R}_{w}(\widehat{f}_{w})]+\sup_{w}\epsilon_{w}
supw𝒲[R^w(f^MRO)R^w(f^w)]+2supwϵw\displaystyle\leq\sup_{w\in\mathcal{W}}[\widehat{R}_{w}(\widehat{f}_{\text{MRO}})-\widehat{R}_{w}(\widehat{f}_{w})]+2\sup_{w}\epsilon_{w}
supw𝒲[R^w(f)R^w(f^w)]+2supwϵw\displaystyle\leq\sup_{w\in\mathcal{W}}[\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})]+2\sup_{w}\epsilon_{w} (since f^MRO=arginffsupw𝒲R^w(f)R^w(f^w)\widehat{f}_{\text{MRO}}=\operatorname*{\text{arginf}}_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w}))
supw𝒲[Rw(f)Rw(f^w)]+4supwϵw\displaystyle\leq\sup_{w\in\mathcal{W}}[R_{w}(f)-R_{w}(\widehat{f}_{w})]+4\sup_{w}\epsilon_{w}
supw𝒲[Rw(f)Rw(fw)]+4supwϵw.\displaystyle\leq\sup_{w\in\mathcal{W}}[R_{w}(f)-R_{w}(f_{w})]+4\sup_{w}\epsilon_{w}.

Taking an infimum over ff\in\mathcal{F} completes the proof of the theorem.

Appendix B Proof of Theorem 4

We first prove Theorem 4 using Lemma 5, before proceeding to prove the lemma.

Proof [Proof of Theorem 4] We begin with the deviation bound for a fixed ff\in\mathcal{F} and w𝒲w\in\mathcal{W} as before, but use the regret random variable A=w(z)((z,f(z))(z,fw(z))A=w(z)(\ell(z,f(z))-\ell(z,f_{w}(z)) this time. Then by Lemma 5, we have

𝔼P0[A2]\displaystyle{\mathbb{E}}_{P_{0}}[A^{2}] Bw𝔼P0[w(z)((z,f(z))(z,fw(z)))2]=Bw𝔼w[((z,f(z))(z,fw(z)))2]\displaystyle\leq B_{w}{\mathbb{E}}_{P_{0}}[w(z)(\ell(z,f(z))-\ell(z,f_{w}(z)))^{2}]=B_{w}{\mathbb{E}}_{w}[(\ell(z,f(z))-\ell(z,f_{w}(z)))^{2}]
=Bw𝔼w[((f(x)y)2(fw(x)y)2)2]\displaystyle=B_{w}{\mathbb{E}}_{w}\left[\left((f(x)-y)^{2}-(f_{w}(x)-y)^{2}\right)^{2}\right]
=Bw𝔼w[(f(x)fw(x))2(f(x)+fw(x)2y)2]\displaystyle=B_{w}{\mathbb{E}}_{w}\left[(f(x)-f_{w}(x))^{2}(f(x)+f_{w}(x)-2y)^{2}\right]
16Bw𝔼w[(f(x)fw(x))2]\displaystyle\leq 16B_{w}{\mathbb{E}}_{w}\left[(f(x)-f_{w}(x))^{2}\right] ((|f(x)|,|fw(x)|,|y|1(|f(x)|,|f_{w}(x)|,|y|\leq 1)
16BwRegretw(f).\displaystyle\leq 16B_{w}{\mathrm{Regret}}_{w}(f). (Lemma 5)

Thus we see that for any fixed ff\in\mathcal{F} and w𝒲w\in\mathcal{W}, we have with probability 1δ1-\delta:

|Regretw(f)[R^w(f)R^w(fw)]|\displaystyle|{\mathrm{Regret}}_{w}(f)-[\widehat{R}_{w}(f)-\widehat{R}_{w}(f_{w})]| =𝒪(BwRegretw(f)ln(1/δ)n+Bwln(1/δ)n)\displaystyle=\mathcal{O}\left(\sqrt{\frac{B_{w}{\mathrm{Regret}}_{w}(f)\ln(1/\delta)}{n}}+\frac{B_{w}\ln(1/\delta)}{n}\right)
γRegretw(f)+(1+γ1)𝒪(Bwln(1/δ)n).\displaystyle\leq\gamma{\mathrm{Regret}}_{w}(f)+(1+\gamma^{-1})\mathcal{O}\left(\frac{B_{w}\ln(1/\delta)}{n}\right).

where γ>0\gamma>0 is arbitrary.

Hence, with probability at least 1δ1-\delta, we have for a fixed ff and ww:

(1γ)Regretw(f)\displaystyle(1-\gamma){\mathrm{Regret}}_{w}(f) (R^w(f)R^w(fw))+(1+γ1)𝒪(Bwln(1/δ)n),and\displaystyle\leq(\widehat{R}_{w}(f)-\widehat{R}_{w}(f_{w}))+(1+\gamma^{-1})\mathcal{O}\left(\frac{B_{w}\ln(1/\delta)}{n}\right),\quad\mbox{and}
R^w(f)R^w(fw)\displaystyle\widehat{R}_{w}(f)-\widehat{R}_{w}(f_{w}) (1+γ)Regretw(f)+(1+γ1)𝒪(Bwln(1/δ)n).\displaystyle\leq(1+\gamma){\mathrm{Regret}}_{w}(f)+(1+\gamma^{-1})\mathcal{O}\left(\frac{B_{w}\ln(1/\delta)}{n}\right).

In particular, choosing f=f^wf=\widehat{f}_{w} in the first inequality, we see that

R^w(fw)R^w(f^w)+(1+γ1)𝒪(Bwln(1/δ)n)\displaystyle\widehat{R}_{w}(f_{w})\leq\widehat{R}_{w}(\widehat{f}_{w})+(1+\gamma^{-1})\mathcal{O}\left(\frac{B_{w}\ln(1/\delta)}{n}\right) (15)

Using identical arguments as the proof of Theorem 2 allows us to turn both statements into uniform bounds over all ff\in\mathcal{F} and w𝒲w\in\mathcal{W}. Defining ϵ=Bd,𝒲(δ)/n\epsilon^{\prime}=Bd_{\mathcal{F},\mathcal{W}}(\delta)/n, we now have with probability at least 1δ1-\delta:

(1γ)supw𝒲Regret(f^MRO)\displaystyle(1-\gamma)\sup_{w\in\mathcal{W}}{\mathrm{Regret}}(\widehat{f}_{\text{MRO}})
supw𝒲[R^w(f^MRO)R^w(fw)]+(1+γ1)ϵ\displaystyle\leq\sup_{w\in\mathcal{W}}[\widehat{R}_{w}(\widehat{f}_{\text{MRO}})-\widehat{R}_{w}(f_{w})]+(1+\gamma^{-1})\epsilon^{\prime}
supw𝒲[R^w(f^MRO)R^w(f^w)]+(1+γ1)ϵ\displaystyle\leq\sup_{w\in\mathcal{W}}[\widehat{R}_{w}(\widehat{f}_{\text{MRO}})-\widehat{R}_{w}(\widehat{f}_{w})]+(1+\gamma^{-1})\epsilon^{\prime} (R^w(fw)R^w(f^w)\widehat{R}_{w}(f_{w})\geq\widehat{R}_{w}(\widehat{f}_{w}))
supw𝒲R^w(f)R^w(f^w)+(1+γ1)ϵ\displaystyle\leq\sup_{w\in\mathcal{W}}\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})+(1+\gamma^{-1})\epsilon^{\prime} (since f^MRO=arginffsupw𝒲R^w(f)R^w(f^w)\widehat{f}_{\text{MRO}}=\operatorname*{\text{arginf}}_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w}))
supw𝒲R^w(f)R^w(fw)+2(1+γ1)ϵ\displaystyle\leq\sup_{w\in\mathcal{W}}\widehat{R}_{w}(f)-\widehat{R}_{w}(f_{w})+2(1+\gamma^{-1})\epsilon^{\prime} (Equation 15)
(1+γ)supw𝒲[Rw(f)Rw(fw)]+3(1+γ1)ϵ\displaystyle\leq(1+\gamma)\sup_{w\in\mathcal{W}}[R_{w}(f)-R_{w}(f_{w})]+3(1+\gamma^{-1})\epsilon^{\prime}
=(1+γ)supw𝒲Regretw(f)+3(1+γ1)ϵ.\displaystyle=(1+\gamma)\sup_{w\in\mathcal{W}}{\mathrm{Regret}}_{w}(f)+3(1+\gamma^{-1})\epsilon^{\prime}.

Taking an infimum over ff\in\mathcal{F} and γ=min(0.5,ϵ/inffsupw𝒲Regretw(f))\gamma=\min(0.5,\sqrt{\epsilon^{\prime}/\inf_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}{\mathrm{Regret}}_{w}(f)}) completes the proof.  


We now prove Lemma 5.

B.1 Proof of Lemma 5

Since \mathcal{F} is convex, if f1f_{1}\in\mathcal{F} and f2f_{2}\in\mathcal{F}, then αf1+(1α)f2\alpha f_{1}+(1-\alpha)f_{2}\in\mathcal{F} for α[0,1]\alpha\in[0,1]. Then using fP=arginffRP(f)f_{P}=\operatorname*{\text{arginf}}_{f\in\mathcal{F}}R_{P}(f), we have for any distribution PP and ff\in\mathcal{F}:

0\displaystyle 0 RP(αf+(1α)fP)RP(fP)\displaystyle\leq R_{P}(\alpha f+(1-\alpha)f_{P})-R_{P}(f_{P})
=α𝔼P[(αf(x)+(2α)fP(x)2y)(f(x)fP(x))]\displaystyle=\alpha{\mathbb{E}}_{P}[(\alpha f(x)+(2-\alpha)f_{P}(x)-2y)(f(x)-f_{P}(x))]
=α2𝔼P[(f(x)fP(x))2]+2α𝔼P[(fP(x)y)(f(x)fP(x))].\displaystyle=\alpha^{2}{\mathbb{E}}_{P}[(f(x)-f_{P}(x))^{2}]+2\alpha{\mathbb{E}}_{P}[(f_{P}(x)-y)(f(x)-f_{P}(x))].

Since this holds for any α[0,1]\alpha\in[0,1], we take the limit α0\alpha\downarrow 0 to conclude that for all ff\in\mathcal{F}

𝔼P[(fP(x)y)(f(x)fP(x))]0.{\mathbb{E}}_{P}[(f_{P}(x)-y)(f(x)-f_{P}(x))]\geq 0. (16)

This inequality further allows us to conclude for any ff\in\mathcal{F}

RP(f)RP(fP)\displaystyle R_{P}(f)-R_{P}(f_{P}) =𝔼P[(f(x)+fP(x)2y)(f(x)fP(x)]\displaystyle={\mathbb{E}}_{P}[(f(x)+f_{P}(x)-2y)(f(x)-f_{P}(x)]
=𝔼P[(f(x)fP(x))2]+2𝔼P[(fP(x)y)(f(x)fP(x))]\displaystyle={\mathbb{E}}_{P}[(f(x)-f_{P}(x))^{2}]+2{\mathbb{E}}_{P}[(f_{P}(x)-y)(f(x)-f_{P}(x))]
𝔼P[(f(x)fP(x))2].\displaystyle\geq{\mathbb{E}}_{P}[(f(x)-f_{P}(x))^{2}].

B.2 Proof of Proposition 6

We consider a simple problem where there is a class 𝒫={P1,P2}\mathcal{P}=\{P_{1},P_{2}\} consisting of two distributions that we want to do well under. Each distribution PiP_{i} is such that xPix\sim P_{i} takes the form x=μi+ϵx=\mu_{i}+\epsilon where μi\mu_{i}\in\mathbb{R} and ϵ\epsilon is ±1\pm 1 with probability 1/21/2. The function class =[C,C]\mathcal{F}=[-C,C] for a constant CC that will be appropriately chosen in the proof. Let Δμ=μ1μ2\Delta_{\mu}=\mu_{1}-\mu_{2} be the difference of means. Let us choose P0P_{0} to be supported over ×{1,2}\mathbb{R}\times\{1,2\} so that z=(x,1)z=(x,1) where xP1x\sim P_{1} with probability 0.5 and z=(x,2)z=(x,2) with xP2x\sim P_{2} with probability 0.5. We set w1(z)=1w_{1}(z)=1 if z2=1z_{2}=1 and similarly w2(z)=1w_{2}(z)=1 if z2=2z_{2}=2. Let us define g(z)=f(z1)g(z)=f(z_{1}) for ff\in\mathcal{F}. Since g(z)g(z) is equivalent to f(x)f(x) with x=z1x=z_{1} in this example, we stick to using the notation ff\in\mathcal{F} for the rest of the proof for consistency with our notation throughout. Let us use RiR_{i}, R^i\widehat{R}_{i} to denote the empirical and expected risks on samples from PiP_{i} (equivalently wiw_{i}). Then a direct calculation shows that

Ri(f)=(fμi)2+1,andR^i(f)=(fμi)2+12Si(fμi),\displaystyle R_{i}(f)=(f-\mu_{i})^{2}+1,\quad\mbox{and}\quad\widehat{R}_{i}(f)=(f-\mu_{i})^{2}+1-2S_{i}(f-\mu_{i}),

where Si=j=1nwi(zj)ϵj/(j=1nwi(zj))S_{i}=\sum_{j=1}^{n}w_{i}(z_{j})\epsilon_{j}/(\sum_{j=1}^{n}w_{i}(z_{j})) with ϵj\epsilon_{j} being the noise realization in zjz_{j}. Then for sufficiently large CC, we see that

fDRO=argminfmaxiRi(f)\displaystyle f_{\text{DRO}}=\operatorname*{\text{argmin}}_{f\in\mathcal{F}}\max_{i}R_{i}(f) =argminfmax{(fμ1)2,(fμ2)2}=μ1+μ22,\displaystyle=\operatorname*{\text{argmin}}_{f\in\mathcal{F}}\max\{(f-\mu_{1})^{2},(f-\mu_{2})^{2}\}=\frac{\mu_{1}+\mu_{2}}{2},

and the best worst case risk is given by maxiRi(fDRO)=1+Δμ2/4\max_{i}R_{i}(f_{\text{DRO}})=1+\Delta_{\mu}^{2}/4. Now let us examine the empirical situation. We have

f^DRO\displaystyle\widehat{f}_{\text{DRO}} =argminfmaxiR^i(f)=argminfmax{(fμ1)22S1(fμ1),(fμ2)22S2(fμ2)}.\displaystyle=\operatorname*{\text{argmin}}_{f\in\mathcal{F}}\max_{i}\widehat{R}_{i}(f)=\operatorname*{\text{argmin}}_{f\in\mathcal{F}}\max\{(f-\mu_{1})^{2}-2S_{1}(f-\mu_{1}),(f-\mu_{2})^{2}-2S_{2}(f-\mu_{2})\}.

Let us denote f^i=μi+Si\widehat{f}_{i}=\mu_{i}+S_{i} as the minimizer of R^i(f)\widehat{R}_{i}(f). Let us condition on the event \mathcal{E} where:

={R^1(f^1)R^2(f^1)andR^1(f^2)R^2(f^2)}.\mathcal{E}=\{\widehat{R}_{1}(\widehat{f}_{1})\leq\widehat{R}_{2}(\widehat{f}_{1})\quad\mbox{and}\quad\widehat{R}_{1}(\widehat{f}_{2})\geq\widehat{R}_{2}(\widehat{f}_{2})\}. (17)

Under this event, since both R^1(f)\widehat{R}_{1}(f) and R^2(f)\widehat{R}_{2}(f) are continous functions of ff, they are equal for some f[min(f^1,f^2),max(f^1,f^2)]f\in[\min(\widehat{f}_{1},\widehat{f}_{2}),\max(\widehat{f}_{1},\widehat{f}_{2})]. So conditioned on \mathcal{E}, we seek a solution to

0\displaystyle 0 =(fμ1)22S1(fμ1)(fμ2)2+2S2(fμ2)\displaystyle=(f-\mu_{1})^{2}-2S_{1}(f-\mu_{1})-(f-\mu_{2})^{2}+2S_{2}(f-\mu_{2})
=2f(μ2μ1+S2S1)+(μ12μ22+2S1μ12S2μ2),\displaystyle=2f(\mu_{2}-\mu_{1}+S_{2}-S_{1})+(\mu_{1}^{2}-\mu_{2}^{2}+2S_{1}\mu_{1}-2S_{2}\mu_{2}),

so that we get

f^DRO\displaystyle\widehat{f}_{\text{DRO}} =μ22μ12+2S2μ22S1μ12(μ2μ1+S2S1)\displaystyle=\frac{\mu_{2}^{2}-\mu_{1}^{2}+2S_{2}\mu_{2}-2S_{1}\mu_{1}}{2(\mu_{2}-\mu_{1}+S_{2}-S_{1})}
=Δμ(μ1+μ2)+2S1μ12S2μ22(Δμ+ΔS)\displaystyle=\frac{\Delta_{\mu}(\mu_{1}+\mu_{2})+2S_{1}\mu_{1}-2S_{2}\mu_{2}}{2(\Delta_{\mu}+\Delta_{S})}
=μ1+μ22+2S1μ12S2μ2(μ1+μ2)ΔS2(Δμ+ΔS)\displaystyle=\frac{\mu_{1}+\mu_{2}}{2}+\frac{2S_{1}\mu_{1}-2S_{2}\mu_{2}-(\mu_{1}+\mu_{2})\Delta_{S}}{2(\Delta_{\mu}+\Delta_{S})}
=μ1+μ22+S1μ1S2μ2S1μ2+S2μ12(Δμ+ΔS)\displaystyle=\frac{\mu_{1}+\mu_{2}}{2}+\frac{S_{1}\mu_{1}-S_{2}\mu_{2}-S_{1}\mu_{2}+S_{2}\mu_{1}}{2(\Delta_{\mu}+\Delta_{S})}
=μ1+μ22+(S1+S2)Δμ2(Δμ+ΔS)ξ.\displaystyle=\frac{\mu_{1}+\mu_{2}}{2}+\underbrace{\frac{(S_{1}+S_{2})\Delta_{\mu}}{2(\Delta_{\mu}+\Delta_{S})}}_{\xi}.

Plugging the minimizer back in, we obtain the best worst case population risk as

maxiRi(f^DRO)1\displaystyle\max_{i}R_{i}(\widehat{f}_{\text{DRO}})-1 =max{(f^DROμ1)2,(f^DROμ2)2\displaystyle=\max\{(\widehat{f}_{\text{DRO}}-\mu_{1})^{2},(\widehat{f}_{\text{DRO}}-\mu_{2})^{2}
=max{(Δμ/2+ξ)2,(Δμ/2+ξ)2}\displaystyle=\max\left\{(-\Delta_{\mu}/2+\xi)^{2},(\Delta_{\mu}/2+\xi)^{2}\right\}
=Δμ2/4+ξ2+max{Δμξ,Δμξ}\displaystyle=\Delta_{\mu}^{2}/4+\xi^{2}+\max\left\{-\Delta_{\mu}\xi,\Delta_{\mu}\xi\right\}
=Δμ2/4+ξ2+|Δμξ|.\displaystyle=\Delta_{\mu}^{2}/4+\xi^{2}+|\Delta_{\mu}\xi|.

Recalling that minfmaxiRi(f)=1+Δμ2/4\min_{f}\max_{i}R_{i}(f)=1+\Delta_{\mu}^{2}/4, we further obtain

maxiRi(f^DRO)maxiRi(fDRO)\displaystyle\max_{i}R_{i}(\widehat{f}_{\text{DRO}})-\max_{i}R_{i}(f_{\text{DRO}}) =ξ2+|Δμξ|.\displaystyle=\xi^{2}+|\Delta_{\mu}\xi|.

In order to complete the lower bound on this regret term, we now analyze the random variable ξ\xi, and we choose Δμ=1\Delta_{\mu}=1. Note that from the definitions of SiS_{i} and nin_{i}, we have with probability at least 14δ1-4\delta, for i=1,2i=1,2:

|Si|\displaystyle|S_{i}| clog(1/δ)ni,and\displaystyle\leq c\sqrt{\frac{\log(1/\delta)}{n_{i}}},\quad\mbox{and}
|nin2|\displaystyle\left|n_{i}-\frac{n}{2}\right| clog(1/δ)n,\displaystyle\leq c\sqrt{\frac{\log(1/\delta)}{n}},

Under this 14δ1-4\delta event, we have ΔS=𝒪(ln(1/δ)/n)\Delta_{S}=\mathcal{O}(\sqrt{\ln(1/\delta)/n}) and S1+S2=O(ln(1/δ)/n)S_{1}+S_{2}=O(\sqrt{\ln(1/\delta)/n}). Consequently, under this event

|ξ|=𝒪(|Δμln(1/δ)2n(Δμln(1/δ)/n)|).\displaystyle|\xi|=\mathcal{O}\left(\left|\frac{\Delta_{\mu}\sqrt{\ln(1/\delta)}}{2\sqrt{n}(\Delta_{\mu}-\sqrt{\ln(1/\delta)/n})}\right|\right).

Choosing 4δ=1/n4\delta=1/n, we further have

|ξ|=𝒪(|ΔμlnnnΔμ|)=𝒪(lnnn),\displaystyle|\xi|=\mathcal{O}\left(\left|\frac{\Delta_{\mu}\sqrt{\ln n}}{\sqrt{n}\Delta_{\mu}}\right|\right)=\mathcal{O}\left(\sqrt{\frac{\ln n}{n}}\right),

where we have used Δμ/2lnn/n\Delta_{\mu}/2\geq\sqrt{\ln n/n} for n1n\geq 1. Note that we have also not set the constant CC defining \mathcal{F} so far. In order for the unconstrained minimizations above to happen for ff\in\mathcal{F}, we need fDRO,f^DROf_{\text{DRO}},\widehat{f}_{\text{DRO}}\in\mathcal{F}. This means that C(μ1+μ2)/2C\geq(\mu_{1}+\mu_{2})/2 and C(μ22μ12+2S2μ22S1μ1)/(2(μ2μ1+S2S1))C\geq(\mu_{2}^{2}-\mu_{1}^{2}+2S_{2}\mu_{2}-2S_{1}\mu_{1})/(2(\mu_{2}-\mu_{1}+S_{2}-S_{1})). The latter is satisfied under our 14/n1-4/n event when Cμ1+μ2C\geq\mu_{1}+\mu_{2}.

Finally we recall that we have conditioned so far on the event \mathcal{E} defined in Equation 17. We now examine (C)\mathbb{P}(\mathcal{E}^{C}) under the 14/n1-4/n event, and focus on the first condition in (17). We have

R^1(f^1)R^2(f^2)\displaystyle\widehat{R}_{1}(\widehat{f}_{1})-\widehat{R}_{2}(\widehat{f}_{2}) =S12+S22(Δμ+ΔS)2\displaystyle=-S_{1}^{2}+S_{2}^{2}-(\Delta_{\mu}+\Delta_{S})^{2}
𝒪(lognn1+lognn)0,\displaystyle\leq\mathcal{O}\left(\frac{\log n}{n}-1+\sqrt{\frac{\log n}{n}}\right)\leq 0,

under the 14/n1-4/n probability event. Similar reasoning also shows that R^2(f^2)R^1(f^2)\widehat{R}_{2}(\widehat{f}_{2})\leq\widehat{R}_{1}(\widehat{f}_{2}) in this event, so that we get the conclusion of the proposition.

Appendix C Proofs for Section 5

We start with a technical lemma about the concentration of empirical estimates of the second moment of importance weights, which is required for Theorem 7.

Lemma 13

Under Assumption 2, we have with probability at least 1δ1-\delta, for all w𝒲w\in\mathcal{W}:

𝔼P0w(z)22ni=1nw(zi)2+𝒪(Bw2ln(|𝒲|/δ)n).\displaystyle{\mathbb{E}}_{P_{0}}w(z)^{2}\leq\frac{2}{n}\sum_{i=1}^{n}w(z_{i})^{2}+\mathcal{O}\left(\frac{B_{w}^{2}\ln(|\mathcal{W}|/\delta)}{n}\right).

Proof  Consider the non-negative random variable A=w(z)2A=w(z)^{2} so that ABw2A\leq B_{w}^{2} with probability 1 when zP0z\sim P_{0}. Then we have

𝔼P0[A2]\displaystyle{\mathbb{E}}_{P_{0}}[A^{2}] =𝔼P0[w(z)4]Bw2𝔼P0[w(z)2].\displaystyle={\mathbb{E}}_{P_{0}}[w(z)^{4}]\leq B_{w}^{2}{\mathbb{E}}_{P_{0}}[w(z)^{2}].

Then for a fixed w𝒲w\in\mathcal{W}, we have by Bernstein’s inequality, with probability at least 1δ1-\delta:

|1ni=1nw(zi)2𝔼P0w(z)2|\displaystyle\left|\frac{1}{n}\sum_{i=1}^{n}w(z_{i})^{2}-{\mathbb{E}}_{P_{0}}w(z)^{2}\right|
=𝒪(Bw2𝔼P0[w(z)2]ln(1/δ)n+Bw2ln(1/δ)n)\displaystyle=\mathcal{O}\left(\sqrt{\frac{B_{w}^{2}{\mathbb{E}}_{P_{0}}[w(z)^{2}]\ln(1/\delta)}{n}}+\frac{B_{w}^{2}\ln(1/\delta)}{n}\right)
𝔼P0[w(z)2]2+𝒪(Bw2ln(1/δ)n).\displaystyle\leq\frac{{\mathbb{E}}_{P_{0}}[w(z)^{2}]}{2}+\mathcal{O}\left(\frac{B_{w}^{2}\ln(1/\delta)}{n}\right).

Rearranging terms and taking a union bound over w𝒲w\in\mathcal{W} completes the proof.  


We are now ready to prove Theorem 7.

C.1 Proof of Theorem 7

Recall the definition σ^w2=1ni=1nw(zi)2\widehat{\sigma}^{2}_{w}=\frac{1}{n}\sum_{i=1}^{n}w(z_{i})^{2}. Plugging the result of Lemma 13 in Equation 13, we see that with probability at least 1δ1-\delta, for a fixed ff\in\mathcal{F} and w𝒲w\in\mathcal{W} we have:

|R^w(f)Rw(f)|\displaystyle\left|\widehat{R}_{w}(f)-R_{w}(f)\right| =𝒪((2σ^w2+Bw2/n)ln(2/δ)n+Bwln(2/δ)n)\displaystyle=\mathcal{O}\left(\sqrt{\frac{(2\widehat{\sigma}_{w}^{2}+B_{w}^{2}/n)\ln(2/\delta)}{n}}+\frac{B_{w}\ln(2/\delta)}{n}\right)
=𝒪(σ^w2ln(2/δ)n+Bwln(2/δ)n).\displaystyle=\mathcal{O}\left(\sqrt{\frac{\widehat{\sigma}_{w}^{2}\ln(2/\delta)}{n}}+\frac{B_{w}\ln(2/\delta)}{n}\right).

Now following the proof of Theorem 2, with probability at least 1δ1-\delta, we have for all w𝒲w\in\mathcal{W}:

Rw(f^SMRO)Rw(fw)\displaystyle R_{w}(\widehat{f}_{\text{SMRO}})-R_{w}(f_{w})
=R^w(f^SMRO)R^w(f^w)+𝒪(σ^w2(d(δ)+log(|𝒲|/δ)n+Bw(d(δ)+log(|𝒲|/δ)n)\displaystyle=\widehat{R}_{w}(\widehat{f}_{\text{SMRO}})-\widehat{R}_{w}(\widehat{f}_{w})+\mathcal{O}\left(\sqrt{\frac{\widehat{\sigma}_{w}^{2}(d_{\mathcal{F}}(\delta)+\log(|\mathcal{W}|/\delta)}{n}}+\frac{B_{w}(d_{\mathcal{F}}(\delta)+\log(|\mathcal{W}|/\delta)}{n}\right)
=R^w(f^SMRO)R^w(f^w)+(σ^w+Bwn)cw𝒪(d,𝒲(δ)n)ϵ.\displaystyle=\widehat{R}_{w}(\widehat{f}_{\text{SMRO}})-\widehat{R}_{w}(\widehat{f}_{w})+\underbrace{\left(\widehat{\sigma}_{w}+\frac{B_{w}}{\sqrt{n}}\right)}_{c_{w}}\,\underbrace{\mathcal{O}\left(\frac{d_{\mathcal{F},\mathcal{W}}(\delta)}{\sqrt{n}}\right)}_{\epsilon}.

Now dividing through by cwc_{w}, we see that for all ff\in\mathcal{F}, we have with probability at least 1δ1-\delta:

supw𝒲Rw(f^SMRO)Rw(fw)cw\displaystyle\sup_{w\in\mathcal{W}}\frac{R_{w}(\widehat{f}_{\text{SMRO}})-R_{w}(f_{w})}{c_{w}} supw𝒲R^w(f^SMRO)R^w(f^w)cw+ϵ\displaystyle\leq\sup_{w\in\mathcal{W}}\frac{\widehat{R}_{w}(\widehat{f}_{\text{SMRO}})-\widehat{R}_{w}(\widehat{f}_{w})}{c_{w}}+\epsilon
inffsupw𝒲R^w(f)R^w(f^w)cw+ϵ\displaystyle\leq\inf_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}\frac{\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})}{c_{w}}+\epsilon
inffsupw𝒲Rw(f)Rw(f^w)cw+2ϵ\displaystyle\leq\inf_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}\frac{R_{w}(f)-R_{w}(\widehat{f}_{w})}{c_{w}}+2\epsilon
inffsupw𝒲Rw(f)Rw(fw)cw+2ϵ,\displaystyle\leq\inf_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}\frac{R_{w}(f)-R_{w}(f_{w})}{c_{w}}+2\epsilon,

where the second inequality follows from the definition of SMRO as the empirical optimizer of the objective (10). As a result, we have for any w𝒲w\in\mathcal{W}, with probability at least 1δ1-\delta:

Rw(f^SMRO)Rw(fw)\displaystyle R_{w}(\widehat{f}_{\text{SMRO}})-R_{w}(f_{w}) cwinffsupw𝒲Rw(f)Rw(fw)cw+2cwϵ.\displaystyle\leq c_{w}\inf_{f\in\mathcal{F}}\sup_{w^{\prime}\in\mathcal{W}}\frac{R_{w^{\prime}}(f)-R_{w^{\prime}}(f_{w})}{c_{w^{\prime}}}+2c_{w}\epsilon.

C.2 Proof of Theorem 8

We will be terse as the proof is largely a combination of the proofs of Theorems 4 and 7. Proceeding as in the proof of Theorem 4, we see that with probability at least 1δ1-\delta, we have for all w𝒲w\in\mathcal{W} and ff\in\mathcal{F}:

Regretw(f)(1+γ)[R^w(f)R^w(f^w)]+(1+γ1)Bwcw𝒪(d,𝒲n)ϵ{\mathrm{Regret}}_{w}(f)\leq(1+\gamma)[\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})]+(1+\gamma^{-1})\underbrace{B_{w}}_{c_{w}}\,\underbrace{\mathcal{O}\left(\frac{d_{\mathcal{F},\mathcal{W}}}{n}\right)}_{\epsilon}

and

[R^w(f)R^w(f^w)](1+γ)Regretw(f)+(1+γ1)Bwcw𝒪(d,𝒲n)ϵ[\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})]\leq(1+\gamma){\mathrm{Regret}}_{w}(f)+(1+\gamma^{-1})\underbrace{B_{w}}_{c_{w}}\,\underbrace{\mathcal{O}\left(\frac{d_{\mathcal{F},\mathcal{W}}}{n}\right)}_{\epsilon}

for all γ>0\gamma>0.

Dividing through by cwc_{w} as before and taking a supremum over w𝒲w\in\mathcal{W}, we obtain with probability at least 1δ1-\delta:

Regretw(f^SMRO)cw\displaystyle\frac{{\mathrm{Regret}}_{w}(\widehat{f}_{\text{SMRO}})}{c_{w}} (1+γ)supw𝒲R^w(f^SMRO)R^w(f^w)cw+(1+γ1)ϵ\displaystyle\leq(1+\gamma)\sup_{w\in\mathcal{W}}\frac{\widehat{R}_{w}(\widehat{f}_{\text{SMRO}})-\widehat{R}_{w}(\widehat{f}_{w})}{c_{w}}+(1+\gamma^{-1})\epsilon
(1+γ)inffsupw𝒲R^w(f)R^w(f^w)cw+(1+γ1)ϵ\displaystyle\leq(1+\gamma)\inf_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}\frac{\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})}{c_{w}}+(1+\gamma^{-1})\epsilon
(1+γ)2inffsupw𝒲Regretw(f)cw+(1+γ1)(2+γ)ϵ.\displaystyle\leq(1+\gamma)^{2}\inf_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}\frac{{\mathrm{Regret}}_{w}(f)}{c_{w}}+(1+\gamma^{-1})(2+\gamma)\epsilon.

Now following the remaining proof of Theorem 7 gives the desired bound.

Appendix D Well-specified linear regression with covariate shift

Let us consider a special case of Theorem 8, where ={βx:βd,β21}\mathcal{F}=\{\beta^{\top}x~{}:~{}\beta\in\mathbb{R}^{d},\|\beta\|_{2}\leq 1\} is the class of linear prediction functions with unit norm weights and the data satisfies x21\|x\|_{2}\leq 1. We further assume that y=xβ+νy=x^{\top}\beta^{\star}+\nu, where β21\|\beta^{\star}\|_{2}\leq 1 and ν\nu is zero-mean noise such that |y|C|y|\leq C for some constant CC (this just causes additional scaling with CC in the bounds of Theorems 4 and 8). Suppose further that the covariance ΣP0:=𝔼xP0xx\Sigma_{P_{0}}:={\mathbb{E}}_{x\sim P_{0}}xx^{\top} is full rank with the smallest eigenvalue equal to λ\lambda. Let β^P0\widehat{\beta}_{P_{0}} be the ordinary least squares estimator, given samples from P0P_{0}. Then it is well-known that with probability at least 1δ1-\delta,

β^P0βΣP02=𝒪(dln(1/δ)n).\displaystyle\|\widehat{\beta}_{P_{0}}-\beta^{\star}\|_{\Sigma_{P_{0}}}^{2}=\mathcal{O}\left(\frac{d\ln(1/\delta)}{n}\right). (18)

Consequently, we have for any other distribution PP over (x,y)(x,y):

RP(β^P0)RP(β)=𝔼P[(xβ^P0xβ)2]=𝒪(dln(1/δ)λn),\displaystyle R_{P}(\widehat{\beta}_{P_{0}})-R_{P}(\beta^{\star})={\mathbb{E}}_{P}[(x^{\top}\widehat{\beta}_{P_{0}}-x^{\top}\beta^{\star})^{2}]=\mathcal{O}\left(\frac{d\ln(1/\delta)}{\lambda n}\right),

where the inequality uses x21\|x\|_{2}\leq 1 and that the smallest eigenvalue of ΣP0\Sigma_{P_{0}} is at least λ\lambda. That is, doing OLS under P0P_{0} yields a strong guarantee for all target distributions PP, since we get pointwise accurate predictions in this scenario. However, directly applying the results of Theorem 4 would still have additional scaling with BwB_{w} for any target distribution with importance weights ww. On the other hand, let us consider the bound of Corollary 9 for any class 𝒲\mathcal{W} such that w01w_{0}\equiv 1 is in 𝒲\mathcal{W}, so that we always include P0P_{0} in our class of target distributions. Since Bw0=1B_{w_{0}}=1 and d=dln(1δ)d_{\mathcal{F}}=d\ln(\frac{1}{\delta}) in this case, the second bound of the corollary yields with probability at least 1δ1-\delta:

β^SMROβΣP02=RP0(β^SMRO)RP0(β)=𝒪((dln(1/δ)+ln|𝒲|δ)n).\displaystyle\|\widehat{\beta}_{\text{SMRO}}-\beta^{\star}\|_{\Sigma_{P_{0}}}^{2}=R_{P_{0}}(\widehat{\beta}_{\text{SMRO}})-R_{P_{0}}(\beta^{\star})=\mathcal{O}\left(\frac{(d\ln(1/\delta)+\ln\tfrac{|\mathcal{W}|}{\delta})}{n}\right).

This is comparable to the prediction error bound in (18) for ERM on P0P_{0}, only incurring an additional ln|𝒲|\ln|\mathcal{W}| term compared. Note that ERM is asymptotically efficient in this setting, so we cannot expect to do better and suffer only a small penalty for our worst-case robustness. The approach of learning on P0P_{0} alone is of course not robust to misspecification in the linear regression assumption. The guarantee of Theorem 4, in contrast incurs a bound supw𝒲Bw𝒪((dln(1/δ)+ln|𝒲|δ)n)\sup_{w\in\mathcal{W}}B_{w}\mathcal{O}\left(\frac{(d\ln(1/\delta)+\ln\tfrac{|\mathcal{W}|}{\delta})}{n}\right), which can be significantly worse.

Appendix E Proofs for Section 6

It is clearly seen that the updates of ρt\rho_{t} correspond to the exponentiated gradient (EG) algorithm applied to the linear objective (𝔼wρR^w(ft)R^w(f^w))/B-({\mathbb{E}}_{w\sim\rho}\widehat{R}_{w}(f_{t})-\widehat{R}_{w}(\widehat{f}_{w}))/B at round tt, where the negative sign happens since the EG algorithm is designed for minimization problems, while we apply it to a maximization problem. The regret guarantee for EG, specifically Corollary 2.14 of Shalev-Shwartz [2012] states that for any w𝒲w\in\mathcal{W}

t=1T𝔼wρtR^w(ft)R^w(f^w)Bt=1TR^w(ft)R^w(f^w)B+ln|𝒲|ηB+ηBT.\displaystyle-\sum_{t=1}^{T}{\mathbb{E}}_{w^{\prime}\sim\rho_{t}}\frac{\widehat{R}_{w^{\prime}}(f_{t})-\widehat{R}_{w^{\prime}}(\widehat{f}_{w^{\prime}})}{B}\leq-\sum_{t=1}^{T}\frac{\widehat{R}_{w}(f_{t})-\widehat{R}_{w}(\widehat{f}_{w})}{B}+\frac{\ln|\mathcal{W}|}{\eta B}+\eta BT.

Multiplying through by BB and substituting ηB=ln|𝒲|/T\eta B=\sqrt{\ln|\mathcal{W}|/T} gives

t=1T𝔼wρt[R^w(ft)R^w(f^w)]t=1T[R^w(ft)R^w(f^w)]+2BTln|𝒲|.\displaystyle-\sum_{t=1}^{T}{\mathbb{E}}_{w^{\prime}\sim\rho_{t}}[\widehat{R}_{w^{\prime}}(f_{t})-\widehat{R}_{w^{\prime}}(\widehat{f}_{w^{\prime}})]\leq-\sum_{t=1}^{T}[\widehat{R}_{w}(f_{t})-\widehat{R}_{w}(\widehat{f}_{w})]+2B\sqrt{T\ln|\mathcal{W}|}.

Now recalling the definition Pt=(f1++ft)/tP_{t}=(f_{1}+\ldots+f_{t})/t, following the proof technique of Freund and Schapire [1996] gives that

𝔼fPTsupwR^w(f)R^w(fw)B\displaystyle{\mathbb{E}}_{f\sim P_{T}}\sup_{w}\frac{\widehat{R}_{w}(f)-\widehat{R}_{w}(f_{w})}{B}
=1Tt=1TsupwR^w(ft)R^w(fw)B\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\sup_{w}\frac{\widehat{R}_{w}(f_{t})-\widehat{R}_{w}(f_{w})}{B}
1Tt=1T𝔼wρtR^w(ft)R^w(f^w)B+2ln|𝒲|T\displaystyle\leq\frac{1}{T}\sum_{t=1}^{T}{\mathbb{E}}_{w\sim\rho_{t}}\frac{\widehat{R}_{w}(f_{t})-\widehat{R}_{w}(\widehat{f}_{w})}{B}+2\sqrt{\frac{\ln|\mathcal{W}|}{T}}
(a)1Tt=1Tinff𝔼wρtR^w(f)R^w(f^w)B+2ln|𝒲|T\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\frac{1}{T}\sum_{t=1}^{T}\inf_{f\in\mathcal{F}}{\mathbb{E}}_{w\sim\rho_{t}}\frac{\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})}{B}+2\sqrt{\frac{\ln|\mathcal{W}|}{T}}
inff1Tt=1T𝔼wρtR^w(f)R^w(f^w)B+2ln|𝒲|T\displaystyle\leq\inf_{f\in\mathcal{F}}\frac{1}{T}\sum_{t=1}^{T}{\mathbb{E}}_{w\sim\rho_{t}}\frac{\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})}{B}+2\sqrt{\frac{\ln|\mathcal{W}|}{T}}
inffsupw𝒲R^w(f)R^w(f^w)B+2ln|𝒲|T,\displaystyle\leq\inf_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}\frac{\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})}{B}+2\sqrt{\frac{\ln|\mathcal{W}|}{T}},

where (a)(a) follows from the best response property of ftf_{t} for ρt\rho_{t} (12).