Minimax Regret Optimization for Robust Machine Learning under Distribution Shift

Alekh Agarwal
[email protected]
Google Research Tong Zhang
[email protected]
Google Research & HKUST

Abstract

In this paper, we consider learning scenarios where the learned model is evaluated under an unknown test distribution which potentially differs from the training distribution (i.e. distribution shift). The learner has access to a family of weight functions such that the test distribution is a reweighting of the training distribution under one of these functions, a setting typically studied under the name of Distributionally Robust Optimization (DRO). We consider the problem of deriving regret bounds in the classical learning theory setting, and require that the resulting regret bounds hold uniformly for all potential test distributions. We show that the DRO formulation does not guarantee uniformly small regret under distribution shift. We instead propose an alternative method called Minimax Regret Optimization (MRO), and show that under suitable conditions this method achieves uniformly low regret across all test distributions. We also adapt our technique to have stronger guarantees when the test distributions are heterogeneous in their similarity to the training data. Given the widespead optimization of worst case risks in current approaches to robust machine learning, we believe that MRO can be a strong alternative to address distribution shift scenarios.

1 Introduction

Learning good models for scenarios where the evaluation happens under a test distribution that differs from the training distribution has become an increasingly important problem in machine learning and statistics. It is particularly motivated by the widespread practical deployment of machine learning models, and the observation that model performance deteriorates when the test distribution differs from the training distribution. Some motivating scenarios range from class imbalance between training and test data (Galar et al., 2011; Japkowicz, 2000) to algorithmic fairness (Dwork et al., 2012; Barocas and Selbst, 2016). Consequently, several recent formalisms have been proposed to develop algorithms for such settings, such as adversarial training (Goodfellow et al., 2014; Madry et al., 2017; Wang et al., 2019), Invariant Risk Minimization (Arjovsky et al., 2019; Ahuja et al., 2020a; Rosenfeld et al., 2020) and Distributionally Robust Optimization (Duchi and Namkoong, 2021; Namkoong and Duchi, 2016; Staib and Jegelka, 2019; Kuhn et al., 2019). While the specifics differ across these formulations, at a high-level they all minimize the worst-case risk of the learned model across some family of test distributions, and spend considerable effort on finding natural classes of test distributions and designing efficient algorithms to optimize the worst-case objectives.

In this paper, we consider this problem from a learning theory perspective, by asking if it is possible to obtain regret bounds that hold uniformly across all test distributions. Recall that in supervised learning, we are given a function class $\mathcal{F}$ and a loss function $\ell(z,f(z))$ , where $z$ denotes a sample. For a test distribution $P$ , we are interested in learning a prediction function $f\in\mathcal{F}$ using the training data, so that the risk of $f$ under $P$ : $R_{P}(f)={\mathbb{E}}_{z\sim P}\ell(z,f(z))$ , is small. In classical learning theory, an estimator $\hat{f}$ is learned from the training data, and we are interested in bounding the regret (or excess risk) of $f$ as follows

{\mathrm{Regret}}_{P}(\hat{f})=R_{P}(\hat{f})-\inf_{f\in\mathcal{F}}R_{P}(f)=R_{P}(\hat{f})-R_{P}(f_{P}),

where $f_{P}$ minimizes $R_{P}(f)$ over $\mathcal{F}$ . If the training data $(z_{1},\ldots,z_{n})$ are drawn from the same distribution as the test distribution $P$ , then standard results imply that the empirical risk minimizaiton (ERM) method, which finds an $\hat{f}$ with a small training error, achieves small regret. A natural question we would like to ask in this paper is whether there exists an estimator that allows us to achieve small regret with distribution shift? More precisely, suppose we are given a family of distributions $\mathcal{P}$ with $P_{0}\in\mathcal{P}$ . Given access to $n$ samples $(z_{1},\ldots,z_{n})$ from $P_{0}$ , we would like to find an estimator $\hat{f}$ from the training data so that the uniform regret

\sup_{P\in\mathcal{P}}{\mathrm{Regret}}_{P}(\hat{f})=\sup_{P\in\mathcal{P}}\left[R_{P}(\hat{f})-\inf_{f\in\mathcal{F}}R_{P}(f)\right]

(1)

is small. Here the model class $\mathcal{F}$ can be statistically misspecified in that it may not contain the optimal model $f^{\star}$ that attains the Bayes risk in a pointwise manner.

We note that this objective is different from the objective of Distributionally Robust Optimization (henceforth DRO), which directly minimizes the risk across all test distributions as follows:

f_{\text{DRO}}=\operatorname*{\text{arginf}}_{f\in\mathcal{F}}\sup_{P\in\mathcal{P}}R_{P}(f).

(2)

We observe that the objective (1) is less sensitive than DRO to heterogeneity in the amount of noise in test distributions. Moreover, when the model is well-specified, our criterion directly measures the closeness of the estimator $\hat{f}$ and the optimal model $f^{\star}$ , which is often desirable in applications. We call the criterion to minimize regret uniformly across test distributaions Minimax Regret Optimization (MRO), and its population formulation seeks to minimize the worst-case regret (1):

f_{\text{MRO}}=\operatorname*{\text{arginf}}_{f\in\mathcal{F}}\sup_{P\in\mathcal{P}}{\mathrm{Regret}}_{P}(f).

(3)

Compared to DRO, MRO evaluates the regret of a candidate model $f$ on each distribution $P\in\mathcal{P}$ as opposed to the raw risk. As we show in the following sections, regret is comparable across distribution in a more robust manner than the risk, since the subtraction of the minimum risk for each distribution takes away the variance due to noise. We note that a similar approach to remove the residual variance to deal with heterogenous noise arises in the offline Reinforcement Learning setting (Antos et al., 2008), which we comment more on in the discussion of related work. Similar to ERM, which replaces the regret minimization problem over a single test distribution by its empirical minimization counterpart, we consider an empirical minimization counterpart of MRO, and analyze its generalization performance.

Our contributions.

With this context, our paper makes the following contributions.

1.

The objective (3), which gives an alternative to DRO, alleviates its shortcomings related to noise overfitting, and provides a robust learning formulation for problems with distribution shift.

We show the empirical couterpart to (3) can be analyzed using classical tools from learning theory. When the loss function is bounded and Lipschitz continuous, we show that the regret of our learned model $\hat{f}_{\text{MRO}}$ on each distribution $P$ can be bounded by that of $f_{\text{MRO}}$ along with a deviation term that goes down as $\mathcal{O}(1/\sqrt{n})$ . Concretely, we show that if $dP(z)/dP_{0}(z)\leq B$ for all $P\in\mathcal{P}$ , then with high probability,

\sup_{P\in\mathcal{P}}{\mathrm{Regret}}_{P}(\widehat{f}_{\text{MRO}})\leq\inf_{f\in\mathcal{F}}\sup_{P\in\mathcal{P}}{\mathrm{Regret}}_{P}(f)+\mathcal{O}\left(\frac{B}{\sqrt{n}}\right),

where we omit the dependence on the complexities of $\mathcal{F}$ and $\mathcal{P}$ as well as the failure probability and use a simplified assumption on $dP/dQ$ than in our formal result of Theorem 2 for ease of presentation. That is, $\widehat{f}_{\text{MRO}}$ has a bounded regret on each domain, so long as there is at least one function with a small regret on each domain. For squared loss, if the class $\mathcal{F}$ is convex, we show that our rates improve to $\mathcal{O}(B/n)$ (Theorem 4). The result follows naturally as empirical regret concentrates to population regret at a fast rate in this setting. We also show that the worst-case risk of $\widehat{f}_{\text{DRO}}$ cannot approach the worst case risk of $f_{\text{DRO}}$ at a rate faster than $1/\sqrt{n}$ (Proposition 6), which demonstrates a theoretical benefit of our selection criterion.

3.

We present SMRO, an adaptation of our basic technique which further rescales the regret of each distribution differently. We show that SMRO retains the same worst-case guarantees as MRO, but markedly improves upon it when the distributions $P\in\mathcal{P}$ satisfy an alignment condition, which includes all well-specified settings (Theorems 7 and 8). The rescaling parameters are fully estimated from data.
4.

Algorithmically, we show that our method can be implemented using access to a (weighted) ERM oracle for the function class $\mathcal{F}$ , using the standard machinery of solving minimax problems as a two player game involving a no-regret learner and a best response player. The computational complexity scales linearly in the size of $\mathcal{P}$ and as $n^{2}$ in the number of samples.

We conclude this section with a discussion of related work.

1.1 Related Work

Distributional mismatch between training and test data has been studied in many settings (see e.g. (Quinonero-Candela et al., 2009)) and under several notions of mismatch. A common setting is that of covariate shift, where only the distribution of labels can differ between training and test (Shimodaira, 2000; Huang et al., 2006; Bickel et al., 2007). Others study changes in the proportions of the label or other discrete attributes between training and test (Dwork et al., 2012; Xu et al., 2020). More general shifts are considered in domain adaptation (Mansour et al., 2009; Ben-David et al., 2010; Patel et al., 2015) and transfer learning (Pan and Yang, 2009; Tan et al., 2018).

A recent line of work on Invariant Risk Minimization (IRM) (Arjovsky et al., 2019) considers particularly broad generalization to unseen domains from which no examples are available at the training time, motivated by preventing the learning of spurious features in ERM based methods. The idea is to find an estimator which is invariant across all potential test distributions. Although most empirical works on IRM use gradient-penalty based formulations for algorithmic considerations, one interpretation of IRM leads to a minimax formulation similar to the MRO objective (3):

f_{\text{IRM}}=\operatorname*{\text{arginf}}_{f\in\mathcal{F}}R_{P_{0}}(f)\quad\mbox{s.t.}\sup_{P\in\mathcal{P}}R_{P}(f)-R_{P}(f_{P})\leq\epsilon,

(4)

where the family $\mathcal{P}$ consists of all individual domains in the training data and $P_{0}$ is the entire training data pooled across domains. However, it is not obvious how to meaningfully analyze the distributional robustness of IRM method (4) in the classical learning theory setting, and some subsequent works works (Ahuja et al., 2020b; Kamath et al., 2021) formalize IRM as minimizing a DRO style worst-case risk objective, instead of regret as in the original paper. In contrast, our MRO formulation is motivated by extending regret-driven reasoning to settings with distribution shift.

Distributionally robust optimization, which forms a starting point for this work, has a long history in the optimization literature (see e.g. (Ben-Tal et al., 2009; Shapiro, 2017)) and has gained recent prominence in the machine learning literature (Namkoong and Duchi, 2016; Duchi and Namkoong, 2021; Duchi et al., 2021; Staib and Jegelka, 2019; Kuhn et al., 2019; Zhu et al., 2020). DRO has been applied with mixed success in language modeling (Oren et al., 2019), correcting class imbalance (Xu et al., 2020) and group fairness (Hashimoto et al., 2018). For most of the theoretical works, the emphasis is on efficient optimization of the worst-case risk objective. While Duchi and Namkoong (2021) provide sharp upper and lower bounds on the risk of the DRO estimator, they do not examine regret, which is the central focus of our study.

We note that the raw risk used in DRO is sensitive to heterogeneous noise, in that larger noise leads to larger risk. Such sensitivity can be undesirable for some applications. Challenges in learning across scenarios with heterogeneous noise levels has been previously studied in supervised learning (Crammer et al., 2005), and also arise routinely in reinforcement learning (RL). In the setting of offline RL, a class of techniques is designed to minimize the Bellman error criterion (Bertsekas and Tsitsiklis, 1996; Munos and Szepesvári, 2008), which has known challenges in direct unbiased estimation (see e.g. Example 11.4 in Sutton and Barto (1998)). To counter this, Antos et al. (2008) suggest removing the residual variance, which is akin to considering regret instead of risk in our objective and yields similar minimax objectives as this work.

On the algorithmic side, we build on the use of regret minimization strategies for solving two player zero-sum games, pioneered by Freund and Schapire (1996) to allow for general distribution families $\mathcal{P}$ as opposed to relying on closed-form maximizations for specific families, as performed in many prior works (Namkoong and Duchi, 2016; Staib and Jegelka, 2019; Kuhn et al., 2019). Similar approaches have been studied in the presence of adversarial corruption in Feige et al. (2015) and in the context of algorithmic fairness for regression in Agarwal et al. (2019).

2 Setting

We consider a class of learning problems where the learner is given a dataset sampled from a distribution $P_{0}$ of the form $z_{1},\ldots,z_{n}$ with $z_{i}\in\mathcal{Z}$ . We also have a class $\mathcal{P}$ of distributions such that we want to do well relative to this entire class of distributions, even though we only have access to samples from $P_{0}$ . In Minimax Regret Optimization, we formalize the property of doing uniformly well for all $P\in\mathcal{P}$ through the objective (3). As mentioned in the introduction, this is closely related to the DRO objective (2). The following result shows that small risk does not lead to small regret, which means DRO does not solve the MRO objective (3). The construction also shows that under heterogenous noise, DRO tends to focus on a distribution $P\in\mathcal{P}$ with large noise level, which can be undesirable for many applications.

Proposition 1 (DRO is sensitive to noise heterogeneity)

There exists a family of two distributions $\mathcal{P}=\{P_{1},P_{2}\}$ with heterogeneous noise levels, and a function class $\mathcal{F}=\{f_{1},f_{2}\}$ , so that
$\sup_{P\in\mathcal{P}}{\mathrm{Regret}}_{P}(f_{\text{DRO}})=0.21$ , while $\sup_{P\in\mathcal{P}}{\mathrm{Regret}}_{P}(f_{\text{MRO}})=0.03$ .

Proof Suppose $\mathcal{P}$ consists of two distributions over $z\in[0,1]$ . The first distribution $P_{1}$ is degenerate with $P_{1}(0.1)=1$ . The second distribution $P_{2}=\text{Ber}(0.5)$ . The goal is to estimate the mean of the distributions using least squares loss. In this example, $P_{1}$ has zero-noise and $P_{2}$ has a relatively high noise. We consider a function class $\mathcal{F}$ consisting of two estimates of the mean, $f_{1}=0.3$ and $f_{2}=0.6$ and $\ell(z,f)=(f-z)^{2}$ . Let us abbreviate $R_{P_{i}}(f)=R_{i}(f)$ for $i\in\{1,2\}$ . Note that the true mean of $P_{1}$ is $0.1$ , and that of $P_{2}$ is $0.5$ . Thus $f_{1}$ is a better uniform estimate of the means.

It is easy to verify that $R_{1}(f_{1})=0.04$ , $R_{1}(f_{2})=0.25$ , $R_{2}(f_{1})=0.29$ , and $R_{2}(f_{2})=0.26$ . Then the DRO objective (2) evaluates to the optimal value of 0.26, with $f_{2}$ attaining this value. On the other hand, the MRO objective (3) is equal to $0.03$ and this value is attained for $f_{1}$ , while $f_{2}$ incurs a much higher objective of $\sup_{P}R_{P}(f_{2})-R_{P}(f_{P})=0.21$ . As a result, we have $f_{\text{DRO}}=f_{2}$ and $f_{\text{MRO}}=f_{1}$ . Comparing regrets lets MRO remove the underlying noise level of $0.25$ for $P_{2}$ .

Note that while the worst-case regret of MRO is always better than that of DRO, this does not imply that the function $f_{\text{MRO}}$ is always preferable under each distribution $P\in\mathcal{P}$ to $f_{\text{DRO}}$ . Next we give one such example where $f_{\text{DRO}}$ achieves an arguably better trade-off in balanced performance across distributions. This situation can happen in real applications when functions achieving minimum risks vary greatly across distributions, potentially due to overfitting. The example shows that both MRO and DRO may have advantages or disadvantages under different conditions.

Example 1 (Heterogeneous regrets across domains)

Suppose the target class $\mathcal{P}$ consists of two distributions over $z\in[0,1]$ and our function class $\mathcal{F}$ consists of 3 functions $\{f_{1},f_{2},f_{3}\}$ . Let us assume that the risks of these functions under the two distributions are given by:

\displaystyle R_{1}(f_{1})=0,~{}~{}R_{1}(f_{2})=0.5,~{}~{}R_{1}(f_{3})=0.5+\epsilon,~{}~{}R_{2}(f_{1})=1,R_{2}(f_{2})=0.9,~{}~{}\mbox{and}~{}~{}R_{2}(f_{3})=0.4.

Then $f_{1}$ is easily disregarded under both MRO and DRO, since it has poor performance under $P_{2}$ . For both $f_{2}$ and $f_{3}$ , their regrets on $P_{1}$ are larger than $P_{2}$ , so MRO selects the better function for $P_{1}$ , that is, $f_{\text{MRO}}=f_{2}$ . DRO, on the other hand, prefers $f_{3}$ as it has a smaller worst-case risk, which is arguably the right choice in this scenario.

More generally, one can always find a distribution $P\in\mathcal{P}$ under which the regret of $f_{\text{DRO}}$ is smaller than that of $f_{\text{MRO}}$ and vice versa, but the ordering in the worst-case is clear by definition. In the context of these observations, we next discuss how we might estimate $f_{\text{MRO}}$ from samples, for which it is useful to rewrite the objective (3) in an equivalent weight-based formulation we discuss next.

A weight-based reformulation.

Having shown the potential benefits of our population objective, we now consider how to optimize it using samples from $P_{0}$ . Notice that the objective (3) does not depend on $P_{0}$ explicitly, which makes it unclear how we should approach the problem given our dataset. We address this issue by adopting an equivalent reweighting based formulation as typically done in DRO. Concretely, let us assume that $P$ is absolutely continous with respect to $P_{0}$ for all $P\in\mathcal{P}$ , so that there exists a weighting functions $w~{}:~{}\mathcal{Z}\to\mathbb{R}_{+}$ such that $dP(z)=w(z)dP_{0}(z)$ , where ${\mathbb{E}}_{P_{0}}[w]=1$ . We can equivalently rewrite the objective (3) as

\displaystyle f_{\text{MRO}}

\displaystyle:=\operatorname*{\text{arginf}}_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}\left\{{\mathrm{Regret}}_{w}(f):=R_{w}(f)-\inf_{f^{\prime}\in\mathcal{F}}R_{w}(f^{\prime})=\ R_{w}(f)-R_{w}(f_{w})\right\},

(5)

where $\mathcal{W}=\{w~{}:~{}w(z)=dP(z)/dP_{0}(z)~{}~{}\text{for}~{}~{}P\in\mathcal{P}\}$ and $R_{w}(f)={\mathbb{E}}_{z\sim P_{0}}[w(z)\ell(z,f(z))]$ . It is also straightforward to define an empirical counterpart for this objective. Given $n$ samples $z_{1},\ldots,z_{n}\sim P_{0}$ , we can define the (weighted) empirical risk and its corresponding minimizer as $\widehat{R}_{w}(f)=\frac{1}{n}\sum_{i=1}^{n}w(z_{i})\ell(z_{i},f(z_{i}))$ and $\widehat{f}_{w}=\operatorname*{\text{arginf}}_{f\in\mathcal{F}}\widehat{R}_{w}(f)$ . With these notations, a natural empirical counterpart of the population objective (5) can be written as

\widehat{f}_{\text{MRO}}:=\operatorname*{\text{arginf}}_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}\big{[}\widehat{R}_{w}(f)-\inf_{f^{\prime}\in\mathcal{F}}\widehat{R}_{w}(f^{\prime})\big{]}=\operatorname*{\text{arginf}}_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}\big{[}\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})\big{]},

(6)

We now give a couple of concrete examples to instantiate this general formulation.

Example 2 (Covariate shift in supervised learning)

Suppose the samples $z=(x,y)$ where $x\in\mathbb{R}^{d}$ are features and $y\in\mathbb{R}$ is a prediction target. Suppose that the class $\mathcal{W}$ contains functions $\mathcal{W}=\{w(z)=w_{\theta}(x)~{}:~{}\theta\in\Theta\}$ , where $\Theta$ is some parameter class. That is, we allow only the marginal distribution of $x$ to change while the conditional $P_{0}(y|x)$ is identical across test distributions (Shimodaira, 2000; Huang et al., 2006). In our formulation, it is easy to incorporate the covariate shift situation with a general class $\mathcal{W}$ under the restrictions $w_{\theta}(x)\leq B$ . It is possible to add additional regularity on the class $\Theta$ , such as taking the unit ball in an RKHS (Huang et al., 2006; Staib and Jegelka, 2019). We note that most prior studies of DRO imposed conditions on the joint perturbations of both $(x,y)$ , which leads to closed form solutions. However, directly instantiating $\mathcal{W}$ in DRO with a bounded $f$ -divergence or MMD perturbations for only the marginal $P_{0}(x)$ (instead of the joint distribution of $(x,y)$ ) does not yield a nice closed form solution. This results in additional computational and statistical challenges as discussed in Duchi et al. (2019).

We now demonstrate the versatility of our setting with an example from reinforcement learning.

Example 3 (Offline contextual bandit learning)

In offline contextual bandit (CB) learning, the data consists of $z=(x,a,r)$ with $x\in\mathbb{R}^{d}$ denoting a context, $a\in[K]$ being an action out of $K$ possible choices and $r\in[0,1]$ a reward. There is a fixed and unknown joint distribution $P_{0}$ over $(x,r(1),\ldots,r(K))$ . The object of interest is a decision policy $\pi$ which maps a context $x$ to a distribution over actions, and we seek a policy which maximizes the expected reward under its action choices: $\pi^{\star}=\operatorname*{\text{argmax}}_{\pi\in\Pi}{\mathbb{E}}[r(a)~{}|~{}a\sim\pi(\cdot|x),x]$ , where $\Pi$ is some given policy class. In the offline setting, there is some fixed policy $\mu$ which is used to choose actions during the data collection process. Then the training data can be described as $z=(x,a,r)$ where $x\sim P_{0}$ , $a\sim\mu(\cdot|x)$ and $r\sim P_{0}(\cdot|x,a)$ , and this differs from the action distributions that other policies in $\Pi$ induce. Existing literature on this problem typically creates an estimator $\eta$ for ${\mathbb{E}}[r|x,a]$ and outputs $\pi=\operatorname*{\text{argmax}}_{\pi\in\Pi}\sum_{i=1}^{n}\sum_{a}\pi(a|x_{i})\eta(r|a,x_{i})$ , and the quality of the resulting policy critically relies on the quality of the estimator $\eta$ . A particularly popular choice is the class of doubly robust estimators (Cassel et al., 1976; Dudík et al., 2014) which solve a (weighted) regression problem to estimate $\eta\approx\operatorname*{\text{argmin}}_{f}\sum_{i}\tfrac{\pi(a_{i}|x_{i})}{\mu(a_{i}|x_{i})}(f(x_{i},a_{i})-r_{i})^{2}$ . Since we require the reward model to predict well on the actions selected according to different policies $\pi\in\Pi$ , the ideal regression objective for any policy $\pi$ reweights the data as per that policy’s actions, and doing this weighting results in a significant performance gain (Su et al., 2020; Farajtabar et al., 2018). Typically this weighting is simplified or skipped during policy learning, however, as one needs to simultaneously reweight for all policies $\pi\in\Pi$ , and MRO provides an approach to do exactly this by choosing $\mathcal{W}=\{w(x,a,r)=\pi(a\mid x)/\mu(a\mid x)~{}:~{}\pi\in\Pi\}$ in the optimization for estimating $\eta$ .

Boundedness and complexity assumptions

We now make standard technical assumptions on $\mathcal{F}$ , the loss function $\ell$ and the weight class $\mathcal{W}$ . To focus attention on the essence of the results, we omit relatively complex results such as chaining or local Rademacher complexity, and use an $\ell_{\infty}$ - covering for uniform convergence over $\mathcal{F}$ , where we use $N(\epsilon,\mathcal{F})$ to denote the $\ell_{\infty}$ - covering number of $\mathcal{F}$ with an accuracy $\epsilon$ . For parametric classes, such an analysis is optimal up to a log factors. We also make a boundedness and Lipschitz continuity assumption on the loss function.

Assumption 1 (Bounded and Lipschitz losses)

The loss function $\ell(z,v)$ is bounded in $[0,1]$ for all $z\in\mathcal{Z}$ and $v\in\{f(z)~{}:~{}f\in\mathcal{F},z\in\mathcal{Z}\}$ and is $L$ -Lipschitz continuous with respect to $v$ .

We also assume that the weight class satisfies a boundedness condition.

Assumption 2 (Bounded importance weights)

All weights $w\in\mathcal{W}$ satisfy $w(z)\leq B_{w}\leq B$ for all $z\in\mathcal{Z}$ and some constant $B\geq 1$ .

To avoid the complexity of introducing an additional covering number, throughout this paper, we assume that the weight class $\mathcal{W}$ is of finite cardinality, which is reasonable for many domain adaptation applications. We adopt the notations $d_{\mathcal{F}}(\delta)=1+\log\frac{N(1/(nLB),\mathcal{F})}{\delta}$ and $d_{\mathcal{F},\mathcal{W}}(\delta)=d_{\mathcal{F}}(\delta)+\ln(|\mathcal{W}|/\delta)$ to jointly capture the complexities of $\mathcal{F}$ and $\mathcal{W}$ , given a failure probability $\delta$ .

3 Regret Bounds for Bounded, Lipschitz Losses

We begin with the most general case with fairly standard assumptions on the loss function, and the function and weight classes and show the following result on the regret of $\widehat{f}_{\text{MRO}}$ for any $w\in\mathcal{W}$ .

Theorem 2

Under assumptions 1 and 2, suppose further that ${\mathbb{E}}_{P_{0}}[w^{2}]\leq\sigma_{w}^{2}$ for any $w\in\mathcal{W}$ . Then with probability at least $1-\delta$ , we have $\forall w\in\mathcal{W}$ :

\displaystyle{\mathrm{Regret}}_{w}(\widehat{f}_{\text{MRO}})\leq\inf_{f\in\mathcal{F}}\sup_{w^{\prime}\in\mathcal{W}}{\mathrm{Regret}}_{w^{\prime}}(f)+\sup_{w^{\prime}\in\mathcal{W}}\underbrace{\mathcal{O}\bigg{(}\sqrt{\frac{\sigma_{w^{\prime}}^{2}d_{\mathcal{F},\mathcal{W}}(\delta)}{n}}+\frac{B_{w^{\prime}}d_{\mathcal{F},\mathcal{W}}(\delta)}{n}\bigg{)}}_{\epsilon_{w^{\prime}}}.

The bound of Theorem 2 highlights the main difference of our objective compared with DRO style approaches. The bound states that we our solution $\widehat{f}_{\text{MRO}}$ has a small regret for each $w\in\mathcal{W}$ , as long as at least one such function exists in the function class, up to the usual finite sample deviation terms. Thus, $\widehat{f}_{\text{MRO}}$ attains the uniform regret guarantee we asked for. To better interpret our result, we state an easy corollary, before making some additional remarks.

Corollary 3

Under conditions of Theorem 2, suppose further that $\exists f^{\star}\in\mathcal{F}$ such that $R_{w}(f^{\star})\leq R_{w}(f_{w})+\tilde{\epsilon}_{w}$ for all $w\in\mathcal{W}$ . Then with probability at least $1-\delta$ , we have

\displaystyle\forall w\in\mathcal{W}~{}:~{}R_{w}(\widehat{f}_{\text{MRO}})\leq R_{w}(f^{\star})+\underbrace{\sup_{w^{\prime}\in\mathcal{W}}\tilde{\epsilon}_{w^{\prime}}+\sup_{w^{\prime}\in\mathcal{W}}\epsilon_{w^{\prime}}}_{\epsilon_{\mathcal{W}}}.

Comparison with DRO approaches.

While Proposition 1 already illustrates the improved robustness of MRO to differing noise levels across target distributions, it is further instructive to compare the guarantees for the two approaches. Concretely, in our setting, it can be shown that with probability at least $1-\delta$ , we have for all $w\in\mathcal{W}$ :

\displaystyle R_{w}(\widehat{f}_{\text{DRO}})\leq\inf_{f\in\mathcal{F}}\sup_{w^{\prime}\in\mathcal{W}}R_{w^{\prime}}(f)+\sup_{w^{\prime}\in\mathcal{W}}\epsilon_{w^{\prime}}.

(7)

Compared with Theorem 2, this bound can be inferior when the different distributions identified by $w\in\mathcal{W}$ have vastly different noise levels. For instance, in the setting of Corollary 3, the bound (7) yields $R_{w}(f_{\text{DRO}})\leq\sup_{w\in\mathcal{W}}R_{w}(f^{\star})+\epsilon_{\mathcal{W}}$ , which has additional dependence on the worst case risk of $f^{\star}$ across all the $w\in\mathcal{W}$ . We can now recall that the assumption of Corollary 3 is satisfied with small $\epsilon_{\mathcal{W}}$ in the constructed example of Proposition 1, in which DRO tries to fit to the large noise. Similarly, if the data distribution is Gaussian with an identical mean but different variances across the $w$ , then we see that the two bounds reduce to:

\displaystyle\textstyle R_{w}(\widehat{f}_{\text{MRO}})\leq\text{Var}_{w}+\epsilon_{\mathcal{W}}\quad\mbox{and}\quad R_{w}(\widehat{f}_{\text{DRO}})\leq\sup_{w^{\prime}\in\mathcal{W}}\text{Var}_{w^{\prime}}+\epsilon_{\mathcal{W}},

where $\text{Var}_{w}$ is the variance of the Gaussian corresponding to importance weights $w$ . Overall, these examples serve to illustrate that the DRO objective is reasonable when the different distributions have similar noise levels, but is not robust to heterogeneity in noise levels.

Dependence on the complexity of $\mathcal{W}$ .

For discrete $\mathcal{W}$ , our results incur a union bound penalty of $\ln|\mathcal{W}|$ . We suspect that this dependency is unavoidable in the general case. If we define $\mathcal{W}$ using a bound constraint such as $w(z)\leq B$ , or an $f$ -divergence based constraint as in the DRO literature, then the worst-case $w$ has a simple solution in the DRO setting, with no explicit dependence on the complexity of $\mathcal{W}$ . These results are easily adapted to MRO as well by observing $\sup_{w\in\mathcal{W}}{\mathrm{Regret}}_{w}(f)=\sup_{w\in\mathcal{W}}\sup_{f^{\prime}\in\mathcal{F}}R_{w}(f)-R_{w}(f^{\prime})$ . For instance, if $\mathcal{W}_{B}=\{w~{}:~{}0\leq w(z)\leq B,{\mathbb{E}}_{P_{0}}[w]=1\}$ is the class of all bounded importance weights, then the results of Shapiro (2017) adapted to MRO imply that

\displaystyle\textstyle\sup_{w\in\mathcal{W}_{B}}{\mathrm{Regret}}(f)=\sup_{f^{\prime}\in\mathcal{F}}\inf_{\eta\in\mathbb{R}}\left\{\eta+B{\mathbb{E}}_{P_{0}}\left[\left(\ell(f)-\ell(f^{\prime})-\eta\right)_{+}\right]\right\}.

(8)

Similar arguments hold for the family of Cressie-Read divergences studied in Duchi and Namkoong (2021). However, these closed form maximizations rely on doing joint perturbations over $(x,y)$ in the supervised learning setting and do not apply to covariate shifts (Duchi et al., 2019).

Proof [Sketch of the proof of Theorem 2] For a fixed $f$ , we can bound $|\widehat{R}_{w}(f)-R_{w}(f)|$ using Bernstein’s inequality. This requires bounding the range and variance of the random variable $w(z)\ell(z,f(z))$ . Since the losses are bounded in $[0,1]$ , the two quantities are bounded by $\sigma_{w}^{2}$ and $B_{w}$ respectively. Assumption 1 allows us to further get a uniform bound over $f\in\mathcal{F}$ with an additional dependence on $d_{\mathcal{F}}(\delta)$ . Standard arguments now yield closeness of $\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})$ to $R_{w}(f)-R_{w}(f_{w})$ simultaneously for all $f\in\mathcal{F}$ and $w\in\mathcal{W}$ , and utilizing the definition of $\widehat{f}_{\text{MRO}}$ as the minimizer of $\sup_{w\in\mathcal{W}}\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})$ completes the proof.

4 Fast Rates for Squared Loss and Convex Classes

A limitation of Theorem 2 is that it does not leverage any structure of the loss function beyond Assumption 1, and as a result obtains a $1/\sqrt{n}$ dependence on the number of samples. For the case of a fixed distribution, it is well known that self-bounding properties of the loss variance can be used to obtain faster $1/n$ bounds on the regret of the ERM solution, and here we show that this improvement extends to our setting. For ease of exposition, we specialize to the squared loss setting in this section. That is, our samples $z$ take the form $(x,y)$ with $x\in\mathcal{X}\subseteq\mathbb{R}^{d}$ , $y\in[-1,1]$ , our functions $f$ map from $\mathcal{X}$ to $[-1,1]$ and and $\ell(z,f(z))=(f(x)-y)^{2}$ . We make an additional convexity assumption on the function class $\mathcal{F}$ .

Assumption 3

The class $\mathcal{F}$ is convex: $\forall~{}f,f^{\prime}\in\mathcal{F}$ , $\alpha f+(1-\alpha)f^{\prime}\in\mathcal{F}$ for all $\alpha\in[0,1]$ .

Note that convexity of $\mathcal{F}$ is quite different from convexity of $f$ in its parameters, and can always be satisfied by taking the convex hull of a base class. The assumption can be avoided by replacing our ERM based solution with aggregation approaches (see e.g. Tsybakov, 2003; Dalalyan and Salmon, 2012). However, we focus on the convex case here to illustrate the key ideas.

Theorem 4

Under assumptions 1, 2 and 3, with probability at least $1-\delta$ , we have $\forall w\in\mathcal{W}$ :

	$\displaystyle{\mathrm{Regret}}_{w}(\widehat{f}_{\text{MRO}})\leq$	$\displaystyle{\mathrm{Regret}}_{\star}+\mathcal{O}\Big{(}\sqrt{{\mathrm{Regret}}_{\star}\cdot Bd_{\mathcal{F},\mathcal{W}}(\delta)/n}+Bd_{\mathcal{F},\mathcal{W}}(\delta)/n\Big{)}$
	$\displaystyle=$	$\displaystyle\mathcal{O}\Big{(}{\mathrm{Regret}}_{\star}+\frac{Bd_{\mathcal{F},\mathcal{W}}(\delta)}{n}\Big{)},~{}~{}\mbox{where ${\mathrm{Regret}}_{\star}=\inf_{f\in\mathcal{F}}\sup_{w^{\prime}\in\mathcal{W}}{\mathrm{Regret}}_{w^{\prime}}(f)$.}$

Theorem 4 crucially leverages the following property of convex classes $\mathcal{F}$ and squared loss.

Lemma 5

For a convex class $\mathcal{F}$ and squared loss $\ell(y,f(x))=(y-f(x))^{2}$ , we have for any distribution $P$ and $f\in\mathcal{F}$ : ${\mathbb{E}}_{P}[(f(x)-f_{P}(x))^{2}]\leq{\mathrm{Regret}}_{P}(f)$ .

Note that a similar result holds for the squared loss and any class $\mathcal{F}$ , if we replace $f_{P}$ with the unconstrained minimizer of $R_{P}(f)$ over all measurable functions, but using this property in our analysis would result in an additional approximation error term in the bound of Theorem 4, which is undesirable when considering regret within the function class.

Proof [Sketch of the proof of Theorem 4] The only difference with the proof of Theorem 2 is in the handling of the variance of $A=w(z)(\ell(z,f(z))-\ell(z,f_{w}(z))$ , which we use as our random variable of interest in this case. Since $w(z)\leq B_{w}$ almost surely, we get

\displaystyle{\mathbb{E}}_{P_{0}}[A^{2}]

\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}16B_{w}{\mathbb{E}}_{w}[(f(x)-f_{w}(x))^{2}]\leq 16B_{w}{\mathrm{Regret}}_{w}(f),

where the inequality $(a)$ follows from the boundedness of $f\in\mathcal{F}$ and $y$ and the final inequality uses Lemma 5. We can now follow the usual recipe for fast rates with a self-bounding property of the variance. Rest of the arguments mirror the proof of Theorem 2.

Comparison with DRO.

It is natural to ask if it is possible to obtain fast rate for DRO under the conditions of Theorem 4. Of course it is not possible to show a regret bound similar to Theorem 4 for DRO, since it does not optimize the worst-case regret as Proposition 1 illustrates. Nevertheless, we investigate if DRO can achieve fast rate for the raw risk. Unfortunately, the following result shows that even under the assumptions of Theorem 4, the worst-case risk criterion for DRO is still subject to the slower $1/\sqrt{n}$ rate. The reason is that regret has a markedly smaller variance than the risk, and the latter usually deviates at a $1/\sqrt{n}$ rate even for squared loss.

Proposition 6

Under the assumptions of Theorem 4, there is a family of distributions $\mathcal{W}$ with $|\mathcal{W}|=2$ , satisfying $B_{w}\leq B=2$ , and a function class $\mathcal{F}$ with $d_{\mathcal{F}}=\mathcal{O}(\ln(nL))$ , such that with probability at least $1-1/n$ , we have $\sup_{w\in\mathcal{W}}R_{w}(\widehat{f}_{\text{DRO}})-\sup_{w\in\mathcal{W}}R_{w}(f_{\text{DRO}})=\Omega(\sqrt{(\ln n)/n})$ .

5 Adapting to Misspecification through Non-uniform Scaling across Distributions

Our result so far guarantees uniform regret, with learning complexity that depends on the worst case complexity over all distributions $w$ . For example, if the complexity of the distribution family is characterized by $B_{w}$ as in our analysis, then the bound depends on $\sup_{w}B_{w}$ . In this section, we show that it is possible to improve such dependence using an adaptive scaling of the objective function. To better illustrate the consequences of this approach, let us assume that we can rewrite our guarantees in the form that for each $w\in\mathcal{W}$ and $f\in\mathcal{F}$ , we have with probability at least $1-\delta$ ,

{\mathrm{Regret}}_{w}(f)\leq c(\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w}))+c_{w}\epsilon,

(9)

where the scaling function $c_{w}$ is a distribution $P$ dependent quantity that is known to the algorithm. For instance, under conditions of Theorem 2, we get this bound with $c=1$ , $c_{w}=\sigma_{w}+B_{w}/\sqrt{n}$ and $\epsilon=d_{\mathcal{F},\mathcal{W}}(\delta)/\sqrt{n}$ (assuming $d_{\mathcal{F},\mathcal{W}}(\delta)\geq 1$ ). Under conditions of Theorem 4, we can use $c=3$ , $c_{w}=B_{w}$ and $\epsilon=d_{\mathcal{F},\mathcal{W}}/n$ , by taking the second $O(B/n)$ bound of Theorem 4. While this does worsen our dependence on $d_{\mathcal{F},\mathcal{W}}(\delta)$ in the first case, this is preferable to assuming that $d_{\mathcal{F},\mathcal{W}}(\delta)$ is known and setting of $c_{w}=\sigma_{w}+B_{w}\sqrt{d_{\mathcal{F},\mathcal{W}}(\delta)/n}$ and $\epsilon=\sqrt{d_{\mathcal{F},\mathcal{W}}(\delta)/n}$ .

Since we assume $c_{w}$ is known, then we can define the estimator (Scaled MRO, or SMRO):

\widehat{f}_{\text{SMRO}}:=\operatorname*{\text{arginf}}_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}(\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w}))/c_{w},

(10)

We now present our main results for the statistical properties of $\widehat{f}_{\text{SMRO}}$ under the assumptions of Theorems 2 and 4, before comparing the results and discussing some key consequences.

Theorem 7

Under assumptions of Theorem 2, consider the estimator $\widehat{f}_{\text{SMRO}}$ with $c_{w}=\widehat{\sigma}_{w}+B_{w}/\sqrt{n}$ , where $\widehat{\sigma}_{w}^{2}=\frac{1}{n}\sum_{i=1}^{n}w(z_{i})^{2}$ is the empirical second moment of the importance weights. Then with probability at least $1-\delta$ , we have $\forall w\in\mathcal{W}$ :

\displaystyle{\mathrm{Regret}}_{w}(\widehat{f}_{\text{SMRO}})\leq c_{w}\inf_{f\in\mathcal{F}}\sup_{w^{\prime}\in\mathcal{W}}\frac{{\mathrm{Regret}}_{w^{\prime}}(f)}{c_{w^{\prime}}}+\underbrace{d_{\mathcal{F},\mathcal{W}}(\delta)\;\mathcal{O}\left(\sqrt{\sigma_{w}^{2}/n}+B_{w}/n\right)}_{\epsilon^{\prime}_{w}}.

We can also state a version for squared loss and convex classes.

Theorem 8

Under assumptions of Theorem 4, consider the estimator $\widehat{f}_{\text{SMRO}}$ with $c_{w}=B_{w}$ . Then with probability at least $1-\delta$ , we have

\displaystyle\forall w\in\mathcal{W}~{}:~{}{\mathrm{Regret}}_{w}(\widehat{f}_{\text{SMRO}})\leq B_{w}\left[{\mathrm{Regret}}_{\star}+\mathcal{O}\left(\sqrt{{\mathrm{Regret}}_{\star}\cdot\frac{d_{\mathcal{F},\mathcal{W}}(\delta)}{n}}+\frac{d_{\mathcal{F},\mathcal{W}}(\delta)}{n}\right)\right],

where ${\mathrm{Regret}}_{\star}=\inf_{f\in\mathcal{F}}\sup_{w^{\prime}\in\mathcal{W}}\frac{{\mathrm{Regret}}_{w^{\prime}}(f)}{B_{w^{\prime}}}$ .

Comparing Theorems 7 and 8 with their counterparts of MRO, we notice a subtle but important difference. The finite sample deviation term $\epsilon^{\prime}_{w}$ in Theorem 7 depends on the weights $w$ for which the bound is stated, while the corresponding term in Theorem 7 is $\sup_{w^{\prime}\in\mathcal{W}}\epsilon_{w^{\prime}}$ . In other words, the heterogeneous scaling in SMRO allows us to obtain a regret bound for each distribution depending on the deviation properties of that particular distribution, unlike in the basic MRO approach. To see the benefits of this approach, we state a corollary of Theorems 7 and 8 next.

Corollary 9 (Aligned distribution class)

Suppose that $P_{0}$ and $\mathcal{W}$ satisfy that $R_{w}(f_{w})=R_{w}(f^{\star})$ for all $w\in\mathcal{W}$ , with $f^{\star}\in\mathcal{F}$ . Under conditions of Theorem 7, with probability at least $1-\delta$ :

\displaystyle\forall w\in\mathcal{W}~{}:~{}R_{w}(\widehat{f}_{\text{SMRO}})\leq R_{w}(f^{\star})+d_{\mathcal{F},\mathcal{W}}(\delta)\;\mathcal{O}\left(\sqrt{\frac{\sigma_{w}^{2}}{n}}+\frac{B_{w}}{n}\right).

In the same setting, under the assumptions of Theorem 8, we have with probability at least $1-\delta$ :

\displaystyle\forall w\in\mathcal{W}~{}:~{}R_{w}(\widehat{f}_{\text{SMRO}})\leq R_{w}(f^{\star})+\mathcal{O}\left(\frac{B_{w}d_{\mathcal{F},\mathcal{W}}(\delta)}{n}\right).

Both results follow by choosing $f=f_{w}=f^{\star}$ in Theorems 7 and 8. This result shows that the rescaling allows our bounds to adapt to the closeness of a target distribution to the data collection distribution, simultaneously for all target distributions. The alignment condition of a shared $f^{\star}$ is always true in well-specified problems, and we illustrate a consequence of Corollary 9 for well-specified linear regression in Appendix D.

6 Algorithmic considerations

So far our development has focused on the statistical properties of MRO. In this section, we discuss how the MRO estimator can be computed from a finite dataset, given some reasonable computational assumptions on the function class $\mathcal{F}$ .

Definition 10 (ERM oracle)

Given a dataset of the form $(\omega_{i},z_{i})_{i=1}^{n}$ , an ERM oracle for $\mathcal{F}$ solves the weighted empirical risk minimization problem: $\min_{f\in\mathcal{F}}\omega_{i}\ell(z_{i},f(z_{i}))$ .

While we assume access to an exact ERM oracle, we can weaken the notion to an approximate oracle with an optimization error comparable to the statistical error from finite samples. Given such an oracle, we can now approximately solve the MRO objective (6) (or (10)) by using a well-known strategy of solving minimax problems as two player zero-sum games. To do so, we recall our assumption of a finite class $\mathcal{W}$ , and let $\Delta(\mathcal{W})$ denote the set of all distributions over the importance weight family. Similarly $\Delta(\mathcal{F})$ is the set of distributions over our function class. Then we can rewrite the objective in Equation (6) as:

\inf_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})=\inf_{P\in\Delta(\mathcal{F})}\sup_{\rho\in\Delta(\mathcal{W})}{\mathbb{E}}_{f\sim P,w\sim\rho}\left[\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})\right].

(11)

The objective is bilinear in $P$ and $\rho$ so that the following result holds due to the exchangeability of minimum and maximum for convex-concave saddle point problems. It is also a direct consequence of Proposition 12.

Proposition 11

There exists $\hat{\rho}\in\Delta(\mathcal{W})$ so that the solution of MRO is equivalent to the following solution of the weighted ERM method: $\widehat{f}_{\text{MRO}}=\operatorname*{\text{argmin}}_{f\in\mathcal{F}}{\mathbb{E}}_{w\sim\hat{\rho}}\widehat{R}_{w}(f)$ .

For finite $\mathcal{W}$ which we consider in this paper, it is also possible to find the approximate $\hat{\rho}$ and $\widehat{f}_{\text{MRO}}$ efficiently, using the celebrated result of Freund and Schapire (1996) to solve this minimax problem through no-regret dynamics. We use the best response strategy for the $P$ -player, as finding the best response distribution given a specific $\rho$ is equivalent to finding the function $f\in\mathcal{F}$ that minimizes ${\mathbb{E}}_{w\sim\rho}\widehat{R}_{w}(f)$ , which can be accomplished through one call to the ERM oracle. For optimizing $\rho$ , we use the exponentiated gradient algorithm of Kivinen and Warmuth (1997) (closely related to Hedge (Freund and Schapire, 1997)), which is a common no-regret strategy to learn a distribution over a finite collection of experts, with each $w\in\mathcal{W}$ being an ”expert” in our setting. More formally, we initialize $\rho_{1}$ to be the uniform distribution over $\mathcal{W}$ and repeatedly update:

\displaystyle f_{t}

\displaystyle=\operatorname*{\text{arginf}}_{f\in\mathcal{F}}{\mathbb{E}}_{w\sim\rho_{t}}\widehat{R}_{w}(f),\quad\rho_{t+1}(w)\propto\rho_{t}(w)\exp\left(\eta\bigg{(}\widehat{R}_{w}(f_{t})-\widehat{R}_{w}(\widehat{f}_{w})\bigg{)}\right),

(12)

Let us denote the distribution $P_{t}=(f_{1}+\ldots+f_{t})/t$ for the iterates generated by (12). Then the results of Freund and Schapire (1996) yield the following suboptimality bound on $P_{t}$ .

Proposition 12

For any $T$ and using $\eta=\sqrt{\tfrac{\ln|\mathcal{W}|}{B^{2}T}}$ in the updates (12), the distribution $P_{T}$ satisfies

{\mathbb{E}}_{f\sim P_{T}}\sup_{w}[\widehat{R}_{w}(f)-\widehat{R}_{w}(f_{w})]\leq 2B\sqrt{\frac{\ln|\mathcal{W}|}{T}}.

At a high-level, Proposition 12 allows us to choose $T=n^{2}$ to ensure that the optimization error is no larger than the statistical error, allowing the same bounds to apply up to constants. The optimization strategy used here bears some resemblance to boosting techniques, which also seek to reweight the data in order to ensure a uniformly good performance on all samples, though the reweighting is not constrained to a particular class $\mathcal{W}$ unlike here. Note that while the optimization error scales logarithmically in $|\mathcal{W}|$ , the computational complexity is linear in the size of this set, meaning that our strategy is computationally feasible for sets $\mathcal{W}$ of a modest size. On the other hand, our earlier discussion (8) suggests that alternative reformulations of the objective might be computationally preferred in the case of class $\mathcal{W}$ which are continuous. Handling the intermediate regime of a large discrete class $\mathcal{W}$ in a computationally efficient manner is an interesting question for future research.

7 Conclusion

In this paper, we introduce the MRO criterion for problems with distribution shift, and establish learning theoretic properties of optimizing the criterion from samples. We demonstrate the many benefits of reasoning with uniform regret as opposed to uniform risk guarantees, and we expect these observations to have implications beyond the setting of distribution shift.

On a technical side, it remains interesting to further develop scalable algorithms for large datasets and weight classes. Better understanding the statistical scaling with the size of the weight class and refining our techniques for important scenarios such as covariate shift are also important directions for future research.

References

Agarwal et al. [2019] Alekh Agarwal, Miroslav Dudík, and Zhiwei Steven Wu. Fair regression: Quantitative definitions and reduction-based algorithms. In International Conference on Machine Learning, pages 120–129. PMLR, 2019.
Ahuja et al. [2020a] Kartik Ahuja, Karthikeyan Shanmugam, Kush Varshney, and Amit Dhurandhar. Invariant risk minimization games. In International Conference on Machine Learning, pages 145–155. PMLR, 2020a.
Ahuja et al. [2020b] Kartik Ahuja, Jun Wang, Amit Dhurandhar, Karthikeyan Shanmugam, and Kush R Varshney. Empirical or invariant risk minimization? a sample complexity perspective. arXiv preprint arXiv:2010.16412, 2020b.
Antos et al. [2008] András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008.
Arjovsky et al. [2019] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
Barocas and Selbst [2016] Solon Barocas and Andrew D Selbst. Big data’s disparate impact. Calif. L. Rev., 104:671, 2016.
Ben-David et al. [2010] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1):151–175, 2010.
Ben-Tal et al. [2009] Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski. Robust optimization. Princeton university press, 2009.
Bertsekas and Tsitsiklis [1996] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1st edition, 1996. ISBN 1886529108.
Bickel et al. [2007] Steffen Bickel, Michael Brückner, and Tobias Scheffer. Discriminative learning for differing training and test distributions. In Proceedings of the 24th international conference on Machine learning, pages 81–88, 2007.
Cassel et al. [1976] Claes M Cassel, Carl E Särndal, and Jan H Wretman. Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika, 63(3):615–620, 1976.
Crammer et al. [2005] Koby Crammer, Michael Kearns, and Jennifer Wortman. Learning from data of variable quality. In NIPS, pages 219–226, 2005.
Dalalyan and Salmon [2012] Arnak S Dalalyan and Joseph Salmon. Sharp oracle inequalities for aggregation of affine estimators. The Annals of Statistics, 40(4):2327–2355, 2012.
Duchi and Namkoong [2021] John C Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378–1406, 2021.
Duchi et al. [2019] John C Duchi, Tatsunori Hashimoto, and Hongseok Namkoong. Distributionally robust losses against mixture covariate shifts. Under review, 2019.
Duchi et al. [2021] John C Duchi, Peter W Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. Mathematics of Operations Research, 2021.
Dudík et al. [2014] Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485–511, 2014.
Dwork et al. [2012] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226, 2012.
Farajtabar et al. [2018] Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pages 1447–1456. PMLR, 2018.
Feige et al. [2015] Uriel Feige, Yishay Mansour, and Robert Schapire. Learning and inference in the presence of corrupted inputs. In Conference on Learning Theory, pages 637–657. PMLR, 2015.
Freund and Schapire [1996] Yoav Freund and Robert E Schapire. Game theory, on-line prediction and boosting. In Proceedings of the ninth annual conference on Computational learning theory, pages 325–332, 1996.
Freund and Schapire [1997] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
Galar et al. [2011] Mikel Galar, Alberto Fernandez, Edurne Barrenechea, Humberto Bustince, and Francisco Herrera. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4):463–484, 2011.
Goodfellow et al. [2014] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
Hashimoto et al. [2018] Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning, pages 1929–1938. PMLR, 2018.
Huang et al. [2006] Jiayuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Schölkopf, and Alex Smola. Correcting sample selection bias by unlabeled data. Advances in neural information processing systems, 19:601–608, 2006.
Japkowicz [2000] Nathalie Japkowicz. The class imbalance problem: Significance and strategies. In Proc. of the Int’l Conf. on Artificial Intelligence, volume 56. Citeseer, 2000.
Kamath et al. [2021] Pritish Kamath, Akilesh Tangella, Danica Sutherland, and Nathan Srebro. Does invariant risk minimization capture invariance? In International Conference on Artificial Intelligence and Statistics, pages 4069–4077. PMLR, 2021.
Kivinen and Warmuth [1997] Jyrki Kivinen and Manfred K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Inf. Comput., 132(1):1–63, 1997.
Kuhn et al. [2019] Daniel Kuhn, Peyman Mohajerin Esfahani, Viet Anh Nguyen, and Soroosh Shafieezadeh-Abadeh. Wasserstein distributionally robust optimization: Theory and applications in machine learning. In Operations Research & Management Science in the Age of Analytics, pages 130–166. INFORMS, 2019.
Madry et al. [2017] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
Mansour et al. [2009] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. arXiv preprint arXiv:0902.3430, 2009.
Munos and Szepesvári [2008] Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(5), 2008.
Namkoong and Duchi [2016] Hongseok Namkoong and John C Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In NIPS, volume 29, pages 2208–2216, 2016.
Oren et al. [2019] Yonatan Oren, Shiori Sagawa, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust language modeling. arXiv preprint arXiv:1909.02060, 2019.
Pan and Yang [2009] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009.
Patel et al. [2015] Vishal M Patel, Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. Visual domain adaptation: A survey of recent advances. IEEE signal processing magazine, 32(3):53–69, 2015.
Quinonero-Candela et al. [2009] J. Quinonero-Candela, M. Sugiyama, A. Schwaighofer, and N.D. Lawrence. Dataset Shift in Machine Learning. Neural Information Processing series. MIT Press, 2009. ISBN 9780262170055. URL https://books.google.com/books?id=7zrhAAAAMAAJ.
Rosenfeld et al. [2020] Elan Rosenfeld, Pradeep Kumar Ravikumar, and Andrej Risteski. The risks of invariant risk minimization. In International Conference on Learning Representations, 2020.
Shalev-Shwartz [2012] Shai Shalev-Shwartz. Online learning and online convex optimization. Found. Trends Mach. Learn., page 107–194, 2012.
Shapiro [2017] Alexander Shapiro. Distributionally robust stochastic programming. SIAM Journal on Optimization, 27(4):2258–2275, 2017.
Shimodaira [2000] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.
Staib and Jegelka [2019] Matthew Staib and Stefanie Jegelka. Distributionally robust optimization and generalization in kernel methods. Advances in Neural Information Processing Systems, 32:9134–9144, 2019.
Su et al. [2020] Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. Doubly robust off-policy evaluation with shrinkage. In International Conference on Machine Learning, pages 9167–9176. PMLR, 2020.
Sutton and Barto [1998] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. Adaptive Computation and Machine Learning series. MIT Press, 1998. ISBN 9780262303842. URL https://books.google.com/books?id=U57uDwAAQBAJ.
Tan et al. [2018] Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu. A survey on deep transfer learning. In International conference on artificial neural networks, pages 270–279. Springer, 2018.
Tsybakov [2003] Alexandre B Tsybakov. Optimal rates of aggregation. In Learning theory and kernel machines, pages 303–313. Springer, 2003.
Wang et al. [2019] Yisen Wang, Xingjun Ma, James Bailey, Jinfeng Yi, Bowen Zhou, and Quanquan Gu. On the convergence and robustness of adversarial training. In International Conference on Machine Learning, pages 6586–6595. PMLR, 2019.
Xu et al. [2020] Ziyu Xu, Chen Dan, Justin Khim, and Pradeep Ravikumar. Class-weighted classification: Trade-offs and robust approaches. In International Conference on Machine Learning, pages 10544–10554. PMLR, 2020.
Zhu et al. [2020] Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl, and Bernhard Schölkopf. Kernel distributionally robust optimization. arXiv preprint arXiv:2006.06981, 2020.

Appendix A Proof of Theorem 2

Recalling our assumptions that for any $w\in\mathcal{W}$ , we have $\max_{z}w(z)\leq B$ and ${\mathbb{E}}_{z\sim P_{0}}w(z)^{2}\leq\sigma_{w}^{2}$ , we know that for a fixed $w\in\mathcal{W}$ and $f\in\mathcal{F}$ , we have, with probability at least $1-\delta$ :

\left|\widehat{R}_{w}(f)-R_{w}(f)\right|=\mathcal{O}\left(\sqrt{\frac{\sigma_{w}^{2}\ln(1/\delta)}{n}}+\frac{B_{w}\ln(1/\delta)}{n}\right).

(13)

This is a consequence of Bernstein’s inequality applied to the random variable $A=w(z)\ell(z,f(z))$ which is bounded by $B_{w}$ almost surely, and has a second moment at most $\sigma_{w}^{2}$ when $z\sim P_{0}$ . Since $\ell(z,f(z))$ is $L$ -Lipschitz in the second argument, if $\|f-f^{\prime}\|_{\infty}\leq 1/(nLB)$ , we have for any $w\in\mathcal{W}$ and any $z$ :

\displaystyle|w(z)\ell(z,f(z))-w(z)\ell(z,f^{\prime}(z))|\leq\frac{B_{w}L}{nLB}\leq\frac{1}{n},

where the second inequality follows since $B_{w}\leq B$ for all $w\in\mathcal{W}$ by assumption. Hence, defining $\mathcal{F}^{\prime}$ to be an $\ell_{\infty}$ cover of $\mathcal{F}$ at a scale $1/nLB$ , a union bound combined with the bound (13) yields that with probability $1-\delta$ , we have for all $f\in\mathcal{F}^{\prime}$ :

\left|\widehat{R}_{w}(f)-R_{w}(f)\right|=\mathcal{O}\left(\sqrt{\frac{\sigma_{w}^{2}\ln(N(\mathcal{F},1/(nLB))/\delta)}{n}}+\frac{B_{w}\ln(N(\mathcal{F},1/(nLB))/\delta)}{n}\right).

Using the Lipschitz property of the loss and further taking a union bound over $w\in\mathcal{W}$ , this gives for all $f\in\mathcal{F}$ and $w\in\mathcal{W}$ , with probability at least $1-\delta$ :

	$\displaystyle\left\|\widehat{R}_{w}(f)-R_{w}(f)\right\|$	$\displaystyle=\mathcal{O}\left(\sqrt{\frac{\sigma_{w}^{2}\ln(N(\mathcal{F},1/(nLB))\|\mathcal{W}\|/\delta)}{n}}+\frac{B_{w}\ln(N(\mathcal{F},1/(nLB))\|\mathcal{W}\|/\delta)}{n}\right)+\frac{1}{n}$
		$\displaystyle=\mathcal{O}\left(\sqrt{\frac{\sigma_{w}^{2}d_{\mathcal{F},\mathcal{W}}(\delta)}{n}}+\frac{B_{w}d_{\mathcal{F},\mathcal{W}}(\delta)}{n}\right)=:\epsilon_{w},$		(14)

where the second equality recalls our definition of $d_{\mathcal{F},\mathcal{W}}=1+\ln\frac{N(1/(nLB),\mathcal{F})}{\delta}+\ln\frac{|\mathcal{W}|}{\delta}$ . We now condition on the $1-\delta$ probability event that the bound of Equation 14 holds to avoid stating error probabilities in each bound. In particular, applying the bound above to the empirical minimizer $\widehat{f}_{w}$ for any $w\in\mathcal{W}$ , we observe that

\displaystyle R_{w}(\widehat{f}_{w})\leq\widehat{R}_{w}(\widehat{f}_{w})+\epsilon_{w}\leq\widehat{R}_{w}(f_{w})+\epsilon_{w}\leq R_{w}(f_{w})+2\epsilon_{w}.

Hence, for any $f\in\mathcal{F}$ :

$\displaystyle\sup_{w\in\mathcal{W}}\left[R_{w}(\widehat{f}_{\text{MRO}})-R_{w}(f_{w})\right]$	$\displaystyle\leq\sup_{w\in\mathcal{W}}[R_{w}(\widehat{f}_{\text{MRO}})-\widehat{R}_{w}(\widehat{f}_{w})]+\sup_{w}\epsilon_{w}$
	$\displaystyle\leq\sup_{w\in\mathcal{W}}[\widehat{R}_{w}(\widehat{f}_{\text{MRO}})-\widehat{R}_{w}(\widehat{f}_{w})]+2\sup_{w}\epsilon_{w}$
	$\displaystyle\leq\sup_{w\in\mathcal{W}}[\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})]+2\sup_{w}\epsilon_{w}$	(since $\widehat{f}_{\text{MRO}}=\operatorname*{\text{arginf}}_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})$ )
	$\displaystyle\leq\sup_{w\in\mathcal{W}}[R_{w}(f)-R_{w}(\widehat{f}_{w})]+4\sup_{w}\epsilon_{w}$
	$\displaystyle\leq\sup_{w\in\mathcal{W}}[R_{w}(f)-R_{w}(f_{w})]+4\sup_{w}\epsilon_{w}.$

Taking an infimum over $f\in\mathcal{F}$ completes the proof of the theorem.

Appendix B Proof of Theorem 4

We first prove Theorem 4 using Lemma 5, before proceeding to prove the lemma.

Proof [Proof of Theorem 4] We begin with the deviation bound for a fixed $f\in\mathcal{F}$ and $w\in\mathcal{W}$ as before, but use the regret random variable $A=w(z)(\ell(z,f(z))-\ell(z,f_{w}(z))$ this time. Then by Lemma 5, we have

$\displaystyle{\mathbb{E}}_{P_{0}}[A^{2}]$	$\displaystyle\leq B_{w}{\mathbb{E}}_{P_{0}}[w(z)(\ell(z,f(z))-\ell(z,f_{w}(z)))^{2}]=B_{w}{\mathbb{E}}_{w}[(\ell(z,f(z))-\ell(z,f_{w}(z)))^{2}]$
	$\displaystyle=B_{w}{\mathbb{E}}_{w}\left[\left((f(x)-y)^{2}-(f_{w}(x)-y)^{2}\right)^{2}\right]$
	$\displaystyle=B_{w}{\mathbb{E}}_{w}\left[(f(x)-f_{w}(x))^{2}(f(x)+f_{w}(x)-2y)^{2}\right]$
	$\displaystyle\leq 16B_{w}{\mathbb{E}}_{w}\left[(f(x)-f_{w}(x))^{2}\right]$	( $(\|f(x)\|,\|f_{w}(x)\|,\|y\|\leq 1$ )
	$\displaystyle\leq 16B_{w}{\mathrm{Regret}}_{w}(f).$	(Lemma 5)

Thus we see that for any fixed $f\in\mathcal{F}$ and $w\in\mathcal{W}$ , we have with probability $1-\delta$ :

	$\displaystyle\|{\mathrm{Regret}}_{w}(f)-[\widehat{R}_{w}(f)-\widehat{R}_{w}(f_{w})]\|$	$\displaystyle=\mathcal{O}\left(\sqrt{\frac{B_{w}{\mathrm{Regret}}_{w}(f)\ln(1/\delta)}{n}}+\frac{B_{w}\ln(1/\delta)}{n}\right)$
		$\displaystyle\leq\gamma{\mathrm{Regret}}_{w}(f)+(1+\gamma^{-1})\mathcal{O}\left(\frac{B_{w}\ln(1/\delta)}{n}\right).$

where $\gamma>0$ is arbitrary.

Hence, with probability at least $1-\delta$ , we have for a fixed $f$ and $w$ :

	$\displaystyle(1-\gamma){\mathrm{Regret}}_{w}(f)$	$\displaystyle\leq(\widehat{R}_{w}(f)-\widehat{R}_{w}(f_{w}))+(1+\gamma^{-1})\mathcal{O}\left(\frac{B_{w}\ln(1/\delta)}{n}\right),\quad\mbox{and}$
	$\displaystyle\widehat{R}_{w}(f)-\widehat{R}_{w}(f_{w})$	$\displaystyle\leq(1+\gamma){\mathrm{Regret}}_{w}(f)+(1+\gamma^{-1})\mathcal{O}\left(\frac{B_{w}\ln(1/\delta)}{n}\right).$

In particular, choosing $f=\widehat{f}_{w}$ in the first inequality, we see that

\displaystyle\widehat{R}_{w}(f_{w})\leq\widehat{R}_{w}(\widehat{f}_{w})+(1+\gamma^{-1})\mathcal{O}\left(\frac{B_{w}\ln(1/\delta)}{n}\right)

(15)

Using identical arguments as the proof of Theorem 2 allows us to turn both statements into uniform bounds over all $f\in\mathcal{F}$ and $w\in\mathcal{W}$ . Defining $\epsilon^{\prime}=Bd_{\mathcal{F},\mathcal{W}}(\delta)/n$ , we now have with probability at least $1-\delta$ :

	$\displaystyle(1-\gamma)\sup_{w\in\mathcal{W}}{\mathrm{Regret}}(\widehat{f}_{\text{MRO}})$
	$\displaystyle\leq\sup_{w\in\mathcal{W}}[\widehat{R}_{w}(\widehat{f}_{\text{MRO}})-\widehat{R}_{w}(f_{w})]+(1+\gamma^{-1})\epsilon^{\prime}$
	$\displaystyle\leq\sup_{w\in\mathcal{W}}[\widehat{R}_{w}(\widehat{f}_{\text{MRO}})-\widehat{R}_{w}(\widehat{f}_{w})]+(1+\gamma^{-1})\epsilon^{\prime}$		( $\widehat{R}_{w}(f_{w})\geq\widehat{R}_{w}(\widehat{f}_{w})$ )
	$\displaystyle\leq\sup_{w\in\mathcal{W}}\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})+(1+\gamma^{-1})\epsilon^{\prime}$		(since $\widehat{f}_{\text{MRO}}=\operatorname*{\text{arginf}}_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})$ )
	$\displaystyle\leq\sup_{w\in\mathcal{W}}\widehat{R}_{w}(f)-\widehat{R}_{w}(f_{w})+2(1+\gamma^{-1})\epsilon^{\prime}$		(Equation 15)
	$\displaystyle\leq(1+\gamma)\sup_{w\in\mathcal{W}}[R_{w}(f)-R_{w}(f_{w})]+3(1+\gamma^{-1})\epsilon^{\prime}$
	$\displaystyle=(1+\gamma)\sup_{w\in\mathcal{W}}{\mathrm{Regret}}_{w}(f)+3(1+\gamma^{-1})\epsilon^{\prime}.$

Taking an infimum over $f\in\mathcal{F}$ and $\gamma=\min(0.5,\sqrt{\epsilon^{\prime}/\inf_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}{\mathrm{Regret}}_{w}(f)})$ completes the proof.

We now prove Lemma 5.

B.1 Proof of Lemma 5

Since $\mathcal{F}$ is convex, if $f_{1}\in\mathcal{F}$ and $f_{2}\in\mathcal{F}$ , then $\alpha f_{1}+(1-\alpha)f_{2}\in\mathcal{F}$ for $\alpha\in[0,1]$ . Then using $f_{P}=\operatorname*{\text{arginf}}_{f\in\mathcal{F}}R_{P}(f)$ , we have for any distribution $P$ and $f\in\mathcal{F}$ :

	$\displaystyle 0$	$\displaystyle\leq R_{P}(\alpha f+(1-\alpha)f_{P})-R_{P}(f_{P})$
		$\displaystyle=\alpha{\mathbb{E}}_{P}[(\alpha f(x)+(2-\alpha)f_{P}(x)-2y)(f(x)-f_{P}(x))]$
		$\displaystyle=\alpha^{2}{\mathbb{E}}_{P}[(f(x)-f_{P}(x))^{2}]+2\alpha{\mathbb{E}}_{P}[(f_{P}(x)-y)(f(x)-f_{P}(x))].$

Since this holds for any $\alpha\in[0,1]$ , we take the limit $\alpha\downarrow 0$ to conclude that for all $f\in\mathcal{F}$

{\mathbb{E}}_{P}[(f_{P}(x)-y)(f(x)-f_{P}(x))]\geq 0.

(16)

This inequality further allows us to conclude for any $f\in\mathcal{F}$

	$\displaystyle R_{P}(f)-R_{P}(f_{P})$	$\displaystyle={\mathbb{E}}_{P}[(f(x)+f_{P}(x)-2y)(f(x)-f_{P}(x)]$
		$\displaystyle={\mathbb{E}}_{P}[(f(x)-f_{P}(x))^{2}]+2{\mathbb{E}}_{P}[(f_{P}(x)-y)(f(x)-f_{P}(x))]$
		$\displaystyle\geq{\mathbb{E}}_{P}[(f(x)-f_{P}(x))^{2}].$

B.2 Proof of Proposition 6

We consider a simple problem where there is a class $\mathcal{P}=\{P_{1},P_{2}\}$ consisting of two distributions that we want to do well under. Each distribution $P_{i}$ is such that $x\sim P_{i}$ takes the form $x=\mu_{i}+\epsilon$ where $\mu_{i}\in\mathbb{R}$ and $\epsilon$ is $\pm 1$ with probability $1/2$ . The function class $\mathcal{F}=[-C,C]$ for a constant $C$ that will be appropriately chosen in the proof. Let $\Delta_{\mu}=\mu_{1}-\mu_{2}$ be the difference of means. Let us choose $P_{0}$ to be supported over $\mathbb{R}\times\{1,2\}$ so that $z=(x,1)$ where $x\sim P_{1}$ with probability 0.5 and $z=(x,2)$ with $x\sim P_{2}$ with probability 0.5. We set $w_{1}(z)=1$ if $z_{2}=1$ and similarly $w_{2}(z)=1$ if $z_{2}=2$ . Let us define $g(z)=f(z_{1})$ for $f\in\mathcal{F}$ . Since $g(z)$ is equivalent to $f(x)$ with $x=z_{1}$ in this example, we stick to using the notation $f\in\mathcal{F}$ for the rest of the proof for consistency with our notation throughout. Let us use $R_{i}$ , $\widehat{R}_{i}$ to denote the empirical and expected risks on samples from $P_{i}$ (equivalently $w_{i}$ ). Then a direct calculation shows that

\displaystyle R_{i}(f)=(f-\mu_{i})^{2}+1,\quad\mbox{and}\quad\widehat{R}_{i}(f)=(f-\mu_{i})^{2}+1-2S_{i}(f-\mu_{i}),

where $S_{i}=\sum_{j=1}^{n}w_{i}(z_{j})\epsilon_{j}/(\sum_{j=1}^{n}w_{i}(z_{j}))$ with $\epsilon_{j}$ being the noise realization in $z_{j}$ . Then for sufficiently large $C$ , we see that

\displaystyle f_{\text{DRO}}=\operatorname*{\text{argmin}}_{f\in\mathcal{F}}\max_{i}R_{i}(f)

\displaystyle=\operatorname*{\text{argmin}}_{f\in\mathcal{F}}\max\{(f-\mu_{1})^{2},(f-\mu_{2})^{2}\}=\frac{\mu_{1}+\mu_{2}}{2},

and the best worst case risk is given by $\max_{i}R_{i}(f_{\text{DRO}})=1+\Delta_{\mu}^{2}/4$ . Now let us examine the empirical situation. We have

\displaystyle\widehat{f}_{\text{DRO}}

\displaystyle=\operatorname*{\text{argmin}}_{f\in\mathcal{F}}\max_{i}\widehat{R}_{i}(f)=\operatorname*{\text{argmin}}_{f\in\mathcal{F}}\max\{(f-\mu_{1})^{2}-2S_{1}(f-\mu_{1}),(f-\mu_{2})^{2}-2S_{2}(f-\mu_{2})\}.

Let us denote $\widehat{f}_{i}=\mu_{i}+S_{i}$ as the minimizer of $\widehat{R}_{i}(f)$ . Let us condition on the event $\mathcal{E}$ where:

\mathcal{E}=\{\widehat{R}_{1}(\widehat{f}_{1})\leq\widehat{R}_{2}(\widehat{f}_{1})\quad\mbox{and}\quad\widehat{R}_{1}(\widehat{f}_{2})\geq\widehat{R}_{2}(\widehat{f}_{2})\}.

(17)

Under this event, since both $\widehat{R}_{1}(f)$ and $\widehat{R}_{2}(f)$ are continous functions of $f$ , they are equal for some $f\in[\min(\widehat{f}_{1},\widehat{f}_{2}),\max(\widehat{f}_{1},\widehat{f}_{2})]$ . So conditioned on $\mathcal{E}$ , we seek a solution to

	$\displaystyle 0$	$\displaystyle=(f-\mu_{1})^{2}-2S_{1}(f-\mu_{1})-(f-\mu_{2})^{2}+2S_{2}(f-\mu_{2})$
		$\displaystyle=2f(\mu_{2}-\mu_{1}+S_{2}-S_{1})+(\mu_{1}^{2}-\mu_{2}^{2}+2S_{1}\mu_{1}-2S_{2}\mu_{2}),$

so that we get

	$\displaystyle\widehat{f}_{\text{DRO}}$	$\displaystyle=\frac{\mu_{2}^{2}-\mu_{1}^{2}+2S_{2}\mu_{2}-2S_{1}\mu_{1}}{2(\mu_{2}-\mu_{1}+S_{2}-S_{1})}$
		$\displaystyle=\frac{\Delta_{\mu}(\mu_{1}+\mu_{2})+2S_{1}\mu_{1}-2S_{2}\mu_{2}}{2(\Delta_{\mu}+\Delta_{S})}$
		$\displaystyle=\frac{\mu_{1}+\mu_{2}}{2}+\frac{2S_{1}\mu_{1}-2S_{2}\mu_{2}-(\mu_{1}+\mu_{2})\Delta_{S}}{2(\Delta_{\mu}+\Delta_{S})}$
		$\displaystyle=\frac{\mu_{1}+\mu_{2}}{2}+\frac{S_{1}\mu_{1}-S_{2}\mu_{2}-S_{1}\mu_{2}+S_{2}\mu_{1}}{2(\Delta_{\mu}+\Delta_{S})}$
		$\displaystyle=\frac{\mu_{1}+\mu_{2}}{2}+\underbrace{\frac{(S_{1}+S_{2})\Delta_{\mu}}{2(\Delta_{\mu}+\Delta_{S})}}_{\xi}.$

Plugging the minimizer back in, we obtain the best worst case population risk as

	$\displaystyle\max_{i}R_{i}(\widehat{f}_{\text{DRO}})-1$	$\displaystyle=\max\{(\widehat{f}_{\text{DRO}}-\mu_{1})^{2},(\widehat{f}_{\text{DRO}}-\mu_{2})^{2}$
		$\displaystyle=\max\left\{(-\Delta_{\mu}/2+\xi)^{2},(\Delta_{\mu}/2+\xi)^{2}\right\}$
		$\displaystyle=\Delta_{\mu}^{2}/4+\xi^{2}+\max\left\{-\Delta_{\mu}\xi,\Delta_{\mu}\xi\right\}$
		$\displaystyle=\Delta_{\mu}^{2}/4+\xi^{2}+\|\Delta_{\mu}\xi\|.$

Recalling that $\min_{f}\max_{i}R_{i}(f)=1+\Delta_{\mu}^{2}/4$ , we further obtain

\displaystyle\max_{i}R_{i}(\widehat{f}_{\text{DRO}})-\max_{i}R_{i}(f_{\text{DRO}})

\displaystyle=\xi^{2}+|\Delta_{\mu}\xi|.

In order to complete the lower bound on this regret term, we now analyze the random variable $\xi$ , and we choose $\Delta_{\mu}=1$ . Note that from the definitions of $S_{i}$ and $n_{i}$ , we have with probability at least $1-4\delta$ , for $i=1,2$ :

	$\displaystyle\|S_{i}\|$	$\displaystyle\leq c\sqrt{\frac{\log(1/\delta)}{n_{i}}},\quad\mbox{and}$
	$\displaystyle\left\|n_{i}-\frac{n}{2}\right\|$	$\displaystyle\leq c\sqrt{\frac{\log(1/\delta)}{n}},$

Under this $1-4\delta$ event, we have $\Delta_{S}=\mathcal{O}(\sqrt{\ln(1/\delta)/n})$ and $S_{1}+S_{2}=O(\sqrt{\ln(1/\delta)/n})$ . Consequently, under this event

\displaystyle|\xi|=\mathcal{O}\left(\left|\frac{\Delta_{\mu}\sqrt{\ln(1/\delta)}}{2\sqrt{n}(\Delta_{\mu}-\sqrt{\ln(1/\delta)/n})}\right|\right).

Choosing $4\delta=1/n$ , we further have

\displaystyle|\xi|=\mathcal{O}\left(\left|\frac{\Delta_{\mu}\sqrt{\ln n}}{\sqrt{n}\Delta_{\mu}}\right|\right)=\mathcal{O}\left(\sqrt{\frac{\ln n}{n}}\right),

where we have used $\Delta_{\mu}/2\geq\sqrt{\ln n/n}$ for $n\geq 1$ . Note that we have also not set the constant $C$ defining $\mathcal{F}$ so far. In order for the unconstrained minimizations above to happen for $f\in\mathcal{F}$ , we need $f_{\text{DRO}},\widehat{f}_{\text{DRO}}\in\mathcal{F}$ . This means that $C\geq(\mu_{1}+\mu_{2})/2$ and $C\geq(\mu_{2}^{2}-\mu_{1}^{2}+2S_{2}\mu_{2}-2S_{1}\mu_{1})/(2(\mu_{2}-\mu_{1}+S_{2}-S_{1}))$ . The latter is satisfied under our $1-4/n$ event when $C\geq\mu_{1}+\mu_{2}$ .

Finally we recall that we have conditioned so far on the event $\mathcal{E}$ defined in Equation 17. We now examine $\mathbb{P}(\mathcal{E}^{C})$ under the $1-4/n$ event, and focus on the first condition in (17). We have

	$\displaystyle\widehat{R}_{1}(\widehat{f}_{1})-\widehat{R}_{2}(\widehat{f}_{2})$	$\displaystyle=-S_{1}^{2}+S_{2}^{2}-(\Delta_{\mu}+\Delta_{S})^{2}$
		$\displaystyle\leq\mathcal{O}\left(\frac{\log n}{n}-1+\sqrt{\frac{\log n}{n}}\right)\leq 0,$

under the $1-4/n$ probability event. Similar reasoning also shows that $\widehat{R}_{2}(\widehat{f}_{2})\leq\widehat{R}_{1}(\widehat{f}_{2})$ in this event, so that we get the conclusion of the proposition.

Appendix C Proofs for Section 5

We start with a technical lemma about the concentration of empirical estimates of the second moment of importance weights, which is required for Theorem 7.

Lemma 13

Under Assumption 2, we have with probability at least $1-\delta$ , for all $w\in\mathcal{W}$ :

\displaystyle{\mathbb{E}}_{P_{0}}w(z)^{2}\leq\frac{2}{n}\sum_{i=1}^{n}w(z_{i})^{2}+\mathcal{O}\left(\frac{B_{w}^{2}\ln(|\mathcal{W}|/\delta)}{n}\right).

Proof Consider the non-negative random variable $A=w(z)^{2}$ so that $A\leq B_{w}^{2}$ with probability 1 when $z\sim P_{0}$ . Then we have

\displaystyle{\mathbb{E}}_{P_{0}}[A^{2}]

\displaystyle={\mathbb{E}}_{P_{0}}[w(z)^{4}]\leq B_{w}^{2}{\mathbb{E}}_{P_{0}}[w(z)^{2}].

Then for a fixed $w\in\mathcal{W}$ , we have by Bernstein’s inequality, with probability at least $1-\delta$ :

	$\displaystyle\left\|\frac{1}{n}\sum_{i=1}^{n}w(z_{i})^{2}-{\mathbb{E}}_{P_{0}}w(z)^{2}\right\|$
	$\displaystyle=\mathcal{O}\left(\sqrt{\frac{B_{w}^{2}{\mathbb{E}}_{P_{0}}[w(z)^{2}]\ln(1/\delta)}{n}}+\frac{B_{w}^{2}\ln(1/\delta)}{n}\right)$
	$\displaystyle\leq\frac{{\mathbb{E}}_{P_{0}}[w(z)^{2}]}{2}+\mathcal{O}\left(\frac{B_{w}^{2}\ln(1/\delta)}{n}\right).$

Rearranging terms and taking a union bound over $w\in\mathcal{W}$ completes the proof.

We are now ready to prove Theorem 7.

C.1 Proof of Theorem 7

Recall the definition $\widehat{\sigma}^{2}_{w}=\frac{1}{n}\sum_{i=1}^{n}w(z_{i})^{2}$ . Plugging the result of Lemma 13 in Equation 13, we see that with probability at least $1-\delta$ , for a fixed $f\in\mathcal{F}$ and $w\in\mathcal{W}$ we have:

	$\displaystyle\left\|\widehat{R}_{w}(f)-R_{w}(f)\right\|$	$\displaystyle=\mathcal{O}\left(\sqrt{\frac{(2\widehat{\sigma}_{w}^{2}+B_{w}^{2}/n)\ln(2/\delta)}{n}}+\frac{B_{w}\ln(2/\delta)}{n}\right)$
		$\displaystyle=\mathcal{O}\left(\sqrt{\frac{\widehat{\sigma}_{w}^{2}\ln(2/\delta)}{n}}+\frac{B_{w}\ln(2/\delta)}{n}\right).$

Now following the proof of Theorem 2, with probability at least $1-\delta$ , we have for all $w\in\mathcal{W}$ :

	$\displaystyle R_{w}(\widehat{f}_{\text{SMRO}})-R_{w}(f_{w})$
	$\displaystyle=\widehat{R}_{w}(\widehat{f}_{\text{SMRO}})-\widehat{R}_{w}(\widehat{f}_{w})+\mathcal{O}\left(\sqrt{\frac{\widehat{\sigma}_{w}^{2}(d_{\mathcal{F}}(\delta)+\log(\|\mathcal{W}\|/\delta)}{n}}+\frac{B_{w}(d_{\mathcal{F}}(\delta)+\log(\|\mathcal{W}\|/\delta)}{n}\right)$
	$\displaystyle=\widehat{R}_{w}(\widehat{f}_{\text{SMRO}})-\widehat{R}_{w}(\widehat{f}_{w})+\underbrace{\left(\widehat{\sigma}_{w}+\frac{B_{w}}{\sqrt{n}}\right)}_{c_{w}}\,\underbrace{\mathcal{O}\left(\frac{d_{\mathcal{F},\mathcal{W}}(\delta)}{\sqrt{n}}\right)}_{\epsilon}.$

Now dividing through by $c_{w}$ , we see that for all $f\in\mathcal{F}$ , we have with probability at least $1-\delta$ :

	$\displaystyle\sup_{w\in\mathcal{W}}\frac{R_{w}(\widehat{f}_{\text{SMRO}})-R_{w}(f_{w})}{c_{w}}$	$\displaystyle\leq\sup_{w\in\mathcal{W}}\frac{\widehat{R}_{w}(\widehat{f}_{\text{SMRO}})-\widehat{R}_{w}(\widehat{f}_{w})}{c_{w}}+\epsilon$
		$\displaystyle\leq\inf_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}\frac{\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})}{c_{w}}+\epsilon$
		$\displaystyle\leq\inf_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}\frac{R_{w}(f)-R_{w}(\widehat{f}_{w})}{c_{w}}+2\epsilon$
		$\displaystyle\leq\inf_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}\frac{R_{w}(f)-R_{w}(f_{w})}{c_{w}}+2\epsilon,$

where the second inequality follows from the definition of SMRO as the empirical optimizer of the objective (10). As a result, we have for any $w\in\mathcal{W}$ , with probability at least $1-\delta$ :

\displaystyle R_{w}(\widehat{f}_{\text{SMRO}})-R_{w}(f_{w})

\displaystyle\leq c_{w}\inf_{f\in\mathcal{F}}\sup_{w^{\prime}\in\mathcal{W}}\frac{R_{w^{\prime}}(f)-R_{w^{\prime}}(f_{w})}{c_{w^{\prime}}}+2c_{w}\epsilon.

C.2 Proof of Theorem 8

We will be terse as the proof is largely a combination of the proofs of Theorems 4 and 7. Proceeding as in the proof of Theorem 4, we see that with probability at least $1-\delta$ , we have for all $w\in\mathcal{W}$ and $f\in\mathcal{F}$ :

{\mathrm{Regret}}_{w}(f)\leq(1+\gamma)[\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})]+(1+\gamma^{-1})\underbrace{B_{w}}_{c_{w}}\,\underbrace{\mathcal{O}\left(\frac{d_{\mathcal{F},\mathcal{W}}}{n}\right)}_{\epsilon}

and

[\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})]\leq(1+\gamma){\mathrm{Regret}}_{w}(f)+(1+\gamma^{-1})\underbrace{B_{w}}_{c_{w}}\,\underbrace{\mathcal{O}\left(\frac{d_{\mathcal{F},\mathcal{W}}}{n}\right)}_{\epsilon}

for all $\gamma>0$ .

Dividing through by $c_{w}$ as before and taking a supremum over $w\in\mathcal{W}$ , we obtain with probability at least $1-\delta$ :

	$\displaystyle\frac{{\mathrm{Regret}}_{w}(\widehat{f}_{\text{SMRO}})}{c_{w}}$	$\displaystyle\leq(1+\gamma)\sup_{w\in\mathcal{W}}\frac{\widehat{R}_{w}(\widehat{f}_{\text{SMRO}})-\widehat{R}_{w}(\widehat{f}_{w})}{c_{w}}+(1+\gamma^{-1})\epsilon$
		$\displaystyle\leq(1+\gamma)\inf_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}\frac{\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})}{c_{w}}+(1+\gamma^{-1})\epsilon$
		$\displaystyle\leq(1+\gamma)^{2}\inf_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}\frac{{\mathrm{Regret}}_{w}(f)}{c_{w}}+(1+\gamma^{-1})(2+\gamma)\epsilon.$

Now following the remaining proof of Theorem 7 gives the desired bound.

Appendix D Well-specified linear regression with covariate shift

Let us consider a special case of Theorem 8, where $\mathcal{F}=\{\beta^{\top}x~{}:~{}\beta\in\mathbb{R}^{d},\|\beta\|_{2}\leq 1\}$ is the class of linear prediction functions with unit norm weights and the data satisfies $\|x\|_{2}\leq 1$ . We further assume that $y=x^{\top}\beta^{\star}+\nu$ , where $\|\beta^{\star}\|_{2}\leq 1$ and $\nu$ is zero-mean noise such that $|y|\leq C$ for some constant $C$ (this just causes additional scaling with $C$ in the bounds of Theorems 4 and 8). Suppose further that the covariance $\Sigma_{P_{0}}:={\mathbb{E}}_{x\sim P_{0}}xx^{\top}$ is full rank with the smallest eigenvalue equal to $\lambda$ . Let $\widehat{\beta}_{P_{0}}$ be the ordinary least squares estimator, given samples from $P_{0}$ . Then it is well-known that with probability at least $1-\delta$ ,

\displaystyle\|\widehat{\beta}_{P_{0}}-\beta^{\star}\|_{\Sigma_{P_{0}}}^{2}=\mathcal{O}\left(\frac{d\ln(1/\delta)}{n}\right).

(18)

Consequently, we have for any other distribution $P$ over $(x,y)$ :

\displaystyle R_{P}(\widehat{\beta}_{P_{0}})-R_{P}(\beta^{\star})={\mathbb{E}}_{P}[(x^{\top}\widehat{\beta}_{P_{0}}-x^{\top}\beta^{\star})^{2}]=\mathcal{O}\left(\frac{d\ln(1/\delta)}{\lambda n}\right),

where the inequality uses $\|x\|_{2}\leq 1$ and that the smallest eigenvalue of $\Sigma_{P_{0}}$ is at least $\lambda$ . That is, doing OLS under $P_{0}$ yields a strong guarantee for all target distributions $P$ , since we get pointwise accurate predictions in this scenario. However, directly applying the results of Theorem 4 would still have additional scaling with $B_{w}$ for any target distribution with importance weights $w$ . On the other hand, let us consider the bound of Corollary 9 for any class $\mathcal{W}$ such that $w_{0}\equiv 1$ is in $\mathcal{W}$ , so that we always include $P_{0}$ in our class of target distributions. Since $B_{w_{0}}=1$ and $d_{\mathcal{F}}=d\ln(\frac{1}{\delta})$ in this case, the second bound of the corollary yields with probability at least $1-\delta$ :

\displaystyle\|\widehat{\beta}_{\text{SMRO}}-\beta^{\star}\|_{\Sigma_{P_{0}}}^{2}=R_{P_{0}}(\widehat{\beta}_{\text{SMRO}})-R_{P_{0}}(\beta^{\star})=\mathcal{O}\left(\frac{(d\ln(1/\delta)+\ln\tfrac{|\mathcal{W}|}{\delta})}{n}\right).

This is comparable to the prediction error bound in (18) for ERM on $P_{0}$ , only incurring an additional $\ln|\mathcal{W}|$ term compared. Note that ERM is asymptotically efficient in this setting, so we cannot expect to do better and suffer only a small penalty for our worst-case robustness. The approach of learning on $P_{0}$ alone is of course not robust to misspecification in the linear regression assumption. The guarantee of Theorem 4, in contrast incurs a bound $\sup_{w\in\mathcal{W}}B_{w}\mathcal{O}\left(\frac{(d\ln(1/\delta)+\ln\tfrac{|\mathcal{W}|}{\delta})}{n}\right)$ , which can be significantly worse.

Appendix E Proofs for Section 6

It is clearly seen that the updates of $\rho_{t}$ correspond to the exponentiated gradient (EG) algorithm applied to the linear objective $-({\mathbb{E}}_{w\sim\rho}\widehat{R}_{w}(f_{t})-\widehat{R}_{w}(\widehat{f}_{w}))/B$ at round $t$ , where the negative sign happens since the EG algorithm is designed for minimization problems, while we apply it to a maximization problem. The regret guarantee for EG, specifically Corollary 2.14 of Shalev-Shwartz [2012] states that for any $w\in\mathcal{W}$

\displaystyle-\sum_{t=1}^{T}{\mathbb{E}}_{w^{\prime}\sim\rho_{t}}\frac{\widehat{R}_{w^{\prime}}(f_{t})-\widehat{R}_{w^{\prime}}(\widehat{f}_{w^{\prime}})}{B}\leq-\sum_{t=1}^{T}\frac{\widehat{R}_{w}(f_{t})-\widehat{R}_{w}(\widehat{f}_{w})}{B}+\frac{\ln|\mathcal{W}|}{\eta B}+\eta BT.

Multiplying through by $B$ and substituting $\eta B=\sqrt{\ln|\mathcal{W}|/T}$ gives

\displaystyle-\sum_{t=1}^{T}{\mathbb{E}}_{w^{\prime}\sim\rho_{t}}[\widehat{R}_{w^{\prime}}(f_{t})-\widehat{R}_{w^{\prime}}(\widehat{f}_{w^{\prime}})]\leq-\sum_{t=1}^{T}[\widehat{R}_{w}(f_{t})-\widehat{R}_{w}(\widehat{f}_{w})]+2B\sqrt{T\ln|\mathcal{W}|}.

Now recalling the definition $P_{t}=(f_{1}+\ldots+f_{t})/t$ , following the proof technique of Freund and Schapire [1996] gives that

	$\displaystyle{\mathbb{E}}_{f\sim P_{T}}\sup_{w}\frac{\widehat{R}_{w}(f)-\widehat{R}_{w}(f_{w})}{B}$
	$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\sup_{w}\frac{\widehat{R}_{w}(f_{t})-\widehat{R}_{w}(f_{w})}{B}$
	$\displaystyle\leq\frac{1}{T}\sum_{t=1}^{T}{\mathbb{E}}_{w\sim\rho_{t}}\frac{\widehat{R}_{w}(f_{t})-\widehat{R}_{w}(\widehat{f}_{w})}{B}+2\sqrt{\frac{\ln\|\mathcal{W}\|}{T}}$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\frac{1}{T}\sum_{t=1}^{T}\inf_{f\in\mathcal{F}}{\mathbb{E}}_{w\sim\rho_{t}}\frac{\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})}{B}+2\sqrt{\frac{\ln\|\mathcal{W}\|}{T}}$
	$\displaystyle\leq\inf_{f\in\mathcal{F}}\frac{1}{T}\sum_{t=1}^{T}{\mathbb{E}}_{w\sim\rho_{t}}\frac{\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})}{B}+2\sqrt{\frac{\ln\|\mathcal{W}\|}{T}}$
	$\displaystyle\leq\inf_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}\frac{\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})}{B}+2\sqrt{\frac{\ln\|\mathcal{W}\|}{T}},$

where $(a)$ follows from the best response property of $f_{t}$ for $\rho_{t}$ (12).

	$\displaystyle{\mathbb{E}}_{f\sim P_{T}}\sup_{w}\frac{\widehat{R}_{w}(f)-\widehat{R}_{w}(f_{w})}{B}$
	$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\sup_{w}\frac{\widehat{R}_{w}(f_{t})-\widehat{R}_{w}(f_{w})}{B}$
	$\displaystyle\leq\frac{1}{T}\sum_{t=1}^{T}{\mathbb{E}}_{w\sim\rho_{t}}\frac{\widehat{R}_{w}(f_{t})-\widehat{R}_{w}(\widehat{f}_{w})}{B}+2\sqrt{\frac{\ln\|\mathcal{W}\|}{T}}$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\frac{1}{T}\sum_{t=1}^{T}\inf_{f\in\mathcal{F}}{\mathbb{E}}_{w\sim\rho_{t}}\frac{\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})}{B}+2\sqrt{\frac{\ln\|\mathcal{W}\|}{T}}$
	$\displaystyle\leq\inf_{f\in\mathcal{F}}\frac{1}{T}\sum_{t=1}^{T}{\mathbb{E}}_{w\sim\rho_{t}}\frac{\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})}{B}+2\sqrt{\frac{\ln\|\mathcal{W}\|}{T}}$
	$\displaystyle\leq\inf_{f\in\mathcal{F}}\sup_{w\in\mathcal{W}}\frac{\widehat{R}_{w}(f)-\widehat{R}_{w}(\widehat{f}_{w})}{B}+2\sqrt{\frac{\ln\|\mathcal{W}\|}{T}},$