This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Examining and Combating Spurious Features under Distribution Shift

Chunting Zhou    Xuezhe Ma    Paul Michel    Graham Neubig
Abstract

A central goal of machine learning is to learn robust representations that capture the causal relationship between inputs features and output labels. However, minimizing empirical risk over finite or biased datasets often results in models latching on to spurious correlations between the training input/output pairs that are not fundamental to the problem at hand. In this paper, we define and analyze robust and spurious representations using the information-theoretic concept of minimal sufficient statistics. We prove that even when there is only bias of the input distribution (i.e. covariate shift), models can still pick up spurious features from their training data. Group distributionally robust optimization (DRO) provides an effective tool to alleviate covariate shift by minimizing the worst-case training loss over a set of pre-defined groups. Inspired by our analysis, we demonstrate that group DRO can fail when groups do not directly account for various spurious correlations that occur in the data. To address this, we further propose to minimize the worst-case losses over a more flexible set of distributions that are defined on the joint distribution of groups and instances, instead of treating each group as a whole at optimization time. Through extensive experiments on one image and two language tasks, we show that our model is significantly more robust than comparable baselines under various partitions. Our code is available at https://github.com/violet-zct/group-conditional-DRO.

Machine Learning, distributionally robust optimization, group robustness, noisy groups, ICML

1 Introduction

Many machine learning models that minimize the average training loss via empirical risk minimization (ERM) are trained and evaluated on randomly shuffled and split training and test sets. However, such in-distribution learning setups can hide critical issues: models that achieve high accuracy on average often underperform when the test distribution drifts away from the training one (Hashimoto et al., 2018; Koenecke et al., 2020; Koh et al., 2020). Such models are often “right for the wrong reasons” due to reliance on spurious correlations (or “dataset biases”) (Torralba & Efros, 2011; Goyal et al., 2017; McCoy et al., 2019; Gururangan et al., 2018), heuristics that hold for most training examples but are not inherent to the task of interest, such as strong associations between the presence of green pastures background with the label “cows” in image classification. Naturally, models that use such features will fail when tested on data where the correlation does not hold.

Refer to caption
Figure 1: Consider data points xx in 2\mathbb{R}^{2} with two classes yy. The vertical axis of xx is a spurious feature that highly correlates with yy, and the horizontal axis is the robust feature. There are two subclasses in each class, where the top-right and lower-left are two minority subclasses. The robust accuracy is test worst-case accuracy over the four subclasses. We train a linear classifier with different methods. For models trained with the clean partitions, each subclass is a group. For the imperfect partitions, dots with the same shape is a group (best viewed in color).

Recent work has investigated how models trained with ERM learn spurious features that do not generalize, from the points of view of causality (Arjovsky et al., 2019), understanding model overparameterization (Sagawa et al., 2020b) and information theory (Lovering et al., 2021). However, these works have not characterized the idea of spurious features mathematically. In this paper, we characterize spurious features from an information-theoretic perspective. We consider prediction of target random variable Y𝒴Y\in\mathcal{Y} from input variable X𝒳X\in\mathcal{X} and characterize spurious features learned under changes to the input distribution p(X)p(X) (i.e. covariate shift).

A central goal of machine learning is to learn true causal relationships between XX and YY in a manner robust to spurious factors concerning the variables. We assume that there exists an “ideal” data distribution pidealp_{\text{ideal}} (short for pideal(X,Y)p_{\text{ideal}}(X,Y) below) which contains data from all possible experimental conditions concerning the confounders that cause spurious correlations, both observable and hypothetical (Lewis, 2013; Arjovsky et al., 2019; Bellot & van der Schaar, 2020). For example, consider the problem of classifying images of cows and camels (Beery et al., 2018). Under the ideal conditions, we assume that pictures of cows and camels on any background can be collected, including cows in deserts and camels in green pastures. Therefore, under pidealp_{\text{ideal}} the background of the image XX is no longer a spurious factor of the label YY. However, such an “ideal” distribution pidealp_{\text{ideal}} is not accessible in practice (Bahng et al., 2020; Koh et al., 2020; McCoy et al., 2019), and our training distribution ptrainp_{\text{train}} (often, in practice, an associated empirical distribution) does not match pidealp_{\text{ideal}}. ERM-based learning algorithms indiscriminately fit all correlations found in ptrainp_{\text{train}}, including spurious correlations based on confounders (Tenenbaum, 2018; Lopez-Paz, 2016).

To investigate the spurious features learned under the distribution shift from pidealp_{\text{ideal}} to ptrainp_{\text{train}}, we first characterize those features of XX which most efficiently capture all possible information needed to predict YY. We define these robust features using the notion of minimal sufficient statistic (MSS) (Dynkin, 2000; Cvitkovic & Koliander, 2019) under pidealp_{\text{ideal}}. We then examine whether the features learned under ptrainp_{\text{train}} contain spurious features compared to the MSS learned under pidealp_{\text{ideal}}. Through our analysis, we find that even only with covariate shift, the features learned on ptrainp_{\text{train}} can contain spurious features or miss robust features of pidealp_{\text{ideal}}.

Models that fit spurious correlations in ptrainp_{\text{train}} can be vulnerable to groups (subpopulations of pidealp_{\text{ideal}}/ptestp_{\text{test}}) where the correlation does not hold. A common approach to avoid learning a model that suffers high worst group errors is group distributionally robust optimization (group DRO), a training procedure that efficiently minimizes the worst expected loss over a set of groups in the training data (Oren et al., 2019; Sagawa et al., 2020a). The partition of groups can be defined in several ways, such as by presence of manually identified potentially spurious features (Sagawa et al., 2020a), data domains (Koh et al., 2020), or topics of text (Oren et al., 2019). In a typical setup, the groups of interest in the test set align with those used to partition the training data. Under such setups, group DRO usually outperforms ERM with respect to the worst-group accuracy. We contend that this is because it promotes learning robust features that perform uniformly well across all groups. However, in many tasks, we can not collect clean group membership of training examples due to expensive annotation cost or privacy concerns regarding e.g. demographic identities of users or other sensitive information.

Inspired by our analysis of spurious features, we demonstrate that group DRO can fail under “imperfect” partitions of training data that are not consistent with the test set, especially when reducing spurious correlation in one group could exacerbate the spurious correlations in another (§4.2), as shown in Fig. 1. This is because group DRO treats each training group as a unit, preventing it from adjusting learning weights differently for subgroups within each group. Recent work has proposed to use sophisticated unsupervised clustering algorithm to search for meaningful subclasses (Sohoni et al., 2020) and execute group DRO on the found subclasses. To learn robust models under noisy protected groups, Wang et al. (2020) designs robust approaches that is based on an estimate of a noise model between the clean and noisy groups. Instead of relying on good partitions of groups or a not readily available noise model, we propose group-conditional DRO (GC-DRO) that defines the uncertainty set over the joint distribution of groups and their instances (i.e. q(G)q(X,Y|G)q(G)q(X,Y|G)). Every training example is reweighted by both its group weight and the instance-level weight, which offers a more flexible uncertainty set compared to group DRO. Through extensive experiments on three tasks — facial attribute classification, natural language inference, and toxicity detection, we show that GC-DRO significantly outperforms both ERM and group DRO in various partitions of training data and demonstrate the robustness of GC-DRO against various group partitions.

2 Preliminaries on Robust Representations

To study spurious features, we need to formally define which features or properties of the data describe spurious correlations, and which features are robust features relevant to the task at hand. In supervised learning we are interested in finding a good representation T(X)T(X) of the input XX111We assume that T(X)T(X) is a deterministic mapping of XX given neural network parameters. that is useful to predict a target label YY. What characterizes the optimal representations of XX w.r.t. YY is much debated, but a common assertion is that T(X)T(X) should be a minimal sufficient statistic (MSS) of XX for YY (Adragni & Cook, 2009; Schwartz-Ziv & Tishby, 2017; Achille & Soatto, 2018; Cvitkovic & Koliander, 2019), which is:

(i) T(X)T(X) should be sufficient for YY, i.e. x𝒳,t𝒯,y𝒴,p(x|t,y)=p(x|t)\forall x\in\mathcal{X},t\in\mathcal{T},y\in\mathcal{Y},p(x|t,y)=p(x|t), which is equivalent to p(y|t,x)=p(y|t)p(y|t,x)=p(y|t). This means given the value of T(X)T(X), the distribution of XX does not depend on the value of YY.

(ii) Given that T(X)T(X) is sufficient, it should be minimal w.r.t. XX, i.e. for any sufficient statistic SS, there exists a deterministic function ff such that T=f(S)T=f(S) almost everywhere w.r.t. XX. This means for any measurable, non-invertible function gg, g(T)g(T) is no longer sufficient for YY.

In other words, the minimal sufficient statistics most efficiently capture all information useful for predicting YY. The notion of MSS has been connected to Shannon’s information theory (Kullback & Leibler, 1951; Cover, 1999) and extended to any joint distribution P(X,Y)P(X,Y) of XX and YY in the information bottleneck (IB) framework (Tishby et al., 2000; Shamir et al., 2010; Kolchinsky et al., 2019), which provides a principled way to characterize the extraction of relevant information from XX for predicting YY. Loosely speaking, learning a MSS TT is equivalent to maximizing I(T(X);Y)I(T(X);Y) (sufficiency) and minimizing I(X;T(X))I(X;T(X)) (minimality).

Robust Features. Suppose 𝒜\mathcal{A} contains all possible combinations of spurious variables, both observable and hypothetical, and we consider datasets 𝒟(a,y)={xi}i=1Na,y\mathcal{D}_{(a,y)}=\{x_{i}\}_{i=1}^{N_{a,y}} collected under each condition of (a𝒜,y𝒴)(a\in\mathcal{A},y\in\mathcal{Y}), where each 𝒟(a,y)\mathcal{D}_{(a,y)} contains examples that are i.i.d. according to some probability distribution p(X|y,a)p(X|y,a). We define pidealp_{\text{ideal}} as the mixture distribution of p(X|y,a)p(X|y,a) with uniform weights over (a,y)𝒜×𝒴(a,y)\in\mathcal{A}\times\mathcal{Y}. Thus, MSS learned on pidealp_{\text{ideal}} provide a good candidate for robust features T(X)T(X) (sometimes denoted Tideal(X)T_{ideal}(X) for clarity), which most efficiently capture the information from XX necessary for predicting YY on a distribution that is free of spurious factors.

Spurious Features. In contrast, we define representations T(X)T^{\prime}(X) that contain spurious features. Specifically, the entropy of T(X)T^{\prime}(X) conditioned on T(X)T(X) under pidealp_{\text{ideal}} is positive.

Hideal(T(X)|T(X))>0\small H_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}ideal}}(T^{\prime}(X)|T(X))>0\vspace{-1mm} (1)

Because these learned features are not deterministic given T(X)T(X) then they contain additional information that is not useful for predicting YY.222Note that it is not just the case of T(X)T^{\prime}(X) containing redundant features, in which case H(T(X)|T(X))=0H(T^{\prime}(X)|T(X))=0. For example, in image classification, knowing that the image contains a horse, we cannot predict the background with certainty (a horse could be on a race track or a beach). Another example in natural language inference (NLI) task is that model learned on a biased data set often associates negation with the label “contradiction”. This is another spurious feature under our definition, because given the meaning of a sentence (robust features), whether it contains negation or not is not deterministic, e.g. “Don’t worry.” and “Be calm.” are synonymous but only one contains negation. A classifier that uses these spurious features can suffer from the risk of learning the spurious correlations between T(X)T^{\prime}(X) and the labels YY.

3 Spurious Features under Covariate Shift

The training data is often marred by various abnormalities, such as selection biases (Buolamwini & Gebru, 2018) and confounding factors (Gururangan et al., 2018). We ask if the MSS learned under ptrainp_{\text{train}} are robust features under pidealp_{\text{ideal}}. Note that we do not study how to learn MSS via ERM in this paper, on the other hand, considering that MSS provides a good candidate for robust representations, we want to study if the MSS learned under ptrainp_{\text{train}} contains spurious features with respect to the MSS learned under pidealp_{\text{ideal}}, which are universal robust features against various spurious factors.

We consider the distribution shift in p(X)p(X),333It is often assumed that p(Y|X)p(Y|X) is invariant in supervised learning problems (Arjovsky et al., 2019). also known as covariate shift (David et al., 2010), and we show that the entropy of MSS learned under ptrainp_{\text{train}} conditioned on the robust features is zero in Theorem 1 with proofs in §A.

Theorem 1.

Suppose that there is only covariate shift in ptrainp_{\text{train}}, i.e. x𝒳train s.t. ptrain(x)pideal(x)\exists x\in\mathcal{X}_{train}\text{~{}s.t.~{}}p_{\text{train}}(x)\neq p_{\text{ideal}}(x) but ptrain(Y|X=x)=pideal(Y|X=x),x𝒳trainp_{\text{train}}(Y|X=x)=p_{\text{ideal}}(Y|X=x),\,\,\forall x\in\mathcal{X}_{train}. Let Ttrain(X)T_{\text{train}}(X) be the MSS representation learned under ptrainp_{\text{train}}, then we have:

Htrain(Ttrain(x)|Tideal(x))=0.\small H_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}train}}(T_{train}(x)|T_{ideal}(x))=0.\vspace{-2mm} (2)

Theorem 1 tells us that Ttrain(X)T_{train}(X) is deterministic given Tideal(X)T_{ideal}(X) under ptrain{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}p_{\text{train}}} (shown in blue to distinguish from Eq. 1). However, this does not imply Hideal(Ttrain(X)|Tideal(X))=0H_{ideal}(T_{train}(X)|T_{ideal}(X))=0 under pideal{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}p_{\text{ideal}}}. Thus, we cannot conclude that Ttrain(X)T_{train}(X) contains no spurious features. We further discuss the implications with two cases based on the relationship between the support of input 𝒳train\mathcal{X}_{train} and that of 𝒳ideal\mathcal{X}_{ideal}: (1) 𝒳train=𝒳ideal\mathcal{X}_{train}=\mathcal{X}_{ideal} and (2) 𝒳train𝒳ideal\mathcal{X}_{train}\subset\mathcal{X}_{ideal}. When the input support of ptrainp_{\text{train}} is equal to that of pidealp_{\text{ideal}}, we have the following corollary:

Corollary 1.

Suppose 𝒳train\mathcal{X}_{train} = 𝒳ideal\mathcal{X}_{ideal} in Theorem 1, then Ttrain(X)T_{train}(X) is also the MSS under pidealp_{\text{ideal}}.

Corollary 1 corroborates the findings in Wen et al. (2014) that the (unweighted) solution learned by ERM is also the robust solution when only covariate shift exists and 𝒳train=𝒳ideal\mathcal{X}_{train}=\mathcal{X}_{ideal}. In practice, however, this assumption does not hold (because we only have datasets with limited support) and thus the representation Ttrain(X)T_{train}(X) learned by ERM is not necessarily equivalent to Tideal(X)T_{ideal}(X). By Theorem 1, Ttrain(X)T_{train}(X) is deterministic given Tideal(X)T_{ideal}(X) under ptrainp_{\text{train}}, which implies that the information contained in Ttrain(X)T_{train}(X) is equal to or less than that contained in Tideal(X)T_{ideal}(X). In the former case, Ttrain(X)T_{train}(X) can be equivalent in representation to Tideal(X)T_{ideal}(X) but can also contain spurious features that co-occur with the robust features in the training data. In the latter case, Ttrain(X)T_{train}(X) can miss robust features in Tideal(X)T_{ideal}(X). We demonstrate these two cases with synthetic experiments in Appendix D due to space limit.

Discussion. We have discussed the cases of learning spurious features when the model learns MSS under ptrainp_{\text{train}}. However, we normally adopt maximum likelihood estimation (MLE) as an instantiation of ERM for classification problems. We provide the connection of MLE with learning MSS via the information bottleneck method (Tishby et al., 2000; Shamir et al., 2010) in the Appendix B, where under certain assumptions, we can view MLE as an objective that approximately learns MSS.

4 Does Group DRO Learn Robust Features?

The discussions in §3 suggest that under covariate shift, directly learning from the empirical data distribution can result in learning the spurious correlations satisfied by the majority of the training data. When the spurious factors are known, we can apply group distributionally robust optimization (group DRO), which reweights the losses of different groups associated with spurious factors to alleviate covariate shift and learn robust features that generalize to both minority and majority groups. In this section, we first review group DRO and discuss under which cases it can fail.

4.1 Group Distributionally Robust Optimization

Group DRO is an instance of distributionally robust optimization (Ben-Tal et al., 2013; Duchi et al., 2016) that minimizes the worst expected loss over a set of potential test distributions 𝒬\mathcal{Q} (the uncertainty set):

DRO(θ)=supq𝒬𝔼(x,y)q[(x,y;θ)]\small\mathcal{L}_{\mathrm{DRO}}(\theta)=\sup_{q\in\mathcal{Q}}\mathbb{E}_{(x,y)\sim q}\big{[}\ell(x,y;\theta)\big{]}\vspace{-2mm} (3)

This worst-case objective upper bounds the test risk for all qtest𝒬q_{\mathrm{test}}\in\mathcal{Q}, which is useful for learning under train-test distribution shift. However, its success crucially depends on choosing an adequate uncertainty set that encodes the possible test distributions of interest. Choosing a general family of distribution as the uncertainty set, such as a divergence ball around the training distribution (Ben-Tal et al., 2013; Hu & Hong, 2013; Gao & Kleywegt, 2016), encompasses a wide set of distribution shifts, but can also lead to a conservative objective emphasizing implausible worst-case distributions (Duchi et al., 2019; Oren et al., 2019).

To construct a viable uncertainty set, one can optimize models over all meaningful subpopulations or groups gg depending on the available source information regarding the data, such as domains, demographics, topics, etc. Group DRO (Hu et al., 2018; Oren et al., 2019) leverages such structural information and constructs the uncertainty set as any mixture of these groups. Following Oren et al. (2019), we adopt the conditional value at risk (CVaR) which is a type of distributionally robust risk to achieve low losses on all α\alpha-fraction subpopulations (Rockafellar et al., 2000) of the training distribution (i.e. {p:αp(x)ptrain(x),x}\{p:\alpha p(x)\leq p_{\text{train}}(x),\forall x\}). As we assume that each data point comes from some group p(x,y|g)p(x,y|g) and ptrainp_{\text{train}} is a mixture of mm groups ptrain(g)p_{\text{train}}(g), we can extend the definition of CVaR to groups and construct the uncertainty set 𝒬\mathcal{Q} as all group distributions that are α\alpha-covered by ptrain(g)p_{\text{train}}(g) (or topic CVaR (Oren et al., 2019)):

𝒬={q:q(g)ptrain(g)αg}\small\mathcal{Q}=\left\{q:q(g)\leq\frac{p_{\text{train}}(g)}{\alpha}~{}~{}\forall g\right\}\vspace{-1mm} (4)

This upper bounds the group distribution within the uncertainty set by its corresponding training distribution. The group DRO objective then minimizes the expected loss under the worst-case group distribution:

GDRO=supq𝒬𝔼gq𝔼(x,y)p(x,y|g)[(x,y;θ)]\small\mathcal{L}_{\mathrm{GDRO}}=\sup_{q\in\mathcal{Q}}\mathbb{E}_{g\sim q}\mathbb{E}_{(x,y)\sim p(x,y|g)}\left[\ell(x,y;\theta)\right]\vspace{-1mm} (5)

Intuitively, this objective encourages uniform losses across different groups, which allows us to learn a model that is robust to group shifts. We adopt the efficient online greedy algorithm developed in Oren et al. (2019) to update the model parameters θ\theta and the worst-case distribution qq in an interleaved manner. The greedy algorithm roughly amounts to upweighting the sample losses by 1α\frac{1}{\alpha} which belong to the α\alpha-fraction of groups that have the worst losses. We present the detailed algorithm in Appendix C.

4.2 Group DRO Can Fail with Imperfect Partitions

As discussed earlier, we aim to learn a model that is robust to spurious factors. For example, in toxicity detection, a robust model should perform equally well on data from different demographic groups. Group DRO mitigates covariate shift by minimizing the worst-case loss under the uncertainty set 𝒬\mathcal{Q}, consisting of mixtures of sub-group distributions. Intuitively, given that optimizing pidealp_{\text{ideal}} allows for learning of robust, non-spurious features, defining a 𝒬\mathcal{Q} that covers pidealp_{\text{ideal}} is highly advantageous from a learning perspective.

If we know all the spurious attributes of the training data 𝒜\mathcal{A}, we can adopt the setup in Sagawa et al. (2020a) that divides the data into |𝒜|×|𝒴||\mathcal{A}|\times|\mathcal{Y}| groups, where each example belongs to one of the groups g=(a,y)g=(a,y). We define such grouping strategy as “clean partitions” in which each group is uniquely associated with one value of (a,y)(a,y).444Our discussions also apply to multiple spurious attributes for which the clean partition corresponds to |𝒴|×i|𝒜i||\mathcal{Y}|\times\prod_{i}|\mathcal{A}_{i}| groups. If 𝒜\mathcal{A} contains all the spurious factors of interest, it can be seen that there exists some mixture of groups g=1mq(g)ptrain(g)\sum_{g=1}^{m}q(g)p_{\text{train}}(\cdot\mid g) that can recover pidealp_{\text{ideal}}, where qΔmq\in\Delta_{m} and Δm\Delta_{m} is the (m1)(m-1)-dimensional probability simplex. Thus, pidealp_{\text{ideal}} is contained in 𝒬\mathcal{Q}. Such clean partitions provide a plausible environment for group DRO to learn well in the presence of covariate shift that causes spurious correlations in the training data.

G1G_{1} G2G_{2}
S=0S=0 S=1S=1 S=0S=0 S=1S=1
P(Y=0|S)P(Y=0|S) 0.5 0 1 0.5
P(Y=1|S)P(Y=1|S) 0.5 1 0 0.5
Table 1: An example of imperfect partition.

In contrast, we define “imperfect partitions” where each group contains samples from multiple values of (a,y)(a,y) such that there does not exist a qΔmq\in\Delta_{m} that recovers pidealp_{\text{ideal}}, in other words, 𝒬\mathcal{Q} does not include pidealp_{\text{ideal}}. In this case, group DRO can not eliminate covariate shift effectively.

To illustrate, consider a binary random variable S{0,1}S\in\{0,1\} following a uniform distribution, and the target label Y{0,1}Y\in\{0,1\} also follows a uniform distribution and is independent of SS. Due to covariate shift, there are spurious correlations between S=0S=0, Y=0Y=0 and between S=1S=1, Y=1Y=1 in the training data. We partition the training data into two groups with an equal number of samples and the conditional distribution of P(Y|S)P(Y|S) is shown in Tab. 1. To prevent the model from learning the spurious correlations between S=1S=1 and Y=1Y=1, one can upweight losses of its “negative” samples for which the spurious correlation does not hold, i.e. samples of (S=1,Y=0)(S=1,Y=0) in G2G_{2}; however, group DRO upweights the group as a whole, which inevitably also upweights the (S=0,Y=0)(S=0,Y=0) and causes the model to latch on the spurious attribute S=0S=0 to predict Y=0Y=0. Therefore, there does not exist a mixture distribution of these two groups, under which SYS\perp Y (pidealp_{\text{ideal}}). Such underlying conflicts prevent the group DRO from formulating a worst-case distribution that can eliminate covariate shift, resulting in a passive reliance on certain spurious correlations.

Imperfect partitions of training data are common in practice, as it can be expensive or infeasible to acquire the labels of spurious attributes for each training instance. For example, we may only have rough partitions based on the data sources or the outputs from (unsupervised) clustering algorithms. Our analysis shows that under these practical settings, the group DRO algorithm can not effectively alleviate covariate shift due to the rigid treatment of group losses.

Input: α\alpha; β\beta; mm: #groups; nin_{i}: #samples of group ii
Initialize historical average group losses L^(0)\hat{L}^{(0)}, historical estimate of group probabilities p^tr(0)\hat{p}^{tr(0)}, historical average instance losses L^g(0)\hat{L}_{g}^{(0)} and q(0)(x,y|g)=𝟏Tq^{(0)}(x,y|g)=\bm{1}^{T} for g{1,,m}g\in\{1,\cdots,m\}
for t=1,,Tt=1,\cdots,T do
       Sample a mini-batch (𝐱,𝐲,𝐠)({\mathbf{x}},{\mathbf{y}},{\mathbf{g}}) from PtrainP_{train}
       Perform online greedy updates for q(t)(Alg.2)q^{(t)}(Alg.\ref{alg:gdro})
       \triangleright Update model parameters θ\theta
       di=niq(t)(𝐠i)q(t)(x,y|𝐠i)p^train(t)(𝐠i)(𝐱i,𝐲i;θ(t1))d_{i}=\frac{n_{i}q^{(t)}({\mathbf{g}}_{i})q^{(t)}(x,y|{\mathbf{g}}_{i})}{\hat{p}^{train(t)}({\mathbf{g}}_{i})}\nabla\ell({\mathbf{x}}_{i},{\mathbf{y}}_{i};\theta^{(t-1)})
       θ(t)=θ(t1)η|B|i=1|B|di\theta^{(t)}=\theta^{(t-1)}-\frac{\eta}{|B|}\sum_{i=1}^{|B|}d_{i}
       if reached inner update criterion then
             \triangleright Update q(t)(x,y|g)q^{(t)}(x,y|g)
             for g=1,,mg=1,\cdots,m do
                   Sort instances in group gg in the decreasing order of (x,y;θt)\ell(x,y;\theta^{t}); denote the sorted index 𝝅g\bm{\pi}^{g}
                   cutoff=(Nni)niβNni\mathrm{cutoff}=\left\lceil\frac{(N-n_{i})n_{i}\beta}{N-n_{i}}\right\rceil
                   q(t)((x,y)𝝅g(j)|g)=1β,1jcutoffq^{(t)}((x,y)_{\bm{\pi}^{g}(j)}|g)=\frac{1}{\beta},\forall 1\leq j\leq\mathrm{cutoff}
                   q(t)((x,y)𝝅g(j)|g)=niN,j>cutoffq^{(t)}((x,y)_{\bm{\pi}^{g}(j)}|g)=\frac{n_{i}}{N},\forall j>\mathrm{cutoff}
             end for
            
       end if
      
end for
Algorithm 1 Online greedy algorithm for GC-DRO.

5 Proposed Method: Group-conditional DRO

Since group DRO can be problematic with imperfect partitions, we propose a more flexible uncertainty set over the joint distribution of (x,y,g)(x,y,g), i.e. q(g)q(x,y|g)q(g)q(x,y|g), using fine-grained weights over instances within each group instead of treating the entire group as a whole. We extend the α\alpha-covered distribution to both the group-level (q(g)q(g)) and conditional instance-level (q(x,y|g)q(x,y|g)) distributions to define the uncertainty set 𝒬\mathcal{Q}. At training time, a sample is weighted by both its group weight induced from q(g)q(g) as well as the instance-level weight induced from q(x,y|g)q(x,y|g). Specifically, the new uncertainty set is

𝒬α,β={q(g)q(x,y|g):q(g)ptrain(g)α,1Nq(x,y|g)ptrain(x,y|g)β,x,y,g},\small\begin{split}\mathcal{Q}^{\alpha,\beta}&=\left\{q(g)q(x,y|g):q(g)\leq\frac{p_{\text{train}}(g)}{\alpha},\right.\\ &\left.\frac{1}{N}\leq q(x,y|g)\leq\frac{p_{\text{train}}(x,y|g)}{\beta},\forall x,y,g\right\},\vspace{-2mm}\end{split}\vspace{-5mm} (6)

where NN is the number of training examples and α,β(0,1]\alpha,\beta\in(0,1]. Denote nin_{i} the number of samples in group ii, then ptrain(x,y|g=i)=1nip^{train}(x,y|g=i)=\frac{1}{n_{i}}. The second constraint of Eq. 6 can be rewritten as 1Nq(x,y|g)1βni\frac{1}{N}\leq q(x,y|g)\leq\frac{1}{\beta n_{i}}. Compared with the β\beta-covered distribution, we add a lower bound q(x,y|g)1Nq(x,y|g)\geq\frac{1}{N} to compensate for imbalanced group sizes. With a plain β\beta-covered distribution for q(x,y|g)q(x,y|g), the DRO objective roughly upweights a β\beta-fraction of instance losses of each group. However, we only want to emphasize a small subset of examples that perform badly in the majority groups. Thus, we add this lower bound to q(x,y|g)q(x,y|g) in Eq. 6 to directly “punish” larger groups. To see this, the percentage of examples that are upweighted by 1β\frac{1}{\beta} in group ii is roughly NniNniββ\frac{N-n_{i}}{N-n_{i}\beta}\beta, which is monotonically decreasing function w.r.t. nin_{i}. Therefore, the larger the group size nin_{i} is, the smaller fraction of instances in group ii are upweighted.

Online Optimization Algorithm. Similarly to the online greedy algorithm for group DRO (Oren et al., 2019) (details in Appendix C), we interleave the updates between model parameters θ\theta and the worst-case distribution q(g)q(x,y|g)q(g)q(x,y|g). The greedy algorithm involves sorting losses of all the variables when updating the worst-case distribution defined by the α\alpha-covered distribution. However, frequently updating q(x,y|g)q(x,y|g) over large-scale training data (e.g millions of samples) can be costly and unstable. Therefore, we only update q(g)q(g) at every iteration, while performing updates on q(x,y|g)q(x,y|g) lazily once every epoch or when the robust accuracy on the validation set drops (inner update criterion). We present the pseudo code for the training process in Alg. 1.

Discussions. Another potential approach to circumventing the purely group-level loss is constructing an instance-level uncertainty set (Ben-Tal et al., 2013; Husain, 2020; Michel et al., 2021), however, the resulting 𝒬\mathcal{Q} can be too pessimistic (Hu et al., 2018; Duchi et al., 2019) or difficult to optimize (Michel et al., 2021). Instead, we leverage the structural information of data partitions and expand the flexibility of uncertainty set by incorporating the conditional probabilities of instances. Furthermore, this allows us to execute the min-max optimization in an efficient manner.

6 Experiments

In this section, we evaluate the proposed group-conditional DRO on one image classification task and two language tasks — natural language inference and toxicity detection. To demonstrate the effectiveness of our method under various partitions of data, we first introduce the clean (group number m=|𝒜|×|𝒴|m=|\mathcal{A}|\times|\mathcal{Y}|) and imperfect data partitions of each task. As we discussed at the end of §4.2, there are various cases where the partitions of training data are imperfect such that each group is not purely associated with examples from one pair of (a,y)(a,y). In this section, we inspect several cases reflecting diverse properties of partitions to evaluate our method. First, on the image and NLI tasks, we manually design adversarial partitions of data such that there are explicit conflicts between groups and purely reweighing over groups cannot eliminate covariate shift (§6.1). Second, we use the attributes provided by a supervised classifier to create the imperfect partitions of the toxicity data set (§6.1). Third, we also perform unsupervised clustering on the toxicity data set to obtain imperfect partitions in §6.4.

6.1 Data and Tasks

male female
dark 65,487 / 1,387 22,880 / 48,749
blonde 0 / 1,387 22,880 / 0
(a) The imperfect partitions for the CelebA dataset (G1/G2G_{1}/G_{2}).
no neg neg 1 neg 2
contradiction 57,605 / 0 / 0 0 / 1,406 / 0 0 / 0 / 9,897
entailment 67,335 / 0 / 0 0 / 0 / 215 0 / 1,318 / 0
neutral 66,401 / 0 / 0 0 / 0 / 251 0 / 1,747 / 0
(b) The imperfect partitions for the MNLI dataset (G1/G2/G3G_{1}/G_{2}/G_{3}).
White-aligned AAE Hispanic Others
abusive 11,281 7,392 6,707 1,770
spam 8,147 1,041 541 4,301
normal 41,756 2,562 2,638 6,895
hateful 2,696 1,420 509 340
(c) Statistics of each group in the clean partition of the hate speech dataset. Data of each dialect attribute (column) corresponds one group in the imperfect partition.
Table 2: Statistics of data in different groups partitioned by attributes (row) and labels (column).

Object Recognition. We use the CelebA dataset (Liu et al., 2015) which has 162,770 training examples of celebrity faces. We classify the hair color from 𝒴={\mathcal{Y}=\{blond, dark}\} following the set up in Sagawa et al. (2020a). In this task, labels are spuriously correlated with the demographic information — gender of the input 𝒜={\mathcal{A}=\{female, male}\}, which together with 𝒴\mathcal{Y} results in 4 clean groups. The statistics of groups in the imperfect partition are presented in Tab. 2(a) (separated by “/”), each of which consists of data from multiple values of (a,y)(a,y). Concretely, we create an imperfect partition of 2 groups with two explicit spurious correlations: i) in group G1G_{1} (dark, male) are spuriously correlated since we put all their counterparts (blonde, male) in group G2G_{2}; ii) similarly, (dark, female) in G2G_{2} are spuriously correlated.

Datasets Methods Clean Partition Imperfect Partition
Robust Acc Average Acc Robust Acc Average Acc
Celeb-A ERM 40.14 ±\pm 0.99 95.92 ±\pm 0.05 40.14 ±\pm 0.99 95.92 ±\pm 0.05
resampling 86.81 ±\pm 1.26 92.72 ±\pm 0.28 44.17 ±\pm 1.15 95.58 ±\pm 0.03
group DRO (EG) 86.11 ±\pm 1.96 92.33 ±\pm 0.65 42.92 ±\pm 0.91 95.82 ±\pm 0.07
group DRO (greedy) 88.19 ±\pm 2.31 92.65 ±\pm 0.20 45.97 ±\pm 1.73 95.81 ±\pm 0.09
GC-DRO 88.75 ±\pm 0.82 92.92 ±\pm 0.16 82.85 ±\pm 1.54 89.32 ±\pm 2.21
MNLI ERM 70.84 ±\pm 2.47 86.18 ±\pm 0.18 70.84 ±\pm 2.47 86.18 ±\pm 0.18
resampling 67.02 ±\pm 2.43 85.72 ±\pm 0.37 67.26 ±\pm 1.63 85.22 ±\pm 0.58
group DRO (EG) 77.88 ±\pm 1.36 85.16 ±\pm 0.44 69.66 ±\pm 1.98 84.96 ±\pm 0.56
group DRO (greedy) 75.14 ±\pm 3.96 85.82 ±\pm 0.24 70.34 ±\pm 2.19 86.02 ±\pm 0.25
GC-DRO 77.82 ±\pm 1.45 85.04 ±\pm 0.67 75.32 ±\pm 0.93 84.82 ±\pm 0.74
FDCL18 ERM 34.30 ±\pm 1.83 79.70 ±\pm 1.05 34.30 ±\pm 1.83 79.70 ±\pm 1.05
resampling 55.44 ±\pm 4.69 72.04 ±\pm 1.99 26.10 ±\pm 4.11 80.66 ±\pm 0.52
group DRO (EG) 55.98 ±\pm 1.67 70.06 ±\pm 3.06 35.20 ±\pm 2.24 79.58 ±\pm 0.95
group DRO (greedy) 56.83 ±\pm 2.94 70.52 ±\pm 1.99 36.24 ±\pm 3.80 79.40 ±\pm 1.12
GC-DRO 57.28 ±\pm 2.71 70.26 ±\pm 0.94 48.42 ±\pm 6.72 72.02 ±\pm 2.96
Table 3: Robust and average test accuracy and standard deviation on the three tasks.
Refer to caption
Figure 2: Under the imperfect partition of the MNLI dataset, the aggregated average training weights of instance losses in each group divided by attributes and labels (top: group DRO; bottom: GC-DRO).

Natural Language Inference (NLI). NLI is the task of determining whether a hypothesis is true (entailment), false (contradiction) or undetermined (neutral) given a premise. We use the MultiNLI dataset (Williams et al., 2018) and follow the train/dev/test split in Sagawa et al. (2020a), which results in 206,175 training examples. Gururangan et al. (2018) have shown that there is spurious correlation between the label of contradiction and the presence of negation words (nobody, nothing, no, never) due to annotation artifacts. We further split the negation words into two groups: set 1 (nobody, nothing) and set 2 (no, never) to have more variety in the attributes, i.e. 𝒜={\mathcal{A}=\{no negation, negation 1, negation 2}\}, which together with labels forms 9 groups in the clean partition. We create 3 groups in the imperfect partition as shown in Tab. 2(b), where G1G_{1} only contains examples from a=a=no negation, while G2G_{2} and G3G_{3} contain data from both a=a=negation 1 and a=a=negation 2. This causes a dilemma when upweighting either of the groups.

Toxicity Detection. This task aims to identity various forms of toxic languages (e.g. abusive speech, hate speech), an application with practical and important real-world consequences. Sap et al. (2019) have shown that there is a strong correlation between certain surface markers of English spoken by minority groups and the labels of toxicity. And such biases can be acquired and propagated by models trained on these corpora. We perform experiments on the FDCL18 (Fortuna & Nunes, 2018) dataset, a corpus of 100k tweets annotated with four labels: 𝒴={\mathcal{Y}=\{hateful, spam, abusive and normal}\}. Since the dataset does not contain the dialect information, we follow Sap et al. (2019) and use annotations predicted by the dialect classifier in Blodgett et al. (2016) to label each example as one of four English varieties: 𝒜={\mathcal{A}=\{White-aligned, African American (AAE), Hispanic, and Other}\}. As noted in Sap et al. (2019), these automatically obtained labels correlate highly with self-reported race and provide an accurate proxy for the dialect labels. These dialect attributes and toxicity labels together divide the dataset into 16 groups in the clean partition. To construct the imperfect partitions, we investigate a natural setting where data is divided by the dialect attributes, therefore we have 4 groups in the imperfect partition. The test set of FDCL18 contains some groups that are severely under-represented. In order to make the robust accuracy reliable yet still representative of the under-represented groups, we follow Michel et al. (2021) and combine groups that contain less than 100 samples into a single group to compute robust test accuracies.

6.2 Experimental Setup

We fine-tune pretrained models for object recognition and NLP tasks that achieve high average test accuracies, specifically ResNet18 (He et al., 2016) on CelebA and RoBERTa (Liu et al., 2019) on the MultiNLI and FDCL18 datasets. We select hyperparameters by the robust validation accuracy. For the clean partitions, we set α=0.2,β=0.5\alpha=0.2,\beta=0.5 for all the three tasks. For the imperfect partitions, we set a relatively lower value of β\beta to highlight badly performed instances within groups. Specifically, for NLP tasks we set α=0.5\alpha=0.5, β=0.2\beta=0.2 and 0.250.25 for NLI and toxicity detection respectively, and for the image task, we set α=0.2,β=0.1\alpha=0.2,\beta=0.1. For more training details, see Appendix E. We measure both the average accuracy over all the test data as well as the robust accuracy (worst accuracy across all groups). Even though different partitions (clean/imperfect) are used at training time, we always evaluate the model’s robust accuracy across groups of the clean partitions of the test data.

We compare with ERM, which minimizes the average training loss on the empirical training distribution, formally

ERM(θ)=𝔼(x,y)Ptrain[(x,y;θ)]\small\mathcal{L}_{\mathrm{ERM}}(\theta)=\mathbb{E}_{(x,y)\sim P_{train}}[\ell(x,y;\theta)]\vspace{-1mm} (7)

We also compare with two variants of group DRO with different objective and optimization procedures: a greedy (Oren et al., 2019) algorithm for CVaR-group DRO and a exponentiated-gradient (EG) (Sagawa et al., 2020a) procedure with full simplex . Note that while previous work (Sagawa et al., 2020a) has found the greedy algorithm is unstable and underperforms EG, we did not observe this issue with our implementation where we took a slight different approach to compute the worst expected loss and we detail this difference in Appendix E.2. In addition, we compare with the resampling method, which optimizes on minibatches sampled from uniform group frequencies, which is often used for imbalanced datasets.

6.3 Main Results

We present the robust and average test accuracies of all three tasks under different partitions in Tab. 3. Models are selected based on the worst-performing accuracy of group (of the clean partition) in the validation set. All the results are averaged over 5 runs with different random seeds. Except for ERM, all the models leverage the group information at training time. First, as expected, ERM models attain high average test accuracies across all the datasets but perform poorly on the worst-case group. Second, we observe that under the clean partition, group DRO models always significantly outperform ERM on the worst-group test accuracy with modest drop in the average test accuracy. And we also note that group DRO optimized with the greedy algorithm performs on par with that optimized by the EG based algorithm. By contrast, the resampling method can not consistently perform well on the worst test groups on all datasets. Furthermore, our method performs similarly to or slightly better than group DRO on all three datasets under the clean partition. Third, under the imperfect partition, neither group DRO nor resampling can perform well in the worst test groups and achieves similar performance to that of ERM models. On the other hand, our method performs significantly better in terms of the robust accuracy on all three datasets, with 5-37 points in improvement over group DRO models. Although the results are worse compared to those under the clean partition, we demonstrate that our method is much more agnostic to the underlying data partitions.

6.4 Analysis

Why does GC-DRO perform well on robust accuracy? In this section, we investigate why group-conditional DRO works well under imperfect partitions. To do this, we first compute the actual weight (niq(t)(gi)q(t)(x,y|gi)p^train(t)(gi)\frac{n_{i}q^{(t)}(g_{i})q^{(t)}(x,y|g_{i})}{\hat{p}^{train(t)}(g_{i})} in Alg. 1) applied to each instance (xi,yi)(x_{i},y_{i}) at every step tt for group DRO and our method respectively. The groups in imperfect partitions contain instances from different values of (a,y)(a,y) and to study the effects of learning weights on the test groups, we aggregate the weights of instances on each group (a,y)(a,y) (i.e. the clean partition) by taking average over all steps in each epoch. In Fig. 2, we plot the dynamic aggregated weights over the training course learned with the imperfect partition (3 groups, see Tab. 2(b)) of the MNLI dataset. We observe that GC-DRO can assign higher weights to instances that belong to groups of (a=(a=negation, yy\neq contradiction)), which helps prevent the model from learning the spurious correlations between a=a= negation and y=y=contradiction. On the contrary, group DRO can not accomplish this goal because it can not handle these subgroups inside groups in a fine-grained way.

Refer to caption
Figure 3: The heatmap of summarized learning weights for different groups from group DRO and our method.

To make this trend more clear, we summarize the weights across all the epochs for each group of (a,y)(a,y) and present the heat map in Fig. 3. We can see that group DRO focuses on learning from the large group that does not contain negation words but pays less attention to those minority groups. On the contrary, our method encourages the model to learn from minority groups that can help combat spurious features.

Robust Acc Average Acc
ERM 34.30 ±\pm 1.83 79.7 ±\pm 1.05
resampling 34.20 ±\pm 2.36 79.4 ±\pm 1.24
group DRO (EG) 32.84 ±\pm 2.72 80.5 ±\pm 0.59
group DRO (greedy) 34.48 ±\pm 4.69 79.62 ±\pm 0.59
GC-DRO (ours) 45.06 ±\pm 6.77 70.7 ±\pm 4.81
Table 4: Average and robust test accuracies of FDCL18 under the partitions via unsupervised clustering.

On groups produced by unsupervised clustering. We study a more realistic setting where no group information is available and we use an unsupervised clustering algorithm to produce the partitions. Specifically, we first embed the training sentences of the FDCL18 dataset with Sentence-BERT (Reimers et al., 2019), a well-performing semantic sentence embedder, then we use K-means to obtain 8 clusters. In Tab. 4, we show the robust and average accuracy on the test set of the toxicity detection task. Our method once again significantly outperforms other baseline methods on the robust test accuracy, which demonstrates the robustness of GC-DRO under different partitions.

Refer to caption
Figure 4: Ablation studies on α\alpha and β\beta on the MNLI datasets.

Ablation studies on α\alpha and β\beta. We perform ablation studies on the two important hyperparameters α\alpha and β\beta used in our method. In Fig 4, we fix one value and vary the other and plot the robust test accuracies over 5 random runs (the variance of average test accuracies is very small) on the NLI task. We observe that GC-DRO is less sensitive to different combinations of α\alpha and β\beta under the imperfect partitions. However, for the clean partitions, a larger β\beta and a smaller α\alpha tends to yield better performance, as GC-DRO behaves more close to the plain group DRO.

7 Conclusion

Through a mathematical characterization of features used in prediction, we have demonstrated that under covariate shift ERM models can pick up spurious features or miss robust features. The GC-DRO algorithm resulting from this analysis allows for a more flexible uncertainty set that performs consistently well in the worst test group under different partitions. This new understanding of features opens up new avenues in both redesigning our distributionally robust algorithms, and further characterizing possible spurious factors that may influence model robustness, for example through unsupervised learning.

Acknowledgements

This work in this paper was supported in part by a Facebook SRA Award and the NSF/Amazon Fairness in AI program under grant number 2040926. We thank the anonymous reviewers for useful suggestions.

References

  • Achille & Soatto (2018) Achille, A. and Soatto, S. Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research, 19(1):1947–1980, 2018.
  • Adragni & Cook (2009) Adragni, K. P. and Cook, R. D. Sufficient dimension reduction and prediction in regression. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 367(1906):4385–4405, 2009.
  • Arjovsky et al. (2019) Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
  • Bahng et al. (2020) Bahng, H., Chun, S., Yun, S., Choo, J., and Oh, S. J. Learning de-biased representations with biased representations. In International Conference on Machine Learning, pp. 528–539. PMLR, 2020.
  • Beery et al. (2018) Beery, S., Van Horn, G., and Perona, P. Recognition in terra incognita. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  456–473, 2018.
  • Bellot & van der Schaar (2020) Bellot, A. and van der Schaar, M. Generalization and invariances in the presence of unobserved confounding. arXiv preprint arXiv:2007.10653, 2020.
  • Ben-Tal et al. (2013) Ben-Tal, A., Den Hertog, D., De Waegenaere, A., Melenberg, B., and Rennen, G. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357, 2013.
  • Blodgett et al. (2016) Blodgett, S. L., Green, L., and O’Connor, B. Demographic dialectal variation in social media: A case study of african-american english. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.  1119–1130, 2016.
  • Buolamwini & Gebru (2018) Buolamwini, J. and Gebru, T. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pp.  77–91, 2018.
  • Cover (1999) Cover, T. M. Elements of information theory. John Wiley & Sons, 1999.
  • Cvitkovic & Koliander (2019) Cvitkovic, M. and Koliander, G. Minimal achievable sufficient statistic learning. volume 97, pp.  1465–1474. PMLR, 2019.
  • David et al. (2010) David, S. B., Lu, T., Luu, T., and Pál, D. Impossibility theorems for domain adaptation. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp.  129–136. JMLR Workshop and Conference Proceedings, 2010.
  • Duchi et al. (2016) Duchi, J., Glynn, P., and Namkoong, H. Statistics of robust optimization: A generalized empirical likelihood approach. arXiv preprint arXiv:1610.03425, 2016.
  • Duchi et al. (2019) Duchi, J. C., Hashimoto, T., and Namkoong, H. Distributionally robust losses against mixture covariate shifts. 2019.
  • Dynkin (2000) Dynkin, E. B. Necessary and sufficient statistics for a family of probability distributions. Selected Papers of EB Dynkin with Commentary, 14:393, 2000.
  • Fortuna & Nunes (2018) Fortuna, P. and Nunes, S. A survey on automatic detection of hate speech in text. ACM Computing Surveys (CSUR), 51(4):1–30, 2018.
  • Gao & Kleywegt (2016) Gao, R. and Kleywegt, A. J. Distributionally robust stochastic optimization with wasserstein distance. arXiv preprint arXiv:1604.02199, 2016.
  • Geiger (2020) Geiger, B. C. On information plane analyses of neural network classifiers–a review. arXiv preprint arXiv:2003.09671, 2020.
  • Goyal et al. (2017) Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  6904–6913, 2017.
  • Gururangan et al. (2018) Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S., and Smith, N. A. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp.  107–112, 2018.
  • Hashimoto et al. (2018) Hashimoto, T., Srivastava, M., Namkoong, H., and Liang, P. Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning, pp. 1929–1938, 2018.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  • Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Hu et al. (2018) Hu, W., Niu, G., Sato, I., and Sugiyama, M. Does distributionally robust supervised learning give robust classifiers? In International Conference on Machine Learning, pp. 2029–2037. PMLR, 2018.
  • Hu & Hong (2013) Hu, Z. and Hong, L. J. Kullback-leibler divergence constrained distributionally robust optimization. Available at Optimization Online, 2013.
  • Husain (2020) Husain, H. Distributional robustness with ipms and links to regularization and gans. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems (NIPS), 2020.
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Koenecke et al. (2020) Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., Toups, C., Rickford, J. R., Jurafsky, D., and Goel, S. Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences, 117(14):7684–7689, 2020.
  • Koh et al. (2020) Koh, P. W., Sagawa, S., Marklund, H., Xie, S. M., Zhang, M., Balsubramani, A., Hu, W., Yasunaga, M., Phillips, R. L., Beery, S., et al. Wilds: A benchmark of in-the-wild distribution shifts. arXiv preprint arXiv:2012.07421, 2020.
  • Kolchinsky et al. (2019) Kolchinsky, A., Tracey, B. D., and Van Kuyk, S. Caveats for information bottleneck in deterministic scenarios. In International Conference on Learning Representations, 2019.
  • Kullback & Leibler (1951) Kullback, S. and Leibler, R. A. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
  • Lewis (2013) Lewis, D. Counterfactuals. John Wiley & Sons, 2013.
  • Liu et al. (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  • Liu et al. (2015) Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp.  3730–3738, 2015.
  • Lopez-Paz (2016) Lopez-Paz, D. From dependence to causation. PhD thesis, University of Cambridge, 2016.
  • Lovering et al. (2021) Lovering, C., Jha, R., Linzen, T., and Pavlick, E. Predicting inductive biases of pre-trained models. In International Conference on Learning Representations (ICLR), 2021.
  • McCoy et al. (2019) McCoy, T., Pavlick, E., and Linzen, T. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  3428–3448, 2019.
  • Michel et al. (2021) Michel, P., Hashimoto, T., and Neubig, G. Modeling the second player in distributionally robust optimization. In International Conference on Learning Representations (ICLR), 2021.
  • Oren et al. (2019) Oren, Y., Sagawa, S., Hashimoto, T. B., and Liang, P. Distributionally robust language modeling. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Hong Kong, November 2019.
  • Ott et al. (2019) Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., and Auli, M. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
  • Reimers et al. (2019) Reimers, N., Gurevych, I., Reimers, N., Gurevych, I., Thakur, N., Reimers, N., Daxenberger, J., and Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2019.
  • Rockafellar et al. (2000) Rockafellar, R. T., Uryasev, S., et al. Optimization of conditional value-at-risk. Journal of risk, 2:21–42, 2000.
  • Sagawa et al. (2020a) Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, April 2020a.
  • Sagawa et al. (2020b) Sagawa, S., Raghunathan, A., Koh, P. W., and Liang, P. An investigation of why overparameterization exacerbates spurious correlations. In International Conference on Machine Learning (ICML), July 2020b.
  • Sap et al. (2019) Sap, M., Card, D., Gabriel, S., Choi, Y., and Smith, N. A. The risk of racial bias in hate speech detection. In Proceedings of the 57th annual meeting of the association for computational linguistics, pp.  1668–1678, 2019.
  • Schwartz-Ziv & Tishby (2017) Schwartz-Ziv, R. and Tishby, N. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.
  • Shamir et al. (2010) Shamir, O., Sabato, S., and Tishby, N. Learning and generalization with the information bottleneck. volume 411, pp.  2696–2711. Elsevier, 2010.
  • Sohoni et al. (2020) Sohoni, N., Dunnmon, J., Angus, G., Gu, A., and Ré, C. No subclass left behind: Fine-grained robustness in coarse-grained classification problems. 34th Conference on Neural Information Processing Systems (NeurIPS), 33, 2020.
  • Tenenbaum (2018) Tenenbaum, J. Building machines that learn and think like people. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp.  5–5, 2018.
  • Tishby et al. (2000) Tishby, N., Pereira, F. C., and Bialek, W. The information bottleneck method. arXiv preprint physics/0004057, 2000.
  • Torralba & Efros (2011) Torralba, A. and Efros, A. A. Unbiased look at dataset bias. In CVPR 2011, pp.  1521–1528. IEEE, 2011.
  • Wang et al. (2020) Wang, S., Guo, W., Narasimhan, H., Cotter, A., Gupta, M. R., and Jordan, M. I. Robust optimization for fairness with noisy protected groups. 34th Conference on Neural Information Processing Systems (NeurIPS)., 2020.
  • Wen et al. (2014) Wen, J., Yu, C.-N., and Greiner, R. Robust learning under uncertain test distributions: Relating covariate shift to model misspecification. In ICML, pp.  631–639, 2014.
  • Williams et al. (2018) Williams, A., Nangia, N., and Bowman, S. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  1112–1122, 2018.

Appendix A Proofs of Theorem 1

Lemma 1.

If TT is sufficient statistics, we have p(Y,X|T)=p(Y|T)p(X|T)p(Y,X|T)=p(Y|T)\cdot p(X|T).

Lemma 2.

If TT is sufficient statistics, we have p(Y|T(X))=p(Y|X)p(Y|T(X))=p(Y|X).

Proof.

Find T(X)T^{\prime}(X), s.t. S(x)=T(X),T(X)S(x)=\langle T(X),T^{\prime}(X)\rangle is an invertible mapping of XX, thus p(Y|X)=p(Y|S(X))=p(Y|T(X),T(X))p(Y|X)=p(Y|S(X))=p(Y|T(X),T^{\prime}(X)). We have,

p(Y,T(X),T(X)|T(X))=p(Y|T(X),T(X))p(T(X)|T(X))p(Y,T(X),T^{\prime}(X)|T(X))=p(Y|T^{\prime}(X),T(X))p(T^{\prime}(X)|T(X)) (8)

From Lemma 1, we have

p(Y,T(X),T(X)|T(X))=p(Y|T(X))p(T(X)|T(X))p(Y,T(X),T^{\prime}(X)|T(X))=p(Y|T(X))p(T^{\prime}(X)|T(X)) (9)

By (8) and (9), we obtain p(Y|T(X),T(X))=p(Y|T(X))=p(Y|X)p(Y|T^{\prime}(X),T(X))=p(Y|T(X))=p(Y|X). ∎

See 1

Proof.

Since there is covariate shift between pidealp_{\text{ideal}} and ptrainp_{\text{train}}, we have ptrain(Y|X)=pideal(Y|X),x𝒳trainp_{\text{train}}(Y|X)=p_{\text{ideal}}(Y|X),\forall x\in\mathcal{X}_{train}. Since Ttrain(X)T_{train}(X) is MSS of ptrainp_{\text{train}} and by Lemma 2, we have ptrain(Y|Ttrain(X))=ptrain(Y|X)=pideal(Y|X)=pideal(Y|Tideal(X)),x𝒳trainp_{\text{train}}(Y|T_{train}(X))=p_{\text{train}}(Y|X)=p_{\text{ideal}}(Y|X)=p_{\text{ideal}}(Y|T_{ideal}(X)),\forall x\in\mathcal{X}_{train}. Then x𝒳train,y𝒴\forall x\in\mathcal{X}_{train},y\in\mathcal{Y},

ptrain(y|Tideal(x))\displaystyle p_{\text{train}}(y|T_{ideal}(x)) =x:Tideal(x)=Tideal(x)ptrain(y|x)ptrain(x|T(x))\displaystyle=\sum\limits_{x^{\prime}:T_{ideal}(x^{\prime})=T_{ideal}(x)}p_{\text{train}}(y|x^{\prime})p_{\text{train}}(x^{\prime}|T(x))
=x:Tideal(x)=Tideal(x)pideal(y|x)ptrain(x|T(x))\displaystyle=\sum\limits_{x^{\prime}:T_{ideal}(x^{\prime})=T_{ideal}(x)}p_{\text{ideal}}(y|x^{\prime})p_{\text{train}}(x^{\prime}|T(x))
=x:Tideal(x)=Tideal(x)pideal(y|T(x))ptrain(x|T(x))\displaystyle=\sum\limits_{x^{\prime}:T_{ideal}(x^{\prime})=T_{ideal}(x)}p_{\text{ideal}}(y|T(x))p_{\text{train}}(x^{\prime}|T(x))
=pideal(y|Tideal(x))\displaystyle=p_{\text{ideal}}(y|T_{ideal}(x)) (10)

Then we have

Htrain(Y|Ttrain(X))\displaystyle H_{train}(Y|T_{train}(X)) =x,yptrain(x,y)[logptrain(y|Ttrain(x))]\displaystyle=\sum_{x,y}p_{\text{train}}(x,y)[-\log p_{\text{train}}(y|T_{train}(x))]
=x,yptrain(x,y)[logpideal(y|Tideal(x))]\displaystyle=\sum_{x,y}p_{\text{train}}(x,y)[-\log p_{\text{ideal}}(y|T_{ideal}(x))]
=x,yptrain(x,y)[logptrain(y|Tideal(x))]\displaystyle=\sum_{x,y}p_{\text{train}}(x,y)[-\log p_{\text{train}}(y|T_{ideal}(x))]
=Htrain(Y|Tideal(X))\displaystyle=H_{train}(Y|T_{ideal}(X)) (11)

From equation A and the definition of sufficient statistics, we have

Itrain(Y;Ttrain(X))=Itrain(Y;X)=Itrain(Y;Tideal(X))I_{train}(Y;T_{train}(X))=I_{train}(Y;X)=I_{train}(Y;T_{ideal}(X)) (12)

Thus, Tideal(X)T_{ideal}(X) is the sufficient statistics of XX about YY under ptrainp_{\text{train}}. By definition, we have

Htrain(Ttrain(X)|Tideal(X))=0.H_{train}(T_{train}(X)|T_{ideal}(X))=0. (13)

See 1

Proof.

Since 𝒳train=𝒳ideal=𝒳\mathcal{X}_{train}=\mathcal{X}_{ideal}=\mathcal{X}, with the similar derivation of equation A, we have x𝒳,y𝒴\forall x\in\mathcal{X},y\in\mathcal{Y}

pideal(y|Tideal(x))=pideal(y|Ttrain(x))p_{\text{ideal}}(y|T_{ideal}(x))=p_{\text{ideal}}(y|T_{train}(x)) (14)

Together with Theorem 1, we have Ttrain(x)T_{train}(x) is also the MSS under pidealp_{\text{ideal}}. ∎

Appendix B Connections between MLE and Learning Minimal Sufficient Statistics

B.1 Information Bottleneck (IB) Method

The information bottleneck (IB) method (Tishby et al., 2000) is an information theoretic principle introduced to extract relevant information that an input X𝒳X\in\mathcal{X} contains about an output random variable Y𝒴Y\in\mathcal{Y}. Defined on a joint distribution of XX and YY, IB learns a mapping function T(X)T(X) by optimizing the trade-off between the mutual information I(X;T)I(X;T) and I(Y;T)I(Y;T) such that T(X)T(X) is a compressed representation of XX (quantified by I(X;T)I(X;T)) that is most informative about YY (quantified by I(Y;T)I(Y;T)). Let TT be parameterized by θ\theta, the objective of IB optimizes the trade-off between I(Y;Tθ(X))I(Y;T_{\theta}(X)) and I(X;Tθ(X))I(X;T_{\theta}(X)):

minθI(Y,Tθ(X))+βI(X;Tθ(X))\displaystyle\min_{\theta}-I(Y,T_{\theta}(X))+\beta I(X;T_{\theta}(X)) (15)

where β\beta is a positive Lagrange multiplier.

Schwartz-Ziv & Tishby (2017) casts finding of minimal sufficient statistics (MSS) T(X)T(X) as a constrained optimization problem using data-processing inequality (Cover, 1999):

minT(X)I(T(X);X)\displaystyle\min_{T(X)}~{}~{}I(T(X);X)
s.tI(T(X);Y)=I(X;Y)\displaystyle s.t~{}~{}I(T(X);Y)=I(X;Y) (16)

This corresponds to the IB method (Eq. 15) which extends the notion of relevance between functions of samples and parameters in conventional MSS to any joint distribution of XX and YY. The IB method provides a computational framework for finding approximate MSS in a soft manner by trading off the sufficiency for YY (I(Y; T(X))) and the minimality of the statistic (I(X,T(X))I(X,T(X))) with the Lagrange multiplier β\beta (Schwartz-Ziv & Tishby, 2017; Shamir et al., 2010).

B.2 Connections between MLE and IB

Given that the IB objective is approximately learning MSS in a soft manner, we next build the connections between the popularly adopted maximum likelihood estimation (MLE) in supervised learning and the IB objective. We show that under certain assumptions, MLE is approximating the IB objective defined on the joint distribution of ptrain(X,Y)p_{\text{train}}(X,Y).

To facilitate the discussions, we decompose the model parameters into θ\theta and ϕ\phi that denote the parameters of the feature extractor Tθ(x)T_{\theta}(x) and the classifier respectively. MLE minimizes the expected negative log probability under ptrain(X,Y)p_{\text{train}}(X,Y):

minθ,ϕ𝔼x,yptrain(X,Y)[logpθ,ϕ(x,y)]\displaystyle\min_{\theta,\phi}\mathbb{E}_{x,y\sim p_{\text{train}}(X,Y)}[-\log p_{\theta,\phi}(x,y)] (17)
\displaystyle\iff minθ,ϕ𝔼x,yptrain(X,Y)[logpϕ(y|Tθ(x))logpθ(x)]\displaystyle\min_{\theta,\phi}\mathbb{E}_{x,y\sim p_{\text{train}}(X,Y)}[-\log p_{\phi}(y|T_{\theta}(x))-\log p_{\theta}(x)] (18)

Usually, we only model the conditional distribution pϕ(Y|X)p_{\phi}(Y|X) and assume that pθ(X)=ptrain(X)p_{\theta}(X)=p_{\text{train}}(X) which is independent from θ\theta. With the assumption that pθ(x)pβ(Tθ(x))),β>0p_{\theta}(x)\propto p^{\beta}(T_{\theta}(x))),\beta>0, (18) can be rewritten as:

minθ,ϕ𝔼x,yptrain(X,Y)[logpϕ(y|Tθ(x))]+β𝔼xptrain(X)[logp(Tθ(x))]\displaystyle\min_{\theta,\phi}\mathbb{E}_{x,y\sim p_{\text{train}}(X,Y)}[-\log p_{\phi}(y|T_{\theta}(x))]+\beta\mathbb{E}_{x\sim p_{\text{train}}(X)}[-\log p(T_{\theta}(x))] (19)

Assume that the neural network parameterized by ϕ\phi is a universal function approximator, then we can replace minθ,ϕ\min_{\theta,\phi} with minθ\min_{\theta} and (19) can be written as:

minθH(Y|Tθ(X))+βH(Tθ(X))\displaystyle\min_{\theta}H(Y|T_{\theta}(X))+\beta H(T_{\theta}(X)) (20)
by(1)I(Y;Tθ(X))=H(Y)H(Y|Tθ(X))\displaystyle\mathrm{by\;(1)}\;I(Y;T_{\theta}(X))=H(Y)-H(Y|T_{\theta}(X))
(2)H(Tθ(X))=I(X;Tθ(X))+H(Tθ(X)|X)=I(X;Tθ(X))\displaystyle\mathrm{\;\;\;\;\;(2)}\;H(T_{\theta}(X))=I(X;T_{\theta}(X))+H(T_{\theta}(X)|X)=I(X;T_{\theta}(X))
\displaystyle\iff minθI(Y;Tθ(X))+βI(X;Tθ(X))\displaystyle\min_{\theta}-I(Y;T_{\theta}(X))+\beta I(X;T_{\theta}(X)) (21)

We can see that under the assumption of pθ(x)pβ(Tθ(x)))p_{\theta}(x)\propto p^{\beta}(T_{\theta}(x))), the MLE objective can be converted into the same form as the IB objective. In practice, we usually do not model ptrain(X)p_{\text{train}}(X) and only optimize the first term I(Y;Tθ(X))I(Y;T_{\theta}(X)) in (21). However, previous work (Schwartz-Ziv & Tishby, 2017; Geiger, 2020) has shown that deep neural networks (DNNs) are implicitly minimizing I(X;Tθ(X))I(X;T_{\theta}(X)) with a wide range of activation functions and architectures, which are manifested as a second compression phase during learning with SGD. Thus, we can presumably consider MLE as approximating the IB objective, which is equivalent to learning the MSS on the train distribution ptrain(X,Y)p_{\text{train}}(X,Y).

Appendix C Details of the Online Greedy Algorithm for Group DRO

Input : α\alpha; mm: total number of groups
Initialize historical average group losses L^(0)\hat{L}^{(0)}; historical estimate of group probabilities p^train(0)\hat{p}^{train(0)}; learning rate η\eta
for t=1,,Tt=1,\cdots,T do
       Sample a mini-batch batch B=(𝐱,𝐲,𝐠)B=({\mathbf{x}},{\mathbf{y}},{\mathbf{g}}) uniformly from ptrainp_{\text{train}}
       \triangleright Update the historical vectors of L^(t)\hat{L}^{(t)} and p^train(t)\hat{p}^{train(t)} for each group g{1,,m}g\in\{1,\cdots,m\}
       L^(t)(g)EMA({(𝐱i,𝐲i;θ(t1)):𝐠i=g},L^(t1)(g))\hat{L}^{(t)}(g)\leftarrow\mathrm{EMA}(\{\ell({\mathbf{x}}_{i},{\mathbf{y}}_{i};\theta^{(t-1)}):{\mathbf{g}}_{i}=g\},~{}\hat{L}^{(t-1)}(g))
       p^train(t)EMA(#samplesofeachgroupinB,p^train(t1))\hat{p}^{train(t)}\leftarrow\mathrm{EMA}(\mathrm{\#samples~{}of~{}each~{}group~{}in~{}B},~{}\hat{p}^{train(t-1)})
       \triangleright Update the worst-case distribution q(t)q^{(t)}
       Sort p^train(t)\hat{p}^{train(t)} in the order of decreasing L^(t)\hat{L}^{(t)} and denote the sorted group indexes 𝝅\bm{\pi}
       q(t)(g𝝅i)=min{p^train(t)(g𝝅i)α,1j=1i1p^train(t)(g𝝅j)α}q^{(t)}(g_{\bm{\pi}_{i}})=\min\{\frac{\hat{p}^{train(t)}(g_{\bm{\pi}_{i}})}{\alpha},1-\sum_{j=1}^{i-1}\frac{\hat{p}^{train(t)}(g_{\bm{\pi}_{j}})}{\alpha}\}
       \triangleright Update model parameters θ\theta
       θ(t)=θ(t1)η|B|i=1|B|q(t)(𝐠i)p^train(t)(𝐠i)(𝐱i,𝐲i;θ(t1))\theta^{(t)}=\theta^{(t-1)}-\frac{\eta}{|B|}\sum_{i=1}^{|B|}\frac{q^{(t)}({\mathbf{g}}_{i})}{\hat{p}^{train(t)}({\mathbf{g}}_{i})}\nabla\ell({\mathbf{x}}_{i},{\mathbf{y}}_{i};\theta^{(t-1)})
end for
Algorithm 2 Online greedy algorithm for group DRO (Oren et al., 2019)

EMA refers to exponential weighted moving average such that EMA(v1,v2)=γv1+(1γ)v2\mathrm{EMA}(v_{1},v_{2})=\gamma v_{1}+(1-\gamma)v_{2}, where γ(0,1)\gamma\in(0,1).

Appendix D Synthetic Experiments: on Investigation Spurious Features under Covariate Shift

Refer to caption
Figure 5: An illustrative example of the synthetic task.

Synthetic Experiments

We design synthetic experiments where data is generated based on the ground-truth rules and different biases are injected. We show that even in the presence of necessary information to learn the rules, the ERM model (specifically, we examine MLE) can still learn spurious features or miss robust features under covariate shift. The synthetic task aims to predict an integer y{0,,9}y\in\{0,\cdots,9\} conditioned on a sequence xx as shown in Fig. 5. Concretely, xx is composed of mm chunks, where each chunk cic_{i} has |ci||c_{i}| characters that are randomly sampled from an alphabet 𝒱\mathcal{V}. We prepend an integer ci1c_{i}^{1} and append an integer ci2c_{i}^{2} to each chunk cic_{i}, and both ci1c_{i}^{1} and ci2c_{i}^{2} are uniformly sampled from [1,10][1,10]. The target integer yy is predicted following the rules: each triple of (ci1,ci,ci2)(c_{i}^{1},c_{i},c_{i}^{2}) produces an indicator value did_{i}; di=ci2ci1d_{i}=c_{i}^{2}-c_{i}^{1} if ci2>ci1c_{i}^{2}>c_{i}^{1}, otherwise di=0d_{i}=0; then y=(i=1mdi)mod10y=(\sum_{i=1}^{m}d_{i})\mod 10. We set 3m63\leq m\leq 6, 3|ci|53\leq|c_{i}|\leq 5 and |𝒱|=26|\mathcal{V}|=26,. We use a one-layer bidirectional LSTM (Hochreiter & Schmidhuber, 1997) to model the input sequence and use the final hidden states of the LSTM to predict the target value. We create training data following the the above description and design two settings that introduce covariate shift to examine if the model can learn the rules with ERM.

(a) Setting 1 — ERM-trained models can miss robust features under covariate shift: We create the training data by imposing cm2>cm1c_{m}^{2}>c_{m}^{1} on the last chunk cmc_{m} of all the training samples. When we create the training data, the rules applied to each chunk are the same as described above, which means that the model does not need to learn additional rules for the last chunk. We are interested in examining whether the model trained with ERM will apply the rules learned from other chunks to the last one or it will miss the robust features of the last chunk. At test time, we evaluate on two groups of test sets: 𝒟out\mathcal{D}_{out} where cm2cm1c_{m}^{2}\leq c_{m}^{1}, different from the training data, and 𝒟in\mathcal{D}_{in} where cm2>cm1c_{m}^{2}>c_{m}^{1}, consistent with the training data. From Tab. 5, we see that the test accuracy on 𝒟out\mathcal{D}_{out} is much lower that that on 𝒟in\mathcal{D}_{in}. This demonstrates that the model only learns robust features from chunks c1m1c_{1}^{m-1} but misses the robust features of the last chunk cmc_{m}. We conjecture that the model trained with ERM learns in a lazy way where it tries to minimize the entropy of learned features by memorizing patterns and taking shortcuts as discussed further in Appendix B.2.

(b) Setting 2 — ERM-trained models can learn spurious features under covariate shift: In the second setting, we inject spurious patterns into the training data that co-occur with the rules we aim to learn. As both robust rules and spurious patterns co-exist in the training data, we would like to see whether the model picks up the spurious ones or the robust ones. Specifically, each training input sequence has a chunk cjc_{j} that includes a special segment of characters, e.g. a b. The remainder of dj=cj2cj1d_{j}=c_{j}^{2}-c_{j}^{1} and the sum of all indicators i=1mdi\sum_{i=1}^{m}d_{i} mod by 10 are the same such that the target label yy is always the same as the indicator djd_{j}. Similarly, we test on two cases: i) 𝒟in\mathcal{D}_{in} where every sequence includes a special chunk as in the training set; ii) 𝒟out\mathcal{D}_{out} where characters in each chunk are uniformly sampled. We can see from Tab. 5 that the model learns to use the spurious patterns to predict the target label instead of the general rules.

𝒟in\mathcal{D}_{in} 𝒟out\mathcal{D}_{out}
Setting 1 99.93 ±\pm 0.02 14.68 ±\pm 2.60
Setting 2 100.00 ±\pm 0.00 10.26 ±\pm 0.25
Table 5: Test accuracy of the synthetic task.

Appendix E Experimental Details

E.1 Models and Training Details

Model Specific Settings

In our method, we adopt two criterions in GC-DRO to determine when to update q(x,y|g)q(x,y|g) for each groups: (1) update when the robust validation accuracy drops (2) update at every epoch. With (2), q(x,y|g)q(x,y|g) is updated more frequently. For MNLI and Celeb-A, we use the second criterion. For FDCL18, we use the first criterion, because this is a relatively smaller dataset and updating q(x,y|g)q(x,y|g) less frequently makes training more stable. Every time q(x,y|g)q(x,y|g) is updates, we clear the historical losses in EMA that is used for updating q(g)q(g). We use exponentially weighted moving average (EMA) to compute the historical losses for both q(g)q(g) and q(x,y|g)q(x,y|g), for which we denote EMAG\mathrm{EMA}_{\text{G}} and and EMACG\mathrm{EMA}_{\text{CG}} respectively. As shown above, we use γ\gamma to denote the coefficient for current value in EMA, thus 1γ1-\gamma is used to the historical value. We found that the value of γ\gamma is an important hyperparameter in some cases to achieve better performance, since the final qq distribution is computed through sorting the losses accumulated via EMA. Basically, a higher γ\gamma pays more attention to the current value. We search over {0.1,0.5}\{0.1,0.5\} for both γ\gamma used in EMAG\mathrm{EMA}_{\text{G}} and EMACG\mathrm{EMA}_{\text{CG}} respectively. Through the robust accuracy on the validation set, we set both γ\gamma’s to be 0.5 for the NLP tasks except that for the imperfect partition of toxicity detection we set γ\gamma used in EMAG\mathrm{EMA}_{\text{G}} to be 0.1. For the image task, we set both γ\gamma’s to be 0.1. For the γ\gamma used in accumulating the historical fractions of groups, we always use a small value 0.01.

Training Details

For the NLP tasks, we finetune a base Roberta model (Liu et al., 2019; Ott et al., 2019) and we segment the input text into the sub-word tokens using the tokenization described in (Liu et al., 2019). During training, we sample minibatches that contain at most 4400 tokens. We train MNLI using Adam (Kingma & Ba, 2014) with an intitial learnig rate of 1e51e-5 for 35 epochs and FDCL18 for 45 epochs, and we linearly decay the learning rate at every step until the end of training. For the image task, we fine-tune a ResNet-18 (He et al., 2016) for 50 epochs with batch size of 256. We use SGD with learning rate of 1e41e-4. At the end of every epoch, we evaluate the robust accuracy on the validation set. We train on one Volta-16G GPU and it takes around 2 - 5 hours to finish one experiments for different datasets.

E.2 Implementation of the Group DRO Loss

We referred to the implementation of greedy group DRO in Sagawa et al. (2020a), where they use the exact formulation in Eq. 5 to compute the expected loss, which leads to inferior performance compared to the exponentiated-gradient based optimization as reported in Sagawa et al. (2020a). The implementation computed the final loss by first computing the average loss over instances for each group (MC for the inner expectation), then compute the full expected value over the averaged group loss, as shown below:

(𝐱,𝐲,𝐠;θ)=gq(g)¯(g)=gq(g)1Cg{i,𝐠i=g}(𝐱i,𝐲i;θ),\vspace{-3mm}\ell({\mathbf{x}},{\mathbf{y}},{\mathbf{g}};\theta)=\sum_{g}q(g)\bar{\ell}(g)=\sum_{g}q(g)\frac{1}{C_{g}}\sum_{\{i,\forall{\mathbf{g}}_{i}=g\}}{\ell({\mathbf{x}}_{i},{\mathbf{y}}_{i};\theta)}, (22)

where (𝐱,𝐲,𝐠)({\mathbf{x}},{\mathbf{y}},{\mathbf{g}}) is a mini-batch and CgC_{g} is the number of samples that belong to group gg in the mini-batch. We can see that instances that belong to different groups are weighted correspondingly by the number of group size in a mini-batch. This causes that instances in large group get unfairly lower weights, especially when its probability in the qq distribution is low. We fix this by directly computing the expected loss over the joint distribution of q(x,y,g)q(x,y,g), i.e. 𝔼(xi,yi,gi)q(x,y,g)(xi,yi,gi;θ)=𝔼(xi,yi,gi)ptrain(x,y,g)q(xi,yi,gi)ptrain(xi,yi,gi)(xi,yi,gi)\mathbb{E}_{(x_{i},y_{i},g_{i})\sim q(x,y,g)}\ell(x_{i},y_{i},g_{i};\theta)=\mathbb{E}_{(x_{i},y_{i},g_{i})\sim p_{\text{train}}(x,y,g)}\frac{q(x_{i},y_{i},g_{i})}{p_{\text{train}}(x_{i},y_{i},g_{i})}\ell(x_{i},y_{i},g_{i}). Specifically, we do this by summing over all the importance weighted instance losses using corresponding group weights and taking average. This allows us to obtain unbiased gradient estimates of θ\theta.

1Niq(𝐠i)ptrain(gi)(𝐱i,𝐲i;θ)\frac{1}{N}\sum_{i}\frac{q({\mathbf{g}}_{i})}{p_{\text{train}}(g_{i})}\ell({\mathbf{x}}_{i},{\mathbf{y}}_{i};\theta) (23)