This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Representation Matters:
Assessing the Importance of Subgroup Allocations in Training Data

   Esther Rolf Department of Electrical Engineering and Computer Sciences, University of California, Berkeley    Theodora Worledge11footnotemark: 1    Benjamin Recht11footnotemark: 1    Michael I. Jordan11footnotemark: 1 Department of Statistics, University of California, Berkeley
Abstract

Collecting more diverse and representative training data is often touted as a remedy for the disparate performance of machine learning predictors across subpopulations. However, a precise framework for understanding how dataset properties like diversity affect learning outcomes is largely lacking. By casting data collection as part of the learning process, we demonstrate that diverse representation in training data is key not only to increasing subgroup performances, but also to achieving population-level objectives. Our analysis and experiments describe how dataset compositions influence performance and provide constructive results for using trends in existing data, alongside domain knowledge, to help guide intentional, objective-aware dataset design.

1 Introduction

Datasets play a critical role in shaping the perception of performance and progress in machine learning (ML)—the way we collect, process, and analyze data affects the way we benchmark success and form new research agendas (Paullada et al., 2020; Dotan & Milli, 2020). A growing appreciation of this determinative role of datasets has sparked a concomitant concern that standard datasets used for training and evaluating ML models lack diversity along significant dimensions, for example, geography, gender, and skin type (Shankar et al., 2017; Buolamwini & Gebru, 2018). Lack of diversity in evaluation data can obfuscate disparate performance when evaluating based on aggregate accuracy (Buolamwini & Gebru, 2018). Lack of diversity in training data can limit the extent to which learned models can adequately apply to all portions of a population, a concern highlighted in recent work in the medical domain (Habib et al., 2019; Hofmanninger et al., 2020).

Our work aims to develop a general unifying perspective on the way that dataset composition affects outcomes of machine learning systems. We focus on dataset allocations: the number of datapoints from predefined subsets of the population. While we acknowledge that numerical inclusion of groups is an imperfect proxy of representation, we believe that allocations provide a useful initial mathematical abstraction for formulating relationships among diversity, data collection, and statistical risk. We discuss broader implications of our formulation in Section 5.

With the implicit assumption that the learning task is well specified and performance evaluation from data is meaningful for all groups, we ask:

  1. 1.

    Are group allocations in training data pivotal to performance? To what extent can methods that up-weight underrepresented groups help, and when might upweighting actually hurt performance?

Taking a point of view that data collection is a critical component of the overall machine learning process, we study the effect that dataset composition has on group and population accuracies. This complements work showing that simply gathering more data can mitigate some sources of bias or unfairness in ML outcomes (Chen et al., 2018), a phenomenon which has been observed in practice as well. Indeed, in response to the Gender Shades study (Buolamwini & Gebru, 2018), companies selectively collected additional data to decrease the exposed inaccuracies of their facial recognition models for certain groups, often raising aggregate accuracy in the process (Raji & Buolamwini, 2019). Given the potential for targeted data collection efforts to repair unintended outcomes of ML systems, we next ask:

  1. 2.

    How might we describe “optimal” dataset allocations for different learning objectives? Does the often-witnessed lack of diversity in large-scale datasets align with maximizing population accuracy?

We show that purposeful data collection efforts can proactively support intentional objectives of an ML system, and that diversity and population objectives are often aligned. Many datasets have recently been designed or amended to exhibit diversity of the underlying population (Ryu et al., 2017; Tschandl et al., 2018; Yang et al., 2020). Such endeavors are significant undertakings, as data gathering and annotation must consider consent, privacy, and power concerns in addition to inclusivity, transparency and reusability (Gebru et al., 2018; Gebru, 2020; Wilkinson et al., 2016). Given the importance of more representative and diverse datasets, and the effort required to create them, our final question asks:

  1. 3.

    When and how can we leverage existing datasets to help inform better allocations, towards achieving diverse objectives in a subsequent dataset collection effort?

Representation bias, or systematic underrepresentation of subpopulations in data, is one of many forms of bias in ML (Suresh & Guttag, 2019). Our work provides a data-focused perspective on the design and evaluation of ML pipelines. Our main contributions are:

  1. 1.

    We analyze the complementary roles of dataset allocation and algorithmic interventions for achieving per-group and total-population performance (Section 2). Our experiments show that while algorithmically up-weighting underrepresented groups can help, dataset composition is the most consistent determinant of performance (Section 4.2).

  2. 2.

    We propose a scaling model that describes the impact of dataset allocations on group accuracies (Section 3). Under this model, when parameters governing the relative values of within-group data are equal for all groups, the allocation that minimizes population risk overrepresents minority groups.

  3. 3.

    We demonstrate that our proposed scaling model captures major trends of the relationship between dataset allocations and performance (Sections 4.3 and 4.5). We evidence that a small initial sample can be used to inform subsequent data collection efforts to, for example, maximize the minimum accuracy over groups without sacrificing population accuracy (Section 4.4).

Sections 2 and 3 formalize data collection as part of the learning problem and derive results under illustrative settings. Experiments in Section 4 support these results and expose nuances inherent to real-data contexts. Section 5 synthesizes results and delineates future work.

1.1 Additional Related Work

Targeted data collection in ML. Recent research evidences that targeted data collection can be an effective way to reduce disparate performance of ML models evaluated across sub-populations (Raji & Buolamwini, 2019). Chen et al. (2018) present a formal argument that the addition of training data can lessen discriminatory outcomes while improving accuracy of learned models, and Abernethy et al. (2020) show that adaptively collecting data from the lowest-performing sub-population can increase the minimum accuracy over groups. At the same time, there are many complexities of gathering data as a solution to disparate performance across groups. Targeted data collection from specific groups can present undue burdens of surveillance or skirt consent (Paullada et al., 2020). When ML systems fail portions of their intended population due to issues of measurement and construct validity, more thorough data collection is unlikely to solve the issue without further intervention (Jacobs & Wallach, 2019).

With these complexities in mind, we study the importance of numerical representation in training datasets in achieving diverse objectives. Optimal allocation of subpopulations in statistical survey designs dates back to at least Neyman (1934), including stratified sampling methods to ensure coverage across sub-populations (Lohr, 2009). For more complex prediction systems, the field of optimal experimental design (Pukelsheim, 2006) studies what inputs are most valuable for reaching a given objective, often focusing on linear prediction functions. We consider a constrained sampling structure and directly model the impact of group allocations on subgroup performance for general model classes.

Valuing data. In economics, allocations indicate a division of goods to various entities (Cole et al., 2013). While we focus on the influence of data allocations on model accuracies across groups, there are many approaches to valuing data. Methods centering on a theory of Shapley valuations (Yona et al., 2019; Ghorbani & Zou, 2019) complement studies of the influence of individual data points on model performance to aid subsampling data (Vodrahalli et al., 2018).

Methods for handling group-imbalanced data. Importance sampling and importance weighting are standard approaches to addressing class imbalance or small groups sizes (Haixiang et al., 2017; Buda et al., 2018), though the effects of importance weighting for deep learning may vary with regularization (Byrd & Lipton, 2019). Other methods specifically address differential performance between groups. Maximizing minimum performance across groups can reduce accuracy disparities (Sagawa et al., 2020) and promote fairer sequential outcomes (Hashimoto et al., 2018). For broader classes of group-aware objectives, techniques exist to mitigate unfairness or disparate performance of black box prediction functions (Dwork et al., 2018; Kim et al., 2019). It might not be clear a priori which subsets need attention; Sohoni et al. (2020) propose a method to identify and account for hidden strata, while other methods are defined for any subsets (Hashimoto et al., 2018; Kim et al., 2019).

In addition to modifying the optimization objective or learning algorithm, one can also modify the input data itself by re-sampling the training data to match the desired population by downsampling or by upsampling with data augmentation techniques (Chawla et al., 2002; Iosifidis & Ntoutsi, 2018).

1.2 Notation

Δk\Delta^{k} denotes the kk-dimensional simplex. +\mathbb{Z}^{+} denotes non-negative integers and +\mathbb{R}^{+} non-negative reals.

2 Training set allocations and alternatives

We study settings in which each data instance is associated with a group gig_{i}, so that the training set can be expressed as 𝒮={xi,yi,gi}i=1n\mathcal{S}=\{x_{i},y_{i},g_{i}\}_{i=1}^{n} where xi,yix_{i},y_{i} denote the features and labels of each instance. We index the discrete groups by integers 𝒢={1,..,|𝒢|}\mathcal{G}=\{1,..,|\mathcal{G}|\}, or when we specifically consider just two groups, we write 𝒢={A,B}\mathcal{G}=\{A,B\}. We assume that groups are disjoint and cover the entire population, with γg=P(X,Y,G)𝒟[G=g]\gamma_{g}=P_{(X,Y,G)\sim\mathcal{D}}[G=g] denoting the population prevalence of group gg, so that γΔ|𝒢|\vec{\gamma}\in\Delta^{|\mathcal{G}|}. Groups could represent inclusion in one of many binned demographic categories, or simply a general association with latent characteristics that are relevant to prediction.

For a given population with distribution 𝒟\mathcal{D} over features, labels, and groups, we are interested in the population level risk, (f^(𝒮);𝒟):=𝔼(X,Y,G)𝒟[(f^(X),Y)]\mathcal{R}(\hat{f}({\mathcal{S})};\mathcal{D}):=\mathbb{E}_{(X,Y,G)\sim\mathcal{D}}[\ell(\hat{f}(X),Y)], of a predictor f^\hat{f} trained on dataset 𝒮\mathcal{S}, as well as group specific risks. Denoting the group distributions by 𝒟g\mathcal{D}_{g}, defined as conditional distributions, via P(X,Y)𝒟g[X=x,Y=y]=P(X,Y,G)𝒟[X=x,Y=y,G=g]/γg{{P_{(X,Y)\sim\mathcal{D}_{g}}[X=x,Y=y]}}=P_{(X,Y,G)\sim\mathcal{D}}[X=x,Y=y,G=g]/\gamma_{g}, the population risk decomposes as a weighted average over group risks:

(f^(𝒮);𝒟)\displaystyle\mathcal{R}(\hat{f}(\mathcal{S});\mathcal{D}) =g𝒢γg(f^(𝒮);𝒟g).\displaystyle=\sum_{g\in\mathcal{G}}\gamma_{g}\cdot\mathcal{R}(\hat{f}(\mathcal{S});\mathcal{D}_{g})~{}. (1)

In Section 2.2 we will assume that the loss (y^,y)\ell(\hat{y},y) is a separable function over data instances. While this holds for many common loss functions, some objectives do not decouple in this sense (e.g., group losses and associated classes of fairness-constrained objectives; see Dwork et al., 2018). We revisit this point in Sections 4 and 5.

2.1 Training Set Allocations

In light of the decomposition of the population-level risk as a weighted average over group risks in Eq. (1), we now consider the composition of fixed-size training sets, in terms of how many samples come from each group.

Definition 1 (Allocations).

Given a dataset of nn triplets, {xi,yi,gi}i=1n\{x_{i},y_{i},g_{i}\}_{i=1}^{n}, the allocation αΔ|𝒢|\vec{\alpha}\in\Delta^{|\mathcal{G}|} describes the relative proportions of each group in the dataset:

αg:=1ni=1n𝕀[gi=g],g𝒢.\displaystyle\alpha_{g}:=\tfrac{1}{n}\sum_{i=1}^{n}\mathbb{I}[g_{i}=g],\quad g\in\mathcal{G}. (2)

It will be illuminating to consider α\vec{\alpha} not only as a property of an existing dataset, but as a parameter governing dataset construction, as captured in the following definition.

Definition 2 (Sampling from allocation α\vec{\alpha}).

Given the sample size nn, group distributions {𝒟g}g𝒢\{\mathcal{D}_{g}\}_{g\in\mathcal{G}}, and allocation αΔ|𝒢|\vec{\alpha}\in\Delta^{|\mathcal{G}|}, such that ng:=αgn+,g𝒢{n_{g}:=}{}\alpha_{g}n\in\mathbb{Z}^{+},\forall g\in\mathcal{G}, to sample from allocation α\vec{\alpha} is procedurally equivalent to independent sampling of |𝒢||\mathcal{G}| disjoint datasets 𝒮g\mathcal{S}_{g} and concatenating:

𝒮(α,n)=g𝒢𝒮g\displaystyle\mathcal{S}(\vec{\alpha},n)=\underset{g\in\mathcal{G}}{\mathbin{\scalebox{1.5}{$\cup$}}}\mathcal{S}_{g} (3)
𝒮g={xi,yi,g}i=1ng,(xi,yi)i.i.d.𝒟g.\displaystyle\mathcal{S}_{g}=\{x_{i},y_{i},g\}_{i=1}^{n_{g}},\quad(x_{i},y_{i})\sim_{i.i.d.}\mathcal{D}_{g}~{}.

For α\vec{\alpha} not satisfying the requirement that αgn\alpha_{g}n is integral, we could randomize the fractional allocations, or take ng=αgnn_{g}=\lfloor\alpha_{g}n\rfloor, reducing the total number of samples to gαgn\sum_{g}\lfloor\alpha_{g}n\rfloor. In the following sections we will generally allow allocations with ngn_{g}\not\in\mathbb{Z}, assuming that the effect of up to |𝒢||\mathcal{G}| fractionally assigned instances is negligible for large nn.

The procedure in Definition 2 suggests formalizing data collection as a component of the learning process in the following way: in addition to choosing a loss function and method for minimizing the risk, choose the relative proportions at which to sample the groups in the training set:

α=argminαΔ|𝒢|minf^(f^(𝒮(α,n));𝒟).\displaystyle\vec{\alpha}^{*}=\operatorname*{argmin}_{\vec{\alpha}\in\Delta^{|\mathcal{G}|}}\min_{\hat{f}\in\mathcal{F}}\mathcal{R}\left(\hat{f}\left({\mathcal{S}(\vec{\alpha},n)}\right);\mathcal{D}\right).

In Section 3, we show that when a dataset curator can design dataset allocations in the sense of Definition 2, they have the opportunity to improve accuracy of the trained model. Of course, one does not always have the opportunity to collect new data or modify the composition of an existing dataset. Section 2.2 considers methods for using fixed datasets that have groups with small training set allocation αg\alpha_{g}, relative to γg\gamma_{g}, or high risk for some groups relative to the population.

2.2 Accounting for Small Group Allocations in Fixed Datasets

In classical empirical risk minimization (ERM), one learns a function from class \mathcal{F} that minimizes average prediction loss over the training instances (xi,yi,gi)𝒮(x_{i},y_{i},g_{i})\in\mathcal{S} (we also abuse notation and write i𝒮i\in\mathcal{S}) with optional regularization RR:

f^(𝒮)=argminfi𝒮(f(xi),yi)+R(f,𝒮).\displaystyle\hat{f}(\mathcal{S})=\operatorname*{argmin}_{f\in\mathcal{F}}\sum_{i\in\mathcal{S}}\ell(f(x_{i}),y_{i})+R(f,\mathcal{S}).

There are many methods for addressing small group allocations in data (see Section 1.1). Of particular relevance to our work are objective functions that minimize group or population risks. In particular, one approach is to use importance weighting (IW) to re-weight training samples with respect to a target distribution defined by γ\vec{\gamma}:

f^IW(𝒮)=argminfg𝒢γgαg(i𝒮g(f(xi),yi))+R(f,𝒮).\displaystyle\hat{f}^{\scriptscriptstyle\textrm{IW}}(\mathcal{S})=\operatorname*{argmin}_{f\in\mathcal{F}}\sum_{g\in\mathcal{G}}\frac{\gamma_{g}}{\alpha_{g}}\left(\sum_{i\in\mathcal{S}_{g}}\ell(f(x_{i}),y_{i})\right)+R(f,\mathcal{S}).

This empirical risk with instances weighted by γg/αg=γgn/ng{\gamma_{g}}/{\alpha_{g}}={\gamma_{g}}n/{n_{g}} is an unbiased estimate of the population risk, up to regularization. While unbiasedness is often desirable, IW can induce high variance of the estimator when γg/αg\gamma_{g}/\alpha_{g} is large for some group (Cortes et al., 2010), which happens when group gg is severely underrepresented in the training data relative to their population prevalence.

Alternatively, group distributionally robust optimization (GDRO) (Hu et al., 2018; Sagawa et al., 2020) minimizes the maximum empirical risk over all groups:

f^GDRO(𝒮)=argminfmaxg𝒢(1ngi𝒮g(f(xi),yi)+R(f,𝒮g)).\displaystyle\hat{f}^{\scriptscriptstyle\textrm{GDRO}}(\mathcal{S})=\operatorname*{argmin}_{f\in\mathcal{F}}\max_{g\in\mathcal{G}}\left(\tfrac{1}{n_{g}}\sum_{i\in\mathcal{S}_{g}}\ell(f(x_{i}),y_{i})+R(f,\mathcal{S}_{g})\right)~{}.

For losses \ell which are continuous and convex in the parameters of ff, the optimal GDRO solution corresponds to the minimizer of a group-weighted objective: 1ni=1nw(gi)(f(xi),yi)\tfrac{1}{n}\sum_{i=1}^{n}w(g_{i})\cdot\ell(f(x_{i}),y_{i}), though this is not in general true for nonconvex losses (see Prop. 1 of Sagawa et al., 2020, and the remark immediately thereafter).

Given the correspondence of GDRO (for convex loss functions) and IW to the optimization of group-weighted ERM objectives, we now investigate the joint roles of sample allocation and group re-weighting for estimating group-weighted risks. For prediction function ff, loss function \ell, and group weights w:𝒢+w:\mathcal{G}\rightarrow\mathbb{R}^{+}, let L^(w,α,n;f,)\hat{L}(w,\alpha,n;f,\ell) be the random variable defined by:

L^(w,α,n;f,)\displaystyle\hat{L}(w,\alpha,n;f,\ell) :=1ni𝒮(α,n)w(gi)(f(xi),yi),\displaystyle:=\tfrac{1}{n}\sum_{i\in\mathcal{S}(\vec{\alpha},n)}w(g_{i})\cdot\ell(f(x_{i}),y_{i})~{},

where the randomness in L^\hat{L} comes from the draws of xi,yix_{i},y_{i} from 𝒟gi\mathcal{D}_{g_{i}} according to procedure 𝒮(α,n)\mathcal{S}(\vec{\alpha},n) (Definition 2), as well as any randomness in ff.

The following proposition shows that group weights and allocations play complementary roles in risk function estimation. In particular, if w(g)w(g) depends on the sampling allocations αg\alpha_{g}, then there are alternative group weights ww^{*} and allocation α\vec{\alpha}^{*} such that the alternative estimator has the same expected value but lower variance.

Proposition 1 (Weights and allocations).

For any loss \ell, prediction function ff and group distributions 𝒟g\mathcal{D}_{g}, there exist weights with w(g)(Var(x,y)𝒟g[(f(x),y))])1/2w^{*}(g)\propto\left(Var_{(x,y)\sim\mathcal{D}_{g}}[\ell(f(x),y))]\right)^{-1/2} such that for any triplet (α,w,n)(\vec{\alpha},w,n) with gαgw(g)>0\sum_{g}\alpha_{g}w(g)>0, if www\not\mathrel{\vbox{ \offinterlineskip\halign{\hfil$#$\cr\propto\cr\kern 2.0pt\cr\sim\cr\kern-2.0pt\cr}}}w^{*},111We use the symbol \not\mathrel{\vbox{ \offinterlineskip\halign{\hfil$#$\cr\propto\cr\kern 2.0pt\cr\sim\cr\kern-2.0pt\cr}}} to denote “not approximately proportional to.” The approximately part of this relation stems from finite and integer sample concerns; for example, the proposition holds if we consider www\not\mathrel{\vbox{ \offinterlineskip\halign{\hfil$#$\cr\propto\cr\kern 2.0pt\cr\sim\cr\kern-2.0pt\cr}}}w^{*} to mean g𝒢:|1w(g)w(g)|>|𝒢|αgn\exists g\in\mathcal{G}:|1-\frac{w(g)}{w^{*}(g)}|>\frac{|\mathcal{G}|}{\alpha_{g}n}. there exists an alternative allocation α\vec{\alpha}^{*} such that

𝔼[L^(w,α,n;f,)]\displaystyle\mathbb{E}[\hat{L}(w^{*},\vec{\alpha}^{*},n;f,\ell)] =𝔼[L^(w,α,n;f,)]\displaystyle=\mathbb{E}[\hat{L}(w,\vec{\alpha},n;f,\ell)]
Var[L^(w,α,n;f,)]\displaystyle Var[\hat{L}(w^{*},\vec{\alpha}^{*},n;f,\ell)] <Var[L^(w,α,n;f,)].\displaystyle<Var[\hat{L}(w,\vec{\alpha},n;f,\ell)]~{}.

If w(g)>w(g)w(g)>w^{*}(g), αg>αg\alpha^{*}_{g}>\alpha_{g} and if w(g)<w(g)w(g)<w^{*}(g), αg<αg\alpha^{*}_{g}<\alpha_{g}.

Proof.

(Sketch; the full proof appears in Section A.1.) For any deterministic weighting function w:𝒢+w:\mathcal{G}\rightarrow\mathbb{R}^{+}, there exists a vector γΔ|𝒢|\vec{\gamma}^{\prime}\in\Delta^{|\mathcal{G}|} with entries γgw(g)αg\gamma^{\prime}_{g}\propto w(g)\alpha_{g} such that

𝔼[L^(w,α,n;f,)]=1n(xi,yi,gi)𝒮𝔼[w(gi)(f(xi),yi)]=c𝔼gMultinomial(γ)𝔼(x,y)𝒟g[(f(x),y)]\displaystyle\mathbb{E}[\hat{L}(w,\vec{\alpha}^{*},n;f,\ell)]=\tfrac{1}{n}\sum_{(x_{i},y_{i},g_{i})\in\mathcal{S}}\mathbb{E}[w(g_{i})\ell(f(x_{i}),y_{i})]=c\cdot\mathbb{E}_{g\sim\textrm{Multinomial}(\gamma^{\prime})}\mathbb{E}_{(x,y)\sim\mathcal{D}_{g}}[\ell(f(x),y)]

for constant c=gαgw(g)c=\sum_{g}\alpha_{g}w(g). For any fixed ff and any “target distribution" defined by γ\gamma^{\prime}, the (α,w\vec{\alpha}^{*},w^{*}) pair which minimizes the variance of the estimator, constrained so that w(g)αg=cγggw(g)\alpha_{g}=c\gamma^{\prime}_{g}\,\forall g, has weights ww^{*} with form given above. Since the original (α,w)(\alpha,w) satisfy this constraint, the minimizing pair (α,w)(\alpha^{*},w^{*}) must satisfy 𝔼[L^(w,α,n;f,)]=𝔼[L^(w,α,n;f,)]\mathbb{E}[\hat{L}(w^{*},\vec{\alpha}^{*},n;f,\ell)]=\mathbb{E}[\hat{L}(w,\vec{\alpha},n;f,\ell)] , and Var[L^(w,α,n;f,)]Var[L^(w,α,n;f,)]Var[\hat{L}(w^{*},\vec{\alpha}^{*},n;f,\ell)]\leq Var[\hat{L}(w,\vec{\alpha},n;f,\ell)]. ∎

Since the estimation of risk functions is a key component of learning, Proposition 1 illuminates an interplay between the roles of sampling allocations and group-weighting schemes like IW and GDRO. When allocations and weights are jointly maximized, the optimal allocation accounts for an implicit target distribution γ\gamma^{\prime} (defined above), which may vary by objective function. The optimal weights account for per-group variability Var(x,y)𝒟g[(f(x),y))]Var_{(x,y)\sim\mathcal{D}_{g}}[\ell(f(x),y))]. In Section 4 we find that it can be advantageous to use IW and GDRO when some groups have small αg/γg\alpha_{g}/\gamma_{g}; though the boost in accuracy is less than having an optimally allocated training set to begin with, and diminishes when all groups are appropriately represented in the training set allocation.

3 Allocating samples to minimize population-level risk

Having motivated the importance of group allocations, we now investigate the direct effects of training set allocations on group and population risks. Using a model of per-group performance as a function of allocations, we study the optimal allocations under a variety of settings.

3.1 A Per-group Power-law Scaling Model

We model the impact of allocations on performance with scaling laws that describe per-group risks as a function of the number of data points from their respective group, as well as the total number of training instances.

Assumption 1 (Group risk scaling with allocation).

The group risks (f^;𝒟g):=𝔼(x,y)𝒟g[(f^(x),y)]\mathcal{R}(\hat{f};\mathcal{D}_{g}):=\mathbb{E}_{(x,y)\sim{\mathcal{D}_{g}}}[\ell(\hat{f}(x),y)] scale approximately as the sum of inverse power functions on the number of samples from group gg and the total number of samples. That is, Mg>0\exists~{}M_{g}>0, σg,τg,δg0\sigma_{g},\tau_{g},\delta_{g}\geq 0, and p,q>0p,q>0 such that for a learning procedure which returns predictor f^(𝒮)\hat{f}(\mathcal{S}), and training set 𝒮\mathcal{S} with group sizes ngMgn_{g}\!\geq\!M_{g}:

(f^(𝒮(α,n));𝒟g):=𝔼(x,y)𝒟g[(f^(𝒮)(x),y)]r(αgn,n;σg,τg,δg,p,q)g𝒢\displaystyle\mathcal{R}\biggl{(}\hat{f}\bigl{(}\mathcal{S}(\vec{\alpha},n)\bigr{)};\mathcal{D}_{g}\biggr{)}:=\mathbb{E}_{(x,y)\sim\mathcal{D}_{g}}\left[\ell(\hat{f}(\mathcal{S})(x),y)\right]\approx\,r(\alpha_{g}n,n;\sigma_{g},\tau_{g},\delta_{g},p,q)\hskip 6.99997pt\forall\,g\in\mathcal{G}
r(ng,n;σg,τg,δg,p,q):=σg2ngp+τg2nq+δg.\displaystyle r(n_{g},n;\sigma_{g},\tau_{g},\delta_{g},p,q):=\sigma_{g}^{2}n_{g}^{-p}+\tau_{g}^{2}n^{-q}+\delta_{g}~{}. (4)

1 is similar to the scaling law in Chen et al. (2018), but includes a τg2nq\tau_{g}^{2}n^{-q} term to allow for data from other groups to influence the risk evaluated on group gg. It additionally requires that the same exponents p,qp,q apply to each group, an assumption that underpins our theoretical results in Section 3. We examine the extent to which 1 holds empirically in Section 4.3, and will modify Eq. (1) to include group-dependent terms pg,qgp_{g},q_{g} when appropriate. The following examples give intuition into the form of Eq. (1).

Example 1 (Split classifiers per group).

When separate models are trained for each group, using training data only from that group, we expect Eq. (1) to apply with τg=0\tau_{g}=0 g𝒢\forall g\in\mathcal{G}. The parameter pp could derived through generalization bounds (Boucheron et al., 2005), or through modeling assumptions (Example 3).

It is often advantageous to pool training data to learn a single classifier. In this case, model performance evaluated on group gg will depend on both ngn_{g} and nn, as the next examples show.

Example 2 (Groups irrelevant to prediction).

When groups are irrelevant for prediction and the model class \mathcal{F} correctly accounts for this, we expect Eq. (1) to apply with σg=0\sigma_{g}=0 g𝒢\forall g\in\mathcal{G}.

Example 3 (Shared linear model with group-dependent intercepts).

Consider a (d+1)(d+1)-dimensional linear model, where two groups, {A,B}\{A,B\}, share a weight vector β\beta and features x𝒩(0,Σx)x\sim\mathcal{N}(0,\Sigma_{x}), but the intercept varies by group:

yi=βxi+cA𝕀[gi=A]+cB𝕀[gi=B]+𝒩(0,σ2).\displaystyle y_{i}=\beta^{\top}x_{i}+c_{A}\mathbb{I}[g_{i}=A]+c_{B}\mathbb{I}[g_{i}=B]+\mathcal{N}(0,\sigma^{2}).

As we show in Section A.5, the ordinary least squares predictor has group risks 𝔼(x,y)𝒟g[(xβ^+c^gy)2]=σ2(1+1/ng+O(d/n)),\mathbb{E}_{(x,y)\sim\mathcal{D}_{g}}[(x^{\top}\hat{\beta}+\hat{c}_{g}-y)^{2}]=\sigma^{2}\left(1+1/n_{g}+O(d/n)\right), where the 1/ng{1}/{n_{g}} arises because we need samples from group gg to estimate the intercept cgc_{g}, whereas samples from both groups help us estimate β\beta.

Example 3 suggests that in some settings, we can relate σg\sigma_{g} and τg\tau_{g} to ‘group specific’ and ‘group agnostic’ model components that affect performance for group gg. In general, the relationship between group sizes and group risks can be more nuanced. Data from different groups may be correlated, so that samples from groups similar to or different from gg have greater effect on (f^;𝒟g)\mathcal{R}(\hat{f};\mathcal{D}_{g}) (see Section 4.5). Eq. 1 is meant to capture the dominant effects of training set allocations on group risks and serves as our main structural assumption in the next section, where we study the allocation that minimizes the approximate population risk.

3.2 Optimal (w.r.t. Population Risk) Allocations

We now study properties of the allocation that minimizes the approximated population risk:

^(α,n):=\displaystyle\hat{\mathcal{R}}(\vec{\alpha},n):= g𝒢γgr(αgn,n;σg,τg,δg,p,q)g𝒢γg(f^(𝒮);𝒟g)=(f^(𝒮);𝒟).\displaystyle\sum_{g\in\mathcal{G}}\gamma_{g}r(\alpha_{g}n,n;\sigma_{g},\tau_{g},\delta_{g},p,q)\approx\sum_{g\in\mathcal{G}}\gamma_{g}\mathcal{R}\left(\hat{f}(\mathcal{S});\mathcal{D}_{g}\right)=\mathcal{R}\left(\hat{f}(\mathcal{S});\mathcal{D}\right)~{}. (5)

The following proposition lays the foundation for two corollaries which show that:

  1. (1)

    when only the population prevalences γ\vec{\gamma} vary between groups, the allocation that minimizes the approximate population risk up-represents groups with small γg\gamma_{g};

  2. (2)

    for two groups with different scaling parameters σg\sigma_{g}, the optimal allocation of the group with γg<12\gamma_{g}<\tfrac{1}{2} is bounded by functions of σA,σB\sigma_{A},\sigma_{B}, and γ\vec{\gamma}.

Proposition 2.

Given a population made up of disjoint groups g𝒢g\in\mathcal{G} with population prevalences γg\gamma_{g}, under the conditions of  1, the allocation αΔ|𝒢|\vec{\alpha}^{*}\in\Delta^{|\mathcal{G}|} that minimizes the approximated population risk ^\hat{\mathcal{R}} in eq. (5) has elements:

αg=(γgσg2)1/(p+1)g𝒢(γgσg2)1/(p+1).\displaystyle\alpha_{g}^{*}=\frac{\left(\gamma_{g}\sigma^{2}_{g}\right)^{1/(p+1)}}{\sum_{g\in\mathcal{G}}\left(\gamma_{g}\sigma^{2}_{g}\right)^{1/(p+1)}}~{}. (6)

If σg=0g𝒢\sigma_{g}=0\ \forall g\in\mathcal{G}, then any allocation in Δ|𝒢|\Delta^{|\mathcal{G}|} minimizes ^\hat{\mathcal{R}}.

The proof of Proposition 2 appears in Section A.2. Note that α\vec{\alpha}^{*} does not depend on nn, {τg}g𝒢\{\tau_{g}\}_{g\in\mathcal{G}}, or qq; this will in general not hold if powers pgp_{g} differ by group.

We now study the form of α\vec{\alpha}^{*} under illustrative settings. Corollary 1 shows that when the group scaling parameters σg\sigma_{g} in Eq. (1) are equal across groups, the allocation that minimizes the approximate population risk allocates samples to minority groups at higher than their population prevalences. The proof of Corollary 1 appears in Section A.3.

Corollary 1 (Many groups with equal σg\sigma_{g}).

When σg=σ>0,g𝒢\sigma_{g}=\sigma>0,\ \forall g\in\mathcal{G}, the allocation that minimizes ^\hat{\mathcal{R}} in Eq. (5) satisfies αgγg\alpha_{g}^{*}\geq\gamma_{g} for any group with γg1|𝒢|\gamma_{g}\leq\frac{1}{|\mathcal{G}|}.

This shows that the allocation that minimizes population risk can differ from the actual population prevalences γ\vec{\gamma}. In fact, Corollary 1 asserts that near the allocation α=γ\vec{\alpha}=\vec{\gamma}, the marginal returns to additional data from group gg are largest for groups with small αg\alpha_{g}, enough so as to offset the small weight γg\gamma_{g} in Eq. (1). This result provides evidence against the idea that small training set allocation to minority groups might comply with minimizing population risk as a result of a small relative contribution to the population risk.

Remark. A counterexample shows that αgγg\alpha_{g}^{*}\leq\gamma_{g} does not hold for all gg with γg>1/|𝒢|\gamma_{g}>{1}/{|\mathcal{G}|}. Take γ=[.68,.30,.01,.01]\vec{\gamma}=[.68,.30,.01,.01] and p=1p=1; Eq. (6) gives α2>0.3=γ2>1/4\alpha_{2}^{*}>0.3=\gamma_{2}>1/4. In general, whether group gg with γg1/|𝒢|\gamma_{g}\geq{1}/{|\mathcal{G}|} gets up- or down-sampled depends on the distribution of γ\vec{\gamma} across all groups.

Complementing the investigation of the role of the population proportions γ\vec{\gamma} in Corollary 1, the next corollary shows that the optimal allocation α\vec{\alpha}^{*} generally depends on the relative values of σg\sigma_{g} between groups. Inspecting Eq. (1) shows that σg\sigma_{g} defines a limit of performance: if σg2\sigma_{g}^{2} is large, the only way to make the approximate risk for group gg small is to make ngn_{g} large. For two groups, we can bound the optimal allocations α\vec{\alpha}^{*} in terms of {σg}g𝒢\{\sigma_{g}\}_{g\in\mathcal{G}}, and the population proportions γ\vec{\gamma}. We let AA be the smaller of the two groups without loss of generality. From Eq. (6), we know that for two groups, αA\alpha^{*}_{A} is increasing in σAσB\tfrac{\sigma_{A}}{\sigma_{B}}; Corollary 2 gives upper and lower bounds on αA\alpha^{*}_{A} in terms of σA\sigma_{A} and σB\sigma_{B}. Corollary 2 is proved in Section A.4.

Corollary 2 (Unequal per-group constants).

For two groups {A,B}=𝒢\{A,B\}=\mathcal{G} with γA<γB\gamma_{A}\!<\!\gamma_{B}, and parameters σA,σB>0\sigma_{A},\sigma_{B}\!>\!0 in Eq. (1), the allocation of the smaller group αA\alpha^{*}_{A} that minimizes ^\hat{\mathcal{R}} in Eq. (5) is upper and lower bounded as

γA(σA2)1/(p+1)γA(σA2)1/(p+1)+γB(σB2)1/(p+1)<αA<(σA2)1/(p+1)(σA2)1/(p+1)+(σB2)1/(p+1).\displaystyle\frac{\gamma_{A}(\sigma_{A}^{2})^{1/(p+1)}}{\gamma_{A}(\sigma_{A}^{2})^{1/(p+1)}+\gamma_{B}(\sigma_{B}^{2})^{1/(p+1)}}<\alpha^{*}_{A}<\frac{(\sigma_{A}^{2})^{1/(p+1)}}{(\sigma_{A}^{2})^{1/(p+1)}+(\sigma_{B}^{2})^{1/(p+1)}}~{}.

When σAσB\sigma_{A}\geq\sigma_{B}, αA>γA\alpha_{A}^{*}>\gamma_{A}, and when σAσB\sigma_{A}\leq\sigma_{B}, αA<1/2\alpha_{A}^{*}<1/2.

Altogether, these results highlight key properties of training set allocations that minimize population risk. Experiments in Section 4 give further insight into the values of weights and allocations for minimizing group and population risks and apply the scaling law model in real data settings.

4 Empirical Results

Having shown the importance of training set allocations from a theoretical perspective, we now provide a complementary empirical investigation of this phenomenon. Throughout our experiments, we use a diverse collection of datasets to give as full a picture of the empirical phenomena as possible (Sections 4.1 and 1). See Appendix B for full details on each experimental setup.222Code to replicate the experiments is available at https://github.com/estherrolf/representation-matters.

The first set of experiments (Section 4.2) investigates group and population accuracies as a function of the training set allocation α\vec{\alpha} by sub-sampling different training set allocations for a fixed training set size. We also study the amount to which importance weighting and group distributionally robust optimization can increase group accuracies, complementing the results of Proposition 1 with an empirical perspective. The second set of experiments (Section 4.3) uses a similar subsetting procedure to examine the fit of the scaling model proposed in Section 3. The third set of experiments (Section 4.4) investigates using the estimated scaling law fits to inform future sampling practices. We simulate using a small pilot training set to inform targeted dataset augmentation through collecting additional samples. In Section 4.5, we probe a setting where we might expect the scaling law to be too simplistic, exposing the need for more nuanced modelling in such settings. In contrast to Section 2, here losses are defined over sets of data; note that AUROC is not separable over groups, and thus Eq. (1) does not apply for this metric.

4.1 Datasets

Here we give a brief description of each dataset we use. For more details, see (Tables 1 and B.1).

Modified CIFAR-4. To create a dataset where we can ensure class balance across groups, we modify the CIFAR-10 dataset (Krizhevsky, 2009) by subsetting to the bird, car, horse, and plane classes. We predict whether the image subject moves primarily by air (plane/bird) or land (car/horse) and group by whether the image contains an animal (bird/horse) or vehicle (car/plane); see Figure B.1. We set γ=0.9\gamma=0.9.

ISIC. The International Skin Imaging Collaboration dataset is a set of labeled images of skin lesions designed to aid development and standardization of imaging procedures for automatic detection of melanomas (Codella et al., 2019). For our main analysis, we follow similar preprocessing steps to (Sohoni et al., 2020), removing any images with patches. We predict whether a lesion is benign or malignant, and group instances by the approximate age of the patient of whom each photo was taken.

Goodreads. Given publicly available book reviews compiled from the Goodreads database (Wan & McAuley, 2018), we predict the rating (1-5) corresponding to each review. We group instances by genre.

Mooc. The HarvardX-MITx Person-Course Dataset contains student demographic and activity data from 2013 offerings of online courses (HarvardX, 2014). We predict whether a student earned a certification in the course and we group instances by highest completed level of education.

Adult. The Adult dataset, downloaded from the UCI Machine Learning Repository (Dua & Graff, 2017) and originally taken from the 1994 Census database, has a prediction task of whether an individual’s annual income is over $50,000\$50,000. We group instances by sex, codified as binary male/female, and exclude features that directly determine groups status.

Table 1: Brief description of datasets; details are given in Section B.1.
dataset groups {A,B}\{A,B\} γA\gamma_{A} mingng\min_{g}n_{g} ntestn_{\textrm{test}} target label loss metric main model used
CIFAR-4 {animal, vehicle} 0.1 10,000 4,000 air/land 0/1 loss resnet-18
ISIC {age 55\geq 55, age <55<55} 0.43 4,092 2,390 benign/malignant 1 - AUROC resnet-18
Goodreads {history, fantasy} 0.38 50,000 25,000 book rating (1-5) 1\ell_{1} loss logistic regression
Mooc {edu 2\leq 2^{\circ}, edu >2>2^{\circ}} 0.16 3,897 6,032 certified 1 - AUROC random forest
Adult {female, male} 0.5 10,771 16,281 income >$50>\$50K 0/1 loss random forest

4.2 Allocation-aware Objectives vs. Ideal Allocations

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Performance across α\vec{\alpha}. Shaded regions denote one stddev. above and below the mean over 10 trials. Stars indicate population minima for each objective (ERM: white, IS/IW:grey, GDRO:black).

We first investigate (a) the change in group and population performance at different training set allocations, and (b) the extent to which optimizing the three objective functions defined in Section 2.2 decreases average and group errors.

For each dataset, we vary the training set allocations α\vec{\alpha} between (0,1)(0,1) and (1,0)(1,0) while fixing the training set size as n=mingngn=\min_{g}n_{g} (see Table 1) and evaluate the per-group and population losses on subsets of the heldout test sets. For the image classification tasks, we compare group-agnostic empirical risk minimization (ERM) to importance weighting (implemented via importance sampling (IS) batches following the findings of Buda et al. (2018)) and group distributionally robust optimization (GDRO) with group-dependent regularization as described in Sagawa et al. (2020). For the non-image datasets, we implement importance weighting (IW) by weighting instances in the loss function during training, and do not compare to GDRO, as the gradient-based algorithm of Sagawa et al. (2020) is not easily adaptable to the predictors we use for these datasets. We pick hyperparameters for each method based on cross-validation results over a coarse grid of α\vec{\alpha}; for IS, IW, and GDRO, we allow the hyperparameters to vary with α\vec{\alpha}; for ERM we choose a single hyperparameter setting for all α\vec{\alpha} values.

Figure 1 highlights the importance of at least a minimal representation of each group in order to achieve low population loss (black curves) for all objectives. For CIFAR-4, the population loss increases sharply for αA<0.1\alpha_{A}<0.1 and αA>0.8\alpha_{A}>0.8, and for ISIC, when αA<0.2\alpha_{A}<0.2. While not as crucial for achieving low population losses for the remaining datasets, the optimal allocations α\vec{\alpha}^{*} (stars) do require a minimal representation of each group. The α\vec{\alpha}^{*} are largely consistent across the training objectives (different star colors). The population losses (black curves) are largely consistent across mid-range values of αA\alpha_{A} for all training objectives. The relatively shallow slopes of the black curves for αA\alpha_{A} near αA\alpha^{*}_{A} (stars) stand in contrast to the per-group losses (blue and orange curves), which can vary considerably as α\vec{\alpha} changes. From the perspective of model evaluation, this reinforces a well-documented need for more comprehensive reporting of performance. From the view of dataset design, this exposes an opportunity to choose allocations which optimize diverse evaluation objectives while maintaining low population loss. Experiments in Section 4.4 investigate this further.

Across the CIFAR-4 and ISIC tasks, GDRO (dotted curves) is more effective than IS (dashed curves) at reducing per-group losses. This is expected, as minimizing the largest loss of any group is the explicit objective of GDRO. Figure 1 shows that GDRO can also improve the population loss (see αA>0.7\alpha_{A}>0.7 for CIFAR-4 and αA<0.2\alpha_{A}<0.2 for ISIC) as a result of minimizing the worst group loss. Importance weighting (dashed curves) has little effect on performance for Mooc and Adult (random forest models), and actually increases the loss for Goodreads (multiclass logistic regression model).

For all the datasets we study, the advantages of using IS or GDRO are greatest when one group has very small training set allocation (αA\alpha_{A} near 0 or 11). When allocations are optimized (stars in Figure 1), the boost that these methods give over ERM diminishes. In light of Proposition 1, these results suggest that in practice, part of the value of such methods is in compensating for sub-optimal allocations. We find, however, that explicitly optimizing the maximum per-group loss with GDRO can reduce population loss more effectively than directly accounting for allocations with IS.

Section C.2 shows that similar phenomena hold for different loss functions and models on the same dataset, though the exact α\vec{\alpha}^{*} can differ. In Section C.1, we show that losses of groups with small αg\alpha_{g} can degrade more severely when group attributes are included in the feature matrix, likely a result of the small number of samples from which to learn group-specific model components (see Example 3).

4.3 Assessing the Scaling Law Fits

Table 2: Estimated scaling parameters for Eq. (7). Parentheses denote standard deviations estimated by the nonlinear least squares fit. Parameters are constrained so that τ^g,σ^g,δ^g0\hat{\tau}_{g},\hat{\sigma}_{g},\hat{\delta}_{g}\geq 0 and p^g,q^g[0,2]\hat{p}_{g},\hat{q}_{g}\in[0,2].
dataset MgM_{g} group gg σ^g\hat{\sigma}_{g} p^g\hat{p}_{g} τ^g\hat{\tau}_{g} q^g\hat{q}_{g} δ^g\hat{\delta}_{g}
CIFAR-4 500 animal 1.9 (0.12) 0.47 (9.8e-04) 4.5e-09 (1.8e+06) 2.0 (0.0e+00) 1.1e-03 (8.9e-06)
vehicle 1.6 (0.19) 0.54 (2.0e-03) 3.2e-12 (1.1e+06) 2.0 (0.0e+00) 1.4e-03 (2.8e-06)
ISIC 200 age 55\geq 55 0.61 (1.7e-03) 0.20 (1.1e-03) 1.7e-09 (1.9e+04) 1.9 (0.0e+00) 1.4e-15 (6.1e-04)
age <55<55 0.26 (9.3e-04) 0.13 (0.012) 0.61 (0.044) 0.3 (7.5e-03) 7.5e-11 (7.2e-03)
Goodreads 2500 history 0.16 (1.2e-03) 0.074 (2.5e-03) 2.5 (0.058) 0.37 (2.0e-04) 0.41 (3.0e-03)
fantasy 0.62 (0.69) 0.020 (1.2e-03) 3.1 (0.093) 0.39 (1.9e-04) 7.2e-21 (0.72)
Mooc 50 edu 2\leq 2^{\circ} 0.08 (2.6e-05) 0.14 (6.0e-03) 0.73 (0.059) 0.63 (4.8e-03) 1.3e-15 (2.6e-04)
edu >2>2^{\circ} 0.038 (6.2e-04) 0.068 (6.3e-03) 0.54 (6.5e-03) 0.61 (9.8e-04) 2.8e-12 (8.0e-04)
Adult 50 female 0.078 (0.051) 0.018 (3.6e-03) 0.43 (8.3e-03) 0.59 (1.6e-03) 8.0e-16 (0.052)
male 0.066 (2.6e-05) 0.21 (1.2e-03) 0.47 (6.5e-03) 0.50 (1.1e-03) 0.16 (5.4e-06)

For each dataset, we combine the results in Figure 1 with extra subsetting runs where we vary both ngn_{g} and nn. From the combined results, we use nonlinear least squares to estimate the parameters of modified scaling laws, where exponents can differ by group

lossgσg2ngpg+τg2nqg+δg.\displaystyle\textrm{loss}_{g}\approx\sigma_{g}^{2}n_{g}^{-p_{g}}+\tau_{g}^{2}n^{-q_{g}}+\delta_{g}~{}. (7)

The estimated parameters of Eq. (7) given in Table 2 capture different high-level phenomena across the five datasets. For CIFAR-4, τ^g0\hat{\tau}_{g}\approx 0 for both groups, indicating that most of the group performance is explained by ngn_{g}, the number of training instances from that group, whereas the total number of data points nn, has less influence. For Goodreads, both ngn_{g} and nn have influence in the fitted model, though τ^g\hat{\tau}_{g} and q^g\hat{q}_{g} are larger than σ^g\hat{\sigma}_{g} and p^g\hat{p}_{g}, respectively. For ISIC, τ^A0\hat{\tau}_{A}\approx 0 but τ^B0\hat{\tau}_{B}\not\approx 0, suggesting other-group data has little effect on the first group, but is beneficial to the latter. For the non-image datasets (Goodreads, Mooc, and Adult), 0<σ^g<τ^g0\!<\!\hat{\sigma}_{g}\!<\!\hat{\tau}_{g} and p^g<q^g\hat{p}_{g}\!<\!\hat{q}_{g} for all groups.

These results shed light on the applicability of the assumptions made in Section 3. Figure B.2 in Section B.5 shows that the fitted curves capture the overall trends of per-group losses as a function of nn and ngn_{g}. However, the assumptions of Proposition 2 and Corollaries 1 and 2 (e.g., equal pgp_{g} for all g𝒢g\in\mathcal{G}) are not always reflected in the empirical fits. Results in Section 3 use Eq. (1) to describe optimal allocations under different hypothetical settings; we find that allowing the scaling parameters vary by group as in Eq. (7) is more realistic in empirical settings.

The estimated models describe the overall trends (Figure B.2), but the parameter estimates are variable (Table 2), indicating that a range of parameters can fit the data, a well-known phenomenon in fitting power laws to data (Clauset et al., 2009). While we caution against absolute or prescriptive interpretations based on the estimates given in Table 2, if such interpretations are desired (Chen et al., 2018), we suggest evaluating variation due to subsetting patterns and comparing to alternative models such as log-normal and exponential fits (cf. Clauset et al., 2009).

Refer to caption
Figure 2: Pilot sample experiment. Panels show the result of the three allocations α[α^minmax,γ,(1/2,1/2)]\vec{\alpha}\in[\hat{\alpha}^{*}_{\textrm{{minmax}}},\vec{\gamma},(1/2,1/2)] for different sizes of the new training sets compared with an αgrid\alpha^{*}_{\textrm{grid}} baseline that minimizes the maximum group loss over a grid of resolution 0.01, averaged over the random trials. Purple circles indicate average maximum error over groups and grey diamonds indicate average population error. Ranges denote standard errors taken over the 10 trials.

4.4 Targeted Data Collection with Fitted Scaling Laws

We now study the use of scaling models fitted on a small pilot dataset to inform a subsequent data collection effort. Given the results of Section 4.2, we aim to collect a training set that minimizes the maximum loss on any group. This procedure goes beyond the descriptive use of the estimated scaling models in Section 4.3; important considerations for operationalizing these findings are discussed below.

We perform this experiment with the Goodreads dataset, the largest of the five we study. The pilot sample contains 2,500 instances from each group, drawn at random from the full training set. We estimate the parameters of Eq. (7) using a procedure similar to that described in Section 4.3. For a new training set of size nnewn_{\textrm{new}}, we suggest an allocation to minimize the maximum forecasted loss of any group:

α^minmax\displaystyle\hat{\alpha}^{*}_{\textrm{{minmax}}} =argminαΔ2maxg𝒢(σ^g2(αgnnew)p^g+τ^g2nnewq^g+δ^g).\displaystyle=\operatorname*{argmin}_{\vec{\alpha}\in\Delta^{2}}\max_{g\in\mathcal{G}}\left(\hat{\sigma}_{g}^{2}(\alpha_{g}n_{\textrm{new}})^{-\hat{p}_{g}}+\hat{\tau}_{g}^{2}n_{\textrm{new}}^{-\hat{q}_{g}}+\hat{\delta}_{g}\right).

For nnew[1×,2×,4×,8×]n_{\textrm{new}}\in[1\times,2\times,4\times,8\times], the pilot sample size, we simulate collecting a new training set by drawing nnewn_{\textrm{new}} fresh samples from the training set with allocation α=α^minmax(nnew)\vec{\alpha}=\hat{\alpha}^{*}_{\textrm{{minmax}}}(n_{\textrm{new}}). We train a model on this sample (ERM objective) and evaluate on the test set. For comparison, we also sample at α=γ\vec{\alpha}=\vec{\gamma} (population proportions) and α=(0.5,0.5)\vec{\alpha}=(0.5,0.5) (equal allocation to both groups). We repeat the experiment, starting with the random instantiation of the pilot dataset, for ten trials. As a point of comparison, we also compute the results for all α\alpha in a grid of resolution 0.010.01, and denote the allocation value in this grid that minimizes the average maximum group loss over the ten trials as αgrid\alpha^{*}_{\textrm{grid}}.

Among the three allocation strategies we compare, α=α^minmax\vec{\alpha}=\hat{\alpha}^{*}_{\textrm{minmax}} minimizes the average maximum loss over groups, across nnewn_{\textrm{new}} (Figure 2). Since per-group losses generally decrease with the increased allocations to that group, we expect the best minmax loss over groups to be achieved when the purple and grey bars meet in Figure 2. The allocation strategy α=α^minmax\vec{\alpha}=\hat{\alpha}^{*}_{\textrm{minmax}} does not quite achieve this; however, it does not increase the population loss over that of the other allocation strategies. This reinforces the finding of Section 4.2 that different per-group losses can be reached for similar population losses and provides evidence that we can navigate these possible outcomes by leveraging information from a small initial sample.

While the results in Figure 2 are promising, error bars highlight the variation across trials. The variability in performance across trials for allocation baseline αgrid\alpha^{*}_{\textrm{grid}} (which is kept constant across the ten trials) is largely consistent with that of the other allocation sampling strategies examined (standard errors in Figure 2). However, the estimation of α^\hat{\alpha}^{*} in each trial does introduce additional variation: across the ten draws of the pilot data, the range of α^\hat{\alpha}^{*} values for subsequent dataset size nnew=5000n_{\textrm{new}}=5000 is [2e-04,0.04], for nnew=10000n_{\textrm{new}}=10000 it is [1e-04,0.05], for nnew=20000n_{\textrm{new}}=20000 it is [5e-05,0.14], and for nnew=40000n_{\textrm{new}}=40000 it is [2e-05,0.82]. Therefore, the estimated α^\hat{\alpha}^{*} should be leveraged with caution, especially if the subsequent sample will be much larger than the pilot sample. Further caution should be taken if there may be distribution shifts between the pilot and subsequent samples. We suggest to interpret estimated α^\hat{\alpha}^{*} values as one signal among many that can inform a dataset design in conjunction with current and emerging practices for ethical data collection (see Section 5).

4.5 Interactions Between Groups

We now shift the focus of our analysis to explore potential between- and within- group interactions that are more nuanced than the scaling law in Eq. (1) provides for. The results highlight the need for and encourage future work extending our analysis to more complex notions of groups (e.g., intersectional, continuous, or proxy groups).

As discussed in Section 3, data from groups similar to or different from group gg may have greater effect on (f^(𝒮);𝒟g)\mathcal{R}(\hat{f}(\mathcal{S});\mathcal{D}_{g}) compared to data drawn at random from the entire distribution. We examine this possibility on the ISIC dataset, which is aggregated from different studies (Section B.1). We measure baseline performance of the model trained on data from all of the studies. We then remove one study at a time from the training set, retrain the model, and evaluate the change in performance for all studies in the test set.

Figure 3(a) shows the percent changes in performance due to leaving out studies from the training set. The MSK and UDA studies are comprised of 5 and 2 sub-studies, respectively; Figure 3(b) shows the results of leaving out each sub-study. Rows correspond to the study withheld from the training set and columns correspond to the study used for evaluation. Rows and columns are ordered by %\% malignancy. For Figure 3(a) this is the same as ordering by dataset size, SONIC being the largest study.

Consistent with our modelling assumptions and results so far, accuracies evaluated on group gg decrease as a result of removing group gg from the training set (diagonal entries of Figure 3). However, additional patterns show more nuanced relationships between groups.

Positive values in the upper right regions of Figures 3(a) and 3(b) show that excluding studies with low malignancy rates can raise performance evaluated on studies with high malignancy rates. This could be partially due to differences in label distributions when removing certain studies from the training data. Importantly, this provides a counterpoint to an assumption implicit in 1, that group risks decrease in the total training set size nn, regardless of the groups these nn instances belong to. To study more nuanced interactions between pairs ggg^{\prime}\!\neq\!g, future work could modify Eq. (1) by reparameterizing r()r(\cdot) to directly account for ngn_{g^{\prime}}.

Refer to caption
(a) Grouped by study.
Refer to caption
(b) Grouped by sub-study.
Figure 3: Percent change in performance (AUROC / accuracy) due to withholding a study from the training set. Studies are ordered by %\% malignancy of the data in the evaluation set. SONIC and MSK-5 contain all benign instances and the 2018 JID Editorial Images dataset has all malignant instances; for these we report %\% change in binary accuracy. For the remaining groups, we report %\% change in AUROC. Note that the random training/test splits differ between (a) and (b), accounting for the differences in values for corresponding cells between the two figures.

Grouping by substudies within the UDA and MSK studies reveals that even within well defined groups, interactions between subgroups can arise. Negative off-diagonal entries in Figure 3(b) suggest strong interactions between different groups, underscoring the importance of evaluating results across hierarchies and intersections of groups when feasible.

Of the 16,965 images in the full training set, 7,401 are from the SONIC study. When evaluating on all non-SONIC instances (like the evaluation set from the rest of the paper), withholding the SONIC study from the training set (akin to the training set of the rest of the paper) leads to higher AUROC (.905) than training on all studies (0.890). This demonstrates that more data is not always better, especially if the distributional differences between the additional data and the target populations are not well accounted for.

5 Generalizations, limitations, and future work

We study the ways in which group and population performance depend on the numerical allocations of discrete groups in training sets. While focusing on discrete groups allows us to derive meaningful results, understanding similar phenomena for intersectional groups and continuous notions of inclusion is an important next step. Addressing the more nuanced relationships between the allocations of different data sources (Section 4.5) is a first step in this direction.

We find that underrepresentation of groups in training data can limit group and population accuracies. However, assuming we can easily and ethically collect more data about any group is often naive, as there may be unintended consequences of upweighting certain groups in an objective function. Naive targeted data collection attempts can present undue burdens of surveillance or skirt consent (Paullada et al., 2020). When ML systems fail to represent subpopulations due to measurement or construct validity issues, more comprehensive interventions are needed (Jacobs & Wallach, 2019).

Our results expose key properties of sub-group representation in training data from a statistical sampling perspective, complementary to current and emerging practices for ethical, contextualized data collection and curation Gebru et al. (2018); Gebru (2020); Denton et al. (2020); Abebe et al. (2021). Studying the role of numerical allocation targets within ethical and context-aware data collection practices will be an important step toward operationalizing our findings.

Representation is a broad and often ambiguous concept (Chasalow & Levy, 2021), and numerical allocation is an imperfect proxy of representation or inclusion. For example, if the data annotation process systematically misrepresents certain groups, optimizing allocations to maximize accuracy with respect to those labels would not reflect our true goals across all groups. If prediction models are tailored to majority groups and are less relevant for smaller groups, an optimized allocation might allocate more data points to smaller groups as a remedy, when in reality a better model or feature representation is preferable. True representation thus requires that each data instance measures the intended variables and their complexities in addition to numerical allocations. That said, if the optimal allocation for a certain group is well beyond that group’s population proportion, this may be cause to reflect on why that is the case. Future work could contextualize allocations as a way of auditing the limits of machine learning systems from a data-focused perspective.

By incorporating dataset collection as a design step in the learning procedure, we were able to assess and leverage the value of different data allocations toward reaching high group and population accuracies. Extending this framework to other objectives and loss functions (e.g., robustness for out-of-distribution prediction and fairness objectives) will be an important area of future work.

6 Conclusions

We demonstrate that representation in data is fundamental to training machine learning models that work for the entire population of interest. By casting dataset design as part of the learning procedure, we can formalize the characteristics that training data must satisfy in order to reach the objectives and specifications of the overall machine learning system. Empirical results bolster our theoretical findings and explore the nuances of real-data phenomena that call for domain dependent analyses in order to operationalize our general results in specific contexts. Overall, we provide insight and constructive results toward understanding, prioritizing, and leveraging conscientious data collection for successful applications of machine learning.

Acknowledgments

We thank Inioluwa Deborah Raji and Ludwig Schmidt for feedback at various stages of this work, and Andrea Bajcsy, Sara Fridovich-Keil, and Max Simchowitz for comments and suggestions during the editing of this manuscript. We thank Nimit Sohoni and Jared Dunnmon for helpful discussions regarding the ISIC dataset.

This material is based upon work supported by the NSF Graduate Research Fellowship under Grant No. DGE 1752814. ER acknowledges the support of a Google PhD Fellowship. This research is generously supported in part by ONR awards N00014-20-1-2497 and N00014-18-1-2833, NSF CPS award 1931853, and the DARPA Assured Autonomy program (FA8750-18-C-0101).

References

  • Abebe et al. (2021) Abebe, R., Aruleba, K., Birhane, A., Kingsley, S., Obaido, G., Remy, S. L., and Sadagopan, S. Narratives and counternarratives on data sharing in africa. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp.  329–341, 2021.
  • Abernethy et al. (2020) Abernethy, J., Awasthi, P., Kleindessner, M., Morgenstern, J., and Zhang, J. Adaptive sampling to reduce disparate performance. arXiv preprint arXiv:2006.06879, 2020.
  • Boucheron et al. (2005) Boucheron, S., Bousquet, O., and Lugosi, G. Theory of classification: A survey of some recent advances. ESAIM: Probability and Statistics, 9:323–375, 2005.
  • Buda et al. (2018) Buda, M., Maki, A., and Mazurowski, M. A. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106:249–259, 2018.
  • Buolamwini & Gebru (2018) Buolamwini, J. and Gebru, T. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency, pp.  77–91, 2018.
  • Byrd & Lipton (2019) Byrd, J. and Lipton, Z. C. What is the effect of importance weighting in deep learning? In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp.  872–881. PMLR, 2019. URL http://proceedings.mlr.press/v97/byrd19a.html.
  • Chasalow & Levy (2021) Chasalow, K. and Levy, K. Representativeness in statistics, politics, and machine learning. arXiv preprint arXiv:2101.03827, 2021.
  • Chawla et al. (2002) Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. SMOTE: Synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.
  • Chen et al. (2018) Chen, I. Y., Johansson, F. D., and Sontag, D. A. Why is my classifier discriminatory? In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp.  3543–3554, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/1f1baa5b8edac74eb4eaa329f14a0361-Abstract.html.
  • Clauset et al. (2009) Clauset, A., Shalizi, C. R., and Newman, M. E. Power-law distributions in empirical data. SIAM Review, 51(4):661–703, 2009.
  • Codella et al. (2019) Codella, N., Rotemberg, V., Tschandl, P., Celebi, M. E., Dusza, S., Gutman, D., Helba, B., Kalloo, A., Liopyris, K., Marchetti, M., et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (ISIC). arXiv preprint arXiv:1902.03368, 2019.
  • Codella et al. (2017) Codella, N. C. F., Gutman, D., Celebi, M. E., Helba, B., Marchetti, M. A., Dusza, S. W., Kalloo, A., Liopyris, K., Mishra, N. K., Kittler, H., and Halpern, A. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC). arXiv preprint, 2017.
  • Cole et al. (2013) Cole, R., Gkatzelis, V., and Goel, G. Mechanism design for fair division: Allocating divisible items without payments. In Proceedings of the Fourteenth ACM Conference on Electronic Commerce, EC ’13, pp.  251–268, New York, NY, USA, 2013. Association for Computing Machinery.
  • Cortes et al. (2010) Cortes, C., Mansour, Y., and Mohri, M. Learning bounds for importance weighting. In Lafferty, J. D., Williams, C. K. I., Shawe-Taylor, J., Zemel, R. S., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Vancouver, British Columbia, Canada, pp.  442–450. Curran Associates, Inc., 2010. URL https://proceedings.neurips.cc/paper/2010/hash/59c33016884a62116be975a9bb8257e3-Abstract.html.
  • Denton et al. (2020) Denton, E., Hanna, A., Amironesei, R., Smart, A., Nicole, H., and Scheuerman, M. K. Bringing the people back in: Contesting benchmark machine learning datasets. arXiv preprint arXiv:2007.07399, 2020.
  • Dotan & Milli (2020) Dotan, R. and Milli, S. Value-laden disciplinary shifts in machine learning. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp.  294–294, 2020.
  • Dua & Graff (2017) Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  • Dwork et al. (2018) Dwork, C., Immorlica, N., Kalai, A. T., and Leiserson, M. Decoupled classifiers for group-fair and efficient machine learning. In Conference on Fairness, Accountability and Transparency, pp.  119–133, 2018.
  • Gebru (2020) Gebru, T. Lessons from archives: Strategies for collecting sociocultural data in machine learning. In Gupta, R., Liu, Y., Tang, J., and Prakash, B. A. (eds.), KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, pp.  3609. ACM, 2020. URL https://dl.acm.org/doi/10.1145/3394486.3409559.
  • Gebru et al. (2018) Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., and Crawford, K. Datasheets for datasets. arXiv preprint arXiv:1803.09010, 2018.
  • Ghorbani & Zou (2019) Ghorbani, A. and Zou, J. Y. Data shapley: Equitable valuation of data for machine learning. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp.  2242–2251. PMLR, 2019. URL http://proceedings.mlr.press/v97/ghorbani19c.html.
  • Habib et al. (2019) Habib, A., Karmakar, C., and Yearwood, J. Impact of ECG dataset diversity on generalization of CNN model for detecting QRS complex. IEEE Access, 7:93275–93285, 2019.
  • Haixiang et al. (2017) Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., and Bing, G. Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 73:220–239, 2017.
  • HarvardX (2014) HarvardX. HarvardX Person-Course Academic Year 2013 De-Identified dataset, version 3.0, 2014. URL https://doi.org/10.7910/DVN/26147.
  • Hashimoto et al. (2018) Hashimoto, T. B., Srivastava, M., Namkoong, H., and Liang, P. Fairness without demographics in repeated loss minimization. In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp.  1934–1943. PMLR, 2018. URL http://proceedings.mlr.press/v80/hashimoto18a.html.
  • Hofmanninger et al. (2020) Hofmanninger, J., Prayer, F., Pan, J., Röhrich, S., Prosch, H., and Langs, G. Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem. European Radiology Experimental, 4(1):1–13, 2020.
  • Hu et al. (2018) Hu, W., Niu, G., Sato, I., and Sugiyama, M. Does distributionally robust supervised learning give robust classifiers? In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp.  2034–2042. PMLR, 2018. URL http://proceedings.mlr.press/v80/hu18a.html.
  • Iosifidis & Ntoutsi (2018) Iosifidis, V. and Ntoutsi, E. Dealing with bias via data augmentation in supervised learning scenarios. In Proceedings of the Workshop on Bias in Information, Algorithms, pp.  24–29, 2018.
  • Jacobs & Wallach (2019) Jacobs, A. Z. and Wallach, H. Measurement and fairness. arXiv preprint arXiv:1912.05511, 2019.
  • Kim et al. (2019) Kim, M. P., Ghorbani, A., and Zou, J. Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp.  247–254, 2019.
  • Krizhevsky (2009) Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
  • Lohr (2009) Lohr, S. L. Sampling: design and analysis. Nelson Education, 2009.
  • Neyman (1934) Neyman, J. On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97(4):558–625, 1934.
  • Paullada et al. (2020) Paullada, A., Raji, I. D., Bender, E. M., Denton, E., and Hanna, A. Data and its (dis) contents: A survey of dataset development and use in machine learning research. arXiv preprint arXiv:2012.05345, 2020.
  • Pukelsheim (2006) Pukelsheim, F. Optimal design of experiments. Classics in applied mathematics ; 50. Society for Industrial and Applied Mathematics, classic ed. edition, 2006.
  • Raji & Buolamwini (2019) Raji, I. D. and Buolamwini, J. Actionable auditing: Investigating the impact of publicly naming biased performance results of commercial ai products. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp.  429–435, 2019.
  • Ryu et al. (2017) Ryu, H. J., Adam, H., and Mitchell, M. Inclusivefacenet: Improving face attribute detection with race and gender diversity. arXiv preprint arXiv:1712.00193, 2017.
  • Sagawa et al. (2020) Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=ryxGuJrFvS.
  • Shankar et al. (2017) Shankar, S., Halpern, Y., Breck, E., Atwood, J., Wilson, J., and Sculley, D. No classification without representation: Assessing geodiversity issues in open data sets for the developing world. arXiv preprint arXiv:1711.08536, 2017.
  • Sohoni et al. (2020) Sohoni, N., Dunnmon, J., Angus, G., Gu, A., and Ré, C. No subclass left behind: Fine-grained robustness in coarse-grained classification problems. Advances in Neural Information Processing Systems, 33, 2020.
  • Suresh & Guttag (2019) Suresh, H. and Guttag, J. V. A framework for understanding unintended consequences of machine learning. arXiv preprint arXiv:1901.10002, 2019.
  • Tschandl et al. (2018) Tschandl, P., Rosendahl, C., and Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific Data, 5(1), 2018. URL https://doi.org/10.1038/sdata.2018.161.
  • Vodrahalli et al. (2018) Vodrahalli, K., Li, K., and Malik, J. Are all training examples created equal? An empirical study. arXiv preprint arXiv:1811.12569, 2018.
  • Wan & McAuley (2018) Wan, M. and McAuley, J. J. Item recommendation on monotonic behavior chains. In Pera, S., Ekstrand, M. D., Amatriain, X., and O’Donovan, J. (eds.), Proceedings of the 12th ACM Conference on Recommender Systems, RecSys 2018, Vancouver, BC, Canada, October 2-7, 2018, pp. 86–94. ACM, 2018. doi: 10.1145/3240323.3240369. URL https://doi.org/10.1145/3240323.3240369.
  • Wilkinson et al. (2016) Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., et al. The fair guiding principles for scientific data management and stewardship. Scientific data, 3(1):1–9, 2016.
  • Wright (2020) Wright, T. A general exact optimal sample allocation algorithm: With bounded cost and bounded sample sizes. Statistics & Probability Letters, pp.  108829, 2020.
  • Yang et al. (2020) Yang, K., Qinami, K., Fei-Fei, L., Deng, J., and Russakovsky, O. Towards fairer datasets: Filtering and balancing the distribution of the people subtree in the imagenet hierarchy. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp.  547–558, 2020.
  • Yona et al. (2019) Yona, G., Ghorbani, A., and Zou, J. Who’s responsible? Jointly quantifying the contribution of the learning algorithm and training data. arXiv preprint arXiv:1910.04214, 2019.

Appendix A Proofs and Derivations

A.1 Full proof of Proposition 1

Recall the random variable L^\hat{L} defined with respect to function ff, loss function \ell, and group weights w:𝒢+w:\mathcal{\mathcal{G}}\rightarrow\mathbb{R}^{+}:

L^(w,α,n;f,)\displaystyle\hat{L}(w,\alpha,n;f,\ell) :=1ni𝒮(α,n)w(gi)(f(xi),yi)\displaystyle:=\frac{1}{n}\sum_{i\in\mathcal{S}(\vec{\alpha},n)}w(g_{i})\cdot\ell(f(x_{i}),y_{i})

where the randomness comes from the draws of xi,yix_{i},y_{i} from 𝒟gi\mathcal{D}_{g_{i}} according to procedure 𝒮(α,n)\mathcal{S}(\vec{\alpha},n) (Definition 2), as well as any randomness in ff.

See 1

Proof.

For any n, any (α,w)(\alpha,w) pair induces a vector γ\vec{\gamma}^{\prime} with entries γg(w,α):=w(g)αg/(g𝒢w(g)αg)\gamma_{g}^{\prime}(w,\vec{\alpha}):=w(g)\alpha_{g}/\left(\sum_{g\in\mathcal{G}}w(g)\alpha_{g}\right), where

𝔼[L^(w,α,n;f,)]\displaystyle\mathbb{E}[\hat{L}(w,\vec{\alpha},n;f,\ell)] =1n(xi,yi,gi)𝒮(α,n)w(gi)𝔼[(f(xi),yi)]\displaystyle=\frac{1}{n}\sum_{(x_{i},y_{i},g_{i})\in\mathcal{S}(\vec{\alpha},n)}w(g_{i})\mathbb{E}[\ell(f(x_{i}),y_{i})]
=g𝒢ngnw(g)𝔼(x,y)𝒟g[(f(x),y)]\displaystyle=\sum_{g\in\mathcal{G}}\frac{n_{g}}{n}w(g)\mathbb{E}_{(x,y)\sim\mathcal{D}_{g}}[\ell(f(x),y)]
=c𝔼gMultinomial(γ)[𝔼(x,y)𝒟g[(f(x),y)]]\displaystyle=c\cdot\mathbb{E}_{g\sim\textrm{Multinomial}(\vec{\gamma}^{\prime})}\left[\mathbb{E}_{(x,y)\sim\mathcal{D}_{g}}\left[\ell(f(x),y)\right]\right]

for constant c=gαgw(g)c=\sum_{g}\alpha_{g}w(g). The vector γ\vec{\gamma}^{\prime} in this sense describes an implicit ‘target distribution’ induced by the applying weights ww after sampling with allocation α\vec{\alpha}. Note that unless wg=0w_{g}=0 for all gg with αg>0\alpha_{g}>0, γ\vec{\gamma}^{\prime} has at least one nonzero entry. The constant cc re-scales the weighted objective function with original weights ww so as to match the expected loss with respect to the group proportions γ\vec{\gamma}^{\prime}. Stated another way, for any alternative allocation α\alpha^{\prime}, we could pick weights w(g)=cγg/αgw^{\prime}(g)=c\gamma^{\prime}_{g}/\alpha^{\prime}_{g} (letting w(g)=0w^{\prime}(g)=0 if αg=0\alpha^{\prime}_{g}=0), and satisfy

𝔼[L^(w,α,n;f,)]\displaystyle\mathbb{E}[\hat{L}(w^{\prime},\vec{\alpha}^{\prime},n;f,\ell)] =𝔼[L^(w,α,n;f,)].\displaystyle=\mathbb{E}[\hat{L}(w,\vec{\alpha},n;f,\ell)]~{}.

Given this correspondence, we now find the pair (α,w)(\vec{\alpha}^{*},w^{*}) which minimizes Var[L^(cw,α,n;f,)]Var[\hat{L}(cw^{\prime},\vec{\alpha}^{\prime},n;f,\ell)], subject to w(g)αg=cγgw^{\prime}(g)\alpha^{\prime}_{g}=c\gamma^{\prime}_{g}. Since the original pair (α,w)(\vec{\alpha},w) satisfies this constraint (by construction), we must have

minα,w:w(g)αg=cγgVar[L^(w,α,n;f,)]Var[L^(w,α,n;f,)].\displaystyle\min_{\vec{\alpha^{\prime}},w^{\prime}:w^{\prime}(g)\alpha^{\prime}_{g}=c\gamma^{\prime}_{g}}Var[\hat{L}(w^{\prime},\vec{\alpha}^{\prime},n;f,\ell)]\leq Var[\hat{L}(w,\vec{\alpha},n;f,\ell)]~{}.

We first compute Var[L^(w,α,n;f,)]Var[\hat{L}(w^{\prime},\vec{\alpha}^{\prime},n;f,\ell)]. By Definition 2, samples (xi,yi)(x_{i},y_{i}) are assumed to be independent draws from distributions 𝒟gi\mathcal{D}_{g_{i}}, so that the variance of the estimator can be written as (for convenience we assume here that nαgn\alpha^{\prime}_{g}\in\mathbb{Z}, see the discussion below):

Var[L^(w,α,n;f,)]\displaystyle Var\left[\hat{L}(w^{\prime},\vec{\alpha}^{\prime},n;f,\ell)\right] =1n2g𝒢w(g)2i=1nαg𝔼(xi,yi)𝒟g[((f(xi),yi)𝔼(x,y)𝒟g[(f(x),y)])2]\displaystyle=\frac{1}{n^{2}}\sum_{g\in\mathcal{G}}w^{\prime}(g)^{2}\sum_{i=1}^{n\alpha^{\prime}_{g}}\mathbb{E}_{(x_{i},y_{i})\sim\mathcal{D}_{g}}\left[\left(\ell(f(x_{i}),y_{i})-\mathbb{E}_{(x,y)\sim\mathcal{D}_{g}}[\ell(f(x),y)]\right)^{2}\right]
=1ng𝒢αgw(g)2Var[g(i)],\displaystyle=\frac{1}{n}\sum_{g\in\mathcal{G}}\alpha^{\prime}_{g}w^{\prime}(g)^{2}Var\left[\ell_{g}^{(i)}\right]~{},

where Var[g(i)]Var\left[\ell_{g}^{(i)}\right] denotes shorthand for 𝔼(xi,yi)𝒟g[((f(xi),yi)𝔼(x,y)𝒟g[(f(x),y)])2]\mathbb{E}_{(x_{i},y_{i})\sim\mathcal{D}_{g}}\left[\left(\ell(f(x_{i}),y_{i})-\mathbb{E}_{(x,y)\sim\mathcal{D}_{g}}[\ell(f(x),y)]\right)^{2}\right] . Now, to respect the constraint w(g)αg=cγgw^{\prime}(g)\alpha^{\prime}_{g}=c\gamma_{g}^{\prime} means that for any gg with γg>0\gamma^{\prime}_{g}>0, w(g)w^{\prime}(g) is a deterministic function of αg\alpha^{\prime}_{g}, since cc and γ\vec{\gamma}^{\prime} are determined by the initial pair (α,w)(\vec{\alpha},w). Then it is sufficient to compute

argminαΔ|𝒢|1ng𝒢:γg>0αg(cγgαg)2Var[g(i)]\displaystyle\operatorname*{argmin}_{\alpha^{\prime}\in\Delta^{|\mathcal{G}|}}\frac{1}{n}\sum_{g\in\mathcal{G}:\gamma^{\prime}_{g}>0}\alpha^{\prime}_{g}\left(\frac{c\gamma_{g}^{\prime}}{\alpha^{\prime}_{g}}\right)^{2}Var\left[\ell_{g}^{(i)}\right] =argminαΔ|𝒢|c2ng𝒢:γg>0(γg)2αgVar[g(i)].\displaystyle=\operatorname*{argmin}_{\alpha^{\prime}\in\Delta^{|\mathcal{G}|}}\frac{c^{2}}{n}\sum_{g\in\mathcal{G}:\gamma^{\prime}_{g}>0}\frac{(\gamma_{g}^{\prime})^{2}}{\alpha^{\prime}_{g}}Var\left[\ell_{g}^{(i)}\right]~{}.

The minimizer α\vec{\alpha}^{*} has entries αg=γgVar[g(i)]/(g𝒢γgVar[g(i)])\alpha^{*}_{g}=\gamma_{g}^{\prime}\sqrt{Var[\ell_{g}^{(i)}]}/\left(\sum_{g\in\mathcal{G}}\gamma_{g}^{\prime}\sqrt{Var[\ell_{g}^{(i)}]}\right). Because α\vec{\alpha}^{*} is unique and determines ww^{*}, Var[L^(w,α,n;f,)]<Var[L^(w,α,n;f,)]Var[\hat{L}(w^{*},\vec{\alpha}^{*},n;f,\ell)]<Var[\hat{L}(w^{\prime},\vec{\alpha}^{\prime},n;f,\ell)] with strict inequality unless (w,α)=(w,α)(w^{\prime},\vec{\alpha}^{\prime})=(w^{*},\vec{\alpha}^{*}). The optimal weights are w(g)=cγg/αg=c(g𝒢γgVar[g(i)])/Var[g(i)]w^{*}(g)=c\gamma^{\prime}_{g}/\alpha^{*}_{g}=c{\left(\sum_{g\in\mathcal{G}}\gamma^{\prime}_{g}\sqrt{Var[\ell_{g}^{(i)}]}\right)}/{\sqrt{Var[\ell_{g}^{(i)}]}}. Note that for pair of groups (g,g)(g,g^{\prime}), the relative weights satisfy w(g)/w(g)=Var[g(i)]/Var[g(i)]w^{*}(g)/w^{*}(g^{\prime})=\sqrt{{Var[\ell_{g^{\prime}}^{(i)}]}/{Var[\ell_{g}^{(i)}]}}, and thus do not depend on γ\vec{\gamma}^{\prime}.

If we consider finite sample concerns, the minimizer α\alpha^{*} must satisfy integer values nαg+g𝒢n\alpha^{*}_{g}\in\mathbb{Z}^{+}\,\forall g\in\mathcal{G}. In this case, efficient algorithms exist for finding the integral solution to allocating ngn_{g} (Wright, 2020). However, the non-integer restricted solution α\alpha^{*} has a closed form solution, and we will use the fact that for any group g, αg\alpha_{g}^{*} as defined above and its variant with the additional constraint that nαg+n\alpha^{*}_{g}\in\mathbb{Z}^{+} can differ by at most |𝒢|n\tfrac{|\mathcal{G}|}{n}. This means that any α\vec{\alpha} with |αgαg|>|𝒢|n|\alpha_{g}-\alpha^{*}_{g}|>\tfrac{|\mathcal{G}|}{n} cannot be a minimizer of the objective function, even constrained to nαgn\alpha^{*}_{g}\in\mathbb{Z}. Since w(g)αg=cγgw(g)\alpha_{g}=c\gamma_{g}^{\prime}, an equivalent statement in terms of ww is |1w(g)w(g)|>w(g)|𝒢|ncγ(g)=|𝒢|nαg=|𝒢|ng|1-\frac{w(g)}{w^{*}(g)}|>\frac{w(g)|\mathcal{G}|}{nc\gamma^{\prime}(g)}=\frac{|\mathcal{G}|}{n\alpha_{g}}=\frac{|\mathcal{G}|}{n_{g}}.

Finally, we show that if w(g)<w(g)w^{*}(g)<w(g), αg>αg\alpha_{g}^{*}>\alpha_{g}. This follows from our definition of γ\vec{\gamma}^{\prime} such that w(g)=cγg/αgw(g)=c\gamma^{\prime}_{g}/\alpha_{g}, and our constraint, such that w(g)=cγg/αgw^{*}(g)=c\gamma^{\prime}_{g}/\alpha^{*}_{g}. From these, we must have that w(g)αg=w(g)αgw^{*}(g)\alpha_{g}^{*}=w(g)\alpha_{g}, from which the claim and its reverse follow. ∎

A.2 Proof of Proposition 2

See 2

Proof.

Recall the decomposition of the estimated population risk:

𝔼(x,y)𝒟[(f^𝒮(x),y)]=g𝒢γg𝔼(x,y)𝒟g[(f^(x),y)]g𝒢γg(σg2(nαg)p+τg2nq+δg).\displaystyle\mathbb{E}_{(x,y)\sim\mathcal{D}}\left[\ell(\hat{f}_{\mathcal{S}}(x),y)\right]=\sum_{g\in\mathcal{G}}\gamma_{g}\cdot\mathbb{E}_{(x,y)\sim\mathcal{D}_{g}}\left[\ell(\hat{f}(x),y)\right]\approx\sum_{g\in\mathcal{G}}\gamma_{g}\left(\sigma_{g}^{2}(n\alpha_{g})^{-p}+\tau_{g}^{2}n^{-q}+\delta_{g}\right)~{}.

Now we find

α=argminαΔ|𝒢|g𝒢γg(σg2(nαg)p+τg2nq+δg)=argminαΔ|𝒢|(np)g𝒢γg(σg2(αg)p)\displaystyle\vec{\alpha}^{*}=\operatorname*{argmin}_{\vec{\alpha}\in\Delta^{|\mathcal{G}|}}\sum_{g\in\mathcal{G}}\gamma_{g}\left(\sigma_{g}^{2}(n\alpha_{g})^{-p}+\tau_{g}^{2}n^{-q}+\delta_{g}\right)=\operatorname*{argmin}_{\vec{\alpha}\in\Delta^{|\mathcal{G}|}}(n^{-p})\sum_{g\in\mathcal{G}}\gamma_{g}\left(\sigma_{g}^{2}(\alpha_{g})^{-p}\right) =argminαΔ|𝒢|g𝒢γgσg2αgp.\displaystyle=\operatorname*{argmin}_{\vec{\alpha}\in\Delta^{|\mathcal{G}|}}\sum_{g\in\mathcal{G}}\gamma_{g}\sigma_{g}^{2}\alpha_{g}^{-p}~{}.

If σg=0g𝒢\sigma_{g}=0\ \forall g\in\mathcal{G}, then any allocation αΔ|𝒢|\vec{\alpha}^{*}\in\Delta^{|\mathcal{G}|} minimizes the approximated population loss. Otherwise, αg=0\alpha^{*}_{g}=0 will be 0 for any group with σg=0\sigma_{g}=0; what follows describes the solution αg\alpha^{*}_{g} for gg with σg>0\sigma_{g}>0. If any αg=0\alpha_{g}=0, then the objective is unbounded above, so we can restrict our constraints to α(0,1)|𝒢|\vec{\alpha}\in(0,1)^{|\mathcal{G}|}. As the sum of convex functions, the objective function is convex in α\vec{\alpha}. It is continuously differentiable when αg>0,g𝒢\alpha_{g}>0,\,\forall g\in\mathcal{G}. The KKT conditions are satisfied when

pγgσg2αg(p+1)\displaystyle p\gamma_{g}\sigma^{2}_{g}\alpha_{g}^{-(p+1)} =λg\displaystyle=\lambda\quad\forall g
gαg\displaystyle\sum_{g}\alpha_{g} =1.\displaystyle=1~{}.

Solving this system of equations yields that the KKT conditions are satisfied when αg=(γgσg2)1/(p+1)/g(γgσg2)1/(p+1).\alpha^{*}_{g}={\left(\gamma_{g}\sigma^{2}_{g}\right)^{1/(p+1)}}/{\sum_{g}\left(\gamma_{g}\sigma^{2}_{g}\right)^{1/(p+1)}}~{}. Since this is the only solution to the KKT conditions, it is the unique minimizer. ∎

A.3 Proof of Corollary 1

See 1

Proof.

Let m=|𝒢|m=|\mathcal{G}| denote the number of groups. When σg=σg𝒢\sigma_{g}=\sigma\ \forall g\in\mathcal{G},

αg=γg1/(p+1)gγg1/(p+1)\displaystyle\alpha^{*}_{g}=\frac{\gamma_{g}^{1/(p+1)}}{\sum_{g}\gamma_{g}^{1/(p+1)}} =γg1γg+γgp/(p+1)ggγg1/(p+1).\displaystyle=\gamma_{g}\cdot\frac{1}{\gamma_{g}+\gamma_{g}^{p/(p+1)}\sum_{g^{\prime}\neq g}\gamma_{g^{\prime}}^{1/(p+1)}}~{}.

Since p>0p>0 by 1, we have that i=1nγi1/(p+1)\sum_{i=1}^{n}{\gamma_{i}}^{1/(p+1)} with γi\gamma_{i} subject to (a) i=1nγi=s\sum_{i=1}^{n}\gamma_{i}=s and (b) γi>0\gamma_{i}>0 is maximized when all γi\gamma_{i} are equal. Then, since ggγg=1γg\sum_{g^{\prime}\neq g}\gamma_{g}^{\prime}=1-\gamma_{g},

γg1γg+γgp/(p+1)ggγg1/(p+1)\displaystyle\gamma_{g}\cdot\frac{1}{\gamma_{g}+\gamma_{g}^{p/(p+1)}\sum_{g^{\prime}\neq g}\gamma_{g^{\prime}}^{1/(p+1)}} γg1γg+γgp/(p+1)gg(1γgm1)1/(p+1)\displaystyle\geq\gamma_{g}\cdot\frac{1}{\gamma_{g}+\gamma_{g}^{p/(p+1)}\sum_{g^{\prime}\neq g}\left(\frac{1-\gamma_{g}}{m-1}\right)^{1/(p+1)}}
=γg1γg+((m1)γg)p/(p+1)(1γg)1/(p+1).\displaystyle=\gamma_{g}\cdot\frac{1}{\gamma_{g}+\left((m-1)\gamma_{g}\right)^{p/(p+1)}(1-\gamma_{g})^{1/(p+1)}}~{}.

When γg1/m\gamma_{g}\leq 1/m, γg/(1γg)1m1\gamma_{g}/(1-\gamma_{g})\leq\frac{1}{m-1}, so that

αg\displaystyle\alpha^{*}_{g} γg1γg+(1γg)(m1)p/(p+1)(γg/(1γg))p/(p+1)γg1γg+(1γg)=γg.\displaystyle\geq\gamma_{g}\cdot\frac{1}{\gamma_{g}+(1-\gamma_{g})\left(m-1\right)^{p/(p+1)}\left(\gamma_{g}/(1-\gamma_{g})\right)^{p/(p+1)}}\geq\gamma_{g}\cdot\frac{1}{\gamma_{g}+(1-\gamma_{g})}=\gamma_{g}~{}.\qed

A.4 Proof of Corollary 2

See 2

Proof.

For the setting of two groups, we can express the optimal allocations as:

αA\displaystyle\alpha^{*}_{A} =(γAσA2)1/(p+1)(γAσA2)1/(p+1)+((1γA)σB2)1/(p+1),αB=1αA\displaystyle=\frac{\left(\gamma_{A}\sigma^{2}_{A}\right)^{1/(p+1)}}{\left(\gamma_{A}\sigma^{2}_{A}\right)^{1/(p+1)}+\left((1-\gamma_{A})\sigma^{2}_{B}\right)^{1/(p+1)}},\quad\alpha^{*}_{B}=1-\alpha^{*}_{A}

Rearranging,

αA\displaystyle\alpha^{*}_{A} =γA1γA+(σB2/σA2)1/(p+1)(1γA)1/(p+1)(γA)p/(p+1).\displaystyle=\gamma_{A}\frac{1}{\gamma_{A}+(\sigma_{B}^{2}/\sigma_{A}^{2})^{1/(p+1)}(1-\gamma_{A})^{1/(p+1)}(\gamma_{A})^{p/(p+1)}}~{}.

For p>0p>0, it holds that 0<1p+1<10<\frac{1}{p+1}<1. Therefore, for any p>0p>0 and γ<0.5\gamma<0.5,

γ<(γ)1p+1(1γ)pp+1<(1γ).\displaystyle\gamma<(\gamma)^{\frac{1}{p+1}}(1-\gamma)^{\frac{p}{p+1}}<(1-\gamma)~{}.

From this, we derive the upper bound

αA<γA1γA+(σB2/σA2)1/(p+1)(γA)=(σA2)1/(p+1)(σA2)1/(p+1)+(σB2)1/(p+1),\displaystyle\alpha^{*}_{A}<\gamma_{A}\frac{1}{\gamma_{A}+(\sigma_{B}^{2}/\sigma_{A}^{2})^{1/(p+1)}(\gamma_{A})}=\frac{(\sigma_{A}^{2})^{1/(p+1)}}{(\sigma_{A}^{2})^{1/(p+1)}+(\sigma_{B}^{2})^{1/(p+1)}},

and the lower bound

αA>γA1γA+(σB2/σA2)1/(p+1)(1γA)=γA(σA2)1/(p+1)γA(σA2)1/(p+1)+(σB2)1/(p+1)(1γA).\displaystyle\alpha^{*}_{A}>\gamma_{A}\frac{1}{\gamma_{A}+(\sigma_{B}^{2}/\sigma_{A}^{2})^{1/(p+1)}(1-\gamma_{A})}=\gamma_{A}\frac{(\sigma_{A}^{2})^{1/(p+1)}}{\gamma_{A}(\sigma_{A}^{2})^{1/(p+1)}+(\sigma_{B}^{2})^{1/(p+1)}(1-\gamma_{A})}~{}.

When σAσB\sigma_{A}\geq\sigma_{B},

αA>γA(σA2)1/(p+1)γA(σA2)1/(p+1)+(σB2)1/(p+1)(1γA)>γA(σA2)1/(p+1)(γA+1γA)(σA2)1/(p+1)=γA,\displaystyle\alpha^{*}_{A}>\gamma_{A}\frac{(\sigma_{A}^{2})^{1/(p+1)}}{\gamma_{A}(\sigma_{A}^{2})^{1/(p+1)}+(\sigma_{B}^{2})^{1/(p+1)}(1-\gamma_{A})}>\gamma_{A}\frac{(\sigma_{A}^{2})^{1/(p+1)}}{(\gamma_{A}+1-\gamma_{A})(\sigma_{A}^{2})^{1/(p+1)}}=\gamma_{A}~{},

and when σAσB\sigma_{A}\leq\sigma_{B},

αA<(σA2)1/(p+1)(σA2)1/(p+1)+(σB2)1/(p+1)<(σA2)1/(p+1)(σA2)1/(p+1)+(σA2)1/(p+1)=1/2.\displaystyle\alpha^{*}_{A}<\frac{(\sigma_{A}^{2})^{1/(p+1)}}{(\sigma_{A}^{2})^{1/(p+1)}+(\sigma_{B}^{2})^{1/(p+1)}}<\frac{(\sigma_{A}^{2})^{1/(p+1)}}{(\sigma_{A}^{2})^{1/(p+1)}+(\sigma_{A}^{2})^{1/(p+1)}}=1/2~{}.

A.5 Additional Derivations

In Example 3, we consider the model yi=xiβ+α1𝕀[gi=A]+α2𝕀[gi=B]+𝒩(0,σ2)y_{i}=x_{i}^{\top}\beta+\alpha_{1}\mathbb{I}[g_{i}=A]+\alpha_{2}\mathbb{I}[g_{i}=B]+\mathcal{N}(0,\sigma^{2}) where xi𝒩(0,Σx)x_{i}\sim\mathcal{N}(0,\Sigma_{x}) and denote θ=[α1,α2,β]\theta=[\alpha_{1},\alpha_{2},\beta^{\top}] (note: here αi\alpha_{i} denote the model coefficients, not allocations). We want to compute

𝔼(x,y,g)𝒟[(f^(x)y)2|g=A]\displaystyle\mathbb{E}_{(x,y,g)\sim\mathcal{D}}\left[(\hat{f}(x)-y)^{2}|g=A\right] =σ2+𝔼[x(β^β)+(α^1α1)2]\displaystyle=\sigma^{2}+\mathbb{E}\left[\|x^{\top}(\hat{\beta}-\beta)+(\hat{\alpha}_{1}-\alpha_{1})\|^{2}\right]
=σ2+𝔼[x(β^β)(β^β)x]+2𝔼[(α^1α1)(β^β)x]+𝔼[(α^1α1)2].\displaystyle=\sigma^{2}+\mathbb{E}\left[x^{\top}(\hat{\beta}-\beta)(\hat{\beta}-\beta)^{\top}x\right]+2\mathbb{E}\left[(\hat{\alpha}_{1}-\alpha_{1})(\hat{\beta}-\beta)^{\top}x\right]+\mathbb{E}\left[(\hat{\alpha}_{1}-\alpha_{1})^{2}\right].

Since the draw (x,y,g)𝒟(x,y,g)\sim\mathcal{D} is independent of the data from which the ordinary least squares solution θ^\hat{\theta} is predicted, we can write out each of these terms in terms of the dependence on nn, the total number of data points, as well as nAn_{A} and nBn_{B} (where nA+nB=nn_{A}+n_{B}=n), the total number of datapoints for each group, from which θ^\hat{\theta} is estimated. To do this, we’ll solve for the entries of the covariance matrix:

𝔼[(θ^θ)(θ^θ)]=[𝔼[(α1^α1)2]𝔼[(α1^α1)(α2^α2)]𝔼[(β^β)(α1^α1)]𝔼[(α1^α1)(α2^α2)]𝔼[(α2^α2)2]𝔼[(β^β)(α2^α2)]𝔼[(β^β)(α1^α1)]𝔼[(β^β)(α2^α2)]𝔼[(β^β)(β^β)]]=σ2(ZZ)1\displaystyle\mathbb{E}[(\hat{\theta}-\theta)(\hat{\theta}-\theta)^{\top}]=\begin{bmatrix}\mathbb{E}\left[(\hat{\alpha_{1}}-\alpha_{1})^{2}\right]&\mathbb{E}\left[(\hat{\alpha_{1}}-\alpha_{1})(\hat{\alpha_{2}}-\alpha_{2})\right]&\mathbb{E}\left[(\hat{\beta}-\beta)(\hat{\alpha_{1}}-\alpha_{1})\right]^{\top}\\ \mathbb{E}\left[(\hat{\alpha_{1}}-\alpha_{1})(\hat{\alpha_{2}}-\alpha_{2})\right]&\mathbb{E}\left[(\hat{\alpha_{2}}-\alpha_{2})^{2}\right]&\mathbb{E}\left[(\hat{\beta}-\beta)(\hat{\alpha_{2}}-\alpha_{2})\right]^{\top}\\ \mathbb{E}\left[(\hat{\beta}-\beta)(\hat{\alpha_{1}}-\alpha_{1})\right]&\mathbb{E}\left[(\hat{\beta}-\beta)(\hat{\alpha_{2}}-\alpha_{2})\right]&\mathbb{E}\left[(\hat{\beta}-\beta)(\hat{\beta}-\beta)^{\top}\right]\end{bmatrix}=\sigma^{2}(Z^{\top}Z)^{-1}

where ZZ is the n×(d+2)n\times(d+2) design matrix with rows {(𝕀[gi=A],𝕀[gi=B],xi)}i=1n\{(\mathbb{I}[g_{i}=A],\mathbb{I}[g_{i}=B],x_{i}^{\top})\}_{i=1}^{n} . Next, we find the block entries of the matrix (ZZ)1(Z^{\top}Z)^{-1}. We first interrogate the term within the inverse:

ZZ=[nA0nAx¯A0nBnBx¯BnAx¯AnBx¯BXX]\displaystyle Z^{\top}Z=\begin{bmatrix}n_{A}&0&n_{A}\bar{x}_{A}^{\top}\\ 0&n_{B}&n_{B}\bar{x}_{B}^{\top}\\ n_{A}\bar{x}_{A}&n_{B}\bar{x}_{B}&X^{\top}X\end{bmatrix}

where x¯A=1nAi=1nxi𝕀[gi=A]\bar{x}_{A}=\frac{1}{n_{A}}\sum_{i=1}^{n}x_{i}\cdot\mathbb{I}[g_{i}=A], and similarly for x¯B\bar{x}_{B}. We’ll now use the Schur complement to compute the desired blocks of Σ\Sigma. The Schur complement is

S=XX[nAx¯AnBx¯B][nA100nB1][nAx¯AnBx¯B]=XXnx¯x¯,\displaystyle S=X^{\top}X-\begin{bmatrix}n_{A}\bar{x}_{A}&n_{B}\bar{x}_{B}\end{bmatrix}\begin{bmatrix}n_{A}^{-1}&0\\ 0&n_{B}^{-1}\end{bmatrix}\begin{bmatrix}n_{A}\bar{x}_{A}^{\top}\\ n_{B}\bar{x}_{B}^{\top}\\ \end{bmatrix}=X^{\top}X-n\bar{x}\bar{x}^{\top}~{},

which we simplify to S=XXS=X^{\top}X by assuming that we zero-mean the sample feature matrix XX before calculating the least squares solution. Using the Schur complement, the covariance matrix in block form is

(ZZ)1=[1nA+x¯AS1x¯Ax¯AS1x¯Bx¯AS1x¯BS1x¯A1nB+x¯BS1x¯Bx¯BS1S1x¯AS1x¯BS1].\displaystyle(Z^{\top}Z)^{-1}=\begin{bmatrix}\frac{1}{n_{A}}+\bar{x}_{A}^{\top}S^{-1}\bar{x}_{A}&\bar{x}_{A}^{\top}S^{-1}\bar{x}_{B}&-\bar{x}_{A}^{\top}S^{-1}\\ \bar{x}_{B}^{\top}S^{-1}\bar{x}_{A}&\frac{1}{n_{B}}+\bar{x}_{B}^{\top}S^{-1}\bar{x}_{B}&-\bar{x}_{B}^{\top}S^{-1}\\ -S^{-1}\bar{x}_{A}&-S^{-1}\bar{x}_{B}&S^{-1}\end{bmatrix}~{}.

Plugging in the appropriate blocks to our original equation, we get:

𝔼(x,y,g)𝒟[(f^(x)y)2|g=A]\displaystyle\mathbb{E}_{(x,y,g)\sim\mathcal{D}}\left[(\hat{f}(x)-y)^{2}|g=A\right] =σ2+𝔼(x,y)𝒟A[x(β^β)(β^β)x+2(α^1α1)(β^β)x+(α^1α1)2]\displaystyle=\sigma^{2}+\mathbb{E}_{(x,y)\sim\mathcal{D}_{A}}\left[x^{\top}(\hat{\beta}-\beta)(\hat{\beta}-\beta)^{\top}x+2(\hat{\alpha}_{1}-\alpha_{1})(\hat{\beta}-\beta)^{\top}x+(\hat{\alpha}_{1}-\alpha_{1})^{2}\right]
=σ2(1+1nA+𝔼g=A[xS1x+x¯AS1x¯A]),\displaystyle=\sigma^{2}(1+\frac{1}{n_{A}}+\mathbb{E}_{g=A}\left[x^{\top}S^{-1}x+\bar{x}_{A}^{\top}S^{-1}\bar{x}_{A}\right]),

where the middle term cancels since 𝔼[x]=0\mathbb{E}[x]=0. Note that SS is the scaled sample covariance matrix. The vectors xix_{i} are drawn i.i.d. from 𝒩(0,Σx)\mathcal{N}(0,\Sigma_{x}) so that S1S^{-1} follows an inverse Wishart distribution with parameters n,d,Σxn,d,\Sigma_{x}. For the fresh sample xx,

𝔼[xS1x]\displaystyle\mathbb{E}[x^{\top}S^{-1}x] =Trace(𝔼[S1]𝔼[xx])=dnd1Tr(Σx1Σx)=dnd1.\displaystyle=\textrm{Trace}(\mathbb{E}[S^{-1}]\mathbb{E}[xx^{\top}])=\frac{d}{n-d-1}Tr(\Sigma_{x}^{-1}\Sigma_{x})=\frac{d}{n-d-1}~{}.

For the x¯AS1x¯A\bar{x}_{A}^{\top}S^{-1}\bar{x}_{A} term we invoke the matrix inversion lemma. For a single row xix_{i} of XX, let XiX_{-i} denote the (n1)×d(n-1)\times d matrix comprised of all rows of XX except XiX_{i}. Then

xi(XX)1xi\displaystyle x_{i}^{\top}(X^{\top}X)^{-1}x_{i} =xi(XiXi+xixi)1xi\displaystyle=x_{i}^{\top}(X_{-i}^{\top}X_{-i}+x_{i}x_{i}^{\top})^{-1}x_{i}
=xi((XiXi)1(XiXi)1xi(1+xi(XiXi)1xi)1xi(XiXi)1)xi.\displaystyle=x_{i}^{\top}\left((X_{-i}^{\top}X_{-i})^{-1}-(X_{-i}^{\top}X_{-i})^{-1}x_{i}(1+x_{i}^{\top}(X_{-i}^{\top}X_{-i})^{-1}x_{i})^{-1}x_{i}^{\top}(X_{-i}^{\top}X_{-i})^{-1}\right)x_{i}~{}.

Letting ai=xi(XiXi)1xi0a_{i}=x_{i}^{\top}(X_{-i}^{\top}X_{-i})^{-1}x_{i}\geq 0, we rewrite the above as

xi(XX)1xi\displaystyle x_{i}^{\top}(X^{\top}X)^{-1}x_{i} =aiai21+ai=ai1+aiai.\displaystyle=a_{i}-\frac{a_{i}^{2}}{1+a_{i}}=\frac{a_{i}}{1+a_{i}}\leq a_{i}~{}.

Since the xix_{i} are independent and zero mean, 𝔼[xi(XiXi)1xj]=𝔼[xi]𝔼[(XiXi)1xj]=0ij.\mathbb{E}\left[x_{i}^{\top}(X_{-i}^{\top}X_{-i})^{-1}x_{j}\right]=\mathbb{E}[x_{i}^{\top}]\mathbb{E}[(X_{-i}^{\top}X_{-i})^{-1}x_{j}]=0\,\forall i\neq j. From a similar argument to that given above, we derive that 𝔼[ai]=d/(nd2)\mathbb{E}[a_{i}]=d/(n-d-2), so that

𝔼[x¯AS1x¯A]=𝔼[(1nAinAxi)S1(1nAinAxi)]=1nA2𝔼[inAxiS1xi]1nAdnd2.\displaystyle\mathbb{E}\left[\bar{x}_{A}^{\top}S^{-1}\bar{x}_{A}\right]=\mathbb{E}\left[\left(\frac{1}{n_{A}}\sum_{i}^{n_{A}}x_{i}\right)^{\top}S^{-1}\left(\frac{1}{n_{A}}\sum_{i}^{n_{A}}x_{i}\right)\right]=\frac{1}{n_{A}^{2}}\mathbb{E}\left[\sum_{i}^{n_{A}}x_{i}^{\top}S^{-1}x_{i}\right]\leq\frac{1}{n_{A}}\cdot\frac{d}{n-d-2}~{}.

Putting this all together, we conclude that for ndn\gg d,

𝔼(x,y,g)𝒟[(f^(x)y)2|g=A]=σ2(1+1nA+dnd1+𝔼[x¯AS1x¯A])=σ2(1+1nA+O(dn)).\displaystyle\mathbb{E}_{(x,y,g)\sim\mathcal{D}}\left[(\hat{f}(x)-y)^{2}|g=A\right]=\sigma^{2}\left(1+\frac{1}{n_{A}}+\frac{d}{n-d-1}+\mathbb{E}[\bar{x}_{A}^{\top}S^{-1}\bar{x}_{A}]\right)=\sigma^{2}\left(1+\frac{1}{n_{A}}+O\left(\frac{d}{n}\right)\right)~{}.

Appendix B Experiment Details

Section B.1 details the datasets used including preprocessing steps and any data excluded from the experiments; the remainder of Appendix B provides a detailed explanation of each experiment described in the main text. Figure 2 describes additional experiments to complement the findings of the main experiments. Code to process data and replicate the experiments detailed here is available at https://github.com/estherrolf/representation-matters.

The code repository accompanying this work can be found at https://github.com/estherrolf/representation-matters.

B.1 Dataset Descriptions

We use and modify benchmark machine learning and fairness datasets, as well as more recent datasets on image diagnosis and student performance to study the effect of training set allocations on group and population accuracies in a systematic fashion. Each of the datasets we use is described below, with download links given at the end of this section.

Modified CIFAR-4. We modify an existing machine learning benchmark dataset, CIFAR-10 Krizhevsky (2009), to instantiate binary classification tasks with binary groups, where groups gg are statistically independent of the labels yy. We take four of the ten CIFAR-10 classes: {plane, car, bird, horse}, and sort them into binary categories of {land/air, animal/vehicle} as in Fig. B.1. This instantiation balances the classes labels among the two groups gg, so that no matter the allocation of groups to the training set, the label distribution will remain balanced. This will eliminate class imbalance as the cause of changes in accuracy due to training set composition or up-weighting methods.

There are 5,0005,000 training and 1,0001,000 test instances of each class in the CIFAR-10 dataset, resulting in 10,00010,000 training instances of each group in our modified CIFAR-4 dataset (20,00020,000 instances total), and 2,0002,000 instances of each group in the test set (4,0004,000 instances total). By construction, the average label in the test and train sets is 0.50.5. Since this dataset is designed to assess the main questions of our study under a controlled setting and there is not a natural setting of population rates of animal and vehicle photos, we set the population prevalence parameter of group A (images of animals) as γA=0.1\gamma_{A}=0.1.

Refer to caption
Figure B.1: Modified CIFAR-4 dataset setup.

Goodreads ratings prediction. The Goodreads dataset (Wan & McAuley, 2018) contains a large corpus of book reviews and ratings, collected from public book reviews online. From the text of the reviews, we predict the numerical rating (integers 1-5) a reviewer assigns to a book. From the larger corpus, we consider reviews for two genres: history/biography (henceforth “history") and fantasy/paranormal (henceforth “fantasy"). We calculate the population prevalences from the total number of reviews in the corpus for each genre. As of the writing of this document, there are 2,066,1932,066,193 reviews for history and biography books, and 3,424,6413,424,641 for fantasy and paranormal books, so that γ=[0.376,0.624]\vec{\gamma}=[0.376,0.624].

After dropping entries with no valid review or rating, we have 1,985,464 reviews from the history genre, and 3,309,417 reviews from the fantasy genre, with no books assigned to both genres. To reduce dataset size and increase performance of the basic prediction task, we further reduce each dataset to only the 100 most frequently reviewed books of each genre (following a procedure similar to (Chen et al., 2018)). To instantiate the dataset we use in our analysis, we draw 62,500 review/rating pairs uniformly at random from each of these pools. The mean review for fantasy instances is 4.146, for history it is 4.103. We allocate 20%20\% of the data to a test set, and the remaining 80%80\% to the training set, with an equal number of instances of each genre in each set.

We use tf-idf vectors of the 2,000 most frequently occurring reviews from the entire dataset of 125,000 instances as features (a similar featurization to (Chen et al., 2018), with fewer total features). We note that this is a different version of the goodreads dataset from that used in (Chen et al., 2018); the updated dataset we use has more reviews, and we use different group variables.

Adult. The Adult dataset, originally extracted from the 1994 Census database, is downloaded from the UCI Machine Learning Repository (Dua & Graff, 2017). The dataset contains demographic features and the classification task is to predict whether an individual makes more than 50K per year. We drop the final-weight feature, a measure of the proportion of the population represented by that profile. There are d=13d=13 remaining features; we one-hot encode non-binary discontinuous features including work class, education, marital status, occupation, relationship, race, and native country, resulting in 103103 feature columns.

For the main analysis, we exclude features for sex, husband, and wife as they indicate group status but do not affect predictive ability (see Section C.1), for a total of 1010 features and 100100 columns in the feature matrix. We keep the original train and test splits, with 32,56132,561 (21,79021,790 from group male and 10,77110,771 from group female) and 16,28116,281 (10,86010,860 from group male and 5,4215,421 from group female) instances respectively. We group instances based on gender (encoded as binary male/female). While the training and tests sets have roughly a 2/32/3 male group, 1/31/3 female group balance, we set γ=[0.5,0.5]\vec{\gamma}=[0.5,0.5] to more adequately reflect the true proportions of men and women in the world. In the test set, the average label for the male group is 0.3000.300, for the female group it is 0.1090.109 (for the train set, the numbers are similar; 0.3060.306 for the male group and 0.1090.109 for the female group).

Mooc dataset. The dataset we refer to as the MOOC (Massive Open Online Course) dataset is the HarvardX-MITx Person-Course Dataset (HarvardX, 2014). This dataset contains anonymized data from the 2013 academic year offerings of MITx and HarvardX edX classes. Each row in the dataset corresponds to a student-course pairing; students taking more than one course may correspond to multiple rows. We keep the following demographic, participation, and administrative features: gender, LoE_DI (highest completed level of education), final_cc_cname_DI (country), YoB, ndays_act (number of active days on the edX platform), nplay_video (number of video plays on edX), nforum_posts (number of discussion forum posts on edX), nevents (number of interactions on edX), course_id, certified (whether the student achieved a high enough grade for certification). After dropping instances without valid entries for these features, we have 25,213 instances. We one-hot encode non-binary discontinuous features including course_id, LoE_DI, and final_cc_cname_DI, resulting in an expanded 47 feature columns from the original 9 features. In Section C.1 we exclude demographic features. We partition the data into a train set of size 24,130 and a test set of size 6,032, with the same proportion of groups in each set.

We group instances by the highest level of education self-reported by the person taking the class; we bin this into those who have completed at most secondary education, and those who have completed more than secondary education. For reference, in the USA, completing secondary education corresponds to completing high school. The train and test sets contain an equal fraction of each group, about 16%16\% instances where the person taking the course had no recorded education higher than secondary. Of all training and test instances, those from group A (student had at most secondary education) have a 5.2%5.2\% certification rate, and those from group B (student had more than secondary education) have an 8.2%8.2\% certification rate.

ISIC skin cancer detection dataset. We download the dataset from the ISIC archive (Codella et al., 2017; Tschandl et al., 2018; Codella et al., 2019) using the repository https://github.com/GalAvineri/ISIC-Archive-Downloader to assist in downloading. To match the dataset used in Sohoni et al. (2020), we use only images added or modified prior to 2019, which gives us 23,906 instances. From this set we include only images that are explicitly labeled as benign or malignant. We additionally remove any data points from the ‘SONIC’ study, as these are all benign cases, and are easily identified via colorful dots on the images (Sohoni et al., 2020). The remaining 11,952 instances are to our knowledge identical to the “Non-patch" dataset in (Sohoni et al., 2020), up to the random splits into training/validation and test sets.

As groups, we subset based on the approximate age of the patient that is the subject of the photo. Approximate age is binned to the nearest 5 years in the original dataset; we design groups as A ={i:age_approx[i]55}=\{i:\texttt{age\_approx[i]}\geq 55\} and B ={i:age_approx[i]<55}=\{i:\texttt{age\_approx[i]}<55\}. Of the group A instances in the training and test sets, 31.4%31.4\% are malignant; of the group B instances, 8.37%8.37\% are malignant. We set γ\vec{\gamma} to match the distribution in the 11,952 data points in our preprocessed set, so that γ=[0.43,0.57]\vec{\gamma}=[0.43,0.57]. We split 80% of the data into a train set and 20% into a test set, with the same proportion of groups in each set.

The ISIC archive is a compilation of medical images aggregated from many individual studies. Experiments detailed in Sections 4.5 and B.7 use the study from which the image originates as groups to investigate the interactions between groups when |𝒢|>2|\mathcal{G}|>2. These results also motivate our choice to exclude the SONIC study from the dataset we use for the main analysis.

Download links. Code for processing these datasets as described above can be found at https://github.com/estherrolf/representation-matters. The datasets we use (before our subsetting/preprocessing) can be downloaded at the following links:

Loss metrics. For binary prediction problems, we report the the 0/1 loss when there is not significant class imbalance (modified CIFAR-4, Adult). For the ISIC and Mooc tasks, we report 1 - AUROC, where AUROC is the area under the reciever operating curve. AUROC is a standard metric for medical image prediction Sohoni et al. (2020), and for Mooc we choose this metric due to the label imbalance (low certification rates). Since AUROC is constrained to be between 0 and 1, the loss metric 1 - AUROC will also be between 0 and 1. For the Goodreads dataset, we optimize the 1\ell_{1} loss (mean absolute error, MAE); in Section C.2 we compare this to minimizing the 2\ell_{2} loss (mean squared error, MSE) of the predictions.

B.2 Models Details

For the neural net classifiers, we compare the empirical risk minimization (ERM) objective with an importance weighted objective, implemented via importance sampling (IS) following results in (Buda et al., 2018), and a group distributionally robust (GDRO) objective Sagawa et al. (2020). We adapt our group distributionally robust training procedure from https://github.com/kohpangwei/group_DRO, the code repository accompanying (Sagawa et al., 2020). For the non-image datasets, we choose the model class from a set of common prediction functions (Section B.3). We use implementations of these algorithms from https://scikit-learn.org, using the built in sample weight parameters to implement importance weighting (IW). Since many of the prediction functions we consider are not gradient-based, we cannot apply the algorithm from (Sagawa et al., 2020), thus we do not compare to GDRO for the non-image datasets.

dataset name metric model used objective parameters
CIFAR-4 0/1 loss pretrained resnet-18 (fine-tuned) ERM num_epochs = 20, lr = 1e-3 , wd = 1e-4, momentum = 0.9
IS num_epochs = 20, lr = 1e-3, momentum = 0.9
wd = 1e-2 if αA\alpha_{A}\geq 0.98 or αA\alpha_{A}\leq 0.02; wd = 1e-3 otherwise
GDRO num_epochs = 20, lr = 1e-3, wd = 1e-3, momentum = 0.9,
gdro_ss = 1e-2, group_adj = 4.0
ISIC 1 - AUROC pretrained resnet-18 (fine-tuned) ERM num_epochs = 20, lr = 1e-3, wd = 1e-4, momentum = 0.9
IS num_epochs = 20, lr = 1e-3, wd = 1e-3, momentum = 0.9
GDRO num_epochs = 20, lr = 1e-3, wd = 1e-4, momentum = 0.9,
gdro_ss = 1e-1, group_adj = 1.0
Goodreads* 1\ell_{1} loss (MAE) multiclass logistic regression ERM C = 1.0, penalty = 2\ell_{2}
IW C = 1.0, penalty = 2\ell_{2}
Goodreads 2\ell_{2} loss (MSE) multiclass logistic regression ERM λ=0.1\lambda=0.1
IW λ=1.0\lambda=1.0 if αA0.95\alpha_{A}\geq 0.95 or αA0.05\alpha_{A}\leq 0.05; λ=0.1\lambda=0.1 otherwise
Mooc* (with dem. features) 1 - AUROC random forest classifier ERM num_trees = 400, max_depth = 16
IW num_trees = 400, max_depth = 16
Mooc (no dem. features) 1 - AUROC random forest classifier ERM num_trees = 400, max_depth = 8
IW num_trees = 400, max_depth = 8
Adult (with group features) 0/1 loss random forest classifier ERM num_trees = 200, max_depth = 16
IW num_trees = 200, max_depth = 16
Adult* (without group features) 0/1 loss random forest classifier ERM num_trees = 400, max_depth = 16
IW num_trees = 400, max_depth = 16
Table B.3: Models and hyperparameters used to generate Figure 1. Asterisks denote the Goodreads, Mooc and Adult setting that appear in Figure 1, the other settings are shown in Figure 1.

B.3 Hyperparameter Selection

We evaluate hyperparameters for each prediction model using 5-fold cross validation on the training sets (see Section B.1), stratifying folds to maintain group proportions. We evaluate the cross-validation across a sparse grid of α[0.02,0.05,0.2,0.5,0.8,0.95,0.98]\vec{\alpha}\in[0.02,0.05,0.2,0.5,0.8,0.95,0.98], allowing us to determine if hyperparameters should be set differently for different values of α\vec{\alpha}. Table B.3 describes the model and parameters which are chosen as a result of this process.

Image Datasets. For the results shown in Figure 1 with image datasets (CIFAR-4 and ISIC), we train a pretrained resnet-18 by running SGD with momentum for 20 epochs for each dataset. We did not find significant improvements from training for more epochs for either dataset. Sohoni et al. (2020) use a pretrained resnset-50 for the ISIC prediction task; we use a resnet-18 since we are mostly working with smaller training set sizes due to subsetting. We did not find major differences in performance for ERM for ISIC between the resnset-18 and resnet-50 for the dataset sizes we considered. We use a resnset-18 for all models for the CIFAR-4 and ISIC tasks.

In the 5-fold cross validation procedure, we search over learning rate in [0.01,0.001,0.0001][0.01,0.001,0.0001] and weight decay in [0.01,0.001,0.0001][0.01,0.001,0.0001], keeping the momentum at 0.90.9 for both the importance sampling and ERM procedures. Given these results, for GDRO, we search over group-adjustment in [1.0,4.0,8.0][1.0,4.0,8.0], gdro step size in [0.1,0.01,0.001][0.1,0.01,0.001], and weight decay in [0.001,0.0001][0.001,0.0001], fixing momentum at 0.90.9 and learning rate at [0.001][0.001].

The optimal hyperparameter configurations for the coarse grid of α\vec{\alpha} are in Table B.3. Across α\vec{\alpha} values from the coarse grid, either the optimal parameters for each objective were largely consistent, or performance did not vary greatly between hyperparameter configurations for almost all dataset/objective configurations. As a result, for both the modified CIFAR and ISIC datasets, we keep the hyperparameters fixed across α\vec{\alpha}, with the exception of IS for CIFAR-4, where we increase weight decay for extreme α\vec{\alpha} (see Table B.3).

Non-Image Datasets. For the Goodreads dataset, we evaluate the following models and parameters: ridge regression model with λ[0.01,0.1,1.0,10.0,100.0]\lambda\in[0.01,0.1,1.0,10.0,100.0], random forest regressor with splits determined by MSE, and with number of trees and maximum depth of trees [100,200,400]×[32,64,128]\in[100,200,400]\times[32,64,128], and a multiclass logistic regression classifier with 2\ell_{2} inverse regularization strength parameter C[0.01,0.1,1.0,10.0]C\in[0.01,0.1,1.0,10.0].

The multiclass logistic regression model minimized mean absolute error (MAE) over the models we considered. For ERM, the optimal regularization parameter was C=1.0C=1.0 for all α\vec{\alpha} in our sparse grid. For IW, the optimal regularization value was C=1.0C=1.0 for all αA\alpha_{A} other than 0.980.98, where the optimal for history MAE was C=10.0C=10.0. Since this only effected one group, and was not symmetric across α\vec{\alpha}, for the evaluation results, we set C=1.0C=1.0 for all α\vec{\alpha}.

For the Mooc dataset, we evaluate a binary logistic regression model with 2\ell_{2} penalty and inverse regularization parameter C[0.001,0.01,0.1,1.0,10.0]C\in[0.001,0.01,0.1,1.0,10.0], a random forest classifier with number of tress and maximum depth of trees [100,200,400]×[8,16,32]\in[100,200,400]\times[8,16,32], and ridge regression model with λ[0.0001,0.001,0.01\lambda\in[0.0001,0.001,0.01, 0.1,1.0,10.0]0.1,1.0,10.0] and threshold for binary classification decision 0.50.5. The best model across both group and population accuracies was a random forest model; the best maximum depth was 1616 for both ERM and IW, and the results were robust to the number of estimators, so we chose 200200 for both. The best hyperparameters were largely consistent for all α\vec{\alpha} in the sparse grid.

For the Adult dataset, we evaluate the same models and parameter configurations as the MOOC dataset. The best model across both ERM and IW was the random forest model, with optimal values given in Table B.3.

B.4 Navigating Tradeoffs

Using the selected hyperparameters from the procedure described in Section B.3, we evaluate performance on the heldout test sets (see Section B.1). For the final evaluation, we evaluate α[0.0,0.01,0.02,0.05,0.1\alpha\in[0.0,0.01,0.02,0.05,0.1, 0.15,0.2,0.25,.3,0.35,.4,0.45,0.5,0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,0.98,0.99,1.0]0.15,0.2,0.25,.3,0.35,.4,0.45,0.5,0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,0.98,0.99,1.0], skipping the extremes for GDRO and IS/IW.

The training set size is determined by the smaller of the training set sizes for each group, denoted as mingng\min_{g}n_{g} in Table 1. This results in n=10,000n=10,000 for the modified CIFAR-4, n=4,092n=4,092 for ISIC, n=50,000n=50,000 for goodreads, n=3,897n=3,897 for Mooc, and 10,77110,771 for adult. Note that the number of test set instances (ntestn_{\textrm{test}} in Table 1) does not necessarily have the proportions of instances indicated by γA\gamma_{A} in Table 1. For the CIFAR-4 and Goodreads dataset, the test set is constructed to have 50%50\% instances from group A, and 50%50\% instances from group B; in reporting performance, we take a weighted average over the errors from each group, as in Eq. (1). For adult, we re-weight the test set instances to reflect γ=(0.5,0.5)\vec{\gamma}=(0.5,0.5). For the remaining two datasets, the test set compositions align with γ\vec{\gamma}. We assess variability in the performance of each method under each setting by examining results over 10 random seeds, which apply to both the random sampling in the subsetting procedure and the randomness in the training procedure.

B.5 Assessing scaling law fit

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure B.2: Estimated scaling law fits describe observed trends of group errors as a function of (ng,n)(n_{g},n). Grey points are not included in the scaling law fit, as ng<Mgn_{g}<M_{g} (see Table 2).

In addition to the results on the holdout set in the previous section, where we vary α\vec{\alpha} and keep the size of the training set fixed at n=mingngn=\min_{g}n_{g}, we conduct subsetting experiments to assess the affect of nn, as well as ngn_{g}, on the group accuracies as in Eq. (7). Specifically, for groups A and B, we vary the relative number of data points from group A and group B, as well as the total number of datapoints nn. We define our subsetting grid by subsampling twice for each configuration of (ng,n)(n_{g},n). First, we choose an allocation ratio of group A to group B (from options [0.125:1,0.25:1,0.5:1,1:1][0.125\!:\!1,0.25\!:\!1,0.5\!:\!1,1\!:\!1]). Then, we subsample again to xx fraction of the largest subset size for this allocation ratio, for an x[0.01,0.02,0.05,0.1,0.15,0.2,0.25,.3,.4,0.5,0.6,0.7,0.8,0.9,1.0]x\in[0.01,0.02,0.05,0.1,0.15,0.2,0.25,.3,.4,0.5,0.6,0.7,0.8,0.9,1.0] (we skip x[0.01]x\in[0.01] for allocation ratios <1<1). We repeat the sampling pattern for all pairs of allocation ratio and xx, and again switching the roles of groups A and B in this procedure. This results in a set of 99 unique nA,nBn_{A},n_{B} pairs, which we combine with the 25 pairs from the previous experiment, where we fix n=nA+nBn=n_{A}+n_{B} (see Figure B.3). We run ten random seeds of (nA,nB)(n_{A},n_{B}) configuration, for a total of 1240 points from which we estimate the scaling law fits.

Refer to caption
Figure B.3: Sample configurations for (ng,n)(n_{g},n) described in Section B.5, shown for the CIFAR-4 dataset, where mingng=10,000\min_{g}n_{g}=10,000. Note that there are two groups, so n=nA+nBn=n_{A}+n_{B}.

Since our main point of comparison is across training set sizes and allocations, rather than optimizing hyperparameters for each new sample size, we use the same hyperparameter configurations as for first set of experiments (ERM rows in Table B.3). For the neural net classifiers, we decrease the number of epochs when the total training set size nn increases, so as to keep nearly the same number of gradient steps per training procedure. Specifically, we set the number of epochs as the nearest integer to (# epochs for first set of experiments)×(n for first set of experiments)/(n for current run)\textrm{(\# epochs for first set of experiments)}\times\textrm{($n$ for first set of experiments)}/\textrm{($n$ for current run)} , where the first set of experiments corresponds to those in Figure 1.

Together with the results from the previous experiment shown in  Figure 1, we use these values of (n,nA,nB)(n,n_{A},n_{B}) and the accuracies evaluated on groups AA and BB to estimate the parameters of the scaling laws in Eq. (7). We use the nonlinear least squares implementation of scipy.optimize.curve_fit. The estimated parameters are given in Table 2. The standard deviations reported in Table 2 are the estimated standard deviations resulting from the nonlinear least squares fits. Figure B.2 plots the fitted model over a subset of the data used to fit it, showing that the modified scaling model in Eq. (7) can express the trends in per-group losses as a function of ngn_{g} and nn.

The parameter estimates sometimes have large standard deviations; we also found the parameter estimates to vary with MgM_{g}, the minimum number of points required to include a data point in the fitting procedure, as well as the overall subsetting design. When seeking exact and robust estimates of the parameters in Eq. (7), we suggest following the experimental procedures outlined in Clauset et al. (2009), and additionally accounting for potential variation due to sampling patterns.

B.6 Pilot Sample Experiment

In this experiment, we take a small random sample from the Goodreads training set to instantiate a “pilot sample" from which to estimate scaling law parameters and suggest an α^\hat{\alpha}^{*} at which to sample a larger training sample in a subsequent sampling effort. The pilot sample contains 50005000 total instances, with nA=nB=2500n_{A}=n_{B}=2500. With the pilot sample, we conduct a series of subsetting pairs for (nA,nB)(n_{A},n_{B}), similar to the procedure outlined in Section B.5, using relative allocations in [0.0625:1,0.125:1,0.25:1,0.5:1,1:1][0.0625\!:\!1,0.125\!:\!1,0.25\!:\!1,0.5\!:\!1,1\!:\!1] and x[0.02,0.05,0.1,0.15,0.2,0.25,.3,.4,0.5,0.6,0.7,0.8,0.9,1.0]x\in[0.02,0.05,0.1,0.15,0.2,0.25,.3,.4,0.5,0.6,0.7,0.8,0.9,1.0]. For each subset configuration, we sample 5 random seeds. From these results, we estimate the scaling parameters of Eq. (7) according to the performance of each subset configuration on the heldout evaluation set. We set the minimum number of points from which to fit the scaling parameters to Mg=250M_{g}=250.

Next, we use these estimated fits to extrapolate predicted per-group losses with more data points. For a given sample size, we suggest the α^\hat{\alpha}^{*} that would minimize the the maximum of the estimated per-group losses. We use the training data held separate from the pilot sample to sample a training set at α=α^\vec{\alpha}=\hat{\alpha}^{*} allocation, and evaluate performance on the original held out evaluation set. We follow the same procedure, sampling the new training set from α=γ\vec{\alpha}=\vec{\gamma} and at α=(0.5,0.5)\vec{\alpha}=(0.5,0.5).

As a point of comparison we also compute the results for all α\alpha in a grid of resolution 0.010.01; i.e., α[0.0,0.01,0.02,0.99,1.0]\alpha\in[0.0,0.01,0.02,\ldots 0.99,1.0] and denote the allocation value in this grid that minimizes the average maximum group loss over trials as αgrid\alpha^{*}_{\textrm{grid}}. The best allocation possible (up to the 0.01 grid resolution) were: αgrid=\alpha^{*}_{\textrm{grid}}= 0.01 for nnew=5000n_{\textrm{new}}=5000, αgrid=\alpha^{*}_{\textrm{grid}}= 0.06 for nnew=10000n_{\textrm{new}}=10000, αgrid=\alpha^{*}_{\textrm{grid}}= 0.03 for nnew=20000n_{\textrm{new}}=20000, and αgrid=\alpha^{*}_{\textrm{grid}}= 0.07 for nnew=40000n_{\textrm{new}}=40000. The αgrid\alpha^{*}_{\textrm{grid}} baseline helps contextualize performance of other allocation strategies with the optimal-in-hindsight αminmax\alpha^{*}_{\textrm{minmax}}. Furthermore, the variability across trials due to subsetting around α=αgrid\alpha=\alpha^{*}_{\textrm{grid}} is largely consistent with that of the other allocation sampling strategies examined.

We repeat this entire procedure (starting with generation of the pilot sample) for 10 random seeds, and the results are reported in Figure 2. Since we have enough training data in the Goodreads dataset outside of the pilot sample to simulate gathering larger datasets of up to 40,00040,000, we simulate collecting a new training dataset of size nnew[5000,10000,20000,40000]n_{new}\in[5000,10000,20000,40000], which are [1×,2×,4×,8×][1\times,2\times,4\times,8\times] the size of the pilot training dataset, respectively.

The error bars in Figure 2 are similarly large for all three allocation strategies we compare. The allocation α=α^minmax\vec{\alpha}=\hat{\alpha}^{*}_{\textrm{minmax}} when nnew=40000n_{\textrm{new}}=40000 is an exception, indicating that high variation in α^minmax\hat{\alpha}^{*}_{\textrm{minmax}} may be an issue when nnewn_{\textrm{new}} is large relative to the pilot training sample size.

B.7 Interactions Between Groups

While the main experiments exclude the SONIC study from the ISIC dataset, consistent with the ‘non-patch’ dataset in Sohoni et al. (2020), this experiment utilizes the larger corpus of labeled images from the ISIC repository. The difference is the inclusion of the SONIC sub-study. This larger dataset has 21,203 instances after subsetting to precisely benign/malignant cases. The number of data instances coming from each dataset are given by the grey bars (black annotated numbers) in Figure 4; purple bars denote the number of malignant instances. Note the logarithmic scale.

We separate 16,965 images to the training set and the remaining 4,238 to the test set, with equal ratios of each sub-study represented in the train and test sets. The train and test splits differ for the study and sub-study analysis, accounting for differing values in corresponding cells of Figure 3(a) and Figure 3(b), particularly apparent when evaluating on JID Editorial Images, the study with the smallest number of data instances (see Figure 4). We choose hyperparameters based off a 5 fold cross-validation procedure on the 16,965 training instances, searching over the same hyperparameter options as in Section B.3, ultimately using 20 epochs, momentum 0.9, weight decay 0.001 and learning rate 0.001 to tune the resnet-18 model (with ERM objective). For each of the 5 studies, and 10 sub-studies that comprise the larger datasets, we retrain the resnet-18 model on the training set with that sub-study excluded from the training data, and compare to performance of the model trained on the entire training set. As in Section B.5, we keep the number of gradient steps for each training process roughly equal. We run 10 random seeds for all conditions and report the differences in the mean performance metrics across groups.

Refer to caption
Figure 4: Sub-studies that comprise the ISIC dataset. The SONIC study is excluded in the main analysis.

Appendix C Supplementary Experiments

In this appendix we detail additional experiments to supplement the findings of Section 4. These experiments investigate the robustness of our findings to different problem formulations and data pre-processing.

C.1 The Effects of Including Groups as Features or Not

Here we compare models that use group or demographic information as features, and models that do not. We examine differences in group performance across α\vec{\alpha} and the effect of importance weighting in both cases.

dataset MgM_{g} group gg σ^g\hat{\sigma}_{g} p^g\hat{p}_{g} τ^g\hat{\tau}_{g} q^g\hat{q}_{g} δ^g\hat{\delta}_{g}
Mooc (with dem. features) 50 edu 2\leq 2^{\circ} 0.08 (2.6e-05) 0.14 (6.0e-03) 0.73 (0.059) 0.63 (4.8e-03) 1.3e-15 (2.6e-04)
edu >2>2^{\circ} 0.038 (6.2e-04) 0.068 (6.3e-03) 0.54 (6.5e-03) 0.61 (9.8e-04) 2.8e-12 (8.0e-04)
Mooc (without dem. features) 50 edu 2\leq 2^{\circ} 0.41 (0.86) 1.0 (0.26) 0.6 (0.028) 0.56 (3.5e-03) 0.029 (1.4e-06)
edu >2>2^{\circ} 0.029 (1.3e-03) 0.055 (0.011) 0.33 (2.2e-03) 0.54 (9.5e-04) 1.9e-14 (1.5e-03)
Adult (with group features) 50 female 0.036 (3.2e-06) 0.14 (2.5e-03) 0.23 (3.3e-03) 0.47 (2.3e-03) 0.055 (2.1e-05)
male 0.12 (6.8e-05) 0.24 (5.8e-04) 0.3 (6.2e-03) 0.47 (2.7e-03) 0.16 (3.5e-06)
Adult (without group features) 50 female 0.078 (0.051) 0.018 (3.6e-03) 0.43 (8.3e-03) 0.59 (1.6e-03) 8.0e-16 (0.052)
male 0.066 (2.6e-05) 0.21 (1.2e-03) 0.47 (6.5e-03) 0.50 (1.1e-03) 0.16 (5.4e-06)
Table 4: Estimated scaling parameters for eq. (7) with and without demographic features included. Parentheses denote standard deviations estimated by the nonlinear least squares fit. Parameters are constrained so that τ^g,σ^g0\hat{\tau}_{g},\hat{\sigma}_{g}\geq 0 and p^g,q^g[0,2]\hat{p}_{g},\hat{q}_{g}\in[0,2].
Refer to caption
(a) Mooc dataset with demographic features excluded.
Refer to caption
(b) Mooc dataset with demographic features included.
Refer to caption
(c) Adult dataset with demographic features excluded.
Refer to caption
(d) Adult dataset with demographic features included.
Figure 1: Comparison of phenomena with group information included or excluded from training set.

For Mooc, we compare the model used in the main analysis with removing all demographic information including education level, which we group data instances by. There are five remaining features: number of active days, number of video plays, number of forum posts, and number of events, and course-id, which we one-hot encode. The random forest model is still the best performing of those we considered, though the optimal hyperparameters after a 5 fold cross validation search shifted to num_trees=400\texttt{num\_trees}=400 and max_depth=8\texttt{max\_depth}=8. For the Adult dataset, we add back in the features for sex, wife, and husband. This results in 13 unique features, and after one-hot encoding, 103 feature columns. After running a 5 fold cross validation search with the new feature matrix, the optimal model and parameters were num_trees=400\texttt{num\_trees}=400 and max_depth=16\texttt{max\_depth}=16.

Figure 1 compares the results of after excluding (left column) and including (right column) this information in the training feature matrix. Models with the group features excluded generally exhibit less degradation in performance for αA\alpha_{A} near 0 or 1. The negative impacts of IW are lessened slightly when the models do not incorporate groups as features.

The different scaling law fits are shown in Table 4. For the Adult task without group features, the estimated exponent and scaling constant on group-agnostic data, q^g\hat{q}_{g} and τ^g\hat{\tau}_{g}, are larger than than for the model with demographic features. For the Mooc task, the effect is the reverse. In all settings, the fitted parameters show that nn influences group errors more than ngn_{g}, so long as ngn_{g} is at least Mg=50M_{g}=50 (the region for which we estimate the fit).

In the main analysis, we include demographic features for the Mooc certification prediction task, as this increases accuracy of the “edu \leq secondary group," and exclude gender features from the Adult income prediction, as it does not not greatly impact performance for αA\alpha_{A} near 0.5, and improves performance for more extreme settings of αA\alpha_{A} (Figure 1). For the other three datasets we study, there is no singular feature corresponding to the group that we can hold out in a similar fashion.

C.2 Trends across models and performance metrics

Figure 2 contrasts the trade offs we can navigate with different allocations and re-weighting schemes for different models and accuracy metrics. The left hand side of Figure 2 shows the result of the model fitting and evaluation procedure to minimize 1\ell_{1} loss (MAE) of the Goodreads predictions (same as the results shown in Figure 1). The right hand side shows the result of the same model fitting and evaluation procedure applied to minimizing the 2\ell_{2} loss (MSE), where ridge regression is the best performing model of those we consider. The ridge model evaluated with MSE exhibits similar trends across α\vec{\alpha} values as the multiclass logistic regression model evaluated on MAE. The value αA\alpha^{*}_{A} that minimizes population loss is similar for both methods – near to γA\gamma_{A}, though on opposite sides of γA\gamma_{A}. The degradation in population loss for αA\alpha_{A} close to 0 or 1 is also apparent for the ridge model, and the range of α\vec{\alpha} for which importance weighting (IW) increases per-group and population losses compared to empirical risk minimization (ERM) is similar across models.

Refer to caption
Refer to caption
Figure 2: Comparison of phenomena for different accuracy metrics and models on the Goodreads dataset.