Representation Matters:
Assessing the Importance of Subgroup Allocations in Training Data

Esther Rolf Department of Electrical Engineering and Computer Sciences, University of California, Berkeley Theodora Worledge¹¹footnotemark: 1 Benjamin Recht¹¹footnotemark: 1 Michael I. Jordan¹¹footnotemark: 1 Department of Statistics, University of California, Berkeley

Abstract

Collecting more diverse and representative training data is often touted as a remedy for the disparate performance of machine learning predictors across subpopulations. However, a precise framework for understanding how dataset properties like diversity affect learning outcomes is largely lacking. By casting data collection as part of the learning process, we demonstrate that diverse representation in training data is key not only to increasing subgroup performances, but also to achieving population-level objectives. Our analysis and experiments describe how dataset compositions influence performance and provide constructive results for using trends in existing data, alongside domain knowledge, to help guide intentional, objective-aware dataset design.

1 Introduction

Datasets play a critical role in shaping the perception of performance and progress in machine learning (ML)—the way we collect, process, and analyze data affects the way we benchmark success and form new research agendas (Paullada et al., 2020; Dotan & Milli, 2020). A growing appreciation of this determinative role of datasets has sparked a concomitant concern that standard datasets used for training and evaluating ML models lack diversity along significant dimensions, for example, geography, gender, and skin type (Shankar et al., 2017; Buolamwini & Gebru, 2018). Lack of diversity in evaluation data can obfuscate disparate performance when evaluating based on aggregate accuracy (Buolamwini & Gebru, 2018). Lack of diversity in training data can limit the extent to which learned models can adequately apply to all portions of a population, a concern highlighted in recent work in the medical domain (Habib et al., 2019; Hofmanninger et al., 2020).

Our work aims to develop a general unifying perspective on the way that dataset composition affects outcomes of machine learning systems. We focus on dataset allocations: the number of datapoints from predefined subsets of the population. While we acknowledge that numerical inclusion of groups is an imperfect proxy of representation, we believe that allocations provide a useful initial mathematical abstraction for formulating relationships among diversity, data collection, and statistical risk. We discuss broader implications of our formulation in Section 5.

With the implicit assumption that the learning task is well specified and performance evaluation from data is meaningful for all groups, we ask:

1.

Are group allocations in training data pivotal to performance? To what extent can methods that up-weight underrepresented groups help, and when might upweighting actually hurt performance?

Taking a point of view that data collection is a critical component of the overall machine learning process, we study the effect that dataset composition has on group and population accuracies. This complements work showing that simply gathering more data can mitigate some sources of bias or unfairness in ML outcomes (Chen et al., 2018), a phenomenon which has been observed in practice as well. Indeed, in response to the Gender Shades study (Buolamwini & Gebru, 2018), companies selectively collected additional data to decrease the exposed inaccuracies of their facial recognition models for certain groups, often raising aggregate accuracy in the process (Raji & Buolamwini, 2019). Given the potential for targeted data collection efforts to repair unintended outcomes of ML systems, we next ask:

2.

How might we describe “optimal” dataset allocations for different learning objectives? Does the often-witnessed lack of diversity in large-scale datasets align with maximizing population accuracy?

We show that purposeful data collection efforts can proactively support intentional objectives of an ML system, and that diversity and population objectives are often aligned. Many datasets have recently been designed or amended to exhibit diversity of the underlying population (Ryu et al., 2017; Tschandl et al., 2018; Yang et al., 2020). Such endeavors are significant undertakings, as data gathering and annotation must consider consent, privacy, and power concerns in addition to inclusivity, transparency and reusability (Gebru et al., 2018; Gebru, 2020; Wilkinson et al., 2016). Given the importance of more representative and diverse datasets, and the effort required to create them, our final question asks:

3.

When and how can we leverage existing datasets to help inform better allocations, towards achieving diverse objectives in a subsequent dataset collection effort?

Representation bias, or systematic underrepresentation of subpopulations in data, is one of many forms of bias in ML (Suresh & Guttag, 2019). Our work provides a data-focused perspective on the design and evaluation of ML pipelines. Our main contributions are:

1.

We analyze the complementary roles of dataset allocation and algorithmic interventions for achieving per-group and total-population performance (Section 2). Our experiments show that while algorithmically up-weighting underrepresented groups can help, dataset composition is the most consistent determinant of performance (Section 4.2).
2.

We propose a scaling model that describes the impact of dataset allocations on group accuracies (Section 3). Under this model, when parameters governing the relative values of within-group data are equal for all groups, the allocation that minimizes population risk overrepresents minority groups.
3.

We demonstrate that our proposed scaling model captures major trends of the relationship between dataset allocations and performance (Sections 4.3 and 4.5). We evidence that a small initial sample can be used to inform subsequent data collection efforts to, for example, maximize the minimum accuracy over groups without sacrificing population accuracy (Section 4.4).

Sections 2 and 3 formalize data collection as part of the learning problem and derive results under illustrative settings. Experiments in Section 4 support these results and expose nuances inherent to real-data contexts. Section 5 synthesizes results and delineates future work.

1.1 Additional Related Work

Targeted data collection in ML. Recent research evidences that targeted data collection can be an effective way to reduce disparate performance of ML models evaluated across sub-populations (Raji & Buolamwini, 2019). Chen et al. (2018) present a formal argument that the addition of training data can lessen discriminatory outcomes while improving accuracy of learned models, and Abernethy et al. (2020) show that adaptively collecting data from the lowest-performing sub-population can increase the minimum accuracy over groups. At the same time, there are many complexities of gathering data as a solution to disparate performance across groups. Targeted data collection from specific groups can present undue burdens of surveillance or skirt consent (Paullada et al., 2020). When ML systems fail portions of their intended population due to issues of measurement and construct validity, more thorough data collection is unlikely to solve the issue without further intervention (Jacobs & Wallach, 2019).

With these complexities in mind, we study the importance of numerical representation in training datasets in achieving diverse objectives. Optimal allocation of subpopulations in statistical survey designs dates back to at least Neyman (1934), including stratified sampling methods to ensure coverage across sub-populations (Lohr, 2009). For more complex prediction systems, the field of optimal experimental design (Pukelsheim, 2006) studies what inputs are most valuable for reaching a given objective, often focusing on linear prediction functions. We consider a constrained sampling structure and directly model the impact of group allocations on subgroup performance for general model classes.

Valuing data. In economics, allocations indicate a division of goods to various entities (Cole et al., 2013). While we focus on the influence of data allocations on model accuracies across groups, there are many approaches to valuing data. Methods centering on a theory of Shapley valuations (Yona et al., 2019; Ghorbani & Zou, 2019) complement studies of the influence of individual data points on model performance to aid subsampling data (Vodrahalli et al., 2018).

Methods for handling group-imbalanced data. Importance sampling and importance weighting are standard approaches to addressing class imbalance or small groups sizes (Haixiang et al., 2017; Buda et al., 2018), though the effects of importance weighting for deep learning may vary with regularization (Byrd & Lipton, 2019). Other methods specifically address differential performance between groups. Maximizing minimum performance across groups can reduce accuracy disparities (Sagawa et al., 2020) and promote fairer sequential outcomes (Hashimoto et al., 2018). For broader classes of group-aware objectives, techniques exist to mitigate unfairness or disparate performance of black box prediction functions (Dwork et al., 2018; Kim et al., 2019). It might not be clear a priori which subsets need attention; Sohoni et al. (2020) propose a method to identify and account for hidden strata, while other methods are defined for any subsets (Hashimoto et al., 2018; Kim et al., 2019).

In addition to modifying the optimization objective or learning algorithm, one can also modify the input data itself by re-sampling the training data to match the desired population by downsampling or by upsampling with data augmentation techniques (Chawla et al., 2002; Iosifidis & Ntoutsi, 2018).

1.2 Notation

$\Delta^{k}$ denotes the $k$ -dimensional simplex. $\mathbb{Z}^{+}$ denotes non-negative integers and $\mathbb{R}^{+}$ non-negative reals.

2 Training set allocations and alternatives

We study settings in which each data instance is associated with a group $g_{i}$ , so that the training set can be expressed as $\mathcal{S}=\{x_{i},y_{i},g_{i}\}_{i=1}^{n}$ where $x_{i},y_{i}$ denote the features and labels of each instance. We index the discrete groups by integers $\mathcal{G}=\{1,..,|\mathcal{G}|\}$ , or when we specifically consider just two groups, we write $\mathcal{G}=\{A,B\}$ . We assume that groups are disjoint and cover the entire population, with $\gamma_{g}=P_{(X,Y,G)\sim\mathcal{D}}[G=g]$ denoting the population prevalence of group $g$ , so that $\vec{\gamma}\in\Delta^{|\mathcal{G}|}$ . Groups could represent inclusion in one of many binned demographic categories, or simply a general association with latent characteristics that are relevant to prediction.

For a given population with distribution $\mathcal{D}$ over features, labels, and groups, we are interested in the population level risk, $\mathcal{R}(\hat{f}({\mathcal{S})};\mathcal{D}):=\mathbb{E}_{(X,Y,G)\sim\mathcal{D}}[\ell(\hat{f}(X),Y)]$ , of a predictor $\hat{f}$ trained on dataset $\mathcal{S}$ , as well as group specific risks. Denoting the group distributions by $\mathcal{D}_{g}$ , defined as conditional distributions, via ${{P_{(X,Y)\sim\mathcal{D}_{g}}[X=x,Y=y]}}=P_{(X,Y,G)\sim\mathcal{D}}[X=x,Y=y,G=g]/\gamma_{g}$ , the population risk decomposes as a weighted average over group risks:

\displaystyle\mathcal{R}(\hat{f}(\mathcal{S});\mathcal{D})

\displaystyle=\sum_{g\in\mathcal{G}}\gamma_{g}\cdot\mathcal{R}(\hat{f}(\mathcal{S});\mathcal{D}_{g})~{}.

(1)

In Section 2.2 we will assume that the loss $\ell(\hat{y},y)$ is a separable function over data instances. While this holds for many common loss functions, some objectives do not decouple in this sense (e.g., group losses and associated classes of fairness-constrained objectives; see Dwork et al., 2018). We revisit this point in Sections 4 and 5.

2.1 Training Set Allocations

In light of the decomposition of the population-level risk as a weighted average over group risks in Eq. (1), we now consider the composition of fixed-size training sets, in terms of how many samples come from each group.

Definition 1 (Allocations).

Given a dataset of $n$ triplets, $\{x_{i},y_{i},g_{i}\}_{i=1}^{n}$ , the allocation $\vec{\alpha}\in\Delta^{|\mathcal{G}|}$ describes the relative proportions of each group in the dataset:

\displaystyle\alpha_{g}:=\tfrac{1}{n}\sum_{i=1}^{n}\mathbb{I}[g_{i}=g],\quad g\in\mathcal{G}.

(2)

It will be illuminating to consider $\vec{\alpha}$ not only as a property of an existing dataset, but as a parameter governing dataset construction, as captured in the following definition.

Definition 2 (Sampling from allocation $\vec{\alpha}$ ).

Given the sample size $n$ , group distributions $\{\mathcal{D}_{g}\}_{g\in\mathcal{G}}$ , and allocation $\vec{\alpha}\in\Delta^{|\mathcal{G}|}$ , such that ${n_{g}:=}{}\alpha_{g}n\in\mathbb{Z}^{+},\forall g\in\mathcal{G}$ , to sample from allocation $\vec{\alpha}$ is procedurally equivalent to independent sampling of $|\mathcal{G}|$ disjoint datasets $\mathcal{S}_{g}$ and concatenating:

	$\displaystyle\mathcal{S}(\vec{\alpha},n)=\underset{g\in\mathcal{G}}{\mathbin{\scalebox{1.5}{$\cup$}}}\mathcal{S}_{g}$		(3)
	$\displaystyle\mathcal{S}_{g}=\{x_{i},y_{i},g\}_{i=1}^{n_{g}},\quad(x_{i},y_{i})\sim_{i.i.d.}\mathcal{D}_{g}~{}.$

For $\vec{\alpha}$ not satisfying the requirement that $\alpha_{g}n$ is integral, we could randomize the fractional allocations, or take $n_{g}=\lfloor\alpha_{g}n\rfloor$ , reducing the total number of samples to $\sum_{g}\lfloor\alpha_{g}n\rfloor$ . In the following sections we will generally allow allocations with $n_{g}\not\in\mathbb{Z}$ , assuming that the effect of up to $|\mathcal{G}|$ fractionally assigned instances is negligible for large $n$ .

The procedure in Definition 2 suggests formalizing data collection as a component of the learning process in the following way: in addition to choosing a loss function and method for minimizing the risk, choose the relative proportions at which to sample the groups in the training set:

\displaystyle\vec{\alpha}^{*}=\operatorname*{argmin}_{\vec{\alpha}\in\Delta^{|\mathcal{G}|}}\min_{\hat{f}\in\mathcal{F}}\mathcal{R}\left(\hat{f}\left({\mathcal{S}(\vec{\alpha},n)}\right);\mathcal{D}\right).

In Section 3, we show that when a dataset curator can design dataset allocations in the sense of Definition 2, they have the opportunity to improve accuracy of the trained model. Of course, one does not always have the opportunity to collect new data or modify the composition of an existing dataset. Section 2.2 considers methods for using fixed datasets that have groups with small training set allocation $\alpha_{g}$ , relative to $\gamma_{g}$ , or high risk for some groups relative to the population.

2.2 Accounting for Small Group Allocations in Fixed Datasets

In classical empirical risk minimization (ERM), one learns a function from class $\mathcal{F}$ that minimizes average prediction loss over the training instances $(x_{i},y_{i},g_{i})\in\mathcal{S}$ (we also abuse notation and write $i\in\mathcal{S}$ ) with optional regularization $R$ :

\displaystyle\hat{f}(\mathcal{S})=\operatorname*{argmin}_{f\in\mathcal{F}}\sum_{i\in\mathcal{S}}\ell(f(x_{i}),y_{i})+R(f,\mathcal{S}).

There are many methods for addressing small group allocations in data (see Section 1.1). Of particular relevance to our work are objective functions that minimize group or population risks. In particular, one approach is to use importance weighting (IW) to re-weight training samples with respect to a target distribution defined by $\vec{\gamma}$ :

\displaystyle\hat{f}^{\scriptscriptstyle\textrm{IW}}(\mathcal{S})=\operatorname*{argmin}_{f\in\mathcal{F}}\sum_{g\in\mathcal{G}}\frac{\gamma_{g}}{\alpha_{g}}\left(\sum_{i\in\mathcal{S}_{g}}\ell(f(x_{i}),y_{i})\right)+R(f,\mathcal{S}).

This empirical risk with instances weighted by ${\gamma_{g}}/{\alpha_{g}}={\gamma_{g}}n/{n_{g}}$ is an unbiased estimate of the population risk, up to regularization. While unbiasedness is often desirable, IW can induce high variance of the estimator when $\gamma_{g}/\alpha_{g}$ is large for some group (Cortes et al., 2010), which happens when group $g$ is severely underrepresented in the training data relative to their population prevalence.

Alternatively, group distributionally robust optimization (GDRO) (Hu et al., 2018; Sagawa et al., 2020) minimizes the maximum empirical risk over all groups:

\displaystyle\hat{f}^{\scriptscriptstyle\textrm{GDRO}}(\mathcal{S})=\operatorname*{argmin}_{f\in\mathcal{F}}\max_{g\in\mathcal{G}}\left(\tfrac{1}{n_{g}}\sum_{i\in\mathcal{S}_{g}}\ell(f(x_{i}),y_{i})+R(f,\mathcal{S}_{g})\right)~{}.

For losses $\ell$ which are continuous and convex in the parameters of $f$ , the optimal GDRO solution corresponds to the minimizer of a group-weighted objective: $\tfrac{1}{n}\sum_{i=1}^{n}w(g_{i})\cdot\ell(f(x_{i}),y_{i})$ , though this is not in general true for nonconvex losses (see Prop. 1 of Sagawa et al., 2020, and the remark immediately thereafter).

Given the correspondence of GDRO (for convex loss functions) and IW to the optimization of group-weighted ERM objectives, we now investigate the joint roles of sample allocation and group re-weighting for estimating group-weighted risks. For prediction function $f$ , loss function $\ell$ , and group weights $w:\mathcal{G}\rightarrow\mathbb{R}^{+}$ , let $\hat{L}(w,\alpha,n;f,\ell)$ be the random variable defined by:

\displaystyle\hat{L}(w,\alpha,n;f,\ell)

\displaystyle:=\tfrac{1}{n}\sum_{i\in\mathcal{S}(\vec{\alpha},n)}w(g_{i})\cdot\ell(f(x_{i}),y_{i})~{},

where the randomness in $\hat{L}$ comes from the draws of $x_{i},y_{i}$ from $\mathcal{D}_{g_{i}}$ according to procedure $\mathcal{S}(\vec{\alpha},n)$ (Definition 2), as well as any randomness in $f$ .

The following proposition shows that group weights and allocations play complementary roles in risk function estimation. In particular, if $w(g)$ depends on the sampling allocations $\alpha_{g}$ , then there are alternative group weights $w^{*}$ and allocation $\vec{\alpha}^{*}$ such that the alternative estimator has the same expected value but lower variance.

Proposition 1 (Weights and allocations).

For any loss $\ell$ , prediction function $f$ and group distributions $\mathcal{D}_{g}$ , there exist weights with $w^{*}(g)\propto\left(Var_{(x,y)\sim\mathcal{D}_{g}}[\ell(f(x),y))]\right)^{-1/2}$ such that for any triplet $(\vec{\alpha},w,n)$ with $\sum_{g}\alpha_{g}w(g)>0$ , if $w\not\mathrel{\vbox{ \offinterlineskip\halign{\hfil$#$\cr\propto\cr\kern 2.0pt\cr\sim\cr\kern-2.0pt\cr}}}w^{*}$ ,¹¹1We use the symbol $\not\mathrel{\vbox{ \offinterlineskip\halign{\hfil$#$\cr\propto\cr\kern 2.0pt\cr\sim\cr\kern-2.0pt\cr}}}$ to denote “not approximately proportional to.” The approximately part of this relation stems from finite and integer sample concerns; for example, the proposition holds if we consider $w\not\mathrel{\vbox{ \offinterlineskip\halign{\hfil$#$\cr\propto\cr\kern 2.0pt\cr\sim\cr\kern-2.0pt\cr}}}w^{*}$ to mean $\exists g\in\mathcal{G}:|1-\frac{w(g)}{w^{*}(g)}|>\frac{|\mathcal{G}|}{\alpha_{g}n}$ . there exists an alternative allocation $\vec{\alpha}^{*}$ such that

	$\displaystyle\mathbb{E}[\hat{L}(w^{},\vec{\alpha}^{},n;f,\ell)]$	$\displaystyle=\mathbb{E}[\hat{L}(w,\vec{\alpha},n;f,\ell)]$
	$\displaystyle Var[\hat{L}(w^{},\vec{\alpha}^{},n;f,\ell)]$	$\displaystyle<Var[\hat{L}(w,\vec{\alpha},n;f,\ell)]~{}.$

If $w(g)>w^{*}(g)$ , $\alpha^{*}_{g}>\alpha_{g}$ and if $w(g)<w^{*}(g)$ , $\alpha^{*}_{g}<\alpha_{g}$ .

Proof.

(Sketch; the full proof appears in Section A.1.) For any deterministic weighting function $w:\mathcal{G}\rightarrow\mathbb{R}^{+}$ , there exists a vector $\vec{\gamma}^{\prime}\in\Delta^{|\mathcal{G}|}$ with entries $\gamma^{\prime}_{g}\propto w(g)\alpha_{g}$ such that

\displaystyle\mathbb{E}[\hat{L}(w,\vec{\alpha}^{*},n;f,\ell)]=\tfrac{1}{n}\sum_{(x_{i},y_{i},g_{i})\in\mathcal{S}}\mathbb{E}[w(g_{i})\ell(f(x_{i}),y_{i})]=c\cdot\mathbb{E}_{g\sim\textrm{Multinomial}(\gamma^{\prime})}\mathbb{E}_{(x,y)\sim\mathcal{D}_{g}}[\ell(f(x),y)]

for constant $c=\sum_{g}\alpha_{g}w(g)$ . For any fixed $f$ and any “target distribution" defined by $\gamma^{\prime}$ , the ( $\vec{\alpha}^{*},w^{*}$ ) pair which minimizes the variance of the estimator, constrained so that $w(g)\alpha_{g}=c\gamma^{\prime}_{g}\,\forall g$ , has weights $w^{*}$ with form given above. Since the original $(\alpha,w)$ satisfy this constraint, the minimizing pair $(\alpha^{*},w^{*})$ must satisfy $\mathbb{E}[\hat{L}(w^{*},\vec{\alpha}^{*},n;f,\ell)]=\mathbb{E}[\hat{L}(w,\vec{\alpha},n;f,\ell)]$ , and $Var[\hat{L}(w^{*},\vec{\alpha}^{*},n;f,\ell)]\leq Var[\hat{L}(w,\vec{\alpha},n;f,\ell)]$ . ∎

Since the estimation of risk functions is a key component of learning, Proposition 1 illuminates an interplay between the roles of sampling allocations and group-weighting schemes like IW and GDRO. When allocations and weights are jointly maximized, the optimal allocation accounts for an implicit target distribution $\gamma^{\prime}$ (defined above), which may vary by objective function. The optimal weights account for per-group variability $Var_{(x,y)\sim\mathcal{D}_{g}}[\ell(f(x),y))]$ . In Section 4 we find that it can be advantageous to use IW and GDRO when some groups have small $\alpha_{g}/\gamma_{g}$ ; though the boost in accuracy is less than having an optimally allocated training set to begin with, and diminishes when all groups are appropriately represented in the training set allocation.

3 Allocating samples to minimize population-level risk

Having motivated the importance of group allocations, we now investigate the direct effects of training set allocations on group and population risks. Using a model of per-group performance as a function of allocations, we study the optimal allocations under a variety of settings.

3.1 A Per-group Power-law Scaling Model

We model the impact of allocations on performance with scaling laws that describe per-group risks as a function of the number of data points from their respective group, as well as the total number of training instances.

Assumption 1 (Group risk scaling with allocation).

The group risks $\mathcal{R}(\hat{f};\mathcal{D}_{g}):=\mathbb{E}_{(x,y)\sim{\mathcal{D}_{g}}}[\ell(\hat{f}(x),y)]$ scale approximately as the sum of inverse power functions on the number of samples from group $g$ and the total number of samples. That is, $\exists~{}M_{g}>0$ , $\sigma_{g},\tau_{g},\delta_{g}\geq 0$ , and $p,q>0$ such that for a learning procedure which returns predictor $\hat{f}(\mathcal{S})$ , and training set $\mathcal{S}$ with group sizes $n_{g}\!\geq\!M_{g}$ :

		$\displaystyle\mathcal{R}\biggl{(}\hat{f}\bigl{(}\mathcal{S}(\vec{\alpha},n)\bigr{)};\mathcal{D}_{g}\biggr{)}:=\mathbb{E}_{(x,y)\sim\mathcal{D}_{g}}\left[\ell(\hat{f}(\mathcal{S})(x),y)\right]\approx\,r(\alpha_{g}n,n;\sigma_{g},\tau_{g},\delta_{g},p,q)\hskip 6.99997pt\forall\,g\in\mathcal{G}$
		$\displaystyle r(n_{g},n;\sigma_{g},\tau_{g},\delta_{g},p,q):=\sigma_{g}^{2}n_{g}^{-p}+\tau_{g}^{2}n^{-q}+\delta_{g}~{}.$		(4)

1 is similar to the scaling law in Chen et al. (2018), but includes a $\tau_{g}^{2}n^{-q}$ term to allow for data from other groups to influence the risk evaluated on group $g$ . It additionally requires that the same exponents $p,q$ apply to each group, an assumption that underpins our theoretical results in Section 3. We examine the extent to which 1 holds empirically in Section 4.3, and will modify Eq. (1) to include group-dependent terms $p_{g},q_{g}$ when appropriate. The following examples give intuition into the form of Eq. (1).

Example 1 (Split classifiers per group).

When separate models are trained for each group, using training data only from that group, we expect Eq. (1) to apply with $\tau_{g}=0$ $\forall g\in\mathcal{G}$ . The parameter $p$ could derived through generalization bounds (Boucheron et al., 2005), or through modeling assumptions (Example 3).

It is often advantageous to pool training data to learn a single classifier. In this case, model performance evaluated on group $g$ will depend on both $n_{g}$ and $n$ , as the next examples show.

Example 2 (Groups irrelevant to prediction).

When groups are irrelevant for prediction and the model class $\mathcal{F}$ correctly accounts for this, we expect Eq. (1) to apply with $\sigma_{g}=0$ $\forall g\in\mathcal{G}$ .

Example 3 (Shared linear model with group-dependent intercepts).

Consider a $(d+1)$ -dimensional linear model, where two groups, $\{A,B\}$ , share a weight vector $\beta$ and features $x\sim\mathcal{N}(0,\Sigma_{x})$ , but the intercept varies by group:

\displaystyle y_{i}=\beta^{\top}x_{i}+c_{A}\mathbb{I}[g_{i}=A]+c_{B}\mathbb{I}[g_{i}=B]+\mathcal{N}(0,\sigma^{2}).

As we show in Section A.5, the ordinary least squares predictor has group risks $\mathbb{E}_{(x,y)\sim\mathcal{D}_{g}}[(x^{\top}\hat{\beta}+\hat{c}_{g}-y)^{2}]=\sigma^{2}\left(1+1/n_{g}+O(d/n)\right),$ where the ${1}/{n_{g}}$ arises because we need samples from group $g$ to estimate the intercept $c_{g}$ , whereas samples from both groups help us estimate $\beta$ .

Example 3 suggests that in some settings, we can relate $\sigma_{g}$ and $\tau_{g}$ to ‘group specific’ and ‘group agnostic’ model components that affect performance for group $g$ . In general, the relationship between group sizes and group risks can be more nuanced. Data from different groups may be correlated, so that samples from groups similar to or different from $g$ have greater effect on $\mathcal{R}(\hat{f};\mathcal{D}_{g})$ (see Section 4.5). Eq. 1 is meant to capture the dominant effects of training set allocations on group risks and serves as our main structural assumption in the next section, where we study the allocation that minimizes the approximate population risk.

3.2 Optimal (w.r.t. Population Risk) Allocations

We now study properties of the allocation that minimizes the approximated population risk:

\displaystyle\hat{\mathcal{R}}(\vec{\alpha},n):=

\displaystyle\sum_{g\in\mathcal{G}}\gamma_{g}r(\alpha_{g}n,n;\sigma_{g},\tau_{g},\delta_{g},p,q)\approx\sum_{g\in\mathcal{G}}\gamma_{g}\mathcal{R}\left(\hat{f}(\mathcal{S});\mathcal{D}_{g}\right)=\mathcal{R}\left(\hat{f}(\mathcal{S});\mathcal{D}\right)~{}.

(5)

The following proposition lays the foundation for two corollaries which show that:

(1)

when only the population prevalences $\vec{\gamma}$ vary between groups, the allocation that minimizes the approximate population risk up-represents groups with small $\gamma_{g}$ ;
(2)

for two groups with different scaling parameters $\sigma_{g}$ , the optimal allocation of the group with $\gamma_{g}<\tfrac{1}{2}$ is bounded by functions of $\sigma_{A},\sigma_{B}$ , and $\vec{\gamma}$ .

Proposition 2.

Given a population made up of disjoint groups $g\in\mathcal{G}$ with population prevalences $\gamma_{g}$ , under the conditions of 1, the allocation $\vec{\alpha}^{*}\in\Delta^{|\mathcal{G}|}$ that minimizes the approximated population risk $\hat{\mathcal{R}}$ in eq. (5) has elements:

\displaystyle\alpha_{g}^{*}=\frac{\left(\gamma_{g}\sigma^{2}_{g}\right)^{1/(p+1)}}{\sum_{g\in\mathcal{G}}\left(\gamma_{g}\sigma^{2}_{g}\right)^{1/(p+1)}}~{}.

(6)

If $\sigma_{g}=0\ \forall g\in\mathcal{G}$ , then any allocation in $\Delta^{|\mathcal{G}|}$ minimizes $\hat{\mathcal{R}}$ .

The proof of Proposition 2 appears in Section A.2. Note that $\vec{\alpha}^{*}$ does not depend on $n$ , $\{\tau_{g}\}_{g\in\mathcal{G}}$ , or $q$ ; this will in general not hold if powers $p_{g}$ differ by group.

We now study the form of $\vec{\alpha}^{*}$ under illustrative settings. Corollary 1 shows that when the group scaling parameters $\sigma_{g}$ in Eq. (1) are equal across groups, the allocation that minimizes the approximate population risk allocates samples to minority groups at higher than their population prevalences. The proof of Corollary 1 appears in Section A.3.

Corollary 1 (Many groups with equal $\sigma_{g}$ ).

When $\sigma_{g}=\sigma>0,\ \forall g\in\mathcal{G}$ , the allocation that minimizes $\hat{\mathcal{R}}$ in Eq. (5) satisfies $\alpha_{g}^{*}\geq\gamma_{g}$ for any group with $\gamma_{g}\leq\frac{1}{|\mathcal{G}|}$ .

This shows that the allocation that minimizes population risk can differ from the actual population prevalences $\vec{\gamma}$ . In fact, Corollary 1 asserts that near the allocation $\vec{\alpha}=\vec{\gamma}$ , the marginal returns to additional data from group $g$ are largest for groups with small $\alpha_{g}$ , enough so as to offset the small weight $\gamma_{g}$ in Eq. (1). This result provides evidence against the idea that small training set allocation to minority groups might comply with minimizing population risk as a result of a small relative contribution to the population risk.

Remark. A counterexample shows that $\alpha_{g}^{*}\leq\gamma_{g}$ does not hold for all $g$ with $\gamma_{g}>{1}/{|\mathcal{G}|}$ . Take $\vec{\gamma}=[.68,.30,.01,.01]$ and $p=1$ ; Eq. (6) gives $\alpha_{2}^{*}>0.3=\gamma_{2}>1/4$ . In general, whether group $g$ with $\gamma_{g}\geq{1}/{|\mathcal{G}|}$ gets up- or down-sampled depends on the distribution of $\vec{\gamma}$ across all groups.

Complementing the investigation of the role of the population proportions $\vec{\gamma}$ in Corollary 1, the next corollary shows that the optimal allocation $\vec{\alpha}^{*}$ generally depends on the relative values of $\sigma_{g}$ between groups. Inspecting Eq. (1) shows that $\sigma_{g}$ defines a limit of performance: if $\sigma_{g}^{2}$ is large, the only way to make the approximate risk for group $g$ small is to make $n_{g}$ large. For two groups, we can bound the optimal allocations $\vec{\alpha}^{*}$ in terms of $\{\sigma_{g}\}_{g\in\mathcal{G}}$ , and the population proportions $\vec{\gamma}$ . We let $A$ be the smaller of the two groups without loss of generality. From Eq. (6), we know that for two groups, $\alpha^{*}_{A}$ is increasing in $\tfrac{\sigma_{A}}{\sigma_{B}}$ ; Corollary 2 gives upper and lower bounds on $\alpha^{*}_{A}$ in terms of $\sigma_{A}$ and $\sigma_{B}$ . Corollary 2 is proved in Section A.4.

Corollary 2 (Unequal per-group constants).

For two groups $\{A,B\}=\mathcal{G}$ with $\gamma_{A}\!<\!\gamma_{B}$ , and parameters $\sigma_{A},\sigma_{B}\!>\!0$ in Eq. (1), the allocation of the smaller group $\alpha^{*}_{A}$ that minimizes $\hat{\mathcal{R}}$ in Eq. (5) is upper and lower bounded as

\displaystyle\frac{\gamma_{A}(\sigma_{A}^{2})^{1/(p+1)}}{\gamma_{A}(\sigma_{A}^{2})^{1/(p+1)}+\gamma_{B}(\sigma_{B}^{2})^{1/(p+1)}}<\alpha^{*}_{A}<\frac{(\sigma_{A}^{2})^{1/(p+1)}}{(\sigma_{A}^{2})^{1/(p+1)}+(\sigma_{B}^{2})^{1/(p+1)}}~{}.

When $\sigma_{A}\geq\sigma_{B}$ , $\alpha_{A}^{*}>\gamma_{A}$ , and when $\sigma_{A}\leq\sigma_{B}$ , $\alpha_{A}^{*}<1/2$ .

Altogether, these results highlight key properties of training set allocations that minimize population risk. Experiments in Section 4 give further insight into the values of weights and allocations for minimizing group and population risks and apply the scaling law model in real data settings.

4 Empirical Results

Having shown the importance of training set allocations from a theoretical perspective, we now provide a complementary empirical investigation of this phenomenon. Throughout our experiments, we use a diverse collection of datasets to give as full a picture of the empirical phenomena as possible (Sections 4.1 and 1). See Appendix B for full details on each experimental setup.²²2Code to replicate the experiments is available at https://github.com/estherrolf/representation-matters.

The first set of experiments (Section 4.2) investigates group and population accuracies as a function of the training set allocation $\vec{\alpha}$ by sub-sampling different training set allocations for a fixed training set size. We also study the amount to which importance weighting and group distributionally robust optimization can increase group accuracies, complementing the results of Proposition 1 with an empirical perspective. The second set of experiments (Section 4.3) uses a similar subsetting procedure to examine the fit of the scaling model proposed in Section 3. The third set of experiments (Section 4.4) investigates using the estimated scaling law fits to inform future sampling practices. We simulate using a small pilot training set to inform targeted dataset augmentation through collecting additional samples. In Section 4.5, we probe a setting where we might expect the scaling law to be too simplistic, exposing the need for more nuanced modelling in such settings. In contrast to Section 2, here losses are defined over sets of data; note that AUROC is not separable over groups, and thus Eq. (1) does not apply for this metric.

4.1 Datasets

Here we give a brief description of each dataset we use. For more details, see (Tables 1 and B.1).

Modified CIFAR-4. To create a dataset where we can ensure class balance across groups, we modify the CIFAR-10 dataset (Krizhevsky, 2009) by subsetting to the bird, car, horse, and plane classes. We predict whether the image subject moves primarily by air (plane/bird) or land (car/horse) and group by whether the image contains an animal (bird/horse) or vehicle (car/plane); see Figure B.1. We set $\gamma=0.9$ .

ISIC. The International Skin Imaging Collaboration dataset is a set of labeled images of skin lesions designed to aid development and standardization of imaging procedures for automatic detection of melanomas (Codella et al., 2019). For our main analysis, we follow similar preprocessing steps to (Sohoni et al., 2020), removing any images with patches. We predict whether a lesion is benign or malignant, and group instances by the approximate age of the patient of whom each photo was taken.

Goodreads. Given publicly available book reviews compiled from the Goodreads database (Wan & McAuley, 2018), we predict the rating (1-5) corresponding to each review. We group instances by genre.

Mooc. The HarvardX-MITx Person-Course Dataset contains student demographic and activity data from 2013 offerings of online courses (HarvardX, 2014). We predict whether a student earned a certification in the course and we group instances by highest completed level of education.

Adult. The Adult dataset, downloaded from the UCI Machine Learning Repository (Dua & Graff, 2017) and originally taken from the 1994 Census database, has a prediction task of whether an individual’s annual income is over $\$50,000$ . We group instances by sex, codified as binary male/female, and exclude features that directly determine groups status.

Table 1: Brief description of datasets; details are given in Section B.1.

dataset	groups $\{A,B\}$	$\gamma_{A}$	$\min_{g}n_{g}$	$n_{\textrm{test}}$	target label	loss metric	main model used
CIFAR-4	{animal, vehicle}	0.1	10,000	4,000	air/land	0/1 loss	resnet-18
ISIC	{age $\geq 55$ , age $<55$ }	0.43	4,092	2,390	benign/malignant	1 - AUROC	resnet-18
Goodreads	{history, fantasy}	0.38	50,000	25,000	book rating (1-5)	$\ell_{1}$ loss	logistic regression
Mooc	{edu $\leq 2^{\circ}$ , edu $>2^{\circ}$ }	0.16	3,897	6,032	certified	1 - AUROC	random forest
Adult	{female, male}	0.5	10,771	16,281	income $>\$50$ K	0/1 loss	random forest

4.2 Allocation-aware Objectives vs. Ideal Allocations

Refer to caption — Figure 1: Performance across $\vec{\alpha}$ . Shaded regions denote one stddev. above and below the mean over 10 trials. Stars indicate population minima for each objective (ERM: white, IS/IW:grey, GDRO:black).

We first investigate (a) the change in group and population performance at different training set allocations, and (b) the extent to which optimizing the three objective functions defined in Section 2.2 decreases average and group errors.

For each dataset, we vary the training set allocations $\vec{\alpha}$ between $(0,1)$ and $(1,0)$ while fixing the training set size as $n=\min_{g}n_{g}$ (see Table 1) and evaluate the per-group and population losses on subsets of the heldout test sets. For the image classification tasks, we compare group-agnostic empirical risk minimization (ERM) to importance weighting (implemented via importance sampling (IS) batches following the findings of Buda et al. (2018)) and group distributionally robust optimization (GDRO) with group-dependent regularization as described in Sagawa et al. (2020). For the non-image datasets, we implement importance weighting (IW) by weighting instances in the loss function during training, and do not compare to GDRO, as the gradient-based algorithm of Sagawa et al. (2020) is not easily adaptable to the predictors we use for these datasets. We pick hyperparameters for each method based on cross-validation results over a coarse grid of $\vec{\alpha}$ ; for IS, IW, and GDRO, we allow the hyperparameters to vary with $\vec{\alpha}$ ; for ERM we choose a single hyperparameter setting for all $\vec{\alpha}$ values.

Figure 1 highlights the importance of at least a minimal representation of each group in order to achieve low population loss (black curves) for all objectives. For CIFAR-4, the population loss increases sharply for $\alpha_{A}<0.1$ and $\alpha_{A}>0.8$ , and for ISIC, when $\alpha_{A}<0.2$ . While not as crucial for achieving low population losses for the remaining datasets, the optimal allocations $\vec{\alpha}^{*}$ (stars) do require a minimal representation of each group. The $\vec{\alpha}^{*}$ are largely consistent across the training objectives (different star colors). The population losses (black curves) are largely consistent across mid-range values of $\alpha_{A}$ for all training objectives. The relatively shallow slopes of the black curves for $\alpha_{A}$ near $\alpha^{*}_{A}$ (stars) stand in contrast to the per-group losses (blue and orange curves), which can vary considerably as $\vec{\alpha}$ changes. From the perspective of model evaluation, this reinforces a well-documented need for more comprehensive reporting of performance. From the view of dataset design, this exposes an opportunity to choose allocations which optimize diverse evaluation objectives while maintaining low population loss. Experiments in Section 4.4 investigate this further.

Across the CIFAR-4 and ISIC tasks, GDRO (dotted curves) is more effective than IS (dashed curves) at reducing per-group losses. This is expected, as minimizing the largest loss of any group is the explicit objective of GDRO. Figure 1 shows that GDRO can also improve the population loss (see $\alpha_{A}>0.7$ for CIFAR-4 and $\alpha_{A}<0.2$ for ISIC) as a result of minimizing the worst group loss. Importance weighting (dashed curves) has little effect on performance for Mooc and Adult (random forest models), and actually increases the loss for Goodreads (multiclass logistic regression model).

For all the datasets we study, the advantages of using IS or GDRO are greatest when one group has very small training set allocation ( $\alpha_{A}$ near $0$ or $1$ ). When allocations are optimized (stars in Figure 1), the boost that these methods give over ERM diminishes. In light of Proposition 1, these results suggest that in practice, part of the value of such methods is in compensating for sub-optimal allocations. We find, however, that explicitly optimizing the maximum per-group loss with GDRO can reduce population loss more effectively than directly accounting for allocations with IS.

Section C.2 shows that similar phenomena hold for different loss functions and models on the same dataset, though the exact $\vec{\alpha}^{*}$ can differ. In Section C.1, we show that losses of groups with small $\alpha_{g}$ can degrade more severely when group attributes are included in the feature matrix, likely a result of the small number of samples from which to learn group-specific model components (see Example 3).

4.3 Assessing the Scaling Law Fits

Table 2: Estimated scaling parameters for Eq. (7). Parentheses denote standard deviations estimated by the nonlinear least squares fit. Parameters are constrained so that

\hat{\tau}_{g},\hat{\sigma}_{g},\hat{\delta}_{g}\geq 0

and

\hat{p}_{g},\hat{q}_{g}\in[0,2]

dataset	$M_{g}$	group $g$	$\hat{\sigma}_{g}$	$\hat{p}_{g}$	$\hat{\tau}_{g}$	$\hat{q}_{g}$	$\hat{\delta}_{g}$
CIFAR-4	500	animal	1.9 (0.12)	0.47 (9.8e-04)	4.5e-09 (1.8e+06)	2.0 (0.0e+00)	1.1e-03 (8.9e-06)
CIFAR-4	500	vehicle	1.6 (0.19)	0.54 (2.0e-03)	3.2e-12 (1.1e+06)	2.0 (0.0e+00)	1.4e-03 (2.8e-06)
ISIC	200	age $\geq 55$	0.61 (1.7e-03)	0.20 (1.1e-03)	1.7e-09 (1.9e+04)	1.9 (0.0e+00)	1.4e-15 (6.1e-04)
ISIC	200	age $<55$	0.26 (9.3e-04)	0.13 (0.012)	0.61 (0.044)	0.3 (7.5e-03)	7.5e-11 (7.2e-03)
Goodreads	2500	history	0.16 (1.2e-03)	0.074 (2.5e-03)	2.5 (0.058)	0.37 (2.0e-04)	0.41 (3.0e-03)
Goodreads	2500	fantasy	0.62 (0.69)	0.020 (1.2e-03)	3.1 (0.093)	0.39 (1.9e-04)	7.2e-21 (0.72)
Mooc	50	edu $\leq 2^{\circ}$	0.08 (2.6e-05)	0.14 (6.0e-03)	0.73 (0.059)	0.63 (4.8e-03)	1.3e-15 (2.6e-04)
Mooc	50	edu $>2^{\circ}$	0.038 (6.2e-04)	0.068 (6.3e-03)	0.54 (6.5e-03)	0.61 (9.8e-04)	2.8e-12 (8.0e-04)
Adult	50	female	0.078 (0.051)	0.018 (3.6e-03)	0.43 (8.3e-03)	0.59 (1.6e-03)	8.0e-16 (0.052)
Adult	50	male	0.066 (2.6e-05)	0.21 (1.2e-03)	0.47 (6.5e-03)	0.50 (1.1e-03)	0.16 (5.4e-06)

For each dataset, we combine the results in Figure 1 with extra subsetting runs where we vary both $n_{g}$ and $n$ . From the combined results, we use nonlinear least squares to estimate the parameters of modified scaling laws, where exponents can differ by group

\displaystyle\textrm{loss}_{g}\approx\sigma_{g}^{2}n_{g}^{-p_{g}}+\tau_{g}^{2}n^{-q_{g}}+\delta_{g}~{}.

(7)

The estimated parameters of Eq. (7) given in Table 2 capture different high-level phenomena across the five datasets. For CIFAR-4, $\hat{\tau}_{g}\approx 0$ for both groups, indicating that most of the group performance is explained by $n_{g}$ , the number of training instances from that group, whereas the total number of data points $n$ , has less influence. For Goodreads, both $n_{g}$ and $n$ have influence in the fitted model, though $\hat{\tau}_{g}$ and $\hat{q}_{g}$ are larger than $\hat{\sigma}_{g}$ and $\hat{p}_{g}$ , respectively. For ISIC, $\hat{\tau}_{A}\approx 0$ but $\hat{\tau}_{B}\not\approx 0$ , suggesting other-group data has little effect on the first group, but is beneficial to the latter. For the non-image datasets (Goodreads, Mooc, and Adult), $0\!<\!\hat{\sigma}_{g}\!<\!\hat{\tau}_{g}$ and $\hat{p}_{g}\!<\!\hat{q}_{g}$ for all groups.

These results shed light on the applicability of the assumptions made in Section 3. Figure B.2 in Section B.5 shows that the fitted curves capture the overall trends of per-group losses as a function of $n$ and $n_{g}$ . However, the assumptions of Proposition 2 and Corollaries 1 and 2 (e.g., equal $p_{g}$ for all $g\in\mathcal{G}$ ) are not always reflected in the empirical fits. Results in Section 3 use Eq. (1) to describe optimal allocations under different hypothetical settings; we find that allowing the scaling parameters vary by group as in Eq. (7) is more realistic in empirical settings.

The estimated models describe the overall trends (Figure B.2), but the parameter estimates are variable (Table 2), indicating that a range of parameters can fit the data, a well-known phenomenon in fitting power laws to data (Clauset et al., 2009). While we caution against absolute or prescriptive interpretations based on the estimates given in Table 2, if such interpretations are desired (Chen et al., 2018), we suggest evaluating variation due to subsetting patterns and comparing to alternative models such as log-normal and exponential fits (cf. Clauset et al., 2009).

4.4 Targeted Data Collection with Fitted Scaling Laws

We now study the use of scaling models fitted on a small pilot dataset to inform a subsequent data collection effort. Given the results of Section 4.2, we aim to collect a training set that minimizes the maximum loss on any group. This procedure goes beyond the descriptive use of the estimated scaling models in Section 4.3; important considerations for operationalizing these findings are discussed below.

We perform this experiment with the Goodreads dataset, the largest of the five we study. The pilot sample contains 2,500 instances from each group, drawn at random from the full training set. We estimate the parameters of Eq. (7) using a procedure similar to that described in Section 4.3. For a new training set of size $n_{\textrm{new}}$ , we suggest an allocation to minimize the maximum forecasted loss of any group:

\displaystyle\hat{\alpha}^{*}_{\textrm{{minmax}}}

\displaystyle=\operatorname*{argmin}_{\vec{\alpha}\in\Delta^{2}}\max_{g\in\mathcal{G}}\left(\hat{\sigma}_{g}^{2}(\alpha_{g}n_{\textrm{new}})^{-\hat{p}_{g}}+\hat{\tau}_{g}^{2}n_{\textrm{new}}^{-\hat{q}_{g}}+\hat{\delta}_{g}\right).

For $n_{\textrm{new}}\in[1\times,2\times,4\times,8\times]$ , the pilot sample size, we simulate collecting a new training set by drawing $n_{\textrm{new}}$ fresh samples from the training set with allocation $\vec{\alpha}=\hat{\alpha}^{*}_{\textrm{{minmax}}}(n_{\textrm{new}})$ . We train a model on this sample (ERM objective) and evaluate on the test set. For comparison, we also sample at $\vec{\alpha}=\vec{\gamma}$ (population proportions) and $\vec{\alpha}=(0.5,0.5)$ (equal allocation to both groups). We repeat the experiment, starting with the random instantiation of the pilot dataset, for ten trials. As a point of comparison, we also compute the results for all $\alpha$ in a grid of resolution $0.01$ , and denote the allocation value in this grid that minimizes the average maximum group loss over the ten trials as $\alpha^{*}_{\textrm{grid}}$ .

Among the three allocation strategies we compare, $\vec{\alpha}=\hat{\alpha}^{*}_{\textrm{minmax}}$ minimizes the average maximum loss over groups, across $n_{\textrm{new}}$ (Figure 2). Since per-group losses generally decrease with the increased allocations to that group, we expect the best minmax loss over groups to be achieved when the purple and grey bars meet in Figure 2. The allocation strategy $\vec{\alpha}=\hat{\alpha}^{*}_{\textrm{minmax}}$ does not quite achieve this; however, it does not increase the population loss over that of the other allocation strategies. This reinforces the finding of Section 4.2 that different per-group losses can be reached for similar population losses and provides evidence that we can navigate these possible outcomes by leveraging information from a small initial sample.

While the results in Figure 2 are promising, error bars highlight the variation across trials. The variability in performance across trials for allocation baseline $\alpha^{*}_{\textrm{grid}}$ (which is kept constant across the ten trials) is largely consistent with that of the other allocation sampling strategies examined (standard errors in Figure 2). However, the estimation of $\hat{\alpha}^{*}$ in each trial does introduce additional variation: across the ten draws of the pilot data, the range of $\hat{\alpha}^{*}$ values for subsequent dataset size $n_{\textrm{new}}=5000$ is [2e-04,0.04], for $n_{\textrm{new}}=10000$ it is [1e-04,0.05], for $n_{\textrm{new}}=20000$ it is [5e-05,0.14], and for $n_{\textrm{new}}=40000$ it is [2e-05,0.82]. Therefore, the estimated $\hat{\alpha}^{*}$ should be leveraged with caution, especially if the subsequent sample will be much larger than the pilot sample. Further caution should be taken if there may be distribution shifts between the pilot and subsequent samples. We suggest to interpret estimated $\hat{\alpha}^{*}$ values as one signal among many that can inform a dataset design in conjunction with current and emerging practices for ethical data collection (see Section 5).

4.5 Interactions Between Groups

We now shift the focus of our analysis to explore potential between- and within- group interactions that are more nuanced than the scaling law in Eq. (1) provides for. The results highlight the need for and encourage future work extending our analysis to more complex notions of groups (e.g., intersectional, continuous, or proxy groups).

As discussed in Section 3, data from groups similar to or different from group $g$ may have greater effect on $\mathcal{R}(\hat{f}(\mathcal{S});\mathcal{D}_{g})$ compared to data drawn at random from the entire distribution. We examine this possibility on the ISIC dataset, which is aggregated from different studies (Section B.1). We measure baseline performance of the model trained on data from all of the studies. We then remove one study at a time from the training set, retrain the model, and evaluate the change in performance for all studies in the test set.

Figure 3(a) shows the percent changes in performance due to leaving out studies from the training set. The MSK and UDA studies are comprised of 5 and 2 sub-studies, respectively; Figure 3(b) shows the results of leaving out each sub-study. Rows correspond to the study withheld from the training set and columns correspond to the study used for evaluation. Rows and columns are ordered by $\%$ malignancy. For Figure 3(a) this is the same as ordering by dataset size, SONIC being the largest study.

Consistent with our modelling assumptions and results so far, accuracies evaluated on group $g$ decrease as a result of removing group $g$ from the training set (diagonal entries of Figure 3). However, additional patterns show more nuanced relationships between groups.

Positive values in the upper right regions of Figures 3(a) and 3(b) show that excluding studies with low malignancy rates can raise performance evaluated on studies with high malignancy rates. This could be partially due to differences in label distributions when removing certain studies from the training data. Importantly, this provides a counterpoint to an assumption implicit in 1, that group risks decrease in the total training set size $n$ , regardless of the groups these $n$ instances belong to. To study more nuanced interactions between pairs $g^{\prime}\!\neq\!g$ , future work could modify Eq. (1) by reparameterizing $r(\cdot)$ to directly account for $n_{g^{\prime}}$ .

Grouping by substudies within the UDA and MSK studies reveals that even within well defined groups, interactions between subgroups can arise. Negative off-diagonal entries in Figure 3(b) suggest strong interactions between different groups, underscoring the importance of evaluating results across hierarchies and intersections of groups when feasible.

Of the 16,965 images in the full training set, 7,401 are from the SONIC study. When evaluating on all non-SONIC instances (like the evaluation set from the rest of the paper), withholding the SONIC study from the training set (akin to the training set of the rest of the paper) leads to higher AUROC (.905) than training on all studies (0.890). This demonstrates that more data is not always better, especially if the distributional differences between the additional data and the target populations are not well accounted for.

5 Generalizations, limitations, and future work

We study the ways in which group and population performance depend on the numerical allocations of discrete groups in training sets. While focusing on discrete groups allows us to derive meaningful results, understanding similar phenomena for intersectional groups and continuous notions of inclusion is an important next step. Addressing the more nuanced relationships between the allocations of different data sources (Section 4.5) is a first step in this direction.

We find that underrepresentation of groups in training data can limit group and population accuracies. However, assuming we can easily and ethically collect more data about any group is often naive, as there may be unintended consequences of upweighting certain groups in an objective function. Naive targeted data collection attempts can present undue burdens of surveillance or skirt consent (Paullada et al., 2020). When ML systems fail to represent subpopulations due to measurement or construct validity issues, more comprehensive interventions are needed (Jacobs & Wallach, 2019).

Our results expose key properties of sub-group representation in training data from a statistical sampling perspective, complementary to current and emerging practices for ethical, contextualized data collection and curation Gebru et al. (2018); Gebru (2020); Denton et al. (2020); Abebe et al. (2021). Studying the role of numerical allocation targets within ethical and context-aware data collection practices will be an important step toward operationalizing our findings.

Representation is a broad and often ambiguous concept (Chasalow & Levy, 2021), and numerical allocation is an imperfect proxy of representation or inclusion. For example, if the data annotation process systematically misrepresents certain groups, optimizing allocations to maximize accuracy with respect to those labels would not reflect our true goals across all groups. If prediction models are tailored to majority groups and are less relevant for smaller groups, an optimized allocation might allocate more data points to smaller groups as a remedy, when in reality a better model or feature representation is preferable. True representation thus requires that each data instance measures the intended variables and their complexities in addition to numerical allocations. That said, if the optimal allocation for a certain group is well beyond that group’s population proportion, this may be cause to reflect on why that is the case. Future work could contextualize allocations as a way of auditing the limits of machine learning systems from a data-focused perspective.

By incorporating dataset collection as a design step in the learning procedure, we were able to assess and leverage the value of different data allocations toward reaching high group and population accuracies. Extending this framework to other objectives and loss functions (e.g., robustness for out-of-distribution prediction and fairness objectives) will be an important area of future work.

6 Conclusions

We demonstrate that representation in data is fundamental to training machine learning models that work for the entire population of interest. By casting dataset design as part of the learning procedure, we can formalize the characteristics that training data must satisfy in order to reach the objectives and specifications of the overall machine learning system. Empirical results bolster our theoretical findings and explore the nuances of real-data phenomena that call for domain dependent analyses in order to operationalize our general results in specific contexts. Overall, we provide insight and constructive results toward understanding, prioritizing, and leveraging conscientious data collection for successful applications of machine learning.

Acknowledgments

We thank Inioluwa Deborah Raji and Ludwig Schmidt for feedback at various stages of this work, and Andrea Bajcsy, Sara Fridovich-Keil, and Max Simchowitz for comments and suggestions during the editing of this manuscript. We thank Nimit Sohoni and Jared Dunnmon for helpful discussions regarding the ISIC dataset.

This material is based upon work supported by the NSF Graduate Research Fellowship under Grant No. DGE 1752814. ER acknowledges the support of a Google PhD Fellowship. This research is generously supported in part by ONR awards N00014-20-1-2497 and N00014-18-1-2833, NSF CPS award 1931853, and the DARPA Assured Autonomy program (FA8750-18-C-0101).

References

Abebe et al. (2021) Abebe, R., Aruleba, K., Birhane, A., Kingsley, S., Obaido, G., Remy, S. L., and Sadagopan, S. Narratives and counternarratives on data sharing in africa. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 329–341, 2021.
Abernethy et al. (2020) Abernethy, J., Awasthi, P., Kleindessner, M., Morgenstern, J., and Zhang, J. Adaptive sampling to reduce disparate performance. arXiv preprint arXiv:2006.06879, 2020.
Boucheron et al. (2005) Boucheron, S., Bousquet, O., and Lugosi, G. Theory of classification: A survey of some recent advances. ESAIM: Probability and Statistics, 9:323–375, 2005.
Buda et al. (2018) Buda, M., Maki, A., and Mazurowski, M. A. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106:249–259, 2018.
Buolamwini & Gebru (2018) Buolamwini, J. and Gebru, T. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency, pp. 77–91, 2018.
Byrd & Lipton (2019) Byrd, J. and Lipton, Z. C. What is the effect of importance weighting in deep learning? In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 872–881. PMLR, 2019. URL http://proceedings.mlr.press/v97/byrd19a.html.
Chasalow & Levy (2021) Chasalow, K. and Levy, K. Representativeness in statistics, politics, and machine learning. arXiv preprint arXiv:2101.03827, 2021.
Chawla et al. (2002) Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. SMOTE: Synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.
Chen et al. (2018) Chen, I. Y., Johansson, F. D., and Sontag, D. A. Why is my classifier discriminatory? In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp. 3543–3554, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/1f1baa5b8edac74eb4eaa329f14a0361-Abstract.html.
Clauset et al. (2009) Clauset, A., Shalizi, C. R., and Newman, M. E. Power-law distributions in empirical data. SIAM Review, 51(4):661–703, 2009.
Codella et al. (2019) Codella, N., Rotemberg, V., Tschandl, P., Celebi, M. E., Dusza, S., Gutman, D., Helba, B., Kalloo, A., Liopyris, K., Marchetti, M., et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (ISIC). arXiv preprint arXiv:1902.03368, 2019.
Codella et al. (2017) Codella, N. C. F., Gutman, D., Celebi, M. E., Helba, B., Marchetti, M. A., Dusza, S. W., Kalloo, A., Liopyris, K., Mishra, N. K., Kittler, H., and Halpern, A. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC). arXiv preprint, 2017.
Cole et al. (2013) Cole, R., Gkatzelis, V., and Goel, G. Mechanism design for fair division: Allocating divisible items without payments. In Proceedings of the Fourteenth ACM Conference on Electronic Commerce, EC ’13, pp. 251–268, New York, NY, USA, 2013. Association for Computing Machinery.
Cortes et al. (2010) Cortes, C., Mansour, Y., and Mohri, M. Learning bounds for importance weighting. In Lafferty, J. D., Williams, C. K. I., Shawe-Taylor, J., Zemel, R. S., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Vancouver, British Columbia, Canada, pp. 442–450. Curran Associates, Inc., 2010. URL https://proceedings.neurips.cc/paper/2010/hash/59c33016884a62116be975a9bb8257e3-Abstract.html.
Denton et al. (2020) Denton, E., Hanna, A., Amironesei, R., Smart, A., Nicole, H., and Scheuerman, M. K. Bringing the people back in: Contesting benchmark machine learning datasets. arXiv preprint arXiv:2007.07399, 2020.
Dotan & Milli (2020) Dotan, R. and Milli, S. Value-laden disciplinary shifts in machine learning. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 294–294, 2020.
Dua & Graff (2017) Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
Dwork et al. (2018) Dwork, C., Immorlica, N., Kalai, A. T., and Leiserson, M. Decoupled classifiers for group-fair and efficient machine learning. In Conference on Fairness, Accountability and Transparency, pp. 119–133, 2018.
Gebru (2020) Gebru, T. Lessons from archives: Strategies for collecting sociocultural data in machine learning. In Gupta, R., Liu, Y., Tang, J., and Prakash, B. A. (eds.), KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, pp. 3609. ACM, 2020. URL https://dl.acm.org/doi/10.1145/3394486.3409559.
Gebru et al. (2018) Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., and Crawford, K. Datasheets for datasets. arXiv preprint arXiv:1803.09010, 2018.
Ghorbani & Zou (2019) Ghorbani, A. and Zou, J. Y. Data shapley: Equitable valuation of data for machine learning. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 2242–2251. PMLR, 2019. URL http://proceedings.mlr.press/v97/ghorbani19c.html.
Habib et al. (2019) Habib, A., Karmakar, C., and Yearwood, J. Impact of ECG dataset diversity on generalization of CNN model for detecting QRS complex. IEEE Access, 7:93275–93285, 2019.
Haixiang et al. (2017) Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., and Bing, G. Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 73:220–239, 2017.
HarvardX (2014) HarvardX. HarvardX Person-Course Academic Year 2013 De-Identified dataset, version 3.0, 2014. URL https://doi.org/10.7910/DVN/26147.
Hashimoto et al. (2018) Hashimoto, T. B., Srivastava, M., Namkoong, H., and Liang, P. Fairness without demographics in repeated loss minimization. In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 1934–1943. PMLR, 2018. URL http://proceedings.mlr.press/v80/hashimoto18a.html.
Hofmanninger et al. (2020) Hofmanninger, J., Prayer, F., Pan, J., Röhrich, S., Prosch, H., and Langs, G. Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem. European Radiology Experimental, 4(1):1–13, 2020.
Hu et al. (2018) Hu, W., Niu, G., Sato, I., and Sugiyama, M. Does distributionally robust supervised learning give robust classifiers? In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 2034–2042. PMLR, 2018. URL http://proceedings.mlr.press/v80/hu18a.html.
Iosifidis & Ntoutsi (2018) Iosifidis, V. and Ntoutsi, E. Dealing with bias via data augmentation in supervised learning scenarios. In Proceedings of the Workshop on Bias in Information, Algorithms, pp. 24–29, 2018.
Jacobs & Wallach (2019) Jacobs, A. Z. and Wallach, H. Measurement and fairness. arXiv preprint arXiv:1912.05511, 2019.
Kim et al. (2019) Kim, M. P., Ghorbani, A., and Zou, J. Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 247–254, 2019.
Krizhevsky (2009) Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
Lohr (2009) Lohr, S. L. Sampling: design and analysis. Nelson Education, 2009.
Neyman (1934) Neyman, J. On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97(4):558–625, 1934.
Paullada et al. (2020) Paullada, A., Raji, I. D., Bender, E. M., Denton, E., and Hanna, A. Data and its (dis) contents: A survey of dataset development and use in machine learning research. arXiv preprint arXiv:2012.05345, 2020.
Pukelsheim (2006) Pukelsheim, F. Optimal design of experiments. Classics in applied mathematics ; 50. Society for Industrial and Applied Mathematics, classic ed. edition, 2006.
Raji & Buolamwini (2019) Raji, I. D. and Buolamwini, J. Actionable auditing: Investigating the impact of publicly naming biased performance results of commercial ai products. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 429–435, 2019.
Ryu et al. (2017) Ryu, H. J., Adam, H., and Mitchell, M. Inclusivefacenet: Improving face attribute detection with race and gender diversity. arXiv preprint arXiv:1712.00193, 2017.
Sagawa et al. (2020) Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=ryxGuJrFvS.
Shankar et al. (2017) Shankar, S., Halpern, Y., Breck, E., Atwood, J., Wilson, J., and Sculley, D. No classification without representation: Assessing geodiversity issues in open data sets for the developing world. arXiv preprint arXiv:1711.08536, 2017.
Sohoni et al. (2020) Sohoni, N., Dunnmon, J., Angus, G., Gu, A., and Ré, C. No subclass left behind: Fine-grained robustness in coarse-grained classification problems. Advances in Neural Information Processing Systems, 33, 2020.
Suresh & Guttag (2019) Suresh, H. and Guttag, J. V. A framework for understanding unintended consequences of machine learning. arXiv preprint arXiv:1901.10002, 2019.
Tschandl et al. (2018) Tschandl, P., Rosendahl, C., and Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific Data, 5(1), 2018. URL https://doi.org/10.1038/sdata.2018.161.
Vodrahalli et al. (2018) Vodrahalli, K., Li, K., and Malik, J. Are all training examples created equal? An empirical study. arXiv preprint arXiv:1811.12569, 2018.
Wan & McAuley (2018) Wan, M. and McAuley, J. J. Item recommendation on monotonic behavior chains. In Pera, S., Ekstrand, M. D., Amatriain, X., and O’Donovan, J. (eds.), Proceedings of the 12th ACM Conference on Recommender Systems, RecSys 2018, Vancouver, BC, Canada, October 2-7, 2018, pp. 86–94. ACM, 2018. doi: 10.1145/3240323.3240369. URL https://doi.org/10.1145/3240323.3240369.
Wilkinson et al. (2016) Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., et al. The fair guiding principles for scientific data management and stewardship. Scientific data, 3(1):1–9, 2016.
Wright (2020) Wright, T. A general exact optimal sample allocation algorithm: With bounded cost and bounded sample sizes. Statistics & Probability Letters, pp. 108829, 2020.
Yang et al. (2020) Yang, K., Qinami, K., Fei-Fei, L., Deng, J., and Russakovsky, O. Towards fairer datasets: Filtering and balancing the distribution of the people subtree in the imagenet hierarchy. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 547–558, 2020.
Yona et al. (2019) Yona, G., Ghorbani, A., and Zou, J. Who’s responsible? Jointly quantifying the contribution of the learning algorithm and training data. arXiv preprint arXiv:1910.04214, 2019.

Appendix A Proofs and Derivations

A.1 Full proof of Proposition 1

Recall the random variable $\hat{L}$ defined with respect to function $f$ , loss function $\ell$ , and group weights $w:\mathcal{\mathcal{G}}\rightarrow\mathbb{R}^{+}$ :

\displaystyle\hat{L}(w,\alpha,n;f,\ell)

\displaystyle:=\frac{1}{n}\sum_{i\in\mathcal{S}(\vec{\alpha},n)}w(g_{i})\cdot\ell(f(x_{i}),y_{i})

where the randomness comes from the draws of $x_{i},y_{i}$ from $\mathcal{D}_{g_{i}}$ according to procedure $\mathcal{S}(\vec{\alpha},n)$ (Definition 2), as well as any randomness in $f$ .

See 1

Proof.

For any n, any $(\alpha,w)$ pair induces a vector $\vec{\gamma}^{\prime}$ with entries $\gamma_{g}^{\prime}(w,\vec{\alpha}):=w(g)\alpha_{g}/\left(\sum_{g\in\mathcal{G}}w(g)\alpha_{g}\right)$ , where

	$\displaystyle\mathbb{E}[\hat{L}(w,\vec{\alpha},n;f,\ell)]$	$\displaystyle=\frac{1}{n}\sum_{(x_{i},y_{i},g_{i})\in\mathcal{S}(\vec{\alpha},n)}w(g_{i})\mathbb{E}[\ell(f(x_{i}),y_{i})]$
		$\displaystyle=\sum_{g\in\mathcal{G}}\frac{n_{g}}{n}w(g)\mathbb{E}_{(x,y)\sim\mathcal{D}_{g}}[\ell(f(x),y)]$
		$\displaystyle=c\cdot\mathbb{E}_{g\sim\textrm{Multinomial}(\vec{\gamma}^{\prime})}\left[\mathbb{E}_{(x,y)\sim\mathcal{D}_{g}}\left[\ell(f(x),y)\right]\right]$

for constant $c=\sum_{g}\alpha_{g}w(g)$ . The vector $\vec{\gamma}^{\prime}$ in this sense describes an implicit ‘target distribution’ induced by the applying weights $w$ after sampling with allocation $\vec{\alpha}$ . Note that unless $w_{g}=0$ for all $g$ with $\alpha_{g}>0$ , $\vec{\gamma}^{\prime}$ has at least one nonzero entry. The constant $c$ re-scales the weighted objective function with original weights $w$ so as to match the expected loss with respect to the group proportions $\vec{\gamma}^{\prime}$ . Stated another way, for any alternative allocation $\alpha^{\prime}$ , we could pick weights $w^{\prime}(g)=c\gamma^{\prime}_{g}/\alpha^{\prime}_{g}$ (letting $w^{\prime}(g)=0$ if $\alpha^{\prime}_{g}=0$ ), and satisfy

\displaystyle\mathbb{E}[\hat{L}(w^{\prime},\vec{\alpha}^{\prime},n;f,\ell)]

\displaystyle=\mathbb{E}[\hat{L}(w,\vec{\alpha},n;f,\ell)]~{}.

Given this correspondence, we now find the pair $(\vec{\alpha}^{*},w^{*})$ which minimizes $Var[\hat{L}(cw^{\prime},\vec{\alpha}^{\prime},n;f,\ell)]$ , subject to $w^{\prime}(g)\alpha^{\prime}_{g}=c\gamma^{\prime}_{g}$ . Since the original pair $(\vec{\alpha},w)$ satisfies this constraint (by construction), we must have

\displaystyle\min_{\vec{\alpha^{\prime}},w^{\prime}:w^{\prime}(g)\alpha^{\prime}_{g}=c\gamma^{\prime}_{g}}Var[\hat{L}(w^{\prime},\vec{\alpha}^{\prime},n;f,\ell)]\leq Var[\hat{L}(w,\vec{\alpha},n;f,\ell)]~{}.

We first compute $Var[\hat{L}(w^{\prime},\vec{\alpha}^{\prime},n;f,\ell)]$ . By Definition 2, samples $(x_{i},y_{i})$ are assumed to be independent draws from distributions $\mathcal{D}_{g_{i}}$ , so that the variance of the estimator can be written as (for convenience we assume here that $n\alpha^{\prime}_{g}\in\mathbb{Z}$ , see the discussion below):

	$\displaystyle Var\left[\hat{L}(w^{\prime},\vec{\alpha}^{\prime},n;f,\ell)\right]$	$\displaystyle=\frac{1}{n^{2}}\sum_{g\in\mathcal{G}}w^{\prime}(g)^{2}\sum_{i=1}^{n\alpha^{\prime}_{g}}\mathbb{E}_{(x_{i},y_{i})\sim\mathcal{D}_{g}}\left[\left(\ell(f(x_{i}),y_{i})-\mathbb{E}_{(x,y)\sim\mathcal{D}_{g}}[\ell(f(x),y)]\right)^{2}\right]$
		$\displaystyle=\frac{1}{n}\sum_{g\in\mathcal{G}}\alpha^{\prime}_{g}w^{\prime}(g)^{2}Var\left[\ell_{g}^{(i)}\right]~{},$

where $Var\left[\ell_{g}^{(i)}\right]$ denotes shorthand for $\mathbb{E}_{(x_{i},y_{i})\sim\mathcal{D}_{g}}\left[\left(\ell(f(x_{i}),y_{i})-\mathbb{E}_{(x,y)\sim\mathcal{D}_{g}}[\ell(f(x),y)]\right)^{2}\right]$ . Now, to respect the constraint $w^{\prime}(g)\alpha^{\prime}_{g}=c\gamma_{g}^{\prime}$ means that for any $g$ with $\gamma^{\prime}_{g}>0$ , $w^{\prime}(g)$ is a deterministic function of $\alpha^{\prime}_{g}$ , since $c$ and $\vec{\gamma}^{\prime}$ are determined by the initial pair $(\vec{\alpha},w)$ . Then it is sufficient to compute

\displaystyle\operatorname*{argmin}_{\alpha^{\prime}\in\Delta^{|\mathcal{G}|}}\frac{1}{n}\sum_{g\in\mathcal{G}:\gamma^{\prime}_{g}>0}\alpha^{\prime}_{g}\left(\frac{c\gamma_{g}^{\prime}}{\alpha^{\prime}_{g}}\right)^{2}Var\left[\ell_{g}^{(i)}\right]

\displaystyle=\operatorname*{argmin}_{\alpha^{\prime}\in\Delta^{|\mathcal{G}|}}\frac{c^{2}}{n}\sum_{g\in\mathcal{G}:\gamma^{\prime}_{g}>0}\frac{(\gamma_{g}^{\prime})^{2}}{\alpha^{\prime}_{g}}Var\left[\ell_{g}^{(i)}\right]~{}.

The minimizer $\vec{\alpha}^{*}$ has entries $\alpha^{*}_{g}=\gamma_{g}^{\prime}\sqrt{Var[\ell_{g}^{(i)}]}/\left(\sum_{g\in\mathcal{G}}\gamma_{g}^{\prime}\sqrt{Var[\ell_{g}^{(i)}]}\right)$ . Because $\vec{\alpha}^{*}$ is unique and determines $w^{*}$ , $Var[\hat{L}(w^{*},\vec{\alpha}^{*},n;f,\ell)]<Var[\hat{L}(w^{\prime},\vec{\alpha}^{\prime},n;f,\ell)]$ with strict inequality unless $(w^{\prime},\vec{\alpha}^{\prime})=(w^{*},\vec{\alpha}^{*})$ . The optimal weights are $w^{*}(g)=c\gamma^{\prime}_{g}/\alpha^{*}_{g}=c{\left(\sum_{g\in\mathcal{G}}\gamma^{\prime}_{g}\sqrt{Var[\ell_{g}^{(i)}]}\right)}/{\sqrt{Var[\ell_{g}^{(i)}]}}$ . Note that for pair of groups $(g,g^{\prime})$ , the relative weights satisfy $w^{*}(g)/w^{*}(g^{\prime})=\sqrt{{Var[\ell_{g^{\prime}}^{(i)}]}/{Var[\ell_{g}^{(i)}]}}$ , and thus do not depend on $\vec{\gamma}^{\prime}$ .

If we consider finite sample concerns, the minimizer $\alpha^{*}$ must satisfy integer values $n\alpha^{*}_{g}\in\mathbb{Z}^{+}\,\forall g\in\mathcal{G}$ . In this case, efficient algorithms exist for finding the integral solution to allocating $n_{g}$ (Wright, 2020). However, the non-integer restricted solution $\alpha^{*}$ has a closed form solution, and we will use the fact that for any group g, $\alpha_{g}^{*}$ as defined above and its variant with the additional constraint that $n\alpha^{*}_{g}\in\mathbb{Z}^{+}$ can differ by at most $\tfrac{|\mathcal{G}|}{n}$ . This means that any $\vec{\alpha}$ with $|\alpha_{g}-\alpha^{*}_{g}|>\tfrac{|\mathcal{G}|}{n}$ cannot be a minimizer of the objective function, even constrained to $n\alpha^{*}_{g}\in\mathbb{Z}$ . Since $w(g)\alpha_{g}=c\gamma_{g}^{\prime}$ , an equivalent statement in terms of $w$ is $|1-\frac{w(g)}{w^{*}(g)}|>\frac{w(g)|\mathcal{G}|}{nc\gamma^{\prime}(g)}=\frac{|\mathcal{G}|}{n\alpha_{g}}=\frac{|\mathcal{G}|}{n_{g}}$ .

Finally, we show that if $w^{*}(g)<w(g)$ , $\alpha_{g}^{*}>\alpha_{g}$ . This follows from our definition of $\vec{\gamma}^{\prime}$ such that $w(g)=c\gamma^{\prime}_{g}/\alpha_{g}$ , and our constraint, such that $w^{*}(g)=c\gamma^{\prime}_{g}/\alpha^{*}_{g}$ . From these, we must have that $w^{*}(g)\alpha_{g}^{*}=w(g)\alpha_{g}$ , from which the claim and its reverse follow. ∎

A.2 Proof of Proposition 2

See 2

Proof.

Recall the decomposition of the estimated population risk:

\displaystyle\mathbb{E}_{(x,y)\sim\mathcal{D}}\left[\ell(\hat{f}_{\mathcal{S}}(x),y)\right]=\sum_{g\in\mathcal{G}}\gamma_{g}\cdot\mathbb{E}_{(x,y)\sim\mathcal{D}_{g}}\left[\ell(\hat{f}(x),y)\right]\approx\sum_{g\in\mathcal{G}}\gamma_{g}\left(\sigma_{g}^{2}(n\alpha_{g})^{-p}+\tau_{g}^{2}n^{-q}+\delta_{g}\right)~{}.

Now we find

\displaystyle\vec{\alpha}^{*}=\operatorname*{argmin}_{\vec{\alpha}\in\Delta^{|\mathcal{G}|}}\sum_{g\in\mathcal{G}}\gamma_{g}\left(\sigma_{g}^{2}(n\alpha_{g})^{-p}+\tau_{g}^{2}n^{-q}+\delta_{g}\right)=\operatorname*{argmin}_{\vec{\alpha}\in\Delta^{|\mathcal{G}|}}(n^{-p})\sum_{g\in\mathcal{G}}\gamma_{g}\left(\sigma_{g}^{2}(\alpha_{g})^{-p}\right)

\displaystyle=\operatorname*{argmin}_{\vec{\alpha}\in\Delta^{|\mathcal{G}|}}\sum_{g\in\mathcal{G}}\gamma_{g}\sigma_{g}^{2}\alpha_{g}^{-p}~{}.

If $\sigma_{g}=0\ \forall g\in\mathcal{G}$ , then any allocation $\vec{\alpha}^{*}\in\Delta^{|\mathcal{G}|}$ minimizes the approximated population loss. Otherwise, $\alpha^{*}_{g}=0$ will be $0$ for any group with $\sigma_{g}=0$ ; what follows describes the solution $\alpha^{*}_{g}$ for $g$ with $\sigma_{g}>0$ . If any $\alpha_{g}=0$ , then the objective is unbounded above, so we can restrict our constraints to $\vec{\alpha}\in(0,1)^{|\mathcal{G}|}$ . As the sum of convex functions, the objective function is convex in $\vec{\alpha}$ . It is continuously differentiable when $\alpha_{g}>0,\,\forall g\in\mathcal{G}$ . The KKT conditions are satisfied when

	$\displaystyle p\gamma_{g}\sigma^{2}_{g}\alpha_{g}^{-(p+1)}$	$\displaystyle=\lambda\quad\forall g$
	$\displaystyle\sum_{g}\alpha_{g}$	$\displaystyle=1~{}.$

Solving this system of equations yields that the KKT conditions are satisfied when $\alpha^{*}_{g}={\left(\gamma_{g}\sigma^{2}_{g}\right)^{1/(p+1)}}/{\sum_{g}\left(\gamma_{g}\sigma^{2}_{g}\right)^{1/(p+1)}}~{}.$ Since this is the only solution to the KKT conditions, it is the unique minimizer. ∎

A.3 Proof of Corollary 1

See 1

Proof.

Let $m=|\mathcal{G}|$ denote the number of groups. When $\sigma_{g}=\sigma\ \forall g\in\mathcal{G}$ ,

\displaystyle\alpha^{*}_{g}=\frac{\gamma_{g}^{1/(p+1)}}{\sum_{g}\gamma_{g}^{1/(p+1)}}

\displaystyle=\gamma_{g}\cdot\frac{1}{\gamma_{g}+\gamma_{g}^{p/(p+1)}\sum_{g^{\prime}\neq g}\gamma_{g^{\prime}}^{1/(p+1)}}~{}.

Since $p>0$ by 1, we have that $\sum_{i=1}^{n}{\gamma_{i}}^{1/(p+1)}$ with $\gamma_{i}$ subject to (a) $\sum_{i=1}^{n}\gamma_{i}=s$ and (b) $\gamma_{i}>0$ is maximized when all $\gamma_{i}$ are equal. Then, since $\sum_{g^{\prime}\neq g}\gamma_{g}^{\prime}=1-\gamma_{g}$ ,

	$\displaystyle\gamma_{g}\cdot\frac{1}{\gamma_{g}+\gamma_{g}^{p/(p+1)}\sum_{g^{\prime}\neq g}\gamma_{g^{\prime}}^{1/(p+1)}}$	$\displaystyle\geq\gamma_{g}\cdot\frac{1}{\gamma_{g}+\gamma_{g}^{p/(p+1)}\sum_{g^{\prime}\neq g}\left(\frac{1-\gamma_{g}}{m-1}\right)^{1/(p+1)}}$
		$\displaystyle=\gamma_{g}\cdot\frac{1}{\gamma_{g}+\left((m-1)\gamma_{g}\right)^{p/(p+1)}(1-\gamma_{g})^{1/(p+1)}}~{}.$

When $\gamma_{g}\leq 1/m$ , $\gamma_{g}/(1-\gamma_{g})\leq\frac{1}{m-1}$ , so that

\displaystyle\alpha^{*}_{g}

\displaystyle\geq\gamma_{g}\cdot\frac{1}{\gamma_{g}+(1-\gamma_{g})\left(m-1\right)^{p/(p+1)}\left(\gamma_{g}/(1-\gamma_{g})\right)^{p/(p+1)}}\geq\gamma_{g}\cdot\frac{1}{\gamma_{g}+(1-\gamma_{g})}=\gamma_{g}~{}.\qed

A.4 Proof of Corollary 2

See 2

Proof.

For the setting of two groups, we can express the optimal allocations as:

\displaystyle\alpha^{*}_{A}

\displaystyle=\frac{\left(\gamma_{A}\sigma^{2}_{A}\right)^{1/(p+1)}}{\left(\gamma_{A}\sigma^{2}_{A}\right)^{1/(p+1)}+\left((1-\gamma_{A})\sigma^{2}_{B}\right)^{1/(p+1)}},\quad\alpha^{*}_{B}=1-\alpha^{*}_{A}

Rearranging,

\displaystyle\alpha^{*}_{A}

\displaystyle=\gamma_{A}\frac{1}{\gamma_{A}+(\sigma_{B}^{2}/\sigma_{A}^{2})^{1/(p+1)}(1-\gamma_{A})^{1/(p+1)}(\gamma_{A})^{p/(p+1)}}~{}.

For $p>0$ , it holds that $0<\frac{1}{p+1}<1$ . Therefore, for any $p>0$ and $\gamma<0.5$ ,

\displaystyle\gamma<(\gamma)^{\frac{1}{p+1}}(1-\gamma)^{\frac{p}{p+1}}<(1-\gamma)~{}.

From this, we derive the upper bound

\displaystyle\alpha^{*}_{A}<\gamma_{A}\frac{1}{\gamma_{A}+(\sigma_{B}^{2}/\sigma_{A}^{2})^{1/(p+1)}(\gamma_{A})}=\frac{(\sigma_{A}^{2})^{1/(p+1)}}{(\sigma_{A}^{2})^{1/(p+1)}+(\sigma_{B}^{2})^{1/(p+1)}},

and the lower bound

\displaystyle\alpha^{*}_{A}>\gamma_{A}\frac{1}{\gamma_{A}+(\sigma_{B}^{2}/\sigma_{A}^{2})^{1/(p+1)}(1-\gamma_{A})}=\gamma_{A}\frac{(\sigma_{A}^{2})^{1/(p+1)}}{\gamma_{A}(\sigma_{A}^{2})^{1/(p+1)}+(\sigma_{B}^{2})^{1/(p+1)}(1-\gamma_{A})}~{}.

When $\sigma_{A}\geq\sigma_{B}$ ,

\displaystyle\alpha^{*}_{A}>\gamma_{A}\frac{(\sigma_{A}^{2})^{1/(p+1)}}{\gamma_{A}(\sigma_{A}^{2})^{1/(p+1)}+(\sigma_{B}^{2})^{1/(p+1)}(1-\gamma_{A})}>\gamma_{A}\frac{(\sigma_{A}^{2})^{1/(p+1)}}{(\gamma_{A}+1-\gamma_{A})(\sigma_{A}^{2})^{1/(p+1)}}=\gamma_{A}~{},

and when $\sigma_{A}\leq\sigma_{B}$ ,

\displaystyle\alpha^{*}_{A}<\frac{(\sigma_{A}^{2})^{1/(p+1)}}{(\sigma_{A}^{2})^{1/(p+1)}+(\sigma_{B}^{2})^{1/(p+1)}}<\frac{(\sigma_{A}^{2})^{1/(p+1)}}{(\sigma_{A}^{2})^{1/(p+1)}+(\sigma_{A}^{2})^{1/(p+1)}}=1/2~{}.

∎

A.5 Additional Derivations

In Example 3, we consider the model $y_{i}=x_{i}^{\top}\beta+\alpha_{1}\mathbb{I}[g_{i}=A]+\alpha_{2}\mathbb{I}[g_{i}=B]+\mathcal{N}(0,\sigma^{2})$ where $x_{i}\sim\mathcal{N}(0,\Sigma_{x})$ and denote $\theta=[\alpha_{1},\alpha_{2},\beta^{\top}]$ (note: here $\alpha_{i}$ denote the model coefficients, not allocations). We want to compute

	$\displaystyle\mathbb{E}_{(x,y,g)\sim\mathcal{D}}\left[(\hat{f}(x)-y)^{2}\|g=A\right]$	$\displaystyle=\sigma^{2}+\mathbb{E}\left[\\|x^{\top}(\hat{\beta}-\beta)+(\hat{\alpha}_{1}-\alpha_{1})\\|^{2}\right]$
		$\displaystyle=\sigma^{2}+\mathbb{E}\left[x^{\top}(\hat{\beta}-\beta)(\hat{\beta}-\beta)^{\top}x\right]+2\mathbb{E}\left[(\hat{\alpha}_{1}-\alpha_{1})(\hat{\beta}-\beta)^{\top}x\right]+\mathbb{E}\left[(\hat{\alpha}_{1}-\alpha_{1})^{2}\right].$

Since the draw $(x,y,g)\sim\mathcal{D}$ is independent of the data from which the ordinary least squares solution $\hat{\theta}$ is predicted, we can write out each of these terms in terms of the dependence on $n$ , the total number of data points, as well as $n_{A}$ and $n_{B}$ (where $n_{A}+n_{B}=n$ ), the total number of datapoints for each group, from which $\hat{\theta}$ is estimated. To do this, we’ll solve for the entries of the covariance matrix:

\displaystyle\mathbb{E}[(\hat{\theta}-\theta)(\hat{\theta}-\theta)^{\top}]=\begin{bmatrix}\mathbb{E}\left[(\hat{\alpha_{1}}-\alpha_{1})^{2}\right]&\mathbb{E}\left[(\hat{\alpha_{1}}-\alpha_{1})(\hat{\alpha_{2}}-\alpha_{2})\right]&\mathbb{E}\left[(\hat{\beta}-\beta)(\hat{\alpha_{1}}-\alpha_{1})\right]^{\top}\\ \mathbb{E}\left[(\hat{\alpha_{1}}-\alpha_{1})(\hat{\alpha_{2}}-\alpha_{2})\right]&\mathbb{E}\left[(\hat{\alpha_{2}}-\alpha_{2})^{2}\right]&\mathbb{E}\left[(\hat{\beta}-\beta)(\hat{\alpha_{2}}-\alpha_{2})\right]^{\top}\\ \mathbb{E}\left[(\hat{\beta}-\beta)(\hat{\alpha_{1}}-\alpha_{1})\right]&\mathbb{E}\left[(\hat{\beta}-\beta)(\hat{\alpha_{2}}-\alpha_{2})\right]&\mathbb{E}\left[(\hat{\beta}-\beta)(\hat{\beta}-\beta)^{\top}\right]\end{bmatrix}=\sigma^{2}(Z^{\top}Z)^{-1}

where $Z$ is the $n\times(d+2)$ design matrix with rows $\{(\mathbb{I}[g_{i}=A],\mathbb{I}[g_{i}=B],x_{i}^{\top})\}_{i=1}^{n}$ . Next, we find the block entries of the matrix $(Z^{\top}Z)^{-1}$ . We first interrogate the term within the inverse:

\displaystyle Z^{\top}Z=\begin{bmatrix}n_{A}&0&n_{A}\bar{x}_{A}^{\top}\\ 0&n_{B}&n_{B}\bar{x}_{B}^{\top}\\ n_{A}\bar{x}_{A}&n_{B}\bar{x}_{B}&X^{\top}X\end{bmatrix}

where $\bar{x}_{A}=\frac{1}{n_{A}}\sum_{i=1}^{n}x_{i}\cdot\mathbb{I}[g_{i}=A]$ , and similarly for $\bar{x}_{B}$ . We’ll now use the Schur complement to compute the desired blocks of $\Sigma$ . The Schur complement is

\displaystyle S=X^{\top}X-\begin{bmatrix}n_{A}\bar{x}_{A}&n_{B}\bar{x}_{B}\end{bmatrix}\begin{bmatrix}n_{A}^{-1}&0\\ 0&n_{B}^{-1}\end{bmatrix}\begin{bmatrix}n_{A}\bar{x}_{A}^{\top}\\ n_{B}\bar{x}_{B}^{\top}\\ \end{bmatrix}=X^{\top}X-n\bar{x}\bar{x}^{\top}~{},

which we simplify to $S=X^{\top}X$ by assuming that we zero-mean the sample feature matrix $X$ before calculating the least squares solution. Using the Schur complement, the covariance matrix in block form is

\displaystyle(Z^{\top}Z)^{-1}=\begin{bmatrix}\frac{1}{n_{A}}+\bar{x}_{A}^{\top}S^{-1}\bar{x}_{A}&\bar{x}_{A}^{\top}S^{-1}\bar{x}_{B}&-\bar{x}_{A}^{\top}S^{-1}\\ \bar{x}_{B}^{\top}S^{-1}\bar{x}_{A}&\frac{1}{n_{B}}+\bar{x}_{B}^{\top}S^{-1}\bar{x}_{B}&-\bar{x}_{B}^{\top}S^{-1}\\ -S^{-1}\bar{x}_{A}&-S^{-1}\bar{x}_{B}&S^{-1}\end{bmatrix}~{}.

Plugging in the appropriate blocks to our original equation, we get:

	$\displaystyle\mathbb{E}_{(x,y,g)\sim\mathcal{D}}\left[(\hat{f}(x)-y)^{2}\|g=A\right]$	$\displaystyle=\sigma^{2}+\mathbb{E}_{(x,y)\sim\mathcal{D}_{A}}\left[x^{\top}(\hat{\beta}-\beta)(\hat{\beta}-\beta)^{\top}x+2(\hat{\alpha}_{1}-\alpha_{1})(\hat{\beta}-\beta)^{\top}x+(\hat{\alpha}_{1}-\alpha_{1})^{2}\right]$
		$\displaystyle=\sigma^{2}(1+\frac{1}{n_{A}}+\mathbb{E}_{g=A}\left[x^{\top}S^{-1}x+\bar{x}_{A}^{\top}S^{-1}\bar{x}_{A}\right]),$

where the middle term cancels since $\mathbb{E}[x]=0$ . Note that $S$ is the scaled sample covariance matrix. The vectors $x_{i}$ are drawn i.i.d. from $\mathcal{N}(0,\Sigma_{x})$ so that $S^{-1}$ follows an inverse Wishart distribution with parameters $n,d,\Sigma_{x}$ . For the fresh sample $x$ ,

\displaystyle\mathbb{E}[x^{\top}S^{-1}x]

\displaystyle=\textrm{Trace}(\mathbb{E}[S^{-1}]\mathbb{E}[xx^{\top}])=\frac{d}{n-d-1}Tr(\Sigma_{x}^{-1}\Sigma_{x})=\frac{d}{n-d-1}~{}.

For the $\bar{x}_{A}^{\top}S^{-1}\bar{x}_{A}$ term we invoke the matrix inversion lemma. For a single row $x_{i}$ of $X$ , let $X_{-i}$ denote the $(n-1)\times d$ matrix comprised of all rows of $X$ except $X_{i}$ . Then

	$\displaystyle x_{i}^{\top}(X^{\top}X)^{-1}x_{i}$	$\displaystyle=x_{i}^{\top}(X_{-i}^{\top}X_{-i}+x_{i}x_{i}^{\top})^{-1}x_{i}$
		$\displaystyle=x_{i}^{\top}\left((X_{-i}^{\top}X_{-i})^{-1}-(X_{-i}^{\top}X_{-i})^{-1}x_{i}(1+x_{i}^{\top}(X_{-i}^{\top}X_{-i})^{-1}x_{i})^{-1}x_{i}^{\top}(X_{-i}^{\top}X_{-i})^{-1}\right)x_{i}~{}.$

Letting $a_{i}=x_{i}^{\top}(X_{-i}^{\top}X_{-i})^{-1}x_{i}\geq 0$ , we rewrite the above as

\displaystyle x_{i}^{\top}(X^{\top}X)^{-1}x_{i}

\displaystyle=a_{i}-\frac{a_{i}^{2}}{1+a_{i}}=\frac{a_{i}}{1+a_{i}}\leq a_{i}~{}.

Since the $x_{i}$ are independent and zero mean, $\mathbb{E}\left[x_{i}^{\top}(X_{-i}^{\top}X_{-i})^{-1}x_{j}\right]=\mathbb{E}[x_{i}^{\top}]\mathbb{E}[(X_{-i}^{\top}X_{-i})^{-1}x_{j}]=0\,\forall i\neq j.$ From a similar argument to that given above, we derive that $\mathbb{E}[a_{i}]=d/(n-d-2)$ , so that

\displaystyle\mathbb{E}\left[\bar{x}_{A}^{\top}S^{-1}\bar{x}_{A}\right]=\mathbb{E}\left[\left(\frac{1}{n_{A}}\sum_{i}^{n_{A}}x_{i}\right)^{\top}S^{-1}\left(\frac{1}{n_{A}}\sum_{i}^{n_{A}}x_{i}\right)\right]=\frac{1}{n_{A}^{2}}\mathbb{E}\left[\sum_{i}^{n_{A}}x_{i}^{\top}S^{-1}x_{i}\right]\leq\frac{1}{n_{A}}\cdot\frac{d}{n-d-2}~{}.

Putting this all together, we conclude that for $n\gg d$ ,

\displaystyle\mathbb{E}_{(x,y,g)\sim\mathcal{D}}\left[(\hat{f}(x)-y)^{2}|g=A\right]=\sigma^{2}\left(1+\frac{1}{n_{A}}+\frac{d}{n-d-1}+\mathbb{E}[\bar{x}_{A}^{\top}S^{-1}\bar{x}_{A}]\right)=\sigma^{2}\left(1+\frac{1}{n_{A}}+O\left(\frac{d}{n}\right)\right)~{}.

Appendix B Experiment Details

Section B.1 details the datasets used including preprocessing steps and any data excluded from the experiments; the remainder of Appendix B provides a detailed explanation of each experiment described in the main text. Figure 2 describes additional experiments to complement the findings of the main experiments. Code to process data and replicate the experiments detailed here is available at https://github.com/estherrolf/representation-matters.

The code repository accompanying this work can be found at https://github.com/estherrolf/representation-matters.

B.1 Dataset Descriptions

We use and modify benchmark machine learning and fairness datasets, as well as more recent datasets on image diagnosis and student performance to study the effect of training set allocations on group and population accuracies in a systematic fashion. Each of the datasets we use is described below, with download links given at the end of this section.

Modified CIFAR-4. We modify an existing machine learning benchmark dataset, CIFAR-10 Krizhevsky (2009), to instantiate binary classification tasks with binary groups, where groups $g$ are statistically independent of the labels $y$ . We take four of the ten CIFAR-10 classes: {plane, car, bird, horse}, and sort them into binary categories of {land/air, animal/vehicle} as in Fig. B.1. This instantiation balances the classes labels among the two groups $g$ , so that no matter the allocation of groups to the training set, the label distribution will remain balanced. This will eliminate class imbalance as the cause of changes in accuracy due to training set composition or up-weighting methods.

There are $5,000$ training and $1,000$ test instances of each class in the CIFAR-10 dataset, resulting in $10,000$ training instances of each group in our modified CIFAR-4 dataset ( $20,000$ instances total), and $2,000$ instances of each group in the test set ( $4,000$ instances total). By construction, the average label in the test and train sets is $0.5$ . Since this dataset is designed to assess the main questions of our study under a controlled setting and there is not a natural setting of population rates of animal and vehicle photos, we set the population prevalence parameter of group A (images of animals) as $\gamma_{A}=0.1$ .

Goodreads ratings prediction. The Goodreads dataset (Wan & McAuley, 2018) contains a large corpus of book reviews and ratings, collected from public book reviews online. From the text of the reviews, we predict the numerical rating (integers 1-5) a reviewer assigns to a book. From the larger corpus, we consider reviews for two genres: history/biography (henceforth “history") and fantasy/paranormal (henceforth “fantasy"). We calculate the population prevalences from the total number of reviews in the corpus for each genre. As of the writing of this document, there are $2,066,193$ reviews for history and biography books, and $3,424,641$ for fantasy and paranormal books, so that $\vec{\gamma}=[0.376,0.624]$ .

After dropping entries with no valid review or rating, we have 1,985,464 reviews from the history genre, and 3,309,417 reviews from the fantasy genre, with no books assigned to both genres. To reduce dataset size and increase performance of the basic prediction task, we further reduce each dataset to only the 100 most frequently reviewed books of each genre (following a procedure similar to (Chen et al., 2018)). To instantiate the dataset we use in our analysis, we draw 62,500 review/rating pairs uniformly at random from each of these pools. The mean review for fantasy instances is 4.146, for history it is 4.103. We allocate $20\%$ of the data to a test set, and the remaining $80\%$ to the training set, with an equal number of instances of each genre in each set.

We use tf-idf vectors of the 2,000 most frequently occurring reviews from the entire dataset of 125,000 instances as features (a similar featurization to (Chen et al., 2018), with fewer total features). We note that this is a different version of the goodreads dataset from that used in (Chen et al., 2018); the updated dataset we use has more reviews, and we use different group variables.

Adult. The Adult dataset, originally extracted from the 1994 Census database, is downloaded from the UCI Machine Learning Repository (Dua & Graff, 2017). The dataset contains demographic features and the classification task is to predict whether an individual makes more than 50K per year. We drop the final-weight feature, a measure of the proportion of the population represented by that profile. There are $d=13$ remaining features; we one-hot encode non-binary discontinuous features including work class, education, marital status, occupation, relationship, race, and native country, resulting in $103$ feature columns.

For the main analysis, we exclude features for sex, husband, and wife as they indicate group status but do not affect predictive ability (see Section C.1), for a total of $10$ features and $100$ columns in the feature matrix. We keep the original train and test splits, with $32,561$ ( $21,790$ from group male and $10,771$ from group female) and $16,281$ ( $10,860$ from group male and $5,421$ from group female) instances respectively. We group instances based on gender (encoded as binary male/female). While the training and tests sets have roughly a $2/3$ male group, $1/3$ female group balance, we set $\vec{\gamma}=[0.5,0.5]$ to more adequately reflect the true proportions of men and women in the world. In the test set, the average label for the male group is $0.300$ , for the female group it is $0.109$ (for the train set, the numbers are similar; $0.306$ for the male group and $0.109$ for the female group).

Mooc dataset. The dataset we refer to as the MOOC (Massive Open Online Course) dataset is the HarvardX-MITx Person-Course Dataset (HarvardX, 2014). This dataset contains anonymized data from the 2013 academic year offerings of MITx and HarvardX edX classes. Each row in the dataset corresponds to a student-course pairing; students taking more than one course may correspond to multiple rows. We keep the following demographic, participation, and administrative features: gender, LoE_DI (highest completed level of education), final_cc_cname_DI (country), YoB, ndays_act (number of active days on the edX platform), nplay_video (number of video plays on edX), nforum_posts (number of discussion forum posts on edX), nevents (number of interactions on edX), course_id, certified (whether the student achieved a high enough grade for certification). After dropping instances without valid entries for these features, we have 25,213 instances. We one-hot encode non-binary discontinuous features including course_id, LoE_DI, and final_cc_cname_DI, resulting in an expanded 47 feature columns from the original 9 features. In Section C.1 we exclude demographic features. We partition the data into a train set of size 24,130 and a test set of size 6,032, with the same proportion of groups in each set.

We group instances by the highest level of education self-reported by the person taking the class; we bin this into those who have completed at most secondary education, and those who have completed more than secondary education. For reference, in the USA, completing secondary education corresponds to completing high school. The train and test sets contain an equal fraction of each group, about $16\%$ instances where the person taking the course had no recorded education higher than secondary. Of all training and test instances, those from group A (student had at most secondary education) have a $5.2\%$ certification rate, and those from group B (student had more than secondary education) have an $8.2\%$ certification rate.

ISIC skin cancer detection dataset. We download the dataset from the ISIC archive (Codella et al., 2017; Tschandl et al., 2018; Codella et al., 2019) using the repository https://github.com/GalAvineri/ISIC-Archive-Downloader to assist in downloading. To match the dataset used in Sohoni et al. (2020), we use only images added or modified prior to 2019, which gives us 23,906 instances. From this set we include only images that are explicitly labeled as benign or malignant. We additionally remove any data points from the ‘SONIC’ study, as these are all benign cases, and are easily identified via colorful dots on the images (Sohoni et al., 2020). The remaining 11,952 instances are to our knowledge identical to the “Non-patch" dataset in (Sohoni et al., 2020), up to the random splits into training/validation and test sets.

As groups, we subset based on the approximate age of the patient that is the subject of the photo. Approximate age is binned to the nearest 5 years in the original dataset; we design groups as A $=\{i:\texttt{age\_approx[i]}\geq 55\}$ and B $=\{i:\texttt{age\_approx[i]}<55\}$ . Of the group A instances in the training and test sets, $31.4\%$ are malignant; of the group B instances, $8.37\%$ are malignant. We set $\vec{\gamma}$ to match the distribution in the 11,952 data points in our preprocessed set, so that $\vec{\gamma}=[0.43,0.57]$ . We split 80% of the data into a train set and 20% into a test set, with the same proportion of groups in each set.

The ISIC archive is a compilation of medical images aggregated from many individual studies. Experiments detailed in Sections 4.5 and B.7 use the study from which the image originates as groups to investigate the interactions between groups when $|\mathcal{G}|>2$ . These results also motivate our choice to exclude the SONIC study from the dataset we use for the main analysis.

Download links. Code for processing these datasets as described above can be found at https://github.com/estherrolf/representation-matters. The datasets we use (before our subsetting/preprocessing) can be downloaded at the following links:

•

CIFAR-10: https://www.cs.toronto.edu/~kriz/cifar.html
•

Goodreads: https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home
•

Adult: https://archive.ics.uci.edu/ml/datasets/adult
•

Mooc: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/26147
•

ISIC: https://www.isic-archive.com

Loss metrics. For binary prediction problems, we report the the 0/1 loss when there is not significant class imbalance (modified CIFAR-4, Adult). For the ISIC and Mooc tasks, we report 1 - AUROC, where AUROC is the area under the reciever operating curve. AUROC is a standard metric for medical image prediction Sohoni et al. (2020), and for Mooc we choose this metric due to the label imbalance (low certification rates). Since AUROC is constrained to be between 0 and 1, the loss metric 1 - AUROC will also be between 0 and 1. For the Goodreads dataset, we optimize the $\ell_{1}$ loss (mean absolute error, MAE); in Section C.2 we compare this to minimizing the $\ell_{2}$ loss (mean squared error, MSE) of the predictions.

B.2 Models Details

For the neural net classifiers, we compare the empirical risk minimization (ERM) objective with an importance weighted objective, implemented via importance sampling (IS) following results in (Buda et al., 2018), and a group distributionally robust (GDRO) objective Sagawa et al. (2020). We adapt our group distributionally robust training procedure from https://github.com/kohpangwei/group_DRO, the code repository accompanying (Sagawa et al., 2020). For the non-image datasets, we choose the model class from a set of common prediction functions (Section B.3). We use implementations of these algorithms from https://scikit-learn.org, using the built in sample weight parameters to implement importance weighting (IW). Since many of the prediction functions we consider are not gradient-based, we cannot apply the algorithm from (Sagawa et al., 2020), thus we do not compare to GDRO for the non-image datasets.

dataset name	metric	model used	objective	parameters
CIFAR-4	0/1 loss	pretrained resnet-18 (fine-tuned)	ERM	num_epochs = 20, lr = 1e-3 , wd = 1e-4, momentum = 0.9
			IS	num_epochs = 20, lr = 1e-3, momentum = 0.9
				wd = 1e-2 if $\alpha_{A}\geq$ 0.98 or $\alpha_{A}\leq$ 0.02; wd = 1e-3 otherwise
			GDRO	num_epochs = 20, lr = 1e-3, wd = 1e-3, momentum = 0.9,
				gdro_ss = 1e-2, group_adj = 4.0
ISIC	1 - AUROC	pretrained resnet-18 (fine-tuned)	ERM	num_epochs = 20, lr = 1e-3, wd = 1e-4, momentum = 0.9
			IS	num_epochs = 20, lr = 1e-3, wd = 1e-3, momentum = 0.9
			GDRO	num_epochs = 20, lr = 1e-3, wd = 1e-4, momentum = 0.9,
				gdro_ss = 1e-1, group_adj = 1.0
Goodreads*	$\ell_{1}$ loss (MAE)	multiclass logistic regression	ERM	C = 1.0, penalty = $\ell_{2}$
			IW	C = 1.0, penalty = $\ell_{2}$
Goodreads	$\ell_{2}$ loss (MSE)	multiclass logistic regression	ERM	$\lambda=0.1$
			IW	$\lambda=1.0$ if $\alpha_{A}\geq 0.95$ or $\alpha_{A}\leq 0.05$ ; $\lambda=0.1$ otherwise
Mooc* (with dem. features)	1 - AUROC	random forest classifier	ERM	num_trees = 400, max_depth = 16
			IW	num_trees = 400, max_depth = 16
Mooc (no dem. features)	1 - AUROC	random forest classifier	ERM	num_trees = 400, max_depth = 8
			IW	num_trees = 400, max_depth = 8
Adult (with group features)	0/1 loss	random forest classifier	ERM	num_trees = 200, max_depth = 16
			IW	num_trees = 200, max_depth = 16
Adult* (without group features)	0/1 loss	random forest classifier	ERM	num_trees = 400, max_depth = 16
			IW	num_trees = 400, max_depth = 16

Table B.3: Models and hyperparameters used to generate Figure 1. Asterisks denote the Goodreads, Mooc and Adult setting that appear in Figure 1, the other settings are shown in Figure 1.

B.3 Hyperparameter Selection

We evaluate hyperparameters for each prediction model using 5-fold cross validation on the training sets (see Section B.1), stratifying folds to maintain group proportions. We evaluate the cross-validation across a sparse grid of $\vec{\alpha}\in[0.02,0.05,0.2,0.5,0.8,0.95,0.98]$ , allowing us to determine if hyperparameters should be set differently for different values of $\vec{\alpha}$ . Table B.3 describes the model and parameters which are chosen as a result of this process.

Image Datasets. For the results shown in Figure 1 with image datasets (CIFAR-4 and ISIC), we train a pretrained resnet-18 by running SGD with momentum for 20 epochs for each dataset. We did not find significant improvements from training for more epochs for either dataset. Sohoni et al. (2020) use a pretrained resnset-50 for the ISIC prediction task; we use a resnet-18 since we are mostly working with smaller training set sizes due to subsetting. We did not find major differences in performance for ERM for ISIC between the resnset-18 and resnet-50 for the dataset sizes we considered. We use a resnset-18 for all models for the CIFAR-4 and ISIC tasks.

In the 5-fold cross validation procedure, we search over learning rate in $[0.01,0.001,0.0001]$ and weight decay in $[0.01,0.001,0.0001]$ , keeping the momentum at $0.9$ for both the importance sampling and ERM procedures. Given these results, for GDRO, we search over group-adjustment in $[1.0,4.0,8.0]$ , gdro step size in $[0.1,0.01,0.001]$ , and weight decay in $[0.001,0.0001]$ , fixing momentum at $0.9$ and learning rate at $[0.001]$ .

The optimal hyperparameter configurations for the coarse grid of $\vec{\alpha}$ are in Table B.3. Across $\vec{\alpha}$ values from the coarse grid, either the optimal parameters for each objective were largely consistent, or performance did not vary greatly between hyperparameter configurations for almost all dataset/objective configurations. As a result, for both the modified CIFAR and ISIC datasets, we keep the hyperparameters fixed across $\vec{\alpha}$ , with the exception of IS for CIFAR-4, where we increase weight decay for extreme $\vec{\alpha}$ (see Table B.3).

Non-Image Datasets. For the Goodreads dataset, we evaluate the following models and parameters: ridge regression model with $\lambda\in[0.01,0.1,1.0,10.0,100.0]$ , random forest regressor with splits determined by MSE, and with number of trees and maximum depth of trees $\in[100,200,400]\times[32,64,128]$ , and a multiclass logistic regression classifier with $\ell_{2}$ inverse regularization strength parameter $C\in[0.01,0.1,1.0,10.0]$ .

The multiclass logistic regression model minimized mean absolute error (MAE) over the models we considered. For ERM, the optimal regularization parameter was $C=1.0$ for all $\vec{\alpha}$ in our sparse grid. For IW, the optimal regularization value was $C=1.0$ for all $\alpha_{A}$ other than $0.98$ , where the optimal for history MAE was $C=10.0$ . Since this only effected one group, and was not symmetric across $\vec{\alpha}$ , for the evaluation results, we set $C=1.0$ for all $\vec{\alpha}$ .

For the Mooc dataset, we evaluate a binary logistic regression model with $\ell_{2}$ penalty and inverse regularization parameter $C\in[0.001,0.01,0.1,1.0,10.0]$ , a random forest classifier with number of tress and maximum depth of trees $\in[100,200,400]\times[8,16,32]$ , and ridge regression model with $\lambda\in[0.0001,0.001,0.01$ , $0.1,1.0,10.0]$ and threshold for binary classification decision $0.5$ . The best model across both group and population accuracies was a random forest model; the best maximum depth was $16$ for both ERM and IW, and the results were robust to the number of estimators, so we chose $200$ for both. The best hyperparameters were largely consistent for all $\vec{\alpha}$ in the sparse grid.

For the Adult dataset, we evaluate the same models and parameter configurations as the MOOC dataset. The best model across both ERM and IW was the random forest model, with optimal values given in Table B.3.

B.4 Navigating Tradeoffs

Using the selected hyperparameters from the procedure described in Section B.3, we evaluate performance on the heldout test sets (see Section B.1). For the final evaluation, we evaluate $\alpha\in[0.0,0.01,0.02,0.05,0.1$ , $0.15,0.2,0.25,.3,0.35,.4,0.45,0.5,0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,0.98,0.99,1.0]$ , skipping the extremes for GDRO and IS/IW.

The training set size is determined by the smaller of the training set sizes for each group, denoted as $\min_{g}n_{g}$ in Table 1. This results in $n=10,000$ for the modified CIFAR-4, $n=4,092$ for ISIC, $n=50,000$ for goodreads, $n=3,897$ for Mooc, and $10,771$ for adult. Note that the number of test set instances ( $n_{\textrm{test}}$ in Table 1) does not necessarily have the proportions of instances indicated by $\gamma_{A}$ in Table 1. For the CIFAR-4 and Goodreads dataset, the test set is constructed to have $50\%$ instances from group A, and $50\%$ instances from group B; in reporting performance, we take a weighted average over the errors from each group, as in Eq. (1). For adult, we re-weight the test set instances to reflect $\vec{\gamma}=(0.5,0.5)$ . For the remaining two datasets, the test set compositions align with $\vec{\gamma}$ . We assess variability in the performance of each method under each setting by examining results over 10 random seeds, which apply to both the random sampling in the subsetting procedure and the randomness in the training procedure.

B.5 Assessing scaling law fit

In addition to the results on the holdout set in the previous section, where we vary $\vec{\alpha}$ and keep the size of the training set fixed at $n=\min_{g}n_{g}$ , we conduct subsetting experiments to assess the affect of $n$ , as well as $n_{g}$ , on the group accuracies as in Eq. (7). Specifically, for groups A and B, we vary the relative number of data points from group A and group B, as well as the total number of datapoints $n$ . We define our subsetting grid by subsampling twice for each configuration of $(n_{g},n)$ . First, we choose an allocation ratio of group A to group B (from options $[0.125\!:\!1,0.25\!:\!1,0.5\!:\!1,1\!:\!1]$ ). Then, we subsample again to $x$ fraction of the largest subset size for this allocation ratio, for an $x\in[0.01,0.02,0.05,0.1,0.15,0.2,0.25,.3,.4,0.5,0.6,0.7,0.8,0.9,1.0]$ (we skip $x\in[0.01]$ for allocation ratios $<1$ ). We repeat the sampling pattern for all pairs of allocation ratio and $x$ , and again switching the roles of groups A and B in this procedure. This results in a set of 99 unique $n_{A},n_{B}$ pairs, which we combine with the 25 pairs from the previous experiment, where we fix $n=n_{A}+n_{B}$ (see Figure B.3). We run ten random seeds of $(n_{A},n_{B})$ configuration, for a total of 1240 points from which we estimate the scaling law fits.

Since our main point of comparison is across training set sizes and allocations, rather than optimizing hyperparameters for each new sample size, we use the same hyperparameter configurations as for first set of experiments (ERM rows in Table B.3). For the neural net classifiers, we decrease the number of epochs when the total training set size $n$ increases, so as to keep nearly the same number of gradient steps per training procedure. Specifically, we set the number of epochs as the nearest integer to $\textrm{(\# epochs for first set of experiments)}\times\textrm{($n$ for first set of experiments)}/\textrm{($n$ for current run)}$ , where the first set of experiments corresponds to those in Figure 1.

Together with the results from the previous experiment shown in Figure 1, we use these values of $(n,n_{A},n_{B})$ and the accuracies evaluated on groups $A$ and $B$ to estimate the parameters of the scaling laws in Eq. (7). We use the nonlinear least squares implementation of scipy.optimize.curve_fit. The estimated parameters are given in Table 2. The standard deviations reported in Table 2 are the estimated standard deviations resulting from the nonlinear least squares fits. Figure B.2 plots the fitted model over a subset of the data used to fit it, showing that the modified scaling model in Eq. (7) can express the trends in per-group losses as a function of $n_{g}$ and $n$ .

The parameter estimates sometimes have large standard deviations; we also found the parameter estimates to vary with $M_{g}$ , the minimum number of points required to include a data point in the fitting procedure, as well as the overall subsetting design. When seeking exact and robust estimates of the parameters in Eq. (7), we suggest following the experimental procedures outlined in Clauset et al. (2009), and additionally accounting for potential variation due to sampling patterns.

B.6 Pilot Sample Experiment

In this experiment, we take a small random sample from the Goodreads training set to instantiate a “pilot sample" from which to estimate scaling law parameters and suggest an $\hat{\alpha}^{*}$ at which to sample a larger training sample in a subsequent sampling effort. The pilot sample contains $5000$ total instances, with $n_{A}=n_{B}=2500$ . With the pilot sample, we conduct a series of subsetting pairs for $(n_{A},n_{B})$ , similar to the procedure outlined in Section B.5, using relative allocations in $[0.0625\!:\!1,0.125\!:\!1,0.25\!:\!1,0.5\!:\!1,1\!:\!1]$ and $x\in[0.02,0.05,0.1,0.15,0.2,0.25,.3,.4,0.5,0.6,0.7,0.8,0.9,1.0]$ . For each subset configuration, we sample 5 random seeds. From these results, we estimate the scaling parameters of Eq. (7) according to the performance of each subset configuration on the heldout evaluation set. We set the minimum number of points from which to fit the scaling parameters to $M_{g}=250$ .

Next, we use these estimated fits to extrapolate predicted per-group losses with more data points. For a given sample size, we suggest the $\hat{\alpha}^{*}$ that would minimize the the maximum of the estimated per-group losses. We use the training data held separate from the pilot sample to sample a training set at $\vec{\alpha}=\hat{\alpha}^{*}$ allocation, and evaluate performance on the original held out evaluation set. We follow the same procedure, sampling the new training set from $\vec{\alpha}=\vec{\gamma}$ and at $\vec{\alpha}=(0.5,0.5)$ .

As a point of comparison we also compute the results for all $\alpha$ in a grid of resolution $0.01$ ; i.e., $\alpha\in[0.0,0.01,0.02,\ldots 0.99,1.0]$ and denote the allocation value in this grid that minimizes the average maximum group loss over trials as $\alpha^{*}_{\textrm{grid}}$ . The best allocation possible (up to the 0.01 grid resolution) were: $\alpha^{*}_{\textrm{grid}}=$ 0.01 for $n_{\textrm{new}}=5000$ , $\alpha^{*}_{\textrm{grid}}=$ 0.06 for $n_{\textrm{new}}=10000$ , $\alpha^{*}_{\textrm{grid}}=$ 0.03 for $n_{\textrm{new}}=20000$ , and $\alpha^{*}_{\textrm{grid}}=$ 0.07 for $n_{\textrm{new}}=40000$ . The $\alpha^{*}_{\textrm{grid}}$ baseline helps contextualize performance of other allocation strategies with the optimal-in-hindsight $\alpha^{*}_{\textrm{minmax}}$ . Furthermore, the variability across trials due to subsetting around $\alpha=\alpha^{*}_{\textrm{grid}}$ is largely consistent with that of the other allocation sampling strategies examined.

We repeat this entire procedure (starting with generation of the pilot sample) for 10 random seeds, and the results are reported in Figure 2. Since we have enough training data in the Goodreads dataset outside of the pilot sample to simulate gathering larger datasets of up to $40,000$ , we simulate collecting a new training dataset of size $n_{new}\in[5000,10000,20000,40000]$ , which are $[1\times,2\times,4\times,8\times]$ the size of the pilot training dataset, respectively.

The error bars in Figure 2 are similarly large for all three allocation strategies we compare. The allocation $\vec{\alpha}=\hat{\alpha}^{*}_{\textrm{minmax}}$ when $n_{\textrm{new}}=40000$ is an exception, indicating that high variation in $\hat{\alpha}^{*}_{\textrm{minmax}}$ may be an issue when $n_{\textrm{new}}$ is large relative to the pilot training sample size.

B.7 Interactions Between Groups

While the main experiments exclude the SONIC study from the ISIC dataset, consistent with the ‘non-patch’ dataset in Sohoni et al. (2020), this experiment utilizes the larger corpus of labeled images from the ISIC repository. The difference is the inclusion of the SONIC sub-study. This larger dataset has 21,203 instances after subsetting to precisely benign/malignant cases. The number of data instances coming from each dataset are given by the grey bars (black annotated numbers) in Figure 4; purple bars denote the number of malignant instances. Note the logarithmic scale.

We separate 16,965 images to the training set and the remaining 4,238 to the test set, with equal ratios of each sub-study represented in the train and test sets. The train and test splits differ for the study and sub-study analysis, accounting for differing values in corresponding cells of Figure 3(a) and Figure 3(b), particularly apparent when evaluating on JID Editorial Images, the study with the smallest number of data instances (see Figure 4). We choose hyperparameters based off a 5 fold cross-validation procedure on the 16,965 training instances, searching over the same hyperparameter options as in Section B.3, ultimately using 20 epochs, momentum 0.9, weight decay 0.001 and learning rate 0.001 to tune the resnet-18 model (with ERM objective). For each of the 5 studies, and 10 sub-studies that comprise the larger datasets, we retrain the resnet-18 model on the training set with that sub-study excluded from the training data, and compare to performance of the model trained on the entire training set. As in Section B.5, we keep the number of gradient steps for each training process roughly equal. We run 10 random seeds for all conditions and report the differences in the mean performance metrics across groups.

Appendix C Supplementary Experiments

In this appendix we detail additional experiments to supplement the findings of Section 4. These experiments investigate the robustness of our findings to different problem formulations and data pre-processing.

C.1 The Effects of Including Groups as Features or Not

Here we compare models that use group or demographic information as features, and models that do not. We examine differences in group performance across $\vec{\alpha}$ and the effect of importance weighting in both cases.

dataset	$M_{g}$	group $g$	$\hat{\sigma}_{g}$	$\hat{p}_{g}$	$\hat{\tau}_{g}$	$\hat{q}_{g}$	$\hat{\delta}_{g}$
Mooc (with dem. features)	50	edu $\leq 2^{\circ}$	0.08 (2.6e-05)	0.14 (6.0e-03)	0.73 (0.059)	0.63 (4.8e-03)	1.3e-15 (2.6e-04)
Mooc (with dem. features)	50	edu $>2^{\circ}$	0.038 (6.2e-04)	0.068 (6.3e-03)	0.54 (6.5e-03)	0.61 (9.8e-04)	2.8e-12 (8.0e-04)
Mooc (without dem. features)	50	edu $\leq 2^{\circ}$	0.41 (0.86)	1.0 (0.26)	0.6 (0.028)	0.56 (3.5e-03)	0.029 (1.4e-06)
Mooc (without dem. features)	50	edu $>2^{\circ}$	0.029 (1.3e-03)	0.055 (0.011)	0.33 (2.2e-03)	0.54 (9.5e-04)	1.9e-14 (1.5e-03)
Adult (with group features)	50	female	0.036 (3.2e-06)	0.14 (2.5e-03)	0.23 (3.3e-03)	0.47 (2.3e-03)	0.055 (2.1e-05)
Adult (with group features)	50	male	0.12 (6.8e-05)	0.24 (5.8e-04)	0.3 (6.2e-03)	0.47 (2.7e-03)	0.16 (3.5e-06)
Adult (without group features)	50	female	0.078 (0.051)	0.018 (3.6e-03)	0.43 (8.3e-03)	0.59 (1.6e-03)	8.0e-16 (0.052)
Adult (without group features)	50	male	0.066 (2.6e-05)	0.21 (1.2e-03)	0.47 (6.5e-03)	0.50 (1.1e-03)	0.16 (5.4e-06)

Table 4: Estimated scaling parameters for eq. (7) with and without demographic features included. Parentheses denote standard deviations estimated by the nonlinear least squares fit. Parameters are constrained so that

\hat{\tau}_{g},\hat{\sigma}_{g}\geq 0

and

\hat{p}_{g},\hat{q}_{g}\in[0,2]

For Mooc, we compare the model used in the main analysis with removing all demographic information including education level, which we group data instances by. There are five remaining features: number of active days, number of video plays, number of forum posts, and number of events, and course-id, which we one-hot encode. The random forest model is still the best performing of those we considered, though the optimal hyperparameters after a 5 fold cross validation search shifted to $\texttt{num\_trees}=400$ and $\texttt{max\_depth}=8$ . For the Adult dataset, we add back in the features for sex, wife, and husband. This results in 13 unique features, and after one-hot encoding, 103 feature columns. After running a 5 fold cross validation search with the new feature matrix, the optimal model and parameters were $\texttt{num\_trees}=400$ and $\texttt{max\_depth}=16$ .

Figure 1 compares the results of after excluding (left column) and including (right column) this information in the training feature matrix. Models with the group features excluded generally exhibit less degradation in performance for $\alpha_{A}$ near 0 or 1. The negative impacts of IW are lessened slightly when the models do not incorporate groups as features.

The different scaling law fits are shown in Table 4. For the Adult task without group features, the estimated exponent and scaling constant on group-agnostic data, $\hat{q}_{g}$ and $\hat{\tau}_{g}$ , are larger than than for the model with demographic features. For the Mooc task, the effect is the reverse. In all settings, the fitted parameters show that $n$ influences group errors more than $n_{g}$ , so long as $n_{g}$ is at least $M_{g}=50$ (the region for which we estimate the fit).

In the main analysis, we include demographic features for the Mooc certification prediction task, as this increases accuracy of the “edu $\leq$ secondary group," and exclude gender features from the Adult income prediction, as it does not not greatly impact performance for $\alpha_{A}$ near 0.5, and improves performance for more extreme settings of $\alpha_{A}$ (Figure 1). For the other three datasets we study, there is no singular feature corresponding to the group that we can hold out in a similar fashion.

C.2 Trends across models and performance metrics

Figure 2 contrasts the trade offs we can navigate with different allocations and re-weighting schemes for different models and accuracy metrics. The left hand side of Figure 2 shows the result of the model fitting and evaluation procedure to minimize $\ell_{1}$ loss (MAE) of the Goodreads predictions (same as the results shown in Figure 1). The right hand side shows the result of the same model fitting and evaluation procedure applied to minimizing the $\ell_{2}$ loss (MSE), where ridge regression is the best performing model of those we consider. The ridge model evaluated with MSE exhibits similar trends across $\vec{\alpha}$ values as the multiclass logistic regression model evaluated on MAE. The value $\alpha^{*}_{A}$ that minimizes population loss is similar for both methods – near to $\gamma_{A}$ , though on opposite sides of $\gamma_{A}$ . The degradation in population loss for $\alpha_{A}$ close to 0 or 1 is also apparent for the ridge model, and the range of $\vec{\alpha}$ for which importance weighting (IW) increases per-group and population losses compared to empirical risk minimization (ERM) is similar across models.

Representation Matters: Assessing the Importance of Subgroup Allocations in Training Data

Abstract

1 Introduction

1.1 Additional Related Work

1.2 Notation

2 Training set allocations and alternatives

2.1 Training Set Allocations

Definition 1 (Allocations).

Definition 2 (Sampling from allocation α→\vec{\alpha}).

2.2 Accounting for Small Group Allocations in Fixed Datasets

Proposition 1 (Weights and allocations).

Proof.

3 Allocating samples to minimize population-level risk

3.1 A Per-group Power-law Scaling Model

Assumption 1 (Group risk scaling with allocation).

Example 1 (Split classifiers per group).

Example 2 (Groups irrelevant to prediction).

Example 3 (Shared linear model with group-dependent intercepts).

3.2 Optimal (w.r.t. Population Risk) Allocations

Proposition 2.

Corollary 1 (Many groups with equal σg\sigma_{g}).

Corollary 2 (Unequal per-group constants).

4 Empirical Results

4.1 Datasets

4.2 Allocation-aware Objectives vs. Ideal Allocations

4.3 Assessing the Scaling Law Fits

4.4 Targeted Data Collection with Fitted Scaling Laws

4.5 Interactions Between Groups

5 Generalizations, limitations, and future work

6 Conclusions

Acknowledgments

References

Appendix A Proofs and Derivations

A.1 Full proof of Proposition 1

Proof.

A.2 Proof of Proposition 2

Proof.

A.3 Proof of Corollary 1

Proof.

A.4 Proof of Corollary 2

Proof.

A.5 Additional Derivations

Appendix B Experiment Details

B.1 Dataset Descriptions

B.2 Models Details

B.3 Hyperparameter Selection

B.4 Navigating Tradeoffs

B.5 Assessing scaling law fit

B.6 Pilot Sample Experiment

B.7 Interactions Between Groups

Appendix C Supplementary Experiments

C.1 The Effects of Including Groups as Features or Not

C.2 Trends across models and performance metrics

Representation Matters:
Assessing the Importance of Subgroup Allocations in Training Data

Definition 2 (Sampling from allocation $\vec{\alpha}$ ).

Corollary 1 (Many groups with equal $\sigma_{g}$ ).