How many labelers do you have?
A closer look at gold-standard labels
Abstract
The construction of most supervised learning datasets revolves around collecting multiple labels for each instance, then aggregating the labels to form a type of “gold-standard”. We question the wisdom of this pipeline by developing a (stylized) theoretical model of this process and analyzing its statistical consequences, showing how access to non-aggregated label information can make training well-calibrated models more feasible than it is with gold-standard labels. The entire story, however, is subtle, and the contrasts between aggregated and fuller label information depend on the particulars of the problem, where estimators that use aggregated information exhibit robust but slower rates of convergence, while estimators that can effectively leverage all labels converge more quickly if they have fidelity to (or can learn) the true labeling process. The theory makes several predictions for real-world datasets, including when non-aggregate labels should improve learning performance, which we test to corroborate the validity of our predictions.
1 Introduction
The centrality of data collection to the development of statistical machine learning is evident [12], with numerous challenge datasets driving advances [27, 25, 1, 22, 11, 37, 38]. Essential to these is the collection of labeled data. While in the past, experts could provide reliable labels for reasonably sized datasets, the cost and size of modern datasets often precludes this expert annotation, motivating a growing literature on crowdsourcing and other sophisticated dataset generation strategies that aggregate expert and non-expert feedback or collect internet-based loosely supervised and multimodal data [10, 20, 48, 37, 34, 38, 13]. By aggregating multiple labels, one typically hopes to obtain clean, true, “gold-standard” data. Yet most statistical machine learning development—theoretical or methodological—does not investigate this full data generating process, assuming only that data comes in the form of pairs of covariates and targets (labels) [45, 5, 2, 17]. Here, we argue for a more holistic perspective: broadly, that analysis and algorithmic development should focus on the more complete machine learning pipeline, from dataset construction to model output; and more narrowly, questioning such aggregation strategies and the extent to which such cleaned data is essential or even useful.
To that end, we develop a stylized theoretical model to capture uncertainties in the labeling process, allowing us to understand the contrasts, limitations and possible improvements of using aggregated or non-aggregated data in a statistical learning pipeline. We model each example as a pair where is a data point and are noisy labels. In the most basic formulation of our results, we compare two methods: empirical risk minimization using all the labels, and empirical risk minimization using cleaned labels based on majority vote. While this grossly simplifies modern crowdsourcing and other label aggregation strategies [10, 37, 34], the simplicity allows (i) us to understand fundamental limitations of algorithms based on majority-vote aggregation, (ii) circumventing these limits by using full, non-aggregated information, and (iii) provides a potential base for future work: the purpose of the simplicity is to allow application of classical statistical tools. By carefully analyzing these models, our main results show that the error of models fit using non-aggregated label information outperforms that of “standard” estimators that use aggregated (cleaned, majority-vote) labels, so long as the model has fidelity to the data, while majority-vote estimators provide robust (but slower) convergence, and these tradeoffs are fundamental. We develop several extensions to these basic results, including misspecified models, semiparametric scenarios where one must learn link functions, and simple models of learned annotator reliability.
While our models are stylized, they also make several concrete and testable predictions for real datasets; if our approach provides a useful abstraction, it must suggest improvements in learning even for more complex and challenging to analyze scenarios. Our theory predicts that methods that fit predictive models on non-aggregated data should both make better-calibrated predictions and, in general, have lower classification error than models that use aggregated clean labels. To that end, we consider two real datasets as well as one large scale semisynthetic dataset, and they corroborate the predictions and implications of our theory even beyond logistic models. In particular, majority-vote based algorithms yield uncalibrated models in all experiments, whereas the algorithms that use full-label information train (more) calibrated models. Moreover, the former algorithms exhibit worse classification error in our experiments, with the error gap depending on parameters—such as inherent label noise—that we can also address in our theoretical models.
1.1 Problem formulation
To situate our results, we begin by providing the key ingredients in the paper.
The model with multiple labels.
Consider a binary classification problem with data points , , with labelers. We assume each labeler annotates data points independently through a generalized linear model, and the labelers use possibly different link functions , where
Here for , for and . If for all , we say the link function is symmetric and denote the class of symmetric functions by . The link functions generate labels via the distribution
(1) |
Key to our stylized model, and what allows our analysis, is that we assume labelers use the same linear classifier —though each labeler may have a distinct link —so we obtain conditionally independent labels . For example, in the logistic model where labelers have identical link, . We seek to recover or the direction from the observations .
Classification and calibration.
For an estimator and associated direction , we measure performance through
(i) | |||
(ii) |
We term these classification and calibration from the rationale that for classification, we only need to control the difference between the directions and , while calibration—that for a new data point , the value is close to —requires controlling the error in as an estimate of . As another brief motivation for calling (i) the classification error, note that if has rotationally symmetric distribution, then for any unit vectors we have ; because as , the error is asymptotically equivalent to the angle between and .
Estimators.
We consider two types of estimators: one using aggregated labels and the other using each label from different annotators. At the highest level, the aggregated estimator depends on processed labels for each example , while the non-aggregated estimator uses all labels . To center the discussion, we instantiate this via logistic regression (with generalizations in the sequel). For the logistic link , define the logistic loss
In the non-aggregated model, we let let be the empirical measure on and consider the logistic regression estimator
(2) |
which is the maximum likelihood estimator (MLE) assuming the logistic model is true. We focus on the simplest aggregation strategy, where example has majority vote label
Then letting be the empirical measure on , the majority-vote estimator solves
(3) |
Method (3) acts as our proxy for the “standard” data analysis pipeline, with cleaned labels, while method (2) is our proxy for non-aggregated methods using all labels. A more general formulation than the majority vote (3) could allow complex aggregation strategies, e.g., crowdsourcing, but we abstract away details to capture what we view as the essentiae for statistical learning problems (e.g. CIFAR [22] or ImageNet [37]) where only aggregated label information is available.
Our main technical approaches characterize the estimators and via asymptotic calculations. Under appropriate assumptions on the data generating mechanisms (1), which will include misspecification, we both (i) provide consistency results that elucidate the infinite sample limits for , , and a few more general estimators, and (ii) carefully evaluate their limit distributions via asymptotic normality calculations. The latter allows direct comparisons between the different estimators through their limiting covariances, which exhibit (to us) interestingly varying dependence on and the scaling of the true parameter .
1.2 Summary of theoretical results and implications
We obtain several results clarifying the distinctions between estimators that use aggregate labels from those that treat them individually.
Improved performance using multiple labels for well-specified models
As in our discussion above, our main approach to highlighting the import of multiple labels is through asymptotic characterization of the estimators (2)–(3) and similar estimators. We begin this in Section 2 by focusing on the particular case that the labelers follow a logistic model. As specializations of our major results to come, we show that the multi-label MLE (2) is calibrated and enjoys faster rates of convergence (in ) than the majority-vote estimator (3). The improvements depend in subtle ways on the particulars of the underlying distribution, and we connect them to Mammen-Tsybakov-type noise conditions in Propositions 1 and 2. In “standard” cases (say, Gaussian features), the improvement scales as ; for problems with little classification noise, the majority vote estimator becomes brittle (and relatively slower), while the convergence rate gap decreases for noisier problems.
Robustness of majority-vote estimators
Nonetheless, our results also evidence support for majority vote estimators of the form (3). Indeed, in Section 3 we provide master results and consequences that hold for both well- and mis-specified losses, which highlight the robustness of the majority vote estimator. While MLE-type estimators (2) enjoy faster rates of convergence when the model is correct, these rates break down when the model is mis-specified in ways we make precise; in contrast, majority-vote-type estimators (3) maintain their (not quite optimal) convergence guarantees, yielding -rate improvements even under mis-specification, suggesting practical value of cleaning data when modeling uncertainty is impossible.
Fundamental limits of majority vote
In Section 4, we develop fundamental limits for estimation given only majority vote labels . We start with a simple result that any estimator based on majority vote aggregation cannot generally be calibrated. Leveraging local asymptotic minimax theory, we also provide lower bounds for estimating the direction and parameter given majority vote labels, which coincide with our results in Section 2. These results highlight both (i) that cleaned labels necessarily force slower convergence in the number of labelers and (ii) the robustness of majority-vote-based estimators: even with a mis-specified link function, they achieve rate-optimal convergence for procedures receiving only cleaned labels .
Semi-parametric approaches
The final theoretical component of the paper (Section 5) provides an exemplar approach that puts the pieces together and achieves efficiency via semi-parametric approaches. While not the main focus of the paper, we highlight two applications. In the first, we use an initial estimator to fit a link function, then produce a refined estimator minimizing the induced semiparametric loss and recovering efficient estimation. In the second, we highlight how our results provide potential insights into existing crowdsourcing techniques by leveraging a blackbox crowdsourcing algorithm providing measures of labeler reliability to achieve (optimal) estimates.
Experimental evaluation
Our theoretical results predict two outcomes: first, that if the model has fidelity to the labeling process, then using all noisy labels should yield better estimates than procedures using cleaned labels, and second, that majority vote estimators should be more robust. In Section 6, we therefore provide real and semi-synthetic experiments that corroborate the theoretical predictions; the experiments also highlight a need, which some researchers are now beginning to address [13], for more applied development of actual datasets, so that we can more carefully evaluate the downstream consequences of filtering, cleaning, and otherwise manipulating the data we use to fit our models.
1.3 Related work
We briefly overview related work, acknowledging that to make our theoretical model tractable, we capture only a few of the complexities inherent in dataset construction. Label aggregation strategies often attempt to evaluate labeler uncertainty, dating to Dawid and Skene [10], who study labeler uncertainty estimation to overcome noise in clinical patient measurements. With the rise of crowdsourcing, such reliability models have attracted substantial recent interest, with approaches for optimal budget allocation [21], addressing untrustworthy or malicious labelers [8], and more broadly an intensive line of work studying crowd labeling and aggregation [48, 47, 42, 34], with substantial applications [11, 37].
The focus in many of these, however, is to obtain a single clean and trustworthy label for each example. Thus, while these aggregation techniques have been successful and important, our work takes a different perspective. First, we target statistical analysis for the full learning pipeline—to understand the theoretical landscape of the learning problem with multiple labels for each example—as opposed to obtaining only clean labels. Moreover, we argue for an increased focus on calibration of the resulting predictive model, which aggregated (clean) labels necessarily impede. We therefore adopt a perspective similar to Peterson et al. and Platanios et al.’s applied work [29, 32], which highlight ways that incorporating human uncertainty into learning pipelines can make classification “more robust” [29]. Instead of splitting the learning procedure into two phases, where the first aggregates labels and the second trains, we simply use non-aggregated labels throughout learning. Platanios et al. [32] propose a richer model than our stylized scenarios, directly modeling labeler reliability, but the simpler approaches we investigate allow us to be theoretically precise about the limiting behavior and performance of the methods.
Our results complement a new strain of experimental and applied work in machine learning. The Neural Information Processing Systems conference, as of 2021, includes a Datasets and Benchmarks track, recognizing that they “are crucial for the development of machine learning methods” [3]. Gadre et al. [13] introduce DataComp, a testbed for experimental work on datasets, building off the long history of dataset-driven development in machine learning [16]. We take a more theoretical approach, attempting to lay mathematical foundations for this line of work.
We mention in passing that our precise consistency results rely on distributional assumptions on the covariates , for example, that they are Gaussian. That such technical conditions appear may be familiar from work, for example, in single-index or nonlinear measurement models [31, 30]. In such settings, one assumes for an unknown increasing , and it is essential that to allow estimation of ; we leverage similar techniques.
Notation
We use to denote the norm of a vector . For a matrix , is its spectral norm, and is its Moore-Penrose pseudo inverse. For a unit vector , the projection operator to the orthogonal space of is . We use the notation for and for if there exist numerical constants and such that and for and . We also use the empirical process notation . We let denote that as .
2 The well-specified logistic model
We begin in a setting that allows the cleanest and most precise comparisons between a method using aggregated labels and one without by considering the logistic model for the link (1),
This simplicity allows results that highlight many of the conclusions we draw, and so we present initial, relatively clean, results for the estimators (2) and (3) here. In particular, we assume identical links , have an i.i.d. sample , where for each we draw for a true vector .
2.1 The isotropic Gaussian case
To deliver the general taste of our results on the performance of the full information (2) and majority vote (3) approaches, we start by studying the simplest case when .
Performance with non-aggregated data.
The non-aggregated MLE estimator in Eq. (2) admits a standard analysis [43], which we state as a corollary of a result where has more general distribution in Proposition 1 to come.
Corollary 1.
Let and . The maximum likelihood estimator is consistent, with , and for ,
The first part of Corollary 1 demonstrates that the non-aggregated MLE classifier is calibrated: it recovers both the direction and scale of . Moreover, the second part shows that this classifier enjoys convergence rates that roughly scale as , so that a linear increase in the label size roughly yields a linear increase in convergence rate.
Performance with majority-vote aggregation.
The analysis of the majority vote estimator (3) requires more care, though the assumption that allows us to calculate the limits explicitly. In brief, we show that when is Gaussian, the estimator is not calibrated and has slower convergence rates in for classification error than the non-aggregated classifier. The basic idea is that a classifier fit using majority vote labels should still point in the direction of , but it should be (roughly) “calibrated” to the probability of a majority of labels being correct.
We follow this idea and sketch the derivation here, as it is central to all of our coming theorems, and then state the companion corollary to Corollary 1. Each result depends on the probability
of obtaining a correct label using majority vote, where defines the binomial probability function
(4) |
when is odd. (When is even the final sum has the additional additive term , which is asymptotically negligible.) Key to the coming result is choosing a parameter to roughly equalize binomial (majority vote) and logistic (Bernoulli) probabilities, and so for we define the function
(5) |
We use to find the minimizer of population loss by considering the ansatz that for some . Using the definition (4) of , we can write
where , and compute the gradient
We set and decompose into the independent sum . Substituting in yields
(6) |
where in (i) we substitute . As we will present in Corollary 2, has a unique solution , and the global minimizer of the population loss is thus exactly .
By completing the calculations for the precise value of above and a performing few asymptotic normality calculations, we have the following result, a special case of Proposition 2 to come.
Corollary 2.
Let and . There are numerical constants such that the following hold: for the function in (5), there is a unique solving and
Moreover, there exists a function as such that satisfies
It is instructive to compare the rates of this estimator to the rates for the non-aggregated MLE in Corollary 1. First, the non-aggregated estimator is calibrated in that , in contrast to the majority-vote estimator, which roughly “calibrates” to the probability majority vote is correct (cf. (6)) via the convergence as . The scaling of in Corollary 2 is also important: the majority-vote estimator exhibits worse convergence rates by a factor of than the estimator : for constants and that depend only on and , we have asymptotic variances differing by :
To obtain intuition for this result via comparison with Corollary 1, consider the Fisher Information for logistic regression. Letting be the predicted margin, there is only “information” available for an estimator when is near , as otherwise ; under majority vote, we analogously require , which in turn means that we need by standard binomial concentration, or . This occurs on about fraction of the sample, so of total observations, majority vote “loses” a factor .
2.2 Comparisons for more general feature distributions
The key to the preceding results—and an indication that they are stylized—is that the covariates decompose into components aligned with and independent noise. Here, we abstract away the Gaussianity assumptions to allow a more general and nuanced development carefully tracking label noise, as margin and noise conditions play a strong role in the relative merits of maximum-likelihood-type (full label information) estimators versus those using cleaned majority-vote labels. The results, as in Sec. 2.1, are consequences of the master theorems to come in the sequel.
We first make our independence assumption.
Assumption A1.
The covariates have non-singular covariance and decompose as a sum of independent random vectors in the span of and its complement
Under these assumptions, we develop a characterization of the limiting behavior of the majority vote and non-aggregated models based on classification difficulty, adopting Mammen and Tsybakov’s perspective [26] and measuring difficulty of classification through the proximity of the probability to . Thus, for a noise exponent , we consider the condition
() |
We see that as the problem becomes “easier” as it is less likely to have a small margin—in particular, gives a hard margin that for all small . Under the independent decomposition Assumption A1, the noise condition () solely depends on the covariate’s projection onto the signal . We therefore consider the following assumption on .
Assumption A2.
For a given , is -regular, meaning that the absolute value has density on , no point mass at , and satisfies
As the logistic function satisfies , for in our logistic model (1) we have . More generally, for any link function differentiable at with , we have , so that the Mammen-Tsybakov noise condition () is equivalent to
Thus, under Assumption A2, condition () holds, as by dominated convergence we have
As a concrete case, when the features are isotropic Gaussian and so , . We provide extensions of Corollaries 1 and 2 in the more general cases the noise exponent allows.
The maximum likelihood estimator retains its convergence guarantees in this setting, and we can be more precise for the analogue of the final claim of Corollary 1 (see Appendix E.1 for a proof):
Proposition 1.
For majority-vote aggregation, we can in turn generalize Corollary 2. In this case we still have . However, the interesting factor here is that the convergence rate now depends on the noise exponent .
Proposition 2.
We defer the proof to Appendix E.2.
Paralleling the discussion in Section 2.1, we may compare the performance of the MLE , which uses all labels, and the majority-vote estimator using only the cleaned labels. When the classification problem is hard—meaning that in Condition () is near 0 so that that classifying most examples is nearly random chance—we see that the aggregation in the majority vote estimator still allows convergence (nearly) as quickly as the non-aggregated estimator; the problem is so noisy that data “cleaning” by aggregation is helpful. Yet for easier problems, where , the gap between them grows substantially; this is sensible, as aggregation is likely to force a dataset to be separable, thus making fitting methods unstable (and indeed, a minimizer may fail to exist).
3 Label aggregation and misspecified model
The logistic link provides clean interpretation and results, but we can move beyond it to more realistic cases where labelers use distinct links, although, to allow precise statements, we still assume the same linear term for each labeler’s generalized linear model. We study generalizations of the maximum likelihood and majority vote estimators (2) and (3), highlighting dependence on link fidelity. In this setting, there are (unknown and possibly distinct) link functions , . We show that the majority-vote estimator enjoys better robustness to model mis-specification than the non-aggregated estimator , though both use identical losses. In particular, our main result in this section implies
where and are constants that depend only on the links , , and when . In contrast to the previous section, the majority-vote estimator enjoys roughly -faster rates than the non-aggregated estimator, maintaining its (slow) improvement with , which the MLE loses to misspecification.
To set the stage for our results, we define the general link-based loss
We then consider the general multi-label estimator and the majority-vote estimator based on the loss ,
(7) |
When is the logistic link, we recover the logistic loss , and thus we recover the results in Section 2. For both the estimators, we suppress the dependence on the link to write when the context is clear.
3.1 Master results
To characterize the behavior of multiple label estimators versus majority vote, we provide master results as a foundation for our convergence rate analyses throughout. By a bit of notational chicanery, we consider both the cases that is a majority vote and that we use multiple (non-aggregated) labels simultaneously. In the case that the estimator uses the majority vote , let
and in the case that the estimator uses each label from the labelers, let
In either case, we then see that the population loss with the link-based loss becomes
(8) |
where we have taken a conditional expectation given . We assume Assumption A1 holds, that decomposes into the independent sum with , and the true link functions . We further impose the following assumption for the model link.
Assumption A3.
For each sign , the model link function satisfies for a constant and is a.e. differentiable.
Minimizer of the population loss.
We begin by characterizing—at a somewhat abstract level—the (unique) solutions to the problem of minimizing the population loss (8). To characterize the minimizer , we hypothesize that it aligns with , using the familiar ansatz that has the form . Using the formulation (8), we see that for ,
(9) |
where the final line uses the decomposition for the random vector independent of , and we recall expression (5) to define the calibration function
(10) |
The function measures the gap between the hypothesized link function and the label probabilities , functioning the approximately “calibrate” to the observed probabilities. If we presume that a solution to exists, then evidently is a minimizer of . In fact, such a solution exists and is unique (see Appendix B for a proof):
Asymptotic normality with multiple labels.
With the existence of minimizers assured, we turn to their asymptotics. For each of these, we require slightly different calculations, as the resulting covariances are slightly different. To state the result when we have multiple labels, we define the average link function and the three functions
(12) | ||||
The first, the link error , measures the mis-specification of the link relative to the average link . The second function, , is a Hessian term, as , and the third is a variance term for each labeler . We have the following theorem, which we prove in Appendix C.
Theorem 1.
Theorem 1 exhibits two dependencies: the first on the link error terms —essentially, a bias term—and the second on the rescaled average variance . So the multi-label estimator recovers an optimal covariance if the link errors are negligible, but if they are not, then it necessarily has asymptotic covariance. The next corollary highlights how things simplify. In the well-specified case that is symmetric and , the zero of the gap function (10) is evidently , the error term , and , and by symmetry, so that :
Corollary 3 (The well-specified case).
Let the conditions above hold. Then
Asymptotic normality with majority vote.
When we use the majority vote estimators, the asymptotics differ: there is no averaging to reduce variance as increases, even in a “well-specified” case. The asymptotic variance does (typically) decrease as grows, but at a slower rate, roughly related to the fraction of data where there is “signal”, as we discuss briefly following Corollary 2.
Theorem 2.
We defer the proof to Appendix D. In most cases, we will take the link function to be symmetric, so that , and thus , so that . This simplifies the denominator in Theorem 2 to . Written differently, we may define a (scalar) variance-characterizing function implicitly as follows: let be a zero of in , that is, so that is a function of the size (recall the gap (10)), and then define
(13) |
where above is implicitly defined. Then
Each of our main results, including those on well-specified models previously, then follows by characterizing the behavior of in the asymptotics as and the scaling of the solution norm , which the calibration gap (10) determines. The key is that the scaling with varies depending on the fidelity of the model, behavior of the links , and the noise exponent (), and our coming consequences of the master theorems 1 and 2 help to reify this scaling.
3.2 Robustness to model mis-specification
Having established the general convergence results for the multi-label estimator and the majority vote estimator , we further explicate their performance when we have a mis-specified model—the link is incorrect—by leveraging Theorems 1 and 2 to precisely characterize their asymptotics and show the majority-vote estimator can be more robust to model mis-specification.
Multi-label estimator.
As our focus here is descriptive, to make interpretable statements about the multi-label estimator in (7), we simplify by assuming that each link is identical. Then an immediate corollary of Theorem 1 follows:
Corollary 4.
So in this simplified case, the asymptotic covariance remains of constant order in unless . In contrast, as we now show, the majority vote estimator exhibits more robustness; this is perhaps expected, as Corollary 2 shows that in the logistic link case, which is a fortiori misspecified for majority vote labels, has covariance scaling as , though the generality of the behavior and its distinction from Corollary 4 is interesting.
Majority vote estimator.
For the majority-vote estimator, we relax our assumptions and allow to be different, showing how the broad conclusions Corollary 4 suggests continue to hold in some generality: majority vote estimators achieve slower convergence than well-specified (maximum likelihood) estimators using each label, but exhibit more robustness. To characterize the large behavior, we require the following regularity conditions on the average link , which we require has a limiting derivative at .
Assumption A4.
For the sequence of link functions , let . There exists such that
(14a) | ||||
(14b) | ||||
(14c) |
These assumptions simplify if the links are identical: if , we only require is differentiable around with and .
We can apply Theorem 2 to obtain asymptotic normality for the majority vote estimator (7). We recall the probability
(15) |
of the majority vote being correct given the margin and the calibration gap function (10), which by a calculation case resolves to the more convenient form
The main technical challenge is to characterize the large behavior for the asymptotic covariance function defined implicitly in the quantity (13). We postpone the details to Appendix E.3 and state the result below, which is a consequence of Theorem 2 and a careful asymptotic expansion of the covariance function (13).
Proposition 3.
Proposition 3 highlights the robustness of the majority vote estimator: even when the link is (more or less) arbitrarily incorrect, the asymptotic covariance still exhibits reasonable scaling. The noise parameter in Assumption A2, roughly equivalent to the Mammen-Tsybakov noise exponent (), also plays an important role. In typical cases with (e.g., when ), we see . In noisier cases, corresponding to , majority vote provides substantial benefit approaching a well-specified model; conversely, in “easy” cases where , majority vote estimators become more unstable, as they make the data (very nearly) separable, which causes logistic-regression and other margin-type estimators to be unstable [6].
4 Fundamental estimation limits of majority vote labels
We complement our convergence guarantees via lower bounds for labelers with identical logistic links . We begin with a somewhat trivial observation that unless we know the number of labels and link ahead of time, consistently estimating from aggregated labels is impossible, which contrasts with algorithms using non-aggregated data. Section 4.2 presents the more precise local asymptotic efficiency limits for estimation from majority labels .
4.1 Inconsistency from aggregate labels
To make clear the impossibility of estimation without some knowledge of the link and number of labelers, we show that there exist distributions with different linear classifiers and generating the same data distribution. Consider a generalized linear model for binary classification, with link function , , define the model (1) with , and let , where for each data point we generate labels , . As usual, let denote the majority-vote (breaking ties randomly). Then Proposition 4 shows there always exists a calibration function and label size such that has identical distribution under both the model (1) and also with replacing and . Letting denote the induced distribution on , we have the following formal statement, which we prove in Appendix F.1.
Proposition 4.
Suppose satisfies for all . For any such that and any positive integer , there is another link function such that
4.2 Fundamental asymptotic limits in the logistic case
As most of our results repose on classical asymptotics, we provide lower bounds in the same setting, using Hajek-Le Cam local asymptotic (minimax) theory. We recall a typical result, which combines Le Cam and Yang [23, Lemma 6.6.5] and van der Vaart and Wellner [44, Theorem 3.11.5]:
Corollary 5.
Let be a family of distributions indexed by with continuous Fisher Information in a neighborhood of . Then there exist probability densities measures supported on such that for any quasi-convex, symmetric, and bounded loss ,
where . If is differentiable at with derivative matrix , then additionally
The infima above are taken over all estimators and based on i.i.d. observations from .
This corollary is the sense in which estimators, such as the MLE, achieving asymptotic variance equal to the inverse Fisher Information are efficient. It also makes clear that (except perhaps at a Lebesgue measure-zero set of parameters ) any estimator with asymptotic variance necessarily satisfies .
We therefore complete our theoretical analysis of fundamental limits by giving the Fisher Information matrix for binary classification models with aggregated majority vote data. We assume the links are identical, . Recall (see Prop. 2) that in the case where is normal, the majority vote estimator has asymptotic variance of order . As we will present momentarily in Theorem 3, is indeed rate optimal for classification as .
As usual, letting Assumption A1 hold, we write with . Let . In this case, we may explicitly decompose the Fisher information.
Lemma 4.1.
The Fisher information matrix for the majority vote model satisfies
Proof.
For , where is the -dimensional centered density of , the Fisher information matrix at is
Substituting , we then have
as desired. ∎
By characterizing the large behavior of the Fisher information, we can then apply Corollary 5 to understand the optimal efficiency of any estimator given aggregate (majority vote) data.
Theorem 3.
See Appendix F.2 for the proof.
By combining Theorem 3 with the classical local asymptotic minimax bounds, we can obtain optimality for estimators of both and the direction as gets large. Computing the inverse Fisher information for and for the normalized quantity , which satisfies , we thus have the efficient limiting covariances
where we recall the shorthand . (Here we use that if are symmetric matrices with , then ; see Lemma A.1 in Appendix A.) Then combining Theorem 3 with Corollary 5 yields the following efficiency lower bound. (Other lower bounds arising from, e.g., the convolution and variants of the local asymptotic minimax theorem are possible; we present only this to give the flavor of such results.)
Corollary 6.
Let and be arbitrary estimators of and using the aggregated data . For any bounded and symmetric quasi-convex loss ,
If there exist random variables or such that (respectively, ), then and for Lebesgue almost every .
This result carries several implications. First, the scaling in both and for the asymptotic (achieved) covariance in Proposition 2 are rate-optimal: we have for the majority vote estimator . Notably, this optimal scaling holds even though the model may be mis-specified: estimators using the aggregated majority vote labels are rate-optimal in the scale and the number of labelers (and sample size ) no matter the margin-based loss. (Achieving the optimal constant requires a well-specified loss, of course.) This robustness of estimators using aggregated data may motivate the use of strong data-cleaning measures in dataset construction when it is difficult to model the particulars of the data generation process.
A perhaps less intuitive result is that for any noise scale on the margin variable arising from Assumption A2, the inverse information as , meaning calibration becomes fundamentally harder for estimators using only aggregated labels, even with a (known) logistic link . For example, in the Gaussian case when , then and , matching the scaling in Corollary 2. The same phenomenon holds even in the “very low noise” regime where , meaning that the margin variable is typically far from 0 (recall the exponent ()). In this case, as , as a paucity of data align well with the true margin ; estimators (nearly) perfectly predict on training data and become non-robust.
5 Semi-parametric approaches
The preceding analysis highlights distinctions between a fuller likelihood-based approach—which uses all the labels, as in (2)—and the robust but somewhat slower rates that majority vote estimators enjoy (as in Proposition 3). That full-label estimators’ performance so strongly depends on the fidelity of the link (recall Corollary 4) suggests that we target estimators achieving the best of both worlds: learn both a link function (or collection thereof) and refit the model using all the labels. While we mainly focus on efficiency bounds for procedures we view as closer to practice—fitting a model based on available data—in this section, we provide a few results targeting more efficient estimation schemes through semiparametric estimation approaches.
As developing full semi-parametric efficiency bounds would unnecessarily lengthen and dilute the paper, we develop a few relatively general and simple convergence results into which we can essentially plug in semiparametric estimators. In distinction from standard results in semiparametric theory (e.g. [43, Ch. 25] or [4]), our results require little more than consistent estimation of the links to recover (optimal) scaling in the asymptotic covariance, as the special structure of our classification problem allows more nuanced calculations; we assume each labeler (link function) generates labels for each of the datapoints , but we could relax the assumption at the expense of extraordinarily cumbersome notation. We give two example applications of the general theory: the first (Sec. 5.2) analyzing a full pipeline for a single index model setting, which robustly estimates direction , the link , and then re-estimates ; the second assuming a stylized black-box crowdsourcing mechanism that provides estimates of labeler reliability, highlighting how even in some crowdsourcing scenarios, there could be substantial advantages to using full label information.
5.1 Master results
For our specializations, we first provide master results that allow semi-parametric estimation of the link functions. We consider Lipschitz symmetric link functions, where for we define
We consider the general case where there are distinct labeler link functions . To eliminate ambiguity in the links, we assume the model is normalized, so . To distinguish from the typical case, we write , and for define
which allows us to consider both the standard margin-based loss and the case in which we learn separate measures of quality per labeler. With this notation, we can then naturally define the population loss
For any sequence of (estimated) links and data for , we define the semi-parametric estimator
We will demonstrate both consistency and asymptotic normality under appropriate convergence and regularity assumptions for the link functions. We assume there is a (semiparametric) collection of link functions of interest, which may coincide with but may be smaller, making estimation easier. Define the distance on by
We make the following assumption.
Assumption A5.
The links are normalized so that , and the sequence is consistent:
Additionally, the mapping is continuous for at .
The continuity of at allows us to develop local asymptotic normality. To see that we may expect the assumption to hold, we give reasonably simple conditions sufficient for it, including that the collection of links is sufficiently smooth or the data distribution is continuous enough. (See Appendix H.1 for a proof.)
Lemma 5.1.
Let be the distance in Assumption A5. Let Assumption A1 hold, where with probability one, has nonzero and continuously differentiable density on satisfying for . The mapping is continuous for at whenever and either of the following conditions holds:
-
(1)
For any , are Lipschitz continuous.
-
(2)
has continuous density on .
We can now present the master result for semi-parametric approaches, which characterizes the asymptotic behavior of the semi-parametric estimator with the variance function
Theorem 4.
5.2 A single index model
Our first example application of Theorem 4 is to a single index model. We present a multi-phase estimator that first estimates the direction , then uses this estimate to find a (consistent) estimate of the link , which we can then substitute directly into Theorem 4. We defer all proofs of this section to Appendix I, which also includes a few auxiliary results that we use to prove the results proper.
We present the the abstract convergence result for link functions first, considering a scenario where we have an initial guess of the direction , independent of , for example constructed via a small held-out subset of the data. We set
where and so it consists of nondecreasing -Lipschitz link functions with . We assume that for all , there exists a (potentially random) such that
Proposition 5.
Let be vectors with , where . Then with probability , there is a finite (random) such that for all large enough ,
The proof is more or less a consequence of standard convergence results for nonparametric function estimation, though we include it for completeness in Appendix I.2 as it includes a few additional technicalities because of the initial estimate of .
Summarizing, we see that a natural procedure is available: if we have models powerful enough to accurately estimate the conditional label probabilities , then Proposition 5 coupled with Theorem 4 shows that we can achieve estimation with near-optimal asymptotic covariance. In particular, if is consistent (so ), then induces a normalized estimator satisfying .
5.3 Crowdsourcing model
Crowdsourcing typically targets estimating rater reliability, then using these reliability estimates to recover ground truth labels as accurately as possible, with versions of this approach central since at least Dawid and Skene’s Expectation-Maximization-based approaches [10, 48, 35]. We focus here on a simple model of rater reliability, highlighting how—at least in our stylized model of classifier learning—by combining a crowdsourcing reliability model and still using all labels in estimating a classifier, we can achieve asymptotically efficient estimates of , rather than the robust but slower estimates arising from “cleaned” labels.
We adopt Whitehill et al.’s roughly “low-rank” model for label generation [48]: for binary classification with labelers and distinct link functions , model the difficulty of by , where denotes the true class belongs to. A parameter models the expertise of annotator , and the probability labeler correctly classifies is
(See also Raykar et al. [35].) The focus in these papers was to construct gold-standard labels and datasets ; here, we take the alternative perspective we have so far advocated to show how using all labels can yield strong performance.
We thus adopt a semiparametric approach: we model the labelers, assuming a black-box crowdsourcing model that can infer each labeler’s ability, then fit the classifier. We represent labeler ’s expertise by a scalar . Given data and the normalized , we assume a modified logistic link
so represents an omniscient labeler while means the labeler chooses random labels regardless of the data. Let . Then, in keeping with the plug-in approach of the master Theorem 4, we assume the blackbox crowdsourcing model generates an estimate of from the data .
We consider the algorithm using the blackbox crowdsourcing model and empirical risk minimization with the margin-based loss as in Section 5.1, with , or equivalently using the rescaled logistic loss
This allows us to apply our general semiparametric Theorem 4 as long as the crowdsourcing model produces a consistent estimate (see Appendix H.2 for a proof):
Proposition 6.
Let Assumption A1 hold, have nonzero and continuous density on , and . If , then is asymptotically normal, and the normalized estimator satisfies
By Proposition 6, the semiparametric estimator is efficient when the rater reliability estimates are consistent. It is also immediate that if are bounded, then the asymptotic covariance multiplier , so we recover the scaling of the MLE, as opposed to the slower rates of the majority vote estimators in Section 3.2. At this point, the refrain is perhaps unsurprising: using all the label information can yield much stronger convergence guarantees.
6 Experiments
We conclude the paper with several experiments to test whether the (sometimes implicit) methodological suggestions hold merit. Before delving into our experimental results, we detail a few of the expected behaviors our theory suggests; if we fail to see them, then the model we have proposed is too unrealistic to inform practice. First, based on the results in Section 2, we expect the classification error to be better for the non-aggregated algorithm , and the gap between the two algorithms to become larger for less noisy problems. Moreover, we only expect to be calibrated, and our theory predicts that the majority-vote estimator’s calibration worsens as the number of labels increases. Generally, so long as we model uncertainty with enough fidelity, Corollary 4 suggests that multilabel estimators should exhibit better performance than those using majority vote labels .
To that end, we provide experiments on two real datasets and one semi-synthetic dataset: the BlueBirds dataset (Section 6.1), CIFAR-10H (Section 6.2), and a modified version of the CIFAR-10 dataset (Section 6.3). We consider our two main algorithmic models:
-
(i)
The maximum likelihood estimator based on non-aggregated data in Eq. (2).
-
(ii)
The majority-vote based estimator of Eq. (3), our proxy for modern data pipelines.
Unfortunately, a paucity of large-scale multi-label datasets that we know of precludes experiments on fully real datasets; the ImageNet creators [11, 37] no longer have any intermediate label information from their construction. To simulate the larger scale data collection process, we create a large scale semisynthetic dataset (Section 6.3) using the raw CIFAR-10 data and using trained neural networks as labelers, giving us a semi-synthetic dataset with multiple labels. As the ground truth model is well-specified and we know the link function, beyond the aforementioned estimators with full information and majority vote, it is also of practical interest to see how the aggregated labels from other crowdsourcing methods compare to majority vote, and to knowing all individual labels.
6.1 BlueBirds
We begin with the BlueBirds dataset [47], which is a relatively small dataset consisting of 108 images with ResNet features. The classification problem is challenging, and the task is to classify each image as one of Indigo Bunting or Blue Grosbeak (two similar-looking blue bird species). For each image, we have labels, obtained through Amazon Mechanical Turk workers. We use a pretrained (on ImageNet) ResNet50 model to generate image features, then apply PCA to reduce the dimensionality from to .
We repeat the following experiment times. For each number of labelers, we fit the multilabel logistic model (2) and the majority vote estimator (3), finding calibration and classification errors using 10-fold cross validation. We measure calibration error on a held-out example by , where is the predicted probability and is the empirical probability (over the labelers), where ; we measure classification error on example with labels by , giving an inherent noise floor because of labeler uncertainty. We report the results in Figure 1. These plots corroborate our theoretical predictions: as the number of labelers increases, both the majority vote method and the full label method exhibit improved classification error, but considering all labels gives a (significant) improvement in accuracy and in calibration error.
\begin{overpic}[width=208.13574pt]{figure/resnet-class-bluebirds.pdf} \put(30.0,-1.0){{\small Number of labelers $m$}} \put(-3.0,20.0){ \rotatebox{90.0}{{\small Classification error}}} \end{overpic} | \begin{overpic}[width=208.13574pt]{figure/resnet-calib-bluebirds.pdf} \put(30.0,-1.0){{\small Number of labelers $m$}} \put(-3.0,20.0){ \rotatebox{90.0}{{\small Calibration error}}} \end{overpic} |
(a) | (b) |
6.2 CIFAR-10H
\begin{overpic}[width=208.13574pt]{figure/resnet-class-new.pdf} \put(30.0,-1.0){{\small Number of labelers $m$}} \put(-3.0,20.0){ \rotatebox{90.0}{{\small Classification error}}} \end{overpic} | \begin{overpic}[width=208.13574pt]{figure/resnet-calib-new.pdf} \put(30.0,-1.0){{\small Number of labelers $m$}} \put(-3.0,20.0){ \rotatebox{90.0}{{\small Calibration error}}} \end{overpic} |
(a) | (b) |
For our second experiment, we consider Peterson et al.’s CIFAR-10H dataset [29], which consists of images from CIFAR-10 test set with soft labeling in that for each image, we have approximately labels from different annotators. Each image in the dataset belongs to one of the ten classes airplane, automobile, bird, cat, dog, frog, horse, ship, or truck; labelers assign each image to one of the classes. To maintain some fidelity to the binary classification setting we analyze throughout the paper, we transform the problem into a set of 10 binary classification problems. For each class , we take each initial image/label pair , assigning binary label if and otherwise (so the annotator labels it as an alternative class ). Most of the images in the dataset are very easy to classify: more than 80% have a unanimous label from each of the labelers, meaning that the MLE and majority vote estimators (2) and (3) coincide for these. (In experiments with this full dataset, we saw little difference between the two estimators.)
As our theoretical results highlight the importance of classifier difficulty, we therefore balance the dataset by considering subsets of harder images as follows. For each fixed target (e.g., cat) and for image , let be the empirical probability of the target among the 50 annotator labels. Then for , define the subsets
so that corresponds to images with substantial confusion, and to all images (most of which are easy). We test on (labelers have at most 90% agreement), which consists of with images. For image , we again generate features by by taking the last layer of a pretrained ResNet50 neural network , using PCA to reduce to a -dimensional feature. We follow the same procedure as in Sec. 6.1, subsampling labelers and using 10-fold cross validation to evaluate classification and calibration error. We report the results in Figure 2. Again we see that—as the number of labelers increases—both aggregated and non-aggregated methods evidence improved classification error, but the majority vote procedure (cleaned data) yields less improvement than one with access to all (uncertain) labels. These results are again consistent with our theoretical predictions.
6.3 Crowdsourcing methods on semisynthetic CIFAR-10 labels
In our final set of experiments, we adapt the original CIFAR-10 [22] dataset, which consists of 6000 images from each of classes (60,000 total images). To mimic collecting and cleaning data with noisy labelers—rather than the single-label “gold standard” in the base CIFAR-10 data—we construct pseudo-labelers using a ResNet18 network whose accuracy we can directly adjust. This allows us to compare the maximum-likelihood estimator , the majority vote estimator , and, to test if the predictions of our stylized models still hold for more advanced aggregation strategies, we include Dawid-Skene [10] and GLAD [48] crowdsouring estimators for the labels.
To generate the semisynthetic labelers and labels, we use a pretrained ResNet18 model [18], an eighteen layer residual network, . For , define the softmax mapping . Then for an input , the model outputs a score , where indicates the probabilities . The model has the form
where is a matrix of per-class weights, and represents the second-to-last layer outputs of the neural network. As we fit linear models to the data, we therefore take these outputs as our data, so that for the th image , we have , and is the ground-truth parameter. We construct semi-synthetic labels , of varying accuracy/labeler expertise as follows. Letting be a chosen constant we vary to model labeler expertise, we draw
We vary in experiments to adjust that the median value of , where . In the “expert labelers” case, we take so that , while the “crowd labelers” case corresponds to and .
Given a semisynthetic dataset , estimates . We fit and using the multiclass logistic loss. We also investigate two standard crowdsourcing approaches for aggregating labels—the Dawid-Skene and GLAD methods [10, 48]—which also produce soft labels. Both methods parameterize the ability of labelers, and GLAD additionally parameterizes task hardness, using the Expectation-Maximization algorithm to estimate the latent parameters. Both methods output estimated (hard) labels as well as soft-labels for each example , where indicates the crowdsourcing model’s estimated probability that example is of class . Given these imputed labels, we can fit estimators
the former the hard label estimator and the latter the soft. We let and denote the corresponding estimators using one-hot hard labels, and and denote those trained using the estimated per-example label probabilities.
\begin{overpic}[width=281.85034pt]{figure/hard-test-error.pdf} \end{overpic} | \begin{overpic}[width=281.85034pt]{figure/soft-test-error.pdf} \end{overpic} |
(a) | (b) |
\begin{overpic}[width=216.81pt]{figure/105.pdf} \end{overpic} | \begin{overpic}[width=216.81pt]{figure/42.pdf} \end{overpic} |
(c) | (d) |
We report the results in Fig. 3. As we see, having the ground truth model well-specified, using the full information of soft labels, the MLE outperforms not only all aggregation methods, but also black-box crowdsourcing models that generate soft labels. The more advanced aggregation methods and yield smaller test error than , and using the soft labels, and further reduce the error, suggesting the benefits of using soft labels from crowdsourcing methods for training even if the underlying model is unknown.
7 Discussion
In spite of the technical detail we require to prove our results, we view this work as almost preliminary and hope that it inspires further work on the full pipeline of statistical machine learning, from dataset creation to model release. Many questions remain both on the theoretical and applied sides of the work.
On the theoretical side, our main focus has been on a stylized model of label aggregation, with majority vote mostly—with the exception of the crowdsourcing model in Sec. 5–functioning as the stand-in for more sophisticated aggregation strategies. It seems challenging to show that no aggregation strategy can work as well as multi-label strategies; it would be interesting to more precisely delineate the benefits and drawbacks of more sophisticated denoising and whether it is useful. We focus throughout on low-dimensional asymptotics, using asymptotic normality to compare estimators. While these insights are valuable, and make predictions consistent with the experimental work we provide, investigating how things may change with high-dimensional scaling or via non-asymptotic results might yield new insights both for theory and methodology. As one example, with high-dimensional scaling, classification datasets often become separable [6, 7], which is consistent with modern applied machine learning [49] but makes the asymptotics we derive impossible. One option is to investigate the asymptotics of maximum-margin estimators, which coincide with the limits of ridge-regularized logistic regression [36, 41].
On the methodological and applied side, given the extent to which the test/challenge dataset methodology drives progress in machine learning [12], it seems that developing newer datasets to incorporate labeler uncertainty could yield substantial benefits. A particular refrain is that modern deep learning methods are overconfident in their predictions [15, 49]; perhaps by calibrating them to labeler uncertainty we could substantially improve their robustness and performance. Additionally, recent large-scale models now frequently use only distantly supervised labels [33, 38, 13], making the study appropriate data cleaning and collection more timely. We look forward to deeper investigations of the intricacies and intellectual foundations of the full practice of statistical machine learning.
Support
This research was partially supported by the Office of Naval Research under awards N00014-19-2288 and N00014-22-1-2669, the Stanford DAWN Consortium, and the National Science Foundation grants IIS-2006777 and HDR-1934578.
References
- Asuncion and Newman [2007] A. Asuncion and D. J. Newman. UCI machine learning repository, 2007. URL http://www.ics.uci.edu/~mlearn/MLRepository.html.
- Bartlett et al. [2006] P. L. Bartlett, M. I. Jordan, and J. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101:138–156, 2006.
- Beygelzimer et al. [2021] A. Beygelzimer, P. Liang, J. Wortman Vaughan, and Y. Dauphin. NeurIPS 2021 datasets and benchmarks track. https://neurips.cc/Conferences/2021/CallForDatasetsBenchmarks, 2021.
- Bickel et al. [1998] P. Bickel, C. A. J. Klaassen, Y. Ritov, and J. Wellner. Efficient and Adaptive Estimation for Semiparametric Models. Springer Verlag, 1998.
- Boucheron et al. [2005] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: a survey of some recent advances. ESAIM: Probability and Statistics, 9:323–375, 2005.
- Candès and Sur [2019] E. Candès and P. Sur. A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29):14516–14525, 2019.
- Candès and Sur [2020] E. Candès and P. Sur. The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. Annals of Statistics, 48(1):27–42, 2020.
- Charikar et al. [2017] M. Charikar, J. Steinhardt, and G. Valiant. Learning from untrusted data. In Proceedings of the Forty-Ninth Annual ACM Symposium on the Theory of Computing, 2017.
- Chen et al. [2010] L. H. Chen, L. Goldstein, and Q.-M. Shao. Normal Approximation by Stein’s method. Springer, 2010.
- Dawid and Skene [1979] A. Dawid and A. Skene. Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society, 28:20–28, 1979.
- Deng et al. [2009] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. ImageNet: a large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- Donoho [2017] D. L. Donoho. 50 years of data science. Journal of Computational and Graphical Statistics, 26(4):745–766, 2017.
- Gadre et al. [2023] S. Y. Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, E. Orgad, R. Entezari, G. Daras, S. Pratt, V. Ramanujan, Y. Bitton, K. Marathe, S. Mussmann, R. Vencu, M. Cherti, R. Krishna, P. W. Koh, O. Saukh, A. Ratner, S. Song, H. Hajishirzi, A. Farhadi, R. Beaumont, S. Oh, A. Dimakis, J. Jitsev, Y. Carmon, V. Shankar, and L. Schmidt. DataComp: In search of the next generation of multimodal datasets. In Advances in Neural Information Processing Systems 36, 2023.
- Giné and Nickl [2021] E. Giné and R. Nickl. Mathematical Foundations of Infinite-Dimensional Statistical Models. Cambridge University Press, 2021.
- Goodfellow et al. [2015] I. Goodfellow, O. Vinyals, and A. Saxe. Qualitatively characterizing neural network optimization problems. In Proceedings of the Third International Conference on Learning Representations, 2015.
- Hardt and Recht [2022] M. Hardt and B. Recht. Patterns, Predictions, and Actions: A story about machine learning. Princeton University Press, 2022. Available at https://mlstory.org/.
- Hastie et al. [2009] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, second edition, 2009.
- He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
- Horn and Johnson [1985] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, 1985.
- Howe [2006] J. Howe. The rise of crowdsourcing. Wired Magazine, 14(6):1–4, 2006.
- Karger et al. [2014] D. Karger, S. Oh, and D. Shah. Budget-optimal task allocation for reliable crowdsourcing systems. Operations Research, 62(1):1–24, 2014.
- Krizhevsky and Hinton [2009] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
- Le Cam and Yang [2000] L. Le Cam and G. L. Yang. Asymptotics in Statistics: Some Basic Concepts. Springer, 2000.
- Ledoux and Talagrand [1991] M. Ledoux and M. Talagrand. Probability in Banach Spaces. Springer, 1991.
- Lewis et al. [2004] D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397, 2004.
- Mammen and Tsybakov [1999] E. Mammen and A. B. Tsybakov. Smooth discrimination analysis. Annals of Statistics, 27:1808–1829, 1999.
- Marcus et al. [1994] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19:313–330, 1994.
- Owen [1990] A. Owen. Empirical likelihood ratio confidence regions. The Annals of Statistics, 18(1):90–120, 1990.
- Peterson et al. [2019] J. C. Peterson, R. M. Battleday, T. L. Griffiths, and O. Russakovsky. Human uncertainty makes classification more robust. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9617–9626, 2019.
- Plan and Vershynin [2013] Y. Plan and R. Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach. IEEE Transactions on Information Theory, 59(1):482–494, 2013.
- Plan and Vershynin [2016] Y. Plan and R. Vershynin. The generalized lasso with non-linear observations. IEEE Transactions on Information Theory, 62(3):1528–1537, 2016.
- Platanios et al. [2020] E. A. Platanios, M. Al-Shedivat, E. Xing, and T. Mitchell. Learning from imperfect annotations. arXiv:2004.03473 [cs.LG], 2020.
- Radford et al. [2021] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, 2021.
- Ratner et al. [2017] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. Ré. Snorkel: rapid training data creation with weak supervision. Proceedings of the VLDB Endowment, 11(3):269–282, 2017.
- Raykar et al. [2010] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. Journal of Machine Learning Research, 11(4), 2010.
- Rosset et al. [2004] S. Rosset, J. Zhu, and T. Hastie. Boosting as a regularized path to a maximum margin classifier. Journal of Machine Learning Research, 5:941–973, 2004.
- Russakovsky et al. [2015] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
- Schuhmann et al. [2022] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, and J. Jitsev. LAION-5B: An open large-scale dataset for training next generation image-text models. In Advances in Neural Information Processing Systems 35, 2022. journal = arXiv:2210.08402 [cs.CV].
- Shapiro et al. [2009] A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures on Stochastic Programming: Modeling and Theory. SIAM and Mathematical Programming Society, 2009.
- Shevtsova [2014] I. Shevtsova. On the absolute constants in the Berry-Esseen-type inequalities. Doklady Mathematics, 89(3):378–381, 2014.
- Soudry et al. [2018] D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, 19(18):1–57, 2018.
- Tian and Zhu [2015] T. Tian and J. Zhu. Max-margin majority voting for learning from crowds. In Advances in Neural Information Processing Systems 28, pages 1621–1629, 2015.
- van der Vaart [1998] A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998.
- van der Vaart and Wellner [1996] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, New York, 1996.
- Vapnik [1995] V. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.
- Wainwright [2019] M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019.
- Welinder et al. [2010] P. Welinder, S. Branson, P. Perona, and S. Belongie. The multidimensional wisdom of crowds. In Advances in Neural Information Processing Systems 23, pages 2424–2432, 2010.
- Whitehill et al. [2009] J. Whitehill, T. Wu, J. Bergsma, J. Movellan, and P. Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. Advances in Neural Information Processing Systems 22, 22:2035–2043, 2009.
- Zhang et al. [2021] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
Appendix A Technical lemmas
We collect several technical lemmas and their proofs in this section, which will be helpful in the main proofs. See Appendices A.1, A.2, A.3 and A.4 for their proofs.
Lemma A.1.
Suppose are symmetric, and the matrix is invertible. Then
The next two lemmas characterize asymptotic behaviors of expectations involving a fixed function and some random variable that satisfies Assumption A2 for given and . To facilitate stating the theorems, we recall that such are -regular.
Lemma A.2.
Let and be a function on such that is integrable. If is -regular (Assumption A2), then
Lemma A.3.
Let and , and be -regular, let satisfy for some and all , and assume has finite th moment. Additionally let be the majority vote prediction function (15) and Assumption A4 hold for each with limiting average derivative at zero. Then for any
where is the standard normal cumulative distribution function.
The fourth lemma is a uniform convergence result for the empirical risk in the case that we have potentially distinct link functions (cf. Sec. 5.1).
Lemma A.4.
Assume for some and and let the radius . Let . Then for a constant , we have
A.1 Proof of Lemma A.1
As , the symmetric matrices and commute and so are simultaneously orthogonally diagonalizable [19, Thm. 4.5.15]. As , we can thus write
for some orthogonal , and as is invertible, are invertible diagonal matrices. We conclude the proof by writing
A.2 Proof of Lemma A.2
By the change of variables , we have
As , where is integrable by assumption, we can invoke dominated convergence to see that
A.3 Proof of Lemma A.3
By rescaling arguments, it suffices to prove the theorem for any function satisfying on for any such that has finite th moment. For such , we wish to show
where is the standard normal cdf. The key insight is that we can approximate by a suitable Gaussian cumulative distribution function (recognizing that for by definition (15) as the probability the majority vote is correct given margin ).
We first assume with probability , as the general result follows by writing . We decompose into two expectations, depending on being large or small:
(16) |
The proof consists of three main parts.
-
1.
We approximate by a Gaussian cdf.
-
2.
We can approximate term (I) by replacing with the Gaussian cdf, showing that
(17) -
3.
For term (II), we show is small when , which allows us to show that
Thus by adding the two preceding displays and taking , we obtain the lemma.
Before we dive into further details, we use the shorthand functions
(18) |
and we also write .
Part 1. Normal approximation for when .
Let for shorthand and be independent random variables. For , then
Consider the centered and standardized random variables so that are zero mean, mutually independent, and satisfy
By the Berry-Esseen theorem (cf. Chen et al. [9], Shevtsova [40]), for all
Fix any . Then for and large enough , the right hand side of the preceding display has the upper bound , as in numerator we have , and in denominator for all as , which follows from Assumption A4 that
By repeating the same argument for , we obtain that for large and ,
(19) |
Part 2. Approximating (I) by Gaussian cdf.
For the first term (I) in (16), we further decompose into a normal approximation term and an error term,
where as in def. (18). We will show and so dominates. By the change of variables , we can further write (III) as
We want to take the limit and apply dominated convergence theorem. Because and , we have
As and , is integrable on , and by (14a) and (14c) in Assumption A4,
Using the above display and that , we can thus apply dominated convergence theorem to conclude that
Part 3. Upper bounding (II).
In term (II), when is large, the key is that the quantity is small when : Hoeffding’s inequality implies the tail bound
Thus
Using the assumption that from (14b), that and has finite th moment, we observe that
and consequently
For , we have
while Assumption (14b) gives , so
Using the inequality , we apply dominated convergence:
A.4 Proof of Lemma A.4
We follow a typical symmetrization approach, then construct a covering that we use to prove the lemma. Let be the (random) symmetrized measure with point masses at for . Then by a standard symmetrization argument, we have
(20) |
We use a covering argument to bound the symmetrized expectation (20). Let to be chosen, and for an (again, to be determined) let denote an -cover of in the supremum norm on , that is, , and so for each there exists such that . Then [44, Ch. 2.7] we have . Let be a minimal -cover of in , so that and . We claim that for each and , there exists and such that
(21) |
Indeed, for any and , we have
where we have used that . Taking the elements in the respective coverings to minimize the above bound gives the guarantee (21).
We now leverage inequality (21) in the symmetrization step (20). We have
where inequality uses that if are -sub-Gaussian, then , and that conditional on , the symmetrized sum is -sub-Gaussian, as . We optimize this bound to get the final guarantee of the lemma: set (and note that we will choose ) to obtain
Choose .
Appendix B Proof of Lemma 3.1
Recall that is the calibration gap function (10). We see that because , we have , while using Assumption A3 and that , we apply dominated convergence and that with probability 1 to obtain
Because , we see that there is a unique solving , and evidently is a minimizer of .
We compute the Hessian of . For this, we again let to write
which gives and so is unique. Finally, the desired form of the Hessian follows as .
Appendix C Proof of Theorem 1
Let be the solution to as in Lemma 3.1. The consistency argument is immediate: the losses are convex, continuous, and locally Lipschitz, so Shapiro et al. [39, Thm. 5.4] gives . By an appeal to standard M-estimator theory [e.g. 43, Thm. 5.23], we thus obtain
(22) |
We expand the covariance term to obtain the first main result of the theorem. For shorthand, let , so that and the are conditionally independent given . Applying the law of total covariance, we have
where we have used the conditional independence of the conditional on . We control each of terms (I) and (II) in turn.
For the first, we have by the independent decomposition that
and so
where we used the independence of and . For term (II) above, we see that conditional on , is a binary random variable taking values in with probabilities and , respectively, so that a calculation leveraging the variance of Bernoulli random variables yields
where we used the definition (12) of the variance terms. A similar calculation to that we used for term (I) then gives that
Applying Lemma A.1 then allows us to decompose the the covariance in expression (22) into terms in the span of and those perpendicular to it, so that the asymptotic covariance is
by applying Lemma 3.1 for the form of the Hessian and Lemma A.1 for the inverse, giving the first result of the theorem.
To obtain the second result, we apply the delta method with the mapping , which satisfies , so that for we have
as desired.
Appendix D Proof of Theorem 2
As in the proof of Theorem 1, we begin with a consistency result. Let be the solution to as in Lemma 3.1. The once again, Shapiro et al. [39, Thm. 5.4] shows that . As previously, appealing to standard M-estimator theory [e.g. 43, Thm. 5.23], we obtain
(23) |
the difference from the asymptotic (22) appearing in the covariance term. Here, we recognize that conditional on , the vector takes on the values each with probabilities and , respectively, while is (unconditionally) mean zero. Thus we have
Appendix E Proofs of asymptotic normality
In this appendix, we include proofs of the convergence results in Propositions 1, 2, and 3. In each, we divide the proof into three steps: we characterize the loss minimizer, apply one of the master Theorems 1 or 2 to obtain asymptotic normality, and then characterize the behavior of the asymptotic covariance as .
E.1 Proof of Proposition 1
Asymptotic normality of the MLE.
The asymptotic normality result is an immediate consequence of the classical asymptotics for maximum likelihood estimators [43, Thm. 5.29].
Normalized estimator.
For the normalized estimator, we appeal to the master results developed in Section 3.1. In particular, since we are in the well-specified logistic model, we can invoke Corollary 3 and write directly that
which immediately implies
and that further . To compute the limit when , we invoke Lemma A.2 and we conclude that
E.2 Proof of Proposition 2
Minimizer of the population loss.
Asymptotic variance.
Large behavior.
The remainder of the proof is to characterize the behavior of as . We first derive asymptotics for . To simplify notation, we let . Because solves we have
(24) |
we must have as because for any as , so the right side of equality (24) converges to 0 by the dominated convergence theorem, and hence so must the left hand side. Invoking Lemma A.2 for the left hand side, it follows that
while invoking Lemma A.3 for the right hand side of (24), it follows that
where the last line follows from change of variables . The identity (24) implies that the ratio and so we have that as ,
In particular, .
We finally proceed to compute asymptotic behavior of , the variance (13). By Lemma A.2 the limit of its denominator as satisfies
We decompose the numerator into the two parts
As we have already shown that , we know for any that for large enough , . We can thus invoke Lemma A.2 to get
With the same argument, we apply Lemma A.3 to establish the convergence
where we use the change of variables . Taking limits, we have
where we used that as above.
E.3 Proof of Proposition 3
Minimizer of the population loss.
By Lemma 3.1, we know the gap has a unique solution , with minimizing the population loss .
Asymptotic variance.
Large behavior.
We derive the large asymptotics of and under Assumption A4 and using the shorthand . The proof is essentially identical to that of Proposition 2 in Appendix E.2. First, recalling the probability (15), , we see that for any as , and thus by dominated convergence, . The analogue of the identity (24) in the proof of Proposition 2, that is the zero of , implies
(26) |
As is bounded and for any , the convergence of the left hand side of equality (26) to 0 as means we must have . Invoking Lemma A.2 yields
Applying Lemma A.3 gives
Rewriting the identity (26) using the symmetry of , so that , we have the equivalent statement that , or . Using this identity ratio, we find that
This concludes the asymptotic characterization that .
Finally, we turn to the asymptotics for in (25). By Lemma A.2 its denominator has limit
We decompose the (rescaled) numerator of the variance (25) into the two parts
Lemmas A.2 and that , coupled with the dominated convergence theorem, establishes the convergence
Similarly, Lemma A.3 and that gives that
where in the last line we use change of variables . Hence we have
Finally, as we obtain that for some constant depending only on , and .
Appendix F Proofs of fundamental limits of estimation with aggregate labels
F.1 Proof of Proposition 4
We assume without loss of generality is odd. When is even the proof is identical, except that we randomize when we have equal votes for both classes. As the marginal distribution of is for both and , we only have to show the existence of a link such that the conditional distribution on any is the same, i.e.,
For , define the probability that a is at least through the one-to-one transformation
For any , monotonically increases in , and , and by symmetry . By the definition of majority vote
Therefore, the link
satisfies and for all , since maps to . We also have
where (i) and (ii) follow from symmetry of and , respectively. Thus belongs to and is a valid link function. This yields the desired equality:
F.2 Proof of Theorem 3
Throughout this proof, we use to mean that as , or equivalantly that . By Lemma 4.1, we have
with
(27a) | ||||
(27b) |
It is crucial to understand the behavior of and , which have the exact forms
Throughout this proof, we assume without loss of generality that is odd. When is even, we replace with , but otherwise all arguments are identical.
Lemma F.1.
The function is even, and for any ,
Proof.
By direct calculations
where (i) uses the combinatorial identity
This yields a telescoping sum and establishes the result. ∎
Next, we define the function
We can then write for the Fisher information via
as is a even function. The next lemma provides the key asymptotic guarantees for in the above displays. See Appendix F.3 for a proof.
Lemma F.2.
There exists a function satisfying the following:
-
(i)
For any , and
-
(ii)
For any , there exist and such that for all and all ,
-
(iii)
For any fixed ,
We use the asymptotics in Lemma F.2 to compute and , respectively.
Part I: Asymptotics for .
We begin by showing the claimed asymptotic formula for in the theorem’s statement. By Stirling’s formula, the multiplicative factor in satisfies
and thus
Our goal now is to show that
(28) |
which then immediately implies the asymptotic for , as claimed in the theorem.
The proof of the limit (28) involves an argument via dominated convergence. By the change of variables ,
We now demonstrate dominating functions for the various terms in the above integrand. To bound the lsat several terms, we use Assumption A2 that and Lemma F.2, which combine to give the upper bound
for some and all sufficiently large . The next lemma controls the first terms in the integral above. (We defer its proof to Appendix F.4.)
Lemma F.3.
There exists a universal constant such that for all ,
Part II: Asymptotics for .
The calculations are similar to those we have done to approximate . In this case
and again invoking dominated convergence,
Hence
F.3 Proof of Lemma F.2
We begin by proving part (i). Define
where the inequality is valid for . Invoking Lemma F.1 yields
which gives the equality in part (i). It then remains to show . For this, we upper bound
and noting the inequality , it then follows that
Proceeding to the proof of part (ii), for any ,
where the inequality uses that is decreasing in . We next establish that . Indeed, applying Stirling’s formula yields
Substituting the above displays into the definition , we can then establish the upper bound using , so
Since , this inequality establishes part (ii).
For part (iii), we compute . For any fixed , we explicitly expand to obtain
Here, we partition the set of all nonnegative integers into sets for . Using the same asymptotics derived above for and that for all , it holds for some remainder term satisfying that
We will invoke dominated convergence for series to allow us to exchange summation and limits in . We observe the following termwise domination:
which is summable in . We can then invoke dominated convergence theorem for infinite series to obtain that for some ,
implying that for any ,
Since the left hand side does not depend on , we can then send to zero on the right hand side and use the definition of Riemann integral to obtain
Now we compute the limit of the remaining term in ,
Taking the above displays collectively into
we establish part (iii) of the lemma, that is,
F.4 Proof of Lemma F.3
We pick some constant . When ,
and when ,
Then for and , it holds and consequently
F.5 Proof of Corollary 6
As we have the explicit formula for the density , with and being the -dimensional centered density with covariance . It is differentiable in quadratic mean and allows us to invoke van der Vaart [43, Thm. 8.9]. It only remains to show the asymptotic rate in for and . Indeed, by Lemma A.1,
and Theorem 3 implies
Analogously we establish the proof for .
Appendix G Proof of Theorem 4
We divide the theorem into two main parts, consistent with the typical division of asymptotic normality results into a consistency result and a distributional result. Recall the notation and , where has the distribution that Assumption A1 specifies.
G.1 Proof of consistency
We demonstrate the consistency in three parts, which we present as Lemmas G.1, G.2, and G.4. The first presents an analogue of Lemma 3.1 generalized to the case in which there are distinct link functions, allowing us to characterize the link-dependent minimizers
via a one-dimensional scalar on the line . The second, Lemma G.2, then shows that as approaches , where the scaling that follows by the normalization Assumption A5. Finally, the third lemma demonstrates the probabilistic convergence whenever in .
We begin with the promised analogue of Lemma 3.1.
Lemma G.1.
Define the calibration gap function
(29) |
Then the loss has unique minimizer for the unique solving . Additionally, taking , we have
Proof.
We perform a derivation similar to that we used to derive the gap function (10), with a few modifications to allow collections of link functions. Note that
so that leveraging the usual ansatz that , we have
where is the immediate generalization of the gap (10). We now simplify it to the form (29). Using the symmetries and for any , we observe that
and so . Of course, , as , while . Then as , there exists a unique satisfying .
We turn to the Hessian derivation, where as in the proof of Lemma 3.1, we write for that
and so and has unique minimizer . ∎
With Lemma G.1 serving as the analogue of Lemma 3.1, we can now show the continuity of the optimizing parameter in :
Lemma G.2.
As , we have .
Proof.
Via Lemma G.1, it is evidently sufficient to show that the solution to converges to . To show this, note the expansion
(30) |
We use the following claim, which shows that the first term tends to 0 uniformly in near :
Claim G.3.
For any , we have whenever .
Proof.
We use that the density of is continuous and nonzero on . Take any . Then
where in we use that the link functions and are bounded within . For any fixed , the ratio is bounded for , and using the substitution we have
We thus have that , where depends only on and the distribution of . Take and . ∎
Finally, we proceed to the third part of the consistency argument: the convergence in probability.
Lemma G.4.
If , then .
Proof.
By Lemma G.1 and the assumed continuity of the population Hessian, there exists such that
(31) |
whenever both and . Applying the uniform convergence Lemma A.4, we see that for any , we have
For , define the events
where Lemma G.2 and the assumption that imply that for all . By the growth condition (31) and uniform convergence over , we therefore have that with probability tending to 1,
The convexity of the losses in and that minimizes then guarantee the desired convergence . ∎
G.2 Asymptotic normality via Donsker classes
While we do not have , which allows the cleanest and simplest asymptotic normality results with nuisance parameters (e.g. [43, Thm. 25.54]), we still expect to be asymptotically normal, and therefore, as for some , the normalized estimators should be asymptotically normal to . To develop the asymptotic normality results, we perform an analysis of the empirical process centered at the estimators , rather than the “true” parameter as would be typical.
We begin with an expansion. We let for shorthand. Then as and , we can derive
The assumed continuity of at (recall Assumption A5) then implies that
(32) |
where we have used the consistency guarantees by Lemmas G.2 and G.4 and that by assumption.
The expansion (32) forms the basis of our asymptotic normality result; while and may be data dependent, by leveraging uniform central limit theorems and the theory of Donsker function classes, we can show that has an appropriate normal limit. To that end, define the function classes
where we leave the Lipschitz constant in tacit. By Assumption A5 and that , we know with probability tending to we have the membership . The key result is then that is a Donsker class:
Lemma G.5.
Assume that . Then is a Donsker class, and moreover, if , then
Temporarily deferring the proof of Lemma G.5, let us see how it leads to the proof of Theorem 4. Using Lemma G.5 and Slutsky’s lemmas in the equality (32), we obtain
Calculations completely similar to those we use in the proof of Theorem 1 then give Theorem 4: because by assumption,
because are conditionally independent, while
When , the Hessian function in Lemma G.1 simplifies to as is symmetric about 0. We then apply the delta method as in the proof of Theorem 1.
Finally, we return to the proof of Lemma G.5.
Proof of Lemma G.5.
To prove is Donsker, we show that each coordinate of is, and as , this amounts to showing the coordinate functions form a Donsker class when has distribution . Let
so it is evidently sufficient to prove that forms a Donsker class.
We use bracketing and entropy numbers [44] to control the . Recall that for a function class , an -bracket of in is a collection of functions such that for each , there exists such that and . The bracketing number is the cardinality of the smallest such -bracket, and the bracketing entropy is
To show that is Donsker, it is sufficient [44, Ch. 2.5.2] to show that . Our approach to demonstrate that is Donsker is thus to construct an appropriate -bracket of , which we do by first covering -balls in , then for vectors , constructing a bracketing of the induced function class , which we combine to give the final bracketing of .
We proceed with this two-stage covering and bracketing. Let be small numbers whose values we determine later. Define the ball , and for any , let be a minimal -cover of in the Euclidean norm of size , , so that for any with there exists with . Standard bounds [46, Lemma 5.7]) give
(33) |
For simplicity of notation and to avoid certain tedious negations, we define the “flipped” monotone function family
Now, for any , let denote the pushforward measure of . For , we then let be a minimal -bracketing of in the norm. That is, for , we have , and for each , there exists such that
Because elements of are monotone, van der Vaart and Wellner [44, Thm. 2.7.5] guarantee there exists a universal constant such that
(34) |
and in particular, for each .
With the covering and induced bracketing collections , we now turn to a construction of the actual bracketing of the class . For any and bracket , define the functionals by
The key is that these functions form a bracketing of :
Lemma G.6.
Define the set
Then is a
bracketing of with cardinality at most .
Proof.
Let . Take satisfying and such that for all , where . We first demonstrate the bracketing guarantee
For the upper bound, we have
where step (i) follows from the -Lipschitz continuity of , (ii) from the Cauchy-Schwarz inequality and that , while step (iii) follows by the construction that for all . Similarly, we obtain the lower bound
again valid for all .
The second part of the proof is to bound the distance between the upper and lower elements in the bracketing. By definition, has the pointwise upper bound
Recalling that , by the Minkowski and Cauchy-Schwarz inequalities, we thus obtain
Noting the trivial bounds and the assumed bracketing distance
we have the desired bracketing distance .
Lemma G.6 will yield the desired entropy integral bound. Fix any , and note that if we take and , then the set is a -bracketing of in , and moreover, we have the cardinality bound
Additionally, as covering numbers are necessarily integer, we have whenever . This gives the entropy integral bound
and consequently (cf. [43, Thm. 19.5] or [44, Ch. 2.5.2]), is a Donsker class. A completely identical argument shows that are Donsker, and so is a Donsker class, completing the proof of the first claim in Lemma G.5.
To complete the proof of Lemma G.5, we need to show that
where denotes the empirical process on . Notably, because is finite, it is sufficient to show that
To that end, we suppress dependence on for notational simplicity and simply write and , where . For any Donsker class and , we have
(see [14, Thm. 3.7.31]), and so in turn it is sufficient to prove that
(35) |
Appendix H Proofs for semiparametric approaches
H.1 Proof of Lemma 5.1
As and are -Lipschitz, we may without loss of generality assume that and so and . We can compute the Hessian at any and ,
where in the last line we use that and are symmetric for . Therefore we can upper bound the distance between Hessians by
and thus we only need to prove each quantity if , that is, if and . In the following, we will show under the two different conditions. To further simplify the quantity, we claim it is sufficient to show . Indeed, we have
Lemma H.1.
If and , then .
Proof.
By the triangle inequality and the independent decomposition , we have
It remains to show . Using the symmetry of and , so , we can replace by . Then integrating by parts, for any we have
By our w.l.o.g. assumption that , we have . Thus, recognizing the trivial bound , we have
We show for any fixed , . Applying the Cauchy-Schwarz inequality twice, we have the bounds
where for the final inequality we use that is nonzero and continuously differentiable.
As , we evidently have for arbitrary . Using the assumptions that and for , we conclude the proof by taking and . ∎
Finally we prove under the two conditions in the statement of Lemma 5.1: that are Lipschitz or has a continuous density.
Condition 1. The links have Lipschitz derivatives
We apply Jensen’s inequality to write
where denotes the Lipschitz constant of its argument. Taking completes the proof for this case.
Condition 2. The covariates have continuous density
Let have density . We rewrite the convergence instead as where is invertible. We again divide the expectation into large part and small part. Let be large enough that . Then using , we obtain
By the linear transformation of variables , the first term in the above display is
Since is absolutely continuous on any compact set, the above term converges to as . Finally, as was arbitrary, we take to conclude that .
H.2 Proof of Proposition 6
We apply Theorem 4. We first give the specialization of that the assumptions of the proposition imply, recognizing that as , we have
We now argue that we can actually invoke Theorem 4, which requires verification of Assumption A5. Because for any , the link functions have Lipschitz continuous derivatives for , so when has form , Lemma 5.1 implies the continuity of the mapping for at . Then recognizing that by Lipschitz continuity of we have
whenever satisfies , we obtain the proposition.
Appendix I Proofs of nonparametric convergence results
In this technical appendix, we include proofs of the results from Section 5.2 as well as a few additional results, which are essentially corollaries of results on localized complexities and nonparametric regression models, though we require a few modifications because our setting is slightly non-standard.
I.1 Preliminary results
To set notation and to make reading it self-contained, we provide some definitions. The norm of a function or random vector is , so that its -norm is . We consider the following abstract nonparametric regression setting: we have a function class with , and our observations follow the model
(36) |
but instead of observing pairs we observe pairs, where may not be identical to (these play the roll of versus in the results to come). We assume that are bounded so that , independent, and satisfy the conditional mean-zero property that (though may not be independent of ). For a (thus far unspecified) function class we set
We now demonstrate that the error can be bounded by a combination of the errors and local complexities of the function class . Our starting point is a local complexity bound analagous to the localization results available for in-sample prediction error in nonparametric regression [cf. 46, Thm. 13.5]. To present the results, for a function class we define the localized -complexity
where we treat the expectation conditionally on and are random (i.e. ). For the model (36), we define the centered class , which is star-shaped as is a convex set.111 A set is star-shaped if for all , if then . We say that satisfies the critical radius inequality if
(37) |
With this, we can provide a proposition giving a high-probability bound on the in-sample prediction error of the empirical estimator , which is essentially identical to [46, Thm. 13.5], though we require a few modifications to address that we observe and not and that the noise are bounded but not Gaussian.
Proposition 7.
Let be a convex function class, satisfy the critical inequality (37), and let . Then for ,
See Section I.3 for a proof of the proposition.
Revisiting the critical radius inequality (37), we can also apply [46, Corollary 13.7], which allows us to use an entropy integral to guarantee bounds on the critical radius. Here, we again fix , and for a function class we let . Let denote the -covering number of in -norm. Then modifying a few numerical constants, we have
Corollary 7 (Wainwright [46], Corollary 13.7).
As an immediate consequence of this inequality, we have the following:
Corollary 8.
Assume for all and that is contained in the class of -Lipschitz functions with (or any other fixed constant). Then for a numerical constant , the choice
satisfies the critical inequality (37).
I.2 Proof of Proposition 5
We assume without loss of generality that , as nothing in the proof changes except notationally (as we assume is fixed). We apply Proposition 7 and Corollary 8. For notational simplicity, let denote the measure on that induces for , and similarly for . We first show that converges. First, we recall [28, Lemma 3] that as . Thus there is a (random) such that for all . Therefore, Corollary 8 implies that the choice satisfies the critical inequality (37), and taking , we have that for ,
on the event that for all , where we have conditioned (in the probability) on . As , we may choose and find that the Borell-Cantelli lemma then implies that with probability , there is a random such that
(38) |
for all with probability .
Finally we argue that . Let be otherwise arbitrary, and let be the collection of -Lipschitz functions on with and for , noting that restricted to evidently belongs to . Then we have
(39) |
where the inequality used that by construction. The first term in inequality (39) we have already controlled. We may control the second supremum term almost immediately using Dudley’s entropy integral and a Rademacher contraction inequality. Indeed, we have
where inequality is a standard symmetrization inequality, is the Rademacher contraction inequality [24, Ch. 4] applied to the function , which is Lipschitz for , and is Dudley’s entropy integral bound. As the -norm log-covering numbers of -Lipschitz functions on scale as for and are 0 otherwise, we obtain . The bounded-differences inequality then implies that for any ,
Finally, the final term in the bound (39) evidently satisfies and with probability at least by the DKW inequality. We thus find by the Borel-Cantelli lemma that simultaneously for all , with probability at least we have
where is a numerical constant.
Substituting inequality (38) into the preceding display and taking , we get the result.
I.3 Proof of Proposition 7
We begin with an extension of the familiar basic inequality [e.g. 46, Eq. (13.18)], where we see by convexity that for any we have
Noting that , algebraic manipulations yield that if is the error vector, then
Simplifying, we obtain the following:
Lemma I.1.
Let and . Then
Proof.
The first inequality is algebraic manipulations and uses that our choice of was arbitrary. For the second, we consider two cases: that and that . In the first case, we consider , yielding the simplified bound that for . Taking gives that
In the case that , we choose , and the bound simplifies to . ∎
We now return to the proof of the proposition proper. We begin with an essentially immediate extension of the result [46, Lemma 13.12]. We let be an arbitrary star-shaped function class. Define the event
(40) |
which we treat conditionally on as in the definition of the local complexity . (Here the noise are still random, taken conditionally on .)
Lemma I.2 (Modification of Lemma 13.12 of Wainwright [46]).
Let be a start-shaped function class and let satisfy the critical radius inequality
Then for all , we have
Deferring the proof of Lemma I.2 to Section I.3.1, we can parallel the argument for [46, Thm. 13.5] to obtain our proposition.
Let in Lemma I.2. Whenever , we have . We consider the two cases that . In the former that , we have nothing to do. In the latter, we have while , so that if fails then we must have
From the extension of the basic inequality in Lemma I.1 we see that
Solving for then yields
Simplifying gives Proposition 7.
I.3.1 Proof of Lemma I.2
Mimicking the proof of [46, Lemma 13.12], we begin with [46, Eq. (13.40)]:
Now note that if , then the function is -Lipschitz with respect to the -norm, so that convex concentration inequalities [e.g. 46, Theorem 3.4] imply that whenever , and so for we have
As , we finally use that the normalized complexity is non-decreasing [46, Lemma 13.6] to obtain that for , , the last inequality by assumption. In particular, we find that for we have , and so
as desired.